spaCy

mirror of https://github.com/explosion/spaCy.git synced 2025-07-16 11:12:25 +03:00

Author	SHA1	Message	Date
Ines Montani	ec5ee9e616	Auto-format	2018-11-26 18:54:20 +01:00
Ines Montani	968aff2f6a	Update tests for pytest 4.x (#2965 ) <!--- Provide a general summary of your changes in the title. --> ## Description - [x] Replace marks in params for pytest 4.0 compat ([see here](https://docs.pytest.org/en/latest/deprecations.html#marks-in-pytest-mark-parametrize)) - [x] Un-xfail passing tests (some fixes in a recent update resolved a bunch of issues, but tests were apparently never updated here) ### Types of change <!-- What type of change does your PR cover? Is it a bug fix, an enhancement or new feature, or a change to the documentation? --> ## Checklist <!--- Before you submit the PR, go over this checklist and make sure you can tick off all the boxes. [] -> [x] --> - [x] I have submitted the spaCy Contributor Agreement. - [x] I ran the tests, and all new and existing tests passed. - [x] My changes don't require a change to the documentation, or if they do, I've added all required information.	2018-11-26 18:14:57 +01:00
Marc Puig	98fe1ab259	Catalan Language Support (#2940 ) * Catalan language Support * Ddding Catalan to documentation	2018-11-26 15:25:47 +01:00
Ines Montani	048416f265	Fix formatting	2018-11-26 13:27:41 +01:00
Shawn Cicoria	7601ae0cff	fixes symbolic link on py3 and windows (#2949 ) * fixes symbolic link on py3 and windows during setup of spacy using command python -m spacy link en_core_web_sm en closes #2948 * Update spacy/compat.py Co-Authored-By: cicorias <cicorias@users.noreply.github.com>	2018-11-24 15:34:23 +01:00
Ines Montani	350c8d25b0	Add EntityRecognizer.label property	2018-11-18 00:06:26 +01:00
Ines Montani	017bc2ef2f	Expose TextCategorizer via __all__	2018-11-18 00:06:13 +01:00
Ines Montani	b4581435f6	Merge branch 'develop' of https://github.com/explosion/spaCy into develop	2018-11-16 13:08:22 +01:00
Ines Montani	e2f75eb492	Fix message formatting	2018-11-16 13:08:20 +01:00
Matthew Honnibal	2874b8efd8	Fix tok2vec loading in spacy train	2018-11-15 23:34:54 +00:00
Matthew Honnibal	2ddd428834	Fix pretrain script	2018-11-15 23:34:35 +00:00
Matthew Honnibal	f8afaa0c1c	Fix pretrain	2018-11-15 22:46:53 +00:00
Matthew Honnibal	6af6950e46	Fix pretrain	2018-11-15 22:45:36 +00:00
Matthew Honnibal	3e7b214e57	Make pretrain script work with stream from stdin	2018-11-15 22:44:07 +00:00
Matthew Honnibal	8fdb9bc278	💫 Add experimental ULMFit/BERT/Elmo-like pretraining (#2931 ) * Add 'spacy pretrain' command * Fix pretrain command for Python 2 * Fix pretrain command * Fix pretrain command	2018-11-15 22:17:16 +01:00
Ines Montani	02fc73ca53	💫 Create random IDs for SVGs to prevent ID clashes (#2927 ) Resolves #2924. ## Description Fixes problem where multiple visualizations in Jupyter notebooks would have clashing arc IDs, resulting in weirdly positioned arc labels. Generating a random ID prefix so even identical parses won't receive the same IDs for consistency (even if effect of ID clash isn't noticable here.) ### Types of change bug fix ## Checklist <!--- Before you submit the PR, go over this checklist and make sure you can tick off all the boxes. [] -> [x] --> - [x] I have submitted the spaCy Contributor Agreement. - [x] I ran the tests, and all new and existing tests passed. - [x] My changes don't require a change to the documentation, or if they do, I've added all required information.	2018-11-15 11:40:10 +01:00
Ines Montani	e89708c3eb	💫 Allow matching non-ORTH attributes in PhraseMatcher (#2925 ) * Allow matching non-orth attributes in PhraseMatcher (see #1971) Usage: PhraseMatcher(nlp.vocab, attr='POS') * Allow attr argument to be int * Fix formatting * Fix typo	2018-11-15 03:00:58 +01:00
Ines Montani	0d5b142c78	Fix typos and whitespace	2018-11-14 19:12:34 +01:00
Ines Montani	bd1b0e396a	Add deprecation warning for PhraseMatcher max_length	2018-11-14 19:10:46 +01:00
Ines Montani	64257bf3a7	Fix formatting	2018-11-14 19:10:21 +01:00
Ines Montani	b3cadd5b81	Delete _matcher2_notes.py	2018-11-14 16:19:12 +01:00
mauryaland	87ce435aff	Check if the word is in one of the regular lists specific to each POS (#2886 )	2018-11-14 15:58:43 +01:00
Daniel Hershcovich	d3d419ecc0	Allow input text of length up to max_length, inclusive (#2922 )	2018-11-13 16:46:29 +01:00
Matthew Honnibal	5fc98ade04	Set version to 2.1.0a2	2018-11-08 09:56:56 +01:00
Matthew Honnibal	ad44982f01	Fix dropout in tensorizer, update comment	2018-11-03 12:46:58 +00:00
Matthew Honnibal	ba365ae1c9	Normalize gradient by number of words in tensorizer	2018-11-03 10:53:22 +00:00
Matthew Honnibal	dac3f1b280	Improve Tensorizer	2018-11-03 10:52:50 +00:00
Matthew Honnibal	2527ba68e5	Fix tensorizer	2018-11-02 23:29:54 +00:00
Matthew Honnibal	db08b168a3	Set version to 2.0.17	2018-10-29 23:22:18 +01:00
Suraj Rajan	0bf14082a4	Added more constucts for dependency tree matcher (#2836 )	2018-10-29 23:21:39 +01:00
Matthew Honnibal	e2ae25d6f5	Try setting older regex version, to align with conda	2018-10-29 13:39:00 +01:00
Matthew Honnibal	d4fa9af56f	Set version to 2.0.17.dev0	2018-10-28 16:15:26 +01:00
Matthew Honnibal	b2e2bba8b0	Fix missing comma	2018-10-28 00:09:16 +02:00
Wannaphong Phatthiyaphaibun	2d2765fd8a	Change PyThaiNLP Url (#2876 )	2018-10-27 14:46:07 +02:00
Matthew Honnibal	817e1fc5e5	Fix out-of-bounds access in NER training The helper method state.B(1) gets the index of the first token of the buffer, or -1 if no such token exists. Normally this is safe because we pass this to functions like state.safe_get(), which returns an empty token. Here we used it directly as an array index, which is not okay! This error may have been the cause of out-of-bounds access errors during training. Similar errors may still be around, so much be hunted down. Hunting this one down took a long time...I printed out values across training runs and diffed, looking for points of divergence between runs, when no randomness should be allowed.	2018-10-27 01:12:50 +02:00
Matthew Honnibal	9447739027	Merge branch 'master' of https://github.com/explosion/spaCy	2018-10-27 00:50:48 +02:00
Matthew Honnibal	ad068f51be	Fix out-of-bounds access in NER training The helper method state.B(1) gets the index of the first token of the buffer, or -1 if no such token exists. Normally this is safe because we pass this to functions like state.safe_get(), which returns an empty token. Here we used it directly as an array index, which is not okay! This error may have been the cause of out-of-bounds access errors during training. Similar errors may still be around, so much be hunted down. Hunting this one down took a long time...I printed out values across training runs and diffed, looking for points of divergence between runs, when no randomness should be allowed.	2018-10-27 00:46:30 +02:00
Grivaz	57f274b693	raise error when setting overlapping entities as doc.ents (#2880 )	2018-10-26 23:29:16 +02:00
Ines Montani	48b1bc44d3	Update version to 2.0.16	2018-10-15 14:39:25 +02:00
Ines Montani	a0f6647160	Increment version	2018-10-15 14:20:55 +02:00
Ines Montani	7bc7fa8f1e	Increment version	2018-10-15 01:40:44 +02:00
Matthew Honnibal	8612b75890	Set version to 2.0.14	2018-10-15 00:10:04 +02:00
Matthew Honnibal	d6e9cf8b09	Set version to 2.0.14.dev1	2018-10-15 00:09:02 +02:00
Matthew Honnibal	8ccfa52d19	Unhack prefer_gpu	2018-10-14 23:27:09 +02:00
Matthew Honnibal	41adf3572b	Set version to v2.0.14	2018-10-14 23:15:34 +02:00
Matthew Honnibal	38aa835ada	Workaround bug in thinc require_gpu	2018-10-14 23:15:08 +02:00
Matthew Honnibal	91593b7378	Add tests for prefer_gpu() and require_gpu()	2018-10-14 23:05:22 +02:00
Matthew Honnibal	62c70b3163	Import prefer_gpu and require_gpu functions from Thinc	2018-10-14 23:03:06 +02:00
Ines Montani	295da0f11b	Increment version to 2.0.14.dev0	2018-10-14 16:37:46 +02:00
Matthew Honnibal	7de0dcb91f	Merge branch 'master' of https://github.com/explosion/spaCy	2018-10-14 16:12:23 +02:00
Keshan	cb075c8e72	Adding "This is a sentence" example to Sinhala (#2846 )	2018-10-14 00:06:40 +02:00
Matthew Honnibal	9cfab5933a	Set version to 2.0.13	2018-10-13 19:42:16 +02:00
Matthew Honnibal	6a6ae5b0af	Merge branch 'master' of https://github.com/explosion/spaCy	2018-10-13 19:41:00 +02:00
mauryaland	36514b5762	Rule-based French Lemmatizer (#2818 ) <!--- Provide a general summary of your changes in the title. --> ## Description <!--- Use this section to describe your changes. If your changes required testing, include information about the testing environment and the tests you ran. If your test fixes a bug reported in an issue, don't forget to include the issue number. If your PR is still a work in progress, that's totally fine – just include a note to let us know. --> Add a rule-based French Lemmatizer following the english one and the excellent PR for [greek language optimizations](https://github.com/explosion/spaCy/pull/2558) to adapt the Lemmatizer class. ### Types of change <!-- What type of change does your PR cover? Is it a bug fix, an enhancement or new feature, or a change to the documentation? --> - Lemma dictionary used can be found [here](http://infolingu.univ-mlv.fr/DonneesLinguistiques/Dictionnaires/telechargement.html), I used the XML version. - Add several files containing exhaustive list of words for each part of speech - Add some lemma rules - Add POS that are not checked in the standard Lemmatizer, i.e PRON, DET, ADV and AUX - Modify the Lemmatizer class to check in lookup table as a last resort if POS not mentionned - Modify the lemmatize function to check in lookup table as a last resort - Init files are updated so the model can support all the functionalities mentioned above - Add words to tokenizer_exceptions_list.py in respect to regex used in tokenizer_exceptions.py ## Checklist <!--- Before you submit the PR, go over this checklist and make sure you can tick off all the boxes. [] -> [x] --> - [X] I have submitted the spaCy Contributor Agreement. - [X] I ran the tests, and all new and existing tests passed. - [X] My changes don't require a change to the documentation, or if they do, I've added all required information.	2018-10-13 16:38:21 +02:00
Matthew Honnibal	de46286107	Merge branch 'master' of https://github.com/explosion/spaCy	2018-10-13 16:11:16 +02:00
Ines Montani	cb57b35bb8	Also include lowercase norm exceptions	2018-10-13 15:37:30 +02:00
JKhakpour	74a30d883c	Add Persian(Farsi) language support (#2797 )	2018-10-13 15:31:49 +02:00
Matthew Honnibal	c3ddf98b1e	Set version to 2.0.13.dev4	2018-10-13 15:20:59 +02:00
Marina Lysyuk	b76fe08308	Correcting lang/ru/examples.py (#2845 ) * Correct some grammatical inaccuracies in lang\ru\examples.py; filled Contributor Agreement * Correct some grammatical inaccuracies in lang\ru\examples.py * Move contributor agreement to separate file	2018-10-13 15:19:43 +02:00
Matthew Honnibal	67ddce68d8	Unskip test	2018-10-02 23:47:55 +02:00
Matthew Honnibal	4cf5ce2cc2	Revert "Remove problematic test" This reverts commit `bdebbef455`.	2018-10-02 23:47:24 +02:00
Matthew Honnibal	bdebbef455	Remove problematic test	2018-10-02 23:16:29 +02:00
Matthew Honnibal	6afc6ffe56	Skip seemingly problematic test	2018-10-02 22:33:40 +02:00
Matthew Honnibal	9e4079ddb2	Merge branch 'master' of https://github.com/explosion/spaCy	2018-10-02 19:44:43 +02:00
Matthew Honnibal	40f228c2f2	Set version to 2.0.13.dev3	2018-10-02 19:44:25 +02:00
Ines Montani	ea20b72c08	💫 Make like_num work for prefixed numbers (#2808 ) * Only split + prefix if not numbers * Make like_num work for prefixed numbers * Add test for like_num	2018-10-01 10:49:14 +02:00
Filipe Caixeta	6c498f9ff4	Update Portuguese Language (#2790 ) * Add words to portuguese language _num_words * Add words to portuguese language _num_words * Portuguese - Add/remove stopwords, fix tokenizer, add currency symbols * Extended punctuation and norm_exceptions in the Portuguese language	2018-09-29 09:51:45 +02:00
Matthew Honnibal	b39810d692	Fix copy_reg compatibility on _serialize module	2018-09-28 15:23:14 +02:00
Matthew Honnibal	f82f8ba5dd	Fix serialization when empty parser model. Closes #2482	2018-09-28 15:18:52 +02:00
Matthew Honnibal	d5a6c63b62	Add regression test for #2482	2018-09-28 15:18:30 +02:00
Matthew Honnibal	e3e9fe18d4	Merge branch 'develop' of https://github.com/explosion/spaCy into develop	2018-09-28 14:27:35 +02:00
Matthew Honnibal	0323f5be0c	Fix _serialize module	2018-09-28 14:27:24 +02:00
Ines Montani	5d56eb70d7	Tidy up tests	2018-09-27 16:41:57 +02:00
Ines Montani	1f1bab9264	Remove unused import	2018-09-27 16:41:37 +02:00
Matthew Honnibal	6430b1fe64	Restore encoding arg on msgpack-numpy	2018-09-27 15:58:21 +02:00
Matthew Honnibal	2ac69facc6	Fix Python 2 test failure	2018-09-27 15:34:16 +02:00
Matthew Honnibal	72778375fb	Merge branch 'master' of https://github.com/explosion/spaCy	2018-09-27 13:54:49 +02:00
Matthew Honnibal	96fe314d8d	Fix bug when too many entity types. Fixes #2800	2018-09-27 13:54:34 +02:00
Suraj Rajan	bbdc6456c6	Set up dependency tree pattern matching skeleton (#2732 )	2018-09-27 13:27:18 +02:00
Matthew Honnibal	8809dc4514	Remove deprecated encoding argument to msgpack	2018-09-27 12:56:23 +02:00
Matthew Honnibal	bae6b3e2b3	Merge branch 'master' of https://github.com/explosion/spaCy	2018-09-27 12:50:31 +02:00
Ines Montani	71cdbeada7	Revert "Also include lowercase norm exceptions" This reverts commit `70f4e8adf3`.	2018-09-27 12:29:25 +02:00
darindf	8227566805	Fix error (#2802 ) * Fix error ValueError: cannot resize an array that references or is referenced by another array in this way. Use the resize function * added spaCy Contributor Agreement	2018-09-26 21:31:03 +02:00
Ines Montani	5e0dfb34fa	Merge branch 'master' of https://github.com/explosion/spaCy	2018-09-26 11:13:58 +02:00
Ines Montani	70f4e8adf3	Also include lowercase norm exceptions	2018-09-25 12:22:02 +02:00
Keshan	9a016d17c2	Adding basic support for Sinhala language. (#2788 ) * adding Sinhala language package, stop words, examples and lex_attrs. * Adding contributor agreement * Updating contributor agreement	2018-09-25 12:18:25 +02:00
Matthew Honnibal	b42c123e5d	Fix regression introduced by `1759abf1e`	2018-09-25 11:08:58 +02:00
Matthew Honnibal	500898907b	Fix regression in parser.begin_training()	2018-09-25 11:08:31 +02:00
Ines Montani	3c4e3ade30	Fix typo (closes #2784 )	2018-09-21 10:45:11 +02:00
mauryaland	68b3c544d5	Adding French hyphenated first name (#2786 )	2018-09-21 10:38:13 +02:00
Matthew Honnibal	1759abf1e5	Fix bug in sentence starts for non-projective parses The set_children_from_heads function assumed parse trees were projective. However, non-projective parses may be passed in during deserialization, or after deprojectivising. This caused incorrect sentence boundaries to be set for non-projective parses. Close #2772.	2018-09-19 14:50:06 +02:00
Matthew Honnibal	48fd36bf05	Fix test for issue 27772	2018-09-19 14:47:27 +02:00
Matthew Honnibal	6cd920e088	Add xfail test for deprojectivization SBD bug	2018-09-19 14:00:31 +02:00
Matthew Honnibal	99a6011580	Avoid adding empty layer in model, to keep models backwards compatible	2018-09-14 22:51:58 +02:00
Matthew Honnibal	c046392317	Trigger on_data hooks in parser model	2018-09-14 20:51:21 +02:00
Matthew Honnibal	5afd98dff5	Add a stepping function, for changing batch sizes or learning rates	2018-09-14 18:37:16 +02:00
Matthew Honnibal	27c00f4f22	Merge branch 'develop' of https://github.com/explosion/spaCy into develop	2018-09-14 12:30:57 +02:00
Andrew Ongko	81564cc4e8	Update Indonesian model (#2752 ) * adding e-KTP in tokenizer exceptions list * add exception token * removing lines with containing space as it won't matter since we use .split() method in the end, added new tokens in exception * add tokenizer exceptions list * combining base_norms with norm_exceptions * adding norm_exception * fix double key in lemmatizer * remove unused import on punctuation.py * reformat stop_words to reduce number of lines, improve readibility * updating tokenizer exception * implement is_currency for lang/id * adding orth_first_upper in tokenizer_exceptions * update the norm_exception list * remove bunch of abbreviations * adding contributors file	2018-09-14 12:30:32 +02:00
Filipe Caixeta	fe515085f3	Add words to portuguese language _num_words (#2759 ) * Add words to portuguese language _num_words * Add words to portuguese language _num_words	2018-09-14 12:30:16 +02:00
Matthew Honnibal	f32b52e611	Fix bug that caused deprojectivisation to run multiple times	2018-09-14 12:12:54 +02:00
Matthew Honnibal	8f2a6367e9	Fix usage of PyTorch BiLSTM in ud_train	2018-09-13 22:54:59 +00:00
Matthew Honnibal	afeddfff26	Fix PyTorch BiLSTM	2018-09-13 22:54:34 +00:00
Matthew Honnibal	a26fe8e7bb	Small hack in Language.update to make torch work	2018-09-13 22:51:52 +00:00
Matthew Honnibal	445b81ce3f	Support bilstm_depth argument in ud-train	2018-09-13 19:30:22 +02:00
Matthew Honnibal	b43643a953	Support bilstm_depth option in parser	2018-09-13 19:29:49 +02:00
Matthew Honnibal	45032fe9e1	Support option of BiLSTM in Tok2Vec (requires pytorch)	2018-09-13 19:28:35 +02:00
Matthew Honnibal	3eb9f3e2b8	Fix defaults for ud-train	2018-09-13 18:05:48 +02:00
Matthew Honnibal	59cf533879	Improve ud-train script. Make config optional	2018-09-13 14:24:08 +02:00
Matthew Honnibal	3e3a309764	Fix tagger	2018-09-13 14:14:38 +02:00
Matthew Honnibal	da7650e84b	Fix maximum doc length in ud_train script	2018-09-13 14:10:25 +02:00
Matthew Honnibal	a95eea4c06	Fix multi-task objective for parser	2018-09-13 14:08:55 +02:00
Matthew Honnibal	21321cd6cf	Add tok2vec property to parser model	2018-09-13 14:08:43 +02:00
Matthew Honnibal	d6aa60139d	Fix tagger training on GPU	2018-09-13 14:05:37 +02:00
Grivaz	aeba99ab0d	Introduces a bulk merge function, in order to solve issue #653 (#2696 ) * Fix comment * Introduce bulk merge to increase performance on many span merges * Sign contributor agreement * Implement pull request suggestions	2018-09-10 16:41:42 +02:00
tyburam	476472d181	Lex _attrs for polish language (#2750 ) * Signed spaCy contributor agreement * Added polish version of english lex_attrs	2018-09-10 11:53:57 +02:00
Sainath Adapa	77139bc03c	Basic support for Telugu language (#2751 )	2018-09-10 11:53:18 +02:00
Maxim Kupfer	cebe50b5b8	Remove ')' for clarity (#2737 ) Sorry, don't mean to be nitpicky, I just noticed this when going through the CLI and thought it was a quick fix. That said, if this was intention than please let me know.	2018-09-10 11:31:49 +02:00
Matthew Honnibal	b2cb1fc67d	Merge matcher tests	2018-09-06 01:39:53 +02:00
Suraj Krishnan Rajan	356af7b0a1	Fix tests	2018-09-06 01:39:36 +02:00
Piotr Żelasko	bdb2165bd1	Less norm computations in token similarity (#2730 ) * Less norm computations in token similarity * Contributor agreement	2018-09-05 21:50:23 +02:00
Aniruddha Adhikary	4530ddcc51	update bengali token rules for hyphen and digits (#2731 )	2018-09-05 21:49:00 +02:00
Nathaniel J. Smith	26849874ad	When calling getoption() in conftest.py, pass a default option (#2709 ) * When calling getoption() in conftest.py, pass a default option This is necessary to allow testing an installed spacy by running: pytest --pyargs spacy * Add contributor agreement	2018-09-03 09:57:52 +02:00
Matthew Honnibal	4d2d7d5866	Fix new feature flags	2018-08-27 02:12:39 +02:00
Matthew Honnibal	598dbf1ce0	Fix character-based tokenization for Japanese	2018-08-27 01:51:38 +02:00
Matthew Honnibal	3763e20afc	Pass subword_features and conv_depth params	2018-08-27 01:51:15 +02:00
Matthew Honnibal	8051136d70	Support subword_features and conv_depth params in Tok2Vec	2018-08-27 01:50:48 +02:00
Matthew Honnibal	9c33d4d1df	Add more hyper-parameters to spacy ud-train * subword_features: Controls whether subword features are used in the word embeddings. True by default (specifically, prefix, suffix and word shape). Should be set to False for languages like Chinese and Japanese. * conv_depth: Depth of the convolutional layers. Defaults to 4.	2018-08-27 01:48:46 +02:00
Ines Montani	e9022f7b33	Remove docstrings for deprecated arguments (see #2703 )	2018-08-26 14:23:13 +02:00
Ines Montani	559f4139e3	Add FAC to spacy.explain (resolves #2706 )	2018-08-26 14:13:50 +02:00
Matthew Honnibal	51a9efbf3b	Add draft Binder class	2018-08-22 13:12:51 +02:00
Matthew Honnibal	5ce459d2ee	Fix error in vocab	2018-08-16 17:18:09 +02:00
Matthew Honnibal	00febda2e3	Improve alignment around quotes	2018-08-16 01:04:34 +02:00
Matthew Honnibal	66a3f2ba21	Lower-case text before alignment	2018-08-16 00:42:36 +02:00
Matthew Honnibal	595c893791	Expose noise_level option in train CLI	2018-08-16 00:41:44 +02:00
Matthew Honnibal	8365226bf3	Fix lookup of symbols in vocab.	2018-08-15 23:43:34 +02:00
Matthew Honnibal	b9f0588580	Set version to v2.1.0a1	2018-08-15 17:22:39 +02:00
Matthew Honnibal	e968016417	Note link between issues #2671 and #2675	2018-08-15 17:18:28 +02:00
Matthew Honnibal	63bdc734ba	Skip flakey test	2018-08-15 16:56:55 +02:00
Matthew Honnibal	ce512e1d47	Fix #2671 : Incorrect match ID on some patterns	2018-08-15 16:19:08 +02:00
Matthew Honnibal	f12b9190f6	Xfail test for issue #2671	2018-08-15 15:55:31 +02:00
Matthew Honnibal	7cfa665ce6	Add failing test for issue 2671: Incorrect rule ID returned from matcher	2018-08-15 15:54:33 +02:00
Matthew Honnibal	1b2a5869ab	Set version to v2.1.0a2.dev0	2018-08-15 15:38:52 +02:00
Matthew Honnibal	5080760288	Add extra comment on 'add label' in parser	2018-08-15 15:37:24 +02:00
Matthew Honnibal	6e749d3c70	Skip flakey parser test	2018-08-15 15:37:04 +02:00
Matthew Honnibal	6ea981c839	Add converter for jsonl NER data	2018-08-14 14:04:32 +02:00
Matthew Honnibal	a9fb6d5511	Fix docs2jsonl function	2018-08-14 14:03:48 +02:00
Matthew Honnibal	ea2edd1e2c	Merge branch 'feature/docs_to_json' into develop	2018-08-14 13:23:42 +02:00
Matthew Honnibal	6ec236ab08	Fix label-clobber bug in parser.begin_training() The parser.begin_training() method was rewritten in v2.1. The rewrite introduced a regression, where if you added labels prior to begin_training(), these labels were discarded. This patch fixes that.	2018-08-14 13:20:19 +02:00
Matthew Honnibal	02c5c114d0	Fix usage of deprecated freqs.txt in init-model	2018-08-14 13:19:15 +02:00
Matthew Honnibal	2a5a61683e	Add function to get train format from Doc objects Our JSON training format is annoying to work with, and we've wanted to retire it for some time. In the meantime, we can at least add some missing functions to make it easier to live with. This patch adds a function that generates the JSON format from a list of Doc objects, one per paragraph. This should be a convenient way to handle a lot of data conversions: whatever format you have the source information in, you can use it to setup a Doc object. This approach should offer better future-proofing as well. Hopefully, we can steadily rewrite code that is sensitive to the current data-format, so that it instead goes through this function. Then when we change the data format, we won't have such a problem.	2018-08-14 13:13:10 +02:00

1 2 3 4 5 ...

5492 Commits