spaCy

mirror of https://github.com/explosion/spaCy.git synced 2024-12-27 02:16:32 +03:00

Author	SHA1	Message	Date
Muhammad Irfan	224a7f8e94	examples	2020-03-04 15:49:06 +05:00
Muhammad Irfan	03376c9d9b	Basque language added and tested.	2020-03-04 11:58:56 +05:00
adrianeboyd	9be90dbca3	Improve token head verification (#5079 ) * Improve token head verification Improve the verification for valid token heads when heads are set: * in `Token.head`: heads come from the same document * in `Doc.from_array()`: head indices are within the bounds of the document * Improve error message	2020-03-03 21:44:51 +01:00
adrianeboyd	8c20dae6f7	Fix model-final/model-best meta from train CLI (#5093 ) * Fix model-final/model-best meta * include speed and accuracy from final iteration * combine with speeds from base model if necessary * Include token_acc metric for all components	2020-03-03 21:43:25 +01:00
Sofie Van Landeghem	a0998868ff	prevent updating cfg if the Model was already defined (#5078 )	2020-03-03 13:58:56 +01:00
Sofie Van Landeghem	d307e9ca58	take care of global vectors in multiprocessing (#5081 ) * restore load_nlp.VECTORS in the child process * add unit test * fix test * remove unnecessary import * add utf8 encoding * import unicode_literals	2020-03-03 13:58:22 +01:00
adrianeboyd	d078b47c81	Break out of infinite loop as intended (#5077 )	2020-03-03 12:29:05 +01:00
adrianeboyd	697bec764d	Normalize IS_SENT_START to SENT_START for Matcher (#5080 )	2020-03-03 12:22:39 +01:00
adrianeboyd	2281c4708c	Restore empty tokenizer properties (#5026 ) * Restore empty tokenizer properties * Check for types in tokenizer.from_bytes() * Add test for setting empty tokenizer rules	2020-03-02 11:55:02 +01:00
Sofie Van Landeghem	c6b12ab02a	Bugfix/get doc (#5049 ) * new (broken) unit test * fixing get_doc method	2020-03-02 11:49:28 +01:00
Ines Montani	648f61d077	Tidy up compiler flags and imports (#5071 )	2020-03-02 11:48:10 +01:00
Ines Montani	7efaa76168	Update errors.py	2020-02-28 12:23:31 +01:00
Ines Montani	37691e6d5d	Simplify warnings	2020-02-28 12:20:23 +01:00
Ines Montani	5da3ad682a	Tidy up and auto-format	2020-02-28 11:57:41 +01:00
adrianeboyd	65d7bab10f	Initialize all values in a2b/b2a in new align (#5063 )	2020-02-27 18:43:00 +01:00
Sofie Van Landeghem	06f0a8daa0	Default settings to configurations (#4995 ) * fix grad_clip naming * cleaning up pretrained_vectors out of cfg * further refactoring Model init's * move Model building out of pipes * further refactor to require a model config when creating a pipe * small fixes * making cfg in nn_parser more consistent * fixing nr_class for parser * fixing nn_parser's nO * fix printing of loss * architectures in own file per type, consistent naming * convenience methods default_tagger_config and default_tok2vec_config * let create_pipe access default config if available for that component * default_parser_config * move defaults to separate folder * allow reading nlp from package or dir with argument 'name' * architecture spacy.VocabVectors.v1 to read static vectors from file * cleanup * default configs for nel, textcat, morphologizer, tensorizer * fix imports * fixing unit tests * fixes and clean up * fixing defaults, nO, fix unit tests * restore parser IO * fix IO * 'fix' serialization test * add .cfg to manifest fix example configs with additional arguments * replace Morpohologizer with Tagger * add IO bit when testing overfitting of tagger (currently failing) * fix IO - don't initialize when reading from disk * expand overfitting tests to also check IO goes OK * remove dropout from HashEmbed to fix Tagger performance * add defaults for sentrec * update thinc * always pass a Model instance to a Pipe * fix piped_added statement * remove obsolete W029 * remove obsolete errors * restore byte checking tests (work again) * clean up test * further test cleanup * convert from config to Model in create_pipe * bring back error when component is not initialized * cleanup * remove calls for nlp2.begin_training * use thinc.api in imports * allow setting charembed's nM and nC * fix for hardcoded nM/nC + unit test * formatting fixes * trigger build	2020-02-27 18:42:27 +01:00
Adriane Boyd	9f740a9891	Add a few more Danish tokenizer exceptions	2020-02-26 14:59:03 +01:00
Ines Montani	1c212215cd	Merge pull request #5064 from adrianeboyd/feature/german-tokenization Improve German tokenization	2020-02-26 13:41:44 +01:00
Adriane Boyd	d1f703d78d	Improve German tokenization Improve German tokenization with respect to Tiger.	2020-02-26 13:06:52 +01:00
Ines Montani	ed9358420e	Merge branch 'master' into pr/5060	2020-02-26 12:51:29 +01:00
adrianeboyd	ff184b7a9c	Add tag_map argument to CLI debug-data and train (#4750 ) (#5038 ) Add an argument for a path to a JSON-formatted tag map, which is used to update and extend the default language tag map.	2020-02-26 12:10:38 +01:00
svlandeg	18ff97589d	update spacy to 2.2.4.dev0	2020-02-26 10:50:05 +01:00
svlandeg	fc6e34c3a1	fix bugs from porting master to develop	2020-02-26 08:44:22 +01:00
Ines Montani	c1a5ece65f	Tidy up setup and update requirements tests	2020-02-25 15:46:39 +01:00
Ines Montani	5d21d3e8b9	Merge branch 'develop' into pr/5008	2020-02-25 15:24:47 +01:00
Ines Montani	d50152b917	Merge pull request #5019 from questoph/master Optimizing tokenization for Luxembourgish (dealing with apostrophe infixes)	2020-02-25 14:48:50 +01:00
Ines Montani	4440a072d2	Merge pull request #5006 from svlandeg/bugfix/multiproc-underscore load Underscore state when multiprocessing	2020-02-25 14:46:02 +01:00
svlandeg	d821c95eb0	debugging prints	2020-02-23 17:38:33 +01:00
svlandeg	58568bd0cd	fix	2020-02-23 16:45:37 +01:00
svlandeg	0f55e51704	assert we found the root_dir	2020-02-23 16:33:58 +01:00
svlandeg	783da088ea	avoid try except	2020-02-23 16:21:21 +01:00
svlandeg	b49a3afd0c	use clean_underscore fixture	2020-02-23 15:49:20 +01:00
Tom Keefe	ddf63b97a8	make idx available via to_array (#5030 )	2020-02-22 14:13:06 +01:00
Sofie Van Landeghem	44f4142ce4	add two abbreviations and some additional unit tests (#5040 )	2020-02-22 14:12:32 +01:00
Sofie Van Landeghem	479bd8d09f	add lemma option to displacy 'dep' visualiser (#5041 ) * add lemma option to displacy 'dep' visualiser * more compact list comprehension * add option to doc * fix test and add lemmas to util.get_doc * fix capital * remove lemma from get_doc * cleanup	2020-02-22 14:11:51 +01:00
adrianeboyd	2164e71ea8	Improved Romanian tokenization for UD RRT (#5036 ) Modifications to Romanian tokenization to improve tokenization for UD_Romanian-RRT.	2020-02-19 16:15:59 +01:00
svlandeg	9f1447bf71	where areth thou, file ?	2020-02-19 17:09:29 +02:00
svlandeg	9834527f2c	hack to switch between CLI folder setup and local setup	2020-02-19 16:22:48 +02:00
svlandeg	5c2f645470	root dir one level up	2020-02-19 16:15:56 +02:00
svlandeg	b20351792a	assert prints for more clarity	2020-02-19 15:51:53 +02:00
Ines Montani	a3335d36b8	Merge branch 'develop' into refactor/remove-symlinks	2020-02-18 17:22:20 +01:00
Ines Montani	09cbeaef27	Remove symlinks, data dir and related stuff	2020-02-18 17:20:17 +01:00
Ines Montani	e3f40a6a0f	Tidy up and auto-format	2020-02-18 15:38:18 +01:00
Ines Montani	1278161f47	Tidy up and fix issues	2020-02-18 15:17:03 +01:00
Ines Montani	de11ea753a	Merge branch 'master' into develop	2020-02-18 14:47:23 +01:00
Ines Montani	80e95d02b1	Allow spacy attr in token pattern	2020-02-18 14:32:53 +01:00
Jan Jessewitsch	c7e4fe9c5c	Fix/Improve german stop words (#5024 ) * Fix german stop words Two stop words ("einige" and "einigen") are sticking together. Remove three nouns that may serve as stop words in a specific context (e.g. religious or news) but are not applicable for general use. * Create Jan-711.md	2020-02-17 18:59:22 +01:00
Kabir Khan	f6ed07b85c	Use nlp.pipe in EntityRuler for phrase patterns in add_patterns (#4931 ) * Fix ent_ids and labels properties when id attribute used in patterns * use set for labels * sort end_ids for comparison in entity_ruler tests * fixing entity_ruler ent_ids test * add to set * Run make_doc optimistically if using phrase matcher patterns. * remove unused coveragerc I was testing with * format * Refactor EntityRuler.add_patterns to use nlp.pipe for phrase patterns. Improves speed substantially. * Removing old add_patterns function * Fixing spacing * Make sure token_patterns loaded as well, before generator was being emptied in from_disk	2020-02-16 18:17:47 +01:00
Sofie Van Landeghem	72c964bcf4	define pretrained_dims which is used by build_text_classifier (#5004 )	2020-02-16 17:21:17 +01:00
adrianeboyd	3b22eb651b	Sync Span __eq__ and __hash__ (#5005 ) * Sync Span __eq__ and __hash__ Use the same tuple for `__eq__` and `__hash__`, including all attributes except `vector` and `vector_norm`. * Update entity comparison in tests Update `assert_docs_equal()` test util to compare `Span` properties for ents rather than `Span` objects.	2020-02-16 17:20:36 +01:00
adrianeboyd	0c47a53b5e	Use int only in key2row for better performance (#4990 ) Cast all keys and rows to `int` in `vectors.key2row` for more efficient access and serialization.	2020-02-16 17:19:41 +01:00
adrianeboyd	5b102963bf	Require HEAD for is_parsed in Doc.from_array() (#5011 ) Modify flag settings so that `DEP` is not sufficient to set `is_parsed` and only run `set_children_from_heads()` if `HEAD` is provided. Then the combination `[SENT_START, DEP]` will set deps and not clobber sent starts with a lot of one-word sentences.	2020-02-16 17:17:09 +01:00
Sofie Van Landeghem	2572460175	add tok2vec parameters to train script to facilitate init_tok2vec (#5021 )	2020-02-16 17:16:41 +01:00
Sofie Van Landeghem	a27c77ce62	add message when cli train script throws exception (#5009 ) * add message when cli train script throws exception * fix formatting	2020-02-15 15:50:17 +01:00
questoph	5352fc8fc3	Update tokenizer_exceptions.py	2020-02-14 12:02:15 +01:00
questoph	d1f0b397b5	Update punctuation.py	2020-02-13 22:18:51 +01:00
svlandeg	2729d9164d	cleanup	2020-02-12 22:59:37 +01:00
svlandeg	6bbd816569	formatting	2020-02-12 22:50:27 +01:00
svlandeg	34986c7bfd	test versions of required libs across different places	2020-02-12 22:49:50 +01:00
svlandeg	6e717c62ed	avoid the tests interacting with eachother through the global Underscore variable	2020-02-12 13:21:31 +01:00
svlandeg	7939c63886	use English instead of model	2020-02-12 12:26:27 +01:00
svlandeg	46628d8890	add some asserts	2020-02-12 12:12:52 +01:00
svlandeg	51d37033c8	remove old comment	2020-02-12 12:10:05 +01:00
svlandeg	65f5b48b5d	add comment	2020-02-12 12:06:27 +01:00
svlandeg	05dedaa2cf	add unit test	2020-02-12 12:00:13 +01:00
svlandeg	ecbb9c4b9f	load Underscore state when multiprocessing	2020-02-12 11:50:42 +01:00
Ines Montani	2ed49404e3	Improve setup.py and call into Cython directly (#4952 ) * Improve setup.py and call into Cython directly * Add numpy to setup_requires * Improve clean helper * Update setup.cfg * Try if it builds without pyproject.toml * Update MANIFEST.in	2020-02-11 17:46:18 -05:00
adrianeboyd	99a543367d	Set GPU before loading any models in train CLI (#4989 ) Set the GPU before loading any existing models in the train CLI so that you can start with a base model and train on GPU.	2020-02-11 17:45:41 -05:00
adrianeboyd	842dfddbb9	Standardize Greek tag map setup (#4997 ) * Rename `tag_map.py` to `tag_map_fine.py` to indicate that it's not the default tag map * Remove duplicate generic UD tag map and load `../tag_map.py` instead	2020-02-11 17:44:56 -05:00
Sofie Van Landeghem	9b84f987bd	fix grad_clip naming (#4967 )	2020-02-10 20:33:16 -05:00
Antti Ajanki	e1f777b151	Improvements for Finnish tokenizer (#4985 ) * don't split on a colon. Colon is used to attach suffixes for abbreviations * tokenize on any of LIST_HYPHENS (except a single hyphen), not just on -- * simplify infix rules by merging similar rules	2020-02-10 20:32:43 -05:00
Sofie Van Landeghem	781e95cf53	Ensure doc.similarity returns a float (on develop) (#4969 )	2020-02-10 20:31:49 -05:00
Filip Bednárik	d4f4060bf3	Add Slovak language tools implementation (#4943 ) * Add correct stopwords for Slovak language * Add SNK Tags * Disable formatting lint for TAGS * Add example sentences for Slovak language * Add slovak numerals in base form * Add lex_attrs to sk init * Add contributor agreement	2020-02-03 13:03:59 +01:00
Sofie Van Landeghem	cabd60fa1e	Small fixes to as_example (#4957 ) * label in span not writable anymore * Revert "label in span not writable anymore" This reverts commit `ab442338c8`. * fixing yield - remove redundant list	2020-02-03 13:02:12 +01:00
Tyler Couto	9fa9d7f2cb	Fix for Issue 4665 - conllu2json (#4953 ) * Fix for Issue 4665 - conllu2json - Allowing HEAD to be an underscore * Added contributor agreement	2020-02-03 13:01:48 +01:00
Matthew Honnibal	71b93f33bb	Set dev version	2020-01-30 15:41:45 +01:00
Matthew Honnibal	ba6d78132d	Fix dev version	2020-01-30 10:35:09 +01:00
Ines Montani	ccef9f2f44	Update version	2020-01-29 17:52:22 +01:00
adrianeboyd	5ee9d8c9b8	Add MORPH attr, add support in retokenizer (#4947 ) * Add MORPH attr / symbol for token attrs * Update retokenizer for MORPH	2020-01-29 17:45:46 +01:00
adrianeboyd	a365359b36	Add convert CLI option to merge CoNLL-U subtokens (#4722 ) * Add convert CLI option to merge CoNLL-U subtokens Add `-T` option to convert CLI that merges CoNLL-U subtokens into one token in the converted data. Each CoNLL-U sentence is read into a `Doc` and the `Retokenizer` is used to merge subtokens with features as follows: * `orth` is the merged token orth (should correspond to raw text and `# text`) * `tag` is all subtoken tags concatenated with `_`, e.g. `ADP_DET` * `pos` is the POS of the syntactic root of the span (as determined by the Retokenizer) * `morph` is all morphological features merged * `lemma` is all subtoken lemmas concatenated with ` `, e.g. `de o` * with `-m` all morphological features are combined with the tag using the separator `__`, e.g. `ADP_DET__Definite=Def\|Gender=Masc\|Number=Sing\|PronType=Art` * `dep` is the dependency relation for the syntactic root of the span (as determined by the Retokenizer) Concatenated tags will be mapped to the UD POS of the syntactic root (e.g., `ADP`) and the morphological features will be the combined features. In many cases, the original UD subtokens can be reconstructed from the available features given a language-specific lookup table, e.g., Portuguese `do / ADP_DET / Definite=Def\|Gender=Masc\|Number=Sing\|PronType=Art` is `de / ADP`, `o / DET / Definite=Def\|Gender=Masc\|Number=Sing\|PronType=Art` or lookup rules for forms containing open class words like Spanish `hablarlo / VERB_PRON / Case=Acc\|Gender=Masc\|Number=Sing\|Person=3\|PrepCase=Npr\|PronType=Prs\|VerbForm=Inf`. * Clean up imports	2020-01-29 17:44:25 +01:00
Sofie Van Landeghem	569cc98982	Update spaCy for thinc 8.0.0 (#4920 ) * Add load_from_config function * Add train_from_config script * Merge configs and expose via spacy.config * Fix script * Suggest create_evaluation_callback * Hard-code for NER * Fix errors * Register command * Add TODO * Update train-from-config todos * Fix imports * Allow delayed setting of parser model nr_class * Get train-from-config working * Tidy up and fix scores and printing * Hide traceback if cancelled * Fix weighted score formatting * Fix score formatting * Make output_path optional * Add Tok2Vec component * Tidy up and add tok2vec_tensors * Add option to copy docs in nlp.update * Copy docs in nlp.update * Adjust nlp.update() for set_annotations * Don't shuffle pipes in nlp.update, decruft * Support set_annotations arg in component update * Support set_annotations in parser update * Add get_gradients method * Add get_gradients to parser * Update errors.py * Fix problems caused by merge * Add _link_components method in nlp * Add concept of 'listeners' and ControlledModel * Support optional attributes arg in ControlledModel * Try having tok2vec component in pipeline * Fix tok2vec component * Fix config * Fix tok2vec * Update for Example * Update for Example * Update config * Add eg2doc util * Update and add schemas/types * Update schemas * Fix nlp.update * Fix tagger * Remove hacks from train-from-config * Remove hard-coded config str * Calculate loss in tok2vec component * Tidy up and use function signatures instead of models * Support union types for registry models * Minor cleaning in Language.update * Make ControlledModel specifically Tok2VecListener * Fix train_from_config * Fix tok2vec * Tidy up * Add function for bilstm tok2vec * Fix type * Fix syntax * Fix pytorch optimizer * Add example configs * Update for thinc describe changes * Update for Thinc changes * Update for dropout/sgd changes * Update for dropout/sgd changes * Unhack gradient update * Work on refactoring _ml * Remove _ml.py module * WIP upgrade cli scripts for thinc * Move some _ml stuff to util * Import link_vectors from util * Update train_from_config * Import from util * Import from util * Temporarily add ml.component_models module * Move ml methods * Move typedefs * Update load vectors * Update gitignore * Move imports * Add PrecomputableAffine * Fix imports * Fix imports * Fix imports * Fix missing imports * Update CLI scripts * Update spacy.language * Add stubs for building the models * Update model definition * Update create_default_optimizer * Fix import * Fix comment * Update imports in tests * Update imports in spacy.cli * Fix import * fix obsolete thinc imports * update srsly pin * from thinc to ml_datasets for example data such as imdb * update ml_datasets pin * using STATE.vectors * small fix * fix Sentencizer.pipe * black formatting * rename Affine to Linear as in thinc * set validate explicitely to True * rename with_square_sequences to with_list2padded * rename with_flatten to with_list2array * chaining layernorm * small fixes * revert Optimizer import * build_nel_encoder with new thinc style * fixes using model's get and set methods * Tok2Vec in component models, various fixes * fix up legacy tok2vec code * add model initialize calls * add in build_tagger_model * small fixes * setting model dims * fixes for ParserModel * various small fixes * initialize thinc Models * fixes * consistent naming of window_size * fixes, removing set_dropout * work around Iterable issue * remove legacy tok2vec * util fix * fix forward function of tok2vec listener * more fixes * trying to fix PrecomputableAffine (not succesful yet) * alloc instead of allocate * add morphologizer * rename residual * rename fixes * Fix predict function * Update parser and parser model * fixing few more tests * Fix precomputable affine * Update component model * Update parser model * Move backprop padding to own function, for test * Update test * Fix p. affine * Update NEL * build_bow_text_classifier and extract_ngrams * Fix parser init * Fix test add label * add build_simple_cnn_text_classifier * Fix parser init * Set gpu off by default in example * Fix tok2vec listener * Fix parser model * Small fixes * small fix for PyTorchLSTM parameters * revert my_compounding hack (iterable fixed now) * fix biLSTM * Fix uniqued * PyTorchRNNWrapper fix * small fixes * use helper function to calculate cosine loss * small fixes for build_simple_cnn_text_classifier * putting dropout default at 0.0 to ensure the layer gets built * using thinc util's set_dropout_rate * moving layer normalization inside of maxout definition to optimize dropout * temp debugging in NEL * fixed NEL model by using init defaults ! * fixing after set_dropout_rate refactor * proper fix * fix test_update_doc after refactoring optimizers in thinc * Add CharacterEmbed layer * Construct tagger Model * Add missing import * Remove unused stuff * Work on textcat * fix test (again :)) after optimizer refactor * fixes to allow reading Tagger from_disk without overwriting dimensions * don't build the tok2vec prematuraly * fix CharachterEmbed init * CharacterEmbed fixes * Fix CharacterEmbed architecture * fix imports * renames from latest thinc update * one more rename * add initialize calls where appropriate * fix parser initialization * Update Thinc version * Fix errors, auto-format and tidy up imports * Fix validation * fix if bias is cupy array * revert for now * ensure it's a numpy array before running bp in ParserStepModel * no reason to call require_gpu twice * use CupyOps.to_numpy instead of cupy directly * fix initialize of ParserModel * remove unnecessary import * fixes for CosineDistance * fix device renaming * use refactored loss functions (Thinc PR 251) * overfitting test for tagger * experimental settings for the tagger: avoid zero-init and subword normalization * clean up tagger overfitting test * use previous default value for nP * remove toy config * bringing layernorm back (had a bug - fixed in thinc) * revert setting nP explicitly * remove setting default in constructor * restore values as they used to be * add overfitting test for NER * add overfitting test for dep parser * add overfitting test for textcat * fixing init for linear (previously affine) * larger eps window for textcat * ensure doc is not None * Require newer thinc * Make float check vaguer * Slop the textcat overfit test more * Fix textcat test * Fix exclusive classes for textcat * fix after renaming of alloc methods * fixing renames and mandatory arguments (staticvectors WIP) * upgrade to thinc==8.0.0.dev3 * refer to vocab.vectors directly instead of its name * rename alpha to learn_rate * adding hashembed and staticvectors dropout * upgrade to thinc 8.0.0.dev4 * add name back to avoid warning W020 * thinc dev4 * update srsly * using thinc 8.0.0a0 ! Co-authored-by: Matthew Honnibal <honnibal+gh@gmail.com> Co-authored-by: Ines Montani <ines@ines.io>	2020-01-29 17:06:46 +01:00
adrianeboyd	a938566b62	Fix Sentencizer.pipe() for empty doc (#4940 )	2020-01-28 11:36:49 +01:00
adrianeboyd	06b251dd1e	Add support for pos/morphs/lemmas in training data (#4941 ) Add support for pos/morphs/lemmas throughout `GoldParse`, `Example`, and `docs_to_json()`.	2020-01-28 11:36:29 +01:00
adrianeboyd	adc9745718	Modify morphology to support arbitrary features (#4932 ) * Restructure tag maps for MorphAnalysis changes Prepare tag maps for upcoming MorphAnalysis changes that allow arbritrary features. * Use default tag map rather than duplicating for ca / uk / vi * Import tag map into defaults for ga * Modify tag maps so all morphological fields and features are strings * Move features from `"Other"` to the top level * Rewrite tuples as strings separated by `","` * Rewrite morph symbols for fr lemmatizer as strings * Export MorphAnalysis under spacy.tokens * Modify morphology to support arbitrary features Modify `Morphology` and `MorphAnalysis` so that arbitrary features are supported. * Modify `MorphAnalysisC` so that it can support arbitrary features and multiple values per field. `MorphAnalysisC` is redesigned to contain: * key: hash of UD FEATS string of morphological features * array of `MorphFeatureC` structs that each contain a hash of `Field` and `Field=Value` for a given morphological feature, which makes it possible to: * find features by field * represent multiple values for a given field * `get_field()` is renamed to `get_by_field()` and is no longer `nogil`. Instead a new helper function `get_n_by_field()` is `nogil` and returns `n` features by field. * `MorphAnalysis.get()` returns all possible values for a field as a list of individual features such as `["Tense=Pres", "Tense=Past"]`. * `MorphAnalysis`'s `str()` and `repr()` are the UD FEATS string. * `Morphology.feats_to_dict()` converts a UD FEATS string to a dict where: * Each field has one entry in the dict * Multiple values remain separated by a separator in the value string * `Token.morph_` returns the UD FEATS string and you can set `Token.morph_` with a UD FEATS string or with a tag map dict. * Modify get_by_field to use np.ndarray Modify `get_by_field()` to use np.ndarray. Remove `max_results` from `get_n_by_field()` and always iterate over all the fields. * Rewrite without MorphFeatureC * Add shortcut for existing feats strings as keys Add shortcut for existing feats strings as keys in `Morphology.add()`. * Check for '_' as empty analysis when adding morphs * Extend helper converters in Morphology Add and extend helper converters that convert and normalize between: * UD FEATS strings (`"Case=dat,gen\|Number=sing"`) * per-field dict of feats (`{"Case": "dat,gen", "Number": "sing"}`) * list of individual features (`["Case=dat", "Case=gen", "Number=sing"]`) All converters sort fields and values where applicable.	2020-01-23 22:01:54 +01:00
Sofie Van Landeghem	0a0de85409	Fix gold training (#4938 ) * label in span not writable anymore * Revert "label in span not writable anymore" This reverts commit `ab442338c8`. * ensure doc is not None	2020-01-23 22:00:24 +01:00
adrianeboyd	199d89943e	Add as_example to Sentencizer pipe() (#4933 )	2020-01-22 15:40:31 +01:00
Yohei Tamura	708a4d27eb	fix nlp.evaluate (#4924 ) (#4925 ) * new file: test_issue4924.py * modified: spacy/gold.pyx * modified: test_issue4924.py for python2	2020-01-20 12:17:46 +01:00
Kabir Khan	b9afcd56e3	Fix ent_ids and labels properties when id attribute used in patterns (#4900 ) * Fix ent_ids and labels properties when id attribute used in patterns * use set for labels * sort end_ids for comparison in entity_ruler tests * fixing entity_ruler ent_ids test * add to set	2020-01-16 02:01:31 +01:00
adrianeboyd	90c52128dc	Improve train CLI with base model (#4911 ) Improve train CLI with a provided base model so that you can: * add a new component * extend an existing component * replace an existing component When the final model and best model are saved, reenable any disabled components and merge the meta information to include the full pipeline and accuracy information for all components in the base model plus the newly added components if needed.	2020-01-16 01:58:51 +01:00
svlandeg	ee828d5a9a	bugfix typo conv_window	2020-01-14 09:02:58 +01:00
adrianeboyd	d2f3a44b42	Improve train CLI sentrec scoring (#4892 ) * reorder to metrics to prioritize F over P/R * add sentrec to model metrics	2020-01-08 16:52:14 +01:00
adrianeboyd	e55fa1899a	Report length of dev dataset correctly (#4891 )	2020-01-08 16:51:51 +01:00
adrianeboyd	e1b493ae85	Add sentrec shortcut to Language (#4890 )	2020-01-08 16:51:24 +01:00
adrianeboyd	d24bca62f6	Add CJK to character classes (#4884 ) * Add CJK character class as uncased * Incorporate Chinese URL test case Un-xfail Chinese URL test instance	2020-01-08 16:50:19 +01:00
adrianeboyd	aef83e8070	Mark most Hungarian tokenizer test cases as slow (#4883 ) * Mark most Hungarian tokenizer test cases as slow Mark most Hungarian tokenizer test cases as slow to reduce the runtime of the test suite in ordinary usage: * for normal tests: run default tests plus 10% of the detailed tests * for slow tests: run all tests * Rework to mark individual tests as slow	2020-01-08 12:34:06 +01:00
Sofie Van Landeghem	7b96a5e10f	Reduce mem usage in training Entity Linker (#4811 ) * move nlp processing for el pipe to batch training instead of preprocessing * adding dev eval back in, and limit in articles instead of entities * use pipe whenever possible * few more small doc changes * access dev data through generator * tqdm description * small fixes * update documentation	2020-01-06 14:59:50 +01:00
Sofie Van Landeghem	6e9b61b49d	add warning in debug_data for punctuation in entities (#4853 )	2020-01-06 14:59:28 +01:00
adrianeboyd	d652ff215d	Add trailing whitespace to multiline test text (#4877 )	2020-01-06 14:58:59 +01:00
adrianeboyd	de69bc6509	Fix and improve URL pattern (#4882 ) * match domains longer than `hostname.domain.tld` like `www.foo.co.uk` * expand allowed characters in domain names while only matching lowercase TLDs so that "this.That" isn't matched as a URL and can be split on the period as an infix (relevant for at least English, German, and Tatar)	2020-01-06 14:58:30 +01:00
Sofie Van Landeghem	a1b22e90cd	serialize ENT_ID (#4852 ) * expand serialization test for custom token attribute * add failing test for issue 4849 * define ENT_ID as attr and use in doc serialization * fix few typos	2020-01-06 14:57:34 +01:00

1 2 3 4 5 ...

6775 Commits