spaCy

mirror of https://github.com/explosion/spaCy.git synced 2024-11-14 13:47:13 +03:00

Author	SHA1	Message	Date
Ines Montani	cb02bff0eb	Add blank:{lang} shortcut to util.load_mode	2020-05-21 20:24:07 +02:00
Matthw Honnibal	df87c32a40	Pass smaller doc sample into model initialize	2020-05-21 20:17:24 +02:00
Ines Montani	581bda9f98	Update senter test and auto-format	2020-05-21 20:17:14 +02:00
Ines Montani	0f1beb5ff2	Tidy up and avoid absolute spacy imports in core	2020-05-21 20:05:03 +02:00
svlandeg	51715b9f72	span / noun chunk has +1 because end is exclusive	2020-05-21 19:56:56 +02:00
Adriane Boyd	132b2a6898	Merge remote-tracking branch 'upstream/master-tmp' into HEAD	2020-05-21 19:50:30 +02:00
Adriane Boyd	17ee9ab53a	Fix _SP/POS=SPACE in strings serialization tests	2020-05-21 19:49:08 +02:00
Ines Montani	245f91df78	Fix merge issues	2020-05-21 19:42:13 +02:00
Matthw Honnibal	3b5cfec1fc	Tweak memory management in train_from_config	2020-05-21 19:32:04 +02:00
Matthw Honnibal	f075655deb	Fix shape inference in begin_training	2020-05-21 19:26:29 +02:00
svlandeg	84d5b7ad0a	Merge remote-tracking branch 'upstream/master' into bugfix/noun-chunks # Conflicts: # spacy/lang/el/syntax_iterators.py # spacy/lang/en/syntax_iterators.py # spacy/lang/fa/syntax_iterators.py # spacy/lang/fr/syntax_iterators.py # spacy/lang/id/syntax_iterators.py # spacy/lang/nb/syntax_iterators.py # spacy/lang/sv/syntax_iterators.py	2020-05-21 19:19:50 +02:00
svlandeg	f7d10da555	avoid unnecessary loop to check overlapping noun chunks	2020-05-21 19:15:57 +02:00
Ines Montani	631e20d0c6	Fix test and schemas	2020-05-21 19:01:02 +02:00
Ines Montani	d34fc0915e	Remove serialization getter	2020-05-21 18:48:21 +02:00
Ines Montani	f44897e4c6	Update warning IDs	2020-05-21 18:39:11 +02:00
Ines Montani	24f72c669c	Merge branch 'develop' into master-tmp	2020-05-21 18:39:06 +02:00
Ines Montani	c6ec19c844	Add missing declaration	2020-05-21 17:30:05 +02:00
Matthew Honnibal	884d9b060d	Merge pull request #5466 from adrianeboyd/feature/omit-extra-lexeme-info Add option to omit extra lexeme tables in CLI	2020-05-21 16:40:02 +02:00
Matthew Honnibal	e6c4c1a507	Merge pull request #5468 from adrianeboyd/feature/cli-conllu-misc-ner Improve handling of NER in CoNLL-U MISC	2020-05-21 16:39:46 +02:00
Matthew Honnibal	26cd6a0229	Merge pull request #5462 from adrianeboyd/feature/lemmatizer-all-upos Extend lemmatizer rules for all UPOS tags	2020-05-21 16:05:31 +02:00
Matthew Honnibal	cad9b290a2	Merge branch 'master' into feature/omit-extra-lexeme-info	2020-05-21 16:04:24 +02:00
Matthew Honnibal	1f572ce89b	Merge pull request #5473 from explosion/fix/travis-tests Fix Python 2.7 compat	2020-05-21 15:56:16 +02:00
Ines Montani	a9cb2882cb	Rename argument: doc_or_span/obj -> doclike (#5463 ) * doc_or_span -> obj * Revert "doc_or_span -> obj" This reverts commit `78bb9ff5e0`. * obj -> doclike * Refer to correct object	2020-05-21 15:17:39 +02:00
Ines Montani	bea863acd2	Fix naming conflict and formatting	2020-05-21 14:24:38 +02:00
Ines Montani	bd6353715a	Merge branch 'master' into fix/travis-tests	2020-05-21 14:23:04 +02:00
Ines Montani	d8f3190c0a	Tidy up and auto-format	2020-05-21 14:14:01 +02:00
Ines Montani	56de520afd	Try to fix tests on Travis (2.7)	2020-05-21 14:04:57 +02:00
adrianeboyd	d45602bc11	Merge branch 'master' into feature/omit-extra-lexeme-info	2020-05-21 10:26:01 +02:00
svlandeg	b221bcf1ba	fixing all languages	2020-05-21 00:17:28 +02:00
svlandeg	b509a3e7fc	fix: use actual range in 'seen' instead of subtree	2020-05-20 23:06:39 +02:00
svlandeg	36a94c409a	failing test to reproduce overlapping spans problem	2020-05-20 23:06:03 +02:00
adrianeboyd	49ef06d793	Add option for base model in init-model CLI (#5467 ) Intended for languages like Chinese with a custom tokenizer.	2020-05-20 18:49:11 +02:00
Adriane Boyd	4b229bfc22	Improve handling of NER in CoNLL-U MISC	2020-05-20 18:48:51 +02:00
Matthew Honnibal	609c0ba557	Fix accidentally quadratic runtime in Example.split_sents (#5464 ) * Tidy up train-from-config a bit * Fix accidentally quadratic perf in TokenAnnotation.brackets When we're reading in the gold data, we had a nested loop where we looped over the brackets for each token, looking for brackets that start on that word. This is accidentally quadratic, because we have one bracket per word (for the POS tags). So we had an O(N*2) behaviour here that ended up being pretty slow. To solve this I'm indexing the brackets by their starting word on the TokenAnnotations object, and having a property to provide the previous view. Fixes	2020-05-20 18:48:18 +02:00
Adriane Boyd	daaa7bf451	Add option to omit extra lexeme tables in CLI	2020-05-20 15:51:44 +02:00
Adriane Boyd	8cba0e41d8	Return lowercase form as default except for PROPN	2020-05-20 15:35:08 +02:00
adrianeboyd	9393253b66	Remove peeking from Parser.begin_training (#5456 ) Inspect all instances in `Parser.begin_training` rather than only the first 1000.	2020-05-20 15:18:06 +02:00
Matthw Honnibal	fda7355508	Fix train-from-config	2020-05-20 12:30:21 +02:00
Matthw Honnibal	24efd54a42	Merge from develop	2020-05-20 12:27:31 +02:00
Sofie Van Landeghem	7f5715a081	Various fixes to NEL functionality, Example class etc (#5460 ) * setting KB in the EL constructor, similar to how the model is passed on * removing wikipedia example files - moved to projects * throw an error when nlp.update is called with 2 positional arguments * rewriting the config logic in create pipe to accomodate for other objects (e.g. KB) in the config * update config files with new parameters * avoid training pipeline components that don't have a model (like sentencizer) * various small fixes + UX improvements * small fixes * set thinc to 8.0.0a9 everywhere * remove outdated comment	2020-05-20 11:41:12 +02:00
Adriane Boyd	4fa9670537	Extend lemmatizer rules for all UPOS tags	2020-05-20 10:15:43 +02:00
Matthew Honnibal	664a3603b0	Set version to v3.0.0.dev8	2020-05-19 17:15:39 +02:00
adrianeboyd	40e65d6f63	Fix most_similar for vectors with unused rows (#5348 ) * Fix most_similar for vectors with unused rows Address issues related to the unused rows in the vector table and `most_similar`: * Update `most_similar()` to search only through rows that are in use according to `key2row`. * Raise an error when `most_similar(n=n)` is larger than the number of vectors in the table. * Set and restore `_unset` correctly when vectors are added or deserialized so that new vectors are added in the correct row. * Set data and keys to the same length in `Vocab.prune_vectors()` to avoid spurious entries in `key2row`. * Fix regression test using `most_similar` Co-authored-by: Matthew Honnibal <honnibal+gh@gmail.com>	2020-05-19 16:41:26 +02:00
Sofie Van Landeghem	f00de445dd	default models defined in component decorator (#5452 ) * move defaults to pipeline and use in component decorator * black formatting * relative import	2020-05-19 16:20:03 +02:00
adrianeboyd	70da1fd2d6	Add warning for misaligned character offset spans (#5007 ) * Add warning for misaligned character offset spans * Resolve conflict * Filter warnings in example scripts Filter warnings in example scripts to show warnings once, in particular warnings about misaligned entities. Co-authored-by: Ines Montani <ines@ines.io>	2020-05-19 16:01:18 +02:00
adrianeboyd	0061992d95	Update Polish tokenizer for UD_Polish-PDB (#5432 ) Update Polish tokenizer for UD_Polish-PDB, which is a relatively major change from the existing tokenizer. Unused exceptions files and conflicting test cases removed. Co-authored-by: Matthew Honnibal <honnibal+gh@gmail.com>	2020-05-19 15:59:55 +02:00
adrianeboyd	a5cd203284	Reduce stored lexemes data, move feats to lookups (#5238 ) * Reduce stored lexemes data, move feats to lookups * Move non-derivable lexemes features (`norm / cluster / prob`) to `spacy-lookups-data` as lookups * Get/set `norm` in both lookups and `LexemeC`, serialize in lookups * Remove `cluster` and `prob` from `LexemesC`, get/set/serialize in lookups only * Remove serialization of lexemes data as `vocab/lexemes.bin` * Remove `SerializedLexemeC` * Remove `Lexeme.to_bytes/from_bytes` * Modify normalization exception loading: * Always create `Vocab.lookups` table `lexeme_norm` for normalization exceptions * Load base exceptions from `lang.norm_exceptions`, but load language-specific exceptions from lookups * Set `lex_attr_getter[NORM]` including new lookups table in `BaseDefaults.create_vocab()` and when deserializing `Vocab` * Remove all cached lexemes when deserializing vocab to override existing normalizations with the new normalizations (as a replacement for the previous step that replaced all lexemes data with the deserialized data) * Skip English normalization test Skip English normalization test because the data is now in `spacy-lookups-data`. * Remove norm exceptions Moved to spacy-lookups-data. * Move norm exceptions test to spacy-lookups-data * Load extra lookups from spacy-lookups-data lazily Load extra lookups (currently for cluster and prob) lazily from the entry point `lg_extra` as `Vocab.lookups_extra`. * Skip creating lexeme cache on load To improve model loading times, do not create the full lexeme cache when loading. The lexemes will be created on demand when processing. * Identify numeric values in Lexeme.set_attrs() With the removal of a special case for `PROB`, also identify `float` to avoid trying to convert it with the `StringStore`. * Skip lexeme cache init in from_bytes * Unskip and update lookups tests for python3.6+ * Update vocab pickle to include lookups_extra * Update vocab serialization tests Check strings rather than lexemes since lexemes aren't initialized automatically, account for addition of "_SP". * Re-skip lookups test because of python3.5 * Skip PROB/float values in Lexeme.set_attrs * Convert is_oov from lexeme flag to lex in vectors Instead of storing `is_oov` as a lexeme flag, `is_oov` reports whether the lexeme has a vector. Co-authored-by: Matthew Honnibal <honnibal+gh@gmail.com>	2020-05-19 15:59:14 +02:00
Sofie Van Landeghem	0d94737857	Feature toggle_pipes (#5378 ) * make disable_pipes deprecated in favour of the new toggle_pipes * rewrite disable_pipes statements * update documentation * remove bin/wiki_entity_linking folder * one more fix * remove deprecated link to documentation * few more doc fixes * add note about name change to the docs * restore original disable_pipes * small fixes * fix typo * fix error number to W096 * rename to select_pipes * also make changes to the documentation Co-authored-by: Matthew Honnibal <honnibal+gh@gmail.com>	2020-05-18 22:27:10 +02:00
Matthew Honnibal	333b1a308b	Adapt parser and NER for transformers (#5449 ) * Draft layer for BILUO actions * Fixes to biluo layer * WIP on BILUO layer * Add tests for BILUO layer * Format * Fix transitions * Update test * Link in the simple_ner * Update BILUO tagger * Update __init__ * Import simple_ner * Update test * Import * Add files * Add config * Fix label passing for BILUO and tagger * Fix label handling for simple_ner component * Update simple NER test * Update config * Hack train script * Update BILUO layer * Fix SimpleNER component * Update train_from_config * Add biluo_to_iob helper * Add IOB layer * Add IOBTagger model * Update biluo layer * Update SimpleNER tagger * Update BILUO * Read random seed in train-from-config * Update use of normal_init * Fix normalization of gradient in SimpleNER * Update IOBTagger * Remove print * Tweak masking in BILUO * Add dropout in SimpleNER * Update thinc * Tidy up simple_ner * Fix biluo model * Unhack train-from-config * Update setup.cfg and requirements * Add tb_framework.py for parser model * Try to avoid memory leak in BILUO * Move ParserModel into spacy.ml, avoid need for subclass. * Use updated parser model * Remove incorrect call to model.initializre in PrecomputableAffine * Update parser model * Avoid divide by zero in tagger * Add extra dropout layer in tagger * Refine minibatch_by_words function to avoid oom * Fix parser model after refactor * Try to avoid div-by-zero in SimpleNER * Fix infinite loop in minibatch_by_words * Use SequenceCategoricalCrossentropy in Tagger * Fix parser model when hidden layer * Remove extra dropout from tagger * Add extra nan check in tagger * Fix thinc version * Update tests and imports * Fix test * Update test * Update tests * Fix tests * Fix test Co-authored-by: Ines Montani <ines@ines.io>	2020-05-18 22:23:33 +02:00
Ines Montani	a41e28ceba	Merge pull request #5436 from ilivans/fix_errors_with_codes	2020-05-18 10:45:56 +02:00
Ilkyu Ju	72a25c9cef	Very minor issues in Korean example sentences (#5446 ) * Add contributor agreement * Improve ko translation of example sentences I fixed unnatural translations and word spacing errors. * Update osori.md	2020-05-17 13:43:34 +02:00
svlandeg	6fb6a8518c	bump to 3.0.0.dev7 and thinc to 8.0.0a8	2020-05-15 13:25:54 +02:00
svlandeg	047f3d7d94	remove ops argument for Adam	2020-05-15 13:25:00 +02:00
svlandeg	e0fda2bd81	throw warning when model_cfg is None	2020-05-15 11:02:10 +02:00
adrianeboyd	908dea3939	Skip duplicate lexeme rank setting (#5401 ) Skip duplicate lexeme rank setting within `_fix_pretrained_vectors_name()`.	2020-05-14 18:26:12 +02:00
adrianeboyd	f49e2810e6	Add Polish lemmatizer (#5413 ) * Add Polish lemmatizer Contributed by @ryszardtuora * Add missing import	2020-05-14 18:23:19 +02:00
adrianeboyd	e63880e081	Use Token.sent_start for Span.sent (#5439 ) Use `Token.sent_start` for sentence boundaries in `Span.sent` so that `Doc.sents` and `Span.sent` return the same sentence boundaries.	2020-05-14 18:22:51 +02:00
adrianeboyd	780b869345	Fix syntax iterators for Persian (#5437 )	2020-05-14 16:51:03 +02:00
Ilia Ivanov	712d9d4820	fixup! Fix ErrorsWithCodes().__class__ return value	2020-05-14 15:45:58 +02:00
Ilia Ivanov	a987e9e45d	Fix ErrorsWithCodes().__class__ return value	2020-05-14 14:14:15 +02:00
Vishnu Priya VR	9ce059dd06	Limiting noun_chunks for specific languages (#5396 ) * Limiting noun_chunks for specific langauges * Limiting noun_chunks for specific languages Contributor Agreement * Addressing review comments * Removed unused fixtures and imports * Add fa_tokenizer in test suite * Use fa_tokenizer in test * Undo extraneous reformatting Co-authored-by: adrianeboyd <adrianeboyd@gmail.com>	2020-05-14 12:58:06 +02:00
Sofie Van Landeghem	b04738903e	prevent None in gold fields (#5425 ) * set gold fields to empty list instead of keeping them as None * add unit test	2020-05-13 22:08:50 +02:00
adrianeboyd	113e7981d0	Check that row is within bounds when adding vector (#5430 ) Check that row is within bounds for the vector data array when adding a vector. Don't add vectors with rank OOV_RANK in `init-model` (change is due to shift from OOV as 0 to OOV as OOV_RANK).	2020-05-13 22:08:28 +02:00
adrianeboyd	07639dd6ac	Remove TAG from da/sv tokenizer exceptions (#5428 ) Remove `TAG` value from Danish and Swedish tokenizer exceptions because it may not be included in a tag map (and these settings are problematic as tokenizer exceptions anyway).	2020-05-13 10:25:54 +02:00
adrianeboyd	24e7108f80	Modify array type to accommodate OOV_RANK (#5429 ) Modify indices array type in `Vocab.prune_vectors` to accommodate OOV_RANK index as max(uint64).	2020-05-13 10:25:05 +02:00
svlandeg	102c8c7e2f	fix fan_in renaming	2020-05-12 13:56:10 +02:00
adrianeboyd	440b81bddc	Improve exceptions for 'd (would/had) in English (#5379 ) Instead of treating `'d` in contractions like `I'd` as `would` in all cases in the tokenizer exceptions, leave the tagging and lemmatization up to later components.	2020-05-08 15:10:57 +02:00
adrianeboyd	c963e269ba	Add method to update / reset pkuseg user dict (#5404 )	2020-05-08 11:21:46 +02:00
Samuel Rodríguez Medina	5e55bfa821	Fixed tests for Swedish that were written in Danish. (#5395 )	2020-05-05 14:06:27 +02:00
Adriane Boyd	565e0eef73	Add tokenizer option for token match with affixes To fix the slow tokenizer URL (#4374) and allow `token_match` to take priority over prefixes and suffixes by default, introduce a new tokenizer option for a token match pattern that's applied after prefixes and suffixes but before infixes.	2020-05-05 10:35:33 +02:00
Adriane Boyd	792c8af8cf	Merge remote-tracking branch 'upstream/master' into bugfix/revert-token-match	2020-05-05 09:25:57 +02:00
Matthew Honnibal	eb117e2fce	Add load_config_from_str helper	2020-05-02 14:09:21 +02:00
adrianeboyd	c045a9c7f6	Fix logic in train CLI timing eval on GPU (#5387 ) Run CPU timing in first iteration only	2020-05-01 12:05:33 +02:00
Samuel Rodríguez Medina	148b036e0c	Spanish like num improvement (#5381 ) * Add tests for Spanish like_num. * Add missing numbers in Spanish lexical attributes for like_num. * Modify Spanish test function name. * Add contributor agreement.	2020-04-30 11:13:23 +02:00
Samuel Rodríguez Medina	8602daba85	Swedish like_num (#5371 ) * Sign contributor agreement. * Add like_num functionality to Swedish. * Update spacy/tests/lang/sv/test_lex_attrs.py Co-Authored-By: Sofie Van Landeghem <svlandeg@users.noreply.github.com> * Update contributor agreement Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>	2020-04-29 21:25:22 +02:00
adrianeboyd	74da669326	Fix problems with lower and whitespace in variants (#5361 ) * Initialize lower flag explicitly * Handle whitespace words from GoldParse correctly when creating raw text with orth variants * Return the text with original casing if anything goes wrong	2020-04-29 13:01:25 +02:00
adrianeboyd	3f43c73d37	Normalize TokenC.sent_start values for Matcher (#5346 ) Normalize TokenC.sent_start values to booleans for the `Matcher`.	2020-04-29 12:57:30 +02:00
adrianeboyd	bdff76dede	Various updates/additions to CLI scripts (#5362 ) * `debug-data`: determine coverage of provided vectors * `evaluate`: support `blank:lg` model to make it possible to just evaluate tokenization * `init-model`: add option to truncate vectors to N most frequent vectors from word2vec file * `train`: * if training on GPU, only run evaluation/timing on CPU in the first iteration * if training is aborted, exit with a non-0 exit status	2020-04-29 12:56:46 +02:00
Sofie Van Landeghem	cfdaf99b80	Fix passing of component configuration (#5374 ) * add kwargs to to_disk methods in docs - otherwise crashes on 'exclude' argument * add fix and test for Issue 5137	2020-04-29 12:56:17 +02:00
Ines Montani	efec28ce70	Merge pull request #5367 from adrianeboyd/feature/simplify-warnings-v2	2020-04-29 12:55:37 +02:00
Sofie Van Landeghem	f67343295d	Update NEL examples and documentation (#5370 ) * simplify creation of KB by skipping dim reduction * small fixes to train EL example script * add KB creation and NEL training example scripts to example section * update descriptions of example scripts in the documentation * moving wiki_entity_linking folder from bin to projects * remove test for wiki NEL functionality that is being moved	2020-04-29 12:53:53 +02:00
adrianeboyd	a6e521cd79	Add is_sent_end token property (#5375 ) Reconstruction of the original PR #4697 by @MiniLau. Removes unused `SENT_END` symbol and `IS_SENT_END` from `Matcher` schema because the Matcher is only going to be able to support `IS_SENT_START`.	2020-04-29 12:53:16 +02:00
Ines Montani	eac47971f1	Merge pull request #5258 from mirfan899/master	2020-04-29 12:51:55 +02:00
adrianeboyd	d5f18f8307	Add missing import	2020-04-28 14:01:29 +02:00
adrianeboyd	ac40a8f7a5	Add missing import	2020-04-28 14:00:11 +02:00
Adriane Boyd	3a045572ed	Add missing import	2020-04-28 13:48:37 +02:00
Adriane Boyd	bc39f97e11	Simplify warnings	2020-04-28 13:37:37 +02:00
adrianeboyd	f8ac5b9f56	bugfix in span similarity (#5155 ) (#5358 ) * bugfix in span similarity * also rewrite doc.pyx for clarity * formatting Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>	2020-04-27 16:51:27 +02:00
Sofie Van Landeghem	9203d821ae	Add 2 ini files in tests/lang (#5359 )	2020-04-27 13:01:54 +02:00
Punitvara	b2b7e1f37a	This PR adds Gujarati Language class along with (#5355 ) * This PR adds Gujarati Language class along with - stop words * Add test for gu tokenizer	2020-04-27 11:07:37 +02:00
sabiqueqb	fc91660aa2	Gh 5339 language class for malayalam (#5342 ) * Initialize Malayalam Language class * Add lex_attrs and examples for Malayalam * Add spaCy Contributor Agreement * Add test for ml tokenizer	2020-04-27 09:45:08 +02:00
adrianeboyd	84e06f9fb7	Improve GoldParse NER alignment (#5335 ) Improve GoldParse NER alignment by including all cases where the start and end of the NER span can be aligned, regardless of internal tokenization differences. To do this, convert BILUO tags to character offsets, check start/end alignment with `doc.char_span()`, and assign the BILUO tags for the aligned spans. Alignment for `O/-` tags is handled through the one-to-one and multi alignments.	2020-04-23 16:58:23 +02:00
adrianeboyd	521f361052	Switch to new gold.align method (#5334 ) * Switch from original `_align` to new simpler alignment algorithm from #4526 * Remove alignment normalizations beyond whitespace and lowercasing	2020-04-21 19:31:03 +02:00
Matthew Honnibal	b2ef6100af	Only run backprop once when shared tok2vec weights (#5331 ) Previously, pipelines with shared tok2vec weights would call the tok2vec backprop callback multiple times, once for each pipeline component. This caused errors for PyTorch, and was inefficient. Instead, accumulate the gradient for all but one component, and just call the callback once.	2020-04-21 19:30:41 +02:00
adrianeboyd	bf5c13d170	Modify jieba install message (#5328 ) Modify jieba install message to instruct the user to use `ChineseDefaults.use_jieba = False` so that it's possible to load pkuseg-only models without jieba installed.	2020-04-20 22:06:53 +02:00
Matthew Honnibal	6918d99b6c	Improve GPU usage for train-with-config (#5330 ) * Adjust for no ops in Optimizer * Fix gpu in train-from-config * Update train-from-config script * Fix parser * Fix GPU efficiency of padding backprop	2020-04-20 22:06:28 +02:00
adrianeboyd	f7471abd82	Add pkuseg and serialization support for Chinese (#5308 ) * Add pkuseg and serialization support for Chinese Add support for pkuseg alongside jieba * Specify model through `Language` meta: * split on characters (if no word segmentation packages are installed) ``` Chinese(meta={"tokenizer": {"config": {"use_jieba": False, "use_pkuseg": False}}}) ``` * jieba (remains the default tokenizer if installed) ``` Chinese() Chinese(meta={"tokenizer": {"config": {"use_jieba": True}}}) # explicit ``` * pkuseg ``` Chinese(meta={"tokenizer": {"config": {"pkuseg_model": "default", "use_jieba": False, "use_pkuseg": True}}}) ``` * The new tokenizer setting `require_pkuseg` is used to override `use_jieba` default, which is intended for models that provide a pkuseg model: ``` nlp_pkuseg = Chinese(meta={"tokenizer": {"config": {"pkuseg_model": "default", "require_pkuseg": True}}}) nlp = Chinese() # has `use_jieba` as `True` by default nlp.from_bytes(nlp_pkuseg.to_bytes()) # `require_pkuseg` overrides `use_jieba` when calling the tokenizer ``` Add support for serialization of tokenizer settings and pkuseg model, if loaded * Add sorting for `Language.to_bytes()` serialization of `Language.meta` so that the (emptied, but still present) tokenizer metadata is in a consistent position in the serialized data Extend tests to cover all three tokenizer configurations and serialization * Fix from_disk and tests without jieba or pkuseg * Load cfg first and only show error if `use_pkuseg` * Fix blank/default initialization in serialization tests * Explicitly initialize jieba's cache on init * Add serialization for pkuseg pre/postprocessors * Reformat pkuseg install message	2020-04-18 17:01:53 +02:00
Jakob Jul Elben	663333c3b2	Fixes #5413 (#5315 ) * Fix 5314 * Add contributor * Resolve requested changes Co-authored-by: Jakob Jul Elben <jakob@datamaga.com>	2020-04-16 13:29:02 +02:00
Leander Fiedler	a3401b1194	issue5230 changed reference to function to anonymous function	2020-04-15 21:52:52 +02:00
Leander Fiedler	cef0c909b9	issue5230 changed reference to function to anonymous function	2020-04-15 19:28:33 +02:00
Paolo Arduin	1ca32d8f9c	Matcher support for Span as well as Doc (#5113 ) * Matcher support for Span, as well as Doc #5056 * Removes an import unused * Signed contributors agreement * Code optimization and better test * Add error message for bad Matcher call argument * Fix merging	2020-04-15 13:51:33 +02:00
adrianeboyd	98c59027ed	Use max(uint64) for OOV lexeme rank (#5303 ) * Use max(uint64) for OOV lexeme rank * Add test for default OOV rank * Revert back to thinc==7.4.0 Requiring the updated version of thinc was unnecessary. * Define OOV_RANK in one place Define OOV_RANK in one place in `util`. * Fix formatting [ci skip] * Switch to external definitions of max(uint64) Switch to external defintions of max(uint64) and confirm that they are equal.	2020-04-15 13:49:47 +02:00
adrianeboyd	3d2c308906	Add Doc init from list of words and text (#5251 ) * Add Doc init from list of words and text Add an option to initialize a `Doc` from a text and list of words where the words may or may not include all whitespace tokens. If the text and words are mismatched, raise an error. * Fix error code * Remove all whitespace before aligning words/text * Move words/text init to util function * Update error message * Rename to get_words_and_spaces * Fix formatting	2020-04-14 19:15:52 +02:00
Paolo Arduin	8ce408d2e1	Comparison predicate handling for `!=` (#5282 ) * Fix #5281 * Optim test	2020-04-14 19:14:15 +02:00
Leander Fiedler	6700006830	issue5230 attempted fix of pytest segfault for python3.5	2020-04-12 09:34:54 +02:00
Leander Fiedler	d60e2d3ebf	issue5230 added unit test for dumping and loading knowledgebase	2020-04-12 09:08:41 +02:00
Leander Fiedler	d2bb649227	issue5230 filter warnings in addition to filterwarnings to prevent deprecation warnings in python35(win) setup to pop up	2020-04-10 23:21:13 +02:00
Leander Fiedler	ca2a7a44db	issue5230 store string values of warnings to remotely debug failing python35(win) setup	2020-04-10 22:26:55 +02:00
Leander Fiedler	88ca40a15d	issue5230 raise warnings as errors to remotely debug failing python35(win) setup	2020-04-10 21:45:53 +02:00
Leander Fiedler	a7bdfe42e1	issue5230 added print statement to warnings filter to remotely debug failing python35(win) setup	2020-04-10 21:14:33 +02:00
Leander Fiedler	8c1d0d628f	issue5230 writer now checks instance of loc parameter before trying to operate on it	2020-04-10 20:35:52 +02:00
Umar Butler	8952effcc4	Fixed Typo in Warning (#5284 ) * Fixed typo in cli warning Fixed a typo in the warning for the provision of exactly two labels, which have not been designated as binary, to textcat. * Create and signed contributor form	2020-04-09 15:46:15 +02:00
Sofie Van Landeghem	42364dcd9f	Remove "pala" tokenizer exception for Spanish (#5265 )	2020-04-09 10:21:20 +02:00
adrianeboyd	cf579a398d	Add __init__.py to eu and hy tests (#5278 )	2020-04-08 20:03:06 +02:00
adrianeboyd	ae4af52ce7	Add ideographic stops to sentencizer (#5263 ) Add ideographic half- and fullwidth full stops to default sentencizer punctuation.	2020-04-08 12:58:39 +02:00
adrianeboyd	fa760010a5	Set rank for new vector in Vocab.set_vector (#5266 ) Set `Lexeme.rank` for vectors added with `Vocab.set_vector` so that the lexeme `ID` accessed by a model points the right row for the new vector.	2020-04-07 12:04:51 +02:00
lfiedler	e1e25c7e30	issue5230: added unittest test case for completion	2020-04-06 21:36:02 +02:00
Leander Fiedler	cde96f6c64	issue5230: optimized unit test a bit	2020-04-06 20:51:12 +02:00
Leander Fiedler	71cc903d65	issue5230: replaced open statements on path objects so that serialization still works an files are closed	2020-04-06 20:30:41 +02:00
Leander Fiedler	273ed452bb	issue5230: added unicode declaration at top of the file	2020-04-06 19:22:32 +02:00
Leander Fiedler	1cd975d4a5	issue5230: fixed resource warnings in language	2020-04-06 18:54:32 +02:00
Leander Fiedler	493c77462a	issue5230: test cases covering known sources of resource warnings	2020-04-06 18:46:51 +02:00
adrianeboyd	c981aa6684	Use inline flags in token_match patterns (#5257 ) * Use inline flags in token_match patterns Use inline flags in `token_match` patterns so that serializing does not lose the flag information. * Modify inline flag * Modify inline flag	2020-04-06 13:19:04 +02:00
adrianeboyd	e8be15e9b7	Improve tokenization for UD Spanish AnCora (#5253 )	2020-04-06 13:18:23 +02:00
adrianeboyd	f4ef64a526	Improve tokenization for UD Dutch corpora (#5259 ) * Improve tokenization for UD Dutch corpora Improve tokenization for UD Dutch Alpino and LassySmall. * Format Dutch tokenizer exceptions	2020-04-06 13:18:07 +02:00
Muhammad Irfan	406d5748b3	add missing Urdu tags	2020-04-05 20:55:38 +05:00
Sofie Van Landeghem	b2e93be867	Optimizer defaults (#5244 ) * set optimizer defaults to mimic thinc 7 + bump to dev6 * larger error range for senter overfitting test	2020-04-03 13:02:46 +02:00
YohannesDatasci	beef184e53	Armenian language support (#5246 ) * add Armenian language and test cases * agreement submission	2020-04-03 13:02:18 +02:00
Michael Leichtfried	2b14997b68	Remove duplicated branch in if/else-if statement (#5234 ) * Remove duplicated branch in if-elif-statement * Add contributor agreement for leicmi	2020-04-02 14:47:42 +02:00
adrianeboyd	b71a11ff6d	Update morphologizer (#5108 ) * Add pos and morph scoring to Scorer Add pos, morph, and morph_per_type to `Scorer`. Report pos and morph accuracy in `spacy evaluate`. * Update morphologizer for v3 * switch to tagger-based morphologizer * use `spacy.HashCharEmbedCNN` for morphologizer defaults * add `Doc.is_morphed` flag * Add morphologizer to train CLI * Add basic morphologizer pipeline tests * Add simple morphologizer training example * Remove subword_features from CharEmbed models Remove `subword_features` argument from `spacy.HashCharEmbedCNN.v1` and `spacy.HashCharEmbedBiLSTM.v1` since in these cases `subword_features` is always `False`. * Rename setting in morphologizer example Use `with_pos_tags` instead of `without_pos_tags`. * Fix kwargs for spacy.HashCharEmbedBiLSTM.v1 * Remove defaults for spacy.HashCharEmbedBiLSTM.v1 Remove default `nM/nC` for `spacy.HashCharEmbedBiLSTM.v1`. * Set random seed for textcat overfitting test	2020-04-02 14:46:32 +02:00
adrianeboyd	d107afcffb	Raise error for inplace resize with new vector dim (#5228 ) Raise an error if there is an attempt to resize the vectors in place with a different vector dimension.	2020-04-02 10:43:13 +02:00
Jacob Lauritzen	0b76212831	Extend and fix Danish examples (#5227 ) * Extend and fix Danish examples This PR fixes two examples, adds additional examples translated from the english version, and adds punctuation. The two changed examples are: * "fortov" changed to "fortovet", which is more [used](https://www.google.com/search?client=firefox-b-d&sxsrf=ALeKk0143gEuPe4IbIUpzBBt-oU10OMVqA%3A1585549036477&ei=7I6BXuvJHMGOrwSqi46oCQ&q=l%C3%B8behjul+p%C3%A5+fortov&oq=l%C3%B8behjul+p%C3%A5+fortov&gs_lcp=CgZwc3ktYWIQAzIECAAQRzIECAAQRzIECAAQRzIECAAQRzIECAAQRzIECAAQRzIECAAQRzIECAAQR1DT8xZY0_MWYK_0FmgAcAZ4AIABAIgBAJIBAJgBAKABAaoBB2d3cy13aXo&sclient=psy-ab&ved=0ahUKEwjr7964xsHoAhVBx4sKHaqFA5UQ4dUDCAo&uact=5) and more natural. The Swedish and Norwegian examples also use this version of the word. * "stor by" changed to "storby". In Danish we have a specific noun to describe a large, metropolitan city which is different from just describing a city as "large". In this sentence it would be much more natural to describe London as a "storby". Google even correct as search for "London stor by" to "London storby". * Sign contrib agreement	2020-04-02 10:42:35 +02:00
Sofie Van Landeghem	ab59f3124e	fix NEL overfitting test for GPU (#5236 )	2020-04-02 10:32:52 +02:00
Sofie Van Landeghem	311133e579	Train textcat with config (#5143 ) * bring back default build_text_classifier method * remove _set_dims_ hack in favor of proper dim inference * add tok2vec initialize to unit test * small fixes * add unit test for various textcat config settings * logistic output layer does not have nO * fix window_size setting * proper fix * fix W initialization * Update textcat training example * Use ml_datasets * Convert training data to `Example` format * Use `n_texts` to set proportionate dev size * fix _init renaming on latest thinc * avoid setting a non-existing dim * update to thinc==8.0.0a2 * add BOW and CNN defaults for easy testing * various experiments with train_textcat script, fix softmax activation in textcat bow * allow textcat train script to work on other datasets as well * have dataset as a parameter * train textcat from config, with example config * add config for training textcat * formatting * fix exclusive_classes * fixing BOW for GPU * bump thinc to 8.0.0a3 (not published yet so CI will fail) * add in link_vectors_to_models which got deleted Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>	2020-03-29 19:40:36 +02:00
adrianeboyd	ce0e538068	Check whether doc is instantiated in Example.get_gold_parses() (#5167 ) * Check whether doc is instantiated When creating docs to pair with gold parses, modify test to check whether a doc is unset rather than whether it contains tokens. * Restore test of evaluate on an empty doc * Set a minimal gold.orig for the scorer Without a minimal gold.orig the scorer can't evaluate empty docs. This is the v3 equivalent of #4925.	2020-03-29 13:57:00 +02:00
Sofie Van Landeghem	d6d95674c1	bugfix in span similarity (#5155 ) * bugfix in span similarity * also rewrite doc.pyx for clarity * formatting	2020-03-29 13:56:07 +02:00
Nikhil Saldanha	4f27a24f5b	Add kannada examples (#5162 ) * Add example sentences for Kannada * sign contributor agreement	2020-03-29 13:54:42 +02:00
adrianeboyd	d47b810ba4	Fix exclusive_classes in textcat ensemble (#5166 ) Pass the exclusive_classes setting to the bow model within the ensemble textcat model.	2020-03-29 13:52:34 +02:00
adrianeboyd	963bd890c1	Modify Vector.resize to work with cupy and improve resizing (#5216 ) * Modify Vector.resize to work with cupy Modify `Vectors.resize` to work with cupy. Modify behavior when resizing to a different vector dimension so that individual vectors are truncated or extended with zeros instead of having the original values filled into the new shape without regard for the original axes. * Update spacy/tests/vocab_vectors/test_vectors.py Co-Authored-By: Matthew Honnibal <honnibal+gh@gmail.com> Co-authored-by: Matthew Honnibal <honnibal+gh@gmail.com>	2020-03-29 13:51:20 +02:00
Sofie Van Landeghem	1f9852abc3	Fix parser @ GPU (#5210 ) * ensure self.bias is numpy array in parser model * 2 more little bug fixes for parser on GPU * removing testing GPU statement * remove commented code	2020-03-28 23:09:35 +01:00
Sofie Van Landeghem	9b412516e7	Fixing pickling of the parser (#5218 ) * fix __reduce__ for pickling parser * setting the move object as 'state' during pickling * unskip test_issue4725 - works again	2020-03-27 19:35:26 +01:00
Ines Montani	92b9b631ef	xfail -> skip	2020-03-27 10:51:32 +01:00
Ines Montani	ee4bb0e3b6	Fix import	2020-03-26 21:44:18 +01:00
Ines Montani	4fe2299586	xfail hanging test	2020-03-26 20:58:13 +01:00
Ines Montani	f12a46472c	Remove unicode declarations	2020-03-26 15:18:32 +01:00
Ines Montani	7453df79d1	Fix argument	2020-03-26 14:09:02 +01:00
Ines Montani	e7341db5dc	Add sent_start to pattern schema	2020-03-26 14:05:40 +01:00
Ines Montani	70ee4ef4fd	Fix small errors	2020-03-26 13:47:31 +01:00
Ines Montani	46568f40a7	Merge branch 'master' into tmp/sync	2020-03-26 13:38:14 +01:00
adrianeboyd	8d3563f1c4	Minor bugfixes for train CLI (#5186 ) * Omit per_type scores from model-best calculations The addition of per_type scores to the included metrics (#4911) causes errors when they're compared while determining the best model, so omit them for this `max()` comparison. * Add default speed data for interrupted train CLI Add better speed meta defaults so that an interrupted iteration still produces a best model. Co-authored-by: Ines Montani <ines@ines.io>	2020-03-26 10:46:50 +01:00
adrianeboyd	a04f802099	Fix GoldParse init when token count differs (#5191 ) Fix the `GoldParse` initialization when the number of tokens has changed (due to merging subtokens with the parser).	2020-03-26 10:46:23 +01:00
adrianeboyd	d88a377bed	Remove Vectors.from_glove (#5209 )	2020-03-26 10:45:47 +01:00
Ines Montani	828acffc12	Tidy up and auto-format	2020-03-25 12:28:12 +01:00
adrianeboyd	86c43e55fa	Improve Lithuanian tokenization (#5205 ) * Improve Lithuanian tokenization Modify Lithuanian tokenization to improve performance for UD_Lithuanian-ALKSNIS. * Update Lithuanian tokenizer tests	2020-03-25 11:28:12 +01:00
adrianeboyd	1a944e5976	Improve Italian tokenization (#5204 ) Improve Italian tokenization for UD_Italian-ISDT.	2020-03-25 11:28:02 +01:00
adrianeboyd	923a453449	Modifications/updates to Portuguese tokenization (#5203 ) Modifications to Portuguese tokenization for UD_Portuguese-Bosque. Instead of splitting contactions as exceptions, they are kept as merged tokens.	2020-03-25 11:27:53 +01:00
adrianeboyd	4117a5c705	Improve French tokenization (#5202 ) Improve French tokenization for UD_French-Sequoia.	2020-03-25 11:27:42 +01:00
Ines Montani	a3d09ffe61	Merge pull request #5201 from adrianeboyd/feature/ud-tokenization-nb-v2 Improved tokenization for UD_Norwegian-Bokmaal	2020-03-25 11:27:31 +01:00
Sofie Van Landeghem	218e1706ac	Bugfix linking vectors (#5196 ) * restore call to _load_vectors * bump to thinc 8.0.0a3 * bump to 3.0.0.dev4	2020-03-25 10:20:11 +01:00
Adriane Boyd	09d442f5ad	Merge remote-tracking branch 'upstream/master' into feature/ud-tokenization-da	2020-03-25 09:41:52 +01:00
Adriane Boyd	cba2d1d972	Disable failing abbreviation test UD_Danish-DDT has (as far as I can tell) hallucinated periods after abbreviations, so the changes are an artifact of the corpus and not due to anything meaningful about Danish tokenization.	2020-03-25 09:39:26 +01:00
Adriane Boyd	79737adb90	Improved tokenization for UD_Norwegian-Bokmaal	2020-03-25 08:54:02 +01:00
Ines Montani	5f2afa0479	Merge pull request #5185 from adrianeboyd/bugfix/de-punctuation-style Improve German tokenizer settings style	2020-03-24 16:38:32 +01:00
Adriane Boyd	2897a73559	Improve German tokenizer settings style	2020-03-23 19:23:47 +01:00
Baciccin	3b53617a69	Add Ligurian language	2020-03-19 21:37:01 -07:00
Ines Montani	558032017e	Merge pull request #5157 from svlandeg/bugfix/language remove unnecessary itertools call	2020-03-16 15:04:25 +01:00
Ines Montani	c68f20b398	Merge pull request #5146 from adrianeboyd/bugfix/assert-docs-equal-sents Fix sents comparison in test util	2020-03-16 14:59:32 +01:00
svlandeg	fba219f737	remove unnecessary itertools call	2020-03-16 08:31:36 +01:00
svlandeg	59000ee21d	fix serialization of empty doc + unit test	2020-03-13 16:07:56 +01:00
Adriane Boyd	423849f94a	Fix sents comparison in test util Due to changes to `Span` (#5005), spans from different documents are now never equal. Check `Token.is_sent_start` values instead.	2020-03-13 09:25:23 +01:00
Matthew Honnibal	26a90f011b	Set version to v2.2.4	2020-03-12 11:30:41 +01:00
svlandeg	c4d030dbf6	remove accidental commit	2020-03-09 18:10:54 +01:00
svlandeg	1724a4f75b	additional information if doc is empty	2020-03-09 18:08:18 +01:00
Adriane Boyd	1139247532	Revert changes to token_match priority from #4374 * Revert changes to priority of `token_match` so that it has priority over all other tokenizer patterns * Add lookahead and potentially slow lookbehind back to the default URL pattern * Expand character classes in URL pattern to improve matching around lookaheads and lookbehinds related to #4882 * Revert changes to Hungarian tokenizer * Revert (xfail) several URL tests to their status before #4374 * Update `tokenizer.explain()` and docs accordingly	2020-03-09 12:09:41 +01:00
Ines Montani	1d6aec805d	Fix formatting and update docs for v2.2.4	2020-03-09 11:17:20 +01:00
Mark Abraham	0345135167	Tokenizer to_disk and from_disk now ensure paths (#5116 ) * Tokenizer to_disk and from_disk now ensure strings are converted to paths Fixes #5115 * Sign contributor agreement	2020-03-08 13:25:56 +01:00
Sofie Van Landeghem	5847be6022	Tok2Vec: extract-embed-encode (#5102 ) * avoid changing original config * fix elif structure, batch with just int crashes otherwise * tok2vec example with doc2feats, encode and embed architectures * further clean up MultiHashEmbed * further generalize Tok2Vec to work with extract-embed-encode parts * avoid initializing the charembed layer with Docs (for now ?) * small fixes for bilstm config (still does not run) * rename to core layer * move new configs * walk model to set nI instead of using core ref * fix senter overfitting test to be more similar to the training data (avoid flakey behaviour)	2020-03-08 13:23:18 +01:00
adrianeboyd	993758c58f	Remove unnecessary iterator in Language.pipe (#5101 ) Remove iterator over `raw_texts` with `iterator.tee()` in `Language.pipe` that is never consumed and consumes memory unnecessarily.	2020-03-08 13:22:25 +01:00
Sofie Van Landeghem	1a2b8fc264	set vector of merged entity (#5085 ) * merge_entities sets the vector in the vocab for the merged token * add unit test * import unicode_literals * move code to _merge function * only set vector if vocab has non-zero vectors	2020-03-06 14:45:28 +01:00
adrianeboyd	c95ce96c44	Update sentence recognizer (#5109 ) * Update sentence recognizer * rename `sentrec` to `senter` * use `spacy.HashEmbedCNN.v1` by default * update to follow `Tagger` modifications * remove component methods that can be inherited from `Tagger` * add simple initialization and overfitting pipeline tests * Update serialization test for senter	2020-03-06 14:45:02 +01:00
Sofie Van Landeghem	6ac9fc0619	Unit test for NEL functionality (#5114 ) * empty begin_training for sentencizer * overfitting unit test for entity linker * fixed NEL IO by storing the entity_vector_length in the cfg	2020-03-06 14:42:23 +01:00
Ines Montani	b0cfab317f	Merge branch 'develop' into refactor/simplify-warnings	2020-03-04 16:38:55 +01:00
Muhammad Irfan	224a7f8e94	examples	2020-03-04 15:49:06 +05:00
Muhammad Irfan	03376c9d9b	Basque language added and tested.	2020-03-04 11:58:56 +05:00
adrianeboyd	9be90dbca3	Improve token head verification (#5079 ) * Improve token head verification Improve the verification for valid token heads when heads are set: * in `Token.head`: heads come from the same document * in `Doc.from_array()`: head indices are within the bounds of the document * Improve error message	2020-03-03 21:44:51 +01:00
adrianeboyd	8c20dae6f7	Fix model-final/model-best meta from train CLI (#5093 ) * Fix model-final/model-best meta * include speed and accuracy from final iteration * combine with speeds from base model if necessary * Include token_acc metric for all components	2020-03-03 21:43:25 +01:00
Sofie Van Landeghem	a0998868ff	prevent updating cfg if the Model was already defined (#5078 )	2020-03-03 13:58:56 +01:00
Sofie Van Landeghem	d307e9ca58	take care of global vectors in multiprocessing (#5081 ) * restore load_nlp.VECTORS in the child process * add unit test * fix test * remove unnecessary import * add utf8 encoding * import unicode_literals	2020-03-03 13:58:22 +01:00
adrianeboyd	d078b47c81	Break out of infinite loop as intended (#5077 )	2020-03-03 12:29:05 +01:00
adrianeboyd	697bec764d	Normalize IS_SENT_START to SENT_START for Matcher (#5080 )	2020-03-03 12:22:39 +01:00
adrianeboyd	2281c4708c	Restore empty tokenizer properties (#5026 ) * Restore empty tokenizer properties * Check for types in tokenizer.from_bytes() * Add test for setting empty tokenizer rules	2020-03-02 11:55:02 +01:00
Sofie Van Landeghem	c6b12ab02a	Bugfix/get doc (#5049 ) * new (broken) unit test * fixing get_doc method	2020-03-02 11:49:28 +01:00
Ines Montani	648f61d077	Tidy up compiler flags and imports (#5071 )	2020-03-02 11:48:10 +01:00
Ines Montani	7efaa76168	Update errors.py	2020-02-28 12:23:31 +01:00
Ines Montani	37691e6d5d	Simplify warnings	2020-02-28 12:20:23 +01:00
Ines Montani	5da3ad682a	Tidy up and auto-format	2020-02-28 11:57:41 +01:00
adrianeboyd	65d7bab10f	Initialize all values in a2b/b2a in new align (#5063 )	2020-02-27 18:43:00 +01:00
Sofie Van Landeghem	06f0a8daa0	Default settings to configurations (#4995 ) * fix grad_clip naming * cleaning up pretrained_vectors out of cfg * further refactoring Model init's * move Model building out of pipes * further refactor to require a model config when creating a pipe * small fixes * making cfg in nn_parser more consistent * fixing nr_class for parser * fixing nn_parser's nO * fix printing of loss * architectures in own file per type, consistent naming * convenience methods default_tagger_config and default_tok2vec_config * let create_pipe access default config if available for that component * default_parser_config * move defaults to separate folder * allow reading nlp from package or dir with argument 'name' * architecture spacy.VocabVectors.v1 to read static vectors from file * cleanup * default configs for nel, textcat, morphologizer, tensorizer * fix imports * fixing unit tests * fixes and clean up * fixing defaults, nO, fix unit tests * restore parser IO * fix IO * 'fix' serialization test * add .cfg to manifest fix example configs with additional arguments * replace Morpohologizer with Tagger * add IO bit when testing overfitting of tagger (currently failing) * fix IO - don't initialize when reading from disk * expand overfitting tests to also check IO goes OK * remove dropout from HashEmbed to fix Tagger performance * add defaults for sentrec * update thinc * always pass a Model instance to a Pipe * fix piped_added statement * remove obsolete W029 * remove obsolete errors * restore byte checking tests (work again) * clean up test * further test cleanup * convert from config to Model in create_pipe * bring back error when component is not initialized * cleanup * remove calls for nlp2.begin_training * use thinc.api in imports * allow setting charembed's nM and nC * fix for hardcoded nM/nC + unit test * formatting fixes * trigger build	2020-02-27 18:42:27 +01:00
Adriane Boyd	9f740a9891	Add a few more Danish tokenizer exceptions	2020-02-26 14:59:03 +01:00
Ines Montani	1c212215cd	Merge pull request #5064 from adrianeboyd/feature/german-tokenization Improve German tokenization	2020-02-26 13:41:44 +01:00
Adriane Boyd	d1f703d78d	Improve German tokenization Improve German tokenization with respect to Tiger.	2020-02-26 13:06:52 +01:00
Ines Montani	ed9358420e	Merge branch 'master' into pr/5060	2020-02-26 12:51:29 +01:00
adrianeboyd	ff184b7a9c	Add tag_map argument to CLI debug-data and train (#4750 ) (#5038 ) Add an argument for a path to a JSON-formatted tag map, which is used to update and extend the default language tag map.	2020-02-26 12:10:38 +01:00
svlandeg	18ff97589d	update spacy to 2.2.4.dev0	2020-02-26 10:50:05 +01:00
svlandeg	fc6e34c3a1	fix bugs from porting master to develop	2020-02-26 08:44:22 +01:00
Ines Montani	c1a5ece65f	Tidy up setup and update requirements tests	2020-02-25 15:46:39 +01:00
Ines Montani	5d21d3e8b9	Merge branch 'develop' into pr/5008	2020-02-25 15:24:47 +01:00
Ines Montani	d50152b917	Merge pull request #5019 from questoph/master Optimizing tokenization for Luxembourgish (dealing with apostrophe infixes)	2020-02-25 14:48:50 +01:00
Ines Montani	4440a072d2	Merge pull request #5006 from svlandeg/bugfix/multiproc-underscore load Underscore state when multiprocessing	2020-02-25 14:46:02 +01:00
svlandeg	d821c95eb0	debugging prints	2020-02-23 17:38:33 +01:00
svlandeg	58568bd0cd	fix	2020-02-23 16:45:37 +01:00
svlandeg	0f55e51704	assert we found the root_dir	2020-02-23 16:33:58 +01:00
svlandeg	783da088ea	avoid try except	2020-02-23 16:21:21 +01:00
svlandeg	b49a3afd0c	use clean_underscore fixture	2020-02-23 15:49:20 +01:00
Tom Keefe	ddf63b97a8	make idx available via to_array (#5030 )	2020-02-22 14:13:06 +01:00
Sofie Van Landeghem	44f4142ce4	add two abbreviations and some additional unit tests (#5040 )	2020-02-22 14:12:32 +01:00
Sofie Van Landeghem	479bd8d09f	add lemma option to displacy 'dep' visualiser (#5041 ) * add lemma option to displacy 'dep' visualiser * more compact list comprehension * add option to doc * fix test and add lemmas to util.get_doc * fix capital * remove lemma from get_doc * cleanup	2020-02-22 14:11:51 +01:00
adrianeboyd	2164e71ea8	Improved Romanian tokenization for UD RRT (#5036 ) Modifications to Romanian tokenization to improve tokenization for UD_Romanian-RRT.	2020-02-19 16:15:59 +01:00
svlandeg	9f1447bf71	where areth thou, file ?	2020-02-19 17:09:29 +02:00
svlandeg	9834527f2c	hack to switch between CLI folder setup and local setup	2020-02-19 16:22:48 +02:00
svlandeg	5c2f645470	root dir one level up	2020-02-19 16:15:56 +02:00
svlandeg	b20351792a	assert prints for more clarity	2020-02-19 15:51:53 +02:00
Ines Montani	a3335d36b8	Merge branch 'develop' into refactor/remove-symlinks	2020-02-18 17:22:20 +01:00
Ines Montani	09cbeaef27	Remove symlinks, data dir and related stuff	2020-02-18 17:20:17 +01:00
Ines Montani	e3f40a6a0f	Tidy up and auto-format	2020-02-18 15:38:18 +01:00
Ines Montani	1278161f47	Tidy up and fix issues	2020-02-18 15:17:03 +01:00
Ines Montani	de11ea753a	Merge branch 'master' into develop	2020-02-18 14:47:23 +01:00
Ines Montani	80e95d02b1	Allow spacy attr in token pattern	2020-02-18 14:32:53 +01:00
Jan Jessewitsch	c7e4fe9c5c	Fix/Improve german stop words (#5024 ) * Fix german stop words Two stop words ("einige" and "einigen") are sticking together. Remove three nouns that may serve as stop words in a specific context (e.g. religious or news) but are not applicable for general use. * Create Jan-711.md	2020-02-17 18:59:22 +01:00
Kabir Khan	f6ed07b85c	Use nlp.pipe in EntityRuler for phrase patterns in add_patterns (#4931 ) * Fix ent_ids and labels properties when id attribute used in patterns * use set for labels * sort end_ids for comparison in entity_ruler tests * fixing entity_ruler ent_ids test * add to set * Run make_doc optimistically if using phrase matcher patterns. * remove unused coveragerc I was testing with * format * Refactor EntityRuler.add_patterns to use nlp.pipe for phrase patterns. Improves speed substantially. * Removing old add_patterns function * Fixing spacing * Make sure token_patterns loaded as well, before generator was being emptied in from_disk	2020-02-16 18:17:47 +01:00
Sofie Van Landeghem	72c964bcf4	define pretrained_dims which is used by build_text_classifier (#5004 )	2020-02-16 17:21:17 +01:00
adrianeboyd	3b22eb651b	Sync Span __eq__ and __hash__ (#5005 ) * Sync Span __eq__ and __hash__ Use the same tuple for `__eq__` and `__hash__`, including all attributes except `vector` and `vector_norm`. * Update entity comparison in tests Update `assert_docs_equal()` test util to compare `Span` properties for ents rather than `Span` objects.	2020-02-16 17:20:36 +01:00
adrianeboyd	0c47a53b5e	Use int only in key2row for better performance (#4990 ) Cast all keys and rows to `int` in `vectors.key2row` for more efficient access and serialization.	2020-02-16 17:19:41 +01:00
adrianeboyd	5b102963bf	Require HEAD for is_parsed in Doc.from_array() (#5011 ) Modify flag settings so that `DEP` is not sufficient to set `is_parsed` and only run `set_children_from_heads()` if `HEAD` is provided. Then the combination `[SENT_START, DEP]` will set deps and not clobber sent starts with a lot of one-word sentences.	2020-02-16 17:17:09 +01:00
Sofie Van Landeghem	2572460175	add tok2vec parameters to train script to facilitate init_tok2vec (#5021 )	2020-02-16 17:16:41 +01:00
Sofie Van Landeghem	a27c77ce62	add message when cli train script throws exception (#5009 ) * add message when cli train script throws exception * fix formatting	2020-02-15 15:50:17 +01:00
questoph	5352fc8fc3	Update tokenizer_exceptions.py	2020-02-14 12:02:15 +01:00
questoph	d1f0b397b5	Update punctuation.py	2020-02-13 22:18:51 +01:00
svlandeg	2729d9164d	cleanup	2020-02-12 22:59:37 +01:00
svlandeg	6bbd816569	formatting	2020-02-12 22:50:27 +01:00
svlandeg	34986c7bfd	test versions of required libs across different places	2020-02-12 22:49:50 +01:00
svlandeg	6e717c62ed	avoid the tests interacting with eachother through the global Underscore variable	2020-02-12 13:21:31 +01:00
svlandeg	7939c63886	use English instead of model	2020-02-12 12:26:27 +01:00
svlandeg	46628d8890	add some asserts	2020-02-12 12:12:52 +01:00
svlandeg	51d37033c8	remove old comment	2020-02-12 12:10:05 +01:00
svlandeg	65f5b48b5d	add comment	2020-02-12 12:06:27 +01:00
svlandeg	05dedaa2cf	add unit test	2020-02-12 12:00:13 +01:00
svlandeg	ecbb9c4b9f	load Underscore state when multiprocessing	2020-02-12 11:50:42 +01:00
Ines Montani	2ed49404e3	Improve setup.py and call into Cython directly (#4952 ) * Improve setup.py and call into Cython directly * Add numpy to setup_requires * Improve clean helper * Update setup.cfg * Try if it builds without pyproject.toml * Update MANIFEST.in	2020-02-11 17:46:18 -05:00
adrianeboyd	99a543367d	Set GPU before loading any models in train CLI (#4989 ) Set the GPU before loading any existing models in the train CLI so that you can start with a base model and train on GPU.	2020-02-11 17:45:41 -05:00
adrianeboyd	842dfddbb9	Standardize Greek tag map setup (#4997 ) * Rename `tag_map.py` to `tag_map_fine.py` to indicate that it's not the default tag map * Remove duplicate generic UD tag map and load `../tag_map.py` instead	2020-02-11 17:44:56 -05:00
Sofie Van Landeghem	9b84f987bd	fix grad_clip naming (#4967 )	2020-02-10 20:33:16 -05:00
Antti Ajanki	e1f777b151	Improvements for Finnish tokenizer (#4985 ) * don't split on a colon. Colon is used to attach suffixes for abbreviations * tokenize on any of LIST_HYPHENS (except a single hyphen), not just on -- * simplify infix rules by merging similar rules	2020-02-10 20:32:43 -05:00
Sofie Van Landeghem	781e95cf53	Ensure doc.similarity returns a float (on develop) (#4969 )	2020-02-10 20:31:49 -05:00
Filip Bednárik	d4f4060bf3	Add Slovak language tools implementation (#4943 ) * Add correct stopwords for Slovak language * Add SNK Tags * Disable formatting lint for TAGS * Add example sentences for Slovak language * Add slovak numerals in base form * Add lex_attrs to sk init * Add contributor agreement	2020-02-03 13:03:59 +01:00
Sofie Van Landeghem	cabd60fa1e	Small fixes to as_example (#4957 ) * label in span not writable anymore * Revert "label in span not writable anymore" This reverts commit `ab442338c8`. * fixing yield - remove redundant list	2020-02-03 13:02:12 +01:00
Tyler Couto	9fa9d7f2cb	Fix for Issue 4665 - conllu2json (#4953 ) * Fix for Issue 4665 - conllu2json - Allowing HEAD to be an underscore * Added contributor agreement	2020-02-03 13:01:48 +01:00
Matthew Honnibal	71b93f33bb	Set dev version	2020-01-30 15:41:45 +01:00
Matthew Honnibal	ba6d78132d	Fix dev version	2020-01-30 10:35:09 +01:00
Ines Montani	ccef9f2f44	Update version	2020-01-29 17:52:22 +01:00
adrianeboyd	5ee9d8c9b8	Add MORPH attr, add support in retokenizer (#4947 ) * Add MORPH attr / symbol for token attrs * Update retokenizer for MORPH	2020-01-29 17:45:46 +01:00
adrianeboyd	a365359b36	Add convert CLI option to merge CoNLL-U subtokens (#4722 ) * Add convert CLI option to merge CoNLL-U subtokens Add `-T` option to convert CLI that merges CoNLL-U subtokens into one token in the converted data. Each CoNLL-U sentence is read into a `Doc` and the `Retokenizer` is used to merge subtokens with features as follows: * `orth` is the merged token orth (should correspond to raw text and `# text`) * `tag` is all subtoken tags concatenated with `_`, e.g. `ADP_DET` * `pos` is the POS of the syntactic root of the span (as determined by the Retokenizer) * `morph` is all morphological features merged * `lemma` is all subtoken lemmas concatenated with ` `, e.g. `de o` * with `-m` all morphological features are combined with the tag using the separator `__`, e.g. `ADP_DET__Definite=Def\|Gender=Masc\|Number=Sing\|PronType=Art` * `dep` is the dependency relation for the syntactic root of the span (as determined by the Retokenizer) Concatenated tags will be mapped to the UD POS of the syntactic root (e.g., `ADP`) and the morphological features will be the combined features. In many cases, the original UD subtokens can be reconstructed from the available features given a language-specific lookup table, e.g., Portuguese `do / ADP_DET / Definite=Def\|Gender=Masc\|Number=Sing\|PronType=Art` is `de / ADP`, `o / DET / Definite=Def\|Gender=Masc\|Number=Sing\|PronType=Art` or lookup rules for forms containing open class words like Spanish `hablarlo / VERB_PRON / Case=Acc\|Gender=Masc\|Number=Sing\|Person=3\|PrepCase=Npr\|PronType=Prs\|VerbForm=Inf`. * Clean up imports	2020-01-29 17:44:25 +01:00
Sofie Van Landeghem	569cc98982	Update spaCy for thinc 8.0.0 (#4920 ) * Add load_from_config function * Add train_from_config script * Merge configs and expose via spacy.config * Fix script * Suggest create_evaluation_callback * Hard-code for NER * Fix errors * Register command * Add TODO * Update train-from-config todos * Fix imports * Allow delayed setting of parser model nr_class * Get train-from-config working * Tidy up and fix scores and printing * Hide traceback if cancelled * Fix weighted score formatting * Fix score formatting * Make output_path optional * Add Tok2Vec component * Tidy up and add tok2vec_tensors * Add option to copy docs in nlp.update * Copy docs in nlp.update * Adjust nlp.update() for set_annotations * Don't shuffle pipes in nlp.update, decruft * Support set_annotations arg in component update * Support set_annotations in parser update * Add get_gradients method * Add get_gradients to parser * Update errors.py * Fix problems caused by merge * Add _link_components method in nlp * Add concept of 'listeners' and ControlledModel * Support optional attributes arg in ControlledModel * Try having tok2vec component in pipeline * Fix tok2vec component * Fix config * Fix tok2vec * Update for Example * Update for Example * Update config * Add eg2doc util * Update and add schemas/types * Update schemas * Fix nlp.update * Fix tagger * Remove hacks from train-from-config * Remove hard-coded config str * Calculate loss in tok2vec component * Tidy up and use function signatures instead of models * Support union types for registry models * Minor cleaning in Language.update * Make ControlledModel specifically Tok2VecListener * Fix train_from_config * Fix tok2vec * Tidy up * Add function for bilstm tok2vec * Fix type * Fix syntax * Fix pytorch optimizer * Add example configs * Update for thinc describe changes * Update for Thinc changes * Update for dropout/sgd changes * Update for dropout/sgd changes * Unhack gradient update * Work on refactoring _ml * Remove _ml.py module * WIP upgrade cli scripts for thinc * Move some _ml stuff to util * Import link_vectors from util * Update train_from_config * Import from util * Import from util * Temporarily add ml.component_models module * Move ml methods * Move typedefs * Update load vectors * Update gitignore * Move imports * Add PrecomputableAffine * Fix imports * Fix imports * Fix imports * Fix missing imports * Update CLI scripts * Update spacy.language * Add stubs for building the models * Update model definition * Update create_default_optimizer * Fix import * Fix comment * Update imports in tests * Update imports in spacy.cli * Fix import * fix obsolete thinc imports * update srsly pin * from thinc to ml_datasets for example data such as imdb * update ml_datasets pin * using STATE.vectors * small fix * fix Sentencizer.pipe * black formatting * rename Affine to Linear as in thinc * set validate explicitely to True * rename with_square_sequences to with_list2padded * rename with_flatten to with_list2array * chaining layernorm * small fixes * revert Optimizer import * build_nel_encoder with new thinc style * fixes using model's get and set methods * Tok2Vec in component models, various fixes * fix up legacy tok2vec code * add model initialize calls * add in build_tagger_model * small fixes * setting model dims * fixes for ParserModel * various small fixes * initialize thinc Models * fixes * consistent naming of window_size * fixes, removing set_dropout * work around Iterable issue * remove legacy tok2vec * util fix * fix forward function of tok2vec listener * more fixes * trying to fix PrecomputableAffine (not succesful yet) * alloc instead of allocate * add morphologizer * rename residual * rename fixes * Fix predict function * Update parser and parser model * fixing few more tests * Fix precomputable affine * Update component model * Update parser model * Move backprop padding to own function, for test * Update test * Fix p. affine * Update NEL * build_bow_text_classifier and extract_ngrams * Fix parser init * Fix test add label * add build_simple_cnn_text_classifier * Fix parser init * Set gpu off by default in example * Fix tok2vec listener * Fix parser model * Small fixes * small fix for PyTorchLSTM parameters * revert my_compounding hack (iterable fixed now) * fix biLSTM * Fix uniqued * PyTorchRNNWrapper fix * small fixes * use helper function to calculate cosine loss * small fixes for build_simple_cnn_text_classifier * putting dropout default at 0.0 to ensure the layer gets built * using thinc util's set_dropout_rate * moving layer normalization inside of maxout definition to optimize dropout * temp debugging in NEL * fixed NEL model by using init defaults ! * fixing after set_dropout_rate refactor * proper fix * fix test_update_doc after refactoring optimizers in thinc * Add CharacterEmbed layer * Construct tagger Model * Add missing import * Remove unused stuff * Work on textcat * fix test (again :)) after optimizer refactor * fixes to allow reading Tagger from_disk without overwriting dimensions * don't build the tok2vec prematuraly * fix CharachterEmbed init * CharacterEmbed fixes * Fix CharacterEmbed architecture * fix imports * renames from latest thinc update * one more rename * add initialize calls where appropriate * fix parser initialization * Update Thinc version * Fix errors, auto-format and tidy up imports * Fix validation * fix if bias is cupy array * revert for now * ensure it's a numpy array before running bp in ParserStepModel * no reason to call require_gpu twice * use CupyOps.to_numpy instead of cupy directly * fix initialize of ParserModel * remove unnecessary import * fixes for CosineDistance * fix device renaming * use refactored loss functions (Thinc PR 251) * overfitting test for tagger * experimental settings for the tagger: avoid zero-init and subword normalization * clean up tagger overfitting test * use previous default value for nP * remove toy config * bringing layernorm back (had a bug - fixed in thinc) * revert setting nP explicitly * remove setting default in constructor * restore values as they used to be * add overfitting test for NER * add overfitting test for dep parser * add overfitting test for textcat * fixing init for linear (previously affine) * larger eps window for textcat * ensure doc is not None * Require newer thinc * Make float check vaguer * Slop the textcat overfit test more * Fix textcat test * Fix exclusive classes for textcat * fix after renaming of alloc methods * fixing renames and mandatory arguments (staticvectors WIP) * upgrade to thinc==8.0.0.dev3 * refer to vocab.vectors directly instead of its name * rename alpha to learn_rate * adding hashembed and staticvectors dropout * upgrade to thinc 8.0.0.dev4 * add name back to avoid warning W020 * thinc dev4 * update srsly * using thinc 8.0.0a0 ! Co-authored-by: Matthew Honnibal <honnibal+gh@gmail.com> Co-authored-by: Ines Montani <ines@ines.io>	2020-01-29 17:06:46 +01:00
adrianeboyd	a938566b62	Fix Sentencizer.pipe() for empty doc (#4940 )	2020-01-28 11:36:49 +01:00
adrianeboyd	06b251dd1e	Add support for pos/morphs/lemmas in training data (#4941 ) Add support for pos/morphs/lemmas throughout `GoldParse`, `Example`, and `docs_to_json()`.	2020-01-28 11:36:29 +01:00
adrianeboyd	adc9745718	Modify morphology to support arbitrary features (#4932 ) * Restructure tag maps for MorphAnalysis changes Prepare tag maps for upcoming MorphAnalysis changes that allow arbritrary features. * Use default tag map rather than duplicating for ca / uk / vi * Import tag map into defaults for ga * Modify tag maps so all morphological fields and features are strings * Move features from `"Other"` to the top level * Rewrite tuples as strings separated by `","` * Rewrite morph symbols for fr lemmatizer as strings * Export MorphAnalysis under spacy.tokens * Modify morphology to support arbitrary features Modify `Morphology` and `MorphAnalysis` so that arbitrary features are supported. * Modify `MorphAnalysisC` so that it can support arbitrary features and multiple values per field. `MorphAnalysisC` is redesigned to contain: * key: hash of UD FEATS string of morphological features * array of `MorphFeatureC` structs that each contain a hash of `Field` and `Field=Value` for a given morphological feature, which makes it possible to: * find features by field * represent multiple values for a given field * `get_field()` is renamed to `get_by_field()` and is no longer `nogil`. Instead a new helper function `get_n_by_field()` is `nogil` and returns `n` features by field. * `MorphAnalysis.get()` returns all possible values for a field as a list of individual features such as `["Tense=Pres", "Tense=Past"]`. * `MorphAnalysis`'s `str()` and `repr()` are the UD FEATS string. * `Morphology.feats_to_dict()` converts a UD FEATS string to a dict where: * Each field has one entry in the dict * Multiple values remain separated by a separator in the value string * `Token.morph_` returns the UD FEATS string and you can set `Token.morph_` with a UD FEATS string or with a tag map dict. * Modify get_by_field to use np.ndarray Modify `get_by_field()` to use np.ndarray. Remove `max_results` from `get_n_by_field()` and always iterate over all the fields. * Rewrite without MorphFeatureC * Add shortcut for existing feats strings as keys Add shortcut for existing feats strings as keys in `Morphology.add()`. * Check for '_' as empty analysis when adding morphs * Extend helper converters in Morphology Add and extend helper converters that convert and normalize between: * UD FEATS strings (`"Case=dat,gen\|Number=sing"`) * per-field dict of feats (`{"Case": "dat,gen", "Number": "sing"}`) * list of individual features (`["Case=dat", "Case=gen", "Number=sing"]`) All converters sort fields and values where applicable.	2020-01-23 22:01:54 +01:00
Sofie Van Landeghem	0a0de85409	Fix gold training (#4938 ) * label in span not writable anymore * Revert "label in span not writable anymore" This reverts commit `ab442338c8`. * ensure doc is not None	2020-01-23 22:00:24 +01:00
adrianeboyd	199d89943e	Add as_example to Sentencizer pipe() (#4933 )	2020-01-22 15:40:31 +01:00
Yohei Tamura	708a4d27eb	fix nlp.evaluate (#4924 ) (#4925 ) * new file: test_issue4924.py * modified: spacy/gold.pyx * modified: test_issue4924.py for python2	2020-01-20 12:17:46 +01:00
Kabir Khan	b9afcd56e3	Fix ent_ids and labels properties when id attribute used in patterns (#4900 ) * Fix ent_ids and labels properties when id attribute used in patterns * use set for labels * sort end_ids for comparison in entity_ruler tests * fixing entity_ruler ent_ids test * add to set	2020-01-16 02:01:31 +01:00
adrianeboyd	90c52128dc	Improve train CLI with base model (#4911 ) Improve train CLI with a provided base model so that you can: * add a new component * extend an existing component * replace an existing component When the final model and best model are saved, reenable any disabled components and merge the meta information to include the full pipeline and accuracy information for all components in the base model plus the newly added components if needed.	2020-01-16 01:58:51 +01:00
svlandeg	ee828d5a9a	bugfix typo conv_window	2020-01-14 09:02:58 +01:00
adrianeboyd	d2f3a44b42	Improve train CLI sentrec scoring (#4892 ) * reorder to metrics to prioritize F over P/R * add sentrec to model metrics	2020-01-08 16:52:14 +01:00
adrianeboyd	e55fa1899a	Report length of dev dataset correctly (#4891 )	2020-01-08 16:51:51 +01:00
adrianeboyd	e1b493ae85	Add sentrec shortcut to Language (#4890 )	2020-01-08 16:51:24 +01:00
adrianeboyd	d24bca62f6	Add CJK to character classes (#4884 ) * Add CJK character class as uncased * Incorporate Chinese URL test case Un-xfail Chinese URL test instance	2020-01-08 16:50:19 +01:00
adrianeboyd	aef83e8070	Mark most Hungarian tokenizer test cases as slow (#4883 ) * Mark most Hungarian tokenizer test cases as slow Mark most Hungarian tokenizer test cases as slow to reduce the runtime of the test suite in ordinary usage: * for normal tests: run default tests plus 10% of the detailed tests * for slow tests: run all tests * Rework to mark individual tests as slow	2020-01-08 12:34:06 +01:00
Sofie Van Landeghem	7b96a5e10f	Reduce mem usage in training Entity Linker (#4811 ) * move nlp processing for el pipe to batch training instead of preprocessing * adding dev eval back in, and limit in articles instead of entities * use pipe whenever possible * few more small doc changes * access dev data through generator * tqdm description * small fixes * update documentation	2020-01-06 14:59:50 +01:00
Sofie Van Landeghem	6e9b61b49d	add warning in debug_data for punctuation in entities (#4853 )	2020-01-06 14:59:28 +01:00
adrianeboyd	d652ff215d	Add trailing whitespace to multiline test text (#4877 )	2020-01-06 14:58:59 +01:00
adrianeboyd	de69bc6509	Fix and improve URL pattern (#4882 ) * match domains longer than `hostname.domain.tld` like `www.foo.co.uk` * expand allowed characters in domain names while only matching lowercase TLDs so that "this.That" isn't matched as a URL and can be split on the period as an infix (relevant for at least English, German, and Tatar)	2020-01-06 14:58:30 +01:00
Sofie Van Landeghem	a1b22e90cd	serialize ENT_ID (#4852 ) * expand serialization test for custom token attribute * add failing test for issue 4849 * define ENT_ID as attr and use in doc serialization * fix few typos	2020-01-06 14:57:34 +01:00
Sofie Van Landeghem	581eeed98b	Warning goldparse (#4851 ) * label in span not writable anymore * Revert "label in span not writable anymore" This reverts commit `ab442338c8`. * provide more friendly error msg for parsing file	2020-01-01 13:16:48 +01:00
Ines Montani	83e0a6f3e3	Modernize plac commands for Python 3 (#4836 )	2020-01-01 13:15:46 +01:00
Al Johri	1aa2d4dac9	stop rendering mathjax by default in displacy (#4840 ) * stop rendering mathjax by default in displacy * Replace f-string and add comment Co-authored-by: Ines Montani <ines@ines.io>	2020-01-01 13:15:05 +01:00
Anastasiia Iurshina	1830a12578	Fixes typos (#4843 ) * Fixes typos * Fixes typo * Contributor agreement	2019-12-29 14:24:13 +01:00
Ivan Echevarria	ef13e0c038	Add n_process to Language.pipe documentation (#4842 ) [ci skip] * Add n_process to documentation * Auto-format and add default [ci skip] Co-authored-by: Ines Montani <ines@ines.io>	2019-12-29 14:23:33 +01:00
Ines Montani	401946d480	Un-xfail passing tests	2019-12-25 18:02:20 +01:00
Ines Montani	a892821c51	More formatting changes	2019-12-25 17:59:52 +01:00
Ines Montani	33a2682d60	Add better schemas and validation using Pydantic (#4831 ) * Remove unicode declarations * Remove Python 3.5 and 2.7 from CI * Don't require pathlib * Replace compat helpers * Remove OrderedDict * Use f-strings * Set Cython compiler language level * Fix typo * Re-add OrderedDict for Table * Update setup.cfg * Revert CONTRIBUTING.md * Add better schemas and validation using Pydantic * Revert lookups.md * Remove unused import * Update spacy/schemas.py Co-Authored-By: Sebastián Ramírez <tiangolo@gmail.com> * Various small fixes * Fix docstring Co-authored-by: Sebastián Ramírez <tiangolo@gmail.com>	2019-12-25 12:39:49 +01:00
Ines Montani	db55577c45	Drop Python 2.7 and 3.5 (#4828 ) * Remove unicode declarations * Remove Python 3.5 and 2.7 from CI * Don't require pathlib * Replace compat helpers * Remove OrderedDict * Use f-strings * Set Cython compiler language level * Fix typo * Re-add OrderedDict for Table * Update setup.cfg * Revert CONTRIBUTING.md * Revert lookups.md * Revert top-level.md * Small adjustments and docs [ci skip]	2019-12-22 01:53:56 +01:00
Ines Montani	3431ac42de	Fix typo	2019-12-21 21:17:45 +01:00
Ines Montani	21b6d6e0a8	Fix typo	2019-12-21 21:17:31 +01:00
Ines Montani	de33b6d566	Merge branch 'master' into develop	2019-12-21 21:15:46 +01:00
Ines Montani	7c69d30de5	Tidy up and expect warning	2019-12-21 21:14:52 +01:00
Sofie Van Landeghem	732142bf28	facilitate larger training files (#4827 ) * add warning for large file and change start var to long * type for file_length	2019-12-21 21:12:19 +01:00
Ines Montani	d17e7dca9e	Fix problems caused by merge conflict	2019-12-21 19:57:41 +01:00
Ines Montani	947dba7141	Merge branch 'master' into develop	2019-12-21 19:04:43 +01:00
Ines Montani	cb4145adc7	Tidy up and auto-format	2019-12-21 19:04:17 +01:00
Ines Montani	158b98a3ef	Merge branch 'master' into develop	2019-12-21 18:55:03 +01:00
Olamilekan Wahab	a741de7cf6	Adding support for Yoruba Language (#4614 ) * Adding Support for Yoruba * test text * Updated test string. * Fixing encoding declaration. * Adding encoding to stop_words.py * Added contributor agreement and removed iranlowo. * Added removed test files and removed iranlowo to keep project bare. * Returned CONTRIBUTING.md to default state. * Added delted conftest entries * Tidy up and auto-format * Revert CONTRIBUTING.md Co-authored-by: Ines Montani <ines@ines.io>	2019-12-21 14:11:50 +01:00
Ines Montani	0750d59e5a	Allow setting ner_missing_tag on docs_to_json	2019-12-21 13:47:21 +01:00
Sofie Van Landeghem	8ebbb85117	Documentation for PhraseMatcher constructor (#4826 ) * add max_length as argument for init PhraseMatcher * improve error message too	2019-12-20 23:00:04 +01:00
Sofie Van Landeghem	12158c1e3a	Restore tqdm imports (#4804 ) * set 4.38.0 to minimal version with color bug fix * set imports back to proper place * add upper range for tqdm	2019-12-16 13:12:19 +01:00
Sofie Van Landeghem	557dcf5659	NEL requires sentences to be set (#4801 )	2019-12-13 15:55:18 +01:00
tamuhey	1707e77c5e	add char_span to Span (#4793 )	2019-12-13 15:54:58 +01:00
adrianeboyd	a4cacd3402	Add tag_map argument to CLI debug-data and train (#4750 ) Add an argument for a path to a JSON-formatted tag map, which is used to update and extend the default language tag map.	2019-12-13 10:46:18 +01:00
Sofie Van Landeghem	f9b541f9ef	More robust set entities method in KB (#4794 ) * add unit test for setting entities with duplicate identifiers * count the number of actual unique identifiers and throw duplicate warning	2019-12-13 10:45:29 +01:00
adrianeboyd	eb9b1858c4	Add NER map option to convert CLI (#4763 ) Instead of a hard-coded NER tag simplification function that was only intended for NorNE, map NER tags in CoNLL-U converter using a dict provided as JSON as a command-line option. Map NER entity types or new tag or to "" for 'O', e.g.: ``` {"PER": "PERSON", "BAD": ""} => B-PER -> B-PERSON B-BAD -> O ```	2019-12-11 18:20:49 +01:00
Sofie Van Landeghem	5355b0038f	Update EL example (#4789 ) * update EL example script after sentence-central refactor * version bump * set incl_prior to False for quick demo purposes * clean up	2019-12-11 18:19:42 +01:00
adrianeboyd	38e1bc19f4	Add destructors for states in TransitionSystem (#4686 )	2019-12-10 13:23:27 +01:00
adrianeboyd	c208eb6e4d	Fix int value handling in Matcher (#4749 ) Add `int` values (for `LENGTH`) in _get_attr_values() instead of treating `int` like `dict`.	2019-12-06 19:22:57 +01:00
Sofie Van Landeghem	780d43aac7	fix bug in EL predict (#4779 )	2019-12-06 19:18:14 +01:00
adrianeboyd	676e75838f	Include Doc.cats in serialization of Doc and DocBin (#4774 ) * Include Doc.cats in to_bytes() * Include Doc.cats in DocBin serialization * Add tests for serialization of cats Test serialization of cats for Doc and DocBin.	2019-12-06 14:07:39 +01:00
Antti Ajanki	e626a011cc	Improvements to the Finnish language data (#4738 ) * Enable lex_attrs on Finnish * Copy the Danish tokenizer rules to Finnish Specifically, don't break hyphenated compound words * Contributor agreement * A new file for Finnish tokenizer rules instead of including the Danish ones	2019-12-03 12:55:28 +01:00
Christoph Purschke	a7ee4b6f17	new tests & tokenization fixes (#4734 ) - added some tests for tokenization issues - fixed some issues with tokenization of words with hyphen infix - rewrote the "tokenizer_exceptions.py" file (stemming from the German version)	2019-12-01 23:08:21 +01:00
adrianeboyd	68f711b409	Fix conllu2json n_sents and raw text (#4728 ) Update conllu2json converter to include raw text in final batch.	2019-11-29 10:22:03 +01:00
adrianeboyd	79ba1a3b92	Add lemmas to GoldParse / Example / docs_to_json (#4726 )	2019-11-28 14:53:44 +01:00
adrianeboyd	b841d3fe75	Add a tagger-based SentenceRecognizer (#4713 ) * Add sent_starts to GoldParse * Add SentTagger pipeline component Add `SentTagger` pipeline component as a subclass of `Tagger`. * Model reduces default parameters from `Tagger` to be small and fast * Hard-coded set of two labels: * S (1): token at beginning of sentence * I (0): all other sentence positions * Sets `token.sent_start` values * Add sentence segmentation to Scorer Report `sent_p/r/f` for sentence boundaries, which may be provided by various pipeline components. * Add sentence segmentation to CLI evaluate * Add senttagger metrics/scoring to train CLI * Rename SentTagger to SentenceRecognizer * Add SentenceRecognizer to spacy.pipes imports * Add SentenceRecognizer serialization test * Shorten component name to sentrec * Remove duplicates from train CLI output metrics	2019-11-28 11:10:07 +01:00
adrianeboyd	48ea2e8d0f	Restructure Sentencizer to follow Pipe API (#4721 ) * Restructure Sentencizer to follow Pipe API Restructure Sentencizer to follow Pipe API so that it can be scored with `nlp.evaluate()`. * Add Sentencizer pipe() test	2019-11-27 16:33:34 +01:00
adrianeboyd	9efd3ccbef	Update conllu2json MISC column handling (#4715 ) Update converter to handle various things in MISC column: * `SpaceAfter=No` and set raw text accordingly * plain NER tag * name=NER (for NorNE)	2019-11-26 16:10:08 +01:00
adrianeboyd	9aab0a55e1	Fix conllu2json converter to output all sentences (#4716 ) Make sure that the last batch of sentences is output if n_sents > 1.	2019-11-26 16:05:17 +01:00
adrianeboyd	0c9640ced3	Replace old gold alignment with new gold alignment (#4710 ) Replace old gold alignment that allowed for some noise in the alignment between raw and orth with the new simpler alignment that requires that the raw and orth strings are identical except for whitespace and capitalization. * Replace old alignment with new alignment, removing `_align.pyx` and its tests * Remove all quote normalizations * Enable test for new align * Modify test case for quote normalization	2019-11-25 23:13:26 +01:00
Jari Bakken	16cb19e960	update nb tag_map (#4711 )	2019-11-25 21:26:26 +01:00
adrianeboyd	392c4880d9	Restructure Example with merged sents as default (#4632 ) * Switch to train_dataset() function in train CLI * Fixes for pipe() methods in pipeline components * Don't clobber `examples` variable with `as_example` in pipe() methods * Remove unnecessary traversals of `examples` * Update Parser.pipe() for Examples * Add `as_examples` kwarg to `pipe()` with implementation to return `Example`s * Accept `Doc` or `Example` in `pipe()` with `_get_doc()` (copied from `Pipe`) * Fixes to Example implementation in spacy.gold * Move `make_projective` from an attribute of Example to an argument of `Example.get_gold_parses()` * Head of 0 are not treated as unset * Unset heads are set to self rather than `None` (which causes problems while projectivizing) * Check for `Doc` (not just not `None`) when creating GoldParses for pre-merged example * Don't clobber `examples` variable in `iter_gold_docs()` * Add/modify gold tests for handling projectivity * In JSON roundtrip compare results from `dev_dataset` rather than `train_dataset` to avoid projectivization (and other potential modifications) * Add test for projective train vs. nonprojective dev versions of the same `Doc` * Handle ignore_misaligned as arg rather than attr Move `ignore_misaligned` from an attribute of `Example` to an argument to `Example.get_gold_parses()`, which makes it parallel to `make_projective`. Add test with old and new align that checks whether `ignore_misaligned` errors are raised as expected (only for new align). * Remove unused attrs from gold.pxd Remove `ignore_misaligned` and `make_projective` from `gold.pxd` * Restructure Example with merged sents as default An `Example` now includes a single `TokenAnnotation` that includes all the information from one `Doc` (=JSON `paragraph`). If required, the individual sentences can be returned as a list of examples with `Example.split_sents()` with no raw text available. * Input/output a single `Example.token_annotation` * Add `sent_starts` to `TokenAnnotation` to handle sentence boundaries * Replace `Example.merge_sents()` with `Example.split_sents()` * Modify components to use a single `Example.token_annotation` * Pipeline components * conllu2json converter * Rework/rename `add_token_annotation()` and `add_doc_annotation()` to `set_token_annotation()` and `set_doc_annotation()`, functions that set rather then appending/extending. * Rename `morphology` to `morphs` in `TokenAnnotation` and `GoldParse` * Add getters to `TokenAnnotation` to supply default values when a given attribute is not available * `Example.get_gold_parses()` in `spacy.gold._make_golds()` is only applied on single examples, so the `GoldParse` is returned saved in the provided `Example` rather than creating a new `Example` with no other internal annotation * Update tests for API changes and `merge_sents()` vs. `split_sents()` * Refer to Example.goldparse in iter_gold_docs() Use `Example.goldparse` in `iter_gold_docs()` instead of `Example.gold` because a `None` `GoldParse` is generated with ignore_misaligned and generating it on-the-fly can raise an unwanted AlignmentError * Fix make_orth_variants() Fix bug in make_orth_variants() related to conversion from multiple to one TokenAnnotation per Example. * Add basic test for make_orth_variants() * Replace try/except with conditionals * Replace default morph value with set	2019-11-25 16:03:28 +01:00
Ines Montani	5b36dec7eb	Auto-exclude disabled when calling from_disk during load (#4708 )	2019-11-25 16:01:22 +01:00
Ines Montani	2160ecfc92	Fix typo [ci skip]	2019-11-25 13:08:19 +01:00
adrianeboyd	2d8c6e1124	Iterate over lr_edges until sents are correct (#4702 ) Iterate over lr_edges until all heads are within the current sentence. Instead of iterating over them for a fixed number of iterations, check whether the sentence boundaries are correct for the heads and stop when all are correct. Stop after a maximum of 10 iterations, providing a warning in this case since the sentence boundaries may not be correct.	2019-11-25 13:06:36 +01:00
Matt Maybeno	c9f1e99787	Agnostic vocab array fix (#4680 ) * Use get_array_module instead of numpy * add contributor agreement	2019-11-23 14:59:52 +01:00
adrianeboyd	46250f60ac	Add missing tags to el/es/pt tag maps (#4696 ) * Add missing tags to pt tag map * Add missing tags to es tag map * Add missing tags to el tag map * Add missing symbol in el tag map	2019-11-23 14:57:21 +01:00
adrianeboyd	44829950ba	Fix Example details for train CLI / pipeline components (#4624 ) * Switch to train_dataset() function in train CLI * Fixes for pipe() methods in pipeline components * Don't clobber `examples` variable with `as_example` in pipe() methods * Remove unnecessary traversals of `examples` * Update Parser.pipe() for Examples * Add `as_examples` kwarg to `pipe()` with implementation to return `Example`s * Accept `Doc` or `Example` in `pipe()` with `_get_doc()` (copied from `Pipe`) * Fixes to Example implementation in spacy.gold * Move `make_projective` from an attribute of Example to an argument of `Example.get_gold_parses()` * Head of 0 are not treated as unset * Unset heads are set to self rather than `None` (which causes problems while projectivizing) * Check for `Doc` (not just not `None`) when creating GoldParses for pre-merged example * Don't clobber `examples` variable in `iter_gold_docs()` * Add/modify gold tests for handling projectivity * In JSON roundtrip compare results from `dev_dataset` rather than `train_dataset` to avoid projectivization (and other potential modifications) * Add test for projective train vs. nonprojective dev versions of the same `Doc` * Handle ignore_misaligned as arg rather than attr Move `ignore_misaligned` from an attribute of `Example` to an argument to `Example.get_gold_parses()`, which makes it parallel to `make_projective`. Add test with old and new align that checks whether `ignore_misaligned` errors are raised as expected (only for new align). * Remove unused attrs from gold.pxd Remove `ignore_misaligned` and `make_projective` from `gold.pxd` * Refer to Example.goldparse in iter_gold_docs() Use `Example.goldparse` in `iter_gold_docs()` instead of `Example.gold` because a `None` `GoldParse` is generated with ignore_misaligned and generating it on-the-fly can raise an unwanted AlignmentError * Update test for ignore_misaligned	2019-11-23 14:32:15 +01:00
Paul O'Leary McCann	f0e3e606a6	Replace python-mecab3 with fugashi for Japanese (#4621 ) * Switch from mecab-python3 to fugashi mecab-python3 has been the best MeCab binding for a long time but it's not very actively maintained, and since it's based on old SWIG code distributed with MeCab there's a limit to how effectively it can be maintained. Fugashi is a new Cython-based MeCab wrapper I wrote. Since it's not based on the old SWIG code it's easier to keep it current and make small deviations from the MeCab C/C++ API where that makes sense. * Change mecab-python3 to fugashi in setup.cfg * Change "mecab tags" to "unidic tags" The tags come from MeCab, but the tag schema is specified by Unidic, so it's more proper to refer to it that way. * Update conftest * Add fugashi link to external deps list for Japanese	2019-11-23 14:31:04 +01:00
Ines Montani	a0fb1acb10	Update version [ci skip]	2019-11-21 18:19:37 +01:00
Ines Montani	b570d5d2ed	Increment version [ci skip]	2019-11-21 17:02:32 +01:00
Matthew Honnibal	50f89cb85d	Make vectors.find() return keys in correct order (#4691 ) * Make vectors.find() return keys in correct order * Update spacy/vectors.pyx	2019-11-21 16:58:32 +01:00
Ines Montani	5d4eede1e4	Fix test util imports	2019-11-21 16:28:29 +01:00
GuiGel	8f7ab70870	Bugfix/fix entity ruler from disk (#4670 ) * fix EntityRuler from_disk bug * add contributor file * Test EntityRuler PhraseMatcher deserialization (#4651) * newline at end of file * fix copy paste error * serializing the EntityRuler by itself * Add unicode declarations for Python 2 and auto-format	2019-11-21 16:26:37 +01:00
adrianeboyd	054df5d90a	Add error for non-string labels (#4690 ) Add error when attempting to add non-string labels to `Tagger` or `TextCategorizer`.	2019-11-21 16:24:10 +01:00
adrianeboyd	d7f32b285c	Detect more empty matches in tokenizer.explain() (#4675 ) * Detect more empty matches in tokenizer.explain() * Include a few languages in explain non-slow tests Mark a few languages in tokenizer.explain() tests as not slow so they're run by default.	2019-11-20 16:31:29 +01:00
Ines Montani	5bf9ab5b03	Tidy up and auto-format	2019-11-20 13:16:33 +01:00
Ines Montani	7f3b00164a	Re-add slow marker	2019-11-20 13:15:59 +01:00
Ines Montani	6e303de717	Auto-format	2019-11-20 13:15:24 +01:00
Ines Montani	2e7c896fe5	Update Tokenizer.explain tests	2019-11-20 13:14:11 +01:00
adrianeboyd	2c876eb672	Add tokenizer explain() debugging method (#4596 ) * Expose tokenizer rules as a property Expose the tokenizer rules property in the same way as the other core properties. (The cache resetting is overkill, but consistent with `from_bytes` for now.) Add tests and update Tokenizer API docs. * Update Hungarian punctuation to remove empty string Update Hungarian punctuation definitions so that `_units` does not match an empty string. * Use _load_special_tokenization consistently Use `_load_special_tokenization()` and have it to handle `None` checks. * Fix precedence of `token_match` vs. special cases Remove `token_match` check from `_split_affixes()` so that special cases have precedence over `token_match`. `token_match` is checked only before infixes are split. * Add `make_debug_doc()` to the Tokenizer Add `make_debug_doc()` to the Tokenizer as a working implementation of the pseudo-code in the docs. Add a test (marked as slow) that checks that `nlp.tokenizer()` and `nlp.tokenizer.make_debug_doc()` return the same non-whitespace tokens for all languages that have `examples.sentences` that can be imported. * Update tokenization usage docs Update pseudo-code and algorithm description to correspond to `nlp.tokenizer.make_debug_doc()` with example debugging usage. Add more examples for customizing tokenizers while preserving the existing defaults. Minor edits / clarifications. * Revert "Update Hungarian punctuation to remove empty string" This reverts commit `f0a577f7a5`. * Rework `make_debug_doc()` as `explain()` Rework `make_debug_doc()` as `explain()`, which returns a list of `(pattern_string, token_string)` tuples rather than a non-standard `Doc`. Update docs and tests accordingly, leaving the visualization for future work. * Handle cases with bad tokenizer patterns Detect when tokenizer patterns match empty prefixes and suffixes so that `explain()` does not hang on bad patterns. * Remove unused displacy image * Add tokenizer.explain() to usage docs	2019-11-20 13:07:25 +01:00
Matthew Honnibal	a3c43a1692	Support no hidden layer in parser and NER (#4672 ) * Support no hidden layers for parser * Fix parser model for depth 1 * Fix parser for hidden depth=0 * Add option of non-blocking to CUDA stream	2019-11-19 15:54:34 +01:00
Matthew Honnibal	4b123952aa	Add option for improved NER feature extraction (#4671 ) * Support option of three NER features * Expose nr_feature parser model setting * Give feature tokens better name * Test nr_feature=3 for NER * Format	2019-11-19 15:03:14 +01:00
Elijah Rippeth	5ad5c4b44a	Add initial Korean support (#4660 ) * add hangul and jamo char classes. * add initial Korean lexical attributes. * add contributor agreement	2019-11-18 12:56:07 +01:00
Ines Montani	74b951fe61	Fix xpassing tests (#4657 ) * Ignore internal warnings * Un-xfail passing tests * Skip instead of xfail	2019-11-16 20:20:53 +01:00
Ines Montani	3bd15055ce	Fix bug in Language.evaluate for components without .pipe (#4662 )	2019-11-16 20:20:37 +01:00
adrianeboyd	bdfb696677	Fix conllu2json converter to output all sentences (#4656 ) Make sure that the last batch of sentences is output if n_sents > 1.	2019-11-15 17:08:32 +01:00
Ines Montani	d64cfce546	Remove unnecessary newline replace	2019-11-15 16:19:01 +01:00
Christoph Purschke	433748e867	Fix basic language support for Luxembourgish (by adding punctuation.py) (#4648 ) * Update __init__.py * Create punctuation.py * Update tokenizer_exceptions.py * Create questoph.md * Update questoph.md * Update test_text.py * Update test_text.py * Update test_text.py * Update test_text.py	2019-11-15 16:16:47 +01:00
adrianeboyd	faaa832518	Generalize handling of tokenizer special cases (#4259 ) * Generalize handling of tokenizer special cases Handle tokenizer special cases more generally by using the Matcher internally to match special cases after the affix/token_match tokenization is complete. Instead of only matching special cases while processing balanced or nearly balanced prefixes and suffixes, this recognizes special cases in a wider range of contexts: * Allows arbitrary numbers of prefixes/affixes around special cases * Allows special cases separated by infixes Existing tests/settings that couldn't be preserved as before: * The emoticon '")' is no longer a supported special case * The emoticon ':)' in "example:)" is a false positive again When merged with #4258 (or the relevant cache bugfix), the affix and token_match properties should be modified to flush and reload all special cases to use the updated internal tokenization with the Matcher. * Remove accidentally added test case * Really remove accidentally added test * Reload special cases when necessary Reload special cases when affixes or token_match are modified. Skip reloading during initialization. * Update error code number * Fix offset and whitespace in Matcher special cases * Fix offset bugs when merging and splitting tokens * Set final whitespace on final token in inserted special case * Improve cache flushing in tokenizer * Separate cache and specials memory (temporarily) * Flush cache when adding special cases * Repeated `self._cache = PreshMap()` and `self._specials = PreshMap()` are necessary due to this bug: https://github.com/explosion/preshed/issues/21 * Remove reinitialized PreshMaps on cache flush * Update UD bin scripts * Update imports for `bin/` * Add all currently supported languages * Update subtok merger for new Matcher validation * Modify blinded check to look at tokens instead of lemmas (for corpora with tokens but not lemmas like Telugu) * Use special Matcher only for cases with affixes * Reinsert specials cache checks during normal tokenization for special cases as much as possible * Additionally include specials cache checks while splitting on infixes * Since the special Matcher needs consistent affix-only tokenization for the special cases themselves, introduce the argument `with_special_cases` in order to do tokenization with or without specials cache checks * After normal tokenization, postprocess with special cases Matcher for special cases containing affixes * Replace PhraseMatcher with Aho-Corasick Replace PhraseMatcher with the Aho-Corasick algorithm over numpy arrays of the hash values for the relevant attribute. The implementation is based on FlashText. The speed should be similar to the previous PhraseMatcher. It is now possible to easily remove match IDs and matches don't go missing with large keyword lists / vocabularies. Fixes #4308. * Restore support for pickling * Fix internal keyword add/remove for numpy arrays * Add test for #4248, clean up test * Improve efficiency of special cases handling * Use PhraseMatcher instead of Matcher * Improve efficiency of merging/splitting special cases in document * Process merge/splits in one pass without repeated token shifting * Merge in place if no splits * Update error message number * Remove UD script modifications Only used for timing/testing, should be a separate PR * Remove final traces of UD script modifications * Update UD bin scripts * Update imports for `bin/` * Add all currently supported languages * Update subtok merger for new Matcher validation * Modify blinded check to look at tokens instead of lemmas (for corpora with tokens but not lemmas like Telugu) * Add missing loop for match ID set in search loop * Remove cruft in matching loop for partial matches There was a bit of unnecessary code left over from FlashText in the matching loop to handle partial token matches, which we don't have with PhraseMatcher. * Replace dict trie with MapStruct trie * Fix how match ID hash is stored/added * Update fix for match ID vocab * Switch from map_get_unless_missing to map_get * Switch from numpy array to Token.get_struct_attr Access token attributes directly in Doc instead of making a copy of the relevant values in a numpy array. Add unsatisfactory warning for hash collision with reserved terminal hash key. (Ideally it would change the reserved terminal hash and redo the whole trie, but for now, I'm hoping there won't be collisions.) * Restructure imports to export find_matches * Implement full remove() Remove unnecessary trie paths and free unused maps. Parallel to Matcher, raise KeyError when attempting to remove a match ID that has not been added. * Switch to PhraseMatcher.find_matches * Switch to local cdef functions for span filtering * Switch special case reload threshold to variable Refer to variable instead of hard-coded threshold * Move more of special case retokenize to cdef nogil Move as much of the special case retokenization to nogil as possible. * Rewrap sort as stdsort for OS X * Rewrap stdsort with specific types * Switch to qsort * Fix merge * Improve cmp functions * Fix realloc * Fix realloc again * Initialize span struct while retokenizing * Temporarily skip retokenizing * Revert "Move more of special case retokenize to cdef nogil" This reverts commit `0b7e52c797`. * Revert "Switch to qsort" This reverts commit `a98d71a942`. * Fix specials check while caching * Modify URL test with emoticons The multiple suffix tests result in the emoticon `:>`, which is now retokenized into one token as a special case after the suffixes are split off. * Refactor _apply_special_cases() * Use cdef ints for span info used in multiple spots * Modify _filter_special_spans() to prefer earlier Parallel to #4414, modify _filter_special_spans() so that the earlier span is preferred for overlapping spans of the same length. * Replace MatchStruct with Entity Replace MatchStruct with Entity since the existing Entity struct is nearly identical. * Replace Entity with more general SpanC * Replace MatchStruct with SpanC * Add error in debug-data if no dev docs are available (see #4575) * Update azure-pipelines.yml * Revert "Update azure-pipelines.yml" This reverts commit `ed1060cf59`. * Use latest wasabi * Reorganise install_requires * add dframcy to universe.json (#4580) * Update universe.json [ci skip] * Fix multiprocessing for as_tuples=True (#4582) * Fix conllu script (#4579) * force extensions to avoid clash between example scripts * fix arg order and default file encoding * add example config for conllu script * newline * move extension definitions to main function * few more encodings fixes * Add load_from_docbin example [ci skip] TODO: upload the file somewhere * Update README.md * Add warnings about 3.8 (resolves #4593) [ci skip] * Fixed typo: Added space between "recognize" and "various" (#4600) * Fix DocBin.merge() example (#4599) * Replace function registries with catalogue (#4584) * Replace functions registries with catalogue * Update __init__.py * Fix test * Revert unrelated flag [ci skip] * Bugfix/dep matcher issue 4590 (#4601) * add contributor agreement for prilopes * add test for issue #4590 * fix on_match params for DependencyMacther (#4590) * Minor updates to language example sentences (#4608) * Add punctuation to Spanish example sentences * Combine multilanguage examples for lang xx * Add punctuation to nb examples * Always realloc to a larger size Avoid potential (unlikely) edge case and cymem error seen in #4604. * Add error in debug-data if no dev docs are available (see #4575) * Update debug-data for GoldCorpus / Example * Ignore None label in misaligned NER data	2019-11-13 21:24:35 +01:00
adrianeboyd	d67b0f196a	Fix initialization of token mappings in new align (#4640 ) Initialize all values in `a2b` and `b2a` since `numpy.empty()` otherwise result unspecified integers.	2019-11-13 21:22:18 +01:00
adrianeboyd	3ac4e8eb7a	Fix minor issues in debug-data (#4636 ) * Add error in debug-data if no dev docs are available (see #4575) * Update debug-data for GoldCorpus / Example * Ignore None label in misaligned NER data	2019-11-13 15:25:03 +01:00
Sofie Van Landeghem	e48a09df4e	Example class for training data (#4543 ) * OrigAnnot class instead of gold.orig_annot list of zipped tuples * from_orig to replace from_annot_tuples * rename to RawAnnot * some unit tests for GoldParse creation and internal format * removing orig_annot and switching to lists instead of tuple * rewriting tuples to use RawAnnot (+ debug statements, WIP) * fix pop() changing the data * small fixes * pop-append fixes * return RawAnnot for existing GoldParse to have uniform interface * clean up imports * fix merge_sents * add unit test for 4402 with new structure (not working yet) * introduce DocAnnot * typo fixes * add unit test for merge_sents * rename from_orig to from_raw * fixing unit tests * fix nn parser * read_annots to produce text, doc_annot pairs * _make_golds fix * rename golds_to_gold_annots * small fixes * fix encoding * have golds_to_gold_annots use DocAnnot * missed a spot * merge_sents as function in DocAnnot * allow specifying only part of the token-level annotations * refactor with Example class + underlying dicts * pipeline components to work with Example objects (wip) * input checking * fix yielding * fix calls to update * small fixes * fix scorer unit test with new format * fix kwargs order * fixes for ud and conllu scripts * fix reading data for conllu script * add in proper errors (not fixed numbering yet to avoid merge conflicts) * fixing few more small bugs * fix EL script	2019-11-11 17:35:27 +01:00
adrianeboyd	91f89f9693	Fix realloc in retokenizer.split() (#4606 ) Always realloc to a size larger than `doc.max_length` in `retokenizer.split()` (or cymem will throw errors).	2019-11-11 16:26:46 +01:00
adrianeboyd	0b9a5f4074	Rework Chinese language initialization and tokenization (#4619 ) * Rework Chinese language initialization * Create a `ChineseTokenizer` class * Modify jieba post-processing to handle whitespace correctly * Modify non-jieba character tokenization to handle whitespace correctly * Add a `create_tokenizer()` method to `ChineseDefaults` * Load lexical attributes * Update Chinese tag_map for UD v2 * Add very basic Chinese tests * Test tokenization with and without jieba * Test `like_num` attribute * Fix try_jieba_import() * Fix zh code formatting	2019-11-11 14:23:21 +01:00
adrianeboyd	4d85f67eee	Minor updates to language example sentences (#4608 ) * Add punctuation to Spanish example sentences * Combine multilanguage examples for lang xx * Add punctuation to nb examples	2019-11-07 22:34:58 +01:00
Priscilla de Abreu Lopes	39e79fcc86	Bugfix/dep matcher issue 4590 (#4601 ) * add contributor agreement for prilopes * add test for issue #4590 * fix on_match params for DependencyMacther (#4590)	2019-11-07 12:01:06 +01:00
Ines Montani	09cec3e41b	Replace function registries with catalogue (#4584 ) * Replace functions registries with catalogue * Update __init__.py * Fix test * Revert unrelated flag [ci skip]	2019-11-07 11:45:22 +01:00
Matthew Honnibal	4e43c0ba93	Fix multiprocessing for as_tuples=True (#4582 )	2019-11-04 20:29:03 +01:00
Ines Montani	cf4ec88b38	Use latest wasabi	2019-11-04 02:38:45 +01:00
Ines Montani	6ec119d976	Add error in debug-data if no dev docs are available (see #4575 )	2019-11-02 16:08:11 +01:00
adrianeboyd	56ad3a3988	Add LAS per dependency to Scorer (#4560 )	2019-10-31 21:18:16 +01:00
Matthew Honnibal	de98d66f87	Set version to v2.2.2	2019-10-31 15:53:31 +01:00
Matthw Honnibal	55f2241d72	Merge branch 'master' of https://github.com/explosion/spaCy	2019-10-31 15:37:52 +01:00
Ines Montani	df4c9ae3dc	Fix formatting [ci skip]	2019-10-31 15:10:25 +01:00
Ines Montani	59358d9b71	Remove box-decoration-break from entities in displacy (#4564 )	2019-10-31 15:09:43 +01:00
Matthw Honnibal	8b9954d1b7	Set version to v2.2.2.dev5	2019-10-31 15:06:19 +01:00
Ines Montani	2c107f02a4	Auto-format [ci skip]	2019-10-31 15:01:56 +01:00
Matthew Honnibal	e82306937e	Put Tok2Vec refactor behind feature flag (#4563 ) * Add back pre-2.2.2 tok2vec * Add simple tok2vec tests * Add simple tok2vec tests * Reformat * Fix CharacterEmbed in new tok2vec * Fix legacy tok2vec * Resolve circular imports * Fix test for Python 2	2019-10-31 15:01:15 +01:00
Ines Montani	5e9849b60f	Auto-format [ci skip]	2019-10-30 19:27:18 +01:00
Ines Montani	afe4a428f7	Fix pipeline analysis on remove pipe (#4557 ) Validate after component is removed, not before	2019-10-30 19:04:17 +01:00
Matthew Honnibal	6b874ef096	Set version to v2.2.2.dev4	2019-10-30 17:36:20 +01:00
Ines Montani	85f2b04c45	Support span._. in component decorator attrs (#4555 ) * Support span._. in component decorator attrs * Adjust error [ci skip]	2019-10-30 17:19:36 +01:00
Matthew Honnibal	c2f5f9f572	Set version to v2.2.2.dev3	2019-10-29 16:37:58 +01:00
Sofie Van Landeghem	33ba9ff464	set encodings explicitly to utf8 (#4551 )	2019-10-29 13:16:55 +01:00
Matthew Honnibal	9e210fa7fd	Fix tok2vec structure after model registry refactor (#4549 ) The model registry refactor of the Tok2Vec function broke loading models trained with the previous function, because the model tree was slightly different. Specifically, the new function wrote: concatenate(norm, prefix, suffix, shape) To build the embedding layer. In the previous implementation, I had used the operator overloading shortcut: ( norm \| prefix \| suffix \| shape ) This actually gets mapped to a binary association, giving something like: concatenate(norm, concatenate(prefix, concatenate(suffix, shape))) This is a different tree, so the layers iterate differently and we loaded the weights wrongly.	2019-10-28 23:59:03 +01:00
Matthew Honnibal	bade60fe64	Set version to v2.2.2.dev1	2019-10-28 19:09:34 +01:00
Matthew Honnibal	b1505380ff	Fix training with vectors	2019-10-28 18:06:38 +01:00
Matthew Honnibal	a927b3a21e	Put new alignment behind flag for v2.2.2 release (#4541 ) * Xfail new tokenization test * Put new alignment behind feature flag * Move USE_ALIGN to top of the file [ci skip] Co-authored-by: Ines Montani <ines@ines.io>	2019-10-28 16:12:32 +01:00
Ines Montani	a90025b277	Fix serialization of extension attr values in DocBin (#4540 )	2019-10-28 16:02:13 +01:00
tamuhey	df293f3894	modified gold.align to handle space tokens (#4537 ) Co-authored-by: Matthew Honnibal <honnibal+gh@gmail.com>	2019-10-28 15:44:28 +01:00
adrianeboyd	f2bfaa1b38	Filter subtoken matches in merge_subtokens() (#4539 ) The `Matcher` in `merge_subtokens()` returns all possible subsequences of `subtok`, so for sequences of two or more subtoks it's necessary to filter the matches so that the retokenizer is only merging the longest matches with no overlapping spans.	2019-10-28 15:40:28 +01:00
Matthew Honnibal	d5509e0989	Support Mish activation (requires Thinc 7.3) (#4536 ) * Add arch for MishWindowEncoder * Support mish in tok2vec and conv window >=2 * Pass new tok2vec settings from parser * Syntax error * Fix tok2vec setting * Fix registration of MishWindowEncoder * Fix receptive field setting * Fix mish arch * Pass more options from parser * Support more tok2vec options in pretrain * Require thinc 7.3 * Add docs [ci skip] * Require thinc 7.3.0.dev0 to run CI * Run black * Fix typo * Update Thinc version Co-authored-by: Ines Montani <ines@ines.io>	2019-10-28 15:16:33 +01:00
Ines Montani	96bb8f2187	Add regression test for #4528 [ci skip]	2019-10-28 14:36:03 +01:00
Ines Montani	c5e41247e8	Tidy up and auto-format	2019-10-28 12:43:55 +01:00
Ines Montani	92018b9cd4	Tidy up and auto-format	2019-10-28 12:36:23 +01:00
Matthew Honnibal	f0ec7bcb79	Flag to ignore examples with mismatched raw/gold text (#4534 ) * Flag to ignore examples with mismatched raw/gold text After #4525, we're seeing some alignment failures on our OntoNotes data. I think we actually have fixes for most of these cases. In general it's better to fix the data, but it seems good to allow the GoldCorpus class to just skip cases where the raw text doesn't match up to the gold words. I think previously we were silently ignoring these cases. * Try to fix test on Python 2.7	2019-10-28 11:40:12 +01:00
Matthew Honnibal	795699015c	Clarify parser model CPU/GPU code (#4535 ) The previous version worked with previous thinc, but only because some thinc ops happened to have gpu/cpu compatible implementations. It's better to call the right Ops instance.	2019-10-27 23:43:09 +01:00
Matthw Honnibal	46eecdcb70	Remove print	2019-10-27 22:24:19 +01:00
Matthw Honnibal	426b745640	Fix tests for gpu	2019-10-27 22:19:18 +01:00
Matthw Honnibal	165e378082	Fix tok2vec arch after refactor	2019-10-27 22:19:10 +01:00
Matthew Honnibal	f8d740bfb1	Fix --gold-preproc train cli command (#4392 ) * Fix get labels for textcat * Fix char_embed for gpu * Revert "Fix char_embed for gpu" This reverts commit `055b9a9e85`. * Fix passing of cats in gold.pyx * Revert "Match pop with append for training format (#4516)" This reverts commit `8e7414dace`. * Fix popping gold parses * Fix handling of cats in gold tuples * Fix name * Fix ner_multitask_objective script * Add test for 4402	2019-10-27 21:58:50 +01:00
Sofie Van Landeghem	8e7414dace	Match pop with append for training format (#4516 ) * trying to fix script - not succesful yet * match pop() with extend() to avoid changing the data * few more pop-extend fixes * reinsert deleted print statement * fix print statement * add last tested version * append instead of extend * add in few comments * quick fix for 4402 + unit test * fixing number of docs (not counting cats) * more fixes * fix len * print tmp file instead of using data from examples dir * print tmp file instead of using data from examples dir (2)	2019-10-27 16:01:32 +01:00
tamuhey	fcd25db033	[#4529 ] fix: gold pyx (#4530 ) * fix: gold pyx * remove print * skip test in python2 * Add unicode declarations and don't skip test on Python 2	2019-10-27 13:50:07 +01:00
Matthew Honnibal	bddfbc7e1b	Restore missing normalization from gold align PR #4526 missed extra lower-casing and spacing normalization.	2019-10-27 13:47:08 +01:00
tamuhey	554850206c	[#4525 ] fix gold.align (#4526 ) * fix: gold.align * fix align * remove old align	2019-10-27 13:38:04 +01:00
Ines Montani	a9c6104047	Component decorator and component analysis (#4517 ) * Add work in progress * Update analysis helpers and component decorator * Fix porting of docstrings for Python 2 * Fix docstring stuff on Python 2 * Support meta factories when loading model * Put auto pipeline analysis behind flag for now * Analyse pipes on remove_pipe and replace_pipe * Move analysis to root for now Try to find a better place for it, but it needs to go for now to avoid circular imports * Simplify decorator Don't return a wrapped class and instead just write to the object * Update existing components and factories * Add condition in factory for classes vs. functions * Add missing from_nlp classmethods * Add "retokenizes" to printed overview * Update assigns/requires declarations of builtins * Only return data if no_print is enabled * Use multiline table for overview * Don't support Span * Rewrite errors/warnings and move them to spacy.errors	2019-10-27 13:35:49 +01:00
Matthew Honnibal	406eb95a47	Refactor Tok2Vec to use architecture registry (#4518 ) * Add refactored tok2vec, using register_architecture * Refactor Tok2Vec * Fix ml * Fix new tok2vec * Move make_layer to util * Add wire * Fix missing import	2019-10-25 22:28:20 +02:00
Sofie Van Landeghem	99e309bb19	fix nn parser sample construction (#4524 )	2019-10-25 22:26:42 +02:00
Ines Montani	cfffdba7b1	Implement new API for {Phrase}Matcher.add (backwards-compatible) (#4522 ) * Implement new API for {Phrase}Matcher.add (backwards-compatible) * Update docs * Also update DependencyMatcher.add * Update internals * Rewrite tests to use new API * Add basic check for common mistake Raise error with suggestion if user likely passed in a pattern instead of a list of patterns * Fix typo [ci skip]	2019-10-25 22:21:08 +02:00
Ines Montani	d2da117114	Also support passing list to Language.disable_pipes (#4521 ) * Also support passing list to Language.disable_pipes * Adjust internals	2019-10-25 16:19:08 +02:00
Ines Montani	e91366a216	Adjust formatting [ci skip]	2019-10-25 11:25:44 +02:00
Ines Montani	f31876154d	Adjust formatting [ci skip]	2019-10-25 11:19:46 +02:00
Kabir Khan	93640373c7	Make entity_ruler ent_id resolution 2x faster and add docs for… (#4513 ) * Update entityruler.py * Making ent_id resolution 2x faster and adding docs * Fixing newlines in docstrings * Fixing newlines in docstrings	2019-10-25 11:16:42 +02:00
Ines Montani	cc05d9dad6	Auto-format [ci skip]	2019-10-24 16:21:08 +02:00
Ines Montani	73dc63d3bf	Tidy up and auto-format [ci skip]	2019-10-24 16:20:48 +02:00
Ines Montani	6dd2832438	Use numpy.frombuffer instead of fromstring Deprecation warning says we should do this	2019-10-24 16:18:41 +02:00
Ines Montani	9a849fe54e	Explicitly catch warning in test	2019-10-24 16:16:27 +02:00
adrianeboyd	1b0bbe4b76	Update tag maps and docs for English and German (#4501 ) * Update English tag_map Update English tag_map based on this conversion table: https://universaldependencies.org/tagset-conversion/en-penn-uposf.html * Update German tag_map Update German tag_map based on this conversion table: https://universaldependencies.org/tagset-conversion/de-stts-uposf.html * Add missing Tiger dependencies to glossary * Add quotes to definition of TO * Update POS/TAG tables in docs Update POS/TAG tables for English and German docs using current information generated from the tag_maps and GLOSSARY. * Update warning that -PRON- is specific to English * Revert docs to default JSON output with convert * Revert "Revert docs to default JSON output with convert" This reverts commit `6b78c048f1`.	2019-10-24 12:56:05 +02:00
Zhuoru Lin	10d88b09bb	Bugfix/fix wikidata train entity linker (#4509 ) * Fix labels_discard Nonetype iteration error * Contributor agreement for Zhuoru Lin * Enhance EntityLinker.predict() to handle labels_discard is None case.	2019-10-24 12:52:59 +02:00
adrianeboyd	8516e9d53b	Support train dict format as JSONL (#4471 ) * Support train dict format as JSONL * Add (overly simple) check for dict vs. tuple to read JSONL lines as either train dicts or train tuples * Extend JSON/JSONL roundtrip conversion tests using `docs_to_json()` and `GoldCorpus.train_tuples` * Revert docs to default JSON output with convert	2019-10-23 16:01:44 +02:00
Matthew Honnibal	ca7f0e669e	Set version to v2.2.2.dev1	2019-10-22 20:11:25 +02:00
Matthew Honnibal	9489c5f6b2	Clip most_similar to range [-1, 1] (fixes #4506 ) (#4507 ) * Clip most_similar to range [-1, 1] * Add/fix vectors tests * Fix test	2019-10-22 20:10:42 +02:00
Ines Montani	74a19aeb1c	Add xfailing test [ci skip]	2019-10-22 18:18:43 +02:00
Matthew Honnibal	3f6cb618a9	Set version to v2.2.2.dev0	2019-10-22 17:47:36 +02:00
Sofie Van Landeghem	48886afc78	prevent zero-length mem alloc (#4429 ) * raise specific error when removing a matcher rule that doesn't exist * rephrasing * goldparse init: allocate fields only if doc is not empty * avoid zero length alloc in saving tokenizer cache * avoid allocating zero length mem in matcher * asserts to avoid allocating zero length mem * fix zero-length allocation in matcher * bump cymem version * revert cymem version bump	2019-10-22 16:54:33 +02:00
adrianeboyd	3dfc764577	Free pointers in parser activations (#4486 ) * Free pointers in ActivationsC * Restructure alloc/free for parser activations * Rewrite/restructure to have allocation and free in parallel functions in `_parser_model` rather than partially in `_parseC()` in `Parser`. * Remove `resize_activations` from `_parser_model.pxd`.	2019-10-22 15:06:44 +02:00
tamuhey	fb89f6792b	refactor: remove unused variable (#4499 )	2019-10-22 14:38:17 +02:00
gustavengstrom	050e2445a8	Adding noun_chunks to the Swedish language model (sv) (#4422 ) * Create syntax_iterators.py Replica of spacy/lang/fr/syntax_iterators.py * Added import statements for SYNTAX_ITERATORS * Create gustavengstrom.md * Added "dobj" to list of labels in noun_chunks method and a test_noun_chunks method to the Swedish language model. * Delete README-checkpoint.md Co-authored-by: Gustav <gustav@davcon.se> Co-authored-by: Ines Montani <ines@ines.io>	2019-10-21 12:57:06 +02:00
adrianeboyd	f5c551a43a	Checks/errors related to ill-formed IOB input in CLI convert and debug-data (#4487 ) * Error for ill-formed input to iob_to_biluo() Check for empty label in iob_to_biluo(), which can result from ill-formed input. * Check for empty NER label in debug-data	2019-10-21 12:20:28 +02:00
Sofie Van Landeghem	d5d55312b2	prevent division by zero in most_similar method (#4488 )	2019-10-21 12:04:46 +02:00
Pepe Berba	7772d5d3c5	Update `vocab.get_vector` docs to include features on Fasttext ngram (#4464 ) * Update `vocab.get_vector` * Added contrib agreement	2019-10-20 01:28:18 +02:00
Ines Montani	2c96a5e5b0	Remove lemma attrs on BaseDefaults (#4468 )	2019-10-19 23:18:09 +02:00
adrianeboyd	8d3de90bc4	Suppress convert output if writing to stdout (#4472 )	2019-10-18 18:12:59 +02:00
Ines Montani	692d7f4291	Fix formatting [ci skip]	2019-10-18 11:33:38 +02:00
Ines Montani	181c01f629	Tidy up and auto-format	2019-10-18 11:27:38 +02:00
Ines Montani	fb11852750	Remove unused imports	2019-10-18 11:06:41 +02:00
adrianeboyd	d359da9687	Replace Entity/MatchStruct with SpanC (#4459 ) * Replace MatchStruct with Entity Replace MatchStruct with Entity since the existing Entity struct is nearly identical. * Replace Entity with more general SpanC	2019-10-18 11:01:47 +02:00
adrianeboyd	29e3da6493	Add missing cats to gold annot_tuples in Scorer (#4466 ) Add missing `cats` in `Scorer` call to `GoldParse.from_annot_tuples()` when the `doc` and `gold` have differing lengths.	2019-10-18 11:00:02 +02:00
adrianeboyd	135e3de531	Check for docs with 2+ sentences in debug-data (#4467 )	2019-10-18 10:59:16 +02:00
Daniel King	e646956176	Most similar bug (#4446 ) * Add batch size indexing * Don't sort if n == 1 * Add test for most similar vectors issue * Change > to >=	2019-10-16 23:18:55 +02:00
Anastassia	4a77d03ff7	Fix documentation for the docs_to_json function (#4456 )	2019-10-16 23:17:58 +02:00
adrianeboyd	275c9ad872	Allow int values in token patterns (#4444 ) * Add missing int value option to top-level pattern validation in Matcher * Adjust existing tests accordingly * Add new test for valid pattern `{"LENGTH": int}`	2019-10-16 13:40:18 +02:00
Sofie Van Landeghem	7d1efac4eb	Fix remove pattern from matcher (#4454 ) * raise specific error when removing a matcher rule that doesn't exist * rephrasing * bugfix in remove matcher + extended unit test	2019-10-16 13:34:58 +02:00
Sofie Van Landeghem	2d249a9502	KB extensions and better parsing of WikiData (#4375 ) * fix overflow error on windows * more documentation & logging fixes * md fix * 3 different limit parameters to play with execution time * bug fixes directory locations * small fixes * exclude dev test articles from prior probabilities stats * small fixes * filtering wikidata entities, removing numeric and meta items * adding aliases from wikidata also to the KB * fix adding WD aliases * adding also new aliases to previously added entities * fixing comma's * small doc fixes * adding subclassof filtering * append alias functionality in KB * prevent appending the same entity-alias pair * fix for appending WD aliases * remove date filter * remove unnecessary import * small corrections and reformatting * remove WD aliases for now (too slow) * removing numeric entities from training and evaluation * small fixes * shortcut during prediction if there is only one candidate * add counts and fscore logging, remove FP NER from evaluation * fix entity_linker.predict to take docs instead of single sentences * remove enumeration sentences from the WP dataset * entity_linker.update to process full doc instead of single sentence * spelling corrections and dump locations in readme * NLP IO fix * reading KB is unnecessary at the end of the pipeline * small logging fix * remove empty files	2019-10-14 12:28:53 +02:00
Peter Gilles	428887b8f2	Initial commit: New language Luxembourgish (lb) (#4424 ) * new language: Luxembourgish (lb) * update * update * Update and rename .github/CONTRIBUTOR_AGREEMENT.md to .github/contributors/PeterGilles.md * Update and rename .github/contributors/PeterGilles.md to .github/CONTRIBUTOR_AGREEMENT.md * Update norm_exceptions.py * Delete README.md * moved test_lemma.py * deactivated 'lemma_lookup = LOOKUP' * update * Update conftest.py * update * tests updated * import unicode_literals * Update spacy/tests/lang/lb/test_text.py Co-Authored-By: Ines Montani <ines@ines.io> * Create PeterGilles.md	2019-10-14 12:27:50 +02:00
adrianeboyd	98a961a60e	Fix PhraseMatcher.remove for overlapping patterns (#4437 )	2019-10-14 12:19:51 +02:00
Ines Montani	f8f68bb062	Auto-format [ci skip]	2019-10-10 17:08:39 +02:00
adrianeboyd	6f54e59fe7	Fix util.filter_spans() to prefer first span in overlapping sam… (#4414 ) * Update util.filter_spans() to prefer earlier spans * Add filter_spans test for first same-length span * Update entity relation example to refer to util.filter_spans()	2019-10-10 17:00:03 +02:00
Sofie Van Landeghem	da6e0de34f	fix attrs field in the matcher (#4423 ) * raise specific error when removing a matcher rule that doesn't exist * rephrasing * ensure attrs is NULL when nr_attr == 0 + several fixes to prevent OOB	2019-10-10 15:20:59 +02:00
Sofie Van Landeghem	5efae495f1	Error when removing a matcher rule that doesn't exist (#4420 ) * raise specific error when removing a matcher rule that doesn't exist * rephrasing	2019-10-10 14:01:53 +02:00
Matthew Honnibal	fa95c030a5	Unify matcher get_ent_id and get_pattern_key (#4415 ) This is basically stabbing blindly at the ghost match problem, but it at least seems like there was a bug previously here --- so this should hopefully be an improvement, even if it doesn't fix the ghost match problem.	2019-10-09 15:26:31 +02:00
Ines Montani	c4f95c1569	Update formatting and docstrings [ci skip]	2019-10-08 12:25:23 +02:00
Matthew Honnibal	ddd6fda59c	Add registry for model creation functions ('architectures') (#4395 ) * Add architecture registry * Add test for arch registry * Add error for model architectures	2019-10-08 12:21:03 +02:00
tamuhey	650cbfe82d	multiprocessing pipe (#1303 ) (#4371 ) * refactor: separate formatting docs and golds in Language.update * fix return typo * add pipe test * unpickleable object cannot be assigned to p.map * passed test pipe * passed test! * pipe terminate * try pipe * passed test * fix ch * add comments * fix len(texts) * add comment * add comment * fix: multiprocessing of pipe is not supported in 2 * test: use assert_docs_equal * fix: is_python3 -> is_python2 * fix: change _pipe arg to use functools.partial * test: add vector modification test * test: add sample ner_pipe and user_data pipe * add warnings test * test: fix user warnings * test: fix warnings capture * fix: remove islice import * test: remove warnings test * test: add stream test * test: rename * fix: multiproc stream * fix: stream pipe * add comment * mp.Pipe seems to be able to use with relative small data * test: skip stream test in python2 * sort imports * test: add reason to skiptest * fix: use pipe for docs communucation * add comments * add comment	2019-10-08 12:20:55 +02:00
adrianeboyd	14841d0aa6	Fix PhraseMatcher callback and add tests (#4399 ) * Fix callback lookup in PhraseMatcher (string key rather than hash key) * Add callback tests for Matcher and PhraseMatcher	2019-10-08 12:07:02 +02:00
Matthew Honnibal	fd4a5341b0	Fix ner_jsonl2json converter (fix #4389 ) (#4394 )	2019-10-08 00:52:45 +02:00
Matthew Honnibal	29f9fec267	Improve spacy pretrain (#4393 ) * Support bilstm_depth arg in spacy pretrain * Add option to ignore zero vectors in get_cossim_loss * Use cosine loss in Cloze multitask	2019-10-07 23:34:58 +02:00
Ines Montani	9cd6ca3e4d	Improve usage of pkg_resources and handling of entry points (#4387 ) * Only import pkg_resources where it's needed Apparently it's really slow * Use importlib_metadata for entry points * Revert "Only import pkg_resources where it's needed" This reverts commit `5ed8c03afa`. * Revert "Revert "Only import pkg_resources where it's needed"" This reverts commit `8b30b57957`. * Revert "Use importlib_metadata for entry points" This reverts commit `9f071f5c40`. * Revert "Revert "Use importlib_metadata for entry points"" This reverts commit `02e12a17ec`. * Skip test that weirdly hangs * Fix hanging test by using global	2019-10-07 17:22:09 +02:00
adrianeboyd	d53a8d9313	Consider batch_size when sorting similar vectors (#4388 )	2019-10-07 13:38:35 +02:00
adrianeboyd	a3509f67d4	Extend unicode character block for Sinhala (#4378 ) * Extend unicode character block for Sinhala * Add sentencizer tests for more languages	2019-10-07 13:17:03 +02:00
Ines Montani	573e543e4a	Alphanumeric -> alphabetic [ci skip] see ines/spacy-course#38	2019-10-06 13:30:01 +02:00
adrianeboyd	cbc2cee2c8	Improve URL_PATTERN and handling in tokenizer (#4374 ) * Move prefix and suffix detection for URL_PATTERN Move prefix and suffix detection for `URL_PATTERN` into the tokenizer. Remove associated lookahead and lookbehind from `URL_PATTERN`. Fix tokenization for Hungarian given new modified handling of prefixes and suffixes. * Match a wider range of URI schemes	2019-10-05 13:00:09 +02:00
Ines Montani	fec9433044	Make PhraseMatcher.vocab consistent with Matcher.vocab (closes #4373 )	2019-10-04 12:18:41 +02:00
Matthew Honnibal	37ef874d8b	Set version to v2.2.1	2019-10-03 14:50:39 +02:00
Sofie Van Landeghem	4e7259c6cf	Bugfix initializing DocBin with attributes (#4368 ) * docbin init fix + documentation fix + unit tests * newline * try with zlib instead of gzip (python 2 incompatibilities)	2019-10-03 14:48:45 +02:00
Ben Taylor	1db79a33cb	most_similar() return the k most similar vectors (#4364 ) * most_similar return n-most similar vectors * updated most_similar comment * add bintay contributor agreement * sign bintay contributor agreement * fix most_similar documentation typo * fixed error in prune_vectors * updated prune_vectors test	2019-10-03 14:09:44 +02:00
Matthew Honnibal	2eb31012e7	Set version to v2.2.0	2019-10-02 14:40:06 +02:00
Matthew Honnibal	796072e560	Set version to v2.2.0.dev19	2019-10-02 12:51:29 +02:00
Sofie Van Landeghem	9d3ce7cba2	Ensure training doesn't crash with empty batches (#4360 ) * unit test for previously resolved unflatten issue * prevent batch of empty docs to cause problems	2019-10-02 12:50:47 +02:00
adrianeboyd	dda86118bd	Update Ukrainian lemmatizer with new lookups (#4359 ) * Update Ukrainian lemmatizer with new lookups * Add missing import Co-authored-by: Ines Montani <ines@ines.io>	2019-10-02 12:04:06 +02:00
Ines Montani	b6670bf0c2	Use consistent spelling	2019-10-02 10:37:39 +02:00
Matthew Honnibal	38b6e69389	Merge branch 'master' of https://github.com/explosion/spaCy	2019-10-01 22:28:25 +02:00
Matthew Honnibal	d4b63bb6dd	Set version to v2.2.0	2019-10-01 22:28:13 +02:00
Ines Montani	475e3188ce	Add docs on filtering overlapping spans for merging (resolves #4352 ) [ci skip]	2019-10-01 21:59:50 +02:00
Matthew Honnibal	64a9577d43	Set version to v2.2.0.dev17	2019-10-01 21:36:59 +02:00
Ines Montani	cf65a80f36	Refactor lemmatizer and data table integration (#4353 ) * Move test * Allow default in Lookups.get_table * Start with blank tables in Lookups.from_bytes * Refactor lemmatizer to hold instance of Lookups * Get lookups table within the lemmatization methods to make sure it references the correct table (even if the table was replaced or modified, e.g. when loading a model from disk) * Deprecate other arguments on Lemmatizer.__init__ and expect Lookups for consistency * Remove old and unsupported Lemmatizer.load classmethod * Refactor language-specific lemmatizers to inherit as much as possible from base class and override only what they need * Update tests and docs * Fix more tests * Fix lemmatizer * Upgrade pytest to try and fix weird CI errors * Try pytest 4.6.5	2019-10-01 21:36:03 +02:00
Ines Montani	3297a19545	Warn in Tagger.begin_training if no lemma tables are available (#4351 )	2019-10-01 15:13:55 +02:00
Matthew Honnibal	2fb05482dd	Set version to v2.2.0	2019-10-01 03:50:13 +02:00
Matthew Honnibal	dc22ec0aad	Set version to v2.2.0.dev17	2019-10-01 03:26:53 +02:00
Matthew Honnibal	aedfba867a	Set version to v2.2.0.dev16	2019-10-01 00:31:00 +02:00
Ines Montani	e0cf4796a5	Move lookup tables out of the core library (#4346 ) * Add default to util.get_entry_point * Tidy up entry points * Read lookups from entry points * Remove lookup tables and related tests * Add lookups install option * Remove lemmatizer tests * Remove logic to process language data files * Update setup.cfg	2019-10-01 00:01:27 +02:00
Rahul Soni	ed620daa5c	Fix example sentences in Hindi for grammatical errors (#4343 ) * Fix grammar for hindi * Fix grammar for hindi * Submit contributor agreement	2019-09-30 23:32:49 +02:00
Ines Montani	ba186299e1	Tidy up and modernize setup and config (#4344 ) * Tidy up and modernize setup and config * Update setup.cfg * Re-add pyproject.toml * Delete .flake8 * Move static meta from about to setup.cfg * Update setup.cfg Co-Authored-By: Matthew Honnibal <honnibal+gh@gmail.com>	2019-09-30 20:10:55 +02:00
Ines Montani	4f905ac9e6	Add test for ASCII filenames (#4345 )	2019-09-30 18:45:30 +02:00
Matthew Honnibal	b5c775dd42	Set version to v2.2.0	2019-09-30 12:47:08 +02:00
Ines Montani	f7d1736241	Skip duplicate spans in Doc.retokenize (#4339 )	2019-09-30 12:43:48 +02:00
Ines Montani	0226b3bf0e	Fix test imports	2019-09-29 17:34:56 +02:00
Ines Montani	3d8fd4b461	Revert #4334	2019-09-29 17:32:12 +02:00
adrianeboyd	ba5595c764	Fix PhraseMatcher to remember attr on pickling (#4336 ) * Fix PhraseMatcher to remember attr on pickling * Check for attr as int or long	2019-09-29 17:12:33 +02:00
Ines Montani	75514b5970	Fix Korean	2019-09-29 17:10:56 +02:00
Ines Montani	499c39acba	Remove unnecessary namedtuple/dataclass	2019-09-29 15:05:28 +02:00
Matthew Honnibal	eba708404d	Set version to v2.2.0.dev15	2019-09-28 22:23:53 +02:00
Matthew Honnibal	6189959adb	Set version to v2.2.0.dev14	2019-09-28 22:09:46 +02:00
Matthew Honnibal	0df2a599b7	Set version to v2.2.0.dev13	2019-09-28 21:26:05 +02:00
Ines Montani	c9cd516d96	Move tests out of package (#4334 ) * Move tests out of package * Fix typo	2019-09-28 18:05:00 +02:00
Matthew Honnibal	d05eb56ce2	Set version to v2.2.0.dev12	2019-09-28 16:35:56 +02:00
Ines Montani	5fe61539c4	Fix unicode "e" in filename	2019-09-28 15:45:16 +02:00
Ines Montani	811c4c97c9	Correct lookup lemma of "lenses" (see #4332 )	2019-09-28 14:04:07 +02:00
Ines Montani	f8d1e2f214	Update CLI docs [ci skip]	2019-09-28 13:12:30 +02:00
Sofie Van Landeghem	22b9e12159	Ensure the NER remains consistent after resizing (#4330 ) * test and fix for second bug of issue 4042 * fix for first bug in 4042 * crashing test for Issue 4313 * forgot one instance of resize * remove prints * undo uncomment * delete test for 4313 (uses third party lib) * add fix for Issue 4313 * unit test for 4313	2019-09-27 20:57:13 +02:00
adrianeboyd	3906785b49	Initialize low data warning for debug-data parser (#4331 )	2019-09-27 20:56:49 +02:00
Ines Montani	206e8a5ac7	Also apply hotfix to Ukrainian lemmaitzer	2019-09-27 18:03:26 +02:00
Ines Montani	acd5bcb0b3	Tidy up fixtures	2019-09-27 17:57:59 +02:00
Ines Montani	b21b2e27e5	Hotfix Russian lemmatizer	2019-09-27 17:56:12 +02:00
Matthew Honnibal	a4d4c4bfa4	Set version to v2.2.0.dev11	2019-09-27 16:40:26 +02:00
Ines Montani	aad66d9bb9	Document PhraseMatcher.remove [ci skip]	2019-09-27 16:34:53 +02:00
adrianeboyd	c23edf302b	Replace PhraseMatcher with trie-based search (#4309 ) * Replace PhraseMatcher with Aho-Corasick Replace PhraseMatcher with the Aho-Corasick algorithm over numpy arrays of the hash values for the relevant attribute. The implementation is based on FlashText. The speed should be similar to the previous PhraseMatcher. It is now possible to easily remove match IDs and matches don't go missing with large keyword lists / vocabularies. Fixes #4308. * Restore support for pickling * Fix internal keyword add/remove for numpy arrays * Add missing loop for match ID set in search loop * Remove cruft in matching loop for partial matches There was a bit of unnecessary code left over from FlashText in the matching loop to handle partial token matches, which we don't have with PhraseMatcher. * Replace dict trie with MapStruct trie * Fix how match ID hash is stored/added * Update fix for match ID vocab * Switch from map_get_unless_missing to map_get * Switch from numpy array to Token.get_struct_attr Access token attributes directly in Doc instead of making a copy of the relevant values in a numpy array. Add unsatisfactory warning for hash collision with reserved terminal hash key. (Ideally it would change the reserved terminal hash and redo the whole trie, but for now, I'm hoping there won't be collisions.) * Restructure imports to export find_matches * Implement full remove() Remove unnecessary trie paths and free unused maps. Parallel to Matcher, raise KeyError when attempting to remove a match ID that has not been added. * Store docs internally only as attr lists * Reduces size for pickle * Remove duplicate keywords store Now that docs are stored as lists of attr hashes, there's no need to have the duplicate _keywords store.	2019-09-27 16:22:34 +02:00
tamuhey	b408b5b29e	Refactor language update (#4316 ) * refactor: separate formatting docs and golds in Language.update * fix return typo	2019-09-27 16:20:21 +02:00
Jaydeep Borkar	6a06a3fa6a	Update stop_words.py and add name in contributors (#4325 ) * Update stop_words.py and add name in contributors * add jaydeepborkar.md in contributors directory * Reset template [ci skip] Co-authored-by: Ines Montani <ines@ines.io>	2019-09-27 11:57:27 +02:00
Ines Montani	da9a869d3f	Update vectors name docs [ci skip]	2019-09-26 16:21:32 +02:00
Matthew Honnibal	58533f01bf	Set version to v2.2.0.dev10	2019-09-26 03:03:50 +02:00
Matthew Honnibal	27ace84f4a	Support model name in init-model	2019-09-26 03:01:32 +02:00
Matthew Honnibal	eced2f3211	Set version to v2.2.0.dev9	2019-09-25 21:14:07 +02:00
Matthew Honnibal	1251b57dbb	Fix vectors name arg to init-model	2019-09-25 14:21:27 +02:00
Matthew Honnibal	92ed4dc5e0	Allow vectors name to be set in init-model (#4321 ) * Allow vectors name to be specified in init-model * Document --vectors-name argument to init-model * Update website/docs/api/cli.md Co-Authored-By: Ines Montani <ines@ines.io>	2019-09-25 13:11:00 +02:00
Ines Montani	52904b7270	Raise if on_match is not callable or None	2019-09-24 23:06:24 +02:00
Ines Montani	16aa092fb5	Improve Morphology errors (#4314 ) * Improve Morphology errors * Also clean up some other errors * Update errors.py	2019-09-21 14:37:06 +02:00
Ines Montani	9bf69bfbb2	Remove test	2019-09-19 17:38:41 +02:00
Ines Montani	8cd3763678	Update about.py [ci skip]	2019-09-19 01:02:25 +02:00
Matthew Honnibal	f52b857953	Update version	2019-09-19 00:56:35 +02:00
Matthew Honnibal	e34b4a38b0	Fix set labels meta	2019-09-19 00:56:07 +02:00
Matthew Honnibal	9d399fe63a	Merge branch 'develop' of https://github.com/explosion/spaCy into develop	2019-09-19 00:04:06 +02:00
Matthew Honnibal	7d510c833e	Fix orth replacement	2019-09-19 00:03:24 +02:00
Ines Montani	89d1dc4afa	Merge branch 'master' into develop	2019-09-18 22:12:24 +02:00
Sean Löfgren	31c683d87d	add return_matches and as_tuples back to Matcher.pipe (#4303 ) * add contributor agreement [ci skip] * add return_matches and as_tuples back to Matcher.pipe	2019-09-18 22:00:33 +02:00
Matthew Honnibal	42df49133d	Also lower-case in orth variants	2019-09-18 21:54:51 +02:00
Matthew Honnibal	19d99fc9e7	Set version to v2.2.0.dev7	2019-09-18 21:43:59 +02:00
Matthew Honnibal	46c02d25b1	Merge changes to test_ner	2019-09-18 21:41:24 +02:00
Sofie Van Landeghem	de5a9ecdf3	Distinction between outside, missing and blocked NER annotations (#4307 ) * remove duplicate unit test * unit test (currently failing) for issue 4267 * bugfix: ensure doc.ents preserves kb_id annotations * fix in setting doc.ents with empty label * rename * test for presetting an entity to a certain type * allow overwriting Outside + blocking presets * fix actions when previous label needs to be kept * fix default ent_iob in set entities * cleaner solution with U- action * remove debugging print statements * unit tests with explicit transitions and is_valid testing * remove U- from move_names explicitly * remove unit tests with pre-trained models that don't work * remove (working) unit tests with pre-trained models * clean up unit tests * move unit tests * small fixes * remove two TODO's from doc.ents comments	2019-09-18 21:37:17 +02:00
Moshe Hazoom	72463b062f	Improve speed of _merge method (#4300 ) * make merge more efficient * fix offsets * merge works with relative indices * remove printing * Add the SCA * fix SCA date * more cythonize _retokenize.pyx * more cythonize _retokenize.pyx * fix only declaration in _retokenize.pyx * switch back to absolute head * switch back to absolute head * fix comment * merge from origin repo	2019-09-18 21:34:34 +02:00
tamuhey	875f3e5d8c	remove redundant __call__ method in pipes.TextCategorizer (#4305 ) * remove redundant __call__ method in pipes.TextCategorizer Because the parent __call__ method behaves in the same way. * fix: Pipe.__call__ arg * fix: invalid arg in Pipe.__call__ * modified: spacy/tests/regression/test_issue4278.py (#4278) * deleted: Pipfile	2019-09-18 21:31:27 +02:00
Ines Montani	00a8cbc306	Tidy up and auto-format	2019-09-18 20:27:03 +02:00
Ines Montani	f2c8b1e362	Simplify lookup hashing Just use get_string_id, which already does everything ensure_hash was supposed to do	2019-09-18 20:24:41 +02:00
Ines Montani	dd1810f05a	Update DocBin and add docs	2019-09-18 20:23:21 +02:00
Ines Montani	7e810cced6	Add references to docs pages	2019-09-18 19:57:21 +02:00
Ines Montani	2e5ab5b59c	Make except more explicit	2019-09-18 19:57:08 +02:00
Ines Montani	1f648ecb76	Auto-format	2019-09-18 19:56:55 +02:00
Ines Montani	0f7fe5e7a7	Auto-format and fix typo and consistency	2019-09-18 19:18:30 +02:00
Matthew Honnibal	e53b86751f	DocPallet -> DocBin	2019-09-18 15:15:37 +02:00
Matthew Honnibal	fa9a283128	Fix name	2019-09-18 13:40:03 +02:00
Matthew Honnibal	88a23cf49a	Fix name	2019-09-18 13:38:29 +02:00
Matthew Honnibal	3507943b15	Add docstring for DocPallet	2019-09-18 13:25:47 +02:00
Matthew Honnibal	1c8de6b2e5	Rename DocBox->DocPallet	2019-09-18 13:13:51 +02:00
Ines Montani	691e0088cf	Remove duplicate tok2vec property (closes #4302 )	2019-09-17 11:22:03 +02:00
Ines Montani	a84025d70b	Remove --no-deps from default pip args on download Add warning if user is executing spaCy without having it installed and add --no-deps to prevent the package from being redownloaded	2019-09-16 23:32:41 +02:00
Matthew Honnibal	84c65f9455	Merge branch 'master' into develop	2019-09-16 22:12:20 +02:00
Matthew Honnibal	47055d5988	Fix type declarations in _merge method	2019-09-16 22:10:13 +02:00
Sofie Van Landeghem	03ac29f437	Ensure that doc.ents preserves kb_id annotations (#4294 ) * bugfix: ensure doc.ents preserves kb_id annotations * fix backward compatibility * additional test	2019-09-16 15:18:37 +02:00
Ines Montani	139428c20f	Set unique vector names in tests	2019-09-16 15:16:54 +02:00
Ines Montani	bf06d9d537	Allow passing vectors_name to Vocab	2019-09-16 15:16:41 +02:00
Ines Montani	cb6c68a573	Pass vectors name correctly in prune_vectors	2019-09-16 15:16:29 +02:00
Ines Montani	3ba5238282	Make "unnamed vectors" warning a real warning	2019-09-16 15:16:12 +02:00
adrianeboyd	b5d999e510	Add textcat to train CLI (#4226 ) * Add doc.cats to spacy.gold at the paragraph level Support `doc.cats` as `"cats": [{"label": string, "value": number}]` in the spacy JSON training format at the paragraph level. * `spacy.gold.docs_to_json()` writes `docs.cats` * `GoldCorpus` reads in cats in each `GoldParse` * Update instances of gold_tuples to handle cats Update iteration over gold_tuples / gold_parses to handle addition of cats at the paragraph level. * Add textcat to train CLI * Add textcat options to train CLI * Add textcat labels in `TextCategorizer.begin_training()` * Add textcat evaluation to `Scorer`: * For binary exclusive classes with provided label: F1 for label * For 2+ exclusive classes: F1 macro average * For multilabel (not exclusive): ROC AUC macro average (currently relying on sklearn) * Provide user info on textcat evaluation settings, potential incompatibilities * Provide pipeline to Scorer in `Language.evaluate` for textcat config * Customize train CLI output to include only metrics relevant to current pipeline * Add textcat evaluation to evaluate CLI * Fix handling of unset arguments and config params Fix handling of unset arguments and model confiug parameters in Scorer initialization. * Temporarily add sklearn requirement * Remove sklearn version number * Improve Scorer handling of models without textcats * Fixing Scorer handling of models without textcats * Update Scorer output for python 2.7 * Modify inf in Scorer for python 2.7 * Auto-format Also make small adjustments to make auto-formatting with black easier and produce nicer results * Move error message to Errors * Update documentation * Add cats to annotation JSON format [ci skip] * Fix tpl flag and docs [ci skip] * Switch to internal roc_auc_score Switch to internal `roc_auc_score()` adapted from scikit-learn. * Add AUCROCScore tests and improve errors/warnings * Add tests for AUCROCScore and roc_auc_score * Add missing error for only positive/negative values * Remove unnecessary warnings and errors * Make reduced roc_auc_score functions private Because most of the checks and warnings have been stripped for the internal functions and access is only intended through `ROCAUCScore`, make the functions for roc_auc_score adapted from scikit-learn private. * Check that data corresponds with multilabel flag Check that the training instances correspond with the multilabel flag, adding the multilabel flag if required. * Add textcat score to early stopping check * Add more checks to debug-data for textcat * Add example training data for textcat * Add more checks to textcat train CLI * Check configuration when extending base model * Fix typos * Update textcat example data * Provide licensing details and licenses for data * Remove two labels with no positive instances from jigsaw-toxic-comment data. Co-authored-by: Ines Montani <ines@ines.io>	2019-09-15 22:31:31 +02:00
Ines Montani	bab9976d9a	💫 Adjust Table API and add docs (#4289 ) * Adjust Table API and add docs * Add attributes and update description [ci skip] * Use strings.get_string_id instead of hash_string * Fix table method calls * Make orth arg in Lemmatizer.lookup optional Fall back to string, which is now handled by Table.__contains__ out-of-the-box * Fix method name * Auto-format	2019-09-15 22:08:13 +02:00
Ines Montani	88a9d87f6f	Fix test	2019-09-15 18:04:44 +02:00
Ines Montani	23e28e2844	Merge branch 'master' into develop	2019-09-15 17:57:09 +02:00
Ines Montani	c7e4ea7154	Update examples and languages.json [ci skip]	2019-09-15 17:56:40 +02:00
Ines Montani	aa3c59a2f3	Include Norwegian NER entity types in glossary [ci skip] See https://github.com/ltgoslo/norne	2019-09-15 17:16:21 +02:00
Ines Montani	7194845234	Skip tests properly instead of xfailing them	2019-09-15 17:00:17 +02:00
Ines Montani	16c2522791	Merge branch 'master' into develop	2019-09-14 16:42:01 +02:00
adrianeboyd	6942a6a69b	Extend default punct for sentencizer (#4290 ) Most of these characters are for languages / writing systems that aren't supported by spacy, but I don't think it causes problems to include them. In the UD evals, Hindi and Urdu improve a lot as expected (from 0-10% to 70-80%) and Persian improves a little (90% to 96%). Tamil improves in combination with #4288. The punctuation list is converted to a set internally because of its increased length. Sentence final punctuation generated with: ``` unichars -gas '[\p{Sentence_Break=STerm}\p{Sentence_Break=ATerm}]' '\p{Terminal_Punctuation}' ``` See: https://stackoverflow.com/a/9508766/461847 Fixes #4269.	2019-09-14 15:25:48 +02:00
adrianeboyd	bee7961927	Add Kannada, Tamil, and Telugu unicode blocks (#4288 ) Add Kannada, Tamil, and Telugu unicode blocks to uncased character classes so that period is recognized as a suffix during tokenization. (I'm sure a few symbols in the code blocks should not be ALPHA, but this is mainly relevant for suffix detection and seems to be an improvement in practice.)	2019-09-14 14:23:06 +02:00
Ines Montani	3126dd0904	Tidy up and auto-format [ci skip]	2019-09-14 12:58:06 +02:00
Ines Montani	27106d6528	Merge branch 'master' into develop	2019-09-13 17:07:17 +02:00
Sofie Van Landeghem	2ae5db580e	dim bugfix when incl_prior is False (#4285 )	2019-09-13 16:30:05 +02:00
Paul O'Leary McCann	29a9e636eb	Fix half-width space handling in JA (#4284 ) (closes #4262 ) Before this patch, half-width spaces between words were simply lost in Japanese text. This wasn't immediately noticeable because much Japanese text never uses spaces at all.	2019-09-13 16:28:12 +02:00
Ines Montani	3c3658ef9f	Merge branch 'master' into develop	2019-09-12 18:03:01 +02:00
Ines Montani	228bbf506d	Improve label properties on pipes	2019-09-12 18:02:44 +02:00
Paul O'Leary McCann	7d8df69158	Bloom-filter backed Lookup Tables (#4268 ) * Improve load_language_data helper * WIP: Add Lookups implementation * Start moving lemma data over to JSON * WIP: move data over for more languages * Convert more languages * Fix lemmatizer fixtures in tests * Finish conversion * Auto-format JSON files * Fix test for now * Make sure tables are stored on instance * Update docstrings * Update docstrings and errors * Update test * Add Lookups.__len__ * Add serialization methods * Add Lookups.remove_table * Use msgpack for serialization to disk * Fix file exists check * Try using OrderedDict for everything * Update .flake8 [ci skip] * Try fixing serialization * Update test_lookups.py * Update test_serialize_vocab_strings.py * Lookups / Tables now work This implements the stubs in the Lookups/Table classes. Currently this is in Cython but with no type declarations, so that could be improved. * Add lookups to setup.py * Actually add lookups pyx The previous commit added the old py file... * Lookups work-in-progress * Move from pyx back to py * Add string based lookups, fix serialization * Update tests, language/lemmatizer to work with string lookups There are some outstanding issues here: - a pickling-related test fails due to the bloom filter - some custom lemmatizers (fr/nl at least) have issues More generally, there's a question of how to deal with the case where you have a string but want to use the lookup table. Currently the table allows access by string or id, but that's getting pretty awkward. * Change lemmatizer lookup method to pass (orth, string) * Fix token lookup * Fix French lookup * Fix lt lemmatizer test * Fix Dutch lemmatizer * Fix lemmatizer lookup test This was using a normal dict instead of a Table, so checks for the string instead of an integer key failed. * Make uk/nl/ru lemmatizer lookup methods consistent The mentioned tokenizers all have their own implementation of the `lookup` method, which accesses a `Lookups` table. The way that was called in `token.pyx` was changed so this should be updated to have the same arguments as `lookup` in `lemmatizer.py` (specificially (orth/id, string)). Prior to this change tests weren't failing, but there would probably be issues with normal use of a model. More tests should proably be added. Additionally, the language-specific `lookup` implementations seem like they might not be needed, since they handle things like lower-casing that aren't actually language specific. * Make recently added Greek method compatible * Remove redundant class/method Leftovers from a merge not cleaned up adequately.	2019-09-12 17:26:11 +02:00
Sofie Van Landeghem	9be4d1c105	Allow copying of user_data in as_doc (#4282 ) * Allow copying the user_data with as_doc + unit test * add option to docs * add typing * import fix * workaround to avoid bool clashing ... * bint instead of bool	2019-09-12 17:08:14 +02:00
Matthew Honnibal	7d782aa97b	Add more docstrings for MorphAnalysis	2019-09-12 16:48:30 +02:00
Ines Montani	b544dcb3c5	Document debug-data [ci skip]	2019-09-12 15:26:20 +02:00
Ines Montani	05a2df6616	Remove not implemented file validation [ci skip]	2019-09-12 15:26:02 +02:00
Ines Montani	10257f3131	Document Lookups [ci skip]	2019-09-12 14:00:14 +02:00
Ines Montani	32404e613c	Create directory if it doesn't exist	2019-09-12 14:00:01 +02:00
Ines Montani	625ce2db8e	Update Language docs [ci skip]	2019-09-12 13:03:38 +02:00
Ines Montani	655b434553	Merge branch 'master' into develop	2019-09-12 11:39:18 +02:00
Sofie Van Landeghem	0b4b4f1819	Documentation for Entity Linking (#4065 ) * document token ent_kb_id * document span kb_id * update pipeline documentation * prior and context weights as bool's instead * entitylinker api documentation * drop for both models * finish entitylinker documentation * small fixes * documentation for KB * candidate documentation * links to api pages in code * small fix * frequency examples as counts for consistency * consistent documentation about tensors returned by predict * add entity linking to usage 101 * add entity linking infobox and KB section to 101 * entity-linking in linguistic features * small typo corrections * training example and docs for entity_linker * predefined nlp and kb * revert back to similarity encodings for simplicity (for now) * set prior probabilities to 0 when excluded * code clean up * bugfix: deleting kb ID from tokens when entities were removed * refactor train el example to use either model or vocab * pretrain_kb example for example kb generation * add to training docs for KB + EL example scripts * small fixes * error numbering * ensure the language of vocab and nlp stay consistent across serialization * equality with = * avoid conflict in errors file * add error 151 * final adjustements to the train scripts - consistency * update of goldparse documentation * small corrections * push commit * typo fix * add candidate API to kb documentation * update API sidebar with EntityLinker and KnowledgeBase * remove EL from 101 docs * remove entity linker from 101 pipelines / rephrase * custom el model instead of existing model * set version to 2.2 for EL functionality * update documentation for 2 CLI scripts	2019-09-12 11:38:34 +02:00
Ines Montani	4d4b3b0783	Add "labels" to Language.meta	2019-09-12 11:34:25 +02:00
Ines Montani	ac0e27a825	💫 Add Language.pipe_labels (#4276 ) * Add Language.pipe_labels * Update spacy/language.py Co-Authored-By: Matthew Honnibal <honnibal+gh@gmail.com>	2019-09-12 10:56:28 +02:00
tamuhey	71909cdf22	Fix iss4278 (#4279 ) * fix: len(tuple) == 2 * (#4278) add fail test * add contributor's aggreement	2019-09-12 10:44:49 +02:00
Ines Montani	8ebc3711dc	Fix bug in Parser.labels and add test (#4275 )	2019-09-11 18:29:35 +02:00
Matthew Honnibal	7fbb559045	Set version to v2.2.0.dev6	2019-09-11 18:07:20 +02:00
Matthew Honnibal	f7a096b462	Update morphology	2019-09-11 18:06:43 +02:00
Matthew Honnibal	f8ce9dde0f	Set version to v2.2.0.dev5	2019-09-11 17:41:21 +02:00
Matthew Honnibal	c47c0269b1	Update morphology features	2019-09-11 15:16:53 +02:00
Ines Montani	af25323653	Tidy up and auto-format	2019-09-11 14:00:36 +02:00
Matthew Honnibal	af93997993	Fix conllu converter	2019-09-11 13:28:07 +02:00
Matthew Honnibal	178d010b25	Set version to 2.2.0.dev4	2019-09-11 12:28:37 +02:00
Ines Montani	e82a8d0d7a	Merge branch 'master' into develop	2019-09-11 11:52:38 +02:00
Ines Montani	8f9f48b04c	Add GreekLemmatizer.lookup (resolves #4272 )	2019-09-11 11:44:40 +02:00
Ines Montani	6279d74c65	Tidy up and auto-format	2019-09-11 11:38:22 +02:00
Matthew Honnibal	7b858ba606	Update from master	2019-09-10 20:14:08 +02:00
Ines Montani	669a7d37ce	Exclude vocab when testing to_bytes	2019-09-10 19:45:16 +02:00
adrianeboyd	e367864e59	Update Ukrainian create_lemmatizer kwargs (#4266 ) Allow Ukrainian create_lemmatizer to accept lookups kwarg.	2019-09-10 11:14:46 +02:00
adrianeboyd	c32126359a	Allow period as suffix following punctuation (#4248 ) Addresses rare cases (such as `_MATH_.`, see #1061) where the final period was not recognized as a suffix following punctuation.	2019-09-09 19:19:22 +02:00
Ines Montani	3e8f136ba7	💫 WIP: Basic lookup class scaffolding and JSON for all lemmatizer data (#4178 ) * Improve load_language_data helper * WIP: Add Lookups implementation * Start moving lemma data over to JSON * WIP: move data over for more languages * Convert more languages * Fix lemmatizer fixtures in tests * Finish conversion * Auto-format JSON files * Fix test for now * Make sure tables are stored on instance * Update docstrings * Update docstrings and errors * Update test * Add Lookups.__len__ * Add serialization methods * Add Lookups.remove_table * Use msgpack for serialization to disk * Fix file exists check * Try using OrderedDict for everything * Update .flake8 [ci skip] * Try fixing serialization * Update test_lookups.py * Update test_serialize_vocab_strings.py * Fix serialization for lookups * Fix lookups * Fix lookups * Fix lookups * Try to fix serialization * Try to fix serialization * Try to fix serialization * Try to fix serialization * Give up on serialization test * Xfail more serialization tests for 3.5 * Fix lookups for 2.7	2019-09-09 19:17:55 +02:00
Sofie Van Landeghem	482c7cd1b9	pulling tqdm imports in functions to avoid bug (tmp fix) (#4263 )	2019-09-09 16:32:11 +02:00
Mihai Gliga	25aecd504f	adding Romanian tag_map (#4257 ) * adding Romanian tag_map * added SCA file * forgotten import	2019-09-09 11:53:09 +02:00
Matthew Honnibal	1653b818c5	Update Lithuanian tag map	2019-09-08 20:57:58 +02:00
adrianeboyd	3780e2ff50	Flush tokenizer cache when necessary (#4258 ) Flush tokenizer cache when affixes, token_match, or special cases are modified. Fixes #4238, same issue as in #1250.	2019-09-08 20:52:46 +02:00
Matthew Honnibal	da8830d909	Set version to v2.2.0.dev3	2019-09-08 18:22:03 +02:00
Matthew Honnibal	1a65c5b7af	Update develop from master	2019-09-08 18:21:41 +02:00
Matthew Honnibal	aec6174ae6	Fix lemmatizer	2019-09-08 18:09:53 +02:00
Matthew Honnibal	fde4f8ac8e	Create lookups if not passed in	2019-09-08 18:08:09 +02:00
Pavle Vidanović	d03401f532	Lemmatizer lookup dictionary for Serbian and basic tag set adde… (#4251 ) * Serbian stopwords added. (cyrillic alphabet) * spaCy Contribution agreement included. * Test initialize updated * Serbian language code update. --bugfix * Tokenizer exceptions added. Init file updated. * Norm exceptions and lexical attributes added. * Examples added. * Tests added. * sr_lang examples update. * Tokenizer exceptions updated. (Serbian) * Lemmatizer created. Licence included. * Test updated. * Tag map basic added. * tag_map.py file removed since it uses default spacy tags.	2019-09-08 14:19:15 +02:00
Ivan Šarić	b01025dd06	adds Croatian lemma_lookup.json, license file and corresponding tests (#4252 )	2019-09-08 13:40:45 +02:00
adrianeboyd	aec755d3a3	Modify retokenizer to use span root attributes (#4219 ) * Modify retokenizer to use span root attributes * tag/pos/morph are set to root tag/pos/morph * lemma and norm are reset and end up as orth (not ideal, but better than orth of first token) * Also handle individual merge case * Add test * Attempt to handle ent_iob and ent_type in merges * Fix check for whether B-ENT should become I-ENT * Move IOB consistency check to after attrs Move all IOB consistency checks after attrs are set and simplify to check entire document, modifying I to B at the beginning of the document or if the entity type of the previous token isn't the same. * Move IOB consistency check for single merge Move IOB consistency check after the token array is compressed for the single merge case. * Update spacy/tokens/_retokenize.pyx Co-Authored-By: Matthew Honnibal <honnibal+gh@gmail.com> * Remove single vs. multiple merge distinction Remove original single-instance `_merge()` and use `_bulk_merge()` (now renamed `_merge()`) for all merges. * Add out-of-bound check in previous entity check	2019-09-08 13:04:49 +02:00
Bae Yong-Ju	a55f5a744f	Fix ValueError exception on empty Korean text. (#4245 )	2019-09-06 10:29:40 +02:00
Adriane Boyd	0f28418446	Add regression test for #1061 back to test suite	2019-09-04 20:42:24 +02:00
Adriane Boyd	c39c13f26b	Add guillemets/chevrons to German orth variants Add guillemets/chevrons to German orth variants for both German/Austrian and Swiss conventions.	2019-09-04 20:05:08 +02:00
Adriane Boyd	6b0fec76fd	Fix handling of preset entities in NER * Fix check of valid ent_type for B * Add valid L as preset-I followed by not-I	2019-09-04 13:42:42 +02:00
Ines Montani	419ae59c79	Make flaky test test_issue_1971_4 more explicit	2019-08-31 14:08:05 +02:00
Ines Montani	cd90752193	Tidy up and auto-format [ci skip]	2019-08-31 13:39:06 +02:00
Matthew Honnibal	67c3d03905	Revert morphology serialisation	2019-08-30 13:13:07 +02:00
Adriane Boyd	893f11a9e3	Serialize tag_map directly Fix Aspect_prof typo	2019-08-30 11:30:03 +02:00
Adriane Boyd	02babf9317	English tag map without unsupported features/values	2019-08-30 11:29:19 +02:00
Matthew Honnibal	516650f58f	Merge pull request #4207 from svlandeg/bugfix/serialize-tok-exc Bugfix for serializing tokenizer rules/exceptions	2019-08-30 11:04:58 +02:00
Matthew Honnibal	f3c3ce7f1e	Update vocab	2019-08-29 21:19:54 +02:00
Matthew Honnibal	fc0a3c8c38	Add morphology serialization	2019-08-29 21:17:34 +02:00
Matthew Honnibal	c94fc9edb9	Fix noise addition	2019-08-29 15:39:32 +02:00
Matthew Honnibal	32842a3cd4	Disable whitespace corruption	2019-08-29 15:01:58 +02:00
Matthew Honnibal	3c1c0ec18e	Add tests for NER oracle with whitespace	2019-08-29 14:33:39 +02:00
Matthew Honnibal	6511e1d8d3	Fix NER gold-standard around whitespace	2019-08-29 14:33:07 +02:00
adrianeboyd	82159b5c19	Updates/bugfixes for NER/IOB converters (#4186 ) * Updates/bugfixes for NER/IOB converters * Converter formats `ner` and `iob` use autodetect to choose a converter if possible * `iob2json` is reverted to handle sentence-per-line data like `word1\|pos1\|ent1 word2\|pos2\|ent2` * Fix bug in `merge_sentences()` so the second sentence in each batch isn't skipped * `conll_ner2json` is made more general so it can handle more formats with whitespace-separated columns * Supports all formats where the first column is the token and the final column is the IOB tag; if present, the second column is the POS tag * As in CoNLL 2003 NER, blank lines separate sentences, `-DOCSTART- -X- O O` separates documents * Add option for segmenting sentences (new flag `-s`) * Parser-based sentence segmentation with a provided model, otherwise with sentencizer (new option `-b` to specify model) * Can group sentences into documents with `n_sents` as long as sentence segmentation is available * Only applies automatic segmentation when there are no existing delimiters in the data * Provide info about settings applied during conversion with warnings and suggestions if settings conflict or might not be not optimal. * Add tests for common formats * Add '(default)' back to docs for -c auto * Add document count back to output * Revert changes to converter output message * Use explicit tabs in convert CLI test data * Adjust/add messages for n_sents=1 default * Add sample NER data to training examples * Update README * Add links in docs to example NER data * Define msg within converters	2019-08-29 12:04:01 +02:00
adrianeboyd	5feb342f5e	Add more token attributes to token pattern schema (#4210 ) Add token attributes with tests to token pattern schema.	2019-08-29 12:02:26 +02:00
Adriane Boyd	f3906950d3	Add separate noise vs orth level to train CLI	2019-08-29 09:10:35 +02:00
Matthew Honnibal	7d6d438566	Set version to v2.2.0.dev2	2019-08-28 18:30:43 +02:00
Matthew Honnibal	bc5ce49859	Fix 'noise_level' in train cmd	2019-08-28 17:55:38 +02:00
Matthew Honnibal	782056d117	Fix morph rules	2019-08-28 16:59:45 +02:00
Matthew Honnibal	6b2ea883ed	Merge pull request #4205 from adrianeboyd/feature/gold-train-orth-variants Add train_docs() option to add orth variants	2019-08-28 16:54:06 +02:00
svlandeg	c54aabc3cd	fix loading custom tokenizer rules/exceptions from file	2019-08-28 14:17:44 +02:00
svlandeg	7bec0ebbcb	failing unit test for Issue 4190	2019-08-28 14:16:34 +02:00
Adriane Boyd	0a26e94d02	Modify raw to match orth variant annotation tuples If raw is available, attempt to modify raw to match the orth variants. If raw/words can't be aligned, abort and return unmodified raw/annotation.	2019-08-28 13:38:54 +02:00
Adriane Boyd	47af3f676e	Single and paired orth variants for German	2019-08-28 09:19:18 +02:00
Adriane Boyd	56c38484a1	Single and paired orth variants for English	2019-08-28 09:19:18 +02:00
Adriane Boyd	aae05ff16b	Add train_docs() option to add orth variants Filtering by orth and tag, create variants of training docs with alternate orth variants, e.g., unicode quotes, dashes, and ellipses. The variants can be single tokens (dashes) or paired tokens (quotes) with left and right versions. Currently restricted to only add variants to training documents without raw text provided, where only gold.words needs to be modified.	2019-08-28 09:18:36 +02:00
Matthew Honnibal	af7fad2c6d	Set version to v2.2.0.dev1	2019-08-25 22:05:47 +02:00
Matthew Honnibal	71c0321ecf	Fix test	2019-08-25 22:03:37 +02:00
Matthew Honnibal	188a1cf297	Fix morphology for \| features	2019-08-25 21:57:02 +02:00
Matthew Honnibal	095c63c6b8	Avoid making prepositions get the tag SCONJ	2019-08-25 21:56:47 +02:00
Matthew Honnibal	22250cf6b7	Make regression test less sensitive to tag-map stuff	2019-08-25 21:54:26 +02:00
Matthew Honnibal	4e2f07a655	Merge branch 'develop' into feature/lemmatizer	2019-08-25 21:03:25 +02:00
Matthew Honnibal	c308cf3e3e	Merge branch 'master' into feature/lemmatizer	2019-08-25 13:52:27 +02:00
Matthew Honnibal	08e8267a59	Set version to 2.2.0.dev0	2019-08-25 13:50:00 +02:00
Matthew Honnibal	bb911e5f4e	Fix #3830 : 'subtok' label being added even if learn_tokens=False (#4188 ) * Prevent subtok label if not learning tokens The parser introduces the subtok label to mark tokens that should be merged during post-processing. Previously this happened even if we did not have the --learn-tokens flag set. This patch passes the config through to the parser, to prevent the problem. * Make merge_subtokens a parser post-process if learn_subtokens * Fix train script * Add test for 3830: subtok problem * Fix handlign of non-subtok in parser training	2019-08-23 17:54:00 +02:00
Sofie Van Landeghem	c417c380e3	Matcher ID fixes (#4179 ) * allow phrasematcher to link one match to multiple original patterns * small fix for defining ent_id in the matcher (anti-ghost prevention) * cleanup * formatting	2019-08-22 17:17:07 +02:00
Ines Montani	f5d3afb1a3	Fix typo in docstrings [ci skip]	2019-08-22 16:24:15 +02:00
Ines Montani	5ca7dd0f94	💫 WIP: Basic lookup class scaffolding and JSON for all lemmati… (#4167 ) * Improve load_language_data helper * WIP: Add Lookups implementation * Start moving lemma data over to JSON * WIP: move data over for more languages * Convert more languages * Fix lemmatizer fixtures in tests * Finish conversion * Auto-format JSON files * Fix test for now * Make sure tables are stored on instance	2019-08-22 14:21:32 +02:00
Sofie Van Landeghem	73b38c33e4	Small retokenizer fix (#4174 )	2019-08-22 12:23:54 +02:00
Ines Montani	a8752a569d	Auto-format [ci skip]	2019-08-22 11:44:39 +02:00
Pavle Vidanović	60e10a9f93	Serbian language improvement (#4169 ) * Serbian stopwords added. (cyrillic alphabet) * spaCy Contribution agreement included. * Test initialize updated * Serbian language code update. --bugfix * Tokenizer exceptions added. Init file updated. * Norm exceptions and lexical attributes added. * Examples added. * Tests added. * sr_lang examples update. * Tokenizer exceptions updated. (Serbian)	2019-08-22 11:43:07 +02:00
Sofie Van Landeghem	de272f8b82	adding double match for optional operator at the end (#4166 )	2019-08-21 22:46:56 +02:00
Sofie Van Landeghem	01c5980187	Serialize POS attribute when doc.is_tagged (#4092 ) * fix and unit test for issue 3959 * additional unit test for manifestation of the same (resolved) bug	2019-08-21 21:59:30 +02:00
Sofie Van Landeghem	7539a4f3a8	use states[q] in while retry loop (#4162 )	2019-08-21 21:58:04 +02:00
adrianeboyd	2d17b047e2	Check for is_tagged/is_parsed for Matcher attrs (#4163 ) Check for relevant components in the pipeline when Matcher is called, similar to the checks for PhraseMatcher in #4105. * keep track of attributes seen in patterns * when Matcher is called on a Doc, check for is_tagged for LEMMA, TAG, POS and for is_parsed for DEP	2019-08-21 20:52:36 +02:00
Pavle Vidanović	4fe9329bfb	Serbian language code update "rs" -> "sr" (#4159 ) * Serbian stopwords added. (cyrillic alphabet) * spaCy Contribution agreement included. * Test initialize updated * Serbian language code update. --bugfix	2019-08-21 19:57:37 +02:00
Matthew Honnibal	bcd08f20af	Merge changes from master	2019-08-21 14:18:52 +02:00
adrianeboyd	8fe7bdd0fa	Improve token pattern checking without validation (#4105 ) * Fix typo in rule-based matching docs * Improve token pattern checking without validation Add more detailed token pattern checks without full JSON pattern validation and provide more detailed error messages. Addresses #4070 (also related: #4063, #4100). * Check whether top-level attributes in patterns and attr for PhraseMatcher are in token pattern schema * Check whether attribute value types are supported in general (as opposed to per attribute with full validation) * Report various internal error types (OverflowError, AttributeError, KeyError) as ValueError with standard error messages * Check for tagger/parser in PhraseMatcher pipeline for attributes TAG, POS, LEMMA, and DEP * Add error messages with relevant details on how to use validate=True or nlp() instead of nlp.make_doc() * Support attr=TEXT for PhraseMatcher * Add NORM to schema * Expand tests for pattern validation, Matcher, PhraseMatcher, and EntityRuler * Remove unnecessary .keys() * Rephrase error messages * Add another type check to Matcher Add another type check to Matcher for more understandable error messages in some rare cases. * Support phrase_matcher_attr=TEXT for EntityRuler * Don't use spacy.errors in examples and bin scripts * Fix error code * Auto-format Also try get Azure pipelines to finally start a build :( * Update errors.py Co-authored-by: Ines Montani <ines@ines.io> Co-authored-by: Matthew Honnibal <honnibal+gh@gmail.com>	2019-08-21 14:00:37 +02:00
Ines Montani	f580302673	Tidy up and auto-format	2019-08-20 17:36:34 +02:00
Ines Montani	364aaf5bc2	Simplify test	2019-08-20 16:41:58 +02:00
Sofie Van Landeghem	68ee0384fd	Unit test for Issue 3879 (#4153 ) * failing unit test for Issue #3879 * mark test as failing	2019-08-20 16:40:25 +02:00
Ines Montani	86cd7f0efd	Add regression test for #4120	2019-08-20 16:33:09 +02:00
Ines Montani	104125edd2	Tidy up errors	2019-08-20 16:03:45 +02:00
Ines Montani	cc76a26fe8	Raise error for negative arc indices (closes #3917 )	2019-08-20 15:51:37 +02:00
Ines Montani	69e70ffae1	Merge branch 'master' of https://github.com/explosion/spaCy	2019-08-20 15:09:52 +02:00
Ines Montani	f65e36925d	Fix absolute imports and avoid importing from cli	2019-08-20 15:08:59 +02:00
Ines Montani	7e8be44218	Auto-format	2019-08-20 15:06:31 +02:00
Paul O'Leary McCann	756b66b7c0	Reduce size of language data (#4141 ) * Move Turkish lemmas to a json file Rather than a large dict in Python source, the data is now a big json file. This includes a method for loading the json file, falling back to a compressed file, and an update to MANIFEST.in that excludes json in the spacy/lang directory. This focuses on Turkish specifically because it has the most language data in core. * Transition all lemmatizer.py files to json This covers all lemmatizer.py files of a significant size (>500k or so). Small files were left alone. None of the affected files have logic, so this was pretty straightforward. One unusual thing is that the lemma data for Urdu doesn't seem to be used anywhere. That may require further investigation. * Move large lang data to json for fr/nb/nl/sv These are the languages that use a lemmatizer directory (rather than a single file) and are larger than English. For most of these languages there were many language data files, in which case only the large ones (>500k or so) were converted to json. It may or may not be a good idea to migrate the remaining Python files to json in the future. * Fix id lemmas.json The contents of this file were originally just copied from the Python source, but that used single quotes, so it had to be properly converted to json first. * Add .json.gz to gitignore This covers the json.gz files built as part of distribution. * Add language data gzip to build process Currently this gzip data on every build; it works, but it should be changed to only gzip when the source file has been updated. * Remove Danish lemmatizer.py Missed this when I added the json. * Update to match latest explosion/srsly#9 The way gzipped json is loaded/saved in srsly changed a bit. * Only compress language data if necessary If a .json.gz file exists and is newer than the corresponding json file, it's not recompressed. * Move en/el language data to json This only affected files >500kb, which was nouns for both languages and the generic lookup table for English. * Remove empty files in Norwegian tokenizer It's unclear why, but the Norwegian (nb) tokenizer had empty files for adj/adv/noun/verb lemmas. This may have been a result of copying the structure of the English lemmatizer. This removed the files, but still creates the empty sets in the lemmatizer. That may not actually be necessary. * Remove dubious entries in English lookup.json " furthest" and " skilled" - both prefixed with a space - were in the English lookup table. That seems obviously wrong so I have removed them. * Fix small issues with en/fr lemmatizers The en tokenizer was including the removed _nouns.py file, so that's removed. The fr tokenizer is unusual in that it has a lemmatizer directory with both __init__.py and lemmatizer.py. lemmatizer.py had not been converted to load the json language data, so that was fixed. * Auto-format * Auto-format * Update srsly pin * Consistently use pathlib paths	2019-08-20 14:54:11 +02:00
Ivan Šarić	434f6fa6c1	Issue #1107 - adds examples.py for Croatian language (#4143 ) * adds contributor agreement for isaric * adds examples.py for croatian language	2019-08-18 23:04:41 +02:00
Paul O'Leary McCann	7f82a1fe1b	Make the emoticon list a raw string (#4139 ) While working on an unrelated task I got warnings about an unsupported escape sequence (`"\("`) in the tokenizer exceptions. Making the tokenizer exceptions a raw string makes this warning go away. The specific string that triggered this is `¯\(ツ)/¯`.	2019-08-18 15:17:13 +02:00
Ines Montani	009280fbc5	Tidy up and auto-format	2019-08-18 15:09:16 +02:00
Ines Montani	89f2b87266	Open file as utf-8 (closes #4138 )	2019-08-18 13:55:34 +02:00
Ines Montani	f35a8221d8	Move generation of parses out of with blocks	2019-08-18 13:54:26 +02:00
yanaiela	ec0beccaf1	Custom entity render (#4117 ) * customizable template for entities display, allowing to pass additional parameters along each entity * contributor agreement * simpler naming for the additional parameters given to the span entities renderer Co-Authored-By: Ines Montani <ines@ines.io> * change of default parameter, as suggested Co-Authored-By: Ines Montani <ines@ines.io>	2019-08-16 18:39:25 +02:00
Ines Montani	e5c7e19e82	Fix typo and auto-format [ci skip]	2019-08-16 10:53:38 +02:00
adrianeboyd	a58cb023d7	WIP: Extending debug-data (#4114 ) * Extending debug-data with dependency checks, etc. * Modify debug-data to load with GoldCorpus to iterate over .json/.jsonl files within directories * Add GoldCorpus iterator train_docs_without_preprocessing to load original train docs without shuffling and projectivizing * Report number of misaligned tokens * Add more dependency checks and messages * Update spacy/cli/debug_data.py Co-Authored-By: Ines Montani <ines@ines.io> * Fixed conflict * Move counts to _compile_gold() * Move all dependency nonproj/sent/head/cycle counting to _compile_gold() * Unclobber previous merges * Update variable names * Update more variable names, fix misspelling * Don't clobber loading error messages * Only warn about misaligned tokens if present	2019-08-16 10:52:46 +02:00
Ziming He	eea7d4f4a8	biluo_tags_from_offsets throw exception for overlapping entities (#4021 ) * Check whether two entities overlap - biluo_gold_biluo_overlap now throw exception when entities passed in have overlaps - added unit test * SCA agreement	2019-08-15 18:13:32 +02:00
adrianeboyd	2f9b28c218	Provide more info in cycle error message E069 (#4123 ) Provide the tokens in the cycle and the first 50 tokens from document in the error message so it's easier to track down the location of the cycle in the data. Addresses feature request in #3698.	2019-08-15 18:08:28 +02:00
AJ Rader	2f3648700c	Correction of default lemmatizer lookup in English (Issue # 4104) (#4110 ) * pytest file for issue4104 established * edited default lookup english lemmatizer for spun; fixes issue 4102 * eliminated parameterization and sorted dictionary dependnency in issue 4104 test * added contributor agreement	2019-08-15 11:39:10 +02:00
Ines Montani	1711b5eb62	💫 Support displaCy user colors via entry point (#4113 )	2019-08-13 15:59:55 +02:00
Sofie Van Landeghem	0ba1b5eebc	CLI scripts for entity linking (wikipedia & generic) (#4091 ) * document token ent_kb_id * document span kb_id * update pipeline documentation * prior and context weights as bool's instead * entitylinker api documentation * drop for both models * finish entitylinker documentation * small fixes * documentation for KB * candidate documentation * links to api pages in code * small fix * frequency examples as counts for consistency * consistent documentation about tensors returned by predict * add entity linking to usage 101 * add entity linking infobox and KB section to 101 * entity-linking in linguistic features * small typo corrections * training example and docs for entity_linker * predefined nlp and kb * revert back to similarity encodings for simplicity (for now) * set prior probabilities to 0 when excluded * code clean up * bugfix: deleting kb ID from tokens when entities were removed * refactor train el example to use either model or vocab * pretrain_kb example for example kb generation * add to training docs for KB + EL example scripts * small fixes * error numbering * ensure the language of vocab and nlp stay consistent across serialization * equality with = * avoid conflict in errors file * add error 151 * final adjustements to the train scripts - consistency * update of goldparse documentation * small corrections * push commit * turn kb_creator into CLI script (wip) * proper parameters for training entity vectors * wikidata pipeline split up into two executable scripts * remove context_width * move wikidata scripts in bin directory, remove old dummy script * refine KB script with logs and preprocessing options * small edits * small improvements to logging of EL CLI script	2019-08-13 15:38:59 +02:00
黎谢鹏	250a54414b	update lang/zh (#4103 ) * update lang/zh * update lang/zh	2019-08-12 10:37:48 +02:00
Sofie Van Landeghem	963ea5e8d0	Update lemma and vector information after splitting a token (#4097 ) * fixing vector and lemma attributes after retokenizer.split * fixing unit test with mockup tensor * xp instead of numpy	2019-08-08 15:09:44 +02:00
Matthew Honnibal	04113a844d	Set version to v2.1.8	2019-08-07 13:53:58 +02:00
Ines Montani	6bec24cdd0	Require downloaded model in pkg_resources (#4090 )	2019-08-07 13:18:11 +02:00
adrianeboyd	69aca7d839	Add validate option to EntityRuler (#4089 ) * Add validate option to EntityRuler * Add validate to EntityRuler, passed to Matcher and PhraseMatcher * Add validate to usage and API docs * Update website/docs/usage/rule-based-matching.md Co-Authored-By: Ines Montani <ines@ines.io> * Update website/docs/usage/rule-based-matching.md Co-Authored-By: Ines Montani <ines@ines.io>	2019-08-07 00:40:53 +02:00
Jeno	15be09ceb0	Raise error if annotation dict in simple training style has unexpected keys #4074 (#4079 ) * adding enhancement #4074. * modified behavior to strictly require top level dictionary keys - issue #4074 * pass expected keys to error message and add links as expected top level key	2019-08-06 11:01:25 +02:00
Sofie Van Landeghem	ad09b0d6f3	fetch norm from lex if necessary for matching (#4080 )	2019-08-05 23:51:04 +02:00
Pavle Vidanović	e1a935d71c	Stopwords for Serbian language. (#4078 ) * Serbian stopwords added. (cyrillic alphabet) * spaCy Contribution agreement included. * Test initialize updated	2019-08-05 10:22:27 +02:00
veer-bains	874bd8c8dd	Fixed syntax error in lang/ko when using python 2 (#4082 ) (closes #4068 ) * fixed syntax error in declaring variables with python 2.7 in spacy/lang/ko/__init__.py * fixed syntax error in declaring variables with python 2.7 in spacy/lang/ko/__init__.py * Update __init__.py * Create veer-bains.md * Update __init__.py fixed syntax errors in variable datatype assignment when calling spacy.blank("ko") with python 2.7	2019-08-05 10:19:32 +02:00
Ines Montani	87ddbdc33e	Fix handling of kwargs in Language.evaluate Makes it consistent with other methods	2019-08-04 13:44:21 +02:00
Muhammad Irfan	d1d30b0442	added missing punctuation following conventions. (#4066 )	2019-08-04 13:41:18 +02:00
Anastassia	33b14724a5	Update gold corpus code to properly ingest a directory of jsonl… (#4067 ) * Update gold corpus code to properly ingest a directory of jsonlines files In response to: https://github.com/explosion/spaCy/issues/3975 * Update spacy/gold.pyx Co-Authored-By: Ines Montani <ines@ines.io>	2019-08-02 09:58:51 +02:00
Matthew Honnibal	944a66c326	Add span.tensor and token.tensor attributes	2019-08-01 18:30:50 +02:00
Matthew Honnibal	d3071ecdbc	Set version to v2.1.7	2019-08-01 18:09:19 +02:00
Matthew Honnibal	97c51ef93b	Set version to v2.1.7.dev1	2019-08-01 17:29:25 +02:00
Matthew Honnibal	4632c597e7	Fix Pipe base class	2019-08-01 17:29:01 +02:00
Ines Montani	8718ca8b1f	Fix init_model if there's no vocab (closes #4048 ) (#4049 )	2019-08-01 17:26:09 +02:00
adrianeboyd	925a852bb6	Improve NER per type scoring (#4052 ) * Improve NER per type scoring * include all gold labels in per type scoring, not only when recall > 0 * improve efficiency of per type scoring * Create Scorer tests, initially with NER tests * move regression test #3968 (per type NER scoring) to Scorer tests * add new test for per type NER scoring with imperfect P/R/F and per type P/R/F including a case where R == 0.0	2019-08-01 17:15:36 +02:00
Sofie Van Landeghem	f7d950de6d	ensure the lang of vocab and nlp stay consistent (#4057 ) * ensure the language of vocab and nlp stay consistent across serialization * equality with =	2019-08-01 17:13:01 +02:00
Sofie Van Landeghem	7de3b129ab	Resolve edge case when calling textcat.predict with empty doc (#4035 ) * resolve edge case where no doc has tokens when calling textcat.predict * more explicit value test	2019-07-30 14:58:01 +02:00
Matthew Honnibal	89c92c65fb	Update version	2019-07-28 17:56:38 +02:00
Matthew Honnibal	06eb428ed1	Make pipe base class a bit less presumptuous	2019-07-28 17:56:11 +02:00
Matthew Honnibal	16b5144095	Don't raise NotImplemented in Pipe.update	2019-07-28 17:54:11 +02:00
Ines Montani	fc69da0acb	💫 Support simple training format in nlp.evaluate and add tests (#4033 ) * Support simple training format in nlp.evaluate and add tests * Update docs [ci skip]	2019-07-27 17:30:18 +02:00
Ines Montani	a3723f439c	Fix formatting [ci skip]	2019-07-27 16:35:42 +02:00
Ines Montani	d5bce35fb1	Fix bug in Span.similarity when called via hook	2019-07-27 15:33:27 +02:00
Ines Montani	109b5e1798	Fix bug in Token.similarity when called via hook	2019-07-27 15:26:01 +02:00
Ines Montani	e000b5ed82	Also support "requirements" in model.json	2019-07-27 13:34:57 +02:00
Ines Montani	307ffe472d	Support custom language factory setting in meta.json (#4031 )	2019-07-27 13:17:43 +02:00
Bae Yong-Ju	05fbf5d976	Fix error when Korean text contains regexp special characters. (#4022 )	2019-07-25 17:53:33 +02:00
Matthew Honnibal	73e095923f	💫 Improve error message when model.from_bytes() dies (#4014 ) * Improve error message when model.from_bytes() dies When Thinc's model.from_bytes() is called with a mismatched model, often we get a particularly ungraceful error, e.g. "AttributeError: FunctionLayer has no attribute G" This is because we're trying to load the parameters for something like a LayerNorm layer, and the model architecture has some other layer there instead. This is obviously terrible, especially since the error type is wrong. I've changed it to raise a ValueError. The error message is still probably a bit terse, but it's hard to be sure exactly what's gone wrong. * Update spacy/pipeline/pipes.pyx * Update spacy/pipeline/pipes.pyx * Update spacy/pipeline/pipes.pyx * Update spacy/syntax/nn_parser.pyx * Update spacy/syntax/nn_parser.pyx * Update spacy/pipeline/pipes.pyx Co-Authored-By: Matthew Honnibal <honnibal+gh@gmail.com> * Update spacy/pipeline/pipes.pyx Co-Authored-By: Matthew Honnibal <honnibal+gh@gmail.com> Co-authored-by: Ines Montani <ines@ines.io>	2019-07-24 11:27:34 +02:00
Ines Montani	87fcf3141c	Merge pull request #4003 from svlandeg/feature/nel-fixes API changes for Entity linking functionality	2019-07-23 23:17:07 +02:00
Paul O'Leary McCann	c8949ce88a	Remove old comment (#4012 ) Norwegian used to borrow from French but that doesn't appear to have been true for a while now, so the comment that was here is no longer relevant.	2019-07-23 23:10:06 +02:00
Sofie Van Landeghem	ba02957c80	Fix dependency copy for as_doc (#3969 ) * failing unit test for issue 3962 * attempt to fix Issue #3962 * create artificial unit test example * using length instead of self.length * sp * reformat with black * find better ancestor within span and use generic 'dep' * attach to span.root if there is no appropriate ancestor * comment span text * clean up ancestor code * reconstruct dep tree to keep same number of sentences	2019-07-23 18:28:54 +02:00
svlandeg	4e7ec1ed31	return fix	2019-07-23 14:23:58 +02:00
svlandeg	400ff342cf	replace assert's with custom error messages	2019-07-23 11:52:48 +02:00
svlandeg	20389e4553	format and bugfix	2019-07-22 15:08:17 +02:00
svlandeg	b1911f7105	Errors.E146 for IO error when FP is null	2019-07-22 14:56:13 +02:00
svlandeg	5d544f89ba	Errors.E145 for IO errors when reading KB	2019-07-22 14:36:07 +02:00
Ines Montani	a32b033b8c	Add regression test for #4002 Test that the PhraseMatcher can match on overwritten NORM attributes.	2019-07-22 14:18:24 +02:00
svlandeg	ad65171837	Merge remote-tracking branch 'upstream/master' into feature/nel-fixes	2019-07-22 13:41:28 +02:00
svlandeg	76184374e2	test corner cases	2019-07-22 13:39:32 +02:00
svlandeg	9f8c1e71a2	fix for Issue #4000	2019-07-22 13:34:12 +02:00
svlandeg	dae8a21282	rename entity frequency	2019-07-19 17:40:28 +02:00
svlandeg	41fb5204ba	output tensors as part of predict	2019-07-19 14:47:36 +02:00
svlandeg	21176517a7	have gold.links correspond exactly to doc.ents	2019-07-19 12:36:15 +02:00
BreakBB	3e370cf2ba	Add 'Prof.' to Englisch tokenizer_exceptions	2019-07-19 10:00:45 +02:00
svlandeg	e1213eaf6a	use original gold object in get_loss function	2019-07-18 13:35:10 +02:00
svlandeg	ec55d2fccd	filter training data beforehand (+black formatting)	2019-07-18 10:22:24 +02:00
Falak Asad	ff1e73e35c	Bugfix/issue 3968 (#3982 ) * Fix for issue-3968 * Added contributor agreement * Made suggested changes	2019-07-18 00:20:32 +02:00
svlandeg	d833d4c358	fixes in kb and gold	2019-07-17 17:18:26 +02:00
Ines Montani	73565c6d9d	Rename function arguments	2019-07-17 14:29:52 +02:00
Matthew Honnibal	394e4d8058	Add docstring for spacy.gold.align	2019-07-17 13:59:17 +02:00
Ines Montani	073013f129	Auto-format [ci skip]	2019-07-17 12:34:13 +02:00
svlandeg	4086c6ff60	get vector functionality + unit test	2019-07-17 12:17:02 +02:00
Ines Montani	62ff128888	Add regression test for #3951	2019-07-16 14:00:00 +02:00
Ines Montani	7f551050b1	Add regression test for #3972	2019-07-16 13:07:35 +02:00
svlandeg	a63d15a142	code cleanup	2019-07-15 17:36:43 +02:00
svlandeg	cdc589d344	small fix	2019-07-15 12:04:45 +02:00
svlandeg	60f299374f	set default context width	2019-07-15 12:03:09 +02:00
svlandeg	6e809e9b8b	proper error for missing cfg arguments	2019-07-15 11:42:50 +02:00
svlandeg	6026958957	tokenizer doc fix	2019-07-15 11:19:34 +02:00
Ines Montani	c0e29f7029	Merge pull request #3957 from sorenlind/danish-tokenizer-slash Make Danish tokenizer split on forward slash	2019-07-12 18:19:22 +02:00
Matthew Honnibal	ef666656b3	Fix attrs alignment	2019-07-12 17:59:47 +02:00
Matthew Honnibal	c345c042b0	Fix symbol alignment	2019-07-12 17:48:38 +02:00
Ines Montani	7281026879	Increment version [ci skip]	2019-07-12 17:40:00 +02:00
Søren Lind Kristiansen	26aee70d95	Make Danish tokenizer split on forward slash	2019-07-12 15:20:42 +02:00
Matthew Honnibal	3bc4d618f9	Set version to v2.1.5	2019-07-12 13:26:12 +02:00
Sofie Van Landeghem	ed774cb953	Fixing ngram bug (#3953 ) * minimal failing example for Issue #3661 * referenced Issue #3661 instead of Issue #3611 * cleanup	2019-07-12 10:01:35 +02:00
Matthew Honnibal	09dc01a426	Fix #3853 , and add warning	2019-07-11 14:46:47 +02:00
Matthew Honnibal	7369949d2e	Add warning for #3853	2019-07-11 14:46:47 +02:00
Ines Montani	673c864a06	Fix doc.count_by functionality (#3950 ) Fix doc.count_by functionality	2019-07-11 13:44:00 +02:00
Ines Montani	2426f4d44c	Fix default punctuation rules for splitting Hindi text (#3948 ) Fix default punctuation rules for splitting Hindi text Co-authored-by: yash <patadiayash@gmail.com> Co-authored-by: Ines Montani <ines@ines.io>	2019-07-11 13:36:28 +02:00
svlandeg	349107daa3	cleanup	2019-07-11 13:09:22 +02:00

... 13 14 15 16 17 ...

7607 Commits