Commit Graph

6856 Commits

Author SHA1 Message Date
adrianeboyd
9be90dbca3
Improve token head verification (#5079)
* Improve token head verification

Improve the verification for valid token heads when heads are set:

* in `Token.head`: heads come from the same document
* in `Doc.from_array()`: head indices are within the bounds of the
document

* Improve error message
2020-03-03 21:44:51 +01:00
adrianeboyd
8c20dae6f7
Fix model-final/model-best meta from train CLI (#5093)
* Fix model-final/model-best meta

* include speed and accuracy from final iteration
* combine with speeds from base model if necessary

* Include token_acc metric for all components
2020-03-03 21:43:25 +01:00
Sofie Van Landeghem
a0998868ff
prevent updating cfg if the Model was already defined (#5078) 2020-03-03 13:58:56 +01:00
Sofie Van Landeghem
d307e9ca58
take care of global vectors in multiprocessing (#5081)
* restore load_nlp.VECTORS in the child process

* add unit test

* fix test

* remove unnecessary import

* add utf8 encoding

* import unicode_literals
2020-03-03 13:58:22 +01:00
adrianeboyd
d078b47c81
Break out of infinite loop as intended (#5077) 2020-03-03 12:29:05 +01:00
adrianeboyd
697bec764d
Normalize IS_SENT_START to SENT_START for Matcher (#5080) 2020-03-03 12:22:39 +01:00
adrianeboyd
2281c4708c
Restore empty tokenizer properties (#5026)
* Restore empty tokenizer properties

* Check for types in tokenizer.from_bytes()

* Add test for setting empty tokenizer rules
2020-03-02 11:55:02 +01:00
Sofie Van Landeghem
c6b12ab02a
Bugfix/get doc (#5049)
* new (broken) unit test

* fixing get_doc method
2020-03-02 11:49:28 +01:00
adrianeboyd
65d7bab10f
Initialize all values in a2b/b2a in new align (#5063) 2020-02-27 18:43:00 +01:00
Adriane Boyd
9f740a9891 Add a few more Danish tokenizer exceptions 2020-02-26 14:59:03 +01:00
Ines Montani
1c212215cd
Merge pull request #5064 from adrianeboyd/feature/german-tokenization
Improve German tokenization
2020-02-26 13:41:44 +01:00
Adriane Boyd
d1f703d78d Improve German tokenization
Improve German tokenization with respect to Tiger.
2020-02-26 13:06:52 +01:00
Ines Montani
ed9358420e Merge branch 'master' into pr/5060 2020-02-26 12:51:29 +01:00
adrianeboyd
ff184b7a9c
Add tag_map argument to CLI debug-data and train (#4750) (#5038)
Add an argument for a path to a JSON-formatted tag map, which is used to
update and extend the default language tag map.
2020-02-26 12:10:38 +01:00
svlandeg
18ff97589d update spacy to 2.2.4.dev0 2020-02-26 10:50:05 +01:00
Ines Montani
d50152b917
Merge pull request #5019 from questoph/master
Optimizing tokenization for Luxembourgish (dealing with apostrophe infixes)
2020-02-25 14:48:50 +01:00
Ines Montani
4440a072d2
Merge pull request #5006 from svlandeg/bugfix/multiproc-underscore
load Underscore state when multiprocessing
2020-02-25 14:46:02 +01:00
svlandeg
b49a3afd0c use clean_underscore fixture 2020-02-23 15:49:20 +01:00
Tom Keefe
ddf63b97a8
make idx available via to_array (#5030) 2020-02-22 14:13:06 +01:00
Sofie Van Landeghem
44f4142ce4
add two abbreviations and some additional unit tests (#5040) 2020-02-22 14:12:32 +01:00
Sofie Van Landeghem
479bd8d09f
add lemma option to displacy 'dep' visualiser (#5041)
* add lemma option to displacy 'dep' visualiser

* more compact list comprehension

* add option to doc

* fix test and add lemmas to util.get_doc

* fix capital

* remove lemma from get_doc

* cleanup
2020-02-22 14:11:51 +01:00
adrianeboyd
2164e71ea8
Improved Romanian tokenization for UD RRT (#5036)
Modifications to Romanian tokenization to improve tokenization for
UD_Romanian-RRT.
2020-02-19 16:15:59 +01:00
Jan Jessewitsch
c7e4fe9c5c
Fix/Improve german stop words (#5024)
* Fix german stop words

Two stop words ("einige" and  "einigen") are sticking together.
Remove three nouns that may serve as stop words in a specific context (e.g. religious or news) but are not applicable for general use.

* Create Jan-711.md
2020-02-17 18:59:22 +01:00
Kabir Khan
f6ed07b85c
Use nlp.pipe in EntityRuler for phrase patterns in add_patterns (#4931)
* Fix ent_ids and labels properties when id attribute used in patterns

* use set for labels

* sort end_ids for comparison in entity_ruler tests

* fixing entity_ruler ent_ids test

* add to set

* Run make_doc optimistically if using phrase matcher patterns.

* remove unused coveragerc I was testing with

* format

* Refactor EntityRuler.add_patterns to use nlp.pipe for phrase patterns. Improves speed substantially.

* Removing old add_patterns function

* Fixing spacing

* Make sure token_patterns loaded as well, before generator was being emptied in from_disk
2020-02-16 18:17:47 +01:00
Sofie Van Landeghem
72c964bcf4
define pretrained_dims which is used by build_text_classifier (#5004) 2020-02-16 17:21:17 +01:00
adrianeboyd
3b22eb651b
Sync Span __eq__ and __hash__ (#5005)
* Sync Span __eq__ and __hash__

Use the same tuple for `__eq__` and `__hash__`, including all attributes
except `vector` and `vector_norm`.

* Update entity comparison in tests

Update `assert_docs_equal()` test util to compare `Span` properties for
ents rather than `Span` objects.
2020-02-16 17:20:36 +01:00
adrianeboyd
0c47a53b5e
Use int only in key2row for better performance (#4990)
Cast all keys and rows to `int` in `vectors.key2row` for more efficient
access and serialization.
2020-02-16 17:19:41 +01:00
adrianeboyd
5b102963bf
Require HEAD for is_parsed in Doc.from_array() (#5011)
Modify flag settings so that `DEP` is not sufficient to set `is_parsed`
and only run `set_children_from_heads()` if `HEAD` is provided.

Then the combination `[SENT_START, DEP]` will set deps and not clobber
sent starts with a lot of one-word sentences.
2020-02-16 17:17:09 +01:00
Sofie Van Landeghem
2572460175
add tok2vec parameters to train script to facilitate init_tok2vec (#5021) 2020-02-16 17:16:41 +01:00
Sofie Van Landeghem
a27c77ce62
add message when cli train script throws exception (#5009)
* add message when cli train script throws exception

* fix formatting
2020-02-15 15:50:17 +01:00
questoph
5352fc8fc3 Update tokenizer_exceptions.py 2020-02-14 12:02:15 +01:00
questoph
d1f0b397b5 Update punctuation.py 2020-02-13 22:18:51 +01:00
svlandeg
6e717c62ed avoid the tests interacting with eachother through the global Underscore variable 2020-02-12 13:21:31 +01:00
svlandeg
7939c63886 use English instead of model 2020-02-12 12:26:27 +01:00
svlandeg
46628d8890 add some asserts 2020-02-12 12:12:52 +01:00
svlandeg
51d37033c8 remove old comment 2020-02-12 12:10:05 +01:00
svlandeg
65f5b48b5d add comment 2020-02-12 12:06:27 +01:00
svlandeg
05dedaa2cf add unit test 2020-02-12 12:00:13 +01:00
svlandeg
ecbb9c4b9f load Underscore state when multiprocessing 2020-02-12 11:50:42 +01:00
adrianeboyd
99a543367d
Set GPU before loading any models in train CLI (#4989)
Set the GPU before loading any existing models in the train CLI so that
you can start with a base model and train on GPU.
2020-02-11 17:45:41 -05:00
adrianeboyd
842dfddbb9
Standardize Greek tag map setup (#4997)
* Rename `tag_map.py` to `tag_map_fine.py` to indicate that it's not the
default tag map
* Remove duplicate generic UD tag map and load `../tag_map.py` instead
2020-02-11 17:44:56 -05:00
Antti Ajanki
e1f777b151
Improvements for Finnish tokenizer (#4985)
* don't split on a colon. Colon is used to attach suffixes for abbreviations
* tokenize on any of LIST_HYPHENS (except a single hyphen), not just on --
* simplify infix rules by merging similar rules
2020-02-10 20:32:43 -05:00
Filip Bednárik
d4f4060bf3
Add Slovak language tools implementation (#4943)
* Add correct stopwords for Slovak language

* Add SNK Tags

* Disable formatting lint for TAGS

* Add example sentences for Slovak language

* Add slovak numerals in base form

* Add lex_attrs to sk init

* Add contributor agreement
2020-02-03 13:03:59 +01:00
Tyler Couto
9fa9d7f2cb
Fix for Issue 4665 - conllu2json (#4953)
* Fix for Issue 4665 - conllu2json

- Allowing HEAD to be an underscore

* Added contributor agreement
2020-02-03 13:01:48 +01:00
adrianeboyd
a938566b62 Fix Sentencizer.pipe() for empty doc (#4940) 2020-01-28 11:36:49 +01:00
Yohei Tamura
708a4d27eb fix nlp.evaluate (#4924) (#4925)
* new file:   test_issue4924.py

* modified:   spacy/gold.pyx

* modified:   test_issue4924.py for python2
2020-01-20 12:17:46 +01:00
Kabir Khan
b9afcd56e3 Fix ent_ids and labels properties when id attribute used in patterns (#4900)
* Fix ent_ids and labels properties when id attribute used in patterns

* use set for labels

* sort end_ids for comparison in entity_ruler tests

* fixing entity_ruler ent_ids test

* add to set
2020-01-16 02:01:31 +01:00
adrianeboyd
90c52128dc Improve train CLI with base model (#4911)
Improve train CLI with a provided base model so that you can:

* add a new component
* extend an existing component
* replace an existing component

When the final model and best model are saved, reenable any disabled
components and merge the meta information to include the full pipeline
and accuracy information for all components in the base model plus the
newly added components if needed.
2020-01-16 01:58:51 +01:00
svlandeg
ee828d5a9a bugfix typo conv_window 2020-01-14 09:02:58 +01:00
adrianeboyd
d24bca62f6 Add CJK to character classes (#4884)
* Add CJK character class as uncased

* Incorporate Chinese URL test case

Un-xfail Chinese URL test instance
2020-01-08 16:50:19 +01:00
adrianeboyd
aef83e8070 Mark most Hungarian tokenizer test cases as slow (#4883)
* Mark most Hungarian tokenizer test cases as slow

Mark most Hungarian tokenizer test cases as slow to reduce the runtime
of the test suite in ordinary usage:

* for normal tests: run default tests plus 10% of the detailed tests
* for slow tests: run all tests

* Rework to mark individual tests as slow
2020-01-08 12:34:06 +01:00
Sofie Van Landeghem
7b96a5e10f Reduce mem usage in training Entity Linker (#4811)
* move nlp processing for el pipe to batch training instead of preprocessing

* adding dev eval back in, and limit in articles instead of entities

* use pipe whenever possible

* few more small doc changes

* access dev data through generator

* tqdm description

* small fixes

* update documentation
2020-01-06 14:59:50 +01:00
Sofie Van Landeghem
6e9b61b49d add warning in debug_data for punctuation in entities (#4853) 2020-01-06 14:59:28 +01:00
adrianeboyd
d652ff215d Add trailing whitespace to multiline test text (#4877) 2020-01-06 14:58:59 +01:00
adrianeboyd
de69bc6509 Fix and improve URL pattern (#4882)
* match domains longer than `hostname.domain.tld` like `www.foo.co.uk`
* expand allowed characters in domain names while only matching
lowercase TLDs so that "this.That" isn't matched as a URL and can be
split on the period as an infix (relevant for at least English, German,
and Tatar)
2020-01-06 14:58:30 +01:00
Sofie Van Landeghem
a1b22e90cd serialize ENT_ID (#4852)
* expand serialization test for custom token attribute

* add failing test for issue 4849

* define ENT_ID as attr and use in doc serialization

* fix few typos
2020-01-06 14:57:34 +01:00
Al Johri
1aa2d4dac9 stop rendering mathjax by default in displacy (#4840)
* stop rendering mathjax by default in displacy

* Replace f-string and add comment

Co-authored-by: Ines Montani <ines@ines.io>
2020-01-01 13:15:05 +01:00
Anastasiia Iurshina
1830a12578 Fixes typos (#4843)
* Fixes typos

* Fixes typo

* Contributor agreement
2019-12-29 14:24:13 +01:00
Ivan Echevarria
ef13e0c038 Add n_process to Language.pipe documentation (#4842) [ci skip]
* Add n_process to documentation

* Auto-format and add default [ci skip]

Co-authored-by: Ines Montani <ines@ines.io>
2019-12-29 14:23:33 +01:00
Ines Montani
3431ac42de Fix typo 2019-12-21 21:17:45 +01:00
Ines Montani
7c69d30de5 Tidy up and expect warning 2019-12-21 21:14:52 +01:00
Sofie Van Landeghem
732142bf28 facilitate larger training files (#4827)
* add warning for large file and change start var to long

* type for file_length
2019-12-21 21:12:19 +01:00
Ines Montani
cb4145adc7 Tidy up and auto-format 2019-12-21 19:04:17 +01:00
Olamilekan Wahab
a741de7cf6 Adding support for Yoruba Language (#4614)
* Adding Support for Yoruba

* test text

* Updated test string.

* Fixing encoding declaration.

* Adding encoding to stop_words.py

* Added contributor agreement and removed iranlowo.

* Added removed test files and removed iranlowo to keep project bare.

* Returned CONTRIBUTING.md to default state.

* Added delted conftest entries

* Tidy up and auto-format

* Revert CONTRIBUTING.md

Co-authored-by: Ines Montani <ines@ines.io>
2019-12-21 14:11:50 +01:00
Ines Montani
0750d59e5a Allow setting ner_missing_tag on docs_to_json 2019-12-21 13:47:21 +01:00
Sofie Van Landeghem
8ebbb85117 Documentation for PhraseMatcher constructor (#4826)
* add max_length as argument for init PhraseMatcher

* improve error message too
2019-12-20 23:00:04 +01:00
Sofie Van Landeghem
12158c1e3a Restore tqdm imports (#4804)
* set 4.38.0 to minimal version with color bug fix

* set imports back to proper place

* add upper range for tqdm
2019-12-16 13:12:19 +01:00
Sofie Van Landeghem
557dcf5659 NEL requires sentences to be set (#4801) 2019-12-13 15:55:18 +01:00
tamuhey
1707e77c5e add char_span to Span (#4793) 2019-12-13 15:54:58 +01:00
Sofie Van Landeghem
f9b541f9ef More robust set entities method in KB (#4794)
* add unit test for setting entities with duplicate identifiers

* count the number of actual unique identifiers and throw duplicate warning
2019-12-13 10:45:29 +01:00
Sofie Van Landeghem
5355b0038f Update EL example (#4789)
* update EL example script after sentence-central refactor

* version bump

* set incl_prior to False for quick demo purposes

* clean up
2019-12-11 18:19:42 +01:00
adrianeboyd
38e1bc19f4 Add destructors for states in TransitionSystem (#4686) 2019-12-10 13:23:27 +01:00
adrianeboyd
c208eb6e4d Fix int value handling in Matcher (#4749)
Add `int` values (for `LENGTH`) in _get_attr_values() instead of
treating `int` like `dict`.
2019-12-06 19:22:57 +01:00
Sofie Van Landeghem
780d43aac7 fix bug in EL predict (#4779) 2019-12-06 19:18:14 +01:00
adrianeboyd
676e75838f Include Doc.cats in serialization of Doc and DocBin (#4774)
* Include Doc.cats in to_bytes()

* Include Doc.cats in DocBin serialization

* Add tests for serialization of cats

Test serialization of cats for Doc and DocBin.
2019-12-06 14:07:39 +01:00
Antti Ajanki
e626a011cc Improvements to the Finnish language data (#4738)
* Enable lex_attrs on Finnish

* Copy the Danish tokenizer rules to Finnish

Specifically, don't break hyphenated compound words

* Contributor agreement

* A new file for Finnish tokenizer rules instead of including the Danish ones
2019-12-03 12:55:28 +01:00
Christoph Purschke
a7ee4b6f17 new tests & tokenization fixes (#4734)
- added some tests for tokenization issues
- fixed some issues with tokenization of words with hyphen infix
- rewrote the "tokenizer_exceptions.py" file (stemming from the German version)
2019-12-01 23:08:21 +01:00
adrianeboyd
48ea2e8d0f Restructure Sentencizer to follow Pipe API (#4721)
* Restructure Sentencizer to follow Pipe API

Restructure Sentencizer to follow Pipe API so that it can be scored with
`nlp.evaluate()`.

* Add Sentencizer pipe() test
2019-11-27 16:33:34 +01:00
Jari Bakken
16cb19e960 update nb tag_map (#4711) 2019-11-25 21:26:26 +01:00
Ines Montani
5b36dec7eb Auto-exclude disabled when calling from_disk during load (#4708) 2019-11-25 16:01:22 +01:00
Ines Montani
2160ecfc92 Fix typo [ci skip] 2019-11-25 13:08:19 +01:00
adrianeboyd
2d8c6e1124 Iterate over lr_edges until sents are correct (#4702)
Iterate over lr_edges until all heads are within the current sentence.
Instead of iterating over them for a fixed number of iterations, check
whether the sentence boundaries are correct for the heads and stop when
all are correct. Stop after a maximum of 10 iterations, providing a
warning in this case since the sentence boundaries may not be correct.
2019-11-25 13:06:36 +01:00
Matt Maybeno
c9f1e99787 Agnostic vocab array fix (#4680)
* Use get_array_module instead of numpy

* add contributor agreement
2019-11-23 14:59:52 +01:00
adrianeboyd
46250f60ac Add missing tags to el/es/pt tag maps (#4696)
* Add missing tags to pt tag map

* Add missing tags to es tag map

* Add missing tags to el tag map

* Add missing symbol in el tag map
2019-11-23 14:57:21 +01:00
Paul O'Leary McCann
f0e3e606a6 Replace python-mecab3 with fugashi for Japanese (#4621)
* Switch from mecab-python3 to fugashi

mecab-python3 has been the best MeCab binding for a long time but it's
not very actively maintained, and since it's based on old SWIG code
distributed with MeCab there's a limit to how effectively it can be
maintained.

Fugashi is a new Cython-based MeCab wrapper I wrote. Since it's not
based on the old SWIG code it's easier to keep it current and make small
deviations from the MeCab C/C++ API where that makes sense.

* Change mecab-python3 to fugashi in setup.cfg

* Change "mecab tags" to "unidic tags"

The tags come from MeCab, but the tag schema is specified by Unidic, so
it's more proper to refer to it that way.

* Update conftest

* Add fugashi link to external deps list for Japanese
2019-11-23 14:31:04 +01:00
Ines Montani
a0fb1acb10 Update version [ci skip] 2019-11-21 18:19:37 +01:00
Ines Montani
b570d5d2ed Increment version [ci skip] 2019-11-21 17:02:32 +01:00
Matthew Honnibal
50f89cb85d Make vectors.find() return keys in correct order (#4691)
* Make vectors.find() return keys in correct order

* Update spacy/vectors.pyx
2019-11-21 16:58:32 +01:00
Ines Montani
5d4eede1e4 Fix test util imports 2019-11-21 16:28:29 +01:00
GuiGel
8f7ab70870 Bugfix/fix entity ruler from disk (#4670)
* fix EntityRuler from_disk bug

* add contributor file

* Test EntityRuler PhraseMatcher deserialization (#4651)

* newline at end of file

* fix copy paste error

* serializing the EntityRuler by itself

* Add unicode declarations for Python 2 and auto-format
2019-11-21 16:26:37 +01:00
adrianeboyd
054df5d90a Add error for non-string labels (#4690)
Add error when attempting to add non-string labels to `Tagger` or
`TextCategorizer`.
2019-11-21 16:24:10 +01:00
adrianeboyd
d7f32b285c Detect more empty matches in tokenizer.explain() (#4675)
* Detect more empty matches in tokenizer.explain()

* Include a few languages in explain non-slow tests

Mark a few languages in tokenizer.explain() tests as not slow so they're
run by default.
2019-11-20 16:31:29 +01:00
Ines Montani
5bf9ab5b03 Tidy up and auto-format 2019-11-20 13:16:33 +01:00
Ines Montani
7f3b00164a Re-add slow marker 2019-11-20 13:15:59 +01:00
Ines Montani
6e303de717 Auto-format 2019-11-20 13:15:24 +01:00
Ines Montani
2e7c896fe5 Update Tokenizer.explain tests 2019-11-20 13:14:11 +01:00
adrianeboyd
2c876eb672 Add tokenizer explain() debugging method (#4596)
* Expose tokenizer rules as a property

Expose the tokenizer rules property in the same way as the other core
properties. (The cache resetting is overkill, but consistent with
`from_bytes` for now.)

Add tests and update Tokenizer API docs.

* Update Hungarian punctuation to remove empty string

Update Hungarian punctuation definitions so that `_units` does not match
an empty string.

* Use _load_special_tokenization consistently

Use `_load_special_tokenization()` and have it to handle `None` checks.

* Fix precedence of `token_match` vs. special cases

Remove `token_match` check from `_split_affixes()` so that special cases
have precedence over `token_match`. `token_match` is checked only before
infixes are split.

* Add `make_debug_doc()` to the Tokenizer

Add `make_debug_doc()` to the Tokenizer as a working implementation of
the pseudo-code in the docs.

Add a test (marked as slow) that checks that `nlp.tokenizer()` and
`nlp.tokenizer.make_debug_doc()` return the same non-whitespace tokens
for all languages that have `examples.sentences` that can be imported.

* Update tokenization usage docs

Update pseudo-code and algorithm description to correspond to
`nlp.tokenizer.make_debug_doc()` with example debugging usage.

Add more examples for customizing tokenizers while preserving the
existing defaults.

Minor edits / clarifications.

* Revert "Update Hungarian punctuation to remove empty string"

This reverts commit f0a577f7a5.

* Rework `make_debug_doc()` as `explain()`

Rework `make_debug_doc()` as `explain()`, which returns a list of
`(pattern_string, token_string)` tuples rather than a non-standard
`Doc`. Update docs and tests accordingly, leaving the visualization for
future work.

* Handle cases with bad tokenizer patterns

Detect when tokenizer patterns match empty prefixes and suffixes so that
`explain()` does not hang on bad patterns.

* Remove unused displacy image

* Add tokenizer.explain() to usage docs
2019-11-20 13:07:25 +01:00
Matthew Honnibal
a3c43a1692
Support no hidden layer in parser and NER (#4672)
* Support no hidden layers for parser

* Fix parser model for depth 1

* Fix parser for hidden depth=0

* Add option of non-blocking to CUDA stream
2019-11-19 15:54:34 +01:00
Matthew Honnibal
4b123952aa
Add option for improved NER feature extraction (#4671)
* Support option of three NER features

* Expose nr_feature parser model setting

* Give feature tokens better name

* Test nr_feature=3 for NER

* Format
2019-11-19 15:03:14 +01:00
Elijah Rippeth
5ad5c4b44a Add initial Korean support (#4660)
* add hangul and jamo char classes.

* add initial Korean lexical attributes.

* add contributor agreement
2019-11-18 12:56:07 +01:00
Ines Montani
74b951fe61
Fix xpassing tests (#4657)
* Ignore internal warnings

* Un-xfail passing tests

* Skip instead of xfail
2019-11-16 20:20:53 +01:00
Ines Montani
3bd15055ce
Fix bug in Language.evaluate for components without .pipe (#4662) 2019-11-16 20:20:37 +01:00
adrianeboyd
bdfb696677 Fix conllu2json converter to output all sentences (#4656)
Make sure that the last batch of sentences is output if n_sents > 1.
2019-11-15 17:08:32 +01:00
Ines Montani
d64cfce546 Remove unnecessary newline replace 2019-11-15 16:19:01 +01:00
Christoph Purschke
433748e867 Fix basic language support for Luxembourgish (by adding punctuation.py) (#4648)
* Update __init__.py

* Create punctuation.py

* Update tokenizer_exceptions.py

* Create questoph.md

* Update questoph.md

* Update test_text.py

* Update test_text.py

* Update test_text.py

* Update test_text.py
2019-11-15 16:16:47 +01:00
adrianeboyd
91f89f9693 Fix realloc in retokenizer.split() (#4606)
Always realloc to a size larger than `doc.max_length` in
`retokenizer.split()` (or cymem will throw errors).
2019-11-11 16:26:46 +01:00
adrianeboyd
0b9a5f4074 Rework Chinese language initialization and tokenization (#4619)
* Rework Chinese language initialization

* Create a `ChineseTokenizer` class
  * Modify jieba post-processing to handle whitespace correctly
  * Modify non-jieba character tokenization to handle whitespace correctly

* Add a `create_tokenizer()` method to `ChineseDefaults`

* Load lexical attributes

* Update Chinese tag_map for UD v2

* Add very basic Chinese tests

* Test tokenization with and without jieba

* Test `like_num` attribute

* Fix try_jieba_import()

* Fix zh code formatting
2019-11-11 14:23:21 +01:00
adrianeboyd
4d85f67eee Minor updates to language example sentences (#4608)
* Add punctuation to Spanish example sentences

* Combine multilanguage examples for lang xx

* Add punctuation to nb examples
2019-11-07 22:34:58 +01:00
Priscilla de Abreu Lopes
39e79fcc86 Bugfix/dep matcher issue 4590 (#4601)
* add contributor agreement for prilopes

* add test for issue #4590

* fix on_match params for DependencyMacther (#4590)
2019-11-07 12:01:06 +01:00
Ines Montani
09cec3e41b
Replace function registries with catalogue (#4584)
* Replace functions registries with catalogue

* Update __init__.py

* Fix test

* Revert unrelated flag [ci skip]
2019-11-07 11:45:22 +01:00
Matthew Honnibal
4e43c0ba93 Fix multiprocessing for as_tuples=True (#4582) 2019-11-04 20:29:03 +01:00
Ines Montani
cf4ec88b38 Use latest wasabi 2019-11-04 02:38:45 +01:00
Ines Montani
6ec119d976 Add error in debug-data if no dev docs are available (see #4575) 2019-11-02 16:08:11 +01:00
adrianeboyd
56ad3a3988 Add LAS per dependency to Scorer (#4560) 2019-10-31 21:18:16 +01:00
Matthew Honnibal
de98d66f87 Set version to v2.2.2 2019-10-31 15:53:31 +01:00
Matthw Honnibal
55f2241d72 Merge branch 'master' of https://github.com/explosion/spaCy 2019-10-31 15:37:52 +01:00
Ines Montani
df4c9ae3dc Fix formatting [ci skip] 2019-10-31 15:10:25 +01:00
Ines Montani
59358d9b71
Remove box-decoration-break from entities in displacy (#4564) 2019-10-31 15:09:43 +01:00
Matthw Honnibal
8b9954d1b7 Set version to v2.2.2.dev5 2019-10-31 15:06:19 +01:00
Ines Montani
2c107f02a4 Auto-format [ci skip] 2019-10-31 15:01:56 +01:00
Matthew Honnibal
e82306937e Put Tok2Vec refactor behind feature flag (#4563)
* Add back pre-2.2.2 tok2vec

* Add simple tok2vec tests

* Add simple tok2vec tests

* Reformat

* Fix CharacterEmbed in new tok2vec

* Fix legacy tok2vec

* Resolve circular imports

* Fix test for Python 2
2019-10-31 15:01:15 +01:00
Ines Montani
5e9849b60f Auto-format [ci skip] 2019-10-30 19:27:18 +01:00
Ines Montani
afe4a428f7
Fix pipeline analysis on remove pipe (#4557)
Validate *after* component is removed, not before
2019-10-30 19:04:17 +01:00
Matthew Honnibal
6b874ef096 Set version to v2.2.2.dev4 2019-10-30 17:36:20 +01:00
Ines Montani
85f2b04c45
Support span._. in component decorator attrs (#4555)
* Support span._. in component decorator attrs

* Adjust error [ci skip]
2019-10-30 17:19:36 +01:00
Matthew Honnibal
c2f5f9f572 Set version to v2.2.2.dev3 2019-10-29 16:37:58 +01:00
Sofie Van Landeghem
33ba9ff464 set encodings explicitly to utf8 (#4551) 2019-10-29 13:16:55 +01:00
Matthew Honnibal
9e210fa7fd
Fix tok2vec structure after model registry refactor (#4549)
The model registry refactor of the Tok2Vec function broke loading models
trained with the previous function, because the model tree was slightly
different. Specifically, the new function wrote:

    concatenate(norm, prefix, suffix, shape)

To build the embedding layer. In the previous implementation, I had used
the operator overloading shortcut:

    ( norm | prefix | suffix | shape )

This actually gets mapped to a binary association, giving something
like:

    concatenate(norm, concatenate(prefix, concatenate(suffix, shape)))

This is a different tree, so the layers iterate differently and we
loaded the weights wrongly.
2019-10-28 23:59:03 +01:00
Matthew Honnibal
bade60fe64 Set version to v2.2.2.dev1 2019-10-28 19:09:34 +01:00
Matthew Honnibal
b1505380ff Fix training with vectors 2019-10-28 18:06:38 +01:00
Matthew Honnibal
a927b3a21e Put new alignment behind flag for v2.2.2 release (#4541)
* Xfail new tokenization test

* Put new alignment behind feature flag

* Move USE_ALIGN to top of the file [ci skip]


Co-authored-by: Ines Montani <ines@ines.io>
2019-10-28 16:12:32 +01:00
Ines Montani
a90025b277
Fix serialization of extension attr values in DocBin (#4540) 2019-10-28 16:02:13 +01:00
tamuhey
df293f3894 modified gold.align to handle space tokens (#4537)
Co-authored-by: Matthew Honnibal <honnibal+gh@gmail.com>
2019-10-28 15:44:28 +01:00
adrianeboyd
f2bfaa1b38 Filter subtoken matches in merge_subtokens() (#4539)
The `Matcher` in `merge_subtokens()` returns all possible subsequences
of `subtok`, so for sequences of two or more subtoks it's necessary to
filter the matches so that the retokenizer is only merging the longest
matches with no overlapping spans.
2019-10-28 15:40:28 +01:00
Matthew Honnibal
d5509e0989 Support Mish activation (requires Thinc 7.3) (#4536)
* Add arch for MishWindowEncoder

* Support mish in tok2vec and conv window >=2

* Pass new tok2vec settings from parser

* Syntax error

* Fix tok2vec setting

* Fix registration of MishWindowEncoder

* Fix receptive field setting

* Fix mish arch

* Pass more options from parser

* Support more tok2vec options in pretrain

* Require thinc 7.3

* Add docs [ci skip]

* Require thinc 7.3.0.dev0 to run CI

* Run black

* Fix typo

* Update Thinc version


Co-authored-by: Ines Montani <ines@ines.io>
2019-10-28 15:16:33 +01:00
Ines Montani
96bb8f2187 Add regression test for #4528 [ci skip] 2019-10-28 14:36:03 +01:00
Ines Montani
c5e41247e8 Tidy up and auto-format 2019-10-28 12:43:55 +01:00
Ines Montani
92018b9cd4 Tidy up and auto-format 2019-10-28 12:36:23 +01:00
Matthew Honnibal
f0ec7bcb79
Flag to ignore examples with mismatched raw/gold text (#4534)
* Flag to ignore examples with mismatched raw/gold text

After #4525, we're seeing some alignment failures on our OntoNotes data. I think we actually have fixes for most of these cases.

In general it's better to fix the data, but it seems good to allow the GoldCorpus class to just skip cases where the raw text doesn't
match up to the gold words. I think previously we were silently ignoring these cases.

* Try to fix test on Python 2.7
2019-10-28 11:40:12 +01:00
Matthew Honnibal
795699015c
Clarify parser model CPU/GPU code (#4535)
The previous version worked with previous thinc, but only
because some thinc ops happened to have gpu/cpu compatible
implementations. It's better to call the right Ops instance.
2019-10-27 23:43:09 +01:00
Matthw Honnibal
46eecdcb70 Remove print 2019-10-27 22:24:19 +01:00
Matthw Honnibal
426b745640 Fix tests for gpu 2019-10-27 22:19:18 +01:00
Matthw Honnibal
165e378082 Fix tok2vec arch after refactor 2019-10-27 22:19:10 +01:00
Matthew Honnibal
f8d740bfb1
Fix --gold-preproc train cli command (#4392)
* Fix get labels for textcat

* Fix char_embed for gpu

* Revert "Fix char_embed for gpu"

This reverts commit 055b9a9e85.

* Fix passing of cats in gold.pyx

* Revert "Match pop with append for training format (#4516)"

This reverts commit 8e7414dace.

* Fix popping gold parses

* Fix handling of cats in gold tuples

* Fix name

* Fix ner_multitask_objective script

* Add test for 4402
2019-10-27 21:58:50 +01:00
Sofie Van Landeghem
8e7414dace Match pop with append for training format (#4516)
* trying to fix script - not succesful yet

* match pop() with extend() to avoid changing the data

* few more pop-extend fixes

* reinsert deleted print statement

* fix print statement

* add last tested version

* append instead of extend

* add in few comments

* quick fix for 4402 + unit test

* fixing number of docs (not counting cats)

* more fixes

* fix len

* print tmp file instead of using data from examples dir

* print tmp file instead of using data from examples dir (2)
2019-10-27 16:01:32 +01:00
tamuhey
fcd25db033 [#4529] fix: gold pyx (#4530)
* fix: gold pyx

* remove print

* skip test in python2

* Add unicode declarations and don't skip test on Python 2
2019-10-27 13:50:07 +01:00
Matthew Honnibal
bddfbc7e1b Restore missing normalization from gold align
PR #4526 missed extra lower-casing and spacing normalization.
2019-10-27 13:47:08 +01:00
tamuhey
554850206c [#4525] fix gold.align (#4526)
* fix: gold.align

* fix align

* remove old align
2019-10-27 13:38:04 +01:00
Ines Montani
a9c6104047 Component decorator and component analysis (#4517)
* Add work in progress

* Update analysis helpers and component decorator

* Fix porting of docstrings for Python 2

* Fix docstring stuff on Python 2

* Support meta factories when loading model

* Put auto pipeline analysis behind flag for now

* Analyse pipes on remove_pipe and replace_pipe

* Move analysis to root for now

Try to find a better place for it, but it needs to go for now to avoid circular imports

* Simplify decorator

Don't return a wrapped class and instead just write to the object

* Update existing components and factories

* Add condition in factory for classes vs. functions

* Add missing from_nlp classmethods

* Add "retokenizes" to printed overview

* Update assigns/requires declarations of builtins

* Only return data if no_print is enabled

* Use multiline table for overview

* Don't support Span

* Rewrite errors/warnings and move them to spacy.errors
2019-10-27 13:35:49 +01:00
Matthew Honnibal
406eb95a47
Refactor Tok2Vec to use architecture registry (#4518)
* Add refactored tok2vec, using register_architecture

* Refactor Tok2Vec

* Fix ml

* Fix new tok2vec

* Move make_layer to util

* Add wire

* Fix missing import
2019-10-25 22:28:20 +02:00
Sofie Van Landeghem
99e309bb19 fix nn parser sample construction (#4524) 2019-10-25 22:26:42 +02:00
Ines Montani
cfffdba7b1 Implement new API for {Phrase}Matcher.add (backwards-compatible) (#4522)
* Implement new API for {Phrase}Matcher.add (backwards-compatible)

* Update docs

* Also update DependencyMatcher.add

* Update internals

* Rewrite tests to use new API

* Add basic check for common mistake

Raise error with suggestion if user likely passed in a pattern instead of a list of patterns

* Fix typo [ci skip]
2019-10-25 22:21:08 +02:00
Ines Montani
d2da117114 Also support passing list to Language.disable_pipes (#4521)
* Also support passing list to Language.disable_pipes

* Adjust internals
2019-10-25 16:19:08 +02:00
Ines Montani
e91366a216 Adjust formatting [ci skip] 2019-10-25 11:25:44 +02:00
Ines Montani
f31876154d Adjust formatting [ci skip] 2019-10-25 11:19:46 +02:00
Kabir Khan
93640373c7 Make entity_ruler ent_id resolution 2x faster and add docs for… (#4513)
* Update entityruler.py

* Making ent_id resolution 2x faster and adding docs

* Fixing newlines in docstrings

* Fixing newlines in docstrings
2019-10-25 11:16:42 +02:00
Ines Montani
cc05d9dad6 Auto-format [ci skip] 2019-10-24 16:21:08 +02:00
Ines Montani
73dc63d3bf Tidy up and auto-format [ci skip] 2019-10-24 16:20:48 +02:00
Ines Montani
6dd2832438 Use numpy.frombuffer instead of fromstring
Deprecation warning says we should do this
2019-10-24 16:18:41 +02:00
Ines Montani
9a849fe54e Explicitly catch warning in test 2019-10-24 16:16:27 +02:00
adrianeboyd
1b0bbe4b76 Update tag maps and docs for English and German (#4501)
* Update English tag_map

Update English tag_map based on this conversion table:
https://universaldependencies.org/tagset-conversion/en-penn-uposf.html

* Update German tag_map

Update German tag_map based on this conversion table:
https://universaldependencies.org/tagset-conversion/de-stts-uposf.html

* Add missing Tiger dependencies to glossary

* Add quotes to definition of TO

* Update POS/TAG tables in docs

Update POS/TAG tables for English and German docs using current
information generated from the tag_maps and GLOSSARY.

* Update warning that -PRON- is specific to English

* Revert docs to default JSON output with convert

* Revert "Revert docs to default JSON output with convert"

This reverts commit 6b78c048f1.
2019-10-24 12:56:05 +02:00
Zhuoru Lin
10d88b09bb Bugfix/fix wikidata train entity linker (#4509)
* Fix labels_discard Nonetype iteration error

* Contributor agreement for Zhuoru Lin

* Enhance EntityLinker.predict() to handle labels_discard is None case.
2019-10-24 12:52:59 +02:00
adrianeboyd
8516e9d53b Support train dict format as JSONL (#4471)
* Support train dict format as JSONL

* Add (overly simple) check for dict vs. tuple to read JSONL lines as
either train dicts or train tuples

* Extend JSON/JSONL roundtrip conversion tests using `docs_to_json()`
and `GoldCorpus.train_tuples`

* Revert docs to default JSON output with convert
2019-10-23 16:01:44 +02:00
Matthew Honnibal
ca7f0e669e Set version to v2.2.2.dev1 2019-10-22 20:11:25 +02:00
Matthew Honnibal
9489c5f6b2 Clip most_similar to range [-1, 1] (fixes #4506) (#4507)
* Clip most_similar to range [-1, 1]

* Add/fix vectors tests

* Fix test
2019-10-22 20:10:42 +02:00
Ines Montani
74a19aeb1c Add xfailing test [ci skip] 2019-10-22 18:18:43 +02:00
Matthew Honnibal
3f6cb618a9 Set version to v2.2.2.dev0 2019-10-22 17:47:36 +02:00
Sofie Van Landeghem
48886afc78 prevent zero-length mem alloc (#4429)
* raise specific error when removing a matcher rule that doesn't exist

* rephrasing

* goldparse init: allocate fields only if doc is not empty

* avoid zero length alloc in saving tokenizer cache

* avoid allocating zero length mem in matcher

* asserts to avoid allocating zero length mem

* fix zero-length allocation in matcher

* bump cymem version

* revert cymem version bump
2019-10-22 16:54:33 +02:00
adrianeboyd
3dfc764577 Free pointers in parser activations (#4486)
* Free pointers in ActivationsC

* Restructure alloc/free for parser activations

* Rewrite/restructure to have allocation and free in parallel functions
in `_parser_model` rather than partially in `_parseC()` in `Parser`.

* Remove `resize_activations` from `_parser_model.pxd`.
2019-10-22 15:06:44 +02:00
tamuhey
fb89f6792b refactor: remove unused variable (#4499) 2019-10-22 14:38:17 +02:00
gustavengstrom
050e2445a8 Adding noun_chunks to the Swedish language model (sv) (#4422)
* Create syntax_iterators.py

Replica of spacy/lang/fr/syntax_iterators.py

* Added import statements for SYNTAX_ITERATORS

* Create gustavengstrom.md

* Added "dobj" to list of labels in noun_chunks method and a test_noun_chunks method to the  Swedish language model.

* Delete README-checkpoint.md


Co-authored-by: Gustav <gustav@davcon.se>
Co-authored-by: Ines Montani <ines@ines.io>
2019-10-21 12:57:06 +02:00
adrianeboyd
f5c551a43a Checks/errors related to ill-formed IOB input in CLI convert and debug-data (#4487)
* Error for ill-formed input to iob_to_biluo()

Check for empty label in iob_to_biluo(), which can result from
ill-formed input.

* Check for empty NER label in debug-data
2019-10-21 12:20:28 +02:00
Sofie Van Landeghem
d5d55312b2 prevent division by zero in most_similar method (#4488) 2019-10-21 12:04:46 +02:00
Pepe Berba
7772d5d3c5 Update vocab.get_vector docs to include features on Fasttext ngram (#4464)
* Update `vocab.get_vector`

* Added contrib agreement
2019-10-20 01:28:18 +02:00
Ines Montani
2c96a5e5b0
Remove lemma attrs on BaseDefaults (#4468) 2019-10-19 23:18:09 +02:00
adrianeboyd
8d3de90bc4 Suppress convert output if writing to stdout (#4472) 2019-10-18 18:12:59 +02:00
Ines Montani
692d7f4291 Fix formatting [ci skip] 2019-10-18 11:33:38 +02:00
Ines Montani
181c01f629 Tidy up and auto-format 2019-10-18 11:27:38 +02:00
Ines Montani
fb11852750 Remove unused imports 2019-10-18 11:06:41 +02:00
adrianeboyd
d359da9687 Replace Entity/MatchStruct with SpanC (#4459)
* Replace MatchStruct with Entity

Replace MatchStruct with Entity since the existing Entity struct is
nearly identical.

* Replace Entity with more general SpanC
2019-10-18 11:01:47 +02:00
adrianeboyd
29e3da6493 Add missing cats to gold annot_tuples in Scorer (#4466)
Add missing `cats` in `Scorer` call to `GoldParse.from_annot_tuples()`
when the `doc` and `gold` have differing lengths.
2019-10-18 11:00:02 +02:00
adrianeboyd
135e3de531 Check for docs with 2+ sentences in debug-data (#4467) 2019-10-18 10:59:16 +02:00
Daniel King
e646956176 Most similar bug (#4446)
* Add batch size indexing

* Don't sort if n == 1

* Add test for most similar vectors issue

* Change > to >=
2019-10-16 23:18:55 +02:00
Anastassia
4a77d03ff7 Fix documentation for the docs_to_json function (#4456) 2019-10-16 23:17:58 +02:00
adrianeboyd
275c9ad872 Allow int values in token patterns (#4444)
* Add missing int value option to top-level pattern validation in Matcher

* Adjust existing tests accordingly

* Add new test for valid pattern `{"LENGTH": int}`
2019-10-16 13:40:18 +02:00
Sofie Van Landeghem
7d1efac4eb Fix remove pattern from matcher (#4454)
* raise specific error when removing a matcher rule that doesn't exist

* rephrasing

* bugfix in remove matcher + extended unit test
2019-10-16 13:34:58 +02:00
Sofie Van Landeghem
2d249a9502 KB extensions and better parsing of WikiData (#4375)
* fix overflow error on windows

* more documentation & logging fixes

* md fix

* 3 different limit parameters to play with execution time

* bug fixes directory locations

* small fixes

* exclude dev test articles from prior probabilities stats

* small fixes

* filtering wikidata entities, removing numeric and meta items

* adding aliases from wikidata also to the KB

* fix adding WD aliases

* adding also new aliases to previously added entities

* fixing comma's

* small doc fixes

* adding subclassof filtering

* append alias functionality in KB

* prevent appending the same entity-alias pair

* fix for appending WD aliases

* remove date filter

* remove unnecessary import

* small corrections and reformatting

* remove WD aliases for now (too slow)

* removing numeric entities from training and evaluation

* small fixes

* shortcut during prediction if there is only one candidate

* add counts and fscore logging, remove FP NER from evaluation

* fix entity_linker.predict to take docs instead of single sentences

* remove enumeration sentences from the WP dataset

* entity_linker.update to process full doc instead of single sentence

* spelling corrections and dump locations in readme

* NLP IO fix

* reading KB is unnecessary at the end of the pipeline

* small logging fix

* remove empty files
2019-10-14 12:28:53 +02:00
Peter Gilles
428887b8f2 Initial commit: New language Luxembourgish (lb) (#4424)
* new language: Luxembourgish (lb)

* update

* update

* Update and rename .github/CONTRIBUTOR_AGREEMENT.md to .github/contributors/PeterGilles.md

* Update and rename .github/contributors/PeterGilles.md to .github/CONTRIBUTOR_AGREEMENT.md

* Update norm_exceptions.py

* Delete README.md

* moved test_lemma.py

* deactivated 'lemma_lookup = LOOKUP'

* update

* Update conftest.py

* update

* tests updated

* import unicode_literals

* Update spacy/tests/lang/lb/test_text.py

Co-Authored-By: Ines Montani <ines@ines.io>

* Create PeterGilles.md
2019-10-14 12:27:50 +02:00
adrianeboyd
98a961a60e Fix PhraseMatcher.remove for overlapping patterns (#4437) 2019-10-14 12:19:51 +02:00
Ines Montani
f8f68bb062 Auto-format [ci skip] 2019-10-10 17:08:39 +02:00
adrianeboyd
6f54e59fe7 Fix util.filter_spans() to prefer first span in overlapping sam… (#4414)
* Update util.filter_spans() to prefer earlier spans

* Add filter_spans test for first same-length span

* Update entity relation example to refer to util.filter_spans()
2019-10-10 17:00:03 +02:00
Sofie Van Landeghem
da6e0de34f fix attrs field in the matcher (#4423)
* raise specific error when removing a matcher rule that doesn't exist

* rephrasing

* ensure attrs is NULL when nr_attr == 0 + several fixes to prevent OOB
2019-10-10 15:20:59 +02:00
Sofie Van Landeghem
5efae495f1 Error when removing a matcher rule that doesn't exist (#4420)
* raise specific error when removing a matcher rule that doesn't exist

* rephrasing
2019-10-10 14:01:53 +02:00
Matthew Honnibal
fa95c030a5
Unify matcher get_ent_id and get_pattern_key (#4415)
This is basically stabbing blindly at the ghost match problem, but it at
least seems like there was a bug previously here --- so this should
hopefully be an improvement, even if it doesn't fix the ghost match
problem.
2019-10-09 15:26:31 +02:00
Ines Montani
c4f95c1569 Update formatting and docstrings [ci skip] 2019-10-08 12:25:23 +02:00
Matthew Honnibal
ddd6fda59c Add registry for model creation functions ('architectures') (#4395)
* Add architecture registry

* Add test for arch registry

* Add error for model architectures
2019-10-08 12:21:03 +02:00
tamuhey
650cbfe82d multiprocessing pipe (#1303) (#4371)
* refactor: separate formatting docs and golds in Language.update

* fix return typo

* add pipe test

* unpickleable object cannot be assigned to p.map

* passed test pipe

* passed test!

* pipe terminate

* try pipe

* passed test

* fix ch

* add comments

* fix len(texts)

* add comment

* add comment

* fix: multiprocessing of pipe is not supported in 2

* test: use assert_docs_equal

* fix: is_python3 -> is_python2

* fix: change _pipe arg to use functools.partial

* test: add vector modification test

* test: add sample ner_pipe and user_data pipe

* add warnings test

* test: fix user warnings

* test: fix warnings capture

* fix: remove islice import

* test: remove warnings test

* test: add stream test

* test: rename

* fix: multiproc stream

* fix: stream pipe

* add comment

* mp.Pipe seems to be able to use with relative small data

* test: skip stream test in python2

* sort imports

* test: add reason to skiptest

* fix: use pipe for docs communucation

* add comments

* add comment
2019-10-08 12:20:55 +02:00
adrianeboyd
14841d0aa6 Fix PhraseMatcher callback and add tests (#4399)
* Fix callback lookup in PhraseMatcher (string key rather than hash key)
* Add callback tests for Matcher and PhraseMatcher
2019-10-08 12:07:02 +02:00
Matthew Honnibal
fd4a5341b0 Fix ner_jsonl2json converter (fix #4389) (#4394) 2019-10-08 00:52:45 +02:00
Matthew Honnibal
29f9fec267
Improve spacy pretrain (#4393)
* Support bilstm_depth arg in spacy pretrain

* Add option to ignore zero vectors in get_cossim_loss

* Use cosine loss in Cloze multitask
2019-10-07 23:34:58 +02:00
Ines Montani
9cd6ca3e4d Improve usage of pkg_resources and handling of entry points (#4387)
* Only import pkg_resources where it's needed

Apparently it's really slow

* Use importlib_metadata for entry points

* Revert "Only import pkg_resources where it's needed"

This reverts commit 5ed8c03afa.

* Revert "Revert "Only import pkg_resources where it's needed""

This reverts commit 8b30b57957.

* Revert "Use importlib_metadata for entry points"

This reverts commit 9f071f5c40.

* Revert "Revert "Use importlib_metadata for entry points""

This reverts commit 02e12a17ec.

* Skip test that weirdly hangs

* Fix hanging test by using global
2019-10-07 17:22:09 +02:00
adrianeboyd
d53a8d9313 Consider batch_size when sorting similar vectors (#4388) 2019-10-07 13:38:35 +02:00
adrianeboyd
a3509f67d4 Extend unicode character block for Sinhala (#4378)
* Extend unicode character block for Sinhala

* Add sentencizer tests for more languages
2019-10-07 13:17:03 +02:00
Ines Montani
573e543e4a Alphanumeric -> alphabetic [ci skip]
see ines/spacy-course#38
2019-10-06 13:30:01 +02:00
adrianeboyd
cbc2cee2c8 Improve URL_PATTERN and handling in tokenizer (#4374)
* Move prefix and suffix detection for URL_PATTERN

Move prefix and suffix detection for `URL_PATTERN` into the tokenizer.
Remove associated lookahead and lookbehind from `URL_PATTERN`.

Fix tokenization for Hungarian given new modified handling of prefixes
and suffixes.

* Match a wider range of URI schemes
2019-10-05 13:00:09 +02:00
Ines Montani
fec9433044 Make PhraseMatcher.vocab consistent with Matcher.vocab (closes #4373) 2019-10-04 12:18:41 +02:00
Matthew Honnibal
37ef874d8b Set version to v2.2.1 2019-10-03 14:50:39 +02:00
Sofie Van Landeghem
4e7259c6cf Bugfix initializing DocBin with attributes (#4368)
* docbin init fix + documentation fix + unit tests

* newline

* try with zlib instead of gzip (python 2 incompatibilities)
2019-10-03 14:48:45 +02:00
Ben Taylor
1db79a33cb most_similar() return the k most similar vectors (#4364)
* most_similar return n-most similar vectors

* updated most_similar comment

* add bintay contributor agreement

* sign bintay contributor agreement

* fix most_similar documentation typo

* fixed error in prune_vectors

* updated prune_vectors test
2019-10-03 14:09:44 +02:00
Matthew Honnibal
2eb31012e7 Set version to v2.2.0 2019-10-02 14:40:06 +02:00
Matthew Honnibal
796072e560 Set version to v2.2.0.dev19 2019-10-02 12:51:29 +02:00
Sofie Van Landeghem
9d3ce7cba2 Ensure training doesn't crash with empty batches (#4360)
* unit test for previously resolved unflatten issue

* prevent batch of empty docs to cause problems
2019-10-02 12:50:47 +02:00
adrianeboyd
dda86118bd Update Ukrainian lemmatizer with new lookups (#4359)
* Update Ukrainian lemmatizer with new lookups

* Add missing import


Co-authored-by: Ines Montani <ines@ines.io>
2019-10-02 12:04:06 +02:00
Ines Montani
b6670bf0c2 Use consistent spelling 2019-10-02 10:37:39 +02:00
Matthew Honnibal
38b6e69389 Merge branch 'master' of https://github.com/explosion/spaCy 2019-10-01 22:28:25 +02:00
Matthew Honnibal
d4b63bb6dd Set version to v2.2.0 2019-10-01 22:28:13 +02:00
Ines Montani
475e3188ce Add docs on filtering overlapping spans for merging (resolves #4352) [ci skip] 2019-10-01 21:59:50 +02:00
Matthew Honnibal
64a9577d43 Set version to v2.2.0.dev17 2019-10-01 21:36:59 +02:00
Ines Montani
cf65a80f36 Refactor lemmatizer and data table integration (#4353)
* Move test

* Allow default in Lookups.get_table

* Start with blank tables in Lookups.from_bytes

* Refactor lemmatizer to hold instance of Lookups

* Get lookups table within the lemmatization methods to make sure it references the correct table (even if the table was replaced or modified, e.g. when loading a model from disk)
* Deprecate other arguments on Lemmatizer.__init__ and expect Lookups for consistency
* Remove old and unsupported Lemmatizer.load classmethod
* Refactor language-specific lemmatizers to inherit as much as possible from base class and override only what they need

* Update tests and docs

* Fix more tests

* Fix lemmatizer

* Upgrade pytest to try and fix weird CI errors

* Try pytest 4.6.5
2019-10-01 21:36:03 +02:00
Ines Montani
3297a19545 Warn in Tagger.begin_training if no lemma tables are available (#4351) 2019-10-01 15:13:55 +02:00
Matthew Honnibal
2fb05482dd Set version to v2.2.0 2019-10-01 03:50:13 +02:00
Matthew Honnibal
dc22ec0aad Set version to v2.2.0.dev17 2019-10-01 03:26:53 +02:00
Matthew Honnibal
aedfba867a Set version to v2.2.0.dev16 2019-10-01 00:31:00 +02:00
Ines Montani
e0cf4796a5 Move lookup tables out of the core library (#4346)
* Add default to util.get_entry_point

* Tidy up entry points

* Read lookups from entry points

* Remove lookup tables and related tests

* Add lookups install option

* Remove lemmatizer tests

* Remove logic to process language data files

* Update setup.cfg
2019-10-01 00:01:27 +02:00
Rahul Soni
ed620daa5c Fix example sentences in Hindi for grammatical errors (#4343)
* Fix grammar for hindi

* Fix grammar for hindi

* Submit contributor agreement
2019-09-30 23:32:49 +02:00
Ines Montani
ba186299e1 Tidy up and modernize setup and config (#4344)
* Tidy up and modernize setup and config

* Update setup.cfg

* Re-add pyproject.toml

* Delete .flake8

* Move static meta from about to setup.cfg

* Update setup.cfg

Co-Authored-By: Matthew Honnibal <honnibal+gh@gmail.com>
2019-09-30 20:10:55 +02:00
Ines Montani
4f905ac9e6 Add test for ASCII filenames (#4345) 2019-09-30 18:45:30 +02:00
Matthew Honnibal
b5c775dd42 Set version to v2.2.0 2019-09-30 12:47:08 +02:00
Ines Montani
f7d1736241 Skip duplicate spans in Doc.retokenize (#4339) 2019-09-30 12:43:48 +02:00
Ines Montani
0226b3bf0e Fix test imports 2019-09-29 17:34:56 +02:00
Ines Montani
3d8fd4b461 Revert #4334 2019-09-29 17:32:12 +02:00
adrianeboyd
ba5595c764 Fix PhraseMatcher to remember attr on pickling (#4336)
* Fix PhraseMatcher to remember attr on pickling

* Check for attr as int or long
2019-09-29 17:12:33 +02:00
Ines Montani
75514b5970 Fix Korean 2019-09-29 17:10:56 +02:00
Ines Montani
499c39acba Remove unnecessary namedtuple/dataclass 2019-09-29 15:05:28 +02:00
Matthew Honnibal
eba708404d Set version to v2.2.0.dev15 2019-09-28 22:23:53 +02:00
Matthew Honnibal
6189959adb Set version to v2.2.0.dev14 2019-09-28 22:09:46 +02:00
Matthew Honnibal
0df2a599b7 Set version to v2.2.0.dev13 2019-09-28 21:26:05 +02:00
Ines Montani
c9cd516d96 Move tests out of package (#4334)
* Move tests out of package

* Fix typo
2019-09-28 18:05:00 +02:00
Matthew Honnibal
d05eb56ce2 Set version to v2.2.0.dev12 2019-09-28 16:35:56 +02:00
Ines Montani
5fe61539c4 Fix unicode "e" in filename 2019-09-28 15:45:16 +02:00
Ines Montani
811c4c97c9 Correct lookup lemma of "lenses" (see #4332) 2019-09-28 14:04:07 +02:00
Ines Montani
f8d1e2f214 Update CLI docs [ci skip] 2019-09-28 13:12:30 +02:00
Sofie Van Landeghem
22b9e12159 Ensure the NER remains consistent after resizing (#4330)
* test and fix for second bug of issue 4042

* fix for first bug in 4042

* crashing test for Issue 4313

* forgot one instance of resize

* remove prints

* undo uncomment

* delete test for 4313 (uses third party lib)

* add fix for Issue 4313

* unit test for 4313
2019-09-27 20:57:13 +02:00
adrianeboyd
3906785b49 Initialize low data warning for debug-data parser (#4331) 2019-09-27 20:56:49 +02:00
Ines Montani
206e8a5ac7 Also apply hotfix to Ukrainian lemmaitzer 2019-09-27 18:03:26 +02:00
Ines Montani
acd5bcb0b3 Tidy up fixtures 2019-09-27 17:57:59 +02:00
Ines Montani
b21b2e27e5 Hotfix Russian lemmatizer 2019-09-27 17:56:12 +02:00
Matthew Honnibal
a4d4c4bfa4 Set version to v2.2.0.dev11 2019-09-27 16:40:26 +02:00
Ines Montani
aad66d9bb9 Document PhraseMatcher.remove [ci skip] 2019-09-27 16:34:53 +02:00
adrianeboyd
c23edf302b Replace PhraseMatcher with trie-based search (#4309)
* Replace PhraseMatcher with Aho-Corasick

Replace PhraseMatcher with the Aho-Corasick algorithm over numpy arrays
of the hash values for the relevant attribute. The implementation is
based on FlashText.

The speed should be similar to the previous PhraseMatcher. It is now
possible to easily remove match IDs and matches don't go missing with
large keyword lists / vocabularies.

Fixes #4308.

* Restore support for pickling

* Fix internal keyword add/remove for numpy arrays

* Add missing loop for match ID set in search loop

* Remove cruft in matching loop for partial matches

There was a bit of unnecessary code left over from FlashText in the
matching loop to handle partial token matches, which we don't have with
PhraseMatcher.

* Replace dict trie with MapStruct trie

* Fix how match ID hash is stored/added

* Update fix for match ID vocab

* Switch from map_get_unless_missing to map_get

* Switch from numpy array to Token.get_struct_attr

Access token attributes directly in Doc instead of making a copy of the
relevant values in a numpy array.

Add unsatisfactory warning for hash collision with reserved terminal
hash key. (Ideally it would change the reserved terminal hash and redo
the whole trie, but for now, I'm hoping there won't be collisions.)

* Restructure imports to export find_matches

* Implement full remove()

Remove unnecessary trie paths and free unused maps.

Parallel to Matcher, raise KeyError when attempting to remove a match ID
that has not been added.

* Store docs internally only as attr lists

* Reduces size for pickle

* Remove duplicate keywords store

Now that docs are stored as lists of attr hashes, there's no need to
have the duplicate _keywords store.
2019-09-27 16:22:34 +02:00