Commit Graph

10841 Commits

Author SHA1 Message Date
Ines Montani
75514b5970 Fix Korean 2019-09-29 17:10:56 +02:00
Ines Montani
159b72ed4c Delete main.yml 2019-09-29 15:58:59 +02:00
Ines Montani
539a7b53cd
Update main.yml 2019-09-29 15:55:26 +02:00
Ines Montani
b7913c8eca
Update main.yml 2019-09-29 15:40:07 +02:00
Ines Montani
eb2b60069e
Update main.yml 2019-09-29 15:33:53 +02:00
Ines Montani
70295f9e59
Update main.yml 2019-09-29 15:32:11 +02:00
Ines Montani
b503270b09
Update main.yml 2019-09-29 15:30:31 +02:00
Ines Montani
52ea244830 Fix workflows 2019-09-29 15:30:13 +02:00
Ines Montani
e9acfaec52 Revert "Revert "Rename workflows to _workflows""
This reverts commit 051fac51ee.
2019-09-29 15:29:02 +02:00
Ines Montani
051fac51ee Revert "Rename workflows to _workflows"
This reverts commit ba0027c936.
2019-09-29 15:28:59 +02:00
Ines Montani
7164c687e9 Revert "Merge branch 'master' of https://github.com/explosion/spaCy"
This reverts commit 41aab59dbf, reversing
changes made to ba0027c936.
2019-09-29 15:28:31 +02:00
Ines Montani
41aab59dbf Merge branch 'master' of https://github.com/explosion/spaCy 2019-09-29 15:26:32 +02:00
Ines Montani
ba0027c936 Rename workflows to _workflows 2019-09-29 15:26:23 +02:00
Ines Montani
80f67f6065
Update build.yml 2019-09-29 15:24:28 +02:00
Ines Montani
e787e6d47f
Update build.yml 2019-09-29 15:15:34 +02:00
Ines Montani
b2f41e2a9b
Update build.yml 2019-09-29 15:06:19 +02:00
Ines Montani
499c39acba Remove unnecessary namedtuple/dataclass 2019-09-29 15:05:28 +02:00
Ines Montani
8b02fff097
Update build.yml 2019-09-29 14:55:43 +02:00
Ines Montani
ace0d5c580
Update build.yml 2019-09-29 14:52:01 +02:00
Ines Montani
d32fb03401
Update build.yml 2019-09-29 14:48:21 +02:00
Ines Montani
a5c0130b50
Update and rename pythonpackage.yml to build.yml 2019-09-29 14:43:48 +02:00
Matthew Honnibal
eba708404d Set version to v2.2.0.dev15 2019-09-28 22:23:53 +02:00
Matthew Honnibal
b6ec291bde Require preshed 3.0.2 2019-09-28 22:23:24 +02:00
Matthew Honnibal
6189959adb Set version to v2.2.0.dev14 2019-09-28 22:09:46 +02:00
Matthew Honnibal
4c383ab77e Require newer preshed 2019-09-28 22:08:05 +02:00
Matthew Honnibal
0df2a599b7 Set version to v2.2.0.dev13 2019-09-28 21:26:05 +02:00
Ines Montani
7cca6b57a7 Install from sdist in CI (#4335)
* Update azure-pipelines.yml

* Update azure-pipelines.yml

* Update azure-pipelines.yml

* Update azure-pipelines.yml

* Update azure-pipelines.yml

* Update azure-pipelines.yml
2019-09-28 21:18:52 +02:00
Ines Montani
48e697dd00 Auto-format [ci skip] 2019-09-28 18:29:57 +02:00
Ines Montani
c9cd516d96 Move tests out of package (#4334)
* Move tests out of package

* Fix typo
2019-09-28 18:05:00 +02:00
Matthew Honnibal
d05eb56ce2 Set version to v2.2.0.dev12 2019-09-28 16:35:56 +02:00
Matthew Honnibal
96dd143a18 Install json.gz files 2019-09-28 16:35:39 +02:00
Ines Montani
10742d3219 Update v2 docs [ci skip] 2019-09-28 15:57:22 +02:00
Ines Montani
5fe61539c4 Fix unicode "e" in filename 2019-09-28 15:45:16 +02:00
Ines Montani
a2815f6643 Fix model table display [ci skip] 2019-09-28 14:23:03 +02:00
Ines Montani
129670283e Pass meta labels through correctly [ci skip] 2019-09-28 14:08:33 +02:00
Ines Montani
811c4c97c9 Correct lookup lemma of "lenses" (see #4332) 2019-09-28 14:04:07 +02:00
Ines Montani
f8d1e2f214 Update CLI docs [ci skip] 2019-09-28 13:12:30 +02:00
Sofie Van Landeghem
22b9e12159 Ensure the NER remains consistent after resizing (#4330)
* test and fix for second bug of issue 4042

* fix for first bug in 4042

* crashing test for Issue 4313

* forgot one instance of resize

* remove prints

* undo uncomment

* delete test for 4313 (uses third party lib)

* add fix for Issue 4313

* unit test for 4313
2019-09-27 20:57:13 +02:00
adrianeboyd
3906785b49 Initialize low data warning for debug-data parser (#4331) 2019-09-27 20:56:49 +02:00
Ines Montani
59beab8405 Update v2-2.md [ci skip] 2019-09-27 18:10:43 +02:00
Ines Montani
206e8a5ac7 Also apply hotfix to Ukrainian lemmaitzer 2019-09-27 18:03:26 +02:00
Ines Montani
acd5bcb0b3 Tidy up fixtures 2019-09-27 17:57:59 +02:00
Ines Montani
b21b2e27e5 Hotfix Russian lemmatizer 2019-09-27 17:56:12 +02:00
Matthew Honnibal
a4d4c4bfa4 Set version to v2.2.0.dev11 2019-09-27 16:40:26 +02:00
Ines Montani
685e4b2554 Update v2-2.md [ci skip] 2019-09-27 16:35:01 +02:00
Ines Montani
aad66d9bb9 Document PhraseMatcher.remove [ci skip] 2019-09-27 16:34:53 +02:00
adrianeboyd
c23edf302b Replace PhraseMatcher with trie-based search (#4309)
* Replace PhraseMatcher with Aho-Corasick

Replace PhraseMatcher with the Aho-Corasick algorithm over numpy arrays
of the hash values for the relevant attribute. The implementation is
based on FlashText.

The speed should be similar to the previous PhraseMatcher. It is now
possible to easily remove match IDs and matches don't go missing with
large keyword lists / vocabularies.

Fixes #4308.

* Restore support for pickling

* Fix internal keyword add/remove for numpy arrays

* Add missing loop for match ID set in search loop

* Remove cruft in matching loop for partial matches

There was a bit of unnecessary code left over from FlashText in the
matching loop to handle partial token matches, which we don't have with
PhraseMatcher.

* Replace dict trie with MapStruct trie

* Fix how match ID hash is stored/added

* Update fix for match ID vocab

* Switch from map_get_unless_missing to map_get

* Switch from numpy array to Token.get_struct_attr

Access token attributes directly in Doc instead of making a copy of the
relevant values in a numpy array.

Add unsatisfactory warning for hash collision with reserved terminal
hash key. (Ideally it would change the reserved terminal hash and redo
the whole trie, but for now, I'm hoping there won't be collisions.)

* Restructure imports to export find_matches

* Implement full remove()

Remove unnecessary trie paths and free unused maps.

Parallel to Matcher, raise KeyError when attempting to remove a match ID
that has not been added.

* Store docs internally only as attr lists

* Reduces size for pickle

* Remove duplicate keywords store

Now that docs are stored as lists of attr hashes, there's no need to
have the duplicate _keywords store.
2019-09-27 16:22:34 +02:00
adrianeboyd
d844030fd8 Update UD bin scripts (#4315)
* Update imports for `bin/`
* Add all currently supported languages
* Update subtok merger for new Matcher validation
* Modify blinded check to look at tokens instead of lemmas (for corpora
with tokens but not lemmas like Telugu)
2019-09-27 16:20:38 +02:00
tamuhey
b408b5b29e Refactor language update (#4316)
* refactor: separate formatting docs and golds in Language.update

* fix return typo
2019-09-27 16:20:21 +02:00
Matthew Honnibal
105a91975b Fix sdist command 2019-09-27 15:52:26 +02:00