Commit Graph

6512 Commits

Author SHA1 Message Date
Ines Montani
c4f95c1569 Update formatting and docstrings [ci skip] 2019-10-08 12:25:23 +02:00
Matthew Honnibal
ddd6fda59c Add registry for model creation functions ('architectures') (#4395)
* Add architecture registry

* Add test for arch registry

* Add error for model architectures
2019-10-08 12:21:03 +02:00
tamuhey
650cbfe82d multiprocessing pipe (#1303) (#4371)
* refactor: separate formatting docs and golds in Language.update

* fix return typo

* add pipe test

* unpickleable object cannot be assigned to p.map

* passed test pipe

* passed test!

* pipe terminate

* try pipe

* passed test

* fix ch

* add comments

* fix len(texts)

* add comment

* add comment

* fix: multiprocessing of pipe is not supported in 2

* test: use assert_docs_equal

* fix: is_python3 -> is_python2

* fix: change _pipe arg to use functools.partial

* test: add vector modification test

* test: add sample ner_pipe and user_data pipe

* add warnings test

* test: fix user warnings

* test: fix warnings capture

* fix: remove islice import

* test: remove warnings test

* test: add stream test

* test: rename

* fix: multiproc stream

* fix: stream pipe

* add comment

* mp.Pipe seems to be able to use with relative small data

* test: skip stream test in python2

* sort imports

* test: add reason to skiptest

* fix: use pipe for docs communucation

* add comments

* add comment
2019-10-08 12:20:55 +02:00
adrianeboyd
14841d0aa6 Fix PhraseMatcher callback and add tests (#4399)
* Fix callback lookup in PhraseMatcher (string key rather than hash key)
* Add callback tests for Matcher and PhraseMatcher
2019-10-08 12:07:02 +02:00
Matthew Honnibal
fd4a5341b0 Fix ner_jsonl2json converter (fix #4389) (#4394) 2019-10-08 00:52:45 +02:00
Matthew Honnibal
29f9fec267
Improve spacy pretrain (#4393)
* Support bilstm_depth arg in spacy pretrain

* Add option to ignore zero vectors in get_cossim_loss

* Use cosine loss in Cloze multitask
2019-10-07 23:34:58 +02:00
Ines Montani
9cd6ca3e4d Improve usage of pkg_resources and handling of entry points (#4387)
* Only import pkg_resources where it's needed

Apparently it's really slow

* Use importlib_metadata for entry points

* Revert "Only import pkg_resources where it's needed"

This reverts commit 5ed8c03afa.

* Revert "Revert "Only import pkg_resources where it's needed""

This reverts commit 8b30b57957.

* Revert "Use importlib_metadata for entry points"

This reverts commit 9f071f5c40.

* Revert "Revert "Use importlib_metadata for entry points""

This reverts commit 02e12a17ec.

* Skip test that weirdly hangs

* Fix hanging test by using global
2019-10-07 17:22:09 +02:00
adrianeboyd
d53a8d9313 Consider batch_size when sorting similar vectors (#4388) 2019-10-07 13:38:35 +02:00
adrianeboyd
a3509f67d4 Extend unicode character block for Sinhala (#4378)
* Extend unicode character block for Sinhala

* Add sentencizer tests for more languages
2019-10-07 13:17:03 +02:00
Ines Montani
573e543e4a Alphanumeric -> alphabetic [ci skip]
see ines/spacy-course#38
2019-10-06 13:30:01 +02:00
adrianeboyd
cbc2cee2c8 Improve URL_PATTERN and handling in tokenizer (#4374)
* Move prefix and suffix detection for URL_PATTERN

Move prefix and suffix detection for `URL_PATTERN` into the tokenizer.
Remove associated lookahead and lookbehind from `URL_PATTERN`.

Fix tokenization for Hungarian given new modified handling of prefixes
and suffixes.

* Match a wider range of URI schemes
2019-10-05 13:00:09 +02:00
Ines Montani
fec9433044 Make PhraseMatcher.vocab consistent with Matcher.vocab (closes #4373) 2019-10-04 12:18:41 +02:00
Matthew Honnibal
37ef874d8b Set version to v2.2.1 2019-10-03 14:50:39 +02:00
Sofie Van Landeghem
4e7259c6cf Bugfix initializing DocBin with attributes (#4368)
* docbin init fix + documentation fix + unit tests

* newline

* try with zlib instead of gzip (python 2 incompatibilities)
2019-10-03 14:48:45 +02:00
Ben Taylor
1db79a33cb most_similar() return the k most similar vectors (#4364)
* most_similar return n-most similar vectors

* updated most_similar comment

* add bintay contributor agreement

* sign bintay contributor agreement

* fix most_similar documentation typo

* fixed error in prune_vectors

* updated prune_vectors test
2019-10-03 14:09:44 +02:00
Matthew Honnibal
2eb31012e7 Set version to v2.2.0 2019-10-02 14:40:06 +02:00
Matthew Honnibal
796072e560 Set version to v2.2.0.dev19 2019-10-02 12:51:29 +02:00
Sofie Van Landeghem
9d3ce7cba2 Ensure training doesn't crash with empty batches (#4360)
* unit test for previously resolved unflatten issue

* prevent batch of empty docs to cause problems
2019-10-02 12:50:47 +02:00
adrianeboyd
dda86118bd Update Ukrainian lemmatizer with new lookups (#4359)
* Update Ukrainian lemmatizer with new lookups

* Add missing import


Co-authored-by: Ines Montani <ines@ines.io>
2019-10-02 12:04:06 +02:00
Ines Montani
b6670bf0c2 Use consistent spelling 2019-10-02 10:37:39 +02:00
Matthew Honnibal
38b6e69389 Merge branch 'master' of https://github.com/explosion/spaCy 2019-10-01 22:28:25 +02:00
Matthew Honnibal
d4b63bb6dd Set version to v2.2.0 2019-10-01 22:28:13 +02:00
Ines Montani
475e3188ce Add docs on filtering overlapping spans for merging (resolves #4352) [ci skip] 2019-10-01 21:59:50 +02:00
Matthew Honnibal
64a9577d43 Set version to v2.2.0.dev17 2019-10-01 21:36:59 +02:00
Ines Montani
cf65a80f36 Refactor lemmatizer and data table integration (#4353)
* Move test

* Allow default in Lookups.get_table

* Start with blank tables in Lookups.from_bytes

* Refactor lemmatizer to hold instance of Lookups

* Get lookups table within the lemmatization methods to make sure it references the correct table (even if the table was replaced or modified, e.g. when loading a model from disk)
* Deprecate other arguments on Lemmatizer.__init__ and expect Lookups for consistency
* Remove old and unsupported Lemmatizer.load classmethod
* Refactor language-specific lemmatizers to inherit as much as possible from base class and override only what they need

* Update tests and docs

* Fix more tests

* Fix lemmatizer

* Upgrade pytest to try and fix weird CI errors

* Try pytest 4.6.5
2019-10-01 21:36:03 +02:00
Ines Montani
3297a19545 Warn in Tagger.begin_training if no lemma tables are available (#4351) 2019-10-01 15:13:55 +02:00
Matthew Honnibal
2fb05482dd Set version to v2.2.0 2019-10-01 03:50:13 +02:00
Matthew Honnibal
dc22ec0aad Set version to v2.2.0.dev17 2019-10-01 03:26:53 +02:00
Matthew Honnibal
aedfba867a Set version to v2.2.0.dev16 2019-10-01 00:31:00 +02:00
Ines Montani
e0cf4796a5 Move lookup tables out of the core library (#4346)
* Add default to util.get_entry_point

* Tidy up entry points

* Read lookups from entry points

* Remove lookup tables and related tests

* Add lookups install option

* Remove lemmatizer tests

* Remove logic to process language data files

* Update setup.cfg
2019-10-01 00:01:27 +02:00
Rahul Soni
ed620daa5c Fix example sentences in Hindi for grammatical errors (#4343)
* Fix grammar for hindi

* Fix grammar for hindi

* Submit contributor agreement
2019-09-30 23:32:49 +02:00
Ines Montani
ba186299e1 Tidy up and modernize setup and config (#4344)
* Tidy up and modernize setup and config

* Update setup.cfg

* Re-add pyproject.toml

* Delete .flake8

* Move static meta from about to setup.cfg

* Update setup.cfg

Co-Authored-By: Matthew Honnibal <honnibal+gh@gmail.com>
2019-09-30 20:10:55 +02:00
Ines Montani
4f905ac9e6 Add test for ASCII filenames (#4345) 2019-09-30 18:45:30 +02:00
Matthew Honnibal
b5c775dd42 Set version to v2.2.0 2019-09-30 12:47:08 +02:00
Ines Montani
f7d1736241 Skip duplicate spans in Doc.retokenize (#4339) 2019-09-30 12:43:48 +02:00
Ines Montani
0226b3bf0e Fix test imports 2019-09-29 17:34:56 +02:00
Ines Montani
3d8fd4b461 Revert #4334 2019-09-29 17:32:12 +02:00
adrianeboyd
ba5595c764 Fix PhraseMatcher to remember attr on pickling (#4336)
* Fix PhraseMatcher to remember attr on pickling

* Check for attr as int or long
2019-09-29 17:12:33 +02:00
Ines Montani
75514b5970 Fix Korean 2019-09-29 17:10:56 +02:00
Ines Montani
499c39acba Remove unnecessary namedtuple/dataclass 2019-09-29 15:05:28 +02:00
Matthew Honnibal
eba708404d Set version to v2.2.0.dev15 2019-09-28 22:23:53 +02:00
Matthew Honnibal
6189959adb Set version to v2.2.0.dev14 2019-09-28 22:09:46 +02:00
Matthew Honnibal
0df2a599b7 Set version to v2.2.0.dev13 2019-09-28 21:26:05 +02:00
Ines Montani
c9cd516d96 Move tests out of package (#4334)
* Move tests out of package

* Fix typo
2019-09-28 18:05:00 +02:00
Matthew Honnibal
d05eb56ce2 Set version to v2.2.0.dev12 2019-09-28 16:35:56 +02:00
Ines Montani
5fe61539c4 Fix unicode "e" in filename 2019-09-28 15:45:16 +02:00
Ines Montani
811c4c97c9 Correct lookup lemma of "lenses" (see #4332) 2019-09-28 14:04:07 +02:00
Ines Montani
f8d1e2f214 Update CLI docs [ci skip] 2019-09-28 13:12:30 +02:00
Sofie Van Landeghem
22b9e12159 Ensure the NER remains consistent after resizing (#4330)
* test and fix for second bug of issue 4042

* fix for first bug in 4042

* crashing test for Issue 4313

* forgot one instance of resize

* remove prints

* undo uncomment

* delete test for 4313 (uses third party lib)

* add fix for Issue 4313

* unit test for 4313
2019-09-27 20:57:13 +02:00
adrianeboyd
3906785b49 Initialize low data warning for debug-data parser (#4331) 2019-09-27 20:56:49 +02:00
Ines Montani
206e8a5ac7 Also apply hotfix to Ukrainian lemmaitzer 2019-09-27 18:03:26 +02:00
Ines Montani
acd5bcb0b3 Tidy up fixtures 2019-09-27 17:57:59 +02:00
Ines Montani
b21b2e27e5 Hotfix Russian lemmatizer 2019-09-27 17:56:12 +02:00
Matthew Honnibal
a4d4c4bfa4 Set version to v2.2.0.dev11 2019-09-27 16:40:26 +02:00
Ines Montani
aad66d9bb9 Document PhraseMatcher.remove [ci skip] 2019-09-27 16:34:53 +02:00
adrianeboyd
c23edf302b Replace PhraseMatcher with trie-based search (#4309)
* Replace PhraseMatcher with Aho-Corasick

Replace PhraseMatcher with the Aho-Corasick algorithm over numpy arrays
of the hash values for the relevant attribute. The implementation is
based on FlashText.

The speed should be similar to the previous PhraseMatcher. It is now
possible to easily remove match IDs and matches don't go missing with
large keyword lists / vocabularies.

Fixes #4308.

* Restore support for pickling

* Fix internal keyword add/remove for numpy arrays

* Add missing loop for match ID set in search loop

* Remove cruft in matching loop for partial matches

There was a bit of unnecessary code left over from FlashText in the
matching loop to handle partial token matches, which we don't have with
PhraseMatcher.

* Replace dict trie with MapStruct trie

* Fix how match ID hash is stored/added

* Update fix for match ID vocab

* Switch from map_get_unless_missing to map_get

* Switch from numpy array to Token.get_struct_attr

Access token attributes directly in Doc instead of making a copy of the
relevant values in a numpy array.

Add unsatisfactory warning for hash collision with reserved terminal
hash key. (Ideally it would change the reserved terminal hash and redo
the whole trie, but for now, I'm hoping there won't be collisions.)

* Restructure imports to export find_matches

* Implement full remove()

Remove unnecessary trie paths and free unused maps.

Parallel to Matcher, raise KeyError when attempting to remove a match ID
that has not been added.

* Store docs internally only as attr lists

* Reduces size for pickle

* Remove duplicate keywords store

Now that docs are stored as lists of attr hashes, there's no need to
have the duplicate _keywords store.
2019-09-27 16:22:34 +02:00
tamuhey
b408b5b29e Refactor language update (#4316)
* refactor: separate formatting docs and golds in Language.update

* fix return typo
2019-09-27 16:20:21 +02:00
Jaydeep Borkar
6a06a3fa6a Update stop_words.py and add name in contributors (#4325)
* Update stop_words.py and add name in contributors

* add jaydeepborkar.md in contributors directory

* Reset template [ci skip]


Co-authored-by: Ines Montani <ines@ines.io>
2019-09-27 11:57:27 +02:00
Ines Montani
da9a869d3f Update vectors name docs [ci skip] 2019-09-26 16:21:32 +02:00
Matthew Honnibal
58533f01bf Set version to v2.2.0.dev10 2019-09-26 03:03:50 +02:00
Matthew Honnibal
27ace84f4a Support model name in init-model 2019-09-26 03:01:32 +02:00
Matthew Honnibal
eced2f3211 Set version to v2.2.0.dev9 2019-09-25 21:14:07 +02:00
Matthew Honnibal
1251b57dbb Fix vectors name arg to init-model 2019-09-25 14:21:27 +02:00
Matthew Honnibal
92ed4dc5e0
Allow vectors name to be set in init-model (#4321)
* Allow vectors name to be specified in init-model

* Document --vectors-name argument to init-model

* Update website/docs/api/cli.md

Co-Authored-By: Ines Montani <ines@ines.io>
2019-09-25 13:11:00 +02:00
Ines Montani
52904b7270 Raise if on_match is not callable or None 2019-09-24 23:06:24 +02:00
Ines Montani
16aa092fb5 Improve Morphology errors (#4314)
* Improve Morphology errors

* Also clean up some other errors

* Update errors.py
2019-09-21 14:37:06 +02:00
Ines Montani
9bf69bfbb2 Remove test 2019-09-19 17:38:41 +02:00
Ines Montani
8cd3763678 Update about.py [ci skip] 2019-09-19 01:02:25 +02:00
Matthew Honnibal
f52b857953 Update version 2019-09-19 00:56:35 +02:00
Matthew Honnibal
e34b4a38b0 Fix set labels meta 2019-09-19 00:56:07 +02:00
Matthew Honnibal
9d399fe63a Merge branch 'develop' of https://github.com/explosion/spaCy into develop 2019-09-19 00:04:06 +02:00
Matthew Honnibal
7d510c833e Fix orth replacement 2019-09-19 00:03:24 +02:00
Ines Montani
89d1dc4afa Merge branch 'master' into develop 2019-09-18 22:12:24 +02:00
Sean Löfgren
31c683d87d add return_matches and as_tuples back to Matcher.pipe (#4303)
* add contributor agreement [ci skip]

* add return_matches and as_tuples back to Matcher.pipe
2019-09-18 22:00:33 +02:00
Matthew Honnibal
42df49133d Also lower-case in orth variants 2019-09-18 21:54:51 +02:00
Matthew Honnibal
19d99fc9e7 Set version to v2.2.0.dev7 2019-09-18 21:43:59 +02:00
Matthew Honnibal
46c02d25b1 Merge changes to test_ner 2019-09-18 21:41:24 +02:00
Sofie Van Landeghem
de5a9ecdf3 Distinction between outside, missing and blocked NER annotations (#4307)
* remove duplicate unit test

* unit test (currently failing) for issue 4267

* bugfix: ensure doc.ents preserves kb_id annotations

* fix in setting doc.ents with empty label

* rename

* test for presetting an entity to a certain type

* allow overwriting Outside + blocking presets

* fix actions when previous label needs to be kept

* fix default ent_iob in set entities

* cleaner solution with U- action

* remove debugging print statements

* unit tests with explicit transitions and is_valid testing

* remove U- from move_names explicitly

* remove unit tests with pre-trained models that don't work

* remove (working) unit tests with pre-trained models

* clean up unit tests

* move unit tests

* small fixes

* remove two TODO's from doc.ents comments
2019-09-18 21:37:17 +02:00
Moshe Hazoom
72463b062f Improve speed of _merge method (#4300)
* make merge more efficient

* fix offsets

* merge works with relative indices

* remove printing

* Add the SCA

* fix SCA date

* more cythonize _retokenize.pyx

* more cythonize _retokenize.pyx

* fix only declaration in _retokenize.pyx

* switch back to absolute head

* switch back to absolute head

* fix comment

* merge from origin repo
2019-09-18 21:34:34 +02:00
tamuhey
875f3e5d8c remove redundant __call__ method in pipes.TextCategorizer (#4305)
* remove redundant __call__ method in pipes.TextCategorizer

Because the parent __call__ method behaves in the same way.

* fix: Pipe.__call__ arg

* fix: invalid arg in Pipe.__call__

* modified:   spacy/tests/regression/test_issue4278.py (#4278)

* deleted:    Pipfile
2019-09-18 21:31:27 +02:00
Ines Montani
00a8cbc306 Tidy up and auto-format 2019-09-18 20:27:03 +02:00
Ines Montani
f2c8b1e362 Simplify lookup hashing
Just use get_string_id, which already does everything ensure_hash was supposed to do
2019-09-18 20:24:41 +02:00
Ines Montani
dd1810f05a Update DocBin and add docs 2019-09-18 20:23:21 +02:00
Ines Montani
7e810cced6 Add references to docs pages 2019-09-18 19:57:21 +02:00
Ines Montani
2e5ab5b59c Make except more explicit 2019-09-18 19:57:08 +02:00
Ines Montani
1f648ecb76 Auto-format 2019-09-18 19:56:55 +02:00
Ines Montani
0f7fe5e7a7 Auto-format and fix typo and consistency 2019-09-18 19:18:30 +02:00
Matthew Honnibal
e53b86751f DocPallet -> DocBin 2019-09-18 15:15:37 +02:00
Matthew Honnibal
fa9a283128 Fix name 2019-09-18 13:40:03 +02:00
Matthew Honnibal
88a23cf49a Fix name 2019-09-18 13:38:29 +02:00
Matthew Honnibal
3507943b15 Add docstring for DocPallet 2019-09-18 13:25:47 +02:00
Matthew Honnibal
1c8de6b2e5 Rename DocBox->DocPallet 2019-09-18 13:13:51 +02:00
Ines Montani
691e0088cf Remove duplicate tok2vec property (closes #4302) 2019-09-17 11:22:03 +02:00
Ines Montani
a84025d70b Remove --no-deps from default pip args on download
Add warning if user is executing spaCy without having it installed and add --no-deps to prevent the package from being redownloaded
2019-09-16 23:32:41 +02:00
Matthew Honnibal
84c65f9455 Merge branch 'master' into develop 2019-09-16 22:12:20 +02:00
Matthew Honnibal
47055d5988 Fix type declarations in _merge method 2019-09-16 22:10:13 +02:00
Sofie Van Landeghem
03ac29f437 Ensure that doc.ents preserves kb_id annotations (#4294)
* bugfix: ensure doc.ents preserves kb_id annotations

* fix backward compatibility

* additional test
2019-09-16 15:18:37 +02:00
Ines Montani
139428c20f Set unique vector names in tests 2019-09-16 15:16:54 +02:00
Ines Montani
bf06d9d537 Allow passing vectors_name to Vocab 2019-09-16 15:16:41 +02:00
Ines Montani
cb6c68a573 Pass vectors name correctly in prune_vectors 2019-09-16 15:16:29 +02:00