Sofie Van Landeghem
5efae495f1
Error when removing a matcher rule that doesn't exist ( #4420 )
...
* raise specific error when removing a matcher rule that doesn't exist
* rephrasing
2019-10-10 14:01:53 +02:00
Matthew Honnibal
fa95c030a5
Unify matcher get_ent_id and get_pattern_key ( #4415 )
...
This is basically stabbing blindly at the ghost match problem, but it at
least seems like there was a bug previously here --- so this should
hopefully be an improvement, even if it doesn't fix the ghost match
problem.
2019-10-09 15:26:31 +02:00
Ines Montani
c4f95c1569
Update formatting and docstrings [ci skip]
2019-10-08 12:25:23 +02:00
Matthew Honnibal
ddd6fda59c
Add registry for model creation functions ('architectures') ( #4395 )
...
* Add architecture registry
* Add test for arch registry
* Add error for model architectures
2019-10-08 12:21:03 +02:00
tamuhey
650cbfe82d
multiprocessing pipe ( #1303 ) ( #4371 )
...
* refactor: separate formatting docs and golds in Language.update
* fix return typo
* add pipe test
* unpickleable object cannot be assigned to p.map
* passed test pipe
* passed test!
* pipe terminate
* try pipe
* passed test
* fix ch
* add comments
* fix len(texts)
* add comment
* add comment
* fix: multiprocessing of pipe is not supported in 2
* test: use assert_docs_equal
* fix: is_python3 -> is_python2
* fix: change _pipe arg to use functools.partial
* test: add vector modification test
* test: add sample ner_pipe and user_data pipe
* add warnings test
* test: fix user warnings
* test: fix warnings capture
* fix: remove islice import
* test: remove warnings test
* test: add stream test
* test: rename
* fix: multiproc stream
* fix: stream pipe
* add comment
* mp.Pipe seems to be able to use with relative small data
* test: skip stream test in python2
* sort imports
* test: add reason to skiptest
* fix: use pipe for docs communucation
* add comments
* add comment
2019-10-08 12:20:55 +02:00
adrianeboyd
14841d0aa6
Fix PhraseMatcher callback and add tests ( #4399 )
...
* Fix callback lookup in PhraseMatcher (string key rather than hash key)
* Add callback tests for Matcher and PhraseMatcher
2019-10-08 12:07:02 +02:00
Matthew Honnibal
fd4a5341b0
Fix ner_jsonl2json converter ( fix #4389 ) ( #4394 )
2019-10-08 00:52:45 +02:00
Matthew Honnibal
29f9fec267
Improve spacy pretrain ( #4393 )
...
* Support bilstm_depth arg in spacy pretrain
* Add option to ignore zero vectors in get_cossim_loss
* Use cosine loss in Cloze multitask
2019-10-07 23:34:58 +02:00
Ines Montani
9cd6ca3e4d
Improve usage of pkg_resources and handling of entry points ( #4387 )
...
* Only import pkg_resources where it's needed
Apparently it's really slow
* Use importlib_metadata for entry points
* Revert "Only import pkg_resources where it's needed"
This reverts commit 5ed8c03afa
.
* Revert "Revert "Only import pkg_resources where it's needed""
This reverts commit 8b30b57957
.
* Revert "Use importlib_metadata for entry points"
This reverts commit 9f071f5c40
.
* Revert "Revert "Use importlib_metadata for entry points""
This reverts commit 02e12a17ec
.
* Skip test that weirdly hangs
* Fix hanging test by using global
2019-10-07 17:22:09 +02:00
adrianeboyd
d53a8d9313
Consider batch_size when sorting similar vectors ( #4388 )
2019-10-07 13:38:35 +02:00
adrianeboyd
a3509f67d4
Extend unicode character block for Sinhala ( #4378 )
...
* Extend unicode character block for Sinhala
* Add sentencizer tests for more languages
2019-10-07 13:17:03 +02:00
Ines Montani
573e543e4a
Alphanumeric -> alphabetic [ci skip]
...
see ines/spacy-course#38
2019-10-06 13:30:01 +02:00
adrianeboyd
cbc2cee2c8
Improve URL_PATTERN and handling in tokenizer ( #4374 )
...
* Move prefix and suffix detection for URL_PATTERN
Move prefix and suffix detection for `URL_PATTERN` into the tokenizer.
Remove associated lookahead and lookbehind from `URL_PATTERN`.
Fix tokenization for Hungarian given new modified handling of prefixes
and suffixes.
* Match a wider range of URI schemes
2019-10-05 13:00:09 +02:00
Ines Montani
fec9433044
Make PhraseMatcher.vocab consistent with Matcher.vocab ( closes #4373 )
2019-10-04 12:18:41 +02:00
Matthew Honnibal
37ef874d8b
Set version to v2.2.1
2019-10-03 14:50:39 +02:00
Sofie Van Landeghem
4e7259c6cf
Bugfix initializing DocBin with attributes ( #4368 )
...
* docbin init fix + documentation fix + unit tests
* newline
* try with zlib instead of gzip (python 2 incompatibilities)
2019-10-03 14:48:45 +02:00
Ben Taylor
1db79a33cb
most_similar() return the k most similar vectors ( #4364 )
...
* most_similar return n-most similar vectors
* updated most_similar comment
* add bintay contributor agreement
* sign bintay contributor agreement
* fix most_similar documentation typo
* fixed error in prune_vectors
* updated prune_vectors test
2019-10-03 14:09:44 +02:00
Matthew Honnibal
2eb31012e7
Set version to v2.2.0
2019-10-02 14:40:06 +02:00
Matthew Honnibal
796072e560
Set version to v2.2.0.dev19
2019-10-02 12:51:29 +02:00
Sofie Van Landeghem
9d3ce7cba2
Ensure training doesn't crash with empty batches ( #4360 )
...
* unit test for previously resolved unflatten issue
* prevent batch of empty docs to cause problems
2019-10-02 12:50:47 +02:00
adrianeboyd
dda86118bd
Update Ukrainian lemmatizer with new lookups ( #4359 )
...
* Update Ukrainian lemmatizer with new lookups
* Add missing import
Co-authored-by: Ines Montani <ines@ines.io>
2019-10-02 12:04:06 +02:00
Ines Montani
b6670bf0c2
Use consistent spelling
2019-10-02 10:37:39 +02:00
Matthew Honnibal
38b6e69389
Merge branch 'master' of https://github.com/explosion/spaCy
2019-10-01 22:28:25 +02:00
Matthew Honnibal
d4b63bb6dd
Set version to v2.2.0
2019-10-01 22:28:13 +02:00
Ines Montani
475e3188ce
Add docs on filtering overlapping spans for merging ( resolves #4352 ) [ci skip]
2019-10-01 21:59:50 +02:00
Matthew Honnibal
64a9577d43
Set version to v2.2.0.dev17
2019-10-01 21:36:59 +02:00
Ines Montani
cf65a80f36
Refactor lemmatizer and data table integration ( #4353 )
...
* Move test
* Allow default in Lookups.get_table
* Start with blank tables in Lookups.from_bytes
* Refactor lemmatizer to hold instance of Lookups
* Get lookups table within the lemmatization methods to make sure it references the correct table (even if the table was replaced or modified, e.g. when loading a model from disk)
* Deprecate other arguments on Lemmatizer.__init__ and expect Lookups for consistency
* Remove old and unsupported Lemmatizer.load classmethod
* Refactor language-specific lemmatizers to inherit as much as possible from base class and override only what they need
* Update tests and docs
* Fix more tests
* Fix lemmatizer
* Upgrade pytest to try and fix weird CI errors
* Try pytest 4.6.5
2019-10-01 21:36:03 +02:00
Ines Montani
3297a19545
Warn in Tagger.begin_training if no lemma tables are available ( #4351 )
2019-10-01 15:13:55 +02:00
Matthew Honnibal
2fb05482dd
Set version to v2.2.0
2019-10-01 03:50:13 +02:00
Matthew Honnibal
dc22ec0aad
Set version to v2.2.0.dev17
2019-10-01 03:26:53 +02:00
Matthew Honnibal
aedfba867a
Set version to v2.2.0.dev16
2019-10-01 00:31:00 +02:00
Ines Montani
e0cf4796a5
Move lookup tables out of the core library ( #4346 )
...
* Add default to util.get_entry_point
* Tidy up entry points
* Read lookups from entry points
* Remove lookup tables and related tests
* Add lookups install option
* Remove lemmatizer tests
* Remove logic to process language data files
* Update setup.cfg
2019-10-01 00:01:27 +02:00
Rahul Soni
ed620daa5c
Fix example sentences in Hindi for grammatical errors ( #4343 )
...
* Fix grammar for hindi
* Fix grammar for hindi
* Submit contributor agreement
2019-09-30 23:32:49 +02:00
Ines Montani
ba186299e1
Tidy up and modernize setup and config ( #4344 )
...
* Tidy up and modernize setup and config
* Update setup.cfg
* Re-add pyproject.toml
* Delete .flake8
* Move static meta from about to setup.cfg
* Update setup.cfg
Co-Authored-By: Matthew Honnibal <honnibal+gh@gmail.com>
2019-09-30 20:10:55 +02:00
Ines Montani
4f905ac9e6
Add test for ASCII filenames ( #4345 )
2019-09-30 18:45:30 +02:00
Matthew Honnibal
b5c775dd42
Set version to v2.2.0
2019-09-30 12:47:08 +02:00
Ines Montani
f7d1736241
Skip duplicate spans in Doc.retokenize ( #4339 )
2019-09-30 12:43:48 +02:00
Ines Montani
0226b3bf0e
Fix test imports
2019-09-29 17:34:56 +02:00
Ines Montani
3d8fd4b461
Revert #4334
2019-09-29 17:32:12 +02:00
adrianeboyd
ba5595c764
Fix PhraseMatcher to remember attr on pickling ( #4336 )
...
* Fix PhraseMatcher to remember attr on pickling
* Check for attr as int or long
2019-09-29 17:12:33 +02:00
Ines Montani
75514b5970
Fix Korean
2019-09-29 17:10:56 +02:00
Ines Montani
499c39acba
Remove unnecessary namedtuple/dataclass
2019-09-29 15:05:28 +02:00
Matthew Honnibal
eba708404d
Set version to v2.2.0.dev15
2019-09-28 22:23:53 +02:00
Matthew Honnibal
6189959adb
Set version to v2.2.0.dev14
2019-09-28 22:09:46 +02:00
Matthew Honnibal
0df2a599b7
Set version to v2.2.0.dev13
2019-09-28 21:26:05 +02:00
Ines Montani
c9cd516d96
Move tests out of package ( #4334 )
...
* Move tests out of package
* Fix typo
2019-09-28 18:05:00 +02:00
Matthew Honnibal
d05eb56ce2
Set version to v2.2.0.dev12
2019-09-28 16:35:56 +02:00
Ines Montani
5fe61539c4
Fix unicode "e" in filename
2019-09-28 15:45:16 +02:00
Ines Montani
811c4c97c9
Correct lookup lemma of "lenses" (see #4332 )
2019-09-28 14:04:07 +02:00
Ines Montani
f8d1e2f214
Update CLI docs [ci skip]
2019-09-28 13:12:30 +02:00