Commit Graph

6462 Commits

Author SHA1 Message Date
Matthw Honnibal
e63f28079a Try 3 NER features 2019-10-07 16:51:03 +02:00
Matthw Honnibal
2d55ccdd27 Support option of three NER features 2019-10-07 16:50:44 +02:00
Matthw Honnibal
c8857181f8 Fix get labels for textcat 2019-10-07 16:50:15 +02:00
Matthw Honnibal
a6a2ff217f Fix char_embed for gpu 2019-10-07 16:49:32 +02:00
Matthw Honnibal
f4040a98f0 Fix passing of cats in gold.pyx 2019-10-07 16:49:00 +02:00
Matthw Honnibal
a132da1558 Fix gold-preproc training mode 2019-10-07 02:07:03 +02:00
Matthw Honnibal
63ff233ba2 Enable GPU in pytorch n use_gpu functon 2019-10-06 19:24:21 +02:00
Matthw Honnibal
9dbaea1ab4 Use cosine loss in Cloze multitask 2019-10-06 19:23:46 +02:00
Matthw Honnibal
157d3d769b Support bilstm_depth arg in spacy pretrain 2019-10-06 19:22:26 +02:00
Matthw Honnibal
615ebe584f Add option to ignore zero vectors in get_cossim_loss 2019-10-06 19:20:54 +02:00
adrianeboyd
cbc2cee2c8 Improve URL_PATTERN and handling in tokenizer (#4374)
* Move prefix and suffix detection for URL_PATTERN

Move prefix and suffix detection for `URL_PATTERN` into the tokenizer.
Remove associated lookahead and lookbehind from `URL_PATTERN`.

Fix tokenization for Hungarian given new modified handling of prefixes
and suffixes.

* Match a wider range of URI schemes
2019-10-05 13:00:09 +02:00
Ines Montani
fec9433044 Make PhraseMatcher.vocab consistent with Matcher.vocab (closes #4373) 2019-10-04 12:18:41 +02:00
Matthew Honnibal
37ef874d8b Set version to v2.2.1 2019-10-03 14:50:39 +02:00
Sofie Van Landeghem
4e7259c6cf Bugfix initializing DocBin with attributes (#4368)
* docbin init fix + documentation fix + unit tests

* newline

* try with zlib instead of gzip (python 2 incompatibilities)
2019-10-03 14:48:45 +02:00
Ben Taylor
1db79a33cb most_similar() return the k most similar vectors (#4364)
* most_similar return n-most similar vectors

* updated most_similar comment

* add bintay contributor agreement

* sign bintay contributor agreement

* fix most_similar documentation typo

* fixed error in prune_vectors

* updated prune_vectors test
2019-10-03 14:09:44 +02:00
Matthew Honnibal
2eb31012e7 Set version to v2.2.0 2019-10-02 14:40:06 +02:00
Matthew Honnibal
796072e560 Set version to v2.2.0.dev19 2019-10-02 12:51:29 +02:00
Sofie Van Landeghem
9d3ce7cba2 Ensure training doesn't crash with empty batches (#4360)
* unit test for previously resolved unflatten issue

* prevent batch of empty docs to cause problems
2019-10-02 12:50:47 +02:00
adrianeboyd
dda86118bd Update Ukrainian lemmatizer with new lookups (#4359)
* Update Ukrainian lemmatizer with new lookups

* Add missing import


Co-authored-by: Ines Montani <ines@ines.io>
2019-10-02 12:04:06 +02:00
Ines Montani
b6670bf0c2 Use consistent spelling 2019-10-02 10:37:39 +02:00
Matthew Honnibal
38b6e69389 Merge branch 'master' of https://github.com/explosion/spaCy 2019-10-01 22:28:25 +02:00
Matthew Honnibal
d4b63bb6dd Set version to v2.2.0 2019-10-01 22:28:13 +02:00
Ines Montani
475e3188ce Add docs on filtering overlapping spans for merging (resolves #4352) [ci skip] 2019-10-01 21:59:50 +02:00
Matthew Honnibal
64a9577d43 Set version to v2.2.0.dev17 2019-10-01 21:36:59 +02:00
Ines Montani
cf65a80f36 Refactor lemmatizer and data table integration (#4353)
* Move test

* Allow default in Lookups.get_table

* Start with blank tables in Lookups.from_bytes

* Refactor lemmatizer to hold instance of Lookups

* Get lookups table within the lemmatization methods to make sure it references the correct table (even if the table was replaced or modified, e.g. when loading a model from disk)
* Deprecate other arguments on Lemmatizer.__init__ and expect Lookups for consistency
* Remove old and unsupported Lemmatizer.load classmethod
* Refactor language-specific lemmatizers to inherit as much as possible from base class and override only what they need

* Update tests and docs

* Fix more tests

* Fix lemmatizer

* Upgrade pytest to try and fix weird CI errors

* Try pytest 4.6.5
2019-10-01 21:36:03 +02:00
Ines Montani
3297a19545 Warn in Tagger.begin_training if no lemma tables are available (#4351) 2019-10-01 15:13:55 +02:00
Matthew Honnibal
2fb05482dd Set version to v2.2.0 2019-10-01 03:50:13 +02:00
Matthew Honnibal
dc22ec0aad Set version to v2.2.0.dev17 2019-10-01 03:26:53 +02:00
Matthew Honnibal
aedfba867a Set version to v2.2.0.dev16 2019-10-01 00:31:00 +02:00
Ines Montani
e0cf4796a5 Move lookup tables out of the core library (#4346)
* Add default to util.get_entry_point

* Tidy up entry points

* Read lookups from entry points

* Remove lookup tables and related tests

* Add lookups install option

* Remove lemmatizer tests

* Remove logic to process language data files

* Update setup.cfg
2019-10-01 00:01:27 +02:00
Rahul Soni
ed620daa5c Fix example sentences in Hindi for grammatical errors (#4343)
* Fix grammar for hindi

* Fix grammar for hindi

* Submit contributor agreement
2019-09-30 23:32:49 +02:00
Ines Montani
ba186299e1 Tidy up and modernize setup and config (#4344)
* Tidy up and modernize setup and config

* Update setup.cfg

* Re-add pyproject.toml

* Delete .flake8

* Move static meta from about to setup.cfg

* Update setup.cfg

Co-Authored-By: Matthew Honnibal <honnibal+gh@gmail.com>
2019-09-30 20:10:55 +02:00
Ines Montani
4f905ac9e6 Add test for ASCII filenames (#4345) 2019-09-30 18:45:30 +02:00
Matthew Honnibal
b5c775dd42 Set version to v2.2.0 2019-09-30 12:47:08 +02:00
Ines Montani
f7d1736241 Skip duplicate spans in Doc.retokenize (#4339) 2019-09-30 12:43:48 +02:00
Ines Montani
0226b3bf0e Fix test imports 2019-09-29 17:34:56 +02:00
Ines Montani
3d8fd4b461 Revert #4334 2019-09-29 17:32:12 +02:00
adrianeboyd
ba5595c764 Fix PhraseMatcher to remember attr on pickling (#4336)
* Fix PhraseMatcher to remember attr on pickling

* Check for attr as int or long
2019-09-29 17:12:33 +02:00
Ines Montani
75514b5970 Fix Korean 2019-09-29 17:10:56 +02:00
Ines Montani
499c39acba Remove unnecessary namedtuple/dataclass 2019-09-29 15:05:28 +02:00
Matthew Honnibal
eba708404d Set version to v2.2.0.dev15 2019-09-28 22:23:53 +02:00
Matthew Honnibal
6189959adb Set version to v2.2.0.dev14 2019-09-28 22:09:46 +02:00
Matthew Honnibal
0df2a599b7 Set version to v2.2.0.dev13 2019-09-28 21:26:05 +02:00
Ines Montani
c9cd516d96 Move tests out of package (#4334)
* Move tests out of package

* Fix typo
2019-09-28 18:05:00 +02:00
Matthew Honnibal
d05eb56ce2 Set version to v2.2.0.dev12 2019-09-28 16:35:56 +02:00
Ines Montani
5fe61539c4 Fix unicode "e" in filename 2019-09-28 15:45:16 +02:00
Ines Montani
811c4c97c9 Correct lookup lemma of "lenses" (see #4332) 2019-09-28 14:04:07 +02:00
Ines Montani
f8d1e2f214 Update CLI docs [ci skip] 2019-09-28 13:12:30 +02:00
Sofie Van Landeghem
22b9e12159 Ensure the NER remains consistent after resizing (#4330)
* test and fix for second bug of issue 4042

* fix for first bug in 4042

* crashing test for Issue 4313

* forgot one instance of resize

* remove prints

* undo uncomment

* delete test for 4313 (uses third party lib)

* add fix for Issue 4313

* unit test for 4313
2019-09-27 20:57:13 +02:00
adrianeboyd
3906785b49 Initialize low data warning for debug-data parser (#4331) 2019-09-27 20:56:49 +02:00