Commit Graph

11468 Commits

Author SHA1 Message Date
adrianeboyd
90c52128dc Improve train CLI with base model (#4911)
Improve train CLI with a provided base model so that you can:

* add a new component
* extend an existing component
* replace an existing component

When the final model and best model are saved, reenable any disabled
components and merge the meta information to include the full pipeline
and accuracy information for all components in the base model plus the
newly added components if needed.
2020-01-16 01:58:51 +01:00
Bram Vanroy
718704022a Changes to spacy_conll in universe (#4914)
* Update information on spacy_conll

* Typo fix
2020-01-16 01:56:39 +01:00
Matthew Honnibal
1785eebfe0
Merge pull request #4909 from svlandeg/bugfix/cnn_window
bugfix typo conv_window
2020-01-14 11:23:14 +01:00
svlandeg
ee828d5a9a bugfix typo conv_window 2020-01-14 09:02:58 +01:00
Sofie Van Landeghem
c70ccd543d Friendly error warning for NEL example script (#4881)
* make model positional arg and raise error if no vectors

* small doc fixes
2020-01-14 01:51:14 +01:00
adrianeboyd
d2f3a44b42 Improve train CLI sentrec scoring (#4892)
* reorder to metrics to prioritize F over P/R
* add sentrec to model metrics
2020-01-08 16:52:14 +01:00
adrianeboyd
e55fa1899a Report length of dev dataset correctly (#4891) 2020-01-08 16:51:51 +01:00
adrianeboyd
e1b493ae85 Add sentrec shortcut to Language (#4890) 2020-01-08 16:51:24 +01:00
adrianeboyd
d24bca62f6 Add CJK to character classes (#4884)
* Add CJK character class as uncased

* Incorporate Chinese URL test case

Un-xfail Chinese URL test instance
2020-01-08 16:50:19 +01:00
Preston Badeer
b216ff43c9 Update vectors-similarity.md (#4889)
These links are broken on the website, due to quotes around the URLs.
2020-01-08 16:49:40 +01:00
adrianeboyd
aef83e8070 Mark most Hungarian tokenizer test cases as slow (#4883)
* Mark most Hungarian tokenizer test cases as slow

Mark most Hungarian tokenizer test cases as slow to reduce the runtime
of the test suite in ordinary usage:

* for normal tests: run default tests plus 10% of the detailed tests
* for slow tests: run all tests

* Rework to mark individual tests as slow
2020-01-08 12:34:06 +01:00
Sofie Van Landeghem
7b96a5e10f Reduce mem usage in training Entity Linker (#4811)
* move nlp processing for el pipe to batch training instead of preprocessing

* adding dev eval back in, and limit in articles instead of entities

* use pipe whenever possible

* few more small doc changes

* access dev data through generator

* tqdm description

* small fixes

* update documentation
2020-01-06 14:59:50 +01:00
Sofie Van Landeghem
6e9b61b49d add warning in debug_data for punctuation in entities (#4853) 2020-01-06 14:59:28 +01:00
adrianeboyd
d652ff215d Add trailing whitespace to multiline test text (#4877) 2020-01-06 14:58:59 +01:00
adrianeboyd
de69bc6509 Fix and improve URL pattern (#4882)
* match domains longer than `hostname.domain.tld` like `www.foo.co.uk`
* expand allowed characters in domain names while only matching
lowercase TLDs so that "this.That" isn't matched as a URL and can be
split on the period as an infix (relevant for at least English, German,
and Tatar)
2020-01-06 14:58:30 +01:00
Sofie Van Landeghem
a1b22e90cd serialize ENT_ID (#4852)
* expand serialization test for custom token attribute

* add failing test for issue 4849

* define ENT_ID as attr and use in doc serialization

* fix few typos
2020-01-06 14:57:34 +01:00
Geoffrey Gordon Ashbrook
53929138d7 remove extra word typo (#4875)
"let you find you"
2020-01-06 12:37:42 +01:00
Ines Montani
400257a802 Update index.md [ci skip] 2020-01-04 01:52:18 +01:00
Sofie Van Landeghem
581eeed98b Warning goldparse (#4851)
* label in span not writable anymore

* Revert "label in span not writable anymore"

This reverts commit ab442338c8.

* provide more friendly error msg for parsing file
2020-01-01 13:16:48 +01:00
Ines Montani
83e0a6f3e3
Modernize plac commands for Python 3 (#4836) 2020-01-01 13:15:46 +01:00
Al Johri
1aa2d4dac9 stop rendering mathjax by default in displacy (#4840)
* stop rendering mathjax by default in displacy

* Replace f-string and add comment

Co-authored-by: Ines Montani <ines@ines.io>
2020-01-01 13:15:05 +01:00
Anastasiia Iurshina
db9257559c Adds script shebang (#4846) 2019-12-29 14:25:05 +01:00
Anastasiia Iurshina
1830a12578 Fixes typos (#4843)
* Fixes typos

* Fixes typo

* Contributor agreement
2019-12-29 14:24:13 +01:00
Ivan Echevarria
ef13e0c038 Add n_process to Language.pipe documentation (#4842) [ci skip]
* Add n_process to documentation

* Auto-format and add default [ci skip]

Co-authored-by: Ines Montani <ines@ines.io>
2019-12-29 14:23:33 +01:00
Al Johri
fd4a7bd2b7 sign contributor agreement for AlJohri (#4839) [ci skip] 2019-12-29 14:17:28 +01:00
Ines Montani
401946d480 Un-xfail passing tests 2019-12-25 18:02:20 +01:00
Ines Montani
a892821c51 More formatting changes 2019-12-25 17:59:52 +01:00
Ines Montani
c22f075509 Update pydantic version pin [ci skip] 2019-12-25 17:29:53 +01:00
Ines Montani
33a2682d60
Add better schemas and validation using Pydantic (#4831)
* Remove unicode declarations

* Remove Python 3.5 and 2.7 from CI

* Don't require pathlib

* Replace compat helpers

* Remove OrderedDict

* Use f-strings

* Set Cython compiler language level

* Fix typo

* Re-add OrderedDict for Table

* Update setup.cfg

* Revert CONTRIBUTING.md

* Add better schemas and validation using Pydantic

* Revert lookups.md

* Remove unused import

* Update spacy/schemas.py

Co-Authored-By: Sebastián Ramírez <tiangolo@gmail.com>

* Various small fixes

* Fix docstring

Co-authored-by: Sebastián Ramírez <tiangolo@gmail.com>
2019-12-25 12:39:49 +01:00
Ines Montani
db55577c45
Drop Python 2.7 and 3.5 (#4828)
* Remove unicode declarations

* Remove Python 3.5 and 2.7 from CI

* Don't require pathlib

* Replace compat helpers

* Remove OrderedDict

* Use f-strings

* Set Cython compiler language level

* Fix typo

* Re-add OrderedDict for Table

* Update setup.cfg

* Revert CONTRIBUTING.md

* Revert lookups.md

* Revert top-level.md

* Small adjustments and docs [ci skip]
2019-12-22 01:53:56 +01:00
Ines Montani
3431ac42de Fix typo 2019-12-21 21:17:45 +01:00
Ines Montani
21b6d6e0a8 Fix typo 2019-12-21 21:17:31 +01:00
Ines Montani
de33b6d566 Merge branch 'master' into develop 2019-12-21 21:15:46 +01:00
Ines Montani
7c69d30de5 Tidy up and expect warning 2019-12-21 21:14:52 +01:00
Sofie Van Landeghem
732142bf28 facilitate larger training files (#4827)
* add warning for large file and change start var to long

* type for file_length
2019-12-21 21:12:19 +01:00
Ines Montani
d17e7dca9e Fix problems caused by merge conflict 2019-12-21 19:57:41 +01:00
Ines Montani
947dba7141 Merge branch 'master' into develop 2019-12-21 19:04:43 +01:00
Ines Montani
cb4145adc7 Tidy up and auto-format 2019-12-21 19:04:17 +01:00
Ines Montani
158b98a3ef Merge branch 'master' into develop 2019-12-21 18:55:03 +01:00
Olamilekan Wahab
a741de7cf6 Adding support for Yoruba Language (#4614)
* Adding Support for Yoruba

* test text

* Updated test string.

* Fixing encoding declaration.

* Adding encoding to stop_words.py

* Added contributor agreement and removed iranlowo.

* Added removed test files and removed iranlowo to keep project bare.

* Returned CONTRIBUTING.md to default state.

* Added delted conftest entries

* Tidy up and auto-format

* Revert CONTRIBUTING.md

Co-authored-by: Ines Montani <ines@ines.io>
2019-12-21 14:11:50 +01:00
Ines Montani
1b838d1313 Divide models into core and starters [ci skip] 2019-12-21 14:10:22 +01:00
Ines Montani
0750d59e5a Allow setting ner_missing_tag on docs_to_json 2019-12-21 13:47:21 +01:00
Sofie Van Landeghem
8ebbb85117 Documentation for PhraseMatcher constructor (#4826)
* add max_length as argument for init PhraseMatcher

* improve error message too
2019-12-20 23:00:04 +01:00
Sofie Van Landeghem
12158c1e3a Restore tqdm imports (#4804)
* set 4.38.0 to minimal version with color bug fix

* set imports back to proper place

* add upper range for tqdm
2019-12-16 13:12:19 +01:00
Ines Montani
c466e02466 Update universe [ci skip] 2019-12-13 15:57:39 +01:00
Sofie Van Landeghem
557dcf5659 NEL requires sentences to be set (#4801) 2019-12-13 15:55:18 +01:00
tamuhey
1707e77c5e add char_span to Span (#4793) 2019-12-13 15:54:58 +01:00
adrianeboyd
a4cacd3402 Add tag_map argument to CLI debug-data and train (#4750)
Add an argument for a path to a JSON-formatted tag map, which is used to
update and extend the default language tag map.
2019-12-13 10:46:18 +01:00
Sofie Van Landeghem
f9b541f9ef More robust set entities method in KB (#4794)
* add unit test for setting entities with duplicate identifiers

* count the number of actual unique identifiers and throw duplicate warning
2019-12-13 10:45:29 +01:00
Thiago Lages de Alencar
a067ded495 Update doc.md (#4796) 2019-12-11 18:21:40 +01:00