adrianeboyd
90c52128dc
Improve train CLI with base model ( #4911 )
...
Improve train CLI with a provided base model so that you can:
* add a new component
* extend an existing component
* replace an existing component
When the final model and best model are saved, reenable any disabled
components and merge the meta information to include the full pipeline
and accuracy information for all components in the base model plus the
newly added components if needed.
2020-01-16 01:58:51 +01:00
Bram Vanroy
718704022a
Changes to spacy_conll in universe ( #4914 )
...
* Update information on spacy_conll
* Typo fix
2020-01-16 01:56:39 +01:00
Matthew Honnibal
1785eebfe0
Merge pull request #4909 from svlandeg/bugfix/cnn_window
...
bugfix typo conv_window
2020-01-14 11:23:14 +01:00
svlandeg
ee828d5a9a
bugfix typo conv_window
2020-01-14 09:02:58 +01:00
Sofie Van Landeghem
c70ccd543d
Friendly error warning for NEL example script ( #4881 )
...
* make model positional arg and raise error if no vectors
* small doc fixes
2020-01-14 01:51:14 +01:00
adrianeboyd
d2f3a44b42
Improve train CLI sentrec scoring ( #4892 )
...
* reorder to metrics to prioritize F over P/R
* add sentrec to model metrics
2020-01-08 16:52:14 +01:00
adrianeboyd
e55fa1899a
Report length of dev dataset correctly ( #4891 )
2020-01-08 16:51:51 +01:00
adrianeboyd
e1b493ae85
Add sentrec shortcut to Language ( #4890 )
2020-01-08 16:51:24 +01:00
adrianeboyd
d24bca62f6
Add CJK to character classes ( #4884 )
...
* Add CJK character class as uncased
* Incorporate Chinese URL test case
Un-xfail Chinese URL test instance
2020-01-08 16:50:19 +01:00
Preston Badeer
b216ff43c9
Update vectors-similarity.md ( #4889 )
...
These links are broken on the website, due to quotes around the URLs.
2020-01-08 16:49:40 +01:00
adrianeboyd
aef83e8070
Mark most Hungarian tokenizer test cases as slow ( #4883 )
...
* Mark most Hungarian tokenizer test cases as slow
Mark most Hungarian tokenizer test cases as slow to reduce the runtime
of the test suite in ordinary usage:
* for normal tests: run default tests plus 10% of the detailed tests
* for slow tests: run all tests
* Rework to mark individual tests as slow
2020-01-08 12:34:06 +01:00
Sofie Van Landeghem
7b96a5e10f
Reduce mem usage in training Entity Linker ( #4811 )
...
* move nlp processing for el pipe to batch training instead of preprocessing
* adding dev eval back in, and limit in articles instead of entities
* use pipe whenever possible
* few more small doc changes
* access dev data through generator
* tqdm description
* small fixes
* update documentation
2020-01-06 14:59:50 +01:00
Sofie Van Landeghem
6e9b61b49d
add warning in debug_data for punctuation in entities ( #4853 )
2020-01-06 14:59:28 +01:00
adrianeboyd
d652ff215d
Add trailing whitespace to multiline test text ( #4877 )
2020-01-06 14:58:59 +01:00
adrianeboyd
de69bc6509
Fix and improve URL pattern ( #4882 )
...
* match domains longer than `hostname.domain.tld` like `www.foo.co.uk`
* expand allowed characters in domain names while only matching
lowercase TLDs so that "this.That" isn't matched as a URL and can be
split on the period as an infix (relevant for at least English, German,
and Tatar)
2020-01-06 14:58:30 +01:00
Sofie Van Landeghem
a1b22e90cd
serialize ENT_ID ( #4852 )
...
* expand serialization test for custom token attribute
* add failing test for issue 4849
* define ENT_ID as attr and use in doc serialization
* fix few typos
2020-01-06 14:57:34 +01:00
Geoffrey Gordon Ashbrook
53929138d7
remove extra word typo ( #4875 )
...
"let you find you"
2020-01-06 12:37:42 +01:00
Ines Montani
400257a802
Update index.md [ci skip]
2020-01-04 01:52:18 +01:00
Sofie Van Landeghem
581eeed98b
Warning goldparse ( #4851 )
...
* label in span not writable anymore
* Revert "label in span not writable anymore"
This reverts commit ab442338c8
.
* provide more friendly error msg for parsing file
2020-01-01 13:16:48 +01:00
Ines Montani
83e0a6f3e3
Modernize plac commands for Python 3 ( #4836 )
2020-01-01 13:15:46 +01:00
Al Johri
1aa2d4dac9
stop rendering mathjax by default in displacy ( #4840 )
...
* stop rendering mathjax by default in displacy
* Replace f-string and add comment
Co-authored-by: Ines Montani <ines@ines.io>
2020-01-01 13:15:05 +01:00
Anastasiia Iurshina
db9257559c
Adds script shebang ( #4846 )
2019-12-29 14:25:05 +01:00
Anastasiia Iurshina
1830a12578
Fixes typos ( #4843 )
...
* Fixes typos
* Fixes typo
* Contributor agreement
2019-12-29 14:24:13 +01:00
Ivan Echevarria
ef13e0c038
Add n_process to Language.pipe documentation ( #4842 ) [ci skip]
...
* Add n_process to documentation
* Auto-format and add default [ci skip]
Co-authored-by: Ines Montani <ines@ines.io>
2019-12-29 14:23:33 +01:00
Al Johri
fd4a7bd2b7
sign contributor agreement for AlJohri ( #4839 ) [ci skip]
2019-12-29 14:17:28 +01:00
Ines Montani
401946d480
Un-xfail passing tests
2019-12-25 18:02:20 +01:00
Ines Montani
a892821c51
More formatting changes
2019-12-25 17:59:52 +01:00
Ines Montani
c22f075509
Update pydantic version pin [ci skip]
2019-12-25 17:29:53 +01:00
Ines Montani
33a2682d60
Add better schemas and validation using Pydantic ( #4831 )
...
* Remove unicode declarations
* Remove Python 3.5 and 2.7 from CI
* Don't require pathlib
* Replace compat helpers
* Remove OrderedDict
* Use f-strings
* Set Cython compiler language level
* Fix typo
* Re-add OrderedDict for Table
* Update setup.cfg
* Revert CONTRIBUTING.md
* Add better schemas and validation using Pydantic
* Revert lookups.md
* Remove unused import
* Update spacy/schemas.py
Co-Authored-By: Sebastián Ramírez <tiangolo@gmail.com>
* Various small fixes
* Fix docstring
Co-authored-by: Sebastián Ramírez <tiangolo@gmail.com>
2019-12-25 12:39:49 +01:00
Ines Montani
db55577c45
Drop Python 2.7 and 3.5 ( #4828 )
...
* Remove unicode declarations
* Remove Python 3.5 and 2.7 from CI
* Don't require pathlib
* Replace compat helpers
* Remove OrderedDict
* Use f-strings
* Set Cython compiler language level
* Fix typo
* Re-add OrderedDict for Table
* Update setup.cfg
* Revert CONTRIBUTING.md
* Revert lookups.md
* Revert top-level.md
* Small adjustments and docs [ci skip]
2019-12-22 01:53:56 +01:00
Ines Montani
3431ac42de
Fix typo
2019-12-21 21:17:45 +01:00
Ines Montani
21b6d6e0a8
Fix typo
2019-12-21 21:17:31 +01:00
Ines Montani
de33b6d566
Merge branch 'master' into develop
2019-12-21 21:15:46 +01:00
Ines Montani
7c69d30de5
Tidy up and expect warning
2019-12-21 21:14:52 +01:00
Sofie Van Landeghem
732142bf28
facilitate larger training files ( #4827 )
...
* add warning for large file and change start var to long
* type for file_length
2019-12-21 21:12:19 +01:00
Ines Montani
d17e7dca9e
Fix problems caused by merge conflict
2019-12-21 19:57:41 +01:00
Ines Montani
947dba7141
Merge branch 'master' into develop
2019-12-21 19:04:43 +01:00
Ines Montani
cb4145adc7
Tidy up and auto-format
2019-12-21 19:04:17 +01:00
Ines Montani
158b98a3ef
Merge branch 'master' into develop
2019-12-21 18:55:03 +01:00
Olamilekan Wahab
a741de7cf6
Adding support for Yoruba Language ( #4614 )
...
* Adding Support for Yoruba
* test text
* Updated test string.
* Fixing encoding declaration.
* Adding encoding to stop_words.py
* Added contributor agreement and removed iranlowo.
* Added removed test files and removed iranlowo to keep project bare.
* Returned CONTRIBUTING.md to default state.
* Added delted conftest entries
* Tidy up and auto-format
* Revert CONTRIBUTING.md
Co-authored-by: Ines Montani <ines@ines.io>
2019-12-21 14:11:50 +01:00
Ines Montani
1b838d1313
Divide models into core and starters [ci skip]
2019-12-21 14:10:22 +01:00
Ines Montani
0750d59e5a
Allow setting ner_missing_tag on docs_to_json
2019-12-21 13:47:21 +01:00
Sofie Van Landeghem
8ebbb85117
Documentation for PhraseMatcher constructor ( #4826 )
...
* add max_length as argument for init PhraseMatcher
* improve error message too
2019-12-20 23:00:04 +01:00
Sofie Van Landeghem
12158c1e3a
Restore tqdm imports ( #4804 )
...
* set 4.38.0 to minimal version with color bug fix
* set imports back to proper place
* add upper range for tqdm
2019-12-16 13:12:19 +01:00
Ines Montani
c466e02466
Update universe [ci skip]
2019-12-13 15:57:39 +01:00
Sofie Van Landeghem
557dcf5659
NEL requires sentences to be set ( #4801 )
2019-12-13 15:55:18 +01:00
tamuhey
1707e77c5e
add char_span to Span ( #4793 )
2019-12-13 15:54:58 +01:00
adrianeboyd
a4cacd3402
Add tag_map argument to CLI debug-data and train ( #4750 )
...
Add an argument for a path to a JSON-formatted tag map, which is used to
update and extend the default language tag map.
2019-12-13 10:46:18 +01:00
Sofie Van Landeghem
f9b541f9ef
More robust set entities method in KB ( #4794 )
...
* add unit test for setting entities with duplicate identifiers
* count the number of actual unique identifiers and throw duplicate warning
2019-12-13 10:45:29 +01:00
Thiago Lages de Alencar
a067ded495
Update doc.md ( #4796 )
2019-12-11 18:21:40 +01:00