adrianeboyd
e55fa1899a
Report length of dev dataset correctly ( #4891 )
2020-01-08 16:51:51 +01:00
adrianeboyd
e1b493ae85
Add sentrec shortcut to Language ( #4890 )
2020-01-08 16:51:24 +01:00
adrianeboyd
d24bca62f6
Add CJK to character classes ( #4884 )
...
* Add CJK character class as uncased
* Incorporate Chinese URL test case
Un-xfail Chinese URL test instance
2020-01-08 16:50:19 +01:00
Preston Badeer
b216ff43c9
Update vectors-similarity.md ( #4889 )
...
These links are broken on the website, due to quotes around the URLs.
2020-01-08 16:49:40 +01:00
adrianeboyd
aef83e8070
Mark most Hungarian tokenizer test cases as slow ( #4883 )
...
* Mark most Hungarian tokenizer test cases as slow
Mark most Hungarian tokenizer test cases as slow to reduce the runtime
of the test suite in ordinary usage:
* for normal tests: run default tests plus 10% of the detailed tests
* for slow tests: run all tests
* Rework to mark individual tests as slow
2020-01-08 12:34:06 +01:00
Sofie Van Landeghem
7b96a5e10f
Reduce mem usage in training Entity Linker ( #4811 )
...
* move nlp processing for el pipe to batch training instead of preprocessing
* adding dev eval back in, and limit in articles instead of entities
* use pipe whenever possible
* few more small doc changes
* access dev data through generator
* tqdm description
* small fixes
* update documentation
2020-01-06 14:59:50 +01:00
Sofie Van Landeghem
6e9b61b49d
add warning in debug_data for punctuation in entities ( #4853 )
2020-01-06 14:59:28 +01:00
adrianeboyd
d652ff215d
Add trailing whitespace to multiline test text ( #4877 )
2020-01-06 14:58:59 +01:00
adrianeboyd
de69bc6509
Fix and improve URL pattern ( #4882 )
...
* match domains longer than `hostname.domain.tld` like `www.foo.co.uk`
* expand allowed characters in domain names while only matching
lowercase TLDs so that "this.That" isn't matched as a URL and can be
split on the period as an infix (relevant for at least English, German,
and Tatar)
2020-01-06 14:58:30 +01:00
Sofie Van Landeghem
a1b22e90cd
serialize ENT_ID ( #4852 )
...
* expand serialization test for custom token attribute
* add failing test for issue 4849
* define ENT_ID as attr and use in doc serialization
* fix few typos
2020-01-06 14:57:34 +01:00
Geoffrey Gordon Ashbrook
53929138d7
remove extra word typo ( #4875 )
...
"let you find you"
2020-01-06 12:37:42 +01:00
Ines Montani
400257a802
Update index.md [ci skip]
2020-01-04 01:52:18 +01:00
Sofie Van Landeghem
581eeed98b
Warning goldparse ( #4851 )
...
* label in span not writable anymore
* Revert "label in span not writable anymore"
This reverts commit ab442338c8
.
* provide more friendly error msg for parsing file
2020-01-01 13:16:48 +01:00
Ines Montani
83e0a6f3e3
Modernize plac commands for Python 3 ( #4836 )
2020-01-01 13:15:46 +01:00
Al Johri
1aa2d4dac9
stop rendering mathjax by default in displacy ( #4840 )
...
* stop rendering mathjax by default in displacy
* Replace f-string and add comment
Co-authored-by: Ines Montani <ines@ines.io>
2020-01-01 13:15:05 +01:00
Anastasiia Iurshina
db9257559c
Adds script shebang ( #4846 )
2019-12-29 14:25:05 +01:00
Anastasiia Iurshina
1830a12578
Fixes typos ( #4843 )
...
* Fixes typos
* Fixes typo
* Contributor agreement
2019-12-29 14:24:13 +01:00
Ivan Echevarria
ef13e0c038
Add n_process to Language.pipe documentation ( #4842 ) [ci skip]
...
* Add n_process to documentation
* Auto-format and add default [ci skip]
Co-authored-by: Ines Montani <ines@ines.io>
2019-12-29 14:23:33 +01:00
Al Johri
fd4a7bd2b7
sign contributor agreement for AlJohri ( #4839 ) [ci skip]
2019-12-29 14:17:28 +01:00
Ines Montani
401946d480
Un-xfail passing tests
2019-12-25 18:02:20 +01:00
Ines Montani
a892821c51
More formatting changes
2019-12-25 17:59:52 +01:00
Ines Montani
c22f075509
Update pydantic version pin [ci skip]
2019-12-25 17:29:53 +01:00
Ines Montani
33a2682d60
Add better schemas and validation using Pydantic ( #4831 )
...
* Remove unicode declarations
* Remove Python 3.5 and 2.7 from CI
* Don't require pathlib
* Replace compat helpers
* Remove OrderedDict
* Use f-strings
* Set Cython compiler language level
* Fix typo
* Re-add OrderedDict for Table
* Update setup.cfg
* Revert CONTRIBUTING.md
* Add better schemas and validation using Pydantic
* Revert lookups.md
* Remove unused import
* Update spacy/schemas.py
Co-Authored-By: Sebastián Ramírez <tiangolo@gmail.com>
* Various small fixes
* Fix docstring
Co-authored-by: Sebastián Ramírez <tiangolo@gmail.com>
2019-12-25 12:39:49 +01:00
Ines Montani
db55577c45
Drop Python 2.7 and 3.5 ( #4828 )
...
* Remove unicode declarations
* Remove Python 3.5 and 2.7 from CI
* Don't require pathlib
* Replace compat helpers
* Remove OrderedDict
* Use f-strings
* Set Cython compiler language level
* Fix typo
* Re-add OrderedDict for Table
* Update setup.cfg
* Revert CONTRIBUTING.md
* Revert lookups.md
* Revert top-level.md
* Small adjustments and docs [ci skip]
2019-12-22 01:53:56 +01:00
Ines Montani
3431ac42de
Fix typo
2019-12-21 21:17:45 +01:00
Ines Montani
21b6d6e0a8
Fix typo
2019-12-21 21:17:31 +01:00
Ines Montani
de33b6d566
Merge branch 'master' into develop
2019-12-21 21:15:46 +01:00
Ines Montani
7c69d30de5
Tidy up and expect warning
2019-12-21 21:14:52 +01:00
Sofie Van Landeghem
732142bf28
facilitate larger training files ( #4827 )
...
* add warning for large file and change start var to long
* type for file_length
2019-12-21 21:12:19 +01:00
Ines Montani
d17e7dca9e
Fix problems caused by merge conflict
2019-12-21 19:57:41 +01:00
Ines Montani
947dba7141
Merge branch 'master' into develop
2019-12-21 19:04:43 +01:00
Ines Montani
cb4145adc7
Tidy up and auto-format
2019-12-21 19:04:17 +01:00
Ines Montani
158b98a3ef
Merge branch 'master' into develop
2019-12-21 18:55:03 +01:00
Olamilekan Wahab
a741de7cf6
Adding support for Yoruba Language ( #4614 )
...
* Adding Support for Yoruba
* test text
* Updated test string.
* Fixing encoding declaration.
* Adding encoding to stop_words.py
* Added contributor agreement and removed iranlowo.
* Added removed test files and removed iranlowo to keep project bare.
* Returned CONTRIBUTING.md to default state.
* Added delted conftest entries
* Tidy up and auto-format
* Revert CONTRIBUTING.md
Co-authored-by: Ines Montani <ines@ines.io>
2019-12-21 14:11:50 +01:00
Ines Montani
1b838d1313
Divide models into core and starters [ci skip]
2019-12-21 14:10:22 +01:00
Ines Montani
0750d59e5a
Allow setting ner_missing_tag on docs_to_json
2019-12-21 13:47:21 +01:00
Sofie Van Landeghem
8ebbb85117
Documentation for PhraseMatcher constructor ( #4826 )
...
* add max_length as argument for init PhraseMatcher
* improve error message too
2019-12-20 23:00:04 +01:00
Sofie Van Landeghem
12158c1e3a
Restore tqdm imports ( #4804 )
...
* set 4.38.0 to minimal version with color bug fix
* set imports back to proper place
* add upper range for tqdm
2019-12-16 13:12:19 +01:00
Ines Montani
c466e02466
Update universe [ci skip]
2019-12-13 15:57:39 +01:00
Sofie Van Landeghem
557dcf5659
NEL requires sentences to be set ( #4801 )
2019-12-13 15:55:18 +01:00
tamuhey
1707e77c5e
add char_span to Span ( #4793 )
2019-12-13 15:54:58 +01:00
adrianeboyd
a4cacd3402
Add tag_map argument to CLI debug-data and train ( #4750 )
...
Add an argument for a path to a JSON-formatted tag map, which is used to
update and extend the default language tag map.
2019-12-13 10:46:18 +01:00
Sofie Van Landeghem
f9b541f9ef
More robust set entities method in KB ( #4794 )
...
* add unit test for setting entities with duplicate identifiers
* count the number of actual unique identifiers and throw duplicate warning
2019-12-13 10:45:29 +01:00
Thiago Lages de Alencar
a067ded495
Update doc.md ( #4796 )
2019-12-11 18:21:40 +01:00
adrianeboyd
eb9b1858c4
Add NER map option to convert CLI ( #4763 )
...
Instead of a hard-coded NER tag simplification function that was only
intended for NorNE, map NER tags in CoNLL-U converter using a dict
provided as JSON as a command-line option.
Map NER entity types or new tag or to "" for 'O', e.g.:
```
{"PER": "PERSON", "BAD": ""}
=>
B-PER -> B-PERSON
B-BAD -> O
```
2019-12-11 18:20:49 +01:00
Sofie Van Landeghem
5355b0038f
Update EL example ( #4789 )
...
* update EL example script after sentence-central refactor
* version bump
* set incl_prior to False for quick demo purposes
* clean up
2019-12-11 18:19:42 +01:00
adrianeboyd
38e1bc19f4
Add destructors for states in TransitionSystem ( #4686 )
2019-12-10 13:23:27 +01:00
Matthew Honnibal
45efdb1ef7
Merge branch 'master' of https://github.com/explosion/spaCy
2019-12-10 00:54:18 +01:00
Matthew Honnibal
0a3175d46f
Require thinc v7.4.0.dev0
2019-12-10 00:47:51 +01:00
adrianeboyd
c208eb6e4d
Fix int value handling in Matcher ( #4749 )
...
Add `int` values (for `LENGTH`) in _get_attr_values() instead of
treating `int` like `dict`.
2019-12-06 19:22:57 +01:00