Sofie Van Landeghem
6e9b61b49d
add warning in debug_data for punctuation in entities ( #4853 )
2020-01-06 14:59:28 +01:00
adrianeboyd
d652ff215d
Add trailing whitespace to multiline test text ( #4877 )
2020-01-06 14:58:59 +01:00
adrianeboyd
de69bc6509
Fix and improve URL pattern ( #4882 )
...
* match domains longer than `hostname.domain.tld` like `www.foo.co.uk`
* expand allowed characters in domain names while only matching
lowercase TLDs so that "this.That" isn't matched as a URL and can be
split on the period as an infix (relevant for at least English, German,
and Tatar)
2020-01-06 14:58:30 +01:00
Sofie Van Landeghem
a1b22e90cd
serialize ENT_ID ( #4852 )
...
* expand serialization test for custom token attribute
* add failing test for issue 4849
* define ENT_ID as attr and use in doc serialization
* fix few typos
2020-01-06 14:57:34 +01:00
Geoffrey Gordon Ashbrook
53929138d7
remove extra word typo ( #4875 )
...
"let you find you"
2020-01-06 12:37:42 +01:00
Ines Montani
400257a802
Update index.md [ci skip]
2020-01-04 01:52:18 +01:00
Sofie Van Landeghem
581eeed98b
Warning goldparse ( #4851 )
...
* label in span not writable anymore
* Revert "label in span not writable anymore"
This reverts commit ab442338c8
.
* provide more friendly error msg for parsing file
2020-01-01 13:16:48 +01:00
Ines Montani
83e0a6f3e3
Modernize plac commands for Python 3 ( #4836 )
2020-01-01 13:15:46 +01:00
Al Johri
1aa2d4dac9
stop rendering mathjax by default in displacy ( #4840 )
...
* stop rendering mathjax by default in displacy
* Replace f-string and add comment
Co-authored-by: Ines Montani <ines@ines.io>
2020-01-01 13:15:05 +01:00
Anastasiia Iurshina
db9257559c
Adds script shebang ( #4846 )
2019-12-29 14:25:05 +01:00
Anastasiia Iurshina
1830a12578
Fixes typos ( #4843 )
...
* Fixes typos
* Fixes typo
* Contributor agreement
2019-12-29 14:24:13 +01:00
Ivan Echevarria
ef13e0c038
Add n_process to Language.pipe documentation ( #4842 ) [ci skip]
...
* Add n_process to documentation
* Auto-format and add default [ci skip]
Co-authored-by: Ines Montani <ines@ines.io>
2019-12-29 14:23:33 +01:00
Al Johri
fd4a7bd2b7
sign contributor agreement for AlJohri ( #4839 ) [ci skip]
2019-12-29 14:17:28 +01:00
Ines Montani
401946d480
Un-xfail passing tests
2019-12-25 18:02:20 +01:00
Ines Montani
a892821c51
More formatting changes
2019-12-25 17:59:52 +01:00
Ines Montani
c22f075509
Update pydantic version pin [ci skip]
2019-12-25 17:29:53 +01:00
Ines Montani
33a2682d60
Add better schemas and validation using Pydantic ( #4831 )
...
* Remove unicode declarations
* Remove Python 3.5 and 2.7 from CI
* Don't require pathlib
* Replace compat helpers
* Remove OrderedDict
* Use f-strings
* Set Cython compiler language level
* Fix typo
* Re-add OrderedDict for Table
* Update setup.cfg
* Revert CONTRIBUTING.md
* Add better schemas and validation using Pydantic
* Revert lookups.md
* Remove unused import
* Update spacy/schemas.py
Co-Authored-By: Sebastián Ramírez <tiangolo@gmail.com>
* Various small fixes
* Fix docstring
Co-authored-by: Sebastián Ramírez <tiangolo@gmail.com>
2019-12-25 12:39:49 +01:00
Ines Montani
db55577c45
Drop Python 2.7 and 3.5 ( #4828 )
...
* Remove unicode declarations
* Remove Python 3.5 and 2.7 from CI
* Don't require pathlib
* Replace compat helpers
* Remove OrderedDict
* Use f-strings
* Set Cython compiler language level
* Fix typo
* Re-add OrderedDict for Table
* Update setup.cfg
* Revert CONTRIBUTING.md
* Revert lookups.md
* Revert top-level.md
* Small adjustments and docs [ci skip]
2019-12-22 01:53:56 +01:00
Ines Montani
3431ac42de
Fix typo
2019-12-21 21:17:45 +01:00
Ines Montani
21b6d6e0a8
Fix typo
2019-12-21 21:17:31 +01:00
Ines Montani
de33b6d566
Merge branch 'master' into develop
2019-12-21 21:15:46 +01:00
Ines Montani
7c69d30de5
Tidy up and expect warning
2019-12-21 21:14:52 +01:00
Sofie Van Landeghem
732142bf28
facilitate larger training files ( #4827 )
...
* add warning for large file and change start var to long
* type for file_length
2019-12-21 21:12:19 +01:00
Ines Montani
d17e7dca9e
Fix problems caused by merge conflict
2019-12-21 19:57:41 +01:00
Ines Montani
947dba7141
Merge branch 'master' into develop
2019-12-21 19:04:43 +01:00
Ines Montani
cb4145adc7
Tidy up and auto-format
2019-12-21 19:04:17 +01:00
Ines Montani
158b98a3ef
Merge branch 'master' into develop
2019-12-21 18:55:03 +01:00
Olamilekan Wahab
a741de7cf6
Adding support for Yoruba Language ( #4614 )
...
* Adding Support for Yoruba
* test text
* Updated test string.
* Fixing encoding declaration.
* Adding encoding to stop_words.py
* Added contributor agreement and removed iranlowo.
* Added removed test files and removed iranlowo to keep project bare.
* Returned CONTRIBUTING.md to default state.
* Added delted conftest entries
* Tidy up and auto-format
* Revert CONTRIBUTING.md
Co-authored-by: Ines Montani <ines@ines.io>
2019-12-21 14:11:50 +01:00
Ines Montani
1b838d1313
Divide models into core and starters [ci skip]
2019-12-21 14:10:22 +01:00
Ines Montani
0750d59e5a
Allow setting ner_missing_tag on docs_to_json
2019-12-21 13:47:21 +01:00
Sofie Van Landeghem
8ebbb85117
Documentation for PhraseMatcher constructor ( #4826 )
...
* add max_length as argument for init PhraseMatcher
* improve error message too
2019-12-20 23:00:04 +01:00
Sofie Van Landeghem
12158c1e3a
Restore tqdm imports ( #4804 )
...
* set 4.38.0 to minimal version with color bug fix
* set imports back to proper place
* add upper range for tqdm
2019-12-16 13:12:19 +01:00
Ines Montani
c466e02466
Update universe [ci skip]
2019-12-13 15:57:39 +01:00
Sofie Van Landeghem
557dcf5659
NEL requires sentences to be set ( #4801 )
2019-12-13 15:55:18 +01:00
tamuhey
1707e77c5e
add char_span to Span ( #4793 )
2019-12-13 15:54:58 +01:00
adrianeboyd
a4cacd3402
Add tag_map argument to CLI debug-data and train ( #4750 )
...
Add an argument for a path to a JSON-formatted tag map, which is used to
update and extend the default language tag map.
2019-12-13 10:46:18 +01:00
Sofie Van Landeghem
f9b541f9ef
More robust set entities method in KB ( #4794 )
...
* add unit test for setting entities with duplicate identifiers
* count the number of actual unique identifiers and throw duplicate warning
2019-12-13 10:45:29 +01:00
Thiago Lages de Alencar
a067ded495
Update doc.md ( #4796 )
2019-12-11 18:21:40 +01:00
adrianeboyd
eb9b1858c4
Add NER map option to convert CLI ( #4763 )
...
Instead of a hard-coded NER tag simplification function that was only
intended for NorNE, map NER tags in CoNLL-U converter using a dict
provided as JSON as a command-line option.
Map NER entity types or new tag or to "" for 'O', e.g.:
```
{"PER": "PERSON", "BAD": ""}
=>
B-PER -> B-PERSON
B-BAD -> O
```
2019-12-11 18:20:49 +01:00
Sofie Van Landeghem
5355b0038f
Update EL example ( #4789 )
...
* update EL example script after sentence-central refactor
* version bump
* set incl_prior to False for quick demo purposes
* clean up
2019-12-11 18:19:42 +01:00
adrianeboyd
38e1bc19f4
Add destructors for states in TransitionSystem ( #4686 )
2019-12-10 13:23:27 +01:00
Matthew Honnibal
45efdb1ef7
Merge branch 'master' of https://github.com/explosion/spaCy
2019-12-10 00:54:18 +01:00
Matthew Honnibal
0a3175d46f
Require thinc v7.4.0.dev0
2019-12-10 00:47:51 +01:00
adrianeboyd
c208eb6e4d
Fix int value handling in Matcher ( #4749 )
...
Add `int` values (for `LENGTH`) in _get_attr_values() instead of
treating `int` like `dict`.
2019-12-06 19:22:57 +01:00
Tclack88
ab8dc2732c
Update token.md ( #4767 )
...
* Update token.md
documentation is confusing: A '?' is a right punct, but '¿' is a left punct
* Update token.md
add quotations around parentheses in `is_left_punct` and `is_right_punct` for clarrification, ensuring the question mark that follows is not percieved as an example of left and right punctuation
* Move quotes into code block [ci skip]
2019-12-06 19:22:02 +01:00
Sofie Van Landeghem
780d43aac7
fix bug in EL predict ( #4779 )
2019-12-06 19:18:14 +01:00
Ines Montani
bf611ebca7
Document jsonl option on converter [ci skip]
2019-12-06 19:17:45 +01:00
Nicolai Bjerre Pedersen
de5453cdcb
Fix link to user hooks in docs ( #4778 )
...
* Fix link to user hooks in docs
* Update mr_bjerre.md
Mistake in contributor agreement
* Apparently hard to get it right (wrong name of sca)
2019-12-06 19:17:12 +01:00
adrianeboyd
676e75838f
Include Doc.cats in serialization of Doc and DocBin ( #4774 )
...
* Include Doc.cats in to_bytes()
* Include Doc.cats in DocBin serialization
* Add tests for serialization of cats
Test serialization of cats for Doc and DocBin.
2019-12-06 14:07:39 +01:00
Antti Ajanki
e626a011cc
Improvements to the Finnish language data ( #4738 )
...
* Enable lex_attrs on Finnish
* Copy the Danish tokenizer rules to Finnish
Specifically, don't break hyphenated compound words
* Contributor agreement
* A new file for Finnish tokenizer rules instead of including the Danish ones
2019-12-03 12:55:28 +01:00