Commit Graph

11255 Commits

Author SHA1 Message Date
adrianeboyd
d652ff215d Add trailing whitespace to multiline test text (#4877) 2020-01-06 14:58:59 +01:00
adrianeboyd
de69bc6509 Fix and improve URL pattern (#4882)
* match domains longer than `hostname.domain.tld` like `www.foo.co.uk`
* expand allowed characters in domain names while only matching
lowercase TLDs so that "this.That" isn't matched as a URL and can be
split on the period as an infix (relevant for at least English, German,
and Tatar)
2020-01-06 14:58:30 +01:00
Sofie Van Landeghem
a1b22e90cd serialize ENT_ID (#4852)
* expand serialization test for custom token attribute

* add failing test for issue 4849

* define ENT_ID as attr and use in doc serialization

* fix few typos
2020-01-06 14:57:34 +01:00
Geoffrey Gordon Ashbrook
53929138d7 remove extra word typo (#4875)
"let you find you"
2020-01-06 12:37:42 +01:00
Ines Montani
400257a802 Update index.md [ci skip] 2020-01-04 01:52:18 +01:00
Sofie Van Landeghem
581eeed98b Warning goldparse (#4851)
* label in span not writable anymore

* Revert "label in span not writable anymore"

This reverts commit ab442338c8.

* provide more friendly error msg for parsing file
2020-01-01 13:16:48 +01:00
Ines Montani
83e0a6f3e3
Modernize plac commands for Python 3 (#4836) 2020-01-01 13:15:46 +01:00
Al Johri
1aa2d4dac9 stop rendering mathjax by default in displacy (#4840)
* stop rendering mathjax by default in displacy

* Replace f-string and add comment

Co-authored-by: Ines Montani <ines@ines.io>
2020-01-01 13:15:05 +01:00
Anastasiia Iurshina
db9257559c Adds script shebang (#4846) 2019-12-29 14:25:05 +01:00
Anastasiia Iurshina
1830a12578 Fixes typos (#4843)
* Fixes typos

* Fixes typo

* Contributor agreement
2019-12-29 14:24:13 +01:00
Ivan Echevarria
ef13e0c038 Add n_process to Language.pipe documentation (#4842) [ci skip]
* Add n_process to documentation

* Auto-format and add default [ci skip]

Co-authored-by: Ines Montani <ines@ines.io>
2019-12-29 14:23:33 +01:00
Al Johri
fd4a7bd2b7 sign contributor agreement for AlJohri (#4839) [ci skip] 2019-12-29 14:17:28 +01:00
Ines Montani
401946d480 Un-xfail passing tests 2019-12-25 18:02:20 +01:00
Ines Montani
a892821c51 More formatting changes 2019-12-25 17:59:52 +01:00
Ines Montani
c22f075509 Update pydantic version pin [ci skip] 2019-12-25 17:29:53 +01:00
Ines Montani
33a2682d60
Add better schemas and validation using Pydantic (#4831)
* Remove unicode declarations

* Remove Python 3.5 and 2.7 from CI

* Don't require pathlib

* Replace compat helpers

* Remove OrderedDict

* Use f-strings

* Set Cython compiler language level

* Fix typo

* Re-add OrderedDict for Table

* Update setup.cfg

* Revert CONTRIBUTING.md

* Add better schemas and validation using Pydantic

* Revert lookups.md

* Remove unused import

* Update spacy/schemas.py

Co-Authored-By: Sebastián Ramírez <tiangolo@gmail.com>

* Various small fixes

* Fix docstring

Co-authored-by: Sebastián Ramírez <tiangolo@gmail.com>
2019-12-25 12:39:49 +01:00
Ines Montani
db55577c45
Drop Python 2.7 and 3.5 (#4828)
* Remove unicode declarations

* Remove Python 3.5 and 2.7 from CI

* Don't require pathlib

* Replace compat helpers

* Remove OrderedDict

* Use f-strings

* Set Cython compiler language level

* Fix typo

* Re-add OrderedDict for Table

* Update setup.cfg

* Revert CONTRIBUTING.md

* Revert lookups.md

* Revert top-level.md

* Small adjustments and docs [ci skip]
2019-12-22 01:53:56 +01:00
Ines Montani
3431ac42de Fix typo 2019-12-21 21:17:45 +01:00
Ines Montani
21b6d6e0a8 Fix typo 2019-12-21 21:17:31 +01:00
Ines Montani
de33b6d566 Merge branch 'master' into develop 2019-12-21 21:15:46 +01:00
Ines Montani
7c69d30de5 Tidy up and expect warning 2019-12-21 21:14:52 +01:00
Sofie Van Landeghem
732142bf28 facilitate larger training files (#4827)
* add warning for large file and change start var to long

* type for file_length
2019-12-21 21:12:19 +01:00
Ines Montani
d17e7dca9e Fix problems caused by merge conflict 2019-12-21 19:57:41 +01:00
Ines Montani
947dba7141 Merge branch 'master' into develop 2019-12-21 19:04:43 +01:00
Ines Montani
cb4145adc7 Tidy up and auto-format 2019-12-21 19:04:17 +01:00
Ines Montani
158b98a3ef Merge branch 'master' into develop 2019-12-21 18:55:03 +01:00
Olamilekan Wahab
a741de7cf6 Adding support for Yoruba Language (#4614)
* Adding Support for Yoruba

* test text

* Updated test string.

* Fixing encoding declaration.

* Adding encoding to stop_words.py

* Added contributor agreement and removed iranlowo.

* Added removed test files and removed iranlowo to keep project bare.

* Returned CONTRIBUTING.md to default state.

* Added delted conftest entries

* Tidy up and auto-format

* Revert CONTRIBUTING.md

Co-authored-by: Ines Montani <ines@ines.io>
2019-12-21 14:11:50 +01:00
Ines Montani
1b838d1313 Divide models into core and starters [ci skip] 2019-12-21 14:10:22 +01:00
Ines Montani
0750d59e5a Allow setting ner_missing_tag on docs_to_json 2019-12-21 13:47:21 +01:00
Sofie Van Landeghem
8ebbb85117 Documentation for PhraseMatcher constructor (#4826)
* add max_length as argument for init PhraseMatcher

* improve error message too
2019-12-20 23:00:04 +01:00
Sofie Van Landeghem
12158c1e3a Restore tqdm imports (#4804)
* set 4.38.0 to minimal version with color bug fix

* set imports back to proper place

* add upper range for tqdm
2019-12-16 13:12:19 +01:00
Ines Montani
c466e02466 Update universe [ci skip] 2019-12-13 15:57:39 +01:00
Sofie Van Landeghem
557dcf5659 NEL requires sentences to be set (#4801) 2019-12-13 15:55:18 +01:00
tamuhey
1707e77c5e add char_span to Span (#4793) 2019-12-13 15:54:58 +01:00
adrianeboyd
a4cacd3402 Add tag_map argument to CLI debug-data and train (#4750)
Add an argument for a path to a JSON-formatted tag map, which is used to
update and extend the default language tag map.
2019-12-13 10:46:18 +01:00
Sofie Van Landeghem
f9b541f9ef More robust set entities method in KB (#4794)
* add unit test for setting entities with duplicate identifiers

* count the number of actual unique identifiers and throw duplicate warning
2019-12-13 10:45:29 +01:00
Thiago Lages de Alencar
a067ded495 Update doc.md (#4796) 2019-12-11 18:21:40 +01:00
adrianeboyd
eb9b1858c4 Add NER map option to convert CLI (#4763)
Instead of a hard-coded NER tag simplification function that was only
intended for NorNE, map NER tags in CoNLL-U converter using a dict
provided as JSON as a command-line option.

Map NER entity types or new tag or to "" for 'O', e.g.:

```
{"PER": "PERSON", "BAD": ""}

=>

B-PER -> B-PERSON
B-BAD -> O
```
2019-12-11 18:20:49 +01:00
Sofie Van Landeghem
5355b0038f Update EL example (#4789)
* update EL example script after sentence-central refactor

* version bump

* set incl_prior to False for quick demo purposes

* clean up
2019-12-11 18:19:42 +01:00
adrianeboyd
38e1bc19f4 Add destructors for states in TransitionSystem (#4686) 2019-12-10 13:23:27 +01:00
Matthew Honnibal
45efdb1ef7 Merge branch 'master' of https://github.com/explosion/spaCy 2019-12-10 00:54:18 +01:00
Matthew Honnibal
0a3175d46f Require thinc v7.4.0.dev0 2019-12-10 00:47:51 +01:00
adrianeboyd
c208eb6e4d Fix int value handling in Matcher (#4749)
Add `int` values (for `LENGTH`) in _get_attr_values() instead of
treating `int` like `dict`.
2019-12-06 19:22:57 +01:00
Tclack88
ab8dc2732c Update token.md (#4767)
* Update token.md

documentation is confusing: A '?' is a right punct, but '¿' is a left punct

* Update token.md

add quotations around parentheses in `is_left_punct` and `is_right_punct` for clarrification, ensuring the question mark that follows is not percieved as an example of left and right punctuation

* Move quotes into code block [ci skip]
2019-12-06 19:22:02 +01:00
Sofie Van Landeghem
780d43aac7 fix bug in EL predict (#4779) 2019-12-06 19:18:14 +01:00
Ines Montani
bf611ebca7 Document jsonl option on converter [ci skip] 2019-12-06 19:17:45 +01:00
Nicolai Bjerre Pedersen
de5453cdcb Fix link to user hooks in docs (#4778)
* Fix link to user hooks in docs

* Update mr_bjerre.md

Mistake in contributor agreement

* Apparently hard to get it right (wrong name of sca)
2019-12-06 19:17:12 +01:00
adrianeboyd
676e75838f Include Doc.cats in serialization of Doc and DocBin (#4774)
* Include Doc.cats in to_bytes()

* Include Doc.cats in DocBin serialization

* Add tests for serialization of cats

Test serialization of cats for Doc and DocBin.
2019-12-06 14:07:39 +01:00
Antti Ajanki
e626a011cc Improvements to the Finnish language data (#4738)
* Enable lex_attrs on Finnish

* Copy the Danish tokenizer rules to Finnish

Specifically, don't break hyphenated compound words

* Contributor agreement

* A new file for Finnish tokenizer rules instead of including the Danish ones
2019-12-03 12:55:28 +01:00
Christoph Purschke
a7ee4b6f17 new tests & tokenization fixes (#4734)
- added some tests for tokenization issues
- fixed some issues with tokenization of words with hyphen infix
- rewrote the "tokenizer_exceptions.py" file (stemming from the German version)
2019-12-01 23:08:21 +01:00