Commit Graph

6607 Commits

Author SHA1 Message Date
adrianeboyd
d24bca62f6 Add CJK to character classes (#4884)
* Add CJK character class as uncased

* Incorporate Chinese URL test case

Un-xfail Chinese URL test instance
2020-01-08 16:50:19 +01:00
adrianeboyd
aef83e8070 Mark most Hungarian tokenizer test cases as slow (#4883)
* Mark most Hungarian tokenizer test cases as slow

Mark most Hungarian tokenizer test cases as slow to reduce the runtime
of the test suite in ordinary usage:

* for normal tests: run default tests plus 10% of the detailed tests
* for slow tests: run all tests

* Rework to mark individual tests as slow
2020-01-08 12:34:06 +01:00
Sofie Van Landeghem
7b96a5e10f Reduce mem usage in training Entity Linker (#4811)
* move nlp processing for el pipe to batch training instead of preprocessing

* adding dev eval back in, and limit in articles instead of entities

* use pipe whenever possible

* few more small doc changes

* access dev data through generator

* tqdm description

* small fixes

* update documentation
2020-01-06 14:59:50 +01:00
Sofie Van Landeghem
6e9b61b49d add warning in debug_data for punctuation in entities (#4853) 2020-01-06 14:59:28 +01:00
adrianeboyd
d652ff215d Add trailing whitespace to multiline test text (#4877) 2020-01-06 14:58:59 +01:00
adrianeboyd
de69bc6509 Fix and improve URL pattern (#4882)
* match domains longer than `hostname.domain.tld` like `www.foo.co.uk`
* expand allowed characters in domain names while only matching
lowercase TLDs so that "this.That" isn't matched as a URL and can be
split on the period as an infix (relevant for at least English, German,
and Tatar)
2020-01-06 14:58:30 +01:00
Sofie Van Landeghem
a1b22e90cd serialize ENT_ID (#4852)
* expand serialization test for custom token attribute

* add failing test for issue 4849

* define ENT_ID as attr and use in doc serialization

* fix few typos
2020-01-06 14:57:34 +01:00
Al Johri
1aa2d4dac9 stop rendering mathjax by default in displacy (#4840)
* stop rendering mathjax by default in displacy

* Replace f-string and add comment

Co-authored-by: Ines Montani <ines@ines.io>
2020-01-01 13:15:05 +01:00
Anastasiia Iurshina
1830a12578 Fixes typos (#4843)
* Fixes typos

* Fixes typo

* Contributor agreement
2019-12-29 14:24:13 +01:00
Ivan Echevarria
ef13e0c038 Add n_process to Language.pipe documentation (#4842) [ci skip]
* Add n_process to documentation

* Auto-format and add default [ci skip]

Co-authored-by: Ines Montani <ines@ines.io>
2019-12-29 14:23:33 +01:00
Ines Montani
3431ac42de Fix typo 2019-12-21 21:17:45 +01:00
Ines Montani
7c69d30de5 Tidy up and expect warning 2019-12-21 21:14:52 +01:00
Sofie Van Landeghem
732142bf28 facilitate larger training files (#4827)
* add warning for large file and change start var to long

* type for file_length
2019-12-21 21:12:19 +01:00
Ines Montani
cb4145adc7 Tidy up and auto-format 2019-12-21 19:04:17 +01:00
Olamilekan Wahab
a741de7cf6 Adding support for Yoruba Language (#4614)
* Adding Support for Yoruba

* test text

* Updated test string.

* Fixing encoding declaration.

* Adding encoding to stop_words.py

* Added contributor agreement and removed iranlowo.

* Added removed test files and removed iranlowo to keep project bare.

* Returned CONTRIBUTING.md to default state.

* Added delted conftest entries

* Tidy up and auto-format

* Revert CONTRIBUTING.md

Co-authored-by: Ines Montani <ines@ines.io>
2019-12-21 14:11:50 +01:00
Ines Montani
0750d59e5a Allow setting ner_missing_tag on docs_to_json 2019-12-21 13:47:21 +01:00
Sofie Van Landeghem
8ebbb85117 Documentation for PhraseMatcher constructor (#4826)
* add max_length as argument for init PhraseMatcher

* improve error message too
2019-12-20 23:00:04 +01:00
Sofie Van Landeghem
12158c1e3a Restore tqdm imports (#4804)
* set 4.38.0 to minimal version with color bug fix

* set imports back to proper place

* add upper range for tqdm
2019-12-16 13:12:19 +01:00
Sofie Van Landeghem
557dcf5659 NEL requires sentences to be set (#4801) 2019-12-13 15:55:18 +01:00
tamuhey
1707e77c5e add char_span to Span (#4793) 2019-12-13 15:54:58 +01:00
Sofie Van Landeghem
f9b541f9ef More robust set entities method in KB (#4794)
* add unit test for setting entities with duplicate identifiers

* count the number of actual unique identifiers and throw duplicate warning
2019-12-13 10:45:29 +01:00
Sofie Van Landeghem
5355b0038f Update EL example (#4789)
* update EL example script after sentence-central refactor

* version bump

* set incl_prior to False for quick demo purposes

* clean up
2019-12-11 18:19:42 +01:00
adrianeboyd
38e1bc19f4 Add destructors for states in TransitionSystem (#4686) 2019-12-10 13:23:27 +01:00
adrianeboyd
c208eb6e4d Fix int value handling in Matcher (#4749)
Add `int` values (for `LENGTH`) in _get_attr_values() instead of
treating `int` like `dict`.
2019-12-06 19:22:57 +01:00
Sofie Van Landeghem
780d43aac7 fix bug in EL predict (#4779) 2019-12-06 19:18:14 +01:00
adrianeboyd
676e75838f Include Doc.cats in serialization of Doc and DocBin (#4774)
* Include Doc.cats in to_bytes()

* Include Doc.cats in DocBin serialization

* Add tests for serialization of cats

Test serialization of cats for Doc and DocBin.
2019-12-06 14:07:39 +01:00
Antti Ajanki
e626a011cc Improvements to the Finnish language data (#4738)
* Enable lex_attrs on Finnish

* Copy the Danish tokenizer rules to Finnish

Specifically, don't break hyphenated compound words

* Contributor agreement

* A new file for Finnish tokenizer rules instead of including the Danish ones
2019-12-03 12:55:28 +01:00
Christoph Purschke
a7ee4b6f17 new tests & tokenization fixes (#4734)
- added some tests for tokenization issues
- fixed some issues with tokenization of words with hyphen infix
- rewrote the "tokenizer_exceptions.py" file (stemming from the German version)
2019-12-01 23:08:21 +01:00
adrianeboyd
48ea2e8d0f Restructure Sentencizer to follow Pipe API (#4721)
* Restructure Sentencizer to follow Pipe API

Restructure Sentencizer to follow Pipe API so that it can be scored with
`nlp.evaluate()`.

* Add Sentencizer pipe() test
2019-11-27 16:33:34 +01:00
Jari Bakken
16cb19e960 update nb tag_map (#4711) 2019-11-25 21:26:26 +01:00
Ines Montani
5b36dec7eb Auto-exclude disabled when calling from_disk during load (#4708) 2019-11-25 16:01:22 +01:00
Ines Montani
2160ecfc92 Fix typo [ci skip] 2019-11-25 13:08:19 +01:00
adrianeboyd
2d8c6e1124 Iterate over lr_edges until sents are correct (#4702)
Iterate over lr_edges until all heads are within the current sentence.
Instead of iterating over them for a fixed number of iterations, check
whether the sentence boundaries are correct for the heads and stop when
all are correct. Stop after a maximum of 10 iterations, providing a
warning in this case since the sentence boundaries may not be correct.
2019-11-25 13:06:36 +01:00
Matt Maybeno
c9f1e99787 Agnostic vocab array fix (#4680)
* Use get_array_module instead of numpy

* add contributor agreement
2019-11-23 14:59:52 +01:00
adrianeboyd
46250f60ac Add missing tags to el/es/pt tag maps (#4696)
* Add missing tags to pt tag map

* Add missing tags to es tag map

* Add missing tags to el tag map

* Add missing symbol in el tag map
2019-11-23 14:57:21 +01:00
Paul O'Leary McCann
f0e3e606a6 Replace python-mecab3 with fugashi for Japanese (#4621)
* Switch from mecab-python3 to fugashi

mecab-python3 has been the best MeCab binding for a long time but it's
not very actively maintained, and since it's based on old SWIG code
distributed with MeCab there's a limit to how effectively it can be
maintained.

Fugashi is a new Cython-based MeCab wrapper I wrote. Since it's not
based on the old SWIG code it's easier to keep it current and make small
deviations from the MeCab C/C++ API where that makes sense.

* Change mecab-python3 to fugashi in setup.cfg

* Change "mecab tags" to "unidic tags"

The tags come from MeCab, but the tag schema is specified by Unidic, so
it's more proper to refer to it that way.

* Update conftest

* Add fugashi link to external deps list for Japanese
2019-11-23 14:31:04 +01:00
Ines Montani
a0fb1acb10 Update version [ci skip] 2019-11-21 18:19:37 +01:00
Ines Montani
b570d5d2ed Increment version [ci skip] 2019-11-21 17:02:32 +01:00
Matthew Honnibal
50f89cb85d Make vectors.find() return keys in correct order (#4691)
* Make vectors.find() return keys in correct order

* Update spacy/vectors.pyx
2019-11-21 16:58:32 +01:00
Ines Montani
5d4eede1e4 Fix test util imports 2019-11-21 16:28:29 +01:00
GuiGel
8f7ab70870 Bugfix/fix entity ruler from disk (#4670)
* fix EntityRuler from_disk bug

* add contributor file

* Test EntityRuler PhraseMatcher deserialization (#4651)

* newline at end of file

* fix copy paste error

* serializing the EntityRuler by itself

* Add unicode declarations for Python 2 and auto-format
2019-11-21 16:26:37 +01:00
adrianeboyd
054df5d90a Add error for non-string labels (#4690)
Add error when attempting to add non-string labels to `Tagger` or
`TextCategorizer`.
2019-11-21 16:24:10 +01:00
adrianeboyd
d7f32b285c Detect more empty matches in tokenizer.explain() (#4675)
* Detect more empty matches in tokenizer.explain()

* Include a few languages in explain non-slow tests

Mark a few languages in tokenizer.explain() tests as not slow so they're
run by default.
2019-11-20 16:31:29 +01:00
Ines Montani
5bf9ab5b03 Tidy up and auto-format 2019-11-20 13:16:33 +01:00
Ines Montani
7f3b00164a Re-add slow marker 2019-11-20 13:15:59 +01:00
Ines Montani
6e303de717 Auto-format 2019-11-20 13:15:24 +01:00
Ines Montani
2e7c896fe5 Update Tokenizer.explain tests 2019-11-20 13:14:11 +01:00
adrianeboyd
2c876eb672 Add tokenizer explain() debugging method (#4596)
* Expose tokenizer rules as a property

Expose the tokenizer rules property in the same way as the other core
properties. (The cache resetting is overkill, but consistent with
`from_bytes` for now.)

Add tests and update Tokenizer API docs.

* Update Hungarian punctuation to remove empty string

Update Hungarian punctuation definitions so that `_units` does not match
an empty string.

* Use _load_special_tokenization consistently

Use `_load_special_tokenization()` and have it to handle `None` checks.

* Fix precedence of `token_match` vs. special cases

Remove `token_match` check from `_split_affixes()` so that special cases
have precedence over `token_match`. `token_match` is checked only before
infixes are split.

* Add `make_debug_doc()` to the Tokenizer

Add `make_debug_doc()` to the Tokenizer as a working implementation of
the pseudo-code in the docs.

Add a test (marked as slow) that checks that `nlp.tokenizer()` and
`nlp.tokenizer.make_debug_doc()` return the same non-whitespace tokens
for all languages that have `examples.sentences` that can be imported.

* Update tokenization usage docs

Update pseudo-code and algorithm description to correspond to
`nlp.tokenizer.make_debug_doc()` with example debugging usage.

Add more examples for customizing tokenizers while preserving the
existing defaults.

Minor edits / clarifications.

* Revert "Update Hungarian punctuation to remove empty string"

This reverts commit f0a577f7a5.

* Rework `make_debug_doc()` as `explain()`

Rework `make_debug_doc()` as `explain()`, which returns a list of
`(pattern_string, token_string)` tuples rather than a non-standard
`Doc`. Update docs and tests accordingly, leaving the visualization for
future work.

* Handle cases with bad tokenizer patterns

Detect when tokenizer patterns match empty prefixes and suffixes so that
`explain()` does not hang on bad patterns.

* Remove unused displacy image

* Add tokenizer.explain() to usage docs
2019-11-20 13:07:25 +01:00
Matthew Honnibal
a3c43a1692
Support no hidden layer in parser and NER (#4672)
* Support no hidden layers for parser

* Fix parser model for depth 1

* Fix parser for hidden depth=0

* Add option of non-blocking to CUDA stream
2019-11-19 15:54:34 +01:00
Matthew Honnibal
4b123952aa
Add option for improved NER feature extraction (#4671)
* Support option of three NER features

* Expose nr_feature parser model setting

* Give feature tokens better name

* Test nr_feature=3 for NER

* Format
2019-11-19 15:03:14 +01:00