Commit Graph

10798 Commits

Author SHA1 Message Date
adrianeboyd
c32126359a Allow period as suffix following punctuation (#4248)
Addresses rare cases (such as `_MATH_.`, see #1061) where the final
period was not recognized as a suffix following punctuation.
2019-09-09 19:19:22 +02:00
Ines Montani
3e8f136ba7 💫 WIP: Basic lookup class scaffolding and JSON for all lemmatizer data (#4178)
* Improve load_language_data helper

* WIP: Add Lookups implementation

* Start moving lemma data over to JSON

* WIP: move data over for more languages

* Convert more languages

* Fix lemmatizer fixtures in tests

* Finish conversion

* Auto-format JSON files

* Fix test for now

* Make sure tables are stored on instance

* Update docstrings

* Update docstrings and errors

* Update test

* Add Lookups.__len__

* Add serialization methods

* Add Lookups.remove_table

* Use msgpack for serialization to disk

* Fix file exists check

* Try using OrderedDict for everything

* Update .flake8 [ci skip]

* Try fixing serialization

* Update test_lookups.py

* Update test_serialize_vocab_strings.py

* Fix serialization for lookups

* Fix lookups

* Fix lookups

* Fix lookups

* Try to fix serialization

* Try to fix serialization

* Try to fix serialization

* Try to fix serialization

* Give up on serialization test

* Xfail more serialization tests for 3.5

* Fix lookups for 2.7
2019-09-09 19:17:55 +02:00
Sofie Van Landeghem
482c7cd1b9 pulling tqdm imports in functions to avoid bug (tmp fix) (#4263) 2019-09-09 16:32:11 +02:00
Mihai Gliga
25aecd504f adding Romanian tag_map (#4257)
* adding Romanian tag_map

* added SCA file

* forgotten import
2019-09-09 11:53:09 +02:00
Matthew Honnibal
1653b818c5 Update Lithuanian tag map 2019-09-08 20:57:58 +02:00
adrianeboyd
3780e2ff50 Flush tokenizer cache when necessary (#4258)
Flush tokenizer cache when affixes, token_match, or special cases are
modified.

Fixes #4238, same issue as in #1250.
2019-09-08 20:52:46 +02:00
Matthew Honnibal
da8830d909 Set version to v2.2.0.dev3 2019-09-08 18:22:03 +02:00
Matthew Honnibal
1a65c5b7af Update develop from master 2019-09-08 18:21:41 +02:00
Matthew Honnibal
aec6174ae6 Fix lemmatizer 2019-09-08 18:09:53 +02:00
Matthew Honnibal
fde4f8ac8e Create lookups if not passed in 2019-09-08 18:08:09 +02:00
Pavle Vidanović
d03401f532 Lemmatizer lookup dictionary for Serbian and basic tag set adde… (#4251)
* Serbian stopwords added. (cyrillic alphabet)

* spaCy Contribution agreement included.

* Test initialize updated

* Serbian language code update. --bugfix

* Tokenizer exceptions added. Init file updated.

* Norm exceptions and lexical attributes added.

* Examples added.

* Tests added.

* sr_lang examples update.

* Tokenizer exceptions updated. (Serbian)

* Lemmatizer created. Licence included.

* Test updated.

* Tag map basic added.

* tag_map.py file removed since it uses default spacy tags.
2019-09-08 14:19:15 +02:00
Ivan Šarić
b01025dd06 adds Croatian lemma_lookup.json, license file and corresponding tests (#4252) 2019-09-08 13:40:45 +02:00
adrianeboyd
aec755d3a3 Modify retokenizer to use span root attributes (#4219)
* Modify retokenizer to use span root attributes

* tag/pos/morph are set to root tag/pos/morph

* lemma and norm are reset and end up as orth (not ideal, but better
than orth of first token)

* Also handle individual merge case

* Add test

* Attempt to handle ent_iob and ent_type in merges

* Fix check for whether B-ENT should become I-ENT

* Move IOB consistency check to after attrs

Move all IOB consistency checks after attrs are set and simplify to
check entire document, modifying I to B at the beginning of the document
or if the entity type of the previous token isn't the same.

* Move IOB consistency check for single merge

Move IOB consistency check after the token array is compressed for the
single merge case.

* Update spacy/tokens/_retokenize.pyx

Co-Authored-By: Matthew Honnibal <honnibal+gh@gmail.com>

* Remove single vs. multiple merge distinction

Remove original single-instance `_merge()` and use `_bulk_merge()` (now
renamed `_merge()`) for all merges.

* Add out-of-bound check in previous entity check
2019-09-08 13:04:49 +02:00
Sofie Van Landeghem
53a9ca45c9 Docs: bufsize instead of buffsize (#4247) 2019-09-06 11:11:54 +02:00
Sofie Van Landeghem
6b012cebff Make pos/tag distinction more clear in docs (#4246)
* make distinction between tag and pos more prominent in docs

* out of the 101
2019-09-06 10:31:21 +02:00
Bae Yong-Ju
a55f5a744f Fix ValueError exception on empty Korean text. (#4245) 2019-09-06 10:29:40 +02:00
Ines Montani
232a029de6 Send referrer for internal links [ci skip] 2019-09-05 10:41:46 +02:00
Matthew Honnibal
d039ed2267
Merge pull request #4237 from adrianeboyd/feature/gold-train-orth-variants
Add guillemets/chevrons to German orth variants
2019-09-04 23:10:49 +02:00
Matthew Honnibal
b94c34ec8f
Merge pull request #4239 from adrianeboyd/bugfix/tokenizer-cache-test-1061
Add regression test for #1061 back to test suite
2019-09-04 23:10:12 +02:00
Adriane Boyd
0f28418446 Add regression test for #1061 back to test suite 2019-09-04 20:42:24 +02:00
Adriane Boyd
c39c13f26b Add guillemets/chevrons to German orth variants
Add guillemets/chevrons to German orth variants for both German/Austrian
and Swiss conventions.
2019-09-04 20:05:08 +02:00
Ines Montani
2f31f96fce Update languages.json [ci skip] 2019-09-04 18:15:42 +02:00
Ines Montani
2245e95e2d Update languages.json [ci skip] 2019-09-04 17:11:40 +02:00
Matthew Honnibal
17c039406b
Merge pull request #4232 from adrianeboyd/bugfix/entityruler-ner-4229
Fix handling of preset entities in NER
2019-09-04 15:02:31 +02:00
Adriane Boyd
6b0fec76fd Fix handling of preset entities in NER
* Fix check of valid ent_type for B
* Add valid L as preset-I followed by not-I
2019-09-04 13:42:42 +02:00
Ines Montani
419ae59c79 Make flaky test test_issue_1971_4 more explicit 2019-08-31 14:08:05 +02:00
Ines Montani
dad5621166 Tidy up and auto-format [ci skip] 2019-08-31 13:39:31 +02:00
Ines Montani
cd90752193 Tidy up and auto-format [ci skip] 2019-08-31 13:39:06 +02:00
Ines Montani
bcd1b12f43 Add contributor agreement [ci skip] 2019-08-30 17:02:43 +02:00
Matthew Honnibal
67c3d03905 Revert morphology serialisation 2019-08-30 13:13:07 +02:00
Matthew Honnibal
efcb51ddc8
Merge pull request #4217 from adrianeboyd/bugfix/morph-en-serialization
Morphology tag_map-related bugfixes
2019-08-30 12:46:29 +02:00
Adriane Boyd
893f11a9e3 Serialize tag_map directly
Fix Aspect_prof typo
2019-08-30 11:30:03 +02:00
Adriane Boyd
02babf9317 English tag map without unsupported features/values 2019-08-30 11:29:19 +02:00
Matthew Honnibal
516650f58f
Merge pull request #4207 from svlandeg/bugfix/serialize-tok-exc
Bugfix for serializing tokenizer rules/exceptions
2019-08-30 11:04:58 +02:00
Matthew Honnibal
f3c3ce7f1e Update vocab 2019-08-29 21:19:54 +02:00
Matthew Honnibal
fc0a3c8c38 Add morphology serialization 2019-08-29 21:17:34 +02:00
Matthew Honnibal
c94fc9edb9 Fix noise addition 2019-08-29 15:39:32 +02:00
Matthew Honnibal
32842a3cd4 Disable whitespace corruption 2019-08-29 15:01:58 +02:00
Matthew Honnibal
3c1c0ec18e Add tests for NER oracle with whitespace 2019-08-29 14:33:39 +02:00
Matthew Honnibal
6511e1d8d3 Fix NER gold-standard around whitespace 2019-08-29 14:33:07 +02:00
adrianeboyd
82159b5c19 Updates/bugfixes for NER/IOB converters (#4186)
* Updates/bugfixes for NER/IOB converters

* Converter formats `ner` and `iob` use autodetect to choose a converter if
  possible

* `iob2json` is reverted to handle sentence-per-line data like
  `word1|pos1|ent1 word2|pos2|ent2`

  * Fix bug in `merge_sentences()` so the second sentence in each batch isn't
    skipped

* `conll_ner2json` is made more general so it can handle more formats with
  whitespace-separated columns

  * Supports all formats where the first column is the token and the final
    column is the IOB tag; if present, the second column is the POS tag

  * As in CoNLL 2003 NER, blank lines separate sentences, `-DOCSTART- -X- O O`
    separates documents

  * Add option for segmenting sentences (new flag `-s`)

  * Parser-based sentence segmentation with a provided model, otherwise with
    sentencizer (new option `-b` to specify model)

  * Can group sentences into documents with `n_sents` as long as sentence
    segmentation is available

  * Only applies automatic segmentation when there are no existing delimiters
    in the data

* Provide info about settings applied during conversion with warnings and
  suggestions if settings conflict or might not be not optimal.

* Add tests for common formats

* Add '(default)' back to docs for -c auto

* Add document count back to output

* Revert changes to converter output message

* Use explicit tabs in convert CLI test data

* Adjust/add messages for n_sents=1 default

* Add sample NER data to training examples

* Update README

* Add links in docs to example NER data

* Define msg within converters
2019-08-29 12:04:01 +02:00
adrianeboyd
5feb342f5e Add more token attributes to token pattern schema (#4210)
Add token attributes with tests to token pattern schema.
2019-08-29 12:02:26 +02:00
Matthew Honnibal
216f63a987
Merge pull request #4208 from adrianeboyd/bugfix/orth-vs-noise
Add separate noise vs orth level to train CLI
2019-08-29 10:26:42 +02:00
Adriane Boyd
f3906950d3 Add separate noise vs orth level to train CLI 2019-08-29 09:10:35 +02:00
Matthew Honnibal
7d6d438566 Set version to v2.2.0.dev2 2019-08-28 18:30:43 +02:00
Matthew Honnibal
bc5ce49859 Fix 'noise_level' in train cmd 2019-08-28 17:55:38 +02:00
Matthew Honnibal
782056d117 Fix morph rules 2019-08-28 16:59:45 +02:00
Matthew Honnibal
6b2ea883ed
Merge pull request #4205 from adrianeboyd/feature/gold-train-orth-variants
Add train_docs() option to add orth variants
2019-08-28 16:54:06 +02:00
svlandeg
c54aabc3cd fix loading custom tokenizer rules/exceptions from file 2019-08-28 14:17:44 +02:00
svlandeg
7bec0ebbcb failing unit test for Issue 4190 2019-08-28 14:16:34 +02:00