Commit Graph

10890 Commits

Author SHA1 Message Date
Matthew Honnibal
aec6174ae6 Fix lemmatizer 2019-09-08 18:09:53 +02:00
Matthew Honnibal
fde4f8ac8e Create lookups if not passed in 2019-09-08 18:08:09 +02:00
Pavle Vidanović
d03401f532 Lemmatizer lookup dictionary for Serbian and basic tag set adde… (#4251)
* Serbian stopwords added. (cyrillic alphabet)

* spaCy Contribution agreement included.

* Test initialize updated

* Serbian language code update. --bugfix

* Tokenizer exceptions added. Init file updated.

* Norm exceptions and lexical attributes added.

* Examples added.

* Tests added.

* sr_lang examples update.

* Tokenizer exceptions updated. (Serbian)

* Lemmatizer created. Licence included.

* Test updated.

* Tag map basic added.

* tag_map.py file removed since it uses default spacy tags.
2019-09-08 14:19:15 +02:00
Ivan Šarić
b01025dd06 adds Croatian lemma_lookup.json, license file and corresponding tests (#4252) 2019-09-08 13:40:45 +02:00
adrianeboyd
aec755d3a3 Modify retokenizer to use span root attributes (#4219)
* Modify retokenizer to use span root attributes

* tag/pos/morph are set to root tag/pos/morph

* lemma and norm are reset and end up as orth (not ideal, but better
than orth of first token)

* Also handle individual merge case

* Add test

* Attempt to handle ent_iob and ent_type in merges

* Fix check for whether B-ENT should become I-ENT

* Move IOB consistency check to after attrs

Move all IOB consistency checks after attrs are set and simplify to
check entire document, modifying I to B at the beginning of the document
or if the entity type of the previous token isn't the same.

* Move IOB consistency check for single merge

Move IOB consistency check after the token array is compressed for the
single merge case.

* Update spacy/tokens/_retokenize.pyx

Co-Authored-By: Matthew Honnibal <honnibal+gh@gmail.com>

* Remove single vs. multiple merge distinction

Remove original single-instance `_merge()` and use `_bulk_merge()` (now
renamed `_merge()`) for all merges.

* Add out-of-bound check in previous entity check
2019-09-08 13:04:49 +02:00
Sofie Van Landeghem
53a9ca45c9 Docs: bufsize instead of buffsize (#4247) 2019-09-06 11:11:54 +02:00
Sofie Van Landeghem
6b012cebff Make pos/tag distinction more clear in docs (#4246)
* make distinction between tag and pos more prominent in docs

* out of the 101
2019-09-06 10:31:21 +02:00
Bae Yong-Ju
a55f5a744f Fix ValueError exception on empty Korean text. (#4245) 2019-09-06 10:29:40 +02:00
Ines Montani
232a029de6 Send referrer for internal links [ci skip] 2019-09-05 10:41:46 +02:00
Matthew Honnibal
d039ed2267
Merge pull request #4237 from adrianeboyd/feature/gold-train-orth-variants
Add guillemets/chevrons to German orth variants
2019-09-04 23:10:49 +02:00
Matthew Honnibal
b94c34ec8f
Merge pull request #4239 from adrianeboyd/bugfix/tokenizer-cache-test-1061
Add regression test for #1061 back to test suite
2019-09-04 23:10:12 +02:00
Adriane Boyd
0f28418446 Add regression test for #1061 back to test suite 2019-09-04 20:42:24 +02:00
Adriane Boyd
c39c13f26b Add guillemets/chevrons to German orth variants
Add guillemets/chevrons to German orth variants for both German/Austrian
and Swiss conventions.
2019-09-04 20:05:08 +02:00
Ines Montani
2f31f96fce Update languages.json [ci skip] 2019-09-04 18:15:42 +02:00
Ines Montani
2245e95e2d Update languages.json [ci skip] 2019-09-04 17:11:40 +02:00
Matthew Honnibal
17c039406b
Merge pull request #4232 from adrianeboyd/bugfix/entityruler-ner-4229
Fix handling of preset entities in NER
2019-09-04 15:02:31 +02:00
Adriane Boyd
6b0fec76fd Fix handling of preset entities in NER
* Fix check of valid ent_type for B
* Add valid L as preset-I followed by not-I
2019-09-04 13:42:42 +02:00
Ines Montani
419ae59c79 Make flaky test test_issue_1971_4 more explicit 2019-08-31 14:08:05 +02:00
Ines Montani
dad5621166 Tidy up and auto-format [ci skip] 2019-08-31 13:39:31 +02:00
Ines Montani
cd90752193 Tidy up and auto-format [ci skip] 2019-08-31 13:39:06 +02:00
Ines Montani
bcd1b12f43 Add contributor agreement [ci skip] 2019-08-30 17:02:43 +02:00
Matthew Honnibal
67c3d03905 Revert morphology serialisation 2019-08-30 13:13:07 +02:00
Matthew Honnibal
efcb51ddc8
Merge pull request #4217 from adrianeboyd/bugfix/morph-en-serialization
Morphology tag_map-related bugfixes
2019-08-30 12:46:29 +02:00
Adriane Boyd
893f11a9e3 Serialize tag_map directly
Fix Aspect_prof typo
2019-08-30 11:30:03 +02:00
Adriane Boyd
02babf9317 English tag map without unsupported features/values 2019-08-30 11:29:19 +02:00
Matthew Honnibal
516650f58f
Merge pull request #4207 from svlandeg/bugfix/serialize-tok-exc
Bugfix for serializing tokenizer rules/exceptions
2019-08-30 11:04:58 +02:00
Matthew Honnibal
f3c3ce7f1e Update vocab 2019-08-29 21:19:54 +02:00
Matthew Honnibal
fc0a3c8c38 Add morphology serialization 2019-08-29 21:17:34 +02:00
Matthew Honnibal
c94fc9edb9 Fix noise addition 2019-08-29 15:39:32 +02:00
Matthew Honnibal
32842a3cd4 Disable whitespace corruption 2019-08-29 15:01:58 +02:00
Matthew Honnibal
3c1c0ec18e Add tests for NER oracle with whitespace 2019-08-29 14:33:39 +02:00
Matthew Honnibal
6511e1d8d3 Fix NER gold-standard around whitespace 2019-08-29 14:33:07 +02:00
adrianeboyd
82159b5c19 Updates/bugfixes for NER/IOB converters (#4186)
* Updates/bugfixes for NER/IOB converters

* Converter formats `ner` and `iob` use autodetect to choose a converter if
  possible

* `iob2json` is reverted to handle sentence-per-line data like
  `word1|pos1|ent1 word2|pos2|ent2`

  * Fix bug in `merge_sentences()` so the second sentence in each batch isn't
    skipped

* `conll_ner2json` is made more general so it can handle more formats with
  whitespace-separated columns

  * Supports all formats where the first column is the token and the final
    column is the IOB tag; if present, the second column is the POS tag

  * As in CoNLL 2003 NER, blank lines separate sentences, `-DOCSTART- -X- O O`
    separates documents

  * Add option for segmenting sentences (new flag `-s`)

  * Parser-based sentence segmentation with a provided model, otherwise with
    sentencizer (new option `-b` to specify model)

  * Can group sentences into documents with `n_sents` as long as sentence
    segmentation is available

  * Only applies automatic segmentation when there are no existing delimiters
    in the data

* Provide info about settings applied during conversion with warnings and
  suggestions if settings conflict or might not be not optimal.

* Add tests for common formats

* Add '(default)' back to docs for -c auto

* Add document count back to output

* Revert changes to converter output message

* Use explicit tabs in convert CLI test data

* Adjust/add messages for n_sents=1 default

* Add sample NER data to training examples

* Update README

* Add links in docs to example NER data

* Define msg within converters
2019-08-29 12:04:01 +02:00
adrianeboyd
5feb342f5e Add more token attributes to token pattern schema (#4210)
Add token attributes with tests to token pattern schema.
2019-08-29 12:02:26 +02:00
Matthew Honnibal
216f63a987
Merge pull request #4208 from adrianeboyd/bugfix/orth-vs-noise
Add separate noise vs orth level to train CLI
2019-08-29 10:26:42 +02:00
Adriane Boyd
f3906950d3 Add separate noise vs orth level to train CLI 2019-08-29 09:10:35 +02:00
Matthew Honnibal
7d6d438566 Set version to v2.2.0.dev2 2019-08-28 18:30:43 +02:00
Matthew Honnibal
bc5ce49859 Fix 'noise_level' in train cmd 2019-08-28 17:55:38 +02:00
Matthew Honnibal
782056d117 Fix morph rules 2019-08-28 16:59:45 +02:00
Matthew Honnibal
6b2ea883ed
Merge pull request #4205 from adrianeboyd/feature/gold-train-orth-variants
Add train_docs() option to add orth variants
2019-08-28 16:54:06 +02:00
svlandeg
c54aabc3cd fix loading custom tokenizer rules/exceptions from file 2019-08-28 14:17:44 +02:00
svlandeg
7bec0ebbcb failing unit test for Issue 4190 2019-08-28 14:16:34 +02:00
Ines Montani
b91425f803 Update universe.json [ci skip] 2019-08-28 13:45:06 +02:00
Adriane Boyd
0a26e94d02 Modify raw to match orth variant annotation tuples
If raw is available, attempt to modify raw to match the orth variants.
If raw/words can't be aligned, abort and return unmodified
raw/annotation.
2019-08-28 13:38:54 +02:00
Ines Montani
aedae8b4c5 Update universe.json [ci skip] 2019-08-28 11:59:06 +02:00
Adriane Boyd
47af3f676e Single and paired orth variants for German 2019-08-28 09:19:18 +02:00
Adriane Boyd
56c38484a1 Single and paired orth variants for English 2019-08-28 09:19:18 +02:00
Adriane Boyd
aae05ff16b Add train_docs() option to add orth variants
Filtering by orth and tag, create variants of training docs with
alternate orth variants, e.g., unicode quotes, dashes, and ellipses.

The variants can be single tokens (dashes) or paired tokens (quotes)
with left and right versions.

Currently restricted to only add variants to training documents without
raw text provided, where only gold.words needs to be modified.
2019-08-28 09:18:36 +02:00
Björn Böing
bae0455f91 Fix visualizer options linking for displaCy. (#4202) 2019-08-27 14:04:28 +02:00
Ines Montani
8114933f01 Fix universe.json [ci skip] 2019-08-27 12:13:42 +02:00