Matthew Honnibal
d039ed2267
Merge pull request #4237 from adrianeboyd/feature/gold-train-orth-variants
...
Add guillemets/chevrons to German orth variants
2019-09-04 23:10:49 +02:00
Matthew Honnibal
b94c34ec8f
Merge pull request #4239 from adrianeboyd/bugfix/tokenizer-cache-test-1061
...
Add regression test for #1061 back to test suite
2019-09-04 23:10:12 +02:00
Adriane Boyd
0f28418446
Add regression test for #1061 back to test suite
2019-09-04 20:42:24 +02:00
Adriane Boyd
c39c13f26b
Add guillemets/chevrons to German orth variants
...
Add guillemets/chevrons to German orth variants for both German/Austrian
and Swiss conventions.
2019-09-04 20:05:08 +02:00
Ines Montani
2f31f96fce
Update languages.json [ci skip]
2019-09-04 18:15:42 +02:00
Ines Montani
2245e95e2d
Update languages.json [ci skip]
2019-09-04 17:11:40 +02:00
Matthew Honnibal
17c039406b
Merge pull request #4232 from adrianeboyd/bugfix/entityruler-ner-4229
...
Fix handling of preset entities in NER
2019-09-04 15:02:31 +02:00
Adriane Boyd
6b0fec76fd
Fix handling of preset entities in NER
...
* Fix check of valid ent_type for B
* Add valid L as preset-I followed by not-I
2019-09-04 13:42:42 +02:00
Ines Montani
419ae59c79
Make flaky test test_issue_1971_4 more explicit
2019-08-31 14:08:05 +02:00
Ines Montani
dad5621166
Tidy up and auto-format [ci skip]
2019-08-31 13:39:31 +02:00
Ines Montani
cd90752193
Tidy up and auto-format [ci skip]
2019-08-31 13:39:06 +02:00
Ines Montani
bcd1b12f43
Add contributor agreement [ci skip]
2019-08-30 17:02:43 +02:00
Matthew Honnibal
67c3d03905
Revert morphology serialisation
2019-08-30 13:13:07 +02:00
Matthew Honnibal
efcb51ddc8
Merge pull request #4217 from adrianeboyd/bugfix/morph-en-serialization
...
Morphology tag_map-related bugfixes
2019-08-30 12:46:29 +02:00
Adriane Boyd
893f11a9e3
Serialize tag_map directly
...
Fix Aspect_prof typo
2019-08-30 11:30:03 +02:00
Adriane Boyd
02babf9317
English tag map without unsupported features/values
2019-08-30 11:29:19 +02:00
Matthew Honnibal
516650f58f
Merge pull request #4207 from svlandeg/bugfix/serialize-tok-exc
...
Bugfix for serializing tokenizer rules/exceptions
2019-08-30 11:04:58 +02:00
Matthew Honnibal
f3c3ce7f1e
Update vocab
2019-08-29 21:19:54 +02:00
Matthew Honnibal
fc0a3c8c38
Add morphology serialization
2019-08-29 21:17:34 +02:00
Matthew Honnibal
c94fc9edb9
Fix noise addition
2019-08-29 15:39:32 +02:00
Matthew Honnibal
32842a3cd4
Disable whitespace corruption
2019-08-29 15:01:58 +02:00
Matthew Honnibal
3c1c0ec18e
Add tests for NER oracle with whitespace
2019-08-29 14:33:39 +02:00
Matthew Honnibal
6511e1d8d3
Fix NER gold-standard around whitespace
2019-08-29 14:33:07 +02:00
adrianeboyd
82159b5c19
Updates/bugfixes for NER/IOB converters ( #4186 )
...
* Updates/bugfixes for NER/IOB converters
* Converter formats `ner` and `iob` use autodetect to choose a converter if
possible
* `iob2json` is reverted to handle sentence-per-line data like
`word1|pos1|ent1 word2|pos2|ent2`
* Fix bug in `merge_sentences()` so the second sentence in each batch isn't
skipped
* `conll_ner2json` is made more general so it can handle more formats with
whitespace-separated columns
* Supports all formats where the first column is the token and the final
column is the IOB tag; if present, the second column is the POS tag
* As in CoNLL 2003 NER, blank lines separate sentences, `-DOCSTART- -X- O O`
separates documents
* Add option for segmenting sentences (new flag `-s`)
* Parser-based sentence segmentation with a provided model, otherwise with
sentencizer (new option `-b` to specify model)
* Can group sentences into documents with `n_sents` as long as sentence
segmentation is available
* Only applies automatic segmentation when there are no existing delimiters
in the data
* Provide info about settings applied during conversion with warnings and
suggestions if settings conflict or might not be not optimal.
* Add tests for common formats
* Add '(default)' back to docs for -c auto
* Add document count back to output
* Revert changes to converter output message
* Use explicit tabs in convert CLI test data
* Adjust/add messages for n_sents=1 default
* Add sample NER data to training examples
* Update README
* Add links in docs to example NER data
* Define msg within converters
2019-08-29 12:04:01 +02:00
adrianeboyd
5feb342f5e
Add more token attributes to token pattern schema ( #4210 )
...
Add token attributes with tests to token pattern schema.
2019-08-29 12:02:26 +02:00
Matthew Honnibal
216f63a987
Merge pull request #4208 from adrianeboyd/bugfix/orth-vs-noise
...
Add separate noise vs orth level to train CLI
2019-08-29 10:26:42 +02:00
Adriane Boyd
f3906950d3
Add separate noise vs orth level to train CLI
2019-08-29 09:10:35 +02:00
Matthew Honnibal
7d6d438566
Set version to v2.2.0.dev2
2019-08-28 18:30:43 +02:00
Matthew Honnibal
bc5ce49859
Fix 'noise_level' in train cmd
2019-08-28 17:55:38 +02:00
Matthew Honnibal
782056d117
Fix morph rules
2019-08-28 16:59:45 +02:00
Matthew Honnibal
6b2ea883ed
Merge pull request #4205 from adrianeboyd/feature/gold-train-orth-variants
...
Add train_docs() option to add orth variants
2019-08-28 16:54:06 +02:00
svlandeg
c54aabc3cd
fix loading custom tokenizer rules/exceptions from file
2019-08-28 14:17:44 +02:00
svlandeg
7bec0ebbcb
failing unit test for Issue 4190
2019-08-28 14:16:34 +02:00
Ines Montani
b91425f803
Update universe.json [ci skip]
2019-08-28 13:45:06 +02:00
Adriane Boyd
0a26e94d02
Modify raw to match orth variant annotation tuples
...
If raw is available, attempt to modify raw to match the orth variants.
If raw/words can't be aligned, abort and return unmodified
raw/annotation.
2019-08-28 13:38:54 +02:00
Ines Montani
aedae8b4c5
Update universe.json [ci skip]
2019-08-28 11:59:06 +02:00
Adriane Boyd
47af3f676e
Single and paired orth variants for German
2019-08-28 09:19:18 +02:00
Adriane Boyd
56c38484a1
Single and paired orth variants for English
2019-08-28 09:19:18 +02:00
Adriane Boyd
aae05ff16b
Add train_docs() option to add orth variants
...
Filtering by orth and tag, create variants of training docs with
alternate orth variants, e.g., unicode quotes, dashes, and ellipses.
The variants can be single tokens (dashes) or paired tokens (quotes)
with left and right versions.
Currently restricted to only add variants to training documents without
raw text provided, where only gold.words needs to be modified.
2019-08-28 09:18:36 +02:00
Björn Böing
bae0455f91
Fix visualizer options linking for displaCy. ( #4202 )
2019-08-27 14:04:28 +02:00
Ines Montani
8114933f01
Fix universe.json [ci skip]
2019-08-27 12:13:42 +02:00
Ines Montani
48385552c6
Update languages.json [ci skip]
2019-08-27 11:52:51 +02:00
Ines Montani
f4012ba054
Update README.md [ci skip]
2019-08-26 12:32:52 +02:00
Matthew Honnibal
af7fad2c6d
Set version to v2.2.0.dev1
2019-08-25 22:05:47 +02:00
Matthew Honnibal
71c0321ecf
Fix test
2019-08-25 22:03:37 +02:00
Matthew Honnibal
188a1cf297
Fix morphology for | features
2019-08-25 21:57:02 +02:00
Matthew Honnibal
095c63c6b8
Avoid making prepositions get the tag SCONJ
2019-08-25 21:56:47 +02:00
Matthew Honnibal
22250cf6b7
Make regression test less sensitive to tag-map stuff
2019-08-25 21:54:26 +02:00
Matthew Honnibal
4e2f07a655
Merge branch 'develop' into feature/lemmatizer
2019-08-25 21:03:25 +02:00
yanaiela
5d7bc26735
new universe project - the numeric fused-head ( #4192 )
...
* new universe project
* Update website/meta/universe.json
Co-Authored-By: Ines Montani <ines@ines.io>
* Update website/meta/universe.json
Co-Authored-By: Ines Montani <ines@ines.io>
2019-08-25 17:25:28 +02:00