Matthew Honnibal
aec6174ae6
Fix lemmatizer
2019-09-08 18:09:53 +02:00
Matthew Honnibal
fde4f8ac8e
Create lookups if not passed in
2019-09-08 18:08:09 +02:00
Matthew Honnibal
d039ed2267
Merge pull request #4237 from adrianeboyd/feature/gold-train-orth-variants
...
Add guillemets/chevrons to German orth variants
2019-09-04 23:10:49 +02:00
Adriane Boyd
c39c13f26b
Add guillemets/chevrons to German orth variants
...
Add guillemets/chevrons to German orth variants for both German/Austrian
and Swiss conventions.
2019-09-04 20:05:08 +02:00
Matthew Honnibal
67c3d03905
Revert morphology serialisation
2019-08-30 13:13:07 +02:00
Matthew Honnibal
efcb51ddc8
Merge pull request #4217 from adrianeboyd/bugfix/morph-en-serialization
...
Morphology tag_map-related bugfixes
2019-08-30 12:46:29 +02:00
Adriane Boyd
893f11a9e3
Serialize tag_map directly
...
Fix Aspect_prof typo
2019-08-30 11:30:03 +02:00
Adriane Boyd
02babf9317
English tag map without unsupported features/values
2019-08-30 11:29:19 +02:00
Matthew Honnibal
f3c3ce7f1e
Update vocab
2019-08-29 21:19:54 +02:00
Matthew Honnibal
fc0a3c8c38
Add morphology serialization
2019-08-29 21:17:34 +02:00
Matthew Honnibal
c94fc9edb9
Fix noise addition
2019-08-29 15:39:32 +02:00
Matthew Honnibal
32842a3cd4
Disable whitespace corruption
2019-08-29 15:01:58 +02:00
Matthew Honnibal
3c1c0ec18e
Add tests for NER oracle with whitespace
2019-08-29 14:33:39 +02:00
Matthew Honnibal
6511e1d8d3
Fix NER gold-standard around whitespace
2019-08-29 14:33:07 +02:00
Matthew Honnibal
216f63a987
Merge pull request #4208 from adrianeboyd/bugfix/orth-vs-noise
...
Add separate noise vs orth level to train CLI
2019-08-29 10:26:42 +02:00
Adriane Boyd
f3906950d3
Add separate noise vs orth level to train CLI
2019-08-29 09:10:35 +02:00
Matthew Honnibal
7d6d438566
Set version to v2.2.0.dev2
2019-08-28 18:30:43 +02:00
Matthew Honnibal
bc5ce49859
Fix 'noise_level' in train cmd
2019-08-28 17:55:38 +02:00
Matthew Honnibal
782056d117
Fix morph rules
2019-08-28 16:59:45 +02:00
Matthew Honnibal
6b2ea883ed
Merge pull request #4205 from adrianeboyd/feature/gold-train-orth-variants
...
Add train_docs() option to add orth variants
2019-08-28 16:54:06 +02:00
Adriane Boyd
0a26e94d02
Modify raw to match orth variant annotation tuples
...
If raw is available, attempt to modify raw to match the orth variants.
If raw/words can't be aligned, abort and return unmodified
raw/annotation.
2019-08-28 13:38:54 +02:00
Adriane Boyd
47af3f676e
Single and paired orth variants for German
2019-08-28 09:19:18 +02:00
Adriane Boyd
56c38484a1
Single and paired orth variants for English
2019-08-28 09:19:18 +02:00
Adriane Boyd
aae05ff16b
Add train_docs() option to add orth variants
...
Filtering by orth and tag, create variants of training docs with
alternate orth variants, e.g., unicode quotes, dashes, and ellipses.
The variants can be single tokens (dashes) or paired tokens (quotes)
with left and right versions.
Currently restricted to only add variants to training documents without
raw text provided, where only gold.words needs to be modified.
2019-08-28 09:18:36 +02:00
Björn Böing
bae0455f91
Fix visualizer options linking for displaCy. ( #4202 )
2019-08-27 14:04:28 +02:00
Ines Montani
8114933f01
Fix universe.json [ci skip]
2019-08-27 12:13:42 +02:00
Ines Montani
48385552c6
Update languages.json [ci skip]
2019-08-27 11:52:51 +02:00
Ines Montani
f4012ba054
Update README.md [ci skip]
2019-08-26 12:32:52 +02:00
Matthew Honnibal
af7fad2c6d
Set version to v2.2.0.dev1
2019-08-25 22:05:47 +02:00
Matthew Honnibal
71c0321ecf
Fix test
2019-08-25 22:03:37 +02:00
Matthew Honnibal
188a1cf297
Fix morphology for | features
2019-08-25 21:57:02 +02:00
Matthew Honnibal
095c63c6b8
Avoid making prepositions get the tag SCONJ
2019-08-25 21:56:47 +02:00
Matthew Honnibal
22250cf6b7
Make regression test less sensitive to tag-map stuff
2019-08-25 21:54:26 +02:00
Matthew Honnibal
4e2f07a655
Merge branch 'develop' into feature/lemmatizer
2019-08-25 21:03:25 +02:00
yanaiela
5d7bc26735
new universe project - the numeric fused-head ( #4192 )
...
* new universe project
* Update website/meta/universe.json
Co-Authored-By: Ines Montani <ines@ines.io>
* Update website/meta/universe.json
Co-Authored-By: Ines Montani <ines@ines.io>
2019-08-25 17:25:28 +02:00
Matthew Honnibal
9b5c94fed9
Add get-version script
2019-08-25 15:12:36 +02:00
Matthew Honnibal
7bc68913e3
Improve pex building in Makefile
2019-08-25 14:54:19 +02:00
Matthew Honnibal
b8edc8dffb
Require thinc 7.1
2019-08-25 14:54:09 +02:00
Matthew Honnibal
c308cf3e3e
Merge branch 'master' into feature/lemmatizer
2019-08-25 13:52:27 +02:00
Matthew Honnibal
f9075a6fd1
Update to blis 0.4 and thinc 7.1
2019-08-25 13:50:47 +02:00
Matthew Honnibal
08e8267a59
Set version to 2.2.0.dev0
2019-08-25 13:50:00 +02:00
Wannaphong Phatthiyaphaibun
d53c3fcbc1
Add Thai Language tokenizers ( #4191 )
...
Add th (pythainlp)
2019-08-25 11:35:21 +02:00
Christos Aridas
61f5c007a0
DOC Fix pipeline functions examples ( #4189 )
2019-08-23 19:15:32 +02:00
Matthew Honnibal
bb911e5f4e
Fix #3830 : 'subtok' label being added even if learn_tokens=False ( #4188 )
...
* Prevent subtok label if not learning tokens
The parser introduces the subtok label to mark tokens that should be
merged during post-processing. Previously this happened even if we did
not have the --learn-tokens flag set. This patch passes the config
through to the parser, to prevent the problem.
* Make merge_subtokens a parser post-process if learn_subtokens
* Fix train script
* Add test for 3830: subtok problem
* Fix handlign of non-subtok in parser training
2019-08-23 17:54:00 +02:00
Sofie Van Landeghem
c417c380e3
Matcher ID fixes ( #4179 )
...
* allow phrasematcher to link one match to multiple original patterns
* small fix for defining ent_id in the matcher (anti-ghost prevention)
* cleanup
* formatting
2019-08-22 17:17:07 +02:00
Ines Montani
f5d3afb1a3
Fix typo in docstrings [ci skip]
2019-08-22 16:24:15 +02:00
Ines Montani
5ca7dd0f94
💫 WIP: Basic lookup class scaffolding and JSON for all lemmati… ( #4167 )
...
* Improve load_language_data helper
* WIP: Add Lookups implementation
* Start moving lemma data over to JSON
* WIP: move data over for more languages
* Convert more languages
* Fix lemmatizer fixtures in tests
* Finish conversion
* Auto-format JSON files
* Fix test for now
* Make sure tables are stored on instance
2019-08-22 14:21:32 +02:00
Sofie Van Landeghem
73b38c33e4
Small retokenizer fix ( #4174 )
2019-08-22 12:23:54 +02:00
Ines Montani
a8752a569d
Auto-format [ci skip]
2019-08-22 11:44:39 +02:00
Pavle Vidanović
60e10a9f93
Serbian language improvement ( #4169 )
...
* Serbian stopwords added. (cyrillic alphabet)
* spaCy Contribution agreement included.
* Test initialize updated
* Serbian language code update. --bugfix
* Tokenizer exceptions added. Init file updated.
* Norm exceptions and lexical attributes added.
* Examples added.
* Tests added.
* sr_lang examples update.
* Tokenizer exceptions updated. (Serbian)
2019-08-22 11:43:07 +02:00