Commit Graph

11719 Commits

Author SHA1 Message Date
Matthew Honnibal
7d6d438566 Set version to v2.2.0.dev2 2019-08-28 18:30:43 +02:00
Matthew Honnibal
bc5ce49859 Fix 'noise_level' in train cmd 2019-08-28 17:55:38 +02:00
Matthew Honnibal
782056d117 Fix morph rules 2019-08-28 16:59:45 +02:00
Matthew Honnibal
6b2ea883ed
Merge pull request #4205 from adrianeboyd/feature/gold-train-orth-variants
Add train_docs() option to add orth variants
2019-08-28 16:54:06 +02:00
svlandeg
c54aabc3cd fix loading custom tokenizer rules/exceptions from file 2019-08-28 14:17:44 +02:00
svlandeg
7bec0ebbcb failing unit test for Issue 4190 2019-08-28 14:16:34 +02:00
Ines Montani
e055977851 Merge branch 'master' into spacy.io 2019-08-28 13:45:35 +02:00
Ines Montani
b91425f803 Update universe.json [ci skip] 2019-08-28 13:45:06 +02:00
Adriane Boyd
0a26e94d02 Modify raw to match orth variant annotation tuples
If raw is available, attempt to modify raw to match the orth variants.
If raw/words can't be aligned, abort and return unmodified
raw/annotation.
2019-08-28 13:38:54 +02:00
Ines Montani
406563964c Merge branch 'master' into spacy.io 2019-08-28 11:59:16 +02:00
Ines Montani
aedae8b4c5 Update universe.json [ci skip] 2019-08-28 11:59:06 +02:00
Adriane Boyd
47af3f676e Single and paired orth variants for German 2019-08-28 09:19:18 +02:00
Adriane Boyd
56c38484a1 Single and paired orth variants for English 2019-08-28 09:19:18 +02:00
Adriane Boyd
aae05ff16b Add train_docs() option to add orth variants
Filtering by orth and tag, create variants of training docs with
alternate orth variants, e.g., unicode quotes, dashes, and ellipses.

The variants can be single tokens (dashes) or paired tokens (quotes)
with left and right versions.

Currently restricted to only add variants to training documents without
raw text provided, where only gold.words needs to be modified.
2019-08-28 09:18:36 +02:00
Ines Montani
ad8d860a37 Merge branch 'master' into spacy.io 2019-08-27 14:05:06 +02:00
Björn Böing
bae0455f91 Fix visualizer options linking for displaCy. (#4202) 2019-08-27 14:04:28 +02:00
Ines Montani
06854202bb Merge branch 'master' into spacy.io 2019-08-27 12:13:55 +02:00
Ines Montani
8114933f01 Fix universe.json [ci skip] 2019-08-27 12:13:42 +02:00
Ines Montani
50242289bf Merge branch 'master' into spacy.io 2019-08-27 11:53:30 +02:00
Ines Montani
48385552c6 Update languages.json [ci skip] 2019-08-27 11:52:51 +02:00
Ines Montani
f4012ba054 Update README.md [ci skip] 2019-08-26 12:32:52 +02:00
Matthew Honnibal
af7fad2c6d Set version to v2.2.0.dev1 2019-08-25 22:05:47 +02:00
Matthew Honnibal
71c0321ecf Fix test 2019-08-25 22:03:37 +02:00
Matthew Honnibal
188a1cf297 Fix morphology for | features 2019-08-25 21:57:02 +02:00
Matthew Honnibal
095c63c6b8 Avoid making prepositions get the tag SCONJ 2019-08-25 21:56:47 +02:00
Matthew Honnibal
22250cf6b7 Make regression test less sensitive to tag-map stuff 2019-08-25 21:54:26 +02:00
Matthew Honnibal
4e2f07a655 Merge branch 'develop' into feature/lemmatizer 2019-08-25 21:03:25 +02:00
Ines Montani
d6b4e6b0dc Merge branch 'master' into spacy.io 2019-08-25 17:25:47 +02:00
yanaiela
5d7bc26735 new universe project - the numeric fused-head (#4192)
* new universe project

* Update website/meta/universe.json

Co-Authored-By: Ines Montani <ines@ines.io>

* Update website/meta/universe.json

Co-Authored-By: Ines Montani <ines@ines.io>
2019-08-25 17:25:28 +02:00
Matthew Honnibal
9b5c94fed9 Add get-version script 2019-08-25 15:12:36 +02:00
Matthew Honnibal
7bc68913e3 Improve pex building in Makefile 2019-08-25 14:54:19 +02:00
Matthew Honnibal
b8edc8dffb Require thinc 7.1 2019-08-25 14:54:09 +02:00
Matthew Honnibal
c308cf3e3e
Merge branch 'master' into feature/lemmatizer 2019-08-25 13:52:27 +02:00
Matthew Honnibal
f9075a6fd1 Update to blis 0.4 and thinc 7.1 2019-08-25 13:50:47 +02:00
Matthew Honnibal
08e8267a59 Set version to 2.2.0.dev0 2019-08-25 13:50:00 +02:00
Wannaphong Phatthiyaphaibun
d53c3fcbc1 Add Thai Language tokenizers (#4191)
Add th (pythainlp)
2019-08-25 11:35:21 +02:00
Ines Montani
aa5d78ec5d Merge branch 'master' into spacy.io 2019-08-23 19:16:48 +02:00
Christos Aridas
61f5c007a0 DOC Fix pipeline functions examples (#4189) 2019-08-23 19:15:32 +02:00
Matthew Honnibal
bb911e5f4e Fix #3830: 'subtok' label being added even if learn_tokens=False (#4188)
* Prevent subtok label if not learning tokens

The parser introduces the subtok label to mark tokens that should be
merged during post-processing. Previously this happened even if we did
not have the --learn-tokens flag set. This patch passes the config
through to the parser, to prevent the problem.

* Make merge_subtokens a parser post-process if learn_subtokens

* Fix train script

* Add test for 3830: subtok problem

* Fix handlign of non-subtok in parser training
2019-08-23 17:54:00 +02:00
Sofie Van Landeghem
c417c380e3 Matcher ID fixes (#4179)
* allow phrasematcher to link one match to multiple original patterns

* small fix for defining ent_id in the matcher (anti-ghost prevention)

* cleanup

* formatting
2019-08-22 17:17:07 +02:00
Ines Montani
f5d3afb1a3 Fix typo in docstrings [ci skip] 2019-08-22 16:24:15 +02:00
Ines Montani
5ca7dd0f94
💫 WIP: Basic lookup class scaffolding and JSON for all lemmati… (#4167)
* Improve load_language_data helper

* WIP: Add Lookups implementation

* Start moving lemma data over to JSON

* WIP: move data over for more languages

* Convert more languages

* Fix lemmatizer fixtures in tests

* Finish conversion

* Auto-format JSON files

* Fix test for now

* Make sure tables are stored on instance
2019-08-22 14:21:32 +02:00
Sofie Van Landeghem
73b38c33e4 Small retokenizer fix (#4174) 2019-08-22 12:23:54 +02:00
Ines Montani
a8752a569d Auto-format [ci skip] 2019-08-22 11:44:39 +02:00
Pavle Vidanović
60e10a9f93 Serbian language improvement (#4169)
* Serbian stopwords added. (cyrillic alphabet)

* spaCy Contribution agreement included.

* Test initialize updated

* Serbian language code update. --bugfix

* Tokenizer exceptions added. Init file updated.

* Norm exceptions and lexical attributes added.

* Examples added.

* Tests added.

* sr_lang examples update.

* Tokenizer exceptions updated. (Serbian)
2019-08-22 11:43:07 +02:00
Sofie Van Landeghem
de272f8b82 adding double match for optional operator at the end (#4166) 2019-08-21 22:46:56 +02:00
Sofie Van Landeghem
01c5980187 Serialize POS attribute when doc.is_tagged (#4092)
* fix and unit test for issue 3959

* additional unit test for manifestation of the same (resolved) bug
2019-08-21 21:59:30 +02:00
Sofie Van Landeghem
7539a4f3a8 use states[q] in while retry loop (#4162) 2019-08-21 21:58:04 +02:00
Ines Montani
073e8d647c Merge branch 'master' into spacy.io 2019-08-21 21:36:10 +02:00
Ines Montani
b072c13017 Update universe with videos [ci skip] 2019-08-21 21:35:37 +02:00