Commit Graph

11054 Commits

Author SHA1 Message Date
Björn Böing
bae0455f91 Fix visualizer options linking for displaCy. (#4202) 2019-08-27 14:04:28 +02:00
Ines Montani
06854202bb Merge branch 'master' into spacy.io 2019-08-27 12:13:55 +02:00
Ines Montani
8114933f01 Fix universe.json [ci skip] 2019-08-27 12:13:42 +02:00
Ines Montani
50242289bf Merge branch 'master' into spacy.io 2019-08-27 11:53:30 +02:00
Ines Montani
48385552c6 Update languages.json [ci skip] 2019-08-27 11:52:51 +02:00
Ines Montani
f4012ba054 Update README.md [ci skip] 2019-08-26 12:32:52 +02:00
Matthew Honnibal
af7fad2c6d Set version to v2.2.0.dev1 2019-08-25 22:05:47 +02:00
Matthew Honnibal
71c0321ecf Fix test 2019-08-25 22:03:37 +02:00
Matthew Honnibal
188a1cf297 Fix morphology for | features 2019-08-25 21:57:02 +02:00
Matthew Honnibal
095c63c6b8 Avoid making prepositions get the tag SCONJ 2019-08-25 21:56:47 +02:00
Matthew Honnibal
22250cf6b7 Make regression test less sensitive to tag-map stuff 2019-08-25 21:54:26 +02:00
Matthew Honnibal
4e2f07a655 Merge branch 'develop' into feature/lemmatizer 2019-08-25 21:03:25 +02:00
Ines Montani
d6b4e6b0dc Merge branch 'master' into spacy.io 2019-08-25 17:25:47 +02:00
yanaiela
5d7bc26735 new universe project - the numeric fused-head (#4192)
* new universe project

* Update website/meta/universe.json

Co-Authored-By: Ines Montani <ines@ines.io>

* Update website/meta/universe.json

Co-Authored-By: Ines Montani <ines@ines.io>
2019-08-25 17:25:28 +02:00
Matthew Honnibal
9b5c94fed9 Add get-version script 2019-08-25 15:12:36 +02:00
Matthew Honnibal
7bc68913e3 Improve pex building in Makefile 2019-08-25 14:54:19 +02:00
Matthew Honnibal
b8edc8dffb Require thinc 7.1 2019-08-25 14:54:09 +02:00
Matthew Honnibal
c308cf3e3e
Merge branch 'master' into feature/lemmatizer 2019-08-25 13:52:27 +02:00
Matthew Honnibal
f9075a6fd1 Update to blis 0.4 and thinc 7.1 2019-08-25 13:50:47 +02:00
Matthew Honnibal
08e8267a59 Set version to 2.2.0.dev0 2019-08-25 13:50:00 +02:00
Wannaphong Phatthiyaphaibun
d53c3fcbc1 Add Thai Language tokenizers (#4191)
Add th (pythainlp)
2019-08-25 11:35:21 +02:00
Ines Montani
aa5d78ec5d Merge branch 'master' into spacy.io 2019-08-23 19:16:48 +02:00
Christos Aridas
61f5c007a0 DOC Fix pipeline functions examples (#4189) 2019-08-23 19:15:32 +02:00
Matthew Honnibal
bb911e5f4e Fix #3830: 'subtok' label being added even if learn_tokens=False (#4188)
* Prevent subtok label if not learning tokens

The parser introduces the subtok label to mark tokens that should be
merged during post-processing. Previously this happened even if we did
not have the --learn-tokens flag set. This patch passes the config
through to the parser, to prevent the problem.

* Make merge_subtokens a parser post-process if learn_subtokens

* Fix train script

* Add test for 3830: subtok problem

* Fix handlign of non-subtok in parser training
2019-08-23 17:54:00 +02:00
Sofie Van Landeghem
c417c380e3 Matcher ID fixes (#4179)
* allow phrasematcher to link one match to multiple original patterns

* small fix for defining ent_id in the matcher (anti-ghost prevention)

* cleanup

* formatting
2019-08-22 17:17:07 +02:00
Ines Montani
f5d3afb1a3 Fix typo in docstrings [ci skip] 2019-08-22 16:24:15 +02:00
Ines Montani
5ca7dd0f94
💫 WIP: Basic lookup class scaffolding and JSON for all lemmati… (#4167)
* Improve load_language_data helper

* WIP: Add Lookups implementation

* Start moving lemma data over to JSON

* WIP: move data over for more languages

* Convert more languages

* Fix lemmatizer fixtures in tests

* Finish conversion

* Auto-format JSON files

* Fix test for now

* Make sure tables are stored on instance
2019-08-22 14:21:32 +02:00
Sofie Van Landeghem
73b38c33e4 Small retokenizer fix (#4174) 2019-08-22 12:23:54 +02:00
Ines Montani
a8752a569d Auto-format [ci skip] 2019-08-22 11:44:39 +02:00
Pavle Vidanović
60e10a9f93 Serbian language improvement (#4169)
* Serbian stopwords added. (cyrillic alphabet)

* spaCy Contribution agreement included.

* Test initialize updated

* Serbian language code update. --bugfix

* Tokenizer exceptions added. Init file updated.

* Norm exceptions and lexical attributes added.

* Examples added.

* Tests added.

* sr_lang examples update.

* Tokenizer exceptions updated. (Serbian)
2019-08-22 11:43:07 +02:00
Sofie Van Landeghem
de272f8b82 adding double match for optional operator at the end (#4166) 2019-08-21 22:46:56 +02:00
Sofie Van Landeghem
01c5980187 Serialize POS attribute when doc.is_tagged (#4092)
* fix and unit test for issue 3959

* additional unit test for manifestation of the same (resolved) bug
2019-08-21 21:59:30 +02:00
Sofie Van Landeghem
7539a4f3a8 use states[q] in while retry loop (#4162) 2019-08-21 21:58:04 +02:00
Ines Montani
073e8d647c Merge branch 'master' into spacy.io 2019-08-21 21:36:10 +02:00
Ines Montani
b072c13017 Update universe with videos [ci skip] 2019-08-21 21:35:37 +02:00
adrianeboyd
2d17b047e2 Check for is_tagged/is_parsed for Matcher attrs (#4163)
Check for relevant components in the pipeline when Matcher is called,
similar to the checks for PhraseMatcher in #4105.

* keep track of attributes seen in patterns

* when Matcher is called on a Doc, check for is_tagged for LEMMA, TAG,
POS and for is_parsed for DEP
2019-08-21 20:52:36 +02:00
Pavle Vidanović
4fe9329bfb Serbian language code update "rs" -> "sr" (#4159)
* Serbian stopwords added. (cyrillic alphabet)

* spaCy Contribution agreement included.

* Test initialize updated

* Serbian language code update. --bugfix
2019-08-21 19:57:37 +02:00
Matthew Honnibal
bcd08f20af Merge changes from master 2019-08-21 14:18:52 +02:00
adrianeboyd
8fe7bdd0fa Improve token pattern checking without validation (#4105)
* Fix typo in rule-based matching docs

* Improve token pattern checking without validation

Add more detailed token pattern checks without full JSON pattern validation and
provide more detailed error messages.

Addresses #4070 (also related: #4063, #4100).

* Check whether top-level attributes in patterns and attr for PhraseMatcher are
  in token pattern schema

* Check whether attribute value types are supported in general (as opposed to
  per attribute with full validation)

* Report various internal error types (OverflowError, AttributeError, KeyError)
  as ValueError with standard error messages

* Check for tagger/parser in PhraseMatcher pipeline for attributes TAG, POS,
  LEMMA, and DEP

* Add error messages with relevant details on how to use validate=True or nlp()
  instead of nlp.make_doc()

* Support attr=TEXT for PhraseMatcher

* Add NORM to schema

* Expand tests for pattern validation, Matcher, PhraseMatcher, and EntityRuler

* Remove unnecessary .keys()

* Rephrase error messages

* Add another type check to Matcher

Add another type check to Matcher for more understandable error messages
in some rare cases.

* Support phrase_matcher_attr=TEXT for EntityRuler

* Don't use spacy.errors in examples and bin scripts

* Fix error code

* Auto-format

Also try get Azure pipelines to finally start a build :(

* Update errors.py


Co-authored-by: Ines Montani <ines@ines.io>
Co-authored-by: Matthew Honnibal <honnibal+gh@gmail.com>
2019-08-21 14:00:37 +02:00
Ines Montani
39619be14d Merge branch 'master' into spacy.io 2019-08-21 12:53:51 +02:00
Ines Montani
3134a9b6e0 Add section on expanding regex match to token boundaries (see #4158) [ci skip] 2019-08-21 12:53:31 +02:00
Ines Montani
f580302673 Tidy up and auto-format 2019-08-20 17:36:34 +02:00
Ines Montani
364aaf5bc2 Simplify test 2019-08-20 16:41:58 +02:00
Sofie Van Landeghem
68ee0384fd Unit test for Issue 3879 (#4153)
* failing unit test for Issue #3879

* mark test as failing
2019-08-20 16:40:25 +02:00
Ines Montani
86cd7f0efd Add regression test for #4120 2019-08-20 16:33:09 +02:00
Ines Montani
104125edd2 Tidy up errors 2019-08-20 16:03:45 +02:00
Ines Montani
cc76a26fe8 Raise error for negative arc indices (closes #3917) 2019-08-20 15:51:37 +02:00
Ines Montani
69e70ffae1 Merge branch 'master' of https://github.com/explosion/spaCy 2019-08-20 15:09:52 +02:00
Ines Montani
f65e36925d Fix absolute imports and avoid importing from cli 2019-08-20 15:08:59 +02:00
Ines Montani
7e8be44218 Auto-format 2019-08-20 15:06:31 +02:00