Commit Graph

5706 Commits

Author SHA1 Message Date
Matthew Honnibal
ad068f51be Fix out-of-bounds access in NER training
The helper method state.B(1) gets the index of the first token of the
buffer, or -1 if no such token exists. Normally this is safe because we
pass this to functions like state.safe_get(), which returns an empty
token. Here we used it directly as an array index, which is not okay!

This error may have been the cause of out-of-bounds access errors during
training. Similar errors may still be around, so much be hunted down.
Hunting this one down took a long time...I printed out values across
training runs and diffed, looking for points of divergence between
runs, when no randomness should be allowed.
2018-10-27 00:46:30 +02:00
Grivaz
57f274b693 raise error when setting overlapping entities as doc.ents (#2880) 2018-10-26 23:29:16 +02:00
Ines Montani
48b1bc44d3 Update version to 2.0.16 2018-10-15 14:39:25 +02:00
Ines Montani
a0f6647160 Increment version 2018-10-15 14:20:55 +02:00
Ines Montani
7bc7fa8f1e Increment version 2018-10-15 01:40:44 +02:00
Matthew Honnibal
8612b75890 Set version to 2.0.14 2018-10-15 00:10:04 +02:00
Matthew Honnibal
d6e9cf8b09 Set version to 2.0.14.dev1 2018-10-15 00:09:02 +02:00
Matthew Honnibal
8ccfa52d19 Unhack prefer_gpu 2018-10-14 23:27:09 +02:00
Matthew Honnibal
41adf3572b Set version to v2.0.14 2018-10-14 23:15:34 +02:00
Matthew Honnibal
38aa835ada Workaround bug in thinc require_gpu 2018-10-14 23:15:08 +02:00
Matthew Honnibal
91593b7378 Add tests for prefer_gpu() and require_gpu() 2018-10-14 23:05:22 +02:00
Matthew Honnibal
62c70b3163 Import prefer_gpu and require_gpu functions from Thinc 2018-10-14 23:03:06 +02:00
Ines Montani
295da0f11b Increment version to 2.0.14.dev0 2018-10-14 16:37:46 +02:00
Matthew Honnibal
7de0dcb91f Merge branch 'master' of https://github.com/explosion/spaCy 2018-10-14 16:12:23 +02:00
Keshan
cb075c8e72 Adding "This is a sentence" example to Sinhala (#2846) 2018-10-14 00:06:40 +02:00
Matthew Honnibal
9cfab5933a Set version to 2.0.13 2018-10-13 19:42:16 +02:00
Matthew Honnibal
6a6ae5b0af Merge branch 'master' of https://github.com/explosion/spaCy 2018-10-13 19:41:00 +02:00
mauryaland
36514b5762 Rule-based French Lemmatizer (#2818)
<!--- Provide a general summary of your changes in the title. -->

## Description
<!--- Use this section to describe your changes. If your changes required
testing, include information about the testing environment and the tests you
ran. If your test fixes a bug reported in an issue, don't forget to include the
issue number. If your PR is still a work in progress, that's totally fine – just
include a note to let us know. -->

Add a rule-based French Lemmatizer following the english one and the excellent PR for [greek language optimizations](https://github.com/explosion/spaCy/pull/2558) to adapt the Lemmatizer class.

### Types of change
<!-- What type of change does your PR cover? Is it a bug fix, an enhancement
or new feature, or a change to the documentation? -->

- Lemma dictionary used can be found [here](http://infolingu.univ-mlv.fr/DonneesLinguistiques/Dictionnaires/telechargement.html), I used the XML version.
- Add several files containing exhaustive list of words for each part of speech 
- Add some lemma rules
- Add POS that are not checked in the standard Lemmatizer, i.e PRON, DET, ADV and AUX
- Modify the Lemmatizer class to check in lookup table as a last resort if POS not mentionned
- Modify the lemmatize function to check in lookup table as a last resort
- Init files are updated so the model can support all the functionalities mentioned above
- Add words to tokenizer_exceptions_list.py in respect to regex used in tokenizer_exceptions.py

## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [X] I have submitted the spaCy Contributor Agreement.
- [X] I ran the tests, and all new and existing tests passed.
- [X] My changes don't require a change to the documentation, or if they do, I've added all required information.
2018-10-13 16:38:21 +02:00
Matthew Honnibal
de46286107 Merge branch 'master' of https://github.com/explosion/spaCy 2018-10-13 16:11:16 +02:00
Ines Montani
cb57b35bb8 Also include lowercase norm exceptions 2018-10-13 15:37:30 +02:00
JKhakpour
74a30d883c Add Persian(Farsi) language support (#2797) 2018-10-13 15:31:49 +02:00
Matthew Honnibal
c3ddf98b1e Set version to 2.0.13.dev4 2018-10-13 15:20:59 +02:00
Marina Lysyuk
b76fe08308 Correcting lang/ru/examples.py (#2845)
* Correct some grammatical inaccuracies in lang\ru\examples.py; filled Contributor Agreement

* Correct some grammatical inaccuracies in lang\ru\examples.py

* Move contributor agreement to separate file
2018-10-13 15:19:43 +02:00
Matthew Honnibal
67ddce68d8 Unskip test 2018-10-02 23:47:55 +02:00
Matthew Honnibal
4cf5ce2cc2 Revert "Remove problematic test"
This reverts commit bdebbef455.
2018-10-02 23:47:24 +02:00
Matthew Honnibal
bdebbef455 Remove problematic test 2018-10-02 23:16:29 +02:00
Matthew Honnibal
6afc6ffe56 Skip seemingly problematic test 2018-10-02 22:33:40 +02:00
Matthew Honnibal
9e4079ddb2 Merge branch 'master' of https://github.com/explosion/spaCy 2018-10-02 19:44:43 +02:00
Matthew Honnibal
40f228c2f2 Set version to 2.0.13.dev3 2018-10-02 19:44:25 +02:00
Ines Montani
ea20b72c08 💫 Make like_num work for prefixed numbers (#2808)
* Only split + prefix if not numbers

* Make like_num work for prefixed numbers

* Add test for like_num
2018-10-01 10:49:14 +02:00
Filipe Caixeta
6c498f9ff4 Update Portuguese Language (#2790)
* Add words to portuguese language _num_words

* Add words to portuguese language _num_words

* Portuguese - Add/remove stopwords, fix tokenizer, add currency symbols

* Extended punctuation and norm_exceptions in the Portuguese language
2018-09-29 09:51:45 +02:00
Matthew Honnibal
b39810d692 Fix copy_reg compatibility on _serialize module 2018-09-28 15:23:14 +02:00
Matthew Honnibal
f82f8ba5dd Fix serialization when empty parser model. Closes #2482 2018-09-28 15:18:52 +02:00
Matthew Honnibal
d5a6c63b62 Add regression test for #2482 2018-09-28 15:18:30 +02:00
Matthew Honnibal
e3e9fe18d4 Merge branch 'develop' of https://github.com/explosion/spaCy into develop 2018-09-28 14:27:35 +02:00
Matthew Honnibal
0323f5be0c Fix _serialize module 2018-09-28 14:27:24 +02:00
Ines Montani
5d56eb70d7 Tidy up tests 2018-09-27 16:41:57 +02:00
Ines Montani
1f1bab9264 Remove unused import 2018-09-27 16:41:37 +02:00
Matthew Honnibal
6430b1fe64 Restore encoding arg on msgpack-numpy 2018-09-27 15:58:21 +02:00
Matthew Honnibal
2ac69facc6 Fix Python 2 test failure 2018-09-27 15:34:16 +02:00
Matthew Honnibal
72778375fb Merge branch 'master' of https://github.com/explosion/spaCy 2018-09-27 13:54:49 +02:00
Matthew Honnibal
96fe314d8d Fix bug when too many entity types. Fixes #2800 2018-09-27 13:54:34 +02:00
Suraj Rajan
bbdc6456c6 Set up dependency tree pattern matching skeleton (#2732) 2018-09-27 13:27:18 +02:00
Matthew Honnibal
8809dc4514 Remove deprecated encoding argument to msgpack 2018-09-27 12:56:23 +02:00
Matthew Honnibal
bae6b3e2b3 Merge branch 'master' of https://github.com/explosion/spaCy 2018-09-27 12:50:31 +02:00
Ines Montani
71cdbeada7 Revert "Also include lowercase norm exceptions"
This reverts commit 70f4e8adf3.
2018-09-27 12:29:25 +02:00
darindf
8227566805 Fix error (#2802)
* Fix error
ValueError: cannot resize an array that references or is referenced
by another array in this way.  Use the resize function

* added spaCy Contributor Agreement
2018-09-26 21:31:03 +02:00
Ines Montani
5e0dfb34fa Merge branch 'master' of https://github.com/explosion/spaCy 2018-09-26 11:13:58 +02:00
Ines Montani
70f4e8adf3 Also include lowercase norm exceptions 2018-09-25 12:22:02 +02:00
Keshan
9a016d17c2 Adding basic support for Sinhala language. (#2788)
* adding Sinhala language package, stop words, examples and lex_attrs.

* Adding contributor agreement

* Updating contributor agreement
2018-09-25 12:18:25 +02:00
Matthew Honnibal
b42c123e5d Fix regression introduced by 1759abf1e 2018-09-25 11:08:58 +02:00
Matthew Honnibal
500898907b Fix regression in parser.begin_training() 2018-09-25 11:08:31 +02:00
Ines Montani
3c4e3ade30 Fix typo (closes #2784) 2018-09-21 10:45:11 +02:00
mauryaland
68b3c544d5 Adding French hyphenated first name (#2786) 2018-09-21 10:38:13 +02:00
Matthew Honnibal
1759abf1e5 Fix bug in sentence starts for non-projective parses
The set_children_from_heads function assumed parse trees were
projective. However, non-projective parses may be passed in during
deserialization, or after deprojectivising. This caused incorrect
sentence boundaries to be set for non-projective parses. Close #2772.
2018-09-19 14:50:06 +02:00
Matthew Honnibal
48fd36bf05 Fix test for issue 27772 2018-09-19 14:47:27 +02:00
Matthew Honnibal
6cd920e088 Add xfail test for deprojectivization SBD bug 2018-09-19 14:00:31 +02:00
Matthew Honnibal
99a6011580 Avoid adding empty layer in model, to keep models backwards compatible 2018-09-14 22:51:58 +02:00
Matthew Honnibal
c046392317 Trigger on_data hooks in parser model 2018-09-14 20:51:21 +02:00
Matthew Honnibal
5afd98dff5 Add a stepping function, for changing batch sizes or learning rates 2018-09-14 18:37:16 +02:00
Matthew Honnibal
27c00f4f22 Merge branch 'develop' of https://github.com/explosion/spaCy into develop 2018-09-14 12:30:57 +02:00
Andrew Ongko
81564cc4e8 Update Indonesian model (#2752)
* adding e-KTP in tokenizer exceptions list

* add exception token

* removing lines with containing space as it won't matter since we use .split() method in the end, added new tokens in exception

* add tokenizer exceptions list

* combining base_norms with norm_exceptions

* adding norm_exception

* fix double key in lemmatizer

* remove unused import on punctuation.py

* reformat stop_words to reduce number of lines, improve readibility

* updating tokenizer exception

* implement is_currency for lang/id

* adding orth_first_upper in tokenizer_exceptions

* update the norm_exception list

* remove bunch of abbreviations

* adding contributors file
2018-09-14 12:30:32 +02:00
Filipe Caixeta
fe515085f3 Add words to portuguese language _num_words (#2759)
* Add words to portuguese language _num_words

* Add words to portuguese language _num_words
2018-09-14 12:30:16 +02:00
Matthew Honnibal
f32b52e611 Fix bug that caused deprojectivisation to run multiple times 2018-09-14 12:12:54 +02:00
Matthew Honnibal
8f2a6367e9 Fix usage of PyTorch BiLSTM in ud_train 2018-09-13 22:54:59 +00:00
Matthew Honnibal
afeddfff26 Fix PyTorch BiLSTM 2018-09-13 22:54:34 +00:00
Matthew Honnibal
a26fe8e7bb Small hack in Language.update to make torch work 2018-09-13 22:51:52 +00:00
Matthew Honnibal
445b81ce3f Support bilstm_depth argument in ud-train 2018-09-13 19:30:22 +02:00
Matthew Honnibal
b43643a953 Support bilstm_depth option in parser 2018-09-13 19:29:49 +02:00
Matthew Honnibal
45032fe9e1 Support option of BiLSTM in Tok2Vec (requires pytorch) 2018-09-13 19:28:35 +02:00
Matthew Honnibal
3eb9f3e2b8 Fix defaults for ud-train 2018-09-13 18:05:48 +02:00
Matthew Honnibal
59cf533879 Improve ud-train script. Make config optional 2018-09-13 14:24:08 +02:00
Matthew Honnibal
3e3a309764 Fix tagger 2018-09-13 14:14:38 +02:00
Matthew Honnibal
da7650e84b Fix maximum doc length in ud_train script 2018-09-13 14:10:25 +02:00
Matthew Honnibal
a95eea4c06 Fix multi-task objective for parser 2018-09-13 14:08:55 +02:00
Matthew Honnibal
21321cd6cf Add tok2vec property to parser model 2018-09-13 14:08:43 +02:00
Matthew Honnibal
d6aa60139d Fix tagger training on GPU 2018-09-13 14:05:37 +02:00
Grivaz
aeba99ab0d Introduces a bulk merge function, in order to solve issue #653 (#2696)
* Fix comment

* Introduce bulk merge to increase performance on many span merges

* Sign contributor agreement

* Implement pull request suggestions
2018-09-10 16:41:42 +02:00
tyburam
476472d181 Lex _attrs for polish language (#2750)
* Signed spaCy contributor agreement

* Added polish version of english lex_attrs
2018-09-10 11:53:57 +02:00
Sainath Adapa
77139bc03c Basic support for Telugu language (#2751) 2018-09-10 11:53:18 +02:00
Maxim Kupfer
cebe50b5b8 Remove ')' for clarity (#2737)
Sorry, don't mean to be nitpicky, I just noticed this when going through the CLI and thought it was a quick fix. That said, if this was intention than please let me know.
2018-09-10 11:31:49 +02:00
Matthew Honnibal
b2cb1fc67d Merge matcher tests 2018-09-06 01:39:53 +02:00
Suraj Krishnan Rajan
356af7b0a1 Fix tests 2018-09-06 01:39:36 +02:00
Piotr Żelasko
bdb2165bd1 Less norm computations in token similarity (#2730)
* Less norm computations in token similarity

* Contributor agreement
2018-09-05 21:50:23 +02:00
Aniruddha Adhikary
4530ddcc51 update bengali token rules for hyphen and digits (#2731) 2018-09-05 21:49:00 +02:00
Nathaniel J. Smith
26849874ad When calling getoption() in conftest.py, pass a default option (#2709)
* When calling getoption() in conftest.py, pass a default option

This is necessary to allow testing an installed spacy by running:

  pytest --pyargs spacy

* Add contributor agreement
2018-09-03 09:57:52 +02:00
Matthew Honnibal
4d2d7d5866 Fix new feature flags 2018-08-27 02:12:39 +02:00
Matthew Honnibal
598dbf1ce0 Fix character-based tokenization for Japanese 2018-08-27 01:51:38 +02:00
Matthew Honnibal
3763e20afc Pass subword_features and conv_depth params 2018-08-27 01:51:15 +02:00
Matthew Honnibal
8051136d70 Support subword_features and conv_depth params in Tok2Vec 2018-08-27 01:50:48 +02:00
Matthew Honnibal
9c33d4d1df Add more hyper-parameters to spacy ud-train
* subword_features: Controls whether subword features are used in the
word embeddings. True by default (specifically, prefix, suffix and word
shape). Should be set to False for languages like Chinese and Japanese.

* conv_depth: Depth of the convolutional layers. Defaults to 4.
2018-08-27 01:48:46 +02:00
Ines Montani
e9022f7b33 Remove docstrings for deprecated arguments (see #2703) 2018-08-26 14:23:13 +02:00
Ines Montani
559f4139e3 Add FAC to spacy.explain (resolves #2706) 2018-08-26 14:13:50 +02:00
Matthew Honnibal
51a9efbf3b Add draft Binder class 2018-08-22 13:12:51 +02:00
Matthew Honnibal
5ce459d2ee Fix error in vocab 2018-08-16 17:18:09 +02:00
Matthew Honnibal
00febda2e3 Improve alignment around quotes 2018-08-16 01:04:34 +02:00
Matthew Honnibal
66a3f2ba21 Lower-case text before alignment 2018-08-16 00:42:36 +02:00
Matthew Honnibal
595c893791 Expose noise_level option in train CLI 2018-08-16 00:41:44 +02:00
Matthew Honnibal
8365226bf3 Fix lookup of symbols in vocab. 2018-08-15 23:43:34 +02:00
Matthew Honnibal
b9f0588580 Set version to v2.1.0a1 2018-08-15 17:22:39 +02:00
Matthew Honnibal
e968016417 Note link between issues #2671 and #2675 2018-08-15 17:18:28 +02:00
Matthew Honnibal
63bdc734ba Skip flakey test 2018-08-15 16:56:55 +02:00
Matthew Honnibal
ce512e1d47 Fix #2671: Incorrect match ID on some patterns 2018-08-15 16:19:08 +02:00
Matthew Honnibal
f12b9190f6 Xfail test for issue #2671 2018-08-15 15:55:31 +02:00
Matthew Honnibal
7cfa665ce6 Add failing test for issue 2671: Incorrect rule ID returned from matcher 2018-08-15 15:54:33 +02:00
Matthew Honnibal
1b2a5869ab Set version to v2.1.0a2.dev0 2018-08-15 15:38:52 +02:00
Matthew Honnibal
5080760288 Add extra comment on 'add label' in parser 2018-08-15 15:37:24 +02:00
Matthew Honnibal
6e749d3c70 Skip flakey parser test 2018-08-15 15:37:04 +02:00
Matthew Honnibal
6ea981c839 Add converter for jsonl NER data 2018-08-14 14:04:32 +02:00
Matthew Honnibal
a9fb6d5511 Fix docs2jsonl function 2018-08-14 14:03:48 +02:00
Matthew Honnibal
ea2edd1e2c Merge branch 'feature/docs_to_json' into develop 2018-08-14 13:23:42 +02:00
Matthew Honnibal
6ec236ab08 Fix label-clobber bug in parser.begin_training()
The parser.begin_training() method was rewritten in v2.1. The rewrite
introduced a regression, where if you added labels prior to
begin_training(), these labels were discarded. This patch fixes that.
2018-08-14 13:20:19 +02:00
Matthew Honnibal
02c5c114d0 Fix usage of deprecated freqs.txt in init-model 2018-08-14 13:19:15 +02:00
Matthew Honnibal
2a5a61683e Add function to get train format from Doc objects
Our JSON training format is annoying to work with, and we've wanted to
retire it for some time. In the meantime, we can at least add some
missing functions to make it easier to live with.

This patch adds a function that generates the JSON format from a list
of Doc objects, one per paragraph. This should be a convenient way to handle
a lot of data conversions: whatever format you have the source
information in, you can use it to setup a Doc object. This approach
should offer better future-proofing as well. Hopefully, we can steadily
rewrite code that is sensitive to the current data-format, so that it
instead goes through this function. Then when we change the data format,
we won't have such a problem.
2018-08-14 13:13:10 +02:00
Matthew Honnibal
4336397ecb Update develop from master 2018-08-14 03:04:28 +02:00
Matthew Honnibal
13fa550b36 Merge branch 'master' of https://github.com/explosion/spaCy 2018-08-14 02:32:01 +02:00
Ioannis Daras
fe94e696d3 Optimize Greek language support (#2658) 2018-08-14 02:31:32 +02:00
Matthew Honnibal
85000ea13b Increment version to 2.0.13.dev2 2018-08-10 00:43:55 +02:00
Matthew Honnibal
c4ac981e6d Try again to filter warnings 2018-08-10 00:42:54 +02:00
Matthew Honnibal
ae7fc42a41 Increment version to v2.0.13.dev1 2018-08-10 00:14:31 +02:00
Matthew Honnibal
19f5046934 Undoing warning suppression, as doesnt really work 2018-08-10 00:13:34 +02:00
Matthew Honnibal
3fb828352d Set version to 2.0.13.dev0 2018-08-09 23:49:34 +02:00
Matthew Honnibal
1c0614ecd2 Catch numpy warning 2018-08-09 23:49:24 +02:00
Aashish Gangwani
6eebfc7bf4 Added numbers to ../lang/hi/lex_attrs.py (#2629)
I have added numbers in hindi lex_attrs.py file according to Indian numbering system(https://en.wikipedia.org/wiki/Indian_numbering_system) and here are there english translations:
'शून्य' => zero 
'एक' => one
'दो' => two
'तीन' => three
 'चार' => four
'पांच' => five
'छह' => six
'सात'=>seven 
'आठ' => eight
'नौ' => nine
'दस' => ten
'ग्यारह' => eleven
'बारह' => twelve
 'तेरह' => thirteen
'चौदह' => fourteen
'पंद्रह' => fifteen
'सोलह'=> sixteen
'सत्रह' => seventeen
'अठारह' => eighteen
'उन्नीस' => nineteen
'बीस' => twenty
 'तीस' => thirty
'चालीस' => forty
'पचास' => fifty
'साठ' => sixty
'सत्तर' => seventy
'अस्सी' => eighty
'नब्बे' => ninety
'सौ' => hundred
'हज़ार' => thousand
'लाख' => hundred thousand
'करोड़' => ten million
'अरब' => billion
'खरब' => hundred billion

<!--- Provide a general summary of your changes in the title. -->

## Description
<!--- Use this section to describe your changes. If your changes required
testing, include information about the testing environment and the tests you
ran. If your test fixes a bug reported in an issue, don't forget to include the
issue number. If your PR is still a work in progress, that's totally fine – just
include a note to let us know. -->

### Types of change
<!-- What type of change does your PR cover? Is it a bug fix, an enhancement
or new feature, or a change to the documentation? -->

## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.
2018-08-08 16:06:11 +02:00
Emil Stenström
3834f4146d Add abbreviations from UD_Swedish-Talbanken (#2613)
* Add abbreviations from UD_Swedish-Talbanken

* Add contributor agreement.
2018-08-07 13:53:17 +02:00
Ole Henrik Skogstrøm
0473add369 Feature/span ents (#2599)
* Created Span.ents property

* Add tests for span.ents

* Add tests for start and end of sentence
2018-08-07 13:52:32 +02:00
Xiaoquan Kong
87fa847e6e Fix Chinese language related bugs (#2634) 2018-08-07 11:26:31 +02:00
Xiaoquan Kong
f0c9652ed1 New Feature: display more detail when Error E067 (#2639)
* Fix off-by-one error

* Add verbose option

* Update verbose option

* Update documents for verbose option
2018-08-07 10:45:29 +02:00
Emil Stenström
1914c488d3 Swedish: Exceptions for single letter words ending sentence (#2615)
* Exceptions for single letter words ending sentence

Sentences ending in "i." (as in "... peka i."), "m." (as in "...än 2000 m."), should be tokenized as two separate tokens.

* Add test
2018-08-05 14:14:30 +02:00
Matthew Honnibal
860f5bd91f Add test for issue 2626 2018-08-05 13:46:57 +02:00
Kaisa (Katarzyna) Korsak
e531a827db Changed conllu2json to be able to extract NER tags (#2594)
* extract ner tags from conllu file if available

* fixed a bug in regex
2018-07-25 22:21:31 +02:00
Dmitry Bruhanov
07d0cc9de7 Update examples.py (#2597) 2018-07-25 22:20:24 +02:00
Matthew Honnibal
66983d8412
Port BenDerPan's Chinese changes to v2 (finally) (#2591)
* add  template files for Chinese

* add  template files for Chinese, and test directory .
2018-07-25 02:47:23 +02:00
ines
f2e3e039b7 Update French stop words (resolves #2540) 2018-07-24 23:41:51 +02:00
Ines Montani
75f3234404
💫 Refactor test suite (#2568)
## Description

Related issues: #2379 (should be fixed by separating model tests)

* **total execution time down from > 300 seconds to under 60 seconds** 🎉
* removed all model-specific tests that could only really be run manually anyway – those will now live in a separate test suite in the [`spacy-models`](https://github.com/explosion/spacy-models) repository and are already integrated into our new model training infrastructure
* changed all relative imports to absolute imports to prepare for moving the test suite from `/spacy/tests` to `/tests` (it'll now always test against the installed version)
* merged old regression tests into collections, e.g. `test_issue1001-1500.py` (about 90% of the regression tests are very short anyways)
* tidied up and rewrote existing tests wherever possible

### Todo

- [ ] move tests to `/tests` and adjust CI commands accordingly
- [x] move model test suite from internal repo to `spacy-models`
- [x] ~~investigate why `pipeline/test_textcat.py` is flakey~~
- [x] review old regression tests (leftover files) and see if they can be merged, simplified or deleted
- [ ] update documentation on how to run tests


### Types of change
enhancement, tests

## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [ ] My changes don't require a change to the documentation, or if they do, I've added all required information.
2018-07-24 23:38:44 +02:00
Matthew Honnibal
82277f63a3 💫 Small efficiency fixes to tokenizer (#2587)
This patch improves tokenizer speed by about 10%, and reduces memory usage in the `Vocab` by removing a redundant index. The `vocab._by_orth` and `vocab._by_hash` indexed on different data in v1, but in v2 the orth and the hash are identical.

The patch also fixes an uninitialized variable in the tokenizer, the `has_special` flag. This checks whether a chunk we're tokenizing triggers a special-case rule. If it does, then we avoid caching within the chunk. This check led to incorrectly rejecting some chunks from the cache. 

With the `en_core_web_md` model, we now tokenize the IMDB train data at 503,104k words per second. Prior to this patch, we had 465,764k words per second.

Before switching to the regex library and supporting more languages, we had 1.3m words per second for the tokenizer. In order to recover the missing speed, we need to:

* Fix the variable-length lookarounds in the suffix, infix and `token_match` rules
* Improve the performance of the `token_match` regex
* Switch back from the `regex` library to the `re` library.

## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.
2018-07-24 23:35:54 +02:00
Matthew Honnibal
6303ce3d0e Try to fix memory error by moving fr_tokenizer to module scope 2018-07-24 20:09:06 +02:00
Matthew Honnibal
afe3fa4449 Merge branch 'master' of https://github.com/explosion/spaCy 2018-07-24 19:44:31 +02:00
Matthew Honnibal
b2e9e958b9 Add session scoping to tokenizers to try to fix oom on Appveyor 2018-07-24 19:44:18 +02:00
Ines Montani
a43ad114c2
Fix typo [ci skip] 2018-07-24 18:45:40 +02:00
Dmitry Bruhanov
27160b1516 added some widespread written jargon & dialectizms (#2584)
This jargon is not offencive but emotionally colored as funny due to its deviation from the norm for various reasons: immitating a dialect, deliberately wrong spelling emphasizing its low colloquial nature, obsolete form, foreign borrowing with native flections, etc.
Dmitry Briukhanov, Linguist & Pythonist
2018-07-24 18:44:29 +02:00
ines
3c30d1763c Merge branch 'master' into develop 2018-07-21 15:34:18 +02:00
Matthew Honnibal
90c269e1a9 Set about to v2.0.12 release 2018-07-21 15:09:42 +02:00
Matthew Honnibal
1a1c7304cf Set version to 2.0.12.dev1 2018-07-21 13:08:01 +02:00
ines
1ea881c80b Allow ignoring warnings and only overwrite if set explicitly 2018-07-20 22:50:19 +02:00
Matthew Honnibal
e0caf3ae8c Fix msgpack for new version 2018-07-20 17:32:00 +02:00
Matthew Honnibal
899f1cf442 Add regression test for issue 2179 2018-07-20 17:15:44 +02:00
Matthew Honnibal
9db77fd914 Fix deserialization for msgpack 2018-07-20 14:11:09 +02:00
katarkor
5ca853bee0 changed tag_map, morph_rules, lemmatizer for Norwegian (#2565)
* changed tag_map, morph_rules, lemmatizer for Norwegian

* Move unicode declaration up

Hopefully fixes test failure on Python 2

* Update CONTRIBUTOR_AGREEMENT.md

* Move unicode declarations

Hopefully fixes test this time

* Revert "Merge remote-tracking branch 'origin/patch-1'"

This reverts commit f5ccd5dd0d, reversing
changes made to dd07e180ea.

* Update contributor agreement [ci skip]
2018-07-19 19:38:24 +02:00
Ines Montani
e7b075565d
💫 Rule-based NER component (#2513)
* Add helper function for reading in JSONL

* Add rule-based NER component

* Fix whitespace

* Add component to factories

* Add tests

* Add option to disable indent on json_dumps compat

Otherwise, reading JSONL back in line by line won't work

* Fix error code
2018-07-18 19:43:16 +02:00
ines
d84b13e02c Merge branch 'master' into develop 2018-07-18 18:57:00 +02:00
Ole Henrik Skogstrøm
6e2930a4a2 Conll(u)-bio converter (#2525)
* Started simple conllxbiluo converter

* Fix missing BIO to BILUO conversion
2018-07-18 18:55:42 +02:00
ines
02aefe7cc0 Merge branch 'master' into develop 2018-07-18 18:52:59 +02:00
Ioannis Daras
6ed18412d0 Greek language optimizations (#2558)
* Greek language optimizations

* Add encoding on files containing greek words

* Add encoding on files containing greek words
2018-07-18 18:51:38 +02:00
ines
80e7485630 Merge branch 'master' into develop 2018-07-18 17:28:47 +02:00
Paul O'Leary McCann
61ef0739b8 Add Japanese stop words. (#2549)
List created by taking the 2000 top words from a Wikipedia dump and
removing everything that wasn't hiragana.

Tried going through kanji words and deciding what to keep but there were
too many obvious non-stopwords (東京 was in the top 500) and many other
words where it wasn't clear if they should be included or not.
2018-07-17 10:12:48 +02:00
Tero K
f35980f865 Enhancement/lang fi examples (#2547)
* Added a file with examples in finnish

* added contributor agreement
2018-07-15 09:50:27 +02:00
Paul O'Leary McCann
1987f3f784 Add Japanese lemmas (#2543)
This info was already available from Mecab, forgot to add it before.
2018-07-13 10:55:14 +02:00
ines
3a321e79ac Merge branch 'master' into develop 2018-07-10 13:49:08 +02:00
Eleni170
6042723535 Add support for Greek language (#2535)
* Add contributor agreement

* Support for Greek language

* Fix missing el_tokenizer
2018-07-10 13:48:38 +02:00
Stefan Schweter
3dfc7f86be lemmatizer: correct lemma for Rang (#2537)
<!--- Provide a general summary of your changes in the title. -->

## Description

This PR corrects the German lemma form for the word "Rang". Initially, the lemma form was "ringen", which is not correct, because it refers to the verb ("ringen") and not to the noun ("Rang").

### Types of change

The lemma form for "Rang" is corrected to "Rang", see also the [Duden](https://www.duden.de/rechtschreibung/Rang) entry.

## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.
2018-07-10 13:11:19 +02:00
ines
fd6207426a Merge branch 'master' into develop 2018-07-09 18:05:10 +02:00
Duygu Altinok
00b9a58558 German lemmatizer additions (#2529)
* lemma of was-> was

* added new pairs issue @2486

* added article tests
2018-07-09 11:10:15 +02:00
Ole Henrik Skogstrøm
c21efea9bb Add sent property to token (#2521)
* Add sent property to token

* Refactored and cleaned up copy paste errors.
2018-07-06 15:54:15 +02:00
ines
38e07ade4c Add test for custom tokenizer serialization (resolves #2494) 2018-07-06 12:40:51 +02:00
ines
c2581f9172 Tidy up tokenizer test 2018-07-06 12:40:28 +02:00
Matthew Honnibal
43dcaa473e Merge branch 'develop' of https://github.com/explosion/spaCy into develop 2018-07-06 12:36:42 +02:00
Matthew Honnibal
6c8d627733 Fix tokenizer deserialization 2018-07-06 12:36:33 +02:00
ines
c001d46153 Tidy up 2018-07-06 12:33:42 +02:00
Matthew Honnibal
63f5651f8d Fix tokenizer serialization 2018-07-06 12:32:11 +02:00
Matthew Honnibal
e1569fda4e Fix compile error in matcher 2018-07-06 12:29:23 +02:00
Matthew Honnibal
f5b2076700 Merge branch 'develop' of https://github.com/explosion/spaCy into develop 2018-07-06 12:23:14 +02:00
Matthew Honnibal
1a2f61725c Fix tokenizer serialization 2018-07-06 12:23:04 +02:00
ines
9e09477b2f Remove unused import 2018-07-06 12:18:17 +02:00
ines
26f04a6ac3 Fix Matcher tests and add test for any token with operator 2018-07-06 12:17:50 +02:00
Matthew Honnibal
f5703b7a91 Clean up unused stuff in matcher 2018-07-06 12:16:44 +02:00
Matthew Honnibal
08c362d541 Suppress compiler warning about unreachable code 2018-07-06 11:31:22 +02:00
Matthew Honnibal
8ae1bec8bf Fix init_model 2018-07-05 14:02:06 +02:00
Matthew Honnibal
7b09a4ca49 Fix lemmatization 2018-07-05 13:56:02 +02:00
Matthew Honnibal
ec41ceb383 Merge branch 'develop' of https://github.com/explosion/spaCy into develop 2018-07-05 13:49:42 +02:00
Matthew Honnibal
4eb3405df7 Fix lemmatizer ordering, re Issue #1387 2018-07-05 13:49:29 +02:00
ines
63666af328 Merge branch 'master' into develop 2018-07-04 14:52:25 +02:00
ines
8feb7cfe2d Remove model dependency from French lemmatizer tests 2018-07-04 14:46:45 +02:00
kleinay
a82c3153ad fix issue #2452 - displacy arrow direction is always forward (#2506) (closes #2452)
<!--- Provide a general summary of your changes in the title. -->
Referring #2452, fixing displacy arrow directions to match the input. 

## Description
The fix is simply replacing `direction is 'left'` with `direction == 'left'` to include the case `direction` is a `str` and not a `unicode`.

### Types of change
bug fix

## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [ ] I have submitted the spaCy Contributor Agreement.
- [ ] I ran the tests, and all new and existing tests passed.
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.
2018-07-04 14:12:08 +02:00
Bùi Trung Chí
9af46b4f1b Fix loading tokenizer with custom prefix search (#2495)
* Add contributor agreement

* Fix loading tokenizer with cutom prefix search
2018-07-04 12:56:07 +02:00
Matthew Honnibal
dee8bdb900 Fix init-model for npz vectors 2018-07-04 02:29:48 +02:00
Matthew Honnibal
59d655e8d0 Fix model init from jsonl 2018-07-04 01:30:40 +02:00
Matthew Honnibal
1e38bea6e9 Save vectors init 2018-07-03 23:55:04 +02:00
Matthew Honnibal
6692833887 Fix init_model 2018-07-03 23:24:11 +02:00
Matthew Honnibal
4a38a26cb5 Fix init_model 2018-07-03 22:57:11 +02:00
Matthew Honnibal
019d09e3c3 Fix init model 2018-07-03 22:16:44 +02:00
Matthew Honnibal
2543f8c93a Support .npz vectors in init-model command 2018-07-03 21:42:16 +02:00
Matthew Honnibal
86aad11939 Fix init_model arg 2018-07-03 17:00:42 +02:00
Matthew Honnibal
eff42d36e3 Fix init model command 2018-07-03 16:32:23 +02:00
Matthew Honnibal
97487122ea Merge branch 'develop' of https://github.com/explosion/spaCy into develop 2018-07-03 15:44:37 +02:00
Matthew Honnibal
6a89faf12e Add support for jsonl-formatted lexical attributes to init-model command. 2018-07-03 12:22:56 +02:00
Matthew Honnibal
2ec2192000 Revert #1389: Don't overrule rules when lemma exception is present 2018-06-29 19:43:02 +02:00
Matthew Honnibal
01ace9734d Make pipeline work on empty docs 2018-06-29 19:21:38 +02:00
Matthew Honnibal
a1b05048d0 Fix tagger when doc is empty 2018-06-29 16:05:40 +02:00
Matthew Honnibal
3786942ff1 Fix tagger when docs are empty 2018-06-29 15:13:45 +02:00
ines
526be40823 Add test for 46d8a66 2018-06-29 14:33:12 +02:00
ines
f08c871adf Fix typo in Language.from_disk 2018-06-29 14:32:16 +02:00
Matthew Honnibal
46d8a66fef Fix tokenizer serialization if token_match is None 2018-06-29 14:24:46 +02:00
Matthew Honnibal
e0860bcfb3 Fix bug when docs are empty 2018-06-29 13:56:29 +02:00
Matthew Honnibal
a4d2b0c293 Fix bug when docs are empty 2018-06-29 13:44:25 +02:00
Matthew Honnibal
c83fccfe2a Fix output of best model 2018-06-25 23:05:56 +02:00
Matthew Honnibal
5a65418c40 Fix handling of unseen labels in tagger 2018-06-25 22:28:59 +02:00
Matthew Honnibal
5b56aad4c2 Fix handling of unseen labels in tagger 2018-06-25 22:24:54 +02:00
Matthew Honnibal
3aabf621a3 Fix handling of unknown tags in tagger update 2018-06-25 22:01:02 +02:00
Matthew Honnibal
69c900f003 Fix init-model if no vectors provided 2018-06-25 18:26:02 +02:00
Matthew Honnibal
664f89327a Fix init-model if no vectors provided 2018-06-25 17:58:45 +02:00
Matthew Honnibal
c4698f5712 Don't collate model unless training succeeds 2018-06-25 16:36:42 +02:00
Ole Henrik Skogstrøm
d16cb6bee6 Accept Span to displacy render (#2478) (closes #2477)
* Add Span to displacy render

* Fix span support, errors and add tests
2018-06-25 14:55:16 +02:00
Matthew Honnibal
24dfbb8a28 Fix model collation 2018-06-25 14:35:24 +02:00
Matthew Honnibal
62237755a4 Import shutil 2018-06-25 13:40:17 +02:00
Matthew Honnibal
a040fca99e Import json into cli.train 2018-06-25 11:50:37 +02:00
Matthew Honnibal
2c703d99c2 Fix collation of best models 2018-06-25 01:21:34 +02:00
Matthew Honnibal
9d6a1c57f2 Merge branch 'develop' of https://github.com/explosion/spaCy into develop 2018-06-24 23:40:06 +02:00
Matthew Honnibal
2c80b7c013 Collate best model after training 2018-06-24 23:39:52 +02:00
Muhammad Irfan
f33c703066 Add Urdu Language Support (#2430)
* added Urdu language support.

* added Urdu language tests.

* modified conftest.py for Urdu language support.

* added spacy contributor agreement.
2018-06-22 11:14:03 +02:00
himkt
14d9007efd fix wrong indexing (#2416)
* fix wrong indexing

* add agreement
2018-06-19 10:20:57 +02:00
Aliia E
428bae66b5 Add Tatar Language Support (#2444)
* add Tatar lang support

* add Tatar letters

* add Tatar tests

* sign contributor agreement

* sign contributor agreement [x]

* remove comments from Language class

* remove all template comments
2018-06-19 10:17:53 +02:00
Cory Hurst
446f5ec41b Silent keyword in info function in init (#2459)
* Pass through "silent" kwarg to the wrapper in the spacy module init.
reference issue  #2196

* Pass through "silent" kwarg to the wrapper in the spacy module init.
reference issue  #2196

* contributor agreement
2018-06-18 12:24:21 +02:00
ines
778e5f4da3 Merge branch 'master' into develop 2018-06-11 00:38:04 +02:00
himkt
57311d5d47 replace janome with mecab in the documentation and the test (#2415)
* Add links to Reddit data (see #2401)

* replace janome with mecab in the documentation and the test

* add the assignment
2018-06-11 00:33:13 +02:00
Nour Shalabi
a169b79092 Additions to Arabic stop words. (#2422)
* Additions to Arabic stop words.

* Create nourshalabi.md
2018-06-08 02:33:23 +02:00
ines
a0017e4909 Merge branch 'master' into develop 2018-05-30 14:10:47 +02:00
ines
b8ef9c1000 Fix model names in conftest (see #2379) 2018-05-30 14:10:20 +02:00
ines
4a62486340 Merge branch 'master' into develop 2018-05-30 13:01:01 +02:00
Maciej
c7d53348d7 Fix bug in CLI iob and ner converter (#2392) (fixes #2385)
* issue_2385 add tests for iob_to_biluo converter function

* issue_2385 fix and modify iob_to_biluo function to accept either iob or biluo tags in cli.converter

* issue_2385 add test to fix b char bug

* add contributor agreement

* fill contributor agreement
2018-05-30 12:28:44 +02:00
ines
3c3a175018 Merge branch 'master' into develop 2018-05-28 18:37:09 +02:00
ansgar-t
9732988951 escape html in displacy.render (#2378) (closes #2361)
## Description
Fix for issue #2361 :
replace &, <, >, " with &amp;amp; , &amp;lt; , &amp;gt; , &amp;quot; in before rendering svg

## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [x] I have submitted the spaCy Contributor Agreement.
- [ ] I ran the tests, and all new and existing tests passed.
(As discussed in the comments to #2361)
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.
2018-05-28 18:36:41 +02:00
ines
f7103babd9 Only overwrite warnings filter if set explicitly (resolves #2369)
This way, pre-defined warning filters are respected and users are still able to use the fine-grained warning settings if they like.
2018-05-26 18:44:15 +02:00
ines
330c039106 Merge branch 'master' into develop 2018-05-26 18:30:52 +02:00
James Messinger
4515e96e90 Better formatting for spacy train CLI (#2357)
* Better formatting for `spacy train` CLI

Changed to use fixed-spaces rather than tabs to align table headers and data.

### Before:
```
Itn.    P.Loss  N.Loss  UAS     NER P.  NER R.  NER F.  Tag %   Token %
0       4618.857        2910.004        76.172  79.645  67.987  88.732  88.261  100.000 4436.9  6376.4
1       4671.972        3764.812        74.481  78.046  62.374  82.680  88.377  100.000 4672.2  6227.1
2       4742.756        3673.473        71.994  77.380  63.966  84.494  90.620  100.000 4298.0  5983.9
```

### After:
```
Itn.  Dep Loss  NER Loss  UAS     NER P.  NER R.  NER F.  Tag %   Token %  CPU WPS  GPU WPS
0     4618.857  2910.004  76.172  79.645  67.987  88.732  88.261  100.000  4436.9   6376.4
1     4671.972  3764.812  74.481  78.046  62.374  82.680  88.377  100.000  4672.2   6227.1
2     4742.756  3673.473  71.994  77.380  63.966  84.494  90.620  100.000  4298.0   5983.9
```

* Added contributor file
2018-05-25 13:08:45 +02:00
Aristo Rinjuang
432ede04af adding more words and rephrasing (#2351)
* adding more words and rephrasing

* adding a contributor

* tokenizer bugs solved
2018-05-24 11:40:57 +02:00
Jani Monoses
ec62cadf4c Updates to Romanian support (#2354)
* Add back Romanian in conftest

* Romanian lex_attr

* More tokenizer exceptions for Romanian

* Add tests for some Romanian tokenizer exceptions
2018-05-24 11:40:00 +02:00
Matthew Honnibal
5d281cf302 Merge branch 'develop' of https://github.com/explosion/spaCy into develop 2018-05-22 20:50:59 +02:00
Matthew Honnibal
ce458c2428 Fix spacy requirement constraint in package template 2018-05-22 20:50:46 +02:00
Ines Montani
862da5e793 Support pipeline factories via entry points (#2348) 2018-05-22 18:29:45 +02:00
Matthew Honnibal
d5af38f80c Merge branch 'develop' of https://github.com/explosion/spaCy into develop 2018-05-21 17:42:55 +02:00
Matthew Honnibal
ee33de8652 Fix unpickling of NER parser 2018-05-21 17:42:40 +02:00
ines
f9dbcac8e4 Merge branch 'master' into develop 2018-05-21 02:29:29 +02:00
cclauss
f7dcaa1f6b Simplify is_config() and normalize_string_keys() (#2305)
* Simplify is_config() and normalize_string_keys()

* Use __in__ to avoid the nested _ands_ and _ors_.
* Dict comprehension directly tracks with the doc string

* Keep more basic loop in normalize_string_keys

* Whitespace
2018-05-21 01:54:35 +02:00
Ines Montani
cae4457c38 💫 Add .similarity warnings for no vectors and option to exclude warnings (#2197)
* Add logic to filter out warning IDs via environment variable

Usage: SPACY_WARNING_EXCLUDE=W001,W007

* Add warnings for empty vectors

* Add warning if no word vectors are used in .similarity methods

For example, if only tensors are available in small models – should hopefully clear up some confusion around this

* Capture warnings in tests

* Rename SPACY_WARNING_EXCLUDE to SPACY_WARNING_IGNORE
2018-05-21 01:22:38 +02:00
Matthew Honnibal
b096b22c20
Merge pull request #2247 from skrcode/1480
1480 - Implement Fast-Text vectors with subword features
2018-05-21 01:16:21 +02:00
Matthew Honnibal
f3b4f6a4ec Merge setup.py 2018-05-20 23:21:00 +02:00
Ines Montani
d4cc736b7c 💫 Improve model downloads: check for existing install, customise pip and use requests library again (#2346)
* Go back to using requests instead of urllib (closes #2320)

Fewer dependencies are good, but this one was simply causing too many other problems around SSL verification and Python 2/3 compatibility. requests is a popular enough package that it's okay for spaCy to depend on it – and this will hopefully make model downloads less flakey.

* Only download model if not installed (see #1456)

Use #egg=model==version to allow pip to check for existing installations. The download is only started if no installation matching the package/version is found. Fixes a long-standing inconvenience.

* Pass additional options to pip when installing model (resolves #1456)

Treat all additional arguments passed to the download command as pip options to allow user to customise the command. For example:

python -m spacy download en --user

* Add CLI option to enable installing model package dependencies

* Revert "Add CLI option to enable installing model package dependencies"

This reverts commit 9336ffe695.

* Update documentation
2018-05-20 20:26:56 +02:00
Matthew Honnibal
3eb446e0a5 Require thinc 6.11.1 and prepare for release to spacy-nightly 2018-05-20 19:00:34 +02:00
Matthew Honnibal
bdc23dd8c1 Merge branch 'develop' of https://github.com/explosion/spaCy into develop 2018-05-20 18:59:24 +02:00
ines
5401c55c75 Merge branch 'master' into develop 2018-05-20 16:49:40 +02:00
ines
b59e3b157f Don't require attrs argument in Doc.retokenize and allow both ints and unicode (resolves #2304) 2018-05-20 15:15:37 +02:00
ines
5768df4f09 Add SimpleFrozenDict util to use as default function argument 2018-05-20 15:13:37 +02:00
Matthew Honnibal
7431e9c87f Fix parser for GPU 2018-05-19 17:24:34 +00:00
Matthew Honnibal
401213fb1f Only warn about unnamed vectors if non-zero sized. 2018-05-19 18:51:55 +02:00
Matthew Honnibal
74d5c625b3 Use rising beam update prob 2018-05-16 20:11:59 +02:00
Matthew Honnibal
544ae7f1db Merge branch 'develop' into feature/refactor-parser 2018-05-16 02:06:49 +02:00
Matthew Honnibal
d1b27fe5aa Revert "Improve dynamic oracle when values are missing in parse"
This reverts commit f56bd4736b.
2018-05-16 00:31:52 +02:00
Matthew Honnibal
83acaa0358 Add missing name attribute for parser 2018-05-15 19:01:53 +02:00
Matthew Honnibal
f328c195ca Fix size limits in training data 2018-05-15 19:01:41 +02:00
Matthew Honnibal
8446b35ce0 Fix parser model loading 2018-05-15 18:43:46 +02:00
Matthew Honnibal
dc1a479fbd Merge branch 'develop' into feature/refactor-parser 2018-05-15 18:39:21 +02:00
Matthew Honnibal
546dd99cdf Merge master into develop -- mostly Arabic and website 2018-05-15 18:14:28 +02:00
Matthew Honnibal
5664ab7e6c Revert hacks to tests 2018-05-15 18:00:09 +02:00
Matthew Honnibal
7b9195657b Restore beam_density argument for parser beam 2018-05-15 17:55:11 +02:00
Matthew Honnibal
581d318971 Fix conftest 2018-05-15 00:54:45 +02:00
Tahar Zanouda
00417794d3 Add Arabic language (#2314)
* added support for Arabic lang

* added Arabic language support

* updated conftest
2018-05-15 00:27:19 +02:00
Jani Monoses
0e08e49e87 Lemmatizer ro (#2319)
* Add Romanian lemmatizer lookup table.

Adapted from http://www.lexiconista.com/datasets/lemmatization/
by replacing cedillas with commas (ș and ț).

The original dataset is licensed under the Open Database License.

* Fix one blatant issue in the Romanian lemmatizer

* Romanian examples file

* Add ro_tokenizer in conftest

* Add Romanian lemmatizer test
2018-05-12 15:20:04 +02:00
Matthew Honnibal
887631ca25 Disable some tests to figure out why CI fails 2018-05-10 16:42:01 +02:00
Matthew Honnibal
902a172cb7 Disable some tests to figure out why CI fails 2018-05-10 16:30:07 +02:00
Matthew Honnibal
614d45ea58 Set a more aggressive threshold on the max violn update 2018-05-10 15:38:24 +02:00
Matthew Honnibal
8e8724b55b Default to beam_update_prob 1 2018-05-10 15:38:02 +02:00
Jani Monoses
42b34832e4 Update Romanian stopword list (#2316)
* Contributor agreement for janimo

* Update Romanian stopword list

Include the correct spellings of all the words already in the repo
that are using cedillas (ş and ţ) instead of commas (ș and ț).

Add another unrelated spelling fix.

See https://github.com/stopwords-iso/stopwords-ro/pull/1 and
https://github.com/stopwords-iso/stopwords-ro/pull/2
2018-05-10 12:16:56 +02:00
Lucas Abbade
be7fdc59d1 Update lex_attrs.py (#2307)
* Update lex_attrs.py

Fixed spelling mistakes of some numbers (according to Brazilian Portuguese).

* Update lex_attrs.py

As requested, I've included the correct spelling for both Brazilian Portuguese and Portuguese Portuguese.

I will advise however, that the two are separated in the future. Brazilian Portuguese is a very different language from the original one, although most of the writing is unified, the way people talk in both countries is radically different. Keeping both languages as one may lead to bigger issues in the future, especially when it comes to spell checking.
2018-05-09 20:49:31 +02:00
mauryaland
5368ba028a Update stop_words.py for French language (#2310)
* Add contraction forms of some common stopwords

All the stopwords added contain the apostrophe" ' "or " ’ ".

* Adds contributor agreement mauryaland

* Update mauryaland.md
2018-05-09 12:04:38 +02:00
Matthew Honnibal
a61fd60681 Fix error in beam gradient calculation 2018-05-09 02:44:09 +02:00
Matthew Honnibal
a6ae1ee6f7 Don't modify Token in global scope 2018-05-09 00:43:00 +02:00
Matthew Honnibal
f94f721f40 Avoid importing fused token symbol in ud-run-test, untl that's added 2018-05-09 00:28:03 +02:00
Matthew Honnibal
659ec5b975 Avoid importing fused token symbol in ud-run-test, untl that's added 2018-05-08 19:40:33 +02:00
Matthew Honnibal
4cb0494bef Bug fixes to beam search after refactor 2018-05-08 13:48:50 +02:00
Matthew Honnibal
5ed71973b3 Add a keyword argument sink to GoldParse 2018-05-08 13:48:32 +02:00
Matthew Honnibal
8cfe326f87 Avoid relying on final gold check in beam search 2018-05-08 13:48:19 +02:00
Matthew Honnibal
fc4dd49b77 Support oracle segmentation in ud-train CLI command 2018-05-08 13:47:45 +02:00
Matthew Honnibal
c49e44349a Fix beam parsing 2018-05-08 02:53:24 +02:00
Matthew Honnibal
99649d114d Fix parser 2018-05-08 00:27:26 +02:00
Matthew Honnibal
8a82367a9d Fix beam search after refactor 2018-05-08 00:20:33 +02:00
Matthew Honnibal
5a0f26be0c Readd beam search after refactor 2018-05-08 00:19:52 +02:00
ines
7a3599c21a Fix formatting and consistency 2018-05-07 23:02:11 +02:00
Matthew Honnibal
36b2c9bdd5 Fix refactored parser 2018-05-07 18:58:09 +02:00
Matthew Honnibal
bde3be1ad1 Fix refactored parser 2018-05-07 18:31:04 +02:00
Matthew Honnibal
01c4e13b02 Update test 2018-05-07 16:59:52 +02:00
Matthew Honnibal
f6cdafc00e Fix refactored parser 2018-05-07 16:59:38 +02:00
Matthew Honnibal
f56bd4736b Improve dynamic oracle when values are missing in parse 2018-05-07 15:53:18 +02:00
Matthew Honnibal
eddc0e0c74 Set gold.sent_starts in ud_train 2018-05-07 15:52:47 +02:00
Matthew Honnibal
bf19f22340 Allow gold.sent_starts to be set from Python 2018-05-07 15:51:34 +02:00
Matthew Honnibal
7f163442e6 Work on refactoring greedy parser 2018-05-07 15:45:52 +02:00
Douglas Knox
9b49a40f4e Test and fix for Issue #2219 (#2272)
Test and fix for Issue #2219: Token.similarity() failed if single letter
2018-05-03 18:40:46 +02:00
Paul O'Leary McCann
bd72fbf09c Port Japanese mecab tokenizer from v1 (#2036)
* Port Japanese mecab tokenizer from v1

This brings the Mecab-based Japanese tokenization introduced in #1246 to
spaCy v2. There isn't a JapaneseTagger implementation yet, but POS tag
information from Mecab is stored in a token extension. A tag map is also
included.

As a reminder, Mecab is required because Universal Dependencies are
based on Unidic tags, and Janome doesn't support Unidic.

Things to check:

1. Is this the right way to use a token extension?

2. What's the right way to implement a JapaneseTagger? The approach in
 #1246 relied on `tag_from_strings` which is just gone now. I guess the
best thing is to just try training spaCy's default Tagger?

-POLM

* Add tagging/make_doc and tests
2018-05-03 18:38:26 +02:00
G.Pruvost
cc8e804648 #2211 - Support for ssl certs config on download command (#2212)
* Add support for SSL/Certs customization on download CLI

* Add a note on SSL options for the 'download' CLI in the README

* Add contributor agreement
2018-05-03 18:37:02 +02:00
Jens Dahl Møllerhøj
b9290397fb rename SP to _SP (#2289) 2018-05-03 18:33:49 +02:00
Matthew Honnibal
a8e70a4187 Merge branch 'develop' of https://github.com/explosion/spaCy into develop 2018-05-03 14:02:10 +02:00
Matthew Honnibal
c0e596283b Set version to 2.1.0a0 2018-05-03 14:00:11 +02:00
Matthew Honnibal
8cd06cc763 Try to fix root-outside-sentence bug 2018-05-02 14:39:48 +00:00
Matthew Honnibal
acebd01033 Set cildren from heads in finalize doc 2018-05-02 14:19:22 +00:00
Matthew Honnibal
569440a6db Dont normalize gradient by batch size 2018-05-02 08:42:10 +02:00
Matthew Honnibal
281e29cbcd Merge branch 'develop' of https://github.com/explosion/spaCy into develop 2018-05-02 01:36:23 +00:00
Matthew Honnibal
2338e8c7fc Update develop from master 2018-05-02 01:36:12 +00:00
Matthew Honnibal
9d147e12c4 Merge remote-tracking branch 'origin/master' into develop 2018-05-01 18:18:51 +02:00
Matthew Honnibal
6d0fe67b72 Constrain subtok label to adjacent tokens 2018-05-01 17:34:27 +02:00
Matthew Honnibal
8f21953fc5 Constrain subtok to adjacent words 2018-05-01 17:29:00 +02:00
Matthew Honnibal
b43bfd3524 Fix arc-eager oracle tests 2018-05-01 16:16:14 +02:00
Matthew Honnibal
31ed64e9b0 Fix textcat test 2018-05-01 15:18:39 +02:00
Matthew Honnibal
548bdff943 Update default Adam settings 2018-05-01 15:18:20 +02:00
Matthew Honnibal
adbb1f7533 Add better arc-eager oracle tests 2018-05-01 15:14:55 +02:00
Matthew Honnibal
697bcaa34f Add some methods to ArcEager that make testing easier 2018-05-01 15:13:14 +02:00
Mr Roboto
6f5ccda19c Addresses Issue #2228 - Deserialization fails when using tensor=False or sentiment=False (#2230)
* Fixes issue #2228

* Adds a new contributor
2018-05-01 13:40:22 +02:00
Matthew Honnibal
d44bb45c72 Fix scoring if tokenization changes 2018-05-01 01:33:20 +02:00
Matthew Honnibal
2b26c007cd Revert "Disable batch size compounding in ud-train"
This reverts commit 8a120fb455.
2018-04-29 14:09:02 +00:00
Matthew Honnibal
723b328062 Add script to run UD test 2018-04-29 15:50:25 +02:00
Matthew Honnibal
17af6aa3a4 Update ud_train script 2018-04-29 15:49:32 +02:00
Matthew Honnibal
5de8a36537 Fix arc_eager is_nonproj_tree 2018-04-29 15:49:11 +02:00
Matthew Honnibal
5260268f70 Fix textcat after merge 2018-04-29 15:48:53 +02:00
Matthew Honnibal
ad3d56c3ba Fix compile error in matcher 2018-04-29 15:48:34 +02:00
Matthew Honnibal
a8bc947fd4 Fix Token.set_extension 2018-04-29 15:48:19 +02:00
Matthew Honnibal
2c4a6d66fa Merge master into develop. Big merge, many conflicts -- need to review 2018-04-29 14:49:26 +02:00
ines
3c80f69ff5 Return data in cli.info and add silent option (resolves #2196) 2018-04-29 01:59:44 +02:00
ines
1c6d77610c Add remove_extension method on Doc, Token and Span (closes #2242) 2018-04-28 23:33:09 +02:00
ines
abdb853ebf Simplify underscore tests 2018-04-28 23:30:33 +02:00
ines
6fb6371670 Add collapse_phrases option to displacy (closes #2266) 2018-04-28 23:06:50 +02:00
Robin Linderborg
1f9904ef12 fixes #2238 (#2241)
* Remove erroneous lemma lookup år > åra in Swedish

* Add contributors agreement

* Add contrib agreement to correct directory

* Revert change to CONTRIBUTOR_AGREEMENT
2018-04-28 14:55:22 +02:00
Robin Linderborg
d01f503b54 Remove incorrect lemma lookup gäng->gänga (#2252)
* Remove incorrect lemma lookup gäng->gänga
In modern Swedish, "gäng" is mostly associated with "gang" or "group of people". The removed lemma lookup lemmatized it to the verb "thread".

* Add contrib agreement to correct directory

* Revert change to CONTRIBUTOR_AGREEMENT
2018-04-28 14:54:41 +02:00
Suraj Krishnan Rajan
69d041148f Implement Fast-Text vectors with subword features 2018-04-21 01:34:14 +05:30
ines
686225eadd Fix Spanish noun_chunks (resolves #2210)
Make sure 'NP' label is added to StringStore and move noun_bounds helper into a closure to allow reusing label sets
2018-04-18 18:44:01 -04:00
ines
9632595fb4 Use correct, non-deprecated merge syntax (resolves #2226) 2018-04-18 18:28:28 -04:00
Suraj Rajan
5957f15227 Fixed typos for #2222,#2223 (#2233) (closes #2222, closes #2223) 2018-04-18 14:55:26 -07:00
Matthew Honnibal
97851d2c4e Increment version to v2.0.12.dev0 2018-04-10 22:20:16 +02:00
Matthew Honnibal
ed39c75a92 Merge branch 'master' of https://github.com/explosion/spaCy 2018-04-10 22:19:40 +02:00
Matthew Honnibal
3836199a83 Fix loading of models when custom vectors are added 2018-04-10 22:19:20 +02:00
ines
0299d5fac8 Update argument annotations and formatting 2018-04-10 21:45:11 +02:00
ines
49b1e48bf5 Fix syntax error 2018-04-10 21:44:59 +02:00
ines
70052e46e9 Fix formatting [ci skip] 2018-04-10 21:42:46 +02:00
Matthew Honnibal
0ddb152be0 Improve error message when reading vectors 2018-04-10 21:26:50 +02:00
Matthew Honnibal
db50ac524e Support zipped vector files in init-model 2018-04-10 21:21:00 +02:00
ines
270fcfd925 Fix typo in package command message (closes #2200) 2018-04-10 19:14:31 +02:00
ines
24d8bf348d Revert "Add support for .zip to init_model"
This reverts commit 7ee880a0ad.
2018-04-10 19:08:06 +02:00
Matthew Honnibal
7ee880a0ad Add support for .zip to init_model 2018-04-10 14:30:04 +00:00
ines
5ecb274764 Fix indentation error and set Doc.is_tagged correctly 2018-04-10 16:14:52 +02:00
ines
987ee27af7 Return Doc if noun chunks merger component if Doc is not parsed 2018-04-09 14:51:02 +02:00
Xiaoquan Kong
e2f13ec722 bugfix: Doc.noun_chunks call Doc.noun_chunks_iterator without checking (closes #2194) 2018-04-08 23:44:05 +02:00
Jens Dahl Møllerhøj
e5055e3cf6 Add Danish lemmatizer (#2184)
* add danish lemmatizer

* fill contributor agreement
2018-04-07 19:07:28 +02:00
ines
bccbf538ef Revert "Check if spaCy has compiled correctly and show error message"
This reverts commit 3463ded7cf.
2018-04-06 15:49:44 +02:00
ines
fb4eda6616 Merge branch 'master' of https://github.com/explosion/spaCy 2018-04-06 00:38:48 +02:00
Matthew Honnibal
0c7fab4443 Set version to 2.0.11 2018-04-04 11:19:11 +02:00
Matthew Honnibal
a350be0601 Fix vector-name loading fix 2018-04-04 01:31:25 +02:00
Matthew Honnibal
21047bde52 Fix syntax error in italian lemmatizer 2018-04-03 23:13:22 +02:00
Matthew Honnibal
81f4005f3d Fix loading models with pretrained vectors 2018-04-03 23:11:48 +02:00
ines
3463ded7cf Check if spaCy has compiled correctly and show error message 2018-04-03 22:18:47 +02:00
Matthew Honnibal
96b612873b Add hyper-parameter to control whether parser makes a beam update 2018-04-03 22:02:56 +02:00
ines
e5f47cd82d Update errors 2018-04-03 21:40:29 +02:00
Matthew Honnibal
f7e6313b43 Increment version to v2.0.11.dev0 2018-04-03 20:58:47 +02:00
ines
10462816bc Fix tests for Python 2 2018-04-03 18:51:31 +02:00
ines
62b4b527d7 Don't raise error if set_extension has getter and setter (closes #2177)
Improve error messages, raise error if setter is specified without a getter and compare against _unset to allow default=None. Also add more tests.
2018-04-03 18:30:17 +02:00
ines
ee3082ad29 Fix whitespace 2018-04-03 18:29:53 +02:00
Ines Montani
3141e04822
💫 New system for error messages and warnings (#2163)
* Add spacy.errors module

* Update deprecation and user warnings

* Replace errors and asserts with new error message system

* Remove redundant asserts

* Fix whitespace

* Add messages for print/util.prints statements

* Fix typo

* Fix typos

* Move CLI messages to spacy.cli._messages

* Add decorator to display error code with message

An implementation like this is nice because it only modifies the string when it's retrieved from the containing class – so we don't have to worry about manipulating tracebacks etc.

* Remove unused link in spacy.about

* Update errors for invalid pipeline components

* Improve error for unknown factories

* Add displaCy warnings

* Update formatting consistency

* Move error message to spacy.errors

* Update errors and check if doc returned by component is None
2018-04-03 15:50:31 +02:00
Matthew Honnibal
abf8b16d71
Add doc.retokenize() context manager (#2172)
This patch takes a step towards #1487 by introducing the
doc.retokenize() context manager, to handle merging spans, and soon
splitting tokens.

The idea is to do merging and splitting like this:

with doc.retokenize() as retokenizer:
    for start, end, label in matches:
        retokenizer.merge(doc[start : end], attrs={'ent_type': label})

The retokenizer accumulates the merge requests, and applies them
together at the end of the block. This will allow retokenization to be
more efficient, and much less error prone.

A retokenizer.split() function will then be added, to handle splitting a
single token into multiple tokens. These methods take `Span` and `Token`
objects; if the user wants to go directly from offsets, they can append
to the .merges and .splits lists on the retokenizer.

The doc.merge() method's behaviour remains unchanged, so this patch
should be 100% backwards incompatible (modulo bugs). Internally,
doc.merge() fixes up the arguments (to handle the various deprecated styles),
opens the retokenizer, and makes the single merge.

We can later start making deprecation warnings on direct calls to doc.merge(),
to migrate people to use of the retokenize context manager.
2018-04-03 14:10:35 +02:00
Matthew Honnibal
8a120fb455 Disable batch size compounding in ud-train 2018-04-01 08:45:00 +00:00
Matthew Honnibal
98165e43a7 Sometimes update beam with greedy oracle 2018-04-01 08:44:35 +00:00
Suraj Rajan
1cdbb7c97c [2032] - Changed python set to cpp stl set (#2170)
Changed python set to cpp stl set #2032 

## Description

Changed python set to cpp stl set. CPP stl set works better due to the logarithmic run time of its methods. Finding minimum in the cpp set is done in constant time as opposed to the worst case linear runtime of python set. Operations such as find,count,insert,delete are also done in either constant and logarithmic time thus making cpp set a better option to manage vectors.
Reference : http://www.cplusplus.com/reference/set/set/

### Types of change
Enhancement for `Vectors` for faster initialising of word vectors(fasttext)
2018-03-31 13:28:25 +02:00
Matthew Honnibal
f3b7c5e537 Fix syntax error 2018-03-29 21:50:32 +02:00
Matthew Honnibal
23afa6429f Add input length error, to address #1826 2018-03-29 21:45:26 +02:00
Ines Montani
a609a1ca29
Merge pull request #2152 from explosion/feature/tidy-up-dependencies
💫 Tidy up dependencies
2018-03-29 14:35:09 +02:00
Viet Trung Tran
ea2af94cd9 Add support for Vietnamese in spaCy by leveraging Pyvi, an external Vietnamese tokenizer (#2155)
* support for Vietnamese

* Contributor Agreement for adding Vietnamese support on spaCy
2018-03-29 12:19:51 +02:00
ines
e6979bdbbd Merge branch 'feature/tidy-up-dependencies' of https://github.com/explosion/spaCy into feature/tidy-up-dependencies 2018-03-29 00:19:37 +02:00
ines
83146458a2 Fix urllib for Python 3 2018-03-29 00:19:33 +02:00
Matthew Honnibal
8308bbc617 Get msgpack and msgpack_numpy via Thinc, to avoid potential version conflicts 2018-03-29 00:14:55 +02:00
Matthew Honnibal
b5098079d8 Fix error on urllib 2018-03-29 00:08:16 +02:00
Ines Montani
0de599b16b
Merge pull request #2159 from explosion/feature/fix-merged-entity-iob (resolves #1554, resolves #1752)
💫 Fix token.ent_iob after doc.merge(), and ensure consistency in doc.ents
2018-03-28 23:10:00 +02:00
Ines Montani
98e9cda677
Merge pull request #2158 from explosion/feature/fix-multiple-vectors (resolves #1660)
💫 Fix loading of multiple vector models
2018-03-28 23:08:24 +02:00
Matthew Honnibal
a7c5ae2beb Avoid forcing a name on empty vectors, and remove print statement 2018-03-28 21:08:58 +02:00
ines
3eb67bbe4b Allow entity types with dashes (resolves #1967) 2018-03-28 20:51:26 +02:00
Matthew Honnibal
cf5fcf0546 Update serialization test 2018-03-28 20:12:53 +02:00
Matthew Honnibal
4555e3e251 Dont assume pretrained_vectors cfg set in build_tagger 2018-03-28 20:12:45 +02:00
Matthew Honnibal
0b375d50c8 Fix ent_iob tags in doc.merge to avoid inconsistent sequences 2018-03-28 18:39:03 +02:00
Matthew Honnibal
95fa89c4b8 Update doc.ents test 2018-03-28 18:39:03 +02:00
Matthew Honnibal
e807f88410 Resolve merge when cherry-picking ent iob patches from develop 2018-03-28 18:38:13 +02:00
Matthew Honnibal
99fbc7db33 Improve error message when entity sequence is inconsistent 2018-03-28 18:36:53 +02:00
Matthew Honnibal
cbd2794be0 Add test for ent_iob during span merge 2018-03-28 18:36:53 +02:00
Matthew Honnibal
f8dd905a24 Warn and fallback if vectors have no name 2018-03-28 18:24:53 +02:00
Matthew Honnibal
fd9e259414 Add test for #1660 2018-03-28 18:22:51 +02:00
Matthew Honnibal
bc4afa9881 Remove print statement 2018-03-28 17:48:37 +02:00
Matthew Honnibal
79dc241caa Set pretrained_vectors in parser cfg 2018-03-28 17:35:07 +02:00
Matthew Honnibal
17c3e7efa2 Add message noting vectors 2018-03-28 16:33:43 +02:00
Matthew Honnibal
9bf6e93b3e Set pretrained_vectors in begin_training 2018-03-28 16:32:41 +02:00
Matthew Honnibal
95a9615221 Fix loading of multiple pre-trained vectors
This patch addresses #1660, which was caused by keying all pre-trained
vectors with the same ID when telling Thinc how to refer to them. This
meant that if multiple models were loaded that had pre-trained vectors,
errors or incorrect behaviour resulted.

The vectors class now includes a .name attribute, which defaults to:
{nlp.meta['lang']_nlp.meta['name']}.vectors
The vectors name is set in the cfg of the pipeline components under the
key pretrained_vectors. This replaces the previous cfg key
pretrained_dims.

In order to make existing models compatible with this change, we check
for the pretrained_dims key when loading models in from_disk and
from_bytes, and add the cfg key pretrained_vectors if we find it.
2018-03-28 16:02:59 +02:00
ines
7fbc9e5874 Replace requests with urllib 2018-03-28 12:46:07 +02:00
ines
da1f200362 Add compat helpers for urllib 2018-03-28 12:45:53 +02:00
ines
ac88c72c9a Fix ftfy workaround and remove old import 2018-03-28 12:14:28 +02:00
ines
ce6071ca89 Remove ftfy dependency and update docs 2018-03-28 12:09:42 +02:00
Matthew Honnibal
070b6c6495 Remove dependency on ftfy 2018-03-28 12:07:02 +02:00
ines
6d2c85f428 Drop six and related hacks as a dependency 2018-03-28 10:45:25 +02:00