Commit Graph

5499 Commits

Author SHA1 Message Date
Matthew Honnibal
0fdb25b958
Fix msgpack error 2018-11-27 19:35:55 +01:00
Matthew Honnibal
ef0820827a
Update hyper-parameters after NER random search (#2972)
These experiments were completed a few weeks ago, but I didn't make the PR, pending model release.

    Token vector width: 128->96
    Hidden width: 128->64
    Embed size: 5000->2000
    Dropout: 0.2->0.1
    Updated optimizer defaults (unclear how important?)

This should improve speed, model size and load time, while keeping
similar or slightly better accuracy.

The tl;dr is we prefer to prevent over-fitting by reducing model size,
rather than using more dropout.
2018-11-27 18:49:52 +01:00
Matthew Honnibal
c9f6acc564 Set version to 2.1.0a3.dev0 2018-11-27 05:15:27 +01:00
Ines Montani
b6e991440c 💫 Tidy up and auto-format tests (#2967)
* Auto-format tests with black

* Add flake8 config

* Tidy up and remove unused imports

* Fix redefinitions of test functions

* Replace orths_and_spaces with words and spaces

* Fix compatibility with pytest 4.0

* xfail test for now

Test was previously overwritten by following test due to naming conflict, so failure wasn't reported

* Unfail passing test

* Only use fixture via arguments

Fixes pytest 4.0 compatibility
2018-11-27 01:09:36 +01:00
Matthew Honnibal
2c37e0ccf6
💫 Use Blis for matrix multiplications (#2966)
Our epic matrix multiplication odyssey is drawing to a close...

I've now finally got the Blis linear algebra routines in a self-contained Python package, with wheels for Windows, Linux and OSX. The only missing platform at the moment is Windows Python 2.7. The result is at https://github.com/explosion/cython-blis

Thinc v7.0.0 will make the change to Blis. I've put a Thinc v7.0.0.dev0 up on PyPi so that we can test these changes with the CI, and even get them out to spacy-nightly, before Thinc v7.0.0 is released. This PR also updates the other dependencies to be in line with the current versions master is using. I've also resolved the msgpack deprecation problems, and gotten spaCy and Thinc up to date with the latest Cython.

The point of switching to Blis is to have control of how our matrix multiplications are executed across platforms. When we were using numpy for this, a different library would be used on pip and conda, OSX would use Accelerate, etc. This would open up different bugs and performance problems, especially when multi-threading was introduced.

With the change to Blis, we now strictly single-thread the matrix multiplications. This will make it much easier to use multiprocessing to parallelise the runtime, since we won't have nested parallelism problems to deal with.

* Use blis

* Use -2 arg to Cython

* Update dependencies

* Fix requirements

* Update setup dependencies

* Fix requirement typo

* Fix msgpack errors

* Remove Python27 test from Appveyor, until Blis works there

* Auto-format setup.py

* Fix murmurhash version
2018-11-27 00:44:04 +01:00
Ines Montani
41c6002fd8 Tidy up [ci skip] 2018-11-26 18:56:04 +01:00
Ines Montani
c62d06ea5c Port over #2949 2018-11-26 18:54:27 +01:00
Ines Montani
ec5ee9e616 Auto-format 2018-11-26 18:54:20 +01:00
Ines Montani
968aff2f6a
Update tests for pytest 4.x (#2965)
<!--- Provide a general summary of your changes in the title. -->

## Description
- [x] Replace marks in params for pytest 4.0 compat ([see here](https://docs.pytest.org/en/latest/deprecations.html#marks-in-pytest-mark-parametrize))
- [x] Un-xfail passing tests (some fixes in a recent update resolved a bunch of issues, but tests were apparently never updated here)

### Types of change
<!-- What type of change does your PR cover? Is it a bug fix, an enhancement
or new feature, or a change to the documentation? -->

## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.
2018-11-26 18:14:57 +01:00
Marc Puig
98fe1ab259 Catalan Language Support (#2940)
* Catalan language Support

* Ddding Catalan to documentation
2018-11-26 15:25:47 +01:00
Ines Montani
048416f265 Fix formatting 2018-11-26 13:27:41 +01:00
Shawn Cicoria
7601ae0cff fixes symbolic link on py3 and windows (#2949)
* fixes symbolic link on py3 and windows
during setup of spacy using command
python -m spacy link en_core_web_sm en
closes #2948

* Update spacy/compat.py

Co-Authored-By: cicorias <cicorias@users.noreply.github.com>
2018-11-24 15:34:23 +01:00
Ines Montani
350c8d25b0 Add EntityRecognizer.label property 2018-11-18 00:06:26 +01:00
Ines Montani
017bc2ef2f Expose TextCategorizer via __all__ 2018-11-18 00:06:13 +01:00
Ines Montani
b4581435f6 Merge branch 'develop' of https://github.com/explosion/spaCy into develop 2018-11-16 13:08:22 +01:00
Ines Montani
e2f75eb492 Fix message formatting 2018-11-16 13:08:20 +01:00
Matthew Honnibal
2874b8efd8 Fix tok2vec loading in spacy train 2018-11-15 23:34:54 +00:00
Matthew Honnibal
2ddd428834 Fix pretrain script 2018-11-15 23:34:35 +00:00
Matthew Honnibal
f8afaa0c1c Fix pretrain 2018-11-15 22:46:53 +00:00
Matthew Honnibal
6af6950e46 Fix pretrain 2018-11-15 22:45:36 +00:00
Matthew Honnibal
3e7b214e57 Make pretrain script work with stream from stdin 2018-11-15 22:44:07 +00:00
Matthew Honnibal
8fdb9bc278
💫 Add experimental ULMFit/BERT/Elmo-like pretraining (#2931)
* Add 'spacy pretrain' command

* Fix pretrain command for Python 2

* Fix pretrain command

* Fix pretrain command
2018-11-15 22:17:16 +01:00
Ines Montani
02fc73ca53
💫 Create random IDs for SVGs to prevent ID clashes (#2927)
Resolves #2924.

## Description
Fixes problem where multiple visualizations in Jupyter notebooks would have clashing arc IDs, resulting in weirdly positioned arc labels. Generating a random ID prefix so even identical parses won't receive the same IDs for consistency (even if effect of ID clash isn't noticable here.)

### Types of change
bug fix

## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.
2018-11-15 11:40:10 +01:00
Ines Montani
e89708c3eb 💫 Allow matching non-ORTH attributes in PhraseMatcher (#2925)
* Allow matching non-orth attributes in PhraseMatcher (see #1971)

Usage: PhraseMatcher(nlp.vocab, attr='POS')

* Allow attr argument to be int

* Fix formatting

* Fix typo
2018-11-15 03:00:58 +01:00
Ines Montani
0d5b142c78 Fix typos and whitespace 2018-11-14 19:12:34 +01:00
Ines Montani
bd1b0e396a Add deprecation warning for PhraseMatcher max_length 2018-11-14 19:10:46 +01:00
Ines Montani
64257bf3a7 Fix formatting 2018-11-14 19:10:21 +01:00
Ines Montani
b3cadd5b81
Delete _matcher2_notes.py 2018-11-14 16:19:12 +01:00
mauryaland
87ce435aff Check if the word is in one of the regular lists specific to each POS (#2886) 2018-11-14 15:58:43 +01:00
Daniel Hershcovich
d3d419ecc0 Allow input text of length up to max_length, inclusive (#2922) 2018-11-13 16:46:29 +01:00
Matthew Honnibal
5fc98ade04 Set version to 2.1.0a2 2018-11-08 09:56:56 +01:00
Matthew Honnibal
ad44982f01 Fix dropout in tensorizer, update comment 2018-11-03 12:46:58 +00:00
Matthew Honnibal
ba365ae1c9 Normalize gradient by number of words in tensorizer 2018-11-03 10:53:22 +00:00
Matthew Honnibal
dac3f1b280 Improve Tensorizer 2018-11-03 10:52:50 +00:00
Matthew Honnibal
2527ba68e5 Fix tensorizer 2018-11-02 23:29:54 +00:00
Matthew Honnibal
db08b168a3 Set version to 2.0.17 2018-10-29 23:22:18 +01:00
Suraj Rajan
0bf14082a4 Added more constucts for dependency tree matcher (#2836) 2018-10-29 23:21:39 +01:00
Matthew Honnibal
e2ae25d6f5 Try setting older regex version, to align with conda 2018-10-29 13:39:00 +01:00
Matthew Honnibal
d4fa9af56f Set version to 2.0.17.dev0 2018-10-28 16:15:26 +01:00
Matthew Honnibal
b2e2bba8b0
Fix missing comma 2018-10-28 00:09:16 +02:00
Wannaphong Phatthiyaphaibun
2d2765fd8a Change PyThaiNLP Url (#2876) 2018-10-27 14:46:07 +02:00
Matthew Honnibal
817e1fc5e5 Fix out-of-bounds access in NER training
The helper method state.B(1) gets the index of the first token of the
buffer, or -1 if no such token exists. Normally this is safe because we
pass this to functions like state.safe_get(), which returns an empty
token. Here we used it directly as an array index, which is not okay!

This error may have been the cause of out-of-bounds access errors during
training. Similar errors may still be around, so much be hunted down.
Hunting this one down took a long time...I printed out values across
training runs and diffed, looking for points of divergence between
runs, when no randomness should be allowed.
2018-10-27 01:12:50 +02:00
Matthew Honnibal
9447739027 Merge branch 'master' of https://github.com/explosion/spaCy 2018-10-27 00:50:48 +02:00
Matthew Honnibal
ad068f51be Fix out-of-bounds access in NER training
The helper method state.B(1) gets the index of the first token of the
buffer, or -1 if no such token exists. Normally this is safe because we
pass this to functions like state.safe_get(), which returns an empty
token. Here we used it directly as an array index, which is not okay!

This error may have been the cause of out-of-bounds access errors during
training. Similar errors may still be around, so much be hunted down.
Hunting this one down took a long time...I printed out values across
training runs and diffed, looking for points of divergence between
runs, when no randomness should be allowed.
2018-10-27 00:46:30 +02:00
Grivaz
57f274b693 raise error when setting overlapping entities as doc.ents (#2880) 2018-10-26 23:29:16 +02:00
Ines Montani
48b1bc44d3 Update version to 2.0.16 2018-10-15 14:39:25 +02:00
Ines Montani
a0f6647160 Increment version 2018-10-15 14:20:55 +02:00
Ines Montani
7bc7fa8f1e Increment version 2018-10-15 01:40:44 +02:00
Matthew Honnibal
8612b75890 Set version to 2.0.14 2018-10-15 00:10:04 +02:00
Matthew Honnibal
d6e9cf8b09 Set version to 2.0.14.dev1 2018-10-15 00:09:02 +02:00
Matthew Honnibal
8ccfa52d19 Unhack prefer_gpu 2018-10-14 23:27:09 +02:00
Matthew Honnibal
41adf3572b Set version to v2.0.14 2018-10-14 23:15:34 +02:00
Matthew Honnibal
38aa835ada Workaround bug in thinc require_gpu 2018-10-14 23:15:08 +02:00
Matthew Honnibal
91593b7378 Add tests for prefer_gpu() and require_gpu() 2018-10-14 23:05:22 +02:00
Matthew Honnibal
62c70b3163 Import prefer_gpu and require_gpu functions from Thinc 2018-10-14 23:03:06 +02:00
Ines Montani
295da0f11b Increment version to 2.0.14.dev0 2018-10-14 16:37:46 +02:00
Matthew Honnibal
7de0dcb91f Merge branch 'master' of https://github.com/explosion/spaCy 2018-10-14 16:12:23 +02:00
Keshan
cb075c8e72 Adding "This is a sentence" example to Sinhala (#2846) 2018-10-14 00:06:40 +02:00
Matthew Honnibal
9cfab5933a Set version to 2.0.13 2018-10-13 19:42:16 +02:00
Matthew Honnibal
6a6ae5b0af Merge branch 'master' of https://github.com/explosion/spaCy 2018-10-13 19:41:00 +02:00
mauryaland
36514b5762 Rule-based French Lemmatizer (#2818)
<!--- Provide a general summary of your changes in the title. -->

## Description
<!--- Use this section to describe your changes. If your changes required
testing, include information about the testing environment and the tests you
ran. If your test fixes a bug reported in an issue, don't forget to include the
issue number. If your PR is still a work in progress, that's totally fine – just
include a note to let us know. -->

Add a rule-based French Lemmatizer following the english one and the excellent PR for [greek language optimizations](https://github.com/explosion/spaCy/pull/2558) to adapt the Lemmatizer class.

### Types of change
<!-- What type of change does your PR cover? Is it a bug fix, an enhancement
or new feature, or a change to the documentation? -->

- Lemma dictionary used can be found [here](http://infolingu.univ-mlv.fr/DonneesLinguistiques/Dictionnaires/telechargement.html), I used the XML version.
- Add several files containing exhaustive list of words for each part of speech 
- Add some lemma rules
- Add POS that are not checked in the standard Lemmatizer, i.e PRON, DET, ADV and AUX
- Modify the Lemmatizer class to check in lookup table as a last resort if POS not mentionned
- Modify the lemmatize function to check in lookup table as a last resort
- Init files are updated so the model can support all the functionalities mentioned above
- Add words to tokenizer_exceptions_list.py in respect to regex used in tokenizer_exceptions.py

## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [X] I have submitted the spaCy Contributor Agreement.
- [X] I ran the tests, and all new and existing tests passed.
- [X] My changes don't require a change to the documentation, or if they do, I've added all required information.
2018-10-13 16:38:21 +02:00
Matthew Honnibal
de46286107 Merge branch 'master' of https://github.com/explosion/spaCy 2018-10-13 16:11:16 +02:00
Ines Montani
cb57b35bb8 Also include lowercase norm exceptions 2018-10-13 15:37:30 +02:00
JKhakpour
74a30d883c Add Persian(Farsi) language support (#2797) 2018-10-13 15:31:49 +02:00
Matthew Honnibal
c3ddf98b1e Set version to 2.0.13.dev4 2018-10-13 15:20:59 +02:00
Marina Lysyuk
b76fe08308 Correcting lang/ru/examples.py (#2845)
* Correct some grammatical inaccuracies in lang\ru\examples.py; filled Contributor Agreement

* Correct some grammatical inaccuracies in lang\ru\examples.py

* Move contributor agreement to separate file
2018-10-13 15:19:43 +02:00
Matthew Honnibal
67ddce68d8 Unskip test 2018-10-02 23:47:55 +02:00
Matthew Honnibal
4cf5ce2cc2 Revert "Remove problematic test"
This reverts commit bdebbef455.
2018-10-02 23:47:24 +02:00
Matthew Honnibal
bdebbef455 Remove problematic test 2018-10-02 23:16:29 +02:00
Matthew Honnibal
6afc6ffe56 Skip seemingly problematic test 2018-10-02 22:33:40 +02:00
Matthew Honnibal
9e4079ddb2 Merge branch 'master' of https://github.com/explosion/spaCy 2018-10-02 19:44:43 +02:00
Matthew Honnibal
40f228c2f2 Set version to 2.0.13.dev3 2018-10-02 19:44:25 +02:00
Ines Montani
ea20b72c08 💫 Make like_num work for prefixed numbers (#2808)
* Only split + prefix if not numbers

* Make like_num work for prefixed numbers

* Add test for like_num
2018-10-01 10:49:14 +02:00
Filipe Caixeta
6c498f9ff4 Update Portuguese Language (#2790)
* Add words to portuguese language _num_words

* Add words to portuguese language _num_words

* Portuguese - Add/remove stopwords, fix tokenizer, add currency symbols

* Extended punctuation and norm_exceptions in the Portuguese language
2018-09-29 09:51:45 +02:00
Matthew Honnibal
b39810d692 Fix copy_reg compatibility on _serialize module 2018-09-28 15:23:14 +02:00
Matthew Honnibal
f82f8ba5dd Fix serialization when empty parser model. Closes #2482 2018-09-28 15:18:52 +02:00
Matthew Honnibal
d5a6c63b62 Add regression test for #2482 2018-09-28 15:18:30 +02:00
Matthew Honnibal
e3e9fe18d4 Merge branch 'develop' of https://github.com/explosion/spaCy into develop 2018-09-28 14:27:35 +02:00
Matthew Honnibal
0323f5be0c Fix _serialize module 2018-09-28 14:27:24 +02:00
Ines Montani
5d56eb70d7 Tidy up tests 2018-09-27 16:41:57 +02:00
Ines Montani
1f1bab9264 Remove unused import 2018-09-27 16:41:37 +02:00
Matthew Honnibal
6430b1fe64 Restore encoding arg on msgpack-numpy 2018-09-27 15:58:21 +02:00
Matthew Honnibal
2ac69facc6 Fix Python 2 test failure 2018-09-27 15:34:16 +02:00
Matthew Honnibal
72778375fb Merge branch 'master' of https://github.com/explosion/spaCy 2018-09-27 13:54:49 +02:00
Matthew Honnibal
96fe314d8d Fix bug when too many entity types. Fixes #2800 2018-09-27 13:54:34 +02:00
Suraj Rajan
bbdc6456c6 Set up dependency tree pattern matching skeleton (#2732) 2018-09-27 13:27:18 +02:00
Matthew Honnibal
8809dc4514 Remove deprecated encoding argument to msgpack 2018-09-27 12:56:23 +02:00
Matthew Honnibal
bae6b3e2b3 Merge branch 'master' of https://github.com/explosion/spaCy 2018-09-27 12:50:31 +02:00
Ines Montani
71cdbeada7 Revert "Also include lowercase norm exceptions"
This reverts commit 70f4e8adf3.
2018-09-27 12:29:25 +02:00
darindf
8227566805 Fix error (#2802)
* Fix error
ValueError: cannot resize an array that references or is referenced
by another array in this way.  Use the resize function

* added spaCy Contributor Agreement
2018-09-26 21:31:03 +02:00
Ines Montani
5e0dfb34fa Merge branch 'master' of https://github.com/explosion/spaCy 2018-09-26 11:13:58 +02:00
Ines Montani
70f4e8adf3 Also include lowercase norm exceptions 2018-09-25 12:22:02 +02:00
Keshan
9a016d17c2 Adding basic support for Sinhala language. (#2788)
* adding Sinhala language package, stop words, examples and lex_attrs.

* Adding contributor agreement

* Updating contributor agreement
2018-09-25 12:18:25 +02:00
Matthew Honnibal
b42c123e5d Fix regression introduced by 1759abf1e 2018-09-25 11:08:58 +02:00
Matthew Honnibal
500898907b Fix regression in parser.begin_training() 2018-09-25 11:08:31 +02:00
Ines Montani
3c4e3ade30 Fix typo (closes #2784) 2018-09-21 10:45:11 +02:00
mauryaland
68b3c544d5 Adding French hyphenated first name (#2786) 2018-09-21 10:38:13 +02:00
Matthew Honnibal
1759abf1e5 Fix bug in sentence starts for non-projective parses
The set_children_from_heads function assumed parse trees were
projective. However, non-projective parses may be passed in during
deserialization, or after deprojectivising. This caused incorrect
sentence boundaries to be set for non-projective parses. Close #2772.
2018-09-19 14:50:06 +02:00
Matthew Honnibal
48fd36bf05 Fix test for issue 27772 2018-09-19 14:47:27 +02:00
Matthew Honnibal
6cd920e088 Add xfail test for deprojectivization SBD bug 2018-09-19 14:00:31 +02:00
Matthew Honnibal
99a6011580 Avoid adding empty layer in model, to keep models backwards compatible 2018-09-14 22:51:58 +02:00
Matthew Honnibal
c046392317 Trigger on_data hooks in parser model 2018-09-14 20:51:21 +02:00
Matthew Honnibal
5afd98dff5 Add a stepping function, for changing batch sizes or learning rates 2018-09-14 18:37:16 +02:00
Matthew Honnibal
27c00f4f22 Merge branch 'develop' of https://github.com/explosion/spaCy into develop 2018-09-14 12:30:57 +02:00
Andrew Ongko
81564cc4e8 Update Indonesian model (#2752)
* adding e-KTP in tokenizer exceptions list

* add exception token

* removing lines with containing space as it won't matter since we use .split() method in the end, added new tokens in exception

* add tokenizer exceptions list

* combining base_norms with norm_exceptions

* adding norm_exception

* fix double key in lemmatizer

* remove unused import on punctuation.py

* reformat stop_words to reduce number of lines, improve readibility

* updating tokenizer exception

* implement is_currency for lang/id

* adding orth_first_upper in tokenizer_exceptions

* update the norm_exception list

* remove bunch of abbreviations

* adding contributors file
2018-09-14 12:30:32 +02:00
Filipe Caixeta
fe515085f3 Add words to portuguese language _num_words (#2759)
* Add words to portuguese language _num_words

* Add words to portuguese language _num_words
2018-09-14 12:30:16 +02:00
Matthew Honnibal
f32b52e611 Fix bug that caused deprojectivisation to run multiple times 2018-09-14 12:12:54 +02:00
Matthew Honnibal
8f2a6367e9 Fix usage of PyTorch BiLSTM in ud_train 2018-09-13 22:54:59 +00:00
Matthew Honnibal
afeddfff26 Fix PyTorch BiLSTM 2018-09-13 22:54:34 +00:00
Matthew Honnibal
a26fe8e7bb Small hack in Language.update to make torch work 2018-09-13 22:51:52 +00:00
Matthew Honnibal
445b81ce3f Support bilstm_depth argument in ud-train 2018-09-13 19:30:22 +02:00
Matthew Honnibal
b43643a953 Support bilstm_depth option in parser 2018-09-13 19:29:49 +02:00
Matthew Honnibal
45032fe9e1 Support option of BiLSTM in Tok2Vec (requires pytorch) 2018-09-13 19:28:35 +02:00
Matthew Honnibal
3eb9f3e2b8 Fix defaults for ud-train 2018-09-13 18:05:48 +02:00
Matthew Honnibal
59cf533879 Improve ud-train script. Make config optional 2018-09-13 14:24:08 +02:00
Matthew Honnibal
3e3a309764 Fix tagger 2018-09-13 14:14:38 +02:00
Matthew Honnibal
da7650e84b Fix maximum doc length in ud_train script 2018-09-13 14:10:25 +02:00
Matthew Honnibal
a95eea4c06 Fix multi-task objective for parser 2018-09-13 14:08:55 +02:00
Matthew Honnibal
21321cd6cf Add tok2vec property to parser model 2018-09-13 14:08:43 +02:00
Matthew Honnibal
d6aa60139d Fix tagger training on GPU 2018-09-13 14:05:37 +02:00
Grivaz
aeba99ab0d Introduces a bulk merge function, in order to solve issue #653 (#2696)
* Fix comment

* Introduce bulk merge to increase performance on many span merges

* Sign contributor agreement

* Implement pull request suggestions
2018-09-10 16:41:42 +02:00
tyburam
476472d181 Lex _attrs for polish language (#2750)
* Signed spaCy contributor agreement

* Added polish version of english lex_attrs
2018-09-10 11:53:57 +02:00
Sainath Adapa
77139bc03c Basic support for Telugu language (#2751) 2018-09-10 11:53:18 +02:00
Maxim Kupfer
cebe50b5b8 Remove ')' for clarity (#2737)
Sorry, don't mean to be nitpicky, I just noticed this when going through the CLI and thought it was a quick fix. That said, if this was intention than please let me know.
2018-09-10 11:31:49 +02:00
Matthew Honnibal
b2cb1fc67d Merge matcher tests 2018-09-06 01:39:53 +02:00
Suraj Krishnan Rajan
356af7b0a1 Fix tests 2018-09-06 01:39:36 +02:00
Piotr Żelasko
bdb2165bd1 Less norm computations in token similarity (#2730)
* Less norm computations in token similarity

* Contributor agreement
2018-09-05 21:50:23 +02:00
Aniruddha Adhikary
4530ddcc51 update bengali token rules for hyphen and digits (#2731) 2018-09-05 21:49:00 +02:00
Nathaniel J. Smith
26849874ad When calling getoption() in conftest.py, pass a default option (#2709)
* When calling getoption() in conftest.py, pass a default option

This is necessary to allow testing an installed spacy by running:

  pytest --pyargs spacy

* Add contributor agreement
2018-09-03 09:57:52 +02:00
Matthew Honnibal
4d2d7d5866 Fix new feature flags 2018-08-27 02:12:39 +02:00
Matthew Honnibal
598dbf1ce0 Fix character-based tokenization for Japanese 2018-08-27 01:51:38 +02:00
Matthew Honnibal
3763e20afc Pass subword_features and conv_depth params 2018-08-27 01:51:15 +02:00
Matthew Honnibal
8051136d70 Support subword_features and conv_depth params in Tok2Vec 2018-08-27 01:50:48 +02:00
Matthew Honnibal
9c33d4d1df Add more hyper-parameters to spacy ud-train
* subword_features: Controls whether subword features are used in the
word embeddings. True by default (specifically, prefix, suffix and word
shape). Should be set to False for languages like Chinese and Japanese.

* conv_depth: Depth of the convolutional layers. Defaults to 4.
2018-08-27 01:48:46 +02:00
Ines Montani
e9022f7b33 Remove docstrings for deprecated arguments (see #2703) 2018-08-26 14:23:13 +02:00
Ines Montani
559f4139e3 Add FAC to spacy.explain (resolves #2706) 2018-08-26 14:13:50 +02:00
Matthew Honnibal
51a9efbf3b Add draft Binder class 2018-08-22 13:12:51 +02:00
Matthew Honnibal
5ce459d2ee Fix error in vocab 2018-08-16 17:18:09 +02:00
Matthew Honnibal
00febda2e3 Improve alignment around quotes 2018-08-16 01:04:34 +02:00
Matthew Honnibal
66a3f2ba21 Lower-case text before alignment 2018-08-16 00:42:36 +02:00
Matthew Honnibal
595c893791 Expose noise_level option in train CLI 2018-08-16 00:41:44 +02:00
Matthew Honnibal
8365226bf3 Fix lookup of symbols in vocab. 2018-08-15 23:43:34 +02:00
Matthew Honnibal
b9f0588580 Set version to v2.1.0a1 2018-08-15 17:22:39 +02:00
Matthew Honnibal
e968016417 Note link between issues #2671 and #2675 2018-08-15 17:18:28 +02:00
Matthew Honnibal
63bdc734ba Skip flakey test 2018-08-15 16:56:55 +02:00
Matthew Honnibal
ce512e1d47 Fix #2671: Incorrect match ID on some patterns 2018-08-15 16:19:08 +02:00
Matthew Honnibal
f12b9190f6 Xfail test for issue #2671 2018-08-15 15:55:31 +02:00
Matthew Honnibal
7cfa665ce6 Add failing test for issue 2671: Incorrect rule ID returned from matcher 2018-08-15 15:54:33 +02:00
Matthew Honnibal
1b2a5869ab Set version to v2.1.0a2.dev0 2018-08-15 15:38:52 +02:00
Matthew Honnibal
5080760288 Add extra comment on 'add label' in parser 2018-08-15 15:37:24 +02:00