ines
c2581f9172
Tidy up tokenizer test
2018-07-06 12:40:28 +02:00
Matthew Honnibal
43dcaa473e
Merge branch 'develop' of https://github.com/explosion/spaCy into develop
2018-07-06 12:36:42 +02:00
Matthew Honnibal
6c8d627733
Fix tokenizer deserialization
2018-07-06 12:36:33 +02:00
ines
c001d46153
Tidy up
2018-07-06 12:33:42 +02:00
Matthew Honnibal
63f5651f8d
Fix tokenizer serialization
2018-07-06 12:32:11 +02:00
Matthew Honnibal
e1569fda4e
Fix compile error in matcher
2018-07-06 12:29:23 +02:00
Matthew Honnibal
f5b2076700
Merge branch 'develop' of https://github.com/explosion/spaCy into develop
2018-07-06 12:23:14 +02:00
Matthew Honnibal
1a2f61725c
Fix tokenizer serialization
2018-07-06 12:23:04 +02:00
ines
9e09477b2f
Remove unused import
2018-07-06 12:18:17 +02:00
ines
26f04a6ac3
Fix Matcher tests and add test for any token with operator
2018-07-06 12:17:50 +02:00
Matthew Honnibal
f5703b7a91
Clean up unused stuff in matcher
2018-07-06 12:16:44 +02:00
Matthew Honnibal
08c362d541
Suppress compiler warning about unreachable code
2018-07-06 11:31:22 +02:00
Matthew Honnibal
8ae1bec8bf
Fix init_model
2018-07-05 14:02:06 +02:00
Matthew Honnibal
7b09a4ca49
Fix lemmatization
2018-07-05 13:56:02 +02:00
Matthew Honnibal
ec41ceb383
Merge branch 'develop' of https://github.com/explosion/spaCy into develop
2018-07-05 13:49:42 +02:00
Matthew Honnibal
4eb3405df7
Fix lemmatizer ordering, re Issue #1387
2018-07-05 13:49:29 +02:00
ines
63666af328
Merge branch 'master' into develop
2018-07-04 14:52:25 +02:00
ines
8feb7cfe2d
Remove model dependency from French lemmatizer tests
2018-07-04 14:46:45 +02:00
kleinay
a82c3153ad
fix issue #2452 - displacy arrow direction is always forward ( #2506 ) ( closes #2452 )
...
<!--- Provide a general summary of your changes in the title. -->
Referring #2452 , fixing displacy arrow directions to match the input.
## Description
The fix is simply replacing `direction is 'left'` with `direction == 'left'` to include the case `direction` is a `str` and not a `unicode`.
### Types of change
bug fix
## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [ ] I have submitted the spaCy Contributor Agreement.
- [ ] I ran the tests, and all new and existing tests passed.
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.
2018-07-04 14:12:08 +02:00
Bùi Trung Chí
9af46b4f1b
Fix loading tokenizer with custom prefix search ( #2495 )
...
* Add contributor agreement
* Fix loading tokenizer with cutom prefix search
2018-07-04 12:56:07 +02:00
Matthew Honnibal
dee8bdb900
Fix init-model for npz vectors
2018-07-04 02:29:48 +02:00
Matthew Honnibal
59d655e8d0
Fix model init from jsonl
2018-07-04 01:30:40 +02:00
Matthew Honnibal
1e38bea6e9
Save vectors init
2018-07-03 23:55:04 +02:00
Matthew Honnibal
6692833887
Fix init_model
2018-07-03 23:24:11 +02:00
Matthew Honnibal
4a38a26cb5
Fix init_model
2018-07-03 22:57:11 +02:00
Matthew Honnibal
019d09e3c3
Fix init model
2018-07-03 22:16:44 +02:00
Matthew Honnibal
2543f8c93a
Support .npz vectors in init-model command
2018-07-03 21:42:16 +02:00
Matthew Honnibal
86aad11939
Fix init_model arg
2018-07-03 17:00:42 +02:00
Matthew Honnibal
eff42d36e3
Fix init model command
2018-07-03 16:32:23 +02:00
Matthew Honnibal
97487122ea
Merge branch 'develop' of https://github.com/explosion/spaCy into develop
2018-07-03 15:44:37 +02:00
Matthew Honnibal
6a89faf12e
Add support for jsonl-formatted lexical attributes to init-model command.
2018-07-03 12:22:56 +02:00
Matthew Honnibal
2ec2192000
Revert #1389 : Don't overrule rules when lemma exception is present
2018-06-29 19:43:02 +02:00
Matthew Honnibal
01ace9734d
Make pipeline work on empty docs
2018-06-29 19:21:38 +02:00
Matthew Honnibal
a1b05048d0
Fix tagger when doc is empty
2018-06-29 16:05:40 +02:00
Matthew Honnibal
3786942ff1
Fix tagger when docs are empty
2018-06-29 15:13:45 +02:00
ines
526be40823
Add test for 46d8a66
2018-06-29 14:33:12 +02:00
ines
f08c871adf
Fix typo in Language.from_disk
2018-06-29 14:32:16 +02:00
Matthew Honnibal
46d8a66fef
Fix tokenizer serialization if token_match is None
2018-06-29 14:24:46 +02:00
Matthew Honnibal
e0860bcfb3
Fix bug when docs are empty
2018-06-29 13:56:29 +02:00
Matthew Honnibal
a4d2b0c293
Fix bug when docs are empty
2018-06-29 13:44:25 +02:00
Matthew Honnibal
c83fccfe2a
Fix output of best model
2018-06-25 23:05:56 +02:00
Matthew Honnibal
5a65418c40
Fix handling of unseen labels in tagger
2018-06-25 22:28:59 +02:00
Matthew Honnibal
5b56aad4c2
Fix handling of unseen labels in tagger
2018-06-25 22:24:54 +02:00
Matthew Honnibal
3aabf621a3
Fix handling of unknown tags in tagger update
2018-06-25 22:01:02 +02:00
Matthew Honnibal
69c900f003
Fix init-model if no vectors provided
2018-06-25 18:26:02 +02:00
Matthew Honnibal
664f89327a
Fix init-model if no vectors provided
2018-06-25 17:58:45 +02:00
Matthew Honnibal
c4698f5712
Don't collate model unless training succeeds
2018-06-25 16:36:42 +02:00
Ole Henrik Skogstrøm
d16cb6bee6
Accept Span to displacy render ( #2478 ) ( closes #2477 )
...
* Add Span to displacy render
* Fix span support, errors and add tests
2018-06-25 14:55:16 +02:00
Matthew Honnibal
24dfbb8a28
Fix model collation
2018-06-25 14:35:24 +02:00
Matthew Honnibal
62237755a4
Import shutil
2018-06-25 13:40:17 +02:00
Matthew Honnibal
a040fca99e
Import json into cli.train
2018-06-25 11:50:37 +02:00
Matthew Honnibal
2c703d99c2
Fix collation of best models
2018-06-25 01:21:34 +02:00
Matthew Honnibal
9d6a1c57f2
Merge branch 'develop' of https://github.com/explosion/spaCy into develop
2018-06-24 23:40:06 +02:00
Matthew Honnibal
2c80b7c013
Collate best model after training
2018-06-24 23:39:52 +02:00
Muhammad Irfan
f33c703066
Add Urdu Language Support ( #2430 )
...
* added Urdu language support.
* added Urdu language tests.
* modified conftest.py for Urdu language support.
* added spacy contributor agreement.
2018-06-22 11:14:03 +02:00
himkt
14d9007efd
fix wrong indexing ( #2416 )
...
* fix wrong indexing
* add agreement
2018-06-19 10:20:57 +02:00
Aliia E
428bae66b5
Add Tatar Language Support ( #2444 )
...
* add Tatar lang support
* add Tatar letters
* add Tatar tests
* sign contributor agreement
* sign contributor agreement [x]
* remove comments from Language class
* remove all template comments
2018-06-19 10:17:53 +02:00
Cory Hurst
446f5ec41b
Silent keyword in info function in init ( #2459 )
...
* Pass through "silent" kwarg to the wrapper in the spacy module init.
reference issue #2196
* Pass through "silent" kwarg to the wrapper in the spacy module init.
reference issue #2196
* contributor agreement
2018-06-18 12:24:21 +02:00
ines
778e5f4da3
Merge branch 'master' into develop
2018-06-11 00:38:04 +02:00
himkt
57311d5d47
replace janome with mecab in the documentation and the test ( #2415 )
...
* Add links to Reddit data (see #2401 )
* replace janome with mecab in the documentation and the test
* add the assignment
2018-06-11 00:33:13 +02:00
Nour Shalabi
a169b79092
Additions to Arabic stop words. ( #2422 )
...
* Additions to Arabic stop words.
* Create nourshalabi.md
2018-06-08 02:33:23 +02:00
ines
a0017e4909
Merge branch 'master' into develop
2018-05-30 14:10:47 +02:00
ines
b8ef9c1000
Fix model names in conftest (see #2379 )
2018-05-30 14:10:20 +02:00
ines
4a62486340
Merge branch 'master' into develop
2018-05-30 13:01:01 +02:00
Maciej
c7d53348d7
Fix bug in CLI iob and ner converter ( #2392 ) ( fixes #2385 )
...
* issue_2385 add tests for iob_to_biluo converter function
* issue_2385 fix and modify iob_to_biluo function to accept either iob or biluo tags in cli.converter
* issue_2385 add test to fix b char bug
* add contributor agreement
* fill contributor agreement
2018-05-30 12:28:44 +02:00
ines
3c3a175018
Merge branch 'master' into develop
2018-05-28 18:37:09 +02:00
ansgar-t
9732988951
escape html in displacy.render ( #2378 ) ( closes #2361 )
...
## Description
Fix for issue #2361 :
replace &, <, >, " with &amp; , &lt; , &gt; , &quot; in before rendering svg
## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [x] I have submitted the spaCy Contributor Agreement.
- [ ] I ran the tests, and all new and existing tests passed.
(As discussed in the comments to #2361 )
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.
2018-05-28 18:36:41 +02:00
ines
f7103babd9
Only overwrite warnings filter if set explicitly ( resolves #2369 )
...
This way, pre-defined warning filters are respected and users are still able to use the fine-grained warning settings if they like.
2018-05-26 18:44:15 +02:00
ines
330c039106
Merge branch 'master' into develop
2018-05-26 18:30:52 +02:00
James Messinger
4515e96e90
Better formatting for spacy train
CLI ( #2357 )
...
* Better formatting for `spacy train` CLI
Changed to use fixed-spaces rather than tabs to align table headers and data.
### Before:
```
Itn. P.Loss N.Loss UAS NER P. NER R. NER F. Tag % Token %
0 4618.857 2910.004 76.172 79.645 67.987 88.732 88.261 100.000 4436.9 6376.4
1 4671.972 3764.812 74.481 78.046 62.374 82.680 88.377 100.000 4672.2 6227.1
2 4742.756 3673.473 71.994 77.380 63.966 84.494 90.620 100.000 4298.0 5983.9
```
### After:
```
Itn. Dep Loss NER Loss UAS NER P. NER R. NER F. Tag % Token % CPU WPS GPU WPS
0 4618.857 2910.004 76.172 79.645 67.987 88.732 88.261 100.000 4436.9 6376.4
1 4671.972 3764.812 74.481 78.046 62.374 82.680 88.377 100.000 4672.2 6227.1
2 4742.756 3673.473 71.994 77.380 63.966 84.494 90.620 100.000 4298.0 5983.9
```
* Added contributor file
2018-05-25 13:08:45 +02:00
Aristo Rinjuang
432ede04af
adding more words and rephrasing ( #2351 )
...
* adding more words and rephrasing
* adding a contributor
* tokenizer bugs solved
2018-05-24 11:40:57 +02:00
Jani Monoses
ec62cadf4c
Updates to Romanian support ( #2354 )
...
* Add back Romanian in conftest
* Romanian lex_attr
* More tokenizer exceptions for Romanian
* Add tests for some Romanian tokenizer exceptions
2018-05-24 11:40:00 +02:00
Matthew Honnibal
5d281cf302
Merge branch 'develop' of https://github.com/explosion/spaCy into develop
2018-05-22 20:50:59 +02:00
Matthew Honnibal
ce458c2428
Fix spacy requirement constraint in package template
2018-05-22 20:50:46 +02:00
Ines Montani
862da5e793
Support pipeline factories via entry points ( #2348 )
2018-05-22 18:29:45 +02:00
Matthew Honnibal
d5af38f80c
Merge branch 'develop' of https://github.com/explosion/spaCy into develop
2018-05-21 17:42:55 +02:00
Matthew Honnibal
ee33de8652
Fix unpickling of NER parser
2018-05-21 17:42:40 +02:00
ines
f9dbcac8e4
Merge branch 'master' into develop
2018-05-21 02:29:29 +02:00
cclauss
f7dcaa1f6b
Simplify is_config() and normalize_string_keys() ( #2305 )
...
* Simplify is_config() and normalize_string_keys()
* Use __in__ to avoid the nested _ands_ and _ors_.
* Dict comprehension directly tracks with the doc string
* Keep more basic loop in normalize_string_keys
* Whitespace
2018-05-21 01:54:35 +02:00
Ines Montani
cae4457c38
💫 Add .similarity warnings for no vectors and option to exclude warnings ( #2197 )
...
* Add logic to filter out warning IDs via environment variable
Usage: SPACY_WARNING_EXCLUDE=W001,W007
* Add warnings for empty vectors
* Add warning if no word vectors are used in .similarity methods
For example, if only tensors are available in small models – should hopefully clear up some confusion around this
* Capture warnings in tests
* Rename SPACY_WARNING_EXCLUDE to SPACY_WARNING_IGNORE
2018-05-21 01:22:38 +02:00
Matthew Honnibal
b096b22c20
Merge pull request #2247 from skrcode/1480
...
1480 - Implement Fast-Text vectors with subword features
2018-05-21 01:16:21 +02:00
Matthew Honnibal
f3b4f6a4ec
Merge setup.py
2018-05-20 23:21:00 +02:00
Ines Montani
d4cc736b7c
💫 Improve model downloads: check for existing install, customise pip and use requests library again ( #2346 )
...
* Go back to using requests instead of urllib (closes #2320 )
Fewer dependencies are good, but this one was simply causing too many other problems around SSL verification and Python 2/3 compatibility. requests is a popular enough package that it's okay for spaCy to depend on it – and this will hopefully make model downloads less flakey.
* Only download model if not installed (see #1456 )
Use #egg=model==version to allow pip to check for existing installations. The download is only started if no installation matching the package/version is found. Fixes a long-standing inconvenience.
* Pass additional options to pip when installing model (resolves #1456 )
Treat all additional arguments passed to the download command as pip options to allow user to customise the command. For example:
python -m spacy download en --user
* Add CLI option to enable installing model package dependencies
* Revert "Add CLI option to enable installing model package dependencies"
This reverts commit 9336ffe695
.
* Update documentation
2018-05-20 20:26:56 +02:00
Matthew Honnibal
3eb446e0a5
Require thinc 6.11.1 and prepare for release to spacy-nightly
2018-05-20 19:00:34 +02:00
Matthew Honnibal
bdc23dd8c1
Merge branch 'develop' of https://github.com/explosion/spaCy into develop
2018-05-20 18:59:24 +02:00
ines
5401c55c75
Merge branch 'master' into develop
2018-05-20 16:49:40 +02:00
ines
b59e3b157f
Don't require attrs argument in Doc.retokenize and allow both ints and unicode ( resolves #2304 )
2018-05-20 15:15:37 +02:00
ines
5768df4f09
Add SimpleFrozenDict util to use as default function argument
2018-05-20 15:13:37 +02:00
Matthew Honnibal
7431e9c87f
Fix parser for GPU
2018-05-19 17:24:34 +00:00
Matthew Honnibal
401213fb1f
Only warn about unnamed vectors if non-zero sized.
2018-05-19 18:51:55 +02:00
Matthew Honnibal
74d5c625b3
Use rising beam update prob
2018-05-16 20:11:59 +02:00
Matthew Honnibal
544ae7f1db
Merge branch 'develop' into feature/refactor-parser
2018-05-16 02:06:49 +02:00
Matthew Honnibal
d1b27fe5aa
Revert "Improve dynamic oracle when values are missing in parse"
...
This reverts commit f56bd4736b
.
2018-05-16 00:31:52 +02:00
Matthew Honnibal
83acaa0358
Add missing name attribute for parser
2018-05-15 19:01:53 +02:00
Matthew Honnibal
f328c195ca
Fix size limits in training data
2018-05-15 19:01:41 +02:00
Matthew Honnibal
8446b35ce0
Fix parser model loading
2018-05-15 18:43:46 +02:00
Matthew Honnibal
dc1a479fbd
Merge branch 'develop' into feature/refactor-parser
2018-05-15 18:39:21 +02:00
Matthew Honnibal
546dd99cdf
Merge master into develop -- mostly Arabic and website
2018-05-15 18:14:28 +02:00
Matthew Honnibal
5664ab7e6c
Revert hacks to tests
2018-05-15 18:00:09 +02:00
Matthew Honnibal
7b9195657b
Restore beam_density argument for parser beam
2018-05-15 17:55:11 +02:00
Matthew Honnibal
581d318971
Fix conftest
2018-05-15 00:54:45 +02:00
Tahar Zanouda
00417794d3
Add Arabic language ( #2314 )
...
* added support for Arabic lang
* added Arabic language support
* updated conftest
2018-05-15 00:27:19 +02:00
Jani Monoses
0e08e49e87
Lemmatizer ro ( #2319 )
...
* Add Romanian lemmatizer lookup table.
Adapted from http://www.lexiconista.com/datasets/lemmatization/
by replacing cedillas with commas (ș and ț).
The original dataset is licensed under the Open Database License.
* Fix one blatant issue in the Romanian lemmatizer
* Romanian examples file
* Add ro_tokenizer in conftest
* Add Romanian lemmatizer test
2018-05-12 15:20:04 +02:00
Matthew Honnibal
887631ca25
Disable some tests to figure out why CI fails
2018-05-10 16:42:01 +02:00
Matthew Honnibal
902a172cb7
Disable some tests to figure out why CI fails
2018-05-10 16:30:07 +02:00
Matthew Honnibal
614d45ea58
Set a more aggressive threshold on the max violn update
2018-05-10 15:38:24 +02:00
Matthew Honnibal
8e8724b55b
Default to beam_update_prob 1
2018-05-10 15:38:02 +02:00
Jani Monoses
42b34832e4
Update Romanian stopword list ( #2316 )
...
* Contributor agreement for janimo
* Update Romanian stopword list
Include the correct spellings of all the words already in the repo
that are using cedillas (ş and ţ) instead of commas (ș and ț).
Add another unrelated spelling fix.
See https://github.com/stopwords-iso/stopwords-ro/pull/1 and
https://github.com/stopwords-iso/stopwords-ro/pull/2
2018-05-10 12:16:56 +02:00
Lucas Abbade
be7fdc59d1
Update lex_attrs.py ( #2307 )
...
* Update lex_attrs.py
Fixed spelling mistakes of some numbers (according to Brazilian Portuguese).
* Update lex_attrs.py
As requested, I've included the correct spelling for both Brazilian Portuguese and Portuguese Portuguese.
I will advise however, that the two are separated in the future. Brazilian Portuguese is a very different language from the original one, although most of the writing is unified, the way people talk in both countries is radically different. Keeping both languages as one may lead to bigger issues in the future, especially when it comes to spell checking.
2018-05-09 20:49:31 +02:00
mauryaland
5368ba028a
Update stop_words.py for French language ( #2310 )
...
* Add contraction forms of some common stopwords
All the stopwords added contain the apostrophe" ' "or " ’ ".
* Adds contributor agreement mauryaland
* Update mauryaland.md
2018-05-09 12:04:38 +02:00
Matthew Honnibal
a61fd60681
Fix error in beam gradient calculation
2018-05-09 02:44:09 +02:00
Matthew Honnibal
a6ae1ee6f7
Don't modify Token in global scope
2018-05-09 00:43:00 +02:00
Matthew Honnibal
f94f721f40
Avoid importing fused token symbol in ud-run-test, untl that's added
2018-05-09 00:28:03 +02:00
Matthew Honnibal
659ec5b975
Avoid importing fused token symbol in ud-run-test, untl that's added
2018-05-08 19:40:33 +02:00
Matthew Honnibal
4cb0494bef
Bug fixes to beam search after refactor
2018-05-08 13:48:50 +02:00
Matthew Honnibal
5ed71973b3
Add a keyword argument sink to GoldParse
2018-05-08 13:48:32 +02:00
Matthew Honnibal
8cfe326f87
Avoid relying on final gold check in beam search
2018-05-08 13:48:19 +02:00
Matthew Honnibal
fc4dd49b77
Support oracle segmentation in ud-train CLI command
2018-05-08 13:47:45 +02:00
Matthew Honnibal
c49e44349a
Fix beam parsing
2018-05-08 02:53:24 +02:00
Matthew Honnibal
99649d114d
Fix parser
2018-05-08 00:27:26 +02:00
Matthew Honnibal
8a82367a9d
Fix beam search after refactor
2018-05-08 00:20:33 +02:00
Matthew Honnibal
5a0f26be0c
Readd beam search after refactor
2018-05-08 00:19:52 +02:00
ines
7a3599c21a
Fix formatting and consistency
2018-05-07 23:02:11 +02:00
Matthew Honnibal
36b2c9bdd5
Fix refactored parser
2018-05-07 18:58:09 +02:00
Matthew Honnibal
bde3be1ad1
Fix refactored parser
2018-05-07 18:31:04 +02:00
Matthew Honnibal
01c4e13b02
Update test
2018-05-07 16:59:52 +02:00
Matthew Honnibal
f6cdafc00e
Fix refactored parser
2018-05-07 16:59:38 +02:00
Matthew Honnibal
f56bd4736b
Improve dynamic oracle when values are missing in parse
2018-05-07 15:53:18 +02:00
Matthew Honnibal
eddc0e0c74
Set gold.sent_starts in ud_train
2018-05-07 15:52:47 +02:00
Matthew Honnibal
bf19f22340
Allow gold.sent_starts to be set from Python
2018-05-07 15:51:34 +02:00
Matthew Honnibal
7f163442e6
Work on refactoring greedy parser
2018-05-07 15:45:52 +02:00
Douglas Knox
9b49a40f4e
Test and fix for Issue #2219 ( #2272 )
...
Test and fix for Issue #2219 : Token.similarity() failed if single letter
2018-05-03 18:40:46 +02:00
Paul O'Leary McCann
bd72fbf09c
Port Japanese mecab tokenizer from v1 ( #2036 )
...
* Port Japanese mecab tokenizer from v1
This brings the Mecab-based Japanese tokenization introduced in #1246 to
spaCy v2. There isn't a JapaneseTagger implementation yet, but POS tag
information from Mecab is stored in a token extension. A tag map is also
included.
As a reminder, Mecab is required because Universal Dependencies are
based on Unidic tags, and Janome doesn't support Unidic.
Things to check:
1. Is this the right way to use a token extension?
2. What's the right way to implement a JapaneseTagger? The approach in
#1246 relied on `tag_from_strings` which is just gone now. I guess the
best thing is to just try training spaCy's default Tagger?
-POLM
* Add tagging/make_doc and tests
2018-05-03 18:38:26 +02:00
G.Pruvost
cc8e804648
#2211 - Support for ssl certs config on download command ( #2212 )
...
* Add support for SSL/Certs customization on download CLI
* Add a note on SSL options for the 'download' CLI in the README
* Add contributor agreement
2018-05-03 18:37:02 +02:00
Jens Dahl Møllerhøj
b9290397fb
rename SP to _SP ( #2289 )
2018-05-03 18:33:49 +02:00
Matthew Honnibal
a8e70a4187
Merge branch 'develop' of https://github.com/explosion/spaCy into develop
2018-05-03 14:02:10 +02:00
Matthew Honnibal
c0e596283b
Set version to 2.1.0a0
2018-05-03 14:00:11 +02:00
Matthew Honnibal
8cd06cc763
Try to fix root-outside-sentence bug
2018-05-02 14:39:48 +00:00
Matthew Honnibal
acebd01033
Set cildren from heads in finalize doc
2018-05-02 14:19:22 +00:00
Matthew Honnibal
569440a6db
Dont normalize gradient by batch size
2018-05-02 08:42:10 +02:00
Matthew Honnibal
281e29cbcd
Merge branch 'develop' of https://github.com/explosion/spaCy into develop
2018-05-02 01:36:23 +00:00
Matthew Honnibal
2338e8c7fc
Update develop from master
2018-05-02 01:36:12 +00:00
Matthew Honnibal
9d147e12c4
Merge remote-tracking branch 'origin/master' into develop
2018-05-01 18:18:51 +02:00
Matthew Honnibal
6d0fe67b72
Constrain subtok label to adjacent tokens
2018-05-01 17:34:27 +02:00
Matthew Honnibal
8f21953fc5
Constrain subtok to adjacent words
2018-05-01 17:29:00 +02:00
Matthew Honnibal
b43bfd3524
Fix arc-eager oracle tests
2018-05-01 16:16:14 +02:00
Matthew Honnibal
31ed64e9b0
Fix textcat test
2018-05-01 15:18:39 +02:00
Matthew Honnibal
548bdff943
Update default Adam settings
2018-05-01 15:18:20 +02:00
Matthew Honnibal
adbb1f7533
Add better arc-eager oracle tests
2018-05-01 15:14:55 +02:00
Matthew Honnibal
697bcaa34f
Add some methods to ArcEager that make testing easier
2018-05-01 15:13:14 +02:00
Mr Roboto
6f5ccda19c
Addresses Issue #2228 - Deserialization fails when using tensor=False or sentiment=False ( #2230 )
...
* Fixes issue #2228
* Adds a new contributor
2018-05-01 13:40:22 +02:00
Matthew Honnibal
d44bb45c72
Fix scoring if tokenization changes
2018-05-01 01:33:20 +02:00
Matthew Honnibal
2b26c007cd
Revert "Disable batch size compounding in ud-train"
...
This reverts commit 8a120fb455
.
2018-04-29 14:09:02 +00:00
Matthew Honnibal
723b328062
Add script to run UD test
2018-04-29 15:50:25 +02:00
Matthew Honnibal
17af6aa3a4
Update ud_train script
2018-04-29 15:49:32 +02:00
Matthew Honnibal
5de8a36537
Fix arc_eager is_nonproj_tree
2018-04-29 15:49:11 +02:00
Matthew Honnibal
5260268f70
Fix textcat after merge
2018-04-29 15:48:53 +02:00
Matthew Honnibal
ad3d56c3ba
Fix compile error in matcher
2018-04-29 15:48:34 +02:00
Matthew Honnibal
a8bc947fd4
Fix Token.set_extension
2018-04-29 15:48:19 +02:00
Matthew Honnibal
2c4a6d66fa
Merge master into develop. Big merge, many conflicts -- need to review
2018-04-29 14:49:26 +02:00
ines
3c80f69ff5
Return data in cli.info and add silent option ( resolves #2196 )
2018-04-29 01:59:44 +02:00
ines
1c6d77610c
Add remove_extension method on Doc, Token and Span ( closes #2242 )
2018-04-28 23:33:09 +02:00
ines
abdb853ebf
Simplify underscore tests
2018-04-28 23:30:33 +02:00
ines
6fb6371670
Add collapse_phrases option to displacy ( closes #2266 )
2018-04-28 23:06:50 +02:00
Robin Linderborg
1f9904ef12
fixes #2238 ( #2241 )
...
* Remove erroneous lemma lookup år > åra in Swedish
* Add contributors agreement
* Add contrib agreement to correct directory
* Revert change to CONTRIBUTOR_AGREEMENT
2018-04-28 14:55:22 +02:00
Robin Linderborg
d01f503b54
Remove incorrect lemma lookup gäng->gänga ( #2252 )
...
* Remove incorrect lemma lookup gäng->gänga
In modern Swedish, "gäng" is mostly associated with "gang" or "group of people". The removed lemma lookup lemmatized it to the verb "thread".
* Add contrib agreement to correct directory
* Revert change to CONTRIBUTOR_AGREEMENT
2018-04-28 14:54:41 +02:00
Suraj Krishnan Rajan
69d041148f
Implement Fast-Text vectors with subword features
2018-04-21 01:34:14 +05:30
ines
686225eadd
Fix Spanish noun_chunks ( resolves #2210 )
...
Make sure 'NP' label is added to StringStore and move noun_bounds helper into a closure to allow reusing label sets
2018-04-18 18:44:01 -04:00
ines
9632595fb4
Use correct, non-deprecated merge syntax ( resolves #2226 )
2018-04-18 18:28:28 -04:00
Suraj Rajan
5957f15227
Fixed typos for #2222,#2223 ( #2233 ) ( closes #2222 , closes #2223 )
2018-04-18 14:55:26 -07:00
Matthew Honnibal
97851d2c4e
Increment version to v2.0.12.dev0
2018-04-10 22:20:16 +02:00
Matthew Honnibal
ed39c75a92
Merge branch 'master' of https://github.com/explosion/spaCy
2018-04-10 22:19:40 +02:00
Matthew Honnibal
3836199a83
Fix loading of models when custom vectors are added
2018-04-10 22:19:20 +02:00
ines
0299d5fac8
Update argument annotations and formatting
2018-04-10 21:45:11 +02:00
ines
49b1e48bf5
Fix syntax error
2018-04-10 21:44:59 +02:00
ines
70052e46e9
Fix formatting [ci skip]
2018-04-10 21:42:46 +02:00
Matthew Honnibal
0ddb152be0
Improve error message when reading vectors
2018-04-10 21:26:50 +02:00
Matthew Honnibal
db50ac524e
Support zipped vector files in init-model
2018-04-10 21:21:00 +02:00
ines
270fcfd925
Fix typo in package command message ( closes #2200 )
2018-04-10 19:14:31 +02:00
ines
24d8bf348d
Revert "Add support for .zip to init_model"
...
This reverts commit 7ee880a0ad
.
2018-04-10 19:08:06 +02:00
Matthew Honnibal
7ee880a0ad
Add support for .zip to init_model
2018-04-10 14:30:04 +00:00
ines
5ecb274764
Fix indentation error and set Doc.is_tagged correctly
2018-04-10 16:14:52 +02:00
ines
987ee27af7
Return Doc if noun chunks merger component if Doc is not parsed
2018-04-09 14:51:02 +02:00
Xiaoquan Kong
e2f13ec722
bugfix: Doc.noun_chunks
call Doc.noun_chunks_iterator
without checking ( closes #2194 )
2018-04-08 23:44:05 +02:00
Jens Dahl Møllerhøj
e5055e3cf6
Add Danish lemmatizer ( #2184 )
...
* add danish lemmatizer
* fill contributor agreement
2018-04-07 19:07:28 +02:00
ines
bccbf538ef
Revert "Check if spaCy has compiled correctly and show error message"
...
This reverts commit 3463ded7cf
.
2018-04-06 15:49:44 +02:00
ines
fb4eda6616
Merge branch 'master' of https://github.com/explosion/spaCy
2018-04-06 00:38:48 +02:00
Matthew Honnibal
0c7fab4443
Set version to 2.0.11
2018-04-04 11:19:11 +02:00
Matthew Honnibal
a350be0601
Fix vector-name loading fix
2018-04-04 01:31:25 +02:00
Matthew Honnibal
21047bde52
Fix syntax error in italian lemmatizer
2018-04-03 23:13:22 +02:00
Matthew Honnibal
81f4005f3d
Fix loading models with pretrained vectors
2018-04-03 23:11:48 +02:00
ines
3463ded7cf
Check if spaCy has compiled correctly and show error message
2018-04-03 22:18:47 +02:00
Matthew Honnibal
96b612873b
Add hyper-parameter to control whether parser makes a beam update
2018-04-03 22:02:56 +02:00
ines
e5f47cd82d
Update errors
2018-04-03 21:40:29 +02:00
Matthew Honnibal
f7e6313b43
Increment version to v2.0.11.dev0
2018-04-03 20:58:47 +02:00
ines
10462816bc
Fix tests for Python 2
2018-04-03 18:51:31 +02:00
ines
62b4b527d7
Don't raise error if set_extension has getter and setter ( closes #2177 )
...
Improve error messages, raise error if setter is specified without a getter and compare against _unset to allow default=None. Also add more tests.
2018-04-03 18:30:17 +02:00
ines
ee3082ad29
Fix whitespace
2018-04-03 18:29:53 +02:00
Ines Montani
3141e04822
💫 New system for error messages and warnings ( #2163 )
...
* Add spacy.errors module
* Update deprecation and user warnings
* Replace errors and asserts with new error message system
* Remove redundant asserts
* Fix whitespace
* Add messages for print/util.prints statements
* Fix typo
* Fix typos
* Move CLI messages to spacy.cli._messages
* Add decorator to display error code with message
An implementation like this is nice because it only modifies the string when it's retrieved from the containing class – so we don't have to worry about manipulating tracebacks etc.
* Remove unused link in spacy.about
* Update errors for invalid pipeline components
* Improve error for unknown factories
* Add displaCy warnings
* Update formatting consistency
* Move error message to spacy.errors
* Update errors and check if doc returned by component is None
2018-04-03 15:50:31 +02:00
Matthew Honnibal
abf8b16d71
Add doc.retokenize() context manager ( #2172 )
...
This patch takes a step towards #1487 by introducing the
doc.retokenize() context manager, to handle merging spans, and soon
splitting tokens.
The idea is to do merging and splitting like this:
with doc.retokenize() as retokenizer:
for start, end, label in matches:
retokenizer.merge(doc[start : end], attrs={'ent_type': label})
The retokenizer accumulates the merge requests, and applies them
together at the end of the block. This will allow retokenization to be
more efficient, and much less error prone.
A retokenizer.split() function will then be added, to handle splitting a
single token into multiple tokens. These methods take `Span` and `Token`
objects; if the user wants to go directly from offsets, they can append
to the .merges and .splits lists on the retokenizer.
The doc.merge() method's behaviour remains unchanged, so this patch
should be 100% backwards incompatible (modulo bugs). Internally,
doc.merge() fixes up the arguments (to handle the various deprecated styles),
opens the retokenizer, and makes the single merge.
We can later start making deprecation warnings on direct calls to doc.merge(),
to migrate people to use of the retokenize context manager.
2018-04-03 14:10:35 +02:00
Matthew Honnibal
8a120fb455
Disable batch size compounding in ud-train
2018-04-01 08:45:00 +00:00
Matthew Honnibal
98165e43a7
Sometimes update beam with greedy oracle
2018-04-01 08:44:35 +00:00
Suraj Rajan
1cdbb7c97c
[2032] - Changed python set to cpp stl set ( #2170 )
...
Changed python set to cpp stl set #2032
## Description
Changed python set to cpp stl set. CPP stl set works better due to the logarithmic run time of its methods. Finding minimum in the cpp set is done in constant time as opposed to the worst case linear runtime of python set. Operations such as find,count,insert,delete are also done in either constant and logarithmic time thus making cpp set a better option to manage vectors.
Reference : http://www.cplusplus.com/reference/set/set/
### Types of change
Enhancement for `Vectors` for faster initialising of word vectors(fasttext)
2018-03-31 13:28:25 +02:00
Matthew Honnibal
f3b7c5e537
Fix syntax error
2018-03-29 21:50:32 +02:00
Matthew Honnibal
23afa6429f
Add input length error, to address #1826
2018-03-29 21:45:26 +02:00
Ines Montani
a609a1ca29
Merge pull request #2152 from explosion/feature/tidy-up-dependencies
...
💫 Tidy up dependencies
2018-03-29 14:35:09 +02:00
Viet Trung Tran
ea2af94cd9
Add support for Vietnamese in spaCy by leveraging Pyvi, an external Vietnamese tokenizer ( #2155 )
...
* support for Vietnamese
* Contributor Agreement for adding Vietnamese support on spaCy
2018-03-29 12:19:51 +02:00
ines
e6979bdbbd
Merge branch 'feature/tidy-up-dependencies' of https://github.com/explosion/spaCy into feature/tidy-up-dependencies
2018-03-29 00:19:37 +02:00
ines
83146458a2
Fix urllib for Python 3
2018-03-29 00:19:33 +02:00
Matthew Honnibal
8308bbc617
Get msgpack and msgpack_numpy via Thinc, to avoid potential version conflicts
2018-03-29 00:14:55 +02:00
Matthew Honnibal
b5098079d8
Fix error on urllib
2018-03-29 00:08:16 +02:00
Ines Montani
0de599b16b
Merge pull request #2159 from explosion/feature/fix-merged-entity-iob ( resolves #1554 , resolves #1752 )
...
💫 Fix token.ent_iob after doc.merge(), and ensure consistency in doc.ents
2018-03-28 23:10:00 +02:00
Ines Montani
98e9cda677
Merge pull request #2158 from explosion/feature/fix-multiple-vectors ( resolves #1660 )
...
💫 Fix loading of multiple vector models
2018-03-28 23:08:24 +02:00
Matthew Honnibal
a7c5ae2beb
Avoid forcing a name on empty vectors, and remove print statement
2018-03-28 21:08:58 +02:00
ines
3eb67bbe4b
Allow entity types with dashes ( resolves #1967 )
2018-03-28 20:51:26 +02:00
Matthew Honnibal
cf5fcf0546
Update serialization test
2018-03-28 20:12:53 +02:00
Matthew Honnibal
4555e3e251
Dont assume pretrained_vectors cfg set in build_tagger
2018-03-28 20:12:45 +02:00
Matthew Honnibal
0b375d50c8
Fix ent_iob tags in doc.merge to avoid inconsistent sequences
2018-03-28 18:39:03 +02:00
Matthew Honnibal
95fa89c4b8
Update doc.ents test
2018-03-28 18:39:03 +02:00
Matthew Honnibal
e807f88410
Resolve merge when cherry-picking ent iob patches from develop
2018-03-28 18:38:13 +02:00
Matthew Honnibal
99fbc7db33
Improve error message when entity sequence is inconsistent
2018-03-28 18:36:53 +02:00
Matthew Honnibal
cbd2794be0
Add test for ent_iob during span merge
2018-03-28 18:36:53 +02:00
Matthew Honnibal
f8dd905a24
Warn and fallback if vectors have no name
2018-03-28 18:24:53 +02:00
Matthew Honnibal
fd9e259414
Add test for #1660
2018-03-28 18:22:51 +02:00
Matthew Honnibal
bc4afa9881
Remove print statement
2018-03-28 17:48:37 +02:00
Matthew Honnibal
79dc241caa
Set pretrained_vectors in parser cfg
2018-03-28 17:35:07 +02:00
Matthew Honnibal
17c3e7efa2
Add message noting vectors
2018-03-28 16:33:43 +02:00
Matthew Honnibal
9bf6e93b3e
Set pretrained_vectors in begin_training
2018-03-28 16:32:41 +02:00
Matthew Honnibal
95a9615221
Fix loading of multiple pre-trained vectors
...
This patch addresses #1660 , which was caused by keying all pre-trained
vectors with the same ID when telling Thinc how to refer to them. This
meant that if multiple models were loaded that had pre-trained vectors,
errors or incorrect behaviour resulted.
The vectors class now includes a .name attribute, which defaults to:
{nlp.meta['lang']_nlp.meta['name']}.vectors
The vectors name is set in the cfg of the pipeline components under the
key pretrained_vectors. This replaces the previous cfg key
pretrained_dims.
In order to make existing models compatible with this change, we check
for the pretrained_dims key when loading models in from_disk and
from_bytes, and add the cfg key pretrained_vectors if we find it.
2018-03-28 16:02:59 +02:00
ines
7fbc9e5874
Replace requests with urllib
2018-03-28 12:46:07 +02:00
ines
da1f200362
Add compat helpers for urllib
2018-03-28 12:45:53 +02:00
ines
ac88c72c9a
Fix ftfy workaround and remove old import
2018-03-28 12:14:28 +02:00
ines
ce6071ca89
Remove ftfy dependency and update docs
2018-03-28 12:09:42 +02:00
Matthew Honnibal
070b6c6495
Remove dependency on ftfy
2018-03-28 12:07:02 +02:00
ines
6d2c85f428
Drop six and related hacks as a dependency
2018-03-28 10:45:25 +02:00
ines
9e83513004
Add position of invalid token to error message
2018-03-27 23:56:59 +02:00
ines
11c4735ccf
Fix issue in Italian lemmatizer data ( resolves #2050 )
2018-03-27 23:55:22 +02:00
Matthew Honnibal
6a961928b2
Merge branch 'develop' of https://github.com/explosion/spaCy into develop
2018-03-27 21:01:48 +00:00
Matthew Honnibal
b7136cb094
Support zipped vector files in init-model
2018-03-27 21:01:18 +00:00
ines
693971dd8f
Improve error message if token text is empty string (see #2101 )
2018-03-27 22:25:40 +02:00
ines
0c829e6605
Fix whitespace
2018-03-27 22:20:59 +02:00
Matthew Honnibal
de9fd091ac
Fix #2014 : token.pos_ not writeable
2018-03-27 21:21:11 +02:00
Matthew Honnibal
18da89e04c
Handle non-callable gold_tuples in parser begin_training
2018-03-27 21:08:41 +02:00
Matthew Honnibal
1f7229f40f
Revert "Merge branch 'develop' of https://github.com/explosion/spaCy into develop"
...
This reverts commit c9ba3d3c2d
, reversing
changes made to 92c26a35d4
.
2018-03-27 19:23:02 +02:00
Matthew Honnibal
8b7a74570f
Revert "Revert "Merge branch 'develop' of https://github.com/explosion/spaCy into develop""
...
This reverts commit f41e626844
.
2018-03-27 19:22:52 +02:00
Matthew Honnibal
f41e626844
Revert "Merge branch 'develop' of https://github.com/explosion/spaCy into develop"
...
This reverts commit c9ba3d3c2d
, reversing
changes made to f57bfbccdc
.
2018-03-27 19:22:25 +02:00
Matthew Honnibal
c9ba3d3c2d
Merge branch 'develop' of https://github.com/explosion/spaCy into develop
2018-03-27 18:59:08 +02:00
Matthew Honnibal
92c26a35d4
Update get_cuda_stream
2018-03-27 16:42:00 +00:00
Matthew Honnibal
f57bfbccdc
Fix non-projective label filtering
2018-03-27 13:41:33 +02:00
Matthew Honnibal
d2118792e7
Merge changes from master
2018-03-27 13:38:41 +02:00