Matthew Honnibal
ed39c75a92
Merge branch 'master' of https://github.com/explosion/spaCy
2018-04-10 22:19:40 +02:00
Matthew Honnibal
3836199a83
Fix loading of models when custom vectors are added
2018-04-10 22:19:20 +02:00
ines
0299d5fac8
Update argument annotations and formatting
2018-04-10 21:45:11 +02:00
ines
49b1e48bf5
Fix syntax error
2018-04-10 21:44:59 +02:00
ines
70052e46e9
Fix formatting [ci skip]
2018-04-10 21:42:46 +02:00
Matthew Honnibal
0ddb152be0
Improve error message when reading vectors
2018-04-10 21:26:50 +02:00
Matthew Honnibal
db50ac524e
Support zipped vector files in init-model
2018-04-10 21:21:00 +02:00
ines
270fcfd925
Fix typo in package command message ( closes #2200 )
2018-04-10 19:14:31 +02:00
ines
24d8bf348d
Revert "Add support for .zip to init_model"
...
This reverts commit 7ee880a0ad
.
2018-04-10 19:08:06 +02:00
Matthew Honnibal
7ee880a0ad
Add support for .zip to init_model
2018-04-10 14:30:04 +00:00
ines
5ecb274764
Fix indentation error and set Doc.is_tagged correctly
2018-04-10 16:14:52 +02:00
ines
987ee27af7
Return Doc if noun chunks merger component if Doc is not parsed
2018-04-09 14:51:02 +02:00
Xiaoquan Kong
e2f13ec722
bugfix: Doc.noun_chunks
call Doc.noun_chunks_iterator
without checking ( closes #2194 )
2018-04-08 23:44:05 +02:00
Jens Dahl Møllerhøj
e5055e3cf6
Add Danish lemmatizer ( #2184 )
...
* add danish lemmatizer
* fill contributor agreement
2018-04-07 19:07:28 +02:00
ines
bccbf538ef
Revert "Check if spaCy has compiled correctly and show error message"
...
This reverts commit 3463ded7cf
.
2018-04-06 15:49:44 +02:00
ines
fb4eda6616
Merge branch 'master' of https://github.com/explosion/spaCy
2018-04-06 00:38:48 +02:00
Matthew Honnibal
0c7fab4443
Set version to 2.0.11
2018-04-04 11:19:11 +02:00
Matthew Honnibal
a350be0601
Fix vector-name loading fix
2018-04-04 01:31:25 +02:00
Matthew Honnibal
21047bde52
Fix syntax error in italian lemmatizer
2018-04-03 23:13:22 +02:00
Matthew Honnibal
81f4005f3d
Fix loading models with pretrained vectors
2018-04-03 23:11:48 +02:00
ines
3463ded7cf
Check if spaCy has compiled correctly and show error message
2018-04-03 22:18:47 +02:00
Matthew Honnibal
96b612873b
Add hyper-parameter to control whether parser makes a beam update
2018-04-03 22:02:56 +02:00
ines
e5f47cd82d
Update errors
2018-04-03 21:40:29 +02:00
Matthew Honnibal
f7e6313b43
Increment version to v2.0.11.dev0
2018-04-03 20:58:47 +02:00
ines
10462816bc
Fix tests for Python 2
2018-04-03 18:51:31 +02:00
ines
62b4b527d7
Don't raise error if set_extension has getter and setter ( closes #2177 )
...
Improve error messages, raise error if setter is specified without a getter and compare against _unset to allow default=None. Also add more tests.
2018-04-03 18:30:17 +02:00
ines
ee3082ad29
Fix whitespace
2018-04-03 18:29:53 +02:00
Ines Montani
3141e04822
💫 New system for error messages and warnings ( #2163 )
...
* Add spacy.errors module
* Update deprecation and user warnings
* Replace errors and asserts with new error message system
* Remove redundant asserts
* Fix whitespace
* Add messages for print/util.prints statements
* Fix typo
* Fix typos
* Move CLI messages to spacy.cli._messages
* Add decorator to display error code with message
An implementation like this is nice because it only modifies the string when it's retrieved from the containing class – so we don't have to worry about manipulating tracebacks etc.
* Remove unused link in spacy.about
* Update errors for invalid pipeline components
* Improve error for unknown factories
* Add displaCy warnings
* Update formatting consistency
* Move error message to spacy.errors
* Update errors and check if doc returned by component is None
2018-04-03 15:50:31 +02:00
Matthew Honnibal
abf8b16d71
Add doc.retokenize() context manager ( #2172 )
...
This patch takes a step towards #1487 by introducing the
doc.retokenize() context manager, to handle merging spans, and soon
splitting tokens.
The idea is to do merging and splitting like this:
with doc.retokenize() as retokenizer:
for start, end, label in matches:
retokenizer.merge(doc[start : end], attrs={'ent_type': label})
The retokenizer accumulates the merge requests, and applies them
together at the end of the block. This will allow retokenization to be
more efficient, and much less error prone.
A retokenizer.split() function will then be added, to handle splitting a
single token into multiple tokens. These methods take `Span` and `Token`
objects; if the user wants to go directly from offsets, they can append
to the .merges and .splits lists on the retokenizer.
The doc.merge() method's behaviour remains unchanged, so this patch
should be 100% backwards incompatible (modulo bugs). Internally,
doc.merge() fixes up the arguments (to handle the various deprecated styles),
opens the retokenizer, and makes the single merge.
We can later start making deprecation warnings on direct calls to doc.merge(),
to migrate people to use of the retokenize context manager.
2018-04-03 14:10:35 +02:00
Matthew Honnibal
8a120fb455
Disable batch size compounding in ud-train
2018-04-01 08:45:00 +00:00
Matthew Honnibal
98165e43a7
Sometimes update beam with greedy oracle
2018-04-01 08:44:35 +00:00
Suraj Rajan
1cdbb7c97c
[2032] - Changed python set to cpp stl set ( #2170 )
...
Changed python set to cpp stl set #2032
## Description
Changed python set to cpp stl set. CPP stl set works better due to the logarithmic run time of its methods. Finding minimum in the cpp set is done in constant time as opposed to the worst case linear runtime of python set. Operations such as find,count,insert,delete are also done in either constant and logarithmic time thus making cpp set a better option to manage vectors.
Reference : http://www.cplusplus.com/reference/set/set/
### Types of change
Enhancement for `Vectors` for faster initialising of word vectors(fasttext)
2018-03-31 13:28:25 +02:00
Matthew Honnibal
f3b7c5e537
Fix syntax error
2018-03-29 21:50:32 +02:00
Matthew Honnibal
23afa6429f
Add input length error, to address #1826
2018-03-29 21:45:26 +02:00
Ines Montani
a609a1ca29
Merge pull request #2152 from explosion/feature/tidy-up-dependencies
...
💫 Tidy up dependencies
2018-03-29 14:35:09 +02:00
Viet Trung Tran
ea2af94cd9
Add support for Vietnamese in spaCy by leveraging Pyvi, an external Vietnamese tokenizer ( #2155 )
...
* support for Vietnamese
* Contributor Agreement for adding Vietnamese support on spaCy
2018-03-29 12:19:51 +02:00
ines
e6979bdbbd
Merge branch 'feature/tidy-up-dependencies' of https://github.com/explosion/spaCy into feature/tidy-up-dependencies
2018-03-29 00:19:37 +02:00
ines
83146458a2
Fix urllib for Python 3
2018-03-29 00:19:33 +02:00
Matthew Honnibal
8308bbc617
Get msgpack and msgpack_numpy via Thinc, to avoid potential version conflicts
2018-03-29 00:14:55 +02:00
Matthew Honnibal
b5098079d8
Fix error on urllib
2018-03-29 00:08:16 +02:00
Ines Montani
0de599b16b
Merge pull request #2159 from explosion/feature/fix-merged-entity-iob ( resolves #1554 , resolves #1752 )
...
💫 Fix token.ent_iob after doc.merge(), and ensure consistency in doc.ents
2018-03-28 23:10:00 +02:00
Ines Montani
98e9cda677
Merge pull request #2158 from explosion/feature/fix-multiple-vectors ( resolves #1660 )
...
💫 Fix loading of multiple vector models
2018-03-28 23:08:24 +02:00
Matthew Honnibal
a7c5ae2beb
Avoid forcing a name on empty vectors, and remove print statement
2018-03-28 21:08:58 +02:00
ines
3eb67bbe4b
Allow entity types with dashes ( resolves #1967 )
2018-03-28 20:51:26 +02:00
Matthew Honnibal
cf5fcf0546
Update serialization test
2018-03-28 20:12:53 +02:00
Matthew Honnibal
4555e3e251
Dont assume pretrained_vectors cfg set in build_tagger
2018-03-28 20:12:45 +02:00
Matthew Honnibal
0b375d50c8
Fix ent_iob tags in doc.merge to avoid inconsistent sequences
2018-03-28 18:39:03 +02:00
Matthew Honnibal
95fa89c4b8
Update doc.ents test
2018-03-28 18:39:03 +02:00
Matthew Honnibal
e807f88410
Resolve merge when cherry-picking ent iob patches from develop
2018-03-28 18:38:13 +02:00
Matthew Honnibal
99fbc7db33
Improve error message when entity sequence is inconsistent
2018-03-28 18:36:53 +02:00
Matthew Honnibal
cbd2794be0
Add test for ent_iob during span merge
2018-03-28 18:36:53 +02:00
Matthew Honnibal
f8dd905a24
Warn and fallback if vectors have no name
2018-03-28 18:24:53 +02:00
Matthew Honnibal
fd9e259414
Add test for #1660
2018-03-28 18:22:51 +02:00
Matthew Honnibal
bc4afa9881
Remove print statement
2018-03-28 17:48:37 +02:00
Matthew Honnibal
79dc241caa
Set pretrained_vectors in parser cfg
2018-03-28 17:35:07 +02:00
Matthew Honnibal
17c3e7efa2
Add message noting vectors
2018-03-28 16:33:43 +02:00
Matthew Honnibal
9bf6e93b3e
Set pretrained_vectors in begin_training
2018-03-28 16:32:41 +02:00
Matthew Honnibal
95a9615221
Fix loading of multiple pre-trained vectors
...
This patch addresses #1660 , which was caused by keying all pre-trained
vectors with the same ID when telling Thinc how to refer to them. This
meant that if multiple models were loaded that had pre-trained vectors,
errors or incorrect behaviour resulted.
The vectors class now includes a .name attribute, which defaults to:
{nlp.meta['lang']_nlp.meta['name']}.vectors
The vectors name is set in the cfg of the pipeline components under the
key pretrained_vectors. This replaces the previous cfg key
pretrained_dims.
In order to make existing models compatible with this change, we check
for the pretrained_dims key when loading models in from_disk and
from_bytes, and add the cfg key pretrained_vectors if we find it.
2018-03-28 16:02:59 +02:00
ines
7fbc9e5874
Replace requests with urllib
2018-03-28 12:46:07 +02:00
ines
da1f200362
Add compat helpers for urllib
2018-03-28 12:45:53 +02:00
ines
ac88c72c9a
Fix ftfy workaround and remove old import
2018-03-28 12:14:28 +02:00
ines
ce6071ca89
Remove ftfy dependency and update docs
2018-03-28 12:09:42 +02:00
Matthew Honnibal
070b6c6495
Remove dependency on ftfy
2018-03-28 12:07:02 +02:00
ines
6d2c85f428
Drop six and related hacks as a dependency
2018-03-28 10:45:25 +02:00
ines
9e83513004
Add position of invalid token to error message
2018-03-27 23:56:59 +02:00
ines
11c4735ccf
Fix issue in Italian lemmatizer data ( resolves #2050 )
2018-03-27 23:55:22 +02:00
Matthew Honnibal
6a961928b2
Merge branch 'develop' of https://github.com/explosion/spaCy into develop
2018-03-27 21:01:48 +00:00
Matthew Honnibal
b7136cb094
Support zipped vector files in init-model
2018-03-27 21:01:18 +00:00
ines
693971dd8f
Improve error message if token text is empty string (see #2101 )
2018-03-27 22:25:40 +02:00
ines
0c829e6605
Fix whitespace
2018-03-27 22:20:59 +02:00
Matthew Honnibal
de9fd091ac
Fix #2014 : token.pos_ not writeable
2018-03-27 21:21:11 +02:00
Matthew Honnibal
18da89e04c
Handle non-callable gold_tuples in parser begin_training
2018-03-27 21:08:41 +02:00
Matthew Honnibal
1f7229f40f
Revert "Merge branch 'develop' of https://github.com/explosion/spaCy into develop"
...
This reverts commit c9ba3d3c2d
, reversing
changes made to 92c26a35d4
.
2018-03-27 19:23:02 +02:00
Matthew Honnibal
8b7a74570f
Revert "Revert "Merge branch 'develop' of https://github.com/explosion/spaCy into develop""
...
This reverts commit f41e626844
.
2018-03-27 19:22:52 +02:00
Matthew Honnibal
f41e626844
Revert "Merge branch 'develop' of https://github.com/explosion/spaCy into develop"
...
This reverts commit c9ba3d3c2d
, reversing
changes made to f57bfbccdc
.
2018-03-27 19:22:25 +02:00
Matthew Honnibal
c9ba3d3c2d
Merge branch 'develop' of https://github.com/explosion/spaCy into develop
2018-03-27 18:59:08 +02:00
Matthew Honnibal
92c26a35d4
Update get_cuda_stream
2018-03-27 16:42:00 +00:00
Matthew Honnibal
f57bfbccdc
Fix non-projective label filtering
2018-03-27 13:41:33 +02:00
Matthew Honnibal
d2118792e7
Merge changes from master
2018-03-27 13:38:41 +02:00
Matthew Honnibal
d4680e4d83
Merge branch 'master' of https://github.com/explosion/spaCy
2018-03-27 13:36:37 +02:00
Matthew Honnibal
63a267b34d
Fix #2073 : Token.set_extension not working
2018-03-27 13:36:20 +02:00
Matthew Honnibal
25280b7013
Try to make sum_state_features faster
2018-03-27 10:08:38 +00:00
Matthew Honnibal
987e1533a4
Use 8 features in parser
2018-03-27 10:08:12 +00:00
Matthew Honnibal
8bbd26579c
Support GPU in UD training script
2018-03-27 09:53:35 +00:00
Matthew Honnibal
dd54511c4f
Pass data as a function in begin_training methods
2018-03-27 09:39:59 +00:00
Matthew Honnibal
d9ebd78e11
Change default sizes in parser
2018-03-26 17:22:18 +02:00
Matthew Honnibal
a3d0cb15d3
Fix ent_iob tags in doc.merge to avoid inconsistent sequences
2018-03-26 07:16:06 +02:00
Matthew Honnibal
7d4687162f
Update doc.ents test
2018-03-26 07:14:35 +02:00
Matthew Honnibal
514d89a3ae
Set missing label for non-specified entities when setting doc.ents
2018-03-26 07:14:16 +02:00
Matthew Honnibal
54d7a1c916
Improve error message when entity sequence is inconsistent
2018-03-26 07:13:34 +02:00
Matthew Honnibal
938436455a
Add test for ent_iob during span merge
2018-03-25 22:16:19 +02:00
Matthew Honnibal
8e08c378fe
Fix entity IOB and tag in span merging
2018-03-25 22:16:01 +02:00
Matthew Honnibal
5430c43298
Set about to spacy-nightly
2018-03-25 19:30:14 +02:00
Ines Montani
68226109f4
Merge pull request #2142 from jimregan/polish-more-tokens
...
more exceptions
2018-03-24 19:06:44 +01:00
Matthew Honnibal
d566e673bf
Set version to v2.0.10
2018-03-24 18:09:03 +01:00
Matthew Honnibal
0d3bf0d4eb
Merge branch 'master' of https://github.com/explosion/spaCy
2018-03-24 17:31:49 +01:00
dejanmarich
ccd1c04c63
Update stop_words.py
...
Added more words
2018-03-24 17:31:24 +01:00
ines
f1446b0257
Port over Turkish changes
2018-03-24 17:31:07 +01:00
DuyguA
cd604878a4
quick typo fix
2018-03-24 17:26:35 +01:00
Matthew Honnibal
406548b976
Support .gz and .tar.gz files in spacy init-model
2018-03-24 17:18:32 +01:00