ines
f7103babd9
Only overwrite warnings filter if set explicitly ( resolves #2369 )
...
This way, pre-defined warning filters are respected and users are still able to use the fine-grained warning settings if they like.
2018-05-26 18:44:15 +02:00
ines
330c039106
Merge branch 'master' into develop
2018-05-26 18:30:52 +02:00
James Messinger
4515e96e90
Better formatting for spacy train
CLI ( #2357 )
...
* Better formatting for `spacy train` CLI
Changed to use fixed-spaces rather than tabs to align table headers and data.
### Before:
```
Itn. P.Loss N.Loss UAS NER P. NER R. NER F. Tag % Token %
0 4618.857 2910.004 76.172 79.645 67.987 88.732 88.261 100.000 4436.9 6376.4
1 4671.972 3764.812 74.481 78.046 62.374 82.680 88.377 100.000 4672.2 6227.1
2 4742.756 3673.473 71.994 77.380 63.966 84.494 90.620 100.000 4298.0 5983.9
```
### After:
```
Itn. Dep Loss NER Loss UAS NER P. NER R. NER F. Tag % Token % CPU WPS GPU WPS
0 4618.857 2910.004 76.172 79.645 67.987 88.732 88.261 100.000 4436.9 6376.4
1 4671.972 3764.812 74.481 78.046 62.374 82.680 88.377 100.000 4672.2 6227.1
2 4742.756 3673.473 71.994 77.380 63.966 84.494 90.620 100.000 4298.0 5983.9
```
* Added contributor file
2018-05-25 13:08:45 +02:00
Aristo Rinjuang
432ede04af
adding more words and rephrasing ( #2351 )
...
* adding more words and rephrasing
* adding a contributor
* tokenizer bugs solved
2018-05-24 11:40:57 +02:00
Jani Monoses
ec62cadf4c
Updates to Romanian support ( #2354 )
...
* Add back Romanian in conftest
* Romanian lex_attr
* More tokenizer exceptions for Romanian
* Add tests for some Romanian tokenizer exceptions
2018-05-24 11:40:00 +02:00
Matthew Honnibal
5d281cf302
Merge branch 'develop' of https://github.com/explosion/spaCy into develop
2018-05-22 20:50:59 +02:00
Matthew Honnibal
ce458c2428
Fix spacy requirement constraint in package template
2018-05-22 20:50:46 +02:00
Ines Montani
862da5e793
Support pipeline factories via entry points ( #2348 )
2018-05-22 18:29:45 +02:00
Matthew Honnibal
d5af38f80c
Merge branch 'develop' of https://github.com/explosion/spaCy into develop
2018-05-21 17:42:55 +02:00
Matthew Honnibal
ee33de8652
Fix unpickling of NER parser
2018-05-21 17:42:40 +02:00
ines
f9dbcac8e4
Merge branch 'master' into develop
2018-05-21 02:29:29 +02:00
cclauss
f7dcaa1f6b
Simplify is_config() and normalize_string_keys() ( #2305 )
...
* Simplify is_config() and normalize_string_keys()
* Use __in__ to avoid the nested _ands_ and _ors_.
* Dict comprehension directly tracks with the doc string
* Keep more basic loop in normalize_string_keys
* Whitespace
2018-05-21 01:54:35 +02:00
Ines Montani
cae4457c38
💫 Add .similarity warnings for no vectors and option to exclude warnings ( #2197 )
...
* Add logic to filter out warning IDs via environment variable
Usage: SPACY_WARNING_EXCLUDE=W001,W007
* Add warnings for empty vectors
* Add warning if no word vectors are used in .similarity methods
For example, if only tensors are available in small models – should hopefully clear up some confusion around this
* Capture warnings in tests
* Rename SPACY_WARNING_EXCLUDE to SPACY_WARNING_IGNORE
2018-05-21 01:22:38 +02:00
Matthew Honnibal
b096b22c20
Merge pull request #2247 from skrcode/1480
...
1480 - Implement Fast-Text vectors with subword features
2018-05-21 01:16:21 +02:00
Matthew Honnibal
f3b4f6a4ec
Merge setup.py
2018-05-20 23:21:00 +02:00
Ines Montani
d4cc736b7c
💫 Improve model downloads: check for existing install, customise pip and use requests library again ( #2346 )
...
* Go back to using requests instead of urllib (closes #2320 )
Fewer dependencies are good, but this one was simply causing too many other problems around SSL verification and Python 2/3 compatibility. requests is a popular enough package that it's okay for spaCy to depend on it – and this will hopefully make model downloads less flakey.
* Only download model if not installed (see #1456 )
Use #egg=model==version to allow pip to check for existing installations. The download is only started if no installation matching the package/version is found. Fixes a long-standing inconvenience.
* Pass additional options to pip when installing model (resolves #1456 )
Treat all additional arguments passed to the download command as pip options to allow user to customise the command. For example:
python -m spacy download en --user
* Add CLI option to enable installing model package dependencies
* Revert "Add CLI option to enable installing model package dependencies"
This reverts commit 9336ffe695
.
* Update documentation
2018-05-20 20:26:56 +02:00
Matthew Honnibal
3eb446e0a5
Require thinc 6.11.1 and prepare for release to spacy-nightly
2018-05-20 19:00:34 +02:00
Matthew Honnibal
bdc23dd8c1
Merge branch 'develop' of https://github.com/explosion/spaCy into develop
2018-05-20 18:59:24 +02:00
ines
5401c55c75
Merge branch 'master' into develop
2018-05-20 16:49:40 +02:00
ines
b59e3b157f
Don't require attrs argument in Doc.retokenize and allow both ints and unicode ( resolves #2304 )
2018-05-20 15:15:37 +02:00
ines
5768df4f09
Add SimpleFrozenDict util to use as default function argument
2018-05-20 15:13:37 +02:00
Matthew Honnibal
7431e9c87f
Fix parser for GPU
2018-05-19 17:24:34 +00:00
Matthew Honnibal
401213fb1f
Only warn about unnamed vectors if non-zero sized.
2018-05-19 18:51:55 +02:00
Matthew Honnibal
74d5c625b3
Use rising beam update prob
2018-05-16 20:11:59 +02:00
Matthew Honnibal
544ae7f1db
Merge branch 'develop' into feature/refactor-parser
2018-05-16 02:06:49 +02:00
Matthew Honnibal
d1b27fe5aa
Revert "Improve dynamic oracle when values are missing in parse"
...
This reverts commit f56bd4736b
.
2018-05-16 00:31:52 +02:00
Matthew Honnibal
83acaa0358
Add missing name attribute for parser
2018-05-15 19:01:53 +02:00
Matthew Honnibal
f328c195ca
Fix size limits in training data
2018-05-15 19:01:41 +02:00
Matthew Honnibal
8446b35ce0
Fix parser model loading
2018-05-15 18:43:46 +02:00
Matthew Honnibal
dc1a479fbd
Merge branch 'develop' into feature/refactor-parser
2018-05-15 18:39:21 +02:00
Matthew Honnibal
546dd99cdf
Merge master into develop -- mostly Arabic and website
2018-05-15 18:14:28 +02:00
Matthew Honnibal
5664ab7e6c
Revert hacks to tests
2018-05-15 18:00:09 +02:00
Matthew Honnibal
7b9195657b
Restore beam_density argument for parser beam
2018-05-15 17:55:11 +02:00
Matthew Honnibal
581d318971
Fix conftest
2018-05-15 00:54:45 +02:00
Tahar Zanouda
00417794d3
Add Arabic language ( #2314 )
...
* added support for Arabic lang
* added Arabic language support
* updated conftest
2018-05-15 00:27:19 +02:00
Jani Monoses
0e08e49e87
Lemmatizer ro ( #2319 )
...
* Add Romanian lemmatizer lookup table.
Adapted from http://www.lexiconista.com/datasets/lemmatization/
by replacing cedillas with commas (ș and ț).
The original dataset is licensed under the Open Database License.
* Fix one blatant issue in the Romanian lemmatizer
* Romanian examples file
* Add ro_tokenizer in conftest
* Add Romanian lemmatizer test
2018-05-12 15:20:04 +02:00
Matthew Honnibal
887631ca25
Disable some tests to figure out why CI fails
2018-05-10 16:42:01 +02:00
Matthew Honnibal
902a172cb7
Disable some tests to figure out why CI fails
2018-05-10 16:30:07 +02:00
Matthew Honnibal
614d45ea58
Set a more aggressive threshold on the max violn update
2018-05-10 15:38:24 +02:00
Matthew Honnibal
8e8724b55b
Default to beam_update_prob 1
2018-05-10 15:38:02 +02:00
Jani Monoses
42b34832e4
Update Romanian stopword list ( #2316 )
...
* Contributor agreement for janimo
* Update Romanian stopword list
Include the correct spellings of all the words already in the repo
that are using cedillas (ş and ţ) instead of commas (ș and ț).
Add another unrelated spelling fix.
See https://github.com/stopwords-iso/stopwords-ro/pull/1 and
https://github.com/stopwords-iso/stopwords-ro/pull/2
2018-05-10 12:16:56 +02:00
Lucas Abbade
be7fdc59d1
Update lex_attrs.py ( #2307 )
...
* Update lex_attrs.py
Fixed spelling mistakes of some numbers (according to Brazilian Portuguese).
* Update lex_attrs.py
As requested, I've included the correct spelling for both Brazilian Portuguese and Portuguese Portuguese.
I will advise however, that the two are separated in the future. Brazilian Portuguese is a very different language from the original one, although most of the writing is unified, the way people talk in both countries is radically different. Keeping both languages as one may lead to bigger issues in the future, especially when it comes to spell checking.
2018-05-09 20:49:31 +02:00
mauryaland
5368ba028a
Update stop_words.py for French language ( #2310 )
...
* Add contraction forms of some common stopwords
All the stopwords added contain the apostrophe" ' "or " ’ ".
* Adds contributor agreement mauryaland
* Update mauryaland.md
2018-05-09 12:04:38 +02:00
Matthew Honnibal
a61fd60681
Fix error in beam gradient calculation
2018-05-09 02:44:09 +02:00
Matthew Honnibal
a6ae1ee6f7
Don't modify Token in global scope
2018-05-09 00:43:00 +02:00
Matthew Honnibal
f94f721f40
Avoid importing fused token symbol in ud-run-test, untl that's added
2018-05-09 00:28:03 +02:00
Matthew Honnibal
659ec5b975
Avoid importing fused token symbol in ud-run-test, untl that's added
2018-05-08 19:40:33 +02:00
Matthew Honnibal
4cb0494bef
Bug fixes to beam search after refactor
2018-05-08 13:48:50 +02:00
Matthew Honnibal
5ed71973b3
Add a keyword argument sink to GoldParse
2018-05-08 13:48:32 +02:00
Matthew Honnibal
8cfe326f87
Avoid relying on final gold check in beam search
2018-05-08 13:48:19 +02:00
Matthew Honnibal
fc4dd49b77
Support oracle segmentation in ud-train CLI command
2018-05-08 13:47:45 +02:00
Matthew Honnibal
c49e44349a
Fix beam parsing
2018-05-08 02:53:24 +02:00
Matthew Honnibal
99649d114d
Fix parser
2018-05-08 00:27:26 +02:00
Matthew Honnibal
8a82367a9d
Fix beam search after refactor
2018-05-08 00:20:33 +02:00
Matthew Honnibal
5a0f26be0c
Readd beam search after refactor
2018-05-08 00:19:52 +02:00
ines
7a3599c21a
Fix formatting and consistency
2018-05-07 23:02:11 +02:00
Matthew Honnibal
36b2c9bdd5
Fix refactored parser
2018-05-07 18:58:09 +02:00
Matthew Honnibal
bde3be1ad1
Fix refactored parser
2018-05-07 18:31:04 +02:00
Matthew Honnibal
01c4e13b02
Update test
2018-05-07 16:59:52 +02:00
Matthew Honnibal
f6cdafc00e
Fix refactored parser
2018-05-07 16:59:38 +02:00
Matthew Honnibal
f56bd4736b
Improve dynamic oracle when values are missing in parse
2018-05-07 15:53:18 +02:00
Matthew Honnibal
eddc0e0c74
Set gold.sent_starts in ud_train
2018-05-07 15:52:47 +02:00
Matthew Honnibal
bf19f22340
Allow gold.sent_starts to be set from Python
2018-05-07 15:51:34 +02:00
Matthew Honnibal
7f163442e6
Work on refactoring greedy parser
2018-05-07 15:45:52 +02:00
Douglas Knox
9b49a40f4e
Test and fix for Issue #2219 ( #2272 )
...
Test and fix for Issue #2219 : Token.similarity() failed if single letter
2018-05-03 18:40:46 +02:00
Paul O'Leary McCann
bd72fbf09c
Port Japanese mecab tokenizer from v1 ( #2036 )
...
* Port Japanese mecab tokenizer from v1
This brings the Mecab-based Japanese tokenization introduced in #1246 to
spaCy v2. There isn't a JapaneseTagger implementation yet, but POS tag
information from Mecab is stored in a token extension. A tag map is also
included.
As a reminder, Mecab is required because Universal Dependencies are
based on Unidic tags, and Janome doesn't support Unidic.
Things to check:
1. Is this the right way to use a token extension?
2. What's the right way to implement a JapaneseTagger? The approach in
#1246 relied on `tag_from_strings` which is just gone now. I guess the
best thing is to just try training spaCy's default Tagger?
-POLM
* Add tagging/make_doc and tests
2018-05-03 18:38:26 +02:00
G.Pruvost
cc8e804648
#2211 - Support for ssl certs config on download command ( #2212 )
...
* Add support for SSL/Certs customization on download CLI
* Add a note on SSL options for the 'download' CLI in the README
* Add contributor agreement
2018-05-03 18:37:02 +02:00
Jens Dahl Møllerhøj
b9290397fb
rename SP to _SP ( #2289 )
2018-05-03 18:33:49 +02:00
Matthew Honnibal
a8e70a4187
Merge branch 'develop' of https://github.com/explosion/spaCy into develop
2018-05-03 14:02:10 +02:00
Matthew Honnibal
c0e596283b
Set version to 2.1.0a0
2018-05-03 14:00:11 +02:00
Matthew Honnibal
8cd06cc763
Try to fix root-outside-sentence bug
2018-05-02 14:39:48 +00:00
Matthew Honnibal
acebd01033
Set cildren from heads in finalize doc
2018-05-02 14:19:22 +00:00
Matthew Honnibal
569440a6db
Dont normalize gradient by batch size
2018-05-02 08:42:10 +02:00
Matthew Honnibal
281e29cbcd
Merge branch 'develop' of https://github.com/explosion/spaCy into develop
2018-05-02 01:36:23 +00:00
Matthew Honnibal
2338e8c7fc
Update develop from master
2018-05-02 01:36:12 +00:00
Matthew Honnibal
9d147e12c4
Merge remote-tracking branch 'origin/master' into develop
2018-05-01 18:18:51 +02:00
Matthew Honnibal
6d0fe67b72
Constrain subtok label to adjacent tokens
2018-05-01 17:34:27 +02:00
Matthew Honnibal
8f21953fc5
Constrain subtok to adjacent words
2018-05-01 17:29:00 +02:00
Matthew Honnibal
b43bfd3524
Fix arc-eager oracle tests
2018-05-01 16:16:14 +02:00
Matthew Honnibal
31ed64e9b0
Fix textcat test
2018-05-01 15:18:39 +02:00
Matthew Honnibal
548bdff943
Update default Adam settings
2018-05-01 15:18:20 +02:00
Matthew Honnibal
adbb1f7533
Add better arc-eager oracle tests
2018-05-01 15:14:55 +02:00
Matthew Honnibal
697bcaa34f
Add some methods to ArcEager that make testing easier
2018-05-01 15:13:14 +02:00
Mr Roboto
6f5ccda19c
Addresses Issue #2228 - Deserialization fails when using tensor=False or sentiment=False ( #2230 )
...
* Fixes issue #2228
* Adds a new contributor
2018-05-01 13:40:22 +02:00
Matthew Honnibal
d44bb45c72
Fix scoring if tokenization changes
2018-05-01 01:33:20 +02:00
Matthew Honnibal
2b26c007cd
Revert "Disable batch size compounding in ud-train"
...
This reverts commit 8a120fb455
.
2018-04-29 14:09:02 +00:00
Matthew Honnibal
723b328062
Add script to run UD test
2018-04-29 15:50:25 +02:00
Matthew Honnibal
17af6aa3a4
Update ud_train script
2018-04-29 15:49:32 +02:00
Matthew Honnibal
5de8a36537
Fix arc_eager is_nonproj_tree
2018-04-29 15:49:11 +02:00
Matthew Honnibal
5260268f70
Fix textcat after merge
2018-04-29 15:48:53 +02:00
Matthew Honnibal
ad3d56c3ba
Fix compile error in matcher
2018-04-29 15:48:34 +02:00
Matthew Honnibal
a8bc947fd4
Fix Token.set_extension
2018-04-29 15:48:19 +02:00
Matthew Honnibal
2c4a6d66fa
Merge master into develop. Big merge, many conflicts -- need to review
2018-04-29 14:49:26 +02:00
ines
3c80f69ff5
Return data in cli.info and add silent option ( resolves #2196 )
2018-04-29 01:59:44 +02:00
ines
1c6d77610c
Add remove_extension method on Doc, Token and Span ( closes #2242 )
2018-04-28 23:33:09 +02:00
ines
abdb853ebf
Simplify underscore tests
2018-04-28 23:30:33 +02:00
ines
6fb6371670
Add collapse_phrases option to displacy ( closes #2266 )
2018-04-28 23:06:50 +02:00
Robin Linderborg
1f9904ef12
fixes #2238 ( #2241 )
...
* Remove erroneous lemma lookup år > åra in Swedish
* Add contributors agreement
* Add contrib agreement to correct directory
* Revert change to CONTRIBUTOR_AGREEMENT
2018-04-28 14:55:22 +02:00
Robin Linderborg
d01f503b54
Remove incorrect lemma lookup gäng->gänga ( #2252 )
...
* Remove incorrect lemma lookup gäng->gänga
In modern Swedish, "gäng" is mostly associated with "gang" or "group of people". The removed lemma lookup lemmatized it to the verb "thread".
* Add contrib agreement to correct directory
* Revert change to CONTRIBUTOR_AGREEMENT
2018-04-28 14:54:41 +02:00
Suraj Krishnan Rajan
69d041148f
Implement Fast-Text vectors with subword features
2018-04-21 01:34:14 +05:30
ines
686225eadd
Fix Spanish noun_chunks ( resolves #2210 )
...
Make sure 'NP' label is added to StringStore and move noun_bounds helper into a closure to allow reusing label sets
2018-04-18 18:44:01 -04:00
ines
9632595fb4
Use correct, non-deprecated merge syntax ( resolves #2226 )
2018-04-18 18:28:28 -04:00
Suraj Rajan
5957f15227
Fixed typos for #2222,#2223 ( #2233 ) ( closes #2222 , closes #2223 )
2018-04-18 14:55:26 -07:00
Matthew Honnibal
97851d2c4e
Increment version to v2.0.12.dev0
2018-04-10 22:20:16 +02:00
Matthew Honnibal
ed39c75a92
Merge branch 'master' of https://github.com/explosion/spaCy
2018-04-10 22:19:40 +02:00
Matthew Honnibal
3836199a83
Fix loading of models when custom vectors are added
2018-04-10 22:19:20 +02:00
ines
0299d5fac8
Update argument annotations and formatting
2018-04-10 21:45:11 +02:00
ines
49b1e48bf5
Fix syntax error
2018-04-10 21:44:59 +02:00
ines
70052e46e9
Fix formatting [ci skip]
2018-04-10 21:42:46 +02:00
Matthew Honnibal
0ddb152be0
Improve error message when reading vectors
2018-04-10 21:26:50 +02:00
Matthew Honnibal
db50ac524e
Support zipped vector files in init-model
2018-04-10 21:21:00 +02:00
ines
270fcfd925
Fix typo in package command message ( closes #2200 )
2018-04-10 19:14:31 +02:00
ines
24d8bf348d
Revert "Add support for .zip to init_model"
...
This reverts commit 7ee880a0ad
.
2018-04-10 19:08:06 +02:00
Matthew Honnibal
7ee880a0ad
Add support for .zip to init_model
2018-04-10 14:30:04 +00:00
ines
5ecb274764
Fix indentation error and set Doc.is_tagged correctly
2018-04-10 16:14:52 +02:00
ines
987ee27af7
Return Doc if noun chunks merger component if Doc is not parsed
2018-04-09 14:51:02 +02:00
Xiaoquan Kong
e2f13ec722
bugfix: Doc.noun_chunks
call Doc.noun_chunks_iterator
without checking ( closes #2194 )
2018-04-08 23:44:05 +02:00
Jens Dahl Møllerhøj
e5055e3cf6
Add Danish lemmatizer ( #2184 )
...
* add danish lemmatizer
* fill contributor agreement
2018-04-07 19:07:28 +02:00
ines
bccbf538ef
Revert "Check if spaCy has compiled correctly and show error message"
...
This reverts commit 3463ded7cf
.
2018-04-06 15:49:44 +02:00
ines
fb4eda6616
Merge branch 'master' of https://github.com/explosion/spaCy
2018-04-06 00:38:48 +02:00
Matthew Honnibal
0c7fab4443
Set version to 2.0.11
2018-04-04 11:19:11 +02:00
Matthew Honnibal
a350be0601
Fix vector-name loading fix
2018-04-04 01:31:25 +02:00
Matthew Honnibal
21047bde52
Fix syntax error in italian lemmatizer
2018-04-03 23:13:22 +02:00
Matthew Honnibal
81f4005f3d
Fix loading models with pretrained vectors
2018-04-03 23:11:48 +02:00
ines
3463ded7cf
Check if spaCy has compiled correctly and show error message
2018-04-03 22:18:47 +02:00
Matthew Honnibal
96b612873b
Add hyper-parameter to control whether parser makes a beam update
2018-04-03 22:02:56 +02:00
ines
e5f47cd82d
Update errors
2018-04-03 21:40:29 +02:00
Matthew Honnibal
f7e6313b43
Increment version to v2.0.11.dev0
2018-04-03 20:58:47 +02:00
ines
10462816bc
Fix tests for Python 2
2018-04-03 18:51:31 +02:00
ines
62b4b527d7
Don't raise error if set_extension has getter and setter ( closes #2177 )
...
Improve error messages, raise error if setter is specified without a getter and compare against _unset to allow default=None. Also add more tests.
2018-04-03 18:30:17 +02:00
ines
ee3082ad29
Fix whitespace
2018-04-03 18:29:53 +02:00
Ines Montani
3141e04822
💫 New system for error messages and warnings ( #2163 )
...
* Add spacy.errors module
* Update deprecation and user warnings
* Replace errors and asserts with new error message system
* Remove redundant asserts
* Fix whitespace
* Add messages for print/util.prints statements
* Fix typo
* Fix typos
* Move CLI messages to spacy.cli._messages
* Add decorator to display error code with message
An implementation like this is nice because it only modifies the string when it's retrieved from the containing class – so we don't have to worry about manipulating tracebacks etc.
* Remove unused link in spacy.about
* Update errors for invalid pipeline components
* Improve error for unknown factories
* Add displaCy warnings
* Update formatting consistency
* Move error message to spacy.errors
* Update errors and check if doc returned by component is None
2018-04-03 15:50:31 +02:00
Matthew Honnibal
abf8b16d71
Add doc.retokenize() context manager ( #2172 )
...
This patch takes a step towards #1487 by introducing the
doc.retokenize() context manager, to handle merging spans, and soon
splitting tokens.
The idea is to do merging and splitting like this:
with doc.retokenize() as retokenizer:
for start, end, label in matches:
retokenizer.merge(doc[start : end], attrs={'ent_type': label})
The retokenizer accumulates the merge requests, and applies them
together at the end of the block. This will allow retokenization to be
more efficient, and much less error prone.
A retokenizer.split() function will then be added, to handle splitting a
single token into multiple tokens. These methods take `Span` and `Token`
objects; if the user wants to go directly from offsets, they can append
to the .merges and .splits lists on the retokenizer.
The doc.merge() method's behaviour remains unchanged, so this patch
should be 100% backwards incompatible (modulo bugs). Internally,
doc.merge() fixes up the arguments (to handle the various deprecated styles),
opens the retokenizer, and makes the single merge.
We can later start making deprecation warnings on direct calls to doc.merge(),
to migrate people to use of the retokenize context manager.
2018-04-03 14:10:35 +02:00
Matthew Honnibal
8a120fb455
Disable batch size compounding in ud-train
2018-04-01 08:45:00 +00:00
Matthew Honnibal
98165e43a7
Sometimes update beam with greedy oracle
2018-04-01 08:44:35 +00:00
Suraj Rajan
1cdbb7c97c
[2032] - Changed python set to cpp stl set ( #2170 )
...
Changed python set to cpp stl set #2032
## Description
Changed python set to cpp stl set. CPP stl set works better due to the logarithmic run time of its methods. Finding minimum in the cpp set is done in constant time as opposed to the worst case linear runtime of python set. Operations such as find,count,insert,delete are also done in either constant and logarithmic time thus making cpp set a better option to manage vectors.
Reference : http://www.cplusplus.com/reference/set/set/
### Types of change
Enhancement for `Vectors` for faster initialising of word vectors(fasttext)
2018-03-31 13:28:25 +02:00
Matthew Honnibal
f3b7c5e537
Fix syntax error
2018-03-29 21:50:32 +02:00
Matthew Honnibal
23afa6429f
Add input length error, to address #1826
2018-03-29 21:45:26 +02:00
Ines Montani
a609a1ca29
Merge pull request #2152 from explosion/feature/tidy-up-dependencies
...
💫 Tidy up dependencies
2018-03-29 14:35:09 +02:00
Viet Trung Tran
ea2af94cd9
Add support for Vietnamese in spaCy by leveraging Pyvi, an external Vietnamese tokenizer ( #2155 )
...
* support for Vietnamese
* Contributor Agreement for adding Vietnamese support on spaCy
2018-03-29 12:19:51 +02:00
ines
e6979bdbbd
Merge branch 'feature/tidy-up-dependencies' of https://github.com/explosion/spaCy into feature/tidy-up-dependencies
2018-03-29 00:19:37 +02:00
ines
83146458a2
Fix urllib for Python 3
2018-03-29 00:19:33 +02:00
Matthew Honnibal
8308bbc617
Get msgpack and msgpack_numpy via Thinc, to avoid potential version conflicts
2018-03-29 00:14:55 +02:00
Matthew Honnibal
b5098079d8
Fix error on urllib
2018-03-29 00:08:16 +02:00
Ines Montani
0de599b16b
Merge pull request #2159 from explosion/feature/fix-merged-entity-iob ( resolves #1554 , resolves #1752 )
...
💫 Fix token.ent_iob after doc.merge(), and ensure consistency in doc.ents
2018-03-28 23:10:00 +02:00
Ines Montani
98e9cda677
Merge pull request #2158 from explosion/feature/fix-multiple-vectors ( resolves #1660 )
...
💫 Fix loading of multiple vector models
2018-03-28 23:08:24 +02:00
Matthew Honnibal
a7c5ae2beb
Avoid forcing a name on empty vectors, and remove print statement
2018-03-28 21:08:58 +02:00
ines
3eb67bbe4b
Allow entity types with dashes ( resolves #1967 )
2018-03-28 20:51:26 +02:00
Matthew Honnibal
cf5fcf0546
Update serialization test
2018-03-28 20:12:53 +02:00
Matthew Honnibal
4555e3e251
Dont assume pretrained_vectors cfg set in build_tagger
2018-03-28 20:12:45 +02:00
Matthew Honnibal
0b375d50c8
Fix ent_iob tags in doc.merge to avoid inconsistent sequences
2018-03-28 18:39:03 +02:00
Matthew Honnibal
95fa89c4b8
Update doc.ents test
2018-03-28 18:39:03 +02:00
Matthew Honnibal
e807f88410
Resolve merge when cherry-picking ent iob patches from develop
2018-03-28 18:38:13 +02:00
Matthew Honnibal
99fbc7db33
Improve error message when entity sequence is inconsistent
2018-03-28 18:36:53 +02:00
Matthew Honnibal
cbd2794be0
Add test for ent_iob during span merge
2018-03-28 18:36:53 +02:00
Matthew Honnibal
f8dd905a24
Warn and fallback if vectors have no name
2018-03-28 18:24:53 +02:00
Matthew Honnibal
fd9e259414
Add test for #1660
2018-03-28 18:22:51 +02:00
Matthew Honnibal
bc4afa9881
Remove print statement
2018-03-28 17:48:37 +02:00
Matthew Honnibal
79dc241caa
Set pretrained_vectors in parser cfg
2018-03-28 17:35:07 +02:00
Matthew Honnibal
17c3e7efa2
Add message noting vectors
2018-03-28 16:33:43 +02:00
Matthew Honnibal
9bf6e93b3e
Set pretrained_vectors in begin_training
2018-03-28 16:32:41 +02:00
Matthew Honnibal
95a9615221
Fix loading of multiple pre-trained vectors
...
This patch addresses #1660 , which was caused by keying all pre-trained
vectors with the same ID when telling Thinc how to refer to them. This
meant that if multiple models were loaded that had pre-trained vectors,
errors or incorrect behaviour resulted.
The vectors class now includes a .name attribute, which defaults to:
{nlp.meta['lang']_nlp.meta['name']}.vectors
The vectors name is set in the cfg of the pipeline components under the
key pretrained_vectors. This replaces the previous cfg key
pretrained_dims.
In order to make existing models compatible with this change, we check
for the pretrained_dims key when loading models in from_disk and
from_bytes, and add the cfg key pretrained_vectors if we find it.
2018-03-28 16:02:59 +02:00
ines
7fbc9e5874
Replace requests with urllib
2018-03-28 12:46:07 +02:00
ines
da1f200362
Add compat helpers for urllib
2018-03-28 12:45:53 +02:00
ines
ac88c72c9a
Fix ftfy workaround and remove old import
2018-03-28 12:14:28 +02:00
ines
ce6071ca89
Remove ftfy dependency and update docs
2018-03-28 12:09:42 +02:00
Matthew Honnibal
070b6c6495
Remove dependency on ftfy
2018-03-28 12:07:02 +02:00
ines
6d2c85f428
Drop six and related hacks as a dependency
2018-03-28 10:45:25 +02:00
ines
9e83513004
Add position of invalid token to error message
2018-03-27 23:56:59 +02:00
ines
11c4735ccf
Fix issue in Italian lemmatizer data ( resolves #2050 )
2018-03-27 23:55:22 +02:00
Matthew Honnibal
6a961928b2
Merge branch 'develop' of https://github.com/explosion/spaCy into develop
2018-03-27 21:01:48 +00:00
Matthew Honnibal
b7136cb094
Support zipped vector files in init-model
2018-03-27 21:01:18 +00:00
ines
693971dd8f
Improve error message if token text is empty string (see #2101 )
2018-03-27 22:25:40 +02:00
ines
0c829e6605
Fix whitespace
2018-03-27 22:20:59 +02:00
Matthew Honnibal
de9fd091ac
Fix #2014 : token.pos_ not writeable
2018-03-27 21:21:11 +02:00
Matthew Honnibal
18da89e04c
Handle non-callable gold_tuples in parser begin_training
2018-03-27 21:08:41 +02:00
Matthew Honnibal
1f7229f40f
Revert "Merge branch 'develop' of https://github.com/explosion/spaCy into develop"
...
This reverts commit c9ba3d3c2d
, reversing
changes made to 92c26a35d4
.
2018-03-27 19:23:02 +02:00
Matthew Honnibal
8b7a74570f
Revert "Revert "Merge branch 'develop' of https://github.com/explosion/spaCy into develop""
...
This reverts commit f41e626844
.
2018-03-27 19:22:52 +02:00
Matthew Honnibal
f41e626844
Revert "Merge branch 'develop' of https://github.com/explosion/spaCy into develop"
...
This reverts commit c9ba3d3c2d
, reversing
changes made to f57bfbccdc
.
2018-03-27 19:22:25 +02:00
Matthew Honnibal
c9ba3d3c2d
Merge branch 'develop' of https://github.com/explosion/spaCy into develop
2018-03-27 18:59:08 +02:00
Matthew Honnibal
92c26a35d4
Update get_cuda_stream
2018-03-27 16:42:00 +00:00
Matthew Honnibal
f57bfbccdc
Fix non-projective label filtering
2018-03-27 13:41:33 +02:00
Matthew Honnibal
d2118792e7
Merge changes from master
2018-03-27 13:38:41 +02:00
Matthew Honnibal
d4680e4d83
Merge branch 'master' of https://github.com/explosion/spaCy
2018-03-27 13:36:37 +02:00
Matthew Honnibal
63a267b34d
Fix #2073 : Token.set_extension not working
2018-03-27 13:36:20 +02:00
Matthew Honnibal
25280b7013
Try to make sum_state_features faster
2018-03-27 10:08:38 +00:00
Matthew Honnibal
987e1533a4
Use 8 features in parser
2018-03-27 10:08:12 +00:00
Matthew Honnibal
8bbd26579c
Support GPU in UD training script
2018-03-27 09:53:35 +00:00
Matthew Honnibal
dd54511c4f
Pass data as a function in begin_training methods
2018-03-27 09:39:59 +00:00
Matthew Honnibal
d9ebd78e11
Change default sizes in parser
2018-03-26 17:22:18 +02:00
Matthew Honnibal
a3d0cb15d3
Fix ent_iob tags in doc.merge to avoid inconsistent sequences
2018-03-26 07:16:06 +02:00
Matthew Honnibal
7d4687162f
Update doc.ents test
2018-03-26 07:14:35 +02:00
Matthew Honnibal
514d89a3ae
Set missing label for non-specified entities when setting doc.ents
2018-03-26 07:14:16 +02:00
Matthew Honnibal
54d7a1c916
Improve error message when entity sequence is inconsistent
2018-03-26 07:13:34 +02:00
Matthew Honnibal
938436455a
Add test for ent_iob during span merge
2018-03-25 22:16:19 +02:00
Matthew Honnibal
8e08c378fe
Fix entity IOB and tag in span merging
2018-03-25 22:16:01 +02:00
Matthew Honnibal
5430c43298
Set about to spacy-nightly
2018-03-25 19:30:14 +02:00
Ines Montani
68226109f4
Merge pull request #2142 from jimregan/polish-more-tokens
...
more exceptions
2018-03-24 19:06:44 +01:00
Matthew Honnibal
d566e673bf
Set version to v2.0.10
2018-03-24 18:09:03 +01:00
Matthew Honnibal
0d3bf0d4eb
Merge branch 'master' of https://github.com/explosion/spaCy
2018-03-24 17:31:49 +01:00
dejanmarich
ccd1c04c63
Update stop_words.py
...
Added more words
2018-03-24 17:31:24 +01:00
ines
f1446b0257
Port over Turkish changes
2018-03-24 17:31:07 +01:00
DuyguA
cd604878a4
quick typo fix
2018-03-24 17:26:35 +01:00
Matthew Honnibal
406548b976
Support .gz and .tar.gz files in spacy init-model
2018-03-24 17:18:32 +01:00
Jim O'Regan
efe037e8be
more exceptions
2018-03-24 00:05:27 +00:00
Ines Montani
719037cf20
Update formatting and add missing commas
2018-03-23 22:18:20 +01:00
Otto Sulin
266efc2018
Added Finnish examples
2018-03-23 22:58:52 +02:00
Otto Sulin
1940e54602
Added Finnish numbers
2018-03-23 22:33:08 +02:00
Otto Sulin
4ec3f19e2b
fixed stop words -> to-do lex_attrs.py
2018-03-23 22:18:17 +02:00
Matthew Honnibal
85717f570c
Merge branch 'master' of https://github.com/explosion/spaCy
2018-03-23 20:30:42 +01:00
Matthew Honnibal
8902754f0b
Fix vector loading for ud_train
2018-03-23 20:30:00 +01:00
Xiaoquan Kong
a71b99d7ff
bugfix for global-variable-change-in-runtime related issue ( #2135 )
...
* Bugfix: setting pollution from spacy/cli/ud_train.py to whole package
* Add contributor agreement of howl-anderson
2018-03-23 11:36:38 +01:00
Matthew Honnibal
044397e269
Support .gz and .tar.gz files in spacy init-model
2018-03-21 14:33:23 +01:00
Matthew Honnibal
49fbe2dfee
Use thinc.openblas in spacy.syntax.nn_parser
2018-03-20 02:22:09 +01:00
DuyguA
f708d7443b
added contractions to stopwords #2020
2018-03-19 14:06:39 +01:00
Matthew Honnibal
bede11b67c
Improve label management in parser and NER ( #2108 )
...
This patch does a few smallish things that tighten up the training workflow a little, and allow memory use during training to be reduced by letting the GoldCorpus stream data properly.
Previously, the parser and entity recognizer read and saved labels as lists, with extra labels noted separately. Lists were used becaue ordering is very important, to ensure that the label-to-class mapping is stable.
We now manage labels as nested dictionaries, first keyed by the action, and then keyed by the label. Values are frequencies. The trick is, how do we save new labels? We need to make sure we iterate over these in the same order they're added. Otherwise, we'll get different class IDs, and the model's predictions won't make sense.
To allow stable sorting, we map the new labels to negative values. If we have two new labels, they'll be noted as having "frequency" -1 and -2. The next new label will then have "frequency" -3. When we sort by (frequency, label), we then get a stable sort.
Storing frequencies then allows us to make the next nice improvement. Previously we had to iterate over the whole training set, to pre-process it for the deprojectivisation. This led to storing the whole training set in memory. This was most of the required memory during training.
To prevent this, we now store the frequencies as we stream in the data, and deprojectivize as we go. Once we've built the frequencies, we can then apply a frequency cut-off when we decide how many classes to make.
Finally, to allow proper data streaming, we also have to have some way of shuffling the iterator. This is awkward if the training files have multiple documents in them. To solve this, the GoldCorpus class now writes the training data to disk in msgpack files, one per document. We can then shuffle the data by shuffling the paths.
This is a squash merge, as I made a lot of very small commits. Individual commit messages below.
* Simplify label management for TransitionSystem and its subclasses
* Fix serialization for new label handling format in parser
* Simplify and improve GoldCorpus class. Reduce memory use, write to temp dir
* Set actions in transition system
* Require thinc 6.11.1.dev4
* Fix error in parser init
* Add unicode declaration
* Fix unicode declaration
* Update textcat test
* Try to get model training on less memory
* Print json loc for now
* Try rapidjson to reduce memory use
* Remove rapidjson requirement
* Try rapidjson for reduced mem usage
* Handle None heads when projectivising
* Stream json docs
* Fix train script
* Handle projectivity in GoldParse
* Fix projectivity handling
* Add minibatch_by_words util from ud_train
* Minibatch by number of words in spacy.cli.train
* Move minibatch_by_words util to spacy.util
* Fix label handling
* More hacking at label management in parser
* Fix encoding in msgpack serialization in GoldParse
* Adjust batch sizes in parser training
* Fix minibatch_by_words
* Add merge_subtokens function to pipeline.pyx
* Register merge_subtokens factory
* Restore use of msgpack tmp directory
* Use minibatch-by-words in train
* Handle retokenization in scorer
* Change back-off approach for missing labels. Use 'dep' label
* Update NER for new label management
* Set NER tags for over-segmented words
* Fix label alignment in gold
* Fix label back-off for infrequent labels
* Fix int type in labels dict key
* Fix int type in labels dict key
* Update feature definition for 8 feature set
* Update ud-train script for new label stuff
* Fix json streamer
* Print the line number if conll eval fails
* Update children and sentence boundaries after deprojectivisation
* Export set_children_from_heads from doc.pxd
* Render parses during UD training
* Remove print statement
* Require thinc 6.11.1.dev6. Try adding wheel as install_requires
* Set different dev version, to flush pip cache
* Update thinc version
* Update GoldCorpus docs
* Remove print statements
* Fix formatting and links [ci skip]
2018-03-19 02:58:08 +01:00
Matthew Honnibal
ff42b726c1
Fix unicode declaration on test
2018-03-19 02:04:24 +01:00
Matthew Honnibal
7dc76c6ff6
Add test for textcat
2018-03-16 12:39:45 +01:00
Matthew Honnibal
3cdee79a0c
Add depth argument for text classifier
2018-03-16 12:37:31 +01:00
Matthew Honnibal
13067095a1
Disable broken add-after-train in textcat
2018-03-16 12:33:33 +01:00
Matthew Honnibal
565ef8c4d8
Improve argument passing in textcat
2018-03-16 12:30:51 +01:00
Matthew Honnibal
eb2a3c5971
Remove unused function
2018-03-16 12:30:33 +01:00
Matthew Honnibal
307d6bf6d3
Fix parser for Thinc 6.11
2018-03-16 10:59:31 +01:00
Matthew Honnibal
9a389c4490
Fix parser for Thinc 6.11
2018-03-16 10:38:13 +01:00
Matthew Honnibal
648532d647
Don't assume blas methods are present
2018-03-16 02:48:20 +01:00
Matthew Honnibal
e85dd038fe
Merge remote-tracking branch 'origin/master' into feature/single-thread
2018-03-16 02:41:11 +01:00
Matthew Honnibal
e3be3d65b3
Version as v2.0.10.dev0
2018-03-15 17:31:22 +01:00
ines
f3f8bfc367
Add built-in factories for merge_entities and merge_noun_chunks
...
Allows adding those components to the pipeline out-of-the-box if they're defined in a model's meta.json. Also allows usage as nlp.add_pipe(nlp.create_pipe('merge_entities')).
2018-03-15 17:16:54 +01:00
Ines Montani
0d17377e8b
Merge pull request #2095 from DuyguA/quick-typo-fix ( resolves #2063 )
...
Quick typo fix
2018-03-15 00:29:56 +01:00
ines
d854f69fe3
Add built-in factories for merge_entities and merge_noun_chunks
...
Allows adding those components to the pipeline out-of-the-box if they're defined in a model's meta.json. Also allows usage as nlp.add_pipe(nlp.create_pipe('merge_entities')).
2018-03-15 00:18:51 +01:00
ines
9ad5df41fe
Fix whitespace
2018-03-15 00:11:18 +01:00
Matthew Honnibal
d7ce6527fb
Use increasing batch sizes in ud-train
2018-03-14 20:15:28 +01:00
alldefector
f4e5904fc2
Fix Spanish noun_chunks failure caused by typo
2018-03-14 17:03:17 +01:00
Thomas Opsomer
fbf48b3f9f
lemma property to return hash instead of unicode
2018-03-14 17:03:00 +01:00
Matthew Honnibal
8cefc58abc
Fix Vectors pickling
2018-03-14 16:59:37 +01:00
DuyguA
be4f6da16b
maybe not a good idea to remove also
2018-03-14 14:47:24 +01:00
DuyguA
1a513f71e3
removed also from lookup
2018-03-14 11:57:15 +01:00
DuyguA
cca66abf1e
quick typo fix
2018-03-14 11:34:22 +01:00
Matthew Honnibal
7b755414eb
Update call into thinc
2018-03-13 13:59:59 +01:00
Matthew Honnibal
e101f10ef0
Fix header
2018-03-13 02:12:16 +01:00
Matthew Honnibal
952c87409e
Use openblas.sgemm in parser
2018-03-13 02:12:01 +01:00
Matthew Honnibal
d55620041b
Switch parser to gemm from thinc.openblas
2018-03-13 02:10:58 +01:00
Matthew Honnibal
c2f4759257
Fix test for Python 2
2018-03-12 23:03:05 +01:00
Matthew Honnibal
9aeec9c242
Increment dev version
2018-03-11 01:58:21 +01:00
Matthew Honnibal
f49d71fa7c
Merge branch 'master' of https://github.com/explosion/spaCy
2018-03-11 01:27:17 +01:00
Matthew Honnibal
5dddb30e5b
Fix ud-train script
2018-03-11 01:26:45 +01:00
Matthew Honnibal
e42960bd14
Merge pull request #2012 from alldefector/patch-1
...
Fix Spanish noun_chunks failure caused by typo
2018-03-11 01:05:19 +01:00
Matthew Honnibal
2cab4d6517
Remove use of attr module in ud_train
2018-03-11 00:59:39 +01:00
Matthew Honnibal
fa9fd21620
Increment dev version
2018-03-11 00:41:54 +01:00
Matthew Honnibal
53b3249e06
Add tests for arc eager oracle
2018-03-10 23:42:56 +01:00
Matthew Honnibal
754ea1b2f7
Link in spaCy CoNLL commands
2018-03-10 23:42:15 +01:00
Matthew Honnibal
3478ea76d1
Add ud_train and ud_evaluate CLI commands
2018-03-10 23:41:55 +01:00
Matthew Honnibal
4b72c38556
Fix dropout bug in beam parser
2018-03-10 23:16:40 +01:00
Matthew Honnibal
9cc202d670
Fix Vectors pickling
2018-03-10 22:53:42 +01:00
Matthew Honnibal
3d6487c734
Support dropout in beam parse
2018-03-10 22:41:55 +01:00
Matthew Honnibal
31b156d60b
Fix itershuffle
2018-03-10 22:32:59 +01:00
Matthew Honnibal
b59765ca9f
Stream gold during spacy train
2018-03-10 22:32:45 +01:00
Matthew Honnibal
c3d168509a
Stream the gold data during training, to reduce memory
2018-03-10 22:32:32 +01:00
DuyguA
cba63196f9
fixed typo
2018-03-09 10:54:18 +01:00
DuyguA
7a780476af
added more abbreviations
2018-03-09 10:13:00 +01:00
DuyguA
cca87756d7
added Sti
2018-03-08 18:07:52 +01:00
DuyguA
3c994311c5
added abbrevs
2018-03-08 18:03:27 +01:00
DuyguA
56d6fb180e
added like_num to lex
2018-03-08 15:25:25 +01:00
DuyguA
26ee0590a3
added some commonly used cases
2018-03-08 12:43:58 +01:00
DuyguA
ae6473e4d5
removed some words with negation particle.
2018-03-08 12:20:32 +01:00
DuyguA
6ed59a2198
removed number words to be caried to the lexical
2018-03-08 12:19:23 +01:00
DuyguA
04784a44a6
made alphabetical order for Turkish chaaracters
2018-03-08 12:11:32 +01:00
DuyguA
af33e022a5
added example sentences for Turkish
2018-03-08 12:06:03 +01:00
Matthew Honnibal
a1be01185c
Fix array out of bounds error in Span
2018-02-28 12:27:09 +01:00
Thomas Opsomer
8df9e52829
lemma property to return hash instead of unicode
2018-02-27 19:50:01 +01:00
Ines Montani
35634352fe
Merge pull request #2025 from dejanmarich/patch-1
...
Update stop_words.py for Croatian language
2018-02-26 18:22:32 +01:00
Matthew Honnibal
14f729c72a
Add subtok label to parser
2018-02-26 12:26:35 +01:00
Matthew Honnibal
7137ad8b0b
Make label filtering clearer for projectivisation
2018-02-26 12:02:01 +01:00
Matthew Honnibal
b8d52cb285
Fix inconsistent label freq cutoff for projectivisation
2018-02-26 12:01:44 +01:00
Matthew Honnibal
7b66ec896a
Revert "Revert "Improve parser oracle around sentence breaks.""
...
This reverts commit 36e481c584
.
2018-02-26 10:57:37 +01:00
Matthew Honnibal
36e481c584
Revert "Improve parser oracle around sentence breaks."
...
This reverts commit 50817dc9ad
.
2018-02-26 10:53:55 +01:00
Matthew Honnibal
5faae803c6
Add option to not use Janome for Japanese tokenization
2018-02-26 09:39:46 +01:00
Matthew Honnibal
9b406181cd
Add Chinese.Defaults.use_jieba setting, for UD
2018-02-25 15:12:38 +01:00
Matthew Honnibal
9ccd0c643b
Add Vietnamese
2018-02-25 15:00:46 +01:00
Matthew Honnibal
d4fdb97c87
Fix alignment for words with spaces
2018-02-25 14:55:00 +01:00
Matthew Honnibal
6d2c1ef52c
Fix SP tag in generic tag map
2018-02-24 16:04:56 +01:00
Matthew Honnibal
5cc3bd1c1d
Update alignment tests
2018-02-24 16:03:58 +01:00
Matthew Honnibal
6138439469
Fix many-to-one alignment
2018-02-24 16:03:50 +01:00
Matthew Honnibal
4890ee1732
Fix scoring of tokenization for punct
2018-02-24 10:32:32 +01:00
Matthew Honnibal
12b39f87da
Move cython declarations in matcher.pyx
2018-02-24 10:32:18 +01:00
Matthew Honnibal
01d1b7abdf
Support many-to-one alignment in GoldParse
2018-02-24 10:17:01 +01:00
Matthew Honnibal
7865746574
Support many-to-one alignment
2018-02-24 02:09:53 +01:00
Matthew Honnibal
458710b831
Poke matcher test for appveyor
2018-02-23 23:53:48 +01:00
Matthew Honnibal
968dabdde4
Fix bug in multi-task objective
2018-02-23 23:48:09 +01:00
Matthew Honnibal
2c9c8b8d72
Try comming out emoji test in matcher
2018-02-23 23:34:35 +01:00
Matthew Honnibal
980ad68cbe
Try to find test that fails on appveyor
2018-02-23 21:27:53 +01:00
Matthew Honnibal
39de8cd4d3
Try to find test failing on appveyor
2018-02-23 20:59:21 +01:00
Matthew Honnibal
4492a33a9d
Fix sent_start multi-task objective when alignment fails
2018-02-23 16:50:59 +01:00
Matthew Honnibal
5fa44e93f1
Set unicode_literals in matcher
2018-02-23 16:48:54 +01:00
Matthew Honnibal
12264f9296
Add multi-task objective for sentence segmentation
2018-02-23 16:25:57 +01:00
Matthew Honnibal
e7deadb519
Set version to 2.1.0.dev1
2018-02-23 16:22:24 +01:00
Matthew Honnibal
7b575a119e
Try to reduce memory usage of test_matcher
2018-02-23 15:34:37 +01:00
Matthew Honnibal
24563f4026
Fix data typing in align
2018-02-23 15:08:06 +01:00
Matthew Honnibal
7a5ba20692
Fix integer typing in _align
2018-02-23 14:51:24 +01:00
Matthew Honnibal
875411b875
Set unicode types in _align.pyx and test
2018-02-23 14:35:38 +01:00
Matthew Honnibal
51d9679aa3
Fix broken span.as_doc test
2018-02-23 14:22:24 +01:00
dejanmarich
71c261d58b
Update stop_words.py
...
Added more words
2018-02-23 10:31:01 +01:00
Matthew Honnibal
3e6c1111b7
Remove obsolete test
2018-02-23 03:22:07 +01:00
Matthew Honnibal
a4fdec524a
Merge branch 'master' of https://github.com/explosion/spaCy into feature/better-gold
2018-02-22 21:44:28 +01:00
Matthew Honnibal
50817dc9ad
Improve parser oracle around sentence breaks.
2018-02-22 19:22:26 +01:00
Matthew Honnibal
307aefe131
Increment version to v2.0.9
2018-02-22 17:07:53 +01:00
Feng Niu
1c60384bed
return on empty doc
2018-02-21 15:39:04 -08:00
Feng Niu
7eb1cd100b
unbound doc var
2018-02-21 15:05:37 -08:00
Feng Niu
8df75b229c
fix unbound vars in es.syntax_iterators
2018-02-21 13:11:17 -08:00
alldefector
4244e285c2
Fix Spanish noun_chunks failure caused by typo
2018-02-21 12:43:21 -08:00
Matthew Honnibal
661873ee4c
Randomize the rebatch size in parser
2018-02-21 21:02:07 +01:00
Matthew Honnibal
0872cf611d
Don't lower-case lemmas of proper nouns
2018-02-21 16:01:16 +01:00
Matthew Honnibal
a0ddb803fd
Make error when no label found more helpful
2018-02-21 16:00:59 +01:00
Matthew Honnibal
ea2fc5d45f
Improve length and freq cutoffs in parser
2018-02-21 16:00:38 +01:00
Matthew Honnibal
e5757d4bf0
Add labels property to parser
2018-02-21 16:00:00 +01:00
Matthew Honnibal
eff4ae809a
Fix nonproj label filter
2018-02-21 15:59:04 +01:00
Matthew Honnibal
e624405cda
Temporarily remove cutoff when filtering labels in nonproj
2018-02-21 13:53:40 +01:00
Matthew Honnibal
f466f0186e
Use new alignment implementation in GoldParse
2018-02-20 21:16:35 +01:00
Matthew Honnibal
c0734ba526
Make alignment work with strings
2018-02-20 17:51:49 +01:00
Matthew Honnibal
8180c84a98
Add tests for new Levenshtein alignment
2018-02-20 17:32:25 +01:00
Matthew Honnibal
930c980570
Add improved Levenshtein alignment implementation
2018-02-20 17:31:56 +01:00
Ines Montani
14e7e0f12a
Merge pull request #2000 from jimregan/polish-tag-map
...
Polish tag map
2018-02-18 19:05:58 +01:00
Jim O'Regan
664407de5d
missing PrepCase attribute
2018-02-18 14:46:12 +00:00
Jim O'Regan
95f0673fbc
fix typo/missing here too
2018-02-18 14:38:27 +00:00
Matthew Honnibal
2bccad8815
Fix incorrect matcher test
2018-02-18 14:56:12 +01:00
Matthew Honnibal
530172d57a
Merge branch 'master' of https://github.com/explosion/spaCy into feature/better-faster-matcher
2018-02-18 14:40:42 +01:00
Matthew Honnibal
cf0e320f2b
Add doc.is_sentenced attribute, re #1959
2018-02-18 14:16:55 +01:00
Matthew Honnibal
1e5aeb4eec
Merge pull request #1987 from thomasopsomer/span-sent
...
Make span.sent work when only manual / custom sbd
2018-02-18 14:05:37 +01:00
Matthew Honnibal
1cf774bdc1
Add output options return_matches and as_tuples to Matcher
2018-02-18 14:00:45 +01:00
Matthew Honnibal
dd9b0945af
Fix inconsistencies in the symbols table
2018-02-18 13:51:31 +01:00
Matthew Honnibal
66496ac8e1
Set version to v2.1.0.dev0
2018-02-18 13:48:39 +01:00
Matthew Honnibal
eb3040ce46
Merge pull request #1891 from fucking-signup/master
...
Fix issue #1889
2018-02-18 13:47:47 +01:00
Matthew Honnibal
3d7285870b
Update matcher branch with v2.0.8 master
2018-02-18 13:42:58 +01:00
ines
6bba1db4cc
Drop six and related hacks as a dependency
2018-02-18 13:29:56 +01:00
Matthew Honnibal
b30b09192a
Merge pull request #1665 from jimregan/animacy
...
typo in "inan", add "nhum"
2018-02-18 13:26:53 +01:00
Matthew Honnibal
1b3c98e01b
Set version to v2.0.8
2018-02-18 12:16:31 +01:00
Matthew Honnibal
f9f46e5a07
Revert matcher fixes from GregDubbin
2018-02-18 10:59:28 +01:00
Matthew Honnibal
86405e4ad1
Fix CLI for multitask objectives
2018-02-18 10:59:11 +01:00
Matthew Honnibal
a34749b2bf
Add multitask objectives options to train CLI
2018-02-17 22:03:54 +01:00
Matthew Honnibal
8f06903e09
Fix multitask objectives
2018-02-17 18:41:36 +01:00
Matthew Honnibal
d1246c95fb
Fix model loading when using multitask objectives
2018-02-17 18:11:36 +01:00
Matthew Honnibal
262d0a3148
Fix overwriting of lexical attributes when loading vectors during training
2018-02-17 18:11:11 +01:00
Matthew Honnibal
c0caf7cf27
Fix LANG symbol
2018-02-17 18:10:50 +01:00
Matthew Honnibal
0bf2f6be29
Add missing symbol for LANG attr. Fixes inconsistent numeric ID
2018-02-17 17:37:02 +01:00
Matthew Honnibal
97a228a4ce
Increment to v2.0.8.dev0
2018-02-17 16:54:36 +01:00
Matthew Honnibal
f7dc64d2a3
Merge branch 'master' of https://github.com/explosion/spaCy into feature/better-faster-matcher
2018-02-17 16:47:35 +01:00
Aaron Marquez
ea571e8325
Merge branch 'master' into issue-1959
2018-02-16 15:14:09 -08:00
Matthew Honnibal
7d5c720fc3
Fix multitask objective when no pipeline provided
2018-02-15 23:50:21 +01:00
Aaron Marquez
f0d3672e17
Changed loading EN model
2018-02-15 14:28:38 -08:00
Aaron Marquez
3765d84d57
Fix issue #1959
2018-02-15 12:51:49 -08:00
Aaron Marquez
7ba4111554
Add test for issue-1959
2018-02-15 12:46:22 -08:00
Matthew Honnibal
59b7cf9db8
Add get_beam_parse method in ArcEager, for Prodigy
2018-02-15 21:03:16 +01:00
Matthew Honnibal
3e541de440
Merge branch 'master' of https://github.com/explosion/spaCy
2018-02-15 21:02:55 +01:00
Thomas Opsomer
5d24a81c0b
add test for span.sent when doc not parsed
2018-02-15 16:59:16 +01:00
Thomas Opsomer
deab391cbf
correct check on sent_start & raise if no boundaries
2018-02-15 16:58:30 +01:00
Matthew Honnibal
afbd46adfb
Remove length cap in PhraseMatcher
2018-02-15 16:10:54 +01:00
Matthew Honnibal
4533c7408d
Update matcher tests
2018-02-15 15:39:47 +01:00
Matthew Honnibal
1c19605426
Move matcher2.pyx to matcher.pyx
2018-02-15 15:27:03 +01:00
Matthew Honnibal
9ebf2fe7c3
Make helper function to get longest matches
2018-02-15 15:26:15 +01:00
Matthew Honnibal
4cb861e080
Merge pull request #1968 from DuyguA/is_currency
...
New lexical feature is_currency
2018-02-15 12:13:36 +01:00
Thomas Opsomer
b902731313
Find span sentence when only sentence boundaries (no parser)
2018-02-14 22:18:54 +01:00
Matthew Honnibal
d19dc67886
Make get_action nogil, for efficiency
2018-02-14 12:16:36 +01:00
Matthew Honnibal
7885b92b45
Refactor matcher2, hopefully making it faster
2018-02-14 12:11:17 +01:00
Matthew Honnibal
00261eea27
Make tests refer to matcher2
2018-02-14 12:10:51 +01:00
Claudiu-Vlad Ursache
e28de12cbd
Ensure files opened in from_disk
are closed
...
Fixes [issue 1706](https://github.com/explosion/spaCy/issues/1706 ).
2018-02-13 20:49:43 +01:00
Matthew Honnibal
262cbe356e
Remove caching, as doesn't seem to help for now.
2018-02-13 17:15:20 +01:00
Matthew Honnibal
f43d53f2c5
Remove print statement
2018-02-13 17:15:07 +01:00
Matthew Honnibal
dcd8d89aef
Update test for 850, making it work with matcher2
2018-02-13 16:35:20 +01:00
Matthew Honnibal
9bdfa5cd4f
Remove re comparisons tests, as matcher behaves differently
2018-02-13 16:28:52 +01:00
Matthew Honnibal
6d7986b0f1
Fix matcher test
2018-02-13 16:28:06 +01:00
Matthew Honnibal
9efda9e9ab
Add PhraseMatcher in matcher2.pyx
2018-02-13 16:27:46 +01:00
Johannes Dollinger
012e874d09
Add contributor agreement for emulbreh
2018-02-13 13:40:33 +01:00
Johannes Dollinger
bf94c13382
Don't fix random seeds on import
2018-02-13 12:42:23 +01:00
Matthew Honnibal
0004331895
Update notes on matcher2
2018-02-13 11:45:45 +01:00
Matthew Honnibal
b4cc39eb74
Fix zero-width quantifiers. Passes test_matcher
2018-02-13 11:45:32 +01:00
Matthew Honnibal
1b01685f47
Fix ZERO_PLUS operator
2018-02-12 12:28:03 +01:00
Matthew Honnibal
9115c3ba0a
Add TODO in notes
2018-02-12 12:06:48 +01:00
Matthew Honnibal
b00326a7fe
Move pattern_id out of TokenPattern
2018-02-12 12:05:54 +01:00
Matthew Honnibal
d34c732635
Add Python notes for rethinking matcher
2018-02-12 10:19:29 +01:00
Matthew Honnibal
d7c9b53120
Pass kwargs into pipeline components during begin_training
2018-02-12 10:18:39 +01:00
Matthew Honnibal
fae5c0dc18
Work on matcher2
2018-02-12 10:17:43 +01:00
4altinok
ca8728035d
added new lex feat to token
2018-02-11 18:55:48 +01:00
4altinok
edd7202a06
added new symbol
2018-02-11 18:55:32 +01:00
4altinok
ed1ac2969e
added new lexical feat to lexeme
2018-02-11 18:51:48 +01:00
4altinok
94fb0b75e3
code for is_currency
2018-02-11 18:51:32 +01:00
4altinok
3deef1497a
removed 18 and replaced 18 with is_currency
2018-02-11 18:51:09 +01:00
4altinok
471d3c9e23
added lex test for is_currency
2018-02-11 18:50:50 +01:00
ines
c63e99da8a
Fix typo in glossary ( resolves #1964 )
...
Co-Authored-By: SThomasP <sthomasp@users.noreply.github.com>
2018-02-10 11:58:41 +01:00
Lyndon White
6ee5dff51c
Make python 3.4 compat module loading ( fix #1733 )
2018-02-09 23:03:35 +08:00
Matthew Honnibal
e361b4f82b
Fix #1929 : Incorrect NER when pre-set sentence boundaries.
2018-02-08 15:25:41 +01:00
Matthew Honnibal
fd9fd275c5
Make test for #1945 more precise
2018-02-07 02:06:11 +01:00
Matthew Honnibal
c087a14380
Merge branch 'master' of https://github.com/explosion/spaCy
2018-02-07 01:29:39 +01:00
Matthew Honnibal
76d89b2180
Add test for #1945 : PhraseMatcher regression
2018-02-07 01:29:23 +01:00
Ines Montani
0954e15dda
Merge pull request #1913 from ohenrik/nb_syntax_iterator
...
Norwegian Language (nb) - Added french syntax iterator with explanation
2018-02-06 04:59:07 +01:00
Ole Henrik Skogstrøm
251a7805fe
Copied French syntax iterator to simplify future changes
2018-02-05 14:45:05 +01:00
Matthew Honnibal
2e7391e627
Merge pull request #1916 from tokestermw/bug/fix-not-passing-in-model-cfg-in-nlp
...
Bug/fix not passing in model cfg in nlp
2018-02-05 01:19:40 +01:00
Ali Zarezade
9df9da34a3
Fix init_model issue
...
Fixing issue #1928
2018-02-03 17:21:34 +03:30
Matthew Honnibal
ebe84e45e5
Increment version to 2.0.7
2018-02-02 03:39:16 +01:00
Matthew Honnibal
e4b1f57599
Increment version
2018-02-02 02:33:23 +01:00
Matthew Honnibal
069531c351
Merge branch 'master' of https://github.com/explosion/spaCy
2018-02-02 02:32:58 +01:00
Matthew Honnibal
f74a802d09
Test and fix #1919 : Error resuming training
2018-02-02 02:32:40 +01:00
ines
f1d3deffac
Add Russian example sentences (see #1107 )
2018-02-01 20:09:40 +01:00
Matthew Honnibal
6b1126c312
Merge branch 'master' of https://github.com/explosion/spaCy
2018-02-01 02:57:52 +01:00
ines
3c1fb9d02d
Make validate command fail more gracefully if version not found
...
Mostly relevant during develoment when working with .dev versions
2018-01-31 22:06:28 +01:00
Motoki Wu
54062b7326
added tests for issue #1915
2018-01-30 18:30:19 -08:00
Motoki Wu
f4a7d1a423
make to sure pass in **cfg to each component when training
2018-01-30 18:29:54 -08:00
ines
4046823699
Only check component in factories if string (see #1911 )
2018-01-30 16:29:07 +01:00
ines
ce10d320c4
Fix component check in self.factories (see #1911 )
2018-01-30 16:09:37 +01:00
Ole Henrik Skogstrøm
e40465487c
Added french syntax iterator with explenation
2018-01-30 15:44:29 +01:00
ines
8901814248
Improve error handling if pipeline component is not callable ( resolves #1911 )
...
Also add help message if user accidentally calls nlp.add_pipe() with a string of a built-in component name.
2018-01-30 15:43:03 +01:00
Matthew Honnibal
a437ba87a3
Set release=True
2018-01-29 21:26:04 +01:00
Adam Binford
9238749aaf
Removed test to avoid network requests
2018-01-29 14:48:20 -05:00
Adam Binford
1a2c2f7d7f
Fixed auto linking after download and added simple test to check
2018-01-29 14:25:21 -05:00
Matthew Honnibal
cb7110c22e
Merge pull request #1882 from ohenrik/nb_lemma_and_tag_map
...
Add norwegian bokmål ('nb') lemmatizer and tag_map
2018-01-29 18:18:50 +01:00
Matthew Honnibal
0c1e7f0c86
Merge pull request #1893 from azarezade/master
...
Add Persian language
2018-01-29 18:18:33 +01:00
Matthew Honnibal
cbdab75b36
Increment version
2018-01-28 23:46:22 +01:00
Matthew Honnibal
512e6adb08
Merge pull request #1896 from thomasopsomer/fix-sent
...
Fix sentence boundaries serialization (issue #1834 )
2018-01-28 21:18:51 +01:00
Matthew Honnibal
f5b1ad4100
Limit parser model size, to hopefully reduce memory during CI tests
2018-01-28 21:00:32 +01:00
Thomas Opsomer
515e25910e
fix sent_start in serialization
2018-01-28 19:50:42 +01:00
Thomas Opsomer
45d62561f7
add test for the issue
2018-01-28 19:49:56 +01:00
ines
6d978e5c35
Don't use deprecated Doc.merge call in displaCy
...
As reported here: https://stackoverflow.com/a/48464412/6400719
2018-01-27 11:25:05 +01:00
Ali Zarezade
bb6bd3d8ae
add persian language
2018-01-27 13:27:26 +03:30
Ali Zarezade
d195675db5
add persian language
2018-01-27 13:21:38 +03:30
Kit
4b42267ba3
Fix issue #1889
2018-01-25 23:17:22 +01:00
Kit
52ef51f36e
Add test for issue #1889
2018-01-25 22:56:48 +01:00
Ole Henrik Skogstrøm
8e2c9f2475
Cleaned up nb tag_map comments
2018-01-25 11:09:28 +01:00
Ole Henrik Skogstrøm
1107e89fcf
Updated doc string on nb tag_map module
2018-01-25 11:08:28 +01:00
Matthew Honnibal
6a8cb905aa
Merge pull request #1876 from GregDubbin/master
...
Pattern matcher fixes
2018-01-24 16:38:11 +01:00
Matthew Honnibal
38b260e0c3
Merge pull request #1879 from azarezade/master
...
Add Persian character and symbols
2018-01-24 16:34:22 +01:00
Matthew Honnibal
edb71a280e
Add test for #1883 : Unpickling Matcher
2018-01-24 15:42:33 +01:00
Matthew Honnibal
2ad050e668
Fix unpickling of Matcher. Also store correct data in matcher._patterns
2018-01-24 15:42:11 +01:00
Ole Henrik Skogstrøm
4058a7d579
Fix æøå characters in lemmatizer
2018-01-24 14:03:14 +01:00
Ole Henrik Skogstrøm
42248f423f
Updated tag map
2018-01-24 13:50:33 +01:00
Ole Henrik Skogstrøm
74b430b49a
Correct Lemmatizer
2018-01-24 13:26:33 +01:00
Ole Henrik Skogstrøm
b9b3a40c78
Add norwegian lemmatizer and tag_map
2018-01-24 12:28:29 +01:00
Matthew Honnibal
42a18ef903
Add test for #1868 : Vocab.__contains__ with ints
2018-01-23 23:27:05 +01:00
Matthew Honnibal
43f381ce36
Make Vocab.__contains__ work with ints. Fixes #1868
2018-01-23 23:26:47 +01:00
greg
85ab99e692
Correct test examples
2018-01-23 15:00:14 -05:00
greg
f50bb1aafc
Restructure StateC to eliminate dependency on unordered_map
2018-01-23 14:40:03 -05:00
Matthew Honnibal
f3753c2453
Further model deserialization fixes re #1727
2018-01-23 19:16:05 +01:00
Matthew Honnibal
91e916cb67
Add comment to new test
2018-01-23 19:11:53 +01:00
Matthew Honnibal
fd187d71ad
Add test for #1727
2018-01-23 19:11:01 +01:00
Matthew Honnibal
85c942a6e3
Dont overwrite pretrained_dims setting from cfg. Fixes #1727
2018-01-23 19:10:49 +01:00
Ali Zarezade
42349471bc
add ٪ as punctuation
2018-01-23 18:11:33 +03:30
Ali Zarezade
2bda582135
Add Persian character and symbols
...
Add Persian characters and the following:
- ٪ used instead of %
- ؟ used instead of ?
- ﷼ used instead of $
- ، used instead of ,
- ؛ used instead of ;
2018-01-23 13:20:36 +03:30
Matthew Honnibal
7e6dc283db
Fix unicode import in test
2018-01-22 23:55:44 +01:00
greg
686735b94e
Fix matcher import
2018-01-22 16:53:05 -05:00
greg
3a491093ee
Import libcpp.map if libcpp.unordered_map doesn't exist
2018-01-22 16:46:25 -05:00
greg
d55992bdf0
Switch match dictionary to use final state pointer rather than ID
2018-01-22 15:36:47 -05:00
Matthew Honnibal
4ce7d24fd5
Add test for #1799 : Set left and right edges (and thus sentences) in non-projective parses.
2018-01-22 20:18:38 +01:00
Matthew Honnibal
56164ab688
Set l_edge and r_edge correctly for non-projective parses. Fixes #1799
2018-01-22 20:18:04 +01:00
Matthew Honnibal
964aa1b384
Merge branch 'master' of https://github.com/explosion/spaCy
2018-01-22 19:18:46 +01:00
Matthew Honnibal
29897ed1b3
Allow vector loading to work on 1d data files. Fixes #1831
2018-01-22 19:18:26 +01:00
greg
490bc82c27
Add comments clarifying matcher logic for '*'
2018-01-22 10:03:12 -05:00
Matthew Honnibal
fe4748fc38
Merge pull request #1870 from avadhpatel/master
...
Model Load Performance Improvement by more than 5x
2018-01-22 00:05:15 +01:00
Avadh Patel
a517df55c8
Small fix
...
Signed-off-by: Avadh Patel <avadh4all@gmail.com>
2018-01-21 15:20:45 -06:00
Avadh Patel
5b5029890d
Merge branch 'perfTuning' into perfTuningMaster
...
Signed-off-by: Avadh Patel <avadh4all@gmail.com>
2018-01-21 15:20:00 -06:00
Matthew Honnibal
203d2ea830
Allow multitask objectives to be added to the parser and NER more easily
2018-01-21 19:37:02 +01:00
Matthew Honnibal
4a7d524efb
Merge branch 'master' of https://github.com/explosion/spaCy
2018-01-21 19:22:03 +01:00
Matthew Honnibal
61a051f2c0
Fix MultitaskObjective
2018-01-21 19:21:34 +01:00
Avadh Patel
75903949da
Updated model building after suggestion from Matthew
...
Signed-off-by: Avadh Patel <avadh4all@gmail.com>
2018-01-18 06:51:57 -06:00
Avadh Patel
fe879da2a1
Do not train model if its going to be loaded from disk
...
This saves significant time in loading a model from disk.
Signed-off-by: Avadh Patel <avadh4all@gmail.com>
2018-01-17 06:16:07 -06:00
Avadh Patel
2146faffee
Do not train model if its going to be loaded from disk
...
This saves significant time in loading a model from disk.
Signed-off-by: Avadh Patel <avadh4all@gmail.com>
2018-01-17 06:04:22 -06:00
greg
7072b395c9
Add greedy matcher tests
2018-01-16 15:46:13 -05:00
greg
441f490c1c
Merge branch 'master' of github.com:GregDubbin/spaCy
2018-01-16 13:31:10 -05:00
greg
8bea62f26e
Correct bugs for greedy matching and introduce ADVANCE_PLUS action
2018-01-16 13:21:43 -05:00
Matthew Honnibal
ccb51a9f36
Make .similarity() return 1.0 if all orth attrs match
2018-01-15 16:29:48 +01:00
Matthew Honnibal
82135d85b7
Fix test
2018-01-15 15:55:15 +01:00
Matthew Honnibal
4b09616b58
Add test for #1757 : Comparison against None
2018-01-15 15:55:01 +01:00
Matthew Honnibal
b904d81e9a
Fix rich comparison against None objects. Closes #1757
2018-01-15 15:51:25 +01:00
Matthew Honnibal
9e413449f6
Fix unicode error in new test
2018-01-15 15:39:00 +01:00
Matthew Honnibal
ab7c45b12d
Fix error message and handling of doc.sents
2018-01-15 15:21:11 +01:00
Matthew Honnibal
6b215d2dd3
Add test for Issue #1537
2018-01-15 15:20:56 +01:00
ines
5babb7d6f6
Merge branch 'master' of https://github.com/explosion/spaCy
2018-01-14 17:31:09 +01:00
ines
793890cb4d
Remove test for removed deprecation warning
2018-01-14 17:31:06 +01:00
Matthew Honnibal
465a6f6452
Add missing Span.vocab property. Closes #1633
2018-01-14 15:06:30 +01:00
Matthew Honnibal
0cb090e526
Fix infinite recursion in token.sent_start. Closes #1640
2018-01-14 15:02:15 +01:00
Matthew Honnibal
5cbe913b6f
Don't raise deprecation warning in property. Closes #1813 , #1712
2018-01-14 14:55:58 +01:00
Matthew Honnibal
1a1cca6052
Fix vectors.resize() on Py3. Closes #1539
2018-01-14 14:48:51 +01:00
Matthew Honnibal
0153220304
Make set_vector add word to vocab. Fixes #1807
2018-01-14 13:57:57 +01:00
Ines Montani
55754f0cee
Merge pull request #1836 from fucking-signup/master
...
Add tests for issue #1769
2018-01-13 00:23:35 +00:00
Kit
4ee97f20a0
Mark like_num tests as slow
2018-01-13 00:44:15 +01:00
Kit
855531537e
Rewrite tests for issue #1769
2018-01-12 23:49:51 +01:00
Kit
5b541cb5ec
Simplify tests for issue #1769
2018-01-12 23:34:27 +01:00
Kit
7a2adc4633
Remove some tests to see build status changes
2018-01-12 22:49:16 +01:00
Kit
0e62809a43
Rewrite tests for issue #1769
2018-01-12 22:26:06 +01:00
Ines Montani
36f426fe0a
Merge pull request #1808 from fucking-signup/master
...
Fix issue #1769
2018-01-12 21:12:02 +00:00
Kit
76f4eeca44
Remove tests to see build changes on Windows (Python 2.7)
2018-01-12 20:30:51 +01:00
Matthew Honnibal
7ca49c2061
Merge branch 'master' into feature-improve-model-download
2018-01-10 18:21:55 +01:00
Kit
7ec0956e8d
Add regression test (issue #1769 )
2018-01-08 03:42:04 +01:00
Kit
701e7cc6aa
Rename variable to keep code consistent
2018-01-08 03:38:44 +01:00
Kit
ed0db95183
Find lowercased forms of ordinal words, where possible
2018-01-08 03:28:50 +01:00
Kit
9bc524982e
Find lowercased forms of numeric words
2018-01-08 03:25:08 +01:00
Søren Lind Kristiansen
62de5da1ff
Remove unsused dummy variable
2018-01-05 09:57:24 +01:00
Søren Lind Kristiansen
10dab8eef8
Remove dummy variable from function calls
2018-01-05 09:37:05 +01:00
Søren Lind Kristiansen
7f0ab145e9
Don't pass CLI command name as dummy argument
2018-01-04 21:33:47 +01:00
Ines Montani
6a008233b5
Merge pull request #1795 from textioHQ/issue1758 ( resolves #1758 )
...
english tokenizer: handle "would've"
2018-01-04 02:43:39 +00:00
Kevin Humphreys
597df5bf83
add test
2018-01-03 13:00:05 -08:00
Kevin Humphreys
7918fa4ef9
handle would've
2018-01-03 12:25:48 -08:00
ines
2c656f90fb
Exit with 1 if incompatible models found (see #1714 )
2018-01-03 21:20:35 +01:00
ines
dacfaa2ca4
Ensure that download command exits properly ( resolves #1714 )
2018-01-03 21:03:36 +01:00
Søren Lind Kristiansen
a9ff6eadc9
Prefix dummy argument names with underscore
2018-01-03 20:48:12 +01:00
ines
1081e08efb
Fix formatting
2018-01-03 20:14:50 +01:00
ines
d8109964d6
Use --no-deps on model install
...
In general, it's nice for models to specify spaCy as a dependency. However, this tends to cause problems in conda environments, as pip will re-install spaCy and its dependencies (especially Thinc)
2018-01-03 17:40:37 +01:00
ines
319d754309
Fix overwriting of existing symlinks
...
Check for is_symlink() to also overwrite invalid and outdated symlinks. Also show better error message if link path exists but is not symlink (i.e. file or directory).
2018-01-03 17:39:36 +01:00
ines
8ba0dfd017
Make message on failed linking more clear
2018-01-03 17:38:09 +01:00
Søren Lind Kristiansen
d6327e8495
Fix handling case when vectors not specified
2018-01-03 12:20:49 +01:00
Søren Lind Kristiansen
bcc51d7d8b
Fix shifted positional arguments
2018-01-03 12:19:47 +01:00
zqhZY
f27859fa99
add ChineseDefaults class for pickling
2017-12-28 17:13:58 +08:00
Ines Montani
ff9fc945ab
Merge pull request #1749 from sorenlind/da_ud_tokenization
...
Tune Danish tokenizer to more closely match Universal Dependencies
2017-12-22 16:00:49 +00:00
ines
26f313dabc
Fix missing import
2017-12-22 16:21:44 +01:00
ines
8dc1c27841
Merge branch 'master' of https://github.com/explosion/spaCy
2017-12-22 16:01:00 +01:00
ines
b10ba848b8
xfail test that causes MemoryError on Python 2 on Windows
...
Need to investigate this further!
2017-12-22 16:00:58 +01:00
Søren Lind Kristiansen
bef735aef7
Fix Danish abbreviation 'm.h.t.'
2017-12-21 09:24:31 +01:00
Ines Montani
a3dd167d7f
Merge branch 'master' into da_ud_tokenization
2017-12-20 21:05:34 +00:00
Ines Montani
97f100f69f
Merge pull request #1742 from kimfalk/master
...
Two corrections in the da lan.
2017-12-20 21:02:00 +00:00
Ines Montani
d682a8803e
Merge pull request #1672 from cbilgili/master
...
Adds Turkish Lemmatization
2017-12-20 21:01:00 +00:00
Benjamin Peterson
9452134cd1
remove no-break spaces from Hindi example ( fixes #1750 )
2017-12-20 11:35:30 -08:00
Søren Lind Kristiansen
7a2f2f6f94
Fix formatting.
2017-12-20 18:37:37 +01:00
Søren Lind Kristiansen
15d13efafd
Tune Danish tokenizer to more closely match tokenization in Universal Dependencies.
2017-12-20 17:36:52 +01:00
Kim FalkJørgensen
648dc60755
Remove the incorrect exception 'm.h.t'
2017-12-20 10:02:39 +01:00
Kim FalkJørgensen
9c9f4ef84a
Fixing a translation error in examples.py
...
Adding an exception in the tokenizer_exceptions.py
2017-12-19 15:26:50 +01:00
ines
22dc744b48
Fix check for '@' in like_url (see #1715 )
2017-12-16 13:48:43 +01:00
Ines Montani
9c1ee65268
Add regression test for #1698
2017-12-12 10:36:11 +01:00
Ines Montani
6455b574fc
Check for email address first
2017-12-12 10:25:13 +01:00
Bri-Will
d77361d76c
Update lex_attrs.py. Fix like_url from matching on e-mail
2017-12-11 14:13:28 -08:00
Søren Lind Kristiansen
5a9d377580
Remove abbreviation for positional plac argument
2017-12-11 11:08:29 +01:00
Isaac Sijaranamual
38021fbb00
Switch from python 3 only TemporaryDirectory to pytest's tmpdir
2017-12-11 00:16:04 +01:00
Isaac Sijaranamual
20ae0c459a
Fixes "Error saving model" #1622
2017-12-10 23:07:13 +01:00
Isaac Sijaranamual
568130ce7c
Adds regression test_issue1622
2017-12-10 23:00:48 +01:00
Isaac Sijaranamual
e188b61960
Make cli/train.py not eat exception
2017-12-10 22:53:08 +01:00
ines
020a7e5d52
Allow 'fine_grained' option in displaCy (see #1703 )
...
Shows token.tag_ instead of token.pos_. Disabled by default, to not cause rendering issues for models with long fine-grained tags (e.g. merged morphological features).
2017-12-09 15:11:12 +01:00
Matthew Honnibal
3b17eb7c49
Merge branch 'master' of https://github.com/explosion/spaCy
2017-12-07 10:39:32 +01:00
Matthew Honnibal
a6b43729c6
Set version to v2.0.5
2017-12-07 10:39:14 +01:00
ines
5eaa61c2b8
Fix formatting
2017-12-07 10:23:09 +01:00
ines
24e80c51b8
Document init-model command
2017-12-07 10:14:37 +01:00
Matthew Honnibal
c91f451b0f
Fix imports and CLI in init-model
2017-12-07 10:03:07 +01:00
ines
82e80ff928
Rename model command to init_model and fix formatting
2017-12-07 09:59:23 +01:00
Ines Montani
2feeb428d6
Merge pull request #1646 from GreenRiverRUS/master
...
Added model command to create models from raw data
2017-12-07 08:54:26 +00:00
Matthew Honnibal
6373d2580d
Increment version to v2.0.5.dev0
2017-12-07 09:53:59 +01:00
Matthew Honnibal
36b47e3fa6
Fix (and test) vector pickling
2017-12-07 09:53:30 +01:00
Matthew Honnibal
05f41ff587
Set version to 2.0.4
2017-12-06 13:24:02 +01:00
Matthew Honnibal
04c38f7e87
Merge branch 'master' of https://github.com/explosion/spaCy
2017-12-06 12:15:52 +01:00
Matthew Honnibal
361944e512
If no rules are set, lemmatize by lookup
2017-12-06 12:12:11 +01:00
Matthew Honnibal
2ab0f2d186
Merge pull request #1664 from jimregan/italian-lemmatizer
...
BOM in Italian lemmatiser
2017-12-06 11:09:04 +01:00
Matthew Honnibal
3f247119d3
Merge pull request #1668 from sorenlind/da_morph
...
Add more Danish morph rules and clean up existing ones
2017-12-06 11:08:09 +01:00
Matthew Honnibal
b712de774e
Fix vectors pickling
2017-12-05 12:45:24 +01:00
Matthew Honnibal
04650e38c7
Set version to 2.0.4.dev0
2017-12-05 10:52:31 +01:00
Matthew Honnibal
07acb43a85
Merge branch 'master' of https://github.com/explosion/spaCy
2017-12-04 14:42:52 +01:00
Thomas Werkmeister
94eac75b7c
fix setup.py spacy req string for packaging
...
Requirement should be `spacy>=2.0.2` instead of `spacy2.0.2`
2017-12-03 04:16:28 -06:00
ines
f2ea6d4713
Add Dutch example sentences (see #1107 )
2017-12-01 23:36:05 +01:00
Canbey Bilgili
abe098b255
Adds Turkish Lemmatization
2017-12-01 17:04:32 +03:00
Søren Lind Kristiansen
d86b537a38
Enable morph rules for Danish
2017-11-30 15:58:02 +01:00
Søren Lind Kristiansen
13a988adc3
Remove 'Number[psor]'
2017-11-30 15:55:04 +01:00
Søren Lind Kristiansen
dd6fde18a9
Add more Danish morph rules and clean up existing ones
2017-11-30 11:17:19 +01:00
Vadim Mazaev
495eacf470
Merge branch 'model_command'
2017-11-30 12:30:26 +03:00
Vadim Mazaev
4ba7ddf651
Bugfixies
2017-11-30 12:29:38 +03:00
Jim O'Regan
a4ecdeadd4
aha
2017-11-29 23:43:25 +00:00
Jim O'Regan
2c7a9215d7
Merge branch 'master' into animacy
2017-11-29 23:31:12 +00:00
Jim O'Regan
c3e6cee17a
use inan in polimorf tagset conversion
2017-11-29 23:15:47 +00:00
Jim O'Regan
b32575e78c
imports
2017-11-29 23:03:41 +00:00
Jim O'Regan
3696ce6a7b
add UD mapping
2017-11-29 22:59:19 +00:00
Jim O'Regan
f8e7082fe4
typo in "inan", add "nhum"
2017-11-29 22:40:47 +00:00
Matthew Honnibal
6bc0f4d29f
Merge pull request #1611 from fsonntag/master
...
Solving #1494
2017-11-29 23:11:23 +01:00
Matthew Honnibal
f9ed9ea529
Merge pull request #1624 from GreenRiverRUS/russian
...
Add support for Russian
2017-11-29 23:10:01 +01:00
Jim O'Regan
076a6fc60a
symbols
2017-11-29 20:11:20 +00:00
Jim O'Regan
834ba3c69a
(semi generated) Polimorf mapping
2017-11-29 20:08:24 +00:00
Jim O'Regan
ba6a23fd11
BOM in Italian lemmatiser
2017-11-29 17:40:07 +00:00
ines
a31506e060
Fix off-by-one error in nlp.add_pipe(after=name) ( fixes #1654 )
2017-11-28 20:37:55 +01:00
ines
b62739fbfe
Add regression test for #1654
2017-11-28 20:27:54 +01:00
ines
2e50dbb9d7
Simplify test
2017-11-28 20:27:27 +01:00
Felix Sonntag
724ae7dc55
Fixed issue of infix capturing prefixes
2017-11-28 17:17:12 +01:00
Ines Montani
9052643e2c
Merge pull request #1653 from sorenlind/da_example_typo
...
Fix typo
2017-11-27 14:47:42 +00:00
Søren Lind Kristiansen
5fe58b885b
Fix typo
2017-11-27 15:36:18 +01:00
Ines Montani
d52b1ab245
Add unicode_literals (hopefully fixes test failure on Python 2)
2017-11-27 15:16:54 +01:00
Søren Lind Kristiansen
0ffd27b0f6
Add several Danish alternative spellings
2017-11-27 13:35:41 +01:00
Ines Montani
6362024cf8
Merge pull request #1645 from GreenRiverRUS/fix_default_meta
...
Fixed spaCy version string in default meta
2017-11-27 11:58:02 +00:00
Vadim Mazaev
c332ffdde1
Added model command to create model from raw data:
...
words counts, brown clusters and vectors
2017-11-27 01:21:47 +03:00
Vadim Mazaev
59f03ab1d7
Fixed spacy version string in default meta
2017-11-26 23:02:07 +03:00
Vadim Mazaev
53e7c38637
Fixed tests depends on pymorphy2
2017-11-26 21:04:44 +03:00
Vadim Mazaev
cacd859dcd
Added tag map, fixed tests fails, added more exceptions
2017-11-26 20:54:48 +03:00
Ines Montani
a7bb8f1b42
Merge pull request #1637 from sorenlind/da_tokenization
...
Improve Danish tokenization
2017-11-26 15:41:38 +00:00
ines
c699aec089
Add offsets_from_biluo_tags helper and tests (see #1626 )
2017-11-26 16:38:01 +01:00
Søren Lind Kristiansen
ef03e9ea53
Remove unused import.
2017-11-25 13:04:02 +01:00
Søren Lind Kristiansen
6aa241bcec
Add day of month tokenizer exceptions for Danish.
2017-11-24 15:03:24 +01:00
Søren Lind Kristiansen
0c276ed020
Add weekday abbreviations and remove abiguous month abbreviations for Danish.
2017-11-24 14:43:29 +01:00
Søren Lind Kristiansen
056547e989
Add multiple tokenizer exceptions for Danish.
2017-11-24 11:51:26 +01:00
Søren Lind Kristiansen
8dc265ac0c
Add test for tokenization of 'i.' for Danish.
2017-11-24 11:29:37 +01:00
Søren Lind Kristiansen
ac8116510d
Fix tokenization of 'i.' for Danish.
2017-11-24 11:16:53 +01:00
Matthew Honnibal
79f11d4f85
Pickle vectors with vocab
2017-11-23 17:19:50 +01:00
Matthew Honnibal
f29c3925ee
Fix more efficient nonproj
2017-11-23 12:48:00 +00:00
Matthew Honnibal
e10e9ad2c5
Improve efficiency of Doc.to_array
2017-11-23 12:33:27 +00:00
Matthew Honnibal
2acc907d55
Improve profiling
2017-11-23 12:33:03 +00:00
Matthew Honnibal
fa62427300
Remove lookup-based lemmatization
2017-11-23 12:32:22 +00:00
Matthew Honnibal
fb26b2cb12
Use lookup lemmatizer if lemma unset
2017-11-23 12:31:58 +00:00
Matthew Honnibal
db5c714ad2
Improve efficiency of deprojectivization
2017-11-23 12:31:34 +00:00
Matthew Honnibal
8fec7268eb
Move string cleanup under a setting flag
2017-11-23 12:19:18 +00:00
Matthew Honnibal
5949777b12
Fix misleading multi-threading docstring
2017-11-23 12:18:59 +00:00
Matthew Honnibal
542e6fd4ea
Don't remove entries from specials
2017-11-23 12:17:42 +00:00
Matthew Honnibal
30ba81f881
Merge pull request #1576 from ligser/master
...
Actually reset caches in pipe [wip]
2017-11-23 12:54:48 +01:00
ines
c90fe92e15
Fix displaCy test
2017-11-22 05:04:39 +01:00
ines
a6f33ac27d
Fix displaCy test
2017-11-22 04:19:28 +01:00
ines
93b0be611a
Merge branch 'master' of https://github.com/explosion/spaCy
2017-11-22 00:28:55 +01:00
ines
60b4915569
Use .pos_ instead of .tags_ in displaCy by default (see #1006 )
2017-11-22 00:28:52 +01:00
Vadim Mazaev
81314f8659
Fixed tokenizer: added char classes; added first lemmatizer and
...
tokenizer tests
2017-11-21 22:23:59 +03:00
Vadim Mazaev
52ee1f9bf9
Updated Russian Language, added lemmatizer, norm exceptions and lex
...
attrs
2017-11-21 11:44:46 +03:00
Burton DeWilde
a5c6869b2d
Fix bug where span.orth_ != span.text (see #1612 )
2017-11-20 12:05:43 -06:00
Burton DeWilde
635792997c
Add regression test for #1612
2017-11-20 12:05:35 -06:00
ines
9a63e32f21
Add noqa to Python 2 compat variables of built-ins (see #1617 )
2017-11-20 14:03:42 +01:00
ines
d70a64d78b
Fix syntax error and formatting in test (see #1617 )
2017-11-20 14:01:25 +01:00
ines
17849dee4b
Fix French test (see #1617 )
2017-11-20 13:59:59 +01:00
Felix Sonntag
33b0f86de3
Changed tokenizer to add infix when infix_start is offset
2017-11-19 16:32:10 +01:00
Felix Sonntag
8be3392302
Added regression text for 1494
2017-11-19 16:30:35 +01:00
Motoki Wu
a52e195a0a
Fixes Issue #1207 where noun_chunks
of Span
gives an error.
...
Make sure to reference `self.doc` when getting the noun chunks.
Same fix as 9750a0128c
2017-11-17 17:16:20 -08:00
Motoki Wu
b818afaa0e
Added failing test for Issue #1207 .
...
The noun chunk iterator should work for `Doc` but not for `Span`.
2017-11-17 17:04:27 -08:00
Vadim Mazaev
a0739a06d4
Returned russian support from v1.10 branch
2017-11-17 17:06:15 +03:00
yuukos
7401152289
updated Russian tokenizer
...
moved the trying to import pymorph into __init__
2017-11-17 17:04:50 +03:00
yuukos
3aad66cf00
added russian language support
2017-11-17 17:04:22 +03:00
ines
a3d4dd1a5d
Test adding of lots of pipeline components (see #1585 )
...
Just to make sure that there's no error now or in the future with adding a large number of pipeline components.
2017-11-15 17:28:06 +01:00
Roman Domrachev
61d28d03e4
Try again to do selective remove cache
2017-11-15 19:11:12 +03:00
Roman Domrachev
b3311100c7
Merge branch 'master' of github.com:explosion/spaCy
2017-11-15 18:30:04 +03:00
Matthew Honnibal
b60d92aca8
Increment version
2017-11-15 16:14:46 +01:00
Roman Domrachev
505c6a2f2f
Completely cleanup tokenizer cache
...
Tokenizer cache can have be different keys than string
That modification can slow down tokenizer and need to be measured
2017-11-15 17:55:48 +03:00
Matthew Honnibal
cf0be62096
Increment version
2017-11-15 15:00:18 +01:00
ines
97a4f9362b
Merge branch 'master' of https://github.com/explosion/spaCy
2017-11-15 14:24:00 +01:00
ines
8e65247886
Fix lex.id if vectors is None
2017-11-15 14:23:58 +01:00
Matthew Honnibal
437ad1a852
Merge pull request #1570 from explosion/feature/fix-beam-leak
...
Fix memory leak in beam parser
2017-11-15 14:15:05 +01:00
Matthew Honnibal
2f169fdb0a
Set lex ID correctly for new tokens in Vocab
2017-11-15 13:58:03 +01:00
Matthew Honnibal
fe3c42a06b
Fix caching in tokenizer
2017-11-15 13:55:46 +01:00
Matthew Honnibal
8d692771f6
Improve profiling
2017-11-15 13:51:25 +01:00
Matthew Honnibal
b797dca977
Merge branch 'master' of https://github.com/explosion/spaCy
2017-11-15 13:11:43 +01:00
ines
c9d72de0fb
Add dummy serialization methods for Japanese and missing lang getter ( resolves #1557 )
2017-11-15 12:44:02 +01:00
Matthew Honnibal
d274d3a3b9
Let beam forward use minibatches
2017-11-15 00:51:42 +01:00
Matthew Honnibal
855872f872
Remove state hashing
2017-11-14 23:36:46 +01:00
Roman Domrachev
3e21680814
Use safer method to get string without hit
2017-11-14 22:58:46 +03:00
Roman Domrachev
a33d5a068d
Try to hold origin data instead of restore it
2017-11-14 22:40:03 +03:00
Roman Domrachev
91e2fa6561
Clean all caches
2017-11-14 21:15:04 +03:00
Roman Domrachev
4e378dc4a4
Remove all obsolete code and test only initial problem
2017-11-14 20:45:04 +03:00
Roman
47ce2347b0
Create test that fails when actual cleanup caused
2017-11-14 20:28:13 +03:00
Roman
caae77f72d
Update strings.pyx
2017-11-14 19:44:40 +03:00
Roman Domrachev
3d247d2bb8
Get back previous testcase
2017-11-14 18:01:37 +03:00
Roman Domrachev
870defa815
Swap keys in proper place
...
Remove unnecessary clear of the hits
2017-11-14 17:56:30 +03:00
Roman Domrachev
86ca434c93
Merge github.com:explosion/spaCy
2017-11-14 17:46:22 +03:00
Roman Domrachev
a2745b0e84
StringStore now actually cleaned
...
Do not lose docs in ref tracking
2017-11-14 17:45:50 +03:00
Matthew Honnibal
2512ea9eeb
Fix memory leak in beam parser
2017-11-14 02:11:40 +01:00
Matthew Honnibal
86ddf692a1
Fix bug in limit calculation on dev data
2017-11-14 01:37:10 +01:00
Ines Montani
ea6c85c67a
Merge pull request #1566 from MathiasDesch/master ( resolves #1248 )
...
Add exceptions to tokenizer and norm
2017-11-13 19:05:22 +01:00
Matthew Honnibal
1b348389bb
Merge branch 'master' of https://github.com/explosion/spaCy
2017-11-13 18:18:48 +01:00
Matthew Honnibal
ca73d0d8fe
Cleanup states after beam parsing, explicitly
2017-11-13 18:18:26 +01:00
Matthew Honnibal
63ef9a2e73
Remove __dealloc__ from ParserBeam
2017-11-13 18:18:08 +01:00
Mathias Deschamps
c0691b2ab4
Add tokenizer exceptions for ing verbs
...
Extend list of tokenizing exceptions introduced in 123810b
2017-11-13 17:46:05 +01:00
Mathias Deschamps
288298ead9
Add norm exception for ing verbs
...
Some ing verbs are sometimes written in or in'. Make the NORM form correct
2017-11-13 17:46:05 +01:00
Abhinav Sharma
59f5740ede
improved upon the list of included stop_words
2017-11-13 17:13:49 +05:30
Matthew Honnibal
6e641f46d4
Create a preprocess function that gets bigrams
2017-11-12 00:43:41 +01:00
Matthew Honnibal
c9251d79e3
Edit comment
2017-11-11 18:38:32 +01:00
Matthew Honnibal
dd1678eab3
Edit comment
2017-11-11 18:37:08 +01:00
Roman Domrachev
ee60a52ee7
Fix test imports and last batch cleanup
2017-11-11 11:32:16 +03:00
Roman Domrachev
4a6b094e09
Remove unused import
2017-11-11 03:13:05 +03:00
Roman Domrachev
3c600adf23
Try to fix StringStore clean up (see #1506 )
2017-11-11 03:11:27 +03:00
ines
ee97fd3cb4
Add regression test for #1547
2017-11-11 00:14:03 +01:00
ines
2df27db671
Add unicode declaration
2017-11-11 00:13:56 +01:00
ines
35653bef3a
Add missing import ( fixes #1546 )
2017-11-10 19:05:18 +01:00
ines
4c5d2c80d5
Re-add python -m to commands, too brittle :( (see #1536 )
2017-11-10 02:30:55 +01:00
ines
123810b6de
Add "lovin'" to tokenizer exceptions (see #1248 )
2017-11-09 17:09:30 +01:00
ines
1c218397f6
Ensure path in Doc.to_disk/from_disk (resolves ##1521)
...
Also add Doc serialization tests with both Path and string path options
2017-11-09 02:29:03 +01:00
Matthew Honnibal
49fd5a646f
Set version for 2.0.2 release
2017-11-08 22:39:39 +01:00
Matthew Honnibal
fba2dbddf7
Increment version
2017-11-08 22:19:08 +01:00
Matthew Honnibal
a5ea0fdf5a
Fix #1518 : vocab.vectors.resize() didn't work
2017-11-08 22:18:37 +01:00
Matthew Honnibal
de45702bbe
Strip dev suffixes from version for compatibility check
2017-11-08 18:40:21 +01:00
Matthew Honnibal
51639214a1
Merge branch 'master' of https://github.com/explosion/spaCy
2017-11-08 18:04:33 +01:00
Matthew Honnibal
a2f980de4e
Exclude .devN versioning from compatibility check
2017-11-08 18:03:52 +01:00
Daniel Hershcovich
d7ae54ff44
Fix typo in message
2017-11-08 16:06:28 +02:00
Matthew Honnibal
4194bc5744
Xfail flakey serialization test
2017-11-08 13:55:13 +01:00
Matthew Honnibal
d5537e5516
Work on Windows test failure
2017-11-08 13:25:18 +01:00
Matthew Honnibal
c27c82d5f9
Fix serialization
2017-11-08 13:08:48 +01:00
Matthew Honnibal
1d5599cd28
Fix dtype
2017-11-08 12:18:32 +01:00
Matthew Honnibal
fa7fdd0d9b
Merge branch 'master' of https://github.com/explosion/spaCy
2017-11-08 12:11:31 +01:00
Matthew Honnibal
072ff38a01
Try to fix python3.5 serialization
2017-11-08 12:10:49 +01:00
Ines Montani
3a0f34d567
Merge pull request #1509 from abhi18av/patch-1
...
Create examples.py for Hindi language
2017-11-08 11:37:19 +01:00
Ines Montani
42b241ccd0
Update language code in usage example in comment
2017-11-08 11:36:38 +01:00
Matthew Honnibal
e262e8d942
Increment version to v2.0.2.dev0
2017-11-08 11:25:47 +01:00
Matthew Honnibal
a8b592783b
Make a dtype more specific, to fix a windows build
2017-11-08 11:24:35 +01:00
Abhinav Sharma
84edade82d
Create examples.py
...
Populated the file with the translations of English example sentences
2017-11-08 13:23:08 +05:30
Matthew Honnibal
d725aee4e2
Increment version to 2.0.1
2017-11-08 02:14:47 +01:00
Matthew Honnibal
8d6f68f1df
Increment version
2017-11-08 01:12:34 +01:00
ines
bcf42b8846
Fix typo
2017-11-08 01:06:37 +01:00
Matthew Honnibal
bbd2a3dee1
Fix title in about.py
2017-11-07 14:02:58 +01:00
Matthew Honnibal
4efaf9306c
Set version to spacy-nightly rc2
2017-11-07 13:27:26 +01:00
Matthew Honnibal
bf1ec2965f
Merge branch 'develop' of https://github.com/explosion/spaCy into develop
2017-11-07 13:20:29 +01:00
Matthew Honnibal
726f689da4
Fix missing import
2017-11-07 13:20:12 +01:00
ines
834f9c1aab
Update about.py
2017-11-07 13:11:33 +01:00
ines
a4662a31a9
Move model package templates to cli.package and update docs
2017-11-07 12:15:35 +01:00
ines
a09c096d3c
Get docs ready for v2.0.0
2017-11-07 12:00:43 +01:00
Matthew Honnibal
9a88e66103
Merge branch 'develop' of https://github.com/explosion/spaCy into develop
2017-11-07 02:00:06 +01:00
Matthew Honnibal
174abe4677
Increment to 2.0.0rc1
2017-11-07 01:59:46 +01:00
ines
42a0fbf291
Fix textcat simple train example
2017-11-07 01:25:54 +01:00
ines
8fb48b9b91
Update and document new util functions
2017-11-07 00:22:43 +01:00
Matthew Honnibal
1cab703bba
Move minibatch function to util
2017-11-06 23:45:36 +01:00
ines
5f43953536
Move test
2017-11-06 23:14:10 +01:00
Matthew Honnibal
dd90fe09f5
Remove extraneous label from textcat class
2017-11-06 22:09:02 +01:00
Matthew Honnibal
45e0617e61
Allow Language.update to take unicode text and dict objects
2017-11-06 22:07:38 +01:00
Matthew Honnibal
1831dbd065
Add test of simple textcat workflow
2017-11-06 22:04:29 +01:00
Matthew Honnibal
ffb9101f3f
Merge branch 'develop' of https://github.com/explosion/spaCy into develop
2017-11-06 19:20:41 +01:00
Matthew Honnibal
8fea512ac8
Don't set tensor in textcat
2017-11-06 19:20:14 +01:00
ines
acb9bdb852
Fix PRON_LEMMA imports
2017-11-06 17:41:53 +01:00
Matthew Honnibal
7d46793dd7
Add PRON_LEMMA to spacy.symbols
2017-11-06 17:38:25 +01:00
Matthew Honnibal
2f7e9f390d
Make test less flakey
2017-11-06 17:34:50 +01:00
Matthew Honnibal
407b08017e
Make test less flakey
2017-11-06 17:31:40 +01:00
Matthew Honnibal
102f797933
Fix lemma ordering in test
2017-11-06 17:02:17 +01:00
Matthew Honnibal
75e1618ec3
Fix lemma clobbering
2017-11-06 16:56:19 +01:00
Matthew Honnibal
6fdffd7246
Merge pull request #1497 from explosion/feature/improve-optimizer-handling
...
💫 Improve optimizer handling
2017-11-06 16:41:15 +01:00
Matthew Honnibal
8e6795437b
Set release=True
2017-11-06 16:39:32 +01:00
Matthew Honnibal
5c85bf3791
Fix missing import
2017-11-06 15:06:27 +01:00
Matthew Honnibal
25859dbb48
Return optimizer from begin_training, creating if necessary
2017-11-06 14:26:49 +01:00
Matthew Honnibal
465adfee94
Remove unused resume_training method, and pass optimizer through
2017-11-06 14:26:00 +01:00
Matthew Honnibal
13336a6197
Fix Adam import
2017-11-06 14:25:37 +01:00
Matthew Honnibal
2eb11d60f2
Add function create_default_optimizer to spacy._ml
2017-11-06 14:11:59 +01:00
Matthew Honnibal
31babe3c3f
Fix non-clobbering lemmatization
2017-11-06 12:36:05 +01:00
Matthew Honnibal
63c6ae4191
Fix lemmatizer test
2017-11-06 11:57:06 +01:00
Matthew Honnibal
a86a0181b5
Merge branch 'develop' of https://github.com/explosion/spaCy into develop
2017-11-05 22:19:10 +01:00
Matthew Honnibal
134d3b8143
Fix morphology
2017-11-05 22:18:22 +01:00
ines
08d1cf850a
Merge branch 'develop' of https://github.com/explosion/spaCy into develop
2017-11-05 21:41:58 +01:00
ines
baa231745c
Fix Dutch tag map
2017-11-05 21:41:50 +01:00
Matthew Honnibal
46e62ad747
Merge branch 'develop' of https://github.com/explosion/spaCy into develop
2017-11-05 19:40:00 +01:00
Matthew Honnibal
bb25cb0f76
Avoid clobbering preset lemmas
2017-11-05 19:39:38 +01:00
ines
507ecb67af
Fix Spanish tag map
2017-11-05 19:23:34 +01:00
Matthew Honnibal
320008352b
Merge branch 'develop' of https://github.com/explosion/spaCy into develop
2017-11-05 18:46:15 +01:00
Matthew Honnibal
38109a0e4a
Register SentenceSegmenter in Language.factories
2017-11-05 18:45:57 +01:00
ines
975e1042ff
Fix Italian tag map
2017-11-05 18:34:09 +01:00
ines
6b2d6e4937
Fix Portuguese tag map
2017-11-05 18:31:00 +01:00
ines
fa2687fded
Fix Dutch tag map
2017-11-05 17:57:59 +01:00
ines
fb8990d916
Fix Spanish tag map
2017-11-05 17:48:46 +01:00
ines
9d13288f73
Fix French tag map
2017-11-05 17:47:59 +01:00
ines
54579805c5
Fix French tag map
2017-11-05 17:44:05 +01:00
Matthew Honnibal
2b35bb76ad
Fix tensorizer on GPU
2017-11-05 15:34:40 +01:00
Matthew Honnibal
6e5181bbaa
Merge branch 'develop' of https://github.com/explosion/spaCy into develop
2017-11-05 15:33:56 +01:00
Matthew Honnibal
6f438b17c1
Increment version to v2.0.0a19
2017-11-05 14:43:36 +01:00
Matthew Honnibal
225cc249c9
Pass string path to numpy, to fix #1479
2017-11-05 14:42:46 +01:00
Matthew Honnibal
00435d8f0c
Add extra beam parsing test
2017-11-05 14:39:57 +01:00
Matthew Honnibal
e777ea25bb
Merge pull request #1492 from uwol/develop
...
TextCategorizer return parameter fix
2017-11-05 14:13:04 +01:00
Matthew Honnibal
0d4bd6414e
Fix Italian tag map
2017-11-05 14:11:03 +01:00
ines
ef597622a6
Add Portuguese tag map
2017-11-05 13:58:34 +01:00
ines
793c62dfda
Add Dutch tag map
2017-11-05 13:48:07 +01:00
ines
f7485a09c8
Fix Italian tag map
2017-11-05 13:12:58 +01:00
uwol
a2162b8908
tensorizer return parameter fix
2017-11-05 12:25:10 +01:00
ines
0a27afbf86
Merge branch 'develop' of https://github.com/explosion/spaCy into develop
2017-11-04 23:32:52 +01:00
ines
3cef901834
Add tag map for French and Italian
2017-11-04 23:32:51 +01:00
Matthew Honnibal
cfb83c231c
Merge branch 'develop' of https://github.com/explosion/spaCy into develop
2017-11-04 23:08:19 +01:00
Matthew Honnibal
d185927998
Undo harmful pickling hacks on Language class
2017-11-04 23:07:03 +01:00
ines
6c15aafebd
Fix formatting
2017-11-04 23:07:02 +01:00