Commit Graph

6130 Commits

Author SHA1 Message Date
Matthew Honnibal
a6ae1ee6f7 Don't modify Token in global scope 2018-05-09 00:43:00 +02:00
Matthew Honnibal
f94f721f40 Avoid importing fused token symbol in ud-run-test, untl that's added 2018-05-09 00:28:03 +02:00
Matthew Honnibal
659ec5b975 Avoid importing fused token symbol in ud-run-test, untl that's added 2018-05-08 19:40:33 +02:00
Matthew Honnibal
4cb0494bef Bug fixes to beam search after refactor 2018-05-08 13:48:50 +02:00
Matthew Honnibal
5ed71973b3 Add a keyword argument sink to GoldParse 2018-05-08 13:48:32 +02:00
Matthew Honnibal
8cfe326f87 Avoid relying on final gold check in beam search 2018-05-08 13:48:19 +02:00
Matthew Honnibal
fc4dd49b77 Support oracle segmentation in ud-train CLI command 2018-05-08 13:47:45 +02:00
Matthew Honnibal
c49e44349a Fix beam parsing 2018-05-08 02:53:24 +02:00
Matthew Honnibal
99649d114d Fix parser 2018-05-08 00:27:26 +02:00
Matthew Honnibal
8a82367a9d Fix beam search after refactor 2018-05-08 00:20:33 +02:00
Matthew Honnibal
5a0f26be0c Readd beam search after refactor 2018-05-08 00:19:52 +02:00
ines
7a3599c21a Fix formatting and consistency 2018-05-07 23:02:11 +02:00
Matthew Honnibal
36b2c9bdd5 Fix refactored parser 2018-05-07 18:58:09 +02:00
Matthew Honnibal
bde3be1ad1 Fix refactored parser 2018-05-07 18:31:04 +02:00
Matthew Honnibal
01c4e13b02 Update test 2018-05-07 16:59:52 +02:00
Matthew Honnibal
f6cdafc00e Fix refactored parser 2018-05-07 16:59:38 +02:00
Matthew Honnibal
f56bd4736b Improve dynamic oracle when values are missing in parse 2018-05-07 15:53:18 +02:00
Matthew Honnibal
eddc0e0c74 Set gold.sent_starts in ud_train 2018-05-07 15:52:47 +02:00
Matthew Honnibal
bf19f22340 Allow gold.sent_starts to be set from Python 2018-05-07 15:51:34 +02:00
Matthew Honnibal
7f163442e6 Work on refactoring greedy parser 2018-05-07 15:45:52 +02:00
Douglas Knox
9b49a40f4e Test and fix for Issue #2219 (#2272)
Test and fix for Issue #2219: Token.similarity() failed if single letter
2018-05-03 18:40:46 +02:00
Paul O'Leary McCann
bd72fbf09c Port Japanese mecab tokenizer from v1 (#2036)
* Port Japanese mecab tokenizer from v1

This brings the Mecab-based Japanese tokenization introduced in #1246 to
spaCy v2. There isn't a JapaneseTagger implementation yet, but POS tag
information from Mecab is stored in a token extension. A tag map is also
included.

As a reminder, Mecab is required because Universal Dependencies are
based on Unidic tags, and Janome doesn't support Unidic.

Things to check:

1. Is this the right way to use a token extension?

2. What's the right way to implement a JapaneseTagger? The approach in
 #1246 relied on `tag_from_strings` which is just gone now. I guess the
best thing is to just try training spaCy's default Tagger?

-POLM

* Add tagging/make_doc and tests
2018-05-03 18:38:26 +02:00
G.Pruvost
cc8e804648 #2211 - Support for ssl certs config on download command (#2212)
* Add support for SSL/Certs customization on download CLI

* Add a note on SSL options for the 'download' CLI in the README

* Add contributor agreement
2018-05-03 18:37:02 +02:00
Jens Dahl Møllerhøj
b9290397fb rename SP to _SP (#2289) 2018-05-03 18:33:49 +02:00
Matthew Honnibal
a8e70a4187 Merge branch 'develop' of https://github.com/explosion/spaCy into develop 2018-05-03 14:02:10 +02:00
Matthew Honnibal
c0e596283b Set version to 2.1.0a0 2018-05-03 14:00:11 +02:00
Matthew Honnibal
8cd06cc763 Try to fix root-outside-sentence bug 2018-05-02 14:39:48 +00:00
Matthew Honnibal
acebd01033 Set cildren from heads in finalize doc 2018-05-02 14:19:22 +00:00
Matthew Honnibal
569440a6db Dont normalize gradient by batch size 2018-05-02 08:42:10 +02:00
Matthew Honnibal
281e29cbcd Merge branch 'develop' of https://github.com/explosion/spaCy into develop 2018-05-02 01:36:23 +00:00
Matthew Honnibal
2338e8c7fc Update develop from master 2018-05-02 01:36:12 +00:00
Matthew Honnibal
9d147e12c4 Merge remote-tracking branch 'origin/master' into develop 2018-05-01 18:18:51 +02:00
Matthew Honnibal
6d0fe67b72 Constrain subtok label to adjacent tokens 2018-05-01 17:34:27 +02:00
Matthew Honnibal
8f21953fc5 Constrain subtok to adjacent words 2018-05-01 17:29:00 +02:00
Matthew Honnibal
b43bfd3524 Fix arc-eager oracle tests 2018-05-01 16:16:14 +02:00
Matthew Honnibal
31ed64e9b0 Fix textcat test 2018-05-01 15:18:39 +02:00
Matthew Honnibal
548bdff943 Update default Adam settings 2018-05-01 15:18:20 +02:00
Matthew Honnibal
adbb1f7533 Add better arc-eager oracle tests 2018-05-01 15:14:55 +02:00
Matthew Honnibal
697bcaa34f Add some methods to ArcEager that make testing easier 2018-05-01 15:13:14 +02:00
Mr Roboto
6f5ccda19c Addresses Issue #2228 - Deserialization fails when using tensor=False or sentiment=False (#2230)
* Fixes issue #2228

* Adds a new contributor
2018-05-01 13:40:22 +02:00
Matthew Honnibal
d44bb45c72 Fix scoring if tokenization changes 2018-05-01 01:33:20 +02:00
Matthew Honnibal
2b26c007cd Revert "Disable batch size compounding in ud-train"
This reverts commit 8a120fb455.
2018-04-29 14:09:02 +00:00
Matthew Honnibal
723b328062 Add script to run UD test 2018-04-29 15:50:25 +02:00
Matthew Honnibal
17af6aa3a4 Update ud_train script 2018-04-29 15:49:32 +02:00
Matthew Honnibal
5de8a36537 Fix arc_eager is_nonproj_tree 2018-04-29 15:49:11 +02:00
Matthew Honnibal
5260268f70 Fix textcat after merge 2018-04-29 15:48:53 +02:00
Matthew Honnibal
ad3d56c3ba Fix compile error in matcher 2018-04-29 15:48:34 +02:00
Matthew Honnibal
a8bc947fd4 Fix Token.set_extension 2018-04-29 15:48:19 +02:00
Matthew Honnibal
2c4a6d66fa Merge master into develop. Big merge, many conflicts -- need to review 2018-04-29 14:49:26 +02:00
ines
3c80f69ff5 Return data in cli.info and add silent option (resolves #2196) 2018-04-29 01:59:44 +02:00
ines
1c6d77610c Add remove_extension method on Doc, Token and Span (closes #2242) 2018-04-28 23:33:09 +02:00
ines
abdb853ebf Simplify underscore tests 2018-04-28 23:30:33 +02:00
ines
6fb6371670 Add collapse_phrases option to displacy (closes #2266) 2018-04-28 23:06:50 +02:00
Robin Linderborg
1f9904ef12 fixes #2238 (#2241)
* Remove erroneous lemma lookup år > åra in Swedish

* Add contributors agreement

* Add contrib agreement to correct directory

* Revert change to CONTRIBUTOR_AGREEMENT
2018-04-28 14:55:22 +02:00
Robin Linderborg
d01f503b54 Remove incorrect lemma lookup gäng->gänga (#2252)
* Remove incorrect lemma lookup gäng->gänga
In modern Swedish, "gäng" is mostly associated with "gang" or "group of people". The removed lemma lookup lemmatized it to the verb "thread".

* Add contrib agreement to correct directory

* Revert change to CONTRIBUTOR_AGREEMENT
2018-04-28 14:54:41 +02:00
Suraj Krishnan Rajan
69d041148f Implement Fast-Text vectors with subword features 2018-04-21 01:34:14 +05:30
ines
686225eadd Fix Spanish noun_chunks (resolves #2210)
Make sure 'NP' label is added to StringStore and move noun_bounds helper into a closure to allow reusing label sets
2018-04-18 18:44:01 -04:00
ines
9632595fb4 Use correct, non-deprecated merge syntax (resolves #2226) 2018-04-18 18:28:28 -04:00
Suraj Rajan
5957f15227 Fixed typos for #2222,#2223 (#2233) (closes #2222, closes #2223) 2018-04-18 14:55:26 -07:00
Matthew Honnibal
97851d2c4e Increment version to v2.0.12.dev0 2018-04-10 22:20:16 +02:00
Matthew Honnibal
ed39c75a92 Merge branch 'master' of https://github.com/explosion/spaCy 2018-04-10 22:19:40 +02:00
Matthew Honnibal
3836199a83 Fix loading of models when custom vectors are added 2018-04-10 22:19:20 +02:00
ines
0299d5fac8 Update argument annotations and formatting 2018-04-10 21:45:11 +02:00
ines
49b1e48bf5 Fix syntax error 2018-04-10 21:44:59 +02:00
ines
70052e46e9 Fix formatting [ci skip] 2018-04-10 21:42:46 +02:00
Matthew Honnibal
0ddb152be0 Improve error message when reading vectors 2018-04-10 21:26:50 +02:00
Matthew Honnibal
db50ac524e Support zipped vector files in init-model 2018-04-10 21:21:00 +02:00
ines
270fcfd925 Fix typo in package command message (closes #2200) 2018-04-10 19:14:31 +02:00
ines
24d8bf348d Revert "Add support for .zip to init_model"
This reverts commit 7ee880a0ad.
2018-04-10 19:08:06 +02:00
Matthew Honnibal
7ee880a0ad Add support for .zip to init_model 2018-04-10 14:30:04 +00:00
ines
5ecb274764 Fix indentation error and set Doc.is_tagged correctly 2018-04-10 16:14:52 +02:00
ines
987ee27af7 Return Doc if noun chunks merger component if Doc is not parsed 2018-04-09 14:51:02 +02:00
Xiaoquan Kong
e2f13ec722 bugfix: Doc.noun_chunks call Doc.noun_chunks_iterator without checking (closes #2194) 2018-04-08 23:44:05 +02:00
Jens Dahl Møllerhøj
e5055e3cf6 Add Danish lemmatizer (#2184)
* add danish lemmatizer

* fill contributor agreement
2018-04-07 19:07:28 +02:00
ines
bccbf538ef Revert "Check if spaCy has compiled correctly and show error message"
This reverts commit 3463ded7cf.
2018-04-06 15:49:44 +02:00
ines
fb4eda6616 Merge branch 'master' of https://github.com/explosion/spaCy 2018-04-06 00:38:48 +02:00
Matthew Honnibal
0c7fab4443 Set version to 2.0.11 2018-04-04 11:19:11 +02:00
Matthew Honnibal
a350be0601 Fix vector-name loading fix 2018-04-04 01:31:25 +02:00
Matthew Honnibal
21047bde52 Fix syntax error in italian lemmatizer 2018-04-03 23:13:22 +02:00
Matthew Honnibal
81f4005f3d Fix loading models with pretrained vectors 2018-04-03 23:11:48 +02:00
ines
3463ded7cf Check if spaCy has compiled correctly and show error message 2018-04-03 22:18:47 +02:00
Matthew Honnibal
96b612873b Add hyper-parameter to control whether parser makes a beam update 2018-04-03 22:02:56 +02:00
ines
e5f47cd82d Update errors 2018-04-03 21:40:29 +02:00
Matthew Honnibal
f7e6313b43 Increment version to v2.0.11.dev0 2018-04-03 20:58:47 +02:00
ines
10462816bc Fix tests for Python 2 2018-04-03 18:51:31 +02:00
ines
62b4b527d7 Don't raise error if set_extension has getter and setter (closes #2177)
Improve error messages, raise error if setter is specified without a getter and compare against _unset to allow default=None. Also add more tests.
2018-04-03 18:30:17 +02:00
ines
ee3082ad29 Fix whitespace 2018-04-03 18:29:53 +02:00
Ines Montani
3141e04822
💫 New system for error messages and warnings (#2163)
* Add spacy.errors module

* Update deprecation and user warnings

* Replace errors and asserts with new error message system

* Remove redundant asserts

* Fix whitespace

* Add messages for print/util.prints statements

* Fix typo

* Fix typos

* Move CLI messages to spacy.cli._messages

* Add decorator to display error code with message

An implementation like this is nice because it only modifies the string when it's retrieved from the containing class – so we don't have to worry about manipulating tracebacks etc.

* Remove unused link in spacy.about

* Update errors for invalid pipeline components

* Improve error for unknown factories

* Add displaCy warnings

* Update formatting consistency

* Move error message to spacy.errors

* Update errors and check if doc returned by component is None
2018-04-03 15:50:31 +02:00
Matthew Honnibal
abf8b16d71
Add doc.retokenize() context manager (#2172)
This patch takes a step towards #1487 by introducing the
doc.retokenize() context manager, to handle merging spans, and soon
splitting tokens.

The idea is to do merging and splitting like this:

with doc.retokenize() as retokenizer:
    for start, end, label in matches:
        retokenizer.merge(doc[start : end], attrs={'ent_type': label})

The retokenizer accumulates the merge requests, and applies them
together at the end of the block. This will allow retokenization to be
more efficient, and much less error prone.

A retokenizer.split() function will then be added, to handle splitting a
single token into multiple tokens. These methods take `Span` and `Token`
objects; if the user wants to go directly from offsets, they can append
to the .merges and .splits lists on the retokenizer.

The doc.merge() method's behaviour remains unchanged, so this patch
should be 100% backwards incompatible (modulo bugs). Internally,
doc.merge() fixes up the arguments (to handle the various deprecated styles),
opens the retokenizer, and makes the single merge.

We can later start making deprecation warnings on direct calls to doc.merge(),
to migrate people to use of the retokenize context manager.
2018-04-03 14:10:35 +02:00
Matthew Honnibal
8a120fb455 Disable batch size compounding in ud-train 2018-04-01 08:45:00 +00:00
Matthew Honnibal
98165e43a7 Sometimes update beam with greedy oracle 2018-04-01 08:44:35 +00:00
Suraj Rajan
1cdbb7c97c [2032] - Changed python set to cpp stl set (#2170)
Changed python set to cpp stl set #2032 

## Description

Changed python set to cpp stl set. CPP stl set works better due to the logarithmic run time of its methods. Finding minimum in the cpp set is done in constant time as opposed to the worst case linear runtime of python set. Operations such as find,count,insert,delete are also done in either constant and logarithmic time thus making cpp set a better option to manage vectors.
Reference : http://www.cplusplus.com/reference/set/set/

### Types of change
Enhancement for `Vectors` for faster initialising of word vectors(fasttext)
2018-03-31 13:28:25 +02:00
Matthew Honnibal
f3b7c5e537 Fix syntax error 2018-03-29 21:50:32 +02:00
Matthew Honnibal
23afa6429f Add input length error, to address #1826 2018-03-29 21:45:26 +02:00
Ines Montani
a609a1ca29
Merge pull request #2152 from explosion/feature/tidy-up-dependencies
💫 Tidy up dependencies
2018-03-29 14:35:09 +02:00
Viet Trung Tran
ea2af94cd9 Add support for Vietnamese in spaCy by leveraging Pyvi, an external Vietnamese tokenizer (#2155)
* support for Vietnamese

* Contributor Agreement for adding Vietnamese support on spaCy
2018-03-29 12:19:51 +02:00
ines
e6979bdbbd Merge branch 'feature/tidy-up-dependencies' of https://github.com/explosion/spaCy into feature/tidy-up-dependencies 2018-03-29 00:19:37 +02:00
ines
83146458a2 Fix urllib for Python 3 2018-03-29 00:19:33 +02:00
Matthew Honnibal
8308bbc617 Get msgpack and msgpack_numpy via Thinc, to avoid potential version conflicts 2018-03-29 00:14:55 +02:00
Matthew Honnibal
b5098079d8 Fix error on urllib 2018-03-29 00:08:16 +02:00
Ines Montani
0de599b16b
Merge pull request #2159 from explosion/feature/fix-merged-entity-iob (resolves #1554, resolves #1752)
💫 Fix token.ent_iob after doc.merge(), and ensure consistency in doc.ents
2018-03-28 23:10:00 +02:00
Ines Montani
98e9cda677
Merge pull request #2158 from explosion/feature/fix-multiple-vectors (resolves #1660)
💫 Fix loading of multiple vector models
2018-03-28 23:08:24 +02:00
Matthew Honnibal
a7c5ae2beb Avoid forcing a name on empty vectors, and remove print statement 2018-03-28 21:08:58 +02:00
ines
3eb67bbe4b Allow entity types with dashes (resolves #1967) 2018-03-28 20:51:26 +02:00
Matthew Honnibal
cf5fcf0546 Update serialization test 2018-03-28 20:12:53 +02:00
Matthew Honnibal
4555e3e251 Dont assume pretrained_vectors cfg set in build_tagger 2018-03-28 20:12:45 +02:00
Matthew Honnibal
0b375d50c8 Fix ent_iob tags in doc.merge to avoid inconsistent sequences 2018-03-28 18:39:03 +02:00
Matthew Honnibal
95fa89c4b8 Update doc.ents test 2018-03-28 18:39:03 +02:00
Matthew Honnibal
e807f88410 Resolve merge when cherry-picking ent iob patches from develop 2018-03-28 18:38:13 +02:00
Matthew Honnibal
99fbc7db33 Improve error message when entity sequence is inconsistent 2018-03-28 18:36:53 +02:00
Matthew Honnibal
cbd2794be0 Add test for ent_iob during span merge 2018-03-28 18:36:53 +02:00
Matthew Honnibal
f8dd905a24 Warn and fallback if vectors have no name 2018-03-28 18:24:53 +02:00
Matthew Honnibal
fd9e259414 Add test for #1660 2018-03-28 18:22:51 +02:00
Matthew Honnibal
bc4afa9881 Remove print statement 2018-03-28 17:48:37 +02:00
Matthew Honnibal
79dc241caa Set pretrained_vectors in parser cfg 2018-03-28 17:35:07 +02:00
Matthew Honnibal
17c3e7efa2 Add message noting vectors 2018-03-28 16:33:43 +02:00
Matthew Honnibal
9bf6e93b3e Set pretrained_vectors in begin_training 2018-03-28 16:32:41 +02:00
Matthew Honnibal
95a9615221 Fix loading of multiple pre-trained vectors
This patch addresses #1660, which was caused by keying all pre-trained
vectors with the same ID when telling Thinc how to refer to them. This
meant that if multiple models were loaded that had pre-trained vectors,
errors or incorrect behaviour resulted.

The vectors class now includes a .name attribute, which defaults to:
{nlp.meta['lang']_nlp.meta['name']}.vectors
The vectors name is set in the cfg of the pipeline components under the
key pretrained_vectors. This replaces the previous cfg key
pretrained_dims.

In order to make existing models compatible with this change, we check
for the pretrained_dims key when loading models in from_disk and
from_bytes, and add the cfg key pretrained_vectors if we find it.
2018-03-28 16:02:59 +02:00
ines
7fbc9e5874 Replace requests with urllib 2018-03-28 12:46:07 +02:00
ines
da1f200362 Add compat helpers for urllib 2018-03-28 12:45:53 +02:00
ines
ac88c72c9a Fix ftfy workaround and remove old import 2018-03-28 12:14:28 +02:00
ines
ce6071ca89 Remove ftfy dependency and update docs 2018-03-28 12:09:42 +02:00
Matthew Honnibal
070b6c6495 Remove dependency on ftfy 2018-03-28 12:07:02 +02:00
ines
6d2c85f428 Drop six and related hacks as a dependency 2018-03-28 10:45:25 +02:00
ines
9e83513004 Add position of invalid token to error message 2018-03-27 23:56:59 +02:00
ines
11c4735ccf Fix issue in Italian lemmatizer data (resolves #2050) 2018-03-27 23:55:22 +02:00
Matthew Honnibal
6a961928b2 Merge branch 'develop' of https://github.com/explosion/spaCy into develop 2018-03-27 21:01:48 +00:00
Matthew Honnibal
b7136cb094 Support zipped vector files in init-model 2018-03-27 21:01:18 +00:00
ines
693971dd8f Improve error message if token text is empty string (see #2101) 2018-03-27 22:25:40 +02:00
ines
0c829e6605 Fix whitespace 2018-03-27 22:20:59 +02:00
Matthew Honnibal
de9fd091ac Fix #2014: token.pos_ not writeable 2018-03-27 21:21:11 +02:00
Matthew Honnibal
18da89e04c Handle non-callable gold_tuples in parser begin_training 2018-03-27 21:08:41 +02:00
Matthew Honnibal
1f7229f40f Revert "Merge branch 'develop' of https://github.com/explosion/spaCy into develop"
This reverts commit c9ba3d3c2d, reversing
changes made to 92c26a35d4.
2018-03-27 19:23:02 +02:00
Matthew Honnibal
8b7a74570f Revert "Revert "Merge branch 'develop' of https://github.com/explosion/spaCy into develop""
This reverts commit f41e626844.
2018-03-27 19:22:52 +02:00
Matthew Honnibal
f41e626844 Revert "Merge branch 'develop' of https://github.com/explosion/spaCy into develop"
This reverts commit c9ba3d3c2d, reversing
changes made to f57bfbccdc.
2018-03-27 19:22:25 +02:00
Matthew Honnibal
c9ba3d3c2d Merge branch 'develop' of https://github.com/explosion/spaCy into develop 2018-03-27 18:59:08 +02:00
Matthew Honnibal
92c26a35d4 Update get_cuda_stream 2018-03-27 16:42:00 +00:00
Matthew Honnibal
f57bfbccdc Fix non-projective label filtering 2018-03-27 13:41:33 +02:00
Matthew Honnibal
d2118792e7 Merge changes from master 2018-03-27 13:38:41 +02:00
Matthew Honnibal
d4680e4d83 Merge branch 'master' of https://github.com/explosion/spaCy 2018-03-27 13:36:37 +02:00
Matthew Honnibal
63a267b34d Fix #2073: Token.set_extension not working 2018-03-27 13:36:20 +02:00
Matthew Honnibal
25280b7013 Try to make sum_state_features faster 2018-03-27 10:08:38 +00:00
Matthew Honnibal
987e1533a4 Use 8 features in parser 2018-03-27 10:08:12 +00:00
Matthew Honnibal
8bbd26579c Support GPU in UD training script 2018-03-27 09:53:35 +00:00
Matthew Honnibal
dd54511c4f Pass data as a function in begin_training methods 2018-03-27 09:39:59 +00:00
Matthew Honnibal
d9ebd78e11 Change default sizes in parser 2018-03-26 17:22:18 +02:00
Matthew Honnibal
a3d0cb15d3 Fix ent_iob tags in doc.merge to avoid inconsistent sequences 2018-03-26 07:16:06 +02:00
Matthew Honnibal
7d4687162f Update doc.ents test 2018-03-26 07:14:35 +02:00
Matthew Honnibal
514d89a3ae Set missing label for non-specified entities when setting doc.ents 2018-03-26 07:14:16 +02:00
Matthew Honnibal
54d7a1c916 Improve error message when entity sequence is inconsistent 2018-03-26 07:13:34 +02:00
Matthew Honnibal
938436455a Add test for ent_iob during span merge 2018-03-25 22:16:19 +02:00
Matthew Honnibal
8e08c378fe Fix entity IOB and tag in span merging 2018-03-25 22:16:01 +02:00
Matthew Honnibal
5430c43298 Set about to spacy-nightly 2018-03-25 19:30:14 +02:00
Ines Montani
68226109f4
Merge pull request #2142 from jimregan/polish-more-tokens
more exceptions
2018-03-24 19:06:44 +01:00
Matthew Honnibal
d566e673bf Set version to v2.0.10 2018-03-24 18:09:03 +01:00
Matthew Honnibal
0d3bf0d4eb Merge branch 'master' of https://github.com/explosion/spaCy 2018-03-24 17:31:49 +01:00
dejanmarich
ccd1c04c63 Update stop_words.py
Added more words
2018-03-24 17:31:24 +01:00
ines
f1446b0257 Port over Turkish changes 2018-03-24 17:31:07 +01:00
DuyguA
cd604878a4 quick typo fix 2018-03-24 17:26:35 +01:00
Matthew Honnibal
406548b976 Support .gz and .tar.gz files in spacy init-model 2018-03-24 17:18:32 +01:00
Jim O'Regan
efe037e8be more exceptions 2018-03-24 00:05:27 +00:00
Ines Montani
719037cf20
Update formatting and add missing commas 2018-03-23 22:18:20 +01:00
Otto Sulin
266efc2018 Added Finnish examples 2018-03-23 22:58:52 +02:00
Otto Sulin
1940e54602 Added Finnish numbers 2018-03-23 22:33:08 +02:00
Otto Sulin
4ec3f19e2b fixed stop words -> to-do lex_attrs.py 2018-03-23 22:18:17 +02:00
Matthew Honnibal
85717f570c Merge branch 'master' of https://github.com/explosion/spaCy 2018-03-23 20:30:42 +01:00
Matthew Honnibal
8902754f0b Fix vector loading for ud_train 2018-03-23 20:30:00 +01:00
Xiaoquan Kong
a71b99d7ff bugfix for global-variable-change-in-runtime related issue (#2135)
* Bugfix: setting pollution from spacy/cli/ud_train.py to whole package

* Add contributor agreement of howl-anderson
2018-03-23 11:36:38 +01:00
Matthew Honnibal
044397e269 Support .gz and .tar.gz files in spacy init-model 2018-03-21 14:33:23 +01:00
Matthew Honnibal
49fbe2dfee Use thinc.openblas in spacy.syntax.nn_parser 2018-03-20 02:22:09 +01:00
DuyguA
f708d7443b added contractions to stopwords #2020 2018-03-19 14:06:39 +01:00
Matthew Honnibal
bede11b67c
Improve label management in parser and NER (#2108)
This patch does a few smallish things that tighten up the training workflow a little, and allow memory use during training to be reduced by letting the GoldCorpus stream data properly.

Previously, the parser and entity recognizer read and saved labels as lists, with extra labels noted separately. Lists were used becaue ordering is very important, to ensure that the label-to-class mapping is stable.

We now manage labels as nested dictionaries, first keyed by the action, and then keyed by the label. Values are frequencies. The trick is, how do we save new labels? We need to make sure we iterate over these in the same order they're added. Otherwise, we'll get different class IDs, and the model's predictions won't make sense.

To allow stable sorting, we map the new labels to negative values. If we have two new labels, they'll be noted as having "frequency" -1 and -2. The next new label will then have "frequency" -3. When we sort by (frequency, label), we then get a stable sort.

Storing frequencies then allows us to make the next nice improvement. Previously we had to iterate over the whole training set, to pre-process it for the deprojectivisation. This led to storing the whole training set in memory. This was most of the required memory during training.

To prevent this, we now store the frequencies as we stream in the data, and deprojectivize as we go. Once we've built the frequencies, we can then apply a frequency cut-off when we decide how many classes to make.

Finally, to allow proper data streaming, we also have to have some way of shuffling the iterator. This is awkward if the training files have multiple documents in them. To solve this, the GoldCorpus class now writes the training data to disk in msgpack files, one per document. We can then shuffle the data by shuffling the paths.

This is a squash merge, as I made a lot of very small commits. Individual commit messages below.

* Simplify label management for TransitionSystem and its subclasses

* Fix serialization for new label handling format in parser

* Simplify and improve GoldCorpus class. Reduce memory use, write to temp dir

* Set actions in transition system

* Require thinc 6.11.1.dev4

* Fix error in parser init

* Add unicode declaration

* Fix unicode declaration

* Update textcat test

* Try to get model training on less memory

* Print json loc for now

* Try rapidjson to reduce memory use

* Remove rapidjson requirement

* Try rapidjson for reduced mem usage

* Handle None heads when projectivising

* Stream json docs

* Fix train script

* Handle projectivity in GoldParse

* Fix projectivity handling

* Add minibatch_by_words util from ud_train

* Minibatch by number of words in spacy.cli.train

* Move minibatch_by_words util to spacy.util

* Fix label handling

* More hacking at label management in parser

* Fix encoding in msgpack serialization in GoldParse

* Adjust batch sizes in parser training

* Fix minibatch_by_words

* Add merge_subtokens function to pipeline.pyx

* Register merge_subtokens factory

* Restore use of msgpack tmp directory

* Use minibatch-by-words in train

* Handle retokenization in scorer

* Change back-off approach for missing labels. Use 'dep' label

* Update NER for new label management

* Set NER tags for over-segmented words

* Fix label alignment in gold

* Fix label back-off for infrequent labels

* Fix int type in labels dict key

* Fix int type in labels dict key

* Update feature definition for 8 feature set

* Update ud-train script for new label stuff

* Fix json streamer

* Print the line number if conll eval fails

* Update children and sentence boundaries after deprojectivisation

* Export set_children_from_heads from doc.pxd

* Render parses during UD training

* Remove print statement

* Require thinc 6.11.1.dev6. Try adding wheel as install_requires

* Set different dev version, to flush pip cache

* Update thinc version

* Update GoldCorpus docs

* Remove print statements

* Fix formatting and links [ci skip]
2018-03-19 02:58:08 +01:00
Matthew Honnibal
ff42b726c1 Fix unicode declaration on test 2018-03-19 02:04:24 +01:00
Matthew Honnibal
7dc76c6ff6 Add test for textcat 2018-03-16 12:39:45 +01:00
Matthew Honnibal
3cdee79a0c Add depth argument for text classifier 2018-03-16 12:37:31 +01:00
Matthew Honnibal
13067095a1 Disable broken add-after-train in textcat 2018-03-16 12:33:33 +01:00
Matthew Honnibal
565ef8c4d8 Improve argument passing in textcat 2018-03-16 12:30:51 +01:00
Matthew Honnibal
eb2a3c5971 Remove unused function 2018-03-16 12:30:33 +01:00
Matthew Honnibal
307d6bf6d3 Fix parser for Thinc 6.11 2018-03-16 10:59:31 +01:00
Matthew Honnibal
9a389c4490 Fix parser for Thinc 6.11 2018-03-16 10:38:13 +01:00
Matthew Honnibal
648532d647 Don't assume blas methods are present 2018-03-16 02:48:20 +01:00
Matthew Honnibal
e85dd038fe Merge remote-tracking branch 'origin/master' into feature/single-thread 2018-03-16 02:41:11 +01:00
Matthew Honnibal
e3be3d65b3 Version as v2.0.10.dev0 2018-03-15 17:31:22 +01:00
ines
f3f8bfc367 Add built-in factories for merge_entities and merge_noun_chunks
Allows adding those components to the pipeline out-of-the-box if they're defined in a model's meta.json. Also allows usage as nlp.add_pipe(nlp.create_pipe('merge_entities')).
2018-03-15 17:16:54 +01:00
Ines Montani
0d17377e8b
Merge pull request #2095 from DuyguA/quick-typo-fix (resolves #2063)
Quick typo fix
2018-03-15 00:29:56 +01:00
ines
d854f69fe3 Add built-in factories for merge_entities and merge_noun_chunks
Allows adding those components to the pipeline out-of-the-box if they're defined in a model's meta.json. Also allows usage as nlp.add_pipe(nlp.create_pipe('merge_entities')).
2018-03-15 00:18:51 +01:00
ines
9ad5df41fe Fix whitespace 2018-03-15 00:11:18 +01:00
Matthew Honnibal
d7ce6527fb Use increasing batch sizes in ud-train 2018-03-14 20:15:28 +01:00
alldefector
f4e5904fc2 Fix Spanish noun_chunks failure caused by typo 2018-03-14 17:03:17 +01:00
Thomas Opsomer
fbf48b3f9f lemma property to return hash instead of unicode 2018-03-14 17:03:00 +01:00
Matthew Honnibal
8cefc58abc Fix Vectors pickling 2018-03-14 16:59:37 +01:00
DuyguA
be4f6da16b maybe not a good idea to remove also 2018-03-14 14:47:24 +01:00
DuyguA
1a513f71e3 removed also from lookup 2018-03-14 11:57:15 +01:00
DuyguA
cca66abf1e quick typo fix 2018-03-14 11:34:22 +01:00
Matthew Honnibal
7b755414eb Update call into thinc 2018-03-13 13:59:59 +01:00
Matthew Honnibal
e101f10ef0 Fix header 2018-03-13 02:12:16 +01:00
Matthew Honnibal
952c87409e Use openblas.sgemm in parser 2018-03-13 02:12:01 +01:00
Matthew Honnibal
d55620041b Switch parser to gemm from thinc.openblas 2018-03-13 02:10:58 +01:00
Matthew Honnibal
c2f4759257
Fix test for Python 2 2018-03-12 23:03:05 +01:00
Matthew Honnibal
9aeec9c242 Increment dev version 2018-03-11 01:58:21 +01:00
Matthew Honnibal
f49d71fa7c Merge branch 'master' of https://github.com/explosion/spaCy 2018-03-11 01:27:17 +01:00
Matthew Honnibal
5dddb30e5b Fix ud-train script 2018-03-11 01:26:45 +01:00
Matthew Honnibal
e42960bd14
Merge pull request #2012 from alldefector/patch-1
Fix Spanish noun_chunks failure caused by typo
2018-03-11 01:05:19 +01:00
Matthew Honnibal
2cab4d6517 Remove use of attr module in ud_train 2018-03-11 00:59:39 +01:00
Matthew Honnibal
fa9fd21620 Increment dev version 2018-03-11 00:41:54 +01:00
Matthew Honnibal
53b3249e06 Add tests for arc eager oracle 2018-03-10 23:42:56 +01:00
Matthew Honnibal
754ea1b2f7 Link in spaCy CoNLL commands 2018-03-10 23:42:15 +01:00
Matthew Honnibal
3478ea76d1 Add ud_train and ud_evaluate CLI commands 2018-03-10 23:41:55 +01:00
Matthew Honnibal
4b72c38556 Fix dropout bug in beam parser 2018-03-10 23:16:40 +01:00
Matthew Honnibal
9cc202d670 Fix Vectors pickling 2018-03-10 22:53:42 +01:00
Matthew Honnibal
3d6487c734 Support dropout in beam parse 2018-03-10 22:41:55 +01:00
Matthew Honnibal
31b156d60b Fix itershuffle 2018-03-10 22:32:59 +01:00
Matthew Honnibal
b59765ca9f Stream gold during spacy train 2018-03-10 22:32:45 +01:00
Matthew Honnibal
c3d168509a Stream the gold data during training, to reduce memory 2018-03-10 22:32:32 +01:00
DuyguA
cba63196f9 fixed typo 2018-03-09 10:54:18 +01:00
DuyguA
7a780476af added more abbreviations 2018-03-09 10:13:00 +01:00
DuyguA
cca87756d7 added Sti 2018-03-08 18:07:52 +01:00
DuyguA
3c994311c5 added abbrevs 2018-03-08 18:03:27 +01:00
DuyguA
56d6fb180e added like_num to lex 2018-03-08 15:25:25 +01:00
DuyguA
26ee0590a3 added some commonly used cases 2018-03-08 12:43:58 +01:00
DuyguA
ae6473e4d5 removed some words with negation particle. 2018-03-08 12:20:32 +01:00
DuyguA
6ed59a2198 removed number words to be caried to the lexical 2018-03-08 12:19:23 +01:00
DuyguA
04784a44a6 made alphabetical order for Turkish chaaracters 2018-03-08 12:11:32 +01:00
DuyguA
af33e022a5 added example sentences for Turkish 2018-03-08 12:06:03 +01:00
Matthew Honnibal
a1be01185c Fix array out of bounds error in Span 2018-02-28 12:27:09 +01:00
Thomas Opsomer
8df9e52829 lemma property to return hash instead of unicode 2018-02-27 19:50:01 +01:00
Ines Montani
35634352fe
Merge pull request #2025 from dejanmarich/patch-1
Update stop_words.py for Croatian language
2018-02-26 18:22:32 +01:00
Matthew Honnibal
14f729c72a Add subtok label to parser 2018-02-26 12:26:35 +01:00
Matthew Honnibal
7137ad8b0b Make label filtering clearer for projectivisation 2018-02-26 12:02:01 +01:00
Matthew Honnibal
b8d52cb285 Fix inconsistent label freq cutoff for projectivisation 2018-02-26 12:01:44 +01:00
Matthew Honnibal
7b66ec896a Revert "Revert "Improve parser oracle around sentence breaks.""
This reverts commit 36e481c584.
2018-02-26 10:57:37 +01:00
Matthew Honnibal
36e481c584 Revert "Improve parser oracle around sentence breaks."
This reverts commit 50817dc9ad.
2018-02-26 10:53:55 +01:00
Matthew Honnibal
5faae803c6 Add option to not use Janome for Japanese tokenization 2018-02-26 09:39:46 +01:00
Matthew Honnibal
9b406181cd Add Chinese.Defaults.use_jieba setting, for UD 2018-02-25 15:12:38 +01:00
Matthew Honnibal
9ccd0c643b Add Vietnamese 2018-02-25 15:00:46 +01:00
Matthew Honnibal
d4fdb97c87 Fix alignment for words with spaces 2018-02-25 14:55:00 +01:00
Matthew Honnibal
6d2c1ef52c Fix SP tag in generic tag map 2018-02-24 16:04:56 +01:00
Matthew Honnibal
5cc3bd1c1d Update alignment tests 2018-02-24 16:03:58 +01:00
Matthew Honnibal
6138439469 Fix many-to-one alignment 2018-02-24 16:03:50 +01:00
Matthew Honnibal
4890ee1732 Fix scoring of tokenization for punct 2018-02-24 10:32:32 +01:00
Matthew Honnibal
12b39f87da Move cython declarations in matcher.pyx 2018-02-24 10:32:18 +01:00
Matthew Honnibal
01d1b7abdf Support many-to-one alignment in GoldParse 2018-02-24 10:17:01 +01:00
Matthew Honnibal
7865746574 Support many-to-one alignment 2018-02-24 02:09:53 +01:00
Matthew Honnibal
458710b831 Poke matcher test for appveyor 2018-02-23 23:53:48 +01:00
Matthew Honnibal
968dabdde4 Fix bug in multi-task objective 2018-02-23 23:48:09 +01:00
Matthew Honnibal
2c9c8b8d72 Try comming out emoji test in matcher 2018-02-23 23:34:35 +01:00
Matthew Honnibal
980ad68cbe Try to find test that fails on appveyor 2018-02-23 21:27:53 +01:00
Matthew Honnibal
39de8cd4d3 Try to find test failing on appveyor 2018-02-23 20:59:21 +01:00
Matthew Honnibal
4492a33a9d Fix sent_start multi-task objective when alignment fails 2018-02-23 16:50:59 +01:00
Matthew Honnibal
5fa44e93f1 Set unicode_literals in matcher 2018-02-23 16:48:54 +01:00
Matthew Honnibal
12264f9296 Add multi-task objective for sentence segmentation 2018-02-23 16:25:57 +01:00
Matthew Honnibal
e7deadb519 Set version to 2.1.0.dev1 2018-02-23 16:22:24 +01:00
Matthew Honnibal
7b575a119e Try to reduce memory usage of test_matcher 2018-02-23 15:34:37 +01:00
Matthew Honnibal
24563f4026 Fix data typing in align 2018-02-23 15:08:06 +01:00
Matthew Honnibal
7a5ba20692 Fix integer typing in _align 2018-02-23 14:51:24 +01:00
Matthew Honnibal
875411b875 Set unicode types in _align.pyx and test 2018-02-23 14:35:38 +01:00
Matthew Honnibal
51d9679aa3 Fix broken span.as_doc test 2018-02-23 14:22:24 +01:00
dejanmarich
71c261d58b
Update stop_words.py
Added more words
2018-02-23 10:31:01 +01:00
Matthew Honnibal
3e6c1111b7 Remove obsolete test 2018-02-23 03:22:07 +01:00
Matthew Honnibal
a4fdec524a Merge branch 'master' of https://github.com/explosion/spaCy into feature/better-gold 2018-02-22 21:44:28 +01:00
Matthew Honnibal
50817dc9ad Improve parser oracle around sentence breaks. 2018-02-22 19:22:26 +01:00
Matthew Honnibal
307aefe131 Increment version to v2.0.9 2018-02-22 17:07:53 +01:00
Feng Niu
1c60384bed return on empty doc 2018-02-21 15:39:04 -08:00
Feng Niu
7eb1cd100b unbound doc var 2018-02-21 15:05:37 -08:00
Feng Niu
8df75b229c fix unbound vars in es.syntax_iterators 2018-02-21 13:11:17 -08:00
alldefector
4244e285c2
Fix Spanish noun_chunks failure caused by typo 2018-02-21 12:43:21 -08:00
Matthew Honnibal
661873ee4c Randomize the rebatch size in parser 2018-02-21 21:02:07 +01:00
Matthew Honnibal
0872cf611d Don't lower-case lemmas of proper nouns 2018-02-21 16:01:16 +01:00
Matthew Honnibal
a0ddb803fd Make error when no label found more helpful 2018-02-21 16:00:59 +01:00
Matthew Honnibal
ea2fc5d45f Improve length and freq cutoffs in parser 2018-02-21 16:00:38 +01:00
Matthew Honnibal
e5757d4bf0 Add labels property to parser 2018-02-21 16:00:00 +01:00
Matthew Honnibal
eff4ae809a Fix nonproj label filter 2018-02-21 15:59:04 +01:00
Matthew Honnibal
e624405cda Temporarily remove cutoff when filtering labels in nonproj 2018-02-21 13:53:40 +01:00
Matthew Honnibal
f466f0186e Use new alignment implementation in GoldParse 2018-02-20 21:16:35 +01:00
Matthew Honnibal
c0734ba526 Make alignment work with strings 2018-02-20 17:51:49 +01:00
Matthew Honnibal
8180c84a98 Add tests for new Levenshtein alignment 2018-02-20 17:32:25 +01:00
Matthew Honnibal
930c980570 Add improved Levenshtein alignment implementation 2018-02-20 17:31:56 +01:00
Ines Montani
14e7e0f12a
Merge pull request #2000 from jimregan/polish-tag-map
Polish tag map
2018-02-18 19:05:58 +01:00
Jim O'Regan
664407de5d missing PrepCase attribute 2018-02-18 14:46:12 +00:00
Jim O'Regan
95f0673fbc fix typo/missing here too 2018-02-18 14:38:27 +00:00
Matthew Honnibal
2bccad8815 Fix incorrect matcher test 2018-02-18 14:56:12 +01:00
Matthew Honnibal
530172d57a Merge branch 'master' of https://github.com/explosion/spaCy into feature/better-faster-matcher 2018-02-18 14:40:42 +01:00
Matthew Honnibal
cf0e320f2b Add doc.is_sentenced attribute, re #1959 2018-02-18 14:16:55 +01:00
Matthew Honnibal
1e5aeb4eec
Merge pull request #1987 from thomasopsomer/span-sent
Make span.sent work when only manual / custom sbd
2018-02-18 14:05:37 +01:00
Matthew Honnibal
1cf774bdc1 Add output options return_matches and as_tuples to Matcher 2018-02-18 14:00:45 +01:00
Matthew Honnibal
dd9b0945af Fix inconsistencies in the symbols table 2018-02-18 13:51:31 +01:00
Matthew Honnibal
66496ac8e1 Set version to v2.1.0.dev0 2018-02-18 13:48:39 +01:00
Matthew Honnibal
eb3040ce46
Merge pull request #1891 from fucking-signup/master
Fix issue #1889
2018-02-18 13:47:47 +01:00
Matthew Honnibal
3d7285870b Update matcher branch with v2.0.8 master 2018-02-18 13:42:58 +01:00
ines
6bba1db4cc Drop six and related hacks as a dependency 2018-02-18 13:29:56 +01:00
Matthew Honnibal
b30b09192a
Merge pull request #1665 from jimregan/animacy
typo in "inan", add "nhum"
2018-02-18 13:26:53 +01:00
Matthew Honnibal
1b3c98e01b Set version to v2.0.8 2018-02-18 12:16:31 +01:00
Matthew Honnibal
f9f46e5a07 Revert matcher fixes from GregDubbin 2018-02-18 10:59:28 +01:00
Matthew Honnibal
86405e4ad1 Fix CLI for multitask objectives 2018-02-18 10:59:11 +01:00
Matthew Honnibal
a34749b2bf Add multitask objectives options to train CLI 2018-02-17 22:03:54 +01:00
Matthew Honnibal
8f06903e09 Fix multitask objectives 2018-02-17 18:41:36 +01:00
Matthew Honnibal
d1246c95fb Fix model loading when using multitask objectives 2018-02-17 18:11:36 +01:00
Matthew Honnibal
262d0a3148 Fix overwriting of lexical attributes when loading vectors during training 2018-02-17 18:11:11 +01:00
Matthew Honnibal
c0caf7cf27 Fix LANG symbol 2018-02-17 18:10:50 +01:00
Matthew Honnibal
0bf2f6be29 Add missing symbol for LANG attr. Fixes inconsistent numeric ID 2018-02-17 17:37:02 +01:00
Matthew Honnibal
97a228a4ce Increment to v2.0.8.dev0 2018-02-17 16:54:36 +01:00
Matthew Honnibal
f7dc64d2a3 Merge branch 'master' of https://github.com/explosion/spaCy into feature/better-faster-matcher 2018-02-17 16:47:35 +01:00
Aaron Marquez
ea571e8325 Merge branch 'master' into issue-1959 2018-02-16 15:14:09 -08:00
Matthew Honnibal
7d5c720fc3 Fix multitask objective when no pipeline provided 2018-02-15 23:50:21 +01:00
Aaron Marquez
f0d3672e17 Changed loading EN model 2018-02-15 14:28:38 -08:00
Aaron Marquez
3765d84d57 Fix issue #1959 2018-02-15 12:51:49 -08:00
Aaron Marquez
7ba4111554 Add test for issue-1959 2018-02-15 12:46:22 -08:00
Matthew Honnibal
59b7cf9db8 Add get_beam_parse method in ArcEager, for Prodigy 2018-02-15 21:03:16 +01:00
Matthew Honnibal
3e541de440 Merge branch 'master' of https://github.com/explosion/spaCy 2018-02-15 21:02:55 +01:00
Thomas Opsomer
5d24a81c0b add test for span.sent when doc not parsed 2018-02-15 16:59:16 +01:00
Thomas Opsomer
deab391cbf correct check on sent_start & raise if no boundaries 2018-02-15 16:58:30 +01:00
Matthew Honnibal
afbd46adfb Remove length cap in PhraseMatcher 2018-02-15 16:10:54 +01:00
Matthew Honnibal
4533c7408d Update matcher tests 2018-02-15 15:39:47 +01:00
Matthew Honnibal
1c19605426 Move matcher2.pyx to matcher.pyx 2018-02-15 15:27:03 +01:00
Matthew Honnibal
9ebf2fe7c3 Make helper function to get longest matches 2018-02-15 15:26:15 +01:00
Matthew Honnibal
4cb861e080
Merge pull request #1968 from DuyguA/is_currency
New lexical feature is_currency
2018-02-15 12:13:36 +01:00
Thomas Opsomer
b902731313 Find span sentence when only sentence boundaries (no parser) 2018-02-14 22:18:54 +01:00
Matthew Honnibal
d19dc67886 Make get_action nogil, for efficiency 2018-02-14 12:16:36 +01:00
Matthew Honnibal
7885b92b45 Refactor matcher2, hopefully making it faster 2018-02-14 12:11:17 +01:00
Matthew Honnibal
00261eea27 Make tests refer to matcher2 2018-02-14 12:10:51 +01:00
Claudiu-Vlad Ursache
e28de12cbd
Ensure files opened in from_disk are closed
Fixes [issue 1706](https://github.com/explosion/spaCy/issues/1706).
2018-02-13 20:49:43 +01:00
Matthew Honnibal
262cbe356e Remove caching, as doesn't seem to help for now. 2018-02-13 17:15:20 +01:00
Matthew Honnibal
f43d53f2c5 Remove print statement 2018-02-13 17:15:07 +01:00
Matthew Honnibal
dcd8d89aef Update test for 850, making it work with matcher2 2018-02-13 16:35:20 +01:00
Matthew Honnibal
9bdfa5cd4f Remove re comparisons tests, as matcher behaves differently 2018-02-13 16:28:52 +01:00
Matthew Honnibal
6d7986b0f1 Fix matcher test 2018-02-13 16:28:06 +01:00
Matthew Honnibal
9efda9e9ab Add PhraseMatcher in matcher2.pyx 2018-02-13 16:27:46 +01:00
Johannes Dollinger
012e874d09 Add contributor agreement for emulbreh 2018-02-13 13:40:33 +01:00
Johannes Dollinger
bf94c13382 Don't fix random seeds on import 2018-02-13 12:42:23 +01:00
Matthew Honnibal
0004331895 Update notes on matcher2 2018-02-13 11:45:45 +01:00
Matthew Honnibal
b4cc39eb74 Fix zero-width quantifiers. Passes test_matcher 2018-02-13 11:45:32 +01:00
Matthew Honnibal
1b01685f47 Fix ZERO_PLUS operator 2018-02-12 12:28:03 +01:00
Matthew Honnibal
9115c3ba0a Add TODO in notes 2018-02-12 12:06:48 +01:00
Matthew Honnibal
b00326a7fe Move pattern_id out of TokenPattern 2018-02-12 12:05:54 +01:00
Matthew Honnibal
d34c732635 Add Python notes for rethinking matcher 2018-02-12 10:19:29 +01:00
Matthew Honnibal
d7c9b53120 Pass kwargs into pipeline components during begin_training 2018-02-12 10:18:39 +01:00
Matthew Honnibal
fae5c0dc18 Work on matcher2 2018-02-12 10:17:43 +01:00
4altinok
ca8728035d added new lex feat to token 2018-02-11 18:55:48 +01:00
4altinok
edd7202a06 added new symbol 2018-02-11 18:55:32 +01:00
4altinok
ed1ac2969e added new lexical feat to lexeme 2018-02-11 18:51:48 +01:00
4altinok
94fb0b75e3 code for is_currency 2018-02-11 18:51:32 +01:00
4altinok
3deef1497a removed 18 and replaced 18 with is_currency 2018-02-11 18:51:09 +01:00
4altinok
471d3c9e23 added lex test for is_currency 2018-02-11 18:50:50 +01:00
ines
c63e99da8a Fix typo in glossary (resolves #1964)
Co-Authored-By: SThomasP <sthomasp@users.noreply.github.com>
2018-02-10 11:58:41 +01:00
Lyndon White
6ee5dff51c
Make python 3.4 compat module loading (fix #1733) 2018-02-09 23:03:35 +08:00
Matthew Honnibal
e361b4f82b Fix #1929: Incorrect NER when pre-set sentence boundaries. 2018-02-08 15:25:41 +01:00
Matthew Honnibal
fd9fd275c5 Make test for #1945 more precise 2018-02-07 02:06:11 +01:00
Matthew Honnibal
c087a14380 Merge branch 'master' of https://github.com/explosion/spaCy 2018-02-07 01:29:39 +01:00
Matthew Honnibal
76d89b2180 Add test for #1945: PhraseMatcher regression 2018-02-07 01:29:23 +01:00
Ines Montani
0954e15dda
Merge pull request #1913 from ohenrik/nb_syntax_iterator
Norwegian Language (nb) - Added french syntax iterator with explanation
2018-02-06 04:59:07 +01:00
Ole Henrik Skogstrøm
251a7805fe Copied French syntax iterator to simplify future changes 2018-02-05 14:45:05 +01:00
Matthew Honnibal
2e7391e627
Merge pull request #1916 from tokestermw/bug/fix-not-passing-in-model-cfg-in-nlp
Bug/fix not passing in model cfg in nlp
2018-02-05 01:19:40 +01:00
Ali Zarezade
9df9da34a3
Fix init_model issue
Fixing issue #1928
2018-02-03 17:21:34 +03:30
Matthew Honnibal
ebe84e45e5 Increment version to 2.0.7 2018-02-02 03:39:16 +01:00
Matthew Honnibal
e4b1f57599 Increment version 2018-02-02 02:33:23 +01:00
Matthew Honnibal
069531c351 Merge branch 'master' of https://github.com/explosion/spaCy 2018-02-02 02:32:58 +01:00
Matthew Honnibal
f74a802d09 Test and fix #1919: Error resuming training 2018-02-02 02:32:40 +01:00
ines
f1d3deffac Add Russian example sentences (see #1107) 2018-02-01 20:09:40 +01:00
Matthew Honnibal
6b1126c312 Merge branch 'master' of https://github.com/explosion/spaCy 2018-02-01 02:57:52 +01:00
ines
3c1fb9d02d Make validate command fail more gracefully if version not found
Mostly relevant during develoment when working with .dev versions
2018-01-31 22:06:28 +01:00
Motoki Wu
54062b7326 added tests for issue #1915 2018-01-30 18:30:19 -08:00
Motoki Wu
f4a7d1a423 make to sure pass in **cfg to each component when training 2018-01-30 18:29:54 -08:00
ines
4046823699 Only check component in factories if string (see #1911) 2018-01-30 16:29:07 +01:00
ines
ce10d320c4 Fix component check in self.factories (see #1911) 2018-01-30 16:09:37 +01:00
Ole Henrik Skogstrøm
e40465487c Added french syntax iterator with explenation 2018-01-30 15:44:29 +01:00
ines
8901814248 Improve error handling if pipeline component is not callable (resolves #1911)
Also add help message if user accidentally calls nlp.add_pipe() with a string of a built-in component name.
2018-01-30 15:43:03 +01:00
Matthew Honnibal
a437ba87a3 Set release=True 2018-01-29 21:26:04 +01:00
Adam Binford
9238749aaf Removed test to avoid network requests 2018-01-29 14:48:20 -05:00
Adam Binford
1a2c2f7d7f Fixed auto linking after download and added simple test to check 2018-01-29 14:25:21 -05:00
Matthew Honnibal
cb7110c22e
Merge pull request #1882 from ohenrik/nb_lemma_and_tag_map
Add norwegian bokmål ('nb') lemmatizer and tag_map
2018-01-29 18:18:50 +01:00
Matthew Honnibal
0c1e7f0c86
Merge pull request #1893 from azarezade/master
Add Persian language
2018-01-29 18:18:33 +01:00
Matthew Honnibal
cbdab75b36 Increment version 2018-01-28 23:46:22 +01:00
Matthew Honnibal
512e6adb08
Merge pull request #1896 from thomasopsomer/fix-sent
Fix sentence boundaries serialization (issue #1834)
2018-01-28 21:18:51 +01:00
Matthew Honnibal
f5b1ad4100 Limit parser model size, to hopefully reduce memory during CI tests 2018-01-28 21:00:32 +01:00
Thomas Opsomer
515e25910e fix sent_start in serialization 2018-01-28 19:50:42 +01:00
Thomas Opsomer
45d62561f7 add test for the issue 2018-01-28 19:49:56 +01:00
ines
6d978e5c35 Don't use deprecated Doc.merge call in displaCy
As reported here: https://stackoverflow.com/a/48464412/6400719
2018-01-27 11:25:05 +01:00
Ali Zarezade
bb6bd3d8ae add persian language 2018-01-27 13:27:26 +03:30
Ali Zarezade
d195675db5 add persian language 2018-01-27 13:21:38 +03:30
Kit
4b42267ba3
Fix issue #1889 2018-01-25 23:17:22 +01:00
Kit
52ef51f36e
Add test for issue #1889 2018-01-25 22:56:48 +01:00
Ole Henrik Skogstrøm
8e2c9f2475 Cleaned up nb tag_map comments 2018-01-25 11:09:28 +01:00
Ole Henrik Skogstrøm
1107e89fcf Updated doc string on nb tag_map module 2018-01-25 11:08:28 +01:00
Matthew Honnibal
6a8cb905aa
Merge pull request #1876 from GregDubbin/master
Pattern matcher fixes
2018-01-24 16:38:11 +01:00
Matthew Honnibal
38b260e0c3
Merge pull request #1879 from azarezade/master
Add Persian character and symbols
2018-01-24 16:34:22 +01:00
Matthew Honnibal
edb71a280e Add test for #1883: Unpickling Matcher 2018-01-24 15:42:33 +01:00
Matthew Honnibal
2ad050e668 Fix unpickling of Matcher. Also store correct data in matcher._patterns 2018-01-24 15:42:11 +01:00
Ole Henrik Skogstrøm
4058a7d579 Fix æøå characters in lemmatizer 2018-01-24 14:03:14 +01:00
Ole Henrik Skogstrøm
42248f423f Updated tag map 2018-01-24 13:50:33 +01:00
Ole Henrik Skogstrøm
74b430b49a Correct Lemmatizer 2018-01-24 13:26:33 +01:00
Ole Henrik Skogstrøm
b9b3a40c78 Add norwegian lemmatizer and tag_map 2018-01-24 12:28:29 +01:00
Matthew Honnibal
42a18ef903 Add test for #1868: Vocab.__contains__ with ints 2018-01-23 23:27:05 +01:00
Matthew Honnibal
43f381ce36 Make Vocab.__contains__ work with ints. Fixes #1868 2018-01-23 23:26:47 +01:00
greg
85ab99e692 Correct test examples 2018-01-23 15:00:14 -05:00
greg
f50bb1aafc Restructure StateC to eliminate dependency on unordered_map 2018-01-23 14:40:03 -05:00
Matthew Honnibal
f3753c2453 Further model deserialization fixes re #1727 2018-01-23 19:16:05 +01:00
Matthew Honnibal
91e916cb67 Add comment to new test 2018-01-23 19:11:53 +01:00
Matthew Honnibal
fd187d71ad Add test for #1727 2018-01-23 19:11:01 +01:00
Matthew Honnibal
85c942a6e3 Dont overwrite pretrained_dims setting from cfg. Fixes #1727 2018-01-23 19:10:49 +01:00
Ali Zarezade
42349471bc
add ٪ as punctuation 2018-01-23 18:11:33 +03:30
Ali Zarezade
2bda582135
Add Persian character and symbols
Add Persian characters and the following:
- ٪ used instead of %
- ؟ used instead of ?
- ﷼ used instead of $
- ، used instead of ,
- ؛ used instead of ;
2018-01-23 13:20:36 +03:30
Matthew Honnibal
7e6dc283db Fix unicode import in test 2018-01-22 23:55:44 +01:00
greg
686735b94e Fix matcher import 2018-01-22 16:53:05 -05:00
greg
3a491093ee Import libcpp.map if libcpp.unordered_map doesn't exist 2018-01-22 16:46:25 -05:00
greg
d55992bdf0 Switch match dictionary to use final state pointer rather than ID 2018-01-22 15:36:47 -05:00
Matthew Honnibal
4ce7d24fd5 Add test for #1799: Set left and right edges (and thus sentences) in non-projective parses. 2018-01-22 20:18:38 +01:00
Matthew Honnibal
56164ab688 Set l_edge and r_edge correctly for non-projective parses. Fixes #1799 2018-01-22 20:18:04 +01:00
Matthew Honnibal
964aa1b384 Merge branch 'master' of https://github.com/explosion/spaCy 2018-01-22 19:18:46 +01:00
Matthew Honnibal
29897ed1b3 Allow vector loading to work on 1d data files. Fixes #1831 2018-01-22 19:18:26 +01:00
greg
490bc82c27 Add comments clarifying matcher logic for '*' 2018-01-22 10:03:12 -05:00
Matthew Honnibal
fe4748fc38
Merge pull request #1870 from avadhpatel/master
Model Load Performance Improvement by more than 5x
2018-01-22 00:05:15 +01:00
Avadh Patel
a517df55c8 Small fix
Signed-off-by: Avadh Patel <avadh4all@gmail.com>
2018-01-21 15:20:45 -06:00
Avadh Patel
5b5029890d Merge branch 'perfTuning' into perfTuningMaster
Signed-off-by: Avadh Patel <avadh4all@gmail.com>
2018-01-21 15:20:00 -06:00
Matthew Honnibal
203d2ea830 Allow multitask objectives to be added to the parser and NER more easily 2018-01-21 19:37:02 +01:00
Matthew Honnibal
4a7d524efb Merge branch 'master' of https://github.com/explosion/spaCy 2018-01-21 19:22:03 +01:00
Matthew Honnibal
61a051f2c0 Fix MultitaskObjective 2018-01-21 19:21:34 +01:00
Avadh Patel
75903949da Updated model building after suggestion from Matthew
Signed-off-by: Avadh Patel <avadh4all@gmail.com>
2018-01-18 06:51:57 -06:00
Avadh Patel
fe879da2a1 Do not train model if its going to be loaded from disk
This saves significant time in loading a model from disk.

Signed-off-by: Avadh Patel <avadh4all@gmail.com>
2018-01-17 06:16:07 -06:00
Avadh Patel
2146faffee Do not train model if its going to be loaded from disk
This saves significant time in loading a model from disk.

Signed-off-by: Avadh Patel <avadh4all@gmail.com>
2018-01-17 06:04:22 -06:00
greg
7072b395c9 Add greedy matcher tests 2018-01-16 15:46:13 -05:00
greg
441f490c1c Merge branch 'master' of github.com:GregDubbin/spaCy 2018-01-16 13:31:10 -05:00
greg
8bea62f26e Correct bugs for greedy matching and introduce ADVANCE_PLUS action 2018-01-16 13:21:43 -05:00
Matthew Honnibal
ccb51a9f36 Make .similarity() return 1.0 if all orth attrs match 2018-01-15 16:29:48 +01:00
Matthew Honnibal
82135d85b7 Fix test 2018-01-15 15:55:15 +01:00
Matthew Honnibal
4b09616b58 Add test for #1757: Comparison against None 2018-01-15 15:55:01 +01:00
Matthew Honnibal
b904d81e9a Fix rich comparison against None objects. Closes #1757 2018-01-15 15:51:25 +01:00
Matthew Honnibal
9e413449f6 Fix unicode error in new test 2018-01-15 15:39:00 +01:00
Matthew Honnibal
ab7c45b12d Fix error message and handling of doc.sents 2018-01-15 15:21:11 +01:00
Matthew Honnibal
6b215d2dd3 Add test for Issue #1537 2018-01-15 15:20:56 +01:00
ines
5babb7d6f6 Merge branch 'master' of https://github.com/explosion/spaCy 2018-01-14 17:31:09 +01:00
ines
793890cb4d Remove test for removed deprecation warning 2018-01-14 17:31:06 +01:00
Matthew Honnibal
465a6f6452 Add missing Span.vocab property. Closes #1633 2018-01-14 15:06:30 +01:00
Matthew Honnibal
0cb090e526 Fix infinite recursion in token.sent_start. Closes #1640 2018-01-14 15:02:15 +01:00
Matthew Honnibal
5cbe913b6f Don't raise deprecation warning in property. Closes #1813, #1712 2018-01-14 14:55:58 +01:00
Matthew Honnibal
1a1cca6052 Fix vectors.resize() on Py3. Closes #1539 2018-01-14 14:48:51 +01:00
Matthew Honnibal
0153220304 Make set_vector add word to vocab. Fixes #1807 2018-01-14 13:57:57 +01:00
Ines Montani
55754f0cee
Merge pull request #1836 from fucking-signup/master
Add tests for issue #1769
2018-01-13 00:23:35 +00:00
Kit
4ee97f20a0
Mark like_num tests as slow 2018-01-13 00:44:15 +01:00
Kit
855531537e
Rewrite tests for issue #1769 2018-01-12 23:49:51 +01:00
Kit
5b541cb5ec
Simplify tests for issue #1769 2018-01-12 23:34:27 +01:00
Kit
7a2adc4633
Remove some tests to see build status changes 2018-01-12 22:49:16 +01:00
Kit
0e62809a43
Rewrite tests for issue #1769 2018-01-12 22:26:06 +01:00
Ines Montani
36f426fe0a
Merge pull request #1808 from fucking-signup/master
Fix issue #1769
2018-01-12 21:12:02 +00:00
Kit
76f4eeca44
Remove tests to see build changes on Windows (Python 2.7) 2018-01-12 20:30:51 +01:00
Matthew Honnibal
7ca49c2061
Merge branch 'master' into feature-improve-model-download 2018-01-10 18:21:55 +01:00
Kit
7ec0956e8d
Add regression test (issue #1769) 2018-01-08 03:42:04 +01:00
Kit
701e7cc6aa
Rename variable to keep code consistent 2018-01-08 03:38:44 +01:00
Kit
ed0db95183
Find lowercased forms of ordinal words, where possible 2018-01-08 03:28:50 +01:00
Kit
9bc524982e
Find lowercased forms of numeric words 2018-01-08 03:25:08 +01:00
Søren Lind Kristiansen
62de5da1ff Remove unsused dummy variable 2018-01-05 09:57:24 +01:00
Søren Lind Kristiansen
10dab8eef8 Remove dummy variable from function calls 2018-01-05 09:37:05 +01:00
Søren Lind Kristiansen
7f0ab145e9 Don't pass CLI command name as dummy argument 2018-01-04 21:33:47 +01:00
Ines Montani
6a008233b5
Merge pull request #1795 from textioHQ/issue1758 (resolves #1758)
english tokenizer: handle "would've"
2018-01-04 02:43:39 +00:00
Kevin Humphreys
597df5bf83 add test 2018-01-03 13:00:05 -08:00
Kevin Humphreys
7918fa4ef9 handle would've 2018-01-03 12:25:48 -08:00
ines
2c656f90fb Exit with 1 if incompatible models found (see #1714) 2018-01-03 21:20:35 +01:00
ines
dacfaa2ca4 Ensure that download command exits properly (resolves #1714) 2018-01-03 21:03:36 +01:00
Søren Lind Kristiansen
a9ff6eadc9 Prefix dummy argument names with underscore 2018-01-03 20:48:12 +01:00
ines
1081e08efb Fix formatting 2018-01-03 20:14:50 +01:00
ines
d8109964d6 Use --no-deps on model install
In general, it's nice for models to specify spaCy as a dependency. However, this tends to cause problems in conda environments, as pip will re-install spaCy and its dependencies (especially Thinc)
2018-01-03 17:40:37 +01:00
ines
319d754309 Fix overwriting of existing symlinks
Check for is_symlink() to also overwrite invalid and outdated symlinks. Also show better error message if link path exists but is not symlink (i.e. file or directory).
2018-01-03 17:39:36 +01:00
ines
8ba0dfd017 Make message on failed linking more clear 2018-01-03 17:38:09 +01:00
Søren Lind Kristiansen
d6327e8495 Fix handling case when vectors not specified 2018-01-03 12:20:49 +01:00
Søren Lind Kristiansen
bcc51d7d8b Fix shifted positional arguments 2018-01-03 12:19:47 +01:00
zqhZY
f27859fa99 add ChineseDefaults class for pickling 2017-12-28 17:13:58 +08:00
Ines Montani
ff9fc945ab
Merge pull request #1749 from sorenlind/da_ud_tokenization
Tune Danish tokenizer to more closely match Universal Dependencies
2017-12-22 16:00:49 +00:00
ines
26f313dabc Fix missing import 2017-12-22 16:21:44 +01:00
ines
8dc1c27841 Merge branch 'master' of https://github.com/explosion/spaCy 2017-12-22 16:01:00 +01:00
ines
b10ba848b8 xfail test that causes MemoryError on Python 2 on Windows
Need to investigate this further!
2017-12-22 16:00:58 +01:00
Søren Lind Kristiansen
bef735aef7 Fix Danish abbreviation 'm.h.t.' 2017-12-21 09:24:31 +01:00
Ines Montani
a3dd167d7f
Merge branch 'master' into da_ud_tokenization 2017-12-20 21:05:34 +00:00
Ines Montani
97f100f69f
Merge pull request #1742 from kimfalk/master
Two corrections in the da lan.
2017-12-20 21:02:00 +00:00
Ines Montani
d682a8803e
Merge pull request #1672 from cbilgili/master
Adds Turkish Lemmatization
2017-12-20 21:01:00 +00:00
Benjamin Peterson
9452134cd1 remove no-break spaces from Hindi example (fixes #1750) 2017-12-20 11:35:30 -08:00
Søren Lind Kristiansen
7a2f2f6f94 Fix formatting. 2017-12-20 18:37:37 +01:00
Søren Lind Kristiansen
15d13efafd Tune Danish tokenizer to more closely match tokenization in Universal Dependencies. 2017-12-20 17:36:52 +01:00
Kim FalkJørgensen
648dc60755 Remove the incorrect exception 'm.h.t' 2017-12-20 10:02:39 +01:00
Kim FalkJørgensen
9c9f4ef84a Fixing a translation error in examples.py
Adding an exception in the tokenizer_exceptions.py
2017-12-19 15:26:50 +01:00
ines
22dc744b48 Fix check for '@' in like_url (see #1715) 2017-12-16 13:48:43 +01:00
Ines Montani
9c1ee65268
Add regression test for #1698 2017-12-12 10:36:11 +01:00
Ines Montani
6455b574fc
Check for email address first 2017-12-12 10:25:13 +01:00
Bri-Will
d77361d76c
Update lex_attrs.py. Fix like_url from matching on e-mail 2017-12-11 14:13:28 -08:00
Søren Lind Kristiansen
5a9d377580 Remove abbreviation for positional plac argument 2017-12-11 11:08:29 +01:00
Isaac Sijaranamual
38021fbb00 Switch from python 3 only TemporaryDirectory to pytest's tmpdir 2017-12-11 00:16:04 +01:00
Isaac Sijaranamual
20ae0c459a Fixes "Error saving model" #1622 2017-12-10 23:07:13 +01:00
Isaac Sijaranamual
568130ce7c Adds regression test_issue1622 2017-12-10 23:00:48 +01:00
Isaac Sijaranamual
e188b61960 Make cli/train.py not eat exception 2017-12-10 22:53:08 +01:00
ines
020a7e5d52 Allow 'fine_grained' option in displaCy (see #1703)
Shows token.tag_ instead of token.pos_. Disabled by default, to not cause rendering issues for models with long fine-grained tags (e.g. merged morphological features).
2017-12-09 15:11:12 +01:00
Matthew Honnibal
3b17eb7c49 Merge branch 'master' of https://github.com/explosion/spaCy 2017-12-07 10:39:32 +01:00
Matthew Honnibal
a6b43729c6 Set version to v2.0.5 2017-12-07 10:39:14 +01:00
ines
5eaa61c2b8 Fix formatting 2017-12-07 10:23:09 +01:00
ines
24e80c51b8 Document init-model command 2017-12-07 10:14:37 +01:00
Matthew Honnibal
c91f451b0f Fix imports and CLI in init-model 2017-12-07 10:03:07 +01:00
ines
82e80ff928 Rename model command to init_model and fix formatting 2017-12-07 09:59:23 +01:00
Ines Montani
2feeb428d6
Merge pull request #1646 from GreenRiverRUS/master
Added model command to create models from raw data
2017-12-07 08:54:26 +00:00
Matthew Honnibal
6373d2580d Increment version to v2.0.5.dev0 2017-12-07 09:53:59 +01:00
Matthew Honnibal
36b47e3fa6 Fix (and test) vector pickling 2017-12-07 09:53:30 +01:00
Matthew Honnibal
05f41ff587 Set version to 2.0.4 2017-12-06 13:24:02 +01:00
Matthew Honnibal
04c38f7e87 Merge branch 'master' of https://github.com/explosion/spaCy 2017-12-06 12:15:52 +01:00
Matthew Honnibal
361944e512 If no rules are set, lemmatize by lookup 2017-12-06 12:12:11 +01:00
Matthew Honnibal
2ab0f2d186
Merge pull request #1664 from jimregan/italian-lemmatizer
BOM in Italian lemmatiser
2017-12-06 11:09:04 +01:00
Matthew Honnibal
3f247119d3
Merge pull request #1668 from sorenlind/da_morph
Add more Danish morph rules and clean up existing ones
2017-12-06 11:08:09 +01:00
Matthew Honnibal
b712de774e Fix vectors pickling 2017-12-05 12:45:24 +01:00
Matthew Honnibal
04650e38c7 Set version to 2.0.4.dev0 2017-12-05 10:52:31 +01:00
Matthew Honnibal
07acb43a85 Merge branch 'master' of https://github.com/explosion/spaCy 2017-12-04 14:42:52 +01:00
Thomas Werkmeister
94eac75b7c
fix setup.py spacy req string for packaging
Requirement should be `spacy>=2.0.2` instead of `spacy2.0.2`
2017-12-03 04:16:28 -06:00
ines
f2ea6d4713 Add Dutch example sentences (see #1107) 2017-12-01 23:36:05 +01:00
Canbey Bilgili
abe098b255 Adds Turkish Lemmatization 2017-12-01 17:04:32 +03:00
Søren Lind Kristiansen
d86b537a38 Enable morph rules for Danish 2017-11-30 15:58:02 +01:00
Søren Lind Kristiansen
13a988adc3 Remove 'Number[psor]' 2017-11-30 15:55:04 +01:00
Søren Lind Kristiansen
dd6fde18a9 Add more Danish morph rules and clean up existing ones 2017-11-30 11:17:19 +01:00
Vadim Mazaev
495eacf470 Merge branch 'model_command' 2017-11-30 12:30:26 +03:00
Vadim Mazaev
4ba7ddf651 Bugfixies 2017-11-30 12:29:38 +03:00
Jim O'Regan
a4ecdeadd4 aha 2017-11-29 23:43:25 +00:00
Jim O'Regan
2c7a9215d7 Merge branch 'master' into animacy 2017-11-29 23:31:12 +00:00
Jim O'Regan
c3e6cee17a use inan in polimorf tagset conversion 2017-11-29 23:15:47 +00:00
Jim O'Regan
b32575e78c imports 2017-11-29 23:03:41 +00:00
Jim O'Regan
3696ce6a7b add UD mapping 2017-11-29 22:59:19 +00:00
Jim O'Regan
f8e7082fe4 typo in "inan", add "nhum" 2017-11-29 22:40:47 +00:00
Matthew Honnibal
6bc0f4d29f
Merge pull request #1611 from fsonntag/master
Solving #1494
2017-11-29 23:11:23 +01:00
Matthew Honnibal
f9ed9ea529
Merge pull request #1624 from GreenRiverRUS/russian
Add support for Russian
2017-11-29 23:10:01 +01:00
Jim O'Regan
076a6fc60a symbols 2017-11-29 20:11:20 +00:00
Jim O'Regan
834ba3c69a (semi generated) Polimorf mapping 2017-11-29 20:08:24 +00:00
Jim O'Regan
ba6a23fd11 BOM in Italian lemmatiser 2017-11-29 17:40:07 +00:00
ines
a31506e060 Fix off-by-one error in nlp.add_pipe(after=name) (fixes #1654) 2017-11-28 20:37:55 +01:00
ines
b62739fbfe Add regression test for #1654 2017-11-28 20:27:54 +01:00
ines
2e50dbb9d7 Simplify test 2017-11-28 20:27:27 +01:00
Felix Sonntag
724ae7dc55 Fixed issue of infix capturing prefixes 2017-11-28 17:17:12 +01:00
Ines Montani
9052643e2c
Merge pull request #1653 from sorenlind/da_example_typo
Fix typo
2017-11-27 14:47:42 +00:00
Søren Lind Kristiansen
5fe58b885b Fix typo 2017-11-27 15:36:18 +01:00
Ines Montani
d52b1ab245
Add unicode_literals (hopefully fixes test failure on Python 2) 2017-11-27 15:16:54 +01:00
Søren Lind Kristiansen
0ffd27b0f6 Add several Danish alternative spellings 2017-11-27 13:35:41 +01:00
Ines Montani
6362024cf8
Merge pull request #1645 from GreenRiverRUS/fix_default_meta
Fixed spaCy version string in default meta
2017-11-27 11:58:02 +00:00
Vadim Mazaev
c332ffdde1 Added model command to create model from raw data:
words counts, brown clusters and vectors
2017-11-27 01:21:47 +03:00
Vadim Mazaev
59f03ab1d7 Fixed spacy version string in default meta 2017-11-26 23:02:07 +03:00
Vadim Mazaev
53e7c38637 Fixed tests depends on pymorphy2 2017-11-26 21:04:44 +03:00
Vadim Mazaev
cacd859dcd Added tag map, fixed tests fails, added more exceptions 2017-11-26 20:54:48 +03:00
Ines Montani
a7bb8f1b42
Merge pull request #1637 from sorenlind/da_tokenization
Improve Danish tokenization
2017-11-26 15:41:38 +00:00
ines
c699aec089 Add offsets_from_biluo_tags helper and tests (see #1626) 2017-11-26 16:38:01 +01:00
Søren Lind Kristiansen
ef03e9ea53 Remove unused import. 2017-11-25 13:04:02 +01:00
Søren Lind Kristiansen
6aa241bcec Add day of month tokenizer exceptions for Danish. 2017-11-24 15:03:24 +01:00
Søren Lind Kristiansen
0c276ed020 Add weekday abbreviations and remove abiguous month abbreviations for Danish. 2017-11-24 14:43:29 +01:00
Søren Lind Kristiansen
056547e989 Add multiple tokenizer exceptions for Danish. 2017-11-24 11:51:26 +01:00
Søren Lind Kristiansen
8dc265ac0c Add test for tokenization of 'i.' for Danish. 2017-11-24 11:29:37 +01:00
Søren Lind Kristiansen
ac8116510d Fix tokenization of 'i.' for Danish. 2017-11-24 11:16:53 +01:00
Matthew Honnibal
79f11d4f85 Pickle vectors with vocab 2017-11-23 17:19:50 +01:00
Matthew Honnibal
f29c3925ee Fix more efficient nonproj 2017-11-23 12:48:00 +00:00
Matthew Honnibal
e10e9ad2c5 Improve efficiency of Doc.to_array 2017-11-23 12:33:27 +00:00
Matthew Honnibal
2acc907d55 Improve profiling 2017-11-23 12:33:03 +00:00
Matthew Honnibal
fa62427300 Remove lookup-based lemmatization 2017-11-23 12:32:22 +00:00
Matthew Honnibal
fb26b2cb12 Use lookup lemmatizer if lemma unset 2017-11-23 12:31:58 +00:00
Matthew Honnibal
db5c714ad2 Improve efficiency of deprojectivization 2017-11-23 12:31:34 +00:00
Matthew Honnibal
8fec7268eb Move string cleanup under a setting flag 2017-11-23 12:19:18 +00:00
Matthew Honnibal
5949777b12 Fix misleading multi-threading docstring 2017-11-23 12:18:59 +00:00
Matthew Honnibal
542e6fd4ea Don't remove entries from specials 2017-11-23 12:17:42 +00:00
Matthew Honnibal
30ba81f881
Merge pull request #1576 from ligser/master
Actually reset caches in pipe [wip]
2017-11-23 12:54:48 +01:00
ines
c90fe92e15 Fix displaCy test 2017-11-22 05:04:39 +01:00
ines
a6f33ac27d Fix displaCy test 2017-11-22 04:19:28 +01:00
ines
93b0be611a Merge branch 'master' of https://github.com/explosion/spaCy 2017-11-22 00:28:55 +01:00
ines
60b4915569 Use .pos_ instead of .tags_ in displaCy by default (see #1006) 2017-11-22 00:28:52 +01:00
Vadim Mazaev
81314f8659 Fixed tokenizer: added char classes; added first lemmatizer and
tokenizer tests
2017-11-21 22:23:59 +03:00
Vadim Mazaev
52ee1f9bf9 Updated Russian Language, added lemmatizer, norm exceptions and lex
attrs
2017-11-21 11:44:46 +03:00
Burton DeWilde
a5c6869b2d Fix bug where span.orth_ != span.text (see #1612) 2017-11-20 12:05:43 -06:00
Burton DeWilde
635792997c Add regression test for #1612 2017-11-20 12:05:35 -06:00
ines
9a63e32f21 Add noqa to Python 2 compat variables of built-ins (see #1617) 2017-11-20 14:03:42 +01:00
ines
d70a64d78b Fix syntax error and formatting in test (see #1617) 2017-11-20 14:01:25 +01:00
ines
17849dee4b Fix French test (see #1617) 2017-11-20 13:59:59 +01:00
Felix Sonntag
33b0f86de3 Changed tokenizer to add infix when infix_start is offset 2017-11-19 16:32:10 +01:00
Felix Sonntag
8be3392302 Added regression text for 1494 2017-11-19 16:30:35 +01:00
Motoki Wu
a52e195a0a Fixes Issue #1207 where noun_chunks of Span gives an error.
Make sure to reference `self.doc` when getting the noun chunks.

Same fix as 9750a0128c
2017-11-17 17:16:20 -08:00
Motoki Wu
b818afaa0e Added failing test for Issue #1207.
The noun chunk iterator should work for `Doc` but not for `Span`.
2017-11-17 17:04:27 -08:00
Vadim Mazaev
a0739a06d4 Returned russian support from v1.10 branch 2017-11-17 17:06:15 +03:00
yuukos
7401152289 updated Russian tokenizer
moved the trying to import pymorph into __init__
2017-11-17 17:04:50 +03:00
yuukos
3aad66cf00 added russian language support 2017-11-17 17:04:22 +03:00
ines
a3d4dd1a5d Test adding of lots of pipeline components (see #1585)
Just to make sure that there's no error now or in the future with adding a large number of pipeline components.
2017-11-15 17:28:06 +01:00
Roman Domrachev
61d28d03e4 Try again to do selective remove cache 2017-11-15 19:11:12 +03:00
Roman Domrachev
b3311100c7 Merge branch 'master' of github.com:explosion/spaCy 2017-11-15 18:30:04 +03:00
Matthew Honnibal
b60d92aca8 Increment version 2017-11-15 16:14:46 +01:00
Roman Domrachev
505c6a2f2f Completely cleanup tokenizer cache
Tokenizer cache can have be different keys than string

That modification can slow down tokenizer and need to be measured
2017-11-15 17:55:48 +03:00
Matthew Honnibal
cf0be62096 Increment version 2017-11-15 15:00:18 +01:00
ines
97a4f9362b Merge branch 'master' of https://github.com/explosion/spaCy 2017-11-15 14:24:00 +01:00
ines
8e65247886 Fix lex.id if vectors is None 2017-11-15 14:23:58 +01:00
Matthew Honnibal
437ad1a852
Merge pull request #1570 from explosion/feature/fix-beam-leak
Fix memory leak in beam parser
2017-11-15 14:15:05 +01:00
Matthew Honnibal
2f169fdb0a Set lex ID correctly for new tokens in Vocab 2017-11-15 13:58:03 +01:00
Matthew Honnibal
fe3c42a06b Fix caching in tokenizer 2017-11-15 13:55:46 +01:00
Matthew Honnibal
8d692771f6 Improve profiling 2017-11-15 13:51:25 +01:00
Matthew Honnibal
b797dca977 Merge branch 'master' of https://github.com/explosion/spaCy 2017-11-15 13:11:43 +01:00
ines
c9d72de0fb Add dummy serialization methods for Japanese and missing lang getter (resolves #1557) 2017-11-15 12:44:02 +01:00
Matthew Honnibal
d274d3a3b9 Let beam forward use minibatches 2017-11-15 00:51:42 +01:00
Matthew Honnibal
855872f872 Remove state hashing 2017-11-14 23:36:46 +01:00
Roman Domrachev
3e21680814 Use safer method to get string without hit 2017-11-14 22:58:46 +03:00
Roman Domrachev
a33d5a068d Try to hold origin data instead of restore it 2017-11-14 22:40:03 +03:00
Roman Domrachev
91e2fa6561 Clean all caches 2017-11-14 21:15:04 +03:00
Roman Domrachev
4e378dc4a4 Remove all obsolete code and test only initial problem 2017-11-14 20:45:04 +03:00
Roman
47ce2347b0
Create test that fails when actual cleanup caused 2017-11-14 20:28:13 +03:00
Roman
caae77f72d
Update strings.pyx 2017-11-14 19:44:40 +03:00
Roman Domrachev
3d247d2bb8 Get back previous testcase 2017-11-14 18:01:37 +03:00
Roman Domrachev
870defa815 Swap keys in proper place
Remove unnecessary clear of the hits
2017-11-14 17:56:30 +03:00
Roman Domrachev
86ca434c93 Merge github.com:explosion/spaCy 2017-11-14 17:46:22 +03:00
Roman Domrachev
a2745b0e84 StringStore now actually cleaned
Do not lose docs in ref tracking
2017-11-14 17:45:50 +03:00
Matthew Honnibal
2512ea9eeb Fix memory leak in beam parser 2017-11-14 02:11:40 +01:00
Matthew Honnibal
86ddf692a1 Fix bug in limit calculation on dev data 2017-11-14 01:37:10 +01:00
Ines Montani
ea6c85c67a
Merge pull request #1566 from MathiasDesch/master (resolves #1248)
Add exceptions to tokenizer and norm
2017-11-13 19:05:22 +01:00
Matthew Honnibal
1b348389bb Merge branch 'master' of https://github.com/explosion/spaCy 2017-11-13 18:18:48 +01:00
Matthew Honnibal
ca73d0d8fe Cleanup states after beam parsing, explicitly 2017-11-13 18:18:26 +01:00
Matthew Honnibal
63ef9a2e73 Remove __dealloc__ from ParserBeam 2017-11-13 18:18:08 +01:00
Mathias Deschamps
c0691b2ab4 Add tokenizer exceptions for ing verbs
Extend list of tokenizing exceptions introduced in 123810b
2017-11-13 17:46:05 +01:00
Mathias Deschamps
288298ead9 Add norm exception for ing verbs
Some ing verbs are sometimes written in or in'. Make the NORM form correct
2017-11-13 17:46:05 +01:00
Abhinav Sharma
59f5740ede
improved upon the list of included stop_words 2017-11-13 17:13:49 +05:30
Matthew Honnibal
6e641f46d4 Create a preprocess function that gets bigrams 2017-11-12 00:43:41 +01:00
Matthew Honnibal
c9251d79e3
Edit comment 2017-11-11 18:38:32 +01:00
Matthew Honnibal
dd1678eab3
Edit comment 2017-11-11 18:37:08 +01:00
Roman Domrachev
ee60a52ee7 Fix test imports and last batch cleanup 2017-11-11 11:32:16 +03:00
Roman Domrachev
4a6b094e09 Remove unused import 2017-11-11 03:13:05 +03:00
Roman Domrachev
3c600adf23 Try to fix StringStore clean up (see #1506) 2017-11-11 03:11:27 +03:00
ines
ee97fd3cb4 Add regression test for #1547 2017-11-11 00:14:03 +01:00
ines
2df27db671 Add unicode declaration 2017-11-11 00:13:56 +01:00
ines
35653bef3a Add missing import (fixes #1546) 2017-11-10 19:05:18 +01:00
ines
4c5d2c80d5 Re-add python -m to commands, too brittle :( (see #1536) 2017-11-10 02:30:55 +01:00
ines
123810b6de Add "lovin'" to tokenizer exceptions (see #1248) 2017-11-09 17:09:30 +01:00
ines
1c218397f6 Ensure path in Doc.to_disk/from_disk (resolves ##1521)
Also add Doc serialization tests with both Path and string path options
2017-11-09 02:29:03 +01:00
Matthew Honnibal
49fd5a646f Set version for 2.0.2 release 2017-11-08 22:39:39 +01:00
Matthew Honnibal
fba2dbddf7 Increment version 2017-11-08 22:19:08 +01:00
Matthew Honnibal
a5ea0fdf5a Fix #1518: vocab.vectors.resize() didn't work 2017-11-08 22:18:37 +01:00
Matthew Honnibal
de45702bbe Strip dev suffixes from version for compatibility check 2017-11-08 18:40:21 +01:00
Matthew Honnibal
51639214a1 Merge branch 'master' of https://github.com/explosion/spaCy 2017-11-08 18:04:33 +01:00
Matthew Honnibal
a2f980de4e Exclude .devN versioning from compatibility check 2017-11-08 18:03:52 +01:00
Daniel Hershcovich
d7ae54ff44
Fix typo in message 2017-11-08 16:06:28 +02:00
Matthew Honnibal
4194bc5744 Xfail flakey serialization test 2017-11-08 13:55:13 +01:00
Matthew Honnibal
d5537e5516 Work on Windows test failure 2017-11-08 13:25:18 +01:00
Matthew Honnibal
c27c82d5f9 Fix serialization 2017-11-08 13:08:48 +01:00
Matthew Honnibal
1d5599cd28 Fix dtype 2017-11-08 12:18:32 +01:00
Matthew Honnibal
fa7fdd0d9b Merge branch 'master' of https://github.com/explosion/spaCy 2017-11-08 12:11:31 +01:00
Matthew Honnibal
072ff38a01 Try to fix python3.5 serialization 2017-11-08 12:10:49 +01:00
Ines Montani
3a0f34d567
Merge pull request #1509 from abhi18av/patch-1
Create examples.py for Hindi language
2017-11-08 11:37:19 +01:00
Ines Montani
42b241ccd0
Update language code in usage example in comment 2017-11-08 11:36:38 +01:00
Matthew Honnibal
e262e8d942 Increment version to v2.0.2.dev0 2017-11-08 11:25:47 +01:00
Matthew Honnibal
a8b592783b Make a dtype more specific, to fix a windows build 2017-11-08 11:24:35 +01:00
Abhinav Sharma
84edade82d
Create examples.py
Populated the file with the translations of English example sentences
2017-11-08 13:23:08 +05:30
Matthew Honnibal
d725aee4e2 Increment version to 2.0.1 2017-11-08 02:14:47 +01:00
Matthew Honnibal
8d6f68f1df Increment version 2017-11-08 01:12:34 +01:00
ines
bcf42b8846 Fix typo 2017-11-08 01:06:37 +01:00
Matthew Honnibal
bbd2a3dee1 Fix title in about.py 2017-11-07 14:02:58 +01:00
Matthew Honnibal
4efaf9306c Set version to spacy-nightly rc2 2017-11-07 13:27:26 +01:00
Matthew Honnibal
bf1ec2965f Merge branch 'develop' of https://github.com/explosion/spaCy into develop 2017-11-07 13:20:29 +01:00
Matthew Honnibal
726f689da4 Fix missing import 2017-11-07 13:20:12 +01:00
ines
834f9c1aab Update about.py 2017-11-07 13:11:33 +01:00
ines
a4662a31a9 Move model package templates to cli.package and update docs 2017-11-07 12:15:35 +01:00
ines
a09c096d3c Get docs ready for v2.0.0 2017-11-07 12:00:43 +01:00
Matthew Honnibal
9a88e66103 Merge branch 'develop' of https://github.com/explosion/spaCy into develop 2017-11-07 02:00:06 +01:00
Matthew Honnibal
174abe4677 Increment to 2.0.0rc1 2017-11-07 01:59:46 +01:00
ines
42a0fbf291 Fix textcat simple train example 2017-11-07 01:25:54 +01:00
ines
8fb48b9b91 Update and document new util functions 2017-11-07 00:22:43 +01:00
Matthew Honnibal
1cab703bba Move minibatch function to util 2017-11-06 23:45:36 +01:00
ines
5f43953536 Move test 2017-11-06 23:14:10 +01:00
Matthew Honnibal
dd90fe09f5 Remove extraneous label from textcat class 2017-11-06 22:09:02 +01:00
Matthew Honnibal
45e0617e61 Allow Language.update to take unicode text and dict objects 2017-11-06 22:07:38 +01:00
Matthew Honnibal
1831dbd065 Add test of simple textcat workflow 2017-11-06 22:04:29 +01:00
Matthew Honnibal
ffb9101f3f Merge branch 'develop' of https://github.com/explosion/spaCy into develop 2017-11-06 19:20:41 +01:00
Matthew Honnibal
8fea512ac8 Don't set tensor in textcat 2017-11-06 19:20:14 +01:00
ines
acb9bdb852 Fix PRON_LEMMA imports 2017-11-06 17:41:53 +01:00
Matthew Honnibal
7d46793dd7 Add PRON_LEMMA to spacy.symbols 2017-11-06 17:38:25 +01:00
Matthew Honnibal
2f7e9f390d Make test less flakey 2017-11-06 17:34:50 +01:00
Matthew Honnibal
407b08017e Make test less flakey 2017-11-06 17:31:40 +01:00
Matthew Honnibal
102f797933 Fix lemma ordering in test 2017-11-06 17:02:17 +01:00
Matthew Honnibal
75e1618ec3 Fix lemma clobbering 2017-11-06 16:56:19 +01:00
Matthew Honnibal
6fdffd7246
Merge pull request #1497 from explosion/feature/improve-optimizer-handling
💫 Improve optimizer handling
2017-11-06 16:41:15 +01:00
Matthew Honnibal
8e6795437b Set release=True 2017-11-06 16:39:32 +01:00
Matthew Honnibal
5c85bf3791 Fix missing import 2017-11-06 15:06:27 +01:00
Matthew Honnibal
25859dbb48 Return optimizer from begin_training, creating if necessary 2017-11-06 14:26:49 +01:00
Matthew Honnibal
465adfee94 Remove unused resume_training method, and pass optimizer through 2017-11-06 14:26:00 +01:00
Matthew Honnibal
13336a6197 Fix Adam import 2017-11-06 14:25:37 +01:00
Matthew Honnibal
2eb11d60f2 Add function create_default_optimizer to spacy._ml 2017-11-06 14:11:59 +01:00
Matthew Honnibal
31babe3c3f Fix non-clobbering lemmatization 2017-11-06 12:36:05 +01:00
Matthew Honnibal
63c6ae4191 Fix lemmatizer test 2017-11-06 11:57:06 +01:00
Matthew Honnibal
a86a0181b5 Merge branch 'develop' of https://github.com/explosion/spaCy into develop 2017-11-05 22:19:10 +01:00
Matthew Honnibal
134d3b8143 Fix morphology 2017-11-05 22:18:22 +01:00
ines
08d1cf850a Merge branch 'develop' of https://github.com/explosion/spaCy into develop 2017-11-05 21:41:58 +01:00
ines
baa231745c Fix Dutch tag map 2017-11-05 21:41:50 +01:00
Matthew Honnibal
46e62ad747 Merge branch 'develop' of https://github.com/explosion/spaCy into develop 2017-11-05 19:40:00 +01:00
Matthew Honnibal
bb25cb0f76 Avoid clobbering preset lemmas 2017-11-05 19:39:38 +01:00
ines
507ecb67af Fix Spanish tag map 2017-11-05 19:23:34 +01:00
Matthew Honnibal
320008352b Merge branch 'develop' of https://github.com/explosion/spaCy into develop 2017-11-05 18:46:15 +01:00
Matthew Honnibal
38109a0e4a Register SentenceSegmenter in Language.factories 2017-11-05 18:45:57 +01:00
ines
975e1042ff Fix Italian tag map 2017-11-05 18:34:09 +01:00
ines
6b2d6e4937 Fix Portuguese tag map 2017-11-05 18:31:00 +01:00
ines
fa2687fded Fix Dutch tag map 2017-11-05 17:57:59 +01:00
ines
fb8990d916 Fix Spanish tag map 2017-11-05 17:48:46 +01:00
ines
9d13288f73 Fix French tag map 2017-11-05 17:47:59 +01:00
ines
54579805c5 Fix French tag map 2017-11-05 17:44:05 +01:00
Matthew Honnibal
2b35bb76ad Fix tensorizer on GPU 2017-11-05 15:34:40 +01:00
Matthew Honnibal
6e5181bbaa Merge branch 'develop' of https://github.com/explosion/spaCy into develop 2017-11-05 15:33:56 +01:00
Matthew Honnibal
6f438b17c1 Increment version to v2.0.0a19 2017-11-05 14:43:36 +01:00
Matthew Honnibal
225cc249c9 Pass string path to numpy, to fix #1479 2017-11-05 14:42:46 +01:00
Matthew Honnibal
00435d8f0c Add extra beam parsing test 2017-11-05 14:39:57 +01:00
Matthew Honnibal
e777ea25bb
Merge pull request #1492 from uwol/develop
TextCategorizer return parameter fix
2017-11-05 14:13:04 +01:00
Matthew Honnibal
0d4bd6414e Fix Italian tag map 2017-11-05 14:11:03 +01:00
ines
ef597622a6 Add Portuguese tag map 2017-11-05 13:58:34 +01:00
ines
793c62dfda Add Dutch tag map 2017-11-05 13:48:07 +01:00
ines
f7485a09c8 Fix Italian tag map 2017-11-05 13:12:58 +01:00
uwol
a2162b8908 tensorizer return parameter fix 2017-11-05 12:25:10 +01:00
ines
0a27afbf86 Merge branch 'develop' of https://github.com/explosion/spaCy into develop 2017-11-04 23:32:52 +01:00
ines
3cef901834 Add tag map for French and Italian 2017-11-04 23:32:51 +01:00
Matthew Honnibal
cfb83c231c Merge branch 'develop' of https://github.com/explosion/spaCy into develop 2017-11-04 23:08:19 +01:00
Matthew Honnibal
d185927998 Undo harmful pickling hacks on Language class 2017-11-04 23:07:03 +01:00
ines
6c15aafebd Fix formatting 2017-11-04 23:07:02 +01:00
Matthew Honnibal
3ca16ddbd4 Merge branch 'develop' of https://github.com/explosion/spaCy into develop 2017-11-04 00:25:02 +01:00
Matthew Honnibal
e4ec4be948 Fix parser test 2017-11-04 00:23:45 +01:00
Matthew Honnibal
98c29b7912 Add padding vector in parser, to make gradient more correct 2017-11-04 00:23:23 +01:00
ines
5e7d98f72a Remove test for #1491 2017-11-03 22:10:57 +01:00
ines
718f1c50fb Add regression test for #1491 2017-11-03 21:11:20 +01:00
Matthew Honnibal
144a93c2a5 Back-off to tensor for similarity if no vectors 2017-11-03 20:56:33 +01:00
Matthew Honnibal
1e9634691a Merge branch 'develop' of https://github.com/explosion/spaCy into develop 2017-11-03 20:21:15 +01:00
Matthew Honnibal
13c8881d2f Expose parser's tok2vec model component 2017-11-03 20:20:59 +01:00
Matthew Honnibal
17c63906f9 Update tensorizer component 2017-11-03 20:20:26 +01:00
Matthew Honnibal
2bf21cbe29 Update model after optimising it instead of waiting 2017-11-03 20:20:01 +01:00
Matthew Honnibal
d6e831bf89 Fix lemmatizer tests 2017-11-03 19:46:34 +01:00
ines
eef930c73e Assert instead of print 2017-11-03 18:50:57 +01:00
ines
f0986df94b Add test for #1488 (passes on v2.0.0a18?) 2017-11-03 14:44:36 +01:00
Matthew Honnibal
711278b667 Make test less flakey 2017-11-03 14:36:08 +01:00
Matthew Honnibal
7fea845374 Remove print statement 2017-11-03 14:04:51 +01:00
Matthew Honnibal
0a534ae96a Fix test for backprop d_pad 2017-11-03 14:04:16 +01:00
Matthew Honnibal
33bd2428db Merge branch 'develop' of https://github.com/explosion/spaCy into develop 2017-11-03 13:29:56 +01:00
Matthew Honnibal
6681058abd Fix tensor extending in tagger 2017-11-03 13:29:36 +01:00
Matthew Honnibal
bd2cbdfa85 Make Morphology not fail on unknown tags 2017-11-03 13:29:09 +01:00
Matthew Honnibal
c9b118a7e9 Set softmax attr in tagger model 2017-11-03 11:22:01 +01:00
Matthew Honnibal
a5b05f85f0 Set Doc.tensor attribute in parser 2017-11-03 11:21:00 +01:00
Matthew Honnibal
62ed58935a Add Doc.extend_tensor() method 2017-11-03 11:20:31 +01:00
Matthew Honnibal
d6fc39c8a6 Set Doc.tensor from Tagger 2017-11-03 11:20:05 +01:00
Matthew Honnibal
b3264aa5f0 Expose the softmax layer in the tagger model, to allow setting tensors 2017-11-03 11:19:51 +01:00
Matthew Honnibal
c2bbf076a4 Add document length cap for training 2017-11-03 01:54:54 +01:00
Matthew Honnibal
6771780d3f Fix backprop of padding variable 2017-11-03 01:54:34 +01:00
Matthew Honnibal
54a716f2ec Merge branch 'develop' of https://github.com/explosion/spaCy into develop 2017-11-03 00:55:20 +01:00
Matthew Honnibal
260e6ee3fb Improve efficiency of backprop of padding variable 2017-11-03 00:49:11 +01:00
Matthew Honnibal
a22f96c3f1 Add test for backpropagating padding 2017-11-03 00:48:54 +01:00
ines
9baab241b4 Add skeleton language data for Turkish 2017-11-02 16:32:24 +01:00
ines
c6fea3e5f6 Add Romanian and Croatian skeletons (experimental)
Add language data templates to make it easier for others to contribute to the language support
2017-11-01 23:04:28 +01:00
ines
18c859500b Add missing imports 2017-11-01 23:02:51 +01:00
ines
819e30a26e Tidy up tokenizer exceptions 2017-11-01 23:02:45 +01:00
ines
3af281a334 Update test model name 2017-11-01 23:02:00 +01:00
Matthew Honnibal
b30dd36179 Allow Tagger.add_label() before training 2017-11-01 21:49:24 +01:00
Matthew Honnibal
eca41f0cf6 Fix filename conversion for conllu 2017-11-01 21:26:49 +01:00
Matthew Honnibal
e237472cdc Fix tag and filename conversion for conllu 2017-11-01 21:25:33 +01:00
Matthew Honnibal
b84d99b281 Revert tagger.add_label() changes, to fix model 2017-11-01 21:10:45 +01:00
Matthew Honnibal
f5855e539b Fix tagger model loading 2017-11-01 20:42:36 +01:00
Matthew Honnibal
624644adfe Merge branch 'develop' of https://github.com/explosion/spaCy into develop 2017-11-01 20:26:41 +01:00
ines
5f661a1b3a Remove tensorizer from pre-set pipe_names 2017-11-01 19:48:33 +01:00
Matthew Honnibal
190522efd3 Fix tagger when some tags aren't in Morphology 2017-11-01 19:27:49 +01:00
Matthew Honnibal
e85e31cfbd Fix backprop of d_pad 2017-11-01 19:27:26 +01:00
Matthew Honnibal
759cc79185 Merge branch 'develop' of https://github.com/explosion/spaCy into develop 2017-11-01 19:00:19 +01:00
Matthew Honnibal
1ae40b50b4 Merge branch 'develop' of https://github.com/explosion/spaCy into develop 2017-11-01 17:07:02 +01:00
Matthew Honnibal
7ae1aacdb8 Fix add_label methods 2017-11-01 17:06:43 +01:00
ines
8c2260e18c Move span tests to /doc 2017-11-01 16:56:35 +01:00
Matthew Honnibal
2ef7b59eb0 Merge branch 'develop' of https://github.com/explosion/spaCy into develop 2017-11-01 16:51:41 +01:00
ines
1d1f91a041 Merge branch 'develop' of https://github.com/explosion/spaCy into develop 2017-11-01 16:49:44 +01:00
ines
9659391944 Update deprecated methods and add warnings 2017-11-01 16:49:42 +01:00
ines
260cb37224 Catch deprecation warning 2017-11-01 16:49:18 +01:00
ines
5914faafbb Fix .merge tests to not use deprecated API 2017-11-01 16:49:11 +01:00
ines
705a4e3e4a Fix formatting 2017-11-01 16:44:08 +01:00
Matthew Honnibal
d17a12c71d Merge branch 'develop' of https://github.com/explosion/spaCy into develop 2017-11-01 16:38:26 +01:00
Matthew Honnibal
9f9439667b Don't create low-data text classifier if no vectors 2017-11-01 16:34:09 +01:00
Matthew Honnibal
e7a9174877 Add add_label methods to Tagger and TextCategorizer 2017-11-01 16:32:44 +01:00
ines
39e0586192 Add deprecated helper
Uses warning to show DeprecationWarning and custom stack trace
2017-11-01 16:32:36 +01:00
Matthew Honnibal
a7bf38bf31 Remove misleading comment on util.get_cuda_stream() 2017-11-01 13:57:25 +01:00
Matthew Honnibal
273e96b63f Merge branch 'develop' of https://github.com/explosion/spaCy into develop 2017-11-01 13:27:35 +01:00
Matthew Honnibal
9e0ebee81c Add Token.is_sent_start property, so can deprecate Token.sent_start 2017-11-01 13:27:14 +01:00
Matthew Honnibal
7e7116cdf7 Fix Doc.to_array when only one string attr provided 2017-11-01 13:26:43 +01:00
Matthew Honnibal
301fb2bb60 Implement Span.n_lefts and Span.n_rights 2017-11-01 13:25:12 +01:00
Matthew Honnibal
c047498f87 Fix vectors test 2017-11-01 13:24:47 +01:00
ines
9a5e7c6fe2 Merge branch 'develop' of https://github.com/explosion/spaCy into develop 2017-11-01 13:14:45 +01:00
ines
bfe17b7df1 Fix begin_training if get_gold_tuples is None 2017-11-01 13:14:31 +01:00
ines
affd3404ab Remove old model command (now "vocab") 2017-11-01 13:14:03 +01:00
Matthew Honnibal
fdb4b8e456 Merge branch 'develop' of https://github.com/explosion/spaCy into develop 2017-11-01 02:07:17 +01:00
Matthew Honnibal
c48dd0e1d3 Fix vector pruning 2017-11-01 02:06:58 +01:00
ines
37e62ab0e2 Update vector meta in meta.json 2017-11-01 01:25:09 +01:00
ines
96b4aef0bf Merge branch 'develop' of https://github.com/explosion/spaCy into develop 2017-11-01 01:10:53 +01:00
Matthew Honnibal
86eba61fae Fix token.vector when vectors are missing 2017-11-01 00:47:35 +01:00
ines
5683fd65ed Update docstrings 2017-11-01 00:42:39 +01:00
Matthew Honnibal
44bce8e53f Merge branch 'develop' of https://github.com/explosion/spaCy into develop 2017-11-01 00:35:16 +01:00
Matthew Honnibal
c16310d156 Update vectors with find method 2017-11-01 00:34:55 +01:00
Ines Montani
d11659463b
Merge pull request #1152 from jimregan/develop-irish
[WIP] attempt a port from #1147
2017-11-01 00:23:43 +01:00
ines
2ad2f09d12 Update docstrings and simplify most_similar 2017-11-01 00:18:08 +01:00
Jim O'Regan
08b0bfd153 merge 2017-10-31 22:55:59 +00:00
Jim O'Regan
00ecfa5417 Ó, not O 2017-10-31 22:54:42 +00:00
ines
ba2e6c8c6f Update docstrings and formatting 2017-10-31 23:23:34 +01:00
Matthew Honnibal
0de8d213a3
Merge pull request #1475 from explosion/feature/sm-vectors
Improve and simplify Vectors class
2017-10-31 22:59:50 +01:00
Ines Montani
25b1d6cd91
Fix syntax error 2017-10-31 22:36:03 +01:00
Matthew Honnibal
92dc127569 Fix test for Python 3 2017-10-31 22:21:55 +01:00
Jim O'Regan
fe4b10346a replace example sentence until I get around to adding a punctuation.py 2017-10-31 20:24:53 +00:00
Matthew Honnibal
c5799ecc7b Remove print statement 2017-10-31 21:12:33 +01:00
ines
7e424a1804 Don't copy exception dicts if not necessary and tidy up 2017-10-31 21:05:29 +01:00
Matthew Honnibal
c390f2d745 Make it easier to pass explicit no-pruning to vocab 2017-10-31 20:14:47 +01:00
Ines Montani
06c25a8882
Remove comma that caused list to wrap in tuple!
Also removed extra dict wrappings for performance (we used to have them in there, but they should only really exist if copying the dict is absolutely necessary)
2017-10-31 20:13:16 +01:00
Matthew Honnibal
d90a22afe6 Fix loading previous vectors models 2017-10-31 19:58:35 +01:00
Ines Montani
147448b65b
Add missing symbols 2017-10-31 19:34:45 +01:00
Matthew Honnibal
997a61557a Add vectors.n_keys property 2017-10-31 19:30:52 +01:00
Matthew Honnibal
8075726838 Restore vector usage in models 2017-10-31 19:21:17 +01:00
Matthew Honnibal
3659a807b0 Remove vector pruning arg from train CLI 2017-10-31 19:21:05 +01:00
Ines Montani
9b0de9fb43
Fix import of symbols (now nested one level lower) 2017-10-31 19:17:58 +01:00
Matthew Honnibal
59203a2e8a Move vector pruning command into spacy vocab cli tool 2017-10-31 19:10:01 +01:00
Matthew Honnibal
77d8f5de9a Revise and simplify Vectors class 2017-10-31 18:25:08 +01:00
Jim O'Regan
d4a8160c36 change quotes 2017-10-31 15:15:44 +00:00
Jim O'Regan
34ca59691b no idea what is wrong here 2017-10-31 14:50:13 +00:00
Jim O'Regan
41dd29e48e merge 2017-10-31 14:07:45 +00:00
Matthew Honnibal
cb5217012f Fix vector remapping 2017-10-31 11:40:46 +01:00
Matthew Honnibal
9c11ee4a1c WIP on vectors fixes 2017-10-31 11:22:56 +01:00
Matthew Honnibal
ce876c551e Fix GPU usage 2017-10-31 02:33:34 +01:00
Matthew Honnibal
7698903617 Fix GPU usage 2017-10-31 02:33:16 +01:00
Matthew Honnibal
368fdb389a WIP on refactoring and fixing vectors 2017-10-31 02:00:26 +01:00
Matthew Honnibal
4e3006cec7 Merge branch 'develop' of https://github.com/explosion/spaCy into develop 2017-10-30 19:44:58 +01:00
Matthew Honnibal
4112a991ec Fix vector pruning 2017-10-30 19:44:40 +01:00
ines
ec657c1ddc Update vocab docs and document Vocab.prune_vectors 2017-10-30 19:35:41 +01:00
ines
803e41bc66 Merge branch 'develop' of https://github.com/explosion/spaCy into develop 2017-10-30 18:39:51 +01:00
ines
8e02294241 Add vectors to Language.meta 2017-10-30 18:39:48 +01:00
ines
abf8aa05d3 Populate --create-meta defaults from file if available
If meta.json is found in directory and user chooses to overwrite it, show existing data as defaults.
2017-10-30 18:39:38 +01:00
ines
ce98fa7934 Fix formatting 2017-10-30 18:38:55 +01:00
ines
98c35d2585 Fix spacy vocab command 2017-10-30 18:38:41 +01:00
Matthew Honnibal
e98451b5f7 Add -prune-vectors argument to spacy.cly.train 2017-10-30 18:00:10 +01:00
Matthew Honnibal
e026b29ea9 Add prune_vectors method to Vocab 2017-10-30 17:59:43 +01:00
Explosion Bot
d0cf12c8c7 Fix off-by-one error in vectors 2017-10-30 16:22:03 +01:00
Explosion Bot
05a1dd570e Fix vocab script 2017-10-30 16:19:22 +01:00
Explosion Bot
b46bdce8d2 Add missing import 2017-10-30 16:18:10 +01:00
Explosion Bot
2d2cc294b4 Merge branch 'develop' of https://github.com/explosion/spaCy into develop 2017-10-30 16:15:05 +01:00
Explosion Bot
0fc1209421 Wire up new vocab command 2017-10-30 16:14:50 +01:00
Explosion Bot
aa64031751 Fix clear_vectors() method on Vocab 2017-10-30 16:09:04 +01:00
Explosion Bot
7b56b2f04b Add Vocab.cfg attr, to hold stuff like oov probs 2017-10-30 16:08:50 +01:00
Explosion Bot
ab5d5ed880 Fix vectors.add() 2017-10-30 16:08:09 +01:00
Explosion Bot
41d0f1665a Fix add_attrs for cluster 2017-10-30 16:07:50 +01:00
ines
5453821a9f Update NER annotation scheme
Add note on training data sources and include coarse-grained Wikipedia scheme
2017-10-30 13:53:49 +01:00
Explosion Bot
5ede7cec9b Improve Lexeme.set_attrs method 2017-10-30 11:49:11 +01:00
Explosion Bot
72aea8f105 Update vectors.add() to allow setting keys to rows 2017-10-30 10:03:08 +01:00
Matthew Honnibal
c43cc5361d
Merge pull request #1467 from explosion/feature/better-parser
💫 Bug fixes to parser model (requires retraining)
2017-10-29 02:05:22 +02:00
ines
6c2d8d3b2a Use shortcuts-nightly.json to resolve model shortcuts 2017-10-29 01:28:31 +02:00
Matthew Honnibal
a0c7dabb72 Fix bug in 8-token parser features 2017-10-28 23:01:35 +00:00
Matthew Honnibal
b713d10d97 Switch to 13 features in parser 2017-10-28 23:01:14 +00:00
Matthew Honnibal
3b91097321 Whitespace 2017-10-28 17:05:11 +00:00
Matthew Honnibal
6ef72864fa Improve initialization for hidden layers 2017-10-28 17:05:01 +00:00
Matthew Honnibal
5414e2f14b Use missing features in parser 2017-10-28 16:45:54 +00:00
Matthew Honnibal
df4803cc6d Add learned missing values for parser 2017-10-28 16:45:14 +00:00
Matthew Honnibal
64e4ff7c4b Merge 'tidy-up' changes into branch. Resolve conflicts 2017-10-28 13:16:06 +02:00
Explosion Bot
fb0c96f39a Fix optimizer loading 2017-10-28 11:58:16 +02:00
Explosion Bot
b22e42af7f Merge changes to parser and _ml 2017-10-28 11:52:10 +02:00
ines
d96e72f656 Tidy up rest 2017-10-27 21:07:59 +02:00
ines
a8e10f94e4 Tidy up Lexeme and update docs 2017-10-27 21:07:50 +02:00
ines
ba5e646219 Tidy up pipeline 2017-10-27 20:29:08 +02:00
ines
b4d226a3f1 Tidy up syntax 2017-10-27 19:45:57 +02:00
ines
5167a0cce2 Tidy up Vectors and docs 2017-10-27 19:45:19 +02:00
ines
7946464742 Remove spacy.tagger (now in pipeline) 2017-10-27 19:45:04 +02:00
ines
9c89e2cdef Remove unused syntax iterators (now in language data) 2017-10-27 18:09:53 +02:00
ines
d2df81d907 Fix not implemented Span getters 2017-10-27 18:09:28 +02:00
ines
544a407b93 Tidy up Doc, Token and Span and add missing docs 2017-10-27 17:07:26 +02:00
ines
a6135336f5 Tidy up gold 2017-10-27 17:02:55 +02:00
ines
6a0483b7aa Tidy up and document Doc, Token and Span 2017-10-27 15:41:45 +02:00
ines
1a559d4c95 Remove old, unused file 2017-10-27 15:34:35 +02:00
ines
91899d337b Tidy up language, lemmatizer and scorer 2017-10-27 14:40:14 +02:00
ines
778212efea Tidy up init and main 2017-10-27 14:39:51 +02:00
ines
e33b7e0b3c Tidy up parser and ML 2017-10-27 14:39:30 +02:00
ines
e3265998c0 Tidy up displaCy 2017-10-27 14:39:19 +02:00
ines
ea4a41c8fb Tidy up util and helpers 2017-10-27 14:39:09 +02:00
ines
d941fc3667 Tidy up CLI 2017-10-27 14:38:39 +02:00
Matthew Honnibal
531142a933 Merge remote-tracking branch 'origin/develop' into feature/better-parser 2017-10-27 12:34:48 +00:00
Matthew Honnibal
19a2b9bf27 Fix import of Optimizer 2017-10-27 12:33:42 +00:00
Matthew Honnibal
4d048e94d3 Add compat for thinc.neural.optimizers.Optimizer 2017-10-27 10:23:49 +00:00
Ines Montani
4033e70c71 Merge pull request #1461 from explosion/feature/disable-pipes
💫 Add Language.disable_pipes(), to temporarily edit pipeline and update code examples
2017-10-27 12:21:40 +02:00
Matthew Honnibal
75a637fa43 Remove redundant imports from _ml 2017-10-27 10:19:56 +00:00
Matthew Honnibal
c9987cf131 Avoid use of numpy.tensordot 2017-10-27 10:18:36 +00:00
Matthew Honnibal
f6fef30adc Remove dead code from spacy._ml 2017-10-27 10:16:41 +00:00
Matthew Honnibal
b9616419e1 Add try/except around bz2 import 2017-10-27 01:18:05 +00:00
Matthew Honnibal
783c0c8795 Remove unnecessary bz2 import 2017-10-27 01:17:54 +00:00
Matthew Honnibal
bb25bdcd92 Adjust call to scatter_add for the new version 2017-10-27 01:16:55 +00:00
Ines Montani
287a3ca256 Merge pull request #1466 from explosion/feature/rename-pipeline
💫 Clean up dead linear model code
2017-10-27 02:03:28 +02:00
ines
4eb5bd02e7 Update textcat pre-processing after to_array change 2017-10-27 00:32:12 +02:00
ines
2d6ec99884 Set 'model' as default model name to prevent meta.json errors 2017-10-26 16:12:23 +02:00
ines
9e372913e0 Remove old 'SP' condition in tag map 2017-10-26 16:11:57 +02:00
Matthew Honnibal
c52671420c Remove old cfile import 2017-10-26 13:28:19 +02:00
Matthew Honnibal
ea03f1ef64 Remove obsolete cfile code 2017-10-26 13:23:36 +02:00
Matthew Honnibal
90d1d9b230 Remove obsolete parser code 2017-10-26 13:22:45 +02:00
ines
6f78e29bed Add LAW entity label to glossary 2017-10-26 13:04:35 +02:00
ines
9bf78d5fb3 Update spacy.explain docs 2017-10-26 13:04:25 +02:00
Matthew Honnibal
33f8c58782 Remove obsolete parser.pyx 2017-10-26 12:42:05 +02:00
Matthew Honnibal
a8abc47811 Rename BaseThincComponent --> Pipe 2017-10-26 12:40:40 +02:00
Matthew Honnibal
b0f3ea2200 Fix names of pipeline components
NeuralDependencyParser --> DependencyParser
NeuralEntityRecognizer --> EntityRecognizer
TokenVectorEncoder     --> Tensorizer
NeuralLabeller         --> MultitaskObjective
2017-10-26 12:38:23 +02:00
Matthew Honnibal
b6b4f1aaf7 Merge pull request #1462 from explosion/feature/vector-meta-data
💫 Add vector meta data to model meta.json on train/package and show in docs
2017-10-26 11:39:41 +02:00
Matthew Honnibal
35977bdbb9 Update better-parser branch with develop 2017-10-26 00:55:53 +00:00
Ines Montani
090bd00369 Merge pull request #1464 from mayukh18/develop_bengali_pronouns
added the bengali pronouns for v2.0
2017-10-25 21:55:25 +02:00
mayukh18
1bc07758fa added few bengali pronouns 2017-10-25 22:24:40 +05:30
ines
de1e5f35d5 Merge branch 'develop' into feature/disable-pipes 2017-10-25 16:33:12 +02:00
ines
728b609bf9 Merge branch 'develop' into feature/vector-meta-data 2017-10-25 16:32:22 +02:00
ines
c0b55ebdac Fix PhraseMatcher.__contains__ and add more tests 2017-10-25 16:31:11 +02:00
ines
91beacf5e3 Fix Matcher.__contains__ 2017-10-25 16:19:38 +02:00
ines
11e3f19764 Fix vectors data added after training (see #1457) 2017-10-25 16:08:26 +02:00
ines
057954695b Read pipeline and vector data off model in --generate-meta 2017-10-25 16:03:26 +02:00
ines
273e638183 Add vector data to model meta after training (see #1457) 2017-10-25 16:03:05 +02:00
ines
18aae423fb Remove import of non-existing function 2017-10-25 15:54:10 +02:00
ines
5117a7d24d Fix whitespace 2017-10-25 15:54:02 +02:00
ines
657a4d91bc Merge branch 'develop' into feature/disable-pipes 2017-10-25 15:19:05 +02:00
ines
1a722dac31 Merge branch 'develop' into feature/disable-pipes 2017-10-25 15:18:18 +02:00
ines
6a00de4f77 Fix check of unexpected pipe names in restore() 2017-10-25 14:56:35 +02:00
ines
7f03932477 Return self on __enter__ 2017-10-25 14:56:16 +02:00
Matthew Honnibal
b5de768852 Merge branch 'develop' of https://github.com/explosion/spaCy into develop 2017-10-25 14:44:16 +02:00
Matthew Honnibal
094512fd47 Fix model-mark on regression test. 2017-10-25 14:44:00 +02:00
Matthew Honnibal
e70f80f29e Add Language.disable_pipes() 2017-10-25 13:46:41 +02:00
Matthew Honnibal
075e8118ea Update from develop 2017-10-25 12:45:21 +02:00
ines
72497c8cb2 Remove comments and add TODO 2017-10-25 12:15:43 +02:00
ines
4d97efc3b5 Add missing docstrings 2017-10-25 12:10:16 +02:00
ines
1262aa0bf9 Implement PhraseMatcher.__contains__ 2017-10-25 12:10:04 +02:00
ines
9c733a8849 Implement PhraseMatcher.__len__ 2017-10-25 12:09:56 +02:00
ines
7eebeeaf85 Fix Matcher.__contains__ 2017-10-25 12:09:47 +02:00
ines
7bcec57462 Remove unused attribute 2017-10-25 12:08:54 +02:00
ines
0b1dcbac14 Remove unused function 2017-10-25 12:08:46 +02:00
ines
3484174e48 Add Language.path 2017-10-25 11:57:43 +02:00
Ines Montani
d3bf488e16 Merge pull request #1171 from mollerhoj/support-danish
Improve basic support for Danish
2017-10-24 20:29:57 +02:00
Matthew Honnibal
d9bb1e5de8 Increment version 2017-10-24 17:06:19 +02:00
Matthew Honnibal
908809d488 Update tests 2017-10-24 17:05:15 +02:00
Matthew Honnibal
66766c1454 Restore SP tag to English tag_map, until models migrate 2017-10-24 17:05:00 +02:00
Matthew Honnibal
30e67fa808 Merge branch 'develop' of https://github.com/explosion/spaCy into develop 2017-10-24 16:08:23 +02:00
Matthew Honnibal
b0f6fd3f1d Disable tokenizer cache for special-cases. Fixes #1250 2017-10-24 16:08:05 +02:00
Matthew Honnibal
63f0bde749 Add test for #1250: Tokenizer cache clobbered special-case attrs 2017-10-24 16:07:18 +02:00
ines
8492d5be6d Always make lemmatizer return a list of lemmas, not a set 2017-10-24 16:00:56 +02:00
ines
95f866f99f Add lookup argument to Lemmatizer.load 2017-10-24 16:00:56 +02:00
ines
95f6174516 Remove tensorizer from model pipeline example in spacy package 2017-10-24 16:00:56 +02:00
ines
090aed940a Add test for currently failing span.as_doc case 2017-10-24 16:00:56 +02:00
ines
4ef81a9ebc Fix whitespace 2017-10-24 16:00:56 +02:00
Matthew Honnibal
18f1c1d0ba Merge branch 'develop' of https://github.com/explosion/spaCy into develop 2017-10-24 14:29:43 +02:00
Matthew Honnibal
4bea65a1a8 Fix Issue #1450: Off-by-1 in * and ? matches
Patterns that end in variable-length operators e.g. * and ? now end on
the correct token. Previously, they were off by 1: the next token was
pulled into the match, even if that's where the pattern failed.
2017-10-24 14:26:27 +02:00
Matthew Honnibal
391d5ef0d1 Normalize imports in regression test 2017-10-24 14:25:49 +02:00
ines
c55db0a4a1 Add example sentences for Japanese and Chinese (see #1107) 2017-10-24 13:02:24 +02:00
ines
66f8f9d4a0 Fix Japanese tokenizer
JapaneseTokenizer now returns a Doc, not individual words
2017-10-24 13:02:19 +02:00
Matthew Honnibal
dd5b2d8fa3 Check for out-of-memory when calling calloc. Closes #1446 2017-10-24 12:40:47 +02:00
Matthew Honnibal
b66b8f028b Fix #1375 -- out-of-bounds on token.nbor() 2017-10-24 12:10:39 +02:00
Matthew Honnibal
a68d89a4f3 Add failing test for bug #1375 -- no out-of-bounds error for token.nbor() 2017-10-24 12:05:25 +02:00
Ines Montani
facf77e541 Merge branch 'develop' into support-danish 2017-10-24 11:53:19 +02:00
Matthew Honnibal
ccd2ab1a62 Merge pull request #1443 from ramananbalakrishnan/develop-get-lca-matrix
Add LCA matrix for spans and docs
2017-10-24 11:22:46 +02:00
Matthew Honnibal
ef3e5a361b Merge pull request #1442 from explosion/feature/fix-sp
💫Fix SP tag, tweak Vectors.__init__, fix Morphology
2017-10-24 10:24:07 +02:00
Matthew Honnibal
fdf25d10ba Merge pull request #1440 from ramananbalakrishnan/develop
Support single value for attribute list in doc.to_array
2017-10-24 10:23:12 +02:00
Matthew Honnibal
e7556ff048 Fix non-maxout parser 2017-10-23 18:16:23 +02:00
ines
a31f048b4d Fix formatting 2017-10-23 10:38:06 +02:00
Matthew Honnibal
490ad3eaf0 Check that empty strings are handled. Closes #1242 2017-10-21 00:52:14 +02:00
Matthew Honnibal
8f8bccecb9 Patch deserialisation for invalid loads, to avoid model failure 2017-10-21 00:51:42 +02:00
Ramanan Balakrishnan
d2fe56a577
Add LCA matrix for spans and docs 2017-10-20 23:58:00 +05:30
Matthew Honnibal
d8391b1c4d Fix #1434: Matcher failed on ending ? if no token 2017-10-20 16:49:36 +02:00
Matthew Honnibal
fec53f09f7 Merge branch 'develop' of https://github.com/explosion/spaCy into develop 2017-10-20 16:28:34 +02:00
Matthew Honnibal
f111b228e0 Fix re-parsing of previously parsed text
If a Doc object had been previously parsed, it was possible for
invalid parses to be added. There were two problems:

1) The parse was only being partially erased
2) The RightArc action was able to create a 1-cycle.

This patch fixes both errors, and avoids resetting the parse if one is
present. In theory this might allow a better parse to be predicted by
running the parser twice.

Closes #1253.
2017-10-20 16:27:36 +02:00
Matthew Honnibal
1036798155 Make parser consistent if maxout==1 2017-10-20 16:24:16 +02:00
Matthew Honnibal
3faf9189a2 Make parser hidden shape consistent even if maxout==1 2017-10-20 16:23:31 +02:00
Matthew Honnibal
9010a1a060 Create vectors correctly 2017-10-20 14:19:46 +02:00
Matthew Honnibal
33229b1c9e Remove print statement 2017-10-20 14:19:29 +02:00
Matthew Honnibal
cfae54c507 Make change to Vectors.__init__ 2017-10-20 14:19:04 +02:00
Matthew Honnibal
ebecaddb76 Make 'data_or_width' two keyword args in Vectors.__init__
Previously the data and width options were one argument in Vectors,
which meant you couldn't say vectors = Vectors(strings, width=300).
It's better to have two keywords.
2017-10-20 14:17:15 +02:00
Matthew Honnibal
49895fbef6 Rename 'SP' special tag to '_SP'
Renaming the tag with an underscore lets us add it to the tag map
without worrying that we'll change the sequence of tags, which throws
off the tag-to-ID mapping. For instance, if we inserted a 'SP' tag,
the "VERB" tag is pushed to a different class ID, and the model is all
messed up.
2017-10-20 14:01:12 +02:00
Matthew Honnibal
506cf2eb13 Remove cpdef enum, to avoid too much code generation 2017-10-20 14:00:23 +02:00
Matthew Honnibal
6218af0105 Remove cpdef enum, to avoid too much code generation 2017-10-20 13:59:57 +02:00
Matthew Honnibal
92ac9316b5 Fix initialization of vectors, to address serialization problem 2017-10-20 13:59:24 +02:00
Ramanan Balakrishnan
0726946563
cleanup to_array implementation using fixes on master 2017-10-20 17:09:37 +05:30
ines
108f1f786e Update symbols and document missing token attributes (see #1439) 2017-10-20 13:08:44 +02:00
ines
4acab77a8a Add missing symbol for LAW entities (resolves #1427) 2017-10-20 13:07:57 +02:00
Matthew Honnibal
b101736555 Fix precomputed layer 2017-10-20 12:14:52 +02:00
Ramanan Balakrishnan
b3ab124fc5
Support strings for attribute list in doc.to_array 2017-10-20 11:46:57 +05:30
Matthew Honnibal
64658e02e5 Implement fancier initialisation for precomputed layer 2017-10-20 03:07:45 +02:00
Matthew Honnibal
827cd8a883 Fix support of maxout pieces in parser 2017-10-20 03:07:17 +02:00
Matthew Honnibal
a8850b4282 Remove redundant PrecomputableMaxouts class 2017-10-19 20:27:34 +02:00
Matthew Honnibal
a17a1b60c7 Clean up redundant PrecomputableMaxouts class 2017-10-19 20:26:37 +02:00
Matthew Honnibal
b00d0a2c97 Fix bias in parser 2017-10-19 18:42:11 +02:00
Matthew Honnibal
b54b4b8a97 Make parser_maxout_pieces hyper-param work 2017-10-19 13:45:18 +02:00
Matthew Honnibal
03a215c5fd Make PrecomputableAffines work 2017-10-19 13:44:49 +02:00
Ramanan Balakrishnan
7b9b1be44c
Support single value for attribute list in doc.to_array 2017-10-19 17:00:41 +05:30
Matthew Honnibal
61bc203f3f Merge pull request #1438 from explosion/feature/fast-parser
💫 Improve runtime CPU efficiency of parser/NER
2017-10-19 02:42:21 +02:00
Matthew Honnibal
15e5a04a8d Clean up more depth=0 conditional code 2017-10-19 01:48:43 +02:00
Matthew Honnibal
906c50ac59 Fix loop typing, that caused error on windows 2017-10-19 01:48:39 +02:00
ines
24512420b1 Show error if data_path does not exist or is None (see #1102) 2017-10-19 00:53:49 +02:00
ines
bf415fd778 Add test for serializing extension attrs (see #1085) 2017-10-19 00:53:08 +02:00
Matthew Honnibal
960788aaa2 Eliminate dead code in parser, and raise errors for obsolete options 2017-10-19 00:42:34 +02:00
Matthew Honnibal
bbfd7d8d5d Clean up parser multi-threading 2017-10-19 00:25:21 +02:00
Matthew Honnibal
f018f2030c Try optimized parser forward loop 2017-10-18 21:48:00 +02:00
Matthew Honnibal
65bf5e85bd Improve piping in language.pipe 2017-10-18 21:46:12 +02:00
Matthew Honnibal
633a75c7e0 Break parser batches into sub-batches, sorted by length. 2017-10-18 21:45:01 +02:00
Ines Montani
f0d577e460 Merge pull request #1425 from explosion/feature/hindi-tokenizer
💫 Basic Hindi tokenization support
2017-10-18 13:34:52 +02:00
Matthew Honnibal
394633efce Make doc pickling support hooks 2017-10-17 19:44:09 +02:00
Matthew Honnibal
fe844148f6 Test pickling hooks 2017-10-17 19:43:52 +02:00
Matthew Honnibal
cdb0c426d8 Improve deserialization of user_data, esp. for Underscore 2017-10-17 19:29:20 +02:00
Matthew Honnibal
374819edf8 Test user_data deserialization, re #1085 2017-10-17 19:28:54 +02:00
Matthew Honnibal
e35a83d142 Merge branch 'develop' of https://github.com/explosion/spaCy into develop 2017-10-17 18:22:06 +02:00
Matthew Honnibal
f45973848c Rename 'tokens' variable 'doc' in tokenizer 2017-10-17 18:21:41 +02:00
Matthew Honnibal
839de87ca9 Make lambda func a named function, for pickling 2017-10-17 18:21:20 +02:00
Matthew Honnibal
9baa8fe7ec Convert closure to functools.partial, to promote pickling 2017-10-17 18:20:52 +02:00
Matthew Honnibal
32a8564c79 Fix doc pickling 2017-10-17 18:20:24 +02:00
Matthew Honnibal
8ca97f32a3 Fix doc pickling test 2017-10-17 18:19:57 +02:00
Matthew Honnibal
9ce7d6af87 Make lex attr functions top-level functions, to promote pickling 2017-10-17 18:19:18 +02:00
Matthew Honnibal
1cc85a89ef Allow reasonably efficient pickling of Language class, using to_bytes() and from_bytes(). 2017-10-17 18:18:49 +02:00
Matthew Honnibal
0d57b9748a Serialize lex_attr_getters with dill, for better pickle support 2017-10-17 18:17:45 +02:00
Matthew Honnibal
45d1dd90b1 Add tests for pickling doc 2017-10-17 17:20:58 +02:00
Ines Montani
afa67de7ee Merge pull request #1428 from roanuz/develop
Fix trailing whitespace and Language.from_disk overwrites
2017-10-17 16:29:15 +02:00
Matthew Honnibal
92c1eb2d6f Fix Doc pickling. This also removes need for Binder class 2017-10-17 16:11:13 +02:00
Matthew Honnibal
ed8da9b11f Add missing return statement in SentenceSegmenter 2017-10-17 15:32:56 +02:00
Ines Montani
aab299c8ae Merge pull request #1429 from vishnunekkanti/develop
fix syntax error in zh
2017-10-17 14:45:02 +02:00
Anto Binish Kaspar
534240648e Fix trailing whitespace on morphology features 2017-10-17 17:15:58 +05:30
Anto Binish Kaspar
8f5b60c168 Fix Language.from_disk overwrites the meta.json file. 2017-10-17 17:15:32 +05:30
ines
8ca344712d Add Language.has_pipe method 2017-10-17 11:20:07 +02:00
ines
485c4f6df5 Add Hungarian examples (see #1107) 2017-10-17 02:37:45 +02:00
Matthew Honnibal
19531bad4c Merge branch 'develop' into feature/streaming-data-memory-growth 2017-10-16 21:44:11 +02:00
Matthew Honnibal
df488274b1 Fix deserialization of vectors 2017-10-16 20:55:00 +02:00
Matthew Honnibal
4018486d31 Merge remote-tracking branch 'origin/develop' into feature/streaming-data-memory-growth 2017-10-16 20:49:48 +02:00
Matthew Honnibal
4174477161 Fix equality check in test 2017-10-16 19:50:35 +02:00
Matthew Honnibal
2bc06e4b22 Bump rolling buffer size to 10k 2017-10-16 19:38:29 +02:00
Matthew Honnibal
66e2eb8f39 Clean up remnant of frozen in StringStore 2017-10-16 19:34:41 +02:00
Matthew Honnibal
a002264fec Remove caching of Token in Doc, as caused cycle. 2017-10-16 19:34:21 +02:00
Matthew Honnibal
3e037054c8 Remove obsolete is_frozen functionality from StringStore 2017-10-16 19:23:10 +02:00
Matthew Honnibal
5c14f3f033 Create a rolling buffer for the StringStore in Language.pipe() 2017-10-16 19:22:40 +02:00
Matthew Honnibal
59c216196c Allow weakrefs on Doc objects 2017-10-16 19:22:11 +02:00
ines
d5418553eb Fix whitespace 2017-10-16 18:30:04 +02:00
ines
6ceadcdb5c Make sure from_disk passes string to numpy (see #1421)
If path is a WindowsPath, numpy does not recognise it as a path and as
a result, doesn't open the file.
https://github.com/numpy/numpy/blob/master/numpy/lib/npyio.py#L369
2017-10-16 18:29:56 +02:00
Matthew Honnibal
010a7309ff Merge pull request #1402 from explosion/feature/fix-matcher-operators
💫 Fix Matcher variable-length operators
2017-10-16 17:53:19 +02:00
Matthew Honnibal
c29927d2e7 Fix matcher test 2017-10-16 17:22:18 +02:00
Vishnu Kumar Nekkanti
d3c54cf39a fixed SyntaxError while checking for jieba 2017-10-16 18:51:33 +05:30
Matthew Honnibal
a928ae2f35 Merge branch 'develop' into feature/fix-matcher-operators 2017-10-16 13:38:36 +02:00
Matthew Honnibal
56aa42cc5d Fix and document matcher operator 'shadowing' behaviour 2017-10-16 13:38:20 +02:00
Matthew Honnibal
748d525801 Add more matcher operator tests 2017-10-16 13:38:01 +02:00
Matthew Honnibal
0433181658 Document operator semantics in Matcher docstring 2017-10-16 12:06:33 +02:00
ines
266e7180a7 Add Language class, stop words and basic stemmer that sets NORM 2017-10-14 14:59:52 +02:00
ines
e85e1d571b Update base punctuation 2017-10-14 14:59:23 +02:00
ines
9d6c8eaa49 Update base norm exceptions with more unicode characters
e.g. unicode variations of punctuation used in Chinese
2017-10-14 14:58:52 +02:00
ines
3516aa0cea Port over changes from #1389 2017-10-14 13:32:55 +02:00
ines
cd6a29dce7 Port over changes from #1294 2017-10-14 13:28:46 +02:00
ines
38c756fd85 Port over changes from #1287 2017-10-14 13:16:21 +02:00
ines
612224c10d Port over changes from #1157 2017-10-14 13:11:39 +02:00
ines
9b3f8f9ec3 Fix formatting and add comment on languages 2017-10-14 13:11:18 +02:00
ines
a4d974d97b Port over URL pattern changes from #1411 2017-10-14 12:58:07 +02:00
ines
09aed58140 Port over changes from #1333 and add comments 2017-10-14 12:52:59 +02:00
Matthew Honnibal
cf6da9301a Update lemmatizer test 2017-10-12 22:50:52 +02:00
Matthew Honnibal
9b90d235d1 Fix tag check in lemmatizer 2017-10-12 22:50:43 +02:00
Matthew Honnibal
dc01acd821 Escape encoding in validate function 2017-10-12 22:23:21 +02:00
Matthew Honnibal
27b927259a Add locale_escape compat function 2017-10-12 22:22:04 +02:00
ines
9c6de3dcfa Merge branch 'develop' into feature/cli-validate 2017-10-12 21:44:28 +02:00
Matthew Honnibal
462caf835a Fix SBD test 2017-10-12 21:18:22 +02:00
ines
fff1028391 Add validate CLI command 2017-10-12 20:05:06 +02:00
Matthew Honnibal
908f44c3fe Disable history features by default 2017-10-12 14:56:11 +02:00
Matthew Honnibal
a955843684 Increase default number of epochs 2017-10-12 13:13:01 +02:00
Matthew Honnibal
cecfcc7711 Set default hyper params back to 'slow' settings 2017-10-12 13:12:26 +02:00
Ines Montani
37aa523a8e Merge pull request #1408 from explosion/feature/dot-underscore
💫 Custom attributes via Doc._, Token._ and Span._
2017-10-11 18:35:56 +02:00
ines
8ce6f96180 Don't make copies of language data components 2017-10-11 15:34:55 +02:00
ines
51519251c2 Fix underscore method test 2017-10-11 13:34:19 +02:00
ines
c6ae49e8bf Fix formatting 2017-10-11 13:34:11 +02:00
ines
453c47ca24 Add German lemmatizer tests 2017-10-11 13:27:26 +02:00
ines
15fe0fd82d Fix tests 2017-10-11 13:27:18 +02:00
ines
6dd14dc342 Add lookup lemmas to tokens without POS tags 2017-10-11 13:27:10 +02:00
ines
9620c1a640 Add lemma_lookup to Language defaults 2017-10-11 13:26:05 +02:00
ines
9fd471372a Add lookup lemmatizer to lemmatizer as lookup() method 2017-10-11 13:25:51 +02:00
ines
e0ff145a8b Merge branch 'develop' into feature/dot-underscore 2017-10-11 11:57:05 +02:00
ines
c1d6d43c83 Merge branch 'develop' into feature/lemmatizer 2017-10-11 11:56:35 +02:00
Matthew Honnibal
17c467e0ab Avoid clobbering existing lemmas 2017-10-11 03:33:06 -05:00
Matthew Honnibal
807e109f2b Merge branch 'develop' of https://github.com/explosion/spaCy into develop 2017-10-11 02:47:59 -05:00
Matthew Honnibal
6e552c9d83 Prune number of non-projective labels more aggressiely 2017-10-11 02:46:44 -05:00
Matthew Honnibal
76fe24f44d Improve embedding defaults 2017-10-11 09:44:17 +02:00
Matthew Honnibal
188f620046 Improve parser defaults 2017-10-11 09:43:48 +02:00
Matthew Honnibal
acba2e1051 Fix metadata in training 2017-10-11 08:55:52 +02:00
Matthew Honnibal
74c2c6a58c Add default name and lang to meta 2017-10-11 08:49:12 +02:00
Matthew Honnibal
3814a161e6 Avoid clobbering preset lemmas 2017-10-11 08:41:03 +02:00
Matthew Honnibal
fd47f8e89f Fix failing test 2017-10-11 08:38:34 +02:00
Matthew Honnibal
462b2e26b4 Merge branch 'develop' of https://github.com/explosion/spaCy into develop 2017-10-11 08:23:04 +02:00
Matthew Honnibal
a6ac4699eb Allow Morphology class to setup tokens
Add Morphology.assign_untagged() C-method, and call it from
Doc.push_back() when a token is created. This gives a place
to allow the Morphology class to initialize token data.
2017-10-11 03:24:14 +02:00
Matthew Honnibal
3b527fa52b Call morphology.assign_untagged when pushing token to Doc 2017-10-11 03:23:57 +02:00
Matthew Honnibal
c15d8278cb Avoid lemmatizing inappropriate tags in English lemmatizer 2017-10-11 03:23:23 +02:00
Matthew Honnibal
d528b6e36d Add assign_untagged method in Morphology 2017-10-11 03:22:49 +02:00
Matthew Honnibal
2c118ab3a6 Add tests for Doc creation 2017-10-11 03:21:23 +02:00
ines
820bf85075 Move LookupLemmatizer to spacy.lemmatizer 2017-10-11 02:25:13 +02:00
ines
417d45f5d0 Add lemmatizer data as variable on language data
Don't create lookup lemmatizer within Language class and just pass in
the data so it can be set on Token creation
2017-10-11 02:24:58 +02:00
ines
0c2343d73a Tidy up language data 2017-10-11 02:22:49 +02:00
Matthew Honnibal
d84136b4a9 Update add label test 2017-10-10 22:57:41 +02:00
Matthew Honnibal
3065f12ef2 Make add parser label work for hidden_depth=0 2017-10-10 22:57:31 +02:00
ines
bfd58dd0fc Merge branch 'develop' into feature/dot-underscore 2017-10-10 22:03:51 +02:00
Matthew Honnibal
73bca3d382 Merge branch 'develop' of https://github.com/explosion/spaCy into develop 2017-10-10 12:51:37 -05:00
Matthew Honnibal
5156074df1 Make loading code more consistent in train command 2017-10-10 12:51:20 -05:00
Matthew Honnibal
d70fba6807 Merge branch 'develop' of https://github.com/explosion/spaCy into develop 2017-10-10 19:33:10 +02:00
Matthew Honnibal
8143618497 Set prefix length back to 1 2017-10-10 19:32:54 +02:00
Matthew Honnibal
97c9b5db8b Patch spacy.train for new pipeline management 2017-10-09 23:41:16 -05:00
Matthew Honnibal
a635240398 Add conll_ner2json converter 2017-10-09 22:03:26 -05:00
Matthew Honnibal
e0a9b02b67 Merge Span._ and Span.as_doc methods 2017-10-09 22:00:15 -05:00
Matthew Honnibal
dce8afb9cf Set prefix length to 3 2017-10-09 21:55:55 -05:00
Matthew Honnibal
8265b90c83 Update parser defaults 2017-10-09 21:55:20 -05:00
Matthew Honnibal
dd2b0601d1 Merge branch 'develop' of https://github.com/explosion/spaCy into develop 2017-10-09 21:30:46 -05:00
Matthew Honnibal
09d61ada5e Merge pull request #1396 from explosion/feature/pipeline-management
💫 Improve pipeline and factory management
2017-10-10 04:29:54 +02:00
ines
67350fa496 Use better logic for auto-generating component name
Instances don't have __name__, so we try __class__.__name__ as well,
before giving up and defaulting to repr(component).
2017-10-10 04:23:05 +02:00
ines
3fc4fe61d2 Fix typo 2017-10-10 04:15:14 +02:00
ines
59c4f27499 Add get, set and has methods to Underscore 2017-10-10 04:14:35 +02:00
Matthew Honnibal
19136fd155 Merge branch 'develop' of https://github.com/explosion/spaCy into develop 2017-10-10 03:58:30 +02:00
Matthew Honnibal
8978212ee5 Patch serialization bug raised in #1105 2017-10-10 03:58:12 +02:00
Matthew Honnibal
f0f2739ae3 Add test for serialization issue raised in #1105 2017-10-10 03:57:58 +02:00
Matthew Honnibal
735d18654d Add NER converter for CoNLL 2003 data 2017-10-09 20:06:28 -05:00
Matthew Honnibal
51d18937af Partially apply doc/span/token into method
We want methods to act like they're "bound" to the object, so that you can make your method conditional on the `doc`, `span` or `token` instance --- like, well, a method. We therefore partially apply the function, which works like this:

```
def partial(unbound_method, constant_arg):
    def bound_method(*args, **kwargs):
        return unbound_method(constant_arg, *args, **kwargs)
    return bound_method
2017-10-10 02:21:28 +02:00
Matthew Honnibal
808d8740d6 Remove print statement 2017-10-09 08:45:20 -05:00
Matthew Honnibal
0f41b25f60 Add speed benchmarks to metadata 2017-10-09 08:05:37 -05:00
ines
de374dc72a Merge branch 'feature/pipeline-management' into feature/dot-underscore 2017-10-09 14:37:51 +02:00
Matthew Honnibal
2534cd57d7 Add bandaid solution to the 'shadowing' problem in #864 2017-10-09 08:59:35 +02:00
Matthew Honnibal
d8a2506023 Merge pull request #1401 from explosion/feature/add-parser-action
💫 Allow labels to be added to pre-trained parser and NER modes
2017-10-09 04:57:51 +02:00
Matthew Honnibal
689349e32f Merge pull request #1400 from explosion/feature/sentence-parsing
💫 Force parser to respect preset sentence boundaries
2017-10-09 04:31:43 +02:00
Matthew Honnibal
e79fc41ff8 Merge pull request #1391 from explosion/feature/multilabel-textcat
💫 Fix multi-label support for text classification
2017-10-09 04:22:31 +02:00
Matthew Honnibal
fad2b8315f Merge branch 'develop' into feature/add-parser-action 2017-10-09 04:13:04 +02:00
Matthew Honnibal
6c79841c0d Fix tests for history features 2017-10-09 04:12:24 +02:00
Matthew Honnibal
dde87e6b0d Add tests for adding parser actions 2017-10-09 03:42:35 +02:00
Matthew Honnibal
b2b8506f2c Remove whitespace 2017-10-09 03:35:57 +02:00