Commit Graph

4795 Commits

Author SHA1 Message Date
Matthew Honnibal
e7556ff048 Fix non-maxout parser 2017-10-23 18:16:23 +02:00
ines
a31f048b4d Fix formatting 2017-10-23 10:38:06 +02:00
Matthew Honnibal
490ad3eaf0 Check that empty strings are handled. Closes #1242 2017-10-21 00:52:14 +02:00
Matthew Honnibal
8f8bccecb9 Patch deserialisation for invalid loads, to avoid model failure 2017-10-21 00:51:42 +02:00
Ramanan Balakrishnan
d2fe56a577
Add LCA matrix for spans and docs 2017-10-20 23:58:00 +05:30
Matthew Honnibal
d8391b1c4d Fix #1434: Matcher failed on ending ? if no token 2017-10-20 16:49:36 +02:00
Matthew Honnibal
fec53f09f7 Merge branch 'develop' of https://github.com/explosion/spaCy into develop 2017-10-20 16:28:34 +02:00
Matthew Honnibal
f111b228e0 Fix re-parsing of previously parsed text
If a Doc object had been previously parsed, it was possible for
invalid parses to be added. There were two problems:

1) The parse was only being partially erased
2) The RightArc action was able to create a 1-cycle.

This patch fixes both errors, and avoids resetting the parse if one is
present. In theory this might allow a better parse to be predicted by
running the parser twice.

Closes #1253.
2017-10-20 16:27:36 +02:00
Matthew Honnibal
1036798155 Make parser consistent if maxout==1 2017-10-20 16:24:16 +02:00
Matthew Honnibal
3faf9189a2 Make parser hidden shape consistent even if maxout==1 2017-10-20 16:23:31 +02:00
Matthew Honnibal
9010a1a060 Create vectors correctly 2017-10-20 14:19:46 +02:00
Matthew Honnibal
33229b1c9e Remove print statement 2017-10-20 14:19:29 +02:00
Matthew Honnibal
cfae54c507 Make change to Vectors.__init__ 2017-10-20 14:19:04 +02:00
Matthew Honnibal
ebecaddb76 Make 'data_or_width' two keyword args in Vectors.__init__
Previously the data and width options were one argument in Vectors,
which meant you couldn't say vectors = Vectors(strings, width=300).
It's better to have two keywords.
2017-10-20 14:17:15 +02:00
Matthew Honnibal
49895fbef6 Rename 'SP' special tag to '_SP'
Renaming the tag with an underscore lets us add it to the tag map
without worrying that we'll change the sequence of tags, which throws
off the tag-to-ID mapping. For instance, if we inserted a 'SP' tag,
the "VERB" tag is pushed to a different class ID, and the model is all
messed up.
2017-10-20 14:01:12 +02:00
Matthew Honnibal
506cf2eb13 Remove cpdef enum, to avoid too much code generation 2017-10-20 14:00:23 +02:00
Matthew Honnibal
6218af0105 Remove cpdef enum, to avoid too much code generation 2017-10-20 13:59:57 +02:00
Matthew Honnibal
92ac9316b5 Fix initialization of vectors, to address serialization problem 2017-10-20 13:59:24 +02:00
Ramanan Balakrishnan
0726946563
cleanup to_array implementation using fixes on master 2017-10-20 17:09:37 +05:30
ines
108f1f786e Update symbols and document missing token attributes (see #1439) 2017-10-20 13:08:44 +02:00
ines
4acab77a8a Add missing symbol for LAW entities (resolves #1427) 2017-10-20 13:07:57 +02:00
Matthew Honnibal
b101736555 Fix precomputed layer 2017-10-20 12:14:52 +02:00
Ramanan Balakrishnan
b3ab124fc5
Support strings for attribute list in doc.to_array 2017-10-20 11:46:57 +05:30
Matthew Honnibal
64658e02e5 Implement fancier initialisation for precomputed layer 2017-10-20 03:07:45 +02:00
Matthew Honnibal
827cd8a883 Fix support of maxout pieces in parser 2017-10-20 03:07:17 +02:00
Matthew Honnibal
a8850b4282 Remove redundant PrecomputableMaxouts class 2017-10-19 20:27:34 +02:00
Matthew Honnibal
a17a1b60c7 Clean up redundant PrecomputableMaxouts class 2017-10-19 20:26:37 +02:00
Matthew Honnibal
b00d0a2c97 Fix bias in parser 2017-10-19 18:42:11 +02:00
Matthew Honnibal
b54b4b8a97 Make parser_maxout_pieces hyper-param work 2017-10-19 13:45:18 +02:00
Matthew Honnibal
03a215c5fd Make PrecomputableAffines work 2017-10-19 13:44:49 +02:00
Ramanan Balakrishnan
7b9b1be44c
Support single value for attribute list in doc.to_array 2017-10-19 17:00:41 +05:30
Matthew Honnibal
61bc203f3f Merge pull request #1438 from explosion/feature/fast-parser
💫 Improve runtime CPU efficiency of parser/NER
2017-10-19 02:42:21 +02:00
Matthew Honnibal
15e5a04a8d Clean up more depth=0 conditional code 2017-10-19 01:48:43 +02:00
Matthew Honnibal
906c50ac59 Fix loop typing, that caused error on windows 2017-10-19 01:48:39 +02:00
ines
24512420b1 Show error if data_path does not exist or is None (see #1102) 2017-10-19 00:53:49 +02:00
ines
bf415fd778 Add test for serializing extension attrs (see #1085) 2017-10-19 00:53:08 +02:00
Matthew Honnibal
960788aaa2 Eliminate dead code in parser, and raise errors for obsolete options 2017-10-19 00:42:34 +02:00
Matthew Honnibal
bbfd7d8d5d Clean up parser multi-threading 2017-10-19 00:25:21 +02:00
Matthew Honnibal
f018f2030c Try optimized parser forward loop 2017-10-18 21:48:00 +02:00
Matthew Honnibal
65bf5e85bd Improve piping in language.pipe 2017-10-18 21:46:12 +02:00
Matthew Honnibal
633a75c7e0 Break parser batches into sub-batches, sorted by length. 2017-10-18 21:45:01 +02:00
Ines Montani
f0d577e460 Merge pull request #1425 from explosion/feature/hindi-tokenizer
💫 Basic Hindi tokenization support
2017-10-18 13:34:52 +02:00
Matthew Honnibal
394633efce Make doc pickling support hooks 2017-10-17 19:44:09 +02:00
Matthew Honnibal
fe844148f6 Test pickling hooks 2017-10-17 19:43:52 +02:00
Matthew Honnibal
cdb0c426d8 Improve deserialization of user_data, esp. for Underscore 2017-10-17 19:29:20 +02:00
Matthew Honnibal
374819edf8 Test user_data deserialization, re #1085 2017-10-17 19:28:54 +02:00
Matthew Honnibal
e35a83d142 Merge branch 'develop' of https://github.com/explosion/spaCy into develop 2017-10-17 18:22:06 +02:00
Matthew Honnibal
f45973848c Rename 'tokens' variable 'doc' in tokenizer 2017-10-17 18:21:41 +02:00
Matthew Honnibal
839de87ca9 Make lambda func a named function, for pickling 2017-10-17 18:21:20 +02:00
Matthew Honnibal
9baa8fe7ec Convert closure to functools.partial, to promote pickling 2017-10-17 18:20:52 +02:00
Matthew Honnibal
32a8564c79 Fix doc pickling 2017-10-17 18:20:24 +02:00
Matthew Honnibal
8ca97f32a3 Fix doc pickling test 2017-10-17 18:19:57 +02:00
Matthew Honnibal
9ce7d6af87 Make lex attr functions top-level functions, to promote pickling 2017-10-17 18:19:18 +02:00
Matthew Honnibal
1cc85a89ef Allow reasonably efficient pickling of Language class, using to_bytes() and from_bytes(). 2017-10-17 18:18:49 +02:00
Matthew Honnibal
0d57b9748a Serialize lex_attr_getters with dill, for better pickle support 2017-10-17 18:17:45 +02:00
Matthew Honnibal
45d1dd90b1 Add tests for pickling doc 2017-10-17 17:20:58 +02:00
Ines Montani
afa67de7ee Merge pull request #1428 from roanuz/develop
Fix trailing whitespace and Language.from_disk overwrites
2017-10-17 16:29:15 +02:00
Matthew Honnibal
92c1eb2d6f Fix Doc pickling. This also removes need for Binder class 2017-10-17 16:11:13 +02:00
Matthew Honnibal
ed8da9b11f Add missing return statement in SentenceSegmenter 2017-10-17 15:32:56 +02:00
Ines Montani
aab299c8ae Merge pull request #1429 from vishnunekkanti/develop
fix syntax error in zh
2017-10-17 14:45:02 +02:00
Anto Binish Kaspar
534240648e Fix trailing whitespace on morphology features 2017-10-17 17:15:58 +05:30
Anto Binish Kaspar
8f5b60c168 Fix Language.from_disk overwrites the meta.json file. 2017-10-17 17:15:32 +05:30
ines
8ca344712d Add Language.has_pipe method 2017-10-17 11:20:07 +02:00
ines
485c4f6df5 Add Hungarian examples (see #1107) 2017-10-17 02:37:45 +02:00
Matthew Honnibal
19531bad4c Merge branch 'develop' into feature/streaming-data-memory-growth 2017-10-16 21:44:11 +02:00
Matthew Honnibal
df488274b1 Fix deserialization of vectors 2017-10-16 20:55:00 +02:00
Matthew Honnibal
4018486d31 Merge remote-tracking branch 'origin/develop' into feature/streaming-data-memory-growth 2017-10-16 20:49:48 +02:00
Matthew Honnibal
4174477161 Fix equality check in test 2017-10-16 19:50:35 +02:00
Matthew Honnibal
2bc06e4b22 Bump rolling buffer size to 10k 2017-10-16 19:38:29 +02:00
Matthew Honnibal
66e2eb8f39 Clean up remnant of frozen in StringStore 2017-10-16 19:34:41 +02:00
Matthew Honnibal
a002264fec Remove caching of Token in Doc, as caused cycle. 2017-10-16 19:34:21 +02:00
Matthew Honnibal
3e037054c8 Remove obsolete is_frozen functionality from StringStore 2017-10-16 19:23:10 +02:00
Matthew Honnibal
5c14f3f033 Create a rolling buffer for the StringStore in Language.pipe() 2017-10-16 19:22:40 +02:00
Matthew Honnibal
59c216196c Allow weakrefs on Doc objects 2017-10-16 19:22:11 +02:00
ines
d5418553eb Fix whitespace 2017-10-16 18:30:04 +02:00
ines
6ceadcdb5c Make sure from_disk passes string to numpy (see #1421)
If path is a WindowsPath, numpy does not recognise it as a path and as
a result, doesn't open the file.
https://github.com/numpy/numpy/blob/master/numpy/lib/npyio.py#L369
2017-10-16 18:29:56 +02:00
Matthew Honnibal
010a7309ff Merge pull request #1402 from explosion/feature/fix-matcher-operators
💫 Fix Matcher variable-length operators
2017-10-16 17:53:19 +02:00
Matthew Honnibal
c29927d2e7 Fix matcher test 2017-10-16 17:22:18 +02:00
Vishnu Kumar Nekkanti
d3c54cf39a fixed SyntaxError while checking for jieba 2017-10-16 18:51:33 +05:30
Matthew Honnibal
a928ae2f35 Merge branch 'develop' into feature/fix-matcher-operators 2017-10-16 13:38:36 +02:00
Matthew Honnibal
56aa42cc5d Fix and document matcher operator 'shadowing' behaviour 2017-10-16 13:38:20 +02:00
Matthew Honnibal
748d525801 Add more matcher operator tests 2017-10-16 13:38:01 +02:00
Matthew Honnibal
0433181658 Document operator semantics in Matcher docstring 2017-10-16 12:06:33 +02:00
ines
266e7180a7 Add Language class, stop words and basic stemmer that sets NORM 2017-10-14 14:59:52 +02:00
ines
e85e1d571b Update base punctuation 2017-10-14 14:59:23 +02:00
ines
9d6c8eaa49 Update base norm exceptions with more unicode characters
e.g. unicode variations of punctuation used in Chinese
2017-10-14 14:58:52 +02:00
ines
3516aa0cea Port over changes from #1389 2017-10-14 13:32:55 +02:00
ines
cd6a29dce7 Port over changes from #1294 2017-10-14 13:28:46 +02:00
ines
38c756fd85 Port over changes from #1287 2017-10-14 13:16:21 +02:00
ines
612224c10d Port over changes from #1157 2017-10-14 13:11:39 +02:00
ines
9b3f8f9ec3 Fix formatting and add comment on languages 2017-10-14 13:11:18 +02:00
ines
a4d974d97b Port over URL pattern changes from #1411 2017-10-14 12:58:07 +02:00
ines
09aed58140 Port over changes from #1333 and add comments 2017-10-14 12:52:59 +02:00
Matthew Honnibal
cf6da9301a Update lemmatizer test 2017-10-12 22:50:52 +02:00
Matthew Honnibal
9b90d235d1 Fix tag check in lemmatizer 2017-10-12 22:50:43 +02:00
Matthew Honnibal
dc01acd821 Escape encoding in validate function 2017-10-12 22:23:21 +02:00
Matthew Honnibal
27b927259a Add locale_escape compat function 2017-10-12 22:22:04 +02:00
ines
9c6de3dcfa Merge branch 'develop' into feature/cli-validate 2017-10-12 21:44:28 +02:00
Matthew Honnibal
462caf835a Fix SBD test 2017-10-12 21:18:22 +02:00
ines
fff1028391 Add validate CLI command 2017-10-12 20:05:06 +02:00
Matthew Honnibal
908f44c3fe Disable history features by default 2017-10-12 14:56:11 +02:00
Matthew Honnibal
a955843684 Increase default number of epochs 2017-10-12 13:13:01 +02:00
Matthew Honnibal
cecfcc7711 Set default hyper params back to 'slow' settings 2017-10-12 13:12:26 +02:00
Ines Montani
37aa523a8e Merge pull request #1408 from explosion/feature/dot-underscore
💫 Custom attributes via Doc._, Token._ and Span._
2017-10-11 18:35:56 +02:00
ines
8ce6f96180 Don't make copies of language data components 2017-10-11 15:34:55 +02:00
ines
51519251c2 Fix underscore method test 2017-10-11 13:34:19 +02:00
ines
c6ae49e8bf Fix formatting 2017-10-11 13:34:11 +02:00
ines
453c47ca24 Add German lemmatizer tests 2017-10-11 13:27:26 +02:00
ines
15fe0fd82d Fix tests 2017-10-11 13:27:18 +02:00
ines
6dd14dc342 Add lookup lemmas to tokens without POS tags 2017-10-11 13:27:10 +02:00
ines
9620c1a640 Add lemma_lookup to Language defaults 2017-10-11 13:26:05 +02:00
ines
9fd471372a Add lookup lemmatizer to lemmatizer as lookup() method 2017-10-11 13:25:51 +02:00
ines
e0ff145a8b Merge branch 'develop' into feature/dot-underscore 2017-10-11 11:57:05 +02:00
ines
c1d6d43c83 Merge branch 'develop' into feature/lemmatizer 2017-10-11 11:56:35 +02:00
Matthew Honnibal
17c467e0ab Avoid clobbering existing lemmas 2017-10-11 03:33:06 -05:00
Matthew Honnibal
807e109f2b Merge branch 'develop' of https://github.com/explosion/spaCy into develop 2017-10-11 02:47:59 -05:00
Matthew Honnibal
6e552c9d83 Prune number of non-projective labels more aggressiely 2017-10-11 02:46:44 -05:00
Matthew Honnibal
76fe24f44d Improve embedding defaults 2017-10-11 09:44:17 +02:00
Matthew Honnibal
188f620046 Improve parser defaults 2017-10-11 09:43:48 +02:00
Matthew Honnibal
acba2e1051 Fix metadata in training 2017-10-11 08:55:52 +02:00
Matthew Honnibal
74c2c6a58c Add default name and lang to meta 2017-10-11 08:49:12 +02:00
Matthew Honnibal
3814a161e6 Avoid clobbering preset lemmas 2017-10-11 08:41:03 +02:00
Matthew Honnibal
fd47f8e89f Fix failing test 2017-10-11 08:38:34 +02:00
Matthew Honnibal
462b2e26b4 Merge branch 'develop' of https://github.com/explosion/spaCy into develop 2017-10-11 08:23:04 +02:00
Matthew Honnibal
a6ac4699eb Allow Morphology class to setup tokens
Add Morphology.assign_untagged() C-method, and call it from
Doc.push_back() when a token is created. This gives a place
to allow the Morphology class to initialize token data.
2017-10-11 03:24:14 +02:00
Matthew Honnibal
3b527fa52b Call morphology.assign_untagged when pushing token to Doc 2017-10-11 03:23:57 +02:00
Matthew Honnibal
c15d8278cb Avoid lemmatizing inappropriate tags in English lemmatizer 2017-10-11 03:23:23 +02:00
Matthew Honnibal
d528b6e36d Add assign_untagged method in Morphology 2017-10-11 03:22:49 +02:00
Matthew Honnibal
2c118ab3a6 Add tests for Doc creation 2017-10-11 03:21:23 +02:00
ines
820bf85075 Move LookupLemmatizer to spacy.lemmatizer 2017-10-11 02:25:13 +02:00
ines
417d45f5d0 Add lemmatizer data as variable on language data
Don't create lookup lemmatizer within Language class and just pass in
the data so it can be set on Token creation
2017-10-11 02:24:58 +02:00
ines
0c2343d73a Tidy up language data 2017-10-11 02:22:49 +02:00
Matthew Honnibal
d84136b4a9 Update add label test 2017-10-10 22:57:41 +02:00
Matthew Honnibal
3065f12ef2 Make add parser label work for hidden_depth=0 2017-10-10 22:57:31 +02:00
ines
bfd58dd0fc Merge branch 'develop' into feature/dot-underscore 2017-10-10 22:03:51 +02:00
Matthew Honnibal
73bca3d382 Merge branch 'develop' of https://github.com/explosion/spaCy into develop 2017-10-10 12:51:37 -05:00
Matthew Honnibal
5156074df1 Make loading code more consistent in train command 2017-10-10 12:51:20 -05:00
Matthew Honnibal
d70fba6807 Merge branch 'develop' of https://github.com/explosion/spaCy into develop 2017-10-10 19:33:10 +02:00
Matthew Honnibal
8143618497 Set prefix length back to 1 2017-10-10 19:32:54 +02:00
Matthew Honnibal
97c9b5db8b Patch spacy.train for new pipeline management 2017-10-09 23:41:16 -05:00
Matthew Honnibal
a635240398 Add conll_ner2json converter 2017-10-09 22:03:26 -05:00
Matthew Honnibal
e0a9b02b67 Merge Span._ and Span.as_doc methods 2017-10-09 22:00:15 -05:00
Matthew Honnibal
dce8afb9cf Set prefix length to 3 2017-10-09 21:55:55 -05:00
Matthew Honnibal
8265b90c83 Update parser defaults 2017-10-09 21:55:20 -05:00
Matthew Honnibal
dd2b0601d1 Merge branch 'develop' of https://github.com/explosion/spaCy into develop 2017-10-09 21:30:46 -05:00
Matthew Honnibal
09d61ada5e Merge pull request #1396 from explosion/feature/pipeline-management
💫 Improve pipeline and factory management
2017-10-10 04:29:54 +02:00
ines
67350fa496 Use better logic for auto-generating component name
Instances don't have __name__, so we try __class__.__name__ as well,
before giving up and defaulting to repr(component).
2017-10-10 04:23:05 +02:00
ines
3fc4fe61d2 Fix typo 2017-10-10 04:15:14 +02:00
ines
59c4f27499 Add get, set and has methods to Underscore 2017-10-10 04:14:35 +02:00
Matthew Honnibal
19136fd155 Merge branch 'develop' of https://github.com/explosion/spaCy into develop 2017-10-10 03:58:30 +02:00
Matthew Honnibal
8978212ee5 Patch serialization bug raised in #1105 2017-10-10 03:58:12 +02:00
Matthew Honnibal
f0f2739ae3 Add test for serialization issue raised in #1105 2017-10-10 03:57:58 +02:00
Matthew Honnibal
735d18654d Add NER converter for CoNLL 2003 data 2017-10-09 20:06:28 -05:00
Matthew Honnibal
51d18937af Partially apply doc/span/token into method
We want methods to act like they're "bound" to the object, so that you can make your method conditional on the `doc`, `span` or `token` instance --- like, well, a method. We therefore partially apply the function, which works like this:

```
def partial(unbound_method, constant_arg):
    def bound_method(*args, **kwargs):
        return unbound_method(constant_arg, *args, **kwargs)
    return bound_method
2017-10-10 02:21:28 +02:00
Matthew Honnibal
808d8740d6 Remove print statement 2017-10-09 08:45:20 -05:00
Matthew Honnibal
0f41b25f60 Add speed benchmarks to metadata 2017-10-09 08:05:37 -05:00
ines
de374dc72a Merge branch 'feature/pipeline-management' into feature/dot-underscore 2017-10-09 14:37:51 +02:00
Matthew Honnibal
2534cd57d7 Add bandaid solution to the 'shadowing' problem in #864 2017-10-09 08:59:35 +02:00
Matthew Honnibal
d8a2506023 Merge pull request #1401 from explosion/feature/add-parser-action
💫 Allow labels to be added to pre-trained parser and NER modes
2017-10-09 04:57:51 +02:00
Matthew Honnibal
689349e32f Merge pull request #1400 from explosion/feature/sentence-parsing
💫 Force parser to respect preset sentence boundaries
2017-10-09 04:31:43 +02:00
Matthew Honnibal
e79fc41ff8 Merge pull request #1391 from explosion/feature/multilabel-textcat
💫 Fix multi-label support for text classification
2017-10-09 04:22:31 +02:00
Matthew Honnibal
fad2b8315f Merge branch 'develop' into feature/add-parser-action 2017-10-09 04:13:04 +02:00
Matthew Honnibal
6c79841c0d Fix tests for history features 2017-10-09 04:12:24 +02:00
Matthew Honnibal
dde87e6b0d Add tests for adding parser actions 2017-10-09 03:42:35 +02:00
Matthew Honnibal
b2b8506f2c Remove whitespace 2017-10-09 03:35:57 +02:00
Matthew Honnibal
d43a83e37a Allow parser.add_label for pretrained models 2017-10-09 03:35:40 +02:00
Matthew Honnibal
81a64119db Fix string-to-unicode problem 2017-10-09 00:59:49 +02:00
Matthew Honnibal
02c2af7119 Fix test 2017-10-09 00:29:37 +02:00
Matthew Honnibal
4cc84b0234 Prohibit Break when sent_start < 0 2017-10-09 00:02:45 +02:00
Matthew Honnibal
5a67efeccc Add tests for sentence segmentation presetting 2017-10-09 00:02:23 +02:00
Matthew Honnibal
e938bce320 Adjust parsing transition system to allow preset sentence segments. 2017-10-08 23:53:34 +02:00
Matthew Honnibal
080afd4924 Add ternary value setting to Token.sent_start 2017-10-08 23:51:58 +02:00
Matthew Honnibal
7ae67ec6a1 Add Span.as_doc method 2017-10-08 23:50:20 +02:00
Matthew Honnibal
20309fb9db Make history features default to zero 2017-10-08 20:32:14 +02:00
Matthew Honnibal
e74c8d2fad Merge remote-tracking branch 'origin/develop' into feature/sentence-parsing 2017-10-08 20:20:41 +02:00
Matthew Honnibal
18063803de Make TokenC.sent_tart an int, to allow ternary value 2017-10-08 19:58:54 +02:00
Matthew Honnibal
be4f0b6460 Update defaults 2017-10-08 02:08:12 -05:00
Matthew Honnibal
42b401d08b Change default hidden depth to 1 2017-10-07 21:05:21 -05:00
Matthew Honnibal
9d66a915da Update training defaults 2017-10-07 21:02:38 -05:00
Matthew Honnibal
d163115e91 Add non-linearity after history features 2017-10-07 21:00:43 -05:00
Matthew Honnibal
92c5d78b42 Unhack NER.add_action 2017-10-07 19:02:40 +02:00
Matthew Honnibal
f2b590f672 Increment version 2017-10-07 19:01:01 +02:00
Matthew Honnibal
9bd8191739 Add tests for Underscore 2017-10-07 18:56:19 +02:00
Matthew Honnibal
668a0ea640 Pass extensions into Underscore class 2017-10-07 18:56:01 +02:00
Matthew Honnibal
1289129fd9 Add Underscore class 2017-10-07 18:00:14 +02:00
Matthew Honnibal
eb0595bea9 Merge pull request #1392 from explosion/feature/parser-history-model
💫 Parser history features
2017-10-07 15:07:02 +02:00
Matthew Honnibal
3d22ccf495 Update default hyper-parameters 2017-10-07 07:16:41 -05:00
Matthew Honnibal
09442d25ec Merge remote-tracking branch 'origin/develop' into feature/parser-history-model 2017-10-07 07:05:04 -05:00
Matthew Honnibal
3b67eabfea Allow empty dictionaries to match any token in Matcher
Often patterns need to match "any token". A clean way to denote this
is with the empty dict {}: this sets no constraints on the token,
so should always match.

The problem was that having attributes length==0 was used as an
end-of-array signal, so the matcher didn't handle this case correctly.

This patch compiles empty token spec dicts into a constraint
NULL_ATTR==0. The NULL_ATTR attribute, 0, is always set to 0 on the
lexeme -- so this always matches.
2017-10-07 03:36:15 +02:00
ines
0adadcb3f0 Fix beam parse model test 2017-10-07 02:15:15 +02:00
ines
b38a8f4a94 Fix and update pipe methods tests 2017-10-07 02:06:23 +02:00
Matthew Honnibal
0384f08218 Trigger nonproj.deprojectivize as a postprocess 2017-10-07 02:00:47 +02:00
Matthew Honnibal
3a65a0c970 Start adding tests for new pipeline management 2017-10-07 01:48:23 +02:00
ines
e43530269c Update docstrings 2017-10-07 01:04:50 +02:00
ines
61a503a611 Fix parser test 2017-10-07 00:38:51 +02:00
ines
b39409173e Add disable option and True/False/None values for pipeline 2017-10-07 00:29:08 +02:00
ines
2586b61b15 Fix formatting, tidy up and remove unused imports 2017-10-07 00:26:05 +02:00
ines
212c8f0711 Implement new Language methods and pipeline API 2017-10-07 00:25:54 +02:00
Matthew Honnibal
8be46d766e Remove print statement 2017-10-06 16:19:02 -05:00
Matthew Honnibal
8e731009fe Fix parser config serialization 2017-10-06 13:50:52 -05:00
Matthew Honnibal
f4c9a98166 Fix spacy evaluate command on non-GPU 2017-10-06 13:17:47 -05:00
Matthew Honnibal
16ba6aa8a6 Fix parser config serialization 2017-10-06 13:17:31 -05:00
Matthew Honnibal
c66399d8ae Fix depth definition with history features 2017-10-06 06:20:05 -05:00
Matthew Honnibal
5c750a9c2f Reserve 0 for 'missing' in history features 2017-10-06 06:10:13 -05:00
Matthew Honnibal
fbba7c517e Pass dropout through to embed tables 2017-10-06 06:09:18 -05:00
Matthew Honnibal
21d11936fe Fix significant train/test skew error in history feats 2017-10-06 06:08:50 -05:00
Matthew Honnibal
555d8c8bff Fix beam history features 2017-10-05 22:21:50 -05:00
Matthew Honnibal
3db0a32fd6 Fix dropout for history features 2017-10-05 22:21:30 -05:00
Matthew Honnibal
b0618def8d Add support for 2-token state option 2017-10-05 21:54:12 -05:00
Matthew Honnibal
363aa47b40 Clean up dead parsing code 2017-10-05 21:53:49 -05:00
Matthew Honnibal
ca12764772 Enable history features for beam parser 2017-10-05 21:53:29 -05:00
Matthew Honnibal
fc06b0a333 Fix training when hist_size==0 2017-10-05 21:52:28 -05:00
Matthew Honnibal
e25ffcb11f Move history size under feature flags 2017-10-05 19:38:13 -05:00
Matthew Honnibal
563f46f026 Fix multi-label support for text classification
The TextCategorizer class is supposed to support multi-label
text classification, and allow training data to contain missing
values.

For this to work, the gradient of the loss should be 0 when labels
are missing. Instead, there was no way to actually denote "missing"
in the GoldParse class, and so the TextCategorizer class treated
the label set within gold.cats as complete.

To fix this, we change GoldParse.cats to be a dict instead of a list.
The GoldParse.cats dict should map to floats, with 1. denoting
'present' and 0. denoting 'absent'. Gradients are zeroed for categories
absent from the gold.cats dict. A nice bonus is that you can also set
values between 0 and 1 for partial membership. You can also set numeric
values, if you're using a text classification model that uses an
appropriate loss function.

Unfortunately this is a breaking change; although the functionality
was only recently introduced and hasn't been properly documented
yet. I've updated the example script accordingly.
2017-10-05 18:43:02 -05:00
Matthew Honnibal
c6cd81f192 Wrap try/except around model saving 2017-10-05 08:14:24 -05:00
Matthew Honnibal
5743b06e36 Wrap model saving in try/except 2017-10-05 08:12:50 -05:00
Matthew Honnibal
fd4baff475 Update tests 2017-10-05 08:12:27 -05:00
Matthew Honnibal
dcdfa071aa Disable LayerNorm hack 2017-10-04 20:06:52 -05:00
Matthew Honnibal
943af4423a Make depth setting in parser work again 2017-10-04 20:06:05 -05:00
Matthew Honnibal
bfabc333be Merge remote-tracking branch 'origin/develop' into feature/parser-history-model 2017-10-04 20:00:36 -05:00
Matthew Honnibal
92066b04d6 Fix Embed and HistoryFeatures 2017-10-04 19:55:34 -05:00
Matthew Honnibal
d903986439 Increment version 2017-10-04 17:14:26 +02:00
Matthew Honnibal
40edb65ee7 Make test work for Python 2.7 2017-10-04 16:36:50 +02:00
Matthew Honnibal
bd8e84998a Add nO attribute to TextCategorizer model 2017-10-04 16:07:30 +02:00
Matthew Honnibal
f8a0614527 Improve textcat model slightly 2017-10-04 15:15:53 +02:00
Matthew Honnibal
39798b0172 Uncomment layernorm adjustment hack 2017-10-04 15:12:09 +02:00
Matthew Honnibal
b3a7082bf8 Merge branch 'develop' of https://github.com/explosion/spaCy into develop 2017-10-04 14:56:46 +02:00
Matthew Honnibal
db05d4d582 Add test for #1380. Passes without fix? 2017-10-04 14:56:31 +02:00
Matthew Honnibal
774f5732bd Fix dimensionality of textcat when no vectors available 2017-10-04 14:55:15 +02:00
Ines Montani
28ba0b9b51 Merge pull request #1385 from explosion/feature/new-website
💫 New spaCy website
2017-10-04 14:35:52 +02:00
Matthew Honnibal
af75b74208 Unset LayerNorm backwards compat hack 2017-10-03 20:47:10 -05:00
ines
73ac0aa0b5 Update spacy evaluate and add displaCy option 2017-10-04 00:03:15 +02:00
Matthew Honnibal
246612cb53 Merge remote-tracking branch 'origin/develop' into feature/parser-history-model 2017-10-03 16:56:42 -05:00
Matthew Honnibal
f24c2e3a8a Fix evaluate for non-GPU 2017-10-03 22:47:31 +02:00
Matthew Honnibal
5cbefcba17 Set backwards compatibility flag 2017-10-03 20:29:58 +02:00
Matthew Honnibal
5454b20cd7 Update thinc imports for 6.9 2017-10-03 20:07:17 +02:00
Matthew Honnibal
4a59f6358c Fix thinc imports 2017-10-03 19:21:26 +02:00
Matthew Honnibal
e514d6aa0a Import thinc modules more explicitly, to avoid cycles 2017-10-03 18:49:25 +02:00
Matthew Honnibal
338e1fda0e Unbreak merge artefact 2017-10-03 09:41:05 -05:00
Matthew Honnibal
1289187279 Fix circular import 2017-10-03 09:33:21 -05:00
Matthew Honnibal
a44c4c3a5b Add timer to evaluate 2017-10-03 09:15:35 -05:00
Matthew Honnibal
96da86b3e5 Add support for verbose flag to Language 2017-10-03 09:14:57 -05:00
Matthew Honnibal
02586a5243 Add timing to spacy evaluate command 2017-10-03 09:14:34 -05:00
ines
e49cd7aeaf Move import into load to avoid circular imports 2017-10-03 15:22:19 +02:00
ines
b0dfa059db Update docs link in about.py 2017-10-03 15:19:55 +02:00
Matthew Honnibal
dc3c791947 Fix history size option 2017-10-03 13:41:23 +02:00
Matthew Honnibal
278a4c17c6 Fix history features 2017-10-03 13:27:10 +02:00
Matthew Honnibal
b770f4e108 Fix embed class in history features 2017-10-03 13:26:55 +02:00
Matthew Honnibal
b50a359e11 Add support for history features in parsing models 2017-10-03 12:44:01 +02:00
Matthew Honnibal
ee41e4fea7 Support history features in stateclass 2017-10-03 12:43:48 +02:00
Matthew Honnibal
6aa6a5bc25 Add a layer type for history features 2017-10-03 12:43:09 +02:00
Matthew Honnibal
8902df44de Fix component disabling during training 2017-10-02 21:07:23 +02:00
Matthew Honnibal
c617d288d8 Update pipeline component names in spaCy train 2017-10-02 17:20:19 +02:00
Matthew Honnibal
f942903429 Improve sentence merging in iob2json 2017-10-02 17:02:10 +02:00
Matthew Honnibal
31681d20e0 Fix concatenation in iob2json converter 2017-10-02 16:50:26 +02:00
Matthew Honnibal
4896ce3320 Remove misleading comment 2017-10-02 00:09:14 +02:00
Matthew Honnibal
d90cc917fa Merge vectors.pyx doc strings 2017-10-01 17:05:54 -05:00
Matthew Honnibal
b2a8b9be77 Fix inconsistency of Vectors class API 2017-10-01 17:00:34 -05:00
Matthew Honnibal
e38089d598 Merge branch 'develop' of https://github.com/explosion/spaCy into develop 2017-10-01 22:10:54 +02:00
Matthew Honnibal
97c409b602 Add docstrings for spacy.vectors 2017-10-01 22:10:33 +02:00
ines
b776f48e58 Fix typo 2017-10-01 21:58:45 +02:00
Matthew Honnibal
94df115a81 Merge branch 'develop' of https://github.com/explosion/spaCy into develop 2017-10-01 14:06:23 -05:00
Matthew Honnibal
2cf0f4622f Fix loading of models with pre-trained vectors 2017-10-01 14:05:32 -05:00
Matthew Honnibal
69c7c642c2 Add spacy evaluate 2017-10-01 14:05:04 -05:00
ines
8dbe49ecb8 Always compare lowercase package names
Otherwise, is_package will return False if model name contains
uppercase characters. See this issue:
https://support.prodi.gy/t/saving-a-trained-ner-model-as-a-loadable-modu
le/46/6
2017-09-29 20:55:17 +02:00
ines
153c2589d4 Revert "Always compare lowercase package names"
This reverts commit 7d77dc490f.
2017-09-29 20:53:36 +02:00
ines
fd1a9225d8 Handle conversion of pipeline components correctly
Allow both comma and comma + whitespace as separators
2017-09-29 20:52:56 +02:00
ines
7d77dc490f Always compare lowercase package names
Otherwise, is_package will return False if model name contains
uppercase characters. See this issue:
https://support.prodi.gy/t/saving-a-trained-ner-model-as-a-loadable-modu
le/46/6
2017-09-29 20:52:28 +02:00
Matthew Honnibal
cdb2d83e16 Pass dropout in parser 2017-09-28 18:47:13 -05:00
Matthew Honnibal
158e177cae Fix default embed size 2017-09-28 08:25:23 -05:00
Matthew Honnibal
f6330d69e6 Default embed size to 7000 2017-09-28 08:07:41 -05:00
Matthew Honnibal
ac8481a7b0 Print NER loss 2017-09-28 08:05:31 -05:00
Matthew Honnibal
542ebfa498 Improve defaults 2017-09-27 18:54:37 -05:00
Matthew Honnibal
dcb86bdc43 Default batch size to 32 2017-09-27 11:48:19 -05:00
Matthew Honnibal
1a37a2c0a0 Update training defaults 2017-09-27 11:48:07 -05:00
Matthew Honnibal
13d7a97f3a Merge branch 'develop' of https://github.com/explosion/spaCy into develop 2017-09-27 11:44:37 -05:00
Matthew Honnibal
66c388ee01 Remove unhelpful multitask objectives 2017-09-27 11:44:16 -05:00
Matthew Honnibal
983201a83a Fix hard-coded vector width 2017-09-27 11:43:58 -05:00
Ines Montani
959c46eabe Merge pull request #1365 from wannaphongcom/develop
Add Thai language for spaCy v2
2017-09-26 23:43:05 +02:00
Matthew Honnibal
1ef4236f8e Merge pull request #1343 from explosion/feature/phrasematcher
Update PhraseMatcher for spaCy 2
2017-09-26 20:44:23 +02:00
Wannaphong Phatthiyaphaibun
7b5263ffa4 fix thai test 2017-09-26 23:54:15 +07:00
ines
1ff62eaee7 Fix option shortcut to avoid conflict 2017-09-26 17:59:34 +02:00
Wannaphong Phatthiyaphaibun
3d5046c499 fix import in th 2017-09-26 22:41:20 +07:00
ines
7fdfb78141 Add version option to cli.train 2017-09-26 17:34:52 +02:00
Wannaphong Phatthiyaphaibun
a63f790b8c fix thai tag_map 2017-09-26 22:28:57 +07:00
Wannaphong Phatthiyaphaibun
2ea27d07f4 fix tokenizer_exceptions in thai 2017-09-26 22:14:47 +07:00
Matthew Honnibal
41cc5c4c17 Merge branch 'develop' into feature/phrasematcher 2017-09-26 09:59:17 -05:00
Matthew Honnibal
c2e2f81773 Merge pull request #1355 from explosion/feature/noshare
Make pipeline components independent
2017-09-26 16:58:09 +02:00
Wannaphong Phatthiyaphaibun
a2bf4cc7bf fix newline in file 2017-09-26 21:49:43 +07:00
ines
bb5c631402 Implement like_num getter for French (via #1161) 2017-09-26 16:47:45 +02:00
ines
15479b3bae Add comment to like_num re: future work 2017-09-26 16:43:28 +02:00
ines
adda08fe14 Implement like_num getter for Dutch (via #1177) 2017-09-26 16:39:15 +02:00
ines
5ee10379db Port over changes from #1340 2017-09-26 16:38:08 +02:00
Wannaphong Phatthiyaphaibun
5cba67146c add thai in spacy2 2017-09-26 21:36:27 +07:00
ines
10d291f129 Port over change from #1351 2017-09-26 16:11:41 +02:00
Matthew Honnibal
3274b46a0d Try to fix compile error on Windows 2017-09-26 09:05:53 -05:00
Matthew Honnibal
19c7c09bf7 Fix PhraseMatcher.__contains__ 2017-09-26 08:35:53 -05:00
Matthew Honnibal
d02a41a8c9 Merge remote-tracking branch 'origin/develop' into feature/phrasematcher 2017-09-26 08:32:55 -05:00
Matthew Honnibal
698fc0d016 Remove merge artefact 2017-09-26 08:31:37 -05:00
Matthew Honnibal
defb68e94f Update feature/noshare with recent develop changes 2017-09-26 08:15:14 -05:00
Matthew Honnibal
ca28590ddd Use dep and ent multi-task objectives for parser' 2017-09-26 08:13:52 -05:00
Matthew Honnibal
9bfd585a11 Fix parameter name in .pxd file 2017-09-26 07:28:50 -05:00
Matthew Honnibal
74f08e1ad5 Update test 2017-09-26 06:45:56 -05:00
Matthew Honnibal
5aaef3e7b8 Dont link vectors in vocab deserialize 2017-09-26 06:45:47 -05:00
Matthew Honnibal
18a27c7579 Fix typo in tensorizer serialization 2017-09-26 06:45:14 -05:00
Matthew Honnibal
5056743ad5 Fix parser serialization 2017-09-26 06:44:56 -05:00
Ines Montani
7123139b2b Add __contains__ to PhraseMatcher 2017-09-26 13:13:27 +02:00
Ines Montani
50ad50f96a Update matcher.pyx 2017-09-26 13:11:17 +02:00
Matthew Honnibal
e34e70673f Allow tagger models to be built with pre-defined tok2vec layer 2017-09-26 05:51:52 -05:00
Matthew Honnibal
bf917225ab Allow multi-task objectives during training 2017-09-26 05:42:52 -05:00
Matthew Honnibal
4ae9ea7684 Remove unused argument in Language 2017-09-26 05:41:35 -05:00
ines
edf7e4881d Add meta.json option to cli.train and add relevant properties
Add accuracy scores to meta.json instead of accuracy.json and replace
all relevant properties like lang, pipeline, spacy_version in existing
meta.json. If not present, also add name and version placeholders to
make it packagable.
2017-09-25 19:00:47 +02:00
ines
d2d35b63b7 Fix formatting 2017-09-25 18:37:13 +02:00
Matthew Honnibal
8eb0b7b779 Add docstrings for Pipe API 2017-09-25 16:22:07 +02:00
Matthew Honnibal
39f390dba7 Add docstrings for Pipe API 2017-09-25 16:20:49 +02:00
Matthew Honnibal
8716ffe57d Serialize vocab last 2017-09-24 05:01:45 -05:00
Matthew Honnibal
72bbcc0871 Handle lemmatization for unknown string IDs 2017-09-24 05:01:31 -05:00
Matthew Honnibal
204b58c864 Fix evaluation during training 2017-09-24 05:01:03 -05:00
Matthew Honnibal
dc3a623d00 Remove unused update_shared argument 2017-09-24 05:00:37 -05:00
Matthew Honnibal
63bd87508d Don't use iterated convolutions 2017-09-23 04:39:17 -05:00
Matthew Honnibal
5a7fd0fd36 Fix vector linkage 2017-09-22 20:11:52 -05:00
Matthew Honnibal
4348c479fc Merge pre-trained vectors and noshare patches 2017-09-22 20:07:28 -05:00
Matthew Honnibal
7dc61b3f43 Whitespace 2017-09-22 20:00:50 -05:00
Matthew Honnibal
e93d43a43a Fix training with preset vectors 2017-09-22 20:00:40 -05:00
Matthew Honnibal
0795857dcb Fix beam parsing 2017-09-23 02:59:53 +02:00
Matthew Honnibal
4bd6a12b1f Fix Tok2Vec 2017-09-23 02:58:54 +02:00
Matthew Honnibal
386c1a5bd8 Fix tagger training 2017-09-23 02:58:06 +02:00
Matthew Honnibal
a2357cce3f Set random seed in train script 2017-09-23 02:57:31 +02:00
Matthew Honnibal
05596159bf Fix serialization when pre-trained vectors 2017-09-22 15:33:27 -05:00
Matthew Honnibal
980fb6e854 Refactor Tok2Vec 2017-09-22 09:38:36 -05:00
Matthew Honnibal
d9124f1aa3 Add link_vectors_to_models function 2017-09-22 09:38:22 -05:00
Matthew Honnibal
a186596307 Add 'reapply' combinator, for iterated CNN 2017-09-22 09:37:03 -05:00
Matthew Honnibal
40a4873b70 Fix serialization of model options 2017-09-21 13:07:26 -05:00
Matthew Honnibal
0a9016cade Fix serialization during training 2017-09-21 13:06:45 -05:00
Matthew Honnibal
20193371f5 Don't share CNN, to reduce complexities 2017-09-21 14:59:48 +02:00
Matthew Honnibal
1d73dec8b1 Refactor train script 2017-09-20 19:17:10 -05:00
Matthew Honnibal
ffda38356a Add util function to enable GPU 2017-09-20 19:16:35 -05:00
Matthew Honnibal
24e85c2048 Pass values for CNN maxout pieces option 2017-09-20 19:16:12 -05:00
Matthew Honnibal
b832f89ff8 Add resume_training function 2017-09-20 19:15:20 -05:00
Matthew Honnibal
f5144f04be Add argument for CNN maxout pieces 2017-09-20 19:14:41 -05:00
Matthew Honnibal
842e21de9f Fix int type error for Python 2 2017-09-20 23:55:30 +02:00
Matthew Honnibal
0c93c73e49 Add __reduce__ method for PhraseMatcher 2017-09-20 22:26:40 +02:00
Matthew Honnibal
cc408fc189 Make PhraseMatcher API like Matcher API 2017-09-20 22:20:35 +02:00
Matthew Honnibal
43ad250dd5 Update matcher tests 2017-09-20 21:54:49 +02:00
Matthew Honnibal
828cc91545 Fix PhraseMatcher for spaCy 2 2017-09-20 21:54:31 +02:00
Matthew Honnibal
78301b2d29 Avoid comparison to None in Tok2Vec 2017-09-20 00:19:34 +02:00
Matthew Honnibal
b36a38f63d Fix serialization of pretrained_dims property 2017-09-19 23:42:27 +02:00
Matthew Honnibal
2489dcaccf Fix serialization of parser 2017-09-19 23:42:12 +02:00
Matthew Honnibal
40837b275d Fix tensorizer with pretrained vectors 2017-09-18 18:05:38 -05:00
Matthew Honnibal
a0c4b33d03 Support resuming a model during spacy train 2017-09-18 18:04:47 -05:00
Matthew Honnibal
c858927271 Copy vectors to GPU on begin training 2017-09-18 18:04:16 -05:00
Matthew Honnibal
3fa76c17d1 Refactor Tok2Vec 2017-09-18 15:00:05 -05:00
Matthew Honnibal
217e7891cd Merge branch 'develop' of https://github.com/explosion/spaCy into develop 2017-09-18 11:36:21 -05:00
Matthew Honnibal
7b3f391f80 Try dropping the Affine layer, conditionally 2017-09-18 11:35:59 -05:00
ines
2480f8f521 Add missing return in Doc.from_disk() (closes #1330) 2017-09-18 15:32:00 +02:00
Matthew Honnibal
2148ae605b Dont use iterated convolutions 2017-09-17 17:36:04 -05:00
Matthew Honnibal
c013e5996f Fix parser test 2017-09-17 13:13:20 -05:00
Matthew Honnibal
8f42f8d305 Remove unused 'preprocess' argument in Tok2Vec' 2017-09-17 12:30:16 -05:00
Matthew Honnibal
039d609362 Remove hard-coded default vectors width 2017-09-17 12:29:39 -05:00
Matthew Honnibal
4f38a67a89 Make width default to 0 in vectors.pyx 2017-09-17 12:29:14 -05:00
Matthew Honnibal
16122f566e Fix cpdef enum in attrs.pyx 2017-09-17 12:28:53 -05:00
Matthew Honnibal
b159e0eb50 Merge branch 'develop' of https://github.com/explosion/spaCy into develop 2017-09-17 05:47:50 -05:00
Matthew Honnibal
2b0efc77ae Fix wiring of pre-trained vectors in parser loading 2017-09-17 05:47:34 -05:00
Matthew Honnibal
31c2e91c35 Fix wiring of pre-trained vectors in parser loading 2017-09-17 05:46:55 -05:00
Matthew Honnibal
8f913a74ca Fix defaults and args to build_tagger_model 2017-09-17 05:46:36 -05:00
Matthew Honnibal
c003c561c3 Revert NER action loading change, for model compatibility 2017-09-17 05:46:03 -05:00
Matthew Honnibal
43210abacc Resolve fine-tuning conflict 2017-09-17 05:30:04 -05:00
ines
ece30c28a8 Don't split hyphenated words in German
This way, the tokenizer matches the tokenization in German treebanks
2017-09-16 20:40:15 +02:00
ines
68f66aebf8 Use pkg_resources instead of pip for is_package (resolves #1293) 2017-09-16 20:27:59 +02:00
Matthew Honnibal
5ff2491f24 Pass option for pre-trained vectors in parser 2017-09-16 12:47:21 -05:00
Matthew Honnibal
8665a77f48 Fix feature error in NER 2017-09-16 12:46:57 -05:00
Matthew Honnibal
e37a50a436 Pass documents to tensorizer, not 'features' 2017-09-16 12:46:36 -05:00
Matthew Honnibal
84e637e2e6 Pass option for pretrained vectors in pipeline 2017-09-16 12:46:02 -05:00
Matthew Honnibal
2a93404da6 Support optional pre-trained vectors in tensorizer model 2017-09-16 12:45:37 -05:00
Matthew Honnibal
e0a2aa9289 Support having word vectors data on GPU 2017-09-16 12:45:09 -05:00
Matthew Honnibal
ebf8942564 Fix test for Python3 2017-09-16 16:22:38 +02:00
Matthew Honnibal
8c945310fb Excuse emoji failure on narrow unicode builds 2017-09-16 16:21:13 +02:00
Matthew Honnibal
11f2a05ede Fix code explosion from long enum in Python 3, Cython 0.24+ 2017-09-16 12:20:04 +02:00
Matthew Honnibal
3fa5b40b5c Add test for hash consistency 2017-09-16 11:21:35 +02:00
Matthew Honnibal
f730d07e4e Fix prange error for Windows 2017-09-16 00:25:33 +02:00
Matthew Honnibal
4b2065430e Merge branch 'feature/parser-history' into develop 2017-09-15 10:42:20 +02:00
Matthew Honnibal
2f08489694 Remove AddHistory layer -- didnt work as planned 2017-09-15 10:41:40 +02:00
Matthew Honnibal
8b481e0465 Remove redundant brackets 2017-09-15 10:38:08 +02:00
Matthew Honnibal
d84607f6bb Vectorize update in AddHistory 2017-09-14 20:34:40 +02:00
Ines Montani
bd3da3d6fb Port over change from #1323 and tidy up 2017-09-14 19:23:13 +02:00
Matthew Honnibal
18347ab69c Implement AddHistory layer wrapper 2017-09-14 19:07:35 +02:00
Matthew Honnibal
d4ca6cef9e Merge branch 'develop' of https://github.com/explosion/spaCy into develop 2017-09-14 17:00:07 +02:00
Matthew Honnibal
8c503487af Fix lookup of missing NER actions 2017-09-14 16:59:45 +02:00
Matthew Honnibal
664c5af745 Revert padding in parser 2017-09-14 16:59:25 +02:00
Matthew Honnibal
8496d76224 Merge branch 'develop' of https://github.com/explosion/spaCy into develop 2017-09-14 09:21:20 -05:00
Matthew Honnibal
d1518027a9 Increment version 2017-09-14 16:18:46 +02:00
Matthew Honnibal
70da88a3a7 Update comment on Language.begin_training 2017-09-14 16:18:30 +02:00
Matthew Honnibal
c6395b057a Improve parser feature extraction, for missing values 2017-09-14 16:18:02 +02:00
Matthew Honnibal
daf869ab3b Fix add_action for NER, so labelled 'O' actions aren't added 2017-09-14 16:16:41 +02:00
Matthew Honnibal
9cb2aef587 Remove print statement 2017-09-14 13:38:28 +02:00
Matthew Honnibal
ba23d63c35 Fix minibatch function, for fixed batch size 2017-09-14 13:37:41 +02:00
Jim O'Regan
7de709483b missed adding here 2017-09-11 10:51:21 +01:00
Jim O'Regan
b1b6123867 add ga_tokenizer 2017-09-11 10:31:41 +01:00
Jim O'Regan
9dfd301962 rearrange 2017-09-11 10:14:18 +01:00
Jim O'Regan
187be6d372 copy/paste error 2017-09-11 09:33:17 +01:00
Jim O'Regan
c283e9edfe first stab at test 2017-09-11 08:57:48 +01:00
Jim O'Regan
1ee75ae337 Merge remote-tracking branch 'origin/develop' into develop-irish 2017-09-11 08:40:11 +01:00
Matthew Honnibal
456bb8a74c Unxfail and close #1305 2017-09-06 19:14:17 +02:00
Matthew Honnibal
99e44fbdbb Update regression test 2017-09-06 19:13:51 +02:00
Matthew Honnibal
5c3ff06924 Fix lemmatizer rules 2017-09-06 19:13:24 +02:00
Matthew Honnibal
dd9cab0faf Fix type-check for int/long 2017-09-06 19:03:05 +02:00
Matthew Honnibal
497a9308a8 Xfail new lemmatizer test 2017-09-06 18:41:22 +02:00
Matthew Honnibal
dcbf866970 Merge parser changes 2017-09-06 18:41:05 +02:00
Matthew Honnibal
5384fff5ce Add test for 1305: Incorrect lemmatization of VBZ for English 2017-09-06 18:40:18 +02:00
Matthew Honnibal
24ff6b0ad9 Fix parsing and tok2vec models 2017-09-06 05:50:58 -05:00
Matthew Honnibal
1b65115bc2 Merge branch 'develop' of https://github.com/explosion/spaCy into develop 2017-09-04 20:02:53 -05:00
Matthew Honnibal
33fa91feb7 Restore correctness of parser model 2017-09-04 21:19:30 +02:00
Matthew Honnibal
e88a42e460 Increment version 2017-09-04 21:14:39 +02:00
Matthew Honnibal
9d65d67985 Preserve model compatibility in parser, for now 2017-09-04 16:46:22 +02:00
Matthew Honnibal
d5fbf27335 Fix test 2017-09-04 16:45:11 +02:00
Matthew Honnibal
7fdafcc4c4 Fix config loading in tagger 2017-09-04 16:38:49 +02:00
Matthew Honnibal
058372d120 Merge branch 'develop' of https://github.com/explosion/spaCy into develop 2017-09-04 16:27:53 +02:00
Matthew Honnibal
16e25ce3b5 Merge branch 'develop' of https://github.com/explosion/spaCy into develop 2017-09-04 09:26:53 -05:00
Matthew Honnibal
9f512e657a Fix drop_layer calculation 2017-09-04 09:26:38 -05:00
Matthew Honnibal
cb4839033c Fix loader for EN tests 2017-09-04 15:19:18 +02:00
Matthew Honnibal
382ce566eb Fix deserialization bug 2017-09-04 15:19:01 +02:00
Matthew Honnibal
bfddf50081 Fix #1296: Incorrect lemmatization of base form verbs 2017-09-04 15:18:41 +02:00
Matthew Honnibal
b29e6bff46 Improve lemmatization rule for am|VBP 2017-09-04 15:18:10 +02:00
Matthew Honnibal
644d6c9e1a Improve lemmatization tests, re #1296 2017-09-04 15:17:44 +02:00
Matthew Honnibal
3cf3fa1704 Merge branch 'develop' of https://github.com/explosion/spaCy into develop 2017-09-02 12:46:11 -05:00
Matthew Honnibal
e920885676 Fix pickle during train 2017-09-02 12:46:01 -05:00
Matthew Honnibal
c0eaba8b28 Fix low-data textcat 2017-09-02 15:17:32 +02:00
Matthew Honnibal
9e378bdac5 Fix textcat serialization 2017-09-02 15:17:20 +02:00
Matthew Honnibal
e3ea6ee02b Increment version 2017-09-02 15:17:01 +02:00
Matthew Honnibal
a3b69bcb3d Add low_data mode in textcat 2017-09-02 14:56:30 +02:00
Matthew Honnibal
ead78c7b9b Merge branch 'develop' of https://github.com/explosion/spaCy into develop 2017-09-02 12:55:25 +02:00
Matthew Honnibal
5e6a9e7dcc Add rule-based SBD 2017-09-02 12:53:38 +02:00
Matthew Honnibal
a824cf8f9a Adjust text classification model 2017-09-02 11:41:00 +02:00
Matthew Honnibal
ac040b99bb Add support for pre-trained vectors in text classifier 2017-09-01 16:39:55 +02:00
Matthew Honnibal
7742a6d559 Add GloVe vectors reader 2017-09-01 16:39:22 +02:00
Matthew Honnibal
789e1a3980 Use 13 parser features, not 8 2017-08-31 14:13:00 -05:00
Matthew Honnibal
30e35d9666 Fix syntax error 2017-08-30 17:35:39 -05:00
Matthew Honnibal
4ceebde523 Fix gradient bug in parser 2017-08-30 17:32:56 -05:00
ines
173089a45a Add more validation for model meta 2017-08-29 11:21:46 +02:00
Matthew Honnibal
2e28982e28 Merge pull request #1288 from geovedi/indonesian
Indonesian language support
2017-08-26 21:31:13 +02:00
ines
7e04b7f89c Fix info text on pipeline in package cli 2017-08-26 18:30:59 +02:00
ines
40afa13a8a Increment version 2017-08-26 18:30:49 +02:00
Matthew Honnibal
876f38c548 Merge pull request #1279 from oroszgy/model_cli_v2
Added vector loading to model cli
2017-08-26 15:57:50 +02:00
Matthew Honnibal
cfc055734e Split % in units, for compatibility with corpus 2017-08-25 20:03:37 -05:00
Matthew Honnibal
4bb6bc3f9e Add support for sent_start to GoldParse 2017-08-25 20:03:14 -05:00
Matthew Honnibal
44589fb38c Fix Break oracle 2017-08-25 19:50:55 -05:00
Matthew Honnibal
6d4e8e14ca Merge branch 'develop' of https://github.com/explosion/spaCy into develop 2017-08-25 12:37:16 -05:00
Matthew Honnibal
4ce5531389 Use layer norm instead of batch norm 2017-08-25 12:37:10 -05:00
Matthew Honnibal
20dd66ddc2 Constrain sentence boundaries to IS_PUNCT and IS_SPACE tokens 2017-08-25 19:35:47 +02:00
Jim Geovedi
58d8078971 Merge remote-tracking branch 'upstream/develop' into indonesian 2017-08-25 09:21:49 +08:00
Matthew Honnibal
6ceb0f0518 Allow Lexeme.rank to be set 2017-08-24 21:43:00 +02:00
Matthew Honnibal
44a1fa80d3 Merge branch 'develop' of https://github.com/explosion/spaCy into develop 2017-08-23 13:02:16 +02:00
ines
bb1abbeba5 Only link model if download was successfull 2017-08-23 12:36:31 +02:00
Matthew Honnibal
bb2541ffd3 Fix PROB attr for OOV words 2017-08-23 12:11:52 +02:00
Matthew Honnibal
1c5c256e58 Fix fine_tune when optimizer is None 2017-08-23 10:51:33 +02:00
Matthew Honnibal
9c580ad28a Merge branch 'develop' of https://github.com/explosion/spaCy into develop 2017-08-22 17:02:04 -05:00
Matthew Honnibal
a4633fff6f Restore use of batch norm in model 2017-08-22 17:01:58 -05:00
Matthew Honnibal
03b5b9727a Fix Doc.vector for empty doc objects 2017-08-22 19:52:19 +02:00
Matthew Honnibal
0551b7b03a Fix doc.vector 2017-08-22 19:46:52 +02:00
Matthew Honnibal
83f8e98450 Fix retrieval of OOV vectors 2017-08-22 19:46:35 +02:00
Matthew Honnibal
df2745eb08 Merge branch 'develop' of https://github.com/explosion/spaCy into develop 2017-08-22 19:00:43 +02:00
Matthew Honnibal
5b329acbf2 Fix vectors_length property in vocab 2017-08-22 19:00:27 +02:00
Matthew Honnibal
1fe605dfe5 Merge branch 'develop' of https://github.com/explosion/spaCy into develop 2017-08-21 19:18:31 -05:00
Matthew Honnibal
18b64e79ec Fix fine tuning 2017-08-21 19:18:26 -05:00
Matthew Honnibal
682346dd66 Restore optimized hidden_depth=0 for parser 2017-08-21 19:18:04 -05:00
Matthew Honnibal
a21d8f3f0b Add predict paths to _ml models 2017-08-21 23:23:45 +02:00
Matthew Honnibal
cec76801dc Add profile command to CLI 2017-08-21 23:23:05 +02:00
Matthew Honnibal
7be5f30f17 Add profile function 2017-08-21 23:22:49 +02:00
ines
a68dc891ea Port over changes from #1281 2017-08-21 23:19:18 +02:00
Matthew Honnibal
5e50a65252 Merge branch 'develop' of https://github.com/explosion/spaCy into develop 2017-08-21 14:15:46 -05:00
Matthew Honnibal
80acbc5f1f Fix fine-tune weight mixture 2017-08-21 14:15:29 -05:00
ines
d15775c3ad Fix typos and commands in alpha docs 2017-08-21 13:40:11 +02:00
Gyorgy Orosz
b3576bfc86 Added vector leading to model cli 2017-08-20 23:16:12 +02:00
Matthew Honnibal
c10f63bf10 Initialize fine tuning to 0.5 2017-08-20 15:59:48 -05:00
Matthew Honnibal
62878e50db Fix misalignment caued by filtering inputs at wrong point in parser 2017-08-20 15:59:28 -05:00
Matthew Honnibal
78a5f842e9 Fix update when update_shared=False 2017-08-20 15:58:34 -05:00
Matthew Honnibal
7a6edeea68 Merge branch 'develop' of https://github.com/explosion/spaCy into develop 2017-08-20 12:55:39 -05:00
Matthew Honnibal
f2f9229964 Fix name of update_shared flag 2017-08-20 18:19:06 +02:00
Matthew Honnibal
8a59718fd6 Fix fine-tuning 2017-08-20 18:17:35 +02:00
Matthew Honnibal
80a5146ec2 Merge branch 'develop' of https://github.com/explosion/spaCy into develop 2017-08-20 11:07:08 -05:00
Matthew Honnibal
84bb543e4d Add gold_preproc flag to cli/train 2017-08-20 11:07:00 -05:00
Matthew Honnibal
3fe0d76e6d Merge branch 'develop' of https://github.com/explosion/spaCy into develop 2017-08-20 14:50:01 +02:00
Matthew Honnibal
c1d3ff517a Track loss in tagger 2017-08-20 14:42:23 +02:00
Matthew Honnibal
8875590081 Add optimizer in Language.update if sgd=None 2017-08-20 14:42:07 +02:00
Matthew Honnibal
84b7ed49e4 Ensure updates aren't made if no gold available 2017-08-20 14:41:38 +02:00
Ines Montani
c2bbd393af Merge pull request #1276 from oroszgy/model_cli_v2
Ported model cli from v1
2017-08-20 11:52:59 +02:00
Jim Geovedi
f77443ab68 reworked 2017-08-20 13:43:21 +07:00
Jim Geovedi
fbc62a09c7 added {pre,suf,in}fix tests 2017-08-20 13:43:00 +07:00
Jim Geovedi
713d7c0aa0 added indonesian lang test 2017-08-20 12:17:14 +07:00
Jim Geovedi
b7d83f37c8 indonesian abbr. 2017-08-20 12:16:50 +07:00
Jim Geovedi
7193c47f0b direct lookup 2017-08-20 11:57:52 +07:00
Jim Geovedi
fdf802d505 added examples 2017-08-20 11:57:10 +07:00
Jim Geovedi
fa544e6c9a Merge remote-tracking branch 'upstream/develop' into indonesian 2017-08-20 11:49:40 +07:00
Matthew Honnibal
42fa84075f Merge branch 'develop' of https://github.com/explosion/spaCy into develop 2017-08-19 22:42:50 +02:00
Matthew Honnibal
aefef6fd28 Prevent strings from being lost during from_disk and from_bytes 2017-08-19 22:42:17 +02:00
ines
281e7e58b3 Don't escape forward slashes on ujson.dumps 2017-08-19 22:32:16 +02:00
ines
2d126a00ae Fix typo 2017-08-19 22:32:07 +02:00
Matthew Honnibal
41c2218c53 Fix test for vectors 2017-08-19 22:09:12 +02:00
Matthew Honnibal
b8e1603cc4 Fix load fail for missing vectors 2017-08-19 22:07:00 +02:00
Matthew Honnibal
a3c51a0355 Fix creation of pipeline 2017-08-19 21:58:57 +02:00
Gyorgy Orosz
e5344b83a3 Ported model cli from v1 2017-08-19 21:45:23 +02:00
Matthew Honnibal
6a94648373 Fix serialization 2017-08-19 21:27:35 +02:00
Matthew Honnibal
1157294434 Improve vector handling 2017-08-19 20:35:33 +02:00
Matthew Honnibal
ef87562741 Restore vectors test utils 2017-08-19 20:35:16 +02:00
Matthew Honnibal
1391f9da37 Restore vectors tests 2017-08-19 20:34:58 +02:00
Matthew Honnibal
8cfeeb4884 Increment version 2017-08-19 19:52:58 +02:00
Matthew Honnibal
93fb8b64e9 Fix vector loading 2017-08-19 19:52:25 +02:00
Matthew Honnibal
49a615e7d9 Create Vectors object in Vocab 2017-08-19 18:50:16 +02:00
Matthew Honnibal
3d049af563 Improve vectors to/from disk 2017-08-19 18:42:11 +02:00
Matthew Honnibal
d55d6e1cfa Fix comparison of Token from different docs. Closes #1257 2017-08-19 16:39:32 +02:00
Matthew Honnibal
9b6a5df15e Merge branch 'develop' of https://github.com/explosion/spaCy into develop 2017-08-19 16:24:57 +02:00
Matthew Honnibal
4fda02c7e6 Add test for new Span.to_array method 2017-08-19 16:24:38 +02:00
Matthew Honnibal
dea229c634 Fix Span.to_array method 2017-08-19 16:24:28 +02:00
Matthew Honnibal
c606b4a42c Add test for Doc.char_span 2017-08-19 16:18:23 +02:00
Matthew Honnibal
8b7ac77c23 Allow span label to be string in Doc.char_span 2017-08-19 16:18:09 +02:00
Matthew Honnibal
7c47e38c12 Merge branch 'develop' of https://github.com/explosion/spaCy into develop 2017-08-19 09:03:15 -05:00
Matthew Honnibal
ab28f911b4 Fix parser learning rates 2017-08-19 09:02:57 -05:00
ines
1fe5e1a4d1 Add language example sentences (see #1107)
da, de, en, es, fr, he, it, nb, pl, pt, sv
2017-08-19 12:22:29 +02:00
Matthew Honnibal
97aabafb5f Document as_tuples keyword arg of Language.pipe 2017-08-19 12:21:33 +02:00
Matthew Honnibal
80236116a6 Add Doc.char_span method, to get a span by character offset 2017-08-19 12:21:09 +02:00
Matthew Honnibal
482bba1722 Add Span.to_array method 2017-08-19 12:20:45 +02:00
Matthew Honnibal
19c495f451 Fix vectors deserialization 2017-08-19 04:33:03 +02:00
Matthew Honnibal
42d47c1e5c Fix tagger serialization 2017-08-19 04:16:32 +02:00
Matthew Honnibal
2da96a0ec7 Fix beam test 2017-08-19 04:15:46 +02:00
Matthew Honnibal
a7309a217d Update tagger serialization 2017-08-18 23:12:05 +02:00
Matthew Honnibal
bae59bf92f Remove BiLSTM import 2017-08-18 22:46:59 +02:00
Matthew Honnibal
c307a0ffb8 Restore patches from nn-beam-parser to spacy/syntax 2017-08-18 22:38:59 +02:00
Matthew Honnibal
fe90dfc390 Restore changes from nn-beam-parser to spacy/_ml 2017-08-18 22:38:28 +02:00
Matthew Honnibal
de7e8703e3 Restore tests for beam parser 2017-08-18 22:27:42 +02:00
Matthew Honnibal
11c31d285c Restore changes from nn-beam-parser 2017-08-18 22:26:12 +02:00
Matthew Honnibal
ce321b0322 Restore changes from nn-beam-parser to spacy/_ml 2017-08-18 22:24:46 +02:00
Matthew Honnibal
5f81d700ff Restore patches from nn-beam-parser to spacy/syntax 2017-08-18 22:23:03 +02:00
Matthew Honnibal
ec482580b5 Restore changes to pipeline.pyx from nn-beam-parser branch 2017-08-18 22:02:35 +02:00
Matthew Honnibal
931509d96a Merge branch 'develop' of https://github.com/explosion/spaCy into develop 2017-08-18 21:57:15 +02:00
Matthew Honnibal
ed95009b5c Fix data loading on Python 2 2017-08-18 21:57:06 +02:00
Matthew Honnibal
baf36d0588 Add compat function for importlib.util 2017-08-18 21:56:47 +02:00
Matthew Honnibal
263366729e Don't import BiLSTM 2017-08-18 21:56:31 +02:00
Matthew Honnibal
28162290b3 Merge branch 'develop' of https://github.com/explosion/spaCy into develop 2017-08-18 14:55:40 -05:00
Matthew Honnibal
85794c1167 Restore state of _ml.py 2017-08-18 14:55:23 -05:00
Matthew Honnibal
d456d2efe1 Fix conflicts in nn_parser 2017-08-18 20:55:58 +02:00
Matthew Honnibal
1cec1efca7 Fix merge conflicts in nn_parser from beam stuff 2017-08-18 20:50:49 +02:00
Matthew Honnibal
69bcacdc09 Merge branch 'develop' of https://github.com/explosion/spaCy into develop 2017-08-18 20:47:13 +02:00
Matthew Honnibal
2993b54fff Load vectors in vocab 2017-08-18 20:46:56 +02:00
Matthew Honnibal
a1ec41298c Restore CFile loader 2017-08-18 20:46:16 +02:00
Matthew Honnibal
ed4fb991dc Work on vectors loading 2017-08-18 20:45:48 +02:00
Matthew Honnibal
426f84937f Resolve conflicts when merging new beam parsing stuff 2017-08-18 13:38:32 -05:00
Matthew Honnibal
5181e8bedb Fix merge conflict in _ml 2017-08-18 13:35:51 -05:00
Matthew Honnibal
f75420ae79 Unhack beam parsing, moving it under options instead of global flags 2017-08-18 13:31:15 -05:00
Jim Geovedi
7ae45bffcf Merge remote-tracking branch 'upstream/develop' into indonesian 2017-08-18 10:14:46 +07:00
Dan O'Huiginn
ebf5a3ce59 Allow loading with python < 3.6
Don't rely on recent python features to load models

Fixes Issue #1271
2017-08-17 15:15:47 +00:00
Matthew Honnibal
0209a06b4e Update beam parser 2017-08-16 18:25:49 -05:00
Matthew Honnibal
4b1e7bd6d8 Improve tensorizer model 2017-08-16 18:25:20 -05:00
Matthew Honnibal
a6d8d7c82e Add is_gold_parse method to transition system 2017-08-16 18:24:09 -05:00
Matthew Honnibal
3533bb61cb Add option of 8 feature parse state 2017-08-16 18:23:27 -05:00
Matthew Honnibal
1cb2f15d65 Clean up unused predict_confidences function 2017-08-16 18:22:26 -05:00
Matthew Honnibal
210f6d5175 Fix efficiency error in batch parse 2017-08-15 03:19:03 -05:00
Matthew Honnibal
23537a011d Tweaks to beam parser 2017-08-15 03:15:28 -05:00
Matthew Honnibal
500e92553d Fix memory error when copying scores in beam 2017-08-15 03:15:04 -05:00
Matthew Honnibal
a8e4064dd8 Fix tensor gradient in parser 2017-08-15 03:14:36 -05:00
Matthew Honnibal
e420e0366c Remove use of hash function in beam parser 2017-08-15 03:13:57 -05:00
Matthew Honnibal
6259490347 Fix mixture weights in fine_tune 2017-08-14 17:55:18 -05:00
Matthew Honnibal
335fa8b05c Fix gradient in fine_tune 2017-08-14 14:55:47 -05:00
Matthew Honnibal
d9f82f6b50 Increment version 2017-08-14 14:55:26 +02:00
ines
a29f132ffd Change python -m spacy to spacy
Reflects latest change to entry point or auto-alias
2017-08-14 13:04:48 +02:00
ines
65bf80302c Increment version 2017-08-14 13:04:30 +02:00
Matthew Honnibal
52c180ecf5 Revert "Merge branch 'develop' of https://github.com/explosion/spaCy into develop"
This reverts commit ea8de11ad5, reversing
changes made to 08e443e083.
2017-08-14 13:00:23 +02:00
Matthew Honnibal
dbbfe595a5 Merge branch 'develop' of https://github.com/explosion/spaCy into develop 2017-08-14 12:09:28 +02:00
Matthew Honnibal
ac6c25f762 Check SGD is not None in update 2017-08-14 12:09:18 +02:00
Matthew Honnibal
0ae045256d Fix beam training 2017-08-13 18:02:05 -05:00
Matthew Honnibal
6a42cc16ff Fix beam parser, improve efficiency of non-beam 2017-08-13 12:37:26 +02:00
Matthew Honnibal
4363b4aa4a Fix redundant tokvecs updates during update 2017-08-13 12:36:55 +02:00
Matthew Honnibal
12de263813 Bug fixes to beam parsing. Learns small sample 2017-08-13 09:33:39 +02:00
Matthew Honnibal
4ae0d5e1e6 Set defaults for convert command 2017-08-13 09:03:38 +02:00
Matthew Honnibal
92ebab6073 Update beam-update tests 2017-08-13 08:56:02 +02:00
Matthew Honnibal
17874fe491 Disable beam parsing 2017-08-12 19:35:40 -05:00
Matthew Honnibal
69f21867b5 Merge branch 'develop' of https://github.com/explosion/spaCy into develop 2017-08-12 19:25:56 -05:00
Matthew Honnibal
3e30712b62 Improve defaults 2017-08-12 19:24:17 -05:00
Matthew Honnibal
28e930aae0 Fixes for beam parsing. Not working 2017-08-12 19:22:52 -05:00
Matthew Honnibal
c96d769836 Fix beam parse. Not sure if working 2017-08-12 18:21:54 -05:00
Matthew Honnibal
24b45b45c6 Add test for beam update 2017-08-12 17:15:28 -05:00
Matthew Honnibal
4638f4b869 Fix beam update 2017-08-12 17:15:16 -05:00
Matthew Honnibal
d4308d2363 Initialize State offset to 0 2017-08-12 17:14:39 -05:00
Matthew Honnibal
b353e4d843 Work on parser beam training 2017-08-12 14:47:45 -05:00
ines
d4f2baf7dd Add create_meta option to package command
Re-create meta.json in model directory, even if it exists. Especially
useful when updating existing spaCy models or training with Prodigy.
Ensures user won't end up with multiple "en_core_web_sm" models, and
offers easy way to change the model's name and settings without having
to edit the meta.json file.
2017-08-12 21:44:18 +02:00
Matthew Honnibal
4ab0c8c8e9 Try different drop_layer structure in Tok2Vec 2017-08-12 08:56:57 -05:00
Matthew Honnibal
cd5ecedf6a Try drop_layer in parser 2017-08-12 08:56:33 -05:00
Matthew Honnibal
8870d491f1 Remove redundant pickling during training 2017-08-12 08:55:53 -05:00
Matthew Honnibal
680043ebca Improve efficiency of tagger.set_annotations for GPU 2017-08-12 08:54:21 -05:00
Matthew Honnibal
ebe0f7f641 Pass embed size correctly in tagger, and cache embeddings for efficiency 2017-08-12 05:45:20 -05:00
Matthew Honnibal
1a59db1c86 Fix dropout and learn rate in parser 2017-08-12 05:44:39 -05:00
Matthew Honnibal
d01dc3704a Adjust parser model 2017-08-09 20:06:33 -05:00
Matthew Honnibal
f37528ef58 Pass embed size for parser fine-tune. Use SELU 2017-08-09 17:52:53 -05:00
Matthew Honnibal
f93f2bed58 Revert use of layer normalization in Tok2Vec 2017-08-09 17:47:03 -05:00
Matthew Honnibal
20944dd8aa Fix conflict in parser fine-tuning 2017-08-09 16:43:05 -05:00
Matthew Honnibal
ac2de6dced Switch to ReLu layers in Tok2Vec 2017-08-09 16:41:25 -05:00
Matthew Honnibal
bbace204be Gate parser fine-tuning behind feature flag 2017-08-09 16:40:42 -05:00
Matthew Honnibal
a59a1deac4 Merge branch 'develop' of https://github.com/explosion/spaCy into develop 2017-08-09 16:23:19 -05:00
Matthew Honnibal
bcce6f7de0 Fix parser fine tuning 2017-08-09 16:23:12 -05:00
ines
28e2fec23b Fix autolinking failure on fresh model install (resolves #1138)
On fresh install via subprocess, pip.get_installed_distributions()
won't show new model, so is_package check in link command fails.
Solution for now is to get model package path explicitly and pass it to
link command.
2017-08-09 11:52:38 +02:00
Jim Geovedi
c62b49b7cc Merge remote-tracking branch 'upstream/develop' into indonesian 2017-08-09 09:17:46 +07:00
Matthew Honnibal
dbdd8afc4b Fix parser fine-tune training 2017-08-08 15:46:07 -05:00
Matthew Honnibal
88bf1cf87c Update parser for fine tuning 2017-08-08 15:34:17 -05:00
Jim O'Regan
c069b4acb5 fix in UD submitted; map either way 2017-08-08 19:22:14 +01:00
Jim O'Regan
76c22dec4d UD Irish tag mapping 2017-08-08 19:04:52 +01:00
Jim O'Regan
95921d7d4c Merge branch 'develop' into develop-irish 2017-08-08 17:21:27 +01:00
Matthew Honnibal
5d837c3776 Add mix weights on fine_tune 2017-08-07 06:32:59 -05:00
Matthew Honnibal
42bd26f6f3 Give parser its own tok2vec weights 2017-08-06 18:33:46 +02:00
Matthew Honnibal
3ed203de25 Use LayerNorm and SELU in Tok2Vec 2017-08-06 18:33:18 +02:00
Matthew Honnibal
78498a072d Return Transition for missing actions in lookup_action 2017-08-06 14:16:36 +02:00
Matthew Honnibal
4a5cc89138 Fix tagger 'fine_tune', to keep private CNN weights 2017-08-06 14:15:48 +02:00
Matthew Honnibal
3cb8f06881 Fix NeuralLabeller 2017-08-06 14:15:14 +02:00
Matthew Honnibal
0acce0521b Fix Language.update for pipeline 2017-08-06 14:13:03 +02:00
Matthew Honnibal
bfffdeabb2 Fix parser batch-size bug introduced during cleanup 2017-08-06 14:10:48 +02:00
Matthew Honnibal
0eec7c9e9b Fix Language.evaluate 2017-08-06 02:18:31 +02:00
Matthew Honnibal
0a566dc320 Add update_tensors flag to Language.update. Experimental, re #1182 2017-08-06 02:18:12 +02:00
Matthew Honnibal
cc19ea0e7c Add update_tensors flag to Language.update. Experimental, re #1182 2017-08-06 02:17:10 +02:00
Matthew Honnibal
4cfb7a54e7 Fix tagger 2017-08-06 01:53:31 +02:00
Matthew Honnibal
e9ab800e15 Fix tagging model 2017-08-06 01:50:08 +02:00
Matthew Honnibal
468c138ab3 WIP: Add fine-tuning logic to tagger model, re #1182 2017-08-06 01:13:23 +02:00
Matthew Honnibal
7f876a7a82 Clean up some unused code in parser 2017-08-06 00:00:21 +02:00
Matthew Honnibal
ae1ad81069 Increment version 2017-08-05 18:09:32 +02:00
Jim Geovedi
cc4772cac2 reworks 2017-08-03 13:08:38 +07:00
Jim Geovedi
37f19f5ed2 added more currencies based on corpus data 2017-08-03 13:03:25 +07:00
Jim Geovedi
30fd068d42 hashtag prefix should be handled somewhere else 2017-08-03 13:03:02 +07:00
Jim Geovedi
4705ae19ba Merge remote-tracking branch 'upstream/develop' into indonesian 2017-08-03 12:40:19 +07:00
Jim Geovedi
ba07e23c87 added USD in currency rules 2017-08-02 22:42:47 +07:00
Matthew Honnibal
5c323daa1a Merge branch 'develop' of https://github.com/explosion/spaCy into develop 2017-08-01 22:10:37 +02:00
Matthew Honnibal
2e00361522 Fix update when 0 docs 2017-08-01 22:10:17 +02:00
Matthew Honnibal
8fce187de4 Fix ArcEager for missing values 2017-08-01 22:10:05 +02:00
ines
78e262140f Add workaround for displaCy server on Python 2/3 (resolves #1227)
Make sure status and headers are bytes on Python 2 and strings on
Python 3
2017-08-01 01:11:35 +02:00
Jim Geovedi
2572a9ddf0 Merge remote-tracking branch 'upstream/develop' into indonesian 2017-07-30 21:24:16 +07:00
Jim Geovedi
bb08d696f9 added hashtag rule and fixed currency rules 2017-07-30 21:23:28 +07:00
Jim Geovedi
e9af79a803 added u-\d+ rules (sports team) 2017-07-30 21:23:01 +07:00
Matthew Honnibal
27abc56e98 Add method to get beam entities 2017-07-29 21:59:02 +02:00
Matthew Honnibal
ec63f4fe7b Add option to control how missing entities are handled when getting NER tags 2017-07-29 21:58:37 +02:00
Jim Geovedi
e5adc26c72 simplified rules 2017-07-29 18:21:32 +07:00
Jim Geovedi
783f7d8b86 added test set for Indonesian language 2017-07-29 18:21:07 +07:00
Jim Geovedi
4d04898dea updated regexp 2017-07-29 17:44:57 +07:00
Jim Geovedi
7d96d477ea updated like_num 2017-07-29 17:44:46 +07:00
Jim Geovedi
3cca4ed798 added lex attrs rules 2017-07-29 17:22:21 +07:00
Jim Geovedi
8b814c63f1 more exceptions 2017-07-27 19:46:30 +07:00
Jim Geovedi
6c725e8dcf updated lemma 2017-07-27 19:46:21 +07:00
Jim Geovedi
c194f7ae26 Merge remote-tracking branch 'upstream/develop' into indonesian 2017-07-27 10:55:34 +07:00
Jim Geovedi
547973b92a wip syntax iterators 2017-07-27 10:51:34 +07:00
Jim Geovedi
bbc75da38d enable syntax iterator and lemma lookup 2017-07-27 10:51:15 +07:00
Jim Geovedi
24a8c8bf28 added wip lemma dict 2017-07-26 21:39:54 +07:00
Jim Geovedi
63f14ba46b added hyphen-suffix rules 2017-07-26 19:28:57 +07:00
Jim Geovedi
f288964441 removed -el from suffix rules 2017-07-26 19:28:38 +07:00
Jim Geovedi
6eee7a7411 updated tokenizer exceptions 2017-07-26 19:13:47 +07:00
Jim Geovedi
edec51b1b1 update punctuation rules 2017-07-26 19:13:36 +07:00
Jim Geovedi
62443d495a enable token match 2017-07-26 19:13:14 +07:00
Jim Geovedi
c97f5ae0bb updated tokenizer exceptions 2017-07-26 19:12:52 +07:00
Matthew Honnibal
aff325b7e0 Increment version 2017-07-25 19:41:20 +02:00
Matthew Honnibal
6780132821 Fix tagger loading 2017-07-25 19:41:11 +02:00
Matthew Honnibal
fd20a4af55 Increment version 2017-07-25 18:58:34 +02:00
Matthew Honnibal
523b0df2c9 Update text classification model 2017-07-25 18:57:59 +02:00
Matthew Honnibal
7c7fac9337 Add spacy.blank() loading function 2017-07-25 18:56:37 +02:00
Jim Geovedi
73f6ac9d9b added hyhen 2017-07-24 15:56:31 +07:00
Jim Geovedi
68454c40bf added missing import 2017-07-24 14:12:34 +07:00
Jim Geovedi
eaf9cbd708 cursed of copy & paste 2017-07-24 14:11:51 +07:00
Jim Geovedi
7aad6718bc enable tokenizer exceptions 2017-07-24 14:11:10 +07:00
Jim Geovedi
ad56c9179a added tokenizer exceptions list 2017-07-24 14:10:16 +07:00
Jim Geovedi
c1f3fe99fe updated punctuation rules 2017-07-24 13:57:21 +07:00
Jim Geovedi
37fa2c8c80 punctution rules 2017-07-24 06:17:18 +07:00
Jim Geovedi
082e94ac1c added inflix rules 2017-07-24 06:17:07 +07:00
Jim Geovedi
d0ec484725 reverted 2017-07-24 06:16:29 +07:00
Jim Geovedi
0e590c711f added prefix & suffix rules 2017-07-23 23:46:40 +07:00
Jim Geovedi
ba922e30e8 added ampere hour unit 2017-07-23 23:46:18 +07:00
Jim Geovedi
3b17eba27b added frequency units 2017-07-23 23:10:52 +07:00
Jim Geovedi
d5fd32a572 added known currencies 2017-07-23 22:56:48 +07:00
Jim Geovedi
f6f15678fb added lex_attrs 2017-07-23 22:55:22 +07:00
Jim Geovedi
bed8162d00 added tokenizer_exceptions 2017-07-23 22:55:05 +07:00
Jim Geovedi
b80c35bc9a added norm_exceptions 2017-07-23 22:54:49 +07:00
Jim Geovedi
b5de329ea3 added norm_exceptions 2017-07-23 22:54:19 +07:00
Jim Geovedi
082e9ade46 fixed typo 2017-07-23 21:30:34 +07:00
Jim Geovedi
e2efeb186e added stopwords 2017-07-23 20:52:37 +07:00
Jim Geovedi
da98676839 use template 2017-07-23 20:51:31 +07:00
Jim Geovedi
c2b4dd7809 start working on Indonesian language 2017-07-23 20:50:56 +07:00
Matthew Honnibal
5771bd1ff8 Increment version 2017-07-23 14:18:38 +02:00
Matthew Honnibal
c4a81a47a4 Fix deserialization 2017-07-23 14:11:07 +02:00
Matthew Honnibal
2df563ad24 Remove optimization for textcat that caused loading problem 2017-07-23 14:10:51 +02:00
Matthew Honnibal
4fe77bced2 Add cfg attr to pipeline components 2017-07-23 00:52:47 +02:00
Matthew Honnibal
d8aa721664 Compute Language.meta with a property 2017-07-23 00:50:18 +02:00
Matthew Honnibal
a88a7deffe Five save/load of textcat config 2017-07-23 00:33:43 +02:00
Matthew Honnibal
9bae0ddc50 Fix minibatching 2017-07-22 20:14:49 +02:00
Matthew Honnibal
ded0df5e2f Expose hyper-param as keyword arg 2017-07-22 20:14:37 +02:00
Matthew Honnibal
f5de8deeec Increment version 2017-07-22 20:04:53 +02:00
Matthew Honnibal
b55714d5d1 Make gold_tuples arg optional in begin_training 2017-07-22 20:04:43 +02:00
Matthew Honnibal
ed6c85fa3c Fix loading of text categories in GoldParse 2017-07-22 20:04:03 +02:00
Matthew Honnibal
6ffec9dfea Update _ml, for textcat model 2017-07-22 20:03:40 +02:00
Matthew Honnibal
d6a5c2c85a Add test for NER 2017-07-22 01:48:58 +02:00
Matthew Honnibal
28244df4da Add test for beam parsing 2017-07-22 01:48:35 +02:00
Matthew Honnibal
c86445bdfd Merge branch 'develop' of https://github.com/explosion/spaCy into develop 2017-07-22 01:14:28 +02:00
Matthew Honnibal
b3a749610e Fix name of TextCategorizer 2017-07-22 01:14:07 +02:00
Matthew Honnibal
2424493970 Remove unnecessary import of Mock 2017-07-22 01:13:54 +02:00
Matthew Honnibal
baa3d81c35 Add text categorizer to Language 2017-07-22 01:13:36 +02:00
Matthew Honnibal
a6a2159969 Add slot for text categories to Doc 2017-07-22 00:34:15 +02:00
Matthew Honnibal
374ab3ecfb Increment alpha version 2017-07-22 00:32:49 +02:00
Matthew Honnibal
289f23df51 Test beam parsing 2017-07-20 15:03:10 +02:00
Matthew Honnibal
3da1063b36 Add beam decoding to parser, to allow NER uncertainties 2017-07-20 15:02:55 +02:00