Commit Graph

4486 Commits

Author SHA1 Message Date
Ines Montani
147448b65b
Add missing symbols 2017-10-31 19:34:45 +01:00
Matthew Honnibal
997a61557a Add vectors.n_keys property 2017-10-31 19:30:52 +01:00
Matthew Honnibal
8075726838 Restore vector usage in models 2017-10-31 19:21:17 +01:00
Matthew Honnibal
3659a807b0 Remove vector pruning arg from train CLI 2017-10-31 19:21:05 +01:00
Ines Montani
9b0de9fb43
Fix import of symbols (now nested one level lower) 2017-10-31 19:17:58 +01:00
Matthew Honnibal
59203a2e8a Move vector pruning command into spacy vocab cli tool 2017-10-31 19:10:01 +01:00
Matthew Honnibal
77d8f5de9a Revise and simplify Vectors class 2017-10-31 18:25:08 +01:00
Jim O'Regan
d4a8160c36 change quotes 2017-10-31 15:15:44 +00:00
Jim O'Regan
34ca59691b no idea what is wrong here 2017-10-31 14:50:13 +00:00
Jim O'Regan
41dd29e48e merge 2017-10-31 14:07:45 +00:00
Matthew Honnibal
cb5217012f Fix vector remapping 2017-10-31 11:40:46 +01:00
Matthew Honnibal
9c11ee4a1c WIP on vectors fixes 2017-10-31 11:22:56 +01:00
Matthew Honnibal
ce876c551e Fix GPU usage 2017-10-31 02:33:34 +01:00
Matthew Honnibal
7698903617 Fix GPU usage 2017-10-31 02:33:16 +01:00
Matthew Honnibal
368fdb389a WIP on refactoring and fixing vectors 2017-10-31 02:00:26 +01:00
Matthew Honnibal
4e3006cec7 Merge branch 'develop' of https://github.com/explosion/spaCy into develop 2017-10-30 19:44:58 +01:00
Matthew Honnibal
4112a991ec Fix vector pruning 2017-10-30 19:44:40 +01:00
ines
ec657c1ddc Update vocab docs and document Vocab.prune_vectors 2017-10-30 19:35:41 +01:00
ines
803e41bc66 Merge branch 'develop' of https://github.com/explosion/spaCy into develop 2017-10-30 18:39:51 +01:00
ines
8e02294241 Add vectors to Language.meta 2017-10-30 18:39:48 +01:00
ines
abf8aa05d3 Populate --create-meta defaults from file if available
If meta.json is found in directory and user chooses to overwrite it, show existing data as defaults.
2017-10-30 18:39:38 +01:00
ines
ce98fa7934 Fix formatting 2017-10-30 18:38:55 +01:00
ines
98c35d2585 Fix spacy vocab command 2017-10-30 18:38:41 +01:00
Matthew Honnibal
e98451b5f7 Add -prune-vectors argument to spacy.cly.train 2017-10-30 18:00:10 +01:00
Matthew Honnibal
e026b29ea9 Add prune_vectors method to Vocab 2017-10-30 17:59:43 +01:00
Explosion Bot
d0cf12c8c7 Fix off-by-one error in vectors 2017-10-30 16:22:03 +01:00
Explosion Bot
05a1dd570e Fix vocab script 2017-10-30 16:19:22 +01:00
Explosion Bot
b46bdce8d2 Add missing import 2017-10-30 16:18:10 +01:00
Explosion Bot
2d2cc294b4 Merge branch 'develop' of https://github.com/explosion/spaCy into develop 2017-10-30 16:15:05 +01:00
Explosion Bot
0fc1209421 Wire up new vocab command 2017-10-30 16:14:50 +01:00
Explosion Bot
aa64031751 Fix clear_vectors() method on Vocab 2017-10-30 16:09:04 +01:00
Explosion Bot
7b56b2f04b Add Vocab.cfg attr, to hold stuff like oov probs 2017-10-30 16:08:50 +01:00
Explosion Bot
ab5d5ed880 Fix vectors.add() 2017-10-30 16:08:09 +01:00
Explosion Bot
41d0f1665a Fix add_attrs for cluster 2017-10-30 16:07:50 +01:00
ines
5453821a9f Update NER annotation scheme
Add note on training data sources and include coarse-grained Wikipedia scheme
2017-10-30 13:53:49 +01:00
Explosion Bot
5ede7cec9b Improve Lexeme.set_attrs method 2017-10-30 11:49:11 +01:00
Explosion Bot
72aea8f105 Update vectors.add() to allow setting keys to rows 2017-10-30 10:03:08 +01:00
Matthew Honnibal
c43cc5361d
Merge pull request #1467 from explosion/feature/better-parser
💫 Bug fixes to parser model (requires retraining)
2017-10-29 02:05:22 +02:00
ines
6c2d8d3b2a Use shortcuts-nightly.json to resolve model shortcuts 2017-10-29 01:28:31 +02:00
Matthew Honnibal
a0c7dabb72 Fix bug in 8-token parser features 2017-10-28 23:01:35 +00:00
Matthew Honnibal
b713d10d97 Switch to 13 features in parser 2017-10-28 23:01:14 +00:00
Matthew Honnibal
3b91097321 Whitespace 2017-10-28 17:05:11 +00:00
Matthew Honnibal
6ef72864fa Improve initialization for hidden layers 2017-10-28 17:05:01 +00:00
Matthew Honnibal
5414e2f14b Use missing features in parser 2017-10-28 16:45:54 +00:00
Matthew Honnibal
df4803cc6d Add learned missing values for parser 2017-10-28 16:45:14 +00:00
Matthew Honnibal
64e4ff7c4b Merge 'tidy-up' changes into branch. Resolve conflicts 2017-10-28 13:16:06 +02:00
Explosion Bot
fb0c96f39a Fix optimizer loading 2017-10-28 11:58:16 +02:00
Explosion Bot
b22e42af7f Merge changes to parser and _ml 2017-10-28 11:52:10 +02:00
ines
d96e72f656 Tidy up rest 2017-10-27 21:07:59 +02:00
ines
a8e10f94e4 Tidy up Lexeme and update docs 2017-10-27 21:07:50 +02:00
ines
ba5e646219 Tidy up pipeline 2017-10-27 20:29:08 +02:00
ines
b4d226a3f1 Tidy up syntax 2017-10-27 19:45:57 +02:00
ines
5167a0cce2 Tidy up Vectors and docs 2017-10-27 19:45:19 +02:00
ines
7946464742 Remove spacy.tagger (now in pipeline) 2017-10-27 19:45:04 +02:00
ines
9c89e2cdef Remove unused syntax iterators (now in language data) 2017-10-27 18:09:53 +02:00
ines
d2df81d907 Fix not implemented Span getters 2017-10-27 18:09:28 +02:00
ines
544a407b93 Tidy up Doc, Token and Span and add missing docs 2017-10-27 17:07:26 +02:00
ines
a6135336f5 Tidy up gold 2017-10-27 17:02:55 +02:00
ines
6a0483b7aa Tidy up and document Doc, Token and Span 2017-10-27 15:41:45 +02:00
ines
1a559d4c95 Remove old, unused file 2017-10-27 15:34:35 +02:00
ines
91899d337b Tidy up language, lemmatizer and scorer 2017-10-27 14:40:14 +02:00
ines
778212efea Tidy up init and main 2017-10-27 14:39:51 +02:00
ines
e33b7e0b3c Tidy up parser and ML 2017-10-27 14:39:30 +02:00
ines
e3265998c0 Tidy up displaCy 2017-10-27 14:39:19 +02:00
ines
ea4a41c8fb Tidy up util and helpers 2017-10-27 14:39:09 +02:00
ines
d941fc3667 Tidy up CLI 2017-10-27 14:38:39 +02:00
Matthew Honnibal
531142a933 Merge remote-tracking branch 'origin/develop' into feature/better-parser 2017-10-27 12:34:48 +00:00
Matthew Honnibal
19a2b9bf27 Fix import of Optimizer 2017-10-27 12:33:42 +00:00
Matthew Honnibal
4d048e94d3 Add compat for thinc.neural.optimizers.Optimizer 2017-10-27 10:23:49 +00:00
Ines Montani
4033e70c71 Merge pull request #1461 from explosion/feature/disable-pipes
💫 Add Language.disable_pipes(), to temporarily edit pipeline and update code examples
2017-10-27 12:21:40 +02:00
Matthew Honnibal
75a637fa43 Remove redundant imports from _ml 2017-10-27 10:19:56 +00:00
Matthew Honnibal
c9987cf131 Avoid use of numpy.tensordot 2017-10-27 10:18:36 +00:00
Matthew Honnibal
f6fef30adc Remove dead code from spacy._ml 2017-10-27 10:16:41 +00:00
Matthew Honnibal
b9616419e1 Add try/except around bz2 import 2017-10-27 01:18:05 +00:00
Matthew Honnibal
783c0c8795 Remove unnecessary bz2 import 2017-10-27 01:17:54 +00:00
Matthew Honnibal
bb25bdcd92 Adjust call to scatter_add for the new version 2017-10-27 01:16:55 +00:00
Ines Montani
287a3ca256 Merge pull request #1466 from explosion/feature/rename-pipeline
💫 Clean up dead linear model code
2017-10-27 02:03:28 +02:00
ines
4eb5bd02e7 Update textcat pre-processing after to_array change 2017-10-27 00:32:12 +02:00
ines
2d6ec99884 Set 'model' as default model name to prevent meta.json errors 2017-10-26 16:12:23 +02:00
ines
9e372913e0 Remove old 'SP' condition in tag map 2017-10-26 16:11:57 +02:00
Matthew Honnibal
c52671420c Remove old cfile import 2017-10-26 13:28:19 +02:00
Matthew Honnibal
ea03f1ef64 Remove obsolete cfile code 2017-10-26 13:23:36 +02:00
Matthew Honnibal
90d1d9b230 Remove obsolete parser code 2017-10-26 13:22:45 +02:00
ines
6f78e29bed Add LAW entity label to glossary 2017-10-26 13:04:35 +02:00
ines
9bf78d5fb3 Update spacy.explain docs 2017-10-26 13:04:25 +02:00
Matthew Honnibal
33f8c58782 Remove obsolete parser.pyx 2017-10-26 12:42:05 +02:00
Matthew Honnibal
a8abc47811 Rename BaseThincComponent --> Pipe 2017-10-26 12:40:40 +02:00
Matthew Honnibal
b0f3ea2200 Fix names of pipeline components
NeuralDependencyParser --> DependencyParser
NeuralEntityRecognizer --> EntityRecognizer
TokenVectorEncoder     --> Tensorizer
NeuralLabeller         --> MultitaskObjective
2017-10-26 12:38:23 +02:00
Matthew Honnibal
b6b4f1aaf7 Merge pull request #1462 from explosion/feature/vector-meta-data
💫 Add vector meta data to model meta.json on train/package and show in docs
2017-10-26 11:39:41 +02:00
Matthew Honnibal
35977bdbb9 Update better-parser branch with develop 2017-10-26 00:55:53 +00:00
Ines Montani
090bd00369 Merge pull request #1464 from mayukh18/develop_bengali_pronouns
added the bengali pronouns for v2.0
2017-10-25 21:55:25 +02:00
mayukh18
1bc07758fa added few bengali pronouns 2017-10-25 22:24:40 +05:30
ines
de1e5f35d5 Merge branch 'develop' into feature/disable-pipes 2017-10-25 16:33:12 +02:00
ines
728b609bf9 Merge branch 'develop' into feature/vector-meta-data 2017-10-25 16:32:22 +02:00
ines
c0b55ebdac Fix PhraseMatcher.__contains__ and add more tests 2017-10-25 16:31:11 +02:00
ines
91beacf5e3 Fix Matcher.__contains__ 2017-10-25 16:19:38 +02:00
ines
11e3f19764 Fix vectors data added after training (see #1457) 2017-10-25 16:08:26 +02:00
ines
057954695b Read pipeline and vector data off model in --generate-meta 2017-10-25 16:03:26 +02:00
ines
273e638183 Add vector data to model meta after training (see #1457) 2017-10-25 16:03:05 +02:00
ines
18aae423fb Remove import of non-existing function 2017-10-25 15:54:10 +02:00
ines
5117a7d24d Fix whitespace 2017-10-25 15:54:02 +02:00
ines
657a4d91bc Merge branch 'develop' into feature/disable-pipes 2017-10-25 15:19:05 +02:00
ines
1a722dac31 Merge branch 'develop' into feature/disable-pipes 2017-10-25 15:18:18 +02:00
ines
6a00de4f77 Fix check of unexpected pipe names in restore() 2017-10-25 14:56:35 +02:00
ines
7f03932477 Return self on __enter__ 2017-10-25 14:56:16 +02:00
Matthew Honnibal
b5de768852 Merge branch 'develop' of https://github.com/explosion/spaCy into develop 2017-10-25 14:44:16 +02:00
Matthew Honnibal
094512fd47 Fix model-mark on regression test. 2017-10-25 14:44:00 +02:00
Matthew Honnibal
e70f80f29e Add Language.disable_pipes() 2017-10-25 13:46:41 +02:00
Matthew Honnibal
075e8118ea Update from develop 2017-10-25 12:45:21 +02:00
ines
72497c8cb2 Remove comments and add TODO 2017-10-25 12:15:43 +02:00
ines
4d97efc3b5 Add missing docstrings 2017-10-25 12:10:16 +02:00
ines
1262aa0bf9 Implement PhraseMatcher.__contains__ 2017-10-25 12:10:04 +02:00
ines
9c733a8849 Implement PhraseMatcher.__len__ 2017-10-25 12:09:56 +02:00
ines
7eebeeaf85 Fix Matcher.__contains__ 2017-10-25 12:09:47 +02:00
ines
7bcec57462 Remove unused attribute 2017-10-25 12:08:54 +02:00
ines
0b1dcbac14 Remove unused function 2017-10-25 12:08:46 +02:00
ines
3484174e48 Add Language.path 2017-10-25 11:57:43 +02:00
Ines Montani
d3bf488e16 Merge pull request #1171 from mollerhoj/support-danish
Improve basic support for Danish
2017-10-24 20:29:57 +02:00
Matthew Honnibal
d9bb1e5de8 Increment version 2017-10-24 17:06:19 +02:00
Matthew Honnibal
908809d488 Update tests 2017-10-24 17:05:15 +02:00
Matthew Honnibal
66766c1454 Restore SP tag to English tag_map, until models migrate 2017-10-24 17:05:00 +02:00
Matthew Honnibal
30e67fa808 Merge branch 'develop' of https://github.com/explosion/spaCy into develop 2017-10-24 16:08:23 +02:00
Matthew Honnibal
b0f6fd3f1d Disable tokenizer cache for special-cases. Fixes #1250 2017-10-24 16:08:05 +02:00
Matthew Honnibal
63f0bde749 Add test for #1250: Tokenizer cache clobbered special-case attrs 2017-10-24 16:07:18 +02:00
ines
8492d5be6d Always make lemmatizer return a list of lemmas, not a set 2017-10-24 16:00:56 +02:00
ines
95f866f99f Add lookup argument to Lemmatizer.load 2017-10-24 16:00:56 +02:00
ines
95f6174516 Remove tensorizer from model pipeline example in spacy package 2017-10-24 16:00:56 +02:00
ines
090aed940a Add test for currently failing span.as_doc case 2017-10-24 16:00:56 +02:00
ines
4ef81a9ebc Fix whitespace 2017-10-24 16:00:56 +02:00
Matthew Honnibal
18f1c1d0ba Merge branch 'develop' of https://github.com/explosion/spaCy into develop 2017-10-24 14:29:43 +02:00
Matthew Honnibal
4bea65a1a8 Fix Issue #1450: Off-by-1 in * and ? matches
Patterns that end in variable-length operators e.g. * and ? now end on
the correct token. Previously, they were off by 1: the next token was
pulled into the match, even if that's where the pattern failed.
2017-10-24 14:26:27 +02:00
Matthew Honnibal
391d5ef0d1 Normalize imports in regression test 2017-10-24 14:25:49 +02:00
ines
c55db0a4a1 Add example sentences for Japanese and Chinese (see #1107) 2017-10-24 13:02:24 +02:00
ines
66f8f9d4a0 Fix Japanese tokenizer
JapaneseTokenizer now returns a Doc, not individual words
2017-10-24 13:02:19 +02:00
Matthew Honnibal
dd5b2d8fa3 Check for out-of-memory when calling calloc. Closes #1446 2017-10-24 12:40:47 +02:00
Matthew Honnibal
b66b8f028b Fix #1375 -- out-of-bounds on token.nbor() 2017-10-24 12:10:39 +02:00
Matthew Honnibal
a68d89a4f3 Add failing test for bug #1375 -- no out-of-bounds error for token.nbor() 2017-10-24 12:05:25 +02:00
Ines Montani
facf77e541 Merge branch 'develop' into support-danish 2017-10-24 11:53:19 +02:00
Matthew Honnibal
ccd2ab1a62 Merge pull request #1443 from ramananbalakrishnan/develop-get-lca-matrix
Add LCA matrix for spans and docs
2017-10-24 11:22:46 +02:00
Matthew Honnibal
ef3e5a361b Merge pull request #1442 from explosion/feature/fix-sp
💫Fix SP tag, tweak Vectors.__init__, fix Morphology
2017-10-24 10:24:07 +02:00
Matthew Honnibal
fdf25d10ba Merge pull request #1440 from ramananbalakrishnan/develop
Support single value for attribute list in doc.to_array
2017-10-24 10:23:12 +02:00
Matthew Honnibal
e7556ff048 Fix non-maxout parser 2017-10-23 18:16:23 +02:00
ines
a31f048b4d Fix formatting 2017-10-23 10:38:06 +02:00
Matthew Honnibal
490ad3eaf0 Check that empty strings are handled. Closes #1242 2017-10-21 00:52:14 +02:00
Matthew Honnibal
8f8bccecb9 Patch deserialisation for invalid loads, to avoid model failure 2017-10-21 00:51:42 +02:00
Ramanan Balakrishnan
d2fe56a577
Add LCA matrix for spans and docs 2017-10-20 23:58:00 +05:30
Matthew Honnibal
d8391b1c4d Fix #1434: Matcher failed on ending ? if no token 2017-10-20 16:49:36 +02:00
Matthew Honnibal
fec53f09f7 Merge branch 'develop' of https://github.com/explosion/spaCy into develop 2017-10-20 16:28:34 +02:00
Matthew Honnibal
f111b228e0 Fix re-parsing of previously parsed text
If a Doc object had been previously parsed, it was possible for
invalid parses to be added. There were two problems:

1) The parse was only being partially erased
2) The RightArc action was able to create a 1-cycle.

This patch fixes both errors, and avoids resetting the parse if one is
present. In theory this might allow a better parse to be predicted by
running the parser twice.

Closes #1253.
2017-10-20 16:27:36 +02:00
Matthew Honnibal
1036798155 Make parser consistent if maxout==1 2017-10-20 16:24:16 +02:00
Matthew Honnibal
3faf9189a2 Make parser hidden shape consistent even if maxout==1 2017-10-20 16:23:31 +02:00
Matthew Honnibal
9010a1a060 Create vectors correctly 2017-10-20 14:19:46 +02:00
Matthew Honnibal
33229b1c9e Remove print statement 2017-10-20 14:19:29 +02:00
Matthew Honnibal
cfae54c507 Make change to Vectors.__init__ 2017-10-20 14:19:04 +02:00
Matthew Honnibal
ebecaddb76 Make 'data_or_width' two keyword args in Vectors.__init__
Previously the data and width options were one argument in Vectors,
which meant you couldn't say vectors = Vectors(strings, width=300).
It's better to have two keywords.
2017-10-20 14:17:15 +02:00
Matthew Honnibal
49895fbef6 Rename 'SP' special tag to '_SP'
Renaming the tag with an underscore lets us add it to the tag map
without worrying that we'll change the sequence of tags, which throws
off the tag-to-ID mapping. For instance, if we inserted a 'SP' tag,
the "VERB" tag is pushed to a different class ID, and the model is all
messed up.
2017-10-20 14:01:12 +02:00
Matthew Honnibal
506cf2eb13 Remove cpdef enum, to avoid too much code generation 2017-10-20 14:00:23 +02:00
Matthew Honnibal
6218af0105 Remove cpdef enum, to avoid too much code generation 2017-10-20 13:59:57 +02:00
Matthew Honnibal
92ac9316b5 Fix initialization of vectors, to address serialization problem 2017-10-20 13:59:24 +02:00
Ramanan Balakrishnan
0726946563
cleanup to_array implementation using fixes on master 2017-10-20 17:09:37 +05:30
ines
108f1f786e Update symbols and document missing token attributes (see #1439) 2017-10-20 13:08:44 +02:00
ines
4acab77a8a Add missing symbol for LAW entities (resolves #1427) 2017-10-20 13:07:57 +02:00
Matthew Honnibal
b101736555 Fix precomputed layer 2017-10-20 12:14:52 +02:00
Ramanan Balakrishnan
b3ab124fc5
Support strings for attribute list in doc.to_array 2017-10-20 11:46:57 +05:30
Matthew Honnibal
64658e02e5 Implement fancier initialisation for precomputed layer 2017-10-20 03:07:45 +02:00
Matthew Honnibal
827cd8a883 Fix support of maxout pieces in parser 2017-10-20 03:07:17 +02:00
Matthew Honnibal
a8850b4282 Remove redundant PrecomputableMaxouts class 2017-10-19 20:27:34 +02:00
Matthew Honnibal
a17a1b60c7 Clean up redundant PrecomputableMaxouts class 2017-10-19 20:26:37 +02:00
Matthew Honnibal
b00d0a2c97 Fix bias in parser 2017-10-19 18:42:11 +02:00
Matthew Honnibal
b54b4b8a97 Make parser_maxout_pieces hyper-param work 2017-10-19 13:45:18 +02:00
Matthew Honnibal
03a215c5fd Make PrecomputableAffines work 2017-10-19 13:44:49 +02:00
Ramanan Balakrishnan
7b9b1be44c
Support single value for attribute list in doc.to_array 2017-10-19 17:00:41 +05:30
Matthew Honnibal
61bc203f3f Merge pull request #1438 from explosion/feature/fast-parser
💫 Improve runtime CPU efficiency of parser/NER
2017-10-19 02:42:21 +02:00
Matthew Honnibal
15e5a04a8d Clean up more depth=0 conditional code 2017-10-19 01:48:43 +02:00
Matthew Honnibal
906c50ac59 Fix loop typing, that caused error on windows 2017-10-19 01:48:39 +02:00
ines
24512420b1 Show error if data_path does not exist or is None (see #1102) 2017-10-19 00:53:49 +02:00
ines
bf415fd778 Add test for serializing extension attrs (see #1085) 2017-10-19 00:53:08 +02:00
Matthew Honnibal
960788aaa2 Eliminate dead code in parser, and raise errors for obsolete options 2017-10-19 00:42:34 +02:00
Matthew Honnibal
bbfd7d8d5d Clean up parser multi-threading 2017-10-19 00:25:21 +02:00
Matthew Honnibal
f018f2030c Try optimized parser forward loop 2017-10-18 21:48:00 +02:00
Matthew Honnibal
65bf5e85bd Improve piping in language.pipe 2017-10-18 21:46:12 +02:00
Matthew Honnibal
633a75c7e0 Break parser batches into sub-batches, sorted by length. 2017-10-18 21:45:01 +02:00
Ines Montani
f0d577e460 Merge pull request #1425 from explosion/feature/hindi-tokenizer
💫 Basic Hindi tokenization support
2017-10-18 13:34:52 +02:00
Matthew Honnibal
394633efce Make doc pickling support hooks 2017-10-17 19:44:09 +02:00
Matthew Honnibal
fe844148f6 Test pickling hooks 2017-10-17 19:43:52 +02:00
Matthew Honnibal
cdb0c426d8 Improve deserialization of user_data, esp. for Underscore 2017-10-17 19:29:20 +02:00
Matthew Honnibal
374819edf8 Test user_data deserialization, re #1085 2017-10-17 19:28:54 +02:00
Matthew Honnibal
e35a83d142 Merge branch 'develop' of https://github.com/explosion/spaCy into develop 2017-10-17 18:22:06 +02:00
Matthew Honnibal
f45973848c Rename 'tokens' variable 'doc' in tokenizer 2017-10-17 18:21:41 +02:00
Matthew Honnibal
839de87ca9 Make lambda func a named function, for pickling 2017-10-17 18:21:20 +02:00
Matthew Honnibal
9baa8fe7ec Convert closure to functools.partial, to promote pickling 2017-10-17 18:20:52 +02:00
Matthew Honnibal
32a8564c79 Fix doc pickling 2017-10-17 18:20:24 +02:00
Matthew Honnibal
8ca97f32a3 Fix doc pickling test 2017-10-17 18:19:57 +02:00
Matthew Honnibal
9ce7d6af87 Make lex attr functions top-level functions, to promote pickling 2017-10-17 18:19:18 +02:00
Matthew Honnibal
1cc85a89ef Allow reasonably efficient pickling of Language class, using to_bytes() and from_bytes(). 2017-10-17 18:18:49 +02:00
Matthew Honnibal
0d57b9748a Serialize lex_attr_getters with dill, for better pickle support 2017-10-17 18:17:45 +02:00
Matthew Honnibal
45d1dd90b1 Add tests for pickling doc 2017-10-17 17:20:58 +02:00
Ines Montani
afa67de7ee Merge pull request #1428 from roanuz/develop
Fix trailing whitespace and Language.from_disk overwrites
2017-10-17 16:29:15 +02:00
Matthew Honnibal
92c1eb2d6f Fix Doc pickling. This also removes need for Binder class 2017-10-17 16:11:13 +02:00
Matthew Honnibal
ed8da9b11f Add missing return statement in SentenceSegmenter 2017-10-17 15:32:56 +02:00
Ines Montani
aab299c8ae Merge pull request #1429 from vishnunekkanti/develop
fix syntax error in zh
2017-10-17 14:45:02 +02:00
Anto Binish Kaspar
534240648e Fix trailing whitespace on morphology features 2017-10-17 17:15:58 +05:30
Anto Binish Kaspar
8f5b60c168 Fix Language.from_disk overwrites the meta.json file. 2017-10-17 17:15:32 +05:30
ines
8ca344712d Add Language.has_pipe method 2017-10-17 11:20:07 +02:00
ines
485c4f6df5 Add Hungarian examples (see #1107) 2017-10-17 02:37:45 +02:00
Matthew Honnibal
19531bad4c Merge branch 'develop' into feature/streaming-data-memory-growth 2017-10-16 21:44:11 +02:00
Matthew Honnibal
df488274b1 Fix deserialization of vectors 2017-10-16 20:55:00 +02:00
Matthew Honnibal
4018486d31 Merge remote-tracking branch 'origin/develop' into feature/streaming-data-memory-growth 2017-10-16 20:49:48 +02:00
Matthew Honnibal
4174477161 Fix equality check in test 2017-10-16 19:50:35 +02:00
Matthew Honnibal
2bc06e4b22 Bump rolling buffer size to 10k 2017-10-16 19:38:29 +02:00
Matthew Honnibal
66e2eb8f39 Clean up remnant of frozen in StringStore 2017-10-16 19:34:41 +02:00
Matthew Honnibal
a002264fec Remove caching of Token in Doc, as caused cycle. 2017-10-16 19:34:21 +02:00
Matthew Honnibal
3e037054c8 Remove obsolete is_frozen functionality from StringStore 2017-10-16 19:23:10 +02:00
Matthew Honnibal
5c14f3f033 Create a rolling buffer for the StringStore in Language.pipe() 2017-10-16 19:22:40 +02:00
Matthew Honnibal
59c216196c Allow weakrefs on Doc objects 2017-10-16 19:22:11 +02:00
ines
d5418553eb Fix whitespace 2017-10-16 18:30:04 +02:00
ines
6ceadcdb5c Make sure from_disk passes string to numpy (see #1421)
If path is a WindowsPath, numpy does not recognise it as a path and as
a result, doesn't open the file.
https://github.com/numpy/numpy/blob/master/numpy/lib/npyio.py#L369
2017-10-16 18:29:56 +02:00
Matthew Honnibal
010a7309ff Merge pull request #1402 from explosion/feature/fix-matcher-operators
💫 Fix Matcher variable-length operators
2017-10-16 17:53:19 +02:00
Matthew Honnibal
c29927d2e7 Fix matcher test 2017-10-16 17:22:18 +02:00
Vishnu Kumar Nekkanti
d3c54cf39a fixed SyntaxError while checking for jieba 2017-10-16 18:51:33 +05:30
Matthew Honnibal
a928ae2f35 Merge branch 'develop' into feature/fix-matcher-operators 2017-10-16 13:38:36 +02:00
Matthew Honnibal
56aa42cc5d Fix and document matcher operator 'shadowing' behaviour 2017-10-16 13:38:20 +02:00
Matthew Honnibal
748d525801 Add more matcher operator tests 2017-10-16 13:38:01 +02:00
Matthew Honnibal
0433181658 Document operator semantics in Matcher docstring 2017-10-16 12:06:33 +02:00
ines
266e7180a7 Add Language class, stop words and basic stemmer that sets NORM 2017-10-14 14:59:52 +02:00
ines
e85e1d571b Update base punctuation 2017-10-14 14:59:23 +02:00
ines
9d6c8eaa49 Update base norm exceptions with more unicode characters
e.g. unicode variations of punctuation used in Chinese
2017-10-14 14:58:52 +02:00
ines
3516aa0cea Port over changes from #1389 2017-10-14 13:32:55 +02:00
ines
cd6a29dce7 Port over changes from #1294 2017-10-14 13:28:46 +02:00
ines
38c756fd85 Port over changes from #1287 2017-10-14 13:16:21 +02:00
ines
612224c10d Port over changes from #1157 2017-10-14 13:11:39 +02:00
ines
9b3f8f9ec3 Fix formatting and add comment on languages 2017-10-14 13:11:18 +02:00
ines
a4d974d97b Port over URL pattern changes from #1411 2017-10-14 12:58:07 +02:00
ines
09aed58140 Port over changes from #1333 and add comments 2017-10-14 12:52:59 +02:00
Matthew Honnibal
cf6da9301a Update lemmatizer test 2017-10-12 22:50:52 +02:00
Matthew Honnibal
9b90d235d1 Fix tag check in lemmatizer 2017-10-12 22:50:43 +02:00
Matthew Honnibal
dc01acd821 Escape encoding in validate function 2017-10-12 22:23:21 +02:00
Matthew Honnibal
27b927259a Add locale_escape compat function 2017-10-12 22:22:04 +02:00
ines
9c6de3dcfa Merge branch 'develop' into feature/cli-validate 2017-10-12 21:44:28 +02:00
Matthew Honnibal
462caf835a Fix SBD test 2017-10-12 21:18:22 +02:00
ines
fff1028391 Add validate CLI command 2017-10-12 20:05:06 +02:00
Matthew Honnibal
908f44c3fe Disable history features by default 2017-10-12 14:56:11 +02:00
Matthew Honnibal
a955843684 Increase default number of epochs 2017-10-12 13:13:01 +02:00
Matthew Honnibal
cecfcc7711 Set default hyper params back to 'slow' settings 2017-10-12 13:12:26 +02:00
Ines Montani
37aa523a8e Merge pull request #1408 from explosion/feature/dot-underscore
💫 Custom attributes via Doc._, Token._ and Span._
2017-10-11 18:35:56 +02:00
ines
8ce6f96180 Don't make copies of language data components 2017-10-11 15:34:55 +02:00
ines
51519251c2 Fix underscore method test 2017-10-11 13:34:19 +02:00
ines
c6ae49e8bf Fix formatting 2017-10-11 13:34:11 +02:00
ines
453c47ca24 Add German lemmatizer tests 2017-10-11 13:27:26 +02:00
ines
15fe0fd82d Fix tests 2017-10-11 13:27:18 +02:00