Jim O'Regan
08b0bfd153
merge
2017-10-31 22:55:59 +00:00
Jim O'Regan
00ecfa5417
Ó, not O
2017-10-31 22:54:42 +00:00
ines
ba2e6c8c6f
Update docstrings and formatting
2017-10-31 23:23:34 +01:00
Matthew Honnibal
0de8d213a3
Merge pull request #1475 from explosion/feature/sm-vectors
...
Improve and simplify Vectors class
2017-10-31 22:59:50 +01:00
Ines Montani
25b1d6cd91
Fix syntax error
2017-10-31 22:36:03 +01:00
Matthew Honnibal
92dc127569
Fix test for Python 3
2017-10-31 22:21:55 +01:00
Jim O'Regan
fe4b10346a
replace example sentence until I get around to adding a punctuation.py
2017-10-31 20:24:53 +00:00
Matthew Honnibal
c5799ecc7b
Remove print statement
2017-10-31 21:12:33 +01:00
ines
7e424a1804
Don't copy exception dicts if not necessary and tidy up
2017-10-31 21:05:29 +01:00
Matthew Honnibal
c390f2d745
Make it easier to pass explicit no-pruning to vocab
2017-10-31 20:14:47 +01:00
Ines Montani
06c25a8882
Remove comma that caused list to wrap in tuple!
...
Also removed extra dict wrappings for performance (we used to have them in there, but they should only really exist if copying the dict is absolutely necessary)
2017-10-31 20:13:16 +01:00
Matthew Honnibal
d90a22afe6
Fix loading previous vectors models
2017-10-31 19:58:35 +01:00
Ines Montani
147448b65b
Add missing symbols
2017-10-31 19:34:45 +01:00
Matthew Honnibal
997a61557a
Add vectors.n_keys property
2017-10-31 19:30:52 +01:00
Matthew Honnibal
8075726838
Restore vector usage in models
2017-10-31 19:21:17 +01:00
Matthew Honnibal
3659a807b0
Remove vector pruning arg from train CLI
2017-10-31 19:21:05 +01:00
Ines Montani
9b0de9fb43
Fix import of symbols (now nested one level lower)
2017-10-31 19:17:58 +01:00
Matthew Honnibal
59203a2e8a
Move vector pruning command into spacy vocab cli tool
2017-10-31 19:10:01 +01:00
Matthew Honnibal
77d8f5de9a
Revise and simplify Vectors class
2017-10-31 18:25:08 +01:00
Jim O'Regan
d4a8160c36
change quotes
2017-10-31 15:15:44 +00:00
Jim O'Regan
34ca59691b
no idea what is wrong here
2017-10-31 14:50:13 +00:00
Jim O'Regan
41dd29e48e
merge
2017-10-31 14:07:45 +00:00
Matthew Honnibal
cb5217012f
Fix vector remapping
2017-10-31 11:40:46 +01:00
Matthew Honnibal
9c11ee4a1c
WIP on vectors fixes
2017-10-31 11:22:56 +01:00
Matthew Honnibal
ce876c551e
Fix GPU usage
2017-10-31 02:33:34 +01:00
Matthew Honnibal
7698903617
Fix GPU usage
2017-10-31 02:33:16 +01:00
Matthew Honnibal
368fdb389a
WIP on refactoring and fixing vectors
2017-10-31 02:00:26 +01:00
Matthew Honnibal
4e3006cec7
Merge branch 'develop' of https://github.com/explosion/spaCy into develop
2017-10-30 19:44:58 +01:00
Matthew Honnibal
4112a991ec
Fix vector pruning
2017-10-30 19:44:40 +01:00
ines
ec657c1ddc
Update vocab docs and document Vocab.prune_vectors
2017-10-30 19:35:41 +01:00
ines
803e41bc66
Merge branch 'develop' of https://github.com/explosion/spaCy into develop
2017-10-30 18:39:51 +01:00
ines
8e02294241
Add vectors to Language.meta
2017-10-30 18:39:48 +01:00
ines
abf8aa05d3
Populate --create-meta defaults from file if available
...
If meta.json is found in directory and user chooses to overwrite it, show existing data as defaults.
2017-10-30 18:39:38 +01:00
ines
ce98fa7934
Fix formatting
2017-10-30 18:38:55 +01:00
ines
98c35d2585
Fix spacy vocab command
2017-10-30 18:38:41 +01:00
Matthew Honnibal
e98451b5f7
Add -prune-vectors argument to spacy.cly.train
2017-10-30 18:00:10 +01:00
Matthew Honnibal
e026b29ea9
Add prune_vectors method to Vocab
2017-10-30 17:59:43 +01:00
Explosion Bot
d0cf12c8c7
Fix off-by-one error in vectors
2017-10-30 16:22:03 +01:00
Explosion Bot
05a1dd570e
Fix vocab script
2017-10-30 16:19:22 +01:00
Explosion Bot
b46bdce8d2
Add missing import
2017-10-30 16:18:10 +01:00
Explosion Bot
2d2cc294b4
Merge branch 'develop' of https://github.com/explosion/spaCy into develop
2017-10-30 16:15:05 +01:00
Explosion Bot
0fc1209421
Wire up new vocab command
2017-10-30 16:14:50 +01:00
Explosion Bot
aa64031751
Fix clear_vectors() method on Vocab
2017-10-30 16:09:04 +01:00
Explosion Bot
7b56b2f04b
Add Vocab.cfg attr, to hold stuff like oov probs
2017-10-30 16:08:50 +01:00
Explosion Bot
ab5d5ed880
Fix vectors.add()
2017-10-30 16:08:09 +01:00
Explosion Bot
41d0f1665a
Fix add_attrs for cluster
2017-10-30 16:07:50 +01:00
ines
5453821a9f
Update NER annotation scheme
...
Add note on training data sources and include coarse-grained Wikipedia scheme
2017-10-30 13:53:49 +01:00
Explosion Bot
5ede7cec9b
Improve Lexeme.set_attrs method
2017-10-30 11:49:11 +01:00
Explosion Bot
72aea8f105
Update vectors.add() to allow setting keys to rows
2017-10-30 10:03:08 +01:00
Matthew Honnibal
c43cc5361d
Merge pull request #1467 from explosion/feature/better-parser
...
💫 Bug fixes to parser model (requires retraining)
2017-10-29 02:05:22 +02:00
ines
6c2d8d3b2a
Use shortcuts-nightly.json to resolve model shortcuts
2017-10-29 01:28:31 +02:00
Matthew Honnibal
a0c7dabb72
Fix bug in 8-token parser features
2017-10-28 23:01:35 +00:00
Matthew Honnibal
b713d10d97
Switch to 13 features in parser
2017-10-28 23:01:14 +00:00
Matthew Honnibal
3b91097321
Whitespace
2017-10-28 17:05:11 +00:00
Matthew Honnibal
6ef72864fa
Improve initialization for hidden layers
2017-10-28 17:05:01 +00:00
Matthew Honnibal
5414e2f14b
Use missing features in parser
2017-10-28 16:45:54 +00:00
Matthew Honnibal
df4803cc6d
Add learned missing values for parser
2017-10-28 16:45:14 +00:00
Matthew Honnibal
64e4ff7c4b
Merge 'tidy-up' changes into branch. Resolve conflicts
2017-10-28 13:16:06 +02:00
Explosion Bot
fb0c96f39a
Fix optimizer loading
2017-10-28 11:58:16 +02:00
Explosion Bot
b22e42af7f
Merge changes to parser and _ml
2017-10-28 11:52:10 +02:00
ines
d96e72f656
Tidy up rest
2017-10-27 21:07:59 +02:00
ines
a8e10f94e4
Tidy up Lexeme and update docs
2017-10-27 21:07:50 +02:00
ines
ba5e646219
Tidy up pipeline
2017-10-27 20:29:08 +02:00
ines
b4d226a3f1
Tidy up syntax
2017-10-27 19:45:57 +02:00
ines
5167a0cce2
Tidy up Vectors and docs
2017-10-27 19:45:19 +02:00
ines
7946464742
Remove spacy.tagger (now in pipeline)
2017-10-27 19:45:04 +02:00
ines
9c89e2cdef
Remove unused syntax iterators (now in language data)
2017-10-27 18:09:53 +02:00
ines
d2df81d907
Fix not implemented Span getters
2017-10-27 18:09:28 +02:00
ines
544a407b93
Tidy up Doc, Token and Span and add missing docs
2017-10-27 17:07:26 +02:00
ines
a6135336f5
Tidy up gold
2017-10-27 17:02:55 +02:00
ines
6a0483b7aa
Tidy up and document Doc, Token and Span
2017-10-27 15:41:45 +02:00
ines
1a559d4c95
Remove old, unused file
2017-10-27 15:34:35 +02:00
ines
91899d337b
Tidy up language, lemmatizer and scorer
2017-10-27 14:40:14 +02:00
ines
778212efea
Tidy up init and main
2017-10-27 14:39:51 +02:00
ines
e33b7e0b3c
Tidy up parser and ML
2017-10-27 14:39:30 +02:00
ines
e3265998c0
Tidy up displaCy
2017-10-27 14:39:19 +02:00
ines
ea4a41c8fb
Tidy up util and helpers
2017-10-27 14:39:09 +02:00
ines
d941fc3667
Tidy up CLI
2017-10-27 14:38:39 +02:00
Matthew Honnibal
531142a933
Merge remote-tracking branch 'origin/develop' into feature/better-parser
2017-10-27 12:34:48 +00:00
Matthew Honnibal
19a2b9bf27
Fix import of Optimizer
2017-10-27 12:33:42 +00:00
Matthew Honnibal
4d048e94d3
Add compat for thinc.neural.optimizers.Optimizer
2017-10-27 10:23:49 +00:00
Ines Montani
4033e70c71
Merge pull request #1461 from explosion/feature/disable-pipes
...
💫 Add Language.disable_pipes(), to temporarily edit pipeline and update code examples
2017-10-27 12:21:40 +02:00
Matthew Honnibal
75a637fa43
Remove redundant imports from _ml
2017-10-27 10:19:56 +00:00
Matthew Honnibal
c9987cf131
Avoid use of numpy.tensordot
2017-10-27 10:18:36 +00:00
Matthew Honnibal
f6fef30adc
Remove dead code from spacy._ml
2017-10-27 10:16:41 +00:00
Matthew Honnibal
b9616419e1
Add try/except around bz2 import
2017-10-27 01:18:05 +00:00
Matthew Honnibal
783c0c8795
Remove unnecessary bz2 import
2017-10-27 01:17:54 +00:00
Matthew Honnibal
bb25bdcd92
Adjust call to scatter_add for the new version
2017-10-27 01:16:55 +00:00
Ines Montani
287a3ca256
Merge pull request #1466 from explosion/feature/rename-pipeline
...
💫 Clean up dead linear model code
2017-10-27 02:03:28 +02:00
ines
4eb5bd02e7
Update textcat pre-processing after to_array change
2017-10-27 00:32:12 +02:00
ines
2d6ec99884
Set 'model' as default model name to prevent meta.json errors
2017-10-26 16:12:23 +02:00
ines
9e372913e0
Remove old 'SP' condition in tag map
2017-10-26 16:11:57 +02:00
Matthew Honnibal
c52671420c
Remove old cfile import
2017-10-26 13:28:19 +02:00
Matthew Honnibal
ea03f1ef64
Remove obsolete cfile code
2017-10-26 13:23:36 +02:00
Matthew Honnibal
90d1d9b230
Remove obsolete parser code
2017-10-26 13:22:45 +02:00
ines
6f78e29bed
Add LAW entity label to glossary
2017-10-26 13:04:35 +02:00
ines
9bf78d5fb3
Update spacy.explain docs
2017-10-26 13:04:25 +02:00
Matthew Honnibal
33f8c58782
Remove obsolete parser.pyx
2017-10-26 12:42:05 +02:00
Matthew Honnibal
a8abc47811
Rename BaseThincComponent --> Pipe
2017-10-26 12:40:40 +02:00
Matthew Honnibal
b0f3ea2200
Fix names of pipeline components
...
NeuralDependencyParser --> DependencyParser
NeuralEntityRecognizer --> EntityRecognizer
TokenVectorEncoder --> Tensorizer
NeuralLabeller --> MultitaskObjective
2017-10-26 12:38:23 +02:00
Matthew Honnibal
b6b4f1aaf7
Merge pull request #1462 from explosion/feature/vector-meta-data
...
💫 Add vector meta data to model meta.json on train/package and show in docs
2017-10-26 11:39:41 +02:00
Matthew Honnibal
35977bdbb9
Update better-parser branch with develop
2017-10-26 00:55:53 +00:00
Ines Montani
090bd00369
Merge pull request #1464 from mayukh18/develop_bengali_pronouns
...
added the bengali pronouns for v2.0
2017-10-25 21:55:25 +02:00
mayukh18
1bc07758fa
added few bengali pronouns
2017-10-25 22:24:40 +05:30
ines
de1e5f35d5
Merge branch 'develop' into feature/disable-pipes
2017-10-25 16:33:12 +02:00
ines
728b609bf9
Merge branch 'develop' into feature/vector-meta-data
2017-10-25 16:32:22 +02:00
ines
c0b55ebdac
Fix PhraseMatcher.__contains__ and add more tests
2017-10-25 16:31:11 +02:00
ines
91beacf5e3
Fix Matcher.__contains__
2017-10-25 16:19:38 +02:00
ines
11e3f19764
Fix vectors data added after training (see #1457 )
2017-10-25 16:08:26 +02:00
ines
057954695b
Read pipeline and vector data off model in --generate-meta
2017-10-25 16:03:26 +02:00
ines
273e638183
Add vector data to model meta after training (see #1457 )
2017-10-25 16:03:05 +02:00
ines
18aae423fb
Remove import of non-existing function
2017-10-25 15:54:10 +02:00
ines
5117a7d24d
Fix whitespace
2017-10-25 15:54:02 +02:00
ines
657a4d91bc
Merge branch 'develop' into feature/disable-pipes
2017-10-25 15:19:05 +02:00
ines
1a722dac31
Merge branch 'develop' into feature/disable-pipes
2017-10-25 15:18:18 +02:00
ines
6a00de4f77
Fix check of unexpected pipe names in restore()
2017-10-25 14:56:35 +02:00
ines
7f03932477
Return self on __enter__
2017-10-25 14:56:16 +02:00
Matthew Honnibal
b5de768852
Merge branch 'develop' of https://github.com/explosion/spaCy into develop
2017-10-25 14:44:16 +02:00
Matthew Honnibal
094512fd47
Fix model-mark on regression test.
2017-10-25 14:44:00 +02:00
Matthew Honnibal
e70f80f29e
Add Language.disable_pipes()
2017-10-25 13:46:41 +02:00
Matthew Honnibal
075e8118ea
Update from develop
2017-10-25 12:45:21 +02:00
ines
72497c8cb2
Remove comments and add TODO
2017-10-25 12:15:43 +02:00
ines
4d97efc3b5
Add missing docstrings
2017-10-25 12:10:16 +02:00
ines
1262aa0bf9
Implement PhraseMatcher.__contains__
2017-10-25 12:10:04 +02:00
ines
9c733a8849
Implement PhraseMatcher.__len__
2017-10-25 12:09:56 +02:00
ines
7eebeeaf85
Fix Matcher.__contains__
2017-10-25 12:09:47 +02:00
ines
7bcec57462
Remove unused attribute
2017-10-25 12:08:54 +02:00
ines
0b1dcbac14
Remove unused function
2017-10-25 12:08:46 +02:00
ines
3484174e48
Add Language.path
2017-10-25 11:57:43 +02:00
Ines Montani
d3bf488e16
Merge pull request #1171 from mollerhoj/support-danish
...
Improve basic support for Danish
2017-10-24 20:29:57 +02:00
Matthew Honnibal
d9bb1e5de8
Increment version
2017-10-24 17:06:19 +02:00
Matthew Honnibal
908809d488
Update tests
2017-10-24 17:05:15 +02:00
Matthew Honnibal
66766c1454
Restore SP tag to English tag_map, until models migrate
2017-10-24 17:05:00 +02:00
Matthew Honnibal
30e67fa808
Merge branch 'develop' of https://github.com/explosion/spaCy into develop
2017-10-24 16:08:23 +02:00
Matthew Honnibal
b0f6fd3f1d
Disable tokenizer cache for special-cases. Fixes #1250
2017-10-24 16:08:05 +02:00
Matthew Honnibal
63f0bde749
Add test for #1250 : Tokenizer cache clobbered special-case attrs
2017-10-24 16:07:18 +02:00
ines
8492d5be6d
Always make lemmatizer return a list of lemmas, not a set
2017-10-24 16:00:56 +02:00
ines
95f866f99f
Add lookup argument to Lemmatizer.load
2017-10-24 16:00:56 +02:00
ines
95f6174516
Remove tensorizer from model pipeline example in spacy package
2017-10-24 16:00:56 +02:00
ines
090aed940a
Add test for currently failing span.as_doc case
2017-10-24 16:00:56 +02:00
ines
4ef81a9ebc
Fix whitespace
2017-10-24 16:00:56 +02:00
Matthew Honnibal
18f1c1d0ba
Merge branch 'develop' of https://github.com/explosion/spaCy into develop
2017-10-24 14:29:43 +02:00
Matthew Honnibal
4bea65a1a8
Fix Issue #1450 : Off-by-1 in * and ? matches
...
Patterns that end in variable-length operators e.g. * and ? now end on
the correct token. Previously, they were off by 1: the next token was
pulled into the match, even if that's where the pattern failed.
2017-10-24 14:26:27 +02:00
Matthew Honnibal
391d5ef0d1
Normalize imports in regression test
2017-10-24 14:25:49 +02:00
ines
c55db0a4a1
Add example sentences for Japanese and Chinese (see #1107 )
2017-10-24 13:02:24 +02:00
ines
66f8f9d4a0
Fix Japanese tokenizer
...
JapaneseTokenizer now returns a Doc, not individual words
2017-10-24 13:02:19 +02:00
Matthew Honnibal
dd5b2d8fa3
Check for out-of-memory when calling calloc. Closes #1446
2017-10-24 12:40:47 +02:00
Matthew Honnibal
b66b8f028b
Fix #1375 -- out-of-bounds on token.nbor()
2017-10-24 12:10:39 +02:00
Matthew Honnibal
a68d89a4f3
Add failing test for bug #1375 -- no out-of-bounds error for token.nbor()
2017-10-24 12:05:25 +02:00
Ines Montani
facf77e541
Merge branch 'develop' into support-danish
2017-10-24 11:53:19 +02:00
Matthew Honnibal
ccd2ab1a62
Merge pull request #1443 from ramananbalakrishnan/develop-get-lca-matrix
...
Add LCA matrix for spans and docs
2017-10-24 11:22:46 +02:00
Matthew Honnibal
ef3e5a361b
Merge pull request #1442 from explosion/feature/fix-sp
...
💫 Fix SP tag, tweak Vectors.__init__, fix Morphology
2017-10-24 10:24:07 +02:00
Matthew Honnibal
fdf25d10ba
Merge pull request #1440 from ramananbalakrishnan/develop
...
Support single value for attribute list in doc.to_array
2017-10-24 10:23:12 +02:00
Matthew Honnibal
e7556ff048
Fix non-maxout parser
2017-10-23 18:16:23 +02:00
ines
a31f048b4d
Fix formatting
2017-10-23 10:38:06 +02:00
Matthew Honnibal
490ad3eaf0
Check that empty strings are handled. Closes #1242
2017-10-21 00:52:14 +02:00
Matthew Honnibal
8f8bccecb9
Patch deserialisation for invalid loads, to avoid model failure
2017-10-21 00:51:42 +02:00
Ramanan Balakrishnan
d2fe56a577
Add LCA matrix for spans and docs
2017-10-20 23:58:00 +05:30
Matthew Honnibal
d8391b1c4d
Fix #1434 : Matcher failed on ending ? if no token
2017-10-20 16:49:36 +02:00
Matthew Honnibal
fec53f09f7
Merge branch 'develop' of https://github.com/explosion/spaCy into develop
2017-10-20 16:28:34 +02:00
Matthew Honnibal
f111b228e0
Fix re-parsing of previously parsed text
...
If a Doc object had been previously parsed, it was possible for
invalid parses to be added. There were two problems:
1) The parse was only being partially erased
2) The RightArc action was able to create a 1-cycle.
This patch fixes both errors, and avoids resetting the parse if one is
present. In theory this might allow a better parse to be predicted by
running the parser twice.
Closes #1253 .
2017-10-20 16:27:36 +02:00
Matthew Honnibal
1036798155
Make parser consistent if maxout==1
2017-10-20 16:24:16 +02:00
Matthew Honnibal
3faf9189a2
Make parser hidden shape consistent even if maxout==1
2017-10-20 16:23:31 +02:00
Matthew Honnibal
9010a1a060
Create vectors correctly
2017-10-20 14:19:46 +02:00
Matthew Honnibal
33229b1c9e
Remove print statement
2017-10-20 14:19:29 +02:00
Matthew Honnibal
cfae54c507
Make change to Vectors.__init__
2017-10-20 14:19:04 +02:00
Matthew Honnibal
ebecaddb76
Make 'data_or_width' two keyword args in Vectors.__init__
...
Previously the data and width options were one argument in Vectors,
which meant you couldn't say vectors = Vectors(strings, width=300).
It's better to have two keywords.
2017-10-20 14:17:15 +02:00
Matthew Honnibal
49895fbef6
Rename 'SP' special tag to '_SP'
...
Renaming the tag with an underscore lets us add it to the tag map
without worrying that we'll change the sequence of tags, which throws
off the tag-to-ID mapping. For instance, if we inserted a 'SP' tag,
the "VERB" tag is pushed to a different class ID, and the model is all
messed up.
2017-10-20 14:01:12 +02:00
Matthew Honnibal
506cf2eb13
Remove cpdef enum, to avoid too much code generation
2017-10-20 14:00:23 +02:00
Matthew Honnibal
6218af0105
Remove cpdef enum, to avoid too much code generation
2017-10-20 13:59:57 +02:00
Matthew Honnibal
92ac9316b5
Fix initialization of vectors, to address serialization problem
2017-10-20 13:59:24 +02:00
Ramanan Balakrishnan
0726946563
cleanup to_array implementation using fixes on master
2017-10-20 17:09:37 +05:30
ines
108f1f786e
Update symbols and document missing token attributes (see #1439 )
2017-10-20 13:08:44 +02:00
ines
4acab77a8a
Add missing symbol for LAW entities ( resolves #1427 )
2017-10-20 13:07:57 +02:00
Matthew Honnibal
b101736555
Fix precomputed layer
2017-10-20 12:14:52 +02:00
Ramanan Balakrishnan
b3ab124fc5
Support strings for attribute list in doc.to_array
2017-10-20 11:46:57 +05:30
Matthew Honnibal
64658e02e5
Implement fancier initialisation for precomputed layer
2017-10-20 03:07:45 +02:00
Matthew Honnibal
827cd8a883
Fix support of maxout pieces in parser
2017-10-20 03:07:17 +02:00
Matthew Honnibal
a8850b4282
Remove redundant PrecomputableMaxouts class
2017-10-19 20:27:34 +02:00
Matthew Honnibal
a17a1b60c7
Clean up redundant PrecomputableMaxouts class
2017-10-19 20:26:37 +02:00
Matthew Honnibal
b00d0a2c97
Fix bias in parser
2017-10-19 18:42:11 +02:00
Matthew Honnibal
b54b4b8a97
Make parser_maxout_pieces hyper-param work
2017-10-19 13:45:18 +02:00
Matthew Honnibal
03a215c5fd
Make PrecomputableAffines work
2017-10-19 13:44:49 +02:00
Ramanan Balakrishnan
7b9b1be44c
Support single value for attribute list in doc.to_array
2017-10-19 17:00:41 +05:30
Matthew Honnibal
61bc203f3f
Merge pull request #1438 from explosion/feature/fast-parser
...
💫 Improve runtime CPU efficiency of parser/NER
2017-10-19 02:42:21 +02:00
Matthew Honnibal
15e5a04a8d
Clean up more depth=0 conditional code
2017-10-19 01:48:43 +02:00
Matthew Honnibal
906c50ac59
Fix loop typing, that caused error on windows
2017-10-19 01:48:39 +02:00
ines
24512420b1
Show error if data_path does not exist or is None (see #1102 )
2017-10-19 00:53:49 +02:00
ines
bf415fd778
Add test for serializing extension attrs (see #1085 )
2017-10-19 00:53:08 +02:00
Matthew Honnibal
960788aaa2
Eliminate dead code in parser, and raise errors for obsolete options
2017-10-19 00:42:34 +02:00
Matthew Honnibal
bbfd7d8d5d
Clean up parser multi-threading
2017-10-19 00:25:21 +02:00
Matthew Honnibal
f018f2030c
Try optimized parser forward loop
2017-10-18 21:48:00 +02:00
Matthew Honnibal
65bf5e85bd
Improve piping in language.pipe
2017-10-18 21:46:12 +02:00
Matthew Honnibal
633a75c7e0
Break parser batches into sub-batches, sorted by length.
2017-10-18 21:45:01 +02:00
Ines Montani
f0d577e460
Merge pull request #1425 from explosion/feature/hindi-tokenizer
...
💫 Basic Hindi tokenization support
2017-10-18 13:34:52 +02:00
Matthew Honnibal
394633efce
Make doc pickling support hooks
2017-10-17 19:44:09 +02:00
Matthew Honnibal
fe844148f6
Test pickling hooks
2017-10-17 19:43:52 +02:00
Matthew Honnibal
cdb0c426d8
Improve deserialization of user_data, esp. for Underscore
2017-10-17 19:29:20 +02:00
Matthew Honnibal
374819edf8
Test user_data deserialization, re #1085
2017-10-17 19:28:54 +02:00
Matthew Honnibal
e35a83d142
Merge branch 'develop' of https://github.com/explosion/spaCy into develop
2017-10-17 18:22:06 +02:00
Matthew Honnibal
f45973848c
Rename 'tokens' variable 'doc' in tokenizer
2017-10-17 18:21:41 +02:00
Matthew Honnibal
839de87ca9
Make lambda func a named function, for pickling
2017-10-17 18:21:20 +02:00
Matthew Honnibal
9baa8fe7ec
Convert closure to functools.partial, to promote pickling
2017-10-17 18:20:52 +02:00
Matthew Honnibal
32a8564c79
Fix doc pickling
2017-10-17 18:20:24 +02:00
Matthew Honnibal
8ca97f32a3
Fix doc pickling test
2017-10-17 18:19:57 +02:00
Matthew Honnibal
9ce7d6af87
Make lex attr functions top-level functions, to promote pickling
2017-10-17 18:19:18 +02:00
Matthew Honnibal
1cc85a89ef
Allow reasonably efficient pickling of Language class, using to_bytes() and from_bytes().
2017-10-17 18:18:49 +02:00
Matthew Honnibal
0d57b9748a
Serialize lex_attr_getters with dill, for better pickle support
2017-10-17 18:17:45 +02:00
Matthew Honnibal
45d1dd90b1
Add tests for pickling doc
2017-10-17 17:20:58 +02:00
Ines Montani
afa67de7ee
Merge pull request #1428 from roanuz/develop
...
Fix trailing whitespace and Language.from_disk overwrites
2017-10-17 16:29:15 +02:00
Matthew Honnibal
92c1eb2d6f
Fix Doc pickling. This also removes need for Binder class
2017-10-17 16:11:13 +02:00
Matthew Honnibal
ed8da9b11f
Add missing return statement in SentenceSegmenter
2017-10-17 15:32:56 +02:00
Ines Montani
aab299c8ae
Merge pull request #1429 from vishnunekkanti/develop
...
fix syntax error in zh
2017-10-17 14:45:02 +02:00
Anto Binish Kaspar
534240648e
Fix trailing whitespace on morphology features
2017-10-17 17:15:58 +05:30
Anto Binish Kaspar
8f5b60c168
Fix Language.from_disk overwrites the meta.json file.
2017-10-17 17:15:32 +05:30
ines
8ca344712d
Add Language.has_pipe method
2017-10-17 11:20:07 +02:00
ines
485c4f6df5
Add Hungarian examples (see #1107 )
2017-10-17 02:37:45 +02:00
Matthew Honnibal
19531bad4c
Merge branch 'develop' into feature/streaming-data-memory-growth
2017-10-16 21:44:11 +02:00
Matthew Honnibal
df488274b1
Fix deserialization of vectors
2017-10-16 20:55:00 +02:00
Matthew Honnibal
4018486d31
Merge remote-tracking branch 'origin/develop' into feature/streaming-data-memory-growth
2017-10-16 20:49:48 +02:00
Matthew Honnibal
4174477161
Fix equality check in test
2017-10-16 19:50:35 +02:00
Matthew Honnibal
2bc06e4b22
Bump rolling buffer size to 10k
2017-10-16 19:38:29 +02:00
Matthew Honnibal
66e2eb8f39
Clean up remnant of frozen in StringStore
2017-10-16 19:34:41 +02:00
Matthew Honnibal
a002264fec
Remove caching of Token in Doc, as caused cycle.
2017-10-16 19:34:21 +02:00
Matthew Honnibal
3e037054c8
Remove obsolete is_frozen functionality from StringStore
2017-10-16 19:23:10 +02:00
Matthew Honnibal
5c14f3f033
Create a rolling buffer for the StringStore in Language.pipe()
2017-10-16 19:22:40 +02:00
Matthew Honnibal
59c216196c
Allow weakrefs on Doc objects
2017-10-16 19:22:11 +02:00
ines
d5418553eb
Fix whitespace
2017-10-16 18:30:04 +02:00
ines
6ceadcdb5c
Make sure from_disk passes string to numpy (see #1421 )
...
If path is a WindowsPath, numpy does not recognise it as a path and as
a result, doesn't open the file.
https://github.com/numpy/numpy/blob/master/numpy/lib/npyio.py#L369
2017-10-16 18:29:56 +02:00
Matthew Honnibal
010a7309ff
Merge pull request #1402 from explosion/feature/fix-matcher-operators
...
💫 Fix Matcher variable-length operators
2017-10-16 17:53:19 +02:00
Matthew Honnibal
c29927d2e7
Fix matcher test
2017-10-16 17:22:18 +02:00
Vishnu Kumar Nekkanti
d3c54cf39a
fixed SyntaxError while checking for jieba
2017-10-16 18:51:33 +05:30
Matthew Honnibal
a928ae2f35
Merge branch 'develop' into feature/fix-matcher-operators
2017-10-16 13:38:36 +02:00
Matthew Honnibal
56aa42cc5d
Fix and document matcher operator 'shadowing' behaviour
2017-10-16 13:38:20 +02:00
Matthew Honnibal
748d525801
Add more matcher operator tests
2017-10-16 13:38:01 +02:00
Matthew Honnibal
0433181658
Document operator semantics in Matcher docstring
2017-10-16 12:06:33 +02:00
ines
266e7180a7
Add Language class, stop words and basic stemmer that sets NORM
2017-10-14 14:59:52 +02:00
ines
e85e1d571b
Update base punctuation
2017-10-14 14:59:23 +02:00
ines
9d6c8eaa49
Update base norm exceptions with more unicode characters
...
e.g. unicode variations of punctuation used in Chinese
2017-10-14 14:58:52 +02:00
ines
3516aa0cea
Port over changes from #1389
2017-10-14 13:32:55 +02:00
ines
cd6a29dce7
Port over changes from #1294
2017-10-14 13:28:46 +02:00
ines
38c756fd85
Port over changes from #1287
2017-10-14 13:16:21 +02:00
ines
612224c10d
Port over changes from #1157
2017-10-14 13:11:39 +02:00
ines
9b3f8f9ec3
Fix formatting and add comment on languages
2017-10-14 13:11:18 +02:00
ines
a4d974d97b
Port over URL pattern changes from #1411
2017-10-14 12:58:07 +02:00
ines
09aed58140
Port over changes from #1333 and add comments
2017-10-14 12:52:59 +02:00
Matthew Honnibal
cf6da9301a
Update lemmatizer test
2017-10-12 22:50:52 +02:00
Matthew Honnibal
9b90d235d1
Fix tag check in lemmatizer
2017-10-12 22:50:43 +02:00
Matthew Honnibal
dc01acd821
Escape encoding in validate function
2017-10-12 22:23:21 +02:00
Matthew Honnibal
27b927259a
Add locale_escape compat function
2017-10-12 22:22:04 +02:00