Matthew Honnibal
432b3499b3
Fix memory leak
2017-05-21 13:38:46 -05:00
Matthew Honnibal
4c9202249d
Refactor training, to fix memory leak
2017-05-21 09:07:06 -05:00
ines
d82ae9a585
Change "function" to "callable" in docs
2017-05-21 13:17:40 +02:00
Matthew Honnibal
3b7c108246
Pass tokvecs through as a list, instead of concatenated. Also fix padding
2017-05-20 13:23:32 -05:00
Matthew Honnibal
66ea9aebe7
Remove the state argument from Language
2017-05-19 13:25:42 -05:00
ines
2c8c9dc0c9
Update docstrings and API docs for Language
2017-05-19 18:47:24 +02:00
ines
d42bc16868
Update docstrings and API docs for Language class
2017-05-18 23:57:38 +02:00
Matthew Honnibal
c2c825127a
Fix use_params and pipe methods
2017-05-18 08:30:59 -05:00
Matthew Honnibal
2713041571
Fix GPU usage in Language
2017-05-18 04:25:19 -05:00
Matthew Honnibal
793430aa7a
Get spaCy train command working with neural network
...
* Integrate models into pipeline
* Add basic serialization (maybe incorrect)
* Fix pickle on vocab
2017-05-17 12:04:50 +02:00
Matthew Honnibal
8cf097ca88
Redesign training to integrate NN components
...
* Obsolete .parser, .entity etc names in favour of .pipeline
* Components no longer create models on initialization
* Models created by loading method (from_disk(), from_bytes() etc), or
.begin_training()
* Add .predict(), .set_annotations() methods in components
* Pass state through pipeline, to allow components to share information
more flexibly.
2017-05-16 16:17:30 +02:00
Matthew Honnibal
5211645af3
Get data flowing through pipeline. Needs redesign
2017-05-16 11:21:59 +02:00
Matthew Honnibal
a9edb3aa1d
Improve integration of NN parser, to support unified training API
2017-05-15 21:53:27 +02:00
Matthew Honnibal
9e167b7bb6
Strip serializer from code
2017-05-09 17:28:50 +02:00
ines
ea5fa46475
Import LEX_ATTRS from lang.lex_attrs
2017-05-09 00:58:10 +02:00
ines
6eb6306843
Fix language data imports
2017-05-08 23:58:31 +02:00
Matthew Honnibal
d0e19267e8
Create directory if missing in save_to_directory
2017-04-23 21:24:43 +02:00
Matthew Honnibal
4d2a659c52
Fix json dump for Python3
2017-04-23 17:05:53 +02:00
ines
ddd5194088
Update Language docs and docstrings
2017-04-17 01:52:13 +02:00
ines
f62b740961
Use compat.json_dumps
2017-04-17 01:46:14 +02:00
ines
8e83f8e2fa
Update docstrings
2017-04-17 01:40:26 +02:00
ines
e2299dc389
Ensure path in save_to_directory
2017-04-17 01:40:14 +02:00
Matthew Honnibal
4efd6fb9d6
Fix training
2017-04-16 15:28:27 -05:00
Matthew Honnibal
89a4f262fc
Fix training methods
2017-04-16 13:00:37 -05:00
ines
c05ec4b89a
Add compat functions and remove old workarounds
...
Add ensure_path util function to handle checking instance of path
2017-04-15 12:11:16 +02:00
ines
d24589aa72
Clean up imports, unused code, whitespace, docstrings
2017-04-15 12:05:47 +02:00
ines
561f2a3eb4
Use consistent formatting for docstrings
2017-04-15 11:59:21 +02:00
Matthew Honnibal
33ba5066eb
Refactor Language.end_training, making new save_to_directory method
2017-04-14 23:51:24 +02:00
oeg
010293fb2f
fix(typo): Fixes typo in method calling PseudoProjectivity.deprojectivize, failing with new train cli
2017-04-06 17:33:15 +02:00
Matthew Honnibal
47a3ef06a6
Unhack deprojetivization, moving it into pipeline
...
Previously the deprojectivize() call was attached to the transition
system, and only called for German. Instead it should be a separate
process, called after the parser. This makes it available for any
language. Closes #898 .
2017-03-31 12:31:50 +02:00
Matthew Honnibal
83ba6c247c
Fix init of Language without model
2017-03-26 16:46:00 +02:00
Raphaël Bournhonesque
f332bf05be
Remove unused import statements
2017-03-21 21:08:54 +01:00
ines
9605cf39cc
Handle default path in Language classes
2017-03-18 12:58:45 +01:00
Matthew Honnibal
8843b84bd1
Merge remote-tracking branch 'origin/develop-downloads'
2017-03-16 12:00:42 -05:00
ines
618ce3b425
Add .meta to Language object
...
Allows getting the current model's meta data, e.g.:
nlp = spacy.load('my-model')
print(nlp.meta)
2017-03-16 17:14:56 +01:00
Matthew Honnibal
b382dc902c
Add morph rules in Language
2017-03-15 09:24:40 -05:00
Matthew Honnibal
f70be44746
Use lemmatizer in code, not from downloaded model.
2017-03-15 04:52:50 -05:00
Matthew Honnibal
f71eeef9bb
Pass path argument to end_training
2017-03-09 18:42:40 -06:00
Matthew Honnibal
cd33b39a04
Fix 2/3 problem for json save/load
2017-03-08 01:39:13 +01:00
Ines Montani
aa876884f0
Revert "Revert "Merge remote-tracking branch 'origin/master'""
...
This reverts commit fb9d3bb022
.
2017-01-09 13:28:13 +01:00
Matthew Honnibal
3679fb43a3
Fix loading of lemmatizer
2016-12-18 17:34:09 +01:00
Ines Montani
b11d8cd3db
Merge remote-tracking branch 'origin/organize-language-data' into organize-language-data
2016-12-18 16:57:12 +01:00
Ines Montani
753068f1d5
Use base language data as default
2016-12-18 16:55:25 +01:00
Ines Montani
bcc1d50d09
Remove trailing whitespace
2016-12-18 16:54:52 +01:00
Matthew Honnibal
44f4f008bd
Wire up lemmatizer rules for English
2016-12-18 15:50:09 +01:00
Matthew Honnibal
296d33a4fc
Merge branch 'master' of ssh://github.com/explosion/spaCy
2016-11-26 12:36:18 +01:00
Matthew Honnibal
1f6c37c6f5
Fix create_tokenizer when nlp is None
2016-11-26 12:36:04 +01:00
Matthew Honnibal
c7889492f9
Fix model saving error for Python 3
2016-11-25 18:04:30 -06:00
Matthew Honnibal
159e8c46e1
Merge old training fixes with newer state
2016-11-25 09:16:36 -06:00
Matthew Honnibal
a2f55e7015
Pass cfg through loading, for training.
2016-11-25 09:01:20 -06:00
Matthew Honnibal
09f68bc641
Fix Issue #639 : stop words in language class not used. This patch is messy, but it's better not to change too much until the language data loading can be properly refactored.
2016-11-24 00:13:55 +01:00
Matthew Honnibal
48e1dc29d4
Fix default path loading.
2016-11-23 23:48:55 +01:00
ExplodingCabbage
6c4f488e89
Fix syntax mistake
2016-11-23 15:12:45 +00:00
Matthew Honnibal
60eb2343ce
Only try to load vectors if they exist.
2016-11-23 13:50:24 +01:00
Matthew Honnibal
618ac36093
Fix use of path argument in Language.__init__. Needs to be keyword arg, not positional.
2016-11-23 13:26:34 +01:00
Mark Amery
fbe19680a6
Fix another bug related to Language.__init__'s path parameter
2016-11-20 20:31:34 +00:00
Mark Amery
b0a07c21a0
Fix path
param of Language.__init__
always being ignored
...
There was an explicitly-declared `path` keyword argument, so 'path'
would never be present in `**overrides`. This line just overwrote
any manually-specified value the user might've passed to the `path`
parameter.
2016-11-20 16:29:57 +00:00
Matthew Honnibal
22647c2423
Check that patterns aren't null before compiling regex for tokenizer
2016-11-02 20:35:29 +01:00
Matthew Honnibal
f7fee6c24b
Check for class-defined make_docs method before assigning one provided as an argument
2016-11-02 19:57:13 +01:00
Matthew Honnibal
b86f8af0c1
Fix doc strings
2016-11-01 12:25:36 +01:00
Matthew Honnibal
cb49189477
Remove dead code
2016-10-26 13:11:07 +02:00
Matthew Honnibal
150e02d72e
Fix Issue #566
2016-10-23 20:19:01 +02:00
Matthew Honnibal
739213a8af
Fix create_pipeline keyword argument.
2016-10-23 14:24:16 +02:00
Matthew Honnibal
5ec32f5d97
Fix loading of GloVe vectors, to address Issue #541
2016-10-20 18:27:48 +02:00
Matthew Honnibal
d4aaf2752c
Fix issue #535 : Pipeline elements added even when data not installed.
2016-10-19 19:55:19 +02:00
Matthew Honnibal
1b651db9c5
Fix parser creation in Language class.
2016-10-18 19:36:44 +02:00
Matthew Honnibal
45a6f9b9c7
Fix loading of tagger.
2016-10-18 19:33:04 +02:00
Matthew Honnibal
7d5212f131
Refactor defaults
2016-10-18 16:18:25 +02:00
Matthew Honnibal
f787cd29fe
Refactor the pipeline classes to make them more consistent, and remove the redundant blank() constructor.
2016-10-16 21:34:57 +02:00
Matthew Honnibal
ca51f3b77e
Use DependencyParser and EntityRecognizer in the Language class.
2016-10-16 17:58:12 +02:00
Matthew Honnibal
a81c5a7abf
Fix name of labels keyword to 'actions'.
2016-10-16 12:00:27 +02:00
Matthew Honnibal
8a6b35d266
Delay binding in MakeDoc
2016-10-16 11:41:55 +02:00
Matthew Honnibal
08e9134760
Change default value of path to True
2016-10-15 14:12:54 +02:00
Matthew Honnibal
6d8cb515ac
Break the tokenization stage out of the pipeline into a function 'make_doc'. This allows all pipeline methods to have the same signature.
2016-10-14 17:38:29 +02:00
Matthew Honnibal
41f88ce938
Fix dep model loading in parser
2016-10-12 20:26:38 +02:00
Matthew Honnibal
0e2bedc373
Fix default labels for parser and NER
2016-10-12 19:12:40 +02:00
Matthew Honnibal
847a4a4182
Refactor Language, dropping Language.blank() method.
2016-10-12 13:45:58 +02:00
Matthew Honnibal
ea23b64cc8
Refactor training, with new spacy.train module. Defaults still a little awkward.
2016-10-09 12:24:24 +02:00
Matthew Honnibal
eceeaefe53
Fix defaults for Parser and Entity, adding a blank= argument.
2016-09-30 19:56:06 +02:00
Matthew Honnibal
e382e48d9f
Temporarily patch handling of defaul templates for tagger. Need to move these to language_data.
2016-09-27 13:21:28 +02:00
Matthew Honnibal
b14b9b096b
Return None if /deps directory not present, instead of trying to load the parser.
2016-09-26 18:48:03 +02:00
Matthew Honnibal
0b2d7ae9d6
Fix Entity creation
2016-09-26 15:41:22 +02:00
Matthew Honnibal
2debc4e0a2
Add .blank() method to Parser. Start housing default dep labels and entity types within the Defaults class.
2016-09-26 11:57:54 +02:00
Matthew Honnibal
722199acb8
Add spacy.blank() method, that doesn't load data. Don't try to load data if path is falsey
2016-09-26 11:07:46 +02:00
Matthew Honnibal
7db956133e
Move tokenizer data for German into spacy.de.language_data
2016-09-25 15:37:33 +02:00
Matthew Honnibal
95aaea0d3f
Refactor so that the tokenizer data is read from Python data, rather than from disk
2016-09-25 14:49:53 +02:00
Matthew Honnibal
fd58f7655a
Python 3 compatible basestring
2016-09-24 22:16:43 +02:00
Matthew Honnibal
fd65cf6cbb
Finish refactoring data loading
2016-09-24 20:26:17 +02:00
Matthew Honnibal
83e364188c
Mostly finished loading refactoring. Design is in place, but doesn't work yet.
2016-09-24 15:42:01 +02:00
Matthew Honnibal
9dc8043a7e
Refactor Language to use new Defaults class, and work on revised data loading. We're getting rid of sputnik's weird file-system wrapper, and using pathlib.
2016-09-24 14:08:53 +02:00
Matthew Honnibal
4d7f5468bb
* Change Language class to use a .pipeline attribute, instead of having the pipeline hard coded
2016-05-17 16:55:42 +02:00
Matthew Honnibal
0f957dd586
Merge branch 'master' of ssh://github.com/honnibal/spaCy
2016-04-14 10:37:56 +02:00
Matthew Honnibal
61d20de35d
* Fix language.py docstring
2016-04-14 10:36:57 +02:00
Henning Peters
ff690f76ba
fix loading non-german models
2016-04-12 16:00:56 +02:00
Wolfgang Seeker
03fb498dbe
introduce lang field for LexemeC to hold language id
...
put noun_chunk logic into iterators.py for each language separately
2016-03-10 13:01:34 +01:00
Wolfgang Seeker
bc9c62e279
replace Language functions with corresponding orth functions
...
implement punctuation functions in orth
2016-03-09 18:07:37 +01:00
Henning Peters
931c07a609
initial proposal for separate vector package
2016-03-04 11:09:06 +01:00
Matthew Honnibal
a95974ad3f
* Fix oov probability
2016-02-06 15:13:55 +01:00
Matthew Honnibal
1ef84a0557
* Merge master into rethinc2
2016-02-05 12:55:59 +01:00
Matthew Honnibal
249dccbe95
* Fix Language.pipe
2016-02-05 12:47:57 +01:00
Matthew Honnibal
af58f273b3
* Fix spacy.language.pipe
2016-02-05 12:20:29 +01:00
Matthew Honnibal
419edfab50
* Use generic flags for the new attributes until they're added
2016-02-04 15:50:54 +01:00
Matthew Honnibal
e5c96c969f
* Wire up new attributes
2016-02-04 13:04:58 +01:00
Matthew Honnibal
84b247ef83
* Add a .pipe method, that takes a stream of input, operates on it, and streams the output. Internally, the stream may be buffered, to allow multi-threading.
2016-02-03 02:10:58 +01:00
Matthew Honnibal
fcfc17a164
Merge branch 'master' into rethinc2
2016-02-02 23:05:34 +01:00
Matthew Honnibal
59123443e2
* Check for presence/absence of the different models in Language.end_training
2016-02-02 22:49:55 +01:00
Matthew Honnibal
9e9d4c8706
* Fix stupid error in Language.batch
2016-02-01 09:49:32 +01:00
Matthew Honnibal
98fbdf2856
* Add Language.batch() method, to support multi-threaded jobs
2016-02-01 09:01:13 +01:00
Matthew Honnibal
c4a89d56bd
* Automatically register any entity types pre-set on the tokens, so that the NER works with user-given entity types.
2016-01-19 20:09:26 +01:00
Matthew Honnibal
bba0a5e078
* Handle string paths in default_vocab, default_parser, default_entity in Language class
2016-01-18 22:37:24 +01:00
Henning Peters
41ea14a56f
fix pickling
2016-01-16 13:23:11 +01:00
Henning Peters
235f094534
untangle data_path/via
2016-01-16 12:23:45 +01:00
Henning Peters
846fa49b2a
distinct load() and from_package() methods
2016-01-16 10:00:57 +01:00
Henning Peters
211913d689
add about.py, adapt setup.py
2016-01-15 18:57:01 +01:00
Henning Peters
f8a8f97d25
cleanup
2016-01-15 18:13:37 +01:00
Henning Peters
780cb847c9
add default_model to about
2016-01-15 18:07:15 +01:00
Henning Peters
788f734513
refactored data_dir->via, add zip_safe, add spacy.load()
2016-01-15 18:01:02 +01:00
Henning Peters
bc229790ac
integrate with sputnik
2016-01-13 19:46:17 +01:00
Matthew Honnibal
eaf2ad59f1
* Fix use of mock Package object
2015-12-31 04:13:15 +01:00
Matthew Honnibal
a6ba43ecaf
* Fix errors in packaging revision
2015-12-29 18:37:26 +01:00
Matthew Honnibal
aec130af56
Use util.Package class for io
...
Previous Sputnik integration caused API change: Vocab, Tagger, etc
were loaded via a from_package classmethod, that required a
sputnik.Package instance. This forced users to first create a
sputnik.Sputnik() instance, in order to acquire a Package via
sp.pool().
Instead I've created a small file-system shim, util.Package, which
allows classes to have a .load() classmethod, that accepts either
util.Package objects, or strings. We can later gut the internals
of this and make it a proxy for Sputnik if we need more functionality
that should live in the Sputnik library.
Sputnik is now only used to download and install the data, in
spacy.en.download
2015-12-29 18:00:48 +01:00
Matthew Honnibal
f5dea1406d
* Fix silly mistake in Language.__init__
2015-12-28 18:48:57 +01:00
Matthew Honnibal
187960606f
* Fix pickle problems
2015-12-28 16:54:03 +01:00
Matthew Honnibal
8c7e149ec9
* Replace kwargs argument of Language.__init__ with explicit arguments, to fix pickle bug
2015-12-28 15:56:27 +01:00
Henning Peters
d8d348bb55
allow to specify version constraint within model name
2015-12-18 19:12:08 +01:00
Henning Peters
cfa187aaf0
fix tests
2015-12-18 10:58:02 +01:00
Henning Peters
8359bd4d93
strip data/ from package, friendlier Language invocation, make data_dir backward/forward-compatible
2015-12-18 09:52:55 +01:00
Henning Peters
345dda6f53
small fixes, add package build step
2015-12-07 06:50:26 +01:00
Henning Peters
9027cef3bc
access model via sputnik
2015-12-07 06:01:28 +01:00
Matthew Honnibal
3c162dcac3
* Refactor away from the _ml module, to use thinc 4.0. Still some work needs to be done, e.g. to add __reduce__ to the models, more testing, etc.
2015-11-07 03:24:30 +11:00
Matthew Honnibal
adc7bbd6cf
* Fix name of like_num in default_lex_attrs
2015-11-04 22:02:47 +11:00
Matthew Honnibal
e96faf29e7
* Rename like_number to like_num, to fix inconsistency re Issue #166
2015-11-04 22:01:44 +11:00
Matthew Honnibal
f18fd8c659
* Fix language.py for change in StringStore load API
2015-10-23 03:48:12 +11:00
Matthew Honnibal
2348a08481
* Load/dump strings with a json file, instead of the hacky strings file we were using.
2015-10-22 21:13:03 +11:00
Matthew Honnibal
9baf0abd59
* Save vocab after training.
2015-10-22 21:09:14 +11:00
Matthew Honnibal
20fd36a0f7
* Very scrappy, likely buggy first-cut pickle implementation, to work on Issue #125 : allow pickle for Apache Spark. The current implementation sends stuff to temp files, and does almost nothing to ensure all modifiable state is actually preserved. The Language() instance is a deep tree of extension objects, and if pickling during training, some of the C-data state is hard to preserve.
2015-10-13 13:44:41 +11:00
Matthew Honnibal
a6ced80c0c
* Fix Issue #116 : Misleading handling of True value in Language.__init__.
2015-09-29 20:54:12 +10:00
Matthew Honnibal
27f988b167
* Remove the vectors option to Vocab, preferring to either load vectors from disk, or set them on the Lexeme objects.
2015-09-15 14:41:48 +10:00
Matthew Honnibal
e13e47e9e5
* Add English stop words
2015-09-14 17:48:51 +10:00
Matthew Honnibal
d9f1fc2112
* Add deprecation warning for unused load_vectors argument.
2015-09-09 14:31:09 +02:00
Matthew Honnibal
534e3dda3c
* More work on language independent parsing
2015-08-28 03:44:54 +02:00
Matthew Honnibal
c2307fa9ee
* More work on language-generic parsing
2015-08-28 02:02:33 +02:00
Matthew Honnibal
0af139e183
* Tagger training now working. Still need to test load/save of model. Morphology still broken.
2015-08-27 09:16:11 +02:00
Matthew Honnibal
76996f4145
* Hack on generic Language class. Still needs work for morphology, defaults, etc
2015-08-26 19:16:09 +02:00
Matthew Honnibal
f2f699ac18
* Add language base class
2015-08-25 15:37:17 +02:00