Matthew Honnibal
7db956133e
Move tokenizer data for German into spacy.de.language_data
2016-09-25 15:37:33 +02:00
Matthew Honnibal
95aaea0d3f
Refactor so that the tokenizer data is read from Python data, rather than from disk
2016-09-25 14:49:53 +02:00
Matthew Honnibal
fd58f7655a
Python 3 compatible basestring
2016-09-24 22:16:43 +02:00
Matthew Honnibal
fd65cf6cbb
Finish refactoring data loading
2016-09-24 20:26:17 +02:00
Matthew Honnibal
83e364188c
Mostly finished loading refactoring. Design is in place, but doesn't work yet.
2016-09-24 15:42:01 +02:00
Matthew Honnibal
9dc8043a7e
Refactor Language to use new Defaults class, and work on revised data loading. We're getting rid of sputnik's weird file-system wrapper, and using pathlib.
2016-09-24 14:08:53 +02:00
Matthew Honnibal
4d7f5468bb
* Change Language class to use a .pipeline attribute, instead of having the pipeline hard coded
2016-05-17 16:55:42 +02:00
Matthew Honnibal
0f957dd586
Merge branch 'master' of ssh://github.com/honnibal/spaCy
2016-04-14 10:37:56 +02:00
Matthew Honnibal
61d20de35d
* Fix language.py docstring
2016-04-14 10:36:57 +02:00
Henning Peters
ff690f76ba
fix loading non-german models
2016-04-12 16:00:56 +02:00
Wolfgang Seeker
03fb498dbe
introduce lang field for LexemeC to hold language id
...
put noun_chunk logic into iterators.py for each language separately
2016-03-10 13:01:34 +01:00
Wolfgang Seeker
bc9c62e279
replace Language functions with corresponding orth functions
...
implement punctuation functions in orth
2016-03-09 18:07:37 +01:00
Henning Peters
931c07a609
initial proposal for separate vector package
2016-03-04 11:09:06 +01:00
Matthew Honnibal
a95974ad3f
* Fix oov probability
2016-02-06 15:13:55 +01:00
Matthew Honnibal
1ef84a0557
* Merge master into rethinc2
2016-02-05 12:55:59 +01:00
Matthew Honnibal
249dccbe95
* Fix Language.pipe
2016-02-05 12:47:57 +01:00
Matthew Honnibal
af58f273b3
* Fix spacy.language.pipe
2016-02-05 12:20:29 +01:00
Matthew Honnibal
419edfab50
* Use generic flags for the new attributes until they're added
2016-02-04 15:50:54 +01:00
Matthew Honnibal
e5c96c969f
* Wire up new attributes
2016-02-04 13:04:58 +01:00
Matthew Honnibal
84b247ef83
* Add a .pipe method, that takes a stream of input, operates on it, and streams the output. Internally, the stream may be buffered, to allow multi-threading.
2016-02-03 02:10:58 +01:00
Matthew Honnibal
fcfc17a164
Merge branch 'master' into rethinc2
2016-02-02 23:05:34 +01:00
Matthew Honnibal
59123443e2
* Check for presence/absence of the different models in Language.end_training
2016-02-02 22:49:55 +01:00
Matthew Honnibal
9e9d4c8706
* Fix stupid error in Language.batch
2016-02-01 09:49:32 +01:00
Matthew Honnibal
98fbdf2856
* Add Language.batch() method, to support multi-threaded jobs
2016-02-01 09:01:13 +01:00
Matthew Honnibal
c4a89d56bd
* Automatically register any entity types pre-set on the tokens, so that the NER works with user-given entity types.
2016-01-19 20:09:26 +01:00
Matthew Honnibal
bba0a5e078
* Handle string paths in default_vocab, default_parser, default_entity in Language class
2016-01-18 22:37:24 +01:00
Henning Peters
41ea14a56f
fix pickling
2016-01-16 13:23:11 +01:00
Henning Peters
235f094534
untangle data_path/via
2016-01-16 12:23:45 +01:00
Henning Peters
846fa49b2a
distinct load() and from_package() methods
2016-01-16 10:00:57 +01:00
Henning Peters
211913d689
add about.py, adapt setup.py
2016-01-15 18:57:01 +01:00
Henning Peters
f8a8f97d25
cleanup
2016-01-15 18:13:37 +01:00
Henning Peters
780cb847c9
add default_model to about
2016-01-15 18:07:15 +01:00
Henning Peters
788f734513
refactored data_dir->via, add zip_safe, add spacy.load()
2016-01-15 18:01:02 +01:00
Henning Peters
bc229790ac
integrate with sputnik
2016-01-13 19:46:17 +01:00
Matthew Honnibal
eaf2ad59f1
* Fix use of mock Package object
2015-12-31 04:13:15 +01:00
Matthew Honnibal
a6ba43ecaf
* Fix errors in packaging revision
2015-12-29 18:37:26 +01:00
Matthew Honnibal
aec130af56
Use util.Package class for io
...
Previous Sputnik integration caused API change: Vocab, Tagger, etc
were loaded via a from_package classmethod, that required a
sputnik.Package instance. This forced users to first create a
sputnik.Sputnik() instance, in order to acquire a Package via
sp.pool().
Instead I've created a small file-system shim, util.Package, which
allows classes to have a .load() classmethod, that accepts either
util.Package objects, or strings. We can later gut the internals
of this and make it a proxy for Sputnik if we need more functionality
that should live in the Sputnik library.
Sputnik is now only used to download and install the data, in
spacy.en.download
2015-12-29 18:00:48 +01:00
Matthew Honnibal
f5dea1406d
* Fix silly mistake in Language.__init__
2015-12-28 18:48:57 +01:00
Matthew Honnibal
187960606f
* Fix pickle problems
2015-12-28 16:54:03 +01:00
Matthew Honnibal
8c7e149ec9
* Replace kwargs argument of Language.__init__ with explicit arguments, to fix pickle bug
2015-12-28 15:56:27 +01:00
Henning Peters
d8d348bb55
allow to specify version constraint within model name
2015-12-18 19:12:08 +01:00
Henning Peters
cfa187aaf0
fix tests
2015-12-18 10:58:02 +01:00
Henning Peters
8359bd4d93
strip data/ from package, friendlier Language invocation, make data_dir backward/forward-compatible
2015-12-18 09:52:55 +01:00
Henning Peters
345dda6f53
small fixes, add package build step
2015-12-07 06:50:26 +01:00
Henning Peters
9027cef3bc
access model via sputnik
2015-12-07 06:01:28 +01:00
Matthew Honnibal
3c162dcac3
* Refactor away from the _ml module, to use thinc 4.0. Still some work needs to be done, e.g. to add __reduce__ to the models, more testing, etc.
2015-11-07 03:24:30 +11:00
Matthew Honnibal
adc7bbd6cf
* Fix name of like_num in default_lex_attrs
2015-11-04 22:02:47 +11:00
Matthew Honnibal
e96faf29e7
* Rename like_number to like_num, to fix inconsistency re Issue #166
2015-11-04 22:01:44 +11:00
Matthew Honnibal
f18fd8c659
* Fix language.py for change in StringStore load API
2015-10-23 03:48:12 +11:00
Matthew Honnibal
2348a08481
* Load/dump strings with a json file, instead of the hacky strings file we were using.
2015-10-22 21:13:03 +11:00
Matthew Honnibal
9baf0abd59
* Save vocab after training.
2015-10-22 21:09:14 +11:00
Matthew Honnibal
20fd36a0f7
* Very scrappy, likely buggy first-cut pickle implementation, to work on Issue #125 : allow pickle for Apache Spark. The current implementation sends stuff to temp files, and does almost nothing to ensure all modifiable state is actually preserved. The Language() instance is a deep tree of extension objects, and if pickling during training, some of the C-data state is hard to preserve.
2015-10-13 13:44:41 +11:00
Matthew Honnibal
a6ced80c0c
* Fix Issue #116 : Misleading handling of True value in Language.__init__.
2015-09-29 20:54:12 +10:00
Matthew Honnibal
27f988b167
* Remove the vectors option to Vocab, preferring to either load vectors from disk, or set them on the Lexeme objects.
2015-09-15 14:41:48 +10:00
Matthew Honnibal
e13e47e9e5
* Add English stop words
2015-09-14 17:48:51 +10:00
Matthew Honnibal
d9f1fc2112
* Add deprecation warning for unused load_vectors argument.
2015-09-09 14:31:09 +02:00
Matthew Honnibal
534e3dda3c
* More work on language independent parsing
2015-08-28 03:44:54 +02:00
Matthew Honnibal
c2307fa9ee
* More work on language-generic parsing
2015-08-28 02:02:33 +02:00
Matthew Honnibal
0af139e183
* Tagger training now working. Still need to test load/save of model. Morphology still broken.
2015-08-27 09:16:11 +02:00
Matthew Honnibal
76996f4145
* Hack on generic Language class. Still needs work for morphology, defaults, etc
2015-08-26 19:16:09 +02:00
Matthew Honnibal
f2f699ac18
* Add language base class
2015-08-25 15:37:17 +02:00