ines
adbcac6591
Fix spacing
2017-03-20 22:48:21 +01:00
ines
0eafc0f2c6
Add util functions to print data as table or markdown list
2017-03-18 13:00:14 +01:00
Matthew Honnibal
adb0b7e43b
Fix loading when no package found
2017-03-16 18:30:23 -05:00
ines
3d484c3faf
Don't print in parse_package_meta and accept on_erro callback instead
...
TODO: log warning for missing meta data in spacy.link, as this affects
the Language class returned by spacy.load()
2017-03-16 20:34:50 +01:00
ines
5f3f04bd0a
Add util function to load and parse package meta.json
2017-03-16 17:10:05 +01:00
ines
7f920c2f75
Don't break text in when rendering print_msg
2017-03-16 17:09:50 +01:00
ines
68c04fa897
Move sys_exit() function to util
2017-03-16 17:08:58 +01:00
ines
7b2eca36e4
Revert "Fix formatting and remove unused code"
...
This reverts commit d7898d586f
.
2017-03-16 09:58:41 +01:00
ines
f5d1a39a5b
Add util functions for printing and wrapping messages
2017-03-15 17:35:57 +01:00
ines
d7898d586f
Fix formatting and remove unused code
2017-03-15 17:35:41 +01:00
ines
66c1f194f9
Use consistent unicode declarations
2017-03-12 13:07:28 +01:00
Matthew Honnibal
0f9b8a00a5
Unbreak data download
2017-01-09 23:40:26 +01:00
Matthew Honnibal
d9a77ddf14
Return None for data path if it doesn't exist
2017-01-09 14:10:05 +01:00
Ines Montani
de5aa92bc2
Handle deprecated tokenizer prefix data
2017-01-08 20:33:28 +01:00
Ines Montani
6a60a61086
Move update_exc to global language data utils
2016-12-17 12:29:02 +01:00
Ines Montani
66c7348cda
Add update_exc util function
2016-12-08 13:58:12 +01:00
Ines Montani
8e977cc71c
Fix formatting
2016-12-08 13:56:17 +01:00
Matthew Honnibal
6b8b05ef83
Specify that spacy.util is encoded in utf8
2016-11-02 19:58:00 +01:00
Matthew Honnibal
9efe568177
Add missing unicode_literals to spacy.util. I think this was messing up the tokenizer regex for non-ascii characters in Python 2. Re Issue #596
2016-11-02 12:31:34 +01:00
Matthew Honnibal
5e923b9bfa
Return None in match_best_version if not path exists.
2016-10-15 14:47:29 +02:00
Matthew Honnibal
ea23b64cc8
Refactor training, with new spacy.train module. Defaults still a little awkward.
2016-10-09 12:24:24 +02:00
Matthew Honnibal
95aaea0d3f
Refactor so that the tokenizer data is read from Python data, rather than from disk
2016-09-25 14:49:53 +02:00
Matthew Honnibal
82b8cc5efb
Whitespace
2016-09-24 22:17:01 +02:00
Matthew Honnibal
f19af6cb2c
Python 3 compatible basestring
2016-09-24 22:08:43 +02:00
Matthew Honnibal
fd65cf6cbb
Finish refactoring data loading
2016-09-24 20:26:17 +02:00
Matthew Honnibal
83e364188c
Mostly finished loading refactoring. Design is in place, but doesn't work yet.
2016-09-24 15:42:01 +02:00
Daylen Yang
5405e7dd73
Fix get_lang_class parsing (take 2)
2016-05-16 16:40:31 -07:00
Matthew Honnibal
b240104f40
Revert "Fix get_lang_class parsing"
2016-05-17 08:04:26 +10:00
Daylen Yang
1692c2df3c
Fix get_lang_class parsing
...
We want the get_lang_class to return "en" for both "en" and "en_glove_cc_300_1m_vectors". Changed the split rule to "_" so that this happens.
2016-05-16 14:38:20 -07:00
Henning Peters
ff690f76ba
fix loading non-german models
2016-04-12 16:00:56 +02:00
Henning Peters
c90d4a6f17
relative imports in __init__.py
2016-03-26 11:44:53 +01:00
Henning Peters
b8f63071eb
add lang registration facility
2016-03-25 18:54:45 +01:00
Henning Peters
a7d7ea3afa
first idea for supporting multiple langs in download script
2016-03-24 11:19:43 +01:00
Henning Peters
eb7ae61b1c
cleanup api
2016-03-08 12:59:18 +01:00
Henning Peters
9cc4f8d5b3
avoid shadowing __name__
2016-02-15 01:33:39 +01:00
Henning Peters
235f094534
untangle data_path/via
2016-01-16 12:23:45 +01:00
Henning Peters
6d1a3af343
cleanup unused
2016-01-16 10:05:04 +01:00
Henning Peters
846fa49b2a
distinct load() and from_package() methods
2016-01-16 10:00:57 +01:00
Henning Peters
211913d689
add about.py, adapt setup.py
2016-01-15 18:57:01 +01:00
Henning Peters
788f734513
refactored data_dir->via, add zip_safe, add spacy.load()
2016-01-15 18:01:02 +01:00
Henning Peters
d9471f684f
fix typo
2016-01-14 12:14:12 +01:00
Henning Peters
9b75d872b0
fix model download
2016-01-14 12:02:56 +01:00
Henning Peters
bc229790ac
integrate with sputnik
2016-01-13 19:46:17 +01:00
Matthew Honnibal
eaf2ad59f1
* Fix use of mock Package object
2015-12-31 04:13:15 +01:00
Matthew Honnibal
a2dfdec85d
* Clean up spacy.util
2015-12-29 18:06:09 +01:00
Matthew Honnibal
aec130af56
Use util.Package class for io
...
Previous Sputnik integration caused API change: Vocab, Tagger, etc
were loaded via a from_package classmethod, that required a
sputnik.Package instance. This forced users to first create a
sputnik.Sputnik() instance, in order to acquire a Package via
sp.pool().
Instead I've created a small file-system shim, util.Package, which
allows classes to have a .load() classmethod, that accepts either
util.Package objects, or strings. We can later gut the internals
of this and make it a proxy for Sputnik if we need more functionality
that should live in the Sputnik library.
Sputnik is now only used to download and install the data, in
spacy.en.download
2015-12-29 18:00:48 +01:00
Matthew Honnibal
4131e45543
* Add MockPackage class, to see whether we can proxy for Sputnik in a lightweight way
2015-12-29 16:55:03 +01:00
Henning Peters
d8d348bb55
allow to specify version constraint within model name
2015-12-18 19:12:08 +01:00
Henning Peters
cfa187aaf0
fix tests
2015-12-18 10:58:02 +01:00
Henning Peters
8359bd4d93
strip data/ from package, friendlier Language invocation, make data_dir backward/forward-compatible
2015-12-18 09:52:55 +01:00
Henning Peters
9027cef3bc
access model via sputnik
2015-12-07 06:01:28 +01:00
Matthew Honnibal
dc393a5f1d
Merge pull request #126 from tomtung/master
...
Improve slicing support for both Doc and Span
2015-10-10 14:14:57 +11:00
Matthew Honnibal
83dccf0fd7
* Use io module insteads of deprecated codecs module
2015-10-10 14:13:01 +11:00
Yubing (Tom) Dong
3fd3bc79aa
Refactor to remove duplicate slicing logic
2015-10-07 01:25:35 -07:00
alvations
8199012d26
changing deprecated codecs.open to io.open =)
2015-09-30 20:10:15 +02:00
Matthew Honnibal
6ab1696b15
* Remove read_encoding_freqs from util.py
2015-07-23 01:17:32 +02:00
Matthew Honnibal
317cbbc015
* Serialization round trip now working with decent API, but with rough spots in the organisation and requiring vocabulary to be fixed ahead of time.
2015-07-19 15:18:17 +02:00
Jordan Suchow
3a8d9b37a6
Remove trailing whitespace
2015-04-19 13:01:38 -07:00
Jordan Suchow
5f0f940a1f
Remove unused imports
2015-04-19 01:05:22 -07:00
Matthew Honnibal
3f1944d688
* Make PyPy work
2015-01-05 17:54:38 +11:00
Matthew Honnibal
f5d41028b5
* Move around data files for test release
2015-01-03 01:59:22 +11:00
Matthew Honnibal
e1c1a4b868
* Tmp
2014-12-21 05:36:29 +11:00
Matthew Honnibal
b962fe73d7
* Make suffixes file use full-power regex, so that we can handle periods properly
2014-12-09 19:04:27 +11:00
Matthew Honnibal
302e09018b
* Work on fixing special-cases, reading them in as JSON objects so that they can specify lemmas
2014-12-09 14:48:01 +11:00
Matthew Honnibal
ea8f1e7053
* Tighten interfaces
2014-10-30 18:14:42 +11:00
Matthew Honnibal
67c8c8019f
* Update lexeme serialization, using a binary file format
2014-10-30 01:01:00 +11:00
Matthew Honnibal
43d5964e13
* Add function to read detokenization rules
2014-10-22 12:54:59 +11:00
Matthew Honnibal
12742f4f83
* Add detokenize method and test
2014-10-18 18:07:29 +11:00
Matthew Honnibal
6fb42c4919
* Add offsets to Tokens class. Some changes to interfaces, and reorganization of spacy.Lang
2014-10-14 16:17:45 +11:00
Matthew Honnibal
e40caae51f
* Update Lexicon class to expect a list of lexeme dict descriptions
2014-10-09 14:51:35 +11:00
Matthew Honnibal
2e44fa7179
* Add util.py
2014-09-25 18:26:22 +02:00
Matthew Honnibal
e9a62b6eba
* Refactoring with Lexeme as a class now compiles. Basic design seems to work
2014-08-27 17:15:39 +02:00
Matthew Honnibal
d10993f41a
* More docs work
2014-08-21 16:37:13 +02:00
Matthew Honnibal
3379d7a571
* Reforming data model for lexemes
2014-08-19 02:40:37 +02:00
Matthew Honnibal
01469b0888
* Refactor spacy so that chunks return arrays of lexemes, so that there is properly one lexeme per word.
2014-08-18 19:14:00 +02:00
Matthew Honnibal
ff1869ff07
* Fixed major efficiency problem, from not quite grokking pass by reference in cython c++
2014-07-07 07:36:43 +02:00
Matthew Honnibal
25849fc926
* Generalize tokenization rules to capitals
2014-07-07 05:07:21 +02:00
Matthew Honnibal
4e79446dc2
* Reading in tokenization rules correctly. Passing tests.
2014-07-07 00:02:55 +02:00
Matthew Honnibal
556f6a18ca
* Initial commit. Tests passing for punctuation handling. Need contractions, file transport, tokenize function, etc.
2014-07-05 20:51:42 +02:00