Commit Graph

37 Commits

Author SHA1 Message Date
Henning Peters
bc229790ac integrate with sputnik 2016-01-13 19:46:17 +01:00
Matthew Honnibal
eaf2ad59f1 * Fix use of mock Package object 2015-12-31 04:13:15 +01:00
Matthew Honnibal
a2dfdec85d * Clean up spacy.util 2015-12-29 18:06:09 +01:00
Matthew Honnibal
aec130af56 Use util.Package class for io
Previous Sputnik integration caused API change: Vocab, Tagger, etc
were loaded via a from_package classmethod, that required a
sputnik.Package instance. This forced users to first create a
sputnik.Sputnik() instance, in order to acquire a Package via
sp.pool().

Instead I've created a small file-system shim, util.Package, which
allows classes to have a .load() classmethod, that accepts either
util.Package objects, or strings. We can later gut the internals
of this and make it a proxy for Sputnik if we need more functionality
that should live in the Sputnik library.

Sputnik is now only used to download and install the data, in
spacy.en.download
2015-12-29 18:00:48 +01:00
Matthew Honnibal
4131e45543 * Add MockPackage class, to see whether we can proxy for Sputnik in a lightweight way 2015-12-29 16:55:03 +01:00
Henning Peters
d8d348bb55 allow to specify version constraint within model name 2015-12-18 19:12:08 +01:00
Henning Peters
cfa187aaf0 fix tests 2015-12-18 10:58:02 +01:00
Henning Peters
8359bd4d93 strip data/ from package, friendlier Language invocation, make data_dir backward/forward-compatible 2015-12-18 09:52:55 +01:00
Henning Peters
9027cef3bc access model via sputnik 2015-12-07 06:01:28 +01:00
Matthew Honnibal
dc393a5f1d Merge pull request #126 from tomtung/master
Improve slicing support for both Doc and Span
2015-10-10 14:14:57 +11:00
Matthew Honnibal
83dccf0fd7 * Use io module insteads of deprecated codecs module 2015-10-10 14:13:01 +11:00
Yubing (Tom) Dong
3fd3bc79aa Refactor to remove duplicate slicing logic 2015-10-07 01:25:35 -07:00
alvations
8199012d26 changing deprecated codecs.open to io.open =) 2015-09-30 20:10:15 +02:00
Matthew Honnibal
6ab1696b15 * Remove read_encoding_freqs from util.py 2015-07-23 01:17:32 +02:00
Matthew Honnibal
317cbbc015 * Serialization round trip now working with decent API, but with rough spots in the organisation and requiring vocabulary to be fixed ahead of time. 2015-07-19 15:18:17 +02:00
Jordan Suchow
3a8d9b37a6 Remove trailing whitespace 2015-04-19 13:01:38 -07:00
Jordan Suchow
5f0f940a1f Remove unused imports 2015-04-19 01:05:22 -07:00
Matthew Honnibal
3f1944d688 * Make PyPy work 2015-01-05 17:54:38 +11:00
Matthew Honnibal
f5d41028b5 * Move around data files for test release 2015-01-03 01:59:22 +11:00
Matthew Honnibal
e1c1a4b868 * Tmp 2014-12-21 05:36:29 +11:00
Matthew Honnibal
b962fe73d7 * Make suffixes file use full-power regex, so that we can handle periods properly 2014-12-09 19:04:27 +11:00
Matthew Honnibal
302e09018b * Work on fixing special-cases, reading them in as JSON objects so that they can specify lemmas 2014-12-09 14:48:01 +11:00
Matthew Honnibal
ea8f1e7053 * Tighten interfaces 2014-10-30 18:14:42 +11:00
Matthew Honnibal
67c8c8019f * Update lexeme serialization, using a binary file format 2014-10-30 01:01:00 +11:00
Matthew Honnibal
43d5964e13 * Add function to read detokenization rules 2014-10-22 12:54:59 +11:00
Matthew Honnibal
12742f4f83 * Add detokenize method and test 2014-10-18 18:07:29 +11:00
Matthew Honnibal
6fb42c4919 * Add offsets to Tokens class. Some changes to interfaces, and reorganization of spacy.Lang 2014-10-14 16:17:45 +11:00
Matthew Honnibal
e40caae51f * Update Lexicon class to expect a list of lexeme dict descriptions 2014-10-09 14:51:35 +11:00
Matthew Honnibal
2e44fa7179 * Add util.py 2014-09-25 18:26:22 +02:00
Matthew Honnibal
e9a62b6eba * Refactoring with Lexeme as a class now compiles. Basic design seems to work 2014-08-27 17:15:39 +02:00
Matthew Honnibal
d10993f41a * More docs work 2014-08-21 16:37:13 +02:00
Matthew Honnibal
3379d7a571 * Reforming data model for lexemes 2014-08-19 02:40:37 +02:00
Matthew Honnibal
01469b0888 * Refactor spacy so that chunks return arrays of lexemes, so that there is properly one lexeme per word. 2014-08-18 19:14:00 +02:00
Matthew Honnibal
ff1869ff07 * Fixed major efficiency problem, from not quite grokking pass by reference in cython c++ 2014-07-07 07:36:43 +02:00
Matthew Honnibal
25849fc926 * Generalize tokenization rules to capitals 2014-07-07 05:07:21 +02:00
Matthew Honnibal
4e79446dc2 * Reading in tokenization rules correctly. Passing tests. 2014-07-07 00:02:55 +02:00
Matthew Honnibal
556f6a18ca * Initial commit. Tests passing for punctuation handling. Need contractions, file transport, tokenize function, etc. 2014-07-05 20:51:42 +02:00