Commit Graph

1060 Commits

Author SHA1 Message Date
Matthew Honnibal
c4d8754385 * Specify LOCAL_DATA_DIR global in spacy.en.__init__.py 2015-08-26 19:15:07 +02:00
Matthew Honnibal
c2d8edd0bd * Add PROB attribute in attrs.pxd 2015-08-26 19:14:19 +02:00
Matthew Honnibal
c5a27d1821 * Move lemmatizer to spacy 2015-08-25 15:47:08 +02:00
Matthew Honnibal
82217c6ec6 * Generalize lemmatizer 2015-08-25 15:46:19 +02:00
Matthew Honnibal
8083a07c3e * Use language base class 2015-08-25 15:37:30 +02:00
Matthew Honnibal
f2f699ac18 * Add language base class 2015-08-25 15:37:17 +02:00
Matthew Honnibal
5dd76be446 * Split EnPosTagger up into base class and subclass 2015-08-24 05:25:55 +02:00
Matthew Honnibal
5d5922dbfa * Begin laying out morphological features 2015-08-24 01:04:30 +02:00
Matthew Honnibal
6f1743692a * Work on language-independent refactoring 2015-08-23 20:49:18 +02:00
Matthew Honnibal
3879d28457 * Fix https for url detection 2015-08-23 02:40:35 +02:00
Matthew Honnibal
cad0cca4e3 * Tmp 2015-08-22 22:04:34 +02:00
Matthew Honnibal
bf38b3b883 * Hack on l/r reversal bug 2015-08-10 05:58:43 +02:00
Matthew Honnibal
6116413b47 * Fix label prediction in StepwiseState 2015-08-10 05:05:31 +02:00
Matthew Honnibal
2c9753eff2 * Whitespace 2015-08-10 00:09:02 +02:00
Matthew Honnibal
9de98f5a6f * Add Parser.stepthrough method, with context manager 2015-08-10 00:08:46 +02:00
Matthew Honnibal
fe43f8cf39 * Whitespace 2015-08-09 02:31:53 +02:00
Matthew Honnibal
9c090945e0 * Add Parser.predict method, and clean up Parser.get_state 2015-08-09 02:29:58 +02:00
Matthew Honnibal
04fccfb984 * Fix get_state for parser prediction 2015-08-09 02:11:22 +02:00
Matthew Honnibal
55fde0e240 * Fix get_state 2015-08-09 01:45:30 +02:00
Matthew Honnibal
f0f4fa9838 * Fix Parser.get_state 2015-08-09 01:40:13 +02:00
Matthew Honnibal
18331dca89 * Add continue_for argument to parser 'partial' function, which is now renamed to get_state 2015-08-09 01:31:54 +02:00
Matthew Honnibal
0653288fa5 * Fix stateclass.queue 2015-08-09 00:39:02 +02:00
Matthew Honnibal
9de218b7ba * Fix Parser.partial function 2015-08-08 23:45:18 +02:00
Matthew Honnibal
01be34d55a * Whitespace 2015-08-08 23:37:44 +02:00
Matthew Honnibal
cc9deae960 * Add is_valid method to transition_system 2015-08-08 23:36:18 +02:00
Matthew Honnibal
2a46c77324 * Whitespace 2015-08-08 23:35:59 +02:00
Matthew Honnibal
7bafc789e7 * Add stack and queue properties to stateclass, for python access 2015-08-08 23:32:42 +02:00
Matthew Honnibal
3af938365f * Add function partial to Parser 2015-08-08 23:32:15 +02:00
Matthew Honnibal
76a1f0481a * Whitespace 2015-08-08 23:31:54 +02:00
Matthew Honnibal
b0f5c39084 * Fix handling of exclusion entities 2015-08-06 17:28:43 +02:00
Matthew Honnibal
9f65879991 * Fix shape attr bug, and fix handling of false positive matches 2015-08-06 17:28:14 +02:00
Matthew Honnibal
10d869d102 * Don't allow conjunction between NPs in base NP chunks 2015-08-06 16:31:53 +02:00
Matthew Honnibal
383dfabd67 * Fix matcher setting of entities 2015-08-06 16:27:01 +02:00
Matthew Honnibal
59c3bf60a6 * Ensure entity recognizer doesn't over-write preset types 2015-08-06 16:09:08 +02:00
Matthew Honnibal
cd7d1682cd * Fix loading of gazetteer.json file 2015-08-06 16:08:25 +02:00
Matthew Honnibal
9c667b7f15 * Set a value in attrs.pxd on the first flag, to reduce bugs 2015-08-06 16:08:04 +02:00
Matthew Honnibal
c263577424 * Fix lower attribute in lexeme.pxd 2015-08-06 16:07:41 +02:00
Matthew Honnibal
5737115e1e * Work on gazetteer matching 2015-08-06 14:33:21 +02:00
Matthew Honnibal
9c1724ecae * Gazetteer stuff working, now need to wire up to API 2015-08-06 00:35:40 +02:00
Matthew Honnibal
5bc0e83f9a * Reimplement matching in Cython, instead of Python. 2015-08-05 01:05:54 +02:00
Matthew Honnibal
4c87a696b3 * Add draft dfa matcher, in Python. Passing tests. 2015-08-04 15:55:28 +02:00
Matthew Honnibal
eb7138c761 * Add attr relation in base NP detection 2015-08-01 00:34:40 +02:00
Matthew Honnibal
4988356cf0 * Fix dependency type bug from merged tokens 2015-08-01 00:33:24 +02:00
Matthew Honnibal
78a9068319 * Fix spacy attr on merged tokens 2015-07-30 04:25:58 +02:00
Matthew Honnibal
430e2edb96 * Fix noun_chunks issue 2015-07-30 03:51:50 +02:00
Matthew Honnibal
9590968fc1 * Fix negative indices in Span 2015-07-30 02:30:24 +02:00
Matthew Honnibal
74d8cb3980 * Add noun_chunks iterator, and fix left/right child setting in Doc.merge 2015-07-30 02:29:49 +02:00
Matthew Honnibal
d153f18969 * Fix negative indices on spans 2015-07-29 22:36:03 +02:00
Matthew Honnibal
b5132bed7d * Set left and right children when loading parse from byte string 2015-07-28 21:03:18 +02:00
Matthew Honnibal
6609fcf4b2 * Make mem and vocab python-visible in Doc 2015-07-28 20:46:59 +02:00
Matthew Honnibal
d42fe2e694 * Add unicode_literals to strings.pyx 2015-07-28 16:15:53 +02:00
Matthew Honnibal
bb910cff92 * Fix Python3 problem in align_raw 2015-07-28 16:06:53 +02:00
Matthew Honnibal
dcafb181b9 * Fix Python3 problem in align_raw 2015-07-28 15:52:10 +02:00
Matthew Honnibal
c609ea18f0 * Increment version in download script 2015-07-28 15:22:17 +02:00
Matthew Honnibal
9c4d0aae62 * Switch to better Python2/3 compatible unicode handling 2015-07-28 14:45:37 +02:00
Matthew Honnibal
7606d9936f * Python3 correction for GoldParse 2015-07-28 14:44:53 +02:00
Matthew Honnibal
ddc1a5cfe5 * Fix training under python3 2015-07-28 14:09:30 +02:00
Matthew Honnibal
a8bbd7312c * Hackishly patch long dependencies problem 2015-07-28 00:14:29 +02:00
Matthew Honnibal
bb583f7f09 * Hackishly patch long dependencies problem 2015-07-27 23:14:33 +02:00
Matthew Honnibal
aa7a964a4f * Add a type declaration for doc.from_array 2015-07-27 22:57:22 +02:00
Matthew Honnibal
25a8774f42 * Fix regression in packer 2015-07-27 21:53:38 +02:00
Matthew Honnibal
1601e488ee * Fix bug in decoding non-ascii characters 2015-07-27 21:43:58 +02:00
Matthew Honnibal
6a95409cd2 * Fix type on bits 2015-07-27 21:16:49 +02:00
Matthew Honnibal
a296d72b54 * Fix en/attrs 2015-07-27 21:16:33 +02:00
Matthew Honnibal
45460f505c * Fix data type on read32 in BitArray 2015-07-27 21:12:13 +02:00
Matthew Honnibal
3d43f49f69 * Revert prev change 2015-07-27 10:58:15 +02:00
Matthew Honnibal
6b586cdad4 * Change lexemes.bin format. Add a header specifying size of LexemeC and number of lexemes, and don't have the redundant orth information. 2015-07-27 08:31:51 +02:00
Matthew Honnibal
af6ed18f2a * Ensure we don't use orth_encode on OOV words. 2015-07-27 02:12:01 +02:00
Matthew Honnibal
8535d872e8 * Set is_oov property in get_flags 2015-07-27 01:51:24 +02:00
Matthew Honnibal
8e4c69ee8c * Add is_oov property, and fix up handling of attributes 2015-07-27 01:50:06 +02:00
Matthew Honnibal
fc268f03eb * Assert against null pointer exceptions in vocab 2015-07-27 01:00:10 +02:00
Matthew Honnibal
0f093fdb30 * Fix get_by_orth for py3 2015-07-26 19:26:41 +02:00
Matthew Honnibal
ceeda5a739 * Fix get_by_orth for py3 2015-07-26 18:39:27 +02:00
Matthew Honnibal
6bb96c122d * Host IS_ flags in attrs.pxd, and add properties for them on Token and Lexeme objects 2015-07-26 16:37:16 +02:00
Matthew Honnibal
eeaea25f0c * Check oov_prob file is present 2015-07-26 16:36:38 +02:00
Matthew Honnibal
7eb2446082 * Return empty lexeme on empty string 2015-07-26 00:18:30 +02:00
Matthew Honnibal
1b5d1da2a7 * Allow an OOV probability to be specified in get_lex_props 2015-07-26 00:03:43 +02:00
Matthew Honnibal
cd6e25132b * Allow an OOV probability to be specified in get_lex_props 2015-07-26 00:01:46 +02:00
Matthew Honnibal
fd525f0675 * Pass OOV probability around 2015-07-25 23:29:51 +02:00
Matthew Honnibal
3fe14b8ed6 * Fix CFile for Python2 2015-07-25 22:55:53 +02:00
Matthew Honnibal
823ef4a00b * Remove profile declarations 2015-07-25 18:13:06 +02:00
Matthew Honnibal
f4809e562f * Allow json to be used as a fallback if ujson is not available 2015-07-25 18:11:36 +02:00
Matthew Honnibal
9da06671cf * Remove unused import 2015-07-25 18:11:16 +02:00
Matthew Honnibal
2060935cdb * Remove explicit bytes type in doc.from_bytes, to accept bytearray 2015-07-24 04:54:13 +02:00
Matthew Honnibal
aa28e2e01d * Release the GIL around parse function 2015-07-24 04:53:27 +02:00
Matthew Honnibal
d62eb34b76 * More Py 2/3 compatibility in bit strings 2015-07-24 04:52:06 +02:00
Matthew Honnibal
0bb839d299 * Fix string coercion for Python 3 2015-07-24 03:49:30 +02:00
Matthew Honnibal
c4ff410fdb * Fix bytes problems for Python3 2015-07-24 03:48:23 +02:00
Matthew Honnibal
1ab25e4dad * Fix python3 type error 2015-07-24 02:45:34 +02:00
Matthew Honnibal
f35ff173b0 * Fix bits.pyx unicode error 2015-07-23 20:37:57 +02:00
Matthew Honnibal
1406e24327 * Fix unicode error for Python3 2015-07-23 19:36:21 +02:00
Matthew Honnibal
dbda6c27fa * Fix python3 error 2015-07-23 14:52:30 +02:00
Matthew Honnibal
99387f9572 * Fix python3 error 2015-07-23 14:30:29 +02:00
Matthew Honnibal
b81ffe9032 * Fix typing on mode string in CFile 2015-07-23 13:24:43 +02:00
Matthew Honnibal
22028602a9 * Add unicode_literals declaration in vocab.pyx 2015-07-23 13:24:20 +02:00
Matthew Honnibal
5b41744270 * Check for directory presence before loading annotators 2015-07-23 09:27:37 +02:00
Matthew Honnibal
df01a88763 Merge branch 'refactor' (and serializaton)
Add Huffman-code serialization, and do a lot of
refactoring. Highlights include:

* Much more efficient StringStore
* Vocab maintains a by-orth mapping of Lexemes
* Avoid manually slicing Py_UNICODE buffers,
  simplifying tokenizer and vocab C APIs
* Remove various bits of dead code
* Work on removing GIL around parser
* Work on bridge to Theano

Conflicts:
	spacy/strings.pxd
	spacy/strings.pyx
	spacy/structs.pxd
2015-07-23 02:18:35 +02:00
Matthew Honnibal
a7c4d72e83 * Add serializer property to Vocab, and lazy-load it. Add get_by_orth method. 2015-07-23 01:18:19 +02:00
Matthew Honnibal
6ab1696b15 * Remove read_encoding_freqs from util.py 2015-07-23 01:17:32 +02:00
Matthew Honnibal
d5255aad77 * Update freqs for missing tags in ner, for serializer 2015-07-23 01:17:11 +02:00