spaCy

mirror of https://github.com/explosion/spaCy.git synced 2024-09-21 19:39:13 +03:00

Author	SHA1	Message	Date
Matthew Honnibal	f8f2f4e545	* Temporarily add PUNC name to parts_of_specch dictionary, until better solution	2015-08-26 19:18:19 +02:00
Matthew Honnibal	008b02b035	* Comment out enums in Morpohlogy for now	2015-08-26 19:17:35 +02:00
Matthew Honnibal	378729f81a	* Hack Morphology class towards usability	2015-08-26 19:17:21 +02:00
Matthew Honnibal	430affc347	* Fix missing n_patterns property in Matcher class. Fix from_dir method	2015-08-26 19:17:02 +02:00
Matthew Honnibal	3acf60df06	* Add missing properties in Lexeme class	2015-08-26 19:16:28 +02:00
Matthew Honnibal	76996f4145	* Hack on generic Language class. Still needs work for morphology, defaults, etc	2015-08-26 19:16:09 +02:00
Matthew Honnibal	e2ef78b29c	* Gut pos.pyx module, since functionality moved to spacy/tagger.pyx	2015-08-26 19:15:42 +02:00
Matthew Honnibal	c4d8754385	* Specify LOCAL_DATA_DIR global in spacy.en.__init__.py	2015-08-26 19:15:07 +02:00
Matthew Honnibal	c2d8edd0bd	* Add PROB attribute in attrs.pxd	2015-08-26 19:14:19 +02:00
Matthew Honnibal	c5a27d1821	* Move lemmatizer to spacy	2015-08-25 15:47:08 +02:00
Matthew Honnibal	82217c6ec6	* Generalize lemmatizer	2015-08-25 15:46:19 +02:00
Matthew Honnibal	8083a07c3e	* Use language base class	2015-08-25 15:37:30 +02:00
Matthew Honnibal	f2f699ac18	* Add language base class	2015-08-25 15:37:17 +02:00
Matthew Honnibal	5dd76be446	* Split EnPosTagger up into base class and subclass	2015-08-24 05:25:55 +02:00
Matthew Honnibal	5d5922dbfa	* Begin laying out morphological features	2015-08-24 01:04:30 +02:00
Matthew Honnibal	6f1743692a	* Work on language-independent refactoring	2015-08-23 20:49:18 +02:00
Matthew Honnibal	3879d28457	* Fix https for url detection	2015-08-23 02:40:35 +02:00
Matthew Honnibal	cad0cca4e3	* Tmp	2015-08-22 22:04:34 +02:00
Matthew Honnibal	bf38b3b883	* Hack on l/r reversal bug	2015-08-10 05:58:43 +02:00
Matthew Honnibal	6116413b47	* Fix label prediction in StepwiseState	2015-08-10 05:05:31 +02:00
Matthew Honnibal	2c9753eff2	* Whitespace	2015-08-10 00:09:02 +02:00
Matthew Honnibal	9de98f5a6f	* Add Parser.stepthrough method, with context manager	2015-08-10 00:08:46 +02:00
Matthew Honnibal	fe43f8cf39	* Whitespace	2015-08-09 02:31:53 +02:00
Matthew Honnibal	9c090945e0	* Add Parser.predict method, and clean up Parser.get_state	2015-08-09 02:29:58 +02:00
Matthew Honnibal	04fccfb984	* Fix get_state for parser prediction	2015-08-09 02:11:22 +02:00
Matthew Honnibal	55fde0e240	* Fix get_state	2015-08-09 01:45:30 +02:00
Matthew Honnibal	f0f4fa9838	* Fix Parser.get_state	2015-08-09 01:40:13 +02:00
Matthew Honnibal	18331dca89	* Add continue_for argument to parser 'partial' function, which is now renamed to get_state	2015-08-09 01:31:54 +02:00
Matthew Honnibal	0653288fa5	* Fix stateclass.queue	2015-08-09 00:39:02 +02:00
Matthew Honnibal	9de218b7ba	* Fix Parser.partial function	2015-08-08 23:45:18 +02:00
Matthew Honnibal	01be34d55a	* Whitespace	2015-08-08 23:37:44 +02:00
Matthew Honnibal	cc9deae960	* Add is_valid method to transition_system	2015-08-08 23:36:18 +02:00
Matthew Honnibal	2a46c77324	* Whitespace	2015-08-08 23:35:59 +02:00
Matthew Honnibal	7bafc789e7	* Add stack and queue properties to stateclass, for python access	2015-08-08 23:32:42 +02:00
Matthew Honnibal	3af938365f	* Add function partial to Parser	2015-08-08 23:32:15 +02:00
Matthew Honnibal	76a1f0481a	* Whitespace	2015-08-08 23:31:54 +02:00
Matthew Honnibal	b0f5c39084	* Fix handling of exclusion entities	2015-08-06 17:28:43 +02:00
Matthew Honnibal	9f65879991	* Fix shape attr bug, and fix handling of false positive matches	2015-08-06 17:28:14 +02:00
Matthew Honnibal	10d869d102	* Don't allow conjunction between NPs in base NP chunks	2015-08-06 16:31:53 +02:00
Matthew Honnibal	383dfabd67	* Fix matcher setting of entities	2015-08-06 16:27:01 +02:00
Matthew Honnibal	59c3bf60a6	* Ensure entity recognizer doesn't over-write preset types	2015-08-06 16:09:08 +02:00
Matthew Honnibal	cd7d1682cd	* Fix loading of gazetteer.json file	2015-08-06 16:08:25 +02:00
Matthew Honnibal	9c667b7f15	* Set a value in attrs.pxd on the first flag, to reduce bugs	2015-08-06 16:08:04 +02:00
Matthew Honnibal	c263577424	* Fix lower attribute in lexeme.pxd	2015-08-06 16:07:41 +02:00
Matthew Honnibal	5737115e1e	* Work on gazetteer matching	2015-08-06 14:33:21 +02:00
Matthew Honnibal	9c1724ecae	* Gazetteer stuff working, now need to wire up to API	2015-08-06 00:35:40 +02:00
Matthew Honnibal	5bc0e83f9a	* Reimplement matching in Cython, instead of Python.	2015-08-05 01:05:54 +02:00
Matthew Honnibal	4c87a696b3	* Add draft dfa matcher, in Python. Passing tests.	2015-08-04 15:55:28 +02:00
Matthew Honnibal	eb7138c761	* Add attr relation in base NP detection	2015-08-01 00:34:40 +02:00
Matthew Honnibal	4988356cf0	* Fix dependency type bug from merged tokens	2015-08-01 00:33:24 +02:00
Matthew Honnibal	78a9068319	* Fix spacy attr on merged tokens	2015-07-30 04:25:58 +02:00
Matthew Honnibal	430e2edb96	* Fix noun_chunks issue	2015-07-30 03:51:50 +02:00
Matthew Honnibal	9590968fc1	* Fix negative indices in Span	2015-07-30 02:30:24 +02:00
Matthew Honnibal	74d8cb3980	* Add noun_chunks iterator, and fix left/right child setting in Doc.merge	2015-07-30 02:29:49 +02:00
Matthew Honnibal	d153f18969	* Fix negative indices on spans	2015-07-29 22:36:03 +02:00
Matthew Honnibal	b5132bed7d	* Set left and right children when loading parse from byte string	2015-07-28 21:03:18 +02:00
Matthew Honnibal	6609fcf4b2	* Make mem and vocab python-visible in Doc	2015-07-28 20:46:59 +02:00
Matthew Honnibal	d42fe2e694	* Add unicode_literals to strings.pyx	2015-07-28 16:15:53 +02:00
Matthew Honnibal	bb910cff92	* Fix Python3 problem in align_raw	2015-07-28 16:06:53 +02:00
Matthew Honnibal	dcafb181b9	* Fix Python3 problem in align_raw	2015-07-28 15:52:10 +02:00
Matthew Honnibal	c609ea18f0	* Increment version in download script	2015-07-28 15:22:17 +02:00
Matthew Honnibal	9c4d0aae62	* Switch to better Python2/3 compatible unicode handling	2015-07-28 14:45:37 +02:00
Matthew Honnibal	7606d9936f	* Python3 correction for GoldParse	2015-07-28 14:44:53 +02:00
Matthew Honnibal	ddc1a5cfe5	* Fix training under python3	2015-07-28 14:09:30 +02:00
Matthew Honnibal	a8bbd7312c	* Hackishly patch long dependencies problem	2015-07-28 00:14:29 +02:00
Matthew Honnibal	bb583f7f09	* Hackishly patch long dependencies problem	2015-07-27 23:14:33 +02:00
Matthew Honnibal	aa7a964a4f	* Add a type declaration for doc.from_array	2015-07-27 22:57:22 +02:00
Matthew Honnibal	25a8774f42	* Fix regression in packer	2015-07-27 21:53:38 +02:00
Matthew Honnibal	1601e488ee	* Fix bug in decoding non-ascii characters	2015-07-27 21:43:58 +02:00
Matthew Honnibal	6a95409cd2	* Fix type on bits	2015-07-27 21:16:49 +02:00
Matthew Honnibal	a296d72b54	* Fix en/attrs	2015-07-27 21:16:33 +02:00
Matthew Honnibal	45460f505c	* Fix data type on read32 in BitArray	2015-07-27 21:12:13 +02:00
Matthew Honnibal	3d43f49f69	* Revert prev change	2015-07-27 10:58:15 +02:00
Matthew Honnibal	6b586cdad4	* Change lexemes.bin format. Add a header specifying size of LexemeC and number of lexemes, and don't have the redundant orth information.	2015-07-27 08:31:51 +02:00
Matthew Honnibal	af6ed18f2a	* Ensure we don't use orth_encode on OOV words.	2015-07-27 02:12:01 +02:00
Matthew Honnibal	8535d872e8	* Set is_oov property in get_flags	2015-07-27 01:51:24 +02:00
Matthew Honnibal	8e4c69ee8c	* Add is_oov property, and fix up handling of attributes	2015-07-27 01:50:06 +02:00
Matthew Honnibal	fc268f03eb	* Assert against null pointer exceptions in vocab	2015-07-27 01:00:10 +02:00
Matthew Honnibal	0f093fdb30	* Fix get_by_orth for py3	2015-07-26 19:26:41 +02:00
Matthew Honnibal	ceeda5a739	* Fix get_by_orth for py3	2015-07-26 18:39:27 +02:00
Matthew Honnibal	6bb96c122d	* Host IS_ flags in attrs.pxd, and add properties for them on Token and Lexeme objects	2015-07-26 16:37:16 +02:00
Matthew Honnibal	eeaea25f0c	* Check oov_prob file is present	2015-07-26 16:36:38 +02:00
Matthew Honnibal	7eb2446082	* Return empty lexeme on empty string	2015-07-26 00:18:30 +02:00
Matthew Honnibal	1b5d1da2a7	* Allow an OOV probability to be specified in get_lex_props	2015-07-26 00:03:43 +02:00
Matthew Honnibal	cd6e25132b	* Allow an OOV probability to be specified in get_lex_props	2015-07-26 00:01:46 +02:00
Matthew Honnibal	fd525f0675	* Pass OOV probability around	2015-07-25 23:29:51 +02:00
Matthew Honnibal	3fe14b8ed6	* Fix CFile for Python2	2015-07-25 22:55:53 +02:00
Matthew Honnibal	823ef4a00b	* Remove profile declarations	2015-07-25 18:13:06 +02:00
Matthew Honnibal	f4809e562f	* Allow json to be used as a fallback if ujson is not available	2015-07-25 18:11:36 +02:00
Matthew Honnibal	9da06671cf	* Remove unused import	2015-07-25 18:11:16 +02:00
Matthew Honnibal	2060935cdb	* Remove explicit bytes type in doc.from_bytes, to accept bytearray	2015-07-24 04:54:13 +02:00
Matthew Honnibal	aa28e2e01d	* Release the GIL around parse function	2015-07-24 04:53:27 +02:00
Matthew Honnibal	d62eb34b76	* More Py 2/3 compatibility in bit strings	2015-07-24 04:52:06 +02:00
Matthew Honnibal	0bb839d299	* Fix string coercion for Python 3	2015-07-24 03:49:30 +02:00
Matthew Honnibal	c4ff410fdb	* Fix bytes problems for Python3	2015-07-24 03:48:23 +02:00
Matthew Honnibal	1ab25e4dad	* Fix python3 type error	2015-07-24 02:45:34 +02:00
Matthew Honnibal	f35ff173b0	* Fix bits.pyx unicode error	2015-07-23 20:37:57 +02:00
Matthew Honnibal	1406e24327	* Fix unicode error for Python3	2015-07-23 19:36:21 +02:00
Matthew Honnibal	dbda6c27fa	* Fix python3 error	2015-07-23 14:52:30 +02:00
Matthew Honnibal	99387f9572	* Fix python3 error	2015-07-23 14:30:29 +02:00
Matthew Honnibal	b81ffe9032	* Fix typing on mode string in CFile	2015-07-23 13:24:43 +02:00
Matthew Honnibal	22028602a9	* Add unicode_literals declaration in vocab.pyx	2015-07-23 13:24:20 +02:00
Matthew Honnibal	5b41744270	* Check for directory presence before loading annotators	2015-07-23 09:27:37 +02:00
Matthew Honnibal	df01a88763	Merge branch 'refactor' (and serializaton) Add Huffman-code serialization, and do a lot of refactoring. Highlights include: * Much more efficient StringStore * Vocab maintains a by-orth mapping of Lexemes * Avoid manually slicing Py_UNICODE buffers, simplifying tokenizer and vocab C APIs * Remove various bits of dead code * Work on removing GIL around parser * Work on bridge to Theano Conflicts: spacy/strings.pxd spacy/strings.pyx spacy/structs.pxd	2015-07-23 02:18:35 +02:00
Matthew Honnibal	a7c4d72e83	* Add serializer property to Vocab, and lazy-load it. Add get_by_orth method.	2015-07-23 01:18:19 +02:00
Matthew Honnibal	6ab1696b15	* Remove read_encoding_freqs from util.py	2015-07-23 01:17:32 +02:00
Matthew Honnibal	d5255aad77	* Update freqs for missing tags in ner, for serializer	2015-07-23 01:17:11 +02:00
Matthew Honnibal	12699a1152	* Set initial freqs, to avoid missing values in serializer	2015-07-23 01:16:27 +02:00
Matthew Honnibal	680bb47b55	* Write serializer freqs to single file, vocab/serializer.json	2015-07-23 01:15:25 +02:00
Matthew Honnibal	a0e36e8efc	* Add working to/from bytes API to Doc	2015-07-23 01:14:45 +02:00
Matthew Honnibal	1f31d96bf9	* Fix Packer API, so that it reads and writes bytes strings, instead of BitArray. Docs are always byte aligned anyway.	2015-07-23 01:13:02 +02:00
Matthew Honnibal	38ef986b29	* Update spacy/en/attrs.pxd	2015-07-23 01:10:58 +02:00
Matthew Honnibal	06eac32610	* Add cfile.pyx	2015-07-23 01:10:36 +02:00
Matthew Honnibal	0c507bd80a	* Fix tokenizer	2015-07-22 14:10:30 +02:00
Matthew Honnibal	c86dbe4944	* Update English.save_models for new Packer save/load stuff	2015-07-22 13:40:23 +02:00
Matthew Honnibal	bf77bcd6b9	* Add comment explaining hash_string	2015-07-22 13:39:42 +02:00
Matthew Honnibal	815bda201d	* Remove UniStr struct	2015-07-22 13:39:17 +02:00
Matthew Honnibal	2fc66e3723	* Use Py_UNICODE in tokenizer for now, while sort out Py_UCS4 stuff	2015-07-22 13:38:45 +02:00
Matthew Honnibal	4d61239eac	* Reorganize the serialization functions on Doc	2015-07-22 04:53:01 +02:00
Matthew Honnibal	109106a949	* Replace UniStr, using unicode objects instead	2015-07-22 04:52:05 +02:00
Matthew Honnibal	424854028f	* Fix decode_int32	2015-07-21 20:09:59 +00:00
Matthew Honnibal	304d0e2633	* Use decode_int32 in _orth_decode	2015-07-21 20:40:55 +02:00
Matthew Honnibal	9cfa59ec33	* Optimistically try orth encoding, with char as a back-off	2015-07-21 20:22:45 +02:00
Matthew Honnibal	c8b89e37a5	* Bug fix to faster huffman decoding	2015-07-21 20:05:53 +02:00
Matthew Honnibal	b166d1d2a2	* Use encode32 and decode32	2015-07-21 19:59:06 +02:00
Matthew Honnibal	c6cd0ddce8	* Add faster encode_int32 and decode_int32 methods	2015-07-21 19:58:45 +02:00
Matthew Honnibal	dd60594f41	* Fix double encoding error in strings.pyx	2015-07-20 13:52:56 +02:00
Matthew Honnibal	06639dc497	* Add length cap to word shape feature	2015-07-20 12:06:59 +02:00
Matthew Honnibal	128b6d9714	* Move Utf8Str struct to strings module, as that's the only place it's relevant	2015-07-20 12:06:41 +02:00
Matthew Honnibal	01a97b90f3	* Fix header for string store	2015-07-20 12:06:10 +02:00
Matthew Honnibal	52d538ea42	* Fix short string optimization in strings.pyx. StringStore tests now all pass.	2015-07-20 12:05:23 +02:00
Matthew Honnibal	09a3055630	* Work on short string optimization in Utf8Str	2015-07-20 11:26:46 +02:00
Matthew Honnibal	bb0ba1f0cd	* Improve serialization speed	2015-07-20 03:27:59 +02:00
Matthew Honnibal	8743a8c084	* Update Doc serialization for new Packer interface	2015-07-20 01:38:04 +02:00
Matthew Honnibal	1f7170e0e1	* Reinstate the fixed vocabulary --- words are only added to the lexicon in init_model, after that we create LexemeC structs with the Pool given to us.	2015-07-20 01:37:34 +02:00
Matthew Honnibal	5a7d060d9c	* Switch between the orth and char codecs depending on which is shorter for that message. Mostly orth is shorter, except if there are OOV words.	2015-07-20 01:36:22 +02:00
Matthew Honnibal	5a042ee0d3	* Add function to predict number of bits needed to encode message	2015-07-20 01:35:11 +02:00
Matthew Honnibal	b89b489bb4	* Implement both character and orth encoding in Packer, so that we can decide which to use per-text	2015-07-19 22:39:45 +02:00
Matthew Honnibal	ae78c9e3ce	* Implement character-based codec, so that we can do word/char backoff	2015-07-19 22:03:39 +02:00
Matthew Honnibal	cd1d047cb8	* Delete out-dated HuffmanCodec comment	2015-07-19 18:28:14 +02:00
Matthew Honnibal	b8086067d5	* Build Huffman codec from unsorted inputs	2015-07-19 17:58:44 +02:00
Matthew Honnibal	317cbbc015	* Serialization round trip now working with decent API, but with rough spots in the organisation and requiring vocabulary to be fixed ahead of time.	2015-07-19 15:18:17 +02:00
Matthew Honnibal	6b13e7227c	* Remove duplicate get_lex_attr method from doc.pyx	2015-07-18 22:46:07 +02:00
Matthew Honnibal	e49c7f1478	* Update oov check in tokenizer	2015-07-18 22:45:28 +02:00
Matthew Honnibal	cfd842769e	* Allow infix tokens to be variable length	2015-07-18 22:45:00 +02:00
Matthew Honnibal	5b4c78bbb2	* Use an AttributeCodec based on orth for words. Still no oov handling mechanism.	2015-07-18 22:43:18 +02:00
Matthew Honnibal	82d84b0f2b	* Index lexemes by orth, instead of a lexemes vector. Breaks the mechanism for deciding not to own LexemeC structs during parsing. Need to reinstate this.	2015-07-18 22:42:15 +02:00
Matthew Honnibal	4dddc8a69b	* Fix type declarations for attr_t. Remove unused id_t.	2015-07-18 22:39:57 +02:00
Matthew Honnibal	ced59ab9ea	* Make minor efficiency improvement in Doc.__iter__	2015-07-18 04:10:53 +02:00
Matthew Honnibal	cd91914dd8	* Fix hard-coded length	2015-07-18 04:09:56 +02:00
Matthew Honnibal	b1d74ce60d	* Remove unused joint.pyx and joint.pxd files	2015-07-17 23:31:44 +02:00
Matthew Honnibal	c27514512b	* Remove cruft ner/ directory	2015-07-17 23:24:32 +02:00
Matthew Honnibal	f8d6d319f4	* Remove cruft module	2015-07-17 23:23:05 +02:00
Matthew Honnibal	fb0a641a2d	* Don't release the gil around Parser.parse. Does this indicate thread problems?	2015-07-17 23:07:37 +02:00
Matthew Honnibal	e29daea85f	* Fix bint/int typing problem in TransitionSystem. In C++ bint* means bool, but in C it means int. So, type-casting to bint* is unsafe.	2015-07-17 22:37:24 +02:00
Matthew Honnibal	cf0c788892	* Tests passing on round-trip pack/unpack on basic example	2015-07-17 21:20:48 +02:00
Matthew Honnibal	44f39a876f	* Add a blank attrs.pyx	2015-07-17 16:40:42 +02:00
Matthew Honnibal	c2c83120d4	* Remove codec property from Vocab	2015-07-17 16:40:11 +02:00
Matthew Honnibal	dfdf19f6a9	* Draft a from_orth method for Doc	2015-07-17 16:39:54 +02:00
Matthew Honnibal	9e3f17051b	* Move to ORTH instead of ID for encoding lexemes. Basic tests of the codec wrappers now passing	2015-07-17 16:38:29 +02:00
Matthew Honnibal	15ff739996	* Fix passing of ID attribute in string store	2015-07-17 14:49:42 +02:00
Matthew Honnibal	95e57c2780	* Remove unnecessary key and id properties from Utf8String.	2015-07-17 01:40:18 +02:00
Matthew Honnibal	234c7e440a	* Add spacy/serialize/__init__ files	2015-07-17 01:37:33 +02:00
Matthew Honnibal	db9dfd2e23	* Major refactor of serialization. Nearly complete now.	2015-07-17 01:27:54 +02:00
Matthew Honnibal	c8282f9934	* Work on serialization. Needs more reorganisation	2015-07-16 19:56:02 +02:00
Matthew Honnibal	d8458d6a25	* Fix attr_id_t import in Spans	2015-07-16 19:55:21 +02:00
Matthew Honnibal	d1cb30dbc4	* Remove unnecessary key and id properties from Utf8String.	2015-07-16 19:29:02 +02:00
Matthew Honnibal	897de2d438	* Add 'bitter' property for serializer in English class	2015-07-16 17:47:53 +02:00
Matthew Honnibal	fb54052ae0	* Work on serializer design	2015-07-16 17:46:46 +02:00
Matthew Honnibal	a6f401580d	* Add from_array function to Doc.	2015-07-16 17:46:11 +02:00
Matthew Honnibal	2a5d050134	* Give codec loading back to Vocab.	2015-07-16 17:45:42 +02:00
Matthew Honnibal	8bf0f65f1c	* Remove dead code in strings.pyx	2015-07-16 17:35:53 +02:00
Matthew Honnibal	a9c3863665	* Fix inefficiency in StringStore.dump function	2015-07-16 17:34:32 +02:00
Matthew Honnibal	b59d271510	* Move serialization functionality into Serializer class	2015-07-16 11:23:48 +02:00
Matthew Honnibal	30be4f15da	* Import attrs from spacy.attrs, not spacy.typedefs	2015-07-16 11:23:25 +02:00
Matthew Honnibal	6c99e5f4aa	* Move serialization into Serializer class, with __call__ and train() api	2015-07-16 11:22:35 +02:00
Matthew Honnibal	e2133d990e	* Move serialization functionality out into a Serializer object	2015-07-16 11:21:44 +02:00
Matthew Honnibal	a6d040bd11	* Import Lexeme attrs from spacy.attrs, not spacy.typedefs	2015-07-16 11:20:08 +02:00
Matthew Honnibal	45ae1ce428	* Remove unused declaration in parser	2015-07-16 01:27:11 +02:00
Matthew Honnibal	efa80096f1	* Upd attrs id list	2015-07-16 01:26:54 +02:00
Matthew Honnibal	01fab6bb90	* Improve de/serialize functions	2015-07-16 01:26:35 +02:00
Matthew Honnibal	0e07c1ed2a	* draft de/serialization functions in doc.pyx	2015-07-16 01:16:33 +02:00
Matthew Honnibal	9d956b07e9	* Fix import of attrs in doc.pyx, and update the get_token_attr function.	2015-07-16 01:15:34 +02:00
Matthew Honnibal	65251e7625	* Remove redundant attr_id_t from typedefs.pxd	2015-07-16 00:58:51 +02:00
Matthew Honnibal	9a8db9743c	* Remove gil from parser.call	2015-07-14 23:47:33 +02:00
Matthew Honnibal	38ca0c33f5	Merge branch 'neuralnet' into refactor Mostly refactors parser, to use new thinc3.2 Example class. Aim is to remove use of shared memory, so that we can parallelize over documents easily. Conflicts: setup.py spacy/syntax/parser.pxd spacy/syntax/parser.pyx spacy/syntax/stateclass.pyx	2015-07-14 14:13:47 +02:00
Matthew Honnibal	935ac53ee3	* Extend count_by method	2015-07-14 03:20:09 +02:00
Matthew Honnibal	3b5baa660f	* Fix tokenizer	2015-07-14 00:10:51 +02:00
Matthew Honnibal	2ae0b439b2	* Fix space check in gold.pyx	2015-07-14 00:10:27 +02:00
Matthew Honnibal	81aa4e6dcc	* Go back to having token reference doc, instead of complicated gymnastics. Rename the attr 'doc', to expose it in the API	2015-07-14 00:10:11 +02:00
Matthew Honnibal	24d6ce99ec	* Add comment to tokenizer, explaining the spacy attr	2015-07-13 22:29:13 +02:00
Matthew Honnibal	8214b74eec	* Restore _py_tokens cache, to handle orphan tokens.	2015-07-13 22:28:10 +02:00
Matthew Honnibal	67641f3b58	* Refactor tokenizer, to set the 'spacy' field on TokenC instead of passing a string	2015-07-13 21:46:02 +02:00
Matthew Honnibal	6eef0bf9ab	* Break up tokens.pyx into tokens/doc.pyx, tokens/token.pyx, tokens/spans.pyx	2015-07-13 20:20:58 +02:00
Matthew Honnibal	3ea8756c24	* Add spacy/tokens/doc.pyx, for Doc class in its own file	2015-07-13 19:58:26 +02:00
Matthew Honnibal	c99387155f	* Refactor tokens, moving classes into a module instead of a single file	2015-07-13 19:49:55 +02:00
Matthew Honnibal	d27899658e	* Import classes in spacy.tokens.__init__	2015-07-13 19:48:55 +02:00
Matthew Honnibal	aa82caf8f5	* Add TokenC.spacy attr	2015-07-13 19:48:07 +02:00
Matthew Honnibal	dba6b47d4e	* Refactor monster tokens.pyx file, into a tokens/ subpackage. Try to break the cycle between Doc and Token, and remove the need to pass around a unicode string reference	2015-07-13 19:20:48 +02:00
Matthew Honnibal	5b0a7190c9	* Round-trip for serialization finally working. Needs a lot of optimization.	2015-07-13 18:39:38 +02:00
Matthew Honnibal	edd371246c	* Make huffman coder take BitArray in encode/decode. Add __iter__ method to BitArray.	2015-07-13 17:33:33 +02:00
Matthew Honnibal	af5cc926a4	* Add codec property to Vocab, to use the Huffman encoding	2015-07-13 13:55:14 +02:00
Matthew Honnibal	77385d5580	* Make .pxd file for huffman codec	2015-07-13 13:54:51 +02:00
Matthew Honnibal	083b6ea7ae	* Clean up encoder a bit. now read for integration into Vocab.	2015-07-13 12:57:22 +02:00
Matthew Honnibal	8d0f1d98da	* Draft dockstring for HuffmanCache	2015-07-13 12:01:18 +02:00
Matthew Honnibal	281f1faefb	* Nearly finished huffman coder	2015-07-12 23:48:46 +02:00
Matthew Honnibal	e1a25fba32	* Work on huffman coder	2015-07-12 19:58:05 +02:00
Matthew Honnibal	3fb9de2d13	* Remove vector[bint], in favor of simple Code struct.	2015-07-12 17:58:27 +02:00
Matthew Honnibal	aa7bfd932b	* Work on compressor	2015-07-12 16:03:43 +02:00
Matthew Honnibal	14eafcab15	* Refactor to use vector[bint]	2015-07-12 05:27:47 +02:00
Matthew Honnibal	6a6e852a39	* Refactor huffman coding stuff into class	2015-07-12 05:06:36 +02:00
Matthew Honnibal	aad96fdb5c	* Improve efficiency of huffman coding	2015-07-12 01:31:37 +02:00
Matthew Honnibal	ff9ff6f3fa	* Ensure unseen words are given low log probability	2015-07-12 01:31:09 +02:00
Matthew Honnibal	9d3b0d83de	* Refactor huffman coding	2015-07-11 22:27:43 +02:00
Matthew Honnibal	8d29406cd6	* Rename span.right to span.rights	2015-07-11 22:15:04 +02:00
Matthew Honnibal	da9f358166	* Fix span getting	2015-07-11 21:41:41 +02:00
Matthew Honnibal	11e8f2ffb4	* Huffman codes working	2015-07-11 20:01:10 +02:00
Matthew Honnibal	cb6fc81909	* Work on huffman coding.	2015-07-11 15:23:35 +02:00
Matthew Honnibal	4c9b77fe95	* Begin working on serialization code	2015-07-11 10:57:30 +02:00
Matthew Honnibal	53d1f5b2eb	* Rename Span.head to Span.root.	2015-07-09 17:30:58 +02:00
Matthew Honnibal	c0255ed7d8	* Allow slice indexing in Doc.__getitem__, returning a Span object	2015-07-09 15:15:32 +02:00
Matthew Honnibal	89a91ad726	* Add SPACE part-of-speech tag, and train tagger to assign it. Also train tagger not to make whitespace an entity	2015-07-09 13:30:41 +02:00
Matthew Honnibal	55f1042443	* Improve efficiency of L and R features, correcting the non-linear-in-length problem.	2015-07-09 12:17:26 +02:00
Matthew Honnibal	70d2acb579	* Fix edge features	2015-07-09 12:15:01 +02:00
Matthew Honnibal	adb868bdad	* Add warning for models not found in parser	2015-07-08 20:04:55 +02:00
Matthew Honnibal	05b28ec9eb	* Add warning for models not found in parser	2015-07-08 20:02:13 +02:00
Matthew Honnibal	ef700401a6	* Add warning for models not found in parser	2015-07-08 20:00:46 +02:00
Matthew Honnibal	6218d8b389	* Add warning for models not found in parser	2015-07-08 19:59:16 +02:00
Matthew Honnibal	f6a6c39ce8	* Add warning for models not found in parser	2015-07-08 19:52:30 +02:00
Matthew Honnibal	78db7e32f7	* Remove has_sense method from Lexeme declaration	2015-07-08 19:41:20 +02:00
Matthew Honnibal	6ddb2f5e45	* Restore merge_mwe in English class	2015-07-08 19:35:30 +02:00
Matthew Honnibal	6859f6adac	* Restore merge_mwe in English class	2015-07-08 19:34:55 +02:00
Matthew Honnibal	3c270fc8ff	* Remove has_sense method from Lexeme	2015-07-08 19:28:29 +02:00
Matthew Honnibal	b64c843861	* Remove senses attr	2015-07-08 19:26:24 +02:00
Matthew Honnibal	1d3a592edf	* Remove the senses attr from LexemeC, to keep data compatibility	2015-07-08 19:24:44 +02:00
Matthew Honnibal	0ceb1f71c2	* Update parse features	2015-07-08 19:11:36 +02:00
Matthew Honnibal	2e51b5027a	* Alias Doc to Tokens, for backwards compatibility	2015-07-08 18:59:35 +02:00
Matthew Honnibal	e3c53f5ecd	* Fix mention of Tokens in docstring	2015-07-08 18:56:27 +02:00
Matthew Honnibal	bb522496dd	* Rename Tokens to Doc	2015-07-08 18:53:00 +02:00
Matthew Honnibal	b24e8be2b9	* Whitespace in docstring	2015-07-08 12:37:03 +02:00
Matthew Honnibal	abc43b852d	* Add pos_tags attr to Vocab.	2015-07-08 12:36:38 +02:00
Matthew Honnibal	935bcdf3e5	* Remove redundant tag_names argument to Tokenizer	2015-07-08 12:36:04 +02:00
Matthew Honnibal	ff885e8511	* Add ParserFactory convenience function	2015-07-08 12:35:46 +02:00
Matthew Honnibal	4e4fac452b	* Refactor __init__ for simplicity. Allow parse=True, tag=True etc flags to be passed at top-level. Do not lazy-load parser.	2015-07-08 12:35:29 +02:00
Matthew Honnibal	1d2deb4616	* Work on refactoring default arguments to English.__init__	2015-07-07 15:53:25 +02:00
Matthew Honnibal	2d0e99a096	* Pass pos_tags into Tokenizer.from_dir	2015-07-07 14:23:08 +02:00
Matthew Honnibal	6788c86b2f	* Begin refactor	2015-07-07 14:00:07 +02:00
Matthew Honnibal	52fd80c6c6	* Add experimental supersense features for parsing, based on lookup into wordnet.	2015-07-01 20:12:44 +02:00
Matthew Honnibal	e6d828a9af	* Set up an array POS_SENSES that denotes the set of valid senses for each POS tag. This way, we can do bitwise & between a lexeme's senses and the ones available for its POS tag, to get the allowable senses for the token.	2015-07-01 20:12:13 +02:00
Matthew Honnibal	2b8459d9a8	* Add senses flag to Lexeme	2015-07-01 20:10:41 +02:00

... 3 4 5 6 7 ...

1217 Commits