spaCy

mirror of https://github.com/explosion/spaCy.git synced 2024-09-22 03:49:17 +03:00

Author	SHA1	Message	Date
Matthew Honnibal	84c5dfbfc3	* Clean up debugging python list	2016-01-19 20:10:32 +01:00
Matthew Honnibal	04d0686b26	* Make TransitionSystem.add_action idempotent, i.e. ignore duplicate added actions.	2016-01-19 20:10:04 +01:00
Matthew Honnibal	c4a89d56bd	* Automatically register any entity types pre-set on the tokens, so that the NER works with user-given entity types.	2016-01-19 20:09:26 +01:00
Matthew Honnibal	f0f92793f6	* Add test for user NER classes in matcher blocking the NER model. Re Issue #178 and Issue #217	2016-01-19 19:23:16 +01:00
Matthew Honnibal	65c5bc4988	* Add add_label method, to allow users to register new entity types and dependency labels.	2016-01-19 19:11:02 +01:00
Matthew Honnibal	151aa0b0e2	* Allow users to add_label, in order to extend the entity recogniser to new classes. Does not by itself add a class to the model	2016-01-19 19:09:33 +01:00
Matthew Honnibal	c8e0011ebc	* Add iterators to the NER and parser transition systems, to get the action types	2016-01-19 19:07:43 +01:00
Matthew Honnibal	515493c675	* Add xfail test for Issue #225 : tokenization with non-whitespace delimiters	2016-01-19 13:20:14 +01:00
Matthew Honnibal	7abe653223	* Fix imports	2016-01-19 03:36:51 +01:00
Matthew Honnibal	590f38bdb2	* Add hacky solution to Issue #220 . Currently specials.json only supports literal patterns, which doesn't allow us to pre-tag whitespace with the correct token, SP, as a rule. The data-driven approach should be easy but for some reason fails here. Adding a hard code in Morphology isn't a good solution, but we do want to fix the behaviour right away, and don't want to wait for an architecturally better solution.	2016-01-19 03:35:20 +01:00
Matthew Honnibal	445164d5b4	* Restore the LOCAL_DATA_DIR global in spacy/en/__init__.py, although this is now deprecated	2016-01-19 02:54:56 +01:00
Matthew Honnibal	04177debd0	* Unwind limit to sentence boundary detection that prevents it from inserting boundaries on whitespace. Replace it with a check for whitespace in StateClass.fast_forward, so that whitespace is LeftArced when it's on the stack. This should prevent the previous problem of whitespace-only sentences. Should fix Issue #184 , but may cause further problems. Needs testing.	2016-01-19 02:54:15 +01:00
Matthew Honnibal	7893de3203	* Add test for Issue #184 : Whitespace at sentence boundary causes sentence boundary error.	2016-01-18 23:04:38 +01:00
Matthew Honnibal	bba0a5e078	* Handle string paths in default_vocab, default_parser, default_entity in Language class	2016-01-18 22:37:24 +01:00
Matthew Honnibal	e825fd9554	* Make some of the website tests work without models	2016-01-18 18:14:44 +01:00
Matthew Honnibal	334c4b2b57	* Disprefer punctuation and spaces as heads of spans	2016-01-18 18:14:09 +01:00
Matthew Honnibal	bed36ab0ff	* Fix import of HEAD attribute	2016-01-18 17:34:43 +01:00
Matthew Honnibal	28c659c1fe	* Fix import for numpy	2016-01-18 17:25:04 +01:00
Matthew Honnibal	fc36bcf458	* Fix import for English	2016-01-18 17:14:40 +01:00
Matthew Honnibal	cc4c335e14	* Set heads for test_merge_tokens, to make the test run without models	2016-01-18 17:00:11 +01:00
Matthew Honnibal	c107da9738	* Bug fix to _count_words_to_root	2016-01-18 16:59:38 +01:00
Matthew Honnibal	f24833d607	* Fix merge for coordinations	2016-01-18 16:03:19 +01:00
Matthew Honnibal	14534958a9	* Fix bug in Span.root	2016-01-18 15:40:28 +01:00
Matthew Honnibal	714cbc03d5	* Add test for Issue #203 : nested noun chunks.	2016-01-16 18:02:30 +01:00
Matthew Honnibal	4e2253170c	* Move test for doc.merge to tokens_api file, to avoid name conflicts which upset pytest	2016-01-16 18:01:36 +01:00
Matthew Honnibal	34a157511f	* Move test_merge_hang to test_tokens_api	2016-01-16 18:00:26 +01:00
Matthew Honnibal	fc8f26584a	* Don't consider NPs connected to parse via conj relation as noun chunks. Change motivated by the nested noun chunks identified in Issue #203 , but might be problematic. Also allow root NPs to be considered noun chunks.	2016-01-16 17:52:40 +01:00
Matthew Honnibal	4a16dbfeca	* Add test for Issue #203 : noun chunks should be flat, but sometimes are nested	2016-01-16 17:41:25 +01:00
Matthew Honnibal	995b2d18fd	* Route token.string via token.txt_with_ws, to deprecate token.string in future	2016-01-16 17:14:34 +01:00
Matthew Honnibal	54a98eaf19	* Fix typo text_wth_ws --> text_with_ws. Reroute .string attribute to text_with_ws, to deprecate .string in future	2016-01-16 17:13:50 +01:00
Matthew Honnibal	3e9961d2c4	* If final token is whitespace, don't mark it as owning a trailing space. Fixes Issue #154	2016-01-16 17:08:59 +01:00
Matthew Honnibal	223d2b3484	* Add test for Issue #154 : Additional whitespace introduced when string ends with a whitespace token.	2016-01-16 17:08:07 +01:00
Matthew Honnibal	3dc398b727	* Fix merge conflict in requirements.txt	2016-01-16 16:20:49 +01:00
Matthew Honnibal	fc5962a77d	* Improve test for root token in Span	2016-01-16 16:19:09 +01:00
Matthew Honnibal	c025a0c64b	* Check for KeyboardInerrupt in parser.__call__	2016-01-16 16:18:44 +01:00
Matthew Honnibal	03e8a4293d	* Add loop guard to Token.lefts and Token.rights properties	2016-01-16 16:18:17 +01:00
Matthew Honnibal	304339985e	* Add a linear scan to Span.root method, to help with long sentences	2016-01-16 16:17:28 +01:00
Matthew Honnibal	aa0dd79f52	* Delete test_token_references, which checked a flakey strategy for preventing orphan tokens from a while ago. Now orphan tokens simply hold a reference to Pool, preventing the memory from being freed underneath them. This means that we don't need to run this slow test.	2016-01-16 16:03:35 +01:00
Matthew Honnibal	8cbcc3a799	* Fix calculation of root token in Span. Now take root to be word with shortest tree path. Avoids parse trees ending up in inconsistent state, as had occurred in Issue #214 .	2016-01-16 15:38:50 +01:00
Matthew Honnibal	c1039fa4b4	* Add test for Issue #214 . Resolved in change to Span.root	2016-01-16 15:37:47 +01:00
Henning Peters	41ea14a56f	fix pickling	2016-01-16 13:23:11 +01:00
Henning Peters	5551052840	fix py2/3 issue	2016-01-16 12:44:53 +01:00
Henning Peters	235f094534	untangle data_path/via	2016-01-16 12:23:45 +01:00
Matthew Honnibal	42a9f29b40	* Add loop guard in Span.root, to raise errors if there is a cycle in the dependency parse, instead of entering an infinite loop. Re Issue #214	2016-01-16 11:53:37 +01:00
Henning Peters	6d1a3af343	cleanup unused	2016-01-16 10:05:04 +01:00
Henning Peters	846fa49b2a	distinct load() and from_package() methods	2016-01-16 10:00:57 +01:00
Henning Peters	211913d689	add about.py, adapt setup.py	2016-01-15 18:57:01 +01:00
Henning Peters	f8a8f97d25	cleanup	2016-01-15 18:13:37 +01:00
Henning Peters	780cb847c9	add default_model to about	2016-01-15 18:07:15 +01:00
Henning Peters	788f734513	refactored data_dir->via, add zip_safe, add spacy.load()	2016-01-15 18:01:02 +01:00
Matthew Honnibal	478a79a3d5	* Add test for Issue #220 : Whitespace being tagged as noun	2016-01-15 16:17:07 +01:00
Henning Peters	d9471f684f	fix typo	2016-01-14 12:14:12 +01:00
Henning Peters	9b75d872b0	fix model download	2016-01-14 12:02:56 +01:00
Henning Peters	bc229790ac	integrate with sputnik	2016-01-13 19:46:17 +01:00
Matthew Honnibal	3fbfba575a	* xfail the contractions test	2015-12-31 13:16:28 +01:00
Matthew Honnibal	3bd910ccad	* Merge therell test	2015-12-31 11:55:18 +01:00
Matthew Honnibal	eaf2ad59f1	* Fix use of mock Package object	2015-12-31 04:13:15 +01:00
Matthew Honnibal	029136a007	* Fix resource loading for Matcher	2015-12-31 02:45:12 +01:00
Matthew Honnibal	55bcdf8bdd	* Fix errors	2015-12-29 22:32:03 +01:00
Matthew Honnibal	a6ba43ecaf	* Fix errors in packaging revision	2015-12-29 18:37:26 +01:00
Matthew Honnibal	4b4eec8b47	* Fix Issue #201 : Tokenization of there'll	2015-12-29 18:09:09 +01:00
Matthew Honnibal	86ee9d046d	* Remove test that belongs to a change for master	2015-12-29 18:07:23 +01:00
Matthew Honnibal	a2dfdec85d	* Clean up spacy.util	2015-12-29 18:06:09 +01:00
Matthew Honnibal	aec130af56	Use util.Package class for io Previous Sputnik integration caused API change: Vocab, Tagger, etc were loaded via a from_package classmethod, that required a sputnik.Package instance. This forced users to first create a sputnik.Sputnik() instance, in order to acquire a Package via sp.pool(). Instead I've created a small file-system shim, util.Package, which allows classes to have a .load() classmethod, that accepts either util.Package objects, or strings. We can later gut the internals of this and make it a proxy for Sputnik if we need more functionality that should live in the Sputnik library. Sputnik is now only used to download and install the data, in spacy.en.download	2015-12-29 18:00:48 +01:00
Matthew Honnibal	0e2498da00	* Replace from_package with load() classmethod in Vocab	2015-12-29 16:56:51 +01:00
Matthew Honnibal	c5902f2b4b	* Upd Lemmatizer to use MockPackage. Replace from_package with load() classmethod	2015-12-29 16:56:02 +01:00
Matthew Honnibal	4131e45543	* Add MockPackage class, to see whether we can proxy for Sputnik in a lightweight way	2015-12-29 16:55:03 +01:00
Matthew Honnibal	f5dea1406d	* Fix silly mistake in Language.__init__	2015-12-28 18:48:57 +01:00
Matthew Honnibal	187960606f	* Fix pickle problems	2015-12-28 16:54:03 +01:00
Matthew Honnibal	8c7e149ec9	* Replace kwargs argument of Language.__init__ with explicit arguments, to fix pickle bug	2015-12-28 15:56:27 +01:00
Henning Peters	32d655b6e1	bump version	2015-12-28 09:34:39 +01:00
Matthew Honnibal	8b61d45ed0	* Fix merge conflicts for headers branch	2015-12-27 17:46:25 +01:00
Matthew Honnibal	6bb9c7f311	Merge pull request #202 from henningpeters/sputnik access model via sputnik	2015-12-28 03:29:53 +11:00
Henning Peters	0e321a7105	get mingw32 to work	2015-12-22 23:25:38 +01:00
Henning Peters	d8d348bb55	allow to specify version constraint within model name	2015-12-18 19:12:08 +01:00
Henning Peters	7f7299cafb	Merge branch 'tmpdir' into headers	2015-12-18 12:25:25 +01:00
Henning Peters	cfa187aaf0	fix tests	2015-12-18 10:58:02 +01:00
Henning Peters	8359bd4d93	strip data/ from package, friendlier Language invocation, make data_dir backward/forward-compatible	2015-12-18 09:52:55 +01:00
Henning Peters	970278a3d6	no need to link data dir anymore	2015-12-18 09:49:45 +01:00
Henning Peters	4f3efb8eaf	avoid writing to /tmp (not cross-platform compatible)	2015-12-16 19:56:40 +01:00
Henning Peters	4ada39f472	avoid writing to /tmp (not cross-platform compatible)	2015-12-16 19:53:06 +01:00
Henning Peters	2d4efe40f9	fix sputnik call	2015-12-13 14:46:08 +01:00
Henning Peters	ac318b568c	new approach to dependency headers	2015-12-13 11:49:17 +01:00
Henning Peters	345dda6f53	small fixes, add package build step	2015-12-07 06:50:26 +01:00
Henning Peters	9027cef3bc	access model via sputnik	2015-12-07 06:01:28 +01:00
Henning Peters	73e5650be5	change index server	2015-11-18 18:09:46 +01:00
Henning Peters	50d15ea5d2	fix	2015-11-18 17:35:21 +01:00
Henning Peters	02a1dcec76	add data dir	2015-11-18 11:48:55 +01:00
Henning Peters	919a4f0b04	change data path, add repository	2015-11-18 11:40:46 +01:00
Henning Peters	12de895e60	fix version	2015-11-15 16:38:16 +01:00
Henning Peters	03d2f98cd5	add sputnik	2015-11-15 15:58:21 +01:00
Matthew Honnibal	ec7d36c3a4	* Add test for matcher end-point problem	2015-11-12 05:00:40 +11:00
Matthew Honnibal	d309622a27	* Add test for matcher end-point problem	2015-11-12 04:59:11 +11:00
Matthew Honnibal	56ea20a886	* Add test for matcher end-point problem	2015-11-12 04:58:53 +11:00
Matthew Honnibal	cfa4062147	* Add test for matcher end-point problem	2015-11-12 04:56:07 +11:00
Matthew Honnibal	5623242b3e	* Adjust NER rules, so that U entries in gazetteer don't become B moves to the model	2015-11-12 04:48:23 +11:00
Matthew Honnibal	d67d7d5a86	* Add test for NER inconsistency bug	2015-11-08 16:19:33 +01:00
Matthew Honnibal	44fbdc7260	* Fix bug in NER transition system, that sometimes left no valid moves	2015-11-08 16:19:12 +01:00
Matthew Honnibal	ab5aac5b2f	* Add .rank property to Token and Lexeme, for frequency rank	2015-11-08 16:18:25 +01:00
Matthew Honnibal	fde9a22ec2	* Add new test for ner	2015-11-08 13:57:15 +01:00
Matthew Honnibal	e92371bb54	* Fix rule that made Last action invalid if there was a preset of O, since if the entity is already open, that ship has sailed.	2015-11-08 22:17:51 +11:00
Matthew Honnibal	3b74739c3e	* Download updated data	2015-11-08 21:24:25 +11:00
Matthew Honnibal	31da42eb27	* Mark tests that require models	2015-11-07 19:27:38 +11:00
Matthew Honnibal	8e26a28616	* Mark tests that require models	2015-11-07 19:10:56 +11:00
Matthew Honnibal	15eab7354f	* Remove extraneous test files	2015-11-07 18:45:13 +11:00
Matthew Honnibal	6f47074214	* Make constructor of ParserModel and TaggerModel the same as AveragedPerceptron, for each pickling.	2015-11-07 18:25:17 +11:00
Matthew Honnibal	1cfa20fb17	* Fix sentence-final whitespace issue	2015-11-07 17:34:46 +11:00
Matthew Honnibal	7663970d5f	* Removed unused i variable from Span, and set attributes to read-only	2015-11-07 17:06:15 +11:00
Matthew Honnibal	4b3c96d76d	* Fix zero-length spans	2015-11-07 17:05:16 +11:00
Matthew Honnibal	888c05a7fa	* Fix variable naming in StepwiseState, for thinc 4.0	2015-11-07 11:02:44 +11:00
Matthew Honnibal	fc2185bfe3	* Fix variable naming in StepwiseState, for thinc 4.0	2015-11-07 10:48:31 +11:00
Matthew Honnibal	954442a807	* Fix variable naming in StepwiseState, for thinc 4.0	2015-11-07 10:30:45 +11:00
Matthew Honnibal	06f26d258e	* Fix test_basic_create	2015-11-07 10:04:37 +11:00
Matthew Honnibal	1d3884c46d	* Fix test_basic_create	2015-11-07 10:03:56 +11:00
Matthew Honnibal	cc8febcbe1	* Fix Span comparison	2015-11-07 09:54:14 +11:00
Matthew Honnibal	af70dc166a	* Fix Last restriction, that was supposed to prevent conflicts with presets, but was incorrect.	2015-11-07 09:52:00 +11:00
Matthew Honnibal	a9b612abdf	* Rework the Span-merge patch, to avoid extending the interface of Doc, and avoid virtualizing the Span.start and Span.end indices, to keep Span usage efficient	2015-11-07 09:01:12 +11:00
Matthew Honnibal	56499d89ef	* Rework the Span-merge patch, to avoid extending the interface of Doc, and avoid virtualizing the Span.start and Span.end indices, to keep Span usage efficient	2015-11-07 08:55:34 +11:00
Andreas Grivas	83ca4e0b93	* use old merge tests - add more	2015-11-07 07:57:04 +11:00
Andreas Grivas	4be7fda453	* span start, end -> properties. autoupdate after merge	2015-11-07 07:57:04 +11:00
Andreas Grivas	562db6d2d0	* merge add lex last - add index finder funcs	2015-11-07 07:57:04 +11:00
Matthew Honnibal	a06e3c8963	* Fix bone-headed mistake in StateClass.E	2015-11-07 07:35:28 +11:00
Matthew Honnibal	d24b8509e4	* Correct screw ups from the previous commits	2015-11-07 06:51:41 +11:00
Matthew Honnibal	5efad178b5	* Set ent tag when close entity	2015-11-07 06:09:25 +11:00
Matthew Honnibal	9285f01d26	* Fix broken StateClass.E tracking	2015-11-07 06:06:39 +11:00
Matthew Honnibal	19136b0e7d	* Add better debug message for illegal move	2015-11-07 05:34:37 +11:00
Matthew Honnibal	2733816b7b	* Fix whitespace	2015-11-07 05:31:06 +11:00
Matthew Honnibal	01ab464383	* Prevent Begin and In moves from applying in NER if we're at the last token of a sentence, as this would mean the entity would span over a sentence boundary. Re Issue #169	2015-11-07 05:30:44 +11:00
Matthew Honnibal	b65633f270	* Fix function that returns nth entity in StateClass. Was only returning the first.	2015-11-07 05:29:11 +11:00
Matthew Honnibal	410b6f9ec1	* Remove deprecated _ml.pyx. We now use the nicer APIs provided by thinc 4.0, and subclass the AveragedPerceptron class.	2015-11-07 05:13:10 +11:00
Matthew Honnibal	3c162dcac3	* Refactor away from the _ml module, to use thinc 4.0. Still some work needs to be done, e.g. to add __reduce__ to the models, more testing, etc.	2015-11-07 03:24:30 +11:00
Matthew Honnibal	9d1b2a103a	* Fix capitalization in lemmatizer	2015-11-06 05:44:35 +11:00
Matthew Honnibal	6ed3aedf79	* Merge vocab changes	2015-11-06 00:48:08 +11:00
Matthew Honnibal	72abbb43fb	* Add type declarations in strings.pyx	2015-11-06 00:47:26 +11:00
Matthew Honnibal	5b2af4864f	* When lemmatizing non-noun, non-verb, non-adj words, output lower-case	2015-11-06 00:45:09 +11:00
Matthew Honnibal	754bf04162	* Remove declaration of Model.update	2015-11-06 00:31:15 +11:00
Matthew Honnibal	e18bdff23a	Merge branch 'master' of ssh://github.com/honnibal/spaCy	2015-11-06 00:26:15 +11:00
Matthew Honnibal	b9991fbd20	* Update to use thinc 3.0	2015-11-06 00:25:59 +11:00
Matthew Honnibal	864a8f45d8	* Use unicode in StringStore.intern, instead of unreliably casting to bytes.	2015-11-05 11:32:19 +00:00
Matthew Honnibal	b18204cd52	* Fix StringStore._realloc, re Issue #155	2015-11-05 11:28:26 +00:00
Matthew Honnibal	f8004c5f65	* Begin upgrading to improved thinc API	2015-11-05 03:53:03 +11:00
Matthew Honnibal	adc7bbd6cf	* Fix name of like_num in default_lex_attrs	2015-11-04 22:02:47 +11:00
Matthew Honnibal	e96faf29e7	* Rename like_number to like_num, to fix inconsistency re Issue #166	2015-11-04 22:01:44 +11:00
Matthew Honnibal	65934b7cd4	* Enforce import of ujson in strings.pyx, because otherwise it's too slow	2015-11-04 00:32:02 +11:00
Matthew Honnibal	1ce5d5602d	* Rename Doc.data to Doc.c	2015-11-04 00:17:13 +11:00
Matthew Honnibal	68f479e821	* Rename Doc.data to Doc.c	2015-11-04 00:15:14 +11:00
Matthew Honnibal	3ddea19b2b	* Rename spans.pyx to span.pyx	2015-11-04 00:14:40 +11:00
Matthew Honnibal	9482d616bc	* Rename spans.pyx to span.pyx	2015-11-03 23:51:05 +11:00
Matthew Honnibal	116da5990a	* Clean up setting of tag in doc.from_bytes	2015-11-03 23:48:57 +11:00
Matthew Honnibal	9ec7b9c454	* Clean up unused Constituent struct.	2015-11-03 23:48:21 +11:00
Matthew Honnibal	1e99fcd413	* Rename .repvec to .vector in C API	2015-11-03 23:47:59 +11:00
Matthew Honnibal	ee3f9ba581	* Fix test of serializer	2015-11-03 19:45:16 +11:00
Matthew Honnibal	d06ba26371	* Fix test of serializer	2015-11-03 19:43:27 +11:00
Matthew Honnibal	4083059650	Merge branch 'master' of https://github.com/honnibal/spaCy	2015-11-03 09:07:19 +01:00
Matthew Honnibal	9e37437ba8	* Fix assign_tag in doc.merge	2015-11-03 19:07:02 +11:00
Matthew Honnibal	dde9e1357c	* Add todo to morphology.lemmatize	2015-11-03 18:54:35 +11:00
Matthew Honnibal	ffedff9e6c	* Remove the archive after download, to save disk space	2015-11-03 18:54:05 +11:00
Matthew Honnibal	85372468e3	* Fix serialize test	2015-11-03 08:51:33 +01:00
Matthew Honnibal	833eb35c57	* Fix tag assignment in doc.from_array	2015-11-03 18:45:54 +11:00
Matthew Honnibal	09664177d7	* Fix tag handling in doc.merge, and assign sent_start when setting heads.	2015-11-03 18:15:52 +11:00
Matthew Honnibal	389a373807	Merge branch 'master' of ssh://github.com/honnibal/spaCy	2015-11-03 18:07:25 +11:00
Matthew Honnibal	3f44b3e43f	* Mark serializer test as requiring models	2015-11-03 18:07:08 +11:00
Matthew Honnibal	25ed7be8f8	Merge branch 'master' of https://github.com/honnibal/spaCy	2015-11-03 07:58:17 +01:00
Matthew Honnibal	604ceac4c6	* Fix morphological assignment in doc.merge()	2015-11-03 17:57:51 +11:00
Matthew Honnibal	5e040855a5	* Ensure morphological features and lemmas are loaded in from_array, re Issue #152	2015-11-03 17:56:50 +11:00
Matthew Honnibal	5668feb235	* Fix pickle test for python3	2015-11-03 04:57:02 +01:00
Matthew Honnibal	6161d2529a	Merge branch 'master' of ssh://github.com/honnibal/spaCy	2015-11-03 13:36:30 +11:00
Matthew Honnibal	5887506f5d	* Don't expect lexemes.bin in Vocab	2015-11-03 13:23:39 +11:00
Matthew Honnibal	f7dd377575	* Adjust conjuncts iterator in Token	2015-11-03 13:23:22 +11:00
Andreas Grivas	d418f00eb1	fixed error when printing unicode	2015-11-02 20:23:18 +02:00
Matthew Honnibal	52fc338001	* Set is_parsed and is_tagged attrs when loading annotations into Doc, re Issue #152	2015-10-28 10:43:22 +11:00
Matthew Honnibal	1c0356e4c2	* Set test file mode to w+t	2015-10-26 22:40:48 +11:00
Matthew Honnibal	0fe98f358b	* Fix mode on text file for Python3 in strings test	2015-10-26 22:25:16 +11:00
Matthew Honnibal	8ba9cf905e	* Fix mode on text file for Python3 in strings test	2015-10-26 21:44:34 +11:00
Matthew Honnibal	a0730699b1	* Fix mode on text file for Python3 in strings test	2015-10-26 21:25:56 +11:00
Matthew Honnibal	725344d349	* Fix tempfile in test	2015-10-26 21:08:18 +11:00
Matthew Honnibal	f11030aadc	* Remove out-dated TODO comment	2015-10-26 12:33:38 +11:00
Matthew Honnibal	a371a1071d	* Save and load word vectors during pickling, re Issue #125	2015-10-26 12:33:04 +11:00
Matthew Honnibal	a824a98312	* Add tests for pickling vectors, re: Issue #125	2015-10-26 12:31:05 +11:00
Matthew Honnibal	314090cc78	* Set vectors length when unpickling vocab, re Issue #125	2015-10-26 12:05:08 +11:00
Matthew Honnibal	4e16f9e435	* Move tests underneath spacy/	2015-10-26 00:07:31 +11:00
Matthew Honnibal	3a6e48e814	Merge pull request #149 from chrisdubois/pickle-patch Add __reduce__ to Tokenizer so that English pickles.	2015-10-25 15:30:31 +11:00
Chris DuBois	dac8fe7bdb	Add __reduce__ to Tokenizer so that English pickles. - Add tests to test_pickle and test_tokenizer that save to tempfiles.	2015-10-23 22:24:03 -07:00
Matthew Honnibal	ff4fe524ee	* Fix exception for python 2	2015-10-23 01:56:13 +02:00
Matthew Honnibal	341a3e85cd	* Upd downloaded data version	2015-10-23 00:56:57 +02:00
Matthew Honnibal	f18fd8c659	* Fix language.py for change in StringStore load API	2015-10-23 03:48:12 +11:00
Matthew Honnibal	23855db3ca	Merge branch 'master' of ssh://github.com/honnibal/spaCy into develop	2015-10-23 03:46:09 +11:00
Matthew Honnibal	4f13849065	Merge pull request #145 from henningpeters/master better error reporting, cleanup	2015-10-23 03:45:47 +11:00
Matthew Honnibal	3be94be0c0	Merge pull request #148 from maxirmx/master Utf8 encoding for lemma_rules.json	2015-10-22 21:46:28 +11:00
Matthew Honnibal	c86bda8d1a	* Fix import of uget	2015-10-22 21:13:56 +11:00
Matthew Honnibal	2348a08481	* Load/dump strings with a json file, instead of the hacky strings file we were using.	2015-10-22 21:13:03 +11:00
Matthew Honnibal	9baf0abd59	* Save vocab after training.	2015-10-22 21:09:14 +11:00
maxirmx	f07e4accd7	Fixing encoding issue #4	2015-10-21 20:45:56 +03:00
maxirmx	fcbfff043f	Fixing encoding issue #3	2015-10-21 15:52:34 +03:00
maxirmx	fe9d2e2c4e	Fixing encode issue #2	2015-10-21 15:36:21 +03:00
maxirmx	e4a1726f77	Fixing encoding issue UTF-8	2015-10-21 14:16:37 +03:00
Andreas Grivas	93ada458e2	added __repr__ that prints text in ipython for doc, token, and span objects	2015-10-21 14:11:46 +03:00
Henning Peters	ccffd2ef53	fixed extract directory	2015-10-21 07:59:34 +02:00
Henning Peters	da4c9cee06	assert filename match	2015-10-20 19:33:59 +02:00
Henning Peters	4f703f0cb4	better error reporting, cleanup	2015-10-20 19:11:29 +02:00
Matthew Honnibal	9cdea6e450	* Import uget correctly	2015-10-19 08:32:41 +02:00
Matthew Honnibal	6727a46bb5	* Fix Issue #118 : Matcher behaves unpredictably when matches overlap.	2015-10-19 16:45:32 +11:00
Matthew Honnibal	135062d23c	* Fix error with merged text when merged region did not have trailing whitespace	2015-10-19 15:47:04 +11:00
Henning Peters	bfde91fa49	add custom download tool (uget), replace wget with uget	2015-10-18 12:35:04 +02:00
Matthew Honnibal	9839cd2c0b	* Fix whitespace_ calculation in Token	2015-10-18 17:21:11 +11:00
Matthew Honnibal	c99285b8b9	* Clean up C++ usage in spacy/matcher.pyx	2015-10-18 17:20:50 +11:00
Matthew Honnibal	a7e6c5ac8f	* Fix Issue #122 : Incorrect calculation of children after Doc.merge()	2015-10-18 17:17:27 +11:00
Matthew Honnibal	3ba66f2dc7	* Add string length cap in Tokenizer.__call__	2015-10-16 04:54:16 +11:00
Matthew Honnibal	6e0f985afc	* Fix token.conjuncts	2015-10-15 03:49:45 +11:00
Matthew Honnibal	2e0104ac81	* Fix token.conjuncts	2015-10-15 03:47:45 +11:00
Matthew Honnibal	b8f3345a82	* Fix token.conjuncts method	2015-10-15 03:36:01 +11:00
Matthew Honnibal	23818f89b8	* Fix token.conjuncts method	2015-10-15 03:34:57 +11:00
Matthew Honnibal	7a15d1b60c	* Add Python 2/3 compatibility fix for copy_reg	2015-10-13 20:04:40 +11:00
Matthew Honnibal	329ae57520	* Fix whitespace attachment thing	2015-10-13 09:46:38 +02:00
Matthew Honnibal	37919eac82	* Fix whitespace attachment in simpler way. Leaves problem with setting left/right children.	2015-10-13 18:23:24 +11:00
Matthew Honnibal	c70eb776ae	* Fix whitespace attachment, so that left/right children are consistent with head.	2015-10-13 15:58:22 +11:00
Matthew Honnibal	531182f937	* Fix Model.__reduce__	2015-10-13 15:14:38 +11:00
Matthew Honnibal	6c227a6c1f	* Fix Model.__reduce__	2015-10-13 15:10:04 +11:00
Matthew Honnibal	358c82595c	* Fix NAMES list in spacy/parts_of_speech.pyx	2015-10-13 14:18:45 +11:00
Matthew Honnibal	c1fdc487bc	Merge branch 'attrs'	2015-10-13 14:03:41 +11:00
Matthew Honnibal	e886e6a406	* Inc version	2015-10-13 13:46:17 +11:00
Matthew Honnibal	20fd36a0f7	* Very scrappy, likely buggy first-cut pickle implementation, to work on Issue #125 : allow pickle for Apache Spark. The current implementation sends stuff to temp files, and does almost nothing to ensure all modifiable state is actually preserved. The Language() instance is a deep tree of extension objects, and if pickling during training, some of the C-data state is hard to preserve.	2015-10-13 13:44:41 +11:00
Matthew Honnibal	f8de403483	* Work on pickling Vocab instances. The current implementation is not correct, but it may serve to see whether this approach is workable. Pickling is necessary to address Issue #125	2015-10-13 13:44:41 +11:00
Matthew Honnibal	85e7944572	* Start trying to pickle Vocab	2015-10-13 13:44:41 +11:00
Matthew Honnibal	5ca57bd859	* Ensure Morphology can be pickled, to address Issue #125 .	2015-10-13 13:44:41 +11:00
Matthew Honnibal	0cee928467	* Allow StringStore to be pickled, to start addressing Issue #125	2015-10-13 13:44:41 +11:00
Matthew Honnibal	41012907a8	* Fix variable name	2015-10-13 13:44:40 +11:00
Matthew Honnibal	e70368d157	* Use lower case strings for dependency label names in symbols enum	2015-10-13 13:44:40 +11:00
Matthew Honnibal	7b4af3d1e7	* Fix parts_of_speech now that symbols list has been reformed	2015-10-13 13:44:40 +11:00
Matthew Honnibal	37b909b6b6	* Use the symbols file in vocab instead of the symbols subfiles like attrs.pxd	2015-10-13 13:44:40 +11:00
Matthew Honnibal	ce65ec698c	* Remove qualified naming in symbols	2015-10-13 13:44:40 +11:00
Matthew Honnibal	9f4be0adcd	* Map NO_TAG to NIL in parts_of_speech.pxd	2015-10-13 13:44:40 +11:00
Matthew Honnibal	278e12f7e8	* Addmorphology symbols to morphology. May need to remove these as an enum.	2015-10-13 13:44:40 +11:00
Matthew Honnibal	d80067eda1	* Map empty string to NULL_ATTR in attrs	2015-10-13 13:44:40 +11:00
Matthew Honnibal	d70e8cac2c	* Fix empty values in attributes and parts of speech, so symbols align correctly with the StringStore	2015-10-13 13:44:40 +11:00
Matthew Honnibal	a29c8ee23d	* Add symbols to the vocab before reading the strings, so that they line up correctly	2015-10-13 13:44:39 +11:00
Matthew Honnibal	74c0853471	* Rename ATTR_IDS to attrs.IDS. Rename ATTR_NAMES to attrs.NAMES. Rename UNIV_POS_IDS to parts_of_speech.IDS	2015-10-13 13:44:39 +11:00
Matthew Honnibal	10a4a843ea	* Enumerate all symbols in one file	2015-10-13 13:44:39 +11:00
Matthew Honnibal	85ce36ab11	* Refactor symbols, so that frequency rank can be derived from the orth id of a word.	2015-10-13 13:44:39 +11:00
Matthew Honnibal	dfbcff2ff1	* Revert codecs/io change to strings.pyx, as it seemed to cause an error? Will investigate.	2015-10-10 15:54:55 +11:00
Matthew Honnibal	9dd2f25c74	* Fix Issue #131 : Force whitespace characters to attach syntactically to previous token, and ensure they cannot serve as stand-alone 'sentence' units.	2015-10-10 15:53:30 +11:00
Matthew Honnibal	8b39feefbe	* Add dependency post-process rule to ensure spaces are attached to neighbouring tokens, so that they can't be sentence boundaries	2015-10-10 15:32:13 +11:00
Matthew Honnibal	2153067958	* Fix use of io in strings.pyx	2015-10-10 15:03:12 +11:00
Matthew Honnibal	ec874247b5	Merge branch 'master' of ssh://github.com/honnibal/spaCy	2015-10-10 14:23:51 +11:00
Matthew Honnibal	30de4135c9	* Fix merge problem	2015-10-10 14:22:32 +11:00
Matthew Honnibal	dc393a5f1d	Merge pull request #126 from tomtung/master Improve slicing support for both Doc and Span	2015-10-10 14:14:57 +11:00
Matthew Honnibal	83dccf0fd7	* Use io module insteads of deprecated codecs module	2015-10-10 14:13:01 +11:00
Matthew Honnibal	a3dfe2b901	* Increment data version	2015-10-09 13:26:17 +02:00
Matthew Honnibal	2d9e5bf566	* Allow punctuation to be lemmatized	2015-10-09 19:02:42 +11:00
Matthew Honnibal	5332c0b697	* Add support for punctuation lemmatization, to handle unicode characters. This should help in addressing Issue #130	2015-10-09 18:54:40 +11:00
Yubing (Tom) Dong	9a6811acc4	Merge remote-tracking branch 'upstream/master'	2015-10-08 22:53:02 -07:00
Matthew Honnibal	b125289f30	* Fix type declaration in asciied function	2015-10-09 13:46:57 +11:00
Matthew Honnibal	801d55a6d9	* Fix phrase matcher	2015-10-09 02:00:45 +11:00
Matthew Honnibal	b3a70e6375	* Clean up unnecessary try/except block	2015-10-08 14:34:11 +11:00
Yubing (Tom) Dong	0f601b8b75	Update docstring of Doc.__getitem__	2015-10-07 01:27:28 -07:00
Yubing (Tom) Dong	3fd3bc79aa	Refactor to remove duplicate slicing logic	2015-10-07 01:25:35 -07:00
Yubing (Tom) Dong	97685aecb7	Add slicing support to Span	2015-10-06 02:45:49 -07:00
Yubing (Tom) Dong	ef2af20cd3	Make Doc's slicing behavior conform to Python conventions	2015-10-06 02:41:28 -07:00
Yubing (Tom) Dong	2fc33e8024	Allow step=1 when slicing a Doc	2015-10-06 00:57:05 -07:00
Matthew Honnibal	b228a8f4a6	* Remove spacy/en/attrs	2015-10-06 16:20:46 +11:00
Matthew Honnibal	693677fd8d	* Prepare to remove en/attrx file, now that moving to symbols.pyx	2015-10-06 16:20:13 +11:00
Matthew Honnibal	3d9f41c2c9	* Add LookupError for better error reporting in Vocab	2015-10-06 10:34:59 +11:00
Matthew Honnibal	ecc5281b36	* Remove en/pos.pyx, as the tagger code now lives in spacy/tagger.pyx	2015-10-06 10:12:08 +11:00
alvations	8caedba42a	caught more codecs.open -> io.open	2015-09-30 20:20:09 +02:00
alvations	8199012d26	changing deprecated codecs.open to io.open =)	2015-09-30 20:10:15 +02:00
Matthew Honnibal	87e6186828	* Rename _seq to doc attribute in Span	2015-09-29 23:03:55 +10:00
Matthew Honnibal	ab694b0364	* Fix open-bounded slice indices.	2015-09-29 23:03:09 +10:00
Matthew Honnibal	a6ced80c0c	* Fix Issue #116 : Misleading handling of True value in Language.__init__.	2015-09-29 20:54:12 +10:00
Matthew Honnibal	f9d2a5b651	* Fix issue #112 : Replace unidecode with text-unidecode, to avoid license problems.	2015-09-28 23:40:18 +10:00
Matthew Honnibal	2c33a96ac3	Merge pull request #99 from rw/patch-1 Force SSL for downloading English language data.	2015-09-28 17:46:26 +10:00
Matthew Honnibal	abf0d930af	* Fix API for loading word vectors from a file.	2015-09-23 23:51:08 +10:00
Matthew Honnibal	f5c256745b	Merge branch 'master' of ssh://github.com/honnibal/spaCy	2015-09-22 12:26:24 +10:00
Matthew Honnibal	528e26a506	* Add rule to ensure ordinals are preserved as single tokens	2015-09-22 12:26:05 +10:00
Robert	8711b64860	Force SSL for downloading English language data. It would also be nice to have a checksum for this.	2015-09-21 17:26:01 -07:00
Matthew Honnibal	f7283a5067	* Fix vectors bugs for OOV words	2015-09-22 02:10:25 +02:00
Matthew Honnibal	44aecba701	* Fix Token.has_vector and Lexeme.has_vector	2015-09-22 01:43:16 +02:00
Matthew Honnibal	596fde8daa	* Add has_vector attribute to Token and Lexeme	2015-09-21 19:52:43 +10:00
Matthew Honnibal	f32927efbf	* Raise exceptions if attempt to access parse, but data is not installed. This partly but not fully addresses Issue #97 . Still need exceptions on the various Token attributes that access the parse tree, e.g. token.head, token.lefts, token.rights, etc. Exceptions should be centralized, too.	2015-09-21 18:35:40 +10:00
Matthew Honnibal	388062ae01	* Fix repvec_length problem	2015-09-21 18:10:51 +10:00
Matthew Honnibal	ac459278d1	* Fix vector length error reporting, and ensure vec_len is returned	2015-09-21 18:08:32 +10:00
Matthew Honnibal	ba4e563701	* Ensure vectors are same length, and return vector length in load_vectors_bz2	2015-09-21 18:03:08 +10:00
Matthew Honnibal	d00fe2bbc6	* Don't allow Span objects to be written to, as it introduces subtle bugs because they're created afresh from Doc.sents, Doc.ents etc.	2015-09-21 17:59:39 +10:00
Matthew Honnibal	d6945bf880	* Add way to load vectors from bz2 file to vocab	2015-09-17 12:58:23 +10:00
Matthew Honnibal	77856c4fcd	* Try giving Doc and Span objects vector and vector_norm attributes, and .similarity functions. Turns out to be bad idea.	2015-09-17 11:50:11 +10:00
Matthew Honnibal	191d593e03	* Fix vectors bug in lexeme	2015-09-15 19:05:11 +10:00
Matthew Honnibal	3d87519f64	* Remove vectors argument from Vocab object	2015-09-15 14:47:14 +10:00
Matthew Honnibal	362526b592	* Rename vectors_length attribute	2015-09-15 14:43:31 +10:00
Matthew Honnibal	60c26b2dfa	* Fix slicing when start or stop is None	2015-09-15 14:43:10 +10:00
Matthew Honnibal	7ac6cacc26	* Remove const qualifier on LexemeC.repvec	2015-09-15 14:42:51 +10:00
Matthew Honnibal	dd4d64b235	* Support setting of word vectors on Lexeme object.	2015-09-15 14:42:27 +10:00
Matthew Honnibal	27f988b167	* Remove the vectors option to Vocab, preferring to either load vectors from disk, or set them on the Lexeme objects.	2015-09-15 14:41:48 +10:00
Matthew Honnibal	193f127f81	* Fix ugly py_check_flag and py_set_flag functions in Lexeme	2015-09-15 13:06:18 +10:00
Matthew Honnibal	9561d88529	* Add is_stop to Python API	2015-09-14 18:25:40 +10:00
Matthew Honnibal	65dc0d1dfb	* Extend word vectors support, with .similarity() function, vector_norm property, and rename repvec to vector. Keep repvec name as well for now for backwards compatibility.	2015-09-14 17:49:58 +10:00
Matthew Honnibal	e13e47e9e5	* Add English stop words	2015-09-14 17:48:51 +10:00
Matthew Honnibal	24ed3fc25c	* Check file existance before opening in lemmatizer	2015-09-13 10:45:21 +10:00
Matthew Honnibal	dbb48ce49e	* Delete extra wordnets	2015-09-13 10:31:37 +10:00
Matthew Honnibal	e9c59693ea	* Remove assertion from vocab.pyx	2015-09-13 10:30:08 +10:00
Matthew Honnibal	c08f10083c	* Add test and test_with_ws attributes.	2015-09-13 10:27:42 +10:00
Matthew Honnibal	0b7d2a6c62	* Inc version	2015-09-13 01:26:29 +02:00
Matthew Honnibal	e1dfaeed8a	* Check serializer freqs exist before loading	2015-09-12 23:49:38 +02:00
Matthew Honnibal	a412c66c8c	* Check serializer freqs exist before loading	2015-09-12 23:40:01 +02:00
Matthew Honnibal	631c843ed1	* Don't look for index.adv in le,matizer	2015-09-12 06:03:44 +02:00
Matthew Honnibal	dfdd4f2d60	Merge branch 'develop' of https://github.com/honnibal/spaCy into develop	2015-09-10 15:23:06 +02:00
Matthew Honnibal	e285ca7d6c	* Load serializer freqs in vocab	2015-09-10 15:22:48 +02:00
Matthew Honnibal	f7fdcce1f9	Merge branch 'develop' of https://github.com/honnibal/spaCy into develop	2015-09-10 14:52:47 +02:00
Matthew Honnibal	85c3fec1d1	* Fix morphology loading	2015-09-10 14:52:23 +02:00
Matthew Honnibal	7c660c5efc	* Use dict.get in lemmatizer	2015-09-10 14:51:39 +02:00
Matthew Honnibal	094440f9f5	Merge branch 'develop' of ssh://github.com/honnibal/spaCy into develop	2015-09-10 14:51:17 +02:00
Matthew Honnibal	c3f773cd63	* Fix Lexeme.check_flag	2015-09-10 14:51:05 +02:00
Matthew Honnibal	90da3a695d	* Load lemmatizer from disk in Vocab.from_dir	2015-09-10 14:49:10 +02:00
Matthew Honnibal	e7e529edf4	* Fix Lexeme.check_flag	2015-09-10 14:45:43 +02:00
Matthew Honnibal	9e7bfe8449	* Fix space at end of merged token	2015-09-10 14:45:17 +02:00
Matthew Honnibal	f634191e27	* Fix vocab read/write	2015-09-10 14:44:38 +02:00
Matthew Honnibal	31ccf494e6	Merge branch 'develop' of https://github.com/honnibal/spaCy into develop	2015-09-09 14:33:38 +02:00
Matthew Honnibal	a7f4b26c8c	* Tmp	2015-09-09 14:33:26 +02:00
Matthew Honnibal	07686470a9	* Don't consider a coordinated NP a base chunk	2015-09-09 14:32:28 +02:00
Matthew Honnibal	d9f1fc2112	* Add deprecation warning for unused load_vectors argument.	2015-09-09 14:31:09 +02:00
Matthew Honnibal	0b527fbdc8	* Set POS tag in morphology	2015-09-09 14:30:24 +02:00
Matthew Honnibal	07c09a0e1b	* Fix attribute getters and setters in Lexeme	2015-09-09 14:29:22 +02:00
Matthew Honnibal	d6561988cf	* Fix lexemes.bin	2015-09-09 11:49:51 +02:00
Matthew Honnibal	c301bebd33	Merge branch 'master' of https://github.com/honnibal/spaCy into develop	2015-09-09 10:55:39 +02:00
Matthew Honnibal	0e24d099a1	* Fix L/R edge bug, by ensuring l_edge and r_edge are preset, and fixing the way the edge update in del_arc. Bugs keep arising here because the edges are absolute positions, where everything else is relative. I'm also not 100% convinced that del_arc is handled correctly. Do we need to update the parents?	2015-09-09 03:40:44 +02:00
Matthew Honnibal	2be3620333	* Save morphological analyses in a cache	2015-09-08 15:39:24 +02:00
Matthew Honnibal	1def5a6cbe	* Fix print statements in matcher	2015-09-08 15:38:19 +02:00
Matthew Honnibal	64d71f8893	* Fix lemmatizer	2015-09-08 15:38:03 +02:00
Matthew Honnibal	623329b19a	Merge branch 'master' of ssh://github.com/honnibal/spaCy into develop	2015-09-08 14:27:01 +02:00
Matthew Honnibal	62a01dd41d	* Fix issue #92 : lexemes.bin read error on 32-bit platforms.	2015-09-08 14:23:58 +02:00
Matthew Honnibal	ef58607a99	* Add spacy.it	2015-09-06 22:10:37 +02:00
Matthew Honnibal	2154a54f6b	* Add spacy.de	2015-09-06 21:56:47 +02:00
Matthew Honnibal	f6ec5bf1b0	* Use empty tag map in vocab if none supplied	2015-09-06 20:19:27 +02:00
Matthew Honnibal	4f8e38271d	* Fix merge errors in lexeme.pxd	2015-09-06 20:19:08 +02:00
Matthew Honnibal	86c888667f	* Merge in changes from de branch	2015-09-06 19:49:28 +02:00
Matthew Honnibal	d2fc104a26	* Begin merge of Gazetteer and DE branches	2015-09-06 19:45:15 +02:00
Matthew Honnibal	dbf8dce109	Merge branch 'gaz' of ssh://github.com/honnibal/spaCy into gaz	2015-09-06 18:44:14 +02:00
Matthew Honnibal	9eae9837c4	* Fix morphology look up	2015-09-06 17:53:39 +02:00
Matthew Honnibal	6427a3fcac	* Temporarily import flag attributes in matcher	2015-09-06 17:53:12 +02:00
Matthew Honnibal	7cc56ada6e	* Temporarily add py_set_flag attribute in Lexeme	2015-09-06 17:52:51 +02:00
Matthew Honnibal	e35bb36be7	* Ensure Lexeme.check_flag returns a boolean value	2015-09-06 17:52:32 +02:00
Matthew Honnibal	7e4fea67d3	* Fix bug in token subtree, introduced by duplication of L/R code in Stateclass. Need to consolidate the two methods.	2015-09-06 10:48:36 +02:00
Matthew Honnibal	5edac11225	* Wrap self.parse in nogil, and break if an invalid move is predicted. The invalid break is a work-around that papers over likely bugs, but we can't easily break in the nogil block, and otherwise we'll get an infinite loop. Need to set this as an error flag.	2015-09-06 04:15:00 +02:00
Matthew Honnibal	fd1eeb3102	* Add POS attribute support in get_attr	2015-09-06 04:13:03 +02:00
Matthew Honnibal	534e3dda3c	* More work on language independent parsing	2015-08-28 03:44:54 +02:00
Matthew Honnibal	c2307fa9ee	* More work on language-generic parsing	2015-08-28 02:02:33 +02:00
Matthew Honnibal	86c4a8e3e2	* Work on new morphology organization	2015-08-27 23:11:51 +02:00
Matthew Honnibal	5b89e2454c	* Improve error-reporting in tagger	2015-08-27 10:26:36 +02:00
Matthew Honnibal	f0a7c99554	* Relax rule-requirement in lemmatizer	2015-08-27 10:26:19 +02:00
Matthew Honnibal	0af139e183	* Tagger training now working. Still need to test load/save of model. Morphology still broken.	2015-08-27 09:16:11 +02:00
Matthew Honnibal	1302d35dff	* Rework interfaces in vocab	2015-08-26 19:21:46 +02:00
Matthew Honnibal	2d521768a3	* Store Morphology class in Vocab	2015-08-26 19:21:03 +02:00

... 5 6 7 8 9 ...

1672 Commits