spaCy

mirror of https://github.com/explosion/spaCy.git synced 2024-12-27 18:36:36 +03:00

Author	SHA1	Message	Date
Matthew Honnibal	e825fd9554	* Make some of the website tests work without models	2016-01-18 18:14:44 +01:00
Matthew Honnibal	334c4b2b57	* Disprefer punctuation and spaces as heads of spans	2016-01-18 18:14:09 +01:00
Matthew Honnibal	bed36ab0ff	* Fix import of HEAD attribute	2016-01-18 17:34:43 +01:00
Matthew Honnibal	28c659c1fe	* Fix import for numpy	2016-01-18 17:25:04 +01:00
Matthew Honnibal	fc36bcf458	* Fix import for English	2016-01-18 17:14:40 +01:00
Matthew Honnibal	cc4c335e14	* Set heads for test_merge_tokens, to make the test run without models	2016-01-18 17:00:11 +01:00
Matthew Honnibal	c107da9738	* Bug fix to _count_words_to_root	2016-01-18 16:59:38 +01:00
Matthew Honnibal	f24833d607	* Fix merge for coordinations	2016-01-18 16:03:19 +01:00
Matthew Honnibal	14534958a9	* Fix bug in Span.root	2016-01-18 15:40:28 +01:00
Matthew Honnibal	714cbc03d5	* Add test for Issue #203 : nested noun chunks.	2016-01-16 18:02:30 +01:00
Matthew Honnibal	4e2253170c	* Move test for doc.merge to tokens_api file, to avoid name conflicts which upset pytest	2016-01-16 18:01:36 +01:00
Matthew Honnibal	34a157511f	* Move test_merge_hang to test_tokens_api	2016-01-16 18:00:26 +01:00
Matthew Honnibal	fc8f26584a	* Don't consider NPs connected to parse via conj relation as noun chunks. Change motivated by the nested noun chunks identified in Issue #203 , but might be problematic. Also allow root NPs to be considered noun chunks.	2016-01-16 17:52:40 +01:00
Matthew Honnibal	4a16dbfeca	* Add test for Issue #203 : noun chunks should be flat, but sometimes are nested	2016-01-16 17:41:25 +01:00
Matthew Honnibal	995b2d18fd	* Route token.string via token.txt_with_ws, to deprecate token.string in future	2016-01-16 17:14:34 +01:00
Matthew Honnibal	54a98eaf19	* Fix typo text_wth_ws --> text_with_ws. Reroute .string attribute to text_with_ws, to deprecate .string in future	2016-01-16 17:13:50 +01:00
Matthew Honnibal	3e9961d2c4	* If final token is whitespace, don't mark it as owning a trailing space. Fixes Issue #154	2016-01-16 17:08:59 +01:00
Matthew Honnibal	223d2b3484	* Add test for Issue #154 : Additional whitespace introduced when string ends with a whitespace token.	2016-01-16 17:08:07 +01:00
Matthew Honnibal	3dc398b727	* Fix merge conflict in requirements.txt	2016-01-16 16:20:49 +01:00
Matthew Honnibal	fc5962a77d	* Improve test for root token in Span	2016-01-16 16:19:09 +01:00
Matthew Honnibal	c025a0c64b	* Check for KeyboardInerrupt in parser.__call__	2016-01-16 16:18:44 +01:00
Matthew Honnibal	03e8a4293d	* Add loop guard to Token.lefts and Token.rights properties	2016-01-16 16:18:17 +01:00
Matthew Honnibal	304339985e	* Add a linear scan to Span.root method, to help with long sentences	2016-01-16 16:17:28 +01:00
Matthew Honnibal	aa0dd79f52	* Delete test_token_references, which checked a flakey strategy for preventing orphan tokens from a while ago. Now orphan tokens simply hold a reference to Pool, preventing the memory from being freed underneath them. This means that we don't need to run this slow test.	2016-01-16 16:03:35 +01:00
Matthew Honnibal	8cbcc3a799	* Fix calculation of root token in Span. Now take root to be word with shortest tree path. Avoids parse trees ending up in inconsistent state, as had occurred in Issue #214 .	2016-01-16 15:38:50 +01:00
Matthew Honnibal	c1039fa4b4	* Add test for Issue #214 . Resolved in change to Span.root	2016-01-16 15:37:47 +01:00
Henning Peters	41ea14a56f	fix pickling	2016-01-16 13:23:11 +01:00
Henning Peters	5551052840	fix py2/3 issue	2016-01-16 12:44:53 +01:00
Henning Peters	235f094534	untangle data_path/via	2016-01-16 12:23:45 +01:00
Matthew Honnibal	42a9f29b40	* Add loop guard in Span.root, to raise errors if there is a cycle in the dependency parse, instead of entering an infinite loop. Re Issue #214	2016-01-16 11:53:37 +01:00
Henning Peters	6d1a3af343	cleanup unused	2016-01-16 10:05:04 +01:00
Henning Peters	846fa49b2a	distinct load() and from_package() methods	2016-01-16 10:00:57 +01:00
Henning Peters	211913d689	add about.py, adapt setup.py	2016-01-15 18:57:01 +01:00
Henning Peters	f8a8f97d25	cleanup	2016-01-15 18:13:37 +01:00
Henning Peters	780cb847c9	add default_model to about	2016-01-15 18:07:15 +01:00
Henning Peters	788f734513	refactored data_dir->via, add zip_safe, add spacy.load()	2016-01-15 18:01:02 +01:00
Matthew Honnibal	478a79a3d5	* Add test for Issue #220 : Whitespace being tagged as noun	2016-01-15 16:17:07 +01:00
Henning Peters	d9471f684f	fix typo	2016-01-14 12:14:12 +01:00
Henning Peters	9b75d872b0	fix model download	2016-01-14 12:02:56 +01:00
Henning Peters	bc229790ac	integrate with sputnik	2016-01-13 19:46:17 +01:00
Matthew Honnibal	3fbfba575a	* xfail the contractions test	2015-12-31 13:16:28 +01:00
Matthew Honnibal	3bd910ccad	* Merge therell test	2015-12-31 11:55:18 +01:00
Matthew Honnibal	eaf2ad59f1	* Fix use of mock Package object	2015-12-31 04:13:15 +01:00
Matthew Honnibal	029136a007	* Fix resource loading for Matcher	2015-12-31 02:45:12 +01:00
Matthew Honnibal	55bcdf8bdd	* Fix errors	2015-12-29 22:32:03 +01:00
Matthew Honnibal	a6ba43ecaf	* Fix errors in packaging revision	2015-12-29 18:37:26 +01:00
Matthew Honnibal	4b4eec8b47	* Fix Issue #201 : Tokenization of there'll	2015-12-29 18:09:09 +01:00
Matthew Honnibal	86ee9d046d	* Remove test that belongs to a change for master	2015-12-29 18:07:23 +01:00
Matthew Honnibal	a2dfdec85d	* Clean up spacy.util	2015-12-29 18:06:09 +01:00
Matthew Honnibal	aec130af56	Use util.Package class for io Previous Sputnik integration caused API change: Vocab, Tagger, etc were loaded via a from_package classmethod, that required a sputnik.Package instance. This forced users to first create a sputnik.Sputnik() instance, in order to acquire a Package via sp.pool(). Instead I've created a small file-system shim, util.Package, which allows classes to have a .load() classmethod, that accepts either util.Package objects, or strings. We can later gut the internals of this and make it a proxy for Sputnik if we need more functionality that should live in the Sputnik library. Sputnik is now only used to download and install the data, in spacy.en.download	2015-12-29 18:00:48 +01:00
Matthew Honnibal	0e2498da00	* Replace from_package with load() classmethod in Vocab	2015-12-29 16:56:51 +01:00
Matthew Honnibal	c5902f2b4b	* Upd Lemmatizer to use MockPackage. Replace from_package with load() classmethod	2015-12-29 16:56:02 +01:00
Matthew Honnibal	4131e45543	* Add MockPackage class, to see whether we can proxy for Sputnik in a lightweight way	2015-12-29 16:55:03 +01:00
Matthew Honnibal	f5dea1406d	* Fix silly mistake in Language.__init__	2015-12-28 18:48:57 +01:00
Matthew Honnibal	187960606f	* Fix pickle problems	2015-12-28 16:54:03 +01:00
Matthew Honnibal	8c7e149ec9	* Replace kwargs argument of Language.__init__ with explicit arguments, to fix pickle bug	2015-12-28 15:56:27 +01:00
Henning Peters	32d655b6e1	bump version	2015-12-28 09:34:39 +01:00
Matthew Honnibal	8b61d45ed0	* Fix merge conflicts for headers branch	2015-12-27 17:46:25 +01:00
Matthew Honnibal	6bb9c7f311	Merge pull request #202 from henningpeters/sputnik access model via sputnik	2015-12-28 03:29:53 +11:00
Henning Peters	0e321a7105	get mingw32 to work	2015-12-22 23:25:38 +01:00
Henning Peters	d8d348bb55	allow to specify version constraint within model name	2015-12-18 19:12:08 +01:00
Henning Peters	7f7299cafb	Merge branch 'tmpdir' into headers	2015-12-18 12:25:25 +01:00
Henning Peters	cfa187aaf0	fix tests	2015-12-18 10:58:02 +01:00
Henning Peters	8359bd4d93	strip data/ from package, friendlier Language invocation, make data_dir backward/forward-compatible	2015-12-18 09:52:55 +01:00
Henning Peters	970278a3d6	no need to link data dir anymore	2015-12-18 09:49:45 +01:00
Henning Peters	4f3efb8eaf	avoid writing to /tmp (not cross-platform compatible)	2015-12-16 19:56:40 +01:00
Henning Peters	4ada39f472	avoid writing to /tmp (not cross-platform compatible)	2015-12-16 19:53:06 +01:00
Henning Peters	2d4efe40f9	fix sputnik call	2015-12-13 14:46:08 +01:00
Henning Peters	ac318b568c	new approach to dependency headers	2015-12-13 11:49:17 +01:00
Henning Peters	345dda6f53	small fixes, add package build step	2015-12-07 06:50:26 +01:00
Henning Peters	9027cef3bc	access model via sputnik	2015-12-07 06:01:28 +01:00
Henning Peters	73e5650be5	change index server	2015-11-18 18:09:46 +01:00
Henning Peters	50d15ea5d2	fix	2015-11-18 17:35:21 +01:00
Henning Peters	02a1dcec76	add data dir	2015-11-18 11:48:55 +01:00
Henning Peters	919a4f0b04	change data path, add repository	2015-11-18 11:40:46 +01:00
Henning Peters	12de895e60	fix version	2015-11-15 16:38:16 +01:00
Henning Peters	03d2f98cd5	add sputnik	2015-11-15 15:58:21 +01:00
Matthew Honnibal	ec7d36c3a4	* Add test for matcher end-point problem	2015-11-12 05:00:40 +11:00
Matthew Honnibal	d309622a27	* Add test for matcher end-point problem	2015-11-12 04:59:11 +11:00
Matthew Honnibal	56ea20a886	* Add test for matcher end-point problem	2015-11-12 04:58:53 +11:00
Matthew Honnibal	cfa4062147	* Add test for matcher end-point problem	2015-11-12 04:56:07 +11:00
Matthew Honnibal	5623242b3e	* Adjust NER rules, so that U entries in gazetteer don't become B moves to the model	2015-11-12 04:48:23 +11:00
Matthew Honnibal	d67d7d5a86	* Add test for NER inconsistency bug	2015-11-08 16:19:33 +01:00
Matthew Honnibal	44fbdc7260	* Fix bug in NER transition system, that sometimes left no valid moves	2015-11-08 16:19:12 +01:00
Matthew Honnibal	ab5aac5b2f	* Add .rank property to Token and Lexeme, for frequency rank	2015-11-08 16:18:25 +01:00
Matthew Honnibal	fde9a22ec2	* Add new test for ner	2015-11-08 13:57:15 +01:00
Matthew Honnibal	e92371bb54	* Fix rule that made Last action invalid if there was a preset of O, since if the entity is already open, that ship has sailed.	2015-11-08 22:17:51 +11:00
Matthew Honnibal	3b74739c3e	* Download updated data	2015-11-08 21:24:25 +11:00
Matthew Honnibal	31da42eb27	* Mark tests that require models	2015-11-07 19:27:38 +11:00
Matthew Honnibal	8e26a28616	* Mark tests that require models	2015-11-07 19:10:56 +11:00
Matthew Honnibal	15eab7354f	* Remove extraneous test files	2015-11-07 18:45:13 +11:00
Matthew Honnibal	6f47074214	* Make constructor of ParserModel and TaggerModel the same as AveragedPerceptron, for each pickling.	2015-11-07 18:25:17 +11:00
Matthew Honnibal	1cfa20fb17	* Fix sentence-final whitespace issue	2015-11-07 17:34:46 +11:00
Matthew Honnibal	7663970d5f	* Removed unused i variable from Span, and set attributes to read-only	2015-11-07 17:06:15 +11:00
Matthew Honnibal	4b3c96d76d	* Fix zero-length spans	2015-11-07 17:05:16 +11:00
Matthew Honnibal	888c05a7fa	* Fix variable naming in StepwiseState, for thinc 4.0	2015-11-07 11:02:44 +11:00
Matthew Honnibal	fc2185bfe3	* Fix variable naming in StepwiseState, for thinc 4.0	2015-11-07 10:48:31 +11:00
Matthew Honnibal	954442a807	* Fix variable naming in StepwiseState, for thinc 4.0	2015-11-07 10:30:45 +11:00
Matthew Honnibal	06f26d258e	* Fix test_basic_create	2015-11-07 10:04:37 +11:00
Matthew Honnibal	1d3884c46d	* Fix test_basic_create	2015-11-07 10:03:56 +11:00
Matthew Honnibal	cc8febcbe1	* Fix Span comparison	2015-11-07 09:54:14 +11:00
Matthew Honnibal	af70dc166a	* Fix Last restriction, that was supposed to prevent conflicts with presets, but was incorrect.	2015-11-07 09:52:00 +11:00
Matthew Honnibal	a9b612abdf	* Rework the Span-merge patch, to avoid extending the interface of Doc, and avoid virtualizing the Span.start and Span.end indices, to keep Span usage efficient	2015-11-07 09:01:12 +11:00
Matthew Honnibal	56499d89ef	* Rework the Span-merge patch, to avoid extending the interface of Doc, and avoid virtualizing the Span.start and Span.end indices, to keep Span usage efficient	2015-11-07 08:55:34 +11:00
Andreas Grivas	83ca4e0b93	* use old merge tests - add more	2015-11-07 07:57:04 +11:00
Andreas Grivas	4be7fda453	* span start, end -> properties. autoupdate after merge	2015-11-07 07:57:04 +11:00
Andreas Grivas	562db6d2d0	* merge add lex last - add index finder funcs	2015-11-07 07:57:04 +11:00
Matthew Honnibal	a06e3c8963	* Fix bone-headed mistake in StateClass.E	2015-11-07 07:35:28 +11:00
Matthew Honnibal	d24b8509e4	* Correct screw ups from the previous commits	2015-11-07 06:51:41 +11:00
Matthew Honnibal	5efad178b5	* Set ent tag when close entity	2015-11-07 06:09:25 +11:00
Matthew Honnibal	9285f01d26	* Fix broken StateClass.E tracking	2015-11-07 06:06:39 +11:00
Matthew Honnibal	19136b0e7d	* Add better debug message for illegal move	2015-11-07 05:34:37 +11:00
Matthew Honnibal	2733816b7b	* Fix whitespace	2015-11-07 05:31:06 +11:00
Matthew Honnibal	01ab464383	* Prevent Begin and In moves from applying in NER if we're at the last token of a sentence, as this would mean the entity would span over a sentence boundary. Re Issue #169	2015-11-07 05:30:44 +11:00
Matthew Honnibal	b65633f270	* Fix function that returns nth entity in StateClass. Was only returning the first.	2015-11-07 05:29:11 +11:00
Matthew Honnibal	410b6f9ec1	* Remove deprecated _ml.pyx. We now use the nicer APIs provided by thinc 4.0, and subclass the AveragedPerceptron class.	2015-11-07 05:13:10 +11:00
Matthew Honnibal	3c162dcac3	* Refactor away from the _ml module, to use thinc 4.0. Still some work needs to be done, e.g. to add __reduce__ to the models, more testing, etc.	2015-11-07 03:24:30 +11:00
Matthew Honnibal	9d1b2a103a	* Fix capitalization in lemmatizer	2015-11-06 05:44:35 +11:00
Matthew Honnibal	6ed3aedf79	* Merge vocab changes	2015-11-06 00:48:08 +11:00
Matthew Honnibal	72abbb43fb	* Add type declarations in strings.pyx	2015-11-06 00:47:26 +11:00
Matthew Honnibal	5b2af4864f	* When lemmatizing non-noun, non-verb, non-adj words, output lower-case	2015-11-06 00:45:09 +11:00
Matthew Honnibal	754bf04162	* Remove declaration of Model.update	2015-11-06 00:31:15 +11:00
Matthew Honnibal	e18bdff23a	Merge branch 'master' of ssh://github.com/honnibal/spaCy	2015-11-06 00:26:15 +11:00
Matthew Honnibal	b9991fbd20	* Update to use thinc 3.0	2015-11-06 00:25:59 +11:00
Matthew Honnibal	864a8f45d8	* Use unicode in StringStore.intern, instead of unreliably casting to bytes.	2015-11-05 11:32:19 +00:00
Matthew Honnibal	b18204cd52	* Fix StringStore._realloc, re Issue #155	2015-11-05 11:28:26 +00:00
Matthew Honnibal	f8004c5f65	* Begin upgrading to improved thinc API	2015-11-05 03:53:03 +11:00
Matthew Honnibal	adc7bbd6cf	* Fix name of like_num in default_lex_attrs	2015-11-04 22:02:47 +11:00
Matthew Honnibal	e96faf29e7	* Rename like_number to like_num, to fix inconsistency re Issue #166	2015-11-04 22:01:44 +11:00
Matthew Honnibal	65934b7cd4	* Enforce import of ujson in strings.pyx, because otherwise it's too slow	2015-11-04 00:32:02 +11:00
Matthew Honnibal	1ce5d5602d	* Rename Doc.data to Doc.c	2015-11-04 00:17:13 +11:00
Matthew Honnibal	68f479e821	* Rename Doc.data to Doc.c	2015-11-04 00:15:14 +11:00
Matthew Honnibal	3ddea19b2b	* Rename spans.pyx to span.pyx	2015-11-04 00:14:40 +11:00
Matthew Honnibal	9482d616bc	* Rename spans.pyx to span.pyx	2015-11-03 23:51:05 +11:00
Matthew Honnibal	116da5990a	* Clean up setting of tag in doc.from_bytes	2015-11-03 23:48:57 +11:00
Matthew Honnibal	9ec7b9c454	* Clean up unused Constituent struct.	2015-11-03 23:48:21 +11:00
Matthew Honnibal	1e99fcd413	* Rename .repvec to .vector in C API	2015-11-03 23:47:59 +11:00
Matthew Honnibal	ee3f9ba581	* Fix test of serializer	2015-11-03 19:45:16 +11:00
Matthew Honnibal	d06ba26371	* Fix test of serializer	2015-11-03 19:43:27 +11:00
Matthew Honnibal	4083059650	Merge branch 'master' of https://github.com/honnibal/spaCy	2015-11-03 09:07:19 +01:00
Matthew Honnibal	9e37437ba8	* Fix assign_tag in doc.merge	2015-11-03 19:07:02 +11:00
Matthew Honnibal	dde9e1357c	* Add todo to morphology.lemmatize	2015-11-03 18:54:35 +11:00
Matthew Honnibal	ffedff9e6c	* Remove the archive after download, to save disk space	2015-11-03 18:54:05 +11:00
Matthew Honnibal	85372468e3	* Fix serialize test	2015-11-03 08:51:33 +01:00
Matthew Honnibal	833eb35c57	* Fix tag assignment in doc.from_array	2015-11-03 18:45:54 +11:00
Matthew Honnibal	09664177d7	* Fix tag handling in doc.merge, and assign sent_start when setting heads.	2015-11-03 18:15:52 +11:00
Matthew Honnibal	389a373807	Merge branch 'master' of ssh://github.com/honnibal/spaCy	2015-11-03 18:07:25 +11:00
Matthew Honnibal	3f44b3e43f	* Mark serializer test as requiring models	2015-11-03 18:07:08 +11:00
Matthew Honnibal	25ed7be8f8	Merge branch 'master' of https://github.com/honnibal/spaCy	2015-11-03 07:58:17 +01:00
Matthew Honnibal	604ceac4c6	* Fix morphological assignment in doc.merge()	2015-11-03 17:57:51 +11:00
Matthew Honnibal	5e040855a5	* Ensure morphological features and lemmas are loaded in from_array, re Issue #152	2015-11-03 17:56:50 +11:00
Matthew Honnibal	5668feb235	* Fix pickle test for python3	2015-11-03 04:57:02 +01:00
Matthew Honnibal	6161d2529a	Merge branch 'master' of ssh://github.com/honnibal/spaCy	2015-11-03 13:36:30 +11:00
Matthew Honnibal	5887506f5d	* Don't expect lexemes.bin in Vocab	2015-11-03 13:23:39 +11:00
Matthew Honnibal	f7dd377575	* Adjust conjuncts iterator in Token	2015-11-03 13:23:22 +11:00
Andreas Grivas	d418f00eb1	fixed error when printing unicode	2015-11-02 20:23:18 +02:00
Matthew Honnibal	52fc338001	* Set is_parsed and is_tagged attrs when loading annotations into Doc, re Issue #152	2015-10-28 10:43:22 +11:00
Matthew Honnibal	1c0356e4c2	* Set test file mode to w+t	2015-10-26 22:40:48 +11:00
Matthew Honnibal	0fe98f358b	* Fix mode on text file for Python3 in strings test	2015-10-26 22:25:16 +11:00
Matthew Honnibal	8ba9cf905e	* Fix mode on text file for Python3 in strings test	2015-10-26 21:44:34 +11:00
Matthew Honnibal	a0730699b1	* Fix mode on text file for Python3 in strings test	2015-10-26 21:25:56 +11:00
Matthew Honnibal	725344d349	* Fix tempfile in test	2015-10-26 21:08:18 +11:00
Matthew Honnibal	f11030aadc	* Remove out-dated TODO comment	2015-10-26 12:33:38 +11:00
Matthew Honnibal	a371a1071d	* Save and load word vectors during pickling, re Issue #125	2015-10-26 12:33:04 +11:00
Matthew Honnibal	a824a98312	* Add tests for pickling vectors, re: Issue #125	2015-10-26 12:31:05 +11:00
Matthew Honnibal	314090cc78	* Set vectors length when unpickling vocab, re Issue #125	2015-10-26 12:05:08 +11:00
Matthew Honnibal	4e16f9e435	* Move tests underneath spacy/	2015-10-26 00:07:31 +11:00
Matthew Honnibal	3a6e48e814	Merge pull request #149 from chrisdubois/pickle-patch Add __reduce__ to Tokenizer so that English pickles.	2015-10-25 15:30:31 +11:00
Chris DuBois	dac8fe7bdb	Add __reduce__ to Tokenizer so that English pickles. - Add tests to test_pickle and test_tokenizer that save to tempfiles.	2015-10-23 22:24:03 -07:00
Matthew Honnibal	ff4fe524ee	* Fix exception for python 2	2015-10-23 01:56:13 +02:00
Matthew Honnibal	341a3e85cd	* Upd downloaded data version	2015-10-23 00:56:57 +02:00
Matthew Honnibal	f18fd8c659	* Fix language.py for change in StringStore load API	2015-10-23 03:48:12 +11:00
Matthew Honnibal	23855db3ca	Merge branch 'master' of ssh://github.com/honnibal/spaCy into develop	2015-10-23 03:46:09 +11:00
Matthew Honnibal	4f13849065	Merge pull request #145 from henningpeters/master better error reporting, cleanup	2015-10-23 03:45:47 +11:00
Matthew Honnibal	3be94be0c0	Merge pull request #148 from maxirmx/master Utf8 encoding for lemma_rules.json	2015-10-22 21:46:28 +11:00
Matthew Honnibal	c86bda8d1a	* Fix import of uget	2015-10-22 21:13:56 +11:00
Matthew Honnibal	2348a08481	* Load/dump strings with a json file, instead of the hacky strings file we were using.	2015-10-22 21:13:03 +11:00
Matthew Honnibal	9baf0abd59	* Save vocab after training.	2015-10-22 21:09:14 +11:00
maxirmx	f07e4accd7	Fixing encoding issue #4	2015-10-21 20:45:56 +03:00
maxirmx	fcbfff043f	Fixing encoding issue #3	2015-10-21 15:52:34 +03:00
maxirmx	fe9d2e2c4e	Fixing encode issue #2	2015-10-21 15:36:21 +03:00
maxirmx	e4a1726f77	Fixing encoding issue UTF-8	2015-10-21 14:16:37 +03:00
Andreas Grivas	93ada458e2	added __repr__ that prints text in ipython for doc, token, and span objects	2015-10-21 14:11:46 +03:00
Henning Peters	ccffd2ef53	fixed extract directory	2015-10-21 07:59:34 +02:00
Henning Peters	da4c9cee06	assert filename match	2015-10-20 19:33:59 +02:00
Henning Peters	4f703f0cb4	better error reporting, cleanup	2015-10-20 19:11:29 +02:00
Matthew Honnibal	9cdea6e450	* Import uget correctly	2015-10-19 08:32:41 +02:00
Matthew Honnibal	6727a46bb5	* Fix Issue #118 : Matcher behaves unpredictably when matches overlap.	2015-10-19 16:45:32 +11:00
Matthew Honnibal	135062d23c	* Fix error with merged text when merged region did not have trailing whitespace	2015-10-19 15:47:04 +11:00
Henning Peters	bfde91fa49	add custom download tool (uget), replace wget with uget	2015-10-18 12:35:04 +02:00
Matthew Honnibal	9839cd2c0b	* Fix whitespace_ calculation in Token	2015-10-18 17:21:11 +11:00
Matthew Honnibal	c99285b8b9	* Clean up C++ usage in spacy/matcher.pyx	2015-10-18 17:20:50 +11:00
Matthew Honnibal	a7e6c5ac8f	* Fix Issue #122 : Incorrect calculation of children after Doc.merge()	2015-10-18 17:17:27 +11:00
Matthew Honnibal	3ba66f2dc7	* Add string length cap in Tokenizer.__call__	2015-10-16 04:54:16 +11:00
Matthew Honnibal	6e0f985afc	* Fix token.conjuncts	2015-10-15 03:49:45 +11:00
Matthew Honnibal	2e0104ac81	* Fix token.conjuncts	2015-10-15 03:47:45 +11:00
Matthew Honnibal	b8f3345a82	* Fix token.conjuncts method	2015-10-15 03:36:01 +11:00
Matthew Honnibal	23818f89b8	* Fix token.conjuncts method	2015-10-15 03:34:57 +11:00
Matthew Honnibal	7a15d1b60c	* Add Python 2/3 compatibility fix for copy_reg	2015-10-13 20:04:40 +11:00
Matthew Honnibal	329ae57520	* Fix whitespace attachment thing	2015-10-13 09:46:38 +02:00
Matthew Honnibal	37919eac82	* Fix whitespace attachment in simpler way. Leaves problem with setting left/right children.	2015-10-13 18:23:24 +11:00
Matthew Honnibal	c70eb776ae	* Fix whitespace attachment, so that left/right children are consistent with head.	2015-10-13 15:58:22 +11:00
Matthew Honnibal	531182f937	* Fix Model.__reduce__	2015-10-13 15:14:38 +11:00
Matthew Honnibal	6c227a6c1f	* Fix Model.__reduce__	2015-10-13 15:10:04 +11:00
Matthew Honnibal	358c82595c	* Fix NAMES list in spacy/parts_of_speech.pyx	2015-10-13 14:18:45 +11:00
Matthew Honnibal	c1fdc487bc	Merge branch 'attrs'	2015-10-13 14:03:41 +11:00
Matthew Honnibal	e886e6a406	* Inc version	2015-10-13 13:46:17 +11:00
Matthew Honnibal	20fd36a0f7	* Very scrappy, likely buggy first-cut pickle implementation, to work on Issue #125 : allow pickle for Apache Spark. The current implementation sends stuff to temp files, and does almost nothing to ensure all modifiable state is actually preserved. The Language() instance is a deep tree of extension objects, and if pickling during training, some of the C-data state is hard to preserve.	2015-10-13 13:44:41 +11:00
Matthew Honnibal	f8de403483	* Work on pickling Vocab instances. The current implementation is not correct, but it may serve to see whether this approach is workable. Pickling is necessary to address Issue #125	2015-10-13 13:44:41 +11:00
Matthew Honnibal	85e7944572	* Start trying to pickle Vocab	2015-10-13 13:44:41 +11:00
Matthew Honnibal	5ca57bd859	* Ensure Morphology can be pickled, to address Issue #125 .	2015-10-13 13:44:41 +11:00
Matthew Honnibal	0cee928467	* Allow StringStore to be pickled, to start addressing Issue #125	2015-10-13 13:44:41 +11:00
Matthew Honnibal	41012907a8	* Fix variable name	2015-10-13 13:44:40 +11:00
Matthew Honnibal	e70368d157	* Use lower case strings for dependency label names in symbols enum	2015-10-13 13:44:40 +11:00
Matthew Honnibal	7b4af3d1e7	* Fix parts_of_speech now that symbols list has been reformed	2015-10-13 13:44:40 +11:00
Matthew Honnibal	37b909b6b6	* Use the symbols file in vocab instead of the symbols subfiles like attrs.pxd	2015-10-13 13:44:40 +11:00
Matthew Honnibal	ce65ec698c	* Remove qualified naming in symbols	2015-10-13 13:44:40 +11:00
Matthew Honnibal	9f4be0adcd	* Map NO_TAG to NIL in parts_of_speech.pxd	2015-10-13 13:44:40 +11:00
Matthew Honnibal	278e12f7e8	* Addmorphology symbols to morphology. May need to remove these as an enum.	2015-10-13 13:44:40 +11:00
Matthew Honnibal	d80067eda1	* Map empty string to NULL_ATTR in attrs	2015-10-13 13:44:40 +11:00
Matthew Honnibal	d70e8cac2c	* Fix empty values in attributes and parts of speech, so symbols align correctly with the StringStore	2015-10-13 13:44:40 +11:00
Matthew Honnibal	a29c8ee23d	* Add symbols to the vocab before reading the strings, so that they line up correctly	2015-10-13 13:44:39 +11:00
Matthew Honnibal	74c0853471	* Rename ATTR_IDS to attrs.IDS. Rename ATTR_NAMES to attrs.NAMES. Rename UNIV_POS_IDS to parts_of_speech.IDS	2015-10-13 13:44:39 +11:00
Matthew Honnibal	10a4a843ea	* Enumerate all symbols in one file	2015-10-13 13:44:39 +11:00
Matthew Honnibal	85ce36ab11	* Refactor symbols, so that frequency rank can be derived from the orth id of a word.	2015-10-13 13:44:39 +11:00
Matthew Honnibal	dfbcff2ff1	* Revert codecs/io change to strings.pyx, as it seemed to cause an error? Will investigate.	2015-10-10 15:54:55 +11:00
Matthew Honnibal	9dd2f25c74	* Fix Issue #131 : Force whitespace characters to attach syntactically to previous token, and ensure they cannot serve as stand-alone 'sentence' units.	2015-10-10 15:53:30 +11:00
Matthew Honnibal	8b39feefbe	* Add dependency post-process rule to ensure spaces are attached to neighbouring tokens, so that they can't be sentence boundaries	2015-10-10 15:32:13 +11:00
Matthew Honnibal	2153067958	* Fix use of io in strings.pyx	2015-10-10 15:03:12 +11:00
Matthew Honnibal	ec874247b5	Merge branch 'master' of ssh://github.com/honnibal/spaCy	2015-10-10 14:23:51 +11:00
Matthew Honnibal	30de4135c9	* Fix merge problem	2015-10-10 14:22:32 +11:00
Matthew Honnibal	dc393a5f1d	Merge pull request #126 from tomtung/master Improve slicing support for both Doc and Span	2015-10-10 14:14:57 +11:00
Matthew Honnibal	83dccf0fd7	* Use io module insteads of deprecated codecs module	2015-10-10 14:13:01 +11:00
Matthew Honnibal	a3dfe2b901	* Increment data version	2015-10-09 13:26:17 +02:00
Matthew Honnibal	2d9e5bf566	* Allow punctuation to be lemmatized	2015-10-09 19:02:42 +11:00
Matthew Honnibal	5332c0b697	* Add support for punctuation lemmatization, to handle unicode characters. This should help in addressing Issue #130	2015-10-09 18:54:40 +11:00
Yubing (Tom) Dong	9a6811acc4	Merge remote-tracking branch 'upstream/master'	2015-10-08 22:53:02 -07:00
Matthew Honnibal	b125289f30	* Fix type declaration in asciied function	2015-10-09 13:46:57 +11:00
Matthew Honnibal	801d55a6d9	* Fix phrase matcher	2015-10-09 02:00:45 +11:00
Matthew Honnibal	b3a70e6375	* Clean up unnecessary try/except block	2015-10-08 14:34:11 +11:00
Yubing (Tom) Dong	0f601b8b75	Update docstring of Doc.__getitem__	2015-10-07 01:27:28 -07:00
Yubing (Tom) Dong	3fd3bc79aa	Refactor to remove duplicate slicing logic	2015-10-07 01:25:35 -07:00
Yubing (Tom) Dong	97685aecb7	Add slicing support to Span	2015-10-06 02:45:49 -07:00
Yubing (Tom) Dong	ef2af20cd3	Make Doc's slicing behavior conform to Python conventions	2015-10-06 02:41:28 -07:00
Yubing (Tom) Dong	2fc33e8024	Allow step=1 when slicing a Doc	2015-10-06 00:57:05 -07:00
Matthew Honnibal	b228a8f4a6	* Remove spacy/en/attrs	2015-10-06 16:20:46 +11:00
Matthew Honnibal	693677fd8d	* Prepare to remove en/attrx file, now that moving to symbols.pyx	2015-10-06 16:20:13 +11:00
Matthew Honnibal	3d9f41c2c9	* Add LookupError for better error reporting in Vocab	2015-10-06 10:34:59 +11:00
Matthew Honnibal	ecc5281b36	* Remove en/pos.pyx, as the tagger code now lives in spacy/tagger.pyx	2015-10-06 10:12:08 +11:00
alvations	8caedba42a	caught more codecs.open -> io.open	2015-09-30 20:20:09 +02:00
alvations	8199012d26	changing deprecated codecs.open to io.open =)	2015-09-30 20:10:15 +02:00
Matthew Honnibal	87e6186828	* Rename _seq to doc attribute in Span	2015-09-29 23:03:55 +10:00
Matthew Honnibal	ab694b0364	* Fix open-bounded slice indices.	2015-09-29 23:03:09 +10:00
Matthew Honnibal	a6ced80c0c	* Fix Issue #116 : Misleading handling of True value in Language.__init__.	2015-09-29 20:54:12 +10:00
Matthew Honnibal	f9d2a5b651	* Fix issue #112 : Replace unidecode with text-unidecode, to avoid license problems.	2015-09-28 23:40:18 +10:00
Matthew Honnibal	2c33a96ac3	Merge pull request #99 from rw/patch-1 Force SSL for downloading English language data.	2015-09-28 17:46:26 +10:00
Matthew Honnibal	abf0d930af	* Fix API for loading word vectors from a file.	2015-09-23 23:51:08 +10:00
Matthew Honnibal	f5c256745b	Merge branch 'master' of ssh://github.com/honnibal/spaCy	2015-09-22 12:26:24 +10:00
Matthew Honnibal	528e26a506	* Add rule to ensure ordinals are preserved as single tokens	2015-09-22 12:26:05 +10:00
Robert	8711b64860	Force SSL for downloading English language data. It would also be nice to have a checksum for this.	2015-09-21 17:26:01 -07:00
Matthew Honnibal	f7283a5067	* Fix vectors bugs for OOV words	2015-09-22 02:10:25 +02:00
Matthew Honnibal	44aecba701	* Fix Token.has_vector and Lexeme.has_vector	2015-09-22 01:43:16 +02:00
Matthew Honnibal	596fde8daa	* Add has_vector attribute to Token and Lexeme	2015-09-21 19:52:43 +10:00
Matthew Honnibal	f32927efbf	* Raise exceptions if attempt to access parse, but data is not installed. This partly but not fully addresses Issue #97 . Still need exceptions on the various Token attributes that access the parse tree, e.g. token.head, token.lefts, token.rights, etc. Exceptions should be centralized, too.	2015-09-21 18:35:40 +10:00
Matthew Honnibal	388062ae01	* Fix repvec_length problem	2015-09-21 18:10:51 +10:00
Matthew Honnibal	ac459278d1	* Fix vector length error reporting, and ensure vec_len is returned	2015-09-21 18:08:32 +10:00
Matthew Honnibal	ba4e563701	* Ensure vectors are same length, and return vector length in load_vectors_bz2	2015-09-21 18:03:08 +10:00
Matthew Honnibal	d00fe2bbc6	* Don't allow Span objects to be written to, as it introduces subtle bugs because they're created afresh from Doc.sents, Doc.ents etc.	2015-09-21 17:59:39 +10:00
Matthew Honnibal	d6945bf880	* Add way to load vectors from bz2 file to vocab	2015-09-17 12:58:23 +10:00
Matthew Honnibal	77856c4fcd	* Try giving Doc and Span objects vector and vector_norm attributes, and .similarity functions. Turns out to be bad idea.	2015-09-17 11:50:11 +10:00
Matthew Honnibal	191d593e03	* Fix vectors bug in lexeme	2015-09-15 19:05:11 +10:00
Matthew Honnibal	3d87519f64	* Remove vectors argument from Vocab object	2015-09-15 14:47:14 +10:00
Matthew Honnibal	362526b592	* Rename vectors_length attribute	2015-09-15 14:43:31 +10:00
Matthew Honnibal	60c26b2dfa	* Fix slicing when start or stop is None	2015-09-15 14:43:10 +10:00
Matthew Honnibal	7ac6cacc26	* Remove const qualifier on LexemeC.repvec	2015-09-15 14:42:51 +10:00
Matthew Honnibal	dd4d64b235	* Support setting of word vectors on Lexeme object.	2015-09-15 14:42:27 +10:00
Matthew Honnibal	27f988b167	* Remove the vectors option to Vocab, preferring to either load vectors from disk, or set them on the Lexeme objects.	2015-09-15 14:41:48 +10:00
Matthew Honnibal	193f127f81	* Fix ugly py_check_flag and py_set_flag functions in Lexeme	2015-09-15 13:06:18 +10:00
Matthew Honnibal	9561d88529	* Add is_stop to Python API	2015-09-14 18:25:40 +10:00
Matthew Honnibal	65dc0d1dfb	* Extend word vectors support, with .similarity() function, vector_norm property, and rename repvec to vector. Keep repvec name as well for now for backwards compatibility.	2015-09-14 17:49:58 +10:00
Matthew Honnibal	e13e47e9e5	* Add English stop words	2015-09-14 17:48:51 +10:00
Matthew Honnibal	24ed3fc25c	* Check file existance before opening in lemmatizer	2015-09-13 10:45:21 +10:00
Matthew Honnibal	dbb48ce49e	* Delete extra wordnets	2015-09-13 10:31:37 +10:00
Matthew Honnibal	e9c59693ea	* Remove assertion from vocab.pyx	2015-09-13 10:30:08 +10:00
Matthew Honnibal	c08f10083c	* Add test and test_with_ws attributes.	2015-09-13 10:27:42 +10:00
Matthew Honnibal	0b7d2a6c62	* Inc version	2015-09-13 01:26:29 +02:00
Matthew Honnibal	e1dfaeed8a	* Check serializer freqs exist before loading	2015-09-12 23:49:38 +02:00
Matthew Honnibal	a412c66c8c	* Check serializer freqs exist before loading	2015-09-12 23:40:01 +02:00
Matthew Honnibal	631c843ed1	* Don't look for index.adv in le,matizer	2015-09-12 06:03:44 +02:00
Matthew Honnibal	dfdd4f2d60	Merge branch 'develop' of https://github.com/honnibal/spaCy into develop	2015-09-10 15:23:06 +02:00
Matthew Honnibal	e285ca7d6c	* Load serializer freqs in vocab	2015-09-10 15:22:48 +02:00
Matthew Honnibal	f7fdcce1f9	Merge branch 'develop' of https://github.com/honnibal/spaCy into develop	2015-09-10 14:52:47 +02:00
Matthew Honnibal	85c3fec1d1	* Fix morphology loading	2015-09-10 14:52:23 +02:00
Matthew Honnibal	7c660c5efc	* Use dict.get in lemmatizer	2015-09-10 14:51:39 +02:00
Matthew Honnibal	094440f9f5	Merge branch 'develop' of ssh://github.com/honnibal/spaCy into develop	2015-09-10 14:51:17 +02:00
Matthew Honnibal	c3f773cd63	* Fix Lexeme.check_flag	2015-09-10 14:51:05 +02:00
Matthew Honnibal	90da3a695d	* Load lemmatizer from disk in Vocab.from_dir	2015-09-10 14:49:10 +02:00
Matthew Honnibal	e7e529edf4	* Fix Lexeme.check_flag	2015-09-10 14:45:43 +02:00
Matthew Honnibal	9e7bfe8449	* Fix space at end of merged token	2015-09-10 14:45:17 +02:00
Matthew Honnibal	f634191e27	* Fix vocab read/write	2015-09-10 14:44:38 +02:00

... 4 5 6 7 8 ...

1608 Commits