spaCy

mirror of https://github.com/explosion/spaCy.git synced 2024-11-13 13:17:06 +03:00

Author	SHA1	Message	Date
Ines Montani	99d66d613a	Modernise tests for merging spans and don't depend on models	2017-01-12 12:26:26 +01:00
Ines Montani	fa8f67596d	Remove unused old test	2017-01-12 12:26:08 +01:00
Ines Montani	359f73a96b	Move test for #54 to regression tests	2017-01-12 12:25:51 +01:00
Ines Montani	3f3a46722c	Remove unused conftest	2017-01-12 12:25:24 +01:00
Ines Montani	c2406e92bc	Allow setting ents in get_doc	2017-01-12 12:25:10 +01:00
Ines Montani	c5914c6fe5	Fix and pass regression test for #736	2017-01-12 11:48:56 +01:00
Matthew Honnibal	4e48862fa8	Remove print statement	2017-01-12 11:25:39 +01:00
Matthew Honnibal	d1d8214767	Increment version	2017-01-12 11:21:57 +01:00
Matthew Honnibal	fba67fa342	Fix Issue #736 : Times were being tokenized with incorrect string values.	2017-01-12 11:21:01 +01:00
Ines Montani	a6790b6694	Rename tags to pos in get_doc and allow adding tags to tokens	2017-01-12 11:18:36 +01:00
Ines Montani	1add8ace67	Merge lemmatizer tests	2017-01-12 11:16:53 +01:00
Ines Montani	3bc082abdf	Modernise morph exceptions test and don't depend on models	2017-01-12 11:14:29 +01:00
Ines Montani	ec7739b76e	Add regression test for #736	2017-01-12 11:12:44 +01:00
Ines Montani	6c1c564891	Move language-specific tests out of redundant tokenizer directories	2017-01-12 02:17:18 +01:00
Ines Montani	8fecedac3a	Tidy up	2017-01-12 02:16:37 +01:00
Ines Montani	ae7edd30e7	Move text file back to tokenizer tests directory	2017-01-12 02:10:23 +01:00
Ines Montani	ffcaba9017	Remove old and/or redundant tests	2017-01-12 02:10:18 +01:00
Ines Montani	19c4132097	Modernise space attachment parser tests and don't depend on models	2017-01-12 01:54:44 +01:00
Ines Montani	69778924c8	Modernise and merge parser tests and don't depend on models	2017-01-12 01:07:29 +01:00
Ines Montani	178c147612	Modernise nonprojectivity tests and don't depend on models	2017-01-12 01:06:36 +01:00
Ines Montani	1a3984742c	Modernise sentence boundary detection tests and don't depend on models (where possible)	2017-01-11 23:53:08 +01:00
Ines Montani	0cdb6ea61d	Remove old unused pickle test	2017-01-11 23:52:28 +01:00
Ines Montani	c9671329dc	Move test for #309 to regression tests	2017-01-11 23:52:13 +01:00
Ines Montani	d0e37b5670	Modernise parser tests and don't depend on models	2017-01-11 21:30:27 +01:00
Ines Montani	342cb41782	Add apply_transition_sequence util function to utils	2017-01-11 21:30:14 +01:00
Ines Montani	09807addff	Add en_parser fixture	2017-01-11 21:29:59 +01:00
Ines Montani	55d151aa61	Modernise Doc parse tree navigation tests and don't depend on models	2017-01-11 21:14:15 +01:00
Ines Montani	7262421bb2	Use consistent test names	2017-01-11 19:00:52 +01:00
Ines Montani	33800c9367	Rename "tokens" tests to "doc"	2017-01-11 18:59:01 +01:00
Ines Montani	3a9c6a9563	Remove old unused files	2017-01-11 18:58:38 +01:00
Ines Montani	8e962de39f	Remove old word vector tests	2017-01-11 18:55:08 +01:00
Ines Montani	e027936920	Modernise Doc noun chunks tests	2017-01-11 18:54:56 +01:00
Ines Montani	439f396acd	Modernise Doc array tests and don't depend on models	2017-01-11 18:54:46 +01:00
Ines Montani	05447be884	Modernise test for adding entities	2017-01-11 18:54:24 +01:00
Ines Montani	6e883f4c00	Modernise Doc API tests and don't depend on models	2017-01-11 18:05:36 +01:00
Ines Montani	8bf3bb5c44	Make words optional for get_doc	2017-01-11 18:05:10 +01:00
Ines Montani	928db7e419	Fix StringIO import for Python 3	2017-01-11 14:07:48 +01:00
Ines Montani	69998f216b	Rename test_tokens_api.py to test_doc_api.py	2017-01-11 13:58:56 +01:00
Ines Montani	d94dea1b18	Merge token tests into token API tests	2017-01-11 13:57:02 +01:00
Ines Montani	eb23424ab0	Modernise token API tests and don't depend on loading models	2017-01-11 13:56:54 +01:00
Ines Montani	c682b8ca90	Merge conftests into one cohesive file	2017-01-11 13:56:32 +01:00
Ines Montani	909f24d7df	Add test utils and get_doc helper function Create Doc object from given vocab, words and annotations to allow tests not to depend on loading the models.	2017-01-11 13:55:33 +01:00
Matthew Honnibal	e12c90e03f	Merge branch 'master' of ssh://github.com/explosion/spaCy	2017-01-11 13:03:51 +01:00
Matthew Honnibal	12cd27b821	Amend 8ae8b443f: Handle comparison with None tokens.	2017-01-11 13:03:32 +01:00
Daniel Hershcovich	8e603cc917	Avoid "True if ... else False"	2017-01-11 11:18:22 +02:00
Matthew Honnibal	44e2b0100d	Support TAG attribute in doc.from_array	2017-01-10 22:47:07 +01:00
Ines Montani	3e6e1f0251	Tidy up regression tests	2017-01-10 19:24:10 +01:00
Magnus Burton	aad23ab0b4	Supplemented with capitalized Swedish exceptions	2017-01-10 16:07:20 +01:00
Ines Montani	869963c3c4	Mark extensive prefix/suffix tests as slow	2017-01-10 15:57:35 +01:00
Ines Montani	487e020ebe	Add simple test for surrounding brackets	2017-01-10 15:57:26 +01:00
Ines Montani	0ba5cf51d2	Assert length first	2017-01-10 15:57:00 +01:00
Ines Montani	2185d31907	Adjust names and formatting	2017-01-10 15:56:35 +01:00
Ines Montani	e10d4ca964	Remove semi-redundant URLs and punctuation for faster testing	2017-01-10 15:54:25 +01:00
Ines Montani	3a3cb2c90c	Add unicode declaration	2017-01-10 15:53:15 +01:00
Matthew Honnibal	0f9b8a00a5	Unbreak data download	2017-01-09 23:40:26 +01:00
Matthew Honnibal	8ae8b443f1	Add richcmp method to Token. Closes #631	2017-01-09 19:30:31 +01:00
Matthew Honnibal	64f747cb65	Token comparison test	2017-01-09 19:12:00 +01:00
Matthew Honnibal	18c3c2d05c	Add tests for token comparison, re Issue #631	2017-01-09 19:09:59 +01:00
Matthew Honnibal	97a1286129	Revert changes to tagger and parser for thinc 6	2017-01-09 10:08:34 -06:00
Matthew Honnibal	95a52005df	Revert "Fix Issue #683 : Add 'SP' to tag_map, if it's not there already, within the Morphology class." This reverts commit `40e71586d6`.	2017-01-09 09:55:55 -06:00
Ines Montani	363f09e68c	Merge pull request #726 from magnusburton/master Added Swedish abbreviations as token exceptions	2017-01-09 14:58:15 +01:00
Matthew Honnibal	42cd598f57	Use correct fixtures in URL tokenizer	2017-01-09 14:10:40 +01:00
Matthew Honnibal	d9a77ddf14	Return None for data path if it doesn't exist	2017-01-09 14:10:05 +01:00
Matthew Honnibal	e4862d1dab	Merge branch 'develop'	2017-01-09 13:36:01 +01:00
Ines Montani	aa876884f0	Revert "Revert "Merge remote-tracking branch 'origin/master'"" This reverts commit `fb9d3bb022`.	2017-01-09 13:28:13 +01:00
Ines Montani	d5c72c40eb	Remove old tests for old website example code	2017-01-08 22:28:53 +01:00
Ines Montani	eef94e3ee2	Split off period after two or more uppercase letters (fixes #483 )	2017-01-08 22:28:25 +01:00
Ines Montani	a89a6000e5	Remove unused import	2017-01-08 22:17:37 +01:00
Ines Montani	5d28664fc5	Don't test Hungarian for numbers and hyphens for now Reinvestigate behaviour of case affixes given reorganised tokenizer patterns.	2017-01-08 20:45:40 +01:00
Ines Montani	53362b6b93	Reorganise Hungarian prefixes/suffixes/infixes Use global prefixes and suffixes for non-language-specific rules, import list of alpha unicode characters and adjust regexes.	2017-01-08 20:40:33 +01:00
Ines Montani	347c4a2d06	Reorganise and reformat global tokenizer prefixes, suffixes and infixes	2017-01-08 20:37:39 +01:00
Ines Montani	0dec90e9f7	Use global abbreviation data languages and remove duplicates	2017-01-08 20:36:00 +01:00
Ines Montani	7c3cb2a652	Add global abbreviations data	2017-01-08 20:34:03 +01:00
Ines Montani	de5aa92bc2	Handle deprecated tokenizer prefix data	2017-01-08 20:33:28 +01:00
Ines Montani	abb09782f9	Move sun.txt to original location and fix path to not break parser tests	2017-01-08 20:32:54 +01:00
Ines Montani	cab39c59c5	Add missing contractions to English tokenizer exceptions Inspired by https://github.com/kootenpv/contractions/blob/master/contractions/__init __.py	2017-01-05 19:59:06 +01:00
Ines Montani	a23504fe07	Move abbreviations below other exceptions	2017-01-05 19:58:07 +01:00
Ines Montani	7d2cf934b9	Generate he/she/it correctly with 's instead of 've	2017-01-05 19:57:00 +01:00
Ines Montani	8328925e1f	Add newlines to long German text	2017-01-05 18:13:30 +01:00
Ines Montani	55b46d7cf6	Add tokenizer tests for German	2017-01-05 18:11:25 +01:00
Ines Montani	5bb4081f52	Remove redundant test_tokenizer.py for English	2017-01-05 18:11:11 +01:00
Ines Montani	8216ba599b	Add tests for longer and mixed English texts	2017-01-05 18:11:04 +01:00
Ines Montani	65f937d5c6	Move basic contraction tests to test_contractions.py	2017-01-05 18:09:53 +01:00
Ines Montani	bbe7cab3a1	Move non-English-specific tests back to general tokenizer tests	2017-01-05 18:09:29 +01:00
Ines Montani	038002d616	Reformat HU tokenizer tests and adapt to general style Improve readability of test cases and add conftest.py with fixture	2017-01-05 18:06:44 +01:00
Ines Montani	bc911322b3	Move ") to emoticons (see Tweebo challenge test)	2017-01-05 18:05:38 +01:00
Ines Montani	637f785036	Add general sanity tests for all tokenizers	2017-01-05 16:25:38 +01:00
Ines Montani	c5f2dc15de	Move English tokenizer tests to directory /en	2017-01-05 16:25:04 +01:00
Ines Montani	8b45363b4d	Modernize and merge general tokenizer tests	2017-01-05 13:17:05 +01:00
Ines Montani	02cfda48c9	Modernize and merge tokenizer tests for string loading	2017-01-05 13:16:55 +01:00
Ines Montani	a11f684822	Modernize and merge tokenizer tests for whitespace	2017-01-05 13:16:33 +01:00
Ines Montani	8b284fc6f1	Modernize and merge tokenizer tests for text from file	2017-01-05 13:15:52 +01:00
Ines Montani	2c2e878653	Modernize and merge tokenizer tests for punctuation	2017-01-05 13:14:16 +01:00
Ines Montani	8a74129cdf	Modernize and merge tokenizer tests for prefixes/suffixes/infixes	2017-01-05 13:13:12 +01:00
Ines Montani	0e65dca9a5	Modernize and merge tokenizer tests for exception and emoticons	2017-01-05 13:11:31 +01:00
Ines Montani	34c47bb20d	Fix formatting	2017-01-05 13:10:51 +01:00
Ines Montani	2e72683baa	Add missing docstrings	2017-01-05 13:10:21 +01:00
Ines Montani	da10a049a6	Add unicode declarations	2017-01-05 13:09:48 +01:00
Ines Montani	58adae8774	Remove unused file	2017-01-05 13:09:22 +01:00
Ines Montani	c6e5a5349d	Move regression test for #360 into own file	2017-01-04 00:49:31 +01:00
Ines Montani	8279993a6f	Modernize and merge tokenizer tests for punctuation	2017-01-04 00:49:20 +01:00
Ines Montani	550630df73	Update tokenizer tests for contractions	2017-01-04 00:48:42 +01:00
Ines Montani	109f202e8f	Update conftest fixture	2017-01-04 00:48:21 +01:00
Ines Montani	ee6b49b293	Modernize tokenizer tests for emoticons	2017-01-04 00:47:59 +01:00
Ines Montani	f09b5a5dfd	Modernize tokenizer tests for infixes	2017-01-04 00:47:42 +01:00
Ines Montani	59059fed27	Move regression test for #351 to own file	2017-01-04 00:47:11 +01:00
Ines Montani	667051375d	Modernize tokenizer tests for whitespace	2017-01-04 00:46:35 +01:00
Ines Montani	aafc894285	Modernize tokenizer tests for contractions Use @pytest.mark.parametrize.	2017-01-03 23:02:21 +01:00
Ines Montani	1d237664af	Add lowercase lemma to tokenizer exceptions	2017-01-03 23:02:21 +01:00
Ines Montani	84a87951eb	Fix typos	2017-01-03 18:27:43 +01:00
Ines Montani	35b39f53c3	Reorganise English tokenizer exceptions (as discussed in #718 ) Add logic to generate exceptions that follow a consistent pattern (like verbs and pronouns) and allow certain tokens to be excluded explicitly.	2017-01-03 18:26:09 +01:00
Ines Montani	fb9d3bb022	Revert "Merge remote-tracking branch 'origin/master'" This reverts commit `d3b181cdf1`, reversing changes made to `b19cfcc144`.	2017-01-03 18:21:36 +01:00
Ines Montani	461cbb99d8	Revert "Reorganise English tokenizer exceptions (as discussed in #718 )" This reverts commit `b19cfcc144`.	2017-01-03 18:21:29 +01:00
Ines Montani	d3b181cdf1	Merge remote-tracking branch 'origin/master' # Conflicts: # spacy/en/tokenizer_exceptions.py	2017-01-03 18:20:01 +01:00
Ines Montani	b19cfcc144	Reorganise English tokenizer exceptions (as discussed in #718 ) Add logic to generate exceptions that follow a consistent pattern (like verbs and pronouns) and allow certain tokens to be excluded explicitly.	2017-01-03 18:17:57 +01:00
Ines Montani	1bd53bbf89	Fix typos (resolves #718 )	2017-01-03 11:26:21 +01:00
Matthew Honnibal	fde53be3b4	Move whole token mach inside _split_affixes.	2016-12-30 17:11:50 -06:00
Matthew Honnibal	3ba7c167a8	Fix URL tests	2016-12-30 17:10:08 -06:00
Matthew Honnibal	9936a1b9b5	Merge branch 'tokenization_w_exception_patterns' of https://github.com/oroszgy/spaCy.hu into oroszgy-tokenization_w_exception_patterns	2016-12-30 14:53:40 -06:00
Magnus Burton	56e2219b65	Added Swedish city abbreviations	2016-12-30 21:17:34 +01:00
Magnus Burton	e935c950d8	Added months and days as abbreviations for Swedish	2016-12-30 21:08:44 +01:00
kengz	73a38bd4d1	Merge remote-tracking branch 'upstream/master'	2016-12-30 12:19:59 -05:00
kengz	da44183ae1	move parse_tree logic to a new tokens/printers.py file	2016-12-30 12:19:18 -05:00
Matthew Honnibal	3e8d9c772e	Test interaction of token_match and punctuation Check that the new token_match function applies after punctuation is split off.	2016-12-31 00:52:17 +11:00
Matthew Honnibal	74b921f394	Merge branch 'master' of ssh://github.com/explosion/spaCy into develop	2016-12-30 14:38:27 +01:00
Matthew Honnibal	623d94e14f	Whitespace	2016-12-31 00:30:28 +11:00
Matthew Honnibal	af81ac8bb0	Use thinc 6.0	2016-12-29 11:58:42 +01:00
Petter Hohle	f112e7754e	Add PART to tag map 16 of the 17 PoS tags in the UD tag set is added; PART is missing.	2016-12-28 18:39:01 +01:00
Matthew Honnibal	f62db78dc3	Increment version	2016-12-27 21:11:22 +01:00
Matthew Honnibal	cade536d1e	Merge branch 'master' of ssh://github.com/explosion/spaCy	2016-12-27 21:04:10 +01:00
Matthew Honnibal	ce4539dafd	Allow the vocabulary to grow to 10,000, to prevent cold-start problem.	2016-12-27 21:03:45 +01:00
Ines Montani	ad3669cef5	Merge pull request #703 from magnusburton/master Added Swedish abbreviations	2016-12-27 01:01:49 +01:00
Ines Montani	78f754dd9a	Merge pull request #705 from oroszgy/hu_tokenizer Initial support for Hungarian	2016-12-27 00:48:13 +01:00
Ines Montani	8785706039	Reformat stop words for better readability	2016-12-24 00:58:40 +01:00
Gyorgy Orosz	45e045a87b	Unicode/UTF8 compatibility for Python2	2016-12-24 00:21:00 +01:00
Gyorgy Orosz	72b61b6d03	Typo fix.	2016-12-24 00:10:29 +01:00
Gyorgy Orosz	3a9be4d485	Updated token exception handling mechanism to allow the usage of arbitrary functions as token exception matchers.	2016-12-23 23:49:34 +01:00
Ines Montani	1436b9f15a	Fix formatting and consistency	2016-12-23 21:36:01 +01:00
Ines Montani	1d64527727	Update Spanish tokenizer Remove reflexive pronouns as they're part of an open class, fix mistakes and add exceptions	2016-12-23 21:36:01 +01:00
Ines Montani	7f411fd01c	Remove exceptions containing whitespace / no special chars	2016-12-23 14:30:06 +01:00
Magnus Burton	fdf4776262	Added Swedish abbreviations	2016-12-22 22:45:18 +01:00
Gyorgy Orosz	d9c59c4751	Maintaining backward compatibility.	2016-12-21 23:30:49 +01:00
Gyorgy Orosz	1748549aeb	Added exception pattern mechanism to the tokenizer.	2016-12-21 23:16:19 +01:00
Gyorgy Orosz	35aa54765d	Hungarian module is exposed in spacy.	2016-12-21 20:45:36 +01:00
Gyorgy Orosz	ab2f6ea46c	Removed data files from tests..	2016-12-21 20:22:09 +01:00
Ines Montani	3c87c71d43	Add tokenizer exceptions for a.m. and p.m. in Spanish	2016-12-21 18:19:10 +01:00
Ines Montani	78e63dc7d0	Update tokenizer exceptions for English	2016-12-21 18:06:34 +01:00
Ines Montani	702d1eed93	Update tokenizer exceptions for German	2016-12-21 18:06:27 +01:00
Ines Montani	d60380418e	Update tokenizer exceptions for Spanish	2016-12-21 18:06:17 +01:00
Ines Montani	920fa0fed2	Add DET_LEMMA constant	2016-12-21 18:05:41 +01:00
Ines Montani	8978806ea6	Allow Vocab to load without serializer_freqs	2016-12-21 18:05:23 +01:00
Ines Montani	be8ed811f6	Remove trailing whitespace	2016-12-21 18:04:41 +01:00
Ines Montani	926e19184a	Merge pull request #695 from magnusburton/master Added Swedish morph rules	2016-12-21 01:06:00 +01:00
Gyorgy Orosz	3d5306acb9	Added further testcases.	2016-12-20 23:49:35 +01:00
Gyorgy Orosz	23956e72ff	Improved partial support for tokenzing Hungarian numbers	2016-12-20 23:36:59 +01:00
Gyorgy Orosz	6add156075	Refactored language data structure	2016-12-20 22:28:20 +01:00
Gyorgy Orosz	366b3f8685	Merge branch 'master' into hu_tokenizer	2016-12-20 20:53:31 +01:00
Gyorgy Orosz	c035928156	Partial Hungarian number tokenization is added.	2016-12-20 20:46:20 +01:00
JM	70ff0639b5	Fixed missing vec_path declaration that was failing if 'add_vectors' was set Added vec_path variable declaration to avoid accessing it before assignment in case 'add_vectors' is in overrides.	2016-12-20 18:21:05 +01:00
Magnus Burton	48dcc9f647	Added morph rules	2016-12-20 13:18:41 +01:00
Magnus Burton	db5a077d2b	Initial commit for Swedish	2016-12-20 11:05:06 +01:00
Matthew Honnibal	3f5747a9b2	Merge branch 'master' of ssh://github.com/explosion/spaCy	2016-12-18 23:44:22 +01:00
Matthew Honnibal	40e71586d6	Fix Issue #683 : Add 'SP' to tag_map, if it's not there already, within the Morphology class.	2016-12-18 23:44:05 +01:00
Matthew Honnibal	fa1d23e10d	Merge branch 'master' of https://github.com/explosion/spaCy	2016-12-18 23:32:03 +01:00
Matthew Honnibal	f38eb25fe1	Fix test for word vector	2016-12-18 23:31:55 +01:00
Matthew Honnibal	4e68abebc4	Merge branch 'master' of ssh://github.com/explosion/spaCy	2016-12-18 23:19:45 +01:00
Matthew Honnibal	5a6328a5a4	Increment version	2016-12-18 23:19:19 +01:00
Matthew Honnibal	13a0b31279	Another tweak to GloVe path hackery.	2016-12-18 23:12:49 +01:00
Matthew Honnibal	2c6228565e	Fix vector loading re glove hack	2016-12-18 23:06:44 +01:00
Matthew Honnibal	618b50a064	Fix issue #684 : GloVe vectors not loaded in spacy.en.English.	2016-12-18 22:46:31 +01:00
Matthew Honnibal	404019ad2f	Fix issue #672 : ent_iob_ was a string, not unicode, due to missing unicode_literals statement.	2016-12-18 22:33:53 +01:00
Matthew Honnibal	2ef9d53117	Untested fix for issue #684 : GloVe vectors hack should be inserted in English, not in spacy.load.	2016-12-18 22:29:31 +01:00
Matthew Honnibal	c065359459	Fix path-override bug in spacy.load	2016-12-18 22:15:29 +01:00
Matthew Honnibal	813249f826	Work on morphology class. Still not fully consistent with rest of library.	2016-12-18 17:35:22 +01:00
Matthew Honnibal	3679fb43a3	Fix loading of lemmatizer	2016-12-18 17:34:09 +01:00
Matthew Honnibal	3980f1b0cb	Ignore more morphology attributes in deprecated mode of intify_attrs	2016-12-18 17:33:46 +01:00
Matthew Honnibal	7a98ee5e5a	Merge language data change	2016-12-18 17:03:52 +01:00
Matthew Honnibal	e4c951c153	Merge branch 'organize-language-data' of ssh://github.com/explosion/spaCy into organize-language-data	2016-12-18 17:01:08 +01:00
Ines Montani	b99d683a93	Fix formatting	2016-12-18 16:58:28 +01:00
Ines Montani	b11d8cd3db	Merge remote-tracking branch 'origin/organize-language-data' into organize-language-data	2016-12-18 16:57:12 +01:00
Ines Montani	d1c1d3f9cd	Fix tokenizer test	2016-12-18 16:55:32 +01:00
Ines Montani	753068f1d5	Use base language data as default	2016-12-18 16:55:25 +01:00
Ines Montani	bcc1d50d09	Remove trailing whitespace	2016-12-18 16:54:52 +01:00
Ines Montani	4e95737c6c	Add base tag map	2016-12-18 16:54:28 +01:00
Ines Montani	2b2ea8ca11	Reorganise language data	2016-12-18 16:54:19 +01:00
Matthew Honnibal	1b31c05bf8	Whitespace	2016-12-18 16:51:40 +01:00
Matthew Honnibal	bdcecb3c96	Add import in regression test	2016-12-18 16:51:31 +01:00
Matthew Honnibal	6ee1df93c5	Set tag_map to None if it's not seen in the data by vocab	2016-12-18 16:51:10 +01:00
Matthew Honnibal	33996e770b	Update header for morphology class	2016-12-18 16:50:42 +01:00
Matthew Honnibal	d58187ffa7	Filter out morphology keys in deprecated attrs	2016-12-18 16:50:26 +01:00
Matthew Honnibal	837a5d4100	Update morphology class so that exceptions can be added one-by-one, and so that arbitrary attributes can be referenced.	2016-12-18 16:49:46 +01:00
Matthew Honnibal	44f4f008bd	Wire up lemmatizer rules for English	2016-12-18 15:50:09 +01:00
Matthew Honnibal	e6fc4afb04	Whitespace	2016-12-18 15:48:00 +01:00
Ines Montani	32b36c3882	Break language data components into their own files	2016-12-18 15:40:22 +01:00
Ines Montani	1bff59a8db	Update English language data	2016-12-18 15:36:53 +01:00
Ines Montani	2eb163c5dd	Add lemma rules	2016-12-18 15:36:53 +01:00
Ines Montani	29ad8143d8	Add morph rules	2016-12-18 15:36:53 +01:00
Ines Montani	bc40dad7d9	Add entity rules	2016-12-18 15:36:53 +01:00
Ines Montani	eaa3b1319d	Fix formatting	2016-12-18 15:36:53 +01:00
Ines Montani	704c7442e0	Break language data components into their own files	2016-12-18 15:36:53 +01:00
Ines Montani	62655fd36f	Add ENT_ID constant	2016-12-18 15:36:53 +01:00
Matthew Honnibal	fa272fdf12	Merge branch 'organize-language-data' of ssh://github.com/explosion/spaCy into organize-language-data	2016-12-18 15:00:21 +01:00
Matthew Honnibal	57c4341453	Refactor loading of morphology exceptions, adding a method add_special_case.	2016-12-18 14:59:44 +01:00
Ines Montani	77cf2fb0f6	Remove unnecessary argument in test	2016-12-18 14:06:27 +01:00
Ines Montani	121c310566	Remove trailing whitespace	2016-12-18 14:06:27 +01:00
Ines Montani	0fc4e45cb3	Fix tag map for German	2016-12-18 13:30:03 +01:00
Ines Montani	28326649f3	Fix typo	2016-12-18 13:30:03 +01:00
Matthew Honnibal	0595cc0635	Change test595 to mock data, instead of requiring model.	2016-12-18 13:28:51 +01:00
Matthew Honnibal	a4eb5c2bff	Check POS key in lemmatizer, to update it for new data format	2016-12-18 13:28:20 +01:00
Matthew Honnibal	28d63ec58e	Restore missing '' character in tokenizer exceptions.	2016-12-18 05:34:51 +01:00
Ines Montani	a9421652c9	Remove duplicates in tag map	2016-12-17 22:44:31 +01:00
Ines Montani	69baf1c9a8	Fix tag map	2016-12-17 22:44:22 +01:00
Ines Montani	577adad945	Fix formatting	2016-12-17 14:00:52 +01:00
Ines Montani	fc4ad17136	Fix typo	2016-12-17 14:00:47 +01:00
Ines Montani	bb94e784dc	Fix typo	2016-12-17 13:59:30 +01:00
Ines Montani	afda532595	Use symbols in tag map	2016-12-17 13:56:24 +01:00
Ines Montani	07249145c9	Fix formatting	2016-12-17 13:34:46 +01:00
Ines Montani	dd55d085b6	Reformat dutch language data to match new style	2016-12-17 13:26:01 +01:00
Ines Montani	f2c48ef504	Resolve stopwords conflict to merge Dutch	2016-12-17 13:08:16 +01:00
Matthew Honnibal	ff03ade08f	Merge pull request #688 from nlesc-sherlock/dutch Support for Dutch in SpaCy	2016-12-17 22:44:58 +11:00
Ines Montani	a22322187f	Add missing lemmas to tokenizer exceptions (fixes #674 )	2016-12-17 12:42:41 +01:00
Ines Montani	5445074cbd	Expand tokenizer exceptions with unicode apostrophe (fixes #685 )	2016-12-17 12:34:08 +01:00
Ines Montani	e0a7b5c612	Fix formatting	2016-12-17 12:33:09 +01:00
Ines Montani	08162dce67	Move shared functions and constants to global language data	2016-12-17 12:32:48 +01:00
Ines Montani	6a60a61086	Move update_exc to global language data utils	2016-12-17 12:29:02 +01:00
Ines Montani	f324311249	Add global language data utils	2016-12-17 12:27:41 +01:00
Ines Montani	487ce1e20a	Add encoding declaration	2016-12-17 12:25:44 +01:00
Ines Montani	d8d50a0334	Add tokenizer exception for "gonna" (fixes #691 )	2016-12-17 11:59:28 +01:00
Ines Montani	c69b77d8aa	Revert "Add exception for "gonna"" This reverts commit `280c03f67b`.	2016-12-17 11:56:44 +01:00
Ines Montani	280c03f67b	Add exception for "gonna"	2016-12-17 11:54:59 +01:00
Ines Montani	5031a015e2	Fix typo in stopwords (fixes #689 )	2016-12-15 17:57:06 +01:00
Janneke van der Zwaan	4a3fdcce8a	Merge github.com:explosion/spaCy into dutch	2016-12-13 09:25:23 +01:00
Matthew Honnibal	5965d3c2a7	Revert "Add acl to symbols.pyx"	2016-12-12 10:10:28 +11:00
Matthew Honnibal	6dee76dfed	Update symbols.pxd	2016-12-12 10:09:58 +11:00
Pokey Rule	18a15c0777	Add acl to symbols.pyx	2016-12-11 20:00:07 +00:00
Gyorgy Orosz	0cf2144d24	Adding partial hyphen and quote handling support.	2016-12-11 00:14:36 +01:00
Gyorgy Orosz	2051726fd3	Passing Hungatian abbrev tests.	2016-12-10 23:37:58 +01:00
Ines Montani	63024466a9	Add Portuguese stopwords	2016-12-08 20:45:07 +01:00
Ines Montani	7bfe2d4abc	Update Portuguese language data	2016-12-08 20:41:41 +01:00
Ines Montani	c0c5f31950	Remove unused data and download script	2016-12-08 20:39:49 +01:00
Ines Montani	0a6d529104	Remove unused data	2016-12-08 20:36:56 +01:00
Ines Montani	1b3b043660	Add French stopwords	2016-12-08 20:12:43 +01:00
Ines Montani	8863e504eb	Update French language data	2016-12-08 20:07:14 +01:00
Ines Montani	7cb9f51be6	Add Italian stopwords	2016-12-08 20:05:25 +01:00
Ines Montani	470a0e0bea	Update Italian language data	2016-12-08 19:52:18 +01:00
Ines Montani	1a284d342e	Add Spanish language data	2016-12-08 19:47:03 +01:00
Ines Montani	0c39654786	Remove unused import	2016-12-08 19:46:53 +01:00
Ines Montani	e47ee94761	Split punctuation into its own file	2016-12-08 19:46:43 +01:00
Ines Montani	70b51ed7c8	Remove time from German language data	2016-12-08 19:45:50 +01:00
Ines Montani	e8ae588be9	Add emoticons	2016-12-08 19:45:18 +01:00
Ines Montani	5908c0ed9f	Fix formatting	2016-12-08 19:45:11 +01:00
Ines Montani	311b30ab35	Reorganize exceptions for English and German	2016-12-08 13:58:32 +01:00
Ines Montani	66c7348cda	Add update_exc util function	2016-12-08 13:58:12 +01:00
Ines Montani	1256232fad	Fix formatting	2016-12-08 13:56:40 +01:00
Ines Montani	8e977cc71c	Fix formatting	2016-12-08 13:56:17 +01:00
Ines Montani	0176b99004	Fix formatting	2016-12-08 12:48:02 +01:00
Ines Montani	877f09218b	Add more custom rules for abbreviations	2016-12-08 12:47:01 +01:00
Gyorgy Orosz	0289b8ceaa	Additional abbreviation tests.	2016-12-08 12:17:44 +01:00
Gyorgy Orosz	90d22db023	Added Hungarian resource files.	2016-12-08 12:06:36 +01:00
Ines Montani	bfaa42636c	Update language data for German	2016-12-08 12:01:09 +01:00
Ines Montani	ec44bee321	Fix capitalization on morphological features	2016-12-08 12:00:54 +01:00
Gyorgy Orosz	5b00039955	First steps towards the Hungarian tokenizer code.	2016-12-07 23:07:43 +01:00
Ines Montani	ce979553df	Resolve conflict	2016-12-07 21:16:52 +01:00
Ines Montani	8350d65695	Change morphology and lemmatizer API Take morphology features as object instead of keyword arguments	2016-12-07 21:12:49 +01:00
Ines Montani	52e7d634df	Remove trailing whitespace	2016-12-07 21:12:19 +01:00
Ines Montani	0d07d7fc80	Apply emoticon exceptions to tokenizer	2016-12-07 21:11:59 +01:00
Ines Montani	71f0f34cb3	Fix formatting	2016-12-07 21:11:29 +01:00
Ines Montani	9413bcd9ee	Declare encoding and unicode literals	2016-12-07 21:10:34 +01:00
Ines Montani	a280ff2657	Fix __all__	2016-12-07 21:10:12 +01:00
Ines Montani	ba8721953c	Add missing emoticons	2016-12-07 21:09:44 +01:00
Ines Montani	1285c4ba93	Update English language data	2016-12-07 20:33:28 +01:00
Ines Montani	79dce0aabe	Add emoticons	2016-12-07 20:33:28 +01:00
Ines Montani	a662a95294	Add line breaks	2016-12-07 20:33:28 +01:00
Ines Montani	07f0efb102	Add test for tokenizer regular expressions	2016-12-07 20:33:28 +01:00
Ines Montani	e0712d1b32	Reformat language data	2016-12-07 20:33:28 +01:00
Matthew Honnibal	0c0f4c965d	Increment version	2016-12-03 11:16:52 +01:00
Matthew Honnibal	f6e356aada	Add (and test) Span.sentiment attribute. By default we average token.span, but can override with custom hook. Re Issue #667	2016-12-02 11:05:50 +01:00
Janneke van der Zwaan	88869e0e07	Merge github.com:explosion/spaCy into dutch	2016-11-30 17:13:39 +01:00
Janneke van der Zwaan	51ade86b86	Update language data with tag map from UD_Dutch	2016-11-30 14:41:23 +01:00
Janneke van der Zwaan	90f6ff12c9	Update Dutch language data - Use Dutch tag map - remove tokenizer exceptions	2016-11-30 11:59:39 +01:00
dafnevk	7b8f4c49f2	Added language Dutch to init file	2016-11-29 16:42:05 +01:00
Matthew Honnibal	296d33a4fc	Merge branch 'master' of ssh://github.com/explosion/spaCy	2016-11-26 12:36:18 +01:00
Matthew Honnibal	1f6c37c6f5	Fix create_tokenizer when nlp is None	2016-11-26 12:36:04 +01:00
Matthew Honnibal	c7889492f9	Fix model saving error for Python 3	2016-11-25 18:04:30 -06:00
Matthew Honnibal	bc0a202c9c	Fix unicode problem in nonproj module	2016-11-25 17:29:17 -06:00
Matthew Honnibal	6dd3b94fa6	Filter out deprecated attributes when reading special-case tokenization rules.	2016-11-25 09:57:18 -06:00
Matthew Honnibal	e879c79b8c	Merge branch 'master' of https://github.com/explosion/spaCy	2016-11-25 09:18:28 -06:00
Matthew Honnibal	a335c6dcc2	Exclude morphs from deprecated token attributes for now	2016-11-25 16:17:32 +01:00
Matthew Honnibal	f799a07f25	Merge branch 'master' of https://github.com/explosion/spaCy	2016-11-25 09:16:43 -06:00
Matthew Honnibal	159e8c46e1	Merge old training fixes with newer state	2016-11-25 09:16:36 -06:00
Matthew Honnibal	846e80f2f4	Exclude morphs from deprecated token attributes for now	2016-11-25 16:14:54 +01:00
Matthew Honnibal	664f2dd1c0	Allow dep to be None in scorer, for missing labels.	2016-11-25 09:02:49 -06:00
Matthew Honnibal	39341598bb	Fix NER label calculation	2016-11-25 09:02:22 -06:00
Matthew Honnibal	ca773a1f53	Tweak arc_eager n_gold to deal with negative costs, and improve error message.	2016-11-25 09:01:52 -06:00
Matthew Honnibal	a2f55e7015	Pass cfg through loading, for training.	2016-11-25 09:01:20 -06:00
Matthew Honnibal	608d8f5421	Pass cfg through parser, and have is_valid default to 1, not 0 when resetting state	2016-11-25 09:00:21 -06:00
Matthew Honnibal	cc7e607a8a	Fix gold.pyx for 1.0	2016-11-25 08:57:59 -06:00
root	080d29e092	Fix train.py for 1.0	2016-11-25 08:55:33 -06:00
Matthew Honnibal	6652f2a135	Test #656 , #624 : special case rules for tokenizer with attributes.	2016-11-25 12:44:13 +01:00
Matthew Honnibal	1e0f566d95	Fix #656 , #624 : Support arbitrary token attributes when adding special-case rules.	2016-11-25 12:43:24 +01:00
Matthew Honnibal	87613edf8f	Add set_struct_attr staticmethod to token	2016-11-25 12:41:47 +01:00
Matthew Honnibal	fb69aa648f	Merge branch 'master' of ssh://github.com/explosion/spaCy	2016-11-25 11:35:44 +01:00
Matthew Honnibal	9a03a3f85e	Add get_struct_attr staticmethod to Token, to match Lexeme.get_struct_attr.	2016-11-25 11:35:17 +01:00
Matthew Honnibal	53d8ca8f51	Add spacy.attrs.intify_attrs function, to normalize strings in token attribute dictionaries.	2016-11-25 11:34:30 +01:00
Ines Montani	d21ad01840	Add emoticons	2016-11-24 19:13:00 +01:00
dafnevk	d8c7ac203a	Added nl module for dutch	2016-11-24 16:39:49 +01:00
dafnevk	3db8b0d322	Added language class and some language data (with some TODOs) for Dutch	2016-11-24 15:56:38 +01:00
Ines Montani	4dcfafde02	Add line breaks	2016-11-24 14:57:37 +01:00
Ines Montani	6247c005a2	Add test for tokenizer regular expressions	2016-11-24 13:51:59 +01:00
Ines Montani	de747e39e7	Reformat language data	2016-11-24 13:51:32 +01:00
Matthew Honnibal	b8c4f5ea76	Allow German noun chunks to work on Span Update the German noun chunks iterator, so that it also works on Span objects.	2016-11-24 23:30:15 +11:00
Pokey Rule	3e3bda142d	Add noun_chunks to Span	2016-11-24 10:47:20 +00:00
Janneke van der Zwaan	83daade0e4	Add directory and initial (empty) files for language Dutch	2016-11-24 09:45:41 +01:00
Matthew Honnibal	09f68bc641	Fix Issue #639 : stop words in language class not used. This patch is messy, but it's better not to change too much until the language data loading can be properly refactored.	2016-11-24 00:13:55 +01:00
Matthew Honnibal	48e1dc29d4	Fix default path loading.	2016-11-23 23:48:55 +01:00
Matthew Honnibal	e01c1875ee	Work on test for #615	2016-11-23 23:48:41 +01:00
ExplodingCabbage	6c4f488e89	Fix syntax mistake	2016-11-23 15:12:45 +00:00
Matthew Honnibal	60eb2343ce	Only try to load vectors if they exist.	2016-11-23 13:50:24 +01:00
Matthew Honnibal	618ac36093	Fix use of path argument in Language.__init__. Needs to be keyword arg, not positional.	2016-11-23 13:26:34 +01:00
Mark Amery	fbe19680a6	Fix another bug related to Language.__init__'s path parameter	2016-11-20 20:31:34 +00:00
Mark Amery	b0a07c21a0	Fix `path` param of `Language.__init__` always being ignored There was an explicitly-declared `path` keyword argument, so 'path' would never be present in `**overrides`. This line just overwrote any manually-specified value the user might've passed to the `path` parameter.	2016-11-20 16:29:57 +00:00
Mark Amery	1988fce389	Merge remote-tracking branch 'origin/master' into specify-data-path	2016-11-20 16:07:14 +00:00
Mark Amery	3871007c72	Let --data-path be specified when running download.py scripts Resolves https://github.com/explosion/spaCy/issues/637	2016-11-20 15:48:04 +00:00
Ines Montani	dad2c6cae9	Strip trailing whitespace	2016-11-20 16:45:51 +01:00
Ines Montani	3082e49326	Update and reformat German stopwords	2016-11-20 16:45:26 +01:00
Sourav Singh	6745eac309	Update language_data.py	2016-11-20 19:52:02 +05:30
Sourav Singh	4d9aae7d6a	Add German Stopwords	2016-11-19 22:47:53 +05:30
Matthew Honnibal	7afb2544a7	Merge pull request #627 from sadovnychyi/patch-1 Remove duplicated line of vocab declaration	2016-11-16 06:09:18 +11:00
Yanhao	762169da29	Fixed bug: eg.guess is a tag id, rather than tag	2016-11-15 14:11:22 +08:00
Dmytro Sadovnychyi	e70a7050e1	Remove duplicated line of vocab declaration As already declared on line 211.	2016-11-13 18:52:49 +08:00
Matthew Honnibal	f123f92e0c	Fix #617 : Vocab.load() required Path. Should work with string as well.	2016-11-10 22:48:48 +01:00
Matthew Honnibal	e86f440ca6	Fix test for issue 617	2016-11-10 22:48:10 +01:00
Matthew Honnibal	faa7610c56	Merge branch 'master' of ssh://github.com/explosion/spaCy	2016-11-10 22:46:38 +01:00
Matthew Honnibal	a2c7de8329	spacy/tests/regression/test_issue617.py Test Issue #617	2016-11-10 22:46:23 +01:00
tiago	2a3e342c1f	Added a test case to cover the span.merge returning values	2016-11-09 18:57:50 +00:00
tiago	b38cfd0ef9	now span.merge returns token like it says on documentation	2016-11-09 14:58:19 +00:00
Dmitry Sadovnychyi	9488222e79	Fix PhraseMatcher to work with updated Matcher #613	2016-11-09 00:14:26 +08:00
Dmitry Sadovnychyi	86c056ba64	Add basic test for PhraseMatcher #613	2016-11-09 00:10:32 +08:00
Matthew Honnibal	3ea15b257f	Fix test for 605	2016-11-06 11:59:26 +01:00
Matthew Honnibal	efe7790439	Test #590 : Order dependence in Matcher rules.	2016-11-06 11:21:36 +01:00
Matthew Honnibal	5cd3acb265	Fix #605 : Acceptor now rejects matches as expected.	2016-11-06 10:50:42 +01:00
Matthew Honnibal	75805397dd	Test Issue #605	2016-11-06 10:42:32 +01:00
Matthew Honnibal	014b6936ac	Fix #608 -- __version__ should be available at the base of the package.	2016-11-04 21:21:02 +01:00
Matthew Honnibal	42b0736db7	Increment version	2016-11-04 20:04:21 +01:00
Matthew Honnibal	9f93386994	Update version	2016-11-04 19:28:16 +01:00
Matthew Honnibal	1fb09c3dc1	Fix morphology tagger	2016-11-04 19:19:09 +01:00
Matthew Honnibal	a36353df47	Temporarily put back the tokenize_from_strings method, while tests aren't updated yet.	2016-11-04 19:18:07 +01:00
Matthew Honnibal	f0917b6808	Fix Issue #376 : and/or was tagged as a noun.	2016-11-04 15:21:28 +01:00
Matthew Honnibal	737816e86e	Fix #368 : Tokenizer handled pattern 'unicode close quote, period' incorrectly.	2016-11-04 15:16:20 +01:00
Matthew Honnibal	ab952b4756	Fix #578 -- Sputnik had been purging all files on --force, not just the relevant one.	2016-11-04 10:44:11 +01:00
Matthew Honnibal	6e37ba1d82	Fix #602 , #603 --- Broken build	2016-11-04 09:54:24 +01:00
Matthew Honnibal	293c79c09a	Fix #595 : Lemmatization was incorrect for base forms, because morphological analyser wasn't adding morphology properly.	2016-11-04 00:29:07 +01:00
Matthew Honnibal	e30348b331	Prefer to import from symbols instead of parts_of_speech	2016-11-04 00:27:55 +01:00
Matthew Honnibal	4a8a2b6001	Test #595 -- Bug in lemmatization of base forms.	2016-11-04 00:27:32 +01:00
Matthew Honnibal	f1605df2ec	Fix #588 : Matcher should reject empty pattern.	2016-11-03 00:16:44 +01:00
Matthew Honnibal	72b9bd57ec	Test Issue #588 : Matcher accepts invalid, empty patterns.	2016-11-03 00:09:35 +01:00
Matthew Honnibal	41a90a7fbb	Add tokenizer exception for 'Ph.D.', to fix 592.	2016-11-03 00:03:34 +01:00
Matthew Honnibal	532318e80b	Import Jieba inside zh.make_doc	2016-11-02 23:49:19 +01:00
Matthew Honnibal	f292f7f0e6	Fix Issue #599 , by considering empty documents to be parsed and tagged. Implementation is a bit dodgy.	2016-11-02 23:48:43 +01:00
Matthew Honnibal	b6b01d4680	Remove deprecated tokens_from_list test.	2016-11-02 23:47:21 +01:00
Matthew Honnibal	3d6c79e595	Test Issue #599 : .is_tagged and .is_parsed attributes not reflected after deserialization for empty documents.	2016-11-02 23:40:11 +01:00
Matthew Honnibal	05a8b752a2	Fix Issue #600 : Missing setters for Token attribute.	2016-11-02 23:28:59 +01:00
Matthew Honnibal	125c910a8d	Test Issue #600	2016-11-02 23:24:13 +01:00
Matthew Honnibal	e0c9695615	Fix doc strings for tokenizer	2016-11-02 23:15:39 +01:00
Matthew Honnibal	80824f6d29	Fix test	2016-11-02 20:48:40 +01:00
Matthew Honnibal	dbe47902bc	Add import fr	2016-11-02 20:48:29 +01:00
Matthew Honnibal	8f24dc1982	Fix infixes in Italian	2016-11-02 20:43:52 +01:00
Matthew Honnibal	41a4766c1c	Fix infixes in spanish and portuguese	2016-11-02 20:43:12 +01:00
Matthew Honnibal	3d4bd96e8a	Fix infixes in french	2016-11-02 20:41:43 +01:00
Matthew Honnibal	c09a8ce5bb	Add test for french tokenizer	2016-11-02 20:40:31 +01:00
Matthew Honnibal	b012ae3044	Add test for loading languages	2016-11-02 20:38:48 +01:00
Matthew Honnibal	ad1c747c6b	Fix stray POS in language stubs	2016-11-02 20:37:55 +01:00
Matthew Honnibal	e9e6fce576	Handle null prefix/suffix/infix search in tokenizer	2016-11-02 20:35:48 +01:00
Matthew Honnibal	22647c2423	Check that patterns aren't null before compiling regex for tokenizer	2016-11-02 20:35:29 +01:00
Matthew Honnibal	5ac735df33	Link languages in __init__.py	2016-11-02 20:05:14 +01:00
Matthew Honnibal	c68dfe2965	Stub out support for Italian	2016-11-02 20:03:24 +01:00
Matthew Honnibal	6dbf4f7ad7	Stub out support for French, Spanish, Italian and Portuguese	2016-11-02 20:02:41 +01:00
Matthew Honnibal	6b8b05ef83	Specify that spacy.util is encoded in utf8	2016-11-02 19:58:00 +01:00
Matthew Honnibal	5363224395	Add draft Jieba tokenizer for Chinese	2016-11-02 19:57:38 +01:00
Matthew Honnibal	f7fee6c24b	Check for class-defined make_docs method before assigning one provided as an argument	2016-11-02 19:57:13 +01:00
Matthew Honnibal	19c1e83d3d	Work on draft Italian tokenizer	2016-11-02 19:56:32 +01:00
Matthew Honnibal	9efe568177	Add missing unicode_literals to spacy.util. I think this was messing up the tokenizer regex for non-ascii characters in Python 2. Re Issue #596	2016-11-02 12:31:34 +01:00
Matthew Honnibal	d8db648ebf	Add __init__.py file for regression tests	2016-11-01 13:45:06 +01:00
Matthew Honnibal	11664b9f20	Fix variable error in token	2016-11-01 13:28:00 +01:00
Matthew Honnibal	8c4d1b46ce	Fix variable error in Span	2016-11-01 13:27:44 +01:00
Matthew Honnibal	e7af6b937f	Fix syntax error while fixing doc strings	2016-11-01 13:27:32 +01:00
Matthew Honnibal	62fc6b1afa	Use 32 bit hashes for OOV, re Issue #589 , Issue #285	2016-11-01 13:27:13 +01:00
Matthew Honnibal	6977a2b8cd	Add test for Issue #589	2016-11-01 12:33:36 +01:00
Matthew Honnibal	b86f8af0c1	Fix doc strings	2016-11-01 12:25:36 +01:00
Matthew Honnibal	d563f1eadb	Fix Issue #587 : Segfault in Matcher, due to simple error in the state machine.	2016-10-28 17:42:00 +02:00
Matthew Honnibal	7e5f63a595	Improve test slightly	2016-10-28 17:41:16 +02:00
Matthew Honnibal	782e4814f4	Test Issue #587 : Matcher segfaults on particular input	2016-10-28 16:38:32 +02:00
Matthew Honnibal	708ea22208	Infer types in transition_system.pyx	2016-10-27 18:08:13 +02:00
Matthew Honnibal	18590eba94	Fix training evaluate method	2016-10-27 18:02:19 +02:00
Matthew Honnibal	301f3cc898	Fix Issue #429 . Add an initialize_state method to the named entity recogniser that adds missing entity types. This is a messy place to add this, because it's strange to have the method mutate state. A better home for this logic could be found.	2016-10-27 18:01:55 +02:00
Matthew Honnibal	afea6505f3	Test Issue 429: No valid actions for NER after matcher adds a new entity label.	2016-10-27 18:01:34 +02:00
Matthew Honnibal	03a520ec4f	Change signature of Parser.parseC, so that nr_class is read from the transition system. This allows the transition system to modify the number of actions in initialize_state.	2016-10-27 17:58:56 +02:00
Matthew Honnibal	6c47048912	Fix test, after IOB tweak.	2016-10-26 17:22:03 +02:00
Matthew Honnibal	4ca31b4d87	Fix clobbering of 'missing' named ent values after assigning ents.	2016-10-26 13:13:56 +02:00
Matthew Honnibal	cb49189477	Remove dead code	2016-10-26 13:11:07 +02:00
Matthew Honnibal	a209b10579	Improve error message when oracle fails for non-projective trees, re Issue #571 .	2016-10-24 20:31:30 +02:00
Matthew Honnibal	b2d43b93d2	Fix Python 3 basestring error	2016-10-24 14:22:51 +02:00
Matthew Honnibal	276478fe0f	Update strings.pxd	2016-10-24 14:00:35 +02:00
Matthew Honnibal	d8134817ff	Workaround Issue #285 : Allow the StringStore to be 'frozen', in which case strings will be pushed into an OOV map. We can then flush this OOV map, freeing all of the OOV strings.	2016-10-24 13:49:03 +02:00
Matthew Honnibal	d3a617aa99	Test workaround for Issue #285 : Streaming data memory growth	2016-10-24 13:48:06 +02:00
Matthew Honnibal	64e5f02cf7	Update test	2016-10-23 21:08:07 +02:00
Matthew Honnibal	66d7a6eca2	Update test	2016-10-23 21:02:05 +02:00
Matthew Honnibal	90bf797125	Update test	2016-10-23 20:54:17 +02:00
Matthew Honnibal	5e76320ffe	Update test	2016-10-23 20:44:54 +02:00
Matthew Honnibal	aa105927f3	Update test	2016-10-23 20:31:25 +02:00
Matthew Honnibal	6b9237aa83	Increment version	2016-10-23 20:22:53 +02:00
Matthew Honnibal	150e02d72e	Fix Issue #566	2016-10-23 20:19:01 +02:00
Matthew Honnibal	e120561294	Fix vector_norm test.	2016-10-23 19:56:16 +02:00
Matthew Honnibal	fefde8aef8	Make installation print data path.	2016-10-23 19:46:44 +02:00
Matthew Honnibal	e7414cd064	Try to fix weird install glitch.	2016-10-23 19:46:28 +02:00
Matthew Honnibal	90f7544edd	Increment version	2016-10-23 19:43:06 +02:00
Matthew Honnibal	6036ec7c77	Fix vector norm when loading lexemes.	2016-10-23 19:40:18 +02:00
Matthew Honnibal	c05cd2356e	Fix similarity test for Python 3	2016-10-23 18:16:56 +02:00
Matthew Honnibal	3e688e6d4b	Fix issue #514 -- serializer fails when new entity type has been added. The fix here is quite ugly. It's best to add the entities ASAP after loading the NLP pipeline, to mitigate the brittleness.	2016-10-23 17:45:44 +02:00
Matthew Honnibal	79aa03fe98	Test Issue #514 : Serializer fails when new entity type has been added.	2016-10-23 17:41:44 +02:00
Matthew Honnibal	f97548c6f1	Fix broken test, re Issue #461	2016-10-23 17:02:23 +02:00
Matthew Honnibal	4de30a8e38	Test Issue #514 : Serialization fails after adding a new entity label.	2016-10-23 16:40:27 +02:00
Matthew Honnibal	936e6246aa	Fix Issue #459 -- failed to deserialize empty doc.	2016-10-23 16:31:05 +02:00
Matthew Honnibal	e99b3f5322	Test Issue #459 : Fail to deserialize empty doc	2016-10-23 16:30:22 +02:00
Matthew Honnibal	49c117960c	Fix bug where huffman codec died if given empty freqs dict.	2016-10-23 16:28:05 +02:00
Matthew Honnibal	99ff8b902f	Test that huffman codec works with empty freqs dict	2016-10-23 16:27:45 +02:00
Matthew Honnibal	15c9b59f0e	Fix Issue #461 : O tag was being clobbered by doc.ents.__set__	2016-10-23 15:50:26 +02:00
Matthew Honnibal	e5627134d9	Test Issue #461 : ent_iob tag incorrect after setting entities.	2016-10-23 15:50:04 +02:00
Matthew Honnibal	f62088d646	Fix compile error	2016-10-23 14:50:50 +02:00
Matthew Honnibal	2c3a67b693	Fix calculation of vector norm, re Issue #522 . Need to consolidate the calculations into a helper function.	2016-10-23 14:49:31 +02:00
Matthew Honnibal	a0a4ada42a	Fix calculation of L2-norm for Lexeme	2016-10-23 14:44:45 +02:00
Matthew Honnibal	2989072aac	Add tests to verify that Issue #442 is fixed in 1.1	2016-10-23 14:33:13 +02:00
Matthew Honnibal	739213a8af	Fix create_pipeline keyword argument.	2016-10-23 14:24:16 +02:00
Matthew Honnibal	bea44bd3c4	Fix vector_norm when vector is assigned to Lexeme.	2016-10-23 14:23:56 +02:00
Matthew Honnibal	e838b6d53f	Add tests for using the new Entity ID tracking in the rule matcher	2016-10-23 14:04:01 +02:00
Matthew Honnibal	e7af75e0a9	Add test for vector resizing, re Issue #544	2016-10-21 17:07:21 +02:00
Matthew Honnibal	ca8ea33abc	Bump version to 1.1.0	2016-10-21 16:30:57 +02:00
Matthew Honnibal	7ab03050d4	Add resize_vectors method to Vocab	2016-10-21 01:44:50 +02:00
Matthew Honnibal	8ce8803824	Fix JSON in tokenizer	2016-10-21 01:44:20 +02:00
Matthew Honnibal	6eb73a095f	Fix JSON in tagger	2016-10-21 01:44:10 +02:00
Matthew Honnibal	e16e78a737	Merge branch 'master' of ssh://github.com/explosion/spaCy	2016-10-21 00:00:15 +02:00
Matthew Honnibal	147373c807	Increment version	2016-10-21 00:00:03 +02:00
Matthew Honnibal	e80944276f	Fix Span.vector_norm	2016-10-20 21:58:56 +02:00
Matthew Honnibal	f5fe4f595b	Fix json loading, for Python 3.	2016-10-20 21:23:26 +02:00
Matthew Honnibal	2e92c6fb3a	Fix JSON encoding issue on load	2016-10-20 21:06:48 +02:00
Matthew Honnibal	4ad7bb96c9	Increment version.	2016-10-20 20:48:30 +02:00
Matthew Honnibal	5ec32f5d97	Fix loading of GloVe vectors, to address Issue #541	2016-10-20 18:27:48 +02:00
Matthew Honnibal	ddeabd76c4	Fix mistake loading GloVe vectors. GloVe vectors now loaded by default if present, as promised.	2016-10-20 16:57:53 +02:00
Matthew Honnibal	bfe5cb1244	Increment version.	2016-10-20 14:52:00 +02:00
Matthew Honnibal	f189a3cb00	Fix encoding when opening files in Python 2.7, re Issue #539	2016-10-20 14:42:56 +02:00
Matthew Honnibal	c353a5214d	Increment version	2016-10-19 23:51:01 +02:00
Matthew Honnibal	d10c17f2a4	Fix Issue #536 : oov_prob was 0 for OOV words.	2016-10-19 23:38:47 +02:00
Matthew Honnibal	dfa752d064	Increment version	2016-10-19 23:19:13 +02:00
Matthew Honnibal	3588a18fb8	Fix hook names in doc	2016-10-19 21:15:16 +02:00
Matthew Honnibal	5d5742b773	Add sentiment field to doc, rename getters_for_tokens and getters_for_spans, add user_hooks field to Doc.	2016-10-19 20:54:22 +02:00
Matthew Honnibal	ed5e178817	Add sentiment property on lexeme object	2016-10-19 20:52:52 +02:00
Matthew Honnibal	d4aaf2752c	Fix issue #535 : Pipeline elements added even when data not installed.	2016-10-19 19:55:19 +02:00
Matthew Honnibal	04d1c959da	Fix version	2016-10-19 03:45:37 +02:00
Matthew Honnibal	d35aa7344e	Change version ID to make PyPi happy	2016-10-19 03:24:39 +02:00
Matthew Honnibal	89d2a5c8b3	Increment build version.	2016-10-19 03:05:17 +02:00
Matthew Honnibal	622b0a9674	Tweak download script	2016-10-19 00:52:16 +02:00
Matthew Honnibal	5a5c7192a5	Fix download.py for GloVe vectors.	2016-10-19 00:47:44 +02:00
Matthew Honnibal	edc45c19d6	Update download script	2016-10-19 00:41:14 +02:00
Matthew Honnibal	2bbb050500	Fix default of serializer_freqs	2016-10-18 19:55:41 +02:00
Matthew Honnibal	1b651db9c5	Fix parser creation in Language class.	2016-10-18 19:36:44 +02:00
Matthew Honnibal	45a6f9b9c7	Fix loading of tagger.	2016-10-18 19:33:04 +02:00
Matthew Honnibal	76c815f40d	Fix spacy.load	2016-10-18 19:23:31 +02:00
Matthew Honnibal	8c8f5c62c6	Add LANG attribute to English and German	2016-10-18 18:52:48 +02:00
Matthew Honnibal	05e2a589a4	Fix None label in matcher	2016-10-18 18:05:21 +02:00
Matthew Honnibal	c3a8a1cf51	Update serializer test.	2016-10-18 16:18:46 +02:00
Matthew Honnibal	7d5212f131	Refactor defaults	2016-10-18 16:18:25 +02:00
Matthew Honnibal	a45a9d5092	Remove stray .tensor attribute from Lexeme	2016-10-18 01:16:32 +02:00
Matthew Honnibal	9258db788a	Revert "Have the matcher return character offsets, to handle the match better." This reverts commit `049c937540`.	2016-10-17 16:49:51 +02:00
Matthew Honnibal	7d446e5094	Revert "Update matcher test, to reflect character offset return instead of token offset." This reverts commit `f8d3e3bcfe`.	2016-10-17 16:49:49 +02:00
Matthew Honnibal	4bf2c53c13	Revert "Hack on matcher tests, for new implementation." This reverts commit `dbe60644ab`.	2016-10-17 16:49:48 +02:00
Matthew Honnibal	2fd97c71cc	Revert "Don't try to pickle matcher." This reverts commit `97bd0c9d00`.	2016-10-17 16:49:43 +02:00
Matthew Honnibal	97bd0c9d00	Don't try to pickle matcher.	2016-10-17 16:38:40 +02:00
Matthew Honnibal	dbe60644ab	Hack on matcher tests, for new implementation.	2016-10-17 16:12:22 +02:00
Matthew Honnibal	f8d3e3bcfe	Update matcher test, to reflect character offset return instead of token offset.	2016-10-17 16:00:10 +02:00
Matthew Honnibal	049c937540	Have the matcher return character offsets, to handle the match better.	2016-10-17 15:58:57 +02:00
Matthew Honnibal	9b60186266	Fix doc class	2016-10-17 15:23:47 +02:00
Matthew Honnibal	6cbdc94959	Lots of updates to Matcher, to make entity handling sane.	2016-10-17 15:23:31 +02:00
Matthew Honnibal	7fd98fc91c	Remove deprecation shim around str/bytes in Token.	2016-10-17 14:02:47 +02:00
Matthew Honnibal	b67697a97b	Improve API for doc.merge() and span.merge(), to use keyword arguments.	2016-10-17 14:02:13 +02:00
Matthew Honnibal	fbb7f3f15c	Add user_data attribute to Doc object.	2016-10-17 11:43:22 +02:00
Matthew Honnibal	c1abc8f6ed	Fix deprecation stuff in Token: Remove the shim for the str/unicode semantics, and raise for has_repvec and repvec	2016-10-17 11:18:41 +02:00
Matthew Honnibal	4ba9eadf3d	Merge branch 'v1.0.0-rc1' of ssh://github.com/explosion/spaCy into v1.0.0-rc1	2016-10-17 02:45:44 +02:00
Matthew Honnibal	09ab447a18	Remove tensor property from token.	2016-10-17 02:45:09 +02:00
Matthew Honnibal	5d10e2005c	Defer some attributes to Doc, via getters_for_tokens attribute.	2016-10-17 02:44:49 +02:00
Matthew Honnibal	8829984efb	Remove tensor attribute from Span and Token.	2016-10-17 02:44:04 +02:00
Matthew Honnibal	d15a88c66a	Defer some attributes to Doc via getters_for_spans	2016-10-17 02:43:35 +02:00
Matthew Honnibal	62230dd13a	Add getters_for_spans and getters_for_tokens attributes to Doc. Fix docstring	2016-10-17 02:42:51 +02:00
Matthew Honnibal	ae11ea8240	Add getters_for_tokens and getters_for_spans attributes to Doc object.	2016-10-17 02:42:05 +02:00
Matthew Honnibal	be48a7b4f3	Fix conftest for website tests.	2016-10-17 01:54:26 +02:00
Matthew Honnibal	8951bf6989	Update matcher tests	2016-10-17 01:53:24 +02:00
Matthew Honnibal	0cf4aff470	Set default path in EN/DE tests.	2016-10-17 01:52:49 +02:00
Matthew Honnibal	cd71b6b0a9	Remove test of parser pickle	2016-10-17 01:52:10 +02:00
Matthew Honnibal	5bc101006e	Add cfg field to Tagger	2016-10-17 01:03:41 +02:00
Matthew Honnibal	517f090cbf	Use GoldParse in tagger.update	2016-10-17 00:55:15 +02:00
Matthew Honnibal	59038f7efa	Restore support for prior data format -- specifically, the labels field of the config.	2016-10-17 00:53:26 +02:00
Matthew Honnibal	7887ab3b36	Fix default use of feature_templates in parser	2016-10-16 21:41:56 +02:00
Matthew Honnibal	f787cd29fe	Refactor the pipeline classes to make them more consistent, and remove the redundant blank() constructor.	2016-10-16 21:34:57 +02:00
kengz	fb92e2d061	activate parse_tree test, use from_array, test for root correctness	2016-10-16 15:12:08 -04:00
kengz	17b7832419	mark test as needing models	2016-10-16 14:39:07 -04:00
kengz	f046e0d7c8	add parse_tree method to language, separate from __call__ for efficiency, but will use __call__ to get the doc	2016-10-16 14:20:23 -04:00
Matthew Honnibal	311a985fe0	Add input error handling in Doc	2016-10-16 18:16:42 +02:00
Matthew Honnibal	06322ba99d	Add words and spaces keyword arguments to Doc.	2016-10-16 18:13:03 +02:00
Matthew Honnibal	ca51f3b77e	Use DependencyParser and EntityRecognizer in the Language class.	2016-10-16 17:58:12 +02:00
Matthew Honnibal	195d998a12	Fix GoldParse argument to tagger.update	2016-10-16 17:05:09 +02:00
Matthew Honnibal	274a4d4272	Fix queue Python property in StateClass	2016-10-16 17:04:41 +02:00
Matthew Honnibal	e8c8aa08ce	Make action_name optional in StepwiseState	2016-10-16 17:04:16 +02:00
Matthew Honnibal	4bb73b1a93	Fix parser labels in pipeline	2016-10-16 17:03:22 +02:00
Matthew Honnibal	a81c5a7abf	Fix name of labels keyword to 'actions'.	2016-10-16 12:00:27 +02:00
Matthew Honnibal	a079677984	Fix omission of O action when creating blank entity recognizer	2016-10-16 11:43:25 +02:00
Matthew Honnibal	5444d38cc6	Update test for biluo tags	2016-10-16 11:42:45 +02:00
Matthew Honnibal	4fc56d4a31	Rename 'labels' to 'actions' in parser options	2016-10-16 11:42:26 +02:00
Matthew Honnibal	8a6b35d266	Delay binding in MakeDoc	2016-10-16 11:41:55 +02:00
Matthew Honnibal	52b48b415e	Fix GoldParse class	2016-10-16 11:41:36 +02:00
Matthew Honnibal	3259a63779	Whitespace	2016-10-16 01:47:28 +02:00
Matthew Honnibal	509b30834f	Add a pipeline module, to collect and wrap processes for annotation	2016-10-16 01:47:12 +02:00
Matthew Honnibal	0317cea0ad	Fix GoldParse	2016-10-15 23:55:07 +02:00
Matthew Honnibal	1c62573a41	Fix spacy.train	2016-10-15 23:53:46 +02:00
Matthew Honnibal	a48aa15384	Improve the API for the GoldParse class.	2016-10-15 23:53:29 +02:00
Matthew Honnibal	e07fe92b27	Draft a refactored init for the GoldParse class	2016-10-15 22:09:52 +02:00
Matthew Honnibal	47afef7d6b	Add init.py for gold tests	2016-10-15 21:51:28 +02:00
Matthew Honnibal	86ae665c78	Add function for entity->biluo transformation	2016-10-15 21:51:04 +02:00
Matthew Honnibal	2163fd238f	Add tests for entity->biluo transformation	2016-10-15 21:50:43 +02:00
Matthew Honnibal	5e923b9bfa	Return None in match_best_version if not path exists.	2016-10-15 14:47:29 +02:00
Matthew Honnibal	2516382106	Fix loading of English in span test	2016-10-15 14:44:37 +02:00
Matthew Honnibal	dda2fc6bef	Add empty data directory	2016-10-15 14:25:25 +02:00
Matthew Honnibal	049197e0ae	Update tests, somewhat messily.	2016-10-15 14:14:04 +02:00
Matthew Honnibal	1e1a1d9517	Update matcher test	2016-10-15 14:13:41 +02:00
Matthew Honnibal	9cc9ce0f14	Load with default path=False in tests.	2016-10-15 14:13:23 +02:00
Matthew Honnibal	08e9134760	Change default value of path to True	2016-10-15 14:12:54 +02:00
Matthew Honnibal	788657f062	Ensure words are added to vocab before test, so that the lexicon is updated correctly.	2016-10-15 14:12:18 +02:00
Matthew Honnibal	4a1a2bce68	Update version in about.py	2016-10-15 13:44:27 +02:00
Matthew Honnibal	6d8cb515ac	Break the tokenization stage out of the pipeline into a function 'make_doc'. This allows all pipeline methods to have the same signature.	2016-10-14 17:38:29 +02:00
Matthew Honnibal	2cc515b2ed	Add add_flag method to Vocab, re Issue #504 .	2016-10-14 12:15:38 +02:00
Matthew Honnibal	f3be9d0a9a	Add tensor field to Lexeme, Token, Doc and Span, so that users have a place to hang neural network outputs	2016-10-14 03:24:13 +02:00
Matthew Honnibal	9b55d97a8f	Update train method	2016-10-13 03:24:53 +02:00
Matthew Honnibal	645d99523a	Move merge_sents method into spacy.gold	2016-10-13 03:24:29 +02:00
Matthew Honnibal	41f88ce938	Fix dep model loading in parser	2016-10-12 20:26:38 +02:00
Matthew Honnibal	d9ae2d68af	Load features by string-name for backwards compatibility.	2016-10-12 20:15:11 +02:00
Matthew Honnibal	a42fbcf946	Require model for test_is_properties	2016-10-12 19:35:18 +02:00
Matthew Honnibal	20c948361b	Use local path in test_lemmatizer	2016-10-12 19:35:00 +02:00
Matthew Honnibal	1318d0bc65	Test with the non-loaded versions of the English and German pipelines.	2016-10-12 19:13:31 +02:00
Matthew Honnibal	0e2bedc373	Fix default labels for parser and NER	2016-10-12 19:12:40 +02:00
Matthew Honnibal	3a03c668c3	Fix message in ParserStateError	2016-10-12 14:44:31 +02:00
Matthew Honnibal	6bf505e865	Fix error on ParserStateError	2016-10-12 14:35:55 +02:00
Matthew Honnibal	ba5e048502	Add docstring for Trainer class.	2016-10-12 14:26:02 +02:00
Matthew Honnibal	847a4a4182	Refactor Language, dropping Language.blank() method.	2016-10-12 13:45:58 +02:00
Matthew Honnibal	ea23b64cc8	Refactor training, with new spacy.train module. Defaults still a little awkward.	2016-10-09 12:24:24 +02:00
Matthew Honnibal	ca32a1ab01	Revert "Work on Issue #285 : intern strings into document-specific pools, to address streaming data memory growth. StringStore.__getitem__ now raises KeyError when it can't find the string. Use StringStore.intern() to get the old behaviour. Still need to hunt down all uses of StringStore.__getitem__ in library and do testing, but logic looks good." This reverts commit `8423e8627f`.	2016-09-30 20:20:22 +02:00
Matthew Honnibal	90baa9c7e6	Revert "Changes to matcher.pyx for new StringStore scheme" This reverts commit `3ff09614e0`.	2016-09-30 20:20:13 +02:00
Matthew Honnibal	1b6b129c04	Revert "Changes to morphology.pyx for new StringStore scheme" This reverts commit `95f8cfd745`.	2016-09-30 20:20:02 +02:00
Matthew Honnibal	1d70db58aa	Revert "Changes to iterators.pyx for new StringStore scheme" This reverts commit `4f794b215a`.	2016-09-30 20:19:53 +02:00
Matthew Honnibal	de01e427fd	Revert "Changes to strings.pyx for new StringStore scheme" This reverts commit `22d4752d64`.	2016-09-30 20:19:42 +02:00
Matthew Honnibal	9e09b39b9f	Revert "Changes to transition systems for new StringStore scheme" This reverts commit `0442e0ab1e`.	2016-09-30 20:11:49 +02:00
Matthew Honnibal	e3285f6f30	Revert "Fix report of ParserStateError" This reverts commit `78f19baafa`.	2016-09-30 20:11:33 +02:00
Matthew Honnibal	6736977d82	Revert "Changes to Doc and Token for new string store scheme" This reverts commit `99de44d864`.	2016-09-30 20:11:15 +02:00
Matthew Honnibal	bd7fe6420c	Revert "Changes to test for new string-store" This reverts commit `21e90d7d0b`.	2016-09-30 20:11:01 +02:00
Matthew Honnibal	1f1cd5013f	Revert "Changes to vocab for new stringstore scheme" This reverts commit `a51149a717`.	2016-09-30 20:10:30 +02:00
Matthew Honnibal	1e7d0af127	Revert "Changes to Lexeme for new string store scheme" This reverts commit `717741b6cf`.	2016-09-30 20:10:13 +02:00
Matthew Honnibal	ba51cb8325	Revert "Changes to tagger for new string store scheme" This reverts commit `f5a6aac906`.	2016-09-30 20:09:53 +02:00
Matthew Honnibal	23b7244842	Make sure symbols are unicode strings	2016-09-30 20:02:19 +02:00
Matthew Honnibal	f5a6aac906	Changes to tagger for new string store scheme	2016-09-30 20:01:51 +02:00
Matthew Honnibal	717741b6cf	Changes to Lexeme for new string store scheme	2016-09-30 20:01:36 +02:00
Matthew Honnibal	a51149a717	Changes to vocab for new stringstore scheme	2016-09-30 20:01:19 +02:00
Matthew Honnibal	21e90d7d0b	Changes to test for new string-store	2016-09-30 20:00:58 +02:00
Matthew Honnibal	99de44d864	Changes to Doc and Token for new string store scheme	2016-09-30 20:00:21 +02:00
Matthew Honnibal	78f19baafa	Fix report of ParserStateError	2016-09-30 19:59:22 +02:00
Matthew Honnibal	0442e0ab1e	Changes to transition systems for new StringStore scheme	2016-09-30 19:58:51 +02:00
Matthew Honnibal	22d4752d64	Changes to strings.pyx for new StringStore scheme	2016-09-30 19:58:09 +02:00
Matthew Honnibal	4f794b215a	Changes to iterators.pyx for new StringStore scheme	2016-09-30 19:57:49 +02:00
Matthew Honnibal	95f8cfd745	Changes to morphology.pyx for new StringStore scheme	2016-09-30 19:57:10 +02:00
Matthew Honnibal	3ff09614e0	Changes to matcher.pyx for new StringStore scheme	2016-09-30 19:56:48 +02:00
Matthew Honnibal	eceeaefe53	Fix defaults for Parser and Entity, adding a blank= argument.	2016-09-30 19:56:06 +02:00
Matthew Honnibal	8423e8627f	Work on Issue #285 : intern strings into document-specific pools, to address streaming data memory growth. StringStore.__getitem__ now raises KeyError when it can't find the string. Use StringStore.intern() to get the old behaviour. Still need to hunt down all uses of StringStore.__getitem__ in library and do testing, but logic looks good.	2016-09-30 10:14:47 +02:00
Matthew Honnibal	d3dc5718b2	Fix syntax error in Doc	2016-09-28 11:39:49 +02:00
Matthew Honnibal	1b520e7bab	Improve docstrings for Doc object	2016-09-28 11:15:13 +02:00
Matthew Honnibal	81a47c01d8	Fix test for empty sentence string.	2016-09-27 19:21:22 +02:00
Matthew Honnibal	4cbf0d3bb6	Handle errors when no valid actions are available, pointing users to the issue tracker.	2016-09-27 19:19:53 +02:00
Matthew Honnibal	430473bd98	Raise errors when no actions are available, re Issue #429	2016-09-27 19:09:37 +02:00
Matthew Honnibal	fc4a7ad794	Test and fix Issue #411 : IndexError when .sents property is used on empty string.	2016-09-27 18:49:14 +02:00
Matthew Honnibal	3d370b7d45	Add test for Issue #445 , fixed in `3cb4d455d`, with improved lemmatizer logic	2016-09-27 18:39:46 +02:00
Matthew Honnibal	a2f3510d6d	Fix lemmatizer	2016-09-27 17:47:05 +02:00
Matthew Honnibal	07776d8096	Fix pos name conflict in lemmatize	2016-09-27 17:35:58 +02:00
Matthew Honnibal	35cd953f9e	Fix pos name conflict with morphology	2016-09-27 14:16:22 +02:00
Matthew Honnibal	8e7df3c4ca	Expect the parser data, if parser.load() is called.	2016-09-27 14:02:12 +02:00
Matthew Honnibal	bb4f201ad2	Pass morphological features from tag map into the lemmatizer.	2016-09-27 14:01:43 +02:00
Matthew Honnibal	40509e8bca	Tweak the new is_base_form logic, because we can expect the 'pos' key in the morphology we're passed.	2016-09-27 14:01:16 +02:00
Matthew Honnibal	9c8ac91d72	Add test for Issue #435	2016-09-27 13:52:38 +02:00
Matthew Honnibal	3cb4d455d2	Pass lemmatizer morphological features, so that rules are sensitive to base/inflected distinction, which is how the WordNet data is designed. See Issue #435	2016-09-27 13:52:11 +02:00
Matthew Honnibal	e233328d38	Fix Issue #371 : Lexeme objects were unhashable.	2016-09-27 13:22:30 +02:00
Matthew Honnibal	e382e48d9f	Temporarily patch handling of defaul templates for tagger. Need to move these to language_data.	2016-09-27 13:21:28 +02:00
Matthew Honnibal	a44763af0e	Fix Issue #469 : Incorrectly cased root label in noun chunk iterator	2016-09-27 13:13:01 +02:00
Matthew Honnibal	b14b9b096b	Return None if /deps directory not present, instead of trying to load the parser.	2016-09-26 18:48:03 +02:00
Matthew Honnibal	e07b9665f7	Don't expect parser model	2016-09-26 18:09:33 +02:00
Matthew Honnibal	ee6fa106da	Fix parser features	2016-09-26 17:57:32 +02:00
Matthew Honnibal	e607e4b598	Fix parser loading	2016-09-26 17:51:11 +02:00
Matthew Honnibal	0b2d7ae9d6	Fix Entity creation	2016-09-26 15:41:22 +02:00
Matthew Honnibal	2debc4e0a2	Add .blank() method to Parser. Start housing default dep labels and entity types within the Defaults class.	2016-09-26 11:57:54 +02:00
Matthew Honnibal	722199acb8	Add spacy.blank() method, that doesn't load data. Don't try to load data if path is falsey	2016-09-26 11:07:46 +02:00
Matthew Honnibal	e56653f848	Add language data for German	2016-09-25 15:44:45 +02:00
Matthew Honnibal	7db956133e	Move tokenizer data for German into spacy.de.language_data	2016-09-25 15:37:33 +02:00
Matthew Honnibal	95aaea0d3f	Refactor so that the tokenizer data is read from Python data, rather than from disk	2016-09-25 14:49:53 +02:00
Matthew Honnibal	d7e9acdcdf	Add English language data, so that the tokenizer doesn't require the data download	2016-09-25 14:49:00 +02:00
Matthew Honnibal	82b8cc5efb	Whitespace	2016-09-24 22:17:01 +02:00
Matthew Honnibal	fd58f7655a	Python 3 compatible basestring	2016-09-24 22:16:43 +02:00
Matthew Honnibal	082e95b19e	Python 3 compatible basestring	2016-09-24 22:09:21 +02:00
Matthew Honnibal	f19af6cb2c	Python 3 compatible basestring	2016-09-24 22:08:43 +02:00
Matthew Honnibal	3ed4cdfe32	Handle pathlib.Path objects in CFile	2016-09-24 22:01:46 +02:00
Matthew Honnibal	df88690177	Fix encoding of path variable	2016-09-24 21:13:15 +02:00
Matthew Honnibal	af847e07fc	Fix usage of pathlib for Python3 -- turning paths to strings.	2016-09-24 21:05:27 +02:00
Matthew Honnibal	453683aaf0	Fix spacy/vocab.pyx	2016-09-24 20:50:31 +02:00
Matthew Honnibal	fd65cf6cbb	Finish refactoring data loading	2016-09-24 20:26:17 +02:00
Matthew Honnibal	83e364188c	Mostly finished loading refactoring. Design is in place, but doesn't work yet.	2016-09-24 15:42:01 +02:00
Matthew Honnibal	9dc8043a7e	Refactor Language to use new Defaults class, and work on revised data loading. We're getting rid of sputnik's weird file-system wrapper, and using pathlib.	2016-09-24 14:08:53 +02:00
Matthew Honnibal	b00f683a0c	Fix matcher test	2016-09-24 11:20:58 +02:00
Matthew Honnibal	eaf4065480	Expose the _patterns private member	2016-09-24 11:20:42 +02:00
Matthew Honnibal	15e42a1ba9	Allow entities to be set by Span, or by 4-tuple (with entity ID)	2016-09-24 01:17:43 +02:00
Matthew Honnibal	60fdf4d5f1	Remove commented out debuggng code	2016-09-24 01:17:18 +02:00
Matthew Honnibal	939a791a52	Update tests	2016-09-24 01:17:03 +02:00
Matthew Honnibal	55f1f7edaf	Don't automatically write new entities into the Doc in the Matcher. This fixes a long-standing wart, but introduces a backwards incompatibility.	2016-09-24 01:16:45 +02:00
Matthew Honnibal	e48df859b5	Fix typedef import in span.pyx	2016-09-23 16:02:28 +02:00
Matthew Honnibal	4de13606fd	Fix token.pyx	2016-09-23 15:07:07 +02:00
Matthew Honnibal	b4de419e19	Import hash_t typedef in token.pyx	2016-09-23 14:22:06 +02:00
Matthew Honnibal	c1a2e96604	Clean up notes at end of token.pyx	2016-09-21 20:45:51 +02:00
Matthew Honnibal	f6e587b1c7	Fix matcher tests	2016-09-21 20:45:20 +02:00
Matthew Honnibal	58e83fe34b	Initial, limited support for quantified patterns in Matcher, and tracking of ent_id attribute in Token and Span. The quantifiers need a lot more testing, and there are some known problems. The main known problem is that the zero-plus and one-plus quantifiers won't work if a token can match both the quantified pattern expression AND the tail of the match.	2016-09-21 14:54:55 +02:00
Matthew Honnibal	2735b6247b	Fix orths_and_spaces in Doc.__init__	2016-09-21 14:52:05 +02:00
Matthew Honnibal	070af4af9d	Revert "* Working neural net, but features hacky. Switching to extractor." This reverts commit `7c2f1a673b`.	2016-09-21 12:26:14 +02:00
Matthew Honnibal	6b202ec43f	Merge branch 'master' of ssh://github.com/spacy-io/spaCy	2016-09-21 12:08:25 +02:00
Mahmoud Lababidi	4c9ccc3b8b	Add parameter to download() for application to not exit if a Model exists. The default behavior is unchanged.	2016-09-14 10:04:09 -04:00
Adam Ever Hadani	f1c0762443	exit code 0 for when downloading a model that already was downloaded	2016-07-13 16:22:14 -07:00
Matthew Honnibal	7c2f1a673b	* Working neural net, but features hacky. Switching to extractor.	2016-05-26 19:06:10 +02:00
Matthew Honnibal	cdc10e9a1c	* Fix Issue #375 : noun phrase iteration results in index error if noun phrases are merged during the loop. Fix by accumulating the spans inside the noun_chunks property, allowing the Span index tricks to work.	2016-05-20 10:14:06 +02:00
Matthew Honnibal	13fad36e49	* Cosmetic change to english noun chunks iterator -- use enumerate instead of range loop	2016-05-20 10:11:05 +02:00
Matthew Honnibal	02276cc444	Merge branch 'master' of ssh://github.com/spacy-io/spaCy	2016-05-17 16:56:22 +02:00
Matthew Honnibal	4d7f5468bb	* Change Language class to use a .pipeline attribute, instead of having the pipeline hard coded	2016-05-17 16:55:42 +02:00
Daylen Yang	5405e7dd73	Fix get_lang_class parsing (take 2)	2016-05-16 16:40:31 -07:00
Matthew Honnibal	b240104f40	Revert "Fix get_lang_class parsing"	2016-05-17 08:04:26 +10:00
Daylen Yang	1692c2df3c	Fix get_lang_class parsing We want the get_lang_class to return "en" for both "en" and "en_glove_cc_300_1m_vectors". Changed the split rule to "_" so that this happens.	2016-05-16 14:38:20 -07:00
Matthew Honnibal	17137f5c0c	* Fix issue #372 : mistake in Lexeme rich comparison	2016-05-12 12:58:57 +02:00
Matthew Honnibal	cc8bf62208	* Fix Issue #360 : Tokenizer failed when the infix regex matched the start of the string while trying to tokenize multi-infix tokens.	2016-05-09 13:23:47 +02:00
Matthew Honnibal	c61ee8f9fa	* Increment version	2016-05-09 13:20:00 +02:00
Matthew Honnibal	5d86c30f0b	* Fix Issue #367 : Missing has_vector property on Doc and Span objects	2016-05-09 12:36:14 +02:00
Wolfgang Seeker	7b78239436	add fix for German noun chunk iterator (issue #365 )	2016-05-06 01:41:26 +02:00
Matthew Honnibal	8c0888d6cb	* Fix error in span.sent	2016-05-06 00:28:05 +02:00
Matthew Honnibal	bb94022975	* Fix Issue #365 : Error introduced during noun phrase chunking, due to use of corrected PRON/PROPN/etc tags.	2016-05-06 00:21:05 +02:00
Matthew Honnibal	41342ca79b	Merge branch 'master' of ssh://github.com/spacy-io/spaCy	2016-05-06 00:17:58 +02:00
Matthew Honnibal	26095f9722	* Add span.sent property, re Issue #366	2016-05-06 00:17:38 +02:00
Wolfgang Seeker	dbf8f5f3ec	fix bug in StateC.set_break()	2016-05-05 15:15:34 +02:00
Wolfgang Seeker	3c44b5dc1a	call deprojectivization after parsing	2016-05-05 15:10:36 +02:00
Matthew Honnibal	472f576b82	* Deprojectivize German parses	2016-05-05 15:01:10 +02:00
Matthew Honnibal	9bbd6cf031	* Work on Chinese support	2016-05-05 11:39:12 +02:00
Matthew Honnibal	a6a25166ba	* Remove print from test	2016-05-05 11:10:59 +02:00
Matthew Honnibal	e31df66d26	* Fix Issue #361 : Lexemes didn't have rich comparison.	2016-05-05 01:32:26 +02:00
Matthew Honnibal	7441ca30ee	* Add tests for Issue #361 : Lexeme rich comparison	2016-05-05 01:31:58 +02:00
Matthew Honnibal	72564213e3	* Add test for Issue #309	2016-05-04 16:00:28 +02:00
Matthew Honnibal	76f1d871da	Merge branch 'master' of ssh://github.com/spacy-io/spaCy	2016-05-04 15:54:00 +02:00
Matthew Honnibal	519366f677	* Fix Issue #351 : Indices off when leading whitespace	2016-05-04 15:53:36 +02:00
Matthew Honnibal	b4bfc6ae55	* Add test for Issue #351 : Indices off when leading whitespace	2016-05-04 15:53:17 +02:00
Matthew Honnibal	76021cb853	* Fix bug in Doc.text, introduced by `a862edc`	2016-05-04 11:02:16 +02:00
Wolfgang Seeker	e4ea2bea01	fix whitespace	2016-05-04 07:40:38 +02:00
Wolfgang Seeker	5bf2fd1f78	make the code less cryptic	2016-05-03 17:19:05 +02:00
Wolfgang Seeker	a06fca9fdf	German noun chunk iterator now doesn't return tokens more than once	2016-05-03 16:58:59 +02:00
Wolfgang Seeker	7825b75548	add tests for German noun chunker	2016-05-03 15:01:28 +02:00
Wolfgang Seeker	7b246c13cb	reformulate noun chunk tests for English	2016-05-03 14:24:35 +02:00
Wolfgang Seeker	1786331cd8	add model sanity test	2016-05-03 12:51:47 +02:00
Matthew Honnibal	1f1532142f	* Fix cost calculation on non-monotonic oracle	2016-05-03 00:21:08 +02:00
Matthew Honnibal	377a624046	Merge pull request #358 from wbwseeker/german_lemmatizer_dummy German lemmatizer dummy	2016-05-03 07:38:26 +10:00
Wolfgang Seeker	92bfbebeec	remove unnecessary imports	2016-05-02 17:33:22 +02:00
Wolfgang Seeker	857454ffa0	fix indentation -.-	2016-05-02 17:10:41 +02:00
Matthew Honnibal	308a28c26c	* Whitespace	2016-05-02 16:08:11 +02:00
Matthew Honnibal	29a114e645	* Don't assign 0-valued tags in Doc.from_array	2016-05-02 16:07:50 +02:00
Matthew Honnibal	c1c11a8ae0	* Fix formatting on serializer tests	2016-05-02 16:07:21 +02:00
Wolfgang Seeker	dae6bc05eb	define German dummy lemmatizer until morphology is done	2016-05-02 16:04:53 +02:00
Matthew Honnibal	6e1f1c4b9e	Merge pull request #357 from wbwseeker/german_ner German ner	2016-05-02 23:39:34 +10:00
Wolfgang Seeker	b6b96b233c	don't require read_json_file to expect particular annotations	2016-05-02 15:29:30 +02:00
Matthew Honnibal	902a389d85	* Fix merge conflict in test_parse	2016-05-02 15:28:07 +02:00
Matthew Honnibal	276fbe9996	* Fix assignment of iterator on Doc object	2016-05-02 15:26:24 +02:00
Matthew Honnibal	02c23cc1d0	* Fix sentence boundary test	2016-05-02 15:26:07 +02:00
Matthew Honnibal	d2f469b809	* Fix parsing tests, so that labels are added if they're missing, and so that the branching test values are correct	2016-05-02 15:25:27 +02:00
Wolfgang Seeker	b11cbb06c6	remove old tests for sentence boundary detection	2016-05-02 14:36:35 +02:00
Matthew Honnibal	508fd1f6dc	* Refactor noun chunk iterators, so that they're simple functions. Install the iterator when the Doc is created, but allow users to write to the noun_chunk_iterator attribute. The iterator functions accept an object and yield (int start, int end, int label) triples.	2016-05-02 14:25:10 +02:00
Matthew Honnibal	e526be5602	Merge branch 'master' of ssh://github.com/spacy-io/spaCy	2016-05-02 13:08:08 +02:00
Wolfgang Seeker	fa961ea694	add tests for serialization bug	2016-05-02 11:01:56 +02:00
Matthew Honnibal	97b2bba249	* Merge updated/simplified Break approach	2016-04-25 19:44:42 +00:00
Matthew Honnibal	77609588b6	* Fix assignment of root label to words left as root implicitly, after parsing ends.	2016-04-25 19:41:59 +00:00
Matthew Honnibal	7c2d2deaa7	* Revise transition system so that the Break transition retains sole responsibility for setting sentence boundaries. Re Issue #322	2016-04-25 19:41:59 +00:00
Wolfgang Seeker	c2f76a4024	Merge branch 'master' into german_ner	2016-04-25 13:21:23 +02:00
Wolfgang Seeker	1003e7ccec	remove debug output from tests	2016-04-25 12:12:40 +02:00
Wolfgang Seeker	f57f843e85	fix bug in updating tree structure when introducing additional roots	2016-04-25 12:01:19 +02:00
Matthew Honnibal	478a8d1829	* Register Chinese language in spacy/__init__.py	2016-04-24 18:45:16 +02:00
Matthew Honnibal	8569dbc2d0	* Add initial stuff for Chinese parsing	2016-04-24 18:44:24 +02:00
Wolfgang Seeker	4d7f393fae	don't require json-files to have syntactic annotation	2016-04-22 16:32:27 +02:00
Wolfgang Seeker	b6477fc4f4	adjusted tests to Travis Setup	2016-04-21 17:15:10 +02:00
Wolfgang Seeker	736ffcb9a2	remove whitespace	2016-04-21 16:55:55 +02:00
Wolfgang Seeker	6c7301cc6d	the parser now introduces sentence boundaries properly when predicting dependents with root labels	2016-04-21 16:50:53 +02:00

... 12 13 14 15 16 ...

2878 Commits