spaCy

mirror of https://github.com/explosion/spaCy.git synced 2024-11-15 22:27:12 +03:00

Author	SHA1	Message	Date
Ines Montani	f324311249	Add global language data utils	2016-12-17 12:27:41 +01:00
Ines Montani	487ce1e20a	Add encoding declaration	2016-12-17 12:25:44 +01:00
Ines Montani	d8d50a0334	Add tokenizer exception for "gonna" (fixes #691 )	2016-12-17 11:59:28 +01:00
Ines Montani	c69b77d8aa	Revert "Add exception for "gonna"" This reverts commit `280c03f67b`.	2016-12-17 11:56:44 +01:00
Ines Montani	280c03f67b	Add exception for "gonna"	2016-12-17 11:54:59 +01:00
Ines Montani	5031a015e2	Fix typo in stopwords (fixes #689 )	2016-12-15 17:57:06 +01:00
Janneke van der Zwaan	4a3fdcce8a	Merge github.com:explosion/spaCy into dutch	2016-12-13 09:25:23 +01:00
Matthew Honnibal	5965d3c2a7	Revert "Add acl to symbols.pyx"	2016-12-12 10:10:28 +11:00
Matthew Honnibal	6dee76dfed	Update symbols.pxd	2016-12-12 10:09:58 +11:00
Pokey Rule	18a15c0777	Add acl to symbols.pyx	2016-12-11 20:00:07 +00:00
Gyorgy Orosz	0cf2144d24	Adding partial hyphen and quote handling support.	2016-12-11 00:14:36 +01:00
Gyorgy Orosz	2051726fd3	Passing Hungatian abbrev tests.	2016-12-10 23:37:58 +01:00
Ines Montani	63024466a9	Add Portuguese stopwords	2016-12-08 20:45:07 +01:00
Ines Montani	7bfe2d4abc	Update Portuguese language data	2016-12-08 20:41:41 +01:00
Ines Montani	c0c5f31950	Remove unused data and download script	2016-12-08 20:39:49 +01:00
Ines Montani	0a6d529104	Remove unused data	2016-12-08 20:36:56 +01:00
Ines Montani	1b3b043660	Add French stopwords	2016-12-08 20:12:43 +01:00
Ines Montani	8863e504eb	Update French language data	2016-12-08 20:07:14 +01:00
Ines Montani	7cb9f51be6	Add Italian stopwords	2016-12-08 20:05:25 +01:00
Ines Montani	470a0e0bea	Update Italian language data	2016-12-08 19:52:18 +01:00
Ines Montani	1a284d342e	Add Spanish language data	2016-12-08 19:47:03 +01:00
Ines Montani	0c39654786	Remove unused import	2016-12-08 19:46:53 +01:00
Ines Montani	e47ee94761	Split punctuation into its own file	2016-12-08 19:46:43 +01:00
Ines Montani	70b51ed7c8	Remove time from German language data	2016-12-08 19:45:50 +01:00
Ines Montani	e8ae588be9	Add emoticons	2016-12-08 19:45:18 +01:00
Ines Montani	5908c0ed9f	Fix formatting	2016-12-08 19:45:11 +01:00
Ines Montani	311b30ab35	Reorganize exceptions for English and German	2016-12-08 13:58:32 +01:00
Ines Montani	66c7348cda	Add update_exc util function	2016-12-08 13:58:12 +01:00
Ines Montani	1256232fad	Fix formatting	2016-12-08 13:56:40 +01:00
Ines Montani	8e977cc71c	Fix formatting	2016-12-08 13:56:17 +01:00
Ines Montani	0176b99004	Fix formatting	2016-12-08 12:48:02 +01:00
Ines Montani	877f09218b	Add more custom rules for abbreviations	2016-12-08 12:47:01 +01:00
Gyorgy Orosz	0289b8ceaa	Additional abbreviation tests.	2016-12-08 12:17:44 +01:00
Gyorgy Orosz	90d22db023	Added Hungarian resource files.	2016-12-08 12:06:36 +01:00
Ines Montani	bfaa42636c	Update language data for German	2016-12-08 12:01:09 +01:00
Ines Montani	ec44bee321	Fix capitalization on morphological features	2016-12-08 12:00:54 +01:00
Gyorgy Orosz	5b00039955	First steps towards the Hungarian tokenizer code.	2016-12-07 23:07:43 +01:00
Ines Montani	ce979553df	Resolve conflict	2016-12-07 21:16:52 +01:00
Ines Montani	8350d65695	Change morphology and lemmatizer API Take morphology features as object instead of keyword arguments	2016-12-07 21:12:49 +01:00
Ines Montani	52e7d634df	Remove trailing whitespace	2016-12-07 21:12:19 +01:00
Ines Montani	0d07d7fc80	Apply emoticon exceptions to tokenizer	2016-12-07 21:11:59 +01:00
Ines Montani	71f0f34cb3	Fix formatting	2016-12-07 21:11:29 +01:00
Ines Montani	9413bcd9ee	Declare encoding and unicode literals	2016-12-07 21:10:34 +01:00
Ines Montani	a280ff2657	Fix __all__	2016-12-07 21:10:12 +01:00
Ines Montani	ba8721953c	Add missing emoticons	2016-12-07 21:09:44 +01:00
Ines Montani	1285c4ba93	Update English language data	2016-12-07 20:33:28 +01:00
Ines Montani	79dce0aabe	Add emoticons	2016-12-07 20:33:28 +01:00
Ines Montani	a662a95294	Add line breaks	2016-12-07 20:33:28 +01:00
Ines Montani	07f0efb102	Add test for tokenizer regular expressions	2016-12-07 20:33:28 +01:00
Ines Montani	e0712d1b32	Reformat language data	2016-12-07 20:33:28 +01:00
Matthew Honnibal	0c0f4c965d	Increment version	2016-12-03 11:16:52 +01:00
Matthew Honnibal	f6e356aada	Add (and test) Span.sentiment attribute. By default we average token.span, but can override with custom hook. Re Issue #667	2016-12-02 11:05:50 +01:00
Janneke van der Zwaan	88869e0e07	Merge github.com:explosion/spaCy into dutch	2016-11-30 17:13:39 +01:00
Janneke van der Zwaan	51ade86b86	Update language data with tag map from UD_Dutch	2016-11-30 14:41:23 +01:00
Janneke van der Zwaan	90f6ff12c9	Update Dutch language data - Use Dutch tag map - remove tokenizer exceptions	2016-11-30 11:59:39 +01:00
dafnevk	7b8f4c49f2	Added language Dutch to init file	2016-11-29 16:42:05 +01:00
Matthew Honnibal	296d33a4fc	Merge branch 'master' of ssh://github.com/explosion/spaCy	2016-11-26 12:36:18 +01:00
Matthew Honnibal	1f6c37c6f5	Fix create_tokenizer when nlp is None	2016-11-26 12:36:04 +01:00
Matthew Honnibal	c7889492f9	Fix model saving error for Python 3	2016-11-25 18:04:30 -06:00
Matthew Honnibal	bc0a202c9c	Fix unicode problem in nonproj module	2016-11-25 17:29:17 -06:00
Matthew Honnibal	6dd3b94fa6	Filter out deprecated attributes when reading special-case tokenization rules.	2016-11-25 09:57:18 -06:00
Matthew Honnibal	e879c79b8c	Merge branch 'master' of https://github.com/explosion/spaCy	2016-11-25 09:18:28 -06:00
Matthew Honnibal	a335c6dcc2	Exclude morphs from deprecated token attributes for now	2016-11-25 16:17:32 +01:00
Matthew Honnibal	f799a07f25	Merge branch 'master' of https://github.com/explosion/spaCy	2016-11-25 09:16:43 -06:00
Matthew Honnibal	159e8c46e1	Merge old training fixes with newer state	2016-11-25 09:16:36 -06:00
Matthew Honnibal	846e80f2f4	Exclude morphs from deprecated token attributes for now	2016-11-25 16:14:54 +01:00
Matthew Honnibal	664f2dd1c0	Allow dep to be None in scorer, for missing labels.	2016-11-25 09:02:49 -06:00
Matthew Honnibal	39341598bb	Fix NER label calculation	2016-11-25 09:02:22 -06:00
Matthew Honnibal	ca773a1f53	Tweak arc_eager n_gold to deal with negative costs, and improve error message.	2016-11-25 09:01:52 -06:00
Matthew Honnibal	a2f55e7015	Pass cfg through loading, for training.	2016-11-25 09:01:20 -06:00
Matthew Honnibal	608d8f5421	Pass cfg through parser, and have is_valid default to 1, not 0 when resetting state	2016-11-25 09:00:21 -06:00
Matthew Honnibal	cc7e607a8a	Fix gold.pyx for 1.0	2016-11-25 08:57:59 -06:00
root	080d29e092	Fix train.py for 1.0	2016-11-25 08:55:33 -06:00
Matthew Honnibal	6652f2a135	Test #656 , #624 : special case rules for tokenizer with attributes.	2016-11-25 12:44:13 +01:00
Matthew Honnibal	1e0f566d95	Fix #656 , #624 : Support arbitrary token attributes when adding special-case rules.	2016-11-25 12:43:24 +01:00
Matthew Honnibal	87613edf8f	Add set_struct_attr staticmethod to token	2016-11-25 12:41:47 +01:00
Matthew Honnibal	fb69aa648f	Merge branch 'master' of ssh://github.com/explosion/spaCy	2016-11-25 11:35:44 +01:00
Matthew Honnibal	9a03a3f85e	Add get_struct_attr staticmethod to Token, to match Lexeme.get_struct_attr.	2016-11-25 11:35:17 +01:00
Matthew Honnibal	53d8ca8f51	Add spacy.attrs.intify_attrs function, to normalize strings in token attribute dictionaries.	2016-11-25 11:34:30 +01:00
Ines Montani	d21ad01840	Add emoticons	2016-11-24 19:13:00 +01:00
dafnevk	d8c7ac203a	Added nl module for dutch	2016-11-24 16:39:49 +01:00
dafnevk	3db8b0d322	Added language class and some language data (with some TODOs) for Dutch	2016-11-24 15:56:38 +01:00
Ines Montani	4dcfafde02	Add line breaks	2016-11-24 14:57:37 +01:00
Ines Montani	6247c005a2	Add test for tokenizer regular expressions	2016-11-24 13:51:59 +01:00
Ines Montani	de747e39e7	Reformat language data	2016-11-24 13:51:32 +01:00
Matthew Honnibal	b8c4f5ea76	Allow German noun chunks to work on Span Update the German noun chunks iterator, so that it also works on Span objects.	2016-11-24 23:30:15 +11:00
Pokey Rule	3e3bda142d	Add noun_chunks to Span	2016-11-24 10:47:20 +00:00
Janneke van der Zwaan	83daade0e4	Add directory and initial (empty) files for language Dutch	2016-11-24 09:45:41 +01:00
Matthew Honnibal	09f68bc641	Fix Issue #639 : stop words in language class not used. This patch is messy, but it's better not to change too much until the language data loading can be properly refactored.	2016-11-24 00:13:55 +01:00
Matthew Honnibal	48e1dc29d4	Fix default path loading.	2016-11-23 23:48:55 +01:00
Matthew Honnibal	e01c1875ee	Work on test for #615	2016-11-23 23:48:41 +01:00
ExplodingCabbage	6c4f488e89	Fix syntax mistake	2016-11-23 15:12:45 +00:00
Matthew Honnibal	60eb2343ce	Only try to load vectors if they exist.	2016-11-23 13:50:24 +01:00
Matthew Honnibal	618ac36093	Fix use of path argument in Language.__init__. Needs to be keyword arg, not positional.	2016-11-23 13:26:34 +01:00
Mark Amery	fbe19680a6	Fix another bug related to Language.__init__'s path parameter	2016-11-20 20:31:34 +00:00
Mark Amery	b0a07c21a0	Fix `path` param of `Language.__init__` always being ignored There was an explicitly-declared `path` keyword argument, so 'path' would never be present in `**overrides`. This line just overwrote any manually-specified value the user might've passed to the `path` parameter.	2016-11-20 16:29:57 +00:00
Mark Amery	1988fce389	Merge remote-tracking branch 'origin/master' into specify-data-path	2016-11-20 16:07:14 +00:00
Mark Amery	3871007c72	Let --data-path be specified when running download.py scripts Resolves https://github.com/explosion/spaCy/issues/637	2016-11-20 15:48:04 +00:00
Ines Montani	dad2c6cae9	Strip trailing whitespace	2016-11-20 16:45:51 +01:00
Ines Montani	3082e49326	Update and reformat German stopwords	2016-11-20 16:45:26 +01:00
Sourav Singh	6745eac309	Update language_data.py	2016-11-20 19:52:02 +05:30
Sourav Singh	4d9aae7d6a	Add German Stopwords	2016-11-19 22:47:53 +05:30
Matthew Honnibal	7afb2544a7	Merge pull request #627 from sadovnychyi/patch-1 Remove duplicated line of vocab declaration	2016-11-16 06:09:18 +11:00
Yanhao	762169da29	Fixed bug: eg.guess is a tag id, rather than tag	2016-11-15 14:11:22 +08:00
Dmytro Sadovnychyi	e70a7050e1	Remove duplicated line of vocab declaration As already declared on line 211.	2016-11-13 18:52:49 +08:00
Matthew Honnibal	f123f92e0c	Fix #617 : Vocab.load() required Path. Should work with string as well.	2016-11-10 22:48:48 +01:00
Matthew Honnibal	e86f440ca6	Fix test for issue 617	2016-11-10 22:48:10 +01:00
Matthew Honnibal	faa7610c56	Merge branch 'master' of ssh://github.com/explosion/spaCy	2016-11-10 22:46:38 +01:00
Matthew Honnibal	a2c7de8329	spacy/tests/regression/test_issue617.py Test Issue #617	2016-11-10 22:46:23 +01:00
tiago	2a3e342c1f	Added a test case to cover the span.merge returning values	2016-11-09 18:57:50 +00:00
tiago	b38cfd0ef9	now span.merge returns token like it says on documentation	2016-11-09 14:58:19 +00:00
Dmitry Sadovnychyi	9488222e79	Fix PhraseMatcher to work with updated Matcher #613	2016-11-09 00:14:26 +08:00
Dmitry Sadovnychyi	86c056ba64	Add basic test for PhraseMatcher #613	2016-11-09 00:10:32 +08:00
Matthew Honnibal	3ea15b257f	Fix test for 605	2016-11-06 11:59:26 +01:00
Matthew Honnibal	efe7790439	Test #590 : Order dependence in Matcher rules.	2016-11-06 11:21:36 +01:00
Matthew Honnibal	5cd3acb265	Fix #605 : Acceptor now rejects matches as expected.	2016-11-06 10:50:42 +01:00
Matthew Honnibal	75805397dd	Test Issue #605	2016-11-06 10:42:32 +01:00
Matthew Honnibal	014b6936ac	Fix #608 -- __version__ should be available at the base of the package.	2016-11-04 21:21:02 +01:00
Matthew Honnibal	42b0736db7	Increment version	2016-11-04 20:04:21 +01:00
Matthew Honnibal	9f93386994	Update version	2016-11-04 19:28:16 +01:00
Matthew Honnibal	1fb09c3dc1	Fix morphology tagger	2016-11-04 19:19:09 +01:00
Matthew Honnibal	a36353df47	Temporarily put back the tokenize_from_strings method, while tests aren't updated yet.	2016-11-04 19:18:07 +01:00
Matthew Honnibal	f0917b6808	Fix Issue #376 : and/or was tagged as a noun.	2016-11-04 15:21:28 +01:00
Matthew Honnibal	737816e86e	Fix #368 : Tokenizer handled pattern 'unicode close quote, period' incorrectly.	2016-11-04 15:16:20 +01:00
Matthew Honnibal	ab952b4756	Fix #578 -- Sputnik had been purging all files on --force, not just the relevant one.	2016-11-04 10:44:11 +01:00
Matthew Honnibal	6e37ba1d82	Fix #602 , #603 --- Broken build	2016-11-04 09:54:24 +01:00
Matthew Honnibal	293c79c09a	Fix #595 : Lemmatization was incorrect for base forms, because morphological analyser wasn't adding morphology properly.	2016-11-04 00:29:07 +01:00
Matthew Honnibal	e30348b331	Prefer to import from symbols instead of parts_of_speech	2016-11-04 00:27:55 +01:00
Matthew Honnibal	4a8a2b6001	Test #595 -- Bug in lemmatization of base forms.	2016-11-04 00:27:32 +01:00
Matthew Honnibal	f1605df2ec	Fix #588 : Matcher should reject empty pattern.	2016-11-03 00:16:44 +01:00
Matthew Honnibal	72b9bd57ec	Test Issue #588 : Matcher accepts invalid, empty patterns.	2016-11-03 00:09:35 +01:00
Matthew Honnibal	41a90a7fbb	Add tokenizer exception for 'Ph.D.', to fix 592.	2016-11-03 00:03:34 +01:00
Matthew Honnibal	532318e80b	Import Jieba inside zh.make_doc	2016-11-02 23:49:19 +01:00
Matthew Honnibal	f292f7f0e6	Fix Issue #599 , by considering empty documents to be parsed and tagged. Implementation is a bit dodgy.	2016-11-02 23:48:43 +01:00
Matthew Honnibal	b6b01d4680	Remove deprecated tokens_from_list test.	2016-11-02 23:47:21 +01:00
Matthew Honnibal	3d6c79e595	Test Issue #599 : .is_tagged and .is_parsed attributes not reflected after deserialization for empty documents.	2016-11-02 23:40:11 +01:00
Matthew Honnibal	05a8b752a2	Fix Issue #600 : Missing setters for Token attribute.	2016-11-02 23:28:59 +01:00
Matthew Honnibal	125c910a8d	Test Issue #600	2016-11-02 23:24:13 +01:00
Matthew Honnibal	e0c9695615	Fix doc strings for tokenizer	2016-11-02 23:15:39 +01:00
Matthew Honnibal	80824f6d29	Fix test	2016-11-02 20:48:40 +01:00
Matthew Honnibal	dbe47902bc	Add import fr	2016-11-02 20:48:29 +01:00
Matthew Honnibal	8f24dc1982	Fix infixes in Italian	2016-11-02 20:43:52 +01:00
Matthew Honnibal	41a4766c1c	Fix infixes in spanish and portuguese	2016-11-02 20:43:12 +01:00
Matthew Honnibal	3d4bd96e8a	Fix infixes in french	2016-11-02 20:41:43 +01:00
Matthew Honnibal	c09a8ce5bb	Add test for french tokenizer	2016-11-02 20:40:31 +01:00
Matthew Honnibal	b012ae3044	Add test for loading languages	2016-11-02 20:38:48 +01:00
Matthew Honnibal	ad1c747c6b	Fix stray POS in language stubs	2016-11-02 20:37:55 +01:00
Matthew Honnibal	e9e6fce576	Handle null prefix/suffix/infix search in tokenizer	2016-11-02 20:35:48 +01:00
Matthew Honnibal	22647c2423	Check that patterns aren't null before compiling regex for tokenizer	2016-11-02 20:35:29 +01:00
Matthew Honnibal	5ac735df33	Link languages in __init__.py	2016-11-02 20:05:14 +01:00
Matthew Honnibal	c68dfe2965	Stub out support for Italian	2016-11-02 20:03:24 +01:00
Matthew Honnibal	6dbf4f7ad7	Stub out support for French, Spanish, Italian and Portuguese	2016-11-02 20:02:41 +01:00
Matthew Honnibal	6b8b05ef83	Specify that spacy.util is encoded in utf8	2016-11-02 19:58:00 +01:00
Matthew Honnibal	5363224395	Add draft Jieba tokenizer for Chinese	2016-11-02 19:57:38 +01:00
Matthew Honnibal	f7fee6c24b	Check for class-defined make_docs method before assigning one provided as an argument	2016-11-02 19:57:13 +01:00
Matthew Honnibal	19c1e83d3d	Work on draft Italian tokenizer	2016-11-02 19:56:32 +01:00
Matthew Honnibal	9efe568177	Add missing unicode_literals to spacy.util. I think this was messing up the tokenizer regex for non-ascii characters in Python 2. Re Issue #596	2016-11-02 12:31:34 +01:00
Matthew Honnibal	d8db648ebf	Add __init__.py file for regression tests	2016-11-01 13:45:06 +01:00
Matthew Honnibal	11664b9f20	Fix variable error in token	2016-11-01 13:28:00 +01:00
Matthew Honnibal	8c4d1b46ce	Fix variable error in Span	2016-11-01 13:27:44 +01:00
Matthew Honnibal	e7af6b937f	Fix syntax error while fixing doc strings	2016-11-01 13:27:32 +01:00
Matthew Honnibal	62fc6b1afa	Use 32 bit hashes for OOV, re Issue #589 , Issue #285	2016-11-01 13:27:13 +01:00
Matthew Honnibal	6977a2b8cd	Add test for Issue #589	2016-11-01 12:33:36 +01:00
Matthew Honnibal	b86f8af0c1	Fix doc strings	2016-11-01 12:25:36 +01:00
Matthew Honnibal	d563f1eadb	Fix Issue #587 : Segfault in Matcher, due to simple error in the state machine.	2016-10-28 17:42:00 +02:00
Matthew Honnibal	7e5f63a595	Improve test slightly	2016-10-28 17:41:16 +02:00
Matthew Honnibal	782e4814f4	Test Issue #587 : Matcher segfaults on particular input	2016-10-28 16:38:32 +02:00
Matthew Honnibal	708ea22208	Infer types in transition_system.pyx	2016-10-27 18:08:13 +02:00
Matthew Honnibal	18590eba94	Fix training evaluate method	2016-10-27 18:02:19 +02:00
Matthew Honnibal	301f3cc898	Fix Issue #429 . Add an initialize_state method to the named entity recogniser that adds missing entity types. This is a messy place to add this, because it's strange to have the method mutate state. A better home for this logic could be found.	2016-10-27 18:01:55 +02:00
Matthew Honnibal	afea6505f3	Test Issue 429: No valid actions for NER after matcher adds a new entity label.	2016-10-27 18:01:34 +02:00
Matthew Honnibal	03a520ec4f	Change signature of Parser.parseC, so that nr_class is read from the transition system. This allows the transition system to modify the number of actions in initialize_state.	2016-10-27 17:58:56 +02:00
Matthew Honnibal	6c47048912	Fix test, after IOB tweak.	2016-10-26 17:22:03 +02:00
Matthew Honnibal	4ca31b4d87	Fix clobbering of 'missing' named ent values after assigning ents.	2016-10-26 13:13:56 +02:00
Matthew Honnibal	cb49189477	Remove dead code	2016-10-26 13:11:07 +02:00
Matthew Honnibal	a209b10579	Improve error message when oracle fails for non-projective trees, re Issue #571 .	2016-10-24 20:31:30 +02:00
Matthew Honnibal	b2d43b93d2	Fix Python 3 basestring error	2016-10-24 14:22:51 +02:00
Matthew Honnibal	276478fe0f	Update strings.pxd	2016-10-24 14:00:35 +02:00
Matthew Honnibal	d8134817ff	Workaround Issue #285 : Allow the StringStore to be 'frozen', in which case strings will be pushed into an OOV map. We can then flush this OOV map, freeing all of the OOV strings.	2016-10-24 13:49:03 +02:00
Matthew Honnibal	d3a617aa99	Test workaround for Issue #285 : Streaming data memory growth	2016-10-24 13:48:06 +02:00
Matthew Honnibal	64e5f02cf7	Update test	2016-10-23 21:08:07 +02:00
Matthew Honnibal	66d7a6eca2	Update test	2016-10-23 21:02:05 +02:00
Matthew Honnibal	90bf797125	Update test	2016-10-23 20:54:17 +02:00
Matthew Honnibal	5e76320ffe	Update test	2016-10-23 20:44:54 +02:00
Matthew Honnibal	aa105927f3	Update test	2016-10-23 20:31:25 +02:00
Matthew Honnibal	6b9237aa83	Increment version	2016-10-23 20:22:53 +02:00
Matthew Honnibal	150e02d72e	Fix Issue #566	2016-10-23 20:19:01 +02:00
Matthew Honnibal	e120561294	Fix vector_norm test.	2016-10-23 19:56:16 +02:00
Matthew Honnibal	fefde8aef8	Make installation print data path.	2016-10-23 19:46:44 +02:00
Matthew Honnibal	e7414cd064	Try to fix weird install glitch.	2016-10-23 19:46:28 +02:00
Matthew Honnibal	90f7544edd	Increment version	2016-10-23 19:43:06 +02:00
Matthew Honnibal	6036ec7c77	Fix vector norm when loading lexemes.	2016-10-23 19:40:18 +02:00
Matthew Honnibal	c05cd2356e	Fix similarity test for Python 3	2016-10-23 18:16:56 +02:00
Matthew Honnibal	3e688e6d4b	Fix issue #514 -- serializer fails when new entity type has been added. The fix here is quite ugly. It's best to add the entities ASAP after loading the NLP pipeline, to mitigate the brittleness.	2016-10-23 17:45:44 +02:00
Matthew Honnibal	79aa03fe98	Test Issue #514 : Serializer fails when new entity type has been added.	2016-10-23 17:41:44 +02:00
Matthew Honnibal	f97548c6f1	Fix broken test, re Issue #461	2016-10-23 17:02:23 +02:00
Matthew Honnibal	4de30a8e38	Test Issue #514 : Serialization fails after adding a new entity label.	2016-10-23 16:40:27 +02:00
Matthew Honnibal	936e6246aa	Fix Issue #459 -- failed to deserialize empty doc.	2016-10-23 16:31:05 +02:00
Matthew Honnibal	e99b3f5322	Test Issue #459 : Fail to deserialize empty doc	2016-10-23 16:30:22 +02:00
Matthew Honnibal	49c117960c	Fix bug where huffman codec died if given empty freqs dict.	2016-10-23 16:28:05 +02:00
Matthew Honnibal	99ff8b902f	Test that huffman codec works with empty freqs dict	2016-10-23 16:27:45 +02:00
Matthew Honnibal	15c9b59f0e	Fix Issue #461 : O tag was being clobbered by doc.ents.__set__	2016-10-23 15:50:26 +02:00
Matthew Honnibal	e5627134d9	Test Issue #461 : ent_iob tag incorrect after setting entities.	2016-10-23 15:50:04 +02:00
Matthew Honnibal	f62088d646	Fix compile error	2016-10-23 14:50:50 +02:00
Matthew Honnibal	2c3a67b693	Fix calculation of vector norm, re Issue #522 . Need to consolidate the calculations into a helper function.	2016-10-23 14:49:31 +02:00
Matthew Honnibal	a0a4ada42a	Fix calculation of L2-norm for Lexeme	2016-10-23 14:44:45 +02:00
Matthew Honnibal	2989072aac	Add tests to verify that Issue #442 is fixed in 1.1	2016-10-23 14:33:13 +02:00
Matthew Honnibal	739213a8af	Fix create_pipeline keyword argument.	2016-10-23 14:24:16 +02:00
Matthew Honnibal	bea44bd3c4	Fix vector_norm when vector is assigned to Lexeme.	2016-10-23 14:23:56 +02:00
Matthew Honnibal	e838b6d53f	Add tests for using the new Entity ID tracking in the rule matcher	2016-10-23 14:04:01 +02:00
Matthew Honnibal	e7af75e0a9	Add test for vector resizing, re Issue #544	2016-10-21 17:07:21 +02:00
Matthew Honnibal	ca8ea33abc	Bump version to 1.1.0	2016-10-21 16:30:57 +02:00
Matthew Honnibal	7ab03050d4	Add resize_vectors method to Vocab	2016-10-21 01:44:50 +02:00
Matthew Honnibal	8ce8803824	Fix JSON in tokenizer	2016-10-21 01:44:20 +02:00
Matthew Honnibal	6eb73a095f	Fix JSON in tagger	2016-10-21 01:44:10 +02:00
Matthew Honnibal	e16e78a737	Merge branch 'master' of ssh://github.com/explosion/spaCy	2016-10-21 00:00:15 +02:00
Matthew Honnibal	147373c807	Increment version	2016-10-21 00:00:03 +02:00
Matthew Honnibal	e80944276f	Fix Span.vector_norm	2016-10-20 21:58:56 +02:00
Matthew Honnibal	f5fe4f595b	Fix json loading, for Python 3.	2016-10-20 21:23:26 +02:00
Matthew Honnibal	2e92c6fb3a	Fix JSON encoding issue on load	2016-10-20 21:06:48 +02:00
Matthew Honnibal	4ad7bb96c9	Increment version.	2016-10-20 20:48:30 +02:00
Matthew Honnibal	5ec32f5d97	Fix loading of GloVe vectors, to address Issue #541	2016-10-20 18:27:48 +02:00
Matthew Honnibal	ddeabd76c4	Fix mistake loading GloVe vectors. GloVe vectors now loaded by default if present, as promised.	2016-10-20 16:57:53 +02:00
Matthew Honnibal	bfe5cb1244	Increment version.	2016-10-20 14:52:00 +02:00
Matthew Honnibal	f189a3cb00	Fix encoding when opening files in Python 2.7, re Issue #539	2016-10-20 14:42:56 +02:00
Matthew Honnibal	c353a5214d	Increment version	2016-10-19 23:51:01 +02:00
Matthew Honnibal	d10c17f2a4	Fix Issue #536 : oov_prob was 0 for OOV words.	2016-10-19 23:38:47 +02:00
Matthew Honnibal	dfa752d064	Increment version	2016-10-19 23:19:13 +02:00
Matthew Honnibal	3588a18fb8	Fix hook names in doc	2016-10-19 21:15:16 +02:00
Matthew Honnibal	5d5742b773	Add sentiment field to doc, rename getters_for_tokens and getters_for_spans, add user_hooks field to Doc.	2016-10-19 20:54:22 +02:00
Matthew Honnibal	ed5e178817	Add sentiment property on lexeme object	2016-10-19 20:52:52 +02:00
Matthew Honnibal	d4aaf2752c	Fix issue #535 : Pipeline elements added even when data not installed.	2016-10-19 19:55:19 +02:00
Matthew Honnibal	04d1c959da	Fix version	2016-10-19 03:45:37 +02:00
Matthew Honnibal	d35aa7344e	Change version ID to make PyPi happy	2016-10-19 03:24:39 +02:00
Matthew Honnibal	89d2a5c8b3	Increment build version.	2016-10-19 03:05:17 +02:00
Matthew Honnibal	622b0a9674	Tweak download script	2016-10-19 00:52:16 +02:00
Matthew Honnibal	5a5c7192a5	Fix download.py for GloVe vectors.	2016-10-19 00:47:44 +02:00
Matthew Honnibal	edc45c19d6	Update download script	2016-10-19 00:41:14 +02:00
Matthew Honnibal	2bbb050500	Fix default of serializer_freqs	2016-10-18 19:55:41 +02:00
Matthew Honnibal	1b651db9c5	Fix parser creation in Language class.	2016-10-18 19:36:44 +02:00
Matthew Honnibal	45a6f9b9c7	Fix loading of tagger.	2016-10-18 19:33:04 +02:00
Matthew Honnibal	76c815f40d	Fix spacy.load	2016-10-18 19:23:31 +02:00
Matthew Honnibal	8c8f5c62c6	Add LANG attribute to English and German	2016-10-18 18:52:48 +02:00
Matthew Honnibal	05e2a589a4	Fix None label in matcher	2016-10-18 18:05:21 +02:00
Matthew Honnibal	c3a8a1cf51	Update serializer test.	2016-10-18 16:18:46 +02:00
Matthew Honnibal	7d5212f131	Refactor defaults	2016-10-18 16:18:25 +02:00
Matthew Honnibal	a45a9d5092	Remove stray .tensor attribute from Lexeme	2016-10-18 01:16:32 +02:00
Matthew Honnibal	9258db788a	Revert "Have the matcher return character offsets, to handle the match better." This reverts commit `049c937540`.	2016-10-17 16:49:51 +02:00
Matthew Honnibal	7d446e5094	Revert "Update matcher test, to reflect character offset return instead of token offset." This reverts commit `f8d3e3bcfe`.	2016-10-17 16:49:49 +02:00
Matthew Honnibal	4bf2c53c13	Revert "Hack on matcher tests, for new implementation." This reverts commit `dbe60644ab`.	2016-10-17 16:49:48 +02:00
Matthew Honnibal	2fd97c71cc	Revert "Don't try to pickle matcher." This reverts commit `97bd0c9d00`.	2016-10-17 16:49:43 +02:00
Matthew Honnibal	97bd0c9d00	Don't try to pickle matcher.	2016-10-17 16:38:40 +02:00
Matthew Honnibal	dbe60644ab	Hack on matcher tests, for new implementation.	2016-10-17 16:12:22 +02:00
Matthew Honnibal	f8d3e3bcfe	Update matcher test, to reflect character offset return instead of token offset.	2016-10-17 16:00:10 +02:00
Matthew Honnibal	049c937540	Have the matcher return character offsets, to handle the match better.	2016-10-17 15:58:57 +02:00
Matthew Honnibal	9b60186266	Fix doc class	2016-10-17 15:23:47 +02:00
Matthew Honnibal	6cbdc94959	Lots of updates to Matcher, to make entity handling sane.	2016-10-17 15:23:31 +02:00
Matthew Honnibal	7fd98fc91c	Remove deprecation shim around str/bytes in Token.	2016-10-17 14:02:47 +02:00
Matthew Honnibal	b67697a97b	Improve API for doc.merge() and span.merge(), to use keyword arguments.	2016-10-17 14:02:13 +02:00
Matthew Honnibal	fbb7f3f15c	Add user_data attribute to Doc object.	2016-10-17 11:43:22 +02:00
Matthew Honnibal	c1abc8f6ed	Fix deprecation stuff in Token: Remove the shim for the str/unicode semantics, and raise for has_repvec and repvec	2016-10-17 11:18:41 +02:00
Matthew Honnibal	4ba9eadf3d	Merge branch 'v1.0.0-rc1' of ssh://github.com/explosion/spaCy into v1.0.0-rc1	2016-10-17 02:45:44 +02:00
Matthew Honnibal	09ab447a18	Remove tensor property from token.	2016-10-17 02:45:09 +02:00
Matthew Honnibal	5d10e2005c	Defer some attributes to Doc, via getters_for_tokens attribute.	2016-10-17 02:44:49 +02:00
Matthew Honnibal	8829984efb	Remove tensor attribute from Span and Token.	2016-10-17 02:44:04 +02:00
Matthew Honnibal	d15a88c66a	Defer some attributes to Doc via getters_for_spans	2016-10-17 02:43:35 +02:00
Matthew Honnibal	62230dd13a	Add getters_for_spans and getters_for_tokens attributes to Doc. Fix docstring	2016-10-17 02:42:51 +02:00
Matthew Honnibal	ae11ea8240	Add getters_for_tokens and getters_for_spans attributes to Doc object.	2016-10-17 02:42:05 +02:00
Matthew Honnibal	be48a7b4f3	Fix conftest for website tests.	2016-10-17 01:54:26 +02:00
Matthew Honnibal	8951bf6989	Update matcher tests	2016-10-17 01:53:24 +02:00
Matthew Honnibal	0cf4aff470	Set default path in EN/DE tests.	2016-10-17 01:52:49 +02:00
Matthew Honnibal	cd71b6b0a9	Remove test of parser pickle	2016-10-17 01:52:10 +02:00
Matthew Honnibal	5bc101006e	Add cfg field to Tagger	2016-10-17 01:03:41 +02:00
Matthew Honnibal	517f090cbf	Use GoldParse in tagger.update	2016-10-17 00:55:15 +02:00
Matthew Honnibal	59038f7efa	Restore support for prior data format -- specifically, the labels field of the config.	2016-10-17 00:53:26 +02:00
Matthew Honnibal	7887ab3b36	Fix default use of feature_templates in parser	2016-10-16 21:41:56 +02:00
Matthew Honnibal	f787cd29fe	Refactor the pipeline classes to make them more consistent, and remove the redundant blank() constructor.	2016-10-16 21:34:57 +02:00
Matthew Honnibal	311a985fe0	Add input error handling in Doc	2016-10-16 18:16:42 +02:00
Matthew Honnibal	06322ba99d	Add words and spaces keyword arguments to Doc.	2016-10-16 18:13:03 +02:00
Matthew Honnibal	ca51f3b77e	Use DependencyParser and EntityRecognizer in the Language class.	2016-10-16 17:58:12 +02:00
Matthew Honnibal	195d998a12	Fix GoldParse argument to tagger.update	2016-10-16 17:05:09 +02:00
Matthew Honnibal	274a4d4272	Fix queue Python property in StateClass	2016-10-16 17:04:41 +02:00
Matthew Honnibal	e8c8aa08ce	Make action_name optional in StepwiseState	2016-10-16 17:04:16 +02:00
Matthew Honnibal	4bb73b1a93	Fix parser labels in pipeline	2016-10-16 17:03:22 +02:00
Matthew Honnibal	a81c5a7abf	Fix name of labels keyword to 'actions'.	2016-10-16 12:00:27 +02:00
Matthew Honnibal	a079677984	Fix omission of O action when creating blank entity recognizer	2016-10-16 11:43:25 +02:00
Matthew Honnibal	5444d38cc6	Update test for biluo tags	2016-10-16 11:42:45 +02:00
Matthew Honnibal	4fc56d4a31	Rename 'labels' to 'actions' in parser options	2016-10-16 11:42:26 +02:00
Matthew Honnibal	8a6b35d266	Delay binding in MakeDoc	2016-10-16 11:41:55 +02:00
Matthew Honnibal	52b48b415e	Fix GoldParse class	2016-10-16 11:41:36 +02:00
Matthew Honnibal	3259a63779	Whitespace	2016-10-16 01:47:28 +02:00
Matthew Honnibal	509b30834f	Add a pipeline module, to collect and wrap processes for annotation	2016-10-16 01:47:12 +02:00
Matthew Honnibal	0317cea0ad	Fix GoldParse	2016-10-15 23:55:07 +02:00
Matthew Honnibal	1c62573a41	Fix spacy.train	2016-10-15 23:53:46 +02:00
Matthew Honnibal	a48aa15384	Improve the API for the GoldParse class.	2016-10-15 23:53:29 +02:00
Matthew Honnibal	e07fe92b27	Draft a refactored init for the GoldParse class	2016-10-15 22:09:52 +02:00
Matthew Honnibal	47afef7d6b	Add init.py for gold tests	2016-10-15 21:51:28 +02:00
Matthew Honnibal	86ae665c78	Add function for entity->biluo transformation	2016-10-15 21:51:04 +02:00
Matthew Honnibal	2163fd238f	Add tests for entity->biluo transformation	2016-10-15 21:50:43 +02:00
Matthew Honnibal	5e923b9bfa	Return None in match_best_version if not path exists.	2016-10-15 14:47:29 +02:00
Matthew Honnibal	2516382106	Fix loading of English in span test	2016-10-15 14:44:37 +02:00
Matthew Honnibal	dda2fc6bef	Add empty data directory	2016-10-15 14:25:25 +02:00
Matthew Honnibal	049197e0ae	Update tests, somewhat messily.	2016-10-15 14:14:04 +02:00
Matthew Honnibal	1e1a1d9517	Update matcher test	2016-10-15 14:13:41 +02:00
Matthew Honnibal	9cc9ce0f14	Load with default path=False in tests.	2016-10-15 14:13:23 +02:00
Matthew Honnibal	08e9134760	Change default value of path to True	2016-10-15 14:12:54 +02:00
Matthew Honnibal	788657f062	Ensure words are added to vocab before test, so that the lexicon is updated correctly.	2016-10-15 14:12:18 +02:00
Matthew Honnibal	4a1a2bce68	Update version in about.py	2016-10-15 13:44:27 +02:00
Matthew Honnibal	6d8cb515ac	Break the tokenization stage out of the pipeline into a function 'make_doc'. This allows all pipeline methods to have the same signature.	2016-10-14 17:38:29 +02:00
Matthew Honnibal	2cc515b2ed	Add add_flag method to Vocab, re Issue #504 .	2016-10-14 12:15:38 +02:00
Matthew Honnibal	f3be9d0a9a	Add tensor field to Lexeme, Token, Doc and Span, so that users have a place to hang neural network outputs	2016-10-14 03:24:13 +02:00
Matthew Honnibal	9b55d97a8f	Update train method	2016-10-13 03:24:53 +02:00
Matthew Honnibal	645d99523a	Move merge_sents method into spacy.gold	2016-10-13 03:24:29 +02:00
Matthew Honnibal	41f88ce938	Fix dep model loading in parser	2016-10-12 20:26:38 +02:00
Matthew Honnibal	d9ae2d68af	Load features by string-name for backwards compatibility.	2016-10-12 20:15:11 +02:00
Matthew Honnibal	a42fbcf946	Require model for test_is_properties	2016-10-12 19:35:18 +02:00
Matthew Honnibal	20c948361b	Use local path in test_lemmatizer	2016-10-12 19:35:00 +02:00
Matthew Honnibal	1318d0bc65	Test with the non-loaded versions of the English and German pipelines.	2016-10-12 19:13:31 +02:00
Matthew Honnibal	0e2bedc373	Fix default labels for parser and NER	2016-10-12 19:12:40 +02:00
Matthew Honnibal	3a03c668c3	Fix message in ParserStateError	2016-10-12 14:44:31 +02:00
Matthew Honnibal	6bf505e865	Fix error on ParserStateError	2016-10-12 14:35:55 +02:00
Matthew Honnibal	ba5e048502	Add docstring for Trainer class.	2016-10-12 14:26:02 +02:00
Matthew Honnibal	847a4a4182	Refactor Language, dropping Language.blank() method.	2016-10-12 13:45:58 +02:00
Matthew Honnibal	ea23b64cc8	Refactor training, with new spacy.train module. Defaults still a little awkward.	2016-10-09 12:24:24 +02:00
Matthew Honnibal	ca32a1ab01	Revert "Work on Issue #285 : intern strings into document-specific pools, to address streaming data memory growth. StringStore.__getitem__ now raises KeyError when it can't find the string. Use StringStore.intern() to get the old behaviour. Still need to hunt down all uses of StringStore.__getitem__ in library and do testing, but logic looks good." This reverts commit `8423e8627f`.	2016-09-30 20:20:22 +02:00
Matthew Honnibal	90baa9c7e6	Revert "Changes to matcher.pyx for new StringStore scheme" This reverts commit `3ff09614e0`.	2016-09-30 20:20:13 +02:00
Matthew Honnibal	1b6b129c04	Revert "Changes to morphology.pyx for new StringStore scheme" This reverts commit `95f8cfd745`.	2016-09-30 20:20:02 +02:00
Matthew Honnibal	1d70db58aa	Revert "Changes to iterators.pyx for new StringStore scheme" This reverts commit `4f794b215a`.	2016-09-30 20:19:53 +02:00
Matthew Honnibal	de01e427fd	Revert "Changes to strings.pyx for new StringStore scheme" This reverts commit `22d4752d64`.	2016-09-30 20:19:42 +02:00
Matthew Honnibal	9e09b39b9f	Revert "Changes to transition systems for new StringStore scheme" This reverts commit `0442e0ab1e`.	2016-09-30 20:11:49 +02:00
Matthew Honnibal	e3285f6f30	Revert "Fix report of ParserStateError" This reverts commit `78f19baafa`.	2016-09-30 20:11:33 +02:00
Matthew Honnibal	6736977d82	Revert "Changes to Doc and Token for new string store scheme" This reverts commit `99de44d864`.	2016-09-30 20:11:15 +02:00
Matthew Honnibal	bd7fe6420c	Revert "Changes to test for new string-store" This reverts commit `21e90d7d0b`.	2016-09-30 20:11:01 +02:00
Matthew Honnibal	1f1cd5013f	Revert "Changes to vocab for new stringstore scheme" This reverts commit `a51149a717`.	2016-09-30 20:10:30 +02:00
Matthew Honnibal	1e7d0af127	Revert "Changes to Lexeme for new string store scheme" This reverts commit `717741b6cf`.	2016-09-30 20:10:13 +02:00
Matthew Honnibal	ba51cb8325	Revert "Changes to tagger for new string store scheme" This reverts commit `f5a6aac906`.	2016-09-30 20:09:53 +02:00
Matthew Honnibal	23b7244842	Make sure symbols are unicode strings	2016-09-30 20:02:19 +02:00
Matthew Honnibal	f5a6aac906	Changes to tagger for new string store scheme	2016-09-30 20:01:51 +02:00
Matthew Honnibal	717741b6cf	Changes to Lexeme for new string store scheme	2016-09-30 20:01:36 +02:00
Matthew Honnibal	a51149a717	Changes to vocab for new stringstore scheme	2016-09-30 20:01:19 +02:00
Matthew Honnibal	21e90d7d0b	Changes to test for new string-store	2016-09-30 20:00:58 +02:00
Matthew Honnibal	99de44d864	Changes to Doc and Token for new string store scheme	2016-09-30 20:00:21 +02:00
Matthew Honnibal	78f19baafa	Fix report of ParserStateError	2016-09-30 19:59:22 +02:00
Matthew Honnibal	0442e0ab1e	Changes to transition systems for new StringStore scheme	2016-09-30 19:58:51 +02:00
Matthew Honnibal	22d4752d64	Changes to strings.pyx for new StringStore scheme	2016-09-30 19:58:09 +02:00
Matthew Honnibal	4f794b215a	Changes to iterators.pyx for new StringStore scheme	2016-09-30 19:57:49 +02:00
Matthew Honnibal	95f8cfd745	Changes to morphology.pyx for new StringStore scheme	2016-09-30 19:57:10 +02:00
Matthew Honnibal	3ff09614e0	Changes to matcher.pyx for new StringStore scheme	2016-09-30 19:56:48 +02:00
Matthew Honnibal	eceeaefe53	Fix defaults for Parser and Entity, adding a blank= argument.	2016-09-30 19:56:06 +02:00
Matthew Honnibal	8423e8627f	Work on Issue #285 : intern strings into document-specific pools, to address streaming data memory growth. StringStore.__getitem__ now raises KeyError when it can't find the string. Use StringStore.intern() to get the old behaviour. Still need to hunt down all uses of StringStore.__getitem__ in library and do testing, but logic looks good.	2016-09-30 10:14:47 +02:00
Matthew Honnibal	d3dc5718b2	Fix syntax error in Doc	2016-09-28 11:39:49 +02:00
Matthew Honnibal	1b520e7bab	Improve docstrings for Doc object	2016-09-28 11:15:13 +02:00
Matthew Honnibal	81a47c01d8	Fix test for empty sentence string.	2016-09-27 19:21:22 +02:00
Matthew Honnibal	4cbf0d3bb6	Handle errors when no valid actions are available, pointing users to the issue tracker.	2016-09-27 19:19:53 +02:00
Matthew Honnibal	430473bd98	Raise errors when no actions are available, re Issue #429	2016-09-27 19:09:37 +02:00
Matthew Honnibal	fc4a7ad794	Test and fix Issue #411 : IndexError when .sents property is used on empty string.	2016-09-27 18:49:14 +02:00
Matthew Honnibal	3d370b7d45	Add test for Issue #445 , fixed in `3cb4d455d`, with improved lemmatizer logic	2016-09-27 18:39:46 +02:00
Matthew Honnibal	a2f3510d6d	Fix lemmatizer	2016-09-27 17:47:05 +02:00
Matthew Honnibal	07776d8096	Fix pos name conflict in lemmatize	2016-09-27 17:35:58 +02:00
Matthew Honnibal	35cd953f9e	Fix pos name conflict with morphology	2016-09-27 14:16:22 +02:00
Matthew Honnibal	8e7df3c4ca	Expect the parser data, if parser.load() is called.	2016-09-27 14:02:12 +02:00
Matthew Honnibal	bb4f201ad2	Pass morphological features from tag map into the lemmatizer.	2016-09-27 14:01:43 +02:00
Matthew Honnibal	40509e8bca	Tweak the new is_base_form logic, because we can expect the 'pos' key in the morphology we're passed.	2016-09-27 14:01:16 +02:00
Matthew Honnibal	9c8ac91d72	Add test for Issue #435	2016-09-27 13:52:38 +02:00
Matthew Honnibal	3cb4d455d2	Pass lemmatizer morphological features, so that rules are sensitive to base/inflected distinction, which is how the WordNet data is designed. See Issue #435	2016-09-27 13:52:11 +02:00
Matthew Honnibal	e233328d38	Fix Issue #371 : Lexeme objects were unhashable.	2016-09-27 13:22:30 +02:00
Matthew Honnibal	e382e48d9f	Temporarily patch handling of defaul templates for tagger. Need to move these to language_data.	2016-09-27 13:21:28 +02:00
Matthew Honnibal	a44763af0e	Fix Issue #469 : Incorrectly cased root label in noun chunk iterator	2016-09-27 13:13:01 +02:00
Matthew Honnibal	b14b9b096b	Return None if /deps directory not present, instead of trying to load the parser.	2016-09-26 18:48:03 +02:00
Matthew Honnibal	e07b9665f7	Don't expect parser model	2016-09-26 18:09:33 +02:00
Matthew Honnibal	ee6fa106da	Fix parser features	2016-09-26 17:57:32 +02:00
Matthew Honnibal	e607e4b598	Fix parser loading	2016-09-26 17:51:11 +02:00
Matthew Honnibal	0b2d7ae9d6	Fix Entity creation	2016-09-26 15:41:22 +02:00
Matthew Honnibal	2debc4e0a2	Add .blank() method to Parser. Start housing default dep labels and entity types within the Defaults class.	2016-09-26 11:57:54 +02:00
Matthew Honnibal	722199acb8	Add spacy.blank() method, that doesn't load data. Don't try to load data if path is falsey	2016-09-26 11:07:46 +02:00
Matthew Honnibal	e56653f848	Add language data for German	2016-09-25 15:44:45 +02:00
Matthew Honnibal	7db956133e	Move tokenizer data for German into spacy.de.language_data	2016-09-25 15:37:33 +02:00
Matthew Honnibal	95aaea0d3f	Refactor so that the tokenizer data is read from Python data, rather than from disk	2016-09-25 14:49:53 +02:00
Matthew Honnibal	d7e9acdcdf	Add English language data, so that the tokenizer doesn't require the data download	2016-09-25 14:49:00 +02:00
Matthew Honnibal	82b8cc5efb	Whitespace	2016-09-24 22:17:01 +02:00
Matthew Honnibal	fd58f7655a	Python 3 compatible basestring	2016-09-24 22:16:43 +02:00
Matthew Honnibal	082e95b19e	Python 3 compatible basestring	2016-09-24 22:09:21 +02:00
Matthew Honnibal	f19af6cb2c	Python 3 compatible basestring	2016-09-24 22:08:43 +02:00
Matthew Honnibal	3ed4cdfe32	Handle pathlib.Path objects in CFile	2016-09-24 22:01:46 +02:00
Matthew Honnibal	df88690177	Fix encoding of path variable	2016-09-24 21:13:15 +02:00
Matthew Honnibal	af847e07fc	Fix usage of pathlib for Python3 -- turning paths to strings.	2016-09-24 21:05:27 +02:00
Matthew Honnibal	453683aaf0	Fix spacy/vocab.pyx	2016-09-24 20:50:31 +02:00
Matthew Honnibal	fd65cf6cbb	Finish refactoring data loading	2016-09-24 20:26:17 +02:00
Matthew Honnibal	83e364188c	Mostly finished loading refactoring. Design is in place, but doesn't work yet.	2016-09-24 15:42:01 +02:00
Matthew Honnibal	9dc8043a7e	Refactor Language to use new Defaults class, and work on revised data loading. We're getting rid of sputnik's weird file-system wrapper, and using pathlib.	2016-09-24 14:08:53 +02:00
Matthew Honnibal	b00f683a0c	Fix matcher test	2016-09-24 11:20:58 +02:00
Matthew Honnibal	eaf4065480	Expose the _patterns private member	2016-09-24 11:20:42 +02:00
Matthew Honnibal	15e42a1ba9	Allow entities to be set by Span, or by 4-tuple (with entity ID)	2016-09-24 01:17:43 +02:00
Matthew Honnibal	60fdf4d5f1	Remove commented out debuggng code	2016-09-24 01:17:18 +02:00
Matthew Honnibal	939a791a52	Update tests	2016-09-24 01:17:03 +02:00
Matthew Honnibal	55f1f7edaf	Don't automatically write new entities into the Doc in the Matcher. This fixes a long-standing wart, but introduces a backwards incompatibility.	2016-09-24 01:16:45 +02:00
Matthew Honnibal	e48df859b5	Fix typedef import in span.pyx	2016-09-23 16:02:28 +02:00
Matthew Honnibal	4de13606fd	Fix token.pyx	2016-09-23 15:07:07 +02:00
Matthew Honnibal	b4de419e19	Import hash_t typedef in token.pyx	2016-09-23 14:22:06 +02:00
Matthew Honnibal	c1a2e96604	Clean up notes at end of token.pyx	2016-09-21 20:45:51 +02:00
Matthew Honnibal	f6e587b1c7	Fix matcher tests	2016-09-21 20:45:20 +02:00
Matthew Honnibal	58e83fe34b	Initial, limited support for quantified patterns in Matcher, and tracking of ent_id attribute in Token and Span. The quantifiers need a lot more testing, and there are some known problems. The main known problem is that the zero-plus and one-plus quantifiers won't work if a token can match both the quantified pattern expression AND the tail of the match.	2016-09-21 14:54:55 +02:00
Matthew Honnibal	2735b6247b	Fix orths_and_spaces in Doc.__init__	2016-09-21 14:52:05 +02:00
Matthew Honnibal	070af4af9d	Revert "* Working neural net, but features hacky. Switching to extractor." This reverts commit `7c2f1a673b`.	2016-09-21 12:26:14 +02:00
Matthew Honnibal	6b202ec43f	Merge branch 'master' of ssh://github.com/spacy-io/spaCy	2016-09-21 12:08:25 +02:00
Mahmoud Lababidi	4c9ccc3b8b	Add parameter to download() for application to not exit if a Model exists. The default behavior is unchanged.	2016-09-14 10:04:09 -04:00
Adam Ever Hadani	f1c0762443	exit code 0 for when downloading a model that already was downloaded	2016-07-13 16:22:14 -07:00
Matthew Honnibal	7c2f1a673b	* Working neural net, but features hacky. Switching to extractor.	2016-05-26 19:06:10 +02:00
Matthew Honnibal	cdc10e9a1c	* Fix Issue #375 : noun phrase iteration results in index error if noun phrases are merged during the loop. Fix by accumulating the spans inside the noun_chunks property, allowing the Span index tricks to work.	2016-05-20 10:14:06 +02:00
Matthew Honnibal	13fad36e49	* Cosmetic change to english noun chunks iterator -- use enumerate instead of range loop	2016-05-20 10:11:05 +02:00
Matthew Honnibal	02276cc444	Merge branch 'master' of ssh://github.com/spacy-io/spaCy	2016-05-17 16:56:22 +02:00
Matthew Honnibal	4d7f5468bb	* Change Language class to use a .pipeline attribute, instead of having the pipeline hard coded	2016-05-17 16:55:42 +02:00
Daylen Yang	5405e7dd73	Fix get_lang_class parsing (take 2)	2016-05-16 16:40:31 -07:00
Matthew Honnibal	b240104f40	Revert "Fix get_lang_class parsing"	2016-05-17 08:04:26 +10:00
Daylen Yang	1692c2df3c	Fix get_lang_class parsing We want the get_lang_class to return "en" for both "en" and "en_glove_cc_300_1m_vectors". Changed the split rule to "_" so that this happens.	2016-05-16 14:38:20 -07:00
Matthew Honnibal	17137f5c0c	* Fix issue #372 : mistake in Lexeme rich comparison	2016-05-12 12:58:57 +02:00
Matthew Honnibal	cc8bf62208	* Fix Issue #360 : Tokenizer failed when the infix regex matched the start of the string while trying to tokenize multi-infix tokens.	2016-05-09 13:23:47 +02:00
Matthew Honnibal	c61ee8f9fa	* Increment version	2016-05-09 13:20:00 +02:00
Matthew Honnibal	5d86c30f0b	* Fix Issue #367 : Missing has_vector property on Doc and Span objects	2016-05-09 12:36:14 +02:00
Wolfgang Seeker	7b78239436	add fix for German noun chunk iterator (issue #365 )	2016-05-06 01:41:26 +02:00
Matthew Honnibal	8c0888d6cb	* Fix error in span.sent	2016-05-06 00:28:05 +02:00
Matthew Honnibal	bb94022975	* Fix Issue #365 : Error introduced during noun phrase chunking, due to use of corrected PRON/PROPN/etc tags.	2016-05-06 00:21:05 +02:00
Matthew Honnibal	41342ca79b	Merge branch 'master' of ssh://github.com/spacy-io/spaCy	2016-05-06 00:17:58 +02:00
Matthew Honnibal	26095f9722	* Add span.sent property, re Issue #366	2016-05-06 00:17:38 +02:00
Wolfgang Seeker	dbf8f5f3ec	fix bug in StateC.set_break()	2016-05-05 15:15:34 +02:00
Wolfgang Seeker	3c44b5dc1a	call deprojectivization after parsing	2016-05-05 15:10:36 +02:00
Matthew Honnibal	472f576b82	* Deprojectivize German parses	2016-05-05 15:01:10 +02:00
Matthew Honnibal	9bbd6cf031	* Work on Chinese support	2016-05-05 11:39:12 +02:00
Matthew Honnibal	a6a25166ba	* Remove print from test	2016-05-05 11:10:59 +02:00
Matthew Honnibal	e31df66d26	* Fix Issue #361 : Lexemes didn't have rich comparison.	2016-05-05 01:32:26 +02:00
Matthew Honnibal	7441ca30ee	* Add tests for Issue #361 : Lexeme rich comparison	2016-05-05 01:31:58 +02:00
Matthew Honnibal	72564213e3	* Add test for Issue #309	2016-05-04 16:00:28 +02:00
Matthew Honnibal	76f1d871da	Merge branch 'master' of ssh://github.com/spacy-io/spaCy	2016-05-04 15:54:00 +02:00
Matthew Honnibal	519366f677	* Fix Issue #351 : Indices off when leading whitespace	2016-05-04 15:53:36 +02:00
Matthew Honnibal	b4bfc6ae55	* Add test for Issue #351 : Indices off when leading whitespace	2016-05-04 15:53:17 +02:00
Matthew Honnibal	76021cb853	* Fix bug in Doc.text, introduced by `a862edc`	2016-05-04 11:02:16 +02:00
Wolfgang Seeker	e4ea2bea01	fix whitespace	2016-05-04 07:40:38 +02:00
Wolfgang Seeker	5bf2fd1f78	make the code less cryptic	2016-05-03 17:19:05 +02:00
Wolfgang Seeker	a06fca9fdf	German noun chunk iterator now doesn't return tokens more than once	2016-05-03 16:58:59 +02:00
Wolfgang Seeker	7825b75548	add tests for German noun chunker	2016-05-03 15:01:28 +02:00
Wolfgang Seeker	7b246c13cb	reformulate noun chunk tests for English	2016-05-03 14:24:35 +02:00
Wolfgang Seeker	1786331cd8	add model sanity test	2016-05-03 12:51:47 +02:00
Matthew Honnibal	1f1532142f	* Fix cost calculation on non-monotonic oracle	2016-05-03 00:21:08 +02:00
Matthew Honnibal	377a624046	Merge pull request #358 from wbwseeker/german_lemmatizer_dummy German lemmatizer dummy	2016-05-03 07:38:26 +10:00
Wolfgang Seeker	92bfbebeec	remove unnecessary imports	2016-05-02 17:33:22 +02:00
Wolfgang Seeker	857454ffa0	fix indentation -.-	2016-05-02 17:10:41 +02:00
Matthew Honnibal	308a28c26c	* Whitespace	2016-05-02 16:08:11 +02:00
Matthew Honnibal	29a114e645	* Don't assign 0-valued tags in Doc.from_array	2016-05-02 16:07:50 +02:00
Matthew Honnibal	c1c11a8ae0	* Fix formatting on serializer tests	2016-05-02 16:07:21 +02:00
Wolfgang Seeker	dae6bc05eb	define German dummy lemmatizer until morphology is done	2016-05-02 16:04:53 +02:00
Matthew Honnibal	6e1f1c4b9e	Merge pull request #357 from wbwseeker/german_ner German ner	2016-05-02 23:39:34 +10:00
Wolfgang Seeker	b6b96b233c	don't require read_json_file to expect particular annotations	2016-05-02 15:29:30 +02:00
Matthew Honnibal	902a389d85	* Fix merge conflict in test_parse	2016-05-02 15:28:07 +02:00
Matthew Honnibal	276fbe9996	* Fix assignment of iterator on Doc object	2016-05-02 15:26:24 +02:00
Matthew Honnibal	02c23cc1d0	* Fix sentence boundary test	2016-05-02 15:26:07 +02:00
Matthew Honnibal	d2f469b809	* Fix parsing tests, so that labels are added if they're missing, and so that the branching test values are correct	2016-05-02 15:25:27 +02:00
Wolfgang Seeker	b11cbb06c6	remove old tests for sentence boundary detection	2016-05-02 14:36:35 +02:00
Matthew Honnibal	508fd1f6dc	* Refactor noun chunk iterators, so that they're simple functions. Install the iterator when the Doc is created, but allow users to write to the noun_chunk_iterator attribute. The iterator functions accept an object and yield (int start, int end, int label) triples.	2016-05-02 14:25:10 +02:00
Matthew Honnibal	e526be5602	Merge branch 'master' of ssh://github.com/spacy-io/spaCy	2016-05-02 13:08:08 +02:00
Wolfgang Seeker	fa961ea694	add tests for serialization bug	2016-05-02 11:01:56 +02:00
Matthew Honnibal	97b2bba249	* Merge updated/simplified Break approach	2016-04-25 19:44:42 +00:00
Matthew Honnibal	77609588b6	* Fix assignment of root label to words left as root implicitly, after parsing ends.	2016-04-25 19:41:59 +00:00
Matthew Honnibal	7c2d2deaa7	* Revise transition system so that the Break transition retains sole responsibility for setting sentence boundaries. Re Issue #322	2016-04-25 19:41:59 +00:00
Wolfgang Seeker	c2f76a4024	Merge branch 'master' into german_ner	2016-04-25 13:21:23 +02:00
Wolfgang Seeker	1003e7ccec	remove debug output from tests	2016-04-25 12:12:40 +02:00
Wolfgang Seeker	f57f843e85	fix bug in updating tree structure when introducing additional roots	2016-04-25 12:01:19 +02:00
Matthew Honnibal	478a8d1829	* Register Chinese language in spacy/__init__.py	2016-04-24 18:45:16 +02:00
Matthew Honnibal	8569dbc2d0	* Add initial stuff for Chinese parsing	2016-04-24 18:44:24 +02:00
Wolfgang Seeker	4d7f393fae	don't require json-files to have syntactic annotation	2016-04-22 16:32:27 +02:00
Wolfgang Seeker	b6477fc4f4	adjusted tests to Travis Setup	2016-04-21 17:15:10 +02:00
Wolfgang Seeker	736ffcb9a2	remove whitespace	2016-04-21 16:55:55 +02:00
Wolfgang Seeker	6c7301cc6d	the parser now introduces sentence boundaries properly when predicting dependents with root labels	2016-04-21 16:50:53 +02:00
Wolfgang Seeker	12024b0b0a	bugfix: introducing multiple roots now updates original head's properties adjust tests to rely less on statistical model	2016-04-20 16:42:41 +02:00
Matthew Honnibal	67ce96c9c9	* Make patterns argument to Matcher class optional	2016-04-17 21:32:24 +02:00
Matthew Honnibal	8b4677d34d	* Add missing keyword arguments to spacy.load() function	2016-04-17 21:31:50 +02:00
Matthew Honnibal	2add5206aa	* Fix description of matcher test	2016-04-17 15:40:21 +02:00
Matthew Honnibal	2b419d5b8c	* Update test for Issue #242	2016-04-17 15:34:23 +02:00
Matthew Honnibal	f12b043308	* Add test for Issue #242 : Overlapping matches not well recognised.	2016-04-17 15:19:17 +02:00
Wolfgang Seeker	b98cc3266d	bugfix: iterators now reset properly when called a second time	2016-04-15 17:49:16 +02:00
Wolfgang Seeker	e6945c4d0e	bugfix: uppercase attr values before looking them up	2016-04-15 15:46:31 +02:00
Matthew Honnibal	c0909afe22	Merge pull request #312 from wbwseeker/space_head_bug add restrictions to L-arc and R-arc to prevent space heads	2016-04-15 20:36:03 +10:00
Wolfgang Seeker	289b10f441	remove some comments	2016-04-14 15:37:51 +02:00
Matthew Honnibal	6f82065761	* Fix infixed commas in tokenizer, re Issue #326 . Need to benchmark on empirical data, to make sure this doesn't break other cases.	2016-04-14 11:36:03 +02:00
Matthew Honnibal	0f957dd586	Merge branch 'master' of ssh://github.com/honnibal/spaCy	2016-04-14 10:37:56 +02:00
Matthew Honnibal	108aca0e50	* Make Matcher use attrs from the attrs.pyx file, rather than having an incomplete function doing the mapping.	2016-04-14 10:37:39 +02:00
Matthew Honnibal	61d20de35d	* Fix language.py docstring	2016-04-14 10:36:57 +02:00
Wolfgang Seeker	d99a9cbce9	different handling of space tokens space tokens are now always attached to the previous non-space token there are two exceptions: leading space tokens are attached to the first following non-space token in input that consists exclusively of space tokens, the last space token is the head of all others.	2016-04-13 15:28:28 +02:00
Matthew Honnibal	04d0209be9	* Recognise multiple infixes in a token.	2016-04-13 18:38:26 +10:00
Henning Peters	a473d6e937	fix tests (use english model)	2016-04-12 16:41:57 +02:00
Henning Peters	f2d011c034	avoid polluting spacy namespace with lang classes	2016-04-12 16:31:16 +02:00
Henning Peters	ff690f76ba	fix loading non-german models	2016-04-12 16:00:56 +02:00
Henning Peters	6215272786	remove ujson as default non-dev dependency (still works as fallback if installed), because ujson doesn't ship wheels	2016-04-12 11:28:07 +02:00
Matthew Honnibal	6df3858dbc	* Fix Issue #323 : Incorrect semantics of Token.__str__ built-in. Add flag to allow users to switch the old semantics back on, to ease transition.	2016-04-12 13:17:59 +10:00
Wolfgang Seeker	d328e0b4a8	Merge branch 'master' into space_head_bug	2016-04-11 12:11:01 +02:00
Wolfgang Seeker	80bea62842	bugfix in unit test	2016-04-08 16:46:44 +02:00
Wolfgang Seeker	be4903a1b2	update version numbers	2016-04-08 13:54:05 +02:00
Wolfgang Seeker	1fe911cdb0	bigfix	2016-04-07 18:19:51 +02:00
Matthew Honnibal	872695759d	Merge pull request #306 from wbwseeker/german_noun_chunks add German noun chunk functionality	2016-04-08 00:54:24 +10:00
Henning Peters	470cdf5bf9	remove deprecated LOCAL_DATA_DIR	2016-04-05 11:25:54 +02:00
Matthew Honnibal	26622f0ffc	Merge branch 'master' of ssh://github.com/honnibal/spaCy	2016-03-29 14:31:52 +11:00

... 8 9 10 11 12 ...

2450 Commits