spaCy

mirror of https://github.com/explosion/spaCy.git synced 2024-11-14 13:47:13 +03:00

Author	SHA1	Message	Date
Ines Montani	8279993a6f	Modernize and merge tokenizer tests for punctuation	2017-01-04 00:49:20 +01:00
Ines Montani	550630df73	Update tokenizer tests for contractions	2017-01-04 00:48:42 +01:00
Ines Montani	109f202e8f	Update conftest fixture	2017-01-04 00:48:21 +01:00
Ines Montani	ee6b49b293	Modernize tokenizer tests for emoticons	2017-01-04 00:47:59 +01:00
Ines Montani	f09b5a5dfd	Modernize tokenizer tests for infixes	2017-01-04 00:47:42 +01:00
Ines Montani	59059fed27	Move regression test for #351 to own file	2017-01-04 00:47:11 +01:00
Ines Montani	667051375d	Modernize tokenizer tests for whitespace	2017-01-04 00:46:35 +01:00
Ines Montani	aafc894285	Modernize tokenizer tests for contractions Use @pytest.mark.parametrize.	2017-01-03 23:02:21 +01:00
Ines Montani	1d237664af	Add lowercase lemma to tokenizer exceptions	2017-01-03 23:02:21 +01:00
Ines Montani	84a87951eb	Fix typos	2017-01-03 18:27:43 +01:00
Ines Montani	35b39f53c3	Reorganise English tokenizer exceptions (as discussed in #718 ) Add logic to generate exceptions that follow a consistent pattern (like verbs and pronouns) and allow certain tokens to be excluded explicitly.	2017-01-03 18:26:09 +01:00
Ines Montani	fb9d3bb022	Revert "Merge remote-tracking branch 'origin/master'" This reverts commit `d3b181cdf1`, reversing changes made to `b19cfcc144`.	2017-01-03 18:21:36 +01:00
Ines Montani	461cbb99d8	Revert "Reorganise English tokenizer exceptions (as discussed in #718 )" This reverts commit `b19cfcc144`.	2017-01-03 18:21:29 +01:00
Ines Montani	d3b181cdf1	Merge remote-tracking branch 'origin/master' # Conflicts: # spacy/en/tokenizer_exceptions.py	2017-01-03 18:20:01 +01:00
Ines Montani	b19cfcc144	Reorganise English tokenizer exceptions (as discussed in #718 ) Add logic to generate exceptions that follow a consistent pattern (like verbs and pronouns) and allow certain tokens to be excluded explicitly.	2017-01-03 18:17:57 +01:00
Ines Montani	1bd53bbf89	Fix typos (resolves #718 )	2017-01-03 11:26:21 +01:00
Matthew Honnibal	fde53be3b4	Move whole token mach inside _split_affixes.	2016-12-30 17:11:50 -06:00
Matthew Honnibal	3ba7c167a8	Fix URL tests	2016-12-30 17:10:08 -06:00
Matthew Honnibal	9936a1b9b5	Merge branch 'tokenization_w_exception_patterns' of https://github.com/oroszgy/spaCy.hu into oroszgy-tokenization_w_exception_patterns	2016-12-30 14:53:40 -06:00
Magnus Burton	56e2219b65	Added Swedish city abbreviations	2016-12-30 21:17:34 +01:00
Magnus Burton	e935c950d8	Added months and days as abbreviations for Swedish	2016-12-30 21:08:44 +01:00
kengz	73a38bd4d1	Merge remote-tracking branch 'upstream/master'	2016-12-30 12:19:59 -05:00
kengz	da44183ae1	move parse_tree logic to a new tokens/printers.py file	2016-12-30 12:19:18 -05:00
Matthew Honnibal	3e8d9c772e	Test interaction of token_match and punctuation Check that the new token_match function applies after punctuation is split off.	2016-12-31 00:52:17 +11:00
Matthew Honnibal	74b921f394	Merge branch 'master' of ssh://github.com/explosion/spaCy into develop	2016-12-30 14:38:27 +01:00
Matthew Honnibal	623d94e14f	Whitespace	2016-12-31 00:30:28 +11:00
Matthew Honnibal	af81ac8bb0	Use thinc 6.0	2016-12-29 11:58:42 +01:00
Petter Hohle	f112e7754e	Add PART to tag map 16 of the 17 PoS tags in the UD tag set is added; PART is missing.	2016-12-28 18:39:01 +01:00
Matthew Honnibal	f62db78dc3	Increment version	2016-12-27 21:11:22 +01:00
Matthew Honnibal	cade536d1e	Merge branch 'master' of ssh://github.com/explosion/spaCy	2016-12-27 21:04:10 +01:00
Matthew Honnibal	ce4539dafd	Allow the vocabulary to grow to 10,000, to prevent cold-start problem.	2016-12-27 21:03:45 +01:00
Ines Montani	ad3669cef5	Merge pull request #703 from magnusburton/master Added Swedish abbreviations	2016-12-27 01:01:49 +01:00
Ines Montani	78f754dd9a	Merge pull request #705 from oroszgy/hu_tokenizer Initial support for Hungarian	2016-12-27 00:48:13 +01:00
Ines Montani	8785706039	Reformat stop words for better readability	2016-12-24 00:58:40 +01:00
Gyorgy Orosz	45e045a87b	Unicode/UTF8 compatibility for Python2	2016-12-24 00:21:00 +01:00
Gyorgy Orosz	72b61b6d03	Typo fix.	2016-12-24 00:10:29 +01:00
Gyorgy Orosz	3a9be4d485	Updated token exception handling mechanism to allow the usage of arbitrary functions as token exception matchers.	2016-12-23 23:49:34 +01:00
Ines Montani	1436b9f15a	Fix formatting and consistency	2016-12-23 21:36:01 +01:00
Ines Montani	1d64527727	Update Spanish tokenizer Remove reflexive pronouns as they're part of an open class, fix mistakes and add exceptions	2016-12-23 21:36:01 +01:00
Ines Montani	7f411fd01c	Remove exceptions containing whitespace / no special chars	2016-12-23 14:30:06 +01:00
Magnus Burton	fdf4776262	Added Swedish abbreviations	2016-12-22 22:45:18 +01:00
Gyorgy Orosz	d9c59c4751	Maintaining backward compatibility.	2016-12-21 23:30:49 +01:00
Gyorgy Orosz	1748549aeb	Added exception pattern mechanism to the tokenizer.	2016-12-21 23:16:19 +01:00
Gyorgy Orosz	35aa54765d	Hungarian module is exposed in spacy.	2016-12-21 20:45:36 +01:00
Gyorgy Orosz	ab2f6ea46c	Removed data files from tests..	2016-12-21 20:22:09 +01:00
Ines Montani	3c87c71d43	Add tokenizer exceptions for a.m. and p.m. in Spanish	2016-12-21 18:19:10 +01:00
Ines Montani	78e63dc7d0	Update tokenizer exceptions for English	2016-12-21 18:06:34 +01:00
Ines Montani	702d1eed93	Update tokenizer exceptions for German	2016-12-21 18:06:27 +01:00
Ines Montani	d60380418e	Update tokenizer exceptions for Spanish	2016-12-21 18:06:17 +01:00
Ines Montani	920fa0fed2	Add DET_LEMMA constant	2016-12-21 18:05:41 +01:00
Ines Montani	8978806ea6	Allow Vocab to load without serializer_freqs	2016-12-21 18:05:23 +01:00
Ines Montani	be8ed811f6	Remove trailing whitespace	2016-12-21 18:04:41 +01:00
Ines Montani	926e19184a	Merge pull request #695 from magnusburton/master Added Swedish morph rules	2016-12-21 01:06:00 +01:00
Gyorgy Orosz	3d5306acb9	Added further testcases.	2016-12-20 23:49:35 +01:00
Gyorgy Orosz	23956e72ff	Improved partial support for tokenzing Hungarian numbers	2016-12-20 23:36:59 +01:00
Gyorgy Orosz	6add156075	Refactored language data structure	2016-12-20 22:28:20 +01:00
Gyorgy Orosz	366b3f8685	Merge branch 'master' into hu_tokenizer	2016-12-20 20:53:31 +01:00
Gyorgy Orosz	c035928156	Partial Hungarian number tokenization is added.	2016-12-20 20:46:20 +01:00
JM	70ff0639b5	Fixed missing vec_path declaration that was failing if 'add_vectors' was set Added vec_path variable declaration to avoid accessing it before assignment in case 'add_vectors' is in overrides.	2016-12-20 18:21:05 +01:00
Magnus Burton	48dcc9f647	Added morph rules	2016-12-20 13:18:41 +01:00
Magnus Burton	db5a077d2b	Initial commit for Swedish	2016-12-20 11:05:06 +01:00
Matthew Honnibal	3f5747a9b2	Merge branch 'master' of ssh://github.com/explosion/spaCy	2016-12-18 23:44:22 +01:00
Matthew Honnibal	40e71586d6	Fix Issue #683 : Add 'SP' to tag_map, if it's not there already, within the Morphology class.	2016-12-18 23:44:05 +01:00
Matthew Honnibal	fa1d23e10d	Merge branch 'master' of https://github.com/explosion/spaCy	2016-12-18 23:32:03 +01:00
Matthew Honnibal	f38eb25fe1	Fix test for word vector	2016-12-18 23:31:55 +01:00
Matthew Honnibal	4e68abebc4	Merge branch 'master' of ssh://github.com/explosion/spaCy	2016-12-18 23:19:45 +01:00
Matthew Honnibal	5a6328a5a4	Increment version	2016-12-18 23:19:19 +01:00
Matthew Honnibal	13a0b31279	Another tweak to GloVe path hackery.	2016-12-18 23:12:49 +01:00
Matthew Honnibal	2c6228565e	Fix vector loading re glove hack	2016-12-18 23:06:44 +01:00
Matthew Honnibal	618b50a064	Fix issue #684 : GloVe vectors not loaded in spacy.en.English.	2016-12-18 22:46:31 +01:00
Matthew Honnibal	404019ad2f	Fix issue #672 : ent_iob_ was a string, not unicode, due to missing unicode_literals statement.	2016-12-18 22:33:53 +01:00
Matthew Honnibal	2ef9d53117	Untested fix for issue #684 : GloVe vectors hack should be inserted in English, not in spacy.load.	2016-12-18 22:29:31 +01:00
Matthew Honnibal	c065359459	Fix path-override bug in spacy.load	2016-12-18 22:15:29 +01:00
Matthew Honnibal	813249f826	Work on morphology class. Still not fully consistent with rest of library.	2016-12-18 17:35:22 +01:00
Matthew Honnibal	3679fb43a3	Fix loading of lemmatizer	2016-12-18 17:34:09 +01:00
Matthew Honnibal	3980f1b0cb	Ignore more morphology attributes in deprecated mode of intify_attrs	2016-12-18 17:33:46 +01:00
Matthew Honnibal	7a98ee5e5a	Merge language data change	2016-12-18 17:03:52 +01:00
Matthew Honnibal	e4c951c153	Merge branch 'organize-language-data' of ssh://github.com/explosion/spaCy into organize-language-data	2016-12-18 17:01:08 +01:00
Ines Montani	b99d683a93	Fix formatting	2016-12-18 16:58:28 +01:00
Ines Montani	b11d8cd3db	Merge remote-tracking branch 'origin/organize-language-data' into organize-language-data	2016-12-18 16:57:12 +01:00
Ines Montani	d1c1d3f9cd	Fix tokenizer test	2016-12-18 16:55:32 +01:00
Ines Montani	753068f1d5	Use base language data as default	2016-12-18 16:55:25 +01:00
Ines Montani	bcc1d50d09	Remove trailing whitespace	2016-12-18 16:54:52 +01:00
Ines Montani	4e95737c6c	Add base tag map	2016-12-18 16:54:28 +01:00
Ines Montani	2b2ea8ca11	Reorganise language data	2016-12-18 16:54:19 +01:00
Matthew Honnibal	1b31c05bf8	Whitespace	2016-12-18 16:51:40 +01:00
Matthew Honnibal	bdcecb3c96	Add import in regression test	2016-12-18 16:51:31 +01:00
Matthew Honnibal	6ee1df93c5	Set tag_map to None if it's not seen in the data by vocab	2016-12-18 16:51:10 +01:00
Matthew Honnibal	33996e770b	Update header for morphology class	2016-12-18 16:50:42 +01:00
Matthew Honnibal	d58187ffa7	Filter out morphology keys in deprecated attrs	2016-12-18 16:50:26 +01:00
Matthew Honnibal	837a5d4100	Update morphology class so that exceptions can be added one-by-one, and so that arbitrary attributes can be referenced.	2016-12-18 16:49:46 +01:00
Matthew Honnibal	44f4f008bd	Wire up lemmatizer rules for English	2016-12-18 15:50:09 +01:00
Matthew Honnibal	e6fc4afb04	Whitespace	2016-12-18 15:48:00 +01:00
Ines Montani	32b36c3882	Break language data components into their own files	2016-12-18 15:40:22 +01:00
Ines Montani	1bff59a8db	Update English language data	2016-12-18 15:36:53 +01:00
Ines Montani	2eb163c5dd	Add lemma rules	2016-12-18 15:36:53 +01:00
Ines Montani	29ad8143d8	Add morph rules	2016-12-18 15:36:53 +01:00
Ines Montani	bc40dad7d9	Add entity rules	2016-12-18 15:36:53 +01:00
Ines Montani	eaa3b1319d	Fix formatting	2016-12-18 15:36:53 +01:00
Ines Montani	704c7442e0	Break language data components into their own files	2016-12-18 15:36:53 +01:00
Ines Montani	62655fd36f	Add ENT_ID constant	2016-12-18 15:36:53 +01:00
Matthew Honnibal	fa272fdf12	Merge branch 'organize-language-data' of ssh://github.com/explosion/spaCy into organize-language-data	2016-12-18 15:00:21 +01:00
Matthew Honnibal	57c4341453	Refactor loading of morphology exceptions, adding a method add_special_case.	2016-12-18 14:59:44 +01:00
Ines Montani	77cf2fb0f6	Remove unnecessary argument in test	2016-12-18 14:06:27 +01:00
Ines Montani	121c310566	Remove trailing whitespace	2016-12-18 14:06:27 +01:00
Ines Montani	0fc4e45cb3	Fix tag map for German	2016-12-18 13:30:03 +01:00
Ines Montani	28326649f3	Fix typo	2016-12-18 13:30:03 +01:00
Matthew Honnibal	0595cc0635	Change test595 to mock data, instead of requiring model.	2016-12-18 13:28:51 +01:00
Matthew Honnibal	a4eb5c2bff	Check POS key in lemmatizer, to update it for new data format	2016-12-18 13:28:20 +01:00
Matthew Honnibal	28d63ec58e	Restore missing '' character in tokenizer exceptions.	2016-12-18 05:34:51 +01:00
Ines Montani	a9421652c9	Remove duplicates in tag map	2016-12-17 22:44:31 +01:00
Ines Montani	69baf1c9a8	Fix tag map	2016-12-17 22:44:22 +01:00
Ines Montani	577adad945	Fix formatting	2016-12-17 14:00:52 +01:00
Ines Montani	fc4ad17136	Fix typo	2016-12-17 14:00:47 +01:00
Ines Montani	bb94e784dc	Fix typo	2016-12-17 13:59:30 +01:00
Ines Montani	afda532595	Use symbols in tag map	2016-12-17 13:56:24 +01:00
Ines Montani	07249145c9	Fix formatting	2016-12-17 13:34:46 +01:00
Ines Montani	dd55d085b6	Reformat dutch language data to match new style	2016-12-17 13:26:01 +01:00
Ines Montani	f2c48ef504	Resolve stopwords conflict to merge Dutch	2016-12-17 13:08:16 +01:00
Matthew Honnibal	ff03ade08f	Merge pull request #688 from nlesc-sherlock/dutch Support for Dutch in SpaCy	2016-12-17 22:44:58 +11:00
Ines Montani	a22322187f	Add missing lemmas to tokenizer exceptions (fixes #674 )	2016-12-17 12:42:41 +01:00
Ines Montani	5445074cbd	Expand tokenizer exceptions with unicode apostrophe (fixes #685 )	2016-12-17 12:34:08 +01:00
Ines Montani	e0a7b5c612	Fix formatting	2016-12-17 12:33:09 +01:00
Ines Montani	08162dce67	Move shared functions and constants to global language data	2016-12-17 12:32:48 +01:00
Ines Montani	6a60a61086	Move update_exc to global language data utils	2016-12-17 12:29:02 +01:00
Ines Montani	f324311249	Add global language data utils	2016-12-17 12:27:41 +01:00
Ines Montani	487ce1e20a	Add encoding declaration	2016-12-17 12:25:44 +01:00
Ines Montani	d8d50a0334	Add tokenizer exception for "gonna" (fixes #691 )	2016-12-17 11:59:28 +01:00
Ines Montani	c69b77d8aa	Revert "Add exception for "gonna"" This reverts commit `280c03f67b`.	2016-12-17 11:56:44 +01:00
Ines Montani	280c03f67b	Add exception for "gonna"	2016-12-17 11:54:59 +01:00
Ines Montani	5031a015e2	Fix typo in stopwords (fixes #689 )	2016-12-15 17:57:06 +01:00
Janneke van der Zwaan	4a3fdcce8a	Merge github.com:explosion/spaCy into dutch	2016-12-13 09:25:23 +01:00
Matthew Honnibal	5965d3c2a7	Revert "Add acl to symbols.pyx"	2016-12-12 10:10:28 +11:00
Matthew Honnibal	6dee76dfed	Update symbols.pxd	2016-12-12 10:09:58 +11:00
Pokey Rule	18a15c0777	Add acl to symbols.pyx	2016-12-11 20:00:07 +00:00
Gyorgy Orosz	0cf2144d24	Adding partial hyphen and quote handling support.	2016-12-11 00:14:36 +01:00
Gyorgy Orosz	2051726fd3	Passing Hungatian abbrev tests.	2016-12-10 23:37:58 +01:00
Ines Montani	63024466a9	Add Portuguese stopwords	2016-12-08 20:45:07 +01:00
Ines Montani	7bfe2d4abc	Update Portuguese language data	2016-12-08 20:41:41 +01:00
Ines Montani	c0c5f31950	Remove unused data and download script	2016-12-08 20:39:49 +01:00
Ines Montani	0a6d529104	Remove unused data	2016-12-08 20:36:56 +01:00
Ines Montani	1b3b043660	Add French stopwords	2016-12-08 20:12:43 +01:00
Ines Montani	8863e504eb	Update French language data	2016-12-08 20:07:14 +01:00
Ines Montani	7cb9f51be6	Add Italian stopwords	2016-12-08 20:05:25 +01:00
Ines Montani	470a0e0bea	Update Italian language data	2016-12-08 19:52:18 +01:00
Ines Montani	1a284d342e	Add Spanish language data	2016-12-08 19:47:03 +01:00
Ines Montani	0c39654786	Remove unused import	2016-12-08 19:46:53 +01:00
Ines Montani	e47ee94761	Split punctuation into its own file	2016-12-08 19:46:43 +01:00
Ines Montani	70b51ed7c8	Remove time from German language data	2016-12-08 19:45:50 +01:00
Ines Montani	e8ae588be9	Add emoticons	2016-12-08 19:45:18 +01:00
Ines Montani	5908c0ed9f	Fix formatting	2016-12-08 19:45:11 +01:00
Ines Montani	311b30ab35	Reorganize exceptions for English and German	2016-12-08 13:58:32 +01:00
Ines Montani	66c7348cda	Add update_exc util function	2016-12-08 13:58:12 +01:00
Ines Montani	1256232fad	Fix formatting	2016-12-08 13:56:40 +01:00
Ines Montani	8e977cc71c	Fix formatting	2016-12-08 13:56:17 +01:00
Ines Montani	0176b99004	Fix formatting	2016-12-08 12:48:02 +01:00
Ines Montani	877f09218b	Add more custom rules for abbreviations	2016-12-08 12:47:01 +01:00
Gyorgy Orosz	0289b8ceaa	Additional abbreviation tests.	2016-12-08 12:17:44 +01:00
Gyorgy Orosz	90d22db023	Added Hungarian resource files.	2016-12-08 12:06:36 +01:00
Ines Montani	bfaa42636c	Update language data for German	2016-12-08 12:01:09 +01:00
Ines Montani	ec44bee321	Fix capitalization on morphological features	2016-12-08 12:00:54 +01:00
Gyorgy Orosz	5b00039955	First steps towards the Hungarian tokenizer code.	2016-12-07 23:07:43 +01:00
Ines Montani	ce979553df	Resolve conflict	2016-12-07 21:16:52 +01:00
Ines Montani	8350d65695	Change morphology and lemmatizer API Take morphology features as object instead of keyword arguments	2016-12-07 21:12:49 +01:00
Ines Montani	52e7d634df	Remove trailing whitespace	2016-12-07 21:12:19 +01:00
Ines Montani	0d07d7fc80	Apply emoticon exceptions to tokenizer	2016-12-07 21:11:59 +01:00
Ines Montani	71f0f34cb3	Fix formatting	2016-12-07 21:11:29 +01:00
Ines Montani	9413bcd9ee	Declare encoding and unicode literals	2016-12-07 21:10:34 +01:00
Ines Montani	a280ff2657	Fix __all__	2016-12-07 21:10:12 +01:00
Ines Montani	ba8721953c	Add missing emoticons	2016-12-07 21:09:44 +01:00
Ines Montani	1285c4ba93	Update English language data	2016-12-07 20:33:28 +01:00
Ines Montani	79dce0aabe	Add emoticons	2016-12-07 20:33:28 +01:00
Ines Montani	a662a95294	Add line breaks	2016-12-07 20:33:28 +01:00
Ines Montani	07f0efb102	Add test for tokenizer regular expressions	2016-12-07 20:33:28 +01:00
Ines Montani	e0712d1b32	Reformat language data	2016-12-07 20:33:28 +01:00
Matthew Honnibal	0c0f4c965d	Increment version	2016-12-03 11:16:52 +01:00
Matthew Honnibal	f6e356aada	Add (and test) Span.sentiment attribute. By default we average token.span, but can override with custom hook. Re Issue #667	2016-12-02 11:05:50 +01:00
Janneke van der Zwaan	88869e0e07	Merge github.com:explosion/spaCy into dutch	2016-11-30 17:13:39 +01:00
Janneke van der Zwaan	51ade86b86	Update language data with tag map from UD_Dutch	2016-11-30 14:41:23 +01:00
Janneke van der Zwaan	90f6ff12c9	Update Dutch language data - Use Dutch tag map - remove tokenizer exceptions	2016-11-30 11:59:39 +01:00
dafnevk	7b8f4c49f2	Added language Dutch to init file	2016-11-29 16:42:05 +01:00
Matthew Honnibal	296d33a4fc	Merge branch 'master' of ssh://github.com/explosion/spaCy	2016-11-26 12:36:18 +01:00
Matthew Honnibal	1f6c37c6f5	Fix create_tokenizer when nlp is None	2016-11-26 12:36:04 +01:00
Matthew Honnibal	c7889492f9	Fix model saving error for Python 3	2016-11-25 18:04:30 -06:00
Matthew Honnibal	bc0a202c9c	Fix unicode problem in nonproj module	2016-11-25 17:29:17 -06:00
Matthew Honnibal	6dd3b94fa6	Filter out deprecated attributes when reading special-case tokenization rules.	2016-11-25 09:57:18 -06:00
Matthew Honnibal	e879c79b8c	Merge branch 'master' of https://github.com/explosion/spaCy	2016-11-25 09:18:28 -06:00
Matthew Honnibal	a335c6dcc2	Exclude morphs from deprecated token attributes for now	2016-11-25 16:17:32 +01:00
Matthew Honnibal	f799a07f25	Merge branch 'master' of https://github.com/explosion/spaCy	2016-11-25 09:16:43 -06:00
Matthew Honnibal	159e8c46e1	Merge old training fixes with newer state	2016-11-25 09:16:36 -06:00
Matthew Honnibal	846e80f2f4	Exclude morphs from deprecated token attributes for now	2016-11-25 16:14:54 +01:00
Matthew Honnibal	664f2dd1c0	Allow dep to be None in scorer, for missing labels.	2016-11-25 09:02:49 -06:00
Matthew Honnibal	39341598bb	Fix NER label calculation	2016-11-25 09:02:22 -06:00
Matthew Honnibal	ca773a1f53	Tweak arc_eager n_gold to deal with negative costs, and improve error message.	2016-11-25 09:01:52 -06:00
Matthew Honnibal	a2f55e7015	Pass cfg through loading, for training.	2016-11-25 09:01:20 -06:00
Matthew Honnibal	608d8f5421	Pass cfg through parser, and have is_valid default to 1, not 0 when resetting state	2016-11-25 09:00:21 -06:00
Matthew Honnibal	cc7e607a8a	Fix gold.pyx for 1.0	2016-11-25 08:57:59 -06:00
root	080d29e092	Fix train.py for 1.0	2016-11-25 08:55:33 -06:00
Matthew Honnibal	6652f2a135	Test #656 , #624 : special case rules for tokenizer with attributes.	2016-11-25 12:44:13 +01:00
Matthew Honnibal	1e0f566d95	Fix #656 , #624 : Support arbitrary token attributes when adding special-case rules.	2016-11-25 12:43:24 +01:00
Matthew Honnibal	87613edf8f	Add set_struct_attr staticmethod to token	2016-11-25 12:41:47 +01:00
Matthew Honnibal	fb69aa648f	Merge branch 'master' of ssh://github.com/explosion/spaCy	2016-11-25 11:35:44 +01:00
Matthew Honnibal	9a03a3f85e	Add get_struct_attr staticmethod to Token, to match Lexeme.get_struct_attr.	2016-11-25 11:35:17 +01:00
Matthew Honnibal	53d8ca8f51	Add spacy.attrs.intify_attrs function, to normalize strings in token attribute dictionaries.	2016-11-25 11:34:30 +01:00
Ines Montani	d21ad01840	Add emoticons	2016-11-24 19:13:00 +01:00
dafnevk	d8c7ac203a	Added nl module for dutch	2016-11-24 16:39:49 +01:00
dafnevk	3db8b0d322	Added language class and some language data (with some TODOs) for Dutch	2016-11-24 15:56:38 +01:00
Ines Montani	4dcfafde02	Add line breaks	2016-11-24 14:57:37 +01:00
Ines Montani	6247c005a2	Add test for tokenizer regular expressions	2016-11-24 13:51:59 +01:00
Ines Montani	de747e39e7	Reformat language data	2016-11-24 13:51:32 +01:00
Matthew Honnibal	b8c4f5ea76	Allow German noun chunks to work on Span Update the German noun chunks iterator, so that it also works on Span objects.	2016-11-24 23:30:15 +11:00
Pokey Rule	3e3bda142d	Add noun_chunks to Span	2016-11-24 10:47:20 +00:00
Janneke van der Zwaan	83daade0e4	Add directory and initial (empty) files for language Dutch	2016-11-24 09:45:41 +01:00
Matthew Honnibal	09f68bc641	Fix Issue #639 : stop words in language class not used. This patch is messy, but it's better not to change too much until the language data loading can be properly refactored.	2016-11-24 00:13:55 +01:00
Matthew Honnibal	48e1dc29d4	Fix default path loading.	2016-11-23 23:48:55 +01:00
Matthew Honnibal	e01c1875ee	Work on test for #615	2016-11-23 23:48:41 +01:00
ExplodingCabbage	6c4f488e89	Fix syntax mistake	2016-11-23 15:12:45 +00:00
Matthew Honnibal	60eb2343ce	Only try to load vectors if they exist.	2016-11-23 13:50:24 +01:00
Matthew Honnibal	618ac36093	Fix use of path argument in Language.__init__. Needs to be keyword arg, not positional.	2016-11-23 13:26:34 +01:00
Mark Amery	fbe19680a6	Fix another bug related to Language.__init__'s path parameter	2016-11-20 20:31:34 +00:00
Mark Amery	b0a07c21a0	Fix `path` param of `Language.__init__` always being ignored There was an explicitly-declared `path` keyword argument, so 'path' would never be present in `**overrides`. This line just overwrote any manually-specified value the user might've passed to the `path` parameter.	2016-11-20 16:29:57 +00:00
Mark Amery	1988fce389	Merge remote-tracking branch 'origin/master' into specify-data-path	2016-11-20 16:07:14 +00:00
Mark Amery	3871007c72	Let --data-path be specified when running download.py scripts Resolves https://github.com/explosion/spaCy/issues/637	2016-11-20 15:48:04 +00:00
Ines Montani	dad2c6cae9	Strip trailing whitespace	2016-11-20 16:45:51 +01:00
Ines Montani	3082e49326	Update and reformat German stopwords	2016-11-20 16:45:26 +01:00
Sourav Singh	6745eac309	Update language_data.py	2016-11-20 19:52:02 +05:30
Sourav Singh	4d9aae7d6a	Add German Stopwords	2016-11-19 22:47:53 +05:30
Matthew Honnibal	7afb2544a7	Merge pull request #627 from sadovnychyi/patch-1 Remove duplicated line of vocab declaration	2016-11-16 06:09:18 +11:00
Yanhao	762169da29	Fixed bug: eg.guess is a tag id, rather than tag	2016-11-15 14:11:22 +08:00
Dmytro Sadovnychyi	e70a7050e1	Remove duplicated line of vocab declaration As already declared on line 211.	2016-11-13 18:52:49 +08:00
Matthew Honnibal	f123f92e0c	Fix #617 : Vocab.load() required Path. Should work with string as well.	2016-11-10 22:48:48 +01:00
Matthew Honnibal	e86f440ca6	Fix test for issue 617	2016-11-10 22:48:10 +01:00
Matthew Honnibal	faa7610c56	Merge branch 'master' of ssh://github.com/explosion/spaCy	2016-11-10 22:46:38 +01:00
Matthew Honnibal	a2c7de8329	spacy/tests/regression/test_issue617.py Test Issue #617	2016-11-10 22:46:23 +01:00
tiago	2a3e342c1f	Added a test case to cover the span.merge returning values	2016-11-09 18:57:50 +00:00
tiago	b38cfd0ef9	now span.merge returns token like it says on documentation	2016-11-09 14:58:19 +00:00
Dmitry Sadovnychyi	9488222e79	Fix PhraseMatcher to work with updated Matcher #613	2016-11-09 00:14:26 +08:00
Dmitry Sadovnychyi	86c056ba64	Add basic test for PhraseMatcher #613	2016-11-09 00:10:32 +08:00
Matthew Honnibal	3ea15b257f	Fix test for 605	2016-11-06 11:59:26 +01:00
Matthew Honnibal	efe7790439	Test #590 : Order dependence in Matcher rules.	2016-11-06 11:21:36 +01:00
Matthew Honnibal	5cd3acb265	Fix #605 : Acceptor now rejects matches as expected.	2016-11-06 10:50:42 +01:00
Matthew Honnibal	75805397dd	Test Issue #605	2016-11-06 10:42:32 +01:00
Matthew Honnibal	014b6936ac	Fix #608 -- __version__ should be available at the base of the package.	2016-11-04 21:21:02 +01:00
Matthew Honnibal	42b0736db7	Increment version	2016-11-04 20:04:21 +01:00
Matthew Honnibal	9f93386994	Update version	2016-11-04 19:28:16 +01:00
Matthew Honnibal	1fb09c3dc1	Fix morphology tagger	2016-11-04 19:19:09 +01:00
Matthew Honnibal	a36353df47	Temporarily put back the tokenize_from_strings method, while tests aren't updated yet.	2016-11-04 19:18:07 +01:00
Matthew Honnibal	f0917b6808	Fix Issue #376 : and/or was tagged as a noun.	2016-11-04 15:21:28 +01:00
Matthew Honnibal	737816e86e	Fix #368 : Tokenizer handled pattern 'unicode close quote, period' incorrectly.	2016-11-04 15:16:20 +01:00
Matthew Honnibal	ab952b4756	Fix #578 -- Sputnik had been purging all files on --force, not just the relevant one.	2016-11-04 10:44:11 +01:00
Matthew Honnibal	6e37ba1d82	Fix #602 , #603 --- Broken build	2016-11-04 09:54:24 +01:00
Matthew Honnibal	293c79c09a	Fix #595 : Lemmatization was incorrect for base forms, because morphological analyser wasn't adding morphology properly.	2016-11-04 00:29:07 +01:00
Matthew Honnibal	e30348b331	Prefer to import from symbols instead of parts_of_speech	2016-11-04 00:27:55 +01:00
Matthew Honnibal	4a8a2b6001	Test #595 -- Bug in lemmatization of base forms.	2016-11-04 00:27:32 +01:00
Matthew Honnibal	f1605df2ec	Fix #588 : Matcher should reject empty pattern.	2016-11-03 00:16:44 +01:00
Matthew Honnibal	72b9bd57ec	Test Issue #588 : Matcher accepts invalid, empty patterns.	2016-11-03 00:09:35 +01:00
Matthew Honnibal	41a90a7fbb	Add tokenizer exception for 'Ph.D.', to fix 592.	2016-11-03 00:03:34 +01:00
Matthew Honnibal	532318e80b	Import Jieba inside zh.make_doc	2016-11-02 23:49:19 +01:00
Matthew Honnibal	f292f7f0e6	Fix Issue #599 , by considering empty documents to be parsed and tagged. Implementation is a bit dodgy.	2016-11-02 23:48:43 +01:00
Matthew Honnibal	b6b01d4680	Remove deprecated tokens_from_list test.	2016-11-02 23:47:21 +01:00
Matthew Honnibal	3d6c79e595	Test Issue #599 : .is_tagged and .is_parsed attributes not reflected after deserialization for empty documents.	2016-11-02 23:40:11 +01:00
Matthew Honnibal	05a8b752a2	Fix Issue #600 : Missing setters for Token attribute.	2016-11-02 23:28:59 +01:00
Matthew Honnibal	125c910a8d	Test Issue #600	2016-11-02 23:24:13 +01:00
Matthew Honnibal	e0c9695615	Fix doc strings for tokenizer	2016-11-02 23:15:39 +01:00
Matthew Honnibal	80824f6d29	Fix test	2016-11-02 20:48:40 +01:00
Matthew Honnibal	dbe47902bc	Add import fr	2016-11-02 20:48:29 +01:00
Matthew Honnibal	8f24dc1982	Fix infixes in Italian	2016-11-02 20:43:52 +01:00
Matthew Honnibal	41a4766c1c	Fix infixes in spanish and portuguese	2016-11-02 20:43:12 +01:00
Matthew Honnibal	3d4bd96e8a	Fix infixes in french	2016-11-02 20:41:43 +01:00
Matthew Honnibal	c09a8ce5bb	Add test for french tokenizer	2016-11-02 20:40:31 +01:00
Matthew Honnibal	b012ae3044	Add test for loading languages	2016-11-02 20:38:48 +01:00
Matthew Honnibal	ad1c747c6b	Fix stray POS in language stubs	2016-11-02 20:37:55 +01:00
Matthew Honnibal	e9e6fce576	Handle null prefix/suffix/infix search in tokenizer	2016-11-02 20:35:48 +01:00
Matthew Honnibal	22647c2423	Check that patterns aren't null before compiling regex for tokenizer	2016-11-02 20:35:29 +01:00
Matthew Honnibal	5ac735df33	Link languages in __init__.py	2016-11-02 20:05:14 +01:00
Matthew Honnibal	c68dfe2965	Stub out support for Italian	2016-11-02 20:03:24 +01:00
Matthew Honnibal	6dbf4f7ad7	Stub out support for French, Spanish, Italian and Portuguese	2016-11-02 20:02:41 +01:00
Matthew Honnibal	6b8b05ef83	Specify that spacy.util is encoded in utf8	2016-11-02 19:58:00 +01:00
Matthew Honnibal	5363224395	Add draft Jieba tokenizer for Chinese	2016-11-02 19:57:38 +01:00
Matthew Honnibal	f7fee6c24b	Check for class-defined make_docs method before assigning one provided as an argument	2016-11-02 19:57:13 +01:00
Matthew Honnibal	19c1e83d3d	Work on draft Italian tokenizer	2016-11-02 19:56:32 +01:00
Matthew Honnibal	9efe568177	Add missing unicode_literals to spacy.util. I think this was messing up the tokenizer regex for non-ascii characters in Python 2. Re Issue #596	2016-11-02 12:31:34 +01:00
Matthew Honnibal	d8db648ebf	Add __init__.py file for regression tests	2016-11-01 13:45:06 +01:00
Matthew Honnibal	11664b9f20	Fix variable error in token	2016-11-01 13:28:00 +01:00
Matthew Honnibal	8c4d1b46ce	Fix variable error in Span	2016-11-01 13:27:44 +01:00
Matthew Honnibal	e7af6b937f	Fix syntax error while fixing doc strings	2016-11-01 13:27:32 +01:00
Matthew Honnibal	62fc6b1afa	Use 32 bit hashes for OOV, re Issue #589 , Issue #285	2016-11-01 13:27:13 +01:00
Matthew Honnibal	6977a2b8cd	Add test for Issue #589	2016-11-01 12:33:36 +01:00
Matthew Honnibal	b86f8af0c1	Fix doc strings	2016-11-01 12:25:36 +01:00
Matthew Honnibal	d563f1eadb	Fix Issue #587 : Segfault in Matcher, due to simple error in the state machine.	2016-10-28 17:42:00 +02:00
Matthew Honnibal	7e5f63a595	Improve test slightly	2016-10-28 17:41:16 +02:00
Matthew Honnibal	782e4814f4	Test Issue #587 : Matcher segfaults on particular input	2016-10-28 16:38:32 +02:00
Matthew Honnibal	708ea22208	Infer types in transition_system.pyx	2016-10-27 18:08:13 +02:00
Matthew Honnibal	18590eba94	Fix training evaluate method	2016-10-27 18:02:19 +02:00
Matthew Honnibal	301f3cc898	Fix Issue #429 . Add an initialize_state method to the named entity recogniser that adds missing entity types. This is a messy place to add this, because it's strange to have the method mutate state. A better home for this logic could be found.	2016-10-27 18:01:55 +02:00
Matthew Honnibal	afea6505f3	Test Issue 429: No valid actions for NER after matcher adds a new entity label.	2016-10-27 18:01:34 +02:00
Matthew Honnibal	03a520ec4f	Change signature of Parser.parseC, so that nr_class is read from the transition system. This allows the transition system to modify the number of actions in initialize_state.	2016-10-27 17:58:56 +02:00
Matthew Honnibal	6c47048912	Fix test, after IOB tweak.	2016-10-26 17:22:03 +02:00
Matthew Honnibal	4ca31b4d87	Fix clobbering of 'missing' named ent values after assigning ents.	2016-10-26 13:13:56 +02:00
Matthew Honnibal	cb49189477	Remove dead code	2016-10-26 13:11:07 +02:00
Matthew Honnibal	a209b10579	Improve error message when oracle fails for non-projective trees, re Issue #571 .	2016-10-24 20:31:30 +02:00
Matthew Honnibal	b2d43b93d2	Fix Python 3 basestring error	2016-10-24 14:22:51 +02:00
Matthew Honnibal	276478fe0f	Update strings.pxd	2016-10-24 14:00:35 +02:00
Matthew Honnibal	d8134817ff	Workaround Issue #285 : Allow the StringStore to be 'frozen', in which case strings will be pushed into an OOV map. We can then flush this OOV map, freeing all of the OOV strings.	2016-10-24 13:49:03 +02:00
Matthew Honnibal	d3a617aa99	Test workaround for Issue #285 : Streaming data memory growth	2016-10-24 13:48:06 +02:00
Matthew Honnibal	64e5f02cf7	Update test	2016-10-23 21:08:07 +02:00
Matthew Honnibal	66d7a6eca2	Update test	2016-10-23 21:02:05 +02:00
Matthew Honnibal	90bf797125	Update test	2016-10-23 20:54:17 +02:00
Matthew Honnibal	5e76320ffe	Update test	2016-10-23 20:44:54 +02:00
Matthew Honnibal	aa105927f3	Update test	2016-10-23 20:31:25 +02:00
Matthew Honnibal	6b9237aa83	Increment version	2016-10-23 20:22:53 +02:00
Matthew Honnibal	150e02d72e	Fix Issue #566	2016-10-23 20:19:01 +02:00
Matthew Honnibal	e120561294	Fix vector_norm test.	2016-10-23 19:56:16 +02:00
Matthew Honnibal	fefde8aef8	Make installation print data path.	2016-10-23 19:46:44 +02:00
Matthew Honnibal	e7414cd064	Try to fix weird install glitch.	2016-10-23 19:46:28 +02:00
Matthew Honnibal	90f7544edd	Increment version	2016-10-23 19:43:06 +02:00
Matthew Honnibal	6036ec7c77	Fix vector norm when loading lexemes.	2016-10-23 19:40:18 +02:00
Matthew Honnibal	c05cd2356e	Fix similarity test for Python 3	2016-10-23 18:16:56 +02:00
Matthew Honnibal	3e688e6d4b	Fix issue #514 -- serializer fails when new entity type has been added. The fix here is quite ugly. It's best to add the entities ASAP after loading the NLP pipeline, to mitigate the brittleness.	2016-10-23 17:45:44 +02:00
Matthew Honnibal	79aa03fe98	Test Issue #514 : Serializer fails when new entity type has been added.	2016-10-23 17:41:44 +02:00
Matthew Honnibal	f97548c6f1	Fix broken test, re Issue #461	2016-10-23 17:02:23 +02:00
Matthew Honnibal	4de30a8e38	Test Issue #514 : Serialization fails after adding a new entity label.	2016-10-23 16:40:27 +02:00
Matthew Honnibal	936e6246aa	Fix Issue #459 -- failed to deserialize empty doc.	2016-10-23 16:31:05 +02:00
Matthew Honnibal	e99b3f5322	Test Issue #459 : Fail to deserialize empty doc	2016-10-23 16:30:22 +02:00
Matthew Honnibal	49c117960c	Fix bug where huffman codec died if given empty freqs dict.	2016-10-23 16:28:05 +02:00
Matthew Honnibal	99ff8b902f	Test that huffman codec works with empty freqs dict	2016-10-23 16:27:45 +02:00
Matthew Honnibal	15c9b59f0e	Fix Issue #461 : O tag was being clobbered by doc.ents.__set__	2016-10-23 15:50:26 +02:00
Matthew Honnibal	e5627134d9	Test Issue #461 : ent_iob tag incorrect after setting entities.	2016-10-23 15:50:04 +02:00
Matthew Honnibal	f62088d646	Fix compile error	2016-10-23 14:50:50 +02:00
Matthew Honnibal	2c3a67b693	Fix calculation of vector norm, re Issue #522 . Need to consolidate the calculations into a helper function.	2016-10-23 14:49:31 +02:00
Matthew Honnibal	a0a4ada42a	Fix calculation of L2-norm for Lexeme	2016-10-23 14:44:45 +02:00
Matthew Honnibal	2989072aac	Add tests to verify that Issue #442 is fixed in 1.1	2016-10-23 14:33:13 +02:00
Matthew Honnibal	739213a8af	Fix create_pipeline keyword argument.	2016-10-23 14:24:16 +02:00
Matthew Honnibal	bea44bd3c4	Fix vector_norm when vector is assigned to Lexeme.	2016-10-23 14:23:56 +02:00
Matthew Honnibal	e838b6d53f	Add tests for using the new Entity ID tracking in the rule matcher	2016-10-23 14:04:01 +02:00
Matthew Honnibal	e7af75e0a9	Add test for vector resizing, re Issue #544	2016-10-21 17:07:21 +02:00
Matthew Honnibal	ca8ea33abc	Bump version to 1.1.0	2016-10-21 16:30:57 +02:00
Matthew Honnibal	7ab03050d4	Add resize_vectors method to Vocab	2016-10-21 01:44:50 +02:00
Matthew Honnibal	8ce8803824	Fix JSON in tokenizer	2016-10-21 01:44:20 +02:00
Matthew Honnibal	6eb73a095f	Fix JSON in tagger	2016-10-21 01:44:10 +02:00
Matthew Honnibal	e16e78a737	Merge branch 'master' of ssh://github.com/explosion/spaCy	2016-10-21 00:00:15 +02:00
Matthew Honnibal	147373c807	Increment version	2016-10-21 00:00:03 +02:00
Matthew Honnibal	e80944276f	Fix Span.vector_norm	2016-10-20 21:58:56 +02:00
Matthew Honnibal	f5fe4f595b	Fix json loading, for Python 3.	2016-10-20 21:23:26 +02:00
Matthew Honnibal	2e92c6fb3a	Fix JSON encoding issue on load	2016-10-20 21:06:48 +02:00
Matthew Honnibal	4ad7bb96c9	Increment version.	2016-10-20 20:48:30 +02:00
Matthew Honnibal	5ec32f5d97	Fix loading of GloVe vectors, to address Issue #541	2016-10-20 18:27:48 +02:00
Matthew Honnibal	ddeabd76c4	Fix mistake loading GloVe vectors. GloVe vectors now loaded by default if present, as promised.	2016-10-20 16:57:53 +02:00
Matthew Honnibal	bfe5cb1244	Increment version.	2016-10-20 14:52:00 +02:00
Matthew Honnibal	f189a3cb00	Fix encoding when opening files in Python 2.7, re Issue #539	2016-10-20 14:42:56 +02:00
Matthew Honnibal	c353a5214d	Increment version	2016-10-19 23:51:01 +02:00
Matthew Honnibal	d10c17f2a4	Fix Issue #536 : oov_prob was 0 for OOV words.	2016-10-19 23:38:47 +02:00
Matthew Honnibal	dfa752d064	Increment version	2016-10-19 23:19:13 +02:00
Matthew Honnibal	3588a18fb8	Fix hook names in doc	2016-10-19 21:15:16 +02:00
Matthew Honnibal	5d5742b773	Add sentiment field to doc, rename getters_for_tokens and getters_for_spans, add user_hooks field to Doc.	2016-10-19 20:54:22 +02:00
Matthew Honnibal	ed5e178817	Add sentiment property on lexeme object	2016-10-19 20:52:52 +02:00
Matthew Honnibal	d4aaf2752c	Fix issue #535 : Pipeline elements added even when data not installed.	2016-10-19 19:55:19 +02:00
Matthew Honnibal	04d1c959da	Fix version	2016-10-19 03:45:37 +02:00
Matthew Honnibal	d35aa7344e	Change version ID to make PyPi happy	2016-10-19 03:24:39 +02:00
Matthew Honnibal	89d2a5c8b3	Increment build version.	2016-10-19 03:05:17 +02:00
Matthew Honnibal	622b0a9674	Tweak download script	2016-10-19 00:52:16 +02:00
Matthew Honnibal	5a5c7192a5	Fix download.py for GloVe vectors.	2016-10-19 00:47:44 +02:00
Matthew Honnibal	edc45c19d6	Update download script	2016-10-19 00:41:14 +02:00
Matthew Honnibal	2bbb050500	Fix default of serializer_freqs	2016-10-18 19:55:41 +02:00
Matthew Honnibal	1b651db9c5	Fix parser creation in Language class.	2016-10-18 19:36:44 +02:00
Matthew Honnibal	45a6f9b9c7	Fix loading of tagger.	2016-10-18 19:33:04 +02:00
Matthew Honnibal	76c815f40d	Fix spacy.load	2016-10-18 19:23:31 +02:00
Matthew Honnibal	8c8f5c62c6	Add LANG attribute to English and German	2016-10-18 18:52:48 +02:00
Matthew Honnibal	05e2a589a4	Fix None label in matcher	2016-10-18 18:05:21 +02:00
Matthew Honnibal	c3a8a1cf51	Update serializer test.	2016-10-18 16:18:46 +02:00
Matthew Honnibal	7d5212f131	Refactor defaults	2016-10-18 16:18:25 +02:00
Matthew Honnibal	a45a9d5092	Remove stray .tensor attribute from Lexeme	2016-10-18 01:16:32 +02:00
Matthew Honnibal	9258db788a	Revert "Have the matcher return character offsets, to handle the match better." This reverts commit `049c937540`.	2016-10-17 16:49:51 +02:00
Matthew Honnibal	7d446e5094	Revert "Update matcher test, to reflect character offset return instead of token offset." This reverts commit `f8d3e3bcfe`.	2016-10-17 16:49:49 +02:00
Matthew Honnibal	4bf2c53c13	Revert "Hack on matcher tests, for new implementation." This reverts commit `dbe60644ab`.	2016-10-17 16:49:48 +02:00
Matthew Honnibal	2fd97c71cc	Revert "Don't try to pickle matcher." This reverts commit `97bd0c9d00`.	2016-10-17 16:49:43 +02:00
Matthew Honnibal	97bd0c9d00	Don't try to pickle matcher.	2016-10-17 16:38:40 +02:00
Matthew Honnibal	dbe60644ab	Hack on matcher tests, for new implementation.	2016-10-17 16:12:22 +02:00
Matthew Honnibal	f8d3e3bcfe	Update matcher test, to reflect character offset return instead of token offset.	2016-10-17 16:00:10 +02:00
Matthew Honnibal	049c937540	Have the matcher return character offsets, to handle the match better.	2016-10-17 15:58:57 +02:00
Matthew Honnibal	9b60186266	Fix doc class	2016-10-17 15:23:47 +02:00
Matthew Honnibal	6cbdc94959	Lots of updates to Matcher, to make entity handling sane.	2016-10-17 15:23:31 +02:00
Matthew Honnibal	7fd98fc91c	Remove deprecation shim around str/bytes in Token.	2016-10-17 14:02:47 +02:00
Matthew Honnibal	b67697a97b	Improve API for doc.merge() and span.merge(), to use keyword arguments.	2016-10-17 14:02:13 +02:00
Matthew Honnibal	fbb7f3f15c	Add user_data attribute to Doc object.	2016-10-17 11:43:22 +02:00
Matthew Honnibal	c1abc8f6ed	Fix deprecation stuff in Token: Remove the shim for the str/unicode semantics, and raise for has_repvec and repvec	2016-10-17 11:18:41 +02:00
Matthew Honnibal	4ba9eadf3d	Merge branch 'v1.0.0-rc1' of ssh://github.com/explosion/spaCy into v1.0.0-rc1	2016-10-17 02:45:44 +02:00
Matthew Honnibal	09ab447a18	Remove tensor property from token.	2016-10-17 02:45:09 +02:00
Matthew Honnibal	5d10e2005c	Defer some attributes to Doc, via getters_for_tokens attribute.	2016-10-17 02:44:49 +02:00
Matthew Honnibal	8829984efb	Remove tensor attribute from Span and Token.	2016-10-17 02:44:04 +02:00
Matthew Honnibal	d15a88c66a	Defer some attributes to Doc via getters_for_spans	2016-10-17 02:43:35 +02:00
Matthew Honnibal	62230dd13a	Add getters_for_spans and getters_for_tokens attributes to Doc. Fix docstring	2016-10-17 02:42:51 +02:00
Matthew Honnibal	ae11ea8240	Add getters_for_tokens and getters_for_spans attributes to Doc object.	2016-10-17 02:42:05 +02:00
Matthew Honnibal	be48a7b4f3	Fix conftest for website tests.	2016-10-17 01:54:26 +02:00
Matthew Honnibal	8951bf6989	Update matcher tests	2016-10-17 01:53:24 +02:00
Matthew Honnibal	0cf4aff470	Set default path in EN/DE tests.	2016-10-17 01:52:49 +02:00
Matthew Honnibal	cd71b6b0a9	Remove test of parser pickle	2016-10-17 01:52:10 +02:00
Matthew Honnibal	5bc101006e	Add cfg field to Tagger	2016-10-17 01:03:41 +02:00
Matthew Honnibal	517f090cbf	Use GoldParse in tagger.update	2016-10-17 00:55:15 +02:00
Matthew Honnibal	59038f7efa	Restore support for prior data format -- specifically, the labels field of the config.	2016-10-17 00:53:26 +02:00
Matthew Honnibal	7887ab3b36	Fix default use of feature_templates in parser	2016-10-16 21:41:56 +02:00
Matthew Honnibal	f787cd29fe	Refactor the pipeline classes to make them more consistent, and remove the redundant blank() constructor.	2016-10-16 21:34:57 +02:00
kengz	fb92e2d061	activate parse_tree test, use from_array, test for root correctness	2016-10-16 15:12:08 -04:00
kengz	17b7832419	mark test as needing models	2016-10-16 14:39:07 -04:00
kengz	f046e0d7c8	add parse_tree method to language, separate from __call__ for efficiency, but will use __call__ to get the doc	2016-10-16 14:20:23 -04:00
Matthew Honnibal	311a985fe0	Add input error handling in Doc	2016-10-16 18:16:42 +02:00
Matthew Honnibal	06322ba99d	Add words and spaces keyword arguments to Doc.	2016-10-16 18:13:03 +02:00
Matthew Honnibal	ca51f3b77e	Use DependencyParser and EntityRecognizer in the Language class.	2016-10-16 17:58:12 +02:00
Matthew Honnibal	195d998a12	Fix GoldParse argument to tagger.update	2016-10-16 17:05:09 +02:00
Matthew Honnibal	274a4d4272	Fix queue Python property in StateClass	2016-10-16 17:04:41 +02:00
Matthew Honnibal	e8c8aa08ce	Make action_name optional in StepwiseState	2016-10-16 17:04:16 +02:00
Matthew Honnibal	4bb73b1a93	Fix parser labels in pipeline	2016-10-16 17:03:22 +02:00
Matthew Honnibal	a81c5a7abf	Fix name of labels keyword to 'actions'.	2016-10-16 12:00:27 +02:00
Matthew Honnibal	a079677984	Fix omission of O action when creating blank entity recognizer	2016-10-16 11:43:25 +02:00
Matthew Honnibal	5444d38cc6	Update test for biluo tags	2016-10-16 11:42:45 +02:00
Matthew Honnibal	4fc56d4a31	Rename 'labels' to 'actions' in parser options	2016-10-16 11:42:26 +02:00
Matthew Honnibal	8a6b35d266	Delay binding in MakeDoc	2016-10-16 11:41:55 +02:00
Matthew Honnibal	52b48b415e	Fix GoldParse class	2016-10-16 11:41:36 +02:00
Matthew Honnibal	3259a63779	Whitespace	2016-10-16 01:47:28 +02:00
Matthew Honnibal	509b30834f	Add a pipeline module, to collect and wrap processes for annotation	2016-10-16 01:47:12 +02:00
Matthew Honnibal	0317cea0ad	Fix GoldParse	2016-10-15 23:55:07 +02:00
Matthew Honnibal	1c62573a41	Fix spacy.train	2016-10-15 23:53:46 +02:00
Matthew Honnibal	a48aa15384	Improve the API for the GoldParse class.	2016-10-15 23:53:29 +02:00
Matthew Honnibal	e07fe92b27	Draft a refactored init for the GoldParse class	2016-10-15 22:09:52 +02:00
Matthew Honnibal	47afef7d6b	Add init.py for gold tests	2016-10-15 21:51:28 +02:00
Matthew Honnibal	86ae665c78	Add function for entity->biluo transformation	2016-10-15 21:51:04 +02:00
Matthew Honnibal	2163fd238f	Add tests for entity->biluo transformation	2016-10-15 21:50:43 +02:00
Matthew Honnibal	5e923b9bfa	Return None in match_best_version if not path exists.	2016-10-15 14:47:29 +02:00
Matthew Honnibal	2516382106	Fix loading of English in span test	2016-10-15 14:44:37 +02:00
Matthew Honnibal	dda2fc6bef	Add empty data directory	2016-10-15 14:25:25 +02:00
Matthew Honnibal	049197e0ae	Update tests, somewhat messily.	2016-10-15 14:14:04 +02:00
Matthew Honnibal	1e1a1d9517	Update matcher test	2016-10-15 14:13:41 +02:00
Matthew Honnibal	9cc9ce0f14	Load with default path=False in tests.	2016-10-15 14:13:23 +02:00
Matthew Honnibal	08e9134760	Change default value of path to True	2016-10-15 14:12:54 +02:00
Matthew Honnibal	788657f062	Ensure words are added to vocab before test, so that the lexicon is updated correctly.	2016-10-15 14:12:18 +02:00
Matthew Honnibal	4a1a2bce68	Update version in about.py	2016-10-15 13:44:27 +02:00
Matthew Honnibal	6d8cb515ac	Break the tokenization stage out of the pipeline into a function 'make_doc'. This allows all pipeline methods to have the same signature.	2016-10-14 17:38:29 +02:00
Matthew Honnibal	2cc515b2ed	Add add_flag method to Vocab, re Issue #504 .	2016-10-14 12:15:38 +02:00
Matthew Honnibal	f3be9d0a9a	Add tensor field to Lexeme, Token, Doc and Span, so that users have a place to hang neural network outputs	2016-10-14 03:24:13 +02:00
Matthew Honnibal	9b55d97a8f	Update train method	2016-10-13 03:24:53 +02:00
Matthew Honnibal	645d99523a	Move merge_sents method into spacy.gold	2016-10-13 03:24:29 +02:00
Matthew Honnibal	41f88ce938	Fix dep model loading in parser	2016-10-12 20:26:38 +02:00
Matthew Honnibal	d9ae2d68af	Load features by string-name for backwards compatibility.	2016-10-12 20:15:11 +02:00
Matthew Honnibal	a42fbcf946	Require model for test_is_properties	2016-10-12 19:35:18 +02:00
Matthew Honnibal	20c948361b	Use local path in test_lemmatizer	2016-10-12 19:35:00 +02:00
Matthew Honnibal	1318d0bc65	Test with the non-loaded versions of the English and German pipelines.	2016-10-12 19:13:31 +02:00
Matthew Honnibal	0e2bedc373	Fix default labels for parser and NER	2016-10-12 19:12:40 +02:00
Matthew Honnibal	3a03c668c3	Fix message in ParserStateError	2016-10-12 14:44:31 +02:00
Matthew Honnibal	6bf505e865	Fix error on ParserStateError	2016-10-12 14:35:55 +02:00
Matthew Honnibal	ba5e048502	Add docstring for Trainer class.	2016-10-12 14:26:02 +02:00
Matthew Honnibal	847a4a4182	Refactor Language, dropping Language.blank() method.	2016-10-12 13:45:58 +02:00
Matthew Honnibal	ea23b64cc8	Refactor training, with new spacy.train module. Defaults still a little awkward.	2016-10-09 12:24:24 +02:00
Matthew Honnibal	ca32a1ab01	Revert "Work on Issue #285 : intern strings into document-specific pools, to address streaming data memory growth. StringStore.__getitem__ now raises KeyError when it can't find the string. Use StringStore.intern() to get the old behaviour. Still need to hunt down all uses of StringStore.__getitem__ in library and do testing, but logic looks good." This reverts commit `8423e8627f`.	2016-09-30 20:20:22 +02:00
Matthew Honnibal	90baa9c7e6	Revert "Changes to matcher.pyx for new StringStore scheme" This reverts commit `3ff09614e0`.	2016-09-30 20:20:13 +02:00
Matthew Honnibal	1b6b129c04	Revert "Changes to morphology.pyx for new StringStore scheme" This reverts commit `95f8cfd745`.	2016-09-30 20:20:02 +02:00
Matthew Honnibal	1d70db58aa	Revert "Changes to iterators.pyx for new StringStore scheme" This reverts commit `4f794b215a`.	2016-09-30 20:19:53 +02:00
Matthew Honnibal	de01e427fd	Revert "Changes to strings.pyx for new StringStore scheme" This reverts commit `22d4752d64`.	2016-09-30 20:19:42 +02:00
Matthew Honnibal	9e09b39b9f	Revert "Changes to transition systems for new StringStore scheme" This reverts commit `0442e0ab1e`.	2016-09-30 20:11:49 +02:00
Matthew Honnibal	e3285f6f30	Revert "Fix report of ParserStateError" This reverts commit `78f19baafa`.	2016-09-30 20:11:33 +02:00
Matthew Honnibal	6736977d82	Revert "Changes to Doc and Token for new string store scheme" This reverts commit `99de44d864`.	2016-09-30 20:11:15 +02:00
Matthew Honnibal	bd7fe6420c	Revert "Changes to test for new string-store" This reverts commit `21e90d7d0b`.	2016-09-30 20:11:01 +02:00
Matthew Honnibal	1f1cd5013f	Revert "Changes to vocab for new stringstore scheme" This reverts commit `a51149a717`.	2016-09-30 20:10:30 +02:00
Matthew Honnibal	1e7d0af127	Revert "Changes to Lexeme for new string store scheme" This reverts commit `717741b6cf`.	2016-09-30 20:10:13 +02:00
Matthew Honnibal	ba51cb8325	Revert "Changes to tagger for new string store scheme" This reverts commit `f5a6aac906`.	2016-09-30 20:09:53 +02:00
Matthew Honnibal	23b7244842	Make sure symbols are unicode strings	2016-09-30 20:02:19 +02:00
Matthew Honnibal	f5a6aac906	Changes to tagger for new string store scheme	2016-09-30 20:01:51 +02:00
Matthew Honnibal	717741b6cf	Changes to Lexeme for new string store scheme	2016-09-30 20:01:36 +02:00
Matthew Honnibal	a51149a717	Changes to vocab for new stringstore scheme	2016-09-30 20:01:19 +02:00
Matthew Honnibal	21e90d7d0b	Changes to test for new string-store	2016-09-30 20:00:58 +02:00
Matthew Honnibal	99de44d864	Changes to Doc and Token for new string store scheme	2016-09-30 20:00:21 +02:00
Matthew Honnibal	78f19baafa	Fix report of ParserStateError	2016-09-30 19:59:22 +02:00
Matthew Honnibal	0442e0ab1e	Changes to transition systems for new StringStore scheme	2016-09-30 19:58:51 +02:00
Matthew Honnibal	22d4752d64	Changes to strings.pyx for new StringStore scheme	2016-09-30 19:58:09 +02:00
Matthew Honnibal	4f794b215a	Changes to iterators.pyx for new StringStore scheme	2016-09-30 19:57:49 +02:00
Matthew Honnibal	95f8cfd745	Changes to morphology.pyx for new StringStore scheme	2016-09-30 19:57:10 +02:00
Matthew Honnibal	3ff09614e0	Changes to matcher.pyx for new StringStore scheme	2016-09-30 19:56:48 +02:00
Matthew Honnibal	eceeaefe53	Fix defaults for Parser and Entity, adding a blank= argument.	2016-09-30 19:56:06 +02:00
Matthew Honnibal	8423e8627f	Work on Issue #285 : intern strings into document-specific pools, to address streaming data memory growth. StringStore.__getitem__ now raises KeyError when it can't find the string. Use StringStore.intern() to get the old behaviour. Still need to hunt down all uses of StringStore.__getitem__ in library and do testing, but logic looks good.	2016-09-30 10:14:47 +02:00
Matthew Honnibal	d3dc5718b2	Fix syntax error in Doc	2016-09-28 11:39:49 +02:00
Matthew Honnibal	1b520e7bab	Improve docstrings for Doc object	2016-09-28 11:15:13 +02:00
Matthew Honnibal	81a47c01d8	Fix test for empty sentence string.	2016-09-27 19:21:22 +02:00
Matthew Honnibal	4cbf0d3bb6	Handle errors when no valid actions are available, pointing users to the issue tracker.	2016-09-27 19:19:53 +02:00
Matthew Honnibal	430473bd98	Raise errors when no actions are available, re Issue #429	2016-09-27 19:09:37 +02:00
Matthew Honnibal	fc4a7ad794	Test and fix Issue #411 : IndexError when .sents property is used on empty string.	2016-09-27 18:49:14 +02:00
Matthew Honnibal	3d370b7d45	Add test for Issue #445 , fixed in `3cb4d455d`, with improved lemmatizer logic	2016-09-27 18:39:46 +02:00
Matthew Honnibal	a2f3510d6d	Fix lemmatizer	2016-09-27 17:47:05 +02:00
Matthew Honnibal	07776d8096	Fix pos name conflict in lemmatize	2016-09-27 17:35:58 +02:00
Matthew Honnibal	35cd953f9e	Fix pos name conflict with morphology	2016-09-27 14:16:22 +02:00
Matthew Honnibal	8e7df3c4ca	Expect the parser data, if parser.load() is called.	2016-09-27 14:02:12 +02:00
Matthew Honnibal	bb4f201ad2	Pass morphological features from tag map into the lemmatizer.	2016-09-27 14:01:43 +02:00
Matthew Honnibal	40509e8bca	Tweak the new is_base_form logic, because we can expect the 'pos' key in the morphology we're passed.	2016-09-27 14:01:16 +02:00
Matthew Honnibal	9c8ac91d72	Add test for Issue #435	2016-09-27 13:52:38 +02:00
Matthew Honnibal	3cb4d455d2	Pass lemmatizer morphological features, so that rules are sensitive to base/inflected distinction, which is how the WordNet data is designed. See Issue #435	2016-09-27 13:52:11 +02:00
Matthew Honnibal	e233328d38	Fix Issue #371 : Lexeme objects were unhashable.	2016-09-27 13:22:30 +02:00
Matthew Honnibal	e382e48d9f	Temporarily patch handling of defaul templates for tagger. Need to move these to language_data.	2016-09-27 13:21:28 +02:00
Matthew Honnibal	a44763af0e	Fix Issue #469 : Incorrectly cased root label in noun chunk iterator	2016-09-27 13:13:01 +02:00
Matthew Honnibal	b14b9b096b	Return None if /deps directory not present, instead of trying to load the parser.	2016-09-26 18:48:03 +02:00
Matthew Honnibal	e07b9665f7	Don't expect parser model	2016-09-26 18:09:33 +02:00
Matthew Honnibal	ee6fa106da	Fix parser features	2016-09-26 17:57:32 +02:00
Matthew Honnibal	e607e4b598	Fix parser loading	2016-09-26 17:51:11 +02:00
Matthew Honnibal	0b2d7ae9d6	Fix Entity creation	2016-09-26 15:41:22 +02:00
Matthew Honnibal	2debc4e0a2	Add .blank() method to Parser. Start housing default dep labels and entity types within the Defaults class.	2016-09-26 11:57:54 +02:00
Matthew Honnibal	722199acb8	Add spacy.blank() method, that doesn't load data. Don't try to load data if path is falsey	2016-09-26 11:07:46 +02:00
Matthew Honnibal	e56653f848	Add language data for German	2016-09-25 15:44:45 +02:00
Matthew Honnibal	7db956133e	Move tokenizer data for German into spacy.de.language_data	2016-09-25 15:37:33 +02:00
Matthew Honnibal	95aaea0d3f	Refactor so that the tokenizer data is read from Python data, rather than from disk	2016-09-25 14:49:53 +02:00
Matthew Honnibal	d7e9acdcdf	Add English language data, so that the tokenizer doesn't require the data download	2016-09-25 14:49:00 +02:00
Matthew Honnibal	82b8cc5efb	Whitespace	2016-09-24 22:17:01 +02:00
Matthew Honnibal	fd58f7655a	Python 3 compatible basestring	2016-09-24 22:16:43 +02:00
Matthew Honnibal	082e95b19e	Python 3 compatible basestring	2016-09-24 22:09:21 +02:00
Matthew Honnibal	f19af6cb2c	Python 3 compatible basestring	2016-09-24 22:08:43 +02:00
Matthew Honnibal	3ed4cdfe32	Handle pathlib.Path objects in CFile	2016-09-24 22:01:46 +02:00
Matthew Honnibal	df88690177	Fix encoding of path variable	2016-09-24 21:13:15 +02:00
Matthew Honnibal	af847e07fc	Fix usage of pathlib for Python3 -- turning paths to strings.	2016-09-24 21:05:27 +02:00
Matthew Honnibal	453683aaf0	Fix spacy/vocab.pyx	2016-09-24 20:50:31 +02:00
Matthew Honnibal	fd65cf6cbb	Finish refactoring data loading	2016-09-24 20:26:17 +02:00
Matthew Honnibal	83e364188c	Mostly finished loading refactoring. Design is in place, but doesn't work yet.	2016-09-24 15:42:01 +02:00
Matthew Honnibal	9dc8043a7e	Refactor Language to use new Defaults class, and work on revised data loading. We're getting rid of sputnik's weird file-system wrapper, and using pathlib.	2016-09-24 14:08:53 +02:00
Matthew Honnibal	b00f683a0c	Fix matcher test	2016-09-24 11:20:58 +02:00
Matthew Honnibal	eaf4065480	Expose the _patterns private member	2016-09-24 11:20:42 +02:00
Matthew Honnibal	15e42a1ba9	Allow entities to be set by Span, or by 4-tuple (with entity ID)	2016-09-24 01:17:43 +02:00
Matthew Honnibal	60fdf4d5f1	Remove commented out debuggng code	2016-09-24 01:17:18 +02:00
Matthew Honnibal	939a791a52	Update tests	2016-09-24 01:17:03 +02:00
Matthew Honnibal	55f1f7edaf	Don't automatically write new entities into the Doc in the Matcher. This fixes a long-standing wart, but introduces a backwards incompatibility.	2016-09-24 01:16:45 +02:00
Matthew Honnibal	e48df859b5	Fix typedef import in span.pyx	2016-09-23 16:02:28 +02:00
Matthew Honnibal	4de13606fd	Fix token.pyx	2016-09-23 15:07:07 +02:00
Matthew Honnibal	b4de419e19	Import hash_t typedef in token.pyx	2016-09-23 14:22:06 +02:00
Matthew Honnibal	c1a2e96604	Clean up notes at end of token.pyx	2016-09-21 20:45:51 +02:00
Matthew Honnibal	f6e587b1c7	Fix matcher tests	2016-09-21 20:45:20 +02:00
Matthew Honnibal	58e83fe34b	Initial, limited support for quantified patterns in Matcher, and tracking of ent_id attribute in Token and Span. The quantifiers need a lot more testing, and there are some known problems. The main known problem is that the zero-plus and one-plus quantifiers won't work if a token can match both the quantified pattern expression AND the tail of the match.	2016-09-21 14:54:55 +02:00
Matthew Honnibal	2735b6247b	Fix orths_and_spaces in Doc.__init__	2016-09-21 14:52:05 +02:00
Matthew Honnibal	070af4af9d	Revert "* Working neural net, but features hacky. Switching to extractor." This reverts commit `7c2f1a673b`.	2016-09-21 12:26:14 +02:00
Matthew Honnibal	6b202ec43f	Merge branch 'master' of ssh://github.com/spacy-io/spaCy	2016-09-21 12:08:25 +02:00
Mahmoud Lababidi	4c9ccc3b8b	Add parameter to download() for application to not exit if a Model exists. The default behavior is unchanged.	2016-09-14 10:04:09 -04:00
Adam Ever Hadani	f1c0762443	exit code 0 for when downloading a model that already was downloaded	2016-07-13 16:22:14 -07:00
Matthew Honnibal	7c2f1a673b	* Working neural net, but features hacky. Switching to extractor.	2016-05-26 19:06:10 +02:00
Matthew Honnibal	cdc10e9a1c	* Fix Issue #375 : noun phrase iteration results in index error if noun phrases are merged during the loop. Fix by accumulating the spans inside the noun_chunks property, allowing the Span index tricks to work.	2016-05-20 10:14:06 +02:00
Matthew Honnibal	13fad36e49	* Cosmetic change to english noun chunks iterator -- use enumerate instead of range loop	2016-05-20 10:11:05 +02:00
Matthew Honnibal	02276cc444	Merge branch 'master' of ssh://github.com/spacy-io/spaCy	2016-05-17 16:56:22 +02:00
Matthew Honnibal	4d7f5468bb	* Change Language class to use a .pipeline attribute, instead of having the pipeline hard coded	2016-05-17 16:55:42 +02:00
Daylen Yang	5405e7dd73	Fix get_lang_class parsing (take 2)	2016-05-16 16:40:31 -07:00
Matthew Honnibal	b240104f40	Revert "Fix get_lang_class parsing"	2016-05-17 08:04:26 +10:00
Daylen Yang	1692c2df3c	Fix get_lang_class parsing We want the get_lang_class to return "en" for both "en" and "en_glove_cc_300_1m_vectors". Changed the split rule to "_" so that this happens.	2016-05-16 14:38:20 -07:00
Matthew Honnibal	17137f5c0c	* Fix issue #372 : mistake in Lexeme rich comparison	2016-05-12 12:58:57 +02:00
Matthew Honnibal	cc8bf62208	* Fix Issue #360 : Tokenizer failed when the infix regex matched the start of the string while trying to tokenize multi-infix tokens.	2016-05-09 13:23:47 +02:00
Matthew Honnibal	c61ee8f9fa	* Increment version	2016-05-09 13:20:00 +02:00
Matthew Honnibal	5d86c30f0b	* Fix Issue #367 : Missing has_vector property on Doc and Span objects	2016-05-09 12:36:14 +02:00
Wolfgang Seeker	7b78239436	add fix for German noun chunk iterator (issue #365 )	2016-05-06 01:41:26 +02:00
Matthew Honnibal	8c0888d6cb	* Fix error in span.sent	2016-05-06 00:28:05 +02:00
Matthew Honnibal	bb94022975	* Fix Issue #365 : Error introduced during noun phrase chunking, due to use of corrected PRON/PROPN/etc tags.	2016-05-06 00:21:05 +02:00
Matthew Honnibal	41342ca79b	Merge branch 'master' of ssh://github.com/spacy-io/spaCy	2016-05-06 00:17:58 +02:00
Matthew Honnibal	26095f9722	* Add span.sent property, re Issue #366	2016-05-06 00:17:38 +02:00
Wolfgang Seeker	dbf8f5f3ec	fix bug in StateC.set_break()	2016-05-05 15:15:34 +02:00
Wolfgang Seeker	3c44b5dc1a	call deprojectivization after parsing	2016-05-05 15:10:36 +02:00
Matthew Honnibal	472f576b82	* Deprojectivize German parses	2016-05-05 15:01:10 +02:00
Matthew Honnibal	9bbd6cf031	* Work on Chinese support	2016-05-05 11:39:12 +02:00
Matthew Honnibal	a6a25166ba	* Remove print from test	2016-05-05 11:10:59 +02:00
Matthew Honnibal	e31df66d26	* Fix Issue #361 : Lexemes didn't have rich comparison.	2016-05-05 01:32:26 +02:00
Matthew Honnibal	7441ca30ee	* Add tests for Issue #361 : Lexeme rich comparison	2016-05-05 01:31:58 +02:00
Matthew Honnibal	72564213e3	* Add test for Issue #309	2016-05-04 16:00:28 +02:00
Matthew Honnibal	76f1d871da	Merge branch 'master' of ssh://github.com/spacy-io/spaCy	2016-05-04 15:54:00 +02:00
Matthew Honnibal	519366f677	* Fix Issue #351 : Indices off when leading whitespace	2016-05-04 15:53:36 +02:00
Matthew Honnibal	b4bfc6ae55	* Add test for Issue #351 : Indices off when leading whitespace	2016-05-04 15:53:17 +02:00
Matthew Honnibal	76021cb853	* Fix bug in Doc.text, introduced by `a862edc`	2016-05-04 11:02:16 +02:00
Wolfgang Seeker	e4ea2bea01	fix whitespace	2016-05-04 07:40:38 +02:00
Wolfgang Seeker	5bf2fd1f78	make the code less cryptic	2016-05-03 17:19:05 +02:00
Wolfgang Seeker	a06fca9fdf	German noun chunk iterator now doesn't return tokens more than once	2016-05-03 16:58:59 +02:00
Wolfgang Seeker	7825b75548	add tests for German noun chunker	2016-05-03 15:01:28 +02:00
Wolfgang Seeker	7b246c13cb	reformulate noun chunk tests for English	2016-05-03 14:24:35 +02:00
Wolfgang Seeker	1786331cd8	add model sanity test	2016-05-03 12:51:47 +02:00
Matthew Honnibal	1f1532142f	* Fix cost calculation on non-monotonic oracle	2016-05-03 00:21:08 +02:00
Matthew Honnibal	377a624046	Merge pull request #358 from wbwseeker/german_lemmatizer_dummy German lemmatizer dummy	2016-05-03 07:38:26 +10:00
Wolfgang Seeker	92bfbebeec	remove unnecessary imports	2016-05-02 17:33:22 +02:00
Wolfgang Seeker	857454ffa0	fix indentation -.-	2016-05-02 17:10:41 +02:00
Matthew Honnibal	308a28c26c	* Whitespace	2016-05-02 16:08:11 +02:00
Matthew Honnibal	29a114e645	* Don't assign 0-valued tags in Doc.from_array	2016-05-02 16:07:50 +02:00
Matthew Honnibal	c1c11a8ae0	* Fix formatting on serializer tests	2016-05-02 16:07:21 +02:00
Wolfgang Seeker	dae6bc05eb	define German dummy lemmatizer until morphology is done	2016-05-02 16:04:53 +02:00
Matthew Honnibal	6e1f1c4b9e	Merge pull request #357 from wbwseeker/german_ner German ner	2016-05-02 23:39:34 +10:00
Wolfgang Seeker	b6b96b233c	don't require read_json_file to expect particular annotations	2016-05-02 15:29:30 +02:00
Matthew Honnibal	902a389d85	* Fix merge conflict in test_parse	2016-05-02 15:28:07 +02:00
Matthew Honnibal	276fbe9996	* Fix assignment of iterator on Doc object	2016-05-02 15:26:24 +02:00
Matthew Honnibal	02c23cc1d0	* Fix sentence boundary test	2016-05-02 15:26:07 +02:00
Matthew Honnibal	d2f469b809	* Fix parsing tests, so that labels are added if they're missing, and so that the branching test values are correct	2016-05-02 15:25:27 +02:00
Wolfgang Seeker	b11cbb06c6	remove old tests for sentence boundary detection	2016-05-02 14:36:35 +02:00
Matthew Honnibal	508fd1f6dc	* Refactor noun chunk iterators, so that they're simple functions. Install the iterator when the Doc is created, but allow users to write to the noun_chunk_iterator attribute. The iterator functions accept an object and yield (int start, int end, int label) triples.	2016-05-02 14:25:10 +02:00
Matthew Honnibal	e526be5602	Merge branch 'master' of ssh://github.com/spacy-io/spaCy	2016-05-02 13:08:08 +02:00
Wolfgang Seeker	fa961ea694	add tests for serialization bug	2016-05-02 11:01:56 +02:00
Matthew Honnibal	97b2bba249	* Merge updated/simplified Break approach	2016-04-25 19:44:42 +00:00
Matthew Honnibal	77609588b6	* Fix assignment of root label to words left as root implicitly, after parsing ends.	2016-04-25 19:41:59 +00:00
Matthew Honnibal	7c2d2deaa7	* Revise transition system so that the Break transition retains sole responsibility for setting sentence boundaries. Re Issue #322	2016-04-25 19:41:59 +00:00
Wolfgang Seeker	c2f76a4024	Merge branch 'master' into german_ner	2016-04-25 13:21:23 +02:00
Wolfgang Seeker	1003e7ccec	remove debug output from tests	2016-04-25 12:12:40 +02:00
Wolfgang Seeker	f57f843e85	fix bug in updating tree structure when introducing additional roots	2016-04-25 12:01:19 +02:00
Matthew Honnibal	478a8d1829	* Register Chinese language in spacy/__init__.py	2016-04-24 18:45:16 +02:00
Matthew Honnibal	8569dbc2d0	* Add initial stuff for Chinese parsing	2016-04-24 18:44:24 +02:00
Wolfgang Seeker	4d7f393fae	don't require json-files to have syntactic annotation	2016-04-22 16:32:27 +02:00
Wolfgang Seeker	b6477fc4f4	adjusted tests to Travis Setup	2016-04-21 17:15:10 +02:00
Wolfgang Seeker	736ffcb9a2	remove whitespace	2016-04-21 16:55:55 +02:00
Wolfgang Seeker	6c7301cc6d	the parser now introduces sentence boundaries properly when predicting dependents with root labels	2016-04-21 16:50:53 +02:00
Wolfgang Seeker	12024b0b0a	bugfix: introducing multiple roots now updates original head's properties adjust tests to rely less on statistical model	2016-04-20 16:42:41 +02:00
Matthew Honnibal	67ce96c9c9	* Make patterns argument to Matcher class optional	2016-04-17 21:32:24 +02:00
Matthew Honnibal	8b4677d34d	* Add missing keyword arguments to spacy.load() function	2016-04-17 21:31:50 +02:00
Matthew Honnibal	2add5206aa	* Fix description of matcher test	2016-04-17 15:40:21 +02:00
Matthew Honnibal	2b419d5b8c	* Update test for Issue #242	2016-04-17 15:34:23 +02:00
Matthew Honnibal	f12b043308	* Add test for Issue #242 : Overlapping matches not well recognised.	2016-04-17 15:19:17 +02:00
Wolfgang Seeker	b98cc3266d	bugfix: iterators now reset properly when called a second time	2016-04-15 17:49:16 +02:00
Wolfgang Seeker	e6945c4d0e	bugfix: uppercase attr values before looking them up	2016-04-15 15:46:31 +02:00
Matthew Honnibal	c0909afe22	Merge pull request #312 from wbwseeker/space_head_bug add restrictions to L-arc and R-arc to prevent space heads	2016-04-15 20:36:03 +10:00
Wolfgang Seeker	289b10f441	remove some comments	2016-04-14 15:37:51 +02:00
Matthew Honnibal	6f82065761	* Fix infixed commas in tokenizer, re Issue #326 . Need to benchmark on empirical data, to make sure this doesn't break other cases.	2016-04-14 11:36:03 +02:00
Matthew Honnibal	0f957dd586	Merge branch 'master' of ssh://github.com/honnibal/spaCy	2016-04-14 10:37:56 +02:00
Matthew Honnibal	108aca0e50	* Make Matcher use attrs from the attrs.pyx file, rather than having an incomplete function doing the mapping.	2016-04-14 10:37:39 +02:00
Matthew Honnibal	61d20de35d	* Fix language.py docstring	2016-04-14 10:36:57 +02:00
Wolfgang Seeker	d99a9cbce9	different handling of space tokens space tokens are now always attached to the previous non-space token there are two exceptions: leading space tokens are attached to the first following non-space token in input that consists exclusively of space tokens, the last space token is the head of all others.	2016-04-13 15:28:28 +02:00
Matthew Honnibal	04d0209be9	* Recognise multiple infixes in a token.	2016-04-13 18:38:26 +10:00
Henning Peters	a473d6e937	fix tests (use english model)	2016-04-12 16:41:57 +02:00
Henning Peters	f2d011c034	avoid polluting spacy namespace with lang classes	2016-04-12 16:31:16 +02:00
Henning Peters	ff690f76ba	fix loading non-german models	2016-04-12 16:00:56 +02:00
Henning Peters	6215272786	remove ujson as default non-dev dependency (still works as fallback if installed), because ujson doesn't ship wheels	2016-04-12 11:28:07 +02:00
Matthew Honnibal	6df3858dbc	* Fix Issue #323 : Incorrect semantics of Token.__str__ built-in. Add flag to allow users to switch the old semantics back on, to ease transition.	2016-04-12 13:17:59 +10:00
Wolfgang Seeker	d328e0b4a8	Merge branch 'master' into space_head_bug	2016-04-11 12:11:01 +02:00
Wolfgang Seeker	80bea62842	bugfix in unit test	2016-04-08 16:46:44 +02:00
Wolfgang Seeker	be4903a1b2	update version numbers	2016-04-08 13:54:05 +02:00
Wolfgang Seeker	1fe911cdb0	bigfix	2016-04-07 18:19:51 +02:00
Matthew Honnibal	872695759d	Merge pull request #306 from wbwseeker/german_noun_chunks add German noun chunk functionality	2016-04-08 00:54:24 +10:00
Henning Peters	470cdf5bf9	remove deprecated LOCAL_DATA_DIR	2016-04-05 11:25:54 +02:00
Matthew Honnibal	26622f0ffc	Merge branch 'master' of ssh://github.com/honnibal/spaCy	2016-03-29 14:31:52 +11:00
Matthew Honnibal	b1fe41b45d	* Extend infix test, commenting on limitation of tokenizer w.r.t. infixes at the moment.	2016-03-29 14:31:05 +11:00
Matthew Honnibal	9c73983bdd	* Add test for hyphenation problem in Issue #302	2016-03-29 14:27:13 +11:00
Matthew Honnibal	ad119c074f	* Fix incorrect whitespacing in Doc.text. This change is potentially breaking, to anyone who was relying on the previous incorrect semantics.	2016-03-29 13:02:42 +11:00
Matthew Honnibal	8c7a1908ee	Merge pull request #307 from scoder/faster_string_store remove internal redundancy and overhead from StringStore	2016-03-29 12:59:52 +11:00
Wolfgang Seeker	7195b6742d	add restrictions to L-arc and R-arc to prevent space heads	2016-03-28 10:40:52 +02:00
Matthew Honnibal	8c77a994c6	Merge pull request #305 from henningpeters/master multiple langs in download script	2016-03-26 21:54:59 +11:00
Henning Peters	c90d4a6f17	relative imports in __init__.py	2016-03-26 11:44:53 +01:00
Henning Peters	db095a162c	fix	2016-03-25 18:59:47 +01:00
Henning Peters	b8f63071eb	add lang registration facility	2016-03-25 18:54:45 +01:00
Matthew Honnibal	4a37fdcee1	Merge pull request #287 from wbwseeker/deproj_sentbnd_bug add function to Token for setting head and dep (and dep_)	2016-03-25 09:47:45 +11:00
Stefan Behnel	f18805ee1c	make StringStore.__contains__() return True for the empty string (which is also contained in iteration)	2016-03-24 15:42:12 +01:00
Stefan Behnel	f2cfbfc412	remove internal redundancy and overhead from StringStore	2016-03-24 15:25:27 +01:00
Wolfgang Seeker	d65ef41d08	make error messages language independent	2016-03-24 11:47:09 +01:00
Henning Peters	963570aa49	Merge branch 'master' of github.com:spacy-io/spaCy	2016-03-24 11:19:47 +01:00
Henning Peters	a7d7ea3afa	first idea for supporting multiple langs in download script	2016-03-24 11:19:43 +01:00
Wolfgang Seeker	5080077097	revert init_model.py back to pre-german state (because it makes more sense) simplify token.n_rights and token.n_lefts	2016-03-21 16:10:25 +01:00
Wolfgang Seeker	5e2e8e951a	add baseclass DocIterator for iterators over documents add classes for English and German noun chunks the respective iterators are set for the document when created by the parser as they depend on the annotation scheme of the parsing model	2016-03-16 15:53:35 +01:00
Matthew Honnibal	80134eb12d	Merge branch 'master' of https://github.com/spacy-io/spaCy	2016-03-15 19:14:50 +00:00
Wolfgang Seeker	2ae253ef5b	changed head.__set__ to make it simpler	2016-03-14 13:43:48 +01:00
Henning Peters	c12d3dd200	add __init__.py to empty package dirs	2016-03-14 11:28:03 +01:00
Henning Peters	54f3447b5f	cleanup	2016-03-14 01:46:33 +01:00
Wolfgang Seeker	46e3f979f1	add function for setting head and label to token change PseudoProjectivity.deprojectivize to use these functions	2016-03-11 17:31:06 +01:00
Wolfgang Seeker	03fb498dbe	introduce lang field for LexemeC to hold language id put noun_chunk logic into iterators.py for each language separately	2016-03-10 13:01:34 +01:00
Wolfgang Seeker	bc9c62e279	replace Language functions with corresponding orth functions implement punctuation functions in orth	2016-03-09 18:07:37 +01:00
Wolfgang Seeker	d9312bc9ea	add new files npchunks.{pyx,pxd} to hold noun phrase chunk generators	2016-03-09 16:18:48 +01:00
Matthew Honnibal	1508528c8c	* Increment version	2016-03-08 15:58:45 +00:00
Matthew Honnibal	963fe5258e	* Add missing __contains__ method to vocab	2016-03-08 15:49:10 +00:00
Matthew Honnibal	478aa21cb0	* Remove broken __reduce__ method on vocab	2016-03-08 15:48:21 +00:00
Matthew Honnibal	20235bde00	Merge pull request #282 from henningpeters/switch_vectors initial proposal for ability to switch vectors	2016-03-09 01:39:41 +11:00
Henning Peters	eb7ae61b1c	cleanup api	2016-03-08 12:59:18 +01:00
Henning Peters	b740f20191	hash_string() should not depend on python's internal unicode representation, also fixes https://github.com/spacy-io/sense2vec/issues/5 for py2	2016-03-06 09:19:27 +01:00
Henning Peters	aa4d964c14	cleanup api	2016-03-05 17:51:32 +01:00
Henning Peters	931c07a609	initial proposal for separate vector package	2016-03-04 11:09:06 +01:00
Wolfgang Seeker	7adbd7a785	replace Counter with normal dict	2016-03-03 21:36:27 +01:00
Wolfgang Seeker	1ae487a4f6	add backwards compatibility with python 2.6	2016-03-03 21:18:12 +01:00
Wolfgang Seeker	9d1e6de4a0	make a proper list from zip iterator	2016-03-03 19:51:01 +01:00
Wolfgang Seeker	49f9d1c085	change test_nonproj.py to not use zip inside numpy.asarray	2016-03-03 19:42:09 +01:00
Wolfgang Seeker	72b8df0684	turned PseudoProjectivity into a normal python class	2016-03-03 19:05:08 +01:00
Matthew Honnibal	fcaa0ad7ce	Merge pull request #280 from wbwseeker/german_parser German parser	2016-03-04 03:27:42 +11:00
Wolfgang Seeker	690c5acabf	adjust train.py to train both english and german models	2016-03-03 15:21:00 +01:00
Wolfgang Seeker	3448cb40a4	integrated pseudo-projective parsing into parser - nonproj.pyx holds a class PseudoProjectivity which currently holds all functionality to implement Nivre & Nilsson 2005's pseudo-projective parsing using the HEAD decoration scheme - changed lefts/rights in Token to account for possible non-projective structures	2016-03-01 10:09:08 +01:00
Wolfgang Seeker	56b7210e82	moved nonproj.py to syntax/nonproj.pyx	2016-02-25 15:08:49 +01:00
Henning Peters	f3df736e0a	remove unidecode-related test	2016-02-24 18:22:22 +01:00
Wolfgang Seeker	4b2297d5d4	add class PseudoProjective for pseudo-projective parsing PseudoProjective() implements the algorithm from Nivre & Nilsson 2005 using their HEAD decoration scheme.	2016-02-24 11:26:25 +01:00
Henning Peters	12d58a7099	remove text-unidecode dependency	2016-02-24 08:01:59 +01:00
Wolfgang Seeker	8d531c958b	replace tests for non-projectivity - add functions to find non-projective edges - add test file for non-projectivity functions	2016-02-22 14:40:40 +01:00
Matthew Honnibal	141639ea3a	* Fix bug in tokenizer that caused new tokens to be added for affixes	2016-02-21 23:17:47 +00:00
Wolfgang Seeker	eae35e9b27	add tokenizer files for German, add/change code to train German pos tagger - add files to specify rules for German tokenization - change generate_specials.py to generate from an external file (abbrev.de.tab) - copy gazetteer.json from lang_data/en/ - init_model.py - change doc freq threshold to 0 - add train_german_tagger.py - expects conll09-formatted input	2016-02-18 13:24:20 +01:00
Henning Peters	9cc4f8d5b3	avoid shadowing __name__	2016-02-15 01:33:39 +01:00
Henning Peters	4c9e3c7911	upgrade spuntik, enforce data api via model version constraints	2016-02-14 16:03:17 +01:00
Henning Peters	9d8966a2c0	Update test_tokenizer.py	2016-02-10 19:24:37 +01:00
Henning Peters	3b5f1e753b	py26 compatibility	2016-02-10 14:32:54 +01:00
Henning Peters	ee1f1ac300	mark test_sentence_space() as model test	2016-02-10 07:49:11 +01:00
Matthew Honnibal	5d96b3ef4f	* Increment version	2016-02-07 13:48:58 +01:00
Matthew Honnibal	1b83cb9dfa	* Fix Issue #251 : Incorrect right edge calculation on left-clobber low in the tree	2016-02-07 00:00:42 +01:00
Matthew Honnibal	c6623889c1	* Add test for Issue #251 : Incorrect right edges, caused by bad update to r_edge in del_arc, triggered from non-monotonic left-arc	2016-02-06 23:47:51 +01:00
Matthew Honnibal	a95974ad3f	* Fix oov probability	2016-02-06 15:13:55 +01:00
Matthew Honnibal	af8514cb0c	* Refine the way the is_parsed attribute is set by from_array	2016-02-06 14:44:35 +01:00
Matthew Honnibal	161b01d4c0	* Tweak usage example for multi-processing	2016-02-06 14:44:11 +01:00
Matthew Honnibal	7f24229f10	* Don't try to pickle the tokenizer	2016-02-06 14:09:05 +01:00
Matthew Honnibal	dcb401f3e1	* Remove broken Vocab pickling	2016-02-06 14:08:47 +01:00
Matthew Honnibal	e66d45bf66	* Restore previous patch to Span.root, as it seems it wasn't the cause of the problem.	2016-02-06 13:37:41 +01:00
Matthew Honnibal	4412a70dc5	* Initialize StateC._empty_token to 0, to avoid undefined behaviour.	2016-02-06 13:34:38 +01:00
Matthew Honnibal	1b41f868d2	* Check for errors in parser, and parallelise the left-over batch	2016-02-06 10:06:30 +01:00
Matthew Honnibal	031b00cb91	* Fix Span.root calculation	2016-02-05 20:12:09 +01:00
Matthew Honnibal	165ca28b80	* Set is_parsed flag in Parser.pipe	2016-02-05 19:51:44 +01:00
Matthew Honnibal	bdd579db0a	* Set is_parsed flag in Parser.pipe	2016-02-05 19:50:11 +01:00
Matthew Honnibal	7119e77fb6	* Fix Matcher.pipe	2016-02-05 19:46:02 +01:00
Matthew Honnibal	1cf0100bf6	* Add test for multithreading	2016-02-05 19:38:22 +01:00
Matthew Honnibal	b04c9aad71	* Fix off-by-one in Parser.pipe	2016-02-05 19:37:50 +01:00
Matthew Honnibal	e5c447e237	* Questionable fix to problem in Span.root	2016-02-05 19:18:35 +01:00
Matthew Honnibal	1ef84a0557	* Merge master into rethinc2	2016-02-05 12:55:59 +01:00
Matthew Honnibal	4cf34fc170	Merge branch 'rethinc2' of ssh://github.com/honnibal/spaCy into rethinc2	2016-02-05 12:48:28 +01:00
Matthew Honnibal	249dccbe95	* Fix Language.pipe	2016-02-05 12:47:57 +01:00
Matthew Honnibal	c0e63feccc	* xfail pickle tests	2016-02-05 12:46:58 +01:00
Matthew Honnibal	6aa92b70f1	* Fix merge problem in span	2016-02-05 12:46:11 +01:00
Matthew Honnibal	048dfe35aa	* cimport cython.parallel	2016-02-05 12:20:42 +01:00
Matthew Honnibal	af58f273b3	* Fix spacy.language.pipe	2016-02-05 12:20:29 +01:00
Matthew Honnibal	8a13cebdcc	* Update for modified thinc interface	2016-02-05 11:44:39 +01:00
Matthew Honnibal	48ce09687d	* Skip pickling the vocab in the tests	2016-02-04 15:51:19 +01:00
Matthew Honnibal	419edfab50	* Use generic flags for the new attributes until they're added	2016-02-04 15:50:54 +01:00
Matthew Honnibal	c4017a06d9	* Add placeholders for the new flags in attrs and symbols	2016-02-04 15:49:45 +01:00
Matthew Honnibal	e5c96c969f	* Wire up new attributes	2016-02-04 13:04:58 +01:00
Matthew Honnibal	9703ccc3de	* Remove unused import	2016-02-04 13:04:33 +01:00
Matthew Honnibal	11810be33e	* Add Python hooks for is_bracket/is_quote/is_left_punct/is_right_punct	2016-02-04 13:04:16 +01:00
Matthew Honnibal	fe611132f0	* Add stubs for is_bracket/is_quote/is_left_punct/is_right_punct functions	2016-02-04 13:03:04 +01:00
Matthew Honnibal	ee975d36d0	* Add stubs to test is_bracket/is_quote/is_left_punct/is_right_punct functions	2016-02-04 13:02:25 +01:00
Matthew Honnibal	f9e765cae7	* Add pipe() method to tokenizer	2016-02-03 02:32:37 +01:00
Matthew Honnibal	4cbad510ff	* Fix calculation of head for spans with punctuation.	2016-02-03 02:32:21 +01:00
Matthew Honnibal	84b247ef83	* Add a .pipe method, that takes a stream of input, operates on it, and streams the output. Internally, the stream may be buffered, to allow multi-threading.	2016-02-03 02:10:58 +01:00
Matthew Honnibal	fcfc17a164	Merge branch 'master' into rethinc2	2016-02-02 23:05:34 +01:00
Matthew Honnibal	f204daf27b	* Add error warning that a gold tag is unrecognised	2016-02-02 22:59:59 +01:00
Matthew Honnibal	99b8906100	* Accept punct_labels as an argument to the scorer	2016-02-02 22:59:06 +01:00
Matthew Honnibal	59123443e2	* Check for presence/absence of the different models in Language.end_training	2016-02-02 22:49:55 +01:00
Matthew Honnibal	9e9d4c8706	* Fix stupid error in Language.batch	2016-02-01 09:49:32 +01:00
Matthew Honnibal	e3db39dd21	* Fix compiler warning about signed/unsigned comparison	2016-02-01 09:08:07 +01:00
Matthew Honnibal	98fbdf2856	* Add Language.batch() method, to support multi-threaded jobs	2016-02-01 09:01:13 +01:00
Matthew Honnibal	b3802562d6	Merge branch 'rethinc2' of https://github.com/honnibal/spaCy into rethinc2	2016-02-01 08:59:24 +01:00
Matthew Honnibal	4b08a3fafd	* Fix merge conflict	2016-02-01 08:58:18 +01:00
Matthew Honnibal	5188f6d9d8	* Fix parseC function	2016-02-01 08:48:48 +01:00
Matthew Honnibal	bcf8f7ba40	* Add a parse_batch method to Parser, that releases the GIL around a batch of documents.	2016-02-01 08:34:55 +01:00
Matthew Honnibal	d5579cd0d8	Merge branch 'rethinc2' of https://github.com/honnibal/spaCy into rethinc2	2016-02-01 03:08:49 +01:00
Matthew Honnibal	490ba65398	* Use openmp in parser	2016-02-01 03:08:42 +01:00
Matthew Honnibal	cb78d91ec5	* Fix ArcEager.set_valid	2016-02-01 03:07:37 +01:00
Matthew Honnibal	28e5ad62bc	* Pass a StateC pointer into the transition and validation methods in the parser, so that the GIL can be released over a batch of documents	2016-02-01 03:00:15 +01:00
Matthew Honnibal	a47f00901b	* Pass a StateC pointer into the transition and validation methods in the parser, so that the GIL can be released over a batch of documents	2016-02-01 02:58:14 +01:00
Matthew Honnibal	daaad66448	* Now fully proxied	2016-02-01 02:37:08 +01:00
Matthew Honnibal	7a0e3bb9c1	* Continue proxying. Some problem currently	2016-02-01 02:22:21 +01:00
Matthew Honnibal	2169bbb7ea	* Shadow StateClass with StateC, to start proxying	2016-02-01 01:16:14 +01:00
Matthew Honnibal	2fa228458e	* Add _state file, which StateClass will proxy to	2016-02-01 01:09:21 +01:00
Matthew Honnibal	6bb007d16e	* Make set_parse nogil	2016-01-30 20:27:52 +01:00
Matthew Honnibal	9410e74c92	* Switch parser to use nogil functions	2016-01-30 20:27:07 +01:00
Matthew Honnibal	10877a7791	* Update for thinc 5.0, including changing cost from int to weight_t, and updating the tagger and parser	2016-01-30 14:31:36 +01:00
Matthew Honnibal	ea4ff94cde	* Whitespace	2016-01-29 03:59:22 +01:00
Matthew Honnibal	b0718b6ee1	* Move to thinc 5.0	2016-01-29 03:58:55 +01:00
Matthew Honnibal	9721502c81	* Update version	2016-01-25 15:52:59 +01:00
Matthew Honnibal	907e8cf07d	* Add u prefix to string in web example	2016-01-25 15:51:38 +01:00
Matthew Honnibal	eba03695ef	* Comment out pickle tests	2016-01-25 15:51:13 +01:00
Matthew Honnibal	de94e6c525	* Mark pickle tests as xfail, due to temp files problem	2016-01-25 15:24:17 +01:00
Matthew Honnibal	87172a15c6	* Fix runtime error bug that arose from updated Span.root function.	2016-01-25 15:22:42 +01:00
Matthew Honnibal	2c8dd91785	* Fix first code example on the website	2016-01-23 18:09:19 +01:00
Matthew Honnibal	3af84cfd6e	* Increment version	2016-01-21 17:49:27 +01:00
Henning Peters	65aeac24cb	remove package version constraint	2016-01-21 17:40:51 +01:00
Matthew Honnibal	792c98a438	* Increment version for OSX-fixed release of v0.100	2016-01-21 00:23:04 +01:00
Matthew Honnibal	82d011ac43	* Fix test for whitespace	2016-01-19 20:38:26 +01:00
Matthew Honnibal	e89069dcae	* Fix matcher test	2016-01-19 20:24:01 +01:00
Matthew Honnibal	63e3d4e27f	* Add comment on Vocab.__reduce__	2016-01-19 20:11:25 +01:00
Matthew Honnibal	e1282b7f2f	* Require user-custom NER classes to work without adding the label.	2016-01-19 20:11:03 +01:00
Matthew Honnibal	84c5dfbfc3	* Clean up debugging python list	2016-01-19 20:10:32 +01:00
Matthew Honnibal	04d0686b26	* Make TransitionSystem.add_action idempotent, i.e. ignore duplicate added actions.	2016-01-19 20:10:04 +01:00
Matthew Honnibal	c4a89d56bd	* Automatically register any entity types pre-set on the tokens, so that the NER works with user-given entity types.	2016-01-19 20:09:26 +01:00
Matthew Honnibal	f0f92793f6	* Add test for user NER classes in matcher blocking the NER model. Re Issue #178 and Issue #217	2016-01-19 19:23:16 +01:00
Matthew Honnibal	65c5bc4988	* Add add_label method, to allow users to register new entity types and dependency labels.	2016-01-19 19:11:02 +01:00
Matthew Honnibal	151aa0b0e2	* Allow users to add_label, in order to extend the entity recogniser to new classes. Does not by itself add a class to the model	2016-01-19 19:09:33 +01:00
Matthew Honnibal	c8e0011ebc	* Add iterators to the NER and parser transition systems, to get the action types	2016-01-19 19:07:43 +01:00
Matthew Honnibal	515493c675	* Add xfail test for Issue #225 : tokenization with non-whitespace delimiters	2016-01-19 13:20:14 +01:00
Matthew Honnibal	7abe653223	* Fix imports	2016-01-19 03:36:51 +01:00
Matthew Honnibal	590f38bdb2	* Add hacky solution to Issue #220 . Currently specials.json only supports literal patterns, which doesn't allow us to pre-tag whitespace with the correct token, SP, as a rule. The data-driven approach should be easy but for some reason fails here. Adding a hard code in Morphology isn't a good solution, but we do want to fix the behaviour right away, and don't want to wait for an architecturally better solution.	2016-01-19 03:35:20 +01:00
Matthew Honnibal	445164d5b4	* Restore the LOCAL_DATA_DIR global in spacy/en/__init__.py, although this is now deprecated	2016-01-19 02:54:56 +01:00
Matthew Honnibal	04177debd0	* Unwind limit to sentence boundary detection that prevents it from inserting boundaries on whitespace. Replace it with a check for whitespace in StateClass.fast_forward, so that whitespace is LeftArced when it's on the stack. This should prevent the previous problem of whitespace-only sentences. Should fix Issue #184 , but may cause further problems. Needs testing.	2016-01-19 02:54:15 +01:00
Matthew Honnibal	7893de3203	* Add test for Issue #184 : Whitespace at sentence boundary causes sentence boundary error.	2016-01-18 23:04:38 +01:00
Matthew Honnibal	bba0a5e078	* Handle string paths in default_vocab, default_parser, default_entity in Language class	2016-01-18 22:37:24 +01:00
Matthew Honnibal	e825fd9554	* Make some of the website tests work without models	2016-01-18 18:14:44 +01:00
Matthew Honnibal	334c4b2b57	* Disprefer punctuation and spaces as heads of spans	2016-01-18 18:14:09 +01:00
Matthew Honnibal	bed36ab0ff	* Fix import of HEAD attribute	2016-01-18 17:34:43 +01:00
Matthew Honnibal	28c659c1fe	* Fix import for numpy	2016-01-18 17:25:04 +01:00
Matthew Honnibal	fc36bcf458	* Fix import for English	2016-01-18 17:14:40 +01:00
Matthew Honnibal	cc4c335e14	* Set heads for test_merge_tokens, to make the test run without models	2016-01-18 17:00:11 +01:00
Matthew Honnibal	c107da9738	* Bug fix to _count_words_to_root	2016-01-18 16:59:38 +01:00
Matthew Honnibal	f24833d607	* Fix merge for coordinations	2016-01-18 16:03:19 +01:00
Matthew Honnibal	14534958a9	* Fix bug in Span.root	2016-01-18 15:40:28 +01:00
Matthew Honnibal	714cbc03d5	* Add test for Issue #203 : nested noun chunks.	2016-01-16 18:02:30 +01:00
Matthew Honnibal	4e2253170c	* Move test for doc.merge to tokens_api file, to avoid name conflicts which upset pytest	2016-01-16 18:01:36 +01:00
Matthew Honnibal	34a157511f	* Move test_merge_hang to test_tokens_api	2016-01-16 18:00:26 +01:00
Matthew Honnibal	fc8f26584a	* Don't consider NPs connected to parse via conj relation as noun chunks. Change motivated by the nested noun chunks identified in Issue #203 , but might be problematic. Also allow root NPs to be considered noun chunks.	2016-01-16 17:52:40 +01:00
Matthew Honnibal	4a16dbfeca	* Add test for Issue #203 : noun chunks should be flat, but sometimes are nested	2016-01-16 17:41:25 +01:00
Matthew Honnibal	995b2d18fd	* Route token.string via token.txt_with_ws, to deprecate token.string in future	2016-01-16 17:14:34 +01:00
Matthew Honnibal	54a98eaf19	* Fix typo text_wth_ws --> text_with_ws. Reroute .string attribute to text_with_ws, to deprecate .string in future	2016-01-16 17:13:50 +01:00
Matthew Honnibal	3e9961d2c4	* If final token is whitespace, don't mark it as owning a trailing space. Fixes Issue #154	2016-01-16 17:08:59 +01:00
Matthew Honnibal	223d2b3484	* Add test for Issue #154 : Additional whitespace introduced when string ends with a whitespace token.	2016-01-16 17:08:07 +01:00
Matthew Honnibal	3dc398b727	* Fix merge conflict in requirements.txt	2016-01-16 16:20:49 +01:00
Matthew Honnibal	fc5962a77d	* Improve test for root token in Span	2016-01-16 16:19:09 +01:00
Matthew Honnibal	c025a0c64b	* Check for KeyboardInerrupt in parser.__call__	2016-01-16 16:18:44 +01:00
Matthew Honnibal	03e8a4293d	* Add loop guard to Token.lefts and Token.rights properties	2016-01-16 16:18:17 +01:00
Matthew Honnibal	304339985e	* Add a linear scan to Span.root method, to help with long sentences	2016-01-16 16:17:28 +01:00
Matthew Honnibal	aa0dd79f52	* Delete test_token_references, which checked a flakey strategy for preventing orphan tokens from a while ago. Now orphan tokens simply hold a reference to Pool, preventing the memory from being freed underneath them. This means that we don't need to run this slow test.	2016-01-16 16:03:35 +01:00
Matthew Honnibal	8cbcc3a799	* Fix calculation of root token in Span. Now take root to be word with shortest tree path. Avoids parse trees ending up in inconsistent state, as had occurred in Issue #214 .	2016-01-16 15:38:50 +01:00
Matthew Honnibal	c1039fa4b4	* Add test for Issue #214 . Resolved in change to Span.root	2016-01-16 15:37:47 +01:00
Henning Peters	41ea14a56f	fix pickling	2016-01-16 13:23:11 +01:00
Henning Peters	5551052840	fix py2/3 issue	2016-01-16 12:44:53 +01:00
Henning Peters	235f094534	untangle data_path/via	2016-01-16 12:23:45 +01:00
Matthew Honnibal	42a9f29b40	* Add loop guard in Span.root, to raise errors if there is a cycle in the dependency parse, instead of entering an infinite loop. Re Issue #214	2016-01-16 11:53:37 +01:00

... 14 15 16 17 18 ...

2878 Commits