spaCy

mirror of https://github.com/explosion/spaCy.git synced 2025-07-10 00:02:19 +03:00

Author	SHA1	Message	Date
Adriane Boyd	33946d2ef8	Use special Matcher only for cases with affixes * Reinsert specials cache checks during normal tokenization for special cases as much as possible * Additionally include specials cache checks while splitting on infixes * Since the special Matcher needs consistent affix-only tokenization for the special cases themselves, introduce the argument `with_special_cases` in order to do tokenization with or without specials cache checks * After normal tokenization, postprocess with special cases Matcher for special cases containing affixes	2019-09-16 14:16:30 +02:00
Adriane Boyd	5eeaffe14f	Reload special cases when necessary Reload special cases when affixes or token_match are modified. Skip reloading during initialization.	2019-09-08 22:40:08 +02:00
Adriane Boyd	64f86b7e97	Merge remote-tracking branch 'upstream/master' into bugfix/tokenizer-special-cases-matcher	2019-09-08 21:30:01 +02:00
adrianeboyd	3780e2ff50	Flush tokenizer cache when necessary (#4258 ) Flush tokenizer cache when affixes, token_match, or special cases are modified. Fixes #4238, same issue as in #1250.	2019-09-08 20:52:46 +02:00
Adriane Boyd	5861308910	Generalize handling of tokenizer special cases Handle tokenizer special cases more generally by using the Matcher internally to match special cases after the affix/token_match tokenization is complete. Instead of only matching special cases while processing balanced or nearly balanced prefixes and suffixes, this recognizes special cases in a wider range of contexts: * Allows arbitrary numbers of prefixes/affixes around special cases * Allows special cases separated by infixes Existing tests/settings that couldn't be preserved as before: * The emoticon '")' is no longer a supported special case * The emoticon ':)' in "example:)" is a false positive again When merged with #4258 (or the relevant cache bugfix), the affix and token_match properties should be modified to flush and reload all special cases to use the updated internal tokenization with the Matcher.	2019-09-08 20:35:16 +02:00
Matthew Honnibal	b0f6fd3f1d	Disable tokenizer cache for special-cases. Fixes #1250	2017-10-24 16:08:05 +02:00
Ines Montani	aa876884f0	Revert "Revert "Merge remote-tracking branch 'origin/master'"" This reverts commit `fb9d3bb022`.	2017-01-09 13:28:13 +01:00
Matthew Honnibal	fd65cf6cbb	Finish refactoring data loading	2016-09-24 20:26:17 +02:00
Matthew Honnibal	141639ea3a	* Fix bug in tokenizer that caused new tokens to be added for affixes	2016-02-21 23:17:47 +00:00
Chris DuBois	dac8fe7bdb	Add __reduce__ to Tokenizer so that English pickles. - Add tests to test_pickle and test_tokenizer that save to tempfiles.	2015-10-23 22:24:03 -07:00
Matthew Honnibal	c2307fa9ee	* More work on language-generic parsing	2015-08-28 02:02:33 +02:00
Matthew Honnibal	119c0f8c3f	* Hack out morphology stuff from tokenizer, while morphology being reimplemented.	2015-08-26 19:20:11 +02:00
Matthew Honnibal	109106a949	* Replace UniStr, using unicode objects instead	2015-07-22 04:52:05 +02:00
Matthew Honnibal	cfd842769e	* Allow infix tokens to be variable length	2015-07-18 22:45:00 +02:00
Matthew Honnibal	67641f3b58	* Refactor tokenizer, to set the 'spacy' field on TokenC instead of passing a string	2015-07-13 21:46:02 +02:00
Matthew Honnibal	6eef0bf9ab	* Break up tokens.pyx into tokens/doc.pyx, tokens/token.pyx, tokens/spans.pyx	2015-07-13 20:20:58 +02:00
Matthew Honnibal	bb522496dd	* Rename Tokens to Doc	2015-07-08 18:53:00 +02:00
Matthew Honnibal	6c7e44140b	* Work on word vectors, and other stuff	2015-01-17 16:21:17 +11:00
Matthew Honnibal	ce2edd6312	* Tmp commit. Refactoring to create a Python Lexeme class.	2015-01-12 10:26:22 +11:00
Matthew Honnibal	a60ae261ae	* Move tokenizer to its own file, and refactor	2014-12-20 07:29:16 +11:00

20 Commits