Adriane Boyd
e4a1b5dab1
Rename to url_match
...
Rename to `url_match` and update docs.
2020-05-22 12:41:03 +02:00
Adriane Boyd
565e0eef73
Add tokenizer option for token match with affixes
...
To fix the slow tokenizer URL (#4374 ) and allow `token_match` to take
priority over prefixes and suffixes by default, introduce a new
tokenizer option for a token match pattern that's applied after prefixes
and suffixes but before infixes.
2020-05-05 10:35:33 +02:00
adrianeboyd
3780e2ff50
Flush tokenizer cache when necessary ( #4258 )
...
Flush tokenizer cache when affixes, token_match, or special cases are
modified.
Fixes #4238 , same issue as in #1250 .
2019-09-08 20:52:46 +02:00
Matthew Honnibal
b0f6fd3f1d
Disable tokenizer cache for special-cases. Fixes #1250
2017-10-24 16:08:05 +02:00
Ines Montani
aa876884f0
Revert "Revert "Merge remote-tracking branch 'origin/master'""
...
This reverts commit fb9d3bb022
.
2017-01-09 13:28:13 +01:00
Matthew Honnibal
fd65cf6cbb
Finish refactoring data loading
2016-09-24 20:26:17 +02:00
Matthew Honnibal
141639ea3a
* Fix bug in tokenizer that caused new tokens to be added for affixes
2016-02-21 23:17:47 +00:00
Chris DuBois
dac8fe7bdb
Add __reduce__ to Tokenizer so that English pickles.
...
- Add tests to test_pickle and test_tokenizer that save to tempfiles.
2015-10-23 22:24:03 -07:00
Matthew Honnibal
c2307fa9ee
* More work on language-generic parsing
2015-08-28 02:02:33 +02:00
Matthew Honnibal
119c0f8c3f
* Hack out morphology stuff from tokenizer, while morphology being reimplemented.
2015-08-26 19:20:11 +02:00
Matthew Honnibal
109106a949
* Replace UniStr, using unicode objects instead
2015-07-22 04:52:05 +02:00
Matthew Honnibal
cfd842769e
* Allow infix tokens to be variable length
2015-07-18 22:45:00 +02:00
Matthew Honnibal
67641f3b58
* Refactor tokenizer, to set the 'spacy' field on TokenC instead of passing a string
2015-07-13 21:46:02 +02:00
Matthew Honnibal
6eef0bf9ab
* Break up tokens.pyx into tokens/doc.pyx, tokens/token.pyx, tokens/spans.pyx
2015-07-13 20:20:58 +02:00
Matthew Honnibal
bb522496dd
* Rename Tokens to Doc
2015-07-08 18:53:00 +02:00
Matthew Honnibal
6c7e44140b
* Work on word vectors, and other stuff
2015-01-17 16:21:17 +11:00
Matthew Honnibal
ce2edd6312
* Tmp commit. Refactoring to create a Python Lexeme class.
2015-01-12 10:26:22 +11:00
Matthew Honnibal
a60ae261ae
* Move tokenizer to its own file, and refactor
2014-12-20 07:29:16 +11:00