spaCy/spacy
Adriane Boyd 0d9740e826 Replace PhraseMatcher with Aho-Corasick
Replace PhraseMatcher with the Aho-Corasick algorithm over numpy arrays
of the hash values for the relevant attribute. The implementation is
based on FlashText.

The speed should be similar to the previous PhraseMatcher. It is now
possible to easily remove match IDs and matches don't go missing with
large keyword lists / vocabularies.

Fixes #4308.
2019-09-19 16:49:05 +02:00
..
cli Fix conllu converter 2019-09-11 13:28:07 +02:00
data Make spacy/data a package 2017-03-18 20:04:22 +01:00
displacy Improve token pattern checking without validation (#4105) 2019-08-21 14:00:37 +02:00
lang Update examples and languages.json [ci skip] 2019-09-15 17:56:40 +02:00
matcher Replace PhraseMatcher with Aho-Corasick 2019-09-19 16:49:05 +02:00
pipeline remove redundant __call__ method in pipes.TextCategorizer (#4305) 2019-09-18 21:31:27 +02:00
syntax Distinction between outside, missing and blocked NER annotations (#4307) 2019-09-18 21:37:17 +02:00
tests Replace PhraseMatcher with Aho-Corasick 2019-09-19 16:49:05 +02:00
tokens Distinction between outside, missing and blocked NER annotations (#4307) 2019-09-18 21:37:17 +02:00
__init__.pxd * Seems to be working after refactor. Need to wire up more POS tag features, and wire up save/load of POS tags. 2014-10-24 02:23:42 +11:00
__init__.py Fix formatting (hopefully also restarts build properly) 2019-03-20 09:55:45 +01:00
__main__.py Update __main__.py 2019-03-20 09:43:26 +01:00
_align.pyx Improve alignment around quotes 2018-08-16 01:04:34 +02:00
_ml.py Fix absolute imports and avoid importing from cli 2019-08-20 15:08:59 +02:00
about.py Set version to v2.1.8 2019-08-07 13:53:58 +02:00
attrs.pxd Fix attrs alignment 2019-07-12 17:59:47 +02:00
attrs.pyx ensure Span.as_doc keeps the entity links + unit test 2019-06-25 15:28:51 +02:00
compat.py Fix symlink creation to show error message on failure (#3589) (resolves #3307)) 2019-04-16 11:58:31 +02:00
errors.py Distinction between outside, missing and blocked NER annotations (#4307) 2019-09-18 21:37:17 +02:00
glossary.py Update glossary.py to match information found in documentation (#3704) (closes ##3679) 2019-05-10 14:23:20 +02:00
gold.pxd fixes in kb and gold 2019-07-17 17:18:26 +02:00
gold.pyx WIP: Extending debug-data (#4114) 2019-08-16 10:52:46 +02:00
kb.pxd rename entity frequency 2019-07-19 17:40:28 +02:00
kb.pyx CLI scripts for entity linking (wikipedia & generic) (#4091) 2019-08-13 15:38:59 +02:00
language.py Add "labels" to Language.meta 2019-09-12 11:34:25 +02:00
lemmatizer.py Fix inconsistant lemmatizer issue #3484 (#3646) 2019-05-04 18:16:03 +02:00
lexeme.pxd 💫 Support lexical attributes in retokenizer attrs (closes #2390) (#3325) 2019-02-24 21:13:51 +01:00
lexeme.pyx Tidy up property code style (#3391) 2019-03-11 15:59:09 +01:00
lookups.py 💫 WIP: Basic lookup class scaffolding and JSON for all lemmatizer data (#4178) 2019-09-09 19:17:55 +02:00
morphology.pxd annotate kb_id through ents in doc 2019-03-22 11:36:44 +01:00
morphology.pyx Fix issue #3551: Upper case lemmas 2019-04-16 12:27:15 +02:00
parts_of_speech.pxd Add support for Universal Dependencies v2.0 2017-03-03 13:17:34 +01:00
parts_of_speech.pyx Tidy up rest 2017-10-27 21:07:59 +02:00
scorer.py Tidy up and auto-format 2019-08-18 15:09:16 +02:00
strings.pxd Try to fix StringStore clean up (see #1506) 2017-11-11 03:11:27 +03:00
strings.pyx 💫 Make serialization methods consistent (#3385) 2019-03-10 19:16:45 +01:00
structs.pxd rename entity frequency 2019-07-19 17:40:28 +02:00
symbols.pxd Fix symbol alignment 2019-07-12 17:48:38 +02:00
symbols.pyx ensure Span.as_doc keeps the entity links + unit test 2019-06-25 15:28:51 +02:00
tokenizer.pxd Flush tokenizer cache when necessary (#4258) 2019-09-08 20:52:46 +02:00
tokenizer.pyx Flush tokenizer cache when necessary (#4258) 2019-09-08 20:52:46 +02:00
typedefs.pxd Work on changing StringStore to return hashes. 2017-05-28 12:36:27 +02:00
typedefs.pyx Tidy up rest 2017-10-27 21:07:59 +02:00
util.py 💫 WIP: Basic lookup class scaffolding and JSON for all lemmatizer data (#4178) 2019-09-09 19:17:55 +02:00
vectors.pyx Update Vectors.find docs [ci skip] 2019-03-16 17:10:57 +01:00
vocab.pxd 💫 WIP: Basic lookup class scaffolding and JSON for all lemmati… (#4167) 2019-08-22 14:21:32 +02:00
vocab.pyx 💫 WIP: Basic lookup class scaffolding and JSON for all lemmatizer data (#4178) 2019-09-09 19:17:55 +02:00