spaCy

mirror of https://github.com/explosion/spaCy.git synced 2025-03-24 03:44:26 +03:00

Author	SHA1	Message	Date
adrianeboyd	faaa832518	Generalize handling of tokenizer special cases (#4259 ) * Generalize handling of tokenizer special cases Handle tokenizer special cases more generally by using the Matcher internally to match special cases after the affix/token_match tokenization is complete. Instead of only matching special cases while processing balanced or nearly balanced prefixes and suffixes, this recognizes special cases in a wider range of contexts: * Allows arbitrary numbers of prefixes/affixes around special cases * Allows special cases separated by infixes Existing tests/settings that couldn't be preserved as before: * The emoticon '")' is no longer a supported special case * The emoticon ':)' in "example:)" is a false positive again When merged with #4258 (or the relevant cache bugfix), the affix and token_match properties should be modified to flush and reload all special cases to use the updated internal tokenization with the Matcher. * Remove accidentally added test case * Really remove accidentally added test * Reload special cases when necessary Reload special cases when affixes or token_match are modified. Skip reloading during initialization. * Update error code number * Fix offset and whitespace in Matcher special cases * Fix offset bugs when merging and splitting tokens * Set final whitespace on final token in inserted special case * Improve cache flushing in tokenizer * Separate cache and specials memory (temporarily) * Flush cache when adding special cases * Repeated `self._cache = PreshMap()` and `self._specials = PreshMap()` are necessary due to this bug: https://github.com/explosion/preshed/issues/21 * Remove reinitialized PreshMaps on cache flush * Update UD bin scripts * Update imports for `bin/` * Add all currently supported languages * Update subtok merger for new Matcher validation * Modify blinded check to look at tokens instead of lemmas (for corpora with tokens but not lemmas like Telugu) * Use special Matcher only for cases with affixes * Reinsert specials cache checks during normal tokenization for special cases as much as possible * Additionally include specials cache checks while splitting on infixes * Since the special Matcher needs consistent affix-only tokenization for the special cases themselves, introduce the argument `with_special_cases` in order to do tokenization with or without specials cache checks * After normal tokenization, postprocess with special cases Matcher for special cases containing affixes * Replace PhraseMatcher with Aho-Corasick Replace PhraseMatcher with the Aho-Corasick algorithm over numpy arrays of the hash values for the relevant attribute. The implementation is based on FlashText. The speed should be similar to the previous PhraseMatcher. It is now possible to easily remove match IDs and matches don't go missing with large keyword lists / vocabularies. Fixes #4308. * Restore support for pickling * Fix internal keyword add/remove for numpy arrays * Add test for #4248, clean up test * Improve efficiency of special cases handling * Use PhraseMatcher instead of Matcher * Improve efficiency of merging/splitting special cases in document * Process merge/splits in one pass without repeated token shifting * Merge in place if no splits * Update error message number * Remove UD script modifications Only used for timing/testing, should be a separate PR * Remove final traces of UD script modifications * Update UD bin scripts * Update imports for `bin/` * Add all currently supported languages * Update subtok merger for new Matcher validation * Modify blinded check to look at tokens instead of lemmas (for corpora with tokens but not lemmas like Telugu) * Add missing loop for match ID set in search loop * Remove cruft in matching loop for partial matches There was a bit of unnecessary code left over from FlashText in the matching loop to handle partial token matches, which we don't have with PhraseMatcher. * Replace dict trie with MapStruct trie * Fix how match ID hash is stored/added * Update fix for match ID vocab * Switch from map_get_unless_missing to map_get * Switch from numpy array to Token.get_struct_attr Access token attributes directly in Doc instead of making a copy of the relevant values in a numpy array. Add unsatisfactory warning for hash collision with reserved terminal hash key. (Ideally it would change the reserved terminal hash and redo the whole trie, but for now, I'm hoping there won't be collisions.) * Restructure imports to export find_matches * Implement full remove() Remove unnecessary trie paths and free unused maps. Parallel to Matcher, raise KeyError when attempting to remove a match ID that has not been added. * Switch to PhraseMatcher.find_matches * Switch to local cdef functions for span filtering * Switch special case reload threshold to variable Refer to variable instead of hard-coded threshold * Move more of special case retokenize to cdef nogil Move as much of the special case retokenization to nogil as possible. * Rewrap sort as stdsort for OS X * Rewrap stdsort with specific types * Switch to qsort * Fix merge * Improve cmp functions * Fix realloc * Fix realloc again * Initialize span struct while retokenizing * Temporarily skip retokenizing * Revert "Move more of special case retokenize to cdef nogil" This reverts commit `0b7e52c797`. * Revert "Switch to qsort" This reverts commit `a98d71a942`. * Fix specials check while caching * Modify URL test with emoticons The multiple suffix tests result in the emoticon `:>`, which is now retokenized into one token as a special case after the suffixes are split off. * Refactor _apply_special_cases() * Use cdef ints for span info used in multiple spots * Modify _filter_special_spans() to prefer earlier Parallel to #4414, modify _filter_special_spans() so that the earlier span is preferred for overlapping spans of the same length. * Replace MatchStruct with Entity Replace MatchStruct with Entity since the existing Entity struct is nearly identical. * Replace Entity with more general SpanC * Replace MatchStruct with SpanC * Add error in debug-data if no dev docs are available (see #4575) * Update azure-pipelines.yml * Revert "Update azure-pipelines.yml" This reverts commit `ed1060cf59`. * Use latest wasabi * Reorganise install_requires * add dframcy to universe.json (#4580) * Update universe.json [ci skip] * Fix multiprocessing for as_tuples=True (#4582) * Fix conllu script (#4579) * force extensions to avoid clash between example scripts * fix arg order and default file encoding * add example config for conllu script * newline * move extension definitions to main function * few more encodings fixes * Add load_from_docbin example [ci skip] TODO: upload the file somewhere * Update README.md * Add warnings about 3.8 (resolves #4593) [ci skip] * Fixed typo: Added space between "recognize" and "various" (#4600) * Fix DocBin.merge() example (#4599) * Replace function registries with catalogue (#4584) * Replace functions registries with catalogue * Update __init__.py * Fix test * Revert unrelated flag [ci skip] * Bugfix/dep matcher issue 4590 (#4601) * add contributor agreement for prilopes * add test for issue #4590 * fix on_match params for DependencyMacther (#4590) * Minor updates to language example sentences (#4608) * Add punctuation to Spanish example sentences * Combine multilanguage examples for lang xx * Add punctuation to nb examples * Always realloc to a larger size Avoid potential (unlikely) edge case and cymem error seen in #4604. * Add error in debug-data if no dev docs are available (see #4575) * Update debug-data for GoldCorpus / Example * Ignore None label in misaligned NER data	2019-11-13 21:24:35 +01:00
Ines Montani	e0cf4796a5	Move lookup tables out of the core library (#4346 ) * Add default to util.get_entry_point * Tidy up entry points * Read lookups from entry points * Remove lookup tables and related tests * Add lookups install option * Remove lemmatizer tests * Remove logic to process language data files * Update setup.cfg	2019-10-01 00:01:27 +02:00
Ines Montani	5ca7dd0f94	💫 WIP: Basic lookup class scaffolding and JSON for all lemmati… (#4167 ) * Improve load_language_data helper * WIP: Add Lookups implementation * Start moving lemma data over to JSON * WIP: move data over for more languages * Convert more languages * Fix lemmatizer fixtures in tests * Finish conversion * Auto-format JSON files * Fix test for now * Make sure tables are stored on instance	2019-08-22 14:21:32 +02:00
Ines Montani	f580302673	Tidy up and auto-format	2019-08-20 17:36:34 +02:00
Paul O'Leary McCann	756b66b7c0	Reduce size of language data (#4141 ) * Move Turkish lemmas to a json file Rather than a large dict in Python source, the data is now a big json file. This includes a method for loading the json file, falling back to a compressed file, and an update to MANIFEST.in that excludes json in the spacy/lang directory. This focuses on Turkish specifically because it has the most language data in core. * Transition all lemmatizer.py files to json This covers all lemmatizer.py files of a significant size (>500k or so). Small files were left alone. None of the affected files have logic, so this was pretty straightforward. One unusual thing is that the lemma data for Urdu doesn't seem to be used anywhere. That may require further investigation. * Move large lang data to json for fr/nb/nl/sv These are the languages that use a lemmatizer directory (rather than a single file) and are larger than English. For most of these languages there were many language data files, in which case only the large ones (>500k or so) were converted to json. It may or may not be a good idea to migrate the remaining Python files to json in the future. * Fix id lemmas.json The contents of this file were originally just copied from the Python source, but that used single quotes, so it had to be properly converted to json first. * Add .json.gz to gitignore This covers the json.gz files built as part of distribution. * Add language data gzip to build process Currently this gzip data on every build; it works, but it should be changed to only gzip when the source file has been updated. * Remove Danish lemmatizer.py Missed this when I added the json. * Update to match latest explosion/srsly#9 The way gzipped json is loaded/saved in srsly changed a bit. * Only compress language data if necessary If a .json.gz file exists and is newer than the corresponding json file, it's not recompressed. * Move en/el language data to json This only affected files >500kb, which was nouns for both languages and the generic lookup table for English. * Remove empty files in Norwegian tokenizer It's unclear why, but the Norwegian (nb) tokenizer had empty files for adj/adv/noun/verb lemmas. This may have been a result of copying the structure of the English lemmatizer. This removed the files, but still creates the empty sets in the lemmatizer. That may not actually be necessary. * Remove dubious entries in English lookup.json " furthest" and " skilled" - both prefixed with a space - were in the English lookup table. That seems obviously wrong so I have removed them. * Fix small issues with en/fr lemmatizers The en tokenizer was including the removed _nouns.py file, so that's removed. The fr tokenizer is unusual in that it has a lemmatizer directory with both __init__.py and lemmatizer.py. lemmatizer.py had not been converted to load the json language data, so that was fixed. * Auto-format * Auto-format * Update srsly pin * Consistently use pathlib paths	2019-08-20 14:54:11 +02:00
Paul O'Leary McCann	c8949ce88a	Remove old comment (#4012 ) Norwegian used to borrow from French but that doesn't appear to have been true for a while now, so the comment that was here is no longer relevant.	2019-07-23 23:10:06 +02:00
Knut O. Hellan	a54f0cfc2b	Norwegian tweaks (#3894 ) * Norwegian fix Add support for alternative past tense verb form (vaska). * Norwegian months Add all Norwegian months to tokenizer excpetions. * More Norwegian abbreviations Add more Norwegian abbreviations to tokenizer_exceptions. * Contributor agreement khellan Add signed contributor agreement for khellan (Knut O. Hellan).	2019-07-08 10:28:47 +02:00
Sofie	46dfe773e1	Replacing regex library with re to increase tokenization speed (#3218 ) * replace unicode categories with raw list of code points * simplifying ranges * fixing variable length quotes * removing redundant regular expression * small cleanup of regexp notations * quotes and alpha as ranges instead of alterations * removed most regexp dependencies and features * exponential backtracking - unit tests * rewrote expression with pathological backtracking * disabling double hyphen tests for now * test additional variants of repeating punctuation * remove regex and redundant backslashes from load_reddit script * small typo fixes * disable double punctuation test for russian * clean up old comments * format block code * final cleanup * naming consistency * french strings as unicode for python 2 support * french regular expression case insensitive	2019-02-01 18:05:22 +11:00
Jari Bakken	e172f2478e	Add three missing tags from the `nb` tag map (#3085 ) * Contributors agreement for jarib * Add tags from the UD/NORNE dataset that is missing in the nb tag map. Relates to #3082.	2018-12-27 14:48:40 +01:00
Ines Montani	61d09c481b	Merge branch 'master' into develop	2018-12-18 13:48:10 +01:00
Beate Sildnes	f0d7e206ec	Updated wordforms for Norwegian lemmatizer (#3007 ) * Updated wordforms for Norwegian lemmatizer Upload of updated lists of wordforms for the Norwegian lemmatizer (nouns, verbs, adverbs, adjectives and lookup). * Add spaCy contributor agreement for user beatesi * Updated wordforms for Norwegian lemmatizer	2018-12-06 15:46:18 +01:00
Ines Montani	eddeb36c96	💫 Tidy up and auto-format .py files (#2983 ) <!--- Provide a general summary of your changes in the title. --> ## Description - [x] Use [`black`](https://github.com/ambv/black) to auto-format all `.py` files. - [x] Update flake8 config to exclude very large files (lemmatization tables etc.) - [x] Update code to be compatible with flake8 rules - [x] Fix various small bugs, inconsistencies and messy stuff in the language data - [x] Update docs to explain new code style (`black`, `flake8`, when to use `# fmt: off` and `# fmt: on` and what `# noqa` means) Once #2932 is merged, which auto-formats and tidies up the CLI, we'll be able to run `flake8 spacy` actually get meaningful results. At the moment, the code style and linting isn't applied automatically, but I'm hoping that the new [GitHub Actions](https://github.com/features/actions) will let us auto-format pull requests and post comments with relevant linting information. ### Types of change enhancement, code style ## Checklist <!--- Before you submit the PR, go over this checklist and make sure you can tick off all the boxes. [] -> [x] --> - [x] I have submitted the spaCy Contributor Agreement. - [x] I ran the tests, and all new and existing tests passed. - [x] My changes don't require a change to the documentation, or if they do, I've added all required information.	2018-11-30 17:03:03 +01:00
katarkor	5ca853bee0	changed tag_map, morph_rules, lemmatizer for Norwegian (#2565 ) * changed tag_map, morph_rules, lemmatizer for Norwegian * Move unicode declaration up Hopefully fixes test failure on Python 2 * Update CONTRIBUTOR_AGREEMENT.md * Move unicode declarations Hopefully fixes test this time * Revert "Merge remote-tracking branch 'origin/patch-1'" This reverts commit `f5ccd5dd0d`, reversing changes made to `dd07e180ea`. * Update contributor agreement [ci skip]	2018-07-19 19:38:24 +02:00
Ole Henrik Skogstrøm	251a7805fe	Copied French syntax iterator to simplify future changes	2018-02-05 14:45:05 +01:00
Ole Henrik Skogstrøm	e40465487c	Added french syntax iterator with explenation	2018-01-30 15:44:29 +01:00
Ole Henrik Skogstrøm	8e2c9f2475	Cleaned up nb tag_map comments	2018-01-25 11:09:28 +01:00
Ole Henrik Skogstrøm	1107e89fcf	Updated doc string on nb tag_map module	2018-01-25 11:08:28 +01:00
Ole Henrik Skogstrøm	4058a7d579	Fix æøå characters in lemmatizer	2018-01-24 14:03:14 +01:00
Ole Henrik Skogstrøm	42248f423f	Updated tag map	2018-01-24 13:50:33 +01:00
Ole Henrik Skogstrøm	74b430b49a	Correct Lemmatizer	2018-01-24 13:26:33 +01:00
Ole Henrik Skogstrøm	b9b3a40c78	Add norwegian lemmatizer and tag_map	2018-01-24 12:28:29 +01:00
ines	acb9bdb852	Fix PRON_LEMMA imports	2017-11-06 17:41:53 +01:00
ines	819e30a26e	Tidy up tokenizer exceptions	2017-11-01 23:02:45 +01:00
ines	7e424a1804	Don't copy exception dicts if not necessary and tidy up	2017-10-31 21:05:29 +01:00
ines	8ce6f96180	Don't make copies of language data components	2017-10-11 15:34:55 +02:00
ines	0c2343d73a	Tidy up language data	2017-10-11 02:22:49 +02:00
ines	1fe5e1a4d1	Add language example sentences (see #1107 ) da, de, en, es, fr, he, it, nb, pl, pt, sv	2017-08-19 12:22:29 +02:00
ines	4c643d74c5	Add norm exceptions to other Language classes	2017-06-03 22:29:21 +02:00
ines	924e8506de	Move Defaults subclass to module scope (necessary for pickling)	2017-05-20 19:02:27 +02:00
ines	f6f5d78cb9	Fix formatting	2017-05-08 23:59:17 +02:00
ines	73b577cb01	Fix relative imports	2017-05-08 22:29:04 +02:00
ines	f46ffe3e89	Move language data to /lang module	2017-05-08 20:00:40 +02:00

32 Commits