mirror of
https://github.com/explosion/spaCy.git
synced 2025-01-12 02:06:31 +03:00
Generalize handling of tokenizer special cases (#4259)
* Generalize handling of tokenizer special cases Handle tokenizer special cases more generally by using the Matcher internally to match special cases after the affix/token_match tokenization is complete. Instead of only matching special cases while processing balanced or nearly balanced prefixes and suffixes, this recognizes special cases in a wider range of contexts: * Allows arbitrary numbers of prefixes/affixes around special cases * Allows special cases separated by infixes Existing tests/settings that couldn't be preserved as before: * The emoticon '")' is no longer a supported special case * The emoticon ':)' in "example:)" is a false positive again When merged with #4258 (or the relevant cache bugfix), the affix and token_match properties should be modified to flush and reload all special cases to use the updated internal tokenization with the Matcher. * Remove accidentally added test case * Really remove accidentally added test * Reload special cases when necessary Reload special cases when affixes or token_match are modified. Skip reloading during initialization. * Update error code number * Fix offset and whitespace in Matcher special cases * Fix offset bugs when merging and splitting tokens * Set final whitespace on final token in inserted special case * Improve cache flushing in tokenizer * Separate cache and specials memory (temporarily) * Flush cache when adding special cases * Repeated `self._cache = PreshMap()` and `self._specials = PreshMap()` are necessary due to this bug: https://github.com/explosion/preshed/issues/21 * Remove reinitialized PreshMaps on cache flush * Update UD bin scripts * Update imports for `bin/` * Add all currently supported languages * Update subtok merger for new Matcher validation * Modify blinded check to look at tokens instead of lemmas (for corpora with tokens but not lemmas like Telugu) * Use special Matcher only for cases with affixes * Reinsert specials cache checks during normal tokenization for special cases as much as possible * Additionally include specials cache checks while splitting on infixes * Since the special Matcher needs consistent affix-only tokenization for the special cases themselves, introduce the argument `with_special_cases` in order to do tokenization with or without specials cache checks * After normal tokenization, postprocess with special cases Matcher for special cases containing affixes * Replace PhraseMatcher with Aho-Corasick Replace PhraseMatcher with the Aho-Corasick algorithm over numpy arrays of the hash values for the relevant attribute. The implementation is based on FlashText. The speed should be similar to the previous PhraseMatcher. It is now possible to easily remove match IDs and matches don't go missing with large keyword lists / vocabularies. Fixes #4308. * Restore support for pickling * Fix internal keyword add/remove for numpy arrays * Add test for #4248, clean up test * Improve efficiency of special cases handling * Use PhraseMatcher instead of Matcher * Improve efficiency of merging/splitting special cases in document * Process merge/splits in one pass without repeated token shifting * Merge in place if no splits * Update error message number * Remove UD script modifications Only used for timing/testing, should be a separate PR * Remove final traces of UD script modifications * Update UD bin scripts * Update imports for `bin/` * Add all currently supported languages * Update subtok merger for new Matcher validation * Modify blinded check to look at tokens instead of lemmas (for corpora with tokens but not lemmas like Telugu) * Add missing loop for match ID set in search loop * Remove cruft in matching loop for partial matches There was a bit of unnecessary code left over from FlashText in the matching loop to handle partial token matches, which we don't have with PhraseMatcher. * Replace dict trie with MapStruct trie * Fix how match ID hash is stored/added * Update fix for match ID vocab * Switch from map_get_unless_missing to map_get * Switch from numpy array to Token.get_struct_attr Access token attributes directly in Doc instead of making a copy of the relevant values in a numpy array. Add unsatisfactory warning for hash collision with reserved terminal hash key. (Ideally it would change the reserved terminal hash and redo the whole trie, but for now, I'm hoping there won't be collisions.) * Restructure imports to export find_matches * Implement full remove() Remove unnecessary trie paths and free unused maps. Parallel to Matcher, raise KeyError when attempting to remove a match ID that has not been added. * Switch to PhraseMatcher.find_matches * Switch to local cdef functions for span filtering * Switch special case reload threshold to variable Refer to variable instead of hard-coded threshold * Move more of special case retokenize to cdef nogil Move as much of the special case retokenization to nogil as possible. * Rewrap sort as stdsort for OS X * Rewrap stdsort with specific types * Switch to qsort * Fix merge * Improve cmp functions * Fix realloc * Fix realloc again * Initialize span struct while retokenizing * Temporarily skip retokenizing * Revert "Move more of special case retokenize to cdef nogil" This reverts commit0b7e52c797
. * Revert "Switch to qsort" This reverts commita98d71a942
. * Fix specials check while caching * Modify URL test with emoticons The multiple suffix tests result in the emoticon `:>`, which is now retokenized into one token as a special case after the suffixes are split off. * Refactor _apply_special_cases() * Use cdef ints for span info used in multiple spots * Modify _filter_special_spans() to prefer earlier Parallel to #4414, modify _filter_special_spans() so that the earlier span is preferred for overlapping spans of the same length. * Replace MatchStruct with Entity Replace MatchStruct with Entity since the existing Entity struct is nearly identical. * Replace Entity with more general SpanC * Replace MatchStruct with SpanC * Add error in debug-data if no dev docs are available (see #4575) * Update azure-pipelines.yml * Revert "Update azure-pipelines.yml" This reverts commited1060cf59
. * Use latest wasabi * Reorganise install_requires * add dframcy to universe.json (#4580) * Update universe.json [ci skip] * Fix multiprocessing for as_tuples=True (#4582) * Fix conllu script (#4579) * force extensions to avoid clash between example scripts * fix arg order and default file encoding * add example config for conllu script * newline * move extension definitions to main function * few more encodings fixes * Add load_from_docbin example [ci skip] TODO: upload the file somewhere * Update README.md * Add warnings about 3.8 (resolves #4593) [ci skip] * Fixed typo: Added space between "recognize" and "various" (#4600) * Fix DocBin.merge() example (#4599) * Replace function registries with catalogue (#4584) * Replace functions registries with catalogue * Update __init__.py * Fix test * Revert unrelated flag [ci skip] * Bugfix/dep matcher issue 4590 (#4601) * add contributor agreement for prilopes * add test for issue #4590 * fix on_match params for DependencyMacther (#4590) * Minor updates to language example sentences (#4608) * Add punctuation to Spanish example sentences * Combine multilanguage examples for lang xx * Add punctuation to nb examples * Always realloc to a larger size Avoid potential (unlikely) edge case and cymem error seen in #4604. * Add error in debug-data if no dev docs are available (see #4575) * Update debug-data for GoldCorpus / Example * Ignore None label in misaligned NER data
This commit is contained in:
parent
d67b0f196a
commit
faaa832518
106
.github/contributors/prilopes.md
vendored
Normal file
106
.github/contributors/prilopes.md
vendored
Normal file
|
@ -0,0 +1,106 @@
|
||||||
|
# spaCy contributor agreement
|
||||||
|
|
||||||
|
This spaCy Contributor Agreement (**"SCA"**) is based on the
|
||||||
|
[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
|
||||||
|
The SCA applies to any contribution that you make to any product or project
|
||||||
|
managed by us (the **"project"**), and sets out the intellectual property rights
|
||||||
|
you grant to us in the contributed materials. The term **"us"** shall mean
|
||||||
|
[ExplosionAI GmbH](https://explosion.ai/legal). The term
|
||||||
|
**"you"** shall mean the person or entity identified below.
|
||||||
|
|
||||||
|
If you agree to be bound by these terms, fill in the information requested
|
||||||
|
below and include the filled-in version with your first pull request, under the
|
||||||
|
folder [`.github/contributors/`](/.github/contributors/). The name of the file
|
||||||
|
should be your GitHub username, with the extension `.md`. For example, the user
|
||||||
|
example_user would create the file `.github/contributors/example_user.md`.
|
||||||
|
|
||||||
|
Read this agreement carefully before signing. These terms and conditions
|
||||||
|
constitute a binding legal agreement.
|
||||||
|
|
||||||
|
## Contributor Agreement
|
||||||
|
|
||||||
|
1. The term "contribution" or "contributed materials" means any source code,
|
||||||
|
object code, patch, tool, sample, graphic, specification, manual,
|
||||||
|
documentation, or any other material posted or submitted by you to the project.
|
||||||
|
|
||||||
|
2. With respect to any worldwide copyrights, or copyright applications and
|
||||||
|
registrations, in your contribution:
|
||||||
|
|
||||||
|
* you hereby assign to us joint ownership, and to the extent that such
|
||||||
|
assignment is or becomes invalid, ineffective or unenforceable, you hereby
|
||||||
|
grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
|
||||||
|
royalty-free, unrestricted license to exercise all rights under those
|
||||||
|
copyrights. This includes, at our option, the right to sublicense these same
|
||||||
|
rights to third parties through multiple levels of sublicensees or other
|
||||||
|
licensing arrangements;
|
||||||
|
|
||||||
|
* you agree that each of us can do all things in relation to your
|
||||||
|
contribution as if each of us were the sole owners, and if one of us makes
|
||||||
|
a derivative work of your contribution, the one who makes the derivative
|
||||||
|
work (or has it made will be the sole owner of that derivative work;
|
||||||
|
|
||||||
|
* you agree that you will not assert any moral rights in your contribution
|
||||||
|
against us, our licensees or transferees;
|
||||||
|
|
||||||
|
* you agree that we may register a copyright in your contribution and
|
||||||
|
exercise all ownership rights associated with it; and
|
||||||
|
|
||||||
|
* you agree that neither of us has any duty to consult with, obtain the
|
||||||
|
consent of, pay or render an accounting to the other for any use or
|
||||||
|
distribution of your contribution.
|
||||||
|
|
||||||
|
3. With respect to any patents you own, or that you can license without payment
|
||||||
|
to any third party, you hereby grant to us a perpetual, irrevocable,
|
||||||
|
non-exclusive, worldwide, no-charge, royalty-free license to:
|
||||||
|
|
||||||
|
* make, have made, use, sell, offer to sell, import, and otherwise transfer
|
||||||
|
your contribution in whole or in part, alone or in combination with or
|
||||||
|
included in any product, work or materials arising out of the project to
|
||||||
|
which your contribution was submitted, and
|
||||||
|
|
||||||
|
* at our option, to sublicense these same rights to third parties through
|
||||||
|
multiple levels of sublicensees or other licensing arrangements.
|
||||||
|
|
||||||
|
4. Except as set out above, you keep all right, title, and interest in your
|
||||||
|
contribution. The rights that you grant to us under these terms are effective
|
||||||
|
on the date you first submitted a contribution to us, even if your submission
|
||||||
|
took place before the date you sign these terms.
|
||||||
|
|
||||||
|
5. You covenant, represent, warrant and agree that:
|
||||||
|
|
||||||
|
* Each contribution that you submit is and shall be an original work of
|
||||||
|
authorship and you can legally grant the rights set out in this SCA;
|
||||||
|
|
||||||
|
* to the best of your knowledge, each contribution will not violate any
|
||||||
|
third party's copyrights, trademarks, patents, or other intellectual
|
||||||
|
property rights; and
|
||||||
|
|
||||||
|
* each contribution shall be in compliance with U.S. export control laws and
|
||||||
|
other applicable export and import laws. You agree to notify us if you
|
||||||
|
become aware of any circumstance which would make any of the foregoing
|
||||||
|
representations inaccurate in any respect. We may publicly disclose your
|
||||||
|
participation in the project, including the fact that you have signed the SCA.
|
||||||
|
|
||||||
|
6. This SCA is governed by the laws of the State of California and applicable
|
||||||
|
U.S. Federal law. Any choice of law rules will not apply.
|
||||||
|
|
||||||
|
7. Please place an “x” on one of the applicable statement below. Please do NOT
|
||||||
|
mark both statements:
|
||||||
|
|
||||||
|
* [x] I am signing on behalf of myself as an individual and no other person
|
||||||
|
or entity, including my employer, has or will have rights with respect to my
|
||||||
|
contributions.
|
||||||
|
|
||||||
|
* [ ] I am signing on behalf of my employer or a legal entity and I have the
|
||||||
|
actual authority to contractually bind that entity.
|
||||||
|
|
||||||
|
## Contributor Details
|
||||||
|
|
||||||
|
| Field | Entry |
|
||||||
|
|------------------------------- | -------------------- |
|
||||||
|
| Name | Priscilla Lopes |
|
||||||
|
| Company name (if applicable) | |
|
||||||
|
| Title or role (if applicable) | |
|
||||||
|
| Date | 2019-11-06 |
|
||||||
|
| GitHub username | prilopes |
|
||||||
|
| Website (optional) | |
|
10
README.md
10
README.md
|
@ -104,6 +104,13 @@ For detailed installation instructions, see the
|
||||||
[pip]: https://pypi.org/project/spacy/
|
[pip]: https://pypi.org/project/spacy/
|
||||||
[conda]: https://anaconda.org/conda-forge/spacy
|
[conda]: https://anaconda.org/conda-forge/spacy
|
||||||
|
|
||||||
|
> ⚠️ **Important note for Python 3.8:** We can't yet ship pre-compiled binary
|
||||||
|
> wheels for spaCy that work on Python 3.8, as we're still waiting for our CI
|
||||||
|
> providers and other tooling to support it. This means that in order to run
|
||||||
|
> spaCy on Python 3.8, you'll need [a compiler installed](#source) and compile
|
||||||
|
> the library and its Cython dependencies locally. If this is causing problems
|
||||||
|
> for you, the easiest solution is to **use Python 3.7** in the meantime.
|
||||||
|
|
||||||
### pip
|
### pip
|
||||||
|
|
||||||
Using pip, spaCy releases are available as source packages and binary wheels (as
|
Using pip, spaCy releases are available as source packages and binary wheels (as
|
||||||
|
@ -180,9 +187,6 @@ pointing pip to a path or URL.
|
||||||
# download best-matching version of specific model for your spaCy installation
|
# download best-matching version of specific model for your spaCy installation
|
||||||
python -m spacy download en_core_web_sm
|
python -m spacy download en_core_web_sm
|
||||||
|
|
||||||
# out-of-the-box: download best-matching default model
|
|
||||||
python -m spacy download en
|
|
||||||
|
|
||||||
# pip install .tar.gz archive from path or URL
|
# pip install .tar.gz archive from path or URL
|
||||||
pip install /Users/you/en_core_web_sm-2.2.0.tar.gz
|
pip install /Users/you/en_core_web_sm-2.2.0.tar.gz
|
||||||
pip install https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-2.2.0/en_core_web_sm-2.2.0.tar.gz
|
pip install https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-2.2.0/en_core_web_sm-2.2.0.tar.gz
|
||||||
|
|
|
@ -323,11 +323,6 @@ def get_token_conllu(token, i):
|
||||||
return "\n".join(lines)
|
return "\n".join(lines)
|
||||||
|
|
||||||
|
|
||||||
Token.set_extension("get_conllu_lines", method=get_token_conllu, force=True)
|
|
||||||
Token.set_extension("begins_fused", default=False, force=True)
|
|
||||||
Token.set_extension("inside_fused", default=False, force=True)
|
|
||||||
|
|
||||||
|
|
||||||
##################
|
##################
|
||||||
# Initialization #
|
# Initialization #
|
||||||
##################
|
##################
|
||||||
|
@ -460,13 +455,13 @@ class TreebankPaths(object):
|
||||||
|
|
||||||
@plac.annotations(
|
@plac.annotations(
|
||||||
ud_dir=("Path to Universal Dependencies corpus", "positional", None, Path),
|
ud_dir=("Path to Universal Dependencies corpus", "positional", None, Path),
|
||||||
|
parses_dir=("Directory to write the development parses", "positional", None, Path),
|
||||||
corpus=(
|
corpus=(
|
||||||
"UD corpus to train and evaluate on, e.g. en, es_ancora, etc",
|
"UD corpus to train and evaluate on, e.g. UD_Spanish-AnCora",
|
||||||
"positional",
|
"positional",
|
||||||
None,
|
None,
|
||||||
str,
|
str,
|
||||||
),
|
),
|
||||||
parses_dir=("Directory to write the development parses", "positional", None, Path),
|
|
||||||
config=("Path to json formatted config file", "option", "C", Path),
|
config=("Path to json formatted config file", "option", "C", Path),
|
||||||
limit=("Size limit", "option", "n", int),
|
limit=("Size limit", "option", "n", int),
|
||||||
gpu_device=("Use GPU", "option", "g", int),
|
gpu_device=("Use GPU", "option", "g", int),
|
||||||
|
@ -491,6 +486,10 @@ def main(
|
||||||
# temp fix to avoid import issues cf https://github.com/explosion/spaCy/issues/4200
|
# temp fix to avoid import issues cf https://github.com/explosion/spaCy/issues/4200
|
||||||
import tqdm
|
import tqdm
|
||||||
|
|
||||||
|
Token.set_extension("get_conllu_lines", method=get_token_conllu)
|
||||||
|
Token.set_extension("begins_fused", default=False)
|
||||||
|
Token.set_extension("inside_fused", default=False)
|
||||||
|
|
||||||
spacy.util.fix_random_seed()
|
spacy.util.fix_random_seed()
|
||||||
lang.zh.Chinese.Defaults.use_jieba = False
|
lang.zh.Chinese.Defaults.use_jieba = False
|
||||||
lang.ja.Japanese.Defaults.use_janome = False
|
lang.ja.Japanese.Defaults.use_janome = False
|
||||||
|
|
45
examples/load_from_docbin.py
Normal file
45
examples/load_from_docbin.py
Normal file
|
@ -0,0 +1,45 @@
|
||||||
|
# coding: utf-8
|
||||||
|
"""
|
||||||
|
Example of loading previously parsed text using spaCy's DocBin class. The example
|
||||||
|
performs an entity count to show that the annotations are available.
|
||||||
|
For more details, see https://spacy.io/usage/saving-loading#docs
|
||||||
|
Installation:
|
||||||
|
python -m spacy download en_core_web_lg
|
||||||
|
Usage:
|
||||||
|
python examples/load_from_docbin.py en_core_web_lg RC_2015-03-9.spacy
|
||||||
|
"""
|
||||||
|
from __future__ import unicode_literals
|
||||||
|
|
||||||
|
import spacy
|
||||||
|
from spacy.tokens import DocBin
|
||||||
|
from timeit import default_timer as timer
|
||||||
|
from collections import Counter
|
||||||
|
|
||||||
|
EXAMPLE_PARSES_PATH = "RC_2015-03-9.spacy"
|
||||||
|
|
||||||
|
|
||||||
|
def main(model="en_core_web_lg", docbin_path=EXAMPLE_PARSES_PATH):
|
||||||
|
nlp = spacy.load(model)
|
||||||
|
print("Reading data from {}".format(docbin_path))
|
||||||
|
with open(docbin_path, "rb") as file_:
|
||||||
|
bytes_data = file_.read()
|
||||||
|
nr_word = 0
|
||||||
|
start_time = timer()
|
||||||
|
entities = Counter()
|
||||||
|
docbin = DocBin().from_bytes(bytes_data)
|
||||||
|
for doc in docbin.get_docs(nlp.vocab):
|
||||||
|
nr_word += len(doc)
|
||||||
|
entities.update((e.label_, e.text) for e in doc.ents)
|
||||||
|
end_time = timer()
|
||||||
|
msg = "Loaded {nr_word} words in {seconds} seconds ({wps} words per second)"
|
||||||
|
wps = nr_word / (end_time - start_time)
|
||||||
|
print(msg.format(nr_word=nr_word, seconds=end_time - start_time, wps=wps))
|
||||||
|
print("Most common entities:")
|
||||||
|
for (label, entity), freq in entities.most_common(30):
|
||||||
|
print(freq, entity, label)
|
||||||
|
|
||||||
|
|
||||||
|
if __name__ == "__main__":
|
||||||
|
import plac
|
||||||
|
|
||||||
|
plac.call(main)
|
1
examples/training/conllu-config.json
Normal file
1
examples/training/conllu-config.json
Normal file
|
@ -0,0 +1 @@
|
||||||
|
{"nr_epoch": 3, "batch_size": 24, "dropout": 0.001, "vectors": 0, "multitask_tag": 0, "multitask_sent": 0}
|
|
@ -383,20 +383,24 @@ class TreebankPaths(object):
|
||||||
|
|
||||||
@plac.annotations(
|
@plac.annotations(
|
||||||
ud_dir=("Path to Universal Dependencies corpus", "positional", None, Path),
|
ud_dir=("Path to Universal Dependencies corpus", "positional", None, Path),
|
||||||
|
parses_dir=("Directory to write the development parses", "positional", None, Path),
|
||||||
|
config=("Path to json formatted config file", "positional", None, Config.load),
|
||||||
corpus=(
|
corpus=(
|
||||||
"UD corpus to train and evaluate on, e.g. en, es_ancora, etc",
|
"UD corpus to train and evaluate on, e.g. UD_Spanish-AnCora",
|
||||||
"positional",
|
"positional",
|
||||||
None,
|
None,
|
||||||
str,
|
str,
|
||||||
),
|
),
|
||||||
parses_dir=("Directory to write the development parses", "positional", None, Path),
|
|
||||||
config=("Path to json formatted config file", "positional", None, Config.load),
|
|
||||||
limit=("Size limit", "option", "n", int),
|
limit=("Size limit", "option", "n", int),
|
||||||
)
|
)
|
||||||
def main(ud_dir, parses_dir, config, corpus, limit=0):
|
def main(ud_dir, parses_dir, config, corpus, limit=0):
|
||||||
# temp fix to avoid import issues cf https://github.com/explosion/spaCy/issues/4200
|
# temp fix to avoid import issues cf https://github.com/explosion/spaCy/issues/4200
|
||||||
import tqdm
|
import tqdm
|
||||||
|
|
||||||
|
Token.set_extension("get_conllu_lines", method=get_token_conllu)
|
||||||
|
Token.set_extension("begins_fused", default=False)
|
||||||
|
Token.set_extension("inside_fused", default=False)
|
||||||
|
|
||||||
paths = TreebankPaths(ud_dir, corpus)
|
paths = TreebankPaths(ud_dir, corpus)
|
||||||
if not (parses_dir / corpus).exists():
|
if not (parses_dir / corpus).exists():
|
||||||
(parses_dir / corpus).mkdir()
|
(parses_dir / corpus).mkdir()
|
||||||
|
|
|
@ -4,14 +4,14 @@ preshed>=3.0.2,<3.1.0
|
||||||
thinc>=7.3.0,<7.4.0
|
thinc>=7.3.0,<7.4.0
|
||||||
blis>=0.4.0,<0.5.0
|
blis>=0.4.0,<0.5.0
|
||||||
murmurhash>=0.28.0,<1.1.0
|
murmurhash>=0.28.0,<1.1.0
|
||||||
wasabi>=0.3.0,<1.1.0
|
wasabi>=0.4.0,<1.1.0
|
||||||
srsly>=0.1.0,<1.1.0
|
srsly>=0.1.0,<1.1.0
|
||||||
|
catalogue>=0.0.7,<1.1.0
|
||||||
# Third party dependencies
|
# Third party dependencies
|
||||||
numpy>=1.15.0
|
numpy>=1.15.0
|
||||||
requests>=2.13.0,<3.0.0
|
requests>=2.13.0,<3.0.0
|
||||||
plac>=0.9.6,<1.2.0
|
plac>=0.9.6,<1.2.0
|
||||||
pathlib==1.0.1; python_version < "3.4"
|
pathlib==1.0.1; python_version < "3.4"
|
||||||
importlib_metadata>=0.20; python_version < "3.8"
|
|
||||||
# Optional dependencies
|
# Optional dependencies
|
||||||
jsonschema>=2.6.0,<3.1.0
|
jsonschema>=2.6.0,<3.1.0
|
||||||
# Development dependencies
|
# Development dependencies
|
||||||
|
|
12
setup.cfg
12
setup.cfg
|
@ -40,19 +40,21 @@ setup_requires =
|
||||||
murmurhash>=0.28.0,<1.1.0
|
murmurhash>=0.28.0,<1.1.0
|
||||||
thinc>=7.3.0,<7.4.0
|
thinc>=7.3.0,<7.4.0
|
||||||
install_requires =
|
install_requires =
|
||||||
setuptools
|
# Our libraries
|
||||||
numpy>=1.15.0
|
|
||||||
murmurhash>=0.28.0,<1.1.0
|
murmurhash>=0.28.0,<1.1.0
|
||||||
cymem>=2.0.2,<2.1.0
|
cymem>=2.0.2,<2.1.0
|
||||||
preshed>=3.0.2,<3.1.0
|
preshed>=3.0.2,<3.1.0
|
||||||
thinc>=7.3.0,<7.4.0
|
thinc>=7.3.0,<7.4.0
|
||||||
blis>=0.4.0,<0.5.0
|
blis>=0.4.0,<0.5.0
|
||||||
|
wasabi>=0.4.0,<1.1.0
|
||||||
|
srsly>=0.1.0,<1.1.0
|
||||||
|
catalogue>=0.0.7,<1.1.0
|
||||||
|
# Third-party dependencies
|
||||||
|
setuptools
|
||||||
|
numpy>=1.15.0
|
||||||
plac>=0.9.6,<1.2.0
|
plac>=0.9.6,<1.2.0
|
||||||
requests>=2.13.0,<3.0.0
|
requests>=2.13.0,<3.0.0
|
||||||
wasabi>=0.3.0,<1.1.0
|
|
||||||
srsly>=0.1.0,<1.1.0
|
|
||||||
pathlib==1.0.1; python_version < "3.4"
|
pathlib==1.0.1; python_version < "3.4"
|
||||||
importlib_metadata>=0.20; python_version < "3.8"
|
|
||||||
|
|
||||||
[options.extras_require]
|
[options.extras_require]
|
||||||
lookups =
|
lookups =
|
||||||
|
|
|
@ -15,7 +15,7 @@ from .glossary import explain
|
||||||
from .about import __version__
|
from .about import __version__
|
||||||
from .errors import Errors, Warnings, deprecation_warning
|
from .errors import Errors, Warnings, deprecation_warning
|
||||||
from . import util
|
from . import util
|
||||||
from .util import register_architecture, get_architecture
|
from .util import registry
|
||||||
from .language import component
|
from .language import component
|
||||||
|
|
||||||
|
|
||||||
|
|
|
@ -7,12 +7,10 @@ from __future__ import print_function
|
||||||
if __name__ == "__main__":
|
if __name__ == "__main__":
|
||||||
import plac
|
import plac
|
||||||
import sys
|
import sys
|
||||||
from wasabi import Printer
|
from wasabi import msg
|
||||||
from spacy.cli import download, link, info, package, train, pretrain, convert
|
from spacy.cli import download, link, info, package, train, pretrain, convert
|
||||||
from spacy.cli import init_model, profile, evaluate, validate, debug_data
|
from spacy.cli import init_model, profile, evaluate, validate, debug_data
|
||||||
|
|
||||||
msg = Printer()
|
|
||||||
|
|
||||||
commands = {
|
commands = {
|
||||||
"download": download,
|
"download": download,
|
||||||
"link": link,
|
"link": link,
|
||||||
|
|
|
@ -6,16 +6,13 @@ import requests
|
||||||
import os
|
import os
|
||||||
import subprocess
|
import subprocess
|
||||||
import sys
|
import sys
|
||||||
from wasabi import Printer
|
from wasabi import msg
|
||||||
|
|
||||||
from .link import link
|
from .link import link
|
||||||
from ..util import get_package_path
|
from ..util import get_package_path
|
||||||
from .. import about
|
from .. import about
|
||||||
|
|
||||||
|
|
||||||
msg = Printer()
|
|
||||||
|
|
||||||
|
|
||||||
@plac.annotations(
|
@plac.annotations(
|
||||||
model=("Model to download (shortcut or name)", "positional", None, str),
|
model=("Model to download (shortcut or name)", "positional", None, str),
|
||||||
direct=("Force direct download of name + version", "flag", "d", bool),
|
direct=("Force direct download of name + version", "flag", "d", bool),
|
||||||
|
|
|
@ -3,7 +3,7 @@ from __future__ import unicode_literals, division, print_function
|
||||||
|
|
||||||
import plac
|
import plac
|
||||||
from timeit import default_timer as timer
|
from timeit import default_timer as timer
|
||||||
from wasabi import Printer
|
from wasabi import msg
|
||||||
|
|
||||||
from ..gold import GoldCorpus
|
from ..gold import GoldCorpus
|
||||||
from .. import util
|
from .. import util
|
||||||
|
@ -32,7 +32,6 @@ def evaluate(
|
||||||
Evaluate a model. To render a sample of parses in a HTML file, set an
|
Evaluate a model. To render a sample of parses in a HTML file, set an
|
||||||
output directory as the displacy_path argument.
|
output directory as the displacy_path argument.
|
||||||
"""
|
"""
|
||||||
msg = Printer()
|
|
||||||
util.fix_random_seed()
|
util.fix_random_seed()
|
||||||
if gpu_id >= 0:
|
if gpu_id >= 0:
|
||||||
util.use_gpu(gpu_id)
|
util.use_gpu(gpu_id)
|
||||||
|
|
|
@ -4,7 +4,7 @@ from __future__ import unicode_literals
|
||||||
import plac
|
import plac
|
||||||
import platform
|
import platform
|
||||||
from pathlib import Path
|
from pathlib import Path
|
||||||
from wasabi import Printer
|
from wasabi import msg
|
||||||
import srsly
|
import srsly
|
||||||
|
|
||||||
from ..compat import path2str, basestring_, unicode_
|
from ..compat import path2str, basestring_, unicode_
|
||||||
|
@ -23,7 +23,6 @@ def info(model=None, markdown=False, silent=False):
|
||||||
speficied as an argument, print model information. Flag --markdown
|
speficied as an argument, print model information. Flag --markdown
|
||||||
prints details in Markdown for easy copy-pasting to GitHub issues.
|
prints details in Markdown for easy copy-pasting to GitHub issues.
|
||||||
"""
|
"""
|
||||||
msg = Printer()
|
|
||||||
if model:
|
if model:
|
||||||
if util.is_package(model):
|
if util.is_package(model):
|
||||||
model_path = util.get_package_path(model)
|
model_path = util.get_package_path(model)
|
||||||
|
|
|
@ -11,7 +11,7 @@ import tarfile
|
||||||
import gzip
|
import gzip
|
||||||
import zipfile
|
import zipfile
|
||||||
import srsly
|
import srsly
|
||||||
from wasabi import Printer
|
from wasabi import msg
|
||||||
|
|
||||||
from ..vectors import Vectors
|
from ..vectors import Vectors
|
||||||
from ..errors import Errors, Warnings, user_warning
|
from ..errors import Errors, Warnings, user_warning
|
||||||
|
@ -24,7 +24,6 @@ except ImportError:
|
||||||
|
|
||||||
|
|
||||||
DEFAULT_OOV_PROB = -20
|
DEFAULT_OOV_PROB = -20
|
||||||
msg = Printer()
|
|
||||||
|
|
||||||
|
|
||||||
@plac.annotations(
|
@plac.annotations(
|
||||||
|
|
|
@ -3,7 +3,7 @@ from __future__ import unicode_literals
|
||||||
|
|
||||||
import plac
|
import plac
|
||||||
from pathlib import Path
|
from pathlib import Path
|
||||||
from wasabi import Printer
|
from wasabi import msg
|
||||||
|
|
||||||
from ..compat import symlink_to, path2str
|
from ..compat import symlink_to, path2str
|
||||||
from .. import util
|
from .. import util
|
||||||
|
@ -20,7 +20,6 @@ def link(origin, link_name, force=False, model_path=None):
|
||||||
either the name of a pip package, or the local path to the model data
|
either the name of a pip package, or the local path to the model data
|
||||||
directory. Linking models allows loading them via spacy.load(link_name).
|
directory. Linking models allows loading them via spacy.load(link_name).
|
||||||
"""
|
"""
|
||||||
msg = Printer()
|
|
||||||
if util.is_package(origin):
|
if util.is_package(origin):
|
||||||
model_path = util.get_package_path(origin)
|
model_path = util.get_package_path(origin)
|
||||||
else:
|
else:
|
||||||
|
|
|
@ -4,7 +4,7 @@ from __future__ import unicode_literals
|
||||||
import plac
|
import plac
|
||||||
import shutil
|
import shutil
|
||||||
from pathlib import Path
|
from pathlib import Path
|
||||||
from wasabi import Printer, get_raw_input
|
from wasabi import msg, get_raw_input
|
||||||
import srsly
|
import srsly
|
||||||
|
|
||||||
from ..compat import path2str
|
from ..compat import path2str
|
||||||
|
@ -27,7 +27,6 @@ def package(input_dir, output_dir, meta_path=None, create_meta=False, force=Fals
|
||||||
set and a meta.json already exists in the output directory, the existing
|
set and a meta.json already exists in the output directory, the existing
|
||||||
values will be used as the defaults in the command-line prompt.
|
values will be used as the defaults in the command-line prompt.
|
||||||
"""
|
"""
|
||||||
msg = Printer()
|
|
||||||
input_path = util.ensure_path(input_dir)
|
input_path = util.ensure_path(input_dir)
|
||||||
output_path = util.ensure_path(output_dir)
|
output_path = util.ensure_path(output_dir)
|
||||||
meta_path = util.ensure_path(meta_path)
|
meta_path = util.ensure_path(meta_path)
|
||||||
|
|
|
@ -11,7 +11,7 @@ from pathlib import Path
|
||||||
from thinc.v2v import Affine, Maxout
|
from thinc.v2v import Affine, Maxout
|
||||||
from thinc.misc import LayerNorm as LN
|
from thinc.misc import LayerNorm as LN
|
||||||
from thinc.neural.util import prefer_gpu
|
from thinc.neural.util import prefer_gpu
|
||||||
from wasabi import Printer
|
from wasabi import msg
|
||||||
import srsly
|
import srsly
|
||||||
|
|
||||||
from spacy.gold import Example
|
from spacy.gold import Example
|
||||||
|
@ -123,7 +123,6 @@ def pretrain(
|
||||||
for key in config:
|
for key in config:
|
||||||
if isinstance(config[key], Path):
|
if isinstance(config[key], Path):
|
||||||
config[key] = str(config[key])
|
config[key] = str(config[key])
|
||||||
msg = Printer()
|
|
||||||
util.fix_random_seed(seed)
|
util.fix_random_seed(seed)
|
||||||
|
|
||||||
has_gpu = prefer_gpu()
|
has_gpu = prefer_gpu()
|
||||||
|
|
|
@ -9,7 +9,7 @@ import pstats
|
||||||
import sys
|
import sys
|
||||||
import itertools
|
import itertools
|
||||||
import thinc.extra.datasets
|
import thinc.extra.datasets
|
||||||
from wasabi import Printer
|
from wasabi import msg
|
||||||
|
|
||||||
from ..util import load_model
|
from ..util import load_model
|
||||||
|
|
||||||
|
@ -26,7 +26,6 @@ def profile(model, inputs=None, n_texts=10000):
|
||||||
It can either be provided as a JSONL file, or be read from sys.sytdin.
|
It can either be provided as a JSONL file, or be read from sys.sytdin.
|
||||||
If no input file is specified, the IMDB dataset is loaded via Thinc.
|
If no input file is specified, the IMDB dataset is loaded via Thinc.
|
||||||
"""
|
"""
|
||||||
msg = Printer()
|
|
||||||
if inputs is not None:
|
if inputs is not None:
|
||||||
inputs = _read_inputs(inputs, msg)
|
inputs = _read_inputs(inputs, msg)
|
||||||
if inputs is None:
|
if inputs is None:
|
||||||
|
|
|
@ -8,7 +8,7 @@ from thinc.neural._classes.model import Model
|
||||||
from timeit import default_timer as timer
|
from timeit import default_timer as timer
|
||||||
import shutil
|
import shutil
|
||||||
import srsly
|
import srsly
|
||||||
from wasabi import Printer
|
from wasabi import msg
|
||||||
import contextlib
|
import contextlib
|
||||||
import random
|
import random
|
||||||
|
|
||||||
|
@ -89,7 +89,6 @@ def train(
|
||||||
# temp fix to avoid import issues cf https://github.com/explosion/spaCy/issues/4200
|
# temp fix to avoid import issues cf https://github.com/explosion/spaCy/issues/4200
|
||||||
import tqdm
|
import tqdm
|
||||||
|
|
||||||
msg = Printer()
|
|
||||||
util.fix_random_seed()
|
util.fix_random_seed()
|
||||||
util.set_env_log(verbose)
|
util.set_env_log(verbose)
|
||||||
|
|
||||||
|
|
|
@ -5,7 +5,7 @@ from pathlib import Path
|
||||||
import sys
|
import sys
|
||||||
import requests
|
import requests
|
||||||
import srsly
|
import srsly
|
||||||
from wasabi import Printer
|
from wasabi import msg
|
||||||
|
|
||||||
from ..compat import path2str
|
from ..compat import path2str
|
||||||
from ..util import get_data_path
|
from ..util import get_data_path
|
||||||
|
@ -17,7 +17,6 @@ def validate():
|
||||||
Validate that the currently installed version of spaCy is compatible
|
Validate that the currently installed version of spaCy is compatible
|
||||||
with the installed models. Should be run after `pip install -U spacy`.
|
with the installed models. Should be run after `pip install -U spacy`.
|
||||||
"""
|
"""
|
||||||
msg = Printer()
|
|
||||||
with msg.loading("Loading compatibility table..."):
|
with msg.loading("Loading compatibility table..."):
|
||||||
r = requests.get(about.__compatibility__)
|
r = requests.get(about.__compatibility__)
|
||||||
if r.status_code != 200:
|
if r.status_code != 200:
|
||||||
|
|
|
@ -36,11 +36,6 @@ try:
|
||||||
except ImportError:
|
except ImportError:
|
||||||
cupy = None
|
cupy = None
|
||||||
|
|
||||||
try: # Python 3.8
|
|
||||||
import importlib.metadata as importlib_metadata
|
|
||||||
except ImportError:
|
|
||||||
import importlib_metadata # noqa: F401
|
|
||||||
|
|
||||||
try:
|
try:
|
||||||
from thinc.neural.optimizers import Optimizer # noqa: F401
|
from thinc.neural.optimizers import Optimizer # noqa: F401
|
||||||
except ImportError:
|
except ImportError:
|
||||||
|
|
|
@ -5,7 +5,7 @@ import uuid
|
||||||
|
|
||||||
from .templates import TPL_DEP_SVG, TPL_DEP_WORDS, TPL_DEP_ARCS, TPL_ENTS
|
from .templates import TPL_DEP_SVG, TPL_DEP_WORDS, TPL_DEP_ARCS, TPL_ENTS
|
||||||
from .templates import TPL_ENT, TPL_ENT_RTL, TPL_FIGURE, TPL_TITLE, TPL_PAGE
|
from .templates import TPL_ENT, TPL_ENT_RTL, TPL_FIGURE, TPL_TITLE, TPL_PAGE
|
||||||
from ..util import minify_html, escape_html, get_entry_points, ENTRY_POINTS
|
from ..util import minify_html, escape_html, registry
|
||||||
from ..errors import Errors
|
from ..errors import Errors
|
||||||
|
|
||||||
|
|
||||||
|
@ -242,7 +242,7 @@ class EntityRenderer(object):
|
||||||
"CARDINAL": "#e4e7d2",
|
"CARDINAL": "#e4e7d2",
|
||||||
"PERCENT": "#e4e7d2",
|
"PERCENT": "#e4e7d2",
|
||||||
}
|
}
|
||||||
user_colors = get_entry_points(ENTRY_POINTS.displacy_colors)
|
user_colors = registry.displacy_colors.get_all()
|
||||||
for user_color in user_colors.values():
|
for user_color in user_colors.values():
|
||||||
colors.update(user_color)
|
colors.update(user_color)
|
||||||
colors.update(options.get("colors", {}))
|
colors.update(options.get("colors", {}))
|
||||||
|
|
|
@ -529,6 +529,9 @@ class Errors(object):
|
||||||
E185 = ("Received invalid attribute in component attribute declaration: "
|
E185 = ("Received invalid attribute in component attribute declaration: "
|
||||||
"{obj}.{attr}\nAttribute '{attr}' does not exist on {obj}.")
|
"{obj}.{attr}\nAttribute '{attr}' does not exist on {obj}.")
|
||||||
E186 = ("'{tok_a}' and '{tok_b}' are different texts.")
|
E186 = ("'{tok_a}' and '{tok_b}' are different texts.")
|
||||||
|
E187 = ("Tokenizer special cases are not allowed to modify the text. "
|
||||||
|
"This would map '{chunk}' to '{orth}' given token attributes "
|
||||||
|
"'{token_attrs}'.")
|
||||||
|
|
||||||
# TODO: fix numbering after merging develop into master
|
# TODO: fix numbering after merging develop into master
|
||||||
E998 = ("Can only create GoldParse's from Example's without a Doc, "
|
E998 = ("Can only create GoldParse's from Example's without a Doc, "
|
||||||
|
|
|
@ -11,12 +11,12 @@ Example sentences to test spaCy and its language models.
|
||||||
|
|
||||||
|
|
||||||
sentences = [
|
sentences = [
|
||||||
"Apple está buscando comprar una startup del Reino Unido por mil millones de dólares",
|
"Apple está buscando comprar una startup del Reino Unido por mil millones de dólares.",
|
||||||
"Los coches autónomos delegan la responsabilidad del seguro en sus fabricantes",
|
"Los coches autónomos delegan la responsabilidad del seguro en sus fabricantes.",
|
||||||
"San Francisco analiza prohibir los robots delivery",
|
"San Francisco analiza prohibir los robots delivery.",
|
||||||
"Londres es una gran ciudad del Reino Unido",
|
"Londres es una gran ciudad del Reino Unido.",
|
||||||
"El gato come pescado",
|
"El gato come pescado.",
|
||||||
"Veo al hombre con el telescopio",
|
"Veo al hombre con el telescopio.",
|
||||||
"La araña come moscas",
|
"La araña come moscas.",
|
||||||
"El pingüino incuba en su nido",
|
"El pingüino incuba en su nido.",
|
||||||
]
|
]
|
||||||
|
|
|
@ -11,8 +11,8 @@ Example sentences to test spaCy and its language models.
|
||||||
|
|
||||||
|
|
||||||
sentences = [
|
sentences = [
|
||||||
"Apple vurderer å kjøpe britisk oppstartfirma for en milliard dollar",
|
"Apple vurderer å kjøpe britisk oppstartfirma for en milliard dollar.",
|
||||||
"Selvkjørende biler flytter forsikringsansvaret over på produsentene ",
|
"Selvkjørende biler flytter forsikringsansvaret over på produsentene.",
|
||||||
"San Francisco vurderer å forby robotbud på fortauene",
|
"San Francisco vurderer å forby robotbud på fortauene.",
|
||||||
"London er en stor by i Storbritannia.",
|
"London er en stor by i Storbritannia.",
|
||||||
]
|
]
|
||||||
|
|
|
@ -114,7 +114,6 @@ emoticons = set(
|
||||||
(-:
|
(-:
|
||||||
=)
|
=)
|
||||||
(=
|
(=
|
||||||
")
|
|
||||||
:]
|
:]
|
||||||
:-]
|
:-]
|
||||||
[:
|
[:
|
||||||
|
|
99
spacy/lang/xx/examples.py
Normal file
99
spacy/lang/xx/examples.py
Normal file
|
@ -0,0 +1,99 @@
|
||||||
|
# coding: utf8
|
||||||
|
from __future__ import unicode_literals
|
||||||
|
|
||||||
|
|
||||||
|
"""
|
||||||
|
Example sentences to test spaCy and its language models.
|
||||||
|
|
||||||
|
>>> from spacy.lang.de.examples import sentences
|
||||||
|
>>> docs = nlp.pipe(sentences)
|
||||||
|
"""
|
||||||
|
|
||||||
|
# combined examples from de/en/es/fr/it/nl/pl/pt/ru
|
||||||
|
|
||||||
|
sentences = [
|
||||||
|
"Die ganze Stadt ist ein Startup: Shenzhen ist das Silicon Valley für Hardware-Firmen",
|
||||||
|
"Wie deutsche Startups die Technologie vorantreiben wollen: Künstliche Intelligenz",
|
||||||
|
"Trend zum Urlaub in Deutschland beschert Gastwirten mehr Umsatz",
|
||||||
|
"Bundesanwaltschaft erhebt Anklage gegen mutmaßlichen Schweizer Spion",
|
||||||
|
"San Francisco erwägt Verbot von Lieferrobotern",
|
||||||
|
"Autonome Fahrzeuge verlagern Haftpflicht auf Hersteller",
|
||||||
|
"Wo bist du?",
|
||||||
|
"Was ist die Hauptstadt von Deutschland?",
|
||||||
|
"Apple is looking at buying U.K. startup for $1 billion",
|
||||||
|
"Autonomous cars shift insurance liability toward manufacturers",
|
||||||
|
"San Francisco considers banning sidewalk delivery robots",
|
||||||
|
"London is a big city in the United Kingdom.",
|
||||||
|
"Where are you?",
|
||||||
|
"Who is the president of France?",
|
||||||
|
"What is the capital of the United States?",
|
||||||
|
"When was Barack Obama born?",
|
||||||
|
"Apple está buscando comprar una startup del Reino Unido por mil millones de dólares.",
|
||||||
|
"Los coches autónomos delegan la responsabilidad del seguro en sus fabricantes.",
|
||||||
|
"San Francisco analiza prohibir los robots delivery.",
|
||||||
|
"Londres es una gran ciudad del Reino Unido.",
|
||||||
|
"El gato come pescado.",
|
||||||
|
"Veo al hombre con el telescopio.",
|
||||||
|
"La araña come moscas.",
|
||||||
|
"El pingüino incuba en su nido.",
|
||||||
|
"Apple cherche à acheter une start-up anglaise pour 1 milliard de dollars",
|
||||||
|
"Les voitures autonomes déplacent la responsabilité de l'assurance vers les constructeurs",
|
||||||
|
"San Francisco envisage d'interdire les robots coursiers sur les trottoirs",
|
||||||
|
"Londres est une grande ville du Royaume-Uni",
|
||||||
|
"L’Italie choisit ArcelorMittal pour reprendre la plus grande aciérie d’Europe",
|
||||||
|
"Apple lance HomePod parce qu'il se sent menacé par l'Echo d'Amazon",
|
||||||
|
"La France ne devrait pas manquer d'électricité cet été, même en cas de canicule",
|
||||||
|
"Nouvelles attaques de Trump contre le maire de Londres",
|
||||||
|
"Où es-tu ?",
|
||||||
|
"Qui est le président de la France ?",
|
||||||
|
"Où est la capitale des États-Unis ?",
|
||||||
|
"Quand est né Barack Obama ?",
|
||||||
|
"Apple vuole comprare una startup del Regno Unito per un miliardo di dollari",
|
||||||
|
"Le automobili a guida autonoma spostano la responsabilità assicurativa verso i produttori",
|
||||||
|
"San Francisco prevede di bandire i robot di consegna porta a porta",
|
||||||
|
"Londra è una grande città del Regno Unito.",
|
||||||
|
"Apple overweegt om voor 1 miljard een U.K. startup te kopen",
|
||||||
|
"Autonome auto's verschuiven de verzekeringverantwoordelijkheid naar producenten",
|
||||||
|
"San Francisco overweegt robots op voetpaden te verbieden",
|
||||||
|
"Londen is een grote stad in het Verenigd Koninkrijk",
|
||||||
|
"Poczuł przyjemną woń mocnej kawy.",
|
||||||
|
"Istnieje wiele dróg oddziaływania substancji psychoaktywnej na układ nerwowy.",
|
||||||
|
"Powitał mnie biało-czarny kot, płosząc siedzące na płocie trzy dorodne dudki.",
|
||||||
|
"Nowy abonament pod lupą Komisji Europejskiej",
|
||||||
|
"Czy w ciągu ostatnich 48 godzin spożyłeś leki zawierające paracetamol?",
|
||||||
|
"Kto ma ochotę zapoznać się z innymi niż w książkach przygodami Muminków i ich przyjaciół, temu polecam komiks Tove Jansson „Muminki i morze”.",
|
||||||
|
"Apple está querendo comprar uma startup do Reino Unido por 100 milhões de dólares.",
|
||||||
|
"Carros autônomos empurram a responsabilidade do seguro para os fabricantes.."
|
||||||
|
"São Francisco considera banir os robôs de entrega que andam pelas calçadas.",
|
||||||
|
"Londres é a maior cidade do Reino Unido.",
|
||||||
|
# Translations from English:
|
||||||
|
"Apple рассматривает возможность покупки стартапа из Соединённого Королевства за $1 млрд",
|
||||||
|
"Беспилотные автомобили перекладывают страховую ответственность на производителя",
|
||||||
|
"В Сан-Франциско рассматривается возможность запрета роботов-курьеров, которые перемещаются по тротуару",
|
||||||
|
"Лондон — это большой город в Соединённом Королевстве",
|
||||||
|
# Native Russian sentences:
|
||||||
|
# Colloquial:
|
||||||
|
"Да, нет, наверное!", # Typical polite refusal
|
||||||
|
"Обратите внимание на необыкновенную красоту этого города-героя Москвы, столицы нашей Родины!", # From a tour guide speech
|
||||||
|
# Examples of Bookish Russian:
|
||||||
|
# Quote from "The Golden Calf"
|
||||||
|
"Рио-де-Жанейро — это моя мечта, и не смейте касаться её своими грязными лапами!",
|
||||||
|
# Quotes from "Ivan Vasilievich changes his occupation"
|
||||||
|
"Ты пошто боярыню обидел, смерд?!!",
|
||||||
|
"Оставь меня, старушка, я в печали!",
|
||||||
|
# Quotes from Dostoevsky:
|
||||||
|
"Уж коли я, такой же, как и ты, человек грешный, над тобой умилился и пожалел тебя, кольми паче бог",
|
||||||
|
"В мечтах я нередко, говорит, доходил до страстных помыслов о служении человечеству и может быть действительно пошел бы на крест за людей, если б это вдруг как-нибудь потребовалось, а между тем я двух дней не в состоянии прожить ни с кем в одной комнате, о чем знаю из опыта",
|
||||||
|
"Зато всегда так происходило, что чем более я ненавидел людей в частности, тем пламеннее становилась любовь моя к человечеству вообще",
|
||||||
|
# Quotes from Chekhov:
|
||||||
|
"Ненужные дела и разговоры всё об одном отхватывают на свою долю лучшую часть времени, лучшие силы, и в конце концов остается какая-то куцая, бескрылая жизнь, какая-то чепуха, и уйти и бежать нельзя, точно сидишь в сумасшедшем доме или в арестантских ротах!",
|
||||||
|
# Quotes from Turgenev:
|
||||||
|
"Нравится тебе женщина, старайся добиться толку; а нельзя — ну, не надо, отвернись — земля не клином сошлась",
|
||||||
|
"Узенькое местечко, которое я занимаю, до того крохотно в сравнении с остальным пространством, где меня нет и где дела до меня нет; и часть времени, которую мне удастся прожить, так ничтожна перед вечностью, где меня не было и не будет...",
|
||||||
|
# Quotes from newspapers:
|
||||||
|
# Komsomolskaya Pravda:
|
||||||
|
"На заседании президиума правительства Москвы принято решение присвоить статус инвестиционного приоритетного проекта города Москвы киностудии Союзмультфильм",
|
||||||
|
"Глава Минобороны Сергей Шойгу заявил, что обстановка на этом стратегическом направлении требует непрерывного совершенствования боевого состава войск",
|
||||||
|
# Argumenty i Facty:
|
||||||
|
"На реплику лже-Говина — дескать, он (Волков) будет лучшим революционером — Стамп с энтузиазмом ответил: Непременно!",
|
||||||
|
]
|
|
@ -53,8 +53,8 @@ class BaseDefaults(object):
|
||||||
filenames = {name: root / filename for name, filename in cls.resources}
|
filenames = {name: root / filename for name, filename in cls.resources}
|
||||||
if LANG in cls.lex_attr_getters:
|
if LANG in cls.lex_attr_getters:
|
||||||
lang = cls.lex_attr_getters[LANG](None)
|
lang = cls.lex_attr_getters[LANG](None)
|
||||||
user_lookups = util.get_entry_point(util.ENTRY_POINTS.lookups, lang, {})
|
if lang in util.registry.lookups:
|
||||||
filenames.update(user_lookups)
|
filenames.update(util.registry.lookups.get(lang))
|
||||||
lookups = Lookups()
|
lookups = Lookups()
|
||||||
for name, filename in filenames.items():
|
for name, filename in filenames.items():
|
||||||
data = util.load_language_data(filename)
|
data = util.load_language_data(filename)
|
||||||
|
@ -157,7 +157,7 @@ class Language(object):
|
||||||
100,000 characters in one text.
|
100,000 characters in one text.
|
||||||
RETURNS (Language): The newly constructed object.
|
RETURNS (Language): The newly constructed object.
|
||||||
"""
|
"""
|
||||||
user_factories = util.get_entry_points(util.ENTRY_POINTS.factories)
|
user_factories = util.registry.factories.get_all()
|
||||||
self.factories.update(user_factories)
|
self.factories.update(user_factories)
|
||||||
self._meta = dict(meta)
|
self._meta = dict(meta)
|
||||||
self._path = None
|
self._path = None
|
||||||
|
@ -741,6 +741,7 @@ class Language(object):
|
||||||
texts,
|
texts,
|
||||||
batch_size=batch_size,
|
batch_size=batch_size,
|
||||||
disable=disable,
|
disable=disable,
|
||||||
|
n_process=n_process,
|
||||||
component_cfg=component_cfg,
|
component_cfg=component_cfg,
|
||||||
as_example=False
|
as_example=False
|
||||||
)
|
)
|
||||||
|
|
|
@ -240,7 +240,7 @@ cdef class DependencyMatcher:
|
||||||
for i, (ent_id, nodes) in enumerate(matched_key_trees):
|
for i, (ent_id, nodes) in enumerate(matched_key_trees):
|
||||||
on_match = self._callbacks.get(ent_id)
|
on_match = self._callbacks.get(ent_id)
|
||||||
if on_match is not None:
|
if on_match is not None:
|
||||||
on_match(self, doc, i, matches)
|
on_match(self, doc, i, matched_key_trees)
|
||||||
return matched_key_trees
|
return matched_key_trees
|
||||||
|
|
||||||
def recurse(self,tree,id_to_position,_node_operator_map,int patternLength,visitedNodes,matched_trees):
|
def recurse(self,tree,id_to_position,_node_operator_map,int patternLength,visitedNodes,matched_trees):
|
||||||
|
|
|
@ -3,10 +3,10 @@ from __future__ import unicode_literals
|
||||||
from thinc.api import chain
|
from thinc.api import chain
|
||||||
from thinc.v2v import Maxout
|
from thinc.v2v import Maxout
|
||||||
from thinc.misc import LayerNorm
|
from thinc.misc import LayerNorm
|
||||||
from ..util import register_architecture, make_layer
|
from ..util import registry, make_layer
|
||||||
|
|
||||||
|
|
||||||
@register_architecture("thinc.FeedForward.v1")
|
@registry.architectures.register("thinc.FeedForward.v1")
|
||||||
def FeedForward(config):
|
def FeedForward(config):
|
||||||
layers = [make_layer(layer_cfg) for layer_cfg in config["layers"]]
|
layers = [make_layer(layer_cfg) for layer_cfg in config["layers"]]
|
||||||
model = chain(*layers)
|
model = chain(*layers)
|
||||||
|
@ -14,7 +14,7 @@ def FeedForward(config):
|
||||||
return model
|
return model
|
||||||
|
|
||||||
|
|
||||||
@register_architecture("spacy.LayerNormalizedMaxout.v1")
|
@registry.architectures.register("spacy.LayerNormalizedMaxout.v1")
|
||||||
def LayerNormalizedMaxout(config):
|
def LayerNormalizedMaxout(config):
|
||||||
width = config["width"]
|
width = config["width"]
|
||||||
pieces = config["pieces"]
|
pieces = config["pieces"]
|
||||||
|
|
|
@ -6,11 +6,11 @@ from thinc.v2v import Maxout, Model
|
||||||
from thinc.i2v import HashEmbed, StaticVectors
|
from thinc.i2v import HashEmbed, StaticVectors
|
||||||
from thinc.t2t import ExtractWindow
|
from thinc.t2t import ExtractWindow
|
||||||
from thinc.misc import Residual, LayerNorm, FeatureExtracter
|
from thinc.misc import Residual, LayerNorm, FeatureExtracter
|
||||||
from ..util import make_layer, register_architecture
|
from ..util import make_layer, registry
|
||||||
from ._wire import concatenate_lists
|
from ._wire import concatenate_lists
|
||||||
|
|
||||||
|
|
||||||
@register_architecture("spacy.Tok2Vec.v1")
|
@registry.architectures.register("spacy.Tok2Vec.v1")
|
||||||
def Tok2Vec(config):
|
def Tok2Vec(config):
|
||||||
doc2feats = make_layer(config["@doc2feats"])
|
doc2feats = make_layer(config["@doc2feats"])
|
||||||
embed = make_layer(config["@embed"])
|
embed = make_layer(config["@embed"])
|
||||||
|
@ -24,13 +24,13 @@ def Tok2Vec(config):
|
||||||
return tok2vec
|
return tok2vec
|
||||||
|
|
||||||
|
|
||||||
@register_architecture("spacy.Doc2Feats.v1")
|
@registry.architectures.register("spacy.Doc2Feats.v1")
|
||||||
def Doc2Feats(config):
|
def Doc2Feats(config):
|
||||||
columns = config["columns"]
|
columns = config["columns"]
|
||||||
return FeatureExtracter(columns)
|
return FeatureExtracter(columns)
|
||||||
|
|
||||||
|
|
||||||
@register_architecture("spacy.MultiHashEmbed.v1")
|
@registry.architectures.register("spacy.MultiHashEmbed.v1")
|
||||||
def MultiHashEmbed(config):
|
def MultiHashEmbed(config):
|
||||||
# For backwards compatibility with models before the architecture registry,
|
# For backwards compatibility with models before the architecture registry,
|
||||||
# we have to be careful to get exactly the same model structure. One subtle
|
# we have to be careful to get exactly the same model structure. One subtle
|
||||||
|
@ -78,7 +78,7 @@ def MultiHashEmbed(config):
|
||||||
return layer
|
return layer
|
||||||
|
|
||||||
|
|
||||||
@register_architecture("spacy.CharacterEmbed.v1")
|
@registry.architectures.register("spacy.CharacterEmbed.v1")
|
||||||
def CharacterEmbed(config):
|
def CharacterEmbed(config):
|
||||||
from .. import _ml
|
from .. import _ml
|
||||||
|
|
||||||
|
@ -94,7 +94,7 @@ def CharacterEmbed(config):
|
||||||
return model
|
return model
|
||||||
|
|
||||||
|
|
||||||
@register_architecture("spacy.MaxoutWindowEncoder.v1")
|
@registry.architectures.register("spacy.MaxoutWindowEncoder.v1")
|
||||||
def MaxoutWindowEncoder(config):
|
def MaxoutWindowEncoder(config):
|
||||||
nO = config["width"]
|
nO = config["width"]
|
||||||
nW = config["window_size"]
|
nW = config["window_size"]
|
||||||
|
@ -110,7 +110,7 @@ def MaxoutWindowEncoder(config):
|
||||||
return model
|
return model
|
||||||
|
|
||||||
|
|
||||||
@register_architecture("spacy.MishWindowEncoder.v1")
|
@registry.architectures.register("spacy.MishWindowEncoder.v1")
|
||||||
def MishWindowEncoder(config):
|
def MishWindowEncoder(config):
|
||||||
from thinc.v2v import Mish
|
from thinc.v2v import Mish
|
||||||
|
|
||||||
|
@ -124,12 +124,12 @@ def MishWindowEncoder(config):
|
||||||
return model
|
return model
|
||||||
|
|
||||||
|
|
||||||
@register_architecture("spacy.PretrainedVectors.v1")
|
@registry.architectures.register("spacy.PretrainedVectors.v1")
|
||||||
def PretrainedVectors(config):
|
def PretrainedVectors(config):
|
||||||
return StaticVectors(config["vectors_name"], config["width"], config["column"])
|
return StaticVectors(config["vectors_name"], config["width"], config["column"])
|
||||||
|
|
||||||
|
|
||||||
@register_architecture("spacy.TorchBiLSTMEncoder.v1")
|
@registry.architectures.register("spacy.TorchBiLSTMEncoder.v1")
|
||||||
def TorchBiLSTMEncoder(config):
|
def TorchBiLSTMEncoder(config):
|
||||||
import torch.nn
|
import torch.nn
|
||||||
from thinc.extra.wrappers import PyTorchWrapperRNN
|
from thinc.extra.wrappers import PyTorchWrapperRNN
|
||||||
|
|
34
spacy/tests/regression/test_issue4590.py
Normal file
34
spacy/tests/regression/test_issue4590.py
Normal file
|
@ -0,0 +1,34 @@
|
||||||
|
# coding: utf-8
|
||||||
|
from __future__ import unicode_literals
|
||||||
|
|
||||||
|
import pytest
|
||||||
|
from mock import Mock
|
||||||
|
from spacy.matcher import DependencyMatcher
|
||||||
|
from ..util import get_doc
|
||||||
|
|
||||||
|
|
||||||
|
def test_issue4590(en_vocab):
|
||||||
|
"""Test that matches param in on_match method are the same as matches run with no on_match method"""
|
||||||
|
pattern = [
|
||||||
|
{"SPEC": {"NODE_NAME": "jumped"}, "PATTERN": {"ORTH": "jumped"}},
|
||||||
|
{"SPEC": {"NODE_NAME": "fox", "NBOR_RELOP": ">", "NBOR_NAME": "jumped"}, "PATTERN": {"ORTH": "fox"}},
|
||||||
|
{"SPEC": {"NODE_NAME": "quick", "NBOR_RELOP": ".", "NBOR_NAME": "jumped"}, "PATTERN": {"ORTH": "fox"}},
|
||||||
|
]
|
||||||
|
|
||||||
|
on_match = Mock()
|
||||||
|
|
||||||
|
matcher = DependencyMatcher(en_vocab)
|
||||||
|
matcher.add("pattern", on_match, pattern)
|
||||||
|
|
||||||
|
text = "The quick brown fox jumped over the lazy fox"
|
||||||
|
heads = [3, 2, 1, 1, 0, -1, 2, 1, -3]
|
||||||
|
deps = ["det", "amod", "amod", "nsubj", "prep", "pobj", "det", "amod"]
|
||||||
|
|
||||||
|
doc = get_doc(en_vocab, text.split(), heads=heads, deps=deps)
|
||||||
|
|
||||||
|
matches = matcher(doc)
|
||||||
|
|
||||||
|
on_match_args = on_match.call_args
|
||||||
|
|
||||||
|
assert on_match_args[0][3] == matches
|
||||||
|
|
19
spacy/tests/test_architectures.py
Normal file
19
spacy/tests/test_architectures.py
Normal file
|
@ -0,0 +1,19 @@
|
||||||
|
# coding: utf8
|
||||||
|
from __future__ import unicode_literals
|
||||||
|
|
||||||
|
import pytest
|
||||||
|
from spacy import registry
|
||||||
|
from thinc.v2v import Affine
|
||||||
|
from catalogue import RegistryError
|
||||||
|
|
||||||
|
|
||||||
|
@registry.architectures.register("my_test_function")
|
||||||
|
def create_model(nr_in, nr_out):
|
||||||
|
return Affine(nr_in, nr_out)
|
||||||
|
|
||||||
|
|
||||||
|
def test_get_architecture():
|
||||||
|
arch = registry.architectures.get("my_test_function")
|
||||||
|
assert arch is create_model
|
||||||
|
with pytest.raises(RegistryError):
|
||||||
|
registry.architectures.get("not_an_existing_key")
|
|
@ -1,19 +0,0 @@
|
||||||
# coding: utf8
|
|
||||||
from __future__ import unicode_literals
|
|
||||||
|
|
||||||
import pytest
|
|
||||||
from spacy import register_architecture
|
|
||||||
from spacy import get_architecture
|
|
||||||
from thinc.v2v import Affine
|
|
||||||
|
|
||||||
|
|
||||||
@register_architecture("my_test_function")
|
|
||||||
def create_model(nr_in, nr_out):
|
|
||||||
return Affine(nr_in, nr_out)
|
|
||||||
|
|
||||||
|
|
||||||
def test_get_architecture():
|
|
||||||
arch = get_architecture("my_test_function")
|
|
||||||
assert arch is create_model
|
|
||||||
with pytest.raises(KeyError):
|
|
||||||
get_architecture("not_an_existing_key")
|
|
|
@ -7,7 +7,7 @@ import pytest
|
||||||
|
|
||||||
def test_tokenizer_handles_emoticons(tokenizer):
|
def test_tokenizer_handles_emoticons(tokenizer):
|
||||||
# Tweebo challenge (CMU)
|
# Tweebo challenge (CMU)
|
||||||
text = """:o :/ :'( >:o (: :) >.< XD -__- o.O ;D :-) @_@ :P 8D :1 >:( :D =| ") :> ...."""
|
text = """:o :/ :'( >:o (: :) >.< XD -__- o.O ;D :-) @_@ :P 8D :1 >:( :D =| :> ...."""
|
||||||
tokens = tokenizer(text)
|
tokens = tokenizer(text)
|
||||||
assert tokens[0].text == ":o"
|
assert tokens[0].text == ":o"
|
||||||
assert tokens[1].text == ":/"
|
assert tokens[1].text == ":/"
|
||||||
|
@ -28,12 +28,11 @@ def test_tokenizer_handles_emoticons(tokenizer):
|
||||||
assert tokens[16].text == ">:("
|
assert tokens[16].text == ">:("
|
||||||
assert tokens[17].text == ":D"
|
assert tokens[17].text == ":D"
|
||||||
assert tokens[18].text == "=|"
|
assert tokens[18].text == "=|"
|
||||||
assert tokens[19].text == '")'
|
assert tokens[19].text == ":>"
|
||||||
assert tokens[20].text == ":>"
|
assert tokens[20].text == "...."
|
||||||
assert tokens[21].text == "...."
|
|
||||||
|
|
||||||
|
|
||||||
@pytest.mark.parametrize("text,length", [("example:)", 3), ("108)", 2), ("XDN", 1)])
|
@pytest.mark.parametrize("text,length", [("108)", 2), ("XDN", 1)])
|
||||||
def test_tokenizer_excludes_false_pos_emoticons(tokenizer, text, length):
|
def test_tokenizer_excludes_false_pos_emoticons(tokenizer, text, length):
|
||||||
tokens = tokenizer(text)
|
tokens = tokenizer(text)
|
||||||
assert len(tokens) == length
|
assert len(tokens) == length
|
||||||
|
|
|
@ -108,6 +108,12 @@ def test_tokenizer_add_special_case(tokenizer, text, tokens):
|
||||||
assert doc[1].text == tokens[1]["orth"]
|
assert doc[1].text == tokens[1]["orth"]
|
||||||
|
|
||||||
|
|
||||||
|
@pytest.mark.parametrize("text,tokens", [("lorem", [{"orth": "lo"}, {"orth": "re"}])])
|
||||||
|
def test_tokenizer_validate_special_case(tokenizer, text, tokens):
|
||||||
|
with pytest.raises(ValueError):
|
||||||
|
tokenizer.add_special_case(text, tokens)
|
||||||
|
|
||||||
|
|
||||||
@pytest.mark.parametrize(
|
@pytest.mark.parametrize(
|
||||||
"text,tokens", [("lorem", [{"orth": "lo", "tag": "NN"}, {"orth": "rem"}])]
|
"text,tokens", [("lorem", [{"orth": "lo", "tag": "NN"}, {"orth": "rem"}])]
|
||||||
)
|
)
|
||||||
|
@ -120,3 +126,18 @@ def test_tokenizer_add_special_case_tag(text, tokens):
|
||||||
assert doc[0].tag_ == tokens[0]["tag"]
|
assert doc[0].tag_ == tokens[0]["tag"]
|
||||||
assert doc[0].pos_ == "NOUN"
|
assert doc[0].pos_ == "NOUN"
|
||||||
assert doc[1].text == tokens[1]["orth"]
|
assert doc[1].text == tokens[1]["orth"]
|
||||||
|
|
||||||
|
|
||||||
|
def test_tokenizer_special_cases_with_affixes(tokenizer):
|
||||||
|
text = '(((_SPECIAL_ A/B, A/B-A/B")'
|
||||||
|
tokenizer.add_special_case("_SPECIAL_", [{"orth": "_SPECIAL_"}])
|
||||||
|
tokenizer.add_special_case("A/B", [{"orth": "A/B"}])
|
||||||
|
doc = tokenizer(text)
|
||||||
|
assert [token.text for token in doc] == ["(", "(", "(", "_SPECIAL_", "A/B", ",", "A/B", "-", "A/B", '"', ")"]
|
||||||
|
|
||||||
|
|
||||||
|
def test_tokenizer_special_cases_with_period(tokenizer):
|
||||||
|
text = "_SPECIAL_."
|
||||||
|
tokenizer.add_special_case("_SPECIAL_", [{"orth": "_SPECIAL_"}])
|
||||||
|
doc = tokenizer(text)
|
||||||
|
assert [token.text for token in doc] == ["_SPECIAL_", "."]
|
||||||
|
|
|
@ -3,6 +3,8 @@ from __future__ import unicode_literals
|
||||||
|
|
||||||
import pytest
|
import pytest
|
||||||
|
|
||||||
|
from spacy.lang.tokenizer_exceptions import BASE_EXCEPTIONS
|
||||||
|
|
||||||
|
|
||||||
URLS_BASIC = [
|
URLS_BASIC = [
|
||||||
"http://www.nytimes.com/2016/04/20/us/politics/new-york-primary-preview.html?hp&action=click&pgtype=Homepage&clickSource=story-heading&module=a-lede-package-region®ion=top-news&WT.nav=top-news&_r=0",
|
"http://www.nytimes.com/2016/04/20/us/politics/new-york-primary-preview.html?hp&action=click&pgtype=Homepage&clickSource=story-heading&module=a-lede-package-region®ion=top-news&WT.nav=top-news&_r=0",
|
||||||
|
@ -194,7 +196,12 @@ def test_tokenizer_handles_two_prefix_url(tokenizer, prefix1, prefix2, url):
|
||||||
@pytest.mark.parametrize("url", URLS_FULL)
|
@pytest.mark.parametrize("url", URLS_FULL)
|
||||||
def test_tokenizer_handles_two_suffix_url(tokenizer, suffix1, suffix2, url):
|
def test_tokenizer_handles_two_suffix_url(tokenizer, suffix1, suffix2, url):
|
||||||
tokens = tokenizer(url + suffix1 + suffix2)
|
tokens = tokenizer(url + suffix1 + suffix2)
|
||||||
assert len(tokens) == 3
|
if suffix1 + suffix2 in BASE_EXCEPTIONS:
|
||||||
assert tokens[0].text == url
|
assert len(tokens) == 2
|
||||||
assert tokens[1].text == suffix1
|
assert tokens[0].text == url
|
||||||
assert tokens[2].text == suffix2
|
assert tokens[1].text == suffix1 + suffix2
|
||||||
|
else:
|
||||||
|
assert len(tokens) == 3
|
||||||
|
assert tokens[0].text == url
|
||||||
|
assert tokens[1].text == suffix1
|
||||||
|
assert tokens[2].text == suffix2
|
||||||
|
|
|
@ -4,10 +4,11 @@ from preshed.maps cimport PreshMap
|
||||||
from cymem.cymem cimport Pool
|
from cymem.cymem cimport Pool
|
||||||
|
|
||||||
from .typedefs cimport hash_t
|
from .typedefs cimport hash_t
|
||||||
from .structs cimport LexemeC, TokenC
|
from .structs cimport LexemeC, SpanC, TokenC
|
||||||
from .strings cimport StringStore
|
from .strings cimport StringStore
|
||||||
from .tokens.doc cimport Doc
|
from .tokens.doc cimport Doc
|
||||||
from .vocab cimport Vocab, LexemesOrTokens, _Cached
|
from .vocab cimport Vocab, LexemesOrTokens, _Cached
|
||||||
|
from .matcher.phrasematcher cimport PhraseMatcher
|
||||||
|
|
||||||
|
|
||||||
cdef class Tokenizer:
|
cdef class Tokenizer:
|
||||||
|
@ -21,15 +22,32 @@ cdef class Tokenizer:
|
||||||
cdef object _suffix_search
|
cdef object _suffix_search
|
||||||
cdef object _infix_finditer
|
cdef object _infix_finditer
|
||||||
cdef object _rules
|
cdef object _rules
|
||||||
|
cdef PhraseMatcher _special_matcher
|
||||||
|
cdef int _property_init_count
|
||||||
|
cdef int _property_init_max
|
||||||
|
|
||||||
cpdef Doc tokens_from_list(self, list strings)
|
cpdef Doc tokens_from_list(self, list strings)
|
||||||
|
|
||||||
|
cdef Doc _tokenize_affixes(self, unicode string, bint with_special_cases)
|
||||||
|
cdef int _apply_special_cases(self, Doc doc) except -1
|
||||||
|
cdef void _filter_special_spans(self, vector[SpanC] &original,
|
||||||
|
vector[SpanC] &filtered, int doc_len) nogil
|
||||||
|
cdef object _prepare_special_spans(self, Doc doc,
|
||||||
|
vector[SpanC] &filtered)
|
||||||
|
cdef int _retokenize_special_spans(self, Doc doc, TokenC* tokens,
|
||||||
|
object span_data)
|
||||||
cdef int _try_cache(self, hash_t key, Doc tokens) except -1
|
cdef int _try_cache(self, hash_t key, Doc tokens) except -1
|
||||||
cdef int _tokenize(self, Doc tokens, unicode span, hash_t key) except -1
|
cdef int _try_specials(self, hash_t key, Doc tokens,
|
||||||
cdef unicode _split_affixes(self, Pool mem, unicode string, vector[LexemeC*] *prefixes,
|
int* has_special) except -1
|
||||||
vector[LexemeC*] *suffixes, int* has_special)
|
cdef int _tokenize(self, Doc tokens, unicode span, hash_t key,
|
||||||
|
int* has_special, bint with_special_cases) except -1
|
||||||
|
cdef unicode _split_affixes(self, Pool mem, unicode string,
|
||||||
|
vector[LexemeC*] *prefixes,
|
||||||
|
vector[LexemeC*] *suffixes, int* has_special,
|
||||||
|
bint with_special_cases)
|
||||||
cdef int _attach_tokens(self, Doc tokens, unicode string,
|
cdef int _attach_tokens(self, Doc tokens, unicode string,
|
||||||
vector[LexemeC*] *prefixes, vector[LexemeC*] *suffixes) except -1
|
vector[LexemeC*] *prefixes,
|
||||||
|
vector[LexemeC*] *suffixes, int* has_special,
|
||||||
cdef int _save_cached(self, const TokenC* tokens, hash_t key, int has_special,
|
bint with_special_cases) except -1
|
||||||
int n) except -1
|
cdef int _save_cached(self, const TokenC* tokens, hash_t key,
|
||||||
|
int* has_special, int n) except -1
|
||||||
|
|
|
@ -5,6 +5,8 @@ from __future__ import unicode_literals
|
||||||
|
|
||||||
from cython.operator cimport dereference as deref
|
from cython.operator cimport dereference as deref
|
||||||
from cython.operator cimport preincrement as preinc
|
from cython.operator cimport preincrement as preinc
|
||||||
|
from libc.string cimport memcpy, memset
|
||||||
|
from libcpp.set cimport set as stdset
|
||||||
from cymem.cymem cimport Pool
|
from cymem.cymem cimport Pool
|
||||||
from preshed.maps cimport PreshMap
|
from preshed.maps cimport PreshMap
|
||||||
cimport cython
|
cimport cython
|
||||||
|
@ -19,6 +21,9 @@ from .compat import unescape_unicode
|
||||||
from .errors import Errors, Warnings, deprecation_warning
|
from .errors import Errors, Warnings, deprecation_warning
|
||||||
from . import util
|
from . import util
|
||||||
|
|
||||||
|
from .attrs import intify_attrs
|
||||||
|
from .lexeme cimport EMPTY_LEXEME
|
||||||
|
from .symbols import ORTH
|
||||||
|
|
||||||
cdef class Tokenizer:
|
cdef class Tokenizer:
|
||||||
"""Segment text, and create Doc objects with the discovered segment
|
"""Segment text, and create Doc objects with the discovered segment
|
||||||
|
@ -57,9 +62,10 @@ cdef class Tokenizer:
|
||||||
self.infix_finditer = infix_finditer
|
self.infix_finditer = infix_finditer
|
||||||
self.vocab = vocab
|
self.vocab = vocab
|
||||||
self._rules = {}
|
self._rules = {}
|
||||||
if rules is not None:
|
self._special_matcher = PhraseMatcher(self.vocab)
|
||||||
for chunk, substrings in sorted(rules.items()):
|
self._load_special_cases(rules)
|
||||||
self.add_special_case(chunk, substrings)
|
self._property_init_count = 0
|
||||||
|
self._property_init_max = 4
|
||||||
|
|
||||||
property token_match:
|
property token_match:
|
||||||
def __get__(self):
|
def __get__(self):
|
||||||
|
@ -67,7 +73,9 @@ cdef class Tokenizer:
|
||||||
|
|
||||||
def __set__(self, token_match):
|
def __set__(self, token_match):
|
||||||
self._token_match = token_match
|
self._token_match = token_match
|
||||||
self._flush_cache()
|
self._reload_special_cases()
|
||||||
|
if self._property_init_count <= self._property_init_max:
|
||||||
|
self._property_init_count += 1
|
||||||
|
|
||||||
property prefix_search:
|
property prefix_search:
|
||||||
def __get__(self):
|
def __get__(self):
|
||||||
|
@ -75,7 +83,9 @@ cdef class Tokenizer:
|
||||||
|
|
||||||
def __set__(self, prefix_search):
|
def __set__(self, prefix_search):
|
||||||
self._prefix_search = prefix_search
|
self._prefix_search = prefix_search
|
||||||
self._flush_cache()
|
self._reload_special_cases()
|
||||||
|
if self._property_init_count <= self._property_init_max:
|
||||||
|
self._property_init_count += 1
|
||||||
|
|
||||||
property suffix_search:
|
property suffix_search:
|
||||||
def __get__(self):
|
def __get__(self):
|
||||||
|
@ -83,7 +93,9 @@ cdef class Tokenizer:
|
||||||
|
|
||||||
def __set__(self, suffix_search):
|
def __set__(self, suffix_search):
|
||||||
self._suffix_search = suffix_search
|
self._suffix_search = suffix_search
|
||||||
self._flush_cache()
|
self._reload_special_cases()
|
||||||
|
if self._property_init_count <= self._property_init_max:
|
||||||
|
self._property_init_count += 1
|
||||||
|
|
||||||
property infix_finditer:
|
property infix_finditer:
|
||||||
def __get__(self):
|
def __get__(self):
|
||||||
|
@ -91,7 +103,9 @@ cdef class Tokenizer:
|
||||||
|
|
||||||
def __set__(self, infix_finditer):
|
def __set__(self, infix_finditer):
|
||||||
self._infix_finditer = infix_finditer
|
self._infix_finditer = infix_finditer
|
||||||
self._flush_cache()
|
self._reload_special_cases()
|
||||||
|
if self._property_init_count <= self._property_init_max:
|
||||||
|
self._property_init_count += 1
|
||||||
|
|
||||||
def __reduce__(self):
|
def __reduce__(self):
|
||||||
args = (self.vocab,
|
args = (self.vocab,
|
||||||
|
@ -106,7 +120,6 @@ cdef class Tokenizer:
|
||||||
deprecation_warning(Warnings.W002)
|
deprecation_warning(Warnings.W002)
|
||||||
return Doc(self.vocab, words=strings)
|
return Doc(self.vocab, words=strings)
|
||||||
|
|
||||||
@cython.boundscheck(False)
|
|
||||||
def __call__(self, unicode string):
|
def __call__(self, unicode string):
|
||||||
"""Tokenize a string.
|
"""Tokenize a string.
|
||||||
|
|
||||||
|
@ -115,6 +128,17 @@ cdef class Tokenizer:
|
||||||
|
|
||||||
DOCS: https://spacy.io/api/tokenizer#call
|
DOCS: https://spacy.io/api/tokenizer#call
|
||||||
"""
|
"""
|
||||||
|
doc = self._tokenize_affixes(string, True)
|
||||||
|
self._apply_special_cases(doc)
|
||||||
|
return doc
|
||||||
|
|
||||||
|
@cython.boundscheck(False)
|
||||||
|
cdef Doc _tokenize_affixes(self, unicode string, bint with_special_cases):
|
||||||
|
"""Tokenize according to affix and token_match settings.
|
||||||
|
|
||||||
|
string (unicode): The string to tokenize.
|
||||||
|
RETURNS (Doc): A container for linguistic annotations.
|
||||||
|
"""
|
||||||
if len(string) >= (2 ** 30):
|
if len(string) >= (2 ** 30):
|
||||||
raise ValueError(Errors.E025.format(length=len(string)))
|
raise ValueError(Errors.E025.format(length=len(string)))
|
||||||
cdef int length = len(string)
|
cdef int length = len(string)
|
||||||
|
@ -123,7 +147,9 @@ cdef class Tokenizer:
|
||||||
return doc
|
return doc
|
||||||
cdef int i = 0
|
cdef int i = 0
|
||||||
cdef int start = 0
|
cdef int start = 0
|
||||||
cdef bint cache_hit
|
cdef int has_special = 0
|
||||||
|
cdef bint specials_hit = 0
|
||||||
|
cdef bint cache_hit = 0
|
||||||
cdef bint in_ws = string[0].isspace()
|
cdef bint in_ws = string[0].isspace()
|
||||||
cdef unicode span
|
cdef unicode span
|
||||||
# The task here is much like string.split, but not quite
|
# The task here is much like string.split, but not quite
|
||||||
|
@ -139,9 +165,14 @@ cdef class Tokenizer:
|
||||||
# we don't have to create the slice when we hit the cache.
|
# we don't have to create the slice when we hit the cache.
|
||||||
span = string[start:i]
|
span = string[start:i]
|
||||||
key = hash_string(span)
|
key = hash_string(span)
|
||||||
cache_hit = self._try_cache(key, doc)
|
specials_hit = 0
|
||||||
if not cache_hit:
|
cache_hit = 0
|
||||||
self._tokenize(doc, span, key)
|
if with_special_cases:
|
||||||
|
specials_hit = self._try_specials(key, doc, &has_special)
|
||||||
|
if not specials_hit:
|
||||||
|
cache_hit = self._try_cache(key, doc)
|
||||||
|
if not specials_hit and not cache_hit:
|
||||||
|
self._tokenize(doc, span, key, &has_special, with_special_cases)
|
||||||
if uc == ' ':
|
if uc == ' ':
|
||||||
doc.c[doc.length - 1].spacy = True
|
doc.c[doc.length - 1].spacy = True
|
||||||
start = i + 1
|
start = i + 1
|
||||||
|
@ -152,9 +183,14 @@ cdef class Tokenizer:
|
||||||
if start < i:
|
if start < i:
|
||||||
span = string[start:]
|
span = string[start:]
|
||||||
key = hash_string(span)
|
key = hash_string(span)
|
||||||
cache_hit = self._try_cache(key, doc)
|
specials_hit = 0
|
||||||
if not cache_hit:
|
cache_hit = 0
|
||||||
self._tokenize(doc, span, key)
|
if with_special_cases:
|
||||||
|
specials_hit = self._try_specials(key, doc, &has_special)
|
||||||
|
if not specials_hit:
|
||||||
|
cache_hit = self._try_cache(key, doc)
|
||||||
|
if not specials_hit and not cache_hit:
|
||||||
|
self._tokenize(doc, span, key, &has_special, with_special_cases)
|
||||||
doc.c[doc.length - 1].spacy = string[-1] == " " and not in_ws
|
doc.c[doc.length - 1].spacy = string[-1] == " " and not in_ws
|
||||||
return doc
|
return doc
|
||||||
|
|
||||||
|
@ -174,23 +210,141 @@ cdef class Tokenizer:
|
||||||
yield self(text)
|
yield self(text)
|
||||||
|
|
||||||
def _flush_cache(self):
|
def _flush_cache(self):
|
||||||
self._reset_cache([key for key in self._cache if not key in self._specials])
|
self._reset_cache([key for key in self._cache])
|
||||||
|
|
||||||
def _reset_cache(self, keys):
|
def _reset_cache(self, keys):
|
||||||
for k in keys:
|
for k in keys:
|
||||||
|
cached = <_Cached*>self._cache.get(k)
|
||||||
del self._cache[k]
|
del self._cache[k]
|
||||||
if not k in self._specials:
|
if cached is not NULL:
|
||||||
cached = <_Cached*>self._cache.get(k)
|
self.mem.free(cached)
|
||||||
if cached is not NULL:
|
|
||||||
self.mem.free(cached)
|
|
||||||
|
|
||||||
def _reset_specials(self):
|
def _flush_specials(self):
|
||||||
for k in self._specials:
|
for k in self._specials:
|
||||||
cached = <_Cached*>self._specials.get(k)
|
cached = <_Cached*>self._specials.get(k)
|
||||||
del self._specials[k]
|
del self._specials[k]
|
||||||
if cached is not NULL:
|
if cached is not NULL:
|
||||||
self.mem.free(cached)
|
self.mem.free(cached)
|
||||||
|
|
||||||
|
cdef int _apply_special_cases(self, Doc doc) except -1:
|
||||||
|
"""Retokenize doc according to special cases.
|
||||||
|
|
||||||
|
doc (Doc): Document.
|
||||||
|
"""
|
||||||
|
cdef int i
|
||||||
|
cdef int max_length = 0
|
||||||
|
cdef bint modify_in_place
|
||||||
|
cdef Pool mem = Pool()
|
||||||
|
cdef vector[SpanC] c_matches
|
||||||
|
cdef vector[SpanC] c_filtered
|
||||||
|
cdef int offset
|
||||||
|
cdef int modified_doc_length
|
||||||
|
# Find matches for special cases
|
||||||
|
self._special_matcher.find_matches(doc, &c_matches)
|
||||||
|
# Skip processing if no matches
|
||||||
|
if c_matches.size() == 0:
|
||||||
|
return True
|
||||||
|
self._filter_special_spans(c_matches, c_filtered, doc.length)
|
||||||
|
# Put span info in span.start-indexed dict and calculate maximum
|
||||||
|
# intermediate document size
|
||||||
|
(span_data, max_length, modify_in_place) = self._prepare_special_spans(doc, c_filtered)
|
||||||
|
# If modifications never increase doc length, can modify in place
|
||||||
|
if modify_in_place:
|
||||||
|
tokens = doc.c
|
||||||
|
# Otherwise create a separate array to store modified tokens
|
||||||
|
else:
|
||||||
|
tokens = <TokenC*>mem.alloc(max_length, sizeof(TokenC))
|
||||||
|
# Modify tokenization according to filtered special cases
|
||||||
|
offset = self._retokenize_special_spans(doc, tokens, span_data)
|
||||||
|
# Allocate more memory for doc if needed
|
||||||
|
modified_doc_length = doc.length + offset
|
||||||
|
while modified_doc_length >= doc.max_length:
|
||||||
|
doc._realloc(doc.max_length * 2)
|
||||||
|
# If not modified in place, copy tokens back to doc
|
||||||
|
if not modify_in_place:
|
||||||
|
memcpy(doc.c, tokens, max_length * sizeof(TokenC))
|
||||||
|
for i in range(doc.length + offset, doc.length):
|
||||||
|
memset(&doc.c[i], 0, sizeof(TokenC))
|
||||||
|
doc.c[i].lex = &EMPTY_LEXEME
|
||||||
|
doc.length = doc.length + offset
|
||||||
|
return True
|
||||||
|
|
||||||
|
cdef void _filter_special_spans(self, vector[SpanC] &original, vector[SpanC] &filtered, int doc_len) nogil:
|
||||||
|
|
||||||
|
cdef int seen_i
|
||||||
|
cdef SpanC span
|
||||||
|
cdef stdset[int] seen_tokens
|
||||||
|
stdsort(original.begin(), original.end(), len_start_cmp)
|
||||||
|
cdef int orig_i = original.size() - 1
|
||||||
|
while orig_i >= 0:
|
||||||
|
span = original[orig_i]
|
||||||
|
if not seen_tokens.count(span.start) and not seen_tokens.count(span.end - 1):
|
||||||
|
filtered.push_back(span)
|
||||||
|
for seen_i in range(span.start, span.end):
|
||||||
|
seen_tokens.insert(seen_i)
|
||||||
|
orig_i -= 1
|
||||||
|
stdsort(filtered.begin(), filtered.end(), start_cmp)
|
||||||
|
|
||||||
|
cdef object _prepare_special_spans(self, Doc doc, vector[SpanC] &filtered):
|
||||||
|
spans = [doc[match.start:match.end] for match in filtered]
|
||||||
|
cdef bint modify_in_place = True
|
||||||
|
cdef int curr_length = doc.length
|
||||||
|
cdef int max_length
|
||||||
|
cdef int span_length_diff = 0
|
||||||
|
span_data = {}
|
||||||
|
for span in spans:
|
||||||
|
rule = self._rules.get(span.text, None)
|
||||||
|
span_length_diff = 0
|
||||||
|
if rule:
|
||||||
|
span_length_diff = len(rule) - (span.end - span.start)
|
||||||
|
if span_length_diff > 0:
|
||||||
|
modify_in_place = False
|
||||||
|
curr_length += span_length_diff
|
||||||
|
if curr_length > max_length:
|
||||||
|
max_length = curr_length
|
||||||
|
span_data[span.start] = (span.text, span.start, span.end, span_length_diff)
|
||||||
|
return (span_data, max_length, modify_in_place)
|
||||||
|
|
||||||
|
cdef int _retokenize_special_spans(self, Doc doc, TokenC* tokens, object span_data):
|
||||||
|
cdef int i = 0
|
||||||
|
cdef int j = 0
|
||||||
|
cdef int offset = 0
|
||||||
|
cdef _Cached* cached
|
||||||
|
cdef int idx_offset = 0
|
||||||
|
cdef int orig_final_spacy
|
||||||
|
cdef int orig_idx
|
||||||
|
cdef int span_start
|
||||||
|
cdef int span_end
|
||||||
|
while i < doc.length:
|
||||||
|
if not i in span_data:
|
||||||
|
tokens[i + offset] = doc.c[i]
|
||||||
|
i += 1
|
||||||
|
else:
|
||||||
|
span = span_data[i]
|
||||||
|
span_start = span[1]
|
||||||
|
span_end = span[2]
|
||||||
|
cached = <_Cached*>self._specials.get(hash_string(span[0]))
|
||||||
|
if cached == NULL:
|
||||||
|
# Copy original tokens if no rule found
|
||||||
|
for j in range(span_end - span_start):
|
||||||
|
tokens[i + offset + j] = doc.c[i + j]
|
||||||
|
i += span_end - span_start
|
||||||
|
else:
|
||||||
|
# Copy special case tokens into doc and adjust token and
|
||||||
|
# character offsets
|
||||||
|
idx_offset = 0
|
||||||
|
orig_final_spacy = doc.c[span_end + offset - 1].spacy
|
||||||
|
orig_idx = doc.c[i].idx
|
||||||
|
for j in range(cached.length):
|
||||||
|
tokens[i + offset + j] = cached.data.tokens[j]
|
||||||
|
tokens[i + offset + j].idx = orig_idx + idx_offset
|
||||||
|
idx_offset += cached.data.tokens[j].lex.length + \
|
||||||
|
1 if cached.data.tokens[j].spacy else 0
|
||||||
|
tokens[i + offset + cached.length - 1].spacy = orig_final_spacy
|
||||||
|
i += span_end - span_start
|
||||||
|
offset += span[3]
|
||||||
|
return offset
|
||||||
|
|
||||||
cdef int _try_cache(self, hash_t key, Doc tokens) except -1:
|
cdef int _try_cache(self, hash_t key, Doc tokens) except -1:
|
||||||
cached = <_Cached*>self._cache.get(key)
|
cached = <_Cached*>self._cache.get(key)
|
||||||
if cached == NULL:
|
if cached == NULL:
|
||||||
|
@ -204,22 +358,33 @@ cdef class Tokenizer:
|
||||||
tokens.push_back(&cached.data.tokens[i], False)
|
tokens.push_back(&cached.data.tokens[i], False)
|
||||||
return True
|
return True
|
||||||
|
|
||||||
cdef int _tokenize(self, Doc tokens, unicode span, hash_t orig_key) except -1:
|
cdef int _try_specials(self, hash_t key, Doc tokens, int* has_special) except -1:
|
||||||
|
cached = <_Cached*>self._specials.get(key)
|
||||||
|
if cached == NULL:
|
||||||
|
return False
|
||||||
|
cdef int i
|
||||||
|
for i in range(cached.length):
|
||||||
|
tokens.push_back(&cached.data.tokens[i], False)
|
||||||
|
has_special[0] = 1
|
||||||
|
return True
|
||||||
|
|
||||||
|
cdef int _tokenize(self, Doc tokens, unicode span, hash_t orig_key, int* has_special, bint with_special_cases) except -1:
|
||||||
cdef vector[LexemeC*] prefixes
|
cdef vector[LexemeC*] prefixes
|
||||||
cdef vector[LexemeC*] suffixes
|
cdef vector[LexemeC*] suffixes
|
||||||
cdef int orig_size
|
cdef int orig_size
|
||||||
cdef int has_special = 0
|
|
||||||
orig_size = tokens.length
|
orig_size = tokens.length
|
||||||
span = self._split_affixes(tokens.mem, span, &prefixes, &suffixes,
|
span = self._split_affixes(tokens.mem, span, &prefixes, &suffixes,
|
||||||
&has_special)
|
has_special, with_special_cases)
|
||||||
self._attach_tokens(tokens, span, &prefixes, &suffixes)
|
self._attach_tokens(tokens, span, &prefixes, &suffixes, has_special,
|
||||||
|
with_special_cases)
|
||||||
self._save_cached(&tokens.c[orig_size], orig_key, has_special,
|
self._save_cached(&tokens.c[orig_size], orig_key, has_special,
|
||||||
tokens.length - orig_size)
|
tokens.length - orig_size)
|
||||||
|
|
||||||
cdef unicode _split_affixes(self, Pool mem, unicode string,
|
cdef unicode _split_affixes(self, Pool mem, unicode string,
|
||||||
vector[const LexemeC*] *prefixes,
|
vector[const LexemeC*] *prefixes,
|
||||||
vector[const LexemeC*] *suffixes,
|
vector[const LexemeC*] *suffixes,
|
||||||
int* has_special):
|
int* has_special,
|
||||||
|
bint with_special_cases):
|
||||||
cdef size_t i
|
cdef size_t i
|
||||||
cdef unicode prefix
|
cdef unicode prefix
|
||||||
cdef unicode suffix
|
cdef unicode suffix
|
||||||
|
@ -231,29 +396,24 @@ cdef class Tokenizer:
|
||||||
and not self.find_prefix(string) \
|
and not self.find_prefix(string) \
|
||||||
and not self.find_suffix(string):
|
and not self.find_suffix(string):
|
||||||
break
|
break
|
||||||
if self._specials.get(hash_string(string)) != NULL:
|
if with_special_cases and self._specials.get(hash_string(string)) != NULL:
|
||||||
has_special[0] = 1
|
|
||||||
break
|
break
|
||||||
last_size = len(string)
|
last_size = len(string)
|
||||||
pre_len = self.find_prefix(string)
|
pre_len = self.find_prefix(string)
|
||||||
if pre_len != 0:
|
if pre_len != 0:
|
||||||
prefix = string[:pre_len]
|
prefix = string[:pre_len]
|
||||||
minus_pre = string[pre_len:]
|
minus_pre = string[pre_len:]
|
||||||
# Check whether we've hit a special-case
|
if minus_pre and with_special_cases and self._specials.get(hash_string(minus_pre)) != NULL:
|
||||||
if minus_pre and self._specials.get(hash_string(minus_pre)) != NULL:
|
|
||||||
string = minus_pre
|
string = minus_pre
|
||||||
prefixes.push_back(self.vocab.get(mem, prefix))
|
prefixes.push_back(self.vocab.get(mem, prefix))
|
||||||
has_special[0] = 1
|
|
||||||
break
|
break
|
||||||
suf_len = self.find_suffix(string)
|
suf_len = self.find_suffix(string)
|
||||||
if suf_len != 0:
|
if suf_len != 0:
|
||||||
suffix = string[-suf_len:]
|
suffix = string[-suf_len:]
|
||||||
minus_suf = string[:-suf_len]
|
minus_suf = string[:-suf_len]
|
||||||
# Check whether we've hit a special-case
|
if minus_suf and with_special_cases and self._specials.get(hash_string(minus_suf)) != NULL:
|
||||||
if minus_suf and (self._specials.get(hash_string(minus_suf)) != NULL):
|
|
||||||
string = minus_suf
|
string = minus_suf
|
||||||
suffixes.push_back(self.vocab.get(mem, suffix))
|
suffixes.push_back(self.vocab.get(mem, suffix))
|
||||||
has_special[0] = 1
|
|
||||||
break
|
break
|
||||||
if pre_len and suf_len and (pre_len + suf_len) <= len(string):
|
if pre_len and suf_len and (pre_len + suf_len) <= len(string):
|
||||||
string = string[pre_len:-suf_len]
|
string = string[pre_len:-suf_len]
|
||||||
|
@ -265,15 +425,15 @@ cdef class Tokenizer:
|
||||||
elif suf_len:
|
elif suf_len:
|
||||||
string = minus_suf
|
string = minus_suf
|
||||||
suffixes.push_back(self.vocab.get(mem, suffix))
|
suffixes.push_back(self.vocab.get(mem, suffix))
|
||||||
if string and (self._specials.get(hash_string(string)) != NULL):
|
|
||||||
has_special[0] = 1
|
|
||||||
break
|
|
||||||
return string
|
return string
|
||||||
|
|
||||||
cdef int _attach_tokens(self, Doc tokens, unicode string,
|
cdef int _attach_tokens(self, Doc tokens, unicode string,
|
||||||
vector[const LexemeC*] *prefixes,
|
vector[const LexemeC*] *prefixes,
|
||||||
vector[const LexemeC*] *suffixes) except -1:
|
vector[const LexemeC*] *suffixes,
|
||||||
cdef bint cache_hit
|
int* has_special,
|
||||||
|
bint with_special_cases) except -1:
|
||||||
|
cdef bint specials_hit = 0
|
||||||
|
cdef bint cache_hit = 0
|
||||||
cdef int split, end
|
cdef int split, end
|
||||||
cdef const LexemeC* const* lexemes
|
cdef const LexemeC* const* lexemes
|
||||||
cdef const LexemeC* lexeme
|
cdef const LexemeC* lexeme
|
||||||
|
@ -283,8 +443,12 @@ cdef class Tokenizer:
|
||||||
for i in range(prefixes.size()):
|
for i in range(prefixes.size()):
|
||||||
tokens.push_back(prefixes[0][i], False)
|
tokens.push_back(prefixes[0][i], False)
|
||||||
if string:
|
if string:
|
||||||
cache_hit = self._try_cache(hash_string(string), tokens)
|
if with_special_cases:
|
||||||
if cache_hit:
|
specials_hit = self._try_specials(hash_string(string), tokens,
|
||||||
|
has_special)
|
||||||
|
if not specials_hit:
|
||||||
|
cache_hit = self._try_cache(hash_string(string), tokens)
|
||||||
|
if specials_hit or cache_hit:
|
||||||
pass
|
pass
|
||||||
elif self.token_match and self.token_match(string):
|
elif self.token_match and self.token_match(string):
|
||||||
# We're always saying 'no' to spaces here -- the caller will
|
# We're always saying 'no' to spaces here -- the caller will
|
||||||
|
@ -329,7 +493,7 @@ cdef class Tokenizer:
|
||||||
tokens.push_back(lexeme, False)
|
tokens.push_back(lexeme, False)
|
||||||
|
|
||||||
cdef int _save_cached(self, const TokenC* tokens, hash_t key,
|
cdef int _save_cached(self, const TokenC* tokens, hash_t key,
|
||||||
int has_special, int n) except -1:
|
int* has_special, int n) except -1:
|
||||||
cdef int i
|
cdef int i
|
||||||
if n <= 0:
|
if n <= 0:
|
||||||
# avoid mem alloc of zero length
|
# avoid mem alloc of zero length
|
||||||
|
@ -338,7 +502,7 @@ cdef class Tokenizer:
|
||||||
if self.vocab._by_orth.get(tokens[i].lex.orth) == NULL:
|
if self.vocab._by_orth.get(tokens[i].lex.orth) == NULL:
|
||||||
return 0
|
return 0
|
||||||
# See #1250
|
# See #1250
|
||||||
if has_special:
|
if has_special[0]:
|
||||||
return 0
|
return 0
|
||||||
cached = <_Cached*>self.mem.alloc(1, sizeof(_Cached))
|
cached = <_Cached*>self.mem.alloc(1, sizeof(_Cached))
|
||||||
cached.length = n
|
cached.length = n
|
||||||
|
@ -391,10 +555,24 @@ cdef class Tokenizer:
|
||||||
match = self.suffix_search(string)
|
match = self.suffix_search(string)
|
||||||
return (match.end() - match.start()) if match is not None else 0
|
return (match.end() - match.start()) if match is not None else 0
|
||||||
|
|
||||||
def _load_special_tokenization(self, special_cases):
|
def _load_special_cases(self, special_cases):
|
||||||
"""Add special-case tokenization rules."""
|
"""Add special-case tokenization rules."""
|
||||||
for chunk, substrings in sorted(special_cases.items()):
|
if special_cases is not None:
|
||||||
self.add_special_case(chunk, substrings)
|
for chunk, substrings in sorted(special_cases.items()):
|
||||||
|
self._validate_special_case(chunk, substrings)
|
||||||
|
self.add_special_case(chunk, substrings)
|
||||||
|
|
||||||
|
def _validate_special_case(self, chunk, substrings):
|
||||||
|
"""Check whether the `ORTH` fields match the string.
|
||||||
|
|
||||||
|
string (unicode): The string to specially tokenize.
|
||||||
|
substrings (iterable): A sequence of dicts, where each dict describes
|
||||||
|
a token and its attributes.
|
||||||
|
"""
|
||||||
|
attrs = [intify_attrs(spec, _do_deprecated=True) for spec in substrings]
|
||||||
|
orth = "".join([spec[ORTH] for spec in attrs])
|
||||||
|
if chunk != orth:
|
||||||
|
raise ValueError(Errors.E187.format(chunk=chunk, orth=orth, token_attrs=substrings))
|
||||||
|
|
||||||
def add_special_case(self, unicode string, substrings):
|
def add_special_case(self, unicode string, substrings):
|
||||||
"""Add a special-case tokenization rule.
|
"""Add a special-case tokenization rule.
|
||||||
|
@ -406,6 +584,7 @@ cdef class Tokenizer:
|
||||||
|
|
||||||
DOCS: https://spacy.io/api/tokenizer#add_special_case
|
DOCS: https://spacy.io/api/tokenizer#add_special_case
|
||||||
"""
|
"""
|
||||||
|
self._validate_special_case(string, substrings)
|
||||||
substrings = list(substrings)
|
substrings = list(substrings)
|
||||||
cached = <_Cached*>self.mem.alloc(1, sizeof(_Cached))
|
cached = <_Cached*>self.mem.alloc(1, sizeof(_Cached))
|
||||||
cached.length = len(substrings)
|
cached.length = len(substrings)
|
||||||
|
@ -413,15 +592,25 @@ cdef class Tokenizer:
|
||||||
cached.data.tokens = self.vocab.make_fused_token(substrings)
|
cached.data.tokens = self.vocab.make_fused_token(substrings)
|
||||||
key = hash_string(string)
|
key = hash_string(string)
|
||||||
stale_special = <_Cached*>self._specials.get(key)
|
stale_special = <_Cached*>self._specials.get(key)
|
||||||
stale_cached = <_Cached*>self._cache.get(key)
|
|
||||||
self._flush_cache()
|
|
||||||
self._specials.set(key, cached)
|
self._specials.set(key, cached)
|
||||||
self._cache.set(key, cached)
|
|
||||||
if stale_special is not NULL:
|
if stale_special is not NULL:
|
||||||
self.mem.free(stale_special)
|
self.mem.free(stale_special)
|
||||||
if stale_special != stale_cached and stale_cached is not NULL:
|
|
||||||
self.mem.free(stale_cached)
|
|
||||||
self._rules[string] = substrings
|
self._rules[string] = substrings
|
||||||
|
self._flush_cache()
|
||||||
|
if self.find_prefix(string) or self.find_infix(string) or self.find_suffix(string):
|
||||||
|
self._special_matcher.add(string, None, self._tokenize_affixes(string, False))
|
||||||
|
|
||||||
|
def _reload_special_cases(self):
|
||||||
|
try:
|
||||||
|
self._property_init_count
|
||||||
|
except AttributeError:
|
||||||
|
return
|
||||||
|
# only reload if all 4 of prefix, suffix, infix, token_match have
|
||||||
|
# have been initialized
|
||||||
|
if self.vocab is not None and self._property_init_count >= self._property_init_max:
|
||||||
|
self._flush_cache()
|
||||||
|
self._flush_specials()
|
||||||
|
self._load_special_cases(self._rules)
|
||||||
|
|
||||||
def to_disk(self, path, **kwargs):
|
def to_disk(self, path, **kwargs):
|
||||||
"""Save the current state to a directory.
|
"""Save the current state to a directory.
|
||||||
|
@ -503,12 +692,9 @@ cdef class Tokenizer:
|
||||||
if data.get("rules"):
|
if data.get("rules"):
|
||||||
# make sure to hard reset the cache to remove data from the default exceptions
|
# make sure to hard reset the cache to remove data from the default exceptions
|
||||||
self._rules = {}
|
self._rules = {}
|
||||||
self._reset_cache([key for key in self._cache])
|
self._flush_cache()
|
||||||
self._reset_specials()
|
self._flush_specials()
|
||||||
self._cache = PreshMap()
|
self._load_special_cases(data.get("rules", {}))
|
||||||
self._specials = PreshMap()
|
|
||||||
for string, substrings in data.get("rules", {}).items():
|
|
||||||
self.add_special_case(string, substrings)
|
|
||||||
|
|
||||||
return self
|
return self
|
||||||
|
|
||||||
|
@ -516,3 +702,19 @@ cdef class Tokenizer:
|
||||||
def _get_regex_pattern(regex):
|
def _get_regex_pattern(regex):
|
||||||
"""Get a pattern string for a regex, or None if the pattern is None."""
|
"""Get a pattern string for a regex, or None if the pattern is None."""
|
||||||
return None if regex is None else regex.__self__.pattern
|
return None if regex is None else regex.__self__.pattern
|
||||||
|
|
||||||
|
|
||||||
|
cdef extern from "<algorithm>" namespace "std" nogil:
|
||||||
|
void stdsort "sort"(vector[SpanC].iterator,
|
||||||
|
vector[SpanC].iterator,
|
||||||
|
bint (*)(SpanC, SpanC))
|
||||||
|
|
||||||
|
|
||||||
|
cdef bint len_start_cmp(SpanC a, SpanC b) nogil:
|
||||||
|
if a.end - a.start == b.end - b.start:
|
||||||
|
return b.start < a.start
|
||||||
|
return a.end - a.start < b.end - b.start
|
||||||
|
|
||||||
|
|
||||||
|
cdef bint start_cmp(SpanC a, SpanC b) nogil:
|
||||||
|
return a.start < b.start
|
||||||
|
|
113
spacy/util.py
113
spacy/util.py
|
@ -13,6 +13,7 @@ import functools
|
||||||
import itertools
|
import itertools
|
||||||
import numpy.random
|
import numpy.random
|
||||||
import srsly
|
import srsly
|
||||||
|
import catalogue
|
||||||
import sys
|
import sys
|
||||||
|
|
||||||
try:
|
try:
|
||||||
|
@ -27,29 +28,20 @@ except ImportError:
|
||||||
|
|
||||||
from .symbols import ORTH
|
from .symbols import ORTH
|
||||||
from .compat import cupy, CudaStream, path2str, basestring_, unicode_
|
from .compat import cupy, CudaStream, path2str, basestring_, unicode_
|
||||||
from .compat import import_file, importlib_metadata
|
from .compat import import_file
|
||||||
from .errors import Errors, Warnings, deprecation_warning
|
from .errors import Errors, Warnings, deprecation_warning
|
||||||
|
|
||||||
|
|
||||||
LANGUAGES = {}
|
|
||||||
ARCHITECTURES = {}
|
|
||||||
_data_path = Path(__file__).parent / "data"
|
_data_path = Path(__file__).parent / "data"
|
||||||
_PRINT_ENV = False
|
_PRINT_ENV = False
|
||||||
|
|
||||||
|
|
||||||
# NB: Ony ever call this once! If called more than ince within the
|
class registry(object):
|
||||||
# function, test_issue1506 hangs and it's not 100% clear why.
|
languages = catalogue.create("spacy", "languages", entry_points=True)
|
||||||
AVAILABLE_ENTRY_POINTS = importlib_metadata.entry_points()
|
architectures = catalogue.create("spacy", "architectures", entry_points=True)
|
||||||
|
lookups = catalogue.create("spacy", "lookups", entry_points=True)
|
||||||
|
factories = catalogue.create("spacy", "factories", entry_points=True)
|
||||||
class ENTRY_POINTS(object):
|
displacy_colors = catalogue.create("spacy", "displacy_colors", entry_points=True)
|
||||||
"""Available entry points to register extensions."""
|
|
||||||
|
|
||||||
factories = "spacy_factories"
|
|
||||||
languages = "spacy_languages"
|
|
||||||
displacy_colors = "spacy_displacy_colors"
|
|
||||||
lookups = "spacy_lookups"
|
|
||||||
architectures = "spacy_architectures"
|
|
||||||
|
|
||||||
|
|
||||||
def set_env_log(value):
|
def set_env_log(value):
|
||||||
|
@ -65,8 +57,7 @@ def lang_class_is_loaded(lang):
|
||||||
lang (unicode): Two-letter language code, e.g. 'en'.
|
lang (unicode): Two-letter language code, e.g. 'en'.
|
||||||
RETURNS (bool): Whether a Language class has been loaded.
|
RETURNS (bool): Whether a Language class has been loaded.
|
||||||
"""
|
"""
|
||||||
global LANGUAGES
|
return lang in registry.languages
|
||||||
return lang in LANGUAGES
|
|
||||||
|
|
||||||
|
|
||||||
def get_lang_class(lang):
|
def get_lang_class(lang):
|
||||||
|
@ -75,19 +66,16 @@ def get_lang_class(lang):
|
||||||
lang (unicode): Two-letter language code, e.g. 'en'.
|
lang (unicode): Two-letter language code, e.g. 'en'.
|
||||||
RETURNS (Language): Language class.
|
RETURNS (Language): Language class.
|
||||||
"""
|
"""
|
||||||
global LANGUAGES
|
# Check if language is registered / entry point is available
|
||||||
# Check if an entry point is exposed for the language code
|
if lang in registry.languages:
|
||||||
entry_point = get_entry_point(ENTRY_POINTS.languages, lang)
|
return registry.languages.get(lang)
|
||||||
if entry_point is not None:
|
else:
|
||||||
LANGUAGES[lang] = entry_point
|
|
||||||
return entry_point
|
|
||||||
if lang not in LANGUAGES:
|
|
||||||
try:
|
try:
|
||||||
module = importlib.import_module(".lang.%s" % lang, "spacy")
|
module = importlib.import_module(".lang.%s" % lang, "spacy")
|
||||||
except ImportError as err:
|
except ImportError as err:
|
||||||
raise ImportError(Errors.E048.format(lang=lang, err=err))
|
raise ImportError(Errors.E048.format(lang=lang, err=err))
|
||||||
LANGUAGES[lang] = getattr(module, module.__all__[0])
|
set_lang_class(lang, getattr(module, module.__all__[0]))
|
||||||
return LANGUAGES[lang]
|
return registry.languages.get(lang)
|
||||||
|
|
||||||
|
|
||||||
def set_lang_class(name, cls):
|
def set_lang_class(name, cls):
|
||||||
|
@ -96,8 +84,7 @@ def set_lang_class(name, cls):
|
||||||
name (unicode): Name of Language class.
|
name (unicode): Name of Language class.
|
||||||
cls (Language): Language class.
|
cls (Language): Language class.
|
||||||
"""
|
"""
|
||||||
global LANGUAGES
|
registry.languages.register(name, func=cls)
|
||||||
LANGUAGES[name] = cls
|
|
||||||
|
|
||||||
|
|
||||||
def get_data_path(require_exists=True):
|
def get_data_path(require_exists=True):
|
||||||
|
@ -121,49 +108,11 @@ def set_data_path(path):
|
||||||
_data_path = ensure_path(path)
|
_data_path = ensure_path(path)
|
||||||
|
|
||||||
|
|
||||||
def register_architecture(name, arch=None):
|
|
||||||
"""Decorator to register an architecture. An architecture is a function
|
|
||||||
that returns a Thinc Model object.
|
|
||||||
|
|
||||||
name (unicode): The name of the architecture to register.
|
|
||||||
arch (Model): Optional architecture if function is called directly and
|
|
||||||
not used as a decorator.
|
|
||||||
RETURNS (callable): Function to register architecture.
|
|
||||||
"""
|
|
||||||
global ARCHITECTURES
|
|
||||||
if arch is not None:
|
|
||||||
ARCHITECTURES[name] = arch
|
|
||||||
return arch
|
|
||||||
|
|
||||||
def do_registration(arch):
|
|
||||||
ARCHITECTURES[name] = arch
|
|
||||||
return arch
|
|
||||||
|
|
||||||
return do_registration
|
|
||||||
|
|
||||||
|
|
||||||
def make_layer(arch_config):
|
def make_layer(arch_config):
|
||||||
arch_func = get_architecture(arch_config["arch"])
|
arch_func = registry.architectures.get(arch_config["arch"])
|
||||||
return arch_func(arch_config["config"])
|
return arch_func(arch_config["config"])
|
||||||
|
|
||||||
|
|
||||||
def get_architecture(name):
|
|
||||||
"""Get a model architecture function by name. Raises a KeyError if the
|
|
||||||
architecture is not found.
|
|
||||||
|
|
||||||
name (unicode): The mame of the architecture.
|
|
||||||
RETURNS (Model): The architecture.
|
|
||||||
"""
|
|
||||||
# Check if an entry point is exposed for the architecture code
|
|
||||||
entry_point = get_entry_point(ENTRY_POINTS.architectures, name)
|
|
||||||
if entry_point is not None:
|
|
||||||
ARCHITECTURES[name] = entry_point
|
|
||||||
if name not in ARCHITECTURES:
|
|
||||||
names = ", ".join(sorted(ARCHITECTURES.keys()))
|
|
||||||
raise KeyError(Errors.E174.format(name=name, names=names))
|
|
||||||
return ARCHITECTURES[name]
|
|
||||||
|
|
||||||
|
|
||||||
def ensure_path(path):
|
def ensure_path(path):
|
||||||
"""Ensure string is converted to a Path.
|
"""Ensure string is converted to a Path.
|
||||||
|
|
||||||
|
@ -327,34 +276,6 @@ def get_package_path(name):
|
||||||
return Path(pkg.__file__).parent
|
return Path(pkg.__file__).parent
|
||||||
|
|
||||||
|
|
||||||
def get_entry_points(key):
|
|
||||||
"""Get registered entry points from other packages for a given key, e.g.
|
|
||||||
'spacy_factories' and return them as a dictionary, keyed by name.
|
|
||||||
|
|
||||||
key (unicode): Entry point name.
|
|
||||||
RETURNS (dict): Entry points, keyed by name.
|
|
||||||
"""
|
|
||||||
result = {}
|
|
||||||
for entry_point in AVAILABLE_ENTRY_POINTS.get(key, []):
|
|
||||||
result[entry_point.name] = entry_point.load()
|
|
||||||
return result
|
|
||||||
|
|
||||||
|
|
||||||
def get_entry_point(key, value, default=None):
|
|
||||||
"""Check if registered entry point is available for a given name and
|
|
||||||
load it. Otherwise, return None.
|
|
||||||
|
|
||||||
key (unicode): Entry point name.
|
|
||||||
value (unicode): Name of entry point to load.
|
|
||||||
default: Optional default value to return.
|
|
||||||
RETURNS: The loaded entry point or None.
|
|
||||||
"""
|
|
||||||
for entry_point in AVAILABLE_ENTRY_POINTS.get(key, []):
|
|
||||||
if entry_point.name == value:
|
|
||||||
return entry_point.load()
|
|
||||||
return default
|
|
||||||
|
|
||||||
|
|
||||||
def is_in_jupyter():
|
def is_in_jupyter():
|
||||||
"""Check if user is running spaCy from a Jupyter notebook by detecting the
|
"""Check if user is running spaCy from a Jupyter notebook by detecting the
|
||||||
IPython kernel. Mainly used for the displaCy visualizer.
|
IPython kernel. Mainly used for the displaCy visualizer.
|
||||||
|
|
|
@ -109,8 +109,8 @@ raise an error if the pre-defined attrs of the two `DocBin`s don't match.
|
||||||
> doc_bin1.add(nlp("Hello world"))
|
> doc_bin1.add(nlp("Hello world"))
|
||||||
> doc_bin2 = DocBin(attrs=["LEMMA", "POS"])
|
> doc_bin2 = DocBin(attrs=["LEMMA", "POS"])
|
||||||
> doc_bin2.add(nlp("This is a sentence"))
|
> doc_bin2.add(nlp("This is a sentence"))
|
||||||
> merged_bins = doc_bin1.merge(doc_bin2)
|
> doc_bin1.merge(doc_bin2)
|
||||||
> assert len(merged_bins) == 2
|
> assert len(doc_bin1) == 2
|
||||||
> ```
|
> ```
|
||||||
|
|
||||||
| Argument | Type | Description |
|
| Argument | Type | Description |
|
||||||
|
|
|
@ -1,5 +1,5 @@
|
||||||
A named entity is a "real-world object" that's assigned a name – for example, a
|
A named entity is a "real-world object" that's assigned a name – for example, a
|
||||||
person, a country, a product or a book title. spaCy can **recognize**
|
person, a country, a product or a book title. spaCy can **recognize**
|
||||||
[various types](/api/annotation#named-entities) of named entities in a document,
|
[various types](/api/annotation#named-entities) of named entities in a document,
|
||||||
by asking the model for a **prediction**. Because models are statistical and
|
by asking the model for a **prediction**. Because models are statistical and
|
||||||
strongly depend on the examples they were trained on, this doesn't always work
|
strongly depend on the examples they were trained on, this doesn't always work
|
||||||
|
|
|
@ -20,6 +20,17 @@ available over [pip](https://pypi.python.org/pypi/spacy) and
|
||||||
> possible, the new docs also include notes on features that have changed in
|
> possible, the new docs also include notes on features that have changed in
|
||||||
> v2.0, and features that were introduced in the new version.
|
> v2.0, and features that were introduced in the new version.
|
||||||
|
|
||||||
|
<Infobox variant="warning" title="Important note for Python 3.8">
|
||||||
|
|
||||||
|
We can't yet ship pre-compiled binary wheels for spaCy that work on Python 3.8,
|
||||||
|
as we're still waiting for our CI providers and other tooling to support it.
|
||||||
|
This means that in order to run spaCy on Python 3.8, you'll need
|
||||||
|
[a compiler installed](#source) and compile the library and its Cython
|
||||||
|
dependencies locally. If this is causing problems for you, the easiest solution
|
||||||
|
is to **use Python 3.7** in the meantime.
|
||||||
|
|
||||||
|
</Infobox>
|
||||||
|
|
||||||
## Quickstart {hidden="true"}
|
## Quickstart {hidden="true"}
|
||||||
|
|
||||||
import QuickstartInstall from 'widgets/quickstart-install.js'
|
import QuickstartInstall from 'widgets/quickstart-install.js'
|
||||||
|
|
|
@ -1861,6 +1861,30 @@
|
||||||
"author_links": {
|
"author_links": {
|
||||||
"github": "microsoft"
|
"github": "microsoft"
|
||||||
}
|
}
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"id": "dframcy",
|
||||||
|
"title": "Dframcy",
|
||||||
|
"slogan": "Dataframe Integration with spaCy NLP",
|
||||||
|
"github": "yash1994/dframcy",
|
||||||
|
"description": "DframCy is a light-weight utility module to integrate Pandas Dataframe to spaCy's linguistic annotation and training tasks.",
|
||||||
|
"pip": "dframcy",
|
||||||
|
"category": ["pipeline", "training"],
|
||||||
|
"tags": ["pandas"],
|
||||||
|
"code_example": [
|
||||||
|
"import spacy",
|
||||||
|
"from dframcy import DframCy",
|
||||||
|
"",
|
||||||
|
"nlp = spacy.load('en_core_web_sm')",
|
||||||
|
"dframcy = DframCy(nlp)",
|
||||||
|
"doc = dframcy.nlp(u'Apple is looking at buying U.K. startup for $1 billion')",
|
||||||
|
"annotation_dataframe = dframcy.to_dataframe(doc)"
|
||||||
|
],
|
||||||
|
"author": "Yash Patadia",
|
||||||
|
"author_links": {
|
||||||
|
"twitter": "PatadiaYash",
|
||||||
|
"github": "yash1994"
|
||||||
|
}
|
||||||
}
|
}
|
||||||
],
|
],
|
||||||
|
|
||||||
|
|
Loading…
Reference in New Issue
Block a user