Generalize handling of tokenizer special cases (#4259)

* Generalize handling of tokenizer special cases

Handle tokenizer special cases more generally by using the Matcher
internally to match special cases after the affix/token_match
tokenization is complete.

Instead of only matching special cases while processing balanced or
nearly balanced prefixes and suffixes, this recognizes special cases in
a wider range of contexts:

* Allows arbitrary numbers of prefixes/affixes around special cases
* Allows special cases separated by infixes

Existing tests/settings that couldn't be preserved as before:

* The emoticon '")' is no longer a supported special case
* The emoticon ':)' in "example:)" is a false positive again

When merged with #4258 (or the relevant cache bugfix), the affix and
token_match properties should be modified to flush and reload all
special cases to use the updated internal tokenization with the Matcher.

* Remove accidentally added test case

* Really remove accidentally added test

* Reload special cases when necessary

Reload special cases when affixes or token_match are modified. Skip
reloading during initialization.

* Update error code number

* Fix offset and whitespace in Matcher special cases

* Fix offset bugs when merging and splitting tokens
* Set final whitespace on final token in inserted special case

* Improve cache flushing in tokenizer

* Separate cache and specials memory (temporarily)
* Flush cache when adding special cases
* Repeated `self._cache = PreshMap()` and `self._specials = PreshMap()`
are necessary due to this bug:
https://github.com/explosion/preshed/issues/21

* Remove reinitialized PreshMaps on cache flush

* Update UD bin scripts

* Update imports for `bin/`
* Add all currently supported languages
* Update subtok merger for new Matcher validation
* Modify blinded check to look at tokens instead of lemmas (for corpora
with tokens but not lemmas like Telugu)

* Use special Matcher only for cases with affixes

* Reinsert specials cache checks during normal tokenization for special
cases as much as possible
  * Additionally include specials cache checks while splitting on infixes
  * Since the special Matcher needs consistent affix-only tokenization
    for the special cases themselves, introduce the argument
    `with_special_cases` in order to do tokenization with or without
    specials cache checks
* After normal tokenization, postprocess with special cases Matcher for
special cases containing affixes

* Replace PhraseMatcher with Aho-Corasick

Replace PhraseMatcher with the Aho-Corasick algorithm over numpy arrays
of the hash values for the relevant attribute. The implementation is
based on FlashText.

The speed should be similar to the previous PhraseMatcher. It is now
possible to easily remove match IDs and matches don't go missing with
large keyword lists / vocabularies.

Fixes #4308.

* Restore support for pickling

* Fix internal keyword add/remove for numpy arrays

* Add test for #4248, clean up test

* Improve efficiency of special cases handling

* Use PhraseMatcher instead of Matcher
* Improve efficiency of merging/splitting special cases in document
  * Process merge/splits in one pass without repeated token shifting
  * Merge in place if no splits

* Update error message number

* Remove UD script modifications

Only used for timing/testing, should be a separate PR

* Remove final traces of UD script modifications

* Update UD bin scripts

* Update imports for `bin/`
* Add all currently supported languages
* Update subtok merger for new Matcher validation
* Modify blinded check to look at tokens instead of lemmas (for corpora
with tokens but not lemmas like Telugu)

* Add missing loop for match ID set in search loop

* Remove cruft in matching loop for partial matches

There was a bit of unnecessary code left over from FlashText in the
matching loop to handle partial token matches, which we don't have with
PhraseMatcher.

* Replace dict trie with MapStruct trie

* Fix how match ID hash is stored/added

* Update fix for match ID vocab

* Switch from map_get_unless_missing to map_get

* Switch from numpy array to Token.get_struct_attr

Access token attributes directly in Doc instead of making a copy of the
relevant values in a numpy array.

Add unsatisfactory warning for hash collision with reserved terminal
hash key. (Ideally it would change the reserved terminal hash and redo
the whole trie, but for now, I'm hoping there won't be collisions.)

* Restructure imports to export find_matches

* Implement full remove()

Remove unnecessary trie paths and free unused maps.

Parallel to Matcher, raise KeyError when attempting to remove a match ID
that has not been added.

* Switch to PhraseMatcher.find_matches

* Switch to local cdef functions for span filtering

* Switch special case reload threshold to variable

Refer to variable instead of hard-coded threshold

* Move more of special case retokenize to cdef nogil

Move as much of the special case retokenization to nogil as possible.

* Rewrap sort as stdsort for OS X

* Rewrap stdsort with specific types

* Switch to qsort

* Fix merge

* Improve cmp functions

* Fix realloc

* Fix realloc again

* Initialize span struct while retokenizing

* Temporarily skip retokenizing

* Revert "Move more of special case retokenize to cdef nogil"

This reverts commit 0b7e52c797.

* Revert "Switch to qsort"

This reverts commit a98d71a942.

* Fix specials check while caching

* Modify URL test with emoticons

The multiple suffix tests result in the emoticon `:>`, which is now
retokenized into one token as a special case after the suffixes are
split off.

* Refactor _apply_special_cases()

* Use cdef ints for span info used in multiple spots

* Modify _filter_special_spans() to prefer earlier

Parallel to #4414, modify _filter_special_spans() so that the earlier
span is preferred for overlapping spans of the same length.

* Replace MatchStruct with Entity

Replace MatchStruct with Entity since the existing Entity struct is
nearly identical.

* Replace Entity with more general SpanC

* Replace MatchStruct with SpanC

* Add error in debug-data if no dev docs are available (see #4575)

* Update azure-pipelines.yml

* Revert "Update azure-pipelines.yml"

This reverts commit ed1060cf59.

* Use latest wasabi

* Reorganise install_requires

* add dframcy to universe.json (#4580)

* Update universe.json [ci skip]

* Fix multiprocessing for as_tuples=True (#4582)

* Fix conllu script (#4579)

* force extensions to avoid clash between example scripts

* fix arg order and default file encoding

* add example config for conllu script

* newline

* move extension definitions to main function

* few more encodings fixes

* Add load_from_docbin example [ci skip]

TODO: upload the file somewhere

* Update README.md

* Add warnings about 3.8 (resolves #4593) [ci skip]

* Fixed typo: Added space between "recognize" and "various" (#4600)

* Fix DocBin.merge() example (#4599)

* Replace function registries with catalogue (#4584)

* Replace functions registries with catalogue

* Update __init__.py

* Fix test

* Revert unrelated flag [ci skip]

* Bugfix/dep matcher issue 4590 (#4601)

* add contributor agreement for prilopes

* add test for issue #4590

* fix on_match params for DependencyMacther (#4590)

* Minor updates to language example sentences (#4608)

* Add punctuation to Spanish example sentences

* Combine multilanguage examples for lang xx

* Add punctuation to nb examples

* Always realloc to a larger size

Avoid potential (unlikely) edge case and cymem error seen in #4604.

* Add error in debug-data if no dev docs are available (see #4575)

* Update debug-data for GoldCorpus / Example

* Ignore None label in misaligned NER data
This commit is contained in:
adrianeboyd 2019-11-13 21:24:35 +01:00 committed by Matthew Honnibal
parent d67b0f196a
commit faaa832518
44 changed files with 754 additions and 273 deletions

106
.github/contributors/prilopes.md vendored Normal file
View File

@ -0,0 +1,106 @@
# spaCy contributor agreement
This spaCy Contributor Agreement (**"SCA"**) is based on the
[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
The SCA applies to any contribution that you make to any product or project
managed by us (the **"project"**), and sets out the intellectual property rights
you grant to us in the contributed materials. The term **"us"** shall mean
[ExplosionAI GmbH](https://explosion.ai/legal). The term
**"you"** shall mean the person or entity identified below.
If you agree to be bound by these terms, fill in the information requested
below and include the filled-in version with your first pull request, under the
folder [`.github/contributors/`](/.github/contributors/). The name of the file
should be your GitHub username, with the extension `.md`. For example, the user
example_user would create the file `.github/contributors/example_user.md`.
Read this agreement carefully before signing. These terms and conditions
constitute a binding legal agreement.
## Contributor Agreement
1. The term "contribution" or "contributed materials" means any source code,
object code, patch, tool, sample, graphic, specification, manual,
documentation, or any other material posted or submitted by you to the project.
2. With respect to any worldwide copyrights, or copyright applications and
registrations, in your contribution:
* you hereby assign to us joint ownership, and to the extent that such
assignment is or becomes invalid, ineffective or unenforceable, you hereby
grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
royalty-free, unrestricted license to exercise all rights under those
copyrights. This includes, at our option, the right to sublicense these same
rights to third parties through multiple levels of sublicensees or other
licensing arrangements;
* you agree that each of us can do all things in relation to your
contribution as if each of us were the sole owners, and if one of us makes
a derivative work of your contribution, the one who makes the derivative
work (or has it made will be the sole owner of that derivative work;
* you agree that you will not assert any moral rights in your contribution
against us, our licensees or transferees;
* you agree that we may register a copyright in your contribution and
exercise all ownership rights associated with it; and
* you agree that neither of us has any duty to consult with, obtain the
consent of, pay or render an accounting to the other for any use or
distribution of your contribution.
3. With respect to any patents you own, or that you can license without payment
to any third party, you hereby grant to us a perpetual, irrevocable,
non-exclusive, worldwide, no-charge, royalty-free license to:
* make, have made, use, sell, offer to sell, import, and otherwise transfer
your contribution in whole or in part, alone or in combination with or
included in any product, work or materials arising out of the project to
which your contribution was submitted, and
* at our option, to sublicense these same rights to third parties through
multiple levels of sublicensees or other licensing arrangements.
4. Except as set out above, you keep all right, title, and interest in your
contribution. The rights that you grant to us under these terms are effective
on the date you first submitted a contribution to us, even if your submission
took place before the date you sign these terms.
5. You covenant, represent, warrant and agree that:
* Each contribution that you submit is and shall be an original work of
authorship and you can legally grant the rights set out in this SCA;
* to the best of your knowledge, each contribution will not violate any
third party's copyrights, trademarks, patents, or other intellectual
property rights; and
* each contribution shall be in compliance with U.S. export control laws and
other applicable export and import laws. You agree to notify us if you
become aware of any circumstance which would make any of the foregoing
representations inaccurate in any respect. We may publicly disclose your
participation in the project, including the fact that you have signed the SCA.
6. This SCA is governed by the laws of the State of California and applicable
U.S. Federal law. Any choice of law rules will not apply.
7. Please place an “x” on one of the applicable statement below. Please do NOT
mark both statements:
* [x] I am signing on behalf of myself as an individual and no other person
or entity, including my employer, has or will have rights with respect to my
contributions.
* [ ] I am signing on behalf of my employer or a legal entity and I have the
actual authority to contractually bind that entity.
## Contributor Details
| Field | Entry |
|------------------------------- | -------------------- |
| Name | Priscilla Lopes |
| Company name (if applicable) | |
| Title or role (if applicable) | |
| Date | 2019-11-06 |
| GitHub username | prilopes |
| Website (optional) | |

View File

@ -104,6 +104,13 @@ For detailed installation instructions, see the
[pip]: https://pypi.org/project/spacy/
[conda]: https://anaconda.org/conda-forge/spacy
> ⚠️ **Important note for Python 3.8:** We can't yet ship pre-compiled binary
> wheels for spaCy that work on Python 3.8, as we're still waiting for our CI
> providers and other tooling to support it. This means that in order to run
> spaCy on Python 3.8, you'll need [a compiler installed](#source) and compile
> the library and its Cython dependencies locally. If this is causing problems
> for you, the easiest solution is to **use Python 3.7** in the meantime.
### pip
Using pip, spaCy releases are available as source packages and binary wheels (as
@ -180,9 +187,6 @@ pointing pip to a path or URL.
# download best-matching version of specific model for your spaCy installation
python -m spacy download en_core_web_sm
# out-of-the-box: download best-matching default model
python -m spacy download en
# pip install .tar.gz archive from path or URL
pip install /Users/you/en_core_web_sm-2.2.0.tar.gz
pip install https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-2.2.0/en_core_web_sm-2.2.0.tar.gz

View File

@ -323,11 +323,6 @@ def get_token_conllu(token, i):
return "\n".join(lines)
Token.set_extension("get_conllu_lines", method=get_token_conllu, force=True)
Token.set_extension("begins_fused", default=False, force=True)
Token.set_extension("inside_fused", default=False, force=True)
##################
# Initialization #
##################
@ -460,13 +455,13 @@ class TreebankPaths(object):
@plac.annotations(
ud_dir=("Path to Universal Dependencies corpus", "positional", None, Path),
parses_dir=("Directory to write the development parses", "positional", None, Path),
corpus=(
"UD corpus to train and evaluate on, e.g. en, es_ancora, etc",
"UD corpus to train and evaluate on, e.g. UD_Spanish-AnCora",
"positional",
None,
str,
),
parses_dir=("Directory to write the development parses", "positional", None, Path),
config=("Path to json formatted config file", "option", "C", Path),
limit=("Size limit", "option", "n", int),
gpu_device=("Use GPU", "option", "g", int),
@ -491,6 +486,10 @@ def main(
# temp fix to avoid import issues cf https://github.com/explosion/spaCy/issues/4200
import tqdm
Token.set_extension("get_conllu_lines", method=get_token_conllu)
Token.set_extension("begins_fused", default=False)
Token.set_extension("inside_fused", default=False)
spacy.util.fix_random_seed()
lang.zh.Chinese.Defaults.use_jieba = False
lang.ja.Japanese.Defaults.use_janome = False

View File

@ -0,0 +1,45 @@
# coding: utf-8
"""
Example of loading previously parsed text using spaCy's DocBin class. The example
performs an entity count to show that the annotations are available.
For more details, see https://spacy.io/usage/saving-loading#docs
Installation:
python -m spacy download en_core_web_lg
Usage:
python examples/load_from_docbin.py en_core_web_lg RC_2015-03-9.spacy
"""
from __future__ import unicode_literals
import spacy
from spacy.tokens import DocBin
from timeit import default_timer as timer
from collections import Counter
EXAMPLE_PARSES_PATH = "RC_2015-03-9.spacy"
def main(model="en_core_web_lg", docbin_path=EXAMPLE_PARSES_PATH):
nlp = spacy.load(model)
print("Reading data from {}".format(docbin_path))
with open(docbin_path, "rb") as file_:
bytes_data = file_.read()
nr_word = 0
start_time = timer()
entities = Counter()
docbin = DocBin().from_bytes(bytes_data)
for doc in docbin.get_docs(nlp.vocab):
nr_word += len(doc)
entities.update((e.label_, e.text) for e in doc.ents)
end_time = timer()
msg = "Loaded {nr_word} words in {seconds} seconds ({wps} words per second)"
wps = nr_word / (end_time - start_time)
print(msg.format(nr_word=nr_word, seconds=end_time - start_time, wps=wps))
print("Most common entities:")
for (label, entity), freq in entities.most_common(30):
print(freq, entity, label)
if __name__ == "__main__":
import plac
plac.call(main)

View File

@ -0,0 +1 @@
{"nr_epoch": 3, "batch_size": 24, "dropout": 0.001, "vectors": 0, "multitask_tag": 0, "multitask_sent": 0}

View File

@ -383,20 +383,24 @@ class TreebankPaths(object):
@plac.annotations(
ud_dir=("Path to Universal Dependencies corpus", "positional", None, Path),
parses_dir=("Directory to write the development parses", "positional", None, Path),
config=("Path to json formatted config file", "positional", None, Config.load),
corpus=(
"UD corpus to train and evaluate on, e.g. en, es_ancora, etc",
"UD corpus to train and evaluate on, e.g. UD_Spanish-AnCora",
"positional",
None,
str,
),
parses_dir=("Directory to write the development parses", "positional", None, Path),
config=("Path to json formatted config file", "positional", None, Config.load),
limit=("Size limit", "option", "n", int),
)
def main(ud_dir, parses_dir, config, corpus, limit=0):
# temp fix to avoid import issues cf https://github.com/explosion/spaCy/issues/4200
import tqdm
Token.set_extension("get_conllu_lines", method=get_token_conllu)
Token.set_extension("begins_fused", default=False)
Token.set_extension("inside_fused", default=False)
paths = TreebankPaths(ud_dir, corpus)
if not (parses_dir / corpus).exists():
(parses_dir / corpus).mkdir()

View File

@ -4,14 +4,14 @@ preshed>=3.0.2,<3.1.0
thinc>=7.3.0,<7.4.0
blis>=0.4.0,<0.5.0
murmurhash>=0.28.0,<1.1.0
wasabi>=0.3.0,<1.1.0
wasabi>=0.4.0,<1.1.0
srsly>=0.1.0,<1.1.0
catalogue>=0.0.7,<1.1.0
# Third party dependencies
numpy>=1.15.0
requests>=2.13.0,<3.0.0
plac>=0.9.6,<1.2.0
pathlib==1.0.1; python_version < "3.4"
importlib_metadata>=0.20; python_version < "3.8"
# Optional dependencies
jsonschema>=2.6.0,<3.1.0
# Development dependencies

View File

@ -40,19 +40,21 @@ setup_requires =
murmurhash>=0.28.0,<1.1.0
thinc>=7.3.0,<7.4.0
install_requires =
setuptools
numpy>=1.15.0
# Our libraries
murmurhash>=0.28.0,<1.1.0
cymem>=2.0.2,<2.1.0
preshed>=3.0.2,<3.1.0
thinc>=7.3.0,<7.4.0
blis>=0.4.0,<0.5.0
wasabi>=0.4.0,<1.1.0
srsly>=0.1.0,<1.1.0
catalogue>=0.0.7,<1.1.0
# Third-party dependencies
setuptools
numpy>=1.15.0
plac>=0.9.6,<1.2.0
requests>=2.13.0,<3.0.0
wasabi>=0.3.0,<1.1.0
srsly>=0.1.0,<1.1.0
pathlib==1.0.1; python_version < "3.4"
importlib_metadata>=0.20; python_version < "3.8"
[options.extras_require]
lookups =

View File

@ -15,7 +15,7 @@ from .glossary import explain
from .about import __version__
from .errors import Errors, Warnings, deprecation_warning
from . import util
from .util import register_architecture, get_architecture
from .util import registry
from .language import component

View File

@ -7,12 +7,10 @@ from __future__ import print_function
if __name__ == "__main__":
import plac
import sys
from wasabi import Printer
from wasabi import msg
from spacy.cli import download, link, info, package, train, pretrain, convert
from spacy.cli import init_model, profile, evaluate, validate, debug_data
msg = Printer()
commands = {
"download": download,
"link": link,

View File

@ -6,16 +6,13 @@ import requests
import os
import subprocess
import sys
from wasabi import Printer
from wasabi import msg
from .link import link
from ..util import get_package_path
from .. import about
msg = Printer()
@plac.annotations(
model=("Model to download (shortcut or name)", "positional", None, str),
direct=("Force direct download of name + version", "flag", "d", bool),

View File

@ -3,7 +3,7 @@ from __future__ import unicode_literals, division, print_function
import plac
from timeit import default_timer as timer
from wasabi import Printer
from wasabi import msg
from ..gold import GoldCorpus
from .. import util
@ -32,7 +32,6 @@ def evaluate(
Evaluate a model. To render a sample of parses in a HTML file, set an
output directory as the displacy_path argument.
"""
msg = Printer()
util.fix_random_seed()
if gpu_id >= 0:
util.use_gpu(gpu_id)

View File

@ -4,7 +4,7 @@ from __future__ import unicode_literals
import plac
import platform
from pathlib import Path
from wasabi import Printer
from wasabi import msg
import srsly
from ..compat import path2str, basestring_, unicode_
@ -23,7 +23,6 @@ def info(model=None, markdown=False, silent=False):
speficied as an argument, print model information. Flag --markdown
prints details in Markdown for easy copy-pasting to GitHub issues.
"""
msg = Printer()
if model:
if util.is_package(model):
model_path = util.get_package_path(model)

View File

@ -11,7 +11,7 @@ import tarfile
import gzip
import zipfile
import srsly
from wasabi import Printer
from wasabi import msg
from ..vectors import Vectors
from ..errors import Errors, Warnings, user_warning
@ -24,7 +24,6 @@ except ImportError:
DEFAULT_OOV_PROB = -20
msg = Printer()
@plac.annotations(

View File

@ -3,7 +3,7 @@ from __future__ import unicode_literals
import plac
from pathlib import Path
from wasabi import Printer
from wasabi import msg
from ..compat import symlink_to, path2str
from .. import util
@ -20,7 +20,6 @@ def link(origin, link_name, force=False, model_path=None):
either the name of a pip package, or the local path to the model data
directory. Linking models allows loading them via spacy.load(link_name).
"""
msg = Printer()
if util.is_package(origin):
model_path = util.get_package_path(origin)
else:

View File

@ -4,7 +4,7 @@ from __future__ import unicode_literals
import plac
import shutil
from pathlib import Path
from wasabi import Printer, get_raw_input
from wasabi import msg, get_raw_input
import srsly
from ..compat import path2str
@ -27,7 +27,6 @@ def package(input_dir, output_dir, meta_path=None, create_meta=False, force=Fals
set and a meta.json already exists in the output directory, the existing
values will be used as the defaults in the command-line prompt.
"""
msg = Printer()
input_path = util.ensure_path(input_dir)
output_path = util.ensure_path(output_dir)
meta_path = util.ensure_path(meta_path)

View File

@ -11,7 +11,7 @@ from pathlib import Path
from thinc.v2v import Affine, Maxout
from thinc.misc import LayerNorm as LN
from thinc.neural.util import prefer_gpu
from wasabi import Printer
from wasabi import msg
import srsly
from spacy.gold import Example
@ -123,7 +123,6 @@ def pretrain(
for key in config:
if isinstance(config[key], Path):
config[key] = str(config[key])
msg = Printer()
util.fix_random_seed(seed)
has_gpu = prefer_gpu()

View File

@ -9,7 +9,7 @@ import pstats
import sys
import itertools
import thinc.extra.datasets
from wasabi import Printer
from wasabi import msg
from ..util import load_model
@ -26,7 +26,6 @@ def profile(model, inputs=None, n_texts=10000):
It can either be provided as a JSONL file, or be read from sys.sytdin.
If no input file is specified, the IMDB dataset is loaded via Thinc.
"""
msg = Printer()
if inputs is not None:
inputs = _read_inputs(inputs, msg)
if inputs is None:

View File

@ -8,7 +8,7 @@ from thinc.neural._classes.model import Model
from timeit import default_timer as timer
import shutil
import srsly
from wasabi import Printer
from wasabi import msg
import contextlib
import random
@ -89,7 +89,6 @@ def train(
# temp fix to avoid import issues cf https://github.com/explosion/spaCy/issues/4200
import tqdm
msg = Printer()
util.fix_random_seed()
util.set_env_log(verbose)

View File

@ -5,7 +5,7 @@ from pathlib import Path
import sys
import requests
import srsly
from wasabi import Printer
from wasabi import msg
from ..compat import path2str
from ..util import get_data_path
@ -17,7 +17,6 @@ def validate():
Validate that the currently installed version of spaCy is compatible
with the installed models. Should be run after `pip install -U spacy`.
"""
msg = Printer()
with msg.loading("Loading compatibility table..."):
r = requests.get(about.__compatibility__)
if r.status_code != 200:

View File

@ -36,11 +36,6 @@ try:
except ImportError:
cupy = None
try: # Python 3.8
import importlib.metadata as importlib_metadata
except ImportError:
import importlib_metadata # noqa: F401
try:
from thinc.neural.optimizers import Optimizer # noqa: F401
except ImportError:

View File

@ -5,7 +5,7 @@ import uuid
from .templates import TPL_DEP_SVG, TPL_DEP_WORDS, TPL_DEP_ARCS, TPL_ENTS
from .templates import TPL_ENT, TPL_ENT_RTL, TPL_FIGURE, TPL_TITLE, TPL_PAGE
from ..util import minify_html, escape_html, get_entry_points, ENTRY_POINTS
from ..util import minify_html, escape_html, registry
from ..errors import Errors
@ -242,7 +242,7 @@ class EntityRenderer(object):
"CARDINAL": "#e4e7d2",
"PERCENT": "#e4e7d2",
}
user_colors = get_entry_points(ENTRY_POINTS.displacy_colors)
user_colors = registry.displacy_colors.get_all()
for user_color in user_colors.values():
colors.update(user_color)
colors.update(options.get("colors", {}))

View File

@ -529,6 +529,9 @@ class Errors(object):
E185 = ("Received invalid attribute in component attribute declaration: "
"{obj}.{attr}\nAttribute '{attr}' does not exist on {obj}.")
E186 = ("'{tok_a}' and '{tok_b}' are different texts.")
E187 = ("Tokenizer special cases are not allowed to modify the text. "
"This would map '{chunk}' to '{orth}' given token attributes "
"'{token_attrs}'.")
# TODO: fix numbering after merging develop into master
E998 = ("Can only create GoldParse's from Example's without a Doc, "

View File

@ -11,12 +11,12 @@ Example sentences to test spaCy and its language models.
sentences = [
"Apple está buscando comprar una startup del Reino Unido por mil millones de dólares",
"Los coches autónomos delegan la responsabilidad del seguro en sus fabricantes",
"San Francisco analiza prohibir los robots delivery",
"Londres es una gran ciudad del Reino Unido",
"El gato come pescado",
"Veo al hombre con el telescopio",
"La araña come moscas",
"El pingüino incuba en su nido",
"Apple está buscando comprar una startup del Reino Unido por mil millones de dólares.",
"Los coches autónomos delegan la responsabilidad del seguro en sus fabricantes.",
"San Francisco analiza prohibir los robots delivery.",
"Londres es una gran ciudad del Reino Unido.",
"El gato come pescado.",
"Veo al hombre con el telescopio.",
"La araña come moscas.",
"El pingüino incuba en su nido.",
]

View File

@ -11,8 +11,8 @@ Example sentences to test spaCy and its language models.
sentences = [
"Apple vurderer å kjøpe britisk oppstartfirma for en milliard dollar",
"Selvkjørende biler flytter forsikringsansvaret over på produsentene ",
"San Francisco vurderer å forby robotbud på fortauene",
"Apple vurderer å kjøpe britisk oppstartfirma for en milliard dollar.",
"Selvkjørende biler flytter forsikringsansvaret over på produsentene.",
"San Francisco vurderer å forby robotbud på fortauene.",
"London er en stor by i Storbritannia.",
]

View File

@ -114,7 +114,6 @@ emoticons = set(
(-:
=)
(=
")
:]
:-]
[:

99
spacy/lang/xx/examples.py Normal file
View File

@ -0,0 +1,99 @@
# coding: utf8
from __future__ import unicode_literals
"""
Example sentences to test spaCy and its language models.
>>> from spacy.lang.de.examples import sentences
>>> docs = nlp.pipe(sentences)
"""
# combined examples from de/en/es/fr/it/nl/pl/pt/ru
sentences = [
"Die ganze Stadt ist ein Startup: Shenzhen ist das Silicon Valley für Hardware-Firmen",
"Wie deutsche Startups die Technologie vorantreiben wollen: Künstliche Intelligenz",
"Trend zum Urlaub in Deutschland beschert Gastwirten mehr Umsatz",
"Bundesanwaltschaft erhebt Anklage gegen mutmaßlichen Schweizer Spion",
"San Francisco erwägt Verbot von Lieferrobotern",
"Autonome Fahrzeuge verlagern Haftpflicht auf Hersteller",
"Wo bist du?",
"Was ist die Hauptstadt von Deutschland?",
"Apple is looking at buying U.K. startup for $1 billion",
"Autonomous cars shift insurance liability toward manufacturers",
"San Francisco considers banning sidewalk delivery robots",
"London is a big city in the United Kingdom.",
"Where are you?",
"Who is the president of France?",
"What is the capital of the United States?",
"When was Barack Obama born?",
"Apple está buscando comprar una startup del Reino Unido por mil millones de dólares.",
"Los coches autónomos delegan la responsabilidad del seguro en sus fabricantes.",
"San Francisco analiza prohibir los robots delivery.",
"Londres es una gran ciudad del Reino Unido.",
"El gato come pescado.",
"Veo al hombre con el telescopio.",
"La araña come moscas.",
"El pingüino incuba en su nido.",
"Apple cherche à acheter une start-up anglaise pour 1 milliard de dollars",
"Les voitures autonomes déplacent la responsabilité de l'assurance vers les constructeurs",
"San Francisco envisage d'interdire les robots coursiers sur les trottoirs",
"Londres est une grande ville du Royaume-Uni",
"LItalie choisit ArcelorMittal pour reprendre la plus grande aciérie dEurope",
"Apple lance HomePod parce qu'il se sent menacé par l'Echo d'Amazon",
"La France ne devrait pas manquer d'électricité cet été, même en cas de canicule",
"Nouvelles attaques de Trump contre le maire de Londres",
"Où es-tu ?",
"Qui est le président de la France ?",
"Où est la capitale des États-Unis ?",
"Quand est né Barack Obama ?",
"Apple vuole comprare una startup del Regno Unito per un miliardo di dollari",
"Le automobili a guida autonoma spostano la responsabilità assicurativa verso i produttori",
"San Francisco prevede di bandire i robot di consegna porta a porta",
"Londra è una grande città del Regno Unito.",
"Apple overweegt om voor 1 miljard een U.K. startup te kopen",
"Autonome auto's verschuiven de verzekeringverantwoordelijkheid naar producenten",
"San Francisco overweegt robots op voetpaden te verbieden",
"Londen is een grote stad in het Verenigd Koninkrijk",
"Poczuł przyjemną woń mocnej kawy.",
"Istnieje wiele dróg oddziaływania substancji psychoaktywnej na układ nerwowy.",
"Powitał mnie biało-czarny kot, płosząc siedzące na płocie trzy dorodne dudki.",
"Nowy abonament pod lupą Komisji Europejskiej",
"Czy w ciągu ostatnich 48 godzin spożyłeś leki zawierające paracetamol?",
"Kto ma ochotę zapoznać się z innymi niż w książkach przygodami Muminków i ich przyjaciół, temu polecam komiks Tove Jansson „Muminki i morze”.",
"Apple está querendo comprar uma startup do Reino Unido por 100 milhões de dólares.",
"Carros autônomos empurram a responsabilidade do seguro para os fabricantes.."
"São Francisco considera banir os robôs de entrega que andam pelas calçadas.",
"Londres é a maior cidade do Reino Unido.",
# Translations from English:
"Apple рассматривает возможность покупки стартапа из Соединённого Королевства за $1 млрд",
"Беспилотные автомобили перекладывают страховую ответственность на производителя",
"В Сан-Франциско рассматривается возможность запрета роботов-курьеров, которые перемещаются по тротуару",
"Лондон — это большой город в Соединённом Королевстве",
# Native Russian sentences:
# Colloquial:
"Да, нет, наверное!", # Typical polite refusal
"Обратите внимание на необыкновенную красоту этого города-героя Москвы, столицы нашей Родины!", # From a tour guide speech
# Examples of Bookish Russian:
# Quote from "The Golden Calf"
"Рио-де-Жанейро — это моя мечта, и не смейте касаться её своими грязными лапами!",
# Quotes from "Ivan Vasilievich changes his occupation"
"Ты пошто боярыню обидел, смерд?!!",
"Оставь меня, старушка, я в печали!",
# Quotes from Dostoevsky:
"Уж коли я, такой же, как и ты, человек грешный, над тобой умилился и пожалел тебя, кольми паче бог",
"В мечтах я нередко, говорит, доходил до страстных помыслов о служении человечеству и может быть действительно пошел бы на крест за людей, если б это вдруг как-нибудь потребовалось, а между тем я двух дней не в состоянии прожить ни с кем в одной комнате, о чем знаю из опыта",
"Зато всегда так происходило, что чем более я ненавидел людей в частности, тем пламеннее становилась любовь моя к человечеству вообще",
# Quotes from Chekhov:
"Ненужные дела и разговоры всё об одном отхватывают на свою долю лучшую часть времени, лучшие силы, и в конце концов остается какая-то куцая, бескрылая жизнь, какая-то чепуха, и уйти и бежать нельзя, точно сидишь в сумасшедшем доме или в арестантских ротах!",
# Quotes from Turgenev:
"Нравится тебе женщина, старайся добиться толку; а нельзя — ну, не надо, отвернись — земля не клином сошлась",
"Узенькое местечко, которое я занимаю, до того крохотно в сравнении с остальным пространством, где меня нет и где дела до меня нет; и часть времени, которую мне удастся прожить, так ничтожна перед вечностью, где меня не было и не будет...",
# Quotes from newspapers:
# Komsomolskaya Pravda:
"На заседании президиума правительства Москвы принято решение присвоить статус инвестиционного приоритетного проекта города Москвы киностудии Союзмультфильм",
"Глава Минобороны Сергей Шойгу заявил, что обстановка на этом стратегическом направлении требует непрерывного совершенствования боевого состава войск",
# Argumenty i Facty:
"На реплику лже-Говина — дескать, он (Волков) будет лучшим революционером — Стамп с энтузиазмом ответил: Непременно!",
]

View File

@ -53,8 +53,8 @@ class BaseDefaults(object):
filenames = {name: root / filename for name, filename in cls.resources}
if LANG in cls.lex_attr_getters:
lang = cls.lex_attr_getters[LANG](None)
user_lookups = util.get_entry_point(util.ENTRY_POINTS.lookups, lang, {})
filenames.update(user_lookups)
if lang in util.registry.lookups:
filenames.update(util.registry.lookups.get(lang))
lookups = Lookups()
for name, filename in filenames.items():
data = util.load_language_data(filename)
@ -157,7 +157,7 @@ class Language(object):
100,000 characters in one text.
RETURNS (Language): The newly constructed object.
"""
user_factories = util.get_entry_points(util.ENTRY_POINTS.factories)
user_factories = util.registry.factories.get_all()
self.factories.update(user_factories)
self._meta = dict(meta)
self._path = None
@ -741,6 +741,7 @@ class Language(object):
texts,
batch_size=batch_size,
disable=disable,
n_process=n_process,
component_cfg=component_cfg,
as_example=False
)

View File

@ -240,7 +240,7 @@ cdef class DependencyMatcher:
for i, (ent_id, nodes) in enumerate(matched_key_trees):
on_match = self._callbacks.get(ent_id)
if on_match is not None:
on_match(self, doc, i, matches)
on_match(self, doc, i, matched_key_trees)
return matched_key_trees
def recurse(self,tree,id_to_position,_node_operator_map,int patternLength,visitedNodes,matched_trees):

View File

@ -3,10 +3,10 @@ from __future__ import unicode_literals
from thinc.api import chain
from thinc.v2v import Maxout
from thinc.misc import LayerNorm
from ..util import register_architecture, make_layer
from ..util import registry, make_layer
@register_architecture("thinc.FeedForward.v1")
@registry.architectures.register("thinc.FeedForward.v1")
def FeedForward(config):
layers = [make_layer(layer_cfg) for layer_cfg in config["layers"]]
model = chain(*layers)
@ -14,7 +14,7 @@ def FeedForward(config):
return model
@register_architecture("spacy.LayerNormalizedMaxout.v1")
@registry.architectures.register("spacy.LayerNormalizedMaxout.v1")
def LayerNormalizedMaxout(config):
width = config["width"]
pieces = config["pieces"]

View File

@ -6,11 +6,11 @@ from thinc.v2v import Maxout, Model
from thinc.i2v import HashEmbed, StaticVectors
from thinc.t2t import ExtractWindow
from thinc.misc import Residual, LayerNorm, FeatureExtracter
from ..util import make_layer, register_architecture
from ..util import make_layer, registry
from ._wire import concatenate_lists
@register_architecture("spacy.Tok2Vec.v1")
@registry.architectures.register("spacy.Tok2Vec.v1")
def Tok2Vec(config):
doc2feats = make_layer(config["@doc2feats"])
embed = make_layer(config["@embed"])
@ -24,13 +24,13 @@ def Tok2Vec(config):
return tok2vec
@register_architecture("spacy.Doc2Feats.v1")
@registry.architectures.register("spacy.Doc2Feats.v1")
def Doc2Feats(config):
columns = config["columns"]
return FeatureExtracter(columns)
@register_architecture("spacy.MultiHashEmbed.v1")
@registry.architectures.register("spacy.MultiHashEmbed.v1")
def MultiHashEmbed(config):
# For backwards compatibility with models before the architecture registry,
# we have to be careful to get exactly the same model structure. One subtle
@ -78,7 +78,7 @@ def MultiHashEmbed(config):
return layer
@register_architecture("spacy.CharacterEmbed.v1")
@registry.architectures.register("spacy.CharacterEmbed.v1")
def CharacterEmbed(config):
from .. import _ml
@ -94,7 +94,7 @@ def CharacterEmbed(config):
return model
@register_architecture("spacy.MaxoutWindowEncoder.v1")
@registry.architectures.register("spacy.MaxoutWindowEncoder.v1")
def MaxoutWindowEncoder(config):
nO = config["width"]
nW = config["window_size"]
@ -110,7 +110,7 @@ def MaxoutWindowEncoder(config):
return model
@register_architecture("spacy.MishWindowEncoder.v1")
@registry.architectures.register("spacy.MishWindowEncoder.v1")
def MishWindowEncoder(config):
from thinc.v2v import Mish
@ -124,12 +124,12 @@ def MishWindowEncoder(config):
return model
@register_architecture("spacy.PretrainedVectors.v1")
@registry.architectures.register("spacy.PretrainedVectors.v1")
def PretrainedVectors(config):
return StaticVectors(config["vectors_name"], config["width"], config["column"])
@register_architecture("spacy.TorchBiLSTMEncoder.v1")
@registry.architectures.register("spacy.TorchBiLSTMEncoder.v1")
def TorchBiLSTMEncoder(config):
import torch.nn
from thinc.extra.wrappers import PyTorchWrapperRNN

View File

@ -0,0 +1,34 @@
# coding: utf-8
from __future__ import unicode_literals
import pytest
from mock import Mock
from spacy.matcher import DependencyMatcher
from ..util import get_doc
def test_issue4590(en_vocab):
"""Test that matches param in on_match method are the same as matches run with no on_match method"""
pattern = [
{"SPEC": {"NODE_NAME": "jumped"}, "PATTERN": {"ORTH": "jumped"}},
{"SPEC": {"NODE_NAME": "fox", "NBOR_RELOP": ">", "NBOR_NAME": "jumped"}, "PATTERN": {"ORTH": "fox"}},
{"SPEC": {"NODE_NAME": "quick", "NBOR_RELOP": ".", "NBOR_NAME": "jumped"}, "PATTERN": {"ORTH": "fox"}},
]
on_match = Mock()
matcher = DependencyMatcher(en_vocab)
matcher.add("pattern", on_match, pattern)
text = "The quick brown fox jumped over the lazy fox"
heads = [3, 2, 1, 1, 0, -1, 2, 1, -3]
deps = ["det", "amod", "amod", "nsubj", "prep", "pobj", "det", "amod"]
doc = get_doc(en_vocab, text.split(), heads=heads, deps=deps)
matches = matcher(doc)
on_match_args = on_match.call_args
assert on_match_args[0][3] == matches

View File

@ -0,0 +1,19 @@
# coding: utf8
from __future__ import unicode_literals
import pytest
from spacy import registry
from thinc.v2v import Affine
from catalogue import RegistryError
@registry.architectures.register("my_test_function")
def create_model(nr_in, nr_out):
return Affine(nr_in, nr_out)
def test_get_architecture():
arch = registry.architectures.get("my_test_function")
assert arch is create_model
with pytest.raises(RegistryError):
registry.architectures.get("not_an_existing_key")

View File

@ -1,19 +0,0 @@
# coding: utf8
from __future__ import unicode_literals
import pytest
from spacy import register_architecture
from spacy import get_architecture
from thinc.v2v import Affine
@register_architecture("my_test_function")
def create_model(nr_in, nr_out):
return Affine(nr_in, nr_out)
def test_get_architecture():
arch = get_architecture("my_test_function")
assert arch is create_model
with pytest.raises(KeyError):
get_architecture("not_an_existing_key")

View File

@ -7,7 +7,7 @@ import pytest
def test_tokenizer_handles_emoticons(tokenizer):
# Tweebo challenge (CMU)
text = """:o :/ :'( >:o (: :) >.< XD -__- o.O ;D :-) @_@ :P 8D :1 >:( :D =| ") :> ...."""
text = """:o :/ :'( >:o (: :) >.< XD -__- o.O ;D :-) @_@ :P 8D :1 >:( :D =| :> ...."""
tokens = tokenizer(text)
assert tokens[0].text == ":o"
assert tokens[1].text == ":/"
@ -28,12 +28,11 @@ def test_tokenizer_handles_emoticons(tokenizer):
assert tokens[16].text == ">:("
assert tokens[17].text == ":D"
assert tokens[18].text == "=|"
assert tokens[19].text == '")'
assert tokens[20].text == ":>"
assert tokens[21].text == "...."
assert tokens[19].text == ":>"
assert tokens[20].text == "...."
@pytest.mark.parametrize("text,length", [("example:)", 3), ("108)", 2), ("XDN", 1)])
@pytest.mark.parametrize("text,length", [("108)", 2), ("XDN", 1)])
def test_tokenizer_excludes_false_pos_emoticons(tokenizer, text, length):
tokens = tokenizer(text)
assert len(tokens) == length

View File

@ -108,6 +108,12 @@ def test_tokenizer_add_special_case(tokenizer, text, tokens):
assert doc[1].text == tokens[1]["orth"]
@pytest.mark.parametrize("text,tokens", [("lorem", [{"orth": "lo"}, {"orth": "re"}])])
def test_tokenizer_validate_special_case(tokenizer, text, tokens):
with pytest.raises(ValueError):
tokenizer.add_special_case(text, tokens)
@pytest.mark.parametrize(
"text,tokens", [("lorem", [{"orth": "lo", "tag": "NN"}, {"orth": "rem"}])]
)
@ -120,3 +126,18 @@ def test_tokenizer_add_special_case_tag(text, tokens):
assert doc[0].tag_ == tokens[0]["tag"]
assert doc[0].pos_ == "NOUN"
assert doc[1].text == tokens[1]["orth"]
def test_tokenizer_special_cases_with_affixes(tokenizer):
text = '(((_SPECIAL_ A/B, A/B-A/B")'
tokenizer.add_special_case("_SPECIAL_", [{"orth": "_SPECIAL_"}])
tokenizer.add_special_case("A/B", [{"orth": "A/B"}])
doc = tokenizer(text)
assert [token.text for token in doc] == ["(", "(", "(", "_SPECIAL_", "A/B", ",", "A/B", "-", "A/B", '"', ")"]
def test_tokenizer_special_cases_with_period(tokenizer):
text = "_SPECIAL_."
tokenizer.add_special_case("_SPECIAL_", [{"orth": "_SPECIAL_"}])
doc = tokenizer(text)
assert [token.text for token in doc] == ["_SPECIAL_", "."]

View File

@ -3,6 +3,8 @@ from __future__ import unicode_literals
import pytest
from spacy.lang.tokenizer_exceptions import BASE_EXCEPTIONS
URLS_BASIC = [
"http://www.nytimes.com/2016/04/20/us/politics/new-york-primary-preview.html?hp&action=click&pgtype=Homepage&clickSource=story-heading&module=a-lede-package-region&region=top-news&WT.nav=top-news&_r=0",
@ -194,7 +196,12 @@ def test_tokenizer_handles_two_prefix_url(tokenizer, prefix1, prefix2, url):
@pytest.mark.parametrize("url", URLS_FULL)
def test_tokenizer_handles_two_suffix_url(tokenizer, suffix1, suffix2, url):
tokens = tokenizer(url + suffix1 + suffix2)
assert len(tokens) == 3
assert tokens[0].text == url
assert tokens[1].text == suffix1
assert tokens[2].text == suffix2
if suffix1 + suffix2 in BASE_EXCEPTIONS:
assert len(tokens) == 2
assert tokens[0].text == url
assert tokens[1].text == suffix1 + suffix2
else:
assert len(tokens) == 3
assert tokens[0].text == url
assert tokens[1].text == suffix1
assert tokens[2].text == suffix2

View File

@ -4,10 +4,11 @@ from preshed.maps cimport PreshMap
from cymem.cymem cimport Pool
from .typedefs cimport hash_t
from .structs cimport LexemeC, TokenC
from .structs cimport LexemeC, SpanC, TokenC
from .strings cimport StringStore
from .tokens.doc cimport Doc
from .vocab cimport Vocab, LexemesOrTokens, _Cached
from .matcher.phrasematcher cimport PhraseMatcher
cdef class Tokenizer:
@ -21,15 +22,32 @@ cdef class Tokenizer:
cdef object _suffix_search
cdef object _infix_finditer
cdef object _rules
cdef PhraseMatcher _special_matcher
cdef int _property_init_count
cdef int _property_init_max
cpdef Doc tokens_from_list(self, list strings)
cdef Doc _tokenize_affixes(self, unicode string, bint with_special_cases)
cdef int _apply_special_cases(self, Doc doc) except -1
cdef void _filter_special_spans(self, vector[SpanC] &original,
vector[SpanC] &filtered, int doc_len) nogil
cdef object _prepare_special_spans(self, Doc doc,
vector[SpanC] &filtered)
cdef int _retokenize_special_spans(self, Doc doc, TokenC* tokens,
object span_data)
cdef int _try_cache(self, hash_t key, Doc tokens) except -1
cdef int _tokenize(self, Doc tokens, unicode span, hash_t key) except -1
cdef unicode _split_affixes(self, Pool mem, unicode string, vector[LexemeC*] *prefixes,
vector[LexemeC*] *suffixes, int* has_special)
cdef int _try_specials(self, hash_t key, Doc tokens,
int* has_special) except -1
cdef int _tokenize(self, Doc tokens, unicode span, hash_t key,
int* has_special, bint with_special_cases) except -1
cdef unicode _split_affixes(self, Pool mem, unicode string,
vector[LexemeC*] *prefixes,
vector[LexemeC*] *suffixes, int* has_special,
bint with_special_cases)
cdef int _attach_tokens(self, Doc tokens, unicode string,
vector[LexemeC*] *prefixes, vector[LexemeC*] *suffixes) except -1
cdef int _save_cached(self, const TokenC* tokens, hash_t key, int has_special,
int n) except -1
vector[LexemeC*] *prefixes,
vector[LexemeC*] *suffixes, int* has_special,
bint with_special_cases) except -1
cdef int _save_cached(self, const TokenC* tokens, hash_t key,
int* has_special, int n) except -1

View File

@ -5,6 +5,8 @@ from __future__ import unicode_literals
from cython.operator cimport dereference as deref
from cython.operator cimport preincrement as preinc
from libc.string cimport memcpy, memset
from libcpp.set cimport set as stdset
from cymem.cymem cimport Pool
from preshed.maps cimport PreshMap
cimport cython
@ -19,6 +21,9 @@ from .compat import unescape_unicode
from .errors import Errors, Warnings, deprecation_warning
from . import util
from .attrs import intify_attrs
from .lexeme cimport EMPTY_LEXEME
from .symbols import ORTH
cdef class Tokenizer:
"""Segment text, and create Doc objects with the discovered segment
@ -57,9 +62,10 @@ cdef class Tokenizer:
self.infix_finditer = infix_finditer
self.vocab = vocab
self._rules = {}
if rules is not None:
for chunk, substrings in sorted(rules.items()):
self.add_special_case(chunk, substrings)
self._special_matcher = PhraseMatcher(self.vocab)
self._load_special_cases(rules)
self._property_init_count = 0
self._property_init_max = 4
property token_match:
def __get__(self):
@ -67,7 +73,9 @@ cdef class Tokenizer:
def __set__(self, token_match):
self._token_match = token_match
self._flush_cache()
self._reload_special_cases()
if self._property_init_count <= self._property_init_max:
self._property_init_count += 1
property prefix_search:
def __get__(self):
@ -75,7 +83,9 @@ cdef class Tokenizer:
def __set__(self, prefix_search):
self._prefix_search = prefix_search
self._flush_cache()
self._reload_special_cases()
if self._property_init_count <= self._property_init_max:
self._property_init_count += 1
property suffix_search:
def __get__(self):
@ -83,7 +93,9 @@ cdef class Tokenizer:
def __set__(self, suffix_search):
self._suffix_search = suffix_search
self._flush_cache()
self._reload_special_cases()
if self._property_init_count <= self._property_init_max:
self._property_init_count += 1
property infix_finditer:
def __get__(self):
@ -91,7 +103,9 @@ cdef class Tokenizer:
def __set__(self, infix_finditer):
self._infix_finditer = infix_finditer
self._flush_cache()
self._reload_special_cases()
if self._property_init_count <= self._property_init_max:
self._property_init_count += 1
def __reduce__(self):
args = (self.vocab,
@ -106,7 +120,6 @@ cdef class Tokenizer:
deprecation_warning(Warnings.W002)
return Doc(self.vocab, words=strings)
@cython.boundscheck(False)
def __call__(self, unicode string):
"""Tokenize a string.
@ -115,6 +128,17 @@ cdef class Tokenizer:
DOCS: https://spacy.io/api/tokenizer#call
"""
doc = self._tokenize_affixes(string, True)
self._apply_special_cases(doc)
return doc
@cython.boundscheck(False)
cdef Doc _tokenize_affixes(self, unicode string, bint with_special_cases):
"""Tokenize according to affix and token_match settings.
string (unicode): The string to tokenize.
RETURNS (Doc): A container for linguistic annotations.
"""
if len(string) >= (2 ** 30):
raise ValueError(Errors.E025.format(length=len(string)))
cdef int length = len(string)
@ -123,7 +147,9 @@ cdef class Tokenizer:
return doc
cdef int i = 0
cdef int start = 0
cdef bint cache_hit
cdef int has_special = 0
cdef bint specials_hit = 0
cdef bint cache_hit = 0
cdef bint in_ws = string[0].isspace()
cdef unicode span
# The task here is much like string.split, but not quite
@ -139,9 +165,14 @@ cdef class Tokenizer:
# we don't have to create the slice when we hit the cache.
span = string[start:i]
key = hash_string(span)
cache_hit = self._try_cache(key, doc)
if not cache_hit:
self._tokenize(doc, span, key)
specials_hit = 0
cache_hit = 0
if with_special_cases:
specials_hit = self._try_specials(key, doc, &has_special)
if not specials_hit:
cache_hit = self._try_cache(key, doc)
if not specials_hit and not cache_hit:
self._tokenize(doc, span, key, &has_special, with_special_cases)
if uc == ' ':
doc.c[doc.length - 1].spacy = True
start = i + 1
@ -152,9 +183,14 @@ cdef class Tokenizer:
if start < i:
span = string[start:]
key = hash_string(span)
cache_hit = self._try_cache(key, doc)
if not cache_hit:
self._tokenize(doc, span, key)
specials_hit = 0
cache_hit = 0
if with_special_cases:
specials_hit = self._try_specials(key, doc, &has_special)
if not specials_hit:
cache_hit = self._try_cache(key, doc)
if not specials_hit and not cache_hit:
self._tokenize(doc, span, key, &has_special, with_special_cases)
doc.c[doc.length - 1].spacy = string[-1] == " " and not in_ws
return doc
@ -174,23 +210,141 @@ cdef class Tokenizer:
yield self(text)
def _flush_cache(self):
self._reset_cache([key for key in self._cache if not key in self._specials])
self._reset_cache([key for key in self._cache])
def _reset_cache(self, keys):
for k in keys:
cached = <_Cached*>self._cache.get(k)
del self._cache[k]
if not k in self._specials:
cached = <_Cached*>self._cache.get(k)
if cached is not NULL:
self.mem.free(cached)
if cached is not NULL:
self.mem.free(cached)
def _reset_specials(self):
def _flush_specials(self):
for k in self._specials:
cached = <_Cached*>self._specials.get(k)
del self._specials[k]
if cached is not NULL:
self.mem.free(cached)
cdef int _apply_special_cases(self, Doc doc) except -1:
"""Retokenize doc according to special cases.
doc (Doc): Document.
"""
cdef int i
cdef int max_length = 0
cdef bint modify_in_place
cdef Pool mem = Pool()
cdef vector[SpanC] c_matches
cdef vector[SpanC] c_filtered
cdef int offset
cdef int modified_doc_length
# Find matches for special cases
self._special_matcher.find_matches(doc, &c_matches)
# Skip processing if no matches
if c_matches.size() == 0:
return True
self._filter_special_spans(c_matches, c_filtered, doc.length)
# Put span info in span.start-indexed dict and calculate maximum
# intermediate document size
(span_data, max_length, modify_in_place) = self._prepare_special_spans(doc, c_filtered)
# If modifications never increase doc length, can modify in place
if modify_in_place:
tokens = doc.c
# Otherwise create a separate array to store modified tokens
else:
tokens = <TokenC*>mem.alloc(max_length, sizeof(TokenC))
# Modify tokenization according to filtered special cases
offset = self._retokenize_special_spans(doc, tokens, span_data)
# Allocate more memory for doc if needed
modified_doc_length = doc.length + offset
while modified_doc_length >= doc.max_length:
doc._realloc(doc.max_length * 2)
# If not modified in place, copy tokens back to doc
if not modify_in_place:
memcpy(doc.c, tokens, max_length * sizeof(TokenC))
for i in range(doc.length + offset, doc.length):
memset(&doc.c[i], 0, sizeof(TokenC))
doc.c[i].lex = &EMPTY_LEXEME
doc.length = doc.length + offset
return True
cdef void _filter_special_spans(self, vector[SpanC] &original, vector[SpanC] &filtered, int doc_len) nogil:
cdef int seen_i
cdef SpanC span
cdef stdset[int] seen_tokens
stdsort(original.begin(), original.end(), len_start_cmp)
cdef int orig_i = original.size() - 1
while orig_i >= 0:
span = original[orig_i]
if not seen_tokens.count(span.start) and not seen_tokens.count(span.end - 1):
filtered.push_back(span)
for seen_i in range(span.start, span.end):
seen_tokens.insert(seen_i)
orig_i -= 1
stdsort(filtered.begin(), filtered.end(), start_cmp)
cdef object _prepare_special_spans(self, Doc doc, vector[SpanC] &filtered):
spans = [doc[match.start:match.end] for match in filtered]
cdef bint modify_in_place = True
cdef int curr_length = doc.length
cdef int max_length
cdef int span_length_diff = 0
span_data = {}
for span in spans:
rule = self._rules.get(span.text, None)
span_length_diff = 0
if rule:
span_length_diff = len(rule) - (span.end - span.start)
if span_length_diff > 0:
modify_in_place = False
curr_length += span_length_diff
if curr_length > max_length:
max_length = curr_length
span_data[span.start] = (span.text, span.start, span.end, span_length_diff)
return (span_data, max_length, modify_in_place)
cdef int _retokenize_special_spans(self, Doc doc, TokenC* tokens, object span_data):
cdef int i = 0
cdef int j = 0
cdef int offset = 0
cdef _Cached* cached
cdef int idx_offset = 0
cdef int orig_final_spacy
cdef int orig_idx
cdef int span_start
cdef int span_end
while i < doc.length:
if not i in span_data:
tokens[i + offset] = doc.c[i]
i += 1
else:
span = span_data[i]
span_start = span[1]
span_end = span[2]
cached = <_Cached*>self._specials.get(hash_string(span[0]))
if cached == NULL:
# Copy original tokens if no rule found
for j in range(span_end - span_start):
tokens[i + offset + j] = doc.c[i + j]
i += span_end - span_start
else:
# Copy special case tokens into doc and adjust token and
# character offsets
idx_offset = 0
orig_final_spacy = doc.c[span_end + offset - 1].spacy
orig_idx = doc.c[i].idx
for j in range(cached.length):
tokens[i + offset + j] = cached.data.tokens[j]
tokens[i + offset + j].idx = orig_idx + idx_offset
idx_offset += cached.data.tokens[j].lex.length + \
1 if cached.data.tokens[j].spacy else 0
tokens[i + offset + cached.length - 1].spacy = orig_final_spacy
i += span_end - span_start
offset += span[3]
return offset
cdef int _try_cache(self, hash_t key, Doc tokens) except -1:
cached = <_Cached*>self._cache.get(key)
if cached == NULL:
@ -204,22 +358,33 @@ cdef class Tokenizer:
tokens.push_back(&cached.data.tokens[i], False)
return True
cdef int _tokenize(self, Doc tokens, unicode span, hash_t orig_key) except -1:
cdef int _try_specials(self, hash_t key, Doc tokens, int* has_special) except -1:
cached = <_Cached*>self._specials.get(key)
if cached == NULL:
return False
cdef int i
for i in range(cached.length):
tokens.push_back(&cached.data.tokens[i], False)
has_special[0] = 1
return True
cdef int _tokenize(self, Doc tokens, unicode span, hash_t orig_key, int* has_special, bint with_special_cases) except -1:
cdef vector[LexemeC*] prefixes
cdef vector[LexemeC*] suffixes
cdef int orig_size
cdef int has_special = 0
orig_size = tokens.length
span = self._split_affixes(tokens.mem, span, &prefixes, &suffixes,
&has_special)
self._attach_tokens(tokens, span, &prefixes, &suffixes)
has_special, with_special_cases)
self._attach_tokens(tokens, span, &prefixes, &suffixes, has_special,
with_special_cases)
self._save_cached(&tokens.c[orig_size], orig_key, has_special,
tokens.length - orig_size)
cdef unicode _split_affixes(self, Pool mem, unicode string,
vector[const LexemeC*] *prefixes,
vector[const LexemeC*] *suffixes,
int* has_special):
int* has_special,
bint with_special_cases):
cdef size_t i
cdef unicode prefix
cdef unicode suffix
@ -231,29 +396,24 @@ cdef class Tokenizer:
and not self.find_prefix(string) \
and not self.find_suffix(string):
break
if self._specials.get(hash_string(string)) != NULL:
has_special[0] = 1
if with_special_cases and self._specials.get(hash_string(string)) != NULL:
break
last_size = len(string)
pre_len = self.find_prefix(string)
if pre_len != 0:
prefix = string[:pre_len]
minus_pre = string[pre_len:]
# Check whether we've hit a special-case
if minus_pre and self._specials.get(hash_string(minus_pre)) != NULL:
if minus_pre and with_special_cases and self._specials.get(hash_string(minus_pre)) != NULL:
string = minus_pre
prefixes.push_back(self.vocab.get(mem, prefix))
has_special[0] = 1
break
suf_len = self.find_suffix(string)
if suf_len != 0:
suffix = string[-suf_len:]
minus_suf = string[:-suf_len]
# Check whether we've hit a special-case
if minus_suf and (self._specials.get(hash_string(minus_suf)) != NULL):
if minus_suf and with_special_cases and self._specials.get(hash_string(minus_suf)) != NULL:
string = minus_suf
suffixes.push_back(self.vocab.get(mem, suffix))
has_special[0] = 1
break
if pre_len and suf_len and (pre_len + suf_len) <= len(string):
string = string[pre_len:-suf_len]
@ -265,15 +425,15 @@ cdef class Tokenizer:
elif suf_len:
string = minus_suf
suffixes.push_back(self.vocab.get(mem, suffix))
if string and (self._specials.get(hash_string(string)) != NULL):
has_special[0] = 1
break
return string
cdef int _attach_tokens(self, Doc tokens, unicode string,
vector[const LexemeC*] *prefixes,
vector[const LexemeC*] *suffixes) except -1:
cdef bint cache_hit
vector[const LexemeC*] *suffixes,
int* has_special,
bint with_special_cases) except -1:
cdef bint specials_hit = 0
cdef bint cache_hit = 0
cdef int split, end
cdef const LexemeC* const* lexemes
cdef const LexemeC* lexeme
@ -283,8 +443,12 @@ cdef class Tokenizer:
for i in range(prefixes.size()):
tokens.push_back(prefixes[0][i], False)
if string:
cache_hit = self._try_cache(hash_string(string), tokens)
if cache_hit:
if with_special_cases:
specials_hit = self._try_specials(hash_string(string), tokens,
has_special)
if not specials_hit:
cache_hit = self._try_cache(hash_string(string), tokens)
if specials_hit or cache_hit:
pass
elif self.token_match and self.token_match(string):
# We're always saying 'no' to spaces here -- the caller will
@ -329,7 +493,7 @@ cdef class Tokenizer:
tokens.push_back(lexeme, False)
cdef int _save_cached(self, const TokenC* tokens, hash_t key,
int has_special, int n) except -1:
int* has_special, int n) except -1:
cdef int i
if n <= 0:
# avoid mem alloc of zero length
@ -338,7 +502,7 @@ cdef class Tokenizer:
if self.vocab._by_orth.get(tokens[i].lex.orth) == NULL:
return 0
# See #1250
if has_special:
if has_special[0]:
return 0
cached = <_Cached*>self.mem.alloc(1, sizeof(_Cached))
cached.length = n
@ -391,10 +555,24 @@ cdef class Tokenizer:
match = self.suffix_search(string)
return (match.end() - match.start()) if match is not None else 0
def _load_special_tokenization(self, special_cases):
def _load_special_cases(self, special_cases):
"""Add special-case tokenization rules."""
for chunk, substrings in sorted(special_cases.items()):
self.add_special_case(chunk, substrings)
if special_cases is not None:
for chunk, substrings in sorted(special_cases.items()):
self._validate_special_case(chunk, substrings)
self.add_special_case(chunk, substrings)
def _validate_special_case(self, chunk, substrings):
"""Check whether the `ORTH` fields match the string.
string (unicode): The string to specially tokenize.
substrings (iterable): A sequence of dicts, where each dict describes
a token and its attributes.
"""
attrs = [intify_attrs(spec, _do_deprecated=True) for spec in substrings]
orth = "".join([spec[ORTH] for spec in attrs])
if chunk != orth:
raise ValueError(Errors.E187.format(chunk=chunk, orth=orth, token_attrs=substrings))
def add_special_case(self, unicode string, substrings):
"""Add a special-case tokenization rule.
@ -406,6 +584,7 @@ cdef class Tokenizer:
DOCS: https://spacy.io/api/tokenizer#add_special_case
"""
self._validate_special_case(string, substrings)
substrings = list(substrings)
cached = <_Cached*>self.mem.alloc(1, sizeof(_Cached))
cached.length = len(substrings)
@ -413,15 +592,25 @@ cdef class Tokenizer:
cached.data.tokens = self.vocab.make_fused_token(substrings)
key = hash_string(string)
stale_special = <_Cached*>self._specials.get(key)
stale_cached = <_Cached*>self._cache.get(key)
self._flush_cache()
self._specials.set(key, cached)
self._cache.set(key, cached)
if stale_special is not NULL:
self.mem.free(stale_special)
if stale_special != stale_cached and stale_cached is not NULL:
self.mem.free(stale_cached)
self._rules[string] = substrings
self._flush_cache()
if self.find_prefix(string) or self.find_infix(string) or self.find_suffix(string):
self._special_matcher.add(string, None, self._tokenize_affixes(string, False))
def _reload_special_cases(self):
try:
self._property_init_count
except AttributeError:
return
# only reload if all 4 of prefix, suffix, infix, token_match have
# have been initialized
if self.vocab is not None and self._property_init_count >= self._property_init_max:
self._flush_cache()
self._flush_specials()
self._load_special_cases(self._rules)
def to_disk(self, path, **kwargs):
"""Save the current state to a directory.
@ -503,12 +692,9 @@ cdef class Tokenizer:
if data.get("rules"):
# make sure to hard reset the cache to remove data from the default exceptions
self._rules = {}
self._reset_cache([key for key in self._cache])
self._reset_specials()
self._cache = PreshMap()
self._specials = PreshMap()
for string, substrings in data.get("rules", {}).items():
self.add_special_case(string, substrings)
self._flush_cache()
self._flush_specials()
self._load_special_cases(data.get("rules", {}))
return self
@ -516,3 +702,19 @@ cdef class Tokenizer:
def _get_regex_pattern(regex):
"""Get a pattern string for a regex, or None if the pattern is None."""
return None if regex is None else regex.__self__.pattern
cdef extern from "<algorithm>" namespace "std" nogil:
void stdsort "sort"(vector[SpanC].iterator,
vector[SpanC].iterator,
bint (*)(SpanC, SpanC))
cdef bint len_start_cmp(SpanC a, SpanC b) nogil:
if a.end - a.start == b.end - b.start:
return b.start < a.start
return a.end - a.start < b.end - b.start
cdef bint start_cmp(SpanC a, SpanC b) nogil:
return a.start < b.start

View File

@ -13,6 +13,7 @@ import functools
import itertools
import numpy.random
import srsly
import catalogue
import sys
try:
@ -27,29 +28,20 @@ except ImportError:
from .symbols import ORTH
from .compat import cupy, CudaStream, path2str, basestring_, unicode_
from .compat import import_file, importlib_metadata
from .compat import import_file
from .errors import Errors, Warnings, deprecation_warning
LANGUAGES = {}
ARCHITECTURES = {}
_data_path = Path(__file__).parent / "data"
_PRINT_ENV = False
# NB: Ony ever call this once! If called more than ince within the
# function, test_issue1506 hangs and it's not 100% clear why.
AVAILABLE_ENTRY_POINTS = importlib_metadata.entry_points()
class ENTRY_POINTS(object):
"""Available entry points to register extensions."""
factories = "spacy_factories"
languages = "spacy_languages"
displacy_colors = "spacy_displacy_colors"
lookups = "spacy_lookups"
architectures = "spacy_architectures"
class registry(object):
languages = catalogue.create("spacy", "languages", entry_points=True)
architectures = catalogue.create("spacy", "architectures", entry_points=True)
lookups = catalogue.create("spacy", "lookups", entry_points=True)
factories = catalogue.create("spacy", "factories", entry_points=True)
displacy_colors = catalogue.create("spacy", "displacy_colors", entry_points=True)
def set_env_log(value):
@ -65,8 +57,7 @@ def lang_class_is_loaded(lang):
lang (unicode): Two-letter language code, e.g. 'en'.
RETURNS (bool): Whether a Language class has been loaded.
"""
global LANGUAGES
return lang in LANGUAGES
return lang in registry.languages
def get_lang_class(lang):
@ -75,19 +66,16 @@ def get_lang_class(lang):
lang (unicode): Two-letter language code, e.g. 'en'.
RETURNS (Language): Language class.
"""
global LANGUAGES
# Check if an entry point is exposed for the language code
entry_point = get_entry_point(ENTRY_POINTS.languages, lang)
if entry_point is not None:
LANGUAGES[lang] = entry_point
return entry_point
if lang not in LANGUAGES:
# Check if language is registered / entry point is available
if lang in registry.languages:
return registry.languages.get(lang)
else:
try:
module = importlib.import_module(".lang.%s" % lang, "spacy")
except ImportError as err:
raise ImportError(Errors.E048.format(lang=lang, err=err))
LANGUAGES[lang] = getattr(module, module.__all__[0])
return LANGUAGES[lang]
set_lang_class(lang, getattr(module, module.__all__[0]))
return registry.languages.get(lang)
def set_lang_class(name, cls):
@ -96,8 +84,7 @@ def set_lang_class(name, cls):
name (unicode): Name of Language class.
cls (Language): Language class.
"""
global LANGUAGES
LANGUAGES[name] = cls
registry.languages.register(name, func=cls)
def get_data_path(require_exists=True):
@ -121,49 +108,11 @@ def set_data_path(path):
_data_path = ensure_path(path)
def register_architecture(name, arch=None):
"""Decorator to register an architecture. An architecture is a function
that returns a Thinc Model object.
name (unicode): The name of the architecture to register.
arch (Model): Optional architecture if function is called directly and
not used as a decorator.
RETURNS (callable): Function to register architecture.
"""
global ARCHITECTURES
if arch is not None:
ARCHITECTURES[name] = arch
return arch
def do_registration(arch):
ARCHITECTURES[name] = arch
return arch
return do_registration
def make_layer(arch_config):
arch_func = get_architecture(arch_config["arch"])
arch_func = registry.architectures.get(arch_config["arch"])
return arch_func(arch_config["config"])
def get_architecture(name):
"""Get a model architecture function by name. Raises a KeyError if the
architecture is not found.
name (unicode): The mame of the architecture.
RETURNS (Model): The architecture.
"""
# Check if an entry point is exposed for the architecture code
entry_point = get_entry_point(ENTRY_POINTS.architectures, name)
if entry_point is not None:
ARCHITECTURES[name] = entry_point
if name not in ARCHITECTURES:
names = ", ".join(sorted(ARCHITECTURES.keys()))
raise KeyError(Errors.E174.format(name=name, names=names))
return ARCHITECTURES[name]
def ensure_path(path):
"""Ensure string is converted to a Path.
@ -327,34 +276,6 @@ def get_package_path(name):
return Path(pkg.__file__).parent
def get_entry_points(key):
"""Get registered entry points from other packages for a given key, e.g.
'spacy_factories' and return them as a dictionary, keyed by name.
key (unicode): Entry point name.
RETURNS (dict): Entry points, keyed by name.
"""
result = {}
for entry_point in AVAILABLE_ENTRY_POINTS.get(key, []):
result[entry_point.name] = entry_point.load()
return result
def get_entry_point(key, value, default=None):
"""Check if registered entry point is available for a given name and
load it. Otherwise, return None.
key (unicode): Entry point name.
value (unicode): Name of entry point to load.
default: Optional default value to return.
RETURNS: The loaded entry point or None.
"""
for entry_point in AVAILABLE_ENTRY_POINTS.get(key, []):
if entry_point.name == value:
return entry_point.load()
return default
def is_in_jupyter():
"""Check if user is running spaCy from a Jupyter notebook by detecting the
IPython kernel. Mainly used for the displaCy visualizer.

View File

@ -109,8 +109,8 @@ raise an error if the pre-defined attrs of the two `DocBin`s don't match.
> doc_bin1.add(nlp("Hello world"))
> doc_bin2 = DocBin(attrs=["LEMMA", "POS"])
> doc_bin2.add(nlp("This is a sentence"))
> merged_bins = doc_bin1.merge(doc_bin2)
> assert len(merged_bins) == 2
> doc_bin1.merge(doc_bin2)
> assert len(doc_bin1) == 2
> ```
| Argument | Type | Description |

View File

@ -1,5 +1,5 @@
A named entity is a "real-world object" that's assigned a name for example, a
person, a country, a product or a book title. spaCy can **recognize**
person, a country, a product or a book title. spaCy can **recognize**
[various types](/api/annotation#named-entities) of named entities in a document,
by asking the model for a **prediction**. Because models are statistical and
strongly depend on the examples they were trained on, this doesn't always work

View File

@ -20,6 +20,17 @@ available over [pip](https://pypi.python.org/pypi/spacy) and
> possible, the new docs also include notes on features that have changed in
> v2.0, and features that were introduced in the new version.
<Infobox variant="warning" title="Important note for Python 3.8">
We can't yet ship pre-compiled binary wheels for spaCy that work on Python 3.8,
as we're still waiting for our CI providers and other tooling to support it.
This means that in order to run spaCy on Python 3.8, you'll need
[a compiler installed](#source) and compile the library and its Cython
dependencies locally. If this is causing problems for you, the easiest solution
is to **use Python 3.7** in the meantime.
</Infobox>
## Quickstart {hidden="true"}
import QuickstartInstall from 'widgets/quickstart-install.js'

View File

@ -1861,6 +1861,30 @@
"author_links": {
"github": "microsoft"
}
},
{
"id": "dframcy",
"title": "Dframcy",
"slogan": "Dataframe Integration with spaCy NLP",
"github": "yash1994/dframcy",
"description": "DframCy is a light-weight utility module to integrate Pandas Dataframe to spaCy's linguistic annotation and training tasks.",
"pip": "dframcy",
"category": ["pipeline", "training"],
"tags": ["pandas"],
"code_example": [
"import spacy",
"from dframcy import DframCy",
"",
"nlp = spacy.load('en_core_web_sm')",
"dframcy = DframCy(nlp)",
"doc = dframcy.nlp(u'Apple is looking at buying U.K. startup for $1 billion')",
"annotation_dataframe = dframcy.to_dataframe(doc)"
],
"author": "Yash Patadia",
"author_links": {
"twitter": "PatadiaYash",
"github": "yash1994"
}
}
],