Merge branch 'master' into spacy.io

This commit is contained in:
Ines Montani 2019-11-11 17:12:19 +01:00
commit 71f5a5daa1
28 changed files with 482 additions and 186 deletions

106
.github/contributors/prilopes.md vendored Normal file
View File

@ -0,0 +1,106 @@
# spaCy contributor agreement
This spaCy Contributor Agreement (**"SCA"**) is based on the
[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
The SCA applies to any contribution that you make to any product or project
managed by us (the **"project"**), and sets out the intellectual property rights
you grant to us in the contributed materials. The term **"us"** shall mean
[ExplosionAI GmbH](https://explosion.ai/legal). The term
**"you"** shall mean the person or entity identified below.
If you agree to be bound by these terms, fill in the information requested
below and include the filled-in version with your first pull request, under the
folder [`.github/contributors/`](/.github/contributors/). The name of the file
should be your GitHub username, with the extension `.md`. For example, the user
example_user would create the file `.github/contributors/example_user.md`.
Read this agreement carefully before signing. These terms and conditions
constitute a binding legal agreement.
## Contributor Agreement
1. The term "contribution" or "contributed materials" means any source code,
object code, patch, tool, sample, graphic, specification, manual,
documentation, or any other material posted or submitted by you to the project.
2. With respect to any worldwide copyrights, or copyright applications and
registrations, in your contribution:
* you hereby assign to us joint ownership, and to the extent that such
assignment is or becomes invalid, ineffective or unenforceable, you hereby
grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
royalty-free, unrestricted license to exercise all rights under those
copyrights. This includes, at our option, the right to sublicense these same
rights to third parties through multiple levels of sublicensees or other
licensing arrangements;
* you agree that each of us can do all things in relation to your
contribution as if each of us were the sole owners, and if one of us makes
a derivative work of your contribution, the one who makes the derivative
work (or has it made will be the sole owner of that derivative work;
* you agree that you will not assert any moral rights in your contribution
against us, our licensees or transferees;
* you agree that we may register a copyright in your contribution and
exercise all ownership rights associated with it; and
* you agree that neither of us has any duty to consult with, obtain the
consent of, pay or render an accounting to the other for any use or
distribution of your contribution.
3. With respect to any patents you own, or that you can license without payment
to any third party, you hereby grant to us a perpetual, irrevocable,
non-exclusive, worldwide, no-charge, royalty-free license to:
* make, have made, use, sell, offer to sell, import, and otherwise transfer
your contribution in whole or in part, alone or in combination with or
included in any product, work or materials arising out of the project to
which your contribution was submitted, and
* at our option, to sublicense these same rights to third parties through
multiple levels of sublicensees or other licensing arrangements.
4. Except as set out above, you keep all right, title, and interest in your
contribution. The rights that you grant to us under these terms are effective
on the date you first submitted a contribution to us, even if your submission
took place before the date you sign these terms.
5. You covenant, represent, warrant and agree that:
* Each contribution that you submit is and shall be an original work of
authorship and you can legally grant the rights set out in this SCA;
* to the best of your knowledge, each contribution will not violate any
third party's copyrights, trademarks, patents, or other intellectual
property rights; and
* each contribution shall be in compliance with U.S. export control laws and
other applicable export and import laws. You agree to notify us if you
become aware of any circumstance which would make any of the foregoing
representations inaccurate in any respect. We may publicly disclose your
participation in the project, including the fact that you have signed the SCA.
6. This SCA is governed by the laws of the State of California and applicable
U.S. Federal law. Any choice of law rules will not apply.
7. Please place an “x” on one of the applicable statement below. Please do NOT
mark both statements:
* [x] I am signing on behalf of myself as an individual and no other person
or entity, including my employer, has or will have rights with respect to my
contributions.
* [ ] I am signing on behalf of my employer or a legal entity and I have the
actual authority to contractually bind that entity.
## Contributor Details
| Field | Entry |
|------------------------------- | -------------------- |
| Name | Priscilla Lopes |
| Company name (if applicable) | |
| Title or role (if applicable) | |
| Date | 2019-11-06 |
| GitHub username | prilopes |
| Website (optional) | |

View File

@ -221,6 +221,13 @@ def evaluate(nlp, text_loc, gold_loc, sys_loc, limit=None):
def write_conllu(docs, file_):
if not Token.has_extension("get_conllu_lines"):
Token.set_extension("get_conllu_lines", method=get_token_conllu)
if not Token.has_extension("begins_fused"):
Token.set_extension("begins_fused", default=False)
if not Token.has_extension("inside_fused"):
Token.set_extension("inside_fused", default=False)
merger = Matcher(docs[0].vocab)
merger.add("SUBTOK", None, [{"DEP": "subtok", "op": "+"}])
for i, doc in enumerate(docs):

View File

@ -6,12 +6,12 @@ blis>=0.4.0,<0.5.0
murmurhash>=0.28.0,<1.1.0
wasabi>=0.4.0,<1.1.0
srsly>=0.1.0,<1.1.0
catalogue>=0.0.7,<1.1.0
# Third party dependencies
numpy>=1.15.0
requests>=2.13.0,<3.0.0
plac>=0.9.6,<1.2.0
pathlib==1.0.1; python_version < "3.4"
importlib_metadata>=0.20; python_version < "3.8"
# Optional dependencies
jsonschema>=2.6.0,<3.1.0
# Development dependencies

View File

@ -48,13 +48,13 @@ install_requires =
blis>=0.4.0,<0.5.0
wasabi>=0.4.0,<1.1.0
srsly>=0.1.0,<1.1.0
catalogue>=0.0.7,<1.1.0
# Third-party dependencies
setuptools
numpy>=1.15.0
plac>=0.9.6,<1.2.0
requests>=2.13.0,<3.0.0
pathlib==1.0.1; python_version < "3.4"
importlib_metadata>=0.20; python_version < "3.8"
[options.extras_require]
lookups =

View File

@ -15,7 +15,7 @@ from .glossary import explain
from .about import __version__
from .errors import Errors, Warnings, deprecation_warning
from . import util
from .util import register_architecture, get_architecture
from .util import registry
from .language import component

View File

@ -36,11 +36,6 @@ try:
except ImportError:
cupy = None
try: # Python 3.8
import importlib.metadata as importlib_metadata
except ImportError:
import importlib_metadata # noqa: F401
try:
from thinc.neural.optimizers import Optimizer # noqa: F401
except ImportError:

View File

@ -5,7 +5,7 @@ import uuid
from .templates import TPL_DEP_SVG, TPL_DEP_WORDS, TPL_DEP_ARCS, TPL_ENTS
from .templates import TPL_ENT, TPL_ENT_RTL, TPL_FIGURE, TPL_TITLE, TPL_PAGE
from ..util import minify_html, escape_html, get_entry_points, ENTRY_POINTS
from ..util import minify_html, escape_html, registry
from ..errors import Errors
@ -242,7 +242,7 @@ class EntityRenderer(object):
"CARDINAL": "#e4e7d2",
"PERCENT": "#e4e7d2",
}
user_colors = get_entry_points(ENTRY_POINTS.displacy_colors)
user_colors = registry.displacy_colors.get_all()
for user_color in user_colors.values():
colors.update(user_color)
colors.update(options.get("colors", {}))

View File

@ -11,12 +11,12 @@ Example sentences to test spaCy and its language models.
sentences = [
"Apple está buscando comprar una startup del Reino Unido por mil millones de dólares",
"Los coches autónomos delegan la responsabilidad del seguro en sus fabricantes",
"San Francisco analiza prohibir los robots delivery",
"Londres es una gran ciudad del Reino Unido",
"El gato come pescado",
"Veo al hombre con el telescopio",
"La araña come moscas",
"El pingüino incuba en su nido",
"Apple está buscando comprar una startup del Reino Unido por mil millones de dólares.",
"Los coches autónomos delegan la responsabilidad del seguro en sus fabricantes.",
"San Francisco analiza prohibir los robots delivery.",
"Londres es una gran ciudad del Reino Unido.",
"El gato come pescado.",
"Veo al hombre con el telescopio.",
"La araña come moscas.",
"El pingüino incuba en su nido.",
]

View File

@ -11,8 +11,8 @@ Example sentences to test spaCy and its language models.
sentences = [
"Apple vurderer å kjøpe britisk oppstartfirma for en milliard dollar",
"Selvkjørende biler flytter forsikringsansvaret over på produsentene ",
"San Francisco vurderer å forby robotbud på fortauene",
"Apple vurderer å kjøpe britisk oppstartfirma for en milliard dollar.",
"Selvkjørende biler flytter forsikringsansvaret over på produsentene.",
"San Francisco vurderer å forby robotbud på fortauene.",
"London er en stor by i Storbritannia.",
]

99
spacy/lang/xx/examples.py Normal file
View File

@ -0,0 +1,99 @@
# coding: utf8
from __future__ import unicode_literals
"""
Example sentences to test spaCy and its language models.
>>> from spacy.lang.de.examples import sentences
>>> docs = nlp.pipe(sentences)
"""
# combined examples from de/en/es/fr/it/nl/pl/pt/ru
sentences = [
"Die ganze Stadt ist ein Startup: Shenzhen ist das Silicon Valley für Hardware-Firmen",
"Wie deutsche Startups die Technologie vorantreiben wollen: Künstliche Intelligenz",
"Trend zum Urlaub in Deutschland beschert Gastwirten mehr Umsatz",
"Bundesanwaltschaft erhebt Anklage gegen mutmaßlichen Schweizer Spion",
"San Francisco erwägt Verbot von Lieferrobotern",
"Autonome Fahrzeuge verlagern Haftpflicht auf Hersteller",
"Wo bist du?",
"Was ist die Hauptstadt von Deutschland?",
"Apple is looking at buying U.K. startup for $1 billion",
"Autonomous cars shift insurance liability toward manufacturers",
"San Francisco considers banning sidewalk delivery robots",
"London is a big city in the United Kingdom.",
"Where are you?",
"Who is the president of France?",
"What is the capital of the United States?",
"When was Barack Obama born?",
"Apple está buscando comprar una startup del Reino Unido por mil millones de dólares.",
"Los coches autónomos delegan la responsabilidad del seguro en sus fabricantes.",
"San Francisco analiza prohibir los robots delivery.",
"Londres es una gran ciudad del Reino Unido.",
"El gato come pescado.",
"Veo al hombre con el telescopio.",
"La araña come moscas.",
"El pingüino incuba en su nido.",
"Apple cherche à acheter une start-up anglaise pour 1 milliard de dollars",
"Les voitures autonomes déplacent la responsabilité de l'assurance vers les constructeurs",
"San Francisco envisage d'interdire les robots coursiers sur les trottoirs",
"Londres est une grande ville du Royaume-Uni",
"LItalie choisit ArcelorMittal pour reprendre la plus grande aciérie dEurope",
"Apple lance HomePod parce qu'il se sent menacé par l'Echo d'Amazon",
"La France ne devrait pas manquer d'électricité cet été, même en cas de canicule",
"Nouvelles attaques de Trump contre le maire de Londres",
"Où es-tu ?",
"Qui est le président de la France ?",
"Où est la capitale des États-Unis ?",
"Quand est né Barack Obama ?",
"Apple vuole comprare una startup del Regno Unito per un miliardo di dollari",
"Le automobili a guida autonoma spostano la responsabilità assicurativa verso i produttori",
"San Francisco prevede di bandire i robot di consegna porta a porta",
"Londra è una grande città del Regno Unito.",
"Apple overweegt om voor 1 miljard een U.K. startup te kopen",
"Autonome auto's verschuiven de verzekeringverantwoordelijkheid naar producenten",
"San Francisco overweegt robots op voetpaden te verbieden",
"Londen is een grote stad in het Verenigd Koninkrijk",
"Poczuł przyjemną woń mocnej kawy.",
"Istnieje wiele dróg oddziaływania substancji psychoaktywnej na układ nerwowy.",
"Powitał mnie biało-czarny kot, płosząc siedzące na płocie trzy dorodne dudki.",
"Nowy abonament pod lupą Komisji Europejskiej",
"Czy w ciągu ostatnich 48 godzin spożyłeś leki zawierające paracetamol?",
"Kto ma ochotę zapoznać się z innymi niż w książkach przygodami Muminków i ich przyjaciół, temu polecam komiks Tove Jansson „Muminki i morze”.",
"Apple está querendo comprar uma startup do Reino Unido por 100 milhões de dólares.",
"Carros autônomos empurram a responsabilidade do seguro para os fabricantes.."
"São Francisco considera banir os robôs de entrega que andam pelas calçadas.",
"Londres é a maior cidade do Reino Unido.",
# Translations from English:
"Apple рассматривает возможность покупки стартапа из Соединённого Королевства за $1 млрд",
"Беспилотные автомобили перекладывают страховую ответственность на производителя",
"В Сан-Франциско рассматривается возможность запрета роботов-курьеров, которые перемещаются по тротуару",
"Лондон — это большой город в Соединённом Королевстве",
# Native Russian sentences:
# Colloquial:
"Да, нет, наверное!", # Typical polite refusal
"Обратите внимание на необыкновенную красоту этого города-героя Москвы, столицы нашей Родины!", # From a tour guide speech
# Examples of Bookish Russian:
# Quote from "The Golden Calf"
"Рио-де-Жанейро — это моя мечта, и не смейте касаться её своими грязными лапами!",
# Quotes from "Ivan Vasilievich changes his occupation"
"Ты пошто боярыню обидел, смерд?!!",
"Оставь меня, старушка, я в печали!",
# Quotes from Dostoevsky:
"Уж коли я, такой же, как и ты, человек грешный, над тобой умилился и пожалел тебя, кольми паче бог",
"В мечтах я нередко, говорит, доходил до страстных помыслов о служении человечеству и может быть действительно пошел бы на крест за людей, если б это вдруг как-нибудь потребовалось, а между тем я двух дней не в состоянии прожить ни с кем в одной комнате, о чем знаю из опыта",
"Зато всегда так происходило, что чем более я ненавидел людей в частности, тем пламеннее становилась любовь моя к человечеству вообще",
# Quotes from Chekhov:
"Ненужные дела и разговоры всё об одном отхватывают на свою долю лучшую часть времени, лучшие силы, и в конце концов остается какая-то куцая, бескрылая жизнь, какая-то чепуха, и уйти и бежать нельзя, точно сидишь в сумасшедшем доме или в арестантских ротах!",
# Quotes from Turgenev:
"Нравится тебе женщина, старайся добиться толку; а нельзя — ну, не надо, отвернись — земля не клином сошлась",
"Узенькое местечко, которое я занимаю, до того крохотно в сравнении с остальным пространством, где меня нет и где дела до меня нет; и часть времени, которую мне удастся прожить, так ничтожна перед вечностью, где меня не было и не будет...",
# Quotes from newspapers:
# Komsomolskaya Pravda:
"На заседании президиума правительства Москвы принято решение присвоить статус инвестиционного приоритетного проекта города Москвы киностудии Союзмультфильм",
"Глава Минобороны Сергей Шойгу заявил, что обстановка на этом стратегическом направлении требует непрерывного совершенствования боевого состава войск",
# Argumenty i Facty:
"На реплику лже-Говина — дескать, он (Волков) будет лучшим революционером — Стамп с энтузиазмом ответил: Непременно!",
]

View File

@ -4,19 +4,92 @@ from __future__ import unicode_literals
from ...attrs import LANG
from ...language import Language
from ...tokens import Doc
from ...util import DummyTokenizer
from ..tokenizer_exceptions import BASE_EXCEPTIONS
from .lex_attrs import LEX_ATTRS
from .stop_words import STOP_WORDS
from .tag_map import TAG_MAP
def try_jieba_import(use_jieba):
try:
import jieba
return jieba
except ImportError:
if use_jieba:
msg = (
"Jieba not installed. Either set Chinese.use_jieba = False, "
"or install it https://github.com/fxsjy/jieba"
)
raise ImportError(msg)
class ChineseTokenizer(DummyTokenizer):
def __init__(self, cls, nlp=None):
self.vocab = nlp.vocab if nlp is not None else cls.create_vocab(nlp)
self.use_jieba = cls.use_jieba
self.jieba_seg = try_jieba_import(self.use_jieba)
self.tokenizer = Language.Defaults().create_tokenizer(nlp)
def __call__(self, text):
# use jieba
if self.use_jieba:
jieba_words = list([x for x in self.jieba_seg.cut(text, cut_all=False) if x])
words = [jieba_words[0]]
spaces = [False]
for i in range(1, len(jieba_words)):
word = jieba_words[i]
if word.isspace():
# second token in adjacent whitespace following a
# non-space token
if spaces[-1]:
words.append(word)
spaces.append(False)
# first space token following non-space token
elif word == " " and not words[-1].isspace():
spaces[-1] = True
# token is non-space whitespace or any whitespace following
# a whitespace token
else:
# extend previous whitespace token with more whitespace
if words[-1].isspace():
words[-1] += word
# otherwise it's a new whitespace token
else:
words.append(word)
spaces.append(False)
else:
words.append(word)
spaces.append(False)
return Doc(self.vocab, words=words, spaces=spaces)
# split into individual characters
words = []
spaces = []
for token in self.tokenizer(text):
if token.text.isspace():
words.append(token.text)
spaces.append(False)
else:
words.extend(list(token.text))
spaces.extend([False] * len(token.text))
spaces[-1] = bool(token.whitespace_)
return Doc(self.vocab, words=words, spaces=spaces)
class ChineseDefaults(Language.Defaults):
lex_attr_getters = dict(Language.Defaults.lex_attr_getters)
lex_attr_getters.update(LEX_ATTRS)
lex_attr_getters[LANG] = lambda text: "zh"
use_jieba = True
tokenizer_exceptions = BASE_EXCEPTIONS
stop_words = STOP_WORDS
tag_map = TAG_MAP
writing_system = {"direction": "ltr", "has_case": False, "has_letters": False}
use_jieba = True
@classmethod
def create_tokenizer(cls, nlp=None):
return ChineseTokenizer(cls, nlp)
class Chinese(Language):
@ -24,26 +97,7 @@ class Chinese(Language):
Defaults = ChineseDefaults # override defaults
def make_doc(self, text):
if self.Defaults.use_jieba:
try:
import jieba
except ImportError:
msg = (
"Jieba not installed. Either set Chinese.use_jieba = False, "
"or install it https://github.com/fxsjy/jieba"
)
raise ImportError(msg)
words = list(jieba.cut(text, cut_all=False))
words = [x for x in words if x]
return Doc(self.vocab, words=words, spaces=[False] * len(words))
else:
words = []
spaces = []
for token in self.tokenizer(text):
words.extend(list(token.text))
spaces.extend([False] * len(token.text))
spaces[-1] = bool(token.whitespace_)
return Doc(self.vocab, words=words, spaces=spaces)
return self.tokenizer(text)
__all__ = ["Chinese"]

View File

@ -1,11 +1,12 @@
# coding: utf8
from __future__ import unicode_literals
from ...symbols import POS, PUNCT, ADJ, CONJ, CCONJ, NUM, DET, ADV, ADP, X, VERB
from ...symbols import NOUN, PART, INTJ, PRON
from ...symbols import POS, PUNCT, ADJ, SCONJ, CCONJ, NUM, DET, ADV, ADP, X
from ...symbols import NOUN, PART, INTJ, PRON, VERB, SPACE
# The Chinese part-of-speech tagger uses the OntoNotes 5 version of the Penn Treebank tag set.
# We also map the tags to the simpler Google Universal POS tag set.
# The Chinese part-of-speech tagger uses the OntoNotes 5 version of the Penn
# Treebank tag set. We also map the tags to the simpler Universal Dependencies
# v2 tag set.
TAG_MAP = {
"AS": {POS: PART},
@ -38,10 +39,11 @@ TAG_MAP = {
"OD": {POS: NUM},
"DT": {POS: DET},
"CC": {POS: CCONJ},
"CS": {POS: CONJ},
"CS": {POS: SCONJ},
"AD": {POS: ADV},
"JJ": {POS: ADJ},
"P": {POS: ADP},
"PN": {POS: PRON},
"PU": {POS: PUNCT},
"_SP": {POS: SPACE},
}

View File

@ -51,8 +51,8 @@ class BaseDefaults(object):
filenames = {name: root / filename for name, filename in cls.resources}
if LANG in cls.lex_attr_getters:
lang = cls.lex_attr_getters[LANG](None)
user_lookups = util.get_entry_point(util.ENTRY_POINTS.lookups, lang, {})
filenames.update(user_lookups)
if lang in util.registry.lookups:
filenames.update(util.registry.lookups.get(lang))
lookups = Lookups()
for name, filename in filenames.items():
data = util.load_language_data(filename)
@ -155,7 +155,7 @@ class Language(object):
100,000 characters in one text.
RETURNS (Language): The newly constructed object.
"""
user_factories = util.get_entry_points(util.ENTRY_POINTS.factories)
user_factories = util.registry.factories.get_all()
self.factories.update(user_factories)
self._meta = dict(meta)
self._path = None

View File

@ -240,7 +240,7 @@ cdef class DependencyMatcher:
for i, (ent_id, nodes) in enumerate(matched_key_trees):
on_match = self._callbacks.get(ent_id)
if on_match is not None:
on_match(self, doc, i, matches)
on_match(self, doc, i, matched_key_trees)
return matched_key_trees
def recurse(self,tree,id_to_position,_node_operator_map,int patternLength,visitedNodes,matched_trees):

View File

@ -3,10 +3,10 @@ from __future__ import unicode_literals
from thinc.api import chain
from thinc.v2v import Maxout
from thinc.misc import LayerNorm
from ..util import register_architecture, make_layer
from ..util import registry, make_layer
@register_architecture("thinc.FeedForward.v1")
@registry.architectures.register("thinc.FeedForward.v1")
def FeedForward(config):
layers = [make_layer(layer_cfg) for layer_cfg in config["layers"]]
model = chain(*layers)
@ -14,7 +14,7 @@ def FeedForward(config):
return model
@register_architecture("spacy.LayerNormalizedMaxout.v1")
@registry.architectures.register("spacy.LayerNormalizedMaxout.v1")
def LayerNormalizedMaxout(config):
width = config["width"]
pieces = config["pieces"]

View File

@ -6,11 +6,11 @@ from thinc.v2v import Maxout, Model
from thinc.i2v import HashEmbed, StaticVectors
from thinc.t2t import ExtractWindow
from thinc.misc import Residual, LayerNorm, FeatureExtracter
from ..util import make_layer, register_architecture
from ..util import make_layer, registry
from ._wire import concatenate_lists
@register_architecture("spacy.Tok2Vec.v1")
@registry.architectures.register("spacy.Tok2Vec.v1")
def Tok2Vec(config):
doc2feats = make_layer(config["@doc2feats"])
embed = make_layer(config["@embed"])
@ -24,13 +24,13 @@ def Tok2Vec(config):
return tok2vec
@register_architecture("spacy.Doc2Feats.v1")
@registry.architectures.register("spacy.Doc2Feats.v1")
def Doc2Feats(config):
columns = config["columns"]
return FeatureExtracter(columns)
@register_architecture("spacy.MultiHashEmbed.v1")
@registry.architectures.register("spacy.MultiHashEmbed.v1")
def MultiHashEmbed(config):
# For backwards compatibility with models before the architecture registry,
# we have to be careful to get exactly the same model structure. One subtle
@ -78,7 +78,7 @@ def MultiHashEmbed(config):
return layer
@register_architecture("spacy.CharacterEmbed.v1")
@registry.architectures.register("spacy.CharacterEmbed.v1")
def CharacterEmbed(config):
from .. import _ml
@ -94,7 +94,7 @@ def CharacterEmbed(config):
return model
@register_architecture("spacy.MaxoutWindowEncoder.v1")
@registry.architectures.register("spacy.MaxoutWindowEncoder.v1")
def MaxoutWindowEncoder(config):
nO = config["width"]
nW = config["window_size"]
@ -110,7 +110,7 @@ def MaxoutWindowEncoder(config):
return model
@register_architecture("spacy.MishWindowEncoder.v1")
@registry.architectures.register("spacy.MishWindowEncoder.v1")
def MishWindowEncoder(config):
from thinc.v2v import Mish
@ -124,12 +124,12 @@ def MishWindowEncoder(config):
return model
@register_architecture("spacy.PretrainedVectors.v1")
@registry.architectures.register("spacy.PretrainedVectors.v1")
def PretrainedVectors(config):
return StaticVectors(config["vectors_name"], config["width"], config["column"])
@register_architecture("spacy.TorchBiLSTMEncoder.v1")
@registry.architectures.register("spacy.TorchBiLSTMEncoder.v1")
def TorchBiLSTMEncoder(config):
import torch.nn
from thinc.extra.wrappers import PyTorchWrapperRNN

View File

@ -218,3 +218,9 @@ def uk_tokenizer():
@pytest.fixture(scope="session")
def ur_tokenizer():
return get_lang_class("ur").Defaults.create_tokenizer()
@pytest.fixture(scope="session")
def zh_tokenizer():
pytest.importorskip("jieba")
return get_lang_class("zh").Defaults.create_tokenizer()

View File

@ -183,3 +183,18 @@ def test_doc_retokenizer_split_lex_attrs(en_vocab):
retokenizer.split(doc[0], ["Los", "Angeles"], heads, attrs=attrs)
assert doc[0].is_stop
assert not doc[1].is_stop
def test_doc_retokenizer_realloc(en_vocab):
"""#4604: realloc correctly when new tokens outnumber original tokens"""
text = "Hyperglycemic adverse events following antipsychotic drug administration in the"
doc = Doc(en_vocab, words=text.split()[:-1])
with doc.retokenize() as retokenizer:
token = doc[0]
heads = [(token, 0)] * len(token)
retokenizer.split(doc[token.i], list(token.text), heads=heads)
doc = Doc(en_vocab, words=text.split())
with doc.retokenize() as retokenizer:
token = doc[0]
heads = [(token, 0)] * len(token)
retokenizer.split(doc[token.i], list(token.text), heads=heads)

View File

View File

@ -0,0 +1,25 @@
# coding: utf-8
from __future__ import unicode_literals
import pytest
@pytest.mark.parametrize(
"text,match",
[
("10", True),
("1", True),
("999.0", True),
("", True),
("", True),
("", True),
("十一", True),
("", False),
(",", False),
],
)
def test_lex_attrs_like_number(zh_tokenizer, text, match):
tokens = zh_tokenizer(text)
assert len(tokens) == 1
assert tokens[0].like_num == match

View File

@ -0,0 +1,31 @@
# coding: utf-8
from __future__ import unicode_literals
import pytest
# fmt: off
TOKENIZER_TESTS = [
("作为语言而言,为世界使用人数最多的语言,目前世界有五分之一人口做为母语。",
['作为', '语言', '而言', '', '', '世界', '使用', '', '数最多',
'', '语言', '', '目前', '世界', '', '五分之一', '人口', '',
'', '母语', '']),
]
# fmt: on
@pytest.mark.parametrize("text,expected_tokens", TOKENIZER_TESTS)
def test_zh_tokenizer(zh_tokenizer, text, expected_tokens):
zh_tokenizer.use_jieba = False
tokens = [token.text for token in zh_tokenizer(text)]
assert tokens == list(text)
zh_tokenizer.use_jieba = True
tokens = [token.text for token in zh_tokenizer(text)]
assert tokens == expected_tokens
def test_extra_spaces(zh_tokenizer):
# note: three spaces after "I"
tokens = zh_tokenizer("I like cheese.")
assert tokens[1].orth_ == " "

View File

@ -0,0 +1,34 @@
# coding: utf-8
from __future__ import unicode_literals
import pytest
from mock import Mock
from spacy.matcher import DependencyMatcher
from ..util import get_doc
def test_issue4590(en_vocab):
"""Test that matches param in on_match method are the same as matches run with no on_match method"""
pattern = [
{"SPEC": {"NODE_NAME": "jumped"}, "PATTERN": {"ORTH": "jumped"}},
{"SPEC": {"NODE_NAME": "fox", "NBOR_RELOP": ">", "NBOR_NAME": "jumped"}, "PATTERN": {"ORTH": "fox"}},
{"SPEC": {"NODE_NAME": "quick", "NBOR_RELOP": ".", "NBOR_NAME": "jumped"}, "PATTERN": {"ORTH": "fox"}},
]
on_match = Mock()
matcher = DependencyMatcher(en_vocab)
matcher.add("pattern", on_match, pattern)
text = "The quick brown fox jumped over the lazy fox"
heads = [3, 2, 1, 1, 0, -1, 2, 1, -3]
deps = ["det", "amod", "amod", "nsubj", "prep", "pobj", "det", "amod"]
doc = get_doc(en_vocab, text.split(), heads=heads, deps=deps)
matches = matcher(doc)
on_match_args = on_match.call_args
assert on_match_args[0][3] == matches

View File

@ -0,0 +1,19 @@
# coding: utf8
from __future__ import unicode_literals
import pytest
from spacy import registry
from thinc.v2v import Affine
from catalogue import RegistryError
@registry.architectures.register("my_test_function")
def create_model(nr_in, nr_out):
return Affine(nr_in, nr_out)
def test_get_architecture():
arch = registry.architectures.get("my_test_function")
assert arch is create_model
with pytest.raises(RegistryError):
registry.architectures.get("not_an_existing_key")

View File

@ -1,19 +0,0 @@
# coding: utf8
from __future__ import unicode_literals
import pytest
from spacy import register_architecture
from spacy import get_architecture
from thinc.v2v import Affine
@register_architecture("my_test_function")
def create_model(nr_in, nr_out):
return Affine(nr_in, nr_out)
def test_get_architecture():
arch = get_architecture("my_test_function")
assert arch is create_model
with pytest.raises(KeyError):
get_architecture("not_an_existing_key")

View File

@ -329,7 +329,7 @@ def _split(Doc doc, int token_index, orths, heads, attrs):
doc.c[i].head += offset
# Double doc.c max_length if necessary (until big enough for all new tokens)
while doc.length + nb_subtokens - 1 >= doc.max_length:
doc._realloc(doc.length * 2)
doc._realloc(doc.max_length * 2)
# Move tokens after the split to create space for the new tokens
doc.length = len(doc) + nb_subtokens -1
to_process_tensor = (doc.tensor is not None and doc.tensor.size != 0)

View File

@ -13,6 +13,7 @@ import functools
import itertools
import numpy.random
import srsly
import catalogue
import sys
try:
@ -27,29 +28,20 @@ except ImportError:
from .symbols import ORTH
from .compat import cupy, CudaStream, path2str, basestring_, unicode_
from .compat import import_file, importlib_metadata
from .compat import import_file
from .errors import Errors, Warnings, deprecation_warning
LANGUAGES = {}
ARCHITECTURES = {}
_data_path = Path(__file__).parent / "data"
_PRINT_ENV = False
# NB: Ony ever call this once! If called more than ince within the
# function, test_issue1506 hangs and it's not 100% clear why.
AVAILABLE_ENTRY_POINTS = importlib_metadata.entry_points()
class ENTRY_POINTS(object):
"""Available entry points to register extensions."""
factories = "spacy_factories"
languages = "spacy_languages"
displacy_colors = "spacy_displacy_colors"
lookups = "spacy_lookups"
architectures = "spacy_architectures"
class registry(object):
languages = catalogue.create("spacy", "languages", entry_points=True)
architectures = catalogue.create("spacy", "architectures", entry_points=True)
lookups = catalogue.create("spacy", "lookups", entry_points=True)
factories = catalogue.create("spacy", "factories", entry_points=True)
displacy_colors = catalogue.create("spacy", "displacy_colors", entry_points=True)
def set_env_log(value):
@ -65,8 +57,7 @@ def lang_class_is_loaded(lang):
lang (unicode): Two-letter language code, e.g. 'en'.
RETURNS (bool): Whether a Language class has been loaded.
"""
global LANGUAGES
return lang in LANGUAGES
return lang in registry.languages
def get_lang_class(lang):
@ -75,19 +66,16 @@ def get_lang_class(lang):
lang (unicode): Two-letter language code, e.g. 'en'.
RETURNS (Language): Language class.
"""
global LANGUAGES
# Check if an entry point is exposed for the language code
entry_point = get_entry_point(ENTRY_POINTS.languages, lang)
if entry_point is not None:
LANGUAGES[lang] = entry_point
return entry_point
if lang not in LANGUAGES:
# Check if language is registered / entry point is available
if lang in registry.languages:
return registry.languages.get(lang)
else:
try:
module = importlib.import_module(".lang.%s" % lang, "spacy")
except ImportError as err:
raise ImportError(Errors.E048.format(lang=lang, err=err))
LANGUAGES[lang] = getattr(module, module.__all__[0])
return LANGUAGES[lang]
set_lang_class(lang, getattr(module, module.__all__[0]))
return registry.languages.get(lang)
def set_lang_class(name, cls):
@ -96,8 +84,7 @@ def set_lang_class(name, cls):
name (unicode): Name of Language class.
cls (Language): Language class.
"""
global LANGUAGES
LANGUAGES[name] = cls
registry.languages.register(name, func=cls)
def get_data_path(require_exists=True):
@ -121,49 +108,11 @@ def set_data_path(path):
_data_path = ensure_path(path)
def register_architecture(name, arch=None):
"""Decorator to register an architecture. An architecture is a function
that returns a Thinc Model object.
name (unicode): The name of the architecture to register.
arch (Model): Optional architecture if function is called directly and
not used as a decorator.
RETURNS (callable): Function to register architecture.
"""
global ARCHITECTURES
if arch is not None:
ARCHITECTURES[name] = arch
return arch
def do_registration(arch):
ARCHITECTURES[name] = arch
return arch
return do_registration
def make_layer(arch_config):
arch_func = get_architecture(arch_config["arch"])
arch_func = registry.architectures.get(arch_config["arch"])
return arch_func(arch_config["config"])
def get_architecture(name):
"""Get a model architecture function by name. Raises a KeyError if the
architecture is not found.
name (unicode): The mame of the architecture.
RETURNS (Model): The architecture.
"""
# Check if an entry point is exposed for the architecture code
entry_point = get_entry_point(ENTRY_POINTS.architectures, name)
if entry_point is not None:
ARCHITECTURES[name] = entry_point
if name not in ARCHITECTURES:
names = ", ".join(sorted(ARCHITECTURES.keys()))
raise KeyError(Errors.E174.format(name=name, names=names))
return ARCHITECTURES[name]
def ensure_path(path):
"""Ensure string is converted to a Path.
@ -327,34 +276,6 @@ def get_package_path(name):
return Path(pkg.__file__).parent
def get_entry_points(key):
"""Get registered entry points from other packages for a given key, e.g.
'spacy_factories' and return them as a dictionary, keyed by name.
key (unicode): Entry point name.
RETURNS (dict): Entry points, keyed by name.
"""
result = {}
for entry_point in AVAILABLE_ENTRY_POINTS.get(key, []):
result[entry_point.name] = entry_point.load()
return result
def get_entry_point(key, value, default=None):
"""Check if registered entry point is available for a given name and
load it. Otherwise, return None.
key (unicode): Entry point name.
value (unicode): Name of entry point to load.
default: Optional default value to return.
RETURNS: The loaded entry point or None.
"""
for entry_point in AVAILABLE_ENTRY_POINTS.get(key, []):
if entry_point.name == value:
return entry_point.load()
return default
def is_in_jupyter():
"""Check if user is running spaCy from a Jupyter notebook by detecting the
IPython kernel. Mainly used for the displaCy visualizer.

View File

@ -109,8 +109,8 @@ raise an error if the pre-defined attrs of the two `DocBin`s don't match.
> doc_bin1.add(nlp("Hello world"))
> doc_bin2 = DocBin(attrs=["LEMMA", "POS"])
> doc_bin2.add(nlp("This is a sentence"))
> merged_bins = doc_bin1.merge(doc_bin2)
> assert len(merged_bins) == 2
> doc_bin1.merge(doc_bin2)
> assert len(doc_bin1) == 2
> ```
| Argument | Type | Description |

View File

@ -1,9 +1,10 @@
A named entity is a "real-world object" that's assigned a name for example, a
person, a country, a product or a book title. spaCy can **recognize**
[various types](/api/annotation#named-entities) of named entities in a document,
by asking the model for a **prediction**. Because models are statistical and
strongly depend on the examples they were trained on, this doesn't always work
_perfectly_ and might need some tuning later, depending on your use case.
person, a country, a product or a book title. spaCy can **recognize
[various types](/api/annotation#named-entities)** of named entities in a
document, by asking the model for a **prediction**. Because models are
statistical and strongly depend on the examples they were trained on, this
doesn't always work _perfectly_ and might need some tuning later, depending on
your use case.
Named entities are available as the `ents` property of a `Doc`: