mirror of
https://github.com/explosion/spaCy.git
synced 2025-01-03 22:06:37 +03:00
Merge branch 'master' into spacy.io
This commit is contained in:
commit
71f5a5daa1
106
.github/contributors/prilopes.md
vendored
Normal file
106
.github/contributors/prilopes.md
vendored
Normal file
|
@ -0,0 +1,106 @@
|
|||
# spaCy contributor agreement
|
||||
|
||||
This spaCy Contributor Agreement (**"SCA"**) is based on the
|
||||
[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
|
||||
The SCA applies to any contribution that you make to any product or project
|
||||
managed by us (the **"project"**), and sets out the intellectual property rights
|
||||
you grant to us in the contributed materials. The term **"us"** shall mean
|
||||
[ExplosionAI GmbH](https://explosion.ai/legal). The term
|
||||
**"you"** shall mean the person or entity identified below.
|
||||
|
||||
If you agree to be bound by these terms, fill in the information requested
|
||||
below and include the filled-in version with your first pull request, under the
|
||||
folder [`.github/contributors/`](/.github/contributors/). The name of the file
|
||||
should be your GitHub username, with the extension `.md`. For example, the user
|
||||
example_user would create the file `.github/contributors/example_user.md`.
|
||||
|
||||
Read this agreement carefully before signing. These terms and conditions
|
||||
constitute a binding legal agreement.
|
||||
|
||||
## Contributor Agreement
|
||||
|
||||
1. The term "contribution" or "contributed materials" means any source code,
|
||||
object code, patch, tool, sample, graphic, specification, manual,
|
||||
documentation, or any other material posted or submitted by you to the project.
|
||||
|
||||
2. With respect to any worldwide copyrights, or copyright applications and
|
||||
registrations, in your contribution:
|
||||
|
||||
* you hereby assign to us joint ownership, and to the extent that such
|
||||
assignment is or becomes invalid, ineffective or unenforceable, you hereby
|
||||
grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
|
||||
royalty-free, unrestricted license to exercise all rights under those
|
||||
copyrights. This includes, at our option, the right to sublicense these same
|
||||
rights to third parties through multiple levels of sublicensees or other
|
||||
licensing arrangements;
|
||||
|
||||
* you agree that each of us can do all things in relation to your
|
||||
contribution as if each of us were the sole owners, and if one of us makes
|
||||
a derivative work of your contribution, the one who makes the derivative
|
||||
work (or has it made will be the sole owner of that derivative work;
|
||||
|
||||
* you agree that you will not assert any moral rights in your contribution
|
||||
against us, our licensees or transferees;
|
||||
|
||||
* you agree that we may register a copyright in your contribution and
|
||||
exercise all ownership rights associated with it; and
|
||||
|
||||
* you agree that neither of us has any duty to consult with, obtain the
|
||||
consent of, pay or render an accounting to the other for any use or
|
||||
distribution of your contribution.
|
||||
|
||||
3. With respect to any patents you own, or that you can license without payment
|
||||
to any third party, you hereby grant to us a perpetual, irrevocable,
|
||||
non-exclusive, worldwide, no-charge, royalty-free license to:
|
||||
|
||||
* make, have made, use, sell, offer to sell, import, and otherwise transfer
|
||||
your contribution in whole or in part, alone or in combination with or
|
||||
included in any product, work or materials arising out of the project to
|
||||
which your contribution was submitted, and
|
||||
|
||||
* at our option, to sublicense these same rights to third parties through
|
||||
multiple levels of sublicensees or other licensing arrangements.
|
||||
|
||||
4. Except as set out above, you keep all right, title, and interest in your
|
||||
contribution. The rights that you grant to us under these terms are effective
|
||||
on the date you first submitted a contribution to us, even if your submission
|
||||
took place before the date you sign these terms.
|
||||
|
||||
5. You covenant, represent, warrant and agree that:
|
||||
|
||||
* Each contribution that you submit is and shall be an original work of
|
||||
authorship and you can legally grant the rights set out in this SCA;
|
||||
|
||||
* to the best of your knowledge, each contribution will not violate any
|
||||
third party's copyrights, trademarks, patents, or other intellectual
|
||||
property rights; and
|
||||
|
||||
* each contribution shall be in compliance with U.S. export control laws and
|
||||
other applicable export and import laws. You agree to notify us if you
|
||||
become aware of any circumstance which would make any of the foregoing
|
||||
representations inaccurate in any respect. We may publicly disclose your
|
||||
participation in the project, including the fact that you have signed the SCA.
|
||||
|
||||
6. This SCA is governed by the laws of the State of California and applicable
|
||||
U.S. Federal law. Any choice of law rules will not apply.
|
||||
|
||||
7. Please place an “x” on one of the applicable statement below. Please do NOT
|
||||
mark both statements:
|
||||
|
||||
* [x] I am signing on behalf of myself as an individual and no other person
|
||||
or entity, including my employer, has or will have rights with respect to my
|
||||
contributions.
|
||||
|
||||
* [ ] I am signing on behalf of my employer or a legal entity and I have the
|
||||
actual authority to contractually bind that entity.
|
||||
|
||||
## Contributor Details
|
||||
|
||||
| Field | Entry |
|
||||
|------------------------------- | -------------------- |
|
||||
| Name | Priscilla Lopes |
|
||||
| Company name (if applicable) | |
|
||||
| Title or role (if applicable) | |
|
||||
| Date | 2019-11-06 |
|
||||
| GitHub username | prilopes |
|
||||
| Website (optional) | |
|
|
@ -221,6 +221,13 @@ def evaluate(nlp, text_loc, gold_loc, sys_loc, limit=None):
|
|||
|
||||
|
||||
def write_conllu(docs, file_):
|
||||
if not Token.has_extension("get_conllu_lines"):
|
||||
Token.set_extension("get_conllu_lines", method=get_token_conllu)
|
||||
if not Token.has_extension("begins_fused"):
|
||||
Token.set_extension("begins_fused", default=False)
|
||||
if not Token.has_extension("inside_fused"):
|
||||
Token.set_extension("inside_fused", default=False)
|
||||
|
||||
merger = Matcher(docs[0].vocab)
|
||||
merger.add("SUBTOK", None, [{"DEP": "subtok", "op": "+"}])
|
||||
for i, doc in enumerate(docs):
|
||||
|
|
|
@ -6,12 +6,12 @@ blis>=0.4.0,<0.5.0
|
|||
murmurhash>=0.28.0,<1.1.0
|
||||
wasabi>=0.4.0,<1.1.0
|
||||
srsly>=0.1.0,<1.1.0
|
||||
catalogue>=0.0.7,<1.1.0
|
||||
# Third party dependencies
|
||||
numpy>=1.15.0
|
||||
requests>=2.13.0,<3.0.0
|
||||
plac>=0.9.6,<1.2.0
|
||||
pathlib==1.0.1; python_version < "3.4"
|
||||
importlib_metadata>=0.20; python_version < "3.8"
|
||||
# Optional dependencies
|
||||
jsonschema>=2.6.0,<3.1.0
|
||||
# Development dependencies
|
||||
|
|
|
@ -48,13 +48,13 @@ install_requires =
|
|||
blis>=0.4.0,<0.5.0
|
||||
wasabi>=0.4.0,<1.1.0
|
||||
srsly>=0.1.0,<1.1.0
|
||||
catalogue>=0.0.7,<1.1.0
|
||||
# Third-party dependencies
|
||||
setuptools
|
||||
numpy>=1.15.0
|
||||
plac>=0.9.6,<1.2.0
|
||||
requests>=2.13.0,<3.0.0
|
||||
pathlib==1.0.1; python_version < "3.4"
|
||||
importlib_metadata>=0.20; python_version < "3.8"
|
||||
|
||||
[options.extras_require]
|
||||
lookups =
|
||||
|
|
|
@ -15,7 +15,7 @@ from .glossary import explain
|
|||
from .about import __version__
|
||||
from .errors import Errors, Warnings, deprecation_warning
|
||||
from . import util
|
||||
from .util import register_architecture, get_architecture
|
||||
from .util import registry
|
||||
from .language import component
|
||||
|
||||
|
||||
|
|
|
@ -36,11 +36,6 @@ try:
|
|||
except ImportError:
|
||||
cupy = None
|
||||
|
||||
try: # Python 3.8
|
||||
import importlib.metadata as importlib_metadata
|
||||
except ImportError:
|
||||
import importlib_metadata # noqa: F401
|
||||
|
||||
try:
|
||||
from thinc.neural.optimizers import Optimizer # noqa: F401
|
||||
except ImportError:
|
||||
|
|
|
@ -5,7 +5,7 @@ import uuid
|
|||
|
||||
from .templates import TPL_DEP_SVG, TPL_DEP_WORDS, TPL_DEP_ARCS, TPL_ENTS
|
||||
from .templates import TPL_ENT, TPL_ENT_RTL, TPL_FIGURE, TPL_TITLE, TPL_PAGE
|
||||
from ..util import minify_html, escape_html, get_entry_points, ENTRY_POINTS
|
||||
from ..util import minify_html, escape_html, registry
|
||||
from ..errors import Errors
|
||||
|
||||
|
||||
|
@ -242,7 +242,7 @@ class EntityRenderer(object):
|
|||
"CARDINAL": "#e4e7d2",
|
||||
"PERCENT": "#e4e7d2",
|
||||
}
|
||||
user_colors = get_entry_points(ENTRY_POINTS.displacy_colors)
|
||||
user_colors = registry.displacy_colors.get_all()
|
||||
for user_color in user_colors.values():
|
||||
colors.update(user_color)
|
||||
colors.update(options.get("colors", {}))
|
||||
|
|
|
@ -11,12 +11,12 @@ Example sentences to test spaCy and its language models.
|
|||
|
||||
|
||||
sentences = [
|
||||
"Apple está buscando comprar una startup del Reino Unido por mil millones de dólares",
|
||||
"Los coches autónomos delegan la responsabilidad del seguro en sus fabricantes",
|
||||
"San Francisco analiza prohibir los robots delivery",
|
||||
"Londres es una gran ciudad del Reino Unido",
|
||||
"El gato come pescado",
|
||||
"Veo al hombre con el telescopio",
|
||||
"La araña come moscas",
|
||||
"El pingüino incuba en su nido",
|
||||
"Apple está buscando comprar una startup del Reino Unido por mil millones de dólares.",
|
||||
"Los coches autónomos delegan la responsabilidad del seguro en sus fabricantes.",
|
||||
"San Francisco analiza prohibir los robots delivery.",
|
||||
"Londres es una gran ciudad del Reino Unido.",
|
||||
"El gato come pescado.",
|
||||
"Veo al hombre con el telescopio.",
|
||||
"La araña come moscas.",
|
||||
"El pingüino incuba en su nido.",
|
||||
]
|
||||
|
|
|
@ -11,8 +11,8 @@ Example sentences to test spaCy and its language models.
|
|||
|
||||
|
||||
sentences = [
|
||||
"Apple vurderer å kjøpe britisk oppstartfirma for en milliard dollar",
|
||||
"Selvkjørende biler flytter forsikringsansvaret over på produsentene ",
|
||||
"San Francisco vurderer å forby robotbud på fortauene",
|
||||
"Apple vurderer å kjøpe britisk oppstartfirma for en milliard dollar.",
|
||||
"Selvkjørende biler flytter forsikringsansvaret over på produsentene.",
|
||||
"San Francisco vurderer å forby robotbud på fortauene.",
|
||||
"London er en stor by i Storbritannia.",
|
||||
]
|
||||
|
|
99
spacy/lang/xx/examples.py
Normal file
99
spacy/lang/xx/examples.py
Normal file
|
@ -0,0 +1,99 @@
|
|||
# coding: utf8
|
||||
from __future__ import unicode_literals
|
||||
|
||||
|
||||
"""
|
||||
Example sentences to test spaCy and its language models.
|
||||
|
||||
>>> from spacy.lang.de.examples import sentences
|
||||
>>> docs = nlp.pipe(sentences)
|
||||
"""
|
||||
|
||||
# combined examples from de/en/es/fr/it/nl/pl/pt/ru
|
||||
|
||||
sentences = [
|
||||
"Die ganze Stadt ist ein Startup: Shenzhen ist das Silicon Valley für Hardware-Firmen",
|
||||
"Wie deutsche Startups die Technologie vorantreiben wollen: Künstliche Intelligenz",
|
||||
"Trend zum Urlaub in Deutschland beschert Gastwirten mehr Umsatz",
|
||||
"Bundesanwaltschaft erhebt Anklage gegen mutmaßlichen Schweizer Spion",
|
||||
"San Francisco erwägt Verbot von Lieferrobotern",
|
||||
"Autonome Fahrzeuge verlagern Haftpflicht auf Hersteller",
|
||||
"Wo bist du?",
|
||||
"Was ist die Hauptstadt von Deutschland?",
|
||||
"Apple is looking at buying U.K. startup for $1 billion",
|
||||
"Autonomous cars shift insurance liability toward manufacturers",
|
||||
"San Francisco considers banning sidewalk delivery robots",
|
||||
"London is a big city in the United Kingdom.",
|
||||
"Where are you?",
|
||||
"Who is the president of France?",
|
||||
"What is the capital of the United States?",
|
||||
"When was Barack Obama born?",
|
||||
"Apple está buscando comprar una startup del Reino Unido por mil millones de dólares.",
|
||||
"Los coches autónomos delegan la responsabilidad del seguro en sus fabricantes.",
|
||||
"San Francisco analiza prohibir los robots delivery.",
|
||||
"Londres es una gran ciudad del Reino Unido.",
|
||||
"El gato come pescado.",
|
||||
"Veo al hombre con el telescopio.",
|
||||
"La araña come moscas.",
|
||||
"El pingüino incuba en su nido.",
|
||||
"Apple cherche à acheter une start-up anglaise pour 1 milliard de dollars",
|
||||
"Les voitures autonomes déplacent la responsabilité de l'assurance vers les constructeurs",
|
||||
"San Francisco envisage d'interdire les robots coursiers sur les trottoirs",
|
||||
"Londres est une grande ville du Royaume-Uni",
|
||||
"L’Italie choisit ArcelorMittal pour reprendre la plus grande aciérie d’Europe",
|
||||
"Apple lance HomePod parce qu'il se sent menacé par l'Echo d'Amazon",
|
||||
"La France ne devrait pas manquer d'électricité cet été, même en cas de canicule",
|
||||
"Nouvelles attaques de Trump contre le maire de Londres",
|
||||
"Où es-tu ?",
|
||||
"Qui est le président de la France ?",
|
||||
"Où est la capitale des États-Unis ?",
|
||||
"Quand est né Barack Obama ?",
|
||||
"Apple vuole comprare una startup del Regno Unito per un miliardo di dollari",
|
||||
"Le automobili a guida autonoma spostano la responsabilità assicurativa verso i produttori",
|
||||
"San Francisco prevede di bandire i robot di consegna porta a porta",
|
||||
"Londra è una grande città del Regno Unito.",
|
||||
"Apple overweegt om voor 1 miljard een U.K. startup te kopen",
|
||||
"Autonome auto's verschuiven de verzekeringverantwoordelijkheid naar producenten",
|
||||
"San Francisco overweegt robots op voetpaden te verbieden",
|
||||
"Londen is een grote stad in het Verenigd Koninkrijk",
|
||||
"Poczuł przyjemną woń mocnej kawy.",
|
||||
"Istnieje wiele dróg oddziaływania substancji psychoaktywnej na układ nerwowy.",
|
||||
"Powitał mnie biało-czarny kot, płosząc siedzące na płocie trzy dorodne dudki.",
|
||||
"Nowy abonament pod lupą Komisji Europejskiej",
|
||||
"Czy w ciągu ostatnich 48 godzin spożyłeś leki zawierające paracetamol?",
|
||||
"Kto ma ochotę zapoznać się z innymi niż w książkach przygodami Muminków i ich przyjaciół, temu polecam komiks Tove Jansson „Muminki i morze”.",
|
||||
"Apple está querendo comprar uma startup do Reino Unido por 100 milhões de dólares.",
|
||||
"Carros autônomos empurram a responsabilidade do seguro para os fabricantes.."
|
||||
"São Francisco considera banir os robôs de entrega que andam pelas calçadas.",
|
||||
"Londres é a maior cidade do Reino Unido.",
|
||||
# Translations from English:
|
||||
"Apple рассматривает возможность покупки стартапа из Соединённого Королевства за $1 млрд",
|
||||
"Беспилотные автомобили перекладывают страховую ответственность на производителя",
|
||||
"В Сан-Франциско рассматривается возможность запрета роботов-курьеров, которые перемещаются по тротуару",
|
||||
"Лондон — это большой город в Соединённом Королевстве",
|
||||
# Native Russian sentences:
|
||||
# Colloquial:
|
||||
"Да, нет, наверное!", # Typical polite refusal
|
||||
"Обратите внимание на необыкновенную красоту этого города-героя Москвы, столицы нашей Родины!", # From a tour guide speech
|
||||
# Examples of Bookish Russian:
|
||||
# Quote from "The Golden Calf"
|
||||
"Рио-де-Жанейро — это моя мечта, и не смейте касаться её своими грязными лапами!",
|
||||
# Quotes from "Ivan Vasilievich changes his occupation"
|
||||
"Ты пошто боярыню обидел, смерд?!!",
|
||||
"Оставь меня, старушка, я в печали!",
|
||||
# Quotes from Dostoevsky:
|
||||
"Уж коли я, такой же, как и ты, человек грешный, над тобой умилился и пожалел тебя, кольми паче бог",
|
||||
"В мечтах я нередко, говорит, доходил до страстных помыслов о служении человечеству и может быть действительно пошел бы на крест за людей, если б это вдруг как-нибудь потребовалось, а между тем я двух дней не в состоянии прожить ни с кем в одной комнате, о чем знаю из опыта",
|
||||
"Зато всегда так происходило, что чем более я ненавидел людей в частности, тем пламеннее становилась любовь моя к человечеству вообще",
|
||||
# Quotes from Chekhov:
|
||||
"Ненужные дела и разговоры всё об одном отхватывают на свою долю лучшую часть времени, лучшие силы, и в конце концов остается какая-то куцая, бескрылая жизнь, какая-то чепуха, и уйти и бежать нельзя, точно сидишь в сумасшедшем доме или в арестантских ротах!",
|
||||
# Quotes from Turgenev:
|
||||
"Нравится тебе женщина, старайся добиться толку; а нельзя — ну, не надо, отвернись — земля не клином сошлась",
|
||||
"Узенькое местечко, которое я занимаю, до того крохотно в сравнении с остальным пространством, где меня нет и где дела до меня нет; и часть времени, которую мне удастся прожить, так ничтожна перед вечностью, где меня не было и не будет...",
|
||||
# Quotes from newspapers:
|
||||
# Komsomolskaya Pravda:
|
||||
"На заседании президиума правительства Москвы принято решение присвоить статус инвестиционного приоритетного проекта города Москвы киностудии Союзмультфильм",
|
||||
"Глава Минобороны Сергей Шойгу заявил, что обстановка на этом стратегическом направлении требует непрерывного совершенствования боевого состава войск",
|
||||
# Argumenty i Facty:
|
||||
"На реплику лже-Говина — дескать, он (Волков) будет лучшим революционером — Стамп с энтузиазмом ответил: Непременно!",
|
||||
]
|
|
@ -4,19 +4,92 @@ from __future__ import unicode_literals
|
|||
from ...attrs import LANG
|
||||
from ...language import Language
|
||||
from ...tokens import Doc
|
||||
from ...util import DummyTokenizer
|
||||
from ..tokenizer_exceptions import BASE_EXCEPTIONS
|
||||
from .lex_attrs import LEX_ATTRS
|
||||
from .stop_words import STOP_WORDS
|
||||
from .tag_map import TAG_MAP
|
||||
|
||||
|
||||
def try_jieba_import(use_jieba):
|
||||
try:
|
||||
import jieba
|
||||
return jieba
|
||||
except ImportError:
|
||||
if use_jieba:
|
||||
msg = (
|
||||
"Jieba not installed. Either set Chinese.use_jieba = False, "
|
||||
"or install it https://github.com/fxsjy/jieba"
|
||||
)
|
||||
raise ImportError(msg)
|
||||
|
||||
|
||||
class ChineseTokenizer(DummyTokenizer):
|
||||
def __init__(self, cls, nlp=None):
|
||||
self.vocab = nlp.vocab if nlp is not None else cls.create_vocab(nlp)
|
||||
self.use_jieba = cls.use_jieba
|
||||
self.jieba_seg = try_jieba_import(self.use_jieba)
|
||||
self.tokenizer = Language.Defaults().create_tokenizer(nlp)
|
||||
|
||||
def __call__(self, text):
|
||||
# use jieba
|
||||
if self.use_jieba:
|
||||
jieba_words = list([x for x in self.jieba_seg.cut(text, cut_all=False) if x])
|
||||
words = [jieba_words[0]]
|
||||
spaces = [False]
|
||||
for i in range(1, len(jieba_words)):
|
||||
word = jieba_words[i]
|
||||
if word.isspace():
|
||||
# second token in adjacent whitespace following a
|
||||
# non-space token
|
||||
if spaces[-1]:
|
||||
words.append(word)
|
||||
spaces.append(False)
|
||||
# first space token following non-space token
|
||||
elif word == " " and not words[-1].isspace():
|
||||
spaces[-1] = True
|
||||
# token is non-space whitespace or any whitespace following
|
||||
# a whitespace token
|
||||
else:
|
||||
# extend previous whitespace token with more whitespace
|
||||
if words[-1].isspace():
|
||||
words[-1] += word
|
||||
# otherwise it's a new whitespace token
|
||||
else:
|
||||
words.append(word)
|
||||
spaces.append(False)
|
||||
else:
|
||||
words.append(word)
|
||||
spaces.append(False)
|
||||
return Doc(self.vocab, words=words, spaces=spaces)
|
||||
|
||||
# split into individual characters
|
||||
words = []
|
||||
spaces = []
|
||||
for token in self.tokenizer(text):
|
||||
if token.text.isspace():
|
||||
words.append(token.text)
|
||||
spaces.append(False)
|
||||
else:
|
||||
words.extend(list(token.text))
|
||||
spaces.extend([False] * len(token.text))
|
||||
spaces[-1] = bool(token.whitespace_)
|
||||
return Doc(self.vocab, words=words, spaces=spaces)
|
||||
|
||||
|
||||
class ChineseDefaults(Language.Defaults):
|
||||
lex_attr_getters = dict(Language.Defaults.lex_attr_getters)
|
||||
lex_attr_getters.update(LEX_ATTRS)
|
||||
lex_attr_getters[LANG] = lambda text: "zh"
|
||||
use_jieba = True
|
||||
tokenizer_exceptions = BASE_EXCEPTIONS
|
||||
stop_words = STOP_WORDS
|
||||
tag_map = TAG_MAP
|
||||
writing_system = {"direction": "ltr", "has_case": False, "has_letters": False}
|
||||
use_jieba = True
|
||||
|
||||
@classmethod
|
||||
def create_tokenizer(cls, nlp=None):
|
||||
return ChineseTokenizer(cls, nlp)
|
||||
|
||||
|
||||
class Chinese(Language):
|
||||
|
@ -24,26 +97,7 @@ class Chinese(Language):
|
|||
Defaults = ChineseDefaults # override defaults
|
||||
|
||||
def make_doc(self, text):
|
||||
if self.Defaults.use_jieba:
|
||||
try:
|
||||
import jieba
|
||||
except ImportError:
|
||||
msg = (
|
||||
"Jieba not installed. Either set Chinese.use_jieba = False, "
|
||||
"or install it https://github.com/fxsjy/jieba"
|
||||
)
|
||||
raise ImportError(msg)
|
||||
words = list(jieba.cut(text, cut_all=False))
|
||||
words = [x for x in words if x]
|
||||
return Doc(self.vocab, words=words, spaces=[False] * len(words))
|
||||
else:
|
||||
words = []
|
||||
spaces = []
|
||||
for token in self.tokenizer(text):
|
||||
words.extend(list(token.text))
|
||||
spaces.extend([False] * len(token.text))
|
||||
spaces[-1] = bool(token.whitespace_)
|
||||
return Doc(self.vocab, words=words, spaces=spaces)
|
||||
return self.tokenizer(text)
|
||||
|
||||
|
||||
__all__ = ["Chinese"]
|
||||
|
|
|
@ -1,11 +1,12 @@
|
|||
# coding: utf8
|
||||
from __future__ import unicode_literals
|
||||
|
||||
from ...symbols import POS, PUNCT, ADJ, CONJ, CCONJ, NUM, DET, ADV, ADP, X, VERB
|
||||
from ...symbols import NOUN, PART, INTJ, PRON
|
||||
from ...symbols import POS, PUNCT, ADJ, SCONJ, CCONJ, NUM, DET, ADV, ADP, X
|
||||
from ...symbols import NOUN, PART, INTJ, PRON, VERB, SPACE
|
||||
|
||||
# The Chinese part-of-speech tagger uses the OntoNotes 5 version of the Penn Treebank tag set.
|
||||
# We also map the tags to the simpler Google Universal POS tag set.
|
||||
# The Chinese part-of-speech tagger uses the OntoNotes 5 version of the Penn
|
||||
# Treebank tag set. We also map the tags to the simpler Universal Dependencies
|
||||
# v2 tag set.
|
||||
|
||||
TAG_MAP = {
|
||||
"AS": {POS: PART},
|
||||
|
@ -38,10 +39,11 @@ TAG_MAP = {
|
|||
"OD": {POS: NUM},
|
||||
"DT": {POS: DET},
|
||||
"CC": {POS: CCONJ},
|
||||
"CS": {POS: CONJ},
|
||||
"CS": {POS: SCONJ},
|
||||
"AD": {POS: ADV},
|
||||
"JJ": {POS: ADJ},
|
||||
"P": {POS: ADP},
|
||||
"PN": {POS: PRON},
|
||||
"PU": {POS: PUNCT},
|
||||
"_SP": {POS: SPACE},
|
||||
}
|
||||
|
|
|
@ -51,8 +51,8 @@ class BaseDefaults(object):
|
|||
filenames = {name: root / filename for name, filename in cls.resources}
|
||||
if LANG in cls.lex_attr_getters:
|
||||
lang = cls.lex_attr_getters[LANG](None)
|
||||
user_lookups = util.get_entry_point(util.ENTRY_POINTS.lookups, lang, {})
|
||||
filenames.update(user_lookups)
|
||||
if lang in util.registry.lookups:
|
||||
filenames.update(util.registry.lookups.get(lang))
|
||||
lookups = Lookups()
|
||||
for name, filename in filenames.items():
|
||||
data = util.load_language_data(filename)
|
||||
|
@ -155,7 +155,7 @@ class Language(object):
|
|||
100,000 characters in one text.
|
||||
RETURNS (Language): The newly constructed object.
|
||||
"""
|
||||
user_factories = util.get_entry_points(util.ENTRY_POINTS.factories)
|
||||
user_factories = util.registry.factories.get_all()
|
||||
self.factories.update(user_factories)
|
||||
self._meta = dict(meta)
|
||||
self._path = None
|
||||
|
|
|
@ -240,7 +240,7 @@ cdef class DependencyMatcher:
|
|||
for i, (ent_id, nodes) in enumerate(matched_key_trees):
|
||||
on_match = self._callbacks.get(ent_id)
|
||||
if on_match is not None:
|
||||
on_match(self, doc, i, matches)
|
||||
on_match(self, doc, i, matched_key_trees)
|
||||
return matched_key_trees
|
||||
|
||||
def recurse(self,tree,id_to_position,_node_operator_map,int patternLength,visitedNodes,matched_trees):
|
||||
|
|
|
@ -3,10 +3,10 @@ from __future__ import unicode_literals
|
|||
from thinc.api import chain
|
||||
from thinc.v2v import Maxout
|
||||
from thinc.misc import LayerNorm
|
||||
from ..util import register_architecture, make_layer
|
||||
from ..util import registry, make_layer
|
||||
|
||||
|
||||
@register_architecture("thinc.FeedForward.v1")
|
||||
@registry.architectures.register("thinc.FeedForward.v1")
|
||||
def FeedForward(config):
|
||||
layers = [make_layer(layer_cfg) for layer_cfg in config["layers"]]
|
||||
model = chain(*layers)
|
||||
|
@ -14,7 +14,7 @@ def FeedForward(config):
|
|||
return model
|
||||
|
||||
|
||||
@register_architecture("spacy.LayerNormalizedMaxout.v1")
|
||||
@registry.architectures.register("spacy.LayerNormalizedMaxout.v1")
|
||||
def LayerNormalizedMaxout(config):
|
||||
width = config["width"]
|
||||
pieces = config["pieces"]
|
||||
|
|
|
@ -6,11 +6,11 @@ from thinc.v2v import Maxout, Model
|
|||
from thinc.i2v import HashEmbed, StaticVectors
|
||||
from thinc.t2t import ExtractWindow
|
||||
from thinc.misc import Residual, LayerNorm, FeatureExtracter
|
||||
from ..util import make_layer, register_architecture
|
||||
from ..util import make_layer, registry
|
||||
from ._wire import concatenate_lists
|
||||
|
||||
|
||||
@register_architecture("spacy.Tok2Vec.v1")
|
||||
@registry.architectures.register("spacy.Tok2Vec.v1")
|
||||
def Tok2Vec(config):
|
||||
doc2feats = make_layer(config["@doc2feats"])
|
||||
embed = make_layer(config["@embed"])
|
||||
|
@ -24,13 +24,13 @@ def Tok2Vec(config):
|
|||
return tok2vec
|
||||
|
||||
|
||||
@register_architecture("spacy.Doc2Feats.v1")
|
||||
@registry.architectures.register("spacy.Doc2Feats.v1")
|
||||
def Doc2Feats(config):
|
||||
columns = config["columns"]
|
||||
return FeatureExtracter(columns)
|
||||
|
||||
|
||||
@register_architecture("spacy.MultiHashEmbed.v1")
|
||||
@registry.architectures.register("spacy.MultiHashEmbed.v1")
|
||||
def MultiHashEmbed(config):
|
||||
# For backwards compatibility with models before the architecture registry,
|
||||
# we have to be careful to get exactly the same model structure. One subtle
|
||||
|
@ -78,7 +78,7 @@ def MultiHashEmbed(config):
|
|||
return layer
|
||||
|
||||
|
||||
@register_architecture("spacy.CharacterEmbed.v1")
|
||||
@registry.architectures.register("spacy.CharacterEmbed.v1")
|
||||
def CharacterEmbed(config):
|
||||
from .. import _ml
|
||||
|
||||
|
@ -94,7 +94,7 @@ def CharacterEmbed(config):
|
|||
return model
|
||||
|
||||
|
||||
@register_architecture("spacy.MaxoutWindowEncoder.v1")
|
||||
@registry.architectures.register("spacy.MaxoutWindowEncoder.v1")
|
||||
def MaxoutWindowEncoder(config):
|
||||
nO = config["width"]
|
||||
nW = config["window_size"]
|
||||
|
@ -110,7 +110,7 @@ def MaxoutWindowEncoder(config):
|
|||
return model
|
||||
|
||||
|
||||
@register_architecture("spacy.MishWindowEncoder.v1")
|
||||
@registry.architectures.register("spacy.MishWindowEncoder.v1")
|
||||
def MishWindowEncoder(config):
|
||||
from thinc.v2v import Mish
|
||||
|
||||
|
@ -124,12 +124,12 @@ def MishWindowEncoder(config):
|
|||
return model
|
||||
|
||||
|
||||
@register_architecture("spacy.PretrainedVectors.v1")
|
||||
@registry.architectures.register("spacy.PretrainedVectors.v1")
|
||||
def PretrainedVectors(config):
|
||||
return StaticVectors(config["vectors_name"], config["width"], config["column"])
|
||||
|
||||
|
||||
@register_architecture("spacy.TorchBiLSTMEncoder.v1")
|
||||
@registry.architectures.register("spacy.TorchBiLSTMEncoder.v1")
|
||||
def TorchBiLSTMEncoder(config):
|
||||
import torch.nn
|
||||
from thinc.extra.wrappers import PyTorchWrapperRNN
|
||||
|
|
|
@ -218,3 +218,9 @@ def uk_tokenizer():
|
|||
@pytest.fixture(scope="session")
|
||||
def ur_tokenizer():
|
||||
return get_lang_class("ur").Defaults.create_tokenizer()
|
||||
|
||||
|
||||
@pytest.fixture(scope="session")
|
||||
def zh_tokenizer():
|
||||
pytest.importorskip("jieba")
|
||||
return get_lang_class("zh").Defaults.create_tokenizer()
|
||||
|
|
|
@ -183,3 +183,18 @@ def test_doc_retokenizer_split_lex_attrs(en_vocab):
|
|||
retokenizer.split(doc[0], ["Los", "Angeles"], heads, attrs=attrs)
|
||||
assert doc[0].is_stop
|
||||
assert not doc[1].is_stop
|
||||
|
||||
|
||||
def test_doc_retokenizer_realloc(en_vocab):
|
||||
"""#4604: realloc correctly when new tokens outnumber original tokens"""
|
||||
text = "Hyperglycemic adverse events following antipsychotic drug administration in the"
|
||||
doc = Doc(en_vocab, words=text.split()[:-1])
|
||||
with doc.retokenize() as retokenizer:
|
||||
token = doc[0]
|
||||
heads = [(token, 0)] * len(token)
|
||||
retokenizer.split(doc[token.i], list(token.text), heads=heads)
|
||||
doc = Doc(en_vocab, words=text.split())
|
||||
with doc.retokenize() as retokenizer:
|
||||
token = doc[0]
|
||||
heads = [(token, 0)] * len(token)
|
||||
retokenizer.split(doc[token.i], list(token.text), heads=heads)
|
||||
|
|
0
spacy/tests/lang/zh/__init__.py
Normal file
0
spacy/tests/lang/zh/__init__.py
Normal file
25
spacy/tests/lang/zh/test_text.py
Normal file
25
spacy/tests/lang/zh/test_text.py
Normal file
|
@ -0,0 +1,25 @@
|
|||
# coding: utf-8
|
||||
from __future__ import unicode_literals
|
||||
|
||||
|
||||
import pytest
|
||||
|
||||
|
||||
@pytest.mark.parametrize(
|
||||
"text,match",
|
||||
[
|
||||
("10", True),
|
||||
("1", True),
|
||||
("999.0", True),
|
||||
("一", True),
|
||||
("二", True),
|
||||
("〇", True),
|
||||
("十一", True),
|
||||
("狗", False),
|
||||
(",", False),
|
||||
],
|
||||
)
|
||||
def test_lex_attrs_like_number(zh_tokenizer, text, match):
|
||||
tokens = zh_tokenizer(text)
|
||||
assert len(tokens) == 1
|
||||
assert tokens[0].like_num == match
|
31
spacy/tests/lang/zh/test_tokenizer.py
Normal file
31
spacy/tests/lang/zh/test_tokenizer.py
Normal file
|
@ -0,0 +1,31 @@
|
|||
# coding: utf-8
|
||||
from __future__ import unicode_literals
|
||||
|
||||
import pytest
|
||||
|
||||
|
||||
# fmt: off
|
||||
TOKENIZER_TESTS = [
|
||||
("作为语言而言,为世界使用人数最多的语言,目前世界有五分之一人口做为母语。",
|
||||
['作为', '语言', '而言', ',', '为', '世界', '使用', '人', '数最多',
|
||||
'的', '语言', ',', '目前', '世界', '有', '五分之一', '人口', '做',
|
||||
'为', '母语', '。']),
|
||||
]
|
||||
# fmt: on
|
||||
|
||||
|
||||
@pytest.mark.parametrize("text,expected_tokens", TOKENIZER_TESTS)
|
||||
def test_zh_tokenizer(zh_tokenizer, text, expected_tokens):
|
||||
zh_tokenizer.use_jieba = False
|
||||
tokens = [token.text for token in zh_tokenizer(text)]
|
||||
assert tokens == list(text)
|
||||
|
||||
zh_tokenizer.use_jieba = True
|
||||
tokens = [token.text for token in zh_tokenizer(text)]
|
||||
assert tokens == expected_tokens
|
||||
|
||||
|
||||
def test_extra_spaces(zh_tokenizer):
|
||||
# note: three spaces after "I"
|
||||
tokens = zh_tokenizer("I like cheese.")
|
||||
assert tokens[1].orth_ == " "
|
34
spacy/tests/regression/test_issue4590.py
Normal file
34
spacy/tests/regression/test_issue4590.py
Normal file
|
@ -0,0 +1,34 @@
|
|||
# coding: utf-8
|
||||
from __future__ import unicode_literals
|
||||
|
||||
import pytest
|
||||
from mock import Mock
|
||||
from spacy.matcher import DependencyMatcher
|
||||
from ..util import get_doc
|
||||
|
||||
|
||||
def test_issue4590(en_vocab):
|
||||
"""Test that matches param in on_match method are the same as matches run with no on_match method"""
|
||||
pattern = [
|
||||
{"SPEC": {"NODE_NAME": "jumped"}, "PATTERN": {"ORTH": "jumped"}},
|
||||
{"SPEC": {"NODE_NAME": "fox", "NBOR_RELOP": ">", "NBOR_NAME": "jumped"}, "PATTERN": {"ORTH": "fox"}},
|
||||
{"SPEC": {"NODE_NAME": "quick", "NBOR_RELOP": ".", "NBOR_NAME": "jumped"}, "PATTERN": {"ORTH": "fox"}},
|
||||
]
|
||||
|
||||
on_match = Mock()
|
||||
|
||||
matcher = DependencyMatcher(en_vocab)
|
||||
matcher.add("pattern", on_match, pattern)
|
||||
|
||||
text = "The quick brown fox jumped over the lazy fox"
|
||||
heads = [3, 2, 1, 1, 0, -1, 2, 1, -3]
|
||||
deps = ["det", "amod", "amod", "nsubj", "prep", "pobj", "det", "amod"]
|
||||
|
||||
doc = get_doc(en_vocab, text.split(), heads=heads, deps=deps)
|
||||
|
||||
matches = matcher(doc)
|
||||
|
||||
on_match_args = on_match.call_args
|
||||
|
||||
assert on_match_args[0][3] == matches
|
||||
|
19
spacy/tests/test_architectures.py
Normal file
19
spacy/tests/test_architectures.py
Normal file
|
@ -0,0 +1,19 @@
|
|||
# coding: utf8
|
||||
from __future__ import unicode_literals
|
||||
|
||||
import pytest
|
||||
from spacy import registry
|
||||
from thinc.v2v import Affine
|
||||
from catalogue import RegistryError
|
||||
|
||||
|
||||
@registry.architectures.register("my_test_function")
|
||||
def create_model(nr_in, nr_out):
|
||||
return Affine(nr_in, nr_out)
|
||||
|
||||
|
||||
def test_get_architecture():
|
||||
arch = registry.architectures.get("my_test_function")
|
||||
assert arch is create_model
|
||||
with pytest.raises(RegistryError):
|
||||
registry.architectures.get("not_an_existing_key")
|
|
@ -1,19 +0,0 @@
|
|||
# coding: utf8
|
||||
from __future__ import unicode_literals
|
||||
|
||||
import pytest
|
||||
from spacy import register_architecture
|
||||
from spacy import get_architecture
|
||||
from thinc.v2v import Affine
|
||||
|
||||
|
||||
@register_architecture("my_test_function")
|
||||
def create_model(nr_in, nr_out):
|
||||
return Affine(nr_in, nr_out)
|
||||
|
||||
|
||||
def test_get_architecture():
|
||||
arch = get_architecture("my_test_function")
|
||||
assert arch is create_model
|
||||
with pytest.raises(KeyError):
|
||||
get_architecture("not_an_existing_key")
|
|
@ -329,7 +329,7 @@ def _split(Doc doc, int token_index, orths, heads, attrs):
|
|||
doc.c[i].head += offset
|
||||
# Double doc.c max_length if necessary (until big enough for all new tokens)
|
||||
while doc.length + nb_subtokens - 1 >= doc.max_length:
|
||||
doc._realloc(doc.length * 2)
|
||||
doc._realloc(doc.max_length * 2)
|
||||
# Move tokens after the split to create space for the new tokens
|
||||
doc.length = len(doc) + nb_subtokens -1
|
||||
to_process_tensor = (doc.tensor is not None and doc.tensor.size != 0)
|
||||
|
|
113
spacy/util.py
113
spacy/util.py
|
@ -13,6 +13,7 @@ import functools
|
|||
import itertools
|
||||
import numpy.random
|
||||
import srsly
|
||||
import catalogue
|
||||
import sys
|
||||
|
||||
try:
|
||||
|
@ -27,29 +28,20 @@ except ImportError:
|
|||
|
||||
from .symbols import ORTH
|
||||
from .compat import cupy, CudaStream, path2str, basestring_, unicode_
|
||||
from .compat import import_file, importlib_metadata
|
||||
from .compat import import_file
|
||||
from .errors import Errors, Warnings, deprecation_warning
|
||||
|
||||
|
||||
LANGUAGES = {}
|
||||
ARCHITECTURES = {}
|
||||
_data_path = Path(__file__).parent / "data"
|
||||
_PRINT_ENV = False
|
||||
|
||||
|
||||
# NB: Ony ever call this once! If called more than ince within the
|
||||
# function, test_issue1506 hangs and it's not 100% clear why.
|
||||
AVAILABLE_ENTRY_POINTS = importlib_metadata.entry_points()
|
||||
|
||||
|
||||
class ENTRY_POINTS(object):
|
||||
"""Available entry points to register extensions."""
|
||||
|
||||
factories = "spacy_factories"
|
||||
languages = "spacy_languages"
|
||||
displacy_colors = "spacy_displacy_colors"
|
||||
lookups = "spacy_lookups"
|
||||
architectures = "spacy_architectures"
|
||||
class registry(object):
|
||||
languages = catalogue.create("spacy", "languages", entry_points=True)
|
||||
architectures = catalogue.create("spacy", "architectures", entry_points=True)
|
||||
lookups = catalogue.create("spacy", "lookups", entry_points=True)
|
||||
factories = catalogue.create("spacy", "factories", entry_points=True)
|
||||
displacy_colors = catalogue.create("spacy", "displacy_colors", entry_points=True)
|
||||
|
||||
|
||||
def set_env_log(value):
|
||||
|
@ -65,8 +57,7 @@ def lang_class_is_loaded(lang):
|
|||
lang (unicode): Two-letter language code, e.g. 'en'.
|
||||
RETURNS (bool): Whether a Language class has been loaded.
|
||||
"""
|
||||
global LANGUAGES
|
||||
return lang in LANGUAGES
|
||||
return lang in registry.languages
|
||||
|
||||
|
||||
def get_lang_class(lang):
|
||||
|
@ -75,19 +66,16 @@ def get_lang_class(lang):
|
|||
lang (unicode): Two-letter language code, e.g. 'en'.
|
||||
RETURNS (Language): Language class.
|
||||
"""
|
||||
global LANGUAGES
|
||||
# Check if an entry point is exposed for the language code
|
||||
entry_point = get_entry_point(ENTRY_POINTS.languages, lang)
|
||||
if entry_point is not None:
|
||||
LANGUAGES[lang] = entry_point
|
||||
return entry_point
|
||||
if lang not in LANGUAGES:
|
||||
# Check if language is registered / entry point is available
|
||||
if lang in registry.languages:
|
||||
return registry.languages.get(lang)
|
||||
else:
|
||||
try:
|
||||
module = importlib.import_module(".lang.%s" % lang, "spacy")
|
||||
except ImportError as err:
|
||||
raise ImportError(Errors.E048.format(lang=lang, err=err))
|
||||
LANGUAGES[lang] = getattr(module, module.__all__[0])
|
||||
return LANGUAGES[lang]
|
||||
set_lang_class(lang, getattr(module, module.__all__[0]))
|
||||
return registry.languages.get(lang)
|
||||
|
||||
|
||||
def set_lang_class(name, cls):
|
||||
|
@ -96,8 +84,7 @@ def set_lang_class(name, cls):
|
|||
name (unicode): Name of Language class.
|
||||
cls (Language): Language class.
|
||||
"""
|
||||
global LANGUAGES
|
||||
LANGUAGES[name] = cls
|
||||
registry.languages.register(name, func=cls)
|
||||
|
||||
|
||||
def get_data_path(require_exists=True):
|
||||
|
@ -121,49 +108,11 @@ def set_data_path(path):
|
|||
_data_path = ensure_path(path)
|
||||
|
||||
|
||||
def register_architecture(name, arch=None):
|
||||
"""Decorator to register an architecture. An architecture is a function
|
||||
that returns a Thinc Model object.
|
||||
|
||||
name (unicode): The name of the architecture to register.
|
||||
arch (Model): Optional architecture if function is called directly and
|
||||
not used as a decorator.
|
||||
RETURNS (callable): Function to register architecture.
|
||||
"""
|
||||
global ARCHITECTURES
|
||||
if arch is not None:
|
||||
ARCHITECTURES[name] = arch
|
||||
return arch
|
||||
|
||||
def do_registration(arch):
|
||||
ARCHITECTURES[name] = arch
|
||||
return arch
|
||||
|
||||
return do_registration
|
||||
|
||||
|
||||
def make_layer(arch_config):
|
||||
arch_func = get_architecture(arch_config["arch"])
|
||||
arch_func = registry.architectures.get(arch_config["arch"])
|
||||
return arch_func(arch_config["config"])
|
||||
|
||||
|
||||
def get_architecture(name):
|
||||
"""Get a model architecture function by name. Raises a KeyError if the
|
||||
architecture is not found.
|
||||
|
||||
name (unicode): The mame of the architecture.
|
||||
RETURNS (Model): The architecture.
|
||||
"""
|
||||
# Check if an entry point is exposed for the architecture code
|
||||
entry_point = get_entry_point(ENTRY_POINTS.architectures, name)
|
||||
if entry_point is not None:
|
||||
ARCHITECTURES[name] = entry_point
|
||||
if name not in ARCHITECTURES:
|
||||
names = ", ".join(sorted(ARCHITECTURES.keys()))
|
||||
raise KeyError(Errors.E174.format(name=name, names=names))
|
||||
return ARCHITECTURES[name]
|
||||
|
||||
|
||||
def ensure_path(path):
|
||||
"""Ensure string is converted to a Path.
|
||||
|
||||
|
@ -327,34 +276,6 @@ def get_package_path(name):
|
|||
return Path(pkg.__file__).parent
|
||||
|
||||
|
||||
def get_entry_points(key):
|
||||
"""Get registered entry points from other packages for a given key, e.g.
|
||||
'spacy_factories' and return them as a dictionary, keyed by name.
|
||||
|
||||
key (unicode): Entry point name.
|
||||
RETURNS (dict): Entry points, keyed by name.
|
||||
"""
|
||||
result = {}
|
||||
for entry_point in AVAILABLE_ENTRY_POINTS.get(key, []):
|
||||
result[entry_point.name] = entry_point.load()
|
||||
return result
|
||||
|
||||
|
||||
def get_entry_point(key, value, default=None):
|
||||
"""Check if registered entry point is available for a given name and
|
||||
load it. Otherwise, return None.
|
||||
|
||||
key (unicode): Entry point name.
|
||||
value (unicode): Name of entry point to load.
|
||||
default: Optional default value to return.
|
||||
RETURNS: The loaded entry point or None.
|
||||
"""
|
||||
for entry_point in AVAILABLE_ENTRY_POINTS.get(key, []):
|
||||
if entry_point.name == value:
|
||||
return entry_point.load()
|
||||
return default
|
||||
|
||||
|
||||
def is_in_jupyter():
|
||||
"""Check if user is running spaCy from a Jupyter notebook by detecting the
|
||||
IPython kernel. Mainly used for the displaCy visualizer.
|
||||
|
|
|
@ -109,8 +109,8 @@ raise an error if the pre-defined attrs of the two `DocBin`s don't match.
|
|||
> doc_bin1.add(nlp("Hello world"))
|
||||
> doc_bin2 = DocBin(attrs=["LEMMA", "POS"])
|
||||
> doc_bin2.add(nlp("This is a sentence"))
|
||||
> merged_bins = doc_bin1.merge(doc_bin2)
|
||||
> assert len(merged_bins) == 2
|
||||
> doc_bin1.merge(doc_bin2)
|
||||
> assert len(doc_bin1) == 2
|
||||
> ```
|
||||
|
||||
| Argument | Type | Description |
|
||||
|
|
|
@ -1,9 +1,10 @@
|
|||
A named entity is a "real-world object" that's assigned a name – for example, a
|
||||
person, a country, a product or a book title. spaCy can **recognize**
|
||||
[various types](/api/annotation#named-entities) of named entities in a document,
|
||||
by asking the model for a **prediction**. Because models are statistical and
|
||||
strongly depend on the examples they were trained on, this doesn't always work
|
||||
_perfectly_ and might need some tuning later, depending on your use case.
|
||||
person, a country, a product or a book title. spaCy can **recognize
|
||||
[various types](/api/annotation#named-entities)** of named entities in a
|
||||
document, by asking the model for a **prediction**. Because models are
|
||||
statistical and strongly depend on the examples they were trained on, this
|
||||
doesn't always work _perfectly_ and might need some tuning later, depending on
|
||||
your use case.
|
||||
|
||||
Named entities are available as the `ents` property of a `Doc`:
|
||||
|
||||
|
|
Loading…
Reference in New Issue
Block a user