mirror of
https://github.com/explosion/spaCy.git
synced 2025-10-24 12:41:23 +03:00
Merge branch 'master' into spacy.io
This commit is contained in:
commit
71f5a5daa1
106
.github/contributors/prilopes.md
vendored
Normal file
106
.github/contributors/prilopes.md
vendored
Normal file
|
@ -0,0 +1,106 @@
|
||||||
|
# spaCy contributor agreement
|
||||||
|
|
||||||
|
This spaCy Contributor Agreement (**"SCA"**) is based on the
|
||||||
|
[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
|
||||||
|
The SCA applies to any contribution that you make to any product or project
|
||||||
|
managed by us (the **"project"**), and sets out the intellectual property rights
|
||||||
|
you grant to us in the contributed materials. The term **"us"** shall mean
|
||||||
|
[ExplosionAI GmbH](https://explosion.ai/legal). The term
|
||||||
|
**"you"** shall mean the person or entity identified below.
|
||||||
|
|
||||||
|
If you agree to be bound by these terms, fill in the information requested
|
||||||
|
below and include the filled-in version with your first pull request, under the
|
||||||
|
folder [`.github/contributors/`](/.github/contributors/). The name of the file
|
||||||
|
should be your GitHub username, with the extension `.md`. For example, the user
|
||||||
|
example_user would create the file `.github/contributors/example_user.md`.
|
||||||
|
|
||||||
|
Read this agreement carefully before signing. These terms and conditions
|
||||||
|
constitute a binding legal agreement.
|
||||||
|
|
||||||
|
## Contributor Agreement
|
||||||
|
|
||||||
|
1. The term "contribution" or "contributed materials" means any source code,
|
||||||
|
object code, patch, tool, sample, graphic, specification, manual,
|
||||||
|
documentation, or any other material posted or submitted by you to the project.
|
||||||
|
|
||||||
|
2. With respect to any worldwide copyrights, or copyright applications and
|
||||||
|
registrations, in your contribution:
|
||||||
|
|
||||||
|
* you hereby assign to us joint ownership, and to the extent that such
|
||||||
|
assignment is or becomes invalid, ineffective or unenforceable, you hereby
|
||||||
|
grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
|
||||||
|
royalty-free, unrestricted license to exercise all rights under those
|
||||||
|
copyrights. This includes, at our option, the right to sublicense these same
|
||||||
|
rights to third parties through multiple levels of sublicensees or other
|
||||||
|
licensing arrangements;
|
||||||
|
|
||||||
|
* you agree that each of us can do all things in relation to your
|
||||||
|
contribution as if each of us were the sole owners, and if one of us makes
|
||||||
|
a derivative work of your contribution, the one who makes the derivative
|
||||||
|
work (or has it made will be the sole owner of that derivative work;
|
||||||
|
|
||||||
|
* you agree that you will not assert any moral rights in your contribution
|
||||||
|
against us, our licensees or transferees;
|
||||||
|
|
||||||
|
* you agree that we may register a copyright in your contribution and
|
||||||
|
exercise all ownership rights associated with it; and
|
||||||
|
|
||||||
|
* you agree that neither of us has any duty to consult with, obtain the
|
||||||
|
consent of, pay or render an accounting to the other for any use or
|
||||||
|
distribution of your contribution.
|
||||||
|
|
||||||
|
3. With respect to any patents you own, or that you can license without payment
|
||||||
|
to any third party, you hereby grant to us a perpetual, irrevocable,
|
||||||
|
non-exclusive, worldwide, no-charge, royalty-free license to:
|
||||||
|
|
||||||
|
* make, have made, use, sell, offer to sell, import, and otherwise transfer
|
||||||
|
your contribution in whole or in part, alone or in combination with or
|
||||||
|
included in any product, work or materials arising out of the project to
|
||||||
|
which your contribution was submitted, and
|
||||||
|
|
||||||
|
* at our option, to sublicense these same rights to third parties through
|
||||||
|
multiple levels of sublicensees or other licensing arrangements.
|
||||||
|
|
||||||
|
4. Except as set out above, you keep all right, title, and interest in your
|
||||||
|
contribution. The rights that you grant to us under these terms are effective
|
||||||
|
on the date you first submitted a contribution to us, even if your submission
|
||||||
|
took place before the date you sign these terms.
|
||||||
|
|
||||||
|
5. You covenant, represent, warrant and agree that:
|
||||||
|
|
||||||
|
* Each contribution that you submit is and shall be an original work of
|
||||||
|
authorship and you can legally grant the rights set out in this SCA;
|
||||||
|
|
||||||
|
* to the best of your knowledge, each contribution will not violate any
|
||||||
|
third party's copyrights, trademarks, patents, or other intellectual
|
||||||
|
property rights; and
|
||||||
|
|
||||||
|
* each contribution shall be in compliance with U.S. export control laws and
|
||||||
|
other applicable export and import laws. You agree to notify us if you
|
||||||
|
become aware of any circumstance which would make any of the foregoing
|
||||||
|
representations inaccurate in any respect. We may publicly disclose your
|
||||||
|
participation in the project, including the fact that you have signed the SCA.
|
||||||
|
|
||||||
|
6. This SCA is governed by the laws of the State of California and applicable
|
||||||
|
U.S. Federal law. Any choice of law rules will not apply.
|
||||||
|
|
||||||
|
7. Please place an “x” on one of the applicable statement below. Please do NOT
|
||||||
|
mark both statements:
|
||||||
|
|
||||||
|
* [x] I am signing on behalf of myself as an individual and no other person
|
||||||
|
or entity, including my employer, has or will have rights with respect to my
|
||||||
|
contributions.
|
||||||
|
|
||||||
|
* [ ] I am signing on behalf of my employer or a legal entity and I have the
|
||||||
|
actual authority to contractually bind that entity.
|
||||||
|
|
||||||
|
## Contributor Details
|
||||||
|
|
||||||
|
| Field | Entry |
|
||||||
|
|------------------------------- | -------------------- |
|
||||||
|
| Name | Priscilla Lopes |
|
||||||
|
| Company name (if applicable) | |
|
||||||
|
| Title or role (if applicable) | |
|
||||||
|
| Date | 2019-11-06 |
|
||||||
|
| GitHub username | prilopes |
|
||||||
|
| Website (optional) | |
|
|
@ -221,6 +221,13 @@ def evaluate(nlp, text_loc, gold_loc, sys_loc, limit=None):
|
||||||
|
|
||||||
|
|
||||||
def write_conllu(docs, file_):
|
def write_conllu(docs, file_):
|
||||||
|
if not Token.has_extension("get_conllu_lines"):
|
||||||
|
Token.set_extension("get_conllu_lines", method=get_token_conllu)
|
||||||
|
if not Token.has_extension("begins_fused"):
|
||||||
|
Token.set_extension("begins_fused", default=False)
|
||||||
|
if not Token.has_extension("inside_fused"):
|
||||||
|
Token.set_extension("inside_fused", default=False)
|
||||||
|
|
||||||
merger = Matcher(docs[0].vocab)
|
merger = Matcher(docs[0].vocab)
|
||||||
merger.add("SUBTOK", None, [{"DEP": "subtok", "op": "+"}])
|
merger.add("SUBTOK", None, [{"DEP": "subtok", "op": "+"}])
|
||||||
for i, doc in enumerate(docs):
|
for i, doc in enumerate(docs):
|
||||||
|
|
|
@ -6,12 +6,12 @@ blis>=0.4.0,<0.5.0
|
||||||
murmurhash>=0.28.0,<1.1.0
|
murmurhash>=0.28.0,<1.1.0
|
||||||
wasabi>=0.4.0,<1.1.0
|
wasabi>=0.4.0,<1.1.0
|
||||||
srsly>=0.1.0,<1.1.0
|
srsly>=0.1.0,<1.1.0
|
||||||
|
catalogue>=0.0.7,<1.1.0
|
||||||
# Third party dependencies
|
# Third party dependencies
|
||||||
numpy>=1.15.0
|
numpy>=1.15.0
|
||||||
requests>=2.13.0,<3.0.0
|
requests>=2.13.0,<3.0.0
|
||||||
plac>=0.9.6,<1.2.0
|
plac>=0.9.6,<1.2.0
|
||||||
pathlib==1.0.1; python_version < "3.4"
|
pathlib==1.0.1; python_version < "3.4"
|
||||||
importlib_metadata>=0.20; python_version < "3.8"
|
|
||||||
# Optional dependencies
|
# Optional dependencies
|
||||||
jsonschema>=2.6.0,<3.1.0
|
jsonschema>=2.6.0,<3.1.0
|
||||||
# Development dependencies
|
# Development dependencies
|
||||||
|
|
|
@ -48,13 +48,13 @@ install_requires =
|
||||||
blis>=0.4.0,<0.5.0
|
blis>=0.4.0,<0.5.0
|
||||||
wasabi>=0.4.0,<1.1.0
|
wasabi>=0.4.0,<1.1.0
|
||||||
srsly>=0.1.0,<1.1.0
|
srsly>=0.1.0,<1.1.0
|
||||||
|
catalogue>=0.0.7,<1.1.0
|
||||||
# Third-party dependencies
|
# Third-party dependencies
|
||||||
setuptools
|
setuptools
|
||||||
numpy>=1.15.0
|
numpy>=1.15.0
|
||||||
plac>=0.9.6,<1.2.0
|
plac>=0.9.6,<1.2.0
|
||||||
requests>=2.13.0,<3.0.0
|
requests>=2.13.0,<3.0.0
|
||||||
pathlib==1.0.1; python_version < "3.4"
|
pathlib==1.0.1; python_version < "3.4"
|
||||||
importlib_metadata>=0.20; python_version < "3.8"
|
|
||||||
|
|
||||||
[options.extras_require]
|
[options.extras_require]
|
||||||
lookups =
|
lookups =
|
||||||
|
|
|
@ -15,7 +15,7 @@ from .glossary import explain
|
||||||
from .about import __version__
|
from .about import __version__
|
||||||
from .errors import Errors, Warnings, deprecation_warning
|
from .errors import Errors, Warnings, deprecation_warning
|
||||||
from . import util
|
from . import util
|
||||||
from .util import register_architecture, get_architecture
|
from .util import registry
|
||||||
from .language import component
|
from .language import component
|
||||||
|
|
||||||
|
|
||||||
|
|
|
@ -36,11 +36,6 @@ try:
|
||||||
except ImportError:
|
except ImportError:
|
||||||
cupy = None
|
cupy = None
|
||||||
|
|
||||||
try: # Python 3.8
|
|
||||||
import importlib.metadata as importlib_metadata
|
|
||||||
except ImportError:
|
|
||||||
import importlib_metadata # noqa: F401
|
|
||||||
|
|
||||||
try:
|
try:
|
||||||
from thinc.neural.optimizers import Optimizer # noqa: F401
|
from thinc.neural.optimizers import Optimizer # noqa: F401
|
||||||
except ImportError:
|
except ImportError:
|
||||||
|
|
|
@ -5,7 +5,7 @@ import uuid
|
||||||
|
|
||||||
from .templates import TPL_DEP_SVG, TPL_DEP_WORDS, TPL_DEP_ARCS, TPL_ENTS
|
from .templates import TPL_DEP_SVG, TPL_DEP_WORDS, TPL_DEP_ARCS, TPL_ENTS
|
||||||
from .templates import TPL_ENT, TPL_ENT_RTL, TPL_FIGURE, TPL_TITLE, TPL_PAGE
|
from .templates import TPL_ENT, TPL_ENT_RTL, TPL_FIGURE, TPL_TITLE, TPL_PAGE
|
||||||
from ..util import minify_html, escape_html, get_entry_points, ENTRY_POINTS
|
from ..util import minify_html, escape_html, registry
|
||||||
from ..errors import Errors
|
from ..errors import Errors
|
||||||
|
|
||||||
|
|
||||||
|
@ -242,7 +242,7 @@ class EntityRenderer(object):
|
||||||
"CARDINAL": "#e4e7d2",
|
"CARDINAL": "#e4e7d2",
|
||||||
"PERCENT": "#e4e7d2",
|
"PERCENT": "#e4e7d2",
|
||||||
}
|
}
|
||||||
user_colors = get_entry_points(ENTRY_POINTS.displacy_colors)
|
user_colors = registry.displacy_colors.get_all()
|
||||||
for user_color in user_colors.values():
|
for user_color in user_colors.values():
|
||||||
colors.update(user_color)
|
colors.update(user_color)
|
||||||
colors.update(options.get("colors", {}))
|
colors.update(options.get("colors", {}))
|
||||||
|
|
|
@ -11,12 +11,12 @@ Example sentences to test spaCy and its language models.
|
||||||
|
|
||||||
|
|
||||||
sentences = [
|
sentences = [
|
||||||
"Apple está buscando comprar una startup del Reino Unido por mil millones de dólares",
|
"Apple está buscando comprar una startup del Reino Unido por mil millones de dólares.",
|
||||||
"Los coches autónomos delegan la responsabilidad del seguro en sus fabricantes",
|
"Los coches autónomos delegan la responsabilidad del seguro en sus fabricantes.",
|
||||||
"San Francisco analiza prohibir los robots delivery",
|
"San Francisco analiza prohibir los robots delivery.",
|
||||||
"Londres es una gran ciudad del Reino Unido",
|
"Londres es una gran ciudad del Reino Unido.",
|
||||||
"El gato come pescado",
|
"El gato come pescado.",
|
||||||
"Veo al hombre con el telescopio",
|
"Veo al hombre con el telescopio.",
|
||||||
"La araña come moscas",
|
"La araña come moscas.",
|
||||||
"El pingüino incuba en su nido",
|
"El pingüino incuba en su nido.",
|
||||||
]
|
]
|
||||||
|
|
|
@ -11,8 +11,8 @@ Example sentences to test spaCy and its language models.
|
||||||
|
|
||||||
|
|
||||||
sentences = [
|
sentences = [
|
||||||
"Apple vurderer å kjøpe britisk oppstartfirma for en milliard dollar",
|
"Apple vurderer å kjøpe britisk oppstartfirma for en milliard dollar.",
|
||||||
"Selvkjørende biler flytter forsikringsansvaret over på produsentene ",
|
"Selvkjørende biler flytter forsikringsansvaret over på produsentene.",
|
||||||
"San Francisco vurderer å forby robotbud på fortauene",
|
"San Francisco vurderer å forby robotbud på fortauene.",
|
||||||
"London er en stor by i Storbritannia.",
|
"London er en stor by i Storbritannia.",
|
||||||
]
|
]
|
||||||
|
|
99
spacy/lang/xx/examples.py
Normal file
99
spacy/lang/xx/examples.py
Normal file
|
@ -0,0 +1,99 @@
|
||||||
|
# coding: utf8
|
||||||
|
from __future__ import unicode_literals
|
||||||
|
|
||||||
|
|
||||||
|
"""
|
||||||
|
Example sentences to test spaCy and its language models.
|
||||||
|
|
||||||
|
>>> from spacy.lang.de.examples import sentences
|
||||||
|
>>> docs = nlp.pipe(sentences)
|
||||||
|
"""
|
||||||
|
|
||||||
|
# combined examples from de/en/es/fr/it/nl/pl/pt/ru
|
||||||
|
|
||||||
|
sentences = [
|
||||||
|
"Die ganze Stadt ist ein Startup: Shenzhen ist das Silicon Valley für Hardware-Firmen",
|
||||||
|
"Wie deutsche Startups die Technologie vorantreiben wollen: Künstliche Intelligenz",
|
||||||
|
"Trend zum Urlaub in Deutschland beschert Gastwirten mehr Umsatz",
|
||||||
|
"Bundesanwaltschaft erhebt Anklage gegen mutmaßlichen Schweizer Spion",
|
||||||
|
"San Francisco erwägt Verbot von Lieferrobotern",
|
||||||
|
"Autonome Fahrzeuge verlagern Haftpflicht auf Hersteller",
|
||||||
|
"Wo bist du?",
|
||||||
|
"Was ist die Hauptstadt von Deutschland?",
|
||||||
|
"Apple is looking at buying U.K. startup for $1 billion",
|
||||||
|
"Autonomous cars shift insurance liability toward manufacturers",
|
||||||
|
"San Francisco considers banning sidewalk delivery robots",
|
||||||
|
"London is a big city in the United Kingdom.",
|
||||||
|
"Where are you?",
|
||||||
|
"Who is the president of France?",
|
||||||
|
"What is the capital of the United States?",
|
||||||
|
"When was Barack Obama born?",
|
||||||
|
"Apple está buscando comprar una startup del Reino Unido por mil millones de dólares.",
|
||||||
|
"Los coches autónomos delegan la responsabilidad del seguro en sus fabricantes.",
|
||||||
|
"San Francisco analiza prohibir los robots delivery.",
|
||||||
|
"Londres es una gran ciudad del Reino Unido.",
|
||||||
|
"El gato come pescado.",
|
||||||
|
"Veo al hombre con el telescopio.",
|
||||||
|
"La araña come moscas.",
|
||||||
|
"El pingüino incuba en su nido.",
|
||||||
|
"Apple cherche à acheter une start-up anglaise pour 1 milliard de dollars",
|
||||||
|
"Les voitures autonomes déplacent la responsabilité de l'assurance vers les constructeurs",
|
||||||
|
"San Francisco envisage d'interdire les robots coursiers sur les trottoirs",
|
||||||
|
"Londres est une grande ville du Royaume-Uni",
|
||||||
|
"L’Italie choisit ArcelorMittal pour reprendre la plus grande aciérie d’Europe",
|
||||||
|
"Apple lance HomePod parce qu'il se sent menacé par l'Echo d'Amazon",
|
||||||
|
"La France ne devrait pas manquer d'électricité cet été, même en cas de canicule",
|
||||||
|
"Nouvelles attaques de Trump contre le maire de Londres",
|
||||||
|
"Où es-tu ?",
|
||||||
|
"Qui est le président de la France ?",
|
||||||
|
"Où est la capitale des États-Unis ?",
|
||||||
|
"Quand est né Barack Obama ?",
|
||||||
|
"Apple vuole comprare una startup del Regno Unito per un miliardo di dollari",
|
||||||
|
"Le automobili a guida autonoma spostano la responsabilità assicurativa verso i produttori",
|
||||||
|
"San Francisco prevede di bandire i robot di consegna porta a porta",
|
||||||
|
"Londra è una grande città del Regno Unito.",
|
||||||
|
"Apple overweegt om voor 1 miljard een U.K. startup te kopen",
|
||||||
|
"Autonome auto's verschuiven de verzekeringverantwoordelijkheid naar producenten",
|
||||||
|
"San Francisco overweegt robots op voetpaden te verbieden",
|
||||||
|
"Londen is een grote stad in het Verenigd Koninkrijk",
|
||||||
|
"Poczuł przyjemną woń mocnej kawy.",
|
||||||
|
"Istnieje wiele dróg oddziaływania substancji psychoaktywnej na układ nerwowy.",
|
||||||
|
"Powitał mnie biało-czarny kot, płosząc siedzące na płocie trzy dorodne dudki.",
|
||||||
|
"Nowy abonament pod lupą Komisji Europejskiej",
|
||||||
|
"Czy w ciągu ostatnich 48 godzin spożyłeś leki zawierające paracetamol?",
|
||||||
|
"Kto ma ochotę zapoznać się z innymi niż w książkach przygodami Muminków i ich przyjaciół, temu polecam komiks Tove Jansson „Muminki i morze”.",
|
||||||
|
"Apple está querendo comprar uma startup do Reino Unido por 100 milhões de dólares.",
|
||||||
|
"Carros autônomos empurram a responsabilidade do seguro para os fabricantes.."
|
||||||
|
"São Francisco considera banir os robôs de entrega que andam pelas calçadas.",
|
||||||
|
"Londres é a maior cidade do Reino Unido.",
|
||||||
|
# Translations from English:
|
||||||
|
"Apple рассматривает возможность покупки стартапа из Соединённого Королевства за $1 млрд",
|
||||||
|
"Беспилотные автомобили перекладывают страховую ответственность на производителя",
|
||||||
|
"В Сан-Франциско рассматривается возможность запрета роботов-курьеров, которые перемещаются по тротуару",
|
||||||
|
"Лондон — это большой город в Соединённом Королевстве",
|
||||||
|
# Native Russian sentences:
|
||||||
|
# Colloquial:
|
||||||
|
"Да, нет, наверное!", # Typical polite refusal
|
||||||
|
"Обратите внимание на необыкновенную красоту этого города-героя Москвы, столицы нашей Родины!", # From a tour guide speech
|
||||||
|
# Examples of Bookish Russian:
|
||||||
|
# Quote from "The Golden Calf"
|
||||||
|
"Рио-де-Жанейро — это моя мечта, и не смейте касаться её своими грязными лапами!",
|
||||||
|
# Quotes from "Ivan Vasilievich changes his occupation"
|
||||||
|
"Ты пошто боярыню обидел, смерд?!!",
|
||||||
|
"Оставь меня, старушка, я в печали!",
|
||||||
|
# Quotes from Dostoevsky:
|
||||||
|
"Уж коли я, такой же, как и ты, человек грешный, над тобой умилился и пожалел тебя, кольми паче бог",
|
||||||
|
"В мечтах я нередко, говорит, доходил до страстных помыслов о служении человечеству и может быть действительно пошел бы на крест за людей, если б это вдруг как-нибудь потребовалось, а между тем я двух дней не в состоянии прожить ни с кем в одной комнате, о чем знаю из опыта",
|
||||||
|
"Зато всегда так происходило, что чем более я ненавидел людей в частности, тем пламеннее становилась любовь моя к человечеству вообще",
|
||||||
|
# Quotes from Chekhov:
|
||||||
|
"Ненужные дела и разговоры всё об одном отхватывают на свою долю лучшую часть времени, лучшие силы, и в конце концов остается какая-то куцая, бескрылая жизнь, какая-то чепуха, и уйти и бежать нельзя, точно сидишь в сумасшедшем доме или в арестантских ротах!",
|
||||||
|
# Quotes from Turgenev:
|
||||||
|
"Нравится тебе женщина, старайся добиться толку; а нельзя — ну, не надо, отвернись — земля не клином сошлась",
|
||||||
|
"Узенькое местечко, которое я занимаю, до того крохотно в сравнении с остальным пространством, где меня нет и где дела до меня нет; и часть времени, которую мне удастся прожить, так ничтожна перед вечностью, где меня не было и не будет...",
|
||||||
|
# Quotes from newspapers:
|
||||||
|
# Komsomolskaya Pravda:
|
||||||
|
"На заседании президиума правительства Москвы принято решение присвоить статус инвестиционного приоритетного проекта города Москвы киностудии Союзмультфильм",
|
||||||
|
"Глава Минобороны Сергей Шойгу заявил, что обстановка на этом стратегическом направлении требует непрерывного совершенствования боевого состава войск",
|
||||||
|
# Argumenty i Facty:
|
||||||
|
"На реплику лже-Говина — дескать, он (Волков) будет лучшим революционером — Стамп с энтузиазмом ответил: Непременно!",
|
||||||
|
]
|
|
@ -4,19 +4,92 @@ from __future__ import unicode_literals
|
||||||
from ...attrs import LANG
|
from ...attrs import LANG
|
||||||
from ...language import Language
|
from ...language import Language
|
||||||
from ...tokens import Doc
|
from ...tokens import Doc
|
||||||
|
from ...util import DummyTokenizer
|
||||||
from ..tokenizer_exceptions import BASE_EXCEPTIONS
|
from ..tokenizer_exceptions import BASE_EXCEPTIONS
|
||||||
|
from .lex_attrs import LEX_ATTRS
|
||||||
from .stop_words import STOP_WORDS
|
from .stop_words import STOP_WORDS
|
||||||
from .tag_map import TAG_MAP
|
from .tag_map import TAG_MAP
|
||||||
|
|
||||||
|
|
||||||
|
def try_jieba_import(use_jieba):
|
||||||
|
try:
|
||||||
|
import jieba
|
||||||
|
return jieba
|
||||||
|
except ImportError:
|
||||||
|
if use_jieba:
|
||||||
|
msg = (
|
||||||
|
"Jieba not installed. Either set Chinese.use_jieba = False, "
|
||||||
|
"or install it https://github.com/fxsjy/jieba"
|
||||||
|
)
|
||||||
|
raise ImportError(msg)
|
||||||
|
|
||||||
|
|
||||||
|
class ChineseTokenizer(DummyTokenizer):
|
||||||
|
def __init__(self, cls, nlp=None):
|
||||||
|
self.vocab = nlp.vocab if nlp is not None else cls.create_vocab(nlp)
|
||||||
|
self.use_jieba = cls.use_jieba
|
||||||
|
self.jieba_seg = try_jieba_import(self.use_jieba)
|
||||||
|
self.tokenizer = Language.Defaults().create_tokenizer(nlp)
|
||||||
|
|
||||||
|
def __call__(self, text):
|
||||||
|
# use jieba
|
||||||
|
if self.use_jieba:
|
||||||
|
jieba_words = list([x for x in self.jieba_seg.cut(text, cut_all=False) if x])
|
||||||
|
words = [jieba_words[0]]
|
||||||
|
spaces = [False]
|
||||||
|
for i in range(1, len(jieba_words)):
|
||||||
|
word = jieba_words[i]
|
||||||
|
if word.isspace():
|
||||||
|
# second token in adjacent whitespace following a
|
||||||
|
# non-space token
|
||||||
|
if spaces[-1]:
|
||||||
|
words.append(word)
|
||||||
|
spaces.append(False)
|
||||||
|
# first space token following non-space token
|
||||||
|
elif word == " " and not words[-1].isspace():
|
||||||
|
spaces[-1] = True
|
||||||
|
# token is non-space whitespace or any whitespace following
|
||||||
|
# a whitespace token
|
||||||
|
else:
|
||||||
|
# extend previous whitespace token with more whitespace
|
||||||
|
if words[-1].isspace():
|
||||||
|
words[-1] += word
|
||||||
|
# otherwise it's a new whitespace token
|
||||||
|
else:
|
||||||
|
words.append(word)
|
||||||
|
spaces.append(False)
|
||||||
|
else:
|
||||||
|
words.append(word)
|
||||||
|
spaces.append(False)
|
||||||
|
return Doc(self.vocab, words=words, spaces=spaces)
|
||||||
|
|
||||||
|
# split into individual characters
|
||||||
|
words = []
|
||||||
|
spaces = []
|
||||||
|
for token in self.tokenizer(text):
|
||||||
|
if token.text.isspace():
|
||||||
|
words.append(token.text)
|
||||||
|
spaces.append(False)
|
||||||
|
else:
|
||||||
|
words.extend(list(token.text))
|
||||||
|
spaces.extend([False] * len(token.text))
|
||||||
|
spaces[-1] = bool(token.whitespace_)
|
||||||
|
return Doc(self.vocab, words=words, spaces=spaces)
|
||||||
|
|
||||||
|
|
||||||
class ChineseDefaults(Language.Defaults):
|
class ChineseDefaults(Language.Defaults):
|
||||||
lex_attr_getters = dict(Language.Defaults.lex_attr_getters)
|
lex_attr_getters = dict(Language.Defaults.lex_attr_getters)
|
||||||
|
lex_attr_getters.update(LEX_ATTRS)
|
||||||
lex_attr_getters[LANG] = lambda text: "zh"
|
lex_attr_getters[LANG] = lambda text: "zh"
|
||||||
use_jieba = True
|
|
||||||
tokenizer_exceptions = BASE_EXCEPTIONS
|
tokenizer_exceptions = BASE_EXCEPTIONS
|
||||||
stop_words = STOP_WORDS
|
stop_words = STOP_WORDS
|
||||||
tag_map = TAG_MAP
|
tag_map = TAG_MAP
|
||||||
writing_system = {"direction": "ltr", "has_case": False, "has_letters": False}
|
writing_system = {"direction": "ltr", "has_case": False, "has_letters": False}
|
||||||
|
use_jieba = True
|
||||||
|
|
||||||
|
@classmethod
|
||||||
|
def create_tokenizer(cls, nlp=None):
|
||||||
|
return ChineseTokenizer(cls, nlp)
|
||||||
|
|
||||||
|
|
||||||
class Chinese(Language):
|
class Chinese(Language):
|
||||||
|
@ -24,26 +97,7 @@ class Chinese(Language):
|
||||||
Defaults = ChineseDefaults # override defaults
|
Defaults = ChineseDefaults # override defaults
|
||||||
|
|
||||||
def make_doc(self, text):
|
def make_doc(self, text):
|
||||||
if self.Defaults.use_jieba:
|
return self.tokenizer(text)
|
||||||
try:
|
|
||||||
import jieba
|
|
||||||
except ImportError:
|
|
||||||
msg = (
|
|
||||||
"Jieba not installed. Either set Chinese.use_jieba = False, "
|
|
||||||
"or install it https://github.com/fxsjy/jieba"
|
|
||||||
)
|
|
||||||
raise ImportError(msg)
|
|
||||||
words = list(jieba.cut(text, cut_all=False))
|
|
||||||
words = [x for x in words if x]
|
|
||||||
return Doc(self.vocab, words=words, spaces=[False] * len(words))
|
|
||||||
else:
|
|
||||||
words = []
|
|
||||||
spaces = []
|
|
||||||
for token in self.tokenizer(text):
|
|
||||||
words.extend(list(token.text))
|
|
||||||
spaces.extend([False] * len(token.text))
|
|
||||||
spaces[-1] = bool(token.whitespace_)
|
|
||||||
return Doc(self.vocab, words=words, spaces=spaces)
|
|
||||||
|
|
||||||
|
|
||||||
__all__ = ["Chinese"]
|
__all__ = ["Chinese"]
|
||||||
|
|
|
@ -1,11 +1,12 @@
|
||||||
# coding: utf8
|
# coding: utf8
|
||||||
from __future__ import unicode_literals
|
from __future__ import unicode_literals
|
||||||
|
|
||||||
from ...symbols import POS, PUNCT, ADJ, CONJ, CCONJ, NUM, DET, ADV, ADP, X, VERB
|
from ...symbols import POS, PUNCT, ADJ, SCONJ, CCONJ, NUM, DET, ADV, ADP, X
|
||||||
from ...symbols import NOUN, PART, INTJ, PRON
|
from ...symbols import NOUN, PART, INTJ, PRON, VERB, SPACE
|
||||||
|
|
||||||
# The Chinese part-of-speech tagger uses the OntoNotes 5 version of the Penn Treebank tag set.
|
# The Chinese part-of-speech tagger uses the OntoNotes 5 version of the Penn
|
||||||
# We also map the tags to the simpler Google Universal POS tag set.
|
# Treebank tag set. We also map the tags to the simpler Universal Dependencies
|
||||||
|
# v2 tag set.
|
||||||
|
|
||||||
TAG_MAP = {
|
TAG_MAP = {
|
||||||
"AS": {POS: PART},
|
"AS": {POS: PART},
|
||||||
|
@ -38,10 +39,11 @@ TAG_MAP = {
|
||||||
"OD": {POS: NUM},
|
"OD": {POS: NUM},
|
||||||
"DT": {POS: DET},
|
"DT": {POS: DET},
|
||||||
"CC": {POS: CCONJ},
|
"CC": {POS: CCONJ},
|
||||||
"CS": {POS: CONJ},
|
"CS": {POS: SCONJ},
|
||||||
"AD": {POS: ADV},
|
"AD": {POS: ADV},
|
||||||
"JJ": {POS: ADJ},
|
"JJ": {POS: ADJ},
|
||||||
"P": {POS: ADP},
|
"P": {POS: ADP},
|
||||||
"PN": {POS: PRON},
|
"PN": {POS: PRON},
|
||||||
"PU": {POS: PUNCT},
|
"PU": {POS: PUNCT},
|
||||||
|
"_SP": {POS: SPACE},
|
||||||
}
|
}
|
||||||
|
|
|
@ -51,8 +51,8 @@ class BaseDefaults(object):
|
||||||
filenames = {name: root / filename for name, filename in cls.resources}
|
filenames = {name: root / filename for name, filename in cls.resources}
|
||||||
if LANG in cls.lex_attr_getters:
|
if LANG in cls.lex_attr_getters:
|
||||||
lang = cls.lex_attr_getters[LANG](None)
|
lang = cls.lex_attr_getters[LANG](None)
|
||||||
user_lookups = util.get_entry_point(util.ENTRY_POINTS.lookups, lang, {})
|
if lang in util.registry.lookups:
|
||||||
filenames.update(user_lookups)
|
filenames.update(util.registry.lookups.get(lang))
|
||||||
lookups = Lookups()
|
lookups = Lookups()
|
||||||
for name, filename in filenames.items():
|
for name, filename in filenames.items():
|
||||||
data = util.load_language_data(filename)
|
data = util.load_language_data(filename)
|
||||||
|
@ -155,7 +155,7 @@ class Language(object):
|
||||||
100,000 characters in one text.
|
100,000 characters in one text.
|
||||||
RETURNS (Language): The newly constructed object.
|
RETURNS (Language): The newly constructed object.
|
||||||
"""
|
"""
|
||||||
user_factories = util.get_entry_points(util.ENTRY_POINTS.factories)
|
user_factories = util.registry.factories.get_all()
|
||||||
self.factories.update(user_factories)
|
self.factories.update(user_factories)
|
||||||
self._meta = dict(meta)
|
self._meta = dict(meta)
|
||||||
self._path = None
|
self._path = None
|
||||||
|
|
|
@ -240,7 +240,7 @@ cdef class DependencyMatcher:
|
||||||
for i, (ent_id, nodes) in enumerate(matched_key_trees):
|
for i, (ent_id, nodes) in enumerate(matched_key_trees):
|
||||||
on_match = self._callbacks.get(ent_id)
|
on_match = self._callbacks.get(ent_id)
|
||||||
if on_match is not None:
|
if on_match is not None:
|
||||||
on_match(self, doc, i, matches)
|
on_match(self, doc, i, matched_key_trees)
|
||||||
return matched_key_trees
|
return matched_key_trees
|
||||||
|
|
||||||
def recurse(self,tree,id_to_position,_node_operator_map,int patternLength,visitedNodes,matched_trees):
|
def recurse(self,tree,id_to_position,_node_operator_map,int patternLength,visitedNodes,matched_trees):
|
||||||
|
|
|
@ -3,10 +3,10 @@ from __future__ import unicode_literals
|
||||||
from thinc.api import chain
|
from thinc.api import chain
|
||||||
from thinc.v2v import Maxout
|
from thinc.v2v import Maxout
|
||||||
from thinc.misc import LayerNorm
|
from thinc.misc import LayerNorm
|
||||||
from ..util import register_architecture, make_layer
|
from ..util import registry, make_layer
|
||||||
|
|
||||||
|
|
||||||
@register_architecture("thinc.FeedForward.v1")
|
@registry.architectures.register("thinc.FeedForward.v1")
|
||||||
def FeedForward(config):
|
def FeedForward(config):
|
||||||
layers = [make_layer(layer_cfg) for layer_cfg in config["layers"]]
|
layers = [make_layer(layer_cfg) for layer_cfg in config["layers"]]
|
||||||
model = chain(*layers)
|
model = chain(*layers)
|
||||||
|
@ -14,7 +14,7 @@ def FeedForward(config):
|
||||||
return model
|
return model
|
||||||
|
|
||||||
|
|
||||||
@register_architecture("spacy.LayerNormalizedMaxout.v1")
|
@registry.architectures.register("spacy.LayerNormalizedMaxout.v1")
|
||||||
def LayerNormalizedMaxout(config):
|
def LayerNormalizedMaxout(config):
|
||||||
width = config["width"]
|
width = config["width"]
|
||||||
pieces = config["pieces"]
|
pieces = config["pieces"]
|
||||||
|
|
|
@ -6,11 +6,11 @@ from thinc.v2v import Maxout, Model
|
||||||
from thinc.i2v import HashEmbed, StaticVectors
|
from thinc.i2v import HashEmbed, StaticVectors
|
||||||
from thinc.t2t import ExtractWindow
|
from thinc.t2t import ExtractWindow
|
||||||
from thinc.misc import Residual, LayerNorm, FeatureExtracter
|
from thinc.misc import Residual, LayerNorm, FeatureExtracter
|
||||||
from ..util import make_layer, register_architecture
|
from ..util import make_layer, registry
|
||||||
from ._wire import concatenate_lists
|
from ._wire import concatenate_lists
|
||||||
|
|
||||||
|
|
||||||
@register_architecture("spacy.Tok2Vec.v1")
|
@registry.architectures.register("spacy.Tok2Vec.v1")
|
||||||
def Tok2Vec(config):
|
def Tok2Vec(config):
|
||||||
doc2feats = make_layer(config["@doc2feats"])
|
doc2feats = make_layer(config["@doc2feats"])
|
||||||
embed = make_layer(config["@embed"])
|
embed = make_layer(config["@embed"])
|
||||||
|
@ -24,13 +24,13 @@ def Tok2Vec(config):
|
||||||
return tok2vec
|
return tok2vec
|
||||||
|
|
||||||
|
|
||||||
@register_architecture("spacy.Doc2Feats.v1")
|
@registry.architectures.register("spacy.Doc2Feats.v1")
|
||||||
def Doc2Feats(config):
|
def Doc2Feats(config):
|
||||||
columns = config["columns"]
|
columns = config["columns"]
|
||||||
return FeatureExtracter(columns)
|
return FeatureExtracter(columns)
|
||||||
|
|
||||||
|
|
||||||
@register_architecture("spacy.MultiHashEmbed.v1")
|
@registry.architectures.register("spacy.MultiHashEmbed.v1")
|
||||||
def MultiHashEmbed(config):
|
def MultiHashEmbed(config):
|
||||||
# For backwards compatibility with models before the architecture registry,
|
# For backwards compatibility with models before the architecture registry,
|
||||||
# we have to be careful to get exactly the same model structure. One subtle
|
# we have to be careful to get exactly the same model structure. One subtle
|
||||||
|
@ -78,7 +78,7 @@ def MultiHashEmbed(config):
|
||||||
return layer
|
return layer
|
||||||
|
|
||||||
|
|
||||||
@register_architecture("spacy.CharacterEmbed.v1")
|
@registry.architectures.register("spacy.CharacterEmbed.v1")
|
||||||
def CharacterEmbed(config):
|
def CharacterEmbed(config):
|
||||||
from .. import _ml
|
from .. import _ml
|
||||||
|
|
||||||
|
@ -94,7 +94,7 @@ def CharacterEmbed(config):
|
||||||
return model
|
return model
|
||||||
|
|
||||||
|
|
||||||
@register_architecture("spacy.MaxoutWindowEncoder.v1")
|
@registry.architectures.register("spacy.MaxoutWindowEncoder.v1")
|
||||||
def MaxoutWindowEncoder(config):
|
def MaxoutWindowEncoder(config):
|
||||||
nO = config["width"]
|
nO = config["width"]
|
||||||
nW = config["window_size"]
|
nW = config["window_size"]
|
||||||
|
@ -110,7 +110,7 @@ def MaxoutWindowEncoder(config):
|
||||||
return model
|
return model
|
||||||
|
|
||||||
|
|
||||||
@register_architecture("spacy.MishWindowEncoder.v1")
|
@registry.architectures.register("spacy.MishWindowEncoder.v1")
|
||||||
def MishWindowEncoder(config):
|
def MishWindowEncoder(config):
|
||||||
from thinc.v2v import Mish
|
from thinc.v2v import Mish
|
||||||
|
|
||||||
|
@ -124,12 +124,12 @@ def MishWindowEncoder(config):
|
||||||
return model
|
return model
|
||||||
|
|
||||||
|
|
||||||
@register_architecture("spacy.PretrainedVectors.v1")
|
@registry.architectures.register("spacy.PretrainedVectors.v1")
|
||||||
def PretrainedVectors(config):
|
def PretrainedVectors(config):
|
||||||
return StaticVectors(config["vectors_name"], config["width"], config["column"])
|
return StaticVectors(config["vectors_name"], config["width"], config["column"])
|
||||||
|
|
||||||
|
|
||||||
@register_architecture("spacy.TorchBiLSTMEncoder.v1")
|
@registry.architectures.register("spacy.TorchBiLSTMEncoder.v1")
|
||||||
def TorchBiLSTMEncoder(config):
|
def TorchBiLSTMEncoder(config):
|
||||||
import torch.nn
|
import torch.nn
|
||||||
from thinc.extra.wrappers import PyTorchWrapperRNN
|
from thinc.extra.wrappers import PyTorchWrapperRNN
|
||||||
|
|
|
@ -218,3 +218,9 @@ def uk_tokenizer():
|
||||||
@pytest.fixture(scope="session")
|
@pytest.fixture(scope="session")
|
||||||
def ur_tokenizer():
|
def ur_tokenizer():
|
||||||
return get_lang_class("ur").Defaults.create_tokenizer()
|
return get_lang_class("ur").Defaults.create_tokenizer()
|
||||||
|
|
||||||
|
|
||||||
|
@pytest.fixture(scope="session")
|
||||||
|
def zh_tokenizer():
|
||||||
|
pytest.importorskip("jieba")
|
||||||
|
return get_lang_class("zh").Defaults.create_tokenizer()
|
||||||
|
|
|
@ -183,3 +183,18 @@ def test_doc_retokenizer_split_lex_attrs(en_vocab):
|
||||||
retokenizer.split(doc[0], ["Los", "Angeles"], heads, attrs=attrs)
|
retokenizer.split(doc[0], ["Los", "Angeles"], heads, attrs=attrs)
|
||||||
assert doc[0].is_stop
|
assert doc[0].is_stop
|
||||||
assert not doc[1].is_stop
|
assert not doc[1].is_stop
|
||||||
|
|
||||||
|
|
||||||
|
def test_doc_retokenizer_realloc(en_vocab):
|
||||||
|
"""#4604: realloc correctly when new tokens outnumber original tokens"""
|
||||||
|
text = "Hyperglycemic adverse events following antipsychotic drug administration in the"
|
||||||
|
doc = Doc(en_vocab, words=text.split()[:-1])
|
||||||
|
with doc.retokenize() as retokenizer:
|
||||||
|
token = doc[0]
|
||||||
|
heads = [(token, 0)] * len(token)
|
||||||
|
retokenizer.split(doc[token.i], list(token.text), heads=heads)
|
||||||
|
doc = Doc(en_vocab, words=text.split())
|
||||||
|
with doc.retokenize() as retokenizer:
|
||||||
|
token = doc[0]
|
||||||
|
heads = [(token, 0)] * len(token)
|
||||||
|
retokenizer.split(doc[token.i], list(token.text), heads=heads)
|
||||||
|
|
0
spacy/tests/lang/zh/__init__.py
Normal file
0
spacy/tests/lang/zh/__init__.py
Normal file
25
spacy/tests/lang/zh/test_text.py
Normal file
25
spacy/tests/lang/zh/test_text.py
Normal file
|
@ -0,0 +1,25 @@
|
||||||
|
# coding: utf-8
|
||||||
|
from __future__ import unicode_literals
|
||||||
|
|
||||||
|
|
||||||
|
import pytest
|
||||||
|
|
||||||
|
|
||||||
|
@pytest.mark.parametrize(
|
||||||
|
"text,match",
|
||||||
|
[
|
||||||
|
("10", True),
|
||||||
|
("1", True),
|
||||||
|
("999.0", True),
|
||||||
|
("一", True),
|
||||||
|
("二", True),
|
||||||
|
("〇", True),
|
||||||
|
("十一", True),
|
||||||
|
("狗", False),
|
||||||
|
(",", False),
|
||||||
|
],
|
||||||
|
)
|
||||||
|
def test_lex_attrs_like_number(zh_tokenizer, text, match):
|
||||||
|
tokens = zh_tokenizer(text)
|
||||||
|
assert len(tokens) == 1
|
||||||
|
assert tokens[0].like_num == match
|
31
spacy/tests/lang/zh/test_tokenizer.py
Normal file
31
spacy/tests/lang/zh/test_tokenizer.py
Normal file
|
@ -0,0 +1,31 @@
|
||||||
|
# coding: utf-8
|
||||||
|
from __future__ import unicode_literals
|
||||||
|
|
||||||
|
import pytest
|
||||||
|
|
||||||
|
|
||||||
|
# fmt: off
|
||||||
|
TOKENIZER_TESTS = [
|
||||||
|
("作为语言而言,为世界使用人数最多的语言,目前世界有五分之一人口做为母语。",
|
||||||
|
['作为', '语言', '而言', ',', '为', '世界', '使用', '人', '数最多',
|
||||||
|
'的', '语言', ',', '目前', '世界', '有', '五分之一', '人口', '做',
|
||||||
|
'为', '母语', '。']),
|
||||||
|
]
|
||||||
|
# fmt: on
|
||||||
|
|
||||||
|
|
||||||
|
@pytest.mark.parametrize("text,expected_tokens", TOKENIZER_TESTS)
|
||||||
|
def test_zh_tokenizer(zh_tokenizer, text, expected_tokens):
|
||||||
|
zh_tokenizer.use_jieba = False
|
||||||
|
tokens = [token.text for token in zh_tokenizer(text)]
|
||||||
|
assert tokens == list(text)
|
||||||
|
|
||||||
|
zh_tokenizer.use_jieba = True
|
||||||
|
tokens = [token.text for token in zh_tokenizer(text)]
|
||||||
|
assert tokens == expected_tokens
|
||||||
|
|
||||||
|
|
||||||
|
def test_extra_spaces(zh_tokenizer):
|
||||||
|
# note: three spaces after "I"
|
||||||
|
tokens = zh_tokenizer("I like cheese.")
|
||||||
|
assert tokens[1].orth_ == " "
|
34
spacy/tests/regression/test_issue4590.py
Normal file
34
spacy/tests/regression/test_issue4590.py
Normal file
|
@ -0,0 +1,34 @@
|
||||||
|
# coding: utf-8
|
||||||
|
from __future__ import unicode_literals
|
||||||
|
|
||||||
|
import pytest
|
||||||
|
from mock import Mock
|
||||||
|
from spacy.matcher import DependencyMatcher
|
||||||
|
from ..util import get_doc
|
||||||
|
|
||||||
|
|
||||||
|
def test_issue4590(en_vocab):
|
||||||
|
"""Test that matches param in on_match method are the same as matches run with no on_match method"""
|
||||||
|
pattern = [
|
||||||
|
{"SPEC": {"NODE_NAME": "jumped"}, "PATTERN": {"ORTH": "jumped"}},
|
||||||
|
{"SPEC": {"NODE_NAME": "fox", "NBOR_RELOP": ">", "NBOR_NAME": "jumped"}, "PATTERN": {"ORTH": "fox"}},
|
||||||
|
{"SPEC": {"NODE_NAME": "quick", "NBOR_RELOP": ".", "NBOR_NAME": "jumped"}, "PATTERN": {"ORTH": "fox"}},
|
||||||
|
]
|
||||||
|
|
||||||
|
on_match = Mock()
|
||||||
|
|
||||||
|
matcher = DependencyMatcher(en_vocab)
|
||||||
|
matcher.add("pattern", on_match, pattern)
|
||||||
|
|
||||||
|
text = "The quick brown fox jumped over the lazy fox"
|
||||||
|
heads = [3, 2, 1, 1, 0, -1, 2, 1, -3]
|
||||||
|
deps = ["det", "amod", "amod", "nsubj", "prep", "pobj", "det", "amod"]
|
||||||
|
|
||||||
|
doc = get_doc(en_vocab, text.split(), heads=heads, deps=deps)
|
||||||
|
|
||||||
|
matches = matcher(doc)
|
||||||
|
|
||||||
|
on_match_args = on_match.call_args
|
||||||
|
|
||||||
|
assert on_match_args[0][3] == matches
|
||||||
|
|
19
spacy/tests/test_architectures.py
Normal file
19
spacy/tests/test_architectures.py
Normal file
|
@ -0,0 +1,19 @@
|
||||||
|
# coding: utf8
|
||||||
|
from __future__ import unicode_literals
|
||||||
|
|
||||||
|
import pytest
|
||||||
|
from spacy import registry
|
||||||
|
from thinc.v2v import Affine
|
||||||
|
from catalogue import RegistryError
|
||||||
|
|
||||||
|
|
||||||
|
@registry.architectures.register("my_test_function")
|
||||||
|
def create_model(nr_in, nr_out):
|
||||||
|
return Affine(nr_in, nr_out)
|
||||||
|
|
||||||
|
|
||||||
|
def test_get_architecture():
|
||||||
|
arch = registry.architectures.get("my_test_function")
|
||||||
|
assert arch is create_model
|
||||||
|
with pytest.raises(RegistryError):
|
||||||
|
registry.architectures.get("not_an_existing_key")
|
|
@ -1,19 +0,0 @@
|
||||||
# coding: utf8
|
|
||||||
from __future__ import unicode_literals
|
|
||||||
|
|
||||||
import pytest
|
|
||||||
from spacy import register_architecture
|
|
||||||
from spacy import get_architecture
|
|
||||||
from thinc.v2v import Affine
|
|
||||||
|
|
||||||
|
|
||||||
@register_architecture("my_test_function")
|
|
||||||
def create_model(nr_in, nr_out):
|
|
||||||
return Affine(nr_in, nr_out)
|
|
||||||
|
|
||||||
|
|
||||||
def test_get_architecture():
|
|
||||||
arch = get_architecture("my_test_function")
|
|
||||||
assert arch is create_model
|
|
||||||
with pytest.raises(KeyError):
|
|
||||||
get_architecture("not_an_existing_key")
|
|
|
@ -329,7 +329,7 @@ def _split(Doc doc, int token_index, orths, heads, attrs):
|
||||||
doc.c[i].head += offset
|
doc.c[i].head += offset
|
||||||
# Double doc.c max_length if necessary (until big enough for all new tokens)
|
# Double doc.c max_length if necessary (until big enough for all new tokens)
|
||||||
while doc.length + nb_subtokens - 1 >= doc.max_length:
|
while doc.length + nb_subtokens - 1 >= doc.max_length:
|
||||||
doc._realloc(doc.length * 2)
|
doc._realloc(doc.max_length * 2)
|
||||||
# Move tokens after the split to create space for the new tokens
|
# Move tokens after the split to create space for the new tokens
|
||||||
doc.length = len(doc) + nb_subtokens -1
|
doc.length = len(doc) + nb_subtokens -1
|
||||||
to_process_tensor = (doc.tensor is not None and doc.tensor.size != 0)
|
to_process_tensor = (doc.tensor is not None and doc.tensor.size != 0)
|
||||||
|
|
113
spacy/util.py
113
spacy/util.py
|
@ -13,6 +13,7 @@ import functools
|
||||||
import itertools
|
import itertools
|
||||||
import numpy.random
|
import numpy.random
|
||||||
import srsly
|
import srsly
|
||||||
|
import catalogue
|
||||||
import sys
|
import sys
|
||||||
|
|
||||||
try:
|
try:
|
||||||
|
@ -27,29 +28,20 @@ except ImportError:
|
||||||
|
|
||||||
from .symbols import ORTH
|
from .symbols import ORTH
|
||||||
from .compat import cupy, CudaStream, path2str, basestring_, unicode_
|
from .compat import cupy, CudaStream, path2str, basestring_, unicode_
|
||||||
from .compat import import_file, importlib_metadata
|
from .compat import import_file
|
||||||
from .errors import Errors, Warnings, deprecation_warning
|
from .errors import Errors, Warnings, deprecation_warning
|
||||||
|
|
||||||
|
|
||||||
LANGUAGES = {}
|
|
||||||
ARCHITECTURES = {}
|
|
||||||
_data_path = Path(__file__).parent / "data"
|
_data_path = Path(__file__).parent / "data"
|
||||||
_PRINT_ENV = False
|
_PRINT_ENV = False
|
||||||
|
|
||||||
|
|
||||||
# NB: Ony ever call this once! If called more than ince within the
|
class registry(object):
|
||||||
# function, test_issue1506 hangs and it's not 100% clear why.
|
languages = catalogue.create("spacy", "languages", entry_points=True)
|
||||||
AVAILABLE_ENTRY_POINTS = importlib_metadata.entry_points()
|
architectures = catalogue.create("spacy", "architectures", entry_points=True)
|
||||||
|
lookups = catalogue.create("spacy", "lookups", entry_points=True)
|
||||||
|
factories = catalogue.create("spacy", "factories", entry_points=True)
|
||||||
class ENTRY_POINTS(object):
|
displacy_colors = catalogue.create("spacy", "displacy_colors", entry_points=True)
|
||||||
"""Available entry points to register extensions."""
|
|
||||||
|
|
||||||
factories = "spacy_factories"
|
|
||||||
languages = "spacy_languages"
|
|
||||||
displacy_colors = "spacy_displacy_colors"
|
|
||||||
lookups = "spacy_lookups"
|
|
||||||
architectures = "spacy_architectures"
|
|
||||||
|
|
||||||
|
|
||||||
def set_env_log(value):
|
def set_env_log(value):
|
||||||
|
@ -65,8 +57,7 @@ def lang_class_is_loaded(lang):
|
||||||
lang (unicode): Two-letter language code, e.g. 'en'.
|
lang (unicode): Two-letter language code, e.g. 'en'.
|
||||||
RETURNS (bool): Whether a Language class has been loaded.
|
RETURNS (bool): Whether a Language class has been loaded.
|
||||||
"""
|
"""
|
||||||
global LANGUAGES
|
return lang in registry.languages
|
||||||
return lang in LANGUAGES
|
|
||||||
|
|
||||||
|
|
||||||
def get_lang_class(lang):
|
def get_lang_class(lang):
|
||||||
|
@ -75,19 +66,16 @@ def get_lang_class(lang):
|
||||||
lang (unicode): Two-letter language code, e.g. 'en'.
|
lang (unicode): Two-letter language code, e.g. 'en'.
|
||||||
RETURNS (Language): Language class.
|
RETURNS (Language): Language class.
|
||||||
"""
|
"""
|
||||||
global LANGUAGES
|
# Check if language is registered / entry point is available
|
||||||
# Check if an entry point is exposed for the language code
|
if lang in registry.languages:
|
||||||
entry_point = get_entry_point(ENTRY_POINTS.languages, lang)
|
return registry.languages.get(lang)
|
||||||
if entry_point is not None:
|
else:
|
||||||
LANGUAGES[lang] = entry_point
|
|
||||||
return entry_point
|
|
||||||
if lang not in LANGUAGES:
|
|
||||||
try:
|
try:
|
||||||
module = importlib.import_module(".lang.%s" % lang, "spacy")
|
module = importlib.import_module(".lang.%s" % lang, "spacy")
|
||||||
except ImportError as err:
|
except ImportError as err:
|
||||||
raise ImportError(Errors.E048.format(lang=lang, err=err))
|
raise ImportError(Errors.E048.format(lang=lang, err=err))
|
||||||
LANGUAGES[lang] = getattr(module, module.__all__[0])
|
set_lang_class(lang, getattr(module, module.__all__[0]))
|
||||||
return LANGUAGES[lang]
|
return registry.languages.get(lang)
|
||||||
|
|
||||||
|
|
||||||
def set_lang_class(name, cls):
|
def set_lang_class(name, cls):
|
||||||
|
@ -96,8 +84,7 @@ def set_lang_class(name, cls):
|
||||||
name (unicode): Name of Language class.
|
name (unicode): Name of Language class.
|
||||||
cls (Language): Language class.
|
cls (Language): Language class.
|
||||||
"""
|
"""
|
||||||
global LANGUAGES
|
registry.languages.register(name, func=cls)
|
||||||
LANGUAGES[name] = cls
|
|
||||||
|
|
||||||
|
|
||||||
def get_data_path(require_exists=True):
|
def get_data_path(require_exists=True):
|
||||||
|
@ -121,49 +108,11 @@ def set_data_path(path):
|
||||||
_data_path = ensure_path(path)
|
_data_path = ensure_path(path)
|
||||||
|
|
||||||
|
|
||||||
def register_architecture(name, arch=None):
|
|
||||||
"""Decorator to register an architecture. An architecture is a function
|
|
||||||
that returns a Thinc Model object.
|
|
||||||
|
|
||||||
name (unicode): The name of the architecture to register.
|
|
||||||
arch (Model): Optional architecture if function is called directly and
|
|
||||||
not used as a decorator.
|
|
||||||
RETURNS (callable): Function to register architecture.
|
|
||||||
"""
|
|
||||||
global ARCHITECTURES
|
|
||||||
if arch is not None:
|
|
||||||
ARCHITECTURES[name] = arch
|
|
||||||
return arch
|
|
||||||
|
|
||||||
def do_registration(arch):
|
|
||||||
ARCHITECTURES[name] = arch
|
|
||||||
return arch
|
|
||||||
|
|
||||||
return do_registration
|
|
||||||
|
|
||||||
|
|
||||||
def make_layer(arch_config):
|
def make_layer(arch_config):
|
||||||
arch_func = get_architecture(arch_config["arch"])
|
arch_func = registry.architectures.get(arch_config["arch"])
|
||||||
return arch_func(arch_config["config"])
|
return arch_func(arch_config["config"])
|
||||||
|
|
||||||
|
|
||||||
def get_architecture(name):
|
|
||||||
"""Get a model architecture function by name. Raises a KeyError if the
|
|
||||||
architecture is not found.
|
|
||||||
|
|
||||||
name (unicode): The mame of the architecture.
|
|
||||||
RETURNS (Model): The architecture.
|
|
||||||
"""
|
|
||||||
# Check if an entry point is exposed for the architecture code
|
|
||||||
entry_point = get_entry_point(ENTRY_POINTS.architectures, name)
|
|
||||||
if entry_point is not None:
|
|
||||||
ARCHITECTURES[name] = entry_point
|
|
||||||
if name not in ARCHITECTURES:
|
|
||||||
names = ", ".join(sorted(ARCHITECTURES.keys()))
|
|
||||||
raise KeyError(Errors.E174.format(name=name, names=names))
|
|
||||||
return ARCHITECTURES[name]
|
|
||||||
|
|
||||||
|
|
||||||
def ensure_path(path):
|
def ensure_path(path):
|
||||||
"""Ensure string is converted to a Path.
|
"""Ensure string is converted to a Path.
|
||||||
|
|
||||||
|
@ -327,34 +276,6 @@ def get_package_path(name):
|
||||||
return Path(pkg.__file__).parent
|
return Path(pkg.__file__).parent
|
||||||
|
|
||||||
|
|
||||||
def get_entry_points(key):
|
|
||||||
"""Get registered entry points from other packages for a given key, e.g.
|
|
||||||
'spacy_factories' and return them as a dictionary, keyed by name.
|
|
||||||
|
|
||||||
key (unicode): Entry point name.
|
|
||||||
RETURNS (dict): Entry points, keyed by name.
|
|
||||||
"""
|
|
||||||
result = {}
|
|
||||||
for entry_point in AVAILABLE_ENTRY_POINTS.get(key, []):
|
|
||||||
result[entry_point.name] = entry_point.load()
|
|
||||||
return result
|
|
||||||
|
|
||||||
|
|
||||||
def get_entry_point(key, value, default=None):
|
|
||||||
"""Check if registered entry point is available for a given name and
|
|
||||||
load it. Otherwise, return None.
|
|
||||||
|
|
||||||
key (unicode): Entry point name.
|
|
||||||
value (unicode): Name of entry point to load.
|
|
||||||
default: Optional default value to return.
|
|
||||||
RETURNS: The loaded entry point or None.
|
|
||||||
"""
|
|
||||||
for entry_point in AVAILABLE_ENTRY_POINTS.get(key, []):
|
|
||||||
if entry_point.name == value:
|
|
||||||
return entry_point.load()
|
|
||||||
return default
|
|
||||||
|
|
||||||
|
|
||||||
def is_in_jupyter():
|
def is_in_jupyter():
|
||||||
"""Check if user is running spaCy from a Jupyter notebook by detecting the
|
"""Check if user is running spaCy from a Jupyter notebook by detecting the
|
||||||
IPython kernel. Mainly used for the displaCy visualizer.
|
IPython kernel. Mainly used for the displaCy visualizer.
|
||||||
|
|
|
@ -109,8 +109,8 @@ raise an error if the pre-defined attrs of the two `DocBin`s don't match.
|
||||||
> doc_bin1.add(nlp("Hello world"))
|
> doc_bin1.add(nlp("Hello world"))
|
||||||
> doc_bin2 = DocBin(attrs=["LEMMA", "POS"])
|
> doc_bin2 = DocBin(attrs=["LEMMA", "POS"])
|
||||||
> doc_bin2.add(nlp("This is a sentence"))
|
> doc_bin2.add(nlp("This is a sentence"))
|
||||||
> merged_bins = doc_bin1.merge(doc_bin2)
|
> doc_bin1.merge(doc_bin2)
|
||||||
> assert len(merged_bins) == 2
|
> assert len(doc_bin1) == 2
|
||||||
> ```
|
> ```
|
||||||
|
|
||||||
| Argument | Type | Description |
|
| Argument | Type | Description |
|
||||||
|
|
|
@ -1,9 +1,10 @@
|
||||||
A named entity is a "real-world object" that's assigned a name – for example, a
|
A named entity is a "real-world object" that's assigned a name – for example, a
|
||||||
person, a country, a product or a book title. spaCy can **recognize**
|
person, a country, a product or a book title. spaCy can **recognize
|
||||||
[various types](/api/annotation#named-entities) of named entities in a document,
|
[various types](/api/annotation#named-entities)** of named entities in a
|
||||||
by asking the model for a **prediction**. Because models are statistical and
|
document, by asking the model for a **prediction**. Because models are
|
||||||
strongly depend on the examples they were trained on, this doesn't always work
|
statistical and strongly depend on the examples they were trained on, this
|
||||||
_perfectly_ and might need some tuning later, depending on your use case.
|
doesn't always work _perfectly_ and might need some tuning later, depending on
|
||||||
|
your use case.
|
||||||
|
|
||||||
Named entities are available as the `ents` property of a `Doc`:
|
Named entities are available as the `ents` property of a `Doc`:
|
||||||
|
|
||||||
|
|
Loading…
Reference in New Issue
Block a user