diff --git a/.github/contributors/prilopes.md b/.github/contributors/prilopes.md new file mode 100644 index 000000000..ad111d4de --- /dev/null +++ b/.github/contributors/prilopes.md @@ -0,0 +1,106 @@ +# spaCy contributor agreement + +This spaCy Contributor Agreement (**"SCA"**) is based on the +[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf). +The SCA applies to any contribution that you make to any product or project +managed by us (the **"project"**), and sets out the intellectual property rights +you grant to us in the contributed materials. The term **"us"** shall mean +[ExplosionAI GmbH](https://explosion.ai/legal). The term +**"you"** shall mean the person or entity identified below. + +If you agree to be bound by these terms, fill in the information requested +below and include the filled-in version with your first pull request, under the +folder [`.github/contributors/`](/.github/contributors/). The name of the file +should be your GitHub username, with the extension `.md`. For example, the user +example_user would create the file `.github/contributors/example_user.md`. + +Read this agreement carefully before signing. These terms and conditions +constitute a binding legal agreement. + +## Contributor Agreement + +1. The term "contribution" or "contributed materials" means any source code, +object code, patch, tool, sample, graphic, specification, manual, +documentation, or any other material posted or submitted by you to the project. + +2. With respect to any worldwide copyrights, or copyright applications and +registrations, in your contribution: + + * you hereby assign to us joint ownership, and to the extent that such + assignment is or becomes invalid, ineffective or unenforceable, you hereby + grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge, + royalty-free, unrestricted license to exercise all rights under those + copyrights. This includes, at our option, the right to sublicense these same + rights to third parties through multiple levels of sublicensees or other + licensing arrangements; + + * you agree that each of us can do all things in relation to your + contribution as if each of us were the sole owners, and if one of us makes + a derivative work of your contribution, the one who makes the derivative + work (or has it made will be the sole owner of that derivative work; + + * you agree that you will not assert any moral rights in your contribution + against us, our licensees or transferees; + + * you agree that we may register a copyright in your contribution and + exercise all ownership rights associated with it; and + + * you agree that neither of us has any duty to consult with, obtain the + consent of, pay or render an accounting to the other for any use or + distribution of your contribution. + +3. With respect to any patents you own, or that you can license without payment +to any third party, you hereby grant to us a perpetual, irrevocable, +non-exclusive, worldwide, no-charge, royalty-free license to: + + * make, have made, use, sell, offer to sell, import, and otherwise transfer + your contribution in whole or in part, alone or in combination with or + included in any product, work or materials arising out of the project to + which your contribution was submitted, and + + * at our option, to sublicense these same rights to third parties through + multiple levels of sublicensees or other licensing arrangements. + +4. Except as set out above, you keep all right, title, and interest in your +contribution. The rights that you grant to us under these terms are effective +on the date you first submitted a contribution to us, even if your submission +took place before the date you sign these terms. + +5. You covenant, represent, warrant and agree that: + + * Each contribution that you submit is and shall be an original work of + authorship and you can legally grant the rights set out in this SCA; + + * to the best of your knowledge, each contribution will not violate any + third party's copyrights, trademarks, patents, or other intellectual + property rights; and + + * each contribution shall be in compliance with U.S. export control laws and + other applicable export and import laws. You agree to notify us if you + become aware of any circumstance which would make any of the foregoing + representations inaccurate in any respect. We may publicly disclose your + participation in the project, including the fact that you have signed the SCA. + +6. This SCA is governed by the laws of the State of California and applicable +U.S. Federal law. Any choice of law rules will not apply. + +7. Please place an “x” on one of the applicable statement below. Please do NOT +mark both statements: + + * [x] I am signing on behalf of myself as an individual and no other person + or entity, including my employer, has or will have rights with respect to my + contributions. + + * [ ] I am signing on behalf of my employer or a legal entity and I have the + actual authority to contractually bind that entity. + +## Contributor Details + +| Field | Entry | +|------------------------------- | -------------------- | +| Name | Priscilla Lopes | +| Company name (if applicable) | | +| Title or role (if applicable) | | +| Date | 2019-11-06 | +| GitHub username | prilopes | +| Website (optional) | | diff --git a/bin/ud/ud_train.py b/bin/ud/ud_train.py index 2784d7c3c..ddd87a31c 100644 --- a/bin/ud/ud_train.py +++ b/bin/ud/ud_train.py @@ -221,6 +221,13 @@ def evaluate(nlp, text_loc, gold_loc, sys_loc, limit=None): def write_conllu(docs, file_): + if not Token.has_extension("get_conllu_lines"): + Token.set_extension("get_conllu_lines", method=get_token_conllu) + if not Token.has_extension("begins_fused"): + Token.set_extension("begins_fused", default=False) + if not Token.has_extension("inside_fused"): + Token.set_extension("inside_fused", default=False) + merger = Matcher(docs[0].vocab) merger.add("SUBTOK", None, [{"DEP": "subtok", "op": "+"}]) for i, doc in enumerate(docs): diff --git a/requirements.txt b/requirements.txt index 89118b970..12f19bb88 100644 --- a/requirements.txt +++ b/requirements.txt @@ -6,12 +6,12 @@ blis>=0.4.0,<0.5.0 murmurhash>=0.28.0,<1.1.0 wasabi>=0.4.0,<1.1.0 srsly>=0.1.0,<1.1.0 +catalogue>=0.0.7,<1.1.0 # Third party dependencies numpy>=1.15.0 requests>=2.13.0,<3.0.0 plac>=0.9.6,<1.2.0 pathlib==1.0.1; python_version < "3.4" -importlib_metadata>=0.20; python_version < "3.8" # Optional dependencies jsonschema>=2.6.0,<3.1.0 # Development dependencies diff --git a/setup.cfg b/setup.cfg index 60a24dc58..940066a9e 100644 --- a/setup.cfg +++ b/setup.cfg @@ -48,13 +48,13 @@ install_requires = blis>=0.4.0,<0.5.0 wasabi>=0.4.0,<1.1.0 srsly>=0.1.0,<1.1.0 + catalogue>=0.0.7,<1.1.0 # Third-party dependencies setuptools numpy>=1.15.0 plac>=0.9.6,<1.2.0 requests>=2.13.0,<3.0.0 pathlib==1.0.1; python_version < "3.4" - importlib_metadata>=0.20; python_version < "3.8" [options.extras_require] lookups = diff --git a/spacy/__init__.py b/spacy/__init__.py index 57701179f..4a0d16a49 100644 --- a/spacy/__init__.py +++ b/spacy/__init__.py @@ -15,7 +15,7 @@ from .glossary import explain from .about import __version__ from .errors import Errors, Warnings, deprecation_warning from . import util -from .util import register_architecture, get_architecture +from .util import registry from .language import component diff --git a/spacy/compat.py b/spacy/compat.py index 5bff28815..0ea31c6b3 100644 --- a/spacy/compat.py +++ b/spacy/compat.py @@ -36,11 +36,6 @@ try: except ImportError: cupy = None -try: # Python 3.8 - import importlib.metadata as importlib_metadata -except ImportError: - import importlib_metadata # noqa: F401 - try: from thinc.neural.optimizers import Optimizer # noqa: F401 except ImportError: diff --git a/spacy/displacy/render.py b/spacy/displacy/render.py index 17b67940a..d6e33437b 100644 --- a/spacy/displacy/render.py +++ b/spacy/displacy/render.py @@ -5,7 +5,7 @@ import uuid from .templates import TPL_DEP_SVG, TPL_DEP_WORDS, TPL_DEP_ARCS, TPL_ENTS from .templates import TPL_ENT, TPL_ENT_RTL, TPL_FIGURE, TPL_TITLE, TPL_PAGE -from ..util import minify_html, escape_html, get_entry_points, ENTRY_POINTS +from ..util import minify_html, escape_html, registry from ..errors import Errors @@ -242,7 +242,7 @@ class EntityRenderer(object): "CARDINAL": "#e4e7d2", "PERCENT": "#e4e7d2", } - user_colors = get_entry_points(ENTRY_POINTS.displacy_colors) + user_colors = registry.displacy_colors.get_all() for user_color in user_colors.values(): colors.update(user_color) colors.update(options.get("colors", {})) diff --git a/spacy/lang/es/examples.py b/spacy/lang/es/examples.py index 96ff9c1ed..0e31b56af 100644 --- a/spacy/lang/es/examples.py +++ b/spacy/lang/es/examples.py @@ -11,12 +11,12 @@ Example sentences to test spaCy and its language models. sentences = [ - "Apple está buscando comprar una startup del Reino Unido por mil millones de dólares", - "Los coches autónomos delegan la responsabilidad del seguro en sus fabricantes", - "San Francisco analiza prohibir los robots delivery", - "Londres es una gran ciudad del Reino Unido", - "El gato come pescado", - "Veo al hombre con el telescopio", - "La araña come moscas", - "El pingüino incuba en su nido", + "Apple está buscando comprar una startup del Reino Unido por mil millones de dólares.", + "Los coches autónomos delegan la responsabilidad del seguro en sus fabricantes.", + "San Francisco analiza prohibir los robots delivery.", + "Londres es una gran ciudad del Reino Unido.", + "El gato come pescado.", + "Veo al hombre con el telescopio.", + "La araña come moscas.", + "El pingüino incuba en su nido.", ] diff --git a/spacy/lang/nb/examples.py b/spacy/lang/nb/examples.py index 72d6b5a71..c15426ded 100644 --- a/spacy/lang/nb/examples.py +++ b/spacy/lang/nb/examples.py @@ -11,8 +11,8 @@ Example sentences to test spaCy and its language models. sentences = [ - "Apple vurderer å kjøpe britisk oppstartfirma for en milliard dollar", - "Selvkjørende biler flytter forsikringsansvaret over på produsentene ", - "San Francisco vurderer å forby robotbud på fortauene", + "Apple vurderer å kjøpe britisk oppstartfirma for en milliard dollar.", + "Selvkjørende biler flytter forsikringsansvaret over på produsentene.", + "San Francisco vurderer å forby robotbud på fortauene.", "London er en stor by i Storbritannia.", ] diff --git a/spacy/lang/xx/examples.py b/spacy/lang/xx/examples.py new file mode 100644 index 000000000..38cd5e0cd --- /dev/null +++ b/spacy/lang/xx/examples.py @@ -0,0 +1,99 @@ +# coding: utf8 +from __future__ import unicode_literals + + +""" +Example sentences to test spaCy and its language models. + +>>> from spacy.lang.de.examples import sentences +>>> docs = nlp.pipe(sentences) +""" + +# combined examples from de/en/es/fr/it/nl/pl/pt/ru + +sentences = [ + "Die ganze Stadt ist ein Startup: Shenzhen ist das Silicon Valley für Hardware-Firmen", + "Wie deutsche Startups die Technologie vorantreiben wollen: Künstliche Intelligenz", + "Trend zum Urlaub in Deutschland beschert Gastwirten mehr Umsatz", + "Bundesanwaltschaft erhebt Anklage gegen mutmaßlichen Schweizer Spion", + "San Francisco erwägt Verbot von Lieferrobotern", + "Autonome Fahrzeuge verlagern Haftpflicht auf Hersteller", + "Wo bist du?", + "Was ist die Hauptstadt von Deutschland?", + "Apple is looking at buying U.K. startup for $1 billion", + "Autonomous cars shift insurance liability toward manufacturers", + "San Francisco considers banning sidewalk delivery robots", + "London is a big city in the United Kingdom.", + "Where are you?", + "Who is the president of France?", + "What is the capital of the United States?", + "When was Barack Obama born?", + "Apple está buscando comprar una startup del Reino Unido por mil millones de dólares.", + "Los coches autónomos delegan la responsabilidad del seguro en sus fabricantes.", + "San Francisco analiza prohibir los robots delivery.", + "Londres es una gran ciudad del Reino Unido.", + "El gato come pescado.", + "Veo al hombre con el telescopio.", + "La araña come moscas.", + "El pingüino incuba en su nido.", + "Apple cherche à acheter une start-up anglaise pour 1 milliard de dollars", + "Les voitures autonomes déplacent la responsabilité de l'assurance vers les constructeurs", + "San Francisco envisage d'interdire les robots coursiers sur les trottoirs", + "Londres est une grande ville du Royaume-Uni", + "L’Italie choisit ArcelorMittal pour reprendre la plus grande aciérie d’Europe", + "Apple lance HomePod parce qu'il se sent menacé par l'Echo d'Amazon", + "La France ne devrait pas manquer d'électricité cet été, même en cas de canicule", + "Nouvelles attaques de Trump contre le maire de Londres", + "Où es-tu ?", + "Qui est le président de la France ?", + "Où est la capitale des États-Unis ?", + "Quand est né Barack Obama ?", + "Apple vuole comprare una startup del Regno Unito per un miliardo di dollari", + "Le automobili a guida autonoma spostano la responsabilità assicurativa verso i produttori", + "San Francisco prevede di bandire i robot di consegna porta a porta", + "Londra è una grande città del Regno Unito.", + "Apple overweegt om voor 1 miljard een U.K. startup te kopen", + "Autonome auto's verschuiven de verzekeringverantwoordelijkheid naar producenten", + "San Francisco overweegt robots op voetpaden te verbieden", + "Londen is een grote stad in het Verenigd Koninkrijk", + "Poczuł przyjemną woń mocnej kawy.", + "Istnieje wiele dróg oddziaływania substancji psychoaktywnej na układ nerwowy.", + "Powitał mnie biało-czarny kot, płosząc siedzące na płocie trzy dorodne dudki.", + "Nowy abonament pod lupą Komisji Europejskiej", + "Czy w ciągu ostatnich 48 godzin spożyłeś leki zawierające paracetamol?", + "Kto ma ochotę zapoznać się z innymi niż w książkach przygodami Muminków i ich przyjaciół, temu polecam komiks Tove Jansson „Muminki i morze”.", + "Apple está querendo comprar uma startup do Reino Unido por 100 milhões de dólares.", + "Carros autônomos empurram a responsabilidade do seguro para os fabricantes.." + "São Francisco considera banir os robôs de entrega que andam pelas calçadas.", + "Londres é a maior cidade do Reino Unido.", + # Translations from English: + "Apple рассматривает возможность покупки стартапа из Соединённого Королевства за $1 млрд", + "Беспилотные автомобили перекладывают страховую ответственность на производителя", + "В Сан-Франциско рассматривается возможность запрета роботов-курьеров, которые перемещаются по тротуару", + "Лондон — это большой город в Соединённом Королевстве", + # Native Russian sentences: + # Colloquial: + "Да, нет, наверное!", # Typical polite refusal + "Обратите внимание на необыкновенную красоту этого города-героя Москвы, столицы нашей Родины!", # From a tour guide speech + # Examples of Bookish Russian: + # Quote from "The Golden Calf" + "Рио-де-Жанейро — это моя мечта, и не смейте касаться её своими грязными лапами!", + # Quotes from "Ivan Vasilievich changes his occupation" + "Ты пошто боярыню обидел, смерд?!!", + "Оставь меня, старушка, я в печали!", + # Quotes from Dostoevsky: + "Уж коли я, такой же, как и ты, человек грешный, над тобой умилился и пожалел тебя, кольми паче бог", + "В мечтах я нередко, говорит, доходил до страстных помыслов о служении человечеству и может быть действительно пошел бы на крест за людей, если б это вдруг как-нибудь потребовалось, а между тем я двух дней не в состоянии прожить ни с кем в одной комнате, о чем знаю из опыта", + "Зато всегда так происходило, что чем более я ненавидел людей в частности, тем пламеннее становилась любовь моя к человечеству вообще", + # Quotes from Chekhov: + "Ненужные дела и разговоры всё об одном отхватывают на свою долю лучшую часть времени, лучшие силы, и в конце концов остается какая-то куцая, бескрылая жизнь, какая-то чепуха, и уйти и бежать нельзя, точно сидишь в сумасшедшем доме или в арестантских ротах!", + # Quotes from Turgenev: + "Нравится тебе женщина, старайся добиться толку; а нельзя — ну, не надо, отвернись — земля не клином сошлась", + "Узенькое местечко, которое я занимаю, до того крохотно в сравнении с остальным пространством, где меня нет и где дела до меня нет; и часть времени, которую мне удастся прожить, так ничтожна перед вечностью, где меня не было и не будет...", + # Quotes from newspapers: + # Komsomolskaya Pravda: + "На заседании президиума правительства Москвы принято решение присвоить статус инвестиционного приоритетного проекта города Москвы киностудии Союзмультфильм", + "Глава Минобороны Сергей Шойгу заявил, что обстановка на этом стратегическом направлении требует непрерывного совершенствования боевого состава войск", + # Argumenty i Facty: + "На реплику лже-Говина — дескать, он (Волков) будет лучшим революционером — Стамп с энтузиазмом ответил: Непременно!", +] diff --git a/spacy/lang/zh/__init__.py b/spacy/lang/zh/__init__.py index 91daea099..5bd7b7335 100644 --- a/spacy/lang/zh/__init__.py +++ b/spacy/lang/zh/__init__.py @@ -4,19 +4,92 @@ from __future__ import unicode_literals from ...attrs import LANG from ...language import Language from ...tokens import Doc +from ...util import DummyTokenizer from ..tokenizer_exceptions import BASE_EXCEPTIONS +from .lex_attrs import LEX_ATTRS from .stop_words import STOP_WORDS from .tag_map import TAG_MAP +def try_jieba_import(use_jieba): + try: + import jieba + return jieba + except ImportError: + if use_jieba: + msg = ( + "Jieba not installed. Either set Chinese.use_jieba = False, " + "or install it https://github.com/fxsjy/jieba" + ) + raise ImportError(msg) + + +class ChineseTokenizer(DummyTokenizer): + def __init__(self, cls, nlp=None): + self.vocab = nlp.vocab if nlp is not None else cls.create_vocab(nlp) + self.use_jieba = cls.use_jieba + self.jieba_seg = try_jieba_import(self.use_jieba) + self.tokenizer = Language.Defaults().create_tokenizer(nlp) + + def __call__(self, text): + # use jieba + if self.use_jieba: + jieba_words = list([x for x in self.jieba_seg.cut(text, cut_all=False) if x]) + words = [jieba_words[0]] + spaces = [False] + for i in range(1, len(jieba_words)): + word = jieba_words[i] + if word.isspace(): + # second token in adjacent whitespace following a + # non-space token + if spaces[-1]: + words.append(word) + spaces.append(False) + # first space token following non-space token + elif word == " " and not words[-1].isspace(): + spaces[-1] = True + # token is non-space whitespace or any whitespace following + # a whitespace token + else: + # extend previous whitespace token with more whitespace + if words[-1].isspace(): + words[-1] += word + # otherwise it's a new whitespace token + else: + words.append(word) + spaces.append(False) + else: + words.append(word) + spaces.append(False) + return Doc(self.vocab, words=words, spaces=spaces) + + # split into individual characters + words = [] + spaces = [] + for token in self.tokenizer(text): + if token.text.isspace(): + words.append(token.text) + spaces.append(False) + else: + words.extend(list(token.text)) + spaces.extend([False] * len(token.text)) + spaces[-1] = bool(token.whitespace_) + return Doc(self.vocab, words=words, spaces=spaces) + + class ChineseDefaults(Language.Defaults): lex_attr_getters = dict(Language.Defaults.lex_attr_getters) + lex_attr_getters.update(LEX_ATTRS) lex_attr_getters[LANG] = lambda text: "zh" - use_jieba = True tokenizer_exceptions = BASE_EXCEPTIONS stop_words = STOP_WORDS tag_map = TAG_MAP writing_system = {"direction": "ltr", "has_case": False, "has_letters": False} + use_jieba = True + + @classmethod + def create_tokenizer(cls, nlp=None): + return ChineseTokenizer(cls, nlp) class Chinese(Language): @@ -24,26 +97,7 @@ class Chinese(Language): Defaults = ChineseDefaults # override defaults def make_doc(self, text): - if self.Defaults.use_jieba: - try: - import jieba - except ImportError: - msg = ( - "Jieba not installed. Either set Chinese.use_jieba = False, " - "or install it https://github.com/fxsjy/jieba" - ) - raise ImportError(msg) - words = list(jieba.cut(text, cut_all=False)) - words = [x for x in words if x] - return Doc(self.vocab, words=words, spaces=[False] * len(words)) - else: - words = [] - spaces = [] - for token in self.tokenizer(text): - words.extend(list(token.text)) - spaces.extend([False] * len(token.text)) - spaces[-1] = bool(token.whitespace_) - return Doc(self.vocab, words=words, spaces=spaces) + return self.tokenizer(text) __all__ = ["Chinese"] diff --git a/spacy/lang/zh/tag_map.py b/spacy/lang/zh/tag_map.py index 8d2f99d01..41e2d2158 100644 --- a/spacy/lang/zh/tag_map.py +++ b/spacy/lang/zh/tag_map.py @@ -1,11 +1,12 @@ # coding: utf8 from __future__ import unicode_literals -from ...symbols import POS, PUNCT, ADJ, CONJ, CCONJ, NUM, DET, ADV, ADP, X, VERB -from ...symbols import NOUN, PART, INTJ, PRON +from ...symbols import POS, PUNCT, ADJ, SCONJ, CCONJ, NUM, DET, ADV, ADP, X +from ...symbols import NOUN, PART, INTJ, PRON, VERB, SPACE -# The Chinese part-of-speech tagger uses the OntoNotes 5 version of the Penn Treebank tag set. -# We also map the tags to the simpler Google Universal POS tag set. +# The Chinese part-of-speech tagger uses the OntoNotes 5 version of the Penn +# Treebank tag set. We also map the tags to the simpler Universal Dependencies +# v2 tag set. TAG_MAP = { "AS": {POS: PART}, @@ -38,10 +39,11 @@ TAG_MAP = { "OD": {POS: NUM}, "DT": {POS: DET}, "CC": {POS: CCONJ}, - "CS": {POS: CONJ}, + "CS": {POS: SCONJ}, "AD": {POS: ADV}, "JJ": {POS: ADJ}, "P": {POS: ADP}, "PN": {POS: PRON}, "PU": {POS: PUNCT}, + "_SP": {POS: SPACE}, } diff --git a/spacy/language.py b/spacy/language.py index 97d6515c5..72044a0c5 100644 --- a/spacy/language.py +++ b/spacy/language.py @@ -51,8 +51,8 @@ class BaseDefaults(object): filenames = {name: root / filename for name, filename in cls.resources} if LANG in cls.lex_attr_getters: lang = cls.lex_attr_getters[LANG](None) - user_lookups = util.get_entry_point(util.ENTRY_POINTS.lookups, lang, {}) - filenames.update(user_lookups) + if lang in util.registry.lookups: + filenames.update(util.registry.lookups.get(lang)) lookups = Lookups() for name, filename in filenames.items(): data = util.load_language_data(filename) @@ -155,7 +155,7 @@ class Language(object): 100,000 characters in one text. RETURNS (Language): The newly constructed object. """ - user_factories = util.get_entry_points(util.ENTRY_POINTS.factories) + user_factories = util.registry.factories.get_all() self.factories.update(user_factories) self._meta = dict(meta) self._path = None diff --git a/spacy/matcher/dependencymatcher.pyx b/spacy/matcher/dependencymatcher.pyx index ae2ad3ca6..56d27024d 100644 --- a/spacy/matcher/dependencymatcher.pyx +++ b/spacy/matcher/dependencymatcher.pyx @@ -240,7 +240,7 @@ cdef class DependencyMatcher: for i, (ent_id, nodes) in enumerate(matched_key_trees): on_match = self._callbacks.get(ent_id) if on_match is not None: - on_match(self, doc, i, matches) + on_match(self, doc, i, matched_key_trees) return matched_key_trees def recurse(self,tree,id_to_position,_node_operator_map,int patternLength,visitedNodes,matched_trees): diff --git a/spacy/ml/common.py b/spacy/ml/common.py index 963d4dc35..f90b53a15 100644 --- a/spacy/ml/common.py +++ b/spacy/ml/common.py @@ -3,10 +3,10 @@ from __future__ import unicode_literals from thinc.api import chain from thinc.v2v import Maxout from thinc.misc import LayerNorm -from ..util import register_architecture, make_layer +from ..util import registry, make_layer -@register_architecture("thinc.FeedForward.v1") +@registry.architectures.register("thinc.FeedForward.v1") def FeedForward(config): layers = [make_layer(layer_cfg) for layer_cfg in config["layers"]] model = chain(*layers) @@ -14,7 +14,7 @@ def FeedForward(config): return model -@register_architecture("spacy.LayerNormalizedMaxout.v1") +@registry.architectures.register("spacy.LayerNormalizedMaxout.v1") def LayerNormalizedMaxout(config): width = config["width"] pieces = config["pieces"] diff --git a/spacy/ml/tok2vec.py b/spacy/ml/tok2vec.py index 0b30551b5..8f86475ef 100644 --- a/spacy/ml/tok2vec.py +++ b/spacy/ml/tok2vec.py @@ -6,11 +6,11 @@ from thinc.v2v import Maxout, Model from thinc.i2v import HashEmbed, StaticVectors from thinc.t2t import ExtractWindow from thinc.misc import Residual, LayerNorm, FeatureExtracter -from ..util import make_layer, register_architecture +from ..util import make_layer, registry from ._wire import concatenate_lists -@register_architecture("spacy.Tok2Vec.v1") +@registry.architectures.register("spacy.Tok2Vec.v1") def Tok2Vec(config): doc2feats = make_layer(config["@doc2feats"]) embed = make_layer(config["@embed"]) @@ -24,13 +24,13 @@ def Tok2Vec(config): return tok2vec -@register_architecture("spacy.Doc2Feats.v1") +@registry.architectures.register("spacy.Doc2Feats.v1") def Doc2Feats(config): columns = config["columns"] return FeatureExtracter(columns) -@register_architecture("spacy.MultiHashEmbed.v1") +@registry.architectures.register("spacy.MultiHashEmbed.v1") def MultiHashEmbed(config): # For backwards compatibility with models before the architecture registry, # we have to be careful to get exactly the same model structure. One subtle @@ -78,7 +78,7 @@ def MultiHashEmbed(config): return layer -@register_architecture("spacy.CharacterEmbed.v1") +@registry.architectures.register("spacy.CharacterEmbed.v1") def CharacterEmbed(config): from .. import _ml @@ -94,7 +94,7 @@ def CharacterEmbed(config): return model -@register_architecture("spacy.MaxoutWindowEncoder.v1") +@registry.architectures.register("spacy.MaxoutWindowEncoder.v1") def MaxoutWindowEncoder(config): nO = config["width"] nW = config["window_size"] @@ -110,7 +110,7 @@ def MaxoutWindowEncoder(config): return model -@register_architecture("spacy.MishWindowEncoder.v1") +@registry.architectures.register("spacy.MishWindowEncoder.v1") def MishWindowEncoder(config): from thinc.v2v import Mish @@ -124,12 +124,12 @@ def MishWindowEncoder(config): return model -@register_architecture("spacy.PretrainedVectors.v1") +@registry.architectures.register("spacy.PretrainedVectors.v1") def PretrainedVectors(config): return StaticVectors(config["vectors_name"], config["width"], config["column"]) -@register_architecture("spacy.TorchBiLSTMEncoder.v1") +@registry.architectures.register("spacy.TorchBiLSTMEncoder.v1") def TorchBiLSTMEncoder(config): import torch.nn from thinc.extra.wrappers import PyTorchWrapperRNN diff --git a/spacy/tests/conftest.py b/spacy/tests/conftest.py index b0d373c42..d6b9ba11f 100644 --- a/spacy/tests/conftest.py +++ b/spacy/tests/conftest.py @@ -218,3 +218,9 @@ def uk_tokenizer(): @pytest.fixture(scope="session") def ur_tokenizer(): return get_lang_class("ur").Defaults.create_tokenizer() + + +@pytest.fixture(scope="session") +def zh_tokenizer(): + pytest.importorskip("jieba") + return get_lang_class("zh").Defaults.create_tokenizer() diff --git a/spacy/tests/doc/test_retokenize_split.py b/spacy/tests/doc/test_retokenize_split.py index 6c41a59be..d074fddc6 100644 --- a/spacy/tests/doc/test_retokenize_split.py +++ b/spacy/tests/doc/test_retokenize_split.py @@ -183,3 +183,18 @@ def test_doc_retokenizer_split_lex_attrs(en_vocab): retokenizer.split(doc[0], ["Los", "Angeles"], heads, attrs=attrs) assert doc[0].is_stop assert not doc[1].is_stop + + +def test_doc_retokenizer_realloc(en_vocab): + """#4604: realloc correctly when new tokens outnumber original tokens""" + text = "Hyperglycemic adverse events following antipsychotic drug administration in the" + doc = Doc(en_vocab, words=text.split()[:-1]) + with doc.retokenize() as retokenizer: + token = doc[0] + heads = [(token, 0)] * len(token) + retokenizer.split(doc[token.i], list(token.text), heads=heads) + doc = Doc(en_vocab, words=text.split()) + with doc.retokenize() as retokenizer: + token = doc[0] + heads = [(token, 0)] * len(token) + retokenizer.split(doc[token.i], list(token.text), heads=heads) diff --git a/spacy/tests/lang/zh/__init__.py b/spacy/tests/lang/zh/__init__.py new file mode 100644 index 000000000..e69de29bb diff --git a/spacy/tests/lang/zh/test_text.py b/spacy/tests/lang/zh/test_text.py new file mode 100644 index 000000000..235f597a5 --- /dev/null +++ b/spacy/tests/lang/zh/test_text.py @@ -0,0 +1,25 @@ +# coding: utf-8 +from __future__ import unicode_literals + + +import pytest + + +@pytest.mark.parametrize( + "text,match", + [ + ("10", True), + ("1", True), + ("999.0", True), + ("一", True), + ("二", True), + ("〇", True), + ("十一", True), + ("狗", False), + (",", False), + ], +) +def test_lex_attrs_like_number(zh_tokenizer, text, match): + tokens = zh_tokenizer(text) + assert len(tokens) == 1 + assert tokens[0].like_num == match diff --git a/spacy/tests/lang/zh/test_tokenizer.py b/spacy/tests/lang/zh/test_tokenizer.py new file mode 100644 index 000000000..36d94beb5 --- /dev/null +++ b/spacy/tests/lang/zh/test_tokenizer.py @@ -0,0 +1,31 @@ +# coding: utf-8 +from __future__ import unicode_literals + +import pytest + + +# fmt: off +TOKENIZER_TESTS = [ + ("作为语言而言,为世界使用人数最多的语言,目前世界有五分之一人口做为母语。", + ['作为', '语言', '而言', ',', '为', '世界', '使用', '人', '数最多', + '的', '语言', ',', '目前', '世界', '有', '五分之一', '人口', '做', + '为', '母语', '。']), +] +# fmt: on + + +@pytest.mark.parametrize("text,expected_tokens", TOKENIZER_TESTS) +def test_zh_tokenizer(zh_tokenizer, text, expected_tokens): + zh_tokenizer.use_jieba = False + tokens = [token.text for token in zh_tokenizer(text)] + assert tokens == list(text) + + zh_tokenizer.use_jieba = True + tokens = [token.text for token in zh_tokenizer(text)] + assert tokens == expected_tokens + + +def test_extra_spaces(zh_tokenizer): + # note: three spaces after "I" + tokens = zh_tokenizer("I like cheese.") + assert tokens[1].orth_ == " " diff --git a/spacy/tests/regression/test_issue4590.py b/spacy/tests/regression/test_issue4590.py new file mode 100644 index 000000000..6a43dfea9 --- /dev/null +++ b/spacy/tests/regression/test_issue4590.py @@ -0,0 +1,34 @@ +# coding: utf-8 +from __future__ import unicode_literals + +import pytest +from mock import Mock +from spacy.matcher import DependencyMatcher +from ..util import get_doc + + +def test_issue4590(en_vocab): + """Test that matches param in on_match method are the same as matches run with no on_match method""" + pattern = [ + {"SPEC": {"NODE_NAME": "jumped"}, "PATTERN": {"ORTH": "jumped"}}, + {"SPEC": {"NODE_NAME": "fox", "NBOR_RELOP": ">", "NBOR_NAME": "jumped"}, "PATTERN": {"ORTH": "fox"}}, + {"SPEC": {"NODE_NAME": "quick", "NBOR_RELOP": ".", "NBOR_NAME": "jumped"}, "PATTERN": {"ORTH": "fox"}}, + ] + + on_match = Mock() + + matcher = DependencyMatcher(en_vocab) + matcher.add("pattern", on_match, pattern) + + text = "The quick brown fox jumped over the lazy fox" + heads = [3, 2, 1, 1, 0, -1, 2, 1, -3] + deps = ["det", "amod", "amod", "nsubj", "prep", "pobj", "det", "amod"] + + doc = get_doc(en_vocab, text.split(), heads=heads, deps=deps) + + matches = matcher(doc) + + on_match_args = on_match.call_args + + assert on_match_args[0][3] == matches + diff --git a/spacy/tests/test_architectures.py b/spacy/tests/test_architectures.py new file mode 100644 index 000000000..77f1af020 --- /dev/null +++ b/spacy/tests/test_architectures.py @@ -0,0 +1,19 @@ +# coding: utf8 +from __future__ import unicode_literals + +import pytest +from spacy import registry +from thinc.v2v import Affine +from catalogue import RegistryError + + +@registry.architectures.register("my_test_function") +def create_model(nr_in, nr_out): + return Affine(nr_in, nr_out) + + +def test_get_architecture(): + arch = registry.architectures.get("my_test_function") + assert arch is create_model + with pytest.raises(RegistryError): + registry.architectures.get("not_an_existing_key") diff --git a/spacy/tests/test_register_architecture.py b/spacy/tests/test_register_architecture.py deleted file mode 100644 index 0c1b5b16f..000000000 --- a/spacy/tests/test_register_architecture.py +++ /dev/null @@ -1,19 +0,0 @@ -# coding: utf8 -from __future__ import unicode_literals - -import pytest -from spacy import register_architecture -from spacy import get_architecture -from thinc.v2v import Affine - - -@register_architecture("my_test_function") -def create_model(nr_in, nr_out): - return Affine(nr_in, nr_out) - - -def test_get_architecture(): - arch = get_architecture("my_test_function") - assert arch is create_model - with pytest.raises(KeyError): - get_architecture("not_an_existing_key") diff --git a/spacy/tokens/_retokenize.pyx b/spacy/tokens/_retokenize.pyx index 5f890de45..a5d06491a 100644 --- a/spacy/tokens/_retokenize.pyx +++ b/spacy/tokens/_retokenize.pyx @@ -329,7 +329,7 @@ def _split(Doc doc, int token_index, orths, heads, attrs): doc.c[i].head += offset # Double doc.c max_length if necessary (until big enough for all new tokens) while doc.length + nb_subtokens - 1 >= doc.max_length: - doc._realloc(doc.length * 2) + doc._realloc(doc.max_length * 2) # Move tokens after the split to create space for the new tokens doc.length = len(doc) + nb_subtokens -1 to_process_tensor = (doc.tensor is not None and doc.tensor.size != 0) diff --git a/spacy/util.py b/spacy/util.py index 74e4cc1c6..2d5a56806 100644 --- a/spacy/util.py +++ b/spacy/util.py @@ -13,6 +13,7 @@ import functools import itertools import numpy.random import srsly +import catalogue import sys try: @@ -27,29 +28,20 @@ except ImportError: from .symbols import ORTH from .compat import cupy, CudaStream, path2str, basestring_, unicode_ -from .compat import import_file, importlib_metadata +from .compat import import_file from .errors import Errors, Warnings, deprecation_warning -LANGUAGES = {} -ARCHITECTURES = {} _data_path = Path(__file__).parent / "data" _PRINT_ENV = False -# NB: Ony ever call this once! If called more than ince within the -# function, test_issue1506 hangs and it's not 100% clear why. -AVAILABLE_ENTRY_POINTS = importlib_metadata.entry_points() - - -class ENTRY_POINTS(object): - """Available entry points to register extensions.""" - - factories = "spacy_factories" - languages = "spacy_languages" - displacy_colors = "spacy_displacy_colors" - lookups = "spacy_lookups" - architectures = "spacy_architectures" +class registry(object): + languages = catalogue.create("spacy", "languages", entry_points=True) + architectures = catalogue.create("spacy", "architectures", entry_points=True) + lookups = catalogue.create("spacy", "lookups", entry_points=True) + factories = catalogue.create("spacy", "factories", entry_points=True) + displacy_colors = catalogue.create("spacy", "displacy_colors", entry_points=True) def set_env_log(value): @@ -65,8 +57,7 @@ def lang_class_is_loaded(lang): lang (unicode): Two-letter language code, e.g. 'en'. RETURNS (bool): Whether a Language class has been loaded. """ - global LANGUAGES - return lang in LANGUAGES + return lang in registry.languages def get_lang_class(lang): @@ -75,19 +66,16 @@ def get_lang_class(lang): lang (unicode): Two-letter language code, e.g. 'en'. RETURNS (Language): Language class. """ - global LANGUAGES - # Check if an entry point is exposed for the language code - entry_point = get_entry_point(ENTRY_POINTS.languages, lang) - if entry_point is not None: - LANGUAGES[lang] = entry_point - return entry_point - if lang not in LANGUAGES: + # Check if language is registered / entry point is available + if lang in registry.languages: + return registry.languages.get(lang) + else: try: module = importlib.import_module(".lang.%s" % lang, "spacy") except ImportError as err: raise ImportError(Errors.E048.format(lang=lang, err=err)) - LANGUAGES[lang] = getattr(module, module.__all__[0]) - return LANGUAGES[lang] + set_lang_class(lang, getattr(module, module.__all__[0])) + return registry.languages.get(lang) def set_lang_class(name, cls): @@ -96,8 +84,7 @@ def set_lang_class(name, cls): name (unicode): Name of Language class. cls (Language): Language class. """ - global LANGUAGES - LANGUAGES[name] = cls + registry.languages.register(name, func=cls) def get_data_path(require_exists=True): @@ -121,49 +108,11 @@ def set_data_path(path): _data_path = ensure_path(path) -def register_architecture(name, arch=None): - """Decorator to register an architecture. An architecture is a function - that returns a Thinc Model object. - - name (unicode): The name of the architecture to register. - arch (Model): Optional architecture if function is called directly and - not used as a decorator. - RETURNS (callable): Function to register architecture. - """ - global ARCHITECTURES - if arch is not None: - ARCHITECTURES[name] = arch - return arch - - def do_registration(arch): - ARCHITECTURES[name] = arch - return arch - - return do_registration - - def make_layer(arch_config): - arch_func = get_architecture(arch_config["arch"]) + arch_func = registry.architectures.get(arch_config["arch"]) return arch_func(arch_config["config"]) -def get_architecture(name): - """Get a model architecture function by name. Raises a KeyError if the - architecture is not found. - - name (unicode): The mame of the architecture. - RETURNS (Model): The architecture. - """ - # Check if an entry point is exposed for the architecture code - entry_point = get_entry_point(ENTRY_POINTS.architectures, name) - if entry_point is not None: - ARCHITECTURES[name] = entry_point - if name not in ARCHITECTURES: - names = ", ".join(sorted(ARCHITECTURES.keys())) - raise KeyError(Errors.E174.format(name=name, names=names)) - return ARCHITECTURES[name] - - def ensure_path(path): """Ensure string is converted to a Path. @@ -327,34 +276,6 @@ def get_package_path(name): return Path(pkg.__file__).parent -def get_entry_points(key): - """Get registered entry points from other packages for a given key, e.g. - 'spacy_factories' and return them as a dictionary, keyed by name. - - key (unicode): Entry point name. - RETURNS (dict): Entry points, keyed by name. - """ - result = {} - for entry_point in AVAILABLE_ENTRY_POINTS.get(key, []): - result[entry_point.name] = entry_point.load() - return result - - -def get_entry_point(key, value, default=None): - """Check if registered entry point is available for a given name and - load it. Otherwise, return None. - - key (unicode): Entry point name. - value (unicode): Name of entry point to load. - default: Optional default value to return. - RETURNS: The loaded entry point or None. - """ - for entry_point in AVAILABLE_ENTRY_POINTS.get(key, []): - if entry_point.name == value: - return entry_point.load() - return default - - def is_in_jupyter(): """Check if user is running spaCy from a Jupyter notebook by detecting the IPython kernel. Mainly used for the displaCy visualizer. diff --git a/website/docs/api/docbin.md b/website/docs/api/docbin.md index 41ebb6075..9f12a07e6 100644 --- a/website/docs/api/docbin.md +++ b/website/docs/api/docbin.md @@ -109,8 +109,8 @@ raise an error if the pre-defined attrs of the two `DocBin`s don't match. > doc_bin1.add(nlp("Hello world")) > doc_bin2 = DocBin(attrs=["LEMMA", "POS"]) > doc_bin2.add(nlp("This is a sentence")) -> merged_bins = doc_bin1.merge(doc_bin2) -> assert len(merged_bins) == 2 +> doc_bin1.merge(doc_bin2) +> assert len(doc_bin1) == 2 > ``` | Argument | Type | Description | diff --git a/website/docs/usage/101/_named-entities.md b/website/docs/usage/101/_named-entities.md index 0e8784187..0dfee8636 100644 --- a/website/docs/usage/101/_named-entities.md +++ b/website/docs/usage/101/_named-entities.md @@ -1,9 +1,10 @@ A named entity is a "real-world object" that's assigned a name – for example, a -person, a country, a product or a book title. spaCy can **recognize** -[various types](/api/annotation#named-entities) of named entities in a document, -by asking the model for a **prediction**. Because models are statistical and -strongly depend on the examples they were trained on, this doesn't always work -_perfectly_ and might need some tuning later, depending on your use case. +person, a country, a product or a book title. spaCy can **recognize +[various types](/api/annotation#named-entities)** of named entities in a +document, by asking the model for a **prediction**. Because models are +statistical and strongly depend on the examples they were trained on, this +doesn't always work _perfectly_ and might need some tuning later, depending on +your use case. Named entities are available as the `ents` property of a `Doc`: