diff --git a/.github/contributors/magnusburton.md b/.github/contributors/magnusburton.md new file mode 100644 index 000000000..9c9e2964e --- /dev/null +++ b/.github/contributors/magnusburton.md @@ -0,0 +1,106 @@ +# spaCy contributor agreement + +This spaCy Contributor Agreement (**"SCA"**) is based on the +[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf). +The SCA applies to any contribution that you make to any product or project +managed by us (the **"project"**), and sets out the intellectual property rights +you grant to us in the contributed materials. The term **"us"** shall mean +[ExplosionAI UG (haftungsbeschränkt)](https://explosion.ai/legal). The term +**"you"** shall mean the person or entity identified below. + +If you agree to be bound by these terms, fill in the information requested +below and include the filled-in version with your first pull request, under the +folder [`.github/contributors/`](/.github/contributors/). The name of the file +should be your GitHub username, with the extension `.md`. For example, the user +example_user would create the file `.github/contributors/example_user.md`. + +Read this agreement carefully before signing. These terms and conditions +constitute a binding legal agreement. + +## Contributor Agreement + +1. The term "contribution" or "contributed materials" means any source code, +object code, patch, tool, sample, graphic, specification, manual, +documentation, or any other material posted or submitted by you to the project. + +2. With respect to any worldwide copyrights, or copyright applications and +registrations, in your contribution: + + * you hereby assign to us joint ownership, and to the extent that such + assignment is or becomes invalid, ineffective or unenforceable, you hereby + grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge, + royalty-free, unrestricted license to exercise all rights under those + copyrights. This includes, at our option, the right to sublicense these same + rights to third parties through multiple levels of sublicensees or other + licensing arrangements; + + * you agree that each of us can do all things in relation to your + contribution as if each of us were the sole owners, and if one of us makes + a derivative work of your contribution, the one who makes the derivative + work (or has it made will be the sole owner of that derivative work; + + * you agree that you will not assert any moral rights in your contribution + against us, our licensees or transferees; + + * you agree that we may register a copyright in your contribution and + exercise all ownership rights associated with it; and + + * you agree that neither of us has any duty to consult with, obtain the + consent of, pay or render an accounting to the other for any use or + distribution of your contribution. + +3. With respect to any patents you own, or that you can license without payment +to any third party, you hereby grant to us a perpetual, irrevocable, +non-exclusive, worldwide, no-charge, royalty-free license to: + + * make, have made, use, sell, offer to sell, import, and otherwise transfer + your contribution in whole or in part, alone or in combination with or + included in any product, work or materials arising out of the project to + which your contribution was submitted, and + + * at our option, to sublicense these same rights to third parties through + multiple levels of sublicensees or other licensing arrangements. + +4. Except as set out above, you keep all right, title, and interest in your +contribution. The rights that you grant to us under these terms are effective +on the date you first submitted a contribution to us, even if your submission +took place before the date you sign these terms. + +5. You covenant, represent, warrant and agree that: + + * Each contribution that you submit is and shall be an original work of + authorship and you can legally grant the rights set out in this SCA; + + * to the best of your knowledge, each contribution will not violate any + third party's copyrights, trademarks, patents, or other intellectual + property rights; and + + * each contribution shall be in compliance with U.S. export control laws and + other applicable export and import laws. You agree to notify us if you + become aware of any circumstance which would make any of the foregoing + representations inaccurate in any respect. We may publicly disclose your + participation in the project, including the fact that you have signed the SCA. + +6. This SCA is governed by the laws of the State of California and applicable +U.S. Federal law. Any choice of law rules will not apply. + +7. Please place an “x” on one of the applicable statement below. Please do NOT +mark both statements: + + * [x] I am signing on behalf of myself as an individual and no other person + or entity, including my employer, has or will have rights with respect my + contributions. + + * [ ] I am signing on behalf of my employer or a legal entity and I have the + actual authority to contractually bind that entity. + +## Contributor Details + +| Field | Entry | +|------------------------------- | -------------------------------- | +| Name | Magnus Burton | +| Company name (if applicable) | | +| Title or role (if applicable) | | +| Date | 17-12-2016 | +| GitHub username | magnusburton | +| Website (optional) | | diff --git a/.github/contributors/oroszgy.md b/.github/contributors/oroszgy.md new file mode 100644 index 000000000..8e69b407e --- /dev/null +++ b/.github/contributors/oroszgy.md @@ -0,0 +1,106 @@ +# spaCy contributor agreement + +This spaCy Contributor Agreement (**"SCA"**) is based on the +[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf). +The SCA applies to any contribution that you make to any product or project +managed by us (the **"project"**), and sets out the intellectual property rights +you grant to us in the contributed materials. The term **"us"** shall mean +[ExplosionAI UG (haftungsbeschränkt)](https://explosion.ai/legal). The term +**"you"** shall mean the person or entity identified below. + +If you agree to be bound by these terms, fill in the information requested +below and include the filled-in version with your first pull request, under the +folder [`.github/contributors/`](/.github/contributors/). The name of the file +should be your GitHub username, with the extension `.md`. For example, the user +example_user would create the file `.github/contributors/example_user.md`. + +Read this agreement carefully before signing. These terms and conditions +constitute a binding legal agreement. + +## Contributor Agreement + +1. The term "contribution" or "contributed materials" means any source code, +object code, patch, tool, sample, graphic, specification, manual, +documentation, or any other material posted or submitted by you to the project. + +2. With respect to any worldwide copyrights, or copyright applications and +registrations, in your contribution: + + * you hereby assign to us joint ownership, and to the extent that such + assignment is or becomes invalid, ineffective or unenforceable, you hereby + grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge, + royalty-free, unrestricted license to exercise all rights under those + copyrights. This includes, at our option, the right to sublicense these same + rights to third parties through multiple levels of sublicensees or other + licensing arrangements; + + * you agree that each of us can do all things in relation to your + contribution as if each of us were the sole owners, and if one of us makes + a derivative work of your contribution, the one who makes the derivative + work (or has it made will be the sole owner of that derivative work; + + * you agree that you will not assert any moral rights in your contribution + against us, our licensees or transferees; + + * you agree that we may register a copyright in your contribution and + exercise all ownership rights associated with it; and + + * you agree that neither of us has any duty to consult with, obtain the + consent of, pay or render an accounting to the other for any use or + distribution of your contribution. + +3. With respect to any patents you own, or that you can license without payment +to any third party, you hereby grant to us a perpetual, irrevocable, +non-exclusive, worldwide, no-charge, royalty-free license to: + + * make, have made, use, sell, offer to sell, import, and otherwise transfer + your contribution in whole or in part, alone or in combination with or + included in any product, work or materials arising out of the project to + which your contribution was submitted, and + + * at our option, to sublicense these same rights to third parties through + multiple levels of sublicensees or other licensing arrangements. + +4. Except as set out above, you keep all right, title, and interest in your +contribution. The rights that you grant to us under these terms are effective +on the date you first submitted a contribution to us, even if your submission +took place before the date you sign these terms. + +5. You covenant, represent, warrant and agree that: + + * Each contribution that you submit is and shall be an original work of + authorship and you can legally grant the rights set out in this SCA; + + * to the best of your knowledge, each contribution will not violate any + third party's copyrights, trademarks, patents, or other intellectual + property rights; and + + * each contribution shall be in compliance with U.S. export control laws and + other applicable export and import laws. You agree to notify us if you + become aware of any circumstance which would make any of the foregoing + representations inaccurate in any respect. We may publicly disclose your + participation in the project, including the fact that you have signed the SCA. + +6. This SCA is governed by the laws of the State of California and applicable +U.S. Federal law. Any choice of law rules will not apply. + +7. Please place an “x” on one of the applicable statement below. Please do NOT +mark both statements: + + * [X] I am signing on behalf of myself as an individual and no other person + or entity, including my employer, has or will have rights with respect my + contributions. + + * [ ] I am signing on behalf of my employer or a legal entity and I have the + actual authority to contractually bind that entity. + +## Contributor Details + +| Field | Entry | +|------------------------------- | -------------------- | +| Name | György Orosz | +| Company name (if applicable) | | +| Title or role (if applicable) | | +| Date | 2016-12-26 | +| GitHub username | oroszgy | +| Website (optional) | gyorgy.orosz.link | diff --git a/CONTRIBUTORS.md b/CONTRIBUTORS.md index 2dc715fae..869113bd3 100644 --- a/CONTRIBUTORS.md +++ b/CONTRIBUTORS.md @@ -8,6 +8,7 @@ This is a list of everyone who has made significant contributions to spaCy, in a * Christoph Schwienheer, [@chssch](https://github.com/chssch) * Dafne van Kuppevelt, [@dafnevk](https://github.com/dafnevk) * Dmytro Sadovnychyi, [@sadovnychyi](https://github.com/sadovnychyi) +* György Orosz, [@oroszgy](https://github.com/oroszgy) * Henning Peters, [@henningpeters](https://github.com/henningpeters) * Ines Montani, [@ines](https://github.com/ines) * J Nicolas Schrading, [@NSchrading](https://github.com/NSchrading) @@ -16,6 +17,7 @@ This is a list of everyone who has made significant contributions to spaCy, in a * Kendrick Tan, [@kendricktan](https://github.com/kendricktan) * Kyle P. Johnson, [@kylepjohnson](https://github.com/kylepjohnson) * Liling Tan, [@alvations](https://github.com/alvations) +* Magnus Burton, [@magnusburton](https://github.com/magnusburton) * Mark Amery, [@ExplodingCabbage](https://github.com/ExplodingCabbage) * Matthew Honnibal, [@honnibal](https://github.com/honnibal) * Maxim Samsonov, [@maxirmx](https://github.com/maxirmx) diff --git a/LICENSE b/LICENSE index 45fcde806..ffce33b2a 100644 --- a/LICENSE +++ b/LICENSE @@ -1,8 +1,6 @@ The MIT License (MIT) -Copyright (C) 2015 Matthew Honnibal - 2016 spaCy GmbH - 2016 ExplosionAI UG (haftungsbeschränkt) +Copyright (C) 2016 ExplosionAI UG (haftungsbeschränkt), 2016 spaCy GmbH, 2015 Matthew Honnibal Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal diff --git a/README.rst b/README.rst index 2f4acd540..4cf6fc5c1 100644 --- a/README.rst +++ b/README.rst @@ -78,7 +78,7 @@ Features See `facts, figures and benchmarks `_. -Top Peformance +Top Performance ============== * Fastest in the world: <50ms per document. No faster system has ever been diff --git a/examples/chainer_sentiment.py b/examples/chainer_sentiment.py index ac3881e75..747ef508a 100644 --- a/examples/chainer_sentiment.py +++ b/examples/chainer_sentiment.py @@ -3,6 +3,9 @@ import plac import random import six +import cProfile +import pstats + import pathlib import cPickle as pickle from itertools import izip @@ -81,7 +84,7 @@ class SentimentModel(Chain): def __init__(self, nlp, shape, **settings): Chain.__init__(self, embed=_Embed(shape['nr_vector'], shape['nr_dim'], shape['nr_hidden'], - initialW=lambda arr: set_vectors(arr, nlp.vocab)), + set_vectors=lambda arr: set_vectors(arr, nlp.vocab)), encode=_Encode(shape['nr_hidden'], shape['nr_hidden']), attend=_Attend(shape['nr_hidden'], shape['nr_hidden']), predict=_Predict(shape['nr_hidden'], shape['nr_class'])) @@ -95,11 +98,11 @@ class SentimentModel(Chain): class _Embed(Chain): - def __init__(self, nr_vector, nr_dim, nr_out): + def __init__(self, nr_vector, nr_dim, nr_out, set_vectors=None): Chain.__init__(self, - embed=L.EmbedID(nr_vector, nr_dim), + embed=L.EmbedID(nr_vector, nr_dim, initialW=set_vectors), project=L.Linear(None, nr_out, nobias=True)) - #self.embed.unchain_backward() + self.embed.W.volatile = False def __call__(self, sentence): return [self.project(self.embed(ts)) for ts in F.transpose(sentence)] @@ -214,7 +217,6 @@ def set_vectors(vectors, vocab): vectors[lex.rank + 1] = lex.vector else: lex.norm = 0 - vectors.unchain_backwards() return vectors @@ -223,7 +225,9 @@ def train(train_texts, train_labels, dev_texts, dev_labels, by_sentence=True): nlp = spacy.load('en', entity=False) if 'nr_vector' not in lstm_shape: - lstm_shape['nr_vector'] = max(lex.rank+1 for lex in vocab if lex.has_vector) + lstm_shape['nr_vector'] = max(lex.rank+1 for lex in nlp.vocab if lex.has_vector) + if 'nr_dim' not in lstm_shape: + lstm_shape['nr_dim'] = nlp.vocab.vectors_length print("Make model") model = Classifier(SentimentModel(nlp, lstm_shape, **lstm_settings)) print("Parsing texts...") @@ -240,7 +244,7 @@ def train(train_texts, train_labels, dev_texts, dev_labels, optimizer = chainer.optimizers.Adam() optimizer.setup(model) updater = chainer.training.StandardUpdater(train_iter, optimizer, device=0) - trainer = chainer.training.Trainer(updater, (20, 'epoch'), out='result') + trainer = chainer.training.Trainer(updater, (1, 'epoch'), out='result') trainer.extend(extensions.Evaluator(dev_iter, model, device=0)) trainer.extend(extensions.LogReport()) @@ -305,11 +309,14 @@ def main(model_dir, train_dir, dev_dir, dev_labels = xp.asarray(dev_labels, dtype='i') lstm = train(train_texts, train_labels, dev_texts, dev_labels, {'nr_hidden': nr_hidden, 'max_length': max_length, 'nr_class': 2, - 'nr_vector': 2000, 'nr_dim': 32}, + 'nr_vector': 5000}, {'dropout': 0.5, 'lr': learn_rate}, {}, nb_epoch=nb_epoch, batch_size=batch_size) if __name__ == '__main__': + #cProfile.runctx("plac.call(main)", globals(), locals(), "Profile.prof") + #s = pstats.Stats("Profile.prof") + #s.strip_dirs().sort_stats("time").print_stats() plac.call(main) diff --git a/examples/deep_learning_keras.py b/examples/deep_learning_keras.py index c9ea7ff84..0ce0581bd 100644 --- a/examples/deep_learning_keras.py +++ b/examples/deep_learning_keras.py @@ -111,10 +111,9 @@ def compile_lstm(embeddings, shape, settings): mask_zero=True ) ) - model.add(TimeDistributed(Dense(shape['nr_hidden'] * 2, bias=False))) - model.add(Dropout(settings['dropout'])) - model.add(Bidirectional(LSTM(shape['nr_hidden']))) - model.add(Dropout(settings['dropout'])) + model.add(TimeDistributed(Dense(shape['nr_hidden'], bias=False))) + model.add(Bidirectional(LSTM(shape['nr_hidden'], dropout_U=settings['dropout'], + dropout_W=settings['dropout']))) model.add(Dense(shape['nr_class'], activation='sigmoid')) model.compile(optimizer=Adam(lr=settings['lr']), loss='binary_crossentropy', metrics=['accuracy']) @@ -195,7 +194,7 @@ def main(model_dir, train_dir, dev_dir, dev_labels = numpy.asarray(dev_labels, dtype='int32') lstm = train(train_texts, train_labels, dev_texts, dev_labels, {'nr_hidden': nr_hidden, 'max_length': max_length, 'nr_class': 1}, - {'dropout': 0.5, 'lr': learn_rate}, + {'dropout': dropout, 'lr': learn_rate}, {}, nb_epoch=nb_epoch, batch_size=batch_size) weights = lstm.get_weights() diff --git a/setup.py b/setup.py index 3127b6abb..2a1d56a5e 100644 --- a/setup.py +++ b/setup.py @@ -27,8 +27,10 @@ PACKAGES = [ 'spacy.es', 'spacy.fr', 'spacy.it', + 'spacy.hu', 'spacy.pt', 'spacy.nl', + 'spacy.sv', 'spacy.language_data', 'spacy.serialize', 'spacy.syntax', @@ -95,7 +97,7 @@ LINK_OPTIONS = { 'other' : [] } - + # I don't understand this very well yet. See Issue #267 # Fingers crossed! #if os.environ.get('USE_OPENMP') == '1': diff --git a/spacy/__init__.py b/spacy/__init__.py index 05e732a50..21e0f7db4 100644 --- a/spacy/__init__.py +++ b/spacy/__init__.py @@ -8,9 +8,11 @@ from . import de from . import zh from . import es from . import it +from . import hu from . import fr from . import pt from . import nl +from . import sv try: @@ -25,8 +27,10 @@ set_lang_class(es.Spanish.lang, es.Spanish) set_lang_class(pt.Portuguese.lang, pt.Portuguese) set_lang_class(fr.French.lang, fr.French) set_lang_class(it.Italian.lang, it.Italian) +set_lang_class(hu.Hungarian.lang, hu.Hungarian) set_lang_class(zh.Chinese.lang, zh.Chinese) set_lang_class(nl.Dutch.lang, nl.Dutch) +set_lang_class(sv.Swedish.lang, sv.Swedish) def load(name, **overrides): diff --git a/spacy/de/tokenizer_exceptions.py b/spacy/de/tokenizer_exceptions.py index d7d9a2f3a..b0561a223 100644 --- a/spacy/de/tokenizer_exceptions.py +++ b/spacy/de/tokenizer_exceptions.py @@ -2,7 +2,7 @@ from __future__ import unicode_literals from ..symbols import * -from ..language_data import PRON_LEMMA +from ..language_data import PRON_LEMMA, DET_LEMMA TOKENIZER_EXCEPTIONS = { @@ -15,23 +15,27 @@ TOKENIZER_EXCEPTIONS = { ], "'S": [ - {ORTH: "'S", LEMMA: PRON_LEMMA} + {ORTH: "'S", LEMMA: PRON_LEMMA, TAG: "PPER"} ], "'n": [ - {ORTH: "'n", LEMMA: "ein"} + {ORTH: "'n", LEMMA: DET_LEMMA, NORM: "ein"} ], "'ne": [ - {ORTH: "'ne", LEMMA: "eine"} + {ORTH: "'ne", LEMMA: DET_LEMMA, NORM: "eine"} ], "'nen": [ - {ORTH: "'nen", LEMMA: "einen"} + {ORTH: "'nen", LEMMA: DET_LEMMA, NORM: "einen"} + ], + + "'nem": [ + {ORTH: "'nem", LEMMA: DET_LEMMA, NORM: "einem"} ], "'s": [ - {ORTH: "'s", LEMMA: PRON_LEMMA} + {ORTH: "'s", LEMMA: PRON_LEMMA, TAG: "PPER"} ], "Abb.": [ @@ -195,7 +199,7 @@ TOKENIZER_EXCEPTIONS = { ], "S'": [ - {ORTH: "S'", LEMMA: PRON_LEMMA} + {ORTH: "S'", LEMMA: PRON_LEMMA, TAG: "PPER"} ], "Sa.": [ @@ -244,7 +248,7 @@ TOKENIZER_EXCEPTIONS = { "auf'm": [ {ORTH: "auf", LEMMA: "auf"}, - {ORTH: "'m", LEMMA: PRON_LEMMA} + {ORTH: "'m", LEMMA: DET_LEMMA, NORM: "dem" } ], "bspw.": [ @@ -268,8 +272,8 @@ TOKENIZER_EXCEPTIONS = { ], "du's": [ - {ORTH: "du", LEMMA: PRON_LEMMA}, - {ORTH: "'s", LEMMA: PRON_LEMMA} + {ORTH: "du", LEMMA: PRON_LEMMA, TAG: "PPER"}, + {ORTH: "'s", LEMMA: PRON_LEMMA, TAG: "PPER", NORM: "es"} ], "ebd.": [ @@ -285,8 +289,8 @@ TOKENIZER_EXCEPTIONS = { ], "er's": [ - {ORTH: "er", LEMMA: PRON_LEMMA}, - {ORTH: "'s", LEMMA: PRON_LEMMA} + {ORTH: "er", LEMMA: PRON_LEMMA, TAG: "PPER"}, + {ORTH: "'s", LEMMA: PRON_LEMMA, TAG: "PPER", NORM: "es"} ], "evtl.": [ @@ -315,7 +319,7 @@ TOKENIZER_EXCEPTIONS = { "hinter'm": [ {ORTH: "hinter", LEMMA: "hinter"}, - {ORTH: "'m", LEMMA: PRON_LEMMA} + {ORTH: "'m", LEMMA: DET_LEMMA, NORM: "dem"} ], "i.O.": [ @@ -327,13 +331,13 @@ TOKENIZER_EXCEPTIONS = { ], "ich's": [ - {ORTH: "ich", LEMMA: PRON_LEMMA}, - {ORTH: "'s", LEMMA: PRON_LEMMA} + {ORTH: "ich", LEMMA: PRON_LEMMA, TAG: "PPER"}, + {ORTH: "'s", LEMMA: PRON_LEMMA, TAG: "PPER", NORM: "es"} ], "ihr's": [ - {ORTH: "ihr", LEMMA: PRON_LEMMA}, - {ORTH: "'s", LEMMA: PRON_LEMMA} + {ORTH: "ihr", LEMMA: PRON_LEMMA, TAG: "PPER"}, + {ORTH: "'s", LEMMA: PRON_LEMMA, TAG: "PPER", NORM: "es"} ], "incl.": [ @@ -385,7 +389,7 @@ TOKENIZER_EXCEPTIONS = { ], "s'": [ - {ORTH: "s'", LEMMA: PRON_LEMMA} + {ORTH: "s'", LEMMA: PRON_LEMMA, TAG: "PPER"} ], "s.o.": [ @@ -393,8 +397,8 @@ TOKENIZER_EXCEPTIONS = { ], "sie's": [ - {ORTH: "sie", LEMMA: PRON_LEMMA}, - {ORTH: "'s", LEMMA: PRON_LEMMA} + {ORTH: "sie", LEMMA: PRON_LEMMA, TAG: "PPER"}, + {ORTH: "'s", LEMMA: PRON_LEMMA, TAG: "PPER", NORM: "es"} ], "sog.": [ @@ -423,7 +427,7 @@ TOKENIZER_EXCEPTIONS = { "unter'm": [ {ORTH: "unter", LEMMA: "unter"}, - {ORTH: "'m", LEMMA: PRON_LEMMA} + {ORTH: "'m", LEMMA: DET_LEMMA, NORM: "dem"} ], "usf.": [ @@ -464,12 +468,12 @@ TOKENIZER_EXCEPTIONS = { "vor'm": [ {ORTH: "vor", LEMMA: "vor"}, - {ORTH: "'m", LEMMA: PRON_LEMMA} + {ORTH: "'m", LEMMA: DET_LEMMA, NORM: "dem"} ], "wir's": [ - {ORTH: "wir", LEMMA: PRON_LEMMA}, - {ORTH: "'s", LEMMA: PRON_LEMMA} + {ORTH: "wir", LEMMA: PRON_LEMMA, TAG: "PPER"}, + {ORTH: "'s", LEMMA: PRON_LEMMA, TAG: "PPER", NORM: "es"} ], "z.B.": [ @@ -506,7 +510,7 @@ TOKENIZER_EXCEPTIONS = { "über'm": [ {ORTH: "über", LEMMA: "über"}, - {ORTH: "'m", LEMMA: PRON_LEMMA} + {ORTH: "'m", LEMMA: DET_LEMMA, NORM: "dem"} ] } @@ -625,5 +629,5 @@ ORTH_ONLY = [ "wiss.", "x.", "y.", - "z.", + "z." ] diff --git a/spacy/en/__init__.py b/spacy/en/__init__.py index d4a371b2b..56cf4d184 100644 --- a/spacy/en/__init__.py +++ b/spacy/en/__init__.py @@ -44,6 +44,7 @@ def _fix_deprecated_glove_vectors_loading(overrides): else: path = overrides['path'] data_path = path.parent + vec_path = None if 'add_vectors' not in overrides: if 'vectors' in overrides: vec_path = match_best_version(overrides['vectors'], None, data_path) diff --git a/spacy/en/tokenizer_exceptions.py b/spacy/en/tokenizer_exceptions.py index 56cc1d7fa..398ae486b 100644 --- a/spacy/en/tokenizer_exceptions.py +++ b/spacy/en/tokenizer_exceptions.py @@ -11,7 +11,7 @@ TOKENIZER_EXCEPTIONS = { ], "Theydve": [ - {ORTH: "They", LEMMA: PRON_LEMMA}, + {ORTH: "They", LEMMA: PRON_LEMMA, TAG: "PRP"}, {ORTH: "d", LEMMA: "would", TAG: "MD"}, {ORTH: "ve", LEMMA: "have", TAG: "VB"} ], @@ -68,7 +68,7 @@ TOKENIZER_EXCEPTIONS = { ], "itll": [ - {ORTH: "it", LEMMA: PRON_LEMMA}, + {ORTH: "it", LEMMA: PRON_LEMMA, TAG: "PRP"}, {ORTH: "ll", LEMMA: "will", TAG: "MD"} ], @@ -113,7 +113,7 @@ TOKENIZER_EXCEPTIONS = { ], "Idve": [ - {ORTH: "I", LEMMA: PRON_LEMMA}, + {ORTH: "I", LEMMA: PRON_LEMMA, TAG: "PRP"}, {ORTH: "d", LEMMA: "would", TAG: "MD"}, {ORTH: "ve", LEMMA: "have", TAG: "VB"} ], @@ -124,23 +124,23 @@ TOKENIZER_EXCEPTIONS = { ], "Ive": [ - {ORTH: "I", LEMMA: PRON_LEMMA}, + {ORTH: "I", LEMMA: PRON_LEMMA, TAG: "PRP"}, {ORTH: "ve", LEMMA: "have", TAG: "VB"} ], "they'd": [ - {ORTH: "they", LEMMA: PRON_LEMMA}, + {ORTH: "they", LEMMA: PRON_LEMMA, TAG: "PRP"}, {ORTH: "'d", LEMMA: "would", TAG: "MD"} ], "Youdve": [ - {ORTH: "You", LEMMA: PRON_LEMMA}, + {ORTH: "You", LEMMA: PRON_LEMMA, TAG: "PRP"}, {ORTH: "d", LEMMA: "would", TAG: "MD"}, {ORTH: "ve", LEMMA: "have", TAG: "VB"} ], "theyve": [ - {ORTH: "they", LEMMA: PRON_LEMMA}, + {ORTH: "they", LEMMA: PRON_LEMMA, TAG: "PRP"}, {ORTH: "ve", LEMMA: "have", TAG: "VB"} ], @@ -160,12 +160,12 @@ TOKENIZER_EXCEPTIONS = { ], "I'm": [ - {ORTH: "I", LEMMA: PRON_LEMMA}, + {ORTH: "I", LEMMA: PRON_LEMMA, TAG: "PRP"}, {ORTH: "'m", TAG: "VBP", "tenspect": 1, "number": 1, LEMMA: "be"} ], "She'd've": [ - {ORTH: "She", LEMMA: PRON_LEMMA}, + {ORTH: "She", LEMMA: PRON_LEMMA, TAG: "PRP"}, {ORTH: "'d", LEMMA: "would", TAG: "MD"}, {ORTH: "'ve", LEMMA: "have", TAG: "VB"} ], @@ -191,7 +191,7 @@ TOKENIZER_EXCEPTIONS = { ], "they've": [ - {ORTH: "they", LEMMA: PRON_LEMMA}, + {ORTH: "they", LEMMA: PRON_LEMMA, TAG: "PRP"}, {ORTH: "'ve", LEMMA: "have", TAG: "VB"} ], @@ -226,12 +226,12 @@ TOKENIZER_EXCEPTIONS = { ], "i'll": [ - {ORTH: "i", LEMMA: PRON_LEMMA}, + {ORTH: "i", LEMMA: PRON_LEMMA, TAG: "PRP"}, {ORTH: "'ll", LEMMA: "will", TAG: "MD"} ], "you'd": [ - {ORTH: "you", LEMMA: PRON_LEMMA}, + {ORTH: "you", LEMMA: PRON_LEMMA, TAG: "PRP"}, {ORTH: "'d", LEMMA: "would", TAG: "MD"} ], @@ -287,7 +287,7 @@ TOKENIZER_EXCEPTIONS = { ], "youll": [ - {ORTH: "you", LEMMA: PRON_LEMMA}, + {ORTH: "you", LEMMA: PRON_LEMMA, TAG: "PRP"}, {ORTH: "ll", LEMMA: "will", TAG: "MD"} ], @@ -307,7 +307,7 @@ TOKENIZER_EXCEPTIONS = { ], "Youre": [ - {ORTH: "You", LEMMA: PRON_LEMMA}, + {ORTH: "You", LEMMA: PRON_LEMMA, TAG: "PRP"}, {ORTH: "re", LEMMA: "be"} ], @@ -369,7 +369,7 @@ TOKENIZER_EXCEPTIONS = { ], "You'll": [ - {ORTH: "You", LEMMA: PRON_LEMMA}, + {ORTH: "You", LEMMA: PRON_LEMMA, TAG: "PRP"}, {ORTH: "'ll", LEMMA: "will", TAG: "MD"} ], @@ -379,7 +379,7 @@ TOKENIZER_EXCEPTIONS = { ], "i'd": [ - {ORTH: "i", LEMMA: PRON_LEMMA}, + {ORTH: "i", LEMMA: PRON_LEMMA, TAG: "PRP"}, {ORTH: "'d", LEMMA: "would", TAG: "MD"} ], @@ -394,7 +394,7 @@ TOKENIZER_EXCEPTIONS = { ], "i'm": [ - {ORTH: "i", LEMMA: PRON_LEMMA}, + {ORTH: "i", LEMMA: PRON_LEMMA, TAG: "PRP"}, {ORTH: "'m", TAG: "VBP", "tenspect": 1, "number": 1, LEMMA: "be"} ], @@ -425,7 +425,7 @@ TOKENIZER_EXCEPTIONS = { ], "Hes": [ - {ORTH: "He", LEMMA: PRON_LEMMA}, + {ORTH: "He", LEMMA: PRON_LEMMA, TAG: "PRP"}, {ORTH: "s"} ], @@ -435,7 +435,7 @@ TOKENIZER_EXCEPTIONS = { ], "It's": [ - {ORTH: "It", LEMMA: PRON_LEMMA}, + {ORTH: "It", LEMMA: PRON_LEMMA, TAG: "PRP"}, {ORTH: "'s"} ], @@ -445,7 +445,7 @@ TOKENIZER_EXCEPTIONS = { ], "Hed": [ - {ORTH: "He", LEMMA: PRON_LEMMA}, + {ORTH: "He", LEMMA: PRON_LEMMA, TAG: "PRP"}, {ORTH: "d", LEMMA: "would", TAG: "MD"} ], @@ -464,12 +464,12 @@ TOKENIZER_EXCEPTIONS = { ], "It'd": [ - {ORTH: "It", LEMMA: PRON_LEMMA}, + {ORTH: "It", LEMMA: PRON_LEMMA, TAG: "PRP"}, {ORTH: "'d", LEMMA: "would", TAG: "MD"} ], "theydve": [ - {ORTH: "they", LEMMA: PRON_LEMMA}, + {ORTH: "they", LEMMA: PRON_LEMMA, TAG: "PRP"}, {ORTH: "d", LEMMA: "would", TAG: "MD"}, {ORTH: "ve", LEMMA: "have", TAG: "VB"} ], @@ -489,7 +489,7 @@ TOKENIZER_EXCEPTIONS = { ], "I've": [ - {ORTH: "I", LEMMA: PRON_LEMMA}, + {ORTH: "I", LEMMA: PRON_LEMMA, TAG: "PRP"}, {ORTH: "'ve", LEMMA: "have", TAG: "VB"} ], @@ -499,13 +499,13 @@ TOKENIZER_EXCEPTIONS = { ], "Itdve": [ - {ORTH: "It", LEMMA: PRON_LEMMA}, + {ORTH: "It", LEMMA: PRON_LEMMA, TAG: "PRP"}, {ORTH: "d", LEMMA: "would", TAG: "MD"}, {ORTH: "ve", LEMMA: "have", TAG: "VB"} ], "I'ma": [ - {ORTH: "I", LEMMA: PRON_LEMMA}, + {ORTH: "I", LEMMA: PRON_LEMMA, TAG: "PRP"}, {ORTH: "'ma"} ], @@ -515,7 +515,7 @@ TOKENIZER_EXCEPTIONS = { ], "They'd": [ - {ORTH: "They", LEMMA: PRON_LEMMA}, + {ORTH: "They", LEMMA: PRON_LEMMA, TAG: "PRP"}, {ORTH: "'d", LEMMA: "would", TAG: "MD"} ], @@ -525,7 +525,7 @@ TOKENIZER_EXCEPTIONS = { ], "You've": [ - {ORTH: "You", LEMMA: PRON_LEMMA}, + {ORTH: "You", LEMMA: PRON_LEMMA, TAG: "PRP"}, {ORTH: "'ve", LEMMA: "have", TAG: "VB"} ], @@ -546,7 +546,7 @@ TOKENIZER_EXCEPTIONS = { ], "I'd've": [ - {ORTH: "I", LEMMA: PRON_LEMMA}, + {ORTH: "I", LEMMA: PRON_LEMMA, TAG: "PRP"}, {ORTH: "'d", LEMMA: "would", TAG: "MD"}, {ORTH: "'ve", LEMMA: "have", TAG: "VB"} ], @@ -557,13 +557,13 @@ TOKENIZER_EXCEPTIONS = { ], "it'd": [ - {ORTH: "it", LEMMA: PRON_LEMMA}, + {ORTH: "it", LEMMA: PRON_LEMMA, TAG: "PRP"}, {ORTH: "'d", LEMMA: "would", TAG: "MD"} ], "what're": [ {ORTH: "what"}, - {ORTH: "'re", LEMMA: "be"} + {ORTH: "'re", LEMMA: "be", NORM: "are"} ], "Wasn't": [ @@ -577,18 +577,18 @@ TOKENIZER_EXCEPTIONS = { ], "he'd've": [ - {ORTH: "he", LEMMA: PRON_LEMMA}, + {ORTH: "he", LEMMA: PRON_LEMMA, TAG: "PRP"}, {ORTH: "'d", LEMMA: "would", TAG: "MD"}, {ORTH: "'ve", LEMMA: "have", TAG: "VB"} ], "She'd": [ - {ORTH: "She", LEMMA: PRON_LEMMA}, + {ORTH: "She", LEMMA: PRON_LEMMA, TAG: "PRP"}, {ORTH: "'d", LEMMA: "would", TAG: "MD"} ], "shedve": [ - {ORTH: "she", LEMMA: PRON_LEMMA}, + {ORTH: "she", LEMMA: PRON_LEMMA, TAG: "PRP"}, {ORTH: "d", LEMMA: "would", TAG: "MD"}, {ORTH: "ve", LEMMA: "have", TAG: "VB"} ], @@ -599,12 +599,12 @@ TOKENIZER_EXCEPTIONS = { ], "She's": [ - {ORTH: "i", LEMMA: PRON_LEMMA}, + {ORTH: "i", LEMMA: PRON_LEMMA, TAG: "PRP"}, {ORTH: "'s"} ], "i'd've": [ - {ORTH: "i", LEMMA: PRON_LEMMA}, + {ORTH: "i", LEMMA: PRON_LEMMA, TAG: "PRP"}, {ORTH: "'d", LEMMA: "would", TAG: "MD"}, {ORTH: "'ve", LEMMA: "have", TAG: "VB"} ], @@ -631,7 +631,7 @@ TOKENIZER_EXCEPTIONS = { ], "you'd've": [ - {ORTH: "you", LEMMA: PRON_LEMMA}, + {ORTH: "you", LEMMA: PRON_LEMMA, TAG: "PRP"}, {ORTH: "'d", LEMMA: "would", TAG: "MD"}, {ORTH: "'ve", LEMMA: "have", TAG: "VB"} ], @@ -647,7 +647,7 @@ TOKENIZER_EXCEPTIONS = { ], "Youd": [ - {ORTH: "You", LEMMA: PRON_LEMMA}, + {ORTH: "You", LEMMA: PRON_LEMMA, TAG: "PRP"}, {ORTH: "d", LEMMA: "would", TAG: "MD"} ], @@ -678,12 +678,12 @@ TOKENIZER_EXCEPTIONS = { ], "ive": [ - {ORTH: "i", LEMMA: PRON_LEMMA}, + {ORTH: "i", LEMMA: PRON_LEMMA, TAG: "PRP"}, {ORTH: "ve", LEMMA: "have", TAG: "VB"} ], "It'd've": [ - {ORTH: "It", LEMMA: PRON_LEMMA}, + {ORTH: "It", LEMMA: PRON_LEMMA, TAG: "PRP"}, {ORTH: "'d", LEMMA: "would", TAG: "MD"}, {ORTH: "'ve", LEMMA: "have", TAG: "VB"} ], @@ -693,7 +693,7 @@ TOKENIZER_EXCEPTIONS = { ], "Itll": [ - {ORTH: "It", LEMMA: PRON_LEMMA}, + {ORTH: "It", LEMMA: PRON_LEMMA, TAG: "PRP"}, {ORTH: "ll", LEMMA: "will", TAG: "MD"} ], @@ -708,12 +708,12 @@ TOKENIZER_EXCEPTIONS = { ], "im": [ - {ORTH: "i", LEMMA: PRON_LEMMA}, + {ORTH: "i", LEMMA: PRON_LEMMA, TAG: "PRP"}, {ORTH: "m", TAG: "VBP", "tenspect": 1, "number": 1, LEMMA: "be"} ], "they'd've": [ - {ORTH: "they", LEMMA: PRON_LEMMA}, + {ORTH: "they", LEMMA: PRON_LEMMA, TAG: "PRP"}, {ORTH: "'d", LEMMA: "would", TAG: "MD"}, {ORTH: "'ve", LEMMA: "have", TAG: "VB"} ], @@ -735,19 +735,19 @@ TOKENIZER_EXCEPTIONS = { ], "youdve": [ - {ORTH: "you", LEMMA: PRON_LEMMA}, + {ORTH: "you", LEMMA: PRON_LEMMA, TAG: "PRP"}, {ORTH: "d", LEMMA: "would", TAG: "MD"}, {ORTH: "ve", LEMMA: "have", TAG: "VB"} ], "Shedve": [ - {ORTH: "i", LEMMA: PRON_LEMMA}, + {ORTH: "i", LEMMA: PRON_LEMMA, TAG: "PRP"}, {ORTH: "d", LEMMA: "would", TAG: "MD"}, {ORTH: "ve", LEMMA: "have", TAG: "VB"} ], "theyd": [ - {ORTH: "they", LEMMA: PRON_LEMMA}, + {ORTH: "they", LEMMA: PRON_LEMMA, TAG: "PRP"}, {ORTH: "d", LEMMA: "would", TAG: "MD"} ], @@ -763,11 +763,11 @@ TOKENIZER_EXCEPTIONS = { "What're": [ {ORTH: "What"}, - {ORTH: "'re", LEMMA: "be"} + {ORTH: "'re", LEMMA: "be", NORM: "are"} ], "He'll": [ - {ORTH: "He", LEMMA: PRON_LEMMA}, + {ORTH: "He", LEMMA: PRON_LEMMA, TAG: "PRP"}, {ORTH: "'ll", LEMMA: "will", TAG: "MD"} ], @@ -777,8 +777,8 @@ TOKENIZER_EXCEPTIONS = { ], "They're": [ - {ORTH: "They", LEMMA: PRON_LEMMA}, - {ORTH: "'re", LEMMA: "be"} + {ORTH: "They", LEMMA: PRON_LEMMA, TAG: "PRP"}, + {ORTH: "'re", LEMMA: "be", NORM: "are"} ], "shouldnt": [ @@ -796,7 +796,7 @@ TOKENIZER_EXCEPTIONS = { ], "youve": [ - {ORTH: "you", LEMMA: PRON_LEMMA}, + {ORTH: "you", LEMMA: PRON_LEMMA, TAG: "PRP"}, {ORTH: "ve", LEMMA: "have", TAG: "VB"} ], @@ -816,7 +816,7 @@ TOKENIZER_EXCEPTIONS = { ], "Youve": [ - {ORTH: "You", LEMMA: PRON_LEMMA}, + {ORTH: "You", LEMMA: PRON_LEMMA, TAG: "PRP"}, {ORTH: "ve", LEMMA: "have", TAG: "VB"} ], @@ -841,12 +841,12 @@ TOKENIZER_EXCEPTIONS = { ], "they're": [ - {ORTH: "they", LEMMA: PRON_LEMMA}, - {ORTH: "'re", LEMMA: "be"} + {ORTH: "they", LEMMA: PRON_LEMMA, TAG: "PRP"}, + {ORTH: "'re", LEMMA: "be", NORM: "are"} ], "idve": [ - {ORTH: "i", LEMMA: PRON_LEMMA}, + {ORTH: "i", LEMMA: PRON_LEMMA, TAG: "PRP"}, {ORTH: "d", LEMMA: "would", TAG: "MD"}, {ORTH: "ve", LEMMA: "have", TAG: "VB"} ], @@ -857,8 +857,8 @@ TOKENIZER_EXCEPTIONS = { ], "youre": [ - {ORTH: "you", LEMMA: PRON_LEMMA}, - {ORTH: "re"} + {ORTH: "you", LEMMA: PRON_LEMMA, TAG: "PRP"}, + {ORTH: "re", LEMMA: "be", NORM: "are"} ], "Didn't": [ @@ -877,8 +877,8 @@ TOKENIZER_EXCEPTIONS = { ], "Im": [ - {ORTH: "I", LEMMA: PRON_LEMMA}, - {ORTH: "m", TAG: "VBP", "tenspect": 1, "number": 1, LEMMA: "be"} + {ORTH: "I", LEMMA: PRON_LEMMA, TAG: "PRP"}, + {ORTH: "m", TAG: "VBP", "tenspect": 1, "number": 1, LEMMA: "be", NORM: "am"} ], "howd": [ @@ -887,22 +887,22 @@ TOKENIZER_EXCEPTIONS = { ], "you've": [ - {ORTH: "you", LEMMA: PRON_LEMMA}, + {ORTH: "you", LEMMA: PRON_LEMMA, TAG: "PRP"}, {ORTH: "'ve", LEMMA: "have", TAG: "VB"} ], "You're": [ - {ORTH: "You", LEMMA: PRON_LEMMA}, - {ORTH: "'re", LEMMA: "be"} + {ORTH: "You", LEMMA: PRON_LEMMA, TAG: "PRP"}, + {ORTH: "'re", LEMMA: "be", NORM: "are"} ], "she'll": [ - {ORTH: "she", LEMMA: PRON_LEMMA}, + {ORTH: "she", LEMMA: PRON_LEMMA, TAG: "PRP"}, {ORTH: "'ll", LEMMA: "will", TAG: "MD"} ], "Theyll": [ - {ORTH: "They", LEMMA: PRON_LEMMA}, + {ORTH: "They", LEMMA: PRON_LEMMA, TAG: "PRP"}, {ORTH: "ll", LEMMA: "will", TAG: "MD"} ], @@ -912,12 +912,12 @@ TOKENIZER_EXCEPTIONS = { ], "itd": [ - {ORTH: "it", LEMMA: PRON_LEMMA}, + {ORTH: "it", LEMMA: PRON_LEMMA, TAG: "PRP"}, {ORTH: "d", LEMMA: "would", TAG: "MD"} ], "Hedve": [ - {ORTH: "He", LEMMA: PRON_LEMMA}, + {ORTH: "He", LEMMA: PRON_LEMMA, TAG: "PRP"}, {ORTH: "d", LEMMA: "would", TAG: "MD"}, {ORTH: "ve", LEMMA: "have", TAG: "VB"} ], @@ -933,8 +933,8 @@ TOKENIZER_EXCEPTIONS = { ], "We're": [ - {ORTH: "We", LEMMA: PRON_LEMMA}, - {ORTH: "'re", LEMMA: "be"} + {ORTH: "We", LEMMA: PRON_LEMMA, TAG: "PRP"}, + {ORTH: "'re", LEMMA: "be", NORM: "are"} ], "\u2018S": [ @@ -951,7 +951,7 @@ TOKENIZER_EXCEPTIONS = { ], "ima": [ - {ORTH: "i", LEMMA: PRON_LEMMA}, + {ORTH: "i", LEMMA: PRON_LEMMA, TAG: "PRP"}, {ORTH: "ma"} ], @@ -961,7 +961,7 @@ TOKENIZER_EXCEPTIONS = { ], "he's": [ - {ORTH: "he", LEMMA: PRON_LEMMA}, + {ORTH: "he", LEMMA: PRON_LEMMA, TAG: "PRP"}, {ORTH: "'s"} ], @@ -981,13 +981,13 @@ TOKENIZER_EXCEPTIONS = { ], "hedve": [ - {ORTH: "he", LEMMA: PRON_LEMMA}, + {ORTH: "he", LEMMA: PRON_LEMMA, TAG: "PRP"}, {ORTH: "d", LEMMA: "would", TAG: "MD"}, {ORTH: "ve", LEMMA: "have", TAG: "VB"} ], "he'd": [ - {ORTH: "he", LEMMA: PRON_LEMMA}, + {ORTH: "he", LEMMA: PRON_LEMMA, TAG: "PRP"}, {ORTH: "'d", LEMMA: "would", TAG: "MD"} ], @@ -1029,7 +1029,7 @@ TOKENIZER_EXCEPTIONS = { ], "You'd've": [ - {ORTH: "You", LEMMA: PRON_LEMMA}, + {ORTH: "You", LEMMA: PRON_LEMMA, TAG: "PRP"}, {ORTH: "'d", LEMMA: "would", TAG: "MD"}, {ORTH: "'ve", LEMMA: "have", TAG: "VB"} ], @@ -1072,12 +1072,12 @@ TOKENIZER_EXCEPTIONS = { ], "wont": [ - {ORTH: "wo"}, + {ORTH: "wo", LEMMA: "will"}, {ORTH: "nt", LEMMA: "not", TAG: "RB"} ], "she'd've": [ - {ORTH: "she", LEMMA: PRON_LEMMA}, + {ORTH: "she", LEMMA: PRON_LEMMA, TAG: "PRP"}, {ORTH: "'d", LEMMA: "would", TAG: "MD"}, {ORTH: "'ve", LEMMA: "have", TAG: "VB"} ], @@ -1088,7 +1088,7 @@ TOKENIZER_EXCEPTIONS = { ], "theyre": [ - {ORTH: "they", LEMMA: PRON_LEMMA}, + {ORTH: "they", LEMMA: PRON_LEMMA, TAG: "PRP"}, {ORTH: "re"} ], @@ -1129,7 +1129,7 @@ TOKENIZER_EXCEPTIONS = { ], "They'll": [ - {ORTH: "They", LEMMA: PRON_LEMMA}, + {ORTH: "They", LEMMA: PRON_LEMMA, TAG: "PRP"}, {ORTH: "'ll", LEMMA: "will", TAG: "MD"} ], @@ -1139,7 +1139,7 @@ TOKENIZER_EXCEPTIONS = { ], "Wedve": [ - {ORTH: "We"}, + {ORTH: "We", LEMMA: PRON_LEMMA, TAG: "PRP"}, {ORTH: "d", LEMMA: "would", TAG: "MD"}, {ORTH: "ve", LEMMA: "have", TAG: "VB"} ], @@ -1156,7 +1156,7 @@ TOKENIZER_EXCEPTIONS = { ], "we'd": [ - {ORTH: "we"}, + {ORTH: "we", LEMMA: PRON_LEMMA, TAG: "PRP"}, {ORTH: "'d", LEMMA: "would", TAG: "MD"} ], @@ -1193,7 +1193,7 @@ TOKENIZER_EXCEPTIONS = { "why're": [ {ORTH: "why"}, - {ORTH: "'re", LEMMA: "be"} + {ORTH: "'re", LEMMA: "be", NORM: "are"} ], "Doesnt": [ @@ -1207,12 +1207,12 @@ TOKENIZER_EXCEPTIONS = { ], "they'll": [ - {ORTH: "they", LEMMA: PRON_LEMMA}, + {ORTH: "they", LEMMA: PRON_LEMMA, TAG: "PRP"}, {ORTH: "'ll", LEMMA: "will", TAG: "MD"} ], "I'd": [ - {ORTH: "I", LEMMA: PRON_LEMMA}, + {ORTH: "I", LEMMA: PRON_LEMMA, TAG: "PRP"}, {ORTH: "'d", LEMMA: "would", TAG: "MD"} ], @@ -1237,12 +1237,12 @@ TOKENIZER_EXCEPTIONS = { ], "you're": [ - {ORTH: "you", LEMMA: PRON_LEMMA}, - {ORTH: "'re", LEMMA: "be"} + {ORTH: "you", LEMMA: PRON_LEMMA, TAG: "PRP"}, + {ORTH: "'re", LEMMA: "be", NORM: "are"} ], "They've": [ - {ORTH: "They", LEMMA: PRON_LEMMA}, + {ORTH: "They", LEMMA: PRON_LEMMA, TAG: "PRP"}, {ORTH: "'ve", LEMMA: "have", TAG: "VB"} ], @@ -1272,12 +1272,12 @@ TOKENIZER_EXCEPTIONS = { ], "She'll": [ - {ORTH: "i", LEMMA: PRON_LEMMA}, + {ORTH: "She", LEMMA: PRON_LEMMA, TAG: "PRP"}, {ORTH: "'ll", LEMMA: "will", TAG: "MD"} ], "You'd": [ - {ORTH: "You", LEMMA: PRON_LEMMA}, + {ORTH: "You", LEMMA: PRON_LEMMA, TAG: "PRP"}, {ORTH: "'d", LEMMA: "would", TAG: "MD"} ], @@ -1297,8 +1297,8 @@ TOKENIZER_EXCEPTIONS = { ], "Theyre": [ - {ORTH: "They", LEMMA: PRON_LEMMA}, - {ORTH: "re"} + {ORTH: "They", LEMMA: PRON_LEMMA, TAG: "PRP"}, + {ORTH: "re", LEMMA: "be", NORM: "are"} ], "Won't": [ @@ -1312,33 +1312,33 @@ TOKENIZER_EXCEPTIONS = { ], "it's": [ - {ORTH: "it", LEMMA: PRON_LEMMA}, + {ORTH: "it", LEMMA: PRON_LEMMA, TAG: "PRP"}, {ORTH: "'s"} ], "it'll": [ - {ORTH: "it", LEMMA: PRON_LEMMA}, + {ORTH: "it", LEMMA: PRON_LEMMA, TAG: "PRP"}, {ORTH: "'ll", LEMMA: "will", TAG: "MD"} ], "They'd've": [ - {ORTH: "They", LEMMA: PRON_LEMMA}, + {ORTH: "They", LEMMA: PRON_LEMMA, TAG: "PRP"}, {ORTH: "'d", LEMMA: "would", TAG: "MD"}, {ORTH: "'ve", LEMMA: "have", TAG: "VB"} ], "Ima": [ - {ORTH: "I", LEMMA: PRON_LEMMA}, + {ORTH: "I", LEMMA: PRON_LEMMA, TAG: "PRP"}, {ORTH: "ma"} ], "gonna": [ - {ORTH: "gon", LEMMA: "go"}, + {ORTH: "gon", LEMMA: "go", NORM: "going"}, {ORTH: "na", LEMMA: "to"} ], "Gonna": [ - {ORTH: "Gon", LEMMA: "go"}, + {ORTH: "Gon", LEMMA: "go", NORM: "going"}, {ORTH: "na", LEMMA: "to"} ], @@ -1359,7 +1359,7 @@ TOKENIZER_EXCEPTIONS = { ], "youd": [ - {ORTH: "you", LEMMA: PRON_LEMMA}, + {ORTH: "you", LEMMA: PRON_LEMMA, TAG: "PRP"}, {ORTH: "d", LEMMA: "would", TAG: "MD"} ], @@ -1390,7 +1390,7 @@ TOKENIZER_EXCEPTIONS = { ], "He'd've": [ - {ORTH: "He", LEMMA: PRON_LEMMA}, + {ORTH: "He", LEMMA: PRON_LEMMA, TAG: "PRP"}, {ORTH: "'d", LEMMA: "would", TAG: "MD"}, {ORTH: "'ve", LEMMA: "have", TAG: "VB"} ], @@ -1427,17 +1427,17 @@ TOKENIZER_EXCEPTIONS = { ], "hes": [ - {ORTH: "he", LEMMA: PRON_LEMMA}, + {ORTH: "he", LEMMA: PRON_LEMMA, TAG: "PRP"}, {ORTH: "s"} ], "he'll": [ - {ORTH: "he", LEMMA: PRON_LEMMA}, + {ORTH: "he", LEMMA: PRON_LEMMA, TAG: "PRP"}, {ORTH: "'ll", LEMMA: "will", TAG: "MD"} ], "hed": [ - {ORTH: "he", LEMMA: PRON_LEMMA}, + {ORTH: "he", LEMMA: PRON_LEMMA, TAG: "PRP"}, {ORTH: "d", LEMMA: "would", TAG: "MD"} ], @@ -1447,8 +1447,8 @@ TOKENIZER_EXCEPTIONS = { ], "we're": [ - {ORTH: "we", LEMMA: PRON_LEMMA}, - {ORTH: "'re", LEMMA: "be"} + {ORTH: "we", LEMMA: PRON_LEMMA, TAG: "PRP"}, + {ORTH: "'re", LEMMA: "be", NORM :"are"} ], "Hadnt": [ @@ -1457,12 +1457,12 @@ TOKENIZER_EXCEPTIONS = { ], "Shant": [ - {ORTH: "Sha"}, + {ORTH: "Sha", LEMMA: "shall"}, {ORTH: "nt", LEMMA: "not", TAG: "RB"} ], "Theyve": [ - {ORTH: "They", LEMMA: PRON_LEMMA}, + {ORTH: "They", LEMMA: PRON_LEMMA, TAG: "PRP"}, {ORTH: "ve", LEMMA: "have", TAG: "VB"} ], @@ -1477,7 +1477,7 @@ TOKENIZER_EXCEPTIONS = { ], "i've": [ - {ORTH: "i", LEMMA: PRON_LEMMA}, + {ORTH: "i", LEMMA: PRON_LEMMA, TAG: "PRP"}, {ORTH: "'ve", LEMMA: "have", TAG: "VB"} ], @@ -1487,7 +1487,7 @@ TOKENIZER_EXCEPTIONS = { ], "i'ma": [ - {ORTH: "i", LEMMA: PRON_LEMMA}, + {ORTH: "i", LEMMA: PRON_LEMMA, TAG: "PRP"}, {ORTH: "'ma"} ], @@ -1502,7 +1502,7 @@ TOKENIZER_EXCEPTIONS = { ], "shant": [ - {ORTH: "sha"}, + {ORTH: "sha", LEMMA: "shall"}, {ORTH: "nt", LEMMA: "not", TAG: "RB"} ], @@ -1513,7 +1513,7 @@ TOKENIZER_EXCEPTIONS = { ], "I'll": [ - {ORTH: "I", LEMMA: PRON_LEMMA}, + {ORTH: "I", LEMMA: PRON_LEMMA, TAG: "PRP"}, {ORTH: "'ll", LEMMA: "will", TAG: "MD"} ], @@ -1571,7 +1571,7 @@ TOKENIZER_EXCEPTIONS = { ], "shes": [ - {ORTH: "she", LEMMA: PRON_LEMMA}, + {ORTH: "she", LEMMA: PRON_LEMMA, TAG: "PRP"}, {ORTH: "s"} ], @@ -1586,12 +1586,12 @@ TOKENIZER_EXCEPTIONS = { ], "Hasnt": [ - {ORTH: "Has"}, + {ORTH: "Has", LEMMA: "have"}, {ORTH: "nt", LEMMA: "not", TAG: "RB"} ], "He's": [ - {ORTH: "He", LEMMA: PRON_LEMMA}, + {ORTH: "He", LEMMA: PRON_LEMMA, TAG: "PRP"}, {ORTH: "'s"} ], @@ -1611,12 +1611,12 @@ TOKENIZER_EXCEPTIONS = { ], "He'd": [ - {ORTH: "He", LEMMA: PRON_LEMMA}, + {ORTH: "He", LEMMA: PRON_LEMMA, TAG: "PRP"}, {ORTH: "'d", LEMMA: "would", TAG: "MD"} ], "Shes": [ - {ORTH: "i", LEMMA: PRON_LEMMA}, + {ORTH: "She", LEMMA: PRON_LEMMA, TAG: "PRP"}, {ORTH: "s"} ], @@ -1626,7 +1626,7 @@ TOKENIZER_EXCEPTIONS = { ], "Youll": [ - {ORTH: "You", LEMMA: PRON_LEMMA}, + {ORTH: "You", LEMMA: PRON_LEMMA, TAG: "PRP"}, {ORTH: "ll", LEMMA: "will", TAG: "MD"} ], @@ -1636,18 +1636,18 @@ TOKENIZER_EXCEPTIONS = { ], "theyll": [ - {ORTH: "they", LEMMA: PRON_LEMMA}, + {ORTH: "they", LEMMA: PRON_LEMMA, TAG: "PRP"}, {ORTH: "ll", LEMMA: "will", TAG: "MD"} ], "it'd've": [ - {ORTH: "it", LEMMA: PRON_LEMMA}, + {ORTH: "it", LEMMA: PRON_LEMMA, TAG: "PRP"}, {ORTH: "'d", LEMMA: "would", TAG: "MD"}, {ORTH: "'ve", LEMMA: "have", TAG: "VB"} ], "itdve": [ - {ORTH: "it", LEMMA: PRON_LEMMA}, + {ORTH: "it", LEMMA: PRON_LEMMA, TAG: "PRP"}, {ORTH: "d", LEMMA: "would", TAG: "MD"}, {ORTH: "ve", LEMMA: "have", TAG: "VB"} ], @@ -1674,7 +1674,7 @@ TOKENIZER_EXCEPTIONS = { ], "Wont": [ - {ORTH: "Wo"}, + {ORTH: "Wo", LEMMA: "will"}, {ORTH: "nt", LEMMA: "not", TAG: "RB"} ], @@ -1691,7 +1691,7 @@ TOKENIZER_EXCEPTIONS = { "Whatre": [ {ORTH: "What"}, - {ORTH: "re"} + {ORTH: "re", LEMMA: "be", NORM: "are"} ], "'s": [ @@ -1719,12 +1719,12 @@ TOKENIZER_EXCEPTIONS = { ], "It'll": [ - {ORTH: "It", LEMMA: PRON_LEMMA}, + {ORTH: "It", LEMMA: PRON_LEMMA, TAG: "PRP"}, {ORTH: "'ll", LEMMA: "will", TAG: "MD"} ], "We'd": [ - {ORTH: "We"}, + {ORTH: "We", LEMMA: PRON_LEMMA, TAG: "PRP"}, {ORTH: "'d", LEMMA: "would", TAG: "MD"} ], @@ -1738,12 +1738,12 @@ TOKENIZER_EXCEPTIONS = { ], "Itd": [ - {ORTH: "It", LEMMA: PRON_LEMMA}, + {ORTH: "It", LEMMA: PRON_LEMMA, TAG: "PRP"}, {ORTH: "d", LEMMA: "would", TAG: "MD"} ], "she'd": [ - {ORTH: "she", LEMMA: PRON_LEMMA}, + {ORTH: "she", LEMMA: PRON_LEMMA, TAG: "PRP"}, {ORTH: "'d", LEMMA: "would", TAG: "MD"} ], @@ -1758,17 +1758,17 @@ TOKENIZER_EXCEPTIONS = { ], "you'll": [ - {ORTH: "you", LEMMA: PRON_LEMMA}, + {ORTH: "you", LEMMA: PRON_LEMMA, TAG: "PRP"}, {ORTH: "'ll", LEMMA: "will", TAG: "MD"} ], "Theyd": [ - {ORTH: "They", LEMMA: PRON_LEMMA}, + {ORTH: "They", LEMMA: PRON_LEMMA, TAG: "PRP"}, {ORTH: "d", LEMMA: "would", TAG: "MD"} ], "she's": [ - {ORTH: "she", LEMMA: PRON_LEMMA}, + {ORTH: "she", LEMMA: PRON_LEMMA, TAG: "PRP"}, {ORTH: "'s"} ], @@ -1783,7 +1783,7 @@ TOKENIZER_EXCEPTIONS = { ], "'em": [ - {ORTH: "'em", LEMMA: PRON_LEMMA} + {ORTH: "'em", LEMMA: PRON_LEMMA, NORM: "them"} ], "ol'": [ diff --git a/spacy/es/language_data.py b/spacy/es/language_data.py index 90595be82..3357c9ac8 100644 --- a/spacy/es/language_data.py +++ b/spacy/es/language_data.py @@ -3,17 +3,48 @@ from __future__ import unicode_literals from .. import language_data as base from ..language_data import update_exc, strings_to_exc +from ..symbols import ORTH, LEMMA from .stop_words import STOP_WORDS from .tokenizer_exceptions import TOKENIZER_EXCEPTIONS, ORTH_ONLY +def get_time_exc(hours): + exc = { + "12m.": [ + {ORTH: "12"}, + {ORTH: "m.", LEMMA: "p.m."} + ] + } + + for hour in hours: + exc["%da.m." % hour] = [ + {ORTH: hour}, + {ORTH: "a.m."} + ] + + exc["%dp.m." % hour] = [ + {ORTH: hour}, + {ORTH: "p.m."} + ] + + exc["%dam" % hour] = [ + {ORTH: hour}, + {ORTH: "am", LEMMA: "a.m."} + ] + + exc["%dpm" % hour] = [ + {ORTH: hour}, + {ORTH: "pm", LEMMA: "p.m."} + ] + return exc + + TOKENIZER_EXCEPTIONS = dict(TOKENIZER_EXCEPTIONS) STOP_WORDS = set(STOP_WORDS) - update_exc(TOKENIZER_EXCEPTIONS, strings_to_exc(ORTH_ONLY)) +update_exc(TOKENIZER_EXCEPTIONS, get_time_exc(range(1, 12 + 1))) update_exc(TOKENIZER_EXCEPTIONS, strings_to_exc(base.EMOTICONS)) - __all__ = ["TOKENIZER_EXCEPTIONS", "STOP_WORDS"] diff --git a/spacy/es/tokenizer_exceptions.py b/spacy/es/tokenizer_exceptions.py index 36a2a8d23..f9259ce93 100644 --- a/spacy/es/tokenizer_exceptions.py +++ b/spacy/es/tokenizer_exceptions.py @@ -2,317 +2,138 @@ from __future__ import unicode_literals from ..symbols import * -from ..language_data import PRON_LEMMA +from ..language_data import PRON_LEMMA, DET_LEMMA TOKENIZER_EXCEPTIONS = { - "accidentarse": [ - {ORTH: "accidentar", LEMMA: "accidentar", POS: AUX}, - {ORTH: "se", LEMMA: PRON_LEMMA, POS: PRON} - ], - - "aceptarlo": [ - {ORTH: "aceptar", LEMMA: "aceptar", POS: AUX}, - {ORTH: "lo", LEMMA: PRON_LEMMA, POS: PRON} - ], - - "acompañarla": [ - {ORTH: "acompañar", LEMMA: "acompañar", POS: AUX}, - {ORTH: "la", LEMMA: PRON_LEMMA, POS: PRON} - ], - - "advertirle": [ - {ORTH: "advertir", LEMMA: "advertir", POS: AUX}, - {ORTH: "le", LEMMA: PRON_LEMMA, POS: PRON} - ], - "al": [ - {ORTH: "a", LEMMA: "a", POS: ADP}, - {ORTH: "el", LEMMA: "el", POS: DET} + {ORTH: "a", LEMMA: "a", TAG: ADP}, + {ORTH: "el", LEMMA: "el", TAG: DET} ], - "anunciarnos": [ - {ORTH: "anunciar", LEMMA: "anunciar", POS: AUX}, - {ORTH: "nos", LEMMA: PRON_LEMMA, POS: PRON} + "consigo": [ + {ORTH: "con", LEMMA: "con"}, + {ORTH: "sigo", LEMMA: PRON_LEMMA, NORM: "sí"} ], - "asegurándole": [ - {ORTH: "asegurando", LEMMA: "asegurar", POS: AUX}, - {ORTH: "le", LEMMA: PRON_LEMMA, POS: PRON} + "conmigo": [ + {ORTH: "con", LEMMA: "con"}, + {ORTH: "migo", LEMMA: PRON_LEMMA, NORM: "mí"} ], - "considerarle": [ - {ORTH: "considerar", LEMMA: "considerar", POS: AUX}, - {ORTH: "le", LEMMA: PRON_LEMMA, POS: PRON} - ], - - "decirle": [ - {ORTH: "decir", LEMMA: "decir", POS: AUX}, - {ORTH: "le", LEMMA: PRON_LEMMA, POS: PRON} - ], - - "decirles": [ - {ORTH: "decir", LEMMA: "decir", POS: AUX}, - {ORTH: "les", LEMMA: PRON_LEMMA, POS: PRON} - ], - - "decirte": [ - {ORTH: "Decir", LEMMA: "decir", POS: AUX}, - {ORTH: "te", LEMMA: PRON_LEMMA, POS: PRON} - ], - - "dejarla": [ - {ORTH: "dejar", LEMMA: "dejar", POS: AUX}, - {ORTH: "la", LEMMA: PRON_LEMMA, POS: PRON} - ], - - "dejarnos": [ - {ORTH: "dejar", LEMMA: "dejar", POS: AUX}, - {ORTH: "nos", LEMMA: PRON_LEMMA, POS: PRON} - ], - - "dejándole": [ - {ORTH: "dejando", LEMMA: "dejar", POS: AUX}, - {ORTH: "le", LEMMA: PRON_LEMMA, POS: PRON} + "contigo": [ + {ORTH: "con", LEMMA: "con"}, + {ORTH: "tigo", LEMMA: PRON_LEMMA, NORM: "ti"} ], "del": [ - {ORTH: "de", LEMMA: "de", POS: ADP}, - {ORTH: "el", LEMMA: "el", POS: DET} - ], - - "demostrarles": [ - {ORTH: "demostrar", LEMMA: "demostrar", POS: AUX}, - {ORTH: "les", LEMMA: PRON_LEMMA, POS: PRON} - ], - - "diciéndole": [ - {ORTH: "diciendo", LEMMA: "decir", POS: AUX}, - {ORTH: "le", LEMMA: PRON_LEMMA, POS: PRON} - ], - - "diciéndoles": [ - {ORTH: "diciendo", LEMMA: "decir", POS: AUX}, - {ORTH: "les", LEMMA: PRON_LEMMA, POS: PRON} - ], - - "diferenciarse": [ - {ORTH: "diferenciar", LEMMA: "diferenciar", POS: AUX}, - {ORTH: "se", LEMMA: "él", POS: PRON} - ], - - "divirtiéndome": [ - {ORTH: "divirtiendo", LEMMA: "divertir", POS: AUX}, - {ORTH: "me", LEMMA: PRON_LEMMA, POS: PRON} - ], - - "ensanchándose": [ - {ORTH: "ensanchando", LEMMA: "ensanchar", POS: AUX}, - {ORTH: "se", LEMMA: PRON_LEMMA, POS: PRON} - ], - - "explicarles": [ - {ORTH: "explicar", LEMMA: "explicar", POS: AUX}, - {ORTH: "les", LEMMA: PRON_LEMMA, POS: PRON} - ], - - "haberla": [ - {ORTH: "haber", LEMMA: "haber", POS: AUX}, - {ORTH: "la", LEMMA: PRON_LEMMA, POS: PRON} - ], - - "haberlas": [ - {ORTH: "haber", LEMMA: "haber", POS: AUX}, - {ORTH: "las", LEMMA: PRON_LEMMA, POS: PRON} - ], - - "haberlo": [ - {ORTH: "haber", LEMMA: "haber", POS: AUX}, - {ORTH: "lo", LEMMA: PRON_LEMMA, POS: PRON} - ], - - "haberlos": [ - {ORTH: "haber", LEMMA: "haber", POS: AUX}, - {ORTH: "los", LEMMA: PRON_LEMMA, POS: PRON} - ], - - "haberme": [ - {ORTH: "haber", LEMMA: "haber", POS: AUX}, - {ORTH: "me", LEMMA: PRON_LEMMA, POS: PRON} - ], - - "haberse": [ - {ORTH: "haber", LEMMA: "haber", POS: AUX}, - {ORTH: "se", LEMMA: PRON_LEMMA, POS: PRON} - ], - - "hacerle": [ - {ORTH: "hacer", LEMMA: "hacer", POS: AUX}, - {ORTH: "le", LEMMA: PRON_LEMMA, POS: PRON} - ], - - "hacerles": [ - {ORTH: "hacer", LEMMA: "hacer", POS: AUX}, - {ORTH: "les", LEMMA: PRON_LEMMA, POS: PRON} - ], - - "hallarse": [ - {ORTH: "hallar", LEMMA: "hallar", POS: AUX}, - {ORTH: "se", LEMMA: PRON_LEMMA, POS: PRON} - ], - - "imaginaros": [ - {ORTH: "imaginar", LEMMA: "imaginar", POS: AUX}, - {ORTH: "os", LEMMA: PRON_LEMMA, POS: PRON} - ], - - "insinuarle": [ - {ORTH: "insinuar", LEMMA: "insinuar", POS: AUX}, - {ORTH: "le", LEMMA: PRON_LEMMA, POS: PRON} - ], - - "justificarla": [ - {ORTH: "justificar", LEMMA: "justificar", POS: AUX}, - {ORTH: "la", LEMMA: PRON_LEMMA, POS: PRON} - ], - - "mantenerlas": [ - {ORTH: "mantener", LEMMA: "mantener", POS: AUX}, - {ORTH: "las", LEMMA: PRON_LEMMA, POS: PRON} - ], - - "mantenerlos": [ - {ORTH: "mantener", LEMMA: "mantener", POS: AUX}, - {ORTH: "los", LEMMA: PRON_LEMMA, POS: PRON} - ], - - "mantenerme": [ - {ORTH: "mantener", LEMMA: "mantener", POS: AUX}, - {ORTH: "me", LEMMA: PRON_LEMMA, POS: PRON} - ], - - "pasarte": [ - {ORTH: "pasar", LEMMA: "pasar", POS: AUX}, - {ORTH: "te", LEMMA: PRON_LEMMA, POS: PRON} - ], - - "pedirle": [ - {ORTH: "pedir", LEMMA: "pedir", POS: AUX}, - {ORTH: "le", LEMMA: "él", POS: PRON} + {ORTH: "de", LEMMA: "de", TAG: ADP}, + {ORTH: "l", LEMMA: "el", TAG: DET} ], "pel": [ - {ORTH: "per", LEMMA: "per", POS: ADP}, - {ORTH: "el", LEMMA: "el", POS: DET} + {ORTH: "pe", LEMMA: "per", TAG: ADP}, + {ORTH: "l", LEMMA: "el", TAG: DET} ], - "pidiéndonos": [ - {ORTH: "pidiendo", LEMMA: "pedir", POS: AUX}, - {ORTH: "nos", LEMMA: PRON_LEMMA, POS: PRON} + "pal": [ + {ORTH: "pa", LEMMA: "para"}, + {ORTH: "l", LEMMA: DET_LEMMA, NORM: "el"} ], - "poderle": [ - {ORTH: "poder", LEMMA: "poder", POS: AUX}, - {ORTH: "le", LEMMA: PRON_LEMMA, POS: PRON} + "pala": [ + {ORTH: "pa", LEMMA: "para"}, + {ORTH: "la", LEMMA: DET_LEMMA} ], - "preguntarse": [ - {ORTH: "preguntar", LEMMA: "preguntar", POS: AUX}, - {ORTH: "se", LEMMA: PRON_LEMMA, POS: PRON} + "aprox.": [ + {ORTH: "aprox.", LEMMA: "aproximadamente"} ], - "preguntándose": [ - {ORTH: "preguntando", LEMMA: "preguntar", POS: AUX}, - {ORTH: "se", LEMMA: PRON_LEMMA, POS: PRON} + "dna.": [ + {ORTH: "dna.", LEMMA: "docena"} ], - "presentarla": [ - {ORTH: "presentar", LEMMA: "presentar", POS: AUX}, - {ORTH: "la", LEMMA: PRON_LEMMA, POS: PRON} + "esq.": [ + {ORTH: "esq.", LEMMA: "esquina"} ], - "pudiéndolo": [ - {ORTH: "pudiendo", LEMMA: "poder", POS: AUX}, - {ORTH: "lo", LEMMA: PRON_LEMMA, POS: PRON} + "pág.": [ + {ORTH: "pág.", LEMMA: "página"} ], - "pudiéndose": [ - {ORTH: "pudiendo", LEMMA: "poder", POS: AUX}, - {ORTH: "se", LEMMA: PRON_LEMMA, POS: PRON} + "p.ej.": [ + {ORTH: "p.ej.", LEMMA: "por ejemplo"} ], - "quererle": [ - {ORTH: "querer", LEMMA: "querer", POS: AUX}, - {ORTH: "le", LEMMA: PRON_LEMMA, POS: PRON} + "Ud.": [ + {ORTH: "Ud.", LEMMA: PRON_LEMMA, NORM: "usted"} ], - "rasgarse": [ - {ORTH: "Rasgar", LEMMA: "rasgar", POS: AUX}, - {ORTH: "se", LEMMA: PRON_LEMMA, POS: PRON} + "Vd.": [ + {ORTH: "Vd.", LEMMA: PRON_LEMMA, NORM: "usted"} ], - "repetirlo": [ - {ORTH: "repetir", LEMMA: "repetir", POS: AUX}, - {ORTH: "lo", LEMMA: PRON_LEMMA, POS: PRON} + "Uds.": [ + {ORTH: "Uds.", LEMMA: PRON_LEMMA, NORM: "ustedes"} ], - "robarle": [ - {ORTH: "robar", LEMMA: "robar", POS: AUX}, - {ORTH: "le", LEMMA: PRON_LEMMA, POS: PRON} - ], - - "seguirlos": [ - {ORTH: "seguir", LEMMA: "seguir", POS: AUX}, - {ORTH: "los", LEMMA: PRON_LEMMA, POS: PRON} - ], - - "serle": [ - {ORTH: "ser", LEMMA: "ser", POS: AUX}, - {ORTH: "le", LEMMA: PRON_LEMMA, POS: PRON} - ], - - "serlo": [ - {ORTH: "ser", LEMMA: "ser", POS: AUX}, - {ORTH: "lo", LEMMA: PRON_LEMMA, POS: PRON} - ], - - "señalándole": [ - {ORTH: "señalando", LEMMA: "señalar", POS: AUX}, - {ORTH: "le", LEMMA: PRON_LEMMA, POS: PRON} - ], - - "suplicarle": [ - {ORTH: "suplicar", LEMMA: "suplicar", POS: AUX}, - {ORTH: "le", LEMMA: PRON_LEMMA, POS: PRON} - ], - - "tenerlos": [ - {ORTH: "tener", LEMMA: "tener", POS: AUX}, - {ORTH: "los", LEMMA: PRON_LEMMA, POS: PRON} - ], - - "vengarse": [ - {ORTH: "vengar", LEMMA: "vengar", POS: AUX}, - {ORTH: "se", LEMMA: PRON_LEMMA, POS: PRON} - ], - - "verla": [ - {ORTH: "ver", LEMMA: "ver", POS: AUX}, - {ORTH: "la", LEMMA: PRON_LEMMA, POS: PRON} - ], - - "verle": [ - {ORTH: "ver", LEMMA: "ver", POS: AUX}, - {ORTH: "le", LEMMA: PRON_LEMMA, POS: PRON} - ], - - "volverlo": [ - {ORTH: "volver", LEMMA: "volver", POS: AUX}, - {ORTH: "lo", LEMMA: PRON_LEMMA, POS: PRON} + "Vds.": [ + {ORTH: "Vds.", LEMMA: PRON_LEMMA, NORM: "ustedes"} ] } ORTH_ONLY = [ - + "a.", + "a.C.", + "a.J.C.", + "apdo.", + "Av.", + "Avda.", + "b.", + "c.", + "Cía.", + "d.", + "e.", + "etc.", + "f.", + "g.", + "Gob.", + "Gral.", + "h.", + "i.", + "Ing.", + "j.", + "J.C.", + "k.", + "l.", + "Lic.", + "m.", + "m.n.", + "n.", + "no.", + "núm.", + "o.", + "p.", + "P.D.", + "Prof.", + "Profa.", + "q.", + "q.e.p.d." + "r.", + "s.", + "S.A.", + "S.L.", + "s.s.s.", + "Sr.", + "Sra.", + "Srta.", + "t.", + "u.", + "v.", + "w.", + "x.", + "y.", + "z." ] diff --git a/spacy/hu/__init__.py b/spacy/hu/__init__.py new file mode 100644 index 000000000..2343b4606 --- /dev/null +++ b/spacy/hu/__init__.py @@ -0,0 +1,23 @@ +# encoding: utf8 +from __future__ import unicode_literals, print_function + +from .language_data import * +from ..attrs import LANG +from ..language import Language + + +class Hungarian(Language): + lang = 'hu' + + class Defaults(Language.Defaults): + tokenizer_exceptions = dict(TOKENIZER_EXCEPTIONS) + lex_attr_getters = dict(Language.Defaults.lex_attr_getters) + lex_attr_getters[LANG] = lambda text: 'hu' + + prefixes = tuple(TOKENIZER_PREFIXES) + + suffixes = tuple(TOKENIZER_SUFFIXES) + + infixes = tuple(TOKENIZER_INFIXES) + + stop_words = set(STOP_WORDS) diff --git a/spacy/hu/language_data.py b/spacy/hu/language_data.py new file mode 100644 index 000000000..94eeb6f4d --- /dev/null +++ b/spacy/hu/language_data.py @@ -0,0 +1,24 @@ +# encoding: utf8 +from __future__ import unicode_literals + +import six + +from spacy.language_data import strings_to_exc, update_exc +from .punctuations import * +from .stop_words import STOP_WORDS +from .tokenizer_exceptions import ABBREVIATIONS +from .tokenizer_exceptions import OTHER_EXC +from .. import language_data as base + +STOP_WORDS = set(STOP_WORDS) +TOKENIZER_EXCEPTIONS = strings_to_exc(base.EMOTICONS) +TOKENIZER_PREFIXES = base.TOKENIZER_PREFIXES + TOKENIZER_PREFIXES +TOKENIZER_SUFFIXES = TOKENIZER_SUFFIXES +TOKENIZER_INFIXES = TOKENIZER_INFIXES + +# HYPHENS = [six.unichr(cp) for cp in [173, 8211, 8212, 8213, 8722, 9472]] + +update_exc(TOKENIZER_EXCEPTIONS, strings_to_exc(OTHER_EXC)) +update_exc(TOKENIZER_EXCEPTIONS, strings_to_exc(ABBREVIATIONS)) + +__all__ = ["TOKENIZER_EXCEPTIONS", "STOP_WORDS", "TOKENIZER_PREFIXES", "TOKENIZER_SUFFIXES", "TOKENIZER_INFIXES"] diff --git a/spacy/hu/punctuations.py b/spacy/hu/punctuations.py new file mode 100644 index 000000000..3681a2fbe --- /dev/null +++ b/spacy/hu/punctuations.py @@ -0,0 +1,89 @@ +# encoding: utf8 +from __future__ import unicode_literals + +TOKENIZER_PREFIXES = r''' ++ +'''.strip().split('\n') + +TOKENIZER_SUFFIXES = r''' +, +\" +\) +\] +\} +\* +\! +\? +\$ +> +: +; +' +” +“ +« +_ +'' +’ +‘ +€ +\.\. +\.\.\. +\.\.\.\. +(?<=[a-züóőúéáűí)\]"'´«‘’%\)²“”+-])\. +(?<=[a-züóőúéáűí)])-e +\-\- +´ +(?<=[0-9])\+ +(?<=[a-z0-9üóőúéáűí][\)\]”"'%\)§/])\. +(?<=[0-9])km² +(?<=[0-9])m² +(?<=[0-9])cm² +(?<=[0-9])mm² +(?<=[0-9])km³ +(?<=[0-9])m³ +(?<=[0-9])cm³ +(?<=[0-9])mm³ +(?<=[0-9])ha +(?<=[0-9])km +(?<=[0-9])m +(?<=[0-9])cm +(?<=[0-9])mm +(?<=[0-9])µm +(?<=[0-9])nm +(?<=[0-9])yd +(?<=[0-9])in +(?<=[0-9])ft +(?<=[0-9])kg +(?<=[0-9])g +(?<=[0-9])mg +(?<=[0-9])µg +(?<=[0-9])t +(?<=[0-9])lb +(?<=[0-9])oz +(?<=[0-9])m/s +(?<=[0-9])km/h +(?<=[0-9])mph +(?<=°[FCK])\. +(?<=[0-9])hPa +(?<=[0-9])Pa +(?<=[0-9])mbar +(?<=[0-9])mb +(?<=[0-9])T +(?<=[0-9])G +(?<=[0-9])M +(?<=[0-9])K +(?<=[0-9])kb +'''.strip().split('\n') + +TOKENIZER_INFIXES = r''' +… +\.\.+ +(?<=[a-züóőúéáűí])\.(?=[A-ZÜÓŐÚÉÁŰÍ]) +(?<=[a-zA-ZüóőúéáűíÜÓŐÚÉÁŰÍ0-9])"(?=[\-a-zA-ZüóőúéáűíÜÓŐÚÉÁŰÍ]) +(?<=[a-zA-ZüóőúéáűíÜÓŐÚÉÁŰÍ])--(?=[a-zA-ZüóőúéáűíÜÓŐÚÉÁŰÍ]) +(?<=[0-9])[+\-\*/^](?=[0-9]) +(?<=[a-zA-ZüóőúéáűíÜÓŐÚÉÁŰÍ]),(?=[a-zA-ZüóőúéáűíÜÓŐÚÉÁŰÍ]) +'''.strip().split('\n') + +__all__ = ["TOKENIZER_PREFIXES", "TOKENIZER_SUFFIXES", "TOKENIZER_INFIXES"] diff --git a/spacy/hu/stop_words.py b/spacy/hu/stop_words.py new file mode 100644 index 000000000..aad992e6e --- /dev/null +++ b/spacy/hu/stop_words.py @@ -0,0 +1,64 @@ +# encoding: utf8 +from __future__ import unicode_literals + + +STOP_WORDS = set(""" +a abban ahhoz ahogy ahol aki akik akkor akár alatt amely amelyek amelyekben +amelyeket amelyet amelynek ami amikor amit amolyan amíg annak arra arról az +azok azon azonban azt aztán azután azzal azért + +be belül benne bár + +cikk cikkek cikkeket csak + +de + +e ebben eddig egy egyes egyetlen egyik egyre egyéb egész ehhez ekkor el ellen +elo eloször elott elso elég előtt emilyen ennek erre ez ezek ezen ezt ezzel +ezért + +fel felé + +ha hanem hiszen hogy hogyan hát + +ide igen ill ill. illetve ilyen ilyenkor inkább is ismét ison itt + +jobban jó jól + +kell kellett keressünk keresztül ki kívül között közül + +le legalább legyen lehet lehetett lenne lenni lesz lett + +ma maga magát majd meg mellett mely melyek mert mi miatt mikor milyen minden +mindenki mindent mindig mint mintha mit mivel miért mondta most már más másik +még míg + +nagy nagyobb nagyon ne nekem neki nem nincs néha néhány nélkül + +o oda ok oket olyan ott + +pedig persze például + +rá + +s saját sem semmi sok sokat sokkal stb. szemben szerint szinte számára szét + +talán te tehát teljes ti tovább továbbá több túl ugyanis + +utolsó után utána + +vagy vagyis vagyok valaki valami valamint való van vannak vele vissza viszont +volna volt voltak voltam voltunk + +által általában át + +én éppen és + +így + +ön össze + +úgy új újabb újra + +ő őket +""".split()) diff --git a/spacy/hu/tokenizer_exceptions.py b/spacy/hu/tokenizer_exceptions.py new file mode 100644 index 000000000..627035bb8 --- /dev/null +++ b/spacy/hu/tokenizer_exceptions.py @@ -0,0 +1,549 @@ +# encoding: utf8 +from __future__ import unicode_literals + +ABBREVIATIONS = """ +AkH. +Aö. +B.CS. +B.S. +B.Sc. +B.ú.é.k. +BE. +BEK. +BSC. +BSc. +BTK. +Be. +Bek. +Bfok. +Bk. +Bp. +Btk. +Btke. +Btét. +CSC. +Cal. +Co. +Colo. +Comp. +Copr. +Cs. +Csc. +Csop. +Ctv. +D. +DR. +Dipl. +Dr. +Dsz. +Dzs. +Fla. +Főszerk. +GM. +Gy. +HKsz. +Hmvh. +Inform. +K.m.f. +KER. +KFT. +KRT. +Ker. +Kft. +Kong. +Korm. +Kr. +Kr.e. +Kr.u. +Krt. +M.A. +M.S. +M.SC. +M.Sc. +MA. +MSC. +MSc. +Mass. +Mlle. +Mme. +Mo. +Mr. +Mrs. +Ms. +Mt. +N.N. +NB. +NBr. +Nat. +Nr. +Ny. +Nyh. +Nyr. +Op. +P.H. +P.S. +PH.D. +PHD. +PROF. +Ph.D +PhD. +Pp. +Proc. +Prof. +Ptk. +Rer. +S.B. +SZOLG. +Salg. +St. +Sz. +Szfv. +Szjt. +Szolg. +Szt. +Sztv. +TEL. +Tel. +Ty. +Tyr. +Ui. +Vcs. +Vhr. +X.Y. +Zs. +a. +a.C. +ac. +adj. +adm. +ag. +agit. +alez. +alk. +altbgy. +an. +ang. +arch. +at. +aug. +b. +b.a. +b.s. +b.sc. +bek. +belker. +berend. +biz. +bizt. +bo. +bp. +br. +bsc. +bt. +btk. +c. +ca. +cc. +cca. +cf. +cif. +co. +corp. +cos. +cs. +csc. +csüt. +cső. +ctv. +d. +dbj. +dd. +ddr. +de. +dec. +dikt. +dipl. +dj. +dk. +dny. +dolg. +dr. +du. +dzs. +e. +ea. +ed. +eff. +egyh. +ell. +elv. +elvt. +em. +eng. +eny. +et. +etc. +ev. +ezr. +eü. +f. +f.h. +f.é. +fam. +febr. +fej. +felv. +felügy. +ff. +ffi. +fhdgy. +fil. +fiz. +fm. +foglalk. +ford. +fp. +fr. +frsz. +fszla. +fszt. +ft. +fuv. +főig. +főisk. +főtörm. +főv. +g. +gazd. +gimn. +gk. +gkv. +gondn. +gr. +grav. +gy. +gyak. +gyártm. +gör. +h. +hads. +hallg. +hdm. +hdp. +hds. +hg. +hiv. +hk. +hm. +ho. +honv. +hp. +hr. +hrsz. +hsz. +ht. +htb. +hv. +hőm. +i.e. +i.sz. +id. +ifj. +ig. +igh. +ill. +imp. +inc. +ind. +inform. +inic. +int. +io. +ip. +ir. +irod. +isk. +ism. +izr. +iá. +j. +jan. +jav. +jegyz. +jjv. +jkv. +jogh. +jogt. +jr. +jvb. +júl. +jún. +k. +karb. +kat. +kb. +kcs. +kd. +ker. +kf. +kft. +kht. +kir. +kirend. +kisip. +kiv. +kk. +kkt. +klin. +kp. +krt. +kt. +ktsg. +kult. +kv. +kve. +képv. +kísérl. +kóth. +könyvt. +körz. +köv. +közj. +közl. +közp. +közt. +kü. +l. +lat. +ld. +legs. +lg. +lgv. +loc. +lt. +ltd. +ltp. +luth. +m. +m.a. +m.s. +m.sc. +ma. +mat. +mb. +med. +megh. +met. +mf. +mfszt. +min. +miss. +mjr. +mjv. +mk. +mlle. +mme. +mn. +mozg. +mr. +mrs. +ms. +msc. +má. +máj. +márc. +mé. +mélt. +mü. +műh. +műsz. +műv. +művez. +n. +nagyker. +nagys. +nat. +nb. +neg. +nk. +nov. +nu. +ny. +nyilv. +nyrt. +nyug. +o. +obj. +okl. +okt. +olv. +orsz. +ort. +ov. +ovh. +p. +pf. +pg. +ph.d +ph.d. +phd. +pk. +pl. +plb. +plc. +pld. +plur. +pol. +polg. +poz. +pp. +proc. +prof. +prot. +pság. +ptk. +pu. +pü. +q. +r. +r.k. +rac. +rad. +red. +ref. +reg. +rer. +rev. +rf. +rkp. +rkt. +rt. +rtg. +röv. +s. +s.b. +s.k. +sa. +sel. +sgt. +sm. +st. +stat. +stb. +strat. +sz. +szakm. +szaksz. +szakszerv. +szd. +szds. +szept. +szerk. +szf. +szimf. +szjt. +szkv. +szla. +szn. +szolg. +szt. +szubj. +szöv. +szül. +t. +tanm. +tb. +tbk. +tc. +techn. +tek. +tel. +tf. +tgk. +ti. +tip. +tisztv. +titks. +tk. +tkp. +tny. +tp. +tszf. +tszk. +tszkv. +tv. +tvr. +ty. +törv. +tü. +u. +ua. +ui. +unit. +uo. +uv. +v. +vas. +vb. +vegy. +vh. +vhol. +vill. +vizsg. +vk. +vkf. +vkny. +vm. +vol. +vs. +vsz. +vv. +vál. +vízv. +vö. +w. +y. +z. +zrt. +zs. +Ész. +Új-Z. +ÚjZ. +á. +ált. +ápr. +ásv. +é. +ék. +ény. +érk. +évf. +í. +ó. +ö. +össz. +ötk. +özv. +ú. +úm. +ún. +út. +ü. +üag. +üd. +üdv. +üe. +ümk. +ütk. +üv. +ő. +ű. +őrgy. +őrpk. +őrv. +""".strip().split() + +OTHER_EXC = """ +'' +-e +""".strip().split() diff --git a/spacy/language_data/util.py b/spacy/language_data/util.py index 71365d879..f21448da2 100644 --- a/spacy/language_data/util.py +++ b/spacy/language_data/util.py @@ -5,6 +5,7 @@ from ..symbols import * PRON_LEMMA = "-PRON-" +DET_LEMMA = "-DET-" ENT_ID = "ent_id" diff --git a/spacy/sv/__init__.py b/spacy/sv/__init__.py new file mode 100644 index 000000000..25930386a --- /dev/null +++ b/spacy/sv/__init__.py @@ -0,0 +1,19 @@ +# encoding: utf8 +from __future__ import unicode_literals, print_function + +from os import path + +from ..language import Language +from ..attrs import LANG +from .language_data import * + + +class Swedish(Language): + lang = 'sv' + + class Defaults(Language.Defaults): + lex_attr_getters = dict(Language.Defaults.lex_attr_getters) + lex_attr_getters[LANG] = lambda text: 'sv' + + tokenizer_exceptions = TOKENIZER_EXCEPTIONS + stop_words = STOP_WORDS diff --git a/spacy/sv/language_data.py b/spacy/sv/language_data.py new file mode 100644 index 000000000..8683f83ac --- /dev/null +++ b/spacy/sv/language_data.py @@ -0,0 +1,14 @@ +# encoding: utf8 +from __future__ import unicode_literals + +from .. import language_data as base +from ..language_data import update_exc, strings_to_exc + +from .stop_words import STOP_WORDS + + +TOKENIZER_EXCEPTIONS = strings_to_exc(base.EMOTICONS) +STOP_WORDS = set(STOP_WORDS) + + +__all__ = ["TOKENIZER_EXCEPTIONS", "STOP_WORDS"] diff --git a/spacy/sv/morph_rules.py b/spacy/sv/morph_rules.py new file mode 100644 index 000000000..da6bcbf20 --- /dev/null +++ b/spacy/sv/morph_rules.py @@ -0,0 +1,68 @@ +# encoding: utf8 +from __future__ import unicode_literals + +from ..symbols import * +from ..language_data import PRON_LEMMA + +# Used the table of pronouns at https://sv.wiktionary.org/wiki/deras + +MORPH_RULES = { + "PRP": { + "jag": {LEMMA: PRON_LEMMA, "PronType": "Prs", "Person": "One", "Number": "Sing", "Case": "Nom"}, + "mig": {LEMMA: PRON_LEMMA, "PronType": "Prs", "Person": "One", "Number": "Sing", "Case": "Acc"}, + "mej": {LEMMA: PRON_LEMMA, "PronType": "Prs", "Person": "One", "Number": "Sing", "Case": "Acc"}, + "du": {LEMMA: PRON_LEMMA, "PronType": "Prs", "Person": "Two", "Number": "Sing", "Case": "Nom"}, + "han": {LEMMA: PRON_LEMMA, "PronType": "Prs", "Person": "Three", "Number": "Sing", "Gender": "Masc", "Case": "Nom"}, + "honom": {LEMMA: PRON_LEMMA, "PronType": "Prs", "Person": "Three", "Number": "Sing", "Gender": "Masc", "Case": "Acc"}, + "hon": {LEMMA: PRON_LEMMA, "PronType": "Prs", "Person": "Three", "Number": "Sing", "Gender": "Fem", "Case": "Nom"}, + "henne": {LEMMA: PRON_LEMMA, "PronType": "Prs", "Person": "Three", "Number": "Sing", "Gender": "Fem", "Case": "Acc"}, + "det": {LEMMA: PRON_LEMMA, "PronType": "Prs", "Person": "Three", "Number": "Sing", "Gender": "Neut"}, + "vi": {LEMMA: PRON_LEMMA, "PronType": "Prs", "Person": "One", "Number": "Plur", "Case": "Nom"}, + "oss": {LEMMA: PRON_LEMMA, "PronType": "Prs", "Person": "One", "Number": "Plur", "Case": "Acc"}, + "ni": {LEMMA: PRON_LEMMA, "PronType": "Prs", "Person": "Two", "Number": "Plur", "Case": "Nom"}, + "er": {LEMMA: PRON_LEMMA, "PronType": "Prs", "Person": "Two", "Number": "Plur", "Case": "Acc"}, + "de": {LEMMA: PRON_LEMMA, "PronType": "Prs", "Person": "Three", "Number": "Plur", "Case": "Nom"}, + "dom": {LEMMA: PRON_LEMMA, "PronType": "Prs", "Person": "Three", "Number": "Plur", "Case": "Nom"}, + "dem": {LEMMA: PRON_LEMMA, "PronType": "Prs", "Person": "Three", "Number": "Plur", "Case": "Acc"}, + "dom": {LEMMA: PRON_LEMMA, "PronType": "Prs", "Person": "Three", "Number": "Plur", "Case": "Acc"}, + + "min": {LEMMA: PRON_LEMMA, "PronType": "Prs", "Person": "One", "Number": "Sing", "Poss": "Yes", "Reflex": "Yes"}, + "mitt": {LEMMA: PRON_LEMMA, "PronType": "Prs", "Person": "One", "Number": "Sing", "Poss": "Yes", "Reflex": "Yes"}, + "mina": {LEMMA: PRON_LEMMA, "PronType": "Prs", "Person": "One", "Number": "Plur", "Poss": "Yes", "Reflex": "Yes"}, + "din": {LEMMA: PRON_LEMMA, "PronType": "Prs", "Person": "Two", "Number": "Sing", "Poss": "Yes", "Reflex": "Yes"}, + "ditt": {LEMMA: PRON_LEMMA, "PronType": "Prs", "Person": "Two", "Number": "Sing", "Poss": "Yes", "Reflex": "Yes"}, + "dina": {LEMMA: PRON_LEMMA, "PronType": "Prs", "Person": "Two", "Number": "Plur", "Poss": "Yes", "Reflex": "Yes"}, + "hans": {LEMMA: PRON_LEMMA, "PronType": "Prs", "Person": "Two", "Number": "Sing", "Gender": "Masc", "Poss": "Yes", "Reflex": "Yes"}, + "hans": {LEMMA: PRON_LEMMA, "PronType": "Prs", "Person": "Two", "Number": "Plur", "Gender": "Masc", "Poss": "Yes", "Reflex": "Yes"}, + "hennes": {LEMMA: PRON_LEMMA, "PronType": "Prs", "Person": "Two", "Number": "Sing", "Gender": "Fem", "Poss": "Yes", "Reflex": "Yes"}, + "hennes": {LEMMA: PRON_LEMMA, "PronType": "Prs", "Person": "Two", "Number": "Plur", "Gender": "Fem", "Poss": "Yes", "Reflex": "Yes"}, + "dess": {LEMMA: PRON_LEMMA, "PronType": "Prs", "Person": "Two", "Number": "Sing", "Poss": "Yes", "Reflex": "Yes"}, + "dess": {LEMMA: PRON_LEMMA, "PronType": "Prs", "Person": "Two", "Number": "Plur", "Poss": "Yes", "Reflex": "Yes"}, + "vår": {LEMMA: PRON_LEMMA, "PronType": "Prs", "Person": "One", "Number": "Plur", "Poss": "Yes", "Reflex": "Yes"}, + "våran": {LEMMA: PRON_LEMMA, "PronType": "Prs", "Person": "One", "Number": "Plur", "Poss": "Yes", "Reflex": "Yes"}, + "vårt": {LEMMA: PRON_LEMMA, "PronType": "Prs", "Person": "One", "Number": "Plur", "Poss": "Yes", "Reflex": "Yes"}, + "vårat": {LEMMA: PRON_LEMMA, "PronType": "Prs", "Person": "One", "Number": "Plur", "Poss": "Yes", "Reflex": "Yes"}, + "våra": {LEMMA: PRON_LEMMA, "PronType": "Prs", "Person": "One", "Number": "Plur", "Poss": "Yes", "Reflex": "Yes"}, + "er": {LEMMA: PRON_LEMMA, "PronType": "Prs", "Person": "Two", "Number": "Plur", "Poss": "Yes", "Reflex": "Yes"}, + "eran": {LEMMA: PRON_LEMMA, "PronType": "Prs", "Person": "Two", "Number": "Plur", "Poss": "Yes", "Reflex": "Yes"}, + "ert": {LEMMA: PRON_LEMMA, "PronType": "Prs", "Person": "Two", "Number": "Plur", "Poss": "Yes", "Reflex": "Yes"}, + "erat": {LEMMA: PRON_LEMMA, "PronType": "Prs", "Person": "Two", "Number": "Plur", "Poss": "Yes", "Reflex": "Yes"}, + "era": {LEMMA: PRON_LEMMA, "PronType": "Prs", "Person": "Two", "Number": "Plur", "Poss": "Yes", "Reflex": "Yes"}, + "deras": {LEMMA: PRON_LEMMA, "PronType": "Prs", "Person": "Three", "Number": "Plur", "Poss": "Yes", "Reflex": "Yes"} + }, + + "VBZ": { + "är": {LEMMA: "be", "VerbForm": "Fin", "Person": "One", "Tense": "Pres", "Mood": "Ind"}, + "är": {LEMMA: "be", "VerbForm": "Fin", "Person": "Two", "Tense": "Pres", "Mood": "Ind"}, + "är": {LEMMA: "be", "VerbForm": "Fin", "Person": "Three", "Tense": "Pres", "Mood": "Ind"}, + }, + + "VBP": { + "är": {LEMMA: "be", "VerbForm": "Fin", "Tense": "Pres", "Mood": "Ind"} + }, + + "VBD": { + "var": {LEMMA: "be", "VerbForm": "Fin", "Tense": "Past", "Number": "Sing"}, + "vart": {LEMMA: "be", "VerbForm": "Fin", "Tense": "Past", "Number": "Plur"} + } +} diff --git a/spacy/sv/stop_words.py b/spacy/sv/stop_words.py new file mode 100644 index 000000000..e64feb3f6 --- /dev/null +++ b/spacy/sv/stop_words.py @@ -0,0 +1,47 @@ +# encoding: utf8 +from __future__ import unicode_literals + + +STOP_WORDS = set(""" +aderton adertonde adjö aldrig alla allas allt alltid alltså än andra andras annan annat ännu artonde arton åtminstone att åtta åttio åttionde åttonde av även + +båda bådas bakom bara bäst bättre behöva behövas behövde behövt beslut beslutat beslutit bland blev bli blir blivit bort borta bra + +då dag dagar dagarna dagen där därför de del delen dem den deras dess det detta dig din dina dit ditt dock du + +efter eftersom elfte eller elva en enkel enkelt enkla enligt er era ert ett ettusen + +få fanns får fått fem femte femtio femtionde femton femtonde fick fin finnas finns fjärde fjorton fjortonde fler flera flesta följande för före förlåt förra första fram framför från fyra fyrtio fyrtionde + +gå gälla gäller gällt går gärna gått genast genom gick gjorde gjort god goda godare godast gör göra gott + +ha hade haft han hans har här heller hellre helst helt henne hennes hit hög höger högre högst hon honom hundra hundraen hundraett hur + +i ibland idag igår igen imorgon in inför inga ingen ingenting inget innan inne inom inte inuti + +ja jag jämfört + +kan kanske knappast kom komma kommer kommit kr kunde kunna kunnat kvar + +länge längre långsam långsammare långsammast långsamt längst långt lätt lättare lättast legat ligga ligger lika likställd likställda lilla lite liten litet + +man många måste med mellan men mer mera mest mig min mina mindre minst mitt mittemot möjlig möjligen möjligt möjligtvis mot mycket + +någon någonting något några när nästa ned nederst nedersta nedre nej ner ni nio nionde nittio nittionde nitton nittonde nödvändig nödvändiga nödvändigt nödvändigtvis nog noll nr nu nummer + +och också ofta oftast olika olikt om oss + +över övermorgon överst övre + +på + +rakt rätt redan + +så sade säga säger sagt samma sämre sämst sedan senare senast sent sex sextio sextionde sexton sextonde sig sin sina sist sista siste sitt sjätte sju sjunde sjuttio sjuttionde sjutton sjuttonde ska skall skulle slutligen små smått snart som stor stora större störst stort + +tack tidig tidigare tidigast tidigt till tills tillsammans tio tionde tjugo tjugoen tjugoett tjugonde tjugotre tjugotvå tjungo tolfte tolv tre tredje trettio trettionde tretton trettonde två tvåhundra + +under upp ur ursäkt ut utan utanför ute + +vad vänster vänstra var vår vara våra varför varifrån varit varken värre varsågod vart vårt vem vems verkligen vi vid vidare viktig viktigare viktigast viktigt vilka vilken vilket vill +""".split()) diff --git a/spacy/sv/tokenizer_exceptions.py b/spacy/sv/tokenizer_exceptions.py new file mode 100644 index 000000000..6cf144b44 --- /dev/null +++ b/spacy/sv/tokenizer_exceptions.py @@ -0,0 +1,58 @@ +# encoding: utf8 +from __future__ import unicode_literals + +from ..symbols import * +from ..language_data import PRON_LEMMA + + +TOKENIZER_EXCEPTIONS = { + +} + + +ORTH_ONLY = [ + "ang.", + "anm.", + "bil.", + "bl.a.", + "dvs.", + "e.Kr.", + "el.", + "e.d.", + "eng.", + "etc.", + "exkl.", + "f.d.", + "fid.", + "f.Kr.", + "forts.", + "fr.o.m.", + "f.ö.", + "förf.", + "inkl.", + "jur.", + "kl.", + "kr.", + "lat.", + "m.a.o.", + "max.", + "m.fl.", + "min.", + "m.m.", + "obs.", + "o.d.", + "osv.", + "p.g.a.", + "ref.", + "resp.", + "s.", + "s.a.s.", + "s.k.", + "st.", + "s:t", + "t.ex.", + "t.o.m.", + "ung.", + "äv.", + "övers." +] diff --git a/spacy/tests/hu/__init__.py b/spacy/tests/hu/__init__.py new file mode 100644 index 000000000..e69de29bb diff --git a/spacy/tests/hu/tokenizer/__init__.py b/spacy/tests/hu/tokenizer/__init__.py new file mode 100644 index 000000000..e69de29bb diff --git a/spacy/tests/hu/tokenizer/test_tokenizer.py b/spacy/tests/hu/tokenizer/test_tokenizer.py new file mode 100644 index 000000000..2bfbfdf36 --- /dev/null +++ b/spacy/tests/hu/tokenizer/test_tokenizer.py @@ -0,0 +1,233 @@ +# encoding: utf8 +from __future__ import unicode_literals + +import pytest +from spacy.hu import Hungarian + +_DEFAULT_TESTS = [('N. kormányzósági\nszékhely.', ['N.', 'kormányzósági', 'székhely', '.']), + ('A .hu egy tld.', ['A', '.hu', 'egy', 'tld', '.']), + ('Az egy.ketto pelda.', ['Az', 'egy.ketto', 'pelda', '.']), + ('A pl. rovidites.', ['A', 'pl.', 'rovidites', '.']), + ('A S.M.A.R.T. szo.', ['A', 'S.M.A.R.T.', 'szo', '.']), + ('A .hu.', ['A', '.hu', '.']), + ('Az egy.ketto.', ['Az', 'egy.ketto', '.']), + ('A pl.', ['A', 'pl.']), + ('A S.M.A.R.T.', ['A', 'S.M.A.R.T.']), + ('Egy..ket.', ['Egy', '..', 'ket', '.']), + ('Valami... van.', ['Valami', '...', 'van', '.']), + ('Valami ...van...', ['Valami', '...', 'van', '...']), + ('Valami...', ['Valami', '...']), + ('Valami ...', ['Valami', '...']), + ('Valami ... más.', ['Valami', '...', 'más', '.'])] + +_HYPHEN_TESTS = [ + ('Egy -nak, -jaiért, -magyar, bel- van.', ['Egy', '-nak', ',', '-jaiért', ',', '-magyar', ',', 'bel-', 'van', '.']), + ('Egy -nak.', ['Egy', '-nak', '.']), + ('Egy bel-.', ['Egy', 'bel-', '.']), + ('Dinnye-domb-.', ['Dinnye-domb-', '.']), + ('Ezen -e elcsatangolt.', ['Ezen', '-e', 'elcsatangolt', '.']), + ('Lakik-e', ['Lakik', '-e']), + ('Lakik-e?', ['Lakik', '-e', '?']), + ('Lakik-e.', ['Lakik', '-e', '.']), + ('Lakik-e...', ['Lakik', '-e', '...']), + ('Lakik-e... van.', ['Lakik', '-e', '...', 'van', '.']), + ('Lakik-e van?', ['Lakik', '-e', 'van', '?']), + ('Lakik-elem van?', ['Lakik-elem', 'van', '?']), + ('Van lakik-elem.', ['Van', 'lakik-elem', '.']), + ('A 7-es busz?', ['A', '7-es', 'busz', '?']), + ('A 7-es?', ['A', '7-es', '?']), + ('A 7-es.', ['A', '7-es', '.']), + ('Ez (lakik)-e?', ['Ez', '(', 'lakik', ')', '-e', '?']), + ('A %-sal.', ['A', '%-sal', '.']), + ('A CD-ROM-okrol.', ['A', 'CD-ROM-okrol', '.'])] + +_NUMBER_TESTS = [('A 2b van.', ['A', '2b', 'van', '.']), + ('A 2b-ben van.', ['A', '2b-ben', 'van', '.']), + ('A 2b.', ['A', '2b', '.']), + ('A 2b-ben.', ['A', '2b-ben', '.']), + ('A 3.b van.', ['A', '3.b', 'van', '.']), + ('A 3.b-ben van.', ['A', '3.b-ben', 'van', '.']), + ('A 3.b.', ['A', '3.b', '.']), + ('A 3.b-ben.', ['A', '3.b-ben', '.']), + ('A 1:20:36.7 van.', ['A', '1:20:36.7', 'van', '.']), + ('A 1:20:36.7-ben van.', ['A', '1:20:36.7-ben', 'van', '.']), + ('A 1:20:36.7-ben.', ['A', '1:20:36.7-ben', '.']), + ('A 1:35 van.', ['A', '1:35', 'van', '.']), + ('A 1:35-ben van.', ['A', '1:35-ben', 'van', '.']), + ('A 1:35-ben.', ['A', '1:35-ben', '.']), + ('A 1.35 van.', ['A', '1.35', 'van', '.']), + ('A 1.35-ben van.', ['A', '1.35-ben', 'van', '.']), + ('A 1.35-ben.', ['A', '1.35-ben', '.']), + ('A 4:01,95 van.', ['A', '4:01,95', 'van', '.']), + ('A 4:01,95-ben van.', ['A', '4:01,95-ben', 'van', '.']), + ('A 4:01,95-ben.', ['A', '4:01,95-ben', '.']), + ('A 10--12 van.', ['A', '10--12', 'van', '.']), + ('A 10--12-ben van.', ['A', '10--12-ben', 'van', '.']), + ('A 10--12-ben.', ['A', '10--12-ben', '.']), + ('A 10‐12 van.', ['A', '10‐12', 'van', '.']), + ('A 10‐12-ben van.', ['A', '10‐12-ben', 'van', '.']), + ('A 10‐12-ben.', ['A', '10‐12-ben', '.']), + ('A 10‑12 van.', ['A', '10‑12', 'van', '.']), + ('A 10‑12-ben van.', ['A', '10‑12-ben', 'van', '.']), + ('A 10‑12-ben.', ['A', '10‑12-ben', '.']), + ('A 10‒12 van.', ['A', '10‒12', 'van', '.']), + ('A 10‒12-ben van.', ['A', '10‒12-ben', 'van', '.']), + ('A 10‒12-ben.', ['A', '10‒12-ben', '.']), + ('A 10–12 van.', ['A', '10–12', 'van', '.']), + ('A 10–12-ben van.', ['A', '10–12-ben', 'van', '.']), + ('A 10–12-ben.', ['A', '10–12-ben', '.']), + ('A 10—12 van.', ['A', '10—12', 'van', '.']), + ('A 10—12-ben van.', ['A', '10—12-ben', 'van', '.']), + ('A 10—12-ben.', ['A', '10—12-ben', '.']), + ('A 10―12 van.', ['A', '10―12', 'van', '.']), + ('A 10―12-ben van.', ['A', '10―12-ben', 'van', '.']), + ('A 10―12-ben.', ['A', '10―12-ben', '.']), + ('A -23,12 van.', ['A', '-23,12', 'van', '.']), + ('A -23,12-ben van.', ['A', '-23,12-ben', 'van', '.']), + ('A -23,12-ben.', ['A', '-23,12-ben', '.']), + ('A 2+3 van.', ['A', '2', '+', '3', 'van', '.']), + ('A 2 +3 van.', ['A', '2', '+', '3', 'van', '.']), + ('A 2+ 3 van.', ['A', '2', '+', '3', 'van', '.']), + ('A 2 + 3 van.', ['A', '2', '+', '3', 'van', '.']), + ('A 2*3 van.', ['A', '2', '*', '3', 'van', '.']), + ('A 2 *3 van.', ['A', '2', '*', '3', 'van', '.']), + ('A 2* 3 van.', ['A', '2', '*', '3', 'van', '.']), + ('A 2 * 3 van.', ['A', '2', '*', '3', 'van', '.']), + ('A C++ van.', ['A', 'C++', 'van', '.']), + ('A C++-ben van.', ['A', 'C++-ben', 'van', '.']), + ('A C++.', ['A', 'C++', '.']), + ('A C++-ben.', ['A', 'C++-ben', '.']), + ('A 2003. I. 06. van.', ['A', '2003.', 'I.', '06.', 'van', '.']), + ('A 2003. I. 06-ben van.', ['A', '2003.', 'I.', '06-ben', 'van', '.']), + ('A 2003. I. 06.', ['A', '2003.', 'I.', '06.']), + ('A 2003. I. 06-ben.', ['A', '2003.', 'I.', '06-ben', '.']), + ('A 2003. 01. 06. van.', ['A', '2003.', '01.', '06.', 'van', '.']), + ('A 2003. 01. 06-ben van.', ['A', '2003.', '01.', '06-ben', 'van', '.']), + ('A 2003. 01. 06.', ['A', '2003.', '01.', '06.']), + ('A 2003. 01. 06-ben.', ['A', '2003.', '01.', '06-ben', '.']), + ('A IV. 12. van.', ['A', 'IV.', '12.', 'van', '.']), + ('A IV. 12-ben van.', ['A', 'IV.', '12-ben', 'van', '.']), + ('A IV. 12.', ['A', 'IV.', '12.']), + ('A IV. 12-ben.', ['A', 'IV.', '12-ben', '.']), + ('A 2003.01.06. van.', ['A', '2003.01.06.', 'van', '.']), + ('A 2003.01.06-ben van.', ['A', '2003.01.06-ben', 'van', '.']), + ('A 2003.01.06.', ['A', '2003.01.06.']), + ('A 2003.01.06-ben.', ['A', '2003.01.06-ben', '.']), + ('A IV.12. van.', ['A', 'IV.12.', 'van', '.']), + ('A IV.12-ben van.', ['A', 'IV.12-ben', 'van', '.']), + ('A IV.12.', ['A', 'IV.12.']), + ('A IV.12-ben.', ['A', 'IV.12-ben', '.']), + ('A 1.1.2. van.', ['A', '1.1.2.', 'van', '.']), + ('A 1.1.2-ben van.', ['A', '1.1.2-ben', 'van', '.']), + ('A 1.1.2.', ['A', '1.1.2.']), + ('A 1.1.2-ben.', ['A', '1.1.2-ben', '.']), + ('A 1,5--2,5 van.', ['A', '1,5--2,5', 'van', '.']), + ('A 1,5--2,5-ben van.', ['A', '1,5--2,5-ben', 'van', '.']), + ('A 1,5--2,5-ben.', ['A', '1,5--2,5-ben', '.']), + ('A 3,14 van.', ['A', '3,14', 'van', '.']), + ('A 3,14-ben van.', ['A', '3,14-ben', 'van', '.']), + ('A 3,14-ben.', ['A', '3,14-ben', '.']), + ('A 3.14 van.', ['A', '3.14', 'van', '.']), + ('A 3.14-ben van.', ['A', '3.14-ben', 'van', '.']), + ('A 3.14-ben.', ['A', '3.14-ben', '.']), + ('A 15. van.', ['A', '15.', 'van', '.']), + ('A 15-ben van.', ['A', '15-ben', 'van', '.']), + ('A 15-ben.', ['A', '15-ben', '.']), + ('A 15.-ben van.', ['A', '15.-ben', 'van', '.']), + ('A 15.-ben.', ['A', '15.-ben', '.']), + ('A 2002--2003. van.', ['A', '2002--2003.', 'van', '.']), + ('A 2002--2003-ben van.', ['A', '2002--2003-ben', 'van', '.']), + ('A 2002--2003-ben.', ['A', '2002--2003-ben', '.']), + ('A -0,99% van.', ['A', '-0,99%', 'van', '.']), + ('A -0,99%-ben van.', ['A', '-0,99%-ben', 'van', '.']), + ('A -0,99%.', ['A', '-0,99%', '.']), + ('A -0,99%-ben.', ['A', '-0,99%-ben', '.']), + ('A 10--20% van.', ['A', '10--20%', 'van', '.']), + ('A 10--20%-ben van.', ['A', '10--20%-ben', 'van', '.']), + ('A 10--20%.', ['A', '10--20%', '.']), + ('A 10--20%-ben.', ['A', '10--20%-ben', '.']), + ('A 99§ van.', ['A', '99§', 'van', '.']), + ('A 99§-ben van.', ['A', '99§-ben', 'van', '.']), + ('A 99§-ben.', ['A', '99§-ben', '.']), + ('A 10--20§ van.', ['A', '10--20§', 'van', '.']), + ('A 10--20§-ben van.', ['A', '10--20§-ben', 'van', '.']), + ('A 10--20§-ben.', ['A', '10--20§-ben', '.']), + ('A 99° van.', ['A', '99°', 'van', '.']), + ('A 99°-ben van.', ['A', '99°-ben', 'van', '.']), + ('A 99°-ben.', ['A', '99°-ben', '.']), + ('A 10--20° van.', ['A', '10--20°', 'van', '.']), + ('A 10--20°-ben van.', ['A', '10--20°-ben', 'van', '.']), + ('A 10--20°-ben.', ['A', '10--20°-ben', '.']), + ('A °C van.', ['A', '°C', 'van', '.']), + ('A °C-ben van.', ['A', '°C-ben', 'van', '.']), + ('A °C.', ['A', '°C', '.']), + ('A °C-ben.', ['A', '°C-ben', '.']), + ('A 100°C van.', ['A', '100°C', 'van', '.']), + ('A 100°C-ben van.', ['A', '100°C-ben', 'van', '.']), + ('A 100°C.', ['A', '100°C', '.']), + ('A 100°C-ben.', ['A', '100°C-ben', '.']), + ('A 800x600 van.', ['A', '800x600', 'van', '.']), + ('A 800x600-ben van.', ['A', '800x600-ben', 'van', '.']), + ('A 800x600-ben.', ['A', '800x600-ben', '.']), + ('A 1x2x3x4 van.', ['A', '1x2x3x4', 'van', '.']), + ('A 1x2x3x4-ben van.', ['A', '1x2x3x4-ben', 'van', '.']), + ('A 1x2x3x4-ben.', ['A', '1x2x3x4-ben', '.']), + ('A 5/J van.', ['A', '5/J', 'van', '.']), + ('A 5/J-ben van.', ['A', '5/J-ben', 'van', '.']), + ('A 5/J-ben.', ['A', '5/J-ben', '.']), + ('A 5/J. van.', ['A', '5/J.', 'van', '.']), + ('A 5/J.-ben van.', ['A', '5/J.-ben', 'van', '.']), + ('A 5/J.-ben.', ['A', '5/J.-ben', '.']), + ('A III/1 van.', ['A', 'III/1', 'van', '.']), + ('A III/1-ben van.', ['A', 'III/1-ben', 'van', '.']), + ('A III/1-ben.', ['A', 'III/1-ben', '.']), + ('A III/1. van.', ['A', 'III/1.', 'van', '.']), + ('A III/1.-ben van.', ['A', 'III/1.-ben', 'van', '.']), + ('A III/1.-ben.', ['A', 'III/1.-ben', '.']), + ('A III/c van.', ['A', 'III/c', 'van', '.']), + ('A III/c-ben van.', ['A', 'III/c-ben', 'van', '.']), + ('A III/c.', ['A', 'III/c', '.']), + ('A III/c-ben.', ['A', 'III/c-ben', '.']), + ('A TU–154 van.', ['A', 'TU–154', 'van', '.']), + ('A TU–154-ben van.', ['A', 'TU–154-ben', 'van', '.']), + ('A TU–154-ben.', ['A', 'TU–154-ben', '.'])] + +_QUOTE_TESTS = [('Az "Ime, hat"-ban irja.', ['Az', '"', 'Ime', ',', 'hat', '"', '-ban', 'irja', '.']), + ('"Ime, hat"-ban irja.', ['"', 'Ime', ',', 'hat', '"', '-ban', 'irja', '.']), + ('Az "Ime, hat".', ['Az', '"', 'Ime', ',', 'hat', '"', '.']), + ('Egy 24"-os monitor.', ['Egy', '24', '"', '-os', 'monitor', '.']), + ("A don't van.", ['A', "don't", 'van', '.'])] + +_DOT_TESTS = [('N. kormányzósági\nszékhely.', ['N.', 'kormányzósági', 'székhely', '.']), + ('A .hu egy tld.', ['A', '.hu', 'egy', 'tld', '.']), + ('Az egy.ketto pelda.', ['Az', 'egy.ketto', 'pelda', '.']), + ('A pl. rovidites.', ['A', 'pl.', 'rovidites', '.']), + ('A S.M.A.R.T. szo.', ['A', 'S.M.A.R.T.', 'szo', '.']), + ('A .hu.', ['A', '.hu', '.']), + ('Az egy.ketto.', ['Az', 'egy.ketto', '.']), + ('A pl.', ['A', 'pl.']), + ('A S.M.A.R.T.', ['A', 'S.M.A.R.T.']), + ('Egy..ket.', ['Egy', '..', 'ket', '.']), + ('Valami... van.', ['Valami', '...', 'van', '.']), + ('Valami ...van...', ['Valami', '...', 'van', '...']), + ('Valami...', ['Valami', '...']), + ('Valami ...', ['Valami', '...']), + ('Valami ... más.', ['Valami', '...', 'más', '.'])] + + +@pytest.fixture(scope="session") +def HU(): + return Hungarian() + + +@pytest.fixture(scope="module") +def hu_tokenizer(HU): + return HU.tokenizer + + +@pytest.mark.parametrize(("input", "expected_tokens"), + _DEFAULT_TESTS + _HYPHEN_TESTS + _NUMBER_TESTS + _DOT_TESTS + _QUOTE_TESTS) +def test_testcases(hu_tokenizer, input, expected_tokens): + tokens = hu_tokenizer(input) + token_list = [token.orth_ for token in tokens if not token.is_space] + assert expected_tokens == token_list diff --git a/spacy/vocab.pyx b/spacy/vocab.pyx index 8792d4f0d..cce85e095 100644 --- a/spacy/vocab.pyx +++ b/spacy/vocab.pyx @@ -53,7 +53,7 @@ cdef class Vocab: ''' @classmethod def load(cls, path, lex_attr_getters=None, lemmatizer=True, - tag_map=True, serializer_freqs=True, oov_prob=True, **deprecated_kwargs): + tag_map=True, serializer_freqs=True, oov_prob=True, **deprecated_kwargs): """ Load the vocabulary from a path. @@ -96,6 +96,8 @@ cdef class Vocab: if serializer_freqs is True and (path / 'vocab' / 'serializer.json').exists(): with (path / 'vocab' / 'serializer.json').open('r', encoding='utf8') as file_: serializer_freqs = json.load(file_) + else: + serializer_freqs = None cdef Vocab self = cls(lex_attr_getters=lex_attr_getters, tag_map=tag_map, lemmatizer=lemmatizer, serializer_freqs=serializer_freqs) @@ -124,7 +126,7 @@ cdef class Vocab: Vocab: The newly constructed vocab object. ''' util.check_renamed_kwargs({'get_lex_attr': 'lex_attr_getters'}, deprecated_kwargs) - + lex_attr_getters = lex_attr_getters if lex_attr_getters is not None else {} tag_map = tag_map if tag_map is not None else {} if lemmatizer in (None, True, False): @@ -149,10 +151,10 @@ cdef class Vocab: self.lex_attr_getters = lex_attr_getters self.morphology = Morphology(self.strings, tag_map, lemmatizer) self.serializer_freqs = serializer_freqs - + self.length = 1 self._serializer = None - + property serializer: # Having the serializer live here is super messy :( def __get__(self): @@ -177,7 +179,7 @@ cdef class Vocab: vectors if necessary. The memory will be zeroed. Arguments: - new_size (int): The new size of the vectors. + new_size (int): The new size of the vectors. ''' cdef hash_t key cdef size_t addr @@ -190,11 +192,11 @@ cdef class Vocab: def add_flag(self, flag_getter, int flag_id=-1): '''Set a new boolean flag to words in the vocabulary. - + The flag_setter function will be called over the words currently in the vocab, and then applied to new words as they occur. You'll then be able to access the flag value on each token, using token.check_flag(flag_id). - + See also: Lexeme.set_flag, Lexeme.check_flag, Token.set_flag, Token.check_flag. @@ -204,7 +206,7 @@ cdef class Vocab: flag_id (int): An integer between 1 and 63 (inclusive), specifying the bit at which the - flag will be stored. If -1, the lowest available bit will be + flag will be stored. If -1, the lowest available bit will be chosen. Returns: @@ -322,7 +324,7 @@ cdef class Vocab: Arguments: id_or_string (int or unicode): The integer ID of a word, or its unicode string. - + If an int >= Lexicon.size, IndexError is raised. If id_or_string is neither an int nor a unicode string, ValueError is raised. @@ -349,7 +351,7 @@ cdef class Vocab: for attr_id, value in props.items(): Token.set_struct_attr(token, attr_id, value) return tokens - + def dump(self, loc): """Save the lexemes binary data to the given location. @@ -443,7 +445,7 @@ cdef class Vocab: cdef int32_t word_len cdef bytes word_str cdef char* chars - + cdef Lexeme lexeme cdef CFile out_file = CFile(out_loc, 'wb') for lexeme in self: @@ -460,7 +462,7 @@ cdef class Vocab: out_file.close() def load_vectors(self, file_): - """Load vectors from a text-based file. + """Load vectors from a text-based file. Arguments: file_ (buffer): The file to read from. Entries should be separated by newlines, diff --git a/website/README.md b/website/README.md index 48e2a1bc4..e0b7d53fc 100644 --- a/website/README.md +++ b/website/README.md @@ -12,7 +12,7 @@ writing. You can read more about our approach in our blog post, ["Rebuilding a W ```bash sudo npm install --global harp git clone https://github.com/explosion/spaCy -cd website +cd spaCy/website harp server ``` diff --git a/website/_includes/_footer.jade b/website/_includes/_footer.jade index e49ba309f..c83fd2988 100644 --- a/website/_includes/_footer.jade +++ b/website/_includes/_footer.jade @@ -11,7 +11,7 @@ footer.o-footer.u-text.u-border-dotted each url, item in group li - +a(url)(target=url.includes("http") ? "_blank" : false)=item + +a(url)=item if SECTION != "docs" +grid-col("quarter") diff --git a/website/_includes/_mixins.jade b/website/_includes/_mixins.jade index 3a15e4518..8fe24b11b 100644 --- a/website/_includes/_mixins.jade +++ b/website/_includes/_mixins.jade @@ -20,7 +20,8 @@ mixin h(level, id) info: https://mathiasbynens.github.io/rel-noopener/ mixin a(url, trusted) - a(href=url target="_blank" rel=!trusted ? "noopener nofollow" : false)&attributes(attributes) + - external = url.includes("http") + a(href=url target=external ? "_blank" : null rel=external && !trusted ? "noopener nofollow" : null)&attributes(attributes) block @@ -33,7 +34,7 @@ mixin src(url) +a(url) block - | #[+icon("code", 16).u-color-subtle] + | #[+icon("code", 16).o-icon--inline.u-color-subtle] //- API link (with added tag and automatically generated path) @@ -43,7 +44,7 @@ mixin api(path) +a("/docs/api/" + path, true)(target="_self").u-no-border.u-inline-block block - | #[+icon("book", 18).o-help-icon.u-color-subtle] + | #[+icon("book", 18).o-icon--inline.u-help.u-color-subtle] //- Aside for text @@ -74,7 +75,8 @@ mixin aside-code(label, language) see assets/css/_components/_buttons.sass mixin button(url, trusted, ...style) - a.c-button.u-text-label(href=url class=prefixArgs(style, "c-button") role="button" target="_blank" rel=!trusted ? "noopener nofollow" : false)&attributes(attributes) + - external = url.includes("http") + a.c-button.u-text-label(href=url class=prefixArgs(style, "c-button") role="button" target=external ? "_blank" : null rel=external && !trusted ? "noopener nofollow" : null)&attributes(attributes) block @@ -148,7 +150,7 @@ mixin tag() mixin list(type, start) if type - ol.c-list.o-block.u-text(class="c-list--#{type}" style=(start === 0 || start) ? "counter-reset: li #{(start - 1)}" : false)&attributes(attributes) + ol.c-list.o-block.u-text(class="c-list--#{type}" style=(start === 0 || start) ? "counter-reset: li #{(start - 1)}" : null)&attributes(attributes) block else diff --git a/website/_includes/_navigation.jade b/website/_includes/_navigation.jade index 881a5db56..beb33be4b 100644 --- a/website/_includes/_navigation.jade +++ b/website/_includes/_navigation.jade @@ -2,7 +2,7 @@ include _mixins -nav.c-nav.u-text.js-nav(class=landing ? "c-nav--theme" : false) +nav.c-nav.u-text.js-nav(class=landing ? "c-nav--theme" : null) a(href='/') #[+logo] if SUBSECTION != "index" @@ -11,7 +11,7 @@ nav.c-nav.u-text.js-nav(class=landing ? "c-nav--theme" : false) ul.c-nav__menu each url, item in NAVIGATION li.c-nav__menu__item - a(href=url target=url.includes("http") ? "_blank" : false)=item + +a(url)=item li.c-nav__menu__item +a(gh("spaCy"))(aria-label="GitHub").u-hidden-xs #[+icon("github", 20)] diff --git a/website/_includes/_page-docs.jade b/website/_includes/_page-docs.jade index 44a5698f0..09cbfa6a5 100644 --- a/website/_includes/_page-docs.jade +++ b/website/_includes/_page-docs.jade @@ -18,7 +18,7 @@ main.o-main.o-main--sidebar.o-main--aside - data = public.docs[SUBSECTION]._data[next] .o-inline-list - span #[strong.u-text-label Read next:] #[a(href=next).u-link=data.title] + span #[strong.u-text-label Read next:] #[+a(next).u-link=data.title] +grid-col("half").u-text-right .o-inline-list diff --git a/website/_includes/_sidebar.jade b/website/_includes/_sidebar.jade index a0d4d4cd3..241a77132 100644 --- a/website/_includes/_sidebar.jade +++ b/website/_includes/_sidebar.jade @@ -9,5 +9,5 @@ menu.c-sidebar.js-sidebar.u-text li.u-text-label.u-color-subtle=menu each url, item in items - li(class=(CURRENT == url || (CURRENT == "index" && url == "./")) ? "is-active" : false) - +a(url)(target=url.includes("http") ? "_blank" : false)=item + li(class=(CURRENT == url || (CURRENT == "index" && url == "./")) ? "is-active" : null) + +a(url)=item diff --git a/website/assets/css/_base/_objects.sass b/website/assets/css/_base/_objects.sass index f0efe94a0..2b037dca7 100644 --- a/website/assets/css/_base/_objects.sass +++ b/website/assets/css/_base/_objects.sass @@ -67,9 +67,8 @@ .o-icon vertical-align: middle -.o-help-icon - cursor: help - margin: 0 0.5rem 0 0.25rem + &.o-icon--inline + margin: 0 0.5rem 0 0.25rem //- Inline List diff --git a/website/assets/css/_base/_utilities.sass b/website/assets/css/_base/_utilities.sass index 2c40858a8..95be81bcd 100644 --- a/website/assets/css/_base/_utilities.sass +++ b/website/assets/css/_base/_utilities.sass @@ -141,6 +141,12 @@ background: $pattern +//- Cursors + +.u-help + cursor: help + + //- Hidden elements .u-hidden diff --git a/website/docs/api/annotation.jade b/website/docs/api/annotation.jade index de678b472..93511899b 100644 --- a/website/docs/api/annotation.jade +++ b/website/docs/api/annotation.jade @@ -50,6 +50,13 @@ p A "lemma" is the uninflected form of a word. In English, this means: +item #[strong Nouns]: The form like "dog", not "dogs"; like "child", not "children" +item #[strong Verbs]: The form like "write", not "writes", "writing", "wrote" or "written" ++aside("About spaCy's custom pronoun lemma") + | Unlike verbs and common nouns, there's no clear base form of a personal + | pronoun. Should the lemma of "me" be "I", or should we normalize person + | as well, giving "it" — or maybe "he"? spaCy's solution is to introduce a + | novel symbol, #[code.u-nowrap -PRON-], which is used as the lemma for + | all personal pronouns. + p | The lemmatization data is taken from | #[+a("https://wordnet.princeton.edu") WordNet]. However, we also add a @@ -58,11 +65,16 @@ p +h(2, "dependency-parsing") Syntactic Dependency Parsing -p - | The parser is trained on data produced by the - | #[+a("http://www.clearnlp.com") ClearNLP] converter. Details of the - | annotation scheme can be found - | #[+a("http://www.mathcs.emory.edu/~choi/doc/clear-dependency-2012.pdf") here]. ++table(["Language", "Converter", "Scheme"]) + +row + +cell English + +cell #[+a("http://www.clearnlp.com") ClearNLP] + +cell #[+a("http://www.mathcs.emory.edu/~choi/doc/clear-dependency-2012.pdf") CLEAR Style] + + +row + +cell German + +cell #[+a("https://github.com/wbwseeker/tiger2dep") TIGER] + +cell #[+a("http://www.ims.uni-stuttgart.de/forschung/ressourcen/korpora/TIGERCorpus/annotation/index.html") TIGER] +h(2, "named-entities") Named Entity Recognition diff --git a/website/docs/api/language-models.jade b/website/docs/api/language-models.jade index 1b63e7702..8e0f1b6b8 100644 --- a/website/docs/api/language-models.jade +++ b/website/docs/api/language-models.jade @@ -2,7 +2,7 @@ include ../../_includes/_mixins -p You can download data packs that add the following capabilities to spaCy. +p spaCy currently supports the following languages and capabilities: +aside-code("Download language models", "bash"). python -m spacy.en.download all @@ -24,8 +24,27 @@ p You can download data packs that add the following capabilities to spaCy. each icon in [ "pro", "con", "con", "con", "con", "con", "con", "con" ] +cell.u-text-center #[+procon(icon)] + +row + +cell Spanish #[code es] + each icon in [ "pro", "con", "con", "con", "con", "con", "con", "con" ] + +cell.u-text-center #[+procon(icon)] + p | Chinese tokenization requires the | #[+a("https://github.com/fxsjy/jieba") Jieba] library. Statistical - | models are coming soon. Tokenizers for Spanish, French, Italian and - | Portuguese are now under development. + | models are coming soon. + + + ++h(2, "alpha-support") Alpha support + +p + | Work has started on the following languages. You can help by improving + | the existing language data and extending the tokenization patterns. + ++table([ "Language", "Source" ]) + each language, code in { it: "Italian", fr: "French", pt: "Portuguese", nl: "Dutch", sv: "Swedish" } + +row + +cell #{language} #[code=code] + +cell + +src(gh("spaCy", "spacy/" + code)) spacy/#{code} diff --git a/website/docs/usage/_data.json b/website/docs/usage/_data.json index eb85c683d..932abc99e 100644 --- a/website/docs/usage/_data.json +++ b/website/docs/usage/_data.json @@ -2,7 +2,8 @@ "sidebar": { "Get started": { "Installation": "./", - "Lightning tour": "lightning-tour" + "Lightning tour": "lightning-tour", + "Resources": "resources" }, "Workflows": { "Loading the pipeline": "language-processing-pipeline", @@ -31,7 +32,12 @@ }, "lightning-tour": { - "title": "Lightning tour" + "title": "Lightning tour", + "next": "resources" + }, + + "resources": { + "title": "Resources" }, "language-processing-pipeline": { diff --git a/website/docs/usage/adding-languages.jade b/website/docs/usage/adding-languages.jade index 349ab3b45..e1631102a 100644 --- a/website/docs/usage/adding-languages.jade +++ b/website/docs/usage/adding-languages.jade @@ -23,7 +23,7 @@ p +item | #[strong Build the vocabulary] including - | #[a(href="#word-probabilities") word probabilities], + | #[a(href="#word-frequencies") word frequencies], | #[a(href="#brown-clusters") Brown clusters] and | #[a(href="#word-vectors") word vectors]. @@ -245,6 +245,12 @@ p +cell | Special value for pronoun lemmas (#[code "-PRON-"]). + +row + +cell #[code DET_LEMMA] + +cell + | Special value for determiner lemmas, used in languages with + | inflected determiners (#[code "-DET-"]). + +row +cell #[code ENT_ID] +cell @@ -392,7 +398,7 @@ p | vectors files, you can use the | #[+src(gh("spacy-dev-resources", "training/init.py")) init.py] | script from our - | #[+a(gh("spacy-developer-resources")) developer resources] to create a + | #[+a(gh("spacy-dev-resources")) developer resources] to create a | spaCy data directory: +code(false, "bash"). @@ -424,16 +430,22 @@ p +h(3, "word-frequencies") Word frequencies p - | The #[code init.py] script expects a tab-separated word frequencies file - | with three columns: the number of times the word occurred in your language - | sample, the number of distinct documents the word occurred in, and the - | word itself. You should make sure you use the spaCy tokenizer for your + | The #[+src(gh("spacy-dev-resources", "training/init.py")) init.py] + | script expects a tab-separated word frequencies file with three columns: + ++list("numbers") + +item The number of times the word occurred in your language sample. + +item The number of distinct documents the word occurred in. + +item The word itself. + +p + | You should make sure you use the spaCy tokenizer for your | language to segment the text for your word frequencies. This will ensure | that the frequencies refer to the same segmentation standards you'll be - | using at run-time. For instance, spaCy's English tokenizer segments "can't" - | into two tokens. If we segmented the text by whitespace to produce the - | frequency counts, we'll have incorrect frequency counts for the tokens - | "ca" and "n't". + | using at run-time. For instance, spaCy's English tokenizer segments + | "can't" into two tokens. If we segmented the text by whitespace to + | produce the frequency counts, we'll have incorrect frequency counts for + | the tokens "ca" and "n't". +h(3, "brown-clusters") Training the Brown clusters diff --git a/website/docs/usage/lightning-tour.jade b/website/docs/usage/lightning-tour.jade index 58186d5d4..cb08bc045 100644 --- a/website/docs/usage/lightning-tour.jade +++ b/website/docs/usage/lightning-tour.jade @@ -3,7 +3,7 @@ include ../../_includes/_mixins p - | The following examples code snippets give you an overview of spaCy's + | The following examples and code snippets give you an overview of spaCy's | functionality and its usage. +h(2, "examples-resources") Load resources and process text diff --git a/website/docs/usage/resources.jade b/website/docs/usage/resources.jade new file mode 100644 index 000000000..a09c7358d --- /dev/null +++ b/website/docs/usage/resources.jade @@ -0,0 +1,118 @@ +//- 💫 DOCS > USAGE > RESOURCES + +include ../../_includes/_mixins + +p Many of the associated tools and resources that we're developing alongside spaCy can be found in their own repositories. + ++h(2, "developer") Developer tools + ++table(["Name", "Description"]) + +row + +cell + +src(gh("spacy-dev-resources")) spaCy Dev Resources + + +cell + | Scripts, tools and resources for developing spaCy, adding new + | languages and training new models. + + +row + +cell + +src("spacy-benchmarks") spaCy Benchmarks + + +cell + | Runtime performance comparison of spaCy against other NLP + | libraries. + + +row + +cell + +src(gh("spacy-services")) spaCy Services + + +cell + | REST microservices for spaCy demos and visualisers. + ++h(2, "libraries") Libraries and projects ++table(["Name", "Description"]) + +row + +cell + +src(gh("sense2vec")) sense2vec + + +cell + | Use spaCy to go beyond vanilla + | #[+a("https://en.wikipedia.org/wiki/Word2vec") Word2vec]. + ++h(2, "utility") Utility libraries and dependencies + ++table(["Name", "Description"]) + +row + +cell + +src(gh("thinc")) Thinc + + +cell + | Super sparse multi-class machine learning with Cython. + + +row + +cell + +src(gh("sputnik")) Sputnik + + +cell + | Data package manager library for spaCy. + + +row + +cell + +src(gh("sputnik-server")) Sputnik Server + + +cell + | Index service for the Sputnik data package manager for spaCy. + + +row + +cell + +src(gh("cymem")) Cymem + + +cell + | Gate Cython calls to malloc/free behind Python ref-counted + | objects. + + +row + +cell + +src(gh("preshed")) Preshed + + +cell + | Cython hash tables that assume keys are pre-hashed + + +row + +cell + +src(gh("murmurhash")) MurmurHash + + +cell + | Cython bindings for + | #[+a("https://en.wikipedia.org/wiki/MurmurHash") MurmurHash2]. + ++h(2, "visualizers") Visualisers and demos + ++table(["Name", "Description"]) + +row + +cell + +src(gh("displacy")) displaCy.js + + +cell + | A lightweight dependency visualisation library for the modern + | web, built with JavaScript, CSS and SVG. + | #[+a(DEMOS_URL + "/displacy") Demo here]. + + +row + +cell + +src(gh("displacy-ent")) displaCy#[sup ENT] + + +cell + | A lightweight and modern named entity visualisation library + | built with JavaScript and CSS. + | #[+a(DEMOS_URL + "/displacy-ent") Demo here]. + + +row + +cell + +src(gh("sense2vec-demo")) sense2vec Demo + + +cell + | Source of our Semantic Analysis of the Reddit Hivemind + | #[+a(DEMOS_URL + "/sense2vec") demo] using + | #[+a(gh("sense2vec")) sense2vec]. diff --git a/website/docs/usage/training.jade b/website/docs/usage/training.jade index 98afef36b..6963730ab 100644 --- a/website/docs/usage/training.jade +++ b/website/docs/usage/training.jade @@ -13,14 +13,17 @@ p +code. from spacy.vocab import Vocab - from spacy.pipeline import Tagger + from spacy.tagger import Tagger from spacy.tokens import Doc + from spacy.gold import GoldParse + vocab = Vocab(tag_map={'N': {'pos': 'NOUN'}, 'V': {'pos': 'VERB'}}) tagger = Tagger(vocab) doc = Doc(vocab, words=['I', 'like', 'stuff']) - tagger.update(doc, ['N', 'V', 'N']) + gold = GoldParse(doc, tags=['N', 'V', 'N']) + tagger.update(doc, gold) tagger.model.end_training() diff --git a/website/index.jade b/website/index.jade index 62c54d94d..9d53432fc 100644 --- a/website/index.jade +++ b/website/index.jade @@ -22,7 +22,7 @@ include _includes/_mixins | process entire web dumps, spaCy is the library you want to | be using. - +button("/docs/api", true, "primary")(target="_self") + +button("/docs/api", true, "primary") | Facts & figures +grid-col("third").o-card @@ -35,7 +35,7 @@ include _includes/_mixins | think of spaCy as the Ruby on Rails of Natural Language | Processing. - +button("/docs/usage", true, "primary")(target="_self") + +button("/docs/usage", true, "primary") | Get started +grid-col("third").o-card @@ -51,7 +51,7 @@ include _includes/_mixins | connect the statistical models trained by these libraries | to the rest of your application. - +button("/docs/usage/deep-learning", true, "primary")(target="_self") + +button("/docs/usage/deep-learning", true, "primary") | Read more .o-inline-list.o-block.u-border-bottom.u-text-small.u-text-center.u-padding-small @@ -105,7 +105,7 @@ include _includes/_mixins +item Robust, rigorously evaluated accuracy .o-inline-list - +button("/docs/usage/lightning-tour", true, "secondary")(target="_self") + +button("/docs/usage/lightning-tour", true, "secondary") | See examples .o-block.u-text-center.u-padding @@ -138,7 +138,7 @@ include _includes/_mixins | all others. p - | spaCy's #[a(href="/docs/api/philosophy") mission] is to make + | spaCy's #[+a("/docs/api/philosophy") mission] is to make | cutting-edge NLP practical and commonly available. That's | why I left academia in 2014, to build a production-quality | open-source NLP library. It's why