mirror of
https://github.com/explosion/spaCy.git
synced 2025-01-13 10:46:29 +03:00
d33953037e
* Create aryaprabhudesai.md (#2681) * Update _install.jade (#2688) Typo fix: "models" -> "model" * Add FAC to spacy.explain (resolves #2706) * Remove docstrings for deprecated arguments (see #2703) * When calling getoption() in conftest.py, pass a default option (#2709) * When calling getoption() in conftest.py, pass a default option This is necessary to allow testing an installed spacy by running: pytest --pyargs spacy * Add contributor agreement * update bengali token rules for hyphen and digits (#2731) * Less norm computations in token similarity (#2730) * Less norm computations in token similarity * Contributor agreement * Remove ')' for clarity (#2737) Sorry, don't mean to be nitpicky, I just noticed this when going through the CLI and thought it was a quick fix. That said, if this was intention than please let me know. * added contributor agreement for mbkupfer (#2738) * Basic support for Telugu language (#2751) * Lex _attrs for polish language (#2750) * Signed spaCy contributor agreement * Added polish version of english lex_attrs * Introduces a bulk merge function, in order to solve issue #653 (#2696) * Fix comment * Introduce bulk merge to increase performance on many span merges * Sign contributor agreement * Implement pull request suggestions * Describe converters more explicitly (see #2643) * Add multi-threading note to Language.pipe (resolves #2582) [ci skip] * Fix formatting * Fix dependency scheme docs (closes #2705) [ci skip] * Don't set stop word in example (closes #2657) [ci skip] * Add words to portuguese language _num_words (#2759) * Add words to portuguese language _num_words * Add words to portuguese language _num_words * Update Indonesian model (#2752) * adding e-KTP in tokenizer exceptions list * add exception token * removing lines with containing space as it won't matter since we use .split() method in the end, added new tokens in exception * add tokenizer exceptions list * combining base_norms with norm_exceptions * adding norm_exception * fix double key in lemmatizer * remove unused import on punctuation.py * reformat stop_words to reduce number of lines, improve readibility * updating tokenizer exception * implement is_currency for lang/id * adding orth_first_upper in tokenizer_exceptions * update the norm_exception list * remove bunch of abbreviations * adding contributors file * Fixed spaCy+Keras example (#2763) * bug fixes in keras example * created contributor agreement * Adding French hyphenated first name (#2786) * Fix typo (closes #2784) * Fix typo (#2795) [ci skip] Fixed typo on line 6 "regcognizer --> recognizer" * Adding basic support for Sinhala language. (#2788) * adding Sinhala language package, stop words, examples and lex_attrs. * Adding contributor agreement * Updating contributor agreement * Also include lowercase norm exceptions * Fix error (#2802) * Fix error ValueError: cannot resize an array that references or is referenced by another array in this way. Use the resize function * added spaCy Contributor Agreement * Add charlax's contributor agreement (#2805) * agreement of contributor, may I introduce a tiny pl languge contribution (#2799) * Contributors agreement * Contributors agreement * Contributors agreement * Add jupyter=True to displacy.render in documentation (#2806) * Revert "Also include lowercase norm exceptions" This reverts commit70f4e8adf3
. * Remove deprecated encoding argument to msgpack * Set up dependency tree pattern matching skeleton (#2732) * Fix bug when too many entity types. Fixes #2800 * Fix Python 2 test failure * Require older msgpack-numpy * Restore encoding arg on msgpack-numpy * Try to fix version pin for msgpack-numpy * Update Portuguese Language (#2790) * Add words to portuguese language _num_words * Add words to portuguese language _num_words * Portuguese - Add/remove stopwords, fix tokenizer, add currency symbols * Extended punctuation and norm_exceptions in the Portuguese language * Correct error in spacy universe docs concerning spacy-lookup (#2814) * Update Keras Example for (Parikh et al, 2016) implementation (#2803) * bug fixes in keras example * created contributor agreement * baseline for Parikh model * initial version of parikh 2016 implemented * tested asymmetric models * fixed grevious error in normalization * use standard SNLI test file * begin to rework parikh example * initial version of running example * start to document the new version * start to document the new version * Update Decompositional Attention.ipynb * fixed calls to similarity * updated the README * import sys package duh * simplified indexing on mapping word to IDs * stupid python indent error * added code from https://github.com/tensorflow/tensorflow/issues/3388 for tf bug workaround * Fix typo (closes #2815) [ci skip] * Update regex version dependency * Set version to 2.0.13.dev3 * Skip seemingly problematic test * Remove problematic test * Try previous version of regex * Revert "Remove problematic test" This reverts commitbdebbef455
. * Unskip test * Try older version of regex * 💫 Update training examples and use minibatching (#2830) <!--- Provide a general summary of your changes in the title. --> ## Description Update the training examples in `/examples/training` to show usage of spaCy's `minibatch` and `compounding` helpers ([see here](https://spacy.io/usage/training#tips-batch-size) for details). The lack of batching in the examples has caused some confusion in the past, especially for beginners who would copy-paste the examples, update them with large training sets and experienced slow and unsatisfying results. ### Types of change enhancements ## Checklist <!--- Before you submit the PR, go over this checklist and make sure you can tick off all the boxes. [] -> [x] --> - [x] I have submitted the spaCy Contributor Agreement. - [x] I ran the tests, and all new and existing tests passed. - [x] My changes don't require a change to the documentation, or if they do, I've added all required information. * Visual C++ link updated (#2842) (closes #2841) [ci skip] * New landing page * Add contribution agreement * Correcting lang/ru/examples.py (#2845) * Correct some grammatical inaccuracies in lang\ru\examples.py; filled Contributor Agreement * Correct some grammatical inaccuracies in lang\ru\examples.py * Move contributor agreement to separate file * Set version to 2.0.13.dev4 * Add Persian(Farsi) language support (#2797) * Also include lowercase norm exceptions * Remove in favour of https://github.com/explosion/spaCy/graphs/contributors * Rule-based French Lemmatizer (#2818) <!--- Provide a general summary of your changes in the title. --> ## Description <!--- Use this section to describe your changes. If your changes required testing, include information about the testing environment and the tests you ran. If your test fixes a bug reported in an issue, don't forget to include the issue number. If your PR is still a work in progress, that's totally fine – just include a note to let us know. --> Add a rule-based French Lemmatizer following the english one and the excellent PR for [greek language optimizations](https://github.com/explosion/spaCy/pull/2558) to adapt the Lemmatizer class. ### Types of change <!-- What type of change does your PR cover? Is it a bug fix, an enhancement or new feature, or a change to the documentation? --> - Lemma dictionary used can be found [here](http://infolingu.univ-mlv.fr/DonneesLinguistiques/Dictionnaires/telechargement.html), I used the XML version. - Add several files containing exhaustive list of words for each part of speech - Add some lemma rules - Add POS that are not checked in the standard Lemmatizer, i.e PRON, DET, ADV and AUX - Modify the Lemmatizer class to check in lookup table as a last resort if POS not mentionned - Modify the lemmatize function to check in lookup table as a last resort - Init files are updated so the model can support all the functionalities mentioned above - Add words to tokenizer_exceptions_list.py in respect to regex used in tokenizer_exceptions.py ## Checklist <!--- Before you submit the PR, go over this checklist and make sure you can tick off all the boxes. [] -> [x] --> - [X] I have submitted the spaCy Contributor Agreement. - [X] I ran the tests, and all new and existing tests passed. - [X] My changes don't require a change to the documentation, or if they do, I've added all required information. * Set version to 2.0.13 * Fix formatting and consistency * Update docs for new version [ci skip] * Increment version [ci skip] * Add info on wheels [ci skip] * Adding "This is a sentence" example to Sinhala (#2846) * Add wheels badge * Update badge [ci skip] * Update README.rst [ci skip] * Update murmurhash pin * Increment version to 2.0.14.dev0 * Update GPU docs for v2.0.14 * Add wheel to setup_requires * Import prefer_gpu and require_gpu functions from Thinc * Add tests for prefer_gpu() and require_gpu() * Update requirements and setup.py * Workaround bug in thinc require_gpu * Set version to v2.0.14 * Update push-tag script * Unhack prefer_gpu * Require thinc 6.10.6 * Update prefer_gpu and require_gpu docs [ci skip] * Fix specifiers for GPU * Set version to 2.0.14.dev1 * Set version to 2.0.14 * Update Thinc version pin * Increment version * Fix msgpack-numpy version pin * Increment version * Update version to 2.0.16 * Update version [ci skip] * Redundant ')' in the Stop words' example (#2856) <!--- Provide a general summary of your changes in the title. --> ## Description <!--- Use this section to describe your changes. If your changes required testing, include information about the testing environment and the tests you ran. If your test fixes a bug reported in an issue, don't forget to include the issue number. If your PR is still a work in progress, that's totally fine – just include a note to let us know. --> ### Types of change <!-- What type of change does your PR cover? Is it a bug fix, an enhancement or new feature, or a change to the documentation? --> ## Checklist <!--- Before you submit the PR, go over this checklist and make sure you can tick off all the boxes. [] -> [x] --> - [ ] I have submitted the spaCy Contributor Agreement. - [ ] I ran the tests, and all new and existing tests passed. - [ ] My changes don't require a change to the documentation, or if they do, I've added all required information. * Documentation improvement regarding joblib and SO (#2867) Some documentation improvements ## Description 1. Fixed the dead URL to joblib 2. Fixed Stack Overflow brand name (with space) ### Types of change Documentation ## Checklist <!--- Before you submit the PR, go over this checklist and make sure you can tick off all the boxes. [] -> [x] --> - [x] I have submitted the spaCy Contributor Agreement. - [x] I ran the tests, and all new and existing tests passed. - [x] My changes don't require a change to the documentation, or if they do, I've added all required information. * raise error when setting overlapping entities as doc.ents (#2880) * Fix out-of-bounds access in NER training The helper method state.B(1) gets the index of the first token of the buffer, or -1 if no such token exists. Normally this is safe because we pass this to functions like state.safe_get(), which returns an empty token. Here we used it directly as an array index, which is not okay! This error may have been the cause of out-of-bounds access errors during training. Similar errors may still be around, so much be hunted down. Hunting this one down took a long time...I printed out values across training runs and diffed, looking for points of divergence between runs, when no randomness should be allowed. * Change PyThaiNLP Url (#2876) * Fix missing comma * Add example showing a fix-up rule for space entities * Set version to 2.0.17.dev0 * Update regex version * Revert "Update regex version" This reverts commit62358dd867
. * Try setting older regex version, to align with conda * Set version to 2.0.17 * Add spacy-js to universe [ci-skip] * Add spacy-raspberry to universe (closes #2889) * Add script to validate universe json [ci skip] * Removed space in docs + added contributor indo (#2909) * - removed unneeded space in documentation * - added contributor info * Allow input text of length up to max_length, inclusive (#2922) * Include universe spec for spacy-wordnet component (#2919) * feat: include universe spec for spacy-wordnet component * chore: include spaCy contributor agreement * Minor formatting changes [ci skip] * Fix image [ci skip] Twitter URL doesn't work on live site * Check if the word is in one of the regular lists specific to each POS (#2886) * 💫 Create random IDs for SVGs to prevent ID clashes (#2927) Resolves #2924. ## Description Fixes problem where multiple visualizations in Jupyter notebooks would have clashing arc IDs, resulting in weirdly positioned arc labels. Generating a random ID prefix so even identical parses won't receive the same IDs for consistency (even if effect of ID clash isn't noticable here.) ### Types of change bug fix ## Checklist <!--- Before you submit the PR, go over this checklist and make sure you can tick off all the boxes. [] -> [x] --> - [x] I have submitted the spaCy Contributor Agreement. - [x] I ran the tests, and all new and existing tests passed. - [x] My changes don't require a change to the documentation, or if they do, I've added all required information. * Fix typo [ci skip] * fixes symbolic link on py3 and windows (#2949) * fixes symbolic link on py3 and windows during setup of spacy using command python -m spacy link en_core_web_sm en closes #2948 * Update spacy/compat.py Co-Authored-By: cicorias <cicorias@users.noreply.github.com> * Fix formatting * Update universe [ci skip] * Catalan Language Support (#2940) * Catalan language Support * Ddding Catalan to documentation * Sort languages alphabetically [ci skip] * Update tests for pytest 4.x (#2965) <!--- Provide a general summary of your changes in the title. --> ## Description - [x] Replace marks in params for pytest 4.0 compat ([see here](https://docs.pytest.org/en/latest/deprecations.html#marks-in-pytest-mark-parametrize)) - [x] Un-xfail passing tests (some fixes in a recent update resolved a bunch of issues, but tests were apparently never updated here) ### Types of change <!-- What type of change does your PR cover? Is it a bug fix, an enhancement or new feature, or a change to the documentation? --> ## Checklist <!--- Before you submit the PR, go over this checklist and make sure you can tick off all the boxes. [] -> [x] --> - [x] I have submitted the spaCy Contributor Agreement. - [x] I ran the tests, and all new and existing tests passed. - [x] My changes don't require a change to the documentation, or if they do, I've added all required information. * Fix regex pin to harmonize with conda (#2964) * Update README.rst * Fix bug where Vocab.prune_vector did not use 'batch_size' (#2977) Fixes #2976 * Fix typo * Fix typo * Remove duplicate file * Require thinc 7.0.0.dev2 Fixes bug in gpu_ops that would use cupy instead of numpy on CPU * Add missing import * Fix error IDs * Fix tests
559 lines
23 KiB
Plaintext
559 lines
23 KiB
Plaintext
//- 💫 DOCS > USAGE > ADDING LANGUAGES > LANGUAGE DATA
|
||
|
||
p
|
||
| The individual components #[strong expose variables] that can be imported
|
||
| within a language module, and added to the language's #[code Defaults].
|
||
| Some components, like the punctuation rules, usually don't need much
|
||
| customisation and can be imported from the global rules. Others,
|
||
| like the tokenizer and norm exceptions, are very specific and will make
|
||
| a big difference to spaCy's performance on the particular language and
|
||
| training a language model.
|
||
|
||
|
||
+table(["Variable", "Type", "Description"])
|
||
+row
|
||
+cell #[code STOP_WORDS]
|
||
+cell set
|
||
+cell Individual words.
|
||
|
||
+row
|
||
+cell #[code TOKENIZER_EXCEPTIONS]
|
||
+cell dict
|
||
+cell Keyed by strings mapped to list of one dict per token with token attributes.
|
||
|
||
+row
|
||
+cell #[code TOKEN_MATCH]
|
||
+cell regex
|
||
+cell Regexes to match complex tokens, e.g. URLs.
|
||
|
||
+row
|
||
+cell #[code NORM_EXCEPTIONS]
|
||
+cell dict
|
||
+cell Keyed by strings, mapped to their norms.
|
||
|
||
+row
|
||
+cell #[code TOKENIZER_PREFIXES]
|
||
+cell list
|
||
+cell Strings or regexes, usually not customised.
|
||
|
||
+row
|
||
+cell #[code TOKENIZER_SUFFIXES]
|
||
+cell list
|
||
+cell Strings or regexes, usually not customised.
|
||
|
||
+row
|
||
+cell #[code TOKENIZER_INFIXES]
|
||
+cell list
|
||
+cell Strings or regexes, usually not customised.
|
||
|
||
+row
|
||
+cell #[code LEX_ATTRS]
|
||
+cell dict
|
||
+cell Attribute ID mapped to function.
|
||
|
||
+row
|
||
+cell #[code SYNTAX_ITERATORS]
|
||
+cell dict
|
||
+cell
|
||
| Iterator ID mapped to function. Currently only supports
|
||
| #[code 'noun_chunks'].
|
||
|
||
+row
|
||
+cell #[code LOOKUP]
|
||
+cell dict
|
||
+cell Keyed by strings mapping to their lemma.
|
||
|
||
+row
|
||
+cell #[code LEMMA_RULES], #[code LEMMA_INDEX], #[code LEMMA_EXC]
|
||
+cell dict
|
||
+cell Lemmatization rules, keyed by part of speech.
|
||
|
||
+row
|
||
+cell #[code TAG_MAP]
|
||
+cell dict
|
||
+cell
|
||
| Keyed by strings mapped to
|
||
| #[+a("http://universaldependencies.org/u/pos/all.html") Universal Dependencies]
|
||
| tags.
|
||
|
||
+row
|
||
+cell #[code MORPH_RULES]
|
||
+cell dict
|
||
+cell Keyed by strings mapped to a dict of their morphological features.
|
||
|
||
+aside("Should I ever update the global data?")
|
||
| Reuseable language data is collected as atomic pieces in the root of the
|
||
| #[+src(gh("spaCy", "lang")) #[code spacy.lang]] package. Often, when a new
|
||
| language is added, you'll find a pattern or symbol that's missing. Even
|
||
| if it isn't common in other languages, it might be best to add it to the
|
||
| shared language data, unless it has some conflicting interpretation. For
|
||
| instance, we don't expect to see guillemot quotation symbols
|
||
| (#[code »] and #[code «]) in English text. But if we do see
|
||
| them, we'd probably prefer the tokenizer to split them off.
|
||
|
||
+infobox("For languages with non-latin characters")
|
||
| In order for the tokenizer to split suffixes, prefixes and infixes, spaCy
|
||
| needs to know the language's character set. If the language you're adding
|
||
| uses non-latin characters, you might need to add the required character
|
||
| classes to the global
|
||
| #[+src(gh("spacy", "spacy/lang/char_classes.py")) #[code char_classes.py]].
|
||
| spaCy uses the #[+a("https://pypi.python.org/pypi/regex/") #[code regex] library]
|
||
| to keep this simple and readable. If the language requires very specific
|
||
| punctuation rules, you should consider overwriting the default regular
|
||
| expressions with your own in the language's #[code Defaults].
|
||
|
||
|
||
+h(3, "language-subclass") Creating a #[code Language] subclass
|
||
|
||
p
|
||
| Language-specific code and resources should be organised into a
|
||
| subpackage of spaCy, named according to the language's
|
||
| #[+a("https://en.wikipedia.org/wiki/List_of_ISO_639-1_codes") ISO code].
|
||
| For instance, code and resources specific to Spanish are placed into a
|
||
| directory #[code spacy/lang/es], which can be imported as
|
||
| #[code spacy.lang.es].
|
||
|
||
p
|
||
| To get started, you can use our
|
||
| #[+src(gh("spacy-dev-resources", "templates/new_language")) templates]
|
||
| for the most important files. Here's what the class template looks like:
|
||
|
||
+code("__init__.py (excerpt)").
|
||
# import language-specific data
|
||
from .stop_words import STOP_WORDS
|
||
from .tokenizer_exceptions import TOKENIZER_EXCEPTIONS
|
||
from .lex_attrs import LEX_ATTRS
|
||
|
||
from ..tokenizer_exceptions import BASE_EXCEPTIONS
|
||
from ...language import Language
|
||
from ...attrs import LANG
|
||
from ...util import update_exc
|
||
|
||
# create Defaults class in the module scope (necessary for pickling!)
|
||
class XxxxxDefaults(Language.Defaults):
|
||
lex_attr_getters = dict(Language.Defaults.lex_attr_getters)
|
||
lex_attr_getters[LANG] = lambda text: 'xx' # language ISO code
|
||
|
||
# optional: replace flags with custom functions, e.g. like_num()
|
||
lex_attr_getters.update(LEX_ATTRS)
|
||
|
||
# merge base exceptions and custom tokenizer exceptions
|
||
tokenizer_exceptions = update_exc(BASE_EXCEPTIONS, TOKENIZER_EXCEPTIONS)
|
||
stop_words = STOP_WORDS
|
||
|
||
# create actual Language class
|
||
class Xxxxx(Language):
|
||
lang = 'xx' # language ISO code
|
||
Defaults = XxxxxDefaults # override defaults
|
||
|
||
# set default export – this allows the language class to be lazy-loaded
|
||
__all__ = ['Xxxxx']
|
||
|
||
+infobox("Why lazy-loading?")
|
||
| Some languages contain large volumes of custom data, like lemmatizer
|
||
| lookup tables, or complex regular expression that are expensive to
|
||
| compute. As of spaCy v2.0, #[code Language] classes are not imported on
|
||
| initialisation and are only loaded when you import them directly, or load
|
||
| a model that requires a language to be loaded. To lazy-load languages in
|
||
| your application, you can use the
|
||
| #[+api("util#get_lang_class") #[code util.get_lang_class()]] helper
|
||
| function with the two-letter language code as its argument.
|
||
|
||
+h(3, "stop-words") Stop words
|
||
|
||
p
|
||
| A #[+a("https://en.wikipedia.org/wiki/Stop_words") "stop list"] is a
|
||
| classic trick from the early days of information retrieval when search
|
||
| was largely about keyword presence and absence. It is still sometimes
|
||
| useful today to filter out common words from a bag-of-words model. To
|
||
| improve readability, #[code STOP_WORDS] are separated by spaces and
|
||
| newlines, and added as a multiline string.
|
||
|
||
+aside("What does spaCy consider a stop word?")
|
||
| There's no particularly principled logic behind what words should be
|
||
| added to the stop list. Make a list that you think might be useful
|
||
| to people and is likely to be unsurprising. As a rule of thumb, words
|
||
| that are very rare are unlikely to be useful stop words.
|
||
|
||
+code("Example").
|
||
STOP_WORDS = set("""
|
||
a about above across after afterwards again against all almost alone along
|
||
already also although always am among amongst amount an and another any anyhow
|
||
anyone anything anyway anywhere are around as at
|
||
|
||
back be became because become becomes becoming been before beforehand behind
|
||
being below beside besides between beyond both bottom but by
|
||
""".split())
|
||
|
||
+infobox("Important note")
|
||
| When adding stop words from an online source, always #[strong include the link]
|
||
| in a comment. Make sure to #[strong proofread] and double-check the words
|
||
| carefully. A lot of the lists available online have been passed around
|
||
| for years and often contain mistakes, like unicode errors or random words
|
||
| that have once been added for a specific use case, but don't actually
|
||
| qualify.
|
||
|
||
+h(3, "tokenizer-exceptions") Tokenizer exceptions
|
||
|
||
p
|
||
| spaCy's #[+a("/usage/linguistic-features#how-tokenizer-works") tokenization algorithm]
|
||
| lets you deal with whitespace-delimited chunks separately. This makes it
|
||
| easy to define special-case rules, without worrying about how they
|
||
| interact with the rest of the tokenizer. Whenever the key string is
|
||
| matched, the special-case rule is applied, giving the defined sequence of
|
||
| tokens. You can also attach attributes to the subtokens, covered by your
|
||
| special case, such as the subtokens #[code LEMMA] or #[code TAG].
|
||
|
||
p
|
||
| Tokenizer exceptions can be added in the following format:
|
||
|
||
+code("tokenizer_exceptions.py (excerpt)").
|
||
TOKENIZER_EXCEPTIONS = {
|
||
"don't": [
|
||
{ORTH: "do", LEMMA: "do"},
|
||
{ORTH: "n't", LEMMA: "not", NORM: "not", TAG: "RB"}]
|
||
}
|
||
|
||
+infobox("Important note")
|
||
| If an exception consists of more than one token, the #[code ORTH] values
|
||
| combined always need to #[strong match the original string]. The way the
|
||
| original string is split up can be pretty arbitrary sometimes – for
|
||
| example "gonna" is split into "gon" (lemma "go") and "na" (lemma "to").
|
||
| Because of how the tokenizer works, it's currently not possible to split
|
||
| single-letter strings into multiple tokens.
|
||
|
||
p
|
||
| Unambiguous abbreviations, like month names or locations in English,
|
||
| should be added to exceptions with a lemma assigned, for example
|
||
| #[code {ORTH: "Jan.", LEMMA: "January"}]. Since the exceptions are
|
||
| added in Python, you can use custom logic to generate them more
|
||
| efficiently and make your data less verbose. How you do this ultimately
|
||
| depends on the language. Here's an example of how exceptions for time
|
||
| formats like "1a.m." and "1am" are generated in the English
|
||
| #[+src(gh("spaCy", "spacy/en/lang/tokenizer_exceptions.py")) #[code tokenizer_exceptions.py]]:
|
||
|
||
+code("tokenizer_exceptions.py (excerpt)").
|
||
# use short, internal variable for readability
|
||
_exc = {}
|
||
|
||
for h in range(1, 12 + 1):
|
||
for period in ["a.m.", "am"]:
|
||
# always keep an eye on string interpolation!
|
||
_exc["%d%s" % (h, period)] = [
|
||
{ORTH: "%d" % h},
|
||
{ORTH: period, LEMMA: "a.m."}]
|
||
for period in ["p.m.", "pm"]:
|
||
_exc["%d%s" % (h, period)] = [
|
||
{ORTH: "%d" % h},
|
||
{ORTH: period, LEMMA: "p.m."}]
|
||
|
||
# only declare this at the bottom
|
||
TOKENIZER_EXCEPTIONS = _exc
|
||
|
||
+aside("Generating tokenizer exceptions")
|
||
| Keep in mind that generating exceptions only makes sense if there's a
|
||
| clearly defined and #[strong finite number] of them, like common
|
||
| contractions in English. This is not always the case – in Spanish for
|
||
| instance, infinitive or imperative reflexive verbs and pronouns are one
|
||
| token (e.g. "vestirme"). In cases like this, spaCy shouldn't be
|
||
| generating exceptions for #[em all verbs]. Instead, this will be handled
|
||
| at a later stage during lemmatization.
|
||
|
||
p
|
||
| When adding the tokenizer exceptions to the #[code Defaults], you can use
|
||
| the #[+api("util#update_exc") #[code update_exc()]] helper function to merge
|
||
| them with the global base exceptions (including one-letter abbreviations
|
||
| and emoticons). The function performs a basic check to make sure
|
||
| exceptions are provided in the correct format. It can take any number of
|
||
| exceptions dicts as its arguments, and will update and overwrite the
|
||
| exception in this order. For example, if your language's tokenizer
|
||
| exceptions include a custom tokenization pattern for "a.", it will
|
||
| overwrite the base exceptions with the language's custom one.
|
||
|
||
+code("Example").
|
||
from ...util import update_exc
|
||
|
||
BASE_EXCEPTIONS = {"a.": [{ORTH: "a."}], ":)": [{ORTH: ":)"}]}
|
||
TOKENIZER_EXCEPTIONS = {"a.": [{ORTH: "a.", LEMMA: "all"}]}
|
||
|
||
tokenizer_exceptions = update_exc(BASE_EXCEPTIONS, TOKENIZER_EXCEPTIONS)
|
||
# {"a.": [{ORTH: "a.", LEMMA: "all"}], ":)": [{ORTH: ":)"}]}
|
||
|
||
+infobox("About spaCy's custom pronoun lemma")
|
||
| Unlike verbs and common nouns, there's no clear base form of a personal
|
||
| pronoun. Should the lemma of "me" be "I", or should we normalize person
|
||
| as well, giving "it" — or maybe "he"? spaCy's solution is to introduce a
|
||
| novel symbol, #[code.u-nowrap -PRON-], which is used as the lemma for
|
||
| all personal pronouns.
|
||
|
||
+h(3, "norm-exceptions") Norm exceptions
|
||
+tag-new(2)
|
||
|
||
p
|
||
| In addition to #[code ORTH] or #[code LEMMA], tokenizer exceptions can
|
||
| also set a #[code NORM] attribute. This is useful to specify a normalised
|
||
| version of the token – for example, the norm of "n't" is "not". By default,
|
||
| a token's norm equals its lowercase text. If the lowercase spelling of a
|
||
| word exists, norms should always be in lowercase.
|
||
|
||
+aside-code("Norms vs. lemmas").
|
||
doc = nlp(u"I'm gonna realise")
|
||
norms = [token.norm_ for token in doc]
|
||
lemmas = [token.lemma_ for token in doc]
|
||
assert norms == ['i', 'am', 'going', 'to', 'realize']
|
||
assert lemmas == ['i', 'be', 'go', 'to', 'realise']
|
||
|
||
p
|
||
| spaCy usually tries to normalise words with different spellings to a single,
|
||
| common spelling. This has no effect on any other token attributes, or
|
||
| tokenization in general, but it ensures that
|
||
| #[strong equivalent tokens receive similar representations]. This can
|
||
| improve the model's predictions on words that weren't common in the
|
||
| training data, but are equivalent to other words – for example, "realize"
|
||
| and "realise", or "thx" and "thanks".
|
||
|
||
p
|
||
| Similarly, spaCy also includes
|
||
| #[+src(gh("spaCy", "spacy/lang/norm_exceptions.py")) global base norms]
|
||
| for normalising different styles of quotation marks and currency
|
||
| symbols. Even though #[code $] and #[code €] are very different, spaCy
|
||
| normalises them both to #[code $]. This way, they'll always be seen as
|
||
| similar, no matter how common they were in the training data.
|
||
|
||
p
|
||
| Norm exceptions can be provided as a simple dictionary. For more examples,
|
||
| see the English
|
||
| #[+src(gh("spaCy", "spacy/lang/en/norm_exceptions.py")) #[code norm_exceptions.py]].
|
||
|
||
+code("Example").
|
||
NORM_EXCEPTIONS = {
|
||
"cos": "because",
|
||
"fav": "favorite",
|
||
"accessorise": "accessorize",
|
||
"accessorised": "accessorized"
|
||
}
|
||
|
||
p
|
||
| To add the custom norm exceptions lookup table, you can use the
|
||
| #[code add_lookups()] helper functions. It takes the default attribute
|
||
| getter function as its first argument, plus a variable list of
|
||
| dictionaries. If a string's norm is found in one of the dictionaries,
|
||
| that value is used – otherwise, the default function is called and the
|
||
| token is assigned its default norm.
|
||
|
||
+code.
|
||
lex_attr_getters[NORM] = add_lookups(Language.Defaults.lex_attr_getters[NORM],
|
||
NORM_EXCEPTIONS, BASE_NORMS)
|
||
|
||
p
|
||
| The order of the dictionaries is also the lookup order – so if your
|
||
| language's norm exceptions overwrite any of the global exceptions, they
|
||
| should be added first. Also note that the tokenizer exceptions will
|
||
| always have priority over the atrribute getters.
|
||
|
||
+h(3, "lex-attrs") Lexical attributes
|
||
+tag-new(2)
|
||
|
||
p
|
||
| spaCy provides a range of #[+api("token#attributes") #[code Token] attributes]
|
||
| that return useful information on that token – for example, whether it's
|
||
| uppercase or lowercase, a left or right punctuation mark, or whether it
|
||
| resembles a number or email address. Most of these functions, like
|
||
| #[code is_lower] or #[code like_url] should be language-independent.
|
||
| Others, like #[code like_num] (which includes both digits and number
|
||
| words), requires some customisation.
|
||
|
||
+aside("Best practices")
|
||
| Keep in mind that those functions are only intended to be an approximation.
|
||
| It's always better to prioritise simplicity and performance over covering
|
||
| very specific edge cases.#[br]#[br]
|
||
| English number words are pretty simple, because even large numbers
|
||
| consist of individual tokens, and we can get away with splitting and
|
||
| matching strings against a list. In other languages, like German, "two
|
||
| hundred and thirty-four" is one word, and thus one token. Here, it's best
|
||
| to match a string against a list of number word fragments (instead of a
|
||
| technically almost infinite list of possible number words).
|
||
|
||
p
|
||
| Here's an example from the English
|
||
| #[+src(gh("spaCy", "spacy/lang/en/lex_attrs.py")) #[code lex_attrs.py]]:
|
||
|
||
+code("lex_attrs.py").
|
||
_num_words = ['zero', 'one', 'two', 'three', 'four', 'five', 'six', 'seven',
|
||
'eight', 'nine', 'ten', 'eleven', 'twelve', 'thirteen', 'fourteen',
|
||
'fifteen', 'sixteen', 'seventeen', 'eighteen', 'nineteen', 'twenty',
|
||
'thirty', 'forty', 'fifty', 'sixty', 'seventy', 'eighty', 'ninety',
|
||
'hundred', 'thousand', 'million', 'billion', 'trillion', 'quadrillion',
|
||
'gajillion', 'bazillion']
|
||
|
||
def like_num(text):
|
||
text = text.replace(',', '').replace('.', '')
|
||
if text.isdigit():
|
||
return True
|
||
if text.count('/') == 1:
|
||
num, denom = text.split('/')
|
||
if num.isdigit() and denom.isdigit():
|
||
return True
|
||
if text.lower() in _num_words:
|
||
return True
|
||
return False
|
||
|
||
LEX_ATTRS = {
|
||
LIKE_NUM: like_num
|
||
}
|
||
|
||
p
|
||
| By updating the default lexical attributes with a custom #[code LEX_ATTRS]
|
||
| dictionary in the language's defaults via
|
||
| #[code lex_attr_getters.update(LEX_ATTRS)], only the new custom functions
|
||
| are overwritten.
|
||
|
||
+h(3, "syntax-iterators") Syntax iterators
|
||
|
||
p
|
||
| Syntax iterators are functions that compute views of a #[code Doc]
|
||
| object based on its syntax. At the moment, this data is only used for
|
||
| extracting
|
||
| #[+a("/usage/linguistic-features#noun-chunks") noun chunks], which
|
||
| are available as the #[+api("doc#noun_chunks") #[code Doc.noun_chunks]]
|
||
| property. Because base noun phrases work differently across languages,
|
||
| the rules to compute them are part of the individual language's data. If
|
||
| a language does not include a noun chunks iterator, the property won't
|
||
| be available. For examples, see the existing syntax iterators:
|
||
|
||
+aside-code("Noun chunks example").
|
||
doc = nlp(u'A phrase with another phrase occurs.')
|
||
chunks = list(doc.noun_chunks)
|
||
assert chunks[0].text == "A phrase"
|
||
assert chunks[1].text == "another phrase"
|
||
|
||
+table(["Language", "Code", "Source"])
|
||
for lang in ["en", "de", "fr", "es"]
|
||
+row
|
||
+cell=LANGUAGES[lang]
|
||
+cell #[code=lang]
|
||
+cell
|
||
+src(gh("spaCy", "spacy/lang/" + lang + "/syntax_iterators.py"))
|
||
code lang/#{lang}/syntax_iterators.py
|
||
|
||
+h(3, "lemmatizer") Lemmatizer
|
||
+tag-new(2)
|
||
|
||
p
|
||
| As of v2.0, spaCy supports simple lookup-based lemmatization. This is
|
||
| usually the quickest and easiest way to get started. The data is stored
|
||
| in a dictionary mapping a string to its lemma. To determine a token's
|
||
| lemma, spaCy simply looks it up in the table. Here's an example from
|
||
| the Spanish language data:
|
||
|
||
+code("lang/es/lemmatizer.py (excerpt)").
|
||
LOOKUP = {
|
||
"aba": "abar",
|
||
"ababa": "abar",
|
||
"ababais": "abar",
|
||
"ababan": "abar",
|
||
"ababanes": "ababán",
|
||
"ababas": "abar",
|
||
"ababoles": "ababol",
|
||
"ababábites": "ababábite"
|
||
}
|
||
|
||
p
|
||
| To provide a lookup lemmatizer for your language, import the lookup table
|
||
| and add it to the #[code Language] class as #[code lemma_lookup]:
|
||
|
||
+code.
|
||
lemma_lookup = dict(LOOKUP)
|
||
|
||
+h(3, "tag-map") Tag map
|
||
|
||
p
|
||
| Most treebanks define a custom part-of-speech tag scheme, striking a
|
||
| balance between level of detail and ease of prediction. While it's
|
||
| useful to have custom tagging schemes, it's also useful to have a common
|
||
| scheme, to which the more specific tags can be related. The tagger can
|
||
| learn a tag scheme with any arbitrary symbols. However, you need to
|
||
| define how those symbols map down to the
|
||
| #[+a("http://universaldependencies.org/u/pos/all.html") Universal Dependencies tag set].
|
||
| This is done by providing a tag map.
|
||
|
||
p
|
||
| The keys of the tag map should be #[strong strings in your tag set]. The
|
||
| values should be a dictionary. The dictionary must have an entry POS
|
||
| whose value is one of the
|
||
| #[+a("http://universaldependencies.org/u/pos/all.html") Universal Dependencies]
|
||
| tags. Optionally, you can also include morphological features or other
|
||
| token attributes in the tag map as well. This allows you to do simple
|
||
| #[+a("/usage/linguistic-features#rule-based-morphology") rule-based morphological analysis].
|
||
|
||
+code("Example").
|
||
from ..symbols import POS, NOUN, VERB, DET
|
||
|
||
TAG_MAP = {
|
||
"NNS": {POS: NOUN, "Number": "plur"},
|
||
"VBG": {POS: VERB, "VerbForm": "part", "Tense": "pres", "Aspect": "prog"},
|
||
"DT": {POS: DET}
|
||
}
|
||
|
||
+h(3, "morph-rules") Morph rules
|
||
|
||
p
|
||
| The morphology rules let you set token attributes such as lemmas, keyed
|
||
| by the extended part-of-speech tag and token text. The morphological
|
||
| features and their possible values are language-specific and based on the
|
||
| #[+a("http://universaldependencies.org") Universal Dependencies scheme].
|
||
|
||
|
||
+code("Example").
|
||
from ..symbols import LEMMA
|
||
|
||
MORPH_RULES = {
|
||
"VBZ": {
|
||
"am": {LEMMA: "be", "VerbForm": "Fin", "Person": "One", "Tense": "Pres", "Mood": "Ind"},
|
||
"are": {LEMMA: "be", "VerbForm": "Fin", "Person": "Two", "Tense": "Pres", "Mood": "Ind"},
|
||
"is": {LEMMA: "be", "VerbForm": "Fin", "Person": "Three", "Tense": "Pres", "Mood": "Ind"},
|
||
"'re": {LEMMA: "be", "VerbForm": "Fin", "Person": "Two", "Tense": "Pres", "Mood": "Ind"},
|
||
"'s": {LEMMA: "be", "VerbForm": "Fin", "Person": "Three", "Tense": "Pres", "Mood": "Ind"}
|
||
}
|
||
}
|
||
|
||
p
|
||
| In the example of #[code "am"], the attributes look like this:
|
||
|
||
+table(["Attribute", "Description"])
|
||
+row
|
||
+cell #[code LEMMA: "be"]
|
||
+cell Base form, e.g. "to be".
|
||
|
||
+row
|
||
+cell #[code "VerbForm": "Fin"]
|
||
+cell
|
||
| Finite verb. Finite verbs have a subject and can be the root of
|
||
| an independent clause – "I am." is a valid, complete
|
||
| sentence.
|
||
|
||
+row
|
||
+cell #[code "Person": "One"]
|
||
+cell First person, i.e. "#[strong I] am".
|
||
|
||
+row
|
||
+cell #[code "Tense": "Pres"]
|
||
+cell
|
||
| Present tense, i.e. actions that are happening right now or
|
||
| actions that usually happen.
|
||
|
||
+row
|
||
+cell #[code "Mood": "Ind"]
|
||
+cell
|
||
| Indicative, i.e. something happens, has happened or will happen
|
||
| (as opposed to imperative or conditional).
|
||
|
||
|
||
+infobox("Important note", "⚠️")
|
||
| The morphological attributes are currently #[strong not all used by spaCy].
|
||
| Full integration is still being developed. In the meantime, it can still
|
||
| be useful to add them, especially if the language you're adding includes
|
||
| important distinctions and special cases. This ensures that as soon as
|
||
| full support is introduced, your language will be able to assign all
|
||
| possible attributes.
|