mirror of
https://github.com/explosion/spaCy.git
synced 2025-09-19 18:42:37 +03:00
Merge 250f6c7b35
into 41e07772dc
This commit is contained in:
commit
be18d1c864
|
@ -194,7 +194,7 @@ model = chain(
|
|||
)
|
||||
```
|
||||
|
||||
but the standalone `Tok2VecTransformer` has an additional `split_trf_batch` chained inbetween the model
|
||||
but the standalone `Tok2VecTransformer` has an additional `split_trf_batch` chained in between the model
|
||||
and `trfs2arrays`:
|
||||
|
||||
```
|
||||
|
|
|
@ -6,7 +6,7 @@ This is a list of all the active repos relevant to spaCy besides the main one, w
|
|||
|
||||
These packages are always pulled in when you install spaCy. Most of them are direct dependencies, but some are transitive dependencies through other packages.
|
||||
|
||||
- [spacy-legacy](https://github.com/explosion/spacy-legacy): When an architecture in spaCy changes enough to get a new version, the old version is frozen and moved to spacy-legacy. This allows us to keep the core library slim while also preserving backwards compatability.
|
||||
- [spacy-legacy](https://github.com/explosion/spacy-legacy): When an architecture in spaCy changes enough to get a new version, the old version is frozen and moved to spacy-legacy. This allows us to keep the core library slim while also preserving backwards compatibility.
|
||||
- [thinc](https://github.com/explosion/thinc): Thinc is the machine learning library that powers trainable components in spaCy. It wraps backends like Numpy, PyTorch, and Tensorflow to provide a functional interface for specifying architectures.
|
||||
- [catalogue](https://github.com/explosion/catalogue): Small library for adding function registries, like those used for model architectures in spaCy.
|
||||
- [confection](https://github.com/explosion/confection): This library contains the functionality for config parsing that was formerly contained directly in Thinc.
|
||||
|
@ -67,7 +67,7 @@ These repos are used to support the spaCy docs or otherwise present information
|
|||
|
||||
These repos are used for organizing data around spaCy, but are not something an end user would need to install as part of using the library.
|
||||
|
||||
- [spacy-models](https://github.com/explosion/spacy-models): This repo contains metadata (but not training data) for all the spaCy models. This includes information about where their training data came from, version compatability, and performance information. It also includes tests for the model packages, and the built models are hosted as releases of this repo.
|
||||
- [spacy-models](https://github.com/explosion/spacy-models): This repo contains metadata (but not training data) for all the spaCy models. This includes information about where their training data came from, version compatibility, and performance information. It also includes tests for the model packages, and the built models are hosted as releases of this repo.
|
||||
- [wheelwright](https://github.com/explosion/wheelwright): A tool for automating our PyPI builds and releases.
|
||||
- [ec2buildwheel](https://github.com/explosion/ec2buildwheel): A small project that allows you to build Python packages in the manner of cibuildwheel, but on any EC2 image. Used by wheelwright.
|
||||
|
||||
|
|
|
@ -145,7 +145,7 @@ These are things stored in the vocab:
|
|||
- `get_noun_chunks`: a syntax iterator
|
||||
- lex attribute getters: functions like `is_punct`, set in language defaults
|
||||
- `cfg`: **not** the pipeline config, this is mostly unused
|
||||
- `_unused_object`: Formerly an unused object, kept around until v4 for compatability
|
||||
- `_unused_object`: Formerly an unused object, kept around until v4 for compatibility
|
||||
|
||||
Some of these, like the Morphology and Vectors, are complex enough that they
|
||||
need their own explanations. Here we'll just look at Vocab-specific items.
|
||||
|
|
|
@ -34,7 +34,7 @@ CONDITIONS.
|
|||
Collection will not be considered an Adaptation for the purpose of
|
||||
this License. For the avoidance of doubt, where the Work is a musical
|
||||
work, performance or phonogram, the synchronization of the Work in
|
||||
timed-relation with a moving image ("synching") will be considered an
|
||||
timed-relation with a moving image ("syncing") will be considered an
|
||||
Adaptation for the purpose of this License.
|
||||
b. "Collection" means a collection of literary or artistic works, such as
|
||||
encyclopedias and anthologies, or performances, phonograms or
|
||||
|
@ -264,7 +264,7 @@ subject to and limited by the following restrictions:
|
|||
UNLESS OTHERWISE MUTUALLY AGREED TO BY THE PARTIES IN WRITING, LICENSOR
|
||||
OFFERS THE WORK AS-IS AND MAKES NO REPRESENTATIONS OR WARRANTIES OF ANY
|
||||
KIND CONCERNING THE WORK, EXPRESS, IMPLIED, STATUTORY OR OTHERWISE,
|
||||
INCLUDING, WITHOUT LIMITATION, WARRANTIES OF TITLE, MERCHANTIBILITY,
|
||||
INCLUDING, WITHOUT LIMITATION, WARRANTIES OF TITLE, MERCHANTABILITY,
|
||||
FITNESS FOR A PARTICULAR PURPOSE, NONINFRINGEMENT, OR THE ABSENCE OF
|
||||
LATENT OR OTHER DEFECTS, ACCURACY, OR THE PRESENCE OF ABSENCE OF ERRORS,
|
||||
WHETHER OR NOT DISCOVERABLE. SOME JURISDICTIONS DO NOT ALLOW THE EXCLUSION
|
||||
|
|
|
@ -99,7 +99,7 @@ def parse_config_overrides(
|
|||
RETURNS (Dict[str, Any]): The parsed dict, keyed by nested config setting.
|
||||
"""
|
||||
env_string = os.environ.get(env_var, "") if env_var else ""
|
||||
env_overrides = _parse_overrides(split_arg_string(env_string))
|
||||
env_overrides = _parse_overrides(split_arg_string(env_string)) # type: ignore[operator]
|
||||
cli_overrides = _parse_overrides(args, is_cli=True)
|
||||
if cli_overrides:
|
||||
keys = [k for k in cli_overrides if k not in env_overrides]
|
||||
|
|
|
@ -84,7 +84,7 @@ def info(
|
|||
|
||||
|
||||
def info_spacy() -> Dict[str, Any]:
|
||||
"""Generate info about the current spaCy intallation.
|
||||
"""Generate info about the current spaCy installation.
|
||||
|
||||
RETURNS (dict): The spaCy info.
|
||||
"""
|
||||
|
|
|
@ -354,7 +354,7 @@ GLOSSARY = {
|
|||
# https://github.com/ltgoslo/norne
|
||||
"EVT": "Festivals, cultural events, sports events, weather phenomena, wars, etc.",
|
||||
"PROD": "Product, i.e. artificially produced entities including speeches, radio shows, programming languages, contracts, laws and ideas",
|
||||
"DRV": "Words (and phrases?) that are dervied from a name, but not a name in themselves, e.g. 'Oslo-mannen' ('the man from Oslo')",
|
||||
"DRV": "Words (and phrases?) that are derived from a name, but not a name in themselves, e.g. 'Oslo-mannen' ('the man from Oslo')",
|
||||
"GPE_LOC": "Geo-political entity, with a locative sense, e.g. 'John lives in Spain'",
|
||||
"GPE_ORG": "Geo-political entity, with an organisation sense, e.g. 'Spain declined to meet with Belgium'",
|
||||
}
|
||||
|
|
|
@ -5,11 +5,11 @@ from thinc.api import Model
|
|||
from ...language import BaseDefaults, Language
|
||||
from .lemmatizer import HaitianCreoleLemmatizer
|
||||
from .lex_attrs import LEX_ATTRS
|
||||
from .punctuation import TOKENIZER_PREFIXES, TOKENIZER_INFIXES, TOKENIZER_SUFFIXES
|
||||
from .punctuation import TOKENIZER_INFIXES, TOKENIZER_PREFIXES, TOKENIZER_SUFFIXES
|
||||
from .stop_words import STOP_WORDS
|
||||
from .syntax_iterators import SYNTAX_ITERATORS
|
||||
from .tokenizer_exceptions import TOKENIZER_EXCEPTIONS
|
||||
from .tag_map import TAG_MAP
|
||||
from .tokenizer_exceptions import TOKENIZER_EXCEPTIONS
|
||||
|
||||
|
||||
class HaitianCreoleDefaults(BaseDefaults):
|
||||
|
@ -22,10 +22,12 @@ class HaitianCreoleDefaults(BaseDefaults):
|
|||
stop_words = STOP_WORDS
|
||||
tag_map = TAG_MAP
|
||||
|
||||
|
||||
class HaitianCreole(Language):
|
||||
lang = "ht"
|
||||
Defaults = HaitianCreoleDefaults
|
||||
|
||||
|
||||
@HaitianCreole.factory(
|
||||
"lemmatizer",
|
||||
assigns=["token.lemma"],
|
||||
|
@ -49,4 +51,5 @@ def make_lemmatizer(
|
|||
nlp.vocab, model, name, mode=mode, overwrite=overwrite, scorer=scorer
|
||||
)
|
||||
|
||||
|
||||
__all__ = ["HaitianCreole"]
|
||||
|
|
|
@ -1,8 +1,8 @@
|
|||
from typing import List, Tuple
|
||||
|
||||
from ...lookups import Lookups
|
||||
from ...pipeline import Lemmatizer
|
||||
from ...tokens import Token
|
||||
from ...lookups import Lookups
|
||||
|
||||
|
||||
class HaitianCreoleLemmatizer(Lemmatizer):
|
||||
|
|
|
@ -49,6 +49,7 @@ NORM_MAP = {
|
|||
"P": "Pa",
|
||||
}
|
||||
|
||||
|
||||
def like_num(text):
|
||||
text = text.strip().lower()
|
||||
if text.startswith(("+", "-", "±", "~")):
|
||||
|
@ -69,9 +70,11 @@ def like_num(text):
|
|||
return True
|
||||
return False
|
||||
|
||||
|
||||
def norm_custom(text):
|
||||
return NORM_MAP.get(text, text.lower())
|
||||
|
||||
|
||||
LEX_ATTRS = {
|
||||
LIKE_NUM: like_num,
|
||||
NORM: norm_custom,
|
||||
|
|
|
@ -4,10 +4,10 @@ from ..char_classes import (
|
|||
ALPHA_UPPER,
|
||||
CONCAT_QUOTES,
|
||||
HYPHENS,
|
||||
LIST_PUNCT,
|
||||
LIST_QUOTES,
|
||||
LIST_ELLIPSES,
|
||||
LIST_ICONS,
|
||||
LIST_PUNCT,
|
||||
LIST_QUOTES,
|
||||
merge_chars,
|
||||
)
|
||||
|
||||
|
@ -16,28 +16,43 @@ ELISION = "'’".replace(" ", "")
|
|||
_prefixes_elision = "m n l y t k w"
|
||||
_prefixes_elision += " " + _prefixes_elision.upper()
|
||||
|
||||
TOKENIZER_PREFIXES = LIST_PUNCT + LIST_QUOTES + [
|
||||
r"(?:({pe})[{el}])(?=[{a}])".format(
|
||||
a=ALPHA, el=ELISION, pe=merge_chars(_prefixes_elision)
|
||||
)
|
||||
]
|
||||
TOKENIZER_PREFIXES = (
|
||||
LIST_PUNCT
|
||||
+ LIST_QUOTES
|
||||
+ [
|
||||
r"(?:({pe})[{el}])(?=[{a}])".format(
|
||||
a=ALPHA, el=ELISION, pe=merge_chars(_prefixes_elision)
|
||||
)
|
||||
]
|
||||
)
|
||||
|
||||
TOKENIZER_SUFFIXES = LIST_PUNCT + LIST_QUOTES + LIST_ELLIPSES + [
|
||||
r"(?<=[0-9])%", # numbers like 10%
|
||||
r"(?<=[0-9])(?:{h})".format(h=HYPHENS), # hyphens after numbers
|
||||
r"(?<=[{a}])['’]".format(a=ALPHA), # apostrophes after letters
|
||||
r"(?<=[{a}])['’][mwlnytk](?=\s|$)".format(a=ALPHA), # contractions
|
||||
r"(?<=[{a}0-9])\)", # right parenthesis after letter/number
|
||||
r"(?<=[{a}])\.(?=\s|$)".format(a=ALPHA), # period after letter if space or end of string
|
||||
r"(?<=\))[\.\?!]", # punctuation immediately after right parenthesis
|
||||
]
|
||||
TOKENIZER_SUFFIXES = (
|
||||
LIST_PUNCT
|
||||
+ LIST_QUOTES
|
||||
+ LIST_ELLIPSES
|
||||
+ [
|
||||
r"(?<=[0-9])%", # numbers like 10%
|
||||
r"(?<=[0-9])(?:{h})".format(h=HYPHENS), # hyphens after numbers
|
||||
r"(?<=[{a}])['’]".format(a=ALPHA), # apostrophes after letters
|
||||
r"(?<=[{a}])['’][mwlnytk](?=\s|$)".format(a=ALPHA), # contractions
|
||||
r"(?<=[{a}0-9])\)", # right parenthesis after letter/number
|
||||
r"(?<=[{a}])\.(?=\s|$)".format(
|
||||
a=ALPHA
|
||||
), # period after letter if space or end of string
|
||||
r"(?<=\))[\.\?!]", # punctuation immediately after right parenthesis
|
||||
]
|
||||
)
|
||||
|
||||
TOKENIZER_INFIXES = LIST_ELLIPSES + LIST_ICONS + [
|
||||
r"(?<=[0-9])[+\-\*^](?=[0-9-])",
|
||||
r"(?<=[{al}{q}])\.(?=[{au}{q}])".format(
|
||||
al=ALPHA_LOWER, au=ALPHA_UPPER, q=CONCAT_QUOTES
|
||||
),
|
||||
r"(?<=[{a}]),(?=[{a}])".format(a=ALPHA),
|
||||
r"(?<=[{a}0-9])(?:{h})(?=[{a}])".format(a=ALPHA, h=HYPHENS),
|
||||
r"(?<=[{a}][{el}])(?=[{a}])".format(a=ALPHA, el=ELISION),
|
||||
]
|
||||
TOKENIZER_INFIXES = (
|
||||
LIST_ELLIPSES
|
||||
+ LIST_ICONS
|
||||
+ [
|
||||
r"(?<=[0-9])[+\-\*^](?=[0-9-])",
|
||||
r"(?<=[{al}{q}])\.(?=[{au}{q}])".format(
|
||||
al=ALPHA_LOWER, au=ALPHA_UPPER, q=CONCAT_QUOTES
|
||||
),
|
||||
r"(?<=[{a}]),(?=[{a}])".format(a=ALPHA),
|
||||
r"(?<=[{a}0-9])(?:{h})(?=[{a}])".format(a=ALPHA, h=HYPHENS),
|
||||
r"(?<=[{a}][{el}])(?=[{a}])".format(a=ALPHA, el=ELISION),
|
||||
]
|
||||
)
|
||||
|
|
|
@ -39,8 +39,7 @@ sa san si swa si
|
|||
|
||||
men mèsi oswa osinon
|
||||
|
||||
"""
|
||||
.split()
|
||||
""".split()
|
||||
)
|
||||
|
||||
# Add common contractions, with and without apostrophe variants
|
||||
|
|
|
@ -1,4 +1,22 @@
|
|||
from spacy.symbols import NOUN, VERB, AUX, ADJ, ADV, PRON, DET, ADP, SCONJ, CCONJ, PART, INTJ, NUM, PROPN, PUNCT, SYM, X
|
||||
from spacy.symbols import (
|
||||
ADJ,
|
||||
ADP,
|
||||
ADV,
|
||||
AUX,
|
||||
CCONJ,
|
||||
DET,
|
||||
INTJ,
|
||||
NOUN,
|
||||
NUM,
|
||||
PART,
|
||||
PRON,
|
||||
PROPN,
|
||||
PUNCT,
|
||||
SCONJ,
|
||||
SYM,
|
||||
VERB,
|
||||
X,
|
||||
)
|
||||
|
||||
TAG_MAP = {
|
||||
"NOUN": {"pos": NOUN},
|
||||
|
|
|
@ -1,4 +1,5 @@
|
|||
from spacy.symbols import ORTH, NORM
|
||||
from spacy.symbols import NORM, ORTH
|
||||
|
||||
|
||||
def make_variants(base, first_norm, second_orth, second_norm):
|
||||
return {
|
||||
|
@ -7,14 +8,16 @@ def make_variants(base, first_norm, second_orth, second_norm):
|
|||
{ORTH: second_orth, NORM: second_norm},
|
||||
],
|
||||
base.capitalize(): [
|
||||
{ORTH: base.split("'")[0].capitalize() + "'", NORM: first_norm.capitalize()},
|
||||
{
|
||||
ORTH: base.split("'")[0].capitalize() + "'",
|
||||
NORM: first_norm.capitalize(),
|
||||
},
|
||||
{ORTH: second_orth, NORM: second_norm},
|
||||
]
|
||||
],
|
||||
}
|
||||
|
||||
TOKENIZER_EXCEPTIONS = {
|
||||
"Dr.": [{ORTH: "Dr."}]
|
||||
}
|
||||
|
||||
TOKENIZER_EXCEPTIONS = {"Dr.": [{ORTH: "Dr."}]}
|
||||
|
||||
# Apostrophe forms
|
||||
TOKENIZER_EXCEPTIONS.update(make_variants("m'ap", "mwen", "ap", "ap"))
|
||||
|
@ -29,93 +32,95 @@ TOKENIZER_EXCEPTIONS.update(make_variants("p'ap", "pa", "ap", "ap"))
|
|||
TOKENIZER_EXCEPTIONS.update(make_variants("t'ap", "te", "ap", "ap"))
|
||||
|
||||
# Non-apostrophe contractions (with capitalized variants)
|
||||
TOKENIZER_EXCEPTIONS.update({
|
||||
"map": [
|
||||
{ORTH: "m", NORM: "mwen"},
|
||||
{ORTH: "ap", NORM: "ap"},
|
||||
],
|
||||
"Map": [
|
||||
{ORTH: "M", NORM: "Mwen"},
|
||||
{ORTH: "ap", NORM: "ap"},
|
||||
],
|
||||
"lem": [
|
||||
{ORTH: "le", NORM: "le"},
|
||||
{ORTH: "m", NORM: "mwen"},
|
||||
],
|
||||
"Lem": [
|
||||
{ORTH: "Le", NORM: "Le"},
|
||||
{ORTH: "m", NORM: "mwen"},
|
||||
],
|
||||
"lew": [
|
||||
{ORTH: "le", NORM: "le"},
|
||||
{ORTH: "w", NORM: "ou"},
|
||||
],
|
||||
"Lew": [
|
||||
{ORTH: "Le", NORM: "Le"},
|
||||
{ORTH: "w", NORM: "ou"},
|
||||
],
|
||||
"nap": [
|
||||
{ORTH: "n", NORM: "nou"},
|
||||
{ORTH: "ap", NORM: "ap"},
|
||||
],
|
||||
"Nap": [
|
||||
{ORTH: "N", NORM: "Nou"},
|
||||
{ORTH: "ap", NORM: "ap"},
|
||||
],
|
||||
"lap": [
|
||||
{ORTH: "l", NORM: "li"},
|
||||
{ORTH: "ap", NORM: "ap"},
|
||||
],
|
||||
"Lap": [
|
||||
{ORTH: "L", NORM: "Li"},
|
||||
{ORTH: "ap", NORM: "ap"},
|
||||
],
|
||||
"yap": [
|
||||
{ORTH: "y", NORM: "yo"},
|
||||
{ORTH: "ap", NORM: "ap"},
|
||||
],
|
||||
"Yap": [
|
||||
{ORTH: "Y", NORM: "Yo"},
|
||||
{ORTH: "ap", NORM: "ap"},
|
||||
],
|
||||
"mte": [
|
||||
{ORTH: "m", NORM: "mwen"},
|
||||
{ORTH: "te", NORM: "te"},
|
||||
],
|
||||
"Mte": [
|
||||
{ORTH: "M", NORM: "Mwen"},
|
||||
{ORTH: "te", NORM: "te"},
|
||||
],
|
||||
"mpral": [
|
||||
{ORTH: "m", NORM: "mwen"},
|
||||
{ORTH: "pral", NORM: "pral"},
|
||||
],
|
||||
"Mpral": [
|
||||
{ORTH: "M", NORM: "Mwen"},
|
||||
{ORTH: "pral", NORM: "pral"},
|
||||
],
|
||||
"wap": [
|
||||
{ORTH: "w", NORM: "ou"},
|
||||
{ORTH: "ap", NORM: "ap"},
|
||||
],
|
||||
"Wap": [
|
||||
{ORTH: "W", NORM: "Ou"},
|
||||
{ORTH: "ap", NORM: "ap"},
|
||||
],
|
||||
"kap": [
|
||||
{ORTH: "k", NORM: "ki"},
|
||||
{ORTH: "ap", NORM: "ap"},
|
||||
],
|
||||
"Kap": [
|
||||
{ORTH: "K", NORM: "Ki"},
|
||||
{ORTH: "ap", NORM: "ap"},
|
||||
],
|
||||
"tap": [
|
||||
{ORTH: "t", NORM: "te"},
|
||||
{ORTH: "ap", NORM: "ap"},
|
||||
],
|
||||
"Tap": [
|
||||
{ORTH: "T", NORM: "Te"},
|
||||
{ORTH: "ap", NORM: "ap"},
|
||||
],
|
||||
})
|
||||
TOKENIZER_EXCEPTIONS.update(
|
||||
{
|
||||
"map": [
|
||||
{ORTH: "m", NORM: "mwen"},
|
||||
{ORTH: "ap", NORM: "ap"},
|
||||
],
|
||||
"Map": [
|
||||
{ORTH: "M", NORM: "Mwen"},
|
||||
{ORTH: "ap", NORM: "ap"},
|
||||
],
|
||||
"lem": [
|
||||
{ORTH: "le", NORM: "le"},
|
||||
{ORTH: "m", NORM: "mwen"},
|
||||
],
|
||||
"Lem": [
|
||||
{ORTH: "Le", NORM: "Le"},
|
||||
{ORTH: "m", NORM: "mwen"},
|
||||
],
|
||||
"lew": [
|
||||
{ORTH: "le", NORM: "le"},
|
||||
{ORTH: "w", NORM: "ou"},
|
||||
],
|
||||
"Lew": [
|
||||
{ORTH: "Le", NORM: "Le"},
|
||||
{ORTH: "w", NORM: "ou"},
|
||||
],
|
||||
"nap": [
|
||||
{ORTH: "n", NORM: "nou"},
|
||||
{ORTH: "ap", NORM: "ap"},
|
||||
],
|
||||
"Nap": [
|
||||
{ORTH: "N", NORM: "Nou"},
|
||||
{ORTH: "ap", NORM: "ap"},
|
||||
],
|
||||
"lap": [
|
||||
{ORTH: "l", NORM: "li"},
|
||||
{ORTH: "ap", NORM: "ap"},
|
||||
],
|
||||
"Lap": [
|
||||
{ORTH: "L", NORM: "Li"},
|
||||
{ORTH: "ap", NORM: "ap"},
|
||||
],
|
||||
"yap": [
|
||||
{ORTH: "y", NORM: "yo"},
|
||||
{ORTH: "ap", NORM: "ap"},
|
||||
],
|
||||
"Yap": [
|
||||
{ORTH: "Y", NORM: "Yo"},
|
||||
{ORTH: "ap", NORM: "ap"},
|
||||
],
|
||||
"mte": [
|
||||
{ORTH: "m", NORM: "mwen"},
|
||||
{ORTH: "te", NORM: "te"},
|
||||
],
|
||||
"Mte": [
|
||||
{ORTH: "M", NORM: "Mwen"},
|
||||
{ORTH: "te", NORM: "te"},
|
||||
],
|
||||
"mpral": [
|
||||
{ORTH: "m", NORM: "mwen"},
|
||||
{ORTH: "pral", NORM: "pral"},
|
||||
],
|
||||
"Mpral": [
|
||||
{ORTH: "M", NORM: "Mwen"},
|
||||
{ORTH: "pral", NORM: "pral"},
|
||||
],
|
||||
"wap": [
|
||||
{ORTH: "w", NORM: "ou"},
|
||||
{ORTH: "ap", NORM: "ap"},
|
||||
],
|
||||
"Wap": [
|
||||
{ORTH: "W", NORM: "Ou"},
|
||||
{ORTH: "ap", NORM: "ap"},
|
||||
],
|
||||
"kap": [
|
||||
{ORTH: "k", NORM: "ki"},
|
||||
{ORTH: "ap", NORM: "ap"},
|
||||
],
|
||||
"Kap": [
|
||||
{ORTH: "K", NORM: "Ki"},
|
||||
{ORTH: "ap", NORM: "ap"},
|
||||
],
|
||||
"tap": [
|
||||
{ORTH: "t", NORM: "te"},
|
||||
{ORTH: "ap", NORM: "ap"},
|
||||
],
|
||||
"Tap": [
|
||||
{ORTH: "T", NORM: "Te"},
|
||||
{ORTH: "ap", NORM: "ap"},
|
||||
],
|
||||
}
|
||||
)
|
||||
|
|
|
@ -106,7 +106,7 @@ class BaseDefaults:
|
|||
|
||||
def create_tokenizer() -> Callable[["Language"], Tokenizer]:
|
||||
"""Registered function to create a tokenizer. Returns a factory that takes
|
||||
the nlp object and returns a Tokenizer instance using the language detaults.
|
||||
the nlp object and returns a Tokenizer instance using the language defaults.
|
||||
"""
|
||||
|
||||
def tokenizer_factory(nlp: "Language") -> Tokenizer:
|
||||
|
@ -173,7 +173,7 @@ class Language:
|
|||
current models may run out memory on extremely long texts, due to
|
||||
large internal allocations. You should segment these texts into
|
||||
meaningful units, e.g. paragraphs, subsections etc, before passing
|
||||
them to spaCy. Default maximum length is 1,000,000 charas (1mb). As
|
||||
them to spaCy. Default maximum length is 1,000,000 chars (1mb). As
|
||||
a rule of thumb, if all pipeline components are enabled, spaCy's
|
||||
default models currently requires roughly 1GB of temporary memory per
|
||||
100,000 characters in one text.
|
||||
|
@ -2448,7 +2448,7 @@ class _Sender:
|
|||
q.put(item)
|
||||
|
||||
def step(self) -> None:
|
||||
"""Tell sender that comsumed one item. Data is sent to the workers after
|
||||
"""Tell sender that consumed one item. Data is sent to the workers after
|
||||
every chunk_size calls.
|
||||
"""
|
||||
self.count += 1
|
||||
|
|
|
@ -12,7 +12,7 @@ cdef extern from "<algorithm>" namespace "std" nogil:
|
|||
# An edit tree (Müller et al., 2015) is a tree structure that consists of
|
||||
# edit operations. The two types of operations are string matches
|
||||
# and string substitutions. Given an input string s and an output string t,
|
||||
# subsitution and match nodes should be interpreted as follows:
|
||||
# substitution and match nodes should be interpreted as follows:
|
||||
#
|
||||
# * Substitution node: consists of an original string and substitute string.
|
||||
# If s matches the original string, then t is the substitute. Otherwise,
|
||||
|
|
|
@ -1,5 +1,5 @@
|
|||
# This file is present to provide a prior version of the EntityLinker component
|
||||
# for backwards compatability. For details see #9669.
|
||||
# for backwards compatibility. For details see #9669.
|
||||
|
||||
import random
|
||||
import warnings
|
||||
|
|
|
@ -187,7 +187,7 @@ class Lemmatizer(Pipe):
|
|||
if univ_pos == "":
|
||||
warnings.warn(Warnings.W108)
|
||||
return [string.lower()]
|
||||
# See Issue #435 for example of where this logic is requied.
|
||||
# See Issue #435 for example of where this logic is required.
|
||||
if self.is_base_form(token):
|
||||
return [string.lower()]
|
||||
index_table = self.lookups.get_table("lemma_index", {})
|
||||
|
@ -210,7 +210,7 @@ class Lemmatizer(Pipe):
|
|||
rules = rules_table.get(univ_pos, {})
|
||||
orig = string
|
||||
string = string.lower()
|
||||
forms = []
|
||||
forms: List[str] = []
|
||||
oov_forms = []
|
||||
for old, new in rules:
|
||||
if string.endswith(old):
|
||||
|
|
|
@ -247,7 +247,7 @@ def test_issue13769():
|
|||
(1, 4, "This is"), # Overlapping with 2 sentences
|
||||
(0, 2, "This is"), # Beginning of the Doc. Full sentence
|
||||
(0, 1, "This is"), # Beginning of the Doc. Part of a sentence
|
||||
(10, 14, "And a"), # End of the Doc. Overlapping with 2 senteces
|
||||
(10, 14, "And a"), # End of the Doc. Overlapping with 2 sentences
|
||||
(12, 14, "third."), # End of the Doc. Full sentence
|
||||
(1, 1, "This is"), # Empty Span
|
||||
],
|
||||
|
@ -676,7 +676,7 @@ def test_span_comparison(doc):
|
|||
(3, 6, 2, 2), # Overlapping with 2 sentences
|
||||
(0, 4, 1, 2), # Beginning of the Doc. Full sentence
|
||||
(0, 3, 1, 2), # Beginning of the Doc. Part of a sentence
|
||||
(9, 14, 2, 3), # End of the Doc. Overlapping with 2 senteces
|
||||
(9, 14, 2, 3), # End of the Doc. Overlapping with 2 sentences
|
||||
(10, 14, 1, 2), # End of the Doc. Full sentence
|
||||
(11, 14, 1, 2), # End of the Doc. Partial sentence
|
||||
(0, 0, 1, 1), # Empty Span
|
||||
|
|
|
@ -29,4 +29,16 @@ def test_ht_tokenizer_handles_basic_abbreviation(ht_tokenizer, text):
|
|||
def test_ht_tokenizer_full_sentence(ht_tokenizer):
|
||||
text = "Si'm ka vini, m'ap pale ak li."
|
||||
tokens = [t.text for t in ht_tokenizer(text)]
|
||||
assert tokens == ["Si", "'m", "ka", "vini", ",", "m'", "ap", "pale", "ak", "li", "."]
|
||||
assert tokens == [
|
||||
"Si",
|
||||
"'m",
|
||||
"ka",
|
||||
"vini",
|
||||
",",
|
||||
"m'",
|
||||
"ap",
|
||||
"pale",
|
||||
"ak",
|
||||
"li",
|
||||
".",
|
||||
]
|
||||
|
|
|
@ -1,4 +1,5 @@
|
|||
import pytest
|
||||
|
||||
from spacy.tokens import Doc
|
||||
|
||||
|
||||
|
|
|
@ -37,7 +37,9 @@ def test_ht_tokenizer_splits_uneven_wrap(ht_tokenizer, text):
|
|||
assert len(tokens) == 5
|
||||
|
||||
|
||||
@pytest.mark.parametrize("text,length", [("Ozetazini.", 2), ("Frans.", 2), ("(Ozetazini.", 3)])
|
||||
@pytest.mark.parametrize(
|
||||
"text,length", [("Ozetazini.", 2), ("Frans.", 2), ("(Ozetazini.", 3)]
|
||||
)
|
||||
def test_ht_tokenizer_splits_prefix_interact(ht_tokenizer, text, length):
|
||||
tokens = ht_tokenizer(text)
|
||||
assert len(tokens) == length
|
||||
|
|
|
@ -16,7 +16,6 @@ Nan Washington, Depatman Deta Etazini pibliye yon deklarasyon ki eksprime "regre
|
|||
assert len(tokens) == 84
|
||||
|
||||
|
||||
|
||||
@pytest.mark.parametrize(
|
||||
"text,length",
|
||||
[
|
||||
|
@ -66,14 +65,14 @@ def test_ht_lex_attrs_capitals(word):
|
|||
|
||||
|
||||
@pytest.mark.parametrize(
|
||||
"word, expected", [
|
||||
"word, expected",
|
||||
[
|
||||
("'m", "mwen"),
|
||||
("'n", "nou"),
|
||||
("'l", "li"),
|
||||
("'y", "yo"),
|
||||
("'w", "ou"),
|
||||
]
|
||||
],
|
||||
)
|
||||
def test_ht_lex_attrs_norm_custom(word, expected):
|
||||
assert norm_custom(word) == expected
|
||||
|
||||
|
|
|
@ -670,7 +670,7 @@ def test_matcher_remove():
|
|||
# removing once should work
|
||||
matcher.remove("Rule")
|
||||
|
||||
# should not return any maches anymore
|
||||
# should not return any matches anymore
|
||||
results2 = matcher(nlp(text))
|
||||
assert len(results2) == 0
|
||||
|
||||
|
|
|
@ -351,7 +351,7 @@ def test_oracle_moves_whitespace(en_vocab):
|
|||
|
||||
|
||||
def test_accept_blocked_token():
|
||||
"""Test succesful blocking of tokens to be in an entity."""
|
||||
"""Test successful blocking of tokens to be in an entity."""
|
||||
# 1. test normal behaviour
|
||||
nlp1 = English()
|
||||
doc1 = nlp1("I live in New York")
|
||||
|
|
|
@ -1288,7 +1288,7 @@ def test_threshold(meet_threshold: bool, config: Dict[str, Any]):
|
|||
entity_linker.set_kb(create_kb) # type: ignore
|
||||
nlp.initialize(get_examples=lambda: train_examples)
|
||||
|
||||
# Add a custom rule-based component to mimick NER
|
||||
# Add a custom rule-based component to mimic NER
|
||||
ruler = nlp.add_pipe("entity_ruler", before="entity_linker")
|
||||
ruler.add_patterns([{"label": "PERSON", "pattern": [{"LOWER": "mahler"}]}]) # type: ignore
|
||||
doc = nlp(text)
|
||||
|
|
|
@ -47,7 +47,7 @@ def test_issue1506():
|
|||
nlp = English()
|
||||
for i, d in enumerate(nlp.pipe(string_generator())):
|
||||
# We should run cleanup more than one time to actually cleanup data.
|
||||
# In first run — clean up only mark strings as «not hitted».
|
||||
# In first run — clean up only mark strings as «not hit».
|
||||
if i == 10000 or i == 20000 or i == 30000:
|
||||
gc.collect()
|
||||
for t in d:
|
||||
|
|
|
@ -34,7 +34,7 @@ def test_issue2728(en_vocab):
|
|||
@pytest.mark.issue(3288)
|
||||
def test_issue3288(en_vocab):
|
||||
"""Test that retokenization works correctly via displaCy when punctuation
|
||||
is merged onto the preceeding token and tensor is resized."""
|
||||
is merged onto the preceding token and tensor is resized."""
|
||||
words = ["Hello", "World", "!", "When", "is", "this", "breaking", "?"]
|
||||
heads = [1, 1, 1, 4, 4, 6, 4, 4]
|
||||
deps = ["intj", "ROOT", "punct", "advmod", "ROOT", "det", "nsubj", "punct"]
|
||||
|
|
|
@ -410,7 +410,7 @@ attribute.
|
|||
|
||||
| Name | Description |
|
||||
| ----------------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
|
||||
| `all_outputs` | List of `Ragged` tensors that correspends to outputs of the different transformer layers. Each tensor element corresponds to a piece identifier's representation. ~~List[Ragged]~~ |
|
||||
| `all_outputs` | List of `Ragged` tensors that corresponds to outputs of the different transformer layers. Each tensor element corresponds to a piece identifier's representation. ~~List[Ragged]~~ |
|
||||
| `last_layer_only` | If only the last transformer layer's outputs are preserved. ~~bool~~ |
|
||||
|
||||
### DocTransformerOutput.embedding_layer {id="doctransformeroutput-embeddinglayer",tag="property"}
|
||||
|
|
|
@ -1116,7 +1116,7 @@ customize the default language data:
|
|||
| --------------------------------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
|
||||
| `stop_words` | List of stop words, used for `Token.is_stop`.<br />**Example:** [`stop_words.py`](%%GITHUB_SPACY/spacy/lang/en/stop_words.py) ~~Set[str]~~ |
|
||||
| `tokenizer_exceptions` | Tokenizer exception rules, string mapped to list of token attributes.<br />**Example:** [`de/tokenizer_exceptions.py`](%%GITHUB_SPACY/spacy/lang/de/tokenizer_exceptions.py) ~~Dict[str, List[dict]]~~ |
|
||||
| `prefixes`, `suffixes`, `infixes` | Prefix, suffix and infix rules for the default tokenizer.<br />**Example:** [`puncutation.py`](%%GITHUB_SPACY/spacy/lang/punctuation.py) ~~Optional[Sequence[Union[str, Pattern]]]~~ |
|
||||
| `prefixes`, `suffixes`, `infixes` | Prefix, suffix and infix rules for the default tokenizer.<br />**Example:** [`punctuation.py`](%%GITHUB_SPACY/spacy/lang/punctuation.py) ~~Optional[Sequence[Union[str, Pattern]]]~~ |
|
||||
| `token_match` | Optional regex for matching strings that should never be split, overriding the infix rules.<br />**Example:** [`fr/tokenizer_exceptions.py`](%%GITHUB_SPACY/spacy/lang/fr/tokenizer_exceptions.py) ~~Optional[Callable]~~ |
|
||||
| `url_match` | Regular expression for matching URLs. Prefixes and suffixes are removed before applying the match.<br />**Example:** [`tokenizer_exceptions.py`](%%GITHUB_SPACY/spacy/lang/tokenizer_exceptions.py) ~~Optional[Callable]~~ |
|
||||
| `lex_attr_getters` | Custom functions for setting lexical attributes on tokens, e.g. `like_num`.<br />**Example:** [`lex_attrs.py`](%%GITHUB_SPACY/spacy/lang/en/lex_attrs.py) ~~Dict[int, Callable[[str], Any]]~~ |
|
||||
|
|
|
@ -590,7 +590,7 @@ candidate.
|
|||
protocol required by [`spacy.EntityLinker.v1`](#el-v1). The built-in candidate
|
||||
selector method allows loading existing knowledge bases in several ways, e. g.
|
||||
loading from a spaCy pipeline with a (not necessarily trained) entity linking
|
||||
component, and loading from a file describing the knowlege base as a .yaml file.
|
||||
component, and loading from a file describing the knowledge base as a .yaml file.
|
||||
Either way the loaded data will be converted to a spaCy `InMemoryLookupKB`
|
||||
instance. The KB's selection capabilities are used to select the most likely
|
||||
entity candidates for the specified mentions.
|
||||
|
@ -1103,7 +1103,7 @@ prompting.
|
|||
|
||||
| Argument | Description |
|
||||
| --------------------------- | ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
|
||||
| `examples` | Optional function that generates examples for few-shot learning. Deafults to `None`. ~~Optional[Callable[[], Iterable[Any]]]~~ |
|
||||
| `examples` | Optional function that generates examples for few-shot learning. Defaults to `None`. ~~Optional[Callable[[], Iterable[Any]]]~~ |
|
||||
| `parse_responses` (NEW) | Callable for parsing LLM responses for this task. Defaults to the internal parsing method for this task. ~~Optional[TaskResponseParser[SpanCatTask]]~~ |
|
||||
| `prompt_example_type` (NEW) | Type to use for fewshot examples. Defaults to `TextCatExample`. ~~Optional[Type[FewshotExample]]~~ |
|
||||
| `scorer` (NEW) | Scorer function that evaluates the task performance on provided examples. Defaults to the metric used by spaCy. ~~Optional[Scorer]~~ |
|
||||
|
@ -1624,7 +1624,7 @@ the same documents at each run that keeps batches of documents stored on disk.
|
|||
| Argument | Description |
|
||||
| -------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------ |
|
||||
| `path` | Cache directory. If `None`, no caching is performed, and this component will act as a NoOp. Defaults to `None`. ~~Optional[Union[str, Path]]~~ |
|
||||
| `batch_size` | Number of docs in one batch (file). Once a batch is full, it will be peristed to disk. Defaults to 64. ~~int~~ |
|
||||
| `batch_size` | Number of docs in one batch (file). Once a batch is full, it will be persisted to disk. Defaults to 64. ~~int~~ |
|
||||
| `max_batches_in_mem` | Max. number of batches to hold in memory. Allows you to limit the effect on your memory if you're handling a lot of docs. Defaults to 4. ~~int~~ |
|
||||
|
||||
When retrieving a document, the `BatchCache` will first figure out what batch
|
||||
|
|
|
@ -1,6 +1,6 @@
|
|||
---
|
||||
title: Tokenizer
|
||||
teaser: Segment text into words, punctuations marks, etc.
|
||||
teaser: Segment text into words, punctuation marks, etc.
|
||||
tag: class
|
||||
source: spacy/tokenizer.pyx
|
||||
---
|
||||
|
|
|
@ -152,7 +152,7 @@ For faster processing, you may only want to run a subset of the components in a
|
|||
trained pipeline. The `disable` and `exclude` arguments to
|
||||
[`spacy.load`](/api/top-level#spacy.load) let you control which components are
|
||||
loaded and run. Disabled components are loaded in the background so it's
|
||||
possible to reenable them in the same pipeline in the future with
|
||||
possible to re-enable them in the same pipeline in the future with
|
||||
[`nlp.enable_pipe`](/api/language/#enable_pipe). To skip loading a component
|
||||
completely, use `exclude` instead of `disable`.
|
||||
|
||||
|
|
|
@ -960,7 +960,7 @@ print(doc._.acronyms)
|
|||
Many stateful components depend on **data resources** like dictionaries and
|
||||
lookup tables that should ideally be **configurable**. For example, it makes
|
||||
sense to make the `DICTIONARY` in the above example an argument of the
|
||||
registered function, so the `AcronymComponent` can be re-used with different
|
||||
registered function, so the `AcronymComponent` can be reused with different
|
||||
data. One logical solution would be to make it an argument of the component
|
||||
factory, and allow it to be initialized with different dictionaries.
|
||||
|
||||
|
@ -1316,7 +1316,7 @@ means that the config can express very complex, nested trees of objects – but
|
|||
the objects don't have to pass the model settings all the way down to the
|
||||
components. It also makes the components more **modular** and lets you
|
||||
[swap](/usage/layers-architectures#swap-architectures) different architectures
|
||||
in your config, and re-use model definitions.
|
||||
in your config, and reuse model definitions.
|
||||
|
||||
```ini {title="config.cfg (excerpt)"}
|
||||
[components]
|
||||
|
|
|
@ -389,7 +389,7 @@ Each command defined in the `project.yml` can optionally define a list of
|
|||
dependencies and outputs. These are the files the command requires and creates.
|
||||
For example, a command for training a pipeline may depend on a
|
||||
[`config.cfg`](/usage/training#config) and the training and evaluation data, and
|
||||
it will export a directory `model-best`, which you can then re-use in other
|
||||
it will export a directory `model-best`, which you can then reuse in other
|
||||
commands.
|
||||
|
||||
{/* prettier-ignore */}
|
||||
|
|
|
@ -703,7 +703,7 @@ def collect_sents(matcher, doc, i, matches):
|
|||
span = doc[start:end] # Matched span
|
||||
sent = span.sent # Sentence containing matched span
|
||||
# Append mock entity for match in displaCy style to matched_sents
|
||||
# get the match span by ofsetting the start and end of the span with the
|
||||
# get the match span by offsetting the start and end of the span with the
|
||||
# start and end of the sentence in the doc
|
||||
match_ents = [{
|
||||
"start": span.start_char - sent.start_char,
|
||||
|
|
|
@ -117,7 +117,7 @@ related to more general machine learning functionality.
|
|||
|
||||
| Name | Description |
|
||||
| ------------------------------------- | ------------------------------------------------------------------------------------------------------------------ |
|
||||
| **Tokenization** | Segmenting text into words, punctuations marks etc. |
|
||||
| **Tokenization** | Segmenting text into words, punctuation marks etc. |
|
||||
| **Part-of-speech** (POS) **Tagging** | Assigning word types to tokens, like verb or noun. |
|
||||
| **Dependency Parsing** | Assigning syntactic dependency labels, describing the relations between individual tokens, like subject or object. |
|
||||
| **Lemmatization** | Assigning the base forms of words. For example, the lemma of "was" is "be", and the lemma of "rats" is "rat". |
|
||||
|
|
|
@ -191,8 +191,8 @@ sections of a config file are:
|
|||
| ------------- | --------------------------------------------------------------------------------------------------------------------------------------------------------------- |
|
||||
| `nlp` | Definition of the `nlp` object, its tokenizer and [processing pipeline](/usage/processing-pipelines) component names. |
|
||||
| `components` | Definitions of the [pipeline components](/usage/processing-pipelines) and their models. |
|
||||
| `paths` | Paths to data and other assets. Re-used across the config as variables, e.g. `${paths.train}`, and can be [overwritten](#config-overrides) on the CLI. |
|
||||
| `system` | Settings related to system and hardware. Re-used across the config as variables, e.g. `${system.seed}`, and can be [overwritten](#config-overrides) on the CLI. |
|
||||
| `paths` | Paths to data and other assets. Reused across the config as variables, e.g. `${paths.train}`, and can be [overwritten](#config-overrides) on the CLI. |
|
||||
| `system` | Settings related to system and hardware. Reused across the config as variables, e.g. `${system.seed}`, and can be [overwritten](#config-overrides) on the CLI. |
|
||||
| `training` | Settings and controls for the training and evaluation process. |
|
||||
| `pretraining` | Optional settings and controls for the [language model pretraining](/usage/embeddings-transformers#pretraining). |
|
||||
| `initialize` | Data resources and arguments passed to components when [`nlp.initialize`](/api/language#initialize) is called before training (but not at runtime). |
|
||||
|
|
|
@ -242,7 +242,7 @@ class JinjaToJS(object):
|
|||
)
|
||||
|
||||
# It is assumed that this will be the absolute path to the template. It is used to work out
|
||||
# related paths for inclues.
|
||||
# related paths for includes.
|
||||
self.template_path = template_path
|
||||
|
||||
if self.js_module_format not in JS_MODULE_FORMATS.keys():
|
||||
|
@ -283,7 +283,7 @@ class JinjaToJS(object):
|
|||
not yet been registered.
|
||||
|
||||
Args:
|
||||
dependency (str): Thet dependency that needs to be imported.
|
||||
dependency (str): The dependency that needs to be imported.
|
||||
|
||||
Returns:
|
||||
str or None
|
||||
|
|
|
@ -88,7 +88,7 @@ export default class Juniper extends React.Component {
|
|||
}
|
||||
|
||||
/**
|
||||
* Request kernel and estabish a server connection via the JupyerLab service
|
||||
* Request kernel and establish a server connection via the JupyerLab service
|
||||
* @param {object} settings - The server settings.
|
||||
* @returns {Promise} - A promise that's resolved with the kernel.
|
||||
*/
|
||||
|
|
|
@ -86,7 +86,7 @@ export const remarkComponents = {
|
|||
IntegrationLogo,
|
||||
|
||||
/**
|
||||
* This is readded as `Image` it can be explicitly used in MDX files.
|
||||
* This is re-added as `Image` it can be explicitly used in MDX files.
|
||||
* For regular img elements it is not possible to pass properties
|
||||
*/
|
||||
Image,
|
||||
|
|
|
@ -15,7 +15,7 @@ $breakpoints: ( sm: 768px, md: 992px, lg: 1200px )
|
|||
@media(max-width: #{map-get($breakpoints-max, $size)})
|
||||
@content
|
||||
|
||||
// Scroll shadows for reponsive tables
|
||||
// Scroll shadows for responsive tables
|
||||
// adapted from David Bushell, http://codepen.io/dbushell/pen/wGaamR
|
||||
// $scroll-shadow-color - color of shadow
|
||||
// $scroll-shadow-side - side to cover shadow (left or right)
|
||||
|
|
Loading…
Reference in New Issue
Block a user