This commit is contained in:
Christian Clauss 2025-08-31 13:29:39 +00:00 committed by GitHub
commit be18d1c864
No known key found for this signature in database
GPG Key ID: B5690EEEBB952194
42 changed files with 232 additions and 175 deletions

View File

@ -6,7 +6,7 @@ This is a list of all the active repos relevant to spaCy besides the main one, w
These packages are always pulled in when you install spaCy. Most of them are direct dependencies, but some are transitive dependencies through other packages.
- [spacy-legacy](https://github.com/explosion/spacy-legacy): When an architecture in spaCy changes enough to get a new version, the old version is frozen and moved to spacy-legacy. This allows us to keep the core library slim while also preserving backwards compatability.
- [spacy-legacy](https://github.com/explosion/spacy-legacy): When an architecture in spaCy changes enough to get a new version, the old version is frozen and moved to spacy-legacy. This allows us to keep the core library slim while also preserving backwards compatibility.
- [thinc](https://github.com/explosion/thinc): Thinc is the machine learning library that powers trainable components in spaCy. It wraps backends like Numpy, PyTorch, and Tensorflow to provide a functional interface for specifying architectures.
- [catalogue](https://github.com/explosion/catalogue): Small library for adding function registries, like those used for model architectures in spaCy.
- [confection](https://github.com/explosion/confection): This library contains the functionality for config parsing that was formerly contained directly in Thinc.
@ -67,7 +67,7 @@ These repos are used to support the spaCy docs or otherwise present information
These repos are used for organizing data around spaCy, but are not something an end user would need to install as part of using the library.
- [spacy-models](https://github.com/explosion/spacy-models): This repo contains metadata (but not training data) for all the spaCy models. This includes information about where their training data came from, version compatability, and performance information. It also includes tests for the model packages, and the built models are hosted as releases of this repo.
- [spacy-models](https://github.com/explosion/spacy-models): This repo contains metadata (but not training data) for all the spaCy models. This includes information about where their training data came from, version compatibility, and performance information. It also includes tests for the model packages, and the built models are hosted as releases of this repo.
- [wheelwright](https://github.com/explosion/wheelwright): A tool for automating our PyPI builds and releases.
- [ec2buildwheel](https://github.com/explosion/ec2buildwheel): A small project that allows you to build Python packages in the manner of cibuildwheel, but on any EC2 image. Used by wheelwright.

View File

@ -145,7 +145,7 @@ These are things stored in the vocab:
- `get_noun_chunks`: a syntax iterator
- lex attribute getters: functions like `is_punct`, set in language defaults
- `cfg`: **not** the pipeline config, this is mostly unused
- `_unused_object`: Formerly an unused object, kept around until v4 for compatability
- `_unused_object`: Formerly an unused object, kept around until v4 for compatibility
Some of these, like the Morphology and Vectors, are complex enough that they
need their own explanations. Here we'll just look at Vocab-specific items.

View File

@ -34,7 +34,7 @@ CONDITIONS.
Collection will not be considered an Adaptation for the purpose of
this License. For the avoidance of doubt, where the Work is a musical
work, performance or phonogram, the synchronization of the Work in
timed-relation with a moving image ("synching") will be considered an
timed-relation with a moving image ("syncing") will be considered an
Adaptation for the purpose of this License.
b. "Collection" means a collection of literary or artistic works, such as
encyclopedias and anthologies, or performances, phonograms or
@ -264,7 +264,7 @@ subject to and limited by the following restrictions:
UNLESS OTHERWISE MUTUALLY AGREED TO BY THE PARTIES IN WRITING, LICENSOR
OFFERS THE WORK AS-IS AND MAKES NO REPRESENTATIONS OR WARRANTIES OF ANY
KIND CONCERNING THE WORK, EXPRESS, IMPLIED, STATUTORY OR OTHERWISE,
INCLUDING, WITHOUT LIMITATION, WARRANTIES OF TITLE, MERCHANTIBILITY,
INCLUDING, WITHOUT LIMITATION, WARRANTIES OF TITLE, MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE, NONINFRINGEMENT, OR THE ABSENCE OF
LATENT OR OTHER DEFECTS, ACCURACY, OR THE PRESENCE OF ABSENCE OF ERRORS,
WHETHER OR NOT DISCOVERABLE. SOME JURISDICTIONS DO NOT ALLOW THE EXCLUSION

View File

@ -99,7 +99,7 @@ def parse_config_overrides(
RETURNS (Dict[str, Any]): The parsed dict, keyed by nested config setting.
"""
env_string = os.environ.get(env_var, "") if env_var else ""
env_overrides = _parse_overrides(split_arg_string(env_string))
env_overrides = _parse_overrides(split_arg_string(env_string)) # type: ignore[operator]
cli_overrides = _parse_overrides(args, is_cli=True)
if cli_overrides:
keys = [k for k in cli_overrides if k not in env_overrides]

View File

@ -84,7 +84,7 @@ def info(
def info_spacy() -> Dict[str, Any]:
"""Generate info about the current spaCy intallation.
"""Generate info about the current spaCy installation.
RETURNS (dict): The spaCy info.
"""

View File

@ -354,7 +354,7 @@ GLOSSARY = {
# https://github.com/ltgoslo/norne
"EVT": "Festivals, cultural events, sports events, weather phenomena, wars, etc.",
"PROD": "Product, i.e. artificially produced entities including speeches, radio shows, programming languages, contracts, laws and ideas",
"DRV": "Words (and phrases?) that are dervied from a name, but not a name in themselves, e.g. 'Oslo-mannen' ('the man from Oslo')",
"DRV": "Words (and phrases?) that are derived from a name, but not a name in themselves, e.g. 'Oslo-mannen' ('the man from Oslo')",
"GPE_LOC": "Geo-political entity, with a locative sense, e.g. 'John lives in Spain'",
"GPE_ORG": "Geo-political entity, with an organisation sense, e.g. 'Spain declined to meet with Belgium'",
}

View File

@ -5,11 +5,11 @@ from thinc.api import Model
from ...language import BaseDefaults, Language
from .lemmatizer import HaitianCreoleLemmatizer
from .lex_attrs import LEX_ATTRS
from .punctuation import TOKENIZER_PREFIXES, TOKENIZER_INFIXES, TOKENIZER_SUFFIXES
from .punctuation import TOKENIZER_INFIXES, TOKENIZER_PREFIXES, TOKENIZER_SUFFIXES
from .stop_words import STOP_WORDS
from .syntax_iterators import SYNTAX_ITERATORS
from .tokenizer_exceptions import TOKENIZER_EXCEPTIONS
from .tag_map import TAG_MAP
from .tokenizer_exceptions import TOKENIZER_EXCEPTIONS
class HaitianCreoleDefaults(BaseDefaults):
@ -22,10 +22,12 @@ class HaitianCreoleDefaults(BaseDefaults):
stop_words = STOP_WORDS
tag_map = TAG_MAP
class HaitianCreole(Language):
lang = "ht"
Defaults = HaitianCreoleDefaults
@HaitianCreole.factory(
"lemmatizer",
assigns=["token.lemma"],
@ -49,4 +51,5 @@ def make_lemmatizer(
nlp.vocab, model, name, mode=mode, overwrite=overwrite, scorer=scorer
)
__all__ = ["HaitianCreole"]

View File

@ -1,8 +1,8 @@
from typing import List, Tuple
from ...lookups import Lookups
from ...pipeline import Lemmatizer
from ...tokens import Token
from ...lookups import Lookups
class HaitianCreoleLemmatizer(Lemmatizer):

View File

@ -49,6 +49,7 @@ NORM_MAP = {
"P": "Pa",
}
def like_num(text):
text = text.strip().lower()
if text.startswith(("+", "-", "±", "~")):
@ -69,9 +70,11 @@ def like_num(text):
return True
return False
def norm_custom(text):
return NORM_MAP.get(text, text.lower())
LEX_ATTRS = {
LIKE_NUM: like_num,
NORM: norm_custom,

View File

@ -4,10 +4,10 @@ from ..char_classes import (
ALPHA_UPPER,
CONCAT_QUOTES,
HYPHENS,
LIST_PUNCT,
LIST_QUOTES,
LIST_ELLIPSES,
LIST_ICONS,
LIST_PUNCT,
LIST_QUOTES,
merge_chars,
)
@ -16,23 +16,37 @@ ELISION = "'".replace(" ", "")
_prefixes_elision = "m n l y t k w"
_prefixes_elision += " " + _prefixes_elision.upper()
TOKENIZER_PREFIXES = LIST_PUNCT + LIST_QUOTES + [
TOKENIZER_PREFIXES = (
LIST_PUNCT
+ LIST_QUOTES
+ [
r"(?:({pe})[{el}])(?=[{a}])".format(
a=ALPHA, el=ELISION, pe=merge_chars(_prefixes_elision)
)
]
)
TOKENIZER_SUFFIXES = LIST_PUNCT + LIST_QUOTES + LIST_ELLIPSES + [
TOKENIZER_SUFFIXES = (
LIST_PUNCT
+ LIST_QUOTES
+ LIST_ELLIPSES
+ [
r"(?<=[0-9])%", # numbers like 10%
r"(?<=[0-9])(?:{h})".format(h=HYPHENS), # hyphens after numbers
r"(?<=[{a}])[']".format(a=ALPHA), # apostrophes after letters
r"(?<=[{a}])['][mwlnytk](?=\s|$)".format(a=ALPHA), # contractions
r"(?<=[{a}0-9])\)", # right parenthesis after letter/number
r"(?<=[{a}])\.(?=\s|$)".format(a=ALPHA), # period after letter if space or end of string
r"(?<=[{a}])\.(?=\s|$)".format(
a=ALPHA
), # period after letter if space or end of string
r"(?<=\))[\.\?!]", # punctuation immediately after right parenthesis
]
)
TOKENIZER_INFIXES = LIST_ELLIPSES + LIST_ICONS + [
TOKENIZER_INFIXES = (
LIST_ELLIPSES
+ LIST_ICONS
+ [
r"(?<=[0-9])[+\-\*^](?=[0-9-])",
r"(?<=[{al}{q}])\.(?=[{au}{q}])".format(
al=ALPHA_LOWER, au=ALPHA_UPPER, q=CONCAT_QUOTES
@ -41,3 +55,4 @@ TOKENIZER_INFIXES = LIST_ELLIPSES + LIST_ICONS + [
r"(?<=[{a}0-9])(?:{h})(?=[{a}])".format(a=ALPHA, h=HYPHENS),
r"(?<=[{a}][{el}])(?=[{a}])".format(a=ALPHA, el=ELISION),
]
)

View File

@ -39,8 +39,7 @@ sa san si swa si
men mèsi oswa osinon
"""
.split()
""".split()
)
# Add common contractions, with and without apostrophe variants

View File

@ -1,4 +1,22 @@
from spacy.symbols import NOUN, VERB, AUX, ADJ, ADV, PRON, DET, ADP, SCONJ, CCONJ, PART, INTJ, NUM, PROPN, PUNCT, SYM, X
from spacy.symbols import (
ADJ,
ADP,
ADV,
AUX,
CCONJ,
DET,
INTJ,
NOUN,
NUM,
PART,
PRON,
PROPN,
PUNCT,
SCONJ,
SYM,
VERB,
X,
)
TAG_MAP = {
"NOUN": {"pos": NOUN},

View File

@ -1,4 +1,5 @@
from spacy.symbols import ORTH, NORM
from spacy.symbols import NORM, ORTH
def make_variants(base, first_norm, second_orth, second_norm):
return {
@ -7,14 +8,16 @@ def make_variants(base, first_norm, second_orth, second_norm):
{ORTH: second_orth, NORM: second_norm},
],
base.capitalize(): [
{ORTH: base.split("'")[0].capitalize() + "'", NORM: first_norm.capitalize()},
{
ORTH: base.split("'")[0].capitalize() + "'",
NORM: first_norm.capitalize(),
},
{ORTH: second_orth, NORM: second_norm},
]
],
}
TOKENIZER_EXCEPTIONS = {
"Dr.": [{ORTH: "Dr."}]
}
TOKENIZER_EXCEPTIONS = {"Dr.": [{ORTH: "Dr."}]}
# Apostrophe forms
TOKENIZER_EXCEPTIONS.update(make_variants("m'ap", "mwen", "ap", "ap"))
@ -29,7 +32,8 @@ TOKENIZER_EXCEPTIONS.update(make_variants("p'ap", "pa", "ap", "ap"))
TOKENIZER_EXCEPTIONS.update(make_variants("t'ap", "te", "ap", "ap"))
# Non-apostrophe contractions (with capitalized variants)
TOKENIZER_EXCEPTIONS.update({
TOKENIZER_EXCEPTIONS.update(
{
"map": [
{ORTH: "m", NORM: "mwen"},
{ORTH: "ap", NORM: "ap"},
@ -118,4 +122,5 @@ TOKENIZER_EXCEPTIONS.update({
{ORTH: "T", NORM: "Te"},
{ORTH: "ap", NORM: "ap"},
],
})
}
)

View File

@ -106,7 +106,7 @@ class BaseDefaults:
def create_tokenizer() -> Callable[["Language"], Tokenizer]:
"""Registered function to create a tokenizer. Returns a factory that takes
the nlp object and returns a Tokenizer instance using the language detaults.
the nlp object and returns a Tokenizer instance using the language defaults.
"""
def tokenizer_factory(nlp: "Language") -> Tokenizer:
@ -173,7 +173,7 @@ class Language:
current models may run out memory on extremely long texts, due to
large internal allocations. You should segment these texts into
meaningful units, e.g. paragraphs, subsections etc, before passing
them to spaCy. Default maximum length is 1,000,000 charas (1mb). As
them to spaCy. Default maximum length is 1,000,000 chars (1mb). As
a rule of thumb, if all pipeline components are enabled, spaCy's
default models currently requires roughly 1GB of temporary memory per
100,000 characters in one text.
@ -2448,7 +2448,7 @@ class _Sender:
q.put(item)
def step(self) -> None:
"""Tell sender that comsumed one item. Data is sent to the workers after
"""Tell sender that consumed one item. Data is sent to the workers after
every chunk_size calls.
"""
self.count += 1

View File

@ -12,7 +12,7 @@ cdef extern from "<algorithm>" namespace "std" nogil:
# An edit tree (Müller et al., 2015) is a tree structure that consists of
# edit operations. The two types of operations are string matches
# and string substitutions. Given an input string s and an output string t,
# subsitution and match nodes should be interpreted as follows:
# substitution and match nodes should be interpreted as follows:
#
# * Substitution node: consists of an original string and substitute string.
# If s matches the original string, then t is the substitute. Otherwise,

View File

@ -1,5 +1,5 @@
# This file is present to provide a prior version of the EntityLinker component
# for backwards compatability. For details see #9669.
# for backwards compatibility. For details see #9669.
import random
import warnings

View File

@ -187,7 +187,7 @@ class Lemmatizer(Pipe):
if univ_pos == "":
warnings.warn(Warnings.W108)
return [string.lower()]
# See Issue #435 for example of where this logic is requied.
# See Issue #435 for example of where this logic is required.
if self.is_base_form(token):
return [string.lower()]
index_table = self.lookups.get_table("lemma_index", {})
@ -210,7 +210,7 @@ class Lemmatizer(Pipe):
rules = rules_table.get(univ_pos, {})
orig = string
string = string.lower()
forms = []
forms: List[str] = []
oov_forms = []
for old, new in rules:
if string.endswith(old):

View File

@ -247,7 +247,7 @@ def test_issue13769():
(1, 4, "This is"), # Overlapping with 2 sentences
(0, 2, "This is"), # Beginning of the Doc. Full sentence
(0, 1, "This is"), # Beginning of the Doc. Part of a sentence
(10, 14, "And a"), # End of the Doc. Overlapping with 2 senteces
(10, 14, "And a"), # End of the Doc. Overlapping with 2 sentences
(12, 14, "third."), # End of the Doc. Full sentence
(1, 1, "This is"), # Empty Span
],
@ -676,7 +676,7 @@ def test_span_comparison(doc):
(3, 6, 2, 2), # Overlapping with 2 sentences
(0, 4, 1, 2), # Beginning of the Doc. Full sentence
(0, 3, 1, 2), # Beginning of the Doc. Part of a sentence
(9, 14, 2, 3), # End of the Doc. Overlapping with 2 senteces
(9, 14, 2, 3), # End of the Doc. Overlapping with 2 sentences
(10, 14, 1, 2), # End of the Doc. Full sentence
(11, 14, 1, 2), # End of the Doc. Partial sentence
(0, 0, 1, 1), # Empty Span

View File

@ -29,4 +29,16 @@ def test_ht_tokenizer_handles_basic_abbreviation(ht_tokenizer, text):
def test_ht_tokenizer_full_sentence(ht_tokenizer):
text = "Si'm ka vini, m'ap pale ak li."
tokens = [t.text for t in ht_tokenizer(text)]
assert tokens == ["Si", "'m", "ka", "vini", ",", "m'", "ap", "pale", "ak", "li", "."]
assert tokens == [
"Si",
"'m",
"ka",
"vini",
",",
"m'",
"ap",
"pale",
"ak",
"li",
".",
]

View File

@ -1,4 +1,5 @@
import pytest
from spacy.tokens import Doc

View File

@ -37,7 +37,9 @@ def test_ht_tokenizer_splits_uneven_wrap(ht_tokenizer, text):
assert len(tokens) == 5
@pytest.mark.parametrize("text,length", [("Ozetazini.", 2), ("Frans.", 2), ("(Ozetazini.", 3)])
@pytest.mark.parametrize(
"text,length", [("Ozetazini.", 2), ("Frans.", 2), ("(Ozetazini.", 3)]
)
def test_ht_tokenizer_splits_prefix_interact(ht_tokenizer, text, length):
tokens = ht_tokenizer(text)
assert len(tokens) == length

View File

@ -16,7 +16,6 @@ Nan Washington, Depatman Deta Etazini pibliye yon deklarasyon ki eksprime "regre
assert len(tokens) == 84
@pytest.mark.parametrize(
"text,length",
[
@ -66,14 +65,14 @@ def test_ht_lex_attrs_capitals(word):
@pytest.mark.parametrize(
"word, expected", [
"word, expected",
[
("'m", "mwen"),
("'n", "nou"),
("'l", "li"),
("'y", "yo"),
("'w", "ou"),
]
],
)
def test_ht_lex_attrs_norm_custom(word, expected):
assert norm_custom(word) == expected

View File

@ -670,7 +670,7 @@ def test_matcher_remove():
# removing once should work
matcher.remove("Rule")
# should not return any maches anymore
# should not return any matches anymore
results2 = matcher(nlp(text))
assert len(results2) == 0

View File

@ -351,7 +351,7 @@ def test_oracle_moves_whitespace(en_vocab):
def test_accept_blocked_token():
"""Test succesful blocking of tokens to be in an entity."""
"""Test successful blocking of tokens to be in an entity."""
# 1. test normal behaviour
nlp1 = English()
doc1 = nlp1("I live in New York")

View File

@ -1288,7 +1288,7 @@ def test_threshold(meet_threshold: bool, config: Dict[str, Any]):
entity_linker.set_kb(create_kb) # type: ignore
nlp.initialize(get_examples=lambda: train_examples)
# Add a custom rule-based component to mimick NER
# Add a custom rule-based component to mimic NER
ruler = nlp.add_pipe("entity_ruler", before="entity_linker")
ruler.add_patterns([{"label": "PERSON", "pattern": [{"LOWER": "mahler"}]}]) # type: ignore
doc = nlp(text)

View File

@ -47,7 +47,7 @@ def test_issue1506():
nlp = English()
for i, d in enumerate(nlp.pipe(string_generator())):
# We should run cleanup more than one time to actually cleanup data.
# In first run — clean up only mark strings as «not hitted».
# In first run — clean up only mark strings as «not hit».
if i == 10000 or i == 20000 or i == 30000:
gc.collect()
for t in d:

View File

@ -34,7 +34,7 @@ def test_issue2728(en_vocab):
@pytest.mark.issue(3288)
def test_issue3288(en_vocab):
"""Test that retokenization works correctly via displaCy when punctuation
is merged onto the preceeding token and tensor is resized."""
is merged onto the preceding token and tensor is resized."""
words = ["Hello", "World", "!", "When", "is", "this", "breaking", "?"]
heads = [1, 1, 1, 4, 4, 6, 4, 4]
deps = ["intj", "ROOT", "punct", "advmod", "ROOT", "det", "nsubj", "punct"]

View File

@ -410,7 +410,7 @@ attribute.
| Name | Description |
| ----------------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| `all_outputs` | List of `Ragged` tensors that correspends to outputs of the different transformer layers. Each tensor element corresponds to a piece identifier's representation. ~~List[Ragged]~~ |
| `all_outputs` | List of `Ragged` tensors that corresponds to outputs of the different transformer layers. Each tensor element corresponds to a piece identifier's representation. ~~List[Ragged]~~ |
| `last_layer_only` | If only the last transformer layer's outputs are preserved. ~~bool~~ |
### DocTransformerOutput.embedding_layer {id="doctransformeroutput-embeddinglayer",tag="property"}

View File

@ -1116,7 +1116,7 @@ customize the default language data:
| --------------------------------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| `stop_words` | List of stop words, used for `Token.is_stop`.<br />**Example:** [`stop_words.py`](%%GITHUB_SPACY/spacy/lang/en/stop_words.py) ~~Set[str]~~ |
| `tokenizer_exceptions` | Tokenizer exception rules, string mapped to list of token attributes.<br />**Example:** [`de/tokenizer_exceptions.py`](%%GITHUB_SPACY/spacy/lang/de/tokenizer_exceptions.py) ~~Dict[str, List[dict]]~~ |
| `prefixes`, `suffixes`, `infixes` | Prefix, suffix and infix rules for the default tokenizer.<br />**Example:** [`puncutation.py`](%%GITHUB_SPACY/spacy/lang/punctuation.py) ~~Optional[Sequence[Union[str, Pattern]]]~~ |
| `prefixes`, `suffixes`, `infixes` | Prefix, suffix and infix rules for the default tokenizer.<br />**Example:** [`punctuation.py`](%%GITHUB_SPACY/spacy/lang/punctuation.py) ~~Optional[Sequence[Union[str, Pattern]]]~~ |
| `token_match` | Optional regex for matching strings that should never be split, overriding the infix rules.<br />**Example:** [`fr/tokenizer_exceptions.py`](%%GITHUB_SPACY/spacy/lang/fr/tokenizer_exceptions.py) ~~Optional[Callable]~~ |
| `url_match` | Regular expression for matching URLs. Prefixes and suffixes are removed before applying the match.<br />**Example:** [`tokenizer_exceptions.py`](%%GITHUB_SPACY/spacy/lang/tokenizer_exceptions.py) ~~Optional[Callable]~~ |
| `lex_attr_getters` | Custom functions for setting lexical attributes on tokens, e.g. `like_num`.<br />**Example:** [`lex_attrs.py`](%%GITHUB_SPACY/spacy/lang/en/lex_attrs.py) ~~Dict[int, Callable[[str], Any]]~~ |

View File

@ -590,7 +590,7 @@ candidate.
protocol required by [`spacy.EntityLinker.v1`](#el-v1). The built-in candidate
selector method allows loading existing knowledge bases in several ways, e. g.
loading from a spaCy pipeline with a (not necessarily trained) entity linking
component, and loading from a file describing the knowlege base as a .yaml file.
component, and loading from a file describing the knowledge base as a .yaml file.
Either way the loaded data will be converted to a spaCy `InMemoryLookupKB`
instance. The KB's selection capabilities are used to select the most likely
entity candidates for the specified mentions.
@ -1103,7 +1103,7 @@ prompting.
| Argument | Description |
| --------------------------- | ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| `examples` | Optional function that generates examples for few-shot learning. Deafults to `None`. ~~Optional[Callable[[], Iterable[Any]]]~~ |
| `examples` | Optional function that generates examples for few-shot learning. Defaults to `None`. ~~Optional[Callable[[], Iterable[Any]]]~~ |
| `parse_responses` (NEW) | Callable for parsing LLM responses for this task. Defaults to the internal parsing method for this task. ~~Optional[TaskResponseParser[SpanCatTask]]~~ |
| `prompt_example_type` (NEW) | Type to use for fewshot examples. Defaults to `TextCatExample`. ~~Optional[Type[FewshotExample]]~~ |
| `scorer` (NEW) | Scorer function that evaluates the task performance on provided examples. Defaults to the metric used by spaCy. ~~Optional[Scorer]~~ |
@ -1624,7 +1624,7 @@ the same documents at each run that keeps batches of documents stored on disk.
| Argument | Description |
| -------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------ |
| `path` | Cache directory. If `None`, no caching is performed, and this component will act as a NoOp. Defaults to `None`. ~~Optional[Union[str, Path]]~~ |
| `batch_size` | Number of docs in one batch (file). Once a batch is full, it will be peristed to disk. Defaults to 64. ~~int~~ |
| `batch_size` | Number of docs in one batch (file). Once a batch is full, it will be persisted to disk. Defaults to 64. ~~int~~ |
| `max_batches_in_mem` | Max. number of batches to hold in memory. Allows you to limit the effect on your memory if you're handling a lot of docs. Defaults to 4. ~~int~~ |
When retrieving a document, the `BatchCache` will first figure out what batch

View File

@ -1,6 +1,6 @@
---
title: Tokenizer
teaser: Segment text into words, punctuations marks, etc.
teaser: Segment text into words, punctuation marks, etc.
tag: class
source: spacy/tokenizer.pyx
---

View File

@ -152,7 +152,7 @@ For faster processing, you may only want to run a subset of the components in a
trained pipeline. The `disable` and `exclude` arguments to
[`spacy.load`](/api/top-level#spacy.load) let you control which components are
loaded and run. Disabled components are loaded in the background so it's
possible to reenable them in the same pipeline in the future with
possible to re-enable them in the same pipeline in the future with
[`nlp.enable_pipe`](/api/language/#enable_pipe). To skip loading a component
completely, use `exclude` instead of `disable`.

View File

@ -960,7 +960,7 @@ print(doc._.acronyms)
Many stateful components depend on **data resources** like dictionaries and
lookup tables that should ideally be **configurable**. For example, it makes
sense to make the `DICTIONARY` in the above example an argument of the
registered function, so the `AcronymComponent` can be re-used with different
registered function, so the `AcronymComponent` can be reused with different
data. One logical solution would be to make it an argument of the component
factory, and allow it to be initialized with different dictionaries.
@ -1316,7 +1316,7 @@ means that the config can express very complex, nested trees of objects but
the objects don't have to pass the model settings all the way down to the
components. It also makes the components more **modular** and lets you
[swap](/usage/layers-architectures#swap-architectures) different architectures
in your config, and re-use model definitions.
in your config, and reuse model definitions.
```ini {title="config.cfg (excerpt)"}
[components]

View File

@ -389,7 +389,7 @@ Each command defined in the `project.yml` can optionally define a list of
dependencies and outputs. These are the files the command requires and creates.
For example, a command for training a pipeline may depend on a
[`config.cfg`](/usage/training#config) and the training and evaluation data, and
it will export a directory `model-best`, which you can then re-use in other
it will export a directory `model-best`, which you can then reuse in other
commands.
{/* prettier-ignore */}

View File

@ -703,7 +703,7 @@ def collect_sents(matcher, doc, i, matches):
span = doc[start:end] # Matched span
sent = span.sent # Sentence containing matched span
# Append mock entity for match in displaCy style to matched_sents
# get the match span by ofsetting the start and end of the span with the
# get the match span by offsetting the start and end of the span with the
# start and end of the sentence in the doc
match_ents = [{
"start": span.start_char - sent.start_char,

View File

@ -117,7 +117,7 @@ related to more general machine learning functionality.
| Name | Description |
| ------------------------------------- | ------------------------------------------------------------------------------------------------------------------ |
| **Tokenization** | Segmenting text into words, punctuations marks etc. |
| **Tokenization** | Segmenting text into words, punctuation marks etc. |
| **Part-of-speech** (POS) **Tagging** | Assigning word types to tokens, like verb or noun. |
| **Dependency Parsing** | Assigning syntactic dependency labels, describing the relations between individual tokens, like subject or object. |
| **Lemmatization** | Assigning the base forms of words. For example, the lemma of "was" is "be", and the lemma of "rats" is "rat". |

View File

@ -191,8 +191,8 @@ sections of a config file are:
| ------------- | --------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| `nlp` | Definition of the `nlp` object, its tokenizer and [processing pipeline](/usage/processing-pipelines) component names. |
| `components` | Definitions of the [pipeline components](/usage/processing-pipelines) and their models. |
| `paths` | Paths to data and other assets. Re-used across the config as variables, e.g. `${paths.train}`, and can be [overwritten](#config-overrides) on the CLI. |
| `system` | Settings related to system and hardware. Re-used across the config as variables, e.g. `${system.seed}`, and can be [overwritten](#config-overrides) on the CLI. |
| `paths` | Paths to data and other assets. Reused across the config as variables, e.g. `${paths.train}`, and can be [overwritten](#config-overrides) on the CLI. |
| `system` | Settings related to system and hardware. Reused across the config as variables, e.g. `${system.seed}`, and can be [overwritten](#config-overrides) on the CLI. |
| `training` | Settings and controls for the training and evaluation process. |
| `pretraining` | Optional settings and controls for the [language model pretraining](/usage/embeddings-transformers#pretraining). |
| `initialize` | Data resources and arguments passed to components when [`nlp.initialize`](/api/language#initialize) is called before training (but not at runtime). |

View File

@ -242,7 +242,7 @@ class JinjaToJS(object):
)
# It is assumed that this will be the absolute path to the template. It is used to work out
# related paths for inclues.
# related paths for includes.
self.template_path = template_path
if self.js_module_format not in JS_MODULE_FORMATS.keys():
@ -283,7 +283,7 @@ class JinjaToJS(object):
not yet been registered.
Args:
dependency (str): Thet dependency that needs to be imported.
dependency (str): The dependency that needs to be imported.
Returns:
str or None

View File

@ -88,7 +88,7 @@ export default class Juniper extends React.Component {
}
/**
* Request kernel and estabish a server connection via the JupyerLab service
* Request kernel and establish a server connection via the JupyerLab service
* @param {object} settings - The server settings.
* @returns {Promise} - A promise that's resolved with the kernel.
*/

View File

@ -86,7 +86,7 @@ export const remarkComponents = {
IntegrationLogo,
/**
* This is readded as `Image` it can be explicitly used in MDX files.
* This is re-added as `Image` it can be explicitly used in MDX files.
* For regular img elements it is not possible to pass properties
*/
Image,

View File

@ -15,7 +15,7 @@ $breakpoints: ( sm: 768px, md: 992px, lg: 1200px )
@media(max-width: #{map-get($breakpoints-max, $size)})
@content
// Scroll shadows for reponsive tables
// Scroll shadows for responsive tables
// adapted from David Bushell, http://codepen.io/dbushell/pen/wGaamR
// $scroll-shadow-color - color of shadow
// $scroll-shadow-side - side to cover shadow (left or right)