Merge pull request #10863 from danieldk/refactor-parser-master-20220527

Merge `master` into `feature/refactor-parser`
This commit is contained in:
Daniël de Kok 2022-05-27 18:33:54 +02:00 committed by GitHub
commit 65c770c368
No known key found for this signature in database
GPG Key ID: 4AEE18F83AFDEB23
93 changed files with 3103 additions and 780 deletions

106
.github/contributors/fonfonx.md vendored Normal file
View File

@ -0,0 +1,106 @@
# spaCy contributor agreement
This spaCy Contributor Agreement (**"SCA"**) is based on the
[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
The SCA applies to any contribution that you make to any product or project
managed by us (the **"project"**), and sets out the intellectual property rights
you grant to us in the contributed materials. The term **"us"** shall mean
[ExplosionAI GmbH](https://explosion.ai/legal). The term
**"you"** shall mean the person or entity identified below.
If you agree to be bound by these terms, fill in the information requested
below and include the filled-in version with your first pull request, under the
folder [`.github/contributors/`](/.github/contributors/). The name of the file
should be your GitHub username, with the extension `.md`. For example, the user
example_user would create the file `.github/contributors/example_user.md`.
Read this agreement carefully before signing. These terms and conditions
constitute a binding legal agreement.
## Contributor Agreement
1. The term "contribution" or "contributed materials" means any source code,
object code, patch, tool, sample, graphic, specification, manual,
documentation, or any other material posted or submitted by you to the project.
2. With respect to any worldwide copyrights, or copyright applications and
registrations, in your contribution:
* you hereby assign to us joint ownership, and to the extent that such
assignment is or becomes invalid, ineffective or unenforceable, you hereby
grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
royalty-free, unrestricted license to exercise all rights under those
copyrights. This includes, at our option, the right to sublicense these same
rights to third parties through multiple levels of sublicensees or other
licensing arrangements;
* you agree that each of us can do all things in relation to your
contribution as if each of us were the sole owners, and if one of us makes
a derivative work of your contribution, the one who makes the derivative
work (or has it made will be the sole owner of that derivative work;
* you agree that you will not assert any moral rights in your contribution
against us, our licensees or transferees;
* you agree that we may register a copyright in your contribution and
exercise all ownership rights associated with it; and
* you agree that neither of us has any duty to consult with, obtain the
consent of, pay or render an accounting to the other for any use or
distribution of your contribution.
3. With respect to any patents you own, or that you can license without payment
to any third party, you hereby grant to us a perpetual, irrevocable,
non-exclusive, worldwide, no-charge, royalty-free license to:
* make, have made, use, sell, offer to sell, import, and otherwise transfer
your contribution in whole or in part, alone or in combination with or
included in any product, work or materials arising out of the project to
which your contribution was submitted, and
* at our option, to sublicense these same rights to third parties through
multiple levels of sublicensees or other licensing arrangements.
4. Except as set out above, you keep all right, title, and interest in your
contribution. The rights that you grant to us under these terms are effective
on the date you first submitted a contribution to us, even if your submission
took place before the date you sign these terms.
5. You covenant, represent, warrant and agree that:
* Each contribution that you submit is and shall be an original work of
authorship and you can legally grant the rights set out in this SCA;
* to the best of your knowledge, each contribution will not violate any
third party's copyrights, trademarks, patents, or other intellectual
property rights; and
* each contribution shall be in compliance with U.S. export control laws and
other applicable export and import laws. You agree to notify us if you
become aware of any circumstance which would make any of the foregoing
representations inaccurate in any respect. We may publicly disclose your
participation in the project, including the fact that you have signed the SCA.
6. This SCA is governed by the laws of the State of California and applicable
U.S. Federal law. Any choice of law rules will not apply.
7. Please place an “x” on one of the applicable statement below. Please do NOT
mark both statements:
* [x] I am signing on behalf of myself as an individual and no other person
or entity, including my employer, has or will have rights with respect to my
contributions.
* [ ] I am signing on behalf of my employer or a legal entity and I have the
actual authority to contractually bind that entity.
## Contributor Details
| Field | Entry |
|------------------------------- | -------------------- |
| Name | Xavier Fontaine |
| Company name (if applicable) | |
| Title or role (if applicable) | |
| Date | 2022-04-13 |
| GitHub username | fonfonx |
| Website (optional) | |

View File

@ -10,6 +10,7 @@ jobs:
fail-fast: false fail-fast: false
matrix: matrix:
branch: [master, v4] branch: [master, v4]
if: github.repository_owner == 'explosion'
runs-on: ubuntu-latest runs-on: ubuntu-latest
steps: steps:
- name: Trigger buildkite build - name: Trigger buildkite build

View File

@ -10,6 +10,7 @@ jobs:
fail-fast: false fail-fast: false
matrix: matrix:
branch: [master, v4] branch: [master, v4]
if: github.repository_owner == 'explosion'
runs-on: ubuntu-latest runs-on: ubuntu-latest
steps: steps:
- name: Checkout - name: Checkout

View File

@ -1,9 +1,10 @@
repos: repos:
- repo: https://github.com/ambv/black - repo: https://github.com/ambv/black
rev: 21.6b0 rev: 22.3.0
hooks: hooks:
- id: black - id: black
language_version: python3.7 language_version: python3.7
additional_dependencies: ['click==8.0.4']
- repo: https://gitlab.com/pycqa/flake8 - repo: https://gitlab.com/pycqa/flake8
rev: 3.9.2 rev: 3.9.2
hooks: hooks:

View File

@ -144,7 +144,7 @@ Changes to `.py` files will be effective immediately.
When fixing a bug, first create an When fixing a bug, first create an
[issue](https://github.com/explosion/spaCy/issues) if one does not already [issue](https://github.com/explosion/spaCy/issues) if one does not already
exist. The description text can be very short we don't want to make this too exist. The description text can be very short we don't want to make this too
bureaucratic. bureaucratic.
Next, add a test to the relevant file in the Next, add a test to the relevant file in the
@ -233,7 +233,7 @@ also want to keep an eye on unused declared variables or repeated
(i.e. overwritten) dictionary keys. If your code was formatted with `black` (i.e. overwritten) dictionary keys. If your code was formatted with `black`
(see above), you shouldn't see any formatting-related warnings. (see above), you shouldn't see any formatting-related warnings.
The [`.flake8`](.flake8) config defines the configuration we use for this The `flake8` section in [`setup.cfg`](setup.cfg) defines the configuration we use for this
codebase. For example, we're not super strict about the line length, and we're codebase. For example, we're not super strict about the line length, and we're
excluding very large files like lemmatization and tokenizer exception tables. excluding very large files like lemmatization and tokenizer exception tables.

View File

@ -0,0 +1,36 @@
# Explosion-bot
Explosion-bot is a robot that can be invoked to help with running particular test commands.
## Permissions
Only maintainers have permissions to summon explosion-bot. Each of the open source repos that use explosion-bot has its own team(s) of maintainers, and only github users who are members of those teams can successfully run bot commands.
## Running robot commands
To summon the robot, write a github comment on the issue/PR you wish to test. The comment must be in the following format:
```
@explosion-bot please test_gpu
```
Some things to note:
* The `@explosion-bot please` must be the beginning of the command - you cannot add anything in front of this or else the robot won't know how to parse it. Adding anything at the end aside from the test name will also confuse the robot, so keep it simple!
* The command name (such as `test_gpu`) must be one of the tests that the bot knows how to run. The available commands are documented in the bot's [workflow config](https://github.com/explosion/spaCy/blob/master/.github/workflows/explosionbot.yml#L26) and must match exactly one of the commands listed there.
* The robot can't do multiple things at once, so if you want it to run multiple tests, you'll have to summon it with one comment per test.
* For the `test_gpu` command, you can specify an optional thinc branch (from the spaCy repo) or a spaCy branch (from the thinc repo) with either the `--thinc-branch` or `--spacy-branch` flags. By default, the bot will pull in the PR branch from the repo where the command was issued, and the main branch of the other repository. However, if you need to run against another branch, you can say (for example):
```
@explosion-bot please test_gpu --thinc-branch develop
```
You can also specify a branch from an unmerged PR:
```
@explosion-bot please test_gpu --thinc-branch refs/pull/633/head
```
## Troubleshooting
If the robot isn't responding to commands as expected, you can check its logs in the [Github Action](https://github.com/explosion/spaCy/actions/workflows/explosionbot.yml).
For each command sent to the bot, there should be a run of the `explosion-bot` workflow. In the `Install and run explosion-bot` step, towards the ends of the logs you should see info about the configuration that the bot was run with, as well as any errors that the bot encountered.

View File

@ -5,8 +5,7 @@ requires = [
"cymem>=2.0.2,<2.1.0", "cymem>=2.0.2,<2.1.0",
"preshed>=3.0.2,<3.1.0", "preshed>=3.0.2,<3.1.0",
"murmurhash>=0.28.0,<1.1.0", "murmurhash>=0.28.0,<1.1.0",
"thinc>=8.0.14,<8.1.0", "thinc>=8.1.0.dev0,<8.2.0",
"blis>=0.4.0,<0.8.0",
"pathy", "pathy",
"numpy>=1.15.0", "numpy>=1.15.0",
] ]

View File

@ -3,12 +3,11 @@ spacy-legacy>=3.0.9,<3.1.0
spacy-loggers>=1.0.0,<2.0.0 spacy-loggers>=1.0.0,<2.0.0
cymem>=2.0.2,<2.1.0 cymem>=2.0.2,<2.1.0
preshed>=3.0.2,<3.1.0 preshed>=3.0.2,<3.1.0
thinc>=8.0.14,<8.1.0 thinc>=8.1.0.dev0,<8.2.0
blis>=0.4.0,<0.8.0
ml_datasets>=0.2.0,<0.3.0 ml_datasets>=0.2.0,<0.3.0
murmurhash>=0.28.0,<1.1.0 murmurhash>=0.28.0,<1.1.0
wasabi>=0.8.1,<1.1.0 wasabi>=0.9.1,<1.1.0
srsly>=2.4.1,<3.0.0 srsly>=2.4.3,<3.0.0
catalogue>=2.0.6,<2.1.0 catalogue>=2.0.6,<2.1.0
typer>=0.3.0,<0.5.0 typer>=0.3.0,<0.5.0
pathy>=0.3.5 pathy>=0.3.5
@ -16,7 +15,7 @@ pathy>=0.3.5
numpy>=1.15.0 numpy>=1.15.0
requests>=2.13.0,<3.0.0 requests>=2.13.0,<3.0.0
tqdm>=4.38.0,<5.0.0 tqdm>=4.38.0,<5.0.0
pydantic>=1.7.4,!=1.8,!=1.8.1,<1.9.0 pydantic>=1.7.4,!=1.8,!=1.8.1,<1.10.0
jinja2 jinja2
langcodes>=3.2.0,<4.0.0 langcodes>=3.2.0,<4.0.0
# Official Python utilities # Official Python utilities
@ -31,7 +30,7 @@ pytest-timeout>=1.3.0,<2.0.0
mock>=2.0.0,<3.0.0 mock>=2.0.0,<3.0.0
flake8>=3.8.0,<3.10.0 flake8>=3.8.0,<3.10.0
hypothesis>=3.27.0,<7.0.0 hypothesis>=3.27.0,<7.0.0
mypy==0.910 mypy>=0.910,<=0.960
types-dataclasses>=0.1.3; python_version < "3.7" types-dataclasses>=0.1.3; python_version < "3.7"
types-mock>=0.1.1 types-mock>=0.1.1
types-requests types-requests

View File

@ -38,7 +38,7 @@ setup_requires =
cymem>=2.0.2,<2.1.0 cymem>=2.0.2,<2.1.0
preshed>=3.0.2,<3.1.0 preshed>=3.0.2,<3.1.0
murmurhash>=0.28.0,<1.1.0 murmurhash>=0.28.0,<1.1.0
thinc>=8.0.14,<8.1.0 thinc>=8.1.0.dev0,<8.2.0
install_requires = install_requires =
# Our libraries # Our libraries
spacy-legacy>=3.0.9,<3.1.0 spacy-legacy>=3.0.9,<3.1.0
@ -46,10 +46,9 @@ install_requires =
murmurhash>=0.28.0,<1.1.0 murmurhash>=0.28.0,<1.1.0
cymem>=2.0.2,<2.1.0 cymem>=2.0.2,<2.1.0
preshed>=3.0.2,<3.1.0 preshed>=3.0.2,<3.1.0
thinc>=8.0.14,<8.1.0 thinc>=8.1.0.dev0,<8.2.0
blis>=0.4.0,<0.8.0 wasabi>=0.9.1,<1.1.0
wasabi>=0.8.1,<1.1.0 srsly>=2.4.3,<3.0.0
srsly>=2.4.1,<3.0.0
catalogue>=2.0.6,<2.1.0 catalogue>=2.0.6,<2.1.0
typer>=0.3.0,<0.5.0 typer>=0.3.0,<0.5.0
pathy>=0.3.5 pathy>=0.3.5
@ -57,7 +56,7 @@ install_requires =
tqdm>=4.38.0,<5.0.0 tqdm>=4.38.0,<5.0.0
numpy>=1.15.0 numpy>=1.15.0
requests>=2.13.0,<3.0.0 requests>=2.13.0,<3.0.0
pydantic>=1.7.4,!=1.8,!=1.8.1,<1.9.0 pydantic>=1.7.4,!=1.8,!=1.8.1,<1.10.0
jinja2 jinja2
# Official Python utilities # Official Python utilities
setuptools setuptools

View File

@ -1,6 +1,6 @@
# fmt: off # fmt: off
__title__ = "spacy" __title__ = "spacy"
__version__ = "3.2.2" __version__ = "3.3.0"
__download_url__ = "https://github.com/explosion/spacy-models/releases/download" __download_url__ = "https://github.com/explosion/spacy-models/releases/download"
__compatibility__ = "https://raw.githubusercontent.com/explosion/spacy-models/master/compatibility.json" __compatibility__ = "https://raw.githubusercontent.com/explosion/spacy-models/master/compatibility.json"
__projects__ = "https://github.com/explosion/projects" __projects__ = "https://github.com/explosion/projects"

View File

@ -14,6 +14,7 @@ from .pretrain import pretrain # noqa: F401
from .debug_data import debug_data # noqa: F401 from .debug_data import debug_data # noqa: F401
from .debug_config import debug_config # noqa: F401 from .debug_config import debug_config # noqa: F401
from .debug_model import debug_model # noqa: F401 from .debug_model import debug_model # noqa: F401
from .debug_diff import debug_diff # noqa: F401
from .evaluate import evaluate # noqa: F401 from .evaluate import evaluate # noqa: F401
from .convert import convert # noqa: F401 from .convert import convert # noqa: F401
from .init_pipeline import init_pipeline_cli # noqa: F401 from .init_pipeline import init_pipeline_cli # noqa: F401

View File

@ -6,6 +6,7 @@ import sys
import srsly import srsly
from wasabi import Printer, MESSAGES, msg from wasabi import Printer, MESSAGES, msg
import typer import typer
import math
from ._util import app, Arg, Opt, show_validation_error, parse_config_overrides from ._util import app, Arg, Opt, show_validation_error, parse_config_overrides
from ._util import import_code, debug_cli from ._util import import_code, debug_cli
@ -30,6 +31,12 @@ DEP_LABEL_THRESHOLD = 20
# Minimum number of expected examples to train a new pipeline # Minimum number of expected examples to train a new pipeline
BLANK_MODEL_MIN_THRESHOLD = 100 BLANK_MODEL_MIN_THRESHOLD = 100
BLANK_MODEL_THRESHOLD = 2000 BLANK_MODEL_THRESHOLD = 2000
# Arbitrary threshold where SpanCat performs well
SPAN_DISTINCT_THRESHOLD = 1
# Arbitrary threshold where SpanCat performs well
BOUNDARY_DISTINCT_THRESHOLD = 1
# Arbitrary threshold for filtering span lengths during reporting (percentage)
SPAN_LENGTH_THRESHOLD_PERCENTAGE = 90
@debug_cli.command( @debug_cli.command(
@ -247,6 +254,69 @@ def debug_data(
msg.warn(f"No examples for texts WITHOUT new label '{label}'") msg.warn(f"No examples for texts WITHOUT new label '{label}'")
has_no_neg_warning = True has_no_neg_warning = True
with msg.loading("Obtaining span characteristics..."):
span_characteristics = _get_span_characteristics(
train_dataset, gold_train_data, spans_key
)
msg.info(f"Span characteristics for spans_key '{spans_key}'")
msg.info("SD = Span Distinctiveness, BD = Boundary Distinctiveness")
_print_span_characteristics(span_characteristics)
_span_freqs = _get_spans_length_freq_dist(
gold_train_data["spans_length"][spans_key]
)
_filtered_span_freqs = _filter_spans_length_freq_dist(
_span_freqs, threshold=SPAN_LENGTH_THRESHOLD_PERCENTAGE
)
msg.info(
f"Over {SPAN_LENGTH_THRESHOLD_PERCENTAGE}% of spans have lengths of 1 -- "
f"{max(_filtered_span_freqs.keys())} "
f"(min={span_characteristics['min_length']}, max={span_characteristics['max_length']}). "
f"The most common span lengths are: {_format_freqs(_filtered_span_freqs)}. "
"If you are using the n-gram suggester, note that omitting "
"infrequent n-gram lengths can greatly improve speed and "
"memory usage."
)
msg.text(
f"Full distribution of span lengths: {_format_freqs(_span_freqs)}",
show=verbose,
)
# Add report regarding span characteristics
if span_characteristics["avg_sd"] < SPAN_DISTINCT_THRESHOLD:
msg.warn("Spans may not be distinct from the rest of the corpus")
else:
msg.good("Spans are distinct from the rest of the corpus")
p_spans = span_characteristics["p_spans"].values()
all_span_tokens: Counter = sum(p_spans, Counter())
most_common_spans = [w for w, _ in all_span_tokens.most_common(10)]
msg.text(
"10 most common span tokens: {}".format(
_format_labels(most_common_spans)
),
show=verbose,
)
# Add report regarding span boundary characteristics
if span_characteristics["avg_bd"] < BOUNDARY_DISTINCT_THRESHOLD:
msg.warn("Boundary tokens are not distinct from the rest of the corpus")
else:
msg.good("Boundary tokens are distinct from the rest of the corpus")
p_bounds = span_characteristics["p_bounds"].values()
all_span_bound_tokens: Counter = sum(p_bounds, Counter())
most_common_bounds = [w for w, _ in all_span_bound_tokens.most_common(10)]
msg.text(
"10 most common span boundary tokens: {}".format(
_format_labels(most_common_bounds)
),
show=verbose,
)
if has_low_data_warning: if has_low_data_warning:
msg.text( msg.text(
f"To train a new span type, your data should include at " f"To train a new span type, your data should include at "
@ -647,6 +717,9 @@ def _compile_gold(
"words": Counter(), "words": Counter(),
"roots": Counter(), "roots": Counter(),
"spancat": dict(), "spancat": dict(),
"spans_length": dict(),
"spans_per_type": dict(),
"sb_per_type": dict(),
"ws_ents": 0, "ws_ents": 0,
"boundary_cross_ents": 0, "boundary_cross_ents": 0,
"n_words": 0, "n_words": 0,
@ -692,14 +765,59 @@ def _compile_gold(
elif label == "-": elif label == "-":
data["ner"]["-"] += 1 data["ner"]["-"] += 1
if "spancat" in factory_names: if "spancat" in factory_names:
for span_key in list(eg.reference.spans.keys()): for spans_key in list(eg.reference.spans.keys()):
if span_key not in data["spancat"]: # Obtain the span frequency
data["spancat"][span_key] = Counter() if spans_key not in data["spancat"]:
for i, span in enumerate(eg.reference.spans[span_key]): data["spancat"][spans_key] = Counter()
for i, span in enumerate(eg.reference.spans[spans_key]):
if span.label_ is None: if span.label_ is None:
continue continue
else: else:
data["spancat"][span_key][span.label_] += 1 data["spancat"][spans_key][span.label_] += 1
# Obtain the span length
if spans_key not in data["spans_length"]:
data["spans_length"][spans_key] = dict()
for span in gold.spans[spans_key]:
if span.label_ is None:
continue
if span.label_ not in data["spans_length"][spans_key]:
data["spans_length"][spans_key][span.label_] = []
data["spans_length"][spans_key][span.label_].append(len(span))
# Obtain spans per span type
if spans_key not in data["spans_per_type"]:
data["spans_per_type"][spans_key] = dict()
for span in gold.spans[spans_key]:
if span.label_ not in data["spans_per_type"][spans_key]:
data["spans_per_type"][spans_key][span.label_] = []
data["spans_per_type"][spans_key][span.label_].append(span)
# Obtain boundary tokens per span type
window_size = 1
if spans_key not in data["sb_per_type"]:
data["sb_per_type"][spans_key] = dict()
for span in gold.spans[spans_key]:
if span.label_ not in data["sb_per_type"][spans_key]:
# Creating a data structure that holds the start and
# end tokens for each span type
data["sb_per_type"][spans_key][span.label_] = {
"start": [],
"end": [],
}
for offset in range(window_size):
sb_start_idx = span.start - (offset + 1)
if sb_start_idx >= 0:
data["sb_per_type"][spans_key][span.label_]["start"].append(
gold[sb_start_idx : sb_start_idx + 1]
)
sb_end_idx = span.end + (offset + 1)
if sb_end_idx <= len(gold):
data["sb_per_type"][spans_key][span.label_]["end"].append(
gold[sb_end_idx - 1 : sb_end_idx]
)
if "textcat" in factory_names or "textcat_multilabel" in factory_names: if "textcat" in factory_names or "textcat_multilabel" in factory_names:
data["cats"].update(gold.cats) data["cats"].update(gold.cats)
if any(val not in (0, 1) for val in gold.cats.values()): if any(val not in (0, 1) for val in gold.cats.values()):
@ -770,6 +888,16 @@ def _format_labels(
return ", ".join([f"'{l}'" for l in cast(Iterable[str], labels)]) return ", ".join([f"'{l}'" for l in cast(Iterable[str], labels)])
def _format_freqs(freqs: Dict[int, float], sort: bool = True) -> str:
if sort:
freqs = dict(sorted(freqs.items()))
_freqs = [(str(k), v) for k, v in freqs.items()]
return ", ".join(
[f"{l} ({c}%)" for l, c in cast(Iterable[Tuple[str, float]], _freqs)]
)
def _get_examples_without_label( def _get_examples_without_label(
data: Sequence[Example], data: Sequence[Example],
label: str, label: str,
@ -824,3 +952,158 @@ def _get_labels_from_spancat(nlp: Language) -> Dict[str, Set[str]]:
labels[pipe.key] = set() labels[pipe.key] = set()
labels[pipe.key].update(pipe.labels) labels[pipe.key].update(pipe.labels)
return labels return labels
def _gmean(l: List) -> float:
"""Compute geometric mean of a list"""
return math.exp(math.fsum(math.log(i) for i in l) / len(l))
def _wgt_average(metric: Dict[str, float], frequencies: Counter) -> float:
total = sum(value * frequencies[span_type] for span_type, value in metric.items())
return total / sum(frequencies.values())
def _get_distribution(docs, normalize: bool = True) -> Counter:
"""Get the frequency distribution given a set of Docs"""
word_counts: Counter = Counter()
for doc in docs:
for token in doc:
# Normalize the text
t = token.text.lower().replace("``", '"').replace("''", '"')
word_counts[t] += 1
if normalize:
total = sum(word_counts.values(), 0.0)
word_counts = Counter({k: v / total for k, v in word_counts.items()})
return word_counts
def _get_kl_divergence(p: Counter, q: Counter) -> float:
"""Compute the Kullback-Leibler divergence from two frequency distributions"""
total = 0.0
for word, p_word in p.items():
total += p_word * math.log(p_word / q[word])
return total
def _format_span_row(span_data: List[Dict], labels: List[str]) -> List[Any]:
"""Compile into one list for easier reporting"""
d = {
label: [label] + list(round(d[label], 2) for d in span_data) for label in labels
}
return list(d.values())
def _get_span_characteristics(
examples: List[Example], compiled_gold: Dict[str, Any], spans_key: str
) -> Dict[str, Any]:
"""Obtain all span characteristics"""
data_labels = compiled_gold["spancat"][spans_key]
# Get lengths
span_length = {
label: _gmean(l)
for label, l in compiled_gold["spans_length"][spans_key].items()
}
min_lengths = [min(l) for l in compiled_gold["spans_length"][spans_key].values()]
max_lengths = [max(l) for l in compiled_gold["spans_length"][spans_key].values()]
# Get relevant distributions: corpus, spans, span boundaries
p_corpus = _get_distribution([eg.reference for eg in examples], normalize=True)
p_spans = {
label: _get_distribution(spans, normalize=True)
for label, spans in compiled_gold["spans_per_type"][spans_key].items()
}
p_bounds = {
label: _get_distribution(sb["start"] + sb["end"], normalize=True)
for label, sb in compiled_gold["sb_per_type"][spans_key].items()
}
# Compute for actual span characteristics
span_distinctiveness = {
label: _get_kl_divergence(freq_dist, p_corpus)
for label, freq_dist in p_spans.items()
}
sb_distinctiveness = {
label: _get_kl_divergence(freq_dist, p_corpus)
for label, freq_dist in p_bounds.items()
}
return {
"sd": span_distinctiveness,
"bd": sb_distinctiveness,
"lengths": span_length,
"min_length": min(min_lengths),
"max_length": max(max_lengths),
"avg_sd": _wgt_average(span_distinctiveness, data_labels),
"avg_bd": _wgt_average(sb_distinctiveness, data_labels),
"avg_length": _wgt_average(span_length, data_labels),
"labels": list(data_labels.keys()),
"p_spans": p_spans,
"p_bounds": p_bounds,
}
def _print_span_characteristics(span_characteristics: Dict[str, Any]):
"""Print all span characteristics into a table"""
headers = ("Span Type", "Length", "SD", "BD")
# Prepare table data with all span characteristics
table_data = [
span_characteristics["lengths"],
span_characteristics["sd"],
span_characteristics["bd"],
]
table = _format_span_row(
span_data=table_data, labels=span_characteristics["labels"]
)
# Prepare table footer with weighted averages
footer_data = [
span_characteristics["avg_length"],
span_characteristics["avg_sd"],
span_characteristics["avg_bd"],
]
footer = ["Wgt. Average"] + [str(round(f, 2)) for f in footer_data]
msg.table(table, footer=footer, header=headers, divider=True)
def _get_spans_length_freq_dist(
length_dict: Dict, threshold=SPAN_LENGTH_THRESHOLD_PERCENTAGE
) -> Dict[int, float]:
"""Get frequency distribution of spans length under a certain threshold"""
all_span_lengths = []
for _, lengths in length_dict.items():
all_span_lengths.extend(lengths)
freq_dist: Counter = Counter()
for i in all_span_lengths:
if freq_dist.get(i):
freq_dist[i] += 1
else:
freq_dist[i] = 1
# We will be working with percentages instead of raw counts
freq_dist_percentage = {}
for span_length, count in freq_dist.most_common():
percentage = (count / len(all_span_lengths)) * 100.0
percentage = round(percentage, 2)
freq_dist_percentage[span_length] = percentage
return freq_dist_percentage
def _filter_spans_length_freq_dist(
freq_dist: Dict[int, float], threshold: int
) -> Dict[int, float]:
"""Filter frequency distribution with respect to a threshold
We're going to filter all the span lengths that fall
around a percentage threshold when summed.
"""
total = 0.0
filtered_freq_dist = {}
for span_length, dist in freq_dist.items():
if total >= threshold:
break
else:
filtered_freq_dist[span_length] = dist
total += dist
return filtered_freq_dist

89
spacy/cli/debug_diff.py Normal file
View File

@ -0,0 +1,89 @@
from typing import Optional
import typer
from wasabi import Printer, diff_strings, MarkdownRenderer
from pathlib import Path
from thinc.api import Config
from ._util import debug_cli, Arg, Opt, show_validation_error, parse_config_overrides
from ..util import load_config
from .init_config import init_config, Optimizations
@debug_cli.command(
"diff-config",
context_settings={"allow_extra_args": True, "ignore_unknown_options": True},
)
def debug_diff_cli(
# fmt: off
ctx: typer.Context,
config_path: Path = Arg(..., help="Path to config file", exists=True, allow_dash=True),
compare_to: Optional[Path] = Opt(None, help="Path to a config file to diff against, or `None` to compare against default settings", exists=True, allow_dash=True),
optimize: Optimizations = Opt(Optimizations.efficiency.value, "--optimize", "-o", help="Whether the user config was optimized for efficiency or accuracy. Only relevant when comparing against the default config."),
gpu: bool = Opt(False, "--gpu", "-G", help="Whether the original config can run on a GPU. Only relevant when comparing against the default config."),
pretraining: bool = Opt(False, "--pretraining", "--pt", help="Whether to compare on a config with pretraining involved. Only relevant when comparing against the default config."),
markdown: bool = Opt(False, "--markdown", "-md", help="Generate Markdown for GitHub issues")
# fmt: on
):
"""Show a diff of a config file with respect to spaCy's defaults or another config file. If
additional settings were used in the creation of the config file, then you
must supply these as extra parameters to the command when comparing to the default settings. The generated diff
can also be used when posting to the discussion forum to provide more
information for the maintainers.
The `optimize`, `gpu`, and `pretraining` options are only relevant when
comparing against the default configuration (or specifically when `compare_to` is None).
DOCS: https://spacy.io/api/cli#debug-diff
"""
debug_diff(
config_path=config_path,
compare_to=compare_to,
gpu=gpu,
optimize=optimize,
pretraining=pretraining,
markdown=markdown,
)
def debug_diff(
config_path: Path,
compare_to: Optional[Path],
gpu: bool,
optimize: Optimizations,
pretraining: bool,
markdown: bool,
):
msg = Printer()
with show_validation_error(hint_fill=False):
user_config = load_config(config_path)
if compare_to:
other_config = load_config(compare_to)
else:
# Recreate a default config based from user's config
lang = user_config["nlp"]["lang"]
pipeline = list(user_config["nlp"]["pipeline"])
msg.info(f"Found user-defined language: '{lang}'")
msg.info(f"Found user-defined pipelines: {pipeline}")
other_config = init_config(
lang=lang,
pipeline=pipeline,
optimize=optimize.value,
gpu=gpu,
pretraining=pretraining,
silent=True,
)
user = user_config.to_str()
other = other_config.to_str()
if user == other:
msg.warn("No diff to show: configs are identical")
else:
diff_text = diff_strings(other, user, add_symbols=markdown)
if markdown:
md = MarkdownRenderer()
md.add(md.code_block(diff_text, "diff"))
print(md.text)
else:
print(diff_text)

View File

@ -12,6 +12,9 @@ from .._util import project_cli, Arg, Opt, PROJECT_FILE, load_project_config
from .._util import get_checksum, download_file, git_checkout, get_git_version from .._util import get_checksum, download_file, git_checkout, get_git_version
from .._util import SimpleFrozenDict, parse_config_overrides from .._util import SimpleFrozenDict, parse_config_overrides
# Whether assets are extra if `extra` is not set.
EXTRA_DEFAULT = False
@project_cli.command( @project_cli.command(
"assets", "assets",
@ -21,7 +24,8 @@ def project_assets_cli(
# fmt: off # fmt: off
ctx: typer.Context, # This is only used to read additional arguments ctx: typer.Context, # This is only used to read additional arguments
project_dir: Path = Arg(Path.cwd(), help="Path to cloned project. Defaults to current working directory.", exists=True, file_okay=False), project_dir: Path = Arg(Path.cwd(), help="Path to cloned project. Defaults to current working directory.", exists=True, file_okay=False),
sparse_checkout: bool = Opt(False, "--sparse", "-S", help="Use sparse checkout for assets provided via Git, to only check out and clone the files needed. Requires Git v22.2+.") sparse_checkout: bool = Opt(False, "--sparse", "-S", help="Use sparse checkout for assets provided via Git, to only check out and clone the files needed. Requires Git v22.2+."),
extra: bool = Opt(False, "--extra", "-e", help="Download all assets, including those marked as 'extra'.")
# fmt: on # fmt: on
): ):
"""Fetch project assets like datasets and pretrained weights. Assets are """Fetch project assets like datasets and pretrained weights. Assets are
@ -32,7 +36,12 @@ def project_assets_cli(
DOCS: https://spacy.io/api/cli#project-assets DOCS: https://spacy.io/api/cli#project-assets
""" """
overrides = parse_config_overrides(ctx.args) overrides = parse_config_overrides(ctx.args)
project_assets(project_dir, overrides=overrides, sparse_checkout=sparse_checkout) project_assets(
project_dir,
overrides=overrides,
sparse_checkout=sparse_checkout,
extra=extra,
)
def project_assets( def project_assets(
@ -40,17 +49,29 @@ def project_assets(
*, *,
overrides: Dict[str, Any] = SimpleFrozenDict(), overrides: Dict[str, Any] = SimpleFrozenDict(),
sparse_checkout: bool = False, sparse_checkout: bool = False,
extra: bool = False,
) -> None: ) -> None:
"""Fetch assets for a project using DVC if possible. """Fetch assets for a project using DVC if possible.
project_dir (Path): Path to project directory. project_dir (Path): Path to project directory.
sparse_checkout (bool): Use sparse checkout for assets provided via Git, to only check out and clone the files
needed.
extra (bool): Whether to download all assets, including those marked as 'extra'.
""" """
project_path = ensure_path(project_dir) project_path = ensure_path(project_dir)
config = load_project_config(project_path, overrides=overrides) config = load_project_config(project_path, overrides=overrides)
assets = config.get("assets", {}) assets = [
asset
for asset in config.get("assets", [])
if extra or not asset.get("extra", EXTRA_DEFAULT)
]
if not assets: if not assets:
msg.warn(f"No assets specified in {PROJECT_FILE}", exits=0) msg.warn(
f"No assets specified in {PROJECT_FILE} (if assets are marked as extra, download them with --extra)",
exits=0,
)
msg.info(f"Fetching {len(assets)} asset(s)") msg.info(f"Fetching {len(assets)} asset(s)")
for asset in assets: for asset in assets:
dest = (project_dir / asset["dest"]).resolve() dest = (project_dir / asset["dest"]).resolve()
checksum = asset.get("checksum") checksum = asset.get("checksum")

View File

@ -4,7 +4,7 @@ spaCy's built in visualization suite for dependencies and named entities.
DOCS: https://spacy.io/api/top-level#displacy DOCS: https://spacy.io/api/top-level#displacy
USAGE: https://spacy.io/usage/visualizers USAGE: https://spacy.io/usage/visualizers
""" """
from typing import List, Union, Iterable, Optional, Dict, Any, Callable from typing import Union, Iterable, Optional, Dict, Any, Callable
import warnings import warnings
from .render import DependencyRenderer, EntityRenderer, SpanRenderer from .render import DependencyRenderer, EntityRenderer, SpanRenderer
@ -56,6 +56,10 @@ def render(
renderer_func, converter = factories[style] renderer_func, converter = factories[style]
renderer = renderer_func(options=options) renderer = renderer_func(options=options)
parsed = [converter(doc, options) for doc in docs] if not manual else docs # type: ignore parsed = [converter(doc, options) for doc in docs] if not manual else docs # type: ignore
if manual:
for doc in docs:
if isinstance(doc, dict) and "ents" in doc:
doc["ents"] = sorted(doc["ents"], key=lambda x: (x["start"], x["end"]))
_html["parsed"] = renderer.render(parsed, page=page, minify=minify).strip() # type: ignore _html["parsed"] = renderer.render(parsed, page=page, minify=minify).strip() # type: ignore
html = _html["parsed"] html = _html["parsed"]
if RENDER_WRAPPER is not None: if RENDER_WRAPPER is not None:

View File

@ -1,4 +1,4 @@
from typing import Any, Dict, List, Optional, Union from typing import Any, Dict, List, Optional, Tuple, Union
import uuid import uuid
import itertools import itertools
@ -270,7 +270,7 @@ class DependencyRenderer:
RETURNS (str): Rendered SVG markup. RETURNS (str): Rendered SVG markup.
""" """
self.levels = self.get_levels(arcs) self.levels = self.get_levels(arcs)
self.highest_level = len(self.levels) self.highest_level = max(self.levels.values(), default=0)
self.offset_y = self.distance / 2 * self.highest_level + self.arrow_stroke self.offset_y = self.distance / 2 * self.highest_level + self.arrow_stroke
self.width = self.offset_x + len(words) * self.distance self.width = self.offset_x + len(words) * self.distance
self.height = self.offset_y + 3 * self.word_spacing self.height = self.offset_y + 3 * self.word_spacing
@ -330,7 +330,7 @@ class DependencyRenderer:
if start < 0 or end < 0: if start < 0 or end < 0:
error_args = dict(start=start, end=end, label=label, dir=direction) error_args = dict(start=start, end=end, label=label, dir=direction)
raise ValueError(Errors.E157.format(**error_args)) raise ValueError(Errors.E157.format(**error_args))
level = self.levels.index(end - start) + 1 level = self.levels[(start, end, label)]
x_start = self.offset_x + start * self.distance + self.arrow_spacing x_start = self.offset_x + start * self.distance + self.arrow_spacing
if self.direction == "rtl": if self.direction == "rtl":
x_start = self.width - x_start x_start = self.width - x_start
@ -346,7 +346,7 @@ class DependencyRenderer:
y_curve = self.offset_y - level * self.distance / 2 y_curve = self.offset_y - level * self.distance / 2
if self.compact: if self.compact:
y_curve = self.offset_y - level * self.distance / 6 y_curve = self.offset_y - level * self.distance / 6
if y_curve == 0 and len(self.levels) > 5: if y_curve == 0 and max(self.levels.values(), default=0) > 5:
y_curve = -self.distance y_curve = -self.distance
arrowhead = self.get_arrowhead(direction, x_start, y, x_end) arrowhead = self.get_arrowhead(direction, x_start, y, x_end)
arc = self.get_arc(x_start, y, y_curve, x_end) arc = self.get_arc(x_start, y, y_curve, x_end)
@ -390,15 +390,23 @@ class DependencyRenderer:
p1, p2, p3 = (end, end + self.arrow_width - 2, end - self.arrow_width + 2) p1, p2, p3 = (end, end + self.arrow_width - 2, end - self.arrow_width + 2)
return f"M{p1},{y + 2} L{p2},{y - self.arrow_width} {p3},{y - self.arrow_width}" return f"M{p1},{y + 2} L{p2},{y - self.arrow_width} {p3},{y - self.arrow_width}"
def get_levels(self, arcs: List[Dict[str, Any]]) -> List[int]: def get_levels(self, arcs: List[Dict[str, Any]]) -> Dict[Tuple[int, int, str], int]:
"""Calculate available arc height "levels". """Calculate available arc height "levels".
Used to calculate arrow heights dynamically and without wasting space. Used to calculate arrow heights dynamically and without wasting space.
args (list): Individual arcs and their start, end, direction and label. args (list): Individual arcs and their start, end, direction and label.
RETURNS (list): Arc levels sorted from lowest to highest. RETURNS (dict): Arc levels keyed by (start, end, label).
""" """
levels = set(map(lambda arc: arc["end"] - arc["start"], arcs)) arcs = [dict(t) for t in {tuple(sorted(arc.items())) for arc in arcs}]
return sorted(list(levels)) length = max([arc["end"] for arc in arcs], default=0)
max_level = [0] * length
levels = {}
for arc in sorted(arcs, key=lambda arc: arc["end"] - arc["start"]):
level = max(max_level[arc["start"] : arc["end"]]) + 1
for i in range(arc["start"], arc["end"]):
max_level[i] = level
levels[(arc["start"], arc["end"], arc["label"])] = level
return levels
class EntityRenderer: class EntityRenderer:

View File

@ -1,4 +1,5 @@
import warnings import warnings
from .compat import Literal
class ErrorsWithCodes(type): class ErrorsWithCodes(type):
@ -26,7 +27,10 @@ def setup_default_warnings():
filter_warning("once", error_msg="[W114]") filter_warning("once", error_msg="[W114]")
def filter_warning(action: str, error_msg: str): def filter_warning(
action: Literal["default", "error", "ignore", "always", "module", "once"],
error_msg: str,
):
"""Customize how spaCy should handle a certain warning. """Customize how spaCy should handle a certain warning.
error_msg (str): e.g. "W006", or a full error message error_msg (str): e.g. "W006", or a full error message
@ -196,6 +200,10 @@ class Warnings(metaclass=ErrorsWithCodes):
"surprising to you, make sure the Doc was processed using a model " "surprising to you, make sure the Doc was processed using a model "
"that supports span categorization, and check the `doc.spans[spans_key]` " "that supports span categorization, and check the `doc.spans[spans_key]` "
"property manually if necessary.") "property manually if necessary.")
W118 = ("Term '{term}' not found in glossary. It may however be explained in documentation "
"for the corpora used to train the language. Please check "
"`nlp.meta[\"sources\"]` for any relevant links.")
W119 = ("Overriding pipe name in `config` is not supported. Ignoring override '{name_in_config}'.")
class Errors(metaclass=ErrorsWithCodes): class Errors(metaclass=ErrorsWithCodes):
@ -441,10 +449,10 @@ class Errors(metaclass=ErrorsWithCodes):
"same, but found '{nlp}' and '{vocab}' respectively.") "same, but found '{nlp}' and '{vocab}' respectively.")
E152 = ("The attribute {attr} is not supported for token patterns. " E152 = ("The attribute {attr} is not supported for token patterns. "
"Please use the option `validate=True` with the Matcher, PhraseMatcher, " "Please use the option `validate=True` with the Matcher, PhraseMatcher, "
"or EntityRuler for more details.") "EntityRuler or AttributeRuler for more details.")
E153 = ("The value type {vtype} is not supported for token patterns. " E153 = ("The value type {vtype} is not supported for token patterns. "
"Please use the option validate=True with Matcher, PhraseMatcher, " "Please use the option validate=True with Matcher, PhraseMatcher, "
"or EntityRuler for more details.") "EntityRuler or AttributeRuler for more details.")
E154 = ("One of the attributes or values is not supported for token " E154 = ("One of the attributes or values is not supported for token "
"patterns. Please use the option `validate=True` with the Matcher, " "patterns. Please use the option `validate=True` with the Matcher, "
"PhraseMatcher, or EntityRuler for more details.") "PhraseMatcher, or EntityRuler for more details.")
@ -524,6 +532,9 @@ class Errors(metaclass=ErrorsWithCodes):
E202 = ("Unsupported {name} mode '{mode}'. Supported modes: {modes}.") E202 = ("Unsupported {name} mode '{mode}'. Supported modes: {modes}.")
# New errors added in v3.x # New errors added in v3.x
E855 = ("Invalid {obj}: {obj} is not from the same doc.")
E856 = ("Error accessing span at position {i}: out of bounds in span group "
"of length {length}.")
E857 = ("Entry '{name}' not found in edit tree lemmatizer labels.") E857 = ("Entry '{name}' not found in edit tree lemmatizer labels.")
E858 = ("The {mode} vector table does not support this operation. " E858 = ("The {mode} vector table does not support this operation. "
"{alternative}") "{alternative}")
@ -899,7 +910,16 @@ class Errors(metaclass=ErrorsWithCodes):
E1026 = ("Edit tree has an invalid format:\n{errors}") E1026 = ("Edit tree has an invalid format:\n{errors}")
E1027 = ("AlignmentArray only supports slicing with a step of 1.") E1027 = ("AlignmentArray only supports slicing with a step of 1.")
E1028 = ("AlignmentArray only supports indexing using an int or a slice.") E1028 = ("AlignmentArray only supports indexing using an int or a slice.")
E1029 = ("Edit tree cannot be applied to form.")
E1030 = ("Edit tree identifier out of range.")
E1031 = ("Could not find gold transition - see logs above.")
E1032 = ("`{var}` should not be {forbidden}, but received {value}.")
E1033 = ("Dimension {name} invalid -- only nO, nF, nP")
E1034 = ("Node index {i} out of bounds ({length})")
E1035 = ("Token index {i} out of bounds ({length})")
E1036 = ("Cannot index into NoneNode")
E1037 = ("Invalid attribute value '{attr}'.")
# Deprecated model shortcuts, only used in errors and warnings # Deprecated model shortcuts, only used in errors and warnings
OLD_MODEL_SHORTCUTS = { OLD_MODEL_SHORTCUTS = {

View File

@ -1,3 +1,7 @@
import warnings
from .errors import Warnings
def explain(term): def explain(term):
"""Get a description for a given POS tag, dependency label or entity type. """Get a description for a given POS tag, dependency label or entity type.
@ -11,6 +15,8 @@ def explain(term):
""" """
if term in GLOSSARY: if term in GLOSSARY:
return GLOSSARY[term] return GLOSSARY[term]
else:
warnings.warn(Warnings.W118.format(term=term))
GLOSSARY = { GLOSSARY = {
@ -267,6 +273,7 @@ GLOSSARY = {
"relcl": "relative clause modifier", "relcl": "relative clause modifier",
"reparandum": "overridden disfluency", "reparandum": "overridden disfluency",
"root": "root", "root": "root",
"ROOT": "root",
"vocative": "vocative", "vocative": "vocative",
"xcomp": "open clausal complement", "xcomp": "open clausal complement",
# Dependency labels (German) # Dependency labels (German)

View File

@ -9,14 +9,14 @@ Example sentences to test spaCy and its language models.
sentences = [ sentences = [
"Apple está buscando comprar una startup del Reino Unido por mil millones de dólares.", "Apple está buscando comprar una startup del Reino Unido por mil millones de dólares.",
"Los coches autónomos delegan la responsabilidad del seguro en sus fabricantes.", "Los coches autónomos delegan la responsabilidad del seguro en sus fabricantes.",
"San Francisco analiza prohibir los robots delivery.", "San Francisco analiza prohibir los robots de reparto.",
"Londres es una gran ciudad del Reino Unido.", "Londres es una gran ciudad del Reino Unido.",
"El gato come pescado.", "El gato come pescado.",
"Veo al hombre con el telescopio.", "Veo al hombre con el telescopio.",
"La araña come moscas.", "La araña come moscas.",
"El pingüino incuba en su nido sobre el hielo.", "El pingüino incuba en su nido sobre el hielo.",
"¿Dónde estais?", "¿Dónde estáis?",
"¿Quién es el presidente Francés?", "¿Quién es el presidente francés?",
"¿Dónde está encuentra la capital de Argentina?", "¿Dónde se encuentra la capital de Argentina?",
"¿Cuándo nació José de San Martín?", "¿Cuándo nació José de San Martín?",
] ]

View File

@ -1,82 +1,80 @@
STOP_WORDS = set( STOP_WORDS = set(
""" """
actualmente acuerdo adelante ademas además adrede afirmó agregó ahi ahora ahí a acuerdo adelante ademas además afirmó agregó ahi ahora ahí al algo alguna
al algo alguna algunas alguno algunos algún alli allí alrededor ambos ampleamos algunas alguno algunos algún alli allí alrededor ambos ante anterior antes
antano antaño ante anterior antes apenas aproximadamente aquel aquella aquellas apenas aproximadamente aquel aquella aquellas aquello aquellos aqui aquél
aquello aquellos aqui aquél aquélla aquéllas aquéllos aquí arriba arribaabajo aquélla aquéllas aquéllos aquí arriba aseguró asi así atras aun aunque añadió
aseguró asi así atras aun aunque ayer añadió aún aún
bajo bastante bien breve buen buena buenas bueno buenos bajo bastante bien breve buen buena buenas bueno buenos
cada casi cerca cierta ciertas cierto ciertos cinco claro comentó como con cada casi cierta ciertas cierto ciertos cinco claro comentó como con conmigo
conmigo conocer conseguimos conseguir considera consideró consigo consigue conocer conseguimos conseguir considera consideró consigo consigue consiguen
consiguen consigues contigo contra cosas creo cual cuales cualquier cuando consigues contigo contra creo cual cuales cualquier cuando cuanta cuantas
cuanta cuantas cuanto cuantos cuatro cuenta cuál cuáles cuándo cuánta cuántas cuanto cuantos cuatro cuenta cuál cuáles cuándo cuánta cuántas cuánto cuántos
cuánto cuántos cómo cómo
da dado dan dar de debajo debe deben debido decir dejó del delante demasiado da dado dan dar de debajo debe deben debido decir dejó del delante demasiado
demás dentro deprisa desde despacio despues después detras detrás dia dias dice demás dentro deprisa desde despacio despues después detras detrás dia dias dice
dicen dicho dieron diferente diferentes dijeron dijo dio donde dos durante día dicen dicho dieron diez diferente diferentes dijeron dijo dio doce donde dos
días dónde durante día días dónde
ejemplo el ella ellas ello ellos embargo empleais emplean emplear empleas e el ella ellas ello ellos embargo en encima encuentra enfrente enseguida
empleo en encima encuentra enfrente enseguida entonces entre era eramos eran entonces entre era eramos eran eras eres es esa esas ese eso esos esta estaba
eras eres es esa esas ese eso esos esta estaba estaban estado estados estais estaban estado estados estais estamos estan estar estará estas este esto estos
estamos estan estar estará estas este esto estos estoy estuvo está están ex estoy estuvo está están excepto existe existen explicó expresó él ésa ésas ése
excepto existe existen explicó expresó él ésa ésas ése ésos ésta éstas éste ésos ésta éstas éste éstos
éstos
fin final fue fuera fueron fui fuimos fin final fue fuera fueron fui fuimos
general gran grandes gueno gran grande grandes
ha haber habia habla hablan habrá había habían hace haceis hacemos hacen hacer ha haber habia habla hablan habrá había habían hace haceis hacemos hacen hacer
hacerlo haces hacia haciendo hago han hasta hay haya he hecho hemos hicieron hacerlo haces hacia haciendo hago han hasta hay haya he hecho hemos hicieron
hizo horas hoy hubo hizo hoy hubo
igual incluso indicó informo informó intenta intentais intentamos intentan igual incluso indicó informo informó ir
intentar intentas intento ir
junto junto
la lado largo las le lejos les llegó lleva llevar lo los luego lugar la lado largo las le les llegó lleva llevar lo los luego
mal manera manifestó mas mayor me mediante medio mejor mencionó menos menudo mi mal manera manifestó mas mayor me mediante medio mejor mencionó menos menudo mi
mia mias mientras mio mios mis misma mismas mismo mismos modo momento mucha mia mias mientras mio mios mis misma mismas mismo mismos modo mucha muchas
muchas mucho muchos muy más mía mías mío míos mucho muchos muy más mía mías mío míos
nada nadie ni ninguna ningunas ninguno ningunos ningún no nos nosotras nosotros nada nadie ni ninguna ningunas ninguno ningunos ningún no nos nosotras nosotros
nuestra nuestras nuestro nuestros nueva nuevas nuevo nuevos nunca nuestra nuestras nuestro nuestros nueva nuevas nueve nuevo nuevos nunca
ocho os otra otras otro otros o ocho once os otra otras otro otros
pais para parece parte partir pasada pasado paìs peor pero pesar poca pocas para parece parte partir pasada pasado paìs peor pero pesar poca pocas poco
poco pocos podeis podemos poder podria podriais podriamos podrian podrias podrá pocos podeis podemos poder podria podriais podriamos podrian podrias podrá
podrán podría podrían poner por porque posible primer primera primero primeros podrán podría podrían poner por porque posible primer primera primero primeros
principalmente pronto propia propias propio propios proximo próximo próximos pronto propia propias propio propios proximo próximo próximos pudo pueda puede
pudo pueda puede pueden puedo pues pueden puedo pues
qeu que quedó queremos quien quienes quiere quiza quizas quizá quizás quién quiénes qué qeu que quedó queremos quien quienes quiere quiza quizas quizá quizás quién
quiénes qué
raras realizado realizar realizó repente respecto realizado realizar realizó repente respecto
sabe sabeis sabemos saben saber sabes salvo se sea sean segun segunda segundo sabe sabeis sabemos saben saber sabes salvo se sea sean segun segunda segundo
según seis ser sera será serán sería señaló si sido siempre siendo siete sigue según seis ser sera será serán sería señaló si sido siempre siendo siete sigue
siguiente sin sino sobre sois sola solamente solas solo solos somos son soy siguiente sin sino sobre sois sola solamente solas solo solos somos son soy su
soyos su supuesto sus suya suyas suyo sólo supuesto sus suya suyas suyo suyos sólo
tal tambien también tampoco tan tanto tarde te temprano tendrá tendrán teneis tal tambien también tampoco tan tanto tarde te temprano tendrá tendrán teneis
tenemos tener tenga tengo tenido tenía tercera ti tiempo tiene tienen toda tenemos tener tenga tengo tenido tenía tercera tercero ti tiene tienen toda
todas todavia todavía todo todos total trabaja trabajais trabajamos trabajan todas todavia todavía todo todos total tras trata través tres tu tus tuvo tuya
trabajar trabajas trabajo tras trata través tres tu tus tuvo tuya tuyas tuyo tuyas tuyo tuyos
tuyos
ultimo un una unas uno unos usa usais usamos usan usar usas uso usted ustedes u ultimo un una unas uno unos usa usais usamos usan usar usas uso usted ustedes
última últimas último últimos última últimas último últimos
va vais valor vamos van varias varios vaya veces ver verdad verdadera verdadero va vais vamos van varias varios vaya veces ver verdad verdadera verdadero vez
vez vosotras vosotros voy vuestra vuestras vuestro vuestros vosotras vosotros voy vuestra vuestras vuestro vuestros
ya yo y ya yo
""".split() """.split()
) )

View File

@ -3,7 +3,7 @@ from ...attrs import LIKE_NUM
_num_words = set( _num_words = set(
""" """
zero un deux trois quatre cinq six sept huit neuf dix zero un une deux trois quatre cinq six sept huit neuf dix
onze douze treize quatorze quinze seize dix-sept dix-huit dix-neuf onze douze treize quatorze quinze seize dix-sept dix-huit dix-neuf
vingt trente quarante cinquante soixante soixante-dix septante quatre-vingt huitante quatre-vingt-dix nonante vingt trente quarante cinquante soixante soixante-dix septante quatre-vingt huitante quatre-vingt-dix nonante
cent mille mil million milliard billion quadrillion quintillion cent mille mil million milliard billion quadrillion quintillion
@ -13,7 +13,7 @@ sextillion septillion octillion nonillion decillion
_ordinal_words = set( _ordinal_words = set(
""" """
premier deuxième second troisième quatrième cinquième sixième septième huitième neuvième dixième premier première deuxième second seconde troisième quatrième cinquième sixième septième huitième neuvième dixième
onzième douzième treizième quatorzième quinzième seizième dix-septième dix-huitième dix-neuvième onzième douzième treizième quatorzième quinzième seizième dix-septième dix-huitième dix-neuvième
vingtième trentième quarantième cinquantième soixantième soixante-dixième septantième quatre-vingtième huitantième quatre-vingt-dixième nonantième vingtième trentième quarantième cinquantième soixantième soixante-dixième septantième quatre-vingtième huitantième quatre-vingt-dixième nonantième
centième millième millionnième milliardième billionnième quadrillionnième quintillionnième centième millième millionnième milliardième billionnième quadrillionnième quintillionnième

View File

@ -2,22 +2,29 @@ from ...attrs import LIKE_NUM
_num_words = [ _num_words = [
"không", "không", # Zero
"một", "một", # One
"hai", "mốt", # Also one, irreplacable in niché cases for unit digit such as "51"="năm mươi mốt"
"ba", "hai", # Two
"bốn", "ba", # Three
"năm", "bốn", # Four
"sáu", "", # Also four, used in certain cases for unit digit such as "54"="năm mươi tư"
"bảy", "năm", # Five
"bẩy", "lăm", # Also five, irreplacable in niché cases for unit digit such as "55"="năm mươi lăm"
"tám", "sáu", # Six
"chín", "bảy", # Seven
"mười", "bẩy", # Also seven, old fashioned
"chục", "tám", # Eight
"trăm", "chín", # Nine
"nghìn", "mười", # Ten
"tỷ", "chục", # Also ten, used for counting in tens such as "20 eggs"="hai chục trứng"
"trăm", # Hundred
"nghìn", # Thousand
"ngàn", # Also thousand, used in the south
"vạn", # Ten thousand
"triệu", # Million
"tỷ", # Billion
"tỉ", # Also billion, used in combinatorics such as "tỉ_phú"="billionaire"
] ]

View File

@ -774,6 +774,9 @@ class Language:
name = name if name is not None else factory_name name = name if name is not None else factory_name
if name in self.component_names: if name in self.component_names:
raise ValueError(Errors.E007.format(name=name, opts=self.component_names)) raise ValueError(Errors.E007.format(name=name, opts=self.component_names))
# Overriding pipe name in the config is not supported and will be ignored.
if "name" in config:
warnings.warn(Warnings.W119.format(name_in_config=config.pop("name")))
if source is not None: if source is not None:
# We're loading the component from a model. After loading the # We're loading the component from a model. After loading the
# component, we know its real factory name # component, we know its real factory name

View File

@ -85,7 +85,7 @@ class Table(OrderedDict):
value: The value to set. value: The value to set.
""" """
key = get_string_id(key) key = get_string_id(key)
OrderedDict.__setitem__(self, key, value) OrderedDict.__setitem__(self, key, value) # type:ignore[assignment]
self.bloom.add(key) self.bloom.add(key)
def set(self, key: Union[str, int], value: Any) -> None: def set(self, key: Union[str, int], value: Any) -> None:
@ -104,7 +104,7 @@ class Table(OrderedDict):
RETURNS: The value. RETURNS: The value.
""" """
key = get_string_id(key) key = get_string_id(key)
return OrderedDict.__getitem__(self, key) return OrderedDict.__getitem__(self, key) # type:ignore[index]
def get(self, key: Union[str, int], default: Optional[Any] = None) -> Any: def get(self, key: Union[str, int], default: Optional[Any] = None) -> Any:
"""Get the value for a given key. String keys will be hashed. """Get the value for a given key. String keys will be hashed.
@ -114,7 +114,7 @@ class Table(OrderedDict):
RETURNS: The value. RETURNS: The value.
""" """
key = get_string_id(key) key = get_string_id(key)
return OrderedDict.get(self, key, default) return OrderedDict.get(self, key, default) # type:ignore[arg-type]
def __contains__(self, key: Union[str, int]) -> bool: # type: ignore[override] def __contains__(self, key: Union[str, int]) -> bool: # type: ignore[override]
"""Check whether a key is in the table. String keys will be hashed. """Check whether a key is in the table. String keys will be hashed.

View File

@ -690,18 +690,14 @@ cdef int8_t get_is_match(PatternStateC state,
return True return True
cdef int8_t get_is_final(PatternStateC state) nogil: cdef inline int8_t get_is_final(PatternStateC state) nogil:
if state.pattern[1].quantifier == FINAL_ID: if state.pattern[1].quantifier == FINAL_ID:
id_attr = state.pattern[1].attrs[0]
if id_attr.attr != ID:
with gil:
raise ValueError(Errors.E074.format(attr=ID, bad_attr=id_attr.attr))
return 1 return 1
else: else:
return 0 return 0
cdef int8_t get_quantifier(PatternStateC state) nogil: cdef inline int8_t get_quantifier(PatternStateC state) nogil:
return state.pattern.quantifier return state.pattern.quantifier
@ -790,6 +786,7 @@ def _preprocess_pattern(token_specs, vocab, extensions_table, extra_predicates):
def _get_attr_values(spec, string_store): def _get_attr_values(spec, string_store):
attr_values = [] attr_values = []
for attr, value in spec.items(): for attr, value in spec.items():
input_attr = attr
if isinstance(attr, str): if isinstance(attr, str):
attr = attr.upper() attr = attr.upper()
if attr == '_': if attr == '_':
@ -818,7 +815,7 @@ def _get_attr_values(spec, string_store):
attr_values.append((attr, value)) attr_values.append((attr, value))
else: else:
# should be caught in validation # should be caught in validation
raise ValueError(Errors.E152.format(attr=attr)) raise ValueError(Errors.E152.format(attr=input_attr))
return attr_values return attr_values

View File

@ -118,6 +118,8 @@ cdef class PhraseMatcher:
# if token is not found, break out of the loop # if token is not found, break out of the loop
current_node = NULL current_node = NULL
break break
path_nodes.push_back(current_node)
path_keys.push_back(self._terminal_hash)
# remove the tokens from trie node if there are no other # remove the tokens from trie node if there are no other
# keywords with them # keywords with them
result = map_get(current_node, self._terminal_hash) result = map_get(current_node, self._terminal_hash)

View File

@ -23,7 +23,7 @@ def build_nel_encoder(
((tok2vec >> list2ragged()) & build_span_maker()) ((tok2vec >> list2ragged()) & build_span_maker())
>> extract_spans() >> extract_spans()
>> reduce_mean() >> reduce_mean()
>> residual(Maxout(nO=token_width, nI=token_width, nP=2, dropout=0.0)) # type: ignore[arg-type] >> residual(Maxout(nO=token_width, nI=token_width, nP=2, dropout=0.0))
>> output_layer >> output_layer
) )
model.set_ref("output_layer", output_layer) model.set_ref("output_layer", output_layer)

View File

@ -1,5 +1,5 @@
from typing import Optional, List, cast
from functools import partial from functools import partial
from typing import Optional, List
from thinc.types import Floats2d from thinc.types import Floats2d
from thinc.api import Model, reduce_mean, Linear, list2ragged, Logistic from thinc.api import Model, reduce_mean, Linear, list2ragged, Logistic
@ -59,7 +59,8 @@ def build_simple_cnn_text_classifier(
resizable_layer=resizable_layer, resizable_layer=resizable_layer,
) )
model.set_ref("tok2vec", tok2vec) model.set_ref("tok2vec", tok2vec)
model.set_dim("nO", nO) # type: ignore # TODO: remove type ignore once Thinc has been updated if nO is not None:
model.set_dim("nO", cast(int, nO))
model.attrs["multi_label"] = not exclusive_classes model.attrs["multi_label"] = not exclusive_classes
return model return model
@ -85,7 +86,7 @@ def build_bow_text_classifier(
if not no_output_layer: if not no_output_layer:
fill_defaults["b"] = NEG_VALUE fill_defaults["b"] = NEG_VALUE
output_layer = softmax_activation() if exclusive_classes else Logistic() output_layer = softmax_activation() if exclusive_classes else Logistic()
resizable_layer = resizable( # type: ignore[var-annotated] resizable_layer: Model[Floats2d, Floats2d] = resizable(
sparse_linear, sparse_linear,
resize_layer=partial(resize_linear_weighted, fill_defaults=fill_defaults), resize_layer=partial(resize_linear_weighted, fill_defaults=fill_defaults),
) )
@ -93,7 +94,8 @@ def build_bow_text_classifier(
model = with_cpu(model, model.ops) model = with_cpu(model, model.ops)
if output_layer: if output_layer:
model = model >> with_cpu(output_layer, output_layer.ops) model = model >> with_cpu(output_layer, output_layer.ops)
model.set_dim("nO", nO) # type: ignore[arg-type] if nO is not None:
model.set_dim("nO", cast(int, nO))
model.set_ref("output_layer", sparse_linear) model.set_ref("output_layer", sparse_linear)
model.attrs["multi_label"] = not exclusive_classes model.attrs["multi_label"] = not exclusive_classes
model.attrs["resize_output"] = partial( model.attrs["resize_output"] = partial(
@ -129,8 +131,8 @@ def build_text_classifier_v2(
output_layer = Linear(nO=nO, nI=nO_double) >> Logistic() output_layer = Linear(nO=nO, nI=nO_double) >> Logistic()
model = (linear_model | cnn_model) >> output_layer model = (linear_model | cnn_model) >> output_layer
model.set_ref("tok2vec", tok2vec) model.set_ref("tok2vec", tok2vec)
if model.has_dim("nO") is not False: if model.has_dim("nO") is not False and nO is not None:
model.set_dim("nO", nO) # type: ignore[arg-type] model.set_dim("nO", cast(int, nO))
model.set_ref("output_layer", linear_model.get_ref("output_layer")) model.set_ref("output_layer", linear_model.get_ref("output_layer"))
model.set_ref("attention_layer", attention_layer) model.set_ref("attention_layer", attention_layer)
model.set_ref("maxout_layer", maxout_layer) model.set_ref("maxout_layer", maxout_layer)
@ -164,7 +166,7 @@ def build_text_classifier_lowdata(
>> list2ragged() >> list2ragged()
>> ParametricAttention(width) >> ParametricAttention(width)
>> reduce_sum() >> reduce_sum()
>> residual(Relu(width, width)) ** 2 # type: ignore[arg-type] >> residual(Relu(width, width)) ** 2
>> Linear(nO, width) >> Linear(nO, width)
) )
if dropout: if dropout:

View File

@ -1,5 +1,5 @@
from typing import Optional, List, Union, cast from typing import Optional, List, Union, cast
from thinc.types import Floats2d, Ints2d, Ragged from thinc.types import Floats2d, Ints2d, Ragged, Ints1d
from thinc.api import chain, clone, concatenate, with_array, with_padded from thinc.api import chain, clone, concatenate, with_array, with_padded
from thinc.api import Model, noop, list2ragged, ragged2list, HashEmbed from thinc.api import Model, noop, list2ragged, ragged2list, HashEmbed
from thinc.api import expand_window, residual, Maxout, Mish, PyTorchLSTM from thinc.api import expand_window, residual, Maxout, Mish, PyTorchLSTM
@ -159,7 +159,7 @@ def MultiHashEmbed(
embeddings = [make_hash_embed(i) for i in range(len(attrs))] embeddings = [make_hash_embed(i) for i in range(len(attrs))]
concat_size = width * (len(embeddings) + include_static_vectors) concat_size = width * (len(embeddings) + include_static_vectors)
max_out: Model[Ragged, Ragged] = with_array( max_out: Model[Ragged, Ragged] = with_array(
Maxout(width, concat_size, nP=3, dropout=0.0, normalize=True) # type: ignore Maxout(width, concat_size, nP=3, dropout=0.0, normalize=True)
) )
if include_static_vectors: if include_static_vectors:
feature_extractor: Model[List[Doc], Ragged] = chain( feature_extractor: Model[List[Doc], Ragged] = chain(
@ -173,7 +173,7 @@ def MultiHashEmbed(
StaticVectors(width, dropout=0.0), StaticVectors(width, dropout=0.0),
), ),
max_out, max_out,
cast(Model[Ragged, List[Floats2d]], ragged2list()), ragged2list(),
) )
else: else:
model = chain( model = chain(
@ -181,7 +181,7 @@ def MultiHashEmbed(
cast(Model[List[Ints2d], Ragged], list2ragged()), cast(Model[List[Ints2d], Ragged], list2ragged()),
with_array(concatenate(*embeddings)), with_array(concatenate(*embeddings)),
max_out, max_out,
cast(Model[Ragged, List[Floats2d]], ragged2list()), ragged2list(),
) )
return model return model
@ -232,12 +232,12 @@ def CharacterEmbed(
feature_extractor: Model[List[Doc], Ragged] = chain( feature_extractor: Model[List[Doc], Ragged] = chain(
FeatureExtractor([feature]), FeatureExtractor([feature]),
cast(Model[List[Ints2d], Ragged], list2ragged()), cast(Model[List[Ints2d], Ragged], list2ragged()),
with_array(HashEmbed(nO=width, nV=rows, column=0, seed=5)), # type: ignore with_array(HashEmbed(nO=width, nV=rows, column=0, seed=5)), # type: ignore[misc]
) )
max_out: Model[Ragged, Ragged] max_out: Model[Ragged, Ragged]
if include_static_vectors: if include_static_vectors:
max_out = with_array( max_out = with_array(
Maxout(width, nM * nC + (2 * width), nP=3, normalize=True, dropout=0.0) # type: ignore Maxout(width, nM * nC + (2 * width), nP=3, normalize=True, dropout=0.0)
) )
model = chain( model = chain(
concatenate( concatenate(
@ -246,11 +246,11 @@ def CharacterEmbed(
StaticVectors(width, dropout=0.0), StaticVectors(width, dropout=0.0),
), ),
max_out, max_out,
cast(Model[Ragged, List[Floats2d]], ragged2list()), ragged2list(),
) )
else: else:
max_out = with_array( max_out = with_array(
Maxout(width, nM * nC + width, nP=3, normalize=True, dropout=0.0) # type: ignore Maxout(width, nM * nC + width, nP=3, normalize=True, dropout=0.0)
) )
model = chain( model = chain(
concatenate( concatenate(
@ -258,7 +258,7 @@ def CharacterEmbed(
feature_extractor, feature_extractor,
), ),
max_out, max_out,
cast(Model[Ragged, List[Floats2d]], ragged2list()), ragged2list(),
) )
return model return model
@ -289,10 +289,10 @@ def MaxoutWindowEncoder(
normalize=True, normalize=True,
), ),
) )
model = clone(residual(cnn), depth) # type: ignore[arg-type] model = clone(residual(cnn), depth)
model.set_dim("nO", width) model.set_dim("nO", width)
receptive_field = window_size * depth receptive_field = window_size * depth
return with_array(model, pad=receptive_field) # type: ignore[arg-type] return with_array(model, pad=receptive_field)
@registry.architectures("spacy.MishWindowEncoder.v2") @registry.architectures("spacy.MishWindowEncoder.v2")
@ -313,9 +313,9 @@ def MishWindowEncoder(
expand_window(window_size=window_size), expand_window(window_size=window_size),
Mish(nO=width, nI=width * ((window_size * 2) + 1), dropout=0.0, normalize=True), Mish(nO=width, nI=width * ((window_size * 2) + 1), dropout=0.0, normalize=True),
) )
model = clone(residual(cnn), depth) # type: ignore[arg-type] model = clone(residual(cnn), depth)
model.set_dim("nO", width) model.set_dim("nO", width)
return with_array(model) # type: ignore[arg-type] return with_array(model)
@registry.architectures("spacy.TorchBiLSTMEncoder.v1") @registry.architectures("spacy.TorchBiLSTMEncoder.v1")

View File

@ -40,17 +40,15 @@ def forward(
if not token_count: if not token_count:
return _handle_empty(model.ops, model.get_dim("nO")) return _handle_empty(model.ops, model.get_dim("nO"))
key_attr: int = model.attrs["key_attr"] key_attr: int = model.attrs["key_attr"]
keys: Ints1d = model.ops.flatten( keys = model.ops.flatten([cast(Ints1d, doc.to_array(key_attr)) for doc in docs])
cast(Sequence, [doc.to_array(key_attr) for doc in docs])
)
vocab: Vocab = docs[0].vocab vocab: Vocab = docs[0].vocab
W = cast(Floats2d, model.ops.as_contig(model.get_param("W"))) W = cast(Floats2d, model.ops.as_contig(model.get_param("W")))
if vocab.vectors.mode == Mode.default: if vocab.vectors.mode == Mode.default:
V = cast(Floats2d, model.ops.asarray(vocab.vectors.data)) V = model.ops.asarray(vocab.vectors.data)
rows = vocab.vectors.find(keys=keys) rows = vocab.vectors.find(keys=keys)
V = model.ops.as_contig(V[rows]) V = model.ops.as_contig(V[rows])
elif vocab.vectors.mode == Mode.floret: elif vocab.vectors.mode == Mode.floret:
V = cast(Floats2d, vocab.vectors.get_batch(keys)) V = vocab.vectors.get_batch(keys)
V = model.ops.as_contig(V) V = model.ops.as_contig(V)
else: else:
raise RuntimeError(Errors.E896) raise RuntimeError(Errors.E896)
@ -62,9 +60,7 @@ def forward(
# Convert negative indices to 0-vectors # Convert negative indices to 0-vectors
# TODO: more options for UNK tokens # TODO: more options for UNK tokens
vectors_data[rows < 0] = 0 vectors_data[rows < 0] = 0
output = Ragged( output = Ragged(vectors_data, model.ops.asarray1i([len(doc) for doc in docs]))
vectors_data, model.ops.asarray([len(doc) for doc in docs], dtype="i") # type: ignore
)
mask = None mask = None
if is_train: if is_train:
mask = _get_drop_mask(model.ops, W.shape[0], model.attrs.get("dropout_rate")) mask = _get_drop_mask(model.ops, W.shape[0], model.attrs.get("dropout_rate"))
@ -77,7 +73,9 @@ def forward(
model.inc_grad( model.inc_grad(
"W", "W",
model.ops.gemm( model.ops.gemm(
cast(Floats2d, d_output.data), model.ops.as_contig(V), trans1=True cast(Floats2d, d_output.data),
cast(Floats2d, model.ops.as_contig(V)),
trans1=True,
), ),
) )
return [] return []

View File

@ -132,7 +132,7 @@ cdef class EditTrees:
could not be applied to the form. could not be applied to the form.
""" """
if tree_id >= self.trees.size(): if tree_id >= self.trees.size():
raise IndexError("Edit tree identifier out of range") raise IndexError(Errors.E1030)
lemma_pieces = [] lemma_pieces = []
try: try:
@ -154,7 +154,7 @@ cdef class EditTrees:
match_node = tree.inner.match_node match_node = tree.inner.match_node
if match_node.prefix_len + match_node.suffix_len > len(form_part): if match_node.prefix_len + match_node.suffix_len > len(form_part):
raise ValueError("Edit tree cannot be applied to form") raise ValueError(Errors.E1029)
suffix_start = len(form_part) - match_node.suffix_len suffix_start = len(form_part) - match_node.suffix_len
@ -169,7 +169,7 @@ cdef class EditTrees:
if form_part == self.strings[tree.inner.subst_node.orig]: if form_part == self.strings[tree.inner.subst_node.orig]:
lemma_pieces.append(self.strings[tree.inner.subst_node.subst]) lemma_pieces.append(self.strings[tree.inner.subst_node.subst])
else: else:
raise ValueError("Edit tree cannot be applied to form") raise ValueError(Errors.E1029)
cpdef unicode tree_to_str(self, uint32_t tree_id): cpdef unicode tree_to_str(self, uint32_t tree_id):
"""Return the tree as a string. The tree tree string is formatted """Return the tree as a string. The tree tree string is formatted
@ -187,7 +187,7 @@ cdef class EditTrees:
""" """
if tree_id >= self.trees.size(): if tree_id >= self.trees.size():
raise IndexError("Edit tree identifier out of range") raise IndexError(Errors.E1030)
cdef EditTreeC tree = self.trees[tree_id] cdef EditTreeC tree = self.trees[tree_id]
cdef SubstNodeC subst_node cdef SubstNodeC subst_node

View File

@ -826,7 +826,7 @@ cdef class ArcEager(TransitionSystem):
for i in range(self.n_moves): for i in range(self.n_moves):
print(self.get_class_name(i), is_valid[i], costs[i]) print(self.get_class_name(i), is_valid[i], costs[i])
print("Gold sent starts?", is_sent_start(&gold_state, state.B(0)), is_sent_start(&gold_state, state.B(1))) print("Gold sent starts?", is_sent_start(&gold_state, state.B(0)), is_sent_start(&gold_state, state.B(1)))
raise ValueError("Could not find gold transition - see logs above.") raise ValueError(Errors.E1031)
def get_oracle_sequence_from_state(self, StateClass state, ArcEagerGold gold, _debug=None): def get_oracle_sequence_from_state(self, StateClass state, ArcEagerGold gold, _debug=None):
cdef int i cdef int i

View File

@ -138,7 +138,7 @@ class EditTreeLemmatizer(TrainablePipe):
truths.append(eg_truths) truths.append(eg_truths)
d_scores, loss = loss_func(scores, truths) # type: ignore d_scores, loss = loss_func(scores, truths)
if self.model.ops.xp.isnan(loss): if self.model.ops.xp.isnan(loss):
raise ValueError(Errors.E910.format(name=self.name)) raise ValueError(Errors.E910.format(name=self.name))

View File

@ -234,10 +234,11 @@ class EntityLinker(TrainablePipe):
nO = self.kb.entity_vector_length nO = self.kb.entity_vector_length
doc_sample = [] doc_sample = []
vector_sample = [] vector_sample = []
for example in islice(get_examples(), 10): for eg in islice(get_examples(), 10):
doc = example.x doc = eg.x
if self.use_gold_ents: if self.use_gold_ents:
doc.ents = example.y.ents ents, _ = eg.get_aligned_ents_and_ner()
doc.ents = ents
doc_sample.append(doc) doc_sample.append(doc)
vector_sample.append(self.model.ops.alloc1f(nO)) vector_sample.append(self.model.ops.alloc1f(nO))
assert len(doc_sample) > 0, Errors.E923.format(name=self.name) assert len(doc_sample) > 0, Errors.E923.format(name=self.name)
@ -312,7 +313,8 @@ class EntityLinker(TrainablePipe):
for doc, ex in zip(docs, examples): for doc, ex in zip(docs, examples):
if self.use_gold_ents: if self.use_gold_ents:
doc.ents = ex.reference.ents ents, _ = ex.get_aligned_ents_and_ner()
doc.ents = ents
else: else:
# only keep matching ents # only keep matching ents
doc.ents = ex.get_matching_ents() doc.ents = ex.get_matching_ents()
@ -345,7 +347,7 @@ class EntityLinker(TrainablePipe):
for eg in examples: for eg in examples:
kb_ids = eg.get_aligned("ENT_KB_ID", as_string=True) kb_ids = eg.get_aligned("ENT_KB_ID", as_string=True)
for ent in eg.reference.ents: for ent in eg.get_matching_ents():
kb_id = kb_ids[ent.start] kb_id = kb_ids[ent.start]
if kb_id: if kb_id:
entity_encoding = self.kb.get_vector(kb_id) entity_encoding = self.kb.get_vector(kb_id)
@ -356,7 +358,11 @@ class EntityLinker(TrainablePipe):
entity_encodings = self.model.ops.asarray(entity_encodings, dtype="float32") entity_encodings = self.model.ops.asarray(entity_encodings, dtype="float32")
selected_encodings = sentence_encodings[keep_ents] selected_encodings = sentence_encodings[keep_ents]
# If the entity encodings list is empty, then # if there are no matches, short circuit
if not keep_ents:
out = self.model.ops.alloc2f(*sentence_encodings.shape)
return 0, out
if selected_encodings.shape != entity_encodings.shape: if selected_encodings.shape != entity_encodings.shape:
err = Errors.E147.format( err = Errors.E147.format(
method="get_loss", msg="gold entities do not match up" method="get_loss", msg="gold entities do not match up"

View File

@ -159,10 +159,8 @@ class EntityRuler(Pipe):
self._require_patterns() self._require_patterns()
with warnings.catch_warnings(): with warnings.catch_warnings():
warnings.filterwarnings("ignore", message="\\[W036") warnings.filterwarnings("ignore", message="\\[W036")
matches = cast( matches = list(self.matcher(doc)) + list(self.phrase_matcher(doc))
List[Tuple[int, int, int]],
list(self.matcher(doc)) + list(self.phrase_matcher(doc)),
)
final_matches = set( final_matches = set(
[(m_id, start, end) for m_id, start, end in matches if start != end] [(m_id, start, end) for m_id, start, end in matches if start != end]
) )

View File

@ -213,15 +213,14 @@ class EntityLinker_v1(TrainablePipe):
if kb_id: if kb_id:
entity_encoding = self.kb.get_vector(kb_id) entity_encoding = self.kb.get_vector(kb_id)
entity_encodings.append(entity_encoding) entity_encodings.append(entity_encoding)
entity_encodings = self.model.ops.asarray(entity_encodings, dtype="float32") entity_encodings = self.model.ops.asarray2f(entity_encodings)
if sentence_encodings.shape != entity_encodings.shape: if sentence_encodings.shape != entity_encodings.shape:
err = Errors.E147.format( err = Errors.E147.format(
method="get_loss", msg="gold entities do not match up" method="get_loss", msg="gold entities do not match up"
) )
raise RuntimeError(err) raise RuntimeError(err)
# TODO: fix typing issue here gradients = self.distance.get_grad(sentence_encodings, entity_encodings)
gradients = self.distance.get_grad(sentence_encodings, entity_encodings) # type: ignore loss = self.distance.get_loss(sentence_encodings, entity_encodings)
loss = self.distance.get_loss(sentence_encodings, entity_encodings) # type: ignore
loss = loss / len(entity_encodings) loss = loss / len(entity_encodings)
return float(loss), gradients return float(loss), gradients

View File

@ -75,7 +75,7 @@ def build_ngram_suggester(sizes: List[int]) -> Suggester:
if spans: if spans:
assert spans[-1].ndim == 2, spans[-1].shape assert spans[-1].ndim == 2, spans[-1].shape
lengths.append(length) lengths.append(length)
lengths_array = cast(Ints1d, ops.asarray(lengths, dtype="i")) lengths_array = ops.asarray1i(lengths)
if len(spans) > 0: if len(spans) > 0:
output = Ragged(ops.xp.vstack(spans), lengths_array) output = Ragged(ops.xp.vstack(spans), lengths_array)
else: else:

View File

@ -104,7 +104,7 @@ def get_arg_model(
sig_args[param.name] = (annotation, default) sig_args[param.name] = (annotation, default)
is_strict = strict and not has_variable is_strict = strict and not has_variable
sig_args["__config__"] = ArgSchemaConfig if is_strict else ArgSchemaConfigExtra # type: ignore[assignment] sig_args["__config__"] = ArgSchemaConfig if is_strict else ArgSchemaConfigExtra # type: ignore[assignment]
return create_model(name, **sig_args) # type: ignore[arg-type, return-value] return create_model(name, **sig_args) # type: ignore[call-overload, arg-type, return-value]
def validate_init_settings( def validate_init_settings(

View File

@ -1,4 +1,4 @@
from typing import Optional, Iterable, Iterator, Union, Any from typing import Optional, Iterable, Iterator, Union, Any, overload
from pathlib import Path from pathlib import Path
def get_string_id(key: Union[str, int]) -> int: ... def get_string_id(key: Union[str, int]) -> int: ...
@ -7,7 +7,10 @@ class StringStore:
def __init__( def __init__(
self, strings: Optional[Iterable[str]] = ..., freeze: bool = ... self, strings: Optional[Iterable[str]] = ..., freeze: bool = ...
) -> None: ... ) -> None: ...
def __getitem__(self, string_or_id: Union[bytes, str, int]) -> Union[str, int]: ... @overload
def __getitem__(self, string_or_id: Union[bytes, str]) -> int: ...
@overload
def __getitem__(self, string_or_id: int) -> str: ...
def as_int(self, key: Union[bytes, str, int]) -> int: ... def as_int(self, key: Union[bytes, str, int]) -> int: ...
def as_string(self, key: Union[bytes, str, int]) -> str: ... def as_string(self, key: Union[bytes, str, int]) -> str: ...
def add(self, string: str) -> int: ... def add(self, string: str) -> int: ...

View File

@ -357,6 +357,11 @@ def sv_tokenizer():
return get_lang_class("sv")().tokenizer return get_lang_class("sv")().tokenizer
@pytest.fixture(scope="session")
def ta_tokenizer():
return get_lang_class("ta")().tokenizer
@pytest.fixture(scope="session") @pytest.fixture(scope="session")
def th_tokenizer(): def th_tokenizer():
pytest.importorskip("pythainlp") pytest.importorskip("pythainlp")

View File

@ -1,6 +1,7 @@
import weakref import weakref
import numpy import numpy
from numpy.testing import assert_array_equal
import pytest import pytest
from thinc.api import NumpyOps, get_current_ops from thinc.api import NumpyOps, get_current_ops
@ -10,7 +11,7 @@ from spacy.lang.en import English
from spacy.lang.xx import MultiLanguage from spacy.lang.xx import MultiLanguage
from spacy.language import Language from spacy.language import Language
from spacy.lexeme import Lexeme from spacy.lexeme import Lexeme
from spacy.tokens import Doc, Span, Token from spacy.tokens import Doc, Span, SpanGroup, Token
from spacy.vocab import Vocab from spacy.vocab import Vocab
from .test_underscore import clean_underscore # noqa: F401 from .test_underscore import clean_underscore # noqa: F401
@ -634,6 +635,14 @@ def test_doc_api_from_docs(en_tokenizer, de_tokenizer):
assert "group" in m_doc.spans assert "group" in m_doc.spans
assert span_group_texts == sorted([s.text for s in m_doc.spans["group"]]) assert span_group_texts == sorted([s.text for s in m_doc.spans["group"]])
# can exclude spans
m_doc = Doc.from_docs(en_docs, exclude=["spans"])
assert "group" not in m_doc.spans
# can exclude user_data
m_doc = Doc.from_docs(en_docs, exclude=["user_data"])
assert m_doc.user_data == {}
# can merge empty docs # can merge empty docs
doc = Doc.from_docs([en_tokenizer("")] * 10) doc = Doc.from_docs([en_tokenizer("")] * 10)
@ -647,6 +656,20 @@ def test_doc_api_from_docs(en_tokenizer, de_tokenizer):
assert "group" in m_doc.spans assert "group" in m_doc.spans
assert len(m_doc.spans["group"]) == 0 assert len(m_doc.spans["group"]) == 0
# with tensor
ops = get_current_ops()
for doc in en_docs:
doc.tensor = ops.asarray([[len(t.text), 0.0] for t in doc])
m_doc = Doc.from_docs(en_docs)
assert_array_equal(
ops.to_numpy(m_doc.tensor),
ops.to_numpy(ops.xp.vstack([doc.tensor for doc in en_docs if len(doc)])),
)
# can exclude tensor
m_doc = Doc.from_docs(en_docs, exclude=["tensor"])
assert m_doc.tensor.shape == (0,)
def test_doc_api_from_docs_ents(en_tokenizer): def test_doc_api_from_docs_ents(en_tokenizer):
texts = ["Merging the docs is fun.", "They don't think alike."] texts = ["Merging the docs is fun.", "They don't think alike."]
@ -941,3 +964,13 @@ def test_doc_spans_copy(en_tokenizer):
assert weakref.ref(doc1) == doc1.spans.doc_ref assert weakref.ref(doc1) == doc1.spans.doc_ref
doc2 = doc1.copy() doc2 = doc1.copy()
assert weakref.ref(doc2) == doc2.spans.doc_ref assert weakref.ref(doc2) == doc2.spans.doc_ref
def test_doc_spans_setdefault(en_tokenizer):
doc = en_tokenizer("Some text about Colombia and the Czech Republic")
doc.spans.setdefault("key1")
assert len(doc.spans["key1"]) == 0
doc.spans.setdefault("key2", default=[doc[0:1]])
assert len(doc.spans["key2"]) == 1
doc.spans.setdefault("key3", default=SpanGroup(doc, spans=[doc[0:1], doc[1:2]]))
assert len(doc.spans["key3"]) == 2

View File

@ -0,0 +1,242 @@
import pytest
from random import Random
from spacy.matcher import Matcher
from spacy.tokens import Span, SpanGroup
@pytest.fixture
def doc(en_tokenizer):
doc = en_tokenizer("0 1 2 3 4 5 6")
matcher = Matcher(en_tokenizer.vocab, validate=True)
# fmt: off
matcher.add("4", [[{}, {}, {}, {}]])
matcher.add("2", [[{}, {}, ]])
matcher.add("1", [[{}, ]])
# fmt: on
matches = matcher(doc)
spans = []
for match in matches:
spans.append(
Span(doc, match[1], match[2], en_tokenizer.vocab.strings[match[0]])
)
Random(42).shuffle(spans)
doc.spans["SPANS"] = SpanGroup(
doc, name="SPANS", attrs={"key": "value"}, spans=spans
)
return doc
@pytest.fixture
def other_doc(en_tokenizer):
doc = en_tokenizer("0 1 2 3 4 5 6")
matcher = Matcher(en_tokenizer.vocab, validate=True)
# fmt: off
matcher.add("4", [[{}, {}, {}, {}]])
matcher.add("2", [[{}, {}, ]])
matcher.add("1", [[{}, ]])
# fmt: on
matches = matcher(doc)
spans = []
for match in matches:
spans.append(
Span(doc, match[1], match[2], en_tokenizer.vocab.strings[match[0]])
)
Random(42).shuffle(spans)
doc.spans["SPANS"] = SpanGroup(
doc, name="SPANS", attrs={"key": "value"}, spans=spans
)
return doc
@pytest.fixture
def span_group(en_tokenizer):
doc = en_tokenizer("0 1 2 3 4 5 6")
matcher = Matcher(en_tokenizer.vocab, validate=True)
# fmt: off
matcher.add("4", [[{}, {}, {}, {}]])
matcher.add("2", [[{}, {}, ]])
matcher.add("1", [[{}, ]])
# fmt: on
matches = matcher(doc)
spans = []
for match in matches:
spans.append(
Span(doc, match[1], match[2], en_tokenizer.vocab.strings[match[0]])
)
Random(42).shuffle(spans)
doc.spans["SPANS"] = SpanGroup(
doc, name="SPANS", attrs={"key": "value"}, spans=spans
)
def test_span_group_copy(doc):
span_group = doc.spans["SPANS"]
clone = span_group.copy()
assert clone != span_group
assert clone.name == span_group.name
assert clone.attrs == span_group.attrs
assert len(clone) == len(span_group)
assert list(span_group) == list(clone)
clone.name = "new_name"
clone.attrs["key"] = "new_value"
clone.append(Span(doc, 0, 6, "LABEL"))
assert clone.name != span_group.name
assert clone.attrs != span_group.attrs
assert span_group.attrs["key"] == "value"
assert list(span_group) != list(clone)
def test_span_group_set_item(doc, other_doc):
span_group = doc.spans["SPANS"]
index = 5
span = span_group[index]
span.label_ = "NEW LABEL"
span.kb_id = doc.vocab.strings["KB_ID"]
assert span_group[index].label != span.label
assert span_group[index].kb_id != span.kb_id
span_group[index] = span
assert span_group[index].start == span.start
assert span_group[index].end == span.end
assert span_group[index].label == span.label
assert span_group[index].kb_id == span.kb_id
assert span_group[index] == span
with pytest.raises(IndexError):
span_group[-100] = span
with pytest.raises(IndexError):
span_group[100] = span
span = Span(other_doc, 0, 2)
with pytest.raises(ValueError):
span_group[index] = span
def test_span_group_has_overlap(doc):
span_group = doc.spans["SPANS"]
assert span_group.has_overlap
def test_span_group_concat(doc, other_doc):
span_group_1 = doc.spans["SPANS"]
spans = [doc[0:5], doc[0:6]]
span_group_2 = SpanGroup(
doc,
name="MORE_SPANS",
attrs={"key": "new_value", "new_key": "new_value"},
spans=spans,
)
span_group_3 = span_group_1._concat(span_group_2)
assert span_group_3.name == span_group_1.name
assert span_group_3.attrs == {"key": "value", "new_key": "new_value"}
span_list_expected = list(span_group_1) + list(span_group_2)
assert list(span_group_3) == list(span_list_expected)
# Inplace
span_list_expected = list(span_group_1) + list(span_group_2)
span_group_3 = span_group_1._concat(span_group_2, inplace=True)
assert span_group_3 == span_group_1
assert span_group_3.name == span_group_1.name
assert span_group_3.attrs == {"key": "value", "new_key": "new_value"}
assert list(span_group_3) == list(span_list_expected)
span_group_2 = other_doc.spans["SPANS"]
with pytest.raises(ValueError):
span_group_1._concat(span_group_2)
def test_span_doc_delitem(doc):
span_group = doc.spans["SPANS"]
length = len(span_group)
index = 5
span = span_group[index]
next_span = span_group[index + 1]
del span_group[index]
assert len(span_group) == length - 1
assert span_group[index] != span
assert span_group[index] == next_span
with pytest.raises(IndexError):
del span_group[-100]
with pytest.raises(IndexError):
del span_group[100]
def test_span_group_add(doc):
span_group_1 = doc.spans["SPANS"]
spans = [doc[0:5], doc[0:6]]
span_group_2 = SpanGroup(
doc,
name="MORE_SPANS",
attrs={"key": "new_value", "new_key": "new_value"},
spans=spans,
)
span_group_3_expected = span_group_1._concat(span_group_2)
span_group_3 = span_group_1 + span_group_2
assert len(span_group_3) == len(span_group_3_expected)
assert span_group_3.attrs == {"key": "value", "new_key": "new_value"}
assert list(span_group_3) == list(span_group_3_expected)
def test_span_group_iadd(doc):
span_group_1 = doc.spans["SPANS"].copy()
spans = [doc[0:5], doc[0:6]]
span_group_2 = SpanGroup(
doc,
name="MORE_SPANS",
attrs={"key": "new_value", "new_key": "new_value"},
spans=spans,
)
span_group_1_expected = span_group_1._concat(span_group_2)
span_group_1 += span_group_2
assert len(span_group_1) == len(span_group_1_expected)
assert span_group_1.attrs == {"key": "value", "new_key": "new_value"}
assert list(span_group_1) == list(span_group_1_expected)
span_group_1 = doc.spans["SPANS"].copy()
span_group_1 += spans
assert len(span_group_1) == len(span_group_1_expected)
assert span_group_1.attrs == {
"key": "value",
}
assert list(span_group_1) == list(span_group_1_expected)
def test_span_group_extend(doc):
span_group_1 = doc.spans["SPANS"].copy()
spans = [doc[0:5], doc[0:6]]
span_group_2 = SpanGroup(
doc,
name="MORE_SPANS",
attrs={"key": "new_value", "new_key": "new_value"},
spans=spans,
)
span_group_1_expected = span_group_1._concat(span_group_2)
span_group_1.extend(span_group_2)
assert len(span_group_1) == len(span_group_1_expected)
assert span_group_1.attrs == {"key": "value", "new_key": "new_value"}
assert list(span_group_1) == list(span_group_1_expected)
span_group_1 = doc.spans["SPANS"]
span_group_1.extend(spans)
assert len(span_group_1) == len(span_group_1_expected)
assert span_group_1.attrs == {"key": "value"}
assert list(span_group_1) == list(span_group_1_expected)
def test_span_group_dealloc(span_group):
with pytest.raises(AttributeError):
print(span_group.doc)

View File

View File

@ -0,0 +1,25 @@
import pytest
from spacy.lang.ta import Tamil
# Wikipedia excerpt: https://en.wikipedia.org/wiki/Chennai (Tamil Language)
TAMIL_BASIC_TOKENIZER_SENTENCIZER_TEST_TEXT = """சென்னை (Chennai) தமிழ்நாட்டின் தலைநகரமும், இந்தியாவின் நான்காவது பெரிய நகரமும் ஆகும். 1996 ஆம் ஆண்டுக்கு முன்னர் இந்நகரம், மதராசு பட்டினம், மெட்ராஸ் (Madras) மற்றும் சென்னப்பட்டினம் என்றும் அழைக்கப்பட்டு வந்தது. சென்னை, வங்காள விரிகுடாவின் கரையில் அமைந்த துறைமுக நகரங்களுள் ஒன்று. சுமார் 10 மில்லியன் (ஒரு கோடி) மக்கள் வாழும் இந்நகரம், உலகின் 35 பெரிய மாநகரங்களுள் ஒன்று. 17ஆம் நூற்றாண்டில் ஆங்கிலேயர் சென்னையில் கால் பதித்தது முதல், சென்னை நகரம் ஒரு முக்கிய நகரமாக வளர்ந்து வந்திருக்கிறது. சென்னை தென்னிந்தியாவின் வாசலாகக் கருதப்படுகிறது. சென்னை நகரில் உள்ள மெரினா கடற்கரை உலகின் நீளமான கடற்கரைகளுள் ஒன்று. சென்னை கோலிவுட் (Kollywood) என அறியப்படும் தமிழ்த் திரைப்படத் துறையின் தாயகம் ஆகும். பல விளையாட்டு அரங்கங்கள் உள்ள சென்னையில் பல விளையாட்டுப் போட்டிகளும் நடைபெறுகின்றன."""
@pytest.mark.parametrize(
"text, num_tokens",
[(TAMIL_BASIC_TOKENIZER_SENTENCIZER_TEST_TEXT, 23 + 90)], # Punctuation + rest
)
def test_long_text(ta_tokenizer, text, num_tokens):
tokens = ta_tokenizer(text)
assert len(tokens) == num_tokens
@pytest.mark.parametrize(
"text, num_sents", [(TAMIL_BASIC_TOKENIZER_SENTENCIZER_TEST_TEXT, 9)]
)
def test_ta_sentencizer(text, num_sents):
nlp = Tamil()
nlp.add_pipe("sentencizer")
doc = nlp(text)
assert len(list(doc.sents)) == num_sents

View File

@ -0,0 +1,188 @@
import pytest
from spacy.symbols import ORTH
from spacy.lang.ta import Tamil
TA_BASIC_TOKENIZATION_TESTS = [
(
"கிறிஸ்துமஸ் மற்றும் இனிய புத்தாண்டு வாழ்த்துக்கள்",
["கிறிஸ்துமஸ்", "மற்றும்", "இனிய", "புத்தாண்டு", "வாழ்த்துக்கள்"],
),
(
"எனக்கு என் குழந்தைப் பருவம் நினைவிருக்கிறது",
["எனக்கு", "என்", "குழந்தைப்", "பருவம்", "நினைவிருக்கிறது"],
),
("உங்கள் பெயர் என்ன?", ["உங்கள்", "பெயர்", "என்ன", "?"]),
(
"ஏறத்தாழ இலங்கைத் தமிழரில் மூன்றிலொரு பங்கினர் இலங்கையை விட்டு வெளியேறிப் பிற நாடுகளில் வாழ்கின்றனர்",
[
"ஏறத்தாழ",
"இலங்கைத்",
"தமிழரில்",
"மூன்றிலொரு",
"பங்கினர்",
"இலங்கையை",
"விட்டு",
"வெளியேறிப்",
"பிற",
"நாடுகளில்",
"வாழ்கின்றனர்",
],
),
(
"இந்த ஃபோனுடன் சுமார் ரூ.2,990 மதிப்புள்ள போட் ராக்கர்ஸ் நிறுவனத்தின் ஸ்போர்ட் புளூடூத் ஹெட்போன்ஸ் இலவசமாக வழங்கப்படவுள்ளது.",
[
"இந்த",
"ஃபோனுடன்",
"சுமார்",
"ரூ.2,990",
"மதிப்புள்ள",
"போட்",
"ராக்கர்ஸ்",
"நிறுவனத்தின்",
"ஸ்போர்ட்",
"புளூடூத்",
"ஹெட்போன்ஸ்",
"இலவசமாக",
"வழங்கப்படவுள்ளது",
".",
],
),
(
"மட்டக்களப்பில் பல இடங்களில் வீட்டுத் திட்டங்களுக்கு இன்று அடிக்கல் நாட்டல்",
[
"மட்டக்களப்பில்",
"பல",
"இடங்களில்",
"வீட்டுத்",
"திட்டங்களுக்கு",
"இன்று",
"அடிக்கல்",
"நாட்டல்",
],
),
(
"ஐ போன்க்கு முகத்தை வைத்து அன்லாக் செய்யும் முறை மற்றும் விரலால் தொட்டு அன்லாக் செய்யும் முறையை வாட்ஸ் ஆப் நிறுவனம் இதற்கு முன் கண்டுபிடித்தது",
[
"",
"போன்க்கு",
"முகத்தை",
"வைத்து",
"அன்லாக்",
"செய்யும்",
"முறை",
"மற்றும்",
"விரலால்",
"தொட்டு",
"அன்லாக்",
"செய்யும்",
"முறையை",
"வாட்ஸ்",
"ஆப்",
"நிறுவனம்",
"இதற்கு",
"முன்",
"கண்டுபிடித்தது",
],
),
(
"இது ஒரு வாக்கியம்.",
[
"இது",
"ஒரு",
"வாக்கியம்",
".",
],
),
(
"தன்னாட்சி கார்கள் காப்பீட்டு பொறுப்பை உற்பத்தியாளரிடம் மாற்றுகின்றன",
[
"தன்னாட்சி",
"கார்கள்",
"காப்பீட்டு",
"பொறுப்பை",
"உற்பத்தியாளரிடம்",
"மாற்றுகின்றன",
],
),
(
"நடைபாதை விநியோக ரோபோக்களை தடை செய்வதை சான் பிரான்சிஸ்கோ கருதுகிறது",
[
"நடைபாதை",
"விநியோக",
"ரோபோக்களை",
"தடை",
"செய்வதை",
"சான்",
"பிரான்சிஸ்கோ",
"கருதுகிறது",
],
),
(
"லண்டன் ஐக்கிய இராச்சியத்தில் ஒரு பெரிய நகரம்.",
[
"லண்டன்",
"ஐக்கிய",
"இராச்சியத்தில்",
"ஒரு",
"பெரிய",
"நகரம்",
".",
],
),
(
"என்ன வேலை செய்கிறீர்கள்?",
[
"என்ன",
"வேலை",
"செய்கிறீர்கள்",
"?",
],
),
(
"எந்த கல்லூரியில் படிக்கிறாய்?",
[
"எந்த",
"கல்லூரியில்",
"படிக்கிறாய்",
"?",
],
),
]
@pytest.mark.parametrize("text,expected_tokens", TA_BASIC_TOKENIZATION_TESTS)
def test_ta_tokenizer_basic(ta_tokenizer, text, expected_tokens):
tokens = ta_tokenizer(text)
token_list = [token.text for token in tokens]
assert expected_tokens == token_list
@pytest.mark.parametrize(
"text,expected_tokens",
[
(
"ஆப்பிள் நிறுவனம் யு.கே. தொடக்க நிறுவனத்தை ஒரு லட்சம் கோடிக்கு வாங்கப் பார்க்கிறது",
[
"ஆப்பிள்",
"நிறுவனம்",
"யு.கே.",
"தொடக்க",
"நிறுவனத்தை",
"ஒரு",
"லட்சம்",
"கோடிக்கு",
"வாங்கப்",
"பார்க்கிறது",
],
)
],
)
def test_ta_tokenizer_special_case(text, expected_tokens):
# Add a special rule to tokenize the initialism "யு.கே." (U.K., as
# in the country) as a single token.
nlp = Tamil()
nlp.tokenizer.add_special_case("யு.கே.", [{ORTH: "யு.கே."}])
tokens = nlp(text)
token_list = [token.text for token in tokens]
assert expected_tokens == token_list

View File

@ -694,5 +694,4 @@ TESTS = ABBREV_TESTS + URL_TESTS + NUMBER_TESTS + PUNCT_TESTS + GENERAL_TESTS
def test_tr_tokenizer_handles_allcases(tr_tokenizer, text, expected_tokens): def test_tr_tokenizer_handles_allcases(tr_tokenizer, text, expected_tokens):
tokens = tr_tokenizer(text) tokens = tr_tokenizer(text)
token_list = [token.text for token in tokens if not token.is_space] token_list = [token.text for token in tokens if not token.is_space]
print(token_list)
assert expected_tokens == token_list assert expected_tokens == token_list

View File

@ -122,6 +122,36 @@ def test_issue6839(en_vocab):
assert matches assert matches
@pytest.mark.issue(10643)
def test_issue10643(en_vocab):
"""Ensure overlapping terms can be removed from PhraseMatcher"""
# fmt: off
words = ["Only", "save", "out", "the", "binary", "data", "for", "the", "individual", "components", "."]
# fmt: on
doc = Doc(en_vocab, words=words)
terms = {
"0": Doc(en_vocab, words=["binary"]),
"1": Doc(en_vocab, words=["binary", "data"]),
}
matcher = PhraseMatcher(en_vocab)
for match_id, term in terms.items():
matcher.add(match_id, [term])
matches = matcher(doc)
assert matches == [(en_vocab.strings["0"], 4, 5), (en_vocab.strings["1"], 4, 6)]
matcher.remove("0")
assert len(matcher) == 1
new_matches = matcher(doc)
assert new_matches == [(en_vocab.strings["1"], 4, 6)]
matcher.remove("1")
assert len(matcher) == 0
no_matches = matcher(doc)
assert not no_matches
def test_matcher_phrase_matcher(en_vocab): def test_matcher_phrase_matcher(en_vocab):
doc = Doc(en_vocab, words=["I", "like", "Google", "Now", "best"]) doc = Doc(en_vocab, words=["I", "like", "Google", "Now", "best"])
# intermediate phrase # intermediate phrase

View File

@ -14,6 +14,7 @@ from thinc.api import fix_random_seed
from ...pipeline import DependencyParser from ...pipeline import DependencyParser
from ...pipeline.dep_parser import DEFAULT_PARSER_MODEL from ...pipeline.dep_parser import DEFAULT_PARSER_MODEL
from ..util import apply_transition_sequence, make_tempdir from ..util import apply_transition_sequence, make_tempdir
from ...pipeline.tok2vec import DEFAULT_TOK2VEC_MODEL
TRAIN_DATA = [ TRAIN_DATA = [
( (
@ -401,6 +402,34 @@ def test_overfitting_IO(pipe_name):
assert_equal(batch_deps_1, no_batch_deps) assert_equal(batch_deps_1, no_batch_deps)
# fmt: off
@pytest.mark.slow
@pytest.mark.parametrize("pipe_name", ["parser", "beam_parser"])
@pytest.mark.parametrize(
"parser_config",
[
# TransitionBasedParser V1
({"@architectures": "spacy.TransitionBasedParser.v1", "tok2vec": DEFAULT_TOK2VEC_MODEL, "state_type": "parser", "extra_state_tokens": False, "hidden_width": 64, "maxout_pieces": 2, "use_upper": True}),
# TransitionBasedParser V2
({"@architectures": "spacy.TransitionBasedParser.v2", "tok2vec": DEFAULT_TOK2VEC_MODEL, "state_type": "parser", "extra_state_tokens": False, "hidden_width": 64, "maxout_pieces": 2, "use_upper": True}),
],
)
# fmt: on
def test_parser_configs(pipe_name, parser_config):
pipe_config = {"model": parser_config}
nlp = English()
parser = nlp.add_pipe(pipe_name, config=pipe_config)
train_examples = []
for text, annotations in TRAIN_DATA:
train_examples.append(Example.from_dict(nlp.make_doc(text), annotations))
for dep in annotations.get("deps", []):
parser.add_label(dep)
optimizer = nlp.initialize()
for i in range(5):
losses = {}
nlp.update(train_examples, sgd=optimizer, losses=losses)
def test_beam_parser_scores(): def test_beam_parser_scores():
# Test that we can get confidence values out of the beam_parser pipe # Test that we can get confidence values out of the beam_parser pipe
beam_width = 16 beam_width = 16

View File

@ -14,7 +14,7 @@ from spacy.pipeline.legacy import EntityLinker_v1
from spacy.pipeline.tok2vec import DEFAULT_TOK2VEC_MODEL from spacy.pipeline.tok2vec import DEFAULT_TOK2VEC_MODEL
from spacy.scorer import Scorer from spacy.scorer import Scorer
from spacy.tests.util import make_tempdir from spacy.tests.util import make_tempdir
from spacy.tokens import Span from spacy.tokens import Span, Doc
from spacy.training import Example from spacy.training import Example
from spacy.util import ensure_path from spacy.util import ensure_path
from spacy.vocab import Vocab from spacy.vocab import Vocab
@ -1075,3 +1075,43 @@ def test_no_gold_ents(patterns):
# this will run the pipeline on the examples and shouldn't crash # this will run the pipeline on the examples and shouldn't crash
results = nlp.evaluate(train_examples) results = nlp.evaluate(train_examples)
@pytest.mark.issue(9575)
def test_tokenization_mismatch():
nlp = English()
# include a matching entity so that update isn't skipped
doc1 = Doc(
nlp.vocab,
words=["Kirby", "123456"],
spaces=[True, False],
ents=["B-CHARACTER", "B-CARDINAL"],
)
doc2 = Doc(
nlp.vocab,
words=["Kirby", "123", "456"],
spaces=[True, False, False],
ents=["B-CHARACTER", "B-CARDINAL", "B-CARDINAL"],
)
eg = Example(doc1, doc2)
train_examples = [eg]
vector_length = 3
def create_kb(vocab):
# create placeholder KB
mykb = KnowledgeBase(vocab, entity_vector_length=vector_length)
mykb.add_entity(entity="Q613241", freq=12, entity_vector=[6, -4, 3])
mykb.add_alias("Kirby", ["Q613241"], [0.9])
return mykb
entity_linker = nlp.add_pipe("entity_linker", last=True)
entity_linker.set_kb(create_kb)
optimizer = nlp.initialize(get_examples=lambda: train_examples)
for i in range(2):
losses = {}
nlp.update(train_examples, sgd=optimizer, losses=losses)
nlp.add_pipe("sentencizer", first=True)
results = nlp.evaluate(train_examples)

View File

@ -184,7 +184,7 @@ def test_overfitting_IO():
token.pos_ = "" token.pos_ = ""
token.set_morph(None) token.set_morph(None)
optimizer = nlp.initialize(get_examples=lambda: train_examples) optimizer = nlp.initialize(get_examples=lambda: train_examples)
print(nlp.get_pipe("morphologizer").labels) assert nlp.get_pipe("morphologizer").labels is not None
for i in range(50): for i in range(50):
losses = {} losses = {}
nlp.update(train_examples, sgd=optimizer, losses=losses) nlp.update(train_examples, sgd=optimizer, losses=losses)

View File

@ -119,6 +119,7 @@ def test_pipe_class_component_config():
self.value1 = value1 self.value1 = value1
self.value2 = value2 self.value2 = value2
self.is_base = True self.is_base = True
self.name = name
def __call__(self, doc: Doc) -> Doc: def __call__(self, doc: Doc) -> Doc:
return doc return doc
@ -141,12 +142,16 @@ def test_pipe_class_component_config():
nlp.add_pipe(name) nlp.add_pipe(name)
with pytest.raises(ConfigValidationError): # invalid config with pytest.raises(ConfigValidationError): # invalid config
nlp.add_pipe(name, config={"value1": "10", "value2": "hello"}) nlp.add_pipe(name, config={"value1": "10", "value2": "hello"})
nlp.add_pipe(name, config={"value1": 10, "value2": "hello"}) with pytest.warns(UserWarning):
nlp.add_pipe(
name, config={"value1": 10, "value2": "hello", "name": "wrong_name"}
)
pipe = nlp.get_pipe(name) pipe = nlp.get_pipe(name)
assert isinstance(pipe.nlp, Language) assert isinstance(pipe.nlp, Language)
assert pipe.value1 == 10 assert pipe.value1 == 10
assert pipe.value2 == "hello" assert pipe.value2 == "hello"
assert pipe.is_base is True assert pipe.is_base is True
assert pipe.name == name
nlp_en = English() nlp_en = English()
with pytest.raises(ConfigValidationError): # invalid config with pytest.raises(ConfigValidationError): # invalid config

View File

@ -382,6 +382,7 @@ def test_implicit_label(name, get_examples):
# fmt: off # fmt: off
@pytest.mark.slow
@pytest.mark.parametrize( @pytest.mark.parametrize(
"name,textcat_config", "name,textcat_config",
[ [
@ -390,7 +391,10 @@ def test_implicit_label(name, get_examples):
("textcat", {"@architectures": "spacy.TextCatBOW.v1", "exclusive_classes": True, "no_output_layer": True, "ngram_size": 3}), ("textcat", {"@architectures": "spacy.TextCatBOW.v1", "exclusive_classes": True, "no_output_layer": True, "ngram_size": 3}),
("textcat_multilabel", {"@architectures": "spacy.TextCatBOW.v1", "exclusive_classes": False, "no_output_layer": False, "ngram_size": 3}), ("textcat_multilabel", {"@architectures": "spacy.TextCatBOW.v1", "exclusive_classes": False, "no_output_layer": False, "ngram_size": 3}),
("textcat_multilabel", {"@architectures": "spacy.TextCatBOW.v1", "exclusive_classes": False, "no_output_layer": True, "ngram_size": 3}), ("textcat_multilabel", {"@architectures": "spacy.TextCatBOW.v1", "exclusive_classes": False, "no_output_layer": True, "ngram_size": 3}),
# ENSEMBLE # ENSEMBLE V1
("textcat", {"@architectures": "spacy.TextCatEnsemble.v1", "exclusive_classes": False, "pretrained_vectors": None, "width": 64, "embed_size": 2000, "conv_depth": 2, "window_size": 1, "ngram_size": 1, "dropout": None}),
("textcat_multilabel", {"@architectures": "spacy.TextCatEnsemble.v1", "exclusive_classes": False, "pretrained_vectors": None, "width": 64, "embed_size": 2000, "conv_depth": 2, "window_size": 1, "ngram_size": 1, "dropout": None}),
# ENSEMBLE V2
("textcat", {"@architectures": "spacy.TextCatEnsemble.v2", "tok2vec": DEFAULT_TOK2VEC_MODEL, "linear_model": {"@architectures": "spacy.TextCatBOW.v1", "exclusive_classes": True, "no_output_layer": False, "ngram_size": 3}}), ("textcat", {"@architectures": "spacy.TextCatEnsemble.v2", "tok2vec": DEFAULT_TOK2VEC_MODEL, "linear_model": {"@architectures": "spacy.TextCatBOW.v1", "exclusive_classes": True, "no_output_layer": False, "ngram_size": 3}}),
("textcat", {"@architectures": "spacy.TextCatEnsemble.v2", "tok2vec": DEFAULT_TOK2VEC_MODEL, "linear_model": {"@architectures": "spacy.TextCatBOW.v1", "exclusive_classes": True, "no_output_layer": True, "ngram_size": 3}}), ("textcat", {"@architectures": "spacy.TextCatEnsemble.v2", "tok2vec": DEFAULT_TOK2VEC_MODEL, "linear_model": {"@architectures": "spacy.TextCatBOW.v1", "exclusive_classes": True, "no_output_layer": True, "ngram_size": 3}}),
("textcat_multilabel", {"@architectures": "spacy.TextCatEnsemble.v2", "tok2vec": DEFAULT_TOK2VEC_MODEL, "linear_model": {"@architectures": "spacy.TextCatBOW.v1", "exclusive_classes": False, "no_output_layer": False, "ngram_size": 3}}), ("textcat_multilabel", {"@architectures": "spacy.TextCatEnsemble.v2", "tok2vec": DEFAULT_TOK2VEC_MODEL, "linear_model": {"@architectures": "spacy.TextCatBOW.v1", "exclusive_classes": False, "no_output_layer": False, "ngram_size": 3}}),
@ -643,15 +647,28 @@ def test_overfitting_IO_multi():
# fmt: off # fmt: off
@pytest.mark.slow
@pytest.mark.parametrize( @pytest.mark.parametrize(
"name,train_data,textcat_config", "name,train_data,textcat_config",
[ [
# BOW V1
("textcat_multilabel", TRAIN_DATA_MULTI_LABEL, {"@architectures": "spacy.TextCatBOW.v1", "exclusive_classes": False, "ngram_size": 1, "no_output_layer": False}),
("textcat", TRAIN_DATA_SINGLE_LABEL, {"@architectures": "spacy.TextCatBOW.v1", "exclusive_classes": True, "ngram_size": 4, "no_output_layer": False}),
# ENSEMBLE V1
("textcat_multilabel", TRAIN_DATA_MULTI_LABEL, {"@architectures": "spacy.TextCatEnsemble.v1", "exclusive_classes": False, "pretrained_vectors": None, "width": 64, "embed_size": 2000, "conv_depth": 2, "window_size": 1, "ngram_size": 1, "dropout": None}),
("textcat", TRAIN_DATA_SINGLE_LABEL, {"@architectures": "spacy.TextCatEnsemble.v1", "exclusive_classes": False, "pretrained_vectors": None, "width": 64, "embed_size": 2000, "conv_depth": 2, "window_size": 1, "ngram_size": 1, "dropout": None}),
# CNN V1
("textcat", TRAIN_DATA_SINGLE_LABEL, {"@architectures": "spacy.TextCatCNN.v1", "tok2vec": DEFAULT_TOK2VEC_MODEL, "exclusive_classes": True}),
("textcat_multilabel", TRAIN_DATA_MULTI_LABEL, {"@architectures": "spacy.TextCatCNN.v1", "tok2vec": DEFAULT_TOK2VEC_MODEL, "exclusive_classes": False}),
# BOW V2
("textcat_multilabel", TRAIN_DATA_MULTI_LABEL, {"@architectures": "spacy.TextCatBOW.v2", "exclusive_classes": False, "ngram_size": 1, "no_output_layer": False}), ("textcat_multilabel", TRAIN_DATA_MULTI_LABEL, {"@architectures": "spacy.TextCatBOW.v2", "exclusive_classes": False, "ngram_size": 1, "no_output_layer": False}),
("textcat", TRAIN_DATA_SINGLE_LABEL, {"@architectures": "spacy.TextCatBOW.v2", "exclusive_classes": True, "ngram_size": 4, "no_output_layer": False}), ("textcat", TRAIN_DATA_SINGLE_LABEL, {"@architectures": "spacy.TextCatBOW.v2", "exclusive_classes": True, "ngram_size": 4, "no_output_layer": False}),
("textcat_multilabel", TRAIN_DATA_MULTI_LABEL, {"@architectures": "spacy.TextCatBOW.v2", "exclusive_classes": False, "ngram_size": 3, "no_output_layer": True}), ("textcat_multilabel", TRAIN_DATA_MULTI_LABEL, {"@architectures": "spacy.TextCatBOW.v2", "exclusive_classes": False, "ngram_size": 3, "no_output_layer": True}),
("textcat", TRAIN_DATA_SINGLE_LABEL, {"@architectures": "spacy.TextCatBOW.v2", "exclusive_classes": True, "ngram_size": 2, "no_output_layer": True}), ("textcat", TRAIN_DATA_SINGLE_LABEL, {"@architectures": "spacy.TextCatBOW.v2", "exclusive_classes": True, "ngram_size": 2, "no_output_layer": True}),
# ENSEMBLE V2
("textcat_multilabel", TRAIN_DATA_MULTI_LABEL, {"@architectures": "spacy.TextCatEnsemble.v2", "tok2vec": DEFAULT_TOK2VEC_MODEL, "linear_model": {"@architectures": "spacy.TextCatBOW.v2", "exclusive_classes": False, "ngram_size": 1, "no_output_layer": False}}), ("textcat_multilabel", TRAIN_DATA_MULTI_LABEL, {"@architectures": "spacy.TextCatEnsemble.v2", "tok2vec": DEFAULT_TOK2VEC_MODEL, "linear_model": {"@architectures": "spacy.TextCatBOW.v2", "exclusive_classes": False, "ngram_size": 1, "no_output_layer": False}}),
("textcat", TRAIN_DATA_SINGLE_LABEL, {"@architectures": "spacy.TextCatEnsemble.v2", "tok2vec": DEFAULT_TOK2VEC_MODEL, "linear_model": {"@architectures": "spacy.TextCatBOW.v2", "exclusive_classes": True, "ngram_size": 5, "no_output_layer": False}}), ("textcat", TRAIN_DATA_SINGLE_LABEL, {"@architectures": "spacy.TextCatEnsemble.v2", "tok2vec": DEFAULT_TOK2VEC_MODEL, "linear_model": {"@architectures": "spacy.TextCatBOW.v2", "exclusive_classes": True, "ngram_size": 5, "no_output_layer": False}}),
# CNN V2
("textcat", TRAIN_DATA_SINGLE_LABEL, {"@architectures": "spacy.TextCatCNN.v2", "tok2vec": DEFAULT_TOK2VEC_MODEL, "exclusive_classes": True}), ("textcat", TRAIN_DATA_SINGLE_LABEL, {"@architectures": "spacy.TextCatCNN.v2", "tok2vec": DEFAULT_TOK2VEC_MODEL, "exclusive_classes": True}),
("textcat_multilabel", TRAIN_DATA_MULTI_LABEL, {"@architectures": "spacy.TextCatCNN.v2", "tok2vec": DEFAULT_TOK2VEC_MODEL, "exclusive_classes": False}), ("textcat_multilabel", TRAIN_DATA_MULTI_LABEL, {"@architectures": "spacy.TextCatCNN.v2", "tok2vec": DEFAULT_TOK2VEC_MODEL, "exclusive_classes": False}),
], ],

View File

@ -1,13 +1,13 @@
import pytest import pytest
from spacy.ml.models.tok2vec import build_Tok2Vec_model from spacy.ml.models.tok2vec import build_Tok2Vec_model
from spacy.ml.models.tok2vec import MultiHashEmbed, CharacterEmbed from spacy.ml.models.tok2vec import MultiHashEmbed, MaxoutWindowEncoder
from spacy.ml.models.tok2vec import MishWindowEncoder, MaxoutWindowEncoder
from spacy.pipeline.tok2vec import Tok2Vec, Tok2VecListener from spacy.pipeline.tok2vec import Tok2Vec, Tok2VecListener
from spacy.vocab import Vocab from spacy.vocab import Vocab
from spacy.tokens import Doc from spacy.tokens import Doc
from spacy.training import Example from spacy.training import Example
from spacy import util from spacy import util
from spacy.lang.en import English from spacy.lang.en import English
from spacy.util import registry
from thinc.api import Config, get_current_ops from thinc.api import Config, get_current_ops
from numpy.testing import assert_array_equal from numpy.testing import assert_array_equal
@ -55,24 +55,41 @@ def test_tok2vec_batch_sizes(batch_size, width, embed_size):
assert doc_vec.shape == (len(doc), width) assert doc_vec.shape == (len(doc), width)
@pytest.mark.slow
@pytest.mark.parametrize("width", [8])
@pytest.mark.parametrize( @pytest.mark.parametrize(
"width,embed_arch,embed_config,encode_arch,encode_config", "embed_arch,embed_config",
# fmt: off # fmt: off
[ [
(8, MultiHashEmbed, {"rows": [100, 100], "attrs": ["SHAPE", "LOWER"], "include_static_vectors": False}, MaxoutWindowEncoder, {"window_size": 1, "maxout_pieces": 3, "depth": 2}), ("spacy.MultiHashEmbed.v1", {"rows": [100, 100], "attrs": ["SHAPE", "LOWER"], "include_static_vectors": False}),
(8, MultiHashEmbed, {"rows": [100, 20], "attrs": ["ORTH", "PREFIX"], "include_static_vectors": False}, MishWindowEncoder, {"window_size": 1, "depth": 6}), ("spacy.MultiHashEmbed.v1", {"rows": [100, 20], "attrs": ["ORTH", "PREFIX"], "include_static_vectors": False}),
(8, CharacterEmbed, {"rows": 100, "nM": 64, "nC": 8, "include_static_vectors": False}, MaxoutWindowEncoder, {"window_size": 1, "maxout_pieces": 3, "depth": 3}), ("spacy.CharacterEmbed.v1", {"rows": 100, "nM": 64, "nC": 8, "include_static_vectors": False}),
(8, CharacterEmbed, {"rows": 100, "nM": 16, "nC": 2, "include_static_vectors": False}, MishWindowEncoder, {"window_size": 1, "depth": 3}), ("spacy.CharacterEmbed.v1", {"rows": 100, "nM": 16, "nC": 2, "include_static_vectors": False}),
], ],
# fmt: on # fmt: on
) )
def test_tok2vec_configs(width, embed_arch, embed_config, encode_arch, encode_config): @pytest.mark.parametrize(
"tok2vec_arch,encode_arch,encode_config",
# fmt: off
[
("spacy.Tok2Vec.v1", "spacy.MaxoutWindowEncoder.v1", {"window_size": 1, "maxout_pieces": 3, "depth": 2}),
("spacy.Tok2Vec.v2", "spacy.MaxoutWindowEncoder.v2", {"window_size": 1, "maxout_pieces": 3, "depth": 2}),
("spacy.Tok2Vec.v1", "spacy.MishWindowEncoder.v1", {"window_size": 1, "depth": 6}),
("spacy.Tok2Vec.v2", "spacy.MishWindowEncoder.v2", {"window_size": 1, "depth": 6}),
],
# fmt: on
)
def test_tok2vec_configs(
width, tok2vec_arch, embed_arch, embed_config, encode_arch, encode_config
):
embed = registry.get("architectures", embed_arch)
encode = registry.get("architectures", encode_arch)
tok2vec_model = registry.get("architectures", tok2vec_arch)
embed_config["width"] = width embed_config["width"] = width
encode_config["width"] = width encode_config["width"] = width
docs = get_batch(3) docs = get_batch(3)
tok2vec = build_Tok2Vec_model( tok2vec = tok2vec_model(embed(**embed_config), encode(**encode_config))
embed_arch(**embed_config), encode_arch(**encode_config)
)
tok2vec.initialize(docs) tok2vec.initialize(docs)
vectors, backprop = tok2vec.begin_update(docs) vectors, backprop = tok2vec.begin_update(docs)
assert len(vectors) == len(docs) assert len(vectors) == len(docs)

View File

@ -1,4 +1,7 @@
import os import os
import math
from random import sample
from typing import Counter
import pytest import pytest
import srsly import srsly
@ -14,6 +17,10 @@ from spacy.cli._util import substitute_project_variables
from spacy.cli._util import validate_project_commands from spacy.cli._util import validate_project_commands
from spacy.cli.debug_data import _compile_gold, _get_labels_from_model from spacy.cli.debug_data import _compile_gold, _get_labels_from_model
from spacy.cli.debug_data import _get_labels_from_spancat from spacy.cli.debug_data import _get_labels_from_spancat
from spacy.cli.debug_data import _get_distribution, _get_kl_divergence
from spacy.cli.debug_data import _get_span_characteristics
from spacy.cli.debug_data import _print_span_characteristics
from spacy.cli.debug_data import _get_spans_length_freq_dist
from spacy.cli.download import get_compatibility, get_version from spacy.cli.download import get_compatibility, get_version
from spacy.cli.init_config import RECOMMENDATIONS, init_config, fill_config from spacy.cli.init_config import RECOMMENDATIONS, init_config, fill_config
from spacy.cli.package import get_third_party_dependencies from spacy.cli.package import get_third_party_dependencies
@ -24,6 +31,7 @@ from spacy.lang.nl import Dutch
from spacy.language import Language from spacy.language import Language
from spacy.schemas import ProjectConfigSchema, RecommendationSchema, validate from spacy.schemas import ProjectConfigSchema, RecommendationSchema, validate
from spacy.tokens import Doc from spacy.tokens import Doc
from spacy.tokens.span import Span
from spacy.training import Example, docs_to_json, offsets_to_biluo_tags from spacy.training import Example, docs_to_json, offsets_to_biluo_tags
from spacy.training.converters import conll_ner_to_docs, conllu_to_docs from spacy.training.converters import conll_ner_to_docs, conllu_to_docs
from spacy.training.converters import iob_to_docs from spacy.training.converters import iob_to_docs
@ -217,7 +225,6 @@ def test_cli_converters_conllu_to_docs_subtokens():
sent = converted[0]["paragraphs"][0]["sentences"][0] sent = converted[0]["paragraphs"][0]["sentences"][0]
assert len(sent["tokens"]) == 4 assert len(sent["tokens"]) == 4
tokens = sent["tokens"] tokens = sent["tokens"]
print(tokens)
assert [t["orth"] for t in tokens] == ["Dommer", "FE", "avstår", "."] assert [t["orth"] for t in tokens] == ["Dommer", "FE", "avstår", "."]
assert [t["tag"] for t in tokens] == [ assert [t["tag"] for t in tokens] == [
"NOUN__Definite=Ind|Gender=Masc|Number=Sing", "NOUN__Definite=Ind|Gender=Masc|Number=Sing",
@ -342,6 +349,7 @@ def test_project_config_validation_full():
"assets": [ "assets": [
{ {
"dest": "x", "dest": "x",
"extra": True,
"url": "https://example.com", "url": "https://example.com",
"checksum": "63373dd656daa1fd3043ce166a59474c", "checksum": "63373dd656daa1fd3043ce166a59474c",
}, },
@ -353,6 +361,12 @@ def test_project_config_validation_full():
"path": "y", "path": "y",
}, },
}, },
{
"dest": "z",
"extra": False,
"url": "https://example.com",
"checksum": "63373dd656daa1fd3043ce166a59474c",
},
], ],
"commands": [ "commands": [
{ {
@ -734,3 +748,110 @@ def test_debug_data_compile_gold():
eg = Example(pred, ref) eg = Example(pred, ref)
data = _compile_gold([eg], ["ner"], nlp, True) data = _compile_gold([eg], ["ner"], nlp, True)
assert data["boundary_cross_ents"] == 1 assert data["boundary_cross_ents"] == 1
def test_debug_data_compile_gold_for_spans():
nlp = English()
spans_key = "sc"
pred = Doc(nlp.vocab, words=["Welcome", "to", "the", "Bank", "of", "China", "."])
pred.spans[spans_key] = [Span(pred, 3, 6, "ORG"), Span(pred, 5, 6, "GPE")]
ref = Doc(nlp.vocab, words=["Welcome", "to", "the", "Bank", "of", "China", "."])
ref.spans[spans_key] = [Span(ref, 3, 6, "ORG"), Span(ref, 5, 6, "GPE")]
eg = Example(pred, ref)
data = _compile_gold([eg], ["spancat"], nlp, True)
assert data["spancat"][spans_key] == Counter({"ORG": 1, "GPE": 1})
assert data["spans_length"][spans_key] == {"ORG": [3], "GPE": [1]}
assert data["spans_per_type"][spans_key] == {
"ORG": [Span(ref, 3, 6, "ORG")],
"GPE": [Span(ref, 5, 6, "GPE")],
}
assert data["sb_per_type"][spans_key] == {
"ORG": {"start": [ref[2:3]], "end": [ref[6:7]]},
"GPE": {"start": [ref[4:5]], "end": [ref[6:7]]},
}
def test_frequency_distribution_is_correct():
nlp = English()
docs = [
Doc(nlp.vocab, words=["Bank", "of", "China"]),
Doc(nlp.vocab, words=["China"]),
]
expected = Counter({"china": 0.5, "bank": 0.25, "of": 0.25})
freq_distribution = _get_distribution(docs, normalize=True)
assert freq_distribution == expected
def test_kl_divergence_computation_is_correct():
p = Counter({"a": 0.5, "b": 0.25})
q = Counter({"a": 0.25, "b": 0.50, "c": 0.15, "d": 0.10})
result = _get_kl_divergence(p, q)
expected = 0.1733
assert math.isclose(result, expected, rel_tol=1e-3)
def test_get_span_characteristics_return_value():
nlp = English()
spans_key = "sc"
pred = Doc(nlp.vocab, words=["Welcome", "to", "the", "Bank", "of", "China", "."])
pred.spans[spans_key] = [Span(pred, 3, 6, "ORG"), Span(pred, 5, 6, "GPE")]
ref = Doc(nlp.vocab, words=["Welcome", "to", "the", "Bank", "of", "China", "."])
ref.spans[spans_key] = [Span(ref, 3, 6, "ORG"), Span(ref, 5, 6, "GPE")]
eg = Example(pred, ref)
examples = [eg]
data = _compile_gold(examples, ["spancat"], nlp, True)
span_characteristics = _get_span_characteristics(
examples=examples, compiled_gold=data, spans_key=spans_key
)
assert {"sd", "bd", "lengths"}.issubset(span_characteristics.keys())
assert span_characteristics["min_length"] == 1
assert span_characteristics["max_length"] == 3
def test_ensure_print_span_characteristics_wont_fail():
"""Test if interface between two methods aren't destroyed if refactored"""
nlp = English()
spans_key = "sc"
pred = Doc(nlp.vocab, words=["Welcome", "to", "the", "Bank", "of", "China", "."])
pred.spans[spans_key] = [Span(pred, 3, 6, "ORG"), Span(pred, 5, 6, "GPE")]
ref = Doc(nlp.vocab, words=["Welcome", "to", "the", "Bank", "of", "China", "."])
ref.spans[spans_key] = [Span(ref, 3, 6, "ORG"), Span(ref, 5, 6, "GPE")]
eg = Example(pred, ref)
examples = [eg]
data = _compile_gold(examples, ["spancat"], nlp, True)
span_characteristics = _get_span_characteristics(
examples=examples, compiled_gold=data, spans_key=spans_key
)
_print_span_characteristics(span_characteristics)
@pytest.mark.parametrize("threshold", [70, 80, 85, 90, 95])
def test_span_length_freq_dist_threshold_must_be_correct(threshold):
sample_span_lengths = {
"span_type_1": [1, 4, 4, 5],
"span_type_2": [5, 3, 3, 2],
"span_type_3": [3, 1, 3, 3],
}
span_freqs = _get_spans_length_freq_dist(sample_span_lengths, threshold)
assert sum(span_freqs.values()) >= threshold
def test_span_length_freq_dist_output_must_be_correct():
sample_span_lengths = {
"span_type_1": [1, 4, 4, 5],
"span_type_2": [5, 3, 3, 2],
"span_type_3": [3, 1, 3, 3],
}
threshold = 90
span_freqs = _get_spans_length_freq_dist(sample_span_lengths, threshold)
assert sum(span_freqs.values()) >= threshold
assert list(span_freqs.keys()) == [3, 1, 4, 5, 2]

View File

@ -83,6 +83,27 @@ def test_issue3882(en_vocab):
displacy.parse_deps(doc) displacy.parse_deps(doc)
@pytest.mark.issue(5447)
def test_issue5447():
"""Test that overlapping arcs get separate levels, unless they're identical."""
renderer = DependencyRenderer()
words = [
{"text": "This", "tag": "DT"},
{"text": "is", "tag": "VBZ"},
{"text": "a", "tag": "DT"},
{"text": "sentence.", "tag": "NN"},
]
arcs = [
{"start": 0, "end": 1, "label": "nsubj", "dir": "left"},
{"start": 2, "end": 3, "label": "det", "dir": "left"},
{"start": 2, "end": 3, "label": "overlap", "dir": "left"},
{"end": 3, "label": "overlap", "start": 2, "dir": "left"},
{"start": 1, "end": 3, "label": "attr", "dir": "left"},
]
renderer.render([{"words": words, "arcs": arcs}])
assert renderer.highest_level == 3
@pytest.mark.issue(5838) @pytest.mark.issue(5838)
def test_issue5838(): def test_issue5838():
# Displacy's EntityRenderer break line # Displacy's EntityRenderer break line
@ -317,3 +338,18 @@ def test_displacy_options_case():
assert "green" in result[1] and "bar" in result[1] assert "green" in result[1] and "bar" in result[1]
assert "red" in result[2] and "FOO" in result[2] assert "red" in result[2] and "FOO" in result[2]
assert "green" in result[3] and "BAR" in result[3] assert "green" in result[3] and "BAR" in result[3]
@pytest.mark.issue(10672)
def test_displacy_manual_sorted_entities():
doc = {
"text": "But Google is starting from behind.",
"ents": [
{"start": 14, "end": 22, "label": "SECOND"},
{"start": 4, "end": 10, "label": "FIRST"},
],
"title": None,
}
html = displacy.render(doc, style="ent", manual=True)
assert html.find("FIRST") < html.find("SECOND")

View File

@ -1,7 +1,13 @@
import pytest
import re import re
from spacy.util import get_lang_class import string
import hypothesis
import hypothesis.strategies
import pytest
import spacy
from spacy.tokenizer import Tokenizer from spacy.tokenizer import Tokenizer
from spacy.util import get_lang_class
# Only include languages with no external dependencies # Only include languages with no external dependencies
# "is" seems to confuse importlib, so we're also excluding it for now # "is" seems to confuse importlib, so we're also excluding it for now
@ -77,3 +83,46 @@ def test_tokenizer_explain_special_matcher(en_vocab):
tokens = [t.text for t in tokenizer("a/a.")] tokens = [t.text for t in tokenizer("a/a.")]
explain_tokens = [t[1] for t in tokenizer.explain("a/a.")] explain_tokens = [t[1] for t in tokenizer.explain("a/a.")]
assert tokens == explain_tokens assert tokens == explain_tokens
@hypothesis.strategies.composite
def sentence_strategy(draw: hypothesis.strategies.DrawFn, max_n_words: int = 4) -> str:
"""
Composite strategy for fuzzily generating sentence with varying interpunctation.
draw (hypothesis.strategies.DrawFn): Protocol for drawing function allowing to fuzzily pick from hypothesis'
strategies.
max_n_words (int): Max. number of words in generated sentence.
RETURNS (str): Fuzzily generated sentence.
"""
punctuation_and_space_regex = "|".join(
[*[re.escape(p) for p in string.punctuation], r"\s"]
)
sentence = [
[
draw(hypothesis.strategies.text(min_size=1)),
draw(hypothesis.strategies.from_regex(punctuation_and_space_regex)),
]
for _ in range(
draw(hypothesis.strategies.integers(min_value=2, max_value=max_n_words))
)
]
return " ".join([token for token_pair in sentence for token in token_pair])
@pytest.mark.xfail
@pytest.mark.parametrize("lang", LANGUAGES)
@hypothesis.given(sentence=sentence_strategy())
def test_tokenizer_explain_fuzzy(lang: str, sentence: str) -> None:
"""
Tests whether output of tokenizer.explain() matches tokenizer output. Input generated by hypothesis.
lang (str): Language to test.
text (str): Fuzzily generated sentence to tokenize.
"""
tokenizer: Tokenizer = spacy.blank(lang).tokenizer
tokens = [t.text for t in tokenizer(sentence) if not t.is_space]
debug_tokens = [t[1] for t in tokenizer.explain(sentence)]
assert tokens == debug_tokens, f"{tokens}, {debug_tokens}, {sentence}"

View File

@ -181,7 +181,7 @@ def _optimize(nlp, component: str, data: List, rehearse: bool):
elif component == "tagger": elif component == "tagger":
_add_tagger_label(pipe, data) _add_tagger_label(pipe, data)
elif component == "parser": elif component == "parser":
_add_tagger_label(pipe, data) _add_parser_label(pipe, data)
elif component == "textcat_multilabel": elif component == "textcat_multilabel":
_add_textcat_label(pipe, data) _add_textcat_label(pipe, data)
else: else:

View File

@ -43,6 +43,15 @@ class SpanGroups(UserDict):
doc = self._ensure_doc() doc = self._ensure_doc()
return SpanGroups(doc).from_bytes(self.to_bytes()) return SpanGroups(doc).from_bytes(self.to_bytes())
def setdefault(self, key, default=None):
if not isinstance(default, SpanGroup):
if default is None:
spans = []
else:
spans = default
default = self._make_span_group(key, spans)
return super().setdefault(key, default=default)
def to_bytes(self) -> bytes: def to_bytes(self) -> bytes:
# We don't need to serialize this as a dict, because the groups # We don't need to serialize this as a dict, because the groups
# know their names. # know their names.

View File

@ -11,7 +11,7 @@ from enum import Enum
import itertools import itertools
import numpy import numpy
import srsly import srsly
from thinc.api import get_array_module from thinc.api import get_array_module, get_current_ops
from thinc.util import copy_array from thinc.util import copy_array
import warnings import warnings
@ -414,6 +414,7 @@ cdef class Doc:
""" """
# empty docs are always annotated # empty docs are always annotated
input_attr = attr
if self.length == 0: if self.length == 0:
return True return True
cdef int i cdef int i
@ -423,6 +424,10 @@ cdef class Doc:
elif attr == "IS_SENT_END" or attr == self.vocab.strings["IS_SENT_END"]: elif attr == "IS_SENT_END" or attr == self.vocab.strings["IS_SENT_END"]:
attr = SENT_START attr = SENT_START
attr = intify_attr(attr) attr = intify_attr(attr)
if attr is None:
raise ValueError(
Errors.E1037.format(attr=input_attr)
)
# adjust attributes # adjust attributes
if attr == HEAD: if attr == HEAD:
# HEAD does not have an unset state, so rely on DEP # HEAD does not have an unset state, so rely on DEP
@ -1108,14 +1113,19 @@ cdef class Doc:
return self return self
@staticmethod @staticmethod
def from_docs(docs, ensure_whitespace=True, attrs=None): def from_docs(docs, ensure_whitespace=True, attrs=None, *, exclude=tuple()):
"""Concatenate multiple Doc objects to form a new one. Raises an error """Concatenate multiple Doc objects to form a new one. Raises an error
if the `Doc` objects do not all share the same `Vocab`. if the `Doc` objects do not all share the same `Vocab`.
docs (list): A list of Doc objects. docs (list): A list of Doc objects.
ensure_whitespace (bool): Insert a space between two adjacent docs whenever the first doc does not end in whitespace. ensure_whitespace (bool): Insert a space between two adjacent docs
attrs (list): Optional list of attribute ID ints or attribute name strings. whenever the first doc does not end in whitespace.
RETURNS (Doc): A doc that contains the concatenated docs, or None if no docs were given. attrs (list): Optional list of attribute ID ints or attribute name
strings.
exclude (Iterable[str]): Doc attributes to exclude. Supported
attributes: `spans`, `tensor`, `user_data`.
RETURNS (Doc): A doc that contains the concatenated docs, or None if no
docs were given.
DOCS: https://spacy.io/api/doc#from_docs DOCS: https://spacy.io/api/doc#from_docs
""" """
@ -1145,31 +1155,33 @@ cdef class Doc:
concat_words.extend(t.text for t in doc) concat_words.extend(t.text for t in doc)
concat_spaces.extend(bool(t.whitespace_) for t in doc) concat_spaces.extend(bool(t.whitespace_) for t in doc)
for key, value in doc.user_data.items(): if "user_data" not in exclude:
if isinstance(key, tuple) and len(key) == 4 and key[0] == "._.": for key, value in doc.user_data.items():
data_type, name, start, end = key if isinstance(key, tuple) and len(key) == 4 and key[0] == "._.":
if start is not None or end is not None: data_type, name, start, end = key
start += char_offset if start is not None or end is not None:
if end is not None: start += char_offset
end += char_offset if end is not None:
concat_user_data[(data_type, name, start, end)] = copy.copy(value) end += char_offset
concat_user_data[(data_type, name, start, end)] = copy.copy(value)
else:
warnings.warn(Warnings.W101.format(name=name))
else: else:
warnings.warn(Warnings.W101.format(name=name)) warnings.warn(Warnings.W102.format(key=key, value=value))
else: if "spans" not in exclude:
warnings.warn(Warnings.W102.format(key=key, value=value)) for key in doc.spans:
for key in doc.spans: # if a spans key is in any doc, include it in the merged doc
# if a spans key is in any doc, include it in the merged doc # even if it is empty
# even if it is empty if key not in concat_spans:
if key not in concat_spans: concat_spans[key] = []
concat_spans[key] = [] for span in doc.spans[key]:
for span in doc.spans[key]: concat_spans[key].append((
concat_spans[key].append(( span.start_char + char_offset,
span.start_char + char_offset, span.end_char + char_offset,
span.end_char + char_offset, span.label,
span.label, span.kb_id,
span.kb_id, span.text, # included as a check
span.text, # included as a check ))
))
char_offset += len(doc.text) char_offset += len(doc.text)
if len(doc) > 0 and ensure_whitespace and not doc[-1].is_space and not bool(doc[-1].whitespace_): if len(doc) > 0 and ensure_whitespace and not doc[-1].is_space and not bool(doc[-1].whitespace_):
char_offset += 1 char_offset += 1
@ -1210,6 +1222,10 @@ cdef class Doc:
else: else:
raise ValueError(Errors.E873.format(key=key, text=text)) raise ValueError(Errors.E873.format(key=key, text=text))
if "tensor" not in exclude and any(len(doc) for doc in docs):
ops = get_current_ops()
concat_doc.tensor = ops.xp.vstack([ops.asarray(doc.tensor) for doc in docs if len(doc)])
return concat_doc return concat_doc
def get_lca_matrix(self): def get_lca_matrix(self):

View File

@ -9,6 +9,8 @@ cimport cython
import weakref import weakref
from preshed.maps cimport map_get_unless_missing from preshed.maps cimport map_get_unless_missing
from murmurhash.mrmr cimport hash64 from murmurhash.mrmr cimport hash64
from .. import Errors
from ..typedefs cimport hash_t from ..typedefs cimport hash_t
from ..strings import get_string_id from ..strings import get_string_id
from ..structs cimport EdgeC, GraphC from ..structs cimport EdgeC, GraphC
@ -68,7 +70,7 @@ cdef class Node:
""" """
cdef int length = graph.c.nodes.size() cdef int length = graph.c.nodes.size()
if i >= length or -i >= length: if i >= length or -i >= length:
raise IndexError(f"Node index {i} out of bounds ({length})") raise IndexError(Errors.E1034.format(i=i, length=length))
if i < 0: if i < 0:
i += length i += length
self.graph = graph self.graph = graph
@ -88,7 +90,7 @@ cdef class Node:
"""Get a token index from the node's set of tokens.""" """Get a token index from the node's set of tokens."""
length = self.graph.c.nodes[self.i].size() length = self.graph.c.nodes[self.i].size()
if i >= length or -i >= length: if i >= length or -i >= length:
raise IndexError(f"Token index {i} out of bounds ({length})") raise IndexError(Errors.E1035.format(i=i, length=length))
if i < 0: if i < 0:
i += length i += length
return self.graph.c.nodes[self.i][i] return self.graph.c.nodes[self.i][i]
@ -306,7 +308,7 @@ cdef class NoneNode(Node):
self.i = -1 self.i = -1
def __getitem__(self, int i): def __getitem__(self, int i):
raise IndexError("Cannot index into NoneNode.") raise IndexError(Errors.E1036)
def __len__(self): def __len__(self):
return 0 return 0
@ -484,7 +486,6 @@ cdef class Graph:
for idx in indices: for idx in indices:
node.push_back(idx) node.push_back(idx)
i = add_node(&self.c, node) i = add_node(&self.c, node)
print("Add node", indices, i)
return Node(self, i) return Node(self, i)
def get_node(self, indices) -> Node: def get_node(self, indices) -> Node:
@ -501,7 +502,6 @@ cdef class Graph:
if node_index < 0: if node_index < 0:
return NoneNode(self) return NoneNode(self)
else: else:
print("Get node", indices, node_index)
return Node(self, node_index) return Node(self, node_index)
def has_node(self, tuple indices) -> bool: def has_node(self, tuple indices) -> bool:
@ -661,8 +661,6 @@ cdef int walk_head_nodes(vector[int]& output, const GraphC* graph, int node) nog
seen.insert(node) seen.insert(node)
i = 0 i = 0
while i < output.size(): while i < output.size():
with gil:
print("Walk up from", output[i])
if seen.find(output[i]) == seen.end(): if seen.find(output[i]) == seen.end():
seen.insert(output[i]) seen.insert(output[i])
get_head_nodes(output, graph, output[i]) get_head_nodes(output, graph, output[i])

View File

@ -730,7 +730,7 @@ cdef class Span:
def __set__(self, int start): def __set__(self, int start):
if start < 0: if start < 0:
raise IndexError("TODO") raise IndexError(Errors.E1032.format(var="start", forbidden="< 0", value=start))
self.c.start = start self.c.start = start
property end: property end:
@ -739,7 +739,7 @@ cdef class Span:
def __set__(self, int end): def __set__(self, int end):
if end < 0: if end < 0:
raise IndexError("TODO") raise IndexError(Errors.E1032.format(var="end", forbidden="< 0", value=end))
self.c.end = end self.c.end = end
property start_char: property start_char:
@ -748,7 +748,7 @@ cdef class Span:
def __set__(self, int start_char): def __set__(self, int start_char):
if start_char < 0: if start_char < 0:
raise IndexError("TODO") raise IndexError(Errors.E1032.format(var="start_char", forbidden="< 0", value=start_char))
self.c.start_char = start_char self.c.start_char = start_char
property end_char: property end_char:
@ -757,7 +757,7 @@ cdef class Span:
def __set__(self, int end_char): def __set__(self, int end_char):
if end_char < 0: if end_char < 0:
raise IndexError("TODO") raise IndexError(Errors.E1032.format(var="end_char", forbidden="< 0", value=end_char))
self.c.end_char = end_char self.c.end_char = end_char
property label: property label:

View File

@ -1,10 +1,11 @@
from typing import Iterable, Tuple, Union, Optional, TYPE_CHECKING
import weakref import weakref
import struct import struct
from copy import deepcopy
import srsly import srsly
from spacy.errors import Errors from spacy.errors import Errors
from .span cimport Span from .span cimport Span
from libc.stdint cimport uint64_t, uint32_t, int32_t
cdef class SpanGroup: cdef class SpanGroup:
@ -20,13 +21,13 @@ cdef class SpanGroup:
>>> doc.spans["errors"] = SpanGroup( >>> doc.spans["errors"] = SpanGroup(
doc, doc,
name="errors", name="errors",
spans=[doc[0:1], doc[2:4]], spans=[doc[0:1], doc[1:3]],
attrs={"annotator": "matt"} attrs={"annotator": "matt"}
) )
Construction 2 Construction 2
>>> doc = nlp("Their goi ng home") >>> doc = nlp("Their goi ng home")
>>> doc.spans["errors"] = [doc[0:1], doc[2:4]] >>> doc.spans["errors"] = [doc[0:1], doc[1:3]]
>>> assert isinstance(doc.spans["errors"], SpanGroup) >>> assert isinstance(doc.spans["errors"], SpanGroup)
DOCS: https://spacy.io/api/spangroup DOCS: https://spacy.io/api/spangroup
@ -48,6 +49,8 @@ cdef class SpanGroup:
self.name = name self.name = name
self.attrs = dict(attrs) if attrs is not None else {} self.attrs = dict(attrs) if attrs is not None else {}
cdef Span span cdef Span span
if len(spans) :
self.c.reserve(len(spans))
for span in spans: for span in spans:
self.push_back(span.c) self.push_back(span.c)
@ -89,6 +92,72 @@ cdef class SpanGroup:
""" """
return self.c.size() return self.c.size()
def __getitem__(self, int i) -> Span:
"""Get a span from the group. Note that a copy of the span is returned,
so if any changes are made to this span, they are not reflected in the
corresponding member of the span group.
i (int): The item index.
RETURNS (Span): The span at the given index.
DOCS: https://spacy.io/api/spangroup#getitem
"""
i = self._normalize_index(i)
return Span.cinit(self.doc, self.c[i])
def __delitem__(self, int i):
"""Delete a span from the span group at index i.
i (int): The item index.
DOCS: https://spacy.io/api/spangroup#delitem
"""
i = self._normalize_index(i)
self.c.erase(self.c.begin() + i - 1)
def __setitem__(self, int i, Span span):
"""Set a span in the span group.
i (int): The item index.
span (Span): The span.
DOCS: https://spacy.io/api/spangroup#setitem
"""
if span.doc is not self.doc:
raise ValueError(Errors.E855.format(obj="span"))
i = self._normalize_index(i)
self.c[i] = span.c
def __iadd__(self, other: Union[SpanGroup, Iterable["Span"]]) -> SpanGroup:
"""Operator +=. Append a span group or spans to this group and return
the current span group.
other (Union[SpanGroup, Iterable["Span"]]): The SpanGroup or spans to
add.
RETURNS (SpanGroup): The current span group.
DOCS: https://spacy.io/api/spangroup#iadd
"""
return self._concat(other, inplace=True)
def __add__(self, other: SpanGroup) -> SpanGroup:
"""Operator +. Concatenate a span group with this group and return a
new span group.
other (SpanGroup): The SpanGroup to add.
RETURNS (SpanGroup): The concatenated SpanGroup.
DOCS: https://spacy.io/api/spangroup#add
"""
# For Cython 0.x and __add__, you cannot rely on `self` as being `self`
# or being the right type, so both types need to be checked explicitly.
if isinstance(self, SpanGroup) and isinstance(other, SpanGroup):
return self._concat(other)
return NotImplemented
def append(self, Span span): def append(self, Span span):
"""Add a span to the group. The span must refer to the same Doc """Add a span to the group. The span must refer to the same Doc
object as the span group. object as the span group.
@ -98,35 +167,18 @@ cdef class SpanGroup:
DOCS: https://spacy.io/api/spangroup#append DOCS: https://spacy.io/api/spangroup#append
""" """
if span.doc is not self.doc: if span.doc is not self.doc:
raise ValueError("Cannot add span to group: refers to different Doc.") raise ValueError(Errors.E855.format(obj="span"))
self.push_back(span.c) self.push_back(span.c)
def extend(self, spans): def extend(self, spans_or_span_group: Union[SpanGroup, Iterable["Span"]]):
"""Add multiple spans to the group. All spans must refer to the same """Add multiple spans or contents of another SpanGroup to the group.
Doc object as the span group. All spans must refer to the same Doc object as the span group.
spans (Iterable[Span]): The spans to add. spans (Union[SpanGroup, Iterable["Span"]]): The spans to add.
DOCS: https://spacy.io/api/spangroup#extend DOCS: https://spacy.io/api/spangroup#extend
""" """
cdef Span span self._concat(spans_or_span_group, inplace=True)
for span in spans:
self.append(span)
def __getitem__(self, int i):
"""Get a span from the group.
i (int): The item index.
RETURNS (Span): The span at the given index.
DOCS: https://spacy.io/api/spangroup#getitem
"""
cdef int size = self.c.size()
if i < -size or i >= size:
raise IndexError(f"list index {i} out of range")
if i < 0:
i += size
return Span.cinit(self.doc, self.c[i])
def to_bytes(self): def to_bytes(self):
"""Serialize the SpanGroup's contents to a byte string. """Serialize the SpanGroup's contents to a byte string.
@ -136,6 +188,7 @@ cdef class SpanGroup:
DOCS: https://spacy.io/api/spangroup#to_bytes DOCS: https://spacy.io/api/spangroup#to_bytes
""" """
output = {"name": self.name, "attrs": self.attrs, "spans": []} output = {"name": self.name, "attrs": self.attrs, "spans": []}
cdef int i
for i in range(self.c.size()): for i in range(self.c.size()):
span = self.c[i] span = self.c[i]
# The struct.pack here is probably overkill, but it might help if # The struct.pack here is probably overkill, but it might help if
@ -187,3 +240,74 @@ cdef class SpanGroup:
cdef void push_back(self, SpanC span) nogil: cdef void push_back(self, SpanC span) nogil:
self.c.push_back(span) self.c.push_back(span)
def copy(self) -> SpanGroup:
"""Clones the span group.
RETURNS (SpanGroup): A copy of the span group.
DOCS: https://spacy.io/api/spangroup#copy
"""
return SpanGroup(
self.doc,
name=self.name,
attrs=deepcopy(self.attrs),
spans=list(self),
)
def _concat(
self,
other: Union[SpanGroup, Iterable["Span"]],
*,
inplace: bool = False,
) -> SpanGroup:
"""Concatenates the current span group with the provided span group or
spans, either in place or creating a copy. Preserves the name of self,
updates attrs only with values that are not in self.
other (Union[SpanGroup, Iterable[Span]]): The spans to append.
inplace (bool): Indicates whether the operation should be performed in
place on the current span group.
RETURNS (SpanGroup): Either a new SpanGroup or the current SpanGroup
depending on the value of inplace.
"""
cdef SpanGroup span_group = self if inplace else self.copy()
cdef SpanGroup other_group
cdef Span span
if isinstance(other, SpanGroup):
other_group = other
if other_group.doc is not self.doc:
raise ValueError(Errors.E855.format(obj="span group"))
other_attrs = deepcopy(other_group.attrs)
span_group.attrs.update({
key: value for key, value in other_attrs.items() \
if key not in span_group.attrs
})
if len(other_group):
span_group.c.reserve(span_group.c.size() + other_group.c.size())
span_group.c.insert(span_group.c.end(), other_group.c.begin(), other_group.c.end())
else:
if len(other):
span_group.c.reserve(self.c.size() + len(other))
for span in other:
if span.doc is not self.doc:
raise ValueError(Errors.E855.format(obj="span"))
span_group.c.push_back(span.c)
return span_group
def _normalize_index(self, int i) -> int:
"""Checks list index boundaries and adjusts the index if negative.
i (int): The index.
RETURNS (int): The adjusted index.
"""
cdef int length = self.c.size()
if i < -length or i >= length:
raise IndexError(Errors.E856.format(i=i, length=length))
if i < 0:
i += length
return i

View File

@ -1,4 +1,4 @@
from typing import List, Mapping, NoReturn, Union, Dict, Any, Set from typing import List, Mapping, NoReturn, Union, Dict, Any, Set, cast
from typing import Optional, Iterable, Callable, Tuple, Type from typing import Optional, Iterable, Callable, Tuple, Type
from typing import Iterator, Type, Pattern, Generator, TYPE_CHECKING from typing import Iterator, Type, Pattern, Generator, TYPE_CHECKING
from types import ModuleType from types import ModuleType
@ -294,7 +294,7 @@ def find_matching_language(lang: str) -> Optional[str]:
# Find out which language modules we have # Find out which language modules we have
possible_languages = [] possible_languages = []
for modinfo in pkgutil.iter_modules(spacy.lang.__path__): # type: ignore for modinfo in pkgutil.iter_modules(spacy.lang.__path__): # type: ignore[attr-defined]
code = modinfo.name code = modinfo.name
if code == "xx": if code == "xx":
# Temporarily make 'xx' into a valid language code # Temporarily make 'xx' into a valid language code
@ -391,7 +391,8 @@ def get_module_path(module: ModuleType) -> Path:
""" """
if not hasattr(module, "__module__"): if not hasattr(module, "__module__"):
raise ValueError(Errors.E169.format(module=repr(module))) raise ValueError(Errors.E169.format(module=repr(module)))
return Path(sys.modules[module.__module__].__file__).parent file_path = Path(cast(os.PathLike, sys.modules[module.__module__].__file__))
return file_path.parent
def load_model( def load_model(
@ -878,7 +879,7 @@ def get_package_path(name: str) -> Path:
# Here we're importing the module just to find it. This is worryingly # Here we're importing the module just to find it. This is worryingly
# indirect, but it's otherwise very difficult to find the package. # indirect, but it's otherwise very difficult to find the package.
pkg = importlib.import_module(name) pkg = importlib.import_module(name)
return Path(pkg.__file__).parent return Path(cast(Union[str, os.PathLike], pkg.__file__)).parent
def replace_model_node(model: Model, target: Model, replacement: Model) -> None: def replace_model_node(model: Model, target: Model, replacement: Model) -> None:
@ -1675,7 +1676,7 @@ def packages_distributions() -> Dict[str, List[str]]:
it's not available in the builtin importlib.metadata. it's not available in the builtin importlib.metadata.
""" """
pkg_to_dist = defaultdict(list) pkg_to_dist = defaultdict(list)
for dist in importlib_metadata.distributions(): # type: ignore[attr-defined] for dist in importlib_metadata.distributions():
for pkg in (dist.read_text("top_level.txt") or "").split(): for pkg in (dist.read_text("top_level.txt") or "").split():
pkg_to_dist[pkg].append(dist.metadata["Name"]) pkg_to_dist[pkg].append(dist.metadata["Name"])
return dict(pkg_to_dist) return dict(pkg_to_dist)

View File

@ -466,6 +466,18 @@ takes the same arguments as `train` and reads settings off the
</Infobox> </Infobox>
<Infobox title="Notes on span characteristics" emoji="💡">
If your pipeline contains a `spancat` component, then this command will also
report span characteristics such as the average span length and the span (or
span boundary) distinctiveness. The distinctiveness measure shows how different
the tokens are with respect to the rest of the corpus using the KL-divergence of
the token distributions. To learn more, you can check out Papay et al.'s work on
[*Dissecting Span Identification Tasks with Performance Prediction* (EMNLP
2020)](https://aclanthology.org/2020.emnlp-main.396/).
</Infobox>
```cli ```cli
$ python -m spacy debug data [config_path] [--code] [--ignore-warnings] [--verbose] [--no-format] [overrides] $ python -m spacy debug data [config_path] [--code] [--ignore-warnings] [--verbose] [--no-format] [overrides]
``` ```
@ -626,6 +638,235 @@ will not be available.
| overrides | Config parameters to override. Should be options starting with `--` that correspond to the config section and value to override, e.g. `--paths.train ./train.spacy`. ~~Any (option/flag)~~ | | overrides | Config parameters to override. Should be options starting with `--` that correspond to the config section and value to override, e.g. `--paths.train ./train.spacy`. ~~Any (option/flag)~~ |
| **PRINTS** | Debugging information. | | **PRINTS** | Debugging information. |
### debug diff-config {#debug-diff tag="command"}
Show a diff of a config file with respect to spaCy's defaults or another config
file. If additional settings were used in the creation of the config file, then
you must supply these as extra parameters to the command when comparing to the
default settings. The generated diff can also be used when posting to the
discussion forum to provide more information for the maintainers.
```cli
$ python -m spacy debug diff-config [config_path] [--compare-to] [--optimize] [--gpu] [--pretraining] [--markdown]
```
> #### Example
>
> ```cli
> $ python -m spacy debug diff-config ./config.cfg
> ```
<Accordion title="Example output" spaced>
```
Found user-defined language: 'en'
Found user-defined pipelines: ['tok2vec', 'tagger', 'parser',
'ner']
[paths]
+ train = "./data/train.spacy"
+ dev = "./data/dev.spacy"
- train = null
- dev = null
vectors = null
init_tok2vec = null
[system]
gpu_allocator = null
+ seed = 42
- seed = 0
[nlp]
lang = "en"
pipeline = ["tok2vec","tagger","parser","ner"]
batch_size = 1000
disabled = []
before_creation = null
after_creation = null
after_pipeline_creation = null
tokenizer = {"@tokenizers":"spacy.Tokenizer.v1"}
[components]
[components.ner]
factory = "ner"
incorrect_spans_key = null
moves = null
scorer = {"@scorers":"spacy.ner_scorer.v1"}
update_with_oracle_cut_size = 100
[components.ner.model]
@architectures = "spacy.TransitionBasedParser.v2"
state_type = "ner"
extra_state_tokens = false
- hidden_width = 64
+ hidden_width = 36
maxout_pieces = 2
use_upper = true
nO = null
[components.ner.model.tok2vec]
@architectures = "spacy.Tok2VecListener.v1"
width = ${components.tok2vec.model.encode.width}
upstream = "*"
[components.parser]
factory = "parser"
learn_tokens = false
min_action_freq = 30
moves = null
scorer = {"@scorers":"spacy.parser_scorer.v1"}
update_with_oracle_cut_size = 100
[components.parser.model]
@architectures = "spacy.TransitionBasedParser.v2"
state_type = "parser"
extra_state_tokens = false
hidden_width = 128
maxout_pieces = 3
use_upper = true
nO = null
[components.parser.model.tok2vec]
@architectures = "spacy.Tok2VecListener.v1"
width = ${components.tok2vec.model.encode.width}
upstream = "*"
[components.tagger]
factory = "tagger"
neg_prefix = "!"
overwrite = false
scorer = {"@scorers":"spacy.tagger_scorer.v1"}
[components.tagger.model]
@architectures = "spacy.Tagger.v1"
nO = null
[components.tagger.model.tok2vec]
@architectures = "spacy.Tok2VecListener.v1"
width = ${components.tok2vec.model.encode.width}
upstream = "*"
[components.tok2vec]
factory = "tok2vec"
[components.tok2vec.model]
@architectures = "spacy.Tok2Vec.v2"
[components.tok2vec.model.embed]
@architectures = "spacy.MultiHashEmbed.v2"
width = ${components.tok2vec.model.encode.width}
attrs = ["NORM","PREFIX","SUFFIX","SHAPE"]
rows = [5000,2500,2500,2500]
include_static_vectors = false
[components.tok2vec.model.encode]
@architectures = "spacy.MaxoutWindowEncoder.v2"
width = 96
depth = 4
window_size = 1
maxout_pieces = 3
[corpora]
[corpora.dev]
@readers = "spacy.Corpus.v1"
path = ${paths.dev}
max_length = 0
gold_preproc = false
limit = 0
augmenter = null
[corpora.train]
@readers = "spacy.Corpus.v1"
path = ${paths.train}
max_length = 0
gold_preproc = false
limit = 0
augmenter = null
[training]
dev_corpus = "corpora.dev"
train_corpus = "corpora.train"
seed = ${system.seed}
gpu_allocator = ${system.gpu_allocator}
dropout = 0.1
accumulate_gradient = 1
patience = 1600
max_epochs = 0
max_steps = 20000
eval_frequency = 200
frozen_components = []
annotating_components = []
before_to_disk = null
[training.batcher]
@batchers = "spacy.batch_by_words.v1"
discard_oversize = false
tolerance = 0.2
get_length = null
[training.batcher.size]
@schedules = "compounding.v1"
start = 100
stop = 1000
compound = 1.001
t = 0.0
[training.logger]
@loggers = "spacy.ConsoleLogger.v1"
progress_bar = false
[training.optimizer]
@optimizers = "Adam.v1"
beta1 = 0.9
beta2 = 0.999
L2_is_weight_decay = true
L2 = 0.01
grad_clip = 1.0
use_averages = false
eps = 0.00000001
learn_rate = 0.001
[training.score_weights]
tag_acc = 0.33
dep_uas = 0.17
dep_las = 0.17
dep_las_per_type = null
sents_p = null
sents_r = null
sents_f = 0.0
ents_f = 0.33
ents_p = 0.0
ents_r = 0.0
ents_per_type = null
[pretraining]
[initialize]
vectors = ${paths.vectors}
init_tok2vec = ${paths.init_tok2vec}
vocab_data = null
lookups = null
before_init = null
after_init = null
[initialize.components]
[initialize.tokenizer]
```
</Accordion>
| Name | Description |
| -------------------- | --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| `config_path` | Path to [training config](/api/data-formats#config) file containing all settings and hyperparameters. ~~Union[Path, str] \(positional)~~ |
| `compare_to` | Path to another config file to diff against, or `None` to compare against default settings. ~~Optional[Union[Path, str] \(option)~~ |
| `optimize`, `-o` | `"efficiency"` or `"accuracy"`. Whether the config was optimized for efficiency (faster inference, smaller model, lower memory consumption) or higher accuracy (potentially larger and slower model). Only relevant when comparing against a default config. Defaults to `"efficiency"`. ~~str (option)~~ |
| `gpu`, `-G` | Whether the config was made to run on a GPU. Only relevant when comparing against a default config. ~~bool (flag)~~ |
| `pretraining`, `-pt` | Include config for pretraining (with [`spacy pretrain`](/api/cli#pretrain)). Only relevant when comparing against a default config. Defaults to `False`. ~~bool (flag)~~ |
| `markdown`, `-md` | Generate Markdown for Github issues. Defaults to `False`. ~~bool (flag)~~ |
| **PRINTS** | Diff between the two config files. |
### debug profile {#debug-profile tag="command"} ### debug profile {#debug-profile tag="command"}
Profile which functions take the most time in a spaCy pipeline. Input should be Profile which functions take the most time in a spaCy pipeline. Input should be
@ -1094,7 +1335,7 @@ $ python -m spacy project run [subcommand] [project_dir] [--force] [--dry]
| `subcommand` | Name of the command or workflow to run. ~~str (positional)~~ | | `subcommand` | Name of the command or workflow to run. ~~str (positional)~~ |
| `project_dir` | Path to project directory. Defaults to current working directory. ~~Path (positional)~~ | | `project_dir` | Path to project directory. Defaults to current working directory. ~~Path (positional)~~ |
| `--force`, `-F` | Force re-running steps, even if nothing changed. ~~bool (flag)~~ | | `--force`, `-F` | Force re-running steps, even if nothing changed. ~~bool (flag)~~ |
| `--dry`, `-D` |  Perform a dry run and don't execute scripts. ~~bool (flag)~~ | | `--dry`, `-D` | Perform a dry run and don't execute scripts. ~~bool (flag)~~ |
| `--help`, `-h` | Show help message and available arguments. ~~bool (flag)~~ | | `--help`, `-h` | Show help message and available arguments. ~~bool (flag)~~ |
| **EXECUTES** | The command defined in the `project.yml`. | | **EXECUTES** | The command defined in the `project.yml`. |
@ -1212,12 +1453,12 @@ For more examples, see the templates in our
</Accordion> </Accordion>
| Name | Description | | Name | Description |
| -------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | | ------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| `project_dir` | Path to project directory. Defaults to current working directory. ~~Path (positional)~~ | | `project_dir` | Path to project directory. Defaults to current working directory. ~~Path (positional)~~ |
| `--output`, `-o` | Path to output file or `-` for stdout (default). If a file is specified and it already exists and contains auto-generated docs, only the auto-generated docs section is replaced. ~~Path (positional)~~ | | `--output`, `-o` | Path to output file or `-` for stdout (default). If a file is specified and it already exists and contains auto-generated docs, only the auto-generated docs section is replaced. ~~Path (positional)~~ |
|  `--no-emoji`, `-NE` | Don't use emoji in the titles. ~~bool (flag)~~ | | `--no-emoji`, `-NE` | Don't use emoji in the titles. ~~bool (flag)~~ |
| **CREATES** | The Markdown-formatted project documentation. | | **CREATES** | The Markdown-formatted project documentation. |
### project dvc {#project-dvc tag="command"} ### project dvc {#project-dvc tag="command"}
@ -1256,7 +1497,7 @@ $ python -m spacy project dvc [project_dir] [workflow] [--force] [--verbose]
| `project_dir` | Path to project directory. Defaults to current working directory. ~~Path (positional)~~ | | `project_dir` | Path to project directory. Defaults to current working directory. ~~Path (positional)~~ |
| `workflow` | Name of workflow defined in `project.yml`. Defaults to first workflow if not set. ~~Optional[str] \(option)~~ | | `workflow` | Name of workflow defined in `project.yml`. Defaults to first workflow if not set. ~~Optional[str] \(option)~~ |
| `--force`, `-F` | Force-updating config file. ~~bool (flag)~~ | | `--force`, `-F` | Force-updating config file. ~~bool (flag)~~ |
| `--verbose`, `-V` |  Print more output generated by DVC. ~~bool (flag)~~ | | `--verbose`, `-V` | Print more output generated by DVC. ~~bool (flag)~~ |
| `--help`, `-h` | Show help message and available arguments. ~~bool (flag)~~ | | `--help`, `-h` | Show help message and available arguments. ~~bool (flag)~~ |
| **CREATES** | A `dvc.yaml` file in the project directory, based on the steps defined in the given workflow. | | **CREATES** | A `dvc.yaml` file in the project directory, based on the steps defined in the given workflow. |
@ -1347,5 +1588,5 @@ $ python -m spacy huggingface-hub push [whl_path] [--org] [--msg] [--local-repo]
| `--org`, `-o` | Optional name of organization to which the pipeline should be uploaded. ~~str (option)~~ | | `--org`, `-o` | Optional name of organization to which the pipeline should be uploaded. ~~str (option)~~ |
| `--msg`, `-m` | Commit message to use for update. Defaults to `"Update spaCy pipeline"`. ~~str (option)~~ | | `--msg`, `-m` | Commit message to use for update. Defaults to `"Update spaCy pipeline"`. ~~str (option)~~ |
| `--local-repo`, `-l` | Local path to the model repository (will be created if it doesn't exist). Defaults to `hub` in the current working directory. ~~Path (option)~~ | | `--local-repo`, `-l` | Local path to the model repository (will be created if it doesn't exist). Defaults to `hub` in the current working directory. ~~Path (option)~~ |
| `--verbose`, `-V` | Output additional info for debugging, e.g. the full generated hub metadata. ~~bool (flag)~~  | | `--verbose`, `-V` | Output additional info for debugging, e.g. the full generated hub metadata. ~~bool (flag)~~ |
| **UPLOADS** | The pipeline to the hub. | | **UPLOADS** | The pipeline to the hub. |

View File

@ -37,13 +37,13 @@ streaming.
> augmenter = null > augmenter = null
> ``` > ```
| Name | Description | | Name | Description |
| --------------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | | -------------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| `path` | The directory or filename to read from. Expects data in spaCy's binary [`.spacy` format](/api/data-formats#binary-training). ~~Path~~ | | `path` | The directory or filename to read from. Expects data in spaCy's binary [`.spacy` format](/api/data-formats#binary-training). ~~Path~~ |
|  `gold_preproc` | Whether to set up the Example object with gold-standard sentences and tokens for the predictions. See [`Corpus`](/api/corpus#init) for details. ~~bool~~ | | `gold_preproc` | Whether to set up the Example object with gold-standard sentences and tokens for the predictions. See [`Corpus`](/api/corpus#init) for details. ~~bool~~ |
| `max_length` | Maximum document length. Longer documents will be split into sentences, if sentence boundaries are available. Defaults to `0` for no limit. ~~int~~ | | `max_length` | Maximum document length. Longer documents will be split into sentences, if sentence boundaries are available. Defaults to `0` for no limit. ~~int~~ |
| `limit` | Limit corpus to a subset of examples, e.g. for debugging. Defaults to `0` for no limit. ~~int~~ | | `limit` | Limit corpus to a subset of examples, e.g. for debugging. Defaults to `0` for no limit. ~~int~~ |
| `augmenter` | Apply some simply data augmentation, where we replace tokens with variations. This is especially useful for punctuation and case replacement, to help generalize beyond corpora that don't have smart-quotes, or only have smart quotes, etc. Defaults to `None`. ~~Optional[Callable]~~ | | `augmenter` | Apply some simply data augmentation, where we replace tokens with variations. This is especially useful for punctuation and case replacement, to help generalize beyond corpora that don't have smart-quotes, or only have smart quotes, etc. Defaults to `None`. ~~Optional[Callable]~~ |
```python ```python
%%GITHUB_SPACY/spacy/training/corpus.py %%GITHUB_SPACY/spacy/training/corpus.py
@ -71,15 +71,15 @@ train/test skew.
> corpus = Corpus("./data", limit=10) > corpus = Corpus("./data", limit=10)
> ``` > ```
| Name | Description | | Name | Description |
| --------------- | --------------------------------------------------------------------------------------------------------------------------------------------------- | | -------------- | --------------------------------------------------------------------------------------------------------------------------------------------------- |
| `path` | The directory or filename to read from. ~~Union[str, Path]~~ | | `path` | The directory or filename to read from. ~~Union[str, Path]~~ |
| _keyword-only_ | | | _keyword-only_ | |
|  `gold_preproc` | Whether to set up the Example object with gold-standard sentences and tokens for the predictions. Defaults to `False`. ~~bool~~ | | `gold_preproc` | Whether to set up the Example object with gold-standard sentences and tokens for the predictions. Defaults to `False`. ~~bool~~ |
| `max_length` | Maximum document length. Longer documents will be split into sentences, if sentence boundaries are available. Defaults to `0` for no limit. ~~int~~ | | `max_length` | Maximum document length. Longer documents will be split into sentences, if sentence boundaries are available. Defaults to `0` for no limit. ~~int~~ |
| `limit` | Limit corpus to a subset of examples, e.g. for debugging. Defaults to `0` for no limit. ~~int~~ | | `limit` | Limit corpus to a subset of examples, e.g. for debugging. Defaults to `0` for no limit. ~~int~~ |
| `augmenter` | Optional data augmentation callback. ~~Callable[[Language, Example], Iterable[Example]]~~ | | `augmenter` | Optional data augmentation callback. ~~Callable[[Language, Example], Iterable[Example]]~~ |
| `shuffle` | Whether to shuffle the examples. Defaults to `False`. ~~bool~~ | | `shuffle` | Whether to shuffle the examples. Defaults to `False`. ~~bool~~ |
## Corpus.\_\_call\_\_ {#call tag="method"} ## Corpus.\_\_call\_\_ {#call tag="method"}

View File

@ -34,7 +34,7 @@ Construct a `Doc` object. The most common way to get a `Doc` object is via the
| Name | Description | | Name | Description |
| ---------------------------------------- | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | | ---------------------------------------- | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| `vocab` | A storage container for lexical types. ~~Vocab~~ | | `vocab` | A storage container for lexical types. ~~Vocab~~ |
| `words` | A list of strings or integer hash values to add to the document as words. ~~Optional[List[Union[str,int]]]~~ | | `words` | A list of strings or integer hash values to add to the document as words. ~~Optional[List[Union[str,int]]]~~ |
| `spaces` | A list of boolean values indicating whether each word has a subsequent space. Must have the same length as `words`, if specified. Defaults to a sequence of `True`. ~~Optional[List[bool]]~~ | | `spaces` | A list of boolean values indicating whether each word has a subsequent space. Must have the same length as `words`, if specified. Defaults to a sequence of `True`. ~~Optional[List[bool]]~~ |
| _keyword-only_ | | | _keyword-only_ | |
| `user\_data` | Optional extra data to attach to the Doc. ~~Dict~~ | | `user\_data` | Optional extra data to attach to the Doc. ~~Dict~~ |
@ -304,7 +304,8 @@ ancestor is found, e.g. if span excludes a necessary ancestor.
## Doc.has_annotation {#has_annotation tag="method"} ## Doc.has_annotation {#has_annotation tag="method"}
Check whether the doc contains annotation on a [`Token` attribute](/api/token#attributes). Check whether the doc contains annotation on a
[`Token` attribute](/api/token#attributes).
<Infobox title="Changed in v3.0" variant="warning"> <Infobox title="Changed in v3.0" variant="warning">
@ -398,12 +399,14 @@ Concatenate multiple `Doc` objects to form a new one. Raises an error if the
> [str(ent) for doc in docs for ent in doc.ents] > [str(ent) for doc in docs for ent in doc.ents]
> ``` > ```
| Name | Description | | Name | Description |
| ------------------- | ----------------------------------------------------------------------------------------------------------------- | | -------------------------------------- | ----------------------------------------------------------------------------------------------------------------- |
| `docs` | A list of `Doc` objects. ~~List[Doc]~~ | | `docs` | A list of `Doc` objects. ~~List[Doc]~~ |
| `ensure_whitespace` | Insert a space between two adjacent docs whenever the first doc does not end in whitespace. ~~bool~~ | | `ensure_whitespace` | Insert a space between two adjacent docs whenever the first doc does not end in whitespace. ~~bool~~ |
| `attrs` | Optional list of attribute ID ints or attribute name strings. ~~Optional[List[Union[str, int]]]~~ | | `attrs` | Optional list of attribute ID ints or attribute name strings. ~~Optional[List[Union[str, int]]]~~ |
| **RETURNS** | The new `Doc` object that is containing the other docs or `None`, if `docs` is empty or `None`. ~~Optional[Doc]~~ | | _keyword-only_ | |
| `exclude` <Tag variant="new">3.3</Tag> | String names of Doc attributes to exclude. Supported: `spans`, `tensor`, `user_data`. ~~Iterable[str]~~ |
| **RETURNS** | The new `Doc` object that is containing the other docs or `None`, if `docs` is empty or `None`. ~~Optional[Doc]~~ |
## Doc.to_disk {#to_disk tag="method" new="2"} ## Doc.to_disk {#to_disk tag="method" new="2"}
@ -585,7 +588,7 @@ objects or a [`SpanGroup`](/api/spangroup) to a given key.
> >
> ```python > ```python
> doc = nlp("Their goi ng home") > doc = nlp("Their goi ng home")
> doc.spans["errors"] = [doc[0:1], doc[2:4]] > doc.spans["errors"] = [doc[0:1], doc[1:3]]
> ``` > ```
| Name | Description | | Name | Description |
@ -618,7 +621,7 @@ relative clauses.
To customize the noun chunk iterator in a loaded pipeline, modify To customize the noun chunk iterator in a loaded pipeline, modify
[`nlp.vocab.get_noun_chunks`](/api/vocab#attributes). If the `noun_chunk` [`nlp.vocab.get_noun_chunks`](/api/vocab#attributes). If the `noun_chunk`
[syntax iterator](/usage/adding-languages#language-data) has not been [syntax iterator](/usage/linguistic-features#language-data) has not been
implemented for the given language, a `NotImplementedError` is raised. implemented for the given language, a `NotImplementedError` is raised.
> #### Example > #### Example

View File

@ -1123,7 +1123,7 @@ instance and factory instance.
| `factory` | The name of the registered component factory. ~~str~~ | | `factory` | The name of the registered component factory. ~~str~~ |
| `default_config` | The default config, describing the default values of the factory arguments. ~~Dict[str, Any]~~ | | `default_config` | The default config, describing the default values of the factory arguments. ~~Dict[str, Any]~~ |
| `assigns` | `Doc` or `Token` attributes assigned by this component, e.g. `["token.ent_id"]`. Used for [pipe analysis](/usage/processing-pipelines#analysis). ~~Iterable[str]~~ | | `assigns` | `Doc` or `Token` attributes assigned by this component, e.g. `["token.ent_id"]`. Used for [pipe analysis](/usage/processing-pipelines#analysis). ~~Iterable[str]~~ |
| `requires` | `Doc` or `Token` attributes required by this component, e.g. `["token.ent_id"]`. Used for [pipe analysis](/usage/processing-pipelines#analysis). ~~Iterable[str]~~  | | `requires` | `Doc` or `Token` attributes required by this component, e.g. `["token.ent_id"]`. Used for [pipe analysis](/usage/processing-pipelines#analysis). ~~Iterable[str]~~ |
| `retokenizes` | Whether the component changes tokenization. Used for [pipe analysis](/usage/processing-pipelines#analysis). ~~bool~~  | | `retokenizes` | Whether the component changes tokenization. Used for [pipe analysis](/usage/processing-pipelines#analysis). ~~bool~~ |
| `default_score_weights` | The scores to report during training, and their default weight towards the final score used to select the best model. Weights should sum to `1.0` per component and will be combined and normalized for the whole pipeline. If a weight is set to `None`, the score will not be logged or weighted. ~~Dict[str, Optional[float]]~~ | | `default_score_weights` | The scores to report during training, and their default weight towards the final score used to select the best model. Weights should sum to `1.0` per component and will be combined and normalized for the whole pipeline. If a weight is set to `None`, the score will not be logged or weighted. ~~Dict[str, Optional[float]]~~ |
| `scores` | All scores set by the components if it's trainable, e.g. `["ents_f", "ents_r", "ents_p"]`. Based on the `default_score_weights` and used for [pipe analysis](/usage/processing-pipelines#analysis). ~~Iterable[str]~~ | | `scores` | All scores set by the components if it's trainable, e.g. `["ents_f", "ents_r", "ents_p"]`. Based on the `default_score_weights` and used for [pipe analysis](/usage/processing-pipelines#analysis). ~~Iterable[str]~~ |

View File

@ -103,11 +103,22 @@ and residual connections.
| `depth` | The number of convolutional layers. Recommended value is `4`. ~~int~~ | | `depth` | The number of convolutional layers. Recommended value is `4`. ~~int~~ |
| **CREATES** | The model using the architecture. ~~Model[Floats2d, Floats2d]~~ | | **CREATES** | The model using the architecture. ~~Model[Floats2d, Floats2d]~~ |
### spacy.TransitionBasedParser.v1 {#TransitionBasedParser_v1} ### spacy.HashEmbedCNN.v1 {#HashEmbedCNN_v1}
Identical to Identical to [`spacy.HashEmbedCNN.v2`](/api/architectures#HashEmbedCNN) except
[`spacy.TransitionBasedParser.v2`](/api/architectures#TransitionBasedParser) using [`spacy.StaticVectors.v1`](#StaticVectors_v1) if vectors are included.
except the `use_upper` was set to `True` by default.
### spacy.MultiHashEmbed.v1 {#MultiHashEmbed_v1}
Identical to [`spacy.MultiHashEmbed.v2`](/api/architectures#MultiHashEmbed)
except with [`spacy.StaticVectors.v1`](#StaticVectors_v1) if vectors are
included.
### spacy.CharacterEmbed.v1 {#CharacterEmbed_v1}
Identical to [`spacy.CharacterEmbed.v2`](/api/architectures#CharacterEmbed)
except using [`spacy.StaticVectors.v1`](#StaticVectors_v1) if vectors are
included.
### spacy.TextCatEnsemble.v1 {#TextCatEnsemble_v1} ### spacy.TextCatEnsemble.v1 {#TextCatEnsemble_v1}
@ -147,41 +158,6 @@ network has an internal CNN Tok2Vec layer and uses attention.
| `nO` | Output dimension, determined by the number of different labels. If not set, the [`TextCategorizer`](/api/textcategorizer) component will set it when `initialize` is called. ~~Optional[int]~~ | | `nO` | Output dimension, determined by the number of different labels. If not set, the [`TextCategorizer`](/api/textcategorizer) component will set it when `initialize` is called. ~~Optional[int]~~ |
| **CREATES** | The model using the architecture. ~~Model[List[Doc], Floats2d]~~ | | **CREATES** | The model using the architecture. ~~Model[List[Doc], Floats2d]~~ |
### spacy.HashEmbedCNN.v1 {#HashEmbedCNN_v1}
Identical to [`spacy.HashEmbedCNN.v2`](/api/architectures#HashEmbedCNN) except
using [`spacy.StaticVectors.v1`](#StaticVectors_v1) if vectors are included.
### spacy.MultiHashEmbed.v1 {#MultiHashEmbed_v1}
Identical to [`spacy.MultiHashEmbed.v2`](/api/architectures#MultiHashEmbed)
except with [`spacy.StaticVectors.v1`](#StaticVectors_v1) if vectors are
included.
### spacy.CharacterEmbed.v1 {#CharacterEmbed_v1}
Identical to [`spacy.CharacterEmbed.v2`](/api/architectures#CharacterEmbed)
except using [`spacy.StaticVectors.v1`](#StaticVectors_v1) if vectors are
included.
## Layers {#layers}
These functions are available from `@spacy.registry.layers`.
### spacy.StaticVectors.v1 {#StaticVectors_v1}
Identical to [`spacy.StaticVectors.v2`](/api/architectures#StaticVectors) except
for the handling of tokens without vectors.
<Infobox title="Bugs for tokens without vectors" variant="warning">
`spacy.StaticVectors.v1` maps tokens without vectors to the final row in the
vectors table, which causes the model predictions to change if new vectors are
added to an existing vectors table. See more details in
[issue #7662](https://github.com/explosion/spaCy/issues/7662#issuecomment-813925655).
</Infobox>
### spacy.TextCatCNN.v1 {#TextCatCNN_v1} ### spacy.TextCatCNN.v1 {#TextCatCNN_v1}
Since `spacy.TextCatCNN.v2`, this architecture has become resizable, which means Since `spacy.TextCatCNN.v2`, this architecture has become resizable, which means
@ -246,8 +222,35 @@ the others, but may not be as accurate, especially if texts are short.
| `nO` | Output dimension, determined by the number of different labels. If not set, the [`TextCategorizer`](/api/textcategorizer) component will set it when `initialize` is called. ~~Optional[int]~~ | | `nO` | Output dimension, determined by the number of different labels. If not set, the [`TextCategorizer`](/api/textcategorizer) component will set it when `initialize` is called. ~~Optional[int]~~ |
| **CREATES** | The model using the architecture. ~~Model[List[Doc], Floats2d]~~ | | **CREATES** | The model using the architecture. ~~Model[List[Doc], Floats2d]~~ |
### spacy.TransitionBasedParser.v1 {#TransitionBasedParser_v1}
Identical to
[`spacy.TransitionBasedParser.v2`](/api/architectures#TransitionBasedParser)
except the `use_upper` was set to `True` by default.
## Layers {#layers}
These functions are available from `@spacy.registry.layers`.
### spacy.StaticVectors.v1 {#StaticVectors_v1}
Identical to [`spacy.StaticVectors.v2`](/api/architectures#StaticVectors) except
for the handling of tokens without vectors.
<Infobox title="Bugs for tokens without vectors" variant="warning">
`spacy.StaticVectors.v1` maps tokens without vectors to the final row in the
vectors table, which causes the model predictions to change if new vectors are
added to an existing vectors table. See more details in
[issue #7662](https://github.com/explosion/spaCy/issues/7662#issuecomment-813925655).
</Infobox>
## Loggers {#loggers} ## Loggers {#loggers}
Logging utilities for spaCy are implemented in the [`spacy-loggers`](https://github.com/explosion/spacy-loggers) repo, and the functions are typically available from `@spacy.registry.loggers`. Logging utilities for spaCy are implemented in the
[`spacy-loggers`](https://github.com/explosion/spacy-loggers) repo, and the
functions are typically available from `@spacy.registry.loggers`.
More documentation can be found in that repo's [readme](https://github.com/explosion/spacy-loggers/blob/main/README.md) file. More documentation can be found in that repo's
[readme](https://github.com/explosion/spacy-loggers/blob/main/README.md) file.

View File

@ -30,26 +30,26 @@ pattern keys correspond to a number of
[`Token` attributes](/api/token#attributes). The supported attributes for [`Token` attributes](/api/token#attributes). The supported attributes for
rule-based matching are: rule-based matching are:
| Attribute |  Description | | Attribute | Description |
| ----------------------------------------------- | ------------------------------------------------------------------------------------------------------------------------- | | ---------------------------------------------- | ------------------------------------------------------------------------------------------------------------------------- |
| `ORTH` | The exact verbatim text of a token. ~~str~~ | | `ORTH` | The exact verbatim text of a token. ~~str~~ |
| `TEXT` <Tag variant="new">2.1</Tag> | The exact verbatim text of a token. ~~str~~ | | `TEXT` <Tag variant="new">2.1</Tag> | The exact verbatim text of a token. ~~str~~ |
| `NORM` | The normalized form of the token text. ~~str~~ | | `NORM` | The normalized form of the token text. ~~str~~ |
| `LOWER` | The lowercase form of the token text. ~~str~~ | | `LOWER` | The lowercase form of the token text. ~~str~~ |
|  `LENGTH` | The length of the token text. ~~int~~ | | `LENGTH` | The length of the token text. ~~int~~ |
|  `IS_ALPHA`, `IS_ASCII`, `IS_DIGIT` | Token text consists of alphabetic characters, ASCII characters, digits. ~~bool~~ | | `IS_ALPHA`, `IS_ASCII`, `IS_DIGIT` | Token text consists of alphabetic characters, ASCII characters, digits. ~~bool~~ |
|  `IS_LOWER`, `IS_UPPER`, `IS_TITLE` | Token text is in lowercase, uppercase, titlecase. ~~bool~~ | | `IS_LOWER`, `IS_UPPER`, `IS_TITLE` | Token text is in lowercase, uppercase, titlecase. ~~bool~~ |
|  `IS_PUNCT`, `IS_SPACE`, `IS_STOP` | Token is punctuation, whitespace, stop word. ~~bool~~ | | `IS_PUNCT`, `IS_SPACE`, `IS_STOP` | Token is punctuation, whitespace, stop word. ~~bool~~ |
|  `IS_SENT_START` | Token is start of sentence. ~~bool~~ | | `IS_SENT_START` | Token is start of sentence. ~~bool~~ |
|  `LIKE_NUM`, `LIKE_URL`, `LIKE_EMAIL` | Token text resembles a number, URL, email. ~~bool~~ | | `LIKE_NUM`, `LIKE_URL`, `LIKE_EMAIL` | Token text resembles a number, URL, email. ~~bool~~ |
| `SPACY` | Token has a trailing space. ~~bool~~ | | `SPACY` | Token has a trailing space. ~~bool~~ |
|  `POS`, `TAG`, `MORPH`, `DEP`, `LEMMA`, `SHAPE` | The token's simple and extended part-of-speech tag, morphological analysis, dependency label, lemma, shape. ~~str~~ | | `POS`, `TAG`, `MORPH`, `DEP`, `LEMMA`, `SHAPE` | The token's simple and extended part-of-speech tag, morphological analysis, dependency label, lemma, shape. ~~str~~ |
| `ENT_TYPE` | The token's entity label. ~~str~~ | | `ENT_TYPE` | The token's entity label. ~~str~~ |
| `ENT_IOB` | The IOB part of the token's entity tag. ~~str~~ | | `ENT_IOB` | The IOB part of the token's entity tag. ~~str~~ |
| `ENT_ID` | The token's entity ID (`ent_id`). ~~str~~ | | `ENT_ID` | The token's entity ID (`ent_id`). ~~str~~ |
| `ENT_KB_ID` | The token's entity knowledge base ID (`ent_kb_id`). ~~str~~ | | `ENT_KB_ID` | The token's entity knowledge base ID (`ent_kb_id`). ~~str~~ |
| `_` <Tag variant="new">2.1</Tag> | Properties in [custom extension attributes](/usage/processing-pipelines#custom-components-attributes). ~~Dict[str, Any]~~ | | `_` <Tag variant="new">2.1</Tag> | Properties in [custom extension attributes](/usage/processing-pipelines#custom-components-attributes). ~~Dict[str, Any]~~ |
| `OP` | Operator or quantifier to determine how often to match a token pattern. ~~str~~ | | `OP` | Operator or quantifier to determine how often to match a token pattern. ~~str~~ |
Operators and quantifiers define **how often** a token pattern should be Operators and quantifiers define **how often** a token pattern should be
matched: matched:

View File

@ -283,8 +283,9 @@ objects, if the document has been syntactically parsed. A base noun phrase, or
it so no NP-level coordination, no prepositional phrases, and no relative it so no NP-level coordination, no prepositional phrases, and no relative
clauses. clauses.
If the `noun_chunk` [syntax iterator](/usage/adding-languages#language-data) has If the `noun_chunk` [syntax iterator](/usage/linguistic-features#language-data)
not been implemeted for the given language, a `NotImplementedError` is raised. has not been implemeted for the given language, a `NotImplementedError` is
raised.
> #### Example > #### Example
> >
@ -520,12 +521,13 @@ sent = doc[sent.start : max(sent.end, span.end)]
## Span.sents {#sents tag="property" model="sentences" new="3.2.1"} ## Span.sents {#sents tag="property" model="sentences" new="3.2.1"}
Returns a generator over the sentences the span belongs to. This property is only available Returns a generator over the sentences the span belongs to. This property is
when [sentence boundaries](/usage/linguistic-features#sbd) have been set on the only available when [sentence boundaries](/usage/linguistic-features#sbd) have
document by the `parser`, `senter`, `sentencizer` or some custom function. It been set on the document by the `parser`, `senter`, `sentencizer` or some custom
will raise an error otherwise. function. It will raise an error otherwise.
If the span happens to cross sentence boundaries, all sentences the span overlaps with will be returned. If the span happens to cross sentence boundaries, all sentences the span
overlaps with will be returned.
> #### Example > #### Example
> >

View File

@ -56,7 +56,7 @@ architectures and their arguments and hyperparameters.
| -------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | | -------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| `suggester` | A function that [suggests spans](#suggesters). Spans are returned as a ragged array with two integer columns, for the start and end positions. Defaults to [`ngram_suggester`](#ngram_suggester). ~~Callable[[Iterable[Doc], Optional[Ops]], Ragged]~~ | | `suggester` | A function that [suggests spans](#suggesters). Spans are returned as a ragged array with two integer columns, for the start and end positions. Defaults to [`ngram_suggester`](#ngram_suggester). ~~Callable[[Iterable[Doc], Optional[Ops]], Ragged]~~ |
| `model` | A model instance that is given a a list of documents and `(start, end)` indices representing candidate span offsets. The model predicts a probability for each category for each span. Defaults to [SpanCategorizer](/api/architectures#SpanCategorizer). ~~Model[Tuple[List[Doc], Ragged], Floats2d]~~ | | `model` | A model instance that is given a a list of documents and `(start, end)` indices representing candidate span offsets. The model predicts a probability for each category for each span. Defaults to [SpanCategorizer](/api/architectures#SpanCategorizer). ~~Model[Tuple[List[Doc], Ragged], Floats2d]~~ |
| `spans_key` | Key of the [`Doc.spans`](/api/doc#spans) dict to save the spans under. During initialization and training, the component will look for spans on the reference document under the same key. Defaults to `"spans"`. ~~str~~ | | `spans_key` | Key of the [`Doc.spans`](/api/doc#spans) dict to save the spans under. During initialization and training, the component will look for spans on the reference document under the same key. Defaults to `"sc"`. ~~str~~ |
| `threshold` | Minimum probability to consider a prediction positive. Spans with a positive prediction will be saved on the Doc. Defaults to `0.5`. ~~float~~ | | `threshold` | Minimum probability to consider a prediction positive. Spans with a positive prediction will be saved on the Doc. Defaults to `0.5`. ~~float~~ |
| `max_positive` | Maximum number of labels to consider positive per span. Defaults to `None`, indicating no limit. ~~Optional[int]~~ | | `max_positive` | Maximum number of labels to consider positive per span. Defaults to `None`, indicating no limit. ~~Optional[int]~~ |
| `scorer` | The scoring method. Defaults to [`Scorer.score_spans`](/api/scorer#score_spans) for `Doc.spans[spans_key]` with overlapping spans allowed. ~~Optional[Callable]~~ | | `scorer` | The scoring method. Defaults to [`Scorer.score_spans`](/api/scorer#score_spans) for `Doc.spans[spans_key]` with overlapping spans allowed. ~~Optional[Callable]~~ |
@ -93,7 +93,7 @@ shortcut for this and instantiate the component using its string name and
| `suggester` | A function that [suggests spans](#suggesters). Spans are returned as a ragged array with two integer columns, for the start and end positions. ~~Callable[[Iterable[Doc], Optional[Ops]], Ragged]~~ | | `suggester` | A function that [suggests spans](#suggesters). Spans are returned as a ragged array with two integer columns, for the start and end positions. ~~Callable[[Iterable[Doc], Optional[Ops]], Ragged]~~ |
| `name` | String name of the component instance. Used to add entries to the `losses` during training. ~~str~~ | | `name` | String name of the component instance. Used to add entries to the `losses` during training. ~~str~~ |
| _keyword-only_ | | | _keyword-only_ | |
| `spans_key` | Key of the [`Doc.spans`](/api/doc#sans) dict to save the spans under. During initialization and training, the component will look for spans on the reference document under the same key. Defaults to `"spans"`. ~~str~~ | | `spans_key` | Key of the [`Doc.spans`](/api/doc#sans) dict to save the spans under. During initialization and training, the component will look for spans on the reference document under the same key. Defaults to `"sc"`. ~~str~~ |
| `threshold` | Minimum probability to consider a prediction positive. Spans with a positive prediction will be saved on the Doc. Defaults to `0.5`. ~~float~~ | | `threshold` | Minimum probability to consider a prediction positive. Spans with a positive prediction will be saved on the Doc. Defaults to `0.5`. ~~float~~ |
| `max_positive` | Maximum number of labels to consider positive per span. Defaults to `None`, indicating no limit. ~~Optional[int]~~ | | `max_positive` | Maximum number of labels to consider positive per span. Defaults to `None`, indicating no limit. ~~Optional[int]~~ |

View File

@ -21,7 +21,7 @@ Create a `SpanGroup`.
> >
> ```python > ```python
> doc = nlp("Their goi ng home") > doc = nlp("Their goi ng home")
> spans = [doc[0:1], doc[2:4]] > spans = [doc[0:1], doc[1:3]]
> >
> # Construction 1 > # Construction 1
> from spacy.tokens import SpanGroup > from spacy.tokens import SpanGroup
@ -60,7 +60,7 @@ the scope of your function.
> >
> ```python > ```python
> doc = nlp("Their goi ng home") > doc = nlp("Their goi ng home")
> doc.spans["errors"] = [doc[0:1], doc[2:4]] > doc.spans["errors"] = [doc[0:1], doc[1:3]]
> assert doc.spans["errors"].doc == doc > assert doc.spans["errors"].doc == doc
> ``` > ```
@ -76,9 +76,9 @@ Check whether the span group contains overlapping spans.
> >
> ```python > ```python
> doc = nlp("Their goi ng home") > doc = nlp("Their goi ng home")
> doc.spans["errors"] = [doc[0:1], doc[2:4]] > doc.spans["errors"] = [doc[0:1], doc[1:3]]
> assert not doc.spans["errors"].has_overlap > assert not doc.spans["errors"].has_overlap
> doc.spans["errors"].append(doc[1:2]) > doc.spans["errors"].append(doc[2:4])
> assert doc.spans["errors"].has_overlap > assert doc.spans["errors"].has_overlap
> ``` > ```
@ -94,7 +94,7 @@ Get the number of spans in the group.
> >
> ```python > ```python
> doc = nlp("Their goi ng home") > doc = nlp("Their goi ng home")
> doc.spans["errors"] = [doc[0:1], doc[2:4]] > doc.spans["errors"] = [doc[0:1], doc[1:3]]
> assert len(doc.spans["errors"]) == 2 > assert len(doc.spans["errors"]) == 2
> ``` > ```
@ -104,15 +104,20 @@ Get the number of spans in the group.
## SpanGroup.\_\_getitem\_\_ {#getitem tag="method"} ## SpanGroup.\_\_getitem\_\_ {#getitem tag="method"}
Get a span from the group. Get a span from the group. Note that a copy of the span is returned, so if any
changes are made to this span, they are not reflected in the corresponding
member of the span group. The item or group will need to be reassigned for
changes to be reflected in the span group.
> #### Example > #### Example
> >
> ```python > ```python
> doc = nlp("Their goi ng home") > doc = nlp("Their goi ng home")
> doc.spans["errors"] = [doc[0:1], doc[2:4]] > doc.spans["errors"] = [doc[0:1], doc[1:3]]
> span = doc.spans["errors"][1] > span = doc.spans["errors"][1]
> assert span.text == "goi ng" > assert span.text == "goi ng"
> span.label_ = 'LABEL'
> assert doc.spans["errors"][1].label_ != 'LABEL' # The span within the group was not updated
> ``` > ```
| Name | Description | | Name | Description |
@ -120,6 +125,83 @@ Get a span from the group.
| `i` | The item index. ~~int~~ | | `i` | The item index. ~~int~~ |
| **RETURNS** | The span at the given index. ~~Span~~ | | **RETURNS** | The span at the given index. ~~Span~~ |
## SpanGroup.\_\_setitem\_\_ {#setitem tag="method", new="3.3"}
Set a span in the span group.
> #### Example
>
> ```python
> doc = nlp("Their goi ng home")
> doc.spans["errors"] = [doc[0:1], doc[1:3]]
> span = doc[0:2]
> doc.spans["errors"][0] = span
> assert doc.spans["errors"][0].text == "Their goi"
> ```
| Name | Description |
| ------ | ----------------------- |
| `i` | The item index. ~~int~~ |
| `span` | The new value. ~~Span~~ |
## SpanGroup.\_\_delitem\_\_ {#delitem tag="method", new="3.3"}
Delete a span from the span group.
> #### Example
>
> ```python
> doc = nlp("Their goi ng home")
> doc.spans["errors"] = [doc[0:1], doc[1:3]]
> del doc.spans[0]
> assert len(doc.spans["errors"]) == 1
> ```
| Name | Description |
| ---- | ----------------------- |
| `i` | The item index. ~~int~~ |
## SpanGroup.\_\_add\_\_ {#add tag="method", new="3.3"}
Concatenate the current span group with another span group and return the result
in a new span group. Any `attrs` from the first span group will have precedence
over `attrs` in the second.
> #### Example
>
> ```python
> doc = nlp("Their goi ng home")
> doc.spans["errors"] = [doc[0:1], doc[1:3]]
> doc.spans["other"] = [doc[0:2], doc[2:4]]
> span_group = doc.spans["errors"] + doc.spans["other"]
> assert len(span_group) == 4
> ```
| Name | Description |
| ----------- | ---------------------------------------------------------------------------- |
| `other` | The span group or spans to concatenate. ~~Union[SpanGroup, Iterable[Span]]~~ |
| **RETURNS** | The new span group. ~~SpanGroup~~ |
## SpanGroup.\_\_iadd\_\_ {#iadd tag="method", new="3.3"}
Append an iterable of spans or the content of a span group to the current span
group. Any `attrs` in the other span group will be added for keys that are not
already present in the current span group.
> #### Example
>
> ```python
> doc = nlp("Their goi ng home")
> doc.spans["errors"] = [doc[0:1], doc[1:3]]
> doc.spans["errors"] += [doc[3:4], doc[2:3]]
> assert len(doc.spans["errors"]) == 4
> ```
| Name | Description |
| ----------- | ----------------------------------------------------------------------- |
| `other` | The span group or spans to append. ~~Union[SpanGroup, Iterable[Span]]~~ |
| **RETURNS** | The span group. ~~SpanGroup~~ |
## SpanGroup.append {#append tag="method"} ## SpanGroup.append {#append tag="method"}
Add a [`Span`](/api/span) object to the group. The span must refer to the same Add a [`Span`](/api/span) object to the group. The span must refer to the same
@ -130,7 +212,7 @@ Add a [`Span`](/api/span) object to the group. The span must refer to the same
> ```python > ```python
> doc = nlp("Their goi ng home") > doc = nlp("Their goi ng home")
> doc.spans["errors"] = [doc[0:1]] > doc.spans["errors"] = [doc[0:1]]
> doc.spans["errors"].append(doc[2:4]) > doc.spans["errors"].append(doc[1:3])
> assert len(doc.spans["errors"]) == 2 > assert len(doc.spans["errors"]) == 2
> ``` > ```
@ -140,21 +222,42 @@ Add a [`Span`](/api/span) object to the group. The span must refer to the same
## SpanGroup.extend {#extend tag="method"} ## SpanGroup.extend {#extend tag="method"}
Add multiple [`Span`](/api/span) objects to the group. All spans must refer to Add multiple [`Span`](/api/span) objects or contents of another `SpanGroup` to
the same [`Doc`](/api/doc) object as the span group. the group. All spans must refer to the same [`Doc`](/api/doc) object as the span
group.
> #### Example > #### Example
> >
> ```python > ```python
> doc = nlp("Their goi ng home") > doc = nlp("Their goi ng home")
> doc.spans["errors"] = [] > doc.spans["errors"] = []
> doc.spans["errors"].extend([doc[2:4], doc[0:1]]) > doc.spans["errors"].extend([doc[1:3], doc[0:1]])
> assert len(doc.spans["errors"]) == 2 > assert len(doc.spans["errors"]) == 2
> span_group = SpanGroup(doc, spans=[doc[1:4], doc[0:3]])
> doc.spans["errors"].extend(span_group)
> ``` > ```
| Name | Description | | Name | Description |
| ------- | ------------------------------------ | | ------- | -------------------------------------------------------- |
| `spans` | The spans to add. ~~Iterable[Span]~~ | | `spans` | The spans to add. ~~Union[SpanGroup, Iterable["Span"]]~~ |
## SpanGroup.copy {#copy tag="method", new="3.3"}
Return a copy of the span group.
> #### Example
>
> ```python
> from spacy.tokens import SpanGroup
>
> doc = nlp("Their goi ng home")
> doc.spans["errors"] = [doc[1:3], doc[0:3]]
> new_group = doc.spans["errors"].copy()
> ```
| Name | Description |
| ----------- | ----------------------------------------------- |
| **RETURNS** | A copy of the `SpanGroup` object. ~~SpanGroup~~ |
## SpanGroup.to_bytes {#to_bytes tag="method"} ## SpanGroup.to_bytes {#to_bytes tag="method"}
@ -164,7 +267,7 @@ Serialize the span group to a bytestring.
> >
> ```python > ```python
> doc = nlp("Their goi ng home") > doc = nlp("Their goi ng home")
> doc.spans["errors"] = [doc[0:1], doc[2:4]] > doc.spans["errors"] = [doc[0:1], doc[1:3]]
> group_bytes = doc.spans["errors"].to_bytes() > group_bytes = doc.spans["errors"].to_bytes()
> ``` > ```
@ -183,7 +286,7 @@ it.
> from spacy.tokens import SpanGroup > from spacy.tokens import SpanGroup
> >
> doc = nlp("Their goi ng home") > doc = nlp("Their goi ng home")
> doc.spans["errors"] = [doc[0:1], doc[2:4]] > doc.spans["errors"] = [doc[0:1], doc[1:3]]
> group_bytes = doc.spans["errors"].to_bytes() > group_bytes = doc.spans["errors"].to_bytes()
> new_group = SpanGroup() > new_group = SpanGroup()
> new_group.from_bytes(group_bytes) > new_group.from_bytes(group_bytes)

View File

@ -263,7 +263,7 @@ Render a dependency parse tree or named entity visualization.
| Name | Description | | Name | Description |
| ----------- | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | | ----------- | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| `docs` | Document(s) or span(s) to visualize. ~~Union[Iterable[Union[Doc, Span]], Doc, Span]~~ | | `docs` | Document(s) or span(s) to visualize. ~~Union[Iterable[Union[Doc, Span, dict]], Doc, Span, dict]~~ |
| `style` | Visualization style, `"dep"` or `"ent"`. Defaults to `"dep"`. ~~str~~ | | `style` | Visualization style, `"dep"` or `"ent"`. Defaults to `"dep"`. ~~str~~ |
| `page` | Render markup as full HTML page. Defaults to `True`. ~~bool~~ | | `page` | Render markup as full HTML page. Defaults to `True`. ~~bool~~ |
| `minify` | Minify HTML markup. Defaults to `False`. ~~bool~~ | | `minify` | Minify HTML markup. Defaults to `False`. ~~bool~~ |
@ -320,7 +320,6 @@ If a setting is not present in the options, the default value will be used.
| `template` <Tag variant="new">2.2</Tag> | Optional template to overwrite the HTML used to render entity spans. Should be a format string and can use `{bg}`, `{text}` and `{label}`. See [`templates.py`](%%GITHUB_SPACY/spacy/displacy/templates.py) for examples. ~~Optional[str]~~ | | `template` <Tag variant="new">2.2</Tag> | Optional template to overwrite the HTML used to render entity spans. Should be a format string and can use `{bg}`, `{text}` and `{label}`. See [`templates.py`](%%GITHUB_SPACY/spacy/displacy/templates.py) for examples. ~~Optional[str]~~ |
| `kb_url_template` <Tag variant="new">3.2.1</Tag> | Optional template to construct the KB url for the entity to link to. Expects a python f-string format with single field to fill in. ~~Optional[str]~~ | | `kb_url_template` <Tag variant="new">3.2.1</Tag> | Optional template to construct the KB url for the entity to link to. Expects a python f-string format with single field to fill in. ~~Optional[str]~~ |
#### Span Visualizer options {#displacy_options-span} #### Span Visualizer options {#displacy_options-span}
> #### Example > #### Example
@ -330,21 +329,19 @@ If a setting is not present in the options, the default value will be used.
> displacy.serve(doc, style="span", options=options) > displacy.serve(doc, style="span", options=options)
> ``` > ```
| Name | Description | | Name | Description |
|-----------------|---------------------------------------------------------------------------------------------------------------------------------------------------------| | ----------------- | ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| `spans_key` | Which spans key to render spans from. Default is `"sc"`. ~~str~~ | | `spans_key` | Which spans key to render spans from. Default is `"sc"`. ~~str~~ |
| `templates` | Dictionary containing the keys `"span"`, `"slice"`, and `"start"`. These dictate how the overall span, a span slice, and the starting token will be rendered. ~~Optional[Dict[str, str]~~ | | `templates` | Dictionary containing the keys `"span"`, `"slice"`, and `"start"`. These dictate how the overall span, a span slice, and the starting token will be rendered. ~~Optional[Dict[str, str]~~ |
| `kb_url_template` | Optional template to construct the KB url for the entity to link to. Expects a python f-string format with single field to fill in ~~Optional[str]~~ | | `kb_url_template` | Optional template to construct the KB url for the entity to link to. Expects a python f-string format with single field to fill in ~~Optional[str]~~ |
| `colors` | Color overrides. Entity types should be mapped to color names or values. ~~Dict[str, str]~~ | | `colors` | Color overrides. Entity types should be mapped to color names or values. ~~Dict[str, str]~~ |
By default, displaCy comes with colors for all entity types used by
By default, displaCy comes with colors for all entity types used by [spaCy's [spaCy's trained pipelines](/models) for both entity and span visualizer. If
trained pipelines](/models) for both entity and span visualizer. If you're you're using custom entity types, you can use the `colors` setting to add your
using custom entity types, you can use the `colors` setting to add your own own colors for them. Your application or pipeline package can also expose a
colors for them. Your application or pipeline package can also expose a [`spacy_displacy_colors` entry point](/usage/saving-loading#entry-points-displacy)
[`spacy_displacy_colors` entry to add custom labels and their colors automatically.
point](/usage/saving-loading#entry-points-displacy) to add custom labels and
their colors automatically.
By default, displaCy links to `#` for entities without a `kb_id` set on their By default, displaCy links to `#` for entities without a `kb_id` set on their
span. If you wish to link an entity to their URL then consider using the span. If you wish to link an entity to their URL then consider using the
@ -354,7 +351,6 @@ span. If you wish to link an entity to their URL then consider using the
should redirect you to their Wikidata page, in this case should redirect you to their Wikidata page, in this case
`https://www.wikidata.org/wiki/Q95`. `https://www.wikidata.org/wiki/Q95`.
## registry {#registry source="spacy/util.py" new="3"} ## registry {#registry source="spacy/util.py" new="3"}
spaCy's function registry extends spaCy's function registry extends
@ -443,8 +439,8 @@ and the accuracy scores on the development set.
The built-in, default logger is the ConsoleLogger, which prints results to the The built-in, default logger is the ConsoleLogger, which prints results to the
console in tabular format. The console in tabular format. The
[spacy-loggers](https://github.com/explosion/spacy-loggers) package, included as [spacy-loggers](https://github.com/explosion/spacy-loggers) package, included as
a dependency of spaCy, enables other loggers, such as one that a dependency of spaCy, enables other loggers, such as one that sends results to
sends results to a [Weights & Biases](https://www.wandb.com/) dashboard. a [Weights & Biases](https://www.wandb.com/) dashboard.
Instead of using one of the built-in loggers, you can Instead of using one of the built-in loggers, you can
[implement your own](/usage/training#custom-logging). [implement your own](/usage/training#custom-logging).
@ -583,14 +579,14 @@ the [`Corpus`](/api/corpus) class.
> limit = 0 > limit = 0
> ``` > ```
| Name | Description | | Name | Description |
| --------------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | | -------------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| `path` | The directory or filename to read from. Expects data in spaCy's binary [`.spacy` format](/api/data-formats#binary-training). ~~Union[str, Path]~~ | | `path` | The directory or filename to read from. Expects data in spaCy's binary [`.spacy` format](/api/data-formats#binary-training). ~~Union[str, Path]~~ |
|  `gold_preproc` | Whether to set up the Example object with gold-standard sentences and tokens for the predictions. See [`Corpus`](/api/corpus#init) for details. ~~bool~~ | | `gold_preproc` | Whether to set up the Example object with gold-standard sentences and tokens for the predictions. See [`Corpus`](/api/corpus#init) for details. ~~bool~~ |
| `max_length` | Maximum document length. Longer documents will be split into sentences, if sentence boundaries are available. Defaults to `0` for no limit. ~~int~~ | | `max_length` | Maximum document length. Longer documents will be split into sentences, if sentence boundaries are available. Defaults to `0` for no limit. ~~int~~ |
| `limit` | Limit corpus to a subset of examples, e.g. for debugging. Defaults to `0` for no limit. ~~int~~ | | `limit` | Limit corpus to a subset of examples, e.g. for debugging. Defaults to `0` for no limit. ~~int~~ |
| `augmenter` | Apply some simply data augmentation, where we replace tokens with variations. This is especially useful for punctuation and case replacement, to help generalize beyond corpora that don't have smart-quotes, or only have smart quotes, etc. Defaults to `None`. ~~Optional[Callable]~~ | | `augmenter` | Apply some simply data augmentation, where we replace tokens with variations. This is especially useful for punctuation and case replacement, to help generalize beyond corpora that don't have smart-quotes, or only have smart quotes, etc. Defaults to `None`. ~~Optional[Callable]~~ |
| **CREATES** | The corpus reader. ~~Corpus~~ | | **CREATES** | The corpus reader. ~~Corpus~~ |
#### spacy.JsonlCorpus.v1 {#jsonlcorpus tag="registered function"} #### spacy.JsonlCorpus.v1 {#jsonlcorpus tag="registered function"}

View File

@ -347,14 +347,14 @@ supported for `floret` mode.
> most_similar = nlp.vocab.vectors.most_similar(queries, n=10) > most_similar = nlp.vocab.vectors.most_similar(queries, n=10)
> ``` > ```
| Name | Description | | Name | Description |
| -------------- | --------------------------------------------------------------------------- | ----------------------------------------------------------------------------------------------------------------------- | | -------------- | ----------------------------------------------------------------------------------------------------------------------- |
| `queries` | An array with one or more vectors. ~~numpy.ndarray~~ | | `queries` | An array with one or more vectors. ~~numpy.ndarray~~ |
| _keyword-only_ | | | _keyword-only_ | |
| `batch_size` | The batch size to use. Default to `1024`. ~~int~~ | | `batch_size` | The batch size to use. Default to `1024`. ~~int~~ |
| `n` | The number of entries to return for each query. Defaults to `1`. ~~int~~ | | `n` | The number of entries to return for each query. Defaults to `1`. ~~int~~ |
| `sort` | Whether to sort the entries returned by score. Defaults to `True`. ~~bool~~ | | `sort` | Whether to sort the entries returned by score. Defaults to `True`. ~~bool~~ |
| **RETURNS** | tuple | The most similar entries as a `(keys, best_rows, scores)` tuple. ~~Tuple[numpy.ndarray, numpy.ndarray, numpy.ndarray]~~ | | **RETURNS** | The most similar entries as a `(keys, best_rows, scores)` tuple. ~~Tuple[numpy.ndarray, numpy.ndarray, numpy.ndarray]~~ |
## Vectors.get_batch {#get_batch tag="method" new="3.2"} ## Vectors.get_batch {#get_batch tag="method" new="3.2"}

File diff suppressed because one or more lines are too long

Before

Width:  |  Height:  |  Size: 27 KiB

After

Width:  |  Height:  |  Size: 108 KiB

View File

@ -30,10 +30,16 @@ into three components:
tagging, parsing, lemmatization and named entity recognition, or `dep` for tagging, parsing, lemmatization and named entity recognition, or `dep` for
only tagging, parsing and lemmatization). only tagging, parsing and lemmatization).
2. **Genre:** Type of text the pipeline is trained on, e.g. `web` or `news`. 2. **Genre:** Type of text the pipeline is trained on, e.g. `web` or `news`.
3. **Size:** Package size indicator, `sm`, `md`, `lg` or `trf` (`sm`: no word 3. **Size:** Package size indicator, `sm`, `md`, `lg` or `trf`.
vectors, `md`: reduced word vector table with 20k unique vectors for ~500k
words, `lg`: large word vector table with ~500k entries, `trf`: transformer `sm` and `trf` pipelines have no static word vectors.
pipeline without static word vectors)
For pipelines with default vectors, `md` has a reduced word vector table with
20k unique vectors for ~500k words and `lg` has a large word vector table
with ~500k entries.
For pipelines with floret vectors, `md` vector tables have 50k entries and
`lg` vector tables have 200k entries.
For example, [`en_core_web_sm`](/models/en#en_core_web_sm) is a small English For example, [`en_core_web_sm`](/models/en#en_core_web_sm) is a small English
pipeline trained on written web text (blogs, news, comments), that includes pipeline trained on written web text (blogs, news, comments), that includes
@ -90,19 +96,42 @@ Main changes from spaCy v2 models:
In the `sm`/`md`/`lg` models: In the `sm`/`md`/`lg` models:
- The `tagger`, `morphologizer` and `parser` components listen to the `tok2vec` - The `tagger`, `morphologizer` and `parser` components listen to the `tok2vec`
component. component. If the lemmatizer is trainable (v3.3+), `lemmatizer` also listens
to `tok2vec`.
- The `attribute_ruler` maps `token.tag` to `token.pos` if there is no - The `attribute_ruler` maps `token.tag` to `token.pos` if there is no
`morphologizer`. The `attribute_ruler` additionally makes sure whitespace is `morphologizer`. The `attribute_ruler` additionally makes sure whitespace is
tagged consistently and copies `token.pos` to `token.tag` if there is no tagged consistently and copies `token.pos` to `token.tag` if there is no
tagger. For English, the attribute ruler can improve its mapping from tagger. For English, the attribute ruler can improve its mapping from
`token.tag` to `token.pos` if dependency parses from a `parser` are present, `token.tag` to `token.pos` if dependency parses from a `parser` are present,
but the parser is not required. but the parser is not required.
- The `lemmatizer` component for many languages (Catalan, Dutch, English, - The `lemmatizer` component for many languages requires `token.pos` annotation
French, Greek, Italian Macedonian, Norwegian, Polish and Spanish) requires from either `tagger`+`attribute_ruler` or `morphologizer`.
`token.pos` annotation from either `tagger`+`attribute_ruler` or
`morphologizer`.
- The `ner` component is independent with its own internal tok2vec layer. - The `ner` component is independent with its own internal tok2vec layer.
#### CNN/CPU pipelines with floret vectors
The Finnish, Korean and Swedish `md` and `lg` pipelines use
[floret vectors](/usage/v3-2#vectors) instead of default vectors. If you're
running a trained pipeline on texts and working with [`Doc`](/api/doc) objects,
you shouldn't notice any difference with floret vectors. With floret vectors no
tokens are out-of-vocabulary, so [`Token.is_oov`](/api/token#attributes) will
return `False` for all tokens.
If you access vectors directly for similarity comparisons, there are a few
differences because floret vectors don't include a fixed word list like the
vector keys for default vectors.
- If your workflow iterates over the vector keys, you need to use an external
word list instead:
```diff
- lexemes = [nlp.vocab[orth] for orth in nlp.vocab.vectors]
+ lexemes = [nlp.vocab[word] for word in external_word_list]
```
- [`Vectors.most_similar`](/api/vectors#most_similar) is not supported because
there's no fixed list of vectors to compare your vectors to.
### Transformer pipeline design {#design-trf} ### Transformer pipeline design {#design-trf}
In the transformer (`trf`) models, the `tagger`, `parser` and `ner` (if present) In the transformer (`trf`) models, the `tagger`, `parser` and `ner` (if present)
@ -133,10 +162,14 @@ nlp = spacy.load("en_core_web_trf", disable=["tagger", "attribute_ruler", "lemma
<Infobox variant="warning" title="Rule-based and POS-lookup lemmatizers require <Infobox variant="warning" title="Rule-based and POS-lookup lemmatizers require
Token.pos"> Token.pos">
The lemmatizer depends on `tagger`+`attribute_ruler` or `morphologizer` for The lemmatizer depends on `tagger`+`attribute_ruler` or `morphologizer` for a
Catalan, Dutch, English, French, Greek, Italian, Macedonian, Norwegian, Polish number of languages. If you disable any of these components, you'll see
and Spanish. If you disable any of these components, you'll see lemmatizer lemmatizer warnings unless the lemmatizer is also disabled.
warnings unless the lemmatizer is also disabled.
**v3.3**: Catalan, English, French, Russian and Spanish
**v3.0-v3.2**: Catalan, Dutch, English, French, Greek, Italian, Macedonian,
Norwegian, Polish, Russian and Spanish
</Infobox> </Infobox>
@ -154,10 +187,34 @@ nlp.enable_pipe("senter")
The `senter` component is ~10&times; faster than the parser and more accurate The `senter` component is ~10&times; faster than the parser and more accurate
than the rule-based `sentencizer`. than the rule-based `sentencizer`.
#### Switch from trainable lemmatizer to default lemmatizer
Since v3.3, a number of pipelines use a trainable lemmatizer. You can check whether
the lemmatizer is trainable:
```python
nlp = spacy.load("de_core_web_sm")
assert nlp.get_pipe("lemmatizer").is_trainable
```
If you'd like to switch to a non-trainable lemmatizer that's similar to v3.2 or
earlier, you can replace the trainable lemmatizer with the default non-trainable
lemmatizer:
```python
# Requirements: pip install spacy-lookups-data
nlp = spacy.load("de_core_web_sm")
# Remove existing lemmatizer
nlp.remove_pipe("lemmatizer")
# Add non-trainable lemmatizer from language defaults
# and load lemmatizer tables from spacy-lookups-data
nlp.add_pipe("lemmatizer").initialize()
```
#### Switch from rule-based to lookup lemmatization #### Switch from rule-based to lookup lemmatization
For the Dutch, English, French, Greek, Macedonian, Norwegian and Spanish For the Dutch, English, French, Greek, Macedonian, Norwegian and Spanish
pipelines, you can switch from the default rule-based lemmatizer to a lookup pipelines, you can swap out a trainable or rule-based lemmatizer for a lookup
lemmatizer: lemmatizer:
```python ```python

View File

@ -530,7 +530,8 @@ models, which can **improve the accuracy** of your components.
Word vectors in spaCy are "static" in the sense that they are not learned Word vectors in spaCy are "static" in the sense that they are not learned
parameters of the statistical models, and spaCy itself does not feature any parameters of the statistical models, and spaCy itself does not feature any
algorithms for learning word vector tables. You can train a word vectors table algorithms for learning word vector tables. You can train a word vectors table
using tools such as [Gensim](https://radimrehurek.com/gensim/), using tools such as [floret](https://github.com/explosion/floret),
[Gensim](https://radimrehurek.com/gensim/),
[FastText](https://fasttext.cc/) or [FastText](https://fasttext.cc/) or
[GloVe](https://nlp.stanford.edu/projects/glove/), or download existing [GloVe](https://nlp.stanford.edu/projects/glove/), or download existing
pretrained vectors. The [`init vectors`](/api/cli#init-vectors) command lets you pretrained vectors. The [`init vectors`](/api/cli#init-vectors) command lets you

View File

@ -129,15 +129,14 @@ machine learning library, [Thinc](https://thinc.ai). For GPU support, we've been
grateful to use the work of Chainer's [CuPy](https://cupy.chainer.org) module, grateful to use the work of Chainer's [CuPy](https://cupy.chainer.org) module,
which provides a numpy-compatible interface for GPU arrays. which provides a numpy-compatible interface for GPU arrays.
spaCy can be installed on GPU by specifying `spacy[cuda]`, `spacy[cuda90]`, spaCy can be installed for a CUDA-compatible GPU by specifying `spacy[cuda]`,
`spacy[cuda91]`, `spacy[cuda92]`, `spacy[cuda100]`, `spacy[cuda101]`, `spacy[cuda102]`, `spacy[cuda112]`, `spacy[cuda113]`, etc. If you know your
`spacy[cuda102]`, `spacy[cuda110]`, `spacy[cuda111]` or `spacy[cuda112]`. If you CUDA version, using the more explicit specifier allows CuPy to be installed via
know your cuda version, using the more explicit specifier allows cupy to be wheel, saving some compilation time. The specifiers should install
installed via wheel, saving some compilation time. The specifiers should install
[`cupy`](https://cupy.chainer.org). [`cupy`](https://cupy.chainer.org).
```bash ```bash
$ pip install -U %%SPACY_PKG_NAME[cuda92]%%SPACY_PKG_FLAGS $ pip install -U %%SPACY_PKG_NAME[cuda113]%%SPACY_PKG_FLAGS
``` ```
Once you have a GPU-enabled installation, the best way to activate it is to call Once you have a GPU-enabled installation, the best way to activate it is to call

View File

@ -48,7 +48,7 @@ but do not change its part-of-speech. We say that a **lemma** (root form) is
**inflected** (modified/combined) with one or more **morphological features** to **inflected** (modified/combined) with one or more **morphological features** to
create a surface form. Here are some examples: create a surface form. Here are some examples:
| Context | Surface | Lemma | POS |  Morphological Features | | Context | Surface | Lemma | POS | Morphological Features |
| ---------------------------------------- | ------- | ----- | ------ | ---------------------------------------- | | ---------------------------------------- | ------- | ----- | ------ | ---------------------------------------- |
| I was reading the paper | reading | read | `VERB` | `VerbForm=Ger` | | I was reading the paper | reading | read | `VERB` | `VerbForm=Ger` |
| I don't watch the news, I read the paper | read | read | `VERB` | `VerbForm=Fin`, `Mood=Ind`, `Tense=Pres` | | I don't watch the news, I read the paper | read | read | `VERB` | `VerbForm=Fin`, `Mood=Ind`, `Tense=Pres` |
@ -430,7 +430,7 @@ for token in doc:
print(token.text, token.pos_, token.dep_, token.head.text) print(token.text, token.pos_, token.dep_, token.head.text)
``` ```
| Text |  POS | Dep | Head text | | Text | POS | Dep | Head text |
| ----------------------------------- | ------ | ------- | --------- | | ----------------------------------- | ------ | ------- | --------- |
| Credit and mortgage account holders | `NOUN` | `nsubj` | submit | | Credit and mortgage account holders | `NOUN` | `nsubj` | submit |
| must | `VERB` | `aux` | submit | | must | `VERB` | `aux` | submit |

View File

@ -27,6 +27,35 @@ import QuickstartModels from 'widgets/quickstart-models.js'
<QuickstartModels title="Quickstart" id="quickstart" description="Install a default trained pipeline package, get the code to load it from within spaCy and an example to test it. For more options, see the section on available packages below." /> <QuickstartModels title="Quickstart" id="quickstart" description="Install a default trained pipeline package, get the code to load it from within spaCy and an example to test it. For more options, see the section on available packages below." />
### Usage note
> If lemmatization rules are available for your language, make sure to install
> spaCy with the `lookups` option, or install
> [`spacy-lookups-data`](https://github.com/explosion/spacy-lookups-data)
> separately in the same environment:
>
> ```bash
> $ pip install -U %%SPACY_PKG_NAME[lookups]%%SPACY_PKG_FLAGS
> ```
If a trained pipeline is available for a language, you can download it using the
[`spacy download`](/api/cli#download) command as shown above. In order to use
languages that don't yet come with a trained pipeline, you have to import them
directly, or use [`spacy.blank`](/api/top-level#spacy.blank):
```python
from spacy.lang.yo import Yoruba
nlp = Yoruba() # use directly
nlp = spacy.blank("yo") # blank instance
```
A blank pipeline is typically just a tokenizer. You might want to create a blank
pipeline when you only need a tokenizer, when you want to add more components
from scratch, or for testing purposes. Initializing the language object directly
yields the same result as generating it using `spacy.blank()`. In both cases the
default configuration for the chosen language is loaded, and no pretrained
components will be available.
## Language support {#languages} ## Language support {#languages}
spaCy currently provides support for the following languages. You can help by spaCy currently provides support for the following languages. You can help by
@ -37,28 +66,6 @@ contribute to development. Also see the
[training documentation](/usage/training) for how to train your own pipelines on [training documentation](/usage/training) for how to train your own pipelines on
your data. your data.
> #### Usage note
>
> If a trained pipeline is available for a language, you can download it using
> the [`spacy download`](/api/cli#download) command. In order to use languages
> that don't yet come with a trained pipeline, you have to import them directly,
> or use [`spacy.blank`](/api/top-level#spacy.blank):
>
> ```python
> from spacy.lang.fi import Finnish
> nlp = Finnish() # use directly
> nlp = spacy.blank("fi") # blank instance
> ```
>
> If lemmatization rules are available for your language, make sure to install
> spaCy with the `lookups` option, or install
> [`spacy-lookups-data`](https://github.com/explosion/spacy-lookups-data)
> separately in the same environment:
>
> ```bash
> $ pip install -U %%SPACY_PKG_NAME[lookups]%%SPACY_PKG_FLAGS
> ```
import Languages from 'widgets/languages.js' import Languages from 'widgets/languages.js'
<Languages /> <Languages />

View File

@ -94,9 +94,8 @@ also use any private repo you have access to with Git.
Assets are data files your project needs for example, the training and Assets are data files your project needs for example, the training and
evaluation data or pretrained vectors and embeddings to initialize your model evaluation data or pretrained vectors and embeddings to initialize your model
with. Each project template comes with a `project.yml` that defines the assets with. Each project template comes with a `project.yml` that defines the assets
to download and where to put them. The to download and where to put them. The [`spacy project assets`](/api/cli#run)
[`spacy project assets`](/api/cli#project-assets) will fetch the project assets will fetch the project assets for you:
for you:
```cli ```cli
$ cd some_example_project $ cd some_example_project
@ -108,6 +107,11 @@ even cloud storage such as GCS and S3. You can also fetch assets using git, by
replacing the `url` string with a `git` block. spaCy will use Git's "sparse replacing the `url` string with a `git` block. spaCy will use Git's "sparse
checkout" feature to avoid downloading the whole repository. checkout" feature to avoid downloading the whole repository.
Sometimes your project configuration may include large assets that you don't
necessarily want to download when you run `spacy project assets`. That's why
assets can be marked as [`extra`](#data-assets-url) - by default, these assets
are not downloaded. If they should be, run `spacy project assets --extra`.
### 3. Run a command {#run} ### 3. Run a command {#run}
> #### project.yml > #### project.yml
@ -215,19 +219,27 @@ pipelines.
> #### Tip: Multi-line YAML syntax for long values > #### Tip: Multi-line YAML syntax for long values
> >
> YAML has [multi-line syntax](https://yaml-multiline.info/) that can be > YAML has [multi-line syntax](https://yaml-multiline.info/) that can be helpful
> helpful for readability with longer values such as project descriptions or > for readability with longer values such as project descriptions or commands
> commands that take several arguments. > that take several arguments.
```yaml ```yaml
%%GITHUB_PROJECTS/pipelines/tagger_parser_ud/project.yml %%GITHUB_PROJECTS/pipelines/tagger_parser_ud/project.yml
``` ```
> #### Tip: Overriding variables on the CLI
>
> If you want to override one or more variables on the CLI and are not already specifying a
> project directory, you need to add `.` as a placeholder:
>
> ```
> python -m spacy project run test . --vars.foo bar
> ```
| Section | Description | | Section | Description |
| --------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ | | --------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ |
| `title` | An optional project title used in `--help` message and [auto-generated docs](#custom-docs). | | `title` | An optional project title used in `--help` message and [auto-generated docs](#custom-docs). |
| `description` | An optional project description used in [auto-generated docs](#custom-docs). | | `description` | An optional project description used in [auto-generated docs](#custom-docs). |
| `vars` | A dictionary of variables that can be referenced in paths, URLs and scripts, just like [`config.cfg` variables](/usage/training#config-interpolation). For example, `${vars.name}` will use the value of the variable `name`. Variables need to be defined in the section `vars`, but can be a nested dict, so you're able to reference `${vars.model.name}`. | | `vars` | A dictionary of variables that can be referenced in paths, URLs and scripts and overriden on the CLI, just like [`config.cfg` variables](/usage/training#config-interpolation). For example, `${vars.name}` will use the value of the variable `name`. Variables need to be defined in the section `vars`, but can be a nested dict, so you're able to reference `${vars.model.name}`. |
| `env` | A dictionary of variables, mapped to the names of environment variables that will be read in when running the project. For example, `${env.name}` will use the value of the environment variable defined as `name`. | | `env` | A dictionary of variables, mapped to the names of environment variables that will be read in when running the project. For example, `${env.name}` will use the value of the environment variable defined as `name`. |
| `directories` | An optional list of [directories](#project-files) that should be created in the project for assets, training outputs, metrics etc. spaCy will make sure that these directories always exist. | | `directories` | An optional list of [directories](#project-files) that should be created in the project for assets, training outputs, metrics etc. spaCy will make sure that these directories always exist. |
| `assets` | A list of assets that can be fetched with the [`project assets`](/api/cli#project-assets) command. `url` defines a URL or local path, `dest` is the destination file relative to the project directory, and an optional `checksum` ensures that an error is raised if the file's checksum doesn't match. Instead of `url`, you can also provide a `git` block with the keys `repo`, `branch` and `path`, to download from a Git repo. | | `assets` | A list of assets that can be fetched with the [`project assets`](/api/cli#project-assets) command. `url` defines a URL or local path, `dest` is the destination file relative to the project directory, and an optional `checksum` ensures that an error is raised if the file's checksum doesn't match. Instead of `url`, you can also provide a `git` block with the keys `repo`, `branch` and `path`, to download from a Git repo. |
@ -261,8 +273,9 @@ dependencies to use certain protocols.
> - dest: 'assets/training.spacy' > - dest: 'assets/training.spacy'
> url: 'https://example.com/data.spacy' > url: 'https://example.com/data.spacy'
> checksum: '63373dd656daa1fd3043ce166a59474c' > checksum: '63373dd656daa1fd3043ce166a59474c'
> # Download from Google Cloud Storage bucket > # Optional download from Google Cloud Storage bucket
> - dest: 'assets/development.spacy' > - dest: 'assets/development.spacy'
> extra: True
> url: 'gs://your-bucket/corpora' > url: 'gs://your-bucket/corpora'
> checksum: '5113dc04e03f079525edd8df3f4f39e3' > checksum: '5113dc04e03f079525edd8df3f4f39e3'
> ``` > ```
@ -270,6 +283,7 @@ dependencies to use certain protocols.
| Name | Description | | Name | Description |
| ------------- | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | | ------------- | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| `dest` | The destination path to save the downloaded asset to (relative to the project directory), including the file name. | | `dest` | The destination path to save the downloaded asset to (relative to the project directory), including the file name. |
| `extra` | Optional flag determining whether this asset is downloaded only if `spacy project assets` is run with `--extra`. `False` by default. |
| `url` | The URL to download from, using the respective protocol. | | `url` | The URL to download from, using the respective protocol. |
| `checksum` | Optional checksum of the file. If provided, it will be used to verify that the file matches and downloads will be skipped if a local file with the same checksum already exists. | | `checksum` | Optional checksum of the file. If provided, it will be used to verify that the file matches and downloads will be skipped if a local file with the same checksum already exists. |
| `description` | Optional asset description, used in [auto-generated docs](#custom-docs). | | `description` | Optional asset description, used in [auto-generated docs](#custom-docs). |
@ -294,12 +308,12 @@ files you need and not the whole repo.
> description: 'The training data (5000 examples)' > description: 'The training data (5000 examples)'
> ``` > ```
| Name | Description | | Name | Description |
| ------------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | | ------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| `dest` | The destination path to save the downloaded asset to (relative to the project directory), including the file name. | | `dest` | The destination path to save the downloaded asset to (relative to the project directory), including the file name. |
| `git` | `repo`: The URL of the repo to download from.<br />`path`: Path of the file or directory to download, relative to the repo root. "" specifies the root directory.<br />`branch`: The branch to download from. Defaults to `"master"`. | | `git` | `repo`: The URL of the repo to download from.<br />`path`: Path of the file or directory to download, relative to the repo root. "" specifies the root directory.<br />`branch`: The branch to download from. Defaults to `"master"`. |
| `checksum` | Optional checksum of the file. If provided, it will be used to verify that the file matches and downloads will be skipped if a local file with the same checksum already exists. | | `checksum` | Optional checksum of the file. If provided, it will be used to verify that the file matches and downloads will be skipped if a local file with the same checksum already exists. |
| `description` | Optional asset description, used in [auto-generated docs](#custom-docs). | | `description` | Optional asset description, used in [auto-generated docs](#custom-docs). |
#### Working with private assets {#data-asets-private} #### Working with private assets {#data-asets-private}

View File

@ -158,23 +158,23 @@ The available token pattern keys correspond to a number of
[`Token` attributes](/api/token#attributes). The supported attributes for [`Token` attributes](/api/token#attributes). The supported attributes for
rule-based matching are: rule-based matching are:
| Attribute |  Description | | Attribute | Description |
| ----------------------------------------------- | --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | | ---------------------------------------------- | --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| `ORTH` | The exact verbatim text of a token. ~~str~~ | | `ORTH` | The exact verbatim text of a token. ~~str~~ |
| `TEXT` <Tag variant="new">2.1</Tag> | The exact verbatim text of a token. ~~str~~ | | `TEXT` <Tag variant="new">2.1</Tag> | The exact verbatim text of a token. ~~str~~ |
| `NORM` | The normalized form of the token text. ~~str~~ | | `NORM` | The normalized form of the token text. ~~str~~ |
| `LOWER` | The lowercase form of the token text. ~~str~~ | | `LOWER` | The lowercase form of the token text. ~~str~~ |
|  `LENGTH` | The length of the token text. ~~int~~ | | `LENGTH` | The length of the token text. ~~int~~ |
|  `IS_ALPHA`, `IS_ASCII`, `IS_DIGIT` | Token text consists of alphabetic characters, ASCII characters, digits. ~~bool~~ | | `IS_ALPHA`, `IS_ASCII`, `IS_DIGIT` | Token text consists of alphabetic characters, ASCII characters, digits. ~~bool~~ |
|  `IS_LOWER`, `IS_UPPER`, `IS_TITLE` | Token text is in lowercase, uppercase, titlecase. ~~bool~~ | | `IS_LOWER`, `IS_UPPER`, `IS_TITLE` | Token text is in lowercase, uppercase, titlecase. ~~bool~~ |
|  `IS_PUNCT`, `IS_SPACE`, `IS_STOP` | Token is punctuation, whitespace, stop word. ~~bool~~ | | `IS_PUNCT`, `IS_SPACE`, `IS_STOP` | Token is punctuation, whitespace, stop word. ~~bool~~ |
|  `IS_SENT_START` | Token is start of sentence. ~~bool~~ | | `IS_SENT_START` | Token is start of sentence. ~~bool~~ |
|  `LIKE_NUM`, `LIKE_URL`, `LIKE_EMAIL` | Token text resembles a number, URL, email. ~~bool~~ | | `LIKE_NUM`, `LIKE_URL`, `LIKE_EMAIL` | Token text resembles a number, URL, email. ~~bool~~ |
| `SPACY` | Token has a trailing space. ~~bool~~ | | `SPACY` | Token has a trailing space. ~~bool~~ |
|  `POS`, `TAG`, `MORPH`, `DEP`, `LEMMA`, `SHAPE` | The token's simple and extended part-of-speech tag, morphological analysis, dependency label, lemma, shape. Note that the values of these attributes are case-sensitive. For a list of available part-of-speech tags and dependency labels, see the [Annotation Specifications](/api/annotation). ~~str~~ | | `POS`, `TAG`, `MORPH`, `DEP`, `LEMMA`, `SHAPE` | The token's simple and extended part-of-speech tag, morphological analysis, dependency label, lemma, shape. Note that the values of these attributes are case-sensitive. For a list of available part-of-speech tags and dependency labels, see the [Annotation Specifications](/api/annotation). ~~str~~ |
| `ENT_TYPE` | The token's entity label. ~~str~~ | | `ENT_TYPE` | The token's entity label. ~~str~~ |
| `_` <Tag variant="new">2.1</Tag> | Properties in [custom extension attributes](/usage/processing-pipelines#custom-components-attributes). ~~Dict[str, Any]~~ | | `_` <Tag variant="new">2.1</Tag> | Properties in [custom extension attributes](/usage/processing-pipelines#custom-components-attributes). ~~Dict[str, Any]~~ |
| `OP` | [Operator or quantifier](#quantifiers) to determine how often to match a token pattern. ~~str~~ | | `OP` | [Operator or quantifier](#quantifiers) to determine how often to match a token pattern. ~~str~~ |
<Accordion title="Does it matter if the attribute names are uppercase or lowercase?"> <Accordion title="Does it matter if the attribute names are uppercase or lowercase?">
@ -949,7 +949,7 @@ for match_id, start, end in matcher(doc):
The examples here use [`nlp.make_doc`](/api/language#make_doc) to create `Doc` The examples here use [`nlp.make_doc`](/api/language#make_doc) to create `Doc`
object patterns as efficiently as possible and without running any of the other object patterns as efficiently as possible and without running any of the other
pipeline components. If the token attribute you want to match on are set by a pipeline components. If the token attribute you want to match on is set by a
pipeline component, **make sure that the pipeline component runs** when you pipeline component, **make sure that the pipeline component runs** when you
create the pattern. For example, to match on `POS` or `LEMMA`, the pattern `Doc` create the pattern. For example, to match on `POS` or `LEMMA`, the pattern `Doc`
objects need to have part-of-speech tags set by the `tagger` or `morphologizer`. objects need to have part-of-speech tags set by the `tagger` or `morphologizer`.
@ -960,9 +960,9 @@ disable components selectively.
</Infobox> </Infobox>
Another possible use case is matching number tokens like IP addresses based on Another possible use case is matching number tokens like IP addresses based on
their shape. This means that you won't have to worry about how those string will their shape. This means that you won't have to worry about how those strings
be tokenized and you'll be able to find tokens and combinations of tokens based will be tokenized and you'll be able to find tokens and combinations of tokens
on a few examples. Here, we're matching on the shapes `ddd.d.d.d` and based on a few examples. Here, we're matching on the shapes `ddd.d.d.d` and
`ddd.ddd.d.d`: `ddd.ddd.d.d`:
```python ```python
@ -1433,7 +1433,7 @@ of `"phrase_matcher_attr": "POS"` for the entity ruler.
Running the full language pipeline across every pattern in a large list scales Running the full language pipeline across every pattern in a large list scales
linearly and can therefore take a long time on large amounts of phrase patterns. linearly and can therefore take a long time on large amounts of phrase patterns.
As of spaCy v2.2.4 the `add_patterns` function has been refactored to use As of spaCy v2.2.4 the `add_patterns` function has been refactored to use
nlp.pipe on all phrase patterns resulting in about a 10x-20x speed up with `nlp.pipe` on all phrase patterns resulting in about a 10x-20x speed up with
5,000-100,000 phrase patterns respectively. Even with this speedup (but 5,000-100,000 phrase patterns respectively. Even with this speedup (but
especially if you're using an older version) the `add_patterns` function can especially if you're using an older version) the `add_patterns` function can
still take a long time. An easy workaround to make this function run faster is still take a long time. An easy workaround to make this function run faster is

View File

@ -247,7 +247,7 @@ a consistent format. There are no command-line arguments that need to be set,
and no hidden defaults. However, there can still be scenarios where you may want and no hidden defaults. However, there can still be scenarios where you may want
to override config settings when you run [`spacy train`](/api/cli#train). This to override config settings when you run [`spacy train`](/api/cli#train). This
includes **file paths** to vectors or other resources that shouldn't be includes **file paths** to vectors or other resources that shouldn't be
hard-code in a config file, or **system-dependent settings**. hard-coded in a config file, or **system-dependent settings**.
For cases like this, you can set additional command-line options starting with For cases like this, you can set additional command-line options starting with
`--` that correspond to the config section and value to override. For example, `--` that correspond to the config section and value to override. For example,
@ -730,7 +730,7 @@ with the name of the respective [registry](/api/top-level#registry), e.g.
`@spacy.registry.architectures`, and a string name to assign to your function. `@spacy.registry.architectures`, and a string name to assign to your function.
Registering custom functions allows you to **plug in models** defined in PyTorch Registering custom functions allows you to **plug in models** defined in PyTorch
or TensorFlow, make **custom modifications** to the `nlp` object, create custom or TensorFlow, make **custom modifications** to the `nlp` object, create custom
optimizers or schedules, or **stream in data** and preprocesses it on the fly optimizers or schedules, or **stream in data** and preprocess it on the fly
while training. while training.
Each custom function can have any number of arguments that are passed in via the Each custom function can have any number of arguments that are passed in via the

View File

@ -132,13 +132,13 @@ your own.
> contributions for Catalan and to Kenneth Enevoldsen for Danish. For additional > contributions for Catalan and to Kenneth Enevoldsen for Danish. For additional
> Danish pipelines, check out [DaCy](https://github.com/KennethEnevoldsen/DaCy). > Danish pipelines, check out [DaCy](https://github.com/KennethEnevoldsen/DaCy).
| Package | Language | UPOS | Parser LAS |  NER F | | Package | Language | UPOS | Parser LAS | NER F |
| ------------------------------------------------- | -------- | ---: | ---------: | -----: | | ------------------------------------------------- | -------- | ---: | ---------: | ----: |
| [`ca_core_news_sm`](/models/ca#ca_core_news_sm) | Catalan | 98.2 | 87.4 | 79.8 | | [`ca_core_news_sm`](/models/ca#ca_core_news_sm) | Catalan | 98.2 | 87.4 | 79.8 |
| [`ca_core_news_md`](/models/ca#ca_core_news_md) | Catalan | 98.3 | 88.2 | 84.0 | | [`ca_core_news_md`](/models/ca#ca_core_news_md) | Catalan | 98.3 | 88.2 | 84.0 |
| [`ca_core_news_lg`](/models/ca#ca_core_news_lg) | Catalan | 98.5 | 88.4 | 84.2 | | [`ca_core_news_lg`](/models/ca#ca_core_news_lg) | Catalan | 98.5 | 88.4 | 84.2 |
| [`ca_core_news_trf`](/models/ca#ca_core_news_trf) | Catalan | 98.9 | 93.0 | 91.2 | | [`ca_core_news_trf`](/models/ca#ca_core_news_trf) | Catalan | 98.9 | 93.0 | 91.2 |
| [`da_core_news_trf`](/models/da#da_core_news_trf) | Danish | 98.0 | 85.0 | 82.9 | | [`da_core_news_trf`](/models/da#da_core_news_trf) | Danish | 98.0 | 85.0 | 82.9 |
### Resizable text classification architectures {#resizable-textcat} ### Resizable text classification architectures {#resizable-textcat}

247
website/docs/usage/v3-3.md Normal file
View File

@ -0,0 +1,247 @@
---
title: What's New in v3.3
teaser: New features and how to upgrade
menu:
- ['New Features', 'features']
- ['Upgrading Notes', 'upgrading']
---
## New features {#features hidden="true"}
spaCy v3.3 improves the speed of core pipeline components, adds a new trainable
lemmatizer, and introduces trained pipelines for Finnish, Korean and Swedish.
### Speed improvements {#speed}
v3.3 includes a slew of speed improvements:
- Speed up parser and NER by using constant-time head lookups.
- Support unnormalized softmax probabilities in `spacy.Tagger.v2` to speed up
inference for tagger, morphologizer, senter and trainable lemmatizer.
- Speed up parser projectivization functions.
- Replace `Ragged` with faster `AlignmentArray` in `Example` for training.
- Improve `Matcher` speed.
- Improve serialization speed for empty `Doc.spans`.
For longer texts, the trained pipeline speeds improve **15%** or more in
prediction. We benchmarked `en_core_web_md` (same components as in v3.2) and
`de_core_news_md` (with the new trainable lemmatizer) across a range of text
sizes on Linux (Intel Xeon W-2265) and OS X (M1) to compare spaCy v3.2 vs. v3.3:
**Intel Xeon W-2265**
| Model | Avg. Words/Doc | v3.2 Words/Sec | v3.3 Words/Sec | Diff |
| :----------------------------------------------- | -------------: | -------------: | -------------: | -----: |
| [`en_core_web_md`](/models/en#en_core_web_md) | 100 | 17292 | 17441 | 0.86% |
| (=same components) | 1000 | 15408 | 16024 | 4.00% |
| | 10000 | 12798 | 15346 | 19.91% |
| [`de_core_news_md`](/models/de/#de_core_news_md) | 100 | 20221 | 19321 | -4.45% |
| (+v3.3 trainable lemmatizer) | 1000 | 17480 | 17345 | -0.77% |
| | 10000 | 14513 | 17036 | 17.38% |
**Apple M1**
| Model | Avg. Words/Doc | v3.2 Words/Sec | v3.3 Words/Sec | Diff |
| ------------------------------------------------ | -------------: | -------------: | -------------: | -----: |
| [`en_core_web_md`](/models/en#en_core_web_md) | 100 | 18272 | 18408 | 0.74% |
| (=same components) | 1000 | 18794 | 19248 | 2.42% |
| | 10000 | 15144 | 17513 | 15.64% |
| [`de_core_news_md`](/models/de/#de_core_news_md) | 100 | 19227 | 19591 | 1.89% |
| (+v3.3 trainable lemmatizer) | 1000 | 20047 | 20628 | 2.90% |
| | 10000 | 15921 | 18546 | 16.49% |
### Trainable lemmatizer {#trainable-lemmatizer}
The new [trainable lemmatizer](/api/edittreelemmatizer) component uses
[edit trees](https://explosion.ai/blog/edit-tree-lemmatizer) to transform tokens
into lemmas. Try out the trainable lemmatizer with the
[training quickstart](/usage/training#quickstart)!
### displaCy support for overlapping spans and arcs {#displacy}
displaCy now supports overlapping spans with a new
[`span`](/usage/visualizers#span) style and multiple arcs with different labels
between the same tokens for [`dep`](/usage/visualizers#dep) visualizations.
Overlapping spans can be visualized for any spans key in `doc.spans`:
```python
import spacy
from spacy import displacy
from spacy.tokens import Span
nlp = spacy.blank("en")
text = "Welcome to the Bank of China."
doc = nlp(text)
doc.spans["custom"] = [Span(doc, 3, 6, "ORG"), Span(doc, 5, 6, "GPE")]
displacy.serve(doc, style="span", options={"spans_key": "custom"})
```
import DisplacySpanHtml from 'images/displacy-span.html'
<Iframe title="displaCy visualizer for overlapping spans" html={DisplacySpanHtml} height={180} />
## Additional features and improvements
- Config comparisons with [`spacy debug diff-config`](/api/cli#debug-diff).
- Span suggester debugging with
[`SpanCategorizer.set_candidates`](/api/spancategorizer#set_candidates).
- Big endian support with
[`thinc-bigendian-ops`](https://github.com/andrewsi-z/thinc-bigendian-ops) and
updates to make `floret`, `murmurhash`, Thinc and spaCy endian neutral.
- Initial support for Lower Sorbian and Upper Sorbian.
- Language updates for English, French, Italian, Japanese, Korean, Norwegian,
Russian, Slovenian, Spanish, Turkish, Ukrainian and Vietnamese.
- New noun chunks for Finnish.
## Trained pipelines {#pipelines}
### New trained pipelines {#new-pipelines}
v3.3 introduces new CPU/CNN pipelines for Finnish, Korean and Swedish, which use
the new trainable lemmatizer and
[floret vectors](https://github.com/explosion/floret). Due to the use
[Bloom embeddings](https://explosion.ai/blog/bloom-embeddings) and subwords, the
pipelines have compact vectors with no out-of-vocabulary words.
| Package | Language | UPOS | Parser LAS | NER F |
| ----------------------------------------------- | -------- | ---: | ---------: | ----: |
| [`fi_core_news_sm`](/models/fi#fi_core_news_sm) | Finnish | 92.5 | 71.9 | 75.9 |
| [`fi_core_news_md`](/models/fi#fi_core_news_md) | Finnish | 95.9 | 78.6 | 80.6 |
| [`fi_core_news_lg`](/models/fi#fi_core_news_lg) | Finnish | 96.2 | 79.4 | 82.4 |
| [`ko_core_news_sm`](/models/ko#ko_core_news_sm) | Korean | 86.1 | 65.6 | 71.3 |
| [`ko_core_news_md`](/models/ko#ko_core_news_md) | Korean | 94.7 | 80.9 | 83.1 |
| [`ko_core_news_lg`](/models/ko#ko_core_news_lg) | Korean | 94.7 | 81.3 | 85.3 |
| [`sv_core_news_sm`](/models/sv#sv_core_news_sm) | Swedish | 95.0 | 75.9 | 74.7 |
| [`sv_core_news_md`](/models/sv#sv_core_news_md) | Swedish | 96.3 | 78.5 | 79.3 |
| [`sv_core_news_lg`](/models/sv#sv_core_news_lg) | Swedish | 96.3 | 79.1 | 81.1 |
### Pipeline updates {#pipeline-updates}
The following languages switch from lookup or rule-based lemmatizers to the new
trainable lemmatizer: Danish, Dutch, German, Greek, Italian, Lithuanian,
Norwegian, Polish, Portuguese and Romanian. The overall lemmatizer accuracy
improves for all of these pipelines, but be aware that the types of errors may
look quite different from the lookup-based lemmatizers. If you'd prefer to
continue using the previous lemmatizer, you can
[switch from the trainable lemmatizer to a non-trainable lemmatizer](/models#design-modify).
<figure>
| Model | v3.2 Lemma Acc | v3.3 Lemma Acc |
| ----------------------------------------------- | -------------: | -------------: |
| [`da_core_news_md`](/models/da#da_core_news_md) | 84.9 | 94.8 |
| [`de_core_news_md`](/models/de#de_core_news_md) | 73.4 | 97.7 |
| [`el_core_news_md`](/models/el#el_core_news_md) | 56.5 | 88.9 |
| [`fi_core_news_md`](/models/fi#fi_core_news_md) | - | 86.2 |
| [`it_core_news_md`](/models/it#it_core_news_md) | 86.6 | 97.2 |
| [`ko_core_news_md`](/models/ko#ko_core_news_md) | - | 90.0 |
| [`lt_core_news_md`](/models/lt#lt_core_news_md) | 71.1 | 84.8 |
| [`nb_core_news_md`](/models/nb#nb_core_news_md) | 76.7 | 97.1 |
| [`nl_core_news_md`](/models/nl#nl_core_news_md) | 81.5 | 94.0 |
| [`pl_core_news_md`](/models/pl#pl_core_news_md) | 87.1 | 93.7 |
| [`pt_core_news_md`](/models/pt#pt_core_news_md) | 76.7 | 96.9 |
| [`ro_core_news_md`](/models/ro#ro_core_news_md) | 81.8 | 95.5 |
| [`sv_core_news_md`](/models/sv#sv_core_news_md) | - | 95.5 |
</figure>
In addition, the vectors in the English pipelines are deduplicated to improve
the pruned vectors in the `md` models and reduce the `lg` model size.
## Notes about upgrading from v3.2 {#upgrading}
### Span comparisons
Span comparisons involving ordering (`<`, `<=`, `>`, `>=`) now take all span
attributes into account (start, end, label, and KB ID) so spans may be sorted in
a slightly different order.
### Whitespace annotation
During training, annotation on whitespace tokens is handled in the same way as
annotation on non-whitespace tokens in order to allow custom whitespace
annotation.
### Doc.from_docs
[`Doc.from_docs`](/api/doc#from_docs) now includes `Doc.tensor` by default and
supports excludes with an `exclude` argument in the same format as
`Doc.to_bytes`. The supported exclude fields are `spans`, `tensor` and
`user_data`.
Docs including `Doc.tensor` may be quite a bit larger in RAM, so to exclude
`Doc.tensor` as in v3.2:
```diff
-merged_doc = Doc.from_docs(docs)
+merged_doc = Doc.from_docs(docs, exclude=["tensor"])
```
### Using trained pipelines with floret vectors
If you're running a new trained pipeline for Finnish, Korean or Swedish on new
texts and working with `Doc` objects, you shouldn't notice any difference with
floret vectors vs. default vectors.
If you use vectors for similarity comparisons, there are a few differences,
mainly because a floret pipeline doesn't include any kind of frequency-based
word list similar to the list of in-vocabulary vector keys with default vectors.
- If your workflow iterates over the vector keys, you should use an external
word list instead:
```diff
- lexemes = [nlp.vocab[orth] for orth in nlp.vocab.vectors]
+ lexemes = [nlp.vocab[word] for word in external_word_list]
```
- `Vectors.most_similar` is not supported because there's no fixed list of
vectors to compare your vectors to.
### Pipeline package version compatibility {#version-compat}
> #### Using legacy implementations
>
> In spaCy v3, you'll still be able to load and reference legacy implementations
> via [`spacy-legacy`](https://github.com/explosion/spacy-legacy), even if the
> components or architectures change and newer versions are available in the
> core library.
When you're loading a pipeline package trained with an earlier version of spaCy
v3, you will see a warning telling you that the pipeline may be incompatible.
This doesn't necessarily have to be true, but we recommend running your
pipelines against your test suite or evaluation data to make sure there are no
unexpected results.
If you're using one of the [trained pipelines](/models) we provide, you should
run [`spacy download`](/api/cli#download) to update to the latest version. To
see an overview of all installed packages and their compatibility, you can run
[`spacy validate`](/api/cli#validate).
If you've trained your own custom pipeline and you've confirmed that it's still
working as expected, you can update the spaCy version requirements in the
[`meta.json`](/api/data-formats#meta):
```diff
- "spacy_version": ">=3.2.0,<3.3.0",
+ "spacy_version": ">=3.2.0,<3.4.0",
```
### Updating v3.2 configs
To update a config from spaCy v3.2 with the new v3.3 settings, run
[`init fill-config`](/api/cli#init-fill-config):
```cli
$ python -m spacy init fill-config config-v3.2.cfg config-v3.3.cfg
```
In many cases ([`spacy train`](/api/cli#train),
[`spacy.load`](/api/top-level#spacy.load)), the new defaults will be filled in
automatically, but you'll need to fill in the new settings to run
[`debug config`](/api/cli#debug) and [`debug data`](/api/cli#debug-data).
To see the speed improvements for the
[`Tagger` architecture](/api/architectures#Tagger), edit your config to switch
from `spacy.Tagger.v1` to `spacy.Tagger.v2` and then run `init fill-config`.

View File

@ -116,7 +116,7 @@ import Benchmarks from 'usage/\_benchmarks-models.md'
> corpus that had both syntactic and entity annotations, so the transformer > corpus that had both syntactic and entity annotations, so the transformer
> models for those languages do not include NER. > models for those languages do not include NER.
| Package | Language | Transformer | Tagger | Parser |  NER | | Package | Language | Transformer | Tagger | Parser | NER |
| ------------------------------------------------ | -------- | --------------------------------------------------------------------------------------------- | -----: | -----: | ---: | | ------------------------------------------------ | -------- | --------------------------------------------------------------------------------------------- | -----: | -----: | ---: |
| [`en_core_web_trf`](/models/en#en_core_web_trf) | English | [`roberta-base`](https://huggingface.co/roberta-base) | 97.8 | 95.2 | 89.9 | | [`en_core_web_trf`](/models/en#en_core_web_trf) | English | [`roberta-base`](https://huggingface.co/roberta-base) | 97.8 | 95.2 | 89.9 |
| [`de_dep_news_trf`](/models/de#de_dep_news_trf) | German | [`bert-base-german-cased`](https://huggingface.co/bert-base-german-cased) | 99.0 | 95.8 | - | | [`de_dep_news_trf`](/models/de#de_dep_news_trf) | German | [`bert-base-german-cased`](https://huggingface.co/bert-base-german-cased) | 99.0 | 95.8 | - |
@ -856,9 +856,9 @@ attribute ruler before training using the `[initialize]` block of your config.
### Using Lexeme Tables ### Using Lexeme Tables
To use tables like `lexeme_prob` when training a model from scratch, you need To use tables like `lexeme_prob` when training a model from scratch, you need to
to add an entry to the `initialize` block in your config. Here's what that add an entry to the `initialize` block in your config. Here's what that looks
looks like for the existing trained pipelines: like for the existing trained pipelines:
```ini ```ini
[initialize.lookups] [initialize.lookups]

View File

@ -5,6 +5,7 @@ new: 2
menu: menu:
- ['Dependencies', 'dep'] - ['Dependencies', 'dep']
- ['Named Entities', 'ent'] - ['Named Entities', 'ent']
- ['Spans', 'span']
- ['Jupyter Notebooks', 'jupyter'] - ['Jupyter Notebooks', 'jupyter']
- ['Rendering HTML', 'html'] - ['Rendering HTML', 'html']
- ['Web app usage', 'webapp'] - ['Web app usage', 'webapp']
@ -192,7 +193,7 @@ displacy.serve(doc, style="span")
import DisplacySpanHtml from 'images/displacy-span.html' import DisplacySpanHtml from 'images/displacy-span.html'
<Iframe title="displaCy visualizer for entities" html={DisplacySpanHtml} height={180} /> <Iframe title="displaCy visualizer for overlapping spans" html={DisplacySpanHtml} height={180} />
The span visualizer lets you customize the following `options`: The span visualizer lets you customize the following `options`:
@ -342,9 +343,7 @@ want to visualize output from other libraries, like [NLTK](http://www.nltk.org)
or or
[SyntaxNet](https://github.com/tensorflow/models/tree/master/research/syntaxnet). [SyntaxNet](https://github.com/tensorflow/models/tree/master/research/syntaxnet).
If you set `manual=True` on either `render()` or `serve()`, you can pass in data If you set `manual=True` on either `render()` or `serve()`, you can pass in data
in displaCy's format (instead of `Doc` objects). When setting `ents` manually, in displaCy's format as a dictionary (instead of `Doc` objects).
make sure to supply them in the right order, i.e. starting with the lowest start
position.
> #### Example > #### Example
> >

View File

@ -62,6 +62,11 @@
"example": "Dies ist ein Satz.", "example": "Dies ist ein Satz.",
"has_examples": true "has_examples": true
}, },
{
"code": "dsb",
"name": "Lower Sorbian",
"has_examples": true
},
{ {
"code": "el", "code": "el",
"name": "Greek", "name": "Greek",
@ -159,6 +164,11 @@
"name": "Croatian", "name": "Croatian",
"has_examples": true "has_examples": true
}, },
{
"code": "hsb",
"name": "Upper Sorbian",
"has_examples": true
},
{ {
"code": "hu", "code": "hu",
"name": "Hungarian", "name": "Hungarian",

View File

@ -11,7 +11,8 @@
{ "text": "spaCy 101", "url": "/usage/spacy-101" }, { "text": "spaCy 101", "url": "/usage/spacy-101" },
{ "text": "New in v3.0", "url": "/usage/v3" }, { "text": "New in v3.0", "url": "/usage/v3" },
{ "text": "New in v3.1", "url": "/usage/v3-1" }, { "text": "New in v3.1", "url": "/usage/v3-1" },
{ "text": "New in v3.2", "url": "/usage/v3-2" } { "text": "New in v3.2", "url": "/usage/v3-2" },
{ "text": "New in v3.3", "url": "/usage/v3-3" }
] ]
}, },
{ {

View File

@ -19,7 +19,7 @@
"newsletter": { "newsletter": {
"user": "spacy.us12", "user": "spacy.us12",
"id": "83b0498b1e7fa3c91ce68c3f1", "id": "83b0498b1e7fa3c91ce68c3f1",
"list": "89ad33e698" "list": "ecc82e0493"
}, },
"docSearch": { "docSearch": {
"appId": "Y1LB128RON", "appId": "Y1LB128RON",

View File

@ -1,5 +1,69 @@
{ {
"resources": [ "resources": [
{
"id": "scrubadub_spacy",
"title": "scrubadub_spacy",
"category": ["pipeline"],
"slogan": "Remove personally identifiable information from text using spaCy.",
"description": "scrubadub removes personally identifiable information from text. scrubadub_spacy is an extension that uses spaCy NLP models to remove personal information from text.",
"github": "LeapBeyond/scrubadub_spacy",
"pip": "scrubadub-spacy",
"url": "https://github.com/LeapBeyond/scrubadub_spacy",
"code_language": "python",
"author": "Leap Beyond",
"author_links": {
"github": "https://github.com/LeapBeyond",
"website": "https://leapbeyond.ai"
},
"code_example": [
"import scrubadub, scrubadub_spacy",
"scrubber = scrubadub.Scrubber()",
"scrubber.add_detector(scrubadub_spacy.detectors.SpacyEntityDetector)",
"print(scrubber.clean(\"My name is Alex, I work at LifeGuard in London, and my eMail is alex@lifeguard.com btw. my super secret twitter login is username: alex_2000 password: g-dragon180888\"))",
"# My name is {{NAME}}, I work at {{ORGANIZATION}} in {{LOCATION}}, and my eMail is {{EMAIL}} btw. my super secret twitter login is username: {{USERNAME}} password: {{PASSWORD}}"
]
},
{
"id": "spacy-setfit-textcat",
"title": "spacy-setfit-textcat",
"category": ["research"],
"tags": ["SetFit", "Few-Shot"],
"slogan": "spaCy Project: Experiments with SetFit & Few-Shot Classification",
"description": "This project is an experiment with spaCy and few-shot text classification using SetFit",
"github": "pmbaumgartner/spacy-setfit-textcat",
"url": "https://github.com/pmbaumgartner/spacy-setfit-textcat",
"code_language": "python",
"author": "Peter Baumgartner",
"author_links": {
"twitter" : "https://twitter.com/pmbaumgartner",
"github": "https://github.com/pmbaumgartner",
"website": "https://www.peterbaumgartner.com/"
},
"code_example": [
"https://colab.research.google.com/drive/1CvGEZC0I9_v8gWrBxSJQ4Z8JGPJz-HYb?usp=sharing"
]
},
{
"id": "spacy-experimental",
"title": "spacy-experimental",
"category": ["extension"],
"slogan": "Cutting-edge experimental spaCy components and features",
"description": "This package includes experimental components and features for spaCy v3.x, for example model architectures, pipeline components and utilities.",
"github": "explosion/spacy-experimental",
"pip": "spacy-experimental",
"url": "https://github.com/explosion/spacy-experimental",
"code_language": "python",
"author": "Explosion",
"author_links": {
"twitter" : "https://twitter.com/explosion_ai",
"github": "https://github.com/explosion",
"website": "https://explosion.ai/"
},
"code_example": [
"python -m pip install -U pip setuptools wheel",
"python -m pip install spacy-experimental"
]
},
{ {
"id": "spacypdfreader", "id": "spacypdfreader",
"title": "spadypdfreader", "title": "spadypdfreader",
@ -234,6 +298,10 @@
"github": "SamEdwardes/spacytextblob", "github": "SamEdwardes/spacytextblob",
"pip": "spacytextblob", "pip": "spacytextblob",
"code_example": [ "code_example": [
"# the following installations are required",
"# python -m textblob.download_corpora",
"# python -m spacy download en_core_web_sm",
"",
"import spacy", "import spacy",
"from spacytextblob.spacytextblob import SpacyTextBlob", "from spacytextblob.spacytextblob import SpacyTextBlob",
"", "",
@ -327,15 +395,20 @@
"pip": "spaczz", "pip": "spaczz",
"code_example": [ "code_example": [
"import spacy", "import spacy",
"from spaczz.pipeline import SpaczzRuler", "from spaczz.matcher import FuzzyMatcher",
"", "",
"nlp = spacy.blank('en')", "nlp = spacy.blank(\"en\")",
"ruler = SpaczzRuler(nlp)", "text = \"\"\"Grint Anderson created spaczz in his home at 555 Fake St,",
"ruler.add_patterns([{'label': 'PERSON', 'pattern': 'Bill Gates', 'type': 'fuzzy'}])", "Apt 5 in Nashv1le, TN 55555-1234 in the US.\"\"\" # Spelling errors intentional.",
"nlp.add_pipe(ruler)", "doc = nlp(text)",
"", "",
"doc = nlp('Oops, I spelled Bill Gatez wrong.')", "matcher = FuzzyMatcher(nlp.vocab)",
"print([(ent.text, ent.start, ent.end, ent.label_) for ent in doc.ents])" "matcher.add(\"NAME\", [nlp(\"Grant Andersen\")])",
"matcher.add(\"GPE\", [nlp(\"Nashville\")])",
"matches = matcher(doc)",
"",
"for match_id, start, end, ratio in matches:",
" print(match_id, doc[start:end], ratio)"
], ],
"code_language": "python", "code_language": "python",
"url": "https://spaczz.readthedocs.io/en/latest/", "url": "https://spaczz.readthedocs.io/en/latest/",
@ -442,6 +515,84 @@
"website": "https://koaning.io" "website": "https://koaning.io"
} }
}, },
{
"id": "Klayers",
"title": "Klayers",
"category": ["pipeline"],
"tags": ["AWS"],
"slogan": "spaCy as a AWS Lambda Layer",
"description": "A collection of Python Packages as AWS Lambda(λ) Layers",
"github": "keithrozario/Klayers",
"pip": "",
"url": "https://github.com/keithrozario/Klayers",
"code_language": "python",
"author": "Keith Rozario",
"author_links": {
"twitter" : "https://twitter.com/keithrozario",
"github": "https://github.com/keithrozario",
"website": "https://www.keithrozario.com"
},
"code_example": [
"# SAM Template",
"MyLambdaFunction:",
" Type: AWS::Serverless::Function",
" Handler: 02_pipeline/spaCy.main",
" Description: Name Entity Extraction",
" Runtime: python3.8",
" Layers:",
" - arn:aws:lambda:${self:provider.region}:113088814899:layer:Klayers-python37-spacy:18"
]
},
{
"type": "education",
"id": "video-spacys-ner-model-alt",
"title": "Named Entity Recognition (NER) using spaCy",
"slogan": "",
"description": "In this video, I show you how to do named entity recognition using the spaCy library for Python.",
"youtube": "Gn_PjruUtrc",
"author": "Applied Language Technology",
"author_links": {
"twitter": "HelsinkiNLP",
"github": "Applied-Language-Technology",
"website": "https://applied-language-technology.mooc.fi/"
},
"category": ["videos"]
},
{
"id": "HuSpaCy",
"title": "HuSpaCy",
"category": ["models"],
"tags": ["Hungarian"],
"slogan": "HuSpaCy: industrial-strength Hungarian natural language processing",
"description": "HuSpaCy is a spaCy model and a library providing industrial-strength Hungarian language processing facilities.",
"github": "huspacy/huspacy",
"pip": "huspacy",
"url": "https://github.com/huspacy/huspacy",
"code_language": "python",
"author": "SzegedAI",
"author_links": {
"github": "https://szegedai.github.io/",
"website": "https://u-szeged.hu/english"
},
"code_example": [
"# Load the model using huspacy",
"import huspacy",
"",
"nlp = huspacy.load()",
"",
"# Load the mode using spacy.load()",
"import spacy",
"",
"nlp = spacy.load(\"hu_core_news_lg\")",
"",
"# Load the model directly as a module",
"import hu_core_news_lg",
"",
"nlp = hu_core_news_lg.load()\n",
"# Either way you get the same model and can start processing texts.",
"doc = nlp(\"Csiribiri csiribiri zabszalma - négy csillag közt alszom ma.\")"
]
},
{ {
"id": "spacy-stanza", "id": "spacy-stanza",
"title": "spacy-stanza", "title": "spacy-stanza",
@ -591,23 +742,6 @@
"category": ["conversational", "standalone"], "category": ["conversational", "standalone"],
"tags": ["chatbots"] "tags": ["chatbots"]
}, },
{
"id": "saber",
"title": "saber",
"slogan": "Deep-learning based tool for information extraction in the biomedical domain",
"github": "BaderLab/saber",
"pip": "saber",
"thumb": "https://raw.githubusercontent.com/BaderLab/saber/master/docs/img/saber_logo.png",
"code_example": [
"from saber.saber import Saber",
"saber = Saber()",
"saber.load('PRGE')",
"saber.annotate('The phosphorylation of Hdm2 by MK2 promotes the ubiquitination of p53.')"
],
"author": "Bader Lab, University of Toronto",
"category": ["scientific"],
"tags": ["keras", "biomedical"]
},
{ {
"id": "alibi", "id": "alibi",
"title": "alibi", "title": "alibi",
@ -637,18 +771,17 @@
"import spacy", "import spacy",
"from spacymoji import Emoji", "from spacymoji import Emoji",
"", "",
"nlp = spacy.load('en')", "nlp = spacy.load(\"en_core_web_sm\")",
"emoji = Emoji(nlp)", "nlp.add_pipe(\"emoji\", first=True)",
"nlp.add_pipe(emoji, first=True)", "doc = nlp(\"This is a test 😻 👍🏿\")",
"", "",
"doc = nlp('This is a test 😻 👍🏿')", "assert doc._.has_emoji is True",
"assert doc._.has_emoji == True", "assert doc[2:5]._.has_emoji is True",
"assert doc[2:5]._.has_emoji == True", "assert doc[0]._.is_emoji is False",
"assert doc[0]._.is_emoji == False", "assert doc[4]._.is_emoji is True",
"assert doc[4]._.is_emoji == True", "assert doc[5]._.emoji_desc == \"thumbs up dark skin tone\"",
"assert doc[5]._.emoji_desc == 'thumbs up dark skin tone'",
"assert len(doc._.emoji) == 2", "assert len(doc._.emoji) == 2",
"assert doc._.emoji[1] == ('👍🏿', 5, 'thumbs up dark skin tone')" "assert doc._.emoji[1] == (\"👍🏿\", 5, \"thumbs up dark skin tone\")"
], ],
"author": "Ines Montani", "author": "Ines Montani",
"author_links": { "author_links": {
@ -885,9 +1018,8 @@
"import spacy", "import spacy",
"from spacy_sentiws import spaCySentiWS", "from spacy_sentiws import spaCySentiWS",
"", "",
"nlp = spacy.load('de')", "nlp = spacy.load('de_core_news_sm')",
"sentiws = spaCySentiWS(sentiws_path='data/sentiws/')", "nlp.add_pipe('sentiws', config={'sentiws_path': 'data/sentiws'})",
"nlp.add_pipe(sentiws)",
"doc = nlp('Die Dummheit der Unterwerfung blüht in hübschen Farben.')", "doc = nlp('Die Dummheit der Unterwerfung blüht in hübschen Farben.')",
"", "",
"for token in doc:", "for token in doc:",
@ -1076,29 +1208,6 @@
"category": ["pipeline"], "category": ["pipeline"],
"tags": ["pipeline", "readability", "syntactic complexity", "descriptive statistics"] "tags": ["pipeline", "readability", "syntactic complexity", "descriptive statistics"]
}, },
{
"id": "wmd-relax",
"slogan": "Calculates word mover's distance insanely fast",
"description": "Calculates Word Mover's Distance as described in [From Word Embeddings To Document Distances](http://www.cs.cornell.edu/~kilian/papers/wmd_metric.pdf) by Matt Kusner, Yu Sun, Nicholas Kolkin and Kilian Weinberger.\n\n⚠ **This package is currently only compatible with spaCy v.1x.**",
"github": "src-d/wmd-relax",
"thumb": "https://i.imgur.com/f91C3Lf.jpg",
"code_example": [
"import spacy",
"import wmd",
"",
"nlp = spacy.load('en', create_pipeline=wmd.WMD.create_spacy_pipeline)",
"doc1 = nlp(\"Politician speaks to the media in Illinois.\")",
"doc2 = nlp(\"The president greets the press in Chicago.\")",
"print(doc1.similarity(doc2))"
],
"author": "source{d}",
"author_links": {
"github": "src-d",
"twitter": "sourcedtech",
"website": "https://sourced.tech"
},
"category": ["pipeline"]
},
{ {
"id": "neuralcoref", "id": "neuralcoref",
"slogan": "State-of-the-art coreference resolution based on neural nets and spaCy", "slogan": "State-of-the-art coreference resolution based on neural nets and spaCy",
@ -1525,17 +1634,6 @@
}, },
"category": ["nonpython"] "category": ["nonpython"]
}, },
{
"id": "spaCy.jl",
"slogan": "Julia interface for spaCy (work in progress)",
"github": "jekbradbury/SpaCy.jl",
"author": "James Bradbury",
"author_links": {
"github": "jekbradbury",
"twitter": "jekbradbury"
},
"category": ["nonpython"]
},
{ {
"id": "ruby-spacy", "id": "ruby-spacy",
"title": "ruby-spacy", "title": "ruby-spacy",
@ -1605,21 +1703,6 @@
}, },
"category": ["apis"] "category": ["apis"]
}, },
{
"id": "languagecrunch",
"slogan": "NLP server for spaCy, WordNet and NeuralCoref as a Docker image",
"github": "artpar/languagecrunch",
"code_example": [
"docker run -it -p 8080:8080 artpar/languagecrunch",
"curl http://localhost:8080/nlp/parse?`echo -n \"The new twitter is so weird. Seriously. Why is there a new twitter? What was wrong with the old one? Fix it now.\" | python -c \"import urllib, sys; print(urllib.urlencode({'sentence': sys.stdin.read()}))\"`"
],
"code_language": "bash",
"author": "Parth Mudgal",
"author_links": {
"github": "artpar"
},
"category": ["apis"]
},
{ {
"id": "spacy-nlp", "id": "spacy-nlp",
"slogan": " Expose spaCy NLP text parsing to Node.js (and other languages) via Socket.IO", "slogan": " Expose spaCy NLP text parsing to Node.js (and other languages) via Socket.IO",
@ -2008,6 +2091,20 @@
"youtube": "f4sqeLRzkPg", "youtube": "f4sqeLRzkPg",
"category": ["videos"] "category": ["videos"]
}, },
{
"type": "education",
"id": "video-intro-to-nlp-episode-6",
"title": "Intro to NLP with spaCy (6)",
"slogan": "Episode 6: Moving to spaCy v3",
"description": "In this new video series, data science instructor Vincent Warmerdam gets started with spaCy, an open-source library for Natural Language Processing in Python. His mission: building a system to automatically detect programming languages in large volumes of text. Follow his process from the first idea to a prototype all the way to data collection and training a statistical named entity recogntion model from scratch.",
"author": "Vincent Warmerdam",
"author_links": {
"twitter": "fishnets88",
"github": "koaning"
},
"youtube": "k77RrmMaKEI",
"category": ["videos"]
},
{ {
"type": "education", "type": "education",
"id": "video-spacy-irl-entity-linking", "id": "video-spacy-irl-entity-linking",
@ -2194,43 +2291,6 @@
"category": ["standalone"], "category": ["standalone"],
"tags": ["question-answering", "elasticsearch"] "tags": ["question-answering", "elasticsearch"]
}, },
{
"id": "epitator",
"title": "EpiTator",
"thumb": "https://i.imgur.com/NYFY1Km.jpg",
"slogan": "Extracts case counts, resolved location/species/disease names, date ranges and more",
"description": "EcoHealth Alliance uses EpiTator to catalog the what, where and when of infectious disease case counts reported in online news. Each of these aspects is extracted using independent annotators than can be applied to other domains. EpiTator organizes annotations by creating \"AnnoTiers\" for each type. AnnoTiers have methods for manipulating, combining and searching annotations. For instance, the `with_following_spans_from()` method can be used to create a new tier that combines a tier of one type (such as numbers), with another (say, kitchenware). The resulting tier will contain all the phrases in the document that match that pattern, like \"5 plates\" or \"2 cups.\"\n\nAnother commonly used method is `group_spans_by_containing_span()` which can be used to do things like find all the spaCy tokens in all the GeoNames a document mentions. spaCy tokens, named entities, sentences and noun chunks are exposed through the spaCy annotator which will create a AnnoTier for each. These are basis of many of the other annotators. EpiTator also includes an annotator for extracting tables embedded in free text articles. Another neat feature is that the lexicons used for entity resolution are all stored in an embedded sqlite database so there is no need to run any external services in order to use EpiTator.",
"url": "https://github.com/ecohealthalliance/EpiTator",
"github": "ecohealthalliance/EpiTator",
"pip": "EpiTator",
"code_example": [
"from epitator.annotator import AnnoDoc",
"from epitator.geoname_annotator import GeonameAnnotator",
"",
"doc = AnnoDoc('Where is Chiang Mai?')",
"geoname_annotier = doc.require_tiers('geonames', via=GeonameAnnotator)",
"geoname = geoname_annotier.spans[0].metadata['geoname']",
"geoname['name']",
"# = 'Chiang Mai'",
"geoname['geonameid']",
"# = '1153671'",
"geoname['latitude']",
"# = 18.79038",
"geoname['longitude']",
"# = 98.98468",
"",
"from epitator.spacy_annotator import SpacyAnnotator",
"spacy_token_tier = doc.require_tiers('spacy.tokens', via=SpacyAnnotator)",
"list(geoname_annotier.group_spans_by_containing_span(spacy_token_tier))",
"# = [(AnnoSpan(9-19, Chiang Mai), [AnnoSpan(9-15, Chiang), AnnoSpan(16-19, Mai)])]"
],
"author": "EcoHealth Alliance",
"author_links": {
"github": "ecohealthalliance",
"website": " https://ecohealthalliance.org/"
},
"category": ["scientific", "standalone"]
},
{ {
"id": "self-attentive-parser", "id": "self-attentive-parser",
"title": "Berkeley Neural Parser", "title": "Berkeley Neural Parser",
@ -2259,30 +2319,6 @@
}, },
"category": ["research", "pipeline"] "category": ["research", "pipeline"]
}, },
{
"id": "excelcy",
"title": "ExcelCy",
"slogan": "Excel Integration with spaCy. Training NER using XLSX from PDF, DOCX, PPT, PNG or JPG.",
"description": "ExcelCy is a toolkit to integrate Excel to spaCy NLP training experiences. Training NER using XLSX from PDF, DOCX, PPT, PNG or JPG. ExcelCy has pipeline to match Entity with PhraseMatcher or Matcher in regular expression.",
"url": "https://github.com/kororo/excelcy",
"github": "kororo/excelcy",
"pip": "excelcy",
"code_example": [
"from excelcy import ExcelCy",
"# collect sentences, annotate Entities and train NER using spaCy",
"excelcy = ExcelCy.execute(file_path='https://github.com/kororo/excelcy/raw/master/tests/data/test_data_01.xlsx')",
"# use the nlp object as per spaCy API",
"doc = excelcy.nlp('Google rebrands its business apps')",
"# or save it for faster bootstrap for application",
"excelcy.nlp.to_disk('/model')"
],
"author": "Robertus Johansyah",
"author_links": {
"github": "kororo"
},
"category": ["training"],
"tags": ["excel"]
},
{ {
"id": "spacy-graphql", "id": "spacy-graphql",
"title": "spacy-graphql", "title": "spacy-graphql",
@ -2405,16 +2441,15 @@
{ {
"id": "spacy-conll", "id": "spacy-conll",
"title": "spacy_conll", "title": "spacy_conll",
"slogan": "Parsing to CoNLL with spaCy, spacy-stanza, and spacy-udpipe", "slogan": "Parsing from and to CoNLL-U format with `spacy`, `spacy-stanza` and `spacy-udpipe`",
"description": "This module allows you to parse text into CoNLL-U format. You can use it as a command line tool, or embed it in your own scripts by adding it as a custom pipeline component to a spaCy, spacy-stanfordnlp, spacy-stanza, or spacy-udpipe pipeline. It also provides an easy-to-use function to quickly initialize a parser. CoNLL-related properties are added to Doc elements, sentence Spans, and Tokens.", "description": "This module allows you to parse text into CoNLL-U format or read ConLL-U into a spaCy `Doc`. You can use it as a command line tool, or embed it in your own scripts by adding it as a custom pipeline component to a `spacy`, `spacy-stanza` or `spacy-udpipe` pipeline. It also provides an easy-to-use function to quickly initialize any spaCy-wrapped parser. CoNLL-related properties are added to `Doc` elements, `Span` sentences, and `Token` objects.",
"code_example": [ "code_example": [
"from spacy_conll import init_parser", "from spacy_conll import init_parser",
"", "",
"", "",
"# Initialise English parser, already including the ConllFormatter as a pipeline component.", "# Initialise English parser, already including the ConllFormatter as a pipeline component.",
"# Indicate that we want to get the CoNLL headers in the string output.", "# Indicate that we want to get the CoNLL headers in the string output.",
"# `use_gpu` and `verbose` are specific to stanza (and stanfordnlp). These keywords arguments", "# `use_gpu` and `verbose` are specific to stanza. These keywords arguments are passed onto their Pipeline() initialisation",
"# are passed onto their Pipeline() initialisation",
"nlp = init_parser(\"en\",", "nlp = init_parser(\"en\",",
" \"stanza\",", " \"stanza\",",
" parser_opts={\"use_gpu\": True, \"verbose\": False},", " parser_opts={\"use_gpu\": True, \"verbose\": False},",
@ -2435,7 +2470,7 @@
}, },
"github": "BramVanroy/spacy_conll", "github": "BramVanroy/spacy_conll",
"category": ["standalone", "pipeline"], "category": ["standalone", "pipeline"],
"tags": ["linguistics", "computational linguistics", "conll"] "tags": ["linguistics", "computational linguistics", "conll", "conll-u"]
}, },
{ {
"id": "spacy-langdetect", "id": "spacy-langdetect",
@ -2497,41 +2532,6 @@
}, },
"category": ["standalone", "conversational"] "category": ["standalone", "conversational"]
}, },
{
"id": "gracyql",
"title": "gracyql",
"slogan": "A thin GraphQL wrapper around spacy",
"github": "oterrier/gracyql",
"description": "An example of a basic [Starlette](https://github.com/encode/starlette) app using [Spacy](https://github.com/explosion/spaCy) and [Graphene](https://github.com/graphql-python/graphene). The main goal is to be able to use the amazing power of spaCy from other languages and retrieving only the information you need thanks to the GraphQL query definition. The GraphQL schema tries to mimic as much as possible the original Spacy API with classes Doc, Span and Token.",
"thumb": "https://i.imgur.com/xC7zpTO.png",
"category": ["apis"],
"tags": ["graphql"],
"code_example": [
"query ParserDisabledQuery {",
" nlp(model: \"en\", disable: [\"parser\", \"ner\"]) {",
" doc(text: \"I live in Grenoble, France\") {",
" text",
" tokens {",
" id",
" pos",
" lemma",
" dep",
" }",
" ents {",
" start",
" end",
" label",
" }",
" }",
" }",
"}"
],
"code_language": "json",
"author": "Olivier Terrier",
"author_links": {
"github": "oterrier"
}
},
{ {
"id": "pyInflect", "id": "pyInflect",
"slogan": "A Python module for word inflections", "slogan": "A Python module for word inflections",
@ -2705,6 +2705,66 @@
], ],
"spacy_version": 3 "spacy_version": 3
}, },
{
"id": "crosslingualcoreference",
"title": "Crosslingual Coreference",
"slogan": "One multi-lingual coreference model to rule them all!",
"description": "Coreference is amazing but the data required for training a model is very scarce. In our case, the available training for non-English languages also data proved to be poorly annotated. Crosslingual Coreference therefore uses the assumption a trained model with English data and cross-lingual embeddings should work for other languages with similar sentence structure. Verified to work quite well for at least (EN, NL, DK, FR, DE).",
"github": "pandora-intelligence/crosslingual-coreference",
"pip": "crosslingual-coreference",
"thumb": "https://raw.githubusercontent.com/Pandora-Intelligence/crosslingual-coreference/master/img/logo.png",
"image": "https://raw.githubusercontent.com/Pandora-Intelligence/crosslingual-coreference/master/img/example_total.png",
"code_example": [
"import spacy",
"import crosslingual_coreference",
"",
"text = \"\"\"",
" Do not forget about Momofuku Ando!",
" He created instant noodles in Osaka.",
" At that location, Nissin was founded.",
" Many students survived by eating these noodles, but they don't even know him.\"\"\"",
"",
"# use any model that has internal spacy embeddings",
"nlp = spacy.load('en_core_web_sm')",
"nlp.add_pipe(",
" \"xx_coref\", config={\"chunk_size\": 2500, \"chunk_overlap\": 2, \"device\": 0})",
")",
"",
"doc = nlp(text)",
"",
"print(doc._.coref_clusters)",
"# Output",
"#",
"# [[[4, 5], [7, 7], [27, 27], [36, 36]],",
"# [[12, 12], [15, 16]],",
"# [[9, 10], [27, 28]],",
"# [[22, 23], [31, 31]]]",
"print(doc._.resolved_text)",
"# Output",
"#",
"# Do not forget about Momofuku Ando!",
"# Momofuku Ando created instant noodles in Osaka.",
"# At Osaka, Nissin was founded.",
"# Many students survived by eating instant noodles,",
"# but Many students don't even know Momofuku Ando."
],
"author": "David Berenstein",
"author_links": {
"github": "davidberenstein1957",
"website": "https://www.linkedin.com/in/david-berenstein-1bab11105/"
},
"category": [
"pipeline",
"standalone"
],
"tags": [
"coreference",
"multi-lingual",
"cross-lingual",
"allennlp"
],
"spacy_version": 3
},
{ {
"id": "blackstone", "id": "blackstone",
"title": "Blackstone", "title": "Blackstone",
@ -2761,9 +2821,9 @@
"id": "coreferee", "id": "coreferee",
"title": "Coreferee", "title": "Coreferee",
"slogan": "Coreference resolution for multiple languages", "slogan": "Coreference resolution for multiple languages",
"github": "msg-systems/coreferee", "github": "explosion/coreferee",
"url": "https://github.com/msg-systems/coreferee", "url": "https://github.com/explosion/coreferee",
"description": "Coreferee is a pipeline plugin that performs coreference resolution for English, German and Polish. It is designed so that it is easy to add support for new languages and optimised for limited training data. It uses a mixture of neural networks and programmed rules. Please note you will need to [install models](https://github.com/msg-systems/coreferee#getting-started) before running the code example.", "description": "Coreferee is a pipeline plugin that performs coreference resolution for English, French, German and Polish. It is designed so that it is easy to add support for new languages and optimised for limited training data. It uses a mixture of neural networks and programmed rules. Please note you will need to [install models](https://github.com/explosion/coreferee#getting-started) before running the code example.",
"pip": "coreferee", "pip": "coreferee",
"category": ["pipeline", "models", "standalone"], "category": ["pipeline", "models", "standalone"],
"tags": ["coreference-resolution", "anaphora"], "tags": ["coreference-resolution", "anaphora"],
@ -3121,18 +3181,25 @@
"import spacy", "import spacy",
"import pytextrank", "import pytextrank",
"", "",
"nlp = spacy.load('en_core_web_sm')", "# example text",
"text = \"\"\"Compatibility of systems of linear constraints over the set of natural numbers.",
"Criteria of compatibility of a system of linear Diophantine equations, strict inequations,",
"and nonstrict inequations are considered. Upper bounds for components of a minimal set of",
"solutions and algorithms of construction of minimal generating sets of solutions for all types",
"of systems are given. These criteria and the corresponding algorithms for constructing a minimal",
"supporting set of solutions can be used in solving all the considered types systems and systems of mixed types.\"\"\"",
"", "",
"tr = pytextrank.TextRank()", "# load a spaCy model, depending on language, scale, etc.",
"nlp.add_pipe(tr.PipelineComponent, name='textrank', last=True)", "nlp = spacy.load(\"en_core_web_sm\")",
"# add PyTextRank to the spaCy pipeline",
"nlp.add_pipe(\"textrank\")",
"", "",
"text = 'Compatibility of systems of linear constraints over the set of natural numbers. Criteria of compatibility of a system of linear Diophantine equations, strict inequations, and nonstrict inequations are considered.'",
"doc = nlp(text)", "doc = nlp(text)",
"",
"# examine the top-ranked phrases in the document", "# examine the top-ranked phrases in the document",
"for p in doc._.phrases:", "for phrase in doc._.phrases:",
" print('{:.4f} {:5d} {}'.format(p.rank, p.count, p.text))", " print(phrase.text)",
" print(p.chunks)" " print(phrase.rank, phrase.count)",
" print(phrase.chunks)"
], ],
"code_language": "python", "code_language": "python",
"url": "https://github.com/DerwenAI/pytextrank/wiki", "url": "https://github.com/DerwenAI/pytextrank/wiki",
@ -3158,21 +3225,13 @@
"import spacy", "import spacy",
"from spacy_syllables import SpacySyllables", "from spacy_syllables import SpacySyllables",
"", "",
"nlp = spacy.load('en_core_web_sm')", "nlp = spacy.load(\"en_core_web_sm\")",
"syllables = SpacySyllables(nlp)", "nlp.add_pipe(\"syllables\", after=\"tagger\")",
"nlp.add_pipe(syllables, after='tagger')",
"", "",
"doc = nlp('terribly long')", "assert nlp.pipe_names == [\"tok2vec\", \"tagger\", \"syllables\", \"parser\", \"attribute_ruler\", \"lemmatizer\", \"ner\"]",
"", "doc = nlp(\"terribly long\")",
"data = [", "data = [(token.text, token._.syllables, token._.syllables_count) for token in doc]",
" (token.text, token._.syllables, token._.syllables_count)", "assert data == [(\"terribly\", [\"ter\", \"ri\", \"bly\"], 3), (\"long\", [\"long\"], 1)]"
" for token in doc",
"]",
"",
"assert data == [",
" ('terribly', ['ter', 'ri', 'bly'], 3),",
" ('long', ['long'], 1)",
"]"
], ],
"thumb": "https://raw.githubusercontent.com/sloev/spacy-syllables/master/logo.png", "thumb": "https://raw.githubusercontent.com/sloev/spacy-syllables/master/logo.png",
"author": "Johannes Valbjørn", "author": "Johannes Valbjørn",

View File

@ -120,8 +120,8 @@ const AlertSpace = ({ nightly, legacy }) => {
} }
const navAlert = ( const navAlert = (
<Link to="/usage/v3-2" hidden> <Link to="/usage/v3-3" hidden>
<strong>💥 Out now:</strong> spaCy v3.2 <strong>💥 Out now:</strong> spaCy v3.3
</Link> </Link>
) )

View File

@ -23,6 +23,8 @@ const CUDA = {
'11.2': 'cuda112', '11.2': 'cuda112',
'11.3': 'cuda113', '11.3': 'cuda113',
'11.4': 'cuda114', '11.4': 'cuda114',
'11.5': 'cuda115',
'11.6': 'cuda116',
} }
const LANG_EXTRAS = ['ja'] // only for languages with models const LANG_EXTRAS = ['ja'] // only for languages with models
@ -48,7 +50,7 @@ const QuickstartInstall = ({ id, title }) => {
const modelExtras = train ? selectedModels.filter(m => LANG_EXTRAS.includes(m)) : [] const modelExtras = train ? selectedModels.filter(m => LANG_EXTRAS.includes(m)) : []
const apple = os === 'mac' && platform === 'arm' const apple = os === 'mac' && platform === 'arm'
const pipExtras = [ const pipExtras = [
hardware === 'gpu' && cuda, (hardware === 'gpu' && (platform !== 'arm' || os === 'linux')) && cuda,
train && 'transformers', train && 'transformers',
train && 'lookups', train && 'lookups',
apple && 'apple', apple && 'apple',