Merge branch 'master' into copy/develop

This commit is contained in:
Sofie Van Landeghem 2022-02-16 14:04:59 +01:00 committed by GitHub
commit a16b14e591
No known key found for this signature in database
GPG Key ID: 4AEE18F83AFDEB23
32 changed files with 458 additions and 81 deletions

21
.github/workflows/gputests.yml vendored Normal file
View File

@ -0,0 +1,21 @@
name: Weekly GPU tests
on:
schedule:
- cron: '0 1 * * MON'
jobs:
weekly-gputests:
strategy:
fail-fast: false
matrix:
branch: [master, develop, v4]
runs-on: ubuntu-latest
steps:
- name: Trigger buildkite build
uses: buildkite/trigger-pipeline-action@v1.2.0
env:
PIPELINE: explosion-ai/spacy-slow-gpu-tests
BRANCH: ${{ matrix.branch }}
MESSAGE: ":github: Weekly GPU + slow tests - triggered from a GitHub Action"
BUILDKITE_API_ACCESS_TOKEN: ${{ secrets.BUILDKITE_SECRET }}

35
.github/workflows/slowtests.yml vendored Normal file
View File

@ -0,0 +1,35 @@
name: Daily slow tests
on:
schedule:
- cron: '0 0 * * *'
jobs:
daily-slowtests:
strategy:
fail-fast: false
matrix:
branch: [master, develop, v4]
runs-on: ubuntu-latest
steps:
- name: Checkout
uses: actions/checkout@v1
- name: Get commits from past 24 hours
id: check_commits
run: |
today=$(date '+%Y-%m-%d %H:%M:%S')
yesterday=$(date -d "yesterday" '+%Y-%m-%d %H:%M:%S')
if git log --after="$yesterday" --before="$today" | grep commit ; then
echo "::set-output name=run_tests::true"
else
echo "::set-output name=run_tests::false"
fi
- name: Trigger buildkite build
if: steps.check_commits.outputs.run_tests == 'true'
uses: buildkite/trigger-pipeline-action@v1.2.0
env:
PIPELINE: explosion-ai/spacy-slow-tests
BRANCH: ${{ matrix.branch }}
MESSAGE: ":github: Daily slow tests - triggered from a GitHub Action"
BUILDKITE_API_ACCESS_TOKEN: ${{ secrets.BUILDKITE_SECRET }}

View File

@ -32,19 +32,20 @@ open-source software, released under the MIT license.
## 📖 Documentation ## 📖 Documentation
| Documentation | | | Documentation | |
| -------------------------- | -------------------------------------------------------------- | | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| ⭐️ **[spaCy 101]** | New to spaCy? Here's everything you need to know! | | ⭐️ **[spaCy 101]** | New to spaCy? Here's everything you need to know! |
| 📚 **[Usage Guides]** | How to use spaCy and its features. | | 📚 **[Usage Guides]** | How to use spaCy and its features. |
| 🚀 **[New in v3.0]** | New features, backwards incompatibilities and migration guide. | | 🚀 **[New in v3.0]** | New features, backwards incompatibilities and migration guide. |
| 🪐 **[Project Templates]** | End-to-end workflows you can clone, modify and run. | | 🪐 **[Project Templates]** | End-to-end workflows you can clone, modify and run. |
| 🎛 **[API Reference]** | The detailed reference for spaCy's API. | | 🎛 **[API Reference]** | The detailed reference for spaCy's API. |
| 📦 **[Models]** | Download trained pipelines for spaCy. | | 📦 **[Models]** | Download trained pipelines for spaCy. |
| 🌌 **[Universe]** | Plugins, extensions, demos and books from the spaCy ecosystem. | | 🌌 **[Universe]** | Plugins, extensions, demos and books from the spaCy ecosystem. |
| 👩‍🏫 **[Online Course]** | Learn spaCy in this free and interactive online course. | | 👩‍🏫 **[Online Course]** | Learn spaCy in this free and interactive online course. |
| 📺 **[Videos]** | Our YouTube channel with video tutorials, talks and more. | | 📺 **[Videos]** | Our YouTube channel with video tutorials, talks and more. |
| 🛠 **[Changelog]** | Changes and version history. | | 🛠 **[Changelog]** | Changes and version history. |
| 💝 **[Contribute]** | How to contribute to the spaCy project and code base. | | 💝 **[Contribute]** | How to contribute to the spaCy project and code base. |
| <a href="https://explosion.ai/spacy-tailored-pipelines"><img src="https://user-images.githubusercontent.com/13643239/152853098-1c761611-ccb0-4ec6-9066-b234552831fe.png" width="125" alt="spaCy Tailored Pipelines"/></a> | Get a custom spaCy pipeline, tailor-made for your NLP problem by spaCy's core developers. Streamlined, production-ready, predictable and maintainable. Start by completing our 5-minute questionnaire to tell us what you need and we'll be in touch! **[Learn more &rarr;](https://explosion.ai/spacy-tailored-pipelines)** |
[spacy 101]: https://spacy.io/usage/spacy-101 [spacy 101]: https://spacy.io/usage/spacy-101
[new in v3.0]: https://spacy.io/usage/v3 [new in v3.0]: https://spacy.io/usage/v3
@ -60,9 +61,7 @@ open-source software, released under the MIT license.
## 💬 Where to ask questions ## 💬 Where to ask questions
The spaCy project is maintained by **[@honnibal](https://github.com/honnibal)**, The spaCy project is maintained by the [spaCy team](https://explosion.ai/about).
**[@ines](https://github.com/ines)**, **[@svlandeg](https://github.com/svlandeg)**,
**[@adrianeboyd](https://github.com/adrianeboyd)** and **[@polm](https://github.com/polm)**.
Please understand that we won't be able to provide individual support via email. Please understand that we won't be able to provide individual support via email.
We also believe that help is much more valuable if it's shared publicly, so that We also believe that help is much more valuable if it's shared publicly, so that
more people can benefit from it. more people can benefit from it.

View File

@ -11,12 +11,14 @@ trigger:
exclude: exclude:
- "website/*" - "website/*"
- "*.md" - "*.md"
- ".github/workflows/*"
pr: pr:
paths: paths:
exclude: exclude:
- "*.md" - "*.md"
- "website/docs/*" - "website/docs/*"
- "website/src/*" - "website/src/*"
- ".github/workflows/*"
jobs: jobs:
# Perform basic checks for most important errors (syntax etc.) Uses the config # Perform basic checks for most important errors (syntax etc.) Uses the config

View File

@ -35,3 +35,4 @@ mypy==0.910
types-dataclasses>=0.1.3; python_version < "3.7" types-dataclasses>=0.1.3; python_version < "3.7"
types-mock>=0.1.1 types-mock>=0.1.1
types-requests types-requests
black>=22.0,<23.0

View File

@ -1,6 +1,6 @@
# fmt: off # fmt: off
__title__ = "spacy" __title__ = "spacy"
__version__ = "3.2.1" __version__ = "3.2.2"
__download_url__ = "https://github.com/explosion/spacy-models/releases/download" __download_url__ = "https://github.com/explosion/spacy-models/releases/download"
__compatibility__ = "https://raw.githubusercontent.com/explosion/spacy-models/master/compatibility.json" __compatibility__ = "https://raw.githubusercontent.com/explosion/spacy-models/master/compatibility.json"
__projects__ = "https://github.com/explosion/projects" __projects__ = "https://github.com/explosion/projects"

View File

@ -193,6 +193,70 @@ def debug_data(
else: else:
msg.info("No word vectors present in the package") msg.info("No word vectors present in the package")
if "spancat" in factory_names:
model_labels_spancat = _get_labels_from_spancat(nlp)
has_low_data_warning = False
has_no_neg_warning = False
msg.divider("Span Categorization")
msg.table(model_labels_spancat, header=["Spans Key", "Labels"], divider=True)
msg.text("Label counts in train data: ", show=verbose)
for spans_key, data_labels in gold_train_data["spancat"].items():
msg.text(
f"Key: {spans_key}, {_format_labels(data_labels.items(), counts=True)}",
show=verbose,
)
# Data checks: only take the spans keys in the actual spancat components
data_labels_in_component = {
spans_key: gold_train_data["spancat"][spans_key]
for spans_key in model_labels_spancat.keys()
}
for spans_key, data_labels in data_labels_in_component.items():
for label, count in data_labels.items():
# Check for missing labels
spans_key_in_model = spans_key in model_labels_spancat.keys()
if (spans_key_in_model) and (
label not in model_labels_spancat[spans_key]
):
msg.warn(
f"Label '{label}' is not present in the model labels of key '{spans_key}'. "
"Performance may degrade after training."
)
# Check for low number of examples per label
if count <= NEW_LABEL_THRESHOLD:
msg.warn(
f"Low number of examples for label '{label}' in key '{spans_key}' ({count})"
)
has_low_data_warning = True
# Check for negative examples
with msg.loading("Analyzing label distribution..."):
neg_docs = _get_examples_without_label(
train_dataset, label, "spancat", spans_key
)
if neg_docs == 0:
msg.warn(f"No examples for texts WITHOUT new label '{label}'")
has_no_neg_warning = True
if has_low_data_warning:
msg.text(
f"To train a new span type, your data should include at "
f"least {NEW_LABEL_THRESHOLD} instances of the new label",
show=verbose,
)
else:
msg.good("Good amount of examples for all labels")
if has_no_neg_warning:
msg.text(
"Training data should always include examples of spans "
"in context, as well as examples without a given span "
"type.",
show=verbose,
)
else:
msg.good("Examples without ocurrences available for all labels")
if "ner" in factory_names: if "ner" in factory_names:
# Get all unique NER labels present in the data # Get all unique NER labels present in the data
labels = set( labels = set(
@ -238,7 +302,7 @@ def debug_data(
has_low_data_warning = True has_low_data_warning = True
with msg.loading("Analyzing label distribution..."): with msg.loading("Analyzing label distribution..."):
neg_docs = _get_examples_without_label(train_dataset, label) neg_docs = _get_examples_without_label(train_dataset, label, "ner")
if neg_docs == 0: if neg_docs == 0:
msg.warn(f"No examples for texts WITHOUT new label '{label}'") msg.warn(f"No examples for texts WITHOUT new label '{label}'")
has_no_neg_warning = True has_no_neg_warning = True
@ -573,6 +637,7 @@ def _compile_gold(
"deps": Counter(), "deps": Counter(),
"words": Counter(), "words": Counter(),
"roots": Counter(), "roots": Counter(),
"spancat": dict(),
"ws_ents": 0, "ws_ents": 0,
"boundary_cross_ents": 0, "boundary_cross_ents": 0,
"n_words": 0, "n_words": 0,
@ -603,6 +668,7 @@ def _compile_gold(
if nlp.vocab.strings[word] not in nlp.vocab.vectors: if nlp.vocab.strings[word] not in nlp.vocab.vectors:
data["words_missing_vectors"].update([word]) data["words_missing_vectors"].update([word])
if "ner" in factory_names: if "ner" in factory_names:
sent_starts = eg.get_aligned_sent_starts()
for i, label in enumerate(eg.get_aligned_ner()): for i, label in enumerate(eg.get_aligned_ner()):
if label is None: if label is None:
continue continue
@ -612,10 +678,19 @@ def _compile_gold(
if label.startswith(("B-", "U-")): if label.startswith(("B-", "U-")):
combined_label = label.split("-")[1] combined_label = label.split("-")[1]
data["ner"][combined_label] += 1 data["ner"][combined_label] += 1
if gold[i].is_sent_start and label.startswith(("I-", "L-")): if sent_starts[i] == True and label.startswith(("I-", "L-")):
data["boundary_cross_ents"] += 1 data["boundary_cross_ents"] += 1
elif label == "-": elif label == "-":
data["ner"]["-"] += 1 data["ner"]["-"] += 1
if "spancat" in factory_names:
for span_key in list(eg.reference.spans.keys()):
if span_key not in data["spancat"]:
data["spancat"][span_key] = Counter()
for i, span in enumerate(eg.reference.spans[span_key]):
if span.label_ is None:
continue
else:
data["spancat"][span_key][span.label_] += 1
if "textcat" in factory_names or "textcat_multilabel" in factory_names: if "textcat" in factory_names or "textcat_multilabel" in factory_names:
data["cats"].update(gold.cats) data["cats"].update(gold.cats)
if any(val not in (0, 1) for val in gold.cats.values()): if any(val not in (0, 1) for val in gold.cats.values()):
@ -686,14 +761,28 @@ def _format_labels(
return ", ".join([f"'{l}'" for l in cast(Iterable[str], labels)]) return ", ".join([f"'{l}'" for l in cast(Iterable[str], labels)])
def _get_examples_without_label(data: Sequence[Example], label: str) -> int: def _get_examples_without_label(
data: Sequence[Example],
label: str,
component: Literal["ner", "spancat"] = "ner",
spans_key: Optional[str] = "sc",
) -> int:
count = 0 count = 0
for eg in data: for eg in data:
labels = [ if component == "ner":
label.split("-")[1] labels = [
for label in eg.get_aligned_ner() label.split("-")[1]
if label not in ("O", "-", None) for label in eg.get_aligned_ner()
] if label not in ("O", "-", None)
]
if component == "spancat":
labels = (
[span.label_ for span in eg.reference.spans[spans_key]]
if spans_key in eg.reference.spans
else []
)
if label not in labels: if label not in labels:
count += 1 count += 1
return count return count

View File

@ -7,6 +7,7 @@ from collections import defaultdict
from catalogue import RegistryError from catalogue import RegistryError
import srsly import srsly
import sys import sys
import re
from ._util import app, Arg, Opt, string_to_list, WHEEL_SUFFIX, SDIST_SUFFIX from ._util import app, Arg, Opt, string_to_list, WHEEL_SUFFIX, SDIST_SUFFIX
from ..schemas import validate, ModelMetaSchema from ..schemas import validate, ModelMetaSchema
@ -109,6 +110,24 @@ def package(
", ".join(meta["requirements"]), ", ".join(meta["requirements"]),
) )
if name is not None: if name is not None:
if not name.isidentifier():
msg.fail(
f"Model name ('{name}') is not a valid module name. "
"This is required so it can be imported as a module.",
"We recommend names that use ASCII A-Z, a-z, _ (underscore), "
"and 0-9. "
"For specific details see: https://docs.python.org/3/reference/lexical_analysis.html#identifiers",
exits=1,
)
if not _is_permitted_package_name(name):
msg.fail(
f"Model name ('{name}') is not a permitted package name. "
"This is required to correctly load the model with spacy.load.",
"We recommend names that use ASCII A-Z, a-z, _ (underscore), "
"and 0-9. "
"For specific details see: https://www.python.org/dev/peps/pep-0426/#name",
exits=1,
)
meta["name"] = name meta["name"] = name
if version is not None: if version is not None:
meta["version"] = version meta["version"] = version
@ -162,7 +181,7 @@ def package(
imports="\n".join(f"from . import {m}" for m in imports) imports="\n".join(f"from . import {m}" for m in imports)
) )
create_file(package_path / "__init__.py", init_py) create_file(package_path / "__init__.py", init_py)
msg.good(f"Successfully created package '{model_name_v}'", main_path) msg.good(f"Successfully created package directory '{model_name_v}'", main_path)
if create_sdist: if create_sdist:
with util.working_dir(main_path): with util.working_dir(main_path):
util.run_command([sys.executable, "setup.py", "sdist"], capture=False) util.run_command([sys.executable, "setup.py", "sdist"], capture=False)
@ -171,8 +190,14 @@ def package(
if create_wheel: if create_wheel:
with util.working_dir(main_path): with util.working_dir(main_path):
util.run_command([sys.executable, "setup.py", "bdist_wheel"], capture=False) util.run_command([sys.executable, "setup.py", "bdist_wheel"], capture=False)
wheel = main_path / "dist" / f"{model_name_v}{WHEEL_SUFFIX}" wheel_name_squashed = re.sub("_+", "_", model_name_v)
wheel = main_path / "dist" / f"{wheel_name_squashed}{WHEEL_SUFFIX}"
msg.good(f"Successfully created binary wheel", wheel) msg.good(f"Successfully created binary wheel", wheel)
if "__" in model_name:
msg.warn(
f"Model name ('{model_name}') contains a run of underscores. "
"Runs of underscores are not significant in installed package names.",
)
def has_wheel() -> bool: def has_wheel() -> bool:
@ -422,6 +447,14 @@ def _format_label_scheme(data: Dict[str, Any]) -> str:
return md.text return md.text
def _is_permitted_package_name(package_name: str) -> bool:
# regex from: https://www.python.org/dev/peps/pep-0426/#name
permitted_match = re.search(
r"^([A-Z0-9]|[A-Z0-9][A-Z0-9._-]*[A-Z0-9])$", package_name, re.IGNORECASE
)
return permitted_match is not None
TEMPLATE_SETUP = """ TEMPLATE_SETUP = """
#!/usr/bin/env python #!/usr/bin/env python
import io import io

View File

@ -483,7 +483,7 @@ class Errors(metaclass=ErrorsWithCodes):
"components, since spans are only views of the Doc. Use Doc and " "components, since spans are only views of the Doc. Use Doc and "
"Token attributes (or custom extension attributes) only and remove " "Token attributes (or custom extension attributes) only and remove "
"the following: {attrs}") "the following: {attrs}")
E181 = ("Received invalid attributes for unkown object {obj}: {attrs}. " E181 = ("Received invalid attributes for unknown object {obj}: {attrs}. "
"Only Doc and Token attributes are supported.") "Only Doc and Token attributes are supported.")
E182 = ("Received invalid attribute declaration: {attr}\nDid you forget " E182 = ("Received invalid attribute declaration: {attr}\nDid you forget "
"to define the attribute? For example: `{attr}.???`") "to define the attribute? For example: `{attr}.???`")

View File

@ -310,7 +310,6 @@ GLOSSARY = {
"re": "repeated element", "re": "repeated element",
"rs": "reported speech", "rs": "reported speech",
"sb": "subject", "sb": "subject",
"sb": "subject",
"sbp": "passivized subject (PP)", "sbp": "passivized subject (PP)",
"sp": "subject or predicate", "sp": "subject or predicate",
"svp": "separable verb prefix", "svp": "separable verb prefix",

View File

@ -131,7 +131,7 @@ class Language:
self, self,
vocab: Union[Vocab, bool] = True, vocab: Union[Vocab, bool] = True,
*, *,
max_length: int = 10 ** 6, max_length: int = 10**6,
meta: Dict[str, Any] = {}, meta: Dict[str, Any] = {},
create_tokenizer: Optional[Callable[["Language"], Callable[[str], Doc]]] = None, create_tokenizer: Optional[Callable[["Language"], Callable[[str], Doc]]] = None,
batch_size: int = 1000, batch_size: int = 1000,

View File

@ -14,7 +14,7 @@ class PhraseMatcher:
def add( def add(
self, self,
key: str, key: str,
docs: List[List[Dict[str, Any]]], docs: List[Doc],
*, *,
on_match: Optional[ on_match: Optional[
Callable[[Matcher, Doc, int, List[Tuple[Any, ...]]], Any] Callable[[Matcher, Doc, int, List[Tuple[Any, ...]]], Any]

View File

@ -85,7 +85,7 @@ def get_characters_loss(ops, docs, prediction, nr_char):
target = ops.asarray(to_categorical(target_ids, n_classes=256), dtype="f") target = ops.asarray(to_categorical(target_ids, n_classes=256), dtype="f")
target = target.reshape((-1, 256 * nr_char)) target = target.reshape((-1, 256 * nr_char))
diff = prediction - target diff = prediction - target
loss = (diff ** 2).sum() loss = (diff**2).sum()
d_target = diff / float(prediction.shape[0]) d_target = diff / float(prediction.shape[0])
return loss, d_target return loss, d_target

View File

@ -378,7 +378,7 @@ class SpanCategorizer(TrainablePipe):
# If the prediction is 0.9 and it's false, the gradient will be # If the prediction is 0.9 and it's false, the gradient will be
# 0.9 (0.9 - 0.0) # 0.9 (0.9 - 0.0)
d_scores = scores - target d_scores = scores - target
loss = float((d_scores ** 2).sum()) loss = float((d_scores**2).sum())
return loss, d_scores return loss, d_scores
def initialize( def initialize(

View File

@ -288,7 +288,7 @@ class TextCategorizer(TrainablePipe):
bp_scores(gradient) bp_scores(gradient)
if sgd is not None: if sgd is not None:
self.finish_update(sgd) self.finish_update(sgd)
losses[self.name] += (gradient ** 2).sum() losses[self.name] += (gradient**2).sum()
return losses return losses
def _examples_to_truth( def _examples_to_truth(
@ -322,7 +322,7 @@ class TextCategorizer(TrainablePipe):
not_missing = self.model.ops.asarray(not_missing) # type: ignore not_missing = self.model.ops.asarray(not_missing) # type: ignore
d_scores = (scores - truths) d_scores = (scores - truths)
d_scores *= not_missing d_scores *= not_missing
mean_square_error = (d_scores ** 2).mean() mean_square_error = (d_scores**2).mean()
return float(mean_square_error), d_scores return float(mean_square_error), d_scores
def add_label(self, label: str) -> int: def add_label(self, label: str) -> int:

View File

@ -684,6 +684,7 @@ def test_has_annotation(en_vocab):
attrs = ("TAG", "POS", "MORPH", "LEMMA", "DEP", "HEAD", "ENT_IOB", "ENT_TYPE") attrs = ("TAG", "POS", "MORPH", "LEMMA", "DEP", "HEAD", "ENT_IOB", "ENT_TYPE")
for attr in attrs: for attr in attrs:
assert not doc.has_annotation(attr) assert not doc.has_annotation(attr)
assert not doc.has_annotation(attr, require_complete=True)
doc[0].tag_ = "A" doc[0].tag_ = "A"
doc[0].pos_ = "X" doc[0].pos_ = "X"
@ -709,6 +710,27 @@ def test_has_annotation(en_vocab):
assert doc.has_annotation(attr, require_complete=True) assert doc.has_annotation(attr, require_complete=True)
def test_has_annotation_sents(en_vocab):
doc = Doc(en_vocab, words=["Hello", "beautiful", "world"])
attrs = ("SENT_START", "IS_SENT_START", "IS_SENT_END")
for attr in attrs:
assert not doc.has_annotation(attr)
assert not doc.has_annotation(attr, require_complete=True)
# The first token (index 0) is always assumed to be a sentence start,
# and ignored by the check in doc.has_annotation
doc[1].is_sent_start = False
for attr in attrs:
assert doc.has_annotation(attr)
assert not doc.has_annotation(attr, require_complete=True)
doc[2].is_sent_start = False
for attr in attrs:
assert doc.has_annotation(attr)
assert doc.has_annotation(attr, require_complete=True)
def test_is_flags_deprecated(en_tokenizer): def test_is_flags_deprecated(en_tokenizer):
doc = en_tokenizer("test") doc = en_tokenizer("test")
with pytest.deprecated_call(): with pytest.deprecated_call():

View File

@ -12,6 +12,7 @@ def test_build_dependencies():
"flake8", "flake8",
"hypothesis", "hypothesis",
"pre-commit", "pre-commit",
"black",
"mypy", "mypy",
"types-dataclasses", "types-dataclasses",
"types-mock", "types-mock",

View File

@ -12,16 +12,18 @@ from spacy.cli._util import is_subpath_of, load_project_config
from spacy.cli._util import parse_config_overrides, string_to_list from spacy.cli._util import parse_config_overrides, string_to_list
from spacy.cli._util import substitute_project_variables from spacy.cli._util import substitute_project_variables
from spacy.cli._util import validate_project_commands from spacy.cli._util import validate_project_commands
from spacy.cli.debug_data import _get_labels_from_model from spacy.cli.debug_data import _compile_gold, _get_labels_from_model
from spacy.cli.debug_data import _get_labels_from_spancat from spacy.cli.debug_data import _get_labels_from_spancat
from spacy.cli.download import get_compatibility, get_version from spacy.cli.download import get_compatibility, get_version
from spacy.cli.init_config import RECOMMENDATIONS, init_config, fill_config from spacy.cli.init_config import RECOMMENDATIONS, init_config, fill_config
from spacy.cli.package import get_third_party_dependencies from spacy.cli.package import get_third_party_dependencies
from spacy.cli.package import _is_permitted_package_name
from spacy.cli.validate import get_model_pkgs from spacy.cli.validate import get_model_pkgs
from spacy.lang.en import English from spacy.lang.en import English
from spacy.lang.nl import Dutch from spacy.lang.nl import Dutch
from spacy.language import Language from spacy.language import Language
from spacy.schemas import ProjectConfigSchema, RecommendationSchema, validate from spacy.schemas import ProjectConfigSchema, RecommendationSchema, validate
from spacy.tokens import Doc
from spacy.training import Example, docs_to_json, offsets_to_biluo_tags from spacy.training import Example, docs_to_json, offsets_to_biluo_tags
from spacy.training.converters import conll_ner_to_docs, conllu_to_docs from spacy.training.converters import conll_ner_to_docs, conllu_to_docs
from spacy.training.converters import iob_to_docs from spacy.training.converters import iob_to_docs
@ -692,3 +694,39 @@ def test_get_labels_from_model(factory_name, pipe_name):
assert _get_labels_from_spancat(nlp)[pipe.key] == set(labels) assert _get_labels_from_spancat(nlp)[pipe.key] == set(labels)
else: else:
assert _get_labels_from_model(nlp, factory_name) == set(labels) assert _get_labels_from_model(nlp, factory_name) == set(labels)
def test_permitted_package_names():
# https://www.python.org/dev/peps/pep-0426/#name
assert _is_permitted_package_name("Meine_Bäume") == False
assert _is_permitted_package_name("_package") == False
assert _is_permitted_package_name("package_") == False
assert _is_permitted_package_name(".package") == False
assert _is_permitted_package_name("package.") == False
assert _is_permitted_package_name("-package") == False
assert _is_permitted_package_name("package-") == False
def test_debug_data_compile_gold():
nlp = English()
pred = Doc(nlp.vocab, words=["Token", ".", "New", "York", "City"])
ref = Doc(
nlp.vocab,
words=["Token", ".", "New York City"],
sent_starts=[True, False, True],
ents=["O", "O", "B-ENT"],
)
eg = Example(pred, ref)
data = _compile_gold([eg], ["ner"], nlp, True)
assert data["boundary_cross_ents"] == 0
pred = Doc(nlp.vocab, words=["Token", ".", "New", "York", "City"])
ref = Doc(
nlp.vocab,
words=["Token", ".", "New York City"],
sent_starts=[True, False, True],
ents=["O", "B-ENT", "I-ENT"],
)
eg = Example(pred, ref)
data = _compile_gold([eg], ["ner"], nlp, True)
assert data["boundary_cross_ents"] == 1

View File

@ -420,6 +420,8 @@ cdef class Doc:
cdef int range_start = 0 cdef int range_start = 0
if attr == "IS_SENT_START" or attr == self.vocab.strings["IS_SENT_START"]: if attr == "IS_SENT_START" or attr == self.vocab.strings["IS_SENT_START"]:
attr = SENT_START attr = SENT_START
elif attr == "IS_SENT_END" or attr == self.vocab.strings["IS_SENT_END"]:
attr = SENT_START
attr = intify_attr(attr) attr = intify_attr(attr)
# adjust attributes # adjust attributes
if attr == HEAD: if attr == HEAD:

View File

@ -487,8 +487,6 @@ cdef class Token:
RETURNS (bool / None): Whether the token starts a sentence. RETURNS (bool / None): Whether the token starts a sentence.
None if unknown. None if unknown.
DOCS: https://spacy.io/api/token#is_sent_start
""" """
def __get__(self): def __get__(self):
if self.c.sent_start == 0: if self.c.sent_start == 0:

View File

@ -871,7 +871,6 @@ def get_package_path(name: str) -> Path:
name (str): Package name. name (str): Package name.
RETURNS (Path): Path to installed package. RETURNS (Path): Path to installed package.
""" """
name = name.lower() # use lowercase version to be safe
# Here we're importing the module just to find it. This is worryingly # Here we're importing the module just to find it. This is worryingly
# indirect, but it's otherwise very difficult to find the package. # indirect, but it's otherwise very difficult to find the package.
pkg = importlib.import_module(name) pkg = importlib.import_module(name)

View File

@ -79,6 +79,7 @@ train/test skew.
| `max_length` | Maximum document length. Longer documents will be split into sentences, if sentence boundaries are available. Defaults to `0` for no limit. ~~int~~ | | `max_length` | Maximum document length. Longer documents will be split into sentences, if sentence boundaries are available. Defaults to `0` for no limit. ~~int~~ |
| `limit` | Limit corpus to a subset of examples, e.g. for debugging. Defaults to `0` for no limit. ~~int~~ | | `limit` | Limit corpus to a subset of examples, e.g. for debugging. Defaults to `0` for no limit. ~~int~~ |
| `augmenter` | Optional data augmentation callback. ~~Callable[[Language, Example], Iterable[Example]]~~ | | `augmenter` | Optional data augmentation callback. ~~Callable[[Language, Example], Iterable[Example]]~~ |
| `shuffle` | Whether to shuffle the examples. Defaults to `False`. ~~bool~~ |
## Corpus.\_\_call\_\_ {#call tag="method"} ## Corpus.\_\_call\_\_ {#call tag="method"}

View File

@ -304,7 +304,7 @@ ancestor is found, e.g. if span excludes a necessary ancestor.
## Doc.has_annotation {#has_annotation tag="method"} ## Doc.has_annotation {#has_annotation tag="method"}
Check whether the doc contains annotation on a token attribute. Check whether the doc contains annotation on a [`Token` attribute](/api/token#attributes).
<Infobox title="Changed in v3.0" variant="warning"> <Infobox title="Changed in v3.0" variant="warning">

View File

@ -349,23 +349,6 @@ A sequence containing the token and all the token's syntactic descendants.
| ---------- | ------------------------------------------------------------------------------------ | | ---------- | ------------------------------------------------------------------------------------ |
| **YIELDS** | A descendant token such that `self.is_ancestor(token)` or `token == self`. ~~Token~~ | | **YIELDS** | A descendant token such that `self.is_ancestor(token)` or `token == self`. ~~Token~~ |
## Token.is_sent_start {#is_sent_start tag="property" new="2"}
A boolean value indicating whether the token starts a sentence. `None` if
unknown. Defaults to `True` for the first token in the `Doc`.
> #### Example
>
> ```python
> doc = nlp("Give it back! He pleaded.")
> assert doc[4].is_sent_start
> assert not doc[5].is_sent_start
> ```
| Name | Description |
| ----------- | ------------------------------------------------------- |
| **RETURNS** | Whether the token starts a sentence. ~~Optional[bool]~~ |
## Token.has_vector {#has_vector tag="property" model="vectors"} ## Token.has_vector {#has_vector tag="property" model="vectors"}
A boolean value indicating whether a word vector is associated with the token. A boolean value indicating whether a word vector is associated with the token.
@ -465,6 +448,8 @@ The L2 norm of the token's vector representation.
| `is_punct` | Is the token punctuation? ~~bool~~ | | `is_punct` | Is the token punctuation? ~~bool~~ |
| `is_left_punct` | Is the token a left punctuation mark, e.g. `"("` ? ~~bool~~ | | `is_left_punct` | Is the token a left punctuation mark, e.g. `"("` ? ~~bool~~ |
| `is_right_punct` | Is the token a right punctuation mark, e.g. `")"` ? ~~bool~~ | | `is_right_punct` | Is the token a right punctuation mark, e.g. `")"` ? ~~bool~~ |
| `is_sent_start` | Does the token start a sentence? ~~bool~~ or `None` if unknown. Defaults to `True` for the first token in the `Doc`. |
| `is_sent_end` | Does the token end a sentence? ~~bool~~ or `None` if unknown. |
| `is_space` | Does the token consist of whitespace characters? Equivalent to `token.text.isspace()`. ~~bool~~ | | `is_space` | Does the token consist of whitespace characters? Equivalent to `token.text.isspace()`. ~~bool~~ |
| `is_bracket` | Is the token a bracket? ~~bool~~ | | `is_bracket` | Is the token a bracket? ~~bool~~ |
| `is_quote` | Is the token a quotation mark? ~~bool~~ | | `is_quote` | Is the token a quotation mark? ~~bool~~ |

Binary file not shown.

After

Width:  |  Height:  |  Size: 44 KiB

View File

@ -213,6 +213,12 @@ format, train a pipeline, evaluate it and export metrics, package it and spin up
a quick web demo. It looks pretty similar to a config file used to define CI a quick web demo. It looks pretty similar to a config file used to define CI
pipelines. pipelines.
> #### Tip: Multi-line YAML syntax for long values
>
> YAML has [multi-line syntax](https://yaml-multiline.info/) that can be
> helpful for readability with longer values such as project descriptions or
> commands that take several arguments.
```yaml ```yaml
%%GITHUB_PROJECTS/pipelines/tagger_parser_ud/project.yml %%GITHUB_PROJECTS/pipelines/tagger_parser_ud/project.yml
``` ```

View File

@ -40,7 +40,11 @@
"label": "Resources", "label": "Resources",
"items": [ "items": [
{ "text": "Project Templates", "url": "https://github.com/explosion/projects" }, { "text": "Project Templates", "url": "https://github.com/explosion/projects" },
{ "text": "v2.x Documentation", "url": "https://v2.spacy.io" } { "text": "v2.x Documentation", "url": "https://v2.spacy.io" },
{
"text": "Custom Solutions",
"url": "https://explosion.ai/spacy-tailored-pipelines"
}
] ]
} }
] ]

View File

@ -48,7 +48,11 @@
{ "text": "Usage", "url": "/usage" }, { "text": "Usage", "url": "/usage" },
{ "text": "Models", "url": "/models" }, { "text": "Models", "url": "/models" },
{ "text": "API Reference", "url": "/api" }, { "text": "API Reference", "url": "/api" },
{ "text": "Online Course", "url": "https://course.spacy.io" } { "text": "Online Course", "url": "https://course.spacy.io" },
{
"text": "Custom Solutions",
"url": "https://explosion.ai/spacy-tailored-pipelines"
}
] ]
}, },
{ {

View File

@ -953,6 +953,37 @@
"category": ["pipeline"], "category": ["pipeline"],
"tags": ["lemmatizer", "danish"] "tags": ["lemmatizer", "danish"]
}, },
{
"id": "augmenty",
"title": "Augmenty",
"slogan": "The cherry on top of your NLP pipeline",
"description": "Augmenty is an augmentation library based on spaCy for augmenting texts. Augmenty differs from other augmentation libraries in that it corrects (as far as possible) the token, sentence and document labels under the augmentation.",
"github": "kennethenevoldsen/augmenty",
"pip": "augmenty",
"code_example": [
"import spacy",
"import augmenty",
"",
"nlp = spacy.load('en_core_web_md')",
"",
"docs = nlp.pipe(['Augmenty is a great tool for text augmentation'])",
"",
"ent_dict = {'ORG': [['spaCy'], ['spaCy', 'Universe']]}",
"entity_augmenter = augmenty.load('ents_replace.v1',",
" ent_dict = ent_dict, level=1)",
"",
"for doc in augmenty.docs(docs, augmenter=entity_augmenter, nlp=nlp):",
" print(doc)"
],
"thumb": "https://github.com/KennethEnevoldsen/augmenty/blob/master/img/icon.png?raw=true",
"author": "Kenneth Enevoldsen",
"author_links": {
"github": "kennethenevoldsen",
"website": "https://www.kennethenevoldsen.com"
},
"category": ["training", "research"],
"tags": ["training", "research", "augmentation"]
},
{ {
"id": "dacy", "id": "dacy",
"title": "DaCy", "title": "DaCy",
@ -3738,6 +3769,65 @@
}, },
"category": ["pipeline"], "category": ["pipeline"],
"tags": ["pipeline", "nlp", "sentiment"] "tags": ["pipeline", "nlp", "sentiment"]
},
{
"id": "textnets",
"slogan": "Text analysis with networks",
"description": "textnets represents collections of texts as networks of documents and words. This provides novel possibilities for the visualization and analysis of texts.",
"github": "jboynyc/textnets",
"image": "https://user-images.githubusercontent.com/2187261/152641425-6c0fb41c-b8e0-44fb-a52a-7c1ba24eba1e.png",
"code_example": [
"import textnets as tn",
"",
"corpus = tn.Corpus(tn.examples.moon_landing)",
"t = tn.Textnet(corpus.tokenized(), min_docs=1)",
"t.plot(label_nodes=True,",
" show_clusters=True,",
" scale_nodes_by=\"birank\",",
" scale_edges_by=\"weight\")"
],
"author": "John Boy",
"author_links": {
"github": "jboynyc",
"twitter": "jboy"
},
"category": ["visualizers", "standalone"]
},
{
"id": "tmtoolkit",
"slogan": "Text mining and topic modeling toolkit",
"description": "tmtoolkit is a set of tools for text mining and topic modeling with Python developed especially for the use in the social sciences, in journalism or related disciplines. It aims for easy installation, extensive documentation and a clear programming interface while offering good performance on large datasets by the means of vectorized operations (via NumPy) and parallel computation (using Pythons multiprocessing module and the loky package).",
"github": "WZBSocialScienceCenter/tmtoolkit",
"code_example": [
"# Note: This requires these setup steps:",
"# pip install tmtoolkit[recommended]",
"# python -m tmtoolkit setup en",
"from tmtoolkit.corpus import Corpus, tokens_table, lemmatize, to_lowercase, dtm",
"from tmtoolkit.bow.bow_stats import tfidf, sorted_terms_table",
"# load built-in sample dataset and use 4 worker processes",
"corp = Corpus.from_builtin_corpus('en-News100', max_workers=4)",
"# investigate corpus as dataframe",
"toktbl = tokens_table(corp)",
"print(toktbl)",
"# apply some text normalization",
"lemmatize(corp)",
"to_lowercase(corp)",
"# build sparse document-token matrix (DTM)",
"# document labels identify rows, vocabulary tokens identify columns",
"mat, doc_labels, vocab = dtm(corp, return_doc_labels=True, return_vocab=True)",
"# apply tf-idf transformation to DTM",
"# operation is applied on sparse matrix and uses few memory",
"tfidf_mat = tfidf(mat)",
"# show top 5 tokens per document ranked by tf-idf",
"top_tokens = sorted_terms_table(tfidf_mat, vocab, doc_labels, top_n=5)",
"print(top_tokens)"
],
"author": "Markus Konrad / WZB Social Science Center",
"author_links": {
"github": "internaut",
"twitter": "_knrd"
},
"category": ["scientific", "standalone"]
} }
], ],

View File

@ -6,11 +6,14 @@ import { replaceEmoji } from './icon'
export const Ol = props => <ol className={classes.ol} {...props} /> export const Ol = props => <ol className={classes.ol} {...props} />
export const Ul = props => <ul className={classes.ul} {...props} /> export const Ul = props => <ul className={classes.ul} {...props} />
export const Li = ({ children, ...props }) => { export const Li = ({ children, emoji, ...props }) => {
const { hasIcon, content } = replaceEmoji(children) const { hasIcon, content } = replaceEmoji(children)
const liClassNames = classNames(classes.li, { [classes.liIcon]: hasIcon }) const liClassNames = classNames(classes.li, {
[classes.liIcon]: hasIcon,
[classes.emoji]: emoji,
})
return ( return (
<li className={liClassNames} {...props}> <li data-emoji={emoji} className={liClassNames} {...props}>
{content} {content}
</li> </li>
) )

View File

@ -36,6 +36,16 @@
box-sizing: content-box box-sizing: content-box
vertical-align: top vertical-align: top
.emoji:before
content: attr(data-emoji)
padding-right: 0.75em
padding-top: 0
margin-left: -2.5em
width: 1.75em
text-align: right
font-size: 1em
position: static
.li-icon .li-icon
text-indent: calc(-20px - 0.55em) text-indent: calc(-20px - 0.55em)

View File

@ -15,9 +15,9 @@ import {
} from '../components/landing' } from '../components/landing'
import { H2 } from '../components/typography' import { H2 } from '../components/typography'
import { InlineCode } from '../components/code' import { InlineCode } from '../components/code'
import { Ul, Li } from '../components/list'
import Button from '../components/button' import Button from '../components/button'
import Link from '../components/link' import Link from '../components/link'
import { YouTube } from '../components/embed'
import QuickstartTraining from './quickstart-training' import QuickstartTraining from './quickstart-training'
import Project from './project' import Project from './project'
@ -25,6 +25,7 @@ import Features from './features'
import courseImage from '../../docs/images/course.jpg' import courseImage from '../../docs/images/course.jpg'
import prodigyImage from '../../docs/images/prodigy_overview.jpg' import prodigyImage from '../../docs/images/prodigy_overview.jpg'
import projectsImage from '../../docs/images/projects.png' import projectsImage from '../../docs/images/projects.png'
import tailoredPipelinesImage from '../../docs/images/spacy-tailored-pipelines_wide.png'
import Benchmarks from 'usage/_benchmarks-models.md' import Benchmarks from 'usage/_benchmarks-models.md'
@ -104,23 +105,45 @@ const Landing = ({ data }) => {
<LandingBannerGrid> <LandingBannerGrid>
<LandingBanner <LandingBanner
label="New in v3.0" to="https://explosion.ai/spacy-tailored-pipelines"
title="Transformer-based pipelines, new training system, project templates &amp; more" button="Learn more"
to="/usage/v3" background="#E4F4F9"
button="See what's new" color="#1e1935"
small small
> >
spaCy v3.0 features all new <strong>transformer-based pipelines</strong> that <Link to="https://explosion.ai/spacy-tailored-pipelines" hidden>
bring spaCy's accuracy right up to the current <strong>state-of-the-art</strong> <img src={tailoredPipelinesImage} alt="spaCy Tailored Pipelines" />
. You can use any pretrained transformer to train your own pipelines, and even </Link>
share one transformer between multiple components with{' '} <strong>
<strong>multi-task learning</strong>. Training is now fully configurable and Get a custom spaCy pipeline, tailor-made for your NLP problem by spaCy's
extensible, and you can define your own custom models using{' '} core developers.
<strong>PyTorch</strong>, <strong>TensorFlow</strong> and other frameworks. The </strong>
new spaCy projects system lets you describe whole{' '} <br />
<strong>end-to-end workflows</strong> in a single file, giving you an easy path <br />
from prototype to production, and making it easy to clone and adapt <Ul>
best-practice projects for your own use cases. <Li emoji="🔥">
<strong>Streamlined.</strong> Nobody knows spaCy better than we do. Send
us your pipeline requirements and we'll be ready to start producing your
solution in no time at all.
</Li>
<Li emoji="🐿 ">
<strong>Production ready.</strong> spaCy pipelines are robust and easy
to deploy. You'll get a complete spaCy project folder which is ready to{' '}
<InlineCode>spacy project run</InlineCode>.
</Li>
<Li emoji="🔮">
<strong>Predictable.</strong> You'll know exactly what you're going to
get and what it's going to cost. We quote fees up-front, let you try
before you buy, and don't charge for over-runs at our end all the risk
is on us.
</Li>
<Li emoji="🛠">
<strong>Maintainable.</strong> spaCy is an industry standard, and we'll
deliver your pipeline with full code, data, tests and documentation, so
your team can retrain, update and extend the solution as your
requirements change.
</Li>
</Ul>
</LandingBanner> </LandingBanner>
<LandingBanner <LandingBanner
@ -206,8 +229,20 @@ const Landing = ({ data }) => {
</LandingGrid> </LandingGrid>
<LandingBannerGrid> <LandingBannerGrid>
<LandingBanner background="#0099dd" color="#ffffff" small> <LandingBanner
<YouTube id="9k_EfV7Cns0" /> label="New in v3.0"
title="Transformer-based pipelines, new training system, project templates &amp; more"
to="/usage/v3"
button="See what's new"
small
>
spaCy v3.0 features all new <strong>transformer-based pipelines</strong> that
bring spaCy's accuracy right up to the current <strong>state-of-the-art</strong>
. You can use any pretrained transformer to train your own pipelines, and even
share one transformer between multiple components with{' '}
<strong>multi-task learning</strong>. Training is now fully configurable and
extensible, and you can define your own custom models using{' '}
<strong>PyTorch</strong>, <strong>TensorFlow</strong> and other frameworks.
</LandingBanner> </LandingBanner>
<LandingBanner <LandingBanner
to="https://course.spacy.io" to="https://course.spacy.io"