mirror of
https://github.com/explosion/spaCy.git
synced 2025-01-25 00:34:20 +03:00
Merge branch 'master' into copy/develop
This commit is contained in:
commit
a16b14e591
21
.github/workflows/gputests.yml
vendored
Normal file
21
.github/workflows/gputests.yml
vendored
Normal file
|
@ -0,0 +1,21 @@
|
|||
name: Weekly GPU tests
|
||||
|
||||
on:
|
||||
schedule:
|
||||
- cron: '0 1 * * MON'
|
||||
|
||||
jobs:
|
||||
weekly-gputests:
|
||||
strategy:
|
||||
fail-fast: false
|
||||
matrix:
|
||||
branch: [master, develop, v4]
|
||||
runs-on: ubuntu-latest
|
||||
steps:
|
||||
- name: Trigger buildkite build
|
||||
uses: buildkite/trigger-pipeline-action@v1.2.0
|
||||
env:
|
||||
PIPELINE: explosion-ai/spacy-slow-gpu-tests
|
||||
BRANCH: ${{ matrix.branch }}
|
||||
MESSAGE: ":github: Weekly GPU + slow tests - triggered from a GitHub Action"
|
||||
BUILDKITE_API_ACCESS_TOKEN: ${{ secrets.BUILDKITE_SECRET }}
|
35
.github/workflows/slowtests.yml
vendored
Normal file
35
.github/workflows/slowtests.yml
vendored
Normal file
|
@ -0,0 +1,35 @@
|
|||
name: Daily slow tests
|
||||
|
||||
on:
|
||||
schedule:
|
||||
- cron: '0 0 * * *'
|
||||
|
||||
jobs:
|
||||
daily-slowtests:
|
||||
strategy:
|
||||
fail-fast: false
|
||||
matrix:
|
||||
branch: [master, develop, v4]
|
||||
runs-on: ubuntu-latest
|
||||
steps:
|
||||
- name: Checkout
|
||||
uses: actions/checkout@v1
|
||||
- name: Get commits from past 24 hours
|
||||
id: check_commits
|
||||
run: |
|
||||
today=$(date '+%Y-%m-%d %H:%M:%S')
|
||||
yesterday=$(date -d "yesterday" '+%Y-%m-%d %H:%M:%S')
|
||||
if git log --after="$yesterday" --before="$today" | grep commit ; then
|
||||
echo "::set-output name=run_tests::true"
|
||||
else
|
||||
echo "::set-output name=run_tests::false"
|
||||
fi
|
||||
|
||||
- name: Trigger buildkite build
|
||||
if: steps.check_commits.outputs.run_tests == 'true'
|
||||
uses: buildkite/trigger-pipeline-action@v1.2.0
|
||||
env:
|
||||
PIPELINE: explosion-ai/spacy-slow-tests
|
||||
BRANCH: ${{ matrix.branch }}
|
||||
MESSAGE: ":github: Daily slow tests - triggered from a GitHub Action"
|
||||
BUILDKITE_API_ACCESS_TOKEN: ${{ secrets.BUILDKITE_SECRET }}
|
31
README.md
31
README.md
|
@ -32,19 +32,20 @@ open-source software, released under the MIT license.
|
|||
|
||||
## 📖 Documentation
|
||||
|
||||
| Documentation | |
|
||||
| -------------------------- | -------------------------------------------------------------- |
|
||||
| ⭐️ **[spaCy 101]** | New to spaCy? Here's everything you need to know! |
|
||||
| 📚 **[Usage Guides]** | How to use spaCy and its features. |
|
||||
| 🚀 **[New in v3.0]** | New features, backwards incompatibilities and migration guide. |
|
||||
| 🪐 **[Project Templates]** | End-to-end workflows you can clone, modify and run. |
|
||||
| 🎛 **[API Reference]** | The detailed reference for spaCy's API. |
|
||||
| 📦 **[Models]** | Download trained pipelines for spaCy. |
|
||||
| 🌌 **[Universe]** | Plugins, extensions, demos and books from the spaCy ecosystem. |
|
||||
| 👩🏫 **[Online Course]** | Learn spaCy in this free and interactive online course. |
|
||||
| 📺 **[Videos]** | Our YouTube channel with video tutorials, talks and more. |
|
||||
| 🛠 **[Changelog]** | Changes and version history. |
|
||||
| 💝 **[Contribute]** | How to contribute to the spaCy project and code base. |
|
||||
| Documentation | |
|
||||
| ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
|
||||
| ⭐️ **[spaCy 101]** | New to spaCy? Here's everything you need to know! |
|
||||
| 📚 **[Usage Guides]** | How to use spaCy and its features. |
|
||||
| 🚀 **[New in v3.0]** | New features, backwards incompatibilities and migration guide. |
|
||||
| 🪐 **[Project Templates]** | End-to-end workflows you can clone, modify and run. |
|
||||
| 🎛 **[API Reference]** | The detailed reference for spaCy's API. |
|
||||
| 📦 **[Models]** | Download trained pipelines for spaCy. |
|
||||
| 🌌 **[Universe]** | Plugins, extensions, demos and books from the spaCy ecosystem. |
|
||||
| 👩🏫 **[Online Course]** | Learn spaCy in this free and interactive online course. |
|
||||
| 📺 **[Videos]** | Our YouTube channel with video tutorials, talks and more. |
|
||||
| 🛠 **[Changelog]** | Changes and version history. |
|
||||
| 💝 **[Contribute]** | How to contribute to the spaCy project and code base. |
|
||||
| <a href="https://explosion.ai/spacy-tailored-pipelines"><img src="https://user-images.githubusercontent.com/13643239/152853098-1c761611-ccb0-4ec6-9066-b234552831fe.png" width="125" alt="spaCy Tailored Pipelines"/></a> | Get a custom spaCy pipeline, tailor-made for your NLP problem by spaCy's core developers. Streamlined, production-ready, predictable and maintainable. Start by completing our 5-minute questionnaire to tell us what you need and we'll be in touch! **[Learn more →](https://explosion.ai/spacy-tailored-pipelines)** |
|
||||
|
||||
[spacy 101]: https://spacy.io/usage/spacy-101
|
||||
[new in v3.0]: https://spacy.io/usage/v3
|
||||
|
@ -60,9 +61,7 @@ open-source software, released under the MIT license.
|
|||
|
||||
## 💬 Where to ask questions
|
||||
|
||||
The spaCy project is maintained by **[@honnibal](https://github.com/honnibal)**,
|
||||
**[@ines](https://github.com/ines)**, **[@svlandeg](https://github.com/svlandeg)**,
|
||||
**[@adrianeboyd](https://github.com/adrianeboyd)** and **[@polm](https://github.com/polm)**.
|
||||
The spaCy project is maintained by the [spaCy team](https://explosion.ai/about).
|
||||
Please understand that we won't be able to provide individual support via email.
|
||||
We also believe that help is much more valuable if it's shared publicly, so that
|
||||
more people can benefit from it.
|
||||
|
|
|
@ -11,12 +11,14 @@ trigger:
|
|||
exclude:
|
||||
- "website/*"
|
||||
- "*.md"
|
||||
- ".github/workflows/*"
|
||||
pr:
|
||||
paths:
|
||||
paths:
|
||||
exclude:
|
||||
- "*.md"
|
||||
- "website/docs/*"
|
||||
- "website/src/*"
|
||||
- ".github/workflows/*"
|
||||
|
||||
jobs:
|
||||
# Perform basic checks for most important errors (syntax etc.) Uses the config
|
||||
|
|
|
@ -35,3 +35,4 @@ mypy==0.910
|
|||
types-dataclasses>=0.1.3; python_version < "3.7"
|
||||
types-mock>=0.1.1
|
||||
types-requests
|
||||
black>=22.0,<23.0
|
||||
|
|
|
@ -1,6 +1,6 @@
|
|||
# fmt: off
|
||||
__title__ = "spacy"
|
||||
__version__ = "3.2.1"
|
||||
__version__ = "3.2.2"
|
||||
__download_url__ = "https://github.com/explosion/spacy-models/releases/download"
|
||||
__compatibility__ = "https://raw.githubusercontent.com/explosion/spacy-models/master/compatibility.json"
|
||||
__projects__ = "https://github.com/explosion/projects"
|
||||
|
|
|
@ -193,6 +193,70 @@ def debug_data(
|
|||
else:
|
||||
msg.info("No word vectors present in the package")
|
||||
|
||||
if "spancat" in factory_names:
|
||||
model_labels_spancat = _get_labels_from_spancat(nlp)
|
||||
has_low_data_warning = False
|
||||
has_no_neg_warning = False
|
||||
|
||||
msg.divider("Span Categorization")
|
||||
msg.table(model_labels_spancat, header=["Spans Key", "Labels"], divider=True)
|
||||
|
||||
msg.text("Label counts in train data: ", show=verbose)
|
||||
for spans_key, data_labels in gold_train_data["spancat"].items():
|
||||
msg.text(
|
||||
f"Key: {spans_key}, {_format_labels(data_labels.items(), counts=True)}",
|
||||
show=verbose,
|
||||
)
|
||||
# Data checks: only take the spans keys in the actual spancat components
|
||||
data_labels_in_component = {
|
||||
spans_key: gold_train_data["spancat"][spans_key]
|
||||
for spans_key in model_labels_spancat.keys()
|
||||
}
|
||||
for spans_key, data_labels in data_labels_in_component.items():
|
||||
for label, count in data_labels.items():
|
||||
# Check for missing labels
|
||||
spans_key_in_model = spans_key in model_labels_spancat.keys()
|
||||
if (spans_key_in_model) and (
|
||||
label not in model_labels_spancat[spans_key]
|
||||
):
|
||||
msg.warn(
|
||||
f"Label '{label}' is not present in the model labels of key '{spans_key}'. "
|
||||
"Performance may degrade after training."
|
||||
)
|
||||
# Check for low number of examples per label
|
||||
if count <= NEW_LABEL_THRESHOLD:
|
||||
msg.warn(
|
||||
f"Low number of examples for label '{label}' in key '{spans_key}' ({count})"
|
||||
)
|
||||
has_low_data_warning = True
|
||||
# Check for negative examples
|
||||
with msg.loading("Analyzing label distribution..."):
|
||||
neg_docs = _get_examples_without_label(
|
||||
train_dataset, label, "spancat", spans_key
|
||||
)
|
||||
if neg_docs == 0:
|
||||
msg.warn(f"No examples for texts WITHOUT new label '{label}'")
|
||||
has_no_neg_warning = True
|
||||
|
||||
if has_low_data_warning:
|
||||
msg.text(
|
||||
f"To train a new span type, your data should include at "
|
||||
f"least {NEW_LABEL_THRESHOLD} instances of the new label",
|
||||
show=verbose,
|
||||
)
|
||||
else:
|
||||
msg.good("Good amount of examples for all labels")
|
||||
|
||||
if has_no_neg_warning:
|
||||
msg.text(
|
||||
"Training data should always include examples of spans "
|
||||
"in context, as well as examples without a given span "
|
||||
"type.",
|
||||
show=verbose,
|
||||
)
|
||||
else:
|
||||
msg.good("Examples without ocurrences available for all labels")
|
||||
|
||||
if "ner" in factory_names:
|
||||
# Get all unique NER labels present in the data
|
||||
labels = set(
|
||||
|
@ -238,7 +302,7 @@ def debug_data(
|
|||
has_low_data_warning = True
|
||||
|
||||
with msg.loading("Analyzing label distribution..."):
|
||||
neg_docs = _get_examples_without_label(train_dataset, label)
|
||||
neg_docs = _get_examples_without_label(train_dataset, label, "ner")
|
||||
if neg_docs == 0:
|
||||
msg.warn(f"No examples for texts WITHOUT new label '{label}'")
|
||||
has_no_neg_warning = True
|
||||
|
@ -573,6 +637,7 @@ def _compile_gold(
|
|||
"deps": Counter(),
|
||||
"words": Counter(),
|
||||
"roots": Counter(),
|
||||
"spancat": dict(),
|
||||
"ws_ents": 0,
|
||||
"boundary_cross_ents": 0,
|
||||
"n_words": 0,
|
||||
|
@ -603,6 +668,7 @@ def _compile_gold(
|
|||
if nlp.vocab.strings[word] not in nlp.vocab.vectors:
|
||||
data["words_missing_vectors"].update([word])
|
||||
if "ner" in factory_names:
|
||||
sent_starts = eg.get_aligned_sent_starts()
|
||||
for i, label in enumerate(eg.get_aligned_ner()):
|
||||
if label is None:
|
||||
continue
|
||||
|
@ -612,10 +678,19 @@ def _compile_gold(
|
|||
if label.startswith(("B-", "U-")):
|
||||
combined_label = label.split("-")[1]
|
||||
data["ner"][combined_label] += 1
|
||||
if gold[i].is_sent_start and label.startswith(("I-", "L-")):
|
||||
if sent_starts[i] == True and label.startswith(("I-", "L-")):
|
||||
data["boundary_cross_ents"] += 1
|
||||
elif label == "-":
|
||||
data["ner"]["-"] += 1
|
||||
if "spancat" in factory_names:
|
||||
for span_key in list(eg.reference.spans.keys()):
|
||||
if span_key not in data["spancat"]:
|
||||
data["spancat"][span_key] = Counter()
|
||||
for i, span in enumerate(eg.reference.spans[span_key]):
|
||||
if span.label_ is None:
|
||||
continue
|
||||
else:
|
||||
data["spancat"][span_key][span.label_] += 1
|
||||
if "textcat" in factory_names or "textcat_multilabel" in factory_names:
|
||||
data["cats"].update(gold.cats)
|
||||
if any(val not in (0, 1) for val in gold.cats.values()):
|
||||
|
@ -686,14 +761,28 @@ def _format_labels(
|
|||
return ", ".join([f"'{l}'" for l in cast(Iterable[str], labels)])
|
||||
|
||||
|
||||
def _get_examples_without_label(data: Sequence[Example], label: str) -> int:
|
||||
def _get_examples_without_label(
|
||||
data: Sequence[Example],
|
||||
label: str,
|
||||
component: Literal["ner", "spancat"] = "ner",
|
||||
spans_key: Optional[str] = "sc",
|
||||
) -> int:
|
||||
count = 0
|
||||
for eg in data:
|
||||
labels = [
|
||||
label.split("-")[1]
|
||||
for label in eg.get_aligned_ner()
|
||||
if label not in ("O", "-", None)
|
||||
]
|
||||
if component == "ner":
|
||||
labels = [
|
||||
label.split("-")[1]
|
||||
for label in eg.get_aligned_ner()
|
||||
if label not in ("O", "-", None)
|
||||
]
|
||||
|
||||
if component == "spancat":
|
||||
labels = (
|
||||
[span.label_ for span in eg.reference.spans[spans_key]]
|
||||
if spans_key in eg.reference.spans
|
||||
else []
|
||||
)
|
||||
|
||||
if label not in labels:
|
||||
count += 1
|
||||
return count
|
||||
|
|
|
@ -7,6 +7,7 @@ from collections import defaultdict
|
|||
from catalogue import RegistryError
|
||||
import srsly
|
||||
import sys
|
||||
import re
|
||||
|
||||
from ._util import app, Arg, Opt, string_to_list, WHEEL_SUFFIX, SDIST_SUFFIX
|
||||
from ..schemas import validate, ModelMetaSchema
|
||||
|
@ -109,6 +110,24 @@ def package(
|
|||
", ".join(meta["requirements"]),
|
||||
)
|
||||
if name is not None:
|
||||
if not name.isidentifier():
|
||||
msg.fail(
|
||||
f"Model name ('{name}') is not a valid module name. "
|
||||
"This is required so it can be imported as a module.",
|
||||
"We recommend names that use ASCII A-Z, a-z, _ (underscore), "
|
||||
"and 0-9. "
|
||||
"For specific details see: https://docs.python.org/3/reference/lexical_analysis.html#identifiers",
|
||||
exits=1,
|
||||
)
|
||||
if not _is_permitted_package_name(name):
|
||||
msg.fail(
|
||||
f"Model name ('{name}') is not a permitted package name. "
|
||||
"This is required to correctly load the model with spacy.load.",
|
||||
"We recommend names that use ASCII A-Z, a-z, _ (underscore), "
|
||||
"and 0-9. "
|
||||
"For specific details see: https://www.python.org/dev/peps/pep-0426/#name",
|
||||
exits=1,
|
||||
)
|
||||
meta["name"] = name
|
||||
if version is not None:
|
||||
meta["version"] = version
|
||||
|
@ -162,7 +181,7 @@ def package(
|
|||
imports="\n".join(f"from . import {m}" for m in imports)
|
||||
)
|
||||
create_file(package_path / "__init__.py", init_py)
|
||||
msg.good(f"Successfully created package '{model_name_v}'", main_path)
|
||||
msg.good(f"Successfully created package directory '{model_name_v}'", main_path)
|
||||
if create_sdist:
|
||||
with util.working_dir(main_path):
|
||||
util.run_command([sys.executable, "setup.py", "sdist"], capture=False)
|
||||
|
@ -171,8 +190,14 @@ def package(
|
|||
if create_wheel:
|
||||
with util.working_dir(main_path):
|
||||
util.run_command([sys.executable, "setup.py", "bdist_wheel"], capture=False)
|
||||
wheel = main_path / "dist" / f"{model_name_v}{WHEEL_SUFFIX}"
|
||||
wheel_name_squashed = re.sub("_+", "_", model_name_v)
|
||||
wheel = main_path / "dist" / f"{wheel_name_squashed}{WHEEL_SUFFIX}"
|
||||
msg.good(f"Successfully created binary wheel", wheel)
|
||||
if "__" in model_name:
|
||||
msg.warn(
|
||||
f"Model name ('{model_name}') contains a run of underscores. "
|
||||
"Runs of underscores are not significant in installed package names.",
|
||||
)
|
||||
|
||||
|
||||
def has_wheel() -> bool:
|
||||
|
@ -422,6 +447,14 @@ def _format_label_scheme(data: Dict[str, Any]) -> str:
|
|||
return md.text
|
||||
|
||||
|
||||
def _is_permitted_package_name(package_name: str) -> bool:
|
||||
# regex from: https://www.python.org/dev/peps/pep-0426/#name
|
||||
permitted_match = re.search(
|
||||
r"^([A-Z0-9]|[A-Z0-9][A-Z0-9._-]*[A-Z0-9])$", package_name, re.IGNORECASE
|
||||
)
|
||||
return permitted_match is not None
|
||||
|
||||
|
||||
TEMPLATE_SETUP = """
|
||||
#!/usr/bin/env python
|
||||
import io
|
||||
|
|
|
@ -483,7 +483,7 @@ class Errors(metaclass=ErrorsWithCodes):
|
|||
"components, since spans are only views of the Doc. Use Doc and "
|
||||
"Token attributes (or custom extension attributes) only and remove "
|
||||
"the following: {attrs}")
|
||||
E181 = ("Received invalid attributes for unkown object {obj}: {attrs}. "
|
||||
E181 = ("Received invalid attributes for unknown object {obj}: {attrs}. "
|
||||
"Only Doc and Token attributes are supported.")
|
||||
E182 = ("Received invalid attribute declaration: {attr}\nDid you forget "
|
||||
"to define the attribute? For example: `{attr}.???`")
|
||||
|
|
|
@ -310,7 +310,6 @@ GLOSSARY = {
|
|||
"re": "repeated element",
|
||||
"rs": "reported speech",
|
||||
"sb": "subject",
|
||||
"sb": "subject",
|
||||
"sbp": "passivized subject (PP)",
|
||||
"sp": "subject or predicate",
|
||||
"svp": "separable verb prefix",
|
||||
|
|
|
@ -131,7 +131,7 @@ class Language:
|
|||
self,
|
||||
vocab: Union[Vocab, bool] = True,
|
||||
*,
|
||||
max_length: int = 10 ** 6,
|
||||
max_length: int = 10**6,
|
||||
meta: Dict[str, Any] = {},
|
||||
create_tokenizer: Optional[Callable[["Language"], Callable[[str], Doc]]] = None,
|
||||
batch_size: int = 1000,
|
||||
|
|
|
@ -14,7 +14,7 @@ class PhraseMatcher:
|
|||
def add(
|
||||
self,
|
||||
key: str,
|
||||
docs: List[List[Dict[str, Any]]],
|
||||
docs: List[Doc],
|
||||
*,
|
||||
on_match: Optional[
|
||||
Callable[[Matcher, Doc, int, List[Tuple[Any, ...]]], Any]
|
||||
|
|
|
@ -85,7 +85,7 @@ def get_characters_loss(ops, docs, prediction, nr_char):
|
|||
target = ops.asarray(to_categorical(target_ids, n_classes=256), dtype="f")
|
||||
target = target.reshape((-1, 256 * nr_char))
|
||||
diff = prediction - target
|
||||
loss = (diff ** 2).sum()
|
||||
loss = (diff**2).sum()
|
||||
d_target = diff / float(prediction.shape[0])
|
||||
return loss, d_target
|
||||
|
||||
|
|
|
@ -378,7 +378,7 @@ class SpanCategorizer(TrainablePipe):
|
|||
# If the prediction is 0.9 and it's false, the gradient will be
|
||||
# 0.9 (0.9 - 0.0)
|
||||
d_scores = scores - target
|
||||
loss = float((d_scores ** 2).sum())
|
||||
loss = float((d_scores**2).sum())
|
||||
return loss, d_scores
|
||||
|
||||
def initialize(
|
||||
|
|
|
@ -288,7 +288,7 @@ class TextCategorizer(TrainablePipe):
|
|||
bp_scores(gradient)
|
||||
if sgd is not None:
|
||||
self.finish_update(sgd)
|
||||
losses[self.name] += (gradient ** 2).sum()
|
||||
losses[self.name] += (gradient**2).sum()
|
||||
return losses
|
||||
|
||||
def _examples_to_truth(
|
||||
|
@ -322,7 +322,7 @@ class TextCategorizer(TrainablePipe):
|
|||
not_missing = self.model.ops.asarray(not_missing) # type: ignore
|
||||
d_scores = (scores - truths)
|
||||
d_scores *= not_missing
|
||||
mean_square_error = (d_scores ** 2).mean()
|
||||
mean_square_error = (d_scores**2).mean()
|
||||
return float(mean_square_error), d_scores
|
||||
|
||||
def add_label(self, label: str) -> int:
|
||||
|
|
|
@ -684,6 +684,7 @@ def test_has_annotation(en_vocab):
|
|||
attrs = ("TAG", "POS", "MORPH", "LEMMA", "DEP", "HEAD", "ENT_IOB", "ENT_TYPE")
|
||||
for attr in attrs:
|
||||
assert not doc.has_annotation(attr)
|
||||
assert not doc.has_annotation(attr, require_complete=True)
|
||||
|
||||
doc[0].tag_ = "A"
|
||||
doc[0].pos_ = "X"
|
||||
|
@ -709,6 +710,27 @@ def test_has_annotation(en_vocab):
|
|||
assert doc.has_annotation(attr, require_complete=True)
|
||||
|
||||
|
||||
def test_has_annotation_sents(en_vocab):
|
||||
doc = Doc(en_vocab, words=["Hello", "beautiful", "world"])
|
||||
attrs = ("SENT_START", "IS_SENT_START", "IS_SENT_END")
|
||||
for attr in attrs:
|
||||
assert not doc.has_annotation(attr)
|
||||
assert not doc.has_annotation(attr, require_complete=True)
|
||||
|
||||
# The first token (index 0) is always assumed to be a sentence start,
|
||||
# and ignored by the check in doc.has_annotation
|
||||
|
||||
doc[1].is_sent_start = False
|
||||
for attr in attrs:
|
||||
assert doc.has_annotation(attr)
|
||||
assert not doc.has_annotation(attr, require_complete=True)
|
||||
|
||||
doc[2].is_sent_start = False
|
||||
for attr in attrs:
|
||||
assert doc.has_annotation(attr)
|
||||
assert doc.has_annotation(attr, require_complete=True)
|
||||
|
||||
|
||||
def test_is_flags_deprecated(en_tokenizer):
|
||||
doc = en_tokenizer("test")
|
||||
with pytest.deprecated_call():
|
||||
|
|
|
@ -12,6 +12,7 @@ def test_build_dependencies():
|
|||
"flake8",
|
||||
"hypothesis",
|
||||
"pre-commit",
|
||||
"black",
|
||||
"mypy",
|
||||
"types-dataclasses",
|
||||
"types-mock",
|
||||
|
|
|
@ -12,16 +12,18 @@ from spacy.cli._util import is_subpath_of, load_project_config
|
|||
from spacy.cli._util import parse_config_overrides, string_to_list
|
||||
from spacy.cli._util import substitute_project_variables
|
||||
from spacy.cli._util import validate_project_commands
|
||||
from spacy.cli.debug_data import _get_labels_from_model
|
||||
from spacy.cli.debug_data import _compile_gold, _get_labels_from_model
|
||||
from spacy.cli.debug_data import _get_labels_from_spancat
|
||||
from spacy.cli.download import get_compatibility, get_version
|
||||
from spacy.cli.init_config import RECOMMENDATIONS, init_config, fill_config
|
||||
from spacy.cli.package import get_third_party_dependencies
|
||||
from spacy.cli.package import _is_permitted_package_name
|
||||
from spacy.cli.validate import get_model_pkgs
|
||||
from spacy.lang.en import English
|
||||
from spacy.lang.nl import Dutch
|
||||
from spacy.language import Language
|
||||
from spacy.schemas import ProjectConfigSchema, RecommendationSchema, validate
|
||||
from spacy.tokens import Doc
|
||||
from spacy.training import Example, docs_to_json, offsets_to_biluo_tags
|
||||
from spacy.training.converters import conll_ner_to_docs, conllu_to_docs
|
||||
from spacy.training.converters import iob_to_docs
|
||||
|
@ -692,3 +694,39 @@ def test_get_labels_from_model(factory_name, pipe_name):
|
|||
assert _get_labels_from_spancat(nlp)[pipe.key] == set(labels)
|
||||
else:
|
||||
assert _get_labels_from_model(nlp, factory_name) == set(labels)
|
||||
|
||||
|
||||
def test_permitted_package_names():
|
||||
# https://www.python.org/dev/peps/pep-0426/#name
|
||||
assert _is_permitted_package_name("Meine_Bäume") == False
|
||||
assert _is_permitted_package_name("_package") == False
|
||||
assert _is_permitted_package_name("package_") == False
|
||||
assert _is_permitted_package_name(".package") == False
|
||||
assert _is_permitted_package_name("package.") == False
|
||||
assert _is_permitted_package_name("-package") == False
|
||||
assert _is_permitted_package_name("package-") == False
|
||||
|
||||
|
||||
def test_debug_data_compile_gold():
|
||||
nlp = English()
|
||||
pred = Doc(nlp.vocab, words=["Token", ".", "New", "York", "City"])
|
||||
ref = Doc(
|
||||
nlp.vocab,
|
||||
words=["Token", ".", "New York City"],
|
||||
sent_starts=[True, False, True],
|
||||
ents=["O", "O", "B-ENT"],
|
||||
)
|
||||
eg = Example(pred, ref)
|
||||
data = _compile_gold([eg], ["ner"], nlp, True)
|
||||
assert data["boundary_cross_ents"] == 0
|
||||
|
||||
pred = Doc(nlp.vocab, words=["Token", ".", "New", "York", "City"])
|
||||
ref = Doc(
|
||||
nlp.vocab,
|
||||
words=["Token", ".", "New York City"],
|
||||
sent_starts=[True, False, True],
|
||||
ents=["O", "B-ENT", "I-ENT"],
|
||||
)
|
||||
eg = Example(pred, ref)
|
||||
data = _compile_gold([eg], ["ner"], nlp, True)
|
||||
assert data["boundary_cross_ents"] == 1
|
||||
|
|
|
@ -420,6 +420,8 @@ cdef class Doc:
|
|||
cdef int range_start = 0
|
||||
if attr == "IS_SENT_START" or attr == self.vocab.strings["IS_SENT_START"]:
|
||||
attr = SENT_START
|
||||
elif attr == "IS_SENT_END" or attr == self.vocab.strings["IS_SENT_END"]:
|
||||
attr = SENT_START
|
||||
attr = intify_attr(attr)
|
||||
# adjust attributes
|
||||
if attr == HEAD:
|
||||
|
|
|
@ -487,8 +487,6 @@ cdef class Token:
|
|||
|
||||
RETURNS (bool / None): Whether the token starts a sentence.
|
||||
None if unknown.
|
||||
|
||||
DOCS: https://spacy.io/api/token#is_sent_start
|
||||
"""
|
||||
def __get__(self):
|
||||
if self.c.sent_start == 0:
|
||||
|
|
|
@ -871,7 +871,6 @@ def get_package_path(name: str) -> Path:
|
|||
name (str): Package name.
|
||||
RETURNS (Path): Path to installed package.
|
||||
"""
|
||||
name = name.lower() # use lowercase version to be safe
|
||||
# Here we're importing the module just to find it. This is worryingly
|
||||
# indirect, but it's otherwise very difficult to find the package.
|
||||
pkg = importlib.import_module(name)
|
||||
|
|
|
@ -79,6 +79,7 @@ train/test skew.
|
|||
| `max_length` | Maximum document length. Longer documents will be split into sentences, if sentence boundaries are available. Defaults to `0` for no limit. ~~int~~ |
|
||||
| `limit` | Limit corpus to a subset of examples, e.g. for debugging. Defaults to `0` for no limit. ~~int~~ |
|
||||
| `augmenter` | Optional data augmentation callback. ~~Callable[[Language, Example], Iterable[Example]]~~ |
|
||||
| `shuffle` | Whether to shuffle the examples. Defaults to `False`. ~~bool~~ |
|
||||
|
||||
## Corpus.\_\_call\_\_ {#call tag="method"}
|
||||
|
||||
|
|
|
@ -304,7 +304,7 @@ ancestor is found, e.g. if span excludes a necessary ancestor.
|
|||
|
||||
## Doc.has_annotation {#has_annotation tag="method"}
|
||||
|
||||
Check whether the doc contains annotation on a token attribute.
|
||||
Check whether the doc contains annotation on a [`Token` attribute](/api/token#attributes).
|
||||
|
||||
<Infobox title="Changed in v3.0" variant="warning">
|
||||
|
||||
|
|
|
@ -349,23 +349,6 @@ A sequence containing the token and all the token's syntactic descendants.
|
|||
| ---------- | ------------------------------------------------------------------------------------ |
|
||||
| **YIELDS** | A descendant token such that `self.is_ancestor(token)` or `token == self`. ~~Token~~ |
|
||||
|
||||
## Token.is_sent_start {#is_sent_start tag="property" new="2"}
|
||||
|
||||
A boolean value indicating whether the token starts a sentence. `None` if
|
||||
unknown. Defaults to `True` for the first token in the `Doc`.
|
||||
|
||||
> #### Example
|
||||
>
|
||||
> ```python
|
||||
> doc = nlp("Give it back! He pleaded.")
|
||||
> assert doc[4].is_sent_start
|
||||
> assert not doc[5].is_sent_start
|
||||
> ```
|
||||
|
||||
| Name | Description |
|
||||
| ----------- | ------------------------------------------------------- |
|
||||
| **RETURNS** | Whether the token starts a sentence. ~~Optional[bool]~~ |
|
||||
|
||||
## Token.has_vector {#has_vector tag="property" model="vectors"}
|
||||
|
||||
A boolean value indicating whether a word vector is associated with the token.
|
||||
|
@ -465,6 +448,8 @@ The L2 norm of the token's vector representation.
|
|||
| `is_punct` | Is the token punctuation? ~~bool~~ |
|
||||
| `is_left_punct` | Is the token a left punctuation mark, e.g. `"("` ? ~~bool~~ |
|
||||
| `is_right_punct` | Is the token a right punctuation mark, e.g. `")"` ? ~~bool~~ |
|
||||
| `is_sent_start` | Does the token start a sentence? ~~bool~~ or `None` if unknown. Defaults to `True` for the first token in the `Doc`. |
|
||||
| `is_sent_end` | Does the token end a sentence? ~~bool~~ or `None` if unknown. |
|
||||
| `is_space` | Does the token consist of whitespace characters? Equivalent to `token.text.isspace()`. ~~bool~~ |
|
||||
| `is_bracket` | Is the token a bracket? ~~bool~~ |
|
||||
| `is_quote` | Is the token a quotation mark? ~~bool~~ |
|
||||
|
|
BIN
website/docs/images/spacy-tailored-pipelines_wide.png
Normal file
BIN
website/docs/images/spacy-tailored-pipelines_wide.png
Normal file
Binary file not shown.
After Width: | Height: | Size: 44 KiB |
|
@ -213,6 +213,12 @@ format, train a pipeline, evaluate it and export metrics, package it and spin up
|
|||
a quick web demo. It looks pretty similar to a config file used to define CI
|
||||
pipelines.
|
||||
|
||||
> #### Tip: Multi-line YAML syntax for long values
|
||||
>
|
||||
> YAML has [multi-line syntax](https://yaml-multiline.info/) that can be
|
||||
> helpful for readability with longer values such as project descriptions or
|
||||
> commands that take several arguments.
|
||||
|
||||
```yaml
|
||||
%%GITHUB_PROJECTS/pipelines/tagger_parser_ud/project.yml
|
||||
```
|
||||
|
|
|
@ -40,7 +40,11 @@
|
|||
"label": "Resources",
|
||||
"items": [
|
||||
{ "text": "Project Templates", "url": "https://github.com/explosion/projects" },
|
||||
{ "text": "v2.x Documentation", "url": "https://v2.spacy.io" }
|
||||
{ "text": "v2.x Documentation", "url": "https://v2.spacy.io" },
|
||||
{
|
||||
"text": "Custom Solutions",
|
||||
"url": "https://explosion.ai/spacy-tailored-pipelines"
|
||||
}
|
||||
]
|
||||
}
|
||||
]
|
||||
|
|
|
@ -48,7 +48,11 @@
|
|||
{ "text": "Usage", "url": "/usage" },
|
||||
{ "text": "Models", "url": "/models" },
|
||||
{ "text": "API Reference", "url": "/api" },
|
||||
{ "text": "Online Course", "url": "https://course.spacy.io" }
|
||||
{ "text": "Online Course", "url": "https://course.spacy.io" },
|
||||
{
|
||||
"text": "Custom Solutions",
|
||||
"url": "https://explosion.ai/spacy-tailored-pipelines"
|
||||
}
|
||||
]
|
||||
},
|
||||
{
|
||||
|
|
|
@ -953,6 +953,37 @@
|
|||
"category": ["pipeline"],
|
||||
"tags": ["lemmatizer", "danish"]
|
||||
},
|
||||
{
|
||||
"id": "augmenty",
|
||||
"title": "Augmenty",
|
||||
"slogan": "The cherry on top of your NLP pipeline",
|
||||
"description": "Augmenty is an augmentation library based on spaCy for augmenting texts. Augmenty differs from other augmentation libraries in that it corrects (as far as possible) the token, sentence and document labels under the augmentation.",
|
||||
"github": "kennethenevoldsen/augmenty",
|
||||
"pip": "augmenty",
|
||||
"code_example": [
|
||||
"import spacy",
|
||||
"import augmenty",
|
||||
"",
|
||||
"nlp = spacy.load('en_core_web_md')",
|
||||
"",
|
||||
"docs = nlp.pipe(['Augmenty is a great tool for text augmentation'])",
|
||||
"",
|
||||
"ent_dict = {'ORG': [['spaCy'], ['spaCy', 'Universe']]}",
|
||||
"entity_augmenter = augmenty.load('ents_replace.v1',",
|
||||
" ent_dict = ent_dict, level=1)",
|
||||
"",
|
||||
"for doc in augmenty.docs(docs, augmenter=entity_augmenter, nlp=nlp):",
|
||||
" print(doc)"
|
||||
],
|
||||
"thumb": "https://github.com/KennethEnevoldsen/augmenty/blob/master/img/icon.png?raw=true",
|
||||
"author": "Kenneth Enevoldsen",
|
||||
"author_links": {
|
||||
"github": "kennethenevoldsen",
|
||||
"website": "https://www.kennethenevoldsen.com"
|
||||
},
|
||||
"category": ["training", "research"],
|
||||
"tags": ["training", "research", "augmentation"]
|
||||
},
|
||||
{
|
||||
"id": "dacy",
|
||||
"title": "DaCy",
|
||||
|
@ -3738,6 +3769,65 @@
|
|||
},
|
||||
"category": ["pipeline"],
|
||||
"tags": ["pipeline", "nlp", "sentiment"]
|
||||
},
|
||||
{
|
||||
"id": "textnets",
|
||||
"slogan": "Text analysis with networks",
|
||||
"description": "textnets represents collections of texts as networks of documents and words. This provides novel possibilities for the visualization and analysis of texts.",
|
||||
"github": "jboynyc/textnets",
|
||||
"image": "https://user-images.githubusercontent.com/2187261/152641425-6c0fb41c-b8e0-44fb-a52a-7c1ba24eba1e.png",
|
||||
"code_example": [
|
||||
"import textnets as tn",
|
||||
"",
|
||||
"corpus = tn.Corpus(tn.examples.moon_landing)",
|
||||
"t = tn.Textnet(corpus.tokenized(), min_docs=1)",
|
||||
"t.plot(label_nodes=True,",
|
||||
" show_clusters=True,",
|
||||
" scale_nodes_by=\"birank\",",
|
||||
" scale_edges_by=\"weight\")"
|
||||
],
|
||||
"author": "John Boy",
|
||||
"author_links": {
|
||||
"github": "jboynyc",
|
||||
"twitter": "jboy"
|
||||
},
|
||||
"category": ["visualizers", "standalone"]
|
||||
},
|
||||
{
|
||||
"id": "tmtoolkit",
|
||||
"slogan": "Text mining and topic modeling toolkit",
|
||||
"description": "tmtoolkit is a set of tools for text mining and topic modeling with Python developed especially for the use in the social sciences, in journalism or related disciplines. It aims for easy installation, extensive documentation and a clear programming interface while offering good performance on large datasets by the means of vectorized operations (via NumPy) and parallel computation (using Python’s multiprocessing module and the loky package).",
|
||||
"github": "WZBSocialScienceCenter/tmtoolkit",
|
||||
"code_example": [
|
||||
"# Note: This requires these setup steps:",
|
||||
"# pip install tmtoolkit[recommended]",
|
||||
"# python -m tmtoolkit setup en",
|
||||
"from tmtoolkit.corpus import Corpus, tokens_table, lemmatize, to_lowercase, dtm",
|
||||
"from tmtoolkit.bow.bow_stats import tfidf, sorted_terms_table",
|
||||
"# load built-in sample dataset and use 4 worker processes",
|
||||
"corp = Corpus.from_builtin_corpus('en-News100', max_workers=4)",
|
||||
"# investigate corpus as dataframe",
|
||||
"toktbl = tokens_table(corp)",
|
||||
"print(toktbl)",
|
||||
"# apply some text normalization",
|
||||
"lemmatize(corp)",
|
||||
"to_lowercase(corp)",
|
||||
"# build sparse document-token matrix (DTM)",
|
||||
"# document labels identify rows, vocabulary tokens identify columns",
|
||||
"mat, doc_labels, vocab = dtm(corp, return_doc_labels=True, return_vocab=True)",
|
||||
"# apply tf-idf transformation to DTM",
|
||||
"# operation is applied on sparse matrix and uses few memory",
|
||||
"tfidf_mat = tfidf(mat)",
|
||||
"# show top 5 tokens per document ranked by tf-idf",
|
||||
"top_tokens = sorted_terms_table(tfidf_mat, vocab, doc_labels, top_n=5)",
|
||||
"print(top_tokens)"
|
||||
],
|
||||
"author": "Markus Konrad / WZB Social Science Center",
|
||||
"author_links": {
|
||||
"github": "internaut",
|
||||
"twitter": "_knrd"
|
||||
},
|
||||
"category": ["scientific", "standalone"]
|
||||
}
|
||||
],
|
||||
|
||||
|
|
|
@ -6,11 +6,14 @@ import { replaceEmoji } from './icon'
|
|||
|
||||
export const Ol = props => <ol className={classes.ol} {...props} />
|
||||
export const Ul = props => <ul className={classes.ul} {...props} />
|
||||
export const Li = ({ children, ...props }) => {
|
||||
export const Li = ({ children, emoji, ...props }) => {
|
||||
const { hasIcon, content } = replaceEmoji(children)
|
||||
const liClassNames = classNames(classes.li, { [classes.liIcon]: hasIcon })
|
||||
const liClassNames = classNames(classes.li, {
|
||||
[classes.liIcon]: hasIcon,
|
||||
[classes.emoji]: emoji,
|
||||
})
|
||||
return (
|
||||
<li className={liClassNames} {...props}>
|
||||
<li data-emoji={emoji} className={liClassNames} {...props}>
|
||||
{content}
|
||||
</li>
|
||||
)
|
||||
|
|
|
@ -36,6 +36,16 @@
|
|||
box-sizing: content-box
|
||||
vertical-align: top
|
||||
|
||||
.emoji:before
|
||||
content: attr(data-emoji)
|
||||
padding-right: 0.75em
|
||||
padding-top: 0
|
||||
margin-left: -2.5em
|
||||
width: 1.75em
|
||||
text-align: right
|
||||
font-size: 1em
|
||||
position: static
|
||||
|
||||
.li-icon
|
||||
text-indent: calc(-20px - 0.55em)
|
||||
|
||||
|
|
|
@ -15,9 +15,9 @@ import {
|
|||
} from '../components/landing'
|
||||
import { H2 } from '../components/typography'
|
||||
import { InlineCode } from '../components/code'
|
||||
import { Ul, Li } from '../components/list'
|
||||
import Button from '../components/button'
|
||||
import Link from '../components/link'
|
||||
import { YouTube } from '../components/embed'
|
||||
|
||||
import QuickstartTraining from './quickstart-training'
|
||||
import Project from './project'
|
||||
|
@ -25,6 +25,7 @@ import Features from './features'
|
|||
import courseImage from '../../docs/images/course.jpg'
|
||||
import prodigyImage from '../../docs/images/prodigy_overview.jpg'
|
||||
import projectsImage from '../../docs/images/projects.png'
|
||||
import tailoredPipelinesImage from '../../docs/images/spacy-tailored-pipelines_wide.png'
|
||||
|
||||
import Benchmarks from 'usage/_benchmarks-models.md'
|
||||
|
||||
|
@ -104,23 +105,45 @@ const Landing = ({ data }) => {
|
|||
|
||||
<LandingBannerGrid>
|
||||
<LandingBanner
|
||||
label="New in v3.0"
|
||||
title="Transformer-based pipelines, new training system, project templates & more"
|
||||
to="/usage/v3"
|
||||
button="See what's new"
|
||||
to="https://explosion.ai/spacy-tailored-pipelines"
|
||||
button="Learn more"
|
||||
background="#E4F4F9"
|
||||
color="#1e1935"
|
||||
small
|
||||
>
|
||||
spaCy v3.0 features all new <strong>transformer-based pipelines</strong> that
|
||||
bring spaCy's accuracy right up to the current <strong>state-of-the-art</strong>
|
||||
. You can use any pretrained transformer to train your own pipelines, and even
|
||||
share one transformer between multiple components with{' '}
|
||||
<strong>multi-task learning</strong>. Training is now fully configurable and
|
||||
extensible, and you can define your own custom models using{' '}
|
||||
<strong>PyTorch</strong>, <strong>TensorFlow</strong> and other frameworks. The
|
||||
new spaCy projects system lets you describe whole{' '}
|
||||
<strong>end-to-end workflows</strong> in a single file, giving you an easy path
|
||||
from prototype to production, and making it easy to clone and adapt
|
||||
best-practice projects for your own use cases.
|
||||
<Link to="https://explosion.ai/spacy-tailored-pipelines" hidden>
|
||||
<img src={tailoredPipelinesImage} alt="spaCy Tailored Pipelines" />
|
||||
</Link>
|
||||
<strong>
|
||||
Get a custom spaCy pipeline, tailor-made for your NLP problem by spaCy's
|
||||
core developers.
|
||||
</strong>
|
||||
<br />
|
||||
<br />
|
||||
<Ul>
|
||||
<Li emoji="🔥">
|
||||
<strong>Streamlined.</strong> Nobody knows spaCy better than we do. Send
|
||||
us your pipeline requirements and we'll be ready to start producing your
|
||||
solution in no time at all.
|
||||
</Li>
|
||||
<Li emoji="🐿 ">
|
||||
<strong>Production ready.</strong> spaCy pipelines are robust and easy
|
||||
to deploy. You'll get a complete spaCy project folder which is ready to{' '}
|
||||
<InlineCode>spacy project run</InlineCode>.
|
||||
</Li>
|
||||
<Li emoji="🔮">
|
||||
<strong>Predictable.</strong> You'll know exactly what you're going to
|
||||
get and what it's going to cost. We quote fees up-front, let you try
|
||||
before you buy, and don't charge for over-runs at our end — all the risk
|
||||
is on us.
|
||||
</Li>
|
||||
<Li emoji="🛠">
|
||||
<strong>Maintainable.</strong> spaCy is an industry standard, and we'll
|
||||
deliver your pipeline with full code, data, tests and documentation, so
|
||||
your team can retrain, update and extend the solution as your
|
||||
requirements change.
|
||||
</Li>
|
||||
</Ul>
|
||||
</LandingBanner>
|
||||
|
||||
<LandingBanner
|
||||
|
@ -206,8 +229,20 @@ const Landing = ({ data }) => {
|
|||
</LandingGrid>
|
||||
|
||||
<LandingBannerGrid>
|
||||
<LandingBanner background="#0099dd" color="#ffffff" small>
|
||||
<YouTube id="9k_EfV7Cns0" />
|
||||
<LandingBanner
|
||||
label="New in v3.0"
|
||||
title="Transformer-based pipelines, new training system, project templates & more"
|
||||
to="/usage/v3"
|
||||
button="See what's new"
|
||||
small
|
||||
>
|
||||
spaCy v3.0 features all new <strong>transformer-based pipelines</strong> that
|
||||
bring spaCy's accuracy right up to the current <strong>state-of-the-art</strong>
|
||||
. You can use any pretrained transformer to train your own pipelines, and even
|
||||
share one transformer between multiple components with{' '}
|
||||
<strong>multi-task learning</strong>. Training is now fully configurable and
|
||||
extensible, and you can define your own custom models using{' '}
|
||||
<strong>PyTorch</strong>, <strong>TensorFlow</strong> and other frameworks.
|
||||
</LandingBanner>
|
||||
<LandingBanner
|
||||
to="https://course.spacy.io"
|
||||
|
|
Loading…
Reference in New Issue
Block a user