Merge branch 'master' into spacy.io

This commit is contained in:
Ines Montani 2021-03-19 12:09:03 +11:00
commit 6db5414668
21 changed files with 254 additions and 58 deletions

View File

@ -3,6 +3,7 @@ recursive-include spacy *.pyx *.pxd *.txt *.cfg *.jinja
include LICENSE include LICENSE
include README.md include README.md
include pyproject.toml include pyproject.toml
include spacy/py.typed
recursive-exclude spacy/lang *.json recursive-exclude spacy/lang *.json
recursive-include spacy/lang *.json.gz recursive-include spacy/lang *.json.gz
recursive-include spacy/cli *.json *.yml recursive-include spacy/cli *.json *.yml

130
examples/README.md Normal file
View File

@ -0,0 +1,130 @@
<a href="https://explosion.ai"><img src="https://explosion.ai/assets/img/logo.svg" width="125" height="125" align="right" /></a>
# spaCy examples
For spaCy v3 we've converted many of the [v2 example
scripts](https://github.com/explosion/spaCy/tree/v2.3.x/examples/) into
end-to-end [spacy projects](https://spacy.io/usage/projects) workflows. The
workflows include all the steps to go from data to packaged spaCy models.
## 🪐 Pipeline component demos
The simplest demos for training a single pipeline component are in the
[`pipelines`](https://github.com/explosion/projects/blob/v3/pipelines) category
including:
- [`pipelines/ner_demo`](https://github.com/explosion/projects/blob/v3/pipelines/ner_demo):
Train a named entity recognizer
- [`pipelines/textcat_demo`](https://github.com/explosion/projects/blob/v3/pipelines/textcat_demo):
Train a text classifier
- [`pipelines/parser_intent_demo`](https://github.com/explosion/projects/blob/v3/pipelines/parser_intent_demo):
Train a dependency parser for custom semantics
## 🪐 Tutorials
The [`tutorials`](https://github.com/explosion/projects/blob/v3/tutorials)
category includes examples that work through specific NLP use cases end-to-end:
- [`tutorials/textcat_goemotions`](https://github.com/explosion/projects/blob/v3/tutorials/textcat_goemotions):
Train a text classifier to categorize emotions in Reddit posts
- [`tutorials/nel_emerson`](https://github.com/explosion/projects/blob/v3/tutorials/nel_emerson):
Use an entity linker to disambiguate mentions of the same name
Check out the [projects documentation](https://spacy.io/usage/projects) and
browse through the [available
projects](https://github.com/explosion/projects/)!
## 🚀 Get started with a demo project
The
[`pipelines/ner_demo`](https://github.com/explosion/projects/blob/v3/pipelines/ner_demo)
project converts the spaCy v2
[`train_ner.py`](https://github.com/explosion/spaCy/blob/v2.3.x/examples/training/train_ner.py)
demo script into a spaCy v3 project.
1. Clone the project:
```bash
python -m spacy project clone pipelines/ner_demo
```
2. Install requirements and download any data assets:
```bash
cd ner_demo
python -m pip install -r requirements.txt
python -m spacy project assets
```
3. Run the default workflow to convert, train and evaluate:
```bash
python -m spacy project run all
```
Sample output:
```none
Running workflow 'all'
================================== convert ==================================
Running command: /home/user/venv/bin/python scripts/convert.py en assets/train.json corpus/train.spacy
Running command: /home/user/venv/bin/python scripts/convert.py en assets/dev.json corpus/dev.spacy
=============================== create-config ===============================
Running command: /home/user/venv/bin/python -m spacy init config --lang en --pipeline ner configs/config.cfg --force
Generated config template specific for your use case
- Language: en
- Pipeline: ner
- Optimize for: efficiency
- Hardware: CPU
- Transformer: None
✔ Auto-filled config with all values
✔ Saved config
configs/config.cfg
You can now add your data and train your pipeline:
python -m spacy train config.cfg --paths.train ./train.spacy --paths.dev ./dev.spacy
=================================== train ===================================
Running command: /home/user/venv/bin/python -m spacy train configs/config.cfg --output training/ --paths.train corpus/train.spacy --paths.dev corpus/dev.spacy --training.eval_frequency 10 --training.max_steps 100 --gpu-id -1
Using CPU
=========================== Initializing pipeline ===========================
[2021-03-11 19:34:59,101] [INFO] Set up nlp object from config
[2021-03-11 19:34:59,109] [INFO] Pipeline: ['tok2vec', 'ner']
[2021-03-11 19:34:59,113] [INFO] Created vocabulary
[2021-03-11 19:34:59,113] [INFO] Finished initializing nlp object
[2021-03-11 19:34:59,265] [INFO] Initialized pipeline components: ['tok2vec', 'ner']
✔ Initialized pipeline
============================= Training pipeline =============================
Pipeline: ['tok2vec', 'ner']
Initial learn rate: 0.001
E # LOSS TOK2VEC LOSS NER ENTS_F ENTS_P ENTS_R SCORE
--- ------ ------------ -------- ------ ------ ------ ------
0 0 0.00 7.90 0.00 0.00 0.00 0.00
10 10 0.11 71.07 0.00 0.00 0.00 0.00
20 20 0.65 22.44 50.00 50.00 50.00 0.50
30 30 0.22 6.38 80.00 66.67 100.00 0.80
40 40 0.00 0.00 80.00 66.67 100.00 0.80
50 50 0.00 0.00 80.00 66.67 100.00 0.80
60 60 0.00 0.00 100.00 100.00 100.00 1.00
70 70 0.00 0.00 100.00 100.00 100.00 1.00
80 80 0.00 0.00 100.00 100.00 100.00 1.00
90 90 0.00 0.00 100.00 100.00 100.00 1.00
100 100 0.00 0.00 100.00 100.00 100.00 1.00
✔ Saved pipeline to output directory
training/model-last
```
4. Package the model:
```bash
python -m spacy project run package
```
5. Visualize the model's output with [Streamlit](https://streamlit.io):
```bash
python -m spacy project run visualize-model
```

View File

@ -0,0 +1,5 @@
<a href="https://explosion.ai"><img src="https://explosion.ai/assets/img/logo.svg" width="125" height="125" align="right" /></a>
# spaCy examples
See [examples/README.md](../README.md)

View File

@ -1,7 +1,25 @@
## Examples of NER/IOB data that can be converted with `spacy convert` ## Examples of NER/IOB data that can be converted with `spacy convert`
spacy JSON training files were generated with: To convert an IOB file to `.spacy` ([`DocBin`](https://spacy.io/api/docbin))
for spaCy v3:
```bash
python -m spacy convert -c iob -s -n 10 -b en_core_web_sm file.iob .
``` ```
See all the `spacy convert` options: https://spacy.io/api/cli#convert
---
The spaCy v2 JSON training files were generated using **spaCy v2** with:
```bash
python -m spacy convert -c iob -s -n 10 -b en file.iob python -m spacy convert -c iob -s -n 10 -b en file.iob
``` ```
To convert an existing JSON training file to `.spacy` for spaCy v3, convert
with **spaCy v3**:
```bash
python -m spacy convert file.json .
```

View File

@ -5,7 +5,7 @@ requires = [
"cymem>=2.0.2,<2.1.0", "cymem>=2.0.2,<2.1.0",
"preshed>=3.0.2,<3.1.0", "preshed>=3.0.2,<3.1.0",
"murmurhash>=0.28.0,<1.1.0", "murmurhash>=0.28.0,<1.1.0",
"thinc>=8.0.0,<8.1.0", "thinc>=8.0.2,<8.1.0",
"blis>=0.4.0,<0.8.0", "blis>=0.4.0,<0.8.0",
"pathy", "pathy",
"numpy>=1.15.0", "numpy>=1.15.0",

View File

@ -2,7 +2,7 @@
spacy-legacy>=3.0.0,<3.1.0 spacy-legacy>=3.0.0,<3.1.0
cymem>=2.0.2,<2.1.0 cymem>=2.0.2,<2.1.0
preshed>=3.0.2,<3.1.0 preshed>=3.0.2,<3.1.0
thinc>=8.0.0,<8.1.0 thinc>=8.0.2,<8.1.0
blis>=0.4.0,<0.8.0 blis>=0.4.0,<0.8.0
ml_datasets>=0.2.0,<0.3.0 ml_datasets>=0.2.0,<0.3.0
murmurhash>=0.28.0,<1.1.0 murmurhash>=0.28.0,<1.1.0

View File

@ -34,14 +34,14 @@ setup_requires =
cymem>=2.0.2,<2.1.0 cymem>=2.0.2,<2.1.0
preshed>=3.0.2,<3.1.0 preshed>=3.0.2,<3.1.0
murmurhash>=0.28.0,<1.1.0 murmurhash>=0.28.0,<1.1.0
thinc>=8.0.0,<8.1.0 thinc>=8.0.2,<8.1.0
install_requires = install_requires =
# Our libraries # Our libraries
spacy-legacy>=3.0.0,<3.1.0 spacy-legacy>=3.0.0,<3.1.0
murmurhash>=0.28.0,<1.1.0 murmurhash>=0.28.0,<1.1.0
cymem>=2.0.2,<2.1.0 cymem>=2.0.2,<2.1.0
preshed>=3.0.2,<3.1.0 preshed>=3.0.2,<3.1.0
thinc>=8.0.0,<8.1.0 thinc>=8.0.2,<8.1.0
blis>=0.4.0,<0.8.0 blis>=0.4.0,<0.8.0
wasabi>=0.8.1,<1.1.0 wasabi>=0.8.1,<1.1.0
srsly>=2.4.0,<3.0.0 srsly>=2.4.0,<3.0.0

View File

@ -28,6 +28,8 @@ if sys.maxunicode == 65535:
def load( def load(
name: Union[str, Path], name: Union[str, Path],
*,
vocab: Union[Vocab, bool] = True,
disable: Iterable[str] = util.SimpleFrozenList(), disable: Iterable[str] = util.SimpleFrozenList(),
exclude: Iterable[str] = util.SimpleFrozenList(), exclude: Iterable[str] = util.SimpleFrozenList(),
config: Union[Dict[str, Any], Config] = util.SimpleFrozenDict(), config: Union[Dict[str, Any], Config] = util.SimpleFrozenDict(),
@ -35,6 +37,7 @@ def load(
"""Load a spaCy model from an installed package or a local path. """Load a spaCy model from an installed package or a local path.
name (str): Package name or model path. name (str): Package name or model path.
vocab (Vocab): A Vocab object. If True, a vocab is created.
disable (Iterable[str]): Names of pipeline components to disable. Disabled disable (Iterable[str]): Names of pipeline components to disable. Disabled
pipes will be loaded but they won't be run unless you explicitly pipes will be loaded but they won't be run unless you explicitly
enable them by calling nlp.enable_pipe. enable them by calling nlp.enable_pipe.
@ -44,7 +47,9 @@ def load(
keyed by section values in dot notation. keyed by section values in dot notation.
RETURNS (Language): The loaded nlp object. RETURNS (Language): The loaded nlp object.
""" """
return util.load_model(name, disable=disable, exclude=exclude, config=config) return util.load_model(
name, vocab=vocab, disable=disable, exclude=exclude, config=config
)
def blank( def blank(
@ -52,7 +57,7 @@ def blank(
*, *,
vocab: Union[Vocab, bool] = True, vocab: Union[Vocab, bool] = True,
config: Union[Dict[str, Any], Config] = util.SimpleFrozenDict(), config: Union[Dict[str, Any], Config] = util.SimpleFrozenDict(),
meta: Dict[str, Any] = util.SimpleFrozenDict() meta: Dict[str, Any] = util.SimpleFrozenDict(),
) -> Language: ) -> Language:
"""Create a blank nlp object for a given language code. """Create a blank nlp object for a given language code.

View File

@ -1,6 +1,6 @@
# fmt: off # fmt: off
__title__ = "spacy" __title__ = "spacy"
__version__ = "3.0.4" __version__ = "3.0.5"
__download_url__ = "https://github.com/explosion/spacy-models/releases/download" __download_url__ = "https://github.com/explosion/spacy-models/releases/download"
__compatibility__ = "https://raw.githubusercontent.com/explosion/spacy-models/master/compatibility.json" __compatibility__ = "https://raw.githubusercontent.com/explosion/spacy-models/master/compatibility.json"
__projects__ = "https://github.com/explosion/projects" __projects__ = "https://github.com/explosion/projects"

View File

@ -20,7 +20,7 @@ def debug_config_cli(
# fmt: off # fmt: off
ctx: typer.Context, # This is only used to read additional arguments ctx: typer.Context, # This is only used to read additional arguments
config_path: Path = Arg(..., help="Path to config file", exists=True, allow_dash=True), config_path: Path = Arg(..., help="Path to config file", exists=True, allow_dash=True),
code_path: Optional[Path] = Opt(None, "--code-path", "-c", help="Path to Python file with additional code (registered functions) to be imported"), code_path: Optional[Path] = Opt(None, "--code-path", "--code", "-c", help="Path to Python file with additional code (registered functions) to be imported"),
show_funcs: bool = Opt(False, "--show-functions", "-F", help="Show an overview of all registered functions used in the config and where they come from (modules, files etc.)"), show_funcs: bool = Opt(False, "--show-functions", "-F", help="Show an overview of all registered functions used in the config and where they come from (modules, files etc.)"),
show_vars: bool = Opt(False, "--show-variables", "-V", help="Show an overview of all variables referenced in the config and their values. This will also reflect variables overwritten on the CLI.") show_vars: bool = Opt(False, "--show-variables", "-V", help="Show an overview of all variables referenced in the config and their values. This will also reflect variables overwritten on the CLI.")
# fmt: on # fmt: on

View File

@ -39,7 +39,7 @@ def debug_data_cli(
# fmt: off # fmt: off
ctx: typer.Context, # This is only used to read additional arguments ctx: typer.Context, # This is only used to read additional arguments
config_path: Path = Arg(..., help="Path to config file", exists=True, allow_dash=True), config_path: Path = Arg(..., help="Path to config file", exists=True, allow_dash=True),
code_path: Optional[Path] = Opt(None, "--code-path", "-c", help="Path to Python file with additional code (registered functions) to be imported"), code_path: Optional[Path] = Opt(None, "--code-path", "--code", "-c", help="Path to Python file with additional code (registered functions) to be imported"),
ignore_warnings: bool = Opt(False, "--ignore-warnings", "-IW", help="Ignore warnings, only show stats and errors"), ignore_warnings: bool = Opt(False, "--ignore-warnings", "-IW", help="Ignore warnings, only show stats and errors"),
verbose: bool = Opt(False, "--verbose", "-V", help="Print additional information and explanations"), verbose: bool = Opt(False, "--verbose", "-V", help="Print additional information and explanations"),
no_format: bool = Opt(False, "--no-format", "-NF", help="Don't pretty-print the results"), no_format: bool = Opt(False, "--no-format", "-NF", help="Don't pretty-print the results"),

View File

@ -10,7 +10,8 @@ from jinja2 import Template
from .. import util from .. import util
from ..language import DEFAULT_CONFIG_PRETRAIN_PATH from ..language import DEFAULT_CONFIG_PRETRAIN_PATH
from ..schemas import RecommendationSchema from ..schemas import RecommendationSchema
from ._util import init_cli, Arg, Opt, show_validation_error, COMMAND, string_to_list from ._util import init_cli, Arg, Opt, show_validation_error, COMMAND
from ._util import string_to_list, import_code
ROOT = Path(__file__).parent / "templates" ROOT = Path(__file__).parent / "templates"
@ -70,7 +71,8 @@ def init_fill_config_cli(
base_path: Path = Arg(..., help="Base config to fill", exists=True, dir_okay=False), base_path: Path = Arg(..., help="Base config to fill", exists=True, dir_okay=False),
output_file: Path = Arg("-", help="File to save config.cfg to (or - for stdout)", allow_dash=True), output_file: Path = Arg("-", help="File to save config.cfg to (or - for stdout)", allow_dash=True),
pretraining: bool = Opt(False, "--pretraining", "-pt", help="Include config for pretraining (with 'spacy pretrain')"), pretraining: bool = Opt(False, "--pretraining", "-pt", help="Include config for pretraining (with 'spacy pretrain')"),
diff: bool = Opt(False, "--diff", "-D", help="Print a visual diff highlighting the changes") diff: bool = Opt(False, "--diff", "-D", help="Print a visual diff highlighting the changes"),
code_path: Optional[Path] = Opt(None, "--code-path", "--code", "-c", help="Path to Python file with additional code (registered functions) to be imported"),
# fmt: on # fmt: on
): ):
""" """
@ -82,6 +84,7 @@ def init_fill_config_cli(
DOCS: https://spacy.io/api/cli#init-fill-config DOCS: https://spacy.io/api/cli#init-fill-config
""" """
import_code(code_path)
fill_config(output_file, base_path, pretraining=pretraining, diff=diff) fill_config(output_file, base_path, pretraining=pretraining, diff=diff)

View File

@ -120,7 +120,7 @@ def parse_deps(orig_doc: Doc, options: Dict[str, Any] = {}) -> Dict[str, Any]:
doc (Doc): Document do parse. doc (Doc): Document do parse.
RETURNS (dict): Generated dependency parse keyed by words and arcs. RETURNS (dict): Generated dependency parse keyed by words and arcs.
""" """
doc = Doc(orig_doc.vocab).from_bytes(orig_doc.to_bytes(exclude=["user_data"])) doc = Doc(orig_doc.vocab).from_bytes(orig_doc.to_bytes(exclude=["user_data", "user_hooks"]))
if not doc.has_annotation("DEP"): if not doc.has_annotation("DEP"):
warnings.warn(Warnings.W005) warnings.warn(Warnings.W005)
if options.get("collapse_phrases", False): if options.get("collapse_phrases", False):

View File

@ -90,12 +90,12 @@ class RussianLemmatizer(Lemmatizer):
return [string.lower()] return [string.lower()]
return list(set([analysis.normal_form for analysis in filtered_analyses])) return list(set([analysis.normal_form for analysis in filtered_analyses]))
def lookup_lemmatize(self, token: Token) -> List[str]: def pymorphy2_lookup_lemmatize(self, token: Token) -> List[str]:
string = token.text string = token.text
analyses = self._morph.parse(string) analyses = self._morph.parse(string)
if len(analyses) == 1: if len(analyses) == 1:
return analyses[0].normal_form return [analyses[0].normal_form]
return string return [string]
def oc2ud(oc_tag: str) -> Tuple[str, Dict[str, str]]: def oc2ud(oc_tag: str) -> Tuple[str, Dict[str, str]]:

0
spacy/py.typed Normal file
View File

View File

@ -6,15 +6,14 @@ def test_build_dependencies():
# Check that library requirements are pinned exactly the same across different setup files. # Check that library requirements are pinned exactly the same across different setup files.
# TODO: correct checks for numpy rather than ignoring # TODO: correct checks for numpy rather than ignoring
libs_ignore_requirements = [ libs_ignore_requirements = [
"numpy",
"pytest", "pytest",
"pytest-timeout", "pytest-timeout",
"mock", "mock",
"flake8", "flake8",
"hypothesis",
] ]
# ignore language-specific packages that shouldn't be installed by all # ignore language-specific packages that shouldn't be installed by all
libs_ignore_setup = [ libs_ignore_setup = [
"numpy",
"fugashi", "fugashi",
"natto-py", "natto-py",
"pythainlp", "pythainlp",

View File

@ -142,7 +142,7 @@ def create_pretraining_model(nlp, pretrain_config):
# If the config referred to a Tok2VecListener, grab the original model instead # If the config referred to a Tok2VecListener, grab the original model instead
if type(tok2vec).__name__ == "Tok2VecListener": if type(tok2vec).__name__ == "Tok2VecListener":
original_tok2vec = ( original_tok2vec = (
tok2vec.upstream_name if tok2vec.upstream_name is not "*" else "tok2vec" tok2vec.upstream_name if tok2vec.upstream_name != "*" else "tok2vec"
) )
tok2vec = nlp.get_pipe(original_tok2vec).model tok2vec = nlp.get_pipe(original_tok2vec).model
try: try:

View File

@ -88,7 +88,7 @@ class registry(thinc.registry):
displacy_colors = catalogue.create("spacy", "displacy_colors", entry_points=True) displacy_colors = catalogue.create("spacy", "displacy_colors", entry_points=True)
misc = catalogue.create("spacy", "misc", entry_points=True) misc = catalogue.create("spacy", "misc", entry_points=True)
# Callback functions used to manipulate nlp object etc. # Callback functions used to manipulate nlp object etc.
callbacks = catalogue.create("spacy", "callbacks") callbacks = catalogue.create("spacy", "callbacks", entry_points=True)
batchers = catalogue.create("spacy", "batchers", entry_points=True) batchers = catalogue.create("spacy", "batchers", entry_points=True)
readers = catalogue.create("spacy", "readers", entry_points=True) readers = catalogue.create("spacy", "readers", entry_points=True)
augmenters = catalogue.create("spacy", "augmenters", entry_points=True) augmenters = catalogue.create("spacy", "augmenters", entry_points=True)

View File

@ -170,14 +170,15 @@ validation error with more details.
$ python -m spacy init fill-config [base_path] [output_file] [--diff] $ python -m spacy init fill-config [base_path] [output_file] [--diff]
``` ```
| Name | Description | | Name | Description |
| ---------------------- | ----------------------------------------------------------------------------------------------------------------------------------- | | ---------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ |
| `base_path` | Path to base config to fill, e.g. generated by the [quickstart widget](/usage/training#quickstart). ~~Path (positional)~~ | | `base_path` | Path to base config to fill, e.g. generated by the [quickstart widget](/usage/training#quickstart). ~~Path (positional)~~ |
| `output_file` | Path to output `.cfg` file. If not set, the config is written to stdout so you can pipe it forward to a file. ~~Path (positional)~~ | | `output_file` | Path to output `.cfg` file. If not set, the config is written to stdout so you can pipe it forward to a file. ~~Path (positional)~~ |
| `--pretraining`, `-pt` | Include config for pretraining (with [`spacy pretrain`](/api/cli#pretrain)). Defaults to `False`. ~~bool (flag)~~ | | `--code`, `-c` | Path to Python file with additional code to be imported. Allows [registering custom functions](/usage/training#custom-functions) for new architectures. ~~Optional[Path] \(option)~~ |
| `--diff`, `-D` | Print a visual diff highlighting the changes. ~~bool (flag)~~ | | `--pretraining`, `-pt` | Include config for pretraining (with [`spacy pretrain`](/api/cli#pretrain)). Defaults to `False`. ~~bool (flag)~~ |
| `--help`, `-h` | Show help message and available arguments. ~~bool (flag)~~ | | `--diff`, `-D` | Print a visual diff highlighting the changes. ~~bool (flag)~~ |
| **CREATES** | Complete and auto-filled config file for training. | | `--help`, `-h` | Show help message and available arguments. ~~bool (flag)~~ |
| **CREATES** | Complete and auto-filled config file for training. |
### init vectors {#init-vectors new="3" tag="command"} ### init vectors {#init-vectors new="3" tag="command"}
@ -261,24 +262,24 @@ $ python -m spacy convert [input_file] [output_dir] [--converter] [--file-type]
| `output_dir` | Output directory for converted file. Defaults to `"-"`, meaning data will be written to `stdout`. ~~Optional[Path] \(positional)~~ | | `output_dir` | Output directory for converted file. Defaults to `"-"`, meaning data will be written to `stdout`. ~~Optional[Path] \(positional)~~ |
| `--converter`, `-c` <Tag variant="new">2</Tag> | Name of converter to use (see below). ~~str (option)~~ | | `--converter`, `-c` <Tag variant="new">2</Tag> | Name of converter to use (see below). ~~str (option)~~ |
| `--file-type`, `-t` <Tag variant="new">2.1</Tag> | Type of file to create. Either `spacy` (default) for binary [`DocBin`](/api/docbin) data or `json` for v2.x JSON format. ~~str (option)~~ | | `--file-type`, `-t` <Tag variant="new">2.1</Tag> | Type of file to create. Either `spacy` (default) for binary [`DocBin`](/api/docbin) data or `json` for v2.x JSON format. ~~str (option)~~ |
| `--n-sents`, `-n` | Number of sentences per document. ~~int (option)~~ | | `--n-sents`, `-n` | Number of sentences per document. Supported for: `conll`, `conllu`, `iob`, `ner` ~~int (option)~~ |
| `--seg-sents`, `-s` <Tag variant="new">2.2</Tag> | Segment sentences (for `--converter ner`). ~~bool (flag)~~ | | `--seg-sents`, `-s` <Tag variant="new">2.2</Tag> | Segment sentences. Supported for: `conll`, `ner` ~~bool (flag)~~ |
| `--base`, `-b` | Trained spaCy pipeline for sentence segmentation to use as base (for `--seg-sents`). ~~Optional[str](option)~~ | | `--base`, `-b` | Trained spaCy pipeline for sentence segmentation to use as base (for `--seg-sents`). ~~Optional[str](option)~~ |
| `--morphology`, `-m` | Enable appending morphology to tags. ~~bool (flag)~~ | | `--morphology`, `-m` | Enable appending morphology to tags. Supported for: `conllu` ~~bool (flag)~~ |
| `--ner-map`, `-nm` | NER tag mapping (as JSON-encoded dict of entity types). ~~Optional[Path](option)~~ | | `--ner-map`, `-nm` | NER tag mapping (as JSON-encoded dict of entity types). Supported for: `conllu` ~~Optional[Path](option)~~ |
| `--lang`, `-l` <Tag variant="new">2.1</Tag> | Language code (if tokenizer required). ~~Optional[str] \(option)~~ | | `--lang`, `-l` <Tag variant="new">2.1</Tag> | Language code (if tokenizer required). ~~Optional[str] \(option)~~ |
| `--help`, `-h` | Show help message and available arguments. ~~bool (flag)~~ | | `--help`, `-h` | Show help message and available arguments. ~~bool (flag)~~ |
| **CREATES** | Binary [`DocBin`](/api/docbin) training data that can be used with [`spacy train`](/api/cli#train). | | **CREATES** | Binary [`DocBin`](/api/docbin) training data that can be used with [`spacy train`](/api/cli#train). |
### Converters {#converters} ### Converters {#converters}
| ID | Description | | ID | Description |
| ------- | --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | | --------------- | --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| `auto` | Automatically pick converter based on file extension and file content (default). | | `auto` | Automatically pick converter based on file extension and file content (default). |
| `json` | JSON-formatted training data used in spaCy v2.x. | | `json` | JSON-formatted training data used in spaCy v2.x. |
| `conll` | Universal Dependencies `.conllu` or `.conll` format. | | `conllu` | Universal Dependencies `.conllu` format. |
| `ner` | NER with IOB/IOB2 tags, one token per line with columns separated by whitespace. The first column is the token and the final column is the IOB tag. Sentences are separated by blank lines and documents are separated by the line `-DOCSTART- -X- O O`. Supports CoNLL 2003 NER format. See [sample data](%%GITHUB_SPACY/extra/example_data/ner_example_data). | | `ner` / `conll` | NER with IOB/IOB2/BILUO tags, one token per line with columns separated by whitespace. The first column is the token and the final column is the NER tag. Sentences are separated by blank lines and documents are separated by the line `-DOCSTART- -X- O O`. Supports CoNLL 2003 NER format. See [sample data](%%GITHUB_SPACY/extra/example_data/ner_example_data). |
| `iob` | NER with IOB/IOB2 tags, one sentence per line with tokens separated by whitespace and annotation separated by `\|`, either `word\|B-ENT`or`word\|POS\|B-ENT`. See [sample data](%%GITHUB_SPACY/extra/example_data/ner_example_data). | | `iob` | NER with IOB/IOB2/BILUO tags, one sentence per line with tokens separated by whitespace and annotation separated by `\|`, either `word\|B-ENT`or`word\|POS\|B-ENT`. See [sample data](%%GITHUB_SPACY/extra/example_data/ner_example_data). |
## debug {#debug new="3"} ## debug {#debug new="3"}
@ -805,7 +806,7 @@ $ python -m spacy train [config_path] [--output] [--code] [--verbose] [--gpu-id]
| Name | Description | | Name | Description |
| ----------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ | | ----------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ |
| `config_path` | Path to [training config](/api/data-formats#config) file containing all settings and hyperparameters. If `-`, the data will be [read from stdin](/usage/training#config-stdin). ~~Union[Path, str] \(positional)~~ | | `config_path` | Path to [training config](/api/data-formats#config) file containing all settings and hyperparameters. If `-`, the data will be [read from stdin](/usage/training#config-stdin). ~~Union[Path, str] \(positional)~~ |
| `--output`, `-o` | Directory to store trained pipeline in. Will be created if it doesn't exist. ~~Optional[Path] \(positional)~~ | | `--output`, `-o` | Directory to store trained pipeline in. Will be created if it doesn't exist. ~~Optional[Path] \(option)~~ |
| `--code`, `-c` | Path to Python file with additional code to be imported. Allows [registering custom functions](/usage/training#custom-functions) for new architectures. ~~Optional[Path] \(option)~~ | | `--code`, `-c` | Path to Python file with additional code to be imported. Allows [registering custom functions](/usage/training#custom-functions) for new architectures. ~~Optional[Path] \(option)~~ |
| `--verbose`, `-V` | Show more detailed messages during training. ~~bool (flag)~~ | | `--verbose`, `-V` | Show more detailed messages during training. ~~bool (flag)~~ |
| `--gpu-id`, `-g` | GPU ID or `-1` for CPU. Defaults to `-1`. ~~int (option)~~ | | `--gpu-id`, `-g` | GPU ID or `-1` for CPU. Defaults to `-1`. ~~int (option)~~ |

View File

@ -48,6 +48,7 @@ specified separately using the new `exclude` keyword argument.
| ------------------------------------ | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | | ------------------------------------ | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| `name` | Pipeline to load, i.e. package name or path. ~~Union[str, Path]~~ | | `name` | Pipeline to load, i.e. package name or path. ~~Union[str, Path]~~ |
| _keyword-only_ | | | _keyword-only_ | |
| `vocab` | Optional shared vocab to pass in on initialization. If `True` (default), a new `Vocab` object will be created. ~~Union[Vocab, bool]~~ |
| `disable` | Names of pipeline components to [disable](/usage/processing-pipelines#disabling). Disabled pipes will be loaded but they won't be run unless you explicitly enable them by calling [nlp.enable_pipe](/api/language#enable_pipe). ~~List[str]~~ | | `disable` | Names of pipeline components to [disable](/usage/processing-pipelines#disabling). Disabled pipes will be loaded but they won't be run unless you explicitly enable them by calling [nlp.enable_pipe](/api/language#enable_pipe). ~~List[str]~~ |
| `exclude` <Tag variant="new">3</Tag> | Names of pipeline components to [exclude](/usage/processing-pipelines#disabling). Excluded components won't be loaded. ~~List[str]~~ | | `exclude` <Tag variant="new">3</Tag> | Names of pipeline components to [exclude](/usage/processing-pipelines#disabling). Excluded components won't be loaded. ~~List[str]~~ |
| `config` <Tag variant="new">3</Tag> | Optional config overrides, either as nested dict or dict keyed by section value in dot notation, e.g. `"components.name.value"`. ~~Union[Dict[str, Any], Config]~~ | | `config` <Tag variant="new">3</Tag> | Optional config overrides, either as nested dict or dict keyed by section value in dot notation, e.g. `"components.name.value"`. ~~Union[Dict[str, Any], Config]~~ |
@ -83,9 +84,9 @@ Create a blank pipeline of a given language class. This function is the twin of
| ----------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------ | | ----------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------ |
| `name` | [ISO code](https://en.wikipedia.org/wiki/List_of_ISO_639-1_codes) of the language class to load. ~~str~~ | | `name` | [ISO code](https://en.wikipedia.org/wiki/List_of_ISO_639-1_codes) of the language class to load. ~~str~~ |
| _keyword-only_ | | | _keyword-only_ | |
| `vocab` <Tag variant="new">3</Tag> | Optional shared vocab to pass in on initialization. If `True` (default), a new `Vocab` object will be created. ~~Union[Vocab, bool]~~. | | `vocab` | Optional shared vocab to pass in on initialization. If `True` (default), a new `Vocab` object will be created. ~~Union[Vocab, bool]~~ |
| `config` <Tag variant="new">3</Tag> | Optional config overrides, either as nested dict or dict keyed by section value in dot notation, e.g. `"components.name.value"`. ~~Union[Dict[str, Any], Config]~~ | | `config` <Tag variant="new">3</Tag> | Optional config overrides, either as nested dict or dict keyed by section value in dot notation, e.g. `"components.name.value"`. ~~Union[Dict[str, Any], Config]~~ |
| `meta` <Tag variant="new">3</Tag> | Optional meta overrides for [`nlp.meta`](/api/language#meta). ~~Dict[str, Any]~~ | | `meta` | Optional meta overrides for [`nlp.meta`](/api/language#meta). ~~Dict[str, Any]~~ |
| **RETURNS** | An empty `Language` object of the appropriate subclass. ~~Language~~ | | **RETURNS** | An empty `Language` object of the appropriate subclass. ~~Language~~ |
### spacy.info {#spacy.info tag="function"} ### spacy.info {#spacy.info tag="function"}
@ -140,9 +141,9 @@ pipelines.
<Infobox variant="warning" title="Jupyter notebook usage"> <Infobox variant="warning" title="Jupyter notebook usage">
In a Jupyter notebook, run `prefer_gpu()` in the same cell as `spacy.load()` In a Jupyter notebook, run `prefer_gpu()` in the same cell as `spacy.load()` to
to ensure that the model is loaded on the correct device. See [more ensure that the model is loaded on the correct device. See
details](/usage/v3#jupyter-notebook-gpu). [more details](/usage/v3#jupyter-notebook-gpu).
</Infobox> </Infobox>
@ -168,9 +169,9 @@ and _before_ loading any pipelines.
<Infobox variant="warning" title="Jupyter notebook usage"> <Infobox variant="warning" title="Jupyter notebook usage">
In a Jupyter notebook, run `require_gpu()` in the same cell as `spacy.load()` In a Jupyter notebook, run `require_gpu()` in the same cell as `spacy.load()` to
to ensure that the model is loaded on the correct device. See [more ensure that the model is loaded on the correct device. See
details](/usage/v3#jupyter-notebook-gpu). [more details](/usage/v3#jupyter-notebook-gpu).
</Infobox> </Infobox>
@ -195,9 +196,9 @@ after importing spaCy and _before_ loading any pipelines.
<Infobox variant="warning" title="Jupyter notebook usage"> <Infobox variant="warning" title="Jupyter notebook usage">
In a Jupyter notebook, run `require_cpu()` in the same cell as `spacy.load()` In a Jupyter notebook, run `require_cpu()` in the same cell as `spacy.load()` to
to ensure that the model is loaded on the correct device. See [more ensure that the model is loaded on the correct device. See
details](/usage/v3#jupyter-notebook-gpu). [more details](/usage/v3#jupyter-notebook-gpu).
</Infobox> </Infobox>
@ -945,7 +946,8 @@ and create a `Language` object. The model data will then be loaded in via
| Name | Description | | Name | Description |
| ------------------------------------ | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ | | ------------------------------------ | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ |
| `name` | Package name or path. ~~str~~ | | `name` | Package name or path. ~~str~~ |
| `vocab` <Tag variant="new">3</Tag> | Optional shared vocab to pass in on initialization. If `True` (default), a new `Vocab` object will be created. ~~Union[Vocab, bool]~~. | | _keyword-only_ | |
| `vocab` | Optional shared vocab to pass in on initialization. If `True` (default), a new `Vocab` object will be created. ~~Union[Vocab, bool]~~ |
| `disable` | Names of pipeline components to [disable](/usage/processing-pipelines#disabling). Disabled pipes will be loaded but they won't be run unless you explicitly enable them by calling [`nlp.enable_pipe`](/api/language#enable_pipe). ~~List[str]~~ | | `disable` | Names of pipeline components to [disable](/usage/processing-pipelines#disabling). Disabled pipes will be loaded but they won't be run unless you explicitly enable them by calling [`nlp.enable_pipe`](/api/language#enable_pipe). ~~List[str]~~ |
| `exclude` <Tag variant="new">3</Tag> | Names of pipeline components to [exclude](/usage/processing-pipelines#disabling). Excluded components won't be loaded. ~~List[str]~~ | | `exclude` <Tag variant="new">3</Tag> | Names of pipeline components to [exclude](/usage/processing-pipelines#disabling). Excluded components won't be loaded. ~~List[str]~~ |
| `config` <Tag variant="new">3</Tag> | Config overrides as nested dict or flat dict keyed by section values in dot notation, e.g. `"nlp.pipeline"`. ~~Union[Dict[str, Any], Config]~~ | | `config` <Tag variant="new">3</Tag> | Config overrides as nested dict or flat dict keyed by section values in dot notation, e.g. `"nlp.pipeline"`. ~~Union[Dict[str, Any], Config]~~ |
@ -968,7 +970,8 @@ A helper function to use in the `load()` method of a pipeline package's
| Name | Description | | Name | Description |
| ------------------------------------ | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | | ------------------------------------ | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| `init_file` | Path to package's `__init__.py`, i.e. `__file__`. ~~Union[str, Path]~~ | | `init_file` | Path to package's `__init__.py`, i.e. `__file__`. ~~Union[str, Path]~~ |
| `vocab` <Tag variant="new">3</Tag> | Optional shared vocab to pass in on initialization. If `True` (default), a new `Vocab` object will be created. ~~Union[Vocab, bool]~~. | | _keyword-only_ | |
| `vocab` <Tag variant="new">3</Tag> | Optional shared vocab to pass in on initialization. If `True` (default), a new `Vocab` object will be created. ~~Union[Vocab, bool]~~ |
| `disable` | Names of pipeline components to [disable](/usage/processing-pipelines#disabling). Disabled pipes will be loaded but they won't be run unless you explicitly enable them by calling [nlp.enable_pipe](/api/language#enable_pipe). ~~List[str]~~ | | `disable` | Names of pipeline components to [disable](/usage/processing-pipelines#disabling). Disabled pipes will be loaded but they won't be run unless you explicitly enable them by calling [nlp.enable_pipe](/api/language#enable_pipe). ~~List[str]~~ |
| `exclude` <Tag variant="new">3</Tag> | Names of pipeline components to [exclude](/usage/processing-pipelines#disabling). Excluded components won't be loaded. ~~List[str]~~ | | `exclude` <Tag variant="new">3</Tag> | Names of pipeline components to [exclude](/usage/processing-pipelines#disabling). Excluded components won't be loaded. ~~List[str]~~ |
| `config` <Tag variant="new">3</Tag> | Config overrides as nested dict or flat dict keyed by section values in dot notation, e.g. `"nlp.pipeline"`. ~~Union[Dict[str, Any], Config]~~ | | `config` <Tag variant="new">3</Tag> | Config overrides as nested dict or flat dict keyed by section values in dot notation, e.g. `"nlp.pipeline"`. ~~Union[Dict[str, Any], Config]~~ |
@ -1147,11 +1150,11 @@ vary on each step.
> nlp.update(batch) > nlp.update(batch)
> ``` > ```
| Name | Description | | Name | Description |
| ---------- | ---------------------------------------- | | ---------- | ------------------------------------------------ |
| `items` | The items to batch up. ~~Iterable[Any]~~ | | `items` | The items to batch up. ~~Iterable[Any]~~ |
| `size` | int / iterable | The batch size(s). ~~Union[int, Sequence[int]]~~ | | `size` | The batch size(s). ~~Union[int, Sequence[int]]~~ |
| **YIELDS** | The batches. | | **YIELDS** | The batches. |
### util.filter_spans {#util.filter_spans tag="function" new="2.1.4"} ### util.filter_spans {#util.filter_spans tag="function" new="2.1.4"}

View File

@ -1,5 +1,36 @@
{ {
"resources": [ "resources": [
{
"id": "spikex",
"title": "SpikeX - SpaCy Pipes for Knowledge Extraction",
"slogan": "Use SpikeX to build knowledge extraction tools with almost-zero effort",
"description": "SpikeX is a collection of pipes ready to be plugged in a spaCy pipeline. It aims to help in building knowledge extraction tools with almost-zero effort.",
"github": "erre-quadro/spikex",
"pip": "spikex",
"code_example": [
"from spacy import load as spacy_load",
"from spikex.wikigraph import load as wg_load",
"from spikex.pipes import WikiPageX",
"",
"# load a spacy model and get a doc",
"nlp = spacy_load('en_core_web_sm')",
"doc = nlp('An apple a day keeps the doctor away')",
"# load a WikiGraph",
"wg = wg_load('simplewiki_core')",
"# get a WikiPageX and extract all pages",
"wikipagex = WikiPageX(wg)",
"doc = wikipagex(doc)",
"# see all pages extracted from the doc",
"for span in doc._.wiki_spans:",
" print(span._.wiki_pages)"
],
"category": ["pipeline", "standalone"],
"author": "Erre Quadro",
"author_links": {
"github": "erre-quadro",
"website": "https://www.errequadrosrl.com"
}
},
{ {
"id": "spacy-dbpedia-spotlight", "id": "spacy-dbpedia-spotlight",
"title": "DBpedia Spotlight for SpaCy", "title": "DBpedia Spotlight for SpaCy",