diff --git a/.github/azure-steps.yml b/.github/azure-steps.yml new file mode 100644 index 000000000..750e096d0 --- /dev/null +++ b/.github/azure-steps.yml @@ -0,0 +1,57 @@ +parameters: + python_version: '' + architecture: '' + prefix: '' + gpu: false + num_build_jobs: 1 + +steps: + - task: UsePythonVersion@0 + inputs: + versionSpec: ${{ parameters.python_version }} + architecture: ${{ parameters.architecture }} + + - script: | + ${{ parameters.prefix }} python -m pip install -U pip setuptools + ${{ parameters.prefix }} python -m pip install -U -r requirements.txt + displayName: "Install dependencies" + + - script: | + ${{ parameters.prefix }} python setup.py build_ext --inplace -j ${{ parameters.num_build_jobs }} + ${{ parameters.prefix }} python setup.py sdist --formats=gztar + displayName: "Compile and build sdist" + + - task: DeleteFiles@1 + inputs: + contents: "spacy" + displayName: "Delete source directory" + + - script: | + ${{ parameters.prefix }} python -m pip freeze --exclude torch --exclude cupy-cuda110 > installed.txt + ${{ parameters.prefix }} python -m pip uninstall -y -r installed.txt + displayName: "Uninstall all packages" + + - bash: | + ${{ parameters.prefix }} SDIST=$(python -c "import os;print(os.listdir('./dist')[-1])" 2>&1) + ${{ parameters.prefix }} python -m pip install dist/$SDIST + displayName: "Install from sdist" + + - script: | + ${{ parameters.prefix }} python -m pip install -U -r requirements.txt + displayName: "Install test requirements" + + - script: | + ${{ parameters.prefix }} python -m pip install -U cupy-cuda110 + ${{ parameters.prefix }} python -m pip install "torch==1.7.1+cu110" -f https://download.pytorch.org/whl/torch_stable.html + displayName: "Install GPU requirements" + condition: eq(${{ parameters.gpu }}, true) + + - script: | + ${{ parameters.prefix }} python -m pytest --pyargs spacy + displayName: "Run CPU tests" + condition: eq(${{ parameters.gpu }}, false) + + - script: | + ${{ parameters.prefix }} python -m pytest --pyargs spacy -p spacy.tests.enable_gpu + displayName: "Run GPU tests" + condition: eq(${{ parameters.gpu }}, true) diff --git a/.github/contributors/AyushExel.md b/.github/contributors/AyushExel.md new file mode 100644 index 000000000..281fd0cd0 --- /dev/null +++ b/.github/contributors/AyushExel.md @@ -0,0 +1,106 @@ +# spaCy contributor agreement + +This spaCy Contributor Agreement (**"SCA"**) is based on the +[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf). +The SCA applies to any contribution that you make to any product or project +managed by us (the **"project"**), and sets out the intellectual property rights +you grant to us in the contributed materials. The term **"us"** shall mean +[ExplosionAI GmbH](https://explosion.ai/legal). The term +**"you"** shall mean the person or entity identified below. + +If you agree to be bound by these terms, fill in the information requested +below and include the filled-in version with your first pull request, under the +folder [`.github/contributors/`](/.github/contributors/). The name of the file +should be your GitHub username, with the extension `.md`. For example, the user +example_user would create the file `.github/contributors/example_user.md`. + +Read this agreement carefully before signing. These terms and conditions +constitute a binding legal agreement. + +## Contributor Agreement + +1. The term "contribution" or "contributed materials" means any source code, +object code, patch, tool, sample, graphic, specification, manual, +documentation, or any other material posted or submitted by you to the project. + +2. With respect to any worldwide copyrights, or copyright applications and +registrations, in your contribution: + + * you hereby assign to us joint ownership, and to the extent that such + assignment is or becomes invalid, ineffective or unenforceable, you hereby + grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge, + royalty-free, unrestricted license to exercise all rights under those + copyrights. This includes, at our option, the right to sublicense these same + rights to third parties through multiple levels of sublicensees or other + licensing arrangements; + + * you agree that each of us can do all things in relation to your + contribution as if each of us were the sole owners, and if one of us makes + a derivative work of your contribution, the one who makes the derivative + work (or has it made will be the sole owner of that derivative work; + + * you agree that you will not assert any moral rights in your contribution + against us, our licensees or transferees; + + * you agree that we may register a copyright in your contribution and + exercise all ownership rights associated with it; and + + * you agree that neither of us has any duty to consult with, obtain the + consent of, pay or render an accounting to the other for any use or + distribution of your contribution. + +3. With respect to any patents you own, or that you can license without payment +to any third party, you hereby grant to us a perpetual, irrevocable, +non-exclusive, worldwide, no-charge, royalty-free license to: + + * make, have made, use, sell, offer to sell, import, and otherwise transfer + your contribution in whole or in part, alone or in combination with or + included in any product, work or materials arising out of the project to + which your contribution was submitted, and + + * at our option, to sublicense these same rights to third parties through + multiple levels of sublicensees or other licensing arrangements. + +4. Except as set out above, you keep all right, title, and interest in your +contribution. The rights that you grant to us under these terms are effective +on the date you first submitted a contribution to us, even if your submission +took place before the date you sign these terms. + +5. You covenant, represent, warrant and agree that: + + * Each contribution that you submit is and shall be an original work of + authorship and you can legally grant the rights set out in this SCA; + + * to the best of your knowledge, each contribution will not violate any + third party's copyrights, trademarks, patents, or other intellectual + property rights; and + + * each contribution shall be in compliance with U.S. export control laws and + other applicable export and import laws. You agree to notify us if you + become aware of any circumstance which would make any of the foregoing + representations inaccurate in any respect. We may publicly disclose your + participation in the project, including the fact that you have signed the SCA. + +6. This SCA is governed by the laws of the State of California and applicable +U.S. Federal law. Any choice of law rules will not apply. + +7. Please place an “x” on one of the applicable statement below. Please do NOT +mark both statements: + + * [X] I am signing on behalf of myself as an individual and no other person + or entity, including my employer, has or will have rights with respect to my + contributions. + + * [ ] I am signing on behalf of my employer or a legal entity and I have the + actual authority to contractually bind that entity. + +## Contributor Details + +| Field | Entry | +|------------------------------- | -------------------- | +| Name | Ayush Chaurasia | +| Company name (if applicable) | | +| Title or role (if applicable) | | +| Date | 2021-03-12 | +| GitHub username | AyushExel | +| Website (optional) | | diff --git a/.github/contributors/broaddeep.md b/.github/contributors/broaddeep.md new file mode 100644 index 000000000..d6c4b3cf3 --- /dev/null +++ b/.github/contributors/broaddeep.md @@ -0,0 +1,106 @@ +# spaCy contributor agreement + +This spaCy Contributor Agreement (**"SCA"**) is based on the +[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf). +The SCA applies to any contribution that you make to any product or project +managed by us (the **"project"**), and sets out the intellectual property rights +you grant to us in the contributed materials. The term **"us"** shall mean +[ExplosionAI GmbH](https://explosion.ai/legal). The term +**"you"** shall mean the person or entity identified below. + +If you agree to be bound by these terms, fill in the information requested +below and include the filled-in version with your first pull request, under the +folder [`.github/contributors/`](/.github/contributors/). The name of the file +should be your GitHub username, with the extension `.md`. For example, the user +example_user would create the file `.github/contributors/example_user.md`. + +Read this agreement carefully before signing. These terms and conditions +constitute a binding legal agreement. + +## Contributor Agreement + +1. The term "contribution" or "contributed materials" means any source code, +object code, patch, tool, sample, graphic, specification, manual, +documentation, or any other material posted or submitted by you to the project. + +2. With respect to any worldwide copyrights, or copyright applications and +registrations, in your contribution: + + * you hereby assign to us joint ownership, and to the extent that such + assignment is or becomes invalid, ineffective or unenforceable, you hereby + grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge, + royalty-free, unrestricted license to exercise all rights under those + copyrights. This includes, at our option, the right to sublicense these same + rights to third parties through multiple levels of sublicensees or other + licensing arrangements; + + * you agree that each of us can do all things in relation to your + contribution as if each of us were the sole owners, and if one of us makes + a derivative work of your contribution, the one who makes the derivative + work (or has it made will be the sole owner of that derivative work; + + * you agree that you will not assert any moral rights in your contribution + against us, our licensees or transferees; + + * you agree that we may register a copyright in your contribution and + exercise all ownership rights associated with it; and + + * you agree that neither of us has any duty to consult with, obtain the + consent of, pay or render an accounting to the other for any use or + distribution of your contribution. + +3. With respect to any patents you own, or that you can license without payment +to any third party, you hereby grant to us a perpetual, irrevocable, +non-exclusive, worldwide, no-charge, royalty-free license to: + + * make, have made, use, sell, offer to sell, import, and otherwise transfer + your contribution in whole or in part, alone or in combination with or + included in any product, work or materials arising out of the project to + which your contribution was submitted, and + + * at our option, to sublicense these same rights to third parties through + multiple levels of sublicensees or other licensing arrangements. + +4. Except as set out above, you keep all right, title, and interest in your +contribution. The rights that you grant to us under these terms are effective +on the date you first submitted a contribution to us, even if your submission +took place before the date you sign these terms. + +5. You covenant, represent, warrant and agree that: + + * Each contribution that you submit is and shall be an original work of + authorship and you can legally grant the rights set out in this SCA; + + * to the best of your knowledge, each contribution will not violate any + third party's copyrights, trademarks, patents, or other intellectual + property rights; and + + * each contribution shall be in compliance with U.S. export control laws and + other applicable export and import laws. You agree to notify us if you + become aware of any circumstance which would make any of the foregoing + representations inaccurate in any respect. We may publicly disclose your + participation in the project, including the fact that you have signed the SCA. + +6. This SCA is governed by the laws of the State of California and applicable +U.S. Federal law. Any choice of law rules will not apply. + +7. Please place an “x” on one of the applicable statement below. Please do NOT +mark both statements: + + * [x] I am signing on behalf of myself as an individual and no other person + or entity, including my employer, has or will have rights with respect to my + contributions. + + * [ ] I am signing on behalf of my employer or a legal entity and I have the + actual authority to contractually bind that entity. + +## Contributor Details + +| Field | Entry | +|------------------------------- | -------------------- | +| Name | Dongjun Park | +| Company name (if applicable) | | +| Title or role (if applicable) | | +| Date | 2021-03-06 | +| GitHub username | broaddeep | +| Website (optional) | | diff --git a/azure-pipelines.yml b/azure-pipelines.yml index bb259dded..bea65cae2 100644 --- a/azure-pipelines.yml +++ b/azure-pipelines.yml @@ -76,39 +76,24 @@ jobs: maxParallel: 4 pool: vmImage: $(imageName) - steps: - - task: UsePythonVersion@0 - inputs: - versionSpec: "$(python.version)" - architecture: "x64" + - template: .github/azure-steps.yml + parameters: + python_version: '$(python.version)' + architecture: 'x64' - - script: | - python -m pip install -U setuptools - pip install -r requirements.txt - displayName: "Install dependencies" - - - script: | - python setup.py build_ext --inplace - python setup.py sdist --formats=gztar - displayName: "Compile and build sdist" - - - task: DeleteFiles@1 - inputs: - contents: "spacy" - displayName: "Delete source directory" - - - script: | - pip freeze > installed.txt - pip uninstall -y -r installed.txt - displayName: "Uninstall all packages" - - - bash: | - SDIST=$(python -c "import os;print(os.listdir('./dist')[-1])" 2>&1) - pip install dist/$SDIST - displayName: "Install from sdist" - - - script: | - pip install -r requirements.txt - python -m pytest --pyargs spacy - displayName: "Run tests" + - job: "TestGPU" + dependsOn: "Validate" + strategy: + matrix: + Python38LinuxX64_GPU: + python.version: '3.8' + pool: + name: "LinuxX64_GPU" + steps: + - template: .github/azure-steps.yml + parameters: + python_version: '$(python.version)' + architecture: 'x64' + gpu: true + num_build_jobs: 24 diff --git a/pyproject.toml b/pyproject.toml index f00fdc9f4..3e34a0b2d 100644 --- a/pyproject.toml +++ b/pyproject.toml @@ -5,7 +5,7 @@ requires = [ "cymem>=2.0.2,<2.1.0", "preshed>=3.0.2,<3.1.0", "murmurhash>=0.28.0,<1.1.0", - "thinc>=8.0.2,<8.1.0", + "thinc>=8.0.3,<8.1.0", "blis>=0.4.0,<0.8.0", "pathy", "numpy>=1.15.0", diff --git a/requirements.txt b/requirements.txt index e09a5b221..1947dd2de 100644 --- a/requirements.txt +++ b/requirements.txt @@ -1,14 +1,14 @@ # Our libraries -spacy-legacy>=3.0.0,<3.1.0 +spacy-legacy>=3.0.4,<3.1.0 cymem>=2.0.2,<2.1.0 preshed>=3.0.2,<3.1.0 -thinc>=8.0.2,<8.1.0 +thinc>=8.0.3,<8.1.0 blis>=0.4.0,<0.8.0 ml_datasets>=0.2.0,<0.3.0 murmurhash>=0.28.0,<1.1.0 wasabi>=0.8.1,<1.1.0 -srsly>=2.4.0,<3.0.0 -catalogue>=2.0.1,<2.1.0 +srsly>=2.4.1,<3.0.0 +catalogue>=2.0.3,<2.1.0 typer>=0.3.0,<0.4.0 pathy>=0.3.5 # Third party dependencies @@ -20,7 +20,6 @@ jinja2 # Official Python utilities setuptools packaging>=20.0 -importlib_metadata>=0.20; python_version < "3.8" typing_extensions>=3.7.4.1,<4.0.0.0; python_version < "3.8" # Development dependencies cython>=0.25 diff --git a/setup.cfg b/setup.cfg index 09f989c54..9e1293335 100644 --- a/setup.cfg +++ b/setup.cfg @@ -34,18 +34,18 @@ setup_requires = cymem>=2.0.2,<2.1.0 preshed>=3.0.2,<3.1.0 murmurhash>=0.28.0,<1.1.0 - thinc>=8.0.2,<8.1.0 + thinc>=8.0.3,<8.1.0 install_requires = # Our libraries - spacy-legacy>=3.0.0,<3.1.0 + spacy-legacy>=3.0.4,<3.1.0 murmurhash>=0.28.0,<1.1.0 cymem>=2.0.2,<2.1.0 preshed>=3.0.2,<3.1.0 - thinc>=8.0.2,<8.1.0 + thinc>=8.0.3,<8.1.0 blis>=0.4.0,<0.8.0 wasabi>=0.8.1,<1.1.0 - srsly>=2.4.0,<3.0.0 - catalogue>=2.0.1,<2.1.0 + srsly>=2.4.1,<3.0.0 + catalogue>=2.0.3,<2.1.0 typer>=0.3.0,<0.4.0 pathy>=0.3.5 # Third-party dependencies @@ -57,7 +57,6 @@ install_requires = # Official Python utilities setuptools packaging>=20.0 - importlib_metadata>=0.20; python_version < "3.8" typing_extensions>=3.7.4,<4.0.0.0; python_version < "3.8" [options.entry_points] @@ -91,6 +90,8 @@ cuda110 = cupy-cuda110>=5.0.0b4,<9.0.0 cuda111 = cupy-cuda111>=5.0.0b4,<9.0.0 +cuda112 = + cupy-cuda112>=5.0.0b4,<9.0.0 # Language tokenizers with external dependencies ja = sudachipy>=0.4.9 diff --git a/spacy/about.py b/spacy/about.py index 2987f3c53..c351076c5 100644 --- a/spacy/about.py +++ b/spacy/about.py @@ -1,6 +1,6 @@ # fmt: off __title__ = "spacy" -__version__ = "3.0.5" +__version__ = "3.0.6" __download_url__ = "https://github.com/explosion/spacy-models/releases/download" __compatibility__ = "https://raw.githubusercontent.com/explosion/spacy-models/master/compatibility.json" __projects__ = "https://github.com/explosion/projects" diff --git a/spacy/cli/__init__.py b/spacy/cli/__init__.py index 7368bcef3..fd8da262e 100644 --- a/spacy/cli/__init__.py +++ b/spacy/cli/__init__.py @@ -9,6 +9,7 @@ from .info import info # noqa: F401 from .package import package # noqa: F401 from .profile import profile # noqa: F401 from .train import train_cli # noqa: F401 +from .assemble import assemble_cli # noqa: F401 from .pretrain import pretrain # noqa: F401 from .debug_data import debug_data # noqa: F401 from .debug_config import debug_config # noqa: F401 @@ -29,9 +30,9 @@ from .project.document import project_document # noqa: F401 @app.command("link", no_args_is_help=True, deprecated=True, hidden=True) def link(*args, **kwargs): - """As of spaCy v3.0, symlinks like "en" are deprecated. You can load trained + """As of spaCy v3.0, symlinks like "en" are not supported anymore. You can load trained pipeline packages using their full names or from a directory path.""" msg.warn( - "As of spaCy v3.0, model symlinks are deprecated. You can load trained " + "As of spaCy v3.0, model symlinks are not supported anymore. You can load trained " "pipeline packages using their full names or from a directory path." ) diff --git a/spacy/cli/assemble.py b/spacy/cli/assemble.py new file mode 100644 index 000000000..f63c51857 --- /dev/null +++ b/spacy/cli/assemble.py @@ -0,0 +1,58 @@ +from typing import Optional +from pathlib import Path +from wasabi import msg +import typer +import logging + +from ._util import app, Arg, Opt, parse_config_overrides, show_validation_error +from ._util import import_code +from ..training.initialize import init_nlp +from .. import util +from ..util import get_sourced_components, load_model_from_config + + +@app.command( + "assemble", + context_settings={"allow_extra_args": True, "ignore_unknown_options": True}, +) +def assemble_cli( + # fmt: off + ctx: typer.Context, # This is only used to read additional arguments + config_path: Path = Arg(..., help="Path to config file", exists=True, allow_dash=True), + output_path: Path = Arg(..., help="Output directory to store assembled pipeline in"), + code_path: Optional[Path] = Opt(None, "--code", "-c", help="Path to Python file with additional code (registered functions) to be imported"), + verbose: bool = Opt(False, "--verbose", "-V", "-VV", help="Display more information for debugging purposes"), + # fmt: on +): + """ + Assemble a spaCy pipeline from a config file. The config file includes + all settings for initializing the pipeline. To override settings in the + config, e.g. settings that point to local paths or that you want to + experiment with, you can override them as command line options. The + --code argument lets you pass in a Python file that can be used to + register custom functions that are referenced in the config. + + DOCS: https://spacy.io/api/cli#assemble + """ + util.logger.setLevel(logging.DEBUG if verbose else logging.INFO) + # Make sure all files and paths exists if they are needed + if not config_path or (str(config_path) != "-" and not config_path.exists()): + msg.fail("Config file not found", config_path, exits=1) + overrides = parse_config_overrides(ctx.args) + import_code(code_path) + with show_validation_error(config_path): + config = util.load_config(config_path, overrides=overrides, interpolate=False) + msg.divider("Initializing pipeline") + nlp = load_model_from_config(config, auto_fill=True) + config = config.interpolate() + sourced = get_sourced_components(config) + # Make sure that listeners are defined before initializing further + nlp._link_components() + with nlp.select_pipes(disable=[*sourced]): + nlp.initialize() + msg.good("Initialized pipeline") + msg.divider("Serializing to disk") + if output_path is not None and not output_path.exists(): + output_path.mkdir(parents=True) + msg.good(f"Created output directory: {output_path}") + nlp.to_disk(output_path) diff --git a/spacy/cli/debug_data.py b/spacy/cli/debug_data.py index be11f8d1c..3351e53fe 100644 --- a/spacy/cli/debug_data.py +++ b/spacy/cli/debug_data.py @@ -1,4 +1,4 @@ -from typing import List, Sequence, Dict, Any, Tuple, Optional +from typing import List, Sequence, Dict, Any, Tuple, Optional, Set from pathlib import Path from collections import Counter import sys @@ -13,6 +13,8 @@ from ..training.initialize import get_sourced_components from ..schemas import ConfigSchemaTraining from ..pipeline._parser_internals import nonproj from ..pipeline._parser_internals.nonproj import DELIMITER +from ..pipeline import Morphologizer +from ..morphology import Morphology from ..language import Language from ..util import registry, resolve_dot_names from .. import util @@ -194,32 +196,32 @@ def debug_data( ) label_counts = gold_train_data["ner"] model_labels = _get_labels_from_model(nlp, "ner") - new_labels = [l for l in labels if l not in model_labels] - existing_labels = [l for l in labels if l in model_labels] has_low_data_warning = False has_no_neg_warning = False has_ws_ents_error = False has_punct_ents_warning = False msg.divider("Named Entity Recognition") - msg.info( - f"{len(new_labels)} new label(s), {len(existing_labels)} existing label(s)" - ) + msg.info(f"{len(model_labels)} label(s)") missing_values = label_counts["-"] msg.text(f"{missing_values} missing value(s) (tokens with '-' label)") - for label in new_labels: + for label in labels: if len(label) == 0: - msg.fail("Empty label found in new labels") - if new_labels: - labels_with_counts = [ - (label, count) - for label, count in label_counts.most_common() - if label != "-" - ] - labels_with_counts = _format_labels(labels_with_counts, counts=True) - msg.text(f"New: {labels_with_counts}", show=verbose) - if existing_labels: - msg.text(f"Existing: {_format_labels(existing_labels)}", show=verbose) + msg.fail("Empty label found in train data") + labels_with_counts = [ + (label, count) + for label, count in label_counts.most_common() + if label != "-" + ] + labels_with_counts = _format_labels(labels_with_counts, counts=True) + msg.text(f"Labels in train data: {_format_labels(labels)}", show=verbose) + missing_labels = model_labels - labels + if missing_labels: + msg.warn( + "Some model labels are not present in the train data. The " + "model performance may be degraded for these labels after " + f"training: {_format_labels(missing_labels)}." + ) if gold_train_data["ws_ents"]: msg.fail(f"{gold_train_data['ws_ents']} invalid whitespace entity spans") has_ws_ents_error = True @@ -228,10 +230,10 @@ def debug_data( msg.warn(f"{gold_train_data['punct_ents']} entity span(s) with punctuation") has_punct_ents_warning = True - for label in new_labels: + for label in labels: if label_counts[label] <= NEW_LABEL_THRESHOLD: msg.warn( - f"Low number of examples for new label '{label}' ({label_counts[label]})" + f"Low number of examples for label '{label}' ({label_counts[label]})" ) has_low_data_warning = True @@ -276,22 +278,52 @@ def debug_data( ) if "textcat" in factory_names: - msg.divider("Text Classification") - labels = [label for label in gold_train_data["cats"]] - model_labels = _get_labels_from_model(nlp, "textcat") - new_labels = [l for l in labels if l not in model_labels] - existing_labels = [l for l in labels if l in model_labels] - msg.info( - f"Text Classification: {len(new_labels)} new label(s), " - f"{len(existing_labels)} existing label(s)" + msg.divider("Text Classification (Exclusive Classes)") + labels = _get_labels_from_model(nlp, "textcat") + msg.info(f"Text Classification: {len(labels)} label(s)") + msg.text(f"Labels: {_format_labels(labels)}", show=verbose) + labels_with_counts = _format_labels( + gold_train_data["cats"].most_common(), counts=True ) - if new_labels: - labels_with_counts = _format_labels( - gold_train_data["cats"].most_common(), counts=True + msg.text(f"Labels in train data: {labels_with_counts}", show=verbose) + missing_labels = labels - set(gold_train_data["cats"].keys()) + if missing_labels: + msg.warn( + "Some model labels are not present in the train data. The " + "model performance may be degraded for these labels after " + f"training: {_format_labels(missing_labels)}." + ) + if gold_train_data["n_cats_multilabel"] > 0: + # Note: you should never get here because you run into E895 on + # initialization first. + msg.warn( + "The train data contains instances without " + "mutually-exclusive classes. Use the component " + "'textcat_multilabel' instead of 'textcat'." + ) + if gold_dev_data["n_cats_multilabel"] > 0: + msg.fail( + "Train/dev mismatch: the dev data contains instances " + "without mutually-exclusive classes while the train data " + "contains only instances with mutually-exclusive classes." + ) + + if "textcat_multilabel" in factory_names: + msg.divider("Text Classification (Multilabel)") + labels = _get_labels_from_model(nlp, "textcat_multilabel") + msg.info(f"Text Classification: {len(labels)} label(s)") + msg.text(f"Labels: {_format_labels(labels)}", show=verbose) + labels_with_counts = _format_labels( + gold_train_data["cats"].most_common(), counts=True + ) + msg.text(f"Labels in train data: {labels_with_counts}", show=verbose) + missing_labels = labels - set(gold_train_data["cats"].keys()) + if missing_labels: + msg.warn( + "Some model labels are not present in the train data. The " + "model performance may be degraded for these labels after " + f"training: {_format_labels(missing_labels)}." ) - msg.text(f"New: {labels_with_counts}", show=verbose) - if existing_labels: - msg.text(f"Existing: {_format_labels(existing_labels)}", show=verbose) if set(gold_train_data["cats"]) != set(gold_dev_data["cats"]): msg.fail( f"The train and dev labels are not the same. " @@ -299,11 +331,6 @@ def debug_data( f"Dev labels: {_format_labels(gold_dev_data['cats'])}." ) if gold_train_data["n_cats_multilabel"] > 0: - msg.info( - "The train data contains instances without " - "mutually-exclusive classes. Use '--textcat-multilabel' " - "when training." - ) if gold_dev_data["n_cats_multilabel"] == 0: msg.warn( "Potential train/dev mismatch: the train data contains " @@ -311,9 +338,10 @@ def debug_data( "dev data does not." ) else: - msg.info( + msg.warn( "The train data contains only instances with " - "mutually-exclusive classes." + "mutually-exclusive classes. You can potentially use the " + "component 'textcat' instead of 'textcat_multilabel'." ) if gold_dev_data["n_cats_multilabel"] > 0: msg.fail( @@ -325,13 +353,37 @@ def debug_data( if "tagger" in factory_names: msg.divider("Part-of-speech Tagging") labels = [label for label in gold_train_data["tags"]] - # TODO: does this need to be updated? - msg.info(f"{len(labels)} label(s) in data") + model_labels = _get_labels_from_model(nlp, "tagger") + msg.info(f"{len(labels)} label(s) in train data") + missing_labels = model_labels - set(labels) + if missing_labels: + msg.warn( + "Some model labels are not present in the train data. The " + "model performance may be degraded for these labels after " + f"training: {_format_labels(missing_labels)}." + ) labels_with_counts = _format_labels( gold_train_data["tags"].most_common(), counts=True ) msg.text(labels_with_counts, show=verbose) + if "morphologizer" in factory_names: + msg.divider("Morphologizer (POS+Morph)") + labels = [label for label in gold_train_data["morphs"]] + model_labels = _get_labels_from_model(nlp, "morphologizer") + msg.info(f"{len(labels)} label(s) in train data") + missing_labels = model_labels - set(labels) + if missing_labels: + msg.warn( + "Some model labels are not present in the train data. The " + "model performance may be degraded for these labels after " + f"training: {_format_labels(missing_labels)}." + ) + labels_with_counts = _format_labels( + gold_train_data["morphs"].most_common(), counts=True + ) + msg.text(labels_with_counts, show=verbose) + if "parser" in factory_names: has_low_data_warning = False msg.divider("Dependency Parsing") @@ -491,6 +543,7 @@ def _compile_gold( "ner": Counter(), "cats": Counter(), "tags": Counter(), + "morphs": Counter(), "deps": Counter(), "words": Counter(), "roots": Counter(), @@ -544,13 +597,36 @@ def _compile_gold( data["ner"][combined_label] += 1 elif label == "-": data["ner"]["-"] += 1 - if "textcat" in factory_names: + if "textcat" in factory_names or "textcat_multilabel" in factory_names: data["cats"].update(gold.cats) if list(gold.cats.values()).count(1.0) != 1: data["n_cats_multilabel"] += 1 if "tagger" in factory_names: tags = eg.get_aligned("TAG", as_string=True) data["tags"].update([x for x in tags if x is not None]) + if "morphologizer" in factory_names: + pos_tags = eg.get_aligned("POS", as_string=True) + morphs = eg.get_aligned("MORPH", as_string=True) + for pos, morph in zip(pos_tags, morphs): + # POS may align (same value for multiple tokens) when morph + # doesn't, so if either is misaligned (None), treat the + # annotation as missing so that truths doesn't end up with an + # unknown morph+POS combination + if pos is None or morph is None: + pass + # If both are unset, the annotation is missing (empty morph + # converted from int is "_" rather than "") + elif pos == "" and morph == "": + pass + # Otherwise, generate the combined label + else: + label_dict = Morphology.feats_to_dict(morph) + if pos: + label_dict[Morphologizer.POS_FEAT] = pos + label = eg.reference.vocab.strings[ + eg.reference.vocab.morphology.add(label_dict) + ] + data["morphs"].update([label]) if "parser" in factory_names: aligned_heads, aligned_deps = eg.get_aligned_parse(projectivize=make_proj) data["deps"].update([x for x in aligned_deps if x is not None]) @@ -584,8 +660,8 @@ def _get_examples_without_label(data: Sequence[Example], label: str) -> int: return count -def _get_labels_from_model(nlp: Language, pipe_name: str) -> Sequence[str]: +def _get_labels_from_model(nlp: Language, pipe_name: str) -> Set[str]: if pipe_name not in nlp.pipe_names: return set() pipe = nlp.get_pipe(pipe_name) - return pipe.labels + return set(pipe.labels) diff --git a/spacy/cli/templates/quickstart_training.jinja b/spacy/cli/templates/quickstart_training.jinja index 38fc23272..e43c21bbd 100644 --- a/spacy/cli/templates/quickstart_training.jinja +++ b/spacy/cli/templates/quickstart_training.jinja @@ -206,7 +206,7 @@ factory = "tok2vec" @architectures = "spacy.Tok2Vec.v2" [components.tok2vec.model.embed] -@architectures = "spacy.MultiHashEmbed.v1" +@architectures = "spacy.MultiHashEmbed.v2" width = ${components.tok2vec.model.encode.width} {% if has_letters -%} attrs = ["NORM", "PREFIX", "SUFFIX", "SHAPE"] diff --git a/spacy/default_config.cfg b/spacy/default_config.cfg index 0f7226083..7f092d5f5 100644 --- a/spacy/default_config.cfg +++ b/spacy/default_config.cfg @@ -68,8 +68,11 @@ seed = ${system.seed} gpu_allocator = ${system.gpu_allocator} dropout = 0.1 accumulate_gradient = 1 -# Controls early-stopping. 0 or -1 mean unlimited. +# Controls early-stopping. 0 disables early stopping. patience = 1600 +# Number of epochs. 0 means unlimited. If >= 0, train corpus is loaded once in +# memory and shuffled within the training loop. -1 means stream train corpus +# rather than loading in memory with no shuffling within the training loop. max_epochs = 0 max_steps = 20000 eval_frequency = 200 diff --git a/spacy/errors.py b/spacy/errors.py index d8c5cc3a8..7cf9e54e4 100644 --- a/spacy/errors.py +++ b/spacy/errors.py @@ -157,6 +157,10 @@ class Warnings: "`spacy.load()` to ensure that the model is loaded on the correct " "device. More information: " "http://spacy.io/usage/v3#jupyter-notebook-gpu") + W112 = ("The model specified to use for initial vectors ({name}) has no " + "vectors. This is almost certainly a mistake.") + W113 = ("Sourced component '{name}' may not work as expected: source " + "vectors are not identical to current pipeline vectors.") @add_codes @@ -497,6 +501,12 @@ class Errors: E202 = ("Unsupported alignment mode '{mode}'. Supported modes: {modes}.") # New errors added in v3.x + E872 = ("Unable to copy tokenizer from base model due to different " + 'tokenizer settings: current tokenizer config "{curr_config}" ' + 'vs. base model "{base_config}"') + E873 = ("Unable to merge a span from doc.spans with key '{key}' and text " + "'{text}'. This is likely a bug in spaCy, so feel free to open an " + "issue: https://github.com/explosion/spaCy/issues") E874 = ("Could not initialize the tok2vec model from component " "'{component}' and layer '{layer}'.") E875 = ("To use the PretrainVectors objective, make sure that static vectors are loaded. " @@ -631,7 +641,7 @@ class Errors: "method, make sure it's overwritten on the subclass.") E940 = ("Found NaN values in scores.") E941 = ("Can't find model '{name}'. It looks like you're trying to load a " - "model from a shortcut, which is deprecated as of spaCy v3.0. To " + "model from a shortcut, which is obsolete as of spaCy v3.0. To " "load the model, use its full name instead:\n\n" "nlp = spacy.load(\"{full}\")\n\nFor more details on the available " "models, see the models directory: https://spacy.io/models. If you " @@ -646,8 +656,8 @@ class Errors: "returned the initialized nlp object instead?") E944 = ("Can't copy pipeline component '{name}' from source '{model}': " "not found in pipeline. Available components: {opts}") - E945 = ("Can't copy pipeline component '{name}' from source. Expected loaded " - "nlp object, but got: {source}") + E945 = ("Can't copy pipeline component '{name}' from source. Expected " + "loaded nlp object, but got: {source}") E947 = ("`Matcher.add` received invalid `greedy` argument: expected " "a string value from {expected} but got: '{arg}'") E948 = ("`Matcher.add` received invalid 'patterns' argument: expected " diff --git a/spacy/lang/it/tokenizer_exceptions.py b/spacy/lang/it/tokenizer_exceptions.py index 0c9968bc6..87c2929bf 100644 --- a/spacy/lang/it/tokenizer_exceptions.py +++ b/spacy/lang/it/tokenizer_exceptions.py @@ -17,14 +17,19 @@ _exc = { for orth in [ "..", "....", + "a.C.", "al.", "all-path", "art.", "Art.", "artt.", "att.", + "avv.", + "Avv." "by-pass", "c.d.", + "c/c", + "C.so", "centro-sinistra", "check-up", "Civ.", @@ -48,6 +53,8 @@ for orth in [ "prof.", "sett.", "s.p.a.", + "s.n.c", + "s.r.l", "ss.", "St.", "tel.", diff --git a/spacy/language.py b/spacy/language.py index 04a5e843e..6f6470533 100644 --- a/spacy/language.py +++ b/spacy/language.py @@ -682,9 +682,14 @@ class Language: name (str): Optional alternative name to use in current pipeline. RETURNS (Tuple[Callable, str]): The component and its factory name. """ - # TODO: handle errors and mismatches (vectors etc.) - if not isinstance(source, self.__class__): + # Check source type + if not isinstance(source, Language): raise ValueError(Errors.E945.format(name=source_name, source=type(source))) + # Check vectors, with faster checks first + if self.vocab.vectors.shape != source.vocab.vectors.shape or \ + self.vocab.vectors.key2row != source.vocab.vectors.key2row or \ + self.vocab.vectors.to_bytes() != source.vocab.vectors.to_bytes(): + util.logger.warning(Warnings.W113.format(name=source_name)) if not source_name in source.component_names: raise KeyError( Errors.E944.format( @@ -1673,7 +1678,16 @@ class Language: # model with the same vocab as the current nlp object source_nlps[model] = util.load_model(model, vocab=nlp.vocab) source_name = pipe_cfg.get("component", pipe_name) + listeners_replaced = False + if "replace_listeners" in pipe_cfg: + for name, proc in source_nlps[model].pipeline: + if source_name in getattr(proc, "listening_components", []): + source_nlps[model].replace_listeners(name, source_name, pipe_cfg["replace_listeners"]) + listeners_replaced = True nlp.add_pipe(source_name, source=source_nlps[model], name=pipe_name) + # Delete from cache if listeners were replaced + if listeners_replaced: + del source_nlps[model] disabled_pipes = [*config["nlp"]["disabled"], *disable] nlp._disabled = set(p for p in disabled_pipes if p not in exclude) nlp.batch_size = config["nlp"]["batch_size"] diff --git a/spacy/matcher/dependencymatcher.pyx b/spacy/matcher/dependencymatcher.pyx index 4124696b3..0e601281a 100644 --- a/spacy/matcher/dependencymatcher.pyx +++ b/spacy/matcher/dependencymatcher.pyx @@ -299,7 +299,7 @@ cdef class DependencyMatcher: if isinstance(doclike, Doc): doc = doclike elif isinstance(doclike, Span): - doc = doclike.as_doc() + doc = doclike.as_doc(copy_user_data=True) else: raise ValueError(Errors.E195.format(good="Doc or Span", got=type(doclike).__name__)) diff --git a/spacy/matcher/matcher.pxd b/spacy/matcher/matcher.pxd index 52a30d94c..455f978cc 100644 --- a/spacy/matcher/matcher.pxd +++ b/spacy/matcher/matcher.pxd @@ -46,6 +46,12 @@ cdef struct TokenPatternC: int32_t nr_py quantifier_t quantifier hash_t key + int32_t token_idx + + +cdef struct MatchAlignmentC: + int32_t token_idx + int32_t length cdef struct PatternStateC: diff --git a/spacy/matcher/matcher.pyx b/spacy/matcher/matcher.pyx index 26dca05eb..dae12c3f6 100644 --- a/spacy/matcher/matcher.pyx +++ b/spacy/matcher/matcher.pyx @@ -196,7 +196,7 @@ cdef class Matcher: else: yield doc - def __call__(self, object doclike, *, as_spans=False, allow_missing=False): + def __call__(self, object doclike, *, as_spans=False, allow_missing=False, with_alignments=False): """Find all token sequences matching the supplied pattern. doclike (Doc or Span): The document to match over. @@ -204,10 +204,16 @@ cdef class Matcher: start, end) tuples. allow_missing (bool): Whether to skip checks for missing annotation for attributes included in patterns. Defaults to False. + with_alignments (bool): Return match alignment information, which is + `List[int]` with length of matched span. Each entry denotes the + corresponding index of token pattern. If as_spans is set to True, + this setting is ignored. RETURNS (list): A list of `(match_id, start, end)` tuples, describing the matches. A match tuple describes a span `doc[start:end]`. The `match_id` is an integer. If as_spans is set to True, a list of Span objects is returned. + If with_alignments is set to True and as_spans is set to False, + A list of `(match_id, start, end, alignments)` tuples is returned. """ if isinstance(doclike, Doc): doc = doclike @@ -217,6 +223,9 @@ cdef class Matcher: length = doclike.end - doclike.start else: raise ValueError(Errors.E195.format(good="Doc or Span", got=type(doclike).__name__)) + # Skip alignments calculations if as_spans is set + if as_spans: + with_alignments = False cdef Pool tmp_pool = Pool() if not allow_missing: for attr in (TAG, POS, MORPH, LEMMA, DEP): @@ -232,18 +241,20 @@ cdef class Matcher: error_msg = Errors.E155.format(pipe=pipe, attr=self.vocab.strings.as_string(attr)) raise ValueError(error_msg) matches = find_matches(&self.patterns[0], self.patterns.size(), doclike, length, - extensions=self._extensions, predicates=self._extra_predicates) + extensions=self._extensions, predicates=self._extra_predicates, with_alignments=with_alignments) final_matches = [] pairs_by_id = {} - # For each key, either add all matches, or only the filtered, non-overlapping ones - for (key, start, end) in matches: + # For each key, either add all matches, or only the filtered, + # non-overlapping ones this `match` can be either (start, end) or + # (start, end, alignments) depending on `with_alignments=` option. + for key, *match in matches: span_filter = self._filter.get(key) if span_filter is not None: pairs = pairs_by_id.get(key, []) - pairs.append((start,end)) + pairs.append(match) pairs_by_id[key] = pairs else: - final_matches.append((key, start, end)) + final_matches.append((key, *match)) matched = tmp_pool.alloc(length, sizeof(char)) empty = tmp_pool.alloc(length, sizeof(char)) for key, pairs in pairs_by_id.items(): @@ -255,14 +266,18 @@ cdef class Matcher: sorted_pairs = sorted(pairs, key=lambda x: (x[1]-x[0], -x[0]), reverse=True) # reverse sort by length else: raise ValueError(Errors.E947.format(expected=["FIRST", "LONGEST"], arg=span_filter)) - for (start, end) in sorted_pairs: + for match in sorted_pairs: + start, end = match[:2] assert 0 <= start < end # Defend against segfaults span_len = end-start # If no tokens in the span have matched if memcmp(&matched[start], &empty[start], span_len * sizeof(matched[0])) == 0: - final_matches.append((key, start, end)) + final_matches.append((key, *match)) # Mark tokens that have matched memset(&matched[start], 1, span_len * sizeof(matched[0])) + if with_alignments: + final_matches_with_alignments = final_matches + final_matches = [(key, start, end) for key, start, end, alignments in final_matches] # perform the callbacks on the filtered set of results for i, (key, start, end) in enumerate(final_matches): on_match = self._callbacks.get(key, None) @@ -270,6 +285,22 @@ cdef class Matcher: on_match(self, doc, i, final_matches) if as_spans: return [Span(doc, start, end, label=key) for key, start, end in final_matches] + elif with_alignments: + # convert alignments List[Dict[str, int]] --> List[int] + final_matches = [] + # when multiple alignment (belongs to the same length) is found, + # keeps the alignment that has largest token_idx + for key, start, end, alignments in final_matches_with_alignments: + sorted_alignments = sorted(alignments, key=lambda x: (x['length'], x['token_idx']), reverse=False) + alignments = [0] * (end-start) + for align in sorted_alignments: + if align['length'] >= end-start: + continue + # Since alignments are sorted in order of (length, token_idx) + # this overwrites smaller token_idx when they have same length. + alignments[align['length']] = align['token_idx'] + final_matches.append((key, start, end, alignments)) + return final_matches else: return final_matches @@ -288,9 +319,9 @@ def unpickle_matcher(vocab, patterns, callbacks): return matcher -cdef find_matches(TokenPatternC** patterns, int n, object doclike, int length, extensions=None, predicates=tuple()): +cdef find_matches(TokenPatternC** patterns, int n, object doclike, int length, extensions=None, predicates=tuple(), bint with_alignments=0): """Find matches in a doc, with a compiled array of patterns. Matches are - returned as a list of (id, start, end) tuples. + returned as a list of (id, start, end) tuples or (id, start, end, alignments) tuples (if with_alignments != 0) To augment the compiled patterns, we optionally also take two Python lists. @@ -302,6 +333,8 @@ cdef find_matches(TokenPatternC** patterns, int n, object doclike, int length, e """ cdef vector[PatternStateC] states cdef vector[MatchC] matches + cdef vector[vector[MatchAlignmentC]] align_states + cdef vector[vector[MatchAlignmentC]] align_matches cdef PatternStateC state cdef int i, j, nr_extra_attr cdef Pool mem = Pool() @@ -328,12 +361,14 @@ cdef find_matches(TokenPatternC** patterns, int n, object doclike, int length, e for i in range(length): for j in range(n): states.push_back(PatternStateC(patterns[j], i, 0)) - transition_states(states, matches, predicate_cache, - doclike[i], extra_attr_values, predicates) + if with_alignments != 0: + align_states.resize(states.size()) + transition_states(states, matches, align_states, align_matches, predicate_cache, + doclike[i], extra_attr_values, predicates, with_alignments) extra_attr_values += nr_extra_attr predicate_cache += len(predicates) # Handle matches that end in 0-width patterns - finish_states(matches, states) + finish_states(matches, states, align_matches, align_states, with_alignments) seen = set() for i in range(matches.size()): match = ( @@ -346,16 +381,22 @@ cdef find_matches(TokenPatternC** patterns, int n, object doclike, int length, e # first .?, or the second .? -- it doesn't matter, it's just one match. # Skip 0-length matches. (TODO: fix algorithm) if match not in seen and matches[i].length > 0: - output.append(match) + if with_alignments != 0: + # since the length of align_matches equals to that of match, we can share same 'i' + output.append(match + (align_matches[i],)) + else: + output.append(match) seen.add(match) return output cdef void transition_states(vector[PatternStateC]& states, vector[MatchC]& matches, + vector[vector[MatchAlignmentC]]& align_states, vector[vector[MatchAlignmentC]]& align_matches, int8_t* cached_py_predicates, - Token token, const attr_t* extra_attrs, py_predicates) except *: + Token token, const attr_t* extra_attrs, py_predicates, bint with_alignments) except *: cdef int q = 0 cdef vector[PatternStateC] new_states + cdef vector[vector[MatchAlignmentC]] align_new_states cdef int nr_predicate = len(py_predicates) for i in range(states.size()): if states[i].pattern.nr_py >= 1: @@ -370,23 +411,39 @@ cdef void transition_states(vector[PatternStateC]& states, vector[MatchC]& match # it in the states list, because q doesn't advance. state = states[i] states[q] = state + # Separate from states, performance is guaranteed for users who only need basic options (without alignments). + # `align_states` always corresponds to `states` 1:1. + if with_alignments != 0: + align_state = align_states[i] + align_states[q] = align_state while action in (RETRY, RETRY_ADVANCE, RETRY_EXTEND): + # Update alignment before the transition of current state + # 'MatchAlignmentC' maps 'original token index of current pattern' to 'current matching length' + if with_alignments != 0: + align_states[q].push_back(MatchAlignmentC(states[q].pattern.token_idx, states[q].length)) if action == RETRY_EXTEND: # This handles the 'extend' new_states.push_back( PatternStateC(pattern=states[q].pattern, start=state.start, length=state.length+1)) + if with_alignments != 0: + align_new_states.push_back(align_states[q]) if action == RETRY_ADVANCE: # This handles the 'advance' new_states.push_back( PatternStateC(pattern=states[q].pattern+1, start=state.start, length=state.length+1)) + if with_alignments != 0: + align_new_states.push_back(align_states[q]) states[q].pattern += 1 if states[q].pattern.nr_py != 0: update_predicate_cache(cached_py_predicates, states[q].pattern, token, py_predicates) action = get_action(states[q], token.c, extra_attrs, cached_py_predicates) + # Update alignment before the transition of current state + if with_alignments != 0: + align_states[q].push_back(MatchAlignmentC(states[q].pattern.token_idx, states[q].length)) if action == REJECT: pass elif action == ADVANCE: @@ -399,29 +456,50 @@ cdef void transition_states(vector[PatternStateC]& states, vector[MatchC]& match matches.push_back( MatchC(pattern_id=ent_id, start=state.start, length=state.length+1)) + # `align_matches` always corresponds to `matches` 1:1 + if with_alignments != 0: + align_matches.push_back(align_states[q]) elif action == MATCH_DOUBLE: # push match without last token if length > 0 if state.length > 0: matches.push_back( MatchC(pattern_id=ent_id, start=state.start, length=state.length)) + # MATCH_DOUBLE emits matches twice, + # add one more to align_matches in order to keep 1:1 relationship + if with_alignments != 0: + align_matches.push_back(align_states[q]) # push match with last token matches.push_back( MatchC(pattern_id=ent_id, start=state.start, length=state.length+1)) + # `align_matches` always corresponds to `matches` 1:1 + if with_alignments != 0: + align_matches.push_back(align_states[q]) elif action == MATCH_REJECT: matches.push_back( MatchC(pattern_id=ent_id, start=state.start, length=state.length)) + # `align_matches` always corresponds to `matches` 1:1 + if with_alignments != 0: + align_matches.push_back(align_states[q]) elif action == MATCH_EXTEND: matches.push_back( MatchC(pattern_id=ent_id, start=state.start, length=state.length)) + # `align_matches` always corresponds to `matches` 1:1 + if with_alignments != 0: + align_matches.push_back(align_states[q]) states[q].length += 1 q += 1 states.resize(q) for i in range(new_states.size()): states.push_back(new_states[i]) + # `align_states` always corresponds to `states` 1:1 + if with_alignments != 0: + align_states.resize(q) + for i in range(align_new_states.size()): + align_states.push_back(align_new_states[i]) cdef int update_predicate_cache(int8_t* cache, @@ -444,15 +522,27 @@ cdef int update_predicate_cache(int8_t* cache, raise ValueError(Errors.E125.format(value=result)) -cdef void finish_states(vector[MatchC]& matches, vector[PatternStateC]& states) except *: +cdef void finish_states(vector[MatchC]& matches, vector[PatternStateC]& states, + vector[vector[MatchAlignmentC]]& align_matches, + vector[vector[MatchAlignmentC]]& align_states, + bint with_alignments) except *: """Handle states that end in zero-width patterns.""" cdef PatternStateC state + cdef vector[MatchAlignmentC] align_state for i in range(states.size()): state = states[i] + if with_alignments != 0: + align_state = align_states[i] while get_quantifier(state) in (ZERO_PLUS, ZERO_ONE): + # Update alignment before the transition of current state + if with_alignments != 0: + align_state.push_back(MatchAlignmentC(state.pattern.token_idx, state.length)) is_final = get_is_final(state) if is_final: ent_id = get_ent_id(state.pattern) + # `align_matches` always corresponds to `matches` 1:1 + if with_alignments != 0: + align_matches.push_back(align_state) matches.push_back( MatchC(pattern_id=ent_id, start=state.start, length=state.length)) break @@ -607,7 +697,7 @@ cdef int8_t get_quantifier(PatternStateC state) nogil: cdef TokenPatternC* init_pattern(Pool mem, attr_t entity_id, object token_specs) except NULL: pattern = mem.alloc(len(token_specs) + 1, sizeof(TokenPatternC)) cdef int i, index - for i, (quantifier, spec, extensions, predicates) in enumerate(token_specs): + for i, (quantifier, spec, extensions, predicates, token_idx) in enumerate(token_specs): pattern[i].quantifier = quantifier # Ensure attrs refers to a null pointer if nr_attr == 0 if len(spec) > 0: @@ -628,6 +718,7 @@ cdef TokenPatternC* init_pattern(Pool mem, attr_t entity_id, object token_specs) pattern[i].py_predicates[j] = index pattern[i].nr_py = len(predicates) pattern[i].key = hash64(pattern[i].attrs, pattern[i].nr_attr * sizeof(AttrValueC), 0) + pattern[i].token_idx = token_idx i = len(token_specs) # Use quantifier to identify final ID pattern node (rather than previous # uninitialized quantifier == 0/ZERO + nr_attr == 0 + non-zero-length attrs) @@ -638,6 +729,7 @@ cdef TokenPatternC* init_pattern(Pool mem, attr_t entity_id, object token_specs) pattern[i].nr_attr = 1 pattern[i].nr_extra_attr = 0 pattern[i].nr_py = 0 + pattern[i].token_idx = -1 return pattern @@ -655,7 +747,7 @@ def _preprocess_pattern(token_specs, vocab, extensions_table, extra_predicates): """This function interprets the pattern, converting the various bits of syntactic sugar before we compile it into a struct with init_pattern. - We need to split the pattern up into three parts: + We need to split the pattern up into four parts: * Normal attribute/value pairs, which are stored on either the token or lexeme, can be handled directly. * Extension attributes are handled specially, as we need to prefetch the @@ -664,13 +756,14 @@ def _preprocess_pattern(token_specs, vocab, extensions_table, extra_predicates): functions and store them. So we store these specially as well. * Extension attributes that have extra predicates are stored within the extra_predicates. + * Token index that this pattern belongs to. """ tokens = [] string_store = vocab.strings - for spec in token_specs: + for token_idx, spec in enumerate(token_specs): if not spec: # Signifier for 'any token' - tokens.append((ONE, [(NULL_ATTR, 0)], [], [])) + tokens.append((ONE, [(NULL_ATTR, 0)], [], [], token_idx)) continue if not isinstance(spec, dict): raise ValueError(Errors.E154.format()) @@ -679,7 +772,7 @@ def _preprocess_pattern(token_specs, vocab, extensions_table, extra_predicates): extensions = _get_extensions(spec, string_store, extensions_table) predicates = _get_extra_predicates(spec, extra_predicates, vocab) for op in ops: - tokens.append((op, list(attr_values), list(extensions), list(predicates))) + tokens.append((op, list(attr_values), list(extensions), list(predicates), token_idx)) return tokens diff --git a/spacy/ml/_character_embed.py b/spacy/ml/_character_embed.py index f5c539c42..0ed28b859 100644 --- a/spacy/ml/_character_embed.py +++ b/spacy/ml/_character_embed.py @@ -3,8 +3,10 @@ from thinc.api import Model from thinc.types import Floats2d from ..tokens import Doc +from ..util import registry +@registry.layers("spacy.CharEmbed.v1") def CharacterEmbed(nM: int, nC: int) -> Model[List[Doc], List[Floats2d]]: # nM: Number of dimensions per character. nC: Number of characters. return Model( diff --git a/spacy/ml/models/tok2vec.py b/spacy/ml/models/tok2vec.py index 5790af631..76ec87054 100644 --- a/spacy/ml/models/tok2vec.py +++ b/spacy/ml/models/tok2vec.py @@ -31,7 +31,7 @@ def get_tok2vec_width(model: Model): return nO -@registry.architectures("spacy.HashEmbedCNN.v1") +@registry.architectures("spacy.HashEmbedCNN.v2") def build_hash_embed_cnn_tok2vec( *, width: int, @@ -108,7 +108,7 @@ def build_Tok2Vec_model( return tok2vec -@registry.architectures("spacy.MultiHashEmbed.v1") +@registry.architectures("spacy.MultiHashEmbed.v2") def MultiHashEmbed( width: int, attrs: List[Union[str, int]], @@ -182,7 +182,7 @@ def MultiHashEmbed( return model -@registry.architectures("spacy.CharacterEmbed.v1") +@registry.architectures("spacy.CharacterEmbed.v2") def CharacterEmbed( width: int, rows: int, diff --git a/spacy/ml/staticvectors.py b/spacy/ml/staticvectors.py index ea4c7fb77..4e7262e7d 100644 --- a/spacy/ml/staticvectors.py +++ b/spacy/ml/staticvectors.py @@ -8,7 +8,7 @@ from ..tokens import Doc from ..errors import Errors -@registry.layers("spacy.StaticVectors.v1") +@registry.layers("spacy.StaticVectors.v2") def StaticVectors( nO: Optional[int] = None, nM: Optional[int] = None, @@ -38,7 +38,7 @@ def forward( return _handle_empty(model.ops, model.get_dim("nO")) key_attr = model.attrs["key_attr"] W = cast(Floats2d, model.ops.as_contig(model.get_param("W"))) - V = cast(Floats2d, docs[0].vocab.vectors.data) + V = cast(Floats2d, model.ops.asarray(docs[0].vocab.vectors.data)) rows = model.ops.flatten( [doc.vocab.vectors.find(keys=doc.to_array(key_attr)) for doc in docs] ) @@ -46,6 +46,8 @@ def forward( vectors_data = model.ops.gemm(model.ops.as_contig(V[rows]), W, trans2=True) except ValueError: raise RuntimeError(Errors.E896) + # Convert negative indices to 0-vectors (TODO: more options for UNK tokens) + vectors_data[rows < 0] = 0 output = Ragged( vectors_data, model.ops.asarray([len(doc) for doc in docs], dtype="i") ) diff --git a/spacy/pipeline/dep_parser.pyx b/spacy/pipeline/dep_parser.pyx index 7290c4637..37f09ce3a 100644 --- a/spacy/pipeline/dep_parser.pyx +++ b/spacy/pipeline/dep_parser.pyx @@ -24,7 +24,7 @@ maxout_pieces = 2 use_upper = true [model.tok2vec] -@architectures = "spacy.HashEmbedCNN.v1" +@architectures = "spacy.HashEmbedCNN.v2" pretrained_vectors = null width = 96 depth = 4 diff --git a/spacy/pipeline/entity_linker.py b/spacy/pipeline/entity_linker.py index 630057c3f..66070916e 100644 --- a/spacy/pipeline/entity_linker.py +++ b/spacy/pipeline/entity_linker.py @@ -26,7 +26,7 @@ default_model_config = """ @architectures = "spacy.EntityLinker.v1" [model.tok2vec] -@architectures = "spacy.HashEmbedCNN.v1" +@architectures = "spacy.HashEmbedCNN.v2" pretrained_vectors = null width = 96 depth = 2 @@ -300,77 +300,77 @@ class EntityLinker(TrainablePipe): for i, doc in enumerate(docs): sentences = [s for s in doc.sents] if len(doc) > 0: - # Looping through each sentence and each entity - # This may go wrong if there are entities across sentences - which shouldn't happen normally. - for sent_index, sent in enumerate(sentences): - if sent.ents: - # get n_neightbour sentences, clipped to the length of the document - start_sentence = max(0, sent_index - self.n_sents) - end_sentence = min( - len(sentences) - 1, sent_index + self.n_sents - ) - start_token = sentences[start_sentence].start - end_token = sentences[end_sentence].end - sent_doc = doc[start_token:end_token].as_doc() - # currently, the context is the same for each entity in a sentence (should be refined) - xp = self.model.ops.xp - if self.incl_context: - sentence_encoding = self.model.predict([sent_doc])[0] - sentence_encoding_t = sentence_encoding.T - sentence_norm = xp.linalg.norm(sentence_encoding_t) - for ent in sent.ents: - entity_count += 1 - if ent.label_ in self.labels_discard: - # ignoring this entity - setting to NIL - final_kb_ids.append(self.NIL) - else: - candidates = self.get_candidates(self.kb, ent) - if not candidates: - # no prediction possible for this entity - setting to NIL - final_kb_ids.append(self.NIL) - elif len(candidates) == 1: - # shortcut for efficiency reasons: take the 1 candidate - # TODO: thresholding - final_kb_ids.append(candidates[0].entity_) - else: - random.shuffle(candidates) - # set all prior probabilities to 0 if incl_prior=False - prior_probs = xp.asarray( - [c.prior_prob for c in candidates] + # Looping through each entity (TODO: rewrite) + for ent in doc.ents: + sent = ent.sent + sent_index = sentences.index(sent) + assert sent_index >= 0 + # get n_neightbour sentences, clipped to the length of the document + start_sentence = max(0, sent_index - self.n_sents) + end_sentence = min( + len(sentences) - 1, sent_index + self.n_sents + ) + start_token = sentences[start_sentence].start + end_token = sentences[end_sentence].end + sent_doc = doc[start_token:end_token].as_doc() + # currently, the context is the same for each entity in a sentence (should be refined) + xp = self.model.ops.xp + if self.incl_context: + sentence_encoding = self.model.predict([sent_doc])[0] + sentence_encoding_t = sentence_encoding.T + sentence_norm = xp.linalg.norm(sentence_encoding_t) + entity_count += 1 + if ent.label_ in self.labels_discard: + # ignoring this entity - setting to NIL + final_kb_ids.append(self.NIL) + else: + candidates = self.get_candidates(self.kb, ent) + if not candidates: + # no prediction possible for this entity - setting to NIL + final_kb_ids.append(self.NIL) + elif len(candidates) == 1: + # shortcut for efficiency reasons: take the 1 candidate + # TODO: thresholding + final_kb_ids.append(candidates[0].entity_) + else: + random.shuffle(candidates) + # set all prior probabilities to 0 if incl_prior=False + prior_probs = xp.asarray( + [c.prior_prob for c in candidates] + ) + if not self.incl_prior: + prior_probs = xp.asarray( + [0.0 for _ in candidates] + ) + scores = prior_probs + # add in similarity from the context + if self.incl_context: + entity_encodings = xp.asarray( + [c.entity_vector for c in candidates] + ) + entity_norm = xp.linalg.norm( + entity_encodings, axis=1 + ) + if len(entity_encodings) != len(prior_probs): + raise RuntimeError( + Errors.E147.format( + method="predict", + msg="vectors not of equal length", + ) ) - if not self.incl_prior: - prior_probs = xp.asarray( - [0.0 for _ in candidates] - ) - scores = prior_probs - # add in similarity from the context - if self.incl_context: - entity_encodings = xp.asarray( - [c.entity_vector for c in candidates] - ) - entity_norm = xp.linalg.norm( - entity_encodings, axis=1 - ) - if len(entity_encodings) != len(prior_probs): - raise RuntimeError( - Errors.E147.format( - method="predict", - msg="vectors not of equal length", - ) - ) - # cosine similarity - sims = xp.dot( - entity_encodings, sentence_encoding_t - ) / (sentence_norm * entity_norm) - if sims.shape != prior_probs.shape: - raise ValueError(Errors.E161) - scores = ( - prior_probs + sims - (prior_probs * sims) - ) - # TODO: thresholding - best_index = scores.argmax().item() - best_candidate = candidates[best_index] - final_kb_ids.append(best_candidate.entity_) + # cosine similarity + sims = xp.dot( + entity_encodings, sentence_encoding_t + ) / (sentence_norm * entity_norm) + if sims.shape != prior_probs.shape: + raise ValueError(Errors.E161) + scores = ( + prior_probs + sims - (prior_probs * sims) + ) + # TODO: thresholding + best_index = scores.argmax().item() + best_candidate = candidates[best_index] + final_kb_ids.append(best_candidate.entity_) if not (len(final_kb_ids) == entity_count): err = Errors.E147.format( method="predict", msg="result variables not of equal length" diff --git a/spacy/pipeline/lemmatizer.py b/spacy/pipeline/lemmatizer.py index 21f1a8a8b..cfe405efa 100644 --- a/spacy/pipeline/lemmatizer.py +++ b/spacy/pipeline/lemmatizer.py @@ -175,7 +175,7 @@ class Lemmatizer(Pipe): DOCS: https://spacy.io/api/lemmatizer#rule_lemmatize """ - cache_key = (token.orth, token.pos, token.morph) + cache_key = (token.orth, token.pos, token.morph.key) if cache_key in self.cache: return self.cache[cache_key] string = token.text diff --git a/spacy/pipeline/morphologizer.pyx b/spacy/pipeline/morphologizer.pyx index cd0081346..3ba05e616 100644 --- a/spacy/pipeline/morphologizer.pyx +++ b/spacy/pipeline/morphologizer.pyx @@ -27,7 +27,7 @@ default_model_config = """ @architectures = "spacy.Tok2Vec.v2" [model.tok2vec.embed] -@architectures = "spacy.CharacterEmbed.v1" +@architectures = "spacy.CharacterEmbed.v2" width = 128 rows = 7000 nM = 64 diff --git a/spacy/pipeline/multitask.pyx b/spacy/pipeline/multitask.pyx index 990b6a1de..8c44061e2 100644 --- a/spacy/pipeline/multitask.pyx +++ b/spacy/pipeline/multitask.pyx @@ -22,7 +22,7 @@ maxout_pieces = 3 token_vector_width = 96 [model.tok2vec] -@architectures = "spacy.HashEmbedCNN.v1" +@architectures = "spacy.HashEmbedCNN.v2" pretrained_vectors = null width = 96 depth = 4 diff --git a/spacy/pipeline/ner.pyx b/spacy/pipeline/ner.pyx index 3a2151b01..0b9b0d324 100644 --- a/spacy/pipeline/ner.pyx +++ b/spacy/pipeline/ner.pyx @@ -21,7 +21,7 @@ maxout_pieces = 2 use_upper = true [model.tok2vec] -@architectures = "spacy.HashEmbedCNN.v1" +@architectures = "spacy.HashEmbedCNN.v2" pretrained_vectors = null width = 96 depth = 4 diff --git a/spacy/pipeline/senter.pyx b/spacy/pipeline/senter.pyx index 83cd06739..f9472abf5 100644 --- a/spacy/pipeline/senter.pyx +++ b/spacy/pipeline/senter.pyx @@ -19,7 +19,7 @@ default_model_config = """ @architectures = "spacy.Tagger.v1" [model.tok2vec] -@architectures = "spacy.HashEmbedCNN.v1" +@architectures = "spacy.HashEmbedCNN.v2" pretrained_vectors = null width = 12 depth = 1 diff --git a/spacy/pipeline/tagger.pyx b/spacy/pipeline/tagger.pyx index 9af5245c1..938131f6f 100644 --- a/spacy/pipeline/tagger.pyx +++ b/spacy/pipeline/tagger.pyx @@ -26,7 +26,7 @@ default_model_config = """ @architectures = "spacy.Tagger.v1" [model.tok2vec] -@architectures = "spacy.HashEmbedCNN.v1" +@architectures = "spacy.HashEmbedCNN.v2" pretrained_vectors = null width = 96 depth = 4 diff --git a/spacy/pipeline/textcat.py b/spacy/pipeline/textcat.py index 174ffd273..1d652a483 100644 --- a/spacy/pipeline/textcat.py +++ b/spacy/pipeline/textcat.py @@ -21,7 +21,7 @@ single_label_default_config = """ @architectures = "spacy.Tok2Vec.v2" [model.tok2vec.embed] -@architectures = "spacy.MultiHashEmbed.v1" +@architectures = "spacy.MultiHashEmbed.v2" width = 64 rows = [2000, 2000, 1000, 1000, 1000, 1000] attrs = ["ORTH", "LOWER", "PREFIX", "SUFFIX", "SHAPE", "ID"] @@ -56,7 +56,7 @@ single_label_cnn_config = """ exclusive_classes = true [model.tok2vec] -@architectures = "spacy.HashEmbedCNN.v1" +@architectures = "spacy.HashEmbedCNN.v2" pretrained_vectors = null width = 96 depth = 4 diff --git a/spacy/pipeline/textcat_multilabel.py b/spacy/pipeline/textcat_multilabel.py index 036bc8dc5..7267735b4 100644 --- a/spacy/pipeline/textcat_multilabel.py +++ b/spacy/pipeline/textcat_multilabel.py @@ -21,7 +21,7 @@ multi_label_default_config = """ @architectures = "spacy.Tok2Vec.v1" [model.tok2vec.embed] -@architectures = "spacy.MultiHashEmbed.v1" +@architectures = "spacy.MultiHashEmbed.v2" width = 64 rows = [2000, 2000, 1000, 1000, 1000, 1000] attrs = ["ORTH", "LOWER", "PREFIX", "SUFFIX", "SHAPE", "ID"] @@ -56,7 +56,7 @@ multi_label_cnn_config = """ exclusive_classes = false [model.tok2vec] -@architectures = "spacy.HashEmbedCNN.v1" +@architectures = "spacy.HashEmbedCNN.v2" pretrained_vectors = null width = 96 depth = 4 diff --git a/spacy/pipeline/tok2vec.py b/spacy/pipeline/tok2vec.py index 26a4c998c..3ee324d50 100644 --- a/spacy/pipeline/tok2vec.py +++ b/spacy/pipeline/tok2vec.py @@ -11,7 +11,7 @@ from ..errors import Errors default_model_config = """ [model] -@architectures = "spacy.HashEmbedCNN.v1" +@architectures = "spacy.HashEmbedCNN.v2" pretrained_vectors = null width = 96 depth = 4 diff --git a/spacy/scorer.py b/spacy/scorer.py index f28cb5639..25df44f14 100644 --- a/spacy/scorer.py +++ b/spacy/scorer.py @@ -20,10 +20,16 @@ MISSING_VALUES = frozenset([None, 0, ""]) class PRFScore: """A precision / recall / F score.""" - def __init__(self) -> None: - self.tp = 0 - self.fp = 0 - self.fn = 0 + def __init__( + self, + *, + tp: int = 0, + fp: int = 0, + fn: int = 0, + ) -> None: + self.tp = tp + self.fp = fp + self.fn = fn def __len__(self) -> int: return self.tp + self.fp + self.fn @@ -305,6 +311,8 @@ class Scorer: *, getter: Callable[[Doc, str], Iterable[Span]] = getattr, has_annotation: Optional[Callable[[Doc], bool]] = None, + labeled: bool = True, + allow_overlap: bool = False, **cfg, ) -> Dict[str, Any]: """Returns PRF scores for labeled spans. @@ -317,6 +325,11 @@ class Scorer: has_annotation (Optional[Callable[[Doc], bool]]) should return whether a `Doc` has annotation for this `attr`. Docs without annotation are skipped for scoring purposes. + labeled (bool): Whether or not to include label information in + the evaluation. If set to 'False', two spans will be considered + equal if their start and end match, irrespective of their label. + allow_overlap (bool): Whether or not to allow overlapping spans. + If set to 'False', the alignment will automatically resolve conflicts. RETURNS (Dict[str, Any]): A dictionary containing the PRF scores under the keys attr_p/r/f and the per-type PRF scores under attr_per_type. @@ -345,33 +358,42 @@ class Scorer: gold_spans = set() pred_spans = set() for span in getter(gold_doc, attr): - gold_span = (span.label_, span.start, span.end - 1) + if labeled: + gold_span = (span.label_, span.start, span.end - 1) + else: + gold_span = (span.start, span.end - 1) gold_spans.add(gold_span) - gold_per_type[span.label_].add((span.label_, span.start, span.end - 1)) + gold_per_type[span.label_].add(gold_span) pred_per_type = {label: set() for label in labels} - for span in example.get_aligned_spans_x2y(getter(pred_doc, attr)): - pred_spans.add((span.label_, span.start, span.end - 1)) - pred_per_type[span.label_].add((span.label_, span.start, span.end - 1)) + for span in example.get_aligned_spans_x2y(getter(pred_doc, attr), allow_overlap): + if labeled: + pred_span = (span.label_, span.start, span.end - 1) + else: + pred_span = (span.start, span.end - 1) + pred_spans.add(pred_span) + pred_per_type[span.label_].add(pred_span) # Scores per label - for k, v in score_per_type.items(): - if k in pred_per_type: - v.score_set(pred_per_type[k], gold_per_type[k]) + if labeled: + for k, v in score_per_type.items(): + if k in pred_per_type: + v.score_set(pred_per_type[k], gold_per_type[k]) # Score for all labels score.score_set(pred_spans, gold_spans) - if len(score) > 0: - return { - f"{attr}_p": score.precision, - f"{attr}_r": score.recall, - f"{attr}_f": score.fscore, - f"{attr}_per_type": {k: v.to_dict() for k, v in score_per_type.items()}, - } - else: - return { + # Assemble final result + final_scores = { f"{attr}_p": None, f"{attr}_r": None, f"{attr}_f": None, - f"{attr}_per_type": None, } + if labeled: + final_scores[f"{attr}_per_type"] = None + if len(score) > 0: + final_scores[f"{attr}_p"] = score.precision + final_scores[f"{attr}_r"] = score.recall + final_scores[f"{attr}_f"] = score.fscore + if labeled: + final_scores[f"{attr}_per_type"] = {k: v.to_dict() for k, v in score_per_type.items()} + return final_scores @staticmethod def score_cats( diff --git a/spacy/strings.pyx b/spacy/strings.pyx index 6a1d68221..4a20cb8af 100644 --- a/spacy/strings.pyx +++ b/spacy/strings.pyx @@ -223,7 +223,7 @@ cdef class StringStore: it doesn't exist. Paths may be either strings or Path-like objects. """ path = util.ensure_path(path) - strings = list(self) + strings = sorted(self) srsly.write_json(path, strings) def from_disk(self, path): @@ -247,7 +247,7 @@ cdef class StringStore: RETURNS (bytes): The serialized form of the `StringStore` object. """ - return srsly.json_dumps(list(self)) + return srsly.json_dumps(sorted(self)) def from_bytes(self, bytes_data, **kwargs): """Load state from a binary string. diff --git a/spacy/tests/doc/test_doc_api.py b/spacy/tests/doc/test_doc_api.py index c27139d2f..d7452a802 100644 --- a/spacy/tests/doc/test_doc_api.py +++ b/spacy/tests/doc/test_doc_api.py @@ -6,12 +6,14 @@ import logging import mock from spacy.lang.xx import MultiLanguage -from spacy.tokens import Doc, Span +from spacy.tokens import Doc, Span, Token from spacy.vocab import Vocab from spacy.lexeme import Lexeme from spacy.lang.en import English from spacy.attrs import ENT_TYPE, ENT_IOB, SENT_START, HEAD, DEP, MORPH +from .test_underscore import clean_underscore # noqa: F401 + def test_doc_api_init(en_vocab): words = ["a", "b", "c", "d"] @@ -347,15 +349,19 @@ def test_doc_from_array_morph(en_vocab): assert [str(t.morph) for t in doc] == [str(t.morph) for t in new_doc] +@pytest.mark.usefixtures("clean_underscore") def test_doc_api_from_docs(en_tokenizer, de_tokenizer): en_texts = ["Merging the docs is fun.", "", "They don't think alike."] en_texts_without_empty = [t for t in en_texts if len(t)] de_text = "Wie war die Frage?" en_docs = [en_tokenizer(text) for text in en_texts] - docs_idx = en_texts[0].index("docs") + en_docs[0].spans["group"] = [en_docs[0][1:4]] + en_docs[2].spans["group"] = [en_docs[2][1:4]] + span_group_texts = sorted([en_docs[0][1:4].text, en_docs[2][1:4].text]) de_doc = de_tokenizer(de_text) - expected = (True, None, None, None) - en_docs[0].user_data[("._.", "is_ambiguous", docs_idx, None)] = expected + Token.set_extension("is_ambiguous", default=False) + en_docs[0][2]._.is_ambiguous = True # docs + en_docs[2][3]._.is_ambiguous = True # think assert Doc.from_docs([]) is None assert de_doc is not Doc.from_docs([de_doc]) assert str(de_doc) == str(Doc.from_docs([de_doc])) @@ -372,11 +378,12 @@ def test_doc_api_from_docs(en_tokenizer, de_tokenizer): en_docs_tokens = [t for doc in en_docs for t in doc] assert len(m_doc) == len(en_docs_tokens) think_idx = len(en_texts[0]) + 1 + en_texts[2].index("think") + assert m_doc[2]._.is_ambiguous == True assert m_doc[9].idx == think_idx - with pytest.raises(AttributeError): - # not callable, because it was not set via set_extension - m_doc[2]._.is_ambiguous - assert len(m_doc.user_data) == len(en_docs[0].user_data) # but it's there + assert m_doc[9]._.is_ambiguous == True + assert not any([t._.is_ambiguous for t in m_doc[3:8]]) + assert "group" in m_doc.spans + assert span_group_texts == sorted([s.text for s in m_doc.spans["group"]]) m_doc = Doc.from_docs(en_docs, ensure_whitespace=False) assert len(en_texts_without_empty) == len(list(m_doc.sents)) @@ -388,6 +395,8 @@ def test_doc_api_from_docs(en_tokenizer, de_tokenizer): assert len(m_doc) == len(en_docs_tokens) think_idx = len(en_texts[0]) + 0 + en_texts[2].index("think") assert m_doc[9].idx == think_idx + assert "group" in m_doc.spans + assert span_group_texts == sorted([s.text for s in m_doc.spans["group"]]) m_doc = Doc.from_docs(en_docs, attrs=["lemma", "length", "pos"]) assert len(str(m_doc)) > len(en_texts[0]) + len(en_texts[1]) @@ -399,6 +408,8 @@ def test_doc_api_from_docs(en_tokenizer, de_tokenizer): assert len(m_doc) == len(en_docs_tokens) think_idx = len(en_texts[0]) + 1 + en_texts[2].index("think") assert m_doc[9].idx == think_idx + assert "group" in m_doc.spans + assert span_group_texts == sorted([s.text for s in m_doc.spans["group"]]) def test_doc_api_from_docs_ents(en_tokenizer): diff --git a/spacy/tests/doc/test_retokenize_merge.py b/spacy/tests/doc/test_retokenize_merge.py index 48cd33890..36fa3c15d 100644 --- a/spacy/tests/doc/test_retokenize_merge.py +++ b/spacy/tests/doc/test_retokenize_merge.py @@ -452,3 +452,30 @@ def test_retokenize_disallow_zero_length(en_vocab): with pytest.raises(ValueError): with doc.retokenize() as retokenizer: retokenizer.merge(doc[1:1]) + + +def test_doc_retokenize_merge_without_parse_keeps_sents(en_tokenizer): + text = "displaCy is a parse tool built with Javascript" + sent_starts = [1, 0, 0, 0, 1, 0, 0, 0] + tokens = en_tokenizer(text) + + # merging within a sentence keeps all sentence boundaries + doc = Doc(tokens.vocab, words=[t.text for t in tokens], sent_starts=sent_starts) + assert len(list(doc.sents)) == 2 + with doc.retokenize() as retokenizer: + retokenizer.merge(doc[1:3]) + assert len(list(doc.sents)) == 2 + + # merging over a sentence boundary unsets it by default + doc = Doc(tokens.vocab, words=[t.text for t in tokens], sent_starts=sent_starts) + assert len(list(doc.sents)) == 2 + with doc.retokenize() as retokenizer: + retokenizer.merge(doc[3:6]) + assert doc[3].is_sent_start == None + + # merging over a sentence boundary and setting sent_start + doc = Doc(tokens.vocab, words=[t.text for t in tokens], sent_starts=sent_starts) + assert len(list(doc.sents)) == 2 + with doc.retokenize() as retokenizer: + retokenizer.merge(doc[3:6], attrs={"sent_start": True}) + assert len(list(doc.sents)) == 2 diff --git a/spacy/tests/doc/test_span.py b/spacy/tests/doc/test_span.py index 078cc81b1..6a5689971 100644 --- a/spacy/tests/doc/test_span.py +++ b/spacy/tests/doc/test_span.py @@ -1,9 +1,11 @@ import pytest from spacy.attrs import ORTH, LENGTH -from spacy.tokens import Doc, Span +from spacy.tokens import Doc, Span, Token from spacy.vocab import Vocab from spacy.util import filter_spans +from .test_underscore import clean_underscore # noqa: F401 + @pytest.fixture def doc(en_tokenizer): @@ -219,11 +221,14 @@ def test_span_as_doc(doc): assert span_doc[0].idx == 0 +@pytest.mark.usefixtures("clean_underscore") def test_span_as_doc_user_data(doc): """Test that the user_data can be preserved (but not by default). """ my_key = "my_info" my_value = 342 doc.user_data[my_key] = my_value + Token.set_extension("is_x", default=False) + doc[7]._.is_x = True span = doc[4:10] span_doc_with = span.as_doc(copy_user_data=True) @@ -232,6 +237,12 @@ def test_span_as_doc_user_data(doc): assert doc.user_data.get(my_key, None) is my_value assert span_doc_with.user_data.get(my_key, None) is my_value assert span_doc_without.user_data.get(my_key, None) is None + for i in range(len(span_doc_with)): + if i != 3: + assert span_doc_with[i]._.is_x is False + else: + assert span_doc_with[i]._.is_x is True + assert not any([t._.is_x for t in span_doc_without]) def test_span_string_label_kb_id(doc): diff --git a/spacy/tests/enable_gpu.py b/spacy/tests/enable_gpu.py new file mode 100644 index 000000000..3d4fded10 --- /dev/null +++ b/spacy/tests/enable_gpu.py @@ -0,0 +1,3 @@ +from spacy import require_gpu + +require_gpu() diff --git a/spacy/tests/matcher/test_dependency_matcher.py b/spacy/tests/matcher/test_dependency_matcher.py index a563ddaa2..fb9222aaa 100644 --- a/spacy/tests/matcher/test_dependency_matcher.py +++ b/spacy/tests/matcher/test_dependency_matcher.py @@ -4,7 +4,9 @@ import re import copy from mock import Mock from spacy.matcher import DependencyMatcher -from spacy.tokens import Doc +from spacy.tokens import Doc, Token + +from ..doc.test_underscore import clean_underscore # noqa: F401 @pytest.fixture @@ -344,3 +346,26 @@ def test_dependency_matcher_long_matches(en_vocab, doc): matcher = DependencyMatcher(en_vocab) with pytest.raises(ValueError): matcher.add("pattern", [pattern]) + + +@pytest.mark.usefixtures("clean_underscore") +def test_dependency_matcher_span_user_data(en_tokenizer): + doc = en_tokenizer("a b c d e") + for token in doc: + token.head = doc[0] + token.dep_ = "a" + get_is_c = lambda token: token.text in ("c",) + Token.set_extension("is_c", default=False) + doc[2]._.is_c = True + pattern = [ + {"RIGHT_ID": "c", "RIGHT_ATTRS": {"_": {"is_c": True}}}, + ] + matcher = DependencyMatcher(en_tokenizer.vocab) + matcher.add("C", [pattern]) + doc_matches = matcher(doc) + offset = 1 + span_matches = matcher(doc[offset:]) + for doc_match, span_match in zip(sorted(doc_matches), sorted(span_matches)): + assert doc_match[0] == span_match[0] + for doc_t_i, span_t_i in zip(doc_match[1], span_match[1]): + assert doc_t_i == span_t_i + offset diff --git a/spacy/tests/matcher/test_matcher_logic.py b/spacy/tests/matcher/test_matcher_logic.py index 5f4c2991a..9f575fe05 100644 --- a/spacy/tests/matcher/test_matcher_logic.py +++ b/spacy/tests/matcher/test_matcher_logic.py @@ -204,3 +204,90 @@ def test_matcher_remove(): # removing again should throw an error with pytest.raises(ValueError): matcher.remove("Rule") + + +def test_matcher_with_alignments_greedy_longest(en_vocab): + cases = [ + ("aaab", "a* b", [0, 0, 0, 1]), + ("baab", "b a* b", [0, 1, 1, 2]), + ("aaab", "a a a b", [0, 1, 2, 3]), + ("aaab", "a+ b", [0, 0, 0, 1]), + ("aaba", "a+ b a+", [0, 0, 1, 2]), + ("aabaa", "a+ b a+", [0, 0, 1, 2, 2]), + ("aaba", "a+ b a*", [0, 0, 1, 2]), + ("aaaa", "a*", [0, 0, 0, 0]), + ("baab", "b a* b b*", [0, 1, 1, 2]), + ("aabb", "a* b* a*", [0, 0, 1, 1]), + ("aaab", "a+ a+ a b", [0, 1, 2, 3]), + ("aaab", "a+ a+ a+ b", [0, 1, 2, 3]), + ("aaab", "a+ a a b", [0, 1, 2, 3]), + ("aaab", "a+ a a", [0, 1, 2]), + ("aaab", "a+ a a?", [0, 1, 2]), + ("aaaa", "a a a a a?", [0, 1, 2, 3]), + ("aaab", "a+ a b", [0, 0, 1, 2]), + ("aaab", "a+ a+ b", [0, 0, 1, 2]), + ] + for string, pattern_str, result in cases: + matcher = Matcher(en_vocab) + doc = Doc(matcher.vocab, words=list(string)) + pattern = [] + for part in pattern_str.split(): + if part.endswith("+"): + pattern.append({"ORTH": part[0], "OP": "+"}) + elif part.endswith("*"): + pattern.append({"ORTH": part[0], "OP": "*"}) + elif part.endswith("?"): + pattern.append({"ORTH": part[0], "OP": "?"}) + else: + pattern.append({"ORTH": part}) + matcher.add("PATTERN", [pattern], greedy="LONGEST") + matches = matcher(doc, with_alignments=True) + n_matches = len(matches) + + _, s, e, expected = matches[0] + + assert expected == result, (string, pattern_str, s, e, n_matches) + + +def test_matcher_with_alignments_nongreedy(en_vocab): + cases = [ + (0, "aaab", "a* b", [[0, 1], [0, 0, 1], [0, 0, 0, 1], [1]]), + (1, "baab", "b a* b", [[0, 1, 1, 2]]), + (2, "aaab", "a a a b", [[0, 1, 2, 3]]), + (3, "aaab", "a+ b", [[0, 1], [0, 0, 1], [0, 0, 0, 1]]), + (4, "aaba", "a+ b a+", [[0, 1, 2], [0, 0, 1, 2]]), + (5, "aabaa", "a+ b a+", [[0, 1, 2], [0, 0, 1, 2], [0, 0, 1, 2, 2], [0, 1, 2, 2] ]), + (6, "aaba", "a+ b a*", [[0, 1], [0, 0, 1], [0, 0, 1, 2], [0, 1, 2]]), + (7, "aaaa", "a*", [[0], [0, 0], [0, 0, 0], [0, 0, 0, 0]]), + (8, "baab", "b a* b b*", [[0, 1, 1, 2]]), + (9, "aabb", "a* b* a*", [[1], [2], [2, 2], [0, 1], [0, 0, 1], [0, 0, 1, 1], [0, 1, 1], [1, 1]]), + (10, "aaab", "a+ a+ a b", [[0, 1, 2, 3]]), + (11, "aaab", "a+ a+ a+ b", [[0, 1, 2, 3]]), + (12, "aaab", "a+ a a b", [[0, 1, 2, 3]]), + (13, "aaab", "a+ a a", [[0, 1, 2]]), + (14, "aaab", "a+ a a?", [[0, 1], [0, 1, 2]]), + (15, "aaaa", "a a a a a?", [[0, 1, 2, 3]]), + (16, "aaab", "a+ a b", [[0, 1, 2], [0, 0, 1, 2]]), + (17, "aaab", "a+ a+ b", [[0, 1, 2], [0, 0, 1, 2]]), + ] + for case_id, string, pattern_str, results in cases: + matcher = Matcher(en_vocab) + doc = Doc(matcher.vocab, words=list(string)) + pattern = [] + for part in pattern_str.split(): + if part.endswith("+"): + pattern.append({"ORTH": part[0], "OP": "+"}) + elif part.endswith("*"): + pattern.append({"ORTH": part[0], "OP": "*"}) + elif part.endswith("?"): + pattern.append({"ORTH": part[0], "OP": "?"}) + else: + pattern.append({"ORTH": part}) + + matcher.add("PATTERN", [pattern]) + matches = matcher(doc, with_alignments=True) + n_matches = len(matches) + + for _, s, e, expected in matches: + assert expected in results, (case_id, string, pattern_str, s, e, n_matches) + assert len(expected) == e - s diff --git a/spacy/tests/pipeline/test_entity_ruler.py b/spacy/tests/pipeline/test_entity_ruler.py index 3f998d78d..2f6da79d6 100644 --- a/spacy/tests/pipeline/test_entity_ruler.py +++ b/spacy/tests/pipeline/test_entity_ruler.py @@ -5,6 +5,7 @@ from spacy.tokens import Span from spacy.language import Language from spacy.pipeline import EntityRuler from spacy.errors import MatchPatternError +from thinc.api import NumpyOps, get_current_ops @pytest.fixture @@ -201,13 +202,14 @@ def test_entity_ruler_overlapping_spans(nlp): @pytest.mark.parametrize("n_process", [1, 2]) def test_entity_ruler_multiprocessing(nlp, n_process): - texts = ["I enjoy eating Pizza Hut pizza."] + if isinstance(get_current_ops, NumpyOps) or n_process < 2: + texts = ["I enjoy eating Pizza Hut pizza."] - patterns = [{"label": "FASTFOOD", "pattern": "Pizza Hut", "id": "1234"}] + patterns = [{"label": "FASTFOOD", "pattern": "Pizza Hut", "id": "1234"}] - ruler = nlp.add_pipe("entity_ruler") - ruler.add_patterns(patterns) + ruler = nlp.add_pipe("entity_ruler") + ruler.add_patterns(patterns) - for doc in nlp.pipe(texts, n_process=2): - for ent in doc.ents: - assert ent.ent_id_ == "1234" + for doc in nlp.pipe(texts, n_process=2): + for ent in doc.ents: + assert ent.ent_id_ == "1234" diff --git a/spacy/tests/pipeline/test_lemmatizer.py b/spacy/tests/pipeline/test_lemmatizer.py index 1943d3dd7..3c16d3bcb 100644 --- a/spacy/tests/pipeline/test_lemmatizer.py +++ b/spacy/tests/pipeline/test_lemmatizer.py @@ -1,6 +1,7 @@ import pytest import logging import mock +import pickle from spacy import util, registry from spacy.lang.en import English from spacy.lookups import Lookups @@ -106,6 +107,9 @@ def test_lemmatizer_serialize(nlp): doc2 = nlp2.make_doc("coping") doc2[0].pos_ = "VERB" assert doc2[0].lemma_ == "" - doc2 = lemmatizer(doc2) + doc2 = lemmatizer2(doc2) assert doc2[0].text == "coping" assert doc2[0].lemma_ == "cope" + + # Make sure that lemmatizer cache can be pickled + b = pickle.dumps(lemmatizer2) diff --git a/spacy/tests/pipeline/test_models.py b/spacy/tests/pipeline/test_models.py index d04ac9cd4..302c307e2 100644 --- a/spacy/tests/pipeline/test_models.py +++ b/spacy/tests/pipeline/test_models.py @@ -4,7 +4,7 @@ import numpy import pytest from numpy.testing import assert_almost_equal from spacy.vocab import Vocab -from thinc.api import NumpyOps, Model, data_validation +from thinc.api import Model, data_validation, get_current_ops from thinc.types import Array2d, Ragged from spacy.lang.en import English @@ -13,7 +13,7 @@ from spacy.ml._character_embed import CharacterEmbed from spacy.tokens import Doc -OPS = NumpyOps() +OPS = get_current_ops() texts = ["These are 4 words", "Here just three"] l0 = [[1, 2], [3, 4], [5, 6], [7, 8]] @@ -82,7 +82,7 @@ def util_batch_unbatch_docs_list( Y_batched = model.predict(in_data) Y_not_batched = [model.predict([u])[0] for u in in_data] for i in range(len(Y_batched)): - assert_almost_equal(Y_batched[i], Y_not_batched[i], decimal=4) + assert_almost_equal(OPS.to_numpy(Y_batched[i]), OPS.to_numpy(Y_not_batched[i]), decimal=4) def util_batch_unbatch_docs_array( @@ -91,7 +91,7 @@ def util_batch_unbatch_docs_array( with data_validation(True): model.initialize(in_data, out_data) Y_batched = model.predict(in_data).tolist() - Y_not_batched = [model.predict([u])[0] for u in in_data] + Y_not_batched = [model.predict([u])[0].tolist() for u in in_data] assert_almost_equal(Y_batched, Y_not_batched, decimal=4) @@ -100,8 +100,8 @@ def util_batch_unbatch_docs_ragged( ): with data_validation(True): model.initialize(in_data, out_data) - Y_batched = model.predict(in_data) + Y_batched = model.predict(in_data).data.tolist() Y_not_batched = [] for u in in_data: Y_not_batched.extend(model.predict([u]).data.tolist()) - assert_almost_equal(Y_batched.data, Y_not_batched, decimal=4) + assert_almost_equal(Y_batched, Y_not_batched, decimal=4) diff --git a/spacy/tests/pipeline/test_pipe_factories.py b/spacy/tests/pipeline/test_pipe_factories.py index e1706ffb1..a7071abfd 100644 --- a/spacy/tests/pipeline/test_pipe_factories.py +++ b/spacy/tests/pipeline/test_pipe_factories.py @@ -1,4 +1,6 @@ import pytest +import mock +import logging from spacy.language import Language from spacy.lang.en import English from spacy.lang.de import German @@ -402,6 +404,38 @@ def test_pipe_factories_from_source(): nlp.add_pipe("custom", source=source_nlp) +def test_pipe_factories_from_source_language_subclass(): + class CustomEnglishDefaults(English.Defaults): + stop_words = set(["custom", "stop"]) + + @registry.languages("custom_en") + class CustomEnglish(English): + lang = "custom_en" + Defaults = CustomEnglishDefaults + + source_nlp = English() + source_nlp.add_pipe("tagger") + + # custom subclass + nlp = CustomEnglish() + nlp.add_pipe("tagger", source=source_nlp) + assert "tagger" in nlp.pipe_names + + # non-subclass + nlp = German() + nlp.add_pipe("tagger", source=source_nlp) + assert "tagger" in nlp.pipe_names + + # mismatched vectors + nlp = English() + nlp.vocab.vectors.resize((1, 4)) + nlp.vocab.vectors.add("cat", vector=[1, 2, 3, 4]) + logger = logging.getLogger("spacy") + with mock.patch.object(logger, "warning") as mock_warning: + nlp.add_pipe("tagger", source=source_nlp) + mock_warning.assert_called() + + def test_pipe_factories_from_source_custom(): """Test adding components from a source model with custom components.""" name = "test_pipe_factories_from_source_custom" diff --git a/spacy/tests/pipeline/test_textcat.py b/spacy/tests/pipeline/test_textcat.py index 61af16eb5..43dfff147 100644 --- a/spacy/tests/pipeline/test_textcat.py +++ b/spacy/tests/pipeline/test_textcat.py @@ -1,7 +1,7 @@ import pytest import random import numpy.random -from numpy.testing import assert_equal +from numpy.testing import assert_almost_equal from thinc.api import fix_random_seed from spacy import util from spacy.lang.en import English @@ -222,8 +222,12 @@ def test_overfitting_IO(): batch_cats_1 = [doc.cats for doc in nlp.pipe(texts)] batch_cats_2 = [doc.cats for doc in nlp.pipe(texts)] no_batch_cats = [doc.cats for doc in [nlp(text) for text in texts]] - assert_equal(batch_cats_1, batch_cats_2) - assert_equal(batch_cats_1, no_batch_cats) + for cats_1, cats_2 in zip(batch_cats_1, batch_cats_2): + for cat in cats_1: + assert_almost_equal(cats_1[cat], cats_2[cat], decimal=5) + for cats_1, cats_2 in zip(batch_cats_1, no_batch_cats): + for cat in cats_1: + assert_almost_equal(cats_1[cat], cats_2[cat], decimal=5) def test_overfitting_IO_multi(): @@ -270,8 +274,12 @@ def test_overfitting_IO_multi(): batch_deps_1 = [doc.cats for doc in nlp.pipe(texts)] batch_deps_2 = [doc.cats for doc in nlp.pipe(texts)] no_batch_deps = [doc.cats for doc in [nlp(text) for text in texts]] - assert_equal(batch_deps_1, batch_deps_2) - assert_equal(batch_deps_1, no_batch_deps) + for cats_1, cats_2 in zip(batch_deps_1, batch_deps_2): + for cat in cats_1: + assert_almost_equal(cats_1[cat], cats_2[cat], decimal=5) + for cats_1, cats_2 in zip(batch_deps_1, no_batch_deps): + for cat in cats_1: + assert_almost_equal(cats_1[cat], cats_2[cat], decimal=5) # fmt: off diff --git a/spacy/tests/pipeline/test_tok2vec.py b/spacy/tests/pipeline/test_tok2vec.py index ac5428de6..e3b71c502 100644 --- a/spacy/tests/pipeline/test_tok2vec.py +++ b/spacy/tests/pipeline/test_tok2vec.py @@ -8,8 +8,8 @@ from spacy.tokens import Doc from spacy.training import Example from spacy import util from spacy.lang.en import English -from thinc.api import Config -from numpy.testing import assert_equal +from thinc.api import Config, get_current_ops +from numpy.testing import assert_array_equal from ..util import get_batch, make_tempdir @@ -160,7 +160,8 @@ def test_tok2vec_listener(): doc = nlp("Running the pipeline as a whole.") doc_tensor = tagger_tok2vec.predict([doc])[0] - assert_equal(doc.tensor, doc_tensor) + ops = get_current_ops() + assert_array_equal(ops.to_numpy(doc.tensor), ops.to_numpy(doc_tensor)) # TODO: should this warn or error? nlp.select_pipes(disable="tok2vec") diff --git a/spacy/tests/regression/test_issue4501-5000.py b/spacy/tests/regression/test_issue4501-5000.py index 6dbbc233b..f5fcb53fd 100644 --- a/spacy/tests/regression/test_issue4501-5000.py +++ b/spacy/tests/regression/test_issue4501-5000.py @@ -9,6 +9,7 @@ from spacy.language import Language from spacy.util import ensure_path, load_model_from_path import numpy import pickle +from thinc.api import NumpyOps, get_current_ops from ..util import make_tempdir @@ -169,21 +170,22 @@ def test_issue4725_1(): def test_issue4725_2(): - # ensures that this runs correctly and doesn't hang or crash because of the global vectors - # if it does crash, it's usually because of calling 'spawn' for multiprocessing (e.g. on Windows), - # or because of issues with pickling the NER (cf test_issue4725_1) - vocab = Vocab(vectors_name="test_vocab_add_vector") - data = numpy.ndarray((5, 3), dtype="f") - data[0] = 1.0 - data[1] = 2.0 - vocab.set_vector("cat", data[0]) - vocab.set_vector("dog", data[1]) - nlp = English(vocab=vocab) - nlp.add_pipe("ner") - nlp.initialize() - docs = ["Kurt is in London."] * 10 - for _ in nlp.pipe(docs, batch_size=2, n_process=2): - pass + if isinstance(get_current_ops, NumpyOps): + # ensures that this runs correctly and doesn't hang or crash because of the global vectors + # if it does crash, it's usually because of calling 'spawn' for multiprocessing (e.g. on Windows), + # or because of issues with pickling the NER (cf test_issue4725_1) + vocab = Vocab(vectors_name="test_vocab_add_vector") + data = numpy.ndarray((5, 3), dtype="f") + data[0] = 1.0 + data[1] = 2.0 + vocab.set_vector("cat", data[0]) + vocab.set_vector("dog", data[1]) + nlp = English(vocab=vocab) + nlp.add_pipe("ner") + nlp.initialize() + docs = ["Kurt is in London."] * 10 + for _ in nlp.pipe(docs, batch_size=2, n_process=2): + pass def test_issue4849(): @@ -204,10 +206,11 @@ def test_issue4849(): count_ents += len([ent for ent in doc.ents if ent.ent_id > 0]) assert count_ents == 2 # USING 2 PROCESSES - count_ents = 0 - for doc in nlp.pipe([text], n_process=2): - count_ents += len([ent for ent in doc.ents if ent.ent_id > 0]) - assert count_ents == 2 + if isinstance(get_current_ops, NumpyOps): + count_ents = 0 + for doc in nlp.pipe([text], n_process=2): + count_ents += len([ent for ent in doc.ents if ent.ent_id > 0]) + assert count_ents == 2 @Language.factory("my_pipe") @@ -239,10 +242,11 @@ def test_issue4903(): nlp.add_pipe("sentencizer") nlp.add_pipe("my_pipe", after="sentencizer") text = ["I like bananas.", "Do you like them?", "No, I prefer wasabi."] - docs = list(nlp.pipe(text, n_process=2)) - assert docs[0].text == "I like bananas." - assert docs[1].text == "Do you like them?" - assert docs[2].text == "No, I prefer wasabi." + if isinstance(get_current_ops(), NumpyOps): + docs = list(nlp.pipe(text, n_process=2)) + assert docs[0].text == "I like bananas." + assert docs[1].text == "Do you like them?" + assert docs[2].text == "No, I prefer wasabi." def test_issue4924(): diff --git a/spacy/tests/regression/test_issue5001-5500.py b/spacy/tests/regression/test_issue5001-5500.py index dbfe78679..0575c8270 100644 --- a/spacy/tests/regression/test_issue5001-5500.py +++ b/spacy/tests/regression/test_issue5001-5500.py @@ -6,6 +6,7 @@ from spacy.language import Language from spacy.lang.en.syntax_iterators import noun_chunks from spacy.vocab import Vocab import spacy +from thinc.api import get_current_ops import pytest from ...util import make_tempdir @@ -54,16 +55,17 @@ def test_issue5082(): ruler.add_patterns(patterns) parsed_vectors_1 = [t.vector for t in nlp(text)] assert len(parsed_vectors_1) == 4 - numpy.testing.assert_array_equal(parsed_vectors_1[0], array1) - numpy.testing.assert_array_equal(parsed_vectors_1[1], array2) - numpy.testing.assert_array_equal(parsed_vectors_1[2], array3) - numpy.testing.assert_array_equal(parsed_vectors_1[3], array4) + ops = get_current_ops() + numpy.testing.assert_array_equal(ops.to_numpy(parsed_vectors_1[0]), array1) + numpy.testing.assert_array_equal(ops.to_numpy(parsed_vectors_1[1]), array2) + numpy.testing.assert_array_equal(ops.to_numpy(parsed_vectors_1[2]), array3) + numpy.testing.assert_array_equal(ops.to_numpy(parsed_vectors_1[3]), array4) nlp.add_pipe("merge_entities") parsed_vectors_2 = [t.vector for t in nlp(text)] assert len(parsed_vectors_2) == 3 - numpy.testing.assert_array_equal(parsed_vectors_2[0], array1) - numpy.testing.assert_array_equal(parsed_vectors_2[1], array2) - numpy.testing.assert_array_equal(parsed_vectors_2[2], array34) + numpy.testing.assert_array_equal(ops.to_numpy(parsed_vectors_2[0]), array1) + numpy.testing.assert_array_equal(ops.to_numpy(parsed_vectors_2[1]), array2) + numpy.testing.assert_array_equal(ops.to_numpy(parsed_vectors_2[2]), array34) def test_issue5137(): diff --git a/spacy/tests/regression/test_issue5501-6000.py b/spacy/tests/regression/test_issue5501-6000.py index 8d1199e98..a35de92fa 100644 --- a/spacy/tests/regression/test_issue5501-6000.py +++ b/spacy/tests/regression/test_issue5501-6000.py @@ -1,5 +1,6 @@ import pytest -from thinc.api import Config, fix_random_seed +from numpy.testing import assert_almost_equal +from thinc.api import Config, fix_random_seed, get_current_ops from spacy.lang.en import English from spacy.pipeline.textcat import single_label_default_config, single_label_bow_config @@ -44,11 +45,12 @@ def test_issue5551(textcat_config): nlp.update([Example.from_dict(doc, annots)]) # Store the result of each iteration result = pipe.model.predict([doc]) - results.append(list(result[0])) + results.append(result[0]) # All results should be the same because of the fixed seed assert len(results) == 3 - assert results[0] == results[1] - assert results[0] == results[2] + ops = get_current_ops() + assert_almost_equal(ops.to_numpy(results[0]), ops.to_numpy(results[1])) + assert_almost_equal(ops.to_numpy(results[0]), ops.to_numpy(results[2])) def test_issue5838(): diff --git a/spacy/tests/regression/test_issue7065.py b/spacy/tests/regression/test_issue7065.py index 897687d19..63d36552a 100644 --- a/spacy/tests/regression/test_issue7065.py +++ b/spacy/tests/regression/test_issue7065.py @@ -1,4 +1,6 @@ +from spacy.kb import KnowledgeBase from spacy.lang.en import English +from spacy.training import Example def test_issue7065(): @@ -16,3 +18,58 @@ def test_issue7065(): ent = doc.ents[0] assert ent.start < sent0.end < ent.end assert sentences.index(ent.sent) == 0 + + +def test_issue7065_b(): + # Test that the NEL doesn't crash when an entity crosses a sentence boundary + nlp = English() + vector_length = 3 + nlp.add_pipe("sentencizer") + + text = "Mahler 's Symphony No. 8 was beautiful." + entities = [(0, 6, "PERSON"), (10, 24, "WORK")] + links = {(0, 6): {"Q7304": 1.0, "Q270853": 0.0}, + (10, 24): {"Q7304": 0.0, "Q270853": 1.0}} + sent_starts = [1, -1, 0, 0, 0, 0, 0, 0, 0] + doc = nlp(text) + example = Example.from_dict(doc, {"entities": entities, "links": links, "sent_starts": sent_starts}) + train_examples = [example] + + def create_kb(vocab): + # create artificial KB + mykb = KnowledgeBase(vocab, entity_vector_length=vector_length) + mykb.add_entity(entity="Q270853", freq=12, entity_vector=[9, 1, -7]) + mykb.add_alias( + alias="No. 8", + entities=["Q270853"], + probabilities=[1.0], + ) + mykb.add_entity(entity="Q7304", freq=12, entity_vector=[6, -4, 3]) + mykb.add_alias( + alias="Mahler", + entities=["Q7304"], + probabilities=[1.0], + ) + return mykb + + # Create the Entity Linker component and add it to the pipeline + entity_linker = nlp.add_pipe("entity_linker", last=True) + entity_linker.set_kb(create_kb) + + # train the NEL pipe + optimizer = nlp.initialize(get_examples=lambda: train_examples) + for i in range(2): + losses = {} + nlp.update(train_examples, sgd=optimizer, losses=losses) + + # Add a custom rule-based component to mimick NER + patterns = [ + {"label": "PERSON", "pattern": [{"LOWER": "mahler"}]}, + {"label": "WORK", "pattern": [{"LOWER": "symphony"}, {"LOWER": "no"}, {"LOWER": "."}, {"LOWER": "8"}]} + ] + ruler = nlp.add_pipe("entity_ruler", before="entity_linker") + ruler.add_patterns(patterns) + + # test the trained model - this should not throw E148 + doc = nlp(text) + assert doc diff --git a/spacy/tests/serialize/test_serialize_config.py b/spacy/tests/serialize/test_serialize_config.py index 66b66b744..2cd0e4ab6 100644 --- a/spacy/tests/serialize/test_serialize_config.py +++ b/spacy/tests/serialize/test_serialize_config.py @@ -4,7 +4,7 @@ import spacy from spacy.lang.en import English from spacy.lang.de import German from spacy.language import Language, DEFAULT_CONFIG, DEFAULT_CONFIG_PRETRAIN_PATH -from spacy.util import registry, load_model_from_config, load_config +from spacy.util import registry, load_model_from_config, load_config, load_config_from_str from spacy.ml.models import build_Tok2Vec_model, build_tb_parser_model from spacy.ml.models import MultiHashEmbed, MaxoutWindowEncoder from spacy.schemas import ConfigSchema, ConfigSchemaPretrain @@ -465,3 +465,32 @@ def test_config_only_resolve_relevant_blocks(): nlp.initialize() nlp.config["initialize"]["lookups"] = None nlp.initialize() + + +def test_hyphen_in_config(): + hyphen_config_str = """ + [nlp] + lang = "en" + pipeline = ["my_punctual_component"] + + [components] + + [components.my_punctual_component] + factory = "my_punctual_component" + punctuation = ["?","-"] + """ + + @spacy.Language.factory("my_punctual_component") + class MyPunctualComponent(object): + name = "my_punctual_component" + + def __init__( + self, + nlp, + name, + punctuation, + ): + self.punctuation = punctuation + + nlp = English.from_config(load_config_from_str(hyphen_config_str)) + assert nlp.get_pipe("my_punctual_component").punctuation == ['?', '-'] diff --git a/spacy/tests/serialize/test_serialize_tokenizer.py b/spacy/tests/serialize/test_serialize_tokenizer.py index ae612114a..a9450cd04 100644 --- a/spacy/tests/serialize/test_serialize_tokenizer.py +++ b/spacy/tests/serialize/test_serialize_tokenizer.py @@ -26,10 +26,14 @@ def test_serialize_custom_tokenizer(en_vocab, en_tokenizer): assert tokenizer.rules != {} assert tokenizer.token_match is not None assert tokenizer.url_match is not None + assert tokenizer.prefix_search is not None + assert tokenizer.infix_finditer is not None tokenizer.from_bytes(tokenizer_bytes) assert tokenizer.rules == {} assert tokenizer.token_match is None assert tokenizer.url_match is None + assert tokenizer.prefix_search is None + assert tokenizer.infix_finditer is None tokenizer = Tokenizer(en_vocab, rules={"ABC.": [{"ORTH": "ABC"}, {"ORTH": "."}]}) tokenizer.rules = {} diff --git a/spacy/tests/serialize/test_serialize_vocab_strings.py b/spacy/tests/serialize/test_serialize_vocab_strings.py index 45a546203..3fe9363bf 100644 --- a/spacy/tests/serialize/test_serialize_vocab_strings.py +++ b/spacy/tests/serialize/test_serialize_vocab_strings.py @@ -49,9 +49,9 @@ def test_serialize_vocab_roundtrip_disk(strings1, strings2): vocab1_d = Vocab().from_disk(file_path1) vocab2_d = Vocab().from_disk(file_path2) # check strings rather than lexemes, which are only reloaded on demand - assert strings1 == [s for s in vocab1_d.strings] - assert strings2 == [s for s in vocab2_d.strings] - if strings1 == strings2: + assert set(strings1) == set([s for s in vocab1_d.strings]) + assert set(strings2) == set([s for s in vocab2_d.strings]) + if set(strings1) == set(strings2): assert [s for s in vocab1_d.strings] == [s for s in vocab2_d.strings] else: assert [s for s in vocab1_d.strings] != [s for s in vocab2_d.strings] @@ -96,7 +96,7 @@ def test_serialize_stringstore_roundtrip_bytes(strings1, strings2): sstore2 = StringStore(strings=strings2) sstore1_b = sstore1.to_bytes() sstore2_b = sstore2.to_bytes() - if strings1 == strings2: + if set(strings1) == set(strings2): assert sstore1_b == sstore2_b else: assert sstore1_b != sstore2_b @@ -104,7 +104,7 @@ def test_serialize_stringstore_roundtrip_bytes(strings1, strings2): assert sstore1.to_bytes() == sstore1_b new_sstore1 = StringStore().from_bytes(sstore1_b) assert new_sstore1.to_bytes() == sstore1_b - assert list(new_sstore1) == strings1 + assert set(new_sstore1) == set(strings1) @pytest.mark.parametrize("strings1,strings2", test_strings) @@ -118,12 +118,12 @@ def test_serialize_stringstore_roundtrip_disk(strings1, strings2): sstore2.to_disk(file_path2) sstore1_d = StringStore().from_disk(file_path1) sstore2_d = StringStore().from_disk(file_path2) - assert list(sstore1_d) == list(sstore1) - assert list(sstore2_d) == list(sstore2) - if strings1 == strings2: - assert list(sstore1_d) == list(sstore2_d) + assert set(sstore1_d) == set(sstore1) + assert set(sstore2_d) == set(sstore2) + if set(strings1) == set(strings2): + assert set(sstore1_d) == set(sstore2_d) else: - assert list(sstore1_d) != list(sstore2_d) + assert set(sstore1_d) != set(sstore2_d) @pytest.mark.parametrize("strings,lex_attr", test_strings_attrs) diff --git a/spacy/tests/test_cli.py b/spacy/tests/test_cli.py index c36be9c57..2013ceac4 100644 --- a/spacy/tests/test_cli.py +++ b/spacy/tests/test_cli.py @@ -307,8 +307,11 @@ def test_project_config_validation2(config, n_errors): assert len(errors) == n_errors -def test_project_config_interpolation(): - variables = {"a": 10, "b": {"c": "foo", "d": True}} +@pytest.mark.parametrize( + "int_value", [10, pytest.param("10", marks=pytest.mark.xfail)], +) +def test_project_config_interpolation(int_value): + variables = {"a": int_value, "b": {"c": "foo", "d": True}} commands = [ {"name": "x", "script": ["hello ${vars.a} ${vars.b.c}"]}, {"name": "y", "script": ["${vars.b.c} ${vars.b.d}"]}, @@ -317,6 +320,8 @@ def test_project_config_interpolation(): with make_tempdir() as d: srsly.write_yaml(d / "project.yml", project) cfg = load_project_config(d) + assert type(cfg) == dict + assert type(cfg["commands"]) == list assert cfg["commands"][0]["script"][0] == "hello 10 foo" assert cfg["commands"][1]["script"][0] == "foo true" commands = [{"name": "x", "script": ["hello ${vars.a} ${vars.b.e}"]}] @@ -325,6 +330,24 @@ def test_project_config_interpolation(): substitute_project_variables(project) +@pytest.mark.parametrize( + "greeting", [342, "everyone", "tout le monde", pytest.param("42", marks=pytest.mark.xfail)], +) +def test_project_config_interpolation_override(greeting): + variables = {"a": "world"} + commands = [ + {"name": "x", "script": ["hello ${vars.a}"]}, + ] + overrides = {"vars.a": greeting} + project = {"commands": commands, "vars": variables} + with make_tempdir() as d: + srsly.write_yaml(d / "project.yml", project) + cfg = load_project_config(d, overrides=overrides) + assert type(cfg) == dict + assert type(cfg["commands"]) == list + assert cfg["commands"][0]["script"][0] == f"hello {greeting}" + + def test_project_config_interpolation_env(): variables = {"a": 10} env_var = "SPACY_TEST_FOO" diff --git a/spacy/tests/test_language.py b/spacy/tests/test_language.py index bec85a1a2..7fb03da0c 100644 --- a/spacy/tests/test_language.py +++ b/spacy/tests/test_language.py @@ -10,6 +10,7 @@ from spacy.lang.en import English from spacy.lang.de import German from spacy.util import registry, ignore_error, raise_error import spacy +from thinc.api import NumpyOps, get_current_ops from .util import add_vecs_to_vocab, assert_docs_equal @@ -142,25 +143,29 @@ def texts(): @pytest.mark.parametrize("n_process", [1, 2]) def test_language_pipe(nlp2, n_process, texts): - texts = texts * 10 - expecteds = [nlp2(text) for text in texts] - docs = nlp2.pipe(texts, n_process=n_process, batch_size=2) + ops = get_current_ops() + if isinstance(ops, NumpyOps) or n_process < 2: + texts = texts * 10 + expecteds = [nlp2(text) for text in texts] + docs = nlp2.pipe(texts, n_process=n_process, batch_size=2) - for doc, expected_doc in zip(docs, expecteds): - assert_docs_equal(doc, expected_doc) + for doc, expected_doc in zip(docs, expecteds): + assert_docs_equal(doc, expected_doc) @pytest.mark.parametrize("n_process", [1, 2]) def test_language_pipe_stream(nlp2, n_process, texts): - # check if nlp.pipe can handle infinite length iterator properly. - stream_texts = itertools.cycle(texts) - texts0, texts1 = itertools.tee(stream_texts) - expecteds = (nlp2(text) for text in texts0) - docs = nlp2.pipe(texts1, n_process=n_process, batch_size=2) + ops = get_current_ops() + if isinstance(ops, NumpyOps) or n_process < 2: + # check if nlp.pipe can handle infinite length iterator properly. + stream_texts = itertools.cycle(texts) + texts0, texts1 = itertools.tee(stream_texts) + expecteds = (nlp2(text) for text in texts0) + docs = nlp2.pipe(texts1, n_process=n_process, batch_size=2) - n_fetch = 20 - for doc, expected_doc in itertools.islice(zip(docs, expecteds), n_fetch): - assert_docs_equal(doc, expected_doc) + n_fetch = 20 + for doc, expected_doc in itertools.islice(zip(docs, expecteds), n_fetch): + assert_docs_equal(doc, expected_doc) def test_language_pipe_error_handler(): diff --git a/spacy/tests/test_misc.py b/spacy/tests/test_misc.py index 58bebc4ca..0d09999a9 100644 --- a/spacy/tests/test_misc.py +++ b/spacy/tests/test_misc.py @@ -8,7 +8,8 @@ from spacy import prefer_gpu, require_gpu, require_cpu from spacy.ml._precomputable_affine import PrecomputableAffine from spacy.ml._precomputable_affine import _backprop_precomputable_affine_padding from spacy.util import dot_to_object, SimpleFrozenList, import_file -from thinc.api import Config, Optimizer, ConfigValidationError +from thinc.api import Config, Optimizer, ConfigValidationError, get_current_ops +from thinc.api import set_current_ops from spacy.training.batchers import minibatch_by_words from spacy.lang.en import English from spacy.lang.nl import Dutch @@ -81,6 +82,7 @@ def test_PrecomputableAffine(nO=4, nI=5, nF=3, nP=2): def test_prefer_gpu(): + current_ops = get_current_ops() try: import cupy # noqa: F401 @@ -88,9 +90,11 @@ def test_prefer_gpu(): assert isinstance(get_current_ops(), CupyOps) except ImportError: assert not prefer_gpu() + set_current_ops(current_ops) def test_require_gpu(): + current_ops = get_current_ops() try: import cupy # noqa: F401 @@ -99,9 +103,11 @@ def test_require_gpu(): except ImportError: with pytest.raises(ValueError): require_gpu() + set_current_ops(current_ops) def test_require_cpu(): + current_ops = get_current_ops() require_cpu() assert isinstance(get_current_ops(), NumpyOps) try: @@ -113,6 +119,7 @@ def test_require_cpu(): pass require_cpu() assert isinstance(get_current_ops(), NumpyOps) + set_current_ops(current_ops) def test_ascii_filenames(): diff --git a/spacy/tests/test_models.py b/spacy/tests/test_models.py index 200d7dcfd..45cee13ea 100644 --- a/spacy/tests/test_models.py +++ b/spacy/tests/test_models.py @@ -1,7 +1,7 @@ from typing import List import pytest from thinc.api import fix_random_seed, Adam, set_dropout_rate -from numpy.testing import assert_array_equal +from numpy.testing import assert_array_equal, assert_array_almost_equal import numpy from spacy.ml.models import build_Tok2Vec_model, MultiHashEmbed, MaxoutWindowEncoder from spacy.ml.models import build_bow_text_classifier, build_simple_cnn_text_classifier @@ -109,7 +109,7 @@ def test_models_initialize_consistently(seed, model_func, kwargs): model2.initialize() params1 = get_all_params(model1) params2 = get_all_params(model2) - assert_array_equal(params1, params2) + assert_array_equal(model1.ops.to_numpy(params1), model2.ops.to_numpy(params2)) @pytest.mark.parametrize( @@ -134,14 +134,25 @@ def test_models_predict_consistently(seed, model_func, kwargs, get_X): for i in range(len(tok2vec1)): for j in range(len(tok2vec1[i])): assert_array_equal( - numpy.asarray(tok2vec1[i][j]), numpy.asarray(tok2vec2[i][j]) + numpy.asarray(model1.ops.to_numpy(tok2vec1[i][j])), + numpy.asarray(model2.ops.to_numpy(tok2vec2[i][j])), ) + try: + Y1 = model1.ops.to_numpy(Y1) + Y2 = model2.ops.to_numpy(Y2) + except Exception: + pass if isinstance(Y1, numpy.ndarray): assert_array_equal(Y1, Y2) elif isinstance(Y1, List): assert len(Y1) == len(Y2) for y1, y2 in zip(Y1, Y2): + try: + y1 = model1.ops.to_numpy(y1) + y2 = model2.ops.to_numpy(y2) + except Exception: + pass assert_array_equal(y1, y2) else: raise ValueError(f"Could not compare type {type(Y1)}") @@ -169,12 +180,17 @@ def test_models_update_consistently(seed, dropout, model_func, kwargs, get_X): model.finish_update(optimizer) updated_params = get_all_params(model) with pytest.raises(AssertionError): - assert_array_equal(initial_params, updated_params) + assert_array_equal( + model.ops.to_numpy(initial_params), model.ops.to_numpy(updated_params) + ) return model model1 = get_updated_model() model2 = get_updated_model() - assert_array_equal(get_all_params(model1), get_all_params(model2)) + assert_array_almost_equal( + model1.ops.to_numpy(get_all_params(model1)), + model2.ops.to_numpy(get_all_params(model2)), + ) @pytest.mark.parametrize("model_func,kwargs", [(StaticVectors, {"nO": 128, "nM": 300})]) diff --git a/spacy/tests/test_scorer.py b/spacy/tests/test_scorer.py index 4dddca404..c044d8afe 100644 --- a/spacy/tests/test_scorer.py +++ b/spacy/tests/test_scorer.py @@ -3,10 +3,10 @@ import pytest from pytest import approx from spacy.training import Example from spacy.training.iob_utils import offsets_to_biluo_tags -from spacy.scorer import Scorer, ROCAUCScore +from spacy.scorer import Scorer, ROCAUCScore, PRFScore from spacy.scorer import _roc_auc_score, _roc_curve from spacy.lang.en import English -from spacy.tokens import Doc +from spacy.tokens import Doc, Span test_las_apple = [ @@ -403,3 +403,68 @@ def test_roc_auc_score(): score.score_set(0.75, 1) with pytest.raises(ValueError): _ = score.score # noqa: F841 + + +def test_score_spans(): + nlp = English() + text = "This is just a random sentence." + key = "my_spans" + gold = nlp.make_doc(text) + pred = nlp.make_doc(text) + spans = [] + spans.append(gold.char_span(0, 4, label="PERSON")) + spans.append(gold.char_span(0, 7, label="ORG")) + spans.append(gold.char_span(8, 12, label="ORG")) + gold.spans[key] = spans + + def span_getter(doc, span_key): + return doc.spans[span_key] + + # Predict exactly the same, but overlapping spans will be discarded + pred.spans[key] = spans + eg = Example(pred, gold) + scores = Scorer.score_spans([eg], attr=key, getter=span_getter) + assert scores[f"{key}_p"] == 1.0 + assert scores[f"{key}_r"] < 1.0 + + # Allow overlapping, now both precision and recall should be 100% + pred.spans[key] = spans + eg = Example(pred, gold) + scores = Scorer.score_spans([eg], attr=key, getter=span_getter, allow_overlap=True) + assert scores[f"{key}_p"] == 1.0 + assert scores[f"{key}_r"] == 1.0 + + # Change the predicted labels + new_spans = [Span(pred, span.start, span.end, label="WRONG") for span in spans] + pred.spans[key] = new_spans + eg = Example(pred, gold) + scores = Scorer.score_spans([eg], attr=key, getter=span_getter, allow_overlap=True) + assert scores[f"{key}_p"] == 0.0 + assert scores[f"{key}_r"] == 0.0 + assert f"{key}_per_type" in scores + + # Discard labels from the evaluation + scores = Scorer.score_spans([eg], attr=key, getter=span_getter, allow_overlap=True, labeled=False) + assert scores[f"{key}_p"] == 1.0 + assert scores[f"{key}_r"] == 1.0 + assert f"{key}_per_type" not in scores + + +def test_prf_score(): + cand = {"hi", "ho"} + gold1 = {"yo", "hi"} + gold2 = set() + + a = PRFScore() + a.score_set(cand=cand, gold=gold1) + assert (a.precision, a.recall, a.fscore) == approx((0.5, 0.5, 0.5)) + + b = PRFScore() + b.score_set(cand=cand, gold=gold2) + assert (b.precision, b.recall, b.fscore) == approx((0.0, 0.0, 0.0)) + + c = a + b + assert (c.precision, c.recall, c.fscore) == approx((0.25, 0.5, 0.33333333)) + + a += b + assert (a.precision, a.recall, a.fscore) == approx((c.precision, c.recall, c.fscore)) \ No newline at end of file diff --git a/spacy/tests/tokenizer/test_explain.py b/spacy/tests/tokenizer/test_explain.py index ea6cf91be..0a10ae67d 100644 --- a/spacy/tests/tokenizer/test_explain.py +++ b/spacy/tests/tokenizer/test_explain.py @@ -1,5 +1,7 @@ import pytest +import re from spacy.util import get_lang_class +from spacy.tokenizer import Tokenizer # Only include languages with no external dependencies # "is" seems to confuse importlib, so we're also excluding it for now @@ -60,3 +62,18 @@ def test_tokenizer_explain(lang): tokens = [t.text for t in tokenizer(sentence) if not t.is_space] debug_tokens = [t[1] for t in tokenizer.explain(sentence)] assert tokens == debug_tokens + + +def test_tokenizer_explain_special_matcher(en_vocab): + suffix_re = re.compile(r"[\.]$") + infix_re = re.compile(r"[/]") + rules = {"a.": [{"ORTH": "a."}]} + tokenizer = Tokenizer( + en_vocab, + rules=rules, + suffix_search=suffix_re.search, + infix_finditer=infix_re.finditer, + ) + tokens = [t.text for t in tokenizer("a/a.")] + explain_tokens = [t[1] for t in tokenizer.explain("a/a.")] + assert tokens == explain_tokens diff --git a/spacy/tests/tokenizer/test_tokenizer.py b/spacy/tests/tokenizer/test_tokenizer.py index 4f5eddb95..6cfeaf014 100644 --- a/spacy/tests/tokenizer/test_tokenizer.py +++ b/spacy/tests/tokenizer/test_tokenizer.py @@ -1,4 +1,5 @@ import pytest +import re from spacy.vocab import Vocab from spacy.tokenizer import Tokenizer from spacy.util import ensure_path @@ -186,3 +187,31 @@ def test_tokenizer_special_cases_spaces(tokenizer): assert [t.text for t in tokenizer("a b c")] == ["a", "b", "c"] tokenizer.add_special_case("a b c", [{"ORTH": "a b c"}]) assert [t.text for t in tokenizer("a b c")] == ["a b c"] + + +def test_tokenizer_flush_cache(en_vocab): + suffix_re = re.compile(r"[\.]$") + tokenizer = Tokenizer( + en_vocab, + suffix_search=suffix_re.search, + ) + assert [t.text for t in tokenizer("a.")] == ["a", "."] + tokenizer.suffix_search = None + assert [t.text for t in tokenizer("a.")] == ["a."] + + +def test_tokenizer_flush_specials(en_vocab): + suffix_re = re.compile(r"[\.]$") + rules = {"a a": [{"ORTH": "a a"}]} + tokenizer1 = Tokenizer( + en_vocab, + suffix_search=suffix_re.search, + rules=rules, + ) + tokenizer2 = Tokenizer( + en_vocab, + suffix_search=suffix_re.search, + ) + assert [t.text for t in tokenizer1("a a.")] == ["a a", "."] + tokenizer1.rules = {} + assert [t.text for t in tokenizer1("a a.")] == ["a", "a", "."] diff --git a/spacy/tests/training/test_new_example.py b/spacy/tests/training/test_new_example.py index b8fbaf606..ba58ea96d 100644 --- a/spacy/tests/training/test_new_example.py +++ b/spacy/tests/training/test_new_example.py @@ -2,6 +2,7 @@ import pytest from spacy.training.example import Example from spacy.tokens import Doc from spacy.vocab import Vocab +from spacy.util import to_ternary_int def test_Example_init_requires_doc_objects(): @@ -121,7 +122,7 @@ def test_Example_from_dict_with_morphology(annots): [ { "words": ["This", "is", "one", "sentence", "this", "is", "another"], - "sent_starts": [1, 0, 0, 0, 1, 0, 0], + "sent_starts": [1, False, 0, None, True, -1, -5.7], } ], ) @@ -131,7 +132,12 @@ def test_Example_from_dict_with_sent_start(annots): example = Example.from_dict(predicted, annots) assert len(list(example.reference.sents)) == 2 for i, token in enumerate(example.reference): - assert bool(token.is_sent_start) == bool(annots["sent_starts"][i]) + if to_ternary_int(annots["sent_starts"][i]) == 1: + assert token.is_sent_start is True + elif to_ternary_int(annots["sent_starts"][i]) == 0: + assert token.is_sent_start is None + else: + assert token.is_sent_start is False @pytest.mark.parametrize( diff --git a/spacy/tests/training/test_training.py b/spacy/tests/training/test_training.py index c7a85bf87..321c08c1e 100644 --- a/spacy/tests/training/test_training.py +++ b/spacy/tests/training/test_training.py @@ -426,6 +426,29 @@ def test_aligned_spans_x2y(en_vocab, en_tokenizer): assert [(ent.start, ent.end) for ent in ents_x2y] == [(0, 2), (4, 6)] +def test_aligned_spans_y2x_overlap(en_vocab, en_tokenizer): + text = "I flew to San Francisco Valley" + nlp = English() + doc = nlp(text) + # the reference doc has overlapping spans + gold_doc = nlp.make_doc(text) + spans = [] + prefix = "I flew to " + spans.append(gold_doc.char_span(len(prefix), len(prefix + "San Francisco"), label="CITY")) + spans.append(gold_doc.char_span(len(prefix), len(prefix + "San Francisco Valley"), label="VALLEY")) + spans_key = "overlap_ents" + gold_doc.spans[spans_key] = spans + example = Example(doc, gold_doc) + spans_gold = example.reference.spans[spans_key] + assert [(ent.start, ent.end) for ent in spans_gold] == [(3, 5), (3, 6)] + + # Ensure that 'get_aligned_spans_y2x' has the aligned entities correct + spans_y2x_no_overlap = example.get_aligned_spans_y2x(spans_gold, allow_overlap=False) + assert [(ent.start, ent.end) for ent in spans_y2x_no_overlap] == [(3, 5)] + spans_y2x_overlap = example.get_aligned_spans_y2x(spans_gold, allow_overlap=True) + assert [(ent.start, ent.end) for ent in spans_y2x_overlap] == [(3, 5), (3, 6)] + + def test_gold_ner_missing_tags(en_tokenizer): doc = en_tokenizer("I flew to Silicon Valley via London.") biluo_tags = [None, "O", "O", "B-LOC", "L-LOC", "O", "U-GPE", "O"] diff --git a/spacy/tests/util.py b/spacy/tests/util.py index ef7b4d00d..365ea4349 100644 --- a/spacy/tests/util.py +++ b/spacy/tests/util.py @@ -5,6 +5,7 @@ import srsly from spacy.tokens import Doc from spacy.vocab import Vocab from spacy.util import make_tempdir # noqa: F401 +from thinc.api import get_current_ops @contextlib.contextmanager @@ -58,7 +59,10 @@ def add_vecs_to_vocab(vocab, vectors): def get_cosine(vec1, vec2): """Get cosine for two given vectors""" - return numpy.dot(vec1, vec2) / (numpy.linalg.norm(vec1) * numpy.linalg.norm(vec2)) + OPS = get_current_ops() + v1 = OPS.to_numpy(OPS.asarray(vec1)) + v2 = OPS.to_numpy(OPS.asarray(vec2)) + return numpy.dot(v1, v2) / (numpy.linalg.norm(v1) * numpy.linalg.norm(v2)) def assert_docs_equal(doc1, doc2): diff --git a/spacy/tests/vocab_vectors/test_vectors.py b/spacy/tests/vocab_vectors/test_vectors.py index 4257022ea..37d48ad0f 100644 --- a/spacy/tests/vocab_vectors/test_vectors.py +++ b/spacy/tests/vocab_vectors/test_vectors.py @@ -1,6 +1,7 @@ import pytest import numpy from numpy.testing import assert_allclose, assert_equal +from thinc.api import get_current_ops from spacy.vocab import Vocab from spacy.vectors import Vectors from spacy.tokenizer import Tokenizer @@ -9,6 +10,7 @@ from spacy.tokens import Doc from ..util import add_vecs_to_vocab, get_cosine, make_tempdir +OPS = get_current_ops() @pytest.fixture def strings(): @@ -18,21 +20,21 @@ def strings(): @pytest.fixture def vectors(): return [ - ("apple", [1, 2, 3]), - ("orange", [-1, -2, -3]), - ("and", [-1, -1, -1]), - ("juice", [5, 5, 10]), - ("pie", [7, 6.3, 8.9]), + ("apple", OPS.asarray([1, 2, 3])), + ("orange", OPS.asarray([-1, -2, -3])), + ("and", OPS.asarray([-1, -1, -1])), + ("juice", OPS.asarray([5, 5, 10])), + ("pie", OPS.asarray([7, 6.3, 8.9])), ] @pytest.fixture def ngrams_vectors(): return [ - ("apple", [1, 2, 3]), - ("app", [-0.1, -0.2, -0.3]), - ("ppl", [-0.2, -0.3, -0.4]), - ("pl", [0.7, 0.8, 0.9]), + ("apple", OPS.asarray([1, 2, 3])), + ("app", OPS.asarray([-0.1, -0.2, -0.3])), + ("ppl", OPS.asarray([-0.2, -0.3, -0.4])), + ("pl", OPS.asarray([0.7, 0.8, 0.9])), ] @@ -171,8 +173,10 @@ def test_vectors_most_similar_identical(): @pytest.mark.parametrize("text", ["apple and orange"]) def test_vectors_token_vector(tokenizer_v, vectors, text): doc = tokenizer_v(text) - assert vectors[0] == (doc[0].text, list(doc[0].vector)) - assert vectors[1] == (doc[2].text, list(doc[2].vector)) + assert vectors[0][0] == doc[0].text + assert all([a == b for a, b in zip(vectors[0][1], doc[0].vector)]) + assert vectors[1][0] == doc[2].text + assert all([a == b for a, b in zip(vectors[1][1], doc[2].vector)]) @pytest.mark.parametrize("text", ["apple"]) @@ -301,7 +305,7 @@ def test_vectors_doc_doc_similarity(vocab, text1, text2): def test_vocab_add_vector(): vocab = Vocab(vectors_name="test_vocab_add_vector") - data = numpy.ndarray((5, 3), dtype="f") + data = OPS.xp.ndarray((5, 3), dtype="f") data[0] = 1.0 data[1] = 2.0 vocab.set_vector("cat", data[0]) @@ -320,10 +324,10 @@ def test_vocab_prune_vectors(): _ = vocab["cat"] # noqa: F841 _ = vocab["dog"] # noqa: F841 _ = vocab["kitten"] # noqa: F841 - data = numpy.ndarray((5, 3), dtype="f") - data[0] = [1.0, 1.2, 1.1] - data[1] = [0.3, 1.3, 1.0] - data[2] = [0.9, 1.22, 1.05] + data = OPS.xp.ndarray((5, 3), dtype="f") + data[0] = OPS.asarray([1.0, 1.2, 1.1]) + data[1] = OPS.asarray([0.3, 1.3, 1.0]) + data[2] = OPS.asarray([0.9, 1.22, 1.05]) vocab.set_vector("cat", data[0]) vocab.set_vector("dog", data[1]) vocab.set_vector("kitten", data[2]) @@ -332,40 +336,41 @@ def test_vocab_prune_vectors(): assert list(remap.keys()) == ["kitten"] neighbour, similarity = list(remap.values())[0] assert neighbour == "cat", remap - assert_allclose(similarity, get_cosine(data[0], data[2]), atol=1e-4, rtol=1e-3) + cosine = get_cosine(data[0], data[2]) + assert_allclose(float(similarity), cosine, atol=1e-4, rtol=1e-3) def test_vectors_serialize(): - data = numpy.asarray([[4, 2, 2, 2], [4, 2, 2, 2], [1, 1, 1, 1]], dtype="f") + data = OPS.asarray([[4, 2, 2, 2], [4, 2, 2, 2], [1, 1, 1, 1]], dtype="f") v = Vectors(data=data, keys=["A", "B", "C"]) b = v.to_bytes() v_r = Vectors() v_r.from_bytes(b) - assert_equal(v.data, v_r.data) + assert_equal(OPS.to_numpy(v.data), OPS.to_numpy(v_r.data)) assert v.key2row == v_r.key2row v.resize((5, 4)) v_r.resize((5, 4)) - row = v.add("D", vector=numpy.asarray([1, 2, 3, 4], dtype="f")) - row_r = v_r.add("D", vector=numpy.asarray([1, 2, 3, 4], dtype="f")) + row = v.add("D", vector=OPS.asarray([1, 2, 3, 4], dtype="f")) + row_r = v_r.add("D", vector=OPS.asarray([1, 2, 3, 4], dtype="f")) assert row == row_r - assert_equal(v.data, v_r.data) + assert_equal(OPS.to_numpy(v.data), OPS.to_numpy(v_r.data)) assert v.is_full == v_r.is_full with make_tempdir() as d: v.to_disk(d) v_r.from_disk(d) - assert_equal(v.data, v_r.data) + assert_equal(OPS.to_numpy(v.data), OPS.to_numpy(v_r.data)) assert v.key2row == v_r.key2row v.resize((5, 4)) v_r.resize((5, 4)) - row = v.add("D", vector=numpy.asarray([10, 20, 30, 40], dtype="f")) - row_r = v_r.add("D", vector=numpy.asarray([10, 20, 30, 40], dtype="f")) + row = v.add("D", vector=OPS.asarray([10, 20, 30, 40], dtype="f")) + row_r = v_r.add("D", vector=OPS.asarray([10, 20, 30, 40], dtype="f")) assert row == row_r - assert_equal(v.data, v_r.data) + assert_equal(OPS.to_numpy(v.data), OPS.to_numpy(v_r.data)) def test_vector_is_oov(): vocab = Vocab(vectors_name="test_vocab_is_oov") - data = numpy.ndarray((5, 3), dtype="f") + data = OPS.xp.ndarray((5, 3), dtype="f") data[0] = 1.0 data[1] = 2.0 vocab.set_vector("cat", data[0]) diff --git a/spacy/tokenizer.pxd b/spacy/tokenizer.pxd index 9c1398a17..2a44d7729 100644 --- a/spacy/tokenizer.pxd +++ b/spacy/tokenizer.pxd @@ -23,8 +23,8 @@ cdef class Tokenizer: cdef object _infix_finditer cdef object _rules cdef PhraseMatcher _special_matcher - cdef int _property_init_count - cdef int _property_init_max + cdef int _property_init_count # TODO: unused, remove in v3.1 + cdef int _property_init_max # TODO: unused, remove in v3.1 cdef Doc _tokenize_affixes(self, unicode string, bint with_special_cases) cdef int _apply_special_cases(self, Doc doc) except -1 diff --git a/spacy/tokenizer.pyx b/spacy/tokenizer.pyx index 5bd6e7aa3..61a7582b1 100644 --- a/spacy/tokenizer.pyx +++ b/spacy/tokenizer.pyx @@ -20,11 +20,12 @@ from .attrs import intify_attrs from .symbols import ORTH, NORM from .errors import Errors, Warnings from . import util -from .util import registry +from .util import registry, get_words_and_spaces from .attrs import intify_attrs from .symbols import ORTH from .scorer import Scorer from .training import validate_examples +from .tokens import Span cdef class Tokenizer: @@ -68,8 +69,6 @@ cdef class Tokenizer: self._rules = {} self._special_matcher = PhraseMatcher(self.vocab) self._load_special_cases(rules) - self._property_init_count = 0 - self._property_init_max = 4 property token_match: def __get__(self): @@ -78,8 +77,6 @@ cdef class Tokenizer: def __set__(self, token_match): self._token_match = token_match self._reload_special_cases() - if self._property_init_count <= self._property_init_max: - self._property_init_count += 1 property url_match: def __get__(self): @@ -87,7 +84,7 @@ cdef class Tokenizer: def __set__(self, url_match): self._url_match = url_match - self._flush_cache() + self._reload_special_cases() property prefix_search: def __get__(self): @@ -96,8 +93,6 @@ cdef class Tokenizer: def __set__(self, prefix_search): self._prefix_search = prefix_search self._reload_special_cases() - if self._property_init_count <= self._property_init_max: - self._property_init_count += 1 property suffix_search: def __get__(self): @@ -106,8 +101,6 @@ cdef class Tokenizer: def __set__(self, suffix_search): self._suffix_search = suffix_search self._reload_special_cases() - if self._property_init_count <= self._property_init_max: - self._property_init_count += 1 property infix_finditer: def __get__(self): @@ -116,8 +109,6 @@ cdef class Tokenizer: def __set__(self, infix_finditer): self._infix_finditer = infix_finditer self._reload_special_cases() - if self._property_init_count <= self._property_init_max: - self._property_init_count += 1 property rules: def __get__(self): @@ -125,7 +116,7 @@ cdef class Tokenizer: def __set__(self, rules): self._rules = {} - self._reset_cache([key for key in self._cache]) + self._flush_cache() self._flush_specials() self._cache = PreshMap() self._specials = PreshMap() @@ -225,6 +216,7 @@ cdef class Tokenizer: self.mem.free(cached) def _flush_specials(self): + self._special_matcher = PhraseMatcher(self.vocab) for k in self._specials: cached = <_Cached*>self._specials.get(k) del self._specials[k] @@ -567,7 +559,6 @@ cdef class Tokenizer: """Add special-case tokenization rules.""" if special_cases is not None: for chunk, substrings in sorted(special_cases.items()): - self._validate_special_case(chunk, substrings) self.add_special_case(chunk, substrings) def _validate_special_case(self, chunk, substrings): @@ -615,16 +606,9 @@ cdef class Tokenizer: self._special_matcher.add(string, None, self._tokenize_affixes(string, False)) def _reload_special_cases(self): - try: - self._property_init_count - except AttributeError: - return - # only reload if all 4 of prefix, suffix, infix, token_match have - # have been initialized - if self.vocab is not None and self._property_init_count >= self._property_init_max: - self._flush_cache() - self._flush_specials() - self._load_special_cases(self._rules) + self._flush_cache() + self._flush_specials() + self._load_special_cases(self._rules) def explain(self, text): """A debugging tokenizer that provides information about which @@ -638,8 +622,14 @@ cdef class Tokenizer: DOCS: https://spacy.io/api/tokenizer#explain """ prefix_search = self.prefix_search + if prefix_search is None: + prefix_search = re.compile("a^").search suffix_search = self.suffix_search + if suffix_search is None: + suffix_search = re.compile("a^").search infix_finditer = self.infix_finditer + if infix_finditer is None: + infix_finditer = re.compile("a^").finditer token_match = self.token_match if token_match is None: token_match = re.compile("a^").match @@ -687,7 +677,7 @@ cdef class Tokenizer: tokens.append(("URL_MATCH", substring)) substring = '' elif substring in special_cases: - tokens.extend(("SPECIAL-" + str(i + 1), self.vocab.strings[e[ORTH]]) for i, e in enumerate(special_cases[substring])) + tokens.extend((f"SPECIAL-{i + 1}", self.vocab.strings[e[ORTH]]) for i, e in enumerate(special_cases[substring])) substring = '' elif list(infix_finditer(substring)): infixes = infix_finditer(substring) @@ -705,7 +695,33 @@ cdef class Tokenizer: tokens.append(("TOKEN", substring)) substring = '' tokens.extend(reversed(suffixes)) - return tokens + # Find matches for special cases handled by special matcher + words, spaces = get_words_and_spaces([t[1] for t in tokens], text) + t_words = [] + t_spaces = [] + for word, space in zip(words, spaces): + if not word.isspace(): + t_words.append(word) + t_spaces.append(space) + doc = Doc(self.vocab, words=t_words, spaces=t_spaces) + matches = self._special_matcher(doc) + spans = [Span(doc, s, e, label=m_id) for m_id, s, e in matches] + spans = util.filter_spans(spans) + # Replace matched tokens with their exceptions + i = 0 + final_tokens = [] + spans_by_start = {s.start: s for s in spans} + while i < len(tokens): + if i in spans_by_start: + span = spans_by_start[i] + exc = [d[ORTH] for d in special_cases[span.label_]] + for j, orth in enumerate(exc): + final_tokens.append((f"SPECIAL-{j + 1}", self.vocab.strings[orth])) + i += len(span) + else: + final_tokens.append(tokens[i]) + i += 1 + return final_tokens def score(self, examples, **kwargs): validate_examples(examples, "Tokenizer.score") @@ -778,6 +794,15 @@ cdef class Tokenizer: "url_match": lambda b: data.setdefault("url_match", b), "exceptions": lambda b: data.setdefault("rules", b) } + # reset all properties and flush all caches (through rules), + # reset rules first so that _reload_special_cases is trivial/fast as + # the other properties are reset + self.rules = {} + self.prefix_search = None + self.suffix_search = None + self.infix_finditer = None + self.token_match = None + self.url_match = None msg = util.from_bytes(bytes_data, deserializers, exclude) if "prefix_search" in data and isinstance(data["prefix_search"], str): self.prefix_search = re.compile(data["prefix_search"]).search @@ -785,22 +810,12 @@ cdef class Tokenizer: self.suffix_search = re.compile(data["suffix_search"]).search if "infix_finditer" in data and isinstance(data["infix_finditer"], str): self.infix_finditer = re.compile(data["infix_finditer"]).finditer - # for token_match and url_match, set to None to override the language - # defaults if no regex is provided if "token_match" in data and isinstance(data["token_match"], str): self.token_match = re.compile(data["token_match"]).match - else: - self.token_match = None if "url_match" in data and isinstance(data["url_match"], str): self.url_match = re.compile(data["url_match"]).match - else: - self.url_match = None if "rules" in data and isinstance(data["rules"], dict): - # make sure to hard reset the cache to remove data from the default exceptions - self._rules = {} - self._flush_cache() - self._flush_specials() - self._load_special_cases(data["rules"]) + self.rules = data["rules"] return self diff --git a/spacy/tokens/_retokenize.pyx b/spacy/tokens/_retokenize.pyx index 3cb2965a9..43e6d4aa7 100644 --- a/spacy/tokens/_retokenize.pyx +++ b/spacy/tokens/_retokenize.pyx @@ -281,7 +281,8 @@ def _merge(Doc doc, merges): for i in range(doc.length): doc.c[i].head -= i # Set the left/right children, left/right edges - set_children_from_heads(doc.c, 0, doc.length) + if doc.has_annotation("DEP"): + set_children_from_heads(doc.c, 0, doc.length) # Make sure ent_iob remains consistent make_iob_consistent(doc.c, doc.length) # Return the merged Python object @@ -294,7 +295,19 @@ def _resize_tensor(tensor, ranges): for i in range(start, end-1): delete.append(i) xp = get_array_module(tensor) - return xp.delete(tensor, delete, axis=0) + if xp is numpy: + return xp.delete(tensor, delete, axis=0) + else: + offset = 0 + copy_start = 0 + resized_shape = (tensor.shape[0] - len(delete), tensor.shape[1]) + for start, end in ranges: + if copy_start > 0: + tensor[copy_start - offset:start - offset] = tensor[copy_start: start] + offset += end - start - 1 + copy_start = end - 1 + tensor[copy_start - offset:resized_shape[0]] = tensor[copy_start:] + return xp.asarray(tensor[:resized_shape[0]]) def _split(Doc doc, int token_index, orths, heads, attrs): @@ -331,7 +344,13 @@ def _split(Doc doc, int token_index, orths, heads, attrs): to_process_tensor = (doc.tensor is not None and doc.tensor.size != 0) if to_process_tensor: xp = get_array_module(doc.tensor) - doc.tensor = xp.append(doc.tensor, xp.zeros((nb_subtokens,doc.tensor.shape[1]), dtype="float32"), axis=0) + if xp is numpy: + doc.tensor = xp.append(doc.tensor, xp.zeros((nb_subtokens,doc.tensor.shape[1]), dtype="float32"), axis=0) + else: + shape = (doc.tensor.shape[0] + nb_subtokens, doc.tensor.shape[1]) + resized_array = xp.zeros(shape, dtype="float32") + resized_array[:doc.tensor.shape[0]] = doc.tensor[:doc.tensor.shape[0]] + doc.tensor = resized_array for token_to_move in range(orig_length - 1, token_index, -1): doc.c[token_to_move + nb_subtokens - 1] = doc.c[token_to_move] if to_process_tensor: @@ -348,7 +367,7 @@ def _split(Doc doc, int token_index, orths, heads, attrs): token.norm = 0 # reset norm if to_process_tensor: # setting the tensors of the split tokens to array of zeros - doc.tensor[token_index + i] = xp.zeros((1,doc.tensor.shape[1]), dtype="float32") + doc.tensor[token_index + i:token_index + i + 1] = xp.zeros((1,doc.tensor.shape[1]), dtype="float32") # Update the character offset of the subtokens if i != 0: token.idx = orig_token.idx + idx_offset @@ -392,7 +411,8 @@ def _split(Doc doc, int token_index, orths, heads, attrs): for i in range(doc.length): doc.c[i].head -= i # set children from head - set_children_from_heads(doc.c, 0, doc.length) + if doc.has_annotation("DEP"): + set_children_from_heads(doc.c, 0, doc.length) def _validate_extensions(extensions): diff --git a/spacy/tokens/doc.pyx b/spacy/tokens/doc.pyx index 850036483..aae0ff374 100644 --- a/spacy/tokens/doc.pyx +++ b/spacy/tokens/doc.pyx @@ -6,7 +6,7 @@ from libc.math cimport sqrt from libc.stdint cimport int32_t, uint64_t import copy -from collections import Counter +from collections import Counter, defaultdict from enum import Enum import itertools import numpy @@ -1120,13 +1120,14 @@ cdef class Doc: concat_words = [] concat_spaces = [] concat_user_data = {} + concat_spans = defaultdict(list) char_offset = 0 for doc in docs: concat_words.extend(t.text for t in doc) concat_spaces.extend(bool(t.whitespace_) for t in doc) for key, value in doc.user_data.items(): - if isinstance(key, tuple) and len(key) == 4: + if isinstance(key, tuple) and len(key) == 4 and key[0] == "._.": data_type, name, start, end = key if start is not None or end is not None: start += char_offset @@ -1137,8 +1138,17 @@ cdef class Doc: warnings.warn(Warnings.W101.format(name=name)) else: warnings.warn(Warnings.W102.format(key=key, value=value)) + for key in doc.spans: + for span in doc.spans[key]: + concat_spans[key].append(( + span.start_char + char_offset, + span.end_char + char_offset, + span.label, + span.kb_id, + span.text, # included as a check + )) char_offset += len(doc.text) - if ensure_whitespace and not (len(doc) > 0 and doc[-1].is_space): + if len(doc) > 0 and ensure_whitespace and not doc[-1].is_space: char_offset += 1 arrays = [doc.to_array(attrs) for doc in docs] @@ -1160,6 +1170,22 @@ cdef class Doc: concat_doc.from_array(attrs, concat_array) + for key in concat_spans: + if key not in concat_doc.spans: + concat_doc.spans[key] = [] + for span_tuple in concat_spans[key]: + span = concat_doc.char_span( + span_tuple[0], + span_tuple[1], + label=span_tuple[2], + kb_id=span_tuple[3], + ) + text = span_tuple[4] + if span is not None and span.text == text: + concat_doc.spans[key].append(span) + else: + raise ValueError(Errors.E873.format(key=key, text=text)) + return concat_doc def get_lca_matrix(self): diff --git a/spacy/tokens/span.pyx b/spacy/tokens/span.pyx index 06d86d2ac..614d8fda5 100644 --- a/spacy/tokens/span.pyx +++ b/spacy/tokens/span.pyx @@ -6,6 +6,7 @@ from libc.math cimport sqrt import numpy from thinc.api import get_array_module import warnings +import copy from .doc cimport token_by_start, token_by_end, get_token_attr, _get_lca_matrix from ..structs cimport TokenC, LexemeC @@ -241,7 +242,19 @@ cdef class Span: if cat_start == self.start_char and cat_end == self.end_char: doc.cats[cat_label] = value if copy_user_data: - doc.user_data = self.doc.user_data + user_data = {} + char_offset = self.start_char + for key, value in self.doc.user_data.items(): + if isinstance(key, tuple) and len(key) == 4 and key[0] == "._.": + data_type, name, start, end = key + if start is not None or end is not None: + start -= char_offset + if end is not None: + end -= char_offset + user_data[(data_type, name, start, end)] = copy.copy(value) + else: + user_data[key] = copy.copy(value) + doc.user_data = user_data return doc def _fix_dep_copy(self, attrs, array): diff --git a/spacy/training/__init__.py b/spacy/training/__init__.py index 5111b80dc..055f30f42 100644 --- a/spacy/training/__init__.py +++ b/spacy/training/__init__.py @@ -8,3 +8,4 @@ from .iob_utils import biluo_tags_to_spans, tags_to_entities # noqa: F401 from .gold_io import docs_to_json, read_json_file # noqa: F401 from .batchers import minibatch_by_padded_size, minibatch_by_words # noqa: F401 from .loggers import console_logger, wandb_logger # noqa: F401 +from .callbacks import create_copy_from_base_model # noqa: F401 diff --git a/spacy/training/callbacks.py b/spacy/training/callbacks.py new file mode 100644 index 000000000..2a21be98c --- /dev/null +++ b/spacy/training/callbacks.py @@ -0,0 +1,32 @@ +from typing import Optional +from ..errors import Errors +from ..language import Language +from ..util import load_model, registry, logger + + +@registry.callbacks("spacy.copy_from_base_model.v1") +def create_copy_from_base_model( + tokenizer: Optional[str] = None, + vocab: Optional[str] = None, +) -> Language: + def copy_from_base_model(nlp): + if tokenizer: + logger.info(f"Copying tokenizer from: {tokenizer}") + base_nlp = load_model(tokenizer) + if nlp.config["nlp"]["tokenizer"] == base_nlp.config["nlp"]["tokenizer"]: + nlp.tokenizer.from_bytes(base_nlp.tokenizer.to_bytes(exclude=["vocab"])) + else: + raise ValueError( + Errors.E872.format( + curr_config=nlp.config["nlp"]["tokenizer"], + base_config=base_nlp.config["nlp"]["tokenizer"], + ) + ) + if vocab: + logger.info(f"Copying vocab from: {vocab}") + # only reload if the vocab is from a different model + if tokenizer != vocab: + base_nlp = load_model(vocab) + nlp.vocab.from_bytes(base_nlp.vocab.to_bytes()) + + return copy_from_base_model diff --git a/spacy/training/converters/conll_ner_to_docs.py b/spacy/training/converters/conll_ner_to_docs.py index 8c1bad9ea..28b21c5f0 100644 --- a/spacy/training/converters/conll_ner_to_docs.py +++ b/spacy/training/converters/conll_ner_to_docs.py @@ -124,6 +124,9 @@ def segment_sents_and_docs(doc, n_sents, doc_delimiter, model=None, msg=None): nlp = load_model(model) if "parser" in nlp.pipe_names: msg.info(f"Segmenting sentences with parser from model '{model}'.") + for name, proc in nlp.pipeline: + if "parser" in getattr(proc, "listening_components", []): + nlp.replace_listeners(name, "parser", ["model.tok2vec"]) sentencizer = nlp.get_pipe("parser") if not sentencizer: msg.info( diff --git a/spacy/training/corpus.py b/spacy/training/corpus.py index 079b872d6..063d80a95 100644 --- a/spacy/training/corpus.py +++ b/spacy/training/corpus.py @@ -2,6 +2,7 @@ import warnings from typing import Union, List, Iterable, Iterator, TYPE_CHECKING, Callable from typing import Optional from pathlib import Path +import random import srsly from .. import util @@ -96,6 +97,7 @@ class Corpus: Defaults to 0, which indicates no limit. augment (Callable[Example, Iterable[Example]]): Optional data augmentation function, to extrapolate additional examples from your annotations. + shuffle (bool): Whether to shuffle the examples. DOCS: https://spacy.io/api/corpus """ @@ -108,12 +110,14 @@ class Corpus: gold_preproc: bool = False, max_length: int = 0, augmenter: Optional[Callable] = None, + shuffle: bool = False, ) -> None: self.path = util.ensure_path(path) self.gold_preproc = gold_preproc self.max_length = max_length self.limit = limit self.augmenter = augmenter if augmenter is not None else dont_augment + self.shuffle = shuffle def __call__(self, nlp: "Language") -> Iterator[Example]: """Yield examples from the data. @@ -124,6 +128,10 @@ class Corpus: DOCS: https://spacy.io/api/corpus#call """ ref_docs = self.read_docbin(nlp.vocab, walk_corpus(self.path, FILE_TYPE)) + if self.shuffle: + ref_docs = list(ref_docs) + random.shuffle(ref_docs) + if self.gold_preproc: examples = self.make_examples_gold_preproc(nlp, ref_docs) else: diff --git a/spacy/training/example.pyx b/spacy/training/example.pyx index 9cf825bf9..07a83bfec 100644 --- a/spacy/training/example.pyx +++ b/spacy/training/example.pyx @@ -13,7 +13,7 @@ from .iob_utils import biluo_tags_to_spans from ..errors import Errors, Warnings from ..pipeline._parser_internals import nonproj from ..tokens.token cimport MISSING_DEP -from ..util import logger +from ..util import logger, to_ternary_int cpdef Doc annotations_to_doc(vocab, tok_annot, doc_annot): @@ -213,18 +213,19 @@ cdef class Example: else: return [None] * len(self.x) - def get_aligned_spans_x2y(self, x_spans): - return self._get_aligned_spans(self.y, x_spans, self.alignment.x2y) + def get_aligned_spans_x2y(self, x_spans, allow_overlap=False): + return self._get_aligned_spans(self.y, x_spans, self.alignment.x2y, allow_overlap) - def get_aligned_spans_y2x(self, y_spans): - return self._get_aligned_spans(self.x, y_spans, self.alignment.y2x) + def get_aligned_spans_y2x(self, y_spans, allow_overlap=False): + return self._get_aligned_spans(self.x, y_spans, self.alignment.y2x, allow_overlap) - def _get_aligned_spans(self, doc, spans, align): + def _get_aligned_spans(self, doc, spans, align, allow_overlap): seen = set() output = [] for span in spans: indices = align[span.start : span.end].data.ravel() - indices = [idx for idx in indices if idx not in seen] + if not allow_overlap: + indices = [idx for idx in indices if idx not in seen] if len(indices) >= 1: aligned_span = Span(doc, indices[0], indices[-1] + 1, label=span.label) target_text = span.text.lower().strip().replace(" ", "") @@ -237,7 +238,7 @@ cdef class Example: def get_aligned_ner(self): if not self.y.has_annotation("ENT_IOB"): return [None] * len(self.x) # should this be 'missing' instead of 'None' ? - x_ents = self.get_aligned_spans_y2x(self.y.ents) + x_ents = self.get_aligned_spans_y2x(self.y.ents, allow_overlap=False) # Default to 'None' for missing values x_tags = offsets_to_biluo_tags( self.x, @@ -337,7 +338,7 @@ def _annot2array(vocab, tok_annot, doc_annot): values.append([vocab.strings.add(h) if h is not None else MISSING_DEP for h in value]) elif key == "SENT_START": attrs.append(key) - values.append(value) + values.append([to_ternary_int(v) for v in value]) elif key == "MORPH": attrs.append(key) values.append([vocab.morphology.add(v) for v in value]) diff --git a/spacy/training/gold_io.pyx b/spacy/training/gold_io.pyx index 327748d01..69654e2c7 100644 --- a/spacy/training/gold_io.pyx +++ b/spacy/training/gold_io.pyx @@ -121,7 +121,7 @@ def json_to_annotations(doc): if i == 0: sent_starts.append(1) else: - sent_starts.append(0) + sent_starts.append(-1) if "brackets" in sent: brackets.extend((b["first"] + sent_start_i, b["last"] + sent_start_i, b["label"]) diff --git a/spacy/training/initialize.py b/spacy/training/initialize.py index f623627eb..36384d67b 100644 --- a/spacy/training/initialize.py +++ b/spacy/training/initialize.py @@ -8,6 +8,7 @@ import tarfile import gzip import zipfile import tqdm +from itertools import islice from .pretrain import get_tok2vec_ref from ..lookups import Lookups @@ -68,7 +69,11 @@ def init_nlp(config: Config, *, use_gpu: int = -1) -> "Language": # Make sure that listeners are defined before initializing further nlp._link_components() with nlp.select_pipes(disable=[*frozen_components, *resume_components]): - nlp.initialize(lambda: train_corpus(nlp), sgd=optimizer) + if T["max_epochs"] == -1: + logger.debug("Due to streamed train corpus, using only first 100 examples for initialization. If necessary, provide all labels in [initialize]. More info: https://spacy.io/api/cli#init_labels") + nlp.initialize(lambda: islice(train_corpus(nlp), 100), sgd=optimizer) + else: + nlp.initialize(lambda: train_corpus(nlp), sgd=optimizer) logger.info(f"Initialized pipeline components: {nlp.pipe_names}") # Detect components with listeners that are not frozen consistently for name, proc in nlp.pipeline: @@ -133,6 +138,10 @@ def load_vectors_into_model( ) err = ConfigValidationError.from_error(e, title=title, desc=desc) raise err from None + + if len(vectors_nlp.vocab.vectors.keys()) == 0: + logger.warning(Warnings.W112.format(name=name)) + nlp.vocab.vectors = vectors_nlp.vocab.vectors if add_strings: # I guess we should add the strings from the vectors_nlp model? diff --git a/spacy/training/loggers.py b/spacy/training/loggers.py index 8acf2783c..ef6c86044 100644 --- a/spacy/training/loggers.py +++ b/spacy/training/loggers.py @@ -101,8 +101,13 @@ def console_logger(progress_bar: bool = False): return setup_printer -@registry.loggers("spacy.WandbLogger.v1") -def wandb_logger(project_name: str, remove_config_values: List[str] = []): +@registry.loggers("spacy.WandbLogger.v2") +def wandb_logger( + project_name: str, + remove_config_values: List[str] = [], + model_log_interval: Optional[int] = None, + log_dataset_dir: Optional[str] = None, +): try: import wandb from wandb import init, log, join # test that these are available @@ -119,9 +124,23 @@ def wandb_logger(project_name: str, remove_config_values: List[str] = []): for field in remove_config_values: del config_dot[field] config = util.dot_to_dict(config_dot) - wandb.init(project=project_name, config=config, reinit=True) + run = wandb.init(project=project_name, config=config, reinit=True) console_log_step, console_finalize = console(nlp, stdout, stderr) + def log_dir_artifact( + path: str, + name: str, + type: str, + metadata: Optional[Dict[str, Any]] = {}, + aliases: Optional[List[str]] = [], + ): + dataset_artifact = wandb.Artifact(name, type=type, metadata=metadata) + dataset_artifact.add_dir(path, name=name) + wandb.log_artifact(dataset_artifact, aliases=aliases) + + if log_dataset_dir: + log_dir_artifact(path=log_dataset_dir, name="dataset", type="dataset") + def log_step(info: Optional[Dict[str, Any]]): console_log_step(info) if info is not None: @@ -133,6 +152,21 @@ def wandb_logger(project_name: str, remove_config_values: List[str] = []): wandb.log({f"loss_{k}": v for k, v in losses.items()}) if isinstance(other_scores, dict): wandb.log(other_scores) + if model_log_interval and info.get("output_path"): + if info["step"] % model_log_interval == 0 and info["step"] != 0: + log_dir_artifact( + path=info["output_path"], + name="pipeline_" + run.id, + type="checkpoint", + metadata=info, + aliases=[ + f"epoch {info['epoch']} step {info['step']}", + "latest", + "best" + if info["score"] == max(info["checkpoints"])[0] + else "", + ], + ) def finalize() -> None: console_finalize() diff --git a/spacy/training/loop.py b/spacy/training/loop.py index 55919014b..ecfa12fdb 100644 --- a/spacy/training/loop.py +++ b/spacy/training/loop.py @@ -78,7 +78,7 @@ def train( training_step_iterator = train_while_improving( nlp, optimizer, - create_train_batches(train_corpus(nlp), batcher, T["max_epochs"]), + create_train_batches(nlp, train_corpus, batcher, T["max_epochs"]), create_evaluation_callback(nlp, dev_corpus, score_weights), dropout=T["dropout"], accumulate_gradient=T["accumulate_gradient"], @@ -96,12 +96,13 @@ def train( log_step, finalize_logger = train_logger(nlp, stdout, stderr) try: for batch, info, is_best_checkpoint in training_step_iterator: - log_step(info if is_best_checkpoint is not None else None) if is_best_checkpoint is not None: with nlp.select_pipes(disable=frozen_components): update_meta(T, nlp, info) if output_path is not None: save_checkpoint(is_best_checkpoint) + info["output_path"] = str(output_path / DIR_MODEL_LAST) + log_step(info if is_best_checkpoint is not None else None) except Exception as e: if output_path is not None: stdout.write( @@ -289,17 +290,22 @@ def create_evaluation_callback( def create_train_batches( - iterator: Iterator[Example], + nlp: "Language", + corpus: Callable[["Language"], Iterable[Example]], batcher: Callable[[Iterable[Example]], Iterable[Example]], max_epochs: int, ): epoch = 0 - examples = list(iterator) - if not examples: - # Raise error if no data - raise ValueError(Errors.E986) + if max_epochs >= 0: + examples = list(corpus(nlp)) + if not examples: + # Raise error if no data + raise ValueError(Errors.E986) while max_epochs < 1 or epoch != max_epochs: - random.shuffle(examples) + if max_epochs >= 0: + random.shuffle(examples) + else: + examples = corpus(nlp) for batch in batcher(examples): yield epoch, batch epoch += 1 diff --git a/spacy/util.py b/spacy/util.py index 9915de935..512c6b742 100644 --- a/spacy/util.py +++ b/spacy/util.py @@ -36,7 +36,7 @@ except ImportError: try: # Python 3.8 import importlib.metadata as importlib_metadata except ImportError: - import importlib_metadata + from catalogue import _importlib_metadata as importlib_metadata # These are functions that were previously (v2.x) available from spacy.util # and have since moved to Thinc. We're importing them here so people's code @@ -1526,3 +1526,18 @@ def check_lexeme_norms(vocab, component_name): if len(lexeme_norms) == 0 and vocab.lang in LEXEME_NORM_LANGS: langs = ", ".join(LEXEME_NORM_LANGS) logger.debug(Warnings.W033.format(model=component_name, langs=langs)) + + +def to_ternary_int(val) -> int: + """Convert a value to the ternary 1/0/-1 int used for True/None/False in + attributes such as SENT_START: True/1/1.0 is 1 (True), None/0/0.0 is 0 + (None), any other values are -1 (False). + """ + if isinstance(val, float): + val = int(val) + if val is True or val is 1: + return 1 + elif val is None or val is 0: + return 0 + else: + return -1 diff --git a/spacy/vectors.pyx b/spacy/vectors.pyx index bcea87e67..7cb3322c2 100644 --- a/spacy/vectors.pyx +++ b/spacy/vectors.pyx @@ -55,7 +55,7 @@ cdef class Vectors: """Create a new vector store. shape (tuple): Size of the table, as (# entries, # columns) - data (numpy.ndarray): The vector data. + data (numpy.ndarray or cupy.ndarray): The vector data. keys (iterable): A sequence of keys, aligned with the data. name (str): A name to identify the vectors table. @@ -65,7 +65,8 @@ cdef class Vectors: if data is None: if shape is None: shape = (0,0) - data = numpy.zeros(shape, dtype="f") + ops = get_current_ops() + data = ops.xp.zeros(shape, dtype="f") self.data = data self.key2row = {} if self.data is not None: @@ -300,6 +301,8 @@ cdef class Vectors: else: raise ValueError(Errors.E197.format(row=row, key=key)) if vector is not None: + xp = get_array_module(self.data) + vector = xp.asarray(vector) self.data[row] = vector if self._unset.count(row): self._unset.erase(self._unset.find(row)) @@ -321,10 +324,11 @@ cdef class Vectors: RETURNS (tuple): The most similar entries as a `(keys, best_rows, scores)` tuple. """ + xp = get_array_module(self.data) filled = sorted(list({row for row in self.key2row.values()})) if len(filled) < n: raise ValueError(Errors.E198.format(n=n, n_rows=len(filled))) - xp = get_array_module(self.data) + filled = xp.asarray(filled) norms = xp.linalg.norm(self.data[filled], axis=1, keepdims=True) norms[norms == 0] = 1 @@ -357,8 +361,10 @@ cdef class Vectors: # Account for numerical error we want to return in range -1, 1 scores = xp.clip(scores, a_min=-1, a_max=1, out=scores) row2key = {row: key for key, row in self.key2row.items()} + + numpy_rows = get_current_ops().to_numpy(best_rows) keys = xp.asarray( - [[row2key[row] for row in best_rows[i] if row in row2key] + [[row2key[row] for row in numpy_rows[i] if row in row2key] for i in range(len(queries)) ], dtype="uint64") return (keys, best_rows, scores) @@ -459,7 +465,8 @@ cdef class Vectors: if hasattr(self.data, "from_bytes"): self.data.from_bytes() else: - self.data = srsly.msgpack_loads(b) + xp = get_array_module(self.data) + self.data = xp.asarray(srsly.msgpack_loads(b)) deserializers = { "key2row": lambda b: self.key2row.update(srsly.msgpack_loads(b)), diff --git a/spacy/vocab.pyx b/spacy/vocab.pyx index 1008797b3..ee440898a 100644 --- a/spacy/vocab.pyx +++ b/spacy/vocab.pyx @@ -2,7 +2,7 @@ from libc.string cimport memcpy import srsly -from thinc.api import get_array_module +from thinc.api import get_array_module, get_current_ops import functools from .lexeme cimport EMPTY_LEXEME, OOV_RANK @@ -293,7 +293,7 @@ cdef class Vocab: among those remaining. For example, suppose the original table had vectors for the words: - ['sat', 'cat', 'feline', 'reclined']. If we prune the vector table to, + ['sat', 'cat', 'feline', 'reclined']. If we prune the vector table to two rows, we would discard the vectors for 'feline' and 'reclined'. These words would then be remapped to the closest remaining vector -- so "feline" would have the same vector as "cat", and "reclined" @@ -314,6 +314,7 @@ cdef class Vocab: DOCS: https://spacy.io/api/vocab#prune_vectors """ + ops = get_current_ops() xp = get_array_module(self.vectors.data) # Make sure all vectors are in the vocab for orth in self.vectors: @@ -329,8 +330,9 @@ cdef class Vocab: toss = xp.ascontiguousarray(self.vectors.data[indices[nr_row:]]) self.vectors = Vectors(data=keep, keys=keys[:nr_row], name=self.vectors.name) syn_keys, syn_rows, scores = self.vectors.most_similar(toss, batch_size=batch_size) + syn_keys = ops.to_numpy(syn_keys) remap = {} - for i, key in enumerate(keys[nr_row:]): + for i, key in enumerate(ops.to_numpy(keys[nr_row:])): self.vectors.add(key, row=syn_rows[i][0]) word = self.strings[key] synonym = self.strings[syn_keys[i][0]] @@ -351,7 +353,7 @@ cdef class Vocab: Defaults to the length of `orth`. maxn (int): Maximum n-gram length used for Fasttext's ngram computation. Defaults to the length of `orth`. - RETURNS (numpy.ndarray): A word vector. Size + RETURNS (numpy.ndarray or cupy.ndarray): A word vector. Size and shape determined by the `vocab.vectors` instance. Usually, a numpy ndarray of shape (300,) and dtype float32. @@ -400,7 +402,7 @@ cdef class Vocab: by string or int ID. orth (int / unicode): The word. - vector (numpy.ndarray[ndim=1, dtype='float32']): The vector to set. + vector (numpy.ndarray or cupy.nadarry[ndim=1, dtype='float32']): The vector to set. DOCS: https://spacy.io/api/vocab#set_vector """ diff --git a/website/docs/api/architectures.md b/website/docs/api/architectures.md index 4c4bf73f4..e09352ec9 100644 --- a/website/docs/api/architectures.md +++ b/website/docs/api/architectures.md @@ -35,7 +35,7 @@ usage documentation on > @architectures = "spacy.Tok2Vec.v2" > > [model.embed] -> @architectures = "spacy.CharacterEmbed.v1" +> @architectures = "spacy.CharacterEmbed.v2" > # ... > > [model.encode] @@ -54,13 +54,13 @@ blog post for background. | `encode` | Encode context into the embeddings, using an architecture such as a CNN, BiLSTM or transformer. For example, [MaxoutWindowEncoder](/api/architectures#MaxoutWindowEncoder). ~~Model[List[Floats2d], List[Floats2d]]~~ | | **CREATES** | The model using the architecture. ~~Model[List[Doc], List[Floats2d]]~~ | -### spacy.HashEmbedCNN.v1 {#HashEmbedCNN} +### spacy.HashEmbedCNN.v2 {#HashEmbedCNN} > #### Example Config > > ```ini > [model] -> @architectures = "spacy.HashEmbedCNN.v1" +> @architectures = "spacy.HashEmbedCNN.v2" > pretrained_vectors = null > width = 96 > depth = 4 @@ -96,7 +96,7 @@ consisting of a CNN and a layer-normalized maxout activation function. > factory = "tok2vec" > > [components.tok2vec.model] -> @architectures = "spacy.HashEmbedCNN.v1" +> @architectures = "spacy.HashEmbedCNN.v2" > width = 342 > > [components.tagger] @@ -129,13 +129,13 @@ argument that connects to the shared `tok2vec` component in the pipeline. | `upstream` | A string to identify the "upstream" `Tok2Vec` component to communicate with. By default, the upstream name is the wildcard string `"*"`, but you could also specify the name of the `Tok2Vec` component. You'll almost never have multiple upstream `Tok2Vec` components, so the wildcard string will almost always be fine. ~~str~~ | | **CREATES** | The model using the architecture. ~~Model[List[Doc], List[Floats2d]]~~ | -### spacy.MultiHashEmbed.v1 {#MultiHashEmbed} +### spacy.MultiHashEmbed.v2 {#MultiHashEmbed} > #### Example config > > ```ini > [model] -> @architectures = "spacy.MultiHashEmbed.v1" +> @architectures = "spacy.MultiHashEmbed.v2" > width = 64 > attrs = ["NORM", "PREFIX", "SUFFIX", "SHAPE"] > rows = [2000, 1000, 1000, 1000] @@ -160,13 +160,13 @@ not updated). | `include_static_vectors` | Whether to also use static word vectors. Requires a vectors table to be loaded in the [`Doc`](/api/doc) objects' vocab. ~~bool~~ | | **CREATES** | The model using the architecture. ~~Model[List[Doc], List[Floats2d]]~~ | -### spacy.CharacterEmbed.v1 {#CharacterEmbed} +### spacy.CharacterEmbed.v2 {#CharacterEmbed} > #### Example config > > ```ini > [model] -> @architectures = "spacy.CharacterEmbed.v1" +> @architectures = "spacy.CharacterEmbed.v2" > width = 128 > rows = 7000 > nM = 64 @@ -266,13 +266,13 @@ Encode context using bidirectional LSTM layers. Requires | `dropout` | Creates a Dropout layer on the outputs of each LSTM layer except the last layer. Set to 0.0 to disable this functionality. ~~float~~ | | **CREATES** | The model using the architecture. ~~Model[List[Floats2d], List[Floats2d]]~~ | -### spacy.StaticVectors.v1 {#StaticVectors} +### spacy.StaticVectors.v2 {#StaticVectors} > #### Example config > > ```ini > [model] -> @architectures = "spacy.StaticVectors.v1" +> @architectures = "spacy.StaticVectors.v2" > nO = null > nM = null > dropout = 0.2 @@ -283,8 +283,9 @@ Encode context using bidirectional LSTM layers. Requires > ``` Embed [`Doc`](/api/doc) objects with their vocab's vectors table, applying a -learned linear projection to control the dimensionality. See the documentation -on [static vectors](/usage/embeddings-transformers#static-vectors) for details. +learned linear projection to control the dimensionality. Unknown tokens are +mapped to a zero vector. See the documentation on [static +vectors](/usage/embeddings-transformers#static-vectors) for details. | Name |  Description | | ----------- | ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | @@ -513,7 +514,7 @@ for a Tok2Vec layer. > use_upper = true > > [model.tok2vec] -> @architectures = "spacy.HashEmbedCNN.v1" +> @architectures = "spacy.HashEmbedCNN.v2" > pretrained_vectors = null > width = 96 > depth = 4 @@ -619,7 +620,7 @@ single-label use-cases where `exclusive_classes = true`, while the > @architectures = "spacy.Tok2Vec.v2" > > [model.tok2vec.embed] -> @architectures = "spacy.MultiHashEmbed.v1" +> @architectures = "spacy.MultiHashEmbed.v2" > width = 64 > rows = [2000, 2000, 1000, 1000, 1000, 1000] > attrs = ["ORTH", "LOWER", "PREFIX", "SUFFIX", "SHAPE", "ID"] @@ -676,7 +677,7 @@ taking it as argument: > nO = null > > [model.tok2vec] -> @architectures = "spacy.HashEmbedCNN.v1" +> @architectures = "spacy.HashEmbedCNN.v2" > pretrained_vectors = null > width = 96 > depth = 4 @@ -744,7 +745,7 @@ into the "real world". This requires 3 main components: > nO = null > > [model.tok2vec] -> @architectures = "spacy.HashEmbedCNN.v1" +> @architectures = "spacy.HashEmbedCNN.v2" > pretrained_vectors = null > width = 96 > depth = 2 diff --git a/website/docs/api/cli.md b/website/docs/api/cli.md index 73a03cba8..196e47543 100644 --- a/website/docs/api/cli.md +++ b/website/docs/api/cli.md @@ -12,6 +12,7 @@ menu: - ['train', 'train'] - ['pretrain', 'pretrain'] - ['evaluate', 'evaluate'] + - ['assemble', 'assemble'] - ['package', 'package'] - ['project', 'project'] - ['ray', 'ray'] @@ -892,6 +893,34 @@ $ python -m spacy evaluate [model] [data_path] [--output] [--code] [--gold-prepr | `--help`, `-h` | Show help message and available arguments. ~~bool (flag)~~ | | **CREATES** | Training results and optional metrics and visualizations. | +## assemble {#assemble tag="command"} + +Assemble a pipeline from a config file without additional training. Expects a +[config file](/api/data-formats#config) with all settings and hyperparameters. +The `--code` argument can be used to import a Python file that lets you register +[custom functions](/usage/training#custom-functions) and refer to them in your +config. + +> #### Example +> +> ```cli +> $ python -m spacy assemble config.cfg ./output +> ``` + +```cli +$ python -m spacy assemble [config_path] [output_dir] [--code] [--verbose] [overrides] +``` + +| Name | Description | +| ----------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | +| `config_path` | Path to the [config](/api/data-formats#config) file containing all settings and hyperparameters. If `-`, the data will be [read from stdin](/usage/training#config-stdin). ~~Union[Path, str] \(positional)~~ | +| `output_dir` | Directory to store the final pipeline in. Will be created if it doesn't exist. ~~Optional[Path] \(option)~~ | +| `--code`, `-c` | Path to Python file with additional code to be imported. Allows [registering custom functions](/usage/training#custom-functions). ~~Optional[Path] \(option)~~ | +| `--verbose`, `-V` | Show more detailed messages during processing. ~~bool (flag)~~ | +| `--help`, `-h` | Show help message and available arguments. ~~bool (flag)~~ | +| overrides | Config parameters to override. Should be options starting with `--` that correspond to the config section and value to override, e.g. `--paths.data ./data`. ~~Any (option/flag)~~ | +| **CREATES** | The final assembled pipeline. | + ## package {#package tag="command"} Generate an installable [Python package](/usage/training#models-generating) from diff --git a/website/docs/api/data-formats.md b/website/docs/api/data-formats.md index de893611a..0c2a4c9f3 100644 --- a/website/docs/api/data-formats.md +++ b/website/docs/api/data-formats.md @@ -29,8 +29,8 @@ recommended settings for your use case, check out the > > The `@` syntax lets you refer to function names registered in the > [function registry](/api/top-level#registry). For example, -> `@architectures = "spacy.HashEmbedCNN.v1"` refers to a registered function of -> the name [spacy.HashEmbedCNN.v1](/api/architectures#HashEmbedCNN) and all +> `@architectures = "spacy.HashEmbedCNN.v2"` refers to a registered function of +> the name [spacy.HashEmbedCNN.v2](/api/architectures#HashEmbedCNN) and all > other values defined in its block will be passed into that function as > arguments. Those arguments depend on the registered function. See the usage > guide on [registered functions](/usage/training#config-functions) for details. @@ -193,10 +193,10 @@ process that are used when you run [`spacy train`](/api/cli#train). | `frozen_components` | Pipeline component names that are "frozen" and shouldn't be initialized or updated during training. See [here](/usage/training#config-components) for details. Defaults to `[]`. ~~List[str]~~ | | `gpu_allocator` | Library for cupy to route GPU memory allocation to. Can be `"pytorch"` or `"tensorflow"`. Defaults to variable `${system.gpu_allocator}`. ~~str~~ | | `logger` | Callable that takes the `nlp` and stdout and stderr `IO` objects, sets up the logger, and returns two new callables to log a training step and to finalize the logger. Defaults to [`ConsoleLogger`](/api/top-level#ConsoleLogger). ~~Callable[[Language, IO, IO], [Tuple[Callable[[Dict[str, Any]], None], Callable[[], None]]]]~~ | -| `max_epochs` | Maximum number of epochs to train for. Defaults to `0`. ~~int~~ | -| `max_steps` | Maximum number of update steps to train for. Defaults to `20000`. ~~int~~ | +| `max_epochs` | Maximum number of epochs to train for. `0` means an unlimited number of epochs. `-1` means that the train corpus should be streamed rather than loaded into memory with no shuffling within the training loop. Defaults to `0`. ~~int~~ | +| `max_steps` | Maximum number of update steps to train for. `0` means an unlimited number of steps. Defaults to `20000`. ~~int~~ | | `optimizer` | The optimizer. The learning rate schedule and other settings can be configured as part of the optimizer. Defaults to [`Adam`](https://thinc.ai/docs/api-optimizers#adam). ~~Optimizer~~ | -| `patience` | How many steps to continue without improvement in evaluation score. Defaults to `1600`. ~~int~~ | +| `patience` | How many steps to continue without improvement in evaluation score. `0` disables early stopping. Defaults to `1600`. ~~int~~ | | `score_weights` | Score names shown in metrics mapped to their weight towards the final weighted score. See [here](/usage/training#metrics) for details. Defaults to `{}`. ~~Dict[str, float]~~ | | `seed` | The random seed. Defaults to variable `${system.seed}`. ~~int~~ | | `train_corpus` | Dot notation of the config location defining the train corpus. Defaults to `corpora.train`. ~~str~~ | @@ -390,7 +390,7 @@ file to keep track of your settings and hyperparameters and your own > "tags": List[str], > "pos": List[str], > "morphs": List[str], -> "sent_starts": List[bool], +> "sent_starts": List[Optional[bool]], > "deps": List[string], > "heads": List[int], > "entities": List[str], diff --git a/website/docs/api/doc.md b/website/docs/api/doc.md index c8917efa1..9358507dc 100644 --- a/website/docs/api/doc.md +++ b/website/docs/api/doc.md @@ -44,7 +44,7 @@ Construct a `Doc` object. The most common way to get a `Doc` object is via the | `lemmas` 3 | A list of strings, of the same length as `words`, to assign as `token.lemma` for each word. Defaults to `None`. ~~Optional[List[str]]~~ | | `heads` 3 | A list of values, of the same length as `words`, to assign as the head for each word. Head indices are the absolute position of the head in the `Doc`. Defaults to `None`. ~~Optional[List[int]]~~ | | `deps` 3 | A list of strings, of the same length as `words`, to assign as `token.dep` for each word. Defaults to `None`. ~~Optional[List[str]]~~ | -| `sent_starts` 3 | A list of values, of the same length as `words`, to assign as `token.is_sent_start`. Will be overridden by heads if `heads` is provided. Defaults to `None`. ~~Optional[List[Union[bool, None]]~~ | +| `sent_starts` 3 | A list of values, of the same length as `words`, to assign as `token.is_sent_start`. Will be overridden by heads if `heads` is provided. Defaults to `None`. ~~Optional[List[Optional[bool]]]~~ | | `ents` 3 | A list of strings, of the same length of `words`, to assign the token-based IOB tag. Defaults to `None`. ~~Optional[List[str]]~~ | ## Doc.\_\_getitem\_\_ {#getitem tag="method"} diff --git a/website/docs/api/example.md b/website/docs/api/example.md index 2811f4d91..ca9d3c056 100644 --- a/website/docs/api/example.md +++ b/website/docs/api/example.md @@ -33,8 +33,8 @@ both documents. | Name | Description | | -------------- | ------------------------------------------------------------------------------------------------------------------------ | -| `predicted` | The document containing (partial) predictions. Cannot be `None`. ~~Doc~~ | -| `reference` | The document containing gold-standard annotations. Cannot be `None`. ~~Doc~~ | +| `predicted` | The document containing (partial) predictions. Cannot be `None`. ~~Doc~~ | +| `reference` | The document containing gold-standard annotations. Cannot be `None`. ~~Doc~~ | | _keyword-only_ | | | `alignment` | An object holding the alignment between the tokens of the `predicted` and `reference` documents. ~~Optional[Alignment]~~ | @@ -56,11 +56,11 @@ see the [training format documentation](/api/data-formats#dict-input). > example = Example.from_dict(predicted, {"words": token_ref, "tags": tags_ref}) > ``` -| Name | Description | -| -------------- | ------------------------------------------------------------------------- | -| `predicted` | The document containing (partial) predictions. Cannot be `None`. ~~Doc~~ | -| `example_dict` | `Dict[str, obj]` | The gold-standard annotations as a dictionary. Cannot be `None`. ~~Dict[str, Any]~~ | -| **RETURNS** | The newly constructed object. ~~Example~~ | +| Name | Description | +| -------------- | ----------------------------------------------------------------------------------- | +| `predicted` | The document containing (partial) predictions. Cannot be `None`. ~~Doc~~ | +| `example_dict` | The gold-standard annotations as a dictionary. Cannot be `None`. ~~Dict[str, Any]~~ | +| **RETURNS** | The newly constructed object. ~~Example~~ | ## Example.text {#text tag="property"} @@ -211,10 +211,11 @@ align to the tokenization in [`Example.predicted`](/api/example#predicted). > assert [(ent.start, ent.end) for ent in ents_y2x] == [(0, 1)] > ``` -| Name | Description | -| ----------- | ----------------------------------------------------------------------------- | -| `y_spans` | `Span` objects aligned to the tokenization of `reference`. ~~Iterable[Span]~~ | -| **RETURNS** | `Span` objects aligned to the tokenization of `predicted`. ~~List[Span]~~ | +| Name | Description | +| --------------- | -------------------------------------------------------------------------------------------- | +| `y_spans` | `Span` objects aligned to the tokenization of `reference`. ~~Iterable[Span]~~ | +| `allow_overlap` | Whether the resulting `Span` objects may overlap or not. Set to `False` by default. ~~bool~~ | +| **RETURNS** | `Span` objects aligned to the tokenization of `predicted`. ~~List[Span]~~ | ## Example.get_aligned_spans_x2y {#get_aligned_spans_x2y tag="method"} @@ -238,10 +239,11 @@ against the original gold-standard annotation. > assert [(ent.start, ent.end) for ent in ents_x2y] == [(0, 2)] > ``` -| Name | Description | -| ----------- | ----------------------------------------------------------------------------- | -| `x_spans` | `Span` objects aligned to the tokenization of `predicted`. ~~Iterable[Span]~~ | -| **RETURNS** | `Span` objects aligned to the tokenization of `reference`. ~~List[Span]~~ | +| Name | Description | +| --------------- | -------------------------------------------------------------------------------------------- | +| `x_spans` | `Span` objects aligned to the tokenization of `predicted`. ~~Iterable[Span]~~ | +| `allow_overlap` | Whether the resulting `Span` objects may overlap or not. Set to `False` by default. ~~bool~~ | +| **RETURNS** | `Span` objects aligned to the tokenization of `reference`. ~~List[Span]~~ | ## Example.to_dict {#to_dict tag="method"} diff --git a/website/docs/api/legacy.md b/website/docs/api/legacy.md index 4b5e8df3a..96bc199bf 100644 --- a/website/docs/api/legacy.md +++ b/website/docs/api/legacy.md @@ -4,12 +4,13 @@ teaser: Archived implementations available through spacy-legacy source: spacy/legacy --- -The [`spacy-legacy`](https://github.com/explosion/spacy-legacy) package includes -outdated registered functions and architectures. It is installed automatically as -a dependency of spaCy, and provides backwards compatibility for archived functions -that may still be used in projects. +The [`spacy-legacy`](https://github.com/explosion/spacy-legacy) package includes +outdated registered functions and architectures. It is installed automatically +as a dependency of spaCy, and provides backwards compatibility for archived +functions that may still be used in projects. -You can find the detailed documentation of each such legacy function on this page. +You can find the detailed documentation of each such legacy function on this +page. ## Architectures {#architectures} @@ -17,8 +18,8 @@ These functions are available from `@spacy.registry.architectures`. ### spacy.Tok2Vec.v1 {#Tok2Vec_v1} -The `spacy.Tok2Vec.v1` architecture was expecting an `encode` model of type -`Model[Floats2D, Floats2D]` such as `spacy.MaxoutWindowEncoder.v1` or +The `spacy.Tok2Vec.v1` architecture was expecting an `encode` model of type +`Model[Floats2D, Floats2D]` such as `spacy.MaxoutWindowEncoder.v1` or `spacy.MishWindowEncoder.v1`. > #### Example config @@ -44,15 +45,14 @@ blog post for background. | Name | Description | | ----------- | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | | `embed` | Embed tokens into context-independent word vector representations. For example, [CharacterEmbed](/api/architectures#CharacterEmbed) or [MultiHashEmbed](/api/architectures#MultiHashEmbed). ~~Model[List[Doc], List[Floats2d]]~~ | -| `encode` | Encode context into the embeddings, using an architecture such as a CNN, BiLSTM or transformer. For example, [MaxoutWindowEncoder.v1](/api/legacy#MaxoutWindowEncoder_v1). ~~Model[Floats2d, Floats2d]~~ | +| `encode` | Encode context into the embeddings, using an architecture such as a CNN, BiLSTM or transformer. For example, [MaxoutWindowEncoder.v1](/api/legacy#MaxoutWindowEncoder_v1). ~~Model[Floats2d, Floats2d]~~ | | **CREATES** | The model using the architecture. ~~Model[List[Doc], List[Floats2d]]~~ | ### spacy.MaxoutWindowEncoder.v1 {#MaxoutWindowEncoder_v1} -The `spacy.MaxoutWindowEncoder.v1` architecture was producing a model of type -`Model[Floats2D, Floats2D]`. Since `spacy.MaxoutWindowEncoder.v2`, this has been changed to output -type `Model[List[Floats2d], List[Floats2d]]`. - +The `spacy.MaxoutWindowEncoder.v1` architecture was producing a model of type +`Model[Floats2D, Floats2D]`. Since `spacy.MaxoutWindowEncoder.v2`, this has been +changed to output type `Model[List[Floats2d], List[Floats2d]]`. > #### Example config > @@ -78,9 +78,9 @@ and residual connections. ### spacy.MishWindowEncoder.v1 {#MishWindowEncoder_v1} -The `spacy.MishWindowEncoder.v1` architecture was producing a model of type -`Model[Floats2D, Floats2D]`. Since `spacy.MishWindowEncoder.v2`, this has been changed to output -type `Model[List[Floats2d], List[Floats2d]]`. +The `spacy.MishWindowEncoder.v1` architecture was producing a model of type +`Model[Floats2D, Floats2D]`. Since `spacy.MishWindowEncoder.v2`, this has been +changed to output type `Model[List[Floats2d], List[Floats2d]]`. > #### Example config > @@ -103,12 +103,11 @@ and residual connections. | `depth` | The number of convolutional layers. Recommended value is `4`. ~~int~~ | | **CREATES** | The model using the architecture. ~~Model[Floats2d, Floats2d]~~ | - ### spacy.TextCatEnsemble.v1 {#TextCatEnsemble_v1} -The `spacy.TextCatEnsemble.v1` architecture built an internal `tok2vec` and `linear_model`. -Since `spacy.TextCatEnsemble.v2`, this has been refactored so that the `TextCatEnsemble` takes these -two sublayers as input. +The `spacy.TextCatEnsemble.v1` architecture built an internal `tok2vec` and +`linear_model`. Since `spacy.TextCatEnsemble.v2`, this has been refactored so +that the `TextCatEnsemble` takes these two sublayers as input. > #### Example Config > @@ -140,4 +139,62 @@ network has an internal CNN Tok2Vec layer and uses attention. | `ngram_size` | Determines the maximum length of the n-grams in the BOW model. For instance, `ngram_size=3`would give unigram, trigram and bigram features. ~~int~~ | | `dropout` | The dropout rate. ~~float~~ | | `nO` | Output dimension, determined by the number of different labels. If not set, the [`TextCategorizer`](/api/textcategorizer) component will set it when `initialize` is called. ~~Optional[int]~~ | -| **CREATES** | The model using the architecture. ~~Model[List[Doc], Floats2d]~~ | \ No newline at end of file +| **CREATES** | The model using the architecture. ~~Model[List[Doc], Floats2d]~~ | + +### spacy.HashEmbedCNN.v1 {#HashEmbedCNN_v1} + +Identical to [`spacy.HashEmbedCNN.v2`](/api/architectures#HashEmbedCNN) except +using [`spacy.StaticVectors.v1`](#StaticVectors_v1) if vectors are included. + +### spacy.MultiHashEmbed.v1 {#MultiHashEmbed_v1} + +Identical to [`spacy.MultiHashEmbed.v2`](/api/architectures#MultiHashEmbed) +except with [`spacy.StaticVectors.v1`](#StaticVectors_v1) if vectors are +included. + +### spacy.CharacterEmbed.v1 {#CharacterEmbed_v1} + +Identical to [`spacy.CharacterEmbed.v2`](/api/architectures#CharacterEmbed) +except using [`spacy.StaticVectors.v1`](#StaticVectors_v1) if vectors are +included. + +## Layers {#layers} + +These functions are available from `@spacy.registry.layers`. + +### spacy.StaticVectors.v1 {#StaticVectors_v1} + +Identical to [`spacy.StaticVectors.v2`](/api/architectures#StaticVectors) except +for the handling of tokens without vectors. + + + +`spacy.StaticVectors.v1` maps tokens without vectors to the final row in the +vectors table, which causes the model predictions to change if new vectors are +added to an existing vectors table. See more details in +[issue #7662](https://github.com/explosion/spaCy/issues/7662#issuecomment-813925655). + + + +## Loggers {#loggers} + +These functions are available from `@spacy.registry.loggers`. + +### spacy.WandbLogger.v1 {#WandbLogger_v1} + +The first version of the [`WandbLogger`](/api/top-level#WandbLogger) did not yet +support the `log_dataset_dir` and `model_log_interval` arguments. + +> #### Example config +> +> ```ini +> [training.logger] +> @loggers = "spacy.WandbLogger.v1" +> project_name = "monitor_spacy_training" +> remove_config_values = ["paths.train", "paths.dev", "corpora.train.path", "corpora.dev.path"] +> ``` +> +> | Name | Description | +> | ---------------------- | ------------------------------------------------------------------------------------------------------------------------------------- | +> | `project_name` | The name of the project in the Weights & Biases interface. The project will be created automatically if it doesn't exist yet. ~~str~~ | +> | `remove_config_values` | A list of values to include from the config before it is uploaded to W&B (default: empty). ~~List[str]~~ | diff --git a/website/docs/api/matcher.md b/website/docs/api/matcher.md index 95a76586a..c15ee7a47 100644 --- a/website/docs/api/matcher.md +++ b/website/docs/api/matcher.md @@ -120,13 +120,14 @@ Find all token sequences matching the supplied patterns on the `Doc` or `Span`. > matches = matcher(doc) > ``` -| Name | Description | -| ------------------------------------------ | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | -| `doclike` | The `Doc` or `Span` to match over. ~~Union[Doc, Span]~~ | -| _keyword-only_ | | -| `as_spans` 3 | Instead of tuples, return a list of [`Span`](/api/span) objects of the matches, with the `match_id` assigned as the span label. Defaults to `False`. ~~bool~~ | -| `allow_missing` 3 | Whether to skip checks for missing annotation for attributes included in patterns. Defaults to `False`. ~~bool~~ | -| **RETURNS** | A list of `(match_id, start, end)` tuples, describing the matches. A match tuple describes a span `doc[start:end`]. The `match_id` is the ID of the added match pattern. If `as_spans` is set to `True`, a list of `Span` objects is returned instead. ~~Union[List[Tuple[int, int, int]], List[Span]]~~ | +| Name | Description | +| ---------------------------------------------- | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | +| `doclike` | The `Doc` or `Span` to match over. ~~Union[Doc, Span]~~ | +| _keyword-only_ | | +| `as_spans` 3 | Instead of tuples, return a list of [`Span`](/api/span) objects of the matches, with the `match_id` assigned as the span label. Defaults to `False`. ~~bool~~ | +| `allow_missing` 3 | Whether to skip checks for missing annotation for attributes included in patterns. Defaults to `False`. ~~bool~~ | +| `with_alignments` 3.1 | Return match alignment information as part of the match tuple as `List[int]` with the same length as the matched span. Each entry denotes the corresponding index of the token pattern. If `as_spans` is set to `True`, this setting is ignored. Defaults to `False`. ~~bool~~ | +| **RETURNS** | A list of `(match_id, start, end)` tuples, describing the matches. A match tuple describes a span `doc[start:end`]. The `match_id` is the ID of the added match pattern. If `as_spans` is set to `True`, a list of `Span` objects is returned instead. ~~Union[List[Tuple[int, int, int]], List[Span]]~~ | ## Matcher.\_\_len\_\_ {#len tag="method" new="2"} diff --git a/website/docs/api/scorer.md b/website/docs/api/scorer.md index cf1a1ca1f..7398bae81 100644 --- a/website/docs/api/scorer.md +++ b/website/docs/api/scorer.md @@ -137,14 +137,16 @@ Returns PRF scores for labeled or unlabeled spans. > print(scores["ents_f"]) > ``` -| Name | Description | -| ---------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ | -| `examples` | The `Example` objects holding both the predictions and the correct gold-standard annotations. ~~Iterable[Example]~~ | -| `attr` | The attribute to score. ~~str~~ | -| _keyword-only_ | | -| `getter` | Defaults to `getattr`. If provided, `getter(doc, attr)` should return the `Span` objects for an individual `Doc`. ~~Callable[[Doc, str], Iterable[Span]]~~ | -| `has_annotation` | Defaults to `None`. If provided, `has_annotation(doc)` should return whether a `Doc` has annotation for this `attr`. Docs without annotation are skipped for scoring purposes. ~~Optional[Callable[[Doc], bool]]~~ | -| **RETURNS** | A dictionary containing the PRF scores under the keys `{attr}_p`, `{attr}_r`, `{attr}_f` and the per-type PRF scores under `{attr}_per_type`. ~~Dict[str, Union[float, Dict[str, float]]]~~ | +| Name | Description | +| ---------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | +| `examples` | The `Example` objects holding both the predictions and the correct gold-standard annotations. ~~Iterable[Example]~~ | +| `attr` | The attribute to score. ~~str~~ | +| _keyword-only_ | | +| `getter` | Defaults to `getattr`. If provided, `getter(doc, attr)` should return the `Span` objects for an individual `Doc`. ~~Callable[[Doc, str], Iterable[Span]]~~ | +| `has_annotation` | Defaults to `None`. If provided, `has_annotation(doc)` should return whether a `Doc` has annotation for this `attr`. Docs without annotation are skipped for scoring purposes. ~~str~~ | +| `labeled` | Defaults to `True`. If set to `False`, two spans will be considered equal if their start and end match, irrespective of their label. ~~bool~~ | +| `allow_overlap` | Defaults to `False`. Whether or not to allow overlapping spans. If set to `False`, the alignment will automatically resolve conflicts. ~~bool~~ | +| **RETURNS** | A dictionary containing the PRF scores under the keys `{attr}_p`, `{attr}_r`, `{attr}_f` and the per-type PRF scores under `{attr}_per_type`. ~~Dict[str, Union[float, Dict[str, float]]]~~ | ## Scorer.score_deps {#score_deps tag="staticmethod" new="3"} diff --git a/website/docs/api/token.md b/website/docs/api/token.md index 687705524..ecf7bcc8e 100644 --- a/website/docs/api/token.md +++ b/website/docs/api/token.md @@ -364,7 +364,7 @@ unknown. Defaults to `True` for the first token in the `Doc`. | Name | Description | | ----------- | --------------------------------------------- | -| **RETURNS** | Whether the token starts a sentence. ~~bool~~ | +| **RETURNS** | Whether the token starts a sentence. ~~Optional[bool]~~ | ## Token.has_vector {#has_vector tag="property" model="vectors"} diff --git a/website/docs/api/top-level.md b/website/docs/api/top-level.md index eef8958cf..cfaa75bff 100644 --- a/website/docs/api/top-level.md +++ b/website/docs/api/top-level.md @@ -8,6 +8,7 @@ menu: - ['Readers', 'readers'] - ['Batchers', 'batchers'] - ['Augmenters', 'augmenters'] + - ['Callbacks', 'callbacks'] - ['Training & Alignment', 'gold'] - ['Utility Functions', 'util'] --- @@ -461,7 +462,7 @@ start decreasing across epochs. -#### spacy.WandbLogger.v1 {#WandbLogger tag="registered function"} +#### spacy.WandbLogger.v2 {#WandbLogger tag="registered function"} > #### Installation > @@ -493,15 +494,19 @@ remain in the config file stored on your local system. > > ```ini > [training.logger] -> @loggers = "spacy.WandbLogger.v1" +> @loggers = "spacy.WandbLogger.v2" > project_name = "monitor_spacy_training" > remove_config_values = ["paths.train", "paths.dev", "corpora.train.path", "corpora.dev.path"] +> log_dataset_dir = "corpus" +> model_log_interval = 1000 > ``` | Name | Description | | ---------------------- | ------------------------------------------------------------------------------------------------------------------------------------- | | `project_name` | The name of the project in the Weights & Biases interface. The project will be created automatically if it doesn't exist yet. ~~str~~ | | `remove_config_values` | A list of values to include from the config before it is uploaded to W&B (default: empty). ~~List[str]~~ | +| `model_log_interval` | Steps to wait between logging model checkpoints to W&B dasboard (default: None). ~~Optional[int]~~ | +| `log_dataset_dir` | Directory containing dataset to be logged and versioned as W&B artifact (default: None). ~~Optional[str]~~ | @@ -781,6 +786,35 @@ useful for making the model less sensitive to capitalization. | `level` | The percentage of texts that will be augmented. ~~float~~ | | **CREATES** | A function that takes the current `nlp` object and an [`Example`](/api/example) and yields augmented `Example` objects. ~~Callable[[Language, Example], Iterator[Example]]~~ | +## Callbacks {#callbacks source="spacy/training/callbacks.py" new="3"} + +The config supports [callbacks](/usage/training#custom-code-nlp-callbacks) at +several points in the lifecycle that can be used modify the `nlp` object. + +### spacy.copy_from_base_model.v1 {#copy_from_base_model tag="registered function"} + +> #### Example config +> +> ```ini +> [initialize.before_init] +> @callbacks = "spacy.copy_from_base_model.v1" +> tokenizer = "en_core_sci_md" +> vocab = "en_core_sci_md" +> ``` + +Copy the tokenizer and/or vocab from the specified models. It's similar to the +v2 [base model](https://v2.spacy.io/api/cli#train) option and useful in +combination with +[sourced components](/usage/processing-pipelines#sourced-components) when +fine-tuning an existing pipeline. The vocab includes the lookups and the vectors +from the specified model. Intended for use in `[initialize.before_init]`. + +| Name | Description | +| ----------- | ----------------------------------------------------------------------------------------------------------------------- | +| `tokenizer` | The pipeline to copy the tokenizer from. Defaults to `None`. ~~Optional[str]~~ | +| `vocab` | The pipeline to copy the vocab from. The vocab includes the lookups and vectors. Defaults to `None`. ~~Optional[str]~~ | +| **CREATES** | A function that takes the current `nlp` object and modifies its `tokenizer` and `vocab`. ~~Callable[[Language], None]~~ | + ## Training data and alignment {#gold source="spacy/training"} ### training.offsets_to_biluo_tags {#offsets_to_biluo_tags tag="function"} diff --git a/website/docs/usage/embeddings-transformers.md b/website/docs/usage/embeddings-transformers.md index e71336e84..4113e9394 100644 --- a/website/docs/usage/embeddings-transformers.md +++ b/website/docs/usage/embeddings-transformers.md @@ -132,7 +132,7 @@ factory = "tok2vec" @architectures = "spacy.Tok2Vec.v2" [components.tok2vec.model.embed] -@architectures = "spacy.MultiHashEmbed.v1" +@architectures = "spacy.MultiHashEmbed.v2" [components.tok2vec.model.encode] @architectures = "spacy.MaxoutWindowEncoder.v2" @@ -164,7 +164,7 @@ factory = "ner" @architectures = "spacy.Tok2Vec.v2" [components.ner.model.tok2vec.embed] -@architectures = "spacy.MultiHashEmbed.v1" +@architectures = "spacy.MultiHashEmbed.v2" [components.ner.model.tok2vec.encode] @architectures = "spacy.MaxoutWindowEncoder.v2" @@ -541,7 +541,7 @@ word vector tables using the `include_static_vectors` flag. ```ini [tagger.model.tok2vec.embed] -@architectures = "spacy.MultiHashEmbed.v1" +@architectures = "spacy.MultiHashEmbed.v2" width = 128 attrs = ["LOWER","PREFIX","SUFFIX","SHAPE"] rows = [5000,2500,2500,2500] @@ -550,7 +550,7 @@ include_static_vectors = true -The configuration system will look up the string `"spacy.MultiHashEmbed.v1"` in +The configuration system will look up the string `"spacy.MultiHashEmbed.v2"` in the `architectures` [registry](/api/top-level#registry), and call the returned object with the rest of the arguments from the block. This will result in a call to the diff --git a/website/docs/usage/index.md b/website/docs/usage/index.md index cbbda2e4f..665d334f8 100644 --- a/website/docs/usage/index.md +++ b/website/docs/usage/index.md @@ -130,9 +130,9 @@ which provides a numpy-compatible interface for GPU arrays. spaCy can be installed on GPU by specifying `spacy[cuda]`, `spacy[cuda90]`, `spacy[cuda91]`, `spacy[cuda92]`, `spacy[cuda100]`, `spacy[cuda101]`, -`spacy[cuda102]`, `spacy[cuda110]` or `spacy[cuda111]`. If you know your cuda -version, using the more explicit specifier allows cupy to be installed via -wheel, saving some compilation time. The specifiers should install +`spacy[cuda102]`, `spacy[cuda110]`, `spacy[cuda111]` or `spacy[cuda112]`. If you +know your cuda version, using the more explicit specifier allows cupy to be +installed via wheel, saving some compilation time. The specifiers should install [`cupy`](https://cupy.chainer.org). ```bash diff --git a/website/docs/usage/layers-architectures.md b/website/docs/usage/layers-architectures.md index 0bc935d51..8fe2cf489 100644 --- a/website/docs/usage/layers-architectures.md +++ b/website/docs/usage/layers-architectures.md @@ -137,7 +137,7 @@ nO = null @architectures = "spacy.Tok2Vec.v2" [components.textcat.model.tok2vec.embed] -@architectures = "spacy.MultiHashEmbed.v1" +@architectures = "spacy.MultiHashEmbed.v2" width = 64 rows = [2000, 2000, 1000, 1000, 1000, 1000] attrs = ["ORTH", "LOWER", "PREFIX", "SUFFIX", "SHAPE", "ID"] @@ -204,7 +204,7 @@ factory = "tok2vec" @architectures = "spacy.Tok2Vec.v2" [components.tok2vec.model.embed] -@architectures = "spacy.MultiHashEmbed.v1" +@architectures = "spacy.MultiHashEmbed.v2" # ... [components.tok2vec.model.encode] @@ -220,7 +220,7 @@ architecture: ```ini ### config.cfg (excerpt) [components.tok2vec.model.embed] -@architectures = "spacy.CharacterEmbed.v1" +@architectures = "spacy.CharacterEmbed.v2" # ... [components.tok2vec.model.encode] @@ -638,7 +638,7 @@ that has the full implementation. > @architectures = "rel_instance_tensor.v1" > > [model.create_instance_tensor.tok2vec] -> @architectures = "spacy.HashEmbedCNN.v1" +> @architectures = "spacy.HashEmbedCNN.v2" > # ... > > [model.create_instance_tensor.pooling] diff --git a/website/docs/usage/linguistic-features.md b/website/docs/usage/linguistic-features.md index 352c4c9dd..5a1293c2e 100644 --- a/website/docs/usage/linguistic-features.md +++ b/website/docs/usage/linguistic-features.md @@ -787,6 +787,7 @@ rather than performance: ```python def tokenizer_pseudo_code( + text, special_cases, prefix_search, suffix_search, @@ -840,12 +841,14 @@ def tokenizer_pseudo_code( tokens.append(substring) substring = "" tokens.extend(reversed(suffixes)) + for match in matcher(special_cases, text): + tokens.replace(match, special_cases[match]) return tokens ``` The algorithm can be summarized as follows: -1. Iterate over whitespace-separated substrings. +1. Iterate over space-separated substrings. 2. Look for a token match. If there is a match, stop processing and keep this token. 3. Check whether we have an explicitly defined special case for this substring. @@ -859,6 +862,8 @@ The algorithm can be summarized as follows: 8. Look for "infixes" – stuff like hyphens etc. and split the substring into tokens on all infixes. 9. Once we can't consume any more of the string, handle it as a single token. +10. Make a final pass over the text to check for special cases that include + spaces or that were missed due to the incremental processing of affixes. diff --git a/website/docs/usage/projects.md b/website/docs/usage/projects.md index 97b5b9f28..fc191824a 100644 --- a/website/docs/usage/projects.md +++ b/website/docs/usage/projects.md @@ -995,7 +995,7 @@ your results. > > ```ini > [training.logger] -> @loggers = "spacy.WandbLogger.v1" +> @loggers = "spacy.WandbLogger.v2" > project_name = "monitor_spacy_training" > remove_config_values = ["paths.train", "paths.dev", "corpora.train.path", "corpora.dev.path"] > ``` diff --git a/website/docs/usage/training.md b/website/docs/usage/training.md index 5e9d3303c..9f929fe19 100644 --- a/website/docs/usage/training.md +++ b/website/docs/usage/training.md @@ -1130,8 +1130,8 @@ any other custom workflows. `corpora.train` and `corpora.dev` are used as conventions within spaCy's default configs, but you can also define any other custom blocks. Each section in the corpora config should resolve to a [`Corpus`](/api/corpus) – for example, using spaCy's built-in -[corpus reader](/api/top-level#readers) that takes a path to a binary `.spacy` -file. The `train_corpus` and `dev_corpus` fields in the +[corpus reader](/api/top-level#corpus-readers) that takes a path to a binary +`.spacy` file. The `train_corpus` and `dev_corpus` fields in the [`[training]`](/api/data-formats#config-training) block specify where to find the corpus in your config. This makes it easy to **swap out** different corpora by only changing a single config setting. @@ -1142,21 +1142,23 @@ corpora, keyed by corpus name, e.g. `"train"` and `"dev"`. This can be especially useful if you need to split a single file into corpora for training and evaluation, without loading the same file twice. +By default, the training data is loaded into memory and shuffled before each +epoch. If the corpus is **too large to fit into memory** during training, stream +the corpus using a custom reader as described in the next section. + ### Custom data reading and batching {#custom-code-readers-batchers} Some use-cases require **streaming in data** or manipulating datasets on the -fly, rather than generating all data beforehand and storing it to file. Instead +fly, rather than generating all data beforehand and storing it to disk. Instead of using the built-in [`Corpus`](/api/corpus) reader, which uses static file paths, you can create and register a custom function that generates -[`Example`](/api/example) objects. The resulting generator can be infinite. When -using this dataset for training, stopping criteria such as maximum number of -steps, or stopping when the loss does not decrease further, can be used. +[`Example`](/api/example) objects. -In this example we assume a custom function `read_custom_data` which loads or -generates texts with relevant text classification annotations. Then, small -lexical variations of the input text are created before generating the final -[`Example`](/api/example) objects. The `@spacy.registry.readers` decorator lets -you register the function creating the custom reader in the `readers` +In the following example we assume a custom function `read_custom_data` which +loads or generates texts with relevant text classification annotations. Then, +small lexical variations of the input text are created before generating the +final [`Example`](/api/example) objects. The `@spacy.registry.readers` decorator +lets you register the function creating the custom reader in the `readers` [registry](/api/top-level#registry) and assign it a string name, so it can be used in your config. All arguments on the registered function become available as **config settings** – in this case, `source`. @@ -1199,6 +1201,80 @@ Remember that a registered function should always be a function that spaCy +If the corpus is **too large to load into memory** or the corpus reader is an +**infinite generator**, use the setting `max_epochs = -1` to indicate that the +train corpus should be streamed. With this setting the train corpus is merely +streamed and batched, not shuffled, so any shuffling needs to be implemented in +the corpus reader itself. In the example below, a corpus reader that generates +sentences containing even or odd numbers is used with an unlimited number of +examples for the train corpus and a limited number of examples for the dev +corpus. The dev corpus should always be finite and fit in memory during the +evaluation step. `max_steps` and/or `patience` are used to determine when the +training should stop. + +> #### config.cfg +> +> ```ini +> [corpora.dev] +> @readers = "even_odd.v1" +> limit = 100 +> +> [corpora.train] +> @readers = "even_odd.v1" +> limit = -1 +> +> [training] +> max_epochs = -1 +> patience = 500 +> max_steps = 2000 +> ``` + +```python +### functions.py +from typing import Callable, Iterable, Iterator +from spacy import util +import random +from spacy.training import Example +from spacy import Language + + +@util.registry.readers("even_odd.v1") +def create_even_odd_corpus(limit: int = -1) -> Callable[[Language], Iterable[Example]]: + return EvenOddCorpus(limit) + + +class EvenOddCorpus: + def __init__(self, limit): + self.limit = limit + + def __call__(self, nlp: Language) -> Iterator[Example]: + i = 0 + while i < self.limit or self.limit < 0: + r = random.randint(0, 1000) + cat = r % 2 == 0 + text = "This is sentence " + str(r) + yield Example.from_dict( + nlp.make_doc(text), {"cats": {"EVEN": cat, "ODD": not cat}} + ) + i += 1 +``` + +> #### config.cfg +> +> ```ini +> [initialize.components.textcat.labels] +> @readers = "spacy.read_labels.v1" +> path = "labels/textcat.json" +> require = true +> ``` + +If the train corpus is streamed, the initialize step peeks at the first 100 +examples in the corpus to find the labels for each component. If this isn't +sufficient, you'll need to [provide the labels](#initialization-labels) for each +component in the `[initialize]` block. [`init labels`](/api/cli#init-labels) can +be used to generate JSON files in the correct format, which you can extend with +the full label set. + We can also customize the **batching strategy** by registering a new batcher function in the `batchers` [registry](/api/top-level#registry). A batcher turns a stream of items into a stream of batches. spaCy has several useful built-in diff --git a/website/docs/usage/v3.md b/website/docs/usage/v3.md index 847d4a327..8b4d2de7c 100644 --- a/website/docs/usage/v3.md +++ b/website/docs/usage/v3.md @@ -616,11 +616,11 @@ Note that spaCy v3.0 now requires **Python 3.6+**. | `spacy profile` | [`spacy debug profile`](/api/cli#debug-profile) | | `spacy link`, `util.set_data_path`, `util.get_data_path` | not needed, symlinks are deprecated | -The following deprecated methods, attributes and arguments were removed in v3.0. -Most of them have been **deprecated for a while** and many would previously -raise errors. Many of them were also mostly internals. If you've been working -with more recent versions of spaCy v2.x, it's **unlikely** that your code relied -on them. +The following methods, attributes and arguments were removed in v3.0. Most of +them have been **deprecated for a while** and many would previously raise +errors. Many of them were also mostly internals. If you've been working with +more recent versions of spaCy v2.x, it's **unlikely** that your code relied on +them. | Removed | Replacement | | ----------------------------------------------------------------------------------------------------------------------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------- | @@ -637,10 +637,10 @@ on them. ### Downloading and loading trained pipelines {#migrating-downloading-models} -Symlinks and shortcuts like `en` are now officially deprecated. There are -[many different trained pipelines](/models) with different capabilities and not -just one "English model". In order to download and load a package, you should -always use its full name – for instance, +Symlinks and shortcuts like `en` have been deprecated for a while, and are now +not supported anymore. There are [many different trained pipelines](/models) +with different capabilities and not just one "English model". In order to +download and load a package, you should always use its full name – for instance, [`en_core_web_sm`](/models/en#en_core_web_sm). ```diff @@ -1185,9 +1185,10 @@ package isn't imported. In Jupyter notebooks, run [`prefer_gpu`](/api/top-level#spacy.prefer_gpu), [`require_gpu`](/api/top-level#spacy.require_gpu) or [`require_cpu`](/api/top-level#spacy.require_cpu) in the same cell as -[`spacy.load`](/api/top-level#spacy.load) to ensure that the model is loaded on the correct device. +[`spacy.load`](/api/top-level#spacy.load) to ensure that the model is loaded on +the correct device. -Due to a bug related to `contextvars` (see the [bug -report](https://github.com/ipython/ipython/issues/11565)), the GPU settings may -not be preserved correctly across cells, resulting in models being loaded on +Due to a bug related to `contextvars` (see the +[bug report](https://github.com/ipython/ipython/issues/11565)), the GPU settings +may not be preserved correctly across cells, resulting in models being loaded on the wrong device or only partially on GPU.