mirror of
https://github.com/explosion/spaCy.git
synced 2025-08-03 11:50:19 +03:00
Merge branch 'explosion:master' into feature/visualisation
This commit is contained in:
commit
c8fd577ba4
21
.github/workflows/gputests.yml
vendored
Normal file
21
.github/workflows/gputests.yml
vendored
Normal file
|
@ -0,0 +1,21 @@
|
|||
name: Weekly GPU tests
|
||||
|
||||
on:
|
||||
schedule:
|
||||
- cron: '0 1 * * MON'
|
||||
|
||||
jobs:
|
||||
weekly-gputests:
|
||||
strategy:
|
||||
fail-fast: false
|
||||
matrix:
|
||||
branch: [master, v4]
|
||||
runs-on: ubuntu-latest
|
||||
steps:
|
||||
- name: Trigger buildkite build
|
||||
uses: buildkite/trigger-pipeline-action@v1.2.0
|
||||
env:
|
||||
PIPELINE: explosion-ai/spacy-slow-gpu-tests
|
||||
BRANCH: ${{ matrix.branch }}
|
||||
MESSAGE: ":github: Weekly GPU + slow tests - triggered from a GitHub Action"
|
||||
BUILDKITE_API_ACCESS_TOKEN: ${{ secrets.BUILDKITE_SECRET }}
|
37
.github/workflows/slowtests.yml
vendored
Normal file
37
.github/workflows/slowtests.yml
vendored
Normal file
|
@ -0,0 +1,37 @@
|
|||
name: Daily slow tests
|
||||
|
||||
on:
|
||||
schedule:
|
||||
- cron: '0 0 * * *'
|
||||
|
||||
jobs:
|
||||
daily-slowtests:
|
||||
strategy:
|
||||
fail-fast: false
|
||||
matrix:
|
||||
branch: [master, v4]
|
||||
runs-on: ubuntu-latest
|
||||
steps:
|
||||
- name: Checkout
|
||||
uses: actions/checkout@v1
|
||||
with:
|
||||
ref: ${{ matrix.branch }}
|
||||
- name: Get commits from past 24 hours
|
||||
id: check_commits
|
||||
run: |
|
||||
today=$(date '+%Y-%m-%d %H:%M:%S')
|
||||
yesterday=$(date -d "yesterday" '+%Y-%m-%d %H:%M:%S')
|
||||
if git log --after="$yesterday" --before="$today" | grep commit ; then
|
||||
echo "::set-output name=run_tests::true"
|
||||
else
|
||||
echo "::set-output name=run_tests::false"
|
||||
fi
|
||||
|
||||
- name: Trigger buildkite build
|
||||
if: steps.check_commits.outputs.run_tests == 'true'
|
||||
uses: buildkite/trigger-pipeline-action@v1.2.0
|
||||
env:
|
||||
PIPELINE: explosion-ai/spacy-slow-tests
|
||||
BRANCH: ${{ matrix.branch }}
|
||||
MESSAGE: ":github: Daily slow tests - triggered from a GitHub Action"
|
||||
BUILDKITE_API_ACCESS_TOKEN: ${{ secrets.BUILDKITE_SECRET }}
|
1
.gitignore
vendored
1
.gitignore
vendored
|
@ -9,7 +9,6 @@ keys/
|
|||
spacy/tests/package/setup.cfg
|
||||
spacy/tests/package/pyproject.toml
|
||||
spacy/tests/package/requirements.txt
|
||||
spacy/tests/universe/universe.json
|
||||
|
||||
# Website
|
||||
website/.cache/
|
||||
|
|
|
@ -143,15 +143,25 @@ Changes to `.py` files will be effective immediately.
|
|||
### Fixing bugs
|
||||
|
||||
When fixing a bug, first create an
|
||||
[issue](https://github.com/explosion/spaCy/issues) if one does not already exist.
|
||||
The description text can be very short – we don't want to make this too
|
||||
[issue](https://github.com/explosion/spaCy/issues) if one does not already
|
||||
exist. The description text can be very short – we don't want to make this too
|
||||
bureaucratic.
|
||||
|
||||
Next, create a test file named `test_issue[ISSUE NUMBER].py` in the
|
||||
[`spacy/tests/regression`](spacy/tests/regression) folder. Test for the bug
|
||||
you're fixing, and make sure the test fails. Next, add and commit your test file
|
||||
referencing the issue number in the commit message. Finally, fix the bug, make
|
||||
sure your test passes and reference the issue in your commit message.
|
||||
Next, add a test to the relevant file in the
|
||||
[`spacy/tests`](spacy/tests)folder. Then add a [pytest
|
||||
mark](https://docs.pytest.org/en/6.2.x/example/markers.html#working-with-custom-markers),
|
||||
`@pytest.mark.issue(NUMBER)`, to reference the issue number.
|
||||
|
||||
```python
|
||||
# Assume you're fixing Issue #1234
|
||||
@pytest.mark.issue(1234)
|
||||
def test_issue1234():
|
||||
...
|
||||
```
|
||||
|
||||
Test for the bug you're fixing, and make sure the test fails. Next, add and
|
||||
commit your test file. Finally, fix the bug, make sure your test passes and
|
||||
reference the issue number in your pull request description.
|
||||
|
||||
📖 **For more information on how to add tests, check out the [tests README](spacy/tests/README.md).**
|
||||
|
||||
|
|
2
LICENSE
2
LICENSE
|
@ -1,6 +1,6 @@
|
|||
The MIT License (MIT)
|
||||
|
||||
Copyright (C) 2016-2021 ExplosionAI GmbH, 2016 spaCy GmbH, 2015 Matthew Honnibal
|
||||
Copyright (C) 2016-2022 ExplosionAI GmbH, 2016 spaCy GmbH, 2015 Matthew Honnibal
|
||||
|
||||
Permission is hereby granted, free of charge, to any person obtaining a copy
|
||||
of this software and associated documentation files (the "Software"), to deal
|
||||
|
|
|
@ -1,11 +1,8 @@
|
|||
recursive-include include *.h
|
||||
recursive-include spacy *.pyi *.pyx *.pxd *.txt *.cfg *.jinja *.toml
|
||||
include LICENSE
|
||||
include README.md
|
||||
include pyproject.toml
|
||||
include spacy/py.typed
|
||||
recursive-exclude spacy/lang *.json
|
||||
recursive-include spacy/lang *.json.gz
|
||||
recursive-include spacy/cli *.json *.yml
|
||||
recursive-include spacy/cli *.yml
|
||||
recursive-include licenses *
|
||||
recursive-exclude spacy *.cpp
|
||||
|
|
31
README.md
31
README.md
|
@ -32,19 +32,20 @@ open-source software, released under the MIT license.
|
|||
|
||||
## 📖 Documentation
|
||||
|
||||
| Documentation | |
|
||||
| -------------------------- | -------------------------------------------------------------- |
|
||||
| ⭐️ **[spaCy 101]** | New to spaCy? Here's everything you need to know! |
|
||||
| 📚 **[Usage Guides]** | How to use spaCy and its features. |
|
||||
| 🚀 **[New in v3.0]** | New features, backwards incompatibilities and migration guide. |
|
||||
| 🪐 **[Project Templates]** | End-to-end workflows you can clone, modify and run. |
|
||||
| 🎛 **[API Reference]** | The detailed reference for spaCy's API. |
|
||||
| 📦 **[Models]** | Download trained pipelines for spaCy. |
|
||||
| 🌌 **[Universe]** | Plugins, extensions, demos and books from the spaCy ecosystem. |
|
||||
| 👩🏫 **[Online Course]** | Learn spaCy in this free and interactive online course. |
|
||||
| 📺 **[Videos]** | Our YouTube channel with video tutorials, talks and more. |
|
||||
| 🛠 **[Changelog]** | Changes and version history. |
|
||||
| 💝 **[Contribute]** | How to contribute to the spaCy project and code base. |
|
||||
| Documentation | |
|
||||
| ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
|
||||
| ⭐️ **[spaCy 101]** | New to spaCy? Here's everything you need to know! |
|
||||
| 📚 **[Usage Guides]** | How to use spaCy and its features. |
|
||||
| 🚀 **[New in v3.0]** | New features, backwards incompatibilities and migration guide. |
|
||||
| 🪐 **[Project Templates]** | End-to-end workflows you can clone, modify and run. |
|
||||
| 🎛 **[API Reference]** | The detailed reference for spaCy's API. |
|
||||
| 📦 **[Models]** | Download trained pipelines for spaCy. |
|
||||
| 🌌 **[Universe]** | Plugins, extensions, demos and books from the spaCy ecosystem. |
|
||||
| 👩🏫 **[Online Course]** | Learn spaCy in this free and interactive online course. |
|
||||
| 📺 **[Videos]** | Our YouTube channel with video tutorials, talks and more. |
|
||||
| 🛠 **[Changelog]** | Changes and version history. |
|
||||
| 💝 **[Contribute]** | How to contribute to the spaCy project and code base. |
|
||||
| <a href="https://explosion.ai/spacy-tailored-pipelines"><img src="https://user-images.githubusercontent.com/13643239/152853098-1c761611-ccb0-4ec6-9066-b234552831fe.png" width="125" alt="spaCy Tailored Pipelines"/></a> | Get a custom spaCy pipeline, tailor-made for your NLP problem by spaCy's core developers. Streamlined, production-ready, predictable and maintainable. Start by completing our 5-minute questionnaire to tell us what you need and we'll be in touch! **[Learn more →](https://explosion.ai/spacy-tailored-pipelines)** |
|
||||
|
||||
[spacy 101]: https://spacy.io/usage/spacy-101
|
||||
[new in v3.0]: https://spacy.io/usage/v3
|
||||
|
@ -60,9 +61,7 @@ open-source software, released under the MIT license.
|
|||
|
||||
## 💬 Where to ask questions
|
||||
|
||||
The spaCy project is maintained by **[@honnibal](https://github.com/honnibal)**,
|
||||
**[@ines](https://github.com/ines)**, **[@svlandeg](https://github.com/svlandeg)**,
|
||||
**[@adrianeboyd](https://github.com/adrianeboyd)** and **[@polm](https://github.com/polm)**.
|
||||
The spaCy project is maintained by the [spaCy team](https://explosion.ai/about).
|
||||
Please understand that we won't be able to provide individual support via email.
|
||||
We also believe that help is much more valuable if it's shared publicly, so that
|
||||
more people can benefit from it.
|
||||
|
|
|
@ -11,19 +11,21 @@ trigger:
|
|||
exclude:
|
||||
- "website/*"
|
||||
- "*.md"
|
||||
- ".github/workflows/*"
|
||||
pr:
|
||||
paths:
|
||||
paths:
|
||||
exclude:
|
||||
- "*.md"
|
||||
- "website/docs/*"
|
||||
- "website/src/*"
|
||||
- ".github/workflows/*"
|
||||
|
||||
jobs:
|
||||
# Perform basic checks for most important errors (syntax etc.) Uses the config
|
||||
# defined in .flake8 and overwrites the selected codes.
|
||||
- job: "Validate"
|
||||
pool:
|
||||
vmImage: "ubuntu-18.04"
|
||||
vmImage: "ubuntu-latest"
|
||||
steps:
|
||||
- task: UsePythonVersion@0
|
||||
inputs:
|
||||
|
@ -39,49 +41,49 @@ jobs:
|
|||
matrix:
|
||||
# We're only running one platform per Python version to speed up builds
|
||||
Python36Linux:
|
||||
imageName: "ubuntu-18.04"
|
||||
imageName: "ubuntu-latest"
|
||||
python.version: "3.6"
|
||||
# Python36Windows:
|
||||
# imageName: "windows-2019"
|
||||
# imageName: "windows-latest"
|
||||
# python.version: "3.6"
|
||||
# Python36Mac:
|
||||
# imageName: "macos-10.14"
|
||||
# imageName: "macos-latest"
|
||||
# python.version: "3.6"
|
||||
# Python37Linux:
|
||||
# imageName: "ubuntu-18.04"
|
||||
# imageName: "ubuntu-latest"
|
||||
# python.version: "3.7"
|
||||
Python37Windows:
|
||||
imageName: "windows-2019"
|
||||
imageName: "windows-latest"
|
||||
python.version: "3.7"
|
||||
# Python37Mac:
|
||||
# imageName: "macos-10.14"
|
||||
# imageName: "macos-latest"
|
||||
# python.version: "3.7"
|
||||
# Python38Linux:
|
||||
# imageName: "ubuntu-18.04"
|
||||
# imageName: "ubuntu-latest"
|
||||
# python.version: "3.8"
|
||||
# Python38Windows:
|
||||
# imageName: "windows-2019"
|
||||
# imageName: "windows-latest"
|
||||
# python.version: "3.8"
|
||||
Python38Mac:
|
||||
imageName: "macos-10.14"
|
||||
imageName: "macos-latest"
|
||||
python.version: "3.8"
|
||||
Python39Linux:
|
||||
imageName: "ubuntu-18.04"
|
||||
imageName: "ubuntu-latest"
|
||||
python.version: "3.9"
|
||||
# Python39Windows:
|
||||
# imageName: "windows-2019"
|
||||
# imageName: "windows-latest"
|
||||
# python.version: "3.9"
|
||||
# Python39Mac:
|
||||
# imageName: "macos-10.14"
|
||||
# imageName: "macos-latest"
|
||||
# python.version: "3.9"
|
||||
Python310Linux:
|
||||
imageName: "ubuntu-20.04"
|
||||
imageName: "ubuntu-latest"
|
||||
python.version: "3.10"
|
||||
Python310Windows:
|
||||
imageName: "windows-2019"
|
||||
imageName: "windows-latest"
|
||||
python.version: "3.10"
|
||||
Python310Mac:
|
||||
imageName: "macos-10.15"
|
||||
imageName: "macos-latest"
|
||||
python.version: "3.10"
|
||||
maxParallel: 4
|
||||
pool:
|
||||
|
|
|
@ -444,7 +444,7 @@ spaCy uses the [`pytest`](http://doc.pytest.org/) framework for testing. Tests f
|
|||
|
||||
When adding tests, make sure to use descriptive names and only test for one behavior at a time. Tests should be grouped into modules dedicated to the same type of functionality and some test modules are organized as directories of test files related to the same larger area of the library, e.g. `matcher` or `tokenizer`.
|
||||
|
||||
Regression tests are tests that refer to bugs reported in specific issues. They should live in the `regression` module and are named according to the issue number (e.g. `test_issue1234.py`). This system allows us to relate tests for specific bugs back to the original reported issue, which is especially useful if we introduce a regression and a previously passing regression tests suddenly fails again. When fixing a bug, it's often useful to create a regression test for it first. Every once in a while, we go through the `regression` module and group tests together into larger files by issue number, in groups of 500 to 1000 numbers. This prevents us from ending up with too many individual files over time.
|
||||
Regression tests are tests that refer to bugs reported in specific issues. They should live in the relevant module of the test suite, named according to the issue number (e.g., `test_issue1234.py`), and [marked](https://docs.pytest.org/en/6.2.x/example/markers.html#working-with-custom-markers) appropriately (e.g. `@pytest.mark.issue(1234)`). This system allows us to relate tests for specific bugs back to the original reported issue, which is especially useful if we introduce a regression and a previously passing regression tests suddenly fails again. When fixing a bug, it's often useful to create a regression test for it first.
|
||||
|
||||
The test suite also provides [fixtures](https://github.com/explosion/spaCy/blob/master/spacy/tests/conftest.py) for different language tokenizers that can be used as function arguments of the same name and will be passed in automatically. Those should only be used for tests related to those specific languages. We also have [test utility functions](https://github.com/explosion/spaCy/blob/master/spacy/tests/util.py) for common operations, like creating a temporary file.
|
||||
|
||||
|
|
|
@ -31,7 +31,8 @@ pytest-timeout>=1.3.0,<2.0.0
|
|||
mock>=2.0.0,<3.0.0
|
||||
flake8>=3.8.0,<3.10.0
|
||||
hypothesis>=3.27.0,<7.0.0
|
||||
mypy>=0.910
|
||||
mypy==0.910
|
||||
types-dataclasses>=0.1.3; python_version < "3.7"
|
||||
types-mock>=0.1.1
|
||||
types-requests
|
||||
black>=22.0,<23.0
|
||||
|
|
32
setup.cfg
32
setup.cfg
|
@ -77,37 +77,39 @@ transformers =
|
|||
ray =
|
||||
spacy_ray>=0.1.0,<1.0.0
|
||||
cuda =
|
||||
cupy>=5.0.0b4,<10.0.0
|
||||
cupy>=5.0.0b4,<11.0.0
|
||||
cuda80 =
|
||||
cupy-cuda80>=5.0.0b4,<10.0.0
|
||||
cupy-cuda80>=5.0.0b4,<11.0.0
|
||||
cuda90 =
|
||||
cupy-cuda90>=5.0.0b4,<10.0.0
|
||||
cupy-cuda90>=5.0.0b4,<11.0.0
|
||||
cuda91 =
|
||||
cupy-cuda91>=5.0.0b4,<10.0.0
|
||||
cupy-cuda91>=5.0.0b4,<11.0.0
|
||||
cuda92 =
|
||||
cupy-cuda92>=5.0.0b4,<10.0.0
|
||||
cupy-cuda92>=5.0.0b4,<11.0.0
|
||||
cuda100 =
|
||||
cupy-cuda100>=5.0.0b4,<10.0.0
|
||||
cupy-cuda100>=5.0.0b4,<11.0.0
|
||||
cuda101 =
|
||||
cupy-cuda101>=5.0.0b4,<10.0.0
|
||||
cupy-cuda101>=5.0.0b4,<11.0.0
|
||||
cuda102 =
|
||||
cupy-cuda102>=5.0.0b4,<10.0.0
|
||||
cupy-cuda102>=5.0.0b4,<11.0.0
|
||||
cuda110 =
|
||||
cupy-cuda110>=5.0.0b4,<10.0.0
|
||||
cupy-cuda110>=5.0.0b4,<11.0.0
|
||||
cuda111 =
|
||||
cupy-cuda111>=5.0.0b4,<10.0.0
|
||||
cupy-cuda111>=5.0.0b4,<11.0.0
|
||||
cuda112 =
|
||||
cupy-cuda112>=5.0.0b4,<10.0.0
|
||||
cupy-cuda112>=5.0.0b4,<11.0.0
|
||||
cuda113 =
|
||||
cupy-cuda113>=5.0.0b4,<10.0.0
|
||||
cupy-cuda113>=5.0.0b4,<11.0.0
|
||||
cuda114 =
|
||||
cupy-cuda114>=5.0.0b4,<10.0.0
|
||||
cupy-cuda114>=5.0.0b4,<11.0.0
|
||||
cuda115 =
|
||||
cupy-cuda115>=5.0.0b4,<11.0.0
|
||||
apple =
|
||||
thinc-apple-ops>=0.0.4,<1.0.0
|
||||
# Language tokenizers with external dependencies
|
||||
ja =
|
||||
sudachipy>=0.4.9
|
||||
sudachidict_core>=20200330
|
||||
sudachipy>=0.5.2,!=0.6.1
|
||||
sudachidict_core>=20211220
|
||||
ko =
|
||||
natto-py==0.9.0
|
||||
th =
|
||||
|
|
1
setup.py
1
setup.py
|
@ -81,7 +81,6 @@ COPY_FILES = {
|
|||
ROOT / "setup.cfg": PACKAGE_ROOT / "tests" / "package",
|
||||
ROOT / "pyproject.toml": PACKAGE_ROOT / "tests" / "package",
|
||||
ROOT / "requirements.txt": PACKAGE_ROOT / "tests" / "package",
|
||||
ROOT / "website" / "meta" / "universe.json": PACKAGE_ROOT / "tests" / "universe",
|
||||
}
|
||||
|
||||
|
||||
|
|
|
@ -1,6 +1,6 @@
|
|||
# fmt: off
|
||||
__title__ = "spacy"
|
||||
__version__ = "3.2.0"
|
||||
__version__ = "3.2.2"
|
||||
__download_url__ = "https://github.com/explosion/spacy-models/releases/download"
|
||||
__compatibility__ = "https://raw.githubusercontent.com/explosion/spacy-models/master/compatibility.json"
|
||||
__projects__ = "https://github.com/explosion/projects"
|
||||
|
|
|
@ -1,3 +1,6 @@
|
|||
from .errors import Errors
|
||||
|
||||
IOB_STRINGS = ("", "I", "O", "B")
|
||||
|
||||
IDS = {
|
||||
"": NULL_ATTR,
|
||||
|
@ -64,7 +67,6 @@ IDS = {
|
|||
"FLAG61": FLAG61,
|
||||
"FLAG62": FLAG62,
|
||||
"FLAG63": FLAG63,
|
||||
|
||||
"ID": ID,
|
||||
"ORTH": ORTH,
|
||||
"LOWER": LOWER,
|
||||
|
@ -72,7 +74,6 @@ IDS = {
|
|||
"SHAPE": SHAPE,
|
||||
"PREFIX": PREFIX,
|
||||
"SUFFIX": SUFFIX,
|
||||
|
||||
"LENGTH": LENGTH,
|
||||
"LEMMA": LEMMA,
|
||||
"POS": POS,
|
||||
|
@ -87,7 +88,7 @@ IDS = {
|
|||
"SPACY": SPACY,
|
||||
"LANG": LANG,
|
||||
"MORPH": MORPH,
|
||||
"IDX": IDX
|
||||
"IDX": IDX,
|
||||
}
|
||||
|
||||
|
||||
|
@ -109,28 +110,66 @@ def intify_attrs(stringy_attrs, strings_map=None, _do_deprecated=False):
|
|||
"""
|
||||
inty_attrs = {}
|
||||
if _do_deprecated:
|
||||
if 'F' in stringy_attrs:
|
||||
if "F" in stringy_attrs:
|
||||
stringy_attrs["ORTH"] = stringy_attrs.pop("F")
|
||||
if 'L' in stringy_attrs:
|
||||
if "L" in stringy_attrs:
|
||||
stringy_attrs["LEMMA"] = stringy_attrs.pop("L")
|
||||
if 'pos' in stringy_attrs:
|
||||
if "pos" in stringy_attrs:
|
||||
stringy_attrs["TAG"] = stringy_attrs.pop("pos")
|
||||
if 'morph' in stringy_attrs:
|
||||
morphs = stringy_attrs.pop('morph')
|
||||
if 'number' in stringy_attrs:
|
||||
stringy_attrs.pop('number')
|
||||
if 'tenspect' in stringy_attrs:
|
||||
stringy_attrs.pop('tenspect')
|
||||
if "morph" in stringy_attrs:
|
||||
morphs = stringy_attrs.pop("morph")
|
||||
if "number" in stringy_attrs:
|
||||
stringy_attrs.pop("number")
|
||||
if "tenspect" in stringy_attrs:
|
||||
stringy_attrs.pop("tenspect")
|
||||
morph_keys = [
|
||||
'PunctType', 'PunctSide', 'Other', 'Degree', 'AdvType', 'Number',
|
||||
'VerbForm', 'PronType', 'Aspect', 'Tense', 'PartType', 'Poss',
|
||||
'Hyph', 'ConjType', 'NumType', 'Foreign', 'VerbType', 'NounType',
|
||||
'Gender', 'Mood', 'Negative', 'Tense', 'Voice', 'Abbr',
|
||||
'Derivation', 'Echo', 'Foreign', 'NameType', 'NounType', 'NumForm',
|
||||
'NumValue', 'PartType', 'Polite', 'StyleVariant',
|
||||
'PronType', 'AdjType', 'Person', 'Variant', 'AdpType',
|
||||
'Reflex', 'Negative', 'Mood', 'Aspect', 'Case',
|
||||
'Polarity', 'PrepCase', 'Animacy' # U20
|
||||
"PunctType",
|
||||
"PunctSide",
|
||||
"Other",
|
||||
"Degree",
|
||||
"AdvType",
|
||||
"Number",
|
||||
"VerbForm",
|
||||
"PronType",
|
||||
"Aspect",
|
||||
"Tense",
|
||||
"PartType",
|
||||
"Poss",
|
||||
"Hyph",
|
||||
"ConjType",
|
||||
"NumType",
|
||||
"Foreign",
|
||||
"VerbType",
|
||||
"NounType",
|
||||
"Gender",
|
||||
"Mood",
|
||||
"Negative",
|
||||
"Tense",
|
||||
"Voice",
|
||||
"Abbr",
|
||||
"Derivation",
|
||||
"Echo",
|
||||
"Foreign",
|
||||
"NameType",
|
||||
"NounType",
|
||||
"NumForm",
|
||||
"NumValue",
|
||||
"PartType",
|
||||
"Polite",
|
||||
"StyleVariant",
|
||||
"PronType",
|
||||
"AdjType",
|
||||
"Person",
|
||||
"Variant",
|
||||
"AdpType",
|
||||
"Reflex",
|
||||
"Negative",
|
||||
"Mood",
|
||||
"Aspect",
|
||||
"Case",
|
||||
"Polarity",
|
||||
"PrepCase",
|
||||
"Animacy", # U20
|
||||
]
|
||||
for key in morph_keys:
|
||||
if key in stringy_attrs:
|
||||
|
@ -142,8 +181,13 @@ def intify_attrs(stringy_attrs, strings_map=None, _do_deprecated=False):
|
|||
for name, value in stringy_attrs.items():
|
||||
int_key = intify_attr(name)
|
||||
if int_key is not None:
|
||||
if int_key == ENT_IOB:
|
||||
if value in IOB_STRINGS:
|
||||
value = IOB_STRINGS.index(value)
|
||||
elif isinstance(value, str):
|
||||
raise ValueError(Errors.E1025.format(value=value))
|
||||
if strings_map is not None and isinstance(value, str):
|
||||
if hasattr(strings_map, 'add'):
|
||||
if hasattr(strings_map, "add"):
|
||||
value = strings_map.add(value)
|
||||
else:
|
||||
value = strings_map[value]
|
||||
|
|
|
@ -25,7 +25,7 @@ def debug_config_cli(
|
|||
show_vars: bool = Opt(False, "--show-variables", "-V", help="Show an overview of all variables referenced in the config and their values. This will also reflect variables overwritten on the CLI.")
|
||||
# fmt: on
|
||||
):
|
||||
"""Debug a config.cfg file and show validation errors. The command will
|
||||
"""Debug a config file and show validation errors. The command will
|
||||
create all objects in the tree and validate them. Note that some config
|
||||
validation errors are blocking and will prevent the rest of the config from
|
||||
being resolved. This means that you may not see all validation errors at
|
||||
|
|
|
@ -14,7 +14,7 @@ from ..training.initialize import get_sourced_components
|
|||
from ..schemas import ConfigSchemaTraining
|
||||
from ..pipeline._parser_internals import nonproj
|
||||
from ..pipeline._parser_internals.nonproj import DELIMITER
|
||||
from ..pipeline import Morphologizer
|
||||
from ..pipeline import Morphologizer, SpanCategorizer
|
||||
from ..morphology import Morphology
|
||||
from ..language import Language
|
||||
from ..util import registry, resolve_dot_names
|
||||
|
@ -193,6 +193,70 @@ def debug_data(
|
|||
else:
|
||||
msg.info("No word vectors present in the package")
|
||||
|
||||
if "spancat" in factory_names:
|
||||
model_labels_spancat = _get_labels_from_spancat(nlp)
|
||||
has_low_data_warning = False
|
||||
has_no_neg_warning = False
|
||||
|
||||
msg.divider("Span Categorization")
|
||||
msg.table(model_labels_spancat, header=["Spans Key", "Labels"], divider=True)
|
||||
|
||||
msg.text("Label counts in train data: ", show=verbose)
|
||||
for spans_key, data_labels in gold_train_data["spancat"].items():
|
||||
msg.text(
|
||||
f"Key: {spans_key}, {_format_labels(data_labels.items(), counts=True)}",
|
||||
show=verbose,
|
||||
)
|
||||
# Data checks: only take the spans keys in the actual spancat components
|
||||
data_labels_in_component = {
|
||||
spans_key: gold_train_data["spancat"][spans_key]
|
||||
for spans_key in model_labels_spancat.keys()
|
||||
}
|
||||
for spans_key, data_labels in data_labels_in_component.items():
|
||||
for label, count in data_labels.items():
|
||||
# Check for missing labels
|
||||
spans_key_in_model = spans_key in model_labels_spancat.keys()
|
||||
if (spans_key_in_model) and (
|
||||
label not in model_labels_spancat[spans_key]
|
||||
):
|
||||
msg.warn(
|
||||
f"Label '{label}' is not present in the model labels of key '{spans_key}'. "
|
||||
"Performance may degrade after training."
|
||||
)
|
||||
# Check for low number of examples per label
|
||||
if count <= NEW_LABEL_THRESHOLD:
|
||||
msg.warn(
|
||||
f"Low number of examples for label '{label}' in key '{spans_key}' ({count})"
|
||||
)
|
||||
has_low_data_warning = True
|
||||
# Check for negative examples
|
||||
with msg.loading("Analyzing label distribution..."):
|
||||
neg_docs = _get_examples_without_label(
|
||||
train_dataset, label, "spancat", spans_key
|
||||
)
|
||||
if neg_docs == 0:
|
||||
msg.warn(f"No examples for texts WITHOUT new label '{label}'")
|
||||
has_no_neg_warning = True
|
||||
|
||||
if has_low_data_warning:
|
||||
msg.text(
|
||||
f"To train a new span type, your data should include at "
|
||||
f"least {NEW_LABEL_THRESHOLD} instances of the new label",
|
||||
show=verbose,
|
||||
)
|
||||
else:
|
||||
msg.good("Good amount of examples for all labels")
|
||||
|
||||
if has_no_neg_warning:
|
||||
msg.text(
|
||||
"Training data should always include examples of spans "
|
||||
"in context, as well as examples without a given span "
|
||||
"type.",
|
||||
show=verbose,
|
||||
)
|
||||
else:
|
||||
msg.good("Examples without ocurrences available for all labels")
|
||||
|
||||
if "ner" in factory_names:
|
||||
# Get all unique NER labels present in the data
|
||||
labels = set(
|
||||
|
@ -203,6 +267,7 @@ def debug_data(
|
|||
has_low_data_warning = False
|
||||
has_no_neg_warning = False
|
||||
has_ws_ents_error = False
|
||||
has_boundary_cross_ents_warning = False
|
||||
|
||||
msg.divider("Named Entity Recognition")
|
||||
msg.info(f"{len(model_labels)} label(s)")
|
||||
|
@ -237,17 +302,25 @@ def debug_data(
|
|||
has_low_data_warning = True
|
||||
|
||||
with msg.loading("Analyzing label distribution..."):
|
||||
neg_docs = _get_examples_without_label(train_dataset, label)
|
||||
neg_docs = _get_examples_without_label(train_dataset, label, "ner")
|
||||
if neg_docs == 0:
|
||||
msg.warn(f"No examples for texts WITHOUT new label '{label}'")
|
||||
has_no_neg_warning = True
|
||||
|
||||
if gold_train_data["boundary_cross_ents"]:
|
||||
msg.warn(
|
||||
f"{gold_train_data['boundary_cross_ents']} entity span(s) crossing sentence boundaries"
|
||||
)
|
||||
has_boundary_cross_ents_warning = True
|
||||
|
||||
if not has_low_data_warning:
|
||||
msg.good("Good amount of examples for all labels")
|
||||
if not has_no_neg_warning:
|
||||
msg.good("Examples without occurrences available for all labels")
|
||||
if not has_ws_ents_error:
|
||||
msg.good("No entities consisting of or starting/ending with whitespace")
|
||||
if not has_boundary_cross_ents_warning:
|
||||
msg.good("No entities crossing sentence boundaries")
|
||||
|
||||
if has_low_data_warning:
|
||||
msg.text(
|
||||
|
@ -564,7 +637,9 @@ def _compile_gold(
|
|||
"deps": Counter(),
|
||||
"words": Counter(),
|
||||
"roots": Counter(),
|
||||
"spancat": dict(),
|
||||
"ws_ents": 0,
|
||||
"boundary_cross_ents": 0,
|
||||
"n_words": 0,
|
||||
"n_misaligned_words": 0,
|
||||
"words_missing_vectors": Counter(),
|
||||
|
@ -593,6 +668,7 @@ def _compile_gold(
|
|||
if nlp.vocab.strings[word] not in nlp.vocab.vectors:
|
||||
data["words_missing_vectors"].update([word])
|
||||
if "ner" in factory_names:
|
||||
sent_starts = eg.get_aligned_sent_starts()
|
||||
for i, label in enumerate(eg.get_aligned_ner()):
|
||||
if label is None:
|
||||
continue
|
||||
|
@ -602,8 +678,19 @@ def _compile_gold(
|
|||
if label.startswith(("B-", "U-")):
|
||||
combined_label = label.split("-")[1]
|
||||
data["ner"][combined_label] += 1
|
||||
if sent_starts[i] == True and label.startswith(("I-", "L-")):
|
||||
data["boundary_cross_ents"] += 1
|
||||
elif label == "-":
|
||||
data["ner"]["-"] += 1
|
||||
if "spancat" in factory_names:
|
||||
for span_key in list(eg.reference.spans.keys()):
|
||||
if span_key not in data["spancat"]:
|
||||
data["spancat"][span_key] = Counter()
|
||||
for i, span in enumerate(eg.reference.spans[span_key]):
|
||||
if span.label_ is None:
|
||||
continue
|
||||
else:
|
||||
data["spancat"][span_key][span.label_] += 1
|
||||
if "textcat" in factory_names or "textcat_multilabel" in factory_names:
|
||||
data["cats"].update(gold.cats)
|
||||
if any(val not in (0, 1) for val in gold.cats.values()):
|
||||
|
@ -674,21 +761,57 @@ def _format_labels(
|
|||
return ", ".join([f"'{l}'" for l in cast(Iterable[str], labels)])
|
||||
|
||||
|
||||
def _get_examples_without_label(data: Sequence[Example], label: str) -> int:
|
||||
def _get_examples_without_label(
|
||||
data: Sequence[Example],
|
||||
label: str,
|
||||
component: Literal["ner", "spancat"] = "ner",
|
||||
spans_key: Optional[str] = "sc",
|
||||
) -> int:
|
||||
count = 0
|
||||
for eg in data:
|
||||
labels = [
|
||||
label.split("-")[1]
|
||||
for label in eg.get_aligned_ner()
|
||||
if label not in ("O", "-", None)
|
||||
]
|
||||
if component == "ner":
|
||||
labels = [
|
||||
label.split("-")[1]
|
||||
for label in eg.get_aligned_ner()
|
||||
if label not in ("O", "-", None)
|
||||
]
|
||||
|
||||
if component == "spancat":
|
||||
labels = (
|
||||
[span.label_ for span in eg.reference.spans[spans_key]]
|
||||
if spans_key in eg.reference.spans
|
||||
else []
|
||||
)
|
||||
|
||||
if label not in labels:
|
||||
count += 1
|
||||
return count
|
||||
|
||||
|
||||
def _get_labels_from_model(nlp: Language, pipe_name: str) -> Set[str]:
|
||||
if pipe_name not in nlp.pipe_names:
|
||||
return set()
|
||||
pipe = nlp.get_pipe(pipe_name)
|
||||
return set(pipe.labels)
|
||||
def _get_labels_from_model(nlp: Language, factory_name: str) -> Set[str]:
|
||||
pipe_names = [
|
||||
pipe_name
|
||||
for pipe_name in nlp.pipe_names
|
||||
if nlp.get_pipe_meta(pipe_name).factory == factory_name
|
||||
]
|
||||
labels: Set[str] = set()
|
||||
for pipe_name in pipe_names:
|
||||
pipe = nlp.get_pipe(pipe_name)
|
||||
labels.update(pipe.labels)
|
||||
return labels
|
||||
|
||||
|
||||
def _get_labels_from_spancat(nlp: Language) -> Dict[str, Set[str]]:
|
||||
pipe_names = [
|
||||
pipe_name
|
||||
for pipe_name in nlp.pipe_names
|
||||
if nlp.get_pipe_meta(pipe_name).factory == "spancat"
|
||||
]
|
||||
labels: Dict[str, Set[str]] = {}
|
||||
for pipe_name in pipe_names:
|
||||
pipe = nlp.get_pipe(pipe_name)
|
||||
assert isinstance(pipe, SpanCategorizer)
|
||||
if pipe.key not in labels:
|
||||
labels[pipe.key] = set()
|
||||
labels[pipe.key].update(pipe.labels)
|
||||
return labels
|
||||
|
|
|
@ -27,7 +27,7 @@ class Optimizations(str, Enum):
|
|||
@init_cli.command("config")
|
||||
def init_config_cli(
|
||||
# fmt: off
|
||||
output_file: Path = Arg(..., help="File to save config.cfg to or - for stdout (will only output config and no additional logging info)", allow_dash=True),
|
||||
output_file: Path = Arg(..., help="File to save the config to or - for stdout (will only output config and no additional logging info)", allow_dash=True),
|
||||
lang: str = Opt("en", "--lang", "-l", help="Two-letter code of the language to use"),
|
||||
pipeline: str = Opt("tagger,parser,ner", "--pipeline", "-p", help="Comma-separated names of trainable pipeline components to include (without 'tok2vec' or 'transformer')"),
|
||||
optimize: Optimizations = Opt(Optimizations.efficiency.value, "--optimize", "-o", help="Whether to optimize for efficiency (faster inference, smaller model, lower memory consumption) or higher accuracy (potentially larger and slower model). This will impact the choice of architecture, pretrained weights and related hyperparameters."),
|
||||
|
@ -37,7 +37,7 @@ def init_config_cli(
|
|||
# fmt: on
|
||||
):
|
||||
"""
|
||||
Generate a starter config.cfg for training. Based on your requirements
|
||||
Generate a starter config file for training. Based on your requirements
|
||||
specified via the CLI arguments, this command generates a config with the
|
||||
optimal settings for your use case. This includes the choice of architecture,
|
||||
pretrained weights and related hyperparameters.
|
||||
|
@ -66,15 +66,15 @@ def init_config_cli(
|
|||
@init_cli.command("fill-config")
|
||||
def init_fill_config_cli(
|
||||
# fmt: off
|
||||
base_path: Path = Arg(..., help="Base config to fill", exists=True, dir_okay=False),
|
||||
output_file: Path = Arg("-", help="File to save config.cfg to (or - for stdout)", allow_dash=True),
|
||||
base_path: Path = Arg(..., help="Path to base config to fill", exists=True, dir_okay=False),
|
||||
output_file: Path = Arg("-", help="Path to output .cfg file (or - for stdout)", allow_dash=True),
|
||||
pretraining: bool = Opt(False, "--pretraining", "-pt", help="Include config for pretraining (with 'spacy pretrain')"),
|
||||
diff: bool = Opt(False, "--diff", "-D", help="Print a visual diff highlighting the changes"),
|
||||
code_path: Optional[Path] = Opt(None, "--code-path", "--code", "-c", help="Path to Python file with additional code (registered functions) to be imported"),
|
||||
# fmt: on
|
||||
):
|
||||
"""
|
||||
Fill partial config.cfg with default values. Will add all missing settings
|
||||
Fill partial config file with default values. Will add all missing settings
|
||||
from the default config and will create all objects, check the registered
|
||||
functions for their default values and update the base config. This command
|
||||
can be used with a config generated via the training quickstart widget:
|
||||
|
|
|
@ -4,8 +4,10 @@ from pathlib import Path
|
|||
from wasabi import Printer, MarkdownRenderer, get_raw_input
|
||||
from thinc.api import Config
|
||||
from collections import defaultdict
|
||||
from catalogue import RegistryError
|
||||
import srsly
|
||||
import sys
|
||||
import re
|
||||
|
||||
from ._util import app, Arg, Opt, string_to_list, WHEEL_SUFFIX, SDIST_SUFFIX
|
||||
from ..schemas import validate, ModelMetaSchema
|
||||
|
@ -108,6 +110,24 @@ def package(
|
|||
", ".join(meta["requirements"]),
|
||||
)
|
||||
if name is not None:
|
||||
if not name.isidentifier():
|
||||
msg.fail(
|
||||
f"Model name ('{name}') is not a valid module name. "
|
||||
"This is required so it can be imported as a module.",
|
||||
"We recommend names that use ASCII A-Z, a-z, _ (underscore), "
|
||||
"and 0-9. "
|
||||
"For specific details see: https://docs.python.org/3/reference/lexical_analysis.html#identifiers",
|
||||
exits=1,
|
||||
)
|
||||
if not _is_permitted_package_name(name):
|
||||
msg.fail(
|
||||
f"Model name ('{name}') is not a permitted package name. "
|
||||
"This is required to correctly load the model with spacy.load.",
|
||||
"We recommend names that use ASCII A-Z, a-z, _ (underscore), "
|
||||
"and 0-9. "
|
||||
"For specific details see: https://www.python.org/dev/peps/pep-0426/#name",
|
||||
exits=1,
|
||||
)
|
||||
meta["name"] = name
|
||||
if version is not None:
|
||||
meta["version"] = version
|
||||
|
@ -161,7 +181,7 @@ def package(
|
|||
imports="\n".join(f"from . import {m}" for m in imports)
|
||||
)
|
||||
create_file(package_path / "__init__.py", init_py)
|
||||
msg.good(f"Successfully created package '{model_name_v}'", main_path)
|
||||
msg.good(f"Successfully created package directory '{model_name_v}'", main_path)
|
||||
if create_sdist:
|
||||
with util.working_dir(main_path):
|
||||
util.run_command([sys.executable, "setup.py", "sdist"], capture=False)
|
||||
|
@ -170,8 +190,14 @@ def package(
|
|||
if create_wheel:
|
||||
with util.working_dir(main_path):
|
||||
util.run_command([sys.executable, "setup.py", "bdist_wheel"], capture=False)
|
||||
wheel = main_path / "dist" / f"{model_name_v}{WHEEL_SUFFIX}"
|
||||
wheel_name_squashed = re.sub("_+", "_", model_name_v)
|
||||
wheel = main_path / "dist" / f"{wheel_name_squashed}{WHEEL_SUFFIX}"
|
||||
msg.good(f"Successfully created binary wheel", wheel)
|
||||
if "__" in model_name:
|
||||
msg.warn(
|
||||
f"Model name ('{model_name}') contains a run of underscores. "
|
||||
"Runs of underscores are not significant in installed package names.",
|
||||
)
|
||||
|
||||
|
||||
def has_wheel() -> bool:
|
||||
|
@ -212,9 +238,18 @@ def get_third_party_dependencies(
|
|||
if "factory" in component:
|
||||
funcs["factories"].add(component["factory"])
|
||||
modules = set()
|
||||
lang = config["nlp"]["lang"]
|
||||
for reg_name, func_names in funcs.items():
|
||||
for func_name in func_names:
|
||||
func_info = util.registry.find(reg_name, func_name)
|
||||
# Try the lang-specific version and fall back
|
||||
try:
|
||||
func_info = util.registry.find(reg_name, lang + "." + func_name)
|
||||
except RegistryError:
|
||||
try:
|
||||
func_info = util.registry.find(reg_name, func_name)
|
||||
except RegistryError as regerr:
|
||||
# lang-specific version being absent is not actually an issue
|
||||
raise regerr from None
|
||||
module_name = func_info.get("module") # type: ignore[attr-defined]
|
||||
if module_name: # the code is part of a module, not a --code file
|
||||
modules.add(func_info["module"].split(".")[0]) # type: ignore[index]
|
||||
|
@ -412,6 +447,14 @@ def _format_label_scheme(data: Dict[str, Any]) -> str:
|
|||
return md.text
|
||||
|
||||
|
||||
def _is_permitted_package_name(package_name: str) -> bool:
|
||||
# regex from: https://www.python.org/dev/peps/pep-0426/#name
|
||||
permitted_match = re.search(
|
||||
r"^([A-Z0-9]|[A-Z0-9][A-Z0-9._-]*[A-Z0-9])$", package_name, re.IGNORECASE
|
||||
)
|
||||
return permitted_match is not None
|
||||
|
||||
|
||||
TEMPLATE_SETUP = """
|
||||
#!/usr/bin/env python
|
||||
import io
|
||||
|
|
|
@ -1,6 +1,7 @@
|
|||
from typing import Any, Dict, Optional
|
||||
from pathlib import Path
|
||||
from wasabi import msg
|
||||
import os
|
||||
import re
|
||||
import shutil
|
||||
import requests
|
||||
|
@ -129,10 +130,17 @@ def fetch_asset(
|
|||
the asset failed.
|
||||
"""
|
||||
dest_path = (project_path / dest).resolve()
|
||||
if dest_path.exists() and checksum:
|
||||
if dest_path.exists():
|
||||
# If there's already a file, check for checksum
|
||||
if checksum == get_checksum(dest_path):
|
||||
msg.good(f"Skipping download with matching checksum: {dest}")
|
||||
if checksum:
|
||||
if checksum == get_checksum(dest_path):
|
||||
msg.good(f"Skipping download with matching checksum: {dest}")
|
||||
return
|
||||
else:
|
||||
# If there's not a checksum, make sure the file is a possibly valid size
|
||||
if os.path.getsize(dest_path) == 0:
|
||||
msg.warn(f"Asset exists but with size of 0 bytes, deleting: {dest}")
|
||||
os.remove(dest_path)
|
||||
# We might as well support the user here and create parent directories in
|
||||
# case the asset dir isn't listed as a dir to create in the project.yml
|
||||
if not dest_path.parent.exists():
|
||||
|
|
|
@ -6,6 +6,11 @@ can help generate the best possible configuration, given a user's requirements.
|
|||
[paths]
|
||||
train = null
|
||||
dev = null
|
||||
{% if use_transformer or optimize == "efficiency" or not word_vectors -%}
|
||||
vectors = null
|
||||
{% else -%}
|
||||
vectors = "{{ word_vectors }}"
|
||||
{% endif -%}
|
||||
|
||||
[system]
|
||||
{% if use_transformer -%}
|
||||
|
@ -421,8 +426,4 @@ compound = 1.001
|
|||
{% endif %}
|
||||
|
||||
[initialize]
|
||||
{% if use_transformer or optimize == "efficiency" or not word_vectors -%}
|
||||
vectors = ${paths.vectors}
|
||||
{% else -%}
|
||||
vectors = "{{ word_vectors }}"
|
||||
{% endif -%}
|
||||
|
|
|
@ -68,12 +68,14 @@ seed = ${system.seed}
|
|||
gpu_allocator = ${system.gpu_allocator}
|
||||
dropout = 0.1
|
||||
accumulate_gradient = 1
|
||||
# Controls early-stopping. 0 disables early stopping.
|
||||
# Controls early-stopping, i.e., the number of steps to continue without
|
||||
# improvement before stopping. 0 disables early stopping.
|
||||
patience = 1600
|
||||
# Number of epochs. 0 means unlimited. If >= 0, train corpus is loaded once in
|
||||
# memory and shuffled within the training loop. -1 means stream train corpus
|
||||
# rather than loading in memory with no shuffling within the training loop.
|
||||
max_epochs = 0
|
||||
# Maximum number of update steps to train for. 0 means an unlimited number of steps.
|
||||
max_steps = 20000
|
||||
eval_frequency = 200
|
||||
# Control how scores are printed and checkpoints are evaluated.
|
||||
|
|
|
@ -181,11 +181,19 @@ def parse_deps(orig_doc: Doc, options: Dict[str, Any] = {}) -> Dict[str, Any]:
|
|||
def parse_ents(doc: Doc, options: Dict[str, Any] = {}) -> Dict[str, Any]:
|
||||
"""Generate named entities in [{start: i, end: i, label: 'label'}] format.
|
||||
|
||||
doc (Doc): Document do parse.
|
||||
doc (Doc): Document to parse.
|
||||
options (Dict[str, Any]): NER-specific visualisation options.
|
||||
RETURNS (dict): Generated entities keyed by text (original text) and ents.
|
||||
"""
|
||||
kb_url_template = options.get("kb_url_template", None)
|
||||
ents = [
|
||||
{"start": ent.start_char, "end": ent.end_char, "label": ent.label_}
|
||||
{
|
||||
"start": ent.start_char,
|
||||
"end": ent.end_char,
|
||||
"label": ent.label_,
|
||||
"kb_id": ent.kb_id_ if ent.kb_id_ else "",
|
||||
"kb_url": kb_url_template.format(ent.kb_id_) if kb_url_template else "#",
|
||||
}
|
||||
for ent in doc.ents
|
||||
]
|
||||
if not ents:
|
||||
|
|
|
@ -18,7 +18,7 @@ DEFAULT_LABEL_COLORS = {
|
|||
"LOC": "#ff9561",
|
||||
"PERSON": "#aa9cfc",
|
||||
"NORP": "#c887fb",
|
||||
"FACILITY": "#9cc9cc",
|
||||
"FAC": "#9cc9cc",
|
||||
"EVENT": "#ffeb80",
|
||||
"LAW": "#ff8197",
|
||||
"LANGUAGE": "#ff8197",
|
||||
|
|
|
@ -483,7 +483,7 @@ class Errors(metaclass=ErrorsWithCodes):
|
|||
"components, since spans are only views of the Doc. Use Doc and "
|
||||
"Token attributes (or custom extension attributes) only and remove "
|
||||
"the following: {attrs}")
|
||||
E181 = ("Received invalid attributes for unkown object {obj}: {attrs}. "
|
||||
E181 = ("Received invalid attributes for unknown object {obj}: {attrs}. "
|
||||
"Only Doc and Token attributes are supported.")
|
||||
E182 = ("Received invalid attribute declaration: {attr}\nDid you forget "
|
||||
"to define the attribute? For example: `{attr}.???`")
|
||||
|
@ -566,9 +566,6 @@ class Errors(metaclass=ErrorsWithCodes):
|
|||
E879 = ("Unexpected type for 'spans' data. Provide a dictionary mapping keys to "
|
||||
"a list of spans, with each span represented by a tuple (start_char, end_char). "
|
||||
"The tuple can be optionally extended with a label and a KB ID.")
|
||||
E880 = ("The 'wandb' library could not be found - did you install it? "
|
||||
"Alternatively, specify the 'ConsoleLogger' in the 'training.logger' "
|
||||
"config section, instead of the 'WandbLogger'.")
|
||||
E884 = ("The pipeline could not be initialized because the vectors "
|
||||
"could not be found at '{vectors}'. If your pipeline was already "
|
||||
"initialized/trained before, call 'resume_training' instead of 'initialize', "
|
||||
|
@ -642,7 +639,7 @@ class Errors(metaclass=ErrorsWithCodes):
|
|||
E912 = ("Failed to initialize lemmatizer. Missing lemmatizer table(s) found "
|
||||
"for mode '{mode}'. Required tables: {tables}. Found: {found}.")
|
||||
E913 = ("Corpus path can't be None. Maybe you forgot to define it in your "
|
||||
"config.cfg or override it on the CLI?")
|
||||
".cfg file or override it on the CLI?")
|
||||
E914 = ("Executing {name} callback failed. Expected the function to "
|
||||
"return the nlp object but got: {value}. Maybe you forgot to return "
|
||||
"the modified object in your function?")
|
||||
|
@ -888,8 +885,13 @@ class Errors(metaclass=ErrorsWithCodes):
|
|||
E1021 = ("`pos` value \"{pp}\" is not a valid Universal Dependencies tag. "
|
||||
"Non-UD tags should use the `tag` property.")
|
||||
E1022 = ("Words must be of type str or int, but input is of type '{wtype}'")
|
||||
E1023 = ("Couldn't read EntityRuler from the {path}. This file doesn't exist.")
|
||||
|
||||
E1023 = ("Couldn't read EntityRuler from the {path}. This file doesn't "
|
||||
"exist.")
|
||||
E1024 = ("A pattern with ID \"{ent_id}\" is not present in EntityRuler "
|
||||
"patterns.")
|
||||
E1025 = ("Cannot intify the value '{value}' as an IOB string. The only "
|
||||
"supported values are: 'I', 'O', 'B' and ''")
|
||||
|
||||
|
||||
# Deprecated model shortcuts, only used in errors and warnings
|
||||
OLD_MODEL_SHORTCUTS = {
|
||||
|
|
|
@ -310,7 +310,6 @@ GLOSSARY = {
|
|||
"re": "repeated element",
|
||||
"rs": "reported speech",
|
||||
"sb": "subject",
|
||||
"sb": "subject",
|
||||
"sbp": "passivized subject (PP)",
|
||||
"sp": "subject or predicate",
|
||||
"svp": "separable verb prefix",
|
||||
|
|
|
@ -45,6 +45,10 @@ _hangul_syllables = r"\uAC00-\uD7AF"
|
|||
_hangul_jamo = r"\u1100-\u11FF"
|
||||
_hangul = _hangul_syllables + _hangul_jamo
|
||||
|
||||
_hiragana = r"\u3040-\u309F"
|
||||
_katakana = r"\u30A0-\u30FFー"
|
||||
_kana = _hiragana + _katakana
|
||||
|
||||
# letters with diacritics - Catalan, Czech, Latin, Latvian, Lithuanian, Polish, Slovak, Turkish, Welsh
|
||||
_latin_u_extendedA = (
|
||||
r"\u0100\u0102\u0104\u0106\u0108\u010A\u010C\u010E\u0110\u0112\u0114\u0116\u0118\u011A\u011C"
|
||||
|
@ -244,6 +248,7 @@ _uncased = (
|
|||
+ _tamil
|
||||
+ _telugu
|
||||
+ _hangul
|
||||
+ _kana
|
||||
+ _cjk
|
||||
)
|
||||
|
||||
|
|
|
@ -47,6 +47,41 @@ _num_words = [
|
|||
]
|
||||
|
||||
|
||||
_ordinal_words = [
|
||||
"primero",
|
||||
"segundo",
|
||||
"tercero",
|
||||
"cuarto",
|
||||
"quinto",
|
||||
"sexto",
|
||||
"séptimo",
|
||||
"octavo",
|
||||
"noveno",
|
||||
"décimo",
|
||||
"undécimo",
|
||||
"duodécimo",
|
||||
"decimotercero",
|
||||
"decimocuarto",
|
||||
"decimoquinto",
|
||||
"decimosexto",
|
||||
"decimoséptimo",
|
||||
"decimoctavo",
|
||||
"decimonoveno",
|
||||
"vigésimo",
|
||||
"trigésimo",
|
||||
"cuadragésimo",
|
||||
"quincuagésimo",
|
||||
"sexagésimo",
|
||||
"septuagésimo",
|
||||
"octogésima",
|
||||
"nonagésima",
|
||||
"centésima",
|
||||
"milésima",
|
||||
"millonésima",
|
||||
"billonésima",
|
||||
]
|
||||
|
||||
|
||||
def like_num(text):
|
||||
if text.startswith(("+", "-", "±", "~")):
|
||||
text = text[1:]
|
||||
|
@ -57,7 +92,11 @@ def like_num(text):
|
|||
num, denom = text.split("/")
|
||||
if num.isdigit() and denom.isdigit():
|
||||
return True
|
||||
if text.lower() in _num_words:
|
||||
text_lower = text.lower()
|
||||
if text_lower in _num_words:
|
||||
return True
|
||||
# Check ordinal number
|
||||
if text_lower in _ordinal_words:
|
||||
return True
|
||||
return False
|
||||
|
||||
|
|
|
@ -2,6 +2,7 @@ from .tokenizer_exceptions import TOKENIZER_EXCEPTIONS
|
|||
from .stop_words import STOP_WORDS
|
||||
from .lex_attrs import LEX_ATTRS
|
||||
from .punctuation import TOKENIZER_INFIXES, TOKENIZER_SUFFIXES
|
||||
from .syntax_iterators import SYNTAX_ITERATORS
|
||||
from ...language import Language, BaseDefaults
|
||||
|
||||
|
||||
|
@ -11,6 +12,7 @@ class FinnishDefaults(BaseDefaults):
|
|||
tokenizer_exceptions = TOKENIZER_EXCEPTIONS
|
||||
lex_attr_getters = LEX_ATTRS
|
||||
stop_words = STOP_WORDS
|
||||
syntax_iterators = SYNTAX_ITERATORS
|
||||
|
||||
|
||||
class Finnish(Language):
|
||||
|
|
79
spacy/lang/fi/syntax_iterators.py
Normal file
79
spacy/lang/fi/syntax_iterators.py
Normal file
|
@ -0,0 +1,79 @@
|
|||
from typing import Iterator, Tuple, Union
|
||||
from ...tokens import Doc, Span
|
||||
from ...symbols import NOUN, PROPN, PRON
|
||||
from ...errors import Errors
|
||||
|
||||
|
||||
def noun_chunks(doclike: Union[Doc, Span]) -> Iterator[Tuple[int, int, int]]:
|
||||
"""Detect base noun phrases from a dependency parse. Works on both Doc and Span."""
|
||||
labels = [
|
||||
"appos",
|
||||
"nsubj",
|
||||
"nsubj:cop",
|
||||
"obj",
|
||||
"obl",
|
||||
"ROOT",
|
||||
]
|
||||
extend_labels = [
|
||||
"amod",
|
||||
"compound",
|
||||
"compound:nn",
|
||||
"flat:name",
|
||||
"nmod",
|
||||
"nmod:gobj",
|
||||
"nmod:gsubj",
|
||||
"nmod:poss",
|
||||
"nummod",
|
||||
]
|
||||
|
||||
def potential_np_head(word):
|
||||
return word.pos in (NOUN, PROPN) and (
|
||||
word.dep in np_deps or word.head.pos == PRON
|
||||
)
|
||||
|
||||
doc = doclike.doc # Ensure works on both Doc and Span.
|
||||
if not doc.has_annotation("DEP"):
|
||||
raise ValueError(Errors.E029)
|
||||
|
||||
np_deps = [doc.vocab.strings[label] for label in labels]
|
||||
extend_deps = [doc.vocab.strings[label] for label in extend_labels]
|
||||
np_label = doc.vocab.strings.add("NP")
|
||||
conj_label = doc.vocab.strings.add("conj")
|
||||
|
||||
rbracket = 0
|
||||
prev_end = -1
|
||||
for i, word in enumerate(doclike):
|
||||
if i < rbracket:
|
||||
continue
|
||||
|
||||
# Is this a potential independent NP head or coordinated with
|
||||
# a NOUN that is itself an independent NP head?
|
||||
#
|
||||
# e.g. "Terveyden ja hyvinvoinnin laitos"
|
||||
if potential_np_head(word) or (
|
||||
word.dep == conj_label and potential_np_head(word.head)
|
||||
):
|
||||
# Try to extend to the left to include adjective/num
|
||||
# modifiers, compound words etc.
|
||||
lbracket = word.i
|
||||
for ldep in word.lefts:
|
||||
if ldep.dep in extend_deps:
|
||||
lbracket = ldep.left_edge.i
|
||||
break
|
||||
|
||||
# Prevent nested chunks from being produced
|
||||
if lbracket <= prev_end:
|
||||
continue
|
||||
|
||||
rbracket = word.i
|
||||
# Try to extend the span to the right to capture
|
||||
# appositions and noun modifiers
|
||||
for rdep in word.rights:
|
||||
if rdep.dep in extend_deps:
|
||||
rbracket = rdep.i
|
||||
prev_end = rbracket
|
||||
|
||||
yield lbracket, rbracket + 1, np_label
|
||||
|
||||
|
||||
SYNTAX_ITERATORS = {"noun_chunks": noun_chunks}
|
|
@ -6,16 +6,35 @@ from ...tokens import Doc, Span
|
|||
|
||||
|
||||
def noun_chunks(doclike: Union[Doc, Span]) -> Iterator[Tuple[int, int, int]]:
|
||||
"""Detect base noun phrases from a dependency parse. Works on Doc and Span."""
|
||||
# fmt: off
|
||||
labels = ["nsubj", "nsubj:pass", "obj", "iobj", "ROOT", "appos", "nmod", "nmod:poss"]
|
||||
# fmt: on
|
||||
"""
|
||||
Detect base noun phrases from a dependency parse. Works on both Doc and Span.
|
||||
"""
|
||||
labels = [
|
||||
"nsubj",
|
||||
"nsubj:pass",
|
||||
"obj",
|
||||
"obl",
|
||||
"obl:agent",
|
||||
"obl:arg",
|
||||
"obl:mod",
|
||||
"nmod",
|
||||
"pcomp",
|
||||
"appos",
|
||||
"ROOT",
|
||||
]
|
||||
post_modifiers = ["flat", "flat:name", "flat:foreign", "fixed", "compound"]
|
||||
doc = doclike.doc # Ensure works on both Doc and Span.
|
||||
if not doc.has_annotation("DEP"):
|
||||
raise ValueError(Errors.E029)
|
||||
np_deps = [doc.vocab.strings[label] for label in labels]
|
||||
conj = doc.vocab.strings.add("conj")
|
||||
np_deps = {doc.vocab.strings.add(label) for label in labels}
|
||||
np_modifs = {doc.vocab.strings.add(modifier) for modifier in post_modifiers}
|
||||
np_label = doc.vocab.strings.add("NP")
|
||||
adj_label = doc.vocab.strings.add("amod")
|
||||
det_label = doc.vocab.strings.add("det")
|
||||
det_pos = doc.vocab.strings.add("DET")
|
||||
adp_pos = doc.vocab.strings.add("ADP")
|
||||
conj_label = doc.vocab.strings.add("conj")
|
||||
conj_pos = doc.vocab.strings.add("CCONJ")
|
||||
prev_end = -1
|
||||
for i, word in enumerate(doclike):
|
||||
if word.pos not in (NOUN, PROPN, PRON):
|
||||
|
@ -24,16 +43,43 @@ def noun_chunks(doclike: Union[Doc, Span]) -> Iterator[Tuple[int, int, int]]:
|
|||
if word.left_edge.i <= prev_end:
|
||||
continue
|
||||
if word.dep in np_deps:
|
||||
prev_end = word.right_edge.i
|
||||
yield word.left_edge.i, word.right_edge.i + 1, np_label
|
||||
elif word.dep == conj:
|
||||
right_childs = list(word.rights)
|
||||
right_child = right_childs[0] if right_childs else None
|
||||
|
||||
if right_child:
|
||||
if (
|
||||
right_child.dep == adj_label
|
||||
): # allow chain of adjectives by expanding to right
|
||||
right_end = right_child.right_edge
|
||||
elif (
|
||||
right_child.dep == det_label and right_child.pos == det_pos
|
||||
): # cut relative pronouns here
|
||||
right_end = right_child
|
||||
elif right_child.dep in np_modifs: # Check if we can expand to right
|
||||
right_end = word.right_edge
|
||||
else:
|
||||
right_end = word
|
||||
else:
|
||||
right_end = word
|
||||
prev_end = right_end.i
|
||||
|
||||
left_index = word.left_edge.i
|
||||
left_index = left_index + 1 if word.left_edge.pos == adp_pos else left_index
|
||||
|
||||
yield left_index, right_end.i + 1, np_label
|
||||
elif word.dep == conj_label:
|
||||
head = word.head
|
||||
while head.dep == conj and head.head.i < head.i:
|
||||
while head.dep == conj_label and head.head.i < head.i:
|
||||
head = head.head
|
||||
# If the head is an NP, and we're coordinated to it, we're an NP
|
||||
if head.dep in np_deps:
|
||||
prev_end = word.right_edge.i
|
||||
yield word.left_edge.i, word.right_edge.i + 1, np_label
|
||||
prev_end = word.i
|
||||
|
||||
left_index = word.left_edge.i # eliminate left attached conjunction
|
||||
left_index = (
|
||||
left_index + 1 if word.left_edge.pos == conj_pos else left_index
|
||||
)
|
||||
yield left_index, word.i + 1, np_label
|
||||
|
||||
|
||||
SYNTAX_ITERATORS = {"noun_chunks": noun_chunks}
|
||||
|
|
|
@ -90,7 +90,7 @@ _eleven_to_beyond = [
|
|||
"अड़सठ",
|
||||
"उनहत्तर",
|
||||
"सत्तर",
|
||||
"इकहत्तर"
|
||||
"इकहत्तर",
|
||||
"बहत्तर",
|
||||
"तिहत्तर",
|
||||
"चौहत्तर",
|
||||
|
|
|
@ -6,13 +6,15 @@ from .tokenizer_exceptions import TOKENIZER_EXCEPTIONS
|
|||
from .punctuation import TOKENIZER_PREFIXES, TOKENIZER_INFIXES
|
||||
from ...language import Language, BaseDefaults
|
||||
from .lemmatizer import ItalianLemmatizer
|
||||
from .syntax_iterators import SYNTAX_ITERATORS
|
||||
|
||||
|
||||
class ItalianDefaults(BaseDefaults):
|
||||
tokenizer_exceptions = TOKENIZER_EXCEPTIONS
|
||||
stop_words = STOP_WORDS
|
||||
prefixes = TOKENIZER_PREFIXES
|
||||
infixes = TOKENIZER_INFIXES
|
||||
stop_words = STOP_WORDS
|
||||
syntax_iterators = SYNTAX_ITERATORS
|
||||
|
||||
|
||||
class Italian(Language):
|
||||
|
|
|
@ -10,18 +10,18 @@ avresti avrete avrà avrò avuta avute avuti avuto
|
|||
|
||||
basta bene benissimo brava bravo
|
||||
|
||||
casa caso cento certa certe certi certo che chi chicchessia chiunque ci
|
||||
casa caso cento certa certe certi certo che chi chicchessia chiunque ci c'
|
||||
ciascuna ciascuno cima cio cioe circa citta città co codesta codesti codesto
|
||||
cogli coi col colei coll coloro colui come cominci comunque con concernente
|
||||
conciliarsi conclusione consiglio contro cortesia cos cosa cosi così cui
|
||||
|
||||
da dagl dagli dai dal dall dalla dalle dallo dappertutto davanti degl degli
|
||||
dei del dell della delle dello dentro detto deve di dice dietro dire
|
||||
d' da dagl dagli dai dal dall dall' dalla dalle dallo dappertutto davanti degl degli
|
||||
dei del dell dell' della delle dello dentro detto deve di dice dietro dire
|
||||
dirimpetto diventa diventare diventato dopo dov dove dovra dovrà dovunque due
|
||||
dunque durante
|
||||
|
||||
ebbe ebbero ebbi ecc ecco ed effettivamente egli ella entrambi eppure era
|
||||
erano eravamo eravate eri ero esempio esse essendo esser essere essi ex
|
||||
e ebbe ebbero ebbi ecc ecco ed effettivamente egli ella entrambi eppure era
|
||||
erano eravamo eravate eri ero esempio esse essendo esser essere essi ex è
|
||||
|
||||
fa faccia facciamo facciano facciate faccio facemmo facendo facesse facessero
|
||||
facessi facessimo faceste facesti faceva facevamo facevano facevate facevi
|
||||
|
@ -30,21 +30,21 @@ fareste faresti farete farà farò fatto favore fece fecero feci fin finalmente
|
|||
finche fine fino forse forza fosse fossero fossi fossimo foste fosti fra
|
||||
frattempo fu fui fummo fuori furono futuro generale
|
||||
|
||||
gia già giacche giorni giorno gli gliela gliele glieli glielo gliene governo
|
||||
gia già giacche giorni giorno gli gl' gliela gliele glieli glielo gliene governo
|
||||
grande grazie gruppo
|
||||
|
||||
ha haha hai hanno ho
|
||||
|
||||
ieri il improvviso in inc infatti inoltre insieme intanto intorno invece io
|
||||
|
||||
la là lasciato lato lavoro le lei li lo lontano loro lui lungo luogo
|
||||
l' la là lasciato lato lavoro le lei li lo lontano loro lui lungo luogo
|
||||
|
||||
ma macche magari maggior mai male malgrado malissimo mancanza marche me
|
||||
m' ma macche magari maggior mai male malgrado malissimo mancanza marche me
|
||||
medesimo mediante meglio meno mentre mesi mezzo mi mia mie miei mila miliardi
|
||||
milioni minimi ministro mio modo molti moltissimo molto momento mondo mosto
|
||||
|
||||
nazionale ne negl negli nei nel nell nella nelle nello nemmeno neppure nessun
|
||||
nessuna nessuno niente no noi non nondimeno nonostante nonsia nostra nostre
|
||||
nazionale ne negl negli nei nel nell nella nelle nello nemmeno neppure nessun nessun'
|
||||
nessuna nessuno nient' niente no noi non nondimeno nonostante nonsia nostra nostre
|
||||
nostri nostro novanta nove nulla nuovo
|
||||
|
||||
od oggi ogni ognuna ognuno oltre oppure ora ore osi ossia ottanta otto
|
||||
|
@ -56,12 +56,12 @@ potrebbe preferibilmente presa press prima primo principalmente probabilmente
|
|||
proprio puo può pure purtroppo
|
||||
|
||||
qualche qualcosa qualcuna qualcuno quale quali qualunque quando quanta quante
|
||||
quanti quanto quantunque quasi quattro quel quella quelle quelli quello quest
|
||||
quanti quanto quantunque quasi quattro quel quel' quella quelle quelli quello quest quest'
|
||||
questa queste questi questo qui quindi
|
||||
|
||||
realmente recente recentemente registrazione relativo riecco salvo
|
||||
|
||||
sara sarà sarai saranno sarebbe sarebbero sarei saremmo saremo sareste
|
||||
s' sara sarà sarai saranno sarebbe sarebbero sarei saremmo saremo sareste
|
||||
saresti sarete saro sarò scola scopo scorso se secondo seguente seguito sei
|
||||
sembra sembrare sembrato sembri sempre senza sette si sia siamo siano siate
|
||||
siete sig solito solo soltanto sono sopra sotto spesso srl sta stai stando
|
||||
|
@ -72,12 +72,12 @@ steste stesti stette stettero stetti stia stiamo stiano stiate sto su sua
|
|||
subito successivamente successivo sue sugl sugli sui sul sull sulla sulle
|
||||
sullo suo suoi
|
||||
|
||||
tale tali talvolta tanto te tempo ti titolo tra tranne tre trenta
|
||||
t' tale tali talvolta tanto te tempo ti titolo tra tranne tre trenta
|
||||
troppo trovato tu tua tue tuo tuoi tutta tuttavia tutte tutti tutto
|
||||
|
||||
uguali ulteriore ultimo un una uno uomo
|
||||
uguali ulteriore ultimo un un' una uno uomo
|
||||
|
||||
va vale vari varia varie vario verso vi via vicino visto vita voi volta volte
|
||||
v' va vale vari varia varie vario verso vi via vicino visto vita voi volta volte
|
||||
vostra vostre vostri vostro
|
||||
""".split()
|
||||
)
|
||||
|
|
86
spacy/lang/it/syntax_iterators.py
Normal file
86
spacy/lang/it/syntax_iterators.py
Normal file
|
@ -0,0 +1,86 @@
|
|||
from typing import Union, Iterator, Tuple
|
||||
|
||||
from ...symbols import NOUN, PROPN, PRON
|
||||
from ...errors import Errors
|
||||
from ...tokens import Doc, Span
|
||||
|
||||
|
||||
def noun_chunks(doclike: Union[Doc, Span]) -> Iterator[Tuple[int, int, int]]:
|
||||
"""
|
||||
Detect base noun phrases from a dependency parse. Works on both Doc and Span.
|
||||
"""
|
||||
labels = [
|
||||
"nsubj",
|
||||
"nsubj:pass",
|
||||
"obj",
|
||||
"obl",
|
||||
"obl:agent",
|
||||
"nmod",
|
||||
"pcomp",
|
||||
"appos",
|
||||
"ROOT",
|
||||
]
|
||||
post_modifiers = ["flat", "flat:name", "fixed", "compound"]
|
||||
dets = ["det", "det:poss"]
|
||||
doc = doclike.doc # Ensure works on both Doc and Span.
|
||||
if not doc.has_annotation("DEP"):
|
||||
raise ValueError(Errors.E029)
|
||||
np_deps = {doc.vocab.strings.add(label) for label in labels}
|
||||
np_modifs = {doc.vocab.strings.add(modifier) for modifier in post_modifiers}
|
||||
np_label = doc.vocab.strings.add("NP")
|
||||
adj_label = doc.vocab.strings.add("amod")
|
||||
det_labels = {doc.vocab.strings.add(det) for det in dets}
|
||||
det_pos = doc.vocab.strings.add("DET")
|
||||
adp_label = doc.vocab.strings.add("ADP")
|
||||
conj = doc.vocab.strings.add("conj")
|
||||
conj_pos = doc.vocab.strings.add("CCONJ")
|
||||
prev_end = -1
|
||||
for i, word in enumerate(doclike):
|
||||
if word.pos not in (NOUN, PROPN, PRON):
|
||||
continue
|
||||
# Prevent nested chunks from being produced
|
||||
if word.left_edge.i <= prev_end:
|
||||
continue
|
||||
if word.dep in np_deps:
|
||||
right_childs = list(word.rights)
|
||||
right_child = right_childs[0] if right_childs else None
|
||||
|
||||
if right_child:
|
||||
if (
|
||||
right_child.dep == adj_label
|
||||
): # allow chain of adjectives by expanding to right
|
||||
right_end = right_child.right_edge
|
||||
elif (
|
||||
right_child.dep in det_labels and right_child.pos == det_pos
|
||||
): # cut relative pronouns here
|
||||
right_end = right_child
|
||||
elif right_child.dep in np_modifs: # Check if we can expand to right
|
||||
right_end = word.right_edge
|
||||
else:
|
||||
right_end = word
|
||||
else:
|
||||
right_end = word
|
||||
prev_end = right_end.i
|
||||
|
||||
left_index = word.left_edge.i
|
||||
left_index = (
|
||||
left_index + 1 if word.left_edge.pos == adp_label else left_index
|
||||
)
|
||||
|
||||
yield left_index, right_end.i + 1, np_label
|
||||
elif word.dep == conj:
|
||||
head = word.head
|
||||
while head.dep == conj and head.head.i < head.i:
|
||||
head = head.head
|
||||
# If the head is an NP, and we're coordinated to it, we're an NP
|
||||
if head.dep in np_deps:
|
||||
prev_end = word.i
|
||||
|
||||
left_index = word.left_edge.i # eliminate left attached conjunction
|
||||
left_index = (
|
||||
left_index + 1 if word.left_edge.pos == conj_pos else left_index
|
||||
)
|
||||
yield left_index, word.i + 1, np_label
|
||||
|
||||
|
||||
SYNTAX_ITERATORS = {"noun_chunks": noun_chunks}
|
|
@ -1,5 +1,6 @@
|
|||
from typing import Iterator, Any, Dict
|
||||
|
||||
from .punctuation import TOKENIZER_INFIXES
|
||||
from .stop_words import STOP_WORDS
|
||||
from .tag_map import TAG_MAP
|
||||
from .lex_attrs import LEX_ATTRS
|
||||
|
@ -31,15 +32,24 @@ def create_tokenizer():
|
|||
class KoreanTokenizer(DummyTokenizer):
|
||||
def __init__(self, vocab: Vocab):
|
||||
self.vocab = vocab
|
||||
MeCab = try_mecab_import() # type: ignore[func-returns-value]
|
||||
self.mecab_tokenizer = MeCab("-F%f[0],%f[7]")
|
||||
self._mecab = try_mecab_import() # type: ignore[func-returns-value]
|
||||
self._mecab_tokenizer = None
|
||||
|
||||
@property
|
||||
def mecab_tokenizer(self):
|
||||
# This is a property so that initializing a pipeline with blank:ko is
|
||||
# possible without actually requiring mecab-ko, e.g. to run
|
||||
# `spacy init vectors ko` for a pipeline that will have a different
|
||||
# tokenizer in the end. The languages need to match for the vectors
|
||||
# to be imported and there's no way to pass a custom config to
|
||||
# `init vectors`.
|
||||
if self._mecab_tokenizer is None:
|
||||
self._mecab_tokenizer = self._mecab("-F%f[0],%f[7]")
|
||||
return self._mecab_tokenizer
|
||||
|
||||
def __reduce__(self):
|
||||
return KoreanTokenizer, (self.vocab,)
|
||||
|
||||
def __del__(self):
|
||||
self.mecab_tokenizer.__del__()
|
||||
|
||||
def __call__(self, text: str) -> Doc:
|
||||
dtokens = list(self.detailed_tokens(text))
|
||||
surfaces = [dt["surface"] for dt in dtokens]
|
||||
|
@ -76,6 +86,7 @@ class KoreanDefaults(BaseDefaults):
|
|||
lex_attr_getters = LEX_ATTRS
|
||||
stop_words = STOP_WORDS
|
||||
writing_system = {"direction": "ltr", "has_case": False, "has_letters": False}
|
||||
infixes = TOKENIZER_INFIXES
|
||||
|
||||
|
||||
class Korean(Language):
|
||||
|
@ -90,7 +101,8 @@ def try_mecab_import() -> None:
|
|||
return MeCab
|
||||
except ImportError:
|
||||
raise ImportError(
|
||||
"Korean support requires [mecab-ko](https://bitbucket.org/eunjeon/mecab-ko/src/master/README.md), "
|
||||
'The Korean tokenizer ("spacy.ko.KoreanTokenizer") requires '
|
||||
"[mecab-ko](https://bitbucket.org/eunjeon/mecab-ko/src/master/README.md), "
|
||||
"[mecab-ko-dic](https://bitbucket.org/eunjeon/mecab-ko-dic), "
|
||||
"and [natto-py](https://github.com/buruzaemon/natto-py)"
|
||||
) from None
|
||||
|
|
12
spacy/lang/ko/punctuation.py
Normal file
12
spacy/lang/ko/punctuation.py
Normal file
|
@ -0,0 +1,12 @@
|
|||
from ..char_classes import LIST_QUOTES
|
||||
from ..punctuation import TOKENIZER_INFIXES as BASE_TOKENIZER_INFIXES
|
||||
|
||||
|
||||
_infixes = (
|
||||
["·", "ㆍ", "\(", "\)"]
|
||||
+ [r"(?<=[0-9])~(?=[0-9-])"]
|
||||
+ LIST_QUOTES
|
||||
+ BASE_TOKENIZER_INFIXES
|
||||
)
|
||||
|
||||
TOKENIZER_INFIXES = _infixes
|
|
@ -4,46 +4,42 @@ alle allerede alt and andre annen annet at av
|
|||
|
||||
bak bare bedre beste blant ble bli blir blitt bris by både
|
||||
|
||||
da dag de del dem den denne der dermed det dette disse drept du
|
||||
da dag de del dem den denne der dermed det dette disse du
|
||||
|
||||
eller en enn er et ett etter
|
||||
|
||||
fem fikk fire fjor flere folk for fortsatt fotball fra fram frankrike fredag
|
||||
fem fikk fire fjor flere folk for fortsatt fra fram
|
||||
funnet få får fått før først første
|
||||
|
||||
gang gi gikk gjennom gjorde gjort gjør gjøre god godt grunn gå går
|
||||
|
||||
ha hadde ham han hans har hele helt henne hennes her hun hva hvor hvordan
|
||||
hvorfor
|
||||
ha hadde ham han hans har hele helt henne hennes her hun
|
||||
|
||||
i ifølge igjen ikke ingen inn
|
||||
|
||||
ja jeg
|
||||
|
||||
kamp kampen kan kl klart kom komme kommer kontakt kort kroner kunne kveld
|
||||
kvinner
|
||||
|
||||
la laget land landet langt leder ligger like litt løpet lørdag
|
||||
la laget land landet langt leder ligger like litt løpet
|
||||
|
||||
man mandag mange mannen mars med meg mellom men mener menn mennesker mens mer
|
||||
millioner minutter mot msci mye må mål måtte
|
||||
man mange med meg mellom men mener mennesker mens mer mot mye må mål måtte
|
||||
|
||||
ned neste noe noen nok norge norsk norske ntb ny nye nå når
|
||||
ned neste noe noen nok ny nye nå når
|
||||
|
||||
og også om onsdag opp opplyser oslo oss over
|
||||
og også om opp opplyser oss over
|
||||
|
||||
personer plass poeng politidistrikt politiet president prosent på
|
||||
personer plass poeng på
|
||||
|
||||
regjeringen runde rundt russland
|
||||
runde rundt
|
||||
|
||||
sa saken samme sammen samtidig satt se seg seks selv senere september ser sett
|
||||
sa saken samme sammen samtidig satt se seg seks selv senere ser sett
|
||||
siden sier sin sine siste sitt skal skriver skulle slik som sted stedet stor
|
||||
store står sverige svært så søndag
|
||||
store står svært så
|
||||
|
||||
ta tatt tid tidligere til tilbake tillegg tirsdag to tok torsdag tre tror
|
||||
tyskland
|
||||
ta tatt tid tidligere til tilbake tillegg tok tror
|
||||
|
||||
under usa ut uten utenfor
|
||||
under ut uten utenfor
|
||||
|
||||
vant var ved veldig vi videre viktig vil ville viser vår være vært
|
||||
|
||||
|
|
|
@ -1,56 +1,119 @@
|
|||
from ...attrs import LIKE_NUM
|
||||
|
||||
|
||||
_num_words = [
|
||||
"ноль",
|
||||
"один",
|
||||
"два",
|
||||
"три",
|
||||
"четыре",
|
||||
"пять",
|
||||
"шесть",
|
||||
"семь",
|
||||
"восемь",
|
||||
"девять",
|
||||
"десять",
|
||||
"одиннадцать",
|
||||
"двенадцать",
|
||||
"тринадцать",
|
||||
"четырнадцать",
|
||||
"пятнадцать",
|
||||
"шестнадцать",
|
||||
"семнадцать",
|
||||
"восемнадцать",
|
||||
"девятнадцать",
|
||||
"двадцать",
|
||||
"тридцать",
|
||||
"сорок",
|
||||
"пятьдесят",
|
||||
"шестьдесят",
|
||||
"семьдесят",
|
||||
"восемьдесят",
|
||||
"девяносто",
|
||||
"сто",
|
||||
"двести",
|
||||
"триста",
|
||||
"четыреста",
|
||||
"пятьсот",
|
||||
"шестьсот",
|
||||
"семьсот",
|
||||
"восемьсот",
|
||||
"девятьсот",
|
||||
"тысяча",
|
||||
"миллион",
|
||||
"миллиард",
|
||||
"триллион",
|
||||
"квадриллион",
|
||||
"квинтиллион",
|
||||
]
|
||||
_num_words = list(
|
||||
set(
|
||||
"""
|
||||
ноль ноля нолю нолём ноле нулевой нулевого нулевому нулевым нулевом нулевая нулевую нулевое нулевые нулевых нулевыми
|
||||
|
||||
один первого первому единица одного одному первой первом первый первым одним одном во-первых
|
||||
|
||||
два второго второму второй втором вторым двойка двумя двум двух во-вторых двое две двоих оба обе обеим обеими
|
||||
обеих обоим обоими обоих
|
||||
|
||||
полтора полторы полутора
|
||||
|
||||
три третьего третьему третьем третьим третий тройка трешка трёшка трояк трёха треха тремя трем трех трое троих трёх
|
||||
|
||||
четыре четвертого четвертому четвертом четвертый четвертым четверка четырьмя четырем четырех четверо четырёх четверым
|
||||
четверых
|
||||
|
||||
пять пятерочка пятерка пятого пятому пятом пятый пятым пятью пяти пятеро пятерых пятерыми
|
||||
|
||||
шесть шестерка шестого шестому шестой шестом шестым шестью шести шестеро шестерых
|
||||
|
||||
семь семерка седьмого седьмому седьмой седьмом седьмым семью семи семеро
|
||||
|
||||
восемь восьмерка восьмого восьмому восемью восьмой восьмом восьмым восеми восьмером восьми восьмью
|
||||
|
||||
девять девятого девятому девятка девятом девятый девятым девятью девяти девятером вдевятером девятерых
|
||||
|
||||
десять десятого десятому десятка десятом десятый десятым десятью десяти десятером вдесятером
|
||||
|
||||
одиннадцать одиннадцатого одиннадцатому одиннадцатом одиннадцатый одиннадцатым одиннадцатью одиннадцати
|
||||
|
||||
двенадцать двенадцатого двенадцатому двенадцатом двенадцатый двенадцатым двенадцатью двенадцати
|
||||
|
||||
тринадцать тринадцатого тринадцатому тринадцатом тринадцатый тринадцатым тринадцатью тринадцати
|
||||
|
||||
четырнадцать четырнадцатого четырнадцатому четырнадцатом четырнадцатый четырнадцатым четырнадцатью четырнадцати
|
||||
|
||||
пятнадцать пятнадцатого пятнадцатому пятнадцатом пятнадцатый пятнадцатым пятнадцатью пятнадцати
|
||||
|
||||
шестнадцать шестнадцатого шестнадцатому шестнадцатом шестнадцатый шестнадцатым шестнадцатью шестнадцати
|
||||
|
||||
семнадцать семнадцатого семнадцатому семнадцатом семнадцатый семнадцатым семнадцатью семнадцати
|
||||
|
||||
восемнадцать восемнадцатого восемнадцатому восемнадцатом восемнадцатый восемнадцатым восемнадцатью восемнадцати
|
||||
|
||||
девятнадцать девятнадцатого девятнадцатому девятнадцатом девятнадцатый девятнадцатым девятнадцатью девятнадцати
|
||||
|
||||
двадцать двадцатого двадцатому двадцатом двадцатый двадцатым двадцатью двадцати
|
||||
|
||||
тридцать тридцатого тридцатому тридцатом тридцатый тридцатым тридцатью тридцати
|
||||
|
||||
тридевять
|
||||
|
||||
сорок сорокового сороковому сороковом сороковым сороковой
|
||||
|
||||
пятьдесят пятьдесятого пятьдесятому пятьюдесятью пятьдесятом пятьдесятый пятьдесятым пятидесяти полтинник
|
||||
|
||||
шестьдесят шестьдесятого шестьдесятому шестьюдесятью шестьдесятом шестьдесятый шестьдесятым шестидесятые шестидесяти
|
||||
|
||||
семьдесят семьдесятого семьдесятому семьюдесятью семьдесятом семьдесятый семьдесятым семидесяти
|
||||
|
||||
восемьдесят восемьдесятого восемьдесятому восемьюдесятью восемьдесятом восемьдесятый восемьдесятым восемидесяти
|
||||
восьмидесяти
|
||||
|
||||
девяносто девяностого девяностому девяностом девяностый девяностым девяноста
|
||||
|
||||
сто сотого сотому сотка сотня сотом сотен сотый сотым ста
|
||||
|
||||
двести двумястами двухсотого двухсотому двухсотом двухсотый двухсотым двумстам двухстах двухсот
|
||||
|
||||
триста тремястами трехсотого трехсотому трехсотом трехсотый трехсотым тремстам трехстах трехсот
|
||||
|
||||
четыреста четырехсотого четырехсотому четырьмястами четырехсотом четырехсотый четырехсотым четыремстам четырехстах
|
||||
четырехсот
|
||||
|
||||
пятьсот пятисотого пятисотому пятьюстами пятисотом пятисотый пятисотым пятистам пятистах пятисот
|
||||
|
||||
шестьсот шестисотого шестисотому шестьюстами шестисотом шестисотый шестисотым шестистам шестистах шестисот
|
||||
|
||||
семьсот семисотого семисотому семьюстами семисотом семисотый семисотым семистам семистах семисот
|
||||
|
||||
восемьсот восемисотого восемисотому восемисотом восемисотый восемисотым восьмистами восьмистам восьмистах восьмисот
|
||||
|
||||
девятьсот девятисотого девятисотому девятьюстами девятисотом девятисотый девятисотым девятистам девятистах девятисот
|
||||
|
||||
тысяча тысячного тысячному тысячном тысячный тысячным тысячам тысячах тысячей тысяч тысячи тыс
|
||||
|
||||
миллион миллионного миллионов миллионному миллионном миллионный миллионным миллионом миллиона миллионе миллиону
|
||||
миллионов лям млн
|
||||
|
||||
миллиард миллиардного миллиардному миллиардном миллиардный миллиардным миллиардом миллиарда миллиарде миллиарду
|
||||
миллиардов лярд млрд
|
||||
|
||||
триллион триллионного триллионному триллионном триллионный триллионным триллионом триллиона триллионе триллиону
|
||||
триллионов трлн
|
||||
|
||||
квадриллион квадриллионного квадриллионному квадриллионный квадриллионным квадриллионом квадриллиона квадриллионе
|
||||
квадриллиону квадриллионов квадрлн
|
||||
|
||||
квинтиллион квинтиллионного квинтиллионному квинтиллионный квинтиллионным квинтиллионом квинтиллиона квинтиллионе
|
||||
квинтиллиону квинтиллионов квинтлн
|
||||
|
||||
i ii iii iv vi vii viii ix xi xii xiii xiv xv xvi xvii xviii xix xx xxi xxii xxiii xxiv xxv xxvi xxvii xxvii xxix
|
||||
""".split()
|
||||
)
|
||||
)
|
||||
|
||||
|
||||
def like_num(text):
|
||||
if text.startswith(("+", "-", "±", "~")):
|
||||
text = text[1:]
|
||||
if text.endswith("%"):
|
||||
text = text[:-1]
|
||||
text = text.replace(",", "").replace(".", "")
|
||||
if text.isdigit():
|
||||
return True
|
||||
|
|
|
@ -1,52 +1,111 @@
|
|||
STOP_WORDS = set(
|
||||
"""
|
||||
а
|
||||
а авось ага агу аж ай али алло ау ах ая
|
||||
|
||||
будем будет будете будешь буду будут будучи будь будьте бы был была были было
|
||||
быть
|
||||
б будем будет будете будешь буду будут будучи будь будьте бы был была были было
|
||||
быть бац без безусловно бишь благо благодаря ближайшие близко более больше
|
||||
будто бывает бывала бывали бываю бывают бытует
|
||||
|
||||
в вам вами вас весь во вот все всё всего всей всем всём всеми всему всех всею
|
||||
всея всю вся вы
|
||||
всея всю вся вы ваш ваша ваше ваши вдали вдобавок вдруг ведь везде вернее
|
||||
взаимно взаправду видно вишь включая вместо внакладе вначале вне вниз внизу
|
||||
вновь вовсе возможно воистину вокруг вон вообще вопреки вперекор вплоть
|
||||
вполне вправду вправе впрочем впрямь вресноту вроде вряд всегда всюду
|
||||
всякий всякого всякой всячески вчеред
|
||||
|
||||
да для до
|
||||
г го где гораздо гав
|
||||
|
||||
его едим едят ее её ей ел ела ем ему емъ если ест есть ешь еще ещё ею
|
||||
д да для до дабы давайте давно давным даже далее далеко дальше данная
|
||||
данного данное данной данном данному данные данный данных дану данунах
|
||||
даром де действительно довольно доколе доколь долго должен должна
|
||||
должно должны должный дополнительно другая другие другим другими
|
||||
других другое другой
|
||||
|
||||
же
|
||||
е его едим едят ее её ей ел ела ем ему емъ если ест есть ешь еще ещё ею едва
|
||||
ежели еле
|
||||
|
||||
за
|
||||
ж же
|
||||
|
||||
и из или им ими имъ их
|
||||
з за затем зато зачем здесь значит зря
|
||||
|
||||
и из или им ими имъ их ибо иль имеет имел имела имело именно иметь иначе
|
||||
иногда иным иными итак ишь
|
||||
|
||||
й
|
||||
|
||||
к как кем ко когда кого ком кому комья которая которого которое которой котором
|
||||
которому которою которую которые который которым которыми которых кто
|
||||
которому которою которую которые который которым которыми которых кто ка кабы
|
||||
каждая каждое каждые каждый кажется казалась казались казалось казался казаться
|
||||
какая какие каким какими каков какого какой какому какою касательно кой коли
|
||||
коль конечно короче кроме кстати ку куда
|
||||
|
||||
меня мне мной мною мог моги могите могла могли могло могу могут мое моё моего
|
||||
л ли либо лишь любая любого любое любой любом любую любыми любых
|
||||
|
||||
м меня мне мной мною мог моги могите могла могли могло могу могут мое моё моего
|
||||
моей моем моём моему моею можем может можете можешь мои мой моим моими моих
|
||||
мочь мою моя мы
|
||||
мочь мою моя мы мало меж между менее меньше мимо многие много многого многое
|
||||
многом многому можно мол му
|
||||
|
||||
на нам нами нас наса наш наша наше нашего нашей нашем нашему нашею наши нашим
|
||||
н на нам нами нас наса наш наша наше нашего нашей нашем нашему нашею наши нашим
|
||||
нашими наших нашу не него нее неё ней нем нём нему нет нею ним ними них но
|
||||
наверняка наверху навряд навыворот над надо назад наиболее наизворот
|
||||
наизнанку наипаче накануне наконец наоборот наперед наперекор наподобие
|
||||
например напротив напрямую насилу настоящая настоящее настоящие настоящий
|
||||
насчет нате находиться начала начале неважно негде недавно недалеко незачем
|
||||
некем некогда некому некоторая некоторые некоторый некоторых некто некуда
|
||||
нельзя немногие немногим немного необходимо необходимости необходимые
|
||||
необходимым неоткуда непрерывно нередко несколько нету неужели нечего
|
||||
нечем нечему нечто нешто нибудь нигде ниже низко никак никакой никем
|
||||
никогда никого никому никто никуда ниоткуда нипочем ничего ничем ничему
|
||||
ничто ну нужная нужно нужного нужные нужный нужных ныне нынешнее нынешней
|
||||
нынешних нынче
|
||||
|
||||
о об один одна одни одним одними одних одно одного одной одном одному одною
|
||||
одну он она оне они оно от
|
||||
одну он она оне они оно от оба общую обычно ого однажды однако ой около оный
|
||||
оп опять особенно особо особую особые откуда отнелижа отнелиже отовсюду
|
||||
отсюда оттого оттот оттуда отчего отчему ох очевидно очень ом
|
||||
|
||||
по при
|
||||
п по при паче перед под подавно поди подобная подобно подобного подобные
|
||||
подобный подобным подобных поелику пожалуй пожалуйста позже поистине
|
||||
пока покамест поколе поколь покуда покудова помимо понеже поприще пор
|
||||
пора посему поскольку после посреди посредством потом потому потомушта
|
||||
похожем почему почти поэтому прежде притом причем про просто прочего
|
||||
прочее прочему прочими проще прям пусть
|
||||
|
||||
р ради разве ранее рано раньше рядом
|
||||
|
||||
с сам сама сами самим самими самих само самого самом самому саму свое своё
|
||||
своего своей своем своём своему своею свои свой своим своими своих свою своя
|
||||
себе себя собой собою
|
||||
себе себя собой собою самая самое самой самый самых сверх свыше се сего сей
|
||||
сейчас сие сих сквозь сколько скорее скоро следует слишком смогут сможет
|
||||
сначала снова со собственно совсем сперва спокону спустя сразу среди сродни
|
||||
стал стала стали стало стать суть сызнова
|
||||
|
||||
та так такая такие таким такими таких такого такое такой таком такому такою
|
||||
такую те тебе тебя тем теми тех то тобой тобою того той только том томах тому
|
||||
тот тою ту ты
|
||||
та то ту ты ти так такая такие таким такими таких такого такое такой таком такому такою
|
||||
такую те тебе тебя тем теми тех тобой тобою того той только том томах тому
|
||||
тот тою также таки таков такова там твои твоим твоих твой твоя твоё
|
||||
теперь тогда тоже тотчас точно туда тут тьфу тая
|
||||
|
||||
у уже
|
||||
у уже увы уж ура ух ую
|
||||
|
||||
чего чем чём чему что чтобы
|
||||
ф фу
|
||||
|
||||
эта эти этим этими этих это этого этой этом этому этот этою эту
|
||||
х ха хе хорошо хотел хотела хотелось хотеть хоть хотя хочешь хочу хуже
|
||||
|
||||
я
|
||||
ч чего чем чём чему что чтобы часто чаще чей через чтоб чуть чхать чьим
|
||||
чьих чьё чё
|
||||
|
||||
ш ша
|
||||
|
||||
щ ща щас
|
||||
|
||||
ы ых ые ый
|
||||
|
||||
э эта эти этим этими этих это этого этой этом этому этот этою эту эдак эдакий
|
||||
эй эка экий этак этакий эх
|
||||
|
||||
ю
|
||||
|
||||
я явно явных яко якобы якоже
|
||||
""".split()
|
||||
)
|
||||
|
|
|
@ -2,7 +2,6 @@ from ..tokenizer_exceptions import BASE_EXCEPTIONS
|
|||
from ...symbols import ORTH, NORM
|
||||
from ...util import update_exc
|
||||
|
||||
|
||||
_exc = {}
|
||||
|
||||
_abbrev_exc = [
|
||||
|
@ -42,7 +41,6 @@ _abbrev_exc = [
|
|||
{ORTH: "дек", NORM: "декабрь"},
|
||||
]
|
||||
|
||||
|
||||
for abbrev_desc in _abbrev_exc:
|
||||
abbrev = abbrev_desc[ORTH]
|
||||
for orth in (abbrev, abbrev.capitalize(), abbrev.upper()):
|
||||
|
@ -50,17 +48,354 @@ for abbrev_desc in _abbrev_exc:
|
|||
_exc[orth + "."] = [{ORTH: orth + ".", NORM: abbrev_desc[NORM]}]
|
||||
|
||||
|
||||
_slang_exc = [
|
||||
for abbr in [
|
||||
# Year slang abbreviations
|
||||
{ORTH: "2к15", NORM: "2015"},
|
||||
{ORTH: "2к16", NORM: "2016"},
|
||||
{ORTH: "2к17", NORM: "2017"},
|
||||
{ORTH: "2к18", NORM: "2018"},
|
||||
{ORTH: "2к19", NORM: "2019"},
|
||||
{ORTH: "2к20", NORM: "2020"},
|
||||
]
|
||||
{ORTH: "2к21", NORM: "2021"},
|
||||
{ORTH: "2к22", NORM: "2022"},
|
||||
{ORTH: "2к23", NORM: "2023"},
|
||||
{ORTH: "2к24", NORM: "2024"},
|
||||
{ORTH: "2к25", NORM: "2025"},
|
||||
]:
|
||||
_exc[abbr[ORTH]] = [abbr]
|
||||
|
||||
for slang_desc in _slang_exc:
|
||||
_exc[slang_desc[ORTH]] = [slang_desc]
|
||||
for abbr in [
|
||||
# Profession and academic titles abbreviations
|
||||
{ORTH: "ак.", NORM: "академик"},
|
||||
{ORTH: "акад.", NORM: "академик"},
|
||||
{ORTH: "д-р архитектуры", NORM: "доктор архитектуры"},
|
||||
{ORTH: "д-р биол. наук", NORM: "доктор биологических наук"},
|
||||
{ORTH: "д-р ветеринар. наук", NORM: "доктор ветеринарных наук"},
|
||||
{ORTH: "д-р воен. наук", NORM: "доктор военных наук"},
|
||||
{ORTH: "д-р геогр. наук", NORM: "доктор географических наук"},
|
||||
{ORTH: "д-р геол.-минерал. наук", NORM: "доктор геолого-минералогических наук"},
|
||||
{ORTH: "д-р искусствоведения", NORM: "доктор искусствоведения"},
|
||||
{ORTH: "д-р ист. наук", NORM: "доктор исторических наук"},
|
||||
{ORTH: "д-р культурологии", NORM: "доктор культурологии"},
|
||||
{ORTH: "д-р мед. наук", NORM: "доктор медицинских наук"},
|
||||
{ORTH: "д-р пед. наук", NORM: "доктор педагогических наук"},
|
||||
{ORTH: "д-р полит. наук", NORM: "доктор политических наук"},
|
||||
{ORTH: "д-р психол. наук", NORM: "доктор психологических наук"},
|
||||
{ORTH: "д-р с.-х. наук", NORM: "доктор сельскохозяйственных наук"},
|
||||
{ORTH: "д-р социол. наук", NORM: "доктор социологических наук"},
|
||||
{ORTH: "д-р техн. наук", NORM: "доктор технических наук"},
|
||||
{ORTH: "д-р фармацевт. наук", NORM: "доктор фармацевтических наук"},
|
||||
{ORTH: "д-р физ.-мат. наук", NORM: "доктор физико-математических наук"},
|
||||
{ORTH: "д-р филол. наук", NORM: "доктор филологических наук"},
|
||||
{ORTH: "д-р филос. наук", NORM: "доктор философских наук"},
|
||||
{ORTH: "д-р хим. наук", NORM: "доктор химических наук"},
|
||||
{ORTH: "д-р экон. наук", NORM: "доктор экономических наук"},
|
||||
{ORTH: "д-р юрид. наук", NORM: "доктор юридических наук"},
|
||||
{ORTH: "д-р", NORM: "доктор"},
|
||||
{ORTH: "д.б.н.", NORM: "доктор биологических наук"},
|
||||
{ORTH: "д.г.-м.н.", NORM: "доктор геолого-минералогических наук"},
|
||||
{ORTH: "д.г.н.", NORM: "доктор географических наук"},
|
||||
{ORTH: "д.и.н.", NORM: "доктор исторических наук"},
|
||||
{ORTH: "д.иск.", NORM: "доктор искусствоведения"},
|
||||
{ORTH: "д.м.н.", NORM: "доктор медицинских наук"},
|
||||
{ORTH: "д.п.н.", NORM: "доктор психологических наук"},
|
||||
{ORTH: "д.пед.н.", NORM: "доктор педагогических наук"},
|
||||
{ORTH: "д.полит.н.", NORM: "доктор политических наук"},
|
||||
{ORTH: "д.с.-х.н.", NORM: "доктор сельскохозяйственных наук"},
|
||||
{ORTH: "д.социол.н.", NORM: "доктор социологических наук"},
|
||||
{ORTH: "д.т.н.", NORM: "доктор технических наук"},
|
||||
{ORTH: "д.т.н", NORM: "доктор технических наук"},
|
||||
{ORTH: "д.ф.-м.н.", NORM: "доктор физико-математических наук"},
|
||||
{ORTH: "д.ф.н.", NORM: "доктор филологических наук"},
|
||||
{ORTH: "д.филос.н.", NORM: "доктор философских наук"},
|
||||
{ORTH: "д.фил.н.", NORM: "доктор филологических наук"},
|
||||
{ORTH: "д.х.н.", NORM: "доктор химических наук"},
|
||||
{ORTH: "д.э.н.", NORM: "доктор экономических наук"},
|
||||
{ORTH: "д.э.н", NORM: "доктор экономических наук"},
|
||||
{ORTH: "д.ю.н.", NORM: "доктор юридических наук"},
|
||||
{ORTH: "доц.", NORM: "доцент"},
|
||||
{ORTH: "и.о.", NORM: "исполняющий обязанности"},
|
||||
{ORTH: "к.б.н.", NORM: "кандидат биологических наук"},
|
||||
{ORTH: "к.воен.н.", NORM: "кандидат военных наук"},
|
||||
{ORTH: "к.г.-м.н.", NORM: "кандидат геолого-минералогических наук"},
|
||||
{ORTH: "к.г.н.", NORM: "кандидат географических наук"},
|
||||
{ORTH: "к.геогр.н", NORM: "кандидат географических наук"},
|
||||
{ORTH: "к.геогр.наук", NORM: "кандидат географических наук"},
|
||||
{ORTH: "к.и.н.", NORM: "кандидат исторических наук"},
|
||||
{ORTH: "к.иск.", NORM: "кандидат искусствоведения"},
|
||||
{ORTH: "к.м.н.", NORM: "кандидат медицинских наук"},
|
||||
{ORTH: "к.п.н.", NORM: "кандидат психологических наук"},
|
||||
{ORTH: "к.псх.н.", NORM: "кандидат психологических наук"},
|
||||
{ORTH: "к.пед.н.", NORM: "кандидат педагогических наук"},
|
||||
{ORTH: "канд.пед.наук", NORM: "кандидат педагогических наук"},
|
||||
{ORTH: "к.полит.н.", NORM: "кандидат политических наук"},
|
||||
{ORTH: "к.с.-х.н.", NORM: "кандидат сельскохозяйственных наук"},
|
||||
{ORTH: "к.социол.н.", NORM: "кандидат социологических наук"},
|
||||
{ORTH: "к.с.н.", NORM: "кандидат социологических наук"},
|
||||
{ORTH: "к.т.н.", NORM: "кандидат технических наук"},
|
||||
{ORTH: "к.ф.-м.н.", NORM: "кандидат физико-математических наук"},
|
||||
{ORTH: "к.ф.н.", NORM: "кандидат филологических наук"},
|
||||
{ORTH: "к.фил.н.", NORM: "кандидат филологических наук"},
|
||||
{ORTH: "к.филол.н", NORM: "кандидат филологических наук"},
|
||||
{ORTH: "к.фарм.наук", NORM: "кандидат фармакологических наук"},
|
||||
{ORTH: "к.фарм.н.", NORM: "кандидат фармакологических наук"},
|
||||
{ORTH: "к.фарм.н", NORM: "кандидат фармакологических наук"},
|
||||
{ORTH: "к.филос.наук", NORM: "кандидат философских наук"},
|
||||
{ORTH: "к.филос.н.", NORM: "кандидат философских наук"},
|
||||
{ORTH: "к.филос.н", NORM: "кандидат философских наук"},
|
||||
{ORTH: "к.х.н.", NORM: "кандидат химических наук"},
|
||||
{ORTH: "к.х.н", NORM: "кандидат химических наук"},
|
||||
{ORTH: "к.э.н.", NORM: "кандидат экономических наук"},
|
||||
{ORTH: "к.э.н", NORM: "кандидат экономических наук"},
|
||||
{ORTH: "к.ю.н.", NORM: "кандидат юридических наук"},
|
||||
{ORTH: "к.ю.н", NORM: "кандидат юридических наук"},
|
||||
{ORTH: "канд. архитектуры", NORM: "кандидат архитектуры"},
|
||||
{ORTH: "канд. биол. наук", NORM: "кандидат биологических наук"},
|
||||
{ORTH: "канд. ветеринар. наук", NORM: "кандидат ветеринарных наук"},
|
||||
{ORTH: "канд. воен. наук", NORM: "кандидат военных наук"},
|
||||
{ORTH: "канд. геогр. наук", NORM: "кандидат географических наук"},
|
||||
{ORTH: "канд. геол.-минерал. наук", NORM: "кандидат геолого-минералогических наук"},
|
||||
{ORTH: "канд. искусствоведения", NORM: "кандидат искусствоведения"},
|
||||
{ORTH: "канд. ист. наук", NORM: "кандидат исторических наук"},
|
||||
{ORTH: "к.ист.н.", NORM: "кандидат исторических наук"},
|
||||
{ORTH: "канд. культурологии", NORM: "кандидат культурологии"},
|
||||
{ORTH: "канд. мед. наук", NORM: "кандидат медицинских наук"},
|
||||
{ORTH: "канд. пед. наук", NORM: "кандидат педагогических наук"},
|
||||
{ORTH: "канд. полит. наук", NORM: "кандидат политических наук"},
|
||||
{ORTH: "канд. психол. наук", NORM: "кандидат психологических наук"},
|
||||
{ORTH: "канд. с.-х. наук", NORM: "кандидат сельскохозяйственных наук"},
|
||||
{ORTH: "канд. социол. наук", NORM: "кандидат социологических наук"},
|
||||
{ORTH: "к.соц.наук", NORM: "кандидат социологических наук"},
|
||||
{ORTH: "к.соц.н.", NORM: "кандидат социологических наук"},
|
||||
{ORTH: "к.соц.н", NORM: "кандидат социологических наук"},
|
||||
{ORTH: "канд. техн. наук", NORM: "кандидат технических наук"},
|
||||
{ORTH: "канд. фармацевт. наук", NORM: "кандидат фармацевтических наук"},
|
||||
{ORTH: "канд. физ.-мат. наук", NORM: "кандидат физико-математических наук"},
|
||||
{ORTH: "канд. филол. наук", NORM: "кандидат филологических наук"},
|
||||
{ORTH: "канд. филос. наук", NORM: "кандидат философских наук"},
|
||||
{ORTH: "канд. хим. наук", NORM: "кандидат химических наук"},
|
||||
{ORTH: "канд. экон. наук", NORM: "кандидат экономических наук"},
|
||||
{ORTH: "канд. юрид. наук", NORM: "кандидат юридических наук"},
|
||||
{ORTH: "в.н.с.", NORM: "ведущий научный сотрудник"},
|
||||
{ORTH: "мл. науч. сотр.", NORM: "младший научный сотрудник"},
|
||||
{ORTH: "м.н.с.", NORM: "младший научный сотрудник"},
|
||||
{ORTH: "проф.", NORM: "профессор"},
|
||||
{ORTH: "профессор.кафедры", NORM: "профессор кафедры"},
|
||||
{ORTH: "ст. науч. сотр.", NORM: "старший научный сотрудник"},
|
||||
{ORTH: "чл.-к.", NORM: "член корреспондент"},
|
||||
{ORTH: "чл.-корр.", NORM: "член-корреспондент"},
|
||||
{ORTH: "чл.-кор.", NORM: "член-корреспондент"},
|
||||
{ORTH: "дир.", NORM: "директор"},
|
||||
{ORTH: "зам. дир.", NORM: "заместитель директора"},
|
||||
{ORTH: "зав. каф.", NORM: "заведующий кафедрой"},
|
||||
{ORTH: "зав.кафедрой", NORM: "заведующий кафедрой"},
|
||||
{ORTH: "зав. кафедрой", NORM: "заведующий кафедрой"},
|
||||
{ORTH: "асп.", NORM: "аспирант"},
|
||||
{ORTH: "гл. науч. сотр.", NORM: "главный научный сотрудник"},
|
||||
{ORTH: "вед. науч. сотр.", NORM: "ведущий научный сотрудник"},
|
||||
{ORTH: "науч. сотр.", NORM: "научный сотрудник"},
|
||||
{ORTH: "к.м.с.", NORM: "кандидат в мастера спорта"},
|
||||
]:
|
||||
_exc[abbr[ORTH]] = [abbr]
|
||||
|
||||
|
||||
for abbr in [
|
||||
# Literary phrases abbreviations
|
||||
{ORTH: "и т.д.", NORM: "и так далее"},
|
||||
{ORTH: "и т.п.", NORM: "и тому подобное"},
|
||||
{ORTH: "т.д.", NORM: "так далее"},
|
||||
{ORTH: "т.п.", NORM: "тому подобное"},
|
||||
{ORTH: "т.е.", NORM: "то есть"},
|
||||
{ORTH: "т.к.", NORM: "так как"},
|
||||
{ORTH: "в т.ч.", NORM: "в том числе"},
|
||||
{ORTH: "и пр.", NORM: "и прочие"},
|
||||
{ORTH: "и др.", NORM: "и другие"},
|
||||
{ORTH: "т.н.", NORM: "так называемый"},
|
||||
]:
|
||||
_exc[abbr[ORTH]] = [abbr]
|
||||
|
||||
|
||||
for abbr in [
|
||||
# Appeal to a person abbreviations
|
||||
{ORTH: "г-н", NORM: "господин"},
|
||||
{ORTH: "г-да", NORM: "господа"},
|
||||
{ORTH: "г-жа", NORM: "госпожа"},
|
||||
{ORTH: "тов.", NORM: "товарищ"},
|
||||
]:
|
||||
_exc[abbr[ORTH]] = [abbr]
|
||||
|
||||
|
||||
for abbr in [
|
||||
# Time periods abbreviations
|
||||
{ORTH: "до н.э.", NORM: "до нашей эры"},
|
||||
{ORTH: "по н.в.", NORM: "по настоящее время"},
|
||||
{ORTH: "в н.в.", NORM: "в настоящее время"},
|
||||
{ORTH: "наст.", NORM: "настоящий"},
|
||||
{ORTH: "наст. время", NORM: "настоящее время"},
|
||||
{ORTH: "г.г.", NORM: "годы"},
|
||||
{ORTH: "гг.", NORM: "годы"},
|
||||
{ORTH: "т.г.", NORM: "текущий год"},
|
||||
]:
|
||||
_exc[abbr[ORTH]] = [abbr]
|
||||
|
||||
|
||||
for abbr in [
|
||||
# Address forming elements abbreviations
|
||||
{ORTH: "респ.", NORM: "республика"},
|
||||
{ORTH: "обл.", NORM: "область"},
|
||||
{ORTH: "г.ф.з.", NORM: "город федерального значения"},
|
||||
{ORTH: "а.обл.", NORM: "автономная область"},
|
||||
{ORTH: "а.окр.", NORM: "автономный округ"},
|
||||
{ORTH: "м.р-н", NORM: "муниципальный район"},
|
||||
{ORTH: "г.о.", NORM: "городской округ"},
|
||||
{ORTH: "г.п.", NORM: "городское поселение"},
|
||||
{ORTH: "с.п.", NORM: "сельское поселение"},
|
||||
{ORTH: "вн.р-н", NORM: "внутригородской район"},
|
||||
{ORTH: "вн.тер.г.", NORM: "внутригородская территория города"},
|
||||
{ORTH: "пос.", NORM: "поселение"},
|
||||
{ORTH: "р-н", NORM: "район"},
|
||||
{ORTH: "с/с", NORM: "сельсовет"},
|
||||
{ORTH: "г.", NORM: "город"},
|
||||
{ORTH: "п.г.т.", NORM: "поселок городского типа"},
|
||||
{ORTH: "пгт.", NORM: "поселок городского типа"},
|
||||
{ORTH: "р.п.", NORM: "рабочий поселок"},
|
||||
{ORTH: "рп.", NORM: "рабочий поселок"},
|
||||
{ORTH: "кп.", NORM: "курортный поселок"},
|
||||
{ORTH: "гп.", NORM: "городской поселок"},
|
||||
{ORTH: "п.", NORM: "поселок"},
|
||||
{ORTH: "в-ки", NORM: "выселки"},
|
||||
{ORTH: "г-к", NORM: "городок"},
|
||||
{ORTH: "з-ка", NORM: "заимка"},
|
||||
{ORTH: "п-к", NORM: "починок"},
|
||||
{ORTH: "киш.", NORM: "кишлак"},
|
||||
{ORTH: "п. ст. ", NORM: "поселок станция"},
|
||||
{ORTH: "п. ж/д ст. ", NORM: "поселок при железнодорожной станции"},
|
||||
{ORTH: "ж/д бл-ст", NORM: "железнодорожный блокпост"},
|
||||
{ORTH: "ж/д б-ка", NORM: "железнодорожная будка"},
|
||||
{ORTH: "ж/д в-ка", NORM: "железнодорожная ветка"},
|
||||
{ORTH: "ж/д к-ма", NORM: "железнодорожная казарма"},
|
||||
{ORTH: "ж/д к-т", NORM: "железнодорожный комбинат"},
|
||||
{ORTH: "ж/д пл-ма", NORM: "железнодорожная платформа"},
|
||||
{ORTH: "ж/д пл-ка", NORM: "железнодорожная площадка"},
|
||||
{ORTH: "ж/д п.п.", NORM: "железнодорожный путевой пост"},
|
||||
{ORTH: "ж/д о.п.", NORM: "железнодорожный остановочный пункт"},
|
||||
{ORTH: "ж/д рзд.", NORM: "железнодорожный разъезд"},
|
||||
{ORTH: "ж/д ст. ", NORM: "железнодорожная станция"},
|
||||
{ORTH: "м-ко", NORM: "местечко"},
|
||||
{ORTH: "д.", NORM: "деревня"},
|
||||
{ORTH: "с.", NORM: "село"},
|
||||
{ORTH: "сл.", NORM: "слобода"},
|
||||
{ORTH: "ст. ", NORM: "станция"},
|
||||
{ORTH: "ст-ца", NORM: "станица"},
|
||||
{ORTH: "у.", NORM: "улус"},
|
||||
{ORTH: "х.", NORM: "хутор"},
|
||||
{ORTH: "рзд.", NORM: "разъезд"},
|
||||
{ORTH: "зим.", NORM: "зимовье"},
|
||||
{ORTH: "б-г", NORM: "берег"},
|
||||
{ORTH: "ж/р", NORM: "жилой район"},
|
||||
{ORTH: "кв-л", NORM: "квартал"},
|
||||
{ORTH: "мкр.", NORM: "микрорайон"},
|
||||
{ORTH: "ост-в", NORM: "остров"},
|
||||
{ORTH: "платф.", NORM: "платформа"},
|
||||
{ORTH: "п/р", NORM: "промышленный район"},
|
||||
{ORTH: "р-н", NORM: "район"},
|
||||
{ORTH: "тер.", NORM: "территория"},
|
||||
{
|
||||
ORTH: "тер. СНО",
|
||||
NORM: "территория садоводческих некоммерческих объединений граждан",
|
||||
},
|
||||
{
|
||||
ORTH: "тер. ОНО",
|
||||
NORM: "территория огороднических некоммерческих объединений граждан",
|
||||
},
|
||||
{ORTH: "тер. ДНО", NORM: "территория дачных некоммерческих объединений граждан"},
|
||||
{ORTH: "тер. СНТ", NORM: "территория садоводческих некоммерческих товариществ"},
|
||||
{ORTH: "тер. ОНТ", NORM: "территория огороднических некоммерческих товариществ"},
|
||||
{ORTH: "тер. ДНТ", NORM: "территория дачных некоммерческих товариществ"},
|
||||
{ORTH: "тер. СПК", NORM: "территория садоводческих потребительских кооперативов"},
|
||||
{ORTH: "тер. ОПК", NORM: "территория огороднических потребительских кооперативов"},
|
||||
{ORTH: "тер. ДПК", NORM: "территория дачных потребительских кооперативов"},
|
||||
{ORTH: "тер. СНП", NORM: "территория садоводческих некоммерческих партнерств"},
|
||||
{ORTH: "тер. ОНП", NORM: "территория огороднических некоммерческих партнерств"},
|
||||
{ORTH: "тер. ДНП", NORM: "территория дачных некоммерческих партнерств"},
|
||||
{ORTH: "тер. ТСН", NORM: "территория товарищества собственников недвижимости"},
|
||||
{ORTH: "тер. ГСК", NORM: "территория гаражно-строительного кооператива"},
|
||||
{ORTH: "ус.", NORM: "усадьба"},
|
||||
{ORTH: "тер.ф.х.", NORM: "территория фермерского хозяйства"},
|
||||
{ORTH: "ю.", NORM: "юрты"},
|
||||
{ORTH: "ал.", NORM: "аллея"},
|
||||
{ORTH: "б-р", NORM: "бульвар"},
|
||||
{ORTH: "взв.", NORM: "взвоз"},
|
||||
{ORTH: "взд.", NORM: "въезд"},
|
||||
{ORTH: "дор.", NORM: "дорога"},
|
||||
{ORTH: "ззд.", NORM: "заезд"},
|
||||
{ORTH: "км", NORM: "километр"},
|
||||
{ORTH: "к-цо", NORM: "кольцо"},
|
||||
{ORTH: "лн.", NORM: "линия"},
|
||||
{ORTH: "мгстр.", NORM: "магистраль"},
|
||||
{ORTH: "наб.", NORM: "набережная"},
|
||||
{ORTH: "пер-д", NORM: "переезд"},
|
||||
{ORTH: "пер.", NORM: "переулок"},
|
||||
{ORTH: "пл-ка", NORM: "площадка"},
|
||||
{ORTH: "пл.", NORM: "площадь"},
|
||||
{ORTH: "пр-д", NORM: "проезд"},
|
||||
{ORTH: "пр-к", NORM: "просек"},
|
||||
{ORTH: "пр-ка", NORM: "просека"},
|
||||
{ORTH: "пр-лок", NORM: "проселок"},
|
||||
{ORTH: "пр-кт", NORM: "проспект"},
|
||||
{ORTH: "проул.", NORM: "проулок"},
|
||||
{ORTH: "рзд.", NORM: "разъезд"},
|
||||
{ORTH: "ряд", NORM: "ряд(ы)"},
|
||||
{ORTH: "с-р", NORM: "сквер"},
|
||||
{ORTH: "с-к", NORM: "спуск"},
|
||||
{ORTH: "сзд.", NORM: "съезд"},
|
||||
{ORTH: "туп.", NORM: "тупик"},
|
||||
{ORTH: "ул.", NORM: "улица"},
|
||||
{ORTH: "ш.", NORM: "шоссе"},
|
||||
{ORTH: "влд.", NORM: "владение"},
|
||||
{ORTH: "г-ж", NORM: "гараж"},
|
||||
{ORTH: "д.", NORM: "дом"},
|
||||
{ORTH: "двлд.", NORM: "домовладение"},
|
||||
{ORTH: "зд.", NORM: "здание"},
|
||||
{ORTH: "з/у", NORM: "земельный участок"},
|
||||
{ORTH: "кв.", NORM: "квартира"},
|
||||
{ORTH: "ком.", NORM: "комната"},
|
||||
{ORTH: "подв.", NORM: "подвал"},
|
||||
{ORTH: "кот.", NORM: "котельная"},
|
||||
{ORTH: "п-б", NORM: "погреб"},
|
||||
{ORTH: "к.", NORM: "корпус"},
|
||||
{ORTH: "ОНС", NORM: "объект незавершенного строительства"},
|
||||
{ORTH: "оф.", NORM: "офис"},
|
||||
{ORTH: "пав.", NORM: "павильон"},
|
||||
{ORTH: "помещ.", NORM: "помещение"},
|
||||
{ORTH: "раб.уч.", NORM: "рабочий участок"},
|
||||
{ORTH: "скл.", NORM: "склад"},
|
||||
{ORTH: "coop.", NORM: "сооружение"},
|
||||
{ORTH: "стр.", NORM: "строение"},
|
||||
{ORTH: "торг.зал", NORM: "торговый зал"},
|
||||
{ORTH: "а/п", NORM: "аэропорт"},
|
||||
{ORTH: "им.", NORM: "имени"},
|
||||
]:
|
||||
_exc[abbr[ORTH]] = [abbr]
|
||||
|
||||
|
||||
for abbr in [
|
||||
# Others abbreviations
|
||||
{ORTH: "тыс.руб.", NORM: "тысяч рублей"},
|
||||
{ORTH: "тыс.", NORM: "тысяч"},
|
||||
{ORTH: "руб.", NORM: "рубль"},
|
||||
{ORTH: "долл.", NORM: "доллар"},
|
||||
{ORTH: "прим.", NORM: "примечание"},
|
||||
{ORTH: "прим.ред.", NORM: "примечание редакции"},
|
||||
{ORTH: "см. также", NORM: "смотри также"},
|
||||
{ORTH: "кв.м.", NORM: "квадрантный метр"},
|
||||
{ORTH: "м2", NORM: "квадрантный метр"},
|
||||
{ORTH: "б/у", NORM: "бывший в употреблении"},
|
||||
{ORTH: "сокр.", NORM: "сокращение"},
|
||||
{ORTH: "чел.", NORM: "человек"},
|
||||
{ORTH: "б.п.", NORM: "базисный пункт"},
|
||||
]:
|
||||
_exc[abbr[ORTH]] = [abbr]
|
||||
|
||||
|
||||
TOKENIZER_EXCEPTIONS = update_exc(BASE_EXCEPTIONS, _exc)
|
||||
|
|
|
@ -1,13 +1,10 @@
|
|||
# Source: https://github.com/stopwords-iso/stopwords-sl
|
||||
# TODO: probably needs to be tidied up – the list seems to have month names in
|
||||
# it, which shouldn't be considered stop words.
|
||||
# Removed various words that are not normally considered stop words, such as months.
|
||||
|
||||
STOP_WORDS = set(
|
||||
"""
|
||||
a
|
||||
ali
|
||||
april
|
||||
avgust
|
||||
b
|
||||
bi
|
||||
bil
|
||||
|
@ -19,7 +16,6 @@ biti
|
|||
blizu
|
||||
bo
|
||||
bodo
|
||||
bojo
|
||||
bolj
|
||||
bom
|
||||
bomo
|
||||
|
@ -37,16 +33,6 @@ da
|
|||
daleč
|
||||
dan
|
||||
danes
|
||||
datum
|
||||
december
|
||||
deset
|
||||
deseta
|
||||
deseti
|
||||
deseto
|
||||
devet
|
||||
deveta
|
||||
deveti
|
||||
deveto
|
||||
do
|
||||
dober
|
||||
dobra
|
||||
|
@ -54,16 +40,7 @@ dobri
|
|||
dobro
|
||||
dokler
|
||||
dol
|
||||
dolg
|
||||
dolga
|
||||
dolgi
|
||||
dovolj
|
||||
drug
|
||||
druga
|
||||
drugi
|
||||
drugo
|
||||
dva
|
||||
dve
|
||||
e
|
||||
eden
|
||||
en
|
||||
|
@ -74,7 +51,6 @@ enkrat
|
|||
eno
|
||||
etc.
|
||||
f
|
||||
februar
|
||||
g
|
||||
g.
|
||||
ga
|
||||
|
@ -93,16 +69,12 @@ iv
|
|||
ix
|
||||
iz
|
||||
j
|
||||
januar
|
||||
jaz
|
||||
je
|
||||
ji
|
||||
jih
|
||||
jim
|
||||
jo
|
||||
julij
|
||||
junij
|
||||
jutri
|
||||
k
|
||||
kadarkoli
|
||||
kaj
|
||||
|
@ -123,41 +95,23 @@ kje
|
|||
kjer
|
||||
kjerkoli
|
||||
ko
|
||||
koder
|
||||
koderkoli
|
||||
koga
|
||||
komu
|
||||
kot
|
||||
kratek
|
||||
kratka
|
||||
kratke
|
||||
kratki
|
||||
l
|
||||
lahka
|
||||
lahke
|
||||
lahki
|
||||
lahko
|
||||
le
|
||||
lep
|
||||
lepa
|
||||
lepe
|
||||
lepi
|
||||
lepo
|
||||
leto
|
||||
m
|
||||
maj
|
||||
majhen
|
||||
majhna
|
||||
majhni
|
||||
malce
|
||||
malo
|
||||
manj
|
||||
marec
|
||||
me
|
||||
med
|
||||
medtem
|
||||
mene
|
||||
mesec
|
||||
mi
|
||||
midva
|
||||
midve
|
||||
|
@ -183,7 +137,6 @@ najmanj
|
|||
naju
|
||||
največ
|
||||
nam
|
||||
narobe
|
||||
nas
|
||||
nato
|
||||
nazaj
|
||||
|
@ -192,7 +145,6 @@ naša
|
|||
naše
|
||||
ne
|
||||
nedavno
|
||||
nedelja
|
||||
nek
|
||||
neka
|
||||
nekaj
|
||||
|
@ -236,7 +188,6 @@ njuna
|
|||
njuno
|
||||
no
|
||||
nocoj
|
||||
november
|
||||
npr.
|
||||
o
|
||||
ob
|
||||
|
@ -244,51 +195,23 @@ oba
|
|||
obe
|
||||
oboje
|
||||
od
|
||||
odprt
|
||||
odprta
|
||||
odprti
|
||||
okoli
|
||||
oktober
|
||||
on
|
||||
onadva
|
||||
one
|
||||
oni
|
||||
onidve
|
||||
osem
|
||||
osma
|
||||
osmi
|
||||
osmo
|
||||
oz.
|
||||
p
|
||||
pa
|
||||
pet
|
||||
peta
|
||||
petek
|
||||
peti
|
||||
peto
|
||||
po
|
||||
pod
|
||||
pogosto
|
||||
poleg
|
||||
poln
|
||||
polna
|
||||
polni
|
||||
polno
|
||||
ponavadi
|
||||
ponedeljek
|
||||
ponovno
|
||||
potem
|
||||
povsod
|
||||
pozdravljen
|
||||
pozdravljeni
|
||||
prav
|
||||
prava
|
||||
prave
|
||||
pravi
|
||||
pravo
|
||||
prazen
|
||||
prazna
|
||||
prazno
|
||||
prbl.
|
||||
precej
|
||||
pred
|
||||
|
@ -297,19 +220,10 @@ preko
|
|||
pri
|
||||
pribl.
|
||||
približno
|
||||
primer
|
||||
pripravljen
|
||||
pripravljena
|
||||
pripravljeni
|
||||
proti
|
||||
prva
|
||||
prvi
|
||||
prvo
|
||||
r
|
||||
ravno
|
||||
redko
|
||||
res
|
||||
reč
|
||||
s
|
||||
saj
|
||||
sam
|
||||
|
@ -321,29 +235,17 @@ se
|
|||
sebe
|
||||
sebi
|
||||
sedaj
|
||||
sedem
|
||||
sedma
|
||||
sedmi
|
||||
sedmo
|
||||
sem
|
||||
september
|
||||
seveda
|
||||
si
|
||||
sicer
|
||||
skoraj
|
||||
skozi
|
||||
slab
|
||||
smo
|
||||
so
|
||||
sobota
|
||||
spet
|
||||
sreda
|
||||
srednja
|
||||
srednji
|
||||
sta
|
||||
ste
|
||||
stran
|
||||
stvar
|
||||
sva
|
||||
t
|
||||
ta
|
||||
|
@ -358,10 +260,6 @@ te
|
|||
tebe
|
||||
tebi
|
||||
tega
|
||||
težak
|
||||
težka
|
||||
težki
|
||||
težko
|
||||
ti
|
||||
tista
|
||||
tiste
|
||||
|
@ -371,11 +269,6 @@ tj.
|
|||
tja
|
||||
to
|
||||
toda
|
||||
torek
|
||||
tretja
|
||||
tretje
|
||||
tretji
|
||||
tri
|
||||
tu
|
||||
tudi
|
||||
tukaj
|
||||
|
@ -392,10 +285,6 @@ vaša
|
|||
vaše
|
||||
ve
|
||||
vedno
|
||||
velik
|
||||
velika
|
||||
veliki
|
||||
veliko
|
||||
vendar
|
||||
ves
|
||||
več
|
||||
|
@ -403,10 +292,6 @@ vi
|
|||
vidva
|
||||
vii
|
||||
viii
|
||||
visok
|
||||
visoka
|
||||
visoke
|
||||
visoki
|
||||
vsa
|
||||
vsaj
|
||||
vsak
|
||||
|
@ -420,34 +305,21 @@ vsega
|
|||
vsi
|
||||
vso
|
||||
včasih
|
||||
včeraj
|
||||
x
|
||||
z
|
||||
za
|
||||
zadaj
|
||||
zadnji
|
||||
zakaj
|
||||
zaprta
|
||||
zaprti
|
||||
zaprto
|
||||
zdaj
|
||||
zelo
|
||||
zunaj
|
||||
č
|
||||
če
|
||||
često
|
||||
četrta
|
||||
četrtek
|
||||
četrti
|
||||
četrto
|
||||
čez
|
||||
čigav
|
||||
š
|
||||
šest
|
||||
šesta
|
||||
šesti
|
||||
šesto
|
||||
štiri
|
||||
ž
|
||||
že
|
||||
""".split()
|
||||
|
|
|
@ -6,19 +6,30 @@ from ...util import update_exc
|
|||
_exc = {}
|
||||
|
||||
for exc_data in [
|
||||
{ORTH: "обл.", NORM: "область"},
|
||||
{ORTH: "р-н.", NORM: "район"},
|
||||
{ORTH: "р-н", NORM: "район"},
|
||||
{ORTH: "м.", NORM: "місто"},
|
||||
{ORTH: "вул.", NORM: "вулиця"},
|
||||
{ORTH: "ім.", NORM: "імені"},
|
||||
{ORTH: "просп.", NORM: "проспект"},
|
||||
{ORTH: "пр-кт", NORM: "проспект"},
|
||||
{ORTH: "бул.", NORM: "бульвар"},
|
||||
{ORTH: "пров.", NORM: "провулок"},
|
||||
{ORTH: "пл.", NORM: "площа"},
|
||||
{ORTH: "майд.", NORM: "майдан"},
|
||||
{ORTH: "мкр.", NORM: "мікрорайон"},
|
||||
{ORTH: "ст.", NORM: "станція"},
|
||||
{ORTH: "ж/м", NORM: "житловий масив"},
|
||||
{ORTH: "наб.", NORM: "набережна"},
|
||||
{ORTH: "в/ч", NORM: "військова частина"},
|
||||
{ORTH: "в/м", NORM: "військове містечко"},
|
||||
{ORTH: "оз.", NORM: "озеро"},
|
||||
{ORTH: "ім.", NORM: "імені"},
|
||||
{ORTH: "г.", NORM: "гора"},
|
||||
{ORTH: "п.", NORM: "пан"},
|
||||
{ORTH: "м.", NORM: "місто"},
|
||||
{ORTH: "проф.", NORM: "професор"},
|
||||
{ORTH: "акад.", NORM: "академік"},
|
||||
{ORTH: "доц.", NORM: "доцент"},
|
||||
{ORTH: "оз.", NORM: "озеро"},
|
||||
]:
|
||||
_exc[exc_data[ORTH]] = [exc_data]
|
||||
|
||||
|
|
|
@ -59,7 +59,7 @@ sentences = [
|
|||
"Czy w ciągu ostatnich 48 godzin spożyłeś leki zawierające paracetamol?",
|
||||
"Kto ma ochotę zapoznać się z innymi niż w książkach przygodami Muminków i ich przyjaciół, temu polecam komiks Tove Jansson „Muminki i morze”.",
|
||||
"Apple está querendo comprar uma startup do Reino Unido por 100 milhões de dólares.",
|
||||
"Carros autônomos empurram a responsabilidade do seguro para os fabricantes.."
|
||||
"Carros autônomos empurram a responsabilidade do seguro para os fabricantes..",
|
||||
"São Francisco considera banir os robôs de entrega que andam pelas calçadas.",
|
||||
"Londres é a maior cidade do Reino Unido.",
|
||||
# Translations from English:
|
||||
|
|
|
@ -131,7 +131,7 @@ class Language:
|
|||
self,
|
||||
vocab: Union[Vocab, bool] = True,
|
||||
*,
|
||||
max_length: int = 10 ** 6,
|
||||
max_length: int = 10**6,
|
||||
meta: Dict[str, Any] = {},
|
||||
create_tokenizer: Optional[Callable[["Language"], Callable[[str], Doc]]] = None,
|
||||
batch_size: int = 1000,
|
||||
|
@ -354,12 +354,15 @@ class Language:
|
|||
@property
|
||||
def pipe_labels(self) -> Dict[str, List[str]]:
|
||||
"""Get the labels set by the pipeline components, if available (if
|
||||
the component exposes a labels property).
|
||||
the component exposes a labels property and the labels are not
|
||||
hidden).
|
||||
|
||||
RETURNS (Dict[str, List[str]]): Labels keyed by component name.
|
||||
"""
|
||||
labels = {}
|
||||
for name, pipe in self._components:
|
||||
if hasattr(pipe, "hide_labels") and pipe.hide_labels is True:
|
||||
continue
|
||||
if hasattr(pipe, "labels"):
|
||||
labels[name] = list(pipe.labels)
|
||||
return SimpleFrozenDict(labels)
|
||||
|
@ -522,7 +525,7 @@ class Language:
|
|||
requires: Iterable[str] = SimpleFrozenList(),
|
||||
retokenizes: bool = False,
|
||||
func: Optional["Pipe"] = None,
|
||||
) -> Callable:
|
||||
) -> Callable[..., Any]:
|
||||
"""Register a new pipeline component. Can be used for stateless function
|
||||
components that don't require a separate factory. Can be used as a
|
||||
decorator on a function or classmethod, or called as a function with the
|
||||
|
@ -1219,8 +1222,9 @@ class Language:
|
|||
component_cfg = {}
|
||||
grads = {}
|
||||
|
||||
def get_grads(W, dW, key=None):
|
||||
def get_grads(key, W, dW):
|
||||
grads[key] = (W, dW)
|
||||
return W, dW
|
||||
|
||||
get_grads.learn_rate = sgd.learn_rate # type: ignore[attr-defined, union-attr]
|
||||
get_grads.b1 = sgd.b1 # type: ignore[attr-defined, union-attr]
|
||||
|
@ -1233,7 +1237,7 @@ class Language:
|
|||
examples, sgd=get_grads, losses=losses, **component_cfg.get(name, {})
|
||||
)
|
||||
for key, (W, dW) in grads.items():
|
||||
sgd(W, dW, key=key) # type: ignore[call-arg, misc]
|
||||
sgd(key, W, dW) # type: ignore[call-arg, misc]
|
||||
return losses
|
||||
|
||||
def begin_training(
|
||||
|
@ -1285,9 +1289,9 @@ class Language:
|
|||
)
|
||||
except IOError:
|
||||
raise IOError(Errors.E884.format(vectors=I["vectors"]))
|
||||
if self.vocab.vectors.data.shape[1] >= 1:
|
||||
if self.vocab.vectors.shape[1] >= 1:
|
||||
ops = get_current_ops()
|
||||
self.vocab.vectors.data = ops.asarray(self.vocab.vectors.data)
|
||||
self.vocab.vectors.to_ops(ops)
|
||||
if hasattr(self.tokenizer, "initialize"):
|
||||
tok_settings = validate_init_settings(
|
||||
self.tokenizer.initialize, # type: ignore[union-attr]
|
||||
|
@ -1332,8 +1336,8 @@ class Language:
|
|||
DOCS: https://spacy.io/api/language#resume_training
|
||||
"""
|
||||
ops = get_current_ops()
|
||||
if self.vocab.vectors.data.shape[1] >= 1:
|
||||
self.vocab.vectors.data = ops.asarray(self.vocab.vectors.data)
|
||||
if self.vocab.vectors.shape[1] >= 1:
|
||||
self.vocab.vectors.to_ops(ops)
|
||||
for name, proc in self.pipeline:
|
||||
if hasattr(proc, "_rehearsal_model"):
|
||||
proc._rehearsal_model = deepcopy(proc.model) # type: ignore[attr-defined]
|
||||
|
@ -1404,20 +1408,13 @@ class Language:
|
|||
for eg in examples:
|
||||
self.make_doc(eg.reference.text)
|
||||
# apply all pipeline components
|
||||
for name, pipe in self.pipeline:
|
||||
kwargs = component_cfg.get(name, {})
|
||||
kwargs.setdefault("batch_size", batch_size)
|
||||
for doc, eg in zip(
|
||||
_pipe(
|
||||
(eg.predicted for eg in examples),
|
||||
proc=pipe,
|
||||
name=name,
|
||||
default_error_handler=self.default_error_handler,
|
||||
kwargs=kwargs,
|
||||
),
|
||||
examples,
|
||||
):
|
||||
eg.predicted = doc
|
||||
docs = self.pipe(
|
||||
(eg.predicted for eg in examples),
|
||||
batch_size=batch_size,
|
||||
component_cfg=component_cfg,
|
||||
)
|
||||
for eg, doc in zip(examples, docs):
|
||||
eg.predicted = doc
|
||||
end_time = timer()
|
||||
results = scorer.score(examples)
|
||||
n_words = sum(len(eg.predicted) for eg in examples)
|
||||
|
|
|
@ -19,7 +19,7 @@ class Lexeme:
|
|||
@property
|
||||
def vector_norm(self) -> float: ...
|
||||
vector: Floats1d
|
||||
rank: str
|
||||
rank: int
|
||||
sentiment: float
|
||||
@property
|
||||
def orth_(self) -> str: ...
|
||||
|
|
|
@ -130,8 +130,10 @@ cdef class Lexeme:
|
|||
return 0.0
|
||||
vector = self.vector
|
||||
xp = get_array_module(vector)
|
||||
return (xp.dot(vector, other.vector) / (self.vector_norm * other.vector_norm))
|
||||
|
||||
result = xp.dot(vector, other.vector) / (self.vector_norm * other.vector_norm)
|
||||
# ensure we get a scalar back (numpy does this automatically but cupy doesn't)
|
||||
return result.item()
|
||||
|
||||
@property
|
||||
def has_vector(self):
|
||||
"""RETURNS (bool): Whether a word vector is associated with the object.
|
||||
|
|
66
spacy/matcher/dependencymatcher.pyi
Normal file
66
spacy/matcher/dependencymatcher.pyi
Normal file
|
@ -0,0 +1,66 @@
|
|||
from typing import Any, Callable, Dict, List, Optional, Tuple, Union
|
||||
from .matcher import Matcher
|
||||
from ..vocab import Vocab
|
||||
from ..tokens.doc import Doc
|
||||
from ..tokens.span import Span
|
||||
|
||||
class DependencyMatcher:
|
||||
"""Match dependency parse tree based on pattern rules."""
|
||||
|
||||
_patterns: Dict[str, List[Any]]
|
||||
_raw_patterns: Dict[str, List[Any]]
|
||||
_tokens_to_key: Dict[str, List[Any]]
|
||||
_root: Dict[str, List[Any]]
|
||||
_tree: Dict[str, List[Any]]
|
||||
_callbacks: Dict[
|
||||
Any, Callable[[DependencyMatcher, Doc, int, List[Tuple[int, List[int]]]], Any]
|
||||
]
|
||||
_ops: Dict[str, Any]
|
||||
vocab: Vocab
|
||||
_matcher: Matcher
|
||||
def __init__(self, vocab: Vocab, *, validate: bool = ...) -> None: ...
|
||||
def __reduce__(
|
||||
self,
|
||||
) -> Tuple[
|
||||
Callable[
|
||||
[Vocab, Dict[str, Any], Dict[str, Callable[..., Any]]], DependencyMatcher
|
||||
],
|
||||
Tuple[
|
||||
Vocab,
|
||||
Dict[str, List[Any]],
|
||||
Dict[
|
||||
str,
|
||||
Callable[
|
||||
[DependencyMatcher, Doc, int, List[Tuple[int, List[int]]]], Any
|
||||
],
|
||||
],
|
||||
],
|
||||
None,
|
||||
None,
|
||||
]: ...
|
||||
def __len__(self) -> int: ...
|
||||
def __contains__(self, key: Union[str, int]) -> bool: ...
|
||||
def add(
|
||||
self,
|
||||
key: Union[str, int],
|
||||
patterns: List[List[Dict[str, Any]]],
|
||||
*,
|
||||
on_match: Optional[
|
||||
Callable[[DependencyMatcher, Doc, int, List[Tuple[int, List[int]]]], Any]
|
||||
] = ...
|
||||
) -> None: ...
|
||||
def has_key(self, key: Union[str, int]) -> bool: ...
|
||||
def get(
|
||||
self, key: Union[str, int], default: Optional[Any] = ...
|
||||
) -> Tuple[
|
||||
Optional[
|
||||
Callable[[DependencyMatcher, Doc, int, List[Tuple[int, List[int]]]], Any]
|
||||
],
|
||||
List[List[Dict[str, Any]]],
|
||||
]: ...
|
||||
def remove(self, key: Union[str, int]) -> None: ...
|
||||
def __call__(self, doclike: Union[Doc, Span]) -> List[Tuple[int, List[int]]]: ...
|
||||
|
||||
def unpickle_matcher(
|
||||
vocab: Vocab, patterns: Dict[str, Any], callbacks: Dict[str, Callable[..., Any]]
|
||||
) -> DependencyMatcher: ...
|
|
@ -1,4 +1,6 @@
|
|||
from typing import Any, List, Dict, Tuple, Optional, Callable, Union, Iterator, Iterable
|
||||
from typing import Any, List, Dict, Tuple, Optional, Callable, Union
|
||||
from typing import Iterator, Iterable, overload
|
||||
from ..compat import Literal
|
||||
from ..vocab import Vocab
|
||||
from ..tokens import Doc, Span
|
||||
|
||||
|
@ -31,12 +33,22 @@ class Matcher:
|
|||
) -> Union[
|
||||
Iterator[Tuple[Tuple[Doc, Any], Any]], Iterator[Tuple[Doc, Any]], Iterator[Doc]
|
||||
]: ...
|
||||
@overload
|
||||
def __call__(
|
||||
self,
|
||||
doclike: Union[Doc, Span],
|
||||
*,
|
||||
as_spans: bool = ...,
|
||||
as_spans: Literal[False] = ...,
|
||||
allow_missing: bool = ...,
|
||||
with_alignments: bool = ...
|
||||
) -> Union[List[Tuple[int, int, int]], List[Span]]: ...
|
||||
) -> List[Tuple[int, int, int]]: ...
|
||||
@overload
|
||||
def __call__(
|
||||
self,
|
||||
doclike: Union[Doc, Span],
|
||||
*,
|
||||
as_spans: Literal[True],
|
||||
allow_missing: bool = ...,
|
||||
with_alignments: bool = ...
|
||||
) -> List[Span]: ...
|
||||
def _normalize_key(self, key: Any) -> Any: ...
|
||||
|
|
|
@ -18,7 +18,7 @@ from ..tokens.doc cimport Doc, get_token_attr_for_matcher
|
|||
from ..tokens.span cimport Span
|
||||
from ..tokens.token cimport Token
|
||||
from ..tokens.morphanalysis cimport MorphAnalysis
|
||||
from ..attrs cimport ID, attr_id_t, NULL_ATTR, ORTH, POS, TAG, DEP, LEMMA, MORPH
|
||||
from ..attrs cimport ID, attr_id_t, NULL_ATTR, ORTH, POS, TAG, DEP, LEMMA, MORPH, ENT_IOB
|
||||
|
||||
from ..schemas import validate_token_pattern
|
||||
from ..errors import Errors, MatchPatternError, Warnings
|
||||
|
@ -798,7 +798,10 @@ def _get_attr_values(spec, string_store):
|
|||
attr = "SENT_START"
|
||||
attr = IDS.get(attr)
|
||||
if isinstance(value, str):
|
||||
value = string_store.add(value)
|
||||
if attr == ENT_IOB and value in Token.iob_strings():
|
||||
value = Token.iob_strings().index(value)
|
||||
else:
|
||||
value = string_store.add(value)
|
||||
elif isinstance(value, bool):
|
||||
value = int(value)
|
||||
elif isinstance(value, int):
|
||||
|
|
|
@ -1,6 +1,6 @@
|
|||
from typing import List, Tuple, Union, Optional, Callable, Any, Dict
|
||||
|
||||
from . import Matcher
|
||||
from typing import List, Tuple, Union, Optional, Callable, Any, Dict, overload
|
||||
from ..compat import Literal
|
||||
from .matcher import Matcher
|
||||
from ..vocab import Vocab
|
||||
from ..tokens import Doc, Span
|
||||
|
||||
|
@ -8,18 +8,30 @@ class PhraseMatcher:
|
|||
def __init__(
|
||||
self, vocab: Vocab, attr: Optional[Union[int, str]], validate: bool = ...
|
||||
) -> None: ...
|
||||
def __call__(
|
||||
self,
|
||||
doclike: Union[Doc, Span],
|
||||
*,
|
||||
as_spans: bool = ...,
|
||||
) -> Union[List[Tuple[int, int, int]], List[Span]]: ...
|
||||
def __reduce__(self) -> Any: ...
|
||||
def __len__(self) -> int: ...
|
||||
def __contains__(self, key: str) -> bool: ...
|
||||
def add(
|
||||
self,
|
||||
key: str,
|
||||
docs: List[List[Dict[str, Any]]],
|
||||
docs: List[Doc],
|
||||
*,
|
||||
on_match: Optional[
|
||||
Callable[[Matcher, Doc, int, List[Tuple[Any, ...]]], Any]
|
||||
] = ...,
|
||||
) -> None: ...
|
||||
def remove(self, key: str) -> None: ...
|
||||
@overload
|
||||
def __call__(
|
||||
self,
|
||||
doclike: Union[Doc, Span],
|
||||
*,
|
||||
as_spans: Literal[False] = ...,
|
||||
) -> List[Tuple[int, int, int]]: ...
|
||||
@overload
|
||||
def __call__(
|
||||
self,
|
||||
doclike: Union[Doc, Span],
|
||||
*,
|
||||
as_spans: Literal[True],
|
||||
) -> List[Span]: ...
|
||||
|
|
|
@ -23,7 +23,7 @@ def create_pretrain_vectors(
|
|||
maxout_pieces: int, hidden_size: int, loss: str
|
||||
) -> Callable[["Vocab", Model], Model]:
|
||||
def create_vectors_objective(vocab: "Vocab", tok2vec: Model) -> Model:
|
||||
if vocab.vectors.data.shape[1] == 0:
|
||||
if vocab.vectors.shape[1] == 0:
|
||||
raise ValueError(Errors.E875)
|
||||
model = build_cloze_multi_task_model(
|
||||
vocab, tok2vec, hidden_size=hidden_size, maxout_pieces=maxout_pieces
|
||||
|
@ -85,7 +85,7 @@ def get_characters_loss(ops, docs, prediction, nr_char):
|
|||
target = ops.asarray(to_categorical(target_ids, n_classes=256), dtype="f")
|
||||
target = target.reshape((-1, 256 * nr_char))
|
||||
diff = prediction - target
|
||||
loss = (diff ** 2).sum()
|
||||
loss = (diff**2).sum()
|
||||
d_target = diff / float(prediction.shape[0])
|
||||
return loss, d_target
|
||||
|
||||
|
@ -116,7 +116,7 @@ def build_multi_task_model(
|
|||
def build_cloze_multi_task_model(
|
||||
vocab: "Vocab", tok2vec: Model, maxout_pieces: int, hidden_size: int
|
||||
) -> Model:
|
||||
nO = vocab.vectors.data.shape[1]
|
||||
nO = vocab.vectors.shape[1]
|
||||
output_layer = chain(
|
||||
cast(Model[List["Floats2d"], Floats2d], list2array()),
|
||||
Maxout(
|
||||
|
|
|
@ -123,7 +123,7 @@ def MultiHashEmbed(
|
|||
attributes are NORM, PREFIX, SUFFIX and SHAPE. This lets the model take into
|
||||
account some subword information, without constructing a fully character-based
|
||||
representation. If pretrained vectors are available, they can be included in
|
||||
the representation as well, with the vectors table will be kept static
|
||||
the representation as well, with the vectors table kept static
|
||||
(i.e. it's not updated).
|
||||
|
||||
The `width` parameter specifies the output width of the layer and the widths
|
||||
|
|
|
@ -94,7 +94,7 @@ def init(
|
|||
nM = model.get_dim("nM") if model.has_dim("nM") else None
|
||||
nO = model.get_dim("nO") if model.has_dim("nO") else None
|
||||
if X is not None and len(X):
|
||||
nM = X[0].vocab.vectors.data.shape[1]
|
||||
nM = X[0].vocab.vectors.shape[1]
|
||||
if Y is not None:
|
||||
nO = Y.data.shape[1]
|
||||
|
||||
|
|
|
@ -1,7 +1,9 @@
|
|||
from cython.operator cimport dereference as deref, preincrement as incr
|
||||
from libc.string cimport memcpy, memset
|
||||
from libc.stdlib cimport calloc, free
|
||||
from libc.stdint cimport uint32_t, uint64_t
|
||||
cimport libcpp
|
||||
from libcpp.unordered_map cimport unordered_map
|
||||
from libcpp.vector cimport vector
|
||||
from libcpp.set cimport set
|
||||
from cpython.exc cimport PyErr_CheckSignals, PyErr_SetFromErrno
|
||||
|
@ -29,8 +31,8 @@ cdef cppclass StateC:
|
|||
vector[int] _stack
|
||||
vector[int] _rebuffer
|
||||
vector[SpanC] _ents
|
||||
vector[ArcC] _left_arcs
|
||||
vector[ArcC] _right_arcs
|
||||
unordered_map[int, vector[ArcC]] _left_arcs
|
||||
unordered_map[int, vector[ArcC]] _right_arcs
|
||||
vector[libcpp.bool] _unshiftable
|
||||
set[int] _sent_starts
|
||||
TokenC _empty_token
|
||||
|
@ -159,15 +161,22 @@ cdef cppclass StateC:
|
|||
else:
|
||||
return &this._sent[i]
|
||||
|
||||
void get_arcs(vector[ArcC]* arcs) nogil const:
|
||||
for i in range(this._left_arcs.size()):
|
||||
arc = this._left_arcs.at(i)
|
||||
if arc.head != -1 and arc.child != -1:
|
||||
arcs.push_back(arc)
|
||||
for i in range(this._right_arcs.size()):
|
||||
arc = this._right_arcs.at(i)
|
||||
if arc.head != -1 and arc.child != -1:
|
||||
arcs.push_back(arc)
|
||||
void map_get_arcs(const unordered_map[int, vector[ArcC]] &heads_arcs, vector[ArcC]* out) nogil const:
|
||||
cdef const vector[ArcC]* arcs
|
||||
head_arcs_it = heads_arcs.const_begin()
|
||||
while head_arcs_it != heads_arcs.const_end():
|
||||
arcs = &deref(head_arcs_it).second
|
||||
arcs_it = arcs.const_begin()
|
||||
while arcs_it != arcs.const_end():
|
||||
arc = deref(arcs_it)
|
||||
if arc.head != -1 and arc.child != -1:
|
||||
out.push_back(arc)
|
||||
incr(arcs_it)
|
||||
incr(head_arcs_it)
|
||||
|
||||
void get_arcs(vector[ArcC]* out) nogil const:
|
||||
this.map_get_arcs(this._left_arcs, out)
|
||||
this.map_get_arcs(this._right_arcs, out)
|
||||
|
||||
int H(int child) nogil const:
|
||||
if child >= this.length or child < 0:
|
||||
|
@ -181,33 +190,35 @@ cdef cppclass StateC:
|
|||
else:
|
||||
return this._ents.back().start
|
||||
|
||||
int nth_child(const unordered_map[int, vector[ArcC]]& heads_arcs, int head, int idx) nogil const:
|
||||
if idx < 1:
|
||||
return -1
|
||||
|
||||
head_arcs_it = heads_arcs.const_find(head)
|
||||
if head_arcs_it == heads_arcs.const_end():
|
||||
return -1
|
||||
|
||||
cdef const vector[ArcC]* arcs = &deref(head_arcs_it).second
|
||||
|
||||
# Work backwards through arcs to find the arc at the
|
||||
# requested index more quickly.
|
||||
cdef size_t child_index = 0
|
||||
arcs_it = arcs.const_rbegin()
|
||||
while arcs_it != arcs.const_rend() and child_index != idx:
|
||||
arc = deref(arcs_it)
|
||||
if arc.child != -1:
|
||||
child_index += 1
|
||||
if child_index == idx:
|
||||
return arc.child
|
||||
incr(arcs_it)
|
||||
|
||||
return -1
|
||||
|
||||
int L(int head, int idx) nogil const:
|
||||
if idx < 1 or this._left_arcs.size() == 0:
|
||||
return -1
|
||||
cdef vector[int] lefts
|
||||
for i in range(this._left_arcs.size()):
|
||||
arc = this._left_arcs.at(i)
|
||||
if arc.head == head and arc.child != -1 and arc.child < head:
|
||||
lefts.push_back(arc.child)
|
||||
idx = (<int>lefts.size()) - idx
|
||||
if idx < 0:
|
||||
return -1
|
||||
else:
|
||||
return lefts.at(idx)
|
||||
return this.nth_child(this._left_arcs, head, idx)
|
||||
|
||||
int R(int head, int idx) nogil const:
|
||||
if idx < 1 or this._right_arcs.size() == 0:
|
||||
return -1
|
||||
cdef vector[int] rights
|
||||
for i in range(this._right_arcs.size()):
|
||||
arc = this._right_arcs.at(i)
|
||||
if arc.head == head and arc.child != -1 and arc.child > head:
|
||||
rights.push_back(arc.child)
|
||||
idx = (<int>rights.size()) - idx
|
||||
if idx < 0:
|
||||
return -1
|
||||
else:
|
||||
return rights.at(idx)
|
||||
return this.nth_child(this._right_arcs, head, idx)
|
||||
|
||||
bint empty() nogil const:
|
||||
return this._stack.size() == 0
|
||||
|
@ -248,22 +259,29 @@ cdef cppclass StateC:
|
|||
|
||||
int r_edge(int word) nogil const:
|
||||
return word
|
||||
|
||||
int n_L(int head) nogil const:
|
||||
|
||||
int n_arcs(const unordered_map[int, vector[ArcC]] &heads_arcs, int head) nogil const:
|
||||
cdef int n = 0
|
||||
for i in range(this._left_arcs.size()):
|
||||
arc = this._left_arcs.at(i)
|
||||
if arc.head == head and arc.child != -1 and arc.child < arc.head:
|
||||
head_arcs_it = heads_arcs.const_find(head)
|
||||
if head_arcs_it == heads_arcs.const_end():
|
||||
return n
|
||||
|
||||
cdef const vector[ArcC]* arcs = &deref(head_arcs_it).second
|
||||
arcs_it = arcs.const_begin()
|
||||
while arcs_it != arcs.end():
|
||||
arc = deref(arcs_it)
|
||||
if arc.child != -1:
|
||||
n += 1
|
||||
incr(arcs_it)
|
||||
|
||||
return n
|
||||
|
||||
|
||||
int n_L(int head) nogil const:
|
||||
return n_arcs(this._left_arcs, head)
|
||||
|
||||
int n_R(int head) nogil const:
|
||||
cdef int n = 0
|
||||
for i in range(this._right_arcs.size()):
|
||||
arc = this._right_arcs.at(i)
|
||||
if arc.head == head and arc.child != -1 and arc.child > arc.head:
|
||||
n += 1
|
||||
return n
|
||||
return n_arcs(this._right_arcs, head)
|
||||
|
||||
bint stack_is_connected() nogil const:
|
||||
return False
|
||||
|
@ -323,19 +341,20 @@ cdef cppclass StateC:
|
|||
arc.child = child
|
||||
arc.label = label
|
||||
if head > child:
|
||||
this._left_arcs.push_back(arc)
|
||||
this._left_arcs[arc.head].push_back(arc)
|
||||
else:
|
||||
this._right_arcs.push_back(arc)
|
||||
this._right_arcs[arc.head].push_back(arc)
|
||||
this._heads[child] = head
|
||||
|
||||
void del_arc(int h_i, int c_i) nogil:
|
||||
cdef vector[ArcC]* arcs
|
||||
if h_i > c_i:
|
||||
arcs = &this._left_arcs
|
||||
else:
|
||||
arcs = &this._right_arcs
|
||||
void map_del_arc(unordered_map[int, vector[ArcC]]* heads_arcs, int h_i, int c_i) nogil:
|
||||
arcs_it = heads_arcs.find(h_i)
|
||||
if arcs_it == heads_arcs.end():
|
||||
return
|
||||
|
||||
arcs = &deref(arcs_it).second
|
||||
if arcs.size() == 0:
|
||||
return
|
||||
|
||||
arc = arcs.back()
|
||||
if arc.head == h_i and arc.child == c_i:
|
||||
arcs.pop_back()
|
||||
|
@ -348,6 +367,12 @@ cdef cppclass StateC:
|
|||
arc.label = 0
|
||||
break
|
||||
|
||||
void del_arc(int h_i, int c_i) nogil:
|
||||
if h_i > c_i:
|
||||
this.map_del_arc(&this._left_arcs, h_i, c_i)
|
||||
else:
|
||||
this.map_del_arc(&this._right_arcs, h_i, c_i)
|
||||
|
||||
SpanC get_ent() nogil const:
|
||||
cdef SpanC ent
|
||||
if this._ents.size() == 0:
|
||||
|
|
|
@ -604,7 +604,7 @@ cdef class ArcEager(TransitionSystem):
|
|||
actions[SHIFT][''] += 1
|
||||
if min_freq is not None:
|
||||
for action, label_freqs in actions.items():
|
||||
for label, freq in list(label_freqs.items()):
|
||||
for label, freq in label_freqs.copy().items():
|
||||
if freq < min_freq:
|
||||
label_freqs.pop(label)
|
||||
# Ensure these actions are present
|
||||
|
|
|
@ -4,6 +4,10 @@ for doing pseudo-projective parsing implementation uses the HEAD decoration
|
|||
scheme.
|
||||
"""
|
||||
from copy import copy
|
||||
from libc.limits cimport INT_MAX
|
||||
from libc.stdlib cimport abs
|
||||
from libcpp cimport bool
|
||||
from libcpp.vector cimport vector
|
||||
|
||||
from ...tokens.doc cimport Doc, set_children_from_heads
|
||||
|
||||
|
@ -41,13 +45,18 @@ def contains_cycle(heads):
|
|||
|
||||
|
||||
def is_nonproj_arc(tokenid, heads):
|
||||
cdef vector[int] c_heads = _heads_to_c(heads)
|
||||
return _is_nonproj_arc(tokenid, c_heads)
|
||||
|
||||
|
||||
cdef bool _is_nonproj_arc(int tokenid, const vector[int]& heads) nogil:
|
||||
# definition (e.g. Havelka 2007): an arc h -> d, h < d is non-projective
|
||||
# if there is a token k, h < k < d such that h is not
|
||||
# an ancestor of k. Same for h -> d, h > d
|
||||
head = heads[tokenid]
|
||||
if head == tokenid: # root arcs cannot be non-projective
|
||||
return False
|
||||
elif head is None: # unattached tokens cannot be non-projective
|
||||
elif head < 0: # unattached tokens cannot be non-projective
|
||||
return False
|
||||
|
||||
cdef int start, end
|
||||
|
@ -56,19 +65,29 @@ def is_nonproj_arc(tokenid, heads):
|
|||
else:
|
||||
start, end = (tokenid+1, head)
|
||||
for k in range(start, end):
|
||||
for ancestor in ancestors(k, heads):
|
||||
if ancestor is None: # for unattached tokens/subtrees
|
||||
break
|
||||
elif ancestor == head: # normal case: k dominated by h
|
||||
break
|
||||
if _has_head_as_ancestor(k, head, heads):
|
||||
continue
|
||||
else: # head not in ancestors: d -> h is non-projective
|
||||
return True
|
||||
return False
|
||||
|
||||
|
||||
cdef bool _has_head_as_ancestor(int tokenid, int head, const vector[int]& heads) nogil:
|
||||
ancestor = tokenid
|
||||
cnt = 0
|
||||
while cnt < heads.size():
|
||||
if heads[ancestor] == head or heads[ancestor] < 0:
|
||||
return True
|
||||
ancestor = heads[ancestor]
|
||||
cnt += 1
|
||||
|
||||
return False
|
||||
|
||||
|
||||
def is_nonproj_tree(heads):
|
||||
cdef vector[int] c_heads = _heads_to_c(heads)
|
||||
# a tree is non-projective if at least one arc is non-projective
|
||||
return any(is_nonproj_arc(word, heads) for word in range(len(heads)))
|
||||
return any(_is_nonproj_arc(word, c_heads) for word in range(len(heads)))
|
||||
|
||||
|
||||
def decompose(label):
|
||||
|
@ -98,16 +117,31 @@ def projectivize(heads, labels):
|
|||
# tree, i.e. connected and cycle-free. Returns a new pair (heads, labels)
|
||||
# which encode a projective and decorated tree.
|
||||
proj_heads = copy(heads)
|
||||
smallest_np_arc = _get_smallest_nonproj_arc(proj_heads)
|
||||
if smallest_np_arc is None: # this sentence is already projective
|
||||
|
||||
cdef int new_head
|
||||
cdef vector[int] c_proj_heads = _heads_to_c(proj_heads)
|
||||
cdef int smallest_np_arc = _get_smallest_nonproj_arc(c_proj_heads)
|
||||
if smallest_np_arc == -1: # this sentence is already projective
|
||||
return proj_heads, copy(labels)
|
||||
while smallest_np_arc is not None:
|
||||
_lift(smallest_np_arc, proj_heads)
|
||||
smallest_np_arc = _get_smallest_nonproj_arc(proj_heads)
|
||||
while smallest_np_arc != -1:
|
||||
new_head = _lift(smallest_np_arc, proj_heads)
|
||||
c_proj_heads[smallest_np_arc] = new_head
|
||||
smallest_np_arc = _get_smallest_nonproj_arc(c_proj_heads)
|
||||
deco_labels = _decorate(heads, proj_heads, labels)
|
||||
return proj_heads, deco_labels
|
||||
|
||||
|
||||
cdef vector[int] _heads_to_c(heads):
|
||||
cdef vector[int] c_heads;
|
||||
for head in heads:
|
||||
if head == None:
|
||||
c_heads.push_back(-1)
|
||||
else:
|
||||
assert head < len(heads)
|
||||
c_heads.push_back(head)
|
||||
return c_heads
|
||||
|
||||
|
||||
cpdef deprojectivize(Doc doc):
|
||||
# Reattach arcs with decorated labels (following HEAD scheme). For each
|
||||
# decorated arc X||Y, search top-down, left-to-right, breadth-first until
|
||||
|
@ -137,27 +171,38 @@ def _decorate(heads, proj_heads, labels):
|
|||
deco_labels.append(labels[tokenid])
|
||||
return deco_labels
|
||||
|
||||
def get_smallest_nonproj_arc_slow(heads):
|
||||
cdef vector[int] c_heads = _heads_to_c(heads)
|
||||
return _get_smallest_nonproj_arc(c_heads)
|
||||
|
||||
def _get_smallest_nonproj_arc(heads):
|
||||
|
||||
cdef int _get_smallest_nonproj_arc(const vector[int]& heads) nogil:
|
||||
# return the smallest non-proj arc or None
|
||||
# where size is defined as the distance between dep and head
|
||||
# and ties are broken left to right
|
||||
smallest_size = float('inf')
|
||||
smallest_np_arc = None
|
||||
for tokenid, head in enumerate(heads):
|
||||
cdef int smallest_size = INT_MAX
|
||||
cdef int smallest_np_arc = -1
|
||||
cdef int size
|
||||
cdef int tokenid
|
||||
cdef int head
|
||||
|
||||
for tokenid in range(heads.size()):
|
||||
head = heads[tokenid]
|
||||
size = abs(tokenid-head)
|
||||
if size < smallest_size and is_nonproj_arc(tokenid, heads):
|
||||
if size < smallest_size and _is_nonproj_arc(tokenid, heads):
|
||||
smallest_size = size
|
||||
smallest_np_arc = tokenid
|
||||
return smallest_np_arc
|
||||
|
||||
|
||||
def _lift(tokenid, heads):
|
||||
cpdef int _lift(tokenid, heads):
|
||||
# reattaches a word to it's grandfather
|
||||
head = heads[tokenid]
|
||||
ghead = heads[head]
|
||||
cdef int new_head = ghead if head != ghead else tokenid
|
||||
# attach to ghead if head isn't attached to root else attach to root
|
||||
heads[tokenid] = ghead if head != ghead else tokenid
|
||||
heads[tokenid] = new_head
|
||||
return new_head
|
||||
|
||||
|
||||
def _find_new_head(token, headlabel):
|
||||
|
|
|
@ -348,6 +348,46 @@ class EntityRuler(Pipe):
|
|||
self.nlp.vocab, attr=self.phrase_matcher_attr, validate=self._validate
|
||||
)
|
||||
|
||||
def remove(self, ent_id: str) -> None:
|
||||
"""Remove a pattern by its ent_id if a pattern with this ent_id was added before
|
||||
|
||||
ent_id (str): id of the pattern to be removed
|
||||
RETURNS: None
|
||||
DOCS: https://spacy.io/api/entityruler#remove
|
||||
"""
|
||||
label_id_pairs = [
|
||||
(label, eid) for (label, eid) in self._ent_ids.values() if eid == ent_id
|
||||
]
|
||||
if not label_id_pairs:
|
||||
raise ValueError(Errors.E1024.format(ent_id=ent_id))
|
||||
created_labels = [
|
||||
self._create_label(label, eid) for (label, eid) in label_id_pairs
|
||||
]
|
||||
# remove the patterns from self.phrase_patterns
|
||||
self.phrase_patterns = defaultdict(
|
||||
list,
|
||||
{
|
||||
label: val
|
||||
for (label, val) in self.phrase_patterns.items()
|
||||
if label not in created_labels
|
||||
},
|
||||
)
|
||||
# remove the patterns from self.token_pattern
|
||||
self.token_patterns = defaultdict(
|
||||
list,
|
||||
{
|
||||
label: val
|
||||
for (label, val) in self.token_patterns.items()
|
||||
if label not in created_labels
|
||||
},
|
||||
)
|
||||
# remove the patterns from self.token_pattern
|
||||
for label in created_labels:
|
||||
if label in self.phrase_matcher:
|
||||
self.phrase_matcher.remove(label)
|
||||
else:
|
||||
self.matcher.remove(label)
|
||||
|
||||
def _require_patterns(self) -> None:
|
||||
"""Raise a warning if this component has no patterns defined."""
|
||||
if len(self) == 0:
|
||||
|
|
|
@ -231,12 +231,13 @@ class Morphologizer(Tagger):
|
|||
cdef Vocab vocab = self.vocab
|
||||
cdef bint overwrite = self.cfg["overwrite"]
|
||||
cdef bint extend = self.cfg["extend"]
|
||||
labels = self.labels
|
||||
for i, doc in enumerate(docs):
|
||||
doc_tag_ids = batch_tag_ids[i]
|
||||
if hasattr(doc_tag_ids, "get"):
|
||||
doc_tag_ids = doc_tag_ids.get()
|
||||
for j, tag_id in enumerate(doc_tag_ids):
|
||||
morph = self.labels[tag_id]
|
||||
morph = labels[tag_id]
|
||||
# set morph
|
||||
if doc.c[j].morph == 0 or overwrite or extend:
|
||||
if overwrite and extend:
|
||||
|
|
|
@ -26,6 +26,8 @@ class Pipe:
|
|||
@property
|
||||
def labels(self) -> Tuple[str, ...]: ...
|
||||
@property
|
||||
def hide_labels(self) -> bool: ...
|
||||
@property
|
||||
def label_data(self) -> Any: ...
|
||||
def _require_labels(self) -> None: ...
|
||||
def set_error_handler(
|
||||
|
|
|
@ -102,6 +102,10 @@ cdef class Pipe:
|
|||
def labels(self) -> Tuple[str, ...]:
|
||||
return tuple()
|
||||
|
||||
@property
|
||||
def hide_labels(self) -> bool:
|
||||
return False
|
||||
|
||||
@property
|
||||
def label_data(self):
|
||||
"""Optional JSON-serializable data that would be sufficient to recreate
|
||||
|
|
|
@ -1,6 +1,6 @@
|
|||
# cython: infer_types=True, profile=True, binding=True
|
||||
from itertools import islice
|
||||
from typing import Optional, Callable
|
||||
from itertools import islice
|
||||
|
||||
import srsly
|
||||
from thinc.api import Model, SequenceCategoricalCrossentropy, Config
|
||||
|
@ -99,6 +99,10 @@ class SentenceRecognizer(Tagger):
|
|||
# are 0
|
||||
return tuple(["I", "S"])
|
||||
|
||||
@property
|
||||
def hide_labels(self):
|
||||
return True
|
||||
|
||||
@property
|
||||
def label_data(self):
|
||||
return None
|
||||
|
|
|
@ -1,9 +1,10 @@
|
|||
import numpy
|
||||
from typing import List, Dict, Callable, Tuple, Optional, Iterable, Any, cast
|
||||
from thinc.api import Config, Model, get_current_ops, set_dropout_rate, Ops
|
||||
from thinc.api import Optimizer
|
||||
from thinc.types import Ragged, Ints2d, Floats2d, Ints1d
|
||||
|
||||
import numpy
|
||||
|
||||
from ..compat import Protocol, runtime_checkable
|
||||
from ..scorer import Scorer
|
||||
from ..language import Language
|
||||
|
@ -377,7 +378,7 @@ class SpanCategorizer(TrainablePipe):
|
|||
# If the prediction is 0.9 and it's false, the gradient will be
|
||||
# 0.9 (0.9 - 0.0)
|
||||
d_scores = scores - target
|
||||
loss = float((d_scores ** 2).sum())
|
||||
loss = float((d_scores**2).sum())
|
||||
return loss, d_scores
|
||||
|
||||
def initialize(
|
||||
|
@ -412,7 +413,7 @@ class SpanCategorizer(TrainablePipe):
|
|||
self._require_labels()
|
||||
if subbatch:
|
||||
docs = [eg.x for eg in subbatch]
|
||||
spans = self.suggester(docs)
|
||||
spans = build_ngram_suggester(sizes=[1])(docs)
|
||||
Y = self.model.ops.alloc2f(spans.dataXd.shape[0], len(self.labels))
|
||||
self.model.initialize(X=(docs, spans), Y=Y)
|
||||
else:
|
||||
|
|
|
@ -45,7 +45,7 @@ DEFAULT_TAGGER_MODEL = Config().from_str(default_model_config)["model"]
|
|||
@Language.factory(
|
||||
"tagger",
|
||||
assigns=["token.tag"],
|
||||
default_config={"model": DEFAULT_TAGGER_MODEL, "overwrite": False, "scorer": {"@scorers": "spacy.tagger_scorer.v1"}},
|
||||
default_config={"model": DEFAULT_TAGGER_MODEL, "overwrite": False, "scorer": {"@scorers": "spacy.tagger_scorer.v1"}, "neg_prefix": "!"},
|
||||
default_score_weights={"tag_acc": 1.0},
|
||||
)
|
||||
def make_tagger(
|
||||
|
@ -54,6 +54,7 @@ def make_tagger(
|
|||
model: Model,
|
||||
overwrite: bool,
|
||||
scorer: Optional[Callable],
|
||||
neg_prefix: str,
|
||||
):
|
||||
"""Construct a part-of-speech tagger component.
|
||||
|
||||
|
@ -62,7 +63,7 @@ def make_tagger(
|
|||
in size, and be normalized as probabilities (all scores between 0 and 1,
|
||||
with the rows summing to 1).
|
||||
"""
|
||||
return Tagger(nlp.vocab, model, name, overwrite=overwrite, scorer=scorer)
|
||||
return Tagger(nlp.vocab, model, name, overwrite=overwrite, scorer=scorer, neg_prefix=neg_prefix)
|
||||
|
||||
|
||||
def tagger_score(examples, **kwargs):
|
||||
|
@ -87,6 +88,7 @@ class Tagger(TrainablePipe):
|
|||
*,
|
||||
overwrite=BACKWARD_OVERWRITE,
|
||||
scorer=tagger_score,
|
||||
neg_prefix="!",
|
||||
):
|
||||
"""Initialize a part-of-speech tagger.
|
||||
|
||||
|
@ -103,7 +105,7 @@ class Tagger(TrainablePipe):
|
|||
self.model = model
|
||||
self.name = name
|
||||
self._rehearsal_model = None
|
||||
cfg = {"labels": [], "overwrite": overwrite}
|
||||
cfg = {"labels": [], "overwrite": overwrite, "neg_prefix": neg_prefix}
|
||||
self.cfg = dict(sorted(cfg.items()))
|
||||
self.scorer = scorer
|
||||
|
||||
|
@ -166,13 +168,14 @@ class Tagger(TrainablePipe):
|
|||
cdef Doc doc
|
||||
cdef Vocab vocab = self.vocab
|
||||
cdef bint overwrite = self.cfg["overwrite"]
|
||||
labels = self.labels
|
||||
for i, doc in enumerate(docs):
|
||||
doc_tag_ids = batch_tag_ids[i]
|
||||
if hasattr(doc_tag_ids, "get"):
|
||||
doc_tag_ids = doc_tag_ids.get()
|
||||
for j, tag_id in enumerate(doc_tag_ids):
|
||||
if doc.c[j].tag == 0 or overwrite:
|
||||
doc.c[j].tag = self.vocab.strings[self.labels[tag_id]]
|
||||
doc.c[j].tag = self.vocab.strings[labels[tag_id]]
|
||||
|
||||
def update(self, examples, *, drop=0., sgd=None, losses=None):
|
||||
"""Learn from a batch of documents and gold-standard information,
|
||||
|
@ -222,6 +225,7 @@ class Tagger(TrainablePipe):
|
|||
|
||||
DOCS: https://spacy.io/api/tagger#rehearse
|
||||
"""
|
||||
loss_func = SequenceCategoricalCrossentropy()
|
||||
if losses is None:
|
||||
losses = {}
|
||||
losses.setdefault(self.name, 0.0)
|
||||
|
@ -233,12 +237,12 @@ class Tagger(TrainablePipe):
|
|||
# Handle cases where there are no tokens in any docs.
|
||||
return losses
|
||||
set_dropout_rate(self.model, drop)
|
||||
guesses, backprop = self.model.begin_update(docs)
|
||||
target = self._rehearsal_model(examples)
|
||||
gradient = guesses - target
|
||||
backprop(gradient)
|
||||
tag_scores, bp_tag_scores = self.model.begin_update(docs)
|
||||
tutor_tag_scores, _ = self._rehearsal_model.begin_update(docs)
|
||||
grads, loss = loss_func(tag_scores, tutor_tag_scores)
|
||||
bp_tag_scores(grads)
|
||||
self.finish_update(sgd)
|
||||
losses[self.name] += (gradient**2).sum()
|
||||
losses[self.name] += loss
|
||||
return losses
|
||||
|
||||
def get_loss(self, examples, scores):
|
||||
|
@ -252,7 +256,7 @@ class Tagger(TrainablePipe):
|
|||
DOCS: https://spacy.io/api/tagger#get_loss
|
||||
"""
|
||||
validate_examples(examples, "Tagger.get_loss")
|
||||
loss_func = SequenceCategoricalCrossentropy(names=self.labels, normalize=False, neg_prefix="!")
|
||||
loss_func = SequenceCategoricalCrossentropy(names=self.labels, normalize=False, neg_prefix=self.cfg["neg_prefix"])
|
||||
# Convert empty tag "" to missing value None so that both misaligned
|
||||
# tokens and tokens with missing annotation have the default missing
|
||||
# value None.
|
||||
|
|
|
@ -1,8 +1,8 @@
|
|||
from itertools import islice
|
||||
from typing import Iterable, Tuple, Optional, Dict, List, Callable, Any
|
||||
from thinc.api import get_array_module, Model, Optimizer, set_dropout_rate, Config
|
||||
from thinc.types import Floats2d
|
||||
import numpy
|
||||
from itertools import islice
|
||||
|
||||
from .trainable_pipe import TrainablePipe
|
||||
from ..language import Language
|
||||
|
@ -158,6 +158,13 @@ class TextCategorizer(TrainablePipe):
|
|||
self.cfg = dict(cfg)
|
||||
self.scorer = scorer
|
||||
|
||||
@property
|
||||
def support_missing_values(self):
|
||||
# There are no missing values as the textcat should always
|
||||
# predict exactly one label. All other labels are 0.0
|
||||
# Subclasses may override this property to change internal behaviour.
|
||||
return False
|
||||
|
||||
@property
|
||||
def labels(self) -> Tuple[str]:
|
||||
"""RETURNS (Tuple[str]): The labels currently added to the component.
|
||||
|
@ -276,12 +283,12 @@ class TextCategorizer(TrainablePipe):
|
|||
return losses
|
||||
set_dropout_rate(self.model, drop)
|
||||
scores, bp_scores = self.model.begin_update(docs)
|
||||
target = self._rehearsal_model(examples)
|
||||
target, _ = self._rehearsal_model.begin_update(docs)
|
||||
gradient = scores - target
|
||||
bp_scores(gradient)
|
||||
if sgd is not None:
|
||||
self.finish_update(sgd)
|
||||
losses[self.name] += (gradient ** 2).sum()
|
||||
losses[self.name] += (gradient**2).sum()
|
||||
return losses
|
||||
|
||||
def _examples_to_truth(
|
||||
|
@ -294,7 +301,7 @@ class TextCategorizer(TrainablePipe):
|
|||
for j, label in enumerate(self.labels):
|
||||
if label in eg.reference.cats:
|
||||
truths[i, j] = eg.reference.cats[label]
|
||||
else:
|
||||
elif self.support_missing_values:
|
||||
not_missing[i, j] = 0.0
|
||||
truths = self.model.ops.asarray(truths) # type: ignore
|
||||
return truths, not_missing # type: ignore
|
||||
|
@ -313,9 +320,9 @@ class TextCategorizer(TrainablePipe):
|
|||
self._validate_categories(examples)
|
||||
truths, not_missing = self._examples_to_truth(examples)
|
||||
not_missing = self.model.ops.asarray(not_missing) # type: ignore
|
||||
d_scores = (scores - truths) / scores.shape[0]
|
||||
d_scores = scores - truths
|
||||
d_scores *= not_missing
|
||||
mean_square_error = (d_scores ** 2).sum(axis=1).mean()
|
||||
mean_square_error = (d_scores**2).mean()
|
||||
return float(mean_square_error), d_scores
|
||||
|
||||
def add_label(self, label: str) -> int:
|
||||
|
|
|
@ -1,8 +1,8 @@
|
|||
from itertools import islice
|
||||
from typing import Iterable, Optional, Dict, List, Callable, Any
|
||||
|
||||
from thinc.api import Model, Config
|
||||
from thinc.types import Floats2d
|
||||
from thinc.api import Model, Config
|
||||
|
||||
from itertools import islice
|
||||
|
||||
from ..language import Language
|
||||
from ..training import Example, validate_get_examples
|
||||
|
@ -158,6 +158,10 @@ class MultiLabel_TextCategorizer(TextCategorizer):
|
|||
self.cfg = dict(cfg)
|
||||
self.scorer = scorer
|
||||
|
||||
@property
|
||||
def support_missing_values(self):
|
||||
return True
|
||||
|
||||
def initialize( # type: ignore[override]
|
||||
self,
|
||||
get_examples: Callable[[], Iterable[Example]],
|
||||
|
|
|
@ -118,6 +118,10 @@ class Tok2Vec(TrainablePipe):
|
|||
|
||||
DOCS: https://spacy.io/api/tok2vec#predict
|
||||
"""
|
||||
if not any(len(doc) for doc in docs):
|
||||
# Handle cases where there are no tokens in any docs.
|
||||
width = self.model.get_dim("nO")
|
||||
return [self.model.ops.alloc((0, width)) for doc in docs]
|
||||
tokvecs = self.model.predict(docs)
|
||||
batch_id = Tok2VecListener.get_batch_id(docs)
|
||||
for listener in self.listeners:
|
||||
|
|
|
@ -1,5 +1,6 @@
|
|||
from typing import Dict, List, Union, Optional, Any, Callable, Type, Tuple
|
||||
from typing import Iterable, TypeVar, TYPE_CHECKING
|
||||
from .compat import Literal
|
||||
from enum import Enum
|
||||
from pydantic import BaseModel, Field, ValidationError, validator, create_model
|
||||
from pydantic import StrictStr, StrictInt, StrictFloat, StrictBool
|
||||
|
@ -209,6 +210,7 @@ NumberValue = Union[TokenPatternNumber, StrictInt, StrictFloat]
|
|||
UnderscoreValue = Union[
|
||||
TokenPatternString, TokenPatternNumber, str, int, float, list, bool
|
||||
]
|
||||
IobValue = Literal["", "I", "O", "B", 0, 1, 2, 3]
|
||||
|
||||
|
||||
class TokenPattern(BaseModel):
|
||||
|
@ -222,6 +224,7 @@ class TokenPattern(BaseModel):
|
|||
lemma: Optional[StringValue] = None
|
||||
shape: Optional[StringValue] = None
|
||||
ent_type: Optional[StringValue] = None
|
||||
ent_iob: Optional[IobValue] = None
|
||||
ent_id: Optional[StringValue] = None
|
||||
ent_kb_id: Optional[StringValue] = None
|
||||
norm: Optional[StringValue] = None
|
||||
|
|
|
@ -445,7 +445,8 @@ class Scorer:
|
|||
getter(doc, attr) should return the values for the individual doc.
|
||||
labels (Iterable[str]): The set of possible labels. Defaults to [].
|
||||
multi_label (bool): Whether the attribute allows multiple labels.
|
||||
Defaults to True.
|
||||
Defaults to True. When set to False (exclusive labels), missing
|
||||
gold labels are interpreted as 0.0.
|
||||
positive_label (str): The positive label for a binary task with
|
||||
exclusive classes. Defaults to None.
|
||||
threshold (float): Cutoff to consider a prediction "positive". Defaults
|
||||
|
@ -484,13 +485,15 @@ class Scorer:
|
|||
|
||||
for label in labels:
|
||||
pred_score = pred_cats.get(label, 0.0)
|
||||
gold_score = gold_cats.get(label, 0.0)
|
||||
gold_score = gold_cats.get(label)
|
||||
if not gold_score and not multi_label:
|
||||
gold_score = 0.0
|
||||
if gold_score is not None:
|
||||
auc_per_type[label].score_set(pred_score, gold_score)
|
||||
if multi_label:
|
||||
for label in labels:
|
||||
pred_score = pred_cats.get(label, 0.0)
|
||||
gold_score = gold_cats.get(label, 0.0)
|
||||
gold_score = gold_cats.get(label)
|
||||
if gold_score is not None:
|
||||
if pred_score >= threshold and gold_score > 0:
|
||||
f_per_type[label].tp += 1
|
||||
|
@ -502,16 +505,15 @@ class Scorer:
|
|||
# Get the highest-scoring for each.
|
||||
pred_label, pred_score = max(pred_cats.items(), key=lambda it: it[1])
|
||||
gold_label, gold_score = max(gold_cats.items(), key=lambda it: it[1])
|
||||
if gold_score is not None:
|
||||
if pred_label == gold_label and pred_score >= threshold:
|
||||
f_per_type[pred_label].tp += 1
|
||||
else:
|
||||
f_per_type[gold_label].fn += 1
|
||||
if pred_score >= threshold:
|
||||
f_per_type[pred_label].fp += 1
|
||||
if pred_label == gold_label and pred_score >= threshold:
|
||||
f_per_type[pred_label].tp += 1
|
||||
else:
|
||||
f_per_type[gold_label].fn += 1
|
||||
if pred_score >= threshold:
|
||||
f_per_type[pred_label].fp += 1
|
||||
elif gold_cats:
|
||||
gold_label, gold_score = max(gold_cats, key=lambda it: it[1])
|
||||
if gold_score is not None and gold_score > 0:
|
||||
if gold_score > 0:
|
||||
f_per_type[gold_label].fn += 1
|
||||
elif pred_cats:
|
||||
pred_label, pred_score = max(pred_cats.items(), key=lambda it: it[1])
|
||||
|
|
|
@ -51,6 +51,11 @@ def tokenizer():
|
|||
return get_lang_class("xx")().tokenizer
|
||||
|
||||
|
||||
@pytest.fixture(scope="session")
|
||||
def af_tokenizer():
|
||||
return get_lang_class("af")().tokenizer
|
||||
|
||||
|
||||
@pytest.fixture(scope="session")
|
||||
def am_tokenizer():
|
||||
return get_lang_class("am")().tokenizer
|
||||
|
@ -127,6 +132,11 @@ def es_vocab():
|
|||
return get_lang_class("es")().vocab
|
||||
|
||||
|
||||
@pytest.fixture(scope="session")
|
||||
def et_tokenizer():
|
||||
return get_lang_class("et")().tokenizer
|
||||
|
||||
|
||||
@pytest.fixture(scope="session")
|
||||
def eu_tokenizer():
|
||||
return get_lang_class("eu")().tokenizer
|
||||
|
@ -147,6 +157,11 @@ def fr_tokenizer():
|
|||
return get_lang_class("fr")().tokenizer
|
||||
|
||||
|
||||
@pytest.fixture(scope="session")
|
||||
def fr_vocab():
|
||||
return get_lang_class("fr")().vocab
|
||||
|
||||
|
||||
@pytest.fixture(scope="session")
|
||||
def ga_tokenizer():
|
||||
return get_lang_class("ga")().tokenizer
|
||||
|
@ -187,11 +202,21 @@ def id_tokenizer():
|
|||
return get_lang_class("id")().tokenizer
|
||||
|
||||
|
||||
@pytest.fixture(scope="session")
|
||||
def is_tokenizer():
|
||||
return get_lang_class("is")().tokenizer
|
||||
|
||||
|
||||
@pytest.fixture(scope="session")
|
||||
def it_tokenizer():
|
||||
return get_lang_class("it")().tokenizer
|
||||
|
||||
|
||||
@pytest.fixture(scope="session")
|
||||
def it_vocab():
|
||||
return get_lang_class("it")().vocab
|
||||
|
||||
|
||||
@pytest.fixture(scope="session")
|
||||
def ja_tokenizer():
|
||||
pytest.importorskip("sudachipy")
|
||||
|
@ -204,6 +229,19 @@ def ko_tokenizer():
|
|||
return get_lang_class("ko")().tokenizer
|
||||
|
||||
|
||||
@pytest.fixture(scope="session")
|
||||
def ko_tokenizer_tokenizer():
|
||||
config = {
|
||||
"nlp": {
|
||||
"tokenizer": {
|
||||
"@tokenizers": "spacy.Tokenizer.v1",
|
||||
}
|
||||
}
|
||||
}
|
||||
nlp = get_lang_class("ko").from_config(config)
|
||||
return nlp.tokenizer
|
||||
|
||||
|
||||
@pytest.fixture(scope="session")
|
||||
def lb_tokenizer():
|
||||
return get_lang_class("lb")().tokenizer
|
||||
|
@ -214,6 +252,11 @@ def lt_tokenizer():
|
|||
return get_lang_class("lt")().tokenizer
|
||||
|
||||
|
||||
@pytest.fixture(scope="session")
|
||||
def lv_tokenizer():
|
||||
return get_lang_class("lv")().tokenizer
|
||||
|
||||
|
||||
@pytest.fixture(scope="session")
|
||||
def mk_tokenizer():
|
||||
return get_lang_class("mk")().tokenizer
|
||||
|
@ -281,11 +324,26 @@ def sa_tokenizer():
|
|||
return get_lang_class("sa")().tokenizer
|
||||
|
||||
|
||||
@pytest.fixture(scope="session")
|
||||
def sk_tokenizer():
|
||||
return get_lang_class("sk")().tokenizer
|
||||
|
||||
|
||||
@pytest.fixture(scope="session")
|
||||
def sl_tokenizer():
|
||||
return get_lang_class("sl")().tokenizer
|
||||
|
||||
|
||||
@pytest.fixture(scope="session")
|
||||
def sr_tokenizer():
|
||||
return get_lang_class("sr")().tokenizer
|
||||
|
||||
|
||||
@pytest.fixture(scope="session")
|
||||
def sq_tokenizer():
|
||||
return get_lang_class("sq")().tokenizer
|
||||
|
||||
|
||||
@pytest.fixture(scope="session")
|
||||
def sv_tokenizer():
|
||||
return get_lang_class("sv")().tokenizer
|
||||
|
@ -346,6 +404,11 @@ def vi_tokenizer():
|
|||
return get_lang_class("vi")().tokenizer
|
||||
|
||||
|
||||
@pytest.fixture(scope="session")
|
||||
def xx_tokenizer():
|
||||
return get_lang_class("xx")().tokenizer
|
||||
|
||||
|
||||
@pytest.fixture(scope="session")
|
||||
def yo_tokenizer():
|
||||
return get_lang_class("yo")().tokenizer
|
||||
|
|
|
@ -1,8 +1,31 @@
|
|||
import numpy
|
||||
import pytest
|
||||
|
||||
from spacy.tokens import Doc
|
||||
from spacy.attrs import ORTH, SHAPE, POS, DEP, MORPH
|
||||
|
||||
|
||||
@pytest.mark.issue(2203)
|
||||
def test_issue2203(en_vocab):
|
||||
"""Test that lemmas are set correctly in doc.from_array."""
|
||||
words = ["I", "'ll", "survive"]
|
||||
tags = ["PRP", "MD", "VB"]
|
||||
lemmas = ["-PRON-", "will", "survive"]
|
||||
tag_ids = [en_vocab.strings.add(tag) for tag in tags]
|
||||
lemma_ids = [en_vocab.strings.add(lemma) for lemma in lemmas]
|
||||
doc = Doc(en_vocab, words=words)
|
||||
# Work around lemma corruption problem and set lemmas after tags
|
||||
doc.from_array("TAG", numpy.array(tag_ids, dtype="uint64"))
|
||||
doc.from_array("LEMMA", numpy.array(lemma_ids, dtype="uint64"))
|
||||
assert [t.tag_ for t in doc] == tags
|
||||
assert [t.lemma_ for t in doc] == lemmas
|
||||
# We need to serialize both tag and lemma, since this is what causes the bug
|
||||
doc_array = doc.to_array(["TAG", "LEMMA"])
|
||||
new_doc = Doc(doc.vocab, words=words).from_array(["TAG", "LEMMA"], doc_array)
|
||||
assert [t.tag_ for t in new_doc] == tags
|
||||
assert [t.lemma_ for t in new_doc] == lemmas
|
||||
|
||||
|
||||
def test_doc_array_attr_of_token(en_vocab):
|
||||
doc = Doc(en_vocab, words=["An", "example", "sentence"])
|
||||
example = doc.vocab["example"]
|
||||
|
|
|
@ -1,14 +1,17 @@
|
|||
import weakref
|
||||
|
||||
import pytest
|
||||
import numpy
|
||||
import pytest
|
||||
from thinc.api import NumpyOps, get_current_ops
|
||||
|
||||
from spacy.attrs import DEP, ENT_IOB, ENT_TYPE, HEAD, IS_ALPHA, MORPH, POS
|
||||
from spacy.attrs import SENT_START, TAG
|
||||
from spacy.lang.en import English
|
||||
from spacy.lang.xx import MultiLanguage
|
||||
from spacy.language import Language
|
||||
from spacy.lexeme import Lexeme
|
||||
from spacy.tokens import Doc, Span, Token
|
||||
from spacy.vocab import Vocab
|
||||
from spacy.lexeme import Lexeme
|
||||
from spacy.lang.en import English
|
||||
from spacy.attrs import ENT_TYPE, ENT_IOB, SENT_START, HEAD, DEP, MORPH
|
||||
|
||||
from .test_underscore import clean_underscore # noqa: F401
|
||||
|
||||
|
@ -30,6 +33,220 @@ def test_doc_api_init(en_vocab):
|
|||
assert [t.is_sent_start for t in doc] == [True, False, True, False]
|
||||
|
||||
|
||||
@pytest.mark.issue(1547)
|
||||
def test_issue1547():
|
||||
"""Test that entity labels still match after merging tokens."""
|
||||
words = ["\n", "worda", ".", "\n", "wordb", "-", "Biosphere", "2", "-", " \n"]
|
||||
doc = Doc(Vocab(), words=words)
|
||||
doc.ents = [Span(doc, 6, 8, label=doc.vocab.strings["PRODUCT"])]
|
||||
with doc.retokenize() as retokenizer:
|
||||
retokenizer.merge(doc[5:7])
|
||||
assert [ent.text for ent in doc.ents]
|
||||
|
||||
|
||||
@pytest.mark.issue(1757)
|
||||
def test_issue1757():
|
||||
"""Test comparison against None doesn't cause segfault."""
|
||||
doc = Doc(Vocab(), words=["a", "b", "c"])
|
||||
assert not doc[0] < None
|
||||
assert not doc[0] is None
|
||||
assert doc[0] >= None
|
||||
assert not doc[:2] < None
|
||||
assert not doc[:2] is None
|
||||
assert doc[:2] >= None
|
||||
assert not doc.vocab["a"] is None
|
||||
assert not doc.vocab["a"] < None
|
||||
|
||||
|
||||
@pytest.mark.issue(2396)
|
||||
def test_issue2396(en_vocab):
|
||||
words = ["She", "created", "a", "test", "for", "spacy"]
|
||||
heads = [1, 1, 3, 1, 3, 4]
|
||||
deps = ["dep"] * len(heads)
|
||||
matrix = numpy.array(
|
||||
[
|
||||
[0, 1, 1, 1, 1, 1],
|
||||
[1, 1, 1, 1, 1, 1],
|
||||
[1, 1, 2, 3, 3, 3],
|
||||
[1, 1, 3, 3, 3, 3],
|
||||
[1, 1, 3, 3, 4, 4],
|
||||
[1, 1, 3, 3, 4, 5],
|
||||
],
|
||||
dtype=numpy.int32,
|
||||
)
|
||||
doc = Doc(en_vocab, words=words, heads=heads, deps=deps)
|
||||
span = doc[:]
|
||||
assert (doc.get_lca_matrix() == matrix).all()
|
||||
assert (span.get_lca_matrix() == matrix).all()
|
||||
|
||||
|
||||
@pytest.mark.parametrize("text", ["-0.23", "+123,456", "±1"])
|
||||
@pytest.mark.parametrize("lang_cls", [English, MultiLanguage])
|
||||
@pytest.mark.issue(2782)
|
||||
def test_issue2782(text, lang_cls):
|
||||
"""Check that like_num handles + and - before number."""
|
||||
nlp = lang_cls()
|
||||
doc = nlp(text)
|
||||
assert len(doc) == 1
|
||||
assert doc[0].like_num
|
||||
|
||||
|
||||
@pytest.mark.parametrize(
|
||||
"sentence",
|
||||
[
|
||||
"The story was to the effect that a young American student recently called on Professor Christlieb with a letter of introduction.",
|
||||
"The next month Barry Siddall joined Stoke City on a free transfer, after Chris Pearce had established himself as the Vale's #1.",
|
||||
"The next month Barry Siddall joined Stoke City on a free transfer, after Chris Pearce had established himself as the Vale's number one",
|
||||
"Indeed, making the one who remains do all the work has installed him into a position of such insolent tyranny, it will take a month at least to reduce him to his proper proportions.",
|
||||
"It was a missed assignment, but it shouldn't have resulted in a turnover ...",
|
||||
],
|
||||
)
|
||||
@pytest.mark.issue(3869)
|
||||
def test_issue3869(sentence):
|
||||
"""Test that the Doc's count_by function works consistently"""
|
||||
nlp = English()
|
||||
doc = nlp(sentence)
|
||||
count = 0
|
||||
for token in doc:
|
||||
count += token.is_alpha
|
||||
assert count == doc.count_by(IS_ALPHA).get(1, 0)
|
||||
|
||||
|
||||
@pytest.mark.issue(3962)
|
||||
def test_issue3962(en_vocab):
|
||||
"""Ensure that as_doc does not result in out-of-bound access of tokens.
|
||||
This is achieved by setting the head to itself if it would lie out of the span otherwise."""
|
||||
# fmt: off
|
||||
words = ["He", "jests", "at", "scars", ",", "that", "never", "felt", "a", "wound", "."]
|
||||
heads = [1, 7, 1, 2, 7, 7, 7, 7, 9, 7, 7]
|
||||
deps = ["nsubj", "ccomp", "prep", "pobj", "punct", "nsubj", "neg", "ROOT", "det", "dobj", "punct"]
|
||||
# fmt: on
|
||||
doc = Doc(en_vocab, words=words, heads=heads, deps=deps)
|
||||
span2 = doc[1:5] # "jests at scars ,"
|
||||
doc2 = span2.as_doc()
|
||||
doc2_json = doc2.to_json()
|
||||
assert doc2_json
|
||||
# head set to itself, being the new artificial root
|
||||
assert doc2[0].head.text == "jests"
|
||||
assert doc2[0].dep_ == "dep"
|
||||
assert doc2[1].head.text == "jests"
|
||||
assert doc2[1].dep_ == "prep"
|
||||
assert doc2[2].head.text == "at"
|
||||
assert doc2[2].dep_ == "pobj"
|
||||
assert doc2[3].head.text == "jests" # head set to the new artificial root
|
||||
assert doc2[3].dep_ == "dep"
|
||||
# We should still have 1 sentence
|
||||
assert len(list(doc2.sents)) == 1
|
||||
span3 = doc[6:9] # "never felt a"
|
||||
doc3 = span3.as_doc()
|
||||
doc3_json = doc3.to_json()
|
||||
assert doc3_json
|
||||
assert doc3[0].head.text == "felt"
|
||||
assert doc3[0].dep_ == "neg"
|
||||
assert doc3[1].head.text == "felt"
|
||||
assert doc3[1].dep_ == "ROOT"
|
||||
assert doc3[2].head.text == "felt" # head set to ancestor
|
||||
assert doc3[2].dep_ == "dep"
|
||||
# We should still have 1 sentence as "a" can be attached to "felt" instead of "wound"
|
||||
assert len(list(doc3.sents)) == 1
|
||||
|
||||
|
||||
@pytest.mark.issue(3962)
|
||||
def test_issue3962_long(en_vocab):
|
||||
"""Ensure that as_doc does not result in out-of-bound access of tokens.
|
||||
This is achieved by setting the head to itself if it would lie out of the span otherwise."""
|
||||
# fmt: off
|
||||
words = ["He", "jests", "at", "scars", ".", "They", "never", "felt", "a", "wound", "."]
|
||||
heads = [1, 1, 1, 2, 1, 7, 7, 7, 9, 7, 7]
|
||||
deps = ["nsubj", "ROOT", "prep", "pobj", "punct", "nsubj", "neg", "ROOT", "det", "dobj", "punct"]
|
||||
# fmt: on
|
||||
two_sent_doc = Doc(en_vocab, words=words, heads=heads, deps=deps)
|
||||
span2 = two_sent_doc[1:7] # "jests at scars. They never"
|
||||
doc2 = span2.as_doc()
|
||||
doc2_json = doc2.to_json()
|
||||
assert doc2_json
|
||||
# head set to itself, being the new artificial root (in sentence 1)
|
||||
assert doc2[0].head.text == "jests"
|
||||
assert doc2[0].dep_ == "ROOT"
|
||||
assert doc2[1].head.text == "jests"
|
||||
assert doc2[1].dep_ == "prep"
|
||||
assert doc2[2].head.text == "at"
|
||||
assert doc2[2].dep_ == "pobj"
|
||||
assert doc2[3].head.text == "jests"
|
||||
assert doc2[3].dep_ == "punct"
|
||||
# head set to itself, being the new artificial root (in sentence 2)
|
||||
assert doc2[4].head.text == "They"
|
||||
assert doc2[4].dep_ == "dep"
|
||||
# head set to the new artificial head (in sentence 2)
|
||||
assert doc2[4].head.text == "They"
|
||||
assert doc2[4].dep_ == "dep"
|
||||
# We should still have 2 sentences
|
||||
sents = list(doc2.sents)
|
||||
assert len(sents) == 2
|
||||
assert sents[0].text == "jests at scars ."
|
||||
assert sents[1].text == "They never"
|
||||
|
||||
|
||||
@Language.factory("my_pipe")
|
||||
class CustomPipe:
|
||||
def __init__(self, nlp, name="my_pipe"):
|
||||
self.name = name
|
||||
Span.set_extension("my_ext", getter=self._get_my_ext)
|
||||
Doc.set_extension("my_ext", default=None)
|
||||
|
||||
def __call__(self, doc):
|
||||
gathered_ext = []
|
||||
for sent in doc.sents:
|
||||
sent_ext = self._get_my_ext(sent)
|
||||
sent._.set("my_ext", sent_ext)
|
||||
gathered_ext.append(sent_ext)
|
||||
|
||||
doc._.set("my_ext", "\n".join(gathered_ext))
|
||||
return doc
|
||||
|
||||
@staticmethod
|
||||
def _get_my_ext(span):
|
||||
return str(span.end)
|
||||
|
||||
|
||||
@pytest.mark.issue(4903)
|
||||
def test_issue4903():
|
||||
"""Ensure that this runs correctly and doesn't hang or crash on Windows /
|
||||
macOS."""
|
||||
nlp = English()
|
||||
nlp.add_pipe("sentencizer")
|
||||
nlp.add_pipe("my_pipe", after="sentencizer")
|
||||
text = ["I like bananas.", "Do you like them?", "No, I prefer wasabi."]
|
||||
if isinstance(get_current_ops(), NumpyOps):
|
||||
docs = list(nlp.pipe(text, n_process=2))
|
||||
assert docs[0].text == "I like bananas."
|
||||
assert docs[1].text == "Do you like them?"
|
||||
assert docs[2].text == "No, I prefer wasabi."
|
||||
|
||||
|
||||
@pytest.mark.issue(5048)
|
||||
def test_issue5048(en_vocab):
|
||||
words = ["This", "is", "a", "sentence"]
|
||||
pos_s = ["DET", "VERB", "DET", "NOUN"]
|
||||
spaces = [" ", " ", " ", ""]
|
||||
deps_s = ["dep", "adj", "nn", "atm"]
|
||||
tags_s = ["DT", "VBZ", "DT", "NN"]
|
||||
strings = en_vocab.strings
|
||||
for w in words:
|
||||
strings.add(w)
|
||||
deps = [strings.add(d) for d in deps_s]
|
||||
pos = [strings.add(p) for p in pos_s]
|
||||
tags = [strings.add(t) for t in tags_s]
|
||||
attrs = [POS, DEP, TAG]
|
||||
array = numpy.array(list(zip(pos, deps, tags)), dtype="uint64")
|
||||
doc = Doc(en_vocab, words=words, spaces=spaces)
|
||||
doc.from_array(attrs, array)
|
||||
v1 = [(token.text, token.pos_, token.tag_) for token in doc]
|
||||
doc2 = Doc(en_vocab, words=words, pos=pos_s, deps=deps_s, tags=tags_s)
|
||||
v2 = [(token.text, token.pos_, token.tag_) for token in doc2]
|
||||
assert v1 == v2
|
||||
|
||||
|
||||
@pytest.mark.parametrize("text", [["one", "two", "three"]])
|
||||
def test_doc_api_compare_by_string_position(en_vocab, text):
|
||||
doc = Doc(en_vocab, words=text)
|
||||
|
@ -350,6 +567,7 @@ def test_doc_api_from_docs(en_tokenizer, de_tokenizer):
|
|||
"Merging the docs is fun.",
|
||||
"",
|
||||
"They don't think alike. ",
|
||||
"",
|
||||
"Another doc.",
|
||||
]
|
||||
en_texts_without_empty = [t for t in en_texts if len(t)]
|
||||
|
@ -357,9 +575,9 @@ def test_doc_api_from_docs(en_tokenizer, de_tokenizer):
|
|||
en_docs = [en_tokenizer(text) for text in en_texts]
|
||||
en_docs[0].spans["group"] = [en_docs[0][1:4]]
|
||||
en_docs[2].spans["group"] = [en_docs[2][1:4]]
|
||||
en_docs[3].spans["group"] = [en_docs[3][0:1]]
|
||||
en_docs[4].spans["group"] = [en_docs[4][0:1]]
|
||||
span_group_texts = sorted(
|
||||
[en_docs[0][1:4].text, en_docs[2][1:4].text, en_docs[3][0:1].text]
|
||||
[en_docs[0][1:4].text, en_docs[2][1:4].text, en_docs[4][0:1].text]
|
||||
)
|
||||
de_doc = de_tokenizer(de_text)
|
||||
Token.set_extension("is_ambiguous", default=False)
|
||||
|
@ -466,6 +684,7 @@ def test_has_annotation(en_vocab):
|
|||
attrs = ("TAG", "POS", "MORPH", "LEMMA", "DEP", "HEAD", "ENT_IOB", "ENT_TYPE")
|
||||
for attr in attrs:
|
||||
assert not doc.has_annotation(attr)
|
||||
assert not doc.has_annotation(attr, require_complete=True)
|
||||
|
||||
doc[0].tag_ = "A"
|
||||
doc[0].pos_ = "X"
|
||||
|
@ -491,6 +710,27 @@ def test_has_annotation(en_vocab):
|
|||
assert doc.has_annotation(attr, require_complete=True)
|
||||
|
||||
|
||||
def test_has_annotation_sents(en_vocab):
|
||||
doc = Doc(en_vocab, words=["Hello", "beautiful", "world"])
|
||||
attrs = ("SENT_START", "IS_SENT_START", "IS_SENT_END")
|
||||
for attr in attrs:
|
||||
assert not doc.has_annotation(attr)
|
||||
assert not doc.has_annotation(attr, require_complete=True)
|
||||
|
||||
# The first token (index 0) is always assumed to be a sentence start,
|
||||
# and ignored by the check in doc.has_annotation
|
||||
|
||||
doc[1].is_sent_start = False
|
||||
for attr in attrs:
|
||||
assert doc.has_annotation(attr)
|
||||
assert not doc.has_annotation(attr, require_complete=True)
|
||||
|
||||
doc[2].is_sent_start = False
|
||||
for attr in attrs:
|
||||
assert doc.has_annotation(attr)
|
||||
assert doc.has_annotation(attr, require_complete=True)
|
||||
|
||||
|
||||
def test_is_flags_deprecated(en_tokenizer):
|
||||
doc = en_tokenizer("test")
|
||||
with pytest.deprecated_call():
|
||||
|
|
|
@ -1,8 +1,50 @@
|
|||
import numpy
|
||||
import pytest
|
||||
|
||||
from spacy.vocab import Vocab
|
||||
from spacy.tokens import Doc, Token
|
||||
|
||||
|
||||
@pytest.mark.issue(3540)
|
||||
def test_issue3540(en_vocab):
|
||||
words = ["I", "live", "in", "NewYork", "right", "now"]
|
||||
tensor = numpy.asarray(
|
||||
[[1.0, 1.1], [2.0, 2.1], [3.0, 3.1], [4.0, 4.1], [5.0, 5.1], [6.0, 6.1]],
|
||||
dtype="f",
|
||||
)
|
||||
doc = Doc(en_vocab, words=words)
|
||||
doc.tensor = tensor
|
||||
gold_text = ["I", "live", "in", "NewYork", "right", "now"]
|
||||
assert [token.text for token in doc] == gold_text
|
||||
gold_lemma = ["I", "live", "in", "NewYork", "right", "now"]
|
||||
for i, lemma in enumerate(gold_lemma):
|
||||
doc[i].lemma_ = lemma
|
||||
assert [token.lemma_ for token in doc] == gold_lemma
|
||||
vectors_1 = [token.vector for token in doc]
|
||||
assert len(vectors_1) == len(doc)
|
||||
|
||||
with doc.retokenize() as retokenizer:
|
||||
heads = [(doc[3], 1), doc[2]]
|
||||
attrs = {
|
||||
"POS": ["PROPN", "PROPN"],
|
||||
"LEMMA": ["New", "York"],
|
||||
"DEP": ["pobj", "compound"],
|
||||
}
|
||||
retokenizer.split(doc[3], ["New", "York"], heads=heads, attrs=attrs)
|
||||
|
||||
gold_text = ["I", "live", "in", "New", "York", "right", "now"]
|
||||
assert [token.text for token in doc] == gold_text
|
||||
gold_lemma = ["I", "live", "in", "New", "York", "right", "now"]
|
||||
assert [token.lemma_ for token in doc] == gold_lemma
|
||||
vectors_2 = [token.vector for token in doc]
|
||||
assert len(vectors_2) == len(doc)
|
||||
assert vectors_1[0].tolist() == vectors_2[0].tolist()
|
||||
assert vectors_1[1].tolist() == vectors_2[1].tolist()
|
||||
assert vectors_1[2].tolist() == vectors_2[2].tolist()
|
||||
assert vectors_1[4].tolist() == vectors_2[5].tolist()
|
||||
assert vectors_1[5].tolist() == vectors_2[6].tolist()
|
||||
|
||||
|
||||
def test_doc_retokenize_split(en_vocab):
|
||||
words = ["LosAngeles", "start", "."]
|
||||
heads = [1, 2, 2]
|
||||
|
|
|
@ -1,7 +1,9 @@
|
|||
import pytest
|
||||
import numpy
|
||||
from numpy.testing import assert_array_equal
|
||||
|
||||
from spacy.attrs import ORTH, LENGTH
|
||||
from spacy.lang.en import English
|
||||
from spacy.tokens import Doc, Span, Token
|
||||
from spacy.vocab import Vocab
|
||||
from spacy.util import filter_spans
|
||||
|
@ -43,6 +45,106 @@ def doc_not_parsed(en_tokenizer):
|
|||
return doc
|
||||
|
||||
|
||||
@pytest.mark.issue(1537)
|
||||
def test_issue1537():
|
||||
"""Test that Span.as_doc() doesn't segfault."""
|
||||
string = "The sky is blue . The man is pink . The dog is purple ."
|
||||
doc = Doc(Vocab(), words=string.split())
|
||||
doc[0].sent_start = True
|
||||
for word in doc[1:]:
|
||||
if word.nbor(-1).text == ".":
|
||||
word.sent_start = True
|
||||
else:
|
||||
word.sent_start = False
|
||||
sents = list(doc.sents)
|
||||
sent0 = sents[0].as_doc()
|
||||
sent1 = sents[1].as_doc()
|
||||
assert isinstance(sent0, Doc)
|
||||
assert isinstance(sent1, Doc)
|
||||
|
||||
|
||||
@pytest.mark.issue(1612)
|
||||
def test_issue1612(en_tokenizer):
|
||||
"""Test that span.orth_ is identical to span.text"""
|
||||
doc = en_tokenizer("The black cat purrs.")
|
||||
span = doc[1:3]
|
||||
assert span.orth_ == span.text
|
||||
|
||||
|
||||
@pytest.mark.issue(3199)
|
||||
def test_issue3199():
|
||||
"""Test that Span.noun_chunks works correctly if no noun chunks iterator
|
||||
is available. To make this test future-proof, we're constructing a Doc
|
||||
with a new Vocab here and a parse tree to make sure the noun chunks run.
|
||||
"""
|
||||
words = ["This", "is", "a", "sentence"]
|
||||
doc = Doc(Vocab(), words=words, heads=[0] * len(words), deps=["dep"] * len(words))
|
||||
with pytest.raises(NotImplementedError):
|
||||
list(doc[0:3].noun_chunks)
|
||||
|
||||
|
||||
@pytest.mark.issue(5152)
|
||||
def test_issue5152():
|
||||
# Test that the comparison between a Span and a Token, goes well
|
||||
# There was a bug when the number of tokens in the span equaled the number of characters in the token (!)
|
||||
nlp = English()
|
||||
text = nlp("Talk about being boring!")
|
||||
text_var = nlp("Talk of being boring!")
|
||||
y = nlp("Let")
|
||||
span = text[0:3] # Talk about being
|
||||
span_2 = text[0:3] # Talk about being
|
||||
span_3 = text_var[0:3] # Talk of being
|
||||
token = y[0] # Let
|
||||
with pytest.warns(UserWarning):
|
||||
assert span.similarity(token) == 0.0
|
||||
assert span.similarity(span_2) == 1.0
|
||||
with pytest.warns(UserWarning):
|
||||
assert span_2.similarity(span_3) < 1.0
|
||||
|
||||
|
||||
@pytest.mark.issue(6755)
|
||||
def test_issue6755(en_tokenizer):
|
||||
doc = en_tokenizer("This is a magnificent sentence.")
|
||||
span = doc[:0]
|
||||
assert span.text_with_ws == ""
|
||||
assert span.text == ""
|
||||
|
||||
|
||||
@pytest.mark.parametrize(
|
||||
"sentence, start_idx,end_idx,label",
|
||||
[("Welcome to Mumbai, my friend", 11, 17, "GPE")],
|
||||
)
|
||||
@pytest.mark.issue(6815)
|
||||
def test_issue6815_1(sentence, start_idx, end_idx, label):
|
||||
nlp = English()
|
||||
doc = nlp(sentence)
|
||||
span = doc[:].char_span(start_idx, end_idx, label=label)
|
||||
assert span.label_ == label
|
||||
|
||||
|
||||
@pytest.mark.parametrize(
|
||||
"sentence, start_idx,end_idx,kb_id", [("Welcome to Mumbai, my friend", 11, 17, 5)]
|
||||
)
|
||||
@pytest.mark.issue(6815)
|
||||
def test_issue6815_2(sentence, start_idx, end_idx, kb_id):
|
||||
nlp = English()
|
||||
doc = nlp(sentence)
|
||||
span = doc[:].char_span(start_idx, end_idx, kb_id=kb_id)
|
||||
assert span.kb_id == kb_id
|
||||
|
||||
|
||||
@pytest.mark.parametrize(
|
||||
"sentence, start_idx,end_idx,vector",
|
||||
[("Welcome to Mumbai, my friend", 11, 17, numpy.array([0.1, 0.2, 0.3]))],
|
||||
)
|
||||
@pytest.mark.issue(6815)
|
||||
def test_issue6815_3(sentence, start_idx, end_idx, vector):
|
||||
nlp = English()
|
||||
doc = nlp(sentence)
|
||||
span = doc[:].char_span(start_idx, end_idx, vector=vector)
|
||||
assert (span.vector == vector).all()
|
||||
|
||||
|
||||
@pytest.mark.parametrize(
|
||||
"i_sent,i,j,text",
|
||||
[
|
||||
|
@ -98,6 +200,12 @@ def test_spans_span_sent(doc, doc_not_parsed):
|
|||
assert doc[:2].sent.root.text == "is"
|
||||
assert doc[:2].sent.text == "This is a sentence."
|
||||
assert doc[6:7].sent.root.left_edge.text == "This"
|
||||
assert doc[0 : len(doc)].sent == list(doc.sents)[0]
|
||||
assert list(doc[0 : len(doc)].sents) == list(doc.sents)
|
||||
|
||||
with pytest.raises(ValueError):
|
||||
doc_not_parsed[:2].sent
|
||||
|
||||
# test on manual sbd
|
||||
doc_not_parsed[0].is_sent_start = True
|
||||
doc_not_parsed[5].is_sent_start = True
|
||||
|
@ -105,6 +213,35 @@ def test_spans_span_sent(doc, doc_not_parsed):
|
|||
assert doc_not_parsed[10:14].sent == doc_not_parsed[5:]
|
||||
|
||||
|
||||
@pytest.mark.parametrize(
|
||||
"start,end,expected_sentence",
|
||||
[
|
||||
(0, 14, "This is"), # Entire doc
|
||||
(1, 4, "This is"), # Overlapping with 2 sentences
|
||||
(0, 2, "This is"), # Beginning of the Doc. Full sentence
|
||||
(0, 1, "This is"), # Beginning of the Doc. Part of a sentence
|
||||
(10, 14, "And a"), # End of the Doc. Overlapping with 2 senteces
|
||||
(12, 14, "third."), # End of the Doc. Full sentence
|
||||
(1, 1, "This is"), # Empty Span
|
||||
],
|
||||
)
|
||||
def test_spans_span_sent_user_hooks(doc, start, end, expected_sentence):
|
||||
|
||||
# Doc-level sents hook
|
||||
def user_hook(doc):
|
||||
return [doc[ii : ii + 2] for ii in range(0, len(doc), 2)]
|
||||
|
||||
doc.user_hooks["sents"] = user_hook
|
||||
|
||||
# Make sure doc-level sents hook works
|
||||
assert doc[start:end].sent.text == expected_sentence
|
||||
|
||||
# Span-level sent hook
|
||||
doc.user_span_hooks["sent"] = lambda x: x
|
||||
# Now, span=level sent hook overrides the doc-level sents hook
|
||||
assert doc[start:end].sent == doc[start:end]
|
||||
|
||||
|
||||
def test_spans_lca_matrix(en_tokenizer):
|
||||
"""Test span's lca matrix generation"""
|
||||
tokens = en_tokenizer("the lazy dog slept")
|
||||
|
@ -434,3 +571,100 @@ def test_span_with_vectors(doc):
|
|||
# single-token span with vector
|
||||
assert_array_equal(ops.to_numpy(doc[10:11].vector), [-1, -1, -1])
|
||||
doc.vocab.vectors = prev_vectors
|
||||
|
||||
|
||||
# fmt: off
|
||||
def test_span_comparison(doc):
|
||||
|
||||
# Identical start, end, only differ in label and kb_id
|
||||
assert Span(doc, 0, 3) == Span(doc, 0, 3)
|
||||
assert Span(doc, 0, 3, "LABEL") == Span(doc, 0, 3, "LABEL")
|
||||
assert Span(doc, 0, 3, "LABEL", kb_id="KB_ID") == Span(doc, 0, 3, "LABEL", kb_id="KB_ID")
|
||||
|
||||
assert Span(doc, 0, 3) != Span(doc, 0, 3, "LABEL")
|
||||
assert Span(doc, 0, 3) != Span(doc, 0, 3, "LABEL", kb_id="KB_ID")
|
||||
assert Span(doc, 0, 3, "LABEL") != Span(doc, 0, 3, "LABEL", kb_id="KB_ID")
|
||||
|
||||
assert Span(doc, 0, 3) <= Span(doc, 0, 3) and Span(doc, 0, 3) >= Span(doc, 0, 3)
|
||||
assert Span(doc, 0, 3, "LABEL") <= Span(doc, 0, 3, "LABEL") and Span(doc, 0, 3, "LABEL") >= Span(doc, 0, 3, "LABEL")
|
||||
assert Span(doc, 0, 3, "LABEL", kb_id="KB_ID") <= Span(doc, 0, 3, "LABEL", kb_id="KB_ID")
|
||||
assert Span(doc, 0, 3, "LABEL", kb_id="KB_ID") >= Span(doc, 0, 3, "LABEL", kb_id="KB_ID")
|
||||
|
||||
assert (Span(doc, 0, 3) < Span(doc, 0, 3, "", kb_id="KB_ID") < Span(doc, 0, 3, "LABEL") < Span(doc, 0, 3, "LABEL", kb_id="KB_ID"))
|
||||
assert (Span(doc, 0, 3) <= Span(doc, 0, 3, "", kb_id="KB_ID") <= Span(doc, 0, 3, "LABEL") <= Span(doc, 0, 3, "LABEL", kb_id="KB_ID"))
|
||||
|
||||
assert (Span(doc, 0, 3, "LABEL", kb_id="KB_ID") > Span(doc, 0, 3, "LABEL") > Span(doc, 0, 3, "", kb_id="KB_ID") > Span(doc, 0, 3))
|
||||
assert (Span(doc, 0, 3, "LABEL", kb_id="KB_ID") >= Span(doc, 0, 3, "LABEL") >= Span(doc, 0, 3, "", kb_id="KB_ID") >= Span(doc, 0, 3))
|
||||
|
||||
# Different end
|
||||
assert Span(doc, 0, 3, "LABEL", kb_id="KB_ID") < Span(doc, 0, 4, "LABEL", kb_id="KB_ID")
|
||||
|
||||
assert Span(doc, 0, 3, "LABEL", kb_id="KB_ID") < Span(doc, 0, 4)
|
||||
assert Span(doc, 0, 3, "LABEL", kb_id="KB_ID") <= Span(doc, 0, 4)
|
||||
assert Span(doc, 0, 4) > Span(doc, 0, 3, "LABEL", kb_id="KB_ID")
|
||||
assert Span(doc, 0, 4) >= Span(doc, 0, 3, "LABEL", kb_id="KB_ID")
|
||||
|
||||
# Different start
|
||||
assert Span(doc, 0, 3, "LABEL", kb_id="KB_ID") != Span(doc, 1, 3, "LABEL", kb_id="KB_ID")
|
||||
|
||||
assert Span(doc, 0, 3, "LABEL", kb_id="KB_ID") < Span(doc, 1, 3)
|
||||
assert Span(doc, 0, 3, "LABEL", kb_id="KB_ID") <= Span(doc, 1, 3)
|
||||
assert Span(doc, 1, 3) > Span(doc, 0, 3, "LABEL", kb_id="KB_ID")
|
||||
assert Span(doc, 1, 3) >= Span(doc, 0, 3, "LABEL", kb_id="KB_ID")
|
||||
|
||||
# Different start & different end
|
||||
assert Span(doc, 0, 4, "LABEL", kb_id="KB_ID") != Span(doc, 1, 3, "LABEL", kb_id="KB_ID")
|
||||
|
||||
assert Span(doc, 0, 4, "LABEL", kb_id="KB_ID") < Span(doc, 1, 3)
|
||||
assert Span(doc, 0, 4, "LABEL", kb_id="KB_ID") <= Span(doc, 1, 3)
|
||||
assert Span(doc, 1, 3) > Span(doc, 0, 4, "LABEL", kb_id="KB_ID")
|
||||
assert Span(doc, 1, 3) >= Span(doc, 0, 4, "LABEL", kb_id="KB_ID")
|
||||
# fmt: on
|
||||
|
||||
|
||||
@pytest.mark.parametrize(
|
||||
"start,end,expected_sentences,expected_sentences_with_hook",
|
||||
[
|
||||
(0, 14, 3, 7), # Entire doc
|
||||
(3, 6, 2, 2), # Overlapping with 2 sentences
|
||||
(0, 4, 1, 2), # Beginning of the Doc. Full sentence
|
||||
(0, 3, 1, 2), # Beginning of the Doc. Part of a sentence
|
||||
(9, 14, 2, 3), # End of the Doc. Overlapping with 2 senteces
|
||||
(10, 14, 1, 2), # End of the Doc. Full sentence
|
||||
(11, 14, 1, 2), # End of the Doc. Partial sentence
|
||||
(0, 0, 1, 1), # Empty Span
|
||||
],
|
||||
)
|
||||
def test_span_sents(doc, start, end, expected_sentences, expected_sentences_with_hook):
|
||||
|
||||
assert len(list(doc[start:end].sents)) == expected_sentences
|
||||
|
||||
def user_hook(doc):
|
||||
return [doc[ii : ii + 2] for ii in range(0, len(doc), 2)]
|
||||
|
||||
doc.user_hooks["sents"] = user_hook
|
||||
|
||||
assert len(list(doc[start:end].sents)) == expected_sentences_with_hook
|
||||
|
||||
doc.user_span_hooks["sents"] = lambda x: [x]
|
||||
|
||||
assert list(doc[start:end].sents)[0] == doc[start:end]
|
||||
assert len(list(doc[start:end].sents)) == 1
|
||||
|
||||
|
||||
def test_span_sents_not_parsed(doc_not_parsed):
|
||||
with pytest.raises(ValueError):
|
||||
list(Span(doc_not_parsed, 0, 3).sents)
|
||||
|
||||
|
||||
def test_span_group_copy(doc):
|
||||
doc.spans["test"] = [doc[0:1], doc[2:4]]
|
||||
assert len(doc.spans["test"]) == 2
|
||||
doc_copy = doc.copy()
|
||||
# check that the spans were indeed copied
|
||||
assert len(doc_copy.spans["test"]) == 2
|
||||
# add a new span to the original doc
|
||||
doc.spans["test"].append(doc[3:4])
|
||||
assert len(doc.spans["test"]) == 3
|
||||
# check that the copy spans were not modified and this is an isolated doc
|
||||
assert len(doc_copy.spans["test"]) == 2
|
||||
|
|
22
spacy/tests/lang/af/test_text.py
Normal file
22
spacy/tests/lang/af/test_text.py
Normal file
|
@ -0,0 +1,22 @@
|
|||
import pytest
|
||||
|
||||
|
||||
def test_long_text(af_tokenizer):
|
||||
# Excerpt: Universal Declaration of Human Rights; “'n” changed to “die” in first sentence
|
||||
text = """
|
||||
Hierdie Universele Verklaring van Menseregte as die algemene standaard vir die verwesenliking deur alle mense en nasies,
|
||||
om te verseker dat elke individu en elke deel van die gemeenskap hierdie Verklaring in ag sal neem en deur opvoeding,
|
||||
respek vir hierdie regte en vryhede te bevorder, op nasionale en internasionale vlak, daarna sal strewe om die universele
|
||||
en effektiewe erkenning en agting van hierdie regte te verseker, nie net vir die mense van die Lidstate nie, maar ook vir
|
||||
die mense in die gebiede onder hul jurisdiksie.
|
||||
|
||||
"""
|
||||
tokens = af_tokenizer(text)
|
||||
assert len(tokens) == 100
|
||||
|
||||
|
||||
@pytest.mark.xfail
|
||||
def test_indefinite_article(af_tokenizer):
|
||||
text = "as 'n algemene standaard"
|
||||
tokens = af_tokenizer(text)
|
||||
assert len(tokens) == 4
|
29
spacy/tests/lang/af/test_tokenizer.py
Normal file
29
spacy/tests/lang/af/test_tokenizer.py
Normal file
|
@ -0,0 +1,29 @@
|
|||
import pytest
|
||||
|
||||
AF_BASIC_TOKENIZATION_TESTS = [
|
||||
(
|
||||
"Elkeen het die reg tot lewe, vryheid en sekuriteit van persoon.",
|
||||
[
|
||||
"Elkeen",
|
||||
"het",
|
||||
"die",
|
||||
"reg",
|
||||
"tot",
|
||||
"lewe",
|
||||
",",
|
||||
"vryheid",
|
||||
"en",
|
||||
"sekuriteit",
|
||||
"van",
|
||||
"persoon",
|
||||
".",
|
||||
],
|
||||
),
|
||||
]
|
||||
|
||||
|
||||
@pytest.mark.parametrize("text,expected_tokens", AF_BASIC_TOKENIZATION_TESTS)
|
||||
def test_af_tokenizer_basic(af_tokenizer, text, expected_tokens):
|
||||
tokens = af_tokenizer(text)
|
||||
token_list = [token.text for token in tokens if not token.is_space]
|
||||
assert expected_tokens == token_list
|
|
@ -4,6 +4,15 @@ from spacy.tokens import Doc
|
|||
from ...util import apply_transition_sequence
|
||||
|
||||
|
||||
@pytest.mark.issue(309)
|
||||
def test_issue309(en_vocab):
|
||||
"""Test Issue #309: SBD fails on empty string"""
|
||||
doc = Doc(en_vocab, words=[" "], heads=[0], deps=["ROOT"])
|
||||
assert len(doc) == 1
|
||||
sents = list(doc.sents)
|
||||
assert len(sents) == 1
|
||||
|
||||
|
||||
@pytest.mark.parametrize("words", [["A", "test", "sentence"]])
|
||||
@pytest.mark.parametrize("punct", [".", "!", "?", ""])
|
||||
def test_en_sbd_single_punct(en_vocab, words, punct):
|
||||
|
|
169
spacy/tests/lang/en/test_tokenizer.py
Normal file
169
spacy/tests/lang/en/test_tokenizer.py
Normal file
|
@ -0,0 +1,169 @@
|
|||
import pytest
|
||||
|
||||
|
||||
@pytest.mark.issue(351)
|
||||
def test_issue351(en_tokenizer):
|
||||
doc = en_tokenizer(" This is a cat.")
|
||||
assert doc[0].idx == 0
|
||||
assert len(doc[0]) == 3
|
||||
assert doc[1].idx == 3
|
||||
|
||||
|
||||
@pytest.mark.issue(360)
|
||||
def test_issue360(en_tokenizer):
|
||||
"""Test tokenization of big ellipsis"""
|
||||
tokens = en_tokenizer("$45...............Asking")
|
||||
assert len(tokens) > 2
|
||||
|
||||
|
||||
@pytest.mark.issue(736)
|
||||
@pytest.mark.parametrize("text,number", [("7am", "7"), ("11p.m.", "11")])
|
||||
def test_issue736(en_tokenizer, text, number):
|
||||
"""Test that times like "7am" are tokenized correctly and that numbers are
|
||||
converted to string."""
|
||||
tokens = en_tokenizer(text)
|
||||
assert len(tokens) == 2
|
||||
assert tokens[0].text == number
|
||||
|
||||
|
||||
@pytest.mark.issue(740)
|
||||
@pytest.mark.parametrize("text", ["3/4/2012", "01/12/1900"])
|
||||
def test_issue740(en_tokenizer, text):
|
||||
"""Test that dates are not split and kept as one token. This behaviour is
|
||||
currently inconsistent, since dates separated by hyphens are still split.
|
||||
This will be hard to prevent without causing clashes with numeric ranges."""
|
||||
tokens = en_tokenizer(text)
|
||||
assert len(tokens) == 1
|
||||
|
||||
|
||||
@pytest.mark.issue(744)
|
||||
@pytest.mark.parametrize("text", ["We were scared", "We Were Scared"])
|
||||
def test_issue744(en_tokenizer, text):
|
||||
"""Test that 'were' and 'Were' are excluded from the contractions
|
||||
generated by the English tokenizer exceptions."""
|
||||
tokens = en_tokenizer(text)
|
||||
assert len(tokens) == 3
|
||||
assert tokens[1].text.lower() == "were"
|
||||
|
||||
|
||||
@pytest.mark.issue(759)
|
||||
@pytest.mark.parametrize(
|
||||
"text,is_num", [("one", True), ("ten", True), ("teneleven", False)]
|
||||
)
|
||||
def test_issue759(en_tokenizer, text, is_num):
|
||||
tokens = en_tokenizer(text)
|
||||
assert tokens[0].like_num == is_num
|
||||
|
||||
|
||||
@pytest.mark.issue(775)
|
||||
@pytest.mark.parametrize("text", ["Shell", "shell", "Shed", "shed"])
|
||||
def test_issue775(en_tokenizer, text):
|
||||
"""Test that 'Shell' and 'shell' are excluded from the contractions
|
||||
generated by the English tokenizer exceptions."""
|
||||
tokens = en_tokenizer(text)
|
||||
assert len(tokens) == 1
|
||||
assert tokens[0].text == text
|
||||
|
||||
|
||||
@pytest.mark.issue(792)
|
||||
@pytest.mark.parametrize("text", ["This is a string ", "This is a string\u0020"])
|
||||
def test_issue792(en_tokenizer, text):
|
||||
"""Test for Issue #792: Trailing whitespace is removed after tokenization."""
|
||||
doc = en_tokenizer(text)
|
||||
assert "".join([token.text_with_ws for token in doc]) == text
|
||||
|
||||
|
||||
@pytest.mark.issue(792)
|
||||
@pytest.mark.parametrize("text", ["This is a string", "This is a string\n"])
|
||||
def test_control_issue792(en_tokenizer, text):
|
||||
"""Test base case for Issue #792: Non-trailing whitespace"""
|
||||
doc = en_tokenizer(text)
|
||||
assert "".join([token.text_with_ws for token in doc]) == text
|
||||
|
||||
|
||||
@pytest.mark.issue(859)
|
||||
@pytest.mark.parametrize(
|
||||
"text", ["aaabbb@ccc.com\nThank you!", "aaabbb@ccc.com \nThank you!"]
|
||||
)
|
||||
def test_issue859(en_tokenizer, text):
|
||||
"""Test that no extra space is added in doc.text method."""
|
||||
doc = en_tokenizer(text)
|
||||
assert doc.text == text
|
||||
|
||||
|
||||
@pytest.mark.issue(886)
|
||||
@pytest.mark.parametrize("text", ["Datum:2014-06-02\nDokument:76467"])
|
||||
def test_issue886(en_tokenizer, text):
|
||||
"""Test that token.idx matches the original text index for texts with newlines."""
|
||||
doc = en_tokenizer(text)
|
||||
for token in doc:
|
||||
assert len(token.text) == len(token.text_with_ws)
|
||||
assert text[token.idx] == token.text[0]
|
||||
|
||||
|
||||
@pytest.mark.issue(891)
|
||||
@pytest.mark.parametrize("text", ["want/need"])
|
||||
def test_issue891(en_tokenizer, text):
|
||||
"""Test that / infixes are split correctly."""
|
||||
tokens = en_tokenizer(text)
|
||||
assert len(tokens) == 3
|
||||
assert tokens[1].text == "/"
|
||||
|
||||
|
||||
@pytest.mark.issue(957)
|
||||
@pytest.mark.slow
|
||||
def test_issue957(en_tokenizer):
|
||||
"""Test that spaCy doesn't hang on many punctuation characters.
|
||||
If this test hangs, check (new) regular expressions for conflicting greedy operators
|
||||
"""
|
||||
# Skip test if pytest-timeout is not installed
|
||||
pytest.importorskip("pytest_timeout")
|
||||
for punct in [".", ",", "'", '"', ":", "?", "!", ";", "-"]:
|
||||
string = "0"
|
||||
for i in range(1, 100):
|
||||
string += punct + str(i)
|
||||
doc = en_tokenizer(string)
|
||||
assert doc
|
||||
|
||||
|
||||
@pytest.mark.parametrize("text", ["test@example.com", "john.doe@example.co.uk"])
|
||||
@pytest.mark.issue(1698)
|
||||
def test_issue1698(en_tokenizer, text):
|
||||
"""Test that doc doesn't identify email-addresses as URLs"""
|
||||
doc = en_tokenizer(text)
|
||||
assert len(doc) == 1
|
||||
assert not doc[0].like_url
|
||||
|
||||
|
||||
@pytest.mark.issue(1758)
|
||||
def test_issue1758(en_tokenizer):
|
||||
"""Test that "would've" is handled by the English tokenizer exceptions."""
|
||||
tokens = en_tokenizer("would've")
|
||||
assert len(tokens) == 2
|
||||
|
||||
|
||||
@pytest.mark.issue(1773)
|
||||
def test_issue1773(en_tokenizer):
|
||||
"""Test that spaces don't receive a POS but no TAG. This is the root cause
|
||||
of the serialization issue reported in #1773."""
|
||||
doc = en_tokenizer("\n")
|
||||
if doc[0].pos_ == "SPACE":
|
||||
assert doc[0].tag_ != ""
|
||||
|
||||
|
||||
@pytest.mark.issue(3277)
|
||||
def test_issue3277(es_tokenizer):
|
||||
"""Test that hyphens are split correctly as prefixes."""
|
||||
doc = es_tokenizer("—Yo me llamo... –murmuró el niño– Emilio Sánchez Pérez.")
|
||||
assert len(doc) == 14
|
||||
assert doc[0].text == "\u2014"
|
||||
assert doc[5].text == "\u2013"
|
||||
assert doc[9].text == "\u2013"
|
||||
|
||||
|
||||
@pytest.mark.parametrize("word", ["don't", "don’t", "I'd", "I’d"])
|
||||
@pytest.mark.issue(3521)
|
||||
def test_issue3521(en_tokenizer, word):
|
||||
tok = en_tokenizer(word)[1]
|
||||
# 'not' and 'would' should be stopwords, also in their abbreviated forms
|
||||
assert tok.is_stop
|
|
@ -1,5 +1,16 @@
|
|||
import pytest
|
||||
from spacy.lang.es.lex_attrs import like_num
|
||||
from spacy.lang.es import Spanish
|
||||
|
||||
|
||||
@pytest.mark.issue(3803)
|
||||
def test_issue3803():
|
||||
"""Test that spanish num-like tokens have True for like_num attribute."""
|
||||
nlp = Spanish()
|
||||
text = "2 dos 1000 mil 12 doce"
|
||||
doc = nlp(text)
|
||||
|
||||
assert [t.like_num for t in doc] == [True, True, True, True, True, True]
|
||||
|
||||
|
||||
def test_es_tokenizer_handles_long_text(es_tokenizer):
|
||||
|
|
0
spacy/tests/lang/et/__init__.py
Normal file
0
spacy/tests/lang/et/__init__.py
Normal file
26
spacy/tests/lang/et/test_text.py
Normal file
26
spacy/tests/lang/et/test_text.py
Normal file
|
@ -0,0 +1,26 @@
|
|||
import pytest
|
||||
|
||||
|
||||
def test_long_text(et_tokenizer):
|
||||
# Excerpt: European Convention on Human Rights
|
||||
text = """
|
||||
arvestades, et nimetatud deklaratsiooni eesmärk on tagada selles
|
||||
kuulutatud õiguste üldine ja tõhus tunnustamine ning järgimine;
|
||||
arvestades, et Euroopa Nõukogu eesmärk on saavutada tema
|
||||
liikmete suurem ühtsus ning et üheks selle eesmärgi saavutamise
|
||||
vahendiks on inimõiguste ja põhivabaduste järgimine ning
|
||||
elluviimine;
|
||||
taaskinnitades oma sügavat usku neisse põhivabadustesse, mis
|
||||
on õigluse ja rahu aluseks maailmas ning mida kõige paremini
|
||||
tagab ühelt poolt tõhus poliitiline demokraatia ning teiselt poolt
|
||||
inimõiguste, millest nad sõltuvad, üldine mõistmine ja järgimine;
|
||||
"""
|
||||
tokens = et_tokenizer(text)
|
||||
assert len(tokens) == 94
|
||||
|
||||
|
||||
@pytest.mark.xfail
|
||||
def test_ordinal_number(et_tokenizer):
|
||||
text = "10. detsembril 1948"
|
||||
tokens = et_tokenizer(text)
|
||||
assert len(tokens) == 3
|
29
spacy/tests/lang/et/test_tokenizer.py
Normal file
29
spacy/tests/lang/et/test_tokenizer.py
Normal file
|
@ -0,0 +1,29 @@
|
|||
import pytest
|
||||
|
||||
ET_BASIC_TOKENIZATION_TESTS = [
|
||||
(
|
||||
"Kedagi ei või piinata ega ebainimlikult või alandavalt kohelda "
|
||||
"ega karistada.",
|
||||
[
|
||||
"Kedagi",
|
||||
"ei",
|
||||
"või",
|
||||
"piinata",
|
||||
"ega",
|
||||
"ebainimlikult",
|
||||
"või",
|
||||
"alandavalt",
|
||||
"kohelda",
|
||||
"ega",
|
||||
"karistada",
|
||||
".",
|
||||
],
|
||||
),
|
||||
]
|
||||
|
||||
|
||||
@pytest.mark.parametrize("text,expected_tokens", ET_BASIC_TOKENIZATION_TESTS)
|
||||
def test_et_tokenizer_basic(et_tokenizer, text, expected_tokens):
|
||||
tokens = et_tokenizer(text)
|
||||
token_list = [token.text for token in tokens if not token.is_space]
|
||||
assert expected_tokens == token_list
|
189
spacy/tests/lang/fi/test_noun_chunks.py
Normal file
189
spacy/tests/lang/fi/test_noun_chunks.py
Normal file
|
@ -0,0 +1,189 @@
|
|||
import pytest
|
||||
from spacy.tokens import Doc
|
||||
|
||||
|
||||
FI_NP_TEST_EXAMPLES = [
|
||||
(
|
||||
"Kaksi tyttöä potkii punaista palloa",
|
||||
["NUM", "NOUN", "VERB", "ADJ", "NOUN"],
|
||||
["nummod", "nsubj", "ROOT", "amod", "obj"],
|
||||
[1, 1, 0, 1, -2],
|
||||
["Kaksi tyttöä", "punaista palloa"],
|
||||
),
|
||||
(
|
||||
"Erittäin vaarallinen leijona karkasi kiertävän sirkuksen eläintenkesyttäjältä",
|
||||
["ADV", "ADJ", "NOUN", "VERB", "ADJ", "NOUN", "NOUN"],
|
||||
["advmod", "amod", "nsubj", "ROOT", "amod", "nmod:poss", "obl"],
|
||||
[1, 1, 1, 0, 1, 1, -3],
|
||||
["Erittäin vaarallinen leijona", "kiertävän sirkuksen eläintenkesyttäjältä"],
|
||||
),
|
||||
(
|
||||
"Leijona raidallisine tassuineen piileksii Porin kaupungin lähellä",
|
||||
["NOUN", "ADJ", "NOUN", "VERB", "PROPN", "NOUN", "ADP"],
|
||||
["nsubj", "amod", "nmod", "ROOT", "nmod:poss", "obl", "case"],
|
||||
[3, 1, -2, 0, 1, -2, -1],
|
||||
["Leijona raidallisine tassuineen", "Porin kaupungin"],
|
||||
),
|
||||
(
|
||||
"Lounaalla nautittiin salaattia, maukasta kanaa ja raikasta vettä",
|
||||
["NOUN", "VERB", "NOUN", "PUNCT", "ADJ", "NOUN", "CCONJ", "ADJ", "NOUN"],
|
||||
["obl", "ROOT", "obj", "punct", "amod", "conj", "cc", "amod", "conj"],
|
||||
[1, 0, -1, 2, 1, -3, 2, 1, -6],
|
||||
["Lounaalla", "salaattia", "maukasta kanaa", "raikasta vettä"],
|
||||
),
|
||||
(
|
||||
"Minua houkuttaa maalle muuttaminen talven jälkeen",
|
||||
["PRON", "VERB", "NOUN", "NOUN", "NOUN", "ADP"],
|
||||
["obj", "ROOT", "nmod", "nsubj", "obl", "case"],
|
||||
[1, 0, 1, -2, -3, -1],
|
||||
["maalle muuttaminen", "talven"],
|
||||
),
|
||||
(
|
||||
"Päivän kohokohta oli vierailu museossa kummilasten kanssa",
|
||||
["NOUN", "NOUN", "AUX", "NOUN", "NOUN", "NOUN", "ADP"],
|
||||
["nmod:poss", "nsubj:cop", "cop", "ROOT", "nmod", "obl", "case"],
|
||||
[1, 2, 1, 0, -1, -2, -1],
|
||||
["Päivän kohokohta", "vierailu museossa", "kummilasten"],
|
||||
),
|
||||
(
|
||||
"Yrittäjät maksoivat tuomioistuimen määräämät korvaukset",
|
||||
["NOUN", "VERB", "NOUN", "VERB", "NOUN"],
|
||||
["nsubj", "ROOT", "nsubj", "acl", "obj"],
|
||||
[1, 0, 1, 1, -3],
|
||||
["Yrittäjät", "tuomioistuimen", "korvaukset"],
|
||||
),
|
||||
(
|
||||
"Julkisoikeudelliset tai niihin rinnastettavat saatavat ovat suoraan ulosottokelpoisia",
|
||||
["ADJ", "CCONJ", "PRON", "VERB", "NOUN", "AUX", "ADV", "NOUN"],
|
||||
["amod", "cc", "obl", "acl", "nsubj:cop", "cop", "advmod", "ROOT"],
|
||||
[4, 3, 1, 1, 3, 2, 1, 0],
|
||||
["Julkisoikeudelliset tai niihin rinnastettavat saatavat", "ulosottokelpoisia"],
|
||||
),
|
||||
(
|
||||
"Se oli ala-arvoista käytöstä kaikilta oppilailta, myös valvojaoppilailta",
|
||||
["PRON", "AUX", "ADJ", "NOUN", "PRON", "NOUN", "PUNCT", "ADV", "NOUN"],
|
||||
["nsubj:cop", "cop", "amod", "ROOT", "det", "nmod", "punct", "advmod", "appos"],
|
||||
[3, 2, 1, 0, 1, -2, 2, 1, -3],
|
||||
["ala-arvoista käytöstä kaikilta oppilailta", "valvojaoppilailta"],
|
||||
),
|
||||
(
|
||||
"Isä souti veneellä, jonka hän oli vuokrannut",
|
||||
["NOUN", "VERB", "NOUN", "PUNCT", "PRON", "PRON", "AUX", "VERB"],
|
||||
["nsubj", "ROOT", "obl", "punct", "obj", "nsubj", "aux", "acl:relcl"],
|
||||
[1, 0, -1, 4, 3, 2, 1, -5],
|
||||
["Isä", "veneellä"],
|
||||
),
|
||||
(
|
||||
"Kirja, jonka poimin hyllystä, kertoo norsuista",
|
||||
["NOUN", "PUNCT", "PRON", "VERB", "NOUN", "PUNCT", "VERB", "NOUN"],
|
||||
["nsubj", "punct", "obj", "acl:relcl", "obl", "punct", "ROOT", "obl"],
|
||||
[6, 2, 1, -3, -1, 1, 0, -1],
|
||||
["Kirja", "hyllystä", "norsuista"],
|
||||
),
|
||||
(
|
||||
"Huomenna on päivä, jota olemme odottaneet",
|
||||
["NOUN", "AUX", "NOUN", "PUNCT", "PRON", "AUX", "VERB"],
|
||||
["ROOT", "cop", "nsubj:cop", "punct", "obj", "aux", "acl:relcl"],
|
||||
[0, -1, -2, 3, 2, 1, -4],
|
||||
["Huomenna", "päivä"],
|
||||
),
|
||||
(
|
||||
"Liikkuvuuden lisääminen on yksi korkeakoulutuksen keskeisistä kehittämiskohteista",
|
||||
["NOUN", "NOUN", "AUX", "PRON", "NOUN", "ADJ", "NOUN"],
|
||||
["nmod:gobj", "nsubj:cop", "cop", "ROOT", "nmod:poss", "amod", "nmod"],
|
||||
[1, 2, 1, 0, 2, 1, -3],
|
||||
[
|
||||
"Liikkuvuuden lisääminen",
|
||||
"korkeakoulutuksen keskeisistä kehittämiskohteista",
|
||||
],
|
||||
),
|
||||
(
|
||||
"Kaupalliset palvelut jätetään yksityisten palveluntarjoajien tarjottavaksi",
|
||||
["ADJ", "NOUN", "VERB", "ADJ", "NOUN", "NOUN"],
|
||||
["amod", "obj", "ROOT", "amod", "nmod:gsubj", "obl"],
|
||||
[1, 1, 0, 1, 1, -3],
|
||||
["Kaupalliset palvelut", "yksityisten palveluntarjoajien tarjottavaksi"],
|
||||
),
|
||||
(
|
||||
"New York tunnetaan kaupunkina, joka ei koskaan nuku",
|
||||
["PROPN", "PROPN", "VERB", "NOUN", "PUNCT", "PRON", "AUX", "ADV", "VERB"],
|
||||
[
|
||||
"obj",
|
||||
"flat:name",
|
||||
"ROOT",
|
||||
"obl",
|
||||
"punct",
|
||||
"nsubj",
|
||||
"aux",
|
||||
"advmod",
|
||||
"acl:relcl",
|
||||
],
|
||||
[2, -1, 0, -1, 4, 3, 2, 1, -5],
|
||||
["New York", "kaupunkina"],
|
||||
),
|
||||
(
|
||||
"Loput vihjeet saat herra Möttöseltä",
|
||||
["NOUN", "NOUN", "VERB", "NOUN", "PROPN"],
|
||||
["compound:nn", "obj", "ROOT", "compound:nn", "obj"],
|
||||
[1, 1, 0, 1, -2],
|
||||
["Loput vihjeet", "herra Möttöseltä"],
|
||||
),
|
||||
(
|
||||
"mahdollisuus tukea muita päivystysyksiköitä",
|
||||
["NOUN", "VERB", "PRON", "NOUN"],
|
||||
["ROOT", "acl", "det", "obj"],
|
||||
[0, -1, 1, -2],
|
||||
["mahdollisuus", "päivystysyksiköitä"],
|
||||
),
|
||||
(
|
||||
"sairaanhoitopiirit harjoittavat leikkaustoimintaa alueellaan useammassa sairaalassa",
|
||||
["NOUN", "VERB", "NOUN", "NOUN", "ADJ", "NOUN"],
|
||||
["nsubj", "ROOT", "obj", "obl", "amod", "obl"],
|
||||
[1, 0, -1, -1, 1, -3],
|
||||
[
|
||||
"sairaanhoitopiirit",
|
||||
"leikkaustoimintaa",
|
||||
"alueellaan",
|
||||
"useammassa sairaalassa",
|
||||
],
|
||||
),
|
||||
(
|
||||
"Lain mukaan varhaiskasvatus on suunnitelmallista toimintaa",
|
||||
["NOUN", "ADP", "NOUN", "AUX", "ADJ", "NOUN"],
|
||||
["obl", "case", "nsubj:cop", "cop", "amod", "ROOT"],
|
||||
[5, -1, 3, 2, 1, 0],
|
||||
["Lain", "varhaiskasvatus", "suunnitelmallista toimintaa"],
|
||||
),
|
||||
]
|
||||
|
||||
|
||||
def test_noun_chunks_is_parsed(fi_tokenizer):
|
||||
"""Test that noun_chunks raises Value Error for 'fi' language if Doc is not parsed.
|
||||
To check this test, we're constructing a Doc
|
||||
with a new Vocab here and forcing is_parsed to 'False'
|
||||
to make sure the noun chunks don't run.
|
||||
"""
|
||||
doc = fi_tokenizer("Tämä on testi")
|
||||
with pytest.raises(ValueError):
|
||||
list(doc.noun_chunks)
|
||||
|
||||
|
||||
@pytest.mark.parametrize(
|
||||
"text,pos,deps,heads,expected_noun_chunks", FI_NP_TEST_EXAMPLES
|
||||
)
|
||||
def test_fi_noun_chunks(fi_tokenizer, text, pos, deps, heads, expected_noun_chunks):
|
||||
tokens = fi_tokenizer(text)
|
||||
|
||||
assert len(heads) == len(pos)
|
||||
doc = Doc(
|
||||
tokens.vocab,
|
||||
words=[t.text for t in tokens],
|
||||
heads=[head + i for i, head in enumerate(heads)],
|
||||
deps=deps,
|
||||
pos=pos,
|
||||
)
|
||||
|
||||
noun_chunks = list(doc.noun_chunks)
|
||||
assert len(noun_chunks) == len(expected_noun_chunks)
|
||||
for i, np in enumerate(noun_chunks):
|
||||
assert np.text == expected_noun_chunks[i]
|
|
@ -1,8 +1,230 @@
|
|||
from spacy.tokens import Doc
|
||||
import pytest
|
||||
|
||||
|
||||
# fmt: off
|
||||
@pytest.mark.parametrize(
|
||||
"words,heads,deps,pos,chunk_offsets",
|
||||
[
|
||||
# determiner + noun
|
||||
# un nom -> un nom
|
||||
(
|
||||
["un", "nom"],
|
||||
[1, 1],
|
||||
["det", "ROOT"],
|
||||
["DET", "NOUN"],
|
||||
[(0, 2)],
|
||||
),
|
||||
# determiner + noun starting with vowel
|
||||
# l'heure -> l'heure
|
||||
(
|
||||
["l'", "heure"],
|
||||
[1, 1],
|
||||
["det", "ROOT"],
|
||||
["DET", "NOUN"],
|
||||
[(0, 2)],
|
||||
),
|
||||
# determiner + plural noun
|
||||
# les romans -> les romans
|
||||
(
|
||||
["les", "romans"],
|
||||
[1, 1],
|
||||
["det", "ROOT"],
|
||||
["DET", "NOUN"],
|
||||
[(0, 2)],
|
||||
),
|
||||
# det + adj + noun
|
||||
# Le vieux Londres -> Le vieux Londres
|
||||
(
|
||||
['Les', 'vieux', 'Londres'],
|
||||
[2, 2, 2],
|
||||
["det", "amod", "ROOT"],
|
||||
["DET", "ADJ", "NOUN"],
|
||||
[(0,3)]
|
||||
),
|
||||
# det + noun + adj
|
||||
# le nom propre -> le nom propre a proper noun
|
||||
(
|
||||
["le", "nom", "propre"],
|
||||
[1, 1, 1],
|
||||
["det", "ROOT", "amod"],
|
||||
["DET", "NOUN", "ADJ"],
|
||||
[(0, 3)],
|
||||
),
|
||||
# det + noun + adj plural
|
||||
# Les chiens bruns -> les chiens bruns
|
||||
(
|
||||
["Les", "chiens", "bruns"],
|
||||
[1, 1, 1],
|
||||
["det", "ROOT", "amod"],
|
||||
["DET", "NOUN", "ADJ"],
|
||||
[(0, 3)],
|
||||
),
|
||||
# multiple adjectives: one adj before the noun, one adj after the noun
|
||||
# un nouveau film intéressant -> un nouveau film intéressant
|
||||
(
|
||||
["un", "nouveau", "film", "intéressant"],
|
||||
[2, 2, 2, 2],
|
||||
["det", "amod", "ROOT", "amod"],
|
||||
["DET", "ADJ", "NOUN", "ADJ"],
|
||||
[(0,4)]
|
||||
),
|
||||
# multiple adjectives, both adjs after the noun
|
||||
# une personne intelligente et drôle -> une personne intelligente et drôle
|
||||
(
|
||||
["une", "personne", "intelligente", "et", "drôle"],
|
||||
[1, 1, 1, 4, 2],
|
||||
["det", "ROOT", "amod", "cc", "conj"],
|
||||
["DET", "NOUN", "ADJ", "CCONJ", "ADJ"],
|
||||
[(0,5)]
|
||||
),
|
||||
# relative pronoun
|
||||
# un bus qui va au ville -> un bus, qui, ville
|
||||
(
|
||||
['un', 'bus', 'qui', 'va', 'au', 'ville'],
|
||||
[1, 1, 3, 1, 5, 3],
|
||||
['det', 'ROOT', 'nsubj', 'acl:relcl', 'case', 'obl:arg'],
|
||||
['DET', 'NOUN', 'PRON', 'VERB', 'ADP', 'NOUN'],
|
||||
[(0,2), (2,3), (5,6)]
|
||||
),
|
||||
# relative subclause
|
||||
# Voilà la maison que nous voulons acheter -> la maison, nous That's the house that we want to buy.
|
||||
(
|
||||
['Voilà', 'la', 'maison', 'que', 'nous', 'voulons', 'acheter'],
|
||||
[0, 2, 0, 5, 5, 2, 5],
|
||||
['ROOT', 'det', 'obj', 'mark', 'nsubj', 'acl:relcl', 'xcomp'],
|
||||
['VERB', 'DET', 'NOUN', 'SCONJ', 'PRON', 'VERB', 'VERB'],
|
||||
[(1,3), (4,5)]
|
||||
),
|
||||
# Person name and title by flat
|
||||
# Louis XIV -> Louis XIV
|
||||
(
|
||||
["Louis", "XIV"],
|
||||
[0, 0],
|
||||
["ROOT", "flat:name"],
|
||||
["PROPN", "PROPN"],
|
||||
[(0,2)]
|
||||
),
|
||||
# Organization name by flat
|
||||
# Nations Unies -> Nations Unies
|
||||
(
|
||||
["Nations", "Unies"],
|
||||
[0, 0],
|
||||
["ROOT", "flat:name"],
|
||||
["PROPN", "PROPN"],
|
||||
[(0,2)]
|
||||
),
|
||||
# Noun compound, person name created by two flats
|
||||
# Louise de Bratagne -> Louise de Bratagne
|
||||
(
|
||||
["Louise", "de", "Bratagne"],
|
||||
[0, 0, 0],
|
||||
["ROOT", "flat:name", "flat:name"],
|
||||
["PROPN", "PROPN", "PROPN"],
|
||||
[(0,3)]
|
||||
),
|
||||
# Noun compound, person name created by two flats
|
||||
# Louis François Joseph -> Louis François Joseph
|
||||
(
|
||||
["Louis", "François", "Joseph"],
|
||||
[0, 0, 0],
|
||||
["ROOT", "flat:name", "flat:name"],
|
||||
["PROPN", "PROPN", "PROPN"],
|
||||
[(0,3)]
|
||||
),
|
||||
# one determiner + one noun + one adjective qualified by an adverb
|
||||
# quelques agriculteurs très riches -> quelques agriculteurs très riches
|
||||
(
|
||||
["quelques", "agriculteurs", "très", "riches"],
|
||||
[1, 1, 3, 1],
|
||||
['det', 'ROOT', 'advmod', 'amod'],
|
||||
['DET', 'NOUN', 'ADV', 'ADJ'],
|
||||
[(0,4)]
|
||||
),
|
||||
# Two NPs conjuncted
|
||||
# Il a un chien et un chat -> Il, un chien, un chat
|
||||
(
|
||||
['Il', 'a', 'un', 'chien', 'et', 'un', 'chat'],
|
||||
[1, 1, 3, 1, 6, 6, 3],
|
||||
['nsubj', 'ROOT', 'det', 'obj', 'cc', 'det', 'conj'],
|
||||
['PRON', 'VERB', 'DET', 'NOUN', 'CCONJ', 'DET', 'NOUN'],
|
||||
[(0,1), (2,4), (5,7)]
|
||||
|
||||
),
|
||||
# Two NPs together
|
||||
# l'écrivain brésilien Aníbal Machado -> l'écrivain brésilien, Aníbal Machado
|
||||
(
|
||||
["l'", 'écrivain', 'brésilien', 'Aníbal', 'Machado'],
|
||||
[1, 1, 1, 1, 3],
|
||||
['det', 'ROOT', 'amod', 'appos', 'flat:name'],
|
||||
['DET', 'NOUN', 'ADJ', 'PROPN', 'PROPN'],
|
||||
[(0, 3), (3, 5)]
|
||||
),
|
||||
# nmod relation between NPs
|
||||
# la destruction de la ville -> la destruction, la ville
|
||||
(
|
||||
['la', 'destruction', 'de', 'la', 'ville'],
|
||||
[1, 1, 4, 4, 1],
|
||||
['det', 'ROOT', 'case', 'det', 'nmod'],
|
||||
['DET', 'NOUN', 'ADP', 'DET', 'NOUN'],
|
||||
[(0,2), (3,5)]
|
||||
),
|
||||
# nmod relation between NPs
|
||||
# Archiduchesse d’Autriche -> Archiduchesse, Autriche
|
||||
(
|
||||
['Archiduchesse', 'd’', 'Autriche'],
|
||||
[0, 2, 0],
|
||||
['ROOT', 'case', 'nmod'],
|
||||
['NOUN', 'ADP', 'PROPN'],
|
||||
[(0,1), (2,3)]
|
||||
),
|
||||
# Compounding by nmod, several NPs chained together
|
||||
# la première usine de drogue du gouvernement -> la première usine, drogue, gouvernement
|
||||
(
|
||||
["la", "première", "usine", "de", "drogue", "du", "gouvernement"],
|
||||
[2, 2, 2, 4, 2, 6, 2],
|
||||
['det', 'amod', 'ROOT', 'case', 'nmod', 'case', 'nmod'],
|
||||
['DET', 'ADJ', 'NOUN', 'ADP', 'NOUN', 'ADP', 'NOUN'],
|
||||
[(0, 3), (4, 5), (6, 7)]
|
||||
),
|
||||
# several NPs
|
||||
# Traduction du rapport de Susana -> Traduction, rapport, Susana
|
||||
(
|
||||
['Traduction', 'du', 'raport', 'de', 'Susana'],
|
||||
[0, 2, 0, 4, 2],
|
||||
['ROOT', 'case', 'nmod', 'case', 'nmod'],
|
||||
['NOUN', 'ADP', 'NOUN', 'ADP', 'PROPN'],
|
||||
[(0,1), (2,3), (4,5)]
|
||||
|
||||
),
|
||||
# Several NPs
|
||||
# Le gros chat de Susana et son amie -> Le gros chat, Susana, son amie
|
||||
(
|
||||
['Le', 'gros', 'chat', 'de', 'Susana', 'et', 'son', 'amie'],
|
||||
[2, 2, 2, 4, 2, 7, 7, 2],
|
||||
['det', 'amod', 'ROOT', 'case', 'nmod', 'cc', 'det', 'conj'],
|
||||
['DET', 'ADJ', 'NOUN', 'ADP', 'PROPN', 'CCONJ', 'DET', 'NOUN'],
|
||||
[(0,3), (4,5), (6,8)]
|
||||
),
|
||||
# Passive subject
|
||||
# Les nouvelles dépenses sont alimentées par le grand compte bancaire de Clinton -> Les nouvelles dépenses, le grand compte bancaire, Clinton
|
||||
(
|
||||
['Les', 'nouvelles', 'dépenses', 'sont', 'alimentées', 'par', 'le', 'grand', 'compte', 'bancaire', 'de', 'Clinton'],
|
||||
[2, 2, 4, 4, 4, 8, 8, 8, 4, 8, 11, 8],
|
||||
['det', 'amod', 'nsubj:pass', 'aux:pass', 'ROOT', 'case', 'det', 'amod', 'obl:agent', 'amod', 'case', 'nmod'],
|
||||
['DET', 'ADJ', 'NOUN', 'AUX', 'VERB', 'ADP', 'DET', 'ADJ', 'NOUN', 'ADJ', 'ADP', 'PROPN'],
|
||||
[(0, 3), (6, 10), (11, 12)]
|
||||
)
|
||||
],
|
||||
)
|
||||
# fmt: on
|
||||
def test_fr_noun_chunks(fr_vocab, words, heads, deps, pos, chunk_offsets):
|
||||
doc = Doc(fr_vocab, words=words, heads=heads, deps=deps, pos=pos)
|
||||
assert [(c.start, c.end) for c in doc.noun_chunks] == chunk_offsets
|
||||
|
||||
|
||||
def test_noun_chunks_is_parsed_fr(fr_tokenizer):
|
||||
"""Test that noun_chunks raises Value Error for 'fr' language if Doc is not parsed."""
|
||||
doc = fr_tokenizer("trouver des travaux antérieurs")
|
||||
doc = fr_tokenizer("Je suis allé à l'école")
|
||||
with pytest.raises(ValueError):
|
||||
list(doc.noun_chunks)
|
||||
|
|
11
spacy/tests/lang/hi/test_text.py
Normal file
11
spacy/tests/lang/hi/test_text.py
Normal file
|
@ -0,0 +1,11 @@
|
|||
import pytest
|
||||
from spacy.lang.hi import Hindi
|
||||
|
||||
|
||||
@pytest.mark.issue(3625)
|
||||
def test_issue3625():
|
||||
"""Test that default punctuation rules applies to hindi unicode characters"""
|
||||
nlp = Hindi()
|
||||
doc = nlp("hi. how हुए. होटल, होटल")
|
||||
expected = ["hi", ".", "how", "हुए", ".", "होटल", ",", "होटल"]
|
||||
assert [token.text for token in doc] == expected
|
0
spacy/tests/lang/hr/__init__.py
Normal file
0
spacy/tests/lang/hr/__init__.py
Normal file
26
spacy/tests/lang/hr/test_text.py
Normal file
26
spacy/tests/lang/hr/test_text.py
Normal file
|
@ -0,0 +1,26 @@
|
|||
import pytest
|
||||
|
||||
|
||||
def test_long_text(hr_tokenizer):
|
||||
# Excerpt: European Convention on Human Rights
|
||||
text = """
|
||||
uzimajući u obzir da ta deklaracija nastoji osigurati opće i djelotvorno
|
||||
priznanje i poštovanje u njoj proglašenih prava;
|
||||
uzimajući u obzir da je cilj Vijeća Europe postizanje većeg jedinstva
|
||||
njegovih članica, i da je jedan od načina postizanja toga cilja
|
||||
očuvanje i daljnje ostvarivanje ljudskih prava i temeljnih sloboda;
|
||||
potvrđujući svoju duboku privrženost tim temeljnim slobodama
|
||||
koje su osnova pravde i mira u svijetu i koje su najbolje zaštićene
|
||||
istinskom političkom demokracijom s jedne strane te zajedničkim
|
||||
razumijevanjem i poštovanjem ljudskih prava o kojima te slobode
|
||||
ovise s druge strane;
|
||||
"""
|
||||
tokens = hr_tokenizer(text)
|
||||
assert len(tokens) == 105
|
||||
|
||||
|
||||
@pytest.mark.xfail
|
||||
def test_ordinal_number(hr_tokenizer):
|
||||
text = "10. prosinca 1948"
|
||||
tokens = hr_tokenizer(text)
|
||||
assert len(tokens) == 3
|
31
spacy/tests/lang/hr/test_tokenizer.py
Normal file
31
spacy/tests/lang/hr/test_tokenizer.py
Normal file
|
@ -0,0 +1,31 @@
|
|||
import pytest
|
||||
|
||||
HR_BASIC_TOKENIZATION_TESTS = [
|
||||
(
|
||||
"Nitko se ne smije podvrgnuti mučenju ni nečovječnom ili "
|
||||
"ponižavajućem postupanju ili kazni.",
|
||||
[
|
||||
"Nitko",
|
||||
"se",
|
||||
"ne",
|
||||
"smije",
|
||||
"podvrgnuti",
|
||||
"mučenju",
|
||||
"ni",
|
||||
"nečovječnom",
|
||||
"ili",
|
||||
"ponižavajućem",
|
||||
"postupanju",
|
||||
"ili",
|
||||
"kazni",
|
||||
".",
|
||||
],
|
||||
),
|
||||
]
|
||||
|
||||
|
||||
@pytest.mark.parametrize("text,expected_tokens", HR_BASIC_TOKENIZATION_TESTS)
|
||||
def test_hr_tokenizer_basic(hr_tokenizer, text, expected_tokens):
|
||||
tokens = hr_tokenizer(text)
|
||||
token_list = [token.text for token in tokens if not token.is_space]
|
||||
assert expected_tokens == token_list
|
0
spacy/tests/lang/is/__init__.py
Normal file
0
spacy/tests/lang/is/__init__.py
Normal file
26
spacy/tests/lang/is/test_text.py
Normal file
26
spacy/tests/lang/is/test_text.py
Normal file
|
@ -0,0 +1,26 @@
|
|||
import pytest
|
||||
|
||||
|
||||
def test_long_text(is_tokenizer):
|
||||
# Excerpt: European Convention on Human Rights
|
||||
text = """
|
||||
hafa í huga, að yfirlýsing þessi hefur það markmið að tryggja
|
||||
almenna og raunhæfa viðurkenningu og vernd þeirra réttinda,
|
||||
sem þar er lýst;
|
||||
hafa í huga, að markmið Evrópuráðs er að koma á nánari einingu
|
||||
aðildarríkjanna og að ein af leiðunum að því marki er sú, að
|
||||
mannréttindi og mannfrelsi séu í heiðri höfð og efld;
|
||||
lýsa á ný eindreginni trú sinni á það mannfrelsi, sem er undirstaða
|
||||
réttlætis og friðar í heiminum og best er tryggt, annars vegar með
|
||||
virku, lýðræðislegu stjórnarfari og, hins vegar, almennum skilningi
|
||||
og varðveislu þeirra mannréttinda, sem eru grundvöllur frelsisins;
|
||||
"""
|
||||
tokens = is_tokenizer(text)
|
||||
assert len(tokens) == 120
|
||||
|
||||
|
||||
@pytest.mark.xfail
|
||||
def test_ordinal_number(is_tokenizer):
|
||||
text = "10. desember 1948"
|
||||
tokens = is_tokenizer(text)
|
||||
assert len(tokens) == 3
|
30
spacy/tests/lang/is/test_tokenizer.py
Normal file
30
spacy/tests/lang/is/test_tokenizer.py
Normal file
|
@ -0,0 +1,30 @@
|
|||
import pytest
|
||||
|
||||
IS_BASIC_TOKENIZATION_TESTS = [
|
||||
(
|
||||
"Enginn maður skal sæta pyndingum eða ómannlegri eða "
|
||||
"vanvirðandi meðferð eða refsingu. ",
|
||||
[
|
||||
"Enginn",
|
||||
"maður",
|
||||
"skal",
|
||||
"sæta",
|
||||
"pyndingum",
|
||||
"eða",
|
||||
"ómannlegri",
|
||||
"eða",
|
||||
"vanvirðandi",
|
||||
"meðferð",
|
||||
"eða",
|
||||
"refsingu",
|
||||
".",
|
||||
],
|
||||
),
|
||||
]
|
||||
|
||||
|
||||
@pytest.mark.parametrize("text,expected_tokens", IS_BASIC_TOKENIZATION_TESTS)
|
||||
def test_is_tokenizer_basic(is_tokenizer, text, expected_tokens):
|
||||
tokens = is_tokenizer(text)
|
||||
token_list = [token.text for token in tokens if not token.is_space]
|
||||
assert expected_tokens == token_list
|
221
spacy/tests/lang/it/test_noun_chunks.py
Normal file
221
spacy/tests/lang/it/test_noun_chunks.py
Normal file
|
@ -0,0 +1,221 @@
|
|||
from spacy.tokens import Doc
|
||||
import pytest
|
||||
|
||||
|
||||
# fmt: off
|
||||
@pytest.mark.parametrize(
|
||||
"words,heads,deps,pos,chunk_offsets",
|
||||
[
|
||||
# determiner + noun
|
||||
# un pollo -> un pollo
|
||||
(
|
||||
["un", "pollo"],
|
||||
[1, 1],
|
||||
["det", "ROOT"],
|
||||
["DET", "NOUN"],
|
||||
[(0,2)],
|
||||
),
|
||||
# two determiners + noun
|
||||
# il mio cane -> il mio cane
|
||||
(
|
||||
["il", "mio", "cane"],
|
||||
[2, 2, 2],
|
||||
["det", "det:poss", "ROOT"],
|
||||
["DET", "DET", "NOUN"],
|
||||
[(0,3)],
|
||||
),
|
||||
# two determiners, one is after noun. rare usage but still testing
|
||||
# il cane mio-> il cane mio
|
||||
(
|
||||
["il", "cane", "mio"],
|
||||
[1, 1, 1],
|
||||
["det", "ROOT", "det:poss"],
|
||||
["DET", "NOUN", "DET"],
|
||||
[(0,3)],
|
||||
),
|
||||
# relative pronoun
|
||||
# È molto bello il vestito che hai acquistat -> il vestito, che the dress that you bought is very pretty.
|
||||
(
|
||||
["È", "molto", "bello", "il", "vestito", "che", "hai", "acquistato"],
|
||||
[2, 2, 2, 4, 2, 7, 7, 4],
|
||||
['cop', 'advmod', 'ROOT', 'det', 'nsubj', 'obj', 'aux', 'acl:relcl'],
|
||||
['AUX', 'ADV', 'ADJ', 'DET', 'NOUN', 'PRON', 'AUX', 'VERB'],
|
||||
[(3,5), (5,6)]
|
||||
),
|
||||
# relative subclause
|
||||
# il computer che hai comprato -> il computer, che the computer that you bought
|
||||
(
|
||||
['il', 'computer', 'che', 'hai', 'comprato'],
|
||||
[1, 1, 4, 4, 1],
|
||||
['det', 'ROOT', 'nsubj', 'aux', 'acl:relcl'],
|
||||
['DET', 'NOUN', 'PRON', 'AUX', 'VERB'],
|
||||
[(0,2), (2,3)]
|
||||
),
|
||||
# det + noun + adj
|
||||
# Una macchina grande -> Una macchina grande
|
||||
(
|
||||
["Una", "macchina", "grande"],
|
||||
[1, 1, 1],
|
||||
["det", "ROOT", "amod"],
|
||||
["DET", "NOUN", "ADJ"],
|
||||
[(0,3)],
|
||||
),
|
||||
# noun + adj plural
|
||||
# mucche bianche
|
||||
(
|
||||
["mucche", "bianche"],
|
||||
[0, 0],
|
||||
["ROOT", "amod"],
|
||||
["NOUN", "ADJ"],
|
||||
[(0,2)],
|
||||
),
|
||||
# det + adj + noun
|
||||
# Una grande macchina -> Una grande macchina
|
||||
(
|
||||
['Una', 'grande', 'macchina'],
|
||||
[2, 2, 2],
|
||||
["det", "amod", "ROOT"],
|
||||
["DET", "ADJ", "NOUN"],
|
||||
[(0,3)]
|
||||
),
|
||||
# det + adj + noun, det with apostrophe
|
||||
# un'importante associazione -> un'importante associazione
|
||||
(
|
||||
["Un'", 'importante', 'associazione'],
|
||||
[2, 2, 2],
|
||||
["det", "amod", "ROOT"],
|
||||
["DET", "ADJ", "NOUN"],
|
||||
[(0,3)]
|
||||
),
|
||||
# multiple adjectives
|
||||
# Un cane piccolo e marrone -> Un cane piccolo e marrone
|
||||
(
|
||||
["Un", "cane", "piccolo", "e", "marrone"],
|
||||
[1, 1, 1, 4, 2],
|
||||
["det", "ROOT", "amod", "cc", "conj"],
|
||||
["DET", "NOUN", "ADJ", "CCONJ", "ADJ"],
|
||||
[(0,5)]
|
||||
),
|
||||
# determiner, adjective, compound created by flat
|
||||
# le Nazioni Unite -> le Nazioni Unite
|
||||
(
|
||||
["le", "Nazioni", "Unite"],
|
||||
[1, 1, 1],
|
||||
["det", "ROOT", "flat:name"],
|
||||
["DET", "PROPN", "PROPN"],
|
||||
[(0,3)]
|
||||
),
|
||||
# one determiner + one noun + one adjective qualified by an adverb
|
||||
# alcuni contadini molto ricchi -> alcuni contadini molto ricchi some very rich farmers
|
||||
(
|
||||
['alcuni', 'contadini', 'molto', 'ricchi'],
|
||||
[1, 1, 3, 1],
|
||||
['det', 'ROOT', 'advmod', 'amod'],
|
||||
['DET', 'NOUN', 'ADV', 'ADJ'],
|
||||
[(0,4)]
|
||||
),
|
||||
# Two NPs conjuncted
|
||||
# Ho un cane e un gatto -> un cane, un gatto
|
||||
(
|
||||
['Ho', 'un', 'cane', 'e', 'un', 'gatto'],
|
||||
[0, 2, 0, 5, 5, 0],
|
||||
['ROOT', 'det', 'obj', 'cc', 'det', 'conj'],
|
||||
['VERB', 'DET', 'NOUN', 'CCONJ', 'DET', 'NOUN'],
|
||||
[(1,3), (4,6)]
|
||||
|
||||
),
|
||||
# Two NPs together
|
||||
# lo scrittore brasiliano Aníbal Machado -> lo scrittore brasiliano, Aníbal Machado
|
||||
(
|
||||
['lo', 'scrittore', 'brasiliano', 'Aníbal', 'Machado'],
|
||||
[1, 1, 1, 1, 3],
|
||||
['det', 'ROOT', 'amod', 'nmod', 'flat:name'],
|
||||
['DET', 'NOUN', 'ADJ', 'PROPN', 'PROPN'],
|
||||
[(0, 3), (3, 5)]
|
||||
),
|
||||
# Noun compound, person name and titles
|
||||
# Dom Pedro II -> Dom Pedro II
|
||||
(
|
||||
["Dom", "Pedro", "II"],
|
||||
[0, 0, 0],
|
||||
["ROOT", "flat:name", "flat:name"],
|
||||
["PROPN", "PROPN", "PROPN"],
|
||||
[(0,3)]
|
||||
),
|
||||
# Noun compound created by flat
|
||||
# gli Stati Uniti
|
||||
(
|
||||
["gli", "Stati", "Uniti"],
|
||||
[1, 1, 1],
|
||||
["det", "ROOT", "flat:name"],
|
||||
["DET", "PROPN", "PROPN"],
|
||||
[(0,3)]
|
||||
),
|
||||
# nmod relation between NPs
|
||||
# la distruzione della città -> la distruzione, città
|
||||
(
|
||||
['la', 'distruzione', 'della', 'città'],
|
||||
[1, 1, 3, 1],
|
||||
['det', 'ROOT', 'case', 'nmod'],
|
||||
['DET', 'NOUN', 'ADP', 'NOUN'],
|
||||
[(0,2), (3,4)]
|
||||
),
|
||||
# Compounding by nmod, several NPs chained together
|
||||
# la prima fabbrica di droga del governo -> la prima fabbrica, droga, governo
|
||||
(
|
||||
["la", "prima", "fabbrica", "di", "droga", "del", "governo"],
|
||||
[2, 2, 2, 4, 2, 6, 2],
|
||||
['det', 'amod', 'ROOT', 'case', 'nmod', 'case', 'nmod'],
|
||||
['DET', 'ADJ', 'NOUN', 'ADP', 'NOUN', 'ADP', 'NOUN'],
|
||||
[(0, 3), (4, 5), (6, 7)]
|
||||
),
|
||||
# several NPs
|
||||
# Traduzione del rapporto di Susana -> Traduzione, rapporto, Susana
|
||||
(
|
||||
['Traduzione', 'del', 'rapporto', 'di', 'Susana'],
|
||||
[0, 2, 0, 4, 2],
|
||||
['ROOT', 'case', 'nmod', 'case', 'nmod'],
|
||||
['NOUN', 'ADP', 'NOUN', 'ADP', 'PROPN'],
|
||||
[(0,1), (2,3), (4,5)]
|
||||
|
||||
),
|
||||
# Several NPs
|
||||
# Il gatto grasso di Susana e la sua amica -> Il gatto grasso, Susana, sua amica
|
||||
(
|
||||
['Il', 'gatto', 'grasso', 'di', 'Susana', 'e', 'la', 'sua', 'amica'],
|
||||
[1, 1, 1, 4, 1, 8, 8, 8, 1],
|
||||
['det', 'ROOT', 'amod', 'case', 'nmod', 'cc', 'det', 'det:poss', 'conj'],
|
||||
['DET', 'NOUN', 'ADJ', 'ADP', 'PROPN', 'CCONJ', 'DET', 'DET', 'NOUN'],
|
||||
[(0,3), (4,5), (6,9)]
|
||||
),
|
||||
# Passive subject
|
||||
# La nuova spesa è alimentata dal grande conto in banca di Clinton -> Le nuova spesa, grande conto, banca, Clinton
|
||||
(
|
||||
['La', 'nuova', 'spesa', 'è', 'alimentata', 'dal', 'grande', 'conto', 'in', 'banca', 'di', 'Clinton'],
|
||||
[2, 2, 4, 4, 4, 7, 7, 4, 9, 7, 11, 9],
|
||||
['det', 'amod', 'nsubj:pass', 'aux:pass', 'ROOT', 'case', 'amod', 'obl:agent', 'case', 'nmod', 'case', 'nmod'],
|
||||
['DET', 'ADJ', 'NOUN', 'AUX', 'VERB', 'ADP', 'ADJ', 'NOUN', 'ADP', 'NOUN', 'ADP', 'PROPN'],
|
||||
[(0, 3), (6, 8), (9, 10), (11,12)]
|
||||
),
|
||||
# Misc
|
||||
# Ma mentre questo prestito possa ora sembrare gestibile, un improvviso cambiamento delle circostanze potrebbe portare a problemi di debiti -> questo prestiti, un provisso cambiento, circostanze, problemi, debiti
|
||||
(
|
||||
['Ma', 'mentre', 'questo', 'prestito', 'possa', 'ora', 'sembrare', 'gestibile', ',', 'un', 'improvviso', 'cambiamento', 'delle', 'circostanze', 'potrebbe', 'portare', 'a', 'problemi', 'di', 'debitii'],
|
||||
[15, 6, 3, 6, 6, 6, 15, 6, 6, 11, 11, 15, 13, 11, 15, 15, 17, 15, 19, 17],
|
||||
['cc', 'mark', 'det', 'nsubj', 'aux', 'advmod', 'advcl', 'xcomp', 'punct', 'det', 'amod', 'nsubj', 'case', 'nmod', 'aux', 'ROOT', 'case', 'obl', 'case', 'nmod'],
|
||||
['CCONJ', 'SCONJ', 'DET', 'NOUN', 'AUX', 'ADV', 'VERB', 'ADJ', 'PUNCT', 'DET', 'ADJ', 'NOUN', 'ADP', 'NOUN', 'AUX', 'VERB', 'ADP', 'NOUN', 'ADP', 'NOUN'],
|
||||
[(2,4), (9,12), (13,14), (17,18), (19,20)]
|
||||
)
|
||||
],
|
||||
)
|
||||
# fmt: on
|
||||
def test_it_noun_chunks(it_vocab, words, heads, deps, pos, chunk_offsets):
|
||||
doc = Doc(it_vocab, words=words, heads=heads, deps=deps, pos=pos)
|
||||
assert [(c.start, c.end) for c in doc.noun_chunks] == chunk_offsets
|
||||
|
||||
|
||||
def test_noun_chunks_is_parsed_it(it_tokenizer):
|
||||
"""Test that noun_chunks raises Value Error for 'it' language if Doc is not parsed."""
|
||||
doc = it_tokenizer("Sei andato a Oxford")
|
||||
with pytest.raises(ValueError):
|
||||
list(doc.noun_chunks)
|
17
spacy/tests/lang/it/test_stopwords.py
Normal file
17
spacy/tests/lang/it/test_stopwords.py
Normal file
|
@ -0,0 +1,17 @@
|
|||
import pytest
|
||||
|
||||
|
||||
@pytest.mark.parametrize(
|
||||
"word", ["un", "lo", "dell", "dall", "si", "ti", "mi", "quest", "quel", "quello"]
|
||||
)
|
||||
def test_stopwords_basic(it_tokenizer, word):
|
||||
tok = it_tokenizer(word)[0]
|
||||
assert tok.is_stop
|
||||
|
||||
|
||||
@pytest.mark.parametrize(
|
||||
"word", ["quest'uomo", "l'ho", "un'amica", "dell'olio", "s'arrende", "m'ascolti"]
|
||||
)
|
||||
def test_stopwords_elided(it_tokenizer, word):
|
||||
tok = it_tokenizer(word)[0]
|
||||
assert tok.is_stop
|
14
spacy/tests/lang/it/test_text.py
Normal file
14
spacy/tests/lang/it/test_text.py
Normal file
|
@ -0,0 +1,14 @@
|
|||
import pytest
|
||||
|
||||
|
||||
@pytest.mark.issue(2822)
|
||||
def test_issue2822(it_tokenizer):
|
||||
"""Test that the abbreviation of poco is kept as one word."""
|
||||
doc = it_tokenizer("Vuoi un po' di zucchero?")
|
||||
assert len(doc) == 6
|
||||
assert doc[0].text == "Vuoi"
|
||||
assert doc[1].text == "un"
|
||||
assert doc[2].text == "po'"
|
||||
assert doc[3].text == "di"
|
||||
assert doc[4].text == "zucchero"
|
||||
assert doc[5].text == "?"
|
|
@ -54,6 +54,18 @@ SUB_TOKEN_TESTS = [
|
|||
# fmt: on
|
||||
|
||||
|
||||
@pytest.mark.issue(2901)
|
||||
def test_issue2901():
|
||||
"""Test that `nlp` doesn't fail."""
|
||||
try:
|
||||
nlp = Japanese()
|
||||
except ImportError:
|
||||
pytest.skip()
|
||||
|
||||
doc = nlp("pythonが大好きです")
|
||||
assert doc
|
||||
|
||||
|
||||
@pytest.mark.parametrize("text,expected_tokens", TOKENIZER_TESTS)
|
||||
def test_ja_tokenizer(ja_tokenizer, text, expected_tokens):
|
||||
tokens = [token.text for token in ja_tokenizer(text)]
|
||||
|
|
|
@ -47,3 +47,23 @@ def test_ko_tokenizer_pos(ko_tokenizer, text, expected_pos):
|
|||
def test_ko_empty_doc(ko_tokenizer):
|
||||
tokens = ko_tokenizer("")
|
||||
assert len(tokens) == 0
|
||||
|
||||
|
||||
# fmt: off
|
||||
SPACY_TOKENIZER_TESTS = [
|
||||
("있다.", "있다 ."),
|
||||
("'예'는", "' 예 ' 는"),
|
||||
("부 (富) 는", "부 ( 富 ) 는"),
|
||||
("부(富)는", "부 ( 富 ) 는"),
|
||||
("1982~1983.", "1982 ~ 1983 ."),
|
||||
("사과·배·복숭아·수박은 모두 과일이다.", "사과 · 배 · 복숭아 · 수박은 모두 과일이다 ."),
|
||||
("그렇구나~", "그렇구나~"),
|
||||
("『9시 반의 당구』,", "『 9시 반의 당구 』 ,"),
|
||||
]
|
||||
# fmt: on
|
||||
|
||||
|
||||
@pytest.mark.parametrize("text,expected_tokens", SPACY_TOKENIZER_TESTS)
|
||||
def test_ko_spacy_tokenizer(ko_tokenizer_tokenizer, text, expected_tokens):
|
||||
tokens = [token.text for token in ko_tokenizer_tokenizer(text)]
|
||||
assert tokens == expected_tokens.split()
|
||||
|
|
0
spacy/tests/lang/lv/__init__.py
Normal file
0
spacy/tests/lang/lv/__init__.py
Normal file
27
spacy/tests/lang/lv/test_text.py
Normal file
27
spacy/tests/lang/lv/test_text.py
Normal file
|
@ -0,0 +1,27 @@
|
|||
import pytest
|
||||
|
||||
|
||||
def test_long_text(lv_tokenizer):
|
||||
# Excerpt: European Convention on Human Rights
|
||||
text = """
|
||||
Ievērodamas, ka šī deklarācija paredz nodrošināt vispārēju un
|
||||
efektīvu tajā pasludināto tiesību atzīšanu un ievērošanu;
|
||||
Ievērodamas, ka Eiropas Padomes mērķis ir panākt lielāku vienotību
|
||||
tās dalībvalstu starpā un ka viens no līdzekļiem, kā šo mērķi
|
||||
sasniegt, ir cilvēka tiesību un pamatbrīvību ievērošana un turpmāka
|
||||
īstenošana;
|
||||
No jauna apliecinādamas patiesu pārliecību, ka šīs pamatbrīvības
|
||||
ir taisnīguma un miera pamats visā pasaulē un ka tās vislabāk var
|
||||
nodrošināt patiess demokrātisks politisks režīms no vienas puses un
|
||||
vispārējo cilvēktiesību, uz kurām tās pamatojas, kopīga izpratne un
|
||||
ievērošana no otras puses;
|
||||
"""
|
||||
tokens = lv_tokenizer(text)
|
||||
assert len(tokens) == 109
|
||||
|
||||
|
||||
@pytest.mark.xfail
|
||||
def test_ordinal_number(lv_tokenizer):
|
||||
text = "10. decembrī"
|
||||
tokens = lv_tokenizer(text)
|
||||
assert len(tokens) == 2
|
30
spacy/tests/lang/lv/test_tokenizer.py
Normal file
30
spacy/tests/lang/lv/test_tokenizer.py
Normal file
|
@ -0,0 +1,30 @@
|
|||
import pytest
|
||||
|
||||
LV_BASIC_TOKENIZATION_TESTS = [
|
||||
(
|
||||
"Nevienu nedrīkst spīdzināt vai cietsirdīgi vai pazemojoši ar viņu "
|
||||
"apieties vai sodīt.",
|
||||
[
|
||||
"Nevienu",
|
||||
"nedrīkst",
|
||||
"spīdzināt",
|
||||
"vai",
|
||||
"cietsirdīgi",
|
||||
"vai",
|
||||
"pazemojoši",
|
||||
"ar",
|
||||
"viņu",
|
||||
"apieties",
|
||||
"vai",
|
||||
"sodīt",
|
||||
".",
|
||||
],
|
||||
),
|
||||
]
|
||||
|
||||
|
||||
@pytest.mark.parametrize("text,expected_tokens", LV_BASIC_TOKENIZATION_TESTS)
|
||||
def test_lv_tokenizer_basic(lv_tokenizer, text, expected_tokens):
|
||||
tokens = lv_tokenizer(text)
|
||||
token_list = [token.text for token in tokens if not token.is_space]
|
||||
assert expected_tokens == token_list
|
0
spacy/tests/lang/sk/__init__.py
Normal file
0
spacy/tests/lang/sk/__init__.py
Normal file
Some files were not shown because too many files have changed in this diff Show More
Loading…
Reference in New Issue
Block a user