Merge branch 'master' of ssh://github.com/explosion/spaCy

This commit is contained in:
Matthew Honnibal 2017-01-16 13:18:06 +01:00
commit 48c712f1c1
78 changed files with 1615 additions and 1971 deletions

View File

@ -7,10 +7,11 @@ Following the v1.0 release, it's time to welcome more contributors into the spaC
## Table of contents
1. [Issues and bug reports](#issues-and-bug-reports)
2. [Contributing to the code base](#contributing-to-the-code-base)
3. [Updating the website](#updating-the-website)
4. [Submitting a tutorial](#submitting-a-tutorial)
5. [Submitting a project to the showcase](#submitting-a-project-to-the-showcase)
6. [Code of conduct](#code-of-conduct)
3. [Adding tests](#adding-tests)
4. [Updating the website](#updating-the-website)
5. [Submitting a tutorial](#submitting-a-tutorial)
6. [Submitting a project to the showcase](#submitting-a-project-to-the-showcase)
7. [Code of conduct](#code-of-conduct)
## Issues and bug reports
@ -33,6 +34,7 @@ We use the following system to tag our issues:
| [`install`](https://github.com/explosion/spaCy/labels/install) | Installation problems |
| [`performance`](https://github.com/explosion/spaCy/labels/performance) | Accuracy, speed and memory use problems |
| [`tests`](https://github.com/explosion/spaCy/labels/tests) | Missing or incorrect [tests](spacy/tests) |
| [`examples`](https://github.com/explosion/spaCy/labels/examples) | Issues related to the [examples](spacy/examples) |
| [`english`](https://github.com/explosion/spaCy/labels/english), [`german`](https://github.com/explosion/spaCy/labels/german) | Issues related to the specific languages, models and data |
| [`linux`](https://github.com/explosion/spaCy/labels/linux), [`osx`](https://github.com/explosion/spaCy/labels/osx), [`windows`](https://github.com/explosion/spaCy/labels/windows) | Issues related to the specific operating systems |
| [`pip`](https://github.com/explosion/spaCy/labels/pip), [`conda`](https://github.com/explosion/spaCy/labels/conda) | Issues related to the specific package managers |
@ -66,11 +68,21 @@ example_user would create the file `.github/contributors/example_user.md`.
### Fixing bugs
When fixing a bug, first create an [issue](https://github.com/explosion/spaCy/issues) if one does not already exist. The description text can be very short we don't want to make this too bureaucratic. Next, create a test file named `test_issue[ISSUE NUMBER].py` in the [`spacy/tests/regression`](spacy/tests/regression) folder.
When fixing a bug, first create an [issue](https://github.com/explosion/spaCy/issues) if one does not already exist. The description text can be very short we don't want to make this too bureaucratic.
Test for the bug you're fixing, and make sure the test fails. If the test requires the models to be loaded, mark it with the `pytest.mark.models` decorator.
Next, create a test file named `test_issue[ISSUE NUMBER].py` in the [`spacy/tests/regression`](spacy/tests/regression) folder. Test for the bug you're fixing, and make sure the test fails. Next, add and commit your test file referencing the issue number in the commit message. Finally, fix the bug, make sure your test passes and reference the issue in your commit message.
Next, add and commit your test file referencing the issue number in the commit message. Finally, fix the bug, make sure your test passes and reference the issue in your commit message.
📖 **For more information on how to add tests, check out the [tests README](spacy/tests/README.md).**
## Adding tests
spaCy uses [pytest](http://doc.pytest.org/) framework for testing. For more info on this, see the [pytest documentation](http://docs.pytest.org/en/latest/contents.html). Tests for spaCy modules and classes live in their own directories of the same name. For example, tests for the `Tokenizer` can be found in [`/spacy/tests/tokenizer`](spacy/tests/tokenizer). To be interpreted and run, all test files and test functions need to be prefixed with `test_`.
When adding tests, make sure to use descriptive names, keep the code short and concise and only test for one behaviour at a time. Try to `parametrize` test cases wherever possible, use our pre-defined fixtures for spaCy components and avoid unnecessary imports.
Extensive tests that take a long time should be marked with `@pytest.mark.slow`. Tests that require the model to be loaded should be marked with `@pytest.mark.models`. Loading the models is expensive and not necessary if you're not actually testing the model performance. If all you needs ia a `Doc` object with annotations like heads, POS tags or the dependency parse, you can use the `get_doc()` utility function to construct it manually.
📖 **For more guidelines and information on how to add tests, check out the [tests README](spacy/tests/README.md).**
## Updating the website
@ -86,7 +98,9 @@ harp server
The docs can always use another example or more detail, and they should always be up to date and not misleading. To quickly find the correct file to edit, simply click on the "Suggest edits" button at the bottom of a page.
To make it easy to add content components, we use a [collection of custom mixins](_includes/_mixins.jade), like `+table`, `+list` or `+code`. For more info and troubleshooting guides, check out the [website README](website).
To make it easy to add content components, we use a [collection of custom mixins](_includes/_mixins.jade), like `+table`, `+list` or `+code`.
📖 **For more info and troubleshooting guides, check out the [website README](website).**
### Resources to get you started

View File

@ -10,10 +10,6 @@ released under the MIT license.
💫 **Version 1.5 out now!** `Read the release notes here. <https://github.com/explosion/spaCy/releases/>`_
.. image:: http://i.imgur.com/wFvLZyJ.png
:target: https://travis-ci.org/explosion/spaCy
:alt: spaCy on Travis CI
.. image:: https://travis-ci.org/explosion/spaCy.svg?branch=master
:target: https://travis-ci.org/explosion/spaCy
:alt: Build Status
@ -26,9 +22,13 @@ released under the MIT license.
:target: https://pypi.python.org/pypi/spacy
:alt: pypi Version
.. image:: https://badges.gitter.im/explosion.png
.. image:: https://img.shields.io/badge/gitter-join%20chat%20%E2%86%92-09a3d5.svg
:target: https://gitter.im/explosion/spaCy
:alt: spaCy on Gitter
.. image:: https://img.shields.io/twitter/follow/spacy_io.svg?style=social&label=Follow
:target: https://twitter.com/spacy_io
:alt: spaCy on Twitter
📖 Documentation
================

View File

@ -47,11 +47,16 @@ First, install [Keras](https://keras.io/), [spaCy](https://spacy.io) and the spa
English models (about 1GB of data):
```bash
pip install keras spacy
pip install https://github.com/fchollet/keras/archive/master.zip
pip install spacy
python -m spacy.en.download
```
You'll also want to get keras working on your GPU. This will depend on your
⚠️ **Important:** In order for the example to run, you'll need to install Keras from
the master branch (and not via `pip install keras`). For more info on this, see
[#727](https://github.com/explosion/spaCy/issues/727).
You'll also want to get Keras working on your GPU. This will depend on your
set up, so you're mostly on your own for this step. If you're using AWS, try the
[NVidia AMI](https://aws.amazon.com/marketplace/pp/B00FYCDDTE). It made things pretty easy.

20
fabfile.py vendored
View File

@ -1,22 +1,20 @@
from __future__ import print_function
# coding: utf-8
from __future__ import unicode_literals, print_function
from fabric.api import local, lcd, env, settings, prefix
from os.path import exists as file_exists
from fabtools.python import virtualenv
from os import path
import os
import shutil
from pathlib import Path
from os import path, environ
PWD = path.dirname(__file__)
VENV_DIR = path.join(PWD, '.env')
ENV = environ['VENV_DIR'] if 'VENV_DIR' in environ else '.env'
VENV_DIR = path.join(PWD, ENV)
def env(lang="python2.7"):
if file_exists('.env'):
local('rm -rf .env')
local('virtualenv -p %s .env' % lang)
def env(lang='python2.7'):
if path.exists(VENV_DIR):
local('rm -rf {env}'.format(env=VENV_DIR))
local('virtualenv -p {lang} {env}'.format(lang=lang, env=VENV_DIR))
def install():

View File

@ -37,7 +37,6 @@ PACKAGES = [
'spacy.munge',
'spacy.tests',
'spacy.tests.matcher',
'spacy.tests.munge',
'spacy.tests.parser',
'spacy.tests.serialize',
'spacy.tests.spans',

View File

@ -7,7 +7,7 @@ from ..language_data import PRON_LEMMA
EXC = {}
EXCLUDE_EXC = ["Ill", "ill", "Its", "its", "Hell", "hell", "Well", "well", "Whore", "whore"]
EXCLUDE_EXC = ["Ill", "ill", "Its", "its", "Hell", "hell", "were", "Were", "Well", "well", "Whore", "whore"]
# Pronouns

View File

@ -1,6 +1,7 @@
# encoding: utf8
from __future__ import unicode_literals, print_function
from spacy.hu.tokenizer_exceptions import TOKEN_MATCH
from .language_data import *
from ..attrs import LANG
from ..language import Language
@ -21,3 +22,5 @@ class Hungarian(Language):
infixes = tuple(TOKENIZER_INFIXES)
stop_words = set(STOP_WORDS)
token_match = TOKEN_MATCH

View File

@ -1,8 +1,6 @@
# encoding: utf8
from __future__ import unicode_literals
import six
from spacy.language_data import strings_to_exc, update_exc
from .punctuation import *
from .stop_words import STOP_WORDS
@ -10,19 +8,15 @@ from .tokenizer_exceptions import ABBREVIATIONS
from .tokenizer_exceptions import OTHER_EXC
from .. import language_data as base
STOP_WORDS = set(STOP_WORDS)
TOKENIZER_EXCEPTIONS = strings_to_exc(base.EMOTICONS)
update_exc(TOKENIZER_EXCEPTIONS, strings_to_exc(base.ABBREVIATIONS))
update_exc(TOKENIZER_EXCEPTIONS, strings_to_exc(OTHER_EXC))
update_exc(TOKENIZER_EXCEPTIONS, strings_to_exc(ABBREVIATIONS))
TOKENIZER_PREFIXES = base.TOKENIZER_PREFIXES
TOKENIZER_SUFFIXES = base.TOKENIZER_SUFFIXES + TOKENIZER_SUFFIXES
TOKENIZER_PREFIXES = TOKENIZER_PREFIXES
TOKENIZER_SUFFIXES = TOKENIZER_SUFFIXES
TOKENIZER_INFIXES = TOKENIZER_INFIXES
__all__ = ["TOKENIZER_EXCEPTIONS", "STOP_WORDS", "TOKENIZER_PREFIXES", "TOKENIZER_SUFFIXES", "TOKENIZER_INFIXES"]

View File

@ -1,25 +1,41 @@
# encoding: utf8
from __future__ import unicode_literals
from ..language_data.punctuation import ALPHA, ALPHA_LOWER, ALPHA_UPPER, LIST_ELLIPSES
from ..language_data.punctuation import ALPHA_LOWER, LIST_ELLIPSES, QUOTES, ALPHA_UPPER, LIST_QUOTES, UNITS, \
CURRENCY, LIST_PUNCT, ALPHA, _QUOTES
CURRENCY_SYMBOLS = r"\$ ¢ £ € ¥ ฿"
TOKENIZER_SUFFIXES = [
r'(?<=[{al})])-e'.format(al=ALPHA_LOWER)
]
TOKENIZER_PREFIXES = (
[r'\+'] +
LIST_PUNCT +
LIST_ELLIPSES +
LIST_QUOTES
)
TOKENIZER_INFIXES = [
r'(?<=[0-9])-(?=[0-9])',
r'(?<=[0-9])[+\-\*/^](?=[0-9])',
r'(?<=[{a}])--(?=[{a}])',
r'(?<=[{a}]),(?=[{a}])'.format(a=ALPHA),
r'(?<=[{al}])\.(?=[{au}])'.format(al=ALPHA_LOWER, au=ALPHA_UPPER),
r'(?<=[0-9{a}])"(?=[\-{a}])'.format(a=ALPHA),
r'(?<=[{a}"])[:<>=](?=[{a}])'.format(a=ALPHA)
]
TOKENIZER_SUFFIXES = (
LIST_PUNCT +
LIST_ELLIPSES +
LIST_QUOTES +
[
r'(?<=[0-9])\+',
r'(?<=°[FfCcKk])\.',
r'(?<=[0-9])(?:{c})'.format(c=CURRENCY),
r'(?<=[0-9])(?:{u})'.format(u=UNITS),
r'(?<=[{al}{p}{c}(?:{q})])\.'.format(al=ALPHA_LOWER, p=r'%²\-\)\]\+', q=QUOTES, c=CURRENCY_SYMBOLS),
r'(?<=[{al})])-e'.format(al=ALPHA_LOWER)
]
)
TOKENIZER_INFIXES += LIST_ELLIPSES
__all__ = ["TOKENIZER_SUFFIXES", "TOKENIZER_INFIXES"]
TOKENIZER_INFIXES = (
LIST_ELLIPSES +
[
r'(?<=[{al}])\.(?=[{au}])'.format(al=ALPHA_LOWER, au=ALPHA_UPPER),
r'(?<=[{a}]),(?=[{a}])'.format(a=ALPHA),
r'(?<=[{a}"])[:<>=](?=[{a}])'.format(a=ALPHA),
r'(?<=[{a}])--(?=[{a}])'.format(a=ALPHA),
r'(?<=[{a}]),(?=[{a}])'.format(a=ALPHA),
r'(?<=[{a}])([{q}\)\]\(\[])(?=[\-{a}])'.format(a=ALPHA, q=_QUOTES.replace("'", "").strip().replace(" ", "")),
]
)
__all__ = ["TOKENIZER_PREFIXES", "TOKENIZER_SUFFIXES", "TOKENIZER_INFIXES"]

View File

@ -1,9 +1,17 @@
# encoding: utf8
from __future__ import unicode_literals
import re
from spacy.language_data.punctuation import ALPHA_LOWER, CURRENCY
from ..language_data.tokenizer_exceptions import _URL_PATTERN
ABBREVIATIONS = """
A.
AG.
AkH.
.
B.
B.CS.
B.S.
B.Sc.
@ -13,57 +21,103 @@ BEK.
BSC.
BSc.
BTK.
Bat.
Be.
Bek.
Bfok.
Bk.
Bp.
Bros.
Bt.
Btk.
Btke.
Btét.
C.
CSC.
Cal.
Cg.
Cgf.
Cgt.
Cia.
Co.
Colo.
Comp.
Copr.
Corp.
Cos.
Cs.
Csc.
Csop.
Cstv.
Ctv.
Ctvr.
D.
DR.
Dipl.
Dr.
Dsz.
Dzs.
E.
EK.
EU.
F.
Fla.
Folyt.
Fpk.
Főszerk.
G.
GK.
GM.
Gfv.
Gmk.
Gr.
Group.
Gt.
Gy.
H.
HKsz.
Hmvh.
I.
Ifj.
Inc.
Inform.
Int.
J.
Jr.
Jv.
K.
K.m.f.
KB.
KER.
KFT.
KRT.
Kb.
Ker.
Kft.
Kg.
Kht.
Kkt.
Kong.
Korm.
Kr.
Kr.e.
Kr.u.
Krt.
L.
LB.
Llc.
Ltd.
M.
M.A.
M.S.
M.SC.
M.Sc.
MA.
MH.
MSC.
MSc.
Mass.
Max.
Mlle.
Mme.
Mo.
@ -71,45 +125,77 @@ Mr.
Mrs.
Ms.
Mt.
N.
N.N.
NB.
NBr.
Nat.
No.
Nr.
Ny.
Nyh.
Nyr.
Nyrt.
O.
OJ.
Op.
P.
P.H.
P.S.
PH.D.
PHD.
PROF.
Pf.
Ph.D
PhD.
Pk.
Pl.
Plc.
Pp.
Proc.
Prof.
Ptk.
R.
RT.
Rer.
Rt.
S.
S.B.
SZOLG.
Salg.
Sch.
Spa.
St.
Sz.
SzRt.
Szerk.
Szfv.
Szjt.
Szolg.
Szt.
Sztv.
Szvt.
Számv.
T.
TEL.
Tel.
Ty.
Tyr.
U.
Ui.
Ut.
V.
VB.
Vcs.
Vhr.
Vht.
Várm.
W.
X.
X.Y.
Y.
Z.
Zrt.
Zs.
a.C.
ac.
@ -119,11 +205,13 @@ ag.
agit.
alez.
alk.
all.
altbgy.
an.
ang.
arch.
at.
atc.
aug.
b.a.
b.s.
@ -161,6 +249,7 @@ dikt.
dipl.
dj.
dk.
dl.
dny.
dolg.
dr.
@ -184,6 +273,7 @@ eü.
f.h.
f.é.
fam.
fb.
febr.
fej.
felv.
@ -211,6 +301,7 @@ gazd.
gimn.
gk.
gkv.
gmk.
gondn.
gr.
grav.
@ -240,6 +331,7 @@ hőm.
i.e.
i.sz.
id.
ie.
ifj.
ig.
igh.
@ -254,6 +346,7 @@ io.
ip.
ir.
irod.
irod.
isk.
ism.
izr.
@ -261,6 +354,7 @@ iá.
jan.
jav.
jegyz.
jgmk.
jjv.
jkv.
jogh.
@ -271,6 +365,7 @@ júl.
jún.
karb.
kat.
kath.
kb.
kcs.
kd.
@ -285,6 +380,8 @@ kiv.
kk.
kkt.
klin.
km.
korm.
kp.
krt.
kt.
@ -318,6 +415,7 @@ m.s.
m.sc.
ma.
mat.
max.
mb.
med.
megh.
@ -353,6 +451,7 @@ nat.
nb.
neg.
nk.
no.
nov.
nu.
ny.
@ -362,6 +461,7 @@ nyug.
obj.
okl.
okt.
old.
olv.
orsz.
ort.
@ -372,6 +472,8 @@ pg.
ph.d
ph.d.
phd.
phil.
pjt.
pk.
pl.
plb.
@ -406,6 +508,7 @@ röv.
s.b.
s.k.
sa.
sb.
sel.
sgt.
sm.
@ -413,6 +516,7 @@ st.
stat.
stb.
strat.
stud.
sz.
szakm.
szaksz.
@ -467,6 +571,7 @@ vb.
vegy.
vh.
vhol.
vhr.
vill.
vizsg.
vk.
@ -478,13 +583,20 @@ vs.
vsz.
vv.
vál.
várm.
vízv.
.
zrt.
zs.
Á.
Áe.
Áht.
É.
Épt.
Ész.
Új-Z.
ÚjZ.
Ún.
á.
ált.
ápr.
@ -500,6 +612,7 @@ zs.
ötk.
özv.
ú.
ú.n.
úm.
ún.
út.
@ -510,7 +623,6 @@ zs.
ümk.
ütk.
üv.
ő.
ű.
őrgy.
őrpk.
@ -520,3 +632,17 @@ zs.
OTHER_EXC = """
-e
""".strip().split()
ORD_NUM_OR_DATE = "([A-Z0-9]+[./-])*(\d+\.?)"
_NUM = "[+\-]?\d+([,.]\d+)*"
_OPS = "[=<>+\-\*/^()÷%²]"
_SUFFIXES = "-[{a}]+".format(a=ALPHA_LOWER)
NUMERIC_EXP = "({n})(({o})({n}))*[%]?".format(n=_NUM, o=_OPS)
TIME_EXP = "\d+(:\d+)*(\.\d+)?"
NUMS = "(({ne})|({t})|({on})|({c}))({s})?".format(
ne=NUMERIC_EXP, t=TIME_EXP, on=ORD_NUM_OR_DATE,
c=CURRENCY, s=_SUFFIXES
)
TOKEN_MATCH = re.compile("^({u})|({n})$".format(u=_URL_PATTERN, n=NUMS)).match

View File

@ -57,14 +57,14 @@ LIST_PUNCT = list(_PUNCT.strip().split())
LIST_HYPHENS = list(_HYPHENS.strip().split())
ALPHA_LOWER = _ALPHA_LOWER.strip().replace(' ', '')
ALPHA_UPPER = _ALPHA_UPPER.strip().replace(' ', '')
ALPHA_LOWER = _ALPHA_LOWER.strip().replace(' ', '').replace('\n', '')
ALPHA_UPPER = _ALPHA_UPPER.strip().replace(' ', '').replace('\n', '')
ALPHA = ALPHA_LOWER + ALPHA_UPPER
QUOTES = _QUOTES.strip().replace(' ', '|')
CURRENCY = _CURRENCY.strip().replace(' ', '|')
UNITS = _UNITS.strip().replace(' ', '|')
UNITS = _UNITS.strip().replace(' ', '|').replace('\n', '|')
HYPHENS = _HYPHENS.strip().replace(' ', '|')
@ -103,7 +103,7 @@ TOKENIZER_SUFFIXES = (
TOKENIZER_INFIXES = (
LIST_ELLIPSES +
[
r'(?<=[0-9])[+\-\*/^](?=[0-9])',
r'(?<=[0-9])[+\-\*^](?=[0-9-])',
r'(?<=[{al}])\.(?=[{au}])'.format(al=ALPHA_LOWER, au=ALPHA_UPPER),
r'(?<=[{a}]),(?=[{a}])'.format(a=ALPHA),
r'(?<=[{a}])(?:{h})(?=[{a}])'.format(a=ALPHA, h=HYPHENS),

169
spacy/tests/README.md Normal file
View File

@ -0,0 +1,169 @@
<a href="https://explosion.ai"><img src="https://explosion.ai/assets/img/logo.svg" width="125" height="125" align="right" /></a>
# spaCy tests
spaCy uses [pytest](http://doc.pytest.org/) framework for testing. For more info on this, see the [pytest documentation](http://docs.pytest.org/en/latest/contents.html).
Tests for spaCy modules and classes live in their own directories of the same name. For example, tests for the `Tokenizer` can be found in [`/tests/tokenizer`](tokenizer). All test modules (i.e. directories) also need to be listed in spaCy's [`setup.py`](../setup.py). To be interpreted and run, all test files and test functions need to be prefixed with `test_`.
## Table of contents
1. [Running the tests](#running-the-tests)
2. [Dos and don'ts](#dos-and-donts)
3. [Parameters](#parameters)
4. [Fixtures](#fixtures)
5. [Helpers and utilities](#helpers-and-utilities)
6. [Contributing to the tests](#contributing-to-the-tests)
## Running the tests
```bash
py.test spacy # run basic tests
py.test spacy --models # run basic and model tests
py.test spacy --slow # run basic and slow tests
py.test spacy --models --slow # run all tests
```
To show print statements, run the tests with `py.test -s`. To abort after the first failure, run them with `py.test -x`.
## Dos and don'ts
To keep the behaviour of the tests consistent and predictable, we try to follow a few basic conventions:
* **Test names** should follow a pattern of `test_[module]_[tested behaviour]`. For example: `test_tokenizer_keeps_email` or `test_spans_override_sentiment`.
* If you're testing for a bug reported in a specific issue, always create a **regression test**. Regression tests should be named `test_issue[ISSUE NUMBER]` and live in the [`regression`](regression) directory.
* Only use `@pytest.mark.xfail` for tests that **should pass, but currently fail**. To test for desired negative behaviour, use `assert not` in your test.
* Very **extensive tests** that take a long time to run should be marked with `@pytest.mark.slow`. If your slow test is testing important behaviour, consider adding an additional simpler version.
* Tests that require **loading the models** should be marked with `@pytest.mark.models`.
* Before requiring the models, always make sure there is no other way to test the particular behaviour. In a lot of cases, it's sufficient to simply create a `Doc` object manually. See the section on [helpers and utility functions](#helpers-and-utilities) for more info on this.
* **Avoid unnecessary imports.** There should never be a need to explicitly import spaCy at the top of a file, and most components are available as [fixtures](#fixtures). You should also avoid wildcard imports (`from module import *`).
* If you're importing from spaCy, **always use relative imports**. Otherwise, you might accidentally be running the tests over a different copy of spaCy, e.g. one you have installed on your system.
* Don't forget the **unicode declarations** at the top of each file. This way, unicode strings won't have to be prefixed with `u`.
* Try to keep the tests **readable and concise**. Use clear and descriptive variable names (`doc`, `tokens` and `text` are great), keep it short and only test for one behaviour at a time.
## Parameters
If the test cases can be extracted from the test, always `parametrize` them instead of hard-coding them into the test:
```python
@pytest.mark.parametrize('text', ["google.com", "spacy.io"])
def test_tokenizer_keep_urls(tokenizer, text):
tokens = tokenizer(text)
assert len(tokens) == 1
```
This will run the test once for each `text` value. Even if you're only testing one example, it's usually best to specify it as a parameter. This will later make it easier for others to quickly add additional test cases without having to modify the test.
You can also specify parameters as tuples to test with multiple values per test:
```python
@pytest.mark.parametrize('text,length', [("U.S.", 1), ("us.", 2), ("(U.S.", 2)])
```
To test for combinations of parameters, you can add several `parametrize` markers:
```python
@pytest.mark.parametrize('text', ["A test sentence", "Another sentence"])
@pytest.mark.parametrize('punct', ['.', '!', '?'])
```
This will run the test with all combinations of the two parameters `text` and `punct`. **Use this feature sparingly**, though, as it can easily cause unneccessary or undesired test bloat.
## Fixtures
Fixtures to create instances of spaCy objects and other components should only be defined once in the global [`conftest.py`](conftest.py). We avoid having per-directory conftest files, as this can easily lead to confusion.
These are the main fixtures that are currently available:
| Fixture | Description |
| --- | --- |
| `tokenizer` | Creates **all available** language tokenizers and runs the test for **each of them**. |
| `en_tokenizer` | Creates an English `Tokenizer` object. |
| `de_tokenizer` | Creates a German `Tokenizer` object. |
| `hu_tokenizer` | Creates a Hungarian `Tokenizer` object. |
| `en_vocab` | Creates an English `Vocab` object. |
| `en_entityrecognizer` | Creates an English `EntityRecognizer` object. |
| `lemmatizer` | Creates a `Lemmatizer` object from the installed language data (`None` if no data is found).
| `EN` | Creates an instance of `English`. Only use for tests that require the models. |
| `DE` | Creates an instance of `German`. Only use for tests that require the models. |
| `text_file` | Creates an instance of `StringIO` to simulate reading from and writing to files. |
| `text_file_b` | Creates an instance of `ByteIO` to simulate reading from and writing to files. |
The fixtures can be used in all tests by simply setting them as an argument, like this:
```python
def test_module_do_something(en_tokenizer):
tokens = en_tokenizer("Some text here")
```
If all tests in a file require a specific configuration, or use the same complex example, it can be helpful to create a separate fixture. This fixture should be added at the top of each file. Make sure to use descriptive names for these fixtures and don't override any of the global fixtures listed above. **From looking at a test, it should immediately be clear which fixtures are used, and where they are coming from.**
## Helpers and utilities
Our new test setup comes with a few handy utility functions that can be imported from [`util.py`](util.py).
### Constructing a `Doc` object manually with `get_doc()`
Loading the models is expensive and not necessary if you're not actually testing the model performance. If all you need ia a `Doc` object with annotations like heads, POS tags or the dependency parse, you can use `get_doc()` to construct it manually.
```python
def test_doc_token_api_strings(en_tokenizer):
text = "Give it back! He pleaded."
pos = ['VERB', 'PRON', 'PART', 'PUNCT', 'PRON', 'VERB', 'PUNCT']
heads = [0, -1, -2, -3, 1, 0, -1]
deps = ['ROOT', 'dobj', 'prt', 'punct', 'nsubj', 'ROOT', 'punct']
tokens = en_tokenizer(text)
doc = get_doc(tokens.vocab, [t.text for t in tokens], pos=pos, heads=heads, deps=deps)
assert doc[0].text == 'Give'
assert doc[0].lower_ == 'give'
assert doc[0].pos_ == 'VERB'
assert doc[0].dep_ == 'ROOT'
```
You can construct a `Doc` with the following arguments:
| Argument | Description |
| --- | --- |
| `vocab` | `Vocab` instance to use. If you're tokenizing before creating a `Doc`, make sure to use the tokenizer's vocab. Otherwise, you can also use the `en_vocab` fixture. **(required)** |
| `words` | List of words, for example `[t.text for t in tokens]`. **(required)** |
| `heads` | List of heads as integers. |
| `pos` | List of POS tags as text values. |
| `tag` | List of tag names as text values. |
| `dep` | List of dependencies as text values. |
| `ents` | List of entity tuples with `ent_id`, `label`, `start`, `end` (for example `('Stewart Lee', 'PERSON', 0, 2)`). The `label` will be looked up in `vocab.strings[label]`. |
Here's how to quickly get these values from within spaCy:
```python
doc = nlp(u'Some text here')
print [token.head.i-token.i for token in doc]
print [token.tag_ for token in doc]
print [token.pos_ for token in doc]
print [token.dep_ for token in doc]
```
**Note:** There's currently no way of setting the serializer data for the parser without loading the models. If this is relevant to your test, constructing the `Doc` via `get_doc()` won't work.
### Other utilities
| Name | Description |
| --- | --- |
| `apply_transition_sequence(parser, doc, sequence)` | Perform a series of pre-specified transitions, to put the parser in a desired state. |
| `add_vecs_to_vocab(vocab, vectors)` | Add list of vector tuples (`[("text", [1, 2, 3])]`) to given vocab. All vectors need to have the same length. |
| `get_cosine(vec1, vec2)` | Get cosine for two given vectors. |
| `assert_docs_equal(doc1, doc2)` | Compare two `Doc` objects and `assert` that they're equal. Tests for tokens, tags, dependencies and entities. |
## Contributing to the tests
There's still a long way to go to finally reach **100% test coverage** and we'd appreciate your help! 🙌 You can open an issue on our [issue tracker](https://github.com/explosion/spaCy/issues) and label it `tests`, or make a [pull request](https://github.com/explosion/spaCy/pulls) to this repository.
📖 **For more information on contributing to spaCy in general, check out our [contribution guidelines](https://github.com/explosion/spaCy/blob/master/CONTRIBUTING.md).**

View File

@ -12,9 +12,13 @@ from ..sv import Swedish
from ..hu import Hungarian
from ..tokens import Doc
from ..strings import StringStore
from ..lemmatizer import Lemmatizer
from ..attrs import ORTH, TAG, HEAD, DEP
from ..util import match_best_version, get_data_path
from io import StringIO
from io import StringIO, BytesIO
from pathlib import Path
import os
import pytest
@ -52,22 +56,49 @@ def de_tokenizer():
def hu_tokenizer():
return Hungarian.Defaults.create_tokenizer()
@pytest.fixture
def stringstore():
return StringStore()
@pytest.fixture
def en_entityrecognizer():
return English.Defaults.create_entity()
@pytest.fixture
def lemmatizer(path):
if path is not None:
return Lemmatizer.load(path)
else:
return None
@pytest.fixture
def text_file():
return StringIO()
@pytest.fixture
def text_file_b():
return BytesIO()
# deprecated, to be replaced with more specific instances
@pytest.fixture
def path():
if 'SPACY_DATA' in os.environ:
return Path(os.environ['SPACY_DATA'])
else:
return match_best_version('en', None, get_data_path())
# only used for tests that require loading the models
# in all other cases, use specific instances
@pytest.fixture(scope="session")
def EN():
return English()
# deprecated, to be replaced with more specific instances
@pytest.fixture(scope="session")
def DE():
return German()

View File

@ -31,7 +31,7 @@ def test_doc_api_getitem(en_tokenizer):
tokens[len(tokens)]
def to_str(span):
return '/'.join(token.text for token in span)
return '/'.join(token.text for token in span)
span = tokens[1:1]
assert not to_str(span)
@ -193,7 +193,7 @@ def test_doc_api_runtime_error(en_tokenizer):
def test_doc_api_right_edge(en_tokenizer):
# Test for bug occurring from Unshift action, causing incorrect right edge
"""Test for bug occurring from Unshift action, causing incorrect right edge"""
text = "I have proposed to myself, for the sake of such as live under the government of the Romans, to translate those books into the Greek tongue."
heads = [2, 1, 0, -1, -1, -3, 15, 1, -2, -1, 1, -3, -1, -1, 1, -2, -1, 1,
-2, -7, 1, -19, 1, -2, -3, 2, 1, -3, -26]
@ -202,7 +202,8 @@ def test_doc_api_right_edge(en_tokenizer):
doc = get_doc(tokens.vocab, [t.text for t in tokens], heads=heads)
assert doc[6].text == 'for'
subtree = [w.text for w in doc[6].subtree]
assert subtree == ['for' , 'the', 'sake', 'of', 'such', 'as', 'live', 'under', 'the', 'government', 'of', 'the', 'Romans', ',']
assert subtree == ['for', 'the', 'sake', 'of', 'such', 'as',
'live', 'under', 'the', 'government', 'of', 'the', 'Romans', ',']
assert doc[6].right_edge.text == ','

View File

@ -85,8 +85,8 @@ def test_doc_token_api_vectors(en_tokenizer, text_file, text, vectors):
assert tokens[0].similarity(tokens[1]) == tokens[1].similarity(tokens[0])
assert sum(tokens[0].vector) != sum(tokens[1].vector)
assert numpy.isclose(
tokens[0].vector_norm,
numpy.sqrt(numpy.dot(tokens[0].vector, tokens[0].vector)))
tokens[0].vector_norm,
numpy.sqrt(numpy.dot(tokens[0].vector, tokens[0].vector)))
def test_doc_token_api_ancestors(en_tokenizer):

View File

@ -10,9 +10,6 @@ from ...util import compile_prefix_regex
from ...language_data import TOKENIZER_PREFIXES
en_search_prefixes = compile_prefix_regex(TOKENIZER_PREFIXES).search
PUNCT_OPEN = ['(', '[', '{', '*']
PUNCT_CLOSE = [')', ']', '}', '*']
PUNCT_PAIRED = [('(', ')'), ('[', ']'), ('{', '}'), ('*', '*')]
@ -99,7 +96,8 @@ def test_tokenizer_splits_double_end_quote(en_tokenizer, text):
@pytest.mark.parametrize('punct_open,punct_close', PUNCT_PAIRED)
@pytest.mark.parametrize('text', ["Hello"])
def test_tokenizer_splits_open_close_punct(en_tokenizer, punct_open, punct_close, text):
def test_tokenizer_splits_open_close_punct(en_tokenizer, punct_open,
punct_close, text):
tokens = en_tokenizer(punct_open + text + punct_close)
assert len(tokens) == 3
assert tokens[0].text == punct_open
@ -108,20 +106,22 @@ def test_tokenizer_splits_open_close_punct(en_tokenizer, punct_open, punct_close
@pytest.mark.parametrize('punct_open,punct_close', PUNCT_PAIRED)
@pytest.mark.parametrize('punct_open_add,punct_close_add', [("`", "'")])
@pytest.mark.parametrize('punct_open2,punct_close2', [("`", "'")])
@pytest.mark.parametrize('text', ["Hello"])
def test_two_different(en_tokenizer, punct_open, punct_close, punct_open_add, punct_close_add, text):
tokens = en_tokenizer(punct_open_add + punct_open + text + punct_close + punct_close_add)
def test_tokenizer_two_diff_punct(en_tokenizer, punct_open, punct_close,
punct_open2, punct_close2, text):
tokens = en_tokenizer(punct_open2 + punct_open + text + punct_close + punct_close2)
assert len(tokens) == 5
assert tokens[0].text == punct_open_add
assert tokens[0].text == punct_open2
assert tokens[1].text == punct_open
assert tokens[2].text == text
assert tokens[3].text == punct_close
assert tokens[4].text == punct_close_add
assert tokens[4].text == punct_close2
@pytest.mark.parametrize('text,punct', [("(can't", "(")])
def test_tokenizer_splits_pre_punct_regex(text, punct):
en_search_prefixes = compile_prefix_regex(TOKENIZER_PREFIXES).search
match = en_search_prefixes(text)
assert match.group() == punct

View File

@ -29,8 +29,7 @@ untimely death" of the rapier-tongued Scottish barrister and parliamentarian.
("""Yes! "I'd rather have a walk", Ms. Comble sighed. """, 15),
("""'Me too!', Mr. P. Delaware cried. """, 11),
("They ran about 10km.", 6),
# ("But then the 6,000-year ice age came...", 10)
])
pytest.mark.xfail(("But then the 6,000-year ice age came...", 10))])
def test_tokenizer_handles_cnts(en_tokenizer, text, length):
tokens = en_tokenizer(text)
assert len(tokens) == length

View File

@ -1,48 +1,43 @@
# coding: utf-8
from __future__ import unicode_literals
from ...gold import biluo_tags_from_offsets
from ...vocab import Vocab
from ...tokens.doc import Doc
import pytest
@pytest.fixture
def vocab():
return Vocab()
def test_U(vocab):
orths_and_spaces = [('I', True), ('flew', True), ('to', True), ('London', False),
('.', True)]
doc = Doc(vocab, orths_and_spaces=orths_and_spaces)
def test_gold_biluo_U(en_vocab):
orths_and_spaces = [('I', True), ('flew', True), ('to', True),
('London', False), ('.', True)]
doc = Doc(en_vocab, orths_and_spaces=orths_and_spaces)
entities = [(len("I flew to "), len("I flew to London"), 'LOC')]
tags = biluo_tags_from_offsets(doc, entities)
assert tags == ['O', 'O', 'O', 'U-LOC', 'O']
def test_BL(vocab):
def test_gold_biluo_BL(en_vocab):
orths_and_spaces = [('I', True), ('flew', True), ('to', True), ('San', True),
('Francisco', False), ('.', True)]
doc = Doc(vocab, orths_and_spaces=orths_and_spaces)
doc = Doc(en_vocab, orths_and_spaces=orths_and_spaces)
entities = [(len("I flew to "), len("I flew to San Francisco"), 'LOC')]
tags = biluo_tags_from_offsets(doc, entities)
assert tags == ['O', 'O', 'O', 'B-LOC', 'L-LOC', 'O']
def test_BIL(vocab):
def test_gold_biluo_BIL(en_vocab):
orths_and_spaces = [('I', True), ('flew', True), ('to', True), ('San', True),
('Francisco', True), ('Valley', False), ('.', True)]
doc = Doc(vocab, orths_and_spaces=orths_and_spaces)
doc = Doc(en_vocab, orths_and_spaces=orths_and_spaces)
entities = [(len("I flew to "), len("I flew to San Francisco Valley"), 'LOC')]
tags = biluo_tags_from_offsets(doc, entities)
assert tags == ['O', 'O', 'O', 'B-LOC', 'I-LOC', 'L-LOC', 'O']
def test_misalign(vocab):
def test_gold_biluo_misalign(en_vocab):
orths_and_spaces = [('I', True), ('flew', True), ('to', True), ('San', True),
('Francisco', True), ('Valley.', False)]
doc = Doc(vocab, orths_and_spaces=orths_and_spaces)
doc = Doc(en_vocab, orths_and_spaces=orths_and_spaces)
entities = [(len("I flew to "), len("I flew to San Francisco Valley"), 'LOC')]
tags = biluo_tags_from_offsets(doc, entities)
assert tags == ['O', 'O', 'O', '-', '-', '-']

View File

@ -0,0 +1,36 @@
# coding: utf-8
"""Find the min-cost alignment between two tokenizations"""
from __future__ import unicode_literals
from ...gold import _min_edit_path as min_edit_path
from ...gold import align
import pytest
@pytest.mark.parametrize('cand,gold,path', [
(["U.S", ".", "policy"], ["U.S.", "policy"], (0, 'MDM')),
(["U.N", ".", "policy"], ["U.S.", "policy"], (1, 'SDM')),
(["The", "cat", "sat", "down"], ["The", "cat", "sat", "down"], (0, 'MMMM')),
(["cat", "sat", "down"], ["The", "cat", "sat", "down"], (1, 'IMMM')),
(["The", "cat", "down"], ["The", "cat", "sat", "down"], (1, 'MMIM')),
(["The", "cat", "sag", "down"], ["The", "cat", "sat", "down"], (1, 'MMSM'))])
def test_gold_lev_align_edit_path(cand, gold, path):
assert min_edit_path(cand, gold) == path
def test_gold_lev_align_edit_path2():
cand = ["your", "stuff"]
gold = ["you", "r", "stuff"]
assert min_edit_path(cand, gold) in [(2, 'ISM'), (2, 'SIM')]
@pytest.mark.parametrize('cand,gold,result', [
(["U.S", ".", "policy"], ["U.S.", "policy"], [0, None, 1]),
(["your", "stuff"], ["you", "r", "stuff"], [None, 2]),
(["i", "like", "2", "guys", " ", "well", "id", "just", "come", "straight", "out"],
["i", "like", "2", "guys", "well", "i", "d", "just", "come", "straight", "out"],
[0, 1, 2, 3, None, 4, None, 7, 8, 9, 10])])
def test_gold_lev_align(cand, gold, result):
assert align(cand, gold) == result

View File

@ -3,7 +3,6 @@ from __future__ import unicode_literals
import pytest
DEFAULT_TESTS = [
('N. kormányzósági\nszékhely.', ['N.', 'kormányzósági', 'székhely', '.']),
('A .hu egy tld.', ['A', '.hu', 'egy', 'tld', '.']),
@ -24,23 +23,27 @@ DEFAULT_TESTS = [
HYPHEN_TESTS = [
('Egy -nak, -jaiért, -magyar, bel- van.', ['Egy', '-nak', ',', '-jaiért', ',', '-magyar', ',', 'bel-', 'van', '.']),
('Szabolcs-Szatmár-Bereg megye', ['Szabolcs-Szatmár-Bereg', 'megye']),
('Egy -nak.', ['Egy', '-nak', '.']),
('Egy bel-.', ['Egy', 'bel-', '.']),
('Dinnye-domb-.', ['Dinnye-domb-', '.']),
('Ezen -e elcsatangolt.', ['Ezen', '-e', 'elcsatangolt', '.']),
('Lakik-e', ['Lakik', '-e']),
('A--B', ['A', '--', 'B']),
('Lakik-e?', ['Lakik', '-e', '?']),
('Lakik-e.', ['Lakik', '-e', '.']),
('Lakik-e...', ['Lakik', '-e', '...']),
('Lakik-e... van.', ['Lakik', '-e', '...', 'van', '.']),
('Lakik-e van?', ['Lakik', '-e', 'van', '?']),
('Lakik-elem van?', ['Lakik-elem', 'van', '?']),
('Az életbiztosításáról- egy.', ['Az', 'életbiztosításáról-', 'egy', '.']),
('Van lakik-elem.', ['Van', 'lakik-elem', '.']),
('A 7-es busz?', ['A', '7-es', 'busz', '?']),
('A 7-es?', ['A', '7-es', '?']),
('A 7-es.', ['A', '7-es', '.']),
('Ez (lakik)-e?', ['Ez', '(', 'lakik', ')', '-e', '?']),
('A %-sal.', ['A', '%-sal', '.']),
('A $-sal.', ['A', '$-sal', '.']),
('A CD-ROM-okrol.', ['A', 'CD-ROM-okrol', '.'])
]
@ -89,11 +92,15 @@ NUMBER_TESTS = [
('A -23,12 van.', ['A', '-23,12', 'van', '.']),
('A -23,12-ben van.', ['A', '-23,12-ben', 'van', '.']),
('A -23,12-ben.', ['A', '-23,12-ben', '.']),
('A 2+3 van.', ['A', '2', '+', '3', 'van', '.']),
('A 2 +3 van.', ['A', '2', '+', '3', 'van', '.']),
('A 2+3 van.', ['A', '2+3', 'van', '.']),
('A 2<3 van.', ['A', '2<3', 'van', '.']),
('A 2=3 van.', ['A', '2=3', 'van', '.']),
('A 2÷3 van.', ['A', '2÷3', 'van', '.']),
('A 1=(2÷3)-2/5 van.', ['A', '1=(2÷3)-2/5', 'van', '.']),
('A 2 +3 van.', ['A', '2', '+3', 'van', '.']),
('A 2+ 3 van.', ['A', '2', '+', '3', 'van', '.']),
('A 2 + 3 van.', ['A', '2', '+', '3', 'van', '.']),
('A 2*3 van.', ['A', '2', '*', '3', 'van', '.']),
('A 2*3 van.', ['A', '2*3', 'van', '.']),
('A 2 *3 van.', ['A', '2', '*', '3', 'van', '.']),
('A 2* 3 van.', ['A', '2', '*', '3', 'van', '.']),
('A 2 * 3 van.', ['A', '2', '*', '3', 'van', '.']),
@ -141,7 +148,8 @@ NUMBER_TESTS = [
('A 15.-ben.', ['A', '15.-ben', '.']),
('A 2002--2003. van.', ['A', '2002--2003.', 'van', '.']),
('A 2002--2003-ben van.', ['A', '2002--2003-ben', 'van', '.']),
('A 2002--2003-ben.', ['A', '2002--2003-ben', '.']),
('A 2002-2003-ben.', ['A', '2002-2003-ben', '.']),
('A +0,99% van.', ['A', '+0,99%', 'van', '.']),
('A -0,99% van.', ['A', '-0,99%', 'van', '.']),
('A -0,99%-ben van.', ['A', '-0,99%-ben', 'van', '.']),
('A -0,99%.', ['A', '-0,99%', '.']),
@ -194,23 +202,33 @@ NUMBER_TESTS = [
('A III/c-ben.', ['A', 'III/c-ben', '.']),
('A TU154 van.', ['A', 'TU154', 'van', '.']),
('A TU154-ben van.', ['A', 'TU154-ben', 'van', '.']),
('A TU154-ben.', ['A', 'TU154-ben', '.'])
('A TU154-ben.', ['A', 'TU154-ben', '.']),
('A 5cm³', ['A', '5', 'cm³']),
('A 5 $-ban', ['A', '5', '$-ban']),
('A 5$-ban', ['A', '5$-ban']),
('A 5$.', ['A', '5', '$', '.']),
('A 5$', ['A', '5', '$']),
('A $5', ['A', '$5']),
('A 5km/h', ['A', '5', 'km/h']),
('A 75%+1-100%-ig', ['A', '75%+1-100%-ig']),
('A 5km/h.', ['A', '5', 'km/h', '.']),
('3434/1992. évi elszámolás', ['3434/1992.', 'évi', 'elszámolás']),
]
QUOTE_TESTS = [
('Az "Ime, hat"-ban irja.', ['Az', '"', 'Ime', ',', 'hat', '"', '-ban', 'irja', '.']),
('"Ime, hat"-ban irja.', ['"', 'Ime', ',', 'hat', '"', '-ban', 'irja', '.']),
('Az "Ime, hat".', ['Az', '"', 'Ime', ',', 'hat', '"', '.']),
('Egy 24"-os monitor.', ['Egy', '24', '"', '-os', 'monitor', '.']),
("A don't van.", ['A', "don't", 'van', '.'])
('Egy 24"-os monitor.', ['Egy', '24"-os', 'monitor', '.']),
("A McDonald's van.", ['A', "McDonald's", 'van', '.'])
]
DOT_TESTS = [
('N. kormányzósági\nszékhely.', ['N.', 'kormányzósági', 'székhely', '.']),
('A .hu egy tld.', ['A', '.hu', 'egy', 'tld', '.']),
('Az egy.ketto pelda.', ['Az', 'egy.ketto', 'pelda', '.']),
('A pl. rovidites.', ['A', 'pl.', 'rovidites', '.']),
('A S.M.A.R.T. szo.', ['A', 'S.M.A.R.T.', 'szo', '.']),
('A pl. rövidítés.', ['A', 'pl.', 'rövidítés', '.']),
('A S.M.A.R.T. szó.', ['A', 'S.M.A.R.T.', 'szó', '.']),
('A .hu.', ['A', '.hu', '.']),
('Az egy.ketto.', ['Az', 'egy.ketto', '.']),
('A pl.', ['A', 'pl.']),
@ -223,8 +241,19 @@ DOT_TESTS = [
('Valami ... más.', ['Valami', '...', 'más', '.'])
]
WIKI_TESTS = [
('!"', ['!', '"']),
('lány"a', ['lány', '"', 'a']),
('lány"a', ['lány', '"', 'a']),
('!"-lel', ['!', '"', '-lel']),
('""-sorozat ', ['"', '"', '-sorozat']),
('"(Köszönöm', ['"', '(', 'Köszönöm']),
('(törvénykönyv)-ben ', ['(', 'törvénykönyv', ')', '-ben']),
('"(...)"sokkal ', ['"', '(', '...', ')', '"', 'sokkal']),
('cérium(IV)-oxid', ['cérium', '(', 'IV', ')', '-oxid'])
]
TESTCASES = DEFAULT_TESTS + DOT_TESTS + QUOTE_TESTS # + NUMBER_TESTS + HYPHEN_TESTS
TESTCASES = DEFAULT_TESTS + DOT_TESTS + QUOTE_TESTS + NUMBER_TESTS + HYPHEN_TESTS + WIKI_TESTS
@pytest.mark.parametrize('text,expected_tokens', TESTCASES)

View File

@ -1,21 +0,0 @@
# encoding: utf8
from __future__ import unicode_literals
from ...fr import French
from ...nl import Dutch
def test_load_french():
nlp = French()
doc = nlp(u'Parlez-vous français?')
assert doc[0].text == u'Parlez'
assert doc[1].text == u'-'
assert doc[2].text == u'vous'
assert doc[3].text == u'français'
assert doc[4].text == u'?'
def test_load_dutch():
nlp = Dutch()
doc = nlp(u'Is dit Nederlands?')
assert doc[0].text == u'Is'
assert doc[1].text == u'dit'
assert doc[2].text == u'Nederlands'
assert doc[3].text == u'?'

View File

@ -1,4 +1,5 @@
# -*- coding: utf-8 -*-
# coding: utf-8
import pytest
import numpy
@ -47,7 +48,7 @@ class TestModelSanity:
def test_vectors(self, example):
# if vectors are available, they should differ on different words
# this isn't a perfect test since this could in principle fail
# this isn't a perfect test since this could in principle fail
# in a sane model as well,
# but that's very unlikely and a good indicator if something is wrong
vector0 = example[0].vector
@ -58,9 +59,9 @@ class TestModelSanity:
assert not numpy.array_equal(vector1,vector2)
def test_probs(self, example):
# if frequencies/probabilities are okay, they should differ for
# if frequencies/probabilities are okay, they should differ for
# different words
# this isn't a perfect test since this could in principle fail
# this isn't a perfect test since this could in principle fail
# in a sane model as well,
# but that's very unlikely and a good indicator if something is wrong
prob0 = example[0].prob

View File

@ -1,59 +1,53 @@
# coding: utf-8
from __future__ import unicode_literals
import spacy
from spacy.vocab import Vocab
from spacy.matcher import Matcher
from spacy.tokens.doc import Doc
from spacy.attrs import *
from ...matcher import Matcher
from ...attrs import ORTH
from ..util import get_doc
import pytest
@pytest.fixture
def en_vocab():
return spacy.get_lang_class('en').Defaults.create_vocab()
def test_init_matcher(en_vocab):
@pytest.mark.parametrize('words,entity', [
(["Test", "Entity"], "TestEntity")])
def test_matcher_add_empty_entity(en_vocab, words, entity):
matcher = Matcher(en_vocab)
matcher.add_entity(entity)
doc = get_doc(en_vocab, words)
assert matcher.n_patterns == 0
assert matcher(Doc(en_vocab, words=[u'Some', u'words'])) == []
assert matcher(doc) == []
def test_add_empty_entity(en_vocab):
@pytest.mark.parametrize('entity1,entity2,attrs', [
("TestEntity", "TestEntity2", {"Hello": "World"})])
def test_matcher_get_entity_attrs(en_vocab, entity1, entity2, attrs):
matcher = Matcher(en_vocab)
matcher.add_entity('TestEntity')
matcher.add_entity(entity1)
assert matcher.get_entity(entity1) == {}
matcher.add_entity(entity2, attrs=attrs)
assert matcher.get_entity(entity2) == attrs
assert matcher.get_entity(entity1) == {}
@pytest.mark.parametrize('words,entity,attrs',
[(["Test", "Entity"], "TestEntity", {"Hello": "World"})])
def test_matcher_get_entity_via_match(en_vocab, words, entity, attrs):
matcher = Matcher(en_vocab)
matcher.add_entity(entity, attrs=attrs)
doc = get_doc(en_vocab, words)
assert matcher.n_patterns == 0
assert matcher(Doc(en_vocab, words=[u'Test', u'Entity'])) == []
assert matcher(doc) == []
def test_get_entity_attrs(en_vocab):
matcher = Matcher(en_vocab)
matcher.add_entity('TestEntity')
entity = matcher.get_entity('TestEntity')
assert entity == {}
matcher.add_entity('TestEntity2', attrs={'Hello': 'World'})
entity = matcher.get_entity('TestEntity2')
assert entity == {'Hello': 'World'}
assert matcher.get_entity('TestEntity') == {}
def test_get_entity_via_match(en_vocab):
matcher = Matcher(en_vocab)
matcher.add_entity('TestEntity', attrs={u'Hello': u'World'})
assert matcher.n_patterns == 0
assert matcher(Doc(en_vocab, words=[u'Test', u'Entity'])) == []
matcher.add_pattern(u'TestEntity', [{ORTH: u'Test'}, {ORTH: u'Entity'}])
matcher.add_pattern(entity, [{ORTH: words[0]}, {ORTH: words[1]}])
assert matcher.n_patterns == 1
matches = matcher(Doc(en_vocab, words=[u'Test', u'Entity']))
matches = matcher(doc)
assert len(matches) == 1
assert len(matches[0]) == 4
ent_id, label, start, end = matches[0]
assert ent_id == matcher.vocab.strings[u'TestEntity']
assert ent_id == matcher.vocab.strings[entity]
assert label == 0
assert start == 0
assert end == 2
attrs = matcher.get_entity(ent_id)
assert attrs == {u'Hello': u'World'}
assert matcher.get_entity(ent_id) == attrs

View File

@ -0,0 +1,107 @@
# coding: utf-8
from __future__ import unicode_literals
from ...matcher import Matcher, PhraseMatcher
from ..util import get_doc
import pytest
@pytest.fixture
def matcher(en_vocab):
patterns = {
'JS': ['PRODUCT', {}, [[{'ORTH': 'JavaScript'}]]],
'GoogleNow': ['PRODUCT', {}, [[{'ORTH': 'Google'}, {'ORTH': 'Now'}]]],
'Java': ['PRODUCT', {}, [[{'LOWER': 'java'}]]]
}
return Matcher(en_vocab, patterns)
@pytest.mark.parametrize('words', [["Some", "words"]])
def test_matcher_init(en_vocab, words):
matcher = Matcher(en_vocab)
doc = get_doc(en_vocab, words)
assert matcher.n_patterns == 0
assert matcher(doc) == []
def test_matcher_no_match(matcher):
words = ["I", "like", "cheese", "."]
doc = get_doc(matcher.vocab, words)
assert matcher(doc) == []
def test_matcher_compile(matcher):
assert matcher.n_patterns == 3
def test_matcher_match_start(matcher):
words = ["JavaScript", "is", "good"]
doc = get_doc(matcher.vocab, words)
assert matcher(doc) == [(matcher.vocab.strings['JS'],
matcher.vocab.strings['PRODUCT'], 0, 1)]
def test_matcher_match_end(matcher):
words = ["I", "like", "java"]
doc = get_doc(matcher.vocab, words)
assert matcher(doc) == [(doc.vocab.strings['Java'],
doc.vocab.strings['PRODUCT'], 2, 3)]
def test_matcher_match_middle(matcher):
words = ["I", "like", "Google", "Now", "best"]
doc = get_doc(matcher.vocab, words)
assert matcher(doc) == [(doc.vocab.strings['GoogleNow'],
doc.vocab.strings['PRODUCT'], 2, 4)]
def test_matcher_match_multi(matcher):
words = ["I", "like", "Google", "Now", "and", "java", "best"]
doc = get_doc(matcher.vocab, words)
assert matcher(doc) == [(doc.vocab.strings['GoogleNow'],
doc.vocab.strings['PRODUCT'], 2, 4),
(doc.vocab.strings['Java'],
doc.vocab.strings['PRODUCT'], 5, 6)]
def test_matcher_phrase_matcher(en_vocab):
words = ["Google", "Now"]
doc = get_doc(en_vocab, words)
matcher = PhraseMatcher(en_vocab, [doc])
words = ["I", "like", "Google", "Now", "best"]
doc = get_doc(en_vocab, words)
assert len(matcher(doc)) == 1
def test_matcher_match_zero(matcher):
words1 = 'He said , " some words " ...'.split()
words2 = 'He said , " some three words " ...'.split()
pattern1 = [{'ORTH': '"'},
{'OP': '!', 'IS_PUNCT': True},
{'OP': '!', 'IS_PUNCT': True},
{'ORTH': '"'}]
pattern2 = [{'ORTH': '"'},
{'IS_PUNCT': True},
{'IS_PUNCT': True},
{'IS_PUNCT': True},
{'ORTH': '"'}]
matcher.add('Quote', '', {}, [pattern1])
doc = get_doc(matcher.vocab, words1)
assert len(matcher(doc)) == 1
doc = get_doc(matcher.vocab, words2)
assert len(matcher(doc)) == 0
matcher.add('Quote', '', {}, [pattern2])
assert len(matcher(doc)) == 0
def test_matcher_match_zero_plus(matcher):
words = 'He said , " some words " ...'.split()
pattern = [{'ORTH': '"'},
{'OP': '*', 'IS_PUNCT': False},
{'ORTH': '"'}]
matcher.add('Quote', '', {}, [pattern])
doc = get_doc(matcher.vocab, words)
assert len(matcher(doc)) == 1

View File

@ -1,200 +0,0 @@
import pytest
import numpy
import os
import spacy
from spacy.matcher import Matcher
from spacy.attrs import ORTH, LOWER, ENT_IOB, ENT_TYPE
from spacy.attrs import ORTH, TAG, LOWER, IS_ALPHA, FLAG63
from spacy.symbols import DATE, LOC
def test_overlap_issue118(EN):
'''Test a bug that arose from having overlapping matches'''
doc = EN.tokenizer(u'how many points did lebron james score against the boston celtics last night')
ORG = doc.vocab.strings['ORG']
matcher = Matcher(EN.vocab,
{'BostonCeltics':
('ORG', {},
[
[{LOWER: 'celtics'}],
[{LOWER: 'boston'}, {LOWER: 'celtics'}],
]
)
}
)
assert len(list(doc.ents)) == 0
matches = [(ent_type, start, end) for ent_id, ent_type, start, end in matcher(doc)]
assert matches == [(ORG, 9, 11), (ORG, 10, 11)]
doc.ents = matches[:1]
ents = list(doc.ents)
assert len(ents) == 1
assert ents[0].label == ORG
assert ents[0].start == 9
assert ents[0].end == 11
def test_overlap_issue242():
'''Test overlapping multi-word phrases.'''
patterns = [
[{LOWER: 'food'}, {LOWER: 'safety'}],
[{LOWER: 'safety'}, {LOWER: 'standards'}],
]
if os.environ.get('SPACY_DATA'):
data_dir = os.environ.get('SPACY_DATA')
else:
data_dir = None
nlp = spacy.en.English(path=data_dir, tagger=False, parser=False, entity=False)
nlp.matcher = Matcher(nlp.vocab)
nlp.matcher.add('FOOD', 'FOOD', {}, patterns)
doc = nlp.tokenizer(u'There are different food safety standards in different countries.')
matches = [(ent_type, start, end) for ent_id, ent_type, start, end in nlp.matcher(doc)]
doc.ents += tuple(matches)
food_safety, safety_standards = matches
assert food_safety[1] == 3
assert food_safety[2] == 5
assert safety_standards[1] == 4
assert safety_standards[2] == 6
def test_overlap_reorder(EN):
'''Test order dependence'''
doc = EN.tokenizer(u'how many points did lebron james score against the boston celtics last night')
ORG = doc.vocab.strings['ORG']
matcher = Matcher(EN.vocab,
{'BostonCeltics':
('ORG', {},
[
[{LOWER: 'boston'}, {LOWER: 'celtics'}],
[{LOWER: 'celtics'}],
]
)
}
)
assert len(list(doc.ents)) == 0
matches = [(ent_type, start, end) for ent_id, ent_type, start, end in matcher(doc)]
assert matches == [(ORG, 9, 11), (ORG, 10, 11)]
doc.ents = matches[:1]
ents = list(doc.ents)
assert len(ents) == 1
assert ents[0].label == ORG
assert ents[0].start == 9
assert ents[0].end == 11
def test_overlap_prefix(EN):
'''Test order dependence'''
doc = EN.tokenizer(u'how many points did lebron james score against the boston celtics last night')
ORG = doc.vocab.strings['ORG']
matcher = Matcher(EN.vocab,
{'BostonCeltics':
('ORG', {},
[
[{LOWER: 'boston'}],
[{LOWER: 'boston'}, {LOWER: 'celtics'}],
]
)
}
)
assert len(list(doc.ents)) == 0
matches = [(ent_type, start, end) for ent_id, ent_type, start, end in matcher(doc)]
doc.ents = matches[1:]
assert matches == [(ORG, 9, 10), (ORG, 9, 11)]
ents = list(doc.ents)
assert len(ents) == 1
assert ents[0].label == ORG
assert ents[0].start == 9
assert ents[0].end == 11
def test_overlap_prefix_reorder(EN):
'''Test order dependence'''
doc = EN.tokenizer(u'how many points did lebron james score against the boston celtics last night')
ORG = doc.vocab.strings['ORG']
matcher = Matcher(EN.vocab,
{'BostonCeltics':
('ORG', {},
[
[{LOWER: 'boston'}, {LOWER: 'celtics'}],
[{LOWER: 'boston'}],
]
)
}
)
assert len(list(doc.ents)) == 0
matches = [(ent_type, start, end) for ent_id, ent_type, start, end in matcher(doc)]
doc.ents += tuple(matches)[1:]
assert matches == [(ORG, 9, 10), (ORG, 9, 11)]
ents = doc.ents
assert len(ents) == 1
assert ents[0].label == ORG
assert ents[0].start == 9
assert ents[0].end == 11
# @pytest.mark.models
# def test_ner_interaction(EN):
# EN.matcher.add('LAX_Airport', 'AIRPORT', {}, [[{ORTH: 'LAX'}]])
# EN.matcher.add('SFO_Airport', 'AIRPORT', {}, [[{ORTH: 'SFO'}]])
# doc = EN(u'get me a flight from SFO to LAX leaving 20 December and arriving on January 5th')
# ents = [(ent.label_, ent.text) for ent in doc.ents]
# assert ents[0] == ('AIRPORT', 'SFO')
# assert ents[1] == ('AIRPORT', 'LAX')
# assert ents[2] == ('DATE', '20 December')
# assert ents[3] == ('DATE', 'January 5th')
# @pytest.mark.models
# def test_ner_interaction(EN):
# # ensure that matcher doesn't overwrite annotations set by the NER model
# doc = EN.tokenizer.tokens_from_list(u'get me a flight from SFO to LAX leaving 20 December and arriving on January 5th'.split(' '))
# EN.tagger(doc)
# columns = [ENT_IOB, ENT_TYPE]
# values = numpy.ndarray(shape=(len(doc),len(columns)), dtype='int32')
# # IOB values are 0=missing, 1=I, 2=O, 3=B
# iobs = [2,2,2,2,2,3,2,3,2,3,1,2,2,2,3,1]
# types = [0,0,0,0,0,LOC,0,LOC,0,DATE,DATE,0,0,0,DATE,DATE]
# values[:] = zip(iobs,types)
# doc.from_array(columns,values)
# assert doc[5].ent_type_ == 'LOC'
# assert doc[7].ent_type_ == 'LOC'
# assert doc[9].ent_type_ == 'DATE'
# assert doc[10].ent_type_ == 'DATE'
# assert doc[14].ent_type_ == 'DATE'
# assert doc[15].ent_type_ == 'DATE'
# EN.matcher.add('LAX_Airport', 'AIRPORT', {}, [[{ORTH: 'LAX'}]])
# EN.matcher.add('SFO_Airport', 'AIRPORT', {}, [[{ORTH: 'SFO'}]])
# EN.matcher(doc)
# assert doc[5].ent_type_ != 'AIRPORT'
# assert doc[7].ent_type_ != 'AIRPORT'
# assert doc[5].ent_type_ == 'LOC'
# assert doc[7].ent_type_ == 'LOC'
# assert doc[9].ent_type_ == 'DATE'
# assert doc[10].ent_type_ == 'DATE'
# assert doc[14].ent_type_ == 'DATE'
# assert doc[15].ent_type_ == 'DATE'

View File

@ -1,32 +0,0 @@
from spacy.deprecated import align_tokens
def test_perfect_align():
ref = ['I', 'align', 'perfectly']
indices = []
i = 0
for token in ref:
indices.append((i, i + len(token)))
i += len(token)
aligned = list(align_tokens(ref, indices))
assert aligned[0] == ('I', [(0, 1)])
assert aligned[1] == ('align', [(1, 6)])
assert aligned[2] == ('perfectly', [(6, 15)])
def test_hyphen_align():
ref = ['I', 'must', 're-align']
indices = [(0, 1), (1, 5), (5, 7), (7, 8), (8, 13)]
aligned = list(align_tokens(ref, indices))
assert aligned[0] == ('I', [(0, 1)])
assert aligned[1] == ('must', [(1, 5)])
assert aligned[2] == ('re-align', [(5, 7), (7, 8), (8, 13)])
def test_align_continue():
ref = ['I', 'must', 're-align', 'and', 'continue']
indices = [(0, 1), (1, 5), (5, 7), (7, 8), (8, 13), (13, 16), (16, 24)]
aligned = list(align_tokens(ref, indices))
assert aligned[2] == ('re-align', [(5, 7), (7, 8), (8, 13)])
assert aligned[3] == ('and', [(13, 16)])
assert aligned[4] == ('continue', [(16, 24)])

View File

@ -1,59 +0,0 @@
import spacy.munge.read_conll
hongbin_example = """
1 2. 0. LS _ 24 meta _ _ _
2 . . . _ 1 punct _ _ _
3 Wang wang NNP _ 4 compound _ _ _
4 Hongbin hongbin NNP _ 16 nsubj _ _ _
5 , , , _ 4 punct _ _ _
6 the the DT _ 11 det _ _ _
7 " " `` _ 11 punct _ _ _
8 communist communist JJ _ 11 amod _ _ _
9 trail trail NN _ 11 compound _ _ _
10 - - HYPH _ 11 punct _ _ _
11 blazer blazer NN _ 4 appos _ _ _
12 , , , _ 16 punct _ _ _
13 " " '' _ 16 punct _ _ _
14 has have VBZ _ 16 aux _ _ _
15 not not RB _ 16 neg _ _ _
16 turned turn VBN _ 24 ccomp _ _ _
17 into into IN syn=CLR 16 prep _ _ _
18 a a DT _ 19 det _ _ _
19 capitalist capitalist NN _ 17 pobj _ _ _
20 ( ( -LRB- _ 24 punct _ _ _
21 he he PRP _ 24 nsubj _ _ _
22 does do VBZ _ 24 aux _ _ _
23 n't not RB _ 24 neg _ _ _
24 have have VB _ 0 root _ _ _
25 any any DT _ 26 det _ _ _
26 shares share NNS _ 24 dobj _ _ _
27 , , , _ 24 punct _ _ _
28 does do VBZ _ 30 aux _ _ _
29 n't not RB _ 30 neg _ _ _
30 have have VB _ 24 conj _ _ _
31 any any DT _ 32 det _ _ _
32 savings saving NNS _ 30 dobj _ _ _
33 , , , _ 30 punct _ _ _
34 does do VBZ _ 36 aux _ _ _
35 n't not RB _ 36 neg _ _ _
36 have have VB _ 30 conj _ _ _
37 his his PRP$ _ 39 poss _ _ _
38 own own JJ _ 39 amod _ _ _
39 car car NN _ 36 dobj _ _ _
40 , , , _ 36 punct _ _ _
41 and and CC _ 36 cc _ _ _
42 does do VBZ _ 44 aux _ _ _
43 n't not RB _ 44 neg _ _ _
44 have have VB _ 36 conj _ _ _
45 a a DT _ 46 det _ _ _
46 mansion mansion NN _ 44 dobj _ _ _
47 ; ; . _ 24 punct _ _ _
""".strip()
def test_hongbin():
words, annot = spacy.munge.read_conll.parse(hongbin_example, strip_bad_periods=True)
assert words[annot[0]['head']] == 'have'
assert words[annot[1]['head']] == 'Hongbin'

View File

@ -1,21 +0,0 @@
from spacy.deprecated import detokenize
def test_punct():
tokens = 'Pierre Vinken , 61 years old .'.split()
detoks = [(0,), (1, 2), (3,), (4,), (5, 6)]
token_rules = ('<SEP>,', '<SEP>.')
assert detokenize(token_rules, tokens) == detoks
def test_contractions():
tokens = "I ca n't even".split()
detoks = [(0,), (1, 2), (3,)]
token_rules = ("ca<SEP>n't",)
assert detokenize(token_rules, tokens) == detoks
def test_contractions_punct():
tokens = "I ca n't !".split()
detoks = [(0,), (1, 2, 3)]
token_rules = ("ca<SEP>n't", '<SEP>!')
assert detokenize(token_rules, tokens) == detoks

View File

@ -1,42 +0,0 @@
"""Find the min-cost alignment between two tokenizations"""
from spacy.gold import _min_edit_path as min_edit_path
from spacy.gold import align
def test_edit_path():
cand = ["U.S", ".", "policy"]
gold = ["U.S.", "policy"]
assert min_edit_path(cand, gold) == (0, 'MDM')
cand = ["U.N", ".", "policy"]
gold = ["U.S.", "policy"]
assert min_edit_path(cand, gold) == (1, 'SDM')
cand = ["The", "cat", "sat", "down"]
gold = ["The", "cat", "sat", "down"]
assert min_edit_path(cand, gold) == (0, 'MMMM')
cand = ["cat", "sat", "down"]
gold = ["The", "cat", "sat", "down"]
assert min_edit_path(cand, gold) == (1, 'IMMM')
cand = ["The", "cat", "down"]
gold = ["The", "cat", "sat", "down"]
assert min_edit_path(cand, gold) == (1, 'MMIM')
cand = ["The", "cat", "sag", "down"]
gold = ["The", "cat", "sat", "down"]
assert min_edit_path(cand, gold) == (1, 'MMSM')
cand = ["your", "stuff"]
gold = ["you", "r", "stuff"]
assert min_edit_path(cand, gold) in [(2, 'ISM'), (2, 'SIM')]
def test_align():
cand = ["U.S", ".", "policy"]
gold = ["U.S.", "policy"]
assert align(cand, gold) == [0, None, 1]
cand = ["your", "stuff"]
gold = ["you", "r", "stuff"]
assert align(cand, gold) == [None, 2]
cand = [u'i', u'like', u'2', u'guys', u' ', u'well', u'id', u'just',
u'come', u'straight', u'out']
gold = [u'i', u'like', u'2', u'guys', u'well', u'i', u'd', u'just', u'come',
u'straight', u'out']
assert align(cand, gold) == [0, 1, 2, 3, None, 4, None, 7, 8, 9, 10]

View File

@ -1,16 +0,0 @@
from spacy.munge.read_ner import _get_text, _get_tag
def test_get_text():
assert _get_text('asbestos') == 'asbestos'
assert _get_text('<ENAMEX TYPE="ORG">Lorillard</ENAMEX>') == 'Lorillard'
assert _get_text('<ENAMEX TYPE="DATE">more') == 'more'
assert _get_text('ago</ENAMEX>') == 'ago'
def test_get_tag():
assert _get_tag('asbestos', None) == ('O', None)
assert _get_tag('asbestos', 'PER') == ('I-PER', 'PER')
assert _get_tag('<ENAMEX TYPE="ORG">Lorillard</ENAMEX>', None) == ('U-ORG', None)
assert _get_tag('<ENAMEX TYPE="DATE">more', None) == ('B-DATE', 'DATE')
assert _get_tag('ago</ENAMEX>', 'DATE') == ('L-DATE', None)

View File

@ -1,247 +0,0 @@
# encoding: utf-8
# SBD tests from "Pragmatic Segmenter"
from __future__ import unicode_literals
from spacy.en import English
EN = English()
def get_sent_strings(text):
tokens = EN(text)
sents = []
for sent in tokens.sents:
sents.append(''.join(tokens[i].string
for i in range(sent.start, sent.end)).strip())
return sents
def test_gr1():
sents = get_sent_strings("Hello World. My name is Jonas.")
assert sents == ["Hello World.", "My name is Jonas."]
def test_gr2():
sents = get_sent_strings("What is your name? My name is Jonas.")
assert sents == ["What is your name?", "My name is Jonas."]
def test_gr3():
sents = get_sent_strings("There it is! I found it.")
assert sents == ["There it is!", "I found it."]
def test_gr4():
sents = get_sent_strings("My name is Jonas E. Smith.")
assert sents == ["My name is Jonas E. Smith."]
def test_gr5():
sents = get_sent_strings("Please turn to p. 55.")
assert sents == ["Please turn to p. 55."]
def test_gr6():
sents = get_sent_strings("Were Jane and co. at the party?")
assert sents == ["Were Jane and co. at the party?"]
def test_gr7():
sents = get_sent_strings("They closed the deal with Pitt, Briggs & Co. at noon.")
assert sents == ["They closed the deal with Pitt, Briggs & Co. at noon."]
def test_gr8():
sents = get_sent_strings("Let's ask Jane and co. They should know.")
assert sents == ["Let's ask Jane and co.", "They should know."]
def test_gr9():
sents = get_sent_strings("They closed the deal with Pitt, Briggs & Co. It closed yesterday.")
assert sents == ["They closed the deal with Pitt, Briggs & Co.", "It closed yesterday."]
def test_gr10():
sents = get_sent_strings("I can see Mt. Fuji from here.")
assert sents == ["I can see Mt. Fuji from here."]
def test_gr11():
sents = get_sent_strings("St. Michael's Church is on 5th st. near the light.")
assert sents == ["St. Michael's Church is on 5th st. near the light."]
def test_gr12():
sents = get_sent_strings("That is JFK Jr.'s book.")
assert sents == ["That is JFK Jr.'s book."]
def test_gr13():
sents = get_sent_strings("I visited the U.S.A. last year.")
assert sents == ["I visited the U.S.A. last year."]
def test_gr14():
sents = get_sent_strings("I live in the E.U. How about you?")
assert sents == ["I live in the E.U.", "How about you?"]
def test_gr15():
sents = get_sent_strings("I live in the U.S. How about you?")
assert sents == ["I live in the U.S.", "How about you?"]
def test_gr16():
sents = get_sent_strings("I work for the U.S. Government in Virginia.")
assert sents == ["I work for the U.S. Government in Virginia."]
def test_gr17():
sents = get_sent_strings("I have lived in the U.S. for 20 years.")
assert sents == ["I have lived in the U.S. for 20 years."]
def test_gr18():
sents = get_sent_strings("At 5 a.m. Mr. Smith went to the bank. He left the bank at 6 P.M. Mr. Smith then went to the store.")
assert sents == ["At 5 a.m. Mr. Smith went to the bank.", "He left the bank at 6 P.M.", "Mr. Smith then went to the store."]
def test_gr19():
sents = get_sent_strings("She has $100.00 in her bag.")
assert sents == ["She has $100.00 in her bag."]
def test_gr20():
sents = get_sent_strings("She has $100.00. It is in her bag.")
assert sents == ["She has $100.00.", "It is in her bag."]
def test_gr21():
sents = get_sent_strings("He teaches science (He previously worked for 5 years as an engineer.) at the local University.")
assert sents == ["He teaches science (He previously worked for 5 years as an engineer.) at the local University."]
def test_gr22():
sents = get_sent_strings("Her email is Jane.Doe@example.com. I sent her an email.")
assert sents == ["Her email is Jane.Doe@example.com.", "I sent her an email."]
def test_gr23():
sents = get_sent_strings("The site is: https://www.example.50.com/new-site/awesome_content.html. Please check it out.")
assert sents == ["The site is: https://www.example.50.com/new-site/awesome_content.html.", "Please check it out."]
def test_gr24():
sents = get_sent_strings("She turned to him, 'This is great.' she said.")
assert sents == ["She turned to him, 'This is great.' she said."]
def test_gr25():
sents = get_sent_strings('She turned to him, "This is great." she said.')
assert sents == ['She turned to him, "This is great." she said.']
def test_gr26():
sents = get_sent_strings('She turned to him, "This is great." She held the book out to show him.')
assert sents == ['She turned to him, "This is great."', "She held the book out to show him."]
def test_gr27():
sents = get_sent_strings("Hello!! Long time no see.")
assert sents == ["Hello!!", "Long time no see."]
def test_gr28():
sents = get_sent_strings("Hello?? Who is there?")
assert sents == ["Hello??", "Who is there?"]
def test_gr29():
sents = get_sent_strings("Hello!? Is that you?")
assert sents == ["Hello!?", "Is that you?"]
def test_gr30():
sents = get_sent_strings("Hello?! Is that you?")
assert sents == ["Hello?!", "Is that you?"]
def test_gr31():
sents = get_sent_strings("1.) The first item 2.) The second item")
assert sents == ["1.) The first item", "2.) The second item"]
def test_gr32():
sents = get_sent_strings("1.) The first item. 2.) The second item.")
assert sents == ["1.) The first item.", "2.) The second item."]
def test_gr33():
sents = get_sent_strings("1) The first item 2) The second item")
assert sents == ["1) The first item", "2) The second item"]
def test_gr34():
sents = get_sent_strings("1) The first item. 2) The second item.")
assert sents == ["1) The first item.", "2) The second item."]
def test_gr35():
sents = get_sent_strings("1. The first item 2. The second item")
assert sents == ["1. The first item", "2. The second item"]
def test_gr36():
sents = get_sent_strings("1. The first item. 2. The second item.")
assert sents == ["1. The first item.", "2. The second item."]
def test_gr37():
sents = get_sent_strings("• 9. The first item • 10. The second item")
assert sents == ["• 9. The first item", "• 10. The second item"]
def test_gr38():
sents = get_sent_strings("9. The first item 10. The second item")
assert sents == ["9. The first item", "10. The second item"]
def test_gr39():
sents = get_sent_strings("a. The first item b. The second item c. The third list item")
assert sents == ["a. The first item", "b. The second item", "c. The third list item"]
def test_gr40():
sents = get_sent_strings("This is a sentence\ncut off in the middle because pdf.")
assert sents == ["This is a sentence\ncut off in the middle because pdf."]
def test_gr41():
sents = get_sent_strings("It was a cold \nnight in the city.")
assert sents == ["It was a cold \nnight in the city."]
def test_gr42():
sents = get_sent_strings("features\ncontact manager\nevents, activities\n")
assert sents == ["features", "contact manager", "events, activities"]
def test_gr43():
sents = get_sent_strings("You can find it at N°. 1026.253.553. That is where the treasure is.")
assert sents == ["You can find it at N°. 1026.253.553.", "That is where the treasure is."]
def test_gr44():
sents = get_sent_strings("She works at Yahoo! in the accounting department.")
assert sents == ["She works at Yahoo! in the accounting department."]
def test_gr45():
sents = get_sent_strings("We make a good team, you and I. Did you see Albert I. Jones yesterday?")
assert sents == ["We make a good team, you and I.", "Did you see Albert I. Jones yesterday?"]
def test_gr46():
sents = get_sent_strings("Thoreau argues that by simplifying ones life, “the laws of the universe will appear less complex. . . .”")
assert sents == ["Thoreau argues that by simplifying ones life, “the laws of the universe will appear less complex. . . .”"]
def test_gr47():
sents = get_sent_strings(""""Bohr [...] used the analogy of parallel stairways [...]" (Smith 55).""")
assert sents == ['"Bohr [...] used the analogy of parallel stairways [...]" (Smith 55).']
def test_gr48():
sents = get_sent_strings("If words are left off at the end of a sentence, and that is all that is omitted, indicate the omission with ellipsis marks (preceded and followed by a space) and then indicate the end of the sentence with a period . . . . Next sentence.")
assert sents == ["If words are left off at the end of a sentence, and that is all that is omitted, indicate the omission with ellipsis marks (preceded and followed by a space) and then indicate the end of the sentence with a period . . . .", "Next sentence."]
def test_gr49():
sents = get_sent_strings("I never meant that.... She left the store.")
assert sents == ["I never meant that....", "She left the store."]
def test_gr50():
sents = get_sent_strings("I wasnt really ... well, what I mean...see . . . what I'm saying, the thing is . . . I didnt mean it.")
assert sents == ["I wasnt really ... well, what I mean...see . . . what I'm saying, the thing is . . . I didnt mean it."]
def test_gr51():
sents = get_sent_strings("One further habit which was somewhat weakened . . . was that of combining words into self-interpreting compounds. . . . The practice was not abandoned. . . .")
assert sents == ["One further habit which was somewhat weakened . . . was that of combining words into self-interpreting compounds.", ". . . The practice was not abandoned. . . ."]
def test_gr52():
sents = get_sent_strings("Hello world.Today is Tuesday.Mr. Smith went to the store and bought 1,000.That is a lot.",)
assert sents == ["Hello world.", "Today is Tuesday.", "Mr. Smith went to the store and bought 1,000.", "That is a lot."]

View File

@ -0,0 +1,76 @@
# coding: utf-8
from __future__ import unicode_literals
from ..util import get_doc
import pytest
def test_parser_noun_chunks_standard(en_tokenizer):
text = "A base phrase should be recognized."
heads = [2, 1, 3, 2, 1, 0, -1]
tags = ['DT', 'JJ', 'NN', 'MD', 'VB', 'VBN', '.']
deps = ['det', 'amod', 'nsubjpass', 'aux', 'auxpass', 'ROOT', 'punct']
tokens = en_tokenizer(text)
doc = get_doc(tokens.vocab, [t.text for t in tokens], tags=tags, deps=deps, heads=heads)
chunks = list(doc.noun_chunks)
assert len(chunks) == 1
assert chunks[0].text_with_ws == "A base phrase "
def test_parser_noun_chunks_coordinated(en_tokenizer):
text = "A base phrase and a good phrase are often the same."
heads = [2, 1, 5, -1, 2, 1, -4, 0, -1, 1, -3, -4]
tags = ['DT', 'NN', 'NN', 'CC', 'DT', 'JJ', 'NN', 'VBP', 'RB', 'DT', 'JJ', '.']
deps = ['det', 'compound', 'nsubj', 'cc', 'det', 'amod', 'conj', 'ROOT', 'advmod', 'det', 'attr', 'punct']
tokens = en_tokenizer(text)
doc = get_doc(tokens.vocab, [t.text for t in tokens], tags=tags, deps=deps, heads=heads)
chunks = list(doc.noun_chunks)
assert len(chunks) == 2
assert chunks[0].text_with_ws == "A base phrase "
assert chunks[1].text_with_ws == "a good phrase "
def test_parser_noun_chunks_pp_chunks(en_tokenizer):
text = "A phrase with another phrase occurs."
heads = [1, 4, -1, 1, -2, 0, -1]
tags = ['DT', 'NN', 'IN', 'DT', 'NN', 'VBZ', '.']
deps = ['det', 'nsubj', 'prep', 'det', 'pobj', 'ROOT', 'punct']
tokens = en_tokenizer(text)
doc = get_doc(tokens.vocab, [t.text for t in tokens], tags=tags, deps=deps, heads=heads)
chunks = list(doc.noun_chunks)
assert len(chunks) == 2
assert chunks[0].text_with_ws == "A phrase "
assert chunks[1].text_with_ws == "another phrase "
def test_parser_noun_chunks_standard_de(de_tokenizer):
text = "Eine Tasse steht auf dem Tisch."
heads = [1, 1, 0, -1, 1, -2, -4]
tags = ['ART', 'NN', 'VVFIN', 'APPR', 'ART', 'NN', '$.']
deps = ['nk', 'sb', 'ROOT', 'mo', 'nk', 'nk', 'punct']
tokens = de_tokenizer(text)
doc = get_doc(tokens.vocab, [t.text for t in tokens], tags=tags, deps=deps, heads=heads)
chunks = list(doc.noun_chunks)
assert len(chunks) == 2
assert chunks[0].text_with_ws == "Eine Tasse "
assert chunks[1].text_with_ws == "dem Tisch "
def test_de_extended_chunk(de_tokenizer):
text = "Die Sängerin singt mit einer Tasse Kaffee Arien."
heads = [1, 1, 0, -1, 1, -2, -1, -5, -6]
tags = ['ART', 'NN', 'VVFIN', 'APPR', 'ART', 'NN', 'NN', 'NN', '$.']
deps = ['nk', 'sb', 'ROOT', 'mo', 'nk', 'nk', 'nk', 'oa', 'punct']
tokens = de_tokenizer(text)
doc = get_doc(tokens.vocab, [t.text for t in tokens], tags=tags, deps=deps, heads=heads)
chunks = list(doc.noun_chunks)
assert len(chunks) == 3
assert chunks[0].text_with_ws == "Die Sängerin "
assert chunks[1].text_with_ws == "einer Tasse Kaffee "
assert chunks[2].text_with_ws == "Arien "

View File

@ -0,0 +1,71 @@
# encoding: utf-8
from __future__ import unicode_literals
import pytest
TEST_CASES = [
("Hello World. My name is Jonas.", ["Hello World.", "My name is Jonas."]),
("What is your name? My name is Jonas.", ["What is your name?", "My name is Jonas."]),
("There it is! I found it.", ["There it is!", "I found it."]),
("My name is Jonas E. Smith.", ["My name is Jonas E. Smith."]),
("Please turn to p. 55.", ["Please turn to p. 55."]),
("Were Jane and co. at the party?", ["Were Jane and co. at the party?"]),
("They closed the deal with Pitt, Briggs & Co. at noon.", ["They closed the deal with Pitt, Briggs & Co. at noon."]),
pytest.mark.xfail(("Let's ask Jane and co. They should know.", ["Let's ask Jane and co.", "They should know."])),
("They closed the deal with Pitt, Briggs & Co. It closed yesterday.", ["They closed the deal with Pitt, Briggs & Co.", "It closed yesterday."]),
("I can see Mt. Fuji from here.", ["I can see Mt. Fuji from here."]),
pytest.mark.xfail(("St. Michael's Church is on 5th st. near the light.", ["St. Michael's Church is on 5th st. near the light."])),
("That is JFK Jr.'s book.", ["That is JFK Jr.'s book."]),
("I visited the U.S.A. last year.", ["I visited the U.S.A. last year."]),
pytest.mark.xfail(("I live in the E.U. How about you?", ["I live in the E.U.", "How about you?"])),
pytest.mark.xfail(("I live in the U.S. How about you?", ["I live in the U.S.", "How about you?"])),
("I work for the U.S. Government in Virginia.", ["I work for the U.S. Government in Virginia."]),
("I have lived in the U.S. for 20 years.", ["I have lived in the U.S. for 20 years."]),
pytest.mark.xfail(("At 5 a.m. Mr. Smith went to the bank. He left the bank at 6 P.M. Mr. Smith then went to the store.", ["At 5 a.m. Mr. Smith went to the bank.", "He left the bank at 6 P.M.", "Mr. Smith then went to the store."])),
("She has $100.00 in her bag.", ["She has $100.00 in her bag."]),
("She has $100.00. It is in her bag.", ["She has $100.00.", "It is in her bag."]),
("He teaches science (He previously worked for 5 years as an engineer.) at the local University.", ["He teaches science (He previously worked for 5 years as an engineer.) at the local University."]),
pytest.mark.xfail(("Her email is Jane.Doe@example.com. I sent her an email.", ["Her email is Jane.Doe@example.com.", "I sent her an email."])),
pytest.mark.xfail(("The site is: https://www.example.50.com/new-site/awesome_content.html. Please check it out.", ["The site is: https://www.example.50.com/new-site/awesome_content.html.", "Please check it out."])),
pytest.mark.xfail(("She turned to him, 'This is great.' she said.", ["She turned to him, 'This is great.' she said."])),
('She turned to him, "This is great." she said.', ['She turned to him, "This is great." she said.']),
('She turned to him, "This is great." She held the book out to show him.', ['She turned to him, "This is great."', "She held the book out to show him."]),
("Hello!! Long time no see.", ["Hello!!", "Long time no see."]),
("Hello?? Who is there?", ["Hello??", "Who is there?"]),
("Hello!? Is that you?", ["Hello!?", "Is that you?"]),
("Hello?! Is that you?", ["Hello?!", "Is that you?"]),
pytest.mark.xfail(("1.) The first item 2.) The second item", ["1.) The first item", "2.) The second item"])),
pytest.mark.xfail(("1.) The first item. 2.) The second item.", ["1.) The first item.", "2.) The second item."])),
pytest.mark.xfail(("1) The first item 2) The second item", ["1) The first item", "2) The second item"])),
pytest.mark.xfail(("1) The first item. 2) The second item.", ["1) The first item.", "2) The second item."])),
pytest.mark.xfail(("1. The first item 2. The second item", ["1. The first item", "2. The second item"])),
pytest.mark.xfail(("1. The first item. 2. The second item.", ["1. The first item.", "2. The second item."])),
pytest.mark.xfail(("• 9. The first item • 10. The second item", ["• 9. The first item", "• 10. The second item"])),
pytest.mark.xfail(("9. The first item 10. The second item", ["9. The first item", "10. The second item"])),
pytest.mark.xfail(("a. The first item b. The second item c. The third list item", ["a. The first item", "b. The second item", "c. The third list item"])),
("This is a sentence\ncut off in the middle because pdf.", ["This is a sentence\ncut off in the middle because pdf."]),
("It was a cold \nnight in the city.", ["It was a cold \nnight in the city."]),
pytest.mark.xfail(("features\ncontact manager\nevents, activities\n", ["features", "contact manager", "events, activities"])),
("You can find it at N°. 1026.253.553. That is where the treasure is.", ["You can find it at N°. 1026.253.553.", "That is where the treasure is."]),
("She works at Yahoo! in the accounting department.", ["She works at Yahoo! in the accounting department."]),
pytest.mark.xfail(("We make a good team, you and I. Did you see Albert I. Jones yesterday?", ["We make a good team, you and I.", "Did you see Albert I. Jones yesterday?"])),
pytest.mark.xfail(("Thoreau argues that by simplifying ones life, “the laws of the universe will appear less complex. . . .”", ["Thoreau argues that by simplifying ones life, “the laws of the universe will appear less complex. . . .”"])),
(""""Bohr [...] used the analogy of parallel stairways [...]" (Smith 55).""", ['"Bohr [...] used the analogy of parallel stairways [...]" (Smith 55).']),
pytest.mark.xfail(("If words are left off at the end of a sentence, and that is all that is omitted, indicate the omission with ellipsis marks (preceded and followed by a space) and then indicate the end of the sentence with a period . . . . Next sentence.", ["If words are left off at the end of a sentence, and that is all that is omitted, indicate the omission with ellipsis marks (preceded and followed by a space) and then indicate the end of the sentence with a period . . . .", "Next sentence."])),
("I never meant that.... She left the store.", ["I never meant that....", "She left the store."]),
pytest.mark.xfail(("I wasnt really ... well, what I mean...see . . . what I'm saying, the thing is . . . I didnt mean it.", ["I wasnt really ... well, what I mean...see . . . what I'm saying, the thing is . . . I didnt mean it."])),
pytest.mark.xfail(("One further habit which was somewhat weakened . . . was that of combining words into self-interpreting compounds. . . . The practice was not abandoned. . . .", ["One further habit which was somewhat weakened . . . was that of combining words into self-interpreting compounds.", ". . . The practice was not abandoned. . . ."])),
pytest.mark.xfail(("Hello world.Today is Tuesday.Mr. Smith went to the store and bought 1,000.That is a lot.", ["Hello world.", "Today is Tuesday.", "Mr. Smith went to the store and bought 1,000.", "That is a lot."]))
]
@pytest.mark.slow
@pytest.mark.models
@pytest.mark.parametrize('text,expected_sents', TEST_CASES)
def test_parser_sbd_prag(EN, text, expected_sents):
"""SBD tests from Pragmatic Segmenter"""
doc = EN(text)
sents = []
for sent in doc.sents:
sents.append(''.join(doc[i].string for i in range(sent.start, sent.end)).strip())
assert sents == expected_sents

View File

@ -0,0 +1,54 @@
# coding: utf-8
from __future__ import unicode_literals
from ...matcher import Matcher
from ...attrs import ORTH, LOWER
import pytest
pattern1 = [[{LOWER: 'celtics'}], [{LOWER: 'boston'}, {LOWER: 'celtics'}]]
pattern2 = [[{LOWER: 'boston'}, {LOWER: 'celtics'}], [{LOWER: 'celtics'}]]
pattern3 = [[{LOWER: 'boston'}], [{LOWER: 'boston'}, {LOWER: 'celtics'}]]
pattern4 = [[{LOWER: 'boston'}, {LOWER: 'celtics'}], [{LOWER: 'boston'}]]
@pytest.fixture
def doc(en_tokenizer):
text = "how many points did lebron james score against the boston celtics last night"
doc = en_tokenizer(text)
return doc
@pytest.mark.parametrize('pattern', [pattern1, pattern2])
def test_issue118(doc, pattern):
"""Test a bug that arose from having overlapping matches"""
ORG = doc.vocab.strings['ORG']
matcher = Matcher(doc.vocab, {'BostonCeltics': ('ORG', {}, pattern)})
assert len(list(doc.ents)) == 0
matches = [(ent_type, start, end) for ent_id, ent_type, start, end in matcher(doc)]
assert matches == [(ORG, 9, 11), (ORG, 10, 11)]
doc.ents = matches[:1]
ents = list(doc.ents)
assert len(ents) == 1
assert ents[0].label == ORG
assert ents[0].start == 9
assert ents[0].end == 11
@pytest.mark.parametrize('pattern', [pattern3, pattern4])
def test_issue118_prefix_reorder(doc, pattern):
"""Test a bug that arose from having overlapping matches"""
ORG = doc.vocab.strings['ORG']
matcher = Matcher(doc.vocab, {'BostonCeltics': ('ORG', {}, pattern)})
assert len(list(doc.ents)) == 0
matches = [(ent_type, start, end) for ent_id, ent_type, start, end in matcher(doc)]
doc.ents += tuple(matches)[1:]
assert matches == [(ORG, 9, 10), (ORG, 9, 11)]
ents = doc.ents
assert len(ents) == 1
assert ents[0].label == ORG
assert ents[0].start == 9
assert ents[0].end == 11

View File

@ -0,0 +1,26 @@
# coding: utf-8
from __future__ import unicode_literals
from ...matcher import Matcher
from ...attrs import LOWER
import pytest
def test_issue242(en_tokenizer):
"""Test overlapping multi-word phrases."""
text = "There are different food safety standards in different countries."
patterns = [[{LOWER: 'food'}, {LOWER: 'safety'}],
[{LOWER: 'safety'}, {LOWER: 'standards'}]]
doc = en_tokenizer(text)
matcher = Matcher(doc.vocab)
matcher.add('FOOD', 'FOOD', {}, patterns)
matches = [(ent_type, start, end) for ent_id, ent_type, start, end in matcher(doc)]
doc.ents += tuple(matches)
match1, match2 = matches
assert match1[1] == 3
assert match1[2] == 5
assert match2[1] == 4
assert match2[2] == 6

View File

@ -4,7 +4,7 @@ from __future__ import unicode_literals
from ..util import get_doc
def test_sbd_empty_string(en_tokenizer):
def test_issue309(en_tokenizer):
"""Test Issue #309: SBD fails on empty string"""
tokens = en_tokenizer(" ")
doc = get_doc(tokens.vocab, [t.text for t in tokens], heads=[0], deps=['ROOT'])

View File

@ -1,16 +1,9 @@
# coding: utf-8
from __future__ import unicode_literals
from ...en import English
import pytest
@pytest.fixture
def en_tokenizer():
return English.Defaults.create_tokenizer()
def test_issue351(en_tokenizer):
doc = en_tokenizer(" This is a cat.")
assert doc[0].idx == 0

View File

@ -1,16 +1,10 @@
# coding: utf-8
from __future__ import unicode_literals
from ...en import English
import pytest
@pytest.fixture
def en_tokenizer():
return English.Defaults.create_tokenizer()
def test_big_ellipsis(en_tokenizer):
def test_issue360(en_tokenizer):
"""Test tokenization of big ellipsis"""
tokens = en_tokenizer('$45...............Asking')
assert len(tokens) > 2

View File

@ -0,0 +1,11 @@
# coding: utf-8
from __future__ import unicode_literals
import pytest
@pytest.mark.parametrize('text1,text2', [("cat", "dog")])
def test_issue361(en_vocab, text1, text2):
"""Test Issue #361: Equality of lexemes"""
assert en_vocab[text1] == en_vocab[text1]
assert en_vocab[text1] != en_vocab[text2]

View File

@ -1,31 +1,25 @@
# coding: utf-8
from __future__ import unicode_literals
import spacy
from spacy.attrs import ORTH
from ...attrs import ORTH
from ...matcher import Matcher
import pytest
@pytest.mark.models
def test_issue429():
nlp = spacy.load('en', parser=False)
def test_issue429(EN):
def merge_phrases(matcher, doc, i, matches):
if i != len(matches) - 1:
return None
spans = [(ent_id, label, doc[start:end]) for ent_id, label, start, end in matches]
for ent_id, label, span in spans:
span.merge('NNP' if label else span.root.tag_, span.text, nlp.vocab.strings[label])
span.merge('NNP' if label else span.root.tag_, span.text, EN.vocab.strings[label])
doc = nlp('a')
nlp.matcher.add('key', label='TEST', attrs={}, specs=[[{ORTH: 'a'}]], on_match=merge_phrases)
doc = nlp.tokenizer('a b c')
nlp.tagger(doc)
nlp.matcher(doc)
for word in doc:
print(word.text, word.ent_iob_, word.ent_type_)
nlp.entity(doc)
doc = EN('a')
matcher = Matcher(EN.vocab)
matcher.add('key', label='TEST', attrs={}, specs=[[{ORTH: 'a'}]], on_match=merge_phrases)
doc = EN.tokenizer('a b c')
EN.tagger(doc)
matcher(doc)
EN.entity(doc)

View File

@ -0,0 +1,21 @@
# coding: utf-8
from __future__ import unicode_literals
from ..util import get_doc
import pytest
@pytest.mark.models
def test_issue514(EN):
"""Test serializing after adding entity"""
text = ["This", "is", "a", "sentence", "about", "pasta", "."]
vocab = EN.entity.vocab
doc = get_doc(vocab, text)
EN.entity.add_label("Food")
EN.entity(doc)
label_id = vocab.strings[u'Food']
doc.ents = [(label_id, 5,6)]
assert [(ent.label_, ent.text) for ent in doc.ents] == [("Food", "pasta")]
doc2 = get_doc(EN.entity.vocab).from_bytes(doc.to_bytes())
assert [(ent.label_, ent.text) for ent in doc2.ents] == [("Food", "pasta")]

View File

@ -6,5 +6,5 @@ import pytest
@pytest.mark.models
def test_issue54(EN):
text = u'Talks given by women had a slightly higher number of questions asked (3.2$\pm$0.2) than talks given by men (2.6$\pm$0.1).'
text = "Talks given by women had a slightly higher number of questions asked (3.2$\pm$0.2) than talks given by men (2.6$\pm$0.1)."
tokens = EN(text)

View File

@ -1,21 +1,20 @@
# coding: utf-8
from __future__ import unicode_literals
import spacy
import spacy.matcher
from spacy.attrs import IS_PUNCT, ORTH
from ...matcher import Matcher
from ...attrs import IS_PUNCT, ORTH
import pytest
@pytest.mark.models
def test_matcher_segfault():
nlp = spacy.load('en', parser=False, entity=False)
matcher = spacy.matcher.Matcher(nlp.vocab)
def test_issue587(EN):
"""Test that Matcher doesn't segfault on particular input"""
matcher = Matcher(EN.vocab)
content = '''a b; c'''
matcher.add(entity_key='1', label='TEST', attrs={}, specs=[[{ORTH: 'a'}, {ORTH: 'b'}]])
matcher(nlp(content))
matcher(EN(content))
matcher.add(entity_key='2', label='TEST', attrs={}, specs=[[{ORTH: 'a'}, {ORTH: 'b'}, {IS_PUNCT: True}, {ORTH: 'c'}]])
matcher(nlp(content))
matcher(EN(content))
matcher.add(entity_key='3', label='TEST', attrs={}, specs=[[{ORTH: 'a'}, {ORTH: 'b'}, {IS_PUNCT: True}, {ORTH: 'd'}]])
matcher(nlp(content))
matcher(EN(content))

View File

@ -1,14 +1,12 @@
# coding: utf-8
from __future__ import unicode_literals
from ...vocab import Vocab
from ...tokens import Doc
from ...matcher import Matcher
import pytest
def test_issue588():
matcher = Matcher(Vocab())
def test_issue588(en_vocab):
matcher = Matcher(en_vocab)
with pytest.raises(ValueError):
matcher.add(entity_key='1', label='TEST', attrs={}, specs=[[]])

View File

@ -2,7 +2,7 @@
from __future__ import unicode_literals
from ...vocab import Vocab
from ...tokens import Doc
from ..util import get_doc
import pytest
@ -10,4 +10,4 @@ import pytest
def test_issue589():
vocab = Vocab()
vocab.strings.set_frozen(True)
doc = Doc(vocab, words=['whata'])
doc = get_doc(vocab, ['whata'])

View File

@ -1,37 +1,22 @@
# coding: utf-8
from __future__ import unicode_literals
from ...attrs import *
from ...attrs import ORTH, IS_ALPHA, LIKE_NUM
from ...matcher import Matcher
from ...tokens import Doc
from ...en import English
from ..util import get_doc
def test_overlapping_matches():
vocab = English.Defaults.create_vocab()
doc = Doc(vocab, words=['n', '=', '1', ';', 'a', ':', '5', '%'])
matcher = Matcher(vocab)
matcher.add_entity(
"ab",
acceptor=None,
on_match=None
)
matcher.add_pattern(
'ab',
[
{IS_ALPHA: True},
{ORTH: ':'},
{LIKE_NUM: True},
{ORTH: '%'}
], label='a')
matcher.add_pattern(
'ab',
[
{IS_ALPHA: True},
{ORTH: '='},
{LIKE_NUM: True},
], label='b')
def test_issue590(en_vocab):
"""Test overlapping matches"""
doc = get_doc(en_vocab, ['n', '=', '1', ';', 'a', ':', '5', '%'])
matcher = Matcher(en_vocab)
matcher.add_entity("ab", acceptor=None, on_match=None)
matcher.add_pattern('ab', [{IS_ALPHA: True}, {ORTH: ':'},
{LIKE_NUM: True}, {ORTH: '%'}],
label='a')
matcher.add_pattern('ab', [{IS_ALPHA: True}, {ORTH: '='},
{LIKE_NUM: True}],
label='b')
matches = matcher(doc)
assert len(matches) == 2

View File

@ -2,43 +2,23 @@
from __future__ import unicode_literals
from ...symbols import POS, VERB, VerbForm_inf
from ...tokens import Doc
from ...vocab import Vocab
from ...lemmatizer import Lemmatizer
from ..util import get_doc
import pytest
@pytest.fixture
def index():
return {'verb': {}}
def test_issue595():
"""Test lemmatization of base forms"""
words = ["Do", "n't", "feed", "the", "dog"]
tag_map = {'VB': {POS: VERB, 'morph': VerbForm_inf}}
rules = {"verb": [["ed", "e"]]}
@pytest.fixture
def exceptions():
return {'verb': {}}
lemmatizer = Lemmatizer({'verb': {}}, {'verb': {}}, rules)
vocab = Vocab(lemmatizer=lemmatizer, tag_map=tag_map)
doc = get_doc(vocab, words)
@pytest.fixture
def rules():
return {"verb": [["ed", "e"]]}
@pytest.fixture
def lemmatizer(index, exceptions, rules):
return Lemmatizer(index, exceptions, rules)
@pytest.fixture
def tag_map():
return {'VB': {POS: VERB, 'morph': VerbForm_inf}}
@pytest.fixture
def vocab(lemmatizer, tag_map):
return Vocab(lemmatizer=lemmatizer, tag_map=tag_map)
def test_not_lemmatize_base_forms(vocab):
doc = Doc(vocab, words=["Do", "n't", "feed", "the", "dog"])
feed = doc[2]
feed.tag_ = 'VB'
assert feed.text == 'feed'
assert feed.lemma_ == 'feed'
doc[2].tag_ = 'VB'
assert doc[2].text == 'feed'
assert doc[2].lemma_ == 'feed'

View File

@ -1,15 +1,13 @@
# coding: utf-8
from __future__ import unicode_literals
from ...tokens import Doc
from ...vocab import Vocab
from ..util import get_doc
def test_issue599():
doc = Doc(Vocab())
def test_issue599(en_vocab):
doc = get_doc(en_vocab)
doc.is_tagged = True
doc.is_parsed = True
bytes_ = doc.to_bytes()
doc2 = Doc(doc.vocab)
doc2.from_bytes(bytes_)
doc2 = get_doc(doc.vocab)
doc2.from_bytes(doc.to_bytes())
assert doc2.is_parsed

View File

@ -1,11 +1,11 @@
# coding: utf-8
from __future__ import unicode_literals
from ...tokens import Doc
from ...vocab import Vocab
from ...attrs import POS
from ..util import get_doc
def test_issue600():
doc = Doc(Vocab(tag_map={'NN': {'pos': 'NOUN'}}), words=['hello'])
vocab = Vocab(tag_map={'NN': {'pos': 'NOUN'}})
doc = get_doc(vocab, ["hello"])
doc[0].tag_ = 'NN'

View File

@ -1,27 +1,21 @@
# coding: utf-8
from __future__ import unicode_literals
from ...attrs import LOWER, ORTH
from ...tokens import Doc
from ...vocab import Vocab
from ...attrs import ORTH
from ...matcher import Matcher
from ..util import get_doc
def return_false(doc, ent_id, label, start, end):
return False
def test_issue605(en_vocab):
def return_false(doc, ent_id, label, start, end):
return False
def test_matcher_accept():
doc = Doc(Vocab(), words=['The', 'golf', 'club', 'is', 'broken'])
golf_pattern = [
{ ORTH: "golf"},
{ ORTH: "club"}
]
words = ["The", "golf", "club", "is", "broken"]
pattern = [{ORTH: "golf"}, {ORTH: "club"}]
label = "Sport_Equipment"
doc = get_doc(en_vocab, words)
matcher = Matcher(doc.vocab)
matcher.add_entity('Sport_Equipment', acceptor=return_false)
matcher.add_pattern("Sport_Equipment", golf_pattern)
matcher.add_entity(label, acceptor=return_false)
matcher.add_pattern(label, pattern)
match = matcher(doc)
assert match == []

View File

@ -1,35 +1,31 @@
# coding: utf-8
from __future__ import unicode_literals
import spacy
from spacy.attrs import ORTH
from ...matcher import Matcher
from ...attrs import ORTH
def merge_phrases(matcher, doc, i, matches):
'''
Merge a phrase. We have to be careful here because we'll change the token indices.
To avoid problems, merge all the phrases once we're called on the last match.
'''
if i != len(matches)-1:
return None
# Get Span objects
spans = [(ent_id, label, doc[start : end]) for ent_id, label, start, end in matches]
for ent_id, label, span in spans:
span.merge('NNP' if label else span.root.tag_, span.text, doc.vocab.strings[label])
def test_issue615(en_tokenizer):
def merge_phrases(matcher, doc, i, matches):
"""Merge a phrase. We have to be careful here because we'll change the
token indices. To avoid problems, merge all the phrases once we're called
on the last match."""
def test_entity_ID_assignment():
nlp = spacy.en.English()
text = """The golf club is broken"""
doc = nlp(text)
if i != len(matches)-1:
return None
# Get Span objects
spans = [(ent_id, label, doc[start : end]) for ent_id, label, start, end in matches]
for ent_id, label, span in spans:
span.merge('NNP' if label else span.root.tag_, span.text, doc.vocab.strings[label])
golf_pattern = [
{ ORTH: "golf"},
{ ORTH: "club"}
]
text = "The golf club is broken"
pattern = [{ORTH: "golf"}, {ORTH: "club"}]
label = "Sport_Equipment"
matcher = spacy.matcher.Matcher(nlp.vocab)
matcher.add_entity('Sport_Equipment', on_match = merge_phrases)
matcher.add_pattern("Sport_Equipment", golf_pattern, label = 'Sport_Equipment')
doc = en_tokenizer(text)
matcher = Matcher(doc.vocab)
matcher.add_entity(label, on_match=merge_phrases)
matcher.add_pattern(label, pattern, label=label)
match = matcher(doc)
entities = list(doc.ents)

View File

@ -4,7 +4,8 @@ from __future__ import unicode_literals
from ...vocab import Vocab
def test_load_vocab_with_string():
def test_issue617():
"""Test loading Vocab with string"""
try:
vocab = Vocab.load('/tmp/vocab')
except IOError:

View File

@ -1,7 +1,4 @@
# coding: utf-8
"""Test that times like "7am" are tokenized correctly and that numbers are converted to string."""
from __future__ import unicode_literals
import pytest
@ -9,6 +6,7 @@ import pytest
@pytest.mark.parametrize('text,number', [("7am", "7"), ("11p.m.", "11")])
def test_issue736(en_tokenizer, text, number):
"""Test that times like "7am" are tokenized correctly and that numbers are converted to string."""
tokens = en_tokenizer(text)
assert len(tokens) == 2
assert tokens[0].text == number

View File

@ -0,0 +1,12 @@
# coding: utf-8
from __future__ import unicode_literals
import pytest
@pytest.mark.parametrize('text', ["3/4/2012", "01/12/1900"])
def test_issue740(en_tokenizer, text):
"""Test that dates are not split and kept as one token. This behaviour is currently inconsistent, since dates separated by hyphens are still split.
This will be hard to prevent without causing clashes with numeric ranges."""
tokens = en_tokenizer(text)
assert len(tokens) == 1

View File

@ -0,0 +1,13 @@
# coding: utf-8
from __future__ import unicode_literals
import pytest
@pytest.mark.parametrize('text', ["We were scared", "We Were Scared"])
def test_issue744(en_tokenizer, text):
"""Test that 'were' and 'Were' are excluded from the contractions
generated by the English tokenizer exceptions."""
tokens = en_tokenizer(text)
assert len(tokens) == 3
assert tokens[1].text.lower() == "were"

View File

@ -1,60 +1,51 @@
# coding: utf-8
from __future__ import unicode_literals
import pytest
from ...serialize.packer import _BinaryCodec
from ...serialize.huffman import HuffmanCodec
from ...serialize.bits import BitArray
import numpy
from spacy.vocab import Vocab
from spacy.serialize.packer import _BinaryCodec
from spacy.serialize.huffman import HuffmanCodec
from spacy.serialize.bits import BitArray
import pytest
def test_binary():
def test_serialize_codecs_binary():
codec = _BinaryCodec()
bits = BitArray()
msg = numpy.array([0, 1, 0, 1, 1], numpy.int32)
codec.encode(msg, bits)
array = numpy.array([0, 1, 0, 1, 1], numpy.int32)
codec.encode(array, bits)
result = numpy.array([0, 0, 0, 0, 0], numpy.int32)
bits.seek(0)
codec.decode(bits, result)
assert list(msg) == list(result)
assert list(array) == list(result)
def test_attribute():
freqs = {'the': 10, 'quick': 3, 'brown': 4, 'fox': 1, 'jumped': 5, 'over': 8,
'lazy': 1, 'dog': 2, '.': 9}
int_map = {'the': 0, 'quick': 1, 'brown': 2, 'fox': 3, 'jumped': 4, 'over': 5,
'lazy': 6, 'dog': 7, '.': 8}
def test_serialize_codecs_attribute():
freqs = {'the': 10, 'quick': 3, 'brown': 4, 'fox': 1, 'jumped': 5,
'over': 8, 'lazy': 1, 'dog': 2, '.': 9}
int_map = {'the': 0, 'quick': 1, 'brown': 2, 'fox': 3, 'jumped': 4,
'over': 5, 'lazy': 6, 'dog': 7, '.': 8}
codec = HuffmanCodec([(int_map[string], freq) for string, freq in freqs.items()])
bits = BitArray()
msg = numpy.array([1, 7], dtype=numpy.int32)
msg_list = list(msg)
codec.encode(msg, bits)
array = numpy.array([1, 7], dtype=numpy.int32)
codec.encode(array, bits)
result = numpy.array([0, 0], dtype=numpy.int32)
bits.seek(0)
codec.decode(bits, result)
assert msg_list == list(result)
assert list(array) == list(result)
def test_vocab_codec():
vocab = Vocab()
lex = vocab['dog']
lex = vocab['the']
lex = vocab['jumped']
codec = HuffmanCodec([(lex.orth, lex.prob) for lex in vocab])
def test_serialize_codecs_vocab(en_vocab):
words = ["the", "dog", "jumped"]
for word in words:
_ = en_vocab[word]
codec = HuffmanCodec([(lex.orth, lex.prob) for lex in en_vocab])
bits = BitArray()
ids = [vocab[s].orth for s in ('the', 'dog', 'jumped')]
msg = numpy.array(ids, dtype=numpy.int32)
msg_list = list(msg)
codec.encode(msg, bits)
result = numpy.array(range(len(msg)), dtype=numpy.int32)
ids = [en_vocab[s].orth for s in words]
array = numpy.array(ids, dtype=numpy.int32)
codec.encode(array, bits)
result = numpy.array(range(len(array)), dtype=numpy.int32)
bits.seek(0)
codec.decode(bits, result)
assert msg_list == list(result)
assert list(array) == list(result)

View File

@ -1,15 +1,15 @@
# coding: utf-8
from __future__ import unicode_literals
from __future__ import division
import pytest
from ...serialize.huffman import HuffmanCodec
from ...serialize.bits import BitArray
from spacy.serialize.huffman import HuffmanCodec
from spacy.serialize.bits import BitArray
import numpy
import math
from heapq import heappush, heappop, heapify
from collections import defaultdict
import numpy
import pytest
def py_encode(symb2freq):
@ -29,7 +29,7 @@ def py_encode(symb2freq):
return dict(heappop(heap)[1:])
def test1():
def test_serialize_huffman_1():
probs = numpy.zeros(shape=(10,), dtype=numpy.float32)
probs[0] = 0.3
probs[1] = 0.2
@ -41,45 +41,44 @@ def test1():
probs[7] = 0.005
probs[8] = 0.0001
probs[9] = 0.000001
codec = HuffmanCodec(list(enumerate(probs)))
py_codes = py_encode(dict(enumerate(probs)))
py_codes = list(py_codes.items())
py_codes.sort()
assert codec.strings == [c for i, c in py_codes]
def test_empty():
def test_serialize_huffman_empty():
codec = HuffmanCodec({})
assert codec.strings == []
def test_round_trip():
freqs = {'the': 10, 'quick': 3, 'brown': 4, 'fox': 1, 'jumped': 5, 'over': 8,
'lazy': 1, 'dog': 2, '.': 9}
def test_serialize_huffman_round_trip():
words = ['the', 'quick', 'brown', 'fox', 'jumped', 'over', 'the', 'the',
'lazy', 'dog', '.']
freqs = {'the': 10, 'quick': 3, 'brown': 4, 'fox': 1, 'jumped': 5,
'over': 8, 'lazy': 1, 'dog': 2, '.': 9}
codec = HuffmanCodec(freqs.items())
message = ['the', 'quick', 'brown', 'fox', 'jumped', 'over', 'the',
'the', 'lazy', 'dog', '.']
strings = list(codec.strings)
codes = dict([(codec.leaves[i], strings[i]) for i in range(len(codec.leaves))])
bits = codec.encode(message)
bits = codec.encode(words)
string = ''.join('{0:b}'.format(c).rjust(8, '0')[::-1] for c in bits.as_bytes())
for word in message:
for word in words:
code = codes[word]
assert string[:len(code)] == code
string = string[len(code):]
unpacked = [0] * len(message)
unpacked = [0] * len(words)
bits.seek(0)
codec.decode(bits, unpacked)
assert message == unpacked
assert words == unpacked
def test_rosetta():
txt = u"this is an example for huffman encoding"
def test_serialize_huffman_rosetta():
text = "this is an example for huffman encoding"
symb2freq = defaultdict(int)
for ch in txt:
for ch in text:
symb2freq[ch] += 1
by_freq = list(symb2freq.items())
by_freq.sort(reverse=True, key=lambda item: item[1])
@ -101,7 +100,7 @@ def test_rosetta():
assert my_exp_len == py_exp_len
@pytest.mark.slow
@pytest.mark.models
def test_vocab(EN):
codec = HuffmanCodec([(w.orth, numpy.exp(w.prob)) for w in EN.vocab])
expected_length = 0

View File

@ -1,58 +1,48 @@
# coding: utf-8
from __future__ import unicode_literals
from ...tokens import Doc
from ..util import get_doc
import pytest
from spacy.serialize.packer import Packer
from spacy.attrs import ORTH, SPACY
from spacy.tokens import Doc
import math
import tempfile
import shutil
import os
def test_serialize_io_read_write(en_vocab, text_file_b):
text1 = ["This", "is", "a", "simple", "test", ".", "With", "a", "couple", "of", "sentences", "."]
text2 = ["This", "is", "another", "test", "document", "."]
doc1 = get_doc(en_vocab, text1)
doc2 = get_doc(en_vocab, text2)
text_file_b.write(doc1.to_bytes())
text_file_b.write(doc2.to_bytes())
text_file_b.seek(0)
bytes1, bytes2 = Doc.read_bytes(text_file_b)
result1 = get_doc(en_vocab).from_bytes(bytes1)
result2 = get_doc(en_vocab).from_bytes(bytes2)
assert result1.text_with_ws == doc1.text_with_ws
assert result2.text_with_ws == doc2.text_with_ws
@pytest.mark.models
def test_read_write(EN):
doc1 = EN(u'This is a simple test. With a couple of sentences.')
doc2 = EN(u'This is another test document.')
def test_serialize_io_left_right(en_vocab):
text = ["This", "is", "a", "simple", "test", ".", "With", "a", "couple", "of", "sentences", "."]
doc = get_doc(en_vocab, text)
result = Doc(en_vocab).from_bytes(doc.to_bytes())
try:
tmp_dir = tempfile.mkdtemp()
with open(os.path.join(tmp_dir, 'spacy_docs.bin'), 'wb') as file_:
file_.write(doc1.to_bytes())
file_.write(doc2.to_bytes())
with open(os.path.join(tmp_dir, 'spacy_docs.bin'), 'rb') as file_:
bytes1, bytes2 = Doc.read_bytes(file_)
r1 = Doc(EN.vocab).from_bytes(bytes1)
r2 = Doc(EN.vocab).from_bytes(bytes2)
assert r1.string == doc1.string
assert r2.string == doc2.string
finally:
shutil.rmtree(tmp_dir)
@pytest.mark.models
def test_left_right(EN):
orig = EN(u'This is a simple test. With a couple of sentences.')
result = Doc(orig.vocab).from_bytes(orig.to_bytes())
for word in result:
assert word.head.i == orig[word.i].head.i
if word.head is not word:
assert word.i in [w.i for w in word.head.children]
for child in word.lefts:
assert child.head.i == word.i
for child in word.rights:
assert child.head.i == word.i
for token in result:
assert token.head.i == doc[token.i].head.i
if token.head is not token:
assert token.i in [w.i for w in token.head.children]
for child in token.lefts:
assert child.head.i == token.i
for child in token.rights:
assert child.head.i == token.i
@pytest.mark.models
def test_lemmas(EN):
orig = EN(u'The geese are flying')
result = Doc(orig.vocab).from_bytes(orig.to_bytes())
the, geese, are, flying = result
assert geese.lemma_ == 'goose'
assert are.lemma_ == 'be'
assert flying.lemma_ == 'fly'
text = "The geese are flying"
doc = EN(text)
result = Doc(doc.vocab).from_bytes(doc.to_bytes())
assert result[1].lemma_ == 'goose'
assert result[2].lemma_ == 'be'
assert result[3].lemma_ == 'fly'

View File

@ -1,55 +1,30 @@
# coding: utf-8
from __future__ import unicode_literals
import re
from ...attrs import TAG, DEP, HEAD
from ...serialize.packer import Packer
from ...serialize.bits import BitArray
from ..util import get_doc
import pytest
import numpy
from spacy.language import Language
from spacy.en import English
from spacy.vocab import Vocab
from spacy.tokens.doc import Doc
from spacy.tokenizer import Tokenizer
from os import path
import os
from spacy import util
from spacy.attrs import ORTH, SPACY, TAG, DEP, HEAD
from spacy.serialize.packer import Packer
from spacy.serialize.bits import BitArray
@pytest.fixture
def vocab():
path = os.environ.get('SPACY_DATA')
if path is None:
path = util.match_best_version('en', None, util.get_data_path())
else:
path = util.match_best_version('en', None, path)
vocab = English.Defaults.create_vocab()
lex = vocab['dog']
assert vocab[vocab.strings['dog']].orth_ == 'dog'
lex = vocab['the']
lex = vocab['quick']
lex = vocab['jumped']
return vocab
def text():
return "the dog jumped"
@pytest.fixture
def tokenizer(vocab):
null_re = re.compile(r'!!!!!!!!!')
tokenizer = Tokenizer(vocab, {}, null_re.search, null_re.search, null_re.finditer)
return tokenizer
def text_b():
return b"the dog jumped"
def test_char_packer(vocab):
packer = Packer(vocab, [])
def test_serialize_char_packer(en_vocab, text_b):
packer = Packer(en_vocab, [])
bits = BitArray()
bits.seek(0)
byte_str = bytearray(b'the dog jumped')
byte_str = bytearray(text_b)
packer.char_codec.encode(byte_str, bits)
bits.seek(0)
result = [b''] * len(byte_str)
@ -57,79 +32,67 @@ def test_char_packer(vocab):
assert bytearray(result) == byte_str
def test_packer_unannotated(tokenizer):
packer = Packer(tokenizer.vocab, [])
msg = tokenizer(u'the dog jumped')
assert msg.string == 'the dog jumped'
bits = packer.pack(msg)
def test_serialize_packer_unannotated(en_tokenizer, text):
packer = Packer(en_tokenizer.vocab, [])
tokens = en_tokenizer(text)
assert tokens.text_with_ws == text
bits = packer.pack(tokens)
result = packer.unpack(bits)
assert result.string == 'the dog jumped'
assert result.text_with_ws == text
@pytest.mark.models
def test_packer_annotated(tokenizer):
vocab = tokenizer.vocab
nn = vocab.strings['NN']
dt = vocab.strings['DT']
vbd = vocab.strings['VBD']
jj = vocab.strings['JJ']
det = vocab.strings['det']
nsubj = vocab.strings['nsubj']
adj = vocab.strings['adj']
root = vocab.strings['ROOT']
def test_packer_annotated(en_vocab, text):
heads = [1, 1, 0]
deps = ['det', 'nsubj', 'ROOT']
tags = ['DT', 'NN', 'VBD']
attr_freqs = [
(TAG, [(nn, 0.1), (dt, 0.2), (jj, 0.01), (vbd, 0.05)]),
(DEP, {det: 0.2, nsubj: 0.1, adj: 0.05, root: 0.1}.items()),
(TAG, [(en_vocab.strings['NN'], 0.1),
(en_vocab.strings['DT'], 0.2),
(en_vocab.strings['JJ'], 0.01),
(en_vocab.strings['VBD'], 0.05)]),
(DEP, {en_vocab.strings['det']: 0.2,
en_vocab.strings['nsubj']: 0.1,
en_vocab.strings['adj']: 0.05,
en_vocab.strings['ROOT']: 0.1}.items()),
(HEAD, {0: 0.05, 1: 0.2, -1: 0.2, -2: 0.1, 2: 0.1}.items())
]
packer = Packer(vocab, attr_freqs)
packer = Packer(en_vocab, attr_freqs)
doc = get_doc(en_vocab, [t for t in text.split()], tags=tags, deps=deps, heads=heads)
msg = tokenizer(u'the dog jumped')
# assert doc.text_with_ws == text
assert [t.tag_ for t in doc] == tags
assert [t.dep_ for t in doc] == deps
assert [(t.head.i-t.i) for t in doc] == heads
msg.from_array(
[TAG, DEP, HEAD],
numpy.array([
[dt, det, 1],
[nn, nsubj, 1],
[vbd, root, 0]
], dtype=numpy.int32))
assert msg.string == 'the dog jumped'
assert [t.tag_ for t in msg] == ['DT', 'NN', 'VBD']
assert [t.dep_ for t in msg] == ['det', 'nsubj', 'ROOT']
assert [(t.head.i - t.i) for t in msg] == [1, 1, 0]
bits = packer.pack(msg)
bits = packer.pack(doc)
result = packer.unpack(bits)
assert result.string == 'the dog jumped'
assert [t.tag_ for t in result] == ['DT', 'NN', 'VBD']
assert [t.dep_ for t in result] == ['det', 'nsubj', 'ROOT']
assert [(t.head.i - t.i) for t in result] == [1, 1, 0]
# assert result.text_with_ws == text
assert [t.tag_ for t in result] == tags
assert [t.dep_ for t in result] == deps
assert [(t.head.i-t.i) for t in result] == heads
def test_packer_bad_chars(tokenizer):
string = u'naja gut, is eher bl\xf6d und nicht mit reddit.com/digg.com vergleichbar; vielleicht auf dem weg dahin'
packer = Packer(tokenizer.vocab, [])
def test_packer_bad_chars(en_tokenizer):
text = "naja gut, is eher bl\xf6d und nicht mit reddit.com/digg.com vergleichbar; vielleicht auf dem weg dahin"
packer = Packer(en_tokenizer.vocab, [])
doc = tokenizer(string)
doc = en_tokenizer(text)
bits = packer.pack(doc)
result = packer.unpack(bits)
assert result.string == doc.string
@pytest.mark.models
def test_packer_bad_chars(EN):
string = u'naja gut, is eher bl\xf6d und nicht mit reddit.com/digg.com vergleichbar; vielleicht auf dem weg dahin'
doc = EN(string)
def test_packer_bad_chars_tags(EN):
text = "naja gut, is eher bl\xf6d und nicht mit reddit.com/digg.com vergleichbar; vielleicht auf dem weg dahin"
tags = ['JJ', 'NN', ',', 'VBZ', 'DT', 'NN', 'JJ', 'NN', 'NN',
'ADD', 'NN', ':', 'NN', 'NN', 'NN', 'NN', 'NN']
tokens = EN.tokenizer(text)
doc = get_doc(tokens.vocab, [t.text for t in tokens], tags=tags)
byte_string = doc.to_bytes()
result = Doc(EN.vocab).from_bytes(byte_string)
result = get_doc(tokens.vocab).from_bytes(byte_string)
assert [t.tag_ for t in result] == [t.tag_ for t in doc]

View File

@ -1,127 +1,40 @@
# coding: utf-8
from __future__ import unicode_literals
from ...serialize.packer import Packer
from ..util import get_doc, assert_docs_equal
import pytest
from spacy.tokens import Doc
import spacy.en
from spacy.serialize.packer import Packer
TEXT = ["This", "is", "a", "test", "sentence", "."]
TAGS = ['DT', 'VBZ', 'DT', 'NN', 'NN', '.']
DEPS = ['nsubj', 'ROOT', 'det', 'compound', 'attr', 'punct']
ENTS = [('hi', 'PERSON', 0, 1)]
def equal(doc1, doc2):
# tokens
assert [ t.orth for t in doc1 ] == [ t.orth for t in doc2 ]
# tags
assert [ t.pos for t in doc1 ] == [ t.pos for t in doc2 ]
assert [ t.tag for t in doc1 ] == [ t.tag for t in doc2 ]
# parse
assert [ t.head.i for t in doc1 ] == [ t.head.i for t in doc2 ]
assert [ t.dep for t in doc1 ] == [ t.dep for t in doc2 ]
if doc1.is_parsed and doc2.is_parsed:
assert [ s for s in doc1.sents ] == [ s for s in doc2.sents ]
# entities
assert [ t.ent_type for t in doc1 ] == [ t.ent_type for t in doc2 ]
assert [ t.ent_iob for t in doc1 ] == [ t.ent_iob for t in doc2 ]
assert [ ent for ent in doc1.ents ] == [ ent for ent in doc2.ents ]
@pytest.mark.models
def test_serialize_tokens(EN):
doc1 = EN(u'This is a test sentence.',tag=False, parse=False, entity=False)
doc2 = Doc(EN.vocab).from_bytes(doc1.to_bytes())
equal(doc1, doc2)
@pytest.mark.models
def test_serialize_tokens_tags(EN):
doc1 = EN(u'This is a test sentence.',tag=True, parse=False, entity=False)
doc2 = Doc(EN.vocab).from_bytes(doc1.to_bytes())
equal(doc1, doc2)
@pytest.mark.models
def test_serialize_tokens_parse(EN):
doc1 = EN(u'This is a test sentence.',tag=False, parse=True, entity=False)
doc2 = Doc(EN.vocab).from_bytes(doc1.to_bytes())
equal(doc1, doc2)
@pytest.mark.models
def test_serialize_tokens_ner(EN):
doc1 = EN(u'This is a test sentence.', tag=False, parse=False, entity=True)
doc2 = Doc(EN.vocab).from_bytes(doc1.to_bytes())
equal(doc1, doc2)
@pytest.mark.models
def test_serialize_tokens_tags_parse(EN):
doc1 = EN(u'This is a test sentence.', tag=True, parse=True, entity=False)
doc2 = Doc(EN.vocab).from_bytes(doc1.to_bytes())
equal(doc1, doc2)
@pytest.mark.models
def test_serialize_tokens_tags_ner(EN):
doc1 = EN(u'This is a test sentence.', tag=True, parse=False, entity=True)
doc2 = Doc(EN.vocab).from_bytes(doc1.to_bytes())
equal(doc1, doc2)
@pytest.mark.models
def test_serialize_tokens_ner_parse(EN):
doc1 = EN(u'This is a test sentence.', tag=False, parse=True, entity=True)
doc2 = Doc(EN.vocab).from_bytes(doc1.to_bytes())
equal(doc1, doc2)
@pytest.mark.models
def test_serialize_tokens_tags_parse_ner(EN):
doc1 = EN(u'This is a test sentence.', tag=True, parse=True, entity=True)
doc2 = Doc(EN.vocab).from_bytes(doc1.to_bytes())
equal(doc1, doc2)
def test_serialize_empty_doc():
vocab = spacy.en.English.Defaults.create_vocab()
doc = Doc(vocab)
packer = Packer(vocab, {})
def test_serialize_empty_doc(en_vocab):
doc = get_doc(en_vocab)
packer = Packer(en_vocab, {})
b = packer.pack(doc)
assert b == b''
loaded = Doc(vocab).from_bytes(b)
loaded = get_doc(en_vocab).from_bytes(b)
assert len(loaded) == 0
def test_serialize_after_adding_entity():
# Re issue #514
vocab = spacy.en.English.Defaults.create_vocab()
entity_recognizer = spacy.en.English.Defaults.create_entity()
doc = Doc(vocab, words=u'This is a sentence about pasta .'.split())
entity_recognizer.add_label('Food')
entity_recognizer(doc)
label_id = vocab.strings[u'Food']
doc.ents = [(label_id, 5,6)]
assert [(ent.label_, ent.text) for ent in doc.ents] == [(u'Food', u'pasta')]
byte_string = doc.to_bytes()
@pytest.mark.parametrize('text', [TEXT])
def test_serialize_tokens(en_vocab, text):
doc1 = get_doc(en_vocab, [t for t in text])
doc2 = get_doc(en_vocab).from_bytes(doc1.to_bytes())
assert_docs_equal(doc1, doc2)
@pytest.mark.models
def test_serialize_after_adding_entity(EN):
EN.entity.add_label(u'Food')
doc = EN(u'This is a sentence about pasta.')
label_id = EN.vocab.strings[u'Food']
doc.ents = [(label_id, 5,6)]
byte_string = doc.to_bytes()
doc2 = Doc(EN.vocab).from_bytes(byte_string)
assert [(ent.label_, ent.text) for ent in doc2.ents] == [(u'Food', u'pasta')]
@pytest.mark.parametrize('text', [TEXT])
@pytest.mark.parametrize('tags', [TAGS, []])
@pytest.mark.parametrize('deps', [DEPS, []])
@pytest.mark.parametrize('ents', [ENTS, []])
def test_serialize_tokens_ner(EN, text, tags, deps, ents):
doc1 = get_doc(EN.vocab, [t for t in text], tags=tags, deps=deps, ents=ents)
doc2 = get_doc(EN.vocab).from_bytes(doc1.to_bytes())
assert_docs_equal(doc1, doc2)

View File

@ -1,33 +1,48 @@
# -*- coding: utf-8 -*-
# coding: utf-8
from __future__ import unicode_literals
import os
import io
import pickle
import pathlib
from spacy.lemmatizer import Lemmatizer, read_index, read_exc
from spacy import util
from ...lemmatizer import read_index, read_exc
import pytest
@pytest.fixture
def path():
if 'SPACY_DATA' in os.environ:
return pathlib.Path(os.environ['SPACY_DATA'])
else:
return util.match_best_version('en', None, util.get_data_path())
@pytest.fixture
def lemmatizer(path):
if path is not None:
return Lemmatizer.load(path)
else:
@pytest.mark.models
@pytest.mark.parametrize('text,lemmas', [("aardwolves", ["aardwolf"]),
("aardwolf", ["aardwolf"]),
("planets", ["planet"]),
("ring", ["ring"]),
("axes", ["axis", "axe", "ax"])])
def test_tagger_lemmatizer_noun_lemmas(lemmatizer, text, lemmas):
if lemmatizer is None:
return None
assert lemmatizer.noun(text) == set(lemmas)
def test_read_index(path):
@pytest.mark.models
def test_tagger_lemmatizer_base_forms(lemmatizer):
if lemmatizer is None:
return None
assert lemmatizer.noun('dive', {'number': 'sing'}) == set(['dive'])
assert lemmatizer.noun('dive', {'number': 'plur'}) == set(['diva'])
@pytest.mark.models
def test_tagger_lemmatizer_base_form_verb(lemmatizer):
if lemmatizer is None:
return None
assert lemmatizer.verb('saw', {'verbform': 'past'}) == set(['see'])
@pytest.mark.models
def test_tagger_lemmatizer_punct(lemmatizer):
if lemmatizer is None:
return None
assert lemmatizer.punct('') == set(['"'])
assert lemmatizer.punct('') == set(['"'])
@pytest.mark.models
def test_tagger_lemmatizer_read_index(path):
if path is not None:
with (path / 'wordnet' / 'index.noun').open() as file_:
index = read_index(file_)
@ -36,67 +51,19 @@ def test_read_index(path):
assert 'plant' in index
def test_read_exc(path):
@pytest.mark.models
@pytest.mark.parametrize('text,lemma', [("was", "be")])
def test_tagger_lemmatizer_read_exc(path, text, lemma):
if path is not None:
with (path / 'wordnet' / 'verb.exc').open() as file_:
exc = read_exc(file_)
assert exc['was'] == ('be',)
def test_noun_lemmas(lemmatizer):
if lemmatizer is None:
return None
do = lemmatizer.noun
assert do('aardwolves') == set(['aardwolf'])
assert do('aardwolf') == set(['aardwolf'])
assert do('planets') == set(['planet'])
assert do('ring') == set(['ring'])
assert do('axes') == set(['axis', 'axe', 'ax'])
def test_base_form_dive(lemmatizer):
if lemmatizer is None:
return None
do = lemmatizer.noun
assert do('dive', {'number': 'sing'}) == set(['dive'])
assert do('dive', {'number': 'plur'}) == set(['diva'])
def test_base_form_saw(lemmatizer):
if lemmatizer is None:
return None
do = lemmatizer.verb
assert do('saw', {'verbform': 'past'}) == set(['see'])
def test_smart_quotes(lemmatizer):
if lemmatizer is None:
return None
do = lemmatizer.punct
assert do('') == set(['"'])
assert do('') == set(['"'])
def test_pickle_lemmatizer(lemmatizer):
if lemmatizer is None:
return None
file_ = io.BytesIO()
pickle.dump(lemmatizer, file_)
file_.seek(0)
loaded = pickle.load(file_)
assert exc[text] == (lemma,)
@pytest.mark.models
def test_lemma_assignment(EN):
tokens = u'Bananas in pyjamas are geese .'.split(' ')
doc = EN.tokenizer.tokens_from_list(tokens)
assert all( t.lemma_ == u'' for t in doc )
def test_tagger_lemmatizer_lemma_assignment(EN):
text = "Bananas in pyjamas are geese."
doc = EN.tokenizer(text)
assert all(t.lemma_ == '' for t in doc)
EN.tagger(doc)
assert all( t.lemma_ != u'' for t in doc )
assert all(t.lemma_ != '' for t in doc)

View File

@ -1,37 +1,32 @@
# coding: utf-8
"""Ensure spaces are assigned the POS tag SPACE"""
from __future__ import unicode_literals
from spacy.parts_of_speech import SPACE
from ...parts_of_speech import SPACE
import pytest
@pytest.mark.models
def test_tagger_spaces(EN):
text = "Some\nspaces are\tnecessary."
doc = EN(text, tag=True, parse=False)
assert doc[0].pos != SPACE
assert doc[0].pos_ != 'SPACE'
assert doc[1].pos == SPACE
assert doc[1].pos_ == 'SPACE'
assert doc[1].tag_ == 'SP'
assert doc[2].pos != SPACE
assert doc[3].pos != SPACE
assert doc[4].pos == SPACE
@pytest.fixture
def tagged(EN):
string = u'Some\nspaces are\tnecessary.'
tokens = EN(string, tag=True, parse=False)
return tokens
@pytest.mark.models
def test_spaces(tagged):
assert tagged[0].pos != SPACE
assert tagged[0].pos_ != 'SPACE'
assert tagged[1].pos == SPACE
assert tagged[1].pos_ == 'SPACE'
assert tagged[1].tag_ == 'SP'
assert tagged[2].pos != SPACE
assert tagged[3].pos != SPACE
assert tagged[4].pos == SPACE
@pytest.mark.xfail
@pytest.mark.models
def test_return_char(EN):
string = ('hi Aaron,\r\n\r\nHow is your schedule today, I was wondering if '
def test_tagger_return_char(EN):
text = ('hi Aaron,\r\n\r\nHow is your schedule today, I was wondering if '
'you had time for a phone\r\ncall this afternoon?\r\n\r\n\r\n')
tokens = EN(string)
tokens = EN(text)
for token in tokens:
if token.is_space:
assert token.pos == SPACE

View File

@ -1,14 +1,16 @@
from spacy.en import English
# coding: utf-8
from __future__ import unicode_literals
import six
import pytest
@pytest.mark.models
def test_tag_names(EN):
tokens = EN(u'I ate pizzas with anchovies.', parse=False, tag=True)
pizza = tokens[2]
assert type(pizza.pos) == int
assert isinstance(pizza.pos_, six.text_type)
assert type(pizza.dep) == int
assert isinstance(pizza.dep_, six.text_type)
assert pizza.tag_ == u'NNS'
text = "I ate pizzas with anchovies."
doc = EN(text, parse=False, tag=True)
assert type(doc[2].pos) == int
assert isinstance(doc[2].pos_, six.text_type)
assert type(doc[2].dep) == int
assert isinstance(doc[2].dep_, six.text_type)
assert doc[2].tag_ == u'NNS'

27
spacy/tests/test_attrs.py Normal file
View File

@ -0,0 +1,27 @@
# coding: utf-8
from __future__ import unicode_literals
from ..attrs import intify_attrs, ORTH, NORM, LEMMA, IS_ALPHA
import pytest
@pytest.mark.parametrize('text', ["dog"])
def test_attrs_key(text):
assert intify_attrs({"ORTH": text}) == {ORTH: text}
assert intify_attrs({"NORM": text}) == {NORM: text}
assert intify_attrs({"lemma": text}, strings_map={text: 10}) == {LEMMA: 10}
@pytest.mark.parametrize('text', ["dog"])
def test_attrs_idempotence(text):
int_attrs = intify_attrs({"lemma": text, 'is_alpha': True}, strings_map={text: 10})
assert intify_attrs(int_attrs) == {LEMMA: 10, IS_ALPHA: True}
@pytest.mark.parametrize('text', ["dog"])
def test_attrs_do_deprecated(text):
int_attrs = intify_attrs({"F": text, 'is_alpha': True},
strings_map={text: 10},
_do_deprecated=True)
assert int_attrs == {ORTH: 10, IS_ALPHA: True}

View File

@ -1,94 +0,0 @@
from __future__ import unicode_literals
import pytest
from spacy.strings import StringStore
from spacy.matcher import *
from spacy.attrs import LOWER
from spacy.tokens.doc import Doc
from spacy.vocab import Vocab
from spacy.en import English
@pytest.fixture
def matcher():
patterns = {
'JS': ['PRODUCT', {}, [[{'ORTH': 'JavaScript'}]]],
'GoogleNow': ['PRODUCT', {}, [[{'ORTH': 'Google'}, {'ORTH': 'Now'}]]],
'Java': ['PRODUCT', {}, [[{'LOWER': 'java'}]]],
}
return Matcher(Vocab(lex_attr_getters=English.Defaults.lex_attr_getters), patterns)
def test_compile(matcher):
assert matcher.n_patterns == 3
def test_no_match(matcher):
doc = Doc(matcher.vocab, words=['I', 'like', 'cheese', '.'])
assert matcher(doc) == []
def test_match_start(matcher):
doc = Doc(matcher.vocab, words=['JavaScript', 'is', 'good'])
assert matcher(doc) == [(matcher.vocab.strings['JS'],
matcher.vocab.strings['PRODUCT'], 0, 1)]
def test_match_end(matcher):
doc = Doc(matcher.vocab, words=['I', 'like', 'java'])
assert matcher(doc) == [(doc.vocab.strings['Java'],
doc.vocab.strings['PRODUCT'], 2, 3)]
def test_match_middle(matcher):
doc = Doc(matcher.vocab, words=['I', 'like', 'Google', 'Now', 'best'])
assert matcher(doc) == [(doc.vocab.strings['GoogleNow'],
doc.vocab.strings['PRODUCT'], 2, 4)]
def test_match_multi(matcher):
doc = Doc(matcher.vocab, words='I like Google Now and java best'.split())
assert matcher(doc) == [(doc.vocab.strings['GoogleNow'],
doc.vocab.strings['PRODUCT'], 2, 4),
(doc.vocab.strings['Java'],
doc.vocab.strings['PRODUCT'], 5, 6)]
def test_match_zero(matcher):
matcher.add('Quote', '', {}, [
[
{'ORTH': '"'},
{'OP': '!', 'IS_PUNCT': True},
{'OP': '!', 'IS_PUNCT': True},
{'ORTH': '"'}
]])
doc = Doc(matcher.vocab, words='He said , " some words " ...'.split())
assert len(matcher(doc)) == 1
doc = Doc(matcher.vocab, words='He said , " some three words " ...'.split())
assert len(matcher(doc)) == 0
matcher.add('Quote', '', {}, [
[
{'ORTH': '"'},
{'IS_PUNCT': True},
{'IS_PUNCT': True},
{'IS_PUNCT': True},
{'ORTH': '"'}
]])
assert len(matcher(doc)) == 0
def test_match_zero_plus(matcher):
matcher.add('Quote', '', {}, [
[
{'ORTH': '"'},
{'OP': '*', 'IS_PUNCT': False},
{'ORTH': '"'}
]])
doc = Doc(matcher.vocab, words='He said , " some words " ...'.split())
assert len(matcher(doc)) == 1
def test_phrase_matcher():
vocab = Vocab(lex_attr_getters=English.Defaults.lex_attr_getters)
matcher = PhraseMatcher(vocab, [Doc(vocab, words='Google Now'.split())])
doc = Doc(vocab, words=['I', 'like', 'Google', 'Now', 'best'])
assert len(matcher(doc)) == 1

View File

@ -1,11 +1,13 @@
# coding: utf-8
from __future__ import unicode_literals
from os import path
import pytest
from ...vocab import Vocab
from ...tokenizer import Tokenizer
from ...util import utf8open
from os import path
import pytest
def test_tokenizer_handles_no_word(tokenizer):
tokens = tokenizer("")
@ -81,3 +83,25 @@ def test_tokenizer_suspected_freeing_strings(tokenizer):
tokens2 = tokenizer(text2)
assert tokens1[0].text == "Lorem"
assert tokens2[0].text == "Lorem"
@pytest.mark.parametrize('text,tokens', [
("lorem", [{'orth': 'lo'}, {'orth': 'rem'}])])
def test_tokenizer_add_special_case(tokenizer, text, tokens):
tokenizer.add_special_case(text, tokens)
doc = tokenizer(text)
assert doc[0].text == tokens[0]['orth']
assert doc[1].text == tokens[1]['orth']
@pytest.mark.parametrize('text,tokens', [
("lorem", [{'orth': 'lo', 'tag': 'NN'}, {'orth': 'rem'}])])
def test_tokenizer_add_special_case_tag(text, tokens):
vocab = Vocab(tag_map={'NN': {'pos': 'NOUN'}})
tokenizer = Tokenizer(vocab, {}, None, None, None)
tokenizer.add_special_case(text, tokens)
doc = tokenizer(text)
assert doc[0].text == tokens[0]['orth']
assert doc[0].tag_ == tokens[0]['tag']
assert doc[0].pos_ == 'NOUN'
assert doc[1].text == tokens[1]['orth']

View File

@ -4,13 +4,17 @@ from __future__ import unicode_literals
import pytest
URLS = [
URLS_BASIC = [
"http://www.nytimes.com/2016/04/20/us/politics/new-york-primary-preview.html?hp&action=click&pgtype=Homepage&clickSource=story-heading&module=a-lede-package-region&region=top-news&WT.nav=top-news&_r=0",
"www.google.com?q=google",
"www.red-stars.com",
"http://foo.com/blah_(wikipedia)#cite-1",
"mailto:foo.bar@baz.com",
"mailto:foo-bar@baz-co.com"
]
URLS_FULL = URLS_BASIC + [
"mailto:foo-bar@baz-co.com",
"www.google.com?q=google",
"http://foo.com/blah_(wikipedia)#cite-1"
]
@ -25,32 +29,14 @@ SUFFIXES = [
'"', ":", ">"]
@pytest.mark.parametrize("url", URLS)
@pytest.mark.parametrize("url", URLS_BASIC)
def test_tokenizer_handles_simple_url(tokenizer, url):
tokens = tokenizer(url)
assert len(tokens) == 1
assert tokens[0].text == url
@pytest.mark.parametrize("prefix", PREFIXES)
@pytest.mark.parametrize("url", URLS)
def test_tokenizer_handles_prefixed_url(tokenizer, prefix, url):
tokens = tokenizer(prefix + url)
assert len(tokens) == 2
assert tokens[0].text == prefix
assert tokens[1].text == url
@pytest.mark.parametrize("suffix", SUFFIXES)
@pytest.mark.parametrize("url", URLS)
def test_tokenizer_handles_suffixed_url(tokenizer, url, suffix):
tokens = tokenizer(url + suffix)
assert len(tokens) == 2
assert tokens[0].text == url
assert tokens[1].text == suffix
@pytest.mark.parametrize("url", URLS)
@pytest.mark.parametrize("url", URLS_BASIC)
def test_tokenizer_handles_simple_surround_url(tokenizer, url):
tokens = tokenizer("(" + url + ")")
assert len(tokens) == 3
@ -61,8 +47,28 @@ def test_tokenizer_handles_simple_surround_url(tokenizer, url):
@pytest.mark.slow
@pytest.mark.parametrize("prefix", PREFIXES)
@pytest.mark.parametrize("url", URLS_FULL)
def test_tokenizer_handles_prefixed_url(tokenizer, prefix, url):
tokens = tokenizer(prefix + url)
assert len(tokens) == 2
assert tokens[0].text == prefix
assert tokens[1].text == url
@pytest.mark.slow
@pytest.mark.parametrize("suffix", SUFFIXES)
@pytest.mark.parametrize("url", URLS)
@pytest.mark.parametrize("url", URLS_FULL)
def test_tokenizer_handles_suffixed_url(tokenizer, url, suffix):
tokens = tokenizer(url + suffix)
assert len(tokens) == 2
assert tokens[0].text == url
assert tokens[1].text == suffix
@pytest.mark.slow
@pytest.mark.parametrize("prefix", PREFIXES)
@pytest.mark.parametrize("suffix", SUFFIXES)
@pytest.mark.parametrize("url", URLS_FULL)
def test_tokenizer_handles_surround_url(tokenizer, prefix, suffix, url):
tokens = tokenizer(prefix + url + suffix)
assert len(tokens) == 3
@ -74,7 +80,7 @@ def test_tokenizer_handles_surround_url(tokenizer, prefix, suffix, url):
@pytest.mark.slow
@pytest.mark.parametrize("prefix1", PREFIXES)
@pytest.mark.parametrize("prefix2", PREFIXES)
@pytest.mark.parametrize("url", URLS)
@pytest.mark.parametrize("url", URLS_FULL)
def test_tokenizer_handles_two_prefix_url(tokenizer, prefix1, prefix2, url):
tokens = tokenizer(prefix1 + prefix2 + url)
assert len(tokens) == 3
@ -86,7 +92,7 @@ def test_tokenizer_handles_two_prefix_url(tokenizer, prefix1, prefix2, url):
@pytest.mark.slow
@pytest.mark.parametrize("suffix1", SUFFIXES)
@pytest.mark.parametrize("suffix2", SUFFIXES)
@pytest.mark.parametrize("url", URLS)
@pytest.mark.parametrize("url", URLS_FULL)
def test_tokenizer_handles_two_prefix_url(tokenizer, suffix1, suffix2, url):
tokens = tokenizer(url + suffix1 + suffix2)
assert len(tokens) == 3

View File

@ -1,35 +0,0 @@
# coding: utf-8
from __future__ import unicode_literals
from ...attrs import *
def test_key_no_value():
int_attrs = intify_attrs({"ORTH": "dog"})
assert int_attrs == {ORTH: "dog"}
def test_lower_key():
int_attrs = intify_attrs({"norm": "dog"})
assert int_attrs == {NORM: "dog"}
def test_lower_key_value():
vals = {'dog': 10}
int_attrs = intify_attrs({"lemma": "dog"}, strings_map=vals)
assert int_attrs == {LEMMA: 10}
def test_idempotence():
vals = {'dog': 10}
int_attrs = intify_attrs({"lemma": "dog", 'is_alpha': True}, strings_map=vals)
int_attrs = intify_attrs(int_attrs)
assert int_attrs == {LEMMA: 10, IS_ALPHA: True}
def test_do_deprecated():
vals = {'dog': 10}
int_attrs = intify_attrs({"F": "dog", 'is_alpha': True}, strings_map=vals,
_do_deprecated=True)
assert int_attrs == {ORTH: 10, IS_ALPHA: True}

View File

@ -1,137 +0,0 @@
# coding: utf-8
from __future__ import unicode_literals
import pytest
import numpy
from ...attrs import HEAD, DEP
@pytest.mark.models
class TestNounChunks:
@pytest.fixture(scope="class")
def ex1_en(self, EN):
example = EN.tokenizer.tokens_from_list('A base phrase should be recognized .'.split(' '))
EN.tagger.tag_from_strings(example, 'DT NN NN MD VB VBN .'.split(' '))
det,compound,nsubjpass,aux,auxpass,root,punct = tuple( EN.vocab.strings[l] for l in ['det','compound','nsubjpass','aux','auxpass','root','punct'] )
example.from_array([HEAD, DEP],
numpy.asarray(
[
[2, det],
[1, compound],
[3, nsubjpass],
[2, aux],
[1, auxpass],
[0, root],
[-1, punct]
], dtype='int32'))
return example
@pytest.fixture(scope="class")
def ex2_en(self, EN):
example = EN.tokenizer.tokens_from_list('A base phrase and a good phrase are often the same .'.split(' '))
EN.tagger.tag_from_strings(example, 'DT NN NN CC DT JJ NN VBP RB DT JJ .'.split(' '))
det,compound,nsubj,cc,amod,conj,root,advmod,attr,punct = tuple( EN.vocab.strings[l] for l in ['det','compound','nsubj','cc','amod','conj','root','advmod','attr','punct'] )
example.from_array([HEAD, DEP],
numpy.asarray(
[
[2, det],
[1, compound],
[5, nsubj],
[-1, cc],
[1, det],
[1, amod],
[-4, conj],
[0, root],
[-1, advmod],
[1, det],
[-3, attr],
[-4, punct]
], dtype='int32'))
return example
@pytest.fixture(scope="class")
def ex3_en(self, EN):
example = EN.tokenizer.tokens_from_list('A phrase with another phrase occurs .'.split(' '))
EN.tagger.tag_from_strings(example, 'DT NN IN DT NN VBZ .'.split(' '))
det,nsubj,prep,pobj,root,punct = tuple( EN.vocab.strings[l] for l in ['det','nsubj','prep','pobj','root','punct'] )
example.from_array([HEAD, DEP],
numpy.asarray(
[
[1, det],
[4, nsubj],
[-1, prep],
[1, det],
[-2, pobj],
[0, root],
[-1, punct]
], dtype='int32'))
return example
@pytest.fixture(scope="class")
def ex1_de(self, DE):
example = DE.tokenizer.tokens_from_list('Eine Tasse steht auf dem Tisch .'.split(' '))
DE.tagger.tag_from_strings(example, 'ART NN VVFIN APPR ART NN $.'.split(' '))
nk,sb,root,mo,punct = tuple( DE.vocab.strings[l] for l in ['nk','sb','root','mo','punct'])
example.from_array([HEAD, DEP],
numpy.asarray(
[
[1, nk],
[1, sb],
[0, root],
[-1, mo],
[1, nk],
[-2, nk],
[-3, punct]
], dtype='int32'))
return example
@pytest.fixture(scope="class")
def ex2_de(self, DE):
example = DE.tokenizer.tokens_from_list('Die Sängerin singt mit einer Tasse Kaffee Arien .'.split(' '))
DE.tagger.tag_from_strings(example, 'ART NN VVFIN APPR ART NN NN NN $.'.split(' '))
nk,sb,root,mo,punct,oa = tuple( DE.vocab.strings[l] for l in ['nk','sb','root','mo','punct','oa'])
example.from_array([HEAD, DEP],
numpy.asarray(
[
[1, nk],
[1, sb],
[0, root],
[-1, mo],
[1, nk],
[-2, nk],
[-1, nk],
[-5, oa],
[-6, punct]
], dtype='int32'))
return example
def test_en_standard_chunk(self, ex1_en):
chunks = list(ex1_en.noun_chunks)
assert len(chunks) == 1
assert chunks[0].string == 'A base phrase '
def test_en_coordinated_chunks(self, ex2_en):
chunks = list(ex2_en.noun_chunks)
assert len(chunks) == 2
assert chunks[0].string == 'A base phrase '
assert chunks[1].string == 'a good phrase '
def test_en_pp_chunks(self, ex3_en):
chunks = list(ex3_en.noun_chunks)
assert len(chunks) == 2
assert chunks[0].string == 'A phrase '
assert chunks[1].string == 'another phrase '
def test_de_standard_chunk(self, ex1_de):
chunks = list(ex1_de.noun_chunks)
assert len(chunks) == 2
assert chunks[0].string == 'Eine Tasse '
assert chunks[1].string == 'dem Tisch '
def test_de_extended_chunk(self, ex2_de):
chunks = list(ex2_de.noun_chunks)
assert len(chunks) == 3
assert chunks[0].string == 'Die Sängerin '
assert chunks[1].string == 'einer Tasse Kaffee '
assert chunks[2].string == 'Arien '

View File

@ -1,50 +0,0 @@
# coding: utf-8
from __future__ import unicode_literals
from ...vocab import Vocab
from ...tokenizer import Tokenizer
import re
import pytest
@pytest.fixture
def vocab():
return Vocab(tag_map={'NN': {'pos': 'NOUN'}})
@pytest.fixture
def rules():
return {}
@pytest.fixture
def prefix_search():
return None
@pytest.fixture
def suffix_search():
return None
@pytest.fixture
def infix_finditer():
return None
@pytest.fixture
def tokenizer(vocab, rules, prefix_search, suffix_search, infix_finditer):
return Tokenizer(vocab, rules, prefix_search, suffix_search, infix_finditer)
def test_add_special_case(tokenizer):
tokenizer.add_special_case('dog', [{'orth': 'd'}, {'orth': 'og'}])
doc = tokenizer('dog')
assert doc[0].text == 'd'
assert doc[1].text == 'og'
def test_special_case_tag(tokenizer):
tokenizer.add_special_case('dog', [{'orth': 'd', 'tag': 'NN'}, {'orth': 'og'}])
doc = tokenizer('dog')
assert doc[0].text == 'd'
assert doc[0].tag_ == 'NN'
assert doc[0].pos_ == 'NOUN'
assert doc[1].text == 'og'

View File

@ -4,6 +4,8 @@ from __future__ import unicode_literals
from ..tokens import Doc
from ..attrs import ORTH, POS, HEAD, DEP
import numpy
def get_doc(vocab, words=[], pos=None, heads=None, deps=None, tags=None, ents=None):
"""Create Doc object from given vocab, words and annotations."""
@ -36,3 +38,36 @@ def apply_transition_sequence(parser, doc, sequence):
with parser.step_through(doc) as stepwise:
for transition in sequence:
stepwise.transition(transition)
def add_vecs_to_vocab(vocab, vectors):
"""Add list of vector tuples to given vocab. All vectors need to have the
same length. Format: [("text", [1, 2, 3])]"""
length = len(vectors[0][1])
vocab.resize_vectors(length)
for word, vec in vectors:
vocab[word].vector = vec
return vocab
def get_cosine(vec1, vec2):
"""Get cosine for two given vectors"""
return numpy.dot(vec1, vec2) / (numpy.linalg.norm(vec1) * numpy.linalg.norm(vec2))
def assert_docs_equal(doc1, doc2):
"""Compare two Doc objects and assert that they're equal. Tests for tokens,
tags, dependencies and entities."""
assert [ t.orth for t in doc1 ] == [ t.orth for t in doc2 ]
assert [ t.pos for t in doc1 ] == [ t.pos for t in doc2 ]
assert [ t.tag for t in doc1 ] == [ t.tag for t in doc2 ]
assert [ t.head.i for t in doc1 ] == [ t.head.i for t in doc2 ]
assert [ t.dep for t in doc1 ] == [ t.dep for t in doc2 ]
if doc1.is_parsed and doc2.is_parsed:
assert [ s for s in doc1.sents ] == [ s for s in doc2.sents ]
assert [ t.ent_type for t in doc1 ] == [ t.ent_type for t in doc2 ]
assert [ t.ent_iob for t in doc1 ] == [ t.ent_iob for t in doc2 ]
assert [ ent for ent in doc1.ents ] == [ ent for ent in doc2.ents ]

View File

@ -1,96 +1,60 @@
# coding: utf-8
from __future__ import unicode_literals
import spacy
from spacy.vocab import Vocab
from spacy.tokens.doc import Doc
import numpy
import numpy.linalg
from ..util import get_doc, get_cosine, add_vecs_to_vocab
import numpy
import pytest
def get_vector(letters):
return numpy.asarray([ord(letter) for letter in letters], dtype='float32')
def get_cosine(vec1, vec2):
return numpy.dot(vec1, vec2) / (numpy.linalg.norm(vec1) * numpy.linalg.norm(vec2))
@pytest.fixture(scope='module')
def en_vocab():
vocab = spacy.get_lang_class('en').Defaults.create_vocab()
vocab.resize_vectors(2)
apple_ = vocab[u'apple']
orange_ = vocab[u'orange']
apple_.vector = get_vector('ap')
orange_.vector = get_vector('or')
return vocab
@pytest.fixture
def appleL(en_vocab):
return en_vocab['apple']
def vectors():
return [("apple", [1, 2, 3]), ("orange", [-1, -2, -3])]
@pytest.fixture
def orangeL(en_vocab):
return en_vocab['orange']
@pytest.fixture()
def vocab(en_vocab, vectors):
return add_vecs_to_vocab(en_vocab, vectors)
@pytest.fixture(scope='module')
def apple_orange(en_vocab):
return Doc(en_vocab, words=[u'apple', u'orange'])
def test_vectors_similarity_LL(vocab, vectors):
[(word1, vec1), (word2, vec2)] = vectors
lex1 = vocab[word1]
lex2 = vocab[word2]
assert lex1.has_vector
assert lex2.has_vector
assert lex1.vector_norm != 0
assert lex2.vector_norm != 0
assert lex1.vector[0] != lex2.vector[0] and lex1.vector[1] != lex2.vector[1]
assert numpy.isclose(lex1.similarity(lex2), get_cosine(vec1, vec2))
assert numpy.isclose(lex2.similarity(lex2), lex1.similarity(lex1))
@pytest.fixture
def appleT(apple_orange):
return apple_orange[0]
def test_vectors_similarity_TT(vocab, vectors):
[(word1, vec1), (word2, vec2)] = vectors
doc = get_doc(vocab, words=[word1, word2])
assert doc[0].has_vector
assert doc[1].has_vector
assert doc[0].vector_norm != 0
assert doc[1].vector_norm != 0
assert doc[0].vector[0] != doc[1].vector[0] and doc[0].vector[1] != doc[1].vector[1]
assert numpy.isclose(doc[0].similarity(doc[1]), get_cosine(vec1, vec2))
assert numpy.isclose(doc[1].similarity(doc[0]), doc[0].similarity(doc[1]))
@pytest.fixture
def orangeT(apple_orange):
return apple_orange[1]
def test_vectors_similarity_TD(vocab, vectors):
[(word1, vec1), (word2, vec2)] = vectors
doc = get_doc(vocab, words=[word1, word2])
assert doc.similarity(doc[0]) == doc[0].similarity(doc)
def test_LL_sim(appleL, orangeL):
assert appleL.has_vector
assert orangeL.has_vector
assert appleL.vector_norm != 0
assert orangeL.vector_norm != 0
assert appleL.vector[0] != orangeL.vector[0] and appleL.vector[1] != orangeL.vector[1]
assert numpy.isclose(
appleL.similarity(orangeL),
get_cosine(get_vector('ap'), get_vector('or')))
assert numpy.isclose(
orangeL.similarity(appleL),
appleL.similarity(orangeL))
def test_TT_sim(appleT, orangeT):
assert appleT.has_vector
assert orangeT.has_vector
assert appleT.vector_norm != 0
assert orangeT.vector_norm != 0
assert appleT.vector[0] != orangeT.vector[0] and appleT.vector[1] != orangeT.vector[1]
assert numpy.isclose(
appleT.similarity(orangeT),
get_cosine(get_vector('ap'), get_vector('or')))
assert numpy.isclose(
orangeT.similarity(appleT),
appleT.similarity(orangeT))
def test_TD_sim(apple_orange, appleT):
assert apple_orange.similarity(appleT) == appleT.similarity(apple_orange)
def test_DS_sim(apple_orange, appleT):
span = apple_orange[:2]
assert apple_orange.similarity(span) == 1.0
assert span.similarity(apple_orange) == 1.0
def test_TS_sim(apple_orange, appleT):
span = apple_orange[:2]
assert span.similarity(appleT) == appleT.similarity(span)
def test_vectors_similarity_DS(vocab, vectors):
[(word1, vec1), (word2, vec2)] = vectors
doc = get_doc(vocab, words=[word1, word2])
assert doc.similarity(doc[:2]) == doc[:2].similarity(doc)
def test_vectors_similarity_TS(vocab, vectors):
[(word1, vec1), (word2, vec2)] = vectors
doc = get_doc(vocab, words=[word1, word2])
assert doc[:2].similarity(doc[0]) == doc[0].similarity(doc[:2])

View File

@ -1,109 +1,126 @@
# coding: utf-8
from __future__ import unicode_literals
from ...tokenizer import Tokenizer
from ..util import get_doc, add_vecs_to_vocab
import pytest
@pytest.mark.models
def test_token_vector(EN):
token = EN(u'Apples and oranges')[0]
token.vector
token.vector_norm
@pytest.mark.models
def test_lexeme_vector(EN):
lexeme = EN.vocab[u'apples']
lexeme.vector
lexeme.vector_norm
@pytest.fixture
def vectors():
return [("apple", [0.0, 1.0, 2.0]), ("orange", [3.0, -2.0, 4.0])]
@pytest.mark.models
def test_doc_vector(EN):
doc = EN(u'Apples and oranges')
doc.vector
doc.vector_norm
@pytest.mark.models
def test_span_vector(EN):
span = EN(u'Apples and oranges')[0:2]
span.vector
span.vector_norm
@pytest.mark.models
def test_token_token_similarity(EN):
apples, oranges = EN(u'apples oranges')
assert apples.similarity(oranges) == oranges.similarity(apples)
assert 0.0 < apples.similarity(oranges) < 1.0
@pytest.mark.models
def test_token_lexeme_similarity(EN):
apples = EN(u'apples')
oranges = EN.vocab[u'oranges']
assert apples.similarity(oranges) == oranges.similarity(apples)
assert 0.0 < apples.similarity(oranges) < 1.0
@pytest.mark.models
def test_token_span_similarity(EN):
doc = EN(u'apples orange juice')
apples = doc[0]
oranges = doc[1:3]
assert apples.similarity(oranges) == oranges.similarity(apples)
assert 0.0 < apples.similarity(oranges) < 1.0
@pytest.mark.models
def test_token_doc_similarity(EN):
doc = EN(u'apples orange juice')
apples = doc[0]
assert apples.similarity(doc) == doc.similarity(apples)
assert 0.0 < apples.similarity(doc) < 1.0
@pytest.mark.models
def test_lexeme_span_similarity(EN):
doc = EN(u'apples orange juice')
apples = EN.vocab[u'apples']
span = doc[1:3]
assert apples.similarity(span) == span.similarity(apples)
assert 0.0 < apples.similarity(span) < 1.0
@pytest.fixture()
def vocab(en_vocab, vectors):
return add_vecs_to_vocab(en_vocab, vectors)
@pytest.mark.models
def test_lexeme_lexeme_similarity(EN):
apples = EN.vocab[u'apples']
oranges = EN.vocab[u'oranges']
assert apples.similarity(oranges) == oranges.similarity(apples)
assert 0.0 < apples.similarity(oranges) < 1.0
@pytest.fixture()
def tokenizer_v(vocab):
return Tokenizer(vocab, {}, None, None, None)
@pytest.mark.models
def test_lexeme_doc_similarity(EN):
doc = EN(u'apples orange juice')
apples = EN.vocab[u'apples']
assert apples.similarity(doc) == doc.similarity(apples)
assert 0.0 < apples.similarity(doc) < 1.0
@pytest.mark.models
def test_span_span_similarity(EN):
doc = EN(u'apples orange juice')
apples = doc[0:2]
oj = doc[1:3]
assert apples.similarity(oj) == oj.similarity(apples)
assert 0.0 < apples.similarity(oj) < 1.0
@pytest.mark.parametrize('text', ["apple and orange"])
def test_vectors_token_vector(tokenizer_v, vectors, text):
doc = tokenizer_v(text)
assert vectors[0] == (doc[0].text, list(doc[0].vector))
assert vectors[1] == (doc[2].text, list(doc[2].vector))
@pytest.mark.models
def test_span_doc_similarity(EN):
doc = EN(u'apples orange juice')
apples = doc[0:2]
oj = doc[1:3]
assert apples.similarity(doc) == doc.similarity(apples)
assert 0.0 < apples.similarity(doc) < 1.0
@pytest.mark.models
def test_doc_doc_similarity(EN):
apples = EN(u'apples and apple pie')
oranges = EN(u'orange juice')
assert apples.similarity(oranges) == apples.similarity(oranges)
assert 0.0 < apples.similarity(oranges) < 1.0
@pytest.mark.parametrize('text', ["apple", "orange"])
def test_vectors_lexeme_vector(vocab, text):
lex = vocab[text]
assert list(lex.vector)
assert lex.vector_norm
@pytest.mark.parametrize('text', [["apple", "and", "orange"]])
def test_vectors_doc_vector(vocab, text):
doc = get_doc(vocab, text)
assert list(doc.vector)
assert doc.vector_norm
@pytest.mark.parametrize('text', [["apple", "and", "orange"]])
def test_vectors_span_vector(vocab, text):
span = get_doc(vocab, text)[0:2]
assert list(span.vector)
assert span.vector_norm
@pytest.mark.parametrize('text', ["apple orange"])
def test_vectors_token_token_similarity(tokenizer_v, text):
doc = tokenizer_v(text)
assert doc[0].similarity(doc[1]) == doc[1].similarity(doc[0])
assert 0.0 < doc[0].similarity(doc[1]) < 1.0
@pytest.mark.parametrize('text1,text2', [("apple", "orange")])
def test_vectors_token_lexeme_similarity(tokenizer_v, vocab, text1, text2):
token = tokenizer_v(text1)
lex = vocab[text2]
assert token.similarity(lex) == lex.similarity(token)
assert 0.0 < token.similarity(lex) < 1.0
@pytest.mark.parametrize('text', [["apple", "orange", "juice"]])
def test_vectors_token_span_similarity(vocab, text):
doc = get_doc(vocab, text)
assert doc[0].similarity(doc[1:3]) == doc[1:3].similarity(doc[0])
assert 0.0 < doc[0].similarity(doc[1:3]) < 1.0
@pytest.mark.parametrize('text', [["apple", "orange", "juice"]])
def test_vectors_token_doc_similarity(vocab, text):
doc = get_doc(vocab, text)
assert doc[0].similarity(doc) == doc.similarity(doc[0])
assert 0.0 < doc[0].similarity(doc) < 1.0
@pytest.mark.parametrize('text', [["apple", "orange", "juice"]])
def test_vectors_lexeme_span_similarity(vocab, text):
doc = get_doc(vocab, text)
lex = vocab[text[0]]
assert lex.similarity(doc[1:3]) == doc[1:3].similarity(lex)
assert 0.0 < doc.similarity(doc[1:3]) < 1.0
@pytest.mark.parametrize('text1,text2', [("apple", "orange")])
def test_vectors_lexeme_lexeme_similarity(vocab, text1, text2):
lex1 = vocab[text1]
lex2 = vocab[text2]
assert lex1.similarity(lex2) == lex2.similarity(lex1)
assert 0.0 < lex1.similarity(lex2) < 1.0
@pytest.mark.parametrize('text', [["apple", "orange", "juice"]])
def test_vectors_lexeme_doc_similarity(vocab, text):
doc = get_doc(vocab, text)
lex = vocab[text[0]]
assert lex.similarity(doc) == doc.similarity(lex)
assert 0.0 < lex.similarity(doc) < 1.0
@pytest.mark.parametrize('text', [["apple", "orange", "juice"]])
def test_vectors_span_span_similarity(vocab, text):
doc = get_doc(vocab, text)
assert doc[0:2].similarity(doc[1:3]) == doc[1:3].similarity(doc[0:2])
assert 0.0 < doc[0:2].similarity(doc[1:3]) < 1.0
@pytest.mark.parametrize('text', [["apple", "orange", "juice"]])
def test_vectors_span_doc_similarity(vocab, text):
doc = get_doc(vocab, text)
assert doc[0:2].similarity(doc) == doc.similarity(doc[0:2])
assert 0.0 < doc[0:2].similarity(doc) < 1.0
@pytest.mark.parametrize('text1,text2', [
(["apple", "and", "apple", "pie"], ["orange", "juice"])])
def test_vectors_doc_doc_similarity(vocab, text1, text2):
doc1 = get_doc(vocab, text1)
doc2 = get_doc(vocab, text2)
assert doc1.similarity(doc2) == doc2.similarity(doc1)
assert 0.0 < doc1.similarity(doc2) < 1.0

View File

@ -1,42 +1,58 @@
# coding: utf-8
from __future__ import unicode_literals
from ...attrs import *
import pytest
from spacy.attrs import *
@pytest.mark.parametrize('text1,prob1,text2,prob2', [("NOUN", -1, "opera", -2)])
def test_vocab_lexeme_lt(en_vocab, text1, text2, prob1, prob2):
"""More frequent is l.t. less frequent"""
lex1 = en_vocab[text1]
lex1.prob = prob1
lex2 = en_vocab[text2]
lex2.prob = prob2
assert lex1 < lex2
assert lex2 > lex1
def test_lexeme_eq(en_vocab):
'''Test Issue #361: Equality of lexemes'''
cat1 = en_vocab['cat']
cat2 = en_vocab['cat']
assert cat1 == cat2
def test_lexeme_neq(en_vocab):
'''Inequality of lexemes'''
cat = en_vocab['cat']
dog = en_vocab['dog']
assert cat != dog
def test_lexeme_lt(en_vocab):
'''More frequent is l.t. less frequent'''
noun = en_vocab['NOUN']
opera = en_vocab['opera']
assert noun < opera
assert opera > noun
@pytest.mark.parametrize('text1,text2', [("phantom", "opera")])
def test_vocab_lexeme_hash(en_vocab, text1, text2):
"""Test that lexemes are hashable."""
lex1 = en_vocab[text1]
lex2 = en_vocab[text2]
lexes = {lex1: lex1, lex2: lex2}
assert lexes[lex1].orth_ == text1
assert lexes[lex2].orth_ == text2
def test_lexeme_hash(en_vocab):
'''Test that lexemes are hashable.'''
phantom = en_vocab['phantom']
def test_vocab_lexeme_is_alpha(en_vocab):
assert en_vocab['the'].flags & (1 << IS_ALPHA)
assert not en_vocab['1999'].flags & (1 << IS_ALPHA)
assert not en_vocab['hello1'].flags & (1 << IS_ALPHA)
opera = en_vocab['opera']
lexes = {phantom: phantom, opera: opera}
assert lexes[phantom].orth_ == 'phantom'
assert lexes[opera].orth_ == 'opera'
def test_vocab_lexeme_is_digit(en_vocab):
assert not en_vocab['the'].flags & (1 << IS_DIGIT)
assert en_vocab['1999'].flags & (1 << IS_DIGIT)
assert not en_vocab['hello1'].flags & (1 << IS_DIGIT)
def test_vocab_lexeme_add_flag_auto_id(en_vocab):
is_len4 = en_vocab.add_flag(lambda string: len(string) == 4)
assert en_vocab['1999'].check_flag(is_len4) == True
assert en_vocab['1999'].check_flag(IS_DIGIT) == True
assert en_vocab['199'].check_flag(is_len4) == False
assert en_vocab['199'].check_flag(IS_DIGIT) == True
assert en_vocab['the'].check_flag(is_len4) == False
assert en_vocab['dogs'].check_flag(is_len4) == True
def test_vocab_lexeme_add_flag_provided_id(en_vocab):
is_len4 = en_vocab.add_flag(lambda string: len(string) == 4, flag_id=IS_DIGIT)
assert en_vocab['1999'].check_flag(is_len4) == True
assert en_vocab['199'].check_flag(is_len4) == False
assert en_vocab['199'].check_flag(IS_DIGIT) == False
assert en_vocab['the'].check_flag(is_len4) == False
assert en_vocab['dogs'].check_flag(is_len4) == True

View File

@ -1,42 +0,0 @@
from __future__ import unicode_literals
import pytest
from spacy.attrs import *
def test_is_alpha(en_vocab):
the = en_vocab['the']
assert the.flags & (1 << IS_ALPHA)
year = en_vocab['1999']
assert not year.flags & (1 << IS_ALPHA)
mixed = en_vocab['hello1']
assert not mixed.flags & (1 << IS_ALPHA)
def test_is_digit(en_vocab):
the = en_vocab['the']
assert not the.flags & (1 << IS_DIGIT)
year = en_vocab['1999']
assert year.flags & (1 << IS_DIGIT)
mixed = en_vocab['hello1']
assert not mixed.flags & (1 << IS_DIGIT)
def test_add_flag_auto_id(en_vocab):
is_len4 = en_vocab.add_flag(lambda string: len(string) == 4)
assert en_vocab['1999'].check_flag(is_len4) == True
assert en_vocab['1999'].check_flag(IS_DIGIT) == True
assert en_vocab['199'].check_flag(is_len4) == False
assert en_vocab['199'].check_flag(IS_DIGIT) == True
assert en_vocab['the'].check_flag(is_len4) == False
assert en_vocab['dogs'].check_flag(is_len4) == True
def test_add_flag_provided_id(en_vocab):
is_len4 = en_vocab.add_flag(lambda string: len(string) == 4, flag_id=IS_DIGIT)
assert en_vocab['1999'].check_flag(is_len4) == True
assert en_vocab['199'].check_flag(is_len4) == False
assert en_vocab['199'].check_flag(IS_DIGIT) == False
assert en_vocab['the'].check_flag(is_len4) == False
assert en_vocab['dogs'].check_flag(is_len4) == True

View File

@ -33,7 +33,7 @@ def test_vocab_api_symbols(en_vocab, string, symbol):
@pytest.mark.parametrize('text', "Hello")
def test_contains(en_vocab, text):
def test_vocab_api_contains(en_vocab, text):
_ = en_vocab[text]
assert text in en_vocab
assert "LKsdjvlsakdvlaksdvlkasjdvljasdlkfvm" not in en_vocab