Merge branch 'master' of github.com:GregDubbin/spaCy

This commit is contained in:
greg 2018-01-16 13:31:10 -05:00
commit 441f490c1c
38 changed files with 709 additions and 87 deletions

106
.github/contributors/fucking-signup.md vendored Normal file
View File

@ -0,0 +1,106 @@
# spaCy contributor agreement
This spaCy Contributor Agreement (**"SCA"**) is based on the
[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
The SCA applies to any contribution that you make to any product or project
managed by us (the **"project"**), and sets out the intellectual property rights
you grant to us in the contributed materials. The term **"us"** shall mean
[ExplosionAI UG (haftungsbeschränkt)](https://explosion.ai/legal). The term
**"you"** shall mean the person or entity identified below.
If you agree to be bound by these terms, fill in the information requested
below and include the filled-in version with your first pull request, under the
folder [`.github/contributors/`](/.github/contributors/). The name of the file
should be your GitHub username, with the extension `.md`. For example, the user
example_user would create the file `.github/contributors/example_user.md`.
Read this agreement carefully before signing. These terms and conditions
constitute a binding legal agreement.
## Contributor Agreement
1. The term "contribution" or "contributed materials" means any source code,
object code, patch, tool, sample, graphic, specification, manual,
documentation, or any other material posted or submitted by you to the project.
2. With respect to any worldwide copyrights, or copyright applications and
registrations, in your contribution:
* you hereby assign to us joint ownership, and to the extent that such
assignment is or becomes invalid, ineffective or unenforceable, you hereby
grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
royalty-free, unrestricted license to exercise all rights under those
copyrights. This includes, at our option, the right to sublicense these same
rights to third parties through multiple levels of sublicensees or other
licensing arrangements;
* you agree that each of us can do all things in relation to your
contribution as if each of us were the sole owners, and if one of us makes
a derivative work of your contribution, the one who makes the derivative
work (or has it made will be the sole owner of that derivative work;
* you agree that you will not assert any moral rights in your contribution
against us, our licensees or transferees;
* you agree that we may register a copyright in your contribution and
exercise all ownership rights associated with it; and
* you agree that neither of us has any duty to consult with, obtain the
consent of, pay or render an accounting to the other for any use or
distribution of your contribution.
3. With respect to any patents you own, or that you can license without payment
to any third party, you hereby grant to us a perpetual, irrevocable,
non-exclusive, worldwide, no-charge, royalty-free license to:
* make, have made, use, sell, offer to sell, import, and otherwise transfer
your contribution in whole or in part, alone or in combination with or
included in any product, work or materials arising out of the project to
which your contribution was submitted, and
* at our option, to sublicense these same rights to third parties through
multiple levels of sublicensees or other licensing arrangements.
4. Except as set out above, you keep all right, title, and interest in your
contribution. The rights that you grant to us under these terms are effective
on the date you first submitted a contribution to us, even if your submission
took place before the date you sign these terms.
5. You covenant, represent, warrant and agree that:
* Each contribution that you submit is and shall be an original work of
authorship and you can legally grant the rights set out in this SCA;
* to the best of your knowledge, each contribution will not violate any
third party's copyrights, trademarks, patents, or other intellectual
property rights; and
* each contribution shall be in compliance with U.S. export control laws and
other applicable export and import laws. You agree to notify us if you
become aware of any circumstance which would make any of the foregoing
representations inaccurate in any respect. We may publicly disclose your
participation in the project, including the fact that you have signed the SCA.
6. This SCA is governed by the laws of the State of California and applicable
U.S. Federal law. Any choice of law rules will not apply.
7. Please place an “x” on one of the applicable statement below. Please do NOT
mark both statements:
* [x] I am signing on behalf of myself as an individual and no other person
or entity, including my employer, has or will have rights with respect to my
contributions.
* [ ] I am signing on behalf of my employer or a legal entity and I have the
actual authority to contractually bind that entity.
## Contributor Details
| Field | Entry |
|------------------------------- | -------------------- |
| Name | Kit |
| Company name (if applicable) | - |
| Title or role (if applicable) | - |
| Date | 2018/01/08 |
| GitHub username | fucking-signup |
| Website (optional) | - |

106
.github/contributors/pbnsilva.md vendored Normal file
View File

@ -0,0 +1,106 @@
# spaCy contributor agreement
This spaCy Contributor Agreement (**"SCA"**) is based on the
[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
The SCA applies to any contribution that you make to any product or project
managed by us (the **"project"**), and sets out the intellectual property rights
you grant to us in the contributed materials. The term **"us"** shall mean
[ExplosionAI UG (haftungsbeschränkt)](https://explosion.ai/legal). The term
**"you"** shall mean the person or entity identified below.
If you agree to be bound by these terms, fill in the information requested
below and include the filled-in version with your first pull request, under the
folder [`.github/contributors/`](/.github/contributors/). The name of the file
should be your GitHub username, with the extension `.md`. For example, the user
example_user would create the file `.github/contributors/example_user.md`.
Read this agreement carefully before signing. These terms and conditions
constitute a binding legal agreement.
## Contributor Agreement
1. The term "contribution" or "contributed materials" means any source code,
object code, patch, tool, sample, graphic, specification, manual,
documentation, or any other material posted or submitted by you to the project.
2. With respect to any worldwide copyrights, or copyright applications and
registrations, in your contribution:
* you hereby assign to us joint ownership, and to the extent that such
assignment is or becomes invalid, ineffective or unenforceable, you hereby
grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
royalty-free, unrestricted license to exercise all rights under those
copyrights. This includes, at our option, the right to sublicense these same
rights to third parties through multiple levels of sublicensees or other
licensing arrangements;
* you agree that each of us can do all things in relation to your
contribution as if each of us were the sole owners, and if one of us makes
a derivative work of your contribution, the one who makes the derivative
work (or has it made will be the sole owner of that derivative work;
* you agree that you will not assert any moral rights in your contribution
against us, our licensees or transferees;
* you agree that we may register a copyright in your contribution and
exercise all ownership rights associated with it; and
* you agree that neither of us has any duty to consult with, obtain the
consent of, pay or render an accounting to the other for any use or
distribution of your contribution.
3. With respect to any patents you own, or that you can license without payment
to any third party, you hereby grant to us a perpetual, irrevocable,
non-exclusive, worldwide, no-charge, royalty-free license to:
* make, have made, use, sell, offer to sell, import, and otherwise transfer
your contribution in whole or in part, alone or in combination with or
included in any product, work or materials arising out of the project to
which your contribution was submitted, and
* at our option, to sublicense these same rights to third parties through
multiple levels of sublicensees or other licensing arrangements.
4. Except as set out above, you keep all right, title, and interest in your
contribution. The rights that you grant to us under these terms are effective
on the date you first submitted a contribution to us, even if your submission
took place before the date you sign these terms.
5. You covenant, represent, warrant and agree that:
* Each contribution that you submit is and shall be an original work of
authorship and you can legally grant the rights set out in this SCA;
* to the best of your knowledge, each contribution will not violate any
third party's copyrights, trademarks, patents, or other intellectual
property rights; and
* each contribution shall be in compliance with U.S. export control laws and
other applicable export and import laws. You agree to notify us if you
become aware of any circumstance which would make any of the foregoing
representations inaccurate in any respect. We may publicly disclose your
participation in the project, including the fact that you have signed the SCA.
6. This SCA is governed by the laws of the State of California and applicable
U.S. Federal law. Any choice of law rules will not apply.
7. Please place an “x” on one of the applicable statement below. Please do NOT
mark both statements:
* [x] I am signing on behalf of myself as an individual and no other person
or entity, including my employer, has or will have rights with respect to my
contributions.
* [ ] I am signing on behalf of my employer or a legal entity and I have the
actual authority to contractually bind that entity.
## Contributor Details
| Field | Entry |
|------------------------------- | -------------------- |
| Name | Pedro Silva |
| Company name (if applicable) | |
| Title or role (if applicable) | |
| Date | 2018-01-11 |
| GitHub username | pbnsilva |
| Website (optional) | |

106
.github/contributors/savkov.md vendored Normal file
View File

@ -0,0 +1,106 @@
# spaCy contributor agreement
This spaCy Contributor Agreement (**"SCA"**) is based on the
[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
The SCA applies to any contribution that you make to any product or project
managed by us (the **"project"**), and sets out the intellectual property rights
you grant to us in the contributed materials. The term **"us"** shall mean
[ExplosionAI UG (haftungsbeschränkt)](https://explosion.ai/legal). The term
**"you"** shall mean the person or entity identified below.
If you agree to be bound by these terms, fill in the information requested
below and include the filled-in version with your first pull request, under the
folder [`.github/contributors/`](/.github/contributors/). The name of the file
should be your GitHub username, with the extension `.md`. For example, the user
example_user would create the file `.github/contributors/example_user.md`.
Read this agreement carefully before signing. These terms and conditions
constitute a binding legal agreement.
## Contributor Agreement
1. The term "contribution" or "contributed materials" means any source code,
object code, patch, tool, sample, graphic, specification, manual,
documentation, or any other material posted or submitted by you to the project.
2. With respect to any worldwide copyrights, or copyright applications and
registrations, in your contribution:
* you hereby assign to us joint ownership, and to the extent that such
assignment is or becomes invalid, ineffective or unenforceable, you hereby
grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
royalty-free, unrestricted license to exercise all rights under those
copyrights. This includes, at our option, the right to sublicense these same
rights to third parties through multiple levels of sublicensees or other
licensing arrangements;
* you agree that each of us can do all things in relation to your
contribution as if each of us were the sole owners, and if one of us makes
a derivative work of your contribution, the one who makes the derivative
work (or has it made will be the sole owner of that derivative work;
* you agree that you will not assert any moral rights in your contribution
against us, our licensees or transferees;
* you agree that we may register a copyright in your contribution and
exercise all ownership rights associated with it; and
* you agree that neither of us has any duty to consult with, obtain the
consent of, pay or render an accounting to the other for any use or
distribution of your contribution.
3. With respect to any patents you own, or that you can license without payment
to any third party, you hereby grant to us a perpetual, irrevocable,
non-exclusive, worldwide, no-charge, royalty-free license to:
* make, have made, use, sell, offer to sell, import, and otherwise transfer
your contribution in whole or in part, alone or in combination with or
included in any product, work or materials arising out of the project to
which your contribution was submitted, and
* at our option, to sublicense these same rights to third parties through
multiple levels of sublicensees or other licensing arrangements.
4. Except as set out above, you keep all right, title, and interest in your
contribution. The rights that you grant to us under these terms are effective
on the date you first submitted a contribution to us, even if your submission
took place before the date you sign these terms.
5. You covenant, represent, warrant and agree that:
* Each contribution that you submit is and shall be an original work of
authorship and you can legally grant the rights set out in this SCA;
* to the best of your knowledge, each contribution will not violate any
third party's copyrights, trademarks, patents, or other intellectual
property rights; and
* each contribution shall be in compliance with U.S. export control laws and
other applicable export and import laws. You agree to notify us if you
become aware of any circumstance which would make any of the foregoing
representations inaccurate in any respect. We may publicly disclose your
participation in the project, including the fact that you have signed the SCA.
6. This SCA is governed by the laws of the State of California and applicable
U.S. Federal law. Any choice of law rules will not apply.
7. Please place an “x” on one of the applicable statement below. Please do NOT
mark both statements:
* [x] I am signing on behalf of myself as an individual and no other person
or entity, including my employer, has or will have rights with respect to my
contributions.
* [ ] I am signing on behalf of my employer or a legal entity and I have the
actual authority to contractually bind that entity.
## Contributor Details
| Field | Entry |
|------------------------------- | -------------------- |
| Name | Aleksandar Savkov |
| Company name (if applicable) | |
| Title or role (if applicable) | |
| Date | 11.01.2018 |
| GitHub username | savkov |
| Website (optional) | sasho.io |

View File

@ -46,9 +46,8 @@ MOD_NAMES = [
COMPILE_OPTIONS = {
'msvc': ['/Ox', '/EHsc'],
'mingw32' : ['-O3', '-Wno-strict-prototypes', '-Wno-unused-function'],
'other' : ['-O3', '-Wno-strict-prototypes', '-Wno-unused-function',
'-march=native']
'mingw32' : ['-O2', '-Wno-strict-prototypes', '-Wno-unused-function'],
'other' : ['-O2', '-Wno-strict-prototypes', '-Wno-unused-function']
}

View File

@ -31,24 +31,28 @@ def download(model, direct=False):
version = get_version(model_name, compatibility)
dl = download_model('{m}-{v}/{m}-{v}.tar.gz'.format(m=model_name,
v=version))
if dl == 0:
try:
# Get package path here because link uses
# pip.get_installed_distributions() to check if model is a
# package, which fails if model was just installed via
# subprocess
package_path = get_package_path(model_name)
link(model_name, model, force=True, model_path=package_path)
except:
# Dirty, but since spacy.download and the auto-linking is
# mostly a convenience wrapper, it's best to show a success
# message and loading instructions, even if linking fails.
prints(
"Creating a shortcut link for 'en' didn't work (maybe "
"you don't have admin permissions?), but you can still "
"load the model via its full package name:",
"nlp = spacy.load('%s')" % model_name,
title="Download successful")
if dl != 0:
# if download subprocess doesn't return 0, exit with the respective
# exit code before doing anything else
sys.exit(dl)
try:
# Get package path here because link uses
# pip.get_installed_distributions() to check if model is a
# package, which fails if model was just installed via
# subprocess
package_path = get_package_path(model_name)
link(None, model_name, model, force=True,
model_path=package_path)
except:
# Dirty, but since spacy.download and the auto-linking is
# mostly a convenience wrapper, it's best to show a success
# message and loading instructions, even if linking fails.
prints(
"Creating a shortcut link for 'en' didn't work (maybe "
"you don't have admin permissions?), but you can still "
"load the model via its full package name:",
"nlp = spacy.load('%s')" % model_name,
title="Download successful but linking failed")
def get_json(url, desc):
@ -84,5 +88,5 @@ def get_version(model, comp):
def download_model(filename):
download_url = about.__download_url__ + '/' + filename
return subprocess.call(
[sys.executable, '-m', 'pip', 'install', '--no-cache-dir',
[sys.executable, '-m', 'pip', 'install', '--no-cache-dir', '--no-deps',
download_url], env=os.environ.copy())

View File

@ -34,11 +34,18 @@ def link(origin, link_name, force=False, model_path=None):
"located here:", path2str(spacy_loc), exits=1,
title="Can't find the spaCy data path to create model symlink")
link_path = util.get_data_path() / link_name
if link_path.exists() and not force:
if link_path.is_symlink() and not force:
prints("To overwrite an existing link, use the --force flag.",
title="Link %s already exists" % link_name, exits=1)
elif link_path.exists():
elif link_path.is_symlink(): # does a symlink exist?
# NB: It's important to check for is_symlink here and not for exists,
# because invalid/outdated symlinks would return False otherwise.
link_path.unlink()
elif link_path.exists(): # does it exist otherwise?
# NB: Check this last because valid symlinks also "exist".
prints("This can happen if your data directory contains a directory "
"or file of the same name.", link_path,
title="Can't overwrite symlink %s" % link_name, exits=1)
try:
symlink_to(link_path, model_path)
except:

View File

@ -4,6 +4,7 @@ from __future__ import unicode_literals, print_function
import requests
import pkg_resources
from pathlib import Path
import sys
from ..compat import path2str, locale_escape
from ..util import prints, get_data_path, read_json
@ -62,6 +63,9 @@ def validate():
"them from the data directory. Data path: {}"
.format(path2str(get_data_path())))
if incompat_models or incompat_links:
sys.exit(1)
def get_model_links(compat):
links = {}

View File

@ -41,9 +41,9 @@ def like_num(text):
num, denom = text.split('/')
if num.isdigit() and denom.isdigit():
return True
if text in _num_words:
if text.lower() in _num_words:
return True
if text in _ordinal_words:
if text.lower() in _ordinal_words:
return True
return False

View File

@ -20,7 +20,7 @@ def like_num(text):
num, denom = text.split('/')
if num.isdigit() and denom.isdigit():
return True
if text in _num_words:
if text.lower() in _num_words:
return True
return False

View File

@ -31,7 +31,9 @@ def like_num(text):
num, denom = text.split('/')
if num.isdigit() and denom.isdigit():
return True
if text in _num_words:
if text.lower() in _num_words:
return True
if text.lower() in _ordinal_words:
return True
return False

View File

@ -27,7 +27,7 @@ def like_num(text):
num, denom = text.split('/')
if num.isdigit() and denom.isdigit():
return True
if text in _num_words:
if text.lower() in _num_words:
return True
if text.count('-') == 1:
_, num = text.split('-')

View File

@ -30,7 +30,9 @@ def like_num(text):
num, denom = text.split('/')
if num.isdigit() and denom.isdigit():
return True
if text in _num_words:
if text.lower() in _num_words:
return True
if text.lower() in _ordinal_words:
return True
return False

View File

@ -11,13 +11,13 @@ _num_words = ['zero', 'um', 'dois', 'três', 'quatro', 'cinco', 'seis', 'sete',
'oitenta', 'noventa', 'cem', 'mil', 'milhão', 'bilião', 'trilião',
'quadrilião']
_ord_words = ['primeiro', 'segundo', 'terceiro', 'quarto', 'quinto', 'sexto',
'sétimo', 'oitavo', 'nono', 'décimo', 'vigésimo', 'trigésimo',
'quadragésimo', 'quinquagésimo', 'sexagésimo', 'septuagésimo',
'octogésimo', 'nonagésimo', 'centésimo', 'ducentésimo',
'trecentésimo', 'quadringentésimo', 'quingentésimo', 'sexcentésimo',
'septingentésimo', 'octingentésimo', 'nongentésimo', 'milésimo',
'milionésimo', 'bilionésimo']
_ordinal_words = ['primeiro', 'segundo', 'terceiro', 'quarto', 'quinto', 'sexto',
'sétimo', 'oitavo', 'nono', 'décimo', 'vigésimo', 'trigésimo',
'quadragésimo', 'quinquagésimo', 'sexagésimo', 'septuagésimo',
'octogésimo', 'nonagésimo', 'centésimo', 'ducentésimo',
'trecentésimo', 'quadringentésimo', 'quingentésimo', 'sexcentésimo',
'septingentésimo', 'octingentésimo', 'nongentésimo', 'milésimo',
'milionésimo', 'bilionésimo']
def like_num(text):
@ -28,7 +28,9 @@ def like_num(text):
num, denom = text.split('/')
if num.isdigit() and denom.isdigit():
return True
if text in _num_words:
if text.lower() in _num_words:
return True
if text.lower() in _ordinal_words:
return True
return False

View File

@ -25,7 +25,7 @@ def like_num(text):
num, denom = text.split('/')
if num.isdigit() and denom.isdigit():
return True
if text in _num_words:
if text.lower() in _num_words:
return True
return False

View File

@ -40,6 +40,11 @@ cdef class Lexeme:
assert self.c.orth == orth
def __richcmp__(self, other, int op):
if other is None:
if op == 0 or op == 1 or op == 2:
return False
else:
return True
if isinstance(other, Lexeme):
a = self.orth
b = other.orth
@ -107,6 +112,14 @@ cdef class Lexeme:
`Span`, `Token` and `Lexeme` objects.
RETURNS (float): A scalar similarity score. Higher is more similar.
"""
# Return 1.0 similarity for matches
if hasattr(other, 'orth'):
if self.c.orth == other.orth:
return 1.0
elif hasattr(other, '__len__') and len(other) == 1 \
and hasattr(other[0], 'orth'):
if self.c.orth == other[0].orth:
return 1.0
if self.vector_norm == 0 or other.vector_norm == 0:
return 0.0
return (numpy.dot(self.vector, other.vector) /

View File

@ -217,6 +217,16 @@ def test_doc_api_has_vector():
doc = Doc(vocab, words=['kitten'])
assert doc.has_vector
def test_doc_api_similarity_match():
doc = Doc(Vocab(), words=['a'])
assert doc.similarity(doc[0]) == 1.0
assert doc.similarity(doc.vocab['a']) == 1.0
doc2 = Doc(doc.vocab, words=['a', 'b', 'c'])
assert doc.similarity(doc2[:1]) == 1.0
assert doc.similarity(doc2) == 0.0
def test_lowest_common_ancestor(en_tokenizer):
tokens = en_tokenizer('the lazy dog slept')
doc = get_doc(tokens.vocab, [t.text for t in tokens], heads=[2, 1, 1, 0])
@ -225,6 +235,7 @@ def test_lowest_common_ancestor(en_tokenizer):
assert(lca[0, 1] == 2)
assert(lca[1, 2] == 2)
def test_parse_tree(en_tokenizer):
"""Tests doc.print_tree() method."""
text = 'I like New York in Autumn.'

View File

@ -3,6 +3,8 @@ from __future__ import unicode_literals
from ..util import get_doc
from ...attrs import ORTH, LENGTH
from ...tokens import Doc
from ...vocab import Vocab
import pytest
@ -66,6 +68,15 @@ def test_spans_lca_matrix(en_tokenizer):
assert(lca[1, 1] == 1)
def test_span_similarity_match():
doc = Doc(Vocab(), words=['a', 'b', 'a', 'b'])
span1 = doc[:2]
span2 = doc[2:]
assert span1.similarity(span2) == 1.0
assert span1.similarity(doc) == 0.0
assert span1[:1].similarity(doc.vocab['a']) == 1.0
def test_spans_default_sentiment(en_tokenizer):
"""Test span.sentiment property's default averaging behaviour"""
text = "good stuff bad stuff"

View File

@ -160,8 +160,5 @@ def test_is_sent_start(en_tokenizer):
assert doc[5].is_sent_start is None
doc[5].is_sent_start = True
assert doc[5].is_sent_start is True
# Backwards compatibility
with pytest.warns(DeprecationWarning):
assert doc[0].sent_start is False
doc.is_parsed = True
assert len(list(doc.sents)) == 2

View File

@ -0,0 +1,31 @@
'''Test Span.as_doc() doesn't segfault'''
from __future__ import unicode_literals
from ...tokens import Doc
from ...vocab import Vocab
from ... import load as load_spacy
def test_issue1537():
string = 'The sky is blue . The man is pink . The dog is purple .'
doc = Doc(Vocab(), words=string.split())
doc[0].sent_start = True
for word in doc[1:]:
if word.nbor(-1).text == '.':
word.sent_start = True
else:
word.sent_start = False
sents = list(doc.sents)
sent0 = sents[0].as_doc()
sent1 = sents[1].as_doc()
assert isinstance(sent0, Doc)
assert isinstance(sent1, Doc)
# Currently segfaulting, due to l_edge and r_edge misalignment
#def test_issue1537_model():
# nlp = load_spacy('en')
# doc = nlp(u'The sky is blue. The man is pink. The dog is purple.')
# sents = [s.as_doc() for s in doc.sents]
# print(list(sents[0].noun_chunks))
# print(list(sents[1].noun_chunks))

View File

@ -0,0 +1,10 @@
'''Ensure vectors.resize() doesn't try to modify dictionary during iteration.'''
from __future__ import unicode_literals
from ...vectors import Vectors
def test_issue1539():
v = Vectors(shape=(10, 10), keys=[5,3,98,100])
v.resize((100,100))

View File

@ -0,0 +1,18 @@
'''Test comparison against None doesn't cause segfault'''
from __future__ import unicode_literals
from ...tokens import Doc
from ...vocab import Vocab
def test_issue1757():
doc = Doc(Vocab(), words=['a', 'b', 'c'])
assert not doc[0] < None
assert not doc[0] == None
assert doc[0] >= None
span = doc[:2]
assert not span < None
assert not span == None
assert span >= None
lex = doc.vocab['a']
assert not lex == None
assert not lex < None

View File

@ -0,0 +1,61 @@
# coding: utf-8
from __future__ import unicode_literals
from ...util import get_lang_class
from ...attrs import LIKE_NUM
import pytest
@pytest.mark.parametrize('word', ['eleven'])
def test_en_lex_attrs(word):
lang = get_lang_class('en')
like_num = lang.Defaults.lex_attr_getters[LIKE_NUM]
assert like_num(word) == like_num(word.upper())
@pytest.mark.slow
@pytest.mark.parametrize('word', ['elleve', 'første'])
def test_da_lex_attrs(word):
lang = get_lang_class('da')
like_num = lang.Defaults.lex_attr_getters[LIKE_NUM]
assert like_num(word) == like_num(word.upper())
@pytest.mark.slow
@pytest.mark.parametrize('word', ['onze', 'onzième'])
def test_fr_lex_attrs(word):
lang = get_lang_class('fr')
like_num = lang.Defaults.lex_attr_getters[LIKE_NUM]
assert like_num(word) == like_num(word.upper())
@pytest.mark.slow
@pytest.mark.parametrize('word', ['sebelas'])
def test_id_lex_attrs(word):
lang = get_lang_class('id')
like_num = lang.Defaults.lex_attr_getters[LIKE_NUM]
assert like_num(word) == like_num(word.upper())
@pytest.mark.slow
@pytest.mark.parametrize('word', ['elf', 'elfde'])
def test_nl_lex_attrs(word):
lang = get_lang_class('nl')
like_num = lang.Defaults.lex_attr_getters[LIKE_NUM]
assert like_num(word) == like_num(word.upper())
@pytest.mark.slow
@pytest.mark.parametrize('word', ['onze', 'quadragésimo'])
def test_pt_lex_attrs(word):
lang = get_lang_class('pt')
like_num = lang.Defaults.lex_attr_getters[LIKE_NUM]
assert like_num(word) == like_num(word.upper())
@pytest.mark.slow
@pytest.mark.parametrize('word', ['одиннадцать'])
def test_ru_lex_attrs(word):
lang = get_lang_class('ru')
like_num = lang.Defaults.lex_attr_getters[LIKE_NUM]
assert like_num(word) == like_num(word.upper())

View File

@ -0,0 +1,14 @@
'''Test vocab.set_vector also adds the word to the vocab.'''
from __future__ import unicode_literals
from ...vocab import Vocab
import numpy
def test_issue1807():
vocab = Vocab()
arr = numpy.ones((50,), dtype='f')
assert 'hello' not in vocab
vocab.set_vector('hello', arr)
assert 'hello' in vocab

View File

@ -295,6 +295,17 @@ cdef class Doc:
"""
if 'similarity' in self.user_hooks:
return self.user_hooks['similarity'](self, other)
if isinstance(other, (Lexeme, Token)) and self.length == 1:
if self.c[0].lex.orth == other.orth:
return 1.0
elif isinstance(other, (Span, Doc)):
if len(self) == len(other):
for i in range(self.length):
if self[i].orth != other[i].orth:
break
else:
return 1.0
if self.vector_norm == 0 or other.vector_norm == 0:
return 0.0
return numpy.dot(self.vector, other.vector) / (self.vector_norm * other.vector_norm)
@ -508,13 +519,18 @@ cdef class Doc:
yield from self.user_hooks['sents'](self)
return
if not self.is_parsed:
raise ValueError(
"Sentence boundary detection requires the dependency "
"parse, which requires a statistical model to be "
"installed and loaded. For more info, see the "
"documentation: \n%s\n" % about.__docs_models__)
cdef int i
if not self.is_parsed:
for i in range(1, self.length):
if self.c[i].sent_start != 0:
break
else:
raise ValueError(
"Sentence boundaries unset. You can add the 'sentencizer' "
"component to the pipeline with: "
"nlp.add_pipe(nlp.create_pipe('sentencizer')) "
"Alternatively, add the dependency parser, or set "
"sentence boundaries by setting doc[i].sent_start")
start = 0
for i in range(1, self.length):
if self.c[i].sent_start == 1:

View File

@ -64,6 +64,11 @@ cdef class Span:
self._vector_norm = vector_norm
def __richcmp__(self, Span other, int op):
if other is None:
if op == 0 or op == 1 or op == 2:
return False
else:
return True
# Eq
if op == 0:
return self.start_char < other.start_char
@ -179,6 +184,15 @@ cdef class Span:
"""
if 'similarity' in self.doc.user_span_hooks:
self.doc.user_span_hooks['similarity'](self, other)
if len(self) == 1 and hasattr(other, 'orth'):
if self[0].orth == other.orth:
return 1.0
elif hasattr(other, '__len__') and len(self) == len(other):
for i in range(len(self)):
if self[i].orth != getattr(other[i], 'orth', None):
break
else:
return 1.0
if self.vector_norm == 0.0 or other.vector_norm == 0.0:
return 0.0
return numpy.dot(self.vector, other.vector) / (self.vector_norm * other.vector_norm)
@ -261,6 +275,11 @@ cdef class Span:
self.start = start
self.end = end + 1
property vocab:
"""RETURNS (Vocab): The Span's Doc's vocab."""
def __get__(self):
return self.doc.vocab
property sent:
"""RETURNS (Span): The sentence span that the span is a part of."""
def __get__(self):

View File

@ -78,10 +78,15 @@ cdef class Token:
def __richcmp__(self, Token other, int op):
# http://cython.readthedocs.io/en/latest/src/userguide/special_methods.html
if other is None:
if op in (0, 1, 2):
return False
else:
return True
cdef Doc my_doc = self.doc
cdef Doc other_doc = other.doc
my = self.idx
their = other.idx if other is not None else None
their = other.idx
if op == 0:
return my < their
elif op == 2:
@ -144,6 +149,12 @@ cdef class Token:
"""
if 'similarity' in self.doc.user_token_hooks:
return self.doc.user_token_hooks['similarity'](self)
if hasattr(other, '__len__') and len(other) == 1:
if self.c.lex.orth == getattr(other[0], 'orth', None):
return 1.0
elif hasattr(other, 'orth'):
if self.c.lex.orth == other.orth:
return 1.0
if self.vector_norm == 0 or other.vector_norm == 0:
return 0.0
return (numpy.dot(self.vector, other.vector) /
@ -341,19 +352,20 @@ cdef class Token:
property sent_start:
def __get__(self):
util.deprecated(
"Token.sent_start is now deprecated. Use Token.is_sent_start "
"instead, which returns a boolean value or None if the answer "
"is unknown instead of a misleading 0 for False and 1 for "
"True. It also fixes a quirk in the old logic that would "
"always set the property to 0 for the first word of the "
"document.")
# Raising a deprecation warning causes errors for autocomplete
#util.deprecated(
# "Token.sent_start is now deprecated. Use Token.is_sent_start "
# "instead, which returns a boolean value or None if the answer "
# "is unknown instead of a misleading 0 for False and 1 for "
# "True. It also fixes a quirk in the old logic that would "
# "always set the property to 0 for the first word of the "
# "document.")
# Handle broken backwards compatibility case: doc[0].sent_start
# was False.
if self.i == 0:
return False
else:
return self.sent_start
return self.c.sent_start
def __set__(self, value):
self.is_sent_start = value

View File

@ -151,7 +151,7 @@ cdef class Vectors:
filled = {row for row in self.key2row.values()}
self._unset = {row for row in range(shape[0]) if row not in filled}
removed_items = []
for key, row in self.key2row.items():
for key, row in list(self.key2row.items()):
if row >= shape[0]:
self.key2row.pop(key)
removed_items.append((key, row))

View File

@ -335,6 +335,7 @@ cdef class Vocab:
else:
width = self.vectors.shape[1]
self.vectors.resize((new_rows, width))
lex = self[orth] # Adds worse to vocab
self.vectors.add(orth, vector=vector)
self.vectors.add(orth, vector=vector)

View File

@ -11,5 +11,6 @@ form.o-grid#mc-embedded-subscribe-form(action="//#{MAILCHIMP.user}.list-manage.c
input(type="text" name="b_#{MAILCHIMP.id}_#{MAILCHIMP.list}" tabindex="-1" value="")
.o-grid-col.o-grid.o-grid--nowrap.o-field.u-padding-small
input#mce-EMAIL.o-field__input.u-text(type="email" name="EMAIL" placeholder="Your email" aria-label="Your email")
div
input#mce-EMAIL.o-field__input.u-text(type="email" name="EMAIL" placeholder="Your email" aria-label="Your email")
button#mc-embedded-subscribe.o-field__button.u-text-label.u-color-theme.u-nowrap(type="submit" name="subscribe") Sign up

View File

@ -46,7 +46,7 @@ p
+table(["Tag", "POS", "Morphology", "Description"])
+pos-row("-LRB-", "PUNCT", "PunctType=brck PunctSide=ini", "left round bracket")
+pos-row("-PRB-", "PUNCT", "PunctType=brck PunctSide=fin", "right round bracket")
+pos-row("-RRB-", "PUNCT", "PunctType=brck PunctSide=fin", "right round bracket")
+pos-row(",", "PUNCT", "PunctType=comm", "punctuation mark, comma")
+pos-row(":", "PUNCT", "", "punctuation mark, colon or ellipsis")
+pos-row(".", "PUNCT", "PunctType=peri", "punctuation mark, sentence closer")
@ -86,7 +86,7 @@ p
+pos-row("RBR", "ADV", "Degree=comp", "adverb, comparative")
+pos-row("RBS", "ADV", "Degree=sup", "adverb, superlative")
+pos-row("RP", "PART", "", "adverb, particle")
+pos-row("SP", "SPACE", "", "space")
+pos-row("_SP", "SPACE", "", "space")
+pos-row("SYM", "SYM", "", "symbol")
+pos-row("TO", "PART", "PartType=inf VerbForm=inf", "infinitival to")
+pos-row("UH", "INTJ", "", "interjection")

View File

@ -17,6 +17,17 @@ p
| Direct downloads don't perform any compatibility checks and require the
| model name to be specified with its version (e.g., #[code en_core_web_sm-1.2.0]).
+aside("Downloading best practices")
| The #[code download] command is mostly intended as a convenient,
| interactive wrapper it performs compatibility checks and prints
| detailed messages in case things go wrong. It's #[strong not recommended]
| to use this command as part of an automated process. If you know which
| model your project needs, you should consider a
| #[+a("/usage/models#download-pip") direct download via pip], or
| uploading the model to a local PyPi installation and fetching it straight
| from there. This will also allow you to add it as a versioned package
| dependency to your project.
+code(false, "bash", "$").
python -m spacy download [model] [--direct]
@ -43,17 +54,6 @@ p
| The installed model package in your #[code site-packages]
| directory and a shortcut link as a symlink in #[code spacy/data].
+aside("Downloading best practices")
| The #[code download] command is mostly intended as a convenient,
| interactive wrapper it performs compatibility checks and prints
| detailed messages in case things go wrong. It's #[strong not recommended]
| to use this command as part of an automated process. If you know which
| model your project needs, you should consider a
| #[+a("/usage/models#download-pip") direct download via pip], or
| uploading the model to a local PyPi installation and fetching it straight
| from there. This will also allow you to add it as a versioned package
| dependency to your project.
+h(3, "link") Link
p
@ -144,8 +144,14 @@ p
| #[code pip install -U spacy] to ensure that all installed models are
| can be used with the new version. The command is also useful to detect
| out-of-sync model links resulting from links created in different virtual
| environments. Prints a list of models, the installed versions, the latest
| compatible version (if out of date) and the commands for updating.
| environments. It will a list of models, the installed versions, the
| latest compatible version (if out of date) and the commands for updating.
+aside("Automated validation")
| You can also use the #[code validate] command as part of your build
| process or test suite, to ensure all models are up to date before
| proceeding. If incompatible models or shortcut links are found, it will
| return #[code 1].
+code(false, "bash", "$").
python -m spacy validate
@ -335,8 +341,12 @@ p
| for your custom #[code train] command while still being able to easily
| tweak the hyperparameters. For example:
+code(false, "bash").
parser_hidden_depth=2 parser_maxout_pieces=1 train-parser
+code(false, "bash", "$").
parser_hidden_depth=2 parser_maxout_pieces=1 spacy train [...]
+code("Usage with alias", "bash", "$").
alias train-parser="spacy train en /output /data /train /dev -n 1000"
parser_maxout_pieces=1 train-parser
+table(["Name", "Description", "Default"])
+row

View File

@ -28,7 +28,7 @@ p Create the rule-based #[code PhraseMatcher].
+row
+cell #[code max_length]
+cell int
+cell Mamimum length of a phrase pattern to add.
+cell Maximum length of a phrase pattern to add.
+row("foot")
+cell returns

View File

@ -394,7 +394,7 @@ p
num, denom = text.split('/')
if num.isdigit() and denom.isdigit():
return True
if text in _num_words:
if text.lower() in _num_words:
return True
return False

View File

@ -148,7 +148,7 @@ p
+cell Negate the pattern, by requiring it to match exactly 0 times.
+row
+cell #[code *]
+cell #[code ?]
+cell Make the pattern optional, by allowing it to match 0 or 1 times.
+row
@ -156,8 +156,8 @@ p
+cell Require the pattern to match 1 or more times.
+row
+cell #[code ?]
+cell Allow the pattern to zero or more times.
+cell #[code *]
+cell Allow the pattern to match zero or more times.
p
| The #[code +] and #[code *] operators are usually interpretted
@ -305,6 +305,54 @@ p
| A list of #[code (match_id, start, end)] tuples, describing the
| matches. A match tuple describes a span #[code doc[start:end]].
+h(3, "regex") Using regular expressions
p
| In some cases, only matching tokens and token attributes isn't enough
| for example, you might want to match different spellings of a word,
| without having to add a new pattern for each spelling. A simple solution
| is to match a regular expression on the #[code Doc]'s #[code text] and
| use the #[+api("doc#char_span") #[code Doc.char_span]] method to
| create a #[code Span] from the character indices of the match:
+code.
import spacy
import re
nlp = spacy.load('en')
doc = nlp(u'The spelling is "definitely", not "definately" or "deffinitely".')
DEFINITELY_PATTERN = re.compile(r'deff?in[ia]tely')
for match in re.finditer(DEFINITELY_PATTERN, doc.text):
start, end = match.span() # get matched indices
span = doc.char_span(start, end) # create Span from indices
p
| You can also use the regular expression with spaCy's #[code Matcher] by
| converting it to a token flag. To ensure efficiency, the
| #[code Matcher] can only access the C-level data. This means that it can
| either use built-in token attributes or #[strong binary flags].
| #[+api("vocab#add_flag") #[code Vocab.add_flag]] returns a flag ID which
| you can use as a key of a token match pattern. Tokens that match the
| regular expression will return #[code True] for the #[code IS_DEFINITELY]
| flag.
+code.
IS_DEFINITELY = nlp.vocab.add_flag(re.compile(r'deff?in[ia]tely').match)
matcher = Matcher(nlp.vocab)
matcher.add('DEFINITELY', None, [{IS_DEFINITELY: True}])
p
| Providing the regular expressions as binary flags also lets you use them
| in combination with other token patterns for example, to match the
| word "definitely" in various spellings, followed by a case-insensitive
| "not" and and adjective:
+code.
[{IS_DEFINITELY: True}, {'LOWER': 'not'}, {'POS': 'ADJ'}]
+h(3, "example1") Example: Using linguistic annotations
p
@ -354,7 +402,7 @@ p
# append mock entity for match in displaCy style to matched_sents
# get the match span by ofsetting the start and end of the span with the
# start and end of the sentence in the doc
match_ents = [{'start': span.start_char - sent.start_char,
match_ents = [{'start': span.start_char - sent.start_char,
'end': span.end_char - sent.start_char,
'label': 'MATCH'}]
matched_sents.append({'text': sent.text, 'ents': match_ents })

View File

@ -48,9 +48,9 @@ p
| those IDs back to strings.
+code.
moby_dick = open('moby_dick.txt', 'r') # open a large document
doc = nlp(moby_dick) # process it
doc.to_disk('/moby_dick.bin') # save the processed Doc
text = open('customer_feedback_627.txt', 'r').read() # open a document
doc = nlp(text) # process it
doc.to_disk('/customer_feedback_627.bin') # save the processed Doc
p
| If you need it again later, you can load it back into an empty #[code Doc]
@ -61,4 +61,4 @@ p
from spacy.tokens import Doc # to create empty Doc
from spacy.vocab import Vocab # to create empty Vocab
doc = Doc(Vocab()).from_disk('/moby_dick.bin') # load processed Doc
doc = Doc(Vocab()).from_disk('/customer_feedback_627.bin') # load processed Doc

View File

@ -8,7 +8,7 @@ p
| Collecting training data may sound incredibly painful and it can be,
| if you're planning a large-scale annotation project. However, if your main
| goal is to update an existing model's predictions for example, spaCy's
| named entity recognition the hard is part usually not creating the
| named entity recognition the hard part is usually not creating the
| actual annotations. It's finding representative examples and
| #[strong extracting potential candidates]. The good news is, if you've
| been noticing bad performance on your data, you likely

View File

@ -106,6 +106,10 @@ p
| #[+api("tagger#from_disk") #[code Tagger.from_disk]]
| #[+api("tagger#from_bytes") #[code Tagger.from_bytes]]
+row
+cell #[code Tagger.tag_names]
+cell #[code Tagger.labels]
+row
+cell #[code DependencyParser.load]
+cell

View File

@ -37,6 +37,9 @@ include ../_includes/_mixins
+card("spacy-api-docker", "https://github.com/jgontrum/spacy-api-docker", "Johannes Gontrum", "github")
| spaCy accessed by a REST API, wrapped in a Docker container.
+card("languagecrunch", "https://github.com/artpar/languagecrunch", "Parth Mudgal", "github")
| NLP server for spaCy, WordNet and NeuralCoref as a Docker image.
+card("spacy-nlp-zeromq", "https://github.com/pasupulaphani/spacy-nlp-docker", "Phaninder Pasupula", "github")
| Docker image exposing spaCy with ZeroMQ bindings.
@ -69,6 +72,10 @@ include ../_includes/_mixins
| Add language detection to your spaCy pipeline using Compact
| Language Detector 2 via PYCLD2.
+card("spacy-lookup", "https://github.com/mpuig/spacy-lookup", "Marc Puig", "github")
| A powerful entity matcher for very large dictionaries, using the
| FlashText module.
.u-text-right
+button("https://github.com/topics/spacy-extension?o=desc&s=stars", false, "primary", "small") See more extensions on GitHub