mirror of
https://github.com/explosion/spaCy.git
synced 2024-12-26 09:56:28 +03:00
Merge branch 'master' of github.com:GregDubbin/spaCy
This commit is contained in:
commit
441f490c1c
106
.github/contributors/fucking-signup.md
vendored
Normal file
106
.github/contributors/fucking-signup.md
vendored
Normal file
|
@ -0,0 +1,106 @@
|
|||
# spaCy contributor agreement
|
||||
|
||||
This spaCy Contributor Agreement (**"SCA"**) is based on the
|
||||
[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
|
||||
The SCA applies to any contribution that you make to any product or project
|
||||
managed by us (the **"project"**), and sets out the intellectual property rights
|
||||
you grant to us in the contributed materials. The term **"us"** shall mean
|
||||
[ExplosionAI UG (haftungsbeschränkt)](https://explosion.ai/legal). The term
|
||||
**"you"** shall mean the person or entity identified below.
|
||||
|
||||
If you agree to be bound by these terms, fill in the information requested
|
||||
below and include the filled-in version with your first pull request, under the
|
||||
folder [`.github/contributors/`](/.github/contributors/). The name of the file
|
||||
should be your GitHub username, with the extension `.md`. For example, the user
|
||||
example_user would create the file `.github/contributors/example_user.md`.
|
||||
|
||||
Read this agreement carefully before signing. These terms and conditions
|
||||
constitute a binding legal agreement.
|
||||
|
||||
## Contributor Agreement
|
||||
|
||||
1. The term "contribution" or "contributed materials" means any source code,
|
||||
object code, patch, tool, sample, graphic, specification, manual,
|
||||
documentation, or any other material posted or submitted by you to the project.
|
||||
|
||||
2. With respect to any worldwide copyrights, or copyright applications and
|
||||
registrations, in your contribution:
|
||||
|
||||
* you hereby assign to us joint ownership, and to the extent that such
|
||||
assignment is or becomes invalid, ineffective or unenforceable, you hereby
|
||||
grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
|
||||
royalty-free, unrestricted license to exercise all rights under those
|
||||
copyrights. This includes, at our option, the right to sublicense these same
|
||||
rights to third parties through multiple levels of sublicensees or other
|
||||
licensing arrangements;
|
||||
|
||||
* you agree that each of us can do all things in relation to your
|
||||
contribution as if each of us were the sole owners, and if one of us makes
|
||||
a derivative work of your contribution, the one who makes the derivative
|
||||
work (or has it made will be the sole owner of that derivative work;
|
||||
|
||||
* you agree that you will not assert any moral rights in your contribution
|
||||
against us, our licensees or transferees;
|
||||
|
||||
* you agree that we may register a copyright in your contribution and
|
||||
exercise all ownership rights associated with it; and
|
||||
|
||||
* you agree that neither of us has any duty to consult with, obtain the
|
||||
consent of, pay or render an accounting to the other for any use or
|
||||
distribution of your contribution.
|
||||
|
||||
3. With respect to any patents you own, or that you can license without payment
|
||||
to any third party, you hereby grant to us a perpetual, irrevocable,
|
||||
non-exclusive, worldwide, no-charge, royalty-free license to:
|
||||
|
||||
* make, have made, use, sell, offer to sell, import, and otherwise transfer
|
||||
your contribution in whole or in part, alone or in combination with or
|
||||
included in any product, work or materials arising out of the project to
|
||||
which your contribution was submitted, and
|
||||
|
||||
* at our option, to sublicense these same rights to third parties through
|
||||
multiple levels of sublicensees or other licensing arrangements.
|
||||
|
||||
4. Except as set out above, you keep all right, title, and interest in your
|
||||
contribution. The rights that you grant to us under these terms are effective
|
||||
on the date you first submitted a contribution to us, even if your submission
|
||||
took place before the date you sign these terms.
|
||||
|
||||
5. You covenant, represent, warrant and agree that:
|
||||
|
||||
* Each contribution that you submit is and shall be an original work of
|
||||
authorship and you can legally grant the rights set out in this SCA;
|
||||
|
||||
* to the best of your knowledge, each contribution will not violate any
|
||||
third party's copyrights, trademarks, patents, or other intellectual
|
||||
property rights; and
|
||||
|
||||
* each contribution shall be in compliance with U.S. export control laws and
|
||||
other applicable export and import laws. You agree to notify us if you
|
||||
become aware of any circumstance which would make any of the foregoing
|
||||
representations inaccurate in any respect. We may publicly disclose your
|
||||
participation in the project, including the fact that you have signed the SCA.
|
||||
|
||||
6. This SCA is governed by the laws of the State of California and applicable
|
||||
U.S. Federal law. Any choice of law rules will not apply.
|
||||
|
||||
7. Please place an “x” on one of the applicable statement below. Please do NOT
|
||||
mark both statements:
|
||||
|
||||
* [x] I am signing on behalf of myself as an individual and no other person
|
||||
or entity, including my employer, has or will have rights with respect to my
|
||||
contributions.
|
||||
|
||||
* [ ] I am signing on behalf of my employer or a legal entity and I have the
|
||||
actual authority to contractually bind that entity.
|
||||
|
||||
## Contributor Details
|
||||
|
||||
| Field | Entry |
|
||||
|------------------------------- | -------------------- |
|
||||
| Name | Kit |
|
||||
| Company name (if applicable) | - |
|
||||
| Title or role (if applicable) | - |
|
||||
| Date | 2018/01/08 |
|
||||
| GitHub username | fucking-signup |
|
||||
| Website (optional) | - |
|
106
.github/contributors/pbnsilva.md
vendored
Normal file
106
.github/contributors/pbnsilva.md
vendored
Normal file
|
@ -0,0 +1,106 @@
|
|||
# spaCy contributor agreement
|
||||
|
||||
This spaCy Contributor Agreement (**"SCA"**) is based on the
|
||||
[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
|
||||
The SCA applies to any contribution that you make to any product or project
|
||||
managed by us (the **"project"**), and sets out the intellectual property rights
|
||||
you grant to us in the contributed materials. The term **"us"** shall mean
|
||||
[ExplosionAI UG (haftungsbeschränkt)](https://explosion.ai/legal). The term
|
||||
**"you"** shall mean the person or entity identified below.
|
||||
|
||||
If you agree to be bound by these terms, fill in the information requested
|
||||
below and include the filled-in version with your first pull request, under the
|
||||
folder [`.github/contributors/`](/.github/contributors/). The name of the file
|
||||
should be your GitHub username, with the extension `.md`. For example, the user
|
||||
example_user would create the file `.github/contributors/example_user.md`.
|
||||
|
||||
Read this agreement carefully before signing. These terms and conditions
|
||||
constitute a binding legal agreement.
|
||||
|
||||
## Contributor Agreement
|
||||
|
||||
1. The term "contribution" or "contributed materials" means any source code,
|
||||
object code, patch, tool, sample, graphic, specification, manual,
|
||||
documentation, or any other material posted or submitted by you to the project.
|
||||
|
||||
2. With respect to any worldwide copyrights, or copyright applications and
|
||||
registrations, in your contribution:
|
||||
|
||||
* you hereby assign to us joint ownership, and to the extent that such
|
||||
assignment is or becomes invalid, ineffective or unenforceable, you hereby
|
||||
grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
|
||||
royalty-free, unrestricted license to exercise all rights under those
|
||||
copyrights. This includes, at our option, the right to sublicense these same
|
||||
rights to third parties through multiple levels of sublicensees or other
|
||||
licensing arrangements;
|
||||
|
||||
* you agree that each of us can do all things in relation to your
|
||||
contribution as if each of us were the sole owners, and if one of us makes
|
||||
a derivative work of your contribution, the one who makes the derivative
|
||||
work (or has it made will be the sole owner of that derivative work;
|
||||
|
||||
* you agree that you will not assert any moral rights in your contribution
|
||||
against us, our licensees or transferees;
|
||||
|
||||
* you agree that we may register a copyright in your contribution and
|
||||
exercise all ownership rights associated with it; and
|
||||
|
||||
* you agree that neither of us has any duty to consult with, obtain the
|
||||
consent of, pay or render an accounting to the other for any use or
|
||||
distribution of your contribution.
|
||||
|
||||
3. With respect to any patents you own, or that you can license without payment
|
||||
to any third party, you hereby grant to us a perpetual, irrevocable,
|
||||
non-exclusive, worldwide, no-charge, royalty-free license to:
|
||||
|
||||
* make, have made, use, sell, offer to sell, import, and otherwise transfer
|
||||
your contribution in whole or in part, alone or in combination with or
|
||||
included in any product, work or materials arising out of the project to
|
||||
which your contribution was submitted, and
|
||||
|
||||
* at our option, to sublicense these same rights to third parties through
|
||||
multiple levels of sublicensees or other licensing arrangements.
|
||||
|
||||
4. Except as set out above, you keep all right, title, and interest in your
|
||||
contribution. The rights that you grant to us under these terms are effective
|
||||
on the date you first submitted a contribution to us, even if your submission
|
||||
took place before the date you sign these terms.
|
||||
|
||||
5. You covenant, represent, warrant and agree that:
|
||||
|
||||
* Each contribution that you submit is and shall be an original work of
|
||||
authorship and you can legally grant the rights set out in this SCA;
|
||||
|
||||
* to the best of your knowledge, each contribution will not violate any
|
||||
third party's copyrights, trademarks, patents, or other intellectual
|
||||
property rights; and
|
||||
|
||||
* each contribution shall be in compliance with U.S. export control laws and
|
||||
other applicable export and import laws. You agree to notify us if you
|
||||
become aware of any circumstance which would make any of the foregoing
|
||||
representations inaccurate in any respect. We may publicly disclose your
|
||||
participation in the project, including the fact that you have signed the SCA.
|
||||
|
||||
6. This SCA is governed by the laws of the State of California and applicable
|
||||
U.S. Federal law. Any choice of law rules will not apply.
|
||||
|
||||
7. Please place an “x” on one of the applicable statement below. Please do NOT
|
||||
mark both statements:
|
||||
|
||||
* [x] I am signing on behalf of myself as an individual and no other person
|
||||
or entity, including my employer, has or will have rights with respect to my
|
||||
contributions.
|
||||
|
||||
* [ ] I am signing on behalf of my employer or a legal entity and I have the
|
||||
actual authority to contractually bind that entity.
|
||||
|
||||
## Contributor Details
|
||||
|
||||
| Field | Entry |
|
||||
|------------------------------- | -------------------- |
|
||||
| Name | Pedro Silva |
|
||||
| Company name (if applicable) | |
|
||||
| Title or role (if applicable) | |
|
||||
| Date | 2018-01-11 |
|
||||
| GitHub username | pbnsilva |
|
||||
| Website (optional) | |
|
106
.github/contributors/savkov.md
vendored
Normal file
106
.github/contributors/savkov.md
vendored
Normal file
|
@ -0,0 +1,106 @@
|
|||
# spaCy contributor agreement
|
||||
|
||||
This spaCy Contributor Agreement (**"SCA"**) is based on the
|
||||
[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
|
||||
The SCA applies to any contribution that you make to any product or project
|
||||
managed by us (the **"project"**), and sets out the intellectual property rights
|
||||
you grant to us in the contributed materials. The term **"us"** shall mean
|
||||
[ExplosionAI UG (haftungsbeschränkt)](https://explosion.ai/legal). The term
|
||||
**"you"** shall mean the person or entity identified below.
|
||||
|
||||
If you agree to be bound by these terms, fill in the information requested
|
||||
below and include the filled-in version with your first pull request, under the
|
||||
folder [`.github/contributors/`](/.github/contributors/). The name of the file
|
||||
should be your GitHub username, with the extension `.md`. For example, the user
|
||||
example_user would create the file `.github/contributors/example_user.md`.
|
||||
|
||||
Read this agreement carefully before signing. These terms and conditions
|
||||
constitute a binding legal agreement.
|
||||
|
||||
## Contributor Agreement
|
||||
|
||||
1. The term "contribution" or "contributed materials" means any source code,
|
||||
object code, patch, tool, sample, graphic, specification, manual,
|
||||
documentation, or any other material posted or submitted by you to the project.
|
||||
|
||||
2. With respect to any worldwide copyrights, or copyright applications and
|
||||
registrations, in your contribution:
|
||||
|
||||
* you hereby assign to us joint ownership, and to the extent that such
|
||||
assignment is or becomes invalid, ineffective or unenforceable, you hereby
|
||||
grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
|
||||
royalty-free, unrestricted license to exercise all rights under those
|
||||
copyrights. This includes, at our option, the right to sublicense these same
|
||||
rights to third parties through multiple levels of sublicensees or other
|
||||
licensing arrangements;
|
||||
|
||||
* you agree that each of us can do all things in relation to your
|
||||
contribution as if each of us were the sole owners, and if one of us makes
|
||||
a derivative work of your contribution, the one who makes the derivative
|
||||
work (or has it made will be the sole owner of that derivative work;
|
||||
|
||||
* you agree that you will not assert any moral rights in your contribution
|
||||
against us, our licensees or transferees;
|
||||
|
||||
* you agree that we may register a copyright in your contribution and
|
||||
exercise all ownership rights associated with it; and
|
||||
|
||||
* you agree that neither of us has any duty to consult with, obtain the
|
||||
consent of, pay or render an accounting to the other for any use or
|
||||
distribution of your contribution.
|
||||
|
||||
3. With respect to any patents you own, or that you can license without payment
|
||||
to any third party, you hereby grant to us a perpetual, irrevocable,
|
||||
non-exclusive, worldwide, no-charge, royalty-free license to:
|
||||
|
||||
* make, have made, use, sell, offer to sell, import, and otherwise transfer
|
||||
your contribution in whole or in part, alone or in combination with or
|
||||
included in any product, work or materials arising out of the project to
|
||||
which your contribution was submitted, and
|
||||
|
||||
* at our option, to sublicense these same rights to third parties through
|
||||
multiple levels of sublicensees or other licensing arrangements.
|
||||
|
||||
4. Except as set out above, you keep all right, title, and interest in your
|
||||
contribution. The rights that you grant to us under these terms are effective
|
||||
on the date you first submitted a contribution to us, even if your submission
|
||||
took place before the date you sign these terms.
|
||||
|
||||
5. You covenant, represent, warrant and agree that:
|
||||
|
||||
* Each contribution that you submit is and shall be an original work of
|
||||
authorship and you can legally grant the rights set out in this SCA;
|
||||
|
||||
* to the best of your knowledge, each contribution will not violate any
|
||||
third party's copyrights, trademarks, patents, or other intellectual
|
||||
property rights; and
|
||||
|
||||
* each contribution shall be in compliance with U.S. export control laws and
|
||||
other applicable export and import laws. You agree to notify us if you
|
||||
become aware of any circumstance which would make any of the foregoing
|
||||
representations inaccurate in any respect. We may publicly disclose your
|
||||
participation in the project, including the fact that you have signed the SCA.
|
||||
|
||||
6. This SCA is governed by the laws of the State of California and applicable
|
||||
U.S. Federal law. Any choice of law rules will not apply.
|
||||
|
||||
7. Please place an “x” on one of the applicable statement below. Please do NOT
|
||||
mark both statements:
|
||||
|
||||
* [x] I am signing on behalf of myself as an individual and no other person
|
||||
or entity, including my employer, has or will have rights with respect to my
|
||||
contributions.
|
||||
|
||||
* [ ] I am signing on behalf of my employer or a legal entity and I have the
|
||||
actual authority to contractually bind that entity.
|
||||
|
||||
## Contributor Details
|
||||
|
||||
| Field | Entry |
|
||||
|------------------------------- | -------------------- |
|
||||
| Name | Aleksandar Savkov |
|
||||
| Company name (if applicable) | |
|
||||
| Title or role (if applicable) | |
|
||||
| Date | 11.01.2018 |
|
||||
| GitHub username | savkov |
|
||||
| Website (optional) | sasho.io |
|
5
setup.py
5
setup.py
|
@ -46,9 +46,8 @@ MOD_NAMES = [
|
|||
|
||||
COMPILE_OPTIONS = {
|
||||
'msvc': ['/Ox', '/EHsc'],
|
||||
'mingw32' : ['-O3', '-Wno-strict-prototypes', '-Wno-unused-function'],
|
||||
'other' : ['-O3', '-Wno-strict-prototypes', '-Wno-unused-function',
|
||||
'-march=native']
|
||||
'mingw32' : ['-O2', '-Wno-strict-prototypes', '-Wno-unused-function'],
|
||||
'other' : ['-O2', '-Wno-strict-prototypes', '-Wno-unused-function']
|
||||
}
|
||||
|
||||
|
||||
|
|
|
@ -31,14 +31,18 @@ def download(model, direct=False):
|
|||
version = get_version(model_name, compatibility)
|
||||
dl = download_model('{m}-{v}/{m}-{v}.tar.gz'.format(m=model_name,
|
||||
v=version))
|
||||
if dl == 0:
|
||||
if dl != 0:
|
||||
# if download subprocess doesn't return 0, exit with the respective
|
||||
# exit code before doing anything else
|
||||
sys.exit(dl)
|
||||
try:
|
||||
# Get package path here because link uses
|
||||
# pip.get_installed_distributions() to check if model is a
|
||||
# package, which fails if model was just installed via
|
||||
# subprocess
|
||||
package_path = get_package_path(model_name)
|
||||
link(model_name, model, force=True, model_path=package_path)
|
||||
link(None, model_name, model, force=True,
|
||||
model_path=package_path)
|
||||
except:
|
||||
# Dirty, but since spacy.download and the auto-linking is
|
||||
# mostly a convenience wrapper, it's best to show a success
|
||||
|
@ -48,7 +52,7 @@ def download(model, direct=False):
|
|||
"you don't have admin permissions?), but you can still "
|
||||
"load the model via its full package name:",
|
||||
"nlp = spacy.load('%s')" % model_name,
|
||||
title="Download successful")
|
||||
title="Download successful but linking failed")
|
||||
|
||||
|
||||
def get_json(url, desc):
|
||||
|
@ -84,5 +88,5 @@ def get_version(model, comp):
|
|||
def download_model(filename):
|
||||
download_url = about.__download_url__ + '/' + filename
|
||||
return subprocess.call(
|
||||
[sys.executable, '-m', 'pip', 'install', '--no-cache-dir',
|
||||
[sys.executable, '-m', 'pip', 'install', '--no-cache-dir', '--no-deps',
|
||||
download_url], env=os.environ.copy())
|
||||
|
|
|
@ -34,11 +34,18 @@ def link(origin, link_name, force=False, model_path=None):
|
|||
"located here:", path2str(spacy_loc), exits=1,
|
||||
title="Can't find the spaCy data path to create model symlink")
|
||||
link_path = util.get_data_path() / link_name
|
||||
if link_path.exists() and not force:
|
||||
if link_path.is_symlink() and not force:
|
||||
prints("To overwrite an existing link, use the --force flag.",
|
||||
title="Link %s already exists" % link_name, exits=1)
|
||||
elif link_path.exists():
|
||||
elif link_path.is_symlink(): # does a symlink exist?
|
||||
# NB: It's important to check for is_symlink here and not for exists,
|
||||
# because invalid/outdated symlinks would return False otherwise.
|
||||
link_path.unlink()
|
||||
elif link_path.exists(): # does it exist otherwise?
|
||||
# NB: Check this last because valid symlinks also "exist".
|
||||
prints("This can happen if your data directory contains a directory "
|
||||
"or file of the same name.", link_path,
|
||||
title="Can't overwrite symlink %s" % link_name, exits=1)
|
||||
try:
|
||||
symlink_to(link_path, model_path)
|
||||
except:
|
||||
|
|
|
@ -4,6 +4,7 @@ from __future__ import unicode_literals, print_function
|
|||
import requests
|
||||
import pkg_resources
|
||||
from pathlib import Path
|
||||
import sys
|
||||
|
||||
from ..compat import path2str, locale_escape
|
||||
from ..util import prints, get_data_path, read_json
|
||||
|
@ -62,6 +63,9 @@ def validate():
|
|||
"them from the data directory. Data path: {}"
|
||||
.format(path2str(get_data_path())))
|
||||
|
||||
if incompat_models or incompat_links:
|
||||
sys.exit(1)
|
||||
|
||||
|
||||
def get_model_links(compat):
|
||||
links = {}
|
||||
|
|
|
@ -41,9 +41,9 @@ def like_num(text):
|
|||
num, denom = text.split('/')
|
||||
if num.isdigit() and denom.isdigit():
|
||||
return True
|
||||
if text in _num_words:
|
||||
if text.lower() in _num_words:
|
||||
return True
|
||||
if text in _ordinal_words:
|
||||
if text.lower() in _ordinal_words:
|
||||
return True
|
||||
return False
|
||||
|
||||
|
|
|
@ -20,7 +20,7 @@ def like_num(text):
|
|||
num, denom = text.split('/')
|
||||
if num.isdigit() and denom.isdigit():
|
||||
return True
|
||||
if text in _num_words:
|
||||
if text.lower() in _num_words:
|
||||
return True
|
||||
return False
|
||||
|
||||
|
|
|
@ -31,7 +31,9 @@ def like_num(text):
|
|||
num, denom = text.split('/')
|
||||
if num.isdigit() and denom.isdigit():
|
||||
return True
|
||||
if text in _num_words:
|
||||
if text.lower() in _num_words:
|
||||
return True
|
||||
if text.lower() in _ordinal_words:
|
||||
return True
|
||||
return False
|
||||
|
||||
|
|
|
@ -27,7 +27,7 @@ def like_num(text):
|
|||
num, denom = text.split('/')
|
||||
if num.isdigit() and denom.isdigit():
|
||||
return True
|
||||
if text in _num_words:
|
||||
if text.lower() in _num_words:
|
||||
return True
|
||||
if text.count('-') == 1:
|
||||
_, num = text.split('-')
|
||||
|
|
|
@ -30,7 +30,9 @@ def like_num(text):
|
|||
num, denom = text.split('/')
|
||||
if num.isdigit() and denom.isdigit():
|
||||
return True
|
||||
if text in _num_words:
|
||||
if text.lower() in _num_words:
|
||||
return True
|
||||
if text.lower() in _ordinal_words:
|
||||
return True
|
||||
return False
|
||||
|
||||
|
|
|
@ -11,7 +11,7 @@ _num_words = ['zero', 'um', 'dois', 'três', 'quatro', 'cinco', 'seis', 'sete',
|
|||
'oitenta', 'noventa', 'cem', 'mil', 'milhão', 'bilião', 'trilião',
|
||||
'quadrilião']
|
||||
|
||||
_ord_words = ['primeiro', 'segundo', 'terceiro', 'quarto', 'quinto', 'sexto',
|
||||
_ordinal_words = ['primeiro', 'segundo', 'terceiro', 'quarto', 'quinto', 'sexto',
|
||||
'sétimo', 'oitavo', 'nono', 'décimo', 'vigésimo', 'trigésimo',
|
||||
'quadragésimo', 'quinquagésimo', 'sexagésimo', 'septuagésimo',
|
||||
'octogésimo', 'nonagésimo', 'centésimo', 'ducentésimo',
|
||||
|
@ -28,7 +28,9 @@ def like_num(text):
|
|||
num, denom = text.split('/')
|
||||
if num.isdigit() and denom.isdigit():
|
||||
return True
|
||||
if text in _num_words:
|
||||
if text.lower() in _num_words:
|
||||
return True
|
||||
if text.lower() in _ordinal_words:
|
||||
return True
|
||||
return False
|
||||
|
||||
|
|
|
@ -25,7 +25,7 @@ def like_num(text):
|
|||
num, denom = text.split('/')
|
||||
if num.isdigit() and denom.isdigit():
|
||||
return True
|
||||
if text in _num_words:
|
||||
if text.lower() in _num_words:
|
||||
return True
|
||||
return False
|
||||
|
||||
|
|
|
@ -40,6 +40,11 @@ cdef class Lexeme:
|
|||
assert self.c.orth == orth
|
||||
|
||||
def __richcmp__(self, other, int op):
|
||||
if other is None:
|
||||
if op == 0 or op == 1 or op == 2:
|
||||
return False
|
||||
else:
|
||||
return True
|
||||
if isinstance(other, Lexeme):
|
||||
a = self.orth
|
||||
b = other.orth
|
||||
|
@ -107,6 +112,14 @@ cdef class Lexeme:
|
|||
`Span`, `Token` and `Lexeme` objects.
|
||||
RETURNS (float): A scalar similarity score. Higher is more similar.
|
||||
"""
|
||||
# Return 1.0 similarity for matches
|
||||
if hasattr(other, 'orth'):
|
||||
if self.c.orth == other.orth:
|
||||
return 1.0
|
||||
elif hasattr(other, '__len__') and len(other) == 1 \
|
||||
and hasattr(other[0], 'orth'):
|
||||
if self.c.orth == other[0].orth:
|
||||
return 1.0
|
||||
if self.vector_norm == 0 or other.vector_norm == 0:
|
||||
return 0.0
|
||||
return (numpy.dot(self.vector, other.vector) /
|
||||
|
|
|
@ -217,6 +217,16 @@ def test_doc_api_has_vector():
|
|||
doc = Doc(vocab, words=['kitten'])
|
||||
assert doc.has_vector
|
||||
|
||||
|
||||
def test_doc_api_similarity_match():
|
||||
doc = Doc(Vocab(), words=['a'])
|
||||
assert doc.similarity(doc[0]) == 1.0
|
||||
assert doc.similarity(doc.vocab['a']) == 1.0
|
||||
doc2 = Doc(doc.vocab, words=['a', 'b', 'c'])
|
||||
assert doc.similarity(doc2[:1]) == 1.0
|
||||
assert doc.similarity(doc2) == 0.0
|
||||
|
||||
|
||||
def test_lowest_common_ancestor(en_tokenizer):
|
||||
tokens = en_tokenizer('the lazy dog slept')
|
||||
doc = get_doc(tokens.vocab, [t.text for t in tokens], heads=[2, 1, 1, 0])
|
||||
|
@ -225,6 +235,7 @@ def test_lowest_common_ancestor(en_tokenizer):
|
|||
assert(lca[0, 1] == 2)
|
||||
assert(lca[1, 2] == 2)
|
||||
|
||||
|
||||
def test_parse_tree(en_tokenizer):
|
||||
"""Tests doc.print_tree() method."""
|
||||
text = 'I like New York in Autumn.'
|
||||
|
|
|
@ -3,6 +3,8 @@ from __future__ import unicode_literals
|
|||
|
||||
from ..util import get_doc
|
||||
from ...attrs import ORTH, LENGTH
|
||||
from ...tokens import Doc
|
||||
from ...vocab import Vocab
|
||||
|
||||
import pytest
|
||||
|
||||
|
@ -66,6 +68,15 @@ def test_spans_lca_matrix(en_tokenizer):
|
|||
assert(lca[1, 1] == 1)
|
||||
|
||||
|
||||
def test_span_similarity_match():
|
||||
doc = Doc(Vocab(), words=['a', 'b', 'a', 'b'])
|
||||
span1 = doc[:2]
|
||||
span2 = doc[2:]
|
||||
assert span1.similarity(span2) == 1.0
|
||||
assert span1.similarity(doc) == 0.0
|
||||
assert span1[:1].similarity(doc.vocab['a']) == 1.0
|
||||
|
||||
|
||||
def test_spans_default_sentiment(en_tokenizer):
|
||||
"""Test span.sentiment property's default averaging behaviour"""
|
||||
text = "good stuff bad stuff"
|
||||
|
|
|
@ -160,8 +160,5 @@ def test_is_sent_start(en_tokenizer):
|
|||
assert doc[5].is_sent_start is None
|
||||
doc[5].is_sent_start = True
|
||||
assert doc[5].is_sent_start is True
|
||||
# Backwards compatibility
|
||||
with pytest.warns(DeprecationWarning):
|
||||
assert doc[0].sent_start is False
|
||||
doc.is_parsed = True
|
||||
assert len(list(doc.sents)) == 2
|
||||
|
|
31
spacy/tests/regression/test_issue1537.py
Normal file
31
spacy/tests/regression/test_issue1537.py
Normal file
|
@ -0,0 +1,31 @@
|
|||
'''Test Span.as_doc() doesn't segfault'''
|
||||
from __future__ import unicode_literals
|
||||
from ...tokens import Doc
|
||||
from ...vocab import Vocab
|
||||
from ... import load as load_spacy
|
||||
|
||||
|
||||
def test_issue1537():
|
||||
string = 'The sky is blue . The man is pink . The dog is purple .'
|
||||
doc = Doc(Vocab(), words=string.split())
|
||||
doc[0].sent_start = True
|
||||
for word in doc[1:]:
|
||||
if word.nbor(-1).text == '.':
|
||||
word.sent_start = True
|
||||
else:
|
||||
word.sent_start = False
|
||||
|
||||
sents = list(doc.sents)
|
||||
sent0 = sents[0].as_doc()
|
||||
sent1 = sents[1].as_doc()
|
||||
assert isinstance(sent0, Doc)
|
||||
assert isinstance(sent1, Doc)
|
||||
|
||||
|
||||
# Currently segfaulting, due to l_edge and r_edge misalignment
|
||||
#def test_issue1537_model():
|
||||
# nlp = load_spacy('en')
|
||||
# doc = nlp(u'The sky is blue. The man is pink. The dog is purple.')
|
||||
# sents = [s.as_doc() for s in doc.sents]
|
||||
# print(list(sents[0].noun_chunks))
|
||||
# print(list(sents[1].noun_chunks))
|
10
spacy/tests/regression/test_issue1539.py
Normal file
10
spacy/tests/regression/test_issue1539.py
Normal file
|
@ -0,0 +1,10 @@
|
|||
'''Ensure vectors.resize() doesn't try to modify dictionary during iteration.'''
|
||||
from __future__ import unicode_literals
|
||||
|
||||
from ...vectors import Vectors
|
||||
|
||||
|
||||
def test_issue1539():
|
||||
v = Vectors(shape=(10, 10), keys=[5,3,98,100])
|
||||
v.resize((100,100))
|
||||
|
18
spacy/tests/regression/test_issue1757.py
Normal file
18
spacy/tests/regression/test_issue1757.py
Normal file
|
@ -0,0 +1,18 @@
|
|||
'''Test comparison against None doesn't cause segfault'''
|
||||
from __future__ import unicode_literals
|
||||
|
||||
from ...tokens import Doc
|
||||
from ...vocab import Vocab
|
||||
|
||||
def test_issue1757():
|
||||
doc = Doc(Vocab(), words=['a', 'b', 'c'])
|
||||
assert not doc[0] < None
|
||||
assert not doc[0] == None
|
||||
assert doc[0] >= None
|
||||
span = doc[:2]
|
||||
assert not span < None
|
||||
assert not span == None
|
||||
assert span >= None
|
||||
lex = doc.vocab['a']
|
||||
assert not lex == None
|
||||
assert not lex < None
|
61
spacy/tests/regression/test_issue1769.py
Normal file
61
spacy/tests/regression/test_issue1769.py
Normal file
|
@ -0,0 +1,61 @@
|
|||
# coding: utf-8
|
||||
from __future__ import unicode_literals
|
||||
from ...util import get_lang_class
|
||||
from ...attrs import LIKE_NUM
|
||||
|
||||
import pytest
|
||||
|
||||
|
||||
@pytest.mark.parametrize('word', ['eleven'])
|
||||
def test_en_lex_attrs(word):
|
||||
lang = get_lang_class('en')
|
||||
like_num = lang.Defaults.lex_attr_getters[LIKE_NUM]
|
||||
assert like_num(word) == like_num(word.upper())
|
||||
|
||||
|
||||
@pytest.mark.slow
|
||||
@pytest.mark.parametrize('word', ['elleve', 'første'])
|
||||
def test_da_lex_attrs(word):
|
||||
lang = get_lang_class('da')
|
||||
like_num = lang.Defaults.lex_attr_getters[LIKE_NUM]
|
||||
assert like_num(word) == like_num(word.upper())
|
||||
|
||||
|
||||
@pytest.mark.slow
|
||||
@pytest.mark.parametrize('word', ['onze', 'onzième'])
|
||||
def test_fr_lex_attrs(word):
|
||||
lang = get_lang_class('fr')
|
||||
like_num = lang.Defaults.lex_attr_getters[LIKE_NUM]
|
||||
assert like_num(word) == like_num(word.upper())
|
||||
|
||||
|
||||
@pytest.mark.slow
|
||||
@pytest.mark.parametrize('word', ['sebelas'])
|
||||
def test_id_lex_attrs(word):
|
||||
lang = get_lang_class('id')
|
||||
like_num = lang.Defaults.lex_attr_getters[LIKE_NUM]
|
||||
assert like_num(word) == like_num(word.upper())
|
||||
|
||||
|
||||
@pytest.mark.slow
|
||||
@pytest.mark.parametrize('word', ['elf', 'elfde'])
|
||||
def test_nl_lex_attrs(word):
|
||||
lang = get_lang_class('nl')
|
||||
like_num = lang.Defaults.lex_attr_getters[LIKE_NUM]
|
||||
assert like_num(word) == like_num(word.upper())
|
||||
|
||||
|
||||
@pytest.mark.slow
|
||||
@pytest.mark.parametrize('word', ['onze', 'quadragésimo'])
|
||||
def test_pt_lex_attrs(word):
|
||||
lang = get_lang_class('pt')
|
||||
like_num = lang.Defaults.lex_attr_getters[LIKE_NUM]
|
||||
assert like_num(word) == like_num(word.upper())
|
||||
|
||||
|
||||
@pytest.mark.slow
|
||||
@pytest.mark.parametrize('word', ['одиннадцать'])
|
||||
def test_ru_lex_attrs(word):
|
||||
lang = get_lang_class('ru')
|
||||
like_num = lang.Defaults.lex_attr_getters[LIKE_NUM]
|
||||
assert like_num(word) == like_num(word.upper())
|
14
spacy/tests/regression/test_issue1807.py
Normal file
14
spacy/tests/regression/test_issue1807.py
Normal file
|
@ -0,0 +1,14 @@
|
|||
'''Test vocab.set_vector also adds the word to the vocab.'''
|
||||
from __future__ import unicode_literals
|
||||
from ...vocab import Vocab
|
||||
|
||||
import numpy
|
||||
|
||||
|
||||
def test_issue1807():
|
||||
vocab = Vocab()
|
||||
arr = numpy.ones((50,), dtype='f')
|
||||
assert 'hello' not in vocab
|
||||
vocab.set_vector('hello', arr)
|
||||
assert 'hello' in vocab
|
||||
|
|
@ -295,6 +295,17 @@ cdef class Doc:
|
|||
"""
|
||||
if 'similarity' in self.user_hooks:
|
||||
return self.user_hooks['similarity'](self, other)
|
||||
if isinstance(other, (Lexeme, Token)) and self.length == 1:
|
||||
if self.c[0].lex.orth == other.orth:
|
||||
return 1.0
|
||||
elif isinstance(other, (Span, Doc)):
|
||||
if len(self) == len(other):
|
||||
for i in range(self.length):
|
||||
if self[i].orth != other[i].orth:
|
||||
break
|
||||
else:
|
||||
return 1.0
|
||||
|
||||
if self.vector_norm == 0 or other.vector_norm == 0:
|
||||
return 0.0
|
||||
return numpy.dot(self.vector, other.vector) / (self.vector_norm * other.vector_norm)
|
||||
|
@ -508,13 +519,18 @@ cdef class Doc:
|
|||
yield from self.user_hooks['sents'](self)
|
||||
return
|
||||
|
||||
if not self.is_parsed:
|
||||
raise ValueError(
|
||||
"Sentence boundary detection requires the dependency "
|
||||
"parse, which requires a statistical model to be "
|
||||
"installed and loaded. For more info, see the "
|
||||
"documentation: \n%s\n" % about.__docs_models__)
|
||||
cdef int i
|
||||
if not self.is_parsed:
|
||||
for i in range(1, self.length):
|
||||
if self.c[i].sent_start != 0:
|
||||
break
|
||||
else:
|
||||
raise ValueError(
|
||||
"Sentence boundaries unset. You can add the 'sentencizer' "
|
||||
"component to the pipeline with: "
|
||||
"nlp.add_pipe(nlp.create_pipe('sentencizer')) "
|
||||
"Alternatively, add the dependency parser, or set "
|
||||
"sentence boundaries by setting doc[i].sent_start")
|
||||
start = 0
|
||||
for i in range(1, self.length):
|
||||
if self.c[i].sent_start == 1:
|
||||
|
|
|
@ -64,6 +64,11 @@ cdef class Span:
|
|||
self._vector_norm = vector_norm
|
||||
|
||||
def __richcmp__(self, Span other, int op):
|
||||
if other is None:
|
||||
if op == 0 or op == 1 or op == 2:
|
||||
return False
|
||||
else:
|
||||
return True
|
||||
# Eq
|
||||
if op == 0:
|
||||
return self.start_char < other.start_char
|
||||
|
@ -179,6 +184,15 @@ cdef class Span:
|
|||
"""
|
||||
if 'similarity' in self.doc.user_span_hooks:
|
||||
self.doc.user_span_hooks['similarity'](self, other)
|
||||
if len(self) == 1 and hasattr(other, 'orth'):
|
||||
if self[0].orth == other.orth:
|
||||
return 1.0
|
||||
elif hasattr(other, '__len__') and len(self) == len(other):
|
||||
for i in range(len(self)):
|
||||
if self[i].orth != getattr(other[i], 'orth', None):
|
||||
break
|
||||
else:
|
||||
return 1.0
|
||||
if self.vector_norm == 0.0 or other.vector_norm == 0.0:
|
||||
return 0.0
|
||||
return numpy.dot(self.vector, other.vector) / (self.vector_norm * other.vector_norm)
|
||||
|
@ -261,6 +275,11 @@ cdef class Span:
|
|||
self.start = start
|
||||
self.end = end + 1
|
||||
|
||||
property vocab:
|
||||
"""RETURNS (Vocab): The Span's Doc's vocab."""
|
||||
def __get__(self):
|
||||
return self.doc.vocab
|
||||
|
||||
property sent:
|
||||
"""RETURNS (Span): The sentence span that the span is a part of."""
|
||||
def __get__(self):
|
||||
|
|
|
@ -78,10 +78,15 @@ cdef class Token:
|
|||
|
||||
def __richcmp__(self, Token other, int op):
|
||||
# http://cython.readthedocs.io/en/latest/src/userguide/special_methods.html
|
||||
if other is None:
|
||||
if op in (0, 1, 2):
|
||||
return False
|
||||
else:
|
||||
return True
|
||||
cdef Doc my_doc = self.doc
|
||||
cdef Doc other_doc = other.doc
|
||||
my = self.idx
|
||||
their = other.idx if other is not None else None
|
||||
their = other.idx
|
||||
if op == 0:
|
||||
return my < their
|
||||
elif op == 2:
|
||||
|
@ -144,6 +149,12 @@ cdef class Token:
|
|||
"""
|
||||
if 'similarity' in self.doc.user_token_hooks:
|
||||
return self.doc.user_token_hooks['similarity'](self)
|
||||
if hasattr(other, '__len__') and len(other) == 1:
|
||||
if self.c.lex.orth == getattr(other[0], 'orth', None):
|
||||
return 1.0
|
||||
elif hasattr(other, 'orth'):
|
||||
if self.c.lex.orth == other.orth:
|
||||
return 1.0
|
||||
if self.vector_norm == 0 or other.vector_norm == 0:
|
||||
return 0.0
|
||||
return (numpy.dot(self.vector, other.vector) /
|
||||
|
@ -341,19 +352,20 @@ cdef class Token:
|
|||
|
||||
property sent_start:
|
||||
def __get__(self):
|
||||
util.deprecated(
|
||||
"Token.sent_start is now deprecated. Use Token.is_sent_start "
|
||||
"instead, which returns a boolean value or None if the answer "
|
||||
"is unknown – instead of a misleading 0 for False and 1 for "
|
||||
"True. It also fixes a quirk in the old logic that would "
|
||||
"always set the property to 0 for the first word of the "
|
||||
"document.")
|
||||
# Raising a deprecation warning causes errors for autocomplete
|
||||
#util.deprecated(
|
||||
# "Token.sent_start is now deprecated. Use Token.is_sent_start "
|
||||
# "instead, which returns a boolean value or None if the answer "
|
||||
# "is unknown – instead of a misleading 0 for False and 1 for "
|
||||
# "True. It also fixes a quirk in the old logic that would "
|
||||
# "always set the property to 0 for the first word of the "
|
||||
# "document.")
|
||||
# Handle broken backwards compatibility case: doc[0].sent_start
|
||||
# was False.
|
||||
if self.i == 0:
|
||||
return False
|
||||
else:
|
||||
return self.sent_start
|
||||
return self.c.sent_start
|
||||
|
||||
def __set__(self, value):
|
||||
self.is_sent_start = value
|
||||
|
|
|
@ -151,7 +151,7 @@ cdef class Vectors:
|
|||
filled = {row for row in self.key2row.values()}
|
||||
self._unset = {row for row in range(shape[0]) if row not in filled}
|
||||
removed_items = []
|
||||
for key, row in self.key2row.items():
|
||||
for key, row in list(self.key2row.items()):
|
||||
if row >= shape[0]:
|
||||
self.key2row.pop(key)
|
||||
removed_items.append((key, row))
|
||||
|
|
|
@ -335,6 +335,7 @@ cdef class Vocab:
|
|||
else:
|
||||
width = self.vectors.shape[1]
|
||||
self.vectors.resize((new_rows, width))
|
||||
lex = self[orth] # Adds worse to vocab
|
||||
self.vectors.add(orth, vector=vector)
|
||||
self.vectors.add(orth, vector=vector)
|
||||
|
||||
|
|
|
@ -11,5 +11,6 @@ form.o-grid#mc-embedded-subscribe-form(action="//#{MAILCHIMP.user}.list-manage.c
|
|||
input(type="text" name="b_#{MAILCHIMP.id}_#{MAILCHIMP.list}" tabindex="-1" value="")
|
||||
|
||||
.o-grid-col.o-grid.o-grid--nowrap.o-field.u-padding-small
|
||||
div
|
||||
input#mce-EMAIL.o-field__input.u-text(type="email" name="EMAIL" placeholder="Your email" aria-label="Your email")
|
||||
button#mc-embedded-subscribe.o-field__button.u-text-label.u-color-theme.u-nowrap(type="submit" name="subscribe") Sign up
|
||||
|
|
|
@ -46,7 +46,7 @@ p
|
|||
|
||||
+table(["Tag", "POS", "Morphology", "Description"])
|
||||
+pos-row("-LRB-", "PUNCT", "PunctType=brck PunctSide=ini", "left round bracket")
|
||||
+pos-row("-PRB-", "PUNCT", "PunctType=brck PunctSide=fin", "right round bracket")
|
||||
+pos-row("-RRB-", "PUNCT", "PunctType=brck PunctSide=fin", "right round bracket")
|
||||
+pos-row(",", "PUNCT", "PunctType=comm", "punctuation mark, comma")
|
||||
+pos-row(":", "PUNCT", "", "punctuation mark, colon or ellipsis")
|
||||
+pos-row(".", "PUNCT", "PunctType=peri", "punctuation mark, sentence closer")
|
||||
|
@ -86,7 +86,7 @@ p
|
|||
+pos-row("RBR", "ADV", "Degree=comp", "adverb, comparative")
|
||||
+pos-row("RBS", "ADV", "Degree=sup", "adverb, superlative")
|
||||
+pos-row("RP", "PART", "", "adverb, particle")
|
||||
+pos-row("SP", "SPACE", "", "space")
|
||||
+pos-row("_SP", "SPACE", "", "space")
|
||||
+pos-row("SYM", "SYM", "", "symbol")
|
||||
+pos-row("TO", "PART", "PartType=inf VerbForm=inf", "infinitival to")
|
||||
+pos-row("UH", "INTJ", "", "interjection")
|
||||
|
|
|
@ -17,6 +17,17 @@ p
|
|||
| Direct downloads don't perform any compatibility checks and require the
|
||||
| model name to be specified with its version (e.g., #[code en_core_web_sm-1.2.0]).
|
||||
|
||||
+aside("Downloading best practices")
|
||||
| The #[code download] command is mostly intended as a convenient,
|
||||
| interactive wrapper – it performs compatibility checks and prints
|
||||
| detailed messages in case things go wrong. It's #[strong not recommended]
|
||||
| to use this command as part of an automated process. If you know which
|
||||
| model your project needs, you should consider a
|
||||
| #[+a("/usage/models#download-pip") direct download via pip], or
|
||||
| uploading the model to a local PyPi installation and fetching it straight
|
||||
| from there. This will also allow you to add it as a versioned package
|
||||
| dependency to your project.
|
||||
|
||||
+code(false, "bash", "$").
|
||||
python -m spacy download [model] [--direct]
|
||||
|
||||
|
@ -43,17 +54,6 @@ p
|
|||
| The installed model package in your #[code site-packages]
|
||||
| directory and a shortcut link as a symlink in #[code spacy/data].
|
||||
|
||||
+aside("Downloading best practices")
|
||||
| The #[code download] command is mostly intended as a convenient,
|
||||
| interactive wrapper – it performs compatibility checks and prints
|
||||
| detailed messages in case things go wrong. It's #[strong not recommended]
|
||||
| to use this command as part of an automated process. If you know which
|
||||
| model your project needs, you should consider a
|
||||
| #[+a("/usage/models#download-pip") direct download via pip], or
|
||||
| uploading the model to a local PyPi installation and fetching it straight
|
||||
| from there. This will also allow you to add it as a versioned package
|
||||
| dependency to your project.
|
||||
|
||||
+h(3, "link") Link
|
||||
|
||||
p
|
||||
|
@ -144,8 +144,14 @@ p
|
|||
| #[code pip install -U spacy] to ensure that all installed models are
|
||||
| can be used with the new version. The command is also useful to detect
|
||||
| out-of-sync model links resulting from links created in different virtual
|
||||
| environments. Prints a list of models, the installed versions, the latest
|
||||
| compatible version (if out of date) and the commands for updating.
|
||||
| environments. It will a list of models, the installed versions, the
|
||||
| latest compatible version (if out of date) and the commands for updating.
|
||||
|
||||
+aside("Automated validation")
|
||||
| You can also use the #[code validate] command as part of your build
|
||||
| process or test suite, to ensure all models are up to date before
|
||||
| proceeding. If incompatible models or shortcut links are found, it will
|
||||
| return #[code 1].
|
||||
|
||||
+code(false, "bash", "$").
|
||||
python -m spacy validate
|
||||
|
@ -335,8 +341,12 @@ p
|
|||
| for your custom #[code train] command while still being able to easily
|
||||
| tweak the hyperparameters. For example:
|
||||
|
||||
+code(false, "bash").
|
||||
parser_hidden_depth=2 parser_maxout_pieces=1 train-parser
|
||||
+code(false, "bash", "$").
|
||||
parser_hidden_depth=2 parser_maxout_pieces=1 spacy train [...]
|
||||
|
||||
+code("Usage with alias", "bash", "$").
|
||||
alias train-parser="spacy train en /output /data /train /dev -n 1000"
|
||||
parser_maxout_pieces=1 train-parser
|
||||
|
||||
+table(["Name", "Description", "Default"])
|
||||
+row
|
||||
|
|
|
@ -28,7 +28,7 @@ p Create the rule-based #[code PhraseMatcher].
|
|||
+row
|
||||
+cell #[code max_length]
|
||||
+cell int
|
||||
+cell Mamimum length of a phrase pattern to add.
|
||||
+cell Maximum length of a phrase pattern to add.
|
||||
|
||||
+row("foot")
|
||||
+cell returns
|
||||
|
|
|
@ -394,7 +394,7 @@ p
|
|||
num, denom = text.split('/')
|
||||
if num.isdigit() and denom.isdigit():
|
||||
return True
|
||||
if text in _num_words:
|
||||
if text.lower() in _num_words:
|
||||
return True
|
||||
return False
|
||||
|
||||
|
|
|
@ -148,7 +148,7 @@ p
|
|||
+cell Negate the pattern, by requiring it to match exactly 0 times.
|
||||
|
||||
+row
|
||||
+cell #[code *]
|
||||
+cell #[code ?]
|
||||
+cell Make the pattern optional, by allowing it to match 0 or 1 times.
|
||||
|
||||
+row
|
||||
|
@ -156,8 +156,8 @@ p
|
|||
+cell Require the pattern to match 1 or more times.
|
||||
|
||||
+row
|
||||
+cell #[code ?]
|
||||
+cell Allow the pattern to zero or more times.
|
||||
+cell #[code *]
|
||||
+cell Allow the pattern to match zero or more times.
|
||||
|
||||
p
|
||||
| The #[code +] and #[code *] operators are usually interpretted
|
||||
|
@ -305,6 +305,54 @@ p
|
|||
| A list of #[code (match_id, start, end)] tuples, describing the
|
||||
| matches. A match tuple describes a span #[code doc[start:end]].
|
||||
|
||||
+h(3, "regex") Using regular expressions
|
||||
|
||||
p
|
||||
| In some cases, only matching tokens and token attributes isn't enough –
|
||||
| for example, you might want to match different spellings of a word,
|
||||
| without having to add a new pattern for each spelling. A simple solution
|
||||
| is to match a regular expression on the #[code Doc]'s #[code text] and
|
||||
| use the #[+api("doc#char_span") #[code Doc.char_span]] method to
|
||||
| create a #[code Span] from the character indices of the match:
|
||||
|
||||
+code.
|
||||
import spacy
|
||||
import re
|
||||
|
||||
nlp = spacy.load('en')
|
||||
doc = nlp(u'The spelling is "definitely", not "definately" or "deffinitely".')
|
||||
|
||||
DEFINITELY_PATTERN = re.compile(r'deff?in[ia]tely')
|
||||
|
||||
for match in re.finditer(DEFINITELY_PATTERN, doc.text):
|
||||
start, end = match.span() # get matched indices
|
||||
span = doc.char_span(start, end) # create Span from indices
|
||||
|
||||
p
|
||||
| You can also use the regular expression with spaCy's #[code Matcher] by
|
||||
| converting it to a token flag. To ensure efficiency, the
|
||||
| #[code Matcher] can only access the C-level data. This means that it can
|
||||
| either use built-in token attributes or #[strong binary flags].
|
||||
| #[+api("vocab#add_flag") #[code Vocab.add_flag]] returns a flag ID which
|
||||
| you can use as a key of a token match pattern. Tokens that match the
|
||||
| regular expression will return #[code True] for the #[code IS_DEFINITELY]
|
||||
| flag.
|
||||
|
||||
+code.
|
||||
IS_DEFINITELY = nlp.vocab.add_flag(re.compile(r'deff?in[ia]tely').match)
|
||||
|
||||
matcher = Matcher(nlp.vocab)
|
||||
matcher.add('DEFINITELY', None, [{IS_DEFINITELY: True}])
|
||||
|
||||
p
|
||||
| Providing the regular expressions as binary flags also lets you use them
|
||||
| in combination with other token patterns – for example, to match the
|
||||
| word "definitely" in various spellings, followed by a case-insensitive
|
||||
| "not" and and adjective:
|
||||
|
||||
+code.
|
||||
[{IS_DEFINITELY: True}, {'LOWER': 'not'}, {'POS': 'ADJ'}]
|
||||
|
||||
+h(3, "example1") Example: Using linguistic annotations
|
||||
|
||||
p
|
||||
|
|
|
@ -48,9 +48,9 @@ p
|
|||
| those IDs back to strings.
|
||||
|
||||
+code.
|
||||
moby_dick = open('moby_dick.txt', 'r') # open a large document
|
||||
doc = nlp(moby_dick) # process it
|
||||
doc.to_disk('/moby_dick.bin') # save the processed Doc
|
||||
text = open('customer_feedback_627.txt', 'r').read() # open a document
|
||||
doc = nlp(text) # process it
|
||||
doc.to_disk('/customer_feedback_627.bin') # save the processed Doc
|
||||
|
||||
p
|
||||
| If you need it again later, you can load it back into an empty #[code Doc]
|
||||
|
@ -61,4 +61,4 @@ p
|
|||
from spacy.tokens import Doc # to create empty Doc
|
||||
from spacy.vocab import Vocab # to create empty Vocab
|
||||
|
||||
doc = Doc(Vocab()).from_disk('/moby_dick.bin') # load processed Doc
|
||||
doc = Doc(Vocab()).from_disk('/customer_feedback_627.bin') # load processed Doc
|
||||
|
|
|
@ -8,7 +8,7 @@ p
|
|||
| Collecting training data may sound incredibly painful – and it can be,
|
||||
| if you're planning a large-scale annotation project. However, if your main
|
||||
| goal is to update an existing model's predictions – for example, spaCy's
|
||||
| named entity recognition – the hard is part usually not creating the
|
||||
| named entity recognition – the hard part is usually not creating the
|
||||
| actual annotations. It's finding representative examples and
|
||||
| #[strong extracting potential candidates]. The good news is, if you've
|
||||
| been noticing bad performance on your data, you likely
|
||||
|
|
|
@ -106,6 +106,10 @@ p
|
|||
| #[+api("tagger#from_disk") #[code Tagger.from_disk]]
|
||||
| #[+api("tagger#from_bytes") #[code Tagger.from_bytes]]
|
||||
|
||||
+row
|
||||
+cell #[code Tagger.tag_names]
|
||||
+cell #[code Tagger.labels]
|
||||
|
||||
+row
|
||||
+cell #[code DependencyParser.load]
|
||||
+cell
|
||||
|
|
|
@ -37,6 +37,9 @@ include ../_includes/_mixins
|
|||
+card("spacy-api-docker", "https://github.com/jgontrum/spacy-api-docker", "Johannes Gontrum", "github")
|
||||
| spaCy accessed by a REST API, wrapped in a Docker container.
|
||||
|
||||
+card("languagecrunch", "https://github.com/artpar/languagecrunch", "Parth Mudgal", "github")
|
||||
| NLP server for spaCy, WordNet and NeuralCoref as a Docker image.
|
||||
|
||||
+card("spacy-nlp-zeromq", "https://github.com/pasupulaphani/spacy-nlp-docker", "Phaninder Pasupula", "github")
|
||||
| Docker image exposing spaCy with ZeroMQ bindings.
|
||||
|
||||
|
@ -69,6 +72,10 @@ include ../_includes/_mixins
|
|||
| Add language detection to your spaCy pipeline using Compact
|
||||
| Language Detector 2 via PYCLD2.
|
||||
|
||||
+card("spacy-lookup", "https://github.com/mpuig/spacy-lookup", "Marc Puig", "github")
|
||||
| A powerful entity matcher for very large dictionaries, using the
|
||||
| FlashText module.
|
||||
|
||||
.u-text-right
|
||||
+button("https://github.com/topics/spacy-extension?o=desc&s=stars", false, "primary", "small") See more extensions on GitHub
|
||||
|
||||
|
|
Loading…
Reference in New Issue
Block a user