mirror of
https://github.com/explosion/spaCy.git
synced 2024-11-11 04:08:09 +03:00
Merge branch 'master' of https://github.com/explosion/spaCy into feature/better-faster-matcher
This commit is contained in:
commit
f7dc64d2a3
106
.github/contributors/emulbreh.md
vendored
Normal file
106
.github/contributors/emulbreh.md
vendored
Normal file
|
@ -0,0 +1,106 @@
|
|||
# spaCy contributor agreement
|
||||
|
||||
This spaCy Contributor Agreement (**"SCA"**) is based on the
|
||||
[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
|
||||
The SCA applies to any contribution that you make to any product or project
|
||||
managed by us (the **"project"**), and sets out the intellectual property rights
|
||||
you grant to us in the contributed materials. The term **"us"** shall mean
|
||||
[ExplosionAI UG (haftungsbeschränkt)](https://explosion.ai/legal). The term
|
||||
**"you"** shall mean the person or entity identified below.
|
||||
|
||||
If you agree to be bound by these terms, fill in the information requested
|
||||
below and include the filled-in version with your first pull request, under the
|
||||
folder [`.github/contributors/`](/.github/contributors/). The name of the file
|
||||
should be your GitHub username, with the extension `.md`. For example, the user
|
||||
example_user would create the file `.github/contributors/example_user.md`.
|
||||
|
||||
Read this agreement carefully before signing. These terms and conditions
|
||||
constitute a binding legal agreement.
|
||||
|
||||
## Contributor Agreement
|
||||
|
||||
1. The term "contribution" or "contributed materials" means any source code,
|
||||
object code, patch, tool, sample, graphic, specification, manual,
|
||||
documentation, or any other material posted or submitted by you to the project.
|
||||
|
||||
2. With respect to any worldwide copyrights, or copyright applications and
|
||||
registrations, in your contribution:
|
||||
|
||||
* you hereby assign to us joint ownership, and to the extent that such
|
||||
assignment is or becomes invalid, ineffective or unenforceable, you hereby
|
||||
grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
|
||||
royalty-free, unrestricted license to exercise all rights under those
|
||||
copyrights. This includes, at our option, the right to sublicense these same
|
||||
rights to third parties through multiple levels of sublicensees or other
|
||||
licensing arrangements;
|
||||
|
||||
* you agree that each of us can do all things in relation to your
|
||||
contribution as if each of us were the sole owners, and if one of us makes
|
||||
a derivative work of your contribution, the one who makes the derivative
|
||||
work (or has it made will be the sole owner of that derivative work;
|
||||
|
||||
* you agree that you will not assert any moral rights in your contribution
|
||||
against us, our licensees or transferees;
|
||||
|
||||
* you agree that we may register a copyright in your contribution and
|
||||
exercise all ownership rights associated with it; and
|
||||
|
||||
* you agree that neither of us has any duty to consult with, obtain the
|
||||
consent of, pay or render an accounting to the other for any use or
|
||||
distribution of your contribution.
|
||||
|
||||
3. With respect to any patents you own, or that you can license without payment
|
||||
to any third party, you hereby grant to us a perpetual, irrevocable,
|
||||
non-exclusive, worldwide, no-charge, royalty-free license to:
|
||||
|
||||
* make, have made, use, sell, offer to sell, import, and otherwise transfer
|
||||
your contribution in whole or in part, alone or in combination with or
|
||||
included in any product, work or materials arising out of the project to
|
||||
which your contribution was submitted, and
|
||||
|
||||
* at our option, to sublicense these same rights to third parties through
|
||||
multiple levels of sublicensees or other licensing arrangements.
|
||||
|
||||
4. Except as set out above, you keep all right, title, and interest in your
|
||||
contribution. The rights that you grant to us under these terms are effective
|
||||
on the date you first submitted a contribution to us, even if your submission
|
||||
took place before the date you sign these terms.
|
||||
|
||||
5. You covenant, represent, warrant and agree that:
|
||||
|
||||
* Each contribution that you submit is and shall be an original work of
|
||||
authorship and you can legally grant the rights set out in this SCA;
|
||||
|
||||
* to the best of your knowledge, each contribution will not violate any
|
||||
third party's copyrights, trademarks, patents, or other intellectual
|
||||
property rights; and
|
||||
|
||||
* each contribution shall be in compliance with U.S. export control laws and
|
||||
other applicable export and import laws. You agree to notify us if you
|
||||
become aware of any circumstance which would make any of the foregoing
|
||||
representations inaccurate in any respect. We may publicly disclose your
|
||||
participation in the project, including the fact that you have signed the SCA.
|
||||
|
||||
6. This SCA is governed by the laws of the State of California and applicable
|
||||
U.S. Federal law. Any choice of law rules will not apply.
|
||||
|
||||
7. Please place an “x” on one of the applicable statement below. Please do NOT
|
||||
mark both statements:
|
||||
|
||||
* [x] I am signing on behalf of myself as an individual and no other person
|
||||
or entity, including my employer, has or will have rights with respect to my
|
||||
contributions.
|
||||
|
||||
* [ ] I am signing on behalf of my employer or a legal entity and I have the
|
||||
actual authority to contractually bind that entity.
|
||||
|
||||
## Contributor Details
|
||||
|
||||
| Field | Entry |
|
||||
|------------------------------- | -------------------- |
|
||||
| Name | Johannes Dollinger |
|
||||
| Company name (if applicable) | |
|
||||
| Title or role (if applicable) | |
|
||||
| Date | 2018-02-13 |
|
||||
| GitHub username | emulbreh |
|
||||
| Website (optional) | |
|
106
.github/contributors/enerrio.md
vendored
Normal file
106
.github/contributors/enerrio.md
vendored
Normal file
|
@ -0,0 +1,106 @@
|
|||
# spaCy contributor agreement
|
||||
|
||||
This spaCy Contributor Agreement (**"SCA"**) is based on the
|
||||
[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
|
||||
The SCA applies to any contribution that you make to any product or project
|
||||
managed by us (the **"project"**), and sets out the intellectual property rights
|
||||
you grant to us in the contributed materials. The term **"us"** shall mean
|
||||
[ExplosionAI UG (haftungsbeschränkt)](https://explosion.ai/legal). The term
|
||||
**"you"** shall mean the person or entity identified below.
|
||||
|
||||
If you agree to be bound by these terms, fill in the information requested
|
||||
below and include the filled-in version with your first pull request, under the
|
||||
folder [`.github/contributors/`](/.github/contributors/). The name of the file
|
||||
should be your GitHub username, with the extension `.md`. For example, the user
|
||||
example_user would create the file `.github/contributors/example_user.md`.
|
||||
|
||||
Read this agreement carefully before signing. These terms and conditions
|
||||
constitute a binding legal agreement.
|
||||
|
||||
## Contributor Agreement
|
||||
|
||||
1. The term "contribution" or "contributed materials" means any source code,
|
||||
object code, patch, tool, sample, graphic, specification, manual,
|
||||
documentation, or any other material posted or submitted by you to the project.
|
||||
|
||||
2. With respect to any worldwide copyrights, or copyright applications and
|
||||
registrations, in your contribution:
|
||||
|
||||
* you hereby assign to us joint ownership, and to the extent that such
|
||||
assignment is or becomes invalid, ineffective or unenforceable, you hereby
|
||||
grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
|
||||
royalty-free, unrestricted license to exercise all rights under those
|
||||
copyrights. This includes, at our option, the right to sublicense these same
|
||||
rights to third parties through multiple levels of sublicensees or other
|
||||
licensing arrangements;
|
||||
|
||||
* you agree that each of us can do all things in relation to your
|
||||
contribution as if each of us were the sole owners, and if one of us makes
|
||||
a derivative work of your contribution, the one who makes the derivative
|
||||
work (or has it made will be the sole owner of that derivative work;
|
||||
|
||||
* you agree that you will not assert any moral rights in your contribution
|
||||
against us, our licensees or transferees;
|
||||
|
||||
* you agree that we may register a copyright in your contribution and
|
||||
exercise all ownership rights associated with it; and
|
||||
|
||||
* you agree that neither of us has any duty to consult with, obtain the
|
||||
consent of, pay or render an accounting to the other for any use or
|
||||
distribution of your contribution.
|
||||
|
||||
3. With respect to any patents you own, or that you can license without payment
|
||||
to any third party, you hereby grant to us a perpetual, irrevocable,
|
||||
non-exclusive, worldwide, no-charge, royalty-free license to:
|
||||
|
||||
* make, have made, use, sell, offer to sell, import, and otherwise transfer
|
||||
your contribution in whole or in part, alone or in combination with or
|
||||
included in any product, work or materials arising out of the project to
|
||||
which your contribution was submitted, and
|
||||
|
||||
* at our option, to sublicense these same rights to third parties through
|
||||
multiple levels of sublicensees or other licensing arrangements.
|
||||
|
||||
4. Except as set out above, you keep all right, title, and interest in your
|
||||
contribution. The rights that you grant to us under these terms are effective
|
||||
on the date you first submitted a contribution to us, even if your submission
|
||||
took place before the date you sign these terms.
|
||||
|
||||
5. You covenant, represent, warrant and agree that:
|
||||
|
||||
* Each contribution that you submit is and shall be an original work of
|
||||
authorship and you can legally grant the rights set out in this SCA;
|
||||
|
||||
* to the best of your knowledge, each contribution will not violate any
|
||||
third party's copyrights, trademarks, patents, or other intellectual
|
||||
property rights; and
|
||||
|
||||
* each contribution shall be in compliance with U.S. export control laws and
|
||||
other applicable export and import laws. You agree to notify us if you
|
||||
become aware of any circumstance which would make any of the foregoing
|
||||
representations inaccurate in any respect. We may publicly disclose your
|
||||
participation in the project, including the fact that you have signed the SCA.
|
||||
|
||||
6. This SCA is governed by the laws of the State of California and applicable
|
||||
U.S. Federal law. Any choice of law rules will not apply.
|
||||
|
||||
7. Please place an “x” on one of the applicable statement below. Please do NOT
|
||||
mark both statements:
|
||||
|
||||
* [x] I am signing on behalf of myself as an individual and no other person
|
||||
or entity, including my employer, has or will have rights with respect to my
|
||||
contributions.
|
||||
|
||||
* [ ] I am signing on behalf of my employer or a legal entity and I have the
|
||||
actual authority to contractually bind that entity.
|
||||
|
||||
## Contributor Details
|
||||
|
||||
| Field | Entry |
|
||||
|------------------------------- | -------------------- |
|
||||
| Name | Aaron Marquez |
|
||||
| Company name (if applicable) | |
|
||||
| Title or role (if applicable) | |
|
||||
| Date | 2/15/2018 |
|
||||
| GitHub username | enerrio |
|
||||
| Website (optional) | |
|
106
.github/contributors/oxinabox.md
vendored
Normal file
106
.github/contributors/oxinabox.md
vendored
Normal file
|
@ -0,0 +1,106 @@
|
|||
# spaCy contributor agreement
|
||||
|
||||
This spaCy Contributor Agreement (**"SCA"**) is based on the
|
||||
[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
|
||||
The SCA applies to any contribution that you make to any product or project
|
||||
managed by us (the **"project"**), and sets out the intellectual property rights
|
||||
you grant to us in the contributed materials. The term **"us"** shall mean
|
||||
[ExplosionAI UG (haftungsbeschränkt)](https://explosion.ai/legal). The term
|
||||
**"you"** shall mean the person or entity identified below.
|
||||
|
||||
If you agree to be bound by these terms, fill in the information requested
|
||||
below and include the filled-in version with your first pull request, under the
|
||||
folder [`.github/contributors/`](/.github/contributors/). The name of the file
|
||||
should be your GitHub username, with the extension `.md`. For example, the user
|
||||
example_user would create the file `.github/contributors/example_user.md`.
|
||||
|
||||
Read this agreement carefully before signing. These terms and conditions
|
||||
constitute a binding legal agreement.
|
||||
|
||||
## Contributor Agreement
|
||||
|
||||
1. The term "contribution" or "contributed materials" means any source code,
|
||||
object code, patch, tool, sample, graphic, specification, manual,
|
||||
documentation, or any other material posted or submitted by you to the project.
|
||||
|
||||
2. With respect to any worldwide copyrights, or copyright applications and
|
||||
registrations, in your contribution:
|
||||
|
||||
* you hereby assign to us joint ownership, and to the extent that such
|
||||
assignment is or becomes invalid, ineffective or unenforceable, you hereby
|
||||
grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
|
||||
royalty-free, unrestricted license to exercise all rights under those
|
||||
copyrights. This includes, at our option, the right to sublicense these same
|
||||
rights to third parties through multiple levels of sublicensees or other
|
||||
licensing arrangements;
|
||||
|
||||
* you agree that each of us can do all things in relation to your
|
||||
contribution as if each of us were the sole owners, and if one of us makes
|
||||
a derivative work of your contribution, the one who makes the derivative
|
||||
work (or has it made will be the sole owner of that derivative work;
|
||||
|
||||
* you agree that you will not assert any moral rights in your contribution
|
||||
against us, our licensees or transferees;
|
||||
|
||||
* you agree that we may register a copyright in your contribution and
|
||||
exercise all ownership rights associated with it; and
|
||||
|
||||
* you agree that neither of us has any duty to consult with, obtain the
|
||||
consent of, pay or render an accounting to the other for any use or
|
||||
distribution of your contribution.
|
||||
|
||||
3. With respect to any patents you own, or that you can license without payment
|
||||
to any third party, you hereby grant to us a perpetual, irrevocable,
|
||||
non-exclusive, worldwide, no-charge, royalty-free license to:
|
||||
|
||||
* make, have made, use, sell, offer to sell, import, and otherwise transfer
|
||||
your contribution in whole or in part, alone or in combination with or
|
||||
included in any product, work or materials arising out of the project to
|
||||
which your contribution was submitted, and
|
||||
|
||||
* at our option, to sublicense these same rights to third parties through
|
||||
multiple levels of sublicensees or other licensing arrangements.
|
||||
|
||||
4. Except as set out above, you keep all right, title, and interest in your
|
||||
contribution. The rights that you grant to us under these terms are effective
|
||||
on the date you first submitted a contribution to us, even if your submission
|
||||
took place before the date you sign these terms.
|
||||
|
||||
5. You covenant, represent, warrant and agree that:
|
||||
|
||||
* Each contribution that you submit is and shall be an original work of
|
||||
authorship and you can legally grant the rights set out in this SCA;
|
||||
|
||||
* to the best of your knowledge, each contribution will not violate any
|
||||
third party's copyrights, trademarks, patents, or other intellectual
|
||||
property rights; and
|
||||
|
||||
* each contribution shall be in compliance with U.S. export control laws and
|
||||
other applicable export and import laws. You agree to notify us if you
|
||||
become aware of any circumstance which would make any of the foregoing
|
||||
representations inaccurate in any respect. We may publicly disclose your
|
||||
participation in the project, including the fact that you have signed the SCA.
|
||||
|
||||
6. This SCA is governed by the laws of the State of California and applicable
|
||||
U.S. Federal law. Any choice of law rules will not apply.
|
||||
|
||||
7. Please place an “x” on one of the applicable statement below. Please do NOT
|
||||
mark both statements:
|
||||
|
||||
* [x] I am signing on behalf of myself as an individual and no other person
|
||||
or entity, including my employer, has or will have rights with respect to my
|
||||
contributions.
|
||||
|
||||
* [ ] I am signing on behalf of my employer or a legal entity and I have the
|
||||
actual authority to contractually bind that entity.
|
||||
|
||||
## Contributor Details
|
||||
|
||||
| Field | Entry |
|
||||
|------------------------------- | -------------------- |
|
||||
| Name | Lyndon White |
|
||||
| Company name (if applicable) | |
|
||||
| Title or role (if applicable) | |
|
||||
| Date | 9/2/2018 |
|
||||
| GitHub username | oxinabox |
|
||||
| Website (optional) | white.ucc.asn.au |
|
106
.github/contributors/ursachec.md
vendored
Normal file
106
.github/contributors/ursachec.md
vendored
Normal file
|
@ -0,0 +1,106 @@
|
|||
# spaCy contributor agreement
|
||||
|
||||
This spaCy Contributor Agreement (**"SCA"**) is based on the
|
||||
[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
|
||||
The SCA applies to any contribution that you make to any product or project
|
||||
managed by us (the **"project"**), and sets out the intellectual property rights
|
||||
you grant to us in the contributed materials. The term **"us"** shall mean
|
||||
[ExplosionAI UG (haftungsbeschränkt)](https://explosion.ai/legal). The term
|
||||
**"you"** shall mean the person or entity identified below.
|
||||
|
||||
If you agree to be bound by these terms, fill in the information requested
|
||||
below and include the filled-in version with your first pull request, under the
|
||||
folder [`.github/contributors/`](/.github/contributors/). The name of the file
|
||||
should be your GitHub username, with the extension `.md`. For example, the user
|
||||
example_user would create the file `.github/contributors/example_user.md`.
|
||||
|
||||
Read this agreement carefully before signing. These terms and conditions
|
||||
constitute a binding legal agreement.
|
||||
|
||||
## Contributor Agreement
|
||||
|
||||
1. The term "contribution" or "contributed materials" means any source code,
|
||||
object code, patch, tool, sample, graphic, specification, manual,
|
||||
documentation, or any other material posted or submitted by you to the project.
|
||||
|
||||
2. With respect to any worldwide copyrights, or copyright applications and
|
||||
registrations, in your contribution:
|
||||
|
||||
* you hereby assign to us joint ownership, and to the extent that such
|
||||
assignment is or becomes invalid, ineffective or unenforceable, you hereby
|
||||
grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
|
||||
royalty-free, unrestricted license to exercise all rights under those
|
||||
copyrights. This includes, at our option, the right to sublicense these same
|
||||
rights to third parties through multiple levels of sublicensees or other
|
||||
licensing arrangements;
|
||||
|
||||
* you agree that each of us can do all things in relation to your
|
||||
contribution as if each of us were the sole owners, and if one of us makes
|
||||
a derivative work of your contribution, the one who makes the derivative
|
||||
work (or has it made will be the sole owner of that derivative work;
|
||||
|
||||
* you agree that you will not assert any moral rights in your contribution
|
||||
against us, our licensees or transferees;
|
||||
|
||||
* you agree that we may register a copyright in your contribution and
|
||||
exercise all ownership rights associated with it; and
|
||||
|
||||
* you agree that neither of us has any duty to consult with, obtain the
|
||||
consent of, pay or render an accounting to the other for any use or
|
||||
distribution of your contribution.
|
||||
|
||||
3. With respect to any patents you own, or that you can license without payment
|
||||
to any third party, you hereby grant to us a perpetual, irrevocable,
|
||||
non-exclusive, worldwide, no-charge, royalty-free license to:
|
||||
|
||||
* make, have made, use, sell, offer to sell, import, and otherwise transfer
|
||||
your contribution in whole or in part, alone or in combination with or
|
||||
included in any product, work or materials arising out of the project to
|
||||
which your contribution was submitted, and
|
||||
|
||||
* at our option, to sublicense these same rights to third parties through
|
||||
multiple levels of sublicensees or other licensing arrangements.
|
||||
|
||||
4. Except as set out above, you keep all right, title, and interest in your
|
||||
contribution. The rights that you grant to us under these terms are effective
|
||||
on the date you first submitted a contribution to us, even if your submission
|
||||
took place before the date you sign these terms.
|
||||
|
||||
5. You covenant, represent, warrant and agree that:
|
||||
|
||||
* Each contribution that you submit is and shall be an original work of
|
||||
authorship and you can legally grant the rights set out in this SCA;
|
||||
|
||||
* to the best of your knowledge, each contribution will not violate any
|
||||
third party's copyrights, trademarks, patents, or other intellectual
|
||||
property rights; and
|
||||
|
||||
* each contribution shall be in compliance with U.S. export control laws and
|
||||
other applicable export and import laws. You agree to notify us if you
|
||||
become aware of any circumstance which would make any of the foregoing
|
||||
representations inaccurate in any respect. We may publicly disclose your
|
||||
participation in the project, including the fact that you have signed the SCA.
|
||||
|
||||
6. This SCA is governed by the laws of the State of California and applicable
|
||||
U.S. Federal law. Any choice of law rules will not apply.
|
||||
|
||||
7. Please place an “x” on one of the applicable statement below. Please do NOT
|
||||
mark both statements:
|
||||
|
||||
* [x] I am signing on behalf of myself as an individual and no other person
|
||||
or entity, including my employer, has or will have rights with respect to my
|
||||
contributions.
|
||||
|
||||
* [ ] I am signing on behalf of my employer or a legal entity and I have the
|
||||
actual authority to contractually bind that entity.
|
||||
|
||||
## Contributor Details
|
||||
|
||||
| Field | Entry |
|
||||
|------------------------------- | ------------------------- |
|
||||
| Name | Claudiu-Vlad Ursache |
|
||||
| Company name (if applicable) | |
|
||||
| Title or role (if applicable) | |
|
||||
| Date | 2018-02-04 |
|
||||
| GitHub username | ursachec |
|
||||
| Website (optional) | https://www.cvursache.com |
|
|
@ -18,9 +18,9 @@ cdef enum attr_id_t:
|
|||
IS_QUOTE
|
||||
IS_LEFT_PUNCT
|
||||
IS_RIGHT_PUNCT
|
||||
IS_CURRENCY
|
||||
|
||||
FLAG18 = 18
|
||||
FLAG19
|
||||
FLAG19 = 19
|
||||
FLAG20
|
||||
FLAG21
|
||||
FLAG22
|
||||
|
|
|
@ -21,7 +21,7 @@ IDS = {
|
|||
"IS_QUOTE": IS_QUOTE,
|
||||
"IS_LEFT_PUNCT": IS_LEFT_PUNCT,
|
||||
"IS_RIGHT_PUNCT": IS_RIGHT_PUNCT,
|
||||
"FLAG18": FLAG18,
|
||||
"IS_CURRENCY": IS_CURRENCY,
|
||||
"FLAG19": FLAG19,
|
||||
"FLAG20": FLAG20,
|
||||
"FLAG21": FLAG21,
|
||||
|
|
|
@ -3,8 +3,6 @@ from __future__ import unicode_literals, division, print_function
|
|||
|
||||
import plac
|
||||
from timeit import default_timer as timer
|
||||
import random
|
||||
import numpy.random
|
||||
|
||||
from ..gold import GoldCorpus
|
||||
from ..util import prints
|
||||
|
@ -12,10 +10,6 @@ from .. import util
|
|||
from .. import displacy
|
||||
|
||||
|
||||
random.seed(0)
|
||||
numpy.random.seed(0)
|
||||
|
||||
|
||||
@plac.annotations(
|
||||
model=("model name or path", "positional", None, str),
|
||||
data_path=("location of JSON-formatted evaluation data", "positional",
|
||||
|
@ -31,6 +25,8 @@ def evaluate(model, data_path, gpu_id=-1, gold_preproc=False, displacy_path=None
|
|||
Evaluate a model. To render a sample of parses in a HTML file, set an
|
||||
output directory as the displacy_path argument.
|
||||
"""
|
||||
|
||||
util.fix_random_seed()
|
||||
if gpu_id >= 0:
|
||||
util.use_gpu(gpu_id)
|
||||
util.set_env_log(False)
|
||||
|
|
|
@ -6,8 +6,6 @@ from pathlib import Path
|
|||
import tqdm
|
||||
from thinc.neural._classes.model import Model
|
||||
from timeit import default_timer as timer
|
||||
import random
|
||||
import numpy.random
|
||||
|
||||
from ..gold import GoldCorpus, minibatch
|
||||
from ..util import prints
|
||||
|
@ -16,9 +14,6 @@ from .. import about
|
|||
from .. import displacy
|
||||
from ..compat import json_dumps
|
||||
|
||||
random.seed(0)
|
||||
numpy.random.seed(0)
|
||||
|
||||
|
||||
@plac.annotations(
|
||||
lang=("model language", "positional", None, str),
|
||||
|
@ -45,6 +40,7 @@ def train(lang, output_dir, train_data, dev_data, n_iter=30, n_sents=0,
|
|||
"""
|
||||
Train a model. Expects data in spaCy's JSON format.
|
||||
"""
|
||||
util.fix_random_seed()
|
||||
util.set_env_log(True)
|
||||
n_sents = n_sents or None
|
||||
output_path = util.ensure_path(output_dir)
|
||||
|
|
|
@ -43,15 +43,15 @@ fix_text = ftfy.fix_text
|
|||
copy_array = copy_array
|
||||
izip = getattr(itertools, 'izip', zip)
|
||||
|
||||
is_python2 = six.PY2
|
||||
is_python3 = six.PY3
|
||||
is_windows = sys.platform.startswith('win')
|
||||
is_linux = sys.platform.startswith('linux')
|
||||
is_osx = sys.platform == 'darwin'
|
||||
|
||||
is_python2 = six.PY2
|
||||
is_python3 = six.PY3
|
||||
is_python_pre_3_5 = is_python2 or (is_python3 and sys.version_info[1]<5)
|
||||
|
||||
if is_python2:
|
||||
import imp
|
||||
bytes_ = str
|
||||
unicode_ = unicode # noqa: F821
|
||||
basestring_ = basestring # noqa: F821
|
||||
|
@ -60,7 +60,6 @@ if is_python2:
|
|||
path2str = lambda path: str(path).decode('utf8')
|
||||
|
||||
elif is_python3:
|
||||
import importlib.util
|
||||
bytes_ = bytes
|
||||
unicode_ = str
|
||||
basestring_ = str
|
||||
|
@ -111,9 +110,11 @@ def normalize_string_keys(old):
|
|||
|
||||
def import_file(name, loc):
|
||||
loc = str(loc)
|
||||
if is_python2:
|
||||
if is_python_pre_3_5:
|
||||
import imp
|
||||
return imp.load_source(name, loc)
|
||||
else:
|
||||
import importlib.util
|
||||
spec = importlib.util.spec_from_file_location(name, str(loc))
|
||||
module = importlib.util.module_from_spec(spec)
|
||||
spec.loader.exec_module(module)
|
||||
|
|
|
@ -115,7 +115,7 @@ GLOSSARY = {
|
|||
'ADJA': 'adjective, attributive',
|
||||
'ADJD': 'adjective, adverbial or predicative',
|
||||
'APPO': 'postposition',
|
||||
'APRP': 'preposition; circumposition left',
|
||||
'APPR': 'preposition; circumposition left',
|
||||
'APPRART': 'preposition with article',
|
||||
'APZR': 'circumposition right',
|
||||
'ART': 'definite or indefinite article',
|
||||
|
|
|
@ -69,6 +69,14 @@ def is_right_punct(text):
|
|||
return text in right_punct
|
||||
|
||||
|
||||
def is_currency(text):
|
||||
# can be overwritten by lang with list of currency words, e.g. dollar, euro
|
||||
for char in text:
|
||||
if unicodedata.category(char) != 'Sc':
|
||||
return False
|
||||
return True
|
||||
|
||||
|
||||
def like_email(text):
|
||||
return bool(_like_email(text))
|
||||
|
||||
|
@ -164,5 +172,6 @@ LEX_ATTRS = {
|
|||
attrs.IS_QUOTE: is_quote,
|
||||
attrs.IS_LEFT_PUNCT: is_left_punct,
|
||||
attrs.IS_RIGHT_PUNCT: is_right_punct,
|
||||
attrs.IS_CURRENCY: is_currency,
|
||||
attrs.LIKE_URL: like_url
|
||||
}
|
||||
|
|
|
@ -624,7 +624,7 @@ class Language(object):
|
|||
deserializers = OrderedDict((
|
||||
('vocab', lambda p: self.vocab.from_disk(p)),
|
||||
('tokenizer', lambda p: self.tokenizer.from_disk(p, vocab=False)),
|
||||
('meta.json', lambda p: self.meta.update(ujson.load(p.open('r'))))
|
||||
('meta.json', lambda p: self.meta.update(util.read_json(p)))
|
||||
))
|
||||
for name, proc in self.pipeline:
|
||||
if name in disable:
|
||||
|
@ -720,5 +720,5 @@ class DisabledPipes(list):
|
|||
|
||||
def _pipe(func, docs):
|
||||
for doc in docs:
|
||||
func(doc)
|
||||
doc = func(doc)
|
||||
yield doc
|
||||
|
|
|
@ -12,7 +12,7 @@ import numpy
|
|||
from .typedefs cimport attr_t, flags_t
|
||||
from .attrs cimport IS_ALPHA, IS_ASCII, IS_DIGIT, IS_LOWER, IS_PUNCT, IS_SPACE
|
||||
from .attrs cimport IS_TITLE, IS_UPPER, LIKE_URL, LIKE_NUM, LIKE_EMAIL, IS_STOP
|
||||
from .attrs cimport IS_BRACKET, IS_QUOTE, IS_LEFT_PUNCT, IS_RIGHT_PUNCT, IS_OOV
|
||||
from .attrs cimport IS_BRACKET, IS_QUOTE, IS_LEFT_PUNCT, IS_RIGHT_PUNCT, IS_CURRENCY, IS_OOV
|
||||
from .attrs cimport PROB
|
||||
from .attrs import intify_attrs
|
||||
from . import about
|
||||
|
@ -474,6 +474,14 @@ cdef class Lexeme:
|
|||
def __set__(self, bint x):
|
||||
Lexeme.c_set_flag(self.c, IS_RIGHT_PUNCT, x)
|
||||
|
||||
property is_currency:
|
||||
"""RETURNS (bool): Whether the lexeme is a currency symbol, e.g. $, €."""
|
||||
def __get__(self):
|
||||
return Lexeme.c_check_flag(self.c, IS_CURRENCY)
|
||||
|
||||
def __set__(self, bint x):
|
||||
Lexeme.c_set_flag(self.c, IS_CURRENCY, x)
|
||||
|
||||
property like_url:
|
||||
"""RETURNS (bool): Whether the lexeme resembles a URL."""
|
||||
def __get__(self):
|
||||
|
|
|
@ -144,7 +144,8 @@ class Pipe(object):
|
|||
return create_default_optimizer(self.model.ops,
|
||||
**self.cfg.get('optimizer', {}))
|
||||
|
||||
def begin_training(self, gold_tuples=tuple(), pipeline=None, sgd=None):
|
||||
def begin_training(self, gold_tuples=tuple(), pipeline=None, sgd=None,
|
||||
**kwargs):
|
||||
"""Initialize the pipe for training, using data exampes if available.
|
||||
If no model has been initialized yet, the model is added."""
|
||||
if self.model is True:
|
||||
|
@ -214,7 +215,8 @@ class Pipe(object):
|
|||
|
||||
def _load_cfg(path):
|
||||
if path.exists():
|
||||
return ujson.load(path.open())
|
||||
with path.open() as file_:
|
||||
return ujson.load(file_)
|
||||
else:
|
||||
return {}
|
||||
|
||||
|
@ -344,7 +346,8 @@ class Tensorizer(Pipe):
|
|||
loss = (d_scores**2).sum()
|
||||
return loss, d_scores
|
||||
|
||||
def begin_training(self, gold_tuples=tuple(), pipeline=None, sgd=None):
|
||||
def begin_training(self, gold_tuples=tuple(), pipeline=None, sgd=None,
|
||||
**kwargs):
|
||||
"""Allocate models, pre-process training data and acquire an
|
||||
optimizer.
|
||||
|
||||
|
@ -467,7 +470,8 @@ class Tagger(Pipe):
|
|||
d_scores = self.model.ops.unflatten(d_scores, [len(d) for d in docs])
|
||||
return float(loss), d_scores
|
||||
|
||||
def begin_training(self, gold_tuples=tuple(), pipeline=None, sgd=None):
|
||||
def begin_training(self, gold_tuples=tuple(), pipeline=None, sgd=None,
|
||||
**kwargs):
|
||||
orig_tag_map = dict(self.vocab.morphology.tag_map)
|
||||
new_tag_map = OrderedDict()
|
||||
for raw_text, annots_brackets in gold_tuples:
|
||||
|
@ -580,7 +584,8 @@ class Tagger(Pipe):
|
|||
def load_model(p):
|
||||
if self.model is True:
|
||||
self.model = self.Model(self.vocab.morphology.n_tags, **self.cfg)
|
||||
self.model.from_bytes(p.open('rb').read())
|
||||
with p.open('rb') as file_:
|
||||
self.model.from_bytes(file_.read())
|
||||
|
||||
def load_tag_map(p):
|
||||
with p.open('rb') as file_:
|
||||
|
@ -641,7 +646,7 @@ class MultitaskObjective(Tagger):
|
|||
pass
|
||||
|
||||
def begin_training(self, gold_tuples=tuple(), pipeline=None, tok2vec=None,
|
||||
sgd=None):
|
||||
sgd=None, **kwargs):
|
||||
gold_tuples = nonproj.preprocess_training_data(gold_tuples)
|
||||
for raw_text, annots_brackets in gold_tuples:
|
||||
for annots, brackets in annots_brackets:
|
||||
|
@ -766,7 +771,7 @@ class SimilarityHook(Pipe):
|
|||
def update(self, doc1_doc2, golds, sgd=None, drop=0.):
|
||||
sims, bp_sims = self.model.begin_update(doc1_doc2, drop=drop)
|
||||
|
||||
def begin_training(self, _=tuple(), pipeline=None, sgd=None):
|
||||
def begin_training(self, _=tuple(), pipeline=None, sgd=None, **kwargs):
|
||||
"""Allocate model, using width from tensorizer in pipeline.
|
||||
|
||||
gold_tuples (iterable): Gold-standard training data.
|
||||
|
@ -887,6 +892,7 @@ cdef class DependencyParser(Parser):
|
|||
self._multitasks.append(labeller)
|
||||
|
||||
def init_multitask_objectives(self, gold_tuples, pipeline, sgd=None, **cfg):
|
||||
self.add_multitask_objective('tag')
|
||||
for labeller in self._multitasks:
|
||||
tok2vec = self.model[0]
|
||||
labeller.begin_training(gold_tuples, pipeline=pipeline,
|
||||
|
|
|
@ -17,9 +17,9 @@ cdef enum symbol_t:
|
|||
IS_QUOTE
|
||||
IS_LEFT_PUNCT
|
||||
IS_RIGHT_PUNCT
|
||||
IS_CURRENCY
|
||||
|
||||
FLAG18 = 18
|
||||
FLAG19
|
||||
FLAG19 = 19
|
||||
FLAG20
|
||||
FLAG21
|
||||
FLAG22
|
||||
|
|
|
@ -22,8 +22,8 @@ IDS = {
|
|||
"IS_QUOTE": IS_QUOTE,
|
||||
"IS_LEFT_PUNCT": IS_LEFT_PUNCT,
|
||||
"IS_RIGHT_PUNCT": IS_RIGHT_PUNCT,
|
||||
"IS_CURRENCY": IS_CURRENCY,
|
||||
|
||||
"FLAG18": FLAG18,
|
||||
"FLAG19": FLAG19,
|
||||
"FLAG20": FLAG20,
|
||||
"FLAG21": FLAG21,
|
||||
|
|
|
@ -390,6 +390,22 @@ cdef class ArcEager(TransitionSystem):
|
|||
gold.c.labels[i] = self.strings.add(label)
|
||||
return gold
|
||||
|
||||
def get_beam_parses(self, Beam beam):
|
||||
parses = []
|
||||
probs = beam.probs
|
||||
for i in range(beam.size):
|
||||
state = <StateC*>beam.at(i)
|
||||
if state.is_final():
|
||||
self.finalize_state(state)
|
||||
prob = probs[i]
|
||||
parse = []
|
||||
for j in range(state.length):
|
||||
head = state.H(j)
|
||||
label = self.strings[state._sent[j].dep]
|
||||
parse.append((head, j, label))
|
||||
parses.append((prob, parse))
|
||||
return parses
|
||||
|
||||
cdef Transition lookup_transition(self, object name) except *:
|
||||
if '-' in name:
|
||||
move_str, label_str = name.split('-', 1)
|
||||
|
|
|
@ -835,6 +835,7 @@ cdef class Parser:
|
|||
sgd = self.create_optimizer()
|
||||
self.model[1].begin_training(
|
||||
self.model[1].ops.allocate((5, cfg['token_vector_width'])))
|
||||
if pipeline is not None:
|
||||
self.init_multitask_objectives(gold_tuples, pipeline, sgd=sgd, **cfg)
|
||||
link_vectors_to_models(self.vocab)
|
||||
else:
|
||||
|
@ -887,7 +888,7 @@ cdef class Parser:
|
|||
deserializers = {
|
||||
'vocab': lambda p: self.vocab.from_disk(p),
|
||||
'moves': lambda p: self.moves.from_disk(p, strings=False),
|
||||
'cfg': lambda p: self.cfg.update(ujson.load(p.open())),
|
||||
'cfg': lambda p: self.cfg.update(util.read_json(p)),
|
||||
'model': lambda p: None
|
||||
}
|
||||
util.from_disk(path, deserializers, exclude)
|
||||
|
|
|
@ -2,7 +2,7 @@
|
|||
from __future__ import unicode_literals
|
||||
|
||||
from ...attrs import intify_attrs, ORTH, NORM, LEMMA, IS_ALPHA
|
||||
from ...lang.lex_attrs import is_punct, is_ascii, like_url, word_shape
|
||||
from ...lang.lex_attrs import is_punct, is_ascii, is_currency, like_url, word_shape
|
||||
|
||||
import pytest
|
||||
|
||||
|
@ -37,6 +37,13 @@ def test_lex_attrs_is_ascii(text, match):
|
|||
assert is_ascii(text) == match
|
||||
|
||||
|
||||
@pytest.mark.parametrize('text,match', [('$', True), ('£', True), ('♥', False),
|
||||
('€', True), ('¥', True), ('¢', True),
|
||||
('a', False), ('www.google.com', False), ('dog', False)])
|
||||
def test_lex_attrs_is_currency(text, match):
|
||||
assert is_currency(text) == match
|
||||
|
||||
|
||||
@pytest.mark.parametrize('text,match', [
|
||||
('www.google.com', True), ('google.com', True), ('sydney.com', True),
|
||||
('2girls1cup.org', True), ('http://stupid', True), ('www.hi', True),
|
||||
|
|
23
spacy/tests/regression/test_issue1959.py
Normal file
23
spacy/tests/regression/test_issue1959.py
Normal file
|
@ -0,0 +1,23 @@
|
|||
# coding: utf8
|
||||
from __future__ import unicode_literals
|
||||
import pytest
|
||||
|
||||
|
||||
@pytest.mark.models('en')
|
||||
def test_issue1959(EN):
|
||||
texts = ['Apple is looking at buying U.K. startup for $1 billion.']
|
||||
# nlp = load_test_model('en_core_web_sm')
|
||||
EN.add_pipe(clean_component, name='cleaner', after='ner')
|
||||
doc = EN(texts[0])
|
||||
doc_pipe = [doc_pipe for doc_pipe in EN.pipe(texts)]
|
||||
assert doc == doc_pipe[0]
|
||||
|
||||
|
||||
def clean_component(doc):
|
||||
""" Clean up text. Make lowercase and remove punctuation and stopwords """
|
||||
# Remove punctuation, symbols (#) and stopwords
|
||||
doc = [tok.text.lower() for tok in doc if (not tok.is_stop
|
||||
and tok.pos_ != 'PUNCT' and
|
||||
tok.pos_ != 'SYM')]
|
||||
doc = ' '.join(doc)
|
||||
return doc
|
28
spacy/tests/serialize/test_serialize_language.py
Normal file
28
spacy/tests/serialize/test_serialize_language.py
Normal file
|
@ -0,0 +1,28 @@
|
|||
# coding: utf-8
|
||||
from __future__ import unicode_literals
|
||||
|
||||
from ..util import make_tempdir
|
||||
from ...language import Language
|
||||
|
||||
import pytest
|
||||
|
||||
|
||||
@pytest.fixture
|
||||
def meta_data():
|
||||
return {
|
||||
'name': 'name-in-fixture',
|
||||
'version': 'version-in-fixture',
|
||||
'description': 'description-in-fixture',
|
||||
'author': 'author-in-fixture',
|
||||
'email': 'email-in-fixture',
|
||||
'url': 'url-in-fixture',
|
||||
'license': 'license-in-fixture',
|
||||
}
|
||||
|
||||
|
||||
def test_serialize_language_meta_disk(meta_data):
|
||||
language = Language(meta=meta_data)
|
||||
with make_tempdir() as d:
|
||||
language.to_disk(d)
|
||||
new_language = Language().from_disk(d)
|
||||
assert new_language.meta == language.meta
|
|
@ -15,7 +15,7 @@ from ..lexeme cimport Lexeme
|
|||
from .. import parts_of_speech
|
||||
from ..attrs cimport IS_ALPHA, IS_ASCII, IS_DIGIT, IS_LOWER, IS_PUNCT, IS_SPACE
|
||||
from ..attrs cimport IS_BRACKET, IS_QUOTE, IS_LEFT_PUNCT, IS_RIGHT_PUNCT
|
||||
from ..attrs cimport IS_OOV, IS_TITLE, IS_UPPER, LIKE_URL, LIKE_NUM, LIKE_EMAIL
|
||||
from ..attrs cimport IS_OOV, IS_TITLE, IS_UPPER, IS_CURRENCY, LIKE_URL, LIKE_NUM, LIKE_EMAIL
|
||||
from ..attrs cimport IS_STOP, ID, ORTH, NORM, LOWER, SHAPE, PREFIX, SUFFIX
|
||||
from ..attrs cimport LENGTH, CLUSTER, LEMMA, POS, TAG, DEP
|
||||
from ..compat import is_config
|
||||
|
@ -855,6 +855,11 @@ cdef class Token:
|
|||
def __get__(self):
|
||||
return Lexeme.c_check_flag(self.c.lex, IS_RIGHT_PUNCT)
|
||||
|
||||
property is_currency:
|
||||
"""RETURNS (bool): Whether the token is a currency symbol."""
|
||||
def __get__(self):
|
||||
return Lexeme.c_check_flag(self.c.lex, IS_CURRENCY)
|
||||
|
||||
property like_url:
|
||||
"""RETURNS (bool): Whether the token resembles a URL."""
|
||||
def __get__(self):
|
||||
|
|
|
@ -17,6 +17,7 @@ from thinc.neural._classes.model import Model
|
|||
import functools
|
||||
import cytoolz
|
||||
import itertools
|
||||
import numpy.random
|
||||
|
||||
from .symbols import ORTH
|
||||
from .compat import cupy, CudaStream, path2str, basestring_, input_, unicode_
|
||||
|
@ -623,3 +624,8 @@ def use_gpu(gpu_id):
|
|||
Model.ops = CupyOps()
|
||||
Model.Ops = CupyOps
|
||||
return device
|
||||
|
||||
|
||||
def fix_random_seed(seed=0):
|
||||
random.seed(seed)
|
||||
numpy.random.seed(seed)
|
||||
|
|
|
@ -347,7 +347,8 @@ cdef class Vectors:
|
|||
"""
|
||||
def load_key2row(path):
|
||||
if path.exists():
|
||||
self.key2row = msgpack.load(path.open('rb'))
|
||||
with path.open('rb') as file_:
|
||||
self.key2row = msgpack.load(file_)
|
||||
for key, row in self.key2row.items():
|
||||
if row in self._unset:
|
||||
self._unset.remove(row)
|
||||
|
|
|
@ -10,6 +10,9 @@ nav.c-nav.u-text.js-nav(class=landing ? "c-nav--theme" : null)
|
|||
li.c-nav__menu__item(class=is_active ? "is-active" : null)
|
||||
+a(url)(tabindex=is_active ? "-1" : null)=item
|
||||
|
||||
li.c-nav__menu__item.u-hidden-xs
|
||||
+a("https://survey.spacy.io", true) User Survey 2018
|
||||
|
||||
li.c-nav__menu__item.u-hidden-xs
|
||||
+a(gh("spaCy"))(aria-label="GitHub") #[+icon("github", 20)]
|
||||
|
||||
|
|
|
@ -13,7 +13,7 @@ p
|
|||
| Their results and subsequent discussions helped us develop a novel
|
||||
| psychologically-motivated technique to improve spaCy's accuracy, which
|
||||
| we published in joint work with Macquarie University
|
||||
| #[+a("https://aclweb.org/anthology/D/D15/D15-1162.pdf") (Honnibal and Johnson, 2015)].
|
||||
| #[+a("https://www.aclweb.org/anthology/D/D15/D15-1162.pdf") (Honnibal and Johnson, 2015)].
|
||||
|
||||
include _benchmarks-choi-2015
|
||||
|
||||
|
|
|
@ -38,9 +38,10 @@ p
|
|||
| #[code spacy/data] directory. This means your user needs permission to do
|
||||
| this. The above error mostly occurs when doing a system-wide installation,
|
||||
| which will create the symlinks in a system directory. Run the
|
||||
| #[code download] or #[code link] command as administrator, or use a
|
||||
| #[code virtualenv] to install spaCy in a user directory, instead
|
||||
| of doing a system-wide installation.
|
||||
| #[code download] or #[code link] command as administrator (on Windows,
|
||||
| simply right-click on your terminal or shell ans select "Run as
|
||||
| Administrator"), or use a #[code virtualenv] to install spaCy in a user
|
||||
| directory, instead of doing a system-wide installation.
|
||||
|
||||
+h(3, "no-cache-dir") No such option: --no-cache-dir
|
||||
|
||||
|
|
|
@ -65,9 +65,9 @@ p
|
|||
- var style = [0, 1, 0, 1, 0]
|
||||
+annotation-row(["Autonomous", "amod", "cars", "NOUN", ""], style)
|
||||
+annotation-row(["cars", "nsubj", "shift", "VERB", "Autonomous"], style)
|
||||
+annotation-row(["shift", "ROOT", "shift", "VERB", "cars, liability"], style)
|
||||
+annotation-row(["shift", "ROOT", "shift", "VERB", "cars, liability, toward"], style)
|
||||
+annotation-row(["insurance", "compound", "liability", "NOUN", ""], style)
|
||||
+annotation-row(["liability", "dobj", "shift", "VERB", "insurance, toward"], style)
|
||||
+annotation-row(["liability", "dobj", "shift", "VERB", "insurance"], style)
|
||||
+annotation-row(["toward", "prep", "liability", "NOUN", "manufacturers"], style)
|
||||
+annotation-row(["manufacturers", "pobj", "toward", "ADP", ""], style)
|
||||
|
||||
|
|
|
@ -80,7 +80,7 @@ p
|
|||
doc.ents = [netflix_ent]
|
||||
|
||||
ents = [(e.text, e.start_char, e.end_char, e.label_) for e in doc.ents]
|
||||
assert ents = [(u'Netflix', 0, 7, u'ORG')]
|
||||
assert ents == [(u'Netflix', 0, 7, u'ORG')]
|
||||
|
||||
p
|
||||
| Keep in mind that you need to create a #[code Span] with the start and
|
||||
|
|
|
@ -54,10 +54,21 @@ p
|
|||
|
||||
p
|
||||
| The matcher returns a list of #[code (match_id, start, end)] tuples – in
|
||||
| this case, #[code [('HelloWorld', 0, 2)]], which maps to the span
|
||||
| #[code doc[0:2]] of our original document. Optionally, we could also
|
||||
| choose to add more than one pattern, for example to also match sequences
|
||||
| without punctuation between "hello" and "world":
|
||||
| this case, #[code [('15578876784678163569', 0, 2)]], which maps to the
|
||||
| span #[code doc[0:2]] of our original document. The #[code match_id]
|
||||
| is the #[+a("/usage/spacy-101#vocab") hash value] of the string ID
|
||||
| "HelloWorld". To get the string value, you can look up the ID
|
||||
| in the #[+api("stringstore") #[code StringStore]].
|
||||
|
||||
+code.
|
||||
for match_id, start, end in matches:
|
||||
string_id = nlp.vocab.strings[match_id] # 'HelloWorld'
|
||||
span = doc[start:end] # the matched span
|
||||
|
||||
p
|
||||
| Optionally, we could also choose to add more than one pattern, for
|
||||
| example to also match sequences without punctuation between "hello" and
|
||||
| "world":
|
||||
|
||||
+code.
|
||||
matcher.add('HelloWorld', None,
|
||||
|
@ -91,6 +102,10 @@ p
|
|||
+cell.u-nowrap #[code LOWER]
|
||||
+cell The lowercase form of the token text.
|
||||
|
||||
+row
|
||||
+cell #[code LENGTH]
|
||||
+cell The length of the token text.
|
||||
|
||||
+row
|
||||
+cell.u-nowrap #[code IS_ALPHA], #[code IS_ASCII], #[code IS_DIGIT]
|
||||
+cell
|
||||
|
@ -117,6 +132,10 @@ p
|
|||
| The token's simple and extended part-of-speech tag, dependency
|
||||
| label, lemma, shape.
|
||||
|
||||
+row
|
||||
+cell.u-nowrap #[code ENT_TYPE]
|
||||
+cell The token's entity label.
|
||||
|
||||
+h(4, "adding-patterns-wildcard") Using wildcard token patterns
|
||||
+tag-new(2)
|
||||
|
||||
|
@ -335,7 +354,8 @@ p
|
|||
| flag.
|
||||
|
||||
+code.
|
||||
IS_DEFINITELY = nlp.vocab.add_flag(re.compile(r'deff?in[ia]tely').match)
|
||||
definitely_flag = lambda text: bool(re.compile(r'deff?in[ia]tely').match(text))
|
||||
IS_DEFINITELY = nlp.vocab.add_flag(definitely_flag)
|
||||
|
||||
matcher = Matcher(nlp.vocab)
|
||||
matcher.add('DEFINITELY', None, [{IS_DEFINITELY: True}])
|
||||
|
|
|
@ -54,7 +54,7 @@ p
|
|||
|
||||
+code.
|
||||
import spacy
|
||||
from spacy.symbols import ORTH, LEMMA, POS
|
||||
from spacy.symbols import ORTH, LEMMA, POS, TAG
|
||||
|
||||
nlp = spacy.load('en')
|
||||
doc = nlp(u'gimme that') # phrase to tokenize
|
||||
|
|
|
@ -31,3 +31,13 @@ p
|
|||
import spacy
|
||||
nlp = spacy.load('en')
|
||||
doc = nlp(u'This is a sentence.')
|
||||
|
||||
+infobox("Important note", "⚠️")
|
||||
| To allow loading models via convenient shortcuts like #[code 'en'], spaCy
|
||||
| will create a symlink within the #[code spacy/data] directory. This means
|
||||
| that your user needs the #[strong required permissions].
|
||||
| If you've installed spaCy to a system directory and don't have admin
|
||||
| privileges, the model linking may fail. The easiest solution
|
||||
| is to re-run the command as admin, or use a #[code virtualenv]. For more
|
||||
| info on this, see the
|
||||
| #[+a("/usage/#symlink-privilege") troubleshooting guide].
|
||||
|
|
|
@ -132,7 +132,7 @@ p
|
|||
# set up shortcut link to load local model as "my_amazing_model"
|
||||
python -m spacy link /Users/you/model my_amazing_model
|
||||
|
||||
+infobox("Important note")
|
||||
+infobox("Important note", "⚠️")
|
||||
| In order to create a symlink, your user needs the #[strong required permissions].
|
||||
| If you've installed spaCy to a system directory and don't have admin
|
||||
| privileges, the #[code spacy link] command may fail. The easiest solution
|
||||
|
|
Loading…
Reference in New Issue
Block a user