mirror of
https://github.com/explosion/spaCy.git
synced 2024-11-11 20:28:20 +03:00
Merge branch 'master' into spacy.io
This commit is contained in:
commit
522a5ffbfe
106
.github/contributors/b1uec0in.md
vendored
Normal file
106
.github/contributors/b1uec0in.md
vendored
Normal file
|
@ -0,0 +1,106 @@
|
|||
# spaCy contributor agreement
|
||||
|
||||
This spaCy Contributor Agreement (**"SCA"**) is based on the
|
||||
[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
|
||||
The SCA applies to any contribution that you make to any product or project
|
||||
managed by us (the **"project"**), and sets out the intellectual property rights
|
||||
you grant to us in the contributed materials. The term **"us"** shall mean
|
||||
[ExplosionAI GmbH](https://explosion.ai/legal). The term
|
||||
**"you"** shall mean the person or entity identified below.
|
||||
|
||||
If you agree to be bound by these terms, fill in the information requested
|
||||
below and include the filled-in version with your first pull request, under the
|
||||
folder [`.github/contributors/`](/.github/contributors/). The name of the file
|
||||
should be your GitHub username, with the extension `.md`. For example, the user
|
||||
example_user would create the file `.github/contributors/example_user.md`.
|
||||
|
||||
Read this agreement carefully before signing. These terms and conditions
|
||||
constitute a binding legal agreement.
|
||||
|
||||
## Contributor Agreement
|
||||
|
||||
1. The term "contribution" or "contributed materials" means any source code,
|
||||
object code, patch, tool, sample, graphic, specification, manual,
|
||||
documentation, or any other material posted or submitted by you to the project.
|
||||
|
||||
2. With respect to any worldwide copyrights, or copyright applications and
|
||||
registrations, in your contribution:
|
||||
|
||||
* you hereby assign to us joint ownership, and to the extent that such
|
||||
assignment is or becomes invalid, ineffective or unenforceable, you hereby
|
||||
grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
|
||||
royalty-free, unrestricted license to exercise all rights under those
|
||||
copyrights. This includes, at our option, the right to sublicense these same
|
||||
rights to third parties through multiple levels of sublicensees or other
|
||||
licensing arrangements;
|
||||
|
||||
* you agree that each of us can do all things in relation to your
|
||||
contribution as if each of us were the sole owners, and if one of us makes
|
||||
a derivative work of your contribution, the one who makes the derivative
|
||||
work (or has it made will be the sole owner of that derivative work;
|
||||
|
||||
* you agree that you will not assert any moral rights in your contribution
|
||||
against us, our licensees or transferees;
|
||||
|
||||
* you agree that we may register a copyright in your contribution and
|
||||
exercise all ownership rights associated with it; and
|
||||
|
||||
* you agree that neither of us has any duty to consult with, obtain the
|
||||
consent of, pay or render an accounting to the other for any use or
|
||||
distribution of your contribution.
|
||||
|
||||
3. With respect to any patents you own, or that you can license without payment
|
||||
to any third party, you hereby grant to us a perpetual, irrevocable,
|
||||
non-exclusive, worldwide, no-charge, royalty-free license to:
|
||||
|
||||
* make, have made, use, sell, offer to sell, import, and otherwise transfer
|
||||
your contribution in whole or in part, alone or in combination with or
|
||||
included in any product, work or materials arising out of the project to
|
||||
which your contribution was submitted, and
|
||||
|
||||
* at our option, to sublicense these same rights to third parties through
|
||||
multiple levels of sublicensees or other licensing arrangements.
|
||||
|
||||
4. Except as set out above, you keep all right, title, and interest in your
|
||||
contribution. The rights that you grant to us under these terms are effective
|
||||
on the date you first submitted a contribution to us, even if your submission
|
||||
took place before the date you sign these terms.
|
||||
|
||||
5. You covenant, represent, warrant and agree that:
|
||||
|
||||
* Each contribution that you submit is and shall be an original work of
|
||||
authorship and you can legally grant the rights set out in this SCA;
|
||||
|
||||
* to the best of your knowledge, each contribution will not violate any
|
||||
third party's copyrights, trademarks, patents, or other intellectual
|
||||
property rights; and
|
||||
|
||||
* each contribution shall be in compliance with U.S. export control laws and
|
||||
other applicable export and import laws. You agree to notify us if you
|
||||
become aware of any circumstance which would make any of the foregoing
|
||||
representations inaccurate in any respect. We may publicly disclose your
|
||||
participation in the project, including the fact that you have signed the SCA.
|
||||
|
||||
6. This SCA is governed by the laws of the State of California and applicable
|
||||
U.S. Federal law. Any choice of law rules will not apply.
|
||||
|
||||
7. Please place an “x” on one of the applicable statement below. Please do NOT
|
||||
mark both statements:
|
||||
|
||||
* [x] I am signing on behalf of myself as an individual and no other person
|
||||
or entity, including my employer, has or will have rights with respect to my
|
||||
contributions.
|
||||
|
||||
* [ ] I am signing on behalf of my employer or a legal entity and I have the
|
||||
actual authority to contractually bind that entity.
|
||||
|
||||
## Contributor Details
|
||||
|
||||
| Field | Entry |
|
||||
|------------------------------- | -------------------- |
|
||||
| Name | Bae, Yong-Ju |
|
||||
| Company name (if applicable) | |
|
||||
| Title or role (if applicable) | |
|
||||
| Date | 2019-07-25 |
|
||||
| GitHub username | b1uec0in |
|
||||
| Website (optional) | |
|
184
CONTRIBUTING.md
184
CONTRIBUTING.md
|
@ -2,12 +2,13 @@
|
|||
|
||||
# Contribute to spaCy
|
||||
|
||||
Thanks for your interest in contributing to spaCy 🎉 The project is maintained
|
||||
Thanks for your interest in contributing to spaCy 🎉 The project is maintained
|
||||
by [@honnibal](https://github.com/honnibal) and [@ines](https://github.com/ines),
|
||||
and we'll do our best to help you get started. This page will give you a quick
|
||||
overview of how things are organised and most importantly, how to get involved.
|
||||
|
||||
## Table of contents
|
||||
|
||||
1. [Issues and bug reports](#issues-and-bug-reports)
|
||||
2. [Contributing to the code base](#contributing-to-the-code-base)
|
||||
3. [Code conventions](#code-conventions)
|
||||
|
@ -42,33 +43,33 @@ can also submit a [regression test](#fixing-bugs) straight away. When you're
|
|||
opening an issue to report the bug, simply refer to your pull request in the
|
||||
issue body. A few more tips:
|
||||
|
||||
* **Describing your issue:** Try to provide as many details as possible. What
|
||||
exactly goes wrong? *How* is it failing? Is there an error?
|
||||
"XY doesn't work" usually isn't that helpful for tracking down problems. Always
|
||||
remember to include the code you ran and if possible, extract only the relevant
|
||||
parts and don't just dump your entire script. This will make it easier for us to
|
||||
reproduce the error.
|
||||
- **Describing your issue:** Try to provide as many details as possible. What
|
||||
exactly goes wrong? _How_ is it failing? Is there an error?
|
||||
"XY doesn't work" usually isn't that helpful for tracking down problems. Always
|
||||
remember to include the code you ran and if possible, extract only the relevant
|
||||
parts and don't just dump your entire script. This will make it easier for us to
|
||||
reproduce the error.
|
||||
|
||||
* **Getting info about your spaCy installation and environment:** If you're
|
||||
using spaCy v1.7+, you can use the command line interface to print details and
|
||||
even format them as Markdown to copy-paste into GitHub issues:
|
||||
`python -m spacy info --markdown`.
|
||||
- **Getting info about your spaCy installation and environment:** If you're
|
||||
using spaCy v1.7+, you can use the command line interface to print details and
|
||||
even format them as Markdown to copy-paste into GitHub issues:
|
||||
`python -m spacy info --markdown`.
|
||||
|
||||
* **Checking the model compatibility:** If you're having problems with a
|
||||
[statistical model](https://spacy.io/models), it may be because the
|
||||
model is incompatible with your spaCy installation. In spaCy v2.0+, you can check
|
||||
this on the command line by running `python -m spacy validate`.
|
||||
- **Checking the model compatibility:** If you're having problems with a
|
||||
[statistical model](https://spacy.io/models), it may be because the
|
||||
model is incompatible with your spaCy installation. In spaCy v2.0+, you can check
|
||||
this on the command line by running `python -m spacy validate`.
|
||||
|
||||
* **Sharing a model's output, like dependencies and entities:** spaCy v2.0+
|
||||
comes with [built-in visualizers](https://spacy.io/usage/visualizers) that
|
||||
you can run from within your script or a Jupyter notebook. For some issues, it's
|
||||
helpful to **include a screenshot** of the visualization. You can simply drag and
|
||||
drop the image into GitHub's editor and it will be uploaded and included.
|
||||
- **Sharing a model's output, like dependencies and entities:** spaCy v2.0+
|
||||
comes with [built-in visualizers](https://spacy.io/usage/visualizers) that
|
||||
you can run from within your script or a Jupyter notebook. For some issues, it's
|
||||
helpful to **include a screenshot** of the visualization. You can simply drag and
|
||||
drop the image into GitHub's editor and it will be uploaded and included.
|
||||
|
||||
* **Sharing long blocks of code or logs:** If you need to include long code,
|
||||
logs or tracebacks, you can wrap them in `<details>` and `</details>`. This
|
||||
[collapses the content](https://developer.mozilla.org/en/docs/Web/HTML/Element/details)
|
||||
so it only becomes visible on click, making the issue easier to read and follow.
|
||||
- **Sharing long blocks of code or logs:** If you need to include long code,
|
||||
logs or tracebacks, you can wrap them in `<details>` and `</details>`. This
|
||||
[collapses the content](https://developer.mozilla.org/en/docs/Web/HTML/Element/details)
|
||||
so it only becomes visible on click, making the issue easier to read and follow.
|
||||
|
||||
### Issue labels
|
||||
|
||||
|
@ -94,39 +95,39 @@ shipped in the core library, and what could be provided in other packages. Our
|
|||
philosophy is to prefer a smaller core library. We generally ask the following
|
||||
questions:
|
||||
|
||||
* **What would this feature look like if implemented in a separate package?**
|
||||
Some features would be very difficult to implement externally – for example,
|
||||
changes to spaCy's built-in methods. In contrast, a library of word
|
||||
alignment functions could easily live as a separate package that depended on
|
||||
spaCy — there's little difference between writing `import word_aligner` and
|
||||
`import spacy.word_aligner`. spaCy v2.0+ makes it easy to implement
|
||||
[custom pipeline components](https://spacy.io/usage/processing-pipelines#custom-components),
|
||||
and add your own attributes, properties and methods to the `Doc`, `Token` and
|
||||
`Span`. If you're looking to implement a new spaCy feature, starting with a
|
||||
custom component package is usually the best strategy. You won't have to worry
|
||||
about spaCy's internals and you can test your module in an isolated
|
||||
environment. And if it works well, we can always integrate it into the core
|
||||
library later.
|
||||
- **What would this feature look like if implemented in a separate package?**
|
||||
Some features would be very difficult to implement externally – for example,
|
||||
changes to spaCy's built-in methods. In contrast, a library of word
|
||||
alignment functions could easily live as a separate package that depended on
|
||||
spaCy — there's little difference between writing `import word_aligner` and
|
||||
`import spacy.word_aligner`. spaCy v2.0+ makes it easy to implement
|
||||
[custom pipeline components](https://spacy.io/usage/processing-pipelines#custom-components),
|
||||
and add your own attributes, properties and methods to the `Doc`, `Token` and
|
||||
`Span`. If you're looking to implement a new spaCy feature, starting with a
|
||||
custom component package is usually the best strategy. You won't have to worry
|
||||
about spaCy's internals and you can test your module in an isolated
|
||||
environment. And if it works well, we can always integrate it into the core
|
||||
library later.
|
||||
|
||||
* **Would the feature be easier to implement if it relied on "heavy" dependencies spaCy doesn't currently require?**
|
||||
Python has a very rich ecosystem. Libraries like scikit-learn, SciPy, Gensim or
|
||||
TensorFlow/Keras do lots of useful things — but we don't want to have them as
|
||||
dependencies. If the feature requires functionality in one of these libraries,
|
||||
it's probably better to break it out into a different package.
|
||||
- **Would the feature be easier to implement if it relied on "heavy" dependencies spaCy doesn't currently require?**
|
||||
Python has a very rich ecosystem. Libraries like scikit-learn, SciPy, Gensim or
|
||||
TensorFlow/Keras do lots of useful things — but we don't want to have them as
|
||||
dependencies. If the feature requires functionality in one of these libraries,
|
||||
it's probably better to break it out into a different package.
|
||||
|
||||
* **Is the feature orthogonal to the current spaCy functionality, or overlapping?**
|
||||
spaCy strongly prefers to avoid having 6 different ways of doing the same thing.
|
||||
As better techniques are developed, we prefer to drop support for "the old way".
|
||||
However, it's rare that one approach *entirely* dominates another. It's very
|
||||
common that there's still a use-case for the "obsolete" approach. For instance,
|
||||
[WordNet](https://wordnet.princeton.edu/) is still very useful — but word
|
||||
vectors are better for most use-cases, and the two approaches to lexical
|
||||
semantics do a lot of the same things. spaCy therefore only supports word
|
||||
vectors, and support for WordNet is currently left for other packages.
|
||||
- **Is the feature orthogonal to the current spaCy functionality, or overlapping?**
|
||||
spaCy strongly prefers to avoid having 6 different ways of doing the same thing.
|
||||
As better techniques are developed, we prefer to drop support for "the old way".
|
||||
However, it's rare that one approach _entirely_ dominates another. It's very
|
||||
common that there's still a use-case for the "obsolete" approach. For instance,
|
||||
[WordNet](https://wordnet.princeton.edu/) is still very useful — but word
|
||||
vectors are better for most use-cases, and the two approaches to lexical
|
||||
semantics do a lot of the same things. spaCy therefore only supports word
|
||||
vectors, and support for WordNet is currently left for other packages.
|
||||
|
||||
* **Do you need the feature to get basic things done?** We do want spaCy to be
|
||||
at least somewhat self-contained. If we keep needing some feature in our
|
||||
recipes, that does provide some argument for bringing it "in house".
|
||||
- **Do you need the feature to get basic things done?** We do want spaCy to be
|
||||
at least somewhat self-contained. If we keep needing some feature in our
|
||||
recipes, that does provide some argument for bringing it "in house".
|
||||
|
||||
### Getting started
|
||||
|
||||
|
@ -155,7 +156,6 @@ Changes to `.py` files will be effective immediately.
|
|||
|
||||
📖 **For more details and instructions, see the documentation on [compiling spaCy from source](https://spacy.io/usage/#source) and the [quickstart widget](https://spacy.io/usage/#section-quickstart) to get the right commands for your platform and Python version.**
|
||||
|
||||
|
||||
### Contributor agreement
|
||||
|
||||
If you've made a contribution to spaCy, you should fill in the
|
||||
|
@ -167,7 +167,6 @@ and include it with your pull request, or submit it separately to
|
|||
your GitHub username, with the extension `.md`. For example, the user
|
||||
example_user would create the file `.github/contributors/example_user.md`.
|
||||
|
||||
|
||||
### Fixing bugs
|
||||
|
||||
When fixing a bug, first create an
|
||||
|
@ -199,7 +198,7 @@ modules in `.py` files, not Cython modules in `.pyx` and `.pxd` files.**
|
|||
[`black`](https://github.com/ambv/black) is an opinionated Python code
|
||||
formatter, optimised to produce readable code and small diffs. You can run
|
||||
`black` from the command-line, or via your code editor. For example, if you're
|
||||
using [Visual Studio Code](https://code.visualstudio.com/), you can add the
|
||||
using [Visual Studio Code](https://code.visualstudio.com/), you can add the
|
||||
following to your `settings.json` to use `black` for formatting and auto-format
|
||||
your files on save:
|
||||
|
||||
|
@ -415,11 +414,10 @@ Python. If it's not fast enough the first time, just switch to Cython.
|
|||
|
||||
### Resources to get you started
|
||||
|
||||
* [PEP 8 Style Guide for Python Code](https://www.python.org/dev/peps/pep-0008/) (python.org)
|
||||
* [Official Cython documentation](http://docs.cython.org/en/latest/) (cython.org)
|
||||
* [Writing C in Cython](https://explosion.ai/blog/writing-c-in-cython) (explosion.ai)
|
||||
* [Multi-threading spaCy’s parser and named entity recogniser](https://explosion.ai/blog/multithreading-with-cython) (explosion.ai)
|
||||
|
||||
- [PEP 8 Style Guide for Python Code](https://www.python.org/dev/peps/pep-0008/) (python.org)
|
||||
- [Official Cython documentation](http://docs.cython.org/en/latest/) (cython.org)
|
||||
- [Writing C in Cython](https://explosion.ai/blog/writing-c-in-cython) (explosion.ai)
|
||||
- [Multi-threading spaCy’s parser and named entity recogniser](https://explosion.ai/blog/multithreading-with-cython) (explosion.ai)
|
||||
|
||||
## Adding tests
|
||||
|
||||
|
@ -444,66 +442,40 @@ use the `get_doc()` utility function to construct it manually.
|
|||
|
||||
📖 **For more guidelines and information on how to add tests, check out the [tests README](spacy/tests/README.md).**
|
||||
|
||||
|
||||
## Updating the website
|
||||
|
||||
For instructions on how to build and run the [website](https://spacy.io) locally see **[Setup and installation](https://github.com/explosion/spaCy/blob/master/website/README.md#setup-and-installation-setup)** in the *website* directory's README.
|
||||
For instructions on how to build and run the [website](https://spacy.io) locally see **[Setup and installation](https://github.com/explosion/spaCy/blob/master/website/README.md#setup-and-installation-setup)** in the _website_ directory's README.
|
||||
|
||||
The docs can always use another example or more detail, and they should always
|
||||
be up to date and not misleading. To quickly find the correct file to edit,
|
||||
simply click on the "Suggest edits" button at the bottom of a page. To keep
|
||||
long pages maintainable, and allow including content in several places without
|
||||
doubling it, sections often consist of partials. Partials and partial directories
|
||||
are prefixed by an underscore `_` so they're not compiled with the site. For
|
||||
example:
|
||||
|
||||
```pug
|
||||
+section("tokenization")
|
||||
+h(2, "tokenization") Tokenization
|
||||
include _spacy-101/_tokenization
|
||||
```
|
||||
|
||||
So if you're looking to edit the content of the tokenization section, you can
|
||||
find it in `_spacy-101/_tokenization.jade`. To make it easy to add content
|
||||
components, we use a [collection of custom mixins](_includes/_mixins.jade),
|
||||
like `+table`, `+list` or `+code`. For an overview of the available mixins and
|
||||
components, see the [styleguide](https://spacy.io/styleguide).
|
||||
simply click on the "Suggest edits" button at the bottom of a page.
|
||||
|
||||
📖 **For more info and troubleshooting guides, check out the [website README](website).**
|
||||
|
||||
### Resources to get you started
|
||||
|
||||
* [Guide to static websites with Harp and Jade](https://ines.io/blog/the-ultimate-guide-static-websites-harp-jade) (ines.io)
|
||||
* [Building a website with modular markup components (mixins)](https://explosion.ai/blog/modular-markup) (explosion.ai)
|
||||
* [spacy.io Styleguide](https://spacy.io/styleguide) (spacy.io)
|
||||
* [Jade/Pug documentation](https://pugjs.org) (pugjs.org)
|
||||
* [Harp documentation](https://harpjs.com/) (harpjs.com)
|
||||
|
||||
|
||||
## Publishing spaCy extensions and plugins
|
||||
|
||||
We're very excited about all the new possibilities for **community extensions**
|
||||
and plugins in spaCy v2.0, and we can't wait to see what you build with it!
|
||||
|
||||
* An extension or plugin should add substantial functionality, be
|
||||
**well-documented** and **open-source**. It should be available for users to download
|
||||
and install as a Python package – for example via [PyPi](http://pypi.python.org).
|
||||
- An extension or plugin should add substantial functionality, be
|
||||
**well-documented** and **open-source**. It should be available for users to download
|
||||
and install as a Python package – for example via [PyPi](http://pypi.python.org).
|
||||
|
||||
* Extensions that write to `Doc`, `Token` or `Span` attributes should be wrapped
|
||||
as [pipeline components](https://spacy.io/usage/processing-pipelines#custom-components)
|
||||
that users can **add to their processing pipeline** using `nlp.add_pipe()`.
|
||||
- Extensions that write to `Doc`, `Token` or `Span` attributes should be wrapped
|
||||
as [pipeline components](https://spacy.io/usage/processing-pipelines#custom-components)
|
||||
that users can **add to their processing pipeline** using `nlp.add_pipe()`.
|
||||
|
||||
* When publishing your extension on GitHub, **tag it** with the topics
|
||||
[`spacy`](https://github.com/topics/spacy?o=desc&s=stars) and
|
||||
[`spacy-extensions`](https://github.com/topics/spacy-extension?o=desc&s=stars)
|
||||
to make it easier to find. Those are also the topics we're linking to from the
|
||||
spaCy website. If you're sharing your project on Twitter, feel free to tag
|
||||
[@spacy_io](https://twitter.com/spacy_io) so we can check it out.
|
||||
- When publishing your extension on GitHub, **tag it** with the topics
|
||||
[`spacy`](https://github.com/topics/spacy?o=desc&s=stars) and
|
||||
[`spacy-extensions`](https://github.com/topics/spacy-extension?o=desc&s=stars)
|
||||
to make it easier to find. Those are also the topics we're linking to from the
|
||||
spaCy website. If you're sharing your project on Twitter, feel free to tag
|
||||
[@spacy_io](https://twitter.com/spacy_io) so we can check it out.
|
||||
|
||||
* Once your extension is published, you can open an issue on the
|
||||
[issue tracker](https://github.com/explosion/spacy/issues) to suggest it for the
|
||||
[resources directory](https://spacy.io/usage/resources#extensions) on the
|
||||
website.
|
||||
- Once your extension is published, you can open an issue on the
|
||||
[issue tracker](https://github.com/explosion/spacy/issues) to suggest it for the
|
||||
[resources directory](https://spacy.io/usage/resources#extensions) on the
|
||||
website.
|
||||
|
||||
📖 **For more tips and best practices, see the [checklist for developing spaCy extensions](https://spacy.io/usage/processing-pipelines#extensions).**
|
||||
|
||||
|
|
|
@ -4,13 +4,13 @@
|
|||
# fmt: off
|
||||
|
||||
__title__ = "spacy"
|
||||
__version__ = "2.1.6"
|
||||
__version__ = "2.1.7.dev0"
|
||||
__summary__ = "Industrial-strength Natural Language Processing (NLP) with Python and Cython"
|
||||
__uri__ = "https://spacy.io"
|
||||
__author__ = "Explosion AI"
|
||||
__email__ = "contact@explosion.ai"
|
||||
__license__ = "MIT"
|
||||
__release__ = True
|
||||
__release__ = False
|
||||
|
||||
__download_url__ = "https://github.com/explosion/spacy-models/releases/download"
|
||||
__compatibility__ = "https://raw.githubusercontent.com/explosion/spacy-models/master/compatibility.json"
|
||||
|
|
|
@ -19,7 +19,7 @@ msg = Printer()
|
|||
@plac.annotations(
|
||||
model=("Model to download (shortcut or name)", "positional", None, str),
|
||||
direct=("Force direct download of name + version", "flag", "d", bool),
|
||||
pip_args=("additional arguments to be passed to `pip install` on model install"),
|
||||
pip_args=("Additional arguments to be passed to `pip install` on model install"),
|
||||
)
|
||||
def download(model, direct=False, *pip_args):
|
||||
"""
|
||||
|
|
|
@ -150,6 +150,8 @@ def list_requirements(meta):
|
|||
requirements = [parent_package + meta['spacy_version']]
|
||||
if 'setup_requires' in meta:
|
||||
requirements += meta['setup_requires']
|
||||
if 'requirements' in meta:
|
||||
requirements += meta['requirements']
|
||||
return requirements
|
||||
|
||||
|
||||
|
|
|
@ -51,12 +51,15 @@ def try_mecab_import():
|
|||
|
||||
|
||||
def check_spaces(text, tokens):
|
||||
token_pattern = re.compile(r"\s?".join(f"({t})" for t in tokens))
|
||||
m = token_pattern.match(text)
|
||||
if m is not None:
|
||||
for i in range(1, m.lastindex):
|
||||
yield m.end(i) < m.start(i + 1)
|
||||
yield False
|
||||
prev_end = -1
|
||||
start = 0
|
||||
for token in tokens:
|
||||
idx = text.find(token, start)
|
||||
if prev_end > 0:
|
||||
yield prev_end != idx
|
||||
prev_end = idx + len(token)
|
||||
start = prev_end
|
||||
yield False
|
||||
|
||||
|
||||
class KoreanTokenizer(DummyTokenizer):
|
||||
|
|
|
@ -618,7 +618,7 @@ class Language(object):
|
|||
if component_cfg is None:
|
||||
component_cfg = {}
|
||||
docs, golds = zip(*docs_golds)
|
||||
docs = list(docs)
|
||||
docs = [self.make_doc(doc) if isinstance(doc, basestring_) else doc for doc in docs]
|
||||
golds = list(golds)
|
||||
for name, pipe in self.pipeline:
|
||||
kwargs = component_cfg.get(name, {})
|
||||
|
@ -628,6 +628,8 @@ class Language(object):
|
|||
else:
|
||||
docs = pipe.pipe(docs, **kwargs)
|
||||
for doc, gold in zip(docs, golds):
|
||||
if not isinstance(gold, GoldParse):
|
||||
gold = GoldParse(doc, **gold)
|
||||
if verbose:
|
||||
print(doc)
|
||||
kwargs = component_cfg.get("scorer", {})
|
||||
|
|
|
@ -83,8 +83,12 @@ class Pipe(object):
|
|||
"""
|
||||
for docs in util.minibatch(stream, size=batch_size):
|
||||
docs = list(docs)
|
||||
scores, tensors = self.predict(docs)
|
||||
self.set_annotations(docs, scores, tensor=tensors)
|
||||
predictions = self.predict(docs)
|
||||
if isinstance(predictions, tuple) and len(tuple) == 2:
|
||||
scores, tensors = predictions
|
||||
self.set_annotations(docs, scores, tensor=tensors)
|
||||
else:
|
||||
self.set_annotations(docs, predictions)
|
||||
yield from docs
|
||||
|
||||
def predict(self, docs):
|
||||
|
@ -104,8 +108,7 @@ class Pipe(object):
|
|||
|
||||
Delegates to predict() and get_loss().
|
||||
"""
|
||||
self.require_model()
|
||||
raise NotImplementedError
|
||||
pass
|
||||
|
||||
def rehearse(self, docs, sgd=None, losses=None, **config):
|
||||
pass
|
||||
|
@ -134,7 +137,8 @@ class Pipe(object):
|
|||
If no model has been initialized yet, the model is added."""
|
||||
if self.model is True:
|
||||
self.model = self.Model(**self.cfg)
|
||||
link_vectors_to_models(self.vocab)
|
||||
if hasattr(self, "vocab"):
|
||||
link_vectors_to_models(self.vocab)
|
||||
if sgd is None:
|
||||
sgd = self.create_optimizer()
|
||||
return sgd
|
||||
|
@ -154,7 +158,8 @@ class Pipe(object):
|
|||
serialize["cfg"] = lambda: srsly.json_dumps(self.cfg)
|
||||
if self.model not in (True, False, None):
|
||||
serialize["model"] = self.model.to_bytes
|
||||
serialize["vocab"] = self.vocab.to_bytes
|
||||
if hasattr(self, "vocab"):
|
||||
serialize["vocab"] = self.vocab.to_bytes
|
||||
exclude = util.get_serialization_exclude(serialize, exclude, kwargs)
|
||||
return util.to_bytes(serialize, exclude)
|
||||
|
||||
|
@ -174,7 +179,8 @@ class Pipe(object):
|
|||
|
||||
deserialize = OrderedDict()
|
||||
deserialize["cfg"] = lambda b: self.cfg.update(srsly.json_loads(b))
|
||||
deserialize["vocab"] = lambda b: self.vocab.from_bytes(b)
|
||||
if hasattr(self, "vocab"):
|
||||
deserialize["vocab"] = lambda b: self.vocab.from_bytes(b)
|
||||
deserialize["model"] = load_model
|
||||
exclude = util.get_serialization_exclude(deserialize, exclude, kwargs)
|
||||
util.from_bytes(bytes_data, deserialize, exclude)
|
||||
|
|
|
@ -5,7 +5,8 @@ import pytest
|
|||
|
||||
# fmt: off
|
||||
TOKENIZER_TESTS = [("서울 타워 근처에 살고 있습니다.", "서울 타워 근처 에 살 고 있 습니다 ."),
|
||||
("영등포구에 있는 맛집 좀 알려주세요.", "영등포구 에 있 는 맛집 좀 알려 주 세요 .")]
|
||||
("영등포구에 있는 맛집 좀 알려주세요.", "영등포구 에 있 는 맛집 좀 알려 주 세요 ."),
|
||||
("10$ 할인코드를 적용할까요?", "10 $ 할인 코드 를 적용 할까요 ?")]
|
||||
|
||||
TAG_TESTS = [("서울 타워 근처에 살고 있습니다.",
|
||||
"NNP NNG NNG JKB VV EC VX EF SF"),
|
||||
|
|
57
spacy/tests/test_language.py
Normal file
57
spacy/tests/test_language.py
Normal file
|
@ -0,0 +1,57 @@
|
|||
# coding: utf-8
|
||||
from __future__ import unicode_literals
|
||||
|
||||
import pytest
|
||||
from spacy.vocab import Vocab
|
||||
from spacy.language import Language
|
||||
from spacy.tokens import Doc
|
||||
from spacy.gold import GoldParse
|
||||
|
||||
|
||||
@pytest.fixture
|
||||
def nlp():
|
||||
nlp = Language(Vocab())
|
||||
textcat = nlp.create_pipe("textcat")
|
||||
for label in ("POSITIVE", "NEGATIVE"):
|
||||
textcat.add_label(label)
|
||||
nlp.add_pipe(textcat)
|
||||
nlp.begin_training()
|
||||
return nlp
|
||||
|
||||
|
||||
def test_language_update(nlp):
|
||||
text = "hello world"
|
||||
annots = {"cats": {"POSITIVE": 1.0, "NEGATIVE": 0.0}}
|
||||
doc = Doc(nlp.vocab, words=text.split(" "))
|
||||
gold = GoldParse(doc, **annots)
|
||||
# Update with doc and gold objects
|
||||
nlp.update([doc], [gold])
|
||||
# Update with text and dict
|
||||
nlp.update([text], [annots])
|
||||
# Update with doc object and dict
|
||||
nlp.update([doc], [annots])
|
||||
# Update with text and gold object
|
||||
nlp.update([text], [gold])
|
||||
# Update badly
|
||||
with pytest.raises(IndexError):
|
||||
nlp.update([doc], [])
|
||||
with pytest.raises(IndexError):
|
||||
nlp.update([], [gold])
|
||||
|
||||
|
||||
def test_language_evaluate(nlp):
|
||||
text = "hello world"
|
||||
annots = {"cats": {"POSITIVE": 1.0, "NEGATIVE": 0.0}}
|
||||
doc = Doc(nlp.vocab, words=text.split(" "))
|
||||
gold = GoldParse(doc, **annots)
|
||||
# Evaluate with doc and gold objects
|
||||
nlp.evaluate([(doc, gold)])
|
||||
# Evaluate with text and dict
|
||||
nlp.evaluate([(text, annots)])
|
||||
# Evaluate with doc object and dict
|
||||
nlp.evaluate([(doc, annots)])
|
||||
# Evaluate with text and gold object
|
||||
nlp.evaluate([(text, gold)])
|
||||
# Evaluate badly
|
||||
with pytest.raises(Exception):
|
||||
nlp.evaluate([text, gold])
|
|
@ -311,7 +311,7 @@ cdef class Span:
|
|||
DOCS: https://spacy.io/api/span#similarity
|
||||
"""
|
||||
if "similarity" in self.doc.user_span_hooks:
|
||||
self.doc.user_span_hooks["similarity"](self, other)
|
||||
return self.doc.user_span_hooks["similarity"](self, other)
|
||||
if len(self) == 1 and hasattr(other, "orth"):
|
||||
if self[0].orth == other.orth:
|
||||
return 1.0
|
||||
|
|
|
@ -202,7 +202,7 @@ cdef class Token:
|
|||
DOCS: https://spacy.io/api/token#similarity
|
||||
"""
|
||||
if "similarity" in self.doc.user_token_hooks:
|
||||
return self.doc.user_token_hooks["similarity"](self)
|
||||
return self.doc.user_token_hooks["similarity"](self, other)
|
||||
if hasattr(other, "__len__") and len(other) == 1 and hasattr(other, "__getitem__"):
|
||||
if self.c.lex.orth == getattr(other[0], "orth", None):
|
||||
return 1.0
|
||||
|
|
|
@ -160,7 +160,10 @@ def load_model_from_path(model_path, meta=False, **overrides):
|
|||
pipeline from meta.json and then calls from_disk() with path."""
|
||||
if not meta:
|
||||
meta = get_model_meta(model_path)
|
||||
cls = get_lang_class(meta["lang"])
|
||||
# Support language factories registered via entry points (e.g. custom
|
||||
# language subclass) while keeping top-level language identifier "lang"
|
||||
lang = meta.get("lang_factory", meta["lang"])
|
||||
cls = get_lang_class(lang)
|
||||
nlp = cls(meta=meta, **overrides)
|
||||
pipeline = meta.get("pipeline", [])
|
||||
disable = overrides.get("disable", [])
|
||||
|
|
|
@ -133,13 +133,13 @@ Evaluate a model's pipeline components.
|
|||
> print(scorer.scores)
|
||||
> ```
|
||||
|
||||
| Name | Type | Description |
|
||||
| -------------------------------------------- | -------- | ------------------------------------------------------------------------------------- |
|
||||
| `docs_golds` | iterable | Tuples of `Doc` and `GoldParse` objects. |
|
||||
| `verbose` | bool | Print debugging information. |
|
||||
| `batch_size` | int | The batch size to use. |
|
||||
| `scorer` | `Scorer` | Optional [`Scorer`](/api/scorer) to use. If not passed in, a new one will be created. |
|
||||
| `component_cfg` <Tag variant="new">2.1</Tag> | dict | Config parameters for specific pipeline components, keyed by component name. |
|
||||
| Name | Type | Description |
|
||||
| -------------------------------------------- | -------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------- |
|
||||
| `docs_golds` | iterable | Tuples of `Doc` and `GoldParse` objects or `(text, annotations)` of raw text and a dict (see [simple training style](/usage/training#training-simple-style)). |
|
||||
| `verbose` | bool | Print debugging information. |
|
||||
| `batch_size` | int | The batch size to use. |
|
||||
| `scorer` | `Scorer` | Optional [`Scorer`](/api/scorer) to use. If not passed in, a new one will be created. |
|
||||
| `component_cfg` <Tag variant="new">2.1</Tag> | dict | Config parameters for specific pipeline components, keyed by component name. |
|
||||
|
||||
## Language.begin_training {#begin_training tag="method"}
|
||||
|
||||
|
|
|
@ -258,7 +258,7 @@ Retokenize the document, such that the span is merged into a single token.
|
|||
| `**attributes` | - | Attributes to assign to the merged token. By default, attributes are inherited from the syntactic root token of the span. |
|
||||
| **RETURNS** | `Token` | The newly merged token. |
|
||||
|
||||
## Span.ents {#ents tag="property" new="2.0.12" model="ner"}
|
||||
## Span.ents {#ents tag="property" new="2.0.13" model="ner"}
|
||||
|
||||
The named entities in the span. Returns a tuple of named entity `Span` objects,
|
||||
if the entity recognizer has been applied.
|
||||
|
|
Loading…
Reference in New Issue
Block a user