Merge branch 'master' into spacy.io

This commit is contained in:
Ines Montani 2019-07-30 10:33:09 +02:00
commit 522a5ffbfe
15 changed files with 287 additions and 135 deletions

106
.github/contributors/b1uec0in.md vendored Normal file
View File

@ -0,0 +1,106 @@
# spaCy contributor agreement
This spaCy Contributor Agreement (**"SCA"**) is based on the
[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
The SCA applies to any contribution that you make to any product or project
managed by us (the **"project"**), and sets out the intellectual property rights
you grant to us in the contributed materials. The term **"us"** shall mean
[ExplosionAI GmbH](https://explosion.ai/legal). The term
**"you"** shall mean the person or entity identified below.
If you agree to be bound by these terms, fill in the information requested
below and include the filled-in version with your first pull request, under the
folder [`.github/contributors/`](/.github/contributors/). The name of the file
should be your GitHub username, with the extension `.md`. For example, the user
example_user would create the file `.github/contributors/example_user.md`.
Read this agreement carefully before signing. These terms and conditions
constitute a binding legal agreement.
## Contributor Agreement
1. The term "contribution" or "contributed materials" means any source code,
object code, patch, tool, sample, graphic, specification, manual,
documentation, or any other material posted or submitted by you to the project.
2. With respect to any worldwide copyrights, or copyright applications and
registrations, in your contribution:
* you hereby assign to us joint ownership, and to the extent that such
assignment is or becomes invalid, ineffective or unenforceable, you hereby
grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
royalty-free, unrestricted license to exercise all rights under those
copyrights. This includes, at our option, the right to sublicense these same
rights to third parties through multiple levels of sublicensees or other
licensing arrangements;
* you agree that each of us can do all things in relation to your
contribution as if each of us were the sole owners, and if one of us makes
a derivative work of your contribution, the one who makes the derivative
work (or has it made will be the sole owner of that derivative work;
* you agree that you will not assert any moral rights in your contribution
against us, our licensees or transferees;
* you agree that we may register a copyright in your contribution and
exercise all ownership rights associated with it; and
* you agree that neither of us has any duty to consult with, obtain the
consent of, pay or render an accounting to the other for any use or
distribution of your contribution.
3. With respect to any patents you own, or that you can license without payment
to any third party, you hereby grant to us a perpetual, irrevocable,
non-exclusive, worldwide, no-charge, royalty-free license to:
* make, have made, use, sell, offer to sell, import, and otherwise transfer
your contribution in whole or in part, alone or in combination with or
included in any product, work or materials arising out of the project to
which your contribution was submitted, and
* at our option, to sublicense these same rights to third parties through
multiple levels of sublicensees or other licensing arrangements.
4. Except as set out above, you keep all right, title, and interest in your
contribution. The rights that you grant to us under these terms are effective
on the date you first submitted a contribution to us, even if your submission
took place before the date you sign these terms.
5. You covenant, represent, warrant and agree that:
* Each contribution that you submit is and shall be an original work of
authorship and you can legally grant the rights set out in this SCA;
* to the best of your knowledge, each contribution will not violate any
third party's copyrights, trademarks, patents, or other intellectual
property rights; and
* each contribution shall be in compliance with U.S. export control laws and
other applicable export and import laws. You agree to notify us if you
become aware of any circumstance which would make any of the foregoing
representations inaccurate in any respect. We may publicly disclose your
participation in the project, including the fact that you have signed the SCA.
6. This SCA is governed by the laws of the State of California and applicable
U.S. Federal law. Any choice of law rules will not apply.
7. Please place an “x” on one of the applicable statement below. Please do NOT
mark both statements:
* [x] I am signing on behalf of myself as an individual and no other person
or entity, including my employer, has or will have rights with respect to my
contributions.
* [ ] I am signing on behalf of my employer or a legal entity and I have the
actual authority to contractually bind that entity.
## Contributor Details
| Field | Entry |
|------------------------------- | -------------------- |
| Name | Bae, Yong-Ju |
| Company name (if applicable) | |
| Title or role (if applicable) | |
| Date | 2019-07-25 |
| GitHub username | b1uec0in |
| Website (optional) | |

View File

@ -8,6 +8,7 @@ and we'll do our best to help you get started. This page will give you a quick
overview of how things are organised and most importantly, how to get involved.
## Table of contents
1. [Issues and bug reports](#issues-and-bug-reports)
2. [Contributing to the code base](#contributing-to-the-code-base)
3. [Code conventions](#code-conventions)
@ -42,30 +43,30 @@ can also submit a [regression test](#fixing-bugs) straight away. When you're
opening an issue to report the bug, simply refer to your pull request in the
issue body. A few more tips:
* **Describing your issue:** Try to provide as many details as possible. What
exactly goes wrong? *How* is it failing? Is there an error?
- **Describing your issue:** Try to provide as many details as possible. What
exactly goes wrong? _How_ is it failing? Is there an error?
"XY doesn't work" usually isn't that helpful for tracking down problems. Always
remember to include the code you ran and if possible, extract only the relevant
parts and don't just dump your entire script. This will make it easier for us to
reproduce the error.
* **Getting info about your spaCy installation and environment:** If you're
- **Getting info about your spaCy installation and environment:** If you're
using spaCy v1.7+, you can use the command line interface to print details and
even format them as Markdown to copy-paste into GitHub issues:
`python -m spacy info --markdown`.
* **Checking the model compatibility:** If you're having problems with a
- **Checking the model compatibility:** If you're having problems with a
[statistical model](https://spacy.io/models), it may be because the
model is incompatible with your spaCy installation. In spaCy v2.0+, you can check
this on the command line by running `python -m spacy validate`.
* **Sharing a model's output, like dependencies and entities:** spaCy v2.0+
- **Sharing a model's output, like dependencies and entities:** spaCy v2.0+
comes with [built-in visualizers](https://spacy.io/usage/visualizers) that
you can run from within your script or a Jupyter notebook. For some issues, it's
helpful to **include a screenshot** of the visualization. You can simply drag and
drop the image into GitHub's editor and it will be uploaded and included.
* **Sharing long blocks of code or logs:** If you need to include long code,
- **Sharing long blocks of code or logs:** If you need to include long code,
logs or tracebacks, you can wrap them in `<details>` and `</details>`. This
[collapses the content](https://developer.mozilla.org/en/docs/Web/HTML/Element/details)
so it only becomes visible on click, making the issue easier to read and follow.
@ -94,7 +95,7 @@ shipped in the core library, and what could be provided in other packages. Our
philosophy is to prefer a smaller core library. We generally ask the following
questions:
* **What would this feature look like if implemented in a separate package?**
- **What would this feature look like if implemented in a separate package?**
Some features would be very difficult to implement externally for example,
changes to spaCy's built-in methods. In contrast, a library of word
alignment functions could easily live as a separate package that depended on
@ -108,23 +109,23 @@ about spaCy's internals and you can test your module in an isolated
environment. And if it works well, we can always integrate it into the core
library later.
* **Would the feature be easier to implement if it relied on "heavy" dependencies spaCy doesn't currently require?**
- **Would the feature be easier to implement if it relied on "heavy" dependencies spaCy doesn't currently require?**
Python has a very rich ecosystem. Libraries like scikit-learn, SciPy, Gensim or
TensorFlow/Keras do lots of useful things — but we don't want to have them as
dependencies. If the feature requires functionality in one of these libraries,
it's probably better to break it out into a different package.
* **Is the feature orthogonal to the current spaCy functionality, or overlapping?**
- **Is the feature orthogonal to the current spaCy functionality, or overlapping?**
spaCy strongly prefers to avoid having 6 different ways of doing the same thing.
As better techniques are developed, we prefer to drop support for "the old way".
However, it's rare that one approach *entirely* dominates another. It's very
However, it's rare that one approach _entirely_ dominates another. It's very
common that there's still a use-case for the "obsolete" approach. For instance,
[WordNet](https://wordnet.princeton.edu/) is still very useful — but word
vectors are better for most use-cases, and the two approaches to lexical
semantics do a lot of the same things. spaCy therefore only supports word
vectors, and support for WordNet is currently left for other packages.
* **Do you need the feature to get basic things done?** We do want spaCy to be
- **Do you need the feature to get basic things done?** We do want spaCy to be
at least somewhat self-contained. If we keep needing some feature in our
recipes, that does provide some argument for bringing it "in house".
@ -155,7 +156,6 @@ Changes to `.py` files will be effective immediately.
📖 **For more details and instructions, see the documentation on [compiling spaCy from source](https://spacy.io/usage/#source) and the [quickstart widget](https://spacy.io/usage/#section-quickstart) to get the right commands for your platform and Python version.**
### Contributor agreement
If you've made a contribution to spaCy, you should fill in the
@ -167,7 +167,6 @@ and include it with your pull request, or submit it separately to
your GitHub username, with the extension `.md`. For example, the user
example_user would create the file `.github/contributors/example_user.md`.
### Fixing bugs
When fixing a bug, first create an
@ -415,11 +414,10 @@ Python. If it's not fast enough the first time, just switch to Cython.
### Resources to get you started
* [PEP 8 Style Guide for Python Code](https://www.python.org/dev/peps/pep-0008/) (python.org)
* [Official Cython documentation](http://docs.cython.org/en/latest/) (cython.org)
* [Writing C in Cython](https://explosion.ai/blog/writing-c-in-cython) (explosion.ai)
* [Multi-threading spaCys parser and named entity recogniser](https://explosion.ai/blog/multithreading-with-cython) (explosion.ai)
- [PEP 8 Style Guide for Python Code](https://www.python.org/dev/peps/pep-0008/) (python.org)
- [Official Cython documentation](http://docs.cython.org/en/latest/) (cython.org)
- [Writing C in Cython](https://explosion.ai/blog/writing-c-in-cython) (explosion.ai)
- [Multi-threading spaCys parser and named entity recogniser](https://explosion.ai/blog/multithreading-with-cython) (explosion.ai)
## Adding tests
@ -444,63 +442,37 @@ use the `get_doc()` utility function to construct it manually.
📖 **For more guidelines and information on how to add tests, check out the [tests README](spacy/tests/README.md).**
## Updating the website
For instructions on how to build and run the [website](https://spacy.io) locally see **[Setup and installation](https://github.com/explosion/spaCy/blob/master/website/README.md#setup-and-installation-setup)** in the *website* directory's README.
For instructions on how to build and run the [website](https://spacy.io) locally see **[Setup and installation](https://github.com/explosion/spaCy/blob/master/website/README.md#setup-and-installation-setup)** in the _website_ directory's README.
The docs can always use another example or more detail, and they should always
be up to date and not misleading. To quickly find the correct file to edit,
simply click on the "Suggest edits" button at the bottom of a page. To keep
long pages maintainable, and allow including content in several places without
doubling it, sections often consist of partials. Partials and partial directories
are prefixed by an underscore `_` so they're not compiled with the site. For
example:
```pug
+section("tokenization")
+h(2, "tokenization") Tokenization
include _spacy-101/_tokenization
```
So if you're looking to edit the content of the tokenization section, you can
find it in `_spacy-101/_tokenization.jade`. To make it easy to add content
components, we use a [collection of custom mixins](_includes/_mixins.jade),
like `+table`, `+list` or `+code`. For an overview of the available mixins and
components, see the [styleguide](https://spacy.io/styleguide).
simply click on the "Suggest edits" button at the bottom of a page.
📖 **For more info and troubleshooting guides, check out the [website README](website).**
### Resources to get you started
* [Guide to static websites with Harp and Jade](https://ines.io/blog/the-ultimate-guide-static-websites-harp-jade) (ines.io)
* [Building a website with modular markup components (mixins)](https://explosion.ai/blog/modular-markup) (explosion.ai)
* [spacy.io Styleguide](https://spacy.io/styleguide) (spacy.io)
* [Jade/Pug documentation](https://pugjs.org) (pugjs.org)
* [Harp documentation](https://harpjs.com/) (harpjs.com)
## Publishing spaCy extensions and plugins
We're very excited about all the new possibilities for **community extensions**
and plugins in spaCy v2.0, and we can't wait to see what you build with it!
* An extension or plugin should add substantial functionality, be
- An extension or plugin should add substantial functionality, be
**well-documented** and **open-source**. It should be available for users to download
and install as a Python package for example via [PyPi](http://pypi.python.org).
* Extensions that write to `Doc`, `Token` or `Span` attributes should be wrapped
- Extensions that write to `Doc`, `Token` or `Span` attributes should be wrapped
as [pipeline components](https://spacy.io/usage/processing-pipelines#custom-components)
that users can **add to their processing pipeline** using `nlp.add_pipe()`.
* When publishing your extension on GitHub, **tag it** with the topics
- When publishing your extension on GitHub, **tag it** with the topics
[`spacy`](https://github.com/topics/spacy?o=desc&s=stars) and
[`spacy-extensions`](https://github.com/topics/spacy-extension?o=desc&s=stars)
to make it easier to find. Those are also the topics we're linking to from the
spaCy website. If you're sharing your project on Twitter, feel free to tag
[@spacy_io](https://twitter.com/spacy_io) so we can check it out.
* Once your extension is published, you can open an issue on the
- Once your extension is published, you can open an issue on the
[issue tracker](https://github.com/explosion/spacy/issues) to suggest it for the
[resources directory](https://spacy.io/usage/resources#extensions) on the
website.

View File

@ -4,13 +4,13 @@
# fmt: off
__title__ = "spacy"
__version__ = "2.1.6"
__version__ = "2.1.7.dev0"
__summary__ = "Industrial-strength Natural Language Processing (NLP) with Python and Cython"
__uri__ = "https://spacy.io"
__author__ = "Explosion AI"
__email__ = "contact@explosion.ai"
__license__ = "MIT"
__release__ = True
__release__ = False
__download_url__ = "https://github.com/explosion/spacy-models/releases/download"
__compatibility__ = "https://raw.githubusercontent.com/explosion/spacy-models/master/compatibility.json"

View File

@ -19,7 +19,7 @@ msg = Printer()
@plac.annotations(
model=("Model to download (shortcut or name)", "positional", None, str),
direct=("Force direct download of name + version", "flag", "d", bool),
pip_args=("additional arguments to be passed to `pip install` on model install"),
pip_args=("Additional arguments to be passed to `pip install` on model install"),
)
def download(model, direct=False, *pip_args):
"""

View File

@ -150,6 +150,8 @@ def list_requirements(meta):
requirements = [parent_package + meta['spacy_version']]
if 'setup_requires' in meta:
requirements += meta['setup_requires']
if 'requirements' in meta:
requirements += meta['requirements']
return requirements

View File

@ -51,11 +51,14 @@ def try_mecab_import():
def check_spaces(text, tokens):
token_pattern = re.compile(r"\s?".join(f"({t})" for t in tokens))
m = token_pattern.match(text)
if m is not None:
for i in range(1, m.lastindex):
yield m.end(i) < m.start(i + 1)
prev_end = -1
start = 0
for token in tokens:
idx = text.find(token, start)
if prev_end > 0:
yield prev_end != idx
prev_end = idx + len(token)
start = prev_end
yield False

View File

@ -618,7 +618,7 @@ class Language(object):
if component_cfg is None:
component_cfg = {}
docs, golds = zip(*docs_golds)
docs = list(docs)
docs = [self.make_doc(doc) if isinstance(doc, basestring_) else doc for doc in docs]
golds = list(golds)
for name, pipe in self.pipeline:
kwargs = component_cfg.get(name, {})
@ -628,6 +628,8 @@ class Language(object):
else:
docs = pipe.pipe(docs, **kwargs)
for doc, gold in zip(docs, golds):
if not isinstance(gold, GoldParse):
gold = GoldParse(doc, **gold)
if verbose:
print(doc)
kwargs = component_cfg.get("scorer", {})

View File

@ -83,8 +83,12 @@ class Pipe(object):
"""
for docs in util.minibatch(stream, size=batch_size):
docs = list(docs)
scores, tensors = self.predict(docs)
predictions = self.predict(docs)
if isinstance(predictions, tuple) and len(tuple) == 2:
scores, tensors = predictions
self.set_annotations(docs, scores, tensor=tensors)
else:
self.set_annotations(docs, predictions)
yield from docs
def predict(self, docs):
@ -104,8 +108,7 @@ class Pipe(object):
Delegates to predict() and get_loss().
"""
self.require_model()
raise NotImplementedError
pass
def rehearse(self, docs, sgd=None, losses=None, **config):
pass
@ -134,6 +137,7 @@ class Pipe(object):
If no model has been initialized yet, the model is added."""
if self.model is True:
self.model = self.Model(**self.cfg)
if hasattr(self, "vocab"):
link_vectors_to_models(self.vocab)
if sgd is None:
sgd = self.create_optimizer()
@ -154,6 +158,7 @@ class Pipe(object):
serialize["cfg"] = lambda: srsly.json_dumps(self.cfg)
if self.model not in (True, False, None):
serialize["model"] = self.model.to_bytes
if hasattr(self, "vocab"):
serialize["vocab"] = self.vocab.to_bytes
exclude = util.get_serialization_exclude(serialize, exclude, kwargs)
return util.to_bytes(serialize, exclude)
@ -174,6 +179,7 @@ class Pipe(object):
deserialize = OrderedDict()
deserialize["cfg"] = lambda b: self.cfg.update(srsly.json_loads(b))
if hasattr(self, "vocab"):
deserialize["vocab"] = lambda b: self.vocab.from_bytes(b)
deserialize["model"] = load_model
exclude = util.get_serialization_exclude(deserialize, exclude, kwargs)

View File

@ -5,7 +5,8 @@ import pytest
# fmt: off
TOKENIZER_TESTS = [("서울 타워 근처에 살고 있습니다.", "서울 타워 근처 에 살 고 있 습니다 ."),
("영등포구에 있는 맛집 좀 알려주세요.", "영등포구 에 있 는 맛집 좀 알려 주 세요 .")]
("영등포구에 있는 맛집 좀 알려주세요.", "영등포구 에 있 는 맛집 좀 알려 주 세요 ."),
("10$ 할인코드를 적용할까요?", "10 $ 할인 코드 를 적용 할까요 ?")]
TAG_TESTS = [("서울 타워 근처에 살고 있습니다.",
"NNP NNG NNG JKB VV EC VX EF SF"),

View File

@ -0,0 +1,57 @@
# coding: utf-8
from __future__ import unicode_literals
import pytest
from spacy.vocab import Vocab
from spacy.language import Language
from spacy.tokens import Doc
from spacy.gold import GoldParse
@pytest.fixture
def nlp():
nlp = Language(Vocab())
textcat = nlp.create_pipe("textcat")
for label in ("POSITIVE", "NEGATIVE"):
textcat.add_label(label)
nlp.add_pipe(textcat)
nlp.begin_training()
return nlp
def test_language_update(nlp):
text = "hello world"
annots = {"cats": {"POSITIVE": 1.0, "NEGATIVE": 0.0}}
doc = Doc(nlp.vocab, words=text.split(" "))
gold = GoldParse(doc, **annots)
# Update with doc and gold objects
nlp.update([doc], [gold])
# Update with text and dict
nlp.update([text], [annots])
# Update with doc object and dict
nlp.update([doc], [annots])
# Update with text and gold object
nlp.update([text], [gold])
# Update badly
with pytest.raises(IndexError):
nlp.update([doc], [])
with pytest.raises(IndexError):
nlp.update([], [gold])
def test_language_evaluate(nlp):
text = "hello world"
annots = {"cats": {"POSITIVE": 1.0, "NEGATIVE": 0.0}}
doc = Doc(nlp.vocab, words=text.split(" "))
gold = GoldParse(doc, **annots)
# Evaluate with doc and gold objects
nlp.evaluate([(doc, gold)])
# Evaluate with text and dict
nlp.evaluate([(text, annots)])
# Evaluate with doc object and dict
nlp.evaluate([(doc, annots)])
# Evaluate with text and gold object
nlp.evaluate([(text, gold)])
# Evaluate badly
with pytest.raises(Exception):
nlp.evaluate([text, gold])

View File

@ -311,7 +311,7 @@ cdef class Span:
DOCS: https://spacy.io/api/span#similarity
"""
if "similarity" in self.doc.user_span_hooks:
self.doc.user_span_hooks["similarity"](self, other)
return self.doc.user_span_hooks["similarity"](self, other)
if len(self) == 1 and hasattr(other, "orth"):
if self[0].orth == other.orth:
return 1.0

View File

@ -202,7 +202,7 @@ cdef class Token:
DOCS: https://spacy.io/api/token#similarity
"""
if "similarity" in self.doc.user_token_hooks:
return self.doc.user_token_hooks["similarity"](self)
return self.doc.user_token_hooks["similarity"](self, other)
if hasattr(other, "__len__") and len(other) == 1 and hasattr(other, "__getitem__"):
if self.c.lex.orth == getattr(other[0], "orth", None):
return 1.0

View File

@ -160,7 +160,10 @@ def load_model_from_path(model_path, meta=False, **overrides):
pipeline from meta.json and then calls from_disk() with path."""
if not meta:
meta = get_model_meta(model_path)
cls = get_lang_class(meta["lang"])
# Support language factories registered via entry points (e.g. custom
# language subclass) while keeping top-level language identifier "lang"
lang = meta.get("lang_factory", meta["lang"])
cls = get_lang_class(lang)
nlp = cls(meta=meta, **overrides)
pipeline = meta.get("pipeline", [])
disable = overrides.get("disable", [])

View File

@ -134,8 +134,8 @@ Evaluate a model's pipeline components.
> ```
| Name | Type | Description |
| -------------------------------------------- | -------- | ------------------------------------------------------------------------------------- |
| `docs_golds` | iterable | Tuples of `Doc` and `GoldParse` objects. |
| -------------------------------------------- | -------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| `docs_golds` | iterable | Tuples of `Doc` and `GoldParse` objects or `(text, annotations)` of raw text and a dict (see [simple training style](/usage/training#training-simple-style)). |
| `verbose` | bool | Print debugging information. |
| `batch_size` | int | The batch size to use. |
| `scorer` | `Scorer` | Optional [`Scorer`](/api/scorer) to use. If not passed in, a new one will be created. |

View File

@ -258,7 +258,7 @@ Retokenize the document, such that the span is merged into a single token.
| `**attributes` | - | Attributes to assign to the merged token. By default, attributes are inherited from the syntactic root token of the span. |
| **RETURNS** | `Token` | The newly merged token. |
## Span.ents {#ents tag="property" new="2.0.12" model="ner"}
## Span.ents {#ents tag="property" new="2.0.13" model="ner"}
The named entities in the span. Returns a tuple of named entity `Span` objects,
if the entity recognizer has been applied.