diff --git a/.github/contributors/b1uec0in.md b/.github/contributors/b1uec0in.md
new file mode 100644
index 000000000..2e2cd0814
--- /dev/null
+++ b/.github/contributors/b1uec0in.md
@@ -0,0 +1,106 @@
+# spaCy contributor agreement
+
+This spaCy Contributor Agreement (**"SCA"**) is based on the
+[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
+The SCA applies to any contribution that you make to any product or project
+managed by us (the **"project"**), and sets out the intellectual property rights
+you grant to us in the contributed materials. The term **"us"** shall mean
+[ExplosionAI GmbH](https://explosion.ai/legal). The term
+**"you"** shall mean the person or entity identified below.
+
+If you agree to be bound by these terms, fill in the information requested
+below and include the filled-in version with your first pull request, under the
+folder [`.github/contributors/`](/.github/contributors/). The name of the file
+should be your GitHub username, with the extension `.md`. For example, the user
+example_user would create the file `.github/contributors/example_user.md`.
+
+Read this agreement carefully before signing. These terms and conditions
+constitute a binding legal agreement.
+
+## Contributor Agreement
+
+1. The term "contribution" or "contributed materials" means any source code,
+object code, patch, tool, sample, graphic, specification, manual,
+documentation, or any other material posted or submitted by you to the project.
+
+2. With respect to any worldwide copyrights, or copyright applications and
+registrations, in your contribution:
+
+ * you hereby assign to us joint ownership, and to the extent that such
+ assignment is or becomes invalid, ineffective or unenforceable, you hereby
+ grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
+ royalty-free, unrestricted license to exercise all rights under those
+ copyrights. This includes, at our option, the right to sublicense these same
+ rights to third parties through multiple levels of sublicensees or other
+ licensing arrangements;
+
+ * you agree that each of us can do all things in relation to your
+ contribution as if each of us were the sole owners, and if one of us makes
+ a derivative work of your contribution, the one who makes the derivative
+ work (or has it made will be the sole owner of that derivative work;
+
+ * you agree that you will not assert any moral rights in your contribution
+ against us, our licensees or transferees;
+
+ * you agree that we may register a copyright in your contribution and
+ exercise all ownership rights associated with it; and
+
+ * you agree that neither of us has any duty to consult with, obtain the
+ consent of, pay or render an accounting to the other for any use or
+ distribution of your contribution.
+
+3. With respect to any patents you own, or that you can license without payment
+to any third party, you hereby grant to us a perpetual, irrevocable,
+non-exclusive, worldwide, no-charge, royalty-free license to:
+
+ * make, have made, use, sell, offer to sell, import, and otherwise transfer
+ your contribution in whole or in part, alone or in combination with or
+ included in any product, work or materials arising out of the project to
+ which your contribution was submitted, and
+
+ * at our option, to sublicense these same rights to third parties through
+ multiple levels of sublicensees or other licensing arrangements.
+
+4. Except as set out above, you keep all right, title, and interest in your
+contribution. The rights that you grant to us under these terms are effective
+on the date you first submitted a contribution to us, even if your submission
+took place before the date you sign these terms.
+
+5. You covenant, represent, warrant and agree that:
+
+ * Each contribution that you submit is and shall be an original work of
+ authorship and you can legally grant the rights set out in this SCA;
+
+ * to the best of your knowledge, each contribution will not violate any
+ third party's copyrights, trademarks, patents, or other intellectual
+ property rights; and
+
+ * each contribution shall be in compliance with U.S. export control laws and
+ other applicable export and import laws. You agree to notify us if you
+ become aware of any circumstance which would make any of the foregoing
+ representations inaccurate in any respect. We may publicly disclose your
+ participation in the project, including the fact that you have signed the SCA.
+
+6. This SCA is governed by the laws of the State of California and applicable
+U.S. Federal law. Any choice of law rules will not apply.
+
+7. Please place an βxβ on one of the applicable statement below. Please do NOT
+mark both statements:
+
+ * [x] I am signing on behalf of myself as an individual and no other person
+ or entity, including my employer, has or will have rights with respect to my
+ contributions.
+
+ * [ ] I am signing on behalf of my employer or a legal entity and I have the
+ actual authority to contractually bind that entity.
+
+## Contributor Details
+
+| Field | Entry |
+|------------------------------- | -------------------- |
+| Name | Bae, Yong-Ju |
+| Company name (if applicable) | |
+| Title or role (if applicable) | |
+| Date | 2019-07-25 |
+| GitHub username | b1uec0in |
+| Website (optional) | |
diff --git a/CONTRIBUTING.md b/CONTRIBUTING.md
index 82de54f01..8b02b7055 100644
--- a/CONTRIBUTING.md
+++ b/CONTRIBUTING.md
@@ -2,12 +2,13 @@
# Contribute to spaCy
-Thanks for your interest in contributing to spaCy π The project is maintained
+Thanks for your interest in contributing to spaCy π The project is maintained
by [@honnibal](https://github.com/honnibal) and [@ines](https://github.com/ines),
and we'll do our best to help you get started. This page will give you a quick
overview of how things are organised and most importantly, how to get involved.
## Table of contents
+
1. [Issues and bug reports](#issues-and-bug-reports)
2. [Contributing to the code base](#contributing-to-the-code-base)
3. [Code conventions](#code-conventions)
@@ -42,33 +43,33 @@ can also submit a [regression test](#fixing-bugs) straight away. When you're
opening an issue to report the bug, simply refer to your pull request in the
issue body. A few more tips:
-* **Describing your issue:** Try to provide as many details as possible. What
-exactly goes wrong? *How* is it failing? Is there an error?
-"XY doesn't work" usually isn't that helpful for tracking down problems. Always
-remember to include the code you ran and if possible, extract only the relevant
-parts and don't just dump your entire script. This will make it easier for us to
-reproduce the error.
+- **Describing your issue:** Try to provide as many details as possible. What
+ exactly goes wrong? _How_ is it failing? Is there an error?
+ "XY doesn't work" usually isn't that helpful for tracking down problems. Always
+ remember to include the code you ran and if possible, extract only the relevant
+ parts and don't just dump your entire script. This will make it easier for us to
+ reproduce the error.
-* **Getting info about your spaCy installation and environment:** If you're
-using spaCy v1.7+, you can use the command line interface to print details and
-even format them as Markdown to copy-paste into GitHub issues:
-`python -m spacy info --markdown`.
+- **Getting info about your spaCy installation and environment:** If you're
+ using spaCy v1.7+, you can use the command line interface to print details and
+ even format them as Markdown to copy-paste into GitHub issues:
+ `python -m spacy info --markdown`.
-* **Checking the model compatibility:** If you're having problems with a
-[statistical model](https://spacy.io/models), it may be because the
-model is incompatible with your spaCy installation. In spaCy v2.0+, you can check
-this on the command line by running `python -m spacy validate`.
+- **Checking the model compatibility:** If you're having problems with a
+ [statistical model](https://spacy.io/models), it may be because the
+ model is incompatible with your spaCy installation. In spaCy v2.0+, you can check
+ this on the command line by running `python -m spacy validate`.
-* **Sharing a model's output, like dependencies and entities:** spaCy v2.0+
-comes with [built-in visualizers](https://spacy.io/usage/visualizers) that
-you can run from within your script or a Jupyter notebook. For some issues, it's
-helpful to **include a screenshot** of the visualization. You can simply drag and
-drop the image into GitHub's editor and it will be uploaded and included.
+- **Sharing a model's output, like dependencies and entities:** spaCy v2.0+
+ comes with [built-in visualizers](https://spacy.io/usage/visualizers) that
+ you can run from within your script or a Jupyter notebook. For some issues, it's
+ helpful to **include a screenshot** of the visualization. You can simply drag and
+ drop the image into GitHub's editor and it will be uploaded and included.
-* **Sharing long blocks of code or logs:** If you need to include long code,
-logs or tracebacks, you can wrap them in `` and ` `. This
-[collapses the content](https://developer.mozilla.org/en/docs/Web/HTML/Element/details)
-so it only becomes visible on click, making the issue easier to read and follow.
+- **Sharing long blocks of code or logs:** If you need to include long code,
+ logs or tracebacks, you can wrap them in `` and ` `. This
+ [collapses the content](https://developer.mozilla.org/en/docs/Web/HTML/Element/details)
+ so it only becomes visible on click, making the issue easier to read and follow.
### Issue labels
@@ -94,39 +95,39 @@ shipped in the core library, and what could be provided in other packages. Our
philosophy is to prefer a smaller core library. We generally ask the following
questions:
-* **What would this feature look like if implemented in a separate package?**
-Some features would be very difficult to implement externally β for example,
-changes to spaCy's built-in methods. In contrast, a library of word
-alignment functions could easily live as a separate package that depended on
-spaCy β there's little difference between writing `import word_aligner` and
-`import spacy.word_aligner`. spaCy v2.0+ makes it easy to implement
-[custom pipeline components](https://spacy.io/usage/processing-pipelines#custom-components),
-and add your own attributes, properties and methods to the `Doc`, `Token` and
-`Span`. If you're looking to implement a new spaCy feature, starting with a
-custom component package is usually the best strategy. You won't have to worry
-about spaCy's internals and you can test your module in an isolated
-environment. And if it works well, we can always integrate it into the core
-library later.
+- **What would this feature look like if implemented in a separate package?**
+ Some features would be very difficult to implement externally β for example,
+ changes to spaCy's built-in methods. In contrast, a library of word
+ alignment functions could easily live as a separate package that depended on
+ spaCy β there's little difference between writing `import word_aligner` and
+ `import spacy.word_aligner`. spaCy v2.0+ makes it easy to implement
+ [custom pipeline components](https://spacy.io/usage/processing-pipelines#custom-components),
+ and add your own attributes, properties and methods to the `Doc`, `Token` and
+ `Span`. If you're looking to implement a new spaCy feature, starting with a
+ custom component package is usually the best strategy. You won't have to worry
+ about spaCy's internals and you can test your module in an isolated
+ environment. And if it works well, we can always integrate it into the core
+ library later.
-* **Would the feature be easier to implement if it relied on "heavy" dependencies spaCy doesn't currently require?**
-Python has a very rich ecosystem. Libraries like scikit-learn, SciPy, Gensim or
-TensorFlow/Keras do lots of useful things β but we don't want to have them as
-dependencies. If the feature requires functionality in one of these libraries,
-it's probably better to break it out into a different package.
+- **Would the feature be easier to implement if it relied on "heavy" dependencies spaCy doesn't currently require?**
+ Python has a very rich ecosystem. Libraries like scikit-learn, SciPy, Gensim or
+ TensorFlow/Keras do lots of useful things β but we don't want to have them as
+ dependencies. If the feature requires functionality in one of these libraries,
+ it's probably better to break it out into a different package.
-* **Is the feature orthogonal to the current spaCy functionality, or overlapping?**
-spaCy strongly prefers to avoid having 6 different ways of doing the same thing.
-As better techniques are developed, we prefer to drop support for "the old way".
-However, it's rare that one approach *entirely* dominates another. It's very
-common that there's still a use-case for the "obsolete" approach. For instance,
-[WordNet](https://wordnet.princeton.edu/) is still very useful β but word
-vectors are better for most use-cases, and the two approaches to lexical
-semantics do a lot of the same things. spaCy therefore only supports word
-vectors, and support for WordNet is currently left for other packages.
+- **Is the feature orthogonal to the current spaCy functionality, or overlapping?**
+ spaCy strongly prefers to avoid having 6 different ways of doing the same thing.
+ As better techniques are developed, we prefer to drop support for "the old way".
+ However, it's rare that one approach _entirely_ dominates another. It's very
+ common that there's still a use-case for the "obsolete" approach. For instance,
+ [WordNet](https://wordnet.princeton.edu/) is still very useful β but word
+ vectors are better for most use-cases, and the two approaches to lexical
+ semantics do a lot of the same things. spaCy therefore only supports word
+ vectors, and support for WordNet is currently left for other packages.
-* **Do you need the feature to get basic things done?** We do want spaCy to be
-at least somewhat self-contained. If we keep needing some feature in our
-recipes, that does provide some argument for bringing it "in house".
+- **Do you need the feature to get basic things done?** We do want spaCy to be
+ at least somewhat self-contained. If we keep needing some feature in our
+ recipes, that does provide some argument for bringing it "in house".
### Getting started
@@ -155,7 +156,6 @@ Changes to `.py` files will be effective immediately.
π **For more details and instructions, see the documentation on [compiling spaCy from source](https://spacy.io/usage/#source) and the [quickstart widget](https://spacy.io/usage/#section-quickstart) to get the right commands for your platform and Python version.**
-
### Contributor agreement
If you've made a contribution to spaCy, you should fill in the
@@ -167,7 +167,6 @@ and include it with your pull request, or submit it separately to
your GitHub username, with the extension `.md`. For example, the user
example_user would create the file `.github/contributors/example_user.md`.
-
### Fixing bugs
When fixing a bug, first create an
@@ -199,7 +198,7 @@ modules in `.py` files, not Cython modules in `.pyx` and `.pxd` files.**
[`black`](https://github.com/ambv/black) is an opinionated Python code
formatter, optimised to produce readable code and small diffs. You can run
`black` from the command-line, or via your code editor. For example, if you're
-using [Visual Studio Code](https://code.visualstudio.com/), you can add the
+using [Visual Studio Code](https://code.visualstudio.com/), you can add the
following to your `settings.json` to use `black` for formatting and auto-format
your files on save:
@@ -415,11 +414,10 @@ Python. If it's not fast enough the first time, just switch to Cython.
### Resources to get you started
-* [PEP 8 Style Guide for Python Code](https://www.python.org/dev/peps/pep-0008/) (python.org)
-* [Official Cython documentation](http://docs.cython.org/en/latest/) (cython.org)
-* [Writing C in Cython](https://explosion.ai/blog/writing-c-in-cython) (explosion.ai)
-* [Multi-threading spaCyβs parser and named entity recogniser](https://explosion.ai/blog/multithreading-with-cython) (explosion.ai)
-
+- [PEP 8 Style Guide for Python Code](https://www.python.org/dev/peps/pep-0008/) (python.org)
+- [Official Cython documentation](http://docs.cython.org/en/latest/) (cython.org)
+- [Writing C in Cython](https://explosion.ai/blog/writing-c-in-cython) (explosion.ai)
+- [Multi-threading spaCyβs parser and named entity recogniser](https://explosion.ai/blog/multithreading-with-cython) (explosion.ai)
## Adding tests
@@ -444,66 +442,40 @@ use the `get_doc()` utility function to construct it manually.
π **For more guidelines and information on how to add tests, check out the [tests README](spacy/tests/README.md).**
-
## Updating the website
-For instructions on how to build and run the [website](https://spacy.io) locally see **[Setup and installation](https://github.com/explosion/spaCy/blob/master/website/README.md#setup-and-installation-setup)** in the *website* directory's README.
+For instructions on how to build and run the [website](https://spacy.io) locally see **[Setup and installation](https://github.com/explosion/spaCy/blob/master/website/README.md#setup-and-installation-setup)** in the _website_ directory's README.
The docs can always use another example or more detail, and they should always
be up to date and not misleading. To quickly find the correct file to edit,
-simply click on the "Suggest edits" button at the bottom of a page. To keep
-long pages maintainable, and allow including content in several places without
-doubling it, sections often consist of partials. Partials and partial directories
-are prefixed by an underscore `_` so they're not compiled with the site. For
-example:
-
-```pug
-+section("tokenization")
- +h(2, "tokenization") Tokenization
- include _spacy-101/_tokenization
-```
-
-So if you're looking to edit the content of the tokenization section, you can
-find it in `_spacy-101/_tokenization.jade`. To make it easy to add content
-components, we use a [collection of custom mixins](_includes/_mixins.jade),
-like `+table`, `+list` or `+code`. For an overview of the available mixins and
-components, see the [styleguide](https://spacy.io/styleguide).
+simply click on the "Suggest edits" button at the bottom of a page.
π **For more info and troubleshooting guides, check out the [website README](website).**
-### Resources to get you started
-
-* [Guide to static websites with Harp and Jade](https://ines.io/blog/the-ultimate-guide-static-websites-harp-jade) (ines.io)
-* [Building a website with modular markup components (mixins)](https://explosion.ai/blog/modular-markup) (explosion.ai)
-* [spacy.io Styleguide](https://spacy.io/styleguide) (spacy.io)
-* [Jade/Pug documentation](https://pugjs.org) (pugjs.org)
-* [Harp documentation](https://harpjs.com/) (harpjs.com)
-
-
## Publishing spaCy extensions and plugins
We're very excited about all the new possibilities for **community extensions**
and plugins in spaCy v2.0, and we can't wait to see what you build with it!
-* An extension or plugin should add substantial functionality, be
-**well-documented** and **open-source**. It should be available for users to download
-and install as a Python package β for example via [PyPi](http://pypi.python.org).
+- An extension or plugin should add substantial functionality, be
+ **well-documented** and **open-source**. It should be available for users to download
+ and install as a Python package β for example via [PyPi](http://pypi.python.org).
-* Extensions that write to `Doc`, `Token` or `Span` attributes should be wrapped
-as [pipeline components](https://spacy.io/usage/processing-pipelines#custom-components)
-that users can **add to their processing pipeline** using `nlp.add_pipe()`.
+- Extensions that write to `Doc`, `Token` or `Span` attributes should be wrapped
+ as [pipeline components](https://spacy.io/usage/processing-pipelines#custom-components)
+ that users can **add to their processing pipeline** using `nlp.add_pipe()`.
-* When publishing your extension on GitHub, **tag it** with the topics
-[`spacy`](https://github.com/topics/spacy?o=desc&s=stars) and
-[`spacy-extensions`](https://github.com/topics/spacy-extension?o=desc&s=stars)
-to make it easier to find. Those are also the topics we're linking to from the
-spaCy website. If you're sharing your project on Twitter, feel free to tag
-[@spacy_io](https://twitter.com/spacy_io) so we can check it out.
+- When publishing your extension on GitHub, **tag it** with the topics
+ [`spacy`](https://github.com/topics/spacy?o=desc&s=stars) and
+ [`spacy-extensions`](https://github.com/topics/spacy-extension?o=desc&s=stars)
+ to make it easier to find. Those are also the topics we're linking to from the
+ spaCy website. If you're sharing your project on Twitter, feel free to tag
+ [@spacy_io](https://twitter.com/spacy_io) so we can check it out.
-* Once your extension is published, you can open an issue on the
-[issue tracker](https://github.com/explosion/spacy/issues) to suggest it for the
-[resources directory](https://spacy.io/usage/resources#extensions) on the
-website.
+- Once your extension is published, you can open an issue on the
+ [issue tracker](https://github.com/explosion/spacy/issues) to suggest it for the
+ [resources directory](https://spacy.io/usage/resources#extensions) on the
+ website.
π **For more tips and best practices, see the [checklist for developing spaCy extensions](https://spacy.io/usage/processing-pipelines#extensions).**
diff --git a/spacy/about.py b/spacy/about.py
index 16e5e9522..1b786a82a 100644
--- a/spacy/about.py
+++ b/spacy/about.py
@@ -4,13 +4,13 @@
# fmt: off
__title__ = "spacy"
-__version__ = "2.1.6"
+__version__ = "2.1.7.dev0"
__summary__ = "Industrial-strength Natural Language Processing (NLP) with Python and Cython"
__uri__ = "https://spacy.io"
__author__ = "Explosion AI"
__email__ = "contact@explosion.ai"
__license__ = "MIT"
-__release__ = True
+__release__ = False
__download_url__ = "https://github.com/explosion/spacy-models/releases/download"
__compatibility__ = "https://raw.githubusercontent.com/explosion/spacy-models/master/compatibility.json"
diff --git a/spacy/cli/download.py b/spacy/cli/download.py
index 66a47823c..1075b0c60 100644
--- a/spacy/cli/download.py
+++ b/spacy/cli/download.py
@@ -19,7 +19,7 @@ msg = Printer()
@plac.annotations(
model=("Model to download (shortcut or name)", "positional", None, str),
direct=("Force direct download of name + version", "flag", "d", bool),
- pip_args=("additional arguments to be passed to `pip install` on model install"),
+ pip_args=("Additional arguments to be passed to `pip install` on model install"),
)
def download(model, direct=False, *pip_args):
"""
diff --git a/spacy/cli/package.py b/spacy/cli/package.py
index 2f1258162..e99a6d5ff 100644
--- a/spacy/cli/package.py
+++ b/spacy/cli/package.py
@@ -150,6 +150,8 @@ def list_requirements(meta):
requirements = [parent_package + meta['spacy_version']]
if 'setup_requires' in meta:
requirements += meta['setup_requires']
+ if 'requirements' in meta:
+ requirements += meta['requirements']
return requirements
diff --git a/spacy/lang/ko/__init__.py b/spacy/lang/ko/__init__.py
index f5dff75f1..52a55c789 100644
--- a/spacy/lang/ko/__init__.py
+++ b/spacy/lang/ko/__init__.py
@@ -51,12 +51,15 @@ def try_mecab_import():
def check_spaces(text, tokens):
- token_pattern = re.compile(r"\s?".join(f"({t})" for t in tokens))
- m = token_pattern.match(text)
- if m is not None:
- for i in range(1, m.lastindex):
- yield m.end(i) < m.start(i + 1)
- yield False
+ prev_end = -1
+ start = 0
+ for token in tokens:
+ idx = text.find(token, start)
+ if prev_end > 0:
+ yield prev_end != idx
+ prev_end = idx + len(token)
+ start = prev_end
+ yield False
class KoreanTokenizer(DummyTokenizer):
diff --git a/spacy/language.py b/spacy/language.py
index 39d95c689..bfdd00b79 100644
--- a/spacy/language.py
+++ b/spacy/language.py
@@ -618,7 +618,7 @@ class Language(object):
if component_cfg is None:
component_cfg = {}
docs, golds = zip(*docs_golds)
- docs = list(docs)
+ docs = [self.make_doc(doc) if isinstance(doc, basestring_) else doc for doc in docs]
golds = list(golds)
for name, pipe in self.pipeline:
kwargs = component_cfg.get(name, {})
@@ -628,6 +628,8 @@ class Language(object):
else:
docs = pipe.pipe(docs, **kwargs)
for doc, gold in zip(docs, golds):
+ if not isinstance(gold, GoldParse):
+ gold = GoldParse(doc, **gold)
if verbose:
print(doc)
kwargs = component_cfg.get("scorer", {})
diff --git a/spacy/pipeline/pipes.pyx b/spacy/pipeline/pipes.pyx
index ca166607f..3b5e3d41c 100644
--- a/spacy/pipeline/pipes.pyx
+++ b/spacy/pipeline/pipes.pyx
@@ -83,8 +83,12 @@ class Pipe(object):
"""
for docs in util.minibatch(stream, size=batch_size):
docs = list(docs)
- scores, tensors = self.predict(docs)
- self.set_annotations(docs, scores, tensor=tensors)
+ predictions = self.predict(docs)
+ if isinstance(predictions, tuple) and len(tuple) == 2:
+ scores, tensors = predictions
+ self.set_annotations(docs, scores, tensor=tensors)
+ else:
+ self.set_annotations(docs, predictions)
yield from docs
def predict(self, docs):
@@ -104,8 +108,7 @@ class Pipe(object):
Delegates to predict() and get_loss().
"""
- self.require_model()
- raise NotImplementedError
+ pass
def rehearse(self, docs, sgd=None, losses=None, **config):
pass
@@ -134,7 +137,8 @@ class Pipe(object):
If no model has been initialized yet, the model is added."""
if self.model is True:
self.model = self.Model(**self.cfg)
- link_vectors_to_models(self.vocab)
+ if hasattr(self, "vocab"):
+ link_vectors_to_models(self.vocab)
if sgd is None:
sgd = self.create_optimizer()
return sgd
@@ -154,7 +158,8 @@ class Pipe(object):
serialize["cfg"] = lambda: srsly.json_dumps(self.cfg)
if self.model not in (True, False, None):
serialize["model"] = self.model.to_bytes
- serialize["vocab"] = self.vocab.to_bytes
+ if hasattr(self, "vocab"):
+ serialize["vocab"] = self.vocab.to_bytes
exclude = util.get_serialization_exclude(serialize, exclude, kwargs)
return util.to_bytes(serialize, exclude)
@@ -174,7 +179,8 @@ class Pipe(object):
deserialize = OrderedDict()
deserialize["cfg"] = lambda b: self.cfg.update(srsly.json_loads(b))
- deserialize["vocab"] = lambda b: self.vocab.from_bytes(b)
+ if hasattr(self, "vocab"):
+ deserialize["vocab"] = lambda b: self.vocab.from_bytes(b)
deserialize["model"] = load_model
exclude = util.get_serialization_exclude(deserialize, exclude, kwargs)
util.from_bytes(bytes_data, deserialize, exclude)
diff --git a/spacy/tests/lang/ko/test_tokenizer.py b/spacy/tests/lang/ko/test_tokenizer.py
index cc7b5fd77..531a41d0b 100644
--- a/spacy/tests/lang/ko/test_tokenizer.py
+++ b/spacy/tests/lang/ko/test_tokenizer.py
@@ -5,7 +5,8 @@ import pytest
# fmt: off
TOKENIZER_TESTS = [("μμΈ νμ κ·Όμ²μ μ΄κ³ μμ΅λλ€.", "μμΈ νμ κ·Όμ² μ μ΄ κ³ μ μ΅λλ€ ."),
- ("μλ±ν¬κ΅¬μ μλ λ§μ§ μ’ μλ €μ£ΌμΈμ.", "μλ±ν¬κ΅¬ μ μ λ λ§μ§ μ’ μλ € μ£Ό μΈμ .")]
+ ("μλ±ν¬κ΅¬μ μλ λ§μ§ μ’ μλ €μ£ΌμΈμ.", "μλ±ν¬κ΅¬ μ μ λ λ§μ§ μ’ μλ € μ£Ό μΈμ ."),
+ ("10$ ν μΈμ½λλ₯Ό μ μ©ν κΉμ?", "10 $ ν μΈ μ½λ λ₯Ό μ μ© ν κΉμ ?")]
TAG_TESTS = [("μμΈ νμ κ·Όμ²μ μ΄κ³ μμ΅λλ€.",
"NNP NNG NNG JKB VV EC VX EF SF"),
diff --git a/spacy/tests/test_language.py b/spacy/tests/test_language.py
new file mode 100644
index 000000000..00175fe9a
--- /dev/null
+++ b/spacy/tests/test_language.py
@@ -0,0 +1,57 @@
+# coding: utf-8
+from __future__ import unicode_literals
+
+import pytest
+from spacy.vocab import Vocab
+from spacy.language import Language
+from spacy.tokens import Doc
+from spacy.gold import GoldParse
+
+
+@pytest.fixture
+def nlp():
+ nlp = Language(Vocab())
+ textcat = nlp.create_pipe("textcat")
+ for label in ("POSITIVE", "NEGATIVE"):
+ textcat.add_label(label)
+ nlp.add_pipe(textcat)
+ nlp.begin_training()
+ return nlp
+
+
+def test_language_update(nlp):
+ text = "hello world"
+ annots = {"cats": {"POSITIVE": 1.0, "NEGATIVE": 0.0}}
+ doc = Doc(nlp.vocab, words=text.split(" "))
+ gold = GoldParse(doc, **annots)
+ # Update with doc and gold objects
+ nlp.update([doc], [gold])
+ # Update with text and dict
+ nlp.update([text], [annots])
+ # Update with doc object and dict
+ nlp.update([doc], [annots])
+ # Update with text and gold object
+ nlp.update([text], [gold])
+ # Update badly
+ with pytest.raises(IndexError):
+ nlp.update([doc], [])
+ with pytest.raises(IndexError):
+ nlp.update([], [gold])
+
+
+def test_language_evaluate(nlp):
+ text = "hello world"
+ annots = {"cats": {"POSITIVE": 1.0, "NEGATIVE": 0.0}}
+ doc = Doc(nlp.vocab, words=text.split(" "))
+ gold = GoldParse(doc, **annots)
+ # Evaluate with doc and gold objects
+ nlp.evaluate([(doc, gold)])
+ # Evaluate with text and dict
+ nlp.evaluate([(text, annots)])
+ # Evaluate with doc object and dict
+ nlp.evaluate([(doc, annots)])
+ # Evaluate with text and gold object
+ nlp.evaluate([(text, gold)])
+ # Evaluate badly
+ with pytest.raises(Exception):
+ nlp.evaluate([text, gold])
diff --git a/spacy/tokens/span.pyx b/spacy/tokens/span.pyx
index 42fb9852d..460972369 100644
--- a/spacy/tokens/span.pyx
+++ b/spacy/tokens/span.pyx
@@ -311,7 +311,7 @@ cdef class Span:
DOCS: https://spacy.io/api/span#similarity
"""
if "similarity" in self.doc.user_span_hooks:
- self.doc.user_span_hooks["similarity"](self, other)
+ return self.doc.user_span_hooks["similarity"](self, other)
if len(self) == 1 and hasattr(other, "orth"):
if self[0].orth == other.orth:
return 1.0
diff --git a/spacy/tokens/token.pyx b/spacy/tokens/token.pyx
index eb79de16b..909ebecbb 100644
--- a/spacy/tokens/token.pyx
+++ b/spacy/tokens/token.pyx
@@ -202,7 +202,7 @@ cdef class Token:
DOCS: https://spacy.io/api/token#similarity
"""
if "similarity" in self.doc.user_token_hooks:
- return self.doc.user_token_hooks["similarity"](self)
+ return self.doc.user_token_hooks["similarity"](self, other)
if hasattr(other, "__len__") and len(other) == 1 and hasattr(other, "__getitem__"):
if self.c.lex.orth == getattr(other[0], "orth", None):
return 1.0
diff --git a/spacy/util.py b/spacy/util.py
index 1a40bb5ca..713501924 100644
--- a/spacy/util.py
+++ b/spacy/util.py
@@ -160,7 +160,10 @@ def load_model_from_path(model_path, meta=False, **overrides):
pipeline from meta.json and then calls from_disk() with path."""
if not meta:
meta = get_model_meta(model_path)
- cls = get_lang_class(meta["lang"])
+ # Support language factories registered via entry points (e.g. custom
+ # language subclass) while keeping top-level language identifier "lang"
+ lang = meta.get("lang_factory", meta["lang"])
+ cls = get_lang_class(lang)
nlp = cls(meta=meta, **overrides)
pipeline = meta.get("pipeline", [])
disable = overrides.get("disable", [])
diff --git a/website/docs/api/language.md b/website/docs/api/language.md
index 3245a165b..3fcdeb195 100644
--- a/website/docs/api/language.md
+++ b/website/docs/api/language.md
@@ -133,13 +133,13 @@ Evaluate a model's pipeline components.
> print(scorer.scores)
> ```
-| Name | Type | Description |
-| -------------------------------------------- | -------- | ------------------------------------------------------------------------------------- |
-| `docs_golds` | iterable | Tuples of `Doc` and `GoldParse` objects. |
-| `verbose` | bool | Print debugging information. |
-| `batch_size` | int | The batch size to use. |
-| `scorer` | `Scorer` | Optional [`Scorer`](/api/scorer) to use. If not passed in, a new one will be created. |
-| `component_cfg` 2.1 | dict | Config parameters for specific pipeline components, keyed by component name. |
+| Name | Type | Description |
+| -------------------------------------------- | -------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------- |
+| `docs_golds` | iterable | Tuples of `Doc` and `GoldParse` objects or `(text, annotations)` of raw text and a dict (see [simple training style](/usage/training#training-simple-style)). |
+| `verbose` | bool | Print debugging information. |
+| `batch_size` | int | The batch size to use. |
+| `scorer` | `Scorer` | Optional [`Scorer`](/api/scorer) to use. If not passed in, a new one will be created. |
+| `component_cfg` 2.1 | dict | Config parameters for specific pipeline components, keyed by component name. |
## Language.begin_training {#begin_training tag="method"}
diff --git a/website/docs/api/span.md b/website/docs/api/span.md
index 53041cd66..7187a32a3 100644
--- a/website/docs/api/span.md
+++ b/website/docs/api/span.md
@@ -258,7 +258,7 @@ Retokenize the document, such that the span is merged into a single token.
| `**attributes` | - | Attributes to assign to the merged token. By default, attributes are inherited from the syntactic root token of the span. |
| **RETURNS** | `Token` | The newly merged token. |
-## Span.ents {#ents tag="property" new="2.0.12" model="ner"}
+## Span.ents {#ents tag="property" new="2.0.13" model="ner"}
The named entities in the span. Returns a tuple of named entity `Span` objects,
if the entity recognizer has been applied.