Merge remote-tracking branch 'upstream/master' into bugfix/tokenizer-special-cases-matcher

This commit is contained in:
Adriane Boyd 2019-09-27 09:32:15 +02:00
commit ccd94809fa
18 changed files with 179 additions and 33 deletions

106
.github/contributors/zqianem.md vendored Normal file
View File

@ -0,0 +1,106 @@
# spaCy contributor agreement
This spaCy Contributor Agreement (**"SCA"**) is based on the
[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
The SCA applies to any contribution that you make to any product or project
managed by us (the **"project"**), and sets out the intellectual property rights
you grant to us in the contributed materials. The term **"us"** shall mean
[ExplosionAI GmbH](https://explosion.ai/legal). The term
**"you"** shall mean the person or entity identified below.
If you agree to be bound by these terms, fill in the information requested
below and include the filled-in version with your first pull request, under the
folder [`.github/contributors/`](/.github/contributors/). The name of the file
should be your GitHub username, with the extension `.md`. For example, the user
example_user would create the file `.github/contributors/example_user.md`.
Read this agreement carefully before signing. These terms and conditions
constitute a binding legal agreement.
## Contributor Agreement
1. The term "contribution" or "contributed materials" means any source code,
object code, patch, tool, sample, graphic, specification, manual,
documentation, or any other material posted or submitted by you to the project.
2. With respect to any worldwide copyrights, or copyright applications and
registrations, in your contribution:
* you hereby assign to us joint ownership, and to the extent that such
assignment is or becomes invalid, ineffective or unenforceable, you hereby
grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
royalty-free, unrestricted license to exercise all rights under those
copyrights. This includes, at our option, the right to sublicense these same
rights to third parties through multiple levels of sublicensees or other
licensing arrangements;
* you agree that each of us can do all things in relation to your
contribution as if each of us were the sole owners, and if one of us makes
a derivative work of your contribution, the one who makes the derivative
work (or has it made will be the sole owner of that derivative work;
* you agree that you will not assert any moral rights in your contribution
against us, our licensees or transferees;
* you agree that we may register a copyright in your contribution and
exercise all ownership rights associated with it; and
* you agree that neither of us has any duty to consult with, obtain the
consent of, pay or render an accounting to the other for any use or
distribution of your contribution.
3. With respect to any patents you own, or that you can license without payment
to any third party, you hereby grant to us a perpetual, irrevocable,
non-exclusive, worldwide, no-charge, royalty-free license to:
* make, have made, use, sell, offer to sell, import, and otherwise transfer
your contribution in whole or in part, alone or in combination with or
included in any product, work or materials arising out of the project to
which your contribution was submitted, and
* at our option, to sublicense these same rights to third parties through
multiple levels of sublicensees or other licensing arrangements.
4. Except as set out above, you keep all right, title, and interest in your
contribution. The rights that you grant to us under these terms are effective
on the date you first submitted a contribution to us, even if your submission
took place before the date you sign these terms.
5. You covenant, represent, warrant and agree that:
* Each contribution that you submit is and shall be an original work of
authorship and you can legally grant the rights set out in this SCA;
* to the best of your knowledge, each contribution will not violate any
third party's copyrights, trademarks, patents, or other intellectual
property rights; and
* each contribution shall be in compliance with U.S. export control laws and
other applicable export and import laws. You agree to notify us if you
become aware of any circumstance which would make any of the foregoing
representations inaccurate in any respect. We may publicly disclose your
participation in the project, including the fact that you have signed the SCA.
6. This SCA is governed by the laws of the State of California and applicable
U.S. Federal law. Any choice of law rules will not apply.
7. Please place an “x” on one of the applicable statement below. Please do NOT
mark both statements:
* [x] I am signing on behalf of myself as an individual and no other person
or entity, including my employer, has or will have rights with respect to my
contributions.
* [ ] I am signing on behalf of my employer or a legal entity and I have the
actual authority to contractually bind that entity.
## Contributor Details
| Field | Entry |
|------------------------------- | -------------------- |
| Name | Em Zhan |
| Company name (if applicable) | |
| Title or role (if applicable) | |
| Date | 2019-09-25 |
| GitHub username | zqianem |
| Website (optional) | |

View File

@ -38,7 +38,7 @@ It's commercial open-source software, released under the MIT license.
| [Contribute] | How to contribute to the spaCy project and code base. |
[spacy 101]: https://spacy.io/usage/spacy-101
[new in v2.2]: https://spacy.io/usage/v2-1
[new in v2.1]: https://spacy.io/usage/v2-1
[usage guides]: https://spacy.io/usage/
[api reference]: https://spacy.io/api/
[models]: https://spacy.io/models

View File

@ -1,6 +1,6 @@
# fmt: off
__title__ = "spacy"
__version__ = "2.2.0.dev8"
__version__ = "2.2.0.dev10"
__summary__ = "Industrial-strength Natural Language Processing (NLP) in Python"
__uri__ = "https://spacy.io"
__author__ = "Explosion"

View File

@ -35,6 +35,13 @@ msg = Printer()
clusters_loc=("Optional location of brown clusters data", "option", "c", str),
vectors_loc=("Optional vectors file in Word2Vec format", "option", "v", str),
prune_vectors=("Optional number of vectors to prune to", "option", "V", int),
vectors_name=(
"Optional name for the word vectors, e.g. en_core_web_lg.vectors",
"option",
"vn",
str,
),
model_name=("Optional name for the model meta", "option", "mn", str),
)
def init_model(
lang,
@ -44,6 +51,8 @@ def init_model(
jsonl_loc=None,
vectors_loc=None,
prune_vectors=-1,
vectors_name=None,
model_name=None,
):
"""
Create a new model from raw data, like word frequencies, Brown clusters
@ -75,10 +84,10 @@ def init_model(
lex_attrs = read_attrs_from_deprecated(freqs_loc, clusters_loc)
with msg.loading("Creating model..."):
nlp = create_model(lang, lex_attrs)
nlp = create_model(lang, lex_attrs, name=model_name)
msg.good("Successfully created model")
if vectors_loc is not None:
add_vectors(nlp, vectors_loc, prune_vectors)
add_vectors(nlp, vectors_loc, prune_vectors, vectors_name)
vec_added = len(nlp.vocab.vectors)
lex_added = len(nlp.vocab)
msg.good(
@ -138,7 +147,7 @@ def read_attrs_from_deprecated(freqs_loc, clusters_loc):
return lex_attrs
def create_model(lang, lex_attrs):
def create_model(lang, lex_attrs, name=None):
lang_class = get_lang_class(lang)
nlp = lang_class()
for lexeme in nlp.vocab:
@ -157,10 +166,12 @@ def create_model(lang, lex_attrs):
else:
oov_prob = DEFAULT_OOV_PROB
nlp.vocab.cfg.update({"oov_prob": oov_prob})
if name:
nlp.meta["name"] = name
return nlp
def add_vectors(nlp, vectors_loc, prune_vectors):
def add_vectors(nlp, vectors_loc, prune_vectors, name=None):
vectors_loc = ensure_path(vectors_loc)
if vectors_loc and vectors_loc.parts[-1].endswith(".npz"):
nlp.vocab.vectors = Vectors(data=numpy.load(vectors_loc.open("rb")))
@ -181,7 +192,10 @@ def add_vectors(nlp, vectors_loc, prune_vectors):
lexeme.is_oov = False
if vectors_data is not None:
nlp.vocab.vectors = Vectors(data=vectors_data, keys=vector_keys)
nlp.vocab.vectors.name = "%s_model.vectors" % nlp.meta["lang"]
if name is None:
nlp.vocab.vectors.name = "%s_model.vectors" % nlp.meta["lang"]
else:
nlp.vocab.vectors.name = name
nlp.meta["vectors"]["name"] = nlp.vocab.vectors.name
if prune_vectors >= 1:
nlp.vocab.prune_vectors(prune_vectors)

View File

@ -471,7 +471,16 @@ class Errors(object):
"that case.")
E166 = ("Can only merge DocBins with the same pre-defined attributes.\n"
"Current DocBin: {current}\nOther DocBin: {other}")
E167 = ("Tokenizer special cases are not allowed to modify the text. "
E167 = ("Unknown morphological feature: '{feat}' ({feat_id}). This can "
"happen if the tagger was trained with a different set of "
"morphological features. If you're using a pre-trained model, make "
"sure that your models are up to date:\npython -m spacy validate")
E168 = ("Unknown field: {field}")
E169 = ("Can't find module: {module}")
E170 = ("Cannot apply transition {name}: invalid for the current state.")
E171 = ("Matcher.add received invalid on_match callback argument: expected "
"callable or None, but got: {arg_type}")
E172 = ("Tokenizer special cases are not allowed to modify the text. "
"This would map '{chunk}' to '{orth}' given token attributes "
"'{token_attrs}'.")

View File

@ -103,6 +103,8 @@ cdef class Matcher:
*patterns (list): List of token descriptions.
"""
errors = {}
if on_match is not None and not hasattr(on_match, "__call__"):
raise ValueError(Errors.E171.format(arg_type=type(on_match)))
for i, pattern in enumerate(patterns):
if len(pattern) == 0:
raise ValueError(Errors.E012.format(key=key))

View File

@ -197,7 +197,7 @@ cdef class Morphology:
cdef attr_t feature
for feature in features:
if feature != 0 and feature not in self._feat_map.id2feat:
raise KeyError("Unknown feature: %s" % self.strings[feature])
raise ValueError(Errors.E167.format(feat=self.strings[feature], feat_id=feature))
cdef MorphAnalysisC tag
tag = create_rich_tag(features)
cdef hash_t key = self.insert(tag)
@ -531,7 +531,7 @@ cdef attr_t get_field(const MorphAnalysisC* tag, int field_id) nogil:
elif field == Field_VerbType:
return tag.verb_type
else:
raise ValueError("Unknown field: (%d)" % field_id)
raise ValueError(Errors.E168.format(field=field_id))
cdef int check_feature(const MorphAnalysisC* tag, attr_t feature) nogil:
@ -726,7 +726,7 @@ cdef int set_feature(MorphAnalysisC* tag,
elif field == Field_VerbType:
tag.verb_type = value_
else:
raise ValueError("Unknown feature: %s (%d)" % (FEATURE_NAMES.get(feature), feature))
raise ValueError(Errors.E167.format(field=FEATURE_NAMES.get(feature), field_id=feature))
FIELDS = {

View File

@ -96,8 +96,7 @@ cdef class TransitionSystem:
def apply_transition(self, StateClass state, name):
if not self.is_valid(state, name):
raise ValueError(
"Cannot apply transition {name}: invalid for the current state.".format(name=name))
raise ValueError(Errors.E170.format(name=name))
action = self.lookup_transition(name)
action.do(state.c, action.label)

View File

@ -410,3 +410,11 @@ def test_matcher_schema_token_attributes(en_vocab, pattern, text):
assert len(matcher) == 1
matches = matcher(doc)
assert len(matches) == 1
def test_matcher_valid_callback(en_vocab):
"""Test that on_match can only be None or callable."""
matcher = Matcher(en_vocab)
with pytest.raises(ValueError):
matcher.add("TEST", [], [{"TEXT": "test"}])
matcher(Doc(en_vocab, words=["test"]))

View File

@ -558,7 +558,7 @@ cdef class Tokenizer:
attrs = [intify_attrs(spec, _do_deprecated=True) for spec in substrings]
orth = "".join([spec[ORTH] for spec in attrs])
if chunk != orth:
raise ValueError(Errors.E167.format(chunk=chunk, orth=orth, token_attrs=substrings))
raise ValueError(Errors.E172.format(chunk=chunk, orth=orth, token_attrs=substrings))
def add_special_case(self, unicode string, substrings):
"""Add a special-case tokenization rule.

View File

@ -136,7 +136,7 @@ def load_language_data(path):
def get_module_path(module):
if not hasattr(module, "__module__"):
raise ValueError("Can't find module {}".format(repr(module)))
raise ValueError(Errors.E169.format(module=repr(module)))
return Path(sys.modules[module.__module__].__file__).parent

View File

@ -63,7 +63,7 @@ cdef class Vectors:
shape (tuple): Size of the table, as (# entries, # columns)
data (numpy.ndarray): The vector data.
keys (iterable): A sequence of keys, aligned with the data.
name (string): A name to identify the vectors table.
name (unicode): A name to identify the vectors table.
RETURNS (Vectors): The newly created object.
DOCS: https://spacy.io/api/vectors#init

View File

@ -45,6 +45,7 @@ cdef class Vocab:
strings (StringStore): StringStore that maps strings to integers, and
vice versa.
lookups (Lookups): Container for large lookup tables and dictionaries.
name (unicode): Optional name to identify the vectors table.
RETURNS (Vocab): The newly constructed object.
"""
lex_attr_getters = lex_attr_getters if lex_attr_getters is not None else {}

View File

@ -538,6 +538,7 @@ $ python -m spacy init-model [lang] [output_dir] [--jsonl-loc] [--vectors-loc]
| `--jsonl-loc`, `-j` | option | Optional location of JSONL-formatted [vocabulary file](/api/annotation#vocab-jsonl) with lexical attributes. |
| `--vectors-loc`, `-v` | option | Optional location of vectors. Should be a file where the first row contains the dimensions of the vectors, followed by a space-separated Word2Vec table. File can be provided in `.txt` format or as a zipped text file in `.zip` or `.tar.gz` format. |
| `--prune-vectors`, `-V` | flag | Number of vectors to prune the vocabulary to. Defaults to `-1` for no pruning. |
| `--vectors-name`, `-vn` | option | Name to assign to the word vectors in the `meta.json`, e.g. `en_core_web_md.vectors`. |
| **CREATES** | model | A spaCy model containing the vocab and vectors. |
## Evaluate {#evaluate new="2"}

View File

@ -35,6 +35,7 @@ you can add vectors to later.
| `data` | `ndarray[ndim=1, dtype='float32']` | The vector data. |
| `keys` | iterable | A sequence of keys aligned with the data. |
| `shape` | tuple | Size of the table as `(n_entries, n_columns)`, the number of entries and number of columns. Not required if you're initializing the object with `data` and `keys`. |
| `name` | unicode | A name to identify the vectors table. |
| **RETURNS** | `Vectors` | The newly created object. |
## Vectors.\_\_getitem\_\_ {#getitem tag="method"}
@ -211,7 +212,7 @@ Iterate over `(key, vector)` pairs, in order.
| ---------- | ----- | -------------------------------- |
| **YIELDS** | tuple | `(key, vector)` pairs, in order. |
## Vectors.find (#find tag="method")
## Vectors.find {#find tag="method"}
Look up one or more keys by row, or vice versa.

View File

@ -21,13 +21,14 @@ Create the vocabulary.
> vocab = Vocab(strings=["hello", "world"])
> ```
| Name | Type | Description |
| ------------------ | -------------------- | ------------------------------------------------------------------------------------------------------------------ |
| `lex_attr_getters` | dict | A dictionary mapping attribute IDs to functions to compute them. Defaults to `None`. |
| `tag_map` | dict | A dictionary mapping fine-grained tags to coarse-grained parts-of-speech, and optionally morphological attributes. |
| `lemmatizer` | object | A lemmatizer. Defaults to `None`. |
| `strings` | `StringStore` / list | A [`StringStore`](/api/stringstore) that maps strings to hash values, and vice versa, or a list of strings. |
| **RETURNS** | `Vocab` | The newly constructed object. |
| Name | Type | Description |
| ------------------------------------------- | -------------------- | ------------------------------------------------------------------------------------------------------------------ |
| `lex_attr_getters` | dict | A dictionary mapping attribute IDs to functions to compute them. Defaults to `None`. |
| `tag_map` | dict | A dictionary mapping fine-grained tags to coarse-grained parts-of-speech, and optionally morphological attributes. |
| `lemmatizer` | object | A lemmatizer. Defaults to `None`. |
| `strings` | `StringStore` / list | A [`StringStore`](/api/stringstore) that maps strings to hash values, and vice versa, or a list of strings. |
| `vectors_name` <Tag variant="new">2.2</Tag> | unicode | A name to identify the vectors table. |
| **RETURNS** | `Vocab` | The newly constructed object. |
## Vocab.\_\_len\_\_ {#len tag="method"}

View File

@ -45,7 +45,7 @@ for token in doc:
| for | for | `ADP` | `IN` | `prep` | `xxx` | `True` | `True` |
| \$ | \$ | `SYM` | `$` | `quantmod` | `$` | `False` | `False` |
| 1 | 1 | `NUM` | `CD` | `compound` | `d` | `False` | `False` |
| billion | billion | `NUM` | `CD` | `probj` | `xxxx` | `True` | `False` |
| billion | billion | `NUM` | `CD` | `pobj` | `xxxx` | `True` | `False` |
> #### Tip: Understanding tags and labels
>

View File

@ -432,17 +432,21 @@
{
"id": "neuralcoref",
"slogan": "State-of-the-art coreference resolution based on neural nets and spaCy",
"description": "This coreference resolution module is based on the super fast [spaCy](https://spacy.io/) parser and uses the neural net scoring model described in [Deep Reinforcement Learning for Mention-Ranking Coreference Models](http://cs.stanford.edu/people/kevclark/resources/clark-manning-emnlp2016-deep.pdf) by Kevin Clark and Christopher D. Manning, EMNLP 2016. With ✨Neuralcoref v2.0, you should now be able to train the coreference resolution system on your own datasete.g., another language than English! — **provided you have an annotated dataset**.",
"description": "This coreference resolution module is based on the super fast [spaCy](https://spacy.io/) parser and uses the neural net scoring model described in [Deep Reinforcement Learning for Mention-Ranking Coreference Models](http://cs.stanford.edu/people/kevclark/resources/clark-manning-emnlp2016-deep.pdf) by Kevin Clark and Christopher D. Manning, EMNLP 2016. Since ✨Neuralcoref v2.0, you can train the coreference resolution system on your own datasete.g., another language than English! — **provided you have an annotated dataset**. Note that to use neuralcoref with spaCy > 2.1.0, you'll have to install neuralcoref from source.",
"github": "huggingface/neuralcoref",
"thumb": "https://i.imgur.com/j6FO9O6.jpg",
"code_example": [
"from neuralcoref import Coref",
"import spacy",
"import neuralcoref",
"",
"coref = Coref()",
"clusters = coref.one_shot_coref(utterances=u\"She loves him.\", context=u\"My sister has a dog.\")",
"mentions = coref.get_mentions()",
"utterances = coref.get_utterances()",
"resolved_utterance_text = coref.get_resolved_utterances()"
"nlp = spacy.load('en')",
"neuralcoref.add_to_pipe(nlp)",
"doc1 = nlp('My sister has a dog. She loves him.')",
"print(doc1._.coref_clusters)",
"",
"doc2 = nlp('Angela lives in Boston. She is quite happy in that city.')",
"for ent in doc2.ents:",
" print(ent._.coref_cluster)"
],
"author": "Hugging Face",
"author_links": {
@ -735,7 +739,7 @@
"slogan": "Use NLP to go beyond vanilla word2vec",
"description": "sense2vec ([Trask et. al](https://arxiv.org/abs/1511.06388), 2015) is a nice twist on [word2vec](https://en.wikipedia.org/wiki/Word2vec) that lets you learn more interesting, detailed and context-sensitive word vectors. For an interactive example of the technology, see our [sense2vec demo](https://explosion.ai/demos/sense2vec) that lets you explore semantic similarities across all Reddit comments of 2015.",
"github": "explosion/sense2vec",
"pip": "sense2vec==1.0.0a0",
"pip": "sense2vec==1.0.0a1",
"thumb": "https://i.imgur.com/awfdhX6.jpg",
"image": "https://explosion.ai/assets/img/demos/sense2vec.png",
"url": "https://explosion.ai/demos/sense2vec",