Merge branch 'master' into spacy.io

2025-11-18 08:45:50 +03:00 · 2020-05-22 13:50:37 +02:00 · 2020-05-22 13:50:37 +02:00 · f30b9d3038
commit f30b9d3038
parent 85064b5c22 65c7e82de2
12 changed files with 356 additions and 94 deletions
--- a/.github/contributors/lfiedler.md
+++ b/.github/contributors/lfiedler.md
@ -0,0 +1,106 @@
+# spaCy contributor agreement
+
+This spaCy Contributor Agreement (**"SCA"**) is based on the
+[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
+The SCA applies to any contribution that you make to any product or project
+managed by us (the **"project"**), and sets out the intellectual property rights
+you grant to us in the contributed materials. The term **"us"** shall mean
+[ExplosionAI GmbH](https://explosion.ai/legal). The term
+**"you"** shall mean the person or entity identified below.
+
+If you agree to be bound by these terms, fill in the information requested
+below and include the filled-in version with your first pull request, under the
+folder [`.github/contributors/`](/.github/contributors/). The name of the file
+should be your GitHub username, with the extension `.md`. For example, the user
+example_user would create the file `.github/contributors/example_user.md`.
+
+Read this agreement carefully before signing. These terms and conditions
+constitute a binding legal agreement.
+
+## Contributor Agreement
+
+1. The term "contribution" or "contributed materials" means any source code,
+object code, patch, tool, sample, graphic, specification, manual,
+documentation, or any other material posted or submitted by you to the project.
+
+2. With respect to any worldwide copyrights, or copyright applications and
+registrations, in your contribution:
+
+    * you hereby assign to us joint ownership, and to the extent that such
+    assignment is or becomes invalid, ineffective or unenforceable, you hereby
+    grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
+    royalty-free, unrestricted license to exercise all rights under those
+    copyrights. This includes, at our option, the right to sublicense these same
+    rights to third parties through multiple levels of sublicensees or other
+    licensing arrangements;
+
+    * you agree that each of us can do all things in relation to your
+    contribution as if each of us were the sole owners, and if one of us makes
+    a derivative work of your contribution, the one who makes the derivative
+    work (or has it made will be the sole owner of that derivative work;
+
+    * you agree that you will not assert any moral rights in your contribution
+    against us, our licensees or transferees;
+
+    * you agree that we may register a copyright in your contribution and
+    exercise all ownership rights associated with it; and
+
+    * you agree that neither of us has any duty to consult with, obtain the
+    consent of, pay or render an accounting to the other for any use or
+    distribution of your contribution.
+
+3. With respect to any patents you own, or that you can license without payment
+to any third party, you hereby grant to us a perpetual, irrevocable,
+non-exclusive, worldwide, no-charge, royalty-free license to:
+
+    * make, have made, use, sell, offer to sell, import, and otherwise transfer
+    your contribution in whole or in part, alone or in combination with or
+    included in any product, work or materials arising out of the project to
+    which your contribution was submitted, and
+
+    * at our option, to sublicense these same rights to third parties through
+    multiple levels of sublicensees or other licensing arrangements.
+
+4. Except as set out above, you keep all right, title, and interest in your
+contribution. The rights that you grant to us under these terms are effective
+on the date you first submitted a contribution to us, even if your submission
+took place before the date you sign these terms.
+
+5. You covenant, represent, warrant and agree that:
+
+    * Each contribution that you submit is and shall be an original work of
+    authorship and you can legally grant the rights set out in this SCA;
+
+    * to the best of your knowledge, each contribution will not violate any
+    third party's copyrights, trademarks, patents, or other intellectual
+    property rights; and
+
+    * each contribution shall be in compliance with U.S. export control laws and
+    other applicable export and import laws. You agree to notify us if you
+    become aware of any circumstance which would make any of the foregoing
+    representations inaccurate in any respect. We may publicly disclose your
+    participation in the project, including the fact that you have signed the SCA.
+
+6. This SCA is governed by the laws of the State of California and applicable
+U.S. Federal law. Any choice of law rules will not apply.
+
+7. Please place an “x” on one of the applicable statement below. Please do NOT
+mark both statements:
+
+    * [x] I am signing on behalf of myself as an individual and no other person
+    or entity, including my employer, has or will have rights with respect to my
+    contributions.
+
+    * [ ] I am signing on behalf of my employer or a legal entity and I have the
+    actual authority to contractually bind that entity.
+
+## Contributor Details
+
+| Field                          | Entry                |
+|------------------------------- | -------------------- |
+| Name                           | Leander Fiedler      |
+| Company name (if applicable)   |                      |
+| Title or role (if applicable)  |                      |
+| Date                           | 06 April 2020        |
+| GitHub username                | lfiedler             |
+| Website (optional)             |                      |
--- a/spacy/errors.py
+++ b/spacy/errors.py
@ -567,6 +567,7 @@ class Errors(object):
    E197 = ("Row out of bounds, unable to add row {row} for key {key}.")
    E198 = ("Unable to return {n} most similar vectors for the current vectors "
            "table, which contains {n_rows} vectors.")
+    E199 = ("Unable to merge 0-length span at doc[{start}:{end}].")


@add_codes
--- a/spacy/kb.pyx
+++ b/spacy/kb.pyx
@ -445,10 +445,10 @@ cdef class KnowledgeBase:

 cdef class Writer:
    def __init__(self, object loc):
-        if path.exists(loc):
-            assert not path.isdir(loc), "%s is directory." % loc
        if isinstance(loc, Path):
            loc = bytes(loc)
+        if path.exists(loc):
+            assert not path.isdir(loc), "%s is directory." % loc
        cdef bytes bytes_loc = loc.encode('utf8') if type(loc) == unicode else loc
        self._fp = fopen(<char*>bytes_loc, 'wb')
        if not self._fp:
@ -490,10 +490,10 @@ cdef class Writer:

 cdef class Reader:
    def __init__(self, object loc):
-        assert path.exists(loc)
-        assert not path.isdir(loc)
        if isinstance(loc, Path):
            loc = bytes(loc)
+        assert path.exists(loc)
+        assert not path.isdir(loc)
        cdef bytes bytes_loc = loc.encode('utf8') if type(loc) == unicode else loc
        self._fp = fopen(<char*>bytes_loc, 'rb')
        if not self._fp:
--- a/spacy/language.py
+++ b/spacy/language.py
@ -907,9 +907,8 @@ class Language(object):
        serializers["tokenizer"] = lambda p: self.tokenizer.to_disk(
            p, exclude=["vocab"]
        )
-        serializers["meta.json"] = lambda p: p.open("w").write(
-            srsly.json_dumps(self.meta)
-        )
+        serializers["meta.json"] = lambda p: srsly.write_json(p, self.meta)
+
        for name, proc in self.pipeline:
            if not hasattr(proc, "name"):
                continue
--- a/spacy/pipeline/pipes.pyx
+++ b/spacy/pipeline/pipes.pyx
@ -203,7 +203,7 @@ class Pipe(object):
        serialize["cfg"] = lambda p: srsly.write_json(p, self.cfg)
        serialize["vocab"] = lambda p: self.vocab.to_disk(p)
        if self.model not in (None, True, False):
-            serialize["model"] = lambda p: p.open("wb").write(self.model.to_bytes())
+            serialize["model"] = lambda p: self.model.to_disk(p)
        exclude = util.get_serialization_exclude(serialize, exclude, kwargs)
        util.to_disk(path, serialize, exclude)

@ -626,7 +626,7 @@ class Tagger(Pipe):
        serialize = OrderedDict((
            ("vocab", lambda p: self.vocab.to_disk(p)),
            ("tag_map", lambda p: srsly.write_msgpack(p, tag_map)),
-            ("model", lambda p: p.open("wb").write(self.model.to_bytes())),
+            ("model", lambda p: self.model.to_disk(p)),
            ("cfg", lambda p: srsly.write_json(p, self.cfg))
        ))
        exclude = util.get_serialization_exclude(serialize, exclude, kwargs)
@ -1395,7 +1395,7 @@ class EntityLinker(Pipe):
        serialize["vocab"] = lambda p: self.vocab.to_disk(p)
        serialize["kb"] = lambda p: self.kb.dump(p)
        if self.model not in (None, True, False):
-            serialize["model"] = lambda p: p.open("wb").write(self.model.to_bytes())
+            serialize["model"] = lambda p: self.model.to_disk(p)
        exclude = util.get_serialization_exclude(serialize, exclude, kwargs)
        util.to_disk(path, serialize, exclude)

--- a/spacy/tests/doc/test_retokenize_merge.py
+++ b/spacy/tests/doc/test_retokenize_merge.py
@ -425,3 +425,10 @@ def test_retokenize_skip_duplicates(en_vocab):
        retokenizer.merge(doc[0:2])
    assert len(doc) == 2
    assert doc[0].text == "hello world"
+
+
+def test_retokenize_disallow_zero_length(en_vocab):
+    doc = Doc(en_vocab, words=["hello", "world", "!"])
+    with pytest.raises(ValueError):
+        with doc.retokenize() as retokenizer:
+            retokenizer.merge(doc[1:1])
--- a/spacy/tests/regression/test_issue5230.py
+++ b/spacy/tests/regression/test_issue5230.py
@ -0,0 +1,142 @@
+# coding: utf8
+import warnings
+from unittest import TestCase
+
+import pytest
+import srsly
+from numpy import zeros
+from spacy.kb import KnowledgeBase, Writer
+from spacy.vectors import Vectors
+
+from spacy.language import Language
+from spacy.pipeline import Pipe
+from spacy.tests.util import make_tempdir
+
+
+def nlp():
+    return Language()
+
+
+def vectors():
+    data = zeros((3, 1), dtype="f")
+    keys = ["cat", "dog", "rat"]
+    return Vectors(data=data, keys=keys)
+
+
+def custom_pipe():
+    # create dummy pipe partially implementing interface -- only want to test to_disk
+    class SerializableDummy(object):
+        def __init__(self, **cfg):
+            if cfg:
+                self.cfg = cfg
+            else:
+                self.cfg = None
+            super(SerializableDummy, self).__init__()
+
+        def to_bytes(self, exclude=tuple(), disable=None, **kwargs):
+            return srsly.msgpack_dumps({"dummy": srsly.json_dumps(None)})
+
+        def from_bytes(self, bytes_data, exclude):
+            return self
+
+        def to_disk(self, path, exclude=tuple(), **kwargs):
+            pass
+
+        def from_disk(self, path, exclude=tuple(), **kwargs):
+            return self
+
+    class MyPipe(Pipe):
+        def __init__(self, vocab, model=True, **cfg):
+            if cfg:
+                self.cfg = cfg
+            else:
+                self.cfg = None
+            self.model = SerializableDummy()
+            self.vocab = SerializableDummy()
+
+    return MyPipe(None)
+
+
+def tagger():
+    nlp = Language()
+    nlp.add_pipe(nlp.create_pipe("tagger"))
+    tagger = nlp.get_pipe("tagger")
+    # need to add model for two reasons:
+    # 1. no model leads to error in serialization,
+    # 2. the affected line is the one for model serialization
+    tagger.begin_training(pipeline=nlp.pipeline)
+    return tagger
+
+
+def entity_linker():
+    nlp = Language()
+    nlp.add_pipe(nlp.create_pipe("entity_linker"))
+    entity_linker = nlp.get_pipe("entity_linker")
+    # need to add model for two reasons:
+    # 1. no model leads to error in serialization,
+    # 2. the affected line is the one for model serialization
+    kb = KnowledgeBase(nlp.vocab, entity_vector_length=1)
+    entity_linker.set_kb(kb)
+    entity_linker.begin_training(pipeline=nlp.pipeline)
+    return entity_linker
+
+
+objects_to_test = (
+    [nlp(), vectors(), custom_pipe(), tagger(), entity_linker()],
+    ["nlp", "vectors", "custom_pipe", "tagger", "entity_linker"],
+)
+
+
+def write_obj_and_catch_warnings(obj):
+    with make_tempdir() as d:
+        with warnings.catch_warnings(record=True) as warnings_list:
+            warnings.filterwarnings("always", category=ResourceWarning)
+            obj.to_disk(d)
+            # in python3.5 it seems that deprecation warnings are not filtered by filterwarnings
+            return list(filter(lambda x: isinstance(x, ResourceWarning), warnings_list))
+
+
+@pytest.mark.parametrize("obj", objects_to_test[0], ids=objects_to_test[1])
+def test_to_disk_resource_warning(obj):
+    warnings_list = write_obj_and_catch_warnings(obj)
+    assert len(warnings_list) == 0
+
+
+def test_writer_with_path_py35():
+    writer = None
+    with make_tempdir() as d:
+        path = d / "test"
+        try:
+            writer = Writer(path)
+        except Exception as e:
+            pytest.fail(str(e))
+        finally:
+            if writer:
+                writer.close()
+
+
+def test_save_and_load_knowledge_base():
+    nlp = Language()
+    kb = KnowledgeBase(nlp.vocab, entity_vector_length=1)
+    with make_tempdir() as d:
+        path = d / "kb"
+        try:
+            kb.dump(path)
+        except Exception as e:
+            pytest.fail(str(e))
+
+        try:
+            kb_loaded = KnowledgeBase(nlp.vocab, entity_vector_length=1)
+            kb_loaded.load_bulk(path)
+        except Exception as e:
+            pytest.fail(str(e))
+
+
+class TestToDiskResourceWarningUnittest(TestCase):
+    def test_resource_warning(self):
+        scenarios = zip(*objects_to_test)
+
+        for scenario in scenarios:
+            with self.subTest(msg=scenario[1]):
+                warnings_list = write_obj_and_catch_warnings(scenario[0])
+                self.assertEqual(len(warnings_list), 0)
--- a/spacy/tests/test_misc.py
+++ b/spacy/tests/test_misc.py
@ -135,3 +135,14 @@ def test_ascii_filenames():
    root = Path(__file__).parent.parent
    for path in root.glob("**/*"):
        assert all(ord(c) < 128 for c in path.name), path.name
+
+
+def test_load_model_blank_shortcut():
+    """Test that using a model name like "blank:en" works as a shortcut for
+    spacy.blank("en").
+    """
+    nlp = util.load_model("blank:en")
+    assert nlp.lang == "en"
+    assert nlp.pipeline == []
+    with pytest.raises(ImportError):
+        util.load_model("blank:fjsfijsdof")
--- a/spacy/tokens/_retokenize.pyx
+++ b/spacy/tokens/_retokenize.pyx
@ -55,6 +55,8 @@ cdef class Retokenizer:
        """
        if (span.start, span.end) in self._spans_to_merge:
            return
+        if span.end - span.start <= 0:
+            raise ValueError(Errors.E199.format(start=span.start, end=span.end))
        for token in span:
            if token.i in self.tokens_to_merge:
                raise ValueError(Errors.E102.format(token=repr(token)))
--- a/spacy/util.py
+++ b/spacy/util.py
@ -161,6 +161,8 @@ def load_model(name, **overrides):
    if not data_path or not data_path.exists():
        raise IOError(Errors.E049.format(path=path2str(data_path)))
    if isinstance(name, basestring_):  # in data dir / shortcut
+        if name.startswith("blank:"):  # shortcut for blank model
+            return get_lang_class(name.replace("blank:", ""))()
        if name in set([d.name for d in data_path.iterdir()]):
            return load_model_from_link(name, **overrides)
        if is_package(name):  # installed as package
--- a/spacy/vectors.pyx
+++ b/spacy/vectors.pyx
@ -383,8 +383,16 @@ cdef class Vectors:
            save_array = lambda arr, file_: xp.save(file_, arr, allow_pickle=False)
        else:
            save_array = lambda arr, file_: xp.save(file_, arr)
+
+        def save_vectors(path):
+            # the source of numpy.save indicates that the file object is closed after use.
+            # but it seems that somehow this does not happen, as ResourceWarnings are raised here.
+            # in order to not rely on this, wrap in context manager.
+            with path.open("wb") as _file:
+                save_array(self.data, _file)
+
        serializers = OrderedDict((
-            ("vectors", lambda p: save_array(self.data, p.open("wb"))),
+            ("vectors", lambda p: save_vectors(p)),
            ("key2row", lambda p: srsly.write_msgpack(p, self.key2row))
        ))
        return util.to_disk(path, serializers, [])
--- a/website/docs/api/token.md
+++ b/website/docs/api/token.md
@ -351,25 +351,9 @@ property to `0` for the first word of the document.
 - assert doc[4].sent_start == 1
 + assert doc[4].is_sent_start == True
 ```
+
 </Infobox>

-## Token.is_sent_end {#is_sent_end tag="property" new="2"}
-
-A boolean value indicating whether the token ends a sentence. `None` if
-unknown. Defaults to `True` for the last token in the `Doc`.
-
-> #### Example
->
-> ```python
-> doc = nlp("Give it back! He pleaded.")
-> assert doc[3].is_sent_end
-> assert not doc[4].is_sent_end
-> ```
-
-| Name        | Type | Description                          |
-| ----------- | ---- | ------------------------------------ |
-| **RETURNS** | bool | Whether the token ends a sentence. |
-
 ## Token.has_vector {#has_vector tag="property" model="vectors"}

 A boolean value indicating whether a word vector is associated with the token.
@ -425,7 +409,7 @@ The L2 norm of the token's vector representation.
 ## Attributes {#attributes}

 | Name                                         | Type         | Description                                                                                                                                                                                                                                                    |
-| -------------------------------------------- | ------------ | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
+| -------------------------------------------- | ------------ | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
 | `doc`                                        | `Doc`        | The parent document.                                                                                                                                                                                                                                           |
 | `sent` <Tag variant="new">2.0.12</Tag>       | `Span`       | The sentence span that this token is a part of.                                                                                                                                                                                                                |
 | `text`                                       | unicode      | Verbatim text content.                                                                                                                                                                                                                                         |