fixed tag_map.py merge conflict

2025-09-28 15:06:45 +03:00 · 2019-04-04 14:18:27 +08:00 · 2019-04-04 14:18:27 +08:00 · 80e15af76c
commit 80e15af76c
parent eba4f77526
50 changed files with 262625 additions and 116 deletions
--- a/.github/contributors/ivigamberdiev.md
+++ b/.github/contributors/ivigamberdiev.md
@ -0,0 +1,106 @@
 # spaCy contributor agreement
 This spaCy Contributor Agreement (**"SCA"**) is based on the
 [Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
 The SCA applies to any contribution that you make to any product or project
 managed by us (the **"project"**), and sets out the intellectual property rights
 you grant to us in the contributed materials. The term **"us"** shall mean
 [ExplosionAI GmbH](https://explosion.ai/legal). The term
 **"you"** shall mean the person or entity identified below.
 If you agree to be bound by these terms, fill in the information requested
 below and include the filled-in version with your first pull request, under the
 folder [`.github/contributors/`](/.github/contributors/). The name of the file
 should be your GitHub username, with the extension `.md`. For example, the user
 example_user would create the file `.github/contributors/example_user.md`.
 Read this agreement carefully before signing. These terms and conditions
 constitute a binding legal agreement.
 ## Contributor Agreement
 1. The term "contribution" or "contributed materials" means any source code,
 object code, patch, tool, sample, graphic, specification, manual,
 documentation, or any other material posted or submitted by you to the project.
 2. With respect to any worldwide copyrights, or copyright applications and
 registrations, in your contribution:
    * you hereby assign to us joint ownership, and to the extent that such
    assignment is or becomes invalid, ineffective or unenforceable, you hereby
    grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
    royalty-free, unrestricted license to exercise all rights under those
    copyrights. This includes, at our option, the right to sublicense these same
    rights to third parties through multiple levels of sublicensees or other
    licensing arrangements;
    * you agree that each of us can do all things in relation to your
    contribution as if each of us were the sole owners, and if one of us makes
    a derivative work of your contribution, the one who makes the derivative
    work (or has it made will be the sole owner of that derivative work;
    * you agree that you will not assert any moral rights in your contribution
    against us, our licensees or transferees;
    * you agree that we may register a copyright in your contribution and
    exercise all ownership rights associated with it; and
    * you agree that neither of us has any duty to consult with, obtain the
    consent of, pay or render an accounting to the other for any use or
    distribution of your contribution.
 3. With respect to any patents you own, or that you can license without payment
 to any third party, you hereby grant to us a perpetual, irrevocable,
 non-exclusive, worldwide, no-charge, royalty-free license to:
    * make, have made, use, sell, offer to sell, import, and otherwise transfer
    your contribution in whole or in part, alone or in combination with or
    included in any product, work or materials arising out of the project to
    which your contribution was submitted, and
    * at our option, to sublicense these same rights to third parties through
    multiple levels of sublicensees or other licensing arrangements.
 4. Except as set out above, you keep all right, title, and interest in your
 contribution. The rights that you grant to us under these terms are effective
 on the date you first submitted a contribution to us, even if your submission
 took place before the date you sign these terms.
 5. You covenant, represent, warrant and agree that:
    * Each contribution that you submit is and shall be an original work of
    authorship and you can legally grant the rights set out in this SCA;
    * to the best of your knowledge, each contribution will not violate any
    third party's copyrights, trademarks, patents, or other intellectual
    property rights; and
    * each contribution shall be in compliance with U.S. export control laws and
    other applicable export and import laws. You agree to notify us if you
    become aware of any circumstance which would make any of the foregoing
    representations inaccurate in any respect. We may publicly disclose your
    participation in the project, including the fact that you have signed the SCA.
 6. This SCA is governed by the laws of the State of California and applicable
 U.S. Federal law. Any choice of law rules will not apply.
 7. Please place an “x” on one of the applicable statement below. Please do NOT
 mark both statements:
    * [x] I am signing on behalf of myself as an individual and no other person
    or entity, including my employer, has or will have rights with respect to my
    contributions.
    * [ ] I am signing on behalf of my employer or a legal entity and I have the
    actual authority to contractually bind that entity.
 ## Contributor Details
 | Field                          | Entry                |
 |------------------------------- | -------------------- |
 | Name                           | Igor Igamberdiev     |
 | Company name (if applicable)   |                      |
 | Title or role (if applicable)  |                      |
 | Date                           | April 2, 2019        |
 | GitHub username                | ivigamberdiev        |
 | Website (optional)             |                      |
--- a/.github/contributors/nlptown.md
+++ b/.github/contributors/nlptown.md
@ -0,0 +1,106 @@
 # spaCy contributor agreement
 This spaCy Contributor Agreement (**"SCA"**) is based on the
 [Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
 The SCA applies to any contribution that you make to any product or project
 managed by us (the **"project"**), and sets out the intellectual property rights
 you grant to us in the contributed materials. The term **"us"** shall mean
 [ExplosionAI UG (haftungsbeschränkt)](https://explosion.ai/legal). The term
 **"you"** shall mean the person or entity identified below.
 If you agree to be bound by these terms, fill in the information requested
 below and include the filled-in version with your first pull request, under the
 folder [`.github/contributors/`](/.github/contributors/). The name of the file
 should be your GitHub username, with the extension `.md`. For example, the user
 example_user would create the file `.github/contributors/example_user.md`.
 Read this agreement carefully before signing. These terms and conditions
 constitute a binding legal agreement.
 ## Contributor Agreement
 1. The term "contribution" or "contributed materials" means any source code,
 object code, patch, tool, sample, graphic, specification, manual,
 documentation, or any other material posted or submitted by you to the project.
 2. With respect to any worldwide copyrights, or copyright applications and
 registrations, in your contribution:
    * you hereby assign to us joint ownership, and to the extent that such
    assignment is or becomes invalid, ineffective or unenforceable, you hereby
    grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
    royalty-free, unrestricted license to exercise all rights under those
    copyrights. This includes, at our option, the right to sublicense these same
    rights to third parties through multiple levels of sublicensees or other
    licensing arrangements;
    * you agree that each of us can do all things in relation to your
    contribution as if each of us were the sole owners, and if one of us makes
    a derivative work of your contribution, the one who makes the derivative
    work (or has it made will be the sole owner of that derivative work;
    * you agree that you will not assert any moral rights in your contribution
    against us, our licensees or transferees;
    * you agree that we may register a copyright in your contribution and
    exercise all ownership rights associated with it; and
    * you agree that neither of us has any duty to consult with, obtain the
    consent of, pay or render an accounting to the other for any use or
    distribution of your contribution.
 3. With respect to any patents you own, or that you can license without payment
 to any third party, you hereby grant to us a perpetual, irrevocable,
 non-exclusive, worldwide, no-charge, royalty-free license to:
    * make, have made, use, sell, offer to sell, import, and otherwise transfer
    your contribution in whole or in part, alone or in combination with or
    included in any product, work or materials arising out of the project to
    which your contribution was submitted, and
    * at our option, to sublicense these same rights to third parties through
    multiple levels of sublicensees or other licensing arrangements.
 4. Except as set out above, you keep all right, title, and interest in your
 contribution. The rights that you grant to us under these terms are effective
 on the date you first submitted a contribution to us, even if your submission
 took place before the date you sign these terms.
 5. You covenant, represent, warrant and agree that:
    * Each contribution that you submit is and shall be an original work of
    authorship and you can legally grant the rights set out in this SCA;
    * to the best of your knowledge, each contribution will not violate any
    third party's copyrights, trademarks, patents, or other intellectual
    property rights; and
    * each contribution shall be in compliance with U.S. export control laws and
    other applicable export and import laws. You agree to notify us if you
    become aware of any circumstance which would make any of the foregoing
    representations inaccurate in any respect. We may publicly disclose your
    participation in the project, including the fact that you have signed the SCA.
 6. This SCA is governed by the laws of the State of California and applicable
 U.S. Federal law. Any choice of law rules will not apply.
 7. Please place an “x” on one of the applicable statement below. Please do NOT
 mark both statements:
    * [ ] I am signing on behalf of myself as an individual and no other person
    or entity, including my employer, has or will have rights with respect to my
    contributions.
    * [x] I am signing on behalf of my employer or a legal entity and I have the
    actual authority to contractually bind that entity.
 ## Contributor Details
 | Field                          | Entry                |
 |------------------------------- | -------------------- |
 | Name                           | Yves Peirsman        |
 | Company name (if applicable)   | NLP Town (Island Constraints BVBA) |
 | Title or role (if applicable)  | Co-founder           |
 | Date                           | 14.03.2019           |
 | GitHub username                | nlptown              |
 | Website (optional)             | http://www.nlp.town  |
--- a/.github/contributors/socool.md
+++ b/.github/contributors/socool.md
@ -0,0 +1,106 @@
 # spaCy contributor agreement
 This spaCy Contributor Agreement (**"SCA"**) is based on the
 [Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
 The SCA applies to any contribution that you make to any product or project
 managed by us (the **"project"**), and sets out the intellectual property rights
 you grant to us in the contributed materials. The term **"us"** shall mean
 [ExplosionAI UG (haftungsbeschränkt)](https://explosion.ai/legal). The term
 **"you"** shall mean the person or entity identified below.
 If you agree to be bound by these terms, fill in the information requested
 below and include the filled-in version with your first pull request, under the
 folder [`.github/contributors/`](/.github/contributors/). The name of the file
 should be your GitHub username, with the extension `.md`. For example, the user
 example_user would create the file `.github/contributors/example_user.md`.
 Read this agreement carefully before signing. These terms and conditions
 constitute a binding legal agreement.
 ## Contributor Agreement
 1. The term "contribution" or "contributed materials" means any source code,
 object code, patch, tool, sample, graphic, specification, manual,
 documentation, or any other material posted or submitted by you to the project.
 2. With respect to any worldwide copyrights, or copyright applications and
 registrations, in your contribution:
    * you hereby assign to us joint ownership, and to the extent that such
    assignment is or becomes invalid, ineffective or unenforceable, you hereby
    grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
    royalty-free, unrestricted license to exercise all rights under those
    copyrights. This includes, at our option, the right to sublicense these same
    rights to third parties through multiple levels of sublicensees or other
    licensing arrangements;
    * you agree that each of us can do all things in relation to your
    contribution as if each of us were the sole owners, and if one of us makes
    a derivative work of your contribution, the one who makes the derivative
    work (or has it made will be the sole owner of that derivative work;
    * you agree that you will not assert any moral rights in your contribution
    against us, our licensees or transferees;
    * you agree that we may register a copyright in your contribution and
    exercise all ownership rights associated with it; and
    * you agree that neither of us has any duty to consult with, obtain the
    consent of, pay or render an accounting to the other for any use or
    distribution of your contribution.
 3. With respect to any patents you own, or that you can license without payment
 to any third party, you hereby grant to us a perpetual, irrevocable,
 non-exclusive, worldwide, no-charge, royalty-free license to:
    * make, have made, use, sell, offer to sell, import, and otherwise transfer
    your contribution in whole or in part, alone or in combination with or
    included in any product, work or materials arising out of the project to
    which your contribution was submitted, and
    * at our option, to sublicense these same rights to third parties through
    multiple levels of sublicensees or other licensing arrangements.
 4. Except as set out above, you keep all right, title, and interest in your
 contribution. The rights that you grant to us under these terms are effective
 on the date you first submitted a contribution to us, even if your submission
 took place before the date you sign these terms.
 5. You covenant, represent, warrant and agree that:
    * Each contribution that you submit is and shall be an original work of
    authorship and you can legally grant the rights set out in this SCA;
    * to the best of your knowledge, each contribution will not violate any
    third party's copyrights, trademarks, patents, or other intellectual
    property rights; and
    * each contribution shall be in compliance with U.S. export control laws and
    other applicable export and import laws. You agree to notify us if you
    become aware of any circumstance which would make any of the foregoing
    representations inaccurate in any respect. We may publicly disclose your 
    participation in the project, including the fact that you have signed the SCA.
 6. This SCA is governed by the laws of the State of California and applicable
 U.S. Federal law. Any choice of law rules will not apply.
 7. Please place an “x” on one of the applicable statement below. Please do NOT
 mark both statements:
    * [x] I am signing on behalf of myself as an individual and no other person
    or entity, including my employer, has or will have rights with respect my
    contributions.
    * [ ] I am signing on behalf of my employer or a legal entity and I have the
    actual authority to contractually bind that entity.
 ## Contributor Details
 | Field                          | Entry                |
 |------------------------------- | -------------------- |
 | Name                           | Kamolsit Mongkolsrisawat           |
 | Company name (if applicable)   | Mojito                   |
 | Title or role (if applicable)  |                      |
 | Date                           | 02-4-2019           |
 | GitHub username                | socool               |
 | Website (optional)             |                      |
--- a/README.md
+++ b/README.md
@ -17,7 +17,7 @@ released under the MIT license.
 [![Azure Pipelines](https://img.shields.io/azure-devops/build/explosion-ai/public/8/master.svg?logo=azure-devops&style=flat-square)](https://dev.azure.com/explosion-ai/public/_build?definitionId=8)
 [![Travis Build Status](https://img.shields.io/travis/explosion/spaCy/master.svg?style=flat-square&logo=travis)](https://travis-ci.org/explosion/spaCy)
 [![Current Release Version](https://img.shields.io/github/release/explosion/spacy.svg?style=flat-square)](https://github.com/explosion/spaCy/releases)
-[![pypi Version](https://img.shields.io/pypi/v/spacy.svg?style=flat-square)](https://pypi.python.org/pypi/spacy)
+[![pypi Version](https://img.shields.io/pypi/v/spacy.svg?style=flat-square)](https://pypi.org/project/spacy/)
 [![conda Version](https://img.shields.io/conda/vn/conda-forge/spacy.svg?style=flat-square)](https://anaconda.org/conda-forge/spacy)
 [![Python wheels](https://img.shields.io/badge/wheels-%E2%9C%93-4c1.svg?longCache=true&style=flat-square&logo=python&logoColor=white)](https://github.com/explosion/wheelwright/releases)
 [![Code style: black](https://img.shields.io/badge/code%20style-black-000000.svg?style=flat-square)](https://github.com/ambv/black)
@ -42,7 +42,7 @@ released under the MIT license.
 [api reference]: https://spacy.io/api/
 [models]: https://spacy.io/models
 [universe]: https://spacy.io/universe
-[changelog]: https://spacy.io/usage/#changelog
+[changelog]: https://spacy.io/usage#changelog
 [contribute]: https://github.com/explosion/spaCy/blob/master/CONTRIBUTING.md
 ## 💬 Where to ask questions
@ -60,7 +60,7 @@ valuable if it's shared publicly, so that more people can benefit from it.
 | 🗯 **General Discussion** | [Gitter Chat] · [Reddit User Group]                    |
 [github issue tracker]: https://github.com/explosion/spaCy/issues
-[stack overflow]: http://stackoverflow.com/questions/tagged/spacy
+[stack overflow]: https://stackoverflow.com/questions/tagged/spacy
 [gitter chat]: https://gitter.im/explosion/spaCy
 [reddit user group]: https://www.reddit.com/r/spacynlp
@ -95,7 +95,7 @@ For detailed installation instructions, see the
 -   **Python version**: Python 2.7, 3.5+ (only 64 bit)
 -   **Package managers**: [pip] · [conda] (via `conda-forge`)
-[pip]: https://pypi.python.org/pypi/spacy
+[pip]: https://pypi.org/project/spacy/
 [conda]: https://anaconda.org/conda-forge/spacy
 ### pip
@ -219,7 +219,7 @@ source. That is the common way if you want to make changes to the code base.
 You'll need to make sure that you have a development environment consisting of a
 Python distribution including header files, a compiler,
 [pip](https://pip.pypa.io/en/latest/installing/),
-[virtualenv](https://virtualenv.pypa.io/) and [git](https://git-scm.com)
+[virtualenv](https://virtualenv.pypa.io/en/latest/) and [git](https://git-scm.com)
 installed. The compiler part is the trickiest. How to do that depends on your
 system. See notes on Ubuntu, OS X and Windows for details.
@ -239,8 +239,8 @@ python setup.py build_ext --inplace
 Compared to regular install via pip, [requirements.txt](requirements.txt)
 additionally installs developer dependencies such as Cython. For more details
 and instructions, see the documentation on
-[compiling spaCy from source](https://spacy.io/usage/#source) and the
+[compiling spaCy from source](https://spacy.io/usage#source) and the
-[quickstart widget](https://spacy.io/usage/#section-quickstart) to get
+[quickstart widget](https://spacy.io/usage#section-quickstart) to get
 the right commands for your platform and Python version.
 ### Ubuntu
@ -260,7 +260,7 @@ and git preinstalled.
 ### Windows
 Install a version of the [Visual C++ Build Tools](https://visualstudio.microsoft.com/visual-cpp-build-tools/) or
-[Visual Studio Express](https://www.visualstudio.com/vs/visual-studio-express/)
+[Visual Studio Express](https://visualstudio.microsoft.com/vs/express/)
 that matches the version that was used to compile your Python
 interpreter. For official distributions these are VS 2008 (Python 2.7),
 VS 2010 (Python 3.4) and VS 2015 (Python 3.5).
@ -282,5 +282,5 @@ pip install -r path/to/requirements.txt
 python -m pytest <spacy-directory>
 ```
-See [the documentation](https://spacy.io/usage/#tests) for more details and
+See [the documentation](https://spacy.io/usage#tests) for more details and
 examples.
--- a/examples/training/train_new_entity_type.py
+++ b/examples/training/train_new_entity_type.py
@ -23,7 +23,7 @@ For more details, see the documentation:
 * Training: https://spacy.io/usage/training
 * NER: https://spacy.io/usage/linguistic-features#named-entities
-Compatible with: spaCy v2.0.0+
+Compatible with: spaCy v2.1.0+
 Last tested with: v2.1.0
 """
 from __future__ import unicode_literals, print_function
--- a/spacy/_ml.py
+++ b/spacy/_ml.py
@ -86,7 +86,7 @@ def with_cpu(ops, model):
    as necessary."""
    model.to_cpu()
-    def with_cpu_forward(inputs, drop=0.):
+    def with_cpu_forward(inputs, drop=0.0):
        cpu_outputs, backprop = model.begin_update(_to_cpu(inputs), drop=drop)
        gpu_outputs = _to_device(ops, cpu_outputs)
@ -106,7 +106,7 @@ def _to_cpu(X):
        return tuple([_to_cpu(x) for x in X])
    elif isinstance(X, list):
        return [_to_cpu(x) for x in X]
-    elif hasattr(X, 'get'):
+    elif hasattr(X, "get"):
        return X.get()
    else:
        return X
@ -142,7 +142,9 @@ class extract_ngrams(Model):
        # The dtype here matches what thinc is expecting -- which differs per
        # platform (by int definition). This should be fixed once the problem
        # is fixed on Thinc's side.
-        lengths = self.ops.asarray([arr.shape[0] for arr in batch_keys], dtype=numpy.int_)
+        lengths = self.ops.asarray(
            [arr.shape[0] for arr in batch_keys], dtype=numpy.int_
        )
        batch_keys = self.ops.xp.concatenate(batch_keys)
        batch_vals = self.ops.asarray(self.ops.xp.concatenate(batch_vals), dtype="f")
        return (batch_keys, batch_vals, lengths), None
@ -592,32 +594,27 @@ def build_text_classifier(nr_class, width=64, **cfg):
        )
        linear_model = build_bow_text_classifier(
-            nr_class, ngram_size=cfg.get("ngram_size", 1), exclusive_classes=False)
+            nr_class, ngram_size=cfg.get("ngram_size", 1), exclusive_classes=False
-        if cfg.get('exclusive_classes'):
+        )
        if cfg.get("exclusive_classes"):
            output_layer = Softmax(nr_class, nr_class * 2)
        else:
            output_layer = (
-                zero_init(Affine(nr_class, nr_class * 2, drop_factor=0.0))
+                zero_init(Affine(nr_class, nr_class * 2, drop_factor=0.0)) >> logistic
                >> logistic
            )
        model = (
            (linear_model | cnn_model)
            >> output_layer
            )
        model = (linear_model | cnn_model) >> output_layer
        model.tok2vec = chain(tok2vec, flatten)
    model.nO = nr_class
    model.lsuv = False
    return model
-def build_bow_text_classifier(nr_class, ngram_size=1, exclusive_classes=False,
+def build_bow_text_classifier(
-        no_output_layer=False, **cfg):
+    nr_class, ngram_size=1, exclusive_classes=False, no_output_layer=False, **cfg
 ):
    with Model.define_operators({">>": chain}):
-        model = (
+        model = with_cpu(
-            with_cpu(Model.ops,
+            Model.ops, extract_ngrams(ngram_size, attr=ORTH) >> LinearModel(nr_class)
                extract_ngrams(ngram_size, attr=ORTH) 
                >> LinearModel(nr_class)
            )
        )
        if not no_output_layer:
            model = model >> (cpu_softmax if exclusive_classes else logistic)
@ -626,11 +623,9 @@ def build_bow_text_classifier(nr_class, ngram_size=1, exclusive_classes=False,
@layerize
-def cpu_softmax(X, drop=0.):
+def cpu_softmax(X, drop=0.0):
    ops = NumpyOps()
    Y = ops.softmax(X)
    def cpu_softmax_backward(dY, sgd=None):
        return dY
@ -648,7 +643,9 @@ def build_simple_cnn_text_classifier(tok2vec, nr_class, exclusive_classes=False,
        if exclusive_classes:
            output_layer = Softmax(nr_class, tok2vec.nO)
        else:
-            output_layer = zero_init(Affine(nr_class, tok2vec.nO, drop_factor=0.0)) >> logistic
+            output_layer = (
                zero_init(Affine(nr_class, tok2vec.nO, drop_factor=0.0)) >> logistic
            )
        model = tok2vec >> flatten_add_lengths >> Pooling(mean_pool) >> output_layer
    model.tok2vec = chain(tok2vec, flatten)
    model.nO = nr_class
--- a/spacy/cli/pretrain.py
+++ b/spacy/cli/pretrain.py
@ -125,7 +125,9 @@ def pretrain(
                max_length=max_length,
                min_length=min_length,
            )
-            loss = make_update(model, docs, optimizer, objective=loss_func, drop=dropout)
+            loss = make_update(
                model, docs, optimizer, objective=loss_func, drop=dropout
            )
            progress = tracker.update(epoch, loss, docs)
            if progress:
                msg.row(progress, **row_settings)
--- a/spacy/displacy/render.py
+++ b/spacy/displacy/render.py
@ -50,8 +50,9 @@ class DependencyRenderer(object):
        rendered = []
        for i, p in enumerate(parsed):
            if i == 0:
-                self.direction = p["settings"].get("direction", DEFAULT_DIR)
+                settings = p.get("settings", {})
-                self.lang = p["settings"].get("lang", DEFAULT_LANG)
+                self.direction = settings.get("direction", DEFAULT_DIR)
                self.lang = settings.get("lang", DEFAULT_LANG)
            render_id = "{}-{}".format(id_prefix, i)
            svg = self.render_svg(render_id, p["words"], p["arcs"])
            rendered.append(svg)
@ -254,9 +255,10 @@ class EntityRenderer(object):
        rendered = []
        for i, p in enumerate(parsed):
            if i == 0:
-                self.direction = p["settings"].get("direction", DEFAULT_DIR)
+                settings = p.get("settings", {})
-                self.lang = p["settings"].get("lang", DEFAULT_LANG)
+                self.direction = settings.get("direction", DEFAULT_DIR)
-            rendered.append(self.render_ents(p["text"], p["ents"], p["title"]))
+                self.lang = settings.get("lang", DEFAULT_LANG)
            rendered.append(self.render_ents(p["text"], p["ents"], p.get("title")))
        if page:
            docs = "".join([TPL_FIGURE.format(content=doc) for doc in rendered])
            markup = TPL_PAGE.format(content=docs, lang=self.lang, dir=self.direction)
--- a/spacy/lang/en/morph_rules.py
+++ b/spacy/lang/en/morph_rules.py
@ -1,7 +1,7 @@
 # coding: utf8
 from __future__ import unicode_literals
-from ...symbols import LEMMA, PRON_LEMMA, AUX
+from ...symbols import LEMMA, PRON_LEMMA
 _subordinating_conjunctions = [
    "that",
@ -457,7 +457,6 @@ MORPH_RULES = {
        "have": {"POS": "AUX"},
        "'m": {"POS": "AUX", LEMMA: "be"},
        "'ve": {"POS": "AUX"},
        "'re": {"POS": "AUX", LEMMA: "be"},
        "'s": {"POS": "AUX"},
        "is": {"POS": "AUX"},
        "'d": {"POS": "AUX"},
--- a/spacy/lang/en/stop_words.py
+++ b/spacy/lang/en/stop_words.py
@ -39,7 +39,7 @@ made make many may me meanwhile might mine more moreover most mostly move much
 must my myself
 name namely neither never nevertheless next nine no nobody none noone nor not
-nothing now nowhere n't
+nothing now nowhere 
 of off often on once one only onto or other others otherwise our ours ourselves
 out over own
@ -66,7 +66,13 @@ whereafter whereas whereby wherein whereupon wherever whether which while
 whither who whoever whole whom whose why will with within without would
 yet you your yours yourself yourselves
 'd 'll 'm 're 's 've
 """.split()
 )
 contractions = ["n't", "'d", "'ll", "'m", "'re", "'s", "'ve"]
 STOP_WORDS.update(contractions)
 for apostrophe in ["‘", "’"]:
    for stopword in contractions:
        STOP_WORDS.add(stopword.replace("'", apostrophe))
--- a/spacy/lang/id/tag_map.py
+++ b/spacy/lang/id/tag_map.py
@ -2,7 +2,11 @@
 from __future__ import unicode_literals
 from ...symbols import POS, PUNCT, ADJ, CCONJ, NUM, DET, ADV, ADP, X, VERB
 <<<<<<< HEAD
 from ...symbols import NOUN, PRON, AUX, SCONJ, INTJ, PART, PROPN
 =======
 from ...symbols import NOUN, PRON, AUX, SCONJ
 >>>>>>> 4faf62d5154c2d2adb6def32da914d18d5e9c8fe
 # POS explanations for indonesian available from https://www.aclweb.org/anthology/Y12-1014
@ -92,4 +96,3 @@ TAG_MAP = {
 	"D--+PS2":{POS: ADV},  
 	"PP3+T—": {POS: PRON} 
 }
--- a/spacy/lang/nl/init.py
+++ b/spacy/lang/nl/init.py
@ -4,6 +4,11 @@ from __future__ import unicode_literals
 from .stop_words import STOP_WORDS
 from .lex_attrs import LEX_ATTRS
 from .tag_map import TAG_MAP
 from .tokenizer_exceptions import TOKENIZER_EXCEPTIONS
 from .punctuation import TOKENIZER_INFIXES, TOKENIZER_SUFFIXES
 from .lemmatizer import LOOKUP, LEMMA_EXC, LEMMA_INDEX, RULES
 from .lemmatizer.lemmatizer import DutchLemmatizer
 from ..tokenizer_exceptions import BASE_EXCEPTIONS
 from ..norm_exceptions import BASE_NORMS
@ -13,20 +18,33 @@ from ...util import update_exc, add_lookups
 class DutchDefaults(Language.Defaults):
    lex_attr_getters = dict(Language.Defaults.lex_attr_getters)
    lex_attr_getters.update(LEX_ATTRS)
-    lex_attr_getters[LANG] = lambda text: "nl"
+    lex_attr_getters[LANG] = lambda text: 'nl'
-    lex_attr_getters[NORM] = add_lookups(
+    lex_attr_getters[NORM] = add_lookups(Language.Defaults.lex_attr_getters[NORM],
-        Language.Defaults.lex_attr_getters[NORM], BASE_NORMS
+                                         BASE_NORMS)
-    )
+    tokenizer_exceptions = update_exc(BASE_EXCEPTIONS, TOKENIZER_EXCEPTIONS)
    tokenizer_exceptions = update_exc(BASE_EXCEPTIONS)
    stop_words = STOP_WORDS
    tag_map = TAG_MAP
    infixes = TOKENIZER_INFIXES
    suffixes = TOKENIZER_SUFFIXES
    @classmethod
    def create_lemmatizer(cls, nlp=None):
        rules = RULES
        lemma_index = LEMMA_INDEX
        lemma_exc = LEMMA_EXC
        lemma_lookup = LOOKUP
        return DutchLemmatizer(index=lemma_index,
                               exceptions=lemma_exc,
                               lookup=lemma_lookup,
                               rules=rules)
 class Dutch(Language):
-    lang = "nl"
+    lang = 'nl'
    Defaults = DutchDefaults
-__all__ = ["Dutch"]
+__all__ = ['Dutch']
--- a/spacy/lang/nl/examples.py
+++ b/spacy/lang/nl/examples.py
@ -14,5 +14,5 @@ sentences = [
    "Apple overweegt om voor 1 miljard een U.K. startup te kopen",
    "Autonome auto's verschuiven de verzekeringverantwoordelijkheid naar producenten",
    "San Francisco overweegt robots op voetpaden te verbieden",
-    "Londen is een grote stad in het Verenigd Koninkrijk",
+    "Londen is een grote stad in het Verenigd Koninkrijk"
 ]
--- a/spacy/lang/nl/lemmatizer/init.py
+++ b/spacy/lang/nl/lemmatizer/init.py
@ -0,0 +1,40 @@
 # coding: utf8
 from __future__ import unicode_literals
 from ._verbs_irreg import VERBS_IRREG
 from ._nouns_irreg import NOUNS_IRREG
 from ._adjectives_irreg import ADJECTIVES_IRREG
 from ._adverbs_irreg import ADVERBS_IRREG
 from ._adpositions_irreg import ADPOSITIONS_IRREG
 from ._determiners_irreg import DETERMINERS_IRREG
 from ._pronouns_irreg import PRONOUNS_IRREG
 from ._verbs import VERBS
 from ._nouns import NOUNS
 from ._adjectives import ADJECTIVES
 from ._adpositions import ADPOSITIONS
 from ._determiners import DETERMINERS
 from .lookup import LOOKUP
 from ._lemma_rules import RULES
 from .lemmatizer import DutchLemmatizer
 LEMMA_INDEX = {"adj": ADJECTIVES,
               "noun": NOUNS,
               "verb": VERBS,
               "adp": ADPOSITIONS,
               "det": DETERMINERS}
 LEMMA_EXC = {"adj": ADJECTIVES_IRREG,
             "adv": ADVERBS_IRREG,
             "adp": ADPOSITIONS_IRREG,
             "noun": NOUNS_IRREG,
             "verb": VERBS_IRREG,
             "det": DETERMINERS_IRREG,
             "pron": PRONOUNS_IRREG}
--- a/spacy/lang/nl/lemmatizer/_adjectives.py
+++ b/spacy/lang/nl/lemmatizer/_adjectives.py
--- a/spacy/lang/nl/lemmatizer/_adjectives_irreg.py
+++ b/spacy/lang/nl/lemmatizer/_adjectives_irreg.py
--- a/spacy/lang/nl/lemmatizer/_adpositions.py
+++ b/spacy/lang/nl/lemmatizer/_adpositions.py
@ -0,0 +1,24 @@
 # coding: utf8
 from __future__ import unicode_literals
 ADPOSITIONS = set(
    ('aan aangaande aanwezig achter af afgezien al als an annex anno anti '
     'behalve behoudens beneden benevens benoorden beoosten betreffende bewesten '
     'bezijden bezuiden bij binnen binnenuit binst bladzij blijkens boven bovenop '
     'buiten conform contra cq daaraan daarbij daarbuiten daarin daarnaar '
     'daaronder daartegenover daarvan dankzij deure dichtbij door doordat doorheen '
     'echter eraf erop erover errond eruit ervoor evenals exclusief gedaan '
     'gedurende gegeven getuige gezien halfweg halverwege heen hierdoorheen hierop '
     'houdende in inclusief indien ingaande ingevolge inzake jegens kortweg '
     'krachtens kralj langs langsheen langst lastens linksom lopende luidens mede '
     'mee met middels midden middenop mits na naan naar naartoe naast naat nabij '
     'nadat namens neer neffe neffen neven nevenst niettegenstaande nopens '
     'officieel om omheen omstreeks omtrent onafgezien ondanks onder onderaan '
     'ondere ongeacht ooit op open over per plus pro qua rechtover rond rondom '
     "sedert sinds spijts strekkende te tegen tegenaan tegenop tegenover telde "
     'teneinde terug tijdens toe tot totdat trots tussen tégen uit uitgenomen '
     'ultimo van vanaf vandaan vandoor vanop vanuit vanwege versus via vinnen '
     'vlakbij volgens voor voor- voorbij voordat voort voren vòòr vóór waaraan '
     'waarbij waardoor waaronder weg wegens weleens zijdens zoals zodat zonder '
     'zónder à').split())
--- a/spacy/lang/nl/lemmatizer/_adpositions_irreg.py
+++ b/spacy/lang/nl/lemmatizer/_adpositions_irreg.py
@ -0,0 +1,12 @@
 # coding: utf8
 from __future__ import unicode_literals
 ADPOSITIONS_IRREG = {
    "'t": ('te',),
    'me': ('mee',),
    'meer': ('mee',),
    'on': ('om',),
    'ten': ('te',),
    'ter': ('te',)
 }
--- a/spacy/lang/nl/lemmatizer/_adverbs_irreg.py
+++ b/spacy/lang/nl/lemmatizer/_adverbs_irreg.py
@ -0,0 +1,19 @@
 # coding: utf8
 from __future__ import unicode_literals
 ADVERBS_IRREG = {
    "'ns": ('eens',),
    "'s": ('eens',),
    "'t": ('het',),
    "d'r": ('er',),
    "d'raf": ('eraf',),
    "d'rbij": ('erbij',),
    "d'rheen": ('erheen',),
    "d'rin": ('erin',),
    "d'rna": ('erna',),
    "d'rnaar": ('ernaar',),
    'hele': ('heel',),
    'nevenst': ('nevens',),
    'overend': ('overeind',)
 }
--- a/spacy/lang/nl/lemmatizer/_determiners.py
+++ b/spacy/lang/nl/lemmatizer/_determiners.py
@ -0,0 +1,17 @@
 # coding: utf8
 from __future__ import unicode_literals
 DETERMINERS = set(
    ("al allebei allerhande allerminst alletwee"
     "beide clip-on d'n d'r dat datgeen datgene de dees degeen degene den dewelke "
     'deze dezelfde die diegeen diegene diehien dien diene diens diezelfde dit '
     'ditgene e een eene eigen elk elkens elkes enig enkel enne ettelijke eure '
     'euren evenveel ewe ge geen ginds géén haar haaren halfelf het hetgeen '
     'hetwelk hetzelfde heur heure hulder hulle hullen hullie hun hunder hunderen '
     'ieder iederes ja je jen jouw jouwen jouwes jullie junder keiveel keiweinig '
     "m'ne me meer meerder meerdere menen menig mijn mijnes minst méér niemendal "
     'oe ons onse se sommig sommigeder superveel telken teveel titulair ulder '
     'uldere ulderen ulle under une uw vaak veel veels véél wat weinig welk welken '
     "welkene welksten z'nen ze zenen zijn zo'n zo'ne zoiet zoveel zovele zovelen "
     'zuk zulk zulkdanig zulken zulks zullie zíjn àlle álle').split())
--- a/spacy/lang/nl/lemmatizer/_determiners_irreg.py
+++ b/spacy/lang/nl/lemmatizer/_determiners_irreg.py
@ -0,0 +1,69 @@
 # coding: utf8
 from __future__ import unicode_literals
 DETERMINERS_IRREG = {
    "'r": ('haar',),
    "'s": ('de',),
    "'t": ('het',),
    "'tgene": ('hetgeen',),
    'alle': ('al',),
    'allen': ('al',),
    'aller': ('al',),
    'beiden': ('beide',),
    'beider': ('beide',),
    "d'": ('het',),
    "d'r": ('haar',),
    'der': ('de',),
    'des': ('de',),
    'dezer': ('deze',),
    'dienen': ('die',),
    'dier': ('die',),
    'elke': ('elk',),
    'ene': ('een',),
    'enen': ('een',),
    'ener': ('een',),
    'enige': ('enig',),
    'enigen': ('enig',),
    'er': ('haar',),
    'gene': ('geen',),
    'genen': ('geen',),
    'hare': ('haar',),
    'haren': ('haar',),
    'harer': ('haar',),
    'hunne': ('hun',),
    'hunnen': ('hun',),
    'jou': ('jouw',),
    'jouwe': ('jouw',),
    'julliejen': ('jullie',),
    "m'n": ('mijn',),
    'mee': ('meer',),
    'meer': ('veel',),
    'meerderen': ('meerdere',),
    'meest': ('veel',),
    'meesten': ('veel',),
    'meet': ('veel',),
    'menige': ('menig',),
    'mij': ('mijn',),
    'mijnen': ('mijn',),
    'minder': ('weinig',),
    'mindere': ('weinig',),
    'minst': ('weinig',),
    'minste': ('minst',),
    'ne': ('een',),
    'onze': ('ons',),
    'onzent': ('ons',),
    'onzer': ('ons',),
    'ouw': ('uw',),
    'sommige': ('sommig',),
    'sommigen': ('sommig',),
    'u': ('uw',),
    'vaker': ('vaak',),
    'vele': ('veel',),
    'velen': ('veel',),
    'welke': ('welk',),
    'zijne': ('zijn',),
    'zijnen': ('zijn',),
    'zijns': ('zijn',),
    'één': ('een',)
 }
--- a/spacy/lang/nl/lemmatizer/_lemma_rules.py
+++ b/spacy/lang/nl/lemmatizer/_lemma_rules.py
@ -0,0 +1,79 @@
 # coding: utf8
 from __future__ import unicode_literals
 ADJECTIVE_SUFFIX_RULES = [
    ["sten", ""],
    ["ste", ""],
    ["st", ""],
    ["er", ""],
    ["en", ""],
    ["e", ""],
    ["ende", "end"]
 ]
 VERB_SUFFIX_RULES = [
    ["dt", "den"],
    ["de", "en"],
    ["te", "en"],
    ["dde", "den"],
    ["tte", "ten"],
    ["dden", "den"],
    ["tten", "ten"],
    ["end", "en"],
 ]
 NOUN_SUFFIX_RULES = [
    ["en", ""],
    ["ën", ""],
    ["'er", ""],
    ["s", ""],
    ["tje", ""],
    ["kje", ""],
    ["'s", ""],
    ["ici", "icus"],
    ["heden", "heid"],
    ["elen", "eel"],
    ["ezen", "ees"],
    ["even", "eef"],
    ["ssen", "s"],
    ["rren", "r"],
    ["kken", "k"],
    ["bben", "b"]
 ]
 NUM_SUFFIX_RULES = [
    ["ste", ""],
    ["sten", ""],
    ["ën", ""],
    ["en", ""],
    ["de", ""],
    ["er", ""],
    ["ër", ""],
    ["tjes", ""]
 ]
 PUNCT_SUFFIX_RULES = [
    ["“", "\""],
    ["”", "\""],
    ["\u2018", "'"],
    ["\u2019", "'"]
 ]
 # In-place sort guaranteeing that longer -- more specific -- rules are
 # applied first.
 for rule_set in (ADJECTIVE_SUFFIX_RULES,
                 NOUN_SUFFIX_RULES,
                 NUM_SUFFIX_RULES,
                 VERB_SUFFIX_RULES):
    rule_set.sort(key=lambda r: len(r[0]), reverse=True)
 RULES = {
    "adj": ADJECTIVE_SUFFIX_RULES,
    "noun": NOUN_SUFFIX_RULES,
    "verb": VERB_SUFFIX_RULES,
    "num": NUM_SUFFIX_RULES,
    "punct": PUNCT_SUFFIX_RULES
 }
--- a/spacy/lang/nl/lemmatizer/_nouns.py
+++ b/spacy/lang/nl/lemmatizer/_nouns.py
--- a/spacy/lang/nl/lemmatizer/_nouns_irreg.py
+++ b/spacy/lang/nl/lemmatizer/_nouns_irreg.py
--- a/spacy/lang/nl/lemmatizer/_numbers_irreg.py
+++ b/spacy/lang/nl/lemmatizer/_numbers_irreg.py
@ -0,0 +1,31 @@
 # coding: utf8
 from __future__ import unicode_literals
 NUMBERS_IRREG = {
    'achten': ('acht',),
    'biljoenen': ('biljoen',),
    'drieën': ('drie',),
    'duizenden': ('duizend',),
    'eentjes': ('één',),
    'elven': ('elf',),
    'miljoenen': ('miljoen',),
    'negenen': ('negen',),
    'negentiger': ('negentig',),
    'tienduizenden': ('tienduizend',),
    'tienen': ('tien',),
    'tientjes': ('tien',),
    'twaalven': ('twaalf',),
    'tweeën': ('twee',),
    'twintiger': ('twintig',),
    'twintigsten': ('twintig',),
    'vieren': ('vier',),
    'vijftiger': ('vijftig',),
    'vijven': ('vijf',),
    'zessen': ('zes',),
    'zestiger': ('zestig',),
    'zevenen': ('zeven',),
    'zeventiger': ('zeventig',),
    'zovele': ('zoveel',),
    'zovelen': ('zoveel',)
 }
--- a/spacy/lang/nl/lemmatizer/_pronouns_irreg.py
+++ b/spacy/lang/nl/lemmatizer/_pronouns_irreg.py
@ -0,0 +1,35 @@
 # coding: utf8
 from __future__ import unicode_literals
 PRONOUNS_IRREG = {
    "'r": ('haar',),
    "'rzelf": ('haarzelf',),
    "'t": ('het',),
    "d'r": ('haar',),
    'da': ('dat',),
    'dienen': ('die',),
    'diens': ('die',),
    'dies': ('die',),
    'elkaars': ('elkaar',),
    'elkanders': ('elkander',),
    'ene': ('een',),
    'enen': ('een',),
    'fik': ('ik',),
    'gaat': ('gaan',),
    'gene': ('geen',),
    'harer': ('haar',),
    'ieders': ('ieder',),
    'iemands': ('iemand',),
    'ikke': ('ik',),
    'mijnen': ('mijn',),
    'oe': ('je',),
    'onzer': ('ons',),
    'wa': ('wat',),
    'watte': ('wat',),
    'wier': ('wie',),
    'zijns': ('zijn',),
    'zoietsken': ('zoietske',),
    'zulks': ('zulk',),
    'één': ('een',)
 }
--- a/spacy/lang/nl/lemmatizer/_verbs.py
+++ b/spacy/lang/nl/lemmatizer/_verbs.py
--- a/spacy/lang/nl/lemmatizer/_verbs_irreg.py
+++ b/spacy/lang/nl/lemmatizer/_verbs_irreg.py
--- a/spacy/lang/nl/lemmatizer/lemmatizer.py
+++ b/spacy/lang/nl/lemmatizer/lemmatizer.py
@ -0,0 +1,130 @@
 # coding: utf8
 from __future__ import unicode_literals
 from ....symbols import POS, NOUN, VERB, ADJ, NUM, DET, PRON, ADP, AUX, ADV
 class DutchLemmatizer(object):
    # Note: CGN does not distinguish AUX verbs, so we treat AUX as VERB.
    univ_pos_name_variants = {
        NOUN: "noun", "NOUN": "noun", "noun": "noun",
        VERB: "verb", "VERB": "verb", "verb": "verb",
        AUX: "verb", "AUX": "verb", "aux": "verb",
        ADJ: "adj", "ADJ": "adj", "adj": "adj",
        ADV: "adv", "ADV": "adv", "adv": "adv",
        PRON: "pron", "PRON": "pron", "pron": "pron",
        DET: "det", "DET": "det", "det": "det",
        ADP: "adp", "ADP": "adp", "adp": "adp",
        NUM: "num", "NUM": "num", "num": "num"
    }
    @classmethod
    def load(cls, path, index=None, exc=None, rules=None, lookup=None):
        return cls(index, exc, rules, lookup)
    def __init__(self, index=None, exceptions=None, rules=None, lookup=None):
        self.index = index
        self.exc = exceptions
        self.rules = rules or {}
        self.lookup_table = lookup if lookup is not None else {}
    def __call__(self, string, univ_pos, morphology=None):
        # Difference 1: self.rules is assumed to be non-None, so no
        # 'is None' check required.
        # String lowercased from the get-go. All lemmatization results in
        # lowercased strings. For most applications, this shouldn't pose
        # any problems, and it keeps the exceptions indexes small. If this
        # creates problems for proper nouns, we can introduce a check for
        # univ_pos == "PROPN".
        string = string.lower()
        try:
            univ_pos = self.univ_pos_name_variants[univ_pos]
        except KeyError:
            # Because PROPN not in self.univ_pos_name_variants, proper names
            # are not lemmatized. They are lowercased, however.
            return [string]
            # if string in self.lemma_index.get(univ_pos)
        lemma_index = self.index.get(univ_pos, {})
        # string is already lemma
        if string in lemma_index:
            return [string]
        exceptions = self.exc.get(univ_pos, {})
        # string is irregular token contained in exceptions index.
        try:
            lemma = exceptions[string]
            return [lemma[0]]
        except KeyError:
            pass
        # string corresponds to  key in lookup table
        lookup_table = self.lookup_table
        looked_up_lemma = lookup_table.get(string)
        if looked_up_lemma and looked_up_lemma in lemma_index:
            return [looked_up_lemma]
        forms, is_known = lemmatize(
            string,
            lemma_index,
            exceptions,
            self.rules.get(univ_pos, []))
        # Back-off through remaining return value candidates.
        if forms:
            if is_known:
                return forms
            else:
                for form in forms:
                    if form in exceptions:
                        return [form]
            if looked_up_lemma:
                return [looked_up_lemma]
            else:
                return forms
        elif looked_up_lemma:
            return [looked_up_lemma]
        else:
            return [string]
    # Overrides parent method so that a lowercased version of the string is
    # used to search the lookup table. This is necessary because our lookup
    # table consists entirely of lowercase keys.
    def lookup(self, string):
        string = string.lower()
        return self.lookup_table.get(string, string)
    def noun(self, string, morphology=None):
        return self(string, 'noun', morphology)
    def verb(self, string, morphology=None):
        return self(string, 'verb', morphology)
    def adj(self, string, morphology=None):
        return self(string, 'adj', morphology)
    def det(self, string, morphology=None):
        return self(string, 'det', morphology)
    def pron(self, string, morphology=None):
        return self(string, 'pron', morphology)
    def adp(self, string, morphology=None):
        return self(string, 'adp', morphology)
    def punct(self, string, morphology=None):
        return self(string, 'punct', morphology)
 # Reimplemented to focus more on application of suffix rules and to return
 # as early as possible.
 def lemmatize(string, index, exceptions, rules):
    # returns (forms, is_known: bool)
    oov_forms = []
    for old, new in rules:
        if string.endswith(old):
            form = string[:len(string) - len(old)] + new
            if not form:
                pass
            elif form in index:
                return [form], True  # True = Is known (is lemma)
            else:
                oov_forms.append(form)
    return list(set(oov_forms)), False
--- a/spacy/lang/nl/lemmatizer/lookup.py
+++ b/spacy/lang/nl/lemmatizer/lookup.py
--- a/spacy/lang/nl/lex_attrs.py
+++ b/spacy/lang/nl/lex_attrs.py
@ -4,22 +4,18 @@ from __future__ import unicode_literals
 from ...attrs import LIKE_NUM
-_num_words = set(
+_num_words = set("""
    """
 nul een één twee drie vier vijf zes zeven acht negen tien elf twaalf dertien
 veertien twintig dertig veertig vijftig zestig zeventig tachtig negentig honderd
 duizend miljoen miljard biljoen biljard triljoen triljard
-""".split()
+""".split())
 )
-_ordinal_words = set(
+_ordinal_words = set("""
    """
 eerste tweede derde vierde vijfde zesde zevende achtste negende tiende elfde
 twaalfde dertiende veertiende twintigste dertigste veertigste vijftigste
 zestigste zeventigste tachtigste negentigste honderdste duizendste miljoenste
 miljardste biljoenste biljardste triljoenste triljardste
-""".split()
+""".split())
 )
 def like_num(text):
@ -27,13 +23,11 @@ def like_num(text):
    # or matches one of the number words. In order to handle numbers like
    # "drieëntwintig", more work is required.
    # See this discussion: https://github.com/explosion/spaCy/pull/1177
-    if text.startswith(("+", "-", "±", "~")):
+    text = text.replace(',', '').replace('.', '')
        text = text[1:]
    text = text.replace(",", "").replace(".", "")
    if text.isdigit():
        return True
-    if text.count("/") == 1:
+    if text.count('/') == 1:
-        num, denom = text.split("/")
+        num, denom = text.split('/')
        if num.isdigit() and denom.isdigit():
            return True
    if text.lower() in _num_words:
@ -43,4 +37,6 @@ def like_num(text):
    return False
-LEX_ATTRS = {LIKE_NUM: like_num}
+LEX_ATTRS = {
    LIKE_NUM: like_num
 }
--- a/spacy/lang/nl/punctuation.py
+++ b/spacy/lang/nl/punctuation.py
@ -0,0 +1,33 @@
 # coding: utf8
 from __future__ import unicode_literals
 from ..char_classes import LIST_ELLIPSES, LIST_ICONS
 from ..char_classes import CONCAT_QUOTES, ALPHA, ALPHA_LOWER, ALPHA_UPPER
 from ..punctuation import TOKENIZER_SUFFIXES as DEFAULT_TOKENIZER_SUFFIXES
 # Copied from `de` package. Main purpose is to ensure that hyphens are not
 # split on.
 _quotes = CONCAT_QUOTES.replace("'", '')
 _infixes = (LIST_ELLIPSES + LIST_ICONS +
            [r'(?<=[{}])\.(?=[{}])'.format(ALPHA_LOWER, ALPHA_UPPER),
             r'(?<=[{a}])[,!?](?=[{a}])'.format(a=ALPHA),
             r'(?<=[{a}"])[:<>=](?=[{a}])'.format(a=ALPHA),
             r'(?<=[{a}]),(?=[{a}])'.format(a=ALPHA),
             r'(?<=[{a}])([{q}\)\]\(\[])(?=[{a}])'.format(a=ALPHA, q=_quotes),
             r'(?<=[{a}])--(?=[{a}])'.format(a=ALPHA),
             r'(?<=[0-9])-(?=[0-9])'])
 # Remove "'s" suffix from suffix list. In Dutch, "'s" is a plural ending when
 # it occurs as a suffix and a clitic for "eens" in standalone use. To avoid
 # ambiguity it's better to just leave it attached when it occurs as a suffix.
 default_suffix_blacklist = ("'s", "'S", '’s', '’S')
 _suffixes = [suffix for suffix in DEFAULT_TOKENIZER_SUFFIXES
             if suffix not in default_suffix_blacklist]
 TOKENIZER_INFIXES = _infixes
 TOKENIZER_SUFFIXES = _suffixes
--- a/spacy/lang/nl/stop_words.py
+++ b/spacy/lang/nl/stop_words.py
@ -1,45 +1,73 @@
 # coding: utf8
 from __future__ import unicode_literals
 # The original stop words list (added in f46ffe3) was taken from
 # http://www.damienvanholten.com/downloads/dutch-stop-words.txt
 # and consisted of about 100 tokens.
 # In order to achieve parity with some of the better-supported
 # languages, e.g., English, French, and German, this original list has been
 # extended with 200 additional tokens. The main source of inspiration was
 # https://raw.githubusercontent.com/stopwords-iso/stopwords-nl/master/stopwords-nl.txt.
 # However, quite a bit of manual editing has taken place as well.
 # Tokens whose status as a stop word is not entirely clear were admitted or
 # rejected by deferring to their counterparts in the stop words lists for English
 # and French. Similarly, those lists were used to identify and fill in gaps so
 # that -- in principle -- each token contained in the English stop words list
 # should have a Dutch counterpart here.
 # Stop words are retrieved from http://www.damienvanholten.com/downloads/dutch-stop-words.txt
-STOP_WORDS = set(
+STOP_WORDS = set("""
-    """
+aan af al alle alles allebei alleen allen als altijd ander anders andere anderen aangaangde aangezien achter achterna
-aan af al alles als altijd andere
+afgelopen aldus alhoewel anderzijds
-ben bij
+ben bij bijna bijvoorbeeld behalve beide beiden beneden bent bepaald beter betere betreffende binnen binnenin boven
 bovenal bovendien bovenstaand buiten
-daar dan dat de der deze die dit doch doen door dus
+daar dan dat de der den deze die dit doch doen door dus daarheen daarin daarna daarnet daarom daarop des dezelfde dezen
 dien dikwijls doet doorgaand doorgaans
-een eens en er
+een eens en er echter enige eerder eerst eerste eersten effe eigen elk elke enkel enkele enz erdoor etc even eveneens
 evenwel
-ge geen geweest
+ff
-haar had heb hebben heeft hem het hier hij hoe hun
+ge geen geweest gauw gedurende gegeven gehad geheel gekund geleden gelijk gemogen geven geweest gewoon gewoonweg
 geworden gij
-iemand iets ik in is
+haar had heb hebben heeft hem het hier hij hoe hun hadden hare hebt hele hen hierbeneden hierboven hierin hoewel hun
-ja je
+iemand iets ik in is idd ieder ikke ikzelf indien inmiddels inz inzake
-kan kon kunnen
+ja je jou jouw jullie jezelf jij jijzelf jouwe juist
-maar me meer men met mij mijn moet
+kan kon kunnen klaar konden krachtens kunnen kunt
-na naar niet niets nog nu
+lang later liet liever
-of om omdat ons ook op over
+maar me meer men met mij mijn moet mag mede meer meesten mezelf mijzelf min minder misschien mocht mochten moest moesten
 moet moeten mogelijk mogen
-reeds
+na naar niet niets nog nu nabij nadat net nogal nooit nr nu
-te tegen toch toen tot
+of om omdat ons ook op over omhoog omlaag omstreeks omtrent omver onder ondertussen ongeveer onszelf onze ooit opdat
 opnieuw opzij over overigens
-u uit uw
+pas pp precies prof publ
-van veel voor
+reeds rond rondom
-want waren was wat we wel werd wezen wie wij wil worden
+sedert sinds sindsdien slechts sommige spoedig steeds
-zal ze zei zelf zich zij zijn zo zonder zou
+‘t 't te tegen toch toen tot tamelijk ten tenzij ter terwijl thans tijdens toe totdat tussen
-""".split()
+
-)
+u uit uw uitgezonderd uwe uwen
 van veel voor vaak vanaf vandaan vanuit vanwege veeleer verder verre vervolgens vgl volgens vooraf vooral vooralsnog
 voorbij voordat voordien voorheen voorop voort voorts vooruit vrij vroeg
 want waren was wat we wel werd wezen wie wij wil worden waar waarom wanneer want weer weg wegens weinig weinige weldra
 welk welke welken werd werden wiens wier wilde wordt
 zal ze zei zelf zich zij zijn zo zonder zou zeer zeker zekere zelfde zelfs zichzelf zijnde zijne zo’n zoals zodra zouden
 zoveel zowat zulk zulke zulks zullen zult
 """.split())
--- a/spacy/lang/nl/tag_map.py
+++ b/spacy/lang/nl/tag_map.py
@ -5,7 +5,6 @@ from ...symbols import POS, PUNCT, ADJ, NUM, DET, ADV, ADP, X, VERB
 from ...symbols import NOUN, PROPN, SPACE, PRON, CONJ
 # fmt: off
 TAG_MAP = {
    "ADJ__Number=Sing": {POS: ADJ},
    "ADJ___": {POS: ADJ},
@ -811,4 +810,3 @@ TAG_MAP = {
    "X___": {POS: X},
    "_SP": {POS: SPACE}
 }
 # fmt: on
--- a/spacy/lang/nl/tokenizer_exceptions.py
+++ b/spacy/lang/nl/tokenizer_exceptions.py
@ -0,0 +1,340 @@
 # coding: utf8
 from __future__ import unicode_literals
 from ...symbols import ORTH, LEMMA, TAG, NORM, PRON_LEMMA
 # Extensive list of both common and uncommon dutch abbreviations copied from
 # github.com/diasks2/pragmatic_segmenter, a Ruby library for rule-based
 # sentence boundary detection (MIT, Copyright 2015 Kevin S. Dias).
 # Source file: https://github.com/diasks2/pragmatic_segmenter/blob/master/lib/pragmatic_segmenter/languages/dutch.rb
 # (Last commit: 4d1477b)
 # Main purpose of such an extensive list: considerably improved sentence
 # segmentation.
 # Note: This list has been copied over largely as-is. Some of the abbreviations
 # are extremely domain-specific. Tokenizer performance may benefit from some
 # slight pruning, although no performance regression has been observed so far.
 abbrevs = ['a.2d.', 'a.a.', 'a.a.j.b.', 'a.f.t.', 'a.g.j.b.',
           'a.h.v.', 'a.h.w.', 'a.hosp.', 'a.i.', 'a.j.b.', 'a.j.t.',
           'a.m.', 'a.m.r.', 'a.p.m.', 'a.p.r.', 'a.p.t.', 'a.s.',
           'a.t.d.f.', 'a.u.b.', 'a.v.a.', 'a.w.', 'aanbev.',
           'aanbev.comm.', 'aant.', 'aanv.st.', 'aanw.', 'vnw.',
           'aanw.vnw.', 'abd.', 'abm.', 'abs.', 'acc.act.',
           'acc.bedr.m.', 'acc.bedr.t.', 'achterv.', 'act.dr.',
           'act.dr.fam.', 'act.fisc.', 'act.soc.', 'adm.akk.',
           'adm.besl.', 'adm.lex.', 'adm.onderr.', 'adm.ov.', 'adv.',
           'adv.', 'gen.', 'adv.bl.', 'afd.', 'afl.', 'aggl.verord.',
           'agr.', 'al.', 'alg.', 'alg.richts.', 'amén.', 'ann.dr.',
           'ann.dr.lg.', 'ann.dr.sc.pol.', 'ann.ét.eur.',
           'ann.fac.dr.lg.', 'ann.jur.créd.',
           'ann.jur.créd.règl.coll.', 'ann.not.', 'ann.parl.',
           'ann.prat.comm.', 'app.', 'arb.', 'aud.', 'arbbl.',
           'arbh.', 'arbit.besl.', 'arbrb.', 'arr.', 'arr.cass.',
           'arr.r.v.st.', 'arr.verbr.', 'arrondrb.', 'art.', 'artw.',
           'aud.', 'b.', 'b.', 'b.&w.', 'b.a.', 'b.a.s.', 'b.b.o.',
           'b.best.dep.', 'b.br.ex.', 'b.coll.fr.gem.comm.',
           'b.coll.vl.gem.comm.', 'b.d.cult.r.', 'b.d.gem.ex.',
           'b.d.gem.reg.', 'b.dep.', 'b.e.b.', 'b.f.r.',
           'b.fr.gem.ex.', 'b.fr.gem.reg.', 'b.i.h.', 'b.inl.j.d.',
           'b.inl.s.reg.', 'b.j.', 'b.l.', 'b.o.z.', 'b.prov.r.',
           'b.r.h.', 'b.s.', 'b.sr.', 'b.stb.', 'b.t.i.r.',
           'b.t.s.z.', 'b.t.w.rev.', 'b.v.',
           'b.ver.coll.gem.gem.comm.', 'b.verg.r.b.', 'b.versl.',
           'b.vl.ex.', 'b.voorl.reg.', 'b.w.', 'b.w.gew.ex.',
           'b.z.d.g.', 'b.z.v.', 'bab.', 'bedr.org.', 'begins.',
           'beheersov.', 'bekendm.comm.', 'bel.', 'bel.besch.',
           'bel.w.p.', 'beleidsov.', 'belg.', 'grondw.', 'ber.',
           'ber.w.', 'besch.', 'besl.', 'beslagr.', 'bestuurswet.',
           'bet.', 'betr.', 'betr.', 'vnw.', 'bevest.', 'bew.',
           'bijbl.', 'ind.', 'eig.', 'bijbl.n.bijdr.', 'bijl.',
           'bijv.', 'bijw.', 'bijz.decr.', 'bin.b.', 'bkh.', 'bl.',
           'blz.', 'bm.', 'bn.', 'rh.', 'bnw.', 'bouwr.', 'br.parl.',
           'bs.', 'bull.', 'bull.adm.pénit.', 'bull.ass.',
           'bull.b.m.m.', 'bull.bel.', 'bull.best.strafinr.',
           'bull.bmm.', 'bull.c.b.n.', 'bull.c.n.c.', 'bull.cbn.',
           'bull.centr.arb.', 'bull.cnc.', 'bull.contr.',
           'bull.doc.min.fin.', 'bull.f.e.b.', 'bull.feb.',
           'bull.fisc.fin.r.', 'bull.i.u.m.',
           'bull.inf.ass.secr.soc.', 'bull.inf.i.e.c.',
           'bull.inf.i.n.a.m.i.', 'bull.inf.i.r.e.', 'bull.inf.iec.',
           'bull.inf.inami.', 'bull.inf.ire.', 'bull.inst.arb.',
           'bull.ium.', 'bull.jur.imm.', 'bull.lég.b.', 'bull.off.',
           'bull.trim.b.dr.comp.', 'bull.us.', 'bull.v.b.o.',
           'bull.vbo.', 'bv.', 'bw.', 'bxh.', 'byz.', 'c.', 'c.a.',
           'c.a.-a.', 'c.a.b.g.', 'c.c.', 'c.c.i.', 'c.c.s.',
           'c.conc.jur.', 'c.d.e.', 'c.d.p.k.', 'c.e.', 'c.ex.',
           'c.f.', 'c.h.a.', 'c.i.f.', 'c.i.f.i.c.', 'c.j.', 'c.l.',
           'c.n.', 'c.o.d.', 'c.p.', 'c.pr.civ.', 'c.q.', 'c.r.',
           'c.r.a.', 'c.s.', 'c.s.a.', 'c.s.q.n.', 'c.v.', 'c.v.a.',
           'c.v.o.', 'ca.', 'cadeaust.', 'cah.const.',
           'cah.dr.europ.', 'cah.dr.immo.', 'cah.dr.jud.', 'cal.',
           '2d.', 'cal.', '3e.', 'cal.', 'rprt.', 'cap.', 'carg.',
           'cass.', 'cass.', 'verw.', 'cert.', 'cf.', 'ch.', 'chron.',
           'chron.d.s.', 'chron.dr.not.', 'cie.', 'cie.',
           'verz.schr.', 'cir.', 'circ.', 'circ.z.', 'cit.',
           'cit.loc.', 'civ.', 'cl.et.b.', 'cmt.', 'co.',
           'cognoss.v.', 'coll.', 'v.', 'b.', 'colp.w.', 'com.',
           'com.', 'cas.', 'com.v.min.', 'comm.', 'comm.', 'v.',
           'comm.bijz.ov.', 'comm.erf.', 'comm.fin.', 'comm.ger.',
           'comm.handel.', 'comm.pers.', 'comm.pub.', 'comm.straf.',
           'comm.v.', 'comm.venn.', 'comm.verz.', 'comm.voor.',
           'comp.', 'compt.w.', 'computerr.', 'con.m.', 'concl.',
           'concr.', 'conf.', 'confl.w.', 'confl.w.huwbetr.', 'cons.',
           'conv.', 'coöp.', 'ver.', 'corr.', 'corr.bl.',
           'cour.fisc.', 'cour.immo.', 'cridon.', 'crim.', 'cur.',
           'cur.', 'crt.', 'curs.', 'd.', 'd.-g.', 'd.a.', 'd.a.v.',
           'd.b.f.', 'd.c.', 'd.c.c.r.', 'd.d.', 'd.d.p.', 'd.e.t.',
           'd.gem.r.', 'd.h.', 'd.h.z.', 'd.i.', 'd.i.t.', 'd.j.',
           'd.l.r.', 'd.m.', 'd.m.v.', 'd.o.v.', 'd.parl.', 'd.w.z.',
           'dact.', 'dat.', 'dbesch.', 'dbesl.', 'decr.', 'decr.d.',
           'decr.fr.', 'decr.vl.', 'decr.w.', 'def.', 'dep.opv.',
           'dep.rtl.', 'derg.', 'desp.', 'det.mag.', 'deurw.regl.',
           'dez.', 'dgl.', 'dhr.', 'disp.', 'diss.', 'div.',
           'div.act.', 'div.bel.', 'dl.', 'dln.', 'dnotz.', 'doc.',
           'hist.', 'doc.jur.b.', 'doc.min.fin.', 'doc.parl.',
           'doctr.', 'dpl.', 'dpl.besl.', 'dr.', 'dr.banc.fin.',
           'dr.circ.', 'dr.inform.', 'dr.mr.', 'dr.pén.entr.',
           'dr.q.m.', 'drs.', 'dtp.', 'dwz.', 'dyn.', 'e.', 'e.a.',
           'e.b.', 'tek.mod.', 'e.c.', 'e.c.a.', 'e.d.', 'e.e.',
           'e.e.a.', 'e.e.g.', 'e.g.', 'e.g.a.', 'e.h.a.', 'e.i.',
           'e.j.', 'e.m.a.', 'e.n.a.c.', 'e.o.', 'e.p.c.', 'e.r.c.',
           'e.r.f.', 'e.r.h.', 'e.r.o.', 'e.r.p.', 'e.r.v.',
           'e.s.r.a.', 'e.s.t.', 'e.v.', 'e.v.a.', 'e.w.', 'e&o.e.',
           'ec.pol.r.', 'econ.', 'ed.', 'ed(s).', 'eff.', 'eig.',
           'eig.mag.', 'eil.', 'elektr.', 'enmb.', 'enz.', 'err.',
           'etc.', 'etq.', 'eur.', 'parl.', 'eur.t.s.', 'ev.', 'evt.',
           'ex.', 'ex.crim.', 'exec.', 'f.', 'f.a.o.', 'f.a.q.',
           'f.a.s.', 'f.i.b.', 'f.j.f.', 'f.o.b.', 'f.o.r.', 'f.o.s.',
           'f.o.t.', 'f.r.', 'f.supp.', 'f.suppl.', 'fa.', 'facs.',
           'fasc.', 'fg.', 'fid.ber.', 'fig.', 'fin.verh.w.', 'fisc.',
           'fisc.', 'tijdschr.', 'fisc.act.', 'fisc.koer.', 'fl.',
           'form.', 'foro.', 'it.', 'fr.', 'fr.cult.r.', 'fr.gem.r.',
           'fr.parl.', 'fra.', 'ft.', 'g.', 'g.a.', 'g.a.v.',
           'g.a.w.v.', 'g.g.d.', 'g.m.t.', 'g.o.', 'g.omt.e.', 'g.p.',
           'g.s.', 'g.v.', 'g.w.w.', 'geb.', 'gebr.', 'gebrs.',
           'gec.', 'gec.decr.', 'ged.', 'ged.st.', 'gedipl.',
           'gedr.st.', 'geh.', 'gem.', 'gem.', 'gem.',
           'gem.gem.comm.', 'gem.st.', 'gem.stem.', 'gem.w.',
           'gemeensch.optr.', 'gemeensch.standp.', 'gemeensch.strat.',
           'gemeent.', 'gemeent.b.', 'gemeent.regl.',
           'gemeent.verord.', 'geol.', 'geopp.', 'gepubl.',
           'ger.deurw.', 'ger.w.', 'gerekw.', 'gereq.', 'gesch.',
           'get.', 'getr.', 'gev.m.', 'gev.maatr.', 'gew.', 'ghert.',
           'gir.eff.verk.', 'gk.', 'gr.', 'gramm.', 'grat.w.',
           'grootb.w.', 'grs.', 'grvm.', 'grw.', 'gst.', 'gw.',
           'h.a.', 'h.a.v.o.', 'h.b.o.', 'h.e.a.o.', 'h.e.g.a.',
           'h.e.geb.', 'h.e.gestr.', 'h.l.', 'h.m.', 'h.o.', 'h.r.',
           'h.t.l.', 'h.t.m.', 'h.w.geb.', 'hand.', 'handelsn.w.',
           'handelspr.', 'handelsr.w.', 'handelsreg.w.', 'handv.',
           'harv.l.rev.', 'hc.', 'herald.', 'hert.', 'herz.',
           'hfdst.', 'hfst.', 'hgrw.', 'hhr.', 'hist.', 'hooggel.',
           'hoogl.', 'hosp.', 'hpw.', 'hr.', 'hr.', 'ms.', 'hr.ms.',
           'hregw.', 'hrg.', 'hst.', 'huis.just.', 'huisv.w.',
           'huurbl.', 'hv.vn.', 'hw.', 'hyp.w.', 'i.b.s.', 'i.c.',
           'i.c.m.h.', 'i.e.', 'i.f.', 'i.f.p.', 'i.g.v.', 'i.h.',
           'i.h.a.', 'i.h.b.', 'i.l.pr.', 'i.o.', 'i.p.o.', 'i.p.r.',
           'i.p.v.', 'i.pl.v.', 'i.r.d.i.', 'i.s.m.', 'i.t.t.',
           'i.v.', 'i.v.m.', 'i.v.s.', 'i.w.tr.', 'i.z.', 'ib.',
           'ibid.', 'icip-ing.cons.', 'iem.', 'indic.soc.', 'indiv.',
           'inf.', 'inf.i.d.a.c.', 'inf.idac.', 'inf.r.i.z.i.v.',
           'inf.riziv.', 'inf.soc.secr.', 'ing.', 'ing.', 'cons.',
           'ing.cons.', 'inst.', 'int.', 'int.', 'rechtsh.',
           'strafz.', 'interm.', 'intern.fisc.act.',
           'intern.vervoerr.', 'inv.', 'inv.', 'f.', 'inv.w.',
           'inv.wet.', 'invord.w.', 'inz.', 'ir.', 'irspr.', 'iwtr.',
           'j.', 'j.-cl.', 'j.c.b.', 'j.c.e.', 'j.c.fl.', 'j.c.j.',
           'j.c.p.', 'j.d.e.', 'j.d.f.', 'j.d.s.c.', 'j.dr.jeun.',
           'j.j.d.', 'j.j.p.', 'j.j.pol.', 'j.l.', 'j.l.m.b.',
           'j.l.o.', 'j.p.a.', 'j.r.s.', 'j.t.', 'j.t.d.e.',
           'j.t.dr.eur.', 'j.t.o.', 'j.t.t.', 'jaarl.', 'jb.hand.',
           'jb.kred.', 'jb.kred.c.s.', 'jb.l.r.b.', 'jb.lrb.',
           'jb.markt.', 'jb.mens.', 'jb.t.r.d.', 'jb.trd.',
           'jeugdrb.', 'jeugdwerkg.w.', 'jg.', 'jis.', 'jl.',
           'journ.jur.', 'journ.prat.dr.fisc.fin.', 'journ.proc.',
           'jrg.', 'jur.', 'jur.comm.fl.', 'jur.dr.soc.b.l.n.',
           'jur.f.p.e.', 'jur.fpe.', 'jur.niv.', 'jur.trav.brux.',
           'jurambt.', 'jv.cass.', 'jv.h.r.j.', 'jv.hrj.', 'jw.',
           'k.', 'k.', 'k.b.', 'k.g.', 'k.k.', 'k.m.b.o.', 'k.o.o.',
           'k.v.k.', 'k.v.v.v.', 'kadasterw.', 'kaderb.', 'kador.',
           'kbo-nr.', 'kg.', 'kh.', 'kiesw.', 'kind.bes.v.', 'kkr.',
           'koopv.', 'kr.', 'krankz.w.', 'ksbel.', 'kt.', 'ktg.',
           'ktr.', 'kvdm.', 'kw.r.', 'kymr.', 'kzr.', 'kzw.', 'l.',
           'l.b.', 'l.b.o.', 'l.bas.', 'l.c.', 'l.gew.', 'l.j.',
           'l.k.', 'l.l.', 'l.o.', 'l.r.b.', 'l.u.v.i.', 'l.v.r.',
           'l.v.w.', 'l.w.', "l'exp.-compt.b..", 'l’exp.-compt.b.',
           'landinr.w.', 'landscrt.', 'lat.', 'law.ed.', 'lett.',
           'levensverz.', 'lgrs.', 'lidw.', 'limb.rechtsl.', 'lit.',
           'litt.', 'liw.', 'liwet.', 'lk.', 'll.', 'll.(l.)l.r.',
           'loonw.', 'losbl.', 'ltd.', 'luchtv.', 'luchtv.w.', 'm.',
           'm.', 'not.', 'm.a.v.o.', 'm.a.w.', 'm.b.', 'm.b.o.',
           'm.b.r.', 'm.b.t.', 'm.d.g.o.', 'm.e.a.o.', 'm.e.r.',
           'm.h.', 'm.h.d.', 'm.i.v.', 'm.j.t.', 'm.k.', 'm.m.',
           'm.m.a.', 'm.m.h.h.', 'm.m.v.', 'm.n.', 'm.not.fisc.',
           'm.nt.', 'm.o.', 'm.r.', 'm.s.a.', 'm.u.p.', 'm.v.a.',
           'm.v.h.n.', 'm.v.t.', 'm.z.', 'maatr.teboekgest.luchtv.',
           'maced.', 'mand.', 'max.', 'mbl.not.', 'me.', 'med.',
           'med.', 'v.b.o.', 'med.b.u.f.r.', 'med.bufr.', 'med.vbo.',
           'meerv.', 'meetbr.w.', 'mém.adm.', 'mgr.', 'mgrs.', 'mhd.',
           'mi.verantw.', 'mil.', 'mil.bed.', 'mil.ger.', 'min.',
           'min.', 'aanbev.', 'min.', 'circ.', 'min.', 'fin.',
           'min.j.omz.', 'min.just.circ.', 'mitt.', 'mnd.', 'mod.',
           'mon.', 'mouv.comm.', 'mr.', 'ms.', 'muz.', 'mv.', 'n.',
           'chr.', 'n.a.', 'n.a.g.', 'n.a.v.', 'n.b.', 'n.c.',
           'n.chr.', 'n.d.', 'n.d.r.', 'n.e.a.', 'n.g.', 'n.h.b.c.',
           'n.j.', 'n.j.b.', 'n.j.w.', 'n.l.', 'n.m.', 'n.m.m.',
           'n.n.', 'n.n.b.', 'n.n.g.', 'n.n.k.', 'n.o.m.', 'n.o.t.k.',
           'n.rapp.', 'n.tijd.pol.', 'n.v.', 'n.v.d.r.', 'n.v.d.v.',
           'n.v.o.b.', 'n.v.t.', 'nat.besch.w.', 'nat.omb.',
           'nat.pers.', 'ned.cult.r.', 'neg.verkl.', 'nhd.', 'wisk.',
           'njcm-bull.', 'nl.', 'nnd.', 'no.', 'not.fisc.m.',
           'not.w.', 'not.wet.', 'nr.', 'nrs.', 'nste.', 'nt.',
           'numism.', 'o.', 'o.a.', 'o.b.', 'o.c.', 'o.g.', 'o.g.v.',
           'o.i.', 'o.i.d.', 'o.m.', 'o.o.', 'o.o.d.', 'o.o.v.',
           'o.p.', 'o.r.', 'o.regl.', 'o.s.', 'o.t.s.', 'o.t.t.',
           'o.t.t.t.', 'o.t.t.z.', 'o.tk.t.', 'o.v.t.', 'o.v.t.t.',
           'o.v.tk.t.', 'o.v.v.', 'ob.', 'obsv.', 'octr.',
           'octr.gem.regl.', 'octr.regl.', 'oe.', 'off.pol.', 'ofra.',
           'ohd.', 'omb.', 'omnil.', 'omz.', 'on.ww.', 'onderr.',
           'onfrank.', 'onteig.w.', 'ontw.', 'b.w.', 'onuitg.',
           'onz.', 'oorl.w.', 'op.cit.', 'opin.pa.', 'opm.', 'or.',
           'ord.br.', 'ord.gem.', 'ors.', 'orth.', 'os.', 'osm.',
           'ov.', 'ov.w.i.', 'ov.w.ii.', 'ov.ww.', 'overg.w.',
           'overw.', 'ovkst.', 'oz.', 'p.', 'p.a.', 'p.a.o.',
           'p.b.o.', 'p.e.', 'p.g.', 'p.j.', 'p.m.', 'p.m.a.', 'p.o.',
           'p.o.j.t.', 'p.p.', 'p.v.', 'p.v.s.', 'pachtw.', 'pag.',
           'pan.', 'pand.b.', 'pand.pér.', 'parl.gesch.',
           'parl.gesch.', 'inv.', 'parl.st.', 'part.arb.', 'pas.',
           'pasin.', 'pat.', 'pb.c.', 'pb.l.', 'pens.',
           'pensioenverz.', 'per.ber.i.b.r.', 'per.ber.ibr.', 'pers.',
           'st.', 'pft.', 'pk.', 'pktg.', 'plv.', 'po.', 'pol.',
           'pol.off.', 'pol.r.', 'pol.w.', 'postbankw.', 'postw.',
           'pp.', 'pr.', 'preadv.', 'pres.', 'prf.', 'prft.', 'prg.',
           'prijz.w.', 'proc.', 'procesregl.', 'prof.', 'prot.',
           'prov.', 'prov.b.', 'prov.instr.h.m.g.', 'prov.regl.',
           'prov.verord.', 'prov.w.', 'publ.', 'pun.', 'pw.',
           'q.b.d.', 'q.e.d.', 'q.q.', 'q.r.', 'r.', 'r.a.b.g.',
           'r.a.c.e.', 'r.a.j.b.', 'r.b.d.c.', 'r.b.d.i.', 'r.b.s.s.',
           'r.c.', 'r.c.b.', 'r.c.d.c.', 'r.c.j.b.', 'r.c.s.j.',
           'r.cass.', 'r.d.c.', 'r.d.i.', 'r.d.i.d.c.', 'r.d.j.b.',
           'r.d.j.p.', 'r.d.p.c.', 'r.d.s.', 'r.d.t.i.', 'r.e.',
           'r.f.s.v.p.', 'r.g.a.r.', 'r.g.c.f.', 'r.g.d.c.', 'r.g.f.',
           'r.g.z.', 'r.h.a.', 'r.i.c.', 'r.i.d.a.', 'r.i.e.j.',
           'r.i.n.', 'r.i.s.a.', 'r.j.d.a.', 'r.j.i.', 'r.k.', 'r.l.',
           'r.l.g.b.', 'r.med.', 'r.med.rechtspr.', 'r.n.b.', 'r.o.',
           'r.ov.', 'r.p.', 'r.p.d.b.', 'r.p.o.t.', 'r.p.r.j.',
           'r.p.s.', 'r.r.d.', 'r.r.s.', 'r.s.', 'r.s.v.p.',
           'r.stvb.', 'r.t.d.f.', 'r.t.d.h.', 'r.t.l.',
           'r.trim.dr.eur.', 'r.v.a.', 'r.verkb.', 'r.w.', 'r.w.d.',
           'rap.ann.c.a.', 'rap.ann.c.c.', 'rap.ann.c.e.',
           'rap.ann.c.s.j.', 'rap.ann.ca.', 'rap.ann.cass.',
           'rap.ann.cc.', 'rap.ann.ce.', 'rap.ann.csj.', 'rapp.',
           'rb.', 'rb.kh.', 'rdn.', 'rdnr.', 're.pers.', 'rec.',
           'rec.c.i.j.', 'rec.c.j.c.e.', 'rec.cij.', 'rec.cjce.',
           'rec.gén.enr.not.', 'rechtsk.t.', 'rechtspl.zeem.',
           'rechtspr.arb.br.', 'rechtspr.b.f.e.', 'rechtspr.bfe.',
           'rechtspr.soc.r.b.l.n.', 'recl.reg.', 'rect.', 'red.',
           'reg.', 'reg.huiz.bew.', 'reg.w.', 'registr.w.', 'regl.',
           'regl.', 'r.v.k.', 'regl.besl.', 'regl.onderr.',
           'regl.r.t.', 'rep.', 'rép.fisc.', 'rép.not.', 'rep.r.j.',
           'rep.rj.', 'req.', 'res.', 'resp.', 'rev.', 'rev.',
           'comp.', 'rev.', 'trim.', 'civ.', 'rev.', 'trim.', 'comm.',
           'rev.acc.trav.', 'rev.adm.', 'rev.b.compt.',
           'rev.b.dr.const.', 'rev.b.dr.intern.', 'rev.b.séc.soc.',
           'rev.banc.fin.', 'rev.comm.', 'rev.cons.prud.',
           'rev.dr.b.', 'rev.dr.commun.', 'rev.dr.étr.',
           'rev.dr.fam.', 'rev.dr.intern.comp.', 'rev.dr.mil.',
           'rev.dr.min.', 'rev.dr.pén.', 'rev.dr.pén.mil.',
           'rev.dr.rur.', 'rev.dr.u.l.b.', 'rev.dr.ulb.', 'rev.exp.',
           'rev.faill.', 'rev.fisc.', 'rev.gd.', 'rev.hist.dr.',
           'rev.i.p.c.', 'rev.ipc.', 'rev.not.b.',
           'rev.prat.dr.comm.', 'rev.prat.not.b.', 'rev.prat.soc.',
           'rev.rec.', 'rev.rw.', 'rev.trav.', 'rev.trim.d.h.',
           'rev.trim.dr.fam.', 'rev.urb.', 'richtl.', 'riv.dir.int.',
           'riv.dir.int.priv.proc.', 'rk.', 'rln.', 'roln.', 'rom.',
           'rondz.', 'rov.', 'rtl.', 'rubr.', 'ruilv.wet.',
           'rv.verdr.', 'rvkb.', 's.', 's.', 's.a.', 's.b.n.',
           's.ct.', 's.d.', 's.e.c.', 's.e.et.o.', 's.e.w.',
           's.exec.rept.', 's.hrg.', 's.j.b.', 's.l.', 's.l.e.a.',
           's.l.n.d.', 's.p.a.', 's.s.', 's.t.', 's.t.b.', 's.v.',
           's.v.p.', 'samenw.', 'sc.', 'sch.', 'scheidsr.uitspr.',
           'schepel.besl.', 'secr.comm.', 'secr.gen.', 'sect.soc.',
           'sess.', 'cas.', 'sir.', 'soc.', 'best.', 'soc.', 'handv.',
           'soc.', 'verz.', 'soc.act.', 'soc.best.', 'soc.kron.',
           'soc.r.', 'soc.sw.', 'soc.weg.', 'sofi-nr.', 'somm.',
           'somm.ann.', 'sp.c.c.', 'sr.', 'ss.', 'st.doc.b.c.n.a.r.',
           'st.doc.bcnar.', 'st.vw.', 'stagever.', 'stas.', 'stat.',
           'stb.', 'stbl.', 'stcrt.', 'stud.dipl.', 'su.', 'subs.',
           'subst.', 'succ.w.', 'suppl.', 'sv.', 'sw.', 't.', 't.a.',
           't.a.a.', 't.a.n.', 't.a.p.', 't.a.s.n.', 't.a.v.',
           't.a.v.w.', 't.aann.', 't.acc.', 't.agr.r.', 't.app.',
           't.b.b.r.', 't.b.h.', 't.b.m.', 't.b.o.', 't.b.p.',
           't.b.r.', 't.b.s.', 't.b.v.', 't.bankw.', 't.belg.not.',
           't.desk.', 't.e.m.', 't.e.p.', 't.f.r.', 't.fam.',
           't.fin.r.', 't.g.r.', 't.g.t.', 't.g.v.', 't.gem.',
           't.gez.', 't.huur.', 't.i.n.', 't.j.k.', 't.l.l.',
           't.l.v.', 't.m.', 't.m.r.', 't.m.w.', 't.mil.r.',
           't.mil.strafr.', 't.not.', 't.o.', 't.o.r.b.', 't.o.v.',
           't.ontv.', 't.p.r.', 't.pol.', 't.r.', 't.r.g.',
           't.r.o.s.', 't.r.v.', 't.s.r.', 't.strafr.', 't.t.',
           't.u.', 't.v.c.', 't.v.g.', 't.v.m.r.', 't.v.o.', 't.v.v.',
           't.v.v.d.b.', 't.v.w.', 't.verz.', 't.vred.', 't.vreemd.',
           't.w.', 't.w.k.', 't.w.v.', 't.w.v.r.', 't.wrr.', 't.z.',
           't.z.t.', 't.z.v.', 'taalk.', 'tar.burg.z.', 'td.',
           'techn.', 'telecomm.', 'toel.', 'toel.st.v.w.', 'toep.',
           'toep.regl.', 'tom.', 'top.', 'trans.b.', 'transp.r.',
           'trb.', 'trib.', 'trib.civ.', 'trib.gr.inst.', 'ts.',
           'ts.', 'best.', 'ts.', 'verv.', 'turnh.rechtsl.', 'tvpol.',
           'tvpr.', 'tvrechtsgesch.', 'tw.', 'u.', 'u.a.', 'u.a.r.',
           'u.a.v.', 'u.c.', 'u.c.c.', 'u.g.', 'u.p.', 'u.s.',
           'u.s.d.c.', 'uitdr.', 'uitl.w.', 'uitv.besch.div.b.',
           'uitv.besl.', 'uitv.besl.', 'succ.w.', 'uitv.besl.bel.rv.',
           'uitv.besl.l.b.', 'uitv.reg.', 'inv.w.', 'uitv.reg.bel.d.',
           'uitv.reg.afd.verm.', 'uitv.reg.lb.', 'uitv.reg.succ.w.',
           'univ.', 'univ.verkl.', 'v.', 'v.', 'chr.', 'v.a.',
           'v.a.v.', 'v.c.', 'v.chr.', 'v.h.', 'v.huw.verm.', 'v.i.',
           'v.i.o.', 'v.k.a.', 'v.m.', 'v.o.f.', 'v.o.n.',
           'v.onderh.verpl.', 'v.p.', 'v.r.', 'v.s.o.', 'v.t.t.',
           'v.t.t.t.', 'v.tk.t.', 'v.toep.r.vert.', 'v.v.b.',
           'v.v.g.', 'v.v.t.', 'v.v.t.t.', 'v.v.tk.t.', 'v.w.b.',
           'v.z.m.', 'vb.', 'vb.bo.', 'vbb.', 'vc.', 'vd.', 'veldw.',
           'ver.k.', 'ver.verg.gem.', 'gem.comm.', 'verbr.', 'verd.',
           'verdr.', 'verdr.v.', 'tek.mod.', 'verenw.', 'verg.',
           'verg.fr.gem.', 'comm.', 'verkl.', 'verkl.herz.gw.',
           'verl.', 'deelw.', 'vern.', 'verord.', 'vers.r.',
           'versch.', 'versl.c.s.w.', 'versl.csw.', 'vert.', 'verw.',
           'verz.', 'verz.w.', 'verz.wett.besl.',
           'verz.wett.decr.besl.', 'vgl.', 'vid.', 'viss.w.',
           'vl.parl.', 'vl.r.', 'vl.t.gez.', 'vl.w.reg.',
           'vl.w.succ.', 'vlg.', 'vn.', 'vnl.', 'vnw.', 'vo.',
           'vo.bl.', 'voegw.', 'vol.', 'volg.', 'volt.', 'deelw.',
           'voorl.', 'voorz.', 'vord.w.', 'vorst.d.', 'vr.', 'vred.',
           'vrg.', 'vnw.', 'vrijgrs.', 'vs.', 'vt.', 'vw.', 'vz.',
           'vzngr.', 'vzr.', 'w.', 'w.a.', 'w.b.r.', 'w.c.h.',
           'w.conf.huw.', 'w.conf.huwelijksb.', 'w.consum.kr.',
           'w.f.r.', 'w.g.', 'w.gew.r.', 'w.ident.pl.', 'w.just.doc.',
           'w.kh.', 'w.l.r.', 'w.l.v.', 'w.mil.straf.spr.', 'w.n.',
           'w.not.ambt.', 'w.o.', 'w.o.d.huurcomm.', 'w.o.d.k.',
           'w.openb.manif.', 'w.parl.', 'w.r.', 'w.reg.', 'w.succ.',
           'w.u.b.', 'w.uitv.pl.verord.', 'w.v.', 'w.v.k.',
           'w.v.m.s.', 'w.v.r.', 'w.v.w.', 'w.venn.', 'wac.', 'wd.',
           'wetb.', 'n.v.h.', 'wgb.', 'winkelt.w.', 'wisk.',
           'wka-verkl.', 'wnd.', 'won.w.', 'woningw.', 'woonr.w.',
           'wrr.', 'wrr.ber.', 'wrsch.', 'ws.', 'wsch.', 'wsr.',
           'wtvb.', 'ww.', 'x.d.', 'z.a.', 'z.g.', 'z.i.', 'z.j.',
           'z.o.z.', 'z.p.', 'z.s.m.', 'zg.', 'zgn.', 'zn.', 'znw.',
           'zr.', 'zr.', 'ms.', 'zr.ms.']
 _exc = {}
 for orth in abbrevs:
    _exc[orth] = [{ORTH: orth}]
    uppered = orth.upper()
    capsed = orth.capitalize()
    for i in [uppered, capsed]:
        _exc[i] = [{ORTH: i}]
 TOKENIZER_EXCEPTIONS = _exc
--- a/spacy/lang/th/tokenizer_exceptions.py
+++ b/spacy/lang/th/tokenizer_exceptions.py
@ -5,6 +5,320 @@ from ...symbols import ORTH, LEMMA
 _exc = {
    #หน่วยงานรัฐ / government agency
    "กกต.": [{ORTH: "กกต.", LEMMA: "คณะกรรมการการเลือกตั้ง"}],
    "กทท.": [{ORTH: "กทท.", LEMMA: "การท่าเรือแห่งประเทศไทย"}],
    "กทพ.": [{ORTH: "กทพ.", LEMMA: "การทางพิเศษแห่งประเทศไทย"}],
    "กบข.": [{ORTH: "กบข.", LEMMA: "กองทุนบำเหน็จบำนาญข้าราชการพลเรือน"}],
    "กบว.": [{ORTH: "กบว.", LEMMA: "คณะกรรมการบริหารวิทยุกระจายเสียงและวิทยุโทรทัศน์"}],
    "กปน.": [{ORTH: "กปน.", LEMMA: "การประปานครหลวง"}],
    "กปภ.": [{ORTH: "กปภ.", LEMMA: "การประปาส่วนภูมิภาค"}],
    "กปส.": [{ORTH: "กปส.", LEMMA: "กรมประชาสัมพันธ์"}],
    "กผม.": [{ORTH: "กผม.", LEMMA: "กองผังเมือง"}],
    "กฟน.": [{ORTH: "กฟน.", LEMMA: "การไฟฟ้านครหลวง"}],
    "กฟผ.": [{ORTH: "กฟผ.", LEMMA: "การไฟฟ้าฝ่ายผลิตแห่งประเทศไทย"}],
    "กฟภ.": [{ORTH: "กฟภ.", LEMMA: "การไฟฟ้าส่วนภูมิภาค"}],
    "ก.ช.น.": [{ORTH: "ก.ช.น.", LEMMA: "คณะกรรมการช่วยเหลือชาวนาชาวไร่"}],
    "กยศ.": [{ORTH: "กยศ.", LEMMA: "กองทุนเงินให้กู้ยืมเพื่อการศึกษา"}],
    "ก.ล.ต.": [{ORTH: "ก.ล.ต.", LEMMA: "คณะกรรมการกำกับหลักทรัพย์และตลาดหลักทรัพย์"}],
    "กศ.บ.": [{ORTH: "กศ.บ.", LEMMA: "การศึกษาบัณฑิต"}],
    "กศน.": [{ORTH: "กศน.", LEMMA: "กรมการศึกษานอกโรงเรียน"}],
    "กสท.": [{ORTH: "กสท.", LEMMA: "การสื่อสารแห่งประเทศไทย"}],
    "กอ.รมน.": [{ORTH: "กอ.รมน.", LEMMA: "กองอำนวยการรักษาความมั่นคงภายใน"}],
    "กร.": [{ORTH: "กร.", LEMMA: "กองเรือยุทธการ"}],
    "ขสมก.": [{ORTH: "ขสมก.", LEMMA: "องค์การขนส่งมวลชนกรุงเทพ"}],
    "คตง.": [{ORTH: "คตง.", LEMMA: "คณะกรรมการตรวจเงินแผ่นดิน"}],
    "ครม.": [{ORTH: "ครม.", LEMMA: "คณะรัฐมนตรี"}],
    "คมช.": [{ORTH: "คมช.", LEMMA: "คณะมนตรีความมั่นคงแห่งชาติ"}],
    "ตชด.": [{ORTH: "ตชด.", LEMMA: "ตำรวจตะเวนชายเดน"}],
    "ตม.": [{ORTH: "ตม.", LEMMA: "กองตรวจคนเข้าเมือง"}],
    "ตร.": [{ORTH: "ตร.", LEMMA: "ตำรวจ"}],
    "ททท.": [{ORTH: "ททท.", LEMMA: "การท่องเที่ยวแห่งประเทศไทย"}],
    "ททบ.": [{ORTH: "ททบ.", LEMMA: "สถานีวิทยุโทรทัศน์กองทัพบก"}],
    "ทบ.": [{ORTH: "ทบ.", LEMMA: "กองทัพบก"}],
    "ทร.": [{ORTH: "ทร.", LEMMA: "กองทัพเรือ"}],
    "ทอ.": [{ORTH: "ทอ.", LEMMA: "กองทัพอากาศ"}],
    "ทอท.": [{ORTH: "ทอท.", LEMMA: "การท่าอากาศยานแห่งประเทศไทย"}],
    "ธ.ก.ส.": [{ORTH: "ธ.ก.ส.", LEMMA: "ธนาคารเพื่อการเกษตรและสหกรณ์การเกษตร"}],
    "ธปท.": [{ORTH: "ธปท.", LEMMA: "ธนาคารแห่งประเทศไทย"}],
    "ธอส.": [{ORTH: "ธอส.", LEMMA: "ธนาคารอาคารสงเคราะห์"}],
    "นย.": [{ORTH: "นย.", LEMMA: "นาวิกโยธิน"}],
    "ปตท.": [{ORTH: "ปตท.", LEMMA: "การปิโตรเลียมแห่งประเทศไทย"}],
    "ป.ป.ช.": [{ORTH: "ป.ป.ช.", LEMMA: "คณะกรรมการป้องกันและปราบปรามการทุจริตและประพฤติมิชอบในวงราชการ"}],
    "ป.ป.ส.": [{ORTH: "ป.ป.ส.", LEMMA: "คณะกรรมการป้องกันและปราบปรามยาเสพติด"}],
    "บพร.": [{ORTH: "บพร.", LEMMA: "กรมการบินพลเรือน"}],
    "บย.": [{ORTH: "บย.", LEMMA: "กองบินยุทธการ"}],
    "พสวท.": [{ORTH: "พสวท.", LEMMA: "โครงการพัฒนาและส่งเสริมผู้มีความรู้ความสามารถพิเศษทางวิทยาศาสตร์และเทคโนโลยี"}],
    "มอก.": [{ORTH: "มอก.", LEMMA: "สำนักงานมาตรฐานผลิตภัณฑ์อุตสาหกรรม"}],
    "ยธ.": [{ORTH: "ยธ.", LEMMA: "กรมโยธาธิการ"}],
    "รพช.": [{ORTH: "รพช.", LEMMA: "สำนักงานเร่งรัดพัฒนาชนบท"}],
    "รฟท.": [{ORTH: "รฟท.", LEMMA: "การรถไฟแห่งประเทศไทย"}],
    "รฟม.": [{ORTH: "รฟม.", LEMMA: "การรถไฟฟ้าขนส่งมวลชนแห่งประเทศไทย"}],
    "ศธ.": [{ORTH: "ศธ.", LEMMA: "กระทรวงศึกษาธิการ"}],
    "ศนธ.": [{ORTH: "ศนธ.", LEMMA: "ศูนย์กลางนิสิตนักศึกษาแห่งประเทศไทย"}],
    "สกจ.": [{ORTH: "สกจ.", LEMMA: "สหกรณ์จังหวัด"}],
    "สกท.": [{ORTH: "สกท.", LEMMA: "สำนักงานคณะกรรมการส่งเสริมการลงทุน"}],
    "สกว.": [{ORTH: "สกว.", LEMMA: "สำนักงานกองทุนสนับสนุนการวิจัย"}],
    "สคบ.": [{ORTH: "สคบ.", LEMMA: "สำนักงานคณะกรรมการคุ้มครองผู้บริโภค"}],
    "สจร.": [{ORTH: "สจร.", LEMMA: "สำนักงานคณะกรรมการจัดระบบการจราจรทางบก"}],
    "สตง.": [{ORTH: "สตง.", LEMMA: "สำนักงานตรวจเงินแผ่นดิน"}],
    "สทท.": [{ORTH: "สทท.", LEMMA: "สถานีวิทยุโทรทัศน์แห่งประเทศไทย"}],
    "สทร.": [{ORTH: "สทร.", LEMMA: "สำนักงานกลางทะเบียนราษฎร์"}],
    "สธ": [{ORTH: "สธ", LEMMA: "กระทรวงสาธารณสุข"}],
    "สนช.": [{ORTH: "สนช.", LEMMA: "สภานิติบัญญัติแห่งชาติ,สำนักงานนวัตกรรมแห่งชาติ"}],
    "สนนท.": [{ORTH: "สนนท.", LEMMA: "สหพันธ์นิสิตนักศึกษาแห่งประเทศไทย"}],
    "สปก.": [{ORTH: "สปก.", LEMMA: "สำนักงานการปฏิรูปที่ดินเพื่อเกษตรกรรม"}],
    "สปช.": [{ORTH: "สปช.", LEMMA: "สำนักงานคณะกรรมการการประถมศึกษาแห่งชาติ"}],
    "สปอ.": [{ORTH: "สปอ.", LEMMA: "สำนักงานการประถมศึกษาอำเภอ"}],
    "สพช.": [{ORTH: "สพช.", LEMMA: "สำนักงานคณะกรรมการนโยบายพลังงานแห่งชาติ"}],
    "สยช.": [{ORTH: "สยช.", LEMMA: "สำนักงานคณะกรรมการส่งเสริมและประสานงานเยาวชนแห่งชาติ"}],
    "สวช.": [{ORTH: "สวช.", LEMMA: "สำนักงานคณะกรรมการวัฒนธรรมแห่งชาติ"}],
    "สวท.": [{ORTH: "สวท.", LEMMA: "สถานีวิทยุกระจายเสียงแห่งประเทศไทย"}],
    "สวทช.": [{ORTH: "สวทช.", LEMMA: "สำนักงานพัฒนาวิทยาศาสตร์และเทคโนโลยีแห่งชาติ"}],
    "สคช.": [{ORTH: "สคช.", LEMMA: "สำนักงานคณะกรรมการพัฒนาการเศรษฐกิจและสังคมแห่งชาติ"}],
    "สสว.": [{ORTH: "สสว.", LEMMA: "สำนักงานส่งเสริมวิสาหกิจขนาดกลางและขนาดย่อม"}],
    "สสส.": [{ORTH: "สสส.", LEMMA: "สำนักงานกองทุนสนับสนุนการสร้างเสริมสุขภาพ"}],
    "สสวท.": [{ORTH: "สสวท.", LEMMA: "สถาบันส่งเสริมการสอนวิทยาศาสตร์และเทคโนโลยี"}],
    "อตก.": [{ORTH: "อตก.", LEMMA: "องค์การตลาดเพื่อเกษตรกร"}],
    "อบจ.": [{ORTH: "อบจ.", LEMMA: "องค์การบริหารส่วนจังหวัด"}],
    "อบต.": [{ORTH: "อบต.", LEMMA: "องค์การบริหารส่วนตำบล"}],
    "อปพร.": [{ORTH: "อปพร.", LEMMA: "อาสาสมัครป้องกันภัยฝ่ายพลเรือน"}],
    "อย.": [{ORTH: "อย.", LEMMA: "สำนักงานคณะกรรมการอาหารและยา"}],
    "อ.ส.ม.ท.": [{ORTH: "อ.ส.ม.ท.", LEMMA: "องค์การสื่อสารมวลชนแห่งประเทศไทย"}],
    #มหาวิทยาลัย / สถานศึกษา / university / college
    "มทส.": [{ORTH: "มทส.", LEMMA: "มหาวิทยาลัยเทคโนโลยีสุรนารี"}],
    "มธ.": [{ORTH: "มธ.", LEMMA: "มหาวิทยาลัยธรรมศาสตร์"}],
    "ม.อ.": [{ORTH: "ม.อ.", LEMMA: "มหาวิทยาลัยสงขลานครินทร์"}],
    "มทร.": [{ORTH: "มทร.", LEMMA: "มหาวิทยาลัยเทคโนโลยีราชมงคล"}],
    "มมส.": [{ORTH: "มมส.", LEMMA: "มหาวิทยาลัยมหาสารคาม"}],
    "วท.": [{ORTH: "วท.", LEMMA: "วิทยาลัยเทคนิค"}],
    "สตม.": [{ORTH: "สตม.", LEMMA: "สำนักงานตรวจคนเข้าเมือง (ตำรวจ)"}],
    #ยศ / rank
    "ดร.": [{ORTH: "ดร.", LEMMA: "ดอกเตอร์"}],
    "ด.ต.": [{ORTH: "ด.ต.", LEMMA: "ดาบตำรวจ"}],
    "จ.ต.": [{ORTH: "จ.ต.", LEMMA: "จ่าตรี"}],
    "จ.ท.": [{ORTH: "จ.ท.", LEMMA: "จ่าโท"}],
    "จ.ส.ต.": [{ORTH: "จ.ส.ต.", LEMMA: "จ่าสิบตรี (ทหารบก)"}],
    "จสต.": [{ORTH: "จสต.", LEMMA: "จ่าสิบตำรวจ"}],
    "จ.ส.ท.": [{ORTH: "จ.ส.ท.", LEMMA: "จ่าสิบโท"}],
    "จ.ส.อ.": [{ORTH: "จ.ส.อ.", LEMMA: "จ่าสิบเอก"}],
    "จ.อ.": [{ORTH: "จ.อ.", LEMMA: "จ่าเอก"}],
    "ทพญ.": [{ORTH: "ทพญ.", LEMMA: "ทันตแพทย์หญิง"}],
    "ทนพ.": [{ORTH: "ทนพ.", LEMMA: "เทคนิคการแพทย์"}],
    "นจอ.": [{ORTH: "นจอ.", LEMMA: "นักเรียนจ่าอากาศ"}],
    "น.ช.": [{ORTH: "น.ช.", LEMMA: "นักโทษชาย"}],
    "น.ญ.": [{ORTH: "น.ญ.", LEMMA: "นักโทษหญิง"}],
    "น.ต.": [{ORTH: "น.ต.", LEMMA: "นาวาตรี"}],
    "น.ท.": [{ORTH: "น.ท.", LEMMA: "นาวาโท"}],
    "นตท.": [{ORTH: "นตท.", LEMMA: "นักเรียนเตรียมทหาร"}],
    "นนส.": [{ORTH: "นนส.", LEMMA: "นักเรียนนายสิบทหารบก"}],
    "นนร.": [{ORTH: "นนร.", LEMMA: "นักเรียนนายร้อย"}],
    "นนอ.": [{ORTH: "นนอ.", LEMMA: "นักเรียนนายเรืออากาศ"}],
    "นพ.": [{ORTH: "นพ.", LEMMA: "นายแพทย์"}],
    "นพท.": [{ORTH: "นพท.", LEMMA: "นายแพทย์ทหาร"}],
    "นรจ.": [{ORTH: "นรจ.", LEMMA: "นักเรียนจ่าทหารเรือ"}],
    "นรต.": [{ORTH: "นรต.", LEMMA: "นักเรียนนายร้อยตำรวจ"}],
    "นศพ.": [{ORTH: "นศพ.", LEMMA: "นักศึกษาแพทย์"}],
    "นศท.": [{ORTH: "นศท.", LEMMA: "นักศึกษาวิชาทหาร"}],
    "น.สพ.": [{ORTH: "น.สพ.", LEMMA: "นายสัตวแพทย์ (พ.ร.บ.วิชาชีพการสัตวแพทย์)"}],
    "น.อ.": [{ORTH: "น.อ.", LEMMA: "นาวาเอก"}],
    "บช.ก.": [{ORTH: "บช.ก.", LEMMA: "กองบัญชาการตำรวจสอบสวนกลาง"}],
    "บช.น.": [{ORTH: "บช.น.", LEMMA: "กองบัญชาการตำรวจนครบาล"}],
    "ผกก.": [{ORTH: "ผกก.", LEMMA: "ผู้กำกับการ"}],
    "ผกก.ภ.": [{ORTH: "ผกก.ภ.", LEMMA: "ผู้กำกับการตำรวจภูธร"}],
    "ผจก.": [{ORTH: "ผจก.", LEMMA: "ผู้จัดการ"}],
    "ผช.": [{ORTH: "ผช.", LEMMA: "ผู้ช่วย"}],
    "ผชก.": [{ORTH: "ผชก.", LEMMA: "ผู้ชำนาญการ"}],
    "ผช.ผอ.": [{ORTH: "ผช.ผอ.", LEMMA: "ผู้ช่วยผู้อำนวยการ"}],
    "ผญบ.": [{ORTH: "ผญบ.", LEMMA: "ผู้ใหญ่บ้าน"}],
    "ผบ.": [{ORTH: "ผบ.", LEMMA: "ผู้บังคับบัญชา"}],
    "ผบก.": [{ORTH: "ผบก.", LEMMA: "ผู้บังคับบัญชาการ (ตำรวจ)"}],
    "ผบก.": [{ORTH: "ผบก.", LEMMA: "ผู้บังคับการ (ตำรวจ)"}],
    "ผบก.น.": [{ORTH: "ผบก.น.", LEMMA: "ผู้บังคับการตำรวจนครบาล"}],
    "ผบก.ป.": [{ORTH: "ผบก.ป.", LEMMA: "ผู้บังคับการตำรวจกองปราบปราม"}],
    "ผบก.ปค.": [{ORTH: "ผบก.ปค.", LEMMA: "ผู้บังคับการ กองบังคับการปกครอง (โรงเรียนนายร้อยตำรวจ)"}],
    "ผบก.ปม.": [{ORTH: "ผบก.ปม.", LEMMA: "ผู้บังคับการตำรวจป่าไม้"}],
    "ผบก.ภ.": [{ORTH: "ผบก.ภ.", LEMMA: "ผู้บังคับการตำรวจภูธร"}],
    "ผบช.": [{ORTH: "ผบช.", LEMMA: "ผู้บัญชาการ (ตำรวจ)"}],
    "ผบช.ก.": [{ORTH: "ผบช.ก.", LEMMA: "ผู้บัญชาการตำรวจสอบสวนกลาง"}],
    "ผบช.ตชด.": [{ORTH: "ผบช.ตชด.", LEMMA: "ผู้บัญชาการตำรวจตระเวนชายแดน"}],
    "ผบช.น.": [{ORTH: "ผบช.น.", LEMMA: "ผู้บัญชาการตำรวจนครบาล"}],
    "ผบช.ภ.": [{ORTH: "ผบช.ภ.", LEMMA: "ผู้บัญชาการตำรวจภูธร"}],
    "ผบ.ทบ.": [{ORTH: "ผบ.ทบ.", LEMMA: "ผู้บัญชาการทหารบก"}],
    "ผบ.ตร.": [{ORTH: "ผบ.ตร.", LEMMA: "ผู้บัญชาการตำรวจแห่งชาติ"}],
    "ผบ.ทร.": [{ORTH: "ผบ.ทร.", LEMMA: "ผู้บัญชาการทหารเรือ"}],
    "ผบ.ทอ.": [{ORTH: "ผบ.ทอ.", LEMMA: "ผู้บัญชาการทหารอากาศ"}],
    "ผบ.ทสส.": [{ORTH: "ผบ.ทสส.", LEMMA: "ผู้บัญชาการทหารสูงสุด"}],
    "ผวจ.": [{ORTH: "ผวจ.", LEMMA: "ผู้ว่าราชการจังหวัด"}],
    "ผู้ว่าฯ": [{ORTH: "ผู้ว่าฯ", LEMMA: "ผู้ว่าราชการจังหวัด"}],
    "พ.จ.ต.": [{ORTH: "พ.จ.ต.", LEMMA: "พันจ่าตรี"}],
    "พ.จ.ท.": [{ORTH: "พ.จ.ท.", LEMMA: "พันจ่าโท"}],
    "พ.จ.อ.": [{ORTH: "พ.จ.อ.", LEMMA: "พันจ่าเอก"}],
    "พญ.": [{ORTH: "พญ.", LEMMA: "แพทย์หญิง"}],
    "ฯพณฯ": [{ORTH: "ฯพณฯ", LEMMA: "พณท่าน"}],
    "พ.ต.": [{ORTH: "พ.ต.", LEMMA: "พันตรี"}],
    "พ.ท.": [{ORTH: "พ.ท.", LEMMA: "พันโท"}],
    "พ.อ.": [{ORTH: "พ.อ.", LEMMA: "พันเอก"}],
    "พ.ต.อ.พิเศษ": [{ORTH: "พ.ต.อ.พิเศษ", LEMMA: "พันตำรวจเอกพิเศษ"}],
    "พลฯ": [{ORTH: "พลฯ", LEMMA: "พลทหาร"}],
    "พล.๑ รอ.": [{ORTH: "พล.๑ รอ.", LEMMA: "กองพลที่ ๑ รักษาพระองค์ กองทัพบก"}],
    "พล.ต.": [{ORTH: "พล.ต.", LEMMA: "พลตรี"}],
    "พล.ต.ต.": [{ORTH: "พล.ต.ต.", LEMMA: "พลตำรวจตรี"}],
    "พล.ต.ท.": [{ORTH: "พล.ต.ท.", LEMMA: "พลตำรวจโท"}],
    "พล.ต.อ.": [{ORTH: "พล.ต.อ.", LEMMA: "พลตำรวจเอก"}],
    "พล.ท.": [{ORTH: "พล.ท.", LEMMA: "พลโท"}],
    "พล.ปตอ.": [{ORTH: "พล.ปตอ.", LEMMA: "กองพลทหารปืนใหญ่ต่อสู่อากาศยาน"}],
    "พล.ม.": [{ORTH: "พล.ม.", LEMMA: "กองพลทหารม้า"}],
    "พล.ม.๒": [{ORTH: "พล.ม.๒", LEMMA: "กองพลทหารม้าที่ ๒"}],
    "พล.ร.ต.": [{ORTH: "พล.ร.ต.", LEMMA: "พลเรือตรี"}],
    "พล.ร.ท.": [{ORTH: "พล.ร.ท.", LEMMA: "พลเรือโท"}],
    "พล.ร.อ.": [{ORTH: "พล.ร.อ.", LEMMA: "พลเรือเอก"}],
    "พล.อ.": [{ORTH: "พล.อ.", LEMMA: "พลเอก"}],
    "พล.อ.ต.": [{ORTH: "พล.อ.ต.", LEMMA: "พลอากาศตรี"}],
    "พล.อ.ท.": [{ORTH: "พล.อ.ท.", LEMMA: "พลอากาศโท"}],
    "พล.อ.อ.": [{ORTH: "พล.อ.อ.", LEMMA: "พลอากาศเอก"}],
    "พ.อ.": [{ORTH: "พ.อ.", LEMMA: "พันเอก"}],
    "พ.อ.พิเศษ": [{ORTH: "พ.อ.พิเศษ", LEMMA: "พันเอกพิเศษ"}],
    "พ.อ.ต.": [{ORTH: "พ.อ.ต.", LEMMA: "พันจ่าอากาศตรี"}],
    "พ.อ.ท.": [{ORTH: "พ.อ.ท.", LEMMA: "พันจ่าอากาศโท"}],
    "พ.อ.อ.": [{ORTH: "พ.อ.อ.", LEMMA: "พันจ่าอากาศเอก"}],
    "ภกญ.": [{ORTH: "ภกญ.", LEMMA: "เภสัชกรหญิง"}],
    "ม.จ.": [{ORTH: "ม.จ.", LEMMA: "หม่อมเจ้า"}],
    "มท1": [{ORTH: "มท1", LEMMA: "รัฐมนตรีว่าการกระทรวงมหาดไทย"}],
    "ม.ร.ว.": [{ORTH: "ม.ร.ว.", LEMMA: "หม่อมราชวงศ์"}],
    "มล.": [{ORTH: "มล.", LEMMA: "หม่อมหลวง"}],
    "ร.ต.": [{ORTH: "ร.ต.", LEMMA: "ร้อยตรี,เรือตรี,เรืออากาศตรี"}],
    "ร.ต.ต.": [{ORTH: "ร.ต.ต.", LEMMA: "ร้อยตำรวจตรี"}],
    "ร.ต.ท.": [{ORTH: "ร.ต.ท.", LEMMA: "ร้อยตำรวจโท"}],
    "ร.ต.อ.": [{ORTH: "ร.ต.อ.", LEMMA: "ร้อยตำรวจเอก"}],
    "ร.ท.": [{ORTH: "ร.ท.", LEMMA: "ร้อยโท,เรือโท,เรืออากาศโท"}],
    "รมช.": [{ORTH: "รมช.", LEMMA: "รัฐมนตรีช่วยว่าการกระทรวง"}],
    "รมต.": [{ORTH: "รมต.", LEMMA: "รัฐมนตรี"}],
    "รมว.": [{ORTH: "รมว.", LEMMA: "รัฐมนตรีว่าการกระทรวง"}],
    "รศ.": [{ORTH: "รศ.", LEMMA: "รองศาสตราจารย์"}],
    "ร.อ.": [{ORTH: "ร.อ.", LEMMA: "ร้อยเอก,เรือเอก,เรืออากาศเอก"}],
    "ศ.": [{ORTH: "ศ.", LEMMA: "ศาสตราจารย์"}],
    "ส.ต.": [{ORTH: "ส.ต.", LEMMA: "สิบตรี"}],
    "ส.ต.ต.": [{ORTH: "ส.ต.ต.", LEMMA: "สิบตำรวจตรี"}],
    "ส.ต.ท.": [{ORTH: "ส.ต.ท.", LEMMA: "สิบตำรวจโท"}],
    "ส.ต.อ.": [{ORTH: "ส.ต.อ.", LEMMA: "สิบตำรวจเอก"}],
    "ส.ท.": [{ORTH: "ส.ท.", LEMMA: "สิบโท"}],
    "สพ.": [{ORTH: "สพ.", LEMMA: "สัตวแพทย์"}],
    "สพ.ญ.": [{ORTH: "สพ.ญ.", LEMMA: "สัตวแพทย์หญิง"}],
    "สพ.ช.": [{ORTH: "สพ.ช.", LEMMA: "สัตวแพทย์ชาย"}],
    "ส.อ.": [{ORTH: "ส.อ.", LEMMA: "สิบเอก"}],
    "อจ.": [{ORTH: "อจ.", LEMMA: "อาจารย์"}],
    "อจญ.": [{ORTH: "อจญ.", LEMMA: "อาจารย์ใหญ่"}],
    #วุฒิ / bachelor degree
    "ป.": [{ORTH: "ป.", LEMMA: "ประถมศึกษา"}],
    "ป.กศ.": [{ORTH: "ป.กศ.", LEMMA: "ประกาศนียบัตรวิชาการศึกษา"}],
    "ป.กศ.สูง": [{ORTH: "ป.กศ.สูง", LEMMA: "ประกาศนียบัตรวิชาการศึกษาชั้นสูง"}],
    "ปวช.": [{ORTH: "ปวช.", LEMMA: "ประกาศนียบัตรวิชาชีพ"}],
    "ปวท.": [{ORTH: "ปวท.", LEMMA: "ประกาศนียบัตรวิชาชีพเทคนิค"}],
    "ปวส.": [{ORTH: "ปวส.", LEMMA: "ประกาศนียบัตรวิชาชีพชั้นสูง"}],
    "ปทส.": [{ORTH: "ปทส.", LEMMA: "ประกาศนียบัตรครูเทคนิคชั้นสูง"}],
    "กษ.บ.": [{ORTH: "กษ.บ.", LEMMA: "เกษตรศาสตรบัณฑิต"}],
    "กษ.ม.": [{ORTH: "กษ.ม.", LEMMA: "เกษตรศาสตรมหาบัณฑิต"}],
    "กษ.ด.": [{ORTH: "กษ.ด.", LEMMA: "เกษตรศาสตรดุษฎีบัณฑิต"}],
    "ค.บ.": [{ORTH: "ค.บ.", LEMMA: "ครุศาสตรบัณฑิต"}],
    "คศ.บ.": [{ORTH: "คศ.บ.", LEMMA: "คหกรรมศาสตรบัณฑิต"}],
    "คศ.ม.": [{ORTH: "คศ.ม.", LEMMA: "คหกรรมศาสตรมหาบัณฑิต"}],
    "คศ.ด.": [{ORTH: "คศ.ด.", LEMMA: "คหกรรมศาสตรดุษฎีบัณฑิต"}],
    "ค.อ.บ.": [{ORTH: "ค.อ.บ.", LEMMA: "ครุศาสตรอุตสาหกรรมบัณฑิต"}],
    "ค.อ.ม.": [{ORTH: "ค.อ.ม.", LEMMA: "ครุศาสตรอุตสาหกรรมมหาบัณฑิต"}],
    "ค.อ.ด.": [{ORTH: "ค.อ.ด.", LEMMA: "ครุศาสตรอุตสาหกรรมดุษฎีบัณฑิต"}],
    "ทก.บ.": [{ORTH: "ทก.บ.", LEMMA: "เทคโนโลยีการเกษตรบัณฑิต"}],
    "ทก.ม.": [{ORTH: "ทก.ม.", LEMMA: "เทคโนโลยีการเกษตรมหาบัณฑิต"}],
    "ทก.ด.": [{ORTH: "ทก.ด.", LEMMA: "เทคโนโลยีการเกษตรดุษฎีบัณฑิต"}],
    "ท.บ.": [{ORTH: "ท.บ.", LEMMA: "ทันตแพทยศาสตรบัณฑิต"}],
    "ท.ม.": [{ORTH: "ท.ม.", LEMMA: "ทันตแพทยศาสตรมหาบัณฑิต"}],
    "ท.ด.": [{ORTH: "ท.ด.", LEMMA: "ทันตแพทยศาสตรดุษฎีบัณฑิต"}],
    "น.บ.": [{ORTH: "น.บ.", LEMMA: "นิติศาสตรบัณฑิต"}],
    "น.ม.": [{ORTH: "น.ม.", LEMMA: "นิติศาสตรมหาบัณฑิต"}],
    "น.ด.": [{ORTH: "น.ด.", LEMMA: "นิติศาสตรดุษฎีบัณฑิต"}],
    "นศ.บ.": [{ORTH: "นศ.บ.", LEMMA: "นิเทศศาสตรบัณฑิต"}],
    "นศ.ม.": [{ORTH: "นศ.ม.", LEMMA: "นิเทศศาสตรมหาบัณฑิต"}],
    "นศ.ด.": [{ORTH: "นศ.ด.", LEMMA: "นิเทศศาสตรดุษฎีบัณฑิต"}],
    "บช.บ.": [{ORTH: "บช.บ.", LEMMA: "บัญชีบัณฑิต"}],
    "บช.ม.": [{ORTH: "บช.ม.", LEMMA: "บัญชีมหาบัณฑิต"}],
    "บช.ด.": [{ORTH: "บช.ด.", LEMMA: "บัญชีดุษฎีบัณฑิต"}],
    "บธ.บ.": [{ORTH: "บธ.บ.", LEMMA: "บริหารธุรกิจบัณฑิต"}],
    "บธ.ม.": [{ORTH: "บธ.ม.", LEMMA: "บริหารธุรกิจมหาบัณฑิต"}],
    "บธ.ด.": [{ORTH: "บธ.ด.", LEMMA: "บริหารธุรกิจดุษฎีบัณฑิต"}],
    "พณ.บ.": [{ORTH: "พณ.บ.", LEMMA: "พาณิชยศาสตรบัณฑิต"}],
    "พณ.ม.": [{ORTH: "พณ.ม.", LEMMA: "พาณิชยศาสตรมหาบัณฑิต"}],
    "พณ.ด.": [{ORTH: "พณ.ด.", LEMMA: "พาณิชยศาสตรดุษฎีบัณฑิต"}],
    "พ.บ.": [{ORTH: "พ.บ.", LEMMA: "แพทยศาสตรบัณฑิต"}],
    "พ.ม.": [{ORTH: "พ.ม.", LEMMA: "แพทยศาสตรมหาบัณฑิต"}],
    "พ.ด.": [{ORTH: "พ.ด.", LEMMA: "แพทยศาสตรดุษฎีบัณฑิต"}],
    "พธ.บ.": [{ORTH: "พธ.บ.", LEMMA: "พุทธศาสตรบัณฑิต"}],
    "พธ.ม.": [{ORTH: "พธ.ม.", LEMMA: "พุทธศาสตรมหาบัณฑิต"}],
    "พธ.ด.": [{ORTH: "พธ.ด.", LEMMA: "พุทธศาสตรดุษฎีบัณฑิต"}],
    "พบ.บ.": [{ORTH: "พบ.บ.", LEMMA: "พัฒนบริหารศาสตรบัณฑิต"}],
    "พบ.ม.": [{ORTH: "พบ.ม.", LEMMA: "พัฒนบริหารศาสตรมหาบัณฑิต"}],
    "พบ.ด.": [{ORTH: "พบ.ด.", LEMMA: "พัฒนบริหารศาสตรดุษฎีบัณฑิต"}],
    "พย.บ.": [{ORTH: "พย.บ.", LEMMA: "พยาบาลศาสตรดุษฎีบัณฑิต"}],
    "พย.ม.": [{ORTH: "พย.ม.", LEMMA: "พยาบาลศาสตรมหาบัณฑิต"}],
    "พย.ด.": [{ORTH: "พย.ด.", LEMMA: "พยาบาลศาสตรดุษฎีบัณฑิต"}],
    "พศ.บ.": [{ORTH: "พศ.บ.", LEMMA: "พาณิชยศาสตรบัณฑิต"}],
    "พศ.ม.": [{ORTH: "พศ.ม.", LEMMA: "พาณิชยศาสตรมหาบัณฑิต"}],
    "พศ.ด.": [{ORTH: "พศ.ด.", LEMMA: "พาณิชยศาสตรดุษฎีบัณฑิต"}],
    "ภ.บ.": [{ORTH: "ภ.บ.", LEMMA: "เภสัชศาสตรบัณฑิต"}],
    "ภ.ม.": [{ORTH: "ภ.ม.", LEMMA: "เภสัชศาสตรมหาบัณฑิต"}],
    "ภ.ด.": [{ORTH: "ภ.ด.", LEMMA: "เภสัชศาสตรดุษฎีบัณฑิต"}],
    "ภ.สถ.บ.": [{ORTH: "ภ.สถ.บ.", LEMMA: "ภูมิสถาปัตยกรรมศาสตรบัณฑิต"}],
    "รป.บ.": [{ORTH: "รป.บ.", LEMMA: "รัฐประศาสนศาสตร์บัณฑิต"}],
    "รป.ม.": [{ORTH: "รป.ม.", LEMMA: "รัฐประศาสนศาสตร์มหาบัณฑิต"}],
    "วท.บ.": [{ORTH: "วท.บ.", LEMMA: "วิทยาศาสตรบัณฑิต"}],
    "วท.ม.": [{ORTH: "วท.ม.", LEMMA: "วิทยาศาสตรมหาบัณฑิต"}],
    "วท.ด.": [{ORTH: "วท.ด.", LEMMA: "วิทยาศาสตรดุษฎีบัณฑิต"}],
    "ศ.บ.": [{ORTH: "ศ.บ.", LEMMA: "ศิลปบัณฑิต"}],
    "ศศ.บ.": [{ORTH: "ศศ.บ.", LEMMA: "ศิลปศาสตรบัณฑิต"}],
    "ศษ.บ.": [{ORTH: "ศษ.บ.", LEMMA: "ศึกษาศาสตรบัณฑิต"}],
    "ศส.บ.": [{ORTH: "ศส.บ.", LEMMA: "เศรษฐศาสตรบัณฑิต"}],
    "สถ.บ.": [{ORTH: "สถ.บ.", LEMMA: "สถาปัตยกรรมศาสตรบัณฑิต"}],
    "สถ.ม.": [{ORTH: "สถ.ม.", LEMMA: "สถาปัตยกรรมศาสตรมหาบัณฑิต"}],
    "สถ.ด.": [{ORTH: "สถ.ด.", LEMMA: "สถาปัตยกรรมศาสตรดุษฎีบัณฑิต"}],
    "สพ.บ.": [{ORTH: "สพ.บ.", LEMMA: "สัตวแพทยศาสตรบัณฑิต"}],
    "อ.บ.": [{ORTH: "อ.บ.", LEMMA: "อักษรศาสตรบัณฑิต"}],
    "อ.ม.": [{ORTH: "อ.ม.", LEMMA: "อักษรศาสตรมหาบัณฑิต"}],
    "อ.ด.": [{ORTH: "อ.ด.", LEMMA: "อักษรศาสตรดุษฎีบัณฑิต"}],
    #ปี / เวลา / year / time
    "ชม.": [{ORTH: "ชม.", LEMMA: "ชั่วโมง"}],
    "จ.ศ.": [{ORTH: "จ.ศ.", LEMMA: "จุลศักราช"}],
    "ค.ศ.": [{ORTH: "ค.ศ.", LEMMA: "คริสต์ศักราช"}],
    "ฮ.ศ.": [{ORTH: "ฮ.ศ.", LEMMA: "ฮิจเราะห์ศักราช"}],
    "ว.ด.ป.": [{ORTH: "ว.ด.ป.", LEMMA: "วัน เดือน ปี"}],
    #ระยะทาง / distance
    "ฮม.": [{ORTH: "ฮม.", LEMMA: "เฮกโตเมตร"}],
    "ดคม.": [{ORTH: "ดคม.", LEMMA: "เดคาเมตร"}],
    "ดม.": [{ORTH: "ดม.", LEMMA: "เดซิเมตร"}],
    "มม.": [{ORTH: "มม.", LEMMA: "มิลลิเมตร"}],
    "ซม.": [{ORTH: "ซม.", LEMMA: "เซนติเมตร"}],
    "กม.": [{ORTH: "กม.", LEMMA: "กิโลเมตร"}],
    #น้ำหนัก / weight
    "น.น.": [{ORTH: "น.น.", LEMMA: "น้ำหนัก"}],
    "ฮก.": [{ORTH: "ฮก.", LEMMA: "เฮกโตกรัม"}],
    "ดคก.": [{ORTH: "ดคก.", LEMMA: "เดคากรัม"}],
    "ดก.": [{ORTH: "ดก.", LEMMA: "เดซิกรัม"}],
    "ซก.": [{ORTH: "ซก.", LEMMA: "เซนติกรัม"}],
    "มก.": [{ORTH: "มก.", LEMMA: "มิลลิกรัม"}],
    "ก.": [{ORTH: "ก.", LEMMA: "กรัม"}],
    "กก.": [{ORTH: "กก.", LEMMA: "กิโลกรัม"}],
    #ปริมาตร / volume
    "ฮล.": [{ORTH: "ฮล.", LEMMA: "เฮกโตลิตร"}],
    "ดคล.": [{ORTH: "ดคล.", LEMMA: "เดคาลิตร"}],
    "ดล.": [{ORTH: "ดล.", LEMMA: "เดซิลิตร"}],
    "ซล.": [{ORTH: "ซล.", LEMMA: "เซนติลิตร"}],
    "ล.": [{ORTH: "ล.", LEMMA: "ลิตร"}],
    "กล.": [{ORTH: "กล.", LEMMA: "กิโลลิตร"}],
    "ลบ.": [{ORTH: "ลบ.", LEMMA: "ลูกบาศก์"}],
    #พื้นที่ / area
    "ตร.ซม.": [{ORTH: "ตร.ซม.", LEMMA: "ตารางเซนติเมตร"}],
    "ตร.ม.": [{ORTH: "ตร.ม.", LEMMA: "ตารางเมตร"}],
    "ตร.ว.": [{ORTH: "ตร.ว.", LEMMA: "ตารางวา"}],
    "ตร.กม.": [{ORTH: "ตร.กม.", LEMMA: "ตารางกิโลเมตร"}],
    #เดือน / month
    "ม.ค.": [{ORTH: "ม.ค.", LEMMA: "มกราคม"}],
    "ก.พ.": [{ORTH: "ก.พ.", LEMMA: "กุมภาพันธ์"}],
    "มี.ค.": [{ORTH: "มี.ค.", LEMMA: "มีนาคม"}],
@ -17,6 +331,114 @@ _exc = {
    "ต.ค.": [{ORTH: "ต.ค.", LEMMA: "ตุลาคม"}],
    "พ.ย.": [{ORTH: "พ.ย.", LEMMA: "พฤศจิกายน"}],
    "ธ.ค.": [{ORTH: "ธ.ค.", LEMMA: "ธันวาคม"}],
    #เพศ / gender
    "ช.": [{ORTH: "ช.", LEMMA: "ชาย"}],
    "ญ.": [{ORTH: "ญ.", LEMMA: "หญิง"}],
    "ด.ช.": [{ORTH: "ด.ช.", LEMMA: "เด็กชาย"}],
    "ด.ญ.": [{ORTH: "ด.ญ.", LEMMA: "เด็กหญิง"}],
    #ที่อยู่ / address
    "ถ.": [{ORTH: "ถ.", LEMMA: "ถนน"}],
    "ต.": [{ORTH: "ต.", LEMMA: "ตำบล"}],
    "อ.": [{ORTH: "อ.", LEMMA: "อำเภอ"}],
    "จ.": [{ORTH: "จ.", LEMMA: "จังหวัด"}],
    #สรรพนาม / pronoun
    "ข้าฯ": [{ORTH: "ข้าฯ", LEMMA: "ข้าพระพุทธเจ้า"}],
    "ทูลเกล้าฯ": [{ORTH: "ทูลเกล้าฯ", LEMMA: "ทูลเกล้าทูลกระหม่อม"}],
    "น้อมเกล้าฯ": [{ORTH: "น้อมเกล้าฯ", LEMMA: "น้อมเกล้าน้อมกระหม่อม"}],
    "โปรดเกล้าฯ": [{ORTH: "โปรดเกล้าฯ", LEMMA: "โปรดเกล้าโปรดกระหม่อม"}],
    #การเมือง / politic
    "ขจก.": [{ORTH: "ขจก.", LEMMA: "ขบวนการโจรก่อการร้าย"}],
    "ขบด.": [{ORTH: "ขบด.", LEMMA: "ขบวนการแบ่งแยกดินแดน"}],
    "นปช.": [{ORTH: "นปช.", LEMMA: "แนวร่วมประชาธิปไตยขับไล่เผด็จการ"}],
    "ปชป.": [{ORTH: "ปชป.", LEMMA: "พรรคประชาธิปัตย์"}],
    "ผกค.": [{ORTH: "ผกค.", LEMMA: "ผู้ก่อการร้ายคอมมิวนิสต์"}],
    "พท.": [{ORTH: "พท.", LEMMA: "พรรคเพื่อไทย"}],
    "พ.ร.ก.": [{ORTH: "พ.ร.ก.", LEMMA: "พระราชกำหนด"}],
    "พ.ร.ฎ.": [{ORTH: "พ.ร.ฎ.", LEMMA: "พระราชกฤษฎีกา"}],
    "พ.ร.บ.": [{ORTH: "พ.ร.บ.", LEMMA: "พระราชบัญญัติ"}],
    "รธน.": [{ORTH: "รธน.", LEMMA: "รัฐธรรมนูญ"}],
    "รบ.": [{ORTH: "รบ.", LEMMA: "รัฐบาล"}],
    "รสช.": [{ORTH: "รสช.", LEMMA: "คณะรักษาความสงบเรียบร้อยแห่งชาติ"}],
    "ส.ก.": [{ORTH: "ส.ก.", LEMMA: "สมาชิกสภากรุงเทพมหานคร"}],
    "สจ.": [{ORTH: "สจ.", LEMMA: "สมาชิกสภาจังหวัด"}],
    "สว.": [{ORTH: "สว.", LEMMA: "สมาชิกวุฒิสภา"}],
    "ส.ส.": [{ORTH: "ส.ส.", LEMMA: "สมาชิกสภาผู้แทนราษฎร"}],
    #ทั่วไป / general
    "ก.ข.ค.": [{ORTH: "ก.ข.ค.", LEMMA: "ก้างขวางคอ"}],
    "กทม.": [{ORTH: "กทม.", LEMMA: "กรุงเทพมหานคร"}],
    "กรุงเทพฯ": [{ORTH: "กรุงเทพฯ", LEMMA: "กรุงเทพมหานคร"}],
    "ขรก.": [{ORTH: "ขรก.", LEMMA: "ข้าราชการ"}],
    "ขส": [{ORTH: "ขส.", LEMMA: "ขนส่ง"}],
    "ค.ร.น.": [{ORTH: "ค.ร.น.", LEMMA: "คูณร่วมน้อย"}],
    "ค.ร.ม.": [{ORTH: "ค.ร.ม.", LEMMA: "คูณร่วมมาก"}],
    "ง.ด.": [{ORTH: "ง.ด.", LEMMA: "เงินเดือน"}],
    "งป.": [{ORTH: "งป.", LEMMA: "งบประมาณ"}],
    "จก.": [{ORTH: "จก.", LEMMA: "จำกัด"}],
    "จขกท.": [{ORTH: "จขกท.", LEMMA: "เจ้าของกระทู้"}],
    "จนท.": [{ORTH: "จนท.", LEMMA: "เจ้าหน้าที่"}],
    "จ.ป.ร.": [{ORTH: "จ.ป.ร.", LEMMA: "มหาจุฬาลงกรณ ปรมราชาธิราช (พระปรมาภิไธยในพระบาทสมเด็จพระจุลจอมเกล้าเจ้าอยู่หัว)"}],
    "จ.ม.": [{ORTH: "จ.ม.", LEMMA: "จดหมาย"}],
    "จย.": [{ORTH: "จย.", LEMMA: "จักรยาน"}],
    "จยย.": [{ORTH: "จยย.", LEMMA: "จักรยานยนต์"}],
    "ตจว.": [{ORTH: "ตจว.", LEMMA: "ต่างจังหวัด"}],
    "โทร.": [{ORTH: "โทร.", LEMMA: "โทรศัพท์"}],
    "ธ.": [{ORTH: "ธ.", LEMMA: "ธนาคาร"}],
    "น.ร.": [{ORTH: "น.ร.", LEMMA: "นักเรียน"}],
    "น.ศ.": [{ORTH: "น.ศ.", LEMMA: "นักศึกษา"}],
    "น.ส.": [{ORTH: "น.ส.", LEMMA: "นางสาว"}],
    "น.ส.๓": [{ORTH: "น.ส.๓", LEMMA: "หนังสือรับรองการทำประโยชน์ในที่ดิน"}],
    "น.ส.๓ ก.": [{ORTH: "น.ส.๓ ก", LEMMA: "หนังสือแสดงกรรมสิทธิ์ในที่ดิน (มีระวางกำหนด)"}],
    "นสพ.": [{ORTH: "นสพ.", LEMMA: "หนังสือพิมพ์"}],
    "บ.ก.": [{ORTH: "บ.ก.", LEMMA: "บรรณาธิการ"}],
    "บจก.": [{ORTH: "บจก.", LEMMA: "บริษัทจำกัด"}],
    "บงล.": [{ORTH: "บงล.", LEMMA: "บริษัทเงินทุนและหลักทรัพย์จำกัด"}],
    "บบส.": [{ORTH: "บบส.", LEMMA: "บรรษัทบริหารสินทรัพย์สถาบันการเงิน"}],
    "บมจ.": [{ORTH: "บมจ.", LEMMA: "บริษัทมหาชนจำกัด"}],
    "บลจ.": [{ORTH: "บลจ.", LEMMA: "บริษัทหลักทรัพย์จัดการกองทุนรวมจำกัด"}],
    "บ/ช": [{ORTH: "บ/ช", LEMMA: "บัญชี"}],
    "บร.": [{ORTH: "บร.", LEMMA: "บรรณารักษ์"}],
    "ปชช.": [{ORTH: "ปชช.", LEMMA: "ประชาชน"}],
    "ปณ.": [{ORTH: "ปณ.", LEMMA: "ที่ทำการไปรษณีย์"}],
    "ปณก.": [{ORTH: "ปณก.", LEMMA: "ที่ทำการไปรษณีย์กลาง"}],
    "ปณส.": [{ORTH: "ปณส.", LEMMA: "ที่ทำการไปรษณีย์สาขา"}],
    "ปธ.": [{ORTH: "ปธ.", LEMMA: "ประธาน"}],
    "ปธน.": [{ORTH: "ปธน.", LEMMA: "ประธานาธิบดี"}],
    "ปอ.": [{ORTH: "ปอ.", LEMMA: "รถยนต์โดยสารประจำทางปรับอากาศ"}],
    "ปอ.พ.": [{ORTH: "ปอ.พ.", LEMMA: "รถยนต์โดยสารประจำทางปรับอากาศพิเศษ"}],
    "พ.ก.ง.": [{ORTH: "พ.ก.ง.", LEMMA: "พัสดุเก็บเงินปลายทาง"}],
    "พ.ก.ส.": [{ORTH: "พ.ก.ส.", LEMMA: "พนักงานเก็บค่าโดยสาร"}],
    "พขร.": [{ORTH: "พขร.", LEMMA: "พนักงานขับรถ"}],
    "ภ.ง.ด.": [{ORTH: "ภ.ง.ด.", LEMMA: "ภาษีเงินได้"}],
    "ภ.ง.ด.๙": [{ORTH: "ภ.ง.ด.๙", LEMMA: "แบบแสดงรายการเสียภาษีเงินได้ของกรมสรรพากร"}],
    "ภ.ป.ร.": [{ORTH: "ภ.ป.ร.", LEMMA: "ภูมิพลอดุยเดช ปรมราชาธิราช (พระปรมาภิไธยในพระบาทสมเด็จพระปรมินทรมหาภูมิพลอดุลยเดช)"}],
    "ภ.พ.": [{ORTH: "ภ.พ.", LEMMA: "ภาษีมูลค่าเพิ่ม"}],
    "ร.": [{ORTH: "ร.", LEMMA: "รัชกาล"}],
    "ร.ง.": [{ORTH: "ร.ง.", LEMMA: "โรงงาน"}],
    "ร.ด.": [{ORTH: "ร.ด.", LEMMA: "รักษาดินแดน"}],
    "รปภ.": [{ORTH: "รปภ.", LEMMA: "รักษาความปลอดภัย"}],
    "รพ.": [{ORTH: "รพ.", LEMMA: "โรงพยาบาล"}],
    "ร.พ.": [{ORTH: "ร.พ.", LEMMA: "โรงพิมพ์"}],
    "รร.": [{ORTH: "รร.", LEMMA: "โรงเรียน,โรงแรม"}],
    "รสก.": [{ORTH: "รสก.", LEMMA: "รัฐวิสาหกิจ"}],
    "ส.ค.ส.": [{ORTH: "ส.ค.ส.", LEMMA: "ส่งความสุขปีใหม่"}],
    "สต.": [{ORTH: "สต.", LEMMA: "สตางค์"}],
    "สน.": [{ORTH: "สน.", LEMMA: "สถานีตำรวจ"}],
    "สนข.": [{ORTH: "สนข.", LEMMA: "สำนักงานเขต"}],
    "สนง.": [{ORTH: "สนง.", LEMMA: "สำนักงาน"}],
    "สนญ.": [{ORTH: "สนญ.", LEMMA: "สำนักงานใหญ่"}],
    "ส.ป.ช.": [{ORTH: "ส.ป.ช.", LEMMA: "สร้างเสริมประสบการณ์ชีวิต"}],
    "สภ.": [{ORTH: "สภ.", LEMMA: "สถานีตำรวจภูธร"}],
    "ส.ล.น.": [{ORTH: "ส.ล.น.", LEMMA: "สร้างเสริมลักษณะนิสัย"}],
    "สวญ.": [{ORTH: "สวญ.", LEMMA: "สารวัตรใหญ่"}],
    "สวป.": [{ORTH: "สวป.", LEMMA: "สารวัตรป้องกันปราบปราม"}],
    "สว.สส.": [{ORTH: "สว.สส.", LEMMA: "สารวัตรสืบสวน"}],
    "ส.ห.": [{ORTH: "ส.ห.", LEMMA: "สารวัตรทหาร"}],
    "สอ.": [{ORTH: "สอ.", LEMMA: "สถานีอนามัย"}],
    "สอท.": [{ORTH: "สอท.", LEMMA: "สถานเอกอัครราชทูต"}],
    "เสธ.": [{ORTH: "เสธ.", LEMMA: "เสนาธิการ"}],
    "หจก.": [{ORTH: "หจก.", LEMMA: "ห้างหุ้นส่วนจำกัด"}],
    "ห.ร.ม.": [{ORTH: "ห.ร.ม.", LEMMA: "ตัวหารร่วมมาก"}],
 }
--- a/spacy/tests/conftest.py
+++ b/spacy/tests/conftest.py
@ -134,6 +134,11 @@ def nl_tokenizer():
    return get_lang_class("nl").Defaults.create_tokenizer()
@pytest.fixture
 def nl_lemmatizer(scope="session"):
    return get_lang_class("nl").Defaults.create_lemmatizer()
@pytest.fixture(scope="session")
 def pl_tokenizer():
    return get_lang_class("pl").Defaults.create_tokenizer()
--- a/spacy/tests/lang/nl/test_lemmatizer.py
+++ b/spacy/tests/lang/nl/test_lemmatizer.py
@ -0,0 +1,143 @@
 # coding: utf-8
 from __future__ import unicode_literals
 import pytest
 # Calling the Lemmatizer directly
 # Imitates behavior of:
 # Tagger.set_annotations()
 # -> vocab.morphology.assign_tag_id()
 # -> vocab.morphology.assign_tag_id()
 #   -> Token.tag.__set__
 #     -> vocab.morphology.assign_tag(...)
 #       -> ... ->  Morphology.assign_tag(...)
 #         -> self.lemmatize(analysis.tag.pos, token.lex.orth,
 noun_irreg_lemmatization_cases = [
    ("volkeren", "volk"),
    ("vaatje", "vat"),
    ("verboden", "verbod"),
    ("ijsje", "ijsje"),
    ("slagen", "slag"),
    ("verdragen", "verdrag"),
    ("verloven", "verlof"),
    ("gebeden", "gebed"),
    ("gaten", "gat"),
    ("staven", "staf"),
    ("aquariums", "aquarium"),
    ("podia", "podium"),
    ("holen", "hol"),
    ("lammeren", "lam"),
    ("bevelen", "bevel"),
    ("wegen", "weg"),
    ("moeilijkheden", "moeilijkheid"),
    ("aanwezigheden", "aanwezigheid"),
    ("goden", "god"),
    ("loten", "lot"),
    ("kaarsen", "kaars"),
    ("leden", "lid"),
    ("glaasje", "glas"),
    ("eieren", "ei"),
    ("vatten", "vat"),
    ("kalveren", "kalf"),
    ("padden", "pad"),
    ("smeden", "smid"),
    ("genen", "gen"),
    ("beenderen", "been"),
 ]
 verb_irreg_lemmatization_cases = [
    ("liep", "lopen"),
    ("hief", "heffen"),
    ("begon", "beginnen"),
    ("sla", "slaan"),
    ("aangekomen", "aankomen"),
    ("sproot", "spruiten"),
    ("waart", "zijn"),
    ("snoof", "snuiven"),
    ("spoot", "spuiten"),
    ("ontbeet", "ontbijten"),
    ("gehouwen", "houwen"),
    ("afgewassen", "afwassen"),
    ("deed", "doen"),
    ("schoven", "schuiven"),
    ("gelogen", "liegen"),
    ("woog", "wegen"),
    ("gebraden", "braden"),
    ("smolten", "smelten"),
    ("riep", "roepen"),
    ("aangedaan", "aandoen"),
    ("vermeden", "vermijden"),
    ("stootten", "stoten"),
    ("ging", "gaan"),
    ("geschoren", "scheren"),
    ("gesponnen", "spinnen"),
    ("reden", "rijden"),
    ("zochten", "zoeken"),
    ("leed", "lijden"),
    ("verzonnen", "verzinnen"),
 ]
@pytest.mark.parametrize("text,lemma", noun_irreg_lemmatization_cases)
 def test_nl_lemmatizer_noun_lemmas_irreg(nl_lemmatizer, text, lemma):
    pos = "noun"
    lemmas_pred = nl_lemmatizer(text, pos)
    assert lemma == sorted(lemmas_pred)[0]
@pytest.mark.parametrize("text,lemma", verb_irreg_lemmatization_cases)
 def test_nl_lemmatizer_verb_lemmas_irreg(nl_lemmatizer, text, lemma):
    pos = "verb"
    lemmas_pred = nl_lemmatizer(text, pos)
    assert lemma == sorted(lemmas_pred)[0]
@pytest.mark.skip
@pytest.mark.parametrize("text,lemma", [])
 def test_nl_lemmatizer_verb_lemmas_reg(nl_lemmatizer, text, lemma):
    # TODO: add test
    pass
@pytest.mark.skip
@pytest.mark.parametrize("text,lemma", [])
 def test_nl_lemmatizer_adjective_lemmas(nl_lemmatizer, text, lemma):
    # TODO: add test
    pass
@pytest.mark.skip
@pytest.mark.parametrize("text,lemma", [])
 def test_nl_lemmatizer_determiner_lemmas(nl_lemmatizer, text, lemma):
    # TODO: add test
    pass
@pytest.mark.skip
@pytest.mark.parametrize("text,lemma", [])
 def test_nl_lemmatizer_adverb_lemmas(nl_lemmatizer, text, lemma):
    # TODO: add test
    pass
@pytest.mark.parametrize("text,lemma", [])
 def test_nl_lemmatizer_pronoun_lemmas(nl_lemmatizer, text, lemma):
    # TODO: add test
    pass
 # Using the lemma lookup table only
@pytest.mark.parametrize("text,lemma", noun_irreg_lemmatization_cases)
 def test_nl_lemmatizer_lookup_noun(nl_lemmatizer, text, lemma):
    lemma_pred = nl_lemmatizer.lookup(text)
    assert lemma_pred in (lemma, text)
@pytest.mark.parametrize("text,lemma", verb_irreg_lemmatization_cases)
 def test_nl_lemmatizer_lookup_verb(nl_lemmatizer, text, lemma):
    lemma_pred = nl_lemmatizer.lookup(text)
    assert lemma_pred in (lemma, text)
--- a/spacy/tests/lang/nl/test_text.py
+++ b/spacy/tests/lang/nl/test_text.py
@ -9,3 +9,19 @@ from spacy.lang.nl.lex_attrs import like_num
 def test_nl_lex_attrs_capitals(word):
    assert like_num(word)
    assert like_num(word.upper())
@pytest.mark.parametrize(
    "text,num_tokens",
    [
        (
            "De aftredende minister-president benadrukte al dat zijn partij inhoudelijk weinig gemeen heeft met de groenen.",
            16,
        ),
        ("Hij is sociaal-cultureel werker.", 5),
        ("Er staan een aantal dure auto's in de garage.", 10),
    ],
 )
 def test_tokenizer_doesnt_split_hyphens(nl_tokenizer, text, num_tokens):
    tokens = nl_tokenizer(text)
    assert len(tokens) == num_tokens
--- a/spacy/tests/regression/test_issue3356.py
+++ b/spacy/tests/regression/test_issue3356.py
@ -1,6 +1,8 @@
-import pytest
+# coding: utf8
 from __future__ import unicode_literals
 import re
-from ... import compat
+from spacy import compat
 prefix_search = (
    b"^\xc2\xa7|^%|^=|^\xe2\x80\x94|^\xe2\x80\x93|^\\+(?![0-9])"
@ -67,4 +69,4 @@ if compat.is_python2:
    # string above in the xpass message.
    def test_issue3356():
        pattern = re.compile(compat.unescape_unicode(prefix_search.decode("utf8")))
-        assert not pattern.search(u"hello")
+        assert not pattern.search("hello")
--- a/spacy/tests/regression/test_issue3447.py
+++ b/spacy/tests/regression/test_issue3447.py
@ -1,10 +1,14 @@
 # coding: utf8
 from __future__ import unicode_literals
 from spacy.util import decaying
-def test_decaying():
+
-    sizes = decaying(10., 1., .5)
+def test_issue3447():
    sizes = decaying(10.0, 1.0, 0.5)
    size = next(sizes)
-    assert size == 10.
+    assert size == 10.0
    size = next(sizes)
-    assert size == 10. - 0.5
+    assert size == 10.0 - 0.5
    size = next(sizes)
-    assert size == 10. - 0.5 - 0.5
+    assert size == 10.0 - 0.5 - 0.5
--- a/spacy/tests/regression/test_issue3449.py
+++ b/spacy/tests/regression/test_issue3449.py
@ -0,0 +1,25 @@
 # coding: utf8
 from __future__ import unicode_literals
 import pytest
 from spacy.lang.en import English
@pytest.mark.xfail(reason="Current default suffix rules avoid one upper-case letter before a dot.")
 def test_issue3449():
    nlp = English()
    nlp.add_pipe(nlp.create_pipe('sentencizer'))
    text1 = "He gave the ball to I. Do you want to go to the movies with I?"
    text2 = "He gave the ball to I.  Do you want to go to the movies with I?"
    text3 = "He gave the ball to I.\nDo you want to go to the movies with I?"
    t1 = nlp(text1)
    t2 = nlp(text2)
    t3 = nlp(text3)
    assert t1[5].text == 'I'
    assert t2[5].text == 'I'
    assert t3[5].text == 'I'
--- a/spacy/tests/regression/test_issue3468.py
+++ b/spacy/tests/regression/test_issue3468.py
@ -1,7 +1,6 @@
 # coding: utf8
 from __future__ import unicode_literals
 import pytest
 from spacy.lang.en import English
 from spacy.tokens import Doc
--- a/spacy/tests/regression/test_issue3521.py
+++ b/spacy/tests/regression/test_issue3521.py
@ -0,0 +1,19 @@
 # coding: utf8
 from __future__ import unicode_literals
 import pytest
@pytest.mark.parametrize(
    "word",
    [
        "don't",
        "don’t",
        "I'd",
        "I’d",
    ],
 )
 def test_issue3521(en_tokenizer, word):
    tok = en_tokenizer(word)[1]
    # 'not' and 'would' should be stopwords, also in their abbreviated forms
    assert tok.is_stop
--- a/spacy/tests/regression/test_issue3531.py
+++ b/spacy/tests/regression/test_issue3531.py
@ -0,0 +1,33 @@
 # coding: utf8
 from __future__ import unicode_literals
 from spacy import displacy
 def test_issue3531():
    """Test that displaCy renderer doesn't require "settings" key."""
    example_dep = {
        "words": [
            {"text": "But", "tag": "CCONJ"},
            {"text": "Google", "tag": "PROPN"},
            {"text": "is", "tag": "VERB"},
            {"text": "starting", "tag": "VERB"},
            {"text": "from", "tag": "ADP"},
            {"text": "behind.", "tag": "ADV"},
        ],
        "arcs": [
            {"start": 0, "end": 3, "label": "cc", "dir": "left"},
            {"start": 1, "end": 3, "label": "nsubj", "dir": "left"},
            {"start": 2, "end": 3, "label": "aux", "dir": "left"},
            {"start": 3, "end": 4, "label": "prep", "dir": "right"},
            {"start": 4, "end": 5, "label": "pcomp", "dir": "right"},
        ],
    }
    example_ent = {
        "text": "But Google is starting from behind.",
        "ents": [{"start": 4, "end": 10, "label": "ORG"}],
    }
    dep_html = displacy.render(example_dep, style="dep", manual=True)
    assert dep_html
    ent_html = displacy.render(example_ent, style="ent", manual=True)
    assert ent_html
--- a/spacy/tests/test_misc.py
+++ b/spacy/tests/test_misc.py
@ -26,6 +26,7 @@ def symlink_setup_target(request, symlink_target, symlink):
        os.mkdir(path2str(symlink_target))
    # yield -- need to cleanup even if assertion fails
    # https://github.com/pytest-dev/pytest/issues/2508#issuecomment-309934240
    def cleanup():
        symlink_remove(symlink)
        os.rmdir(path2str(symlink_target))
--- a/website/docs/usage/examples.md
+++ b/website/docs/usage/examples.md
@ -160,20 +160,14 @@ https://github.com/explosion/spaCy/tree/master/examples/training/train_textcat.p
 ### Visualizing spaCy vectors in TensorBoard {#tensorboard}
-These two scripts let you load any spaCy model containing word vectors into
+This script lets you load any spaCy model containing word vectors into
 [TensorBoard](https://projector.tensorflow.org/) to create an
 [embedding visualization](https://www.tensorflow.org/versions/r1.1/get_started/embedding_viz).
 The first example uses TensorBoard, the second example TensorBoard's standalone
 embedding projector.
 ```python
 https://github.com/explosion/spaCy/tree/master/examples/vectors_tensorboard.py
 ```
 ```python
 https://github.com/explosion/spaCy/tree/master/examples/vectors_tensorboard_standalone.py
 ```
 ## Deep Learning {#deep-learning hidden="true"}
 ### Text classification with Keras {#keras}
--- a/website/src/components/seo.js
+++ b/website/src/components/seo.js
@ -35,7 +35,7 @@ const SEO = ({ description, lang, title, section, sectionTitle, bodyClass }) =>
                siteMetadata.slogan,
                sectionTitle
            )
-            const socialImage = getImage(section)
+            const socialImage = siteMetadata.siteUrl + getImage(section)
            const meta = [
                {
                    name: 'description',
@ -126,6 +126,7 @@ const query = graphql`
                title
                description
                slogan
                siteUrl
                social {
                    twitter
                }
--- a/website/src/widgets/landing.js
+++ b/website/src/widgets/landing.js
@ -164,9 +164,9 @@ const Landing = ({ data }) => {
                    We're pleased to invite the spaCy community and other folks working on Natural
                    Language Processing to Berlin this summer for a small and intimate event{' '}
                    <strong>July 5-6, 2019</strong>. The event includes a hands-on training day for
-                    teams using spaCy in production, followed by a one-track conference. We booked a
+                    teams using spaCy in production, followed by a one-track conference. We've
-                    beautiful venue, hand-picked an awesome lineup of speakers and scheduled plenty
+                    booked a beautiful venue, hand-picked an awesome lineup of speakers and
-                    of social time to get to know each other and exchange ideas.
+                    scheduled plenty of social time to get to know each other and exchange ideas.
                </LandingBanner>
                <LandingBanner