Merge branch 'master' into develop

2026-03-07 05:11:27 +03:00 · 2018-05-26 18:30:52 +02:00 · 2018-05-26 18:30:52 +02:00 · 330c039106
commit 330c039106
parent 5d281cf302 d85494bfae
16 changed files with 632 additions and 85 deletions
--- a/.github/contributors/BigstickCarpet.md
+++ b/.github/contributors/BigstickCarpet.md
@ -0,0 +1,106 @@
+# spaCy contributor agreement
+
+This spaCy Contributor Agreement (**"SCA"**) is based on the
+[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
+The SCA applies to any contribution that you make to any product or project
+managed by us (the **"project"**), and sets out the intellectual property rights
+you grant to us in the contributed materials. The term **"us"** shall mean
+[ExplosionAI UG (haftungsbeschränkt)](https://explosion.ai/legal). The term
+**"you"** shall mean the person or entity identified below.
+
+If you agree to be bound by these terms, fill in the information requested
+below and include the filled-in version with your first pull request, under the
+folder [`.github/contributors/`](/.github/contributors/). The name of the file
+should be your GitHub username, with the extension `.md`. For example, the user
+example_user would create the file `.github/contributors/example_user.md`.
+
+Read this agreement carefully before signing. These terms and conditions
+constitute a binding legal agreement.
+
+## Contributor Agreement
+
+1. The term "contribution" or "contributed materials" means any source code,
+object code, patch, tool, sample, graphic, specification, manual,
+documentation, or any other material posted or submitted by you to the project.
+
+2. With respect to any worldwide copyrights, or copyright applications and
+registrations, in your contribution:
+
+    * you hereby assign to us joint ownership, and to the extent that such
+    assignment is or becomes invalid, ineffective or unenforceable, you hereby
+    grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
+    royalty-free, unrestricted license to exercise all rights under those
+    copyrights. This includes, at our option, the right to sublicense these same
+    rights to third parties through multiple levels of sublicensees or other
+    licensing arrangements;
+
+    * you agree that each of us can do all things in relation to your
+    contribution as if each of us were the sole owners, and if one of us makes
+    a derivative work of your contribution, the one who makes the derivative
+    work (or has it made will be the sole owner of that derivative work;
+
+    * you agree that you will not assert any moral rights in your contribution
+    against us, our licensees or transferees;
+
+    * you agree that we may register a copyright in your contribution and
+    exercise all ownership rights associated with it; and
+
+    * you agree that neither of us has any duty to consult with, obtain the
+    consent of, pay or render an accounting to the other for any use or
+    distribution of your contribution.
+
+3. With respect to any patents you own, or that you can license without payment
+to any third party, you hereby grant to us a perpetual, irrevocable,
+non-exclusive, worldwide, no-charge, royalty-free license to:
+
+    * make, have made, use, sell, offer to sell, import, and otherwise transfer
+    your contribution in whole or in part, alone or in combination with or
+    included in any product, work or materials arising out of the project to
+    which your contribution was submitted, and
+
+    * at our option, to sublicense these same rights to third parties through
+    multiple levels of sublicensees or other licensing arrangements.
+
+4. Except as set out above, you keep all right, title, and interest in your
+contribution. The rights that you grant to us under these terms are effective
+on the date you first submitted a contribution to us, even if your submission
+took place before the date you sign these terms.
+
+5. You covenant, represent, warrant and agree that:
+
+    * Each contribution that you submit is and shall be an original work of
+    authorship and you can legally grant the rights set out in this SCA;
+
+    * to the best of your knowledge, each contribution will not violate any
+    third party's copyrights, trademarks, patents, or other intellectual
+    property rights; and
+
+    * each contribution shall be in compliance with U.S. export control laws and
+    other applicable export and import laws. You agree to notify us if you
+    become aware of any circumstance which would make any of the foregoing
+    representations inaccurate in any respect. We may publicly disclose your
+    participation in the project, including the fact that you have signed the SCA.
+
+6. This SCA is governed by the laws of the State of California and applicable
+U.S. Federal law. Any choice of law rules will not apply.
+
+7. Please place an “x” on one of the applicable statement below. Please do NOT
+mark both statements:
+
+    * [ X] I am signing on behalf of myself as an individual and no other person
+    or entity, including my employer, has or will have rights with respect to my
+    contributions.
+
+    * [ ] I am signing on behalf of my employer or a legal entity and I have the
+    actual authority to contractually bind that entity.
+
+## Contributor Details
+
+| Field                          | Entry                |
+|------------------------------- | -------------------- |
+| Name                           | James Messinger                     |
+| Company name (if applicable)   |                      |
+| Title or role (if applicable)  |                      |
+| Date                           | May 23, 2018                     |
+| GitHub username                | BigstickCarpet                     |
+| Website (optional)             |                      |
--- a/.github/contributors/aristorinjuang.md
+++ b/.github/contributors/aristorinjuang.md
@ -0,0 +1,106 @@
+# spaCy contributor agreement
+
+This spaCy Contributor Agreement (**"SCA"**) is based on the
+[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
+The SCA applies to any contribution that you make to any product or project
+managed by us (the **"project"**), and sets out the intellectual property rights
+you grant to us in the contributed materials. The term **"us"** shall mean
+[ExplosionAI UG (haftungsbeschränkt)](https://explosion.ai/legal). The term
+**"you"** shall mean the person or entity identified below.
+
+If you agree to be bound by these terms, fill in the information requested
+below and include the filled-in version with your first pull request, under the
+folder [`.github/contributors/`](/.github/contributors/). The name of the file
+should be your GitHub username, with the extension `.md`. For example, the user
+example_user would create the file `.github/contributors/example_user.md`.
+
+Read this agreement carefully before signing. These terms and conditions
+constitute a binding legal agreement.
+
+## Contributor Agreement
+
+1. The term "contribution" or "contributed materials" means any source code,
+object code, patch, tool, sample, graphic, specification, manual,
+documentation, or any other material posted or submitted by you to the project.
+
+2. With respect to any worldwide copyrights, or copyright applications and
+registrations, in your contribution:
+
+    * you hereby assign to us joint ownership, and to the extent that such
+    assignment is or becomes invalid, ineffective or unenforceable, you hereby
+    grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
+    royalty-free, unrestricted license to exercise all rights under those
+    copyrights. This includes, at our option, the right to sublicense these same
+    rights to third parties through multiple levels of sublicensees or other
+    licensing arrangements;
+
+    * you agree that each of us can do all things in relation to your
+    contribution as if each of us were the sole owners, and if one of us makes
+    a derivative work of your contribution, the one who makes the derivative
+    work (or has it made will be the sole owner of that derivative work;
+
+    * you agree that you will not assert any moral rights in your contribution
+    against us, our licensees or transferees;
+
+    * you agree that we may register a copyright in your contribution and
+    exercise all ownership rights associated with it; and
+
+    * you agree that neither of us has any duty to consult with, obtain the
+    consent of, pay or render an accounting to the other for any use or
+    distribution of your contribution.
+
+3. With respect to any patents you own, or that you can license without payment
+to any third party, you hereby grant to us a perpetual, irrevocable,
+non-exclusive, worldwide, no-charge, royalty-free license to:
+
+    * make, have made, use, sell, offer to sell, import, and otherwise transfer
+    your contribution in whole or in part, alone or in combination with or
+    included in any product, work or materials arising out of the project to
+    which your contribution was submitted, and
+
+    * at our option, to sublicense these same rights to third parties through
+    multiple levels of sublicensees or other licensing arrangements.
+
+4. Except as set out above, you keep all right, title, and interest in your
+contribution. The rights that you grant to us under these terms are effective
+on the date you first submitted a contribution to us, even if your submission
+took place before the date you sign these terms.
+
+5. You covenant, represent, warrant and agree that:
+
+    * Each contribution that you submit is and shall be an original work of
+    authorship and you can legally grant the rights set out in this SCA;
+
+    * to the best of your knowledge, each contribution will not violate any
+    third party's copyrights, trademarks, patents, or other intellectual
+    property rights; and
+
+    * each contribution shall be in compliance with U.S. export control laws and
+    other applicable export and import laws. You agree to notify us if you
+    become aware of any circumstance which would make any of the foregoing
+    representations inaccurate in any respect. We may publicly disclose your
+    participation in the project, including the fact that you have signed the SCA.
+
+6. This SCA is governed by the laws of the State of California and applicable
+U.S. Federal law. Any choice of law rules will not apply.
+
+7. Please place an “x” on one of the applicable statement below. Please do NOT
+mark both statements:
+
+    * [x] I am signing on behalf of myself as an individual and no other person
+    or entity, including my employer, has or will have rights with respect to my
+    contributions.
+
+    * [x] I am signing on behalf of my employer or a legal entity and I have the
+    actual authority to contractually bind that entity.
+
+## Contributor Details
+
+| Field                          | Entry                      |
+|------------------------------- | -------------------------- |
+| Name                           | Aristo Rinjuang            |
+| Company name (if applicable)   |                            |
+| Title or role (if applicable)  |                            |
+| Date                           | May 22, 2018               |
+| GitHub username                | aristorinjuang             |
+| Website (optional)             | https://aristorinjuang.com |
--- a/.github/contributors/armsp.md
+++ b/.github/contributors/armsp.md
@ -0,0 +1,106 @@
+# spaCy contributor agreement
+
+This spaCy Contributor Agreement (**"SCA"**) is based on the
+[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
+The SCA applies to any contribution that you make to any product or project
+managed by us (the **"project"**), and sets out the intellectual property rights
+you grant to us in the contributed materials. The term **"us"** shall mean
+[ExplosionAI UG (haftungsbeschränkt)](https://explosion.ai/legal). The term
+**"you"** shall mean the person or entity identified below.
+
+If you agree to be bound by these terms, fill in the information requested
+below and include the filled-in version with your first pull request, under the
+folder [`.github/contributors/`](/.github/contributors/). The name of the file
+should be your GitHub username, with the extension `.md`. For example, the user
+example_user would create the file `.github/contributors/example_user.md`.
+
+Read this agreement carefully before signing. These terms and conditions
+constitute a binding legal agreement.
+
+## Contributor Agreement
+
+1. The term "contribution" or "contributed materials" means any source code,
+object code, patch, tool, sample, graphic, specification, manual,
+documentation, or any other material posted or submitted by you to the project.
+
+2. With respect to any worldwide copyrights, or copyright applications and
+registrations, in your contribution:
+
+    * you hereby assign to us joint ownership, and to the extent that such
+    assignment is or becomes invalid, ineffective or unenforceable, you hereby
+    grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
+    royalty-free, unrestricted license to exercise all rights under those
+    copyrights. This includes, at our option, the right to sublicense these same
+    rights to third parties through multiple levels of sublicensees or other
+    licensing arrangements;
+
+    * you agree that each of us can do all things in relation to your
+    contribution as if each of us were the sole owners, and if one of us makes
+    a derivative work of your contribution, the one who makes the derivative
+    work (or has it made will be the sole owner of that derivative work;
+
+    * you agree that you will not assert any moral rights in your contribution
+    against us, our licensees or transferees;
+
+    * you agree that we may register a copyright in your contribution and
+    exercise all ownership rights associated with it; and
+
+    * you agree that neither of us has any duty to consult with, obtain the
+    consent of, pay or render an accounting to the other for any use or
+    distribution of your contribution.
+
+3. With respect to any patents you own, or that you can license without payment
+to any third party, you hereby grant to us a perpetual, irrevocable,
+non-exclusive, worldwide, no-charge, royalty-free license to:
+
+    * make, have made, use, sell, offer to sell, import, and otherwise transfer
+    your contribution in whole or in part, alone or in combination with or
+    included in any product, work or materials arising out of the project to
+    which your contribution was submitted, and
+
+    * at our option, to sublicense these same rights to third parties through
+    multiple levels of sublicensees or other licensing arrangements.
+
+4. Except as set out above, you keep all right, title, and interest in your
+contribution. The rights that you grant to us under these terms are effective
+on the date you first submitted a contribution to us, even if your submission
+took place before the date you sign these terms.
+
+5. You covenant, represent, warrant and agree that:
+
+    * Each contribution that you submit is and shall be an original work of
+    authorship and you can legally grant the rights set out in this SCA;
+
+    * to the best of your knowledge, each contribution will not violate any
+    third party's copyrights, trademarks, patents, or other intellectual
+    property rights; and
+
+    * each contribution shall be in compliance with U.S. export control laws and
+    other applicable export and import laws. You agree to notify us if you
+    become aware of any circumstance which would make any of the foregoing
+    representations inaccurate in any respect. We may publicly disclose your
+    participation in the project, including the fact that you have signed the SCA.
+
+6. This SCA is governed by the laws of the State of California and applicable
+U.S. Federal law. Any choice of law rules will not apply.
+
+7. Please place an “x” on one of the applicable statement below. Please do NOT
+mark both statements:
+
+    * [x] I am signing on behalf of myself as an individual and no other person
+    or entity, including my employer, has or will have rights with respect to my
+    contributions.
+
+    * [ ] I am signing on behalf of my employer or a legal entity and I have the
+    actual authority to contractually bind that entity.
+
+## Contributor Details
+
+| Field                          | Entry                |
+|------------------------------- | -------------------- |
+| Name                           |  Shantam             |
+| Company name (if applicable)   |                      |
+| Title or role (if applicable)  |                      |
+| Date                           |   21/5/2018          |
+| GitHub username                |     armsp            |
+| Website (optional)             |                      |
--- a/.github/contributors/idealley.md
+++ b/.github/contributors/idealley.md
@ -0,0 +1,106 @@
+# spaCy contributor agreement
+
+This spaCy Contributor Agreement (**"SCA"**) is based on the
+[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
+The SCA applies to any contribution that you make to any product or project
+managed by us (the **"project"**), and sets out the intellectual property rights
+you grant to us in the contributed materials. The term **"us"** shall mean
+[ExplosionAI UG (haftungsbeschränkt)](https://explosion.ai/legal). The term
+**"you"** shall mean the person or entity identified below.
+
+If you agree to be bound by these terms, fill in the information requested
+below and include the filled-in version with your first pull request, under the
+folder [`.github/contributors/`](/.github/contributors/). The name of the file
+should be your GitHub username, with the extension `.md`. For example, the user
+example_user would create the file `.github/contributors/example_user.md`.
+
+Read this agreement carefully before signing. These terms and conditions
+constitute a binding legal agreement.
+
+## Contributor Agreement
+
+1. The term "contribution" or "contributed materials" means any source code,
+object code, patch, tool, sample, graphic, specification, manual,
+documentation, or any other material posted or submitted by you to the project.
+
+2. With respect to any worldwide copyrights, or copyright applications and
+registrations, in your contribution:
+
+    * you hereby assign to us joint ownership, and to the extent that such
+    assignment is or becomes invalid, ineffective or unenforceable, you hereby
+    grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
+    royalty-free, unrestricted license to exercise all rights under those
+    copyrights. This includes, at our option, the right to sublicense these same
+    rights to third parties through multiple levels of sublicensees or other
+    licensing arrangements;
+
+    * you agree that each of us can do all things in relation to your
+    contribution as if each of us were the sole owners, and if one of us makes
+    a derivative work of your contribution, the one who makes the derivative
+    work (or has it made will be the sole owner of that derivative work;
+
+    * you agree that you will not assert any moral rights in your contribution
+    against us, our licensees or transferees;
+
+    * you agree that we may register a copyright in your contribution and
+    exercise all ownership rights associated with it; and
+
+    * you agree that neither of us has any duty to consult with, obtain the
+    consent of, pay or render an accounting to the other for any use or
+    distribution of your contribution.
+
+3. With respect to any patents you own, or that you can license without payment
+to any third party, you hereby grant to us a perpetual, irrevocable,
+non-exclusive, worldwide, no-charge, royalty-free license to:
+
+    * make, have made, use, sell, offer to sell, import, and otherwise transfer
+    your contribution in whole or in part, alone or in combination with or
+    included in any product, work or materials arising out of the project to
+    which your contribution was submitted, and
+
+    * at our option, to sublicense these same rights to third parties through
+    multiple levels of sublicensees or other licensing arrangements.
+
+4. Except as set out above, you keep all right, title, and interest in your
+contribution. The rights that you grant to us under these terms are effective
+on the date you first submitted a contribution to us, even if your submission
+took place before the date you sign these terms.
+
+5. You covenant, represent, warrant and agree that:
+
+    * Each contribution that you submit is and shall be an original work of
+    authorship and you can legally grant the rights set out in this SCA;
+
+    * to the best of your knowledge, each contribution will not violate any
+    third party's copyrights, trademarks, patents, or other intellectual
+    property rights; and
+
+    * each contribution shall be in compliance with U.S. export control laws and
+    other applicable export and import laws. You agree to notify us if you
+    become aware of any circumstance which would make any of the foregoing
+    representations inaccurate in any respect. We may publicly disclose your
+    participation in the project, including the fact that you have signed the SCA.
+
+6. This SCA is governed by the laws of the State of California and applicable
+U.S. Federal law. Any choice of law rules will not apply.
+
+7. Please place an “x” on one of the applicable statement below. Please do NOT
+mark both statements:
+
+    * [x] I am signing on behalf of myself as an individual and no other person
+    or entity, including my employer, has or will have rights with respect to my
+    contributions.
+
+    * [x] I am signing on behalf of my employer or a legal entity and I have the
+    actual authority to contractually bind that entity.
+
+## Contributor Details
+
+| Field                          | Entry                |
+|------------------------------- | -------------------- |
+| Name                           |    Pouyt Samuel      |
+| Company name (if applicable)   |                      |
+| Title or role (if applicable)  |                      |
+| Date                           |    26.05.2018        |
+| GitHub username                |    Idealley          |
+| Website (optional)             |                      |
--- a/spacy/cli/train.py
+++ b/spacy/cli/train.py
@ -118,7 +118,7 @@ def train(lang, output_dir, train_data, dev_data, n_iter=30, n_sents=0,
    optimizer = nlp.begin_training(lambda: corpus.train_tuples, device=use_gpu)
    nlp._optimizer = None

-    print("Itn.\tP.Loss\tN.Loss\tUAS\tNER P.\tNER R.\tNER F.\tTag %\tToken %")
+    print("Itn.  Dep Loss  NER Loss  UAS     NER P.  NER R.  NER F.  Tag %   Token %  CPU WPS  GPU WPS")
    try:
        for i in range(n_iter):
            train_docs = corpus.train_docs(nlp, noise_level=0.0,
@ -208,17 +208,17 @@ def print_progress(itn, losses, dev_scores, cpu_wps=0.0, gpu_wps=0.0):
    scores.update(dev_scores)
    scores['cpu_wps'] = cpu_wps
    scores['gpu_wps'] = gpu_wps or 0.0
-    tpl = '\t'.join((
-        '{:d}',
-        '{dep_loss:.3f}',
-        '{ner_loss:.3f}',
-        '{uas:.3f}',
-        '{ents_p:.3f}',
-        '{ents_r:.3f}',
-        '{ents_f:.3f}',
-        '{tags_acc:.3f}',
-        '{token_acc:.3f}',
-        '{cpu_wps:.1f}',
+    tpl = ''.join((
+        '{:<6d}',
+        '{dep_loss:<10.3f}',
+        '{ner_loss:<10.3f}',
+        '{uas:<8.3f}',
+        '{ents_p:<8.3f}',
+        '{ents_r:<8.3f}',
+        '{ents_f:<8.3f}',
+        '{tags_acc:<8.3f}',
+        '{token_acc:<9.3f}',
+        '{cpu_wps:<9.1f}',
        '{gpu_wps:.1f}',
    ))
    print(tpl.format(itn, **scores))
--- a/spacy/lang/id/lex_attrs.py
+++ b/spacy/lang/id/lex_attrs.py
@ -4,19 +4,10 @@ from __future__ import unicode_literals
 from ...attrs import LIKE_NUM


-_num_words = ['zero', 'one', 'two', 'three', 'four', 'five', 'six', 'seven',
-              'eight', 'nine', 'ten', 'eleven', 'twelve', 'thirteen', 'fourteen',
-              'fifteen', 'sixteen', 'seventeen', 'eighteen', 'nineteen', 'twenty',
-              'thirty', 'forty', 'fifty', 'sixty', 'seventy', 'eighty', 'ninety',
-              'hundred', 'thousand', 'million', 'billion', 'trillion', 'quadrillion',
-              'gajillion', 'bazillion',
-              'nol', 'satu', 'dua', 'tiga', 'empat', 'lima', 'enam', 'tujuh',
-              'delapan', 'sembilan', 'sepuluh', 'sebelas', 'duabelas', 'tigabelas',
-              'empatbelas', 'limabelas', 'enambelas', 'tujuhbelas', 'delapanbelas',
-              'sembilanbelas', 'duapuluh', 'seratus', 'seribu', 'sejuta',
-              'ribu', 'rb', 'juta', 'jt', 'miliar', 'biliun', 'triliun',
-              'kuadriliun', 'kuintiliun', 'sekstiliun', 'septiliun', 'oktiliun',
-              'noniliun', 'desiliun']
+_num_words = ['nol', 'satu', 'dua', 'tiga', 'empat', 'lima', 'enam', 'tujuh',
+              'delapan', 'sembilan', 'sepuluh', 'sebelas', 'belas', 'puluh',
+              'ratus', 'ribu', 'juta', 'miliar', 'biliun', 'triliun', 'kuadriliun',
+              'kuintiliun', 'sekstiliun', 'septiliun', 'oktiliun', 'noniliun', 'desiliun']


 def like_num(text):
--- a/spacy/lang/id/norm_exceptions.py
+++ b/spacy/lang/id/norm_exceptions.py
@ -1,14 +1,7 @@
 # coding: utf8
 from __future__ import unicode_literals

-_exc = {
-    "Rp": "$",
-    "IDR": "$",
-    "RMB": "$",
-    "USD": "$",
-    "AUD": "$",
-    "GBP": "$",
-}
+_exc = {}

 NORM_EXCEPTIONS = {}

--- a/spacy/lang/id/tokenizer_exceptions.py
+++ b/spacy/lang/id/tokenizer_exceptions.py
@ -5,7 +5,7 @@ import regex as re

 from ._tokenizer_exceptions_list import ID_BASE_EXCEPTIONS
 from ..tokenizer_exceptions import URL_PATTERN
-from ...symbols import ORTH
+from ...symbols import ORTH, LEMMA, NORM


 _exc = {}
@ -29,17 +29,58 @@ for orth in ID_BASE_EXCEPTIONS:
        orth_caps = '-'.join([part.upper() for part in orth.split('-')])
        _exc[orth_caps] = [{ORTH: orth_caps}]

+for exc_data in [
+    {ORTH: "CKG", LEMMA: "Cakung", NORM: "Cakung"},
+    {ORTH: "CGP", LEMMA: "Grogol Petamburan", NORM: "Grogol Petamburan"},
+    {ORTH: "KSU", LEMMA: "Kepulauan Seribu Utara", NORM: "Kepulauan Seribu Utara"},
+    {ORTH: "KYB", LEMMA: "Kebayoran Baru", NORM: "Kebayoran Baru"},
+    {ORTH: "TJP", LEMMA: "Tanjungpriok", NORM: "Tanjungpriok"},
+    {ORTH: "TNA", LEMMA: "Tanah Abang", NORM: "Tanah Abang"},
+
+    {ORTH: "BEK", LEMMA: "Bengkayang", NORM: "Bengkayang"},
+    {ORTH: "KTP", LEMMA: "Ketapang", NORM: "Ketapang"},
+    {ORTH: "MPW", LEMMA: "Mempawah", NORM: "Mempawah"},
+    {ORTH: "NGP", LEMMA: "Nanga Pinoh", NORM: "Nanga Pinoh"},
+    {ORTH: "NBA", LEMMA: "Ngabang", NORM: "Ngabang"},
+    {ORTH: "PTK", LEMMA: "Pontianak", NORM: "Pontianak"},
+    {ORTH: "PTS", LEMMA: "Putussibau", NORM: "Putussibau"},
+    {ORTH: "SBS", LEMMA: "Sambas", NORM: "Sambas"},
+    {ORTH: "SAG", LEMMA: "Sanggau", NORM: "Sanggau"},
+    {ORTH: "SED", LEMMA: "Sekadau", NORM: "Sekadau"},
+    {ORTH: "SKW", LEMMA: "Singkawang", NORM: "Singkawang"},
+    {ORTH: "STG", LEMMA: "Sintang", NORM: "Sintang"},
+    {ORTH: "SKD", LEMMA: "Sukadane", NORM: "Sukadane"},
+    {ORTH: "SRY", LEMMA: "Sungai Raya", NORM: "Sungai Raya"},
+
+    {ORTH: "Jan.", LEMMA: "Januari", NORM: "Januari"},
+    {ORTH: "Feb.", LEMMA: "Februari", NORM: "Februari"},
+    {ORTH: "Mar.", LEMMA: "Maret", NORM: "Maret"},
+    {ORTH: "Apr.", LEMMA: "April", NORM: "April"},
+    {ORTH: "Jun.", LEMMA: "Juni", NORM: "Juni"},
+    {ORTH: "Jul.", LEMMA: "Juli", NORM: "Juli"},
+    {ORTH: "Agu.", LEMMA: "Agustus", NORM: "Agustus"},
+    {ORTH: "Ags.", LEMMA: "Agustus", NORM: "Agustus"},
+    {ORTH: "Sep.", LEMMA: "September", NORM: "September"},
+    {ORTH: "Okt.", LEMMA: "Oktober", NORM: "Oktober"},
+    {ORTH: "Nov.", LEMMA: "November", NORM: "November"},
+    {ORTH: "Des.", LEMMA: "Desember", NORM: "Desember"}]:
+    _exc[exc_data[ORTH]] = [exc_data]

 for orth in [
-    "'d", "a.m.", "Adm.", "Bros.", "co.", "Co.", "Corp.", "D.C.", "Dr.", "e.g.",
-    "E.g.", "E.G.", "Gen.", "Gov.", "i.e.", "I.e.", "I.E.", "Inc.", "Jr.",
-    "Ltd.", "Md.", "Messrs.", "Mo.", "Mont.", "Mr.", "Mrs.", "Ms.", "p.m.",
-    "Ph.D.", "Rep.", "Rev.", "Sen.", "St.", "vs.",
+    "A.AB.", "A.Ma.", "A.Md.", "A.Md.Keb.", "A.Md.Kep.", "A.P.",
    "B.A.", "B.Ch.E.", "B.Sc.", "Dr.", "Dra.", "Drs.", "Hj.", "Ka.", "Kp.",
-    "M.Ag.", "M.Hum.", "M.Kes,", "M.Kom.", "M.M.", "M.P.", "M.Pd.", "M.Sc.",
-    "M.Si.", "M.Sn.", "M.T.", "M.Th.", "No.", "Pjs.", "Plt.", "R.A.", "S.Ag.",
-    "S.E.", "S.H.", "S.Hut.", "S.K.M.", "S.Kedg.", "S.Kedh.", "S.Kom.",
-    "S.Pd.", "S.Pol.", "S.Psi.", "S.S.", "S.Sos.", "S.T.", "S.Tekp.", "S.Th.",
+    "M.AB", "M.Ag.", "M.AP", "M.Arl", "M.A.R.S", "M.Hum.", "M.I.Kom.", "M.Kes,",
+    "M.Kom.", "M.M.", "M.P.", "M.Pd.", "M.Psi.", "M.Psi.T.", "M.Sc.", "M.SArl",
+    "M.Si.", "M.Sn.", "M.T.", "M.Th.", "No.", "Pjs.", "Plt.", "R.A.",
+    "S.AB", "S.AP", "S.Adm", "S.Ag.", "S.Agr", "S.Ant", "S.Arl", "S.Ars",
+    "S.A.R.S", "S.Ds", "S.E.", "S.E.I.", "S.Farm", "S.Gz.", "S.H.", "S.Han",
+    "S.H.Int", "S.Hum", "S.Hut.", "S.In.", "S.IK.", "S.I.Kom.", "S.I.P", "S.IP",
+    "S.P.", "S.Pt", "S.Psi", "S.Ptk", "S.Keb", "S.Ked", "S.Kep", "S.KG", "S.KH",
+    "S.Kel", "S.K.M.", "S.Kedg.", "S.Kedh.", "S.Kom.", "S.KPM", "S.Mb", "S.Mat",
+    "S.Par", "S.Pd.", "S.Pd.I.", "S.Pd.SD", "S.Pol.", "S.Psi.", "S.S.", "S.SArl.",
+    "S.Sn", "S.Si.", "S.Si.Teol.", "S.SI.", "S.ST.", "S.ST.Han", "S.STP", "S.Sos.",
+    "S.Sy.", "S.T.", "S.T.Han", "S.Th.", "S.Th.I" "S.TI.", "S.T.P.", "S.TrK",
+    "S.Tekp.", "S.Th.",
    "a.l.", "a.n.", "a.s.", "b.d.", "d.a.", "d.l.", "d/h", "dkk.", "dll.",
    "dr.", "drh.", "ds.", "dsb.", "dst.", "faks.", "fax.", "hlm.", "i/o",
    "n.b.", "p.p." "pjs.", "s.d.", "tel.", "u.p.",
--- a/spacy/lang/ro/lex_attrs.py
+++ b/spacy/lang/ro/lex_attrs.py
@ -0,0 +1,42 @@
+# coding: utf8
+from __future__ import unicode_literals
+
+from ...attrs import LIKE_NUM
+
+
+_num_words = set("""
+zero unu doi două trei patru cinci șase șapte opt nouă zece
+unsprezece doisprezece douăsprezece treisprezece patrusprezece cincisprezece șaisprezece șaptesprezece optsprezece nouăsprezece
+douăzeci treizeci patruzeci cincizeci șaizeci șaptezeci optzeci nouăzeci
+sută mie milion miliard bilion trilion cvadrilion catralion cvintilion sextilion septilion enșpemii
+""".split())
+
+_ordinal_words = set("""
+primul doilea treilea patrulea cincilea șaselea șaptelea optulea nouălea zecelea
+prima doua treia patra cincia șasea șaptea opta noua zecea
+unsprezecelea doisprezecelea treisprezecelea patrusprezecelea cincisprezecelea șaisprezecelea șaptesprezecelea optsprezecelea nouăsprezecelea
+unsprezecea douăsprezecea treisprezecea patrusprezecea cincisprezecea șaisprezecea șaptesprezecea optsprezecea nouăsprezecea
+douăzecilea treizecilea patruzecilea cincizecilea șaizecilea șaptezecilea optzecilea nouăzecilea sutălea
+douăzecea treizecea patruzecea cincizecea șaizecea șaptezecea optzecea nouăzecea suta
+miilea mielea mia milionulea milioana miliardulea miliardelea miliarda enșpemia
+""".split())
+
+
+def like_num(text):
+    text = text.replace(',', '').replace('.', '')
+    if text.isdigit():
+        return True
+    if text.count('/') == 1:
+        num, denom = text.split('/')
+        if num.isdigit() and denom.isdigit():
+            return True
+    if text.lower() in _num_words:
+        return True
+    if text.lower() in _ordinal_words:
+        return True
+    return False
+
+
+LEX_ATTRS = {
+    LIKE_NUM: like_num
+}
--- a/spacy/lang/ro/tokenizer_exceptions.py
+++ b/spacy/lang/ro/tokenizer_exceptions.py
@ -9,8 +9,9 @@ _exc = {}

 # Source: https://en.wiktionary.org/wiki/Category:Romanian_abbreviations
 for orth in [
-    "1-a", "1-ul", "10-a", "10-lea", "2-a", "3-a", "3-lea", "6-lea",
-    "d-voastră", "dvs.", "Rom.", "str."]:
+    "1-a", "2-a", "3-a", "4-a", "5-a", "6-a", "7-a", "8-a", "9-a", "10-a", "11-a", "12-a",
+    "1-ul", "2-lea", "3-lea", "4-lea", "5-lea", "6-lea", "7-lea", "8-lea", "9-lea", "10-lea", "11-lea", "12-lea",
+    "d-voastră", "dvs.", "ing.", "dr.", "Rom.", "str.", "nr.", "etc.", "d.p.d.v.", "dpdv", "șamd.", "ș.a.m.d."]:
    _exc[orth] = [{ORTH: orth}]


--- a/spacy/tests/conftest.py
+++ b/spacy/tests/conftest.py
@ -15,7 +15,7 @@ from .. import util
 # here if it's using spaCy's tokenizer (not a different library)
 # TODO: re-implement generic tokenizer tests
 _languages = ['bn', 'da', 'de', 'en', 'es', 'fi', 'fr', 'ga', 'he', 'hu', 'id',
-              'it', 'nb', 'nl', 'pl', 'pt', 'ru', 'sv', 'tr', 'ar', 'xx']
+              'it', 'nb', 'nl', 'pl', 'pt', 'ro', 'ru', 'sv', 'tr', 'ar', 'xx']

 _models = {'en': ['en_core_web_sm'],
           'de': ['de_core_news_md'],
--- a/spacy/tests/lang/ro/test_tokenizer.py
+++ b/spacy/tests/lang/ro/test_tokenizer.py
@ -0,0 +1,25 @@
+# coding: utf8
+from __future__ import unicode_literals
+
+import pytest
+
+DEFAULT_TESTS = [
+    ('Adresa este str. Principală nr. 5.', ['Adresa', 'este', 'str.', 'Principală', 'nr.', '5', '.']),
+    ('Teste, etc.', ['Teste', ',', 'etc.']),
+    ('Lista, ș.a.m.d.', ['Lista', ',', 'ș.a.m.d.']),
+    ('Și d.p.d.v. al...', ['Și', 'd.p.d.v.', 'al', '...'])
+]
+
+NUMBER_TESTS = [
+    ('Clasa a 4-a.', ['Clasa', 'a', '4-a', '.']),
+    ('Al 12-lea ceas.', ['Al', '12-lea', 'ceas', '.'])
+]
+
+TESTCASES = DEFAULT_TESTS + NUMBER_TESTS
+
+
+@pytest.mark.parametrize('text,expected_tokens', TESTCASES)
+def test_tokenizer_handles_testcases(ro_tokenizer, text, expected_tokens):
+    tokens = ro_tokenizer(text)
+    token_list = [token.text for token in tokens if not token.is_space]
+    assert expected_tokens == token_list
--- a/website/api/_annotation/_training.jade
+++ b/website/api/_annotation/_training.jade
@ -53,7 +53,7 @@ p
    +tag-new(2)

 p
-    |  The populate a model's vocabulary, you can use the
+    |  To populate a model's vocabulary, you can use the
    |  #[+api("cli#vocab") #[code spacy vocab]] command and load in a
    |  #[+a("https://jsonlines.readthedocs.io/en/latest/") newline-delimited JSON]
    |  (JSONL) file containing one lexical entry per line. The first line
--- a/website/usage/_install/_quickstart.jade
+++ b/website/usage/_install/_quickstart.jade
@ -16,7 +16,9 @@

    +qs({package: 'source'}) git clone https://github.com/explosion/spaCy
    +qs({package: 'source'}) cd spaCy
-    +qs({package: 'source'}) export PYTHONPATH=`pwd`
+    +qs({package: 'source', os: 'mac'}) export PYTHONPATH=`pwd`
+    +qs({package: 'source', os: 'linux'}) export PYTHONPATH=`pwd`
+    +qs({package: 'source', os: 'windows'}) set PYTHONPATH=/path/to/spaCy
    +qs({package: 'source'}) pip install -r requirements.txt
    +qs({package: 'source'}) python setup.py build_ext --inplace

--- a/website/usage/_linguistic-features/_rule-based-matching.jade
+++ b/website/usage/_linguistic-features/_rule-based-matching.jade
@ -184,7 +184,7 @@ p

 p
    |  In versions before v2.1.0, the semantics of the #[code +] and #[code *] operators
-    |  behave inconsistently. They were usually interpretted
+    |  behave inconsistently. They were usually interpreted
    |  "greedily", i.e. longer matches are returned where possible. However, if
    |  you specify two #[code +] and #[code *] patterns in a row and their
    |  matches overlap, the first operator will behave non-greedily. This quirk
@ -260,41 +260,6 @@ p
    doc = nlp(u"This is a text about Google I/O 2015.")
    matches = matcher(doc)

-p
-    |  In addition to mentions of "Google I/O", your data also contains some
-    |  annoying pre-processing artefacts, like leftover HTML line breaks
-    |  (e.g. #[code &lt;br&gt;] or #[code &lt;BR/&gt;]). While you're at it,
-    |  you want to merge those into one token and flag them, to make sure you
-    |  can easily ignore them later. So you add a second pattern and pass in a
-    |  function #[code merge_and_flag]:
-
-+code-exec.
-    import spacy
-    from spacy.matcher import Matcher
-    from spacy.tokens import Token
-
-    nlp = spacy.load('en_core_web_sm')
-    matcher = Matcher(nlp.vocab)
-    # register a new token extension to flag bad HTML
-    Token.set_extension('bad_html', default=False)
-
-    def merge_and_flag(matcher, doc, i, matches):
-        match_id, start, end = matches[i]
-        span = doc[start : end]
-        span.merge(is_stop=True) # merge (and mark it as a stop word, just in case)
-        for token in span:
-            token._.bad_html = True  # mark token as bad HTML
-        print(span.text)
-
-    matcher.add('BAD_HTML', merge_and_flag,
-                [{'ORTH': '&lt;'}, {'LOWER': 'br'}, {'ORTH': '&gt;'}],
-                [{'ORTH': '&lt;'}, {'LOWER': 'br/'}, {'ORTH': '&gt;'}])
-
-    doc = nlp(u"Hello&lt;br&gt;world!")
-    matches = matcher(doc)
-    for token in doc:
-        print(token.text, token._.bad_html)
-
 +aside("Tip: Visualizing matches")
    |  When working with entities, you can use #[+api("top-level#displacy") displaCy]
    |  to quickly generate a NER visualization from your updated #[code Doc],
@ -315,7 +280,7 @@ p
    |  that was matched, and invoke it.

 +code.
-    doc = nlp(LOTS_OF_TEXT)
+    doc = nlp(YOUR_TEXT_HERE)
    matcher(doc)

 p
@ -348,6 +313,69 @@ p
            |  A list of #[code (match_id, start, end)] tuples, describing the
            |  matches. A match tuple describes a span #[code doc[start:end]].

+h(3, "matcher-pipeline") Using custom pipeline components
+
+p
+    |  Let's say your data also contains some annoying pre-processing artefacts,
+    |  like leftover HTML line breaks (e.g. #[code &lt;br&gt;] or
+    |  #[code &lt;BR/&gt;]). To make your text easier to analyse, you want to
+    |  merge those into one token and flag them, to make sure you
+    |  can ignore them later. Ideally, this should all be done automatically
+    |  as you process the text. You can achieve this by adding a
+    |  #[+a("/usage/processing-pipelines#custom-components") custom pipeline component]
+    |  that's called on each #[code Doc] object, merges the leftover HTML spans
+    |  and sets an attribute #[code bad_html] on the token.
+
+code-exec.
+    import spacy
+    from spacy.matcher import Matcher
+    from spacy.tokens import Token
+
+    # we're using a class because the component needs to be initialised with
+    # the shared vocab via the nlp object
+    class BadHTMLMerger(object):
+        def __init__(self, nlp):
+            # register a new token extension to flag bad HTML
+            Token.set_extension('bad_html', default=False)
+            self.matcher = Matcher(nlp.vocab)
+            self.matcher.add('BAD_HTML', None,
+                [{'ORTH': '&lt;'}, {'LOWER': 'br'}, {'ORTH': '&gt;'}],
+                [{'ORTH': '&lt;'}, {'LOWER': 'br/'}, {'ORTH': '&gt;'}])
+
+        def __call__(self, doc):
+            # this method is invoked when the component is called on a Doc
+            matches = self.matcher(doc)
+            spans = []  # collect the matched spans here
+            for match_id, start, end in matches:
+                spans.append(doc[start:end])
+            for span in spans:
+                span.merge(is_stop=True) # merge (and mark it as a stop word)
+                for token in span:
+                    token._.bad_html = True  # mark token as bad HTML
+            return doc
+
+    nlp = spacy.load('en_core_web_sm')
+    html_merger = BadHTMLMerger(nlp)
+    nlp.add_pipe(html_merger, last=True)  # add component to the pipeline
+    doc = nlp(u"Hello&lt;br&gt;world! &lt;br/&gt; This is a test.")
+    for token in doc:
+        print(token.text, token._.bad_html)
+
+p
+    |  Instead of hard-coding the patterns into the component, you could also
+    |  make it take a path to a JSON file containing the patterns. This lets
+    |  you reuse the component with different patterns, depending on your
+    |  application:
+
+code.
+    html_merger = BadHTMLMerger(nlp, path='/path/to/patterns.json')
+
+infobox
+    |  For more details and examples of how to
+    |  #[strong create custom pipeline components] and
+    |  #[strong extension attributes], see the
+    |  #[+a("/usage/processing-pipelines") usage guide].
+
 +h(3, "regex") Using regular expressions

 p
--- a/website/usage/_vectors-similarity/_custom.jade
+++ b/website/usage/_vectors-similarity/_custom.jade
@ -52,7 +52,7 @@ p

 +code(false, "bash").
    wget https://s3-us-west-1.amazonaws.com/fasttext-vectors/word-vectors-v2/cc.la.300.vec.gz
-    python -m spacy init-model /tmp/la_vectors_wiki_lg --vectors-loc cc.la.300.vec.gz
+    python -m spacy init-model en /tmp/la_vectors_wiki_lg --vectors-loc cc.la.300.vec.gz

 p
    |  This will output a spaCy model in the directory