From 1a4682dd0bdf4d8b7b4c3dd23c3829b1333f2fa5 Mon Sep 17 00:00:00 2001 From: Shantam Raj Date: Mon, 21 May 2018 14:39:33 +0530 Subject: [PATCH 1/9] Update _training.jade (#2340) * Update _training.jade Correcting grammar. Replacing "The" with "To". * Create armsp.md * Update armsp.md --- .github/contributors/armsp.md | 106 +++++++++++++++++++++++++ website/api/_annotation/_training.jade | 2 +- 2 files changed, 107 insertions(+), 1 deletion(-) create mode 100644 .github/contributors/armsp.md diff --git a/.github/contributors/armsp.md b/.github/contributors/armsp.md new file mode 100644 index 000000000..63d1367e4 --- /dev/null +++ b/.github/contributors/armsp.md @@ -0,0 +1,106 @@ +# spaCy contributor agreement + +This spaCy Contributor Agreement (**"SCA"**) is based on the +[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf). +The SCA applies to any contribution that you make to any product or project +managed by us (the **"project"**), and sets out the intellectual property rights +you grant to us in the contributed materials. The term **"us"** shall mean +[ExplosionAI UG (haftungsbeschränkt)](https://explosion.ai/legal). The term +**"you"** shall mean the person or entity identified below. + +If you agree to be bound by these terms, fill in the information requested +below and include the filled-in version with your first pull request, under the +folder [`.github/contributors/`](/.github/contributors/). The name of the file +should be your GitHub username, with the extension `.md`. For example, the user +example_user would create the file `.github/contributors/example_user.md`. + +Read this agreement carefully before signing. These terms and conditions +constitute a binding legal agreement. + +## Contributor Agreement + +1. The term "contribution" or "contributed materials" means any source code, +object code, patch, tool, sample, graphic, specification, manual, +documentation, or any other material posted or submitted by you to the project. + +2. With respect to any worldwide copyrights, or copyright applications and +registrations, in your contribution: + + * you hereby assign to us joint ownership, and to the extent that such + assignment is or becomes invalid, ineffective or unenforceable, you hereby + grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge, + royalty-free, unrestricted license to exercise all rights under those + copyrights. This includes, at our option, the right to sublicense these same + rights to third parties through multiple levels of sublicensees or other + licensing arrangements; + + * you agree that each of us can do all things in relation to your + contribution as if each of us were the sole owners, and if one of us makes + a derivative work of your contribution, the one who makes the derivative + work (or has it made will be the sole owner of that derivative work; + + * you agree that you will not assert any moral rights in your contribution + against us, our licensees or transferees; + + * you agree that we may register a copyright in your contribution and + exercise all ownership rights associated with it; and + + * you agree that neither of us has any duty to consult with, obtain the + consent of, pay or render an accounting to the other for any use or + distribution of your contribution. + +3. With respect to any patents you own, or that you can license without payment +to any third party, you hereby grant to us a perpetual, irrevocable, +non-exclusive, worldwide, no-charge, royalty-free license to: + + * make, have made, use, sell, offer to sell, import, and otherwise transfer + your contribution in whole or in part, alone or in combination with or + included in any product, work or materials arising out of the project to + which your contribution was submitted, and + + * at our option, to sublicense these same rights to third parties through + multiple levels of sublicensees or other licensing arrangements. + +4. Except as set out above, you keep all right, title, and interest in your +contribution. The rights that you grant to us under these terms are effective +on the date you first submitted a contribution to us, even if your submission +took place before the date you sign these terms. + +5. You covenant, represent, warrant and agree that: + + * Each contribution that you submit is and shall be an original work of + authorship and you can legally grant the rights set out in this SCA; + + * to the best of your knowledge, each contribution will not violate any + third party's copyrights, trademarks, patents, or other intellectual + property rights; and + + * each contribution shall be in compliance with U.S. export control laws and + other applicable export and import laws. You agree to notify us if you + become aware of any circumstance which would make any of the foregoing + representations inaccurate in any respect. We may publicly disclose your + participation in the project, including the fact that you have signed the SCA. + +6. This SCA is governed by the laws of the State of California and applicable +U.S. Federal law. Any choice of law rules will not apply. + +7. Please place an “x” on one of the applicable statement below. Please do NOT +mark both statements: + + * [x] I am signing on behalf of myself as an individual and no other person + or entity, including my employer, has or will have rights with respect to my + contributions. + + * [ ] I am signing on behalf of my employer or a legal entity and I have the + actual authority to contractually bind that entity. + +## Contributor Details + +| Field | Entry | +|------------------------------- | -------------------- | +| Name | Shantam | +| Company name (if applicable) | | +| Title or role (if applicable) | | +| Date | 21/5/2018 | +| GitHub username | armsp | +| Website (optional) | | diff --git a/website/api/_annotation/_training.jade b/website/api/_annotation/_training.jade index 9bd59cdae..8658866aa 100644 --- a/website/api/_annotation/_training.jade +++ b/website/api/_annotation/_training.jade @@ -53,7 +53,7 @@ p +tag-new(2) p - | The populate a model's vocabulary, you can use the + | To populate a model's vocabulary, you can use the | #[+api("cli#vocab") #[code spacy vocab]] command and load in a | #[+a("https://jsonlines.readthedocs.io/en/latest/") newline-delimited JSON] | (JSONL) file containing one lexical entry per line. The first line From ec62cadf4c1417261e3c7bc9e4c07a01ca2d324b Mon Sep 17 00:00:00 2001 From: Jani Monoses Date: Thu, 24 May 2018 12:40:00 +0300 Subject: [PATCH 2/9] Updates to Romanian support (#2354) * Add back Romanian in conftest * Romanian lex_attr * More tokenizer exceptions for Romanian * Add tests for some Romanian tokenizer exceptions --- spacy/lang/ro/lex_attrs.py | 42 +++++++++++++++++++++++++++ spacy/lang/ro/tokenizer_exceptions.py | 5 ++-- spacy/tests/conftest.py | 2 +- spacy/tests/lang/ro/test_tokenizer.py | 25 ++++++++++++++++ 4 files changed, 71 insertions(+), 3 deletions(-) create mode 100644 spacy/lang/ro/lex_attrs.py create mode 100644 spacy/tests/lang/ro/test_tokenizer.py diff --git a/spacy/lang/ro/lex_attrs.py b/spacy/lang/ro/lex_attrs.py new file mode 100644 index 000000000..48027186b --- /dev/null +++ b/spacy/lang/ro/lex_attrs.py @@ -0,0 +1,42 @@ +# coding: utf8 +from __future__ import unicode_literals + +from ...attrs import LIKE_NUM + + +_num_words = set(""" +zero unu doi două trei patru cinci șase șapte opt nouă zece +unsprezece doisprezece douăsprezece treisprezece patrusprezece cincisprezece șaisprezece șaptesprezece optsprezece nouăsprezece +douăzeci treizeci patruzeci cincizeci șaizeci șaptezeci optzeci nouăzeci +sută mie milion miliard bilion trilion cvadrilion catralion cvintilion sextilion septilion enșpemii +""".split()) + +_ordinal_words = set(""" +primul doilea treilea patrulea cincilea șaselea șaptelea optulea nouălea zecelea +prima doua treia patra cincia șasea șaptea opta noua zecea +unsprezecelea doisprezecelea treisprezecelea patrusprezecelea cincisprezecelea șaisprezecelea șaptesprezecelea optsprezecelea nouăsprezecelea +unsprezecea douăsprezecea treisprezecea patrusprezecea cincisprezecea șaisprezecea șaptesprezecea optsprezecea nouăsprezecea +douăzecilea treizecilea patruzecilea cincizecilea șaizecilea șaptezecilea optzecilea nouăzecilea sutălea +douăzecea treizecea patruzecea cincizecea șaizecea șaptezecea optzecea nouăzecea suta +miilea mielea mia milionulea milioana miliardulea miliardelea miliarda enșpemia +""".split()) + + +def like_num(text): + text = text.replace(',', '').replace('.', '') + if text.isdigit(): + return True + if text.count('/') == 1: + num, denom = text.split('/') + if num.isdigit() and denom.isdigit(): + return True + if text.lower() in _num_words: + return True + if text.lower() in _ordinal_words: + return True + return False + + +LEX_ATTRS = { + LIKE_NUM: like_num +} diff --git a/spacy/lang/ro/tokenizer_exceptions.py b/spacy/lang/ro/tokenizer_exceptions.py index 42ccd6a93..bc501c32a 100644 --- a/spacy/lang/ro/tokenizer_exceptions.py +++ b/spacy/lang/ro/tokenizer_exceptions.py @@ -9,8 +9,9 @@ _exc = {} # Source: https://en.wiktionary.org/wiki/Category:Romanian_abbreviations for orth in [ - "1-a", "1-ul", "10-a", "10-lea", "2-a", "3-a", "3-lea", "6-lea", - "d-voastră", "dvs.", "Rom.", "str."]: + "1-a", "2-a", "3-a", "4-a", "5-a", "6-a", "7-a", "8-a", "9-a", "10-a", "11-a", "12-a", + "1-ul", "2-lea", "3-lea", "4-lea", "5-lea", "6-lea", "7-lea", "8-lea", "9-lea", "10-lea", "11-lea", "12-lea", + "d-voastră", "dvs.", "ing.", "dr.", "Rom.", "str.", "nr.", "etc.", "d.p.d.v.", "dpdv", "șamd.", "ș.a.m.d."]: _exc[orth] = [{ORTH: orth}] diff --git a/spacy/tests/conftest.py b/spacy/tests/conftest.py index 67f7479d1..afb3ad5cd 100644 --- a/spacy/tests/conftest.py +++ b/spacy/tests/conftest.py @@ -15,7 +15,7 @@ from .. import util # here if it's using spaCy's tokenizer (not a different library) # TODO: re-implement generic tokenizer tests _languages = ['bn', 'da', 'de', 'en', 'es', 'fi', 'fr', 'ga', 'he', 'hu', 'id', - 'it', 'nb', 'nl', 'pl', 'pt', 'ru', 'sv', 'tr', 'ar', 'xx'] + 'it', 'nb', 'nl', 'pl', 'pt', 'ro', 'ru', 'sv', 'tr', 'ar', 'xx'] _models = {'en': ['en_core_web_sm'], 'de': ['de_core_news_md'], diff --git a/spacy/tests/lang/ro/test_tokenizer.py b/spacy/tests/lang/ro/test_tokenizer.py new file mode 100644 index 000000000..e754eaeae --- /dev/null +++ b/spacy/tests/lang/ro/test_tokenizer.py @@ -0,0 +1,25 @@ +# coding: utf8 +from __future__ import unicode_literals + +import pytest + +DEFAULT_TESTS = [ + ('Adresa este str. Principală nr. 5.', ['Adresa', 'este', 'str.', 'Principală', 'nr.', '5', '.']), + ('Teste, etc.', ['Teste', ',', 'etc.']), + ('Lista, ș.a.m.d.', ['Lista', ',', 'ș.a.m.d.']), + ('Și d.p.d.v. al...', ['Și', 'd.p.d.v.', 'al', '...']) +] + +NUMBER_TESTS = [ + ('Clasa a 4-a.', ['Clasa', 'a', '4-a', '.']), + ('Al 12-lea ceas.', ['Al', '12-lea', 'ceas', '.']) +] + +TESTCASES = DEFAULT_TESTS + NUMBER_TESTS + + +@pytest.mark.parametrize('text,expected_tokens', TESTCASES) +def test_tokenizer_handles_testcases(ro_tokenizer, text, expected_tokens): + tokens = ro_tokenizer(text) + token_list = [token.text for token in tokens if not token.is_space] + assert expected_tokens == token_list From 432ede04afc052adc424e05b5544fdb17b3342a6 Mon Sep 17 00:00:00 2001 From: Aristo Rinjuang Date: Thu, 24 May 2018 16:40:57 +0700 Subject: [PATCH 3/9] adding more words and rephrasing (#2351) * adding more words and rephrasing * adding a contributor * tokenizer bugs solved --- .github/contributors/aristorinjuang.md | 106 +++++++++++++++++++++++++ spacy/lang/id/lex_attrs.py | 17 +--- spacy/lang/id/norm_exceptions.py | 9 +-- spacy/lang/id/tokenizer_exceptions.py | 59 +++++++++++--- 4 files changed, 161 insertions(+), 30 deletions(-) create mode 100644 .github/contributors/aristorinjuang.md diff --git a/.github/contributors/aristorinjuang.md b/.github/contributors/aristorinjuang.md new file mode 100644 index 000000000..17cb692a6 --- /dev/null +++ b/.github/contributors/aristorinjuang.md @@ -0,0 +1,106 @@ +# spaCy contributor agreement + +This spaCy Contributor Agreement (**"SCA"**) is based on the +[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf). +The SCA applies to any contribution that you make to any product or project +managed by us (the **"project"**), and sets out the intellectual property rights +you grant to us in the contributed materials. The term **"us"** shall mean +[ExplosionAI UG (haftungsbeschränkt)](https://explosion.ai/legal). The term +**"you"** shall mean the person or entity identified below. + +If you agree to be bound by these terms, fill in the information requested +below and include the filled-in version with your first pull request, under the +folder [`.github/contributors/`](/.github/contributors/). The name of the file +should be your GitHub username, with the extension `.md`. For example, the user +example_user would create the file `.github/contributors/example_user.md`. + +Read this agreement carefully before signing. These terms and conditions +constitute a binding legal agreement. + +## Contributor Agreement + +1. The term "contribution" or "contributed materials" means any source code, +object code, patch, tool, sample, graphic, specification, manual, +documentation, or any other material posted or submitted by you to the project. + +2. With respect to any worldwide copyrights, or copyright applications and +registrations, in your contribution: + + * you hereby assign to us joint ownership, and to the extent that such + assignment is or becomes invalid, ineffective or unenforceable, you hereby + grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge, + royalty-free, unrestricted license to exercise all rights under those + copyrights. This includes, at our option, the right to sublicense these same + rights to third parties through multiple levels of sublicensees or other + licensing arrangements; + + * you agree that each of us can do all things in relation to your + contribution as if each of us were the sole owners, and if one of us makes + a derivative work of your contribution, the one who makes the derivative + work (or has it made will be the sole owner of that derivative work; + + * you agree that you will not assert any moral rights in your contribution + against us, our licensees or transferees; + + * you agree that we may register a copyright in your contribution and + exercise all ownership rights associated with it; and + + * you agree that neither of us has any duty to consult with, obtain the + consent of, pay or render an accounting to the other for any use or + distribution of your contribution. + +3. With respect to any patents you own, or that you can license without payment +to any third party, you hereby grant to us a perpetual, irrevocable, +non-exclusive, worldwide, no-charge, royalty-free license to: + + * make, have made, use, sell, offer to sell, import, and otherwise transfer + your contribution in whole or in part, alone or in combination with or + included in any product, work or materials arising out of the project to + which your contribution was submitted, and + + * at our option, to sublicense these same rights to third parties through + multiple levels of sublicensees or other licensing arrangements. + +4. Except as set out above, you keep all right, title, and interest in your +contribution. The rights that you grant to us under these terms are effective +on the date you first submitted a contribution to us, even if your submission +took place before the date you sign these terms. + +5. You covenant, represent, warrant and agree that: + + * Each contribution that you submit is and shall be an original work of + authorship and you can legally grant the rights set out in this SCA; + + * to the best of your knowledge, each contribution will not violate any + third party's copyrights, trademarks, patents, or other intellectual + property rights; and + + * each contribution shall be in compliance with U.S. export control laws and + other applicable export and import laws. You agree to notify us if you + become aware of any circumstance which would make any of the foregoing + representations inaccurate in any respect. We may publicly disclose your + participation in the project, including the fact that you have signed the SCA. + +6. This SCA is governed by the laws of the State of California and applicable +U.S. Federal law. Any choice of law rules will not apply. + +7. Please place an “x” on one of the applicable statement below. Please do NOT +mark both statements: + + * [x] I am signing on behalf of myself as an individual and no other person + or entity, including my employer, has or will have rights with respect to my + contributions. + + * [x] I am signing on behalf of my employer or a legal entity and I have the + actual authority to contractually bind that entity. + +## Contributor Details + +| Field | Entry | +|------------------------------- | -------------------------- | +| Name | Aristo Rinjuang | +| Company name (if applicable) | | +| Title or role (if applicable) | | +| Date | May 22, 2018 | +| GitHub username | aristorinjuang | +| Website (optional) | https://aristorinjuang.com | diff --git a/spacy/lang/id/lex_attrs.py b/spacy/lang/id/lex_attrs.py index 235cee438..39f7042eb 100644 --- a/spacy/lang/id/lex_attrs.py +++ b/spacy/lang/id/lex_attrs.py @@ -4,19 +4,10 @@ from __future__ import unicode_literals from ...attrs import LIKE_NUM -_num_words = ['zero', 'one', 'two', 'three', 'four', 'five', 'six', 'seven', - 'eight', 'nine', 'ten', 'eleven', 'twelve', 'thirteen', 'fourteen', - 'fifteen', 'sixteen', 'seventeen', 'eighteen', 'nineteen', 'twenty', - 'thirty', 'forty', 'fifty', 'sixty', 'seventy', 'eighty', 'ninety', - 'hundred', 'thousand', 'million', 'billion', 'trillion', 'quadrillion', - 'gajillion', 'bazillion', - 'nol', 'satu', 'dua', 'tiga', 'empat', 'lima', 'enam', 'tujuh', - 'delapan', 'sembilan', 'sepuluh', 'sebelas', 'duabelas', 'tigabelas', - 'empatbelas', 'limabelas', 'enambelas', 'tujuhbelas', 'delapanbelas', - 'sembilanbelas', 'duapuluh', 'seratus', 'seribu', 'sejuta', - 'ribu', 'rb', 'juta', 'jt', 'miliar', 'biliun', 'triliun', - 'kuadriliun', 'kuintiliun', 'sekstiliun', 'septiliun', 'oktiliun', - 'noniliun', 'desiliun'] +_num_words = ['nol', 'satu', 'dua', 'tiga', 'empat', 'lima', 'enam', 'tujuh', + 'delapan', 'sembilan', 'sepuluh', 'sebelas', 'belas', 'puluh', + 'ratus', 'ribu', 'juta', 'miliar', 'biliun', 'triliun', 'kuadriliun', + 'kuintiliun', 'sekstiliun', 'septiliun', 'oktiliun', 'noniliun', 'desiliun'] def like_num(text): diff --git a/spacy/lang/id/norm_exceptions.py b/spacy/lang/id/norm_exceptions.py index cb168dfeb..2468efbcd 100644 --- a/spacy/lang/id/norm_exceptions.py +++ b/spacy/lang/id/norm_exceptions.py @@ -1,14 +1,7 @@ # coding: utf8 from __future__ import unicode_literals -_exc = { - "Rp": "$", - "IDR": "$", - "RMB": "$", - "USD": "$", - "AUD": "$", - "GBP": "$", -} +_exc = {} NORM_EXCEPTIONS = {} diff --git a/spacy/lang/id/tokenizer_exceptions.py b/spacy/lang/id/tokenizer_exceptions.py index 3bba57e4c..1e5282e52 100644 --- a/spacy/lang/id/tokenizer_exceptions.py +++ b/spacy/lang/id/tokenizer_exceptions.py @@ -5,7 +5,7 @@ import regex as re from ._tokenizer_exceptions_list import ID_BASE_EXCEPTIONS from ..tokenizer_exceptions import URL_PATTERN -from ...symbols import ORTH +from ...symbols import ORTH, LEMMA, NORM _exc = {} @@ -29,17 +29,58 @@ for orth in ID_BASE_EXCEPTIONS: orth_caps = '-'.join([part.upper() for part in orth.split('-')]) _exc[orth_caps] = [{ORTH: orth_caps}] +for exc_data in [ + {ORTH: "CKG", LEMMA: "Cakung", NORM: "Cakung"}, + {ORTH: "CGP", LEMMA: "Grogol Petamburan", NORM: "Grogol Petamburan"}, + {ORTH: "KSU", LEMMA: "Kepulauan Seribu Utara", NORM: "Kepulauan Seribu Utara"}, + {ORTH: "KYB", LEMMA: "Kebayoran Baru", NORM: "Kebayoran Baru"}, + {ORTH: "TJP", LEMMA: "Tanjungpriok", NORM: "Tanjungpriok"}, + {ORTH: "TNA", LEMMA: "Tanah Abang", NORM: "Tanah Abang"}, + + {ORTH: "BEK", LEMMA: "Bengkayang", NORM: "Bengkayang"}, + {ORTH: "KTP", LEMMA: "Ketapang", NORM: "Ketapang"}, + {ORTH: "MPW", LEMMA: "Mempawah", NORM: "Mempawah"}, + {ORTH: "NGP", LEMMA: "Nanga Pinoh", NORM: "Nanga Pinoh"}, + {ORTH: "NBA", LEMMA: "Ngabang", NORM: "Ngabang"}, + {ORTH: "PTK", LEMMA: "Pontianak", NORM: "Pontianak"}, + {ORTH: "PTS", LEMMA: "Putussibau", NORM: "Putussibau"}, + {ORTH: "SBS", LEMMA: "Sambas", NORM: "Sambas"}, + {ORTH: "SAG", LEMMA: "Sanggau", NORM: "Sanggau"}, + {ORTH: "SED", LEMMA: "Sekadau", NORM: "Sekadau"}, + {ORTH: "SKW", LEMMA: "Singkawang", NORM: "Singkawang"}, + {ORTH: "STG", LEMMA: "Sintang", NORM: "Sintang"}, + {ORTH: "SKD", LEMMA: "Sukadane", NORM: "Sukadane"}, + {ORTH: "SRY", LEMMA: "Sungai Raya", NORM: "Sungai Raya"}, + + {ORTH: "Jan.", LEMMA: "Januari", NORM: "Januari"}, + {ORTH: "Feb.", LEMMA: "Februari", NORM: "Februari"}, + {ORTH: "Mar.", LEMMA: "Maret", NORM: "Maret"}, + {ORTH: "Apr.", LEMMA: "April", NORM: "April"}, + {ORTH: "Jun.", LEMMA: "Juni", NORM: "Juni"}, + {ORTH: "Jul.", LEMMA: "Juli", NORM: "Juli"}, + {ORTH: "Agu.", LEMMA: "Agustus", NORM: "Agustus"}, + {ORTH: "Ags.", LEMMA: "Agustus", NORM: "Agustus"}, + {ORTH: "Sep.", LEMMA: "September", NORM: "September"}, + {ORTH: "Okt.", LEMMA: "Oktober", NORM: "Oktober"}, + {ORTH: "Nov.", LEMMA: "November", NORM: "November"}, + {ORTH: "Des.", LEMMA: "Desember", NORM: "Desember"}]: + _exc[exc_data[ORTH]] = [exc_data] for orth in [ - "'d", "a.m.", "Adm.", "Bros.", "co.", "Co.", "Corp.", "D.C.", "Dr.", "e.g.", - "E.g.", "E.G.", "Gen.", "Gov.", "i.e.", "I.e.", "I.E.", "Inc.", "Jr.", - "Ltd.", "Md.", "Messrs.", "Mo.", "Mont.", "Mr.", "Mrs.", "Ms.", "p.m.", - "Ph.D.", "Rep.", "Rev.", "Sen.", "St.", "vs.", + "A.AB.", "A.Ma.", "A.Md.", "A.Md.Keb.", "A.Md.Kep.", "A.P.", "B.A.", "B.Ch.E.", "B.Sc.", "Dr.", "Dra.", "Drs.", "Hj.", "Ka.", "Kp.", - "M.Ag.", "M.Hum.", "M.Kes,", "M.Kom.", "M.M.", "M.P.", "M.Pd.", "M.Sc.", - "M.Si.", "M.Sn.", "M.T.", "M.Th.", "No.", "Pjs.", "Plt.", "R.A.", "S.Ag.", - "S.E.", "S.H.", "S.Hut.", "S.K.M.", "S.Kedg.", "S.Kedh.", "S.Kom.", - "S.Pd.", "S.Pol.", "S.Psi.", "S.S.", "S.Sos.", "S.T.", "S.Tekp.", "S.Th.", + "M.AB", "M.Ag.", "M.AP", "M.Arl", "M.A.R.S", "M.Hum.", "M.I.Kom.", "M.Kes,", + "M.Kom.", "M.M.", "M.P.", "M.Pd.", "M.Psi.", "M.Psi.T.", "M.Sc.", "M.SArl", + "M.Si.", "M.Sn.", "M.T.", "M.Th.", "No.", "Pjs.", "Plt.", "R.A.", + "S.AB", "S.AP", "S.Adm", "S.Ag.", "S.Agr", "S.Ant", "S.Arl", "S.Ars", + "S.A.R.S", "S.Ds", "S.E.", "S.E.I.", "S.Farm", "S.Gz.", "S.H.", "S.Han", + "S.H.Int", "S.Hum", "S.Hut.", "S.In.", "S.IK.", "S.I.Kom.", "S.I.P", "S.IP", + "S.P.", "S.Pt", "S.Psi", "S.Ptk", "S.Keb", "S.Ked", "S.Kep", "S.KG", "S.KH", + "S.Kel", "S.K.M.", "S.Kedg.", "S.Kedh.", "S.Kom.", "S.KPM", "S.Mb", "S.Mat", + "S.Par", "S.Pd.", "S.Pd.I.", "S.Pd.SD", "S.Pol.", "S.Psi.", "S.S.", "S.SArl.", + "S.Sn", "S.Si.", "S.Si.Teol.", "S.SI.", "S.ST.", "S.ST.Han", "S.STP", "S.Sos.", + "S.Sy.", "S.T.", "S.T.Han", "S.Th.", "S.Th.I" "S.TI.", "S.T.P.", "S.TrK", + "S.Tekp.", "S.Th.", "a.l.", "a.n.", "a.s.", "b.d.", "d.a.", "d.l.", "d/h", "dkk.", "dll.", "dr.", "drh.", "ds.", "dsb.", "dst.", "faks.", "fax.", "hlm.", "i/o", "n.b.", "p.p." "pjs.", "s.d.", "tel.", "u.p.", From 8adb967e0c293d3bcc66d1f0cbdd6ced49eb1099 Mon Sep 17 00:00:00 2001 From: ines Date: Thu, 24 May 2018 12:42:16 +0200 Subject: [PATCH 4/9] Fix from source quickstart instructions for Windows See: https://stackoverflow.com/a/50478036/6400719 --- website/usage/_install/_quickstart.jade | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) diff --git a/website/usage/_install/_quickstart.jade b/website/usage/_install/_quickstart.jade index f2d4383a4..976e7d4ad 100644 --- a/website/usage/_install/_quickstart.jade +++ b/website/usage/_install/_quickstart.jade @@ -16,7 +16,9 @@ +qs({package: 'source'}) git clone https://github.com/explosion/spaCy +qs({package: 'source'}) cd spaCy - +qs({package: 'source'}) export PYTHONPATH=`pwd` + +qs({package: 'source', os: 'mac'}) export PYTHONPATH=`pwd` + +qs({package: 'source', os: 'linux'}) export PYTHONPATH=`pwd` + +qs({package: 'source', os: 'windows'}) set PYTHONPATH=/path/to/spaCy +qs({package: 'source'}) pip install -r requirements.txt +qs({package: 'source'}) python setup.py build_ext --inplace From 592834183ad0c8097f7290631681012d9b1bb39a Mon Sep 17 00:00:00 2001 From: Shantam Raj Date: Thu, 24 May 2018 16:59:52 +0530 Subject: [PATCH 5/9] corrected spelling (#2359) changed **interpretted** to **interpreted** --- website/usage/_linguistic-features/_rule-based-matching.jade | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/website/usage/_linguistic-features/_rule-based-matching.jade b/website/usage/_linguistic-features/_rule-based-matching.jade index c0d418d46..2d5d73c5a 100644 --- a/website/usage/_linguistic-features/_rule-based-matching.jade +++ b/website/usage/_linguistic-features/_rule-based-matching.jade @@ -184,7 +184,7 @@ p p | In versions before v2.1.0, the semantics of the #[code +] and #[code *] operators - | behave inconsistently. They were usually interpretted + | behave inconsistently. They were usually interpreted | "greedily", i.e. longer matches are returned where possible. However, if | you specify two #[code +] and #[code *] patterns in a row and their | matches overlap, the first operator will behave non-greedily. This quirk From 4515e96e9016799b6b0ccbc14cf67814e5f417fc Mon Sep 17 00:00:00 2001 From: James Messinger Date: Fri, 25 May 2018 06:08:45 -0500 Subject: [PATCH 6/9] Better formatting for `spacy train` CLI (#2357) * Better formatting for `spacy train` CLI Changed to use fixed-spaces rather than tabs to align table headers and data. ### Before: ``` Itn. P.Loss N.Loss UAS NER P. NER R. NER F. Tag % Token % 0 4618.857 2910.004 76.172 79.645 67.987 88.732 88.261 100.000 4436.9 6376.4 1 4671.972 3764.812 74.481 78.046 62.374 82.680 88.377 100.000 4672.2 6227.1 2 4742.756 3673.473 71.994 77.380 63.966 84.494 90.620 100.000 4298.0 5983.9 ``` ### After: ``` Itn. Dep Loss NER Loss UAS NER P. NER R. NER F. Tag % Token % CPU WPS GPU WPS 0 4618.857 2910.004 76.172 79.645 67.987 88.732 88.261 100.000 4436.9 6376.4 1 4671.972 3764.812 74.481 78.046 62.374 82.680 88.377 100.000 4672.2 6227.1 2 4742.756 3673.473 71.994 77.380 63.966 84.494 90.620 100.000 4298.0 5983.9 ``` * Added contributor file --- .github/contributors/BigstickCarpet.md | 106 +++++++++++++++++++++++++ spacy/cli/train.py | 24 +++--- 2 files changed, 118 insertions(+), 12 deletions(-) create mode 100644 .github/contributors/BigstickCarpet.md diff --git a/.github/contributors/BigstickCarpet.md b/.github/contributors/BigstickCarpet.md new file mode 100644 index 000000000..07b356495 --- /dev/null +++ b/.github/contributors/BigstickCarpet.md @@ -0,0 +1,106 @@ +# spaCy contributor agreement + +This spaCy Contributor Agreement (**"SCA"**) is based on the +[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf). +The SCA applies to any contribution that you make to any product or project +managed by us (the **"project"**), and sets out the intellectual property rights +you grant to us in the contributed materials. The term **"us"** shall mean +[ExplosionAI UG (haftungsbeschränkt)](https://explosion.ai/legal). The term +**"you"** shall mean the person or entity identified below. + +If you agree to be bound by these terms, fill in the information requested +below and include the filled-in version with your first pull request, under the +folder [`.github/contributors/`](/.github/contributors/). The name of the file +should be your GitHub username, with the extension `.md`. For example, the user +example_user would create the file `.github/contributors/example_user.md`. + +Read this agreement carefully before signing. These terms and conditions +constitute a binding legal agreement. + +## Contributor Agreement + +1. The term "contribution" or "contributed materials" means any source code, +object code, patch, tool, sample, graphic, specification, manual, +documentation, or any other material posted or submitted by you to the project. + +2. With respect to any worldwide copyrights, or copyright applications and +registrations, in your contribution: + + * you hereby assign to us joint ownership, and to the extent that such + assignment is or becomes invalid, ineffective or unenforceable, you hereby + grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge, + royalty-free, unrestricted license to exercise all rights under those + copyrights. This includes, at our option, the right to sublicense these same + rights to third parties through multiple levels of sublicensees or other + licensing arrangements; + + * you agree that each of us can do all things in relation to your + contribution as if each of us were the sole owners, and if one of us makes + a derivative work of your contribution, the one who makes the derivative + work (or has it made will be the sole owner of that derivative work; + + * you agree that you will not assert any moral rights in your contribution + against us, our licensees or transferees; + + * you agree that we may register a copyright in your contribution and + exercise all ownership rights associated with it; and + + * you agree that neither of us has any duty to consult with, obtain the + consent of, pay or render an accounting to the other for any use or + distribution of your contribution. + +3. With respect to any patents you own, or that you can license without payment +to any third party, you hereby grant to us a perpetual, irrevocable, +non-exclusive, worldwide, no-charge, royalty-free license to: + + * make, have made, use, sell, offer to sell, import, and otherwise transfer + your contribution in whole or in part, alone or in combination with or + included in any product, work or materials arising out of the project to + which your contribution was submitted, and + + * at our option, to sublicense these same rights to third parties through + multiple levels of sublicensees or other licensing arrangements. + +4. Except as set out above, you keep all right, title, and interest in your +contribution. The rights that you grant to us under these terms are effective +on the date you first submitted a contribution to us, even if your submission +took place before the date you sign these terms. + +5. You covenant, represent, warrant and agree that: + + * Each contribution that you submit is and shall be an original work of + authorship and you can legally grant the rights set out in this SCA; + + * to the best of your knowledge, each contribution will not violate any + third party's copyrights, trademarks, patents, or other intellectual + property rights; and + + * each contribution shall be in compliance with U.S. export control laws and + other applicable export and import laws. You agree to notify us if you + become aware of any circumstance which would make any of the foregoing + representations inaccurate in any respect. We may publicly disclose your + participation in the project, including the fact that you have signed the SCA. + +6. This SCA is governed by the laws of the State of California and applicable +U.S. Federal law. Any choice of law rules will not apply. + +7. Please place an “x” on one of the applicable statement below. Please do NOT +mark both statements: + + * [ X] I am signing on behalf of myself as an individual and no other person + or entity, including my employer, has or will have rights with respect to my + contributions. + + * [ ] I am signing on behalf of my employer or a legal entity and I have the + actual authority to contractually bind that entity. + +## Contributor Details + +| Field | Entry | +|------------------------------- | -------------------- | +| Name | James Messinger | +| Company name (if applicable) | | +| Title or role (if applicable) | | +| Date | May 23, 2018 | +| GitHub username | BigstickCarpet | +| Website (optional) | | diff --git a/spacy/cli/train.py b/spacy/cli/train.py index 310b30de0..8dce873ad 100644 --- a/spacy/cli/train.py +++ b/spacy/cli/train.py @@ -116,7 +116,7 @@ def train(lang, output_dir, train_data, dev_data, n_iter=30, n_sents=0, optimizer = nlp.begin_training(lambda: corpus.train_tuples, device=use_gpu) nlp._optimizer = None - print("Itn.\tP.Loss\tN.Loss\tUAS\tNER P.\tNER R.\tNER F.\tTag %\tToken %") + print("Itn. Dep Loss NER Loss UAS NER P. NER R. NER F. Tag % Token % CPU WPS GPU WPS") try: train_docs = corpus.train_docs(nlp, projectivize=True, noise_level=0.0, gold_preproc=gold_preproc, max_length=0) @@ -207,17 +207,17 @@ def print_progress(itn, losses, dev_scores, cpu_wps=0.0, gpu_wps=0.0): scores.update(dev_scores) scores['cpu_wps'] = cpu_wps scores['gpu_wps'] = gpu_wps or 0.0 - tpl = '\t'.join(( - '{:d}', - '{dep_loss:.3f}', - '{ner_loss:.3f}', - '{uas:.3f}', - '{ents_p:.3f}', - '{ents_r:.3f}', - '{ents_f:.3f}', - '{tags_acc:.3f}', - '{token_acc:.3f}', - '{cpu_wps:.1f}', + tpl = ''.join(( + '{:<6d}', + '{dep_loss:<10.3f}', + '{ner_loss:<10.3f}', + '{uas:<8.3f}', + '{ents_p:<8.3f}', + '{ents_r:<8.3f}', + '{ents_f:<8.3f}', + '{tags_acc:<8.3f}', + '{token_acc:<9.3f}', + '{cpu_wps:<9.1f}', '{gpu_wps:.1f}', )) print(tpl.format(itn, **scores)) From fb923b31ea3a4b815e3ef3d28dad126d375aa9b3 Mon Sep 17 00:00:00 2001 From: ines Date: Sat, 26 May 2018 17:57:02 +0200 Subject: [PATCH 7/9] Fix bad HTML example (see #2376) and turn it into section on matcher + components Avoid problems caused by merging while matching (e.g. index errors). Creating a Matcher component also better reflects the recommended best practices. --- .../_rule-based-matching.jade | 100 +++++++++++------- 1 file changed, 64 insertions(+), 36 deletions(-) diff --git a/website/usage/_linguistic-features/_rule-based-matching.jade b/website/usage/_linguistic-features/_rule-based-matching.jade index c0d418d46..094f15b90 100644 --- a/website/usage/_linguistic-features/_rule-based-matching.jade +++ b/website/usage/_linguistic-features/_rule-based-matching.jade @@ -260,41 +260,6 @@ p doc = nlp(u"This is a text about Google I/O 2015.") matches = matcher(doc) -p - | In addition to mentions of "Google I/O", your data also contains some - | annoying pre-processing artefacts, like leftover HTML line breaks - | (e.g. #[code <br>] or #[code <BR/>]). While you're at it, - | you want to merge those into one token and flag them, to make sure you - | can easily ignore them later. So you add a second pattern and pass in a - | function #[code merge_and_flag]: - -+code-exec. - import spacy - from spacy.matcher import Matcher - from spacy.tokens import Token - - nlp = spacy.load('en_core_web_sm') - matcher = Matcher(nlp.vocab) - # register a new token extension to flag bad HTML - Token.set_extension('bad_html', default=False) - - def merge_and_flag(matcher, doc, i, matches): - match_id, start, end = matches[i] - span = doc[start : end] - span.merge(is_stop=True) # merge (and mark it as a stop word, just in case) - for token in span: - token._.bad_html = True # mark token as bad HTML - print(span.text) - - matcher.add('BAD_HTML', merge_and_flag, - [{'ORTH': '<'}, {'LOWER': 'br'}, {'ORTH': '>'}], - [{'ORTH': '<'}, {'LOWER': 'br/'}, {'ORTH': '>'}]) - - doc = nlp(u"Hello<br>world!") - matches = matcher(doc) - for token in doc: - print(token.text, token._.bad_html) - +aside("Tip: Visualizing matches") | When working with entities, you can use #[+api("top-level#displacy") displaCy] | to quickly generate a NER visualization from your updated #[code Doc], @@ -315,7 +280,7 @@ p | that was matched, and invoke it. +code. - doc = nlp(LOTS_OF_TEXT) + doc = nlp(YOUR_TEXT_HERE) matcher(doc) p @@ -348,6 +313,69 @@ p | A list of #[code (match_id, start, end)] tuples, describing the | matches. A match tuple describes a span #[code doc[start:end]]. ++h(3, "matcher-pipeline") Using custom pipeline components + +p + | Let's say your data also contains some annoying pre-processing artefacts, + | like leftover HTML line breaks (e.g. #[code <br>] or + | #[code <BR/>]). To make your text easier to analyse, you want to + | merge those into one token and flag them, to make sure you + | can ignore them later. Ideally, this should all be done automatically + | as you process the text. You can achieve this by adding a + | #[+a("/usage/processing-pipelines#custom-components") custom pipeline component] + | that's called on each #[code Doc] object, merges the leftover HTML spans + | and sets an attribute #[code bad_html] on the token. + ++code-exec. + import spacy + from spacy.matcher import Matcher + from spacy.tokens import Token + + # we're using a class because the component needs to be initialised with + # the shared vocab via the nlp object + class BadHTMLMerger(object): + def __init__(self, nlp): + # register a new token extension to flag bad HTML + Token.set_extension('bad_html', default=False) + self.matcher = Matcher(nlp.vocab) + self.matcher.add('BAD_HTML', None, + [{'ORTH': '<'}, {'LOWER': 'br'}, {'ORTH': '>'}], + [{'ORTH': '<'}, {'LOWER': 'br/'}, {'ORTH': '>'}]) + + def __call__(self, doc): + # this method is invoked when the component is called on a Doc + matches = self.matcher(doc) + spans = [] # collect the matched spans here + for match_id, start, end in matches: + spans.append(doc[start:end]) + for span in spans: + span.merge(is_stop=True) # merge (and mark it as a stop word) + for token in span: + token._.bad_html = True # mark token as bad HTML + return doc + + nlp = spacy.load('en_core_web_sm') + html_merger = BadHTMLMerger(nlp) + nlp.add_pipe(html_merger, last=True) # add component to the pipeline + doc = nlp(u"Hello<br>world! <br/> This is a test.") + for token in doc: + print(token.text, token._.bad_html) + +p + | Instead of hard-coding the patterns into the component, you could also + | make it take a path to a JSON file containing the patterns. This lets + | you reuse the component with different patterns, depending on your + | application: + ++code. + html_merger = BadHTMLMerger(nlp, path='/path/to/patterns.json') + ++infobox + | For more details and examples of how to + | #[strong create custom pipeline components] and + | #[strong extension attributes], see the + | #[+a("/usage/processing-pipelines") usage guide]. + +h(3, "regex") Using regular expressions p From 5f988b8e9c596a941de2bfe4e64867e33e3a00aa Mon Sep 17 00:00:00 2001 From: Samuel Pouyt Date: Sat, 26 May 2018 18:17:12 +0200 Subject: [PATCH 8/9] Update _custom.jade (#2372) It seems based on the doc and trying out that the `en` or `[lang]` is missing from the `spacy model-init` --- website/usage/_vectors-similarity/_custom.jade | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/website/usage/_vectors-similarity/_custom.jade b/website/usage/_vectors-similarity/_custom.jade index fef22ae71..f5ad402a3 100644 --- a/website/usage/_vectors-similarity/_custom.jade +++ b/website/usage/_vectors-similarity/_custom.jade @@ -52,7 +52,7 @@ p +code(false, "bash"). wget https://s3-us-west-1.amazonaws.com/fasttext-vectors/word-vectors-v2/cc.la.300.vec.gz - python -m spacy init-model /tmp/la_vectors_wiki_lg --vectors-loc cc.la.300.vec.gz + python -m spacy init-model en /tmp/la_vectors_wiki_lg --vectors-loc cc.la.300.vec.gz p | This will output a spaCy model in the directory From d85494bfae583b47a5a88f09aa6933f95891497a Mon Sep 17 00:00:00 2001 From: Samuel Pouyt Date: Sat, 26 May 2018 18:19:08 +0200 Subject: [PATCH 9/9] Added agrement (#2374) --- .github/contributors/idealley.md | 106 +++++++++++++++++++++++++++++++ 1 file changed, 106 insertions(+) create mode 100644 .github/contributors/idealley.md diff --git a/.github/contributors/idealley.md b/.github/contributors/idealley.md new file mode 100644 index 000000000..9aa7d4a1b --- /dev/null +++ b/.github/contributors/idealley.md @@ -0,0 +1,106 @@ +# spaCy contributor agreement + +This spaCy Contributor Agreement (**"SCA"**) is based on the +[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf). +The SCA applies to any contribution that you make to any product or project +managed by us (the **"project"**), and sets out the intellectual property rights +you grant to us in the contributed materials. The term **"us"** shall mean +[ExplosionAI UG (haftungsbeschränkt)](https://explosion.ai/legal). The term +**"you"** shall mean the person or entity identified below. + +If you agree to be bound by these terms, fill in the information requested +below and include the filled-in version with your first pull request, under the +folder [`.github/contributors/`](/.github/contributors/). The name of the file +should be your GitHub username, with the extension `.md`. For example, the user +example_user would create the file `.github/contributors/example_user.md`. + +Read this agreement carefully before signing. These terms and conditions +constitute a binding legal agreement. + +## Contributor Agreement + +1. The term "contribution" or "contributed materials" means any source code, +object code, patch, tool, sample, graphic, specification, manual, +documentation, or any other material posted or submitted by you to the project. + +2. With respect to any worldwide copyrights, or copyright applications and +registrations, in your contribution: + + * you hereby assign to us joint ownership, and to the extent that such + assignment is or becomes invalid, ineffective or unenforceable, you hereby + grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge, + royalty-free, unrestricted license to exercise all rights under those + copyrights. This includes, at our option, the right to sublicense these same + rights to third parties through multiple levels of sublicensees or other + licensing arrangements; + + * you agree that each of us can do all things in relation to your + contribution as if each of us were the sole owners, and if one of us makes + a derivative work of your contribution, the one who makes the derivative + work (or has it made will be the sole owner of that derivative work; + + * you agree that you will not assert any moral rights in your contribution + against us, our licensees or transferees; + + * you agree that we may register a copyright in your contribution and + exercise all ownership rights associated with it; and + + * you agree that neither of us has any duty to consult with, obtain the + consent of, pay or render an accounting to the other for any use or + distribution of your contribution. + +3. With respect to any patents you own, or that you can license without payment +to any third party, you hereby grant to us a perpetual, irrevocable, +non-exclusive, worldwide, no-charge, royalty-free license to: + + * make, have made, use, sell, offer to sell, import, and otherwise transfer + your contribution in whole or in part, alone or in combination with or + included in any product, work or materials arising out of the project to + which your contribution was submitted, and + + * at our option, to sublicense these same rights to third parties through + multiple levels of sublicensees or other licensing arrangements. + +4. Except as set out above, you keep all right, title, and interest in your +contribution. The rights that you grant to us under these terms are effective +on the date you first submitted a contribution to us, even if your submission +took place before the date you sign these terms. + +5. You covenant, represent, warrant and agree that: + + * Each contribution that you submit is and shall be an original work of + authorship and you can legally grant the rights set out in this SCA; + + * to the best of your knowledge, each contribution will not violate any + third party's copyrights, trademarks, patents, or other intellectual + property rights; and + + * each contribution shall be in compliance with U.S. export control laws and + other applicable export and import laws. You agree to notify us if you + become aware of any circumstance which would make any of the foregoing + representations inaccurate in any respect. We may publicly disclose your + participation in the project, including the fact that you have signed the SCA. + +6. This SCA is governed by the laws of the State of California and applicable +U.S. Federal law. Any choice of law rules will not apply. + +7. Please place an “x” on one of the applicable statement below. Please do NOT +mark both statements: + + * [x] I am signing on behalf of myself as an individual and no other person + or entity, including my employer, has or will have rights with respect to my + contributions. + + * [x] I am signing on behalf of my employer or a legal entity and I have the + actual authority to contractually bind that entity. + +## Contributor Details + +| Field | Entry | +|------------------------------- | -------------------- | +| Name | Pouyt Samuel | +| Company name (if applicable) | | +| Title or role (if applicable) | | +| Date | 26.05.2018 | +| GitHub username | Idealley | +| Website (optional) | |