mirror of
https://github.com/explosion/spaCy.git
synced 2024-11-10 19:57:17 +03:00
adding more words and rephrasing (#2351)
* adding more words and rephrasing * adding a contributor * tokenizer bugs solved
This commit is contained in:
parent
ec62cadf4c
commit
432ede04af
106
.github/contributors/aristorinjuang.md
vendored
Normal file
106
.github/contributors/aristorinjuang.md
vendored
Normal file
|
@ -0,0 +1,106 @@
|
|||
# spaCy contributor agreement
|
||||
|
||||
This spaCy Contributor Agreement (**"SCA"**) is based on the
|
||||
[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
|
||||
The SCA applies to any contribution that you make to any product or project
|
||||
managed by us (the **"project"**), and sets out the intellectual property rights
|
||||
you grant to us in the contributed materials. The term **"us"** shall mean
|
||||
[ExplosionAI UG (haftungsbeschränkt)](https://explosion.ai/legal). The term
|
||||
**"you"** shall mean the person or entity identified below.
|
||||
|
||||
If you agree to be bound by these terms, fill in the information requested
|
||||
below and include the filled-in version with your first pull request, under the
|
||||
folder [`.github/contributors/`](/.github/contributors/). The name of the file
|
||||
should be your GitHub username, with the extension `.md`. For example, the user
|
||||
example_user would create the file `.github/contributors/example_user.md`.
|
||||
|
||||
Read this agreement carefully before signing. These terms and conditions
|
||||
constitute a binding legal agreement.
|
||||
|
||||
## Contributor Agreement
|
||||
|
||||
1. The term "contribution" or "contributed materials" means any source code,
|
||||
object code, patch, tool, sample, graphic, specification, manual,
|
||||
documentation, or any other material posted or submitted by you to the project.
|
||||
|
||||
2. With respect to any worldwide copyrights, or copyright applications and
|
||||
registrations, in your contribution:
|
||||
|
||||
* you hereby assign to us joint ownership, and to the extent that such
|
||||
assignment is or becomes invalid, ineffective or unenforceable, you hereby
|
||||
grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
|
||||
royalty-free, unrestricted license to exercise all rights under those
|
||||
copyrights. This includes, at our option, the right to sublicense these same
|
||||
rights to third parties through multiple levels of sublicensees or other
|
||||
licensing arrangements;
|
||||
|
||||
* you agree that each of us can do all things in relation to your
|
||||
contribution as if each of us were the sole owners, and if one of us makes
|
||||
a derivative work of your contribution, the one who makes the derivative
|
||||
work (or has it made will be the sole owner of that derivative work;
|
||||
|
||||
* you agree that you will not assert any moral rights in your contribution
|
||||
against us, our licensees or transferees;
|
||||
|
||||
* you agree that we may register a copyright in your contribution and
|
||||
exercise all ownership rights associated with it; and
|
||||
|
||||
* you agree that neither of us has any duty to consult with, obtain the
|
||||
consent of, pay or render an accounting to the other for any use or
|
||||
distribution of your contribution.
|
||||
|
||||
3. With respect to any patents you own, or that you can license without payment
|
||||
to any third party, you hereby grant to us a perpetual, irrevocable,
|
||||
non-exclusive, worldwide, no-charge, royalty-free license to:
|
||||
|
||||
* make, have made, use, sell, offer to sell, import, and otherwise transfer
|
||||
your contribution in whole or in part, alone or in combination with or
|
||||
included in any product, work or materials arising out of the project to
|
||||
which your contribution was submitted, and
|
||||
|
||||
* at our option, to sublicense these same rights to third parties through
|
||||
multiple levels of sublicensees or other licensing arrangements.
|
||||
|
||||
4. Except as set out above, you keep all right, title, and interest in your
|
||||
contribution. The rights that you grant to us under these terms are effective
|
||||
on the date you first submitted a contribution to us, even if your submission
|
||||
took place before the date you sign these terms.
|
||||
|
||||
5. You covenant, represent, warrant and agree that:
|
||||
|
||||
* Each contribution that you submit is and shall be an original work of
|
||||
authorship and you can legally grant the rights set out in this SCA;
|
||||
|
||||
* to the best of your knowledge, each contribution will not violate any
|
||||
third party's copyrights, trademarks, patents, or other intellectual
|
||||
property rights; and
|
||||
|
||||
* each contribution shall be in compliance with U.S. export control laws and
|
||||
other applicable export and import laws. You agree to notify us if you
|
||||
become aware of any circumstance which would make any of the foregoing
|
||||
representations inaccurate in any respect. We may publicly disclose your
|
||||
participation in the project, including the fact that you have signed the SCA.
|
||||
|
||||
6. This SCA is governed by the laws of the State of California and applicable
|
||||
U.S. Federal law. Any choice of law rules will not apply.
|
||||
|
||||
7. Please place an “x” on one of the applicable statement below. Please do NOT
|
||||
mark both statements:
|
||||
|
||||
* [x] I am signing on behalf of myself as an individual and no other person
|
||||
or entity, including my employer, has or will have rights with respect to my
|
||||
contributions.
|
||||
|
||||
* [x] I am signing on behalf of my employer or a legal entity and I have the
|
||||
actual authority to contractually bind that entity.
|
||||
|
||||
## Contributor Details
|
||||
|
||||
| Field | Entry |
|
||||
|------------------------------- | -------------------------- |
|
||||
| Name | Aristo Rinjuang |
|
||||
| Company name (if applicable) | |
|
||||
| Title or role (if applicable) | |
|
||||
| Date | May 22, 2018 |
|
||||
| GitHub username | aristorinjuang |
|
||||
| Website (optional) | https://aristorinjuang.com |
|
|
@ -4,19 +4,10 @@ from __future__ import unicode_literals
|
|||
from ...attrs import LIKE_NUM
|
||||
|
||||
|
||||
_num_words = ['zero', 'one', 'two', 'three', 'four', 'five', 'six', 'seven',
|
||||
'eight', 'nine', 'ten', 'eleven', 'twelve', 'thirteen', 'fourteen',
|
||||
'fifteen', 'sixteen', 'seventeen', 'eighteen', 'nineteen', 'twenty',
|
||||
'thirty', 'forty', 'fifty', 'sixty', 'seventy', 'eighty', 'ninety',
|
||||
'hundred', 'thousand', 'million', 'billion', 'trillion', 'quadrillion',
|
||||
'gajillion', 'bazillion',
|
||||
'nol', 'satu', 'dua', 'tiga', 'empat', 'lima', 'enam', 'tujuh',
|
||||
'delapan', 'sembilan', 'sepuluh', 'sebelas', 'duabelas', 'tigabelas',
|
||||
'empatbelas', 'limabelas', 'enambelas', 'tujuhbelas', 'delapanbelas',
|
||||
'sembilanbelas', 'duapuluh', 'seratus', 'seribu', 'sejuta',
|
||||
'ribu', 'rb', 'juta', 'jt', 'miliar', 'biliun', 'triliun',
|
||||
'kuadriliun', 'kuintiliun', 'sekstiliun', 'septiliun', 'oktiliun',
|
||||
'noniliun', 'desiliun']
|
||||
_num_words = ['nol', 'satu', 'dua', 'tiga', 'empat', 'lima', 'enam', 'tujuh',
|
||||
'delapan', 'sembilan', 'sepuluh', 'sebelas', 'belas', 'puluh',
|
||||
'ratus', 'ribu', 'juta', 'miliar', 'biliun', 'triliun', 'kuadriliun',
|
||||
'kuintiliun', 'sekstiliun', 'septiliun', 'oktiliun', 'noniliun', 'desiliun']
|
||||
|
||||
|
||||
def like_num(text):
|
||||
|
|
|
@ -1,14 +1,7 @@
|
|||
# coding: utf8
|
||||
from __future__ import unicode_literals
|
||||
|
||||
_exc = {
|
||||
"Rp": "$",
|
||||
"IDR": "$",
|
||||
"RMB": "$",
|
||||
"USD": "$",
|
||||
"AUD": "$",
|
||||
"GBP": "$",
|
||||
}
|
||||
_exc = {}
|
||||
|
||||
NORM_EXCEPTIONS = {}
|
||||
|
||||
|
|
|
@ -5,7 +5,7 @@ import regex as re
|
|||
|
||||
from ._tokenizer_exceptions_list import ID_BASE_EXCEPTIONS
|
||||
from ..tokenizer_exceptions import URL_PATTERN
|
||||
from ...symbols import ORTH
|
||||
from ...symbols import ORTH, LEMMA, NORM
|
||||
|
||||
|
||||
_exc = {}
|
||||
|
@ -29,17 +29,58 @@ for orth in ID_BASE_EXCEPTIONS:
|
|||
orth_caps = '-'.join([part.upper() for part in orth.split('-')])
|
||||
_exc[orth_caps] = [{ORTH: orth_caps}]
|
||||
|
||||
for exc_data in [
|
||||
{ORTH: "CKG", LEMMA: "Cakung", NORM: "Cakung"},
|
||||
{ORTH: "CGP", LEMMA: "Grogol Petamburan", NORM: "Grogol Petamburan"},
|
||||
{ORTH: "KSU", LEMMA: "Kepulauan Seribu Utara", NORM: "Kepulauan Seribu Utara"},
|
||||
{ORTH: "KYB", LEMMA: "Kebayoran Baru", NORM: "Kebayoran Baru"},
|
||||
{ORTH: "TJP", LEMMA: "Tanjungpriok", NORM: "Tanjungpriok"},
|
||||
{ORTH: "TNA", LEMMA: "Tanah Abang", NORM: "Tanah Abang"},
|
||||
|
||||
{ORTH: "BEK", LEMMA: "Bengkayang", NORM: "Bengkayang"},
|
||||
{ORTH: "KTP", LEMMA: "Ketapang", NORM: "Ketapang"},
|
||||
{ORTH: "MPW", LEMMA: "Mempawah", NORM: "Mempawah"},
|
||||
{ORTH: "NGP", LEMMA: "Nanga Pinoh", NORM: "Nanga Pinoh"},
|
||||
{ORTH: "NBA", LEMMA: "Ngabang", NORM: "Ngabang"},
|
||||
{ORTH: "PTK", LEMMA: "Pontianak", NORM: "Pontianak"},
|
||||
{ORTH: "PTS", LEMMA: "Putussibau", NORM: "Putussibau"},
|
||||
{ORTH: "SBS", LEMMA: "Sambas", NORM: "Sambas"},
|
||||
{ORTH: "SAG", LEMMA: "Sanggau", NORM: "Sanggau"},
|
||||
{ORTH: "SED", LEMMA: "Sekadau", NORM: "Sekadau"},
|
||||
{ORTH: "SKW", LEMMA: "Singkawang", NORM: "Singkawang"},
|
||||
{ORTH: "STG", LEMMA: "Sintang", NORM: "Sintang"},
|
||||
{ORTH: "SKD", LEMMA: "Sukadane", NORM: "Sukadane"},
|
||||
{ORTH: "SRY", LEMMA: "Sungai Raya", NORM: "Sungai Raya"},
|
||||
|
||||
{ORTH: "Jan.", LEMMA: "Januari", NORM: "Januari"},
|
||||
{ORTH: "Feb.", LEMMA: "Februari", NORM: "Februari"},
|
||||
{ORTH: "Mar.", LEMMA: "Maret", NORM: "Maret"},
|
||||
{ORTH: "Apr.", LEMMA: "April", NORM: "April"},
|
||||
{ORTH: "Jun.", LEMMA: "Juni", NORM: "Juni"},
|
||||
{ORTH: "Jul.", LEMMA: "Juli", NORM: "Juli"},
|
||||
{ORTH: "Agu.", LEMMA: "Agustus", NORM: "Agustus"},
|
||||
{ORTH: "Ags.", LEMMA: "Agustus", NORM: "Agustus"},
|
||||
{ORTH: "Sep.", LEMMA: "September", NORM: "September"},
|
||||
{ORTH: "Okt.", LEMMA: "Oktober", NORM: "Oktober"},
|
||||
{ORTH: "Nov.", LEMMA: "November", NORM: "November"},
|
||||
{ORTH: "Des.", LEMMA: "Desember", NORM: "Desember"}]:
|
||||
_exc[exc_data[ORTH]] = [exc_data]
|
||||
|
||||
for orth in [
|
||||
"'d", "a.m.", "Adm.", "Bros.", "co.", "Co.", "Corp.", "D.C.", "Dr.", "e.g.",
|
||||
"E.g.", "E.G.", "Gen.", "Gov.", "i.e.", "I.e.", "I.E.", "Inc.", "Jr.",
|
||||
"Ltd.", "Md.", "Messrs.", "Mo.", "Mont.", "Mr.", "Mrs.", "Ms.", "p.m.",
|
||||
"Ph.D.", "Rep.", "Rev.", "Sen.", "St.", "vs.",
|
||||
"A.AB.", "A.Ma.", "A.Md.", "A.Md.Keb.", "A.Md.Kep.", "A.P.",
|
||||
"B.A.", "B.Ch.E.", "B.Sc.", "Dr.", "Dra.", "Drs.", "Hj.", "Ka.", "Kp.",
|
||||
"M.Ag.", "M.Hum.", "M.Kes,", "M.Kom.", "M.M.", "M.P.", "M.Pd.", "M.Sc.",
|
||||
"M.Si.", "M.Sn.", "M.T.", "M.Th.", "No.", "Pjs.", "Plt.", "R.A.", "S.Ag.",
|
||||
"S.E.", "S.H.", "S.Hut.", "S.K.M.", "S.Kedg.", "S.Kedh.", "S.Kom.",
|
||||
"S.Pd.", "S.Pol.", "S.Psi.", "S.S.", "S.Sos.", "S.T.", "S.Tekp.", "S.Th.",
|
||||
"M.AB", "M.Ag.", "M.AP", "M.Arl", "M.A.R.S", "M.Hum.", "M.I.Kom.", "M.Kes,",
|
||||
"M.Kom.", "M.M.", "M.P.", "M.Pd.", "M.Psi.", "M.Psi.T.", "M.Sc.", "M.SArl",
|
||||
"M.Si.", "M.Sn.", "M.T.", "M.Th.", "No.", "Pjs.", "Plt.", "R.A.",
|
||||
"S.AB", "S.AP", "S.Adm", "S.Ag.", "S.Agr", "S.Ant", "S.Arl", "S.Ars",
|
||||
"S.A.R.S", "S.Ds", "S.E.", "S.E.I.", "S.Farm", "S.Gz.", "S.H.", "S.Han",
|
||||
"S.H.Int", "S.Hum", "S.Hut.", "S.In.", "S.IK.", "S.I.Kom.", "S.I.P", "S.IP",
|
||||
"S.P.", "S.Pt", "S.Psi", "S.Ptk", "S.Keb", "S.Ked", "S.Kep", "S.KG", "S.KH",
|
||||
"S.Kel", "S.K.M.", "S.Kedg.", "S.Kedh.", "S.Kom.", "S.KPM", "S.Mb", "S.Mat",
|
||||
"S.Par", "S.Pd.", "S.Pd.I.", "S.Pd.SD", "S.Pol.", "S.Psi.", "S.S.", "S.SArl.",
|
||||
"S.Sn", "S.Si.", "S.Si.Teol.", "S.SI.", "S.ST.", "S.ST.Han", "S.STP", "S.Sos.",
|
||||
"S.Sy.", "S.T.", "S.T.Han", "S.Th.", "S.Th.I" "S.TI.", "S.T.P.", "S.TrK",
|
||||
"S.Tekp.", "S.Th.",
|
||||
"a.l.", "a.n.", "a.s.", "b.d.", "d.a.", "d.l.", "d/h", "dkk.", "dll.",
|
||||
"dr.", "drh.", "ds.", "dsb.", "dst.", "faks.", "fax.", "hlm.", "i/o",
|
||||
"n.b.", "p.p." "pjs.", "s.d.", "tel.", "u.p.",
|
||||
|
|
Loading…
Reference in New Issue
Block a user