Merge branch 'develop' of https://github.com/explosion/spaCy into develop

This commit is contained in:
Matthew Honnibal 2018-07-05 13:49:42 +02:00
commit ec41ceb383
60 changed files with 31980 additions and 317 deletions

106
.github/contributors/aliiae.md vendored Normal file
View File

@ -0,0 +1,106 @@
# spaCy contributor agreement
This spaCy Contributor Agreement (**"SCA"**) is based on the
[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
The SCA applies to any contribution that you make to any product or project
managed by us (the **"project"**), and sets out the intellectual property rights
you grant to us in the contributed materials. The term **"us"** shall mean
[ExplosionAI UG (haftungsbeschränkt)](https://explosion.ai/legal). The term
**"you"** shall mean the person or entity identified below.
If you agree to be bound by these terms, fill in the information requested
below and include the filled-in version with your first pull request, under the
folder [`.github/contributors/`](/.github/contributors/). The name of the file
should be your GitHub username, with the extension `.md`. For example, the user
example_user would create the file `.github/contributors/example_user.md`.
Read this agreement carefully before signing. These terms and conditions
constitute a binding legal agreement.
## Contributor Agreement
1. The term "contribution" or "contributed materials" means any source code,
object code, patch, tool, sample, graphic, specification, manual,
documentation, or any other material posted or submitted by you to the project.
2. With respect to any worldwide copyrights, or copyright applications and
registrations, in your contribution:
* you hereby assign to us joint ownership, and to the extent that such
assignment is or becomes invalid, ineffective or unenforceable, you hereby
grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
royalty-free, unrestricted license to exercise all rights under those
copyrights. This includes, at our option, the right to sublicense these same
rights to third parties through multiple levels of sublicensees or other
licensing arrangements;
* you agree that each of us can do all things in relation to your
contribution as if each of us were the sole owners, and if one of us makes
a derivative work of your contribution, the one who makes the derivative
work (or has it made will be the sole owner of that derivative work;
* you agree that you will not assert any moral rights in your contribution
against us, our licensees or transferees;
* you agree that we may register a copyright in your contribution and
exercise all ownership rights associated with it; and
* you agree that neither of us has any duty to consult with, obtain the
consent of, pay or render an accounting to the other for any use or
distribution of your contribution.
3. With respect to any patents you own, or that you can license without payment
to any third party, you hereby grant to us a perpetual, irrevocable,
non-exclusive, worldwide, no-charge, royalty-free license to:
* make, have made, use, sell, offer to sell, import, and otherwise transfer
your contribution in whole or in part, alone or in combination with or
included in any product, work or materials arising out of the project to
which your contribution was submitted, and
* at our option, to sublicense these same rights to third parties through
multiple levels of sublicensees or other licensing arrangements.
4. Except as set out above, you keep all right, title, and interest in your
contribution. The rights that you grant to us under these terms are effective
on the date you first submitted a contribution to us, even if your submission
took place before the date you sign these terms.
5. You covenant, represent, warrant and agree that:
* Each contribution that you submit is and shall be an original work of
authorship and you can legally grant the rights set out in this SCA;
* to the best of your knowledge, each contribution will not violate any
third party's copyrights, trademarks, patents, or other intellectual
property rights; and
* each contribution shall be in compliance with U.S. export control laws and
other applicable export and import laws. You agree to notify us if you
become aware of any circumstance which would make any of the foregoing
representations inaccurate in any respect. We may publicly disclose your
participation in the project, including the fact that you have signed the SCA.
6. This SCA is governed by the laws of the State of California and applicable
U.S. Federal law. Any choice of law rules will not apply.
7. Please place an “x” on one of the applicable statement below. Please do NOT
mark both statements:
* [x] I am signing on behalf of myself as an individual and no other person
or entity, including my employer, has or will have rights with respect to my
contributions.
* [ ] I am signing on behalf of my employer or a legal entity and I have the
actual authority to contractually bind that entity.
## Contributor Details
| Field | Entry |
|------------------------------- | -------------------- |
| Name | Aliia Erofeeva |
| Company name (if applicable) | |
| Title or role (if applicable) | |
| Date | 13 June 2018 |
| GitHub username | aliiae |
| Website (optional) | |

106
.github/contributors/btrungchi.md vendored Normal file
View File

@ -0,0 +1,106 @@
# spaCy contributor agreement
This spaCy Contributor Agreement (**"SCA"**) is based on the
[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
The SCA applies to any contribution that you make to any product or project
managed by us (the **"project"**), and sets out the intellectual property rights
you grant to us in the contributed materials. The term **"us"** shall mean
[ExplosionAI UG (haftungsbeschränkt)](https://explosion.ai/legal). The term
**"you"** shall mean the person or entity identified below.
If you agree to be bound by these terms, fill in the information requested
below and include the filled-in version with your first pull request, under the
folder [`.github/contributors/`](/.github/contributors/). The name of the file
should be your GitHub username, with the extension `.md`. For example, the user
example_user would create the file `.github/contributors/example_user.md`.
Read this agreement carefully before signing. These terms and conditions
constitute a binding legal agreement.
## Contributor Agreement
1. The term "contribution" or "contributed materials" means any source code,
object code, patch, tool, sample, graphic, specification, manual,
documentation, or any other material posted or submitted by you to the project.
2. With respect to any worldwide copyrights, or copyright applications and
registrations, in your contribution:
* you hereby assign to us joint ownership, and to the extent that such
assignment is or becomes invalid, ineffective or unenforceable, you hereby
grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
royalty-free, unrestricted license to exercise all rights under those
copyrights. This includes, at our option, the right to sublicense these same
rights to third parties through multiple levels of sublicensees or other
licensing arrangements;
* you agree that each of us can do all things in relation to your
contribution as if each of us were the sole owners, and if one of us makes
a derivative work of your contribution, the one who makes the derivative
work (or has it made will be the sole owner of that derivative work;
* you agree that you will not assert any moral rights in your contribution
against us, our licensees or transferees;
* you agree that we may register a copyright in your contribution and
exercise all ownership rights associated with it; and
* you agree that neither of us has any duty to consult with, obtain the
consent of, pay or render an accounting to the other for any use or
distribution of your contribution.
3. With respect to any patents you own, or that you can license without payment
to any third party, you hereby grant to us a perpetual, irrevocable,
non-exclusive, worldwide, no-charge, royalty-free license to:
* make, have made, use, sell, offer to sell, import, and otherwise transfer
your contribution in whole or in part, alone or in combination with or
included in any product, work or materials arising out of the project to
which your contribution was submitted, and
* at our option, to sublicense these same rights to third parties through
multiple levels of sublicensees or other licensing arrangements.
4. Except as set out above, you keep all right, title, and interest in your
contribution. The rights that you grant to us under these terms are effective
on the date you first submitted a contribution to us, even if your submission
took place before the date you sign these terms.
5. You covenant, represent, warrant and agree that:
* Each contribution that you submit is and shall be an original work of
authorship and you can legally grant the rights set out in this SCA;
* to the best of your knowledge, each contribution will not violate any
third party's copyrights, trademarks, patents, or other intellectual
property rights; and
* each contribution shall be in compliance with U.S. export control laws and
other applicable export and import laws. You agree to notify us if you
become aware of any circumstance which would make any of the foregoing
representations inaccurate in any respect. We may publicly disclose your
participation in the project, including the fact that you have signed the SCA.
6. This SCA is governed by the laws of the State of California and applicable
U.S. Federal law. Any choice of law rules will not apply.
7. Please place an “x” on one of the applicable statement below. Please do NOT
mark both statements:
* [x] I am signing on behalf of myself as an individual and no other person
or entity, including my employer, has or will have rights with respect to my
contributions.
* [ ] I am signing on behalf of my employer or a legal entity and I have the
actual authority to contractually bind that entity.
## Contributor Details
| Field | Entry |
|------------------------------- | -------------------- |
| Name | Bui Trung Chi |
| Company name (if applicable) | |
| Title or role (if applicable) | |
| Date | 2018-06-30 |
| GitHub username | btrungchi |
| Website (optional) | |

106
.github/contributors/coryhurst.md vendored Normal file
View File

@ -0,0 +1,106 @@
# spaCy contributor agreement
This spaCy Contributor Agreement (**"SCA"**) is based on the
[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
The SCA applies to any contribution that you make to any product or project
managed by us (the **"project"**), and sets out the intellectual property rights
you grant to us in the contributed materials. The term **"us"** shall mean
[ExplosionAI UG (haftungsbeschränkt)](https://explosion.ai/legal). The term
**"you"** shall mean the person or entity identified below.
If you agree to be bound by these terms, fill in the information requested
below and include the filled-in version with your first pull request, under the
folder [`.github/contributors/`](/.github/contributors/). The name of the file
should be your GitHub username, with the extension `.md`. For example, the user
example_user would create the file `.github/contributors/example_user.md`.
Read this agreement carefully before signing. These terms and conditions
constitute a binding legal agreement.
## Contributor Agreement
1. The term "contribution" or "contributed materials" means any source code,
object code, patch, tool, sample, graphic, specification, manual,
documentation, or any other material posted or submitted by you to the project.
2. With respect to any worldwide copyrights, or copyright applications and
registrations, in your contribution:
* you hereby assign to us joint ownership, and to the extent that such
assignment is or becomes invalid, ineffective or unenforceable, you hereby
grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
royalty-free, unrestricted license to exercise all rights under those
copyrights. This includes, at our option, the right to sublicense these same
rights to third parties through multiple levels of sublicensees or other
licensing arrangements;
* you agree that each of us can do all things in relation to your
contribution as if each of us were the sole owners, and if one of us makes
a derivative work of your contribution, the one who makes the derivative
work (or has it made will be the sole owner of that derivative work;
* you agree that you will not assert any moral rights in your contribution
against us, our licensees or transferees;
* you agree that we may register a copyright in your contribution and
exercise all ownership rights associated with it; and
* you agree that neither of us has any duty to consult with, obtain the
consent of, pay or render an accounting to the other for any use or
distribution of your contribution.
3. With respect to any patents you own, or that you can license without payment
to any third party, you hereby grant to us a perpetual, irrevocable,
non-exclusive, worldwide, no-charge, royalty-free license to:
* make, have made, use, sell, offer to sell, import, and otherwise transfer
your contribution in whole or in part, alone or in combination with or
included in any product, work or materials arising out of the project to
which your contribution was submitted, and
* at our option, to sublicense these same rights to third parties through
multiple levels of sublicensees or other licensing arrangements.
4. Except as set out above, you keep all right, title, and interest in your
contribution. The rights that you grant to us under these terms are effective
on the date you first submitted a contribution to us, even if your submission
took place before the date you sign these terms.
5. You covenant, represent, warrant and agree that:
* Each contribution that you submit is and shall be an original work of
authorship and you can legally grant the rights set out in this SCA;
* to the best of your knowledge, each contribution will not violate any
third party's copyrights, trademarks, patents, or other intellectual
property rights; and
* each contribution shall be in compliance with U.S. export control laws and
other applicable export and import laws. You agree to notify us if you
become aware of any circumstance which would make any of the foregoing
representations inaccurate in any respect. We may publicly disclose your
participation in the project, including the fact that you have signed the SCA.
6. This SCA is governed by the laws of the State of California and applicable
U.S. Federal law. Any choice of law rules will not apply.
7. Please place an “x” on one of the applicable statement below. Please do NOT
mark both statements:
* [x] I am signing on behalf of myself as an individual and no other person
or entity, including my employer, has or will have rights with respect to my
contributions.
* [ ] I am signing on behalf of my employer or a legal entity and I have the
actual authority to contractually bind that entity.
## Contributor Details
| Field | Entry |
|------------------------------- | -----------------------------|
| Name | Cory Hurst |
| Company name (if applicable) | Samtec Smart Platform Group |
| Title or role (if applicable) | SoftwareDeveloper |
| Date | 2017-11-13 |
| GitHub username | cjhurst |
| Website (optional) | https://blog.spg.ai/ |

106
.github/contributors/mirfan899.md vendored Normal file
View File

@ -0,0 +1,106 @@
# spaCy contributor agreement
This spaCy Contributor Agreement (**"SCA"**) is based on the
[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
The SCA applies to any contribution that you make to any product or project
managed by us (the **"project"**), and sets out the intellectual property rights
you grant to us in the contributed materials. The term **"us"** shall mean
[ExplosionAI UG (haftungsbeschränkt)](https://explosion.ai/legal). The term
**"you"** shall mean the person or entity identified below.
If you agree to be bound by these terms, fill in the information requested
below and include the filled-in version with your first pull request, under the
folder [`.github/contributors/`](/.github/contributors/). The name of the file
should be your GitHub username, with the extension `.md`. For example, the user
example_user would create the file `.github/contributors/example_user.md`.
Read this agreement carefully before signing. These terms and conditions
constitute a binding legal agreement.
## Contributor Agreement
1. The term "contribution" or "contributed materials" means any source code,
object code, patch, tool, sample, graphic, specification, manual,
documentation, or any other material posted or submitted by you to the project.
2. With respect to any worldwide copyrights, or copyright applications and
registrations, in your contribution:
* you hereby assign to us joint ownership, and to the extent that such
assignment is or becomes invalid, ineffective or unenforceable, you hereby
grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
royalty-free, unrestricted license to exercise all rights under those
copyrights. This includes, at our option, the right to sublicense these same
rights to third parties through multiple levels of sublicensees or other
licensing arrangements;
* you agree that each of us can do all things in relation to your
contribution as if each of us were the sole owners, and if one of us makes
a derivative work of your contribution, the one who makes the derivative
work (or has it made will be the sole owner of that derivative work;
* you agree that you will not assert any moral rights in your contribution
against us, our licensees or transferees;
* you agree that we may register a copyright in your contribution and
exercise all ownership rights associated with it; and
* you agree that neither of us has any duty to consult with, obtain the
consent of, pay or render an accounting to the other for any use or
distribution of your contribution.
3. With respect to any patents you own, or that you can license without payment
to any third party, you hereby grant to us a perpetual, irrevocable,
non-exclusive, worldwide, no-charge, royalty-free license to:
* make, have made, use, sell, offer to sell, import, and otherwise transfer
your contribution in whole or in part, alone or in combination with or
included in any product, work or materials arising out of the project to
which your contribution was submitted, and
* at our option, to sublicense these same rights to third parties through
multiple levels of sublicensees or other licensing arrangements.
4. Except as set out above, you keep all right, title, and interest in your
contribution. The rights that you grant to us under these terms are effective
on the date you first submitted a contribution to us, even if your submission
took place before the date you sign these terms.
5. You covenant, represent, warrant and agree that:
* Each contribution that you submit is and shall be an original work of
authorship and you can legally grant the rights set out in this SCA;
* to the best of your knowledge, each contribution will not violate any
third party's copyrights, trademarks, patents, or other intellectual
property rights; and
* each contribution shall be in compliance with U.S. export control laws and
other applicable export and import laws. You agree to notify us if you
become aware of any circumstance which would make any of the foregoing
representations inaccurate in any respect. We may publicly disclose your
participation in the project, including the fact that you have signed the SCA.
6. This SCA is governed by the laws of the State of California and applicable
U.S. Federal law. Any choice of law rules will not apply.
7. Please place an “x” on one of the applicable statement below. Please do NOT
mark both statements:
* [x] I am signing on behalf of myself as an individual and no other person
or entity, including my employer, has or will have rights with respect to my
contributions.
* [ ] I am signing on behalf of my employer or a legal entity and I have the
actual authority to contractually bind that entity.
## Contributor Details
| Field | Entry |
|------------------------------- | ------------------------ |
| Name | Muhammad Irfan |
| Company name (if applicable) | |
| Title or role (if applicable) | AI & ML Developer |
| Date | 2018-09-06 |
| GitHub username | mirfan899 |
| Website (optional) | |

View File

@ -11,5 +11,5 @@ ujson>=1.35
dill>=0.2,<0.3
regex==2017.4.5
requests>=2.13.0,<3.0.0
pytest>=3.0.6,<4.0.0
pytest>=3.6.0,<4.0.0
mock>=2.0.0,<3.0.0

View File

@ -222,6 +222,7 @@ def setup_package():
'Programming Language :: Python :: 3.4',
'Programming Language :: Python :: 3.5',
'Programming Language :: Python :: 3.6',
'Programming Language :: Python :: 3.7',
'Topic :: Scientific/Engineering'],
cmdclass = {
'build_ext': build_ext_subclass},

View File

@ -20,5 +20,5 @@ def blank(name, **kwargs):
return LangClass(**kwargs)
def info(model=None, markdown=False):
return cli_info(model, markdown)
def info(model=None, markdown=False, silent=False):
return cli_info(model, markdown, silent)

View File

@ -2,7 +2,7 @@
from __future__ import unicode_literals
from .render import DependencyRenderer, EntityRenderer
from ..tokens import Doc
from ..tokens import Doc, Span
from ..compat import b_to_str
from ..errors import Errors, Warnings, user_warning
from ..util import prints, is_in_jupyter
@ -29,8 +29,11 @@ def render(docs, style='dep', page=False, minify=False, jupyter=IS_JUPYTER,
'ent': (EntityRenderer, parse_ents)}
if style not in factories:
raise ValueError(Errors.E087.format(style=style))
if isinstance(docs, Doc) or isinstance(docs, dict):
if isinstance(docs, (Doc, Span, dict)):
docs = [docs]
docs = [obj if not isinstance(obj, Span) else obj.as_doc() for obj in docs]
if not all(isinstance(obj, (Doc, Span, dict)) for obj in docs):
raise ValueError(Errors.E096)
renderer, converter = factories[style]
renderer = renderer(options=options)
parsed = [converter(doc, options) for doc in docs] if not manual else docs

View File

@ -136,7 +136,7 @@ class DependencyRenderer(object):
end (int): X-coordinate of arrow end point.
RETURNS (unicode): Definition of the arrow head path ('d' attribute).
"""
if direction is 'left':
if direction == 'left':
pos1, pos2, pos3 = (x, x-self.arrow_width+2, x+self.arrow_width-2)
else:
pos1, pos2, pos3 = (end, end+self.arrow_width-2,

View File

@ -257,6 +257,8 @@ class Errors(object):
E094 = ("Error reading line {line_num} in vectors file {loc}.")
E095 = ("Can't write to frozen dictionary. This is likely an internal "
"error. Are you writing to a default function argument?")
E096 = ("Invalid object passed to displaCy: Can only visualize Doc or "
"Span objects, or dicts if set to manual=True.")
@add_codes

View File

@ -16,9 +16,11 @@ _latin = r'[[\p{Ll}||\p{Lu}]&&\p{Latin}]'
_persian = r'[\p{L}&&\p{Arabic}]'
_russian_lower = r'[ёа-я]'
_russian_upper = r'[ЁА-Я]'
_tatar_lower = r'[әөүҗңһ]'
_tatar_upper = r'[ӘӨҮҖҢҺ]'
_upper = [_latin_upper, _russian_upper]
_lower = [_latin_lower, _russian_lower]
_upper = [_latin_upper, _russian_upper, _tatar_upper]
_lower = [_latin_lower, _russian_lower, _tatar_lower]
_uncased = [_bengali, _hebrew, _persian]
ALPHA = merge_char_classes(_upper + _lower + _uncased)

View File

@ -60,9 +60,8 @@ def detailed_tokens(tokenizer, text):
parts = node.feature.split(',')
pos = ','.join(parts[0:4])
if len(parts) > 6:
if len(parts) > 7:
# this information is only available for words in the tokenizer dictionary
reading = parts[6]
base = parts[7]
words.append( ShortUnitWord(surface, base, pos) )

31
spacy/lang/tt/__init__.py Normal file
View File

@ -0,0 +1,31 @@
# coding: utf8
from __future__ import unicode_literals
from .lex_attrs import LEX_ATTRS
from .punctuation import TOKENIZER_INFIXES
from .stop_words import STOP_WORDS
from .tokenizer_exceptions import TOKENIZER_EXCEPTIONS
from ..tokenizer_exceptions import BASE_EXCEPTIONS
from ...attrs import LANG
from ...language import Language
from ...util import update_exc
class TatarDefaults(Language.Defaults):
lex_attr_getters = dict(Language.Defaults.lex_attr_getters)
lex_attr_getters[LANG] = lambda text: 'tt'
lex_attr_getters.update(LEX_ATTRS)
tokenizer_exceptions = update_exc(BASE_EXCEPTIONS, TOKENIZER_EXCEPTIONS)
infixes = tuple(TOKENIZER_INFIXES)
stop_words = STOP_WORDS
class Tatar(Language):
lang = 'tt'
Defaults = TatarDefaults
__all__ = ['Tatar']

19
spacy/lang/tt/examples.py Normal file
View File

@ -0,0 +1,19 @@
# coding: utf8
from __future__ import unicode_literals
"""
Example sentences to test spaCy and its language models.
>>> from spacy.lang.tt.examples import sentences
>>> docs = nlp.pipe(sentences)
"""
sentences = [
"Apple Бөекбритания стартабын $1 миллиард өчен сатып алыун исәпли.",
"Автоном автомобильләр иминият җаваплылыкны җитештерүчеләргә күчерә.",
"Сан-Франциско тротуар буенча йөри торган робот-курьерларны тыю мөмкинлеген карый.",
"Лондон - Бөекбританиядә урнашкан зур шәһәр.",
"Син кайда?",
"Францияда кем президент?",
"Америка Кушма Штатларының башкаласы нинди шәһәр?",
"Барак Обама кайчан туган?"
]

View File

@ -0,0 +1,29 @@
# coding: utf8
from __future__ import unicode_literals
from ...attrs import LIKE_NUM
_num_words = ['нуль', 'ноль', 'бер', 'ике', 'өч', 'дүрт', 'биш', 'алты', 'җиде',
'сигез', 'тугыз', 'ун', 'унбер', 'унике', 'унөч', 'ундүрт',
'унбиш', 'уналты', 'унҗиде', 'унсигез', 'унтугыз', 'егерме',
'утыз', 'кырык', 'илле', 'алтмыш', 'җитмеш', 'сиксән', 'туксан',
'йөз', 'мең', 'төмән', 'миллион', 'миллиард', 'триллион',
'триллиард']
def like_num(text):
text = text.replace(',', '').replace('.', '')
if text.isdigit():
return True
if text.count('/') == 1:
num, denom = text.split('/')
if num.isdigit() and denom.isdigit():
return True
if text in _num_words:
return True
return False
LEX_ATTRS = {
LIKE_NUM: like_num
}

View File

@ -0,0 +1,19 @@
# coding: utf8
from __future__ import unicode_literals
from ..char_classes import ALPHA, ALPHA_LOWER, ALPHA_UPPER, QUOTES, HYPHENS
from ..char_classes import LIST_ELLIPSES, LIST_ICONS
_hyphens_no_dash = HYPHENS.replace('-', '').strip('|').replace('||', '')
_infixes = (LIST_ELLIPSES + LIST_ICONS +
[r'(?<=[{}])\.(?=[{}])'.format(ALPHA_LOWER, ALPHA_UPPER),
r'(?<=[{a}])[,!?/\(\)]+(?=[{a}])'.format(a=ALPHA),
r'(?<=[{a}{q}])[:<>=](?=[{a}])'.format(a=ALPHA, q=QUOTES),
r'(?<=[{a}])--(?=[{a}])'.format(a=ALPHA),
r'(?<=[{a}]),(?=[{a}])'.format(a=ALPHA),
r'(?<=[{a}])([{q}\)\]\(\[])(?=[\-{a}])'.format(a=ALPHA, q=QUOTES),
r'(?<=[{a}])[?";:=,.]*(?:{h})(?=[{a}])'.format(a=ALPHA,
h=_hyphens_no_dash),
r'(?<=[0-9])-(?=[0-9])'])
TOKENIZER_INFIXES = _infixes

174
spacy/lang/tt/stop_words.py Normal file
View File

@ -0,0 +1,174 @@
# encoding: utf8
from __future__ import unicode_literals
# Tatar stopwords are from https://github.com/aliiae/stopwords-tt
STOP_WORDS = set("""алай алайса алар аларга аларда алардан аларны аларның аларча
алары аларын аларынга аларында аларыннан аларының алтмыш алтмышынчы алтмышынчыга
алтмышынчыда алтмышынчыдан алтмышынчылар алтмышынчыларга алтмышынчыларда
алтмышынчылардан алтмышынчыларны алтмышынчыларның алтмышынчыны алтмышынчының
алты алтылап алтынчы алтынчыга алтынчыда алтынчыдан алтынчылар алтынчыларга
алтынчыларда алтынчылардан алтынчыларны алтынчыларның алтынчыны алтынчының
алтышар анда андагы андай андый андыйга андыйда андыйдан андыйны андыйның аннан
ансы анча аны аныкы аныкын аныкынга аныкында аныкыннан аныкының анысы анысын
анысынга анысында анысыннан анысының аның аныңча аркылы ары аша аңа аңар аңарга
аңарда аңардагы аңардан
бар бара барлык барча барчасы барчасын барчасына барчасында барчасыннан
барчасының бары башка башкача бе­лән без безгә бездә бездән безне безнең безнеңчә
белдерүенчә белән бер бергә беренче беренчегә беренчедә беренчедән беренчеләр
беренчеләргә беренчеләрдә беренчеләрдән беренчеләрне беренчеләрнең беренчене
беренченең беркайда беркайсы беркая беркаян беркем беркемгә беркемдә беркемне
беркемнең беркемнән берлән берни бернигә бернидә бернидән бернинди бернине
бернинең берничек берничә бернәрсә бернәрсәгә бернәрсәдә бернәрсәдән бернәрсәне
бернәрсәнең беррәттән берсе берсен берсенгә берсендә берсенең берсеннән берәр
берәрсе берәрсен берәрсендә берәрсенең берәрсеннән берәрсенә берәү бигрәк бик
бирле бит биш бишенче бишенчегә бишенчедә бишенчедән бишенчеләр бишенчеләргә
бишенчеләрдә бишенчеләрдән бишенчеләрне бишенчеләрнең бишенчене бишенченең
бишләп болай болар боларга боларда болардан боларны боларның болары боларын
боларынга боларында боларыннан боларының бу буе буена буенда буенча буйлап
буларак булачак булды булмый булса булып булыр булырга бусы бүтән бәлки бән
бәрабәренә бөтен бөтенесе бөтенесен бөтенесендә бөтенесенең бөтенесеннән
бөтенесенә
вә
гел генә гына гүя гүяки гәрчә
да ди дигән диде дип дистәләгән дистәләрчә дүрт дүртенче дүртенчегә дүртенчедә
дүртенчедән дүртенчеләр дүртенчеләргә дүртенчеләрдә дүртенчеләрдән дүртенчеләрне
дүртенчеләрнең дүртенчене дүртенченең дүртләп дә
егерме егерменче егерменчегә егерменчедә егерменчедән егерменчеләр
егерменчеләргә егерменчеләрдә егерменчеләрдән егерменчеләрне егерменчеләрнең
егерменчене егерменченең ел елда
иде идек идем ике икенче икенчегә икенчедә икенчедән икенчеләр икенчеләргә
икенчеләрдә икенчеләрдән икенчеләрне икенчеләрнең икенчене икенченең икешәр икән
илле илленче илленчегә илленчедә илленчедән илленчеләр илленчеләргә
илленчеләрдә илленчеләрдән илленчеләрне илленчеләрнең илленчене илленченең илә
илән инде исә итеп иткән итте итү итә итәргә иң
йөз йөзенче йөзенчегә йөзенчедә йөзенчедән йөзенчеләр йөзенчеләргә йөзенчеләрдә
йөзенчеләрдән йөзенчеләрне йөзенчеләрнең йөзенчене йөзенченең йөзләгән йөзләрчә
йөзәрләгән
кадәр кай кайбер кайберләре кайберсе кайберәү кайберәүгә кайберәүдә кайберәүдән
кайберәүне кайберәүнең кайдагы кайсы кайсыбер кайсын кайсына кайсында кайсыннан
кайсының кайчангы кайчандагы кайчаннан караганда карамастан карамый карата каршы
каршына каршында каршындагы кебек кем кемгә кемдә кемне кемнең кемнән кенә ки
килеп килә кирәк кына кырыгынчы кырыгынчыга кырыгынчыда кырыгынчыдан
кырыгынчылар кырыгынчыларга кырыгынчыларда кырыгынчылардан кырыгынчыларны
кырыгынчыларның кырыгынчыны кырыгынчының кырык күк күпләгән күпме күпмеләп
күпмешәр күпмешәрләп күптән күрә
ләкин
максатында менә мең меңенче меңенчегә меңенчедә меңенчедән меңенчеләр
меңенчеләргә меңенчеләрдә меңенчеләрдән меңенчеләрне меңенчеләрнең меңенчене
меңенченең меңләгән меңләп меңнәрчә меңәрләгән меңәрләп миллиард миллиардлаган
миллиардларча миллион миллионлаган миллионнарча миллионынчы миллионынчыга
миллионынчыда миллионынчыдан миллионынчылар миллионынчыларга миллионынчыларда
миллионынчылардан миллионынчыларны миллионынчыларның миллионынчыны
миллионынчының мин миндә мине минем минемчә миннән миңа монда мондагы мондые
мондыен мондыенгә мондыендә мондыеннән мондыеның мондый мондыйга мондыйда
мондыйдан мондыйлар мондыйларга мондыйларда мондыйлардан мондыйларны
мондыйларның мондыйлары мондыйларын мондыйларынга мондыйларында мондыйларыннан
мондыйларының мондыйны мондыйның моннан монсыз монча моны моныкы моныкын
моныкынга моныкында моныкыннан моныкының монысы монысын монысынга монысында
монысыннан монысының моның моңа моңар моңарга мәгълүматынча мәгәр мән мөмкин
ни нибарысы никадәре нинди ниндие ниндиен ниндиенгә ниндиендә ниндиенең
ниндиеннән ниндиләр ниндиләргә ниндиләрдә ниндиләрдән ниндиләрен ниндиләренн
ниндиләреннгә ниндиләренндә ниндиләреннең ниндиләренннән ниндиләрне ниндиләрнең
ниндирәк нихәтле ничаклы ничек ничәшәр ничәшәрләп нуль нче нчы нәрсә нәрсәгә
нәрсәдә нәрсәдән нәрсәне нәрсәнең
саен сез сезгә сездә сездән сезне сезнең сезнеңчә сигез сигезенче сигезенчегә
сигезенчедә сигезенчедән сигезенчеләр сигезенчеләргә сигезенчеләрдә
сигезенчеләрдән сигезенчеләрне сигезенчеләрнең сигезенчене сигезенченең
сиксән син синдә сине синең синеңчә синнән сиңа соң сыман сүзенчә сүзләренчә
та таба теге тегеләй тегеләр тегеләргә тегеләрдә тегеләрдән тегеләре тегеләрен
тегеләренгә тегеләрендә тегеләренең тегеләреннән тегеләрне тегеләрнең тегенди
тегендигә тегендидә тегендидән тегендине тегендинең тегендә тегендәге тегене
тегенеке тегенекен тегенекенгә тегенекендә тегенекенең тегенекеннән тегенең
тегеннән тегесе тегесен тегесенгә тегесендә тегесенең тегесеннән тегеңә тиеш тик
тикле тора триллиард триллион тугыз тугызлап тугызлашып тугызынчы тугызынчыга
тугызынчыда тугызынчыдан тугызынчылар тугызынчыларга тугызынчыларда
тугызынчылардан тугызынчыларны тугызынчыларның тугызынчыны тугызынчының туксан
туксанынчы туксанынчыга туксанынчыда туксанынчыдан туксанынчылар туксанынчыларга
туксанынчыларда туксанынчылардан туксанынчыларны туксанынчыларның туксанынчыны
туксанынчының турында тыш түгел тә тәгаенләнгән төмән
уенча уйлавынча ук ул ун уналты уналтынчы уналтынчыга уналтынчыда уналтынчыдан
уналтынчылар уналтынчыларга уналтынчыларда уналтынчылардан уналтынчыларны
уналтынчыларның уналтынчыны уналтынчының унарлаган унарлап унаула унаулап унбер
унберенче унберенчегә унберенчедә унберенчедән унберенчеләр унберенчеләргә
унберенчеләрдә унберенчеләрдән унберенчеләрне унберенчеләрнең унберенчене
унберенченең унбиш унбишенче унбишенчегә унбишенчедә унбишенчедән унбишенчеләр
унбишенчеләргә унбишенчеләрдә унбишенчеләрдән унбишенчеләрне унбишенчеләрнең
унбишенчене унбишенченең ундүрт ундүртенче ундүртенчегә ундүртенчедә
ундүртенчедән ундүртенчеләр ундүртенчеләргә ундүртенчеләрдә ундүртенчеләрдән
ундүртенчеләрне ундүртенчеләрнең ундүртенчене ундүртенченең унике уникенче
уникенчегә уникенчедә уникенчедән уникенчеләр уникенчеләргә уникенчеләрдә
уникенчеләрдән уникенчеләрне уникенчеләрнең уникенчене уникенченең унлаган
унлап уннарча унсигез унсигезенче унсигезенчегә унсигезенчедә унсигезенчедән
унсигезенчеләр унсигезенчеләргә унсигезенчеләрдә унсигезенчеләрдән
унсигезенчеләрне унсигезенчеләрнең унсигезенчене унсигезенченең унтугыз
унтугызынчы унтугызынчыга унтугызынчыда унтугызынчыдан унтугызынчылар
унтугызынчыларга унтугызынчыларда унтугызынчылардан унтугызынчыларны
унтугызынчыларның унтугызынчыны унтугызынчының унынчы унынчыга унынчыда
унынчыдан унынчылар унынчыларга унынчыларда унынчылардан унынчыларны
унынчыларның унынчыны унынчының унҗиде унҗиденче унҗиденчегә унҗиденчедә
унҗиденчедән унҗиденчеләр унҗиденчеләргә унҗиденчеләрдә унҗиденчеләрдән
унҗиденчеләрне унҗиденчеләрнең унҗиденчене унҗиденченең унөч унөченче унөченчегә
унөченчедә унөченчедән унөченчеләр унөченчеләргә унөченчеләрдә унөченчеләрдән
унөченчеләрне унөченчеләрнең унөченчене унөченченең утыз утызынчы утызынчыга
утызынчыда утызынчыдан утызынчылар утызынчыларга утызынчыларда утызынчылардан
утызынчыларны утызынчыларның утызынчыны утызынчының
фикеренчә фәкать
хакында хәбәр хәлбуки хәтле хәтта
чаклы чакта чөнки
шикелле шул шулай шулар шуларга шуларда шулардан шуларны шуларның шулары шуларын
шуларынга шуларында шуларыннан шуларының шулкадәр шултикле шултиклем шулхәтле
шулчаклы шунда шундагы шундый шундыйга шундыйда шундыйдан шундыйны шундыйның
шунлыктан шуннан шунсы шунча шуны шуныкы шуныкын шуныкынга шуныкында шуныкыннан
шуныкының шунысы шунысын шунысынга шунысында шунысыннан шунысының шуның шушы
шушында шушыннан шушыны шушының шушыңа шуңа шуңар шуңарга
элек
югыйсә юк юкса
я ягъни язуынча яисә яки яктан якын ярашлы яхут яшь яшьлек
җиде җиделәп җиденче җиденчегә җиденчедә җиденчедән җиденчеләр җиденчеләргә
җиденчеләрдә җиденчеләрдән җиденчеләрне җиденчеләрнең җиденчене җиденченең
җидешәр җитмеш җитмешенче җитмешенчегә җитмешенчедә җитмешенчедән җитмешенчеләр
җитмешенчеләргә җитмешенчеләрдә җитмешенчеләрдән җитмешенчеләрне
җитмешенчеләрнең җитмешенчене җитмешенченең җыенысы
үз үзе үзем үземдә үземне үземнең үземнән үземә үзен үзендә үзенең үзеннән үзенә
үк
һичбер һичбере һичберен һичберендә һичберенең һичбереннән һичберенә һичберсе
һичберсен һичберсендә һичберсенең һичберсеннән һичберсенә һичберәү һичберәүгә
һичберәүдә һичберәүдән һичберәүне һичберәүнең һичкайсы һичкайсыга һичкайсыда
һичкайсыдан һичкайсыны һичкайсының һичкем һичкемгә һичкемдә һичкемне һичкемнең
һичкемнән һични һичнигә һичнидә һичнидән һичнинди һичнине һичнинең һичнәрсә
һичнәрсәгә һичнәрсәдә һичнәрсәдән һичнәрсәне һичнәрсәнең һәм һәммә һәммәсе
һәммәсен һәммәсендә һәммәсенең һәммәсеннән һәммәсенә һәр һәрбер һәрбере һәрберсе
һәркайсы һәркайсыга һәркайсыда һәркайсыдан һәркайсыны һәркайсының һәркем
һәркемгә һәркемдә һәркемне һәркемнең һәркемнән һәрни һәрнәрсә һәрнәрсәгә
һәрнәрсәдә һәрнәрсәдән һәрнәрсәне һәрнәрсәнең һәртөрле
ә әгәр әйтүенчә әйтүләренчә әлбәттә әле әлеге әллә әмма әнә
өстәп өч өчен өченче өченчегә өченчедә өченчедән өченчеләр өченчеләргә
өченчеләрдә өченчеләрдән өченчеләрне өченчеләрнең өченчене өченченең өчләп
өчәрләп""".split())

View File

@ -0,0 +1,52 @@
# coding: utf8
from __future__ import unicode_literals
from ...symbols import ORTH, LEMMA, NORM
_exc = {}
_abbrev_exc = [
# Weekdays abbreviations
{ORTH: "дш", LEMMA: "дүшәмбе"},
{ORTH: "сш", LEMMA: "сишәмбе"},
{ORTH: "чш", LEMMA: "чәршәмбе"},
{ORTH: "пш", LEMMA: "пәнҗешәмбе"},
{ORTH: "җм", LEMMA: "җомга"},
{ORTH: "шб", LEMMA: "шимбә"},
{ORTH: "яш", LEMMA: "якшәмбе"},
# Months abbreviations
{ORTH: "гый", LEMMA: "гыйнвар"},
{ORTH: "фев", LEMMA: "февраль"},
{ORTH: "мар", LEMMA: "март"},
{ORTH: "мар", LEMMA: "март"},
{ORTH: "апр", LEMMA: "апрель"},
{ORTH: "июн", LEMMA: "июнь"},
{ORTH: "июл", LEMMA: "июль"},
{ORTH: "авг", LEMMA: "август"},
{ORTH: "сен", LEMMA: "сентябрь"},
{ORTH: "окт", LEMMA: "октябрь"},
{ORTH: "ноя", LEMMA: "ноябрь"},
{ORTH: "дек", LEMMA: "декабрь"},
# Number abbreviations
{ORTH: "млрд", LEMMA: "миллиард"},
{ORTH: "млн", LEMMA: "миллион"},
]
for abbr in _abbrev_exc:
for orth in (abbr[ORTH], abbr[ORTH].capitalize(), abbr[ORTH].upper()):
_exc[orth] = [{ORTH: orth, LEMMA: abbr[LEMMA], NORM: abbr[LEMMA]}]
_exc[orth + "."] = [
{ORTH: orth + ".", LEMMA: abbr[LEMMA], NORM: abbr[LEMMA]}
]
for exc_data in [ # "etc." abbreviations
{ORTH: "һ.б.ш.", NORM: "һәм башка шундыйлар"},
{ORTH: "һ.б.", NORM: "һәм башка"},
{ORTH: "б.э.к.", NORM: "безнең эрага кадәр"},
{ORTH: "б.э.", NORM: "безнең эра"}]:
exc_data[LEMMA] = exc_data[NORM]
_exc[exc_data[ORTH]] = [exc_data]
TOKENIZER_EXCEPTIONS = _exc

30
spacy/lang/ur/__init__.py Normal file
View File

@ -0,0 +1,30 @@
# coding: utf8
from __future__ import unicode_literals
from .tokenizer_exceptions import TOKENIZER_EXCEPTIONS
from .stop_words import STOP_WORDS
from .lex_attrs import LEX_ATTRS
from ..tag_map import TAG_MAP
from ..tokenizer_exceptions import BASE_EXCEPTIONS
from ...language import Language
from ...attrs import LANG, NORM
from ...util import update_exc
class UrduDefaults(Language.Defaults):
lex_attr_getters = dict(Language.Defaults.lex_attr_getters)
lex_attr_getters.update(LEX_ATTRS)
lex_attr_getters[LANG] = lambda text: 'ur'
tokenizer_exceptions = update_exc(BASE_EXCEPTIONS, TOKENIZER_EXCEPTIONS)
tag_map = TAG_MAP
stop_words = STOP_WORDS
class Urdu(Language):
lang = 'ur'
Defaults = UrduDefaults
__all__ = ['Urdu']

16
spacy/lang/ur/examples.py Normal file
View File

@ -0,0 +1,16 @@
# coding: utf8
from __future__ import unicode_literals
"""
Example sentences to test spaCy and its language models.
>>> from spacy.lang.da.examples import sentences
>>> docs = nlp.pipe(sentences)
"""
sentences = [
"اردو ہے جس کا نام ہم جانتے ہیں داغ",
"سارے جہاں میں دھوم ہماری زباں کی ہے",
]

29113
spacy/lang/ur/lemmatizer.py Normal file

File diff suppressed because it is too large Load Diff

View File

@ -0,0 +1,47 @@
# coding: utf8
from __future__ import unicode_literals
from ...attrs import LIKE_NUM
# Source https://quizlet.com/4271889/1-100-urdu-number-wordsurdu-numerals-flash-cards/
# http://www.urduword.com/lessons.php?lesson=numbers
# https://en.wikibooks.org/wiki/Urdu/Vocabulary/Numbers
# https://www.urdu-english.com/lessons/beginner/numbers
_num_words = """ایک دو تین چار پانچ چھ سات آٹھ نو دس گیارہ بارہ تیرہ چودہ پندرہ سولہ سترہ
اٹهارا انیس بیس اکیس بائیس تئیس چوبیس پچیس چھببیس
ستایس اٹھائس انتيس تیس اکتیس بتیس تینتیس چونتیس پینتیس
چھتیس سینتیس ارتیس انتالیس چالیس اکتالیس بیالیس تیتالیس
چوالیس پیتالیس چھیالیس سینتالیس اڑتالیس انچالیس پچاس اکاون باون
تریپن چون پچپن چھپن ستاون اٹھاون انسٹھ ساثھ
اکسٹھ باسٹھ تریسٹھ چوسٹھ پیسٹھ چھیاسٹھ سڑسٹھ اڑسٹھ
انھتر ستر اکھتر بھتتر تیھتر چوھتر تچھتر چھیتر ستتر
اٹھتر انیاسی اسی اکیاسی بیاسی تیراسی چوراسی پچیاسی چھیاسی
سٹیاسی اٹھیاسی نواسی نوے اکانوے بانوے ترانوے
چورانوے پچانوے چھیانوے ستانوے اٹھانوے ننانوے سو
""".split()
# source https://www.google.com/intl/ur/inputtools/try/
_ordinal_words = """پہلا دوسرا تیسرا چوتھا پانچواں چھٹا ساتواں آٹھواں نواں دسواں گیارہواں بارہواں تیرھواں چودھواں
پندرھواں سولہواں سترھواں اٹھارواں انیسواں بسیواں
""".split()
def like_num(text):
text = text.replace(',', '').replace('.', '')
if text.isdigit():
return True
if text.count('/') == 1:
num, denom = text.split('/')
if num.isdigit() and denom.isdigit():
return True
if text in _num_words:
return True
if text in _ordinal_words:
return True
return False
LEX_ATTRS = {
LIKE_NUM: like_num
}

515
spacy/lang/ur/stop_words.py Normal file
View File

@ -0,0 +1,515 @@
# encoding: utf8
from __future__ import unicode_literals
# Source: collected from different resource on internet
STOP_WORDS = set("""
ثھی
خو
گی
اپٌے
گئے
ثہت
طرف
ہوبری
پبئے
اپٌب
دوضری
گیب
کت
گب
ثھی
ضے
ہر
پر
اش
دی
گے
لگیں
ہے
ثعذ
ضکتے
تھی
اى
دیب
لئے
والے
یہ
ثدبئے
ضکتی
تھب
اًذر
رریعے
لگی
ہوبرا
ہوًے
ثبہر
ضکتب
ًہیں
تو
اور
رہب
لگے
ہوضکتب
ہوں
کب
ہوبرے
توبم
کیب
ایطے
رہی
هگر
ہوضکتی
ہیں
کریں
ہو
تک
کی
ایک
رہے
هیں
ہوضکتے
کیطے
ہوًب
تت
کہ
ہوا
آئے
ضبت
تھے
کیوں
ہو
تب
کے
پھر
ثغیر
خبر
ہے
رکھ
کی
طب
کوئی
رریعے
ثبرے
خب
اضطرذ
ثلکہ
خجکہ
رکھ
تب
کی
طرف
ثراں
خبر
رریعہ
اضکب
ثٌذ
خص
کی
لئے
توہیں
دوضرے
کررہی
اضکی
ثیچ
خوکہ
رکھتی
کیوًکہ
دوًوں
کر
رہے
خبر
ہی
ثرآں
اضکے
پچھلا
خیطب
رکھتے
کے
ثعذ
تو
ہی
دورى
کر
یہبں
آش
تھوڑا
چکے
زکویہ
دوضروں
ضکب
اوًچب
ثٌب
پل
تھوڑی
چلا
خبهوظ
دیتب
ضکٌب
اخبزت
اوًچبئی
ثٌبرہب
پوچھب
تھوڑے
چلو
ختن
دیتی
ضکی
اچھب
اوًچی
ثٌبرہی
پوچھتب
تیي
چلیں
در
دیتے
ضکے
اچھی
اوًچے
ثٌبرہے
پوچھتی
خبًب
چلے
درخبت
دیر
ضلطلہ
اچھے
اٹھبًب
ثٌبًب
پوچھتے
خبًتب
چھوٹب
درخہ
دیکھٌب
ضوچ
اختتبم
اہن
ثٌذ
پوچھٌب
خبًتی
چھوٹوں
درخے
دیکھو
ضوچب
ادھر
آئی
ثٌذکرًب
پوچھو
خبًتے
چھوٹی
درزقیقت
دیکھی
ضوچتب
ارد
آئے
ثٌذکرو
پوچھوں
خبًٌب
چھوٹے
درضت
دیکھیں
ضوچتی
اردگرد
آج
ثٌذی
پوچھیں
خططرذ
چھہ
دش
دیٌب
ضوچتے
ارکبى
آخر
ثڑا
پورا
خگہ
چیسیں
دفعہ
دے
ضوچٌب
اضتعوبل
آخر
پہلا
خگہوں
زبصل
دکھبئیں
راضتوں
ضوچو
اضتعوبلات
آدهی
ثڑی
پہلی
خگہیں
زبضر
دکھبتب
راضتہ
ضوچی
اغیب
آًب
ثڑے
پہلےضی
خلذی
زبل
دکھبتی
راضتے
ضوچیں
اطراف
آٹھ
ثھر
خٌبة
زبل
دکھبتے
رکي
ضیذھب
افراد
آیب
ثھرا
پہلے
خواى
زبلات
دکھبًب
رکھب
ضیذھی
اکثر
ثب
ہوا
پیع
خوًہی
زبلیہ
دکھبو
رکھی
ضیذھے
اکٹھب
ثھرپور
تبزٍ
خیطبکہ
زصوں
رکھے
ضیکٌڈ
اکٹھی
ثبری
ثہتر
تر
چبر
زصہ
دلچطپ
زیبدٍ
غبیذ
اکٹھے
ثبلا
ثہتری
ترتیت
چبہب
زصے
دلچطپی
ضبت
غخص
اکیلا
ثبلترتیت
ثہتریي
تریي
چبہٌب
زقبئق
دلچطپیبں
ضبدٍ
غذ
اکیلی
ثرش
پبش
تعذاد
چبہے
زقیتیں
هٌبضت
ضبرا
غروع
اکیلے
ثغیر
پبًب
چکب
زقیقت
دو
ضبرے
غروعبت
اگرچہ
ثلٌذ
پبًچ
تن
چکی
زکن
دور
ضبل
غے
الگ
پراًب
تٌہب
چکیں
دوضرا
ضبلوں
صبف
صسیر
قجیلہ
کوًطے
لازهی
هطئلے
ًیب
طریق
کرتی
کہتے
صفر
قطن
کھولا
لگتب
هطبئل
وار
طریقوں
کرتے
کہٌب
صورت
کئی
کھولٌب
لگتی
هطتعول
وار
طریقہ
کرتے
ہو
کہٌب
صورتسبل
کئے
کھولو
لگتے
هػتول
ٹھیک
طریقے
کرًب
کہو
صورتوں
کبفی
هطلق
ڈھوًڈا
طور
کرو
کہوں
صورتیں
کبم
کھولیں
لگی
هعلوم
ڈھوًڈلیب
طورپر
کریں
کہی
ضرور
کجھی
کھولے
لگے
هکول
ڈھوًڈًب
ظبہر
کرے
کہیں
ضرورت
کرا
کہب
لوجب
هلا
ڈھوًڈو
عذد
کل
کہیں
کرتب
کہتب
لوجی
هوکي
ڈھوًڈی
عظین
کن
کہے
ضروری
کرتبہوں
کہتی
لوجے
هوکٌبت
ڈھوًڈیں
علاقوں
کوتر
کیے
لوسبت
هوکٌہ
ہن
لے
ًبپطٌذ
ہورہے
علاقہ
کورا
کے
رریعے
لوسہ
هڑا
ہوئی
هتعلق
ًبگسیر
ہوگئی
علاقے
کوروں
گئی
لو
هڑًب
ہوئے
هسترم
ًطجت
ہو
گئے
علاوٍ
کورٍ
گرد
لوگ
هڑے
ہوتی
هسترهہ
ًقطہ
ہوگیب
کورے
گروپ
لوگوں
هہرثبى
ہوتے
هسطوش
ًکبلٌب
ہوًی
عووهی
کوطي
گروٍ
لڑکپي
هیرا
ہوچکب
هختلف
ًکتہ
ہی
فرد
کوى
گروہوں
لی
هیری
ہوچکی
هسیذ
فی
کوًطب
گٌتی
لیب
هیرے
ہوچکے
هطئلہ
ًوخواى
یقیٌی
قجل
کوًطی
لیٌب
ًئی
ہورہب
لیں
ًئے
ہورہی
ثبعث
ضت
""".split())

65
spacy/lang/ur/tag_map.py Normal file
View File

@ -0,0 +1,65 @@
# coding: utf8
from __future__ import unicode_literals
from ...symbols import POS, PUNCT, SYM, ADJ, CCONJ, NUM, DET, ADV, ADP, X, VERB
from ...symbols import NOUN, PROPN, PART, INTJ, SPACE, PRON
TAG_MAP = {
".": {POS: PUNCT, "PunctType": "peri"},
",": {POS: PUNCT, "PunctType": "comm"},
"-LRB-": {POS: PUNCT, "PunctType": "brck", "PunctSide": "ini"},
"-RRB-": {POS: PUNCT, "PunctType": "brck", "PunctSide": "fin"},
"``": {POS: PUNCT, "PunctType": "quot", "PunctSide": "ini"},
"\"\"": {POS: PUNCT, "PunctType": "quot", "PunctSide": "fin"},
"''": {POS: PUNCT, "PunctType": "quot", "PunctSide": "fin"},
":": {POS: PUNCT},
"$": {POS: SYM, "Other": {"SymType": "currency"}},
"#": {POS: SYM, "Other": {"SymType": "numbersign"}},
"AFX": {POS: ADJ, "Hyph": "yes"},
"CC": {POS: CCONJ, "ConjType": "coor"},
"CD": {POS: NUM, "NumType": "card"},
"DT": {POS: DET},
"EX": {POS: ADV, "AdvType": "ex"},
"FW": {POS: X, "Foreign": "yes"},
"HYPH": {POS: PUNCT, "PunctType": "dash"},
"IN": {POS: ADP},
"JJ": {POS: ADJ, "Degree": "pos"},
"JJR": {POS: ADJ, "Degree": "comp"},
"JJS": {POS: ADJ, "Degree": "sup"},
"LS": {POS: PUNCT, "NumType": "ord"},
"MD": {POS: VERB, "VerbType": "mod"},
"NIL": {POS: ""},
"NN": {POS: NOUN, "Number": "sing"},
"NNP": {POS: PROPN, "NounType": "prop", "Number": "sing"},
"NNPS": {POS: PROPN, "NounType": "prop", "Number": "plur"},
"NNS": {POS: NOUN, "Number": "plur"},
"PDT": {POS: ADJ, "AdjType": "pdt", "PronType": "prn"},
"POS": {POS: PART, "Poss": "yes"},
"PRP": {POS: PRON, "PronType": "prs"},
"PRP$": {POS: ADJ, "PronType": "prs", "Poss": "yes"},
"RB": {POS: ADV, "Degree": "pos"},
"RBR": {POS: ADV, "Degree": "comp"},
"RBS": {POS: ADV, "Degree": "sup"},
"RP": {POS: PART},
"SP": {POS: SPACE},
"SYM": {POS: SYM},
"TO": {POS: PART, "PartType": "inf", "VerbForm": "inf"},
"UH": {POS: INTJ},
"VB": {POS: VERB, "VerbForm": "inf"},
"VBD": {POS: VERB, "VerbForm": "fin", "Tense": "past"},
"VBG": {POS: VERB, "VerbForm": "part", "Tense": "pres", "Aspect": "prog"},
"VBN": {POS: VERB, "VerbForm": "part", "Tense": "past", "Aspect": "perf"},
"VBP": {POS: VERB, "VerbForm": "fin", "Tense": "pres"},
"VBZ": {POS: VERB, "VerbForm": "fin", "Tense": "pres", "Number": "sing", "Person": 3},
"WDT": {POS: ADJ, "PronType": "int|rel"},
"WP": {POS: NOUN, "PronType": "int|rel"},
"WP$": {POS: ADJ, "Poss": "yes", "PronType": "int|rel"},
"WRB": {POS: ADV, "PronType": "int|rel"},
"ADD": {POS: X},
"NFP": {POS: PUNCT},
"GW": {POS: X},
"XX": {POS: X},
"BES": {POS: VERB},
"HVS": {POS: VERB},
"_SP": {POS: SPACE},
}

View File

@ -0,0 +1,22 @@
# coding: utf8
from __future__ import unicode_literals
# import symbols if you need to use more, add them here
from ...symbols import ORTH, LEMMA, TAG, NORM, ADP, DET
# Add tokenizer exceptions
# Documentation: https://spacy.io/docs/usage/adding-languages#tokenizer-exceptions
# Feel free to use custom logic to generate repetitive exceptions more efficiently.
# If an exception is split into more than one token, the ORTH values combined always
# need to match the original string.
# Exceptions should be added in the following format:
_exc = {
}
# To keep things clean and readable, it's recommended to only declare the
# TOKENIZER_EXCEPTIONS at the bottom:
TOKENIZER_EXCEPTIONS = _exc

View File

@ -15,7 +15,8 @@ from .. import util
# here if it's using spaCy's tokenizer (not a different library)
# TODO: re-implement generic tokenizer tests
_languages = ['bn', 'da', 'de', 'en', 'es', 'fi', 'fr', 'ga', 'he', 'hu', 'id',
'it', 'nb', 'nl', 'pl', 'pt', 'ro', 'ru', 'sv', 'tr', 'ar', 'xx']
'it', 'nb', 'nl', 'pl', 'pt', 'ro', 'ru', 'sv', 'tr', 'ar', 'ut', 'tt',
'xx']
_models = {'en': ['en_core_web_sm'],
'de': ['de_core_news_sm'],
@ -153,10 +154,18 @@ def th_tokenizer():
def tr_tokenizer():
return util.get_lang_class('tr').Defaults.create_tokenizer()
@pytest.fixture
def tt_tokenizer():
return util.get_lang_class('tt').Defaults.create_tokenizer()
@pytest.fixture
def ar_tokenizer():
return util.get_lang_class('ar').Defaults.create_tokenizer()
@pytest.fixture
def ur_tokenizer():
return util.get_lang_class('ur').Defaults.create_tokenizer()
@pytest.fixture
def ru_tokenizer():
pymorphy = pytest.importorskip('pymorphy2')

View File

@ -4,30 +4,25 @@ from __future__ import unicode_literals
import pytest
@pytest.mark.models('fr')
def test_lemmatizer_verb(FR):
tokens = FR("Qu'est-ce que tu fais?")
def test_lemmatizer_verb(fr_tokenizer):
tokens = fr_tokenizer("Qu'est-ce que tu fais?")
assert tokens[0].lemma_ == "que"
assert tokens[1].lemma_ == "être"
assert tokens[5].lemma_ == "faire"
@pytest.mark.models('fr')
@pytest.mark.xfail(reason="sont tagged as AUX")
def test_lemmatizer_noun_verb_2(FR):
tokens = FR("Les abaissements de température sont gênants.")
def test_lemmatizer_noun_verb_2(fr_tokenizer):
tokens = fr_tokenizer("Les abaissements de température sont gênants.")
assert tokens[4].lemma_ == "être"
@pytest.mark.models('fr')
@pytest.mark.xfail(reason="Costaricienne TAG is PROPN instead of NOUN and spacy don't lemmatize PROPN")
def test_lemmatizer_noun(FR):
tokens = FR("il y a des Costaricienne.")
def test_lemmatizer_noun(fr_tokenizer):
tokens = fr_tokenizer("il y a des Costaricienne.")
assert tokens[4].lemma_ == "Costaricain"
@pytest.mark.models('fr')
def test_lemmatizer_noun_2(FR):
tokens = FR("Les abaissements de température sont gênants.")
def test_lemmatizer_noun_2(fr_tokenizer):
tokens = fr_tokenizer("Les abaissements de température sont gênants.")
assert tokens[1].lemma_ == "abaissement"
assert tokens[5].lemma_ == "gênant"

View File

View File

@ -0,0 +1,75 @@
# coding: utf8
from __future__ import unicode_literals
import pytest
INFIX_HYPHEN_TESTS = [
("Явым-төшем күләме.", "Явым-төшем күләме .".split()),
("Хатын-кыз киеме.", "Хатын-кыз киеме .".split())
]
PUNC_INSIDE_WORDS_TESTS = [
("Пассаҗир саны - 2,13 млн — кеше/көндә (2010), 783,9 млн. кеше/елда.",
"Пассаҗир саны - 2,13 млн — кеше / көндә ( 2010 ) ,"
" 783,9 млн. кеше / елда .".split()),
("Ту\"кай", "Ту \" кай".split())
]
MIXED_ORDINAL_NUMS_TESTS = [
("Иртәгә 22нче гыйнвар...", "Иртәгә 22нче гыйнвар ...".split())
]
ABBREV_TESTS = [
("«3 елда (б.э.к.) туган", "« 3 елда ( б.э.к. ) туган".split()),
("тукымадан һ.б.ш. тегелгән.", "тукымадан һ.б.ш. тегелгән .".split())
]
NAME_ABBREV_TESTS = [
("Ә.Тукай", "Ә.Тукай".split()),
("Ә.тукай", "Ә.тукай".split()),
("ә.Тукай", "ә . Тукай".split()),
("Миләүшә.", "Миләүшә .".split())
]
TYPOS_IN_PUNC_TESTS = [
("«3 елда , туган", "« 3 елда , туган".split()),
("«3 елда,туган", "« 3 елда , туган".split()),
("«3 елда,туган.", "« 3 елда , туган .".split()),
("Ул эшли(кайчан?)", "Ул эшли ( кайчан ? )".split()),
("Ул (кайчан?)эшли", "Ул ( кайчан ?) эшли".split()) # "?)" => "?)" or "? )"
]
LONG_TEXTS_TESTS = [
("Иң борынгы кешеләр суыклар һәм салкын кышлар булмый торган җылы"
"якларда яшәгәннәр, шуңа күрә аларга кием кирәк булмаган.Йөз"
"меңнәрчә еллар үткән, борынгы кешеләр акрынлап Европа һәм Азиянең"
"салкын илләрендә дә яши башлаганнар. Алар кырыс һәм салкын"
"кышлардан саклану өчен кием-салым уйлап тапканнар - итәк.",
"Иң борынгы кешеләр суыклар һәм салкын кышлар булмый торган җылы"
"якларда яшәгәннәр , шуңа күрә аларга кием кирәк булмаган . Йөз"
"меңнәрчә еллар үткән , борынгы кешеләр акрынлап Европа һәм Азиянең"
"салкын илләрендә дә яши башлаганнар . Алар кырыс һәм салкын"
"кышлардан саклану өчен кием-салым уйлап тапканнар - итәк .".split()
)
]
TESTCASES = (INFIX_HYPHEN_TESTS + PUNC_INSIDE_WORDS_TESTS +
MIXED_ORDINAL_NUMS_TESTS + ABBREV_TESTS + NAME_ABBREV_TESTS +
LONG_TEXTS_TESTS + TYPOS_IN_PUNC_TESTS)
NORM_TESTCASES = [
("тукымадан һ.б.ш. тегелгән.",
["тукымадан", "һәм башка шундыйлар", "тегелгән", "."])
]
@pytest.mark.parametrize("text,expected_tokens", TESTCASES)
def test_tokenizer_handles_testcases(tt_tokenizer, text, expected_tokens):
tokens = [token.text for token in tt_tokenizer(text) if not token.is_space]
assert expected_tokens == tokens
@pytest.mark.parametrize('text,norms', NORM_TESTCASES)
def test_tokenizer_handles_norm_exceptions(tt_tokenizer, text, norms):
tokens = tt_tokenizer(text)
assert [token.norm_ for token in tokens] == norms

View File

View File

@ -0,0 +1,26 @@
# coding: utf-8
"""Test that longer and mixed texts are tokenized correctly."""
from __future__ import unicode_literals
import pytest
def test_tokenizer_handles_long_text(ur_tokenizer):
text = """اصل میں رسوا ہونے کی ہمیں
کچھ عادت سی ہو گئی ہے اس لئے جگ ہنسائی کا ذکر نہیں کرتا،ہوا کچھ یوں کہ عرصہ چھ سال بعد ہمیں بھی خیال آیا
کہ ایک عدد ٹیلی ویژن ہی کیوں نہ خرید لیں ، سوچا ورلڈ کپ ہی دیکھیں گے۔اپنے پاکستان کے کھلاڑیوں کو دیکھ کر
ورلڈ کپ دیکھنے کا حوصلہ ہی نہ رہا تو اب یوں ہی ادھر اُدھر کے چینل گھمانے لگ پڑتے ہیں۔"""
tokens = ur_tokenizer(text)
assert len(tokens) == 77
@pytest.mark.parametrize('text,length', [
("تحریر باسط حبیب", 3),
("میرا پاکستان", 2)])
def test_tokenizer_handles_cnts(ur_tokenizer, text, length):
tokens = ur_tokenizer(text)
assert len(tokens) == length

View File

@ -3,7 +3,7 @@ from __future__ import unicode_literals
from ..util import ensure_path
from .. import util
from ..displacy import parse_deps, parse_ents
from .. import displacy
from ..tokens import Span
from .util import get_doc
from .._ml import PrecomputableAffine
@ -34,18 +34,16 @@ def test_util_get_package_path(package):
assert isinstance(path, Path)
@pytest.mark.xfail
def test_displacy_parse_ents(en_vocab):
"""Test that named entities on a Doc are converted into displaCy's format."""
doc = get_doc(en_vocab, words=["But", "Google", "is", "starting", "from", "behind"])
doc.ents = [Span(doc, 1, 2, label=doc.vocab.strings[u'ORG'])]
ents = parse_ents(doc)
ents = displacy.parse_ents(doc)
assert isinstance(ents, dict)
assert ents['text'] == 'But Google is starting from behind '
assert ents['ents'] == [{'start': 4, 'end': 10, 'label': 'ORG'}]
@pytest.mark.xfail
def test_displacy_parse_deps(en_vocab):
"""Test that deps and tags on a Doc are converted into displaCy's format."""
words = ["This", "is", "a", "sentence"]
@ -55,7 +53,7 @@ def test_displacy_parse_deps(en_vocab):
deps = ['nsubj', 'ROOT', 'det', 'attr']
doc = get_doc(en_vocab, words=words, heads=heads, pos=pos, tags=tags,
deps=deps)
deps = parse_deps(doc)
deps = displacy.parse_deps(doc)
assert isinstance(deps, dict)
assert deps['words'] == [{'text': 'This', 'tag': 'DET'},
{'text': 'is', 'tag': 'VERB'},
@ -66,7 +64,19 @@ def test_displacy_parse_deps(en_vocab):
{'start': 1, 'end': 3, 'label': 'attr', 'dir': 'right'}]
@pytest.mark.xfail
def test_displacy_spans(en_vocab):
"""Test that displaCy can render Spans."""
doc = get_doc(en_vocab, words=["But", "Google", "is", "starting", "from", "behind"])
doc.ents = [Span(doc, 1, 2, label=doc.vocab.strings[u'ORG'])]
html = displacy.render(doc[1:4], style='ent')
assert html.startswith('<div')
def test_displacy_raises_for_wrong_type(en_vocab):
with pytest.raises(ValueError):
html = displacy.render('hello world')
def test_PrecomputableAffine(nO=4, nI=5, nF=3, nP=2):
model = PrecomputableAffine(nO=nO, nI=nI, nF=nF, nP=nP)
assert model.W.shape == (nF, nO, nP, nI)

View File

@ -394,7 +394,7 @@ cdef class Tokenizer:
data = OrderedDict()
deserializers = OrderedDict((
('vocab', lambda b: self.vocab.from_bytes(b)),
('prefix_search', lambda b: data.setdefault('prefix', b)),
('prefix_search', lambda b: data.setdefault('prefix_search', b)),
('suffix_search', lambda b: data.setdefault('suffix_search', b)),
('infix_finditer', lambda b: data.setdefault('infix_finditer', b)),
('token_match', lambda b: data.setdefault('token_match', b)),

View File

@ -84,8 +84,8 @@
}
],
"V_CSS": "2.1.3",
"V_JS": "2.1.2",
"V_CSS": "2.2.1",
"V_JS": "2.2.2",
"DEFAULT_SYNTAX": "python",
"ANALYTICS": "UA-58931649-1",
"MAILCHIMP": {

View File

@ -124,6 +124,12 @@ mixin help(tooltip, icon_size)
+icon("help_o", icon_size || 16).o-icon--inline
//- Abbreviation
mixin abbr(title)
abbr.o-abbr(data-tooltip=title data-tooltip-style="code" aria-label=title)&attributes(attributes)
block
//- Aside wrapper
label - [string] aside label

View File

@ -9,7 +9,7 @@ menu.c-sidebar.js-sidebar.u-text
each url, item in items
- var is_current = CURRENT == url || (CURRENT == "index" && url == "./")
li.c-sidebar__item
+a(url)(class=is_current ? "is-active" : null tabindex=is_current ? "-1" : null)=item
+a(url)(class=is_current ? "is-active" : null tabindex=is_current ? "-1" : null data-sidebar-active=is_current ? "" : null)=item
if is_current
if IS_MODELS && CURRENT_MODELS.length

View File

@ -1,115 +0,0 @@
//- 💫 DOCS > API > ARCHITECTURE > CYTHON
+aside("What's Cython?")
| #[+a("http://cython.org/") Cython] is a language for writing
| C extensions for Python. Most Python code is also valid Cython, but
| you can add type declarations to get efficient memory-managed code
| just like C or C++.
p
| spaCy's core data structures are implemented as
| #[+a("http://cython.org/") Cython] #[code cdef] classes. Memory is
| managed through the #[+a(gh("cymem")) #[code cymem]]
| #[code cymem.Pool] class, which allows you
| to allocate memory which will be freed when the #[code Pool] object
| is garbage collected. This means you usually don't have to worry
| about freeing memory. You just have to decide which Python object
| owns the memory, and make it own the #[code Pool]. When that object
| goes out of scope, the memory will be freed. You do have to take
| care that no pointers outlive the object that owns them — but this
| is generally quite easy.
p
| All Cython modules should have the #[code # cython: infer_types=True]
| compiler directive at the top of the file. This makes the code much
| cleaner, as it avoids the need for many type declarations. If
| possible, you should prefer to declare your functions #[code nogil],
| even if you don't especially care about multi-threading. The reason
| is that #[code nogil] functions help the Cython compiler reason about
| your code quite a lot — you're telling the compiler that no Python
| dynamics are possible. This lets many errors be raised, and ensures
| your function will run at C speed.
p
| Cython gives you many choices of sequences: you could have a Python
| list, a numpy array, a memory view, a C++ vector, or a pointer.
| Pointers are preferred, because they are fastest, have the most
| explicit semantics, and let the compiler check your code more
| strictly. C++ vectors are also great — but you should only use them
| internally in functions. It's less friendly to accept a vector as an
| argument, because that asks the user to do much more work. Here's
| how to get a pointer from a numpy array, memory view or vector:
+code.
cdef void get_pointers(np.ndarray[int, mode='c'] numpy_array, vector[int] cpp_vector, int[::1] memory_view) nogil:
pointer1 = &lt;int*&gt;numpy_array.data
pointer2 = cpp_vector.data()
pointer3 = &memory_view[0]
p
| Both C arrays and C++ vectors reassure the compiler that no Python
| operations are possible on your variable. This is a big advantage:
| it lets the Cython compiler raise many more errors for you.
p
| When getting a pointer from a numpy array or memoryview, take care
| that the data is actually stored in C-contiguous order — otherwise
| you'll get a pointer to nonsense. The type-declarations in the code
| above should generate runtime errors if buffers with incorrect
| memory layouts are passed in. To iterate over the array, the
| following style is preferred:
+code.
cdef int c_total(const int* int_array, int length) nogil:
total = 0
for item in int_array[:length]:
total += item
return total
p
| If this is confusing, consider that the compiler couldn't deal with
| #[code for item in int_array:] — there's no length attached to a raw
| pointer, so how could we figure out where to stop? The length is
| provided in the slice notation as a solution to this. Note that we
| don't have to declare the type of #[code item] in the code above —
| the compiler can easily infer it. This gives us tidy code that looks
| quite like Python, but is exactly as fast as C — because we've made
| sure the compilation to C is trivial.
p
| Your functions cannot be declared #[code nogil] if they need to
| create Python objects or call Python functions. This is perfectly
| okay — you shouldn't torture your code just to get #[code nogil]
| functions. However, if your function isn't #[code nogil], you should
| compile your module with #[code cython -a --cplus my_module.pyx] and
| open the resulting #[code my_module.html] file in a browser. This
| will let you see how Cython is compiling your code. Calls into the
| Python run-time will be in bright yellow. This lets you easily see
| whether Cython is able to correctly type your code, or whether there
| are unexpected problems.
p
| Working in Cython is very rewarding once you're over the initial
| learning curve. As with C and C++, the first way you write something
| in Cython will often be the performance-optimal approach. In
| contrast, Python optimisation generally requires a lot of
| experimentation. Is it faster to have an #[code if item in my_dict]
| check, or to use #[code .get()]? What about
| #[code try]/#[code except]? Does this numpy operation create a copy?
| There's no way to guess the answers to these questions, and you'll
| usually be dissatisfied with your results — so there's no way to
| know when to stop this process. In the worst case, you'll make a
| mess that invites the next reader to try their luck too. This is
| like one of those
| #[+a("http://www.wemjournal.org/article/S1080-6032%2809%2970088-2/abstract") volcanic gas-traps],
| where the rescuers keep passing out from low oxygen, causing
| another rescuer to follow — only to succumb themselves. In short,
| just say no to optimizing your Python. If it's not fast enough the
| first time, just switch to Cython.
+infobox("Resources")
+list.o-no-block
+item #[+a("http://docs.cython.org/en/latest/") Official Cython documentation] (cython.org)
+item #[+a("https://explosion.ai/blog/writing-c-in-cython", true) Writing C in Cython] (explosion.ai)
+item #[+a("https://explosion.ai/blog/multithreading-with-cython") Multi-threading spaCys parser and named entity recogniser] (explosion.ai)

View File

@ -1,149 +0,0 @@
//- 💫 DOCS > API > ARCHITECTURE > NN MODEL ARCHITECTURE
p
| spaCy's statistical models have been custom-designed to give a
| high-performance mix of speed and accuracy. The current architecture
| hasn't been published yet, but in the meantime we prepared a video that
| explains how the models work, with particular focus on NER.
+youtube("sqDHBH9IjRU")
p
| The parsing model is a blend of recent results. The two recent
| inspirations have been the work of Eli Klipperwasser and Yoav Goldberg at
| Bar Ilan#[+fn(1)], and the SyntaxNet team from Google. The foundation of
| the parser is still based on the work of Joakim Nivre#[+fn(2)], who
| introduced the transition-based framework#[+fn(3)], the arc-eager
| transition system, and the imitation learning objective. The model is
| implemented using #[+a(gh("thinc")) Thinc], spaCy's machine learning
| library. We first predict context-sensitive vectors for each word in the
| input:
+code.
(embed_lower | embed_prefix | embed_suffix | embed_shape)
&gt;&gt; Maxout(token_width)
&gt;&gt; convolution ** 4
p
| This convolutional layer is shared between the tagger, parser and NER,
| and will also be shared by the future neural lemmatizer. Because the
| parser shares these layers with the tagger, the parser does not require
| tag features. I got this trick from David Weiss's "Stack Combination"
| paper#[+fn(4)].
p
| To boost the representation, the tagger actually predicts a "super tag"
| with POS, morphology and dependency label#[+fn(5)]. The tagger predicts
| these supertags by adding a softmax layer onto the convolutional layer
| so, we're teaching the convolutional layer to give us a representation
| that's one affine transform from this informative lexical information.
| This is obviously good for the parser (which backprops to the
| convolutions too). The parser model makes a state vector by concatenating
| the vector representations for its context tokens. The current context
| tokens:
+table
+row
+cell #[code S0], #[code S1], #[code S2]
+cell Top three words on the stack.
+row
+cell #[code B0], #[code B1]
+cell First two words of the buffer.
+row
+cell
| #[code S0L1], #[code S1L1], #[code S2L1], #[code B0L1],
| #[code B1L1]#[br]
| #[code S0L2], #[code S1L2], #[code S2L2], #[code B0L2],
| #[code B1L2]
+cell
| Leftmost and second leftmost children of #[code S0], #[code S1],
| #[code S2], #[code B0] and #[code B1].
+row
+cell
| #[code S0R1], #[code S1R1], #[code S2R1], #[code B0R1],
| #[code B1R1]#[br]
| #[code S0R2], #[code S1R2], #[code S2R2], #[code B0R2],
| #[code B1R2]
+cell
| Rightmost and second rightmost children of #[code S0], #[code S1],
| #[code S2], #[code B0] and #[code B1].
p
| This makes the state vector quite long: #[code 13*T], where #[code T] is
| the token vector width (128 is working well). Fortunately, there's a way
| to structure the computation to save some expense (and make it more
| GPU-friendly).
p
| The parser typically visits #[code 2*N] states for a sentence of length
| #[code N] (although it may visit more, if it back-tracks with a
| non-monotonic transition#[+fn(4)]). A naive implementation would require
| #[code 2*N (B, 13*T) @ (13*T, H)] matrix multiplications for a batch of
| size #[code B]. We can instead perform one #[code (B*N, T) @ (T, 13*H)]
| multiplication, to pre-compute the hidden weights for each positional
| feature with respect to the words in the batch. (Note that our token
| vectors come from the CNN — so we can't play this trick over the
| vocabulary. That's how Stanford's NN parser#[+fn(3)] works — and why its
| model is so big.)
p
| This pre-computation strategy allows a nice compromise between
| GPU-friendliness and implementation simplicity. The CNN and the wide
| lower layer are computed on the GPU, and then the precomputed hidden
| weights are moved to the CPU, before we start the transition-based
| parsing process. This makes a lot of things much easier. We don't have to
| worry about variable-length batch sizes, and we don't have to implement
| the dynamic oracle in CUDA to train.
p
| Currently the parser's loss function is multilabel log loss#[+fn(6)], as
| the dynamic oracle allows multiple states to be 0 cost. This is defined
| as follows, where #[code gZ] is the sum of the scores assigned to gold
| classes:
+code.
(exp(score) / Z) - (exp(score) / gZ)
+bibliography
+item
| #[+a("https://www.semanticscholar.org/paper/Simple-and-Accurate-Dependency-Parsing-Using-Bidir-Kiperwasser-Goldberg/3cf31ecb2724b5088783d7c96a5fc0d5604cbf41") Simple and Accurate Dependency Parsing Using Bidirectional LSTM Feature Representations]
br
| Eliyahu Kiperwasser, Yoav Goldberg. (2016)
+item
| #[+a("https://www.semanticscholar.org/paper/A-Dynamic-Oracle-for-Arc-Eager-Dependency-Parsing-Goldberg-Nivre/22697256ec19ecc3e14fcfc63624a44cf9c22df4") A Dynamic Oracle for Arc-Eager Dependency Parsing]
br
| Yoav Goldberg, Joakim Nivre (2012)
+item
| #[+a("https://explosion.ai/blog/parsing-english-in-python") Parsing English in 500 Lines of Python]
br
| Matthew Honnibal (2013)
+item
| #[+a("https://www.semanticscholar.org/paper/Stack-propagation-Improved-Representation-Learning-Zhang-Weiss/0c133f79b23e8c680891d2e49a66f0e3d37f1466") Stack-propagation: Improved Representation Learning for Syntax]
br
| Yuan Zhang, David Weiss (2016)
+item
| #[+a("https://www.semanticscholar.org/paper/Deep-multi-task-learning-with-low-level-tasks-supe-S%C3%B8gaard-Goldberg/03ad06583c9721855ccd82c3d969a01360218d86") Deep multi-task learning with low level tasks supervised at lower layers]
br
| Anders Søgaard, Yoav Goldberg (2016)
+item
| #[+a("https://www.semanticscholar.org/paper/An-Improved-Non-monotonic-Transition-System-for-De-Honnibal-Johnson/4094cee47ade13b77b5ab4d2e6cb9dd2b8a2917c") An Improved Non-monotonic Transition System for Dependency Parsing]
br
| Matthew Honnibal, Mark Johnson (2015)
+item
| #[+a("http://cs.stanford.edu/people/danqi/papers/emnlp2014.pdf") A Fast and Accurate Dependency Parser using Neural Networks]
br
| Danqi Cheng, Christopher D. Manning (2014)
+item
| #[+a("https://www.semanticscholar.org/paper/Parsing-the-Wall-Street-Journal-using-a-Lexical-Fu-Riezler-King/0ad07862a91cd59b7eb5de38267e47725a62b8b2") Parsing the Wall Street Journal using a Lexical-Functional Grammar and Discriminative Estimation Techniques]
br
| Stefan Riezler et al. (2002)

View File

@ -0,0 +1,71 @@
//- 💫 DOCS > API > CYTHON > CLASSES > DOC
p
| The #[code Doc] object holds an array of
| #[+api("cython-structs#tokenc") #[code TokenC]] structs.
+infobox
| This section documents the extra C-level attributes and methods that
| can't be accessed from Python. For the Python documentation, see
| #[+api("doc") #[code Doc]].
+h(3, "doc_attributes") Attributes
+table(["Name", "Type", "Description"])
+row
+cell #[code mem]
+cell #[code cymem.Pool]
+cell
| A memory pool. Allocated memory will be freed once the
| #[code Doc] object is garbage collected.
+row
+cell #[code vocab]
+cell #[code Vocab]
+cell A reference to the shared #[code Vocab] object.
+row
+cell #[code c]
+cell #[code TokenC*]
+cell
| A pointer to a #[+api("cython-structs#tokenc") #[code TokenC]]
| struct.
+row
+cell #[code length]
+cell #[code int]
+cell The number of tokens in the document.
+row
+cell #[code max_length]
+cell #[code int]
+cell The underlying size of the #[code Doc.c] array.
+h(3, "doc_push_back") Doc.push_back
+tag method
p
| Append a token to the #[code Doc]. The token can be provided as a
| #[+api("cython-structs#lexemec") #[code LexemeC]] or
| #[+api("cython-structs#tokenc") #[code TokenC]] pointer, using Cython's
| #[+a("http://cython.readthedocs.io/en/latest/src/userguide/fusedtypes.html") fused types].
+aside-code("Example").
from spacy.tokens cimport Doc
from spacy.vocab cimport Vocab
doc = Doc(Vocab())
lexeme = doc.vocab.get(u'hello')
doc.push_back(lexeme, True)
assert doc.text == u'hello '
+table(["Name", "Type", "Description"])
+row
+cell #[code lex_or_tok]
+cell #[code LexemeOrToken]
+cell The word to append to the #[code Doc].
+row
+cell #[code has_space]
+cell #[code bint]
+cell Whether the word has trailing whitespace.

View File

@ -0,0 +1,30 @@
//- 💫 DOCS > API > CYTHON > CLASSES > LEXEME
p
| A Cython class providing access and methods for an entry in the
| vocabulary.
+infobox
| This section documents the extra C-level attributes and methods that
| can't be accessed from Python. For the Python documentation, see
| #[+api("lexeme") #[code Lexeme]].
+h(3, "lexeme_attributes") Attributes
+table(["Name", "Type", "Description"])
+row
+cell #[code c]
+cell #[code LexemeC*]
+cell
| A pointer to a #[+api("cython-structs#lexemec") #[code LexemeC]]
| struct.
+row
+cell #[code vocab]
+cell #[code Vocab]
+cell A reference to the shared #[code Vocab] object.
+row
+cell #[code orth]
+cell #[+abbr("uint64_t") #[code attr_t]]
+cell ID of the verbatim text content.

View File

@ -0,0 +1,200 @@
//- 💫 DOCS > API > CYTHON > STRUCTS > LEXEMEC
p
| Struct holding information about a lexical type. #[code LexemeC]
| structs are usually owned by the #[code Vocab], and accessed through a
| read-only pointer on the #[code TokenC] struct.
+aside-code("Example").
lex = doc.c[3].lex
+table(["Name", "Type", "Description"])
+row
+cell #[code flags]
+cell #[+abbr("uint64_t") #[code flags_t]]
+cell Bit-field for binary lexical flag values.
+row
+cell #[code id]
+cell #[+abbr("uint64_t") #[code attr_t]]
+cell
| Usually used to map lexemes to rows in a matrix, e.g. for word
| vectors. Does not need to be unique, so currently misnamed.
+row
+cell #[code length]
+cell #[+abbr("uint64_t") #[code attr_t]]
+cell Number of unicode characters in the lexeme.
+row
+cell #[code orth]
+cell #[+abbr("uint64_t") #[code attr_t]]
+cell ID of the verbatim text content.
+row
+cell #[code lower]
+cell #[+abbr("uint64_t") #[code attr_t]]
+cell ID of the lowercase form of the lexeme.
+row
+cell #[code norm]
+cell #[+abbr("uint64_t") #[code attr_t]]
+cell ID of the lexeme's norm, i.e. a normalised form of the text.
+row
+cell #[code shape]
+cell #[+abbr("uint64_t") #[code attr_t]]
+cell Transform of the lexeme's string, to show orthographic features.
+row
+cell #[code prefix]
+cell #[+abbr("uint64_t") #[code attr_t]]
+cell
| Length-N substring from the start of the lexeme. Defaults to
| #[code N=1].
+row
+cell #[code suffix]
+cell #[+abbr("uint64_t") #[code attr_t]]
+cell
| Length-N substring from the end of the lexeme. Defaults to
| #[code N=3].
+row
+cell #[code cluster]
+cell #[+abbr("uint64_t") #[code attr_t]]
+cell Brown cluster ID.
+row
+cell #[code prob]
+cell #[code float]
+cell Smoothed log probability estimate of the lexeme's type.
+row
+cell #[code sentiment]
+cell #[code float]
+cell A scalar value indicating positivity or negativity.
+h(3, "lexeme_get_struct_attr", "spacy/lexeme.pxd") Lexeme.get_struct_attr
+tag staticmethod
+tag nogil
p Get the value of an attribute from the #[code LexemeC] struct by attribute ID.
+aside-code("Example").
from spacy.attrs cimport IS_ALPHA
from spacy.lexeme cimport Lexeme
lexeme = doc.c[3].lex
is_alpha = Lexeme.get_struct_attr(lexeme, IS_ALPHA)
+table(["Name", "Type", "Description"])
+row
+cell #[code lex]
+cell #[code const LexemeC*]
+cell A pointer to a #[code LexemeC] struct.
+row
+cell #[code feat_name]
+cell #[code attr_id_t]
+cell
| The ID of the attribute to look up. The attributes are
| enumerated in #[code spacy.typedefs].
+row("foot")
+cell returns
+cell #[+abbr("uint64_t") #[code attr_t]]
+cell The value of the attribute.
+h(3, "lexeme_set_struct_attr", "spacy/lexeme.pxd") Lexeme.set_struct_attr
+tag staticmethod
+tag nogil
p Set the value of an attribute of the #[code LexemeC] struct by attribute ID.
+aside-code("Example").
from spacy.attrs cimport NORM
from spacy.lexeme cimport Lexeme
lexeme = doc.c[3].lex
Lexeme.set_struct_attr(lexeme, NORM, lexeme.lower)
+table(["Name", "Type", "Description"])
+row
+cell #[code lex]
+cell #[code const LexemeC*]
+cell A pointer to a #[code LexemeC] struct.
+row
+cell #[code feat_name]
+cell #[code attr_id_t]
+cell
| The ID of the attribute to look up. The attributes are
| enumerated in #[code spacy.typedefs].
+row
+cell #[code value]
+cell #[+abbr("uint64_t") #[code attr_t]]
+cell The value to set.
+h(3, "lexeme_c_check_flag", "spacy/lexeme.pxd") Lexeme.c_check_flag
+tag staticmethod
+tag nogil
p Check the value of a binary flag attribute.
+aside-code("Example").
from spacy.attrs cimport IS_STOP
from spacy.lexeme cimport Lexeme
lexeme = doc.c[3].lex
is_stop = Lexeme.c_check_flag(lexeme, IS_STOP)
+table(["Name", "Type", "Description"])
+row
+cell #[code lexeme]
+cell #[code const LexemeC*]
+cell A pointer to a #[code LexemeC] struct.
+row
+cell #[code flag_id]
+cell #[code attr_id_t]
+cell
| The ID of the flag to look up. The flag IDs are enumerated in
| #[code spacy.typedefs].
+row("foot")
+cell returns
+cell #[code bint]
+cell The boolean value of the flag.
+h(3, "lexeme_c_set_flag", "spacy/lexeme.pxd") Lexeme.c_set_flag
+tag staticmethod
+tag nogil
p Set the value of a binary flag attribute.
+aside-code("Example").
from spacy.attrs cimport IS_STOP
from spacy.lexeme cimport Lexeme
lexeme = doc.c[3].lex
Lexeme.c_set_flag(lexeme, IS_STOP, 0)
+table(["Name", "Type", "Description"])
+row
+cell #[code lexeme]
+cell #[code const LexemeC*]
+cell A pointer to a #[code LexemeC] struct.
+row
+cell #[code flag_id]
+cell #[code attr_id_t]
+cell
| The ID of the flag to look up. The flag IDs are enumerated in
| #[code spacy.typedefs].
+row
+cell #[code value]
+cell #[code bint]
+cell The value to set.

View File

@ -0,0 +1,43 @@
//- 💫 DOCS > API > CYTHON > CLASSES > SPAN
p
| A Cython class providing access and methods for a slice of a #[code Doc]
| object.
+infobox
| This section documents the extra C-level attributes and methods that
| can't be accessed from Python. For the Python documentation, see
| #[+api("span") #[code Span]].
+h(3, "span_attributes") Attributes
+table(["Name", "Type", "Description"])
+row
+cell #[code doc]
+cell #[code Doc]
+cell The parent document.
+row
+cell #[code start]
+cell #[code int]
+cell The index of the first token of the span.
+row
+cell #[code end]
+cell #[code int]
+cell The index of the first token after the span.
+row
+cell #[code start_char]
+cell #[code int]
+cell The index of the first character of the span.
+row
+cell #[code end_char]
+cell #[code int]
+cell The index of the last character of the span.
+row
+cell #[code label]
+cell #[+abbr("uint64_t") #[code attr_t]]
+cell A label to attach to the span, e.g. for named entities.

View File

@ -0,0 +1,23 @@
//- 💫 DOCS > API > CYTHON > CLASSES > STRINGSTORE
p A lookup table to retrieve strings by 64-bit hashes.
+infobox
| This section documents the extra C-level attributes and methods that
| can't be accessed from Python. For the Python documentation, see
| #[+api("stringstore") #[code StringStore]].
+h(3, "stringstore_attributes") Attributes
+table(["Name", "Type", "Description"])
+row
+cell #[code mem]
+cell #[code cymem.Pool]
+cell
| A memory pool. Allocated memory will be freed once the
| #[code StringStore] object is garbage collected.
+row
+cell #[code keys]
+cell #[+abbr("vector[uint64_t]") #[code vector[hash_t]]]
+cell A list of hash values in the #[code StringStore].

View File

@ -0,0 +1,73 @@
//- 💫 DOCS > API > CYTHON > CLASSES > TOKEN
p
| A Cython class providing access and methods for a
| #[+api("cython-structs#tokenc") #[code TokenC]] struct. Note that the
| #[code Token] object does not own the struct. It only receives a pointer
| to it.
+infobox
| This section documents the extra C-level attributes and methods that
| can't be accessed from Python. For the Python documentation, see
| #[+api("token") #[code Token]].
+h(3, "token_attributes") Attributes
+table(["Name", "Type", "Description"])
+row
+cell #[code vocab]
+cell #[code Vocab]
+cell A reference to the shared #[code Vocab] object.
+row
+cell #[code c]
+cell #[code TokenC*]
+cell
| A pointer to a #[+api("cython-structs#tokenc") #[code TokenC]]
| struct.
+row
+cell #[code i]
+cell #[code int]
+cell The offset of the token within the document.
+row
+cell #[code doc]
+cell #[code Doc]
+cell The parent document.
+h(3, "token_cinit") Token.cinit
+tag method
p Create a #[code Token] object from a #[code TokenC*] pointer.
+aside-code("Example").
token = Token.cinit(&doc.c[3], doc, 3)
+table(["Name", "Type", "Description"])
+row
+cell #[code vocab]
+cell #[code Vocab]
+cell A reference to the shared #[code Vocab].
+row
+cell #[code c]
+cell #[code TokenC*]
+cell
| A pointer to a #[+api("cython-structs#tokenc") #[code TokenC]]
| struct.
+row
+cell #[code offset]
+cell #[code int]
+cell The offset of the token within the document.
+row
+cell #[code doc]
+cell #[code Doc]
+cell The parent document.
+row("foot")
+cell returns
+cell #[code Token]
+cell The newly constructed object.

View File

@ -0,0 +1,270 @@
//- 💫 DOCS > API > CYTHON > STRUCTS > TOKENC
p
| Cython data container for the #[code Token] object.
+aside-code("Example").
token = &doc.c[3]
token_ptr = &doc.c[3]
+table(["Name", "Type", "Description"])
+row
+cell #[code lex]
+cell #[code const LexemeC*]
+cell A pointer to the lexeme for the token.
+row
+cell #[code morph]
+cell #[code uint64_t]
+cell An ID allowing lookup of morphological attributes.
+row
+cell #[code pos]
+cell #[code univ_pos_t]
+cell Coarse-grained part-of-speech tag.
+row
+cell #[code spacy]
+cell #[code bint]
+cell A binary value indicating whether the token has trailing whitespace.
+row
+cell #[code tag]
+cell #[+abbr("uint64_t") #[code attr_t]]
+cell Fine-grained part-of-speech tag.
+row
+cell #[code idx]
+cell #[code int]
+cell The character offset of the token within the parent document.
+row
+cell #[code lemma]
+cell #[+abbr("uint64_t") #[code attr_t]]
+cell Base form of the token, with no inflectional suffixes.
+row
+cell #[code sense]
+cell #[+abbr("uint64_t") #[code attr_t]]
+cell Space for storing a word sense ID, currently unused.
+row
+cell #[code head]
+cell #[code int]
+cell Offset of the syntactic parent relative to the token.
+row
+cell #[code dep]
+cell #[+abbr("uint64_t") #[code attr_t]]
+cell Syntactic dependency relation.
+row
+cell #[code l_kids]
+cell #[code uint32_t]
+cell Number of left children.
+row
+cell #[code r_kids]
+cell #[code uint32_t]
+cell Number of right children.
+row
+cell #[code l_edge]
+cell #[code uint32_t]
+cell Offset of the leftmost token of this token's syntactic descendents.
+row
+cell #[code r_edge]
+cell #[code uint32_t]
+cell Offset of the rightmost token of this token's syntactic descendents.
+row
+cell #[code sent_start]
+cell #[code int]
+cell
| Ternary value indicating whether the token is the first word of
| a sentence. #[code 0] indicates a missing value, #[code -1]
| indicates #[code False] and #[code 1] indicates #[code True]. The default value, 0,
| is interpretted as no sentence break. Sentence boundary detectors will usually
| set 0 for all tokens except tokens that follow a sentence boundary.
+row
+cell #[code ent_iob]
+cell #[code int]
+cell
| IOB code of named entity tag. #[code 0] indicates a missing
| value, #[code 1] indicates #[code I], #[code 2] indicates
| #[code 0] and #[code 3] indicates #[code B].
+row
+cell #[code ent_type]
+cell #[+abbr("uint64_t") #[code attr_t]]
+cell Named entity type.
+row
+cell #[code ent_id]
+cell #[+abbr("uint64_t") #[code hash_t]]
+cell
| ID of the entity the token is an instance of, if any. Currently
| not used, but potentially for coreference resolution.
+h(3, "token_get_struct_attr", "spacy/tokens/token.pxd") Token.get_struct_attr
+tag staticmethod
+tag nogil
p Get the value of an attribute from the #[code TokenC] struct by attribute ID.
+aside-code("Example").
from spacy.attrs cimport IS_ALPHA
from spacy.tokens cimport Token
is_alpha = Token.get_struct_attr(&doc.c[3], IS_ALPHA)
+table(["Name", "Type", "Description"])
+row
+cell #[code token]
+cell #[code const TokenC*]
+cell A pointer to a #[code TokenC] struct.
+row
+cell #[code feat_name]
+cell #[code attr_id_t]
+cell
| The ID of the attribute to look up. The attributes are
| enumerated in #[code spacy.typedefs].
+row("foot")
+cell returns
+cell #[+abbr("uint64_t") #[code attr_t]]
+cell The value of the attribute.
+h(3, "token_set_struct_attr", "spacy/tokens/token.pxd") Token.set_struct_attr
+tag staticmethod
+tag nogil
p Set the value of an attribute of the #[code TokenC] struct by attribute ID.
+aside-code("Example").
from spacy.attrs cimport TAG
from spacy.tokens cimport Token
token = &doc.c[3]
Token.set_struct_attr(token, TAG, 0)
+table(["Name", "Type", "Description"])
+row
+cell #[code token]
+cell #[code const TokenC*]
+cell A pointer to a #[code TokenC] struct.
+row
+cell #[code feat_name]
+cell #[code attr_id_t]
+cell
| The ID of the attribute to look up. The attributes are
| enumerated in #[code spacy.typedefs].
+row
+cell #[code value]
+cell #[+abbr("uint64_t") #[code attr_t]]
+cell The value to set.
+h(3, "token_by_start", "spacy/tokens/doc.pxd") token_by_start
+tag function
p Find a token in a #[code TokenC*] array by the offset of its first character.
+aside-code("Example").
from spacy.tokens.doc cimport Doc, token_by_start
from spacy.vocab cimport Vocab
doc = Doc(Vocab(), words=[u'hello', u'world'])
assert token_by_start(doc.c, doc.length, 6) == 1
assert token_by_start(doc.c, doc.length, 4) == -1
+table(["Name", "Type", "Description"])
+row
+cell #[code tokens]
+cell #[code const TokenC*]
+cell A #[code TokenC*] array.
+row
+cell #[code length]
+cell #[code int]
+cell The number of tokens in the array.
+row
+cell #[code start_char]
+cell #[code int]
+cell The start index to search for.
+row("foot")
+cell returns
+cell #[code int]
+cell The index of the token in the array or #[code -1] if not found.
+h(3, "token_by_end", "spacy/tokens/doc.pxd") token_by_end
+tag function
p Find a token in a #[code TokenC*] array by the offset of its final character.
+aside-code("Example").
from spacy.tokens.doc cimport Doc, token_by_end
from spacy.vocab cimport Vocab
doc = Doc(Vocab(), words=[u'hello', u'world'])
assert token_by_end(doc.c, doc.length, 5) == 0
assert token_by_end(doc.c, doc.length, 1) == -1
+table(["Name", "Type", "Description"])
+row
+cell #[code tokens]
+cell #[code const TokenC*]
+cell A #[code TokenC*] array.
+row
+cell #[code length]
+cell #[code int]
+cell The number of tokens in the array.
+row
+cell #[code end_char]
+cell #[code int]
+cell The end index to search for.
+row("foot")
+cell returns
+cell #[code int]
+cell The index of the token in the array or #[code -1] if not found.
+h(3, "set_children_from_heads", "spacy/tokens/doc.pxd") set_children_from_heads
+tag function
p
| Set attributes that allow lookup of syntactic children on a
| #[code TokenC*] array. This function must be called after making changes
| to the #[code TokenC.head] attribute, in order to make the parse tree
| navigation consistent.
+aside-code("Example").
from spacy.tokens.doc cimport Doc, set_children_from_heads
from spacy.vocab cimport Vocab
doc = Doc(Vocab(), words=[u'Baileys', u'from', u'a', u'shoe'])
doc.c[0].head = 0
doc.c[1].head = 0
doc.c[2].head = 3
doc.c[3].head = 1
set_children_from_heads(doc.c, doc.length)
assert doc.c[3].l_kids == 1
+table(["Name", "Type", "Description"])
+row
+cell #[code tokens]
+cell #[code const TokenC*]
+cell A #[code TokenC*] array.
+row
+cell #[code length]
+cell #[code int]
+cell The number of tokens in the array.

View File

@ -0,0 +1,88 @@
//- 💫 DOCS > API > CYTHON > CLASSES > VOCAB
p
| A Cython class providing access and methods for a vocabulary and other
| data shared across a language.
+infobox
| This section documents the extra C-level attributes and methods that
| can't be accessed from Python. For the Python documentation, see
| #[+api("vocab") #[code Vocab]].
+h(3, "vocab_attributes") Attributes
+table(["Name", "Type", "Description"])
+row
+cell #[code mem]
+cell #[code cymem.Pool]
+cell
| A memory pool. Allocated memory will be freed once the
| #[code Vocab] object is garbage collected.
+row
+cell #[code strings]
+cell #[code StringStore]
+cell
| A #[code StringStore] that maps string to hash values and vice
| versa.
+row
+cell #[code length]
+cell #[code int]
+cell The number of entries in the vocabulary.
+h(3, "vocab_get") Vocab.get
+tag method
p
| Retrieve a #[+api("cython-structs#lexemec") #[code LexemeC*]] pointer
| from the vocabulary.
+aside-code("Example").
lexeme = vocab.get(vocab.mem, u'hello')
+table(["Name", "Type", "Description"])
+row
+cell #[code mem]
+cell #[code cymem.Pool]
+cell
| A memory pool. Allocated memory will be freed once the
| #[code Vocab] object is garbage collected.
+row
+cell #[code string]
+cell #[code unicode]
+cell The string of the word to look up.
+row("foot")
+cell returns
+cell #[code const LexemeC*]
+cell The lexeme in the vocabulary.
+h(3, "vocab_get_by_orth") Vocab.get_by_orth
+tag method
p
| Retrieve a #[+api("cython-structs#lexemec") #[code LexemeC*]] pointer
| from the vocabulary.
+aside-code("Example").
lexeme = vocab.get_by_orth(doc[0].lex.norm)
+table(["Name", "Type", "Description"])
+row
+cell #[code mem]
+cell #[code cymem.Pool]
+cell
| A memory pool. Allocated memory will be freed once the
| #[code Vocab] object is garbage collected.
+row
+cell #[code orth]
+cell #[+abbr("uint64_t") #[code attr_t]]
+cell ID of the verbatim text content.
+row("foot")
+cell returns
+cell #[code const LexemeC*]
+cell The lexeme in the vocabulary.

View File

@ -33,6 +33,12 @@
"Vectors": "vectors",
"GoldParse": "goldparse",
"GoldCorpus": "goldcorpus"
},
"Cython": {
"Architecture": "cython",
"Structs": "cython-structs",
"Classes": "cython-classes"
}
},
@ -41,8 +47,7 @@
"next": "annotation",
"menu": {
"Basics": "basics",
"Neural Network Model": "nn-model",
"Cython Conventions": "cython"
"Neural Network Model": "nn-model"
}
},
@ -211,5 +216,36 @@
"Named Entities": "named-entities",
"Models & Training": "training"
}
},
"cython": {
"title": "Cython Architecture",
"next": "cython-structs",
"menu": {
"Overview": "overview",
"Conventions": "conventions"
}
},
"cython-structs": {
"title": "Cython Structs",
"teaser": "C-language objects that let you group variables together in a single contiguous block.",
"next": "cython-classes",
"menu": {
"TokenC": "tokenc",
"LexemeC": "lexemec"
}
},
"cython-classes": {
"title": "Cython Classes",
"menu": {
"Doc": "doc",
"Token": "token",
"Span": "span",
"Lexeme": "lexeme",
"Vocab": "vocab",
"StringStore": "stringstore"
}
}
}

View File

@ -280,7 +280,7 @@ p
+row
+cell #[code --n-iter], #[code -n]
+cell option
+cell Number of iterations (default: #[code 20]).
+cell Number of iterations (default: #[code 30]).
+row
+cell #[code --n-sents], #[code -ns]

View File

@ -0,0 +1,39 @@
//- 💫 DOCS > API > CYTHON > CLASSES
include ../_includes/_mixins
+section("doc")
+h(2, "doc", "spacy/tokens/doc.pxd") Doc
+tag cdef class
include _cython/_doc
+section("token")
+h(2, "token", "spacy/tokens/token.pxd") Token
+tag cdef class
include _cython/_token
+section("span")
+h(2, "span", "spacy/tokens/span.pxd") Span
+tag cdef class
include _cython/_span
+section("lexeme")
+h(2, "lexeme", "spacy/lexeme.pxd") Lexeme
+tag cdef class
include _cython/_lexeme
+section("vocab")
+h(2, "vocab", "spacy/vocab.pxd") Vocab
+tag cdef class
include _cython/_vocab
+section("stringstore")
+h(2, "stringstore", "spacy/strings.pxd") StringStore
+tag cdef class
include _cython/_stringstore

View File

@ -0,0 +1,15 @@
//- 💫 DOCS > API > CYTHON > STRUCTS
include ../_includes/_mixins
+section("tokenc")
+h(2, "tokenc", "spacy/structs.pxd") TokenC
+tag C struct
include _cython/_tokenc
+section("lexemec")
+h(2, "lexemec", "spacy/structs.pxd") LexemeC
+tag C struct
include _cython/_lexemec

176
website/api/cython.jade Normal file
View File

@ -0,0 +1,176 @@
//- 💫 DOCS > API > CYTHON > ARCHITECTURE
include ../_includes/_mixins
+section("overview")
+aside("What's Cython?")
| #[+a("http://cython.org/") Cython] is a language for writing
| C extensions for Python. Most Python code is also valid Cython, but
| you can add type declarations to get efficient memory-managed code
| just like C or C++.
p
| This section documents spaCy's C-level data structures and
| interfaces, intended for use from Cython. Some of the attributes are
| primarily for internal use, and all C-level functions and methods are
| designed for speed over safety if you make a mistake and access an
| array out-of-bounds, the program may crash abruptly.
p
| With Cython there are four ways of declaring complex data types.
| Unfortunately we use all four in different places, as they all have
| different utility:
+table(["Declaration", "Description", "Example"])
+row
+cell #[code class]
+cell A normal Python class.
+cell #[+api("language") #[code Language]]
+row
+cell #[code cdef class]
+cell
| A Python extension type. Differs from a normal Python class
| in that its attributes can be defined on the underlying
| struct. Can have C-level objects as attributes (notably
| structs and pointers), and can have methods which have
| C-level objects as arguments or return types.
+cell #[+api("cython-classes#lexeme") #[code Lexeme]]
+row
+cell #[code cdef struct]
+cell
| A struct is just a collection of variables, sort of like a
| named tuple, except the memory is contiguous. Structs can't
| have methods, only attributes.
+cell #[+api("cython-structs#lexemec") #[code LexemeC]]
+row
+cell #[code cdef cppclass]
+cell
| A C++ class. Like a struct, this can be allocated on the
| stack, but can have methods, a constructor and a destructor.
| Differs from `cdef class` in that it can be created and
| destroyed without acquiring the Python global interpreter
| lock. This style is the most obscure.
+cell #[+src(gh("spacy", "spacy/syntax/_state.pxd")) #[code StateC]]
p
| The most important classes in spaCy are defined as #[code cdef class]
| objects. The underlying data for these objects is usually gathered
| into a struct, which is usually named #[code c]. For instance, the
| #[+api("cython-classses#lexeme") #[code Lexeme]] class holds a
| #[+api("cython-structs#lexemec") #[code LexemeC]] struct, at
| #[code Lexeme.c]. This lets you shed the Python container, and pass
| a pointer to the underlying data into C-level functions.
+section("conventions")
+h(2, "conventions") Conventions
p
| spaCy's core data structures are implemented as
| #[+a("http://cython.org/") Cython] #[code cdef] classes. Memory is
| managed through the #[+a(gh("cymem")) #[code cymem]]
| #[code cymem.Pool] class, which allows you
| to allocate memory which will be freed when the #[code Pool] object
| is garbage collected. This means you usually don't have to worry
| about freeing memory. You just have to decide which Python object
| owns the memory, and make it own the #[code Pool]. When that object
| goes out of scope, the memory will be freed. You do have to take
| care that no pointers outlive the object that owns them — but this
| is generally quite easy.
p
| All Cython modules should have the #[code # cython: infer_types=True]
| compiler directive at the top of the file. This makes the code much
| cleaner, as it avoids the need for many type declarations. If
| possible, you should prefer to declare your functions #[code nogil],
| even if you don't especially care about multi-threading. The reason
| is that #[code nogil] functions help the Cython compiler reason about
| your code quite a lot — you're telling the compiler that no Python
| dynamics are possible. This lets many errors be raised, and ensures
| your function will run at C speed.
p
| Cython gives you many choices of sequences: you could have a Python
| list, a numpy array, a memory view, a C++ vector, or a pointer.
| Pointers are preferred, because they are fastest, have the most
| explicit semantics, and let the compiler check your code more
| strictly. C++ vectors are also great — but you should only use them
| internally in functions. It's less friendly to accept a vector as an
| argument, because that asks the user to do much more work. Here's
| how to get a pointer from a numpy array, memory view or vector:
+code.
cdef void get_pointers(np.ndarray[int, mode='c'] numpy_array, vector[int] cpp_vector, int[::1] memory_view) nogil:
pointer1 = &lt;int*&gt;numpy_array.data
pointer2 = cpp_vector.data()
pointer3 = &memory_view[0]
p
| Both C arrays and C++ vectors reassure the compiler that no Python
| operations are possible on your variable. This is a big advantage:
| it lets the Cython compiler raise many more errors for you.
p
| When getting a pointer from a numpy array or memoryview, take care
| that the data is actually stored in C-contiguous order — otherwise
| you'll get a pointer to nonsense. The type-declarations in the code
| above should generate runtime errors if buffers with incorrect
| memory layouts are passed in. To iterate over the array, the
| following style is preferred:
+code.
cdef int c_total(const int* int_array, int length) nogil:
total = 0
for item in int_array[:length]:
total += item
return total
p
| If this is confusing, consider that the compiler couldn't deal with
| #[code for item in int_array:] — there's no length attached to a raw
| pointer, so how could we figure out where to stop? The length is
| provided in the slice notation as a solution to this. Note that we
| don't have to declare the type of #[code item] in the code above —
| the compiler can easily infer it. This gives us tidy code that looks
| quite like Python, but is exactly as fast as C — because we've made
| sure the compilation to C is trivial.
p
| Your functions cannot be declared #[code nogil] if they need to
| create Python objects or call Python functions. This is perfectly
| okay — you shouldn't torture your code just to get #[code nogil]
| functions. However, if your function isn't #[code nogil], you should
| compile your module with #[code cython -a --cplus my_module.pyx] and
| open the resulting #[code my_module.html] file in a browser. This
| will let you see how Cython is compiling your code. Calls into the
| Python run-time will be in bright yellow. This lets you easily see
| whether Cython is able to correctly type your code, or whether there
| are unexpected problems.
p
| Working in Cython is very rewarding once you're over the initial
| learning curve. As with C and C++, the first way you write something
| in Cython will often be the performance-optimal approach. In
| contrast, Python optimisation generally requires a lot of
| experimentation. Is it faster to have an #[code if item in my_dict]
| check, or to use #[code .get()]? What about
| #[code try]/#[code except]? Does this numpy operation create a copy?
| There's no way to guess the answers to these questions, and you'll
| usually be dissatisfied with your results — so there's no way to
| know when to stop this process. In the worst case, you'll make a
| mess that invites the next reader to try their luck too. This is
| like one of those
| #[+a("http://www.wemjournal.org/article/S1080-6032%2809%2970088-2/abstract") volcanic gas-traps],
| where the rescuers keep passing out from low oxygen, causing
| another rescuer to follow — only to succumb themselves. In short,
| just say no to optimizing your Python. If it's not fast enough the
| first time, just switch to Cython.
+infobox("Resources")
+list.o-no-block
+item #[+a("http://docs.cython.org/en/latest/") Official Cython documentation] (cython.org)
+item #[+a("https://explosion.ai/blog/writing-c-in-cython", true) Writing C in Cython] (explosion.ai)
+item #[+a("https://explosion.ai/blog/multithreading-with-cython") Multi-threading spaCys parser and named entity recogniser] (explosion.ai)

View File

@ -7,8 +7,151 @@ include ../_includes/_mixins
+section("nn-model")
+h(2, "nn-model") Neural network model architecture
include _architecture/_nn-model
+section("cython")
+h(2, "cython") Cython conventions
include _architecture/_cython
p
| spaCy's statistical models have been custom-designed to give a
| high-performance mix of speed and accuracy. The current architecture
| hasn't been published yet, but in the meantime we prepared a video that
| explains how the models work, with particular focus on NER.
+youtube("sqDHBH9IjRU")
p
| The parsing model is a blend of recent results. The two recent
| inspirations have been the work of Eli Klipperwasser and Yoav Goldberg at
| Bar Ilan#[+fn(1)], and the SyntaxNet team from Google. The foundation of
| the parser is still based on the work of Joakim Nivre#[+fn(2)], who
| introduced the transition-based framework#[+fn(3)], the arc-eager
| transition system, and the imitation learning objective. The model is
| implemented using #[+a(gh("thinc")) Thinc], spaCy's machine learning
| library. We first predict context-sensitive vectors for each word in the
| input:
+code.
(embed_lower | embed_prefix | embed_suffix | embed_shape)
&gt;&gt; Maxout(token_width)
&gt;&gt; convolution ** 4
p
| This convolutional layer is shared between the tagger, parser and NER,
| and will also be shared by the future neural lemmatizer. Because the
| parser shares these layers with the tagger, the parser does not require
| tag features. I got this trick from David Weiss's "Stack Combination"
| paper#[+fn(4)].
p
| To boost the representation, the tagger actually predicts a "super tag"
| with POS, morphology and dependency label#[+fn(5)]. The tagger predicts
| these supertags by adding a softmax layer onto the convolutional layer
| so, we're teaching the convolutional layer to give us a representation
| that's one affine transform from this informative lexical information.
| This is obviously good for the parser (which backprops to the
| convolutions too). The parser model makes a state vector by concatenating
| the vector representations for its context tokens. The current context
| tokens:
+table
+row
+cell #[code S0], #[code S1], #[code S2]
+cell Top three words on the stack.
+row
+cell #[code B0], #[code B1]
+cell First two words of the buffer.
+row
+cell
| #[code S0L1], #[code S1L1], #[code S2L1], #[code B0L1],
| #[code B1L1]#[br]
| #[code S0L2], #[code S1L2], #[code S2L2], #[code B0L2],
| #[code B1L2]
+cell
| Leftmost and second leftmost children of #[code S0], #[code S1],
| #[code S2], #[code B0] and #[code B1].
+row
+cell
| #[code S0R1], #[code S1R1], #[code S2R1], #[code B0R1],
| #[code B1R1]#[br]
| #[code S0R2], #[code S1R2], #[code S2R2], #[code B0R2],
| #[code B1R2]
+cell
| Rightmost and second rightmost children of #[code S0], #[code S1],
| #[code S2], #[code B0] and #[code B1].
p
| This makes the state vector quite long: #[code 13*T], where #[code T] is
| the token vector width (128 is working well). Fortunately, there's a way
| to structure the computation to save some expense (and make it more
| GPU-friendly).
p
| The parser typically visits #[code 2*N] states for a sentence of length
| #[code N] (although it may visit more, if it back-tracks with a
| non-monotonic transition#[+fn(4)]). A naive implementation would require
| #[code 2*N (B, 13*T) @ (13*T, H)] matrix multiplications for a batch of
| size #[code B]. We can instead perform one #[code (B*N, T) @ (T, 13*H)]
| multiplication, to pre-compute the hidden weights for each positional
| feature with respect to the words in the batch. (Note that our token
| vectors come from the CNN — so we can't play this trick over the
| vocabulary. That's how Stanford's NN parser#[+fn(3)] works — and why its
| model is so big.)
p
| This pre-computation strategy allows a nice compromise between
| GPU-friendliness and implementation simplicity. The CNN and the wide
| lower layer are computed on the GPU, and then the precomputed hidden
| weights are moved to the CPU, before we start the transition-based
| parsing process. This makes a lot of things much easier. We don't have to
| worry about variable-length batch sizes, and we don't have to implement
| the dynamic oracle in CUDA to train.
p
| Currently the parser's loss function is multilabel log loss#[+fn(6)], as
| the dynamic oracle allows multiple states to be 0 cost. This is defined
| as follows, where #[code gZ] is the sum of the scores assigned to gold
| classes:
+code.
(exp(score) / Z) - (exp(score) / gZ)
+bibliography
+item
| #[+a("https://www.semanticscholar.org/paper/Simple-and-Accurate-Dependency-Parsing-Using-Bidir-Kiperwasser-Goldberg/3cf31ecb2724b5088783d7c96a5fc0d5604cbf41") Simple and Accurate Dependency Parsing Using Bidirectional LSTM Feature Representations]
br
| Eliyahu Kiperwasser, Yoav Goldberg. (2016)
+item
| #[+a("https://www.semanticscholar.org/paper/A-Dynamic-Oracle-for-Arc-Eager-Dependency-Parsing-Goldberg-Nivre/22697256ec19ecc3e14fcfc63624a44cf9c22df4") A Dynamic Oracle for Arc-Eager Dependency Parsing]
br
| Yoav Goldberg, Joakim Nivre (2012)
+item
| #[+a("https://explosion.ai/blog/parsing-english-in-python") Parsing English in 500 Lines of Python]
br
| Matthew Honnibal (2013)
+item
| #[+a("https://www.semanticscholar.org/paper/Stack-propagation-Improved-Representation-Learning-Zhang-Weiss/0c133f79b23e8c680891d2e49a66f0e3d37f1466") Stack-propagation: Improved Representation Learning for Syntax]
br
| Yuan Zhang, David Weiss (2016)
+item
| #[+a("https://www.semanticscholar.org/paper/Deep-multi-task-learning-with-low-level-tasks-supe-S%C3%B8gaard-Goldberg/03ad06583c9721855ccd82c3d969a01360218d86") Deep multi-task learning with low level tasks supervised at lower layers]
br
| Anders Søgaard, Yoav Goldberg (2016)
+item
| #[+a("https://www.semanticscholar.org/paper/An-Improved-Non-monotonic-Transition-System-for-De-Honnibal-Johnson/4094cee47ade13b77b5ab4d2e6cb9dd2b8a2917c") An Improved Non-monotonic Transition System for Dependency Parsing]
br
| Matthew Honnibal, Mark Johnson (2015)
+item
| #[+a("http://cs.stanford.edu/people/danqi/papers/emnlp2014.pdf") A Fast and Accurate Dependency Parser using Neural Networks]
br
| Danqi Cheng, Christopher D. Manning (2014)
+item
| #[+a("https://www.semanticscholar.org/paper/Parsing-the-Wall-Street-Journal-using-a-Lexical-Fu-Riezler-King/0ad07862a91cd59b7eb5de38267e47725a62b8b2") Parsing the Wall Street Journal using a Lexical-Functional Grammar and Discriminative Estimation Techniques]
br
| Stefan Riezler et al. (2002)

View File

@ -573,15 +573,15 @@ p The L2 norm of the token's vector representation.
+cell #[code ent_id]
+cell int
+cell
| ID of the entity the token is an instance of, if any. Usually
| assigned by patterns in the Matcher.
| ID of the entity the token is an instance of, if any. Currently
| not used, but potentially for coreference resolution.
+row
+cell #[code ent_id_]
+cell unicode
+cell
| ID of the entity the token is an instance of, if any. Usually
| assigned by patterns in the Matcher.
| ID of the entity the token is an instance of, if any. Currently
| not used, but potentially for coreference resolution.
+row
+cell #[code lemma]

View File

@ -231,3 +231,19 @@
border: none
text-align-last: center
width: 100%
//- Abbreviations
.o-abbr
+breakpoint(min, md)
cursor: help
border-bottom: 2px dotted $color-theme
padding-bottom: 3px
+breakpoint(max, sm)
&[data-tooltip]:before
content: none
&:after
content: " (" attr(aria-label) ")"
color: $color-subtle-dark

View File

@ -47,7 +47,10 @@ import initUniverse from './universe.vue.js';
*/
{
if (window.Juniper) {
new Juniper({ repo: 'ines/spacy-io-binder' });
new Juniper({
repo: 'ines/spacy-io-binder',
storageExpire: 60
});
}
}
@ -58,8 +61,13 @@ import initUniverse from './universe.vue.js';
const sectionAttr = 'data-section';
const navAttr = 'data-nav';
const activeClass = 'is-active';
const sidebarAttr = 'data-sidebar-active';
const sections = [...document.querySelectorAll(`[${navAttr}]`)];
const currentItem = document.querySelector(`[${sidebarAttr}]`);
if (window.inView) {
if (currentItem && Element.prototype.scrollIntoView && !inView.is(currentItem)) {
currentItem.scrollIntoView();
}
if (sections.length) { // highlight first item regardless
sections[0].classList.add(activeClass);
}
@ -69,6 +77,9 @@ import initUniverse from './universe.vue.js';
if (el) {
sections.forEach(el => el.classList.remove(activeClass));
el.classList.add(activeClass);
if (Element.prototype.scrollIntoView && !inView.is(el)) {
el.scrollIntoView();
}
}
});
}

File diff suppressed because one or more lines are too long

View File

@ -16,7 +16,7 @@ Prism.languages.json={property:/".*?"(?=\s*:)/gi,string:/"(?!:)(\\?[^"])*?"(?!:)
!function(a){var e=/\\([^a-z()[\]]|[a-z\*]+)/i,n={"equation-command":{pattern:e,alias:"regex"}};a.languages.latex={comment:/%.*/m,cdata:{pattern:/(\\begin\{((?:verbatim|lstlisting)\*?)\})([\w\W]*?)(?=\\end\{\2\})/,lookbehind:!0},equation:[{pattern:/\$(?:\\?[\w\W])*?\$|\\\((?:\\?[\w\W])*?\\\)|\\\[(?:\\?[\w\W])*?\\\]/,inside:n,alias:"string"},{pattern:/(\\begin\{((?:equation|math|eqnarray|align|multline|gather)\*?)\})([\w\W]*?)(?=\\end\{\2\})/,lookbehind:!0,inside:n,alias:"string"}],keyword:{pattern:/(\\(?:begin|end|ref|cite|label|usepackage|documentclass)(?:\[[^\]]+\])?\{)[^}]+(?=\})/,lookbehind:!0},url:{pattern:/(\\url\{)[^}]+(?=\})/,lookbehind:!0},headline:{pattern:/(\\(?:part|chapter|section|subsection|frametitle|subsubsection|paragraph|subparagraph|subsubparagraph|subsubsubparagraph)\*?(?:\[[^\]]+\])?\{)[^}]+(?=\}(?:\[[^\]]+\])?)/,lookbehind:!0,alias:"class-name"},"function":{pattern:e,alias:"selector"},punctuation:/[[\]{}&]/}}(Prism);
Prism.languages.makefile={comment:{pattern:/(^|[^\\])#(?:\\(?:\r\n|[\s\S])|.)*/,lookbehind:!0},string:/(["'])(?:\\(?:\r\n|[\s\S])|(?!\1)[^\\\r\n])*\1/,builtin:/\.[A-Z][^:#=\s]+(?=\s*:(?!=))/,symbol:{pattern:/^[^:=\r\n]+(?=\s*:(?!=))/m,inside:{variable:/\$+(?:[^(){}:#=\s]+|(?=[({]))/}},variable:/\$+(?:[^(){}:#=\s]+|\([@*%<^+?][DF]\)|(?=[({]))/,keyword:[/-include\b|\b(?:define|else|endef|endif|export|ifn?def|ifn?eq|include|override|private|sinclude|undefine|unexport|vpath)\b/,{pattern:/(\()(?:addsuffix|abspath|and|basename|call|dir|error|eval|file|filter(?:-out)?|findstring|firstword|flavor|foreach|guile|if|info|join|lastword|load|notdir|or|origin|patsubst|realpath|shell|sort|strip|subst|suffix|value|warning|wildcard|word(?:s|list)?)(?=[ \t])/,lookbehind:!0}],operator:/(?:::|[?:+!])?=|[|@]/,punctuation:/[:;(){}]/};
Prism.languages.markdown=Prism.languages.extend("markup",{}),Prism.languages.insertBefore("markdown","prolog",{blockquote:{pattern:/^>(?:[\t ]*>)*/m,alias:"punctuation"},code:[{pattern:/^(?: {4}|\t).+/m,alias:"keyword"},{pattern:/``.+?``|`[^`\n]+`/,alias:"keyword"}],title:[{pattern:/\w+.*(?:\r?\n|\r)(?:==+|--+)/,alias:"important",inside:{punctuation:/==+$|--+$/}},{pattern:/(^\s*)#+.+/m,lookbehind:!0,alias:"important",inside:{punctuation:/^#+|#+$/}}],hr:{pattern:/(^\s*)([*-])([\t ]*\2){2,}(?=\s*$)/m,lookbehind:!0,alias:"punctuation"},list:{pattern:/(^\s*)(?:[*+-]|\d+\.)(?=[\t ].)/m,lookbehind:!0,alias:"punctuation"},"url-reference":{pattern:/!?\[[^\]]+\]:[\t ]+(?:\S+|<(?:\\.|[^>\\])+>)(?:[\t ]+(?:"(?:\\.|[^"\\])*"|'(?:\\.|[^'\\])*'|\((?:\\.|[^)\\])*\)))?/,inside:{variable:{pattern:/^(!?\[)[^\]]+/,lookbehind:!0},string:/(?:"(?:\\.|[^"\\])*"|'(?:\\.|[^'\\])*'|\((?:\\.|[^)\\])*\))$/,punctuation:/^[\[\]!:]|[<>]/},alias:"url"},bold:{pattern:/(^|[^\\])(\*\*|__)(?:(?:\r?\n|\r)(?!\r?\n|\r)|.)+?\2/,lookbehind:!0,inside:{punctuation:/^\*\*|^__|\*\*$|__$/}},italic:{pattern:/(^|[^\\])([*_])(?:(?:\r?\n|\r)(?!\r?\n|\r)|.)+?\2/,lookbehind:!0,inside:{punctuation:/^[*_]|[*_]$/}},url:{pattern:/!?\[[^\]]+\](?:\([^\s)]+(?:[\t ]+"(?:\\.|[^"\\])*")?\)| ?\[[^\]\n]*\])/,inside:{variable:{pattern:/(!?\[)[^\]]+(?=\]$)/,lookbehind:!0},string:{pattern:/"(?:\\.|[^"\\])*"(?=\)$)/}}}}),Prism.languages.markdown.bold.inside.url=Prism.util.clone(Prism.languages.markdown.url),Prism.languages.markdown.italic.inside.url=Prism.util.clone(Prism.languages.markdown.url),Prism.languages.markdown.bold.inside.italic=Prism.util.clone(Prism.languages.markdown.italic),Prism.languages.markdown.italic.inside.bold=Prism.util.clone(Prism.languages.markdown.bold);
Prism.languages.python={"triple-quoted-string":{pattern:/"""[\s\S]+?"""|'''[\s\S]+?'''/,alias:"string"},comment:{pattern:/(^|[^\\])#.*/,lookbehind:!0},string:/("|')(?:\\?.)*?\1/,"function":{pattern:/((?:^|\s)def[ \t]+)[a-zA-Z_][a-zA-Z0-9_]*(?=\()/g,lookbehind:!0},"class-name":{pattern:/(\bclass\s+)[a-z0-9_]+/i,lookbehind:!0},keyword:/\b(?:as|assert|async|await|break|class|continue|def|del|elif|else|except|exec|finally|for|from|global|if|import|in|is|lambda|pass|print|raise|return|try|while|with|yield)\b/,"boolean":/\b(?:True|False|None)\b/,number:/\b-?(?:0[bo])?(?:(?:\d|0x[\da-f])[\da-f]*\.?\d*|\.\d+)(?:e[+-]?\d+)?j?\b/i,operator:/[-+%=]=?|!=|\*\*?=?|\/\/?=?|<[<=>]?|>[=>]?|[&|^~]|\b(?:or|and|not)\b/,punctuation:/[{}[\];(),.:]/,"constant":/\b[A-Z_]{2,}\b/};
Prism.languages.python={"triple-quoted-string":{pattern:/"""[\s\S]+?"""|'''[\s\S]+?'''/,alias:"string"},comment:{pattern:/(^|[^\\])#.*/,lookbehind:!0},string:/("|')(?:\\?.)*?\1/,"function":{pattern:/((?:^|\s)def[ \t]+)[a-zA-Z_][a-zA-Z0-9_]*(?=\()/g,lookbehind:!0},"class-name":{pattern:/(\bclass\s+)[a-z0-9_]+/i,lookbehind:!0},keyword:/\b(?:as|assert|async|await|break|class|continue|def|del|elif|else|except|exec|finally|for|from|global|if|import|in|is|lambda|pass|print|raise|return|try|while|with|yield|cimport)\b/,"boolean":/\b(?:True|False|None)\b/,number:/\b-?(?:0[bo])?(?:(?:\d|0x[\da-f])[\da-f]*\.?\d*|\.\d+)(?:e[+-]?\d+)?j?\b/i,operator:/[-+%=]=?|!=|\*\*?=?|\/\/?=?|<[<=>]?|>[=>]?|[&|^~]|\b(?:or|and|not)\b/,punctuation:/[{}[\];(),.:]/,"constant":/\b[A-Z_]{2,}\b/};
Prism.languages.rest={table:[{pattern:/(\s*)(?:\+[=-]+)+\+(?:\r?\n|\r)(?:\1(?:[+|].+)+[+|](?:\r?\n|\r))+\1(?:\+[=-]+)+\+/,lookbehind:!0,inside:{punctuation:/\||(?:\+[=-]+)+\+/}},{pattern:/(\s*)(?:=+ +)+=+((?:\r?\n|\r)\1.+)+(?:\r?\n|\r)\1(?:=+ +)+=+(?=(?:\r?\n|\r){2}|\s*$)/,lookbehind:!0,inside:{punctuation:/[=-]+/}}],"substitution-def":{pattern:/(^\s*\.\. )\|(?:[^|\s](?:[^|]*[^|\s])?)\| [^:]+::/m,lookbehind:!0,inside:{substitution:{pattern:/^\|(?:[^|\s]|[^|\s][^|]*[^|\s])\|/,alias:"attr-value",inside:{punctuation:/^\||\|$/}},directive:{pattern:/( +)[^:]+::/,lookbehind:!0,alias:"function",inside:{punctuation:/::$/}}}},"link-target":[{pattern:/(^\s*\.\. )\[[^\]]+\]/m,lookbehind:!0,alias:"string",inside:{punctuation:/^\[|\]$/}},{pattern:/(^\s*\.\. )_(?:`[^`]+`|(?:[^:\\]|\\.)+):/m,lookbehind:!0,alias:"string",inside:{punctuation:/^_|:$/}}],directive:{pattern:/(^\s*\.\. )[^:]+::/m,lookbehind:!0,alias:"function",inside:{punctuation:/::$/}},comment:{pattern:/(^\s*\.\.)(?:(?: .+)?(?:(?:\r?\n|\r).+)+| .+)(?=(?:\r?\n|\r){2}|$)/m,lookbehind:!0},title:[{pattern:/^(([!"#$%&'()*+,\-.\/:;<=>?@\[\\\]^_`{|}~])\2+)(?:\r?\n|\r).+(?:\r?\n|\r)\1$/m,inside:{punctuation:/^[!"#$%&'()*+,\-.\/:;<=>?@\[\\\]^_`{|}~]+|[!"#$%&'()*+,\-.\/:;<=>?@\[\\\]^_`{|}~]+$/,important:/.+/}},{pattern:/(^|(?:\r?\n|\r){2}).+(?:\r?\n|\r)([!"#$%&'()*+,\-.\/:;<=>?@\[\\\]^_`{|}~])\2+(?=\r?\n|\r|$)/,lookbehind:!0,inside:{punctuation:/[!"#$%&'()*+,\-.\/:;<=>?@\[\\\]^_`{|}~]+$/,important:/.+/}}],hr:{pattern:/((?:\r?\n|\r){2})([!"#$%&'()*+,\-.\/:;<=>?@\[\\\]^_`{|}~])\2{3,}(?=(?:\r?\n|\r){2})/,lookbehind:!0,alias:"punctuation"},field:{pattern:/(^\s*):[^:\r\n]+:(?= )/m,lookbehind:!0,alias:"attr-name"},"command-line-option":{pattern:/(^\s*)(?:[+-][a-z\d]|(?:\-\-|\/)[a-z\d-]+)(?:[ =](?:[a-z][a-z\d_-]*|<[^<>]+>))?(?:, (?:[+-][a-z\d]|(?:\-\-|\/)[a-z\d-]+)(?:[ =](?:[a-z][a-z\d_-]*|<[^<>]+>))?)*(?=(?:\r?\n|\r)? {2,}\S)/im,lookbehind:!0,alias:"symbol"},"literal-block":{pattern:/::(?:\r?\n|\r){2}([ \t]+).+(?:(?:\r?\n|\r)\1.+)*/,inside:{"literal-block-punctuation":{pattern:/^::/,alias:"punctuation"}}},"quoted-literal-block":{pattern:/::(?:\r?\n|\r){2}([!"#$%&'()*+,\-.\/:;<=>?@\[\\\]^_`{|}~]).*(?:(?:\r?\n|\r)\1.*)*/,inside:{"literal-block-punctuation":{pattern:/^(?:::|([!"#$%&'()*+,\-.\/:;<=>?@\[\\\]^_`{|}~])\1*)/m,alias:"punctuation"}}},"list-bullet":{pattern:/(^\s*)(?:[*+\-•‣⁃]|\(?(?:\d+|[a-z]|[ivxdclm]+)\)|(?:\d+|[a-z]|[ivxdclm]+)\.)(?= )/im,lookbehind:!0,alias:"punctuation"},"doctest-block":{pattern:/(^\s*)>>> .+(?:(?:\r?\n|\r).+)*/m,lookbehind:!0,inside:{punctuation:/^>>>/}},inline:[{pattern:/(^|[\s\-:\/'"<(\[{])(?::[^:]+:`.*?`|`.*?`:[^:]+:|(\*\*?|``?|\|)(?!\s).*?[^\s]\2(?=[\s\-.,:;!?\\\/'")\]}]|$))/m,lookbehind:!0,inside:{bold:{pattern:/(^\*\*).+(?=\*\*$)/,lookbehind:!0},italic:{pattern:/(^\*).+(?=\*$)/,lookbehind:!0},"inline-literal":{pattern:/(^``).+(?=``$)/,lookbehind:!0,alias:"symbol"},role:{pattern:/^:[^:]+:|:[^:]+:$/,alias:"function",inside:{punctuation:/^:|:$/}},"interpreted-text":{pattern:/(^`).+(?=`$)/,lookbehind:!0,alias:"attr-value"},substitution:{pattern:/(^\|).+(?=\|$)/,lookbehind:!0,alias:"attr-value"},punctuation:/\*\*?|``?|\|/}}],link:[{pattern:/\[[^\]]+\]_(?=[\s\-.,:;!?\\\/'")\]}]|$)/,alias:"string",inside:{punctuation:/^\[|\]_$/}},{pattern:/(?:\b[a-z\d](?:[_.:+]?[a-z\d]+)*_?_|`[^`]+`_?_|_`[^`]+`)(?=[\s\-.,:;!?\\\/'")\]}]|$)/i,alias:"string",inside:{punctuation:/^_?`|`$|`?_?_$/}}],punctuation:{pattern:/(^\s*)(?:\|(?= |$)|(?:---?|—|\.\.|__)(?= )|\.\.$)/m,lookbehind:!0}};
!function(e){e.languages.sass=e.languages.extend("css",{comment:{pattern:/^([ \t]*)\/[\/*].*(?:(?:\r?\n|\r)\1[ \t]+.+)*/m,lookbehind:!0}}),e.languages.insertBefore("sass","atrule",{"atrule-line":{pattern:/^(?:[ \t]*)[@+=].+/m,inside:{atrule:/(?:@[\w-]+|[+=])/m}}}),delete e.languages.sass.atrule;var a=/((\$[-_\w]+)|(#\{\$[-_\w]+\}))/i,t=[/[+*\/%]|[=!]=|<=?|>=?|\b(?:and|or|not)\b/,{pattern:/(\s+)-(?=\s)/,lookbehind:!0}];e.languages.insertBefore("sass","property",{"variable-line":{pattern:/^[ \t]*\$.+/m,inside:{punctuation:/:/,variable:a,operator:t}},"property-line":{pattern:/^[ \t]*(?:[^:\s]+ *:.*|:[^:\s]+.*)/m,inside:{property:[/[^:\s]+(?=\s*:)/,{pattern:/(:)[^:\s]+/,lookbehind:!0}],punctuation:/:/,variable:a,operator:t,important:e.languages.sass.important}}}),delete e.languages.sass.property,delete e.languages.sass.important,delete e.languages.sass.selector,e.languages.insertBefore("sass","punctuation",{selector:{pattern:/([ \t]*)\S(?:,?[^,\r\n]+)*(?:,(?:\r?\n|\r)\1[ \t]+\S(?:,?[^,\r\n]+)*)*/,lookbehind:!0}})}(Prism);
Prism.languages.scss=Prism.languages.extend("css",{comment:{pattern:/(^|[^\\])(?:\/\*[\w\W]*?\*\/|\/\/.*)/,lookbehind:!0},atrule:{pattern:/@[\w-]+(?:\([^()]+\)|[^(])*?(?=\s+[{;])/,inside:{rule:/@[\w-]+/}},url:/(?:[-a-z]+-)*url(?=\()/i,selector:{pattern:/(?=\S)[^@;\{\}\(\)]?([^@;\{\}\(\)]|&|#\{\$[-_\w]+\})+(?=\s*\{(\}|\s|[^\}]+(:|\{)[^\}]+))/m,inside:{placeholder:/%[-_\w]+/}}}),Prism.languages.insertBefore("scss","atrule",{keyword:[/@(?:if|else(?: if)?|for|each|while|import|extend|debug|warn|mixin|include|function|return|content)/i,{pattern:/( +)(?:from|through)(?= )/,lookbehind:!0}]}),Prism.languages.insertBefore("scss","property",{variable:/\$[-_\w]+|#\{\$[-_\w]+\}/}),Prism.languages.insertBefore("scss","function",{placeholder:{pattern:/%[-_\w]+/,alias:"selector"},statement:/\B!(?:default|optional)\b/i,"boolean":/\b(?:true|false)\b/,"null":/\bnull\b/,operator:{pattern:/(\s)(?:[-+*\/%]|[=!]=|<=?|>=?|and|or|not)(?=\s)/,lookbehind:!0}}),Prism.languages.scss.atrule.inside.rest=Prism.util.clone(Prism.languages.scss);

View File

@ -76,6 +76,7 @@
},
"MODEL_LICENSES": {
"MIT": "https://opensource.org/licenses/MIT",
"CC BY 4.0": "https://creativecommons.org/licenses/by/4.0/",
"CC BY-SA": "https://creativecommons.org/licenses/by-sa/3.0/",
"CC BY-SA 3.0": "https://creativecommons.org/licenses/by-sa/3.0/",
@ -118,6 +119,8 @@
"he": "Hebrew",
"ar": "Arabic",
"fa": "Persian",
"ur": "Urdu",
"tt": "Tatar",
"ga": "Irish",
"bn": "Bengali",
"hi": "Hindi",

View File

@ -157,7 +157,13 @@ p
+infobox("Important note", "⚠️")
| This evaluation was conducted in 2015. We're working on benchmarks on
| current CPU and GPU hardware.
| current CPU and GPU hardware. In the meantime, we're grateful to the
| Stanford folks for drawing our attention to what seems
| to be #[+a("https://nlp.stanford.edu/software/tokenizer.html#Speed") a long-standing error]
| in our CoreNLP benchmarks, especially for their
| tokenizer. Until we run corrected experiments, we have updated the table
| using their figures.
+aside("Methodology")
| #[strong Set up:] 100,000 plain-text documents were streamed from an
@ -183,14 +189,14 @@ p
+row
+cell #[strong spaCy]
each data in [ "0.2ms", "1ms", "19ms"]
+cell("num") #[strong=data]
+cell("num")=data
each data in ["1x", "1x", "1x"]
+cell("num")=data
+row
+cell CoreNLP
each data in ["2ms", "10ms", "49ms", "10x", "10x", "2.6x"]
each data in ["0.18ms", "10ms", "49ms", "0.9x", "10x", "2.6x"]
+cell("num")=data
+row
+cell ZPar

View File

@ -354,7 +354,7 @@ p
string = ''.join(output)
string = string.replace('\n', '')
string = string.replace('\t', ' ')
return '&lt;pre&gt;{}&lt;/pre&gt;.format(string)
return '&lt;pre&gt;{}&lt;/pre&gt;'.format(string)
nlp = spacy.load('en_core_web_sm')
doc = nlp(u"This is a test.\n\nHello world.")