mirror of
https://github.com/explosion/spaCy.git
synced 2025-01-05 23:06:28 +03:00
Merge branch 'master' into spacy.io
This commit is contained in:
commit
4b61750985
106
.github/contributors/mmaybeno.md
vendored
Normal file
106
.github/contributors/mmaybeno.md
vendored
Normal file
|
@ -0,0 +1,106 @@
|
||||||
|
# spaCy contributor agreement
|
||||||
|
|
||||||
|
This spaCy Contributor Agreement (**"SCA"**) is based on the
|
||||||
|
[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
|
||||||
|
The SCA applies to any contribution that you make to any product or project
|
||||||
|
managed by us (the **"project"**), and sets out the intellectual property rights
|
||||||
|
you grant to us in the contributed materials. The term **"us"** shall mean
|
||||||
|
[ExplosionAI GmbH](https://explosion.ai/legal). The term
|
||||||
|
**"you"** shall mean the person or entity identified below.
|
||||||
|
|
||||||
|
If you agree to be bound by these terms, fill in the information requested
|
||||||
|
below and include the filled-in version with your first pull request, under the
|
||||||
|
folder [`.github/contributors/`](/.github/contributors/). The name of the file
|
||||||
|
should be your GitHub username, with the extension `.md`. For example, the user
|
||||||
|
example_user would create the file `.github/contributors/example_user.md`.
|
||||||
|
|
||||||
|
Read this agreement carefully before signing. These terms and conditions
|
||||||
|
constitute a binding legal agreement.
|
||||||
|
|
||||||
|
## Contributor Agreement
|
||||||
|
|
||||||
|
1. The term "contribution" or "contributed materials" means any source code,
|
||||||
|
object code, patch, tool, sample, graphic, specification, manual,
|
||||||
|
documentation, or any other material posted or submitted by you to the project.
|
||||||
|
|
||||||
|
2. With respect to any worldwide copyrights, or copyright applications and
|
||||||
|
registrations, in your contribution:
|
||||||
|
|
||||||
|
* you hereby assign to us joint ownership, and to the extent that such
|
||||||
|
assignment is or becomes invalid, ineffective or unenforceable, you hereby
|
||||||
|
grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
|
||||||
|
royalty-free, unrestricted license to exercise all rights under those
|
||||||
|
copyrights. This includes, at our option, the right to sublicense these same
|
||||||
|
rights to third parties through multiple levels of sublicensees or other
|
||||||
|
licensing arrangements;
|
||||||
|
|
||||||
|
* you agree that each of us can do all things in relation to your
|
||||||
|
contribution as if each of us were the sole owners, and if one of us makes
|
||||||
|
a derivative work of your contribution, the one who makes the derivative
|
||||||
|
work (or has it made will be the sole owner of that derivative work;
|
||||||
|
|
||||||
|
* you agree that you will not assert any moral rights in your contribution
|
||||||
|
against us, our licensees or transferees;
|
||||||
|
|
||||||
|
* you agree that we may register a copyright in your contribution and
|
||||||
|
exercise all ownership rights associated with it; and
|
||||||
|
|
||||||
|
* you agree that neither of us has any duty to consult with, obtain the
|
||||||
|
consent of, pay or render an accounting to the other for any use or
|
||||||
|
distribution of your contribution.
|
||||||
|
|
||||||
|
3. With respect to any patents you own, or that you can license without payment
|
||||||
|
to any third party, you hereby grant to us a perpetual, irrevocable,
|
||||||
|
non-exclusive, worldwide, no-charge, royalty-free license to:
|
||||||
|
|
||||||
|
* make, have made, use, sell, offer to sell, import, and otherwise transfer
|
||||||
|
your contribution in whole or in part, alone or in combination with or
|
||||||
|
included in any product, work or materials arising out of the project to
|
||||||
|
which your contribution was submitted, and
|
||||||
|
|
||||||
|
* at our option, to sublicense these same rights to third parties through
|
||||||
|
multiple levels of sublicensees or other licensing arrangements.
|
||||||
|
|
||||||
|
4. Except as set out above, you keep all right, title, and interest in your
|
||||||
|
contribution. The rights that you grant to us under these terms are effective
|
||||||
|
on the date you first submitted a contribution to us, even if your submission
|
||||||
|
took place before the date you sign these terms.
|
||||||
|
|
||||||
|
5. You covenant, represent, warrant and agree that:
|
||||||
|
|
||||||
|
* Each contribution that you submit is and shall be an original work of
|
||||||
|
authorship and you can legally grant the rights set out in this SCA;
|
||||||
|
|
||||||
|
* to the best of your knowledge, each contribution will not violate any
|
||||||
|
third party's copyrights, trademarks, patents, or other intellectual
|
||||||
|
property rights; and
|
||||||
|
|
||||||
|
* each contribution shall be in compliance with U.S. export control laws and
|
||||||
|
other applicable export and import laws. You agree to notify us if you
|
||||||
|
become aware of any circumstance which would make any of the foregoing
|
||||||
|
representations inaccurate in any respect. We may publicly disclose your
|
||||||
|
participation in the project, including the fact that you have signed the SCA.
|
||||||
|
|
||||||
|
6. This SCA is governed by the laws of the State of California and applicable
|
||||||
|
U.S. Federal law. Any choice of law rules will not apply.
|
||||||
|
|
||||||
|
7. Please place an “x” on one of the applicable statement below. Please do NOT
|
||||||
|
mark both statements:
|
||||||
|
|
||||||
|
* [x] I am signing on behalf of myself as an individual and no other person
|
||||||
|
or entity, including my employer, has or will have rights with respect to my
|
||||||
|
contributions.
|
||||||
|
|
||||||
|
* [ ] I am signing on behalf of my employer or a legal entity and I have the
|
||||||
|
actual authority to contractually bind that entity.
|
||||||
|
|
||||||
|
## Contributor Details
|
||||||
|
|
||||||
|
| Field | Entry |
|
||||||
|
|------------------------------- | -------------------- |
|
||||||
|
| Name | Matt Maybeno |
|
||||||
|
| Company name (if applicable) | |
|
||||||
|
| Title or role (if applicable) | |
|
||||||
|
| Date | 2019-11-19 |
|
||||||
|
| GitHub username | mmaybeno |
|
||||||
|
| Website (optional) | |
|
|
@ -73,7 +73,7 @@ cuda100 =
|
||||||
cupy-cuda100>=5.0.0b4
|
cupy-cuda100>=5.0.0b4
|
||||||
# Language tokenizers with external dependencies
|
# Language tokenizers with external dependencies
|
||||||
ja =
|
ja =
|
||||||
mecab-python3==0.7
|
fugashi>=0.1.3
|
||||||
ko =
|
ko =
|
||||||
natto-py==0.9.0
|
natto-py==0.9.0
|
||||||
th =
|
th =
|
||||||
|
|
|
@ -2,7 +2,7 @@
|
||||||
from __future__ import unicode_literals
|
from __future__ import unicode_literals
|
||||||
|
|
||||||
from ...symbols import POS, PUNCT, SYM, ADJ, CCONJ, SCONJ, NUM, DET, ADV, ADP, X, VERB
|
from ...symbols import POS, PUNCT, SYM, ADJ, CCONJ, SCONJ, NUM, DET, ADV, ADP, X, VERB
|
||||||
from ...symbols import NOUN, PROPN, PART, INTJ, PRON
|
from ...symbols import NOUN, PROPN, PART, INTJ, PRON, AUX
|
||||||
|
|
||||||
|
|
||||||
TAG_MAP = {
|
TAG_MAP = {
|
||||||
|
@ -4249,4 +4249,20 @@ TAG_MAP = {
|
||||||
"Voice": "Act",
|
"Voice": "Act",
|
||||||
"Case": "Nom|Gen|Dat|Acc|Voc",
|
"Case": "Nom|Gen|Dat|Acc|Voc",
|
||||||
},
|
},
|
||||||
|
'ADJ': {POS: ADJ},
|
||||||
|
'ADP': {POS: ADP},
|
||||||
|
'ADV': {POS: ADV},
|
||||||
|
'AtDf': {POS: DET},
|
||||||
|
'AUX': {POS: AUX},
|
||||||
|
'CCONJ': {POS: CCONJ},
|
||||||
|
'DET': {POS: DET},
|
||||||
|
'NOUN': {POS: NOUN},
|
||||||
|
'NUM': {POS: NUM},
|
||||||
|
'PART': {POS: PART},
|
||||||
|
'PRON': {POS: PRON},
|
||||||
|
'PROPN': {POS: PROPN},
|
||||||
|
'SCONJ': {POS: SCONJ},
|
||||||
|
'SYM': {POS: SYM},
|
||||||
|
'VERB': {POS: VERB},
|
||||||
|
'X': {POS: X},
|
||||||
}
|
}
|
||||||
|
|
|
@ -305,6 +305,9 @@ TAG_MAP = {
|
||||||
"VERB__VerbForm=Ger": {"morph": "VerbForm=Ger", POS: VERB},
|
"VERB__VerbForm=Ger": {"morph": "VerbForm=Ger", POS: VERB},
|
||||||
"VERB__VerbForm=Inf": {"morph": "VerbForm=Inf", POS: VERB},
|
"VERB__VerbForm=Inf": {"morph": "VerbForm=Inf", POS: VERB},
|
||||||
"X___": {"morph": "_", POS: X},
|
"X___": {"morph": "_", POS: X},
|
||||||
|
"___PunctType=Quot": {POS: PUNCT},
|
||||||
|
"___VerbForm=Inf": {POS: VERB},
|
||||||
|
"___Number=Sing|Person=2|PronType=Prs": {POS: PRON},
|
||||||
"_SP": {"morph": "_", POS: SPACE},
|
"_SP": {"morph": "_", POS: SPACE},
|
||||||
}
|
}
|
||||||
# fmt: on
|
# fmt: on
|
||||||
|
|
|
@ -12,21 +12,23 @@ from ...tokens import Doc
|
||||||
from ...compat import copy_reg
|
from ...compat import copy_reg
|
||||||
from ...util import DummyTokenizer
|
from ...util import DummyTokenizer
|
||||||
|
|
||||||
|
# Handling for multiple spaces in a row is somewhat awkward, this simplifies
|
||||||
|
# the flow by creating a dummy with the same interface.
|
||||||
|
DummyNode = namedtuple("DummyNode", ["surface", "pos", "feature"])
|
||||||
|
DummyNodeFeatures = namedtuple("DummyNodeFeatures", ["lemma"])
|
||||||
|
DummySpace = DummyNode(' ', ' ', DummyNodeFeatures(' '))
|
||||||
|
|
||||||
ShortUnitWord = namedtuple("ShortUnitWord", ["surface", "lemma", "pos"])
|
def try_fugashi_import():
|
||||||
|
"""Fugashi is required for Japanese support, so check for it.
|
||||||
|
|
||||||
def try_mecab_import():
|
|
||||||
"""Mecab is required for Japanese support, so check for it.
|
|
||||||
It it's not available blow up and explain how to fix it."""
|
It it's not available blow up and explain how to fix it."""
|
||||||
try:
|
try:
|
||||||
import MeCab
|
import fugashi
|
||||||
|
|
||||||
return MeCab
|
return fugashi
|
||||||
except ImportError:
|
except ImportError:
|
||||||
raise ImportError(
|
raise ImportError(
|
||||||
"Japanese support requires MeCab: "
|
"Japanese support requires Fugashi: "
|
||||||
"https://github.com/SamuraiT/mecab-python3"
|
"https://github.com/polm/fugashi"
|
||||||
)
|
)
|
||||||
|
|
||||||
|
|
||||||
|
@ -39,7 +41,7 @@ def resolve_pos(token):
|
||||||
"""
|
"""
|
||||||
|
|
||||||
# this is only used for consecutive ascii spaces
|
# this is only used for consecutive ascii spaces
|
||||||
if token.pos == "空白":
|
if token.surface == " ":
|
||||||
return "空白"
|
return "空白"
|
||||||
|
|
||||||
# TODO: This is a first take. The rules here are crude approximations.
|
# TODO: This is a first take. The rules here are crude approximations.
|
||||||
|
@ -53,55 +55,45 @@ def resolve_pos(token):
|
||||||
return token.pos + ",ADJ"
|
return token.pos + ",ADJ"
|
||||||
return token.pos
|
return token.pos
|
||||||
|
|
||||||
|
def get_words_and_spaces(tokenizer, text):
|
||||||
|
"""Get the individual tokens that make up the sentence and handle white space.
|
||||||
|
|
||||||
|
Japanese doesn't usually use white space, and MeCab's handling of it for
|
||||||
|
multiple spaces in a row is somewhat awkward.
|
||||||
|
"""
|
||||||
|
|
||||||
|
tokens = tokenizer.parseToNodeList(text)
|
||||||
|
|
||||||
def detailed_tokens(tokenizer, text):
|
|
||||||
"""Format Mecab output into a nice data structure, based on Janome."""
|
|
||||||
node = tokenizer.parseToNode(text)
|
|
||||||
node = node.next # first node is beginning of sentence and empty, skip it
|
|
||||||
words = []
|
words = []
|
||||||
spaces = []
|
spaces = []
|
||||||
while node.posid != 0:
|
for token in tokens:
|
||||||
surface = node.surface
|
# If there's more than one space, spaces after the first become tokens
|
||||||
base = surface # a default value. Updated if available later.
|
for ii in range(len(token.white_space) - 1):
|
||||||
parts = node.feature.split(",")
|
words.append(DummySpace)
|
||||||
pos = ",".join(parts[0:4])
|
|
||||||
if len(parts) > 7:
|
|
||||||
# this information is only available for words in the tokenizer
|
|
||||||
# dictionary
|
|
||||||
base = parts[7]
|
|
||||||
words.append(ShortUnitWord(surface, base, pos))
|
|
||||||
|
|
||||||
# The way MeCab stores spaces is that the rlength of the next token is
|
|
||||||
# the length of that token plus any preceding whitespace, **in bytes**.
|
|
||||||
# also note that this is only for half-width / ascii spaces. Full width
|
|
||||||
# spaces just become tokens.
|
|
||||||
scount = node.next.rlength - node.next.length
|
|
||||||
spaces.append(bool(scount))
|
|
||||||
while scount > 1:
|
|
||||||
words.append(ShortUnitWord(" ", " ", "空白"))
|
|
||||||
spaces.append(False)
|
spaces.append(False)
|
||||||
scount -= 1
|
|
||||||
|
|
||||||
node = node.next
|
words.append(token)
|
||||||
|
spaces.append(bool(token.white_space))
|
||||||
return words, spaces
|
return words, spaces
|
||||||
|
|
||||||
|
|
||||||
class JapaneseTokenizer(DummyTokenizer):
|
class JapaneseTokenizer(DummyTokenizer):
|
||||||
def __init__(self, cls, nlp=None):
|
def __init__(self, cls, nlp=None):
|
||||||
self.vocab = nlp.vocab if nlp is not None else cls.create_vocab(nlp)
|
self.vocab = nlp.vocab if nlp is not None else cls.create_vocab(nlp)
|
||||||
self.tokenizer = try_mecab_import().Tagger()
|
self.tokenizer = try_fugashi_import().Tagger()
|
||||||
self.tokenizer.parseToNode("") # see #2901
|
self.tokenizer.parseToNodeList("") # see #2901
|
||||||
|
|
||||||
def __call__(self, text):
|
def __call__(self, text):
|
||||||
dtokens, spaces = detailed_tokens(self.tokenizer, text)
|
dtokens, spaces = get_words_and_spaces(self.tokenizer, text)
|
||||||
words = [x.surface for x in dtokens]
|
words = [x.surface for x in dtokens]
|
||||||
doc = Doc(self.vocab, words=words, spaces=spaces)
|
doc = Doc(self.vocab, words=words, spaces=spaces)
|
||||||
mecab_tags = []
|
unidic_tags = []
|
||||||
for token, dtoken in zip(doc, dtokens):
|
for token, dtoken in zip(doc, dtokens):
|
||||||
mecab_tags.append(dtoken.pos)
|
unidic_tags.append(dtoken.pos)
|
||||||
token.tag_ = resolve_pos(dtoken)
|
token.tag_ = resolve_pos(dtoken)
|
||||||
token.lemma_ = dtoken.lemma
|
|
||||||
doc.user_data["mecab_tags"] = mecab_tags
|
# if there's no lemma info (it's an unk) just use the surface
|
||||||
|
token.lemma_ = dtoken.feature.lemma or dtoken.surface
|
||||||
|
doc.user_data["unidic_tags"] = unidic_tags
|
||||||
return doc
|
return doc
|
||||||
|
|
||||||
|
|
||||||
|
@ -131,5 +123,4 @@ def pickle_japanese(instance):
|
||||||
|
|
||||||
copy_reg.pickle(Japanese, pickle_japanese)
|
copy_reg.pickle(Japanese, pickle_japanese)
|
||||||
|
|
||||||
|
|
||||||
__all__ = ["Japanese"]
|
__all__ = ["Japanese"]
|
||||||
|
|
|
@ -5039,5 +5039,19 @@ TAG_MAP = {
|
||||||
"punc": {POS: PUNCT},
|
"punc": {POS: PUNCT},
|
||||||
"v-pcp|M|P": {POS: VERB},
|
"v-pcp|M|P": {POS: VERB},
|
||||||
"v-pcp|M|S": {POS: VERB},
|
"v-pcp|M|S": {POS: VERB},
|
||||||
|
"ADJ": {POS: ADJ},
|
||||||
|
"AUX": {POS: AUX},
|
||||||
|
"CCONJ": {POS: CCONJ},
|
||||||
|
"DET": {POS: DET},
|
||||||
|
"INTJ": {POS: INTJ},
|
||||||
|
"NUM": {POS: NUM},
|
||||||
|
"PART": {POS: PART},
|
||||||
|
"PRON": {POS: PRON},
|
||||||
|
"PUNCT": {POS: PUNCT},
|
||||||
|
"SCONJ": {POS: SCONJ},
|
||||||
|
"SYM": {POS: SYM},
|
||||||
|
"VERB": {POS: VERB},
|
||||||
|
"X": {POS: X},
|
||||||
|
"adv": {POS: ADV},
|
||||||
"_SP": {POS: SPACE},
|
"_SP": {POS: SPACE},
|
||||||
}
|
}
|
||||||
|
|
|
@ -125,7 +125,7 @@ def it_tokenizer():
|
||||||
|
|
||||||
@pytest.fixture(scope="session")
|
@pytest.fixture(scope="session")
|
||||||
def ja_tokenizer():
|
def ja_tokenizer():
|
||||||
pytest.importorskip("MeCab")
|
pytest.importorskip("fugashi")
|
||||||
return get_lang_class("ja").Defaults.create_tokenizer()
|
return get_lang_class("ja").Defaults.create_tokenizer()
|
||||||
|
|
||||||
|
|
||||||
|
|
|
@ -3,7 +3,6 @@
|
||||||
from __future__ import unicode_literals
|
from __future__ import unicode_literals
|
||||||
from libc.string cimport memcpy
|
from libc.string cimport memcpy
|
||||||
|
|
||||||
import numpy
|
|
||||||
import srsly
|
import srsly
|
||||||
from collections import OrderedDict
|
from collections import OrderedDict
|
||||||
from thinc.neural.util import get_array_module
|
from thinc.neural.util import get_array_module
|
||||||
|
@ -361,7 +360,8 @@ cdef class Vocab:
|
||||||
minn = len(word)
|
minn = len(word)
|
||||||
if maxn is None:
|
if maxn is None:
|
||||||
maxn = len(word)
|
maxn = len(word)
|
||||||
vectors = numpy.zeros((self.vectors_length,), dtype="f")
|
xp = get_array_module(self.vectors.data)
|
||||||
|
vectors = xp.zeros((self.vectors_length,), dtype="f")
|
||||||
# Fasttext's ngram computation taken from
|
# Fasttext's ngram computation taken from
|
||||||
# https://github.com/facebookresearch/fastText
|
# https://github.com/facebookresearch/fastText
|
||||||
ngrams_size = 0;
|
ngrams_size = 0;
|
||||||
|
@ -381,7 +381,7 @@ cdef class Vocab:
|
||||||
j = j + 1
|
j = j + 1
|
||||||
if (n >= minn and not (n == 1 and (i == 0 or j == len(word)))):
|
if (n >= minn and not (n == 1 and (i == 0 or j == len(word)))):
|
||||||
if self.strings[ngram] in self.vectors.key2row:
|
if self.strings[ngram] in self.vectors.key2row:
|
||||||
vectors = numpy.add(self.vectors[self.strings[ngram]],vectors)
|
vectors = xp.add(self.vectors[self.strings[ngram]], vectors)
|
||||||
ngrams_size += 1
|
ngrams_size += 1
|
||||||
n = n + 1
|
n = n + 1
|
||||||
if ngrams_size > 0:
|
if ngrams_size > 0:
|
||||||
|
|
|
@ -123,7 +123,7 @@ The L2 norm of the lexeme's vector representation.
|
||||||
## Attributes {#attributes}
|
## Attributes {#attributes}
|
||||||
|
|
||||||
| Name | Type | Description |
|
| Name | Type | Description |
|
||||||
| -------------------------------------------- | ------- | ------------------------------------------------------------------------------------------------------------ |
|
| -------------------------------------------- | ------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ |
|
||||||
| `vocab` | `Vocab` | The lexeme's vocabulary. |
|
| `vocab` | `Vocab` | The lexeme's vocabulary. |
|
||||||
| `text` | unicode | Verbatim text content. |
|
| `text` | unicode | Verbatim text content. |
|
||||||
| `orth` | int | ID of the verbatim text content. |
|
| `orth` | int | ID of the verbatim text content. |
|
||||||
|
@ -134,8 +134,8 @@ The L2 norm of the lexeme's vector representation.
|
||||||
| `norm_` | unicode | The lexemes's norm, i.e. a normalized form of the lexeme text. |
|
| `norm_` | unicode | The lexemes's norm, i.e. a normalized form of the lexeme text. |
|
||||||
| `lower` | int | Lowercase form of the word. |
|
| `lower` | int | Lowercase form of the word. |
|
||||||
| `lower_` | unicode | Lowercase form of the word. |
|
| `lower_` | unicode | Lowercase form of the word. |
|
||||||
| `shape` | int | Transform of the word's string, to show orthographic features. |
|
| `shape` | int | Transform of the words's string, to show orthographic features. Alphabetic characters are replaced by `x` or `X`, and numeric characters are replaced by d`, and sequences of the same character are truncated after length 4. For example,`"Xxxx"`or`"dd"`. |
|
||||||
| `shape_` | unicode | Transform of the word's string, to show orthographic features. |
|
| `shape_` | unicode | Transform of the word's string, to show orthographic features. Alphabetic characters are replaced by `x` or `X`, and numeric characters are replaced by d`, and sequences of the same character are truncated after length 4. For example,`"Xxxx"`or`"dd"`. |
|
||||||
| `prefix` | int | Length-N substring from the start of the word. Defaults to `N=1`. |
|
| `prefix` | int | Length-N substring from the start of the word. Defaults to `N=1`. |
|
||||||
| `prefix_` | unicode | Length-N substring from the start of the word. Defaults to `N=1`. |
|
| `prefix_` | unicode | Length-N substring from the start of the word. Defaults to `N=1`. |
|
||||||
| `suffix` | int | Length-N substring from the end of the word. Defaults to `N=3`. |
|
| `suffix` | int | Length-N substring from the end of the word. Defaults to `N=3`. |
|
||||||
|
|
|
@ -409,7 +409,7 @@ The L2 norm of the token's vector representation.
|
||||||
## Attributes {#attributes}
|
## Attributes {#attributes}
|
||||||
|
|
||||||
| Name | Type | Description |
|
| Name | Type | Description |
|
||||||
| -------------------------------------------- | ------------ | ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
|
| -------------------------------------------- | ------------ | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
|
||||||
| `doc` | `Doc` | The parent document. |
|
| `doc` | `Doc` | The parent document. |
|
||||||
| `sent` <Tag variant="new">2.0.12</Tag> | `Span` | The sentence span that this token is a part of. |
|
| `sent` <Tag variant="new">2.0.12</Tag> | `Span` | The sentence span that this token is a part of. |
|
||||||
| `text` | unicode | Verbatim text content. |
|
| `text` | unicode | Verbatim text content. |
|
||||||
|
@ -437,8 +437,8 @@ The L2 norm of the token's vector representation.
|
||||||
| `norm_` | unicode | The token's norm, i.e. a normalized form of the token text. Usually set in the language's [tokenizer exceptions](/usage/adding-languages#tokenizer-exceptions) or [norm exceptions](/usage/adding-languages#norm-exceptions). |
|
| `norm_` | unicode | The token's norm, i.e. a normalized form of the token text. Usually set in the language's [tokenizer exceptions](/usage/adding-languages#tokenizer-exceptions) or [norm exceptions](/usage/adding-languages#norm-exceptions). |
|
||||||
| `lower` | int | Lowercase form of the token. |
|
| `lower` | int | Lowercase form of the token. |
|
||||||
| `lower_` | unicode | Lowercase form of the token text. Equivalent to `Token.text.lower()`. |
|
| `lower_` | unicode | Lowercase form of the token text. Equivalent to `Token.text.lower()`. |
|
||||||
| `shape` | int | Transform of the tokens's string, to show orthographic features. For example, "Xxxx" or "dd". |
|
| `shape` | int | Transform of the tokens's string, to show orthographic features. Alphabetic characters are replaced by `x` or `X`, and numeric characters are replaced by d`, and sequences of the same character are truncated after length 4. For example,`"Xxxx"`or`"dd"`. |
|
||||||
| `shape_` | unicode | Transform of the tokens's string, to show orthographic features. For example, "Xxxx" or "dd". |
|
| `shape_` | unicode | Transform of the tokens's string, to show orthographic features. Alphabetic characters are replaced by `x` or `X`, and numeric characters are replaced by d`, and sequences of the same character are truncated after length 4. For example,`"Xxxx"`or`"dd"`. |
|
||||||
| `prefix` | int | Hash value of a length-N substring from the start of the token. Defaults to `N=1`. |
|
| `prefix` | int | Hash value of a length-N substring from the start of the token. Defaults to `N=1`. |
|
||||||
| `prefix_` | unicode | A length-N substring from the start of the token. Defaults to `N=1`. |
|
| `prefix_` | unicode | A length-N substring from the start of the token. Defaults to `N=1`. |
|
||||||
| `suffix` | int | Hash value of a length-N substring from the end of the token. Defaults to `N=3`. |
|
| `suffix` | int | Hash value of a length-N substring from the end of the token. Defaults to `N=3`. |
|
||||||
|
|
|
@ -638,7 +638,7 @@ punctuation – depending on the
|
||||||
|
|
||||||
The `IS_DIGIT` flag is not very helpful here, because it doesn't tell us
|
The `IS_DIGIT` flag is not very helpful here, because it doesn't tell us
|
||||||
anything about the length. However, you can use the `SHAPE` flag, with each `d`
|
anything about the length. However, you can use the `SHAPE` flag, with each `d`
|
||||||
representing a digit:
|
representing a digit (up to 4 digits / characters):
|
||||||
|
|
||||||
```python
|
```python
|
||||||
[{"ORTH": "("}, {"SHAPE": "ddd"}, {"ORTH": ")"}, {"SHAPE": "dddd"},
|
[{"ORTH": "("}, {"SHAPE": "ddd"}, {"ORTH": ")"}, {"SHAPE": "dddd"},
|
||||||
|
@ -654,7 +654,7 @@ match the most common formats of
|
||||||
|
|
||||||
```python
|
```python
|
||||||
[{"ORTH": "+"}, {"ORTH": "49"}, {"ORTH": "(", "OP": "?"}, {"SHAPE": "dddd"},
|
[{"ORTH": "+"}, {"ORTH": "49"}, {"ORTH": "(", "OP": "?"}, {"SHAPE": "dddd"},
|
||||||
{"ORTH": ")", "OP": "?"}, {"SHAPE": "dddddd"}]
|
{"ORTH": ")", "OP": "?"}, {"SHAPE": "dddd", "LENGTH": 6}]
|
||||||
```
|
```
|
||||||
|
|
||||||
Depending on the formats your application needs to match, creating an extensive
|
Depending on the formats your application needs to match, creating an extensive
|
||||||
|
|
|
@ -155,7 +155,8 @@
|
||||||
"name": "Japanese",
|
"name": "Japanese",
|
||||||
"dependencies": [
|
"dependencies": [
|
||||||
{ "name": "Unidic", "url": "http://unidic.ninjal.ac.jp/back_number#unidic_cwj" },
|
{ "name": "Unidic", "url": "http://unidic.ninjal.ac.jp/back_number#unidic_cwj" },
|
||||||
{ "name": "Mecab", "url": "https://github.com/taku910/mecab" }
|
{ "name": "Mecab", "url": "https://github.com/taku910/mecab" },
|
||||||
|
{ "name": "fugashi", "url": "https://github.com/polm/fugashi" }
|
||||||
],
|
],
|
||||||
"example": "これは文章です。",
|
"example": "これは文章です。",
|
||||||
"has_examples": true
|
"has_examples": true
|
||||||
|
|
Loading…
Reference in New Issue
Block a user