mirror of
https://github.com/explosion/spaCy.git
synced 2024-12-26 09:56:28 +03:00
Merge branch 'spacy.io' into spacy.io-develop
This commit is contained in:
commit
414dc7ace1
106
.github/contributors/hertelm.md
vendored
Normal file
106
.github/contributors/hertelm.md
vendored
Normal file
|
@ -0,0 +1,106 @@
|
||||||
|
# spaCy contributor agreement
|
||||||
|
|
||||||
|
This spaCy Contributor Agreement (**"SCA"**) is based on the
|
||||||
|
[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
|
||||||
|
The SCA applies to any contribution that you make to any product or project
|
||||||
|
managed by us (the **"project"**), and sets out the intellectual property rights
|
||||||
|
you grant to us in the contributed materials. The term **"us"** shall mean
|
||||||
|
[ExplosionAI GmbH](https://explosion.ai/legal). The term
|
||||||
|
**"you"** shall mean the person or entity identified below.
|
||||||
|
|
||||||
|
If you agree to be bound by these terms, fill in the information requested
|
||||||
|
below and include the filled-in version with your first pull request, under the
|
||||||
|
folder [`.github/contributors/`](/.github/contributors/). The name of the file
|
||||||
|
should be your GitHub username, with the extension `.md`. For example, the user
|
||||||
|
example_user would create the file `.github/contributors/example_user.md`.
|
||||||
|
|
||||||
|
Read this agreement carefully before signing. These terms and conditions
|
||||||
|
constitute a binding legal agreement.
|
||||||
|
|
||||||
|
## Contributor Agreement
|
||||||
|
|
||||||
|
1. The term "contribution" or "contributed materials" means any source code,
|
||||||
|
object code, patch, tool, sample, graphic, specification, manual,
|
||||||
|
documentation, or any other material posted or submitted by you to the project.
|
||||||
|
|
||||||
|
2. With respect to any worldwide copyrights, or copyright applications and
|
||||||
|
registrations, in your contribution:
|
||||||
|
|
||||||
|
* you hereby assign to us joint ownership, and to the extent that such
|
||||||
|
assignment is or becomes invalid, ineffective or unenforceable, you hereby
|
||||||
|
grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
|
||||||
|
royalty-free, unrestricted license to exercise all rights under those
|
||||||
|
copyrights. This includes, at our option, the right to sublicense these same
|
||||||
|
rights to third parties through multiple levels of sublicensees or other
|
||||||
|
licensing arrangements;
|
||||||
|
|
||||||
|
* you agree that each of us can do all things in relation to your
|
||||||
|
contribution as if each of us were the sole owners, and if one of us makes
|
||||||
|
a derivative work of your contribution, the one who makes the derivative
|
||||||
|
work (or has it made will be the sole owner of that derivative work;
|
||||||
|
|
||||||
|
* you agree that you will not assert any moral rights in your contribution
|
||||||
|
against us, our licensees or transferees;
|
||||||
|
|
||||||
|
* you agree that we may register a copyright in your contribution and
|
||||||
|
exercise all ownership rights associated with it; and
|
||||||
|
|
||||||
|
* you agree that neither of us has any duty to consult with, obtain the
|
||||||
|
consent of, pay or render an accounting to the other for any use or
|
||||||
|
distribution of your contribution.
|
||||||
|
|
||||||
|
3. With respect to any patents you own, or that you can license without payment
|
||||||
|
to any third party, you hereby grant to us a perpetual, irrevocable,
|
||||||
|
non-exclusive, worldwide, no-charge, royalty-free license to:
|
||||||
|
|
||||||
|
* make, have made, use, sell, offer to sell, import, and otherwise transfer
|
||||||
|
your contribution in whole or in part, alone or in combination with or
|
||||||
|
included in any product, work or materials arising out of the project to
|
||||||
|
which your contribution was submitted, and
|
||||||
|
|
||||||
|
* at our option, to sublicense these same rights to third parties through
|
||||||
|
multiple levels of sublicensees or other licensing arrangements.
|
||||||
|
|
||||||
|
4. Except as set out above, you keep all right, title, and interest in your
|
||||||
|
contribution. The rights that you grant to us under these terms are effective
|
||||||
|
on the date you first submitted a contribution to us, even if your submission
|
||||||
|
took place before the date you sign these terms.
|
||||||
|
|
||||||
|
5. You covenant, represent, warrant and agree that:
|
||||||
|
|
||||||
|
* Each contribution that you submit is and shall be an original work of
|
||||||
|
authorship and you can legally grant the rights set out in this SCA;
|
||||||
|
|
||||||
|
* to the best of your knowledge, each contribution will not violate any
|
||||||
|
third party's copyrights, trademarks, patents, or other intellectual
|
||||||
|
property rights; and
|
||||||
|
|
||||||
|
* each contribution shall be in compliance with U.S. export control laws and
|
||||||
|
other applicable export and import laws. You agree to notify us if you
|
||||||
|
become aware of any circumstance which would make any of the foregoing
|
||||||
|
representations inaccurate in any respect. We may publicly disclose your
|
||||||
|
participation in the project, including the fact that you have signed the SCA.
|
||||||
|
|
||||||
|
6. This SCA is governed by the laws of the State of California and applicable
|
||||||
|
U.S. Federal law. Any choice of law rules will not apply.
|
||||||
|
|
||||||
|
7. Please place an “x” on one of the applicable statement below. Please do NOT
|
||||||
|
mark both statements:
|
||||||
|
|
||||||
|
* [x] I am signing on behalf of myself as an individual and no other person
|
||||||
|
or entity, including my employer, has or will have rights with respect to my
|
||||||
|
contributions.
|
||||||
|
|
||||||
|
* [ ] I am signing on behalf of my employer or a legal entity and I have the
|
||||||
|
actual authority to contractually bind that entity.
|
||||||
|
|
||||||
|
## Contributor Details
|
||||||
|
|
||||||
|
| Field | Entry |
|
||||||
|
|------------------------------- | -------------------- |
|
||||||
|
| Name | Matthias Hertel |
|
||||||
|
| Company name (if applicable) | |
|
||||||
|
| Title or role (if applicable) | |
|
||||||
|
| Date | June 29, 2020 |
|
||||||
|
| GitHub username | hertelm |
|
||||||
|
| Website (optional) | |
|
|
@ -91,7 +91,7 @@ Match a stream of documents, yielding them in turn.
|
||||||
> ```python
|
> ```python
|
||||||
> from spacy.matcher import PhraseMatcher
|
> from spacy.matcher import PhraseMatcher
|
||||||
> matcher = PhraseMatcher(nlp.vocab)
|
> matcher = PhraseMatcher(nlp.vocab)
|
||||||
> for doc in matcher.pipe(texts, batch_size=50):
|
> for doc in matcher.pipe(docs, batch_size=50):
|
||||||
> pass
|
> pass
|
||||||
> ```
|
> ```
|
||||||
|
|
||||||
|
|
|
@ -122,7 +122,7 @@ for match_id, start, end in matches:
|
||||||
```
|
```
|
||||||
|
|
||||||
The matcher returns a list of `(match_id, start, end)` tuples – in this case,
|
The matcher returns a list of `(match_id, start, end)` tuples – in this case,
|
||||||
`[('15578876784678163569', 0, 2)]`, which maps to the span `doc[0:2]` of our
|
`[('15578876784678163569', 0, 3)]`, which maps to the span `doc[0:3]` of our
|
||||||
original document. The `match_id` is the [hash value](/usage/spacy-101#vocab) of
|
original document. The `match_id` is the [hash value](/usage/spacy-101#vocab) of
|
||||||
the string ID "HelloWorld". To get the string value, you can look up the ID in
|
the string ID "HelloWorld". To get the string value, you can look up the ID in
|
||||||
the [`StringStore`](/api/stringstore).
|
the [`StringStore`](/api/stringstore).
|
||||||
|
|
|
@ -161,10 +161,18 @@ debugging your tokenizer configuration.
|
||||||
|
|
||||||
spaCy's custom warnings have been replaced with native Python
|
spaCy's custom warnings have been replaced with native Python
|
||||||
[`warnings`](https://docs.python.org/3/library/warnings.html). Instead of
|
[`warnings`](https://docs.python.org/3/library/warnings.html). Instead of
|
||||||
setting `SPACY_WARNING_IGNORE`, use the
|
setting `SPACY_WARNING_IGNORE`, use the [`warnings`
|
||||||
[`warnings` filters](https://docs.python.org/3/library/warnings.html#the-warnings-filter)
|
filters](https://docs.python.org/3/library/warnings.html#the-warnings-filter)
|
||||||
to manage warnings.
|
to manage warnings.
|
||||||
|
|
||||||
|
```diff
|
||||||
|
import spacy
|
||||||
|
+ import warnings
|
||||||
|
|
||||||
|
- spacy.errors.SPACY_WARNING_IGNORE.append('W007')
|
||||||
|
+ warnings.filterwarnings("ignore", message=r"\\[W007\\]", category=UserWarning)
|
||||||
|
```
|
||||||
|
|
||||||
#### Normalization tables
|
#### Normalization tables
|
||||||
|
|
||||||
The normalization tables have moved from the language data in
|
The normalization tables have moved from the language data in
|
||||||
|
@ -174,6 +182,65 @@ If you're adding data for a new language, the normalization table should be
|
||||||
added to `spacy-lookups-data`. See
|
added to `spacy-lookups-data`. See
|
||||||
[adding norm exceptions](/usage/adding-languages#norm-exceptions).
|
[adding norm exceptions](/usage/adding-languages#norm-exceptions).
|
||||||
|
|
||||||
|
#### No preloaded vocab for models with vectors
|
||||||
|
|
||||||
|
To reduce the initial loading time, the lexemes in `nlp.vocab` are no longer
|
||||||
|
loaded on initialization for models with vectors. As you process texts, the
|
||||||
|
lexemes will be added to the vocab automatically, just as in small models
|
||||||
|
without vectors.
|
||||||
|
|
||||||
|
To see the number of unique vectors and number of words with vectors, see
|
||||||
|
`nlp.meta['vectors']`, for example for `en_core_web_md` there are `20000`
|
||||||
|
unique vectors and `684830` words with vectors:
|
||||||
|
|
||||||
|
```python
|
||||||
|
{
|
||||||
|
'width': 300,
|
||||||
|
'vectors': 20000,
|
||||||
|
'keys': 684830,
|
||||||
|
'name': 'en_core_web_md.vectors'
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
If required, for instance if you are working directly with word vectors rather
|
||||||
|
than processing texts, you can load all lexemes for words with vectors at once:
|
||||||
|
|
||||||
|
```python
|
||||||
|
for orth in nlp.vocab.vectors:
|
||||||
|
_ = nlp.vocab[orth]
|
||||||
|
```
|
||||||
|
|
||||||
|
If your workflow previously iterated over `nlp.vocab`, a similar alternative
|
||||||
|
is to iterate over words with vectors instead:
|
||||||
|
|
||||||
|
```diff
|
||||||
|
- lexemes = [w for w in nlp.vocab]
|
||||||
|
+ lexemes = [nlp.vocab[orth] for orth in nlp.vocab.vectors]
|
||||||
|
```
|
||||||
|
|
||||||
|
Be aware that the set of preloaded lexemes in a v2.2 model is not equivalent to
|
||||||
|
the set of words with vectors. For English, v2.2 `md/lg` models have 1.3M
|
||||||
|
provided lexemes but only 685K words with vectors. The vectors have been
|
||||||
|
updated for most languages in v2.2, but the English models contain the same
|
||||||
|
vectors for both v2.2 and v2.3.
|
||||||
|
|
||||||
|
#### Lexeme.is_oov and Token.is_oov
|
||||||
|
|
||||||
|
<Infobox title="Important note" variant="warning">
|
||||||
|
|
||||||
|
Due to a bug, the values for `is_oov` are reversed in v2.3.0, but this will be
|
||||||
|
fixed in the next patch release v2.3.1.
|
||||||
|
|
||||||
|
</Infobox>
|
||||||
|
|
||||||
|
In v2.3, `Lexeme.is_oov` and `Token.is_oov` are `True` if the lexeme does not
|
||||||
|
have a word vector. This is equivalent to `token.orth not in
|
||||||
|
nlp.vocab.vectors`.
|
||||||
|
|
||||||
|
Previously in v2.2, `is_oov` corresponded to whether a lexeme had stored
|
||||||
|
probability and cluster features. The probability and cluster features are no
|
||||||
|
longer included in the provided medium and large models (see the next section).
|
||||||
|
|
||||||
#### Probability and cluster features
|
#### Probability and cluster features
|
||||||
|
|
||||||
> #### Load and save extra prob lookups table
|
> #### Load and save extra prob lookups table
|
||||||
|
@ -201,6 +268,28 @@ model vocab, which will take a few seconds on initial loading. When you save
|
||||||
this model after loading the `prob` table, the full `prob` table will be saved
|
this model after loading the `prob` table, the full `prob` table will be saved
|
||||||
as part of the model vocab.
|
as part of the model vocab.
|
||||||
|
|
||||||
|
To load the probability table into a provided model, first make sure you have
|
||||||
|
`spacy-lookups-data` installed. To load the table, remove the empty provided
|
||||||
|
`lexeme_prob` table and then access `Lexeme.prob` for any word to load the
|
||||||
|
table from `spacy-lookups-data`:
|
||||||
|
|
||||||
|
```diff
|
||||||
|
+ # prerequisite: pip install spacy-lookups-data
|
||||||
|
import spacy
|
||||||
|
|
||||||
|
nlp = spacy.load("en_core_web_md")
|
||||||
|
|
||||||
|
# remove the empty placeholder prob table
|
||||||
|
+ if nlp.vocab.lookups_extra.has_table("lexeme_prob"):
|
||||||
|
+ nlp.vocab.lookups_extra.remove_table("lexeme_prob")
|
||||||
|
|
||||||
|
# access any `.prob` to load the full table into the model
|
||||||
|
assert nlp.vocab["a"].prob == -3.9297883511
|
||||||
|
|
||||||
|
# if desired, save this model with the probability table included
|
||||||
|
nlp.to_disk("/path/to/model")
|
||||||
|
```
|
||||||
|
|
||||||
If you'd like to include custom `cluster`, `prob`, or `sentiment` tables as part
|
If you'd like to include custom `cluster`, `prob`, or `sentiment` tables as part
|
||||||
of a new model, add the data to
|
of a new model, add the data to
|
||||||
[`spacy-lookups-data`](https://github.com/explosion/spacy-lookups-data) under
|
[`spacy-lookups-data`](https://github.com/explosion/spacy-lookups-data) under
|
||||||
|
@ -218,3 +307,39 @@ When you initialize a new model with [`spacy init-model`](/api/cli#init-model),
|
||||||
the `prob` table from `spacy-lookups-data` may be loaded as part of the
|
the `prob` table from `spacy-lookups-data` may be loaded as part of the
|
||||||
initialization. If you'd like to omit this extra data as in spaCy's provided
|
initialization. If you'd like to omit this extra data as in spaCy's provided
|
||||||
v2.3 models, use the new flag `--omit-extra-lookups`.
|
v2.3 models, use the new flag `--omit-extra-lookups`.
|
||||||
|
|
||||||
|
#### Tag maps in provided models vs. blank models
|
||||||
|
|
||||||
|
The tag maps in the provided models may differ from the tag maps in the spaCy
|
||||||
|
library. You can access the tag map in a loaded model under
|
||||||
|
`nlp.vocab.morphology.tag_map`.
|
||||||
|
|
||||||
|
The tag map from `spacy.lang.lg.tag_map` is still used when a blank model is
|
||||||
|
initialized. If you want to provide an alternate tag map, update
|
||||||
|
`nlp.vocab.morphology.tag_map` after initializing the model or if you're using
|
||||||
|
the [train CLI](/api/cli#train), you can use the new `--tag-map-path` option to
|
||||||
|
provide in the tag map as a JSON dict.
|
||||||
|
|
||||||
|
If you want to export a tag map from a provided model for use with the train
|
||||||
|
CLI, you can save it as a JSON dict. To only use string keys as required by
|
||||||
|
JSON and to make it easier to read and edit, any internal integer IDs need to
|
||||||
|
be converted back to strings:
|
||||||
|
|
||||||
|
```python
|
||||||
|
import spacy
|
||||||
|
import srsly
|
||||||
|
|
||||||
|
nlp = spacy.load("en_core_web_sm")
|
||||||
|
tag_map = {}
|
||||||
|
|
||||||
|
# convert any integer IDs to strings for JSON
|
||||||
|
for tag, morph in nlp.vocab.morphology.tag_map.items():
|
||||||
|
tag_map[tag] = {}
|
||||||
|
for feat, val in morph.items():
|
||||||
|
feat = nlp.vocab.strings.as_string(feat)
|
||||||
|
if not isinstance(val, bool):
|
||||||
|
val = nlp.vocab.strings.as_string(val)
|
||||||
|
tag_map[tag][feat] = val
|
||||||
|
|
||||||
|
srsly.write_json("tag_map.json", tag_map)
|
||||||
|
```
|
||||||
|
|
|
@ -78,11 +78,14 @@
|
||||||
"name": "Japanese",
|
"name": "Japanese",
|
||||||
"models": ["ja_core_news_sm", "ja_core_news_md", "ja_core_news_lg"],
|
"models": ["ja_core_news_sm", "ja_core_news_md", "ja_core_news_lg"],
|
||||||
"dependencies": [
|
"dependencies": [
|
||||||
|
{ "name": "Unidic", "url": "http://unidic.ninjal.ac.jp/back_number#unidic_cwj" },
|
||||||
|
{ "name": "Mecab", "url": "https://github.com/taku910/mecab" },
|
||||||
{
|
{
|
||||||
"name": "SudachiPy",
|
"name": "SudachiPy",
|
||||||
"url": "https://github.com/WorksApplications/SudachiPy"
|
"url": "https://github.com/WorksApplications/SudachiPy"
|
||||||
}
|
}
|
||||||
],
|
],
|
||||||
|
"example": "これは文章です。",
|
||||||
"has_examples": true
|
"has_examples": true
|
||||||
},
|
},
|
||||||
{
|
{
|
||||||
|
@ -191,17 +194,6 @@
|
||||||
"example": "นี่คือประโยค",
|
"example": "นี่คือประโยค",
|
||||||
"has_examples": true
|
"has_examples": true
|
||||||
},
|
},
|
||||||
{
|
|
||||||
"code": "ja",
|
|
||||||
"name": "Japanese",
|
|
||||||
"dependencies": [
|
|
||||||
{ "name": "Unidic", "url": "http://unidic.ninjal.ac.jp/back_number#unidic_cwj" },
|
|
||||||
{ "name": "Mecab", "url": "https://github.com/taku910/mecab" },
|
|
||||||
{ "name": "fugashi", "url": "https://github.com/polm/fugashi" }
|
|
||||||
],
|
|
||||||
"example": "これは文章です。",
|
|
||||||
"has_examples": true
|
|
||||||
},
|
|
||||||
{
|
{
|
||||||
"code": "ko",
|
"code": "ko",
|
||||||
"name": "Korean",
|
"name": "Korean",
|
||||||
|
|
13295
website/package-lock.json
generated
13295
website/package-lock.json
generated
File diff suppressed because it is too large
Load Diff
|
@ -16,7 +16,7 @@
|
||||||
"autoprefixer": "^9.4.7",
|
"autoprefixer": "^9.4.7",
|
||||||
"classnames": "^2.2.6",
|
"classnames": "^2.2.6",
|
||||||
"codemirror": "^5.43.0",
|
"codemirror": "^5.43.0",
|
||||||
"gatsby": "^2.1.18",
|
"gatsby": "^2.11.1",
|
||||||
"gatsby-image": "^2.0.29",
|
"gatsby-image": "^2.0.29",
|
||||||
"gatsby-mdx": "^0.3.6",
|
"gatsby-mdx": "^0.3.6",
|
||||||
"gatsby-plugin-catch-links": "^2.0.11",
|
"gatsby-plugin-catch-links": "^2.0.11",
|
||||||
|
|
Loading…
Reference in New Issue
Block a user