mirror of
https://github.com/explosion/spaCy.git
synced 2024-12-25 17:36:30 +03:00
Merge branch 'spacy.io' into spacy.io-develop
This commit is contained in:
commit
414dc7ace1
106
.github/contributors/hertelm.md
vendored
Normal file
106
.github/contributors/hertelm.md
vendored
Normal file
|
@ -0,0 +1,106 @@
|
|||
# spaCy contributor agreement
|
||||
|
||||
This spaCy Contributor Agreement (**"SCA"**) is based on the
|
||||
[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
|
||||
The SCA applies to any contribution that you make to any product or project
|
||||
managed by us (the **"project"**), and sets out the intellectual property rights
|
||||
you grant to us in the contributed materials. The term **"us"** shall mean
|
||||
[ExplosionAI GmbH](https://explosion.ai/legal). The term
|
||||
**"you"** shall mean the person or entity identified below.
|
||||
|
||||
If you agree to be bound by these terms, fill in the information requested
|
||||
below and include the filled-in version with your first pull request, under the
|
||||
folder [`.github/contributors/`](/.github/contributors/). The name of the file
|
||||
should be your GitHub username, with the extension `.md`. For example, the user
|
||||
example_user would create the file `.github/contributors/example_user.md`.
|
||||
|
||||
Read this agreement carefully before signing. These terms and conditions
|
||||
constitute a binding legal agreement.
|
||||
|
||||
## Contributor Agreement
|
||||
|
||||
1. The term "contribution" or "contributed materials" means any source code,
|
||||
object code, patch, tool, sample, graphic, specification, manual,
|
||||
documentation, or any other material posted or submitted by you to the project.
|
||||
|
||||
2. With respect to any worldwide copyrights, or copyright applications and
|
||||
registrations, in your contribution:
|
||||
|
||||
* you hereby assign to us joint ownership, and to the extent that such
|
||||
assignment is or becomes invalid, ineffective or unenforceable, you hereby
|
||||
grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
|
||||
royalty-free, unrestricted license to exercise all rights under those
|
||||
copyrights. This includes, at our option, the right to sublicense these same
|
||||
rights to third parties through multiple levels of sublicensees or other
|
||||
licensing arrangements;
|
||||
|
||||
* you agree that each of us can do all things in relation to your
|
||||
contribution as if each of us were the sole owners, and if one of us makes
|
||||
a derivative work of your contribution, the one who makes the derivative
|
||||
work (or has it made will be the sole owner of that derivative work;
|
||||
|
||||
* you agree that you will not assert any moral rights in your contribution
|
||||
against us, our licensees or transferees;
|
||||
|
||||
* you agree that we may register a copyright in your contribution and
|
||||
exercise all ownership rights associated with it; and
|
||||
|
||||
* you agree that neither of us has any duty to consult with, obtain the
|
||||
consent of, pay or render an accounting to the other for any use or
|
||||
distribution of your contribution.
|
||||
|
||||
3. With respect to any patents you own, or that you can license without payment
|
||||
to any third party, you hereby grant to us a perpetual, irrevocable,
|
||||
non-exclusive, worldwide, no-charge, royalty-free license to:
|
||||
|
||||
* make, have made, use, sell, offer to sell, import, and otherwise transfer
|
||||
your contribution in whole or in part, alone or in combination with or
|
||||
included in any product, work or materials arising out of the project to
|
||||
which your contribution was submitted, and
|
||||
|
||||
* at our option, to sublicense these same rights to third parties through
|
||||
multiple levels of sublicensees or other licensing arrangements.
|
||||
|
||||
4. Except as set out above, you keep all right, title, and interest in your
|
||||
contribution. The rights that you grant to us under these terms are effective
|
||||
on the date you first submitted a contribution to us, even if your submission
|
||||
took place before the date you sign these terms.
|
||||
|
||||
5. You covenant, represent, warrant and agree that:
|
||||
|
||||
* Each contribution that you submit is and shall be an original work of
|
||||
authorship and you can legally grant the rights set out in this SCA;
|
||||
|
||||
* to the best of your knowledge, each contribution will not violate any
|
||||
third party's copyrights, trademarks, patents, or other intellectual
|
||||
property rights; and
|
||||
|
||||
* each contribution shall be in compliance with U.S. export control laws and
|
||||
other applicable export and import laws. You agree to notify us if you
|
||||
become aware of any circumstance which would make any of the foregoing
|
||||
representations inaccurate in any respect. We may publicly disclose your
|
||||
participation in the project, including the fact that you have signed the SCA.
|
||||
|
||||
6. This SCA is governed by the laws of the State of California and applicable
|
||||
U.S. Federal law. Any choice of law rules will not apply.
|
||||
|
||||
7. Please place an “x” on one of the applicable statement below. Please do NOT
|
||||
mark both statements:
|
||||
|
||||
* [x] I am signing on behalf of myself as an individual and no other person
|
||||
or entity, including my employer, has or will have rights with respect to my
|
||||
contributions.
|
||||
|
||||
* [ ] I am signing on behalf of my employer or a legal entity and I have the
|
||||
actual authority to contractually bind that entity.
|
||||
|
||||
## Contributor Details
|
||||
|
||||
| Field | Entry |
|
||||
|------------------------------- | -------------------- |
|
||||
| Name | Matthias Hertel |
|
||||
| Company name (if applicable) | |
|
||||
| Title or role (if applicable) | |
|
||||
| Date | June 29, 2020 |
|
||||
| GitHub username | hertelm |
|
||||
| Website (optional) | |
|
|
@ -91,7 +91,7 @@ Match a stream of documents, yielding them in turn.
|
|||
> ```python
|
||||
> from spacy.matcher import PhraseMatcher
|
||||
> matcher = PhraseMatcher(nlp.vocab)
|
||||
> for doc in matcher.pipe(texts, batch_size=50):
|
||||
> for doc in matcher.pipe(docs, batch_size=50):
|
||||
> pass
|
||||
> ```
|
||||
|
||||
|
|
|
@ -122,7 +122,7 @@ for match_id, start, end in matches:
|
|||
```
|
||||
|
||||
The matcher returns a list of `(match_id, start, end)` tuples – in this case,
|
||||
`[('15578876784678163569', 0, 2)]`, which maps to the span `doc[0:2]` of our
|
||||
`[('15578876784678163569', 0, 3)]`, which maps to the span `doc[0:3]` of our
|
||||
original document. The `match_id` is the [hash value](/usage/spacy-101#vocab) of
|
||||
the string ID "HelloWorld". To get the string value, you can look up the ID in
|
||||
the [`StringStore`](/api/stringstore).
|
||||
|
|
|
@ -161,10 +161,18 @@ debugging your tokenizer configuration.
|
|||
|
||||
spaCy's custom warnings have been replaced with native Python
|
||||
[`warnings`](https://docs.python.org/3/library/warnings.html). Instead of
|
||||
setting `SPACY_WARNING_IGNORE`, use the
|
||||
[`warnings` filters](https://docs.python.org/3/library/warnings.html#the-warnings-filter)
|
||||
setting `SPACY_WARNING_IGNORE`, use the [`warnings`
|
||||
filters](https://docs.python.org/3/library/warnings.html#the-warnings-filter)
|
||||
to manage warnings.
|
||||
|
||||
```diff
|
||||
import spacy
|
||||
+ import warnings
|
||||
|
||||
- spacy.errors.SPACY_WARNING_IGNORE.append('W007')
|
||||
+ warnings.filterwarnings("ignore", message=r"\\[W007\\]", category=UserWarning)
|
||||
```
|
||||
|
||||
#### Normalization tables
|
||||
|
||||
The normalization tables have moved from the language data in
|
||||
|
@ -174,6 +182,65 @@ If you're adding data for a new language, the normalization table should be
|
|||
added to `spacy-lookups-data`. See
|
||||
[adding norm exceptions](/usage/adding-languages#norm-exceptions).
|
||||
|
||||
#### No preloaded vocab for models with vectors
|
||||
|
||||
To reduce the initial loading time, the lexemes in `nlp.vocab` are no longer
|
||||
loaded on initialization for models with vectors. As you process texts, the
|
||||
lexemes will be added to the vocab automatically, just as in small models
|
||||
without vectors.
|
||||
|
||||
To see the number of unique vectors and number of words with vectors, see
|
||||
`nlp.meta['vectors']`, for example for `en_core_web_md` there are `20000`
|
||||
unique vectors and `684830` words with vectors:
|
||||
|
||||
```python
|
||||
{
|
||||
'width': 300,
|
||||
'vectors': 20000,
|
||||
'keys': 684830,
|
||||
'name': 'en_core_web_md.vectors'
|
||||
}
|
||||
```
|
||||
|
||||
If required, for instance if you are working directly with word vectors rather
|
||||
than processing texts, you can load all lexemes for words with vectors at once:
|
||||
|
||||
```python
|
||||
for orth in nlp.vocab.vectors:
|
||||
_ = nlp.vocab[orth]
|
||||
```
|
||||
|
||||
If your workflow previously iterated over `nlp.vocab`, a similar alternative
|
||||
is to iterate over words with vectors instead:
|
||||
|
||||
```diff
|
||||
- lexemes = [w for w in nlp.vocab]
|
||||
+ lexemes = [nlp.vocab[orth] for orth in nlp.vocab.vectors]
|
||||
```
|
||||
|
||||
Be aware that the set of preloaded lexemes in a v2.2 model is not equivalent to
|
||||
the set of words with vectors. For English, v2.2 `md/lg` models have 1.3M
|
||||
provided lexemes but only 685K words with vectors. The vectors have been
|
||||
updated for most languages in v2.2, but the English models contain the same
|
||||
vectors for both v2.2 and v2.3.
|
||||
|
||||
#### Lexeme.is_oov and Token.is_oov
|
||||
|
||||
<Infobox title="Important note" variant="warning">
|
||||
|
||||
Due to a bug, the values for `is_oov` are reversed in v2.3.0, but this will be
|
||||
fixed in the next patch release v2.3.1.
|
||||
|
||||
</Infobox>
|
||||
|
||||
In v2.3, `Lexeme.is_oov` and `Token.is_oov` are `True` if the lexeme does not
|
||||
have a word vector. This is equivalent to `token.orth not in
|
||||
nlp.vocab.vectors`.
|
||||
|
||||
Previously in v2.2, `is_oov` corresponded to whether a lexeme had stored
|
||||
probability and cluster features. The probability and cluster features are no
|
||||
longer included in the provided medium and large models (see the next section).
|
||||
|
||||
#### Probability and cluster features
|
||||
|
||||
> #### Load and save extra prob lookups table
|
||||
|
@ -201,6 +268,28 @@ model vocab, which will take a few seconds on initial loading. When you save
|
|||
this model after loading the `prob` table, the full `prob` table will be saved
|
||||
as part of the model vocab.
|
||||
|
||||
To load the probability table into a provided model, first make sure you have
|
||||
`spacy-lookups-data` installed. To load the table, remove the empty provided
|
||||
`lexeme_prob` table and then access `Lexeme.prob` for any word to load the
|
||||
table from `spacy-lookups-data`:
|
||||
|
||||
```diff
|
||||
+ # prerequisite: pip install spacy-lookups-data
|
||||
import spacy
|
||||
|
||||
nlp = spacy.load("en_core_web_md")
|
||||
|
||||
# remove the empty placeholder prob table
|
||||
+ if nlp.vocab.lookups_extra.has_table("lexeme_prob"):
|
||||
+ nlp.vocab.lookups_extra.remove_table("lexeme_prob")
|
||||
|
||||
# access any `.prob` to load the full table into the model
|
||||
assert nlp.vocab["a"].prob == -3.9297883511
|
||||
|
||||
# if desired, save this model with the probability table included
|
||||
nlp.to_disk("/path/to/model")
|
||||
```
|
||||
|
||||
If you'd like to include custom `cluster`, `prob`, or `sentiment` tables as part
|
||||
of a new model, add the data to
|
||||
[`spacy-lookups-data`](https://github.com/explosion/spacy-lookups-data) under
|
||||
|
@ -218,3 +307,39 @@ When you initialize a new model with [`spacy init-model`](/api/cli#init-model),
|
|||
the `prob` table from `spacy-lookups-data` may be loaded as part of the
|
||||
initialization. If you'd like to omit this extra data as in spaCy's provided
|
||||
v2.3 models, use the new flag `--omit-extra-lookups`.
|
||||
|
||||
#### Tag maps in provided models vs. blank models
|
||||
|
||||
The tag maps in the provided models may differ from the tag maps in the spaCy
|
||||
library. You can access the tag map in a loaded model under
|
||||
`nlp.vocab.morphology.tag_map`.
|
||||
|
||||
The tag map from `spacy.lang.lg.tag_map` is still used when a blank model is
|
||||
initialized. If you want to provide an alternate tag map, update
|
||||
`nlp.vocab.morphology.tag_map` after initializing the model or if you're using
|
||||
the [train CLI](/api/cli#train), you can use the new `--tag-map-path` option to
|
||||
provide in the tag map as a JSON dict.
|
||||
|
||||
If you want to export a tag map from a provided model for use with the train
|
||||
CLI, you can save it as a JSON dict. To only use string keys as required by
|
||||
JSON and to make it easier to read and edit, any internal integer IDs need to
|
||||
be converted back to strings:
|
||||
|
||||
```python
|
||||
import spacy
|
||||
import srsly
|
||||
|
||||
nlp = spacy.load("en_core_web_sm")
|
||||
tag_map = {}
|
||||
|
||||
# convert any integer IDs to strings for JSON
|
||||
for tag, morph in nlp.vocab.morphology.tag_map.items():
|
||||
tag_map[tag] = {}
|
||||
for feat, val in morph.items():
|
||||
feat = nlp.vocab.strings.as_string(feat)
|
||||
if not isinstance(val, bool):
|
||||
val = nlp.vocab.strings.as_string(val)
|
||||
tag_map[tag][feat] = val
|
||||
|
||||
srsly.write_json("tag_map.json", tag_map)
|
||||
```
|
||||
|
|
|
@ -78,11 +78,14 @@
|
|||
"name": "Japanese",
|
||||
"models": ["ja_core_news_sm", "ja_core_news_md", "ja_core_news_lg"],
|
||||
"dependencies": [
|
||||
{ "name": "Unidic", "url": "http://unidic.ninjal.ac.jp/back_number#unidic_cwj" },
|
||||
{ "name": "Mecab", "url": "https://github.com/taku910/mecab" },
|
||||
{
|
||||
"name": "SudachiPy",
|
||||
"url": "https://github.com/WorksApplications/SudachiPy"
|
||||
}
|
||||
],
|
||||
"example": "これは文章です。",
|
||||
"has_examples": true
|
||||
},
|
||||
{
|
||||
|
@ -191,17 +194,6 @@
|
|||
"example": "นี่คือประโยค",
|
||||
"has_examples": true
|
||||
},
|
||||
{
|
||||
"code": "ja",
|
||||
"name": "Japanese",
|
||||
"dependencies": [
|
||||
{ "name": "Unidic", "url": "http://unidic.ninjal.ac.jp/back_number#unidic_cwj" },
|
||||
{ "name": "Mecab", "url": "https://github.com/taku910/mecab" },
|
||||
{ "name": "fugashi", "url": "https://github.com/polm/fugashi" }
|
||||
],
|
||||
"example": "これは文章です。",
|
||||
"has_examples": true
|
||||
},
|
||||
{
|
||||
"code": "ko",
|
||||
"name": "Korean",
|
||||
|
|
13295
website/package-lock.json
generated
13295
website/package-lock.json
generated
File diff suppressed because it is too large
Load Diff
|
@ -16,7 +16,7 @@
|
|||
"autoprefixer": "^9.4.7",
|
||||
"classnames": "^2.2.6",
|
||||
"codemirror": "^5.43.0",
|
||||
"gatsby": "^2.1.18",
|
||||
"gatsby": "^2.11.1",
|
||||
"gatsby-image": "^2.0.29",
|
||||
"gatsby-mdx": "^0.3.6",
|
||||
"gatsby-plugin-catch-links": "^2.0.11",
|
||||
|
|
Loading…
Reference in New Issue
Block a user