mirror of
https://github.com/explosion/spaCy.git
synced 2025-02-14 18:40:33 +03:00
Merge branch 'develop' into nightly.spacy.io
This commit is contained in:
commit
b58b4f1e0e
106
.github/contributors/jmargeta.md
vendored
Normal file
106
.github/contributors/jmargeta.md
vendored
Normal file
|
@ -0,0 +1,106 @@
|
|||
# spaCy contributor agreement
|
||||
|
||||
This spaCy Contributor Agreement (**"SCA"**) is based on the
|
||||
[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
|
||||
The SCA applies to any contribution that you make to any product or project
|
||||
managed by us (the **"project"**), and sets out the intellectual property rights
|
||||
you grant to us in the contributed materials. The term **"us"** shall mean
|
||||
[ExplosionAI GmbH](https://explosion.ai/legal). The term
|
||||
**"you"** shall mean the person or entity identified below.
|
||||
|
||||
If you agree to be bound by these terms, fill in the information requested
|
||||
below and include the filled-in version with your first pull request, under the
|
||||
folder [`.github/contributors/`](/.github/contributors/). The name of the file
|
||||
should be your GitHub username, with the extension `.md`. For example, the user
|
||||
example_user would create the file `.github/contributors/example_user.md`.
|
||||
|
||||
Read this agreement carefully before signing. These terms and conditions
|
||||
constitute a binding legal agreement.
|
||||
|
||||
## Contributor Agreement
|
||||
|
||||
1. The term "contribution" or "contributed materials" means any source code,
|
||||
object code, patch, tool, sample, graphic, specification, manual,
|
||||
documentation, or any other material posted or submitted by you to the project.
|
||||
|
||||
2. With respect to any worldwide copyrights, or copyright applications and
|
||||
registrations, in your contribution:
|
||||
|
||||
* you hereby assign to us joint ownership, and to the extent that such
|
||||
assignment is or becomes invalid, ineffective or unenforceable, you hereby
|
||||
grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
|
||||
royalty-free, unrestricted license to exercise all rights under those
|
||||
copyrights. This includes, at our option, the right to sublicense these same
|
||||
rights to third parties through multiple levels of sublicensees or other
|
||||
licensing arrangements;
|
||||
|
||||
* you agree that each of us can do all things in relation to your
|
||||
contribution as if each of us were the sole owners, and if one of us makes
|
||||
a derivative work of your contribution, the one who makes the derivative
|
||||
work (or has it made will be the sole owner of that derivative work;
|
||||
|
||||
* you agree that you will not assert any moral rights in your contribution
|
||||
against us, our licensees or transferees;
|
||||
|
||||
* you agree that we may register a copyright in your contribution and
|
||||
exercise all ownership rights associated with it; and
|
||||
|
||||
* you agree that neither of us has any duty to consult with, obtain the
|
||||
consent of, pay or render an accounting to the other for any use or
|
||||
distribution of your contribution.
|
||||
|
||||
3. With respect to any patents you own, or that you can license without payment
|
||||
to any third party, you hereby grant to us a perpetual, irrevocable,
|
||||
non-exclusive, worldwide, no-charge, royalty-free license to:
|
||||
|
||||
* make, have made, use, sell, offer to sell, import, and otherwise transfer
|
||||
your contribution in whole or in part, alone or in combination with or
|
||||
included in any product, work or materials arising out of the project to
|
||||
which your contribution was submitted, and
|
||||
|
||||
* at our option, to sublicense these same rights to third parties through
|
||||
multiple levels of sublicensees or other licensing arrangements.
|
||||
|
||||
4. Except as set out above, you keep all right, title, and interest in your
|
||||
contribution. The rights that you grant to us under these terms are effective
|
||||
on the date you first submitted a contribution to us, even if your submission
|
||||
took place before the date you sign these terms.
|
||||
|
||||
5. You covenant, represent, warrant and agree that:
|
||||
|
||||
* Each contribution that you submit is and shall be an original work of
|
||||
authorship and you can legally grant the rights set out in this SCA;
|
||||
|
||||
* to the best of your knowledge, each contribution will not violate any
|
||||
third party's copyrights, trademarks, patents, or other intellectual
|
||||
property rights; and
|
||||
|
||||
* each contribution shall be in compliance with U.S. export control laws and
|
||||
other applicable export and import laws. You agree to notify us if you
|
||||
become aware of any circumstance which would make any of the foregoing
|
||||
representations inaccurate in any respect. We may publicly disclose your
|
||||
participation in the project, including the fact that you have signed the SCA.
|
||||
|
||||
6. This SCA is governed by the laws of the State of California and applicable
|
||||
U.S. Federal law. Any choice of law rules will not apply.
|
||||
|
||||
7. Please place an “x” on one of the applicable statement below. Please do NOT
|
||||
mark both statements:
|
||||
|
||||
* [x] I am signing on behalf of myself as an individual and no other person
|
||||
or entity, including my employer, has or will have rights with respect to my
|
||||
contributions.
|
||||
|
||||
* [ ] I am signing on behalf of my employer or a legal entity and I have the
|
||||
actual authority to contractually bind that entity.
|
||||
|
||||
## Contributor Details
|
||||
|
||||
| Field | Entry |
|
||||
|------------------------------- | -------------------- |
|
||||
| Name | Jan Margeta |
|
||||
| Company name (if applicable) | KardioMe |
|
||||
| Title or role (if applicable) | Founder |
|
||||
| Date | 2020-10-16 |
|
||||
| GitHub username | jmargeta |
|
||||
| Website (optional) | kardio.me |
|
|
@ -58,5 +58,7 @@ redirects = [
|
|||
{from = "/universe", to = "/universe/project/:id", query = {id = ":id"}, force = true},
|
||||
{from = "/universe", to = "/universe/category/:category", query = {category = ":category"}, force = true},
|
||||
# Renamed universe projects
|
||||
{from = "/universe/project/spacy-pytorch-transformers", to = "/universe/project/spacy-transformers", force = true}
|
||||
{from = "/universe/project/spacy-pytorch-transformers", to = "/universe/project/spacy-transformers", force = true},
|
||||
# Old model pages
|
||||
{from = "/models/en-starters", to = "/models/en", force = true},
|
||||
]
|
||||
|
|
|
@ -1,17 +1,20 @@
|
|||
# Recommended settings and available resources for each language, if available.
|
||||
# Not all languages have recommended word vectors or transformers and for some,
|
||||
# the recommended transformer for efficiency and accuracy may be the same.
|
||||
en:
|
||||
word_vectors: en_vectors_web_lg
|
||||
ar:
|
||||
word_vectors: null
|
||||
transformer:
|
||||
efficiency:
|
||||
name: roberta-base
|
||||
name: asafaya/bert-base-arabic
|
||||
size_factor: 3
|
||||
accuracy:
|
||||
name: roberta-base
|
||||
name: asafaya/bert-base-arabic
|
||||
size_factor: 3
|
||||
da:
|
||||
word_vectors: da_core_news_lg
|
||||
transformer: null
|
||||
de:
|
||||
word_vectors: null
|
||||
word_vectors: de_core_news_lg
|
||||
transformer:
|
||||
efficiency:
|
||||
name: bert-base-german-cased
|
||||
|
@ -19,17 +22,26 @@ de:
|
|||
accuracy:
|
||||
name: bert-base-german-cased
|
||||
size_factor: 3
|
||||
fr:
|
||||
word_vectors: null
|
||||
el:
|
||||
word_vectors: el_core_news_lg
|
||||
transformer:
|
||||
efficiency:
|
||||
name: camembert-base
|
||||
name: nlpaueb/bert-base-greek-uncased-v1
|
||||
size_factor: 3
|
||||
accuracy:
|
||||
name: camembert-base
|
||||
name: nlpaueb/bert-base-greek-uncased-v1
|
||||
size_factor: 3
|
||||
en:
|
||||
word_vectors: en_core_web_lg
|
||||
transformer:
|
||||
efficiency:
|
||||
name: roberta-base
|
||||
size_factor: 3
|
||||
accuracy:
|
||||
name: roberta-base
|
||||
size_factor: 3
|
||||
es:
|
||||
word_vectors: null
|
||||
word_vectors: es_core_news_lg
|
||||
transformer:
|
||||
efficiency:
|
||||
name: dccuchile/bert-base-spanish-wwm-cased
|
||||
|
@ -37,15 +49,6 @@ es:
|
|||
accuracy:
|
||||
name: dccuchile/bert-base-spanish-wwm-cased
|
||||
size_factor: 3
|
||||
sv:
|
||||
word_vectors: null
|
||||
transformer:
|
||||
efficiency:
|
||||
name: KB/bert-base-swedish-cased
|
||||
size_factor: 3
|
||||
accuracy:
|
||||
name: KB/bert-base-swedish-cased
|
||||
size_factor: 3
|
||||
fi:
|
||||
word_vectors: null
|
||||
transformer:
|
||||
|
@ -55,14 +58,65 @@ fi:
|
|||
accuracy:
|
||||
name: TurkuNLP/bert-base-finnish-cased-v1
|
||||
size_factor: 3
|
||||
el:
|
||||
fr:
|
||||
word_vectors: fr_core_news_lg
|
||||
transformer:
|
||||
efficiency:
|
||||
name: camembert-base
|
||||
size_factor: 3
|
||||
accuracy:
|
||||
name: camembert-base
|
||||
size_factor: 3
|
||||
it:
|
||||
word_vectors: it_core_news_lg
|
||||
transformers: null
|
||||
ja:
|
||||
word_vectors: ja_core_news_lg
|
||||
transformers: null
|
||||
lt:
|
||||
word_vectors: lt_core_news_lg
|
||||
transformers: null
|
||||
nb:
|
||||
word_vectors: nb_core_news_lg
|
||||
transformers: null
|
||||
nl:
|
||||
word_vectors: nl_core_news_lg
|
||||
transformer:
|
||||
efficiency:
|
||||
name: pdelobelle/robbert-v2-dutch-base
|
||||
size_factor: 3
|
||||
accuracy:
|
||||
name: pdelobelle/robbert-v2-dutch-base
|
||||
size_factor: 3
|
||||
pl:
|
||||
word_vectors: pl_core_news_lg
|
||||
transformer:
|
||||
efficiency:
|
||||
name: dkleczek/bert-base-polish-cased-v1
|
||||
size_factor: 3
|
||||
accuracy:
|
||||
name: dkleczek/bert-base-polish-cased-v1
|
||||
size_factor: 3
|
||||
pt:
|
||||
word_vectors: pt_core_news_lg
|
||||
transformer:
|
||||
efficiency:
|
||||
name: neuralmind/bert-base-portuguese-cased
|
||||
size_factor: 3
|
||||
accuracy:
|
||||
name: neuralmind/bert-base-portuguese-cased
|
||||
size_factor: 3
|
||||
ro:
|
||||
word_vectors: ro_core_news_lg
|
||||
transformers: null
|
||||
sv:
|
||||
word_vectors: null
|
||||
transformer:
|
||||
efficiency:
|
||||
name: nlpaueb/bert-base-greek-uncased-v1
|
||||
name: KB/bert-base-swedish-cased
|
||||
size_factor: 3
|
||||
accuracy:
|
||||
name: nlpaueb/bert-base-greek-uncased-v1
|
||||
name: KB/bert-base-swedish-cased
|
||||
size_factor: 3
|
||||
tr:
|
||||
word_vectors: null
|
||||
|
@ -74,7 +128,7 @@ tr:
|
|||
name: dbmdz/bert-base-turkish-cased
|
||||
size_factor: 3
|
||||
zh:
|
||||
word_vectors: null
|
||||
word_vectors: zh_core_web_lg
|
||||
transformer:
|
||||
efficiency:
|
||||
name: bert-base-chinese
|
||||
|
@ -83,39 +137,3 @@ zh:
|
|||
name: bert-base-chinese
|
||||
size_factor: 3
|
||||
has_letters: false
|
||||
ar:
|
||||
word_vectors: null
|
||||
transformer:
|
||||
efficiency:
|
||||
name: asafaya/bert-base-arabic
|
||||
size_factor: 3
|
||||
accuracy:
|
||||
name: asafaya/bert-base-arabic
|
||||
size_factor: 3
|
||||
pl:
|
||||
word_vectors: null
|
||||
transformer:
|
||||
efficiency:
|
||||
name: dkleczek/bert-base-polish-cased-v1
|
||||
size_factor: 3
|
||||
accuracy:
|
||||
name: dkleczek/bert-base-polish-cased-v1
|
||||
size_factor: 3
|
||||
nl:
|
||||
word_vectors: null
|
||||
transformer:
|
||||
efficiency:
|
||||
name: pdelobelle/robbert-v2-dutch-base
|
||||
size_factor: 3
|
||||
accuracy:
|
||||
name: pdelobelle/robbert-v2-dutch-base
|
||||
size_factor: 3
|
||||
pt:
|
||||
word_vectors: null
|
||||
transformer:
|
||||
efficiency:
|
||||
name: neuralmind/bert-base-portuguese-cased
|
||||
size_factor: 3
|
||||
accuracy:
|
||||
name: neuralmind/bert-base-portuguese-cased
|
||||
size_factor: 3
|
||||
|
|
|
@ -257,7 +257,7 @@ class TokenPattern(BaseModel):
|
|||
|
||||
|
||||
class TokenPatternSchema(BaseModel):
|
||||
pattern: List[TokenPattern] = Field(..., minItems=1)
|
||||
pattern: List[TokenPattern] = Field(..., min_items=1)
|
||||
|
||||
class Config:
|
||||
extra = "forbid"
|
||||
|
|
14
spacy/tests/regression/test_issue6258.py
Normal file
14
spacy/tests/regression/test_issue6258.py
Normal file
|
@ -0,0 +1,14 @@
|
|||
import pydantic
|
||||
import pytest
|
||||
from pydantic import ValidationError
|
||||
from spacy.schemas import TokenPattern, TokenPatternSchema
|
||||
|
||||
|
||||
def test_issue6258():
|
||||
"""Test that the non-empty constraint pattern field is respected"""
|
||||
# These one is valid
|
||||
TokenPatternSchema(pattern=[TokenPattern()])
|
||||
# But an empty pattern list should fail to validate
|
||||
# based on the schema's constraint
|
||||
with pytest.raises(ValidationError):
|
||||
TokenPatternSchema(pattern=[])
|
|
@ -68,8 +68,8 @@ representation consists of 300 dimensions of `0`, which means it's practically
|
|||
nonexistent. If your application will benefit from a **large vocabulary** with
|
||||
more vectors, you should consider using one of the larger pipeline packages or
|
||||
loading in a full vector package, for example,
|
||||
[`en_vectors_web_lg`](/models/en-starters#en_vectors_web_lg), which includes
|
||||
over **1 million unique vectors**.
|
||||
[`en_core_web_lg`](/models/en#en_core_web_lg), which includes **685k unique
|
||||
vectors**.
|
||||
|
||||
spaCy is able to compare two objects, and make a prediction of **how similar
|
||||
they are**. Predicting similarity is useful for building recommendation systems
|
||||
|
|
|
@ -1859,9 +1859,8 @@ pruning the vectors will be taken care of automatically if you set the `--prune`
|
|||
flag. You can also do it manually in the following steps:
|
||||
|
||||
1. Start with a **word vectors package** that covers a huge vocabulary. For
|
||||
instance, the [`en_vectors_web_lg`](/models/en-starters#en_vectors_web_lg)
|
||||
starter provides 300-dimensional GloVe vectors for over 1 million terms of
|
||||
English.
|
||||
instance, the [`en_core_web_lg`](/models/en#en_core_web_lg) package provides
|
||||
300-dimensional GloVe vectors for 685k terms of English.
|
||||
2. If your vocabulary has values set for the `Lexeme.prob` attribute, the
|
||||
lexemes will be sorted by descending probability to determine which vectors
|
||||
to prune. Otherwise, lexemes will be sorted by their order in the `Vocab`.
|
||||
|
@ -1869,7 +1868,7 @@ flag. You can also do it manually in the following steps:
|
|||
vectors you want to keep.
|
||||
|
||||
```python
|
||||
nlp = spacy.load('en_vectors_web_lg')
|
||||
nlp = spacy.load("en_core_web_lg")
|
||||
n_vectors = 105000 # number of vectors to keep
|
||||
removed_words = nlp.vocab.prune_vectors(n_vectors)
|
||||
|
||||
|
|
|
@ -22,7 +22,7 @@ For more details and a behind-the-scenes look at the new release,
|
|||
>
|
||||
> ```bash
|
||||
> $ python -m spacy pretrain ./raw_text.jsonl
|
||||
> en_vectors_web_lg ./pretrained-model
|
||||
> en_core_web_lg ./pretrained-model
|
||||
> ```
|
||||
|
||||
spaCy v2.1 introduces a new CLI command, `spacy pretrain`, that can make your
|
||||
|
|
|
@ -226,8 +226,6 @@ exports.createPages = ({ graphql, actions }) => {
|
|||
|
||||
const langs = result.data.site.siteMetadata.languages
|
||||
const modelLangs = langs.filter(({ models }) => models && models.length)
|
||||
const starterLangs = langs.filter(({ starters }) => starters && starters.length)
|
||||
|
||||
modelLangs.forEach(({ code, name, models, example, has_examples }, i) => {
|
||||
const slug = `/models/${code}`
|
||||
const next = i < modelLangs.length - 1 ? modelLangs[i + 1] : null
|
||||
|
@ -247,28 +245,6 @@ exports.createPages = ({ graphql, actions }) => {
|
|||
},
|
||||
})
|
||||
})
|
||||
|
||||
starterLangs.forEach(({ code, name, starters }, i) => {
|
||||
const slug = `/models/${code}-starters`
|
||||
const next = i < starterLangs.length - 1 ? starterLangs[i + 1] : null
|
||||
createPage({
|
||||
path: slug,
|
||||
component: DEFAULT_TEMPLATE,
|
||||
context: {
|
||||
id: `${code}-starters`,
|
||||
slug: slug,
|
||||
isIndex: false,
|
||||
title: name,
|
||||
section: 'models',
|
||||
sectionTitle: sections.models.title,
|
||||
theme: sections.models.theme,
|
||||
next: next
|
||||
? { title: next.name, slug: `/models/${next.code}-starters` }
|
||||
: null,
|
||||
meta: { models: starters, isStarters: true },
|
||||
},
|
||||
})
|
||||
})
|
||||
})
|
||||
)
|
||||
})
|
||||
|
|
|
@ -52,19 +52,6 @@ const Docs = ({ pageContext, children }) => (
|
|||
id: model,
|
||||
})),
|
||||
}))
|
||||
if (sidebar.items.length > 2) {
|
||||
sidebar.items[2].items = languages
|
||||
.filter(({ starters }) => starters && starters.length)
|
||||
.map(lang => ({
|
||||
text: lang.name,
|
||||
url: `/models/${lang.code}-starters`,
|
||||
isActive: id === `${lang.code}-starters`,
|
||||
menu: lang.starters.map(model => ({
|
||||
text: model,
|
||||
id: model,
|
||||
})),
|
||||
}))
|
||||
}
|
||||
}
|
||||
const sourcePath = source ? github(source) : null
|
||||
const currentSource = getCurrentSource(slug, isIndex)
|
||||
|
|
|
@ -374,7 +374,7 @@ const Models = ({ pageContext, repo, children }) => {
|
|||
const [initialized, setInitialized] = useState(false)
|
||||
const [compatibility, setCompatibility] = useState({})
|
||||
const { id, title, meta } = pageContext
|
||||
const { models, isStarters } = meta
|
||||
const { models } = meta
|
||||
const baseUrl = `https://raw.githubusercontent.com/${repo}/master`
|
||||
|
||||
useEffect(() => {
|
||||
|
@ -388,26 +388,9 @@ const Models = ({ pageContext, repo, children }) => {
|
|||
}
|
||||
}, [initialized, baseUrl])
|
||||
|
||||
const modelTitle = title
|
||||
const modelTeaser = `Available trained pipelines for ${title}`
|
||||
const starterTitle = `${title} starters`
|
||||
const starterTeaser = `Available transfer learning starter packs for ${title}`
|
||||
|
||||
return (
|
||||
<>
|
||||
<Title
|
||||
title={isStarters ? starterTitle : modelTitle}
|
||||
teaser={isStarters ? starterTeaser : modelTeaser}
|
||||
/>
|
||||
{isStarters && (
|
||||
<Section>
|
||||
<p>
|
||||
Starter packs are pretrained weights you can initialize your models with to
|
||||
achieve better accuracy, like word vectors (which will be used as features
|
||||
during training).
|
||||
</p>
|
||||
</Section>
|
||||
)}
|
||||
<Title title={title} teaser={`Available trained pipelines for ${title}`} />
|
||||
<StaticQuery
|
||||
query={query}
|
||||
render={({ site }) =>
|
||||
|
|
Loading…
Reference in New Issue
Block a user