This commit is contained in:
Matthew Honnibal 2017-04-27 13:21:39 +02:00
commit 31ec9e1371
16 changed files with 522 additions and 30 deletions

106
.github/contributors/luvogels.md vendored Normal file
View File

@ -0,0 +1,106 @@
# spaCy contributor agreement
This spaCy Contributor Agreement (**"SCA"**) is based on the
[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
The SCA applies to any contribution that you make to any product or project
managed by us (the **"project"**), and sets out the intellectual property rights
you grant to us in the contributed materials. The term **"us"** shall mean
[ExplosionAI UG (haftungsbeschränkt)](https://explosion.ai/legal). The term
**"you"** shall mean the person or entity identified below.
If you agree to be bound by these terms, fill in the information requested
below and include the filled-in version with your first pull request, under the
folder [`.github/contributors/`](/.github/contributors/). The name of the file
should be your GitHub username, with the extension `.md`. For example, the user
example_user would create the file `.github/contributors/example_user.md`.
Read this agreement carefully before signing. These terms and conditions
constitute a binding legal agreement.
## Contributor Agreement
1. The term "contribution" or "contributed materials" means any source code,
object code, patch, tool, sample, graphic, specification, manual,
documentation, or any other material posted or submitted by you to the project.
2. With respect to any worldwide copyrights, or copyright applications and
registrations, in your contribution:
* you hereby assign to us joint ownership, and to the extent that such
assignment is or becomes invalid, ineffective or unenforceable, you hereby
grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
royalty-free, unrestricted license to exercise all rights under those
copyrights. This includes, at our option, the right to sublicense these same
rights to third parties through multiple levels of sublicensees or other
licensing arrangements;
* you agree that each of us can do all things in relation to your
contribution as if each of us were the sole owners, and if one of us makes
a derivative work of your contribution, the one who makes the derivative
work (or has it made will be the sole owner of that derivative work;
* you agree that you will not assert any moral rights in your contribution
against us, our licensees or transferees;
* you agree that we may register a copyright in your contribution and
exercise all ownership rights associated with it; and
* you agree that neither of us has any duty to consult with, obtain the
consent of, pay or render an accounting to the other for any use or
distribution of your contribution.
3. With respect to any patents you own, or that you can license without payment
to any third party, you hereby grant to us a perpetual, irrevocable,
non-exclusive, worldwide, no-charge, royalty-free license to:
* make, have made, use, sell, offer to sell, import, and otherwise transfer
your contribution in whole or in part, alone or in combination with or
included in any product, work or materials arising out of the project to
which your contribution was submitted, and
* at our option, to sublicense these same rights to third parties through
multiple levels of sublicensees or other licensing arrangements.
4. Except as set out above, you keep all right, title, and interest in your
contribution. The rights that you grant to us under these terms are effective
on the date you first submitted a contribution to us, even if your submission
took place before the date you sign these terms.
5. You covenant, represent, warrant and agree that:
* Each contribution that you submit is and shall be an original work of
authorship and you can legally grant the rights set out in this SCA;
* to the best of your knowledge, each contribution will not violate any
third party's copyrights, trademarks, patents, or other intellectual
property rights; and
* each contribution shall be in compliance with U.S. export control laws and
other applicable export and import laws. You agree to notify us if you
become aware of any circumstance which would make any of the foregoing
representations inaccurate in any respect. We may publicly disclose your
participation in the project, including the fact that you have signed the SCA.
6. This SCA is governed by the laws of the State of California and applicable
U.S. Federal law. Any choice of law rules will not apply.
7. Please place an “x” on one of the applicable statement below. Please do NOT
mark both statements:
* [x] I am signing on behalf of myself as an individual and no other person
or entity, including my employer, has or will have rights with respect my
contributions.
* [ ] I am signing on behalf of my employer or a legal entity and I have the
actual authority to contractually bind that entity.
## Contributor Details
| Field | Entry |
|------------------------------- | -------------------- |
| Name | Leif Uwe Vogelsang |
| Company name (if applicable) | |
| Title or role (if applicable) | |
| Date | 4/27/2017 |
| GitHub username | luvogels |
| Website (optional) | |

View File

@ -29,6 +29,7 @@ This is a list of everyone who has made significant contributions to spaCy, in a
* Juan Miguel Cejuela, [@juanmirocks](https://github.com/juanmirocks)
* Kendrick Tan, [@kendricktan](https://github.com/kendricktan)
* Kyle P. Johnson, [@kylepjohnson](https://github.com/kylepjohnson)
* Leif Uwe Vogelsang, [@luvogels](https://github.com/luvogels)
* Liling Tan, [@alvations](https://github.com/alvations)
* Magnus Burton, [@magnusburton](https://github.com/magnusburton)
* Mark Amery, [@ExplodingCabbage](https://github.com/ExplodingCabbage)

View File

@ -4,9 +4,9 @@ spaCy: Industrial-strength NLP
spaCy is a library for advanced natural language processing in Python and
Cython. spaCy is built on the very latest research, but it isn't researchware.
It was designed from day one to be used in real products. spaCy currently supports
English and German, as well as tokenization for Chinese, Spanish, Italian, French,
Portuguese, Dutch, Swedish, Finnish, Hungarian, Bengali and Hebrew. It's commercial
open-source software, released under the MIT license.
English, German and French, as well as tokenization for Chinese, Spanish, Italian,
Portuguese, Dutch, Swedish, Finnish, Norwegian, Hungarian, Bengali and Hebrew. It's
commercial open-source software, released under the MIT license.
📊 **Help us improve the library!** `Take the spaCy user survey <https://survey.spacy.io>`_.
@ -320,6 +320,7 @@ and ``--model`` are optional and enable additional tests:
=========== ============== ===========
Version Date Description
=========== ============== ===========
`v1.8.2`_ ``2017-04-26`` French model and small improvements
`v1.8.1`_ ``2017-04-23`` Saving, loading and training bug fixes
`v1.8.0`_ ``2017-04-16`` Better NER training, saving and loading
`v1.7.5`_ ``2017-04-07`` Bug fixes and new CLI commands
@ -352,6 +353,7 @@ Version Date Description
`v0.93`_ ``2015-09-22`` Bug fixes to word vectors
=========== ============== ===========
.. _v1.8.2: https://github.com/explosion/spaCy/releases/tag/v1.8.2
.. _v1.8.1: https://github.com/explosion/spaCy/releases/tag/v1.8.1
.. _v1.8.0: https://github.com/explosion/spaCy/releases/tag/v1.8.0
.. _v1.7.5: https://github.com/explosion/spaCy/releases/tag/v1.7.5

1
setup.py Normal file → Executable file
View File

@ -36,6 +36,7 @@ PACKAGES = [
'spacy.fi',
'spacy.bn',
'spacy.he',
'spacy.nb',
'spacy.en.lemmatizer',
'spacy.cli.converters',
'spacy.language_data',

View File

@ -3,14 +3,14 @@ from __future__ import unicode_literals
from . import util
from .deprecated import resolve_model_name
from .cli import info
from .cli.info import info
from . import en, de, zh, es, it, hu, fr, pt, nl, sv, fi, bn, he
from . import en, de, zh, es, it, hu, fr, pt, nl, sv, fi, bn, he, nb
_languages = (en.English, de.German, es.Spanish, pt.Portuguese, fr.French,
it.Italian, hu.Hungarian, zh.Chinese, nl.Dutch, sv.Swedish,
fi.Finnish, bn.Bengali, he.Hebrew)
fi.Finnish, bn.Bengali, he.Hebrew, nb.Norwegian)
for _lang in _languages:

25
spacy/nb/__init__.py Normal file
View File

@ -0,0 +1,25 @@
# encoding: utf8
from __future__ import unicode_literals, print_function
from os import path
from ..language import Language
from ..attrs import LANG
# Import language-specific data
from .language_data import *
# create Language subclass
class Norwegian(Language):
lang = 'nb' # ISO code
class Defaults(Language.Defaults):
lex_attr_getters = dict(Language.Defaults.lex_attr_getters)
lex_attr_getters[LANG] = lambda text: 'nb'
# override defaults
tokenizer_exceptions = TOKENIZER_EXCEPTIONS
#tag_map = TAG_MAP
stop_words = STOP_WORDS

28
spacy/nb/language_data.py Normal file
View File

@ -0,0 +1,28 @@
# encoding: utf8
from __future__ import unicode_literals
# import base language data
from .. import language_data as base
# import util functions
from ..language_data import update_exc, strings_to_exc, expand_exc
# import language-specific data from files
#from .tag_map import TAG_MAP
from .stop_words import STOP_WORDS
from .tokenizer_exceptions import TOKENIZER_EXCEPTIONS, ORTH_ONLY
from .morph_rules import MORPH_RULES
TOKENIZER_EXCEPTIONS = dict(TOKENIZER_EXCEPTIONS)
#TAG_MAP = dict(TAG_MAP)
STOP_WORDS = set(STOP_WORDS)
# customize tokenizer exceptions
update_exc(TOKENIZER_EXCEPTIONS, strings_to_exc(ORTH_ONLY))
update_exc(TOKENIZER_EXCEPTIONS, expand_exc(TOKENIZER_EXCEPTIONS, "'", ""))
update_exc(TOKENIZER_EXCEPTIONS, strings_to_exc(base.EMOTICONS))
# export
__all__ = ["TOKENIZER_EXCEPTIONS", "STOP_WORDS", "MORPH_RULES"]

67
spacy/nb/morph_rules.py Normal file
View File

@ -0,0 +1,67 @@
# encoding: utf8
# norwegian bokmål
from __future__ import unicode_literals
from ..symbols import *
from ..language_data import PRON_LEMMA
# Used the table of pronouns at https://no.wiktionary.org/wiki/Tillegg:Pronomen_i_norsk
MORPH_RULES = {
"PRP": {
"jeg": {LEMMA: PRON_LEMMA, "PronType": "Prs", "Person": "One", "Number": "Sing", "Case": "Nom"},
"meg": {LEMMA: PRON_LEMMA, "PronType": "Prs", "Person": "One", "Number": "Sing", "Case": "Acc"},
"du": {LEMMA: PRON_LEMMA, "PronType": "Prs", "Person": "Two", "Number": "Sing", "Case": "Nom"},
"deg": {LEMMA: PRON_LEMMA, "PronType": "Prs", "Person": "Two", "Number": "Sing", "Case": "Acc"},
"han": {LEMMA: PRON_LEMMA, "PronType": "Prs", "Person": "Three", "Number": "Sing", "Gender": "Masc", "Case": "Nom"},
"ham": {LEMMA: PRON_LEMMA, "PronType": "Prs", "Person": "Three", "Number": "Sing", "Gender": "Masc", "Case": "Acc"},
"han": {LEMMA: PRON_LEMMA, "PronType": "Prs", "Person": "Three", "Number": "Sing", "Gender": "Masc", "Case": "Acc"},
"hun": {LEMMA: PRON_LEMMA, "PronType": "Prs", "Person": "Three", "Number": "Sing", "Gender": "Fem", "Case": "Nom"},
"henne": {LEMMA: PRON_LEMMA, "PronType": "Prs", "Person": "Three", "Number": "Sing", "Gender": "Fem", "Case": "Acc"},
"den": {LEMMA: PRON_LEMMA, "PronType": "Prs", "Person": "Three", "Number": "Sing", "Gender": "Neut"},
"det": {LEMMA: PRON_LEMMA, "PronType": "Prs", "Person": "Three", "Number": "Sing", "Gender": "Neut"},
"seg": {LEMMA: PRON_LEMMA, "PronType": "Prs", "Person": "Three", "Number": "Sing", "Reflex": "Yes"},
"vi": {LEMMA: PRON_LEMMA, "PronType": "Prs", "Person": "One", "Number": "Plur", "Case": "Nom"},
"oss": {LEMMA: PRON_LEMMA, "PronType": "Prs", "Person": "One", "Number": "Plur", "Case": "Acc"},
"dere": {LEMMA: PRON_LEMMA, "PronType": "Prs", "Person": "Two", "Number": "Plur", "Case": "Nom"},
"de": {LEMMA: PRON_LEMMA, "PronType": "Prs", "Person": "Three", "Number": "Plur", "Case": "Nom"},
"dem": {LEMMA: PRON_LEMMA, "PronType": "Prs", "Person": "Three", "Number": "Plur", "Case": "Acc"},
"seg": {LEMMA: PRON_LEMMA, "PronType": "Prs", "Person": "Three", "Number": "Plur", "Reflex": "Yes"},
"min": {LEMMA: PRON_LEMMA, "PronType": "Prs", "Person": "One", "Number": "Sing", "Poss": "Yes", "Gender": "Masc"},
"mi": {LEMMA: PRON_LEMMA, "PronType": "Prs", "Person": "One", "Number": "Sing", "Poss": "Yes", "Gender": "Fem"},
"mitt": {LEMMA: PRON_LEMMA, "PronType": "Prs", "Person": "One", "Number": "Sing", "Poss": "Yes", "Gender": "Neu"},
"mine": {LEMMA: PRON_LEMMA, "PronType": "Prs", "Person": "One", "Number": "Plur", "Poss": "Yes"},
"din": {LEMMA: PRON_LEMMA, "PronType": "Prs", "Person": "Two", "Number": "Sing", "Poss": "Yes", "Gender": "Masc"},
"di": {LEMMA: PRON_LEMMA, "PronType": "Prs", "Person": "Two", "Number": "Sing", "Poss": "Yes", "Gender": "Fem"},
"ditt": {LEMMA: PRON_LEMMA, "PronType": "Prs", "Person": "Two", "Number": "Sing", "Poss": "Yes", "Gender": "Neu"},
"dine": {LEMMA: PRON_LEMMA, "PronType": "Prs", "Person": "Two", "Number": "Plur", "Poss": "Yes"},
"hans": {LEMMA: PRON_LEMMA, "PronType": "Prs", "Person": "Three", "Number": "Sing", "Poss": "Yes", "Gender": "Masc"},
"hennes": {LEMMA: PRON_LEMMA, "PronType": "Prs", "Person": "Three", "Number": "Sing", "Poss": "Yes", "Gender": "Fem"},
"dens": {LEMMA: PRON_LEMMA, "PronType": "Prs", "Person": "Three", "Number": "Sing", "Poss": "Yes", "Gender": "Neu"},
"dets": {LEMMA: PRON_LEMMA, "PronType": "Prs", "Person": "Three", "Number": "Sing", "Poss": "Yes", "Gender": "Neu"},
"vår": {LEMMA: PRON_LEMMA, "PronType": "Prs", "Person": "One", "Number": "Plur", "Poss": "Yes"},
"vårt": {LEMMA: PRON_LEMMA, "PronType": "Prs", "Person": "One", "Number": "Plur", "Poss": "Yes"},
"våre": {LEMMA: PRON_LEMMA, "PronType": "Prs", "Person": "One", "Number": "Plur", "Poss": "Yes", "Gender":"Neu"},
"deres": {LEMMA: PRON_LEMMA, "PronType": "Prs", "Person": "One", "Number": "Plur", "Poss": "Yes", "Gender":"Neu", "Reflex":"Yes"},
"sin": {LEMMA: PRON_LEMMA, "PronType": "Prs", "Person": "One", "Number": "Sing", "Poss": "Yes", "Gender":"Masc", "Reflex":"Yes"},
"si": {LEMMA: PRON_LEMMA, "PronType": "Prs", "Person": "One", "Number": "Sing", "Poss": "Yes", "Gender":"Fem", "Reflex":"Yes"},
"sitt": {LEMMA: PRON_LEMMA, "PronType": "Prs", "Person": "One", "Number": "Sing", "Poss": "Yes", "Gender":"Neu", "Reflex":"Yes"},
"sine": {LEMMA: PRON_LEMMA, "PronType": "Prs", "Person": "One", "Number": "Plur", "Poss": "Yes", "Reflex":"Yes"},
},
"VBZ": {
"er": {LEMMA: "be", "VerbForm": "Fin", "Person": "One", "Tense": "Pres", "Mood": "Ind"},
"er": {LEMMA: "be", "VerbForm": "Fin", "Person": "Two", "Tense": "Pres", "Mood": "Ind"},
"er": {LEMMA: "be", "VerbForm": "Fin", "Person": "Three", "Tense": "Pres", "Mood": "Ind"},
},
"VBP": {
"er": {LEMMA: "be", "VerbForm": "Fin", "Tense": "Pres", "Mood": "Ind"}
},
"VBD": {
"var": {LEMMA: "be", "VerbForm": "Fin", "Tense": "Past", "Number": "Sing"},
"vært": {LEMMA: "be", "VerbForm": "Fin", "Tense": "Past", "Number": "Plur"}
}
}

49
spacy/nb/stop_words.py Normal file
View File

@ -0,0 +1,49 @@
# encoding: utf8
from __future__ import unicode_literals
STOP_WORDS = set("""
alle allerede alt and andre annen annet at av
bak bare bedre beste blant ble bli blir blitt bris by både
da dag de del dem den denne der dermed det dette disse drept du
eller en enn er et ett etter
fem fikk fire fjor flere folk for fortsatt fotball fra fram frankrike fredag funnet får fått før først første
gang gi gikk gjennom gjorde gjort gjør gjøre god godt grunn går
ha hadde ham han hans har hele helt henne hennes her hun hva hvor hvordan hvorfor
i ifølge igjen ikke ingen inn
ja jeg
kamp kampen kan kl klart kom komme kommer kontakt kort kroner kunne kveld kvinner
la laget land landet langt leder ligger like litt løpet lørdag
man mandag mange mannen mars med meg mellom men mener menn mennesker mens mer millioner minutter mot msci mye mål måtte
ned neste noe noen nok norge norsk norske ntb ny nye når
og også om onsdag opp opplyser oslo oss over
personer plass poeng politidistrikt politiet president prosent
regjeringen runde rundt russland
sa saken samme sammen samtidig satt se seg seks selv senere september ser sett siden sier sin sine siste sitt skal skriver skulle slik som sted stedet stor store står sverige svært søndag
ta tatt tid tidligere til tilbake tillegg tirsdag to tok torsdag tre tror tyskland
under usa ut uten utenfor
vant var ved veldig vi videre viktig vil ville viser vår være vært
å år
ønsker
""".split())

View File

@ -0,0 +1,175 @@
# encoding: utf8
# Norwegian bokmaål
from __future__ import unicode_literals
from ..symbols import *
from ..language_data import PRON_LEMMA
TOKENIZER_EXCEPTIONS = {
"jan.": [
{ORTH: "jan.", LEMMA: "januar"}
],
"feb.": [
{ORTH: "feb.", LEMMA: "februar"}
],
"jul.": [
{ORTH: "jul.", LEMMA: "juli"}
]
}
ORTH_ONLY = ["adm.dir.",
"a.m.",
"Aq.",
"b.c.",
"bl.a.",
"bla.",
"bm.",
"bto.",
"ca.",
"cand.mag.",
"c.c.",
"co.",
"d.d.",
"dept.",
"d.m.",
"dr.philos.",
"dvs.",
"d.y.",
"E. coli",
"eg.",
"ekskl.",
"e.Kr.",
"el.",
"e.l.",
"et.",
"etg.",
"ev.",
"evt.",
"f.",
"f.eks.",
"fhv.",
"fk.",
"f.Kr.",
"f.o.m.",
"foreg.",
"fork.",
"fv.",
"fvt.",
"g.",
"gt.",
"gl.",
"gno.",
"gnr.",
"grl.",
"hhv.",
"hoh.",
"hr.",
"h.r.adv.",
"ifb.",
"ifm.",
"iht.",
"inkl.",
"istf.",
"jf.",
"jr.",
"jun.",
"kfr.",
"kgl.res.",
"kl.",
"komm.",
"kst.",
"lø.",
"ma.",
"mag.art.",
"m.a.o.",
"md.",
"mfl.",
"mill.",
"min.",
"m.m.",
"mnd.",
"moh.",
"Mr.",
"muh.",
"mv.",
"mva.",
"ndf.",
"no.",
"nov.",
"nr.",
"nto.",
"nyno.",
"n.å.",
"o.a.",
"off.",
"ofl.",
"okt.",
"o.l.",
"on.",
"op.",
"osv.",
"ovf.",
"p.",
"p.a.",
"Pb.",
"pga.",
"ph.d.",
"pkt.",
"p.m.",
"pr.",
"pst.",
"p.t.",
"red.anm.",
"ref.",
"res.",
"res.kap.",
"resp.",
"rv.",
"s.",
"s.d.",
"sen.",
"sep.",
"siviling.",
"sms.",
"spm.",
"sr.",
"sst.",
"st.",
"stip.",
"stk.",
"st.meld.",
"st.prp.",
"stud.",
"s.u.",
"sv.",
"sø.",
"s.å.",
"såk.",
"temp.",
"ti.",
"tils.",
"tilsv.",
"tl;dr",
"tlf.",
"to.",
"t.o.m.",
"ult.",
"utg.",
"v.",
"vedk.",
"vedr.",
"vg.",
"vgs.",
"vha.",
"vit.ass.",
"vn.",
"vol.",
"vs.",
"vsa.",
"årg.",
"årh."
]

View File

@ -13,6 +13,8 @@ from ..hu import Hungarian
from ..fi import Finnish
from ..bn import Bengali
from ..he import Hebrew
from ..nb import Norwegian
from ..tokens import Doc
from ..strings import StringStore
@ -26,7 +28,7 @@ import pytest
LANGUAGES = [English, German, Spanish, Italian, French, Portuguese, Dutch,
Swedish, Hungarian, Finnish, Bengali]
Swedish, Hungarian, Finnish, Bengali, Norwegian]
@pytest.fixture(params=LANGUAGES)
@ -88,6 +90,9 @@ def bn_tokenizer():
def he_tokenizer():
return Hebrew.Defaults.create_tokenizer()
@pytest.fixture
def nb_tokenizer():
return Norwegian.Defaults.create_tokenizer()
@pytest.fixture
def stringstore():

View File

View File

@ -0,0 +1,17 @@
# coding: utf8
from __future__ import unicode_literals
import pytest
NB_TOKEN_EXCEPTION_TESTS = [
('Smørsausen brukes bl.a. til fisk', ['Smørsausen', 'brukes', 'bl.a.', 'til', 'fisk']),
('Jeg kommer først kl. 13 pga. diverse forsinkelser', ['Jeg', 'kommer', 'først', 'kl.', '13', 'pga.', 'diverse', 'forsinkelser'])
]
@pytest.mark.parametrize('text,expected_tokens', NB_TOKEN_EXCEPTION_TESTS)
def test_tokenizer_handles_exception_cases(nb_tokenizer, text, expected_tokens):
tokens = nb_tokenizer(text)
token_list = [token.text for token in tokens if not token.is_space]
assert expected_tokens == token_list

View File

@ -7,6 +7,7 @@ p spaCy currently supports the following languages and capabilities:
+aside-code("Download language models", "bash").
python -m spacy download en
python -m spacy download de
python -m spacy download fr
+table([ "Language", "Token", "SBD", "Lemma", "POS", "NER", "Dep", "Vector", "Sentiment"])
+row
@ -19,6 +20,14 @@ p spaCy currently supports the following languages and capabilities:
each icon in [ "pro", "pro", "con", "pro", "pro", "pro", "pro", "con" ]
+cell.u-text-center #[+procon(icon)]
+row
+cell French #[code fr]
each icon in [ "pro", "pro", "con", "pro", "con", "pro", "pro", "con" ]
+cell.u-text-center #[+procon(icon)]
+h(2, "available") Available models
include ../usage/_models-list
+h(2, "alpha-support") Alpha support
@ -27,7 +36,7 @@ p
| the existing language data and extending the tokenization patterns.
+table([ "Language", "Source" ])
each language, code in { zh: "Chinese", es: "Spanish", it: "Italian", fr: "French", pt: "Portuguese", nl: "Dutch", sv: "Swedish", fi: "Finnish", hu: "Hungarian", bn: "Bengali", he: "Hebrew" }
each language, code in { zh: "Chinese", es: "Spanish", it: "Italian", pt: "Portuguese", nl: "Dutch", sv: "Swedish", fi: "Finnish", nb: "Norwegian Bokmål", hu: "Hungarian", bn: "Bengali", he: "Hebrew" }
+row
+cell #{language} #[code=code]
+cell

View File

@ -0,0 +1,27 @@
//- 💫 DOCS > USAGE > MODELS LIST
include ../../_includes/_mixins
p
| Model differences are mostly statistical. In general, we do expect larger
| models to be "better" and more accurate overall. Ultimately, it depends on
| your use case and requirements, and we recommend starting with the default
| models (marked with a star below).
+aside
| Models are now available as #[code .tar.gz] archives #[+a(gh("spacy-models")) from GitHub],
| attached to individual releases. They can be downloaded and loaded manually,
| or using spaCy's #[code download] and #[code link] commands. All models
| follow the naming convention of #[code [language]_[type]_[genre]_[size]].
| #[br]#[br]
+button(gh("spacy-models"), true, "primary").u-text-tag
| View model releases
+table(["Name", "Language", "Voc", "Dep", "Ent", "Vec", "Size", "License"])
+model-row("en_core_web_sm", "English", [1, 1, 1, 1], "50 MB", "CC BY-SA", true)
+model-row("en_core_web_md", "English", [1, 1, 1, 1], "1 GB", "CC BY-SA")
+model-row("en_depent_web_md", "English", [1, 1, 1, 0], "328 MB", "CC BY-SA")
+model-row("en_vectors_glove_md", "English", [1, 0, 0, 1], "727 MB", "CC BY-SA")
+model-row("de_core_news_md", "German", [1, 1, 1, 1], "645 MB", "CC BY-SA", true, true)
+model-row("fr_depvec_web_lg", "French", [1, 1, 0, 1], "1.33 GB", "CC BY-NC", true, true)

View File

@ -33,28 +33,7 @@ p
+h(2, "available") Available models
p
| Model differences are mostly statistical. In general, we do expect larger
| models to be "better" and more accurate overall. Ultimately, it depends on
| your use case and requirements, and we recommend starting with the default
| models (marked with a star below).
+aside
| Models are now available as #[code .tar.gz] archives #[+a(gh("spacy-models")) from GitHub],
| attached to individual releases. They can be downloaded and loaded manually,
| or using spaCy's #[code download] and #[code link] commands. All models
| follow the naming convention of #[code [language]_[type]_[genre]_[size]].
| #[br]#[br]
+button(gh("spacy-models"), true, "primary").u-text-tag
| View model releases
+table(["Name", "Language", "Voc", "Dep", "Ent", "Vec", "Size", "License"])
+model-row("en_core_web_sm", "English", [1, 1, 1, 1], "50 MB", "CC BY-SA", true)
+model-row("en_core_web_md", "English", [1, 1, 1, 1], "1 GB", "CC BY-SA")
+model-row("en_depent_web_md", "English", [1, 1, 1, 0], "328 MB", "CC BY-SA")
+model-row("en_vectors_glove_md", "English", [0, 0, 0, 1], "727 MB", "CC BY-SA")
+model-row("de_core_news_md", "German", [1, 1, 1, 1], "645 MB", "CC BY-SA", true, true)
include _models-list
+h(2, "download") Downloading models
@ -82,6 +61,7 @@ p
# out-of-the-box: download best-matching default model
python -m spacy download en
python -m spacy download de
python -m spacy download fr
# download best-matching version of specific model for your spaCy installation
python -m spacy download en_core_web_md