mirror of
https://github.com/explosion/spaCy.git
synced 2025-01-13 02:36:32 +03:00
Update readme with release notes for v0.100.8
This commit is contained in:
parent
72564213e3
commit
1b8b888a57
32
README.rst
32
README.rst
|
@ -13,6 +13,38 @@ spaCy is built on the very latest research, but it isn't researchware. It was
|
||||||
designed from day 1 to be used in real products. It's commercial open-source
|
designed from day 1 to be used in real products. It's commercial open-source
|
||||||
software, released under the MIT license.
|
software, released under the MIT license.
|
||||||
|
|
||||||
|
2016-04-05 v0.100.7: German!
|
||||||
|
----------------------------
|
||||||
|
|
||||||
|
spaCy finally supports another language, in addition to English. We're lucky to have Wolfgang Seeker on the team, and the new German model is just the beginning.
|
||||||
|
Now that there are multiple languages, you should consider loading spaCy via the load() function. This function also makes it easier to load extra word vector data for English:
|
||||||
|
|
||||||
|
import spacy
|
||||||
|
|
||||||
|
en_nlp = spacy.load('en', vectors='en_glove_cc_300_1m_vectors')
|
||||||
|
de_nlp = spacy.load('de')
|
||||||
|
|
||||||
|
To support use of the load function, there are also two new helper functions: spacy.get_lang_class and spacy.set_lang_class.
|
||||||
|
Once the German model is loaded, you can use it just like the English model:
|
||||||
|
|
||||||
|
doc = nlp(u'''Wikipedia ist ein Projekt zum Aufbau einer Enzyklopädie aus freien Inhalten, zu dem du mit deinem Wissen beitragen kannst. Seit Mai 2001 sind 1.936.257 Artikel in deutscher Sprache entstanden.''')
|
||||||
|
for sent in doc.sents:
|
||||||
|
print(sent.root.text, sent.root.n_lefts, sent.root.n_rights)
|
||||||
|
# (u'ist', 1, 2)
|
||||||
|
# (u'sind', 1, 3)
|
||||||
|
|
||||||
|
The German model provides tokenization, POS tagging, sentence boundary detection, syntactic dependency parsing, recognition of organisation, location and person entities, and word vector representations trained on a mix of open subtitles and Wikipedia data. It doesn't yet provide lemmatisation or morphological analysis, and it doesn't yet recognise numeric entities such as numbers and dates.
|
||||||
|
|
||||||
|
Bugfixes
|
||||||
|
--------
|
||||||
|
* spaCy < 0.100.7 had a bug in the semantics of the Token.__str__ and Token.__unicode__
|
||||||
|
built-ins: they included a trailing space.
|
||||||
|
* Improve handling of "infixed" hyphens. Previously the tokenizer struggled with multiple hyphens, such as "well-to-do".
|
||||||
|
* Improve handling of periods after mixed-case tokens
|
||||||
|
* Improve lemmatization for English special-case tokens
|
||||||
|
* Fix bug that allowed spaces to be treated as heads in the syntactic parse
|
||||||
|
* Fix bug that led to inconsistent sentence boundaries before and after serialisation.
|
||||||
|
* Fix bug from deserialising untagged documents.
|
||||||
|
|
||||||
Features
|
Features
|
||||||
--------
|
--------
|
||||||
|
|
Loading…
Reference in New Issue
Block a user