mirror of
https://github.com/explosion/spaCy.git
synced 2025-08-05 21:00:19 +03:00
Merge branch 'master' of github.com:honnibal/spaCy into mrshu/docs-postags-fix
Signed-off-by: mr.Shu <mr@shu.io> Conflicts: docs/source/index.rst
This commit is contained in:
commit
1bd0d90a9e
|
@ -8,7 +8,8 @@ spaCy: Industrial-strength NLP
|
||||||
==============================
|
==============================
|
||||||
|
|
||||||
`spaCy`_ is a new library for text processing in Python and Cython.
|
`spaCy`_ is a new library for text processing in Python and Cython.
|
||||||
I wrote it because I think small companies are terrible at NLP. Or rather:
|
I wrote it because I think small companies are terrible at
|
||||||
|
natural language processing (NLP). Or rather:
|
||||||
small companies are using terrible NLP technology.
|
small companies are using terrible NLP technology.
|
||||||
|
|
||||||
.. _spaCy: https://github.com/honnibal/spaCy/
|
.. _spaCy: https://github.com/honnibal/spaCy/
|
||||||
|
@ -77,7 +78,7 @@ particularly egregious:
|
||||||
>>> nlp = spacy.en.English()
|
>>> nlp = spacy.en.English()
|
||||||
>>> tokens = nlp("‘Give it back,’ he pleaded abjectly, ‘it’s mine.’",
|
>>> tokens = nlp("‘Give it back,’ he pleaded abjectly, ‘it’s mine.’",
|
||||||
tag=True, parse=False)
|
tag=True, parse=False)
|
||||||
>>> print(''.join(tok.string.upper() if tok.pos == ADV else tok.string) for t in tokens)
|
>>> print(''.join(tok.string.upper() if tok.pos == ADV else tok.string for tok in tokens))
|
||||||
‘Give it BACK,’ he pleaded ABJECTLY, ‘it’s mine.’
|
‘Give it BACK,’ he pleaded ABJECTLY, ‘it’s mine.’
|
||||||
|
|
||||||
|
|
||||||
|
@ -143,7 +144,7 @@ cosine metric:
|
||||||
>>> from numpy import dot
|
>>> from numpy import dot
|
||||||
>>> from numpy.linalg import norm
|
>>> from numpy.linalg import norm
|
||||||
>>> cosine = lambda v1, v2: dot(v1, v2) / (norm(v1), norm(v2))
|
>>> cosine = lambda v1, v2: dot(v1, v2) / (norm(v1), norm(v2))
|
||||||
>>> words = [w for w in nlp.vocab if w.is_lower]
|
>>> words = [w for w in nlp.vocab if w.lower]
|
||||||
>>> words.sort(key=lambda w: cosine(w, pleaded))
|
>>> words.sort(key=lambda w: cosine(w, pleaded))
|
||||||
>>> words.reverse()
|
>>> words.reverse()
|
||||||
>>> print('1-20', ', '.join(w.orth_ for w in words[0:20]))
|
>>> print('1-20', ', '.join(w.orth_ for w in words[0:20]))
|
||||||
|
@ -207,6 +208,7 @@ problematic, given our starting assumptions:
|
||||||
>>> from numpy.linalg import norm
|
>>> from numpy.linalg import norm
|
||||||
>>> import spacy.en
|
>>> import spacy.en
|
||||||
>>> from spacy.parts_of_speech import ADV, VERB
|
>>> from spacy.parts_of_speech import ADV, VERB
|
||||||
|
>>> cosine = lambda v1, v2: dot(v1, v2) / (norm(v1), norm(v2))
|
||||||
>>> def is_bad_adverb(token, target_verb, tol):
|
>>> def is_bad_adverb(token, target_verb, tol):
|
||||||
... if token.pos != ADV
|
... if token.pos != ADV
|
||||||
... return False
|
... return False
|
||||||
|
@ -310,6 +312,7 @@ on the standard evaluation from the Wall Street Journal, given gold-standard
|
||||||
sentence boundaries and tokenization. I'm in the process of completing a more
|
sentence boundaries and tokenization. I'm in the process of completing a more
|
||||||
realistic evaluation on web text.
|
realistic evaluation on web text.
|
||||||
|
|
||||||
|
|
||||||
spaCy's parser offers a better speed/accuracy trade-off than any published
|
spaCy's parser offers a better speed/accuracy trade-off than any published
|
||||||
system: its accuracy is within 1% of the current state-of-the-art, and it's
|
system: its accuracy is within 1% of the current state-of-the-art, and it's
|
||||||
seven times faster than the 2014 CoreNLP neural network parser, which is the
|
seven times faster than the 2014 CoreNLP neural network parser, which is the
|
||||||
|
|
Loading…
Reference in New Issue
Block a user