2016-02-07 03:13:40 +03:00
""" Issue #252
Question :
In the documents and tutorials the main thing I haven ' t found is examples on how to break sentences down into small sub thoughts/chunks. The noun_chunks is handy, but having examples on using the token.head to find small (near-complete) sentence chunks would be neat.
2016-02-19 16:03:52 +03:00
Lets take the example sentence on https : / / displacy . spacy . io / displacy / index . html
2016-02-07 03:13:40 +03:00
displaCy uses CSS and JavaScript to show you how computers understand language
This sentence has two main parts ( XCOMP & CCOMP ) according to the breakdown :
[ displaCy ] uses CSS and Javascript [ to + show ]
&
show you how computers understand [ language ]
I ' m assuming that we can use the token.head to build these groups. In one of your examples you had the following function.
def dependency_labels_to_root ( token ) :
''' Walk up the syntactic tree, collecting the arc labels. '''
dep_labels = [ ]
while token . head is not token :
dep_labels . append ( token . dep )
token = token . head
return dep_labels
"""
from __future__ import print_function , unicode_literals
# Answer:
# The easiest way is to find the head of the subtree you want, and then use the
# `.subtree`, `.children`, `.lefts` and `.rights` iterators. `.subtree` is the
# one that does what you're asking for most directly:
from spacy . en import English
nlp = English ( )
doc = nlp ( u ' displaCy uses CSS and JavaScript to show you how computers understand language ' )
for word in doc :
if word . dep_ in ( ' xcomp ' , ' ccomp ' ) :
print ( ' ' . join ( w . text_with_ws for w in word . subtree ) )
# It'd probably be better for `word.subtree` to return a `Span` object instead
# of a generator over the tokens. If you want the `Span` you can get it via the
# `.right_edge` and `.left_edge` properties. The `Span` object is nice because
# you can easily get a vector, merge it, etc.
doc = nlp ( u ' displaCy uses CSS and JavaScript to show you how computers understand language ' )
for word in doc :
if word . dep_ in ( ' xcomp ' , ' ccomp ' ) :
subtree_span = doc [ word . left_edge . i : word . right_edge . i + 1 ]
print ( subtree_span . text , ' | ' , subtree_span . root . text )
print ( subtree_span . similarity ( doc ) )
print ( subtree_span . similarity ( subtree_span . root ) )
# You might also want to select a head, and then select a start and end position by
# walking along its children. You could then take the `.left_edge` and `.right_edge`
# of those tokens, and use it to calculate a span.