spaCy/spacy/syntax/iterators.pyx

from spacy.parts_of_speech cimport NOUN, PROPN, PRON


def english_noun_chunks(doc):
    labels = ['nsubj', 'dobj', 'nsubjpass', 'pcomp', 'pobj',
              'attr', 'root']
    np_deps = [doc.vocab.strings[label] for label in labels]
    conj = doc.vocab.strings['conj']
    np_label = doc.vocab.strings['NP']
    for i in range(len(doc)):
        word = doc[i]
        if word.pos in (NOUN, PROPN, PRON) and word.dep in np_deps:
            yield word.left_edge.i, word.i+1, np_label
        elif word.pos == NOUN and word.dep == conj:
            head = word.head
            while head.dep == conj and head.head.i < head.i:
                head = head.head
            # If the head is an NP, and we're coordinated to it, we're an NP
            if head.dep in np_deps:
                yield word.left_edge.i, word.i+1, np_label


# this iterator extracts spans headed by NOUNs starting from the left-most
# syntactic dependent until the NOUN itself
# for close apposition and measurement construction, the span is sometimes
# extended to the right of the NOUN
# example: "eine Tasse Tee" (a cup (of) tea) returns "eine Tasse Tee" and not
# just "eine Tasse", same for "das Thema Familie"
def german_noun_chunks(doc):
    labels = ['sb', 'oa', 'da', 'nk', 'mo', 'ag', 'root', 'cj', 'pd', 'og', 'app']
    np_label = doc.vocab.strings['NP']
    np_deps = set(doc.vocab.strings[label] for label in labels)
    close_app = doc.vocab.strings['nk']

    rbracket = 0
    for i, word in enumerate(doc):
        if i < rbracket:
            continue
        if word.pos in (NOUN, PROPN, PRON) and word.dep in np_deps:
            rbracket = word.i+1
            # try to extend the span to the right
            # to capture close apposition/measurement constructions
            for rdep in doc[word.i].rights:
                if rdep.pos in (NOUN, PROPN) and rdep.dep == close_app:
                    rbracket = rdep.i+1
            yield word.left_edge.i, rbracket, np_label


CHUNKERS = {'en': english_noun_chunks, 'de': german_noun_chunks}
* Fix Issue #365: Error introduced during noun phrase chunking, due to use of corrected PRON/PROPN/etc tags. 2016-05-06 01:21:05 +03:00			`from spacy.parts_of_speech cimport NOUN, PROPN, PRON`
add baseclass DocIterator for iterators over documents add classes for English and German noun chunks the respective iterators are set for the document when created by the parser as they depend on the annotation scheme of the parsing model 2016-03-16 17:53:35 +03:00

* Refactor noun chunk iterators, so that they're simple functions. Install the iterator when the Doc is created, but allow users to write to the noun_chunk_iterator attribute. The iterator functions accept an object and yield (int start, int end, int label) triples. 2016-05-02 15:25:10 +03:00			`def english_noun_chunks(doc):`
			`labels = ['nsubj', 'dobj', 'nsubjpass', 'pcomp', 'pobj',`
			`'attr', 'root']`
			`np_deps = [doc.vocab.strings[label] for label in labels]`
			`conj = doc.vocab.strings['conj']`
			`np_label = doc.vocab.strings['NP']`
			`for i in range(len(doc)):`
			`word = doc[i]`
* Fix Issue #365: Error introduced during noun phrase chunking, due to use of corrected PRON/PROPN/etc tags. 2016-05-06 01:21:05 +03:00			`if word.pos in (NOUN, PROPN, PRON) and word.dep in np_deps:`
* Refactor noun chunk iterators, so that they're simple functions. Install the iterator when the Doc is created, but allow users to write to the noun_chunk_iterator attribute. The iterator functions accept an object and yield (int start, int end, int label) triples. 2016-05-02 15:25:10 +03:00			`yield word.left_edge.i, word.i+1, np_label`
			`elif word.pos == NOUN and word.dep == conj:`
			`head = word.head`
			`while head.dep == conj and head.head.i < head.i:`
			`head = head.head`
			`# If the head is an NP, and we're coordinated to it, we're an NP`
			`if head.dep in np_deps:`
			`yield word.left_edge.i, word.i+1, np_label`
add baseclass DocIterator for iterators over documents add classes for English and German noun chunks the respective iterators are set for the document when created by the parser as they depend on the annotation scheme of the parsing model 2016-03-16 17:53:35 +03:00

			`# this iterator extracts spans headed by NOUNs starting from the left-most`
			`# syntactic dependent until the NOUN itself`
			`# for close apposition and measurement construction, the span is sometimes`
			`# extended to the right of the NOUN`
			`# example: "eine Tasse Tee" (a cup (of) tea) returns "eine Tasse Tee" and not`
			`# just "eine Tasse", same for "das Thema Familie"`
* Refactor noun chunk iterators, so that they're simple functions. Install the iterator when the Doc is created, but allow users to write to the noun_chunk_iterator attribute. The iterator functions accept an object and yield (int start, int end, int label) triples. 2016-05-02 15:25:10 +03:00			`def german_noun_chunks(doc):`
			`labels = ['sb', 'oa', 'da', 'nk', 'mo', 'ag', 'root', 'cj', 'pd', 'og', 'app']`
			`np_label = doc.vocab.strings['NP']`
			`np_deps = set(doc.vocab.strings[label] for label in labels)`
			`close_app = doc.vocab.strings['nk']`

make the code less cryptic 2016-05-03 18:19:05 +03:00			`rbracket = 0`
			`for i, word in enumerate(doc):`
			`if i < rbracket:`
			`continue`
add fix for German noun chunk iterator (issue #365) 2016-05-06 02:41:26 +03:00			`if word.pos in (NOUN, PROPN, PRON) and word.dep in np_deps:`
* Refactor noun chunk iterators, so that they're simple functions. Install the iterator when the Doc is created, but allow users to write to the noun_chunk_iterator attribute. The iterator functions accept an object and yield (int start, int end, int label) triples. 2016-05-02 15:25:10 +03:00			`rbracket = word.i+1`
			`# try to extend the span to the right`
			`# to capture close apposition/measurement constructions`
			`for rdep in doc[word.i].rights:`
add fix for German noun chunk iterator (issue #365) 2016-05-06 02:41:26 +03:00			`if rdep.pos in (NOUN, PROPN) and rdep.dep == close_app:`
* Refactor noun chunk iterators, so that they're simple functions. Install the iterator when the Doc is created, but allow users to write to the noun_chunk_iterator attribute. The iterator functions accept an object and yield (int start, int end, int label) triples. 2016-05-02 15:25:10 +03:00			`rbracket = rdep.i+1`
fix whitespace 2016-05-04 08:40:38 +03:00			`yield word.left_edge.i, rbracket, np_label`

* Refactor noun chunk iterators, so that they're simple functions. Install the iterator when the Doc is created, but allow users to write to the noun_chunk_iterator attribute. The iterator functions accept an object and yield (int start, int end, int label) triples. 2016-05-02 15:25:10 +03:00
			`CHUNKERS = {'en': english_noun_chunks, 'de': german_noun_chunks}`