spaCy/spacy/pipeline/functions.py

# coding: utf8
from __future__ import unicode_literals

from ..matcher import Matcher


def merge_noun_chunks(doc):
    """Merge noun chunks into a single token.

    doc (Doc): The Doc object.
    RETURNS (Doc): The Doc object with merged noun chunks.

    DOCS: https://spacy.io/api/pipeline-functions#merge_noun_chunks
    """
    if not doc.is_parsed:
        return doc
    with doc.retokenize() as retokenizer:
        for np in doc.noun_chunks:
            attrs = {"tag": np.root.tag, "dep": np.root.dep}
            retokenizer.merge(np, attrs=attrs)
    return doc


def merge_entities(doc):
    """Merge entities into a single token.

    doc (Doc): The Doc object.
    RETURNS (Doc): The Doc object with merged entities.

    DOCS: https://spacy.io/api/pipeline-functions#merge_entities
    """
    with doc.retokenize() as retokenizer:
        for ent in doc.ents:
            attrs = {"tag": ent.root.tag, "dep": ent.root.dep, "ent_type": ent.label}
            retokenizer.merge(ent, attrs=attrs)
    return doc


def merge_subtokens(doc, label="subtok"):
    """Merge subtokens into a single token.

    doc (Doc): The Doc object.
    label (unicode): The subtoken dependency label.
    RETURNS (Doc): The Doc object with merged subtokens.

    DOCS: https://spacy.io/api/pipeline-functions#merge_subtokens
    """
    merger = Matcher(doc.vocab)
    merger.add("SUBTOK", None, [{"DEP": label, "op": "+"}])
    matches = merger(doc)
    spans = [doc[start : end + 1] for _, start, end in matches]
    with doc.retokenize() as retokenizer:
        for span in spans:
            retokenizer.merge(span)
    return doc
💫 Break up large pipeline.pyx (#3246) * Break up large pipeline.pyx * Merge some components back together * Fix typo 2019-02-10 14:14:51 +03:00			`# coding: utf8`
			`from __future__ import unicode_literals`

			`from ..matcher import Matcher`


			`def merge_noun_chunks(doc):`
			`"""Merge noun chunks into a single token.`

			`doc (Doc): The Doc object.`
			`RETURNS (Doc): The Doc object with merged noun chunks.`
Tidy up and improve docs and docstrings (#3370) <!--- Provide a general summary of your changes in the title. --> ## Description * tidy up and adjust Cython code to code style * improve docstrings and make calling `help()` nicer * add URLs to new docs pages to docstrings wherever possible, mostly to user-facing objects * fix various typos and inconsistencies in docs ### Types of change enhancement, docs ## Checklist <!--- Before you submit the PR, go over this checklist and make sure you can tick off all the boxes. [] -> [x] --> - [x] I have submitted the spaCy Contributor Agreement. - [x] I ran the tests, and all new and existing tests passed. - [x] My changes don't require a change to the documentation, or if they do, I've added all required information. 2019-03-08 13:42:26 +03:00
			`DOCS: https://spacy.io/api/pipeline-functions#merge_noun_chunks`
💫 Break up large pipeline.pyx (#3246) * Break up large pipeline.pyx * Merge some components back together * Fix typo 2019-02-10 14:14:51 +03:00			`"""`
			`if not doc.is_parsed:`
			`return doc`
💫 Replace {Doc,Span}.merge with Doc.retokenize (#3280) * Add deprecation warning to Doc.merge and Span.merge * Replace {Doc,Span}.merge with Doc.retokenize 2019-02-15 12:29:44 +03:00			`with doc.retokenize() as retokenizer:`
			`for np in doc.noun_chunks:`
			`attrs = {"tag": np.root.tag, "dep": np.root.dep}`
			`retokenizer.merge(np, attrs=attrs)`
💫 Break up large pipeline.pyx (#3246) * Break up large pipeline.pyx * Merge some components back together * Fix typo 2019-02-10 14:14:51 +03:00			`return doc`


			`def merge_entities(doc):`
			`"""Merge entities into a single token.`

			`doc (Doc): The Doc object.`
Tidy up and improve docs and docstrings (#3370) <!--- Provide a general summary of your changes in the title. --> ## Description * tidy up and adjust Cython code to code style * improve docstrings and make calling `help()` nicer * add URLs to new docs pages to docstrings wherever possible, mostly to user-facing objects * fix various typos and inconsistencies in docs ### Types of change enhancement, docs ## Checklist <!--- Before you submit the PR, go over this checklist and make sure you can tick off all the boxes. [] -> [x] --> - [x] I have submitted the spaCy Contributor Agreement. - [x] I ran the tests, and all new and existing tests passed. - [x] My changes don't require a change to the documentation, or if they do, I've added all required information. 2019-03-08 13:42:26 +03:00			`RETURNS (Doc): The Doc object with merged entities.`

			`DOCS: https://spacy.io/api/pipeline-functions#merge_entities`
💫 Break up large pipeline.pyx (#3246) * Break up large pipeline.pyx * Merge some components back together * Fix typo 2019-02-10 14:14:51 +03:00			`"""`
💫 Replace {Doc,Span}.merge with Doc.retokenize (#3280) * Add deprecation warning to Doc.merge and Span.merge * Replace {Doc,Span}.merge with Doc.retokenize 2019-02-15 12:29:44 +03:00			`with doc.retokenize() as retokenizer:`
			`for ent in doc.ents:`
			`attrs = {"tag": ent.root.tag, "dep": ent.root.dep, "ent_type": ent.label}`
			`retokenizer.merge(ent, attrs=attrs)`
💫 Break up large pipeline.pyx (#3246) * Break up large pipeline.pyx * Merge some components back together * Fix typo 2019-02-10 14:14:51 +03:00			`return doc`


			`def merge_subtokens(doc, label="subtok"):`
Tidy up and improve docs and docstrings (#3370) <!--- Provide a general summary of your changes in the title. --> ## Description * tidy up and adjust Cython code to code style * improve docstrings and make calling `help()` nicer * add URLs to new docs pages to docstrings wherever possible, mostly to user-facing objects * fix various typos and inconsistencies in docs ### Types of change enhancement, docs ## Checklist <!--- Before you submit the PR, go over this checklist and make sure you can tick off all the boxes. [] -> [x] --> - [x] I have submitted the spaCy Contributor Agreement. - [x] I ran the tests, and all new and existing tests passed. - [x] My changes don't require a change to the documentation, or if they do, I've added all required information. 2019-03-08 13:42:26 +03:00			`"""Merge subtokens into a single token.`

			`doc (Doc): The Doc object.`
			`label (unicode): The subtoken dependency label.`
			`RETURNS (Doc): The Doc object with merged subtokens.`

			`DOCS: https://spacy.io/api/pipeline-functions#merge_subtokens`
			`"""`
💫 Break up large pipeline.pyx (#3246) * Break up large pipeline.pyx * Merge some components back together * Fix typo 2019-02-10 14:14:51 +03:00			`merger = Matcher(doc.vocab)`
			`merger.add("SUBTOK", None, [{"DEP": label, "op": "+"}])`
			`matches = merger(doc)`
			`spans = [doc[start : end + 1] for _, start, end in matches]`
💫 Replace {Doc,Span}.merge with Doc.retokenize (#3280) * Add deprecation warning to Doc.merge and Span.merge * Replace {Doc,Span}.merge with Doc.retokenize 2019-02-15 12:29:44 +03:00			`with doc.retokenize() as retokenizer:`
			`for span in spans:`
			`retokenizer.merge(span)`
💫 Break up large pipeline.pyx (#3246) * Break up large pipeline.pyx * Merge some components back together * Fix typo 2019-02-10 14:14:51 +03:00			`return doc`