spaCy/spacy/pipeline/functions.py

from ..language import component
from ..matcher import Matcher
from ..util import filter_spans


@component(
    "merge_noun_chunks",
    requires=["token.dep", "token.tag", "token.pos"],
    retokenizes=True,
)
def merge_noun_chunks(doc):
    """Merge noun chunks into a single token.

    doc (Doc): The Doc object.
    RETURNS (Doc): The Doc object with merged noun chunks.

    DOCS: https://spacy.io/api/pipeline-functions#merge_noun_chunks
    """
    if not doc.is_parsed:
        return doc
    with doc.retokenize() as retokenizer:
        for np in doc.noun_chunks:
            attrs = {"tag": np.root.tag, "dep": np.root.dep}
            retokenizer.merge(np, attrs=attrs)
    return doc


@component(
    "merge_entities",
    requires=["doc.ents", "token.ent_iob", "token.ent_type"],
    retokenizes=True,
)
def merge_entities(doc):
    """Merge entities into a single token.

    doc (Doc): The Doc object.
    RETURNS (Doc): The Doc object with merged entities.

    DOCS: https://spacy.io/api/pipeline-functions#merge_entities
    """
    with doc.retokenize() as retokenizer:
        for ent in doc.ents:
            attrs = {"tag": ent.root.tag, "dep": ent.root.dep, "ent_type": ent.label}
            retokenizer.merge(ent, attrs=attrs)
    return doc


@component("merge_subtokens", requires=["token.dep"], retokenizes=True)
def merge_subtokens(doc, label="subtok"):
    """Merge subtokens into a single token.

    doc (Doc): The Doc object.
    label (unicode): The subtoken dependency label.
    RETURNS (Doc): The Doc object with merged subtokens.

    DOCS: https://spacy.io/api/pipeline-functions#merge_subtokens
    """
    merger = Matcher(doc.vocab)
    merger.add("SUBTOK", None, [{"DEP": label, "op": "+"}])
    matches = merger(doc)
    spans = filter_spans([doc[start : end + 1] for _, start, end in matches])
    with doc.retokenize() as retokenizer:
        for span in spans:
            retokenizer.merge(span)
    return doc
Component decorator and component analysis (#4517) * Add work in progress * Update analysis helpers and component decorator * Fix porting of docstrings for Python 2 * Fix docstring stuff on Python 2 * Support meta factories when loading model * Put auto pipeline analysis behind flag for now * Analyse pipes on remove_pipe and replace_pipe * Move analysis to root for now Try to find a better place for it, but it needs to go for now to avoid circular imports * Simplify decorator Don't return a wrapped class and instead just write to the object * Update existing components and factories * Add condition in factory for classes vs. functions * Add missing from_nlp classmethods * Add "retokenizes" to printed overview * Update assigns/requires declarations of builtins * Only return data if no_print is enabled * Use multiline table for overview * Don't support Span * Rewrite errors/warnings and move them to spacy.errors 2019-10-27 15:35:49 +03:00			`from ..language import component`
💫 Break up large pipeline.pyx (#3246) * Break up large pipeline.pyx * Merge some components back together * Fix typo 2019-02-10 14:14:51 +03:00			`from ..matcher import Matcher`
Filter subtoken matches in merge_subtokens() (#4539) The `Matcher` in `merge_subtokens()` returns all possible subsequences of `subtok`, so for sequences of two or more subtoks it's necessary to filter the matches so that the retokenizer is only merging the longest matches with no overlapping spans. 2019-10-28 17:40:28 +03:00			`from ..util import filter_spans`
💫 Break up large pipeline.pyx (#3246) * Break up large pipeline.pyx * Merge some components back together * Fix typo 2019-02-10 14:14:51 +03:00

Component decorator and component analysis (#4517) * Add work in progress * Update analysis helpers and component decorator * Fix porting of docstrings for Python 2 * Fix docstring stuff on Python 2 * Support meta factories when loading model * Put auto pipeline analysis behind flag for now * Analyse pipes on remove_pipe and replace_pipe * Move analysis to root for now Try to find a better place for it, but it needs to go for now to avoid circular imports * Simplify decorator Don't return a wrapped class and instead just write to the object * Update existing components and factories * Add condition in factory for classes vs. functions * Add missing from_nlp classmethods * Add "retokenizes" to printed overview * Update assigns/requires declarations of builtins * Only return data if no_print is enabled * Use multiline table for overview * Don't support Span * Rewrite errors/warnings and move them to spacy.errors 2019-10-27 15:35:49 +03:00			`@component(`
			`"merge_noun_chunks",`
			`requires=["token.dep", "token.tag", "token.pos"],`
			`retokenizes=True,`
			`)`
💫 Break up large pipeline.pyx (#3246) * Break up large pipeline.pyx * Merge some components back together * Fix typo 2019-02-10 14:14:51 +03:00			`def merge_noun_chunks(doc):`
			`"""Merge noun chunks into a single token.`

			`doc (Doc): The Doc object.`
			`RETURNS (Doc): The Doc object with merged noun chunks.`
Tidy up and improve docs and docstrings (#3370) <!--- Provide a general summary of your changes in the title. --> ## Description * tidy up and adjust Cython code to code style * improve docstrings and make calling `help()` nicer * add URLs to new docs pages to docstrings wherever possible, mostly to user-facing objects * fix various typos and inconsistencies in docs ### Types of change enhancement, docs ## Checklist <!--- Before you submit the PR, go over this checklist and make sure you can tick off all the boxes. [] -> [x] --> - [x] I have submitted the spaCy Contributor Agreement. - [x] I ran the tests, and all new and existing tests passed. - [x] My changes don't require a change to the documentation, or if they do, I've added all required information. 2019-03-08 13:42:26 +03:00
			`DOCS: https://spacy.io/api/pipeline-functions#merge_noun_chunks`
💫 Break up large pipeline.pyx (#3246) * Break up large pipeline.pyx * Merge some components back together * Fix typo 2019-02-10 14:14:51 +03:00			`"""`
			`if not doc.is_parsed:`
			`return doc`
💫 Replace {Doc,Span}.merge with Doc.retokenize (#3280) * Add deprecation warning to Doc.merge and Span.merge * Replace {Doc,Span}.merge with Doc.retokenize 2019-02-15 12:29:44 +03:00			`with doc.retokenize() as retokenizer:`
			`for np in doc.noun_chunks:`
			`attrs = {"tag": np.root.tag, "dep": np.root.dep}`
			`retokenizer.merge(np, attrs=attrs)`
💫 Break up large pipeline.pyx (#3246) * Break up large pipeline.pyx * Merge some components back together * Fix typo 2019-02-10 14:14:51 +03:00			`return doc`


Component decorator and component analysis (#4517) * Add work in progress * Update analysis helpers and component decorator * Fix porting of docstrings for Python 2 * Fix docstring stuff on Python 2 * Support meta factories when loading model * Put auto pipeline analysis behind flag for now * Analyse pipes on remove_pipe and replace_pipe * Move analysis to root for now Try to find a better place for it, but it needs to go for now to avoid circular imports * Simplify decorator Don't return a wrapped class and instead just write to the object * Update existing components and factories * Add condition in factory for classes vs. functions * Add missing from_nlp classmethods * Add "retokenizes" to printed overview * Update assigns/requires declarations of builtins * Only return data if no_print is enabled * Use multiline table for overview * Don't support Span * Rewrite errors/warnings and move them to spacy.errors 2019-10-27 15:35:49 +03:00			`@component(`
			`"merge_entities",`
			`requires=["doc.ents", "token.ent_iob", "token.ent_type"],`
			`retokenizes=True,`
			`)`
💫 Break up large pipeline.pyx (#3246) * Break up large pipeline.pyx * Merge some components back together * Fix typo 2019-02-10 14:14:51 +03:00			`def merge_entities(doc):`
			`"""Merge entities into a single token.`

			`doc (Doc): The Doc object.`
Tidy up and improve docs and docstrings (#3370) <!--- Provide a general summary of your changes in the title. --> ## Description * tidy up and adjust Cython code to code style * improve docstrings and make calling `help()` nicer * add URLs to new docs pages to docstrings wherever possible, mostly to user-facing objects * fix various typos and inconsistencies in docs ### Types of change enhancement, docs ## Checklist <!--- Before you submit the PR, go over this checklist and make sure you can tick off all the boxes. [] -> [x] --> - [x] I have submitted the spaCy Contributor Agreement. - [x] I ran the tests, and all new and existing tests passed. - [x] My changes don't require a change to the documentation, or if they do, I've added all required information. 2019-03-08 13:42:26 +03:00			`RETURNS (Doc): The Doc object with merged entities.`

			`DOCS: https://spacy.io/api/pipeline-functions#merge_entities`
💫 Break up large pipeline.pyx (#3246) * Break up large pipeline.pyx * Merge some components back together * Fix typo 2019-02-10 14:14:51 +03:00			`"""`
💫 Replace {Doc,Span}.merge with Doc.retokenize (#3280) * Add deprecation warning to Doc.merge and Span.merge * Replace {Doc,Span}.merge with Doc.retokenize 2019-02-15 12:29:44 +03:00			`with doc.retokenize() as retokenizer:`
			`for ent in doc.ents:`
			`attrs = {"tag": ent.root.tag, "dep": ent.root.dep, "ent_type": ent.label}`
			`retokenizer.merge(ent, attrs=attrs)`
💫 Break up large pipeline.pyx (#3246) * Break up large pipeline.pyx * Merge some components back together * Fix typo 2019-02-10 14:14:51 +03:00			`return doc`


Component decorator and component analysis (#4517) * Add work in progress * Update analysis helpers and component decorator * Fix porting of docstrings for Python 2 * Fix docstring stuff on Python 2 * Support meta factories when loading model * Put auto pipeline analysis behind flag for now * Analyse pipes on remove_pipe and replace_pipe * Move analysis to root for now Try to find a better place for it, but it needs to go for now to avoid circular imports * Simplify decorator Don't return a wrapped class and instead just write to the object * Update existing components and factories * Add condition in factory for classes vs. functions * Add missing from_nlp classmethods * Add "retokenizes" to printed overview * Update assigns/requires declarations of builtins * Only return data if no_print is enabled * Use multiline table for overview * Don't support Span * Rewrite errors/warnings and move them to spacy.errors 2019-10-27 15:35:49 +03:00			`@component("merge_subtokens", requires=["token.dep"], retokenizes=True)`
💫 Break up large pipeline.pyx (#3246) * Break up large pipeline.pyx * Merge some components back together * Fix typo 2019-02-10 14:14:51 +03:00			`def merge_subtokens(doc, label="subtok"):`
Tidy up and improve docs and docstrings (#3370) <!--- Provide a general summary of your changes in the title. --> ## Description * tidy up and adjust Cython code to code style * improve docstrings and make calling `help()` nicer * add URLs to new docs pages to docstrings wherever possible, mostly to user-facing objects * fix various typos and inconsistencies in docs ### Types of change enhancement, docs ## Checklist <!--- Before you submit the PR, go over this checklist and make sure you can tick off all the boxes. [] -> [x] --> - [x] I have submitted the spaCy Contributor Agreement. - [x] I ran the tests, and all new and existing tests passed. - [x] My changes don't require a change to the documentation, or if they do, I've added all required information. 2019-03-08 13:42:26 +03:00			`"""Merge subtokens into a single token.`

			`doc (Doc): The Doc object.`
			`label (unicode): The subtoken dependency label.`
			`RETURNS (Doc): The Doc object with merged subtokens.`

			`DOCS: https://spacy.io/api/pipeline-functions#merge_subtokens`
			`"""`
💫 Break up large pipeline.pyx (#3246) * Break up large pipeline.pyx * Merge some components back together * Fix typo 2019-02-10 14:14:51 +03:00			`merger = Matcher(doc.vocab)`
			`merger.add("SUBTOK", None, [{"DEP": label, "op": "+"}])`
			`matches = merger(doc)`
Filter subtoken matches in merge_subtokens() (#4539) The `Matcher` in `merge_subtokens()` returns all possible subsequences of `subtok`, so for sequences of two or more subtoks it's necessary to filter the matches so that the retokenizer is only merging the longest matches with no overlapping spans. 2019-10-28 17:40:28 +03:00			`spans = filter_spans([doc[start : end + 1] for _, start, end in matches])`
💫 Replace {Doc,Span}.merge with Doc.retokenize (#3280) * Add deprecation warning to Doc.merge and Span.merge * Replace {Doc,Span}.merge with Doc.retokenize 2019-02-15 12:29:44 +03:00			`with doc.retokenize() as retokenizer:`
			`for span in spans:`
			`retokenizer.merge(span)`
💫 Break up large pipeline.pyx (#3246) * Break up large pipeline.pyx * Merge some components back together * Fix typo 2019-02-10 14:14:51 +03:00			`return doc`