spaCy/spacy/lang/de/syntax_iterators.py

# coding: utf8
from __future__ import unicode_literals

from ...symbols import NOUN, PROPN, PRON
from ...errors import Errors


def noun_chunks(doclike):
    """
    Detect base noun phrases from a dependency parse. Works on both Doc and Span.
    """
    # this iterator extracts spans headed by NOUNs starting from the left-most
    # syntactic dependent until the NOUN itself for close apposition and
    # measurement construction, the span is sometimes extended to the right of
    # the NOUN. Example: "eine Tasse Tee" (a cup (of) tea) returns "eine Tasse Tee"
    # and not just "eine Tasse", same for "das Thema Familie".
    labels = [
        "sb",
        "oa",
        "da",
        "nk",
        "mo",
        "ag",
        "ROOT",
        "root",
        "cj",
        "pd",
        "og",
        "app",
    ]
    doc = doclike.doc  # Ensure works on both Doc and Span.

    if not doc.is_parsed:
        raise ValueError(Errors.E029)

    np_label = doc.vocab.strings.add("NP")
    np_deps = set(doc.vocab.strings.add(label) for label in labels)
    close_app = doc.vocab.strings.add("nk")

    rbracket = 0
    prev_end = -1
    for i, word in enumerate(doclike):
        if i < rbracket:
            continue
        # Prevent nested chunks from being produced
        if word.left_edge.i <= prev_end:
            continue
        if word.pos in (NOUN, PROPN, PRON) and word.dep in np_deps:
            rbracket = word.i + 1
            # try to extend the span to the right
            # to capture close apposition/measurement constructions
            for rdep in doc[word.i].rights:
                if rdep.pos in (NOUN, PROPN) and rdep.dep == close_app:
                    rbracket = rdep.i + 1
            prev_end = rbracket - 1
            yield word.left_edge.i, rbracket, np_label


SYNTAX_ITERATORS = {"noun_chunks": noun_chunks}
Add language-specific syntax iterators to en and de 2017-05-17 12:37:48 +03:00			`# coding: utf8`
			`from __future__ import unicode_literals`

			`from ...symbols import NOUN, PROPN, PRON`
Limiting noun_chunks for specific languages (#5396) * Limiting noun_chunks for specific langauges * Limiting noun_chunks for specific languages Contributor Agreement * Addressing review comments * Removed unused fixtures and imports * Add fa_tokenizer in test suite * Use fa_tokenizer in test * Undo extraneous reformatting Co-authored-by: adrianeboyd <adrianeboyd@gmail.com> 2020-05-14 13:58:06 +03:00			`from ...errors import Errors`
Add language-specific syntax iterators to en and de 2017-05-17 12:37:48 +03:00

Rename argument: doc_or_span/obj -> doclike (#5463) * doc_or_span -> obj * Revert "doc_or_span -> obj" This reverts commit 78bb9ff5e0e4adc01bd30e227657118d87546f83. * obj -> doclike * Refer to correct object 2020-05-21 16:17:39 +03:00			`def noun_chunks(doclike):`
Add language-specific syntax iterators to en and de 2017-05-17 12:37:48 +03:00			`"""`
			`Detect base noun phrases from a dependency parse. Works on both Doc and Span.`
			`"""`
			`# this iterator extracts spans headed by NOUNs starting from the left-most`
			`# syntactic dependent until the NOUN itself for close apposition and`
			`# measurement construction, the span is sometimes extended to the right of`
			`# the NOUN. Example: "eine Tasse Tee" (a cup (of) tea) returns "eine Tasse Tee"`
			`# and not just "eine Tasse", same for "das Thema Familie".`
💫 Tidy up and auto-format .py files (#2983) <!--- Provide a general summary of your changes in the title. --> ## Description - [x] Use [`black`](https://github.com/ambv/black) to auto-format all `.py` files. - [x] Update flake8 config to exclude very large files (lemmatization tables etc.) - [x] Update code to be compatible with flake8 rules - [x] Fix various small bugs, inconsistencies and messy stuff in the language data - [x] Update docs to explain new code style (`black`, `flake8`, when to use `# fmt: off` and `# fmt: on` and what `# noqa` means) Once #2932 is merged, which auto-formats and tidies up the CLI, we'll be able to run `flake8 spacy` actually get meaningful results. At the moment, the code style and linting isn't applied automatically, but I'm hoping that the new [GitHub Actions](https://github.com/features/actions) will let us auto-format pull requests and post comments with relevant linting information. ### Types of change enhancement, code style ## Checklist <!--- Before you submit the PR, go over this checklist and make sure you can tick off all the boxes. [] -> [x] --> - [x] I have submitted the spaCy Contributor Agreement. - [x] I ran the tests, and all new and existing tests passed. - [x] My changes don't require a change to the documentation, or if they do, I've added all required information. 2018-11-30 19:03:03 +03:00			`labels = [`
			`"sb",`
			`"oa",`
			`"da",`
			`"nk",`
			`"mo",`
			`"ag",`
			`"ROOT",`
			`"root",`
			`"cj",`
			`"pd",`
			`"og",`
			`"app",`
			`]`
Rename argument: doc_or_span/obj -> doclike (#5463) * doc_or_span -> obj * Revert "doc_or_span -> obj" This reverts commit 78bb9ff5e0e4adc01bd30e227657118d87546f83. * obj -> doclike * Refer to correct object 2020-05-21 16:17:39 +03:00			`doc = doclike.doc # Ensure works on both Doc and Span.`
Limiting noun_chunks for specific languages (#5396) * Limiting noun_chunks for specific langauges * Limiting noun_chunks for specific languages Contributor Agreement * Addressing review comments * Removed unused fixtures and imports * Add fa_tokenizer in test suite * Use fa_tokenizer in test * Undo extraneous reformatting Co-authored-by: adrianeboyd <adrianeboyd@gmail.com> 2020-05-14 13:58:06 +03:00
			`if not doc.is_parsed:`
			`raise ValueError(Errors.E029)`

💫 Tidy up and auto-format .py files (#2983) <!--- Provide a general summary of your changes in the title. --> ## Description - [x] Use [`black`](https://github.com/ambv/black) to auto-format all `.py` files. - [x] Update flake8 config to exclude very large files (lemmatization tables etc.) - [x] Update code to be compatible with flake8 rules - [x] Fix various small bugs, inconsistencies and messy stuff in the language data - [x] Update docs to explain new code style (`black`, `flake8`, when to use `# fmt: off` and `# fmt: on` and what `# noqa` means) Once #2932 is merged, which auto-formats and tidies up the CLI, we'll be able to run `flake8 spacy` actually get meaningful results. At the moment, the code style and linting isn't applied automatically, but I'm hoping that the new [GitHub Actions](https://github.com/features/actions) will let us auto-format pull requests and post comments with relevant linting information. ### Types of change enhancement, code style ## Checklist <!--- Before you submit the PR, go over this checklist and make sure you can tick off all the boxes. [] -> [x] --> - [x] I have submitted the spaCy Contributor Agreement. - [x] I ran the tests, and all new and existing tests passed. - [x] My changes don't require a change to the documentation, or if they do, I've added all required information. 2018-11-30 19:03:03 +03:00			`np_label = doc.vocab.strings.add("NP")`
Fix syntax iterators 2017-06-04 23:51:50 +03:00			`np_deps = set(doc.vocab.strings.add(label) for label in labels)`
💫 Tidy up and auto-format .py files (#2983) <!--- Provide a general summary of your changes in the title. --> ## Description - [x] Use [`black`](https://github.com/ambv/black) to auto-format all `.py` files. - [x] Update flake8 config to exclude very large files (lemmatization tables etc.) - [x] Update code to be compatible with flake8 rules - [x] Fix various small bugs, inconsistencies and messy stuff in the language data - [x] Update docs to explain new code style (`black`, `flake8`, when to use `# fmt: off` and `# fmt: on` and what `# noqa` means) Once #2932 is merged, which auto-formats and tidies up the CLI, we'll be able to run `flake8 spacy` actually get meaningful results. At the moment, the code style and linting isn't applied automatically, but I'm hoping that the new [GitHub Actions](https://github.com/features/actions) will let us auto-format pull requests and post comments with relevant linting information. ### Types of change enhancement, code style ## Checklist <!--- Before you submit the PR, go over this checklist and make sure you can tick off all the boxes. [] -> [x] --> - [x] I have submitted the spaCy Contributor Agreement. - [x] I ran the tests, and all new and existing tests passed. - [x] My changes don't require a change to the documentation, or if they do, I've added all required information. 2018-11-30 19:03:03 +03:00			`close_app = doc.vocab.strings.add("nk")`
Add language-specific syntax iterators to en and de 2017-05-17 12:37:48 +03:00
			`rbracket = 0`
Fix overlapping German noun chunks (#6112) Add a similar fix as in #5470 to prevent the German noun chunks iterator from producing overlapping spans. 2020-09-22 22:52:42 +03:00			`prev_end = -1`
Rename argument: doc_or_span/obj -> doclike (#5463) * doc_or_span -> obj * Revert "doc_or_span -> obj" This reverts commit 78bb9ff5e0e4adc01bd30e227657118d87546f83. * obj -> doclike * Refer to correct object 2020-05-21 16:17:39 +03:00			`for i, word in enumerate(doclike):`
Add language-specific syntax iterators to en and de 2017-05-17 12:37:48 +03:00			`if i < rbracket:`
			`continue`
Fix overlapping German noun chunks (#6112) Add a similar fix as in #5470 to prevent the German noun chunks iterator from producing overlapping spans. 2020-09-22 22:52:42 +03:00			`# Prevent nested chunks from being produced`
			`if word.left_edge.i <= prev_end:`
			`continue`
Add language-specific syntax iterators to en and de 2017-05-17 12:37:48 +03:00			`if word.pos in (NOUN, PROPN, PRON) and word.dep in np_deps:`
💫 Tidy up and auto-format .py files (#2983) <!--- Provide a general summary of your changes in the title. --> ## Description - [x] Use [`black`](https://github.com/ambv/black) to auto-format all `.py` files. - [x] Update flake8 config to exclude very large files (lemmatization tables etc.) - [x] Update code to be compatible with flake8 rules - [x] Fix various small bugs, inconsistencies and messy stuff in the language data - [x] Update docs to explain new code style (`black`, `flake8`, when to use `# fmt: off` and `# fmt: on` and what `# noqa` means) Once #2932 is merged, which auto-formats and tidies up the CLI, we'll be able to run `flake8 spacy` actually get meaningful results. At the moment, the code style and linting isn't applied automatically, but I'm hoping that the new [GitHub Actions](https://github.com/features/actions) will let us auto-format pull requests and post comments with relevant linting information. ### Types of change enhancement, code style ## Checklist <!--- Before you submit the PR, go over this checklist and make sure you can tick off all the boxes. [] -> [x] --> - [x] I have submitted the spaCy Contributor Agreement. - [x] I ran the tests, and all new and existing tests passed. - [x] My changes don't require a change to the documentation, or if they do, I've added all required information. 2018-11-30 19:03:03 +03:00			`rbracket = word.i + 1`
Add language-specific syntax iterators to en and de 2017-05-17 12:37:48 +03:00			`# try to extend the span to the right`
			`# to capture close apposition/measurement constructions`
			`for rdep in doc[word.i].rights:`
			`if rdep.pos in (NOUN, PROPN) and rdep.dep == close_app:`
💫 Tidy up and auto-format .py files (#2983) <!--- Provide a general summary of your changes in the title. --> ## Description - [x] Use [`black`](https://github.com/ambv/black) to auto-format all `.py` files. - [x] Update flake8 config to exclude very large files (lemmatization tables etc.) - [x] Update code to be compatible with flake8 rules - [x] Fix various small bugs, inconsistencies and messy stuff in the language data - [x] Update docs to explain new code style (`black`, `flake8`, when to use `# fmt: off` and `# fmt: on` and what `# noqa` means) Once #2932 is merged, which auto-formats and tidies up the CLI, we'll be able to run `flake8 spacy` actually get meaningful results. At the moment, the code style and linting isn't applied automatically, but I'm hoping that the new [GitHub Actions](https://github.com/features/actions) will let us auto-format pull requests and post comments with relevant linting information. ### Types of change enhancement, code style ## Checklist <!--- Before you submit the PR, go over this checklist and make sure you can tick off all the boxes. [] -> [x] --> - [x] I have submitted the spaCy Contributor Agreement. - [x] I ran the tests, and all new and existing tests passed. - [x] My changes don't require a change to the documentation, or if they do, I've added all required information. 2018-11-30 19:03:03 +03:00			`rbracket = rdep.i + 1`
Fix overlapping German noun chunks (#6112) Add a similar fix as in #5470 to prevent the German noun chunks iterator from producing overlapping spans. 2020-09-22 22:52:42 +03:00			`prev_end = rbracket - 1`
Add language-specific syntax iterators to en and de 2017-05-17 12:37:48 +03:00			`yield word.left_edge.i, rbracket, np_label`


💫 Tidy up and auto-format .py files (#2983) <!--- Provide a general summary of your changes in the title. --> ## Description - [x] Use [`black`](https://github.com/ambv/black) to auto-format all `.py` files. - [x] Update flake8 config to exclude very large files (lemmatization tables etc.) - [x] Update code to be compatible with flake8 rules - [x] Fix various small bugs, inconsistencies and messy stuff in the language data - [x] Update docs to explain new code style (`black`, `flake8`, when to use `# fmt: off` and `# fmt: on` and what `# noqa` means) Once #2932 is merged, which auto-formats and tidies up the CLI, we'll be able to run `flake8 spacy` actually get meaningful results. At the moment, the code style and linting isn't applied automatically, but I'm hoping that the new [GitHub Actions](https://github.com/features/actions) will let us auto-format pull requests and post comments with relevant linting information. ### Types of change enhancement, code style ## Checklist <!--- Before you submit the PR, go over this checklist and make sure you can tick off all the boxes. [] -> [x] --> - [x] I have submitted the spaCy Contributor Agreement. - [x] I ran the tests, and all new and existing tests passed. - [x] My changes don't require a change to the documentation, or if they do, I've added all required information. 2018-11-30 19:03:03 +03:00			`SYNTAX_ITERATORS = {"noun_chunks": noun_chunks}`