spaCy/spacy/lang/de/syntax_iterators.py

# coding: utf8
from __future__ import unicode_literals

from ...symbols import NOUN, PROPN, PRON
from ...errors import Errors


def noun_chunks(obj):
    """
    Detect base noun phrases from a dependency parse. Works on both Doc and Span.
    """
    # this iterator extracts spans headed by NOUNs starting from the left-most
    # syntactic dependent until the NOUN itself for close apposition and
    # measurement construction, the span is sometimes extended to the right of
    # the NOUN. Example: "eine Tasse Tee" (a cup (of) tea) returns "eine Tasse Tee"
    # and not just "eine Tasse", same for "das Thema Familie".
    labels = [
        "sb",
        "oa",
        "da",
        "nk",
        "mo",
        "ag",
        "ROOT",
        "root",
        "cj",
        "pd",
        "og",
        "app",
    ]
    doc = obj.doc  # Ensure works on both Doc and Span.

    if not doc.is_parsed:
        raise ValueError(Errors.E029)

    np_label = doc.vocab.strings.add("NP")
    np_deps = set(doc.vocab.strings.add(label) for label in labels)
    close_app = doc.vocab.strings.add("nk")

    rbracket = 0
    for i, word in enumerate(obj):
        if i < rbracket:
            continue
        if word.pos in (NOUN, PROPN, PRON) and word.dep in np_deps:
            rbracket = word.i + 1
            # try to extend the span to the right
            # to capture close apposition/measurement constructions
            for rdep in doc[word.i].rights:
                if rdep.pos in (NOUN, PROPN) and rdep.dep == close_app:
                    rbracket = rdep.i + 1
            yield word.left_edge.i, rbracket, np_label


SYNTAX_ITERATORS = {"noun_chunks": noun_chunks}
Add language-specific syntax iterators to en and de 2017-05-17 12:37:48 +03:00			`# coding: utf8`
			`from __future__ import unicode_literals`

			`from ...symbols import NOUN, PROPN, PRON`
Limiting noun_chunks for specific languages (#5396) * Limiting noun_chunks for specific langauges * Limiting noun_chunks for specific languages Contributor Agreement * Addressing review comments * Removed unused fixtures and imports * Add fa_tokenizer in test suite * Use fa_tokenizer in test * Undo extraneous reformatting Co-authored-by: adrianeboyd <adrianeboyd@gmail.com> 2020-05-14 13:58:06 +03:00			`from ...errors import Errors`
Add language-specific syntax iterators to en and de 2017-05-17 12:37:48 +03:00

			`def noun_chunks(obj):`
			`"""`
			`Detect base noun phrases from a dependency parse. Works on both Doc and Span.`
			`"""`
			`# this iterator extracts spans headed by NOUNs starting from the left-most`
			`# syntactic dependent until the NOUN itself for close apposition and`
			`# measurement construction, the span is sometimes extended to the right of`
			`# the NOUN. Example: "eine Tasse Tee" (a cup (of) tea) returns "eine Tasse Tee"`
			`# and not just "eine Tasse", same for "das Thema Familie".`
💫 Tidy up and auto-format .py files (#2983) <!--- Provide a general summary of your changes in the title. --> ## Description - [x] Use [`black`](https://github.com/ambv/black) to auto-format all `.py` files. - [x] Update flake8 config to exclude very large files (lemmatization tables etc.) - [x] Update code to be compatible with flake8 rules - [x] Fix various small bugs, inconsistencies and messy stuff in the language data - [x] Update docs to explain new code style (`black`, `flake8`, when to use `# fmt: off` and `# fmt: on` and what `# noqa` means) Once #2932 is merged, which auto-formats and tidies up the CLI, we'll be able to run `flake8 spacy` actually get meaningful results. At the moment, the code style and linting isn't applied automatically, but I'm hoping that the new [GitHub Actions](https://github.com/features/actions) will let us auto-format pull requests and post comments with relevant linting information. ### Types of change enhancement, code style ## Checklist <!--- Before you submit the PR, go over this checklist and make sure you can tick off all the boxes. [] -> [x] --> - [x] I have submitted the spaCy Contributor Agreement. - [x] I ran the tests, and all new and existing tests passed. - [x] My changes don't require a change to the documentation, or if they do, I've added all required information. 2018-11-30 19:03:03 +03:00			`labels = [`
			`"sb",`
			`"oa",`
			`"da",`
			`"nk",`
			`"mo",`
			`"ag",`
			`"ROOT",`
			`"root",`
			`"cj",`
			`"pd",`
			`"og",`
			`"app",`
			`]`
			`doc = obj.doc # Ensure works on both Doc and Span.`
Limiting noun_chunks for specific languages (#5396) * Limiting noun_chunks for specific langauges * Limiting noun_chunks for specific languages Contributor Agreement * Addressing review comments * Removed unused fixtures and imports * Add fa_tokenizer in test suite * Use fa_tokenizer in test * Undo extraneous reformatting Co-authored-by: adrianeboyd <adrianeboyd@gmail.com> 2020-05-14 13:58:06 +03:00
			`if not doc.is_parsed:`
			`raise ValueError(Errors.E029)`

💫 Tidy up and auto-format .py files (#2983) <!--- Provide a general summary of your changes in the title. --> ## Description - [x] Use [`black`](https://github.com/ambv/black) to auto-format all `.py` files. - [x] Update flake8 config to exclude very large files (lemmatization tables etc.) - [x] Update code to be compatible with flake8 rules - [x] Fix various small bugs, inconsistencies and messy stuff in the language data - [x] Update docs to explain new code style (`black`, `flake8`, when to use `# fmt: off` and `# fmt: on` and what `# noqa` means) Once #2932 is merged, which auto-formats and tidies up the CLI, we'll be able to run `flake8 spacy` actually get meaningful results. At the moment, the code style and linting isn't applied automatically, but I'm hoping that the new [GitHub Actions](https://github.com/features/actions) will let us auto-format pull requests and post comments with relevant linting information. ### Types of change enhancement, code style ## Checklist <!--- Before you submit the PR, go over this checklist and make sure you can tick off all the boxes. [] -> [x] --> - [x] I have submitted the spaCy Contributor Agreement. - [x] I ran the tests, and all new and existing tests passed. - [x] My changes don't require a change to the documentation, or if they do, I've added all required information. 2018-11-30 19:03:03 +03:00			`np_label = doc.vocab.strings.add("NP")`
Fix syntax iterators 2017-06-04 23:51:50 +03:00			`np_deps = set(doc.vocab.strings.add(label) for label in labels)`
💫 Tidy up and auto-format .py files (#2983) <!--- Provide a general summary of your changes in the title. --> ## Description - [x] Use [`black`](https://github.com/ambv/black) to auto-format all `.py` files. - [x] Update flake8 config to exclude very large files (lemmatization tables etc.) - [x] Update code to be compatible with flake8 rules - [x] Fix various small bugs, inconsistencies and messy stuff in the language data - [x] Update docs to explain new code style (`black`, `flake8`, when to use `# fmt: off` and `# fmt: on` and what `# noqa` means) Once #2932 is merged, which auto-formats and tidies up the CLI, we'll be able to run `flake8 spacy` actually get meaningful results. At the moment, the code style and linting isn't applied automatically, but I'm hoping that the new [GitHub Actions](https://github.com/features/actions) will let us auto-format pull requests and post comments with relevant linting information. ### Types of change enhancement, code style ## Checklist <!--- Before you submit the PR, go over this checklist and make sure you can tick off all the boxes. [] -> [x] --> - [x] I have submitted the spaCy Contributor Agreement. - [x] I ran the tests, and all new and existing tests passed. - [x] My changes don't require a change to the documentation, or if they do, I've added all required information. 2018-11-30 19:03:03 +03:00			`close_app = doc.vocab.strings.add("nk")`
Add language-specific syntax iterators to en and de 2017-05-17 12:37:48 +03:00
			`rbracket = 0`
			`for i, word in enumerate(obj):`
			`if i < rbracket:`
			`continue`
			`if word.pos in (NOUN, PROPN, PRON) and word.dep in np_deps:`
💫 Tidy up and auto-format .py files (#2983) <!--- Provide a general summary of your changes in the title. --> ## Description - [x] Use [`black`](https://github.com/ambv/black) to auto-format all `.py` files. - [x] Update flake8 config to exclude very large files (lemmatization tables etc.) - [x] Update code to be compatible with flake8 rules - [x] Fix various small bugs, inconsistencies and messy stuff in the language data - [x] Update docs to explain new code style (`black`, `flake8`, when to use `# fmt: off` and `# fmt: on` and what `# noqa` means) Once #2932 is merged, which auto-formats and tidies up the CLI, we'll be able to run `flake8 spacy` actually get meaningful results. At the moment, the code style and linting isn't applied automatically, but I'm hoping that the new [GitHub Actions](https://github.com/features/actions) will let us auto-format pull requests and post comments with relevant linting information. ### Types of change enhancement, code style ## Checklist <!--- Before you submit the PR, go over this checklist and make sure you can tick off all the boxes. [] -> [x] --> - [x] I have submitted the spaCy Contributor Agreement. - [x] I ran the tests, and all new and existing tests passed. - [x] My changes don't require a change to the documentation, or if they do, I've added all required information. 2018-11-30 19:03:03 +03:00			`rbracket = word.i + 1`
Add language-specific syntax iterators to en and de 2017-05-17 12:37:48 +03:00			`# try to extend the span to the right`
			`# to capture close apposition/measurement constructions`
			`for rdep in doc[word.i].rights:`
			`if rdep.pos in (NOUN, PROPN) and rdep.dep == close_app:`
💫 Tidy up and auto-format .py files (#2983) <!--- Provide a general summary of your changes in the title. --> ## Description - [x] Use [`black`](https://github.com/ambv/black) to auto-format all `.py` files. - [x] Update flake8 config to exclude very large files (lemmatization tables etc.) - [x] Update code to be compatible with flake8 rules - [x] Fix various small bugs, inconsistencies and messy stuff in the language data - [x] Update docs to explain new code style (`black`, `flake8`, when to use `# fmt: off` and `# fmt: on` and what `# noqa` means) Once #2932 is merged, which auto-formats and tidies up the CLI, we'll be able to run `flake8 spacy` actually get meaningful results. At the moment, the code style and linting isn't applied automatically, but I'm hoping that the new [GitHub Actions](https://github.com/features/actions) will let us auto-format pull requests and post comments with relevant linting information. ### Types of change enhancement, code style ## Checklist <!--- Before you submit the PR, go over this checklist and make sure you can tick off all the boxes. [] -> [x] --> - [x] I have submitted the spaCy Contributor Agreement. - [x] I ran the tests, and all new and existing tests passed. - [x] My changes don't require a change to the documentation, or if they do, I've added all required information. 2018-11-30 19:03:03 +03:00			`rbracket = rdep.i + 1`
Add language-specific syntax iterators to en and de 2017-05-17 12:37:48 +03:00			`yield word.left_edge.i, rbracket, np_label`


💫 Tidy up and auto-format .py files (#2983) <!--- Provide a general summary of your changes in the title. --> ## Description - [x] Use [`black`](https://github.com/ambv/black) to auto-format all `.py` files. - [x] Update flake8 config to exclude very large files (lemmatization tables etc.) - [x] Update code to be compatible with flake8 rules - [x] Fix various small bugs, inconsistencies and messy stuff in the language data - [x] Update docs to explain new code style (`black`, `flake8`, when to use `# fmt: off` and `# fmt: on` and what `# noqa` means) Once #2932 is merged, which auto-formats and tidies up the CLI, we'll be able to run `flake8 spacy` actually get meaningful results. At the moment, the code style and linting isn't applied automatically, but I'm hoping that the new [GitHub Actions](https://github.com/features/actions) will let us auto-format pull requests and post comments with relevant linting information. ### Types of change enhancement, code style ## Checklist <!--- Before you submit the PR, go over this checklist and make sure you can tick off all the boxes. [] -> [x] --> - [x] I have submitted the spaCy Contributor Agreement. - [x] I ran the tests, and all new and existing tests passed. - [x] My changes don't require a change to the documentation, or if they do, I've added all required information. 2018-11-30 19:03:03 +03:00			`SYNTAX_ITERATORS = {"noun_chunks": noun_chunks}`