diff --git a/.github/workflows/tests.yml b/.github/workflows/tests.yml index d60c90c1c..4099b31e2 100644 --- a/.github/workflows/tests.yml +++ b/.github/workflows/tests.yml @@ -45,6 +45,12 @@ jobs: run: | python -m pip install flake8==5.0.4 python -m flake8 spacy --count --select=E901,E999,F821,F822,F823,W605 --show-source --statistics + - name: cython-lint + run: | + python -m pip install cython-lint -c requirements.txt + # E501: line too log, W291: trailing whitespace, E266: too many leading '#' for block comment + cython-lint spacy --ignore E501,W291,E266 + tests: name: Test needs: Validate diff --git a/Makefile b/Makefile index 4de628663..c8f68be7f 100644 --- a/Makefile +++ b/Makefile @@ -1,11 +1,11 @@ SHELL := /bin/bash ifndef SPACY_EXTRAS -override SPACY_EXTRAS = spacy-lookups-data==1.0.2 jieba spacy-pkuseg==0.0.28 sudachipy sudachidict_core pymorphy2 +override SPACY_EXTRAS = spacy-lookups-data==1.0.3 endif ifndef PYVER -override PYVER = 3.6 +override PYVER = 3.8 endif VENV := ./env$(PYVER) diff --git a/README.md b/README.md index 59d3ee9ee..9a8f90749 100644 --- a/README.md +++ b/README.md @@ -6,23 +6,20 @@ spaCy is a library for **advanced Natural Language Processing** in Python and Cython. It's built on the very latest research, and was designed from day one to be used in real products. -spaCy comes with -[pretrained pipelines](https://spacy.io/models) and -currently supports tokenization and training for **70+ languages**. It features -state-of-the-art speed and **neural network models** for tagging, -parsing, **named entity recognition**, **text classification** and more, -multi-task learning with pretrained **transformers** like BERT, as well as a +spaCy comes with [pretrained pipelines](https://spacy.io/models) and currently +supports tokenization and training for **70+ languages**. It features +state-of-the-art speed and **neural network models** for tagging, parsing, +**named entity recognition**, **text classification** and more, multi-task +learning with pretrained **transformers** like BERT, as well as a production-ready [**training system**](https://spacy.io/usage/training) and easy model packaging, deployment and workflow management. spaCy is commercial -open-source software, released under the [MIT license](https://github.com/explosion/spaCy/blob/master/LICENSE). +open-source software, released under the +[MIT license](https://github.com/explosion/spaCy/blob/master/LICENSE). -💥 **We'd love to hear more about your experience with spaCy!** -[Fill out our survey here.](https://form.typeform.com/to/aMel9q9f) - -💫 **Version 3.5 out now!** +💫 **Version 3.6 out now!** [Check out the release notes here.](https://github.com/explosion/spaCy/releases) -[![Azure Pipelines](https://img.shields.io/azure-devops/build/explosion-ai/public/8/master.svg?logo=azure-pipelines&style=flat-square&label=build)](https://dev.azure.com/explosion-ai/public/_build?definitionId=8) +[![tests](https://github.com/explosion/spaCy/actions/workflows/tests.yml/badge.svg)](https://github.com/explosion/spaCy/actions/workflows/tests.yml) [![Current Release Version](https://img.shields.io/github/release/explosion/spacy.svg?style=flat-square&logo=github)](https://github.com/explosion/spaCy/releases) [![pypi Version](https://img.shields.io/pypi/v/spacy.svg?style=flat-square&logo=pypi&logoColor=white)](https://pypi.org/project/spacy/) [![conda Version](https://img.shields.io/conda/vn/conda-forge/spacy.svg?style=flat-square&logo=conda-forge&logoColor=white)](https://anaconda.org/conda-forge/spacy) @@ -35,22 +32,22 @@ open-source software, released under the [MIT license](https://github.com/explos ## 📖 Documentation -| Documentation | | -| ----------------------------- | ---------------------------------------------------------------------- | -| ⭐️ **[spaCy 101]** | New to spaCy? Here's everything you need to know! | -| 📚 **[Usage Guides]** | How to use spaCy and its features. | -| 🚀 **[New in v3.0]** | New features, backwards incompatibilities and migration guide. | -| 🪐 **[Project Templates]** | End-to-end workflows you can clone, modify and run. | -| 🎛 **[API Reference]** | The detailed reference for spaCy's API. | -| 📦 **[Models]** | Download trained pipelines for spaCy. | -| 🌌 **[Universe]** | Plugins, extensions, demos and books from the spaCy ecosystem. | -| ⚙️ **[spaCy VS Code Extension]** | Additional tooling and features for working with spaCy's config files. | -| 👩‍🏫 **[Online Course]** | Learn spaCy in this free and interactive online course. | -| 📺 **[Videos]** | Our YouTube channel with video tutorials, talks and more. | -| 🛠 **[Changelog]** | Changes and version history. | -| 💝 **[Contribute]** | How to contribute to the spaCy project and code base. | -| spaCy Tailored Pipelines | Get a custom spaCy pipeline, tailor-made for your NLP problem by spaCy's core developers. Streamlined, production-ready, predictable and maintainable. Start by completing our 5-minute questionnaire to tell us what you need and we'll be in touch! **[Learn more →](https://explosion.ai/spacy-tailored-pipelines)** | -| spaCy Tailored Pipelines | Bespoke advice for problem solving, strategy and analysis for applied NLP projects. Services include data strategy, code reviews, pipeline design and annotation coaching. Curious? Fill in our 5-minute questionnaire to tell us what you need and we'll be in touch! **[Learn more →](https://explosion.ai/spacy-tailored-analysis)** | +| Documentation | | +| ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | +| ⭐️ **[spaCy 101]** | New to spaCy? Here's everything you need to know! | +| 📚 **[Usage Guides]** | How to use spaCy and its features. | +| 🚀 **[New in v3.0]** | New features, backwards incompatibilities and migration guide. | +| 🪐 **[Project Templates]** | End-to-end workflows you can clone, modify and run. | +| 🎛 **[API Reference]** | The detailed reference for spaCy's API. | +| 📦 **[Models]** | Download trained pipelines for spaCy. | +| 🌌 **[Universe]** | Plugins, extensions, demos and books from the spaCy ecosystem. | +| ⚙️ **[spaCy VS Code Extension]** | Additional tooling and features for working with spaCy's config files. | +| 👩‍🏫 **[Online Course]** | Learn spaCy in this free and interactive online course. | +| 📺 **[Videos]** | Our YouTube channel with video tutorials, talks and more. | +| 🛠 **[Changelog]** | Changes and version history. | +| 💝 **[Contribute]** | How to contribute to the spaCy project and code base. | +| spaCy Tailored Pipelines | Get a custom spaCy pipeline, tailor-made for your NLP problem by spaCy's core developers. Streamlined, production-ready, predictable and maintainable. Start by completing our 5-minute questionnaire to tell us what you need and we'll be in touch! **[Learn more →](https://explosion.ai/spacy-tailored-pipelines)** | +| spaCy Tailored Pipelines | Bespoke advice for problem solving, strategy and analysis for applied NLP projects. Services include data strategy, code reviews, pipeline design and annotation coaching. Curious? Fill in our 5-minute questionnaire to tell us what you need and we'll be in touch! **[Learn more →](https://explosion.ai/spacy-tailored-analysis)** | [spacy 101]: https://spacy.io/usage/spacy-101 [new in v3.0]: https://spacy.io/usage/v3 @@ -58,7 +55,7 @@ open-source software, released under the [MIT license](https://github.com/explos [api reference]: https://spacy.io/api/ [models]: https://spacy.io/models [universe]: https://spacy.io/universe -[spaCy VS Code Extension]: https://github.com/explosion/spacy-vscode +[spacy vs code extension]: https://github.com/explosion/spacy-vscode [videos]: https://www.youtube.com/c/ExplosionAI [online course]: https://course.spacy.io [project templates]: https://github.com/explosion/projects @@ -92,7 +89,9 @@ more people can benefit from it. - State-of-the-art speed - Production-ready **training system** - Linguistically-motivated **tokenization** -- Components for named **entity recognition**, part-of-speech-tagging, dependency parsing, sentence segmentation, **text classification**, lemmatization, morphological analysis, entity linking and more +- Components for named **entity recognition**, part-of-speech-tagging, + dependency parsing, sentence segmentation, **text classification**, + lemmatization, morphological analysis, entity linking and more - Easily extensible with **custom components** and attributes - Support for custom models in **PyTorch**, **TensorFlow** and other frameworks - Built in **visualizers** for syntax and NER @@ -118,8 +117,8 @@ For detailed installation instructions, see the ### pip Using pip, spaCy releases are available as source packages and binary wheels. -Before you install spaCy and its dependencies, make sure that -your `pip`, `setuptools` and `wheel` are up to date. +Before you install spaCy and its dependencies, make sure that your `pip`, +`setuptools` and `wheel` are up to date. ```bash pip install -U pip setuptools wheel @@ -174,9 +173,9 @@ with the new version. ## 📦 Download model packages -Trained pipelines for spaCy can be installed as **Python packages**. This -means that they're a component of your application, just like any other module. -Models can be installed using spaCy's [`download`](https://spacy.io/api/cli#download) +Trained pipelines for spaCy can be installed as **Python packages**. This means +that they're a component of your application, just like any other module. Models +can be installed using spaCy's [`download`](https://spacy.io/api/cli#download) command, or manually by pointing pip to a path or URL. | Documentation | | @@ -242,8 +241,7 @@ do that depends on your system. | **Mac** | Install a recent version of [XCode](https://developer.apple.com/xcode/), including the so-called "Command Line Tools". macOS and OS X ship with Python and git preinstalled. | | **Windows** | Install a version of the [Visual C++ Build Tools](https://visualstudio.microsoft.com/visual-cpp-build-tools/) or [Visual Studio Express](https://visualstudio.microsoft.com/vs/express/) that matches the version that was used to compile your Python interpreter. | -For more details -and instructions, see the documentation on +For more details and instructions, see the documentation on [compiling spaCy from source](https://spacy.io/usage#source) and the [quickstart widget](https://spacy.io/usage#section-quickstart) to get the right commands for your platform and Python version. diff --git a/requirements.txt b/requirements.txt index a007f495e..4a131d18c 100644 --- a/requirements.txt +++ b/requirements.txt @@ -38,4 +38,5 @@ types-setuptools>=57.0.0 types-requests types-setuptools>=57.0.0 black==22.3.0 +cython-lint>=0.15.0; python_version >= "3.7" isort>=5.0,<6.0 diff --git a/setup.py b/setup.py index 243554c7a..3b6fae37b 100755 --- a/setup.py +++ b/setup.py @@ -1,10 +1,9 @@ #!/usr/bin/env python from setuptools import Extension, setup, find_packages import sys -import platform import numpy -from distutils.command.build_ext import build_ext -from distutils.sysconfig import get_python_inc +from setuptools.command.build_ext import build_ext +from sysconfig import get_path from pathlib import Path import shutil from Cython.Build import cythonize @@ -88,30 +87,6 @@ COPY_FILES = { } -def is_new_osx(): - """Check whether we're on OSX >= 10.7""" - if sys.platform != "darwin": - return False - mac_ver = platform.mac_ver()[0] - if mac_ver.startswith("10"): - minor_version = int(mac_ver.split(".")[1]) - if minor_version >= 7: - return True - else: - return False - return False - - -if is_new_osx(): - # On Mac, use libc++ because Apple deprecated use of - # libstdc - COMPILE_OPTIONS["other"].append("-stdlib=libc++") - LINK_OPTIONS["other"].append("-lc++") - # g++ (used by unix compiler on mac) links to libstdc++ as a default lib. - # See: https://stackoverflow.com/questions/1653047/avoid-linking-to-libstdc - LINK_OPTIONS["other"].append("-nodefaultlibs") - - # By subclassing build_extensions we have the actual compiler that will be used which is really known only after finalize_options # http://stackoverflow.com/questions/724664/python-distutils-how-to-get-a-compiler-that-is-going-to-be-used class build_ext_options: @@ -204,7 +179,7 @@ def setup_package(): include_dirs = [ numpy.get_include(), - get_python_inc(plat_specific=True), + get_path("include"), ] ext_modules = [] ext_modules.append( diff --git a/spacy/attrs.pxd b/spacy/attrs.pxd index 6dc9ecaee..fbbac0ec2 100644 --- a/spacy/attrs.pxd +++ b/spacy/attrs.pxd @@ -96,4 +96,4 @@ cdef enum attr_id_t: ENT_ID = symbols.ENT_ID IDX - SENT_END \ No newline at end of file + SENT_END diff --git a/spacy/attrs.pyx b/spacy/attrs.pyx index dc8eed7c3..97b5d5e36 100644 --- a/spacy/attrs.pyx +++ b/spacy/attrs.pyx @@ -117,7 +117,7 @@ def intify_attrs(stringy_attrs, strings_map=None, _do_deprecated=False): if "pos" in stringy_attrs: stringy_attrs["TAG"] = stringy_attrs.pop("pos") if "morph" in stringy_attrs: - morphs = stringy_attrs.pop("morph") + morphs = stringy_attrs.pop("morph") # no-cython-lint if "number" in stringy_attrs: stringy_attrs.pop("number") if "tenspect" in stringy_attrs: diff --git a/spacy/kb/candidate.pxd b/spacy/kb/candidate.pxd index 9fc4c4e9d..80fcbc459 100644 --- a/spacy/kb/candidate.pxd +++ b/spacy/kb/candidate.pxd @@ -4,7 +4,8 @@ from ..typedefs cimport hash_t from .kb cimport KnowledgeBase -# Object used by the Entity Linker that summarizes one entity-alias candidate combination. +# Object used by the Entity Linker that summarizes one entity-alias candidate +# combination. cdef class Candidate: cdef readonly KnowledgeBase kb cdef hash_t entity_hash diff --git a/spacy/kb/candidate.pyx b/spacy/kb/candidate.pyx index 4cd734f43..53fc9b036 100644 --- a/spacy/kb/candidate.pyx +++ b/spacy/kb/candidate.pyx @@ -8,15 +8,24 @@ from ..tokens import Span cdef class Candidate: - """A `Candidate` object refers to a textual mention (`alias`) that may or may not be resolved - to a specific `entity` from a Knowledge Base. This will be used as input for the entity linking - algorithm which will disambiguate the various candidates to the correct one. + """A `Candidate` object refers to a textual mention (`alias`) that may or + may not be resolved to a specific `entity` from a Knowledge Base. This + will be used as input for the entity linking algorithm which will + disambiguate the various candidates to the correct one. Each candidate (alias, entity) pair is assigned a certain prior probability. DOCS: https://spacy.io/api/kb/#candidate-init """ - def __init__(self, KnowledgeBase kb, entity_hash, entity_freq, entity_vector, alias_hash, prior_prob): + def __init__( + self, + KnowledgeBase kb, + entity_hash, + entity_freq, + entity_vector, + alias_hash, + prior_prob + ): self.kb = kb self.entity_hash = entity_hash self.entity_freq = entity_freq @@ -59,7 +68,8 @@ cdef class Candidate: def get_candidates(kb: KnowledgeBase, mention: Span) -> Iterable[Candidate]: """ - Return candidate entities for a given mention and fetching appropriate entries from the index. + Return candidate entities for a given mention and fetching appropriate + entries from the index. kb (KnowledgeBase): Knowledge base to query. mention (Span): Entity mention for which to identify candidates. RETURNS (Iterable[Candidate]): Identified candidates. @@ -67,9 +77,12 @@ def get_candidates(kb: KnowledgeBase, mention: Span) -> Iterable[Candidate]: return kb.get_candidates(mention) -def get_candidates_batch(kb: KnowledgeBase, mentions: Iterable[Span]) -> Iterable[Iterable[Candidate]]: +def get_candidates_batch( + kb: KnowledgeBase, mentions: Iterable[Span] +) -> Iterable[Iterable[Candidate]]: """ - Return candidate entities for the given mentions and fetching appropriate entries from the index. + Return candidate entities for the given mentions and fetching appropriate entries + from the index. kb (KnowledgeBase): Knowledge base to query. mention (Iterable[Span]): Entity mentions for which to identify candidates. RETURNS (Iterable[Iterable[Candidate]]): Identified candidates. diff --git a/spacy/kb/kb.pyx b/spacy/kb/kb.pyx index a88e18e1f..6ad4c3564 100644 --- a/spacy/kb/kb.pyx +++ b/spacy/kb/kb.pyx @@ -12,8 +12,9 @@ from .candidate import Candidate cdef class KnowledgeBase: - """A `KnowledgeBase` instance stores unique identifiers for entities and their textual aliases, - to support entity linking of named entities to real-world concepts. + """A `KnowledgeBase` instance stores unique identifiers for entities and + their textual aliases, to support entity linking of named entities to + real-world concepts. This is an abstract class and requires its operations to be implemented. DOCS: https://spacy.io/api/kb @@ -31,10 +32,13 @@ cdef class KnowledgeBase: self.entity_vector_length = entity_vector_length self.mem = Pool() - def get_candidates_batch(self, mentions: Iterable[Span]) -> Iterable[Iterable[Candidate]]: + def get_candidates_batch( + self, mentions: Iterable[Span] + ) -> Iterable[Iterable[Candidate]]: """ - Return candidate entities for specified texts. Each candidate defines the entity, the original alias, - and the prior probability of that alias resolving to that entity. + Return candidate entities for specified texts. Each candidate defines + the entity, the original alias, and the prior probability of that + alias resolving to that entity. If no candidate is found for a given text, an empty list is returned. mentions (Iterable[Span]): Mentions for which to get candidates. RETURNS (Iterable[Iterable[Candidate]]): Identified candidates. @@ -43,14 +47,17 @@ cdef class KnowledgeBase: def get_candidates(self, mention: Span) -> Iterable[Candidate]: """ - Return candidate entities for specified text. Each candidate defines the entity, the original alias, + Return candidate entities for specified text. Each candidate defines + the entity, the original alias, and the prior probability of that alias resolving to that entity. If the no candidate is found for a given text, an empty list is returned. mention (Span): Mention for which to get candidates. RETURNS (Iterable[Candidate]): Identified candidates. """ raise NotImplementedError( - Errors.E1045.format(parent="KnowledgeBase", method="get_candidates", name=self.__name__) + Errors.E1045.format( + parent="KnowledgeBase", method="get_candidates", name=self.__name__ + ) ) def get_vectors(self, entities: Iterable[str]) -> Iterable[Iterable[float]]: @@ -68,7 +75,9 @@ cdef class KnowledgeBase: RETURNS (Iterable[float]): Vector for specified entity. """ raise NotImplementedError( - Errors.E1045.format(parent="KnowledgeBase", method="get_vector", name=self.__name__) + Errors.E1045.format( + parent="KnowledgeBase", method="get_vector", name=self.__name__ + ) ) def to_bytes(self, **kwargs) -> bytes: @@ -76,7 +85,9 @@ cdef class KnowledgeBase: RETURNS (bytes): Current state as binary string. """ raise NotImplementedError( - Errors.E1045.format(parent="KnowledgeBase", method="to_bytes", name=self.__name__) + Errors.E1045.format( + parent="KnowledgeBase", method="to_bytes", name=self.__name__ + ) ) def from_bytes(self, bytes_data: bytes, *, exclude: Tuple[str] = tuple()): @@ -85,25 +96,35 @@ cdef class KnowledgeBase: exclude (Tuple[str]): Properties to exclude when restoring KB. """ raise NotImplementedError( - Errors.E1045.format(parent="KnowledgeBase", method="from_bytes", name=self.__name__) + Errors.E1045.format( + parent="KnowledgeBase", method="from_bytes", name=self.__name__ + ) ) - def to_disk(self, path: Union[str, Path], exclude: Iterable[str] = SimpleFrozenList()) -> None: + def to_disk( + self, path: Union[str, Path], exclude: Iterable[str] = SimpleFrozenList() + ) -> None: """ Write KnowledgeBase content to disk. path (Union[str, Path]): Target file path. exclude (Iterable[str]): List of components to exclude. """ raise NotImplementedError( - Errors.E1045.format(parent="KnowledgeBase", method="to_disk", name=self.__name__) + Errors.E1045.format( + parent="KnowledgeBase", method="to_disk", name=self.__name__ + ) ) - def from_disk(self, path: Union[str, Path], exclude: Iterable[str] = SimpleFrozenList()) -> None: + def from_disk( + self, path: Union[str, Path], exclude: Iterable[str] = SimpleFrozenList() + ) -> None: """ Load KnowledgeBase content from disk. path (Union[str, Path]): Target file path. exclude (Iterable[str]): List of components to exclude. """ raise NotImplementedError( - Errors.E1045.format(parent="KnowledgeBase", method="from_disk", name=self.__name__) + Errors.E1045.format( + parent="KnowledgeBase", method="from_disk", name=self.__name__ + ) ) diff --git a/spacy/kb/kb_in_memory.pxd b/spacy/kb/kb_in_memory.pxd index 08ec6b2a3..e0e33301a 100644 --- a/spacy/kb/kb_in_memory.pxd +++ b/spacy/kb/kb_in_memory.pxd @@ -55,23 +55,28 @@ cdef class InMemoryLookupKB(KnowledgeBase): # optional data, we can let users configure a DB as the backend for this. cdef object _features_table - cdef inline int64_t c_add_vector(self, vector[float] entity_vector) nogil: """Add an entity vector to the vectors table.""" cdef int64_t new_index = self._vectors_table.size() self._vectors_table.push_back(entity_vector) return new_index - - cdef inline int64_t c_add_entity(self, hash_t entity_hash, float freq, - int32_t vector_index, int feats_row) nogil: + cdef inline int64_t c_add_entity( + self, + hash_t entity_hash, + float freq, + int32_t vector_index, + int feats_row + ) nogil: """Add an entry to the vector of entries. - After calling this method, make sure to update also the _entry_index using the return value""" + After calling this method, make sure to update also the _entry_index + using the return value""" # This is what we'll map the entity hash key to. It's where the entry will sit # in the vector of entries, so we can get it later. cdef int64_t new_index = self._entries.size() - # Avoid struct initializer to enable nogil, cf https://github.com/cython/cython/issues/1642 + # Avoid struct initializer to enable nogil, cf. + # https://github.com/cython/cython/issues/1642 cdef KBEntryC entry entry.entity_hash = entity_hash entry.vector_index = vector_index @@ -81,11 +86,17 @@ cdef class InMemoryLookupKB(KnowledgeBase): self._entries.push_back(entry) return new_index - cdef inline int64_t c_add_aliases(self, hash_t alias_hash, vector[int64_t] entry_indices, vector[float] probs) nogil: - """Connect a mention to a list of potential entities with their prior probabilities . - After calling this method, make sure to update also the _alias_index using the return value""" - # This is what we'll map the alias hash key to. It's where the alias will be defined - # in the vector of aliases. + cdef inline int64_t c_add_aliases( + self, + hash_t alias_hash, + vector[int64_t] entry_indices, + vector[float] probs + ) nogil: + """Connect a mention to a list of potential entities with their prior + probabilities. After calling this method, make sure to update also the + _alias_index using the return value""" + # This is what we'll map the alias hash key to. It's where the alias will be + # defined in the vector of aliases. cdef int64_t new_index = self._aliases_table.size() # Avoid struct initializer to enable nogil @@ -98,8 +109,9 @@ cdef class InMemoryLookupKB(KnowledgeBase): cdef inline void _create_empty_vectors(self, hash_t dummy_hash) nogil: """ - Initializing the vectors and making sure the first element of each vector is a dummy, - because the PreshMap maps pointing to indices in these vectors can not contain 0 as value + Initializing the vectors and making sure the first element of each vector is a + dummy, because the PreshMap maps pointing to indices in these vectors can not + contain 0 as value. cf. https://github.com/explosion/preshed/issues/17 """ cdef int32_t dummy_value = 0 @@ -130,12 +142,18 @@ cdef class InMemoryLookupKB(KnowledgeBase): cdef class Writer: cdef FILE* _fp - cdef int write_header(self, int64_t nr_entries, int64_t entity_vector_length) except -1 + cdef int write_header( + self, int64_t nr_entries, int64_t entity_vector_length + ) except -1 cdef int write_vector_element(self, float element) except -1 - cdef int write_entry(self, hash_t entry_hash, float entry_freq, int32_t vector_index) except -1 + cdef int write_entry( + self, hash_t entry_hash, float entry_freq, int32_t vector_index + ) except -1 cdef int write_alias_length(self, int64_t alias_length) except -1 - cdef int write_alias_header(self, hash_t alias_hash, int64_t candidate_length) except -1 + cdef int write_alias_header( + self, hash_t alias_hash, int64_t candidate_length + ) except -1 cdef int write_alias(self, int64_t entry_index, float prob) except -1 cdef int _write(self, void* value, size_t size) except -1 @@ -143,12 +161,18 @@ cdef class Writer: cdef class Reader: cdef FILE* _fp - cdef int read_header(self, int64_t* nr_entries, int64_t* entity_vector_length) except -1 + cdef int read_header( + self, int64_t* nr_entries, int64_t* entity_vector_length + ) except -1 cdef int read_vector_element(self, float* element) except -1 - cdef int read_entry(self, hash_t* entity_hash, float* freq, int32_t* vector_index) except -1 + cdef int read_entry( + self, hash_t* entity_hash, float* freq, int32_t* vector_index + ) except -1 cdef int read_alias_length(self, int64_t* alias_length) except -1 - cdef int read_alias_header(self, hash_t* alias_hash, int64_t* candidate_length) except -1 + cdef int read_alias_header( + self, hash_t* alias_hash, int64_t* candidate_length + ) except -1 cdef int read_alias(self, int64_t* entry_index, float* prob) except -1 cdef int _read(self, void* value, size_t size) except -1 diff --git a/spacy/kb/kb_in_memory.pyx b/spacy/kb/kb_in_memory.pyx index e991f7720..02773cbae 100644 --- a/spacy/kb/kb_in_memory.pyx +++ b/spacy/kb/kb_in_memory.pyx @@ -1,5 +1,5 @@ # cython: infer_types=True, profile=True -from typing import Any, Callable, Dict, Iterable, Union +from typing import Any, Callable, Dict, Iterable import srsly @@ -27,8 +27,9 @@ from .candidate import Candidate as Candidate cdef class InMemoryLookupKB(KnowledgeBase): - """An `InMemoryLookupKB` instance stores unique identifiers for entities and their textual aliases, - to support entity linking of named entities to real-world concepts. + """An `InMemoryLookupKB` instance stores unique identifiers for entities + and their textual aliases, to support entity linking of named entities to + real-world concepts. DOCS: https://spacy.io/api/inmemorylookupkb """ @@ -71,7 +72,8 @@ cdef class InMemoryLookupKB(KnowledgeBase): def add_entity(self, str entity, float freq, vector[float] entity_vector): """ - Add an entity to the KB, optionally specifying its log probability based on corpus frequency + Add an entity to the KB, optionally specifying its log probability + based on corpus frequency. Return the hash of the entity ID/name at the end. """ cdef hash_t entity_hash = self.vocab.strings.add(entity) @@ -83,14 +85,20 @@ cdef class InMemoryLookupKB(KnowledgeBase): # Raise an error if the provided entity vector is not of the correct length if len(entity_vector) != self.entity_vector_length: - raise ValueError(Errors.E141.format(found=len(entity_vector), required=self.entity_vector_length)) + raise ValueError( + Errors.E141.format( + found=len(entity_vector), required=self.entity_vector_length + ) + ) vector_index = self.c_add_vector(entity_vector=entity_vector) - new_index = self.c_add_entity(entity_hash=entity_hash, - freq=freq, - vector_index=vector_index, - feats_row=-1) # Features table currently not implemented + new_index = self.c_add_entity( + entity_hash=entity_hash, + freq=freq, + vector_index=vector_index, + feats_row=-1 + ) # Features table currently not implemented self._entry_index[entity_hash] = new_index return entity_hash @@ -115,7 +123,12 @@ cdef class InMemoryLookupKB(KnowledgeBase): else: entity_vector = vector_list[i] if len(entity_vector) != self.entity_vector_length: - raise ValueError(Errors.E141.format(found=len(entity_vector), required=self.entity_vector_length)) + raise ValueError( + Errors.E141.format( + found=len(entity_vector), + required=self.entity_vector_length + ) + ) entry.entity_hash = entity_hash entry.freq = freq_list[i] @@ -149,11 +162,15 @@ cdef class InMemoryLookupKB(KnowledgeBase): previous_alias_nr = self.get_size_aliases() # Throw an error if the length of entities and probabilities are not the same if not len(entities) == len(probabilities): - raise ValueError(Errors.E132.format(alias=alias, - entities_length=len(entities), - probabilities_length=len(probabilities))) + raise ValueError( + Errors.E132.format( + alias=alias, + entities_length=len(entities), + probabilities_length=len(probabilities)) + ) - # Throw an error if the probabilities sum up to more than 1 (allow for some rounding errors) + # Throw an error if the probabilities sum up to more than 1 (allow for + # some rounding errors) prob_sum = sum(probabilities) if prob_sum > 1.00001: raise ValueError(Errors.E133.format(alias=alias, sum=prob_sum)) @@ -170,40 +187,47 @@ cdef class InMemoryLookupKB(KnowledgeBase): for entity, prob in zip(entities, probabilities): entity_hash = self.vocab.strings[entity] - if not entity_hash in self._entry_index: + if entity_hash not in self._entry_index: raise ValueError(Errors.E134.format(entity=entity)) entry_index = self._entry_index.get(entity_hash) entry_indices.push_back(int(entry_index)) probs.push_back(float(prob)) - new_index = self.c_add_aliases(alias_hash=alias_hash, entry_indices=entry_indices, probs=probs) + new_index = self.c_add_aliases( + alias_hash=alias_hash, entry_indices=entry_indices, probs=probs + ) self._alias_index[alias_hash] = new_index if previous_alias_nr + 1 != self.get_size_aliases(): raise RuntimeError(Errors.E891.format(alias=alias)) return alias_hash - def append_alias(self, str alias, str entity, float prior_prob, ignore_warnings=False): + def append_alias( + self, str alias, str entity, float prior_prob, ignore_warnings=False + ): """ - For an alias already existing in the KB, extend its potential entities with one more. + For an alias already existing in the KB, extend its potential entities + with one more. Throw a warning if either the alias or the entity is unknown, or when the combination is already previously recorded. Throw an error if this entity+prior prob would exceed the sum of 1. - For efficiency, it's best to use the method `add_alias` as much as possible instead of this one. + For efficiency, it's best to use the method `add_alias` as much as + possible instead of this one. """ # Check if the alias exists in the KB cdef hash_t alias_hash = self.vocab.strings[alias] - if not alias_hash in self._alias_index: + if alias_hash not in self._alias_index: raise ValueError(Errors.E176.format(alias=alias)) # Check if the entity exists in the KB cdef hash_t entity_hash = self.vocab.strings[entity] - if not entity_hash in self._entry_index: + if entity_hash not in self._entry_index: raise ValueError(Errors.E134.format(entity=entity)) entry_index = self._entry_index.get(entity_hash) - # Throw an error if the prior probabilities (including the new one) sum up to more than 1 + # Throw an error if the prior probabilities (including the new one) + # sum up to more than 1 alias_index = self._alias_index.get(alias_hash) alias_entry = self._aliases_table[alias_index] current_sum = sum([p for p in alias_entry.probs]) @@ -236,12 +260,13 @@ cdef class InMemoryLookupKB(KnowledgeBase): def get_alias_candidates(self, str alias) -> Iterable[Candidate]: """ - Return candidate entities for an alias. Each candidate defines the entity, the original alias, - and the prior probability of that alias resolving to that entity. + Return candidate entities for an alias. Each candidate defines the + entity, the original alias, and the prior probability of that alias + resolving to that entity. If the alias is not known in the KB, and empty list is returned. """ cdef hash_t alias_hash = self.vocab.strings[alias] - if not alias_hash in self._alias_index: + if alias_hash not in self._alias_index: return [] alias_index = self._alias_index.get(alias_hash) alias_entry = self._aliases_table[alias_index] @@ -249,10 +274,14 @@ cdef class InMemoryLookupKB(KnowledgeBase): return [Candidate(kb=self, entity_hash=self._entries[entry_index].entity_hash, entity_freq=self._entries[entry_index].freq, - entity_vector=self._vectors_table[self._entries[entry_index].vector_index], + entity_vector=self._vectors_table[ + self._entries[entry_index].vector_index + ], alias_hash=alias_hash, prior_prob=prior_prob) - for (entry_index, prior_prob) in zip(alias_entry.entry_indices, alias_entry.probs) + for (entry_index, prior_prob) in zip( + alias_entry.entry_indices, alias_entry.probs + ) if entry_index != 0] def get_vector(self, str entity): @@ -266,8 +295,9 @@ cdef class InMemoryLookupKB(KnowledgeBase): return self._vectors_table[self._entries[entry_index].vector_index] def get_prior_prob(self, str entity, str alias): - """ Return the prior probability of a given alias being linked to a given entity, - or return 0.0 when this combination is not known in the knowledge base""" + """ Return the prior probability of a given alias being linked to a + given entity, or return 0.0 when this combination is not known in the + knowledge base.""" cdef hash_t alias_hash = self.vocab.strings[alias] cdef hash_t entity_hash = self.vocab.strings[entity] @@ -278,7 +308,9 @@ cdef class InMemoryLookupKB(KnowledgeBase): entry_index = self._entry_index[entity_hash] alias_entry = self._aliases_table[alias_index] - for (entry_index, prior_prob) in zip(alias_entry.entry_indices, alias_entry.probs): + for (entry_index, prior_prob) in zip( + alias_entry.entry_indices, alias_entry.probs + ): if self._entries[entry_index].entity_hash == entity_hash: return prior_prob @@ -288,13 +320,19 @@ cdef class InMemoryLookupKB(KnowledgeBase): """Serialize the current state to a binary string. """ def serialize_header(): - header = (self.get_size_entities(), self.get_size_aliases(), self.entity_vector_length) + header = ( + self.get_size_entities(), + self.get_size_aliases(), + self.entity_vector_length + ) return srsly.json_dumps(header) def serialize_entries(): i = 1 tuples = [] - for entry_hash, entry_index in sorted(self._entry_index.items(), key=lambda x: x[1]): + for entry_hash, entry_index in sorted( + self._entry_index.items(), key=lambda x: x[1] + ): entry = self._entries[entry_index] assert entry.entity_hash == entry_hash assert entry_index == i @@ -307,7 +345,9 @@ cdef class InMemoryLookupKB(KnowledgeBase): headers = [] indices_lists = [] probs_lists = [] - for alias_hash, alias_index in sorted(self._alias_index.items(), key=lambda x: x[1]): + for alias_hash, alias_index in sorted( + self._alias_index.items(), key=lambda x: x[1] + ): alias = self._aliases_table[alias_index] assert alias_index == i candidate_length = len(alias.entry_indices) @@ -365,7 +405,7 @@ cdef class InMemoryLookupKB(KnowledgeBase): indices = srsly.json_loads(all_data[1]) probs = srsly.json_loads(all_data[2]) for header, indices, probs in zip(headers, indices, probs): - alias_hash, candidate_length = header + alias_hash, _candidate_length = header alias.entry_indices = indices alias.probs = probs self._aliases_table[i] = alias @@ -414,10 +454,14 @@ cdef class InMemoryLookupKB(KnowledgeBase): writer.write_vector_element(element) i = i+1 - # dumping the entry records in the order in which they are in the _entries vector. - # index 0 is a dummy object not stored in the _entry_index and can be ignored. + # dumping the entry records in the order in which they are in the + # _entries vector. + # index 0 is a dummy object not stored in the _entry_index and can + # be ignored. i = 1 - for entry_hash, entry_index in sorted(self._entry_index.items(), key=lambda x: x[1]): + for entry_hash, entry_index in sorted( + self._entry_index.items(), key=lambda x: x[1] + ): entry = self._entries[entry_index] assert entry.entity_hash == entry_hash assert entry_index == i @@ -429,7 +473,9 @@ cdef class InMemoryLookupKB(KnowledgeBase): # dumping the aliases in the order in which they are in the _alias_index vector. # index 0 is a dummy object not stored in the _aliases_table and can be ignored. i = 1 - for alias_hash, alias_index in sorted(self._alias_index.items(), key=lambda x: x[1]): + for alias_hash, alias_index in sorted( + self._alias_index.items(), key=lambda x: x[1] + ): alias = self._aliases_table[alias_index] assert alias_index == i @@ -535,7 +581,8 @@ cdef class Writer: def __init__(self, path): assert isinstance(path, Path) content = bytes(path) - cdef bytes bytes_loc = content.encode('utf8') if type(content) == str else content + cdef bytes bytes_loc = content.encode('utf8') \ + if type(content) == str else content self._fp = fopen(bytes_loc, 'wb') if not self._fp: raise IOError(Errors.E146.format(path=path)) @@ -545,14 +592,18 @@ cdef class Writer: cdef size_t status = fclose(self._fp) assert status == 0 - cdef int write_header(self, int64_t nr_entries, int64_t entity_vector_length) except -1: + cdef int write_header( + self, int64_t nr_entries, int64_t entity_vector_length + ) except -1: self._write(&nr_entries, sizeof(nr_entries)) self._write(&entity_vector_length, sizeof(entity_vector_length)) cdef int write_vector_element(self, float element) except -1: self._write(&element, sizeof(element)) - cdef int write_entry(self, hash_t entry_hash, float entry_freq, int32_t vector_index) except -1: + cdef int write_entry( + self, hash_t entry_hash, float entry_freq, int32_t vector_index + ) except -1: self._write(&entry_hash, sizeof(entry_hash)) self._write(&entry_freq, sizeof(entry_freq)) self._write(&vector_index, sizeof(vector_index)) @@ -561,7 +612,9 @@ cdef class Writer: cdef int write_alias_length(self, int64_t alias_length) except -1: self._write(&alias_length, sizeof(alias_length)) - cdef int write_alias_header(self, hash_t alias_hash, int64_t candidate_length) except -1: + cdef int write_alias_header( + self, hash_t alias_hash, int64_t candidate_length + ) except -1: self._write(&alias_hash, sizeof(alias_hash)) self._write(&candidate_length, sizeof(candidate_length)) @@ -577,16 +630,19 @@ cdef class Writer: cdef class Reader: def __init__(self, path): content = bytes(path) - cdef bytes bytes_loc = content.encode('utf8') if type(content) == str else content + cdef bytes bytes_loc = content.encode('utf8') \ + if type(content) == str else content self._fp = fopen(bytes_loc, 'rb') if not self._fp: PyErr_SetFromErrno(IOError) - status = fseek(self._fp, 0, 0) # this can be 0 if there is no header + fseek(self._fp, 0, 0) # this can be 0 if there is no header def __dealloc__(self): fclose(self._fp) - cdef int read_header(self, int64_t* nr_entries, int64_t* entity_vector_length) except -1: + cdef int read_header( + self, int64_t* nr_entries, int64_t* entity_vector_length + ) except -1: status = self._read(nr_entries, sizeof(int64_t)) if status < 1: if feof(self._fp): @@ -606,7 +662,9 @@ cdef class Reader: return 0 # end of file raise IOError(Errors.E145.format(param="vector element")) - cdef int read_entry(self, hash_t* entity_hash, float* freq, int32_t* vector_index) except -1: + cdef int read_entry( + self, hash_t* entity_hash, float* freq, int32_t* vector_index + ) except -1: status = self._read(entity_hash, sizeof(hash_t)) if status < 1: if feof(self._fp): @@ -637,7 +695,9 @@ cdef class Reader: return 0 # end of file raise IOError(Errors.E145.format(param="alias length")) - cdef int read_alias_header(self, hash_t* alias_hash, int64_t* candidate_length) except -1: + cdef int read_alias_header( + self, hash_t* alias_hash, int64_t* candidate_length + ) except -1: status = self._read(alias_hash, sizeof(hash_t)) if status < 1: if feof(self._fp): diff --git a/spacy/language.py b/spacy/language.py index 3b3e33991..46f4a7996 100644 --- a/spacy/language.py +++ b/spacy/language.py @@ -1825,7 +1825,6 @@ class Language: # Later we replace the component config with the raw config again. interpolated = filled.interpolate() if not filled.is_interpolated else filled pipeline = interpolated.get("components", {}) - sourced = util.get_sourced_components(interpolated) # If components are loaded from a source (existing models), we cache # them here so they're only loaded once source_nlps = {} diff --git a/spacy/lexeme.pyx b/spacy/lexeme.pyx index 00e2c6258..60d22e615 100644 --- a/spacy/lexeme.pyx +++ b/spacy/lexeme.pyx @@ -1,7 +1,6 @@ # cython: embedsignature=True # Compiler crashes on memory view coercion without this. Should report bug. cimport numpy as np -from cython.view cimport array as cvarray from libc.string cimport memset np.import_array() @@ -35,7 +34,7 @@ from .typedefs cimport attr_t, flags_t from .attrs import intify_attrs from .errors import Errors, Warnings -OOV_RANK = 0xffffffffffffffff # UINT64_MAX +OOV_RANK = 0xffffffffffffffff # UINT64_MAX memset(&EMPTY_LEXEME, 0, sizeof(LexemeC)) EMPTY_LEXEME.id = OOV_RANK @@ -105,7 +104,7 @@ cdef class Lexeme: if isinstance(value, float): continue elif isinstance(value, (int, long)): - Lexeme.set_struct_attr(self.c, attr, value) + Lexeme.set_struct_attr(self.c, attr, value) else: Lexeme.set_struct_attr(self.c, attr, self.vocab.strings.add(value)) @@ -137,10 +136,12 @@ cdef class Lexeme: if hasattr(other, "orth"): if self.c.orth == other.orth: return 1.0 - elif hasattr(other, "__len__") and len(other) == 1 \ - and hasattr(other[0], "orth"): - if self.c.orth == other[0].orth: - return 1.0 + elif ( + hasattr(other, "__len__") and len(other) == 1 + and hasattr(other[0], "orth") + and self.c.orth == other[0].orth + ): + return 1.0 if self.vector_norm == 0 or other.vector_norm == 0: warnings.warn(Warnings.W008.format(obj="Lexeme")) return 0.0 @@ -149,7 +150,7 @@ cdef class Lexeme: result = xp.dot(vector, other.vector) / (self.vector_norm * other.vector_norm) # ensure we get a scalar back (numpy does this automatically but cupy doesn't) return result.item() - + @property def has_vector(self): """RETURNS (bool): Whether a word vector is associated with the object. diff --git a/spacy/matcher/dependencymatcher.pyx b/spacy/matcher/dependencymatcher.pyx index a214c0668..348e000ff 100644 --- a/spacy/matcher/dependencymatcher.pyx +++ b/spacy/matcher/dependencymatcher.pyx @@ -108,7 +108,7 @@ cdef class DependencyMatcher: key (str): The match ID. RETURNS (bool): Whether the matcher contains rules for this match ID. """ - return self.has_key(key) + return self.has_key(key) # no-cython-lint: W601 def _validate_input(self, pattern, key): idx = 0 @@ -264,7 +264,7 @@ cdef class DependencyMatcher: def remove(self, key): key = self._normalize_key(key) - if not key in self._patterns: + if key not in self._patterns: raise ValueError(Errors.E175.format(key=key)) self._patterns.pop(key) self._raw_patterns.pop(key) @@ -382,7 +382,7 @@ cdef class DependencyMatcher: return [] return [doc[node].head] - def _gov(self,doc,node): + def _gov(self, doc, node): return list(doc[node].children) def _dep_chain(self, doc, node): @@ -443,7 +443,7 @@ cdef class DependencyMatcher: def _right_child(self, doc, node): return [child for child in doc[node].rights] - + def _left_child(self, doc, node): return [child for child in doc[node].lefts] @@ -461,7 +461,7 @@ cdef class DependencyMatcher: if doc[node].head.i > node: return [doc[node].head] return [] - + def _left_parent(self, doc, node): if doc[node].head.i < node: return [doc[node].head] diff --git a/spacy/matcher/matcher.pyx b/spacy/matcher/matcher.pyx index 3d03f37ae..167f85af4 100644 --- a/spacy/matcher/matcher.pyx +++ b/spacy/matcher/matcher.pyx @@ -12,31 +12,18 @@ import warnings import srsly -from ..attrs cimport ( - DEP, - ENT_IOB, - ID, - LEMMA, - MORPH, - NULL_ATTR, - ORTH, - POS, - TAG, - attr_id_t, -) +from ..attrs cimport DEP, ENT_IOB, ID, LEMMA, MORPH, NULL_ATTR, POS, TAG from ..structs cimport TokenC from ..tokens.doc cimport Doc, get_token_attr_for_matcher from ..tokens.morphanalysis cimport MorphAnalysis from ..tokens.span cimport Span from ..tokens.token cimport Token from ..typedefs cimport attr_t -from ..vocab cimport Vocab from ..attrs import IDS from ..errors import Errors, MatchPatternError, Warnings from ..schemas import validate_token_pattern from ..strings import get_string_id -from ..util import registry from .levenshtein import levenshtein_compare DEF PADDING = 5 @@ -87,9 +74,9 @@ cdef class Matcher: key (str): The match ID. RETURNS (bool): Whether the matcher contains rules for this match ID. """ - return self.has_key(key) + return self.has_key(key) # no-cython-lint: W601 - def add(self, key, patterns, *, on_match=None, greedy: str=None): + def add(self, key, patterns, *, on_match=None, greedy: str = None): """Add a match-rule to the matcher. A match-rule consists of: an ID key, an on_match callback, and one or more patterns. @@ -143,8 +130,13 @@ cdef class Matcher: key = self._normalize_key(key) for pattern in patterns: try: - specs = _preprocess_pattern(pattern, self.vocab, - self._extensions, self._extra_predicates, self._fuzzy_compare) + specs = _preprocess_pattern( + pattern, + self.vocab, + self._extensions, + self._extra_predicates, + self._fuzzy_compare + ) self.patterns.push_back(init_pattern(self.mem, key, specs)) for spec in specs: for attr, _ in spec[1]: @@ -168,7 +160,7 @@ cdef class Matcher: key (str): The ID of the match rule. """ norm_key = self._normalize_key(key) - if not norm_key in self._patterns: + if norm_key not in self._patterns: raise ValueError(Errors.E175.format(key=key)) self._patterns.pop(norm_key) self._callbacks.pop(norm_key) @@ -268,8 +260,15 @@ cdef class Matcher: if self.patterns.empty(): matches = [] else: - matches = find_matches(&self.patterns[0], self.patterns.size(), doclike, length, - extensions=self._extensions, predicates=self._extra_predicates, with_alignments=with_alignments) + matches = find_matches( + &self.patterns[0], + self.patterns.size(), + doclike, + length, + extensions=self._extensions, + predicates=self._extra_predicates, + with_alignments=with_alignments + ) final_matches = [] pairs_by_id = {} # For each key, either add all matches, or only the filtered, @@ -289,9 +288,9 @@ cdef class Matcher: memset(matched, 0, length * sizeof(matched[0])) span_filter = self._filter.get(key) if span_filter == "FIRST": - sorted_pairs = sorted(pairs, key=lambda x: (x[0], -x[1]), reverse=False) # sort by start + sorted_pairs = sorted(pairs, key=lambda x: (x[0], -x[1]), reverse=False) # sort by start elif span_filter == "LONGEST": - sorted_pairs = sorted(pairs, key=lambda x: (x[1]-x[0], -x[0]), reverse=True) # reverse sort by length + sorted_pairs = sorted(pairs, key=lambda x: (x[1]-x[0], -x[0]), reverse=True) # reverse sort by length else: raise ValueError(Errors.E947.format(expected=["FIRST", "LONGEST"], arg=span_filter)) for match in sorted_pairs: @@ -366,7 +365,6 @@ cdef find_matches(TokenPatternC** patterns, int n, object doclike, int length, e cdef vector[MatchC] matches cdef vector[vector[MatchAlignmentC]] align_states cdef vector[vector[MatchAlignmentC]] align_matches - cdef PatternStateC state cdef int i, j, nr_extra_attr cdef Pool mem = Pool() output = [] @@ -388,14 +386,22 @@ cdef find_matches(TokenPatternC** patterns, int n, object doclike, int length, e value = token.vocab.strings[value] extra_attr_values[i * nr_extra_attr + index] = value # Main loop - cdef int nr_predicate = len(predicates) for i in range(length): for j in range(n): states.push_back(PatternStateC(patterns[j], i, 0)) if with_alignments != 0: align_states.resize(states.size()) - transition_states(states, matches, align_states, align_matches, predicate_cache, - doclike[i], extra_attr_values, predicates, with_alignments) + transition_states( + states, + matches, + align_states, + align_matches, + predicate_cache, + doclike[i], + extra_attr_values, + predicates, + with_alignments + ) extra_attr_values += nr_extra_attr predicate_cache += len(predicates) # Handle matches that end in 0-width patterns @@ -421,18 +427,28 @@ cdef find_matches(TokenPatternC** patterns, int n, object doclike, int length, e return output -cdef void transition_states(vector[PatternStateC]& states, vector[MatchC]& matches, - vector[vector[MatchAlignmentC]]& align_states, vector[vector[MatchAlignmentC]]& align_matches, - int8_t* cached_py_predicates, - Token token, const attr_t* extra_attrs, py_predicates, bint with_alignments) except *: +cdef void transition_states( + vector[PatternStateC]& states, + vector[MatchC]& matches, + vector[vector[MatchAlignmentC]]& align_states, + vector[vector[MatchAlignmentC]]& align_matches, + int8_t* cached_py_predicates, + Token token, + const attr_t* extra_attrs, + py_predicates, + bint with_alignments +) except *: cdef int q = 0 cdef vector[PatternStateC] new_states cdef vector[vector[MatchAlignmentC]] align_new_states - cdef int nr_predicate = len(py_predicates) for i in range(states.size()): if states[i].pattern.nr_py >= 1: - update_predicate_cache(cached_py_predicates, - states[i].pattern, token, py_predicates) + update_predicate_cache( + cached_py_predicates, + states[i].pattern, + token, + py_predicates + ) action = get_action(states[i], token.c, extra_attrs, cached_py_predicates) if action == REJECT: @@ -468,8 +484,12 @@ cdef void transition_states(vector[PatternStateC]& states, vector[MatchC]& match align_new_states.push_back(align_states[q]) states[q].pattern += 1 if states[q].pattern.nr_py != 0: - update_predicate_cache(cached_py_predicates, - states[q].pattern, token, py_predicates) + update_predicate_cache( + cached_py_predicates, + states[q].pattern, + token, + py_predicates + ) action = get_action(states[q], token.c, extra_attrs, cached_py_predicates) # Update alignment before the transition of current state @@ -485,8 +505,12 @@ cdef void transition_states(vector[PatternStateC]& states, vector[MatchC]& match ent_id = get_ent_id(state.pattern) if action == MATCH: matches.push_back( - MatchC(pattern_id=ent_id, start=state.start, - length=state.length+1)) + MatchC( + pattern_id=ent_id, + start=state.start, + length=state.length+1 + ) + ) # `align_matches` always corresponds to `matches` 1:1 if with_alignments != 0: align_matches.push_back(align_states[q]) @@ -494,23 +518,35 @@ cdef void transition_states(vector[PatternStateC]& states, vector[MatchC]& match # push match without last token if length > 0 if state.length > 0: matches.push_back( - MatchC(pattern_id=ent_id, start=state.start, - length=state.length)) + MatchC( + pattern_id=ent_id, + start=state.start, + length=state.length + ) + ) # MATCH_DOUBLE emits matches twice, # add one more to align_matches in order to keep 1:1 relationship if with_alignments != 0: align_matches.push_back(align_states[q]) # push match with last token matches.push_back( - MatchC(pattern_id=ent_id, start=state.start, - length=state.length+1)) + MatchC( + pattern_id=ent_id, + start=state.start, + length=state.length + 1 + ) + ) # `align_matches` always corresponds to `matches` 1:1 if with_alignments != 0: align_matches.push_back(align_states[q]) elif action == MATCH_REJECT: matches.push_back( - MatchC(pattern_id=ent_id, start=state.start, - length=state.length)) + MatchC( + pattern_id=ent_id, + start=state.start, + length=state.length + ) + ) # `align_matches` always corresponds to `matches` 1:1 if with_alignments != 0: align_matches.push_back(align_states[q]) @@ -533,8 +569,12 @@ cdef void transition_states(vector[PatternStateC]& states, vector[MatchC]& match align_states.push_back(align_new_states[i]) -cdef int update_predicate_cache(int8_t* cache, - const TokenPatternC* pattern, Token token, predicates) except -1: +cdef int update_predicate_cache( + int8_t* cache, + const TokenPatternC* pattern, + Token token, + predicates +) except -1: # If the state references any extra predicates, check whether they match. # These are cached, so that we don't call these potentially expensive # Python functions more than we need to. @@ -580,10 +620,12 @@ cdef void finish_states(vector[MatchC]& matches, vector[PatternStateC]& states, else: state.pattern += 1 - -cdef action_t get_action(PatternStateC state, - const TokenC* token, const attr_t* extra_attrs, - const int8_t* predicate_matches) nogil: +cdef action_t get_action( + PatternStateC state, + const TokenC * token, + const attr_t * extra_attrs, + const int8_t * predicate_matches +) nogil: """We need to consider: a) Does the token match the specification? [Yes, No] b) What's the quantifier? [1, 0+, ?] @@ -649,53 +691,56 @@ cdef action_t get_action(PatternStateC state, is_match = not is_match quantifier = ONE if quantifier == ONE: - if is_match and is_final: - # Yes, final: 1000 - return MATCH - elif is_match and not is_final: - # Yes, non-final: 0100 - return ADVANCE - elif not is_match and is_final: - # No, final: 0000 - return REJECT - else: - return REJECT + if is_match and is_final: + # Yes, final: 1000 + return MATCH + elif is_match and not is_final: + # Yes, non-final: 0100 + return ADVANCE + elif not is_match and is_final: + # No, final: 0000 + return REJECT + else: + return REJECT elif quantifier == ZERO_PLUS: - if is_match and is_final: - # Yes, final: 1001 - return MATCH_EXTEND - elif is_match and not is_final: - # Yes, non-final: 0011 - return RETRY_EXTEND - elif not is_match and is_final: - # No, final 2000 (note: Don't include last token!) - return MATCH_REJECT - else: - # No, non-final 0010 - return RETRY + if is_match and is_final: + # Yes, final: 1001 + return MATCH_EXTEND + elif is_match and not is_final: + # Yes, non-final: 0011 + return RETRY_EXTEND + elif not is_match and is_final: + # No, final 2000 (note: Don't include last token!) + return MATCH_REJECT + else: + # No, non-final 0010 + return RETRY elif quantifier == ZERO_ONE: - if is_match and is_final: - # Yes, final: 3000 - # To cater for a pattern ending in "?", we need to add - # a match both with and without the last token - return MATCH_DOUBLE - elif is_match and not is_final: - # Yes, non-final: 0110 - # We need both branches here, consider a pair like: - # pattern: .?b string: b - # If we 'ADVANCE' on the .?, we miss the match. - return RETRY_ADVANCE - elif not is_match and is_final: - # No, final 2000 (note: Don't include last token!) - return MATCH_REJECT - else: - # No, non-final 0010 - return RETRY + if is_match and is_final: + # Yes, final: 3000 + # To cater for a pattern ending in "?", we need to add + # a match both with and without the last token + return MATCH_DOUBLE + elif is_match and not is_final: + # Yes, non-final: 0110 + # We need both branches here, consider a pair like: + # pattern: .?b string: b + # If we 'ADVANCE' on the .?, we miss the match. + return RETRY_ADVANCE + elif not is_match and is_final: + # No, final 2000 (note: Don't include last token!) + return MATCH_REJECT + else: + # No, non-final 0010 + return RETRY -cdef int8_t get_is_match(PatternStateC state, - const TokenC* token, const attr_t* extra_attrs, - const int8_t* predicate_matches) nogil: +cdef int8_t get_is_match( + PatternStateC state, + const TokenC* token, + const attr_t* extra_attrs, + const int8_t* predicate_matches +) nogil: for i in range(state.pattern.nr_py): if predicate_matches[state.pattern.py_predicates[i]] == -1: return 0 @@ -860,7 +905,7 @@ class _FuzzyPredicate: self.is_extension = is_extension if self.predicate not in self.operators: raise ValueError(Errors.E126.format(good=self.operators, bad=self.predicate)) - fuzz = self.predicate[len("FUZZY"):] # number after prefix + fuzz = self.predicate[len("FUZZY"):] # number after prefix self.fuzzy = int(fuzz) if fuzz else -1 self.fuzzy_compare = fuzzy_compare self.key = _predicate_cache_key(self.attr, self.predicate, value, fuzzy=self.fuzzy) @@ -1082,7 +1127,7 @@ def _get_extra_predicates_dict(attr, value_dict, vocab, predicate_types, elif cls == _FuzzyPredicate: if isinstance(value, dict): # add predicates inside fuzzy operator - fuzz = type_[len("FUZZY"):] # number after prefix + fuzz = type_[len("FUZZY"):] # number after prefix fuzzy_val = int(fuzz) if fuzz else -1 output.extend(_get_extra_predicates_dict(attr, value, vocab, predicate_types, extra_predicates, seen_predicates, @@ -1101,8 +1146,9 @@ def _get_extra_predicates_dict(attr, value_dict, vocab, predicate_types, return output -def _get_extension_extra_predicates(spec, extra_predicates, predicate_types, - seen_predicates): +def _get_extension_extra_predicates( + spec, extra_predicates, predicate_types, seen_predicates +): output = [] for attr, value in spec.items(): if isinstance(value, dict): @@ -1131,7 +1177,7 @@ def _get_operators(spec): return (ONE,) elif spec["OP"] in lookup: return lookup[spec["OP"]] - #Min_max {n,m} + # Min_max {n,m} elif spec["OP"].startswith("{") and spec["OP"].endswith("}"): # {n} --> {n,n} exactly n ONE,(n) # {n,m}--> {n,m} min of n, max of m ONE,(n),ZERO_ONE,(m) @@ -1142,8 +1188,8 @@ def _get_operators(spec): min_max = min_max if "," in min_max else f"{min_max},{min_max}" n, m = min_max.split(",") - #1. Either n or m is a blank string and the other is numeric -->isdigit - #2. Both are numeric and n <= m + # 1. Either n or m is a blank string and the other is numeric -->isdigit + # 2. Both are numeric and n <= m if (not n.isdecimal() and not m.isdecimal()) or (n.isdecimal() and m.isdecimal() and int(n) > int(m)): keys = ", ".join(lookup.keys()) + ", {n}, {n,m}, {n,}, {,m} where n and m are integers and n <= m " raise ValueError(Errors.E011.format(op=spec["OP"], opts=keys)) diff --git a/spacy/matcher/phrasematcher.pyx b/spacy/matcher/phrasematcher.pyx index c407cf1cc..26633e6d6 100644 --- a/spacy/matcher/phrasematcher.pyx +++ b/spacy/matcher/phrasematcher.pyx @@ -1,14 +1,12 @@ # cython: infer_types=True, profile=True -from libc.stdint cimport uintptr_t from preshed.maps cimport map_clear, map_get, map_init, map_iter, map_set import warnings -from ..attrs cimport DEP, LEMMA, MORPH, ORTH, POS, TAG +from ..attrs cimport DEP, LEMMA, MORPH, POS, TAG from ..attrs import IDS -from ..structs cimport TokenC from ..tokens.span cimport Span from ..tokens.token cimport Token from ..typedefs cimport attr_t diff --git a/spacy/ml/parser_model.pxd b/spacy/ml/parser_model.pxd index ca31c1699..4d2d7b3fe 100644 --- a/spacy/ml/parser_model.pxd +++ b/spacy/ml/parser_model.pxd @@ -40,11 +40,16 @@ cdef ActivationsC alloc_activations(SizesC n) nogil cdef void free_activations(const ActivationsC* A) nogil -cdef void predict_states(CBlas cblas, ActivationsC* A, StateC** states, - const WeightsC* W, SizesC n) nogil - +cdef void predict_states( + CBlas cblas, ActivationsC* A, StateC** states, const WeightsC* W, SizesC n +) nogil + cdef int arg_max_if_valid(const weight_t* scores, const int* is_valid, int n) nogil -cdef void cpu_log_loss(float* d_scores, - const float* costs, const int* is_valid, const float* scores, int O) nogil - +cdef void cpu_log_loss( + float* d_scores, + const float* costs, + const int* is_valid, + const float* scores, + int O +) nogil diff --git a/spacy/ml/parser_model.pyx b/spacy/ml/parser_model.pyx index 5cffc4c2d..ae60972aa 100644 --- a/spacy/ml/parser_model.pyx +++ b/spacy/ml/parser_model.pyx @@ -8,13 +8,13 @@ from thinc.backends.linalg cimport Vec, VecVec import numpy import numpy.random -from thinc.api import CupyOps, Model, NumpyOps, get_ops +from thinc.api import CupyOps, Model, NumpyOps from .. import util from ..errors import Errors from ..pipeline._parser_internals.stateclass cimport StateClass -from ..typedefs cimport class_t, hash_t, weight_t +from ..typedefs cimport weight_t cdef WeightsC get_c_weights(model) except *: @@ -78,33 +78,48 @@ cdef void resize_activations(ActivationsC* A, SizesC n) nogil: A.is_valid = calloc(n.states * n.classes, sizeof(A.is_valid[0])) A._max_size = n.states else: - A.token_ids = realloc(A.token_ids, - n.states * n.feats * sizeof(A.token_ids[0])) - A.scores = realloc(A.scores, - n.states * n.classes * sizeof(A.scores[0])) - A.unmaxed = realloc(A.unmaxed, - n.states * n.hiddens * n.pieces * sizeof(A.unmaxed[0])) - A.hiddens = realloc(A.hiddens, - n.states * n.hiddens * sizeof(A.hiddens[0])) - A.is_valid = realloc(A.is_valid, - n.states * n.classes * sizeof(A.is_valid[0])) + A.token_ids = realloc( + A.token_ids, n.states * n.feats * sizeof(A.token_ids[0]) + ) + A.scores = realloc( + A.scores, n.states * n.classes * sizeof(A.scores[0]) + ) + A.unmaxed = realloc( + A.unmaxed, n.states * n.hiddens * n.pieces * sizeof(A.unmaxed[0]) + ) + A.hiddens = realloc( + A.hiddens, n.states * n.hiddens * sizeof(A.hiddens[0]) + ) + A.is_valid = realloc( + A.is_valid, n.states * n.classes * sizeof(A.is_valid[0]) + ) A._max_size = n.states A._curr_size = n.states -cdef void predict_states(CBlas cblas, ActivationsC* A, StateC** states, - const WeightsC* W, SizesC n) nogil: - cdef double one = 1.0 +cdef void predict_states( + CBlas cblas, ActivationsC* A, StateC** states, const WeightsC* W, SizesC n +) nogil: resize_activations(A, n) for i in range(n.states): states[i].set_context_tokens(&A.token_ids[i*n.feats], n.feats) memset(A.unmaxed, 0, n.states * n.hiddens * n.pieces * sizeof(float)) memset(A.hiddens, 0, n.states * n.hiddens * sizeof(float)) - sum_state_features(cblas, A.unmaxed, - W.feat_weights, A.token_ids, n.states, n.feats, n.hiddens * n.pieces) + sum_state_features( + cblas, + A.unmaxed, + W.feat_weights, + A.token_ids, + n.states, + n.feats, + n.hiddens * n.pieces + ) for i in range(n.states): - VecVec.add_i(&A.unmaxed[i*n.hiddens*n.pieces], - W.feat_bias, 1., n.hiddens * n.pieces) + VecVec.add_i( + &A.unmaxed[i*n.hiddens*n.pieces], + W.feat_bias, 1., + n.hiddens * n.pieces + ) for j in range(n.hiddens): index = i * n.hiddens * n.pieces + j * n.pieces which = Vec.arg_max(&A.unmaxed[index], n.pieces) @@ -114,14 +129,15 @@ cdef void predict_states(CBlas cblas, ActivationsC* A, StateC** states, memcpy(A.scores, A.hiddens, n.states * n.classes * sizeof(float)) else: # Compute hidden-to-output - sgemm(cblas)(False, True, n.states, n.classes, n.hiddens, + sgemm(cblas)( + False, True, n.states, n.classes, n.hiddens, 1.0, A.hiddens, n.hiddens, W.hidden_weights, n.hiddens, - 0.0, A.scores, n.classes) + 0.0, A.scores, n.classes + ) # Add bias for i in range(n.states): - VecVec.add_i(&A.scores[i*n.classes], - W.hidden_bias, 1., n.classes) + VecVec.add_i(&A.scores[i*n.classes], W.hidden_bias, 1., n.classes) # Set unseen classes to minimum value i = 0 min_ = A.scores[0] @@ -134,9 +150,16 @@ cdef void predict_states(CBlas cblas, ActivationsC* A, StateC** states, A.scores[i*n.classes+j] = min_ -cdef void sum_state_features(CBlas cblas, float* output, - const float* cached, const int* token_ids, int B, int F, int O) nogil: - cdef int idx, b, f, i +cdef void sum_state_features( + CBlas cblas, + float* output, + const float* cached, + const int* token_ids, + int B, + int F, + int O +) nogil: + cdef int idx, b, f cdef const float* feature padding = cached cached += F * O @@ -153,9 +176,13 @@ cdef void sum_state_features(CBlas cblas, float* output, token_ids += F -cdef void cpu_log_loss(float* d_scores, - const float* costs, const int* is_valid, const float* scores, - int O) nogil: +cdef void cpu_log_loss( + float* d_scores, + const float* costs, + const int* is_valid, + const float* scores, + int O +) nogil: """Do multi-label log loss""" cdef double max_, gmax, Z, gZ best = arg_max_if_gold(scores, costs, is_valid, O) @@ -179,8 +206,9 @@ cdef void cpu_log_loss(float* d_scores, d_scores[i] = exp(scores[i]-max_) / Z -cdef int arg_max_if_gold(const weight_t* scores, const weight_t* costs, - const int* is_valid, int n) nogil: +cdef int arg_max_if_gold( + const weight_t* scores, const weight_t* costs, const int* is_valid, int n +) nogil: # Find minimum cost cdef float cost = 1 for i in range(n): @@ -204,10 +232,17 @@ cdef int arg_max_if_valid(const weight_t* scores, const int* is_valid, int n) no return best - class ParserStepModel(Model): - def __init__(self, docs, layers, *, has_upper, unseen_classes=None, train=True, - dropout=0.1): + def __init__( + self, + docs, + layers, + *, + has_upper, + unseen_classes=None, + train=True, + dropout=0.1 + ): Model.__init__(self, name="parser_step_model", forward=step_forward) self.attrs["has_upper"] = has_upper self.attrs["dropout_rate"] = dropout @@ -268,8 +303,10 @@ class ParserStepModel(Model): return ids def backprop_step(self, token_ids, d_vector, get_d_tokvecs): - if isinstance(self.state2vec.ops, CupyOps) \ - and not isinstance(token_ids, self.state2vec.ops.xp.ndarray): + if ( + isinstance(self.state2vec.ops, CupyOps) + and not isinstance(token_ids, self.state2vec.ops.xp.ndarray) + ): # Move token_ids and d_vector to GPU, asynchronously self.backprops.append(( util.get_async(self.cuda_stream, token_ids), @@ -279,7 +316,6 @@ class ParserStepModel(Model): else: self.backprops.append((token_ids, d_vector, get_d_tokvecs)) - def finish_steps(self, golds): # Add a padding vector to the d_tokvecs gradient, so that missing # values don't affect the real gradient. @@ -292,14 +328,15 @@ class ParserStepModel(Model): ids = ids.flatten() d_state_features = d_state_features.reshape( (ids.size, d_state_features.shape[2])) - self.ops.scatter_add(d_tokvecs, ids, - d_state_features) + self.ops.scatter_add(d_tokvecs, ids, d_state_features) # Padded -- see update() self.bp_tokvecs(d_tokvecs[:-1]) return d_tokvecs + NUMPY_OPS = NumpyOps() + def step_forward(model: ParserStepModel, states, is_train): token_ids = model.get_token_ids(states) vector, get_d_tokvecs = model.state2vec(token_ids, is_train) @@ -312,7 +349,7 @@ def step_forward(model: ParserStepModel, states, is_train): scores, get_d_vector = model.vec2scores(vector, is_train) else: scores = NumpyOps().asarray(vector) - get_d_vector = lambda d_scores: d_scores + get_d_vector = lambda d_scores: d_scores # no-cython-lint: E731 # If the class is unseen, make sure its score is minimum scores[:, model._class_mask == 0] = numpy.nanmin(scores) @@ -448,9 +485,11 @@ cdef class precompute_hiddens: feat_weights = self.get_feat_weights() cdef int[:, ::1] ids = token_ids - sum_state_features(cblas, state_vector.data, - feat_weights, &ids[0,0], - token_ids.shape[0], self.nF, self.nO*self.nP) + sum_state_features( + cblas, state_vector.data, + feat_weights, &ids[0, 0], + token_ids.shape[0], self.nF, self.nO*self.nP + ) state_vector += self.bias state_vector, bp_nonlinearity = self._nonlinearity(state_vector) @@ -475,7 +514,7 @@ cdef class precompute_hiddens: def backprop_maxout(d_best): return self.ops.backprop_maxout(d_best, mask, self.nP) - + return state_vector, backprop_maxout def _relu_nonlinearity(self, state_vector): @@ -489,5 +528,5 @@ cdef class precompute_hiddens: def backprop_relu(d_best): d_best *= mask return d_best.reshape((d_best.shape + (1,))) - + return state_vector, backprop_relu diff --git a/spacy/morphology.pxd b/spacy/morphology.pxd index 968764b82..ee43aa4ec 100644 --- a/spacy/morphology.pxd +++ b/spacy/morphology.pxd @@ -11,7 +11,7 @@ from .typedefs cimport attr_t, hash_t cdef class Morphology: cdef readonly Pool mem cdef readonly StringStore strings - cdef PreshMap tags # Keyed by hash, value is pointer to tag + cdef PreshMap tags # Keyed by hash, value is pointer to tag cdef MorphAnalysisC create_morph_tag(self, field_feature_pairs) except * cdef int insert(self, MorphAnalysisC tag) except -1 @@ -20,4 +20,8 @@ cdef class Morphology: cdef int check_feature(const MorphAnalysisC* morph, attr_t feature) nogil cdef list list_features(const MorphAnalysisC* morph) cdef np.ndarray get_by_field(const MorphAnalysisC* morph, attr_t field) -cdef int get_n_by_field(attr_t* results, const MorphAnalysisC* morph, attr_t field) nogil +cdef int get_n_by_field( + attr_t* results, + const MorphAnalysisC* morph, + attr_t field, +) nogil diff --git a/spacy/morphology.pyx b/spacy/morphology.pyx index 1062fff09..ecbbed729 100644 --- a/spacy/morphology.pyx +++ b/spacy/morphology.pyx @@ -83,10 +83,11 @@ cdef class Morphology: features = self.normalize_attrs(features) string_features = {self.strings.as_string(field): self.strings.as_string(values) for field, values in features.items()} # normalized UFEATS string with sorted fields and values - norm_feats_string = self.FEATURE_SEP.join(sorted([ - self.FIELD_SEP.join([field, values]) - for field, values in string_features.items() - ])) + norm_feats_string = self.FEATURE_SEP.join( + sorted( + [self.FIELD_SEP.join([field, values]) for field, values in string_features.items()] + ) + ) return norm_feats_string or self.EMPTY_MORPH def normalize_attrs(self, attrs): @@ -192,6 +193,7 @@ cdef int get_n_by_field(attr_t* results, const MorphAnalysisC* morph, attr_t fie n_results += 1 return n_results + def unpickle_morphology(strings, tags): cdef Morphology morphology = Morphology(strings) for tag in tags: diff --git a/spacy/parts_of_speech.pxd b/spacy/parts_of_speech.pxd index a0b2567f1..b5423d113 100644 --- a/spacy/parts_of_speech.pxd +++ b/spacy/parts_of_speech.pxd @@ -8,7 +8,7 @@ cpdef enum univ_pos_t: ADV AUX CONJ - CCONJ # U20 + CCONJ # U20 DET INTJ NOUN diff --git a/spacy/pipeline/_edit_tree_internals/edit_trees.pxd b/spacy/pipeline/_edit_tree_internals/edit_trees.pxd index 3d63af921..41acd2b07 100644 --- a/spacy/pipeline/_edit_tree_internals/edit_trees.pxd +++ b/spacy/pipeline/_edit_tree_internals/edit_trees.pxd @@ -46,11 +46,18 @@ cdef struct EditTreeC: bint is_match_node NodeC inner -cdef inline EditTreeC edittree_new_match(len_t prefix_len, len_t suffix_len, - uint32_t prefix_tree, uint32_t suffix_tree): - cdef MatchNodeC match_node = MatchNodeC(prefix_len=prefix_len, - suffix_len=suffix_len, prefix_tree=prefix_tree, - suffix_tree=suffix_tree) +cdef inline EditTreeC edittree_new_match( + len_t prefix_len, + len_t suffix_len, + uint32_t prefix_tree, + uint32_t suffix_tree +): + cdef MatchNodeC match_node = MatchNodeC( + prefix_len=prefix_len, + suffix_len=suffix_len, + prefix_tree=prefix_tree, + suffix_tree=suffix_tree + ) cdef NodeC inner = NodeC(match_node=match_node) return EditTreeC(is_match_node=True, inner=inner) diff --git a/spacy/pipeline/_edit_tree_internals/edit_trees.pyx b/spacy/pipeline/_edit_tree_internals/edit_trees.pyx index daab0d204..78cd25622 100644 --- a/spacy/pipeline/_edit_tree_internals/edit_trees.pyx +++ b/spacy/pipeline/_edit_tree_internals/edit_trees.pyx @@ -5,8 +5,6 @@ from libc.string cimport memset from libcpp.pair cimport pair from libcpp.vector cimport vector -from pathlib import Path - from ...typedefs cimport hash_t from ... import util @@ -25,17 +23,16 @@ cdef LCS find_lcs(str source, str target): target (str): The second string. RETURNS (LCS): The spans of the longest common subsequences. """ - cdef Py_ssize_t source_len = len(source) cdef Py_ssize_t target_len = len(target) - cdef size_t longest_align = 0; + cdef size_t longest_align = 0 cdef int source_idx, target_idx cdef LCS lcs cdef Py_UCS4 source_cp, target_cp memset(&lcs, 0, sizeof(lcs)) - cdef vector[size_t] prev_aligns = vector[size_t](target_len); - cdef vector[size_t] cur_aligns = vector[size_t](target_len); + cdef vector[size_t] prev_aligns = vector[size_t](target_len) + cdef vector[size_t] cur_aligns = vector[size_t](target_len) for (source_idx, source_cp) in enumerate(source): for (target_idx, target_cp) in enumerate(target): @@ -89,7 +86,7 @@ cdef class EditTrees: cdef LCS lcs = find_lcs(form, lemma) cdef EditTreeC tree - cdef uint32_t tree_id, prefix_tree, suffix_tree + cdef uint32_t prefix_tree, suffix_tree if lcs_is_empty(lcs): tree = edittree_new_subst(self.strings.add(form), self.strings.add(lemma)) else: @@ -108,7 +105,7 @@ cdef class EditTrees: return self._tree_id(tree) cdef uint32_t _tree_id(self, EditTreeC tree): - # If this tree has been constructed before, return its identifier. + # If this tree has been constructed before, return its identifier. cdef hash_t hash = edittree_hash(tree) cdef unordered_map[hash_t, uint32_t].iterator iter = self.map.find(hash) if iter != self.map.end(): @@ -289,6 +286,7 @@ def _tree2dict(tree): tree = tree["inner"]["subst_node"] return(dict(tree)) + def _dict2tree(tree): errors = validate_edit_tree(tree) if errors: diff --git a/spacy/pipeline/_parser_internals/_beam_utils.pyx b/spacy/pipeline/_parser_internals/_beam_utils.pyx index 04dd3f11e..de8f0bf7b 100644 --- a/spacy/pipeline/_parser_internals/_beam_utils.pyx +++ b/spacy/pipeline/_parser_internals/_beam_utils.pyx @@ -1,17 +1,14 @@ # cython: infer_types=True # cython: profile=True -cimport numpy as np - import numpy -from cpython.ref cimport Py_XDECREF, PyObject from thinc.extra.search cimport Beam from thinc.extra.search import MaxViolation from thinc.extra.search cimport MaxViolation -from ...typedefs cimport class_t, hash_t +from ...typedefs cimport class_t from .transition_system cimport Transition, TransitionSystem from ...errors import Errors @@ -146,7 +143,6 @@ def update_beam(TransitionSystem moves, states, golds, model, int width, beam_de cdef MaxViolation violn pbeam = BeamBatch(moves, states, golds, width=width, density=beam_density) gbeam = BeamBatch(moves, states, golds, width=width, density=0.0) - cdef StateClass state beam_maps = [] backprops = [] violns = [MaxViolation() for _ in range(len(states))] diff --git a/spacy/pipeline/_parser_internals/_state.pxd b/spacy/pipeline/_parser_internals/_state.pxd index 24acc350c..c063cf97c 100644 --- a/spacy/pipeline/_parser_internals/_state.pxd +++ b/spacy/pipeline/_parser_internals/_state.pxd @@ -277,7 +277,6 @@ cdef cppclass StateC: return n - int n_L(int head) nogil const: return n_arcs(this._left_arcs, head) diff --git a/spacy/pipeline/_parser_internals/arc_eager.pyx b/spacy/pipeline/_parser_internals/arc_eager.pyx index 2c9eb0ff5..bcb4626fb 100644 --- a/spacy/pipeline/_parser_internals/arc_eager.pyx +++ b/spacy/pipeline/_parser_internals/arc_eager.pyx @@ -9,7 +9,7 @@ from ...strings cimport hash_string from ...structs cimport TokenC from ...tokens.doc cimport Doc, set_children_from_heads from ...tokens.token cimport MISSING_DEP -from ...typedefs cimport attr_t, hash_t +from ...typedefs cimport attr_t from ...training import split_bilu_label @@ -68,8 +68,9 @@ cdef struct GoldParseStateC: weight_t pop_cost -cdef GoldParseStateC create_gold_state(Pool mem, const StateC* state, - heads, labels, sent_starts) except *: +cdef GoldParseStateC create_gold_state( + Pool mem, const StateC* state, heads, labels, sent_starts +) except *: cdef GoldParseStateC gs gs.length = len(heads) gs.stride = 1 @@ -82,7 +83,7 @@ cdef GoldParseStateC create_gold_state(Pool mem, const StateC* state, gs.n_kids_in_stack = mem.alloc(gs.length, sizeof(gs.n_kids_in_stack[0])) for i, is_sent_start in enumerate(sent_starts): - if is_sent_start == True: + if is_sent_start is True: gs.state_bits[i] = set_state_flag( gs.state_bits[i], IS_SENT_START, @@ -210,6 +211,7 @@ cdef class ArcEagerGold: def update(self, StateClass stcls): update_gold_state(&self.c, stcls.c) + def _get_aligned_sent_starts(example): """Get list of SENT_START attributes aligned to the predicted tokenization. If the reference has not sentence starts, return a list of None values. @@ -524,7 +526,6 @@ cdef class Break: """ @staticmethod cdef bint is_valid(const StateC* st, attr_t label) nogil: - cdef int i if st.buffer_length() < 2: return False elif st.B(1) != st.B(0) + 1: @@ -556,8 +557,8 @@ cdef class Break: cost -= 1 if gold.heads[si] == b0: cost -= 1 - if not is_sent_start(gold, state.B(1)) \ - and not is_sent_start_unknown(gold, state.B(1)): + if not is_sent_start(gold, state.B(1)) and\ + not is_sent_start_unknown(gold, state.B(1)): cost += 1 return cost @@ -803,7 +804,6 @@ cdef class ArcEager(TransitionSystem): raise TypeError(Errors.E909.format(name="ArcEagerGold")) cdef ArcEagerGold gold_ = gold gold_state = gold_.c - n_gold = 0 if self.c[i].is_valid(stcls.c, self.c[i].label): cost = self.c[i].get_cost(stcls.c, &gold_state, self.c[i].label) else: @@ -875,7 +875,7 @@ cdef class ArcEager(TransitionSystem): print("Gold") for token in example.y: print(token.i, token.text, token.dep_, token.head.text) - aligned_heads, aligned_labels = example.get_aligned_parse() + aligned_heads, _aligned_labels = example.get_aligned_parse() print("Aligned heads") for i, head in enumerate(aligned_heads): print(example.x[i], example.x[head] if head is not None else "__") diff --git a/spacy/pipeline/_parser_internals/ner.pyx b/spacy/pipeline/_parser_internals/ner.pyx index e1edb4464..6c4f8e245 100644 --- a/spacy/pipeline/_parser_internals/ner.pyx +++ b/spacy/pipeline/_parser_internals/ner.pyx @@ -1,6 +1,3 @@ -import os -import random - from cymem.cymem cimport Pool from libc.stdint cimport int32_t @@ -14,7 +11,7 @@ from ...tokens.span import Span from ...attrs cimport IS_SPACE from ...lexeme cimport Lexeme -from ...structs cimport SpanC, TokenC +from ...structs cimport SpanC from ...tokens.span cimport Span from ...typedefs cimport attr_t, weight_t @@ -141,11 +138,10 @@ cdef class BiluoPushDown(TransitionSystem): OUT: Counter() } actions[OUT][''] = 1 # Represents a token predicted to be outside of any entity - actions[UNIT][''] = 1 # Represents a token prohibited to be in an entity + actions[UNIT][''] = 1 # Represents a token prohibited to be in an entity for entity_type in kwargs.get('entity_types', []): for action in (BEGIN, IN, LAST, UNIT): actions[action][entity_type] = 1 - moves = ('M', 'B', 'I', 'L', 'U') for example in kwargs.get('examples', []): for token in example.y: ent_type = token.ent_type_ @@ -164,7 +160,7 @@ cdef class BiluoPushDown(TransitionSystem): if token.ent_type: labels.add(token.ent_type_) return labels - + def move_name(self, int move, attr_t label): if move == OUT: return 'O' @@ -325,7 +321,6 @@ cdef class BiluoPushDown(TransitionSystem): raise TypeError(Errors.E909.format(name="BiluoGold")) cdef BiluoGold gold_ = gold gold_state = gold_.c - n_gold = 0 if self.c[i].is_valid(stcls.c, self.c[i].label): cost = self.c[i].get_cost(stcls.c, &gold_state, self.c[i].label) else: @@ -486,10 +481,8 @@ cdef class In: @staticmethod cdef weight_t cost(const StateC* s, const void* _gold, attr_t label) nogil: gold = _gold - move = IN cdef int next_act = gold.ner[s.B(1)].move if s.B(1) >= 0 else OUT cdef int g_act = gold.ner[s.B(0)].move - cdef attr_t g_tag = gold.ner[s.B(0)].label cdef bint is_sunk = _entity_is_sunk(s, gold.ner) if g_act == MISSING: @@ -549,12 +542,10 @@ cdef class Last: @staticmethod cdef weight_t cost(const StateC* s, const void* _gold, attr_t label) nogil: gold = _gold - move = LAST b0 = s.B(0) ent_start = s.E(0) cdef int g_act = gold.ner[b0].move - cdef attr_t g_tag = gold.ner[b0].label cdef int cost = 0 @@ -650,7 +641,6 @@ cdef class Unit: cost += 1 break return cost - cdef class Out: @@ -675,7 +665,6 @@ cdef class Out: cdef weight_t cost(const StateC* s, const void* _gold, attr_t label) nogil: gold = _gold cdef int g_act = gold.ner[s.B(0)].move - cdef attr_t g_tag = gold.ner[s.B(0)].label cdef weight_t cost = 0 if g_act == MISSING: pass diff --git a/spacy/pipeline/_parser_internals/nonproj.pyx b/spacy/pipeline/_parser_internals/nonproj.pyx index 66f423b3b..93ad14feb 100644 --- a/spacy/pipeline/_parser_internals/nonproj.pyx +++ b/spacy/pipeline/_parser_internals/nonproj.pyx @@ -125,14 +125,17 @@ def decompose(label): def is_decorated(label): return DELIMITER in label + def count_decorated_labels(gold_data): freqs = {} for example in gold_data: proj_heads, deco_deps = projectivize(example.get_aligned("HEAD"), example.get_aligned("DEP")) # set the label to ROOT for each root dependent - deco_deps = ['ROOT' if head == i else deco_deps[i] - for i, head in enumerate(proj_heads)] + deco_deps = [ + 'ROOT' if head == i else deco_deps[i] + for i, head in enumerate(proj_heads) + ] # count label frequencies for label in deco_deps: if is_decorated(label): @@ -160,9 +163,9 @@ def projectivize(heads, labels): cdef vector[int] _heads_to_c(heads): - cdef vector[int] c_heads; + cdef vector[int] c_heads for head in heads: - if head == None: + if head is None: c_heads.push_back(-1) else: assert head < len(heads) @@ -199,6 +202,7 @@ def _decorate(heads, proj_heads, labels): deco_labels.append(labels[tokenid]) return deco_labels + def get_smallest_nonproj_arc_slow(heads): cdef vector[int] c_heads = _heads_to_c(heads) return _get_smallest_nonproj_arc(c_heads) diff --git a/spacy/pipeline/_parser_internals/stateclass.pyx b/spacy/pipeline/_parser_internals/stateclass.pyx index 0a2657af1..fdb5004bb 100644 --- a/spacy/pipeline/_parser_internals/stateclass.pyx +++ b/spacy/pipeline/_parser_internals/stateclass.pyx @@ -1,6 +1,4 @@ # cython: infer_types=True -import numpy - from libcpp.vector cimport vector from ...tokens.doc cimport Doc @@ -38,11 +36,11 @@ cdef class StateClass: cdef vector[ArcC] arcs self.c.get_arcs(&arcs) return list(arcs) - #py_arcs = [] - #for arc in arcs: - # if arc.head != -1 and arc.child != -1: - # py_arcs.append((arc.head, arc.child, arc.label)) - #return arcs + # py_arcs = [] + # for arc in arcs: + # if arc.head != -1 and arc.child != -1: + # py_arcs.append((arc.head, arc.child, arc.label)) + # return arcs def add_arc(self, int head, int child, int label): self.c.add_arc(head, child, label) @@ -52,10 +50,10 @@ cdef class StateClass: def H(self, int child): return self.c.H(child) - + def L(self, int head, int idx): return self.c.L(head, idx) - + def R(self, int head, int idx): return self.c.R(head, idx) @@ -98,7 +96,7 @@ cdef class StateClass: def H(self, int i): return self.c.H(i) - + def E(self, int i): return self.c.E(i) @@ -116,7 +114,7 @@ cdef class StateClass: def H_(self, int i): return self.doc[self.c.H(i)] - + def E_(self, int i): return self.doc[self.c.E(i)] @@ -125,7 +123,7 @@ cdef class StateClass: def R_(self, int i, int idx): return self.doc[self.c.R(i, idx)] - + def empty(self): return self.c.empty() @@ -134,7 +132,7 @@ cdef class StateClass: def at_break(self): return False - #return self.c.at_break() + # return self.c.at_break() def has_head(self, int i): return self.c.has_head(i) diff --git a/spacy/pipeline/_parser_internals/transition_system.pxd b/spacy/pipeline/_parser_internals/transition_system.pxd index ce17480d4..04cd10d88 100644 --- a/spacy/pipeline/_parser_internals/transition_system.pxd +++ b/spacy/pipeline/_parser_internals/transition_system.pxd @@ -20,11 +20,15 @@ cdef struct Transition: int (*do)(StateC* state, attr_t label) nogil -ctypedef weight_t (*get_cost_func_t)(const StateC* state, const void* gold, - attr_tlabel) nogil -ctypedef weight_t (*move_cost_func_t)(const StateC* state, const void* gold) nogil -ctypedef weight_t (*label_cost_func_t)(const StateC* state, const void* - gold, attr_t label) nogil +ctypedef weight_t (*get_cost_func_t)( + const StateC* state, const void* gold, attr_tlabel +) nogil +ctypedef weight_t (*move_cost_func_t)( + const StateC* state, const void* gold +) nogil +ctypedef weight_t (*label_cost_func_t)( + const StateC* state, const void* gold, attr_t label +) nogil ctypedef int (*do_func_t)(StateC* state, attr_t label) nogil diff --git a/spacy/pipeline/_parser_internals/transition_system.pyx b/spacy/pipeline/_parser_internals/transition_system.pyx index 053c87f22..aabbdfa24 100644 --- a/spacy/pipeline/_parser_internals/transition_system.pyx +++ b/spacy/pipeline/_parser_internals/transition_system.pyx @@ -8,9 +8,7 @@ from collections import Counter import srsly from ...structs cimport TokenC -from ...tokens.doc cimport Doc from ...typedefs cimport attr_t, weight_t -from . cimport _beam_utils from .stateclass cimport StateClass from ... import util @@ -231,7 +229,6 @@ cdef class TransitionSystem: return self def to_bytes(self, exclude=tuple()): - transitions = [] serializers = { 'moves': lambda: srsly.json_dumps(self.labels), 'strings': lambda: self.strings.to_bytes(), diff --git a/spacy/pipeline/dep_parser.pyx b/spacy/pipeline/dep_parser.pyx index cb896c385..57f091788 100644 --- a/spacy/pipeline/dep_parser.pyx +++ b/spacy/pipeline/dep_parser.pyx @@ -1,6 +1,6 @@ # cython: infer_types=True, profile=True, binding=True from collections import defaultdict -from typing import Callable, Iterable, Optional +from typing import Callable, Optional from thinc.api import Config, Model @@ -124,6 +124,7 @@ def make_parser( scorer=scorer, ) + @Language.factory( "beam_parser", assigns=["token.dep", "token.head", "token.is_sent_start", "doc.sents"], diff --git a/spacy/pipeline/morphologizer.pyx b/spacy/pipeline/morphologizer.pyx index 4ca0ce165..7ca3908bd 100644 --- a/spacy/pipeline/morphologizer.pyx +++ b/spacy/pipeline/morphologizer.pyx @@ -2,7 +2,6 @@ from itertools import islice from typing import Callable, Dict, Optional, Union -import srsly from thinc.api import Config, Model, SequenceCategoricalCrossentropy from ..morphology cimport Morphology @@ -14,10 +13,8 @@ from ..errors import Errors from ..language import Language from ..parts_of_speech import IDS as POS_IDS from ..scorer import Scorer -from ..symbols import POS from ..training import validate_examples, validate_get_examples from ..util import registry -from .pipe import deserialize_config from .tagger import Tagger # See #9050 @@ -76,8 +73,11 @@ def morphologizer_score(examples, **kwargs): results = {} results.update(Scorer.score_token_attr(examples, "pos", **kwargs)) results.update(Scorer.score_token_attr(examples, "morph", getter=morph_key_getter, **kwargs)) - results.update(Scorer.score_token_attr_per_feat(examples, - "morph", getter=morph_key_getter, **kwargs)) + results.update( + Scorer.score_token_attr_per_feat( + examples, "morph", getter=morph_key_getter, **kwargs + ) + ) return results @@ -233,7 +233,6 @@ class Morphologizer(Tagger): if isinstance(docs, Doc): docs = [docs] cdef Doc doc - cdef Vocab vocab = self.vocab cdef bint overwrite = self.cfg["overwrite"] cdef bint extend = self.cfg["extend"] labels = self.labels diff --git a/spacy/pipeline/multitask.pyx b/spacy/pipeline/multitask.pyx index 6b62c0811..2a62a50d5 100644 --- a/spacy/pipeline/multitask.pyx +++ b/spacy/pipeline/multitask.pyx @@ -4,13 +4,10 @@ from typing import Optional import numpy from thinc.api import Config, CosineDistance, Model, set_dropout_rate, to_categorical -from ..tokens.doc cimport Doc - -from ..attrs import ID, POS +from ..attrs import ID from ..errors import Errors from ..language import Language from ..training import validate_examples -from ._parser_internals import nonproj from .tagger import Tagger from .trainable_pipe import TrainablePipe @@ -103,10 +100,9 @@ class MultitaskObjective(Tagger): cdef int idx = 0 correct = numpy.zeros((scores.shape[0],), dtype="i") guesses = scores.argmax(axis=1) - docs = [eg.predicted for eg in examples] for i, eg in enumerate(examples): # Handles alignment for tokenization differences - doc_annots = eg.get_aligned() # TODO + _doc_annots = eg.get_aligned() # TODO for j in range(len(eg.predicted)): tok_annots = {key: values[j] for key, values in tok_annots.items()} label = self.make_label(j, tok_annots) @@ -206,7 +202,6 @@ class ClozeMultitask(TrainablePipe): losses[self.name] = 0. set_dropout_rate(self.model, drop) validate_examples(examples, "ClozeMultitask.rehearse") - docs = [eg.predicted for eg in examples] predictions, bp_predictions = self.model.begin_update() loss, d_predictions = self.get_loss(examples, self.vocab.vectors.data, predictions) bp_predictions(d_predictions) diff --git a/spacy/pipeline/ner.pyx b/spacy/pipeline/ner.pyx index 8dd6c3c43..15c092ae9 100644 --- a/spacy/pipeline/ner.pyx +++ b/spacy/pipeline/ner.pyx @@ -1,6 +1,6 @@ # cython: infer_types=True, profile=True, binding=True from collections import defaultdict -from typing import Callable, Iterable, Optional +from typing import Callable, Optional from thinc.api import Config, Model @@ -10,7 +10,7 @@ from ._parser_internals.ner cimport BiluoPushDown from .transition_parser cimport Parser from ..language import Language -from ..scorer import PRFScore, get_ner_prf +from ..scorer import get_ner_prf from ..training import remove_bilu_prefix from ..util import registry @@ -100,6 +100,7 @@ def make_ner( scorer=scorer, ) + @Language.factory( "beam_ner", assigns=["doc.ents", "token.ent_iob", "token.ent_type"], diff --git a/spacy/pipeline/pipe.pyx b/spacy/pipeline/pipe.pyx index 42f518882..90775c465 100644 --- a/spacy/pipeline/pipe.pyx +++ b/spacy/pipeline/pipe.pyx @@ -1,6 +1,6 @@ # cython: infer_types=True, profile=True, binding=True import warnings -from typing import Callable, Dict, Iterable, Iterator, Optional, Tuple, Union +from typing import Callable, Dict, Iterable, Iterator, Tuple, Union import srsly @@ -40,7 +40,7 @@ cdef class Pipe: """ raise NotImplementedError(Errors.E931.format(parent="Pipe", method="__call__", name=self.name)) - def pipe(self, stream: Iterable[Doc], *, batch_size: int=128) -> Iterator[Doc]: + def pipe(self, stream: Iterable[Doc], *, batch_size: int = 128) -> Iterator[Doc]: """Apply the pipe to a stream of documents. This usually happens under the hood when the nlp object is called on a text and all components are applied to the Doc. @@ -59,7 +59,7 @@ cdef class Pipe: except Exception as e: error_handler(self.name, self, [doc], e) - def initialize(self, get_examples: Callable[[], Iterable[Example]], *, nlp: Language=None): + def initialize(self, get_examples: Callable[[], Iterable[Example]], *, nlp: Language = None): """Initialize the pipe. For non-trainable components, this method is optional. For trainable components, which should inherit from the subclass TrainablePipe, the provided data examples diff --git a/spacy/pipeline/sentencizer.pyx b/spacy/pipeline/sentencizer.pyx index 2fe7e1540..76f296644 100644 --- a/spacy/pipeline/sentencizer.pyx +++ b/spacy/pipeline/sentencizer.pyx @@ -7,13 +7,13 @@ from ..tokens.doc cimport Doc from .. import util from ..language import Language -from ..scorer import Scorer from .pipe import Pipe from .senter import senter_score # see #9050 BACKWARD_OVERWRITE = False + @Language.factory( "sentencizer", assigns=["token.is_sent_start", "doc.sents"], @@ -36,17 +36,19 @@ class Sentencizer(Pipe): DOCS: https://spacy.io/api/sentencizer """ - default_punct_chars = ['!', '.', '?', '։', '؟', '۔', '܀', '܁', '܂', '߹', - '।', '॥', '၊', '။', '።', '፧', '፨', '᙮', '᜵', '᜶', '᠃', '᠉', '᥄', - '᥅', '᪨', '᪩', '᪪', '᪫', '᭚', '᭛', '᭞', '᭟', '᰻', '᰼', '᱾', '᱿', - '‼', '‽', '⁇', '⁈', '⁉', '⸮', '⸼', '꓿', '꘎', '꘏', '꛳', '꛷', '꡶', - '꡷', '꣎', '꣏', '꤯', '꧈', '꧉', '꩝', '꩞', '꩟', '꫰', '꫱', '꯫', '﹒', - '﹖', '﹗', '!', '.', '?', '𐩖', '𐩗', '𑁇', '𑁈', '𑂾', '𑂿', '𑃀', - '𑃁', '𑅁', '𑅂', '𑅃', '𑇅', '𑇆', '𑇍', '𑇞', '𑇟', '𑈸', '𑈹', '𑈻', '𑈼', - '𑊩', '𑑋', '𑑌', '𑗂', '𑗃', '𑗉', '𑗊', '𑗋', '𑗌', '𑗍', '𑗎', '𑗏', '𑗐', - '𑗑', '𑗒', '𑗓', '𑗔', '𑗕', '𑗖', '𑗗', '𑙁', '𑙂', '𑜼', '𑜽', '𑜾', '𑩂', - '𑩃', '𑪛', '𑪜', '𑱁', '𑱂', '𖩮', '𖩯', '𖫵', '𖬷', '𖬸', '𖭄', '𛲟', '𝪈', - '。', '。'] + default_punct_chars = [ + '!', '.', '?', '։', '؟', '۔', '܀', '܁', '܂', '߹', + '।', '॥', '၊', '။', '።', '፧', '፨', '᙮', '᜵', '᜶', '᠃', '᠉', '᥄', + '᥅', '᪨', '᪩', '᪪', '᪫', '᭚', '᭛', '᭞', '᭟', '᰻', '᰼', '᱾', '᱿', + '‼', '‽', '⁇', '⁈', '⁉', '⸮', '⸼', '꓿', '꘎', '꘏', '꛳', '꛷', '꡶', + '꡷', '꣎', '꣏', '꤯', '꧈', '꧉', '꩝', '꩞', '꩟', '꫰', '꫱', '꯫', '﹒', + '﹖', '﹗', '!', '.', '?', '𐩖', '𐩗', '𑁇', '𑁈', '𑂾', '𑂿', '𑃀', + '𑃁', '𑅁', '𑅂', '𑅃', '𑇅', '𑇆', '𑇍', '𑇞', '𑇟', '𑈸', '𑈹', '𑈻', '𑈼', + '𑊩', '𑑋', '𑑌', '𑗂', '𑗃', '𑗉', '𑗊', '𑗋', '𑗌', '𑗍', '𑗎', '𑗏', '𑗐', + '𑗑', '𑗒', '𑗓', '𑗔', '𑗕', '𑗖', '𑗗', '𑙁', '𑙂', '𑜼', '𑜽', '𑜾', '𑩂', + '𑩃', '𑪛', '𑪜', '𑱁', '𑱂', '𖩮', '𖩯', '𖫵', '𖬷', '𖬸', '𖭄', '𛲟', '𝪈', + '。', '。' + ] def __init__( self, @@ -128,7 +130,6 @@ class Sentencizer(Pipe): if isinstance(docs, Doc): docs = [docs] cdef Doc doc - cdef int idx = 0 for i, doc in enumerate(docs): doc_tag_ids = batch_tag_ids[i] for j, tag_id in enumerate(doc_tag_ids): @@ -169,7 +170,6 @@ class Sentencizer(Pipe): path = path.with_suffix(".json") srsly.write_json(path, {"punct_chars": list(self.punct_chars), "overwrite": self.overwrite}) - def from_disk(self, path, *, exclude=tuple()): """Load the sentencizer from disk. diff --git a/spacy/pipeline/senter.pyx b/spacy/pipeline/senter.pyx index 26f98ba59..37ddcc3c0 100644 --- a/spacy/pipeline/senter.pyx +++ b/spacy/pipeline/senter.pyx @@ -2,7 +2,6 @@ from itertools import islice from typing import Callable, Optional -import srsly from thinc.api import Config, Model, SequenceCategoricalCrossentropy from ..tokens.doc cimport Doc diff --git a/spacy/pipeline/tagger.pyx b/spacy/pipeline/tagger.pyx index 47aae2bb7..4c5265a78 100644 --- a/spacy/pipeline/tagger.pyx +++ b/spacy/pipeline/tagger.pyx @@ -1,26 +1,18 @@ # cython: infer_types=True, profile=True, binding=True -import warnings from itertools import islice from typing import Callable, Optional import numpy -import srsly from thinc.api import Config, Model, SequenceCategoricalCrossentropy, set_dropout_rate -from thinc.types import Floats2d -from ..morphology cimport Morphology from ..tokens.doc cimport Doc -from ..vocab cimport Vocab from .. import util -from ..attrs import ID, POS -from ..errors import Errors, Warnings +from ..errors import Errors from ..language import Language -from ..parts_of_speech import X from ..scorer import Scorer from ..training import validate_examples, validate_get_examples from ..util import registry -from .pipe import deserialize_config from .trainable_pipe import TrainablePipe # See #9050 @@ -169,7 +161,6 @@ class Tagger(TrainablePipe): if isinstance(docs, Doc): docs = [docs] cdef Doc doc - cdef Vocab vocab = self.vocab cdef bint overwrite = self.cfg["overwrite"] labels = self.labels for i, doc in enumerate(docs): diff --git a/spacy/pipeline/trainable_pipe.pyx b/spacy/pipeline/trainable_pipe.pyx index 7aa91ac16..e5865e070 100644 --- a/spacy/pipeline/trainable_pipe.pyx +++ b/spacy/pipeline/trainable_pipe.pyx @@ -55,7 +55,7 @@ cdef class TrainablePipe(Pipe): except Exception as e: error_handler(self.name, self, [doc], e) - def pipe(self, stream: Iterable[Doc], *, batch_size: int=128) -> Iterator[Doc]: + def pipe(self, stream: Iterable[Doc], *, batch_size: int = 128) -> Iterator[Doc]: """Apply the pipe to a stream of documents. This usually happens under the hood when the nlp object is called on a text and all components are applied to the Doc. @@ -102,9 +102,9 @@ cdef class TrainablePipe(Pipe): def update(self, examples: Iterable["Example"], *, - drop: float=0.0, - sgd: Optimizer=None, - losses: Optional[Dict[str, float]]=None) -> Dict[str, float]: + drop: float = 0.0, + sgd: Optimizer = None, + losses: Optional[Dict[str, float]] = None) -> Dict[str, float]: """Learn from a batch of documents and gold-standard information, updating the pipe's model. Delegates to predict and get_loss. @@ -138,8 +138,8 @@ cdef class TrainablePipe(Pipe): def rehearse(self, examples: Iterable[Example], *, - sgd: Optimizer=None, - losses: Dict[str, float]=None, + sgd: Optimizer = None, + losses: Dict[str, float] = None, **config) -> Dict[str, float]: """Perform a "rehearsal" update from a batch of data. Rehearsal updates teach the current model to make predictions similar to an initial model, @@ -177,7 +177,7 @@ cdef class TrainablePipe(Pipe): """ return util.create_default_optimizer() - def initialize(self, get_examples: Callable[[], Iterable[Example]], *, nlp: Language=None): + def initialize(self, get_examples: Callable[[], Iterable[Example]], *, nlp: Language = None): """Initialize the pipe for training, using data examples if available. This method needs to be implemented by each TrainablePipe component, ensuring the internal model (if available) is initialized properly diff --git a/spacy/pipeline/transition_parser.pxd b/spacy/pipeline/transition_parser.pxd index e5e88d521..7ddb91e01 100644 --- a/spacy/pipeline/transition_parser.pxd +++ b/spacy/pipeline/transition_parser.pxd @@ -13,8 +13,18 @@ cdef class Parser(TrainablePipe): cdef readonly TransitionSystem moves cdef public object _multitasks - cdef void _parseC(self, CBlas cblas, StateC** states, - WeightsC weights, SizesC sizes) nogil + cdef void _parseC( + self, + CBlas cblas, + StateC** states, + WeightsC weights, + SizesC sizes + ) nogil - cdef void c_transition_batch(self, StateC** states, const float* scores, - int nr_class, int batch_size) nogil + cdef void c_transition_batch( + self, + StateC** states, + const float* scores, + int nr_class, + int batch_size + ) nogil diff --git a/spacy/pipeline/transition_parser.pyx b/spacy/pipeline/transition_parser.pyx index ef4d9b362..11c8fafc7 100644 --- a/spacy/pipeline/transition_parser.pyx +++ b/spacy/pipeline/transition_parser.pyx @@ -7,20 +7,15 @@ from cymem.cymem cimport Pool from itertools import islice from libc.stdlib cimport calloc, free -from libc.string cimport memcpy, memset +from libc.string cimport memset from libcpp.vector cimport vector import random -import srsly -from thinc.api import CupyOps, NumpyOps, get_ops, set_dropout_rate - -from thinc.extra.search cimport Beam - -import warnings - import numpy import numpy.random +import srsly +from thinc.api import CupyOps, NumpyOps, set_dropout_rate from ..ml.parser_model cimport ( ActivationsC, @@ -42,7 +37,7 @@ from .trainable_pipe import TrainablePipe from ._parser_internals cimport _beam_utils from .. import util -from ..errors import Errors, Warnings +from ..errors import Errors from ..training import validate_examples, validate_get_examples from ._parser_internals import _beam_utils @@ -258,7 +253,6 @@ cdef class Parser(TrainablePipe): except Exception as e: error_handler(self.name, self, batch_in_order, e) - def predict(self, docs): if isinstance(docs, Doc): docs = [docs] @@ -300,8 +294,6 @@ cdef class Parser(TrainablePipe): return batch def beam_parse(self, docs, int beam_width, float drop=0., beam_density=0.): - cdef Beam beam - cdef Doc doc self._ensure_labels_are_added(docs) batch = _beam_utils.BeamBatch( self.moves, @@ -321,16 +313,18 @@ cdef class Parser(TrainablePipe): del model return list(batch) - cdef void _parseC(self, CBlas cblas, StateC** states, - WeightsC weights, SizesC sizes) nogil: - cdef int i, j + cdef void _parseC( + self, CBlas cblas, StateC** states, WeightsC weights, SizesC sizes + ) nogil: + cdef int i cdef vector[StateC*] unfinished cdef ActivationsC activations = alloc_activations(sizes) while sizes.states >= 1: predict_states(cblas, &activations, states, &weights, sizes) # Validate actions, argmax, take action. - self.c_transition_batch(states, - activations.scores, sizes.classes, sizes.states) + self.c_transition_batch( + states, activations.scores, sizes.classes, sizes.states + ) for i in range(sizes.states): if not states[i].is_final(): unfinished.push_back(states[i]) @@ -342,7 +336,6 @@ cdef class Parser(TrainablePipe): def set_annotations(self, docs, states_or_beams): cdef StateClass state - cdef Beam beam cdef Doc doc states = _beam_utils.collect_states(states_or_beams, docs) for i, (state, doc) in enumerate(zip(states, docs)): @@ -359,8 +352,13 @@ cdef class Parser(TrainablePipe): self.c_transition_batch(&c_states[0], c_scores, scores.shape[1], scores.shape[0]) return [state for state in states if not state.c.is_final()] - cdef void c_transition_batch(self, StateC** states, const float* scores, - int nr_class, int batch_size) nogil: + cdef void c_transition_batch( + self, + StateC** states, + const float* scores, + int nr_class, + int batch_size + ) nogil: # n_moves should not be zero at this point, but make sure to avoid zero-length mem alloc with gil: assert self.moves.n_moves > 0, Errors.E924.format(name=self.name) @@ -380,7 +378,6 @@ cdef class Parser(TrainablePipe): free(is_valid) def update(self, examples, *, drop=0., sgd=None, losses=None): - cdef StateClass state if losses is None: losses = {} losses.setdefault(self.name, 0.) @@ -419,8 +416,7 @@ cdef class Parser(TrainablePipe): if not states: return losses model, backprop_tok2vec = self.model.begin_update([eg.x for eg in examples]) - - all_states = list(states) + states_golds = list(zip(states, golds)) n_moves = 0 while states_golds: @@ -500,8 +496,16 @@ cdef class Parser(TrainablePipe): del tutor return losses - def update_beam(self, examples, *, beam_width, - drop=0., sgd=None, losses=None, beam_density=0.0): + def update_beam( + self, + examples, + *, + beam_width, + drop=0., + sgd=None, + losses=None, + beam_density=0.0 + ): states, golds, _ = self.moves.init_gold_batch(examples) if not states: return losses @@ -531,8 +535,9 @@ cdef class Parser(TrainablePipe): is_valid = mem.alloc(self.moves.n_moves, sizeof(int)) costs = mem.alloc(self.moves.n_moves, sizeof(float)) - cdef np.ndarray d_scores = numpy.zeros((len(states), self.moves.n_moves), - dtype='f', order='C') + cdef np.ndarray d_scores = numpy.zeros( + (len(states), self.moves.n_moves), dtype='f', order='C' + ) c_d_scores = d_scores.data unseen_classes = self.model.attrs["unseen_classes"] for i, (state, gold) in enumerate(zip(states, golds)): @@ -542,8 +547,9 @@ cdef class Parser(TrainablePipe): for j in range(self.moves.n_moves): if costs[j] <= 0.0 and j in unseen_classes: unseen_classes.remove(j) - cpu_log_loss(c_d_scores, - costs, is_valid, &scores[i, 0], d_scores.shape[1]) + cpu_log_loss( + c_d_scores, costs, is_valid, &scores[i, 0], d_scores.shape[1] + ) c_d_scores += d_scores.shape[1] # Note that we don't normalize this. See comment in update() for why. if losses is not None: diff --git a/spacy/strings.pyx b/spacy/strings.pyx index 16c3e2b5b..b0799d6fc 100644 --- a/spacy/strings.pyx +++ b/spacy/strings.pyx @@ -2,7 +2,6 @@ cimport cython from libc.stdint cimport uint32_t from libc.string cimport memcpy -from libcpp.set cimport set from murmurhash.mrmr cimport hash32, hash64 import srsly @@ -20,9 +19,10 @@ cdef inline bint _try_coerce_to_hash(object key, hash_t* out_hash): try: out_hash[0] = key return True - except: + except: # no-cython-lint return False + def get_string_id(key): """Get a string ID, handling the reserved symbols correctly. If the key is already an ID, return it. @@ -87,7 +87,6 @@ cdef Utf8Str* _allocate(Pool mem, const unsigned char* chars, uint32_t length) e cdef int n_length_bytes cdef int i cdef Utf8Str* string = mem.alloc(1, sizeof(Utf8Str)) - cdef uint32_t ulength = length if length < sizeof(string.s): string.s[0] = length memcpy(&string.s[1], chars, length) diff --git a/spacy/structs.pxd b/spacy/structs.pxd index 9efb068fd..8cfcc2964 100644 --- a/spacy/structs.pxd +++ b/spacy/structs.pxd @@ -52,7 +52,7 @@ cdef struct TokenC: int sent_start int ent_iob - attr_t ent_type # TODO: Is there a better way to do this? Multiple sources of truth.. + attr_t ent_type # TODO: Is there a better way to do this? Multiple sources of truth.. attr_t ent_kb_id hash_t ent_id diff --git a/spacy/symbols.pxd b/spacy/symbols.pxd index bc15d9b80..73be19145 100644 --- a/spacy/symbols.pxd +++ b/spacy/symbols.pxd @@ -92,7 +92,7 @@ cdef enum symbol_t: ADV AUX CONJ - CCONJ # U20 + CCONJ # U20 DET INTJ NOUN @@ -418,7 +418,7 @@ cdef enum symbol_t: ccomp complm conj - cop # U20 + cop # U20 csubj csubjpass dep @@ -441,8 +441,8 @@ cdef enum symbol_t: num number oprd - obj # U20 - obl # U20 + obj # U20 + obl # U20 parataxis partmod pcomp diff --git a/spacy/symbols.pyx b/spacy/symbols.pyx index b0345c710..d1deeb0e7 100644 --- a/spacy/symbols.pyx +++ b/spacy/symbols.pyx @@ -96,7 +96,7 @@ IDS = { "ADV": ADV, "AUX": AUX, "CONJ": CONJ, - "CCONJ": CCONJ, # U20 + "CCONJ": CCONJ, # U20 "DET": DET, "INTJ": INTJ, "NOUN": NOUN, @@ -421,7 +421,7 @@ IDS = { "ccomp": ccomp, "complm": complm, "conj": conj, - "cop": cop, # U20 + "cop": cop, # U20 "csubj": csubj, "csubjpass": csubjpass, "dep": dep, @@ -444,8 +444,8 @@ IDS = { "num": num, "number": number, "oprd": oprd, - "obj": obj, # U20 - "obl": obl, # U20 + "obj": obj, # U20 + "obl": obl, # U20 "parataxis": parataxis, "partmod": partmod, "pcomp": pcomp, diff --git a/spacy/tests/matcher/test_pattern_validation.py b/spacy/tests/matcher/test_pattern_validation.py index 21fa36865..45f9f4ee7 100644 --- a/spacy/tests/matcher/test_pattern_validation.py +++ b/spacy/tests/matcher/test_pattern_validation.py @@ -52,7 +52,8 @@ TEST_PATTERNS = [ @pytest.mark.parametrize( - "pattern", [[{"XX": "y"}, {"LENGTH": "2"}, {"TEXT": {"IN": 5}}]] + "pattern", + [[{"XX": "y"}], [{"LENGTH": "2"}], [{"TEXT": {"IN": 5}}], [{"text": {"in": 6}}]], ) def test_matcher_pattern_validation(en_vocab, pattern): matcher = Matcher(en_vocab, validate=True) diff --git a/spacy/tests/package/test_requirements.py b/spacy/tests/package/test_requirements.py index 9e83d5fb1..fab1e8218 100644 --- a/spacy/tests/package/test_requirements.py +++ b/spacy/tests/package/test_requirements.py @@ -12,6 +12,7 @@ def test_build_dependencies(): "flake8", "hypothesis", "pre-commit", + "cython-lint", "black", "isort", "mypy", diff --git a/spacy/tokenizer.pxd b/spacy/tokenizer.pxd index f7585b45a..a902ebad9 100644 --- a/spacy/tokenizer.pxd +++ b/spacy/tokenizer.pxd @@ -31,24 +31,58 @@ cdef class Tokenizer: cdef Doc _tokenize_affixes(self, str string, bint with_special_cases) cdef int _apply_special_cases(self, Doc doc) except -1 - cdef void _filter_special_spans(self, vector[SpanC] &original, - vector[SpanC] &filtered, int doc_len) nogil - cdef object _prepare_special_spans(self, Doc doc, - vector[SpanC] &filtered) - cdef int _retokenize_special_spans(self, Doc doc, TokenC* tokens, - object span_data) - cdef int _try_specials_and_cache(self, hash_t key, Doc tokens, - int* has_special, - bint with_special_cases) except -1 - cdef int _tokenize(self, Doc tokens, str span, hash_t key, - int* has_special, bint with_special_cases) except -1 - cdef str _split_affixes(self, Pool mem, str string, - vector[LexemeC*] *prefixes, - vector[LexemeC*] *suffixes, int* has_special, - bint with_special_cases) - cdef int _attach_tokens(self, Doc tokens, str string, - vector[LexemeC*] *prefixes, - vector[LexemeC*] *suffixes, int* has_special, - bint with_special_cases) except -1 - cdef int _save_cached(self, const TokenC* tokens, hash_t key, - int* has_special, int n) except -1 + cdef void _filter_special_spans( + self, + vector[SpanC] &original, + vector[SpanC] &filtered, + int doc_len, + ) nogil + cdef object _prepare_special_spans( + self, + Doc doc, + vector[SpanC] &filtered, + ) + cdef int _retokenize_special_spans( + self, + Doc doc, + TokenC* tokens, + object span_data, + ) + cdef int _try_specials_and_cache( + self, + hash_t key, + Doc tokens, + int* has_special, + bint with_special_cases, + ) except -1 + cdef int _tokenize( + self, + Doc tokens, + str span, + hash_t key, + int* has_special, + bint with_special_cases, + ) except -1 + cdef str _split_affixes( + self, + Pool mem, + str string, + vector[LexemeC*] *prefixes, + vector[LexemeC*] *suffixes, int* has_special, + bint with_special_cases, + ) + cdef int _attach_tokens( + self, + Doc tokens, + str string, + vector[LexemeC*] *prefixes, + vector[LexemeC*] *suffixes, int* has_special, + bint with_special_cases, + ) except -1 + cdef int _save_cached( + self, + const TokenC* tokens, + hash_t key, + int* has_special, + int n, + ) except -1 diff --git a/spacy/tokenizer.pyx b/spacy/tokenizer.pyx index 3861b1cee..8fc95bea0 100644 --- a/spacy/tokenizer.pyx +++ b/spacy/tokenizer.pyx @@ -8,20 +8,18 @@ from libcpp.set cimport set as stdset from preshed.maps cimport PreshMap import re -import warnings - from .lexeme cimport EMPTY_LEXEME from .strings cimport hash_string from .tokens.doc cimport Doc from . import util from .attrs import intify_attrs -from .errors import Errors, Warnings +from .errors import Errors from .scorer import Scorer from .symbols import NORM, ORTH from .tokens import Span from .training import validate_examples -from .util import get_words_and_spaces, registry +from .util import get_words_and_spaces cdef class Tokenizer: @@ -324,7 +322,7 @@ cdef class Tokenizer: cdef int span_start cdef int span_end while i < doc.length: - if not i in span_data: + if i not in span_data: tokens[i + offset] = doc.c[i] i += 1 else: @@ -395,12 +393,15 @@ cdef class Tokenizer: self._save_cached(&tokens.c[orig_size], orig_key, has_special, tokens.length - orig_size) - cdef str _split_affixes(self, Pool mem, str string, - vector[const LexemeC*] *prefixes, - vector[const LexemeC*] *suffixes, - int* has_special, - bint with_special_cases): - cdef size_t i + cdef str _split_affixes( + self, + Pool mem, + str string, + vector[const LexemeC*] *prefixes, + vector[const LexemeC*] *suffixes, + int* has_special, + bint with_special_cases + ): cdef str prefix cdef str suffix cdef str minus_pre @@ -445,10 +446,6 @@ cdef class Tokenizer: vector[const LexemeC*] *suffixes, int* has_special, bint with_special_cases) except -1: - cdef bint specials_hit = 0 - cdef bint cache_hit = 0 - cdef int split, end - cdef const LexemeC* const* lexemes cdef const LexemeC* lexeme cdef str span cdef int i @@ -458,9 +455,11 @@ cdef class Tokenizer: if string: if self._try_specials_and_cache(hash_string(string), tokens, has_special, with_special_cases): pass - elif (self.token_match and self.token_match(string)) or \ - (self.url_match and \ - self.url_match(string)): + elif ( + (self.token_match and self.token_match(string)) or + (self.url_match and self.url_match(string)) + ): + # We're always saying 'no' to spaces here -- the caller will # fix up the outermost one, with reference to the original. # See Issue #859 @@ -821,7 +820,7 @@ cdef class Tokenizer: self.infix_finditer = None self.token_match = None self.url_match = None - msg = util.from_bytes(bytes_data, deserializers, exclude) + util.from_bytes(bytes_data, deserializers, exclude) if "prefix_search" in data and isinstance(data["prefix_search"], str): self.prefix_search = re.compile(data["prefix_search"]).search if "suffix_search" in data and isinstance(data["suffix_search"], str): diff --git a/spacy/tokens/_retokenize.pyx b/spacy/tokens/_retokenize.pyx index 8ed707ab9..f28d2e088 100644 --- a/spacy/tokens/_retokenize.pyx +++ b/spacy/tokens/_retokenize.pyx @@ -1,7 +1,6 @@ # cython: infer_types=True, bounds_check=False, profile=True from cymem.cymem cimport Pool -from libc.stdlib cimport free, malloc -from libc.string cimport memcpy, memset +from libc.string cimport memset import numpy from thinc.api import get_array_module @@ -10,7 +9,7 @@ from ..attrs cimport MORPH, NORM from ..lexeme cimport EMPTY_LEXEME, Lexeme from ..structs cimport LexemeC, TokenC from ..vocab cimport Vocab -from .doc cimport Doc, set_children_from_heads, token_by_end, token_by_start +from .doc cimport Doc, set_children_from_heads, token_by_start from .span cimport Span from .token cimport Token @@ -147,7 +146,7 @@ def _merge(Doc doc, merges): syntactic root of the span. RETURNS (Token): The first newly merged token. """ - cdef int i, merge_index, start, end, token_index, current_span_index, current_offset, offset, span_index + cdef int i, merge_index, start, token_index, current_span_index, current_offset, offset, span_index cdef Span span cdef const LexemeC* lex cdef TokenC* token @@ -165,7 +164,6 @@ def _merge(Doc doc, merges): merges.sort(key=_get_start) for merge_index, (span, attributes) in enumerate(merges): start = span.start - end = span.end spans.append(span) # House the new merged token where it starts token = &doc.c[start] @@ -203,8 +201,9 @@ def _merge(Doc doc, merges): # for the merged region. To do this, we create a boolean array indicating # whether the row is to be deleted, then use numpy.delete if doc.tensor is not None and doc.tensor.size != 0: - doc.tensor = _resize_tensor(doc.tensor, - [(m[0].start, m[0].end) for m in merges]) + doc.tensor = _resize_tensor( + doc.tensor, [(m[0].start, m[0].end) for m in merges] + ) # Memorize span roots and sets dependencies of the newly merged # tokens to the dependencies of their roots. span_roots = [] @@ -267,11 +266,11 @@ def _merge(Doc doc, merges): span_index += 1 if span_index < len(spans) and i == spans[span_index].start: # First token in a span - doc.c[i - offset] = doc.c[i] # move token to its place + doc.c[i - offset] = doc.c[i] # move token to its place offset += (spans[span_index].end - spans[span_index].start) - 1 in_span = True if not in_span: - doc.c[i - offset] = doc.c[i] # move token to its place + doc.c[i - offset] = doc.c[i] # move token to its place for i in range(doc.length - offset, doc.length): memset(&doc.c[i], 0, sizeof(TokenC)) @@ -345,7 +344,11 @@ def _split(Doc doc, int token_index, orths, heads, attrs): if to_process_tensor: xp = get_array_module(doc.tensor) if xp is numpy: - doc.tensor = xp.append(doc.tensor, xp.zeros((nb_subtokens,doc.tensor.shape[1]), dtype="float32"), axis=0) + doc.tensor = xp.append( + doc.tensor, + xp.zeros((nb_subtokens, doc.tensor.shape[1]), dtype="float32"), + axis=0 + ) else: shape = (doc.tensor.shape[0] + nb_subtokens, doc.tensor.shape[1]) resized_array = xp.zeros(shape, dtype="float32") @@ -367,7 +370,8 @@ def _split(Doc doc, int token_index, orths, heads, attrs): token.norm = 0 # reset norm if to_process_tensor: # setting the tensors of the split tokens to array of zeros - doc.tensor[token_index + i:token_index + i + 1] = xp.zeros((1,doc.tensor.shape[1]), dtype="float32") + doc.tensor[token_index + i:token_index + i + 1] = \ + xp.zeros((1, doc.tensor.shape[1]), dtype="float32") # Update the character offset of the subtokens if i != 0: token.idx = orig_token.idx + idx_offset @@ -455,7 +459,6 @@ def normalize_token_attrs(Vocab vocab, attrs): def set_token_attrs(Token py_token, attrs): cdef TokenC* token = py_token.c cdef const LexemeC* lex = token.lex - cdef Doc doc = py_token.doc # Assign attributes for attr_name, attr_value in attrs.items(): if attr_name == "_": # Set extension attributes diff --git a/spacy/tokens/doc.pxd b/spacy/tokens/doc.pxd index d7f092c94..d9719609c 100644 --- a/spacy/tokens/doc.pxd +++ b/spacy/tokens/doc.pxd @@ -31,7 +31,7 @@ cdef int token_by_start(const TokenC* tokens, int length, int start_char) except cdef int token_by_end(const TokenC* tokens, int length, int end_char) except -2 -cdef int [:,:] _get_lca_matrix(Doc, int start, int end) +cdef int [:, :] _get_lca_matrix(Doc, int start, int end) cdef class Doc: @@ -61,7 +61,6 @@ cdef class Doc: cdef int length cdef int max_length - cdef public object noun_chunks_iterator cdef object __weakref__ diff --git a/spacy/tokens/doc.pyx b/spacy/tokens/doc.pyx index 146b276e2..8fc2c4b3c 100644 --- a/spacy/tokens/doc.pyx +++ b/spacy/tokens/doc.pyx @@ -43,14 +43,13 @@ from ..attrs cimport ( attr_id_t, ) from ..lexeme cimport EMPTY_LEXEME, Lexeme -from ..typedefs cimport attr_t, flags_t +from ..typedefs cimport attr_t from .token cimport Token from .. import parts_of_speech, schemas, util from ..attrs import IDS, intify_attr -from ..compat import copy_reg, pickle +from ..compat import copy_reg from ..errors import Errors, Warnings -from ..morphology import Morphology from ..util import get_words_and_spaces from ._retokenize import Retokenizer from .underscore import Underscore, get_ext_args @@ -784,7 +783,7 @@ cdef class Doc: # TODO: # 1. Test basic data-driven ORTH gazetteer # 2. Test more nuanced date and currency regex - cdef attr_t entity_type, kb_id, ent_id + cdef attr_t kb_id, ent_id cdef int ent_start, ent_end ent_spans = [] for ent_info in ents: @@ -987,7 +986,6 @@ cdef class Doc: >>> np_array = doc.to_array([LOWER, POS, ENT_TYPE, IS_ALPHA]) """ cdef int i, j - cdef attr_id_t feature cdef np.ndarray[attr_t, ndim=2] output # Handle scalar/list inputs of strings/ints for py_attr_ids # See also #3064 @@ -999,8 +997,10 @@ cdef class Doc: py_attr_ids = [py_attr_ids] # Allow strings, e.g. 'lemma' or 'LEMMA' try: - py_attr_ids = [(IDS[id_.upper()] if hasattr(id_, "upper") else id_) - for id_ in py_attr_ids] + py_attr_ids = [ + (IDS[id_.upper()] if hasattr(id_, "upper") else id_) + for id_ in py_attr_ids + ] except KeyError as msg: keys = [k for k in IDS.keys() if not k.startswith("FLAG")] raise KeyError(Errors.E983.format(dict="IDS", key=msg, keys=keys)) from None @@ -1030,8 +1030,6 @@ cdef class Doc: DOCS: https://spacy.io/api/doc#count_by """ cdef int i - cdef attr_t attr - cdef size_t count if counts is None: counts = Counter() @@ -1093,7 +1091,6 @@ cdef class Doc: cdef int i, col cdef int32_t abs_head_index cdef attr_id_t attr_id - cdef TokenC* tokens = self.c cdef int length = len(array) if length != len(self): raise ValueError(Errors.E971.format(array_length=length, doc_length=len(self))) @@ -1225,7 +1222,7 @@ cdef class Doc: span.label, span.kb_id, span.id, - span.text, # included as a check + span.text, # included as a check )) char_offset += len(doc.text) if len(doc) > 0 and ensure_whitespace and not doc[-1].is_space and not bool(doc[-1].whitespace_): @@ -1508,7 +1505,6 @@ cdef class Doc: attributes are inherited from the syntactic root of the span. RETURNS (Token): The first newly merged token. """ - cdef str tag, lemma, ent_type attr_len = len(attributes) span_len = len(spans) if not attr_len == span_len: @@ -1624,7 +1620,6 @@ cdef class Doc: for token in char_span[1:]: token.is_sent_start = False - for span_group in doc_json.get("spans", {}): spans = [] for span in doc_json["spans"][span_group]: @@ -1656,7 +1651,7 @@ cdef class Doc: start = token_by_char(self.c, self.length, token_data["start"]) value = token_data["value"] self[start]._.set(token_attr, value) - + for span_attr in doc_json.get("underscore_span", {}): if not Span.has_extension(span_attr): Span.set_extension(span_attr) @@ -1698,7 +1693,7 @@ cdef class Doc: token_data["dep"] = token.dep_ token_data["head"] = token.head.i data["tokens"].append(token_data) - + if self.spans: data["spans"] = {} for span_group in self.spans: @@ -1769,7 +1764,6 @@ cdef class Doc: output.fill(255) cdef int i, j, start_idx, end_idx cdef bytes byte_string - cdef unsigned char utf8_char for i, byte_string in enumerate(byte_strings): j = 0 start_idx = 0 @@ -1822,8 +1816,6 @@ cdef int token_by_char(const TokenC* tokens, int length, int char_idx) except -2 cdef int set_children_from_heads(TokenC* tokens, int start, int end) except -1: # note: end is exclusive - cdef TokenC* head - cdef TokenC* child cdef int i # Set number of left/right children to 0. We'll increment it in the loops. for i in range(start, end): @@ -1923,7 +1915,7 @@ cdef int _get_tokens_lca(Token token_j, Token token_k): return -1 -cdef int [:,:] _get_lca_matrix(Doc doc, int start, int end): +cdef int [:, :] _get_lca_matrix(Doc doc, int start, int end): """Given a doc and a start and end position defining a set of contiguous tokens within it, returns a matrix of Lowest Common Ancestors (LCA), where LCA[i, j] is the index of the lowest common ancestor among token i and j. @@ -1936,7 +1928,7 @@ cdef int [:,:] _get_lca_matrix(Doc doc, int start, int end): RETURNS (int [:, :]): memoryview of numpy.array[ndim=2, dtype=numpy.int32], with shape (n, n), where n = len(doc). """ - cdef int [:,:] lca_matrix + cdef int [:, :] lca_matrix cdef int j, k n_tokens= end - start lca_mat = numpy.empty((n_tokens, n_tokens), dtype=numpy.int32) diff --git a/spacy/tokens/graph.pyx b/spacy/tokens/graph.pyx index 47f0a20d4..1cbec09f4 100644 --- a/spacy/tokens/graph.pyx +++ b/spacy/tokens/graph.pyx @@ -3,7 +3,7 @@ from typing import Generator, List, Tuple cimport cython from cython.operator cimport dereference -from libc.stdint cimport int32_t, int64_t +from libc.stdint cimport int32_t from libcpp.pair cimport pair from libcpp.unordered_map cimport unordered_map from libcpp.unordered_set cimport unordered_set @@ -11,7 +11,6 @@ from libcpp.unordered_set cimport unordered_set import weakref from murmurhash.mrmr cimport hash64 -from preshed.maps cimport map_get_unless_missing from .. import Errors @@ -28,7 +27,7 @@ from .token import Token cdef class Edge: cdef readonly Graph graph cdef readonly int i - + def __init__(self, Graph graph, int i): self.graph = graph self.i = i @@ -44,7 +43,7 @@ cdef class Edge: @property def head(self) -> "Node": return Node(self.graph, self.graph.c.edges[self.i].head) - + @property def tail(self) -> "Tail": return Node(self.graph, self.graph.c.edges[self.i].tail) @@ -70,7 +69,7 @@ cdef class Node: def __init__(self, Graph graph, int i): """A reference to a node of an annotation graph. Each node is made up of an ordered set of zero or more token indices. - + Node references are usually created by the Graph object itself, or from the Node or Edge objects. You usually won't need to instantiate this class yourself. @@ -109,13 +108,13 @@ cdef class Node: @property def is_none(self) -> bool: """Whether the node is a special value, indicating 'none'. - + The NoneNode type is returned by the Graph, Edge and Node objects when there is no match to a query. It has the same API as Node, but it always returns NoneNode, NoneEdge or empty lists for its queries. """ return False - + @property def doc(self) -> "Doc": """The Doc object that the graph refers to.""" @@ -130,19 +129,19 @@ cdef class Node: def head(self, i=None, label=None) -> "Node": """Get the head of the first matching edge, searching by index, label, both or neither. - + For instance, `node.head(i=1)` will get the head of the second edge that this node is a tail of. `node.head(i=1, label="ARG0")` will further check that the second edge has the label `"ARG0"`. - + If no matching node can be found, the graph's NoneNode is returned. """ return self.headed(i=i, label=label) - + def tail(self, i=None, label=None) -> "Node": """Get the tail of the first matching edge, searching by index, label, both or neither. - + If no matching node can be found, the graph's NoneNode is returned. """ return self.tailed(i=i, label=label).tail @@ -171,7 +170,7 @@ cdef class Node: cdef vector[int] edge_indices self._find_edges(edge_indices, "head", label) return [Node(self.graph, self.graph.c.edges[i].head) for i in edge_indices] - + def tails(self, label=None) -> List["Node"]: """Find all matching tails of this node.""" cdef vector[int] edge_indices @@ -200,7 +199,7 @@ cdef class Node: return NoneEdge(self.graph) else: return Edge(self.graph, idx) - + def tailed(self, i=None, label=None) -> Edge: """Find the first matching edge tailed by this node. If no matching edge can be found, the graph's NoneEdge is returned. @@ -283,7 +282,7 @@ cdef class NoneEdge(Edge): def __init__(self, graph): self.graph = graph self.i = -1 - + @property def doc(self) -> "Doc": return self.graph.doc @@ -291,7 +290,7 @@ cdef class NoneEdge(Edge): @property def head(self) -> "NoneNode": return NoneNode(self.graph) - + @property def tail(self) -> "NoneNode": return NoneNode(self.graph) @@ -319,7 +318,7 @@ cdef class NoneNode(Node): def __len__(self): return 0 - + @property def is_none(self): return -1 @@ -340,14 +339,14 @@ cdef class NoneNode(Node): def walk_heads(self): yield from [] - + def walk_tails(self): yield from [] - + cdef class Graph: """A set of directed labelled relationships between sets of tokens. - + EXAMPLE: Construction 1 >>> graph = Graph(doc, name="srl") @@ -372,7 +371,9 @@ cdef class Graph: >>> assert graph.has_node((0,)) >>> assert graph.has_edge((0,), (1,3), label="agent") """ - def __init__(self, doc, *, name="", nodes=[], edges=[], labels=None, weights=None): + def __init__( + self, doc, *, name="", nodes=[], edges=[], labels=None, weights=None # no-cython-lint + ): """Create a Graph object. doc (Doc): The Doc object the graph will refer to. @@ -438,13 +439,11 @@ cdef class Graph: def add_edge(self, head, tail, *, label="", weight=None) -> Edge: """Add an edge to the graph, connecting two groups of tokens. - + If there is already an edge for the (head, tail, label) triple, it will be returned, and no new edge will be created. The weight of the edge will be updated if a weight is specified. """ - label_hash = self.doc.vocab.strings.as_int(label) - weight_float = weight if weight is not None else 0.0 edge_index = add_edge( &self.c, EdgeC( @@ -478,11 +477,11 @@ cdef class Graph: def has_edge(self, head, tail, label) -> bool: """Check whether a (head, tail, label) triple is an edge in the graph.""" return not self.get_edge(head, tail, label=label).is_none - + def add_node(self, indices) -> Node: """Add a node to the graph and return it. Nodes refer to ordered sets of token indices. - + This method is idempotent: if there is already a node for the given indices, it is returned without a new node being created. """ @@ -510,7 +509,7 @@ cdef class Graph: return NoneNode(self) else: return Node(self, node_index) - + def has_node(self, tuple indices) -> bool: """Check whether the graph has a node for the given indices.""" return not self.get_node(indices).is_none @@ -570,7 +569,7 @@ cdef int add_node(GraphC* graph, vector[int32_t]& node) nogil: graph.roots.insert(index) graph.node_map.insert(pair[hash_t, int](key, index)) return index - + cdef int get_node(const GraphC* graph, vector[int32_t] node) nogil: key = hash64(&node[0], node.size() * sizeof(node[0]), 0) diff --git a/spacy/tokens/morphanalysis.pyx b/spacy/tokens/morphanalysis.pyx index 0992a0b66..ba7c638f6 100644 --- a/spacy/tokens/morphanalysis.pyx +++ b/spacy/tokens/morphanalysis.pyx @@ -89,4 +89,3 @@ cdef class MorphAnalysis: def __repr__(self): return self.to_json() - diff --git a/spacy/tokens/span.pyx b/spacy/tokens/span.pyx index 59ee21687..cf90e416b 100644 --- a/spacy/tokens/span.pyx +++ b/spacy/tokens/span.pyx @@ -1,5 +1,4 @@ cimport numpy as np -from libc.math cimport sqrt import copy import warnings @@ -10,11 +9,10 @@ from thinc.api import get_array_module from ..attrs cimport * from ..attrs cimport ORTH, attr_id_t from ..lexeme cimport Lexeme -from ..parts_of_speech cimport univ_pos_t -from ..structs cimport LexemeC, TokenC +from ..structs cimport TokenC from ..symbols cimport dep -from ..typedefs cimport attr_t, flags_t, hash_t -from .doc cimport _get_lca_matrix, get_token_attr, token_by_end, token_by_start +from ..typedefs cimport attr_t, hash_t +from .doc cimport _get_lca_matrix, get_token_attr from .token cimport Token from ..errors import Errors, Warnings @@ -595,7 +593,6 @@ cdef class Span: """ return "".join([t.text_with_ws for t in self]) - @property def noun_chunks(self): """Iterate over the base noun phrases in the span. Yields base diff --git a/spacy/tokens/span_group.pyx b/spacy/tokens/span_group.pyx index 48ad4a516..d245a1425 100644 --- a/spacy/tokens/span_group.pyx +++ b/spacy/tokens/span_group.pyx @@ -1,7 +1,7 @@ import struct import weakref from copy import deepcopy -from typing import TYPE_CHECKING, Iterable, Optional, Tuple, Union +from typing import Iterable, Optional, Union import srsly @@ -34,7 +34,7 @@ cdef class SpanGroup: DOCS: https://spacy.io/api/spangroup """ - def __init__(self, doc, *, name="", attrs={}, spans=[]): + def __init__(self, doc, *, name="", attrs={}, spans=[]): # no-cython-lint """Create a SpanGroup. doc (Doc): The reference Doc object. @@ -311,7 +311,7 @@ cdef class SpanGroup: other_attrs = deepcopy(other_group.attrs) span_group.attrs.update({ - key: value for key, value in other_attrs.items() \ + key: value for key, value in other_attrs.items() if key not in span_group.attrs }) if len(other_group): diff --git a/spacy/tokens/token.pxd b/spacy/tokens/token.pxd index fc02ff624..f4e4611df 100644 --- a/spacy/tokens/token.pxd +++ b/spacy/tokens/token.pxd @@ -26,7 +26,7 @@ cdef class Token: cdef Token self = Token.__new__(Token, vocab, doc, offset) return self - #cdef inline TokenC struct_from_attrs(Vocab vocab, attrs): + # cdef inline TokenC struct_from_attrs(Vocab vocab, attrs): # cdef TokenC token # attrs = normalize_attrs(attrs) @@ -98,12 +98,10 @@ cdef class Token: elif feat_name == SENT_START: token.sent_start = value - @staticmethod cdef inline int missing_dep(const TokenC* token) nogil: return token.dep == MISSING_DEP - @staticmethod cdef inline int missing_head(const TokenC* token) nogil: return Token.missing_dep(token) diff --git a/spacy/tokens/token.pyx b/spacy/tokens/token.pyx index 6018c3112..de967ba25 100644 --- a/spacy/tokens/token.pyx +++ b/spacy/tokens/token.pyx @@ -1,13 +1,11 @@ # cython: infer_types=True # Compiler crashes on memory view coercion without this. Should report bug. cimport numpy as np -from cython.view cimport array as cvarray np.import_array() import warnings -import numpy from thinc.api import get_array_module from ..attrs cimport ( @@ -238,7 +236,7 @@ cdef class Token: result = xp.dot(vector, other.vector) / (self.vector_norm * other.vector_norm) # ensure we get a scalar back (numpy does this automatically but cupy doesn't) return result.item() - + def has_morph(self): """Check whether the token has annotated morph information. Return False when the morph annotation is unset/missing. @@ -545,9 +543,9 @@ cdef class Token: def __get__(self): if self.i + 1 == len(self.doc): return True - elif self.doc[self.i+1].is_sent_start == None: + elif self.doc[self.i+1].is_sent_start is None: return None - elif self.doc[self.i+1].is_sent_start == True: + elif self.doc[self.i+1].is_sent_start is True: return True else: return False diff --git a/spacy/training/align.pyx b/spacy/training/align.pyx index 8bd43b048..79fec73c4 100644 --- a/spacy/training/align.pyx +++ b/spacy/training/align.pyx @@ -37,10 +37,14 @@ def get_alignments(A: List[str], B: List[str]) -> Tuple[List[List[int]], List[Li b2a.append(set()) # Process the alignment at the current position if A[token_idx_a] == B[token_idx_b] and \ - (char_idx_a == 0 or \ - char_to_token_a[char_idx_a - 1] < token_idx_a) and \ - (char_idx_b == 0 or \ - char_to_token_b[char_idx_b - 1] < token_idx_b): + ( + char_idx_a == 0 or + char_to_token_a[char_idx_a - 1] < token_idx_a + ) and \ + ( + char_idx_b == 0 or + char_to_token_b[char_idx_b - 1] < token_idx_b + ): # Current tokens are identical and both character offsets are the # start of a token (either at the beginning of the document or the # previous character belongs to a different token) diff --git a/spacy/training/example.pyx b/spacy/training/example.pyx index abdac23ea..3f0cf5ade 100644 --- a/spacy/training/example.pyx +++ b/spacy/training/example.pyx @@ -1,4 +1,3 @@ -import warnings from collections.abc import Iterable as IterableInstance import numpy @@ -31,9 +30,9 @@ cpdef Doc annotations_to_doc(vocab, tok_annot, doc_annot): attrs, array = _annot2array(vocab, tok_annot, doc_annot) output = Doc(vocab, words=tok_annot["ORTH"], spaces=tok_annot["SPACY"]) if "entities" in doc_annot: - _add_entities_to_doc(output, doc_annot["entities"]) + _add_entities_to_doc(output, doc_annot["entities"]) if "spans" in doc_annot: - _add_spans_to_doc(output, doc_annot["spans"]) + _add_spans_to_doc(output, doc_annot["spans"]) if array.size: output = output.from_array(attrs, array) # links are currently added with ENT_KB_ID on the token level @@ -161,7 +160,6 @@ cdef class Example: self._y_sig = y_sig return self._cached_alignment - def _get_aligned_vectorized(self, align, gold_values): # Fast path for Doc attributes/fields that are predominantly a single value, # i.e., TAG, POS, MORPH. @@ -204,7 +202,6 @@ cdef class Example: return output.tolist() - def _get_aligned_non_vectorized(self, align, gold_values): # Slower path for fields that return multiple values (resulting # in ragged arrays that cannot be vectorized trivially). @@ -221,7 +218,6 @@ cdef class Example: return output - def get_aligned(self, field, as_string=False): """Return an aligned array for a token attribute.""" align = self.alignment.x2y @@ -330,7 +326,7 @@ cdef class Example: missing=None ) # Now fill the tokens we can align to O. - O = 2 # I=1, O=2, B=3 + O = 2 # I=1, O=2, B=3 # no-cython-lint: E741 for i, ent_iob in enumerate(self.get_aligned("ENT_IOB")): if x_tags[i] is None: if ent_iob == O: @@ -340,7 +336,7 @@ cdef class Example: return x_ents, x_tags def get_aligned_ner(self): - x_ents, x_tags = self.get_aligned_ents_and_ner() + _x_ents, x_tags = self.get_aligned_ents_and_ner() return x_tags def get_matching_ents(self, check_label=True): @@ -398,7 +394,6 @@ cdef class Example: return span_dict - def _links_to_dict(self): links = {} for ent in self.reference.ents: @@ -589,6 +584,7 @@ def _fix_legacy_dict_data(example_dict): "doc_annotation": doc_dict } + def _has_field(annot, field): if field not in annot: return False @@ -625,6 +621,7 @@ def _parse_ner_tags(biluo_or_offsets, vocab, words, spaces): ent_types.append("") return ent_iobs, ent_types + def _parse_links(vocab, words, spaces, links): reference = Doc(vocab, words=words, spaces=spaces) starts = {token.idx: token.i for token in reference} diff --git a/spacy/training/gold_io.pyx b/spacy/training/gold_io.pyx index 1e7b3681d..2fc36e41f 100644 --- a/spacy/training/gold_io.pyx +++ b/spacy/training/gold_io.pyx @@ -1,4 +1,3 @@ -import json import warnings import srsly @@ -6,7 +5,7 @@ import srsly from .. import util from ..errors import Warnings from ..tokens import Doc -from .iob_utils import offsets_to_biluo_tags, tags_to_entities +from .iob_utils import offsets_to_biluo_tags def docs_to_json(docs, doc_id=0, ner_missing_tag="O"): @@ -23,7 +22,13 @@ def docs_to_json(docs, doc_id=0, ner_missing_tag="O"): json_doc = {"id": doc_id, "paragraphs": []} for i, doc in enumerate(docs): raw = None if doc.has_unknown_spaces else doc.text - json_para = {'raw': raw, "sentences": [], "cats": [], "entities": [], "links": []} + json_para = { + 'raw': raw, + "sentences": [], + "cats": [], + "entities": [], + "links": [] + } for cat, val in doc.cats.items(): json_cat = {"label": cat, "value": val} json_para["cats"].append(json_cat) @@ -35,13 +40,17 @@ def docs_to_json(docs, doc_id=0, ner_missing_tag="O"): if ent.kb_id_: link_dict = {(ent.start_char, ent.end_char): {ent.kb_id_: 1.0}} json_para["links"].append(link_dict) - biluo_tags = offsets_to_biluo_tags(doc, json_para["entities"], missing=ner_missing_tag) + biluo_tags = offsets_to_biluo_tags( + doc, json_para["entities"], missing=ner_missing_tag + ) attrs = ("TAG", "POS", "MORPH", "LEMMA", "DEP", "ENT_IOB") include_annotation = {attr: doc.has_annotation(attr) for attr in attrs} for j, sent in enumerate(doc.sents): json_sent = {"tokens": [], "brackets": []} for token in sent: - json_token = {"id": token.i, "orth": token.text, "space": token.whitespace_} + json_token = { + "id": token.i, "orth": token.text, "space": token.whitespace_ + } if include_annotation["TAG"]: json_token["tag"] = token.tag_ if include_annotation["POS"]: @@ -125,9 +134,14 @@ def json_to_annotations(doc): else: sent_starts.append(-1) if "brackets" in sent: - brackets.extend((b["first"] + sent_start_i, - b["last"] + sent_start_i, b["label"]) - for b in sent["brackets"]) + brackets.extend( + ( + b["first"] + sent_start_i, + b["last"] + sent_start_i, + b["label"] + ) + for b in sent["brackets"] + ) example["token_annotation"] = dict( ids=ids, @@ -160,6 +174,7 @@ def json_to_annotations(doc): ) yield example + def json_iterate(bytes utf8_str): # We should've made these files jsonl...But since we didn't, parse out # the docs one-by-one to reduce memory usage. diff --git a/spacy/vectors.pyx b/spacy/vectors.pyx index bf79481b8..a88f380f9 100644 --- a/spacy/vectors.pyx +++ b/spacy/vectors.pyx @@ -1,10 +1,8 @@ -cimport numpy as np from cython.operator cimport dereference as deref from libc.stdint cimport uint32_t, uint64_t from libcpp.set cimport set as cppset from murmurhash.mrmr cimport hash128_x64 -import functools import warnings from enum import Enum from typing import cast @@ -119,7 +117,7 @@ cdef class Vectors: if self.mode == Mode.default: if data is None: if shape is None: - shape = (0,0) + shape = (0, 0) ops = get_current_ops() data = ops.xp.zeros(shape, dtype="f") self._unset = cppset[int]({i for i in range(data.shape[0])}) @@ -260,11 +258,10 @@ cdef class Vectors: def __eq__(self, other): # Check for equality, with faster checks first return ( - self.shape == other.shape - and self.key2row == other.key2row - and self.to_bytes(exclude=["strings"]) - == other.to_bytes(exclude=["strings"]) - ) + self.shape == other.shape + and self.key2row == other.key2row + and self.to_bytes(exclude=["strings"]) == other.to_bytes(exclude=["strings"]) + ) def resize(self, shape, inplace=False): """Resize the underlying vectors array. If inplace=True, the memory @@ -520,11 +517,12 @@ cdef class Vectors: # vectors e.g. (10000, 300) # sims e.g. (1024, 10000) sims = xp.dot(batch, vectors.T) - best_rows[i:i+batch_size] = xp.argpartition(sims, -n, axis=1)[:,-n:] - scores[i:i+batch_size] = xp.partition(sims, -n, axis=1)[:,-n:] + best_rows[i:i+batch_size] = xp.argpartition(sims, -n, axis=1)[:, -n:] + scores[i:i+batch_size] = xp.partition(sims, -n, axis=1)[:, -n:] if sort and n >= 2: - sorted_index = xp.arange(scores.shape[0])[:,None][i:i+batch_size],xp.argsort(scores[i:i+batch_size], axis=1)[:,::-1] + sorted_index = xp.arange(scores.shape[0])[:, None][i:i+batch_size], \ + xp.argsort(scores[i:i+batch_size], axis=1)[:, ::-1] scores[i:i+batch_size] = scores[sorted_index] best_rows[i:i+batch_size] = best_rows[sorted_index] @@ -538,8 +536,12 @@ cdef class Vectors: numpy_rows = get_current_ops().to_numpy(best_rows) keys = xp.asarray( - [[row2key[row] for row in numpy_rows[i] if row in row2key] - for i in range(len(queries)) ], dtype="uint64") + [ + [row2key[row] for row in numpy_rows[i] if row in row2key] + for i in range(len(queries)) + ], + dtype="uint64" + ) return (keys, best_rows, scores) def to_ops(self, ops: Ops): @@ -582,9 +584,9 @@ cdef class Vectors: """ xp = get_array_module(self.data) if xp is numpy: - save_array = lambda arr, file_: xp.save(file_, arr, allow_pickle=False) + save_array = lambda arr, file_: xp.save(file_, arr, allow_pickle=False) # no-cython-lint else: - save_array = lambda arr, file_: xp.save(file_, arr) + save_array = lambda arr, file_: xp.save(file_, arr) # no-cython-lint def save_vectors(path): # the source of numpy.save indicates that the file object is closed after use. diff --git a/spacy/vocab.pxd b/spacy/vocab.pxd index 3b0173e3e..43e47af1d 100644 --- a/spacy/vocab.pxd +++ b/spacy/vocab.pxd @@ -32,7 +32,7 @@ cdef class Vocab: cdef public object writing_system cdef public object get_noun_chunks cdef readonly int length - cdef public object _unused_object # TODO remove in v4, see #9150 + cdef public object _unused_object # TODO remove in v4, see #9150 cdef public object lex_attr_getters cdef public object cfg diff --git a/spacy/vocab.pyx b/spacy/vocab.pyx index 520228b51..d1edc8533 100644 --- a/spacy/vocab.pyx +++ b/spacy/vocab.pyx @@ -1,6 +1,4 @@ # cython: profile=True -from libc.string cimport memcpy - import functools import numpy @@ -19,7 +17,6 @@ from .errors import Errors from .lang.lex_attrs import LEX_ATTRS, get_lang, is_stop from .lang.norm_exceptions import BASE_NORMS from .lookups import Lookups -from .util import registry from .vectors import Mode as VectorsMode from .vectors import Vectors @@ -51,9 +48,17 @@ cdef class Vocab: DOCS: https://spacy.io/api/vocab """ - def __init__(self, lex_attr_getters=None, strings=tuple(), lookups=None, - oov_prob=-20., vectors_name=None, writing_system={}, - get_noun_chunks=None, **deprecated_kwargs): + def __init__( + self, + lex_attr_getters=None, + strings=tuple(), + lookups=None, + oov_prob=-20., + vectors_name=None, + writing_system={}, # no-cython-lint + get_noun_chunks=None, + **deprecated_kwargs + ): """Create the vocabulary. lex_attr_getters (dict): A dictionary mapping attribute IDs to @@ -150,7 +155,6 @@ cdef class Vocab: cdef LexemeC* lex cdef hash_t key = self.strings[string] lex = self._by_orth.get(key) - cdef size_t addr if lex != NULL: assert lex.orth in self.strings if lex.orth != key: @@ -183,7 +187,7 @@ cdef class Vocab: # of the doc ownership). # TODO: Change the C API so that the mem isn't passed in here. mem = self.mem - #if len(string) < 3 or self.length < 10000: + # if len(string) < 3 or self.length < 10000: # mem = self.mem cdef bint is_oov = mem is not self.mem lex = mem.alloc(1, sizeof(LexemeC)) @@ -463,7 +467,6 @@ cdef class Vocab: self.lookups.get_table("lexeme_norm"), ) - def to_disk(self, path, *, exclude=tuple()): """Save the current state to a directory. @@ -476,7 +479,6 @@ cdef class Vocab: path = util.ensure_path(path) if not path.exists(): path.mkdir() - setters = ["strings", "vectors"] if "strings" not in exclude: self.strings.to_disk(path / "strings.json") if "vectors" not in exclude: @@ -495,7 +497,6 @@ cdef class Vocab: DOCS: https://spacy.io/api/vocab#to_disk """ path = util.ensure_path(path) - getters = ["strings", "vectors"] if "strings" not in exclude: self.strings.from_disk(path / "strings.json") # TODO: add exclude? if "vectors" not in exclude: diff --git a/website/docs/api/large-language-models.mdx b/website/docs/api/large-language-models.mdx new file mode 100644 index 000000000..cc8328790 --- /dev/null +++ b/website/docs/api/large-language-models.mdx @@ -0,0 +1,1488 @@ +--- +title: Large Language Models +teaser: Integrating LLMs into structured NLP pipelines +menu: + - ['Config', 'config'] + - ['Tasks', 'tasks'] + - ['Models', 'models'] + - ['Cache', 'cache'] + - ['Various Functions', 'various-functions'] +--- + +[The spacy-llm package](https://github.com/explosion/spacy-llm) integrates Large +Language Models (LLMs) into spaCy, featuring a modular system for **fast +prototyping** and **prompting**, and turning unstructured responses into +**robust outputs** for various NLP tasks, **no training data** required. + +## Config {id="config"} + +`spacy-llm` exposes a `llm` factory that accepts the following configuration +options: + +| Argument | Description | +| ---------------- | ------------------------------------------------------------------------------------------------------- | +| `task` | An LLMTask can generate prompts and parse LLM responses. See [docs](#tasks). ~~Optional[LLMTask]~~ | +| `model` | Callable querying a specific LLM API. See [docs](#models). ~~Callable[[Iterable[Any]], Iterable[Any]]~~ | +| `cache` | Cache to use for caching prompts and responses per doc (batch). See [docs](#cache). ~~Cache~~ | +| `save_io` | Whether to save prompts/responses within `Doc.user_data["llm_io"]`. ~~bool~~ | +| `validate_types` | Whether to check if signatures of configured model and task are consistent. ~~bool~~ | + +An `llm` component is defined by two main settings: + +- A [**task**](#tasks), defining the prompt to send to the LLM as well as the + functionality to parse the resulting response back into structured fields on + the [Doc](/api/doc) objects. +- A [**model**](#models) defining the model and how to connect to it. Note that + `spacy-llm` supports both access to external APIs (such as OpenAI) as well as + access to self-hosted open-source LLMs (such as using Dolly through Hugging + Face). + +Moreover, `spacy-llm` exposes a customizable [**caching**](#cache) functionality +to avoid running the same document through an LLM service (be it local or +through a REST API) more than once. + +Finally, you can choose to save a stringified version of LLM prompts/responses +within the `Doc.user_data["llm_io"]` attribute by setting `save_io` to `True`. +`Doc.user_data["llm_io"]` is a dictionary containing one entry for every LLM +component within the `nlp` pipeline. Each entry is itself a dictionary, with two +keys: `prompt` and `response`. + +A note on `validate_types`: by default, `spacy-llm` checks whether the +signatures of the `model` and `task` callables are consistent with each other +and emits a warning if they don't. `validate_types` can be set to `False` if you +want to disable this behavior. + +### Tasks {id="tasks"} + +A _task_ defines an NLP problem or question, that will be sent to the LLM via a +prompt. Further, the task defines how to parse the LLM's responses back into +structured information. All tasks are registered in the `llm_tasks` registry. + +#### task.generate_prompts {id="task-generate-prompts"} + +Takes a collection of documents, and returns a collection of "prompts", which +can be of type `Any`. Often, prompts are of type `str` - but this is not +enforced to allow for maximum flexibility in the framework. + +| Argument | Description | +| ----------- | ---------------------------------------- | +| `docs` | The input documents. ~~Iterable[Doc]~~ | +| **RETURNS** | The generated prompts. ~~Iterable[Any]~~ | + +#### task.parse_responses {id="task-parse-responses"} + +Takes a collection of LLM responses and the original documents, parses the +responses into structured information, and sets the annotations on the +documents. The `parse_responses` function is free to set the annotations in any +way, including `Doc` fields like `ents`, `spans` or `cats`, or using custom +defined fields. + +The `responses` are of type `Iterable[Any]`, though they will often be `str` +objects. This depends on the return type of the [model](#models). + +| Argument | Description | +| ----------- | ------------------------------------------ | +| `docs` | The input documents. ~~Iterable[Doc]~~ | +| `responses` | The generated prompts. ~~Iterable[Any]~~ | +| **RETURNS** | The annotated documents. ~~Iterable[Doc]~~ | + +#### spacy.Summarization.v1 {id="summarization-v1"} + +The `spacy.Summarization.v1` task supports both zero-shot and few-shot +prompting. + +> #### Example config +> +> ```ini +> [components.llm.task] +> @llm_tasks = "spacy.Summarization.v1" +> examples = null +> max_n_words = null +> ``` + +| Argument | Description | +| ------------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | +| `template` | Custom prompt template to send to LLM model. Default templates for each task are located in the `spacy_llm/tasks/templates` directory. Defaults to [summarization.jinja](./spacy_llm/tasks/templates/summarization.jinja). ~~str~~ | +| `examples` | Optional function that generates examples for few-shot learning. Defaults to `None`. ~~Optional[Callable[[], Iterable[Any]]]~~ | +| `max_n_words` | Maximum number of words to be used in summary. Note that this should not expected to work exactly. Defaults to `None`. ~~Optional[int]~~ | +| `field` | Name of extension attribute to store summary in (i. e. the summary will be available in `doc._.{field}`). Defaults to `summary`. ~~str~~ | + +The summarization task prompts the model for a concise summary of the provided +text. It optionally allows to limit the response to a certain number of tokens - +note that this requirement will be included in the prompt, but the task doesn't +perform a hard cut-off. It's hence possible that your summary exceeds +`max_n_words`. + +To perform [few-shot learning](/usage/large-langauge-models#few-shot-prompts), +you can write down a few examples in a separate file, and provide these to be +injected into the prompt to the LLM. The default reader `spacy.FewShotReader.v1` +supports `.yml`, `.yaml`, `.json` and `.jsonl`. + +```yaml +- text: > + The United Nations, referred to informally as the UN, is an + intergovernmental organization whose stated purposes are to maintain + international peace and security, develop friendly relations among nations, + achieve international cooperation, and serve as a centre for harmonizing the + actions of nations. It is the world's largest international organization. + The UN is headquartered on international territory in New York City, and the + organization has other offices in Geneva, Nairobi, Vienna, and The Hague, + where the International Court of Justice is headquartered.\n\n The UN was + established after World War II with the aim of preventing future world wars, + and succeeded the League of Nations, which was characterized as + ineffective. + summary: + 'The UN is an international organization that promotes global peace, + cooperation, and harmony. Established after WWII, its purpose is to prevent + future world wars.' +``` + +```ini +[components.llm.task] +@llm_tasks = "spacy.Summarization.v1" +max_n_words = 20 +[components.llm.task.examples] +@misc = "spacy.FewShotReader.v1" +path = "summarization_examples.yml" +``` + +#### spacy.NER.v2 {id="ner-v2"} + +The built-in NER task supports both zero-shot and few-shot prompting. This +version also supports explicitly defining the provided labels with custom +descriptions. + +> #### Example config +> +> ```ini +> [components.llm.task] +> @llm_tasks = "spacy.NER.v2" +> labels = ["PERSON", "ORGANISATION", "LOCATION"] +> examples = null +> ``` + +| Argument | Description | +| ------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | +| `labels` | List of labels or str of comma-separated list of labels. ~~Union[List[str], str]~~ | +| `template` | Custom prompt template to send to LLM model. Default templates for each task are located in the `spacy_llm/tasks/templates` directory. Defaults to [ner.v2.jinja](https://github.com/explosion/spacy-llm/blob/main/spacy_llm/tasks/templates/ner.v2.jinja). ~~str~~ | +| `label_definitions` | Optional dict mapping a label to a description of that label. These descriptions are added to the prompt to help instruct the LLM on what to extract. Defaults to `None`. ~~Optional[Dict[str, str]]~~ | +| `examples` | Optional function that generates examples for few-shot learning. Defaults to `None`. ~~Optional[Callable[[], Iterable[Any]]]~~ | +| `normalizer` | Function that normalizes the labels as returned by the LLM. If `None`, defaults to `spacy.LowercaseNormalizer.v1`. Defaults to `None`. ~~Optional[Callable[[str], str]]~~ | +| `alignment_mode` | Alignment mode in case the LLM returns entities that do not align with token boundaries. Options are `"strict"`, `"contract"` or `"expand"`. Defaults to `"contract"`. ~~str~~ | +| `case_sensitive_matching` | Whether to search without case sensitivity. Defaults to `False`. ~~bool~~ | +| `single_match` | Whether to match an entity in the LLM's response only once (the first hit) or multiple times. Defaults to `False`. ~~bool~~ | + +The NER task implementation doesn't currently ask the LLM for specific offsets, +but simply expects a list of strings that represent the enties in the document. +This means that a form of string matching is required. This can be configured by +the following parameters: + +- The `single_match` parameter is typically set to `False` to allow for multiple + matches. For instance, the response from the LLM might only mention the entity + "Paris" once, but you'd still want to mark it every time it occurs in the + document. +- The case-sensitive matching is typically set to `False` to be robust against + case variances in the LLM's output. +- The `alignment_mode` argument is used to match entities as returned by the LLM + to the tokens from the original `Doc` - specifically it's used as argument in + the call to [`doc.char_span()`](/api/doc#char_span). The `"strict"` mode will + only keep spans that strictly adhere to the given token boundaries. + `"contract"` will only keep those tokens that are fully within the given + range, e.g. reducing `"New Y"` to `"New"`. Finally, `"expand"` will expand the + span to the next token boundaries, e.g. expanding `"New Y"` out to + `"New York"`. + +To perform [few-shot learning](/usage/large-langauge-models#few-shot-prompts), +you can write down a few examples in a separate file, and provide these to be +injected into the prompt to the LLM. The default reader `spacy.FewShotReader.v1` +supports `.yml`, `.yaml`, `.json` and `.jsonl`. + +```yaml +- text: Jack and Jill went up the hill. + entities: + PERSON: + - Jack + - Jill + LOCATION: + - hill +- text: Jack fell down and broke his crown. + entities: + PERSON: + - Jack +``` + +```ini +[components.llm.task] +@llm_tasks = "spacy.NER.v2" +labels = PERSON,ORGANISATION,LOCATION +[components.llm.task.examples] +@misc = "spacy.FewShotReader.v1" +path = "ner_examples.yml" +``` + +> Label descriptions can also be used with explicit examples to give as much +> info to the LLM model as possible. + +You can also write definitions for each label and provide them via the +`label_definitions` argument. This lets you tell the LLM exactly what you're +looking for rather than relying on the LLM to interpret its task given just the +label name. Label descriptions are freeform so you can write whatever you want +here, but through some experiments a brief description along with some examples +and counter examples seems to work quite well. + +```ini +[components.llm.task] +@llm_tasks = "spacy.NER.v2" +labels = PERSON,SPORTS_TEAM +[components.llm.task.label_definitions] +PERSON = "Extract any named individual in the text." +SPORTS_TEAM = "Extract the names of any professional sports team. e.g. Golden State Warriors, LA Lakers, Man City, Real Madrid" +``` + +#### spacy.NER.v1 {id="ner-v1"} + +The original version of the built-in NER task supports both zero-shot and +few-shot prompting. + +> #### Example config +> +> ```ini +> [components.llm.task] +> @llm_tasks = "spacy.NER.v1" +> labels = PERSON,ORGANISATION,LOCATION +> examples = null +> ``` + +| Argument | Description | +| ------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ | +| `labels` | Comma-separated list of labels. ~~str~~ | +| `examples` | Optional function that generates examples for few-shot learning. Defaults to `None`. ~~Optional[Callable[[], Iterable[Any]]]~~ | +| `normalizer` | Function that normalizes the labels as returned by the LLM. If `None`, defaults to `spacy.LowercaseNormalizer.v1`. ~~Optional[Callable[[str], str]]~~ | +| `alignment_mode` | Alignment mode in case the LLM returns entities that do not align with token boundaries. Options are `"strict"`, `"contract"` or `"expand"`. Defaults to `"contract"`. ~~str~~ | +| `case_sensitive_matching` | Whether to search without case sensitivity. Defaults to `False`. ~~bool~~ | +| `single_match` | Whether to match an entity in the LLM's response only once (the first hit) or multiple times. Defaults to `False`. ~~bool~~ | + +The NER task implementation doesn't currently ask the LLM for specific offsets, +but simply expects a list of strings that represent the enties in the document. +This means that a form of string matching is required. This can be configured by +the following parameters: + +- The `single_match` parameter is typically set to `False` to allow for multiple + matches. For instance, the response from the LLM might only mention the entity + "Paris" once, but you'd still want to mark it every time it occurs in the + document. +- The case-sensitive matching is typically set to `False` to be robust against + case variances in the LLM's output. +- The `alignment_mode` argument is used to match entities as returned by the LLM + to the tokens from the original `Doc` - specifically it's used as argument in + the call to [`doc.char_span()`](/api/doc#char_span). The `"strict"` mode will + only keep spans that strictly adhere to the given token boundaries. + `"contract"` will only keep those tokens that are fully within the given + range, e.g. reducing `"New Y"` to `"New"`. Finally, `"expand"` will expand the + span to the next token boundaries, e.g. expanding `"New Y"` out to + `"New York"`. + +To perform [few-shot learning](/usage/large-langauge-models#few-shot-prompts), +you can write down a few examples in a separate file, and provide these to be +injected into the prompt to the LLM. The default reader `spacy.FewShotReader.v1` +supports `.yml`, `.yaml`, `.json` and `.jsonl`. + +```yaml +- text: Jack and Jill went up the hill. + entities: + PERSON: + - Jack + - Jill + LOCATION: + - hill +- text: Jack fell down and broke his crown. + entities: + PERSON: + - Jack +``` + +```ini +[components.llm.task] +@llm_tasks = "spacy.NER.v1" +labels = PERSON,ORGANISATION,LOCATION +[components.llm.task.examples] +@misc = "spacy.FewShotReader.v1" +path = "ner_examples.yml" +``` + +#### spacy.SpanCat.v2 {id="spancat-v2"} + +The built-in SpanCat task is a simple adaptation of the NER task to support +overlapping entities and store its annotations in `doc.spans`. + +> #### Example config +> +> ```ini +> [components.llm.task] +> @llm_tasks = "spacy.SpanCat.v2" +> labels = ["PERSON", "ORGANISATION", "LOCATION"] +> examples = null +> ``` + +| Argument | Description | +| ------------------------- | ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | +| `labels` | List of labels or str of comma-separated list of labels. ~~Union[List[str], str]~~ | +| `template` | Custom prompt template to send to LLM model. Default templates for each task are located in the `spacy_llm/tasks/templates` directory. Defaults to [`spancat.v2.jinja`](https://github.com/explosion/spacy-llm/blob/main/spacy_llm/tasks/templates/spancat.v2.jinja). ~~str~~ | +| `label_definitions` | Optional dict mapping a label to a description of that label. These descriptions are added to the prompt to help instruct the LLM on what to extract. Defaults to `None`. ~~Optional[Dict[str, str]]~~ | +| `spans_key` | Key of the `Doc.spans` dict to save the spans under. Defaults to `"sc"`. ~~str~~ | +| `examples` | Optional function that generates examples for few-shot learning. Defaults to `None`. ~~Optional[Callable[[], Iterable[Any]]]~~ | +| `normalizer` | Function that normalizes the labels as returned by the LLM. If `None`, defaults to `spacy.LowercaseNormalizer.v1`. ~~Optional[Callable[[str], str]]~~ | +| `alignment_mode` | Alignment mode in case the LLM returns entities that do not align with token boundaries. Options are `"strict"`, `"contract"` or `"expand"`. Defaults to `"contract"`. ~~str~~ | +| `case_sensitive_matching` | Whether to search without case sensitivity. Defaults to `False`. ~~bool~~ | +| `single_match` | Whether to match an entity in the LLM's response only once (the first hit) or multiple times. Defaults to `False`. ~~bool~~ | + +Except for the `spans_key` parameter, the SpanCat task reuses the configuration +from the NER task. Refer to [its documentation](#ner-v2) for more insight. + +#### spacy.SpanCat.v1 {id="spancat-v1"} + +The original version of the built-in SpanCat task is a simple adaptation of the +v1 NER task to support overlapping entities and store its annotations in +`doc.spans`. + +> #### Example config +> +> ```ini +> [components.llm.task] +> @llm_tasks = "spacy.SpanCat.v1" +> labels = PERSON,ORGANISATION,LOCATION +> examples = null +> ``` + +| Argument | Description | +| ------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ | +| `labels` | Comma-separated list of labels. ~~str~~ | +| `spans_key` | Key of the `Doc.spans` dict to save the spans under. Defaults to `"sc"`. ~~str~~ | +| `examples` | Optional function that generates examples for few-shot learning. Defaults to `None`. ~~Optional[Callable[[], Iterable[Any]]]~~ | +| `normalizer` | Function that normalizes the labels as returned by the LLM. If `None`, defaults to `spacy.LowercaseNormalizer.v1`. ~~Optional[Callable[[str], str]]~~ | +| `alignment_mode` | Alignment mode in case the LLM returns entities that do not align with token boundaries. Options are `"strict"`, `"contract"` or `"expand"`. Defaults to `"contract"`. ~~str~~ | +| `case_sensitive_matching` | Whether to search without case sensitivity. Defaults to `False`. ~~bool~~ | +| `single_match` | Whether to match an entity in the LLM's response only once (the first hit) or multiple times. Defaults to `False`. ~~bool~~ | + +Except for the `spans_key` parameter, the SpanCat task reuses the configuration +from the NER task. Refer to [its documentation](#ner-v1) for more insight. + +#### spacy.TextCat.v3 {id="textcat-v3"} + +Version 3 (the most recent) of the built-in TextCat task supports both zero-shot +and few-shot prompting. It allows setting definitions of labels. Those +definitions are included in the prompt. + +> #### Example config +> +> ```ini +> [components.llm.task] +> @llm_tasks = "spacy.TextCat.v3" +> labels = ["COMPLIMENT", "INSULT"] +> label_definitions = { +> "COMPLIMENT": "a polite expression of praise or admiration.", +> "INSULT": "a disrespectful or scornfully abusive remark or act." +> } +> examples = null +> ``` + +| Argument | Description | +| ------------------- | ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | +| `labels` | List of labels or str of comma-separated list of labels. ~~Union[List[str], str]~~ | +| `label_definitions` | Dictionary of label definitions. Included in the prompt, if set. Defaults to `None`. ~~Optional[Dict[str, str]]~~ | +| `template` | Custom prompt template to send to LLM model. Default templates for each task are located in the `spacy_llm/tasks/templates` directory. Defaults to [`textcat.jinja`](https://github.com/explosion/spacy-llm/blob/main/spacy_llm/tasks/templates/textcat.jinja). ~~str~~ | +| `examples` | Optional function that generates examples for few-shot learning. Defaults to `None`. ~~Optional[Callable[[], Iterable[Any]]]~~ | +| `normalizer` | Function that normalizes the labels as returned by the LLM. If `None`, falls back to `spacy.LowercaseNormalizer.v1`. Defaults to `None`. ~~Optional[Callable[[str], str]]~~ | +| `exclusive_classes` | If set to `True`, only one label per document should be valid. If set to `False`, one document can have multiple labels. Defaults to `False`. ~~bool~~ | +| `allow_none` | When set to `True`, allows the LLM to not return any of the given label. The resulting dict in `doc.cats` will have `0.0` scores for all labels. Defaults to `True`. ~~bool~~ | +| `verbose` | If set to `True`, warnings will be generated when the LLM returns invalid responses. Defaults to `False`. ~~bool~~ | + +To perform [few-shot learning](/usage/large-langauge-models#few-shot-prompts), +you can write down a few examples in a separate file, and provide these to be +injected into the prompt to the LLM. The default reader `spacy.FewShotReader.v1` +supports `.yml`, `.yaml`, `.json` and `.jsonl`. + +```json +[ + { + "text": "You look great!", + "answer": "Compliment" + }, + { + "text": "You are not very clever at all.", + "answer": "Insult" + } +] +``` + +```ini +[components.llm.task] +@llm_tasks = "spacy.TextCat.v3" +labels = ["COMPLIMENT", "INSULT"] +label_definitions = { + "COMPLIMENT": "a polite expression of praise or admiration.", + "INSULT": "a disrespectful or scornfully abusive remark or act." +} +[components.llm.task.examples] +@misc = "spacy.FewShotReader.v1" +path = "textcat_examples.json" +``` + +#### spacy.TextCat.v2 {id="textcat-v2"} + +Version 2 of the built-in TextCat task supports both zero-shot and few-shot +prompting and includes an improved prompt template. + +> #### Example config +> +> ```ini +> [components.llm.task] +> @llm_tasks = "spacy.TextCat.v2" +> labels = ["COMPLIMENT", "INSULT"] +> examples = null +> ``` + +| Argument | Description | +| ------------------- | ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | +| `labels` | List of labels or str of comma-separated list of labels. ~~Union[List[str], str]~~ | +| `template` | Custom prompt template to send to LLM model. Default templates for each task are located in the `spacy_llm/tasks/templates` directory. Defaults to [`textcat.jinja`](https://github.com/explosion/spacy-llm/blob/main/spacy_llm/tasks/templates/textcat.jinja). ~~str~~ | +| `examples` | Optional function that generates examples for few-shot learning. Defaults to `None`. ~~Optional[Callable[[], Iterable[Any]]]~~ | +| `normalizer` | Function that normalizes the labels as returned by the LLM. If `None`, falls back to `spacy.LowercaseNormalizer.v1`. ~~Optional[Callable[[str], str]]~~ | +| `exclusive_classes` | If set to `True`, only one label per document should be valid. If set to `False`, one document can have multiple labels. Defaults to `False`. ~~bool~~ | +| `allow_none` | When set to `True`, allows the LLM to not return any of the given label. The resulting dict in `doc.cats` will have `0.0` scores for all labels. Defaults to `True`. ~~bool~~ | +| `verbose` | If set to `True`, warnings will be generated when the LLM returns invalid responses. Defaults to `False`. ~~bool~~ | + +To perform [few-shot learning](/usage/large-langauge-models#few-shot-prompts), +you can write down a few examples in a separate file, and provide these to be +injected into the prompt to the LLM. The default reader `spacy.FewShotReader.v1` +supports `.yml`, `.yaml`, `.json` and `.jsonl`. + +```json +[ + { + "text": "You look great!", + "answer": "Compliment" + }, + { + "text": "You are not very clever at all.", + "answer": "Insult" + } +] +``` + +```ini +[components.llm.task] +@llm_tasks = "spacy.TextCat.v2" +labels = ["COMPLIMENT", "INSULT"] +[components.llm.task.examples] +@misc = "spacy.FewShotReader.v1" +path = "textcat_examples.json" +``` + +#### spacy.TextCat.v1 {id="textcat-v1"} + +Version 1 of the built-in TextCat task supports both zero-shot and few-shot +prompting. + +> #### Example config +> +> ```ini +> [components.llm.task] +> @llm_tasks = "spacy.TextCat.v1" +> labels = COMPLIMENT,INSULT +> examples = null +> ``` + +| Argument | Description | +| ------------------- | ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | +| `labels` | Comma-separated list of labels. ~~str~~ | +| `examples` | Optional function that generates examples for few-shot learning. Deafults to `None`. ~~Optional[Callable[[], Iterable[Any]]]~~ | +| `normalizer` | Function that normalizes the labels as returned by the LLM. If `None`, falls back to `spacy.LowercaseNormalizer.v1`. ~~Optional[Callable[[str], str]]~~ | +| `exclusive_classes` | If set to `True`, only one label per document should be valid. If set to `False`, one document can have multiple labels. Deafults to `False`. ~~bool~~ | +| `allow_none` | When set to `True`, allows the LLM to not return any of the given label. The resulting dict in `doc.cats` will have `0.0` scores for all labels. Deafults to `True`. ~~bool~~ | +| `verbose` | If set to `True`, warnings will be generated when the LLM returns invalid responses. Deafults to `False`. ~~bool~~ | + +To perform [few-shot learning](/usage/large-langauge-models#few-shot-prompts), +you can write down a few examples in a separate file, and provide these to be +injected into the prompt to the LLM. The default reader `spacy.FewShotReader.v1` +supports `.yml`, `.yaml`, `.json` and `.jsonl`. + +```json +[ + { + "text": "You look great!", + "answer": "Compliment" + }, + { + "text": "You are not very clever at all.", + "answer": "Insult" + } +] +``` + +```ini +[components.llm.task] +@llm_tasks = "spacy.TextCat.v2" +labels = COMPLIMENT,INSULT +[components.llm.task.examples] +@misc = "spacy.FewShotReader.v1" +path = "textcat_examples.json" +``` + +#### spacy.REL.v1 {id="rel-v1"} + +The built-in REL task supports both zero-shot and few-shot prompting. It relies +on an upstream NER component for entities extraction. + +> #### Example config +> +> ```ini +> [components.llm.task] +> @llm_tasks = "spacy.REL.v1" +> labels = ["LivesIn", "Visits"] +> ``` + +| Argument | Description | +| ------------------- | --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | +| `labels` | List of labels or str of comma-separated list of labels. ~~Union[List[str], str]~~ | +| `template` | Custom prompt template to send to LLM model. Default templates for each task are located in the `spacy_llm/tasks/templates` directory. Defaults to [`rel.jinja`](https://github.com/explosion/spacy-llm/blob/main/spacy_llm/tasks/templates/rel.jinja). ~~str~~ | +| `label_description` | Dictionary providing a description for each relation label. Defaults to `None`. ~~Optional[Dict[str, str]]~~ | +| `examples` | Optional function that generates examples for few-shot learning. Defaults to `None`. ~~Optional[Callable[[], Iterable[Any]]]~~ | +| `normalizer` | Function that normalizes the labels as returned by the LLM. If `None`, falls back to `spacy.LowercaseNormalizer.v1`. Defaults to `None`. ~~Optional[Callable[[str], str]]~~ | +| `verbose` | If set to `True`, warnings will be generated when the LLM returns invalid responses. Defaults to `False`. ~~bool~~ | + +To perform [few-shot learning](/usage/large-langauge-models#few-shot-prompts), +you can write down a few examples in a separate file, and provide these to be +injected into the prompt to the LLM. The default reader `spacy.FewShotReader.v1` +supports `.yml`, `.yaml`, `.json` and `.jsonl`. + +```json +{"text": "Laura bought a house in Boston with her husband Mark.", "ents": [{"start_char": 0, "end_char": 5, "label": "PERSON"}, {"start_char": 24, "end_char": 30, "label": "GPE"}, {"start_char": 48, "end_char": 52, "label": "PERSON"}], "relations": [{"dep": 0, "dest": 1, "relation": "LivesIn"}, {"dep": 2, "dest": 1, "relation": "LivesIn"}]} +{"text": "Michael travelled through South America by bike.", "ents": [{"start_char": 0, "end_char": 7, "label": "PERSON"}, {"start_char": 26, "end_char": 39, "label": "LOC"}], "relations": [{"dep": 0, "dest": 1, "relation": "Visits"}]} +``` + +```ini +[components.llm.task] +@llm_tasks = "spacy.REL.v1" +labels = ["LivesIn", "Visits"] +[components.llm.task.examples] +@misc = "spacy.FewShotReader.v1" +path = "rel_examples.jsonl" +``` + +Note: the REL task relies on pre-extracted entities to make its prediction. +Hence, you'll need to add a component that populates `doc.ents` with recognized +spans to your spaCy pipeline and put it _before_ the REL component. + +#### spacy.Lemma.v1 {id="lemma-v1"} + +The `Lemma.v1` task lemmatizes the provided text and updates the `lemma_` +attribute in the doc's tokens accordingly. + +> #### Example config +> +> ```ini +> [components.llm.task] +> @llm_tasks = "spacy.Lemma.v1" +> examples = null +> ``` + +| Argument | Description | +| ---------- | ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | +| `template` | Custom prompt template to send to LLM model. Default templates for each task are located in the `spacy_llm/tasks/templates` directory. Defaults to [lemma.jinja](https://github.com/explosion/spacy-llm/blob/main/spacy_llm/tasks/templates/lemma.jinja). ~~str~~ | +| `examples` | Optional function that generates examples for few-shot learning. Defaults to `None`. ~~Optional[Callable[[], Iterable[Any]]]~~ | + +`Lemma.v1` prompts the LLM to lemmatize the passed text and return the +lemmatized version as a list of tokens and their corresponding lemma. E. g. the +text `I'm buying ice cream for my friends` should invoke the response + +``` +I: I +'m: be +buying: buy +ice: ice +cream: cream +for: for +my: my +friends: friend +.: . +``` + +If for any given text/doc instance the number of lemmas returned by the LLM +doesn't match the number of tokens from the pipeline's tokenizer, no lemmas are +stored in the corresponding doc's tokens. Otherwise the tokens `.lemma_` +property is updated with the lemma suggested by the LLM. + +To perform [few-shot learning](/usage/large-langauge-models#few-shot-prompts), +you can write down a few examples in a separate file, and provide these to be +injected into the prompt to the LLM. The default reader `spacy.FewShotReader.v1` +supports `.yml`, `.yaml`, `.json` and `.jsonl`. + +```yaml +- text: I'm buying ice cream. + lemmas: + - 'I': 'I' + - "'m": 'be' + - 'buying': 'buy' + - 'ice': 'ice' + - 'cream': 'cream' + - '.': '.' + +- text: I've watered the plants. + lemmas: + - 'I': 'I' + - "'ve": 'have' + - 'watered': 'water' + - 'the': 'the' + - 'plants': 'plant' + - '.': '.' +``` + +```ini +[components.llm.task] +@llm_tasks = "spacy.Lemma.v1" +[components.llm.task.examples] +@misc = "spacy.FewShotReader.v1" +path = "lemma_examples.yml" +``` + +#### spacy.Sentiment.v1 {id="sentiment-v1"} + +Performs sentiment analysis on provided texts. Scores between 0 and 1 are stored +in `Doc._.sentiment` - the higher, the more positive. Note in cases of parsing +issues (e. g. in case of unexpected LLM responses) the value might be `None`. + +> #### Example config +> +> ```ini +> [components.llm.task] +> @llm_tasks = "spacy.Sentiment.v1" +> examples = null +> ``` + +| Argument | Description | +| ---------- | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | +| `template` | Custom prompt template to send to LLM model. Default templates for each task are located in the `spacy_llm/tasks/templates` directory. Defaults to [sentiment.jinja](./spacy_llm/tasks/templates/sentiment.jinja). ~~str~~ | +| `examples` | Optional function that generates examples for few-shot learning. Defaults to `None`. ~~Optional[Callable[[], Iterable[Any]]]~~ | +| `field` | Name of extension attribute to store summary in (i. e. the summary will be available in `doc._.{field}`). Defaults to `sentiment`. ~~str~~ | + +To perform [few-shot learning](/usage/large-langauge-models#few-shot-prompts), +you can write down a few examples in a separate file, and provide these to be +injected into the prompt to the LLM. The default reader `spacy.FewShotReader.v1` +supports `.yml`, `.yaml`, `.json` and `.jsonl`. + +```yaml +- text: 'This is horrifying.' + score: 0 +- text: 'This is underwhelming.' + score: 0.25 +- text: 'This is ok.' + score: 0.5 +- text: "I'm looking forward to this!" + score: 1.0 +``` + +```ini +[components.llm.task] +@llm_tasks = "spacy.Sentiment.v1" +[components.llm.task.examples] +@misc = "spacy.FewShotReader.v1" +path = "sentiment_examples.yml" +``` + +#### spacy.NoOp.v1 {id="noop-v1"} + +> #### Example config +> +> ```ini +> [components.llm.task] +> @llm_tasks = "spacy.NoOp.v1" +> ``` + +This task is only useful for testing - it tells the LLM to do nothing, and does +not set any fields on the `docs`. + +### Models {id="models"} + +A _model_ defines which LLM model to query, and how to query it. It can be a +simple function taking a collection of prompts (consistent with the output type +of `task.generate_prompts()`) and returning a collection of responses +(consistent with the expected input of `parse_responses`). Generally speaking, +it's a function of type `Callable[[Iterable[Any]], Iterable[Any]]`, but specific +implementations can have other signatures, like +`Callable[[Iterable[str]], Iterable[str]]`. + +#### API Keys {id="api-keys"} + +Note that when using hosted services, you have to ensure that the proper API +keys are set as environment variables as described by the corresponding +provider's documentation. + +E. g. when using OpenAI, you have to get an API key from openai.com, and ensure +that the keys are set as environmental variables: + +```shell +export OPENAI_API_KEY="sk-..." +export OPENAI_API_ORG="org-..." +``` + +For Cohere it's + +```shell +export CO_API_KEY="..." +``` + +and for Anthropic + +```shell +export ANTHROPIC_API_KEY="..." +``` + +#### spacy.GPT-4.v1 {id="gpt-4"} + +OpenAI's `gpt-4` model family. + +> #### Example config: +> +> ```ini +> [components.llm.model] +> @llm_models = "spacy.GPT-4.v1" +> name = "gpt-4" +> config = {"temperature": 0.3} +> ``` + +| Argument | Description | +| ----------- | ----------------------------------------------------------------------------------------------------------------------------------------------------------- | +| `name` | Model name, i. e. any supported variant for this particular model. Defaults to `"gpt-4"`. ~~Literal["gpt-4", "gpt-4-0314", "gpt-4-32k", "gpt-4-32k-0314"]~~ | +| `config` | Further configuration passed on to the model. Defaults to `{}`. ~~Dict[Any, Any]~~ | +| `strict` | If `True`, raises an error if the LLM API returns a malformed response. Otherwise, return the error responses as is. Defaults to `True`. ~~bool~~ | +| `max_tries` | Max. number of tries for API request. Defaults to `3`. ~~int~~ | +| `timeout` | Timeout for API request in seconds. Defaults to `30`. ~~int~~ | + +#### spacy.GPT-3-5.v1 {id="gpt-3-5"} + +OpenAI's `gpt-3-5` model family. + +> #### Example config +> +> ```ini +> [components.llm.model] +> @llm_models = "spacy.GPT-3-5.v1" +> name = "gpt-3.5-turbo" +> config = {"temperature": 0.3} +> ``` + +| Argument | Description | +| ----------- | --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | +| `name` | Model name, i. e. any supported variant for this particular model. Defaults to `"gpt-3.5-turbo"`. ~~Literal["gpt-3.5-turbo", "gpt-3.5-turbo-16k", "gpt-3.5-turbo-0613", "gpt-3.5-turbo-0613-16k"]~~ | +| `config` | Further configuration passed on to the model. Defaults to `{}`. ~~Dict[Any, Any]~~ | +| `strict` | If `True`, raises an error if the LLM API returns a malformed response. Otherwise, return the error responses as is. Defaults to `True`. ~~bool~~ | +| `max_tries` | Max. number of tries for API request. Defaults to `3`. ~~int~~ | +| `timeout` | Timeout for API request in seconds. Defaults to `30`. ~~int~~ | + +#### spacy.Text-Davinci.v1 {id="text-davinci"} + +OpenAI's `text-davinci` model family. + +> #### Example config +> +> ```ini +> [components.llm.model] +> @llm_models = "spacy.Text-Davinci.v1" +> name = "text-davinci-003" +> config = {"temperature": 0.3} +> ``` + +| Argument | Description | +| ----------- | -------------------------------------------------------------------------------------------------------------------------------------------------------- | +| `name` | Model name, i. e. any supported variant for this particular model. Defaults to `"text-davinci-003"`. ~~Literal["text-davinci-002", "text-davinci-003"]~~ | +| `config` | Further configuration passed on to the model. Defaults to `{}`. ~~Dict[Any, Any]~~ | +| `strict` | If `True`, raises an error if the LLM API returns a malformed response. Otherwise, return the error responses as is. Defaults to `True`. ~~bool~~ | +| `max_tries` | Max. number of tries for API request. Defaults to `3`. ~~int~~ | +| `timeout` | Timeout for API request in seconds. Defaults to `30`. ~~int~~ | + +#### spacy.Code-Davinci.v1 {id="code-davinci"} + +OpenAI's `code-davinci` model family. + +> #### Example config +> +> ```ini +> [components.llm.model] +> @llm_models = "spacy.Code-Davinci.v1" +> name = "code-davinci-002" +> config = {"temperature": 0.3} +> ``` + +| Argument | Description | +| ----------- | ------------------------------------------------------------------------------------------------------------------------------------------------- | +| `name` | Model name, i. e. any supported variant for this particular model. Defaults to `"code-davinci-002"`. ~~Literal["code-davinci-002"]~~ | +| `config` | Further configuration passed on to the model. Defaults to `{}`. ~~Dict[Any, Any]~~ | +| `strict` | If `True`, raises an error if the LLM API returns a malformed response. Otherwise, return the error responses as is. Defaults to `True`. ~~bool~~ | +| `max_tries` | Max. number of tries for API request. Defaults to `3`. ~~int~~ | +| `timeout` | Timeout for API request in seconds. Defaults to `30`. ~~int~~ | + +#### spacy.Text-Curie.v1 {id="text-curie"} + +OpenAI's `text-curie` model family. + +> #### Example config +> +> ```ini +> [components.llm.model] +> @llm_models = "spacy.Text-Curie.v1" +> name = "text-curie-001" +> config = {"temperature": 0.3} +> ``` + +| Argument | Description | +| ----------- | ------------------------------------------------------------------------------------------------------------------------------------------------- | +| `name` | Model name, i. e. any supported variant for this particular model. Defaults to `"text-curie-001"`. ~~Literal["text-curie-001"]~~ | +| `config` | Further configuration passed on to the model. Defaults to `{}`. ~~Dict[Any, Any]~~ | +| `strict` | If `True`, raises an error if the LLM API returns a malformed response. Otherwise, return the error responses as is. Defaults to `True`. ~~bool~~ | +| `max_tries` | Max. number of tries for API request. Defaults to `3`. ~~int~~ | +| `timeout` | Timeout for API request in seconds. Defaults to `30`. ~~int~~ | + +#### spacy.Text-Babbage.v1 {id="text-babbage"} + +OpenAI's `text-babbage` model family. + +> #### Example config +> +> ```ini +> [components.llm.model] +> @llm_models = "spacy.Text-Babbage.v1" +> name = "text-babbage-001" +> config = {"temperature": 0.3} +> ``` + +| Argument | Description | +| ----------- | ------------------------------------------------------------------------------------------------------------------------------------------------- | +| `name` | Model name, i. e. any supported variant for this particular model. Defaults to `"text-babbage-001"`. ~~Literal["text-babbage-001"]~~ | +| `config` | Further configuration passed on to the model. Defaults to `{}`. ~~Dict[Any, Any]~~ | +| `strict` | If `True`, raises an error if the LLM API returns a malformed response. Otherwise, return the error responses as is. Defaults to `True`. ~~bool~~ | +| `max_tries` | Max. number of tries for API request. Defaults to `3`. ~~int~~ | +| `timeout` | Timeout for API request in seconds. Defaults to `30`. ~~int~~ | + +#### spacy.Text-Ada.v1 {id="text-ada"} + +OpenAI's `text-ada` model family. + +> #### Example config +> +> ```ini +> [components.llm.model] +> @llm_models = "spacy.Text-Ada.v1" +> name = "text-ada-001" +> config = {"temperature": 0.3} +> ``` + +| Argument | Description | +| ----------- | ------------------------------------------------------------------------------------------------------------------------------------------------- | +| `name` | Model name, i. e. any supported variant for this particular model. Defaults to `"text-ada-001"`. ~~Literal["text-ada-001"]~~ | +| `config` | Further configuration passed on to the model. Defaults to `{}`. ~~Dict[Any, Any]~~ | +| `strict` | If `True`, raises an error if the LLM API returns a malformed response. Otherwise, return the error responses as is. Defaults to `True`. ~~bool~~ | +| `max_tries` | Max. number of tries for API request. Defaults to `3`. ~~int~~ | +| `timeout` | Timeout for API request in seconds. Defaults to `30`. ~~int~~ | + +#### spacy.Davinci.v1 {id="davinci"} + +OpenAI's `davinci` model family. + +> #### Example config +> +> ```ini +> [components.llm.model] +> @llm_models = "spacy.Davinci.v1 " +> name = "davinci" +> config = {"temperature": 0.3} +> ``` + +| Argument | Description | +| ----------- | ------------------------------------------------------------------------------------------------------------------------------------------------- | +| `name` | Model name, i. e. any supported variant for this particular model. Defaults to `"davinci"`. ~~Literal["davinci"]~~ | +| `config` | Further configuration passed on to the model. Defaults to `{}`. ~~Dict[Any, Any]~~ | +| `strict` | If `True`, raises an error if the LLM API returns a malformed response. Otherwise, return the error responses as is. Defaults to `True`. ~~bool~~ | +| `max_tries` | Max. number of tries for API request. Defaults to `3`. ~~int~~ | +| `timeout` | Timeout for API request in seconds. Defaults to `30`. ~~int~~ | + +#### spacy.Curie.v1 {id="curie"} + +OpenAI's `curie` model family. + +> #### Example config +> +> ```ini +> [components.llm.model] +> @llm_models = "spacy.Curie.v1 " +> name = "curie" +> config = {"temperature": 0.3} +> ``` + +| Argument | Description | +| ----------- | ------------------------------------------------------------------------------------------------------------------------------------------------- | +| `name` | Model name, i. e. any supported variant for this particular model. Defaults to `"curie"`. ~~Literal["curie"]~~ | +| `config` | Further configuration passed on to the model. Defaults to `{}`. ~~Dict[Any, Any]~~ | +| `strict` | If `True`, raises an error if the LLM API returns a malformed response. Otherwise, return the error responses as is. Defaults to `True`. ~~bool~~ | +| `max_tries` | Max. number of tries for API request. Defaults to `3`. ~~int~~ | +| `timeout` | Timeout for API request in seconds. Defaults to `30`. ~~int~~ | + +#### spacy.Babbage.v1 {id="babbage"} + +OpenAI's `babbage` model family. + +> #### Example config +> +> ```ini +> [components.llm.model] +> @llm_models = "spacy.Babbage.v1 " +> name = "babbage" +> config = {"temperature": 0.3} +> ``` + +| Argument | Description | +| ----------- | ------------------------------------------------------------------------------------------------------------------------------------------------- | +| `name` | Model name, i. e. any supported variant for this particular model. Defaults to `"babbage"`. ~~Literal["babbage"]~~ | +| `config` | Further configuration passed on to the model. Defaults to `{}`. ~~Dict[Any, Any]~~ | +| `strict` | If `True`, raises an error if the LLM API returns a malformed response. Otherwise, return the error responses as is. Defaults to `True`. ~~bool~~ | +| `max_tries` | Max. number of tries for API request. Defaults to `3`. ~~int~~ | +| `timeout` | Timeout for API request in seconds. Defaults to `30`. ~~int~~ | + +#### spacy.Ada.v1 {id="ada"} + +OpenAI's `ada` model family. + +> #### Example config +> +> ```ini +> [components.llm.model] +> @llm_models = "spacy.Ada.v1 " +> name = "ada" +> config = {"temperature": 0.3} +> ``` + +| Argument | Description | +| ----------- | ------------------------------------------------------------------------------------------------------------------------------------------------- | +| `name` | Model name, i. e. any supported variant for this particular model. Defaults to `"ada"`. ~~Literal["ada"]~~ | +| `config` | Further configuration passed on to the model. Defaults to `{}`. ~~Dict[Any, Any]~~ | +| `strict` | If `True`, raises an error if the LLM API returns a malformed response. Otherwise, return the error responses as is. Defaults to `True`. ~~bool~~ | +| `max_tries` | Max. number of tries for API request. Defaults to `3`. ~~int~~ | +| `timeout` | Timeout for API request in seconds. Defaults to `30`. ~~int~~ | + +#### spacy.Command.v1 {id="command"} + +Cohere's `command` model family. + +> #### Example config +> +> ```ini +> [components.llm.model] +> @llm_models = "spacy.Command.v1 " +> name = "command" +> config = {"temperature": 0.3} +> ``` + +| Argument | Description | +| ----------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | +| `name` | Model name, i. e. any supported variant for this particular model. Defaults to `"command"`. ~~Literal["command", "command-light", "command-light-nightly", "command-nightly"]~~ | +| `config` | Further configuration passed on to the model. Defaults to `{}`. ~~Dict[Any, Any]~~ | +| `strict` | If `True`, raises an error if the LLM API returns a malformed response. Otherwise, return the error responses as is. Defaults to `True`. ~~bool~~ | +| `max_tries` | Max. number of tries for API request. Defaults to `3`. ~~int~~ | +| `timeout` | Timeout for API request in seconds. Defaults to `30`. ~~int~~ | + +#### spacy.Claude-2.v1 {id="claude-2"} + +Anthropic's `claude-2` model family. + +> #### Example config +> +> ```ini +> [components.llm.model] +> @llm_models = "spacy.Claude-2.v1 " +> name = "claude-2" +> config = {"temperature": 0.3} +> ``` + +| Argument | Description | +| ----------- | ------------------------------------------------------------------------------------------------------------------------------------------------- | +| `name` | Model name, i. e. any supported variant for this particular model. Defaults to `"claude-2"`. ~~Literal["claude-2", "claude-2-100k"]~~ | +| `config` | Further configuration passed on to the model. Defaults to `{}`. ~~Dict[Any, Any]~~ | +| `strict` | If `True`, raises an error if the LLM API returns a malformed response. Otherwise, return the error responses as is. Defaults to `True`. ~~bool~~ | +| `max_tries` | Max. number of tries for API request. Defaults to `3`. ~~int~~ | +| `timeout` | Timeout for API request in seconds. Defaults to `30`. ~~int~~ | + +#### spacy.Claude-1.v1 {id="claude-1"} + +Anthropic's `claude-1` model family. + +> #### Example config +> +> ```ini +> [components.llm.model] +> @llm_models = "spacy.Claude-1.v1 " +> name = "claude-1" +> config = {"temperature": 0.3} +> ``` + +| Argument | Description | +| ----------- | ------------------------------------------------------------------------------------------------------------------------------------------------- | +| `name` | Model name, i. e. any supported variant for this particular model. Defaults to `"claude-1"`. ~~Literal["claude-1", "claude-1-100k"]~~ | +| `config` | Further configuration passed on to the model. Defaults to `{}`. ~~Dict[Any, Any]~~ | +| `strict` | If `True`, raises an error if the LLM API returns a malformed response. Otherwise, return the error responses as is. Defaults to `True`. ~~bool~~ | +| `max_tries` | Max. number of tries for API request. Defaults to `3`. ~~int~~ | +| `timeout` | Timeout for API request in seconds. Defaults to `30`. ~~int~~ | + +#### spacy.Claude-instant-1.v1 {id="claude-instant-1"} + +Anthropic's `claude-instant-1` model family. + +> #### Example config +> +> ```ini +> [components.llm.model] +> @llm_models = "spacy.Claude-instant-1.v1 " +> name = "claude-instant-1" +> config = {"temperature": 0.3} +> ``` + +| Argument | Description | +| ----------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------- | +| `name` | Model name, i. e. any supported variant for this particular model. Defaults to `"claude-instant-1"`. ~~Literal["claude-instant-1", "claude-instant-1-100k"]~~ | +| `config` | Further configuration passed on to the model. Defaults to `{}`. ~~Dict[Any, Any]~~ | +| `strict` | If `True`, raises an error if the LLM API returns a malformed response. Otherwise, return the error responses as is. Defaults to `True`. ~~bool~~ | +| `max_tries` | Max. number of tries for API request. Defaults to `3`. ~~int~~ | +| `timeout` | Timeout for API request in seconds. Defaults to `30`. ~~int~~ | + +#### spacy.Claude-instant-1-1.v1 {id="claude-instant-1-1"} + +Anthropic's `claude-instant-1.1` model family. + +> #### Example config +> +> ```ini +> [components.llm.model] +> @llm_models = "spacy.Claude-instant-1-1.v1 " +> name = "claude-instant-1.1" +> config = {"temperature": 0.3} +> ``` + +| Argument | Description | +| ----------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------- | +| `name` | Model name, i. e. any supported variant for this particular model. Defaults to `"claude-instant-1.1"`. ~~Literal["claude-instant-1.1", "claude-instant-1.1-100k"]~~ | +| `config` | Further configuration passed on to the model. Defaults to `{}`. ~~Dict[Any, Any]~~ | +| `strict` | If `True`, raises an error if the LLM API returns a malformed response. Otherwise, return the error responses as is. Defaults to `True`. ~~bool~~ | +| `max_tries` | Max. number of tries for API request. Defaults to `3`. ~~int~~ | +| `timeout` | Timeout for API request in seconds. Defaults to `30`. ~~int~~ | + +#### spacy.Claude-1-0.v1 {id="claude-1-0"} + +Anthropic's `claude-1.0` model family. + +> #### Example config +> +> ```ini +> [components.llm.model] +> @llm_models = "spacy.Claude-1-0.v1 " +> name = "claude-1.0" +> config = {"temperature": 0.3} +> ``` + +| Argument | Description | +| ----------- | ------------------------------------------------------------------------------------------------------------------------------------------------- | +| `name` | Model name, i. e. any supported variant for this particular model. Defaults to `"claude-1.0"`. ~~Literal["claude-1.0"]~~ | +| `config` | Further configuration passed on to the model. Defaults to `{}`. ~~Dict[Any, Any]~~ | +| `strict` | If `True`, raises an error if the LLM API returns a malformed response. Otherwise, return the error responses as is. Defaults to `True`. ~~bool~~ | +| `max_tries` | Max. number of tries for API request. Defaults to `3`. ~~int~~ | +| `timeout` | Timeout for API request in seconds. Defaults to `30`. ~~int~~ | + +#### spacy.Claude-1-2.v1 {id="claude-1-2"} + +Anthropic's `claude-1.2` model family. + +> #### Example config +> +> ```ini +> [components.llm.model] +> @llm_models = "spacy.Claude-1-2.v1 " +> name = "claude-1.2" +> config = {"temperature": 0.3} +> ``` + +| Argument | Description | +| ----------- | ------------------------------------------------------------------------------------------------------------------------------------------------- | +| `name` | Model name, i. e. any supported variant for this particular model. Defaults to `"claude-1.2"`. ~~Literal["claude-1.2"]~~ | +| `config` | Further configuration passed on to the model. Defaults to `{}`. ~~Dict[Any, Any]~~ | +| `strict` | If `True`, raises an error if the LLM API returns a malformed response. Otherwise, return the error responses as is. Defaults to `True`. ~~bool~~ | +| `max_tries` | Max. number of tries for API request. Defaults to `3`. ~~int~~ | +| `timeout` | Timeout for API request in seconds. Defaults to `30`. ~~int~~ | + +#### spacy.Claude-1-3.v1 {id="claude-1-3"} + +Anthropic's `claude-1.3` model family. + +> #### Example config +> +> ```ini +> [components.llm.model] +> @llm_models = "spacy.Claude-1-3.v1 " +> name = "claude-1.3" +> config = {"temperature": 0.3} +> ``` + +| Argument | Description | +| ----------- | ------------------------------------------------------------------------------------------------------------------------------------------------- | +| `name` | Model name, i. e. any supported variant for this particular model. Defaults to `"claude-1.3"`. ~~Literal["claude-1.3", "claude-1.3-100k"]~~ | +| `config` | Further configuration passed on to the model. Defaults to `{}`. ~~Dict[Any, Any]~~ | +| `strict` | If `True`, raises an error if the LLM API returns a malformed response. Otherwise, return the error responses as is. Defaults to `True`. ~~bool~~ | +| `max_tries` | Max. number of tries for API request. Defaults to `3`. ~~int~~ | +| `timeout` | Timeout for API request in seconds. Defaults to `30`. ~~int~~ | + +#### spacy.Dolly.v1 {id="dolly"} + +To use this model, ideally you have a GPU enabled and have installed +`transformers`, `torch` and CUDA in your virtual environment. This allows you to +have the setting `device=cuda:0` in your config, which ensures that the model is +loaded entirely on the GPU (and fails otherwise). + +You can do so with + +```shell +python -m pip install "spacy-llm[transformers]" "transformers[sentencepiece]" +``` + +If you don't have access to a GPU, you can install `accelerate` and +set`device_map=auto` instead, but be aware that this may result in some layers +getting distributed to the CPU or even the hard drive, which may ultimately +result in extremely slow queries. + +```shell +python -m pip install "accelerate>=0.16.0,<1.0" +``` + +> #### Example config +> +> ```ini +> [components.llm.model] +> @llm_models = "spacy.Dolly.v1" +> name = "dolly-v2-3b" +> ``` + +| Argument | Description | +| ------------- | ---------------------------------------------------------------------------------------------------------------------------------------------- | +| `name` | The name of a Dolly model that is supported (e. g. "dolly-v2-3b" or "dolly-v2-12b"). ~~Literal["dolly-v2-3b", "dolly-v2-7b", "dolly-v2-12b"]~~ | +| `config_init` | Further configuration passed on to the construction of the model with `transformers.pipeline()`. Defaults to `{}`. ~~Dict[str, Any]~~ | +| `config_run` | Further configuration used during model inference. Defaults to `{}`. ~~Dict[str, Any]~~ | + +Supported models (see the +[Databricks models page](https://huggingface.co/databricks) on Hugging Face for +details): + +- `"databricks/dolly-v2-3b"` +- `"databricks/dolly-v2-7b"` +- `"databricks/dolly-v2-12b"` + +Note that Hugging Face will download this model the first time you use it - you +can +[define the cached directory](https://huggingface.co/docs/huggingface_hub/main/en/guides/manage-cache) +by setting the environmental variable `HF_HOME`. + +#### spacy.Llama2.v1 {id="llama2"} + +To use this model, ideally you have a GPU enabled and have installed +`transformers`, `torch` and CUDA in your virtual environment. This allows you to +have the setting `device=cuda:0` in your config, which ensures that the model is +loaded entirely on the GPU (and fails otherwise). + +You can do so with + +```shell +python -m pip install "spacy-llm[transformers]" "transformers[sentencepiece]" +``` + +If you don't have access to a GPU, you can install `accelerate` and +set`device_map=auto` instead, but be aware that this may result in some layers +getting distributed to the CPU or even the hard drive, which may ultimately +result in extremely slow queries. + +```shell +python -m pip install "accelerate>=0.16.0,<1.0" +``` + +Note that the chat models variants of Llama 2 are currently not supported. This +is because they need a particular prompting setup and don't add any discernible +benefits in the use case of `spacy-llm` (i. e. no interactive chat) compared the +completion model variants. + +> #### Example config +> +> ```ini +> [components.llm.model] +> @llm_models = "spacy.Llama2.v1" +> name = "llama2-7b-hf" +> ``` + +| Argument | Description | +| ------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------ | +| `name` | The name of a Llama 2 model variant that is supported. Defaults to `"Llama-2-7b-hf"`. ~~Literal["Llama-2-7b-hf", "Llama-2-13b-hf", "Llama-2-70b-hf"]~~ | +| `config_init` | Further configuration passed on to the construction of the model with `transformers.pipeline()`. Defaults to `{}`. ~~Dict[str, Any]~~ | +| `config_run` | Further configuration used during model inference. Defaults to `{}`. ~~Dict[str, Any]~~ | + +Note that Hugging Face will download this model the first time you use it - you +can +[define the cache directory](https://huggingface.co/docs/huggingface_hub/main/en/guides/manage-cache) +by setting the environmental variable `HF_HOME`. + +#### spacy.Falcon.v1 {id="falcon"} + +To use this model, ideally you have a GPU enabled and have installed +`transformers`, `torch` and CUDA in your virtual environment. This allows you to +have the setting `device=cuda:0` in your config, which ensures that the model is +loaded entirely on the GPU (and fails otherwise). + +You can do so with + +```shell +python -m pip install "spacy-llm[transformers]" "transformers[sentencepiece]" +``` + +If you don't have access to a GPU, you can install `accelerate` and +set`device_map=auto` instead, but be aware that this may result in some layers +getting distributed to the CPU or even the hard drive, which may ultimately +result in extremely slow queries. + +```shell +python -m pip install "accelerate>=0.16.0,<1.0" +``` + +> #### Example config +> +> ```ini +> [components.llm.model] +> @llm_models = "spacy.Falcon.v1" +> name = "falcon-7b" +> ``` + +| Argument | Description | +| ------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------ | +| `name` | The name of a Falcon model variant that is supported. Defaults to `"7b-instruct"`. ~~Literal["falcon-rw-1b", "falcon-7b", "falcon-7b-instruct", "falcon-40b-instruct"]~~ | +| `config_init` | Further configuration passed on to the construction of the model with `transformers.pipeline()`. Defaults to `{}`. ~~Dict[str, Any]~~ | +| `config_run` | Further configuration used during model inference. Defaults to `{}`. ~~Dict[str, Any]~~ | + +Note that Hugging Face will download this model the first time you use it - you +can +[define the cache directory](https://huggingface.co/docs/huggingface_hub/main/en/guides/manage-cache) +by setting the environmental variable `HF_HOME`. + +#### spacy.StableLM.v1 {id="stablelm"} + +To use this model, ideally you have a GPU enabled and have installed +`transformers`, `torch` and CUDA in your virtual environment. + +You can do so with + +```shell +python -m pip install "spacy-llm[transformers]" "transformers[sentencepiece]" +``` + +If you don't have access to a GPU, you can install `accelerate` and +set`device_map=auto` instead, but be aware that this may result in some layers +getting distributed to the CPU or even the hard drive, which may ultimately +result in extremely slow queries. + +```shell +python -m pip install "accelerate>=0.16.0,<1.0" +``` + +> #### Example config +> +> ```ini +> [components.llm.model] +> @llm_models = "spacy.StableLM.v1" +> name = "stablelm-tuned-alpha-7b" +> ``` + +| Argument | Description | +| ------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | +| `name` | The name of a StableLM model that is supported (e. g. "stablelm-tuned-alpha-7b"). ~~Literal["stablelm-base-alpha-3b", "stablelm-base-alpha-7b", "stablelm-tuned-alpha-3b", "stablelm-tuned-alpha-7b"]~~ | +| `config_init` | Further configuration passed on to the construction of the model with `transformers.AutoModelForCausalLM.from_pretrained()`. Defaults to `{}`. ~~Dict[str, Any]~~ | +| `config_run` | Further configuration used during model inference. Defaults to `{}`. ~~Dict[str, Any]~~ | + +See the +[Stability AI StableLM GitHub repo](https://github.com/Stability-AI/StableLM/#stablelm-alpha) +for details. + +Note that Hugging Face will download this model the first time you use it - you +can +[define the cached directory](https://huggingface.co/docs/huggingface_hub/main/en/guides/manage-cache) +by setting the environmental variable `HF_HOME`. + +#### spacy.OpenLLaMA.v1 {id="openllama"} + +To use this model, ideally you have a GPU enabled and have installed + +- `transformers[sentencepiece]` +- `torch` +- CUDA in your virtual environment. + +You can do so with + +```shell +python -m pip install "spacy-llm[transformers]" "transformers[sentencepiece]" +``` + +If you don't have access to a GPU, you can install `accelerate` and +set`device_map=auto` instead, but be aware that this may result in some layers +getting distributed to the CPU or even the hard drive, which may ultimately +result in extremely slow queries. + +```shell +python -m pip install "accelerate>=0.16.0,<1.0" +``` + +> #### Example config +> +> ```ini +> [components.llm.model] +> @llm_models = "spacy.OpenLLaMA.v1" +> name = "open_llama_3b" +> ``` + +| Argument | Description | +| ------------- | ----------------------------------------------------------------------------------------------------------------------------------------------------------------- | +| `name` | The name of a OpenLLaMA model that is supported. ~~Literal["open_llama_3b", "open_llama_7b", "open_llama_7b_v2", "open_llama_13b"]~~ | +| `config_init` | Further configuration passed on to the construction of the model with `transformers.AutoModelForCausalLM.from_pretrained()`. Defaults to `{}`. ~~Dict[str, Any]~~ | +| `config_run` | Further configuration used during model inference. Defaults to `{}`. ~~Dict[str, Any]~~ | + +See the +[OpenLM Research OpenLLaMA GitHub repo](https://github.com/openlm-research/open_llama) +for details. + +Note that Hugging Face will download this model the first time you use it - you +can +[define the cached directory](https://huggingface.co/docs/huggingface_hub/main/en/guides/manage-cache) +by setting the environmental variable `HF_HOME`. + +#### LangChain models {id="langchain-models"} + +To use [LangChain](https://github.com/hwchase17/langchain) for the API retrieval +part, make sure you have installed it first: + +```shell +python -m pip install "langchain==0.0.191" +# Or install with spacy-llm directly +python -m pip install "spacy-llm[extras]" +``` + +Note that LangChain currently only supports Python 3.9 and beyond. + +LangChain models in `spacy-llm` work slightly differently. `langchain`'s models +are parsed automatically, each LLM class in `langchain` has one entry in +`spacy-llm`'s registry. As `langchain`'s design has one class per API and not +per model, this results in registry entries like `langchain.OpenAI.v1` - i. e. +there is one registry entry per API and not per model (family), as for the REST- +and HuggingFace-based entries. + +The name of the model to be used has to be passed in via the `name` attribute. + +> #### Example config +> +> ```ini +> [components.llm.model] +> @llm_models = "langchain.OpenAI.v1" +> name = "gpt-3.5-turbo" +> query = {"@llm_queries": "spacy.CallLangChain.v1"} +> config = {"temperature": 0.3} +> ``` + +| Argument | Description | +| -------- | --------------------------------------------------------------------------------------------------------------------------------------------------------------------- | +| `name` | The name of a mdodel supported by LangChain for this API. ~~str~~ | +| `config` | Configuration passed on to the LangChain model. Defaults to `{}`. ~~Dict[Any, Any]~~ | +| `query` | Function that executes the prompts. If `None`, defaults to `spacy.CallLangChain.v1`. ~~Optional[Callable[["langchain.llms.BaseLLM", Iterable[Any]], Iterable[Any]]]~~ | + +The default `query` (`spacy.CallLangChain.v1`) executes the prompts by running +`model(text)` for each given textual prompt. + +### Cache {id="cache"} + +Interacting with LLMs, either through an external API or a local instance, is +costly. Since developing an NLP pipeline generally means a lot of exploration +and prototyping, `spacy-llm` implements a built-in cache to avoid reprocessing +the same documents at each run that keeps batches of documents stored on disk. + +> #### Example config +> +> ```ini +> [components.llm.cache] +> @llm_misc = "spacy.BatchCache.v1" +> path = "path/to/cache" +> batch_size = 64 +> max_batches_in_mem = 4 +> ``` + +| Argument | Description | +| -------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------ | +| `path` | Cache directory. If `None`, no caching is performed, and this component will act as a NoOp. Defaults to `None`. ~~Optional[Union[str, Path]]~~ | +| `batch_size` | Number of docs in one batch (file). Once a batch is full, it will be peristed to disk. Defaults to 64. ~~int~~ | +| `max_batches_in_mem` | Max. number of batches to hold in memory. Allows you to limit the effect on your memory if you're handling a lot of docs. Defaults to 4. ~~int~~ | + +When retrieving a document, the `BatchCache` will first figure out what batch +the document belongs to. If the batch isn't in memory it will try to load the +batch from disk and then move it into memory. + +Note that since the cache is generated by a registered function, you can also +provide your own registered function returning your own cache implementation. If +you wish to do so, ensure that your cache object adheres to the `Protocol` +defined in `spacy_llm.ty.Cache`. + +### Various functions {id="various-functions"} + +#### spacy.FewShotReader.v1 {id="fewshotreader-v1"} + +This function is registered in spaCy's `misc` registry, and reads in examples +from a `.yml`, `.yaml`, `.json` or `.jsonl` file. It uses +[`srsly`](https://github.com/explosion/srsly) to read in these files and parses +them depending on the file extension. + +> #### Example config +> +> ```ini +> [components.llm.task.examples] +> @misc = "spacy.FewShotReader.v1" +> path = "ner_examples.yml" +> ``` + +| Argument | Description | +| -------- | ----------------------------------------------------------------------------------------------- | +| `path` | Path to an examples file with suffix `.yml`, `.yaml`, `.json` or `.jsonl`. ~~Union[str, Path]~~ | + +#### spacy.FileReader.v1 {id="filereader-v1"} + +This function is registered in spaCy's `misc` registry, and reads a file +provided to the `path` to return a `str` representation of its contents. This +function is typically used to read +[Jinja](https://jinja.palletsprojects.com/en/3.1.x/) files containing the prompt +template. + +> #### Example config +> +> ```ini +> [components.llm.task.template] +> @misc = "spacy.FileReader.v1" +> path = "ner_template.jinja2" +> ``` + +| Argument | Description | +| -------- | ------------------------------------------------- | +| `path` | Path to the file to be read. ~~Union[str, Path]~~ | + +#### Normalizer functions {id="normalizer-functions"} + +These functions provide simple normalizations for string comparisons, e.g. +between a list of specified labels and a label given in the raw text of the LLM +response. They are registered in spaCy's `misc` registry and have the signature +`Callable[[str], str]`. + +- `spacy.StripNormalizer.v1`: only apply `text.strip()` +- `spacy.LowercaseNormalizer.v1`: applies `text.strip().lower()` to compare + strings in a case-insensitive way. diff --git a/website/docs/usage/index.mdx b/website/docs/usage/index.mdx index 4b06178d5..414968d42 100644 --- a/website/docs/usage/index.mdx +++ b/website/docs/usage/index.mdx @@ -261,7 +261,7 @@ source code and recompiling frequently. #### Visual Studio Code extension -![spaCy extension demo](/images/spacy-extension-demo.gif) +![spaCy extension demo](/images/spacy-extension-demo.gif) The [spaCy VSCode Extension](https://github.com/explosion/spacy-vscode) provides additional tooling and features for working with spaCy's config files. Version @@ -310,7 +310,7 @@ You can configure the build process with the following environment variables: | Variable | Description | | -------------- | ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | | `SPACY_EXTRAS` | Additional Python packages to install alongside spaCy with optional version specifications. Should be a string that can be passed to `pip install`. See [`Makefile`](%%GITHUB_SPACY/Makefile) for defaults. | -| `PYVER` | The Python version to build against. This version needs to be available on your build and runtime machines. Defaults to `3.6`. | +| `PYVER` | The Python version to build against. This version needs to be available on your build and runtime machines. Defaults to `3.8`. | | `WHEELHOUSE` | Directory to store the wheel files during compilation. Defaults to `./wheelhouse`. | ### Run tests {id="run-tests"} diff --git a/website/docs/usage/large-language-models.mdx b/website/docs/usage/large-language-models.mdx new file mode 100644 index 000000000..3c2c52c68 --- /dev/null +++ b/website/docs/usage/large-language-models.mdx @@ -0,0 +1,512 @@ +--- +title: Large Language Models +teaser: Integrating LLMs into structured NLP pipelines +menu: + - ['Motivation', 'motivation'] + - ['Install', 'install'] + - ['Usage', 'usage'] + - ['Logging', 'logging'] + - ['API', 'api'] + - ['Tasks', 'tasks'] + - ['Models', 'models'] +--- + +[The spacy-llm package](https://github.com/explosion/spacy-llm) integrates Large +Language Models (LLMs) into spaCy pipelines, featuring a modular system for +**fast prototyping** and **prompting**, and turning unstructured responses into +**robust outputs** for various NLP tasks, **no training data** required. + +- Serializable `llm` **component** to integrate prompts into your pipeline +- **Modular functions** to define the [**task**](#tasks) (prompting and parsing) + and [**model**](#models) (model to use) +- Support for **hosted APIs** and self-hosted **open-source models** +- Integration with [`LangChain`](https://github.com/hwchase17/langchain) +- Access to + **[OpenAI API](https://platform.openai.com/docs/api-reference/introduction)**, + including GPT-4 and various GPT-3 models +- Built-in support for various **open-source** models hosted on + [Hugging Face](https://huggingface.co/) +- Usage examples for standard NLP tasks such as **Named Entity Recognition** and + **Text Classification** +- Easy implementation of **your own functions** via the + [registry](/api/top-level#registry) for custom prompting, parsing and model + integrations + +## Motivation {id="motivation"} + +Large Language Models (LLMs) feature powerful natural language understanding +capabilities. With only a few (and sometimes no) examples, an LLM can be +prompted to perform custom NLP tasks such as text categorization, named entity +recognition, coreference resolution, information extraction and more. + +Supervised learning is much worse than LLM prompting for prototyping, but for +many tasks it's much better for production. A transformer model that runs +comfortably on a single GPU is extremely powerful, and it's likely to be a +better choice for any task for which you have a well-defined output. You train +the model with anything from a few hundred to a few thousand labelled examples, +and it will learn to do exactly that. Efficiency, reliability and control are +all better with supervised learning, and accuracy will generally be higher than +LLM prompting as well. + +`spacy-llm` lets you have **the best of both worlds**. You can quickly +initialize a pipeline with components powered by LLM prompts, and freely mix in +components powered by other approaches. As your project progresses, you can look +at replacing some or all of the LLM-powered components as you require. + +Of course, there can be components in your system for which the power of an LLM +is fully justified. If you want a system that can synthesize information from +multiple documents in subtle ways and generate a nuanced summary for you, bigger +is better. However, even if your production system needs an LLM for some of the +task, that doesn't mean you need an LLM for all of it. Maybe you want to use a +cheap text classification model to help you find the texts to summarize, or +maybe you want to add a rule-based system to sanity check the output of the +summary. These before-and-after tasks are much easier with a mature and +well-thought-out library, which is exactly what spaCy provides. + +## Install {id="install"} + +`spacy-llm` will be installed automatically in future spaCy versions. For now, +you can run the following in the same virtual environment where you already have +`spacy` [installed](/usage). + +> ⚠️ This package is still experimental and it is possible that changes made to +> the interface will be breaking in minor version updates. + +```bash +python -m pip install spacy-llm +``` + +## Usage {id="usage"} + +The task and the model have to be supplied to the `llm` pipeline component using +the [config system](/api/data-formats#config). This package provides various +built-in functionality, as detailed in the [API](#-api) documentation. + +### Example 1: Add a text classifier using a GPT-3 model from OpenAI {id="example-1"} + +Create a new API key from openai.com or fetch an existing one, and ensure the +keys are set as environmental variables. For more background information, see +the [OpenAI](/api/large-language-models#gpt-3-5) section. + +Create a config file `config.cfg` containing at least the following (or see the +full example +[here](https://github.com/explosion/spacy-llm/tree/main/usage_examples/textcat_openai)): + +```ini +[nlp] +lang = "en" +pipeline = ["llm"] + +[components] + +[components.llm] +factory = "llm" + +[components.llm.task] +@llm_tasks = "spacy.TextCat.v2" +labels = ["COMPLIMENT", "INSULT"] + +[components.llm.model] +@llm_models = "spacy.GPT-3-5.v1" +config = {"temperature": 0.3} +``` + +Now run: + +```python +from spacy_llm.util import assemble + +nlp = assemble("config.cfg") +doc = nlp("You look gorgeous!") +print(doc.cats) +``` + +### Example 2: Add NER using an open-source model through Hugging Face {id="example-2"} + +To run this example, ensure that you have a GPU enabled, and `transformers`, +`torch` and CUDA installed. For more background information, see the +[DollyHF](/api/large-language-models#dolly) section. + +Create a config file `config.cfg` containing at least the following (or see the +full example +[here](https://github.com/explosion/spacy-llm/tree/main/usage_examples/ner_dolly)): + +```ini +[nlp] +lang = "en" +pipeline = ["llm"] + +[components] + +[components.llm] +factory = "llm" + +[components.llm.task] +@llm_tasks = "spacy.NER.v2" +labels = ["PERSON", "ORGANISATION", "LOCATION"] + +[components.llm.model] +@llm_models = "spacy.Dolly.v1" +# For better performance, use dolly-v2-12b instead +name = "dolly-v2-3b" +``` + +Now run: + +```python +from spacy_llm.util import assemble + +nlp = assemble("config.cfg") +doc = nlp("Jack and Jill rode up the hill in Les Deux Alpes") +print([(ent.text, ent.label_) for ent in doc.ents]) +``` + +Note that Hugging Face will download the `"databricks/dolly-v2-3b"` model the +first time you use it. You can +[define the cached directory](https://huggingface.co/docs/huggingface_hub/main/en/guides/manage-cache) +by setting the environmental variable `HF_HOME`. Also, you can upgrade the model +to be `"databricks/dolly-v2-12b"` for better performance. + +### Example 3: Create the component directly in Python {id="example-3"} + +The `llm` component behaves as any other component does, so adding it to an +existing pipeline follows the same pattern: + +```python +import spacy + +nlp = spacy.blank("en") +nlp.add_pipe( + "llm", + config={ + "task": { + "@llm_tasks": "spacy.NER.v2", + "labels": ["PERSON", "ORGANISATION", "LOCATION"] + }, + "model": { + "@llm_models": "spacy.gpt-3.5.v1", + }, + }, +) +nlp.initialize() +doc = nlp("Jack and Jill rode up the hill in Les Deux Alpes") +print([(ent.text, ent.label_) for ent in doc.ents]) +``` + +Note that for efficient usage of resources, typically you would use +[`nlp.pipe(docs)`](/api/language#pipe) with a batch, instead of calling +`nlp(doc)` with a single document. + +### Example 4: Implement your own custom task {id="example-4"} + +To write a [`task`](#tasks), you need to implement two functions: +`generate_prompts` that takes a list of [`Doc`](/api/doc) objects and transforms +them into a list of prompts, and `parse_responses` that transforms the LLM +outputs into annotations on the [`Doc`](/api/doc), e.g. entity spans, text +categories and more. + +To register your custom task, decorate a factory function using the +`spacy_llm.registry.llm_tasks` decorator with a custom name that you can refer +to in your config. + +> 📖 For more details, see the +> [**usage example on writing your own task**](https://github.com/explosion/spacy-llm/tree/main/usage_examples#writing-your-own-task) + +```python +from typing import Iterable, List +from spacy.tokens import Doc +from spacy_llm.registry import registry +from spacy_llm.util import split_labels + + +@registry.llm_tasks("my_namespace.MyTask.v1") +def make_my_task(labels: str, my_other_config_val: float) -> "MyTask": + labels_list = split_labels(labels) + return MyTask(labels=labels_list, my_other_config_val=my_other_config_val) + + +class MyTask: + def __init__(self, labels: List[str], my_other_config_val: float): + ... + + def generate_prompts(self, docs: Iterable[Doc]) -> Iterable[str]: + ... + + def parse_responses( + self, docs: Iterable[Doc], responses: Iterable[str] + ) -> Iterable[Doc]: + ... +``` + +```ini +# config.cfg (excerpt) +[components.llm.task] +@llm_tasks = "my_namespace.MyTask.v1" +labels = LABEL1,LABEL2,LABEL3 +my_other_config_val = 0.3 +``` + +## Logging {id="logging"} + +spacy-llm has a built-in logger that can log the prompt sent to the LLM as well +as its raw response. This logger uses the debug level and by default has a +`logging.NullHandler()` configured. + +In order to use this logger, you can setup a simple handler like this: + +```python +import logging +import spacy_llm + + +spacy_llm.logger.addHandler(logging.StreamHandler()) +spacy_llm.logger.setLevel(logging.DEBUG) +``` + +> NOTE: Any `logging` handler will work here so you probably want to use some +> sort of rotating `FileHandler` as the generated prompts can be quite long, +> especially for tasks with few-shot examples. + +Then when using the pipeline you'll be able to view the prompt and response. + +E.g. with the config and code from [Example 1](#example-1) above: + +```python +from spacy_llm.util import assemble + + +nlp = assemble("config.cfg") +doc = nlp("You look gorgeous!") +print(doc.cats) +``` + +You will see `logging` output similar to: + +``` +Generated prompt for doc: You look gorgeous! + +You are an expert Text Classification system. Your task is to accept Text as input +and provide a category for the text based on the predefined labels. + +Classify the text below to any of the following labels: COMPLIMENT, INSULT +The task is non-exclusive, so you can provide more than one label as long as +they're comma-delimited. For example: Label1, Label2, Label3. +Do not put any other text in your answer, only one or more of the provided labels with nothing before or after. +If the text cannot be classified into any of the provided labels, answer `==NONE==`. + +Here is the text that needs classification + + +Text: +''' +You look gorgeous! +''' + +Model response for doc: You look gorgeous! +COMPLIMENT +``` + +`print(doc.cats)` to standard output should look like: + +``` +{'COMPLIMENT': 1.0, 'INSULT': 0.0} +``` + +## API {id="api"} + +`spacy-llm` exposes a `llm` factory with +[configurable settings](/api/large-language-models#config). + +An `llm` component is defined by two main settings: + +- A [**task**](#tasks), defining the prompt to send to the LLM as well as the + functionality to parse the resulting response back into structured fields on + the [Doc](/api/doc) objects. +- A [**model**](#models) defining the model to use and how to connect to it. + Note that `spacy-llm` supports both access to external APIs (such as OpenAI) + as well as access to self-hosted open-source LLMs (such as using Dolly through + Hugging Face). + +Moreover, `spacy-llm` exposes a customizable [**caching**](#cache) functionality +to avoid running the same document through an LLM service (be it local or +through a REST API) more than once. + +Finally, you can choose to save a stringified version of LLM prompts/responses +within the `Doc.user_data["llm_io"]` attribute by setting `save_io` to `True`. +`Doc.user_data["llm_io"]` is a dictionary containing one entry for every LLM +component within the `nlp` pipeline. Each entry is itself a dictionary, with two +keys: `prompt` and `response`. + +A note on `validate_types`: by default, `spacy-llm` checks whether the +signatures of the `model` and `task` callables are consistent with each other +and emits a warning if they don't. `validate_types` can be set to `False` if you +want to disable this behavior. + +### Tasks {id="tasks"} + +A _task_ defines an NLP problem or question, that will be sent to the LLM via a +prompt. Further, the task defines how to parse the LLM's responses back into +structured information. All tasks are registered in the `llm_tasks` registry. + +Practically speaking, a task should adhere to the `Protocol` `LLMTask` defined +in [`ty.py`](https://github.com/explosion/spacy-llm/blob/main/spacy_llm/ty.py). +It needs to define a `generate_prompts` function and a `parse_responses` +function. + +| Task | Description | +| --------------------------------------------------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------ | +| [`task.generate_prompts`](/api/large-language-models#task-generate-prompts) | Takes a collection of documents, and returns a collection of "prompts", which can be of type `Any`. | +| [`task.parse_responses`](/api/large-language-models#task-parse-responses) | Takes a collection of LLM responses and the original documents, parses the responses into structured information, and sets the annotations on the documents. | + +Moreover, the task may define an optional [`scorer` method](/api/scorer#score). +It should accept an iterable of `Example`s as input and return a score +dictionary. If the `scorer` method is defined, `spacy-llm` will call it to +evaluate the component. + +| Component | Description | +| ----------------------------------------------------------------------- | --------------------------------------------------------------------------------------------------------------------------------------------------------------------- | +| [`spacy.Summarization.v1`](/api/large-language-models#summarization-v1) | The summarization task prompts the model for a concise summary of the provided text. | +| [`spacy.NER.v2`](/api/large-language-models#ner-v2) | The built-in NER task supports both zero-shot and few-shot prompting. This version also supports explicitly defining the provided labels with custom descriptions. | +| [`spacy.NER.v1`](/api/large-language-models#ner-v1) | The original version of the built-in NER task supports both zero-shot and few-shot prompting. | +| [`spacy.SpanCat.v2`](/api/large-language-models#spancat-v2) | The built-in SpanCat task is a simple adaptation of the NER task to support overlapping entities and store its annotations in `doc.spans`. | +| [`spacy.SpanCat.v1`](/api/large-language-models#spancat-v1) | The original version of the built-in SpanCat task is a simple adaptation of the v1 NER task to support overlapping entities and store its annotations in `doc.spans`. | +| [`spacy.TextCat.v3`](/api/large-language-models#textcat-v3) | Version 3 (the most recent) of the built-in TextCat task supports both zero-shot and few-shot prompting. It allows setting definitions of labels. | +| [`spacy.TextCat.v2`](/api/large-language-models#textcat-v2) | Version 2 of the built-in TextCat task supports both zero-shot and few-shot prompting and includes an improved prompt template. | +| [`spacy.TextCat.v1`](/api/large-language-models#textcat-v1) | Version 1 of the built-in TextCat task supports both zero-shot and few-shot prompting. | +| [`spacy.REL.v1`](/api/large-language-models#rel-v1) | The built-in REL task supports both zero-shot and few-shot prompting. It relies on an upstream NER component for entities extraction. | +| [`spacy.Lemma.v1`](/api/large-language-models#lemma-v1) | The `Lemma.v1` task lemmatizes the provided text and updates the `lemma_` attribute in the doc's tokens accordingly. | +| [`spacy.Sentiment.v1`](/api/large-language-models#sentiment-v1) | Performs sentiment analysis on provided texts. | +| [`spacy.NoOp.v1`](/api/large-language-models#noop-v1) | This task is only useful for testing - it tells the LLM to do nothing, and does not set any fields on the `docs`. | + +#### Providing examples for few-shot prompts {id="few-shot-prompts"} + +All built-in tasks support few-shot prompts, i. e. including examples in a +prompt. Examples can be supplied in two ways: (1) as a separate file containing +only examples or (2) by initializing `llm` with a `get_examples()` callback +(like any other pipeline component). + +##### (1) Few-shot example file + +A file containing examples for few-shot prompting can be configured like this: + +```ini +[components.llm.task] +@llm_tasks = "spacy.NER.v2" +labels = PERSON,ORGANISATION,LOCATION +[components.llm.task.examples] +@misc = "spacy.FewShotReader.v1" +path = "ner_examples.yml" +``` + +The supplied file has to conform to the format expected by the required task +(see the task documentation further down). + +##### (2) Initializing the `llm` component with a `get_examples()` callback + +Alternatively, you can initialize your `nlp` pipeline by providing a +`get_examples` callback for [`nlp.initialize`](/api/language#initialize) and +setting `n_prompt_examples` to a positive number to automatically fetch a few +examples for few-shot learning. Set `n_prompt_examples` to `-1` to use all +examples as part of the few-shot learning prompt. + +```ini +[initialize.components.llm] +n_prompt_examples = 3 +``` + +### Model {id="models"} + +A _model_ defines which LLM model to query, and how to query it. It can be a +simple function taking a collection of prompts (consistent with the output type +of `task.generate_prompts()`) and returning a collection of responses +(consistent with the expected input of `parse_responses`). Generally speaking, +it's a function of type `Callable[[Iterable[Any]], Iterable[Any]]`, but specific +implementations can have other signatures, like +`Callable[[Iterable[str]], Iterable[str]]`. + +All built-in models are registered in `llm_models`. If no model is specified, +the repo currently connects to the `OpenAI` API by default using REST, and +accesses the `"gpt-3.5-turbo"` model. + +Currently three different approaches to use LLMs are supported: + +1. `spacy-llm`s native REST interface. This is the default for all hosted models + (e. g. OpenAI, Cohere, Anthropic, ...). +2. A HuggingFace integration that allows to run a limited set of HF models + locally. +3. A LangChain integration that allows to run any model supported by LangChain + (hosted or locally). + +Approaches 1. and 2 are the default for hosted model and local models, +respectively. Alternatively you can use LangChain to access hosted or local +models by specifying one of the models registered with the `langchain.` prefix. + + +_Why LangChain if there are also are a native REST and a HuggingFace interface? When should I use what?_ + +Third-party libraries like `langchain` focus on prompt management, integration +of many different LLM APIs, and other related features such as conversational +memory or agents. `spacy-llm` on the other hand emphasizes features we consider +useful in the context of NLP pipelines utilizing LLMs to process documents +(mostly) independent from each other. It makes sense that the feature sets of +such third-party libraries and `spacy-llm` aren't identical - and users might +want to take advantage of features not available in `spacy-llm`. + +The advantage of implementing our own REST and HuggingFace integrations is that +we can ensure a larger degree of stability and robustness, as we can guarantee +backwards-compatibility and more smoothly integrated error handling. + +If however there are features or APIs not natively covered by `spacy-llm`, it's +trivial to utilize LangChain to cover this - and easy to customize the prompting +mechanism, if so required. + + + + +Note that when using hosted services, you have to ensure that the [proper API +keys](/api/large-language-models#api-keys) are set as environment variables as described by the corresponding +provider's documentation. + + + +| Component | Description | +| ------------------------------------------------------------------------------ | ------------------------------------------------------------------------------------ | +| [`spacy.GPT-4.v1`](/api/large-language-models#gpt-4) | OpenAI’s `gpt-4` model family. | +| [`spacy.GPT-3-5.v1`](/api/large-language-models#gpt-3-5) | OpenAI’s `gpt-3-5` model family. | +| [`spacy.Text-Davinci.v1`](/api/large-language-models#text-davinci) | OpenAI’s `text-davinci` model family. | +| [`spacy.Code-Davinci.v1`](/api/large-language-models#code-davinci) | OpenAI’s `code-davinci` model family. | +| [`spacy.Text-Curie.v1`](/api/large-language-models#text-curie) | OpenAI’s `text-curie` model family. | +| [`spacy.Text-Babbage.v1`](/api/large-language-models#text-babbage) | OpenAI’s `text-babbage` model family. | +| [`spacy.Text-Ada.v1`](/api/large-language-models#text-ada) | OpenAI’s `text-ada` model family. | +| [`spacy.Davinci.v1`](/api/large-language-models#davinci) | OpenAI’s `davinci` model family. | +| [`spacy.Curie.v1`](/api/large-language-models#curie) | OpenAI’s `curie` model family. | +| [`spacy.Babbage.v1`](/api/large-language-models#babbage) | OpenAI’s `babbage` model family. | +| [`spacy.Ada.v1`](/api/large-language-models#ada) | OpenAI’s `ada` model family. | +| [`spacy.Command.v1`](/api/large-language-models#command) | Cohere’s `command` model family. | +| [`spacy.Claude-1.v1`](/api/large-language-models#claude-1) | Anthropic’s `claude-1` model family. | +| [`spacy.Claude-instant-1.v1`](/api/large-language-models#claude-instant-1) | Anthropic’s `claude-instant-1` model family. | +| [`spacy.Claude-instant-1-1.v1`](/api/large-language-models#claude-instant-1-1) | Anthropic’s `claude-instant-1.1` model family. | +| [`spacy.Claude-1-0.v1`](/api/large-language-models#claude-1-0) | Anthropic’s `claude-1.0` model family. | +| [`spacy.Claude-1-2.v1`](/api/large-language-models#claude-1-2) | Anthropic’s `claude-1.2` model family. | +| [`spacy.Claude-1-3.v1`](/api/large-language-models#claude-1-3) | Anthropic’s `claude-1.3` model family. | +| [`spacy.Dolly.v1`](/api/large-language-models#dolly) | Dolly models through [Databricks](https://huggingface.co/databricks) on HuggingFace. | +| [`spacy.Falcon.v1`](/api/large-language-models#falcon) | Falcon model through HuggingFace. | +| [`spacy.StableLM.v1`](/api/large-language-models#stablelm) | StableLM model through HuggingFace. | +| [`spacy.OpenLLaMA.v1`](/api/large-language-models#openllama) | OpenLLaMA model through HuggingFace. | +| [LangChain models](/api/large-language-models#langchain-models) | LangChain models for API retrieval. | + +### Cache {id="cache"} + +Interacting with LLMs, either through an external API or a local instance, is +costly. Since developing an NLP pipeline generally means a lot of exploration +and prototyping, `spacy-llm` implements a built-in +[cache](/api/large-language-models#cache) to avoid reprocessing the same +documents at each run that keeps batches of documents stored on disk. + +### Various functions {id="various-functions"} + +| Component | Description | +| ----------------------------------------------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ | +| [`spacy.FewShotReader.v1`](/api/large-language-models#fewshotreader-v1) | This function is registered in spaCy's `misc` registry, and reads in examples from a `.yml`, `.yaml`, `.json` or `.jsonl` file. It uses [`srsly`](https://github.com/explosion/srsly) to read in these files and parses them depending on the file extension. | +| [`spacy.FileReader.v1`](/api/large-language-models#filereader-v1) | This function is registered in spaCy's `misc` registry, and reads a file provided to the `path` to return a `str` representation of its contents. This function is typically used to read [Jinja](https://jinja.palletsprojects.com/en/3.1.x/) files containing the prompt template. | +| [Normalizer functions](/api/large-language-models#normalizer-functions) | These functions provide simple normalizations for string comparisons, e.g. between a list of specified labels and a label given in the raw text of the LLM response. | diff --git a/website/meta/sidebars.json b/website/meta/sidebars.json index 04102095f..033f71b12 100644 --- a/website/meta/sidebars.json +++ b/website/meta/sidebars.json @@ -26,16 +26,19 @@ { "text": "Processing Pipelines", "url": "/usage/processing-pipelines" }, { "text": "Embeddings & Transformers", - "url": "/usage/embeddings-transformers", + "url": "/usage/embeddings-transformers" + }, + { + "text": "Large Language Models", + "url": "/usage/large-language-models", "tag": "new" }, - { "text": "Training Models", "url": "/usage/training", "tag": "new" }, + { "text": "Training Models", "url": "/usage/training" }, { "text": "Layers & Model Architectures", - "url": "/usage/layers-architectures", - "tag": "new" + "url": "/usage/layers-architectures" }, - { "text": "spaCy Projects", "url": "/usage/projects", "tag": "new" }, + { "text": "spaCy Projects", "url": "/usage/projects" }, { "text": "Saving & Loading", "url": "/usage/saving-loading" }, { "text": "Visualizers", "url": "/usage/visualizers" } ] @@ -102,6 +105,7 @@ { "text": "EntityLinker", "url": "/api/entitylinker" }, { "text": "EntityRecognizer", "url": "/api/entityrecognizer" }, { "text": "EntityRuler", "url": "/api/entityruler" }, + { "text": "Large Language Models", "url": "/api/large-language-models" }, { "text": "Lemmatizer", "url": "/api/lemmatizer" }, { "text": "Morphologizer", "url": "/api/morphologizer" }, { "text": "SentenceRecognizer", "url": "/api/sentencerecognizer" }, diff --git a/website/pages/index.tsx b/website/pages/index.tsx index fc0dba378..089d75b52 100644 --- a/website/pages/index.tsx +++ b/website/pages/index.tsx @@ -106,50 +106,21 @@ const Landing = () => {

- - - + + The spacy-llm package + {' '} + integrates Large Language Models (LLMs) into spaCy, featuring a modular + system for fast prototyping and prompting, + and turning unstructured responses into robust outputs for + various NLP tasks, no training data required.

-

- - Get a custom spaCy pipeline, tailor-made for your NLP problem by - spaCy's core developers. - -

-
    -
  • - Streamlined. Nobody knows spaCy better than we do. Send - us your pipeline requirements and we'll be ready to start producing - your solution in no time at all. -
  • -
  • - Production ready. spaCy pipelines are robust and easy - to deploy. You'll get a complete spaCy project folder which is - ready to spacy project run. -
  • -
  • - Predictable. You'll know exactly what you're - going to get and what it's going to cost. We quote fees up-front, - let you try before you buy, and don't charge for over-runs at our - end — all the risk is on us. -
  • -
  • - Maintainable. spaCy is an industry standard, and - we'll deliver your pipeline with full code, data, tests and - documentation, so your team can retrain, update and extend the solution - as your requirements change. -
  • -
{

- spaCy v3.0 features all new transformer-based pipelines{' '} - that bring spaCy's accuracy right up to the current{' '} - state-of-the-art. You can use any pretrained transformer to - train your own pipelines, and even share one transformer between multiple - components with multi-task learning. Training is now fully - configurable and extensible, and you can define your own custom models using{' '} - PyTorch, TensorFlow and other frameworks. + + +

+

+ + Get a custom spaCy pipeline, tailor-made for your NLP problem by + spaCy's core developers. + +

+
    +
  • + Streamlined. Nobody knows spaCy better than we do. Send + us your pipeline requirements and we'll be ready to start producing + your solution in no time at all. +
  • +
  • + Production ready. spaCy pipelines are robust and easy + to deploy. You'll get a complete spaCy project folder which is + ready to spacy project run. +
  • +
  • + Predictable. You'll know exactly what you're + going to get and what it's going to cost. We quote fees up-front, + let you try before you buy, and don't charge for over-runs at our + end — all the risk is on us. +
  • +
  • + Maintainable. spaCy is an industry standard, and + we'll deliver your pipeline with full code, data, tests and + documentation, so your team can retrain, update and extend the solution + as your requirements change. +
  • +
{ small >

- +