spaCy/sentencizer.md at b278f31ee684e5d402a1891a0445a9c7c1c1f644

mirror of https://github.com/explosion/spaCy.git synced 2024-11-11 20:28:20 +03:00

Document scorers in registry and components from #8766 (#8929 )

* Document scorers in registry and components from #8766

* Update spacy/pipeline/lemmatizer.py

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>

* Update website/docs/api/dependencyparser.md

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>

* Reformat

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>

2021-08-12 12:50:03 +02:00

7.8 KiB

Raw Blame History

title	tag	source	teaser	api_string_name	api_trainable
Sentencizer	class	spacy/pipeline/sentencizer.pyx	Pipeline component for rule-based sentence boundary detection	sentencizer	false

A simple pipeline component to allow custom sentence boundary detection logic that doesn't require the dependency parse. By default, sentence segmentation is performed by the DependencyParser, so the Sentencizer lets you implement a simpler, rule-based strategy that doesn't require a statistical model to be loaded.

Config and implementation

The default config is defined by the pipeline component factory and describes how the component should be configured. You can override its settings via the config argument on nlp.add_pipe or in your config.cfg for training.

Example

config = {"punct_chars": None}
nlp.add_pipe("sentencizer", config=config)

Setting	Description
`punct_chars`	Optional custom list of punctuation characters that mark sentence ends. See below for defaults if not set. Defaults to `None`. ~~Optional[List[str]]~~

%%GITHUB_SPACY/spacy/pipeline/sentencizer.pyx

Sentencizer.init

Initialize the sentencizer.

Example

# Construction via add_pipe
sentencizer = nlp.add_pipe("sentencizer")

# Construction from class
from spacy.pipeline import Sentencizer
sentencizer = Sentencizer()

Name	Description
keyword-only
`punct_chars`	Optional custom list of punctuation characters that mark sentence ends. See below for defaults. ~~Optional[List[str]]~~
`scorer`	The scoring method. Defaults to `Scorer.score_spans` for the attribute `"sents"` ~~Optional[Callable]~~

### punct_chars defaults
['!', '.', '?', '։', '؟', '۔', '܀', '܁', '܂', '߹', '।', '॥', '၊', '။', '።',
 '፧', '፨', '᙮', '᜵', '᜶', '᠃', '᠉', '᥄', '᥅', '᪨', '᪩', '᪪', '᪫',
 '᭚', '᭛', '᭞', '᭟', '᰻', '᰼', '᱾', '᱿', '‼', '‽', '⁇', '⁈', '⁉',
 '⸮', '⸼', '꓿', '꘎', '꘏', '꛳', '꛷', '꡶', '꡷', '꣎', '꣏', '꤯', '꧈',
 '꧉', '꩝', '꩞', '꩟', '꫰', '꫱', '꯫', '﹒', '﹖', '﹗', '！', '．', '？',
 '𐩖', '𐩗', '𑁇', '𑁈', '𑂾', '𑂿', '𑃀', '𑃁', '𑅁', '𑅂', '𑅃', '𑇅',
 '𑇆', '𑇍', '𑇞', '𑇟', '𑈸', '𑈹', '𑈻', '𑈼', '𑊩', '𑑋', '𑑌', '𑗂',
 '𑗃', '𑗉', '𑗊', '𑗋', '𑗌', '𑗍', '𑗎', '𑗏', '𑗐', '𑗑', '𑗒', '𑗓',
 '𑗔', '𑗕', '𑗖', '𑗗', '𑙁', '𑙂', '𑜼', '𑜽', '𑜾', '𑩂', '𑩃', '𑪛',
 '𑪜', '𑱁', '𑱂', '𖩮', '𖩯', '𖫵', '𖬷', '𖬸', '𖭄', '𛲟', '𝪈', '｡', '。']

Sentencizer.call

Apply the sentencizer on a Doc. Typically, this happens automatically after the component has been added to the pipeline using nlp.add_pipe.

Example

from spacy.lang.en import English

nlp = English()
nlp.add_pipe("sentencizer")
doc = nlp("This is a sentence. This is another sentence.")
assert len(list(doc.sents)) == 2

Name	Description
`doc`	The `Doc` object to process, e.g. the `Doc` in the pipeline. ~~Doc~~
RETURNS	The modified `Doc` with added sentence boundaries. ~~Doc~~

Sentencizer.pipe

Apply the pipe to a stream of documents. This usually happens under the hood when the nlp object is called on a text and all pipeline components are applied to the Doc in order.

Example

sentencizer = nlp.add_pipe("sentencizer")
for doc in sentencizer.pipe(docs, batch_size=50):
    pass

Name	Description
`stream`	A stream of documents. ~~Iterable[Doc]~~
keyword-only
`batch_size`	The number of documents to buffer. Defaults to `128`. ~~int~~
YIELDS	The processed documents in order. ~~Doc~~

Sentencizer.to_disk

Save the sentencizer settings (punctuation characters) to a directory. Will create a file sentencizer.json. This also happens automatically when you save an nlp object with a sentencizer added to its pipeline.

Example

config = {"punct_chars": [".", "?", "!", "。"]}
sentencizer = nlp.add_pipe("sentencizer", config=config)
sentencizer.to_disk("/path/to/sentencizer.json")

Name	Description
`path`	A path to a JSON file, which will be created if it doesn't exist. Paths may be either strings or `Path`-like objects. ~~Union[str, Path]~~

Sentencizer.from_disk

Load the sentencizer settings from a file. Expects a JSON file. This also happens automatically when you load an nlp object or model with a sentencizer added to its pipeline.

Example

sentencizer = nlp.add_pipe("sentencizer")
sentencizer.from_disk("/path/to/sentencizer.json")

Name	Description
`path`	A path to a JSON file. Paths may be either strings or `Path`-like objects. ~~Union[str, Path]~~
RETURNS	The modified `Sentencizer` object. ~~Sentencizer~~

Sentencizer.to_bytes

Serialize the sentencizer settings to a bytestring.

Example

config = {"punct_chars": [".", "?", "!", "。"]}
sentencizer = nlp.add_pipe("sentencizer", config=config)
sentencizer_bytes = sentencizer.to_bytes()

Name	Description
RETURNS	The serialized data. ~~bytes~~

Sentencizer.from_bytes

Load the pipe from a bytestring. Modifies the object in place and returns it.

Example

sentencizer_bytes = sentencizer.to_bytes()
sentencizer = nlp.add_pipe("sentencizer")
sentencizer.from_bytes(sentencizer_bytes)

Name	Description
`bytes_data`	The bytestring to load. ~~bytes~~
RETURNS	The modified `Sentencizer` object. ~~Sentencizer~~

7.8 KiB Raw Blame History Unescape Escape

Config and implementation

Example

Sentencizer.__init__

Example

Sentencizer.__call__

Example

Sentencizer.pipe

Example

Sentencizer.to_disk

Example

Sentencizer.from_disk

Example

Sentencizer.to_bytes

Example

Sentencizer.from_bytes

Example

7.8 KiB

Raw Blame History

Sentencizer.init

Sentencizer.call