spaCy/sentencizer.md at ff6a084e9cdf9114cbc8cb55fe0e9c69e4cabc34

mirror of https://github.com/explosion/spaCy.git synced 2024-09-22 03:49:17 +03:00

Documentation updates for v2.3.0 (#5593 )

* Update website models for v2.3.0

* Add docs for Chinese word segmentation

* Tighten up Chinese docs section

* Merge branch 'master' into docs/v2.3.0 [ci skip]

* Merge branch 'master' into docs/v2.3.0 [ci skip]

* Auto-format and update version

* Update matcher.md

* Update languages and sorting

* Typo in landing page

* Infobox about token_match behavior

* Add meta and basic docs for Japanese

* POS -> TAG in models table

* Add info about lookups for normalization

* Updates to API docs for v2.3

* Update adding norm exceptions for adding languages

* Add --omit-extra-lookups to CLI API docs

* Add initial draft of "What's New in v2.3"

* Add new in v2.3 tags to Chinese and Japanese sections

* Add tokenizer to migration section

* Add new in v2.3 flags to init-model

* Typo

* More what's new in v2.3

Co-authored-by: Ines Montani <ines@ines.io>

2020-06-16 15:37:35 +02:00

6.1 KiB

Raw Blame History

title	tag	source
Sentencizer	class	spacy/pipeline/pipes.pyx

A simple pipeline component, to allow custom sentence boundary detection logic that doesn't require the dependency parse. By default, sentence segmentation is performed by the DependencyParser, so the Sentencizer lets you implement a simpler, rule-based strategy that doesn't require a statistical model to be loaded. The component is also available via the string name "sentencizer". After initialization, it is typically added to the processing pipeline using nlp.add_pipe.

Compared to the previous SentenceSegmenter class, the Sentencizer component doesn't add a hook to doc.user_hooks["sents"]. Instead, it iterates over the tokens in the Doc and sets the Token.is_sent_start property. The SentenceSegmenter is still available if you import it directly:

from spacy.pipeline import SentenceSegmenter

Sentencizer.init

Initialize the sentencizer.

Example

# Construction via create_pipe
sentencizer = nlp.create_pipe("sentencizer")

# Construction from class
from spacy.pipeline import Sentencizer
sentencizer = Sentencizer()

Name	Type	Description
`punct_chars`	list	Optional custom list of punctuation characters that mark sentence ends. Defaults to ['!', '.', '?', '։', '؟', '۔', '܀', '܁', '܂', '߹', '।', '॥', '၊', '။', '።', '፧', '፨', '᙮', '᜵', '᜶', '᠃', '᠉', '᥄', '᥅', '᪨', '᪩', '᪪', '᪫', '᭚', '᭛', '᭞', '᭟', '᰻', '᰼', '᱾', '᱿', '‼', '‽', '⁇', '⁈', '⁉', '⸮', '⸼', '꓿', '꘎', '꘏', '꛳', '꛷', '꡶', '꡷', '꣎', '꣏', '꤯', '꧈', '꧉', '꩝', '꩞', '꩟', '꫰', '꫱', '꯫', '﹒', '﹖', '﹗', '！', '．', '？', '𐩖', '𐩗', '𑁇', '𑁈', '𑂾', '𑂿', '𑃀', '𑃁', '𑅁', '𑅂', '𑅃', '𑇅', '𑇆', '𑇍', '𑇞', '𑇟', '𑈸', '𑈹', '𑈻', '𑈼', '𑊩', '𑑋', '𑑌', '𑗂', '𑗃', '𑗉', '𑗊', '𑗋', '𑗌', '𑗍', '𑗎', '𑗏', '𑗐', '𑗑', '𑗒', '𑗓', '𑗔', '𑗕', '𑗖', '𑗗', '𑙁', '𑙂', '𑜼', '𑜽', '𑜾', '𑩂', '𑩃', '𑪛', '𑪜', '𑱁', '𑱂', '𖩮', '𖩯', '𖫵', '𖬷', '𖬸', '𖭄', '𛲟', '𝪈', '｡', '。'].
RETURNS	`Sentencizer`	The newly constructed object.

Sentencizer.call

Apply the sentencizer on a Doc. Typically, this happens automatically after the component has been added to the pipeline using nlp.add_pipe.

Example

from spacy.lang.en import English

nlp = English()
sentencizer = nlp.create_pipe("sentencizer")
nlp.add_pipe(sentencizer)
doc = nlp("This is a sentence. This is another sentence.")
assert len(list(doc.sents)) == 2

Name	Type	Description
`doc`	`Doc`	The `Doc` object to process, e.g. the `Doc` in the pipeline.
RETURNS	`Doc`	The modified `Doc` with added sentence boundaries.

Sentencizer.to_disk

Save the sentencizer settings (punctuation characters) a directory. Will create a file sentencizer.json. This also happens automatically when you save an nlp object with a sentencizer added to its pipeline.

Example

sentencizer = Sentencizer(punct_chars=[".", "?", "!", "。"])
sentencizer.to_disk("/path/to/sentencizer.jsonl")

Name	Type	Description
`path`	unicode / `Path`	A path to a file, which will be created if it doesn't exist. Paths may be either strings or `Path`-like objects.

Sentencizer.from_disk

Load the sentencizer settings from a file. Expects a JSON file. This also happens automatically when you load an nlp object or model with a sentencizer added to its pipeline.

Example

sentencizer = Sentencizer()
sentencizer.from_disk("/path/to/sentencizer.json")

Name	Type	Description
`path`	unicode / `Path`	A path to a JSON file. Paths may be either strings or `Path`-like objects.
RETURNS	`Sentencizer`	The modified `Sentencizer` object.

Sentencizer.to_bytes

Serialize the sentencizer settings to a bytestring.

Example

sentencizer = Sentencizer(punct_chars=[".", "?", "!", "。"])
sentencizer_bytes = sentencizer.to_bytes()

Name	Type	Description
RETURNS	bytes	The serialized data.

Sentencizer.from_bytes

Load the pipe from a bytestring. Modifies the object in place and returns it.

Example

sentencizer_bytes = sentencizer.to_bytes()
sentencizer = Sentencizer()
sentencizer.from_bytes(sentencizer_bytes)

Name	Type	Description
`bytes_data`	bytes	The bytestring to load.
RETURNS	`Sentencizer`	The modified `Sentencizer` object.

6.1 KiB Raw Blame History Unescape Escape

Sentencizer.__init__

Example

Sentencizer.__call__

Example

Sentencizer.to_disk

Example

Sentencizer.from_disk

Example

Sentencizer.to_bytes

Example

Sentencizer.from_bytes

Example

6.1 KiB

Raw Blame History

Sentencizer.init

Sentencizer.call