spaCy/website/docs/api
Daniël de Kok 8d69874afb
Add spacy.PlainTextCorpusReader.v1 (#12122)
* Add `spacy.PlainTextCorpusReader.v1`

This is a corpus reader that reads plain text corpora with the following
format:

- UTF-8 encoding
- One line per document.
- Blank lines are ignored.

It is useful for applications where we deal with very large corpora,
such as distillation, and don't want to deal with the space overhead of
serialized formats. Additionally, many large corpora already use such
a text format, keeping the necessary preprocessing to a minimum.

* Update spacy/training/corpus.py

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>

* docs: add version to `PlainTextCorpus`

* Add docstring to registry function

* Add plain text corpus tests

* Only strip newline/carriage return

* Add return type _string_to_tmp_file helper

* Use a temporary directory in place of file name

Different OS auto delete/sharing semantics are just wonky.

* This will be new in 3.5.1 (rather than 4)

* Test improvements from code review

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
2023-01-26 11:33:22 +01:00
..
architectures.mdx Website migration from Gatsby to Next (#12058) 2023-01-11 17:30:07 +01:00
attributeruler.mdx Website migration from Gatsby to Next (#12058) 2023-01-11 17:30:07 +01:00
attributes.mdx Website migration from Gatsby to Next (#12058) 2023-01-11 17:30:07 +01:00
cli.mdx Fix broken syntax for type annotations (#12171) 2023-01-25 08:51:25 +01:00
coref.mdx Website migration from Gatsby to Next (#12058) 2023-01-11 17:30:07 +01:00
corpus.mdx Add spacy.PlainTextCorpusReader.v1 (#12122) 2023-01-26 11:33:22 +01:00
cython-classes.mdx Website migration from Gatsby to Next (#12058) 2023-01-11 17:30:07 +01:00
cython-structs.mdx Website migration from Gatsby to Next (#12058) 2023-01-11 17:30:07 +01:00
cython.mdx Website migration from Gatsby to Next (#12058) 2023-01-11 17:30:07 +01:00
data-formats.mdx Website migration from Gatsby to Next (#12058) 2023-01-11 17:30:07 +01:00
dependencymatcher.mdx Website migration from Gatsby to Next (#12058) 2023-01-11 17:30:07 +01:00
dependencyparser.mdx Website migration from Gatsby to Next (#12058) 2023-01-11 17:30:07 +01:00
doc.mdx Website migration from Gatsby to Next (#12058) 2023-01-11 17:30:07 +01:00
docbin.mdx Website migration from Gatsby to Next (#12058) 2023-01-11 17:30:07 +01:00
edittreelemmatizer.mdx Website migration from Gatsby to Next (#12058) 2023-01-11 17:30:07 +01:00
entitylinker.mdx API docs: Rename kb_in_memory to inmemorylookupkb, add to sidebar (#12128) 2023-01-19 13:29:17 +01:00
entityrecognizer.mdx Website migration from Gatsby to Next (#12058) 2023-01-11 17:30:07 +01:00
entityruler.mdx Website migration from Gatsby to Next (#12058) 2023-01-11 17:30:07 +01:00
example.mdx Website migration from Gatsby to Next (#12058) 2023-01-11 17:30:07 +01:00
index.mdx Website migration from Gatsby to Next (#12058) 2023-01-11 17:30:07 +01:00
inmemorylookupkb.mdx API docs: Rename kb_in_memory to inmemorylookupkb, add to sidebar (#12128) 2023-01-19 13:29:17 +01:00
kb.mdx API docs: Rename kb_in_memory to inmemorylookupkb, add to sidebar (#12128) 2023-01-19 13:29:17 +01:00
language.mdx Website migration from Gatsby to Next (#12058) 2023-01-11 17:30:07 +01:00
legacy.mdx Website migration from Gatsby to Next (#12058) 2023-01-11 17:30:07 +01:00
lemmatizer.mdx Website migration from Gatsby to Next (#12058) 2023-01-11 17:30:07 +01:00
lexeme.mdx Website migration from Gatsby to Next (#12058) 2023-01-11 17:30:07 +01:00
lookups.mdx Website migration from Gatsby to Next (#12058) 2023-01-11 17:30:07 +01:00
matcher.mdx Website migration from Gatsby to Next (#12058) 2023-01-11 17:30:07 +01:00
morphologizer.mdx Website migration from Gatsby to Next (#12058) 2023-01-11 17:30:07 +01:00
morphology.mdx Website migration from Gatsby to Next (#12058) 2023-01-11 17:30:07 +01:00
phrasematcher.mdx Website migration from Gatsby to Next (#12058) 2023-01-11 17:30:07 +01:00
pipe.mdx Website migration from Gatsby to Next (#12058) 2023-01-11 17:30:07 +01:00
pipeline-functions.mdx Website migration from Gatsby to Next (#12058) 2023-01-11 17:30:07 +01:00
scorer.mdx Website migration from Gatsby to Next (#12058) 2023-01-11 17:30:07 +01:00
sentencerecognizer.mdx Website migration from Gatsby to Next (#12058) 2023-01-11 17:30:07 +01:00
sentencizer.mdx Website migration from Gatsby to Next (#12058) 2023-01-11 17:30:07 +01:00
span-resolver.mdx Website migration from Gatsby to Next (#12058) 2023-01-11 17:30:07 +01:00
span.mdx Website migration from Gatsby to Next (#12058) 2023-01-11 17:30:07 +01:00
spancategorizer.mdx Website migration from Gatsby to Next (#12058) 2023-01-11 17:30:07 +01:00
spangroup.mdx Website migration from Gatsby to Next (#12058) 2023-01-11 17:30:07 +01:00
spanruler.mdx Website migration from Gatsby to Next (#12058) 2023-01-11 17:30:07 +01:00
stringstore.mdx Website migration from Gatsby to Next (#12058) 2023-01-11 17:30:07 +01:00
tagger.mdx Website migration from Gatsby to Next (#12058) 2023-01-11 17:30:07 +01:00
textcategorizer.mdx Website migration from Gatsby to Next (#12058) 2023-01-11 17:30:07 +01:00
tok2vec.mdx Website migration from Gatsby to Next (#12058) 2023-01-11 17:30:07 +01:00
token.mdx Website migration from Gatsby to Next (#12058) 2023-01-11 17:30:07 +01:00
tokenizer.mdx Website migration from Gatsby to Next (#12058) 2023-01-11 17:30:07 +01:00
top-level.mdx Clean up displacy port-related error messages, docs (#12089) 2023-01-12 14:54:09 +09:00
transformer.mdx Website migration from Gatsby to Next (#12058) 2023-01-11 17:30:07 +01:00
vectors.mdx Website migration from Gatsby to Next (#12058) 2023-01-11 17:30:07 +01:00
vocab.mdx Website migration from Gatsby to Next (#12058) 2023-01-11 17:30:07 +01:00