* Normalize whitespace in evaluate CLI output test
Depending on terminal settings, lines may be padded to the screen width
so the comparison is too strict with only the command string replacement.
* Move to test util method
* Change to normalization method
* Add span_id to Span.char_span, update Doc/Span.char_span docs
`Span.char_span(id=)` should be removed in the future.
* Also use Union[int, str] in Doc docstring
* WIP
* rm ipython embeds
* rm total
* WIP
* cleanup
* cleanup + reword
* rm component function
* remove migration support form
* fix reference dataset for dev data
* additional fixes
- set approach to identifying unique trees
- adjust line length on messages
- add logic for detecting docs without annotations
* use 0 instead of none for no annotation
* partial annotation support
* initial tests for _compile_gold lemma attributes
Using the example data from the edit tree lemmatizer tests for:
- lemmatizer_trees
- partial_lemma_annotations
- n_low_cardinality_lemmas
- no_lemma_annotations
* adds output test for cli app
* switch msg level
* rm unclear uniqueness check
* Revert "rm unclear uniqueness check"
This reverts commit 6ea2b3524b.
* remove good message on uniqueness
* formatting
* use en_vocab fixture
* clarify data set source in messages
* remove unnecessary import
Co-authored-by: svlandeg <svlandeg@github.com>
* Add `spacy.PlainTextCorpusReader.v1`
This is a corpus reader that reads plain text corpora with the following
format:
- UTF-8 encoding
- One line per document.
- Blank lines are ignored.
It is useful for applications where we deal with very large corpora,
such as distillation, and don't want to deal with the space overhead of
serialized formats. Additionally, many large corpora already use such
a text format, keeping the necessary preprocessing to a minimum.
* Update spacy/training/corpus.py
Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
* docs: add version to `PlainTextCorpus`
* Add docstring to registry function
* Add plain text corpus tests
* Only strip newline/carriage return
* Add return type _string_to_tmp_file helper
* Use a temporary directory in place of file name
Different OS auto delete/sharing semantics are just wonky.
* This will be new in 3.5.1 (rather than 4)
* Test improvements from code review
Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
* Rename CSS class to make use more clear
* Rename component prop to improve code readability
* Fix `aria-hidden` directly on a link element
This link wouldn't have been clickable by screenreaders
* Refactor component
This removes a unnessary `div` and a duplicate link
Co-authored-by: Ines Montani <ines@ines.io>
Originally introduced in 62b9c9c6d7
Original error: Warning: Invalid DOM property `class`. Did you mean `className`?
React doesn't have `class`, it uses `className`.
* Fix missing comma
* Activate user zoom for website
This is recommended by lighthouse:
> Disabling zooming is problematic for users with low vision who rely on screen magnification to properly see the contents of a web page. Learn more.
Also iOS already ignores this attribute anyway.
* Fix gap in landing pattern at the top
* Replace `.jpg` patterns with `.png`
This drastically reduces file size (for the landing page from 221kb to 57kb) while doubling the resolution to look sharper on retina displays.
* Refactor _scores2guesses
* Handle arrays on GPU
* Convert argmax result to raw integer
Co-authored-by: Madeesh Kannan <shadeMe@users.noreply.github.com>
* Use NumpyOps() to copy data to CPU
Co-authored-by: Madeesh Kannan <shadeMe@users.noreply.github.com>
* Changes based on review comments
* Use different _scores2guesses depending on tree_k
* Add tests for corner cases
* Add empty line for consistency
* Improve naming
Co-authored-by: Daniël de Kok <me@github.danieldk.eu>
* Improve naming
Co-authored-by: Daniël de Kok <me@github.danieldk.eu>
Co-authored-by: Madeesh Kannan <shadeMe@users.noreply.github.com>
Co-authored-by: Daniël de Kok <me@github.danieldk.eu>
* skeleton
* Fill in non-CLI details from release notes draft
* Add TODO for fuzzy matching
* Website updates for v3-5 draft
* Fill in usage examples
* Add fuzzy matching to intro
* Fix fuzzy examples
* Shell example formatting
* Fix typo
* Format
* Remove trailing periods in internal list
* Update
* Fix spacing for nested lists
* Update InMemoryLookupKB link
Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
Co-authored-by: Ines Montani <ines@ines.io>
* API docs: Rename kb_in_memory to inmemorylookupkb, add to sidebar
* adjust to mdx
* linkout to InMemoryLookupKB at first occurrence in kb.mdx
* fix links to docs
* revert Azure trigger setting (I'll make a separate PR)
Co-authored-by: svlandeg <svlandeg@github.com>
* Update years in website landing page
* Update website/pages/index.tsx
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
* Update Dockerfile to work with Next.js
- Update to Node 18
- Do not run as root, this also works better with Node
privilege-dropping.
- Update README with new run instructions and adding the
`--rm` flag to avoid leaving a bunch of unused Docker
containers.
- Also change README to recommend building the image locally.
Image builds are pretty fast and the uploaded images get
outdated pretty quickly.
* Add .dockerignore to avoid sending large build contexts
* Typo
* Clarify how `--code` arg works
This adds a few sentences to the docs to clarify how the `--code`
argument works, including an explanation of how to load custom
components in your own code.
* Add link to spacy.load docs
* Add a `spacy evaluate speed` subcommand
This subcommand reports the mean batch performance of a model on a data set with
a 95% confidence interval. For reliability, it first performs some warmup
rounds. Then it will measure performance on batches with randomly shuffled
documents.
To avoid having too many spaCy commands, `speed` is a subcommand of `evaluate`
and accuracy evaluation is moved to its own `evaluate accuracy` subcommand.
* Fix import cycle
* Restore `spacy evaluate`, make `spacy benchmark speed` an alias
* Add documentation for `spacy benchmark`
* CREATES -> PRINTS
* WPS -> words/s
* Disable formatting of benchmark speed arguments
* Fail with an error message when trying to speed bench empty corpus
* Make it clearer that `benchmark accuracy` is a replacement for `evaluate`
* Fix docstring webpage reference
* tests: check `evaluate` output against `benchmark accuracy`