Commit Graph

15815 Commits

Author SHA1 Message Date
Adriane Boyd
606273f7e4
Normalize whitespace in evaluate CLI output test (#12157)
* Normalize whitespace in evaluate CLI output test

Depending on terminal settings, lines may be padded to the screen width
so the comparison is too strict with only the command string replacement.

* Move to test util method

* Change to normalization method
2023-01-27 16:13:34 +01:00
Sofie Van Landeghem
bd739e67d6
explain KB change and how to remedy (#12189) 2023-01-27 15:13:20 +01:00
Adriane Boyd
5f8a398bb9
Add span_id to Span.char_span, update Doc/Span.char_span docs (#12196)
* Add span_id to Span.char_span, update Doc/Span.char_span docs

`Span.char_span(id=)` should be removed in the future.

* Also use Union[int, str] in Doc docstring
2023-01-27 15:09:17 +01:00
Simon Gurcke
774c10fa39
Add alignment_mode argument to Span.char_span() (#12145)
* Add alignment_mode argument to Span.char_span()

* Update website

* Update spacy/tokens/span.pyx

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>

* Add test

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
2023-01-27 11:43:40 +01:00
Peter Baumgartner
c68e6b8a96
trainable_lemmatizer in debug data (#11419)
* WIP

* rm ipython embeds

* rm total

* WIP

* cleanup

* cleanup + reword

* rm component function

* remove migration support form

* fix reference dataset for dev data

* additional fixes

- set approach to identifying unique trees
- adjust line length on messages
- add logic for detecting docs without annotations

* use 0 instead of none for no annotation

* partial annotation support

* initial tests for _compile_gold lemma attributes

Using the example data from the edit tree lemmatizer tests for:
- lemmatizer_trees
- partial_lemma_annotations
- n_low_cardinality_lemmas
- no_lemma_annotations

* adds output test for cli app

* switch msg level

* rm unclear uniqueness check

* Revert "rm unclear uniqueness check"

This reverts commit 6ea2b3524b.

* remove good message on uniqueness

* formatting

* use en_vocab fixture

* clarify data set source in messages

* remove unnecessary import

Co-authored-by: svlandeg <svlandeg@github.com>
2023-01-26 17:36:50 +01:00
Daniël de Kok
8d69874afb
Add spacy.PlainTextCorpusReader.v1 (#12122)
* Add `spacy.PlainTextCorpusReader.v1`

This is a corpus reader that reads plain text corpora with the following
format:

- UTF-8 encoding
- One line per document.
- Blank lines are ignored.

It is useful for applications where we deal with very large corpora,
such as distillation, and don't want to deal with the space overhead of
serialized formats. Additionally, many large corpora already use such
a text format, keeping the necessary preprocessing to a minimum.

* Update spacy/training/corpus.py

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>

* docs: add version to `PlainTextCorpus`

* Add docstring to registry function

* Add plain text corpus tests

* Only strip newline/carriage return

* Add return type _string_to_tmp_file helper

* Use a temporary directory in place of file name

Different OS auto delete/sharing semantics are just wonky.

* This will be new in 3.5.1 (rather than 4)

* Test improvements from code review

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
2023-01-26 11:33:22 +01:00
Marcus Blättermann
a37117abd0
Fix text colors in docs (#12186) 2023-01-26 10:30:24 +01:00
Marcus Blättermann
056b73468c
Load components dynamically (decrease initial file size for docs) (#12175)
* Extract `CodeBlock` component into own file

* Extract `InlineCode` component into own file

* Extract `TypeAnnotation` component into own file

* Convert named `export` to `default export`

* Remove unused `export`

* Simplify `TypeAnnotation` to remove dependency for Prism

* Load `Code` component dynamically

* Extract `MarkdownToReact` component into own file

* WIP Code Dynamic

* Load `MarkdownToReact` component dynamically

* Extract `htmlToReact` to own file

* Load `htmlToReact` component dynamically

* Dynamically load `Juniper`
2023-01-25 17:30:41 +01:00
Adriane Boyd
07dfa54669
CI: Extend website excludes (#12185) 2023-01-25 15:35:17 +01:00
Marcus Blättermann
11f10fff60
Fix frontpage image (#12184) 2023-01-25 13:17:35 +01:00
Marcus Blättermann
5a6000fb8b
Fix text color in docs (#12183)
* Fix text color on landing page

* Fix code color
2023-01-25 13:14:32 +01:00
Adriane Boyd
8ea15240ca
Update binder version to v3.5 (#12153) 2023-01-25 13:14:23 +01:00
Adriane Boyd
2dbb764183
CI: Add black formatting check to validation (#12182) 2023-01-25 12:51:37 +01:00
Marcus Blättermann
99a05734a8
Add aria-label to quickstart widget (#12179) 2023-01-25 11:46:55 +01:00
Marcus Blättermann
0298b1a863
WEB-28 Increase contrast of grey text (#12178)
* Use transparent colors to increase contrast on darker backgrounds

* Increase color contrast of grey text
2023-01-25 11:46:43 +01:00
Marcus Blättermann
3062fae2ca
Fix broken URL (#12176) 2023-01-25 11:42:19 +01:00
Marcus Blättermann
57ba37bc52
Fix regression with links in prompts (#12172) 2023-01-25 08:51:40 +01:00
Marcus Blättermann
05a3685849
Fix broken syntax for type annotations (#12171) 2023-01-25 08:51:25 +01:00
Marcus Blättermann
f3c586f74a
Fix navigation alert (#12169)
Fixes a regression introduced in #12163
2023-01-24 16:40:40 +01:00
Marcus Blättermann
49237f05a6
Fix aria-hidden element (#12163)
* Rename CSS class to make use more clear

* Rename component prop to improve code readability

* Fix `aria-hidden` directly on a link element

This link wouldn't have been clickable by screenreaders

* Refactor component

This removes a unnessary `div` and a duplicate link

Co-authored-by: Ines Montani <ines@ines.io>
2023-01-24 14:44:47 +01:00
Marcus Blättermann
0a70696923
Fix wrong HTML element attribute (#12151)
Originally introduced in 62b9c9c6d7

Original error: Warning: Invalid DOM property `class`. Did you mean `className`?

React doesn't have `class`, it uses `className`.
2023-01-24 14:35:31 +01:00
Marcus Blättermann
9555e7aecf
Remove unnessary links (#12159)
There is no need to link to the image we are already viewing and this is also considered an accessibility issue.
2023-01-24 14:01:00 +01:00
Marcus Blättermann
031f6c7b60
WEB-27 Add alt tags to images (#12166)
* Update spaCy badge `alt` text

* Add `next/image` component to Universe

* Add missing `alt`texts
2023-01-24 13:56:14 +01:00
Marcus Blättermann
c9beb47ab7
Increase contrast of text and theme color (#12165) 2023-01-24 13:55:20 +01:00
Marcus Blättermann
a7d6a62f7c
Remove zoom locking (#12164)
* Fix missing comma

* Activate user zoom for website

This is recommended by lighthouse:

> Disabling zooming is problematic for users with low vision who rely on screen magnification to properly see the contents of a web page. Learn more.

Also iOS already ignores this attribute anyway.
2023-01-24 13:54:49 +01:00
Marcus Blättermann
48159e1d60
Update explosion logo (#12162)
This fixes a misalignment of the explosion logo
2023-01-24 13:53:51 +01:00
Marcus Blättermann
7160f7835d
Fix GitHub badge (#12161)
* Extract component

* Remove rounded border form GitHub Stars badge

* Add `alt` text
2023-01-24 13:53:28 +01:00
Marcus Blättermann
3aa61e615f
Add missing label (#12160) 2023-01-24 13:52:55 +01:00
Marcus Blättermann
fcedcd54a8
WEB-30 spaCy pattern in .png (#12158)
* Fix gap in landing pattern at the top

* Replace `.jpg` patterns with `.png`

This drastically reduces file size (for the landing page from 221kb to 57kb) while doubling the resolution to look sharper on retina displays.
2023-01-24 13:51:39 +01:00
Sofie Van Landeghem
de1fe8dce3
Fix Azure ignoring website files (#12129)
* ignore all mdx files and all files in website

* have both .md and .mdx

* exclude everything but universe.json
2023-01-24 10:02:07 +01:00
Edward
e9048fd4a1
Add how to load probability tables to existing models to spaCy docs (#12051)
* add section about adding tables to models

* change to lexeme_norm

* Change syntax

* change to _prob

* Update website/docs/usage/saving-loading.mdx

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
2023-01-24 10:01:22 +01:00
Raphael Mitsch
950fceceb6
Make test_cli_find_threshold() more robust. (#12148) 2023-01-23 14:42:33 +01:00
Richard Hudson
f9e020dd67
Fix speed problem with top_k>1 on CPU in edit tree lemmatizer (#12017)
* Refactor _scores2guesses

* Handle arrays on GPU

* Convert argmax result to raw integer

Co-authored-by: Madeesh Kannan <shadeMe@users.noreply.github.com>

* Use NumpyOps() to copy data to CPU

Co-authored-by: Madeesh Kannan <shadeMe@users.noreply.github.com>

* Changes based on review comments

* Use different _scores2guesses depending on tree_k

* Add tests for corner cases

* Add empty line for consistency

* Improve naming

Co-authored-by: Daniël de Kok <me@github.danieldk.eu>

* Improve naming

Co-authored-by: Daniël de Kok <me@github.danieldk.eu>

Co-authored-by: Madeesh Kannan <shadeMe@users.noreply.github.com>
Co-authored-by: Daniël de Kok <me@github.danieldk.eu>
2023-01-20 19:34:11 +01:00
Marcus Blättermann
8a3ca77d9e
Fix broken social media image (#12137) 2023-01-20 16:57:43 +01:00
Adriane Boyd
dec81508d2
Update README for v3.5 (#12132) 2023-01-19 16:13:41 +01:00
Sofie Van Landeghem
0f5d8a27f2
3.5 usage page (#12057)
* skeleton

* Fill in non-CLI details from release notes draft

* Add TODO for fuzzy matching

* Website updates for v3-5 draft

* Fill in usage examples

* Add fuzzy matching to intro

* Fix fuzzy examples

* Shell example formatting

* Fix typo

* Format

* Remove trailing periods in internal list

* Update

* Fix spacing for nested lists

* Update InMemoryLookupKB link

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
Co-authored-by: Ines Montani <ines@ines.io>
2023-01-19 16:13:04 +01:00
Adriane Boyd
1e993d3b03
Merge pull request #12121 from adrianeboyd/chore/v3.5.0-2
Revert "Temporarily skip tests that require models/compat"
2023-01-19 15:59:30 +01:00
Adriane Boyd
3b8918e166
API docs: Rename kb_in_memory to inmemorylookupkb, add to sidebar (#12128)
* API docs: Rename kb_in_memory to inmemorylookupkb, add to sidebar

* adjust to mdx

* linkout to InMemoryLookupKB at first occurrence in kb.mdx

* fix links to docs

* revert Azure trigger setting (I'll make a separate PR)

Co-authored-by: svlandeg <svlandeg@github.com>
2023-01-19 13:29:17 +01:00
Adriane Boyd
a9910b6081
Update years in website landing page (#12107)
* Update years in website landing page

* Update website/pages/index.tsx

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
2023-01-19 11:08:02 +01:00
Sofie Van Landeghem
7d88c55eeb
update docs for apply (#12127)
* update docs for apply

* prettier
2023-01-19 10:37:09 +01:00
Adriane Boyd
28fd589b85
Move all website gitignore settings to website/.gitignore (#12120) 2023-01-18 21:46:19 +01:00
Daniël de Kok
668ec989ad
Update Dockerfile to work with Next.js (#12119)
* Update Dockerfile to work with Next.js

- Update to Node 18
- Do not run as root, this also works better with Node
  privilege-dropping.
- Update README with new run instructions and adding the
  `--rm` flag to avoid leaving a bunch of unused Docker
  containers.
- Also change README to recommend building the image locally.
  Image builds are pretty fast and the uploaded images get
  outdated pretty quickly.

* Add .dockerignore to avoid sending large build contexts

* Typo
2023-01-18 18:15:47 +01:00
Adriane Boyd
dc0f527039 Revert "Temporarily skip tests that require models/compat"
This reverts commit 378db0eb1e.
2023-01-18 12:54:56 +01:00
Adriane Boyd
794cea6907
Fix comments and examples for levenshtein_compare (#12113) 2023-01-18 08:02:33 +01:00
Paul O'Leary McCann
a3b15c9f53
Clarify how --code arg works (#12102)
* Clarify how `--code` arg works

This adds a few sentences to the docs to clarify how the `--code`
argument works, including an explanation of how to load custom
components in your own code.

* Add link to spacy.load docs
2023-01-17 19:30:02 +09:00
Albert Villanova del Moral
25373d8e8e
Fix required maximum version of typing-extensions (#12036)
* Fix required maximum version of typing-extensions

* Restrict to <4.5.0, sync minimum pin

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
2023-01-13 10:44:02 +01:00
github-actions[bot]
9ef7d26032
Auto-format code with black (#12100)
Co-authored-by: explosion-bot <explosion-bot@users.noreply.github.com>
2023-01-13 10:12:10 +01:00
Daniël de Kok
dda7331da3
Handle missing annotations in the edit tree lemmatizer (#12098)
The losses/gradients of missing annotations were not correctly masked
out. Fix this and check the masking in the partial data test.
2023-01-12 12:13:55 +01:00
Daniël de Kok
319eb508b5
Add a spacy benchmark speed subcommand (#11902)
* Add a `spacy evaluate speed` subcommand

This subcommand reports the mean batch performance of a model on a data set with
a 95% confidence interval. For reliability, it first performs some warmup
rounds. Then it will measure performance on batches with randomly shuffled
documents.

To avoid having too many spaCy commands, `speed` is a subcommand of `evaluate`
and accuracy evaluation is moved to its own `evaluate accuracy` subcommand.

* Fix import cycle

* Restore `spacy evaluate`, make `spacy benchmark speed` an alias

* Add documentation for `spacy benchmark`

* CREATES -> PRINTS

* WPS -> words/s

* Disable formatting of benchmark speed arguments

* Fail with an error message when trying to speed bench empty corpus

* Make it clearer that `benchmark accuracy` is a replacement for `evaluate`

* Fix docstring webpage reference

* tests: check `evaluate` output against `benchmark accuracy`
2023-01-12 11:55:21 +01:00
Paul O'Leary McCann
8e558095a1
Clean up displacy port-related error messages, docs (#12089)
* Clean up displacy port-related error messages, docs

There were some issues in the error messages and docs in #11948.

1. the error messages didn't specify the port argument to displacy.serve correctly
2. the docs didn't mark the auto select argument as new

This addresses those issues.

* Update website/docs/api/top-level.md

Co-authored-by: Raphael Mitsch <r.mitsch@outlook.com>

* Apply prettier

Co-authored-by: Raphael Mitsch <r.mitsch@outlook.com>
2023-01-12 14:54:09 +09:00