Commit Graph

15830 Commits

Author SHA1 Message Date
Adriane Boyd
4539fbae17
Revert "Fix FUZZY operator definition (#12318)" (#12336)
This reverts commit daedc45d05.

The default length depends on the length of the pattern string and was
correct for this example.
2023-02-27 09:48:36 +01:00
Kevin Humphreys
acdd993071
Matcher performance fix for extension predicates: use shared key function (#12272)
* standardize predicate key format

* single key function

* Make optional args in key function keyword-only

---------

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
2023-02-27 08:35:08 +01:00
Paul O'Leary McCann
1e8bac99f3
Add tests for projects to master (#12303)
* Add tests for projects to master

* Fix git clone related issues on Windows

* Add stat import
2023-02-23 10:22:57 +01:00
andyjessen
daedc45d05
Fix FUZZY operator definition (#12318)
* Fix FUZZY operator definition

The default length of the FUZZY operator is 2 and not 3.

* adjust edit distance in matcher usage docs too

---------

Co-authored-by: svlandeg <svlandeg@github.com>
2023-02-23 09:37:40 +01:00
Adriane Boyd
80bc140533
Add grc to langs with lexeme norms in spacy-lookups-data (#12287) 2023-02-16 17:57:02 +01:00
Edward
61b8454137
Adjust return type of registry.find (#12227)
* Fix registry find return type

* add dot

* Add type ignore for mypy

* update black formatting version

* add mypy ignore to package cli

* mypy type fix (for real)

* Update find description in spacy/util.py

Co-authored-by: Raphael Mitsch <r.mitsch@outlook.com>

* adjust mypy directive

---------

Co-authored-by: Raphael Mitsch <r.mitsch@outlook.com>
2023-02-15 12:32:53 +01:00
Raphael Mitsch
2d4fb94ba0
Fix wrong file name in docs for rule-based matcher. (#12262) 2023-02-09 12:58:14 +01:00
Adriane Boyd
9d920bafcf
Extend mypy to v1.0.x (#12245) 2023-02-08 14:33:16 +01:00
Raphael Mitsch
d38a88f0f3
Remove negation. (#12252) 2023-02-08 14:18:33 +01:00
Adriane Boyd
9a454676f3
Use black version constraints from requirements.txt (#12220) 2023-02-03 11:44:10 +01:00
Sofie Van Landeghem
79ef6cf0f9
Have logging calls use string formatting types (#12215)
* change logging call for spacy.LookupsDataLoader.v1

* substitutions in language and _util

* various more substitutions

* add string formatting guidelines to contribution guidelines
2023-02-02 11:15:22 +01:00
Sofie Van Landeghem
4c60afb946
Backslash fixes in docs (#12213)
* backslash fixes

* revert unrelated change
2023-02-01 10:15:38 +01:00
Raphael Mitsch
02af17a5c8
Remove flaky assertions. (#12210) 2023-01-31 16:52:06 +01:00
Adriane Boyd
0e51c918ae
Normalize whitespace in evaluate CLI output test (#12157)
* Normalize whitespace in evaluate CLI output test

Depending on terminal settings, lines may be padded to the screen width
so the comparison is too strict with only the command string replacement.

* Move to test util method

* Change to normalization method
2023-01-30 17:51:27 +01:00
Paul O'Leary McCann
8932f4dc35
Add extra flag to assets docs (#12194)
* Add extra flag to assets docs

For some reason this wasn't included.

* Add new tag to docs
2023-01-30 10:05:23 +01:00
Adriane Boyd
606273f7e4
Normalize whitespace in evaluate CLI output test (#12157)
* Normalize whitespace in evaluate CLI output test

Depending on terminal settings, lines may be padded to the screen width
so the comparison is too strict with only the command string replacement.

* Move to test util method

* Change to normalization method
2023-01-27 16:13:34 +01:00
Sofie Van Landeghem
bd739e67d6
explain KB change and how to remedy (#12189) 2023-01-27 15:13:20 +01:00
Adriane Boyd
5f8a398bb9
Add span_id to Span.char_span, update Doc/Span.char_span docs (#12196)
* Add span_id to Span.char_span, update Doc/Span.char_span docs

`Span.char_span(id=)` should be removed in the future.

* Also use Union[int, str] in Doc docstring
2023-01-27 15:09:17 +01:00
Simon Gurcke
774c10fa39
Add alignment_mode argument to Span.char_span() (#12145)
* Add alignment_mode argument to Span.char_span()

* Update website

* Update spacy/tokens/span.pyx

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>

* Add test

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
2023-01-27 11:43:40 +01:00
Peter Baumgartner
c68e6b8a96
trainable_lemmatizer in debug data (#11419)
* WIP

* rm ipython embeds

* rm total

* WIP

* cleanup

* cleanup + reword

* rm component function

* remove migration support form

* fix reference dataset for dev data

* additional fixes

- set approach to identifying unique trees
- adjust line length on messages
- add logic for detecting docs without annotations

* use 0 instead of none for no annotation

* partial annotation support

* initial tests for _compile_gold lemma attributes

Using the example data from the edit tree lemmatizer tests for:
- lemmatizer_trees
- partial_lemma_annotations
- n_low_cardinality_lemmas
- no_lemma_annotations

* adds output test for cli app

* switch msg level

* rm unclear uniqueness check

* Revert "rm unclear uniqueness check"

This reverts commit 6ea2b3524b.

* remove good message on uniqueness

* formatting

* use en_vocab fixture

* clarify data set source in messages

* remove unnecessary import

Co-authored-by: svlandeg <svlandeg@github.com>
2023-01-26 17:36:50 +01:00
Daniël de Kok
8d69874afb
Add spacy.PlainTextCorpusReader.v1 (#12122)
* Add `spacy.PlainTextCorpusReader.v1`

This is a corpus reader that reads plain text corpora with the following
format:

- UTF-8 encoding
- One line per document.
- Blank lines are ignored.

It is useful for applications where we deal with very large corpora,
such as distillation, and don't want to deal with the space overhead of
serialized formats. Additionally, many large corpora already use such
a text format, keeping the necessary preprocessing to a minimum.

* Update spacy/training/corpus.py

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>

* docs: add version to `PlainTextCorpus`

* Add docstring to registry function

* Add plain text corpus tests

* Only strip newline/carriage return

* Add return type _string_to_tmp_file helper

* Use a temporary directory in place of file name

Different OS auto delete/sharing semantics are just wonky.

* This will be new in 3.5.1 (rather than 4)

* Test improvements from code review

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
2023-01-26 11:33:22 +01:00
Marcus Blättermann
a37117abd0
Fix text colors in docs (#12186) 2023-01-26 10:30:24 +01:00
Marcus Blättermann
056b73468c
Load components dynamically (decrease initial file size for docs) (#12175)
* Extract `CodeBlock` component into own file

* Extract `InlineCode` component into own file

* Extract `TypeAnnotation` component into own file

* Convert named `export` to `default export`

* Remove unused `export`

* Simplify `TypeAnnotation` to remove dependency for Prism

* Load `Code` component dynamically

* Extract `MarkdownToReact` component into own file

* WIP Code Dynamic

* Load `MarkdownToReact` component dynamically

* Extract `htmlToReact` to own file

* Load `htmlToReact` component dynamically

* Dynamically load `Juniper`
2023-01-25 17:30:41 +01:00
Adriane Boyd
07dfa54669
CI: Extend website excludes (#12185) 2023-01-25 15:35:17 +01:00
Marcus Blättermann
11f10fff60
Fix frontpage image (#12184) 2023-01-25 13:17:35 +01:00
Marcus Blättermann
5a6000fb8b
Fix text color in docs (#12183)
* Fix text color on landing page

* Fix code color
2023-01-25 13:14:32 +01:00
Adriane Boyd
8ea15240ca
Update binder version to v3.5 (#12153) 2023-01-25 13:14:23 +01:00
Adriane Boyd
2dbb764183
CI: Add black formatting check to validation (#12182) 2023-01-25 12:51:37 +01:00
Marcus Blättermann
99a05734a8
Add aria-label to quickstart widget (#12179) 2023-01-25 11:46:55 +01:00
Marcus Blättermann
0298b1a863
WEB-28 Increase contrast of grey text (#12178)
* Use transparent colors to increase contrast on darker backgrounds

* Increase color contrast of grey text
2023-01-25 11:46:43 +01:00
Marcus Blättermann
3062fae2ca
Fix broken URL (#12176) 2023-01-25 11:42:19 +01:00
Marcus Blättermann
57ba37bc52
Fix regression with links in prompts (#12172) 2023-01-25 08:51:40 +01:00
Marcus Blättermann
05a3685849
Fix broken syntax for type annotations (#12171) 2023-01-25 08:51:25 +01:00
Marcus Blättermann
f3c586f74a
Fix navigation alert (#12169)
Fixes a regression introduced in #12163
2023-01-24 16:40:40 +01:00
Marcus Blättermann
49237f05a6
Fix aria-hidden element (#12163)
* Rename CSS class to make use more clear

* Rename component prop to improve code readability

* Fix `aria-hidden` directly on a link element

This link wouldn't have been clickable by screenreaders

* Refactor component

This removes a unnessary `div` and a duplicate link

Co-authored-by: Ines Montani <ines@ines.io>
2023-01-24 14:44:47 +01:00
Marcus Blättermann
0a70696923
Fix wrong HTML element attribute (#12151)
Originally introduced in 62b9c9c6d7

Original error: Warning: Invalid DOM property `class`. Did you mean `className`?

React doesn't have `class`, it uses `className`.
2023-01-24 14:35:31 +01:00
Marcus Blättermann
9555e7aecf
Remove unnessary links (#12159)
There is no need to link to the image we are already viewing and this is also considered an accessibility issue.
2023-01-24 14:01:00 +01:00
Marcus Blättermann
031f6c7b60
WEB-27 Add alt tags to images (#12166)
* Update spaCy badge `alt` text

* Add `next/image` component to Universe

* Add missing `alt`texts
2023-01-24 13:56:14 +01:00
Marcus Blättermann
c9beb47ab7
Increase contrast of text and theme color (#12165) 2023-01-24 13:55:20 +01:00
Marcus Blättermann
a7d6a62f7c
Remove zoom locking (#12164)
* Fix missing comma

* Activate user zoom for website

This is recommended by lighthouse:

> Disabling zooming is problematic for users with low vision who rely on screen magnification to properly see the contents of a web page. Learn more.

Also iOS already ignores this attribute anyway.
2023-01-24 13:54:49 +01:00
Marcus Blättermann
48159e1d60
Update explosion logo (#12162)
This fixes a misalignment of the explosion logo
2023-01-24 13:53:51 +01:00
Marcus Blättermann
7160f7835d
Fix GitHub badge (#12161)
* Extract component

* Remove rounded border form GitHub Stars badge

* Add `alt` text
2023-01-24 13:53:28 +01:00
Marcus Blättermann
3aa61e615f
Add missing label (#12160) 2023-01-24 13:52:55 +01:00
Marcus Blättermann
fcedcd54a8
WEB-30 spaCy pattern in .png (#12158)
* Fix gap in landing pattern at the top

* Replace `.jpg` patterns with `.png`

This drastically reduces file size (for the landing page from 221kb to 57kb) while doubling the resolution to look sharper on retina displays.
2023-01-24 13:51:39 +01:00
Sofie Van Landeghem
de1fe8dce3
Fix Azure ignoring website files (#12129)
* ignore all mdx files and all files in website

* have both .md and .mdx

* exclude everything but universe.json
2023-01-24 10:02:07 +01:00
Edward
e9048fd4a1
Add how to load probability tables to existing models to spaCy docs (#12051)
* add section about adding tables to models

* change to lexeme_norm

* Change syntax

* change to _prob

* Update website/docs/usage/saving-loading.mdx

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
2023-01-24 10:01:22 +01:00
Raphael Mitsch
950fceceb6
Make test_cli_find_threshold() more robust. (#12148) 2023-01-23 14:42:33 +01:00
Richard Hudson
f9e020dd67
Fix speed problem with top_k>1 on CPU in edit tree lemmatizer (#12017)
* Refactor _scores2guesses

* Handle arrays on GPU

* Convert argmax result to raw integer

Co-authored-by: Madeesh Kannan <shadeMe@users.noreply.github.com>

* Use NumpyOps() to copy data to CPU

Co-authored-by: Madeesh Kannan <shadeMe@users.noreply.github.com>

* Changes based on review comments

* Use different _scores2guesses depending on tree_k

* Add tests for corner cases

* Add empty line for consistency

* Improve naming

Co-authored-by: Daniël de Kok <me@github.danieldk.eu>

* Improve naming

Co-authored-by: Daniël de Kok <me@github.danieldk.eu>

Co-authored-by: Madeesh Kannan <shadeMe@users.noreply.github.com>
Co-authored-by: Daniël de Kok <me@github.danieldk.eu>
2023-01-20 19:34:11 +01:00
Marcus Blättermann
8a3ca77d9e
Fix broken social media image (#12137) 2023-01-20 16:57:43 +01:00
Adriane Boyd
dec81508d2
Update README for v3.5 (#12132) 2023-01-19 16:13:41 +01:00