Commit Graph

15249 Commits

Author SHA1 Message Date
Evgen Kytonin
fc3d446c71 Update Ukrainian tokenizer_exceptions 2022-02-01 13:24:00 +02:00
Lj Miranda
345e7f6bc4
Clarify Span.ents documentation (#10154)
* Clarify Span.ents documentation

Ref: #10135

Retain current behaviour. Span.ents will only include entities within
said span. You can't get tokens outside of the original span.

* Reword docstrings

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>

* Update API docs in the website

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
2022-01-31 08:41:42 +01:00
Marek Šuppa
f09c799a96
fix: Add missing comma to _eleven_to_beyond (#10166)
* This comma has been most probably been left out unintentionally, leading to string concatenation between the two consecutive lines. This issue has been found automatically using a regular expression.
2022-01-30 16:45:06 +09:00
Marek Šuppa
67ecac633f
fix: Add missing comma to examples.py (#10167)
* This comma has been most probably been left out unintentionally, leading to string concatenation between the two consecutive lines. This issue has been found automatically using a regular expression.
2022-01-30 16:43:29 +09:00
Adriane Boyd
4f441dfa24
Fix infix as prefix in Tokenizer.explain (#10140)
* Fix infix as prefix in Tokenizer.explain

Update `Tokenizer.explain` to align with the `Tokenizer` algorithm:

* skip infix matches that are prefixes in the current substring

* Update tokenizer pseudocode in docs
2022-01-28 17:00:54 +01:00
Eduard Zorita
30cf9d6a05
Update typing hints (#10109)
* Improve typing hints for Matcher.__call__

* Add typing hints for DependencyMatcher

* Add typing hints to underscore extensions

* Update Doc.tensor type (requires numpy 1.21)

* Fix typing hints for Language.component decorator

* Use generic np.ndarray type in Doc to avoid numpy version update

* Fix mypy errors

* Fix cyclic import caused by Underscore typing hints

* Use Literal type from spacy.compat

* Update matcher.pyi import format

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
2022-01-28 16:59:54 +01:00
Adriane Boyd
09734c56fc
Use simple suggester for spancat initialization (#10143)
Instead of the running the actual suggester, which may require
annotation from annotating components that is not necessarily present in
the reference docs, use the built-in 1-gram suggester.
2022-01-28 09:34:23 +01:00
github-actions[bot]
6d4db5c3c7
Auto-format code with black (#10106)
Co-authored-by: explosion-bot <explosion-bot@users.noreply.github.com>
2022-01-21 10:01:10 +01:00
Ines Montani
34ed93ef68
Support version tags in universe and add note about reporting (#10093)
* Support version tags in universe and add note about reporting

* Apply suggestions from code review

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
2022-01-20 23:21:26 +01:00
Peter Baumgartner
a69005037a
Docker Image for Website Dev (#10098)
* add docker instructions

* Update website/README.md

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>

* Update website/README.md

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>

* clarifying language on docker image

* fix markdown formatting

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
2022-01-20 23:02:13 +01:00
Duygu Altinok
47a2916801
Intify IOB (#9738)
* added iob to int

* added tests

* added iob strings

* added error

* blacked attrs

* Update spacy/tests/lang/test_attrs.py

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>

* Update spacy/attrs.pyx

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>

* added iob strings as global

* minor refinement with iob

* removed iob strings from token

* changed to uppercase

* cleaned and went back to master version

* imported iob from attrs

* Update and format errors

* Support and test both str and int ENT_IOB key

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
2022-01-20 13:19:38 +01:00
Duygu Altinok
268ddf8a06
Add ENT_IOB key to Matcher (#9649)
* added new field

* added exception for IOb strings

* minor refinement to schema

* removed field

* fixed typo

* imported numeriacla val

* changed the code bit

* cosmetics

* added test for matcher

* set ents of moc docs

* added invalid pattern

* minor update to documentation

* blacked matcher

* added pattern validation

* add IOB vals to schema

* changed into test

* mypy compat

* cleaned left over

* added compat import

* changed type

* added compat import

* changed literal a bit

* went back to old

* made explicit type

* Update spacy/schemas.py

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>

* Update spacy/schemas.py

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>

* Update spacy/schemas.py

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
2022-01-20 13:18:39 +01:00
Paul O'Leary McCann
32bd3856b3
Rename FACILITY to FAC in color list (#10067)
This matches the English models
2022-01-20 12:00:28 +01:00
Adriane Boyd
a55212fca0
Determine labels by factory name in debug data (#10079)
* Determine labels by factory name in debug data

For all components, return labels for all components with the
corresponding factory name rather than for only the default name.

For `spancat`, return labels as a dict keyed by `spans_key`.

* Refactor for typing

* Add test

* Use assert instead of cast, removed unneeded arg

* Mark test as slow
2022-01-20 11:42:52 +01:00
Richard Hudson
e9c6314539
Bugfix for similarity return types (#10051) 2022-01-20 11:40:46 +01:00
Adriane Boyd
7d528e607c
Update quickstart install steps (#10092)
* For conda:
  * Use conda environment rather than venv
  * Install `spacy-transformers` as a conda package
* For pip:
  * Add quotes if extras are included
2022-01-20 10:53:40 +01:00
Paul O'Leary McCann
2ff53834bb
Add link to pattern file info in EntityRuler.initialize docs (#10091)
* Add link to pattern file info in EntityRuler.initialize docs

* Update website/docs/api/entityruler.md

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
2022-01-19 10:45:11 +01:00
Daniël de Kok
50d2a2c930
User fewer Vector internals (#9879)
* Use Vectors.shape rather than Vectors.data.shape

* Use Vectors.size rather than Vectors.data.size

* Add Vectors.to_ops to move data between different ops

* Add documentation for Vector.to_ops
2022-01-18 17:14:35 +01:00
Adriane Boyd
4dfd559e55
Fix spaces in Doc.from_docs for empty docs (#10052)
Fix spaces in `Doc.from_docs(ensure_whitespace=True)` for cases where an
doc ending in whitespace is followed by an empty doc.
2022-01-18 17:12:42 +01:00
Paul O'Leary McCann
c28e33637b
Mark flaky spancat test so it doesn't fail the build (#10075)
* Mark flaky spancat test so it doesn't fail the build

* Skip, don't run and ignore
2022-01-18 09:36:28 +01:00
Adriane Boyd
39f1b13e77
Update sudachipy extras (#10072)
By @polm, redone from #9917 after incorrect (reverted) rebase.

`sudachipy>=0.5.2` is needed for newer dictionaries. `sudachipy<0.6.0`
is kept for users who might still prefer the older version, in
particular to be able to compile it without rust.
2022-01-17 11:48:39 +01:00
Adriane Boyd
add52935ff
Revert "Bump sudachipy version (#9917)" (#10071)
This reverts commit 58bdd8607b.
2022-01-17 10:38:37 +01:00
Tuomo Hiippala
6a8619dd73
Update the entry for Applied Language Technology in spaCy Universe (#10068)
* add entry for Applied Language Technology under "Courses"

Added the following entry into `universe.json`:

```
        {
            "type": "education",
            "id": "applt-course",
            "title": "Applied Language Technology",
            "slogan": "NLP for newcomers using spaCy and Stanza",
            "description": "These learning materials provide an introduction to applied language technology for audiences who are unfamiliar with language technology and programming. The learning materials assume no previous knowledge of the Python programming language.",
            "url": "https://applied-language-technology.readthedocs.io/",
            "image": "https://www.mv.helsinki.fi/home/thiippal/images/applt-preview.jpg",
            "thumb": "https://applied-language-technology.readthedocs.io/en/latest/_static/logo.png",
            "author": "Tuomo Hiippala",
            "author_links": {
                "twitter": "tuomo_h",
                "github": "thiippal",
                "website": "https://www.mv.helsinki.fi/home/thiippal/"
            },
            "category": ["courses"]
        },
```

* Update the entry for "Applied Language Technology"
2022-01-17 08:28:51 +01:00
Paul O'Leary McCann
58bdd8607b
Bump sudachipy version (#9917)
* Edited Slovenian stop words list (#9707)

* Noun chunks for Italian (#9662)

* added it vocab

* copied portuguese

* added possessive determiner

* added conjed Nps

* added nmoded Nps

* test misc

* more examples

* fixed typo

* fixed parenth

* fixed comma

* comma fix

* added syntax iters

* fix some index problems

* fixed index

* corrected heads for test case

* fixed tets case

* fixed determiner gender

* cleaned left over

* added example with apostophe

* French NP review (#9667)

* adapted from pt

* added basic tests

* added fr vocab

* fixed noun chunks

* more examples

* typo fix

* changed naming

* changed the naming

* typo fix

* Add Japanese kana characters to default exceptions (fix #9693) (#9742)

This includes the main kana, or phonetic characters, used in Japanese.

There are some supplemental kana blocks in Unicode outside the BMP that
could also be included, but because their actual use is rare I omitted
them for now, but maybe they should be added. The omitted blocks are:

- Kana Supplement
- Kana Extended (A and B)
- Small Kana Extension

* Remove NER words from stop words in Norwegian (#9820)

Default stop words in Norwegian bokmål (nb) in Spacy contain important entities, e.g. France, Germany, Russia, Sweden and USA, police district, important units of time, e.g. months and days of the week, and organisations.

Nobody expects their presence among the default stop words. There is a danger of users complying with the general recommendation of filtering out stop words, while being unaware of filtering out important entities from their data.

See explanation in https://github.com/explosion/spaCy/issues/3052#issuecomment-986756711 and comment https://github.com/explosion/spaCy/issues/3052#issuecomment-986951831

* Bump sudachipy version

* Update sudachipy versions

* Bump versions

Bumping to the most recent dictionary just to keep thing current.
Bumping sudachipy to 5.2 because older versions don't support recent
dictionaries.

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
Co-authored-by: Richard Hudson <richard@explosion.ai>
Co-authored-by: Duygu Altinok <duygu@explosion.ai>
Co-authored-by: Haakon Meland Eriksen <haakon.eriksen@far.no>
2022-01-17 08:16:22 +01:00
ColleterVi
a784b12eff
fix: new restcountries url (#10043)
Url extension "eu" and path "rest" are no longer available. Replacing them for a working url.
2022-01-13 20:25:06 +09:00
Daniël de Kok
28299644fc
Speed up the StateC::L feature function (#10019)
* Speed up the StateC::L feature function

This function gets the n-th most-recent left-arc with a particular head.
Before this change, StateC::L would construct a vector of all left-arcs
with the given head and then pick the n-th most recent from that vector.
Since the number of left-arcs strongly correlates with the doc length
and the feature is constructed for every transition, this can make
transition-parsing quadratic.

With this change StateC::L:

- Searches left-arcs backwards.
- Stops early when the n-th matching transition is found.
- Does not construct a vector (reducing memory pressure).

This change doesn't avoid the linear search when the transition that is
queried does not occur in the left-arcs. Regardless, performance is
improved quite a bit with very long docs:

Before:

   N  Time

 400   3.3
 800   5.4
1600  11.6
3200  30.7

After:

   N  Time

 400   3.2
 800   5.0
1600   9.5
3200  23.2

We can probably do better with more tailored data structures, but I
first wanted to make a low-impact PR.

Found while investigating #9858.

* StateC::L: simplify loop
2022-01-13 09:03:55 +01:00
Ryn Daniels
057b8c64c0
Check for assets with size of 0 bytes (#10026)
* Check for assets with size of 0 bytes

* Update spacy/cli/project/assets.py

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
2022-01-12 10:34:23 +01:00
Sofie Van Landeghem
5ba4171b19
Update LICENSE to include 2022 [ci skip] 2022-01-07 09:24:07 +01:00
Ines Montani
005e23a525
Merge pull request #9989 from explosion/docs/update-algolia-search-api [ci skip] 2022-01-05 14:14:42 +01:00
Ines Montani
a437ca6737 Update website to use new Algolia search API 2022-01-05 13:21:06 +01:00
Lj Miranda
00e7bf5ffd
Add a few docs to the default_config.cfg (#9981)
* Clarify patience hyperparameter

The current value for patience doesn't seem to indicate that it's
pointing to the number of steps. It may be useful to specify that
explicitly.

Ref: https://github.com/explosion/spaCy/discussions/7450
Ref: https://github.com/explosion/spaCy/discussions/7465

* Update docs for max_steps
2022-01-05 09:16:40 +01:00
Duygu Altinok
55cf492218
Feat/debug data warn spread ents (#9960)
* added check for crossing boundaries

* formatted blacked

* Rephrasing slightly

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
2022-01-04 18:22:10 +01:00
Sofie Van Landeghem
56dcb39fb7
Fix references to config file in the docs & UX (#9961)
* doc fixes around config file

* fix typo

* clarify default
2022-01-04 14:31:26 +01:00
Sofie Van Landeghem
029a48e340
fix type of lexeme.rank (#9979) 2022-01-04 13:15:25 +01:00
Sam Edwardes
6f65e2b544
Added spacypdfreader to universe.json (#9963) 2022-01-03 16:34:36 +09:00
Paul O'Leary McCann
f40e237c5a
Remove denomme from universe (#9952)
Package seems to have been deleted.
2021-12-29 11:41:29 +01:00
Yoav Vollansky
9d63dfacfc
Update UNIVERSE.md (#9941)
typo
2021-12-27 13:46:04 +01:00
Peter Baumgartner
72abf9e102
MultiHashEmbed vector docs correction (#9918) 2021-12-27 11:18:08 +01:00
Adriane Boyd
837d241b68
Make floret murmurhash endian-neutral (#9735) 2021-12-20 17:11:31 +01:00
Adriane Boyd
1163073756
Remove outdated patterns MANIFEST.in (#9912) 2021-12-20 16:40:20 +01:00
Adriane Boyd
18e5638af0
Extend cupy to v10.x (#9911)
* Add extra for `cupy-cuda115`
2021-12-20 15:48:35 +01:00
Daniël de Kok
93e9bf681f
Merge pull request #9873 from danieldk/temporarily-pin-mypy
Pin mypy to 0.910 until there is a compatible pydantic version
2021-12-16 10:28:31 +01:00
Daniël de Kok
b08f1ac17d Pin mypy to 0.910 until there is a compatible pydantic version 2021-12-16 09:31:45 +01:00
Adriane Boyd
94fbd88521
Use dict.copy().items() instead of list(.items()) (#9868) 2021-12-16 09:17:33 +01:00
Edward
018827e9fd Add healthsea to universe (#9838)
* Add healthsea to universe

* Update website/meta/universe.json

* Add thumbnail

* Update website/meta/universe.json

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
2021-12-15 17:57:19 +01:00
antonpibm
ac45ae3779
Update Tokenizer documentation to reflect token_match and url_match signatures (#9859) 2021-12-15 09:34:33 +01:00
Ines Montani
ba0fa7a64e
Support Google Sheets embeds in docs (#9861) 2021-12-15 09:27:08 +01:00
Adriane Boyd
800737b416
Set version to v3.2.1 (#9823) 2021-12-07 10:51:45 +01:00
Adriane Boyd
51a3b60027
Document Tagger neg_prefix, fix typo (#9821) 2021-12-07 09:42:40 +01:00
Adriane Boyd
a0cdc2b007
Use Language.pipe in evaluate (#9800) 2021-12-06 20:39:15 +01:00