Commit Graph

8958 Commits

Author SHA1 Message Date
Adriane Boyd
fef896ce49
Allow Example to align whitespace annotation (#10189)
Remove exception for whitespace tokens in `Example.get_aligned` so that
annotation on whitespace tokens is aligned in the same way as for
non-whitespace tokens.
2022-02-03 17:01:53 +01:00
pepemedigu
2abd380f2d
Update lex_attrs.py for Spanish with ordinals (#10038)
* Update lex_attrs.py

Add ordinal words

* black formatting

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
2022-01-20 15:44:13 +01:00
Sofie Van Landeghem
4465fe0306
Merge branch 'develop' into feature/master_copy 2022-01-20 13:36:17 +01:00
Duygu Altinok
47a2916801
Intify IOB (#9738)
* added iob to int

* added tests

* added iob strings

* added error

* blacked attrs

* Update spacy/tests/lang/test_attrs.py

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>

* Update spacy/attrs.pyx

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>

* added iob strings as global

* minor refinement with iob

* removed iob strings from token

* changed to uppercase

* cleaned and went back to master version

* imported iob from attrs

* Update and format errors

* Support and test both str and int ENT_IOB key

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
2022-01-20 13:19:38 +01:00
Duygu Altinok
268ddf8a06
Add ENT_IOB key to Matcher (#9649)
* added new field

* added exception for IOb strings

* minor refinement to schema

* removed field

* fixed typo

* imported numeriacla val

* changed the code bit

* cosmetics

* added test for matcher

* set ents of moc docs

* added invalid pattern

* minor update to documentation

* blacked matcher

* added pattern validation

* add IOB vals to schema

* changed into test

* mypy compat

* cleaned left over

* added compat import

* changed type

* added compat import

* changed literal a bit

* went back to old

* made explicit type

* Update spacy/schemas.py

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>

* Update spacy/schemas.py

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>

* Update spacy/schemas.py

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
2022-01-20 13:18:39 +01:00
Daniël de Kok
6984f55277
Merge pull request #10048 from danieldk/index-arcs-by-head
Use constant-time head lookups in StateC::{L,R}
2022-01-20 13:06:14 +01:00
Paul O'Leary McCann
32bd3856b3
Rename FACILITY to FAC in color list (#10067)
This matches the English models
2022-01-20 12:00:28 +01:00
Adriane Boyd
a55212fca0
Determine labels by factory name in debug data (#10079)
* Determine labels by factory name in debug data

For all components, return labels for all components with the
corresponding factory name rather than for only the default name.

For `spancat`, return labels as a dict keyed by `spans_key`.

* Refactor for typing

* Add test

* Use assert instead of cast, removed unneeded arg

* Mark test as slow
2022-01-20 11:42:52 +01:00
Richard Hudson
e9c6314539
Bugfix for similarity return types (#10051) 2022-01-20 11:40:46 +01:00
Daniël de Kok
50d2a2c930
User fewer Vector internals (#9879)
* Use Vectors.shape rather than Vectors.data.shape

* Use Vectors.size rather than Vectors.data.size

* Add Vectors.to_ops to move data between different ops

* Add documentation for Vector.to_ops
2022-01-18 17:14:35 +01:00
Adriane Boyd
4dfd559e55
Fix spaces in Doc.from_docs for empty docs (#10052)
Fix spaces in `Doc.from_docs(ensure_whitespace=True)` for cases where an
doc ending in whitespace is followed by an empty doc.
2022-01-18 17:12:42 +01:00
Paul O'Leary McCann
c28e33637b
Mark flaky spancat test so it doesn't fail the build (#10075)
* Mark flaky spancat test so it doesn't fail the build

* Skip, don't run and ignore
2022-01-18 09:36:28 +01:00
Natalia Rodnova
47ea6704f1
Span richcmp fix (#9956)
* Corrected Span's __richcmp__ implementation to take end, label and kb_id in consideration

* Updated test

* Updated test

* Removed formatting from a test for readability sake

* Use same tuples for all comparisons

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
2022-01-17 11:17:49 +01:00
Adriane Boyd
add52935ff
Revert "Bump sudachipy version (#9917)" (#10071)
This reverts commit 58bdd8607b.
2022-01-17 10:38:37 +01:00
Paul O'Leary McCann
58bdd8607b
Bump sudachipy version (#9917)
* Edited Slovenian stop words list (#9707)

* Noun chunks for Italian (#9662)

* added it vocab

* copied portuguese

* added possessive determiner

* added conjed Nps

* added nmoded Nps

* test misc

* more examples

* fixed typo

* fixed parenth

* fixed comma

* comma fix

* added syntax iters

* fix some index problems

* fixed index

* corrected heads for test case

* fixed tets case

* fixed determiner gender

* cleaned left over

* added example with apostophe

* French NP review (#9667)

* adapted from pt

* added basic tests

* added fr vocab

* fixed noun chunks

* more examples

* typo fix

* changed naming

* changed the naming

* typo fix

* Add Japanese kana characters to default exceptions (fix #9693) (#9742)

This includes the main kana, or phonetic characters, used in Japanese.

There are some supplemental kana blocks in Unicode outside the BMP that
could also be included, but because their actual use is rare I omitted
them for now, but maybe they should be added. The omitted blocks are:

- Kana Supplement
- Kana Extended (A and B)
- Small Kana Extension

* Remove NER words from stop words in Norwegian (#9820)

Default stop words in Norwegian bokmål (nb) in Spacy contain important entities, e.g. France, Germany, Russia, Sweden and USA, police district, important units of time, e.g. months and days of the week, and organisations.

Nobody expects their presence among the default stop words. There is a danger of users complying with the general recommendation of filtering out stop words, while being unaware of filtering out important entities from their data.

See explanation in https://github.com/explosion/spaCy/issues/3052#issuecomment-986756711 and comment https://github.com/explosion/spaCy/issues/3052#issuecomment-986951831

* Bump sudachipy version

* Update sudachipy versions

* Bump versions

Bumping to the most recent dictionary just to keep thing current.
Bumping sudachipy to 5.2 because older versions don't support recent
dictionaries.

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
Co-authored-by: Richard Hudson <richard@explosion.ai>
Co-authored-by: Duygu Altinok <duygu@explosion.ai>
Co-authored-by: Haakon Meland Eriksen <haakon.eriksen@far.no>
2022-01-17 08:16:22 +01:00
Daniël de Kok
63fa55089d Use constant-time head lookups in StateC::{L,R}
This change changes the type of left/right-arc collections from
vector[ArcC] to unordered_map[int, vector[Arc]], so that the arcs are
keyed by the head. This allows us to find all the left/right arcs for a
particular head in constant time in StateC::{L,R}.

Benchmarks with long docs (N is the number of text repetitions):

Before (using #10019):

    N  Time (s)

  400   3.2
  800   5.0
 1600   9.5
 3200  23.2
 6400  66.8
12800  220.0

After (this commit):

   N   Time (s)

  400   3.1
  800   4.3
 1600   6.7
 3200  12.0
 6400  22.0
12800  42.0

Related to #9858 and #10019.
2022-01-13 12:08:46 +01:00
Daniël de Kok
677c1a3507 Speed up the StateC::L feature function (#10019)
* Speed up the StateC::L feature function

This function gets the n-th most-recent left-arc with a particular head.
Before this change, StateC::L would construct a vector of all left-arcs
with the given head and then pick the n-th most recent from that vector.
Since the number of left-arcs strongly correlates with the doc length
and the feature is constructed for every transition, this can make
transition-parsing quadratic.

With this change StateC::L:

- Searches left-arcs backwards.
- Stops early when the n-th matching transition is found.
- Does not construct a vector (reducing memory pressure).

This change doesn't avoid the linear search when the transition that is
queried does not occur in the left-arcs. Regardless, performance is
improved quite a bit with very long docs:

Before:

   N  Time

 400   3.3
 800   5.4
1600  11.6
3200  30.7

After:

   N  Time

 400   3.2
 800   5.0
1600   9.5
3200  23.2

We can probably do better with more tailored data structures, but I
first wanted to make a low-impact PR.

Found while investigating #9858.

* StateC::L: simplify loop
2022-01-13 09:29:58 +01:00
Daniël de Kok
28299644fc
Speed up the StateC::L feature function (#10019)
* Speed up the StateC::L feature function

This function gets the n-th most-recent left-arc with a particular head.
Before this change, StateC::L would construct a vector of all left-arcs
with the given head and then pick the n-th most recent from that vector.
Since the number of left-arcs strongly correlates with the doc length
and the feature is constructed for every transition, this can make
transition-parsing quadratic.

With this change StateC::L:

- Searches left-arcs backwards.
- Stops early when the n-th matching transition is found.
- Does not construct a vector (reducing memory pressure).

This change doesn't avoid the linear search when the transition that is
queried does not occur in the left-arcs. Regardless, performance is
improved quite a bit with very long docs:

Before:

   N  Time

 400   3.3
 800   5.4
1600  11.6
3200  30.7

After:

   N  Time

 400   3.2
 800   5.0
1600   9.5
3200  23.2

We can probably do better with more tailored data structures, but I
first wanted to make a low-impact PR.

Found while investigating #9858.

* StateC::L: simplify loop
2022-01-13 09:03:55 +01:00
jsnfly
176a90edee
Fix texcat loss scaling (#9904) (#10002)
* add failing test for issue 9904

* remove division by batch size and summation before applying the mean

Co-authored-by: jonas <jsnfly@gmx.de>
2022-01-13 09:03:23 +01:00
Sofie Van Landeghem
d8a3012539
Merge pull request #10037 from explosion/master
Update develop with master
2022-01-12 12:29:23 +01:00
Ryn Daniels
057b8c64c0
Check for assets with size of 0 bytes (#10026)
* Check for assets with size of 0 bytes

* Update spacy/cli/project/assets.py

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
2022-01-12 10:34:23 +01:00
Sofie Van Landeghem
067a44a417
Merge pull request #9987 from explosion/master
Update develop with commits from master
2022-01-05 11:49:50 +01:00
Lj Miranda
00e7bf5ffd
Add a few docs to the default_config.cfg (#9981)
* Clarify patience hyperparameter

The current value for patience doesn't seem to indicate that it's
pointing to the number of steps. It may be useful to specify that
explicitly.

Ref: https://github.com/explosion/spaCy/discussions/7450
Ref: https://github.com/explosion/spaCy/discussions/7465

* Update docs for max_steps
2022-01-05 09:16:40 +01:00
Duygu Altinok
55cf492218
Feat/debug data warn spread ents (#9960)
* added check for crossing boundaries

* formatted blacked

* Rephrasing slightly

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
2022-01-04 18:22:10 +01:00
Sofie Van Landeghem
56dcb39fb7
Fix references to config file in the docs & UX (#9961)
* doc fixes around config file

* fix typo

* clarify default
2022-01-04 14:31:26 +01:00
Sofie Van Landeghem
029a48e340
fix type of lexeme.rank (#9979) 2022-01-04 13:15:25 +01:00
Florian Cäsar
86e71e7b19
Fix Scorer.score_cats for missing labels (#9443)
* Fix Scorer.score_cats for missing labels

* Add test case for Scorer.score_cats missing labels

* semantic nitpick

* black formatting

* adjust test to give different results depending on multi_label setting

* fix loss function according to whether or not missing values are supported

* add note to docs

* small fixes

* make mypy happy

* Update spacy/pipeline/textcat.py

Co-authored-by: Florian Cäsar <florian.caesar@pm.me>
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
Co-authored-by: svlandeg <svlandeg@github.com>
2021-12-29 11:04:39 +01:00
Sofie Van Landeghem
b8106e0f95
Merge pull request #9951 from explosion/master
Update develop branch with master
2021-12-29 10:11:43 +01:00
Peter Baumgartner
72abf9e102
MultiHashEmbed vector docs correction (#9918) 2021-12-27 11:18:08 +01:00
Duygu Altinok
7ec1452f5f
added ellided forms (#9878)
* added ellided forms

* rearranged a bit

* rearranged a bit

* added stopword tests

* blacked tests file
2021-12-23 13:41:01 +01:00
Andrew Janco
3cfeb518ee
Handle "_" value for token pos in conllu data (#9903)
* change '_' to '' to allow Token.pos, when no value for token pos in conllu data

* Minor code style

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
2021-12-21 15:46:33 +01:00
Adriane Boyd
837d241b68
Make floret murmurhash endian-neutral (#9735) 2021-12-20 17:11:31 +01:00
Sofie Van Landeghem
7847839003
Merge pull request #9891 from explosion/master
Update develop with master
2021-12-17 14:01:27 +01:00
Adriane Boyd
94fbd88521
Use dict.copy().items() instead of list(.items()) (#9868) 2021-12-16 09:17:33 +01:00
antonpibm
ac45ae3779
Update Tokenizer documentation to reflect token_match and url_match signatures (#9859) 2021-12-15 09:34:33 +01:00
Adriane Boyd
800737b416
Set version to v3.2.1 (#9823) 2021-12-07 10:51:45 +01:00
Haakon Meland Eriksen
251119455d
Remove NER words from stop words in Norwegian (#9820)
Default stop words in Norwegian bokmål (nb) in Spacy contain important entities, e.g. France, Germany, Russia, Sweden and USA, police district, important units of time, e.g. months and days of the week, and organisations.

Nobody expects their presence among the default stop words. There is a danger of users complying with the general recommendation of filtering out stop words, while being unaware of filtering out important entities from their data.

See explanation in https://github.com/explosion/spaCy/issues/3052#issuecomment-986756711 and comment https://github.com/explosion/spaCy/issues/3052#issuecomment-986951831
2021-12-07 09:45:10 +01:00
Adriane Boyd
a0cdc2b007
Use Language.pipe in evaluate (#9800) 2021-12-06 20:39:15 +01:00
Adriane Boyd
9964243eb2
Make the Tagger neg_prefix configurable (#9802) 2021-12-06 18:04:44 +01:00
Duygu Altinok
b56b9e7f31
Entity ruler remove pattern (#9685)
* added ruler coe

* added error for none existing pattern

* changed error to warning

* changed error to warning

* added basic tests

* fixed place

* added test files

* went back to error

* went back to pattern error

* minor change to docs

* changed style

* changed doc

* changed error slightly

* added remove to phrasem api

* error key already existed

* phrase matcher match code to api

* blacked tests

* moved comments before expr

* corrected error no

* Update website/docs/api/entityruler.md

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>

* Update website/docs/api/entityruler.md

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
2021-12-06 15:32:49 +01:00
Natalia Rodnova
472740d613
Added sents property to Span for Spans spanning over several sentences (#9699)
* Added sents property to Span class that returns a generator of sentences the Span belongs to

* Added description to Span.sents property

* Update test_span to clarify the difference between span.sent and span.sents

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>

* Update spacy/tests/doc/test_span.py

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>

* Fix documentation typos in spacy/tokens/span.pyx

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>

* Update Span.sents doc string in spacy/tokens/span.pyx

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>

* Parametrized test_span_spans

* Corrected Span.sents to check for span-level hook first. Also, made Span.sent respect doc-level sents hook if no span-level hook is provided

* Corrected Span ocumentation copy/paste issue

* Put back accidentally deleted lines

* Fixed formatting in span.pyx

* Moved check for SENT_START annotation after user hooks in Span.sents

* add version where the property was introduced

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
2021-12-06 09:58:01 +01:00
Lj Miranda
7d50804644
Migrate regression tests into the main test suite (#9655)
* Migrate regressions 1-1000

* Move serialize test to correct file

* Remove tests that won't work in v3

* Migrate regressions 1000-1500

Removed regression test 1250 because v3 doesn't support the old LEX
scheme anymore.

* Add missing imports in serializer tests

* Migrate tests 1500-2000

* Migrate regressions from 2000-2500

* Migrate regressions from 2501-3000

* Migrate regressions from 3000-3501

* Migrate regressions from 3501-4000

* Migrate regressions from 4001-4500

* Migrate regressions from 4501-5000

* Migrate regressions from 5001-5501

* Migrate regressions from 5501 to 7000

* Migrate regressions from 7001 to 8000

* Migrate remaining regression tests

* Fixing missing imports

* Update docs with new system [ci skip]

* Update CONTRIBUTING.md

- Fix formatting
- Update wording

* Remove lemmatizer tests in el lang

* Move a few tests into the general tokenizer

* Separate Doc and DocBin tests
2021-12-04 20:34:48 +01:00
Paul O'Leary McCann
b4d526c357
Add Japanese kana characters to default exceptions (fix #9693) (#9742)
This includes the main kana, or phonetic characters, used in Japanese.

There are some supplemental kana blocks in Unicode outside the BMP that
could also be included, but because their actual use is rare I omitted
them for now, but maybe they should be added. The omitted blocks are:

- Kana Supplement
- Kana Extended (A and B)
- Small Kana Extension
2021-11-30 23:36:39 +01:00
Sofie Van Landeghem
58e29776bd
Merge pull request #9777 from explosion/master
Update develop with master
2021-11-30 14:01:23 +01:00
Duygu Altinok
29f28d1f3e
French NP review (#9667)
* adapted from pt

* added basic tests

* added fr vocab

* fixed noun chunks

* more examples

* typo fix

* changed naming

* changed the naming

* typo fix
2021-11-30 12:19:07 +01:00
Daniël de Kok
72f7f4e68a
morphologizer: avoid recreating label tuple for each token (#9764)
* morphologizer: avoid recreating label tuple for each token

The `labels` property converts the dictionary key set to a tuple. This
property was used for every annotated token, recreating the tuple over
and over again.

Construct the tuple once in the set_annotations function and reuse it.

On a Finnish pipeline that I was experimenting with, this results in a
speedup of ~15% (~13000 -> ~15000 WPS).

* tagger: avoid recreating label tuple for each token
2021-11-30 11:58:59 +01:00
Narayan Acharya
1be8a4dab3
Displacy serve entity linking support without manual=True support. (#9748)
* Add support for kb_id to be displayed via displacy.serve. The current support is only limited to the manual option in displacy.render

* Commit to check pre-commit hooks are run.

* Update spacy/displacy/__init__.py

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>

* Changes as per suggestions on the PR.

* Update website/docs/api/top-level.md

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>

* Update website/docs/api/top-level.md

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>

* tag option as new from 3.2.1 onwards

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
Co-authored-by: svlandeg <sofie.vanlandeghem@gmail.com>
2021-11-29 17:13:26 +01:00
Paul O'Leary McCann
ac05de2c6c
Fix Language-specific factory handling in package command (#9674)
* Use internal names for factories

If a component factory is registered like `@French.factory(...)` instead
of `@Language.factory(...)`, the name in the factories registry will be
prefixed with the language code. However in the nlp.config object the
factory will be listed without the language code. The `add_pipe` code
has fallback logic to handle this, but packaging code and the registry
itself don't.

This change makes it so that the factory name in nlp.config is the
language-specific form. It's not clear if this will break anything else,
but it does seem to fix the inconsistency and resolve the specific user
issue that brought this to our attention.

* Change approach to use fallback in package lookup

This adds fallback logic to the package lookup, so it doesn't have to
touch the way the config is built. It seems to fix the tests too.

* Remove unecessary line

* Add test

Thsi also adds an assert that seems to have been forgotten.
2021-11-29 08:31:02 +01:00
Richard Hudson
7b134b8fbd
New tests for a number of alpha languages (#9703)
* Added Slovak

* Added Slovenian tests

* Added Estonian tests

* Added Croatian tests

* Added Latvian tests

* Added Icelandic tests

* Added Afrikaans tests

* Added language-independent tests

* Added Kannada tests

* Tidied up

* Added Albanian tests

* Formatted with black

* Added failing tests for anomalies

* Update spacy/tests/lang/af/test_text.py

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>

* Added context to failing Estonian tokenizer test

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>

* Added context to failing Croatian tokenizer test

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>

* Added context to failing Icelandic tokenizer test

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>

* Added context to failing Latvian tokenizer test

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>

* Added context to failing Slovak tokenizer test

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>

* Added context to failing Slovenian tokenizer test

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
2021-11-28 21:59:23 +01:00
Natalia Rodnova
a4c43e5c57
Allow Matcher to match on ENT_ID and ENT_KB_ID (#9688)
* Added ENT_ID and ENT_KB_ID into the list of the attributes that Matcher matches on

* Added ENT_ID and ENT_KB_ID to TEST_PATTERNS in test_pattern_validation.py. Disabled tests that I added before

* Update website/docs/api/matcher.md

* Format

* Remove skipped tests

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
2021-11-24 10:37:10 +01:00