Commit Graph

15815 Commits

Author SHA1 Message Date
Jacobo Myerston
3e8bc1272f
add punctuation to grc (#11426)
* add punctuation to grc

Add support for special editorial punctuation that is common in ancient Greek texts.  Ancient Greek texts, as found in digital and print form, have been largely edited by scholars. Restorations and improvements are normally marked with special characters that need to be handled properly by the tokenizer.

* add unit tests

* simplify regex

* move generic quotes to char classes

* rename unit test

* fix regex

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>

Co-authored-by: svlandeg <svlandeg@github.com>
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
2022-09-27 11:38:56 +02:00
Paul O'Leary McCann
a44b7d4622
Add experimental coref docs (#11291)
* Add experimental coref docs

* Docs cleanup

* Apply suggestions from code review

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>

* Apply changes from code review

* Fix prettier formatting

It seems a period after a number made this think it was a list?

* Update docs on examples for initialize

* Add docs for coref scorers

* Remove 3.4 notes from coref

There won't be a "new" tag until it's in core.

* Add docs for span cleaner

* Fix docs

* Fix docs to match spacy-experimental

These weren't properly updated when the code was moved out of spacy
core.

* More doc fixes

* Formatting

* Update architectures

* Fix links

* Fix another link

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
Co-authored-by: svlandeg <svlandeg@github.com>
2022-09-27 18:11:23 +09:00
Adriane Boyd
877671e09a
Preserve missing entity annotation in augmenters (#11540)
Preserve both `-` and `O` annotation in augmenters rather than relying
on `Example.to_dict`'s default support for one option outside of labeled
entity spans.

This is intended as a temporary workaround for augmenters for v3.4.x.
The behavior of `Example` and related IOB utils could be improved in the
general case for v3.5.
2022-09-27 10:16:51 +02:00
Paul O'Leary McCann
936a5f0506
Fix English pipeline names in 3.4 release notes (#11542) 2022-09-27 08:25:24 +02:00
Richard Hudson
6f692a06d5
Remove side effects from Doc.__init__() (#11506)
* Remove side effects from Doc.__init__()

* Changes based on review comment

* Readd test

* Change interface of Doc.__init__()

* Simplify test

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>

* Update doc.md

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
2022-09-26 15:58:21 +02:00
Basile Dura
f40d2fac29
fix: remove duplicate v3.2 (#11530) 2022-09-23 13:18:51 +02:00
Kevin Humphreys
0da324ab5b reinstate FUZZY operator
with length-based distance function
2022-09-22 19:26:52 -07:00
Kevin Humphreys
eab96f7c03 fix min distance 2022-09-22 15:37:19 -07:00
Raphael Mitsch
af9b01ef97
Add dependency check to project step runs (#11226)
* Add dependency check to project step running.

* Fix dependency mismatch warning.

* Remove newline.

* Add types-setuptools to setup.cfg.

* Move types-setuptools to test requirements. Move warnings into _validate_requirements(). Handle file reading in project_run().

* Remove newline formatting for output of package conflicts.

* Show full version conflict message instead of just package name.

* Update spacy/cli/project/run.py

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>

* Fix typo.

* Re-add rephrasing of message for conflicting packages. Remove requirements path redundancy.

* Update spacy/cli/project/run.py

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>

* Update spacy/cli/project/run.py

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>

* Print unified message for requirement conflicts and missing requirements.

* Update spacy/cli/project/run.py

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>

* Fix warning message.

* Print conflict/missing messages individually.

* Print conflict/missing messages individually.

* Add check_requirements setting in project.yml to disable requirements check.

* Update website/docs/usage/projects.md

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>

* Update website/docs/usage/projects.md

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>

* Update description of project.yml structure in projects.md.

* Update website/docs/usage/projects.md

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>

* Prettify projects docs.

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
2022-09-16 16:54:31 +02:00
github-actions[bot]
279358be63
Auto-format code with black (#11513)
Co-authored-by: explosion-bot <explosion-bot@users.noreply.github.com>
2022-09-16 11:50:19 +02:00
Sofie Van Landeghem
df0b815c23
more explicit Example constructor example (#11489)
* make constructor example for Example more explicit

* shorten example and add spaces
2022-09-16 09:26:33 +02:00
Kevin Humphreys
4a677acf5d don't allow more edits than characters 2022-09-15 16:14:24 -07:00
Kevin Humphreys
252e9ab3af exclude whitespace tokens 2022-09-15 15:50:07 -07:00
Kevin Humphreys
a1c984043a remove polyleven 2022-09-15 12:42:17 -07:00
Kevin Humphreys
711f16cc82 Merge branch 'master' into rapidfuzz 2022-09-15 11:54:16 -07:00
Sofie Van Landeghem
d5c8498f2f
disable mypy run for Python 3.10 (#11508) (#11511) 2022-09-15 17:41:25 +02:00
Sofie Van Landeghem
0509f90874
add dot (#11500) 2022-09-15 17:29:42 +02:00
Sofie Van Landeghem
ca1ad67458
disable mypy run for Python 3.10 (#11508) 2022-09-15 15:51:19 +02:00
Kevin Humphreys
b393525b50 Merge branch 'rapidfuzz' of https://github.com/kwhumphreys/spaCy into rapidfuzz 2022-09-14 15:56:18 -07:00
Kevin Humphreys
b7599dfb2f fuzzy match only on oov tokens 2022-09-14 15:54:05 -07:00
Adriane Boyd
7c98245c0c
Add levenshtein from polyleven (#11418)
Add a simple levenshtein distance function using the implementation from
the polyleven library as `spacy.matcher.levenshtein`.
2022-09-14 17:05:22 +02:00
Richard Hudson
3f0c3ad7d3
Correct alignment example and documentation (#11491)
* Correct example and documentation

* Added altered example.md

* Changes based on review + apply prettier

* Remote unnecessary 'the'

Co-authored-by: Madeesh Kannan <shadeMe@users.noreply.github.com>

Co-authored-by: Madeesh Kannan <shadeMe@users.noreply.github.com>
2022-09-14 09:36:55 +02:00
Adriane Boyd
6be6913ba5
Update cupy extras (#11279)
* Update cupy extras:

* Extend to v11
* Add `cupy-cuda11x` and `cupy-wheel`
* Update quickstart to use `cupy-wheel` for CUDA 10.2+

* Rename cuda-wheel to cuda-autodetect, remove repeated CUDA in menu
2022-09-13 09:04:53 +02:00
Kevin Humphreys
a6d26a0195 switch to polyleven
(Python package)
2022-09-12 16:45:51 -07:00
Kevin Humphreys
568a843c09 revert changes added for fuzzy param 2022-09-12 16:45:51 -07:00
Kevin Humphreys
3591a69d35 switch to FUZZYn predicates
use Levenshtein distance.
remove fuzzy param.
remove rapidfuzz_capi.
2022-09-12 16:45:51 -07:00
Kevin Humphreys
974e5f9902 case fix 2022-09-12 16:45:51 -07:00
Kevin Humphreys
e636f4941b simplify fuzzy sets 2022-09-12 16:45:51 -07:00
Kevin Humphreys
9c0f9368a9 handle fuzzy sets 2022-09-12 16:45:51 -07:00
Kevin Humphreys
0859e391c6 remove unnecessary dependency 2022-09-12 16:45:50 -07:00
Kevin Humphreys
ee25d434b6 tidying 2022-09-12 16:45:50 -07:00
Kevin Humphreys
3dba984db9 fix type properly 2022-09-12 16:45:50 -07:00
Kevin Humphreys
63f5e1331d add fuzzy attribute list 2022-09-12 16:45:50 -07:00
Kevin Humphreys
594674db92 add FUZZY predicate 2022-09-12 16:45:50 -07:00
Kevin Humphreys
426f3349d4 fix type 2022-09-12 16:45:50 -07:00
Kevin Humphreys
3a63ad1913 include rapidfuzz_capi
not yet used
2022-09-12 16:45:50 -07:00
Kevin Humphreys
66e9fdd246 add fuzzy param to EntityMatcher 2022-09-12 16:45:50 -07:00
Kevin Humphreys
dacfb57b03 enable fuzzy matching 2022-09-12 16:45:50 -07:00
Sofie Van Landeghem
cc10a27c59
Prevent tok2vec to broadcast to listeners when predicting (#11385)
* replicate bug with tok2vec in annotating components

* add overfitting test with a frozen tok2vec

* remove broadcast from predict and check doc.tensor instead

* remove broadcast

* proper error

* slight rephrase of documentation
2022-09-12 15:36:48 +02:00
Madeesh Kannan
0ec9a696e6
Fix config validation failures caused by NVTX pipeline wrappers (#11460)
* Enable Cython<->Python bindings for `Pipe` and `TrainablePipe` methods

* `pipes_with_nvtx_range`: Skip hooking methods whose signature cannot be ascertained

When loading pipelines from a config file, the arguments passed to individual pipeline components is validated by `pydantic` during init. For this, the validation model attempts to parse the function signature of the component's c'tor/entry point so that it can check if all mandatory parameters are present in the config file.

When using the `models_and_pipes_with_nvtx_range` as a `after_pipeline_creation` callback, the methods of all pipeline components get replaced by a NVTX range wrapper **before** the above-mentioned validation takes place. This can be problematic for components that are implemented as Cython extension types - if the extension type is not compiled with Python bindings for its methods, they will have no signatures at runtime. This resulted in `pydantic` matching the *wrapper's* parameters with the those in the config and raising errors.

To avoid this, we now skip applying the wrapper to any (Cython) methods that do not have signatures.
2022-09-12 14:55:41 +02:00
kadarakos
6b83fee58d
Assets message (#11458)
* new error message when 'project run assets'

* new error message when 'project run assets'

* Update spacy/cli/project/run.py

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
2022-09-09 17:17:10 +02:00
Adriane Boyd
8a86a35eab
Remove has_letters in config template (#11465)
Due to problems with the javascript conversion in the website
quickstart, remove the `has_letters` setting to simplify generating
`attrs` for the default `tok2vec`.

Additionally reduce `PREFIX` as in the trained pipelines.
2022-09-09 15:10:04 +02:00
github-actions[bot]
0c72c6bb2c
Auto-format code with black (#11468)
Co-authored-by: explosion-bot <explosion-bot@users.noreply.github.com>
2022-09-09 11:21:17 +02:00
Madeesh Kannan
aac9a58c29
Add docs for the spacy.models_and_pipes_with_nvtx_range.v1 callback (#11463)
* Add docs for the `spacy.models_and_pipes_with_nvtx_range.v1` callback

* Add `new` tag
2022-09-09 10:46:01 +02:00
Paul O'Leary McCann
2602a30d32
Fix DVC command example (#11457)
This command doesn't have the project dir, but it's required.
2022-09-08 13:42:47 +02:00
Raphael Mitsch
1f23c615d7
Refactor KB for easier customization (#11268)
* Add implementation of batching + backwards compatibility fixes. Tests indicate issue with batch disambiguation for custom singular entity lookups.

* Fix tests. Add distinction w.r.t. batch size.

* Remove redundant and add new comments.

* Adjust comments. Fix variable naming in EL prediction.

* Fix mypy errors.

* Remove KB entity type config option. Change return types of candidate retrieval functions to Iterable from Iterator. Fix various other issues.

* Update spacy/pipeline/entity_linker.py

Co-authored-by: Paul O'Leary McCann <polm@dampfkraft.com>

* Update spacy/pipeline/entity_linker.py

Co-authored-by: Paul O'Leary McCann <polm@dampfkraft.com>

* Update spacy/kb_base.pyx

Co-authored-by: Paul O'Leary McCann <polm@dampfkraft.com>

* Update spacy/kb_base.pyx

Co-authored-by: Paul O'Leary McCann <polm@dampfkraft.com>

* Update spacy/pipeline/entity_linker.py

Co-authored-by: Paul O'Leary McCann <polm@dampfkraft.com>

* Add error messages to NotImplementedErrors. Remove redundant comment.

* Fix imports.

* Remove redundant comments.

* Rename KnowledgeBase to InMemoryLookupKB and BaseKnowledgeBase to KnowledgeBase.

* Fix tests.

* Update spacy/errors.py

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>

* Move KB into subdirectory.

* Adjust imports after KB move to dedicated subdirectory.

* Fix config imports.

* Move Candidate + retrieval functions to separate module. Fix other, small issues.

* Fix docstrings and error message w.r.t. class names. Fix typing for candidate retrieval functions.

* Update spacy/kb/kb_in_memory.pyx

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>

* Update spacy/ml/models/entity_linker.py

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>

* Fix typing.

* Change typing of mentions to be Span instead of Union[Span, str].

* Update docs.

* Update EntityLinker and _architecture docs.

* Update website/docs/api/entitylinker.md

Co-authored-by: Paul O'Leary McCann <polm@dampfkraft.com>

* Adjust message for E1046.

* Re-add section for Candidate in kb.md, add reference to dedicated page.

* Update docs and docstrings.

* Re-add section + reference for KnowledgeBase.get_alias_candidates() in docs.

* Update spacy/kb/candidate.pyx

* Update spacy/kb/kb_in_memory.pyx

* Update spacy/pipeline/legacy/entity_linker.py

* Remove canididate.md. Remove mistakenly added config snippet in entity_linker.py.

Co-authored-by: Paul O'Leary McCann <polm@dampfkraft.com>
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
2022-09-08 10:38:07 +02:00
Paul O'Leary McCann
515d5c65d5
Add dev docs on satellite packages (#11435)
* Add dev docs on satellite packages

* Apply suggestions from code review

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>

* Add displacy link

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
2022-09-07 15:24:22 +02:00
Adriane Boyd
f292569b1a
Merge pull request #11444 from shadeMe/merge-master-into-develop
Merge `master` into `develop`
2022-09-06 19:58:21 +02:00
shademe
21000ae935
Merge branch 'master' into merge-master-into-develop 2022-09-06 17:50:07 +02:00
Paul O'Leary McCann
ff0522f8da Fix asent pip package name 2022-09-06 19:19:05 +09:00