spaCy

mirror of https://github.com/explosion/spaCy.git synced 2025-09-03 10:54:55 +03:00

Author	SHA1	Message	Date
Adriane Boyd	e61c8d2975	Add SpanRuler component (#9880 ) * Add SpanRuler component Add a `SpanRuler` component similar to `EntityRuler` that saves a list of matched spans to `Doc.spans[spans_key]`. The matches from the token and phrase matchers are deduplicated and sorted before assignment but are not otherwise filtered. * Update spacy/pipeline/span_ruler.py Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com> * Fix cast * Add self.key property * Use number of patterns as length * Remove patterns kwarg from init * Update spacy/tests/pipeline/test_span_ruler.py Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com> * Add options for spans filter and setting to ents * Add `spans_filter` option as a registered function' * Make `spans_key` optional and if `None`, set to `doc.ents` instead of `doc.spans[spans_key]`. * Update and generalize tests * Add test for setting doc.ents, fix key property type * Fix typing * Allow independent doc.spans and doc.ents * If `spans_key` is set, set `doc.spans` with `spans_filter`. * If `annotate_ents` is set, set `doc.ents` with `ents_fitler`. * Use `util.filter_spans` by default as `ents_filter`. * Use a custom warning if the filter does not work for `doc.ents`. * Enable use of SpanC.id in Span * Support id in SpanRuler as Span.id * Update types * `id` can only be provided as string (already by `PatternType` definition) * Update all uses of Span.id/ent_id in Doc * Rename Span id kwarg to span_id * Update types and docs * Add ents filter to mimic EntityRuler overwrite_ents * Refactor `ents_filter` to take `entities, spans` args for more filtering options * Give registered filters more descriptive names * Allow registered `filter_spans` filter (`spacy.first_longest_spans_filter.v1`) to take any number of `Iterable[Span]` objects as args so it can be used for spans filter or ents filter * Implement future entity ruler as span ruler Implement a compatible `entity_ruler` as `future_entity_ruler` using `SpanRuler` as the underlying component: * Add `sort_key` and `sort_reverse` to allow the sorting behavior to be customized. (Necessary for the same sorting/filtering as in `EntityRuler`.) * Implement `overwrite_overlapping_ents_filter` and `preserve_existing_ents_filter` to support `EntityRuler.overwrite_ents` settings. * Add `remove_by_id` to support `EntityRuler.remove` functionality. * Refactor `entity_ruler` tests to parametrize all tests to test both `entity_ruler` and `future_entity_ruler` * Implement `SpanRuler.token_patterns` and `SpanRuler.phrase_patterns` properties. Additional changes: * Move all config settings to top-level attributes to avoid duplicating settings in the config vs. `span_ruler/cfg`. (Also avoids a lot of casting.) * Format * Fix filter make method name * Refactor to use same error for removing by label or ID * Also provide existing spans to spans filter * Support ids property * Remove token_patterns and phrase_patterns * Update docstrings * Add span ruler docs * Fix types * Apply suggestions from code review Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com> * Move sorting into filters * Check for all tokens in seen tokens in entity ruler filters * Remove registered sort key * Set Token.ent_id in a backwards-compatible way in Doc.set_ents * Remove sort options from API docs * Update docstrings * Rename entity ruler filters * Fix and parameterize scoring * Add id to Span API docs * Fix typo in API docs * Include explicit labeled=True for scorer Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>	2022-06-02 13:36:11 +02:00
Sofie Van Landeghem	7d30e6620e	fix typo + CI slow testing (#10835 ) * fix typo * one more typo	2022-06-02 08:18:19 +02:00
Madeesh Kannan	34261f628f	Add `test_slow_gpu` explosion-bot command (#10858 )	2022-06-02 08:18:11 +02:00
richardpaulhudson	1bf18b85e4	Update Holmes entry in universe.json	2022-06-02 08:17:30 +02:00
Max Tarlov	f6b39fce2c	Update documentation for displacy style kwargs (#10841 ) * Update docs for displacy style kwargs Added "span" to the accepted values for the style kwarg in the displacy.serve and displacy.render top-level functions. These styles are new as of SpaCy 3.3, so I added the "new" tag for that option only * restored alpha ordering	2022-06-02 08:17:17 +02:00
Peter Baumgartner	ecd4343990	add doc cleaner to menu (#10862 )	2022-06-02 08:17:09 +02:00
Freddy Heppell	985cf8eb64	Fix misspelt keyword in StringStore example	2022-06-02 08:16:39 +02:00
github-actions[bot]	de6607fc9b	Auto-format code with black (#10857 ) Co-authored-by: explosion-bot <explosion-bot@users.noreply.github.com>	2022-06-02 08:16:24 +02:00
kadarakos	31a00ad7e0	Better errors for has_annotation and Matcher (#10830 ) * Show input argument instead of None * catch invalid attr early * moved error message from code to errors.py * Update spacy/errors.py Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com> * Update spacy/errors.py * update E153 and E154 Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>	2022-06-02 08:16:08 +02:00
Sofie Van Landeghem	4619a99185	Remove NBSP's across tables in the docs (#10842 )	2022-06-02 08:15:53 +02:00
Paul O'Leary McCann	6be09bbd07	Fix Entity Linker with tokenization mismatches (fix #9575 ) (#10457 ) * Add failing test * Partial fix for issue This kind of works. The issue with token length mismatches is gone. The problem is that when you get empty lists of encodings to compare, it fails because the sizes are not the same, even though they're both zero: (0, 3) vs (0,). Not sure why that happens... * Short circuit on empties * Remove spurious check The check here isn't needed now the the short circuit is fixed. * Update spacy/tests/pipeline/test_entity_linker.py Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com> * Use "eg", not "example" Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>	2022-05-23 20:42:26 +02:00
Lj Miranda	1d34aa2b3d	Add spacy-span-analyzer to debug data (#10668 ) * Rename to spans_key for consistency * Implement spans length in debug data * Implement how span bounds and spans are obtained In this commit, I implemented how span boundaries (the tokens) around a given span and spans are obtained. I've put them in the compile_gold() function so that it's accessible later on. I will do the actual computation of the span and boundary distinctiveness in the main function above. * Compute for p_spans and p_bounds * Add computation for SD and BD * Fix mypy issues * Add weighted average computation * Fix compile_gold conditional logic * Add test for frequency distribution computation * Add tests for kl-divergence computation * Fix weighted average computation * Make tables more compact by rounding them * Add more descriptive checks for spans * Modularize span computation methods In this commit, I added the _get_span_characteristics and _print_span_characteristics functions so that they can be reusable anywhere. * Remove unnecessary arguments and make fxs more compact * Update a few parameter arguments * Add tests for print_span and get_span methods * Update API to talk about span characteristics in brief * Add better reporting of spans_length * Add test for span length reporting * Update formatting of span length report Removed '' to indicate that it's not a string, then sort the n-grams by their length, not by their frequency. * Apply suggestions from code review Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com> * Show all frequency distribution when -V In this commit, I displayed the full frequency distribution of the span lengths when --verbose is passed. To make things simpler, I rewrote some of the formatter functions so that I can call them whenever. Another notable change is that instead of showing percentages as Integers, I showed them as floats (max 2-decimal places). I did this because it looks weird when it displays (0%). * Update logic on how total is computed The way the 90% thresholding is computed now is that we keep adding the percentages until we reach >= 90%. I also updated the wording and used the term "At least" to denote that >= 90% of your spans have these distributions. * Fix display when showing the threshold percentage * Apply suggestions from code review Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com> * Add better phrasing for span information * Update spacy/cli/debug_data.py Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com> * Add minor edits for whitespaces etc. Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com> Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>	2022-05-23 19:06:38 +02:00
Peter Baumgartner	7ce3460b23	add floret to static vectors docs (#10833 )	2022-05-23 09:16:31 +02:00
kadarakos	a3814ee739	oov confusion fix (#10828 )	2022-05-23 09:15:51 +02:00
Madeesh Kannan	4fb1809c72	Disable weekly GPU/slow tests on forks (#10831 )	2022-05-20 15:46:30 +02:00
Adriane Boyd	a82ec56aae	Remove cuda extras for non-linux arm in install widget (#10796 ) * Remove cuda extras for non-linux arm platforms in install widget * Extend cuda versions install widget * Update GPU install docs to clarify cuda	2022-05-20 09:57:41 +02:00
Paul O'Leary McCann	46982cf694	Add glossary entry for root (#10821 ) * Add glossary entry for root There was already one but it was lower case, maybe that should be removed? * remove lowercase root On reflection, that was probably just a mistake. * Add lowercase root back It's harmless to leave it there.	2022-05-20 09:56:32 +02:00
Raphael Mitsch	357be2614e	Fuzz tokenizer.explain: draft for fuzzy tests. (#10771 ) * Fuzz tokenizer.explain: draft for fuzzy tests. * Fuzz tokenizer.explain: xignoring tokenizer.explain() tests. Removed deadline modification. Removed LANGUAGES_WITHOUT_TOKENIZERS. * Fuzz tokenizer.explain: changed tokenizer initialization to avoid failus in Azure runs. * Fuzz tokenizer.explain: type hint for tokenizer in test. Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com> Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>	2022-05-17 10:23:16 +02:00
github-actions[bot]	99aeaf9bd3	Auto-format code with black (#10795 ) Co-authored-by: explosion-bot <explosion-bot@users.noreply.github.com>	2022-05-13 19:02:08 +02:00
kadarakos	fd36469900	bugfix parser labels (#10797 )	2022-05-13 11:41:32 +02:00
Paul O'Leary McCann	7634a488fe	Merge pull request #10793 from Schero1994/feature/update Update spaCy Universe: spacytextblob (code example)	2022-05-13 12:07:37 +09:00
schaeran	f5952c0851	update spaCy Universe: spacytextblob (code example)	2022-05-12 18:23:00 +02:00
Patrick Düggelin	cb06309ed8	Fix PhraseMatcher remove overlapping terms (#10734 ) * Add regression test for issue 10643 * Improve overlapping terms testcase * Fix removing overlapping terms in phrase matcher (#10643)	2022-05-12 12:23:52 +02:00
Raphael Mitsch	6f9e2ca81f	Ignore overrides for pipe names in config argument (#10779 ) * Pipe name override in config: added check with warning, added removal of name override from config, extended tests. * Pipoe name override in config: added pytest UserWarning. Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com> Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>	2022-05-12 11:46:08 +02:00
Adriane Boyd	b65d652881	Override SpanGroups.setdefault to provide default SpanGroup (#10772 ) * Fix mistake in SpanGroup API docs * Restrict SpanGroups.setdefault to SpanGroup only * Refactor to support default span iterable	2022-05-12 10:06:25 +02:00
Richard Hudson	d524f6415f	Add documentation tip about overriding variables (#10780 )	2022-05-11 10:15:32 +02:00
Raphael Mitsch	2904359685	Allow assets to be optional in spacy project (#10714 ) * Allow assets to be optional in spacy project: draft for optional flag/download_all options. * Allow assets to be optional in spacy project: added OPTIONAL_DEFAULT reflecting default asset optionality. * Allow assets to be optional in spacy project: renamed --all to --extra. * Allow assets to be optional in spacy project: included optional flag in project config test. * Allow assets to be optional in spacy project: added documentation. * Allow assets to be optional in spacy project: fixing deprecated --all reference. Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com> * Allow assets to be optional in spacy project: fixed project_assets() docstring. * Allow assets to be optional in spacy project: adjusted wording in justification of optional assets. Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com> * Allow assets to be optional in spacy project: switched to as keyword in project.yml. Updated docs. * Allow assets to be optional in spacy project: updated comment. * Allow assets to be optional in spacy project: replacing 'optional' with 'extra' in output. Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com> * Allow assets to be optional in spacy project: replacing 'optional' with 'extra' in docstring.. Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com> * Allow assets to be optional in spacy project: replacing 'optional' with 'extra' in test.. Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com> * Allow assets to be optional in spacy project: replacing 'optional' with 'extra' in test. Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com> * Allow assets to be optional in spacy project: renamed OPTIONAL_DEFAULT to EXTRA_DEFAULT. Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com> Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>	2022-05-10 10:40:11 +02:00
Sofie Van Landeghem	1543558d08	Add test for old architectures (#10751 ) * add v1 and v2 tests for tok2vec architectures * textcat architectures are not "layers" * test older textcat architectures * test older parser architecture	2022-05-10 08:24:42 +02:00
Madeesh Kannan	733114bdd9	`training.md`: Fix typos (#10775 )	2022-05-09 19:44:14 +02:00
Raphael Mitsch	e626df959f	Document different ways to create a pipeline (#10762 ) * Document different ways to create a pipeline: moved up/slightly modified paragraph on pipeline creation. * Document different ways to create a pipeline: changed Finnish to Ukrainian in example for language without trained pipeline. * Document different ways to create a pipeline: added explanation of blank pipeline. * Document different ways to create a pipeline: exchanged Ukrainian with Yoruba.	2022-05-06 15:40:59 +02:00
Richard Hudson	c32e1a0079	Updated Coreferee Universe entry (#10763 )	2022-05-06 13:21:39 +02:00
Luca Dorigo	0a92d5644e	Fix StringStore.__getitem__ return type depending on parameter types (#10741 ) * Fix StringStore.__getitem__ return type depending on parameter types Small fix using `@overload` so that `StringStore.__getitem__` returns an `int` when given a `str` or `bytes` and a `str` when given an `int`. * Update spacy/strings.pyi Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>	2022-05-03 17:57:07 +02:00
Sofie Van Landeghem	e03b9f8095	Small doc typos (#10750 ) * fix typos * formatting	2022-05-03 13:55:27 +02:00
Raphael Mitsch	f5390e278a	Refactor error messages to remove hardcoded strings (#10729 ) * Use custom error msg instead of hardcoded string: replaced remaining hardcoded error message strings. * Use custom error msg instead of hardcoded string: fixing faulty Errors import.	2022-05-02 13:38:46 +02:00
Madeesh Kannan	0a503ce5e0	Remove vestigial debug print statement in `walk_head_nodes` (#10718 ) * `graph`: Remove vestigial debug print statement in `walk_head_nodes` * Revert whitespace changes * Remove more debug print statements	2022-05-02 13:36:35 +02:00
vincent d warmerdam	f3de976513	Update universe.json to Include spaCy video #6 (#10723 ) * Update universe.json I noticed that episode 6 was missing, so I added it. * Update universe.json * Update universe.json	2022-05-02 13:35:14 +02:00
Adriane Boyd	497a708c71	Docs for v3.3 (#10628 ) * Temporarily disable CI tests * Start v3.3 website updates * Add trainable lemmatizer to pipeline design * Fix Vectors.most_similar * Add floret vector info to pipeline design * Add Lower and Upper Sorbian * Add span to sidebar * Work on release notes * Copy from release notes * Update pipeline design graphic * Upgrading note about Doc.from_docs * Add tables and details * Update website/docs/models/index.md Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com> * Fix da lemma acc * Add minimal intro, various updates * Round lemma acc * Add section on floret / word lists * Add new pipelines table, minor edits * Fix displacy spans example title * Clarify adding non-trainable lemmatizer * Update adding-languages URLs * Revert "Temporarily disable CI tests" This reverts commit `1dee505920`. * Spell out words/sec Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>	2022-04-28 14:09:35 +02:00
Adriane Boyd	10377fb945	Set version to v3.3.0 (#10614 ) * Set version to v3.3.0 * Revert "Temporarily skip tests that require models/compat" This reverts commit `e422101e00`.	2022-04-28 13:07:49 +02:00
Raphael Mitsch	3579507ba1	Bumped black to 22.3.0 due to a fix for https://github.com/psf/black/issues/2964 . (#10715 )	2022-04-27 14:49:24 +02:00
harmbuisman	c066fb8a4e	#10672 : fixes displacy output for manual unsorted entities (#10673 ) * #10672: fixes displacy output for manual unsorted entities * #10672: removed unused import * fix prettier formatting Co-authored-by: Harm Buisman <h.buisman@iknl.nl> Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>	2022-04-27 09:51:58 +02:00
Sofie Van Landeghem	b3717ba53a	removing print statements from the test suite (#10712 )	2022-04-27 09:14:25 +02:00
Adriane Boyd	455f089c9b	Support exclude in Doc.from_docs (#10689 ) * Support exclude in Doc.from_docs * Update API docs * Add new tag to docs	2022-04-25 18:19:03 +02:00
Mike	3b208197c3	Fixed example for spacy_syllables (#10705 ) There was a typo in the example for the spacy_syllables project.	2022-04-25 16:40:54 +02:00
github-actions[bot]	e07500369c	Auto-format code with black (#10687 ) Co-authored-by: explosion-bot <explosion-bot@users.noreply.github.com>	2022-04-22 11:24:53 +02:00
Sofie Van Landeghem	2c2dbb844c	Syntax for a branch from a PR	2022-04-22 09:45:49 +02:00
Ryn Daniels	29afbdb91e	add readme for explosion-bot (#10677 )	2022-04-20 09:52:34 +02:00
Richard Hudson	4b227f4861	Merge pull request #10669 from mgrojo/develop Fix some issues in Spanish stop-word list and examples	2022-04-19 09:37:34 +02:00
mgr	3d50b1a989	Fix some issues in Spanish examples - Spelling: nationalities in lowercase, accent. - Incorrect verb composition - Untranslated word	2022-04-18 22:12:57 +02:00
mgr	2a2654c756	Remove significant or not very frequent words from stop word list [es] The list of stop words for Spanish contained many inadequate words, see: https://github.com/explosion/spaCy/issues/3052#issuecomment-1100760100 Removed words: - verb forms of 'trabajar' (work) and intentar (try) - words related to 'empleo' (employment) - incorrect words: ampleamos, arribaabajo, soyos, paìs - miscellaneous words due to being too significant of too infrequent: actualmente, aproximadamente, antaño, cosas, ejemplo, horas, general, pais, principalmente, raras Added other stop words for completion: - Spanish one-letter words - numbers up to twelve Some reformatting to 79 columns. When in doubt, the English and German lists have been consulted as good examples.	2022-04-18 22:04:02 +02:00
Madeesh Kannan	aa6780eb27	`Matcher`: Remove superfluous GIL-acquiring check in `get_is_final` (#10659 ) * `Matcher`: Remove superfluous GIL-acquiring check in `get_is_final` This check incurred a significant performance penalty due to implict interactions between the GIL and Cython ref-counting code. * `Matcher`: Inline `PatternStateC` accessors	2022-04-18 12:59:34 +02:00

1 2 3 4 5 ...

15446 Commits