spaCy

mirror of https://github.com/explosion/spaCy.git synced 2025-01-27 01:34:30 +03:00

Author	SHA1	Message	Date
github-actions[bot]	24aafdffad	Auto-format code with black (#10908 ) Co-authored-by: explosion-bot <explosion-bot@users.noreply.github.com>	2022-06-03 11:01:55 +02:00
Adriane Boyd	727ce6d1f5	Remove English exceptions with mismatched features (#10873 ) Remove English contraction exceptions with mismatched features that lead to exceptions like "theses" and "thisre".	2022-06-03 09:44:04 +02:00
Madeesh Kannan	41389ffe1e	Avoid pickling `Doc` inputs passed to `Language.pipe()` (#10864 ) * `Language.pipe()`: Serialize `Doc` objects to bytes when using multiprocessing to avoid pickling overhead * `Doc.to_dict()`: Serialize `_context` attribute (keeping in line with `(un)pickle_doc()` * Correct type annotations * Fix typo * `Doc`: Do not serialize `_context` * `Language.pipe`: Send context objects to child processes, Simplify `as_tuples` handling * Fix type annotation * `Language.pipe`: Simplify `as_tuple` multiprocessor handling * Cleanup code, fix typos * MyPy fixes * Move doc preparation function into `_multiprocessing_pipe` Whitespace changes * Remove superfluous comma * Rename `prepare_doc` to `prepare_input` * Update spacy/errors.py * Undo renaming for error Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>	2022-06-02 20:06:49 +02:00
single-fingal	6c6b8da7cc	Fix: De/Serialize `SpanGroups` including the SpanGroup keys (#10707 ) * fix: De/Serialize `SpanGroups` including the SpanGroup keys This prevents the loss of `SpanGroup`s that have the same .name as other `SpanGroup`s within the same `SpanGroups` object (upon de/serialization of the `SpanGroups`). Fixes #10685 * Maintain backwards compatibility for serialized `SpanGroups` (serialized as: a list of `SpanGroup`s, or b'') * Add tests for `SpanGroups` deserialization backwards-compatibility * Move a `SpanGroups` de/serialization test (test_issue10685) to tests/serialize/test_serialize_spangroups.py * Output a warning if deserializing a `SpanGroups` with duplicate .name-d `SpanGroup`s * Minor refactor * `SpanGroups.from_bytes` handles only `list` and `dict` types with `dict` as the expected default * For lists, keep first rather than last value encountered * Update error message * Rename and update tests * Update to preserve list serialization of SpanGroups To avoid breaking compatibility of serialized `Doc` and `DocBin` with earlier versions of spacy v3, revert back to a list-only serialization, but update the names just for serialization so that the SpanGroups keys override the SpanGroup names. * Preserve object identity and current key overwrite * Preserve SpanGroup object identity * Preserve last rather than first span group from SpanGroup list format without SpanGroups keys * Update inline comments * Fix types * Add type info for SpanGroup.copy * Deserialize `SpanGroup`s as copies when a single SpanGroup is the value for more than 1 `SpanGroups` key. This is because we serialize `SpanGroups` as dicts (to maintain backward- and forward-compatibility) and we can't assume `SpanGroup`s with the same bytes/serialization were the same (identical) object, pre-serialization. * Update spacy/tokens/_dict_proxies.py * Add more SpanGroups serialization tests Test that serialized SpanGroups maintain their Span order * small clarification on older spaCy version * Update spacy/tests/serialize/test_serialize_span_groups.py Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com> Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>	2022-06-02 15:56:27 +02:00
Adriane Boyd	7e13652d36	Fix schemas import in Doc (#10898 )	2022-06-02 15:53:03 +02:00
Raphael Mitsch	8387ce4c01	Add Doc.from_json() (#10688 ) * Implement Doc.from_json: rough draft. * Implement Doc.from_json: first draft with tests. * Implement Doc.from_json: added documentation on website for Doc.to_json(), Doc.from_json(). * Implement Doc.from_json: formatting changes. * Implement Doc.to_json(): reverting unrelated formatting changes. * Implement Doc.to_json(): fixing entity and span conversion. Moving fixture and doc <-> json conversion tests into single file. * Implement Doc.from_json(): replaced entity/span converters with doc.char_span() calls. * Implement Doc.from_json(): handling sentence boundaries in spans. * Implementing Doc.from_json(): added parser-free sentence boundaries transfer. * Implementing Doc.from_json(): added parser-free sentence boundaries transfer. * Implementing Doc.from_json(): incorporated various PR feedback. * Renaming fixture for document without dependencies. Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com> * Implementing Doc.from_json(): using two sent_starts instead of one. Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com> * Implementing Doc.from_json(): doc_without_dependency_parser() -> doc_without_deps. Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com> * Implementing Doc.from_json(): incorporating various PR feedback. Rebased on latest master. * Implementing Doc.from_json(): refactored Doc.from_json() to work with annotation IDs instead of their string representations. * Implement Doc.from_json(): reverting unwanted formatting/rebasing changes. * Implement Doc.from_json(): added check for char_span() calculation for entities. * Update spacy/tokens/doc.pyx Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com> * Implement Doc.from_json(): minor refactoring, additional check for token attribute consistency with corresponding test. * Implement Doc.from_json(): removed redundancy in annotation type key naming. Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com> * Implement Doc.from_json(): Simplifying setting annotation values. Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com> * Implement doc.from_json(): renaming annot_types to token_attrs. Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com> * Implement Doc.from_json(): adjustments for renaming of annot_types to token_attrs. * Implement Doc.from_json(): removing default categories. Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com> * Implement Doc.from_json(): simplifying lexeme initialization. Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com> * Implement Doc.from_json(): simplifying lexeme initialization. Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com> * Implement Doc.from_json(): refactoring to only have keys for present annotations. * Implement Doc.from_json(): fix check for tokens' HEAD attributes. Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com> * Implement Doc.from_json(): refactoring Doc.from_json(). * Implement Doc.from_json(): fixing span_group retrieval. Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com> * Implement Doc.from_json(): fixing span retrieval. * Implement Doc.from_json(): added schema for Doc JSON format. Minor refactoring in Doc.from_json(). * Implement Doc.from_json(): added comment regarding Token and Span extension support. * Implement Doc.from_json(): renaming inconsistent_props to partial_attrs.. Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com> * Implement Doc.from_json(): adjusting error message. Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com> * Implement Doc.from_json(): extending E1038 message. Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com> * Implement Doc.from_json(): added params to E1038 raises. * Implement Doc.from_json(): combined attribute collection with partial attributes check. * Implement Doc.from_json(): added optional schema validation. * Implement Doc.from_json(): fixed optional fields in schema, tests. * Implement Doc.from_json(): removed redundant None check for DEP. * Implement Doc.from_json(): added passing of schema validatoin message to E1037.. * Implement Doc.from_json(): removing redundant error E1040. Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com> * Implement Doc.from_json(): changing message for E1037. Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com> * Implement Doc.from_json(): adjusted website docs and docstring of Doc.from_json(). * Update spacy/tests/doc/test_json_doc_conversion.py * Implement Doc.from_json(): docstring update. Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com> * Implement Doc.from_json(): docstring update. Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com> * Implement Doc.from_json(): website docs update. Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com> * Implement Doc.from_json(): docstring formatting. Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com> * Implement Doc.from_json(): docstring formatting. Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com> * Implement Doc.from_json(): fixing Doc reference in website docs. Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com> * Implement Doc.from_json(): reformatted website/docs/api/doc.md. * Implement Doc.from_json(): bumped IDs of new errors to avoid merge conflicts. * Implement Doc.from_json(): fixing bug in tests. Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com> * Implement Doc.from_json(): fix setting of sentence starts for docs without DEP. * Implement Doc.from_json(): add check for valid char spans when manually setting sentence boundaries. Refactor sentence boundary setting slightly. Move error message for lack of support for partial token annotations to errors.py. * Implement Doc.from_json(): simplify token sentence start manipulation. Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com> * Combine related error messages * Update spacy/tests/doc/test_json_doc_conversion.py Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>	2022-06-02 14:03:47 +02:00
Adriane Boyd	a322d6d5f2	Add SpanRuler component (#9880 ) * Add SpanRuler component Add a `SpanRuler` component similar to `EntityRuler` that saves a list of matched spans to `Doc.spans[spans_key]`. The matches from the token and phrase matchers are deduplicated and sorted before assignment but are not otherwise filtered. * Update spacy/pipeline/span_ruler.py Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com> * Fix cast * Add self.key property * Use number of patterns as length * Remove patterns kwarg from init * Update spacy/tests/pipeline/test_span_ruler.py Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com> * Add options for spans filter and setting to ents * Add `spans_filter` option as a registered function' * Make `spans_key` optional and if `None`, set to `doc.ents` instead of `doc.spans[spans_key]`. * Update and generalize tests * Add test for setting doc.ents, fix key property type * Fix typing * Allow independent doc.spans and doc.ents * If `spans_key` is set, set `doc.spans` with `spans_filter`. * If `annotate_ents` is set, set `doc.ents` with `ents_fitler`. * Use `util.filter_spans` by default as `ents_filter`. * Use a custom warning if the filter does not work for `doc.ents`. * Enable use of SpanC.id in Span * Support id in SpanRuler as Span.id * Update types * `id` can only be provided as string (already by `PatternType` definition) * Update all uses of Span.id/ent_id in Doc * Rename Span id kwarg to span_id * Update types and docs * Add ents filter to mimic EntityRuler overwrite_ents * Refactor `ents_filter` to take `entities, spans` args for more filtering options * Give registered filters more descriptive names * Allow registered `filter_spans` filter (`spacy.first_longest_spans_filter.v1`) to take any number of `Iterable[Span]` objects as args so it can be used for spans filter or ents filter * Implement future entity ruler as span ruler Implement a compatible `entity_ruler` as `future_entity_ruler` using `SpanRuler` as the underlying component: * Add `sort_key` and `sort_reverse` to allow the sorting behavior to be customized. (Necessary for the same sorting/filtering as in `EntityRuler`.) * Implement `overwrite_overlapping_ents_filter` and `preserve_existing_ents_filter` to support `EntityRuler.overwrite_ents` settings. * Add `remove_by_id` to support `EntityRuler.remove` functionality. * Refactor `entity_ruler` tests to parametrize all tests to test both `entity_ruler` and `future_entity_ruler` * Implement `SpanRuler.token_patterns` and `SpanRuler.phrase_patterns` properties. Additional changes: * Move all config settings to top-level attributes to avoid duplicating settings in the config vs. `span_ruler/cfg`. (Also avoids a lot of casting.) * Format * Fix filter make method name * Refactor to use same error for removing by label or ID * Also provide existing spans to spans filter * Support ids property * Remove token_patterns and phrase_patterns * Update docstrings * Add span ruler docs * Fix types * Apply suggestions from code review Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com> * Move sorting into filters * Check for all tokens in seen tokens in entity ruler filters * Remove registered sort key * Set Token.ent_id in a backwards-compatible way in Doc.set_ents * Remove sort options from API docs * Update docstrings * Rename entity ruler filters * Fix and parameterize scoring * Add id to Span API docs * Fix typo in API docs * Include explicit labeled=True for scorer Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>	2022-06-02 13:12:53 +02:00
Sofie Van Landeghem	f7507c2327	fix typo + CI slow testing (#10835 ) * fix typo * one more typo	2022-06-02 00:10:16 +02:00
Paul O'Leary McCann	dca2e8c644	Minor NEL type fixes (#10860 ) * Fix TODO about typing Fix was simple: just request an array2f. * Add type ignore Maxout has a more restrictive type than the residual layer expects (only Floats2d vs any Floats). * Various cleanup This moves a lot of lines around but doesn't change any functionality. Details: 1. use `continue` to reduce indentation 2. move sentence doc building inside conditional since it's otherwise unused 3. reduces some temporary assignments	2022-06-01 00:41:28 +02:00
Daniël de Kok	09a5f03dd7	Merge pull request #10849 from danieldk/simplify-gpu-check Simplify GPU check	2022-05-30 16:35:10 +02:00
Daniël de Kok	85dd2b6c04	Parser: use C saxpy/sgemm provided by the Ops implementation (#10773 ) * Parser: use C saxpy/sgemm provided by the Ops implementation This is a backport of https://github.com/explosion/spaCy/pull/10747 from the parser refactor branch. It eliminates the explicit calls to BLIS, instead using the saxpy/sgemm provided by the Ops implementation. This allows us to use Accelerate in the parser on M1 Macs (with an updated thinc-apple-ops). Performance of the de_core_news_lg pipe: BLIS 0.7.0, no thinc-apple-ops: 6385 WPS BLIS 0.7.0, thinc-apple-ops: 36455 WPS BLIS 0.9.0, no thinc-apple-ops: 19188 WPS BLIS 0.9.0, thinc-apple-ops: 36682 WPS This PR, thinc-apple-ops: 38726 WPS Performance of the de_core_news_lg pipe (only tok2vec -> parser): BLIS 0.7.0, no thinc-apple-ops: 13907 WPS BLIS 0.7.0, thinc-apple-ops: 73172 WPS BLIS 0.9.0, no thinc-apple-ops: 41576 WPS BLIS 0.9.0, thinc-apple-ops: 72569 WPS This PR, thinc-apple-ops: 87061 WPS * Require thinc >=8.1.0,<8.2.0 * Lower thinc lowerbound to 8.1.0.dev0 * Use best CPU ops for CBLAS when the parser model is on the GPU * Fix another unguarded cblas() call * Fix: use ops as a shorthand for self.model.ops Co-authored-by: Madeesh Kannan <shadeMe@users.noreply.github.com> Co-authored-by: Madeesh Kannan <shadeMe@users.noreply.github.com>	2022-05-27 11:20:52 +02:00
github-actions[bot]	6172af8158	Auto-format code with black (#10857 ) Co-authored-by: explosion-bot <explosion-bot@users.noreply.github.com>	2022-05-27 10:54:54 +02:00
Daniël de Kok	7c6a97559d	Simplify GPU check This change removes `thinc.util.has_cupy` from the GPU presence check. Currently `gpu_is_available` already implies `has_cupy`. We also want to show this warning in the future when a machine has a non-CuPy GPU.	2022-05-25 14:06:45 +02:00
kadarakos	f6a4b80c0b	Better errors for has_annotation and Matcher (#10830 ) * Show input argument instead of None * catch invalid attr early * moved error message from code to errors.py * Update spacy/errors.py Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com> * Update spacy/errors.py * update E153 and E154 Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>	2022-05-25 11:12:29 +02:00
Richard Hudson	32954c3bcb	Fix issues for Mypy 0.950 and Pydantic 1.9.0 (#10786 ) * Make changes to typing * Correction * Format with black * Corrections based on review * Bumped Thinc dependency version * Bumped blis requirement * Correction for older Python versions * Update spacy/ml/models/textcat.py Co-authored-by: Daniël de Kok <me@github.danieldk.eu> * Corrections based on review feedback * Readd deleted docstring line Co-authored-by: Daniël de Kok <me@github.danieldk.eu>	2022-05-25 09:33:54 +02:00
Paul O'Leary McCann	6be09bbd07	Fix Entity Linker with tokenization mismatches (fix #9575 ) (#10457 ) * Add failing test * Partial fix for issue This kind of works. The issue with token length mismatches is gone. The problem is that when you get empty lists of encodings to compare, it fails because the sizes are not the same, even though they're both zero: (0, 3) vs (0,). Not sure why that happens... * Short circuit on empties * Remove spurious check The check here isn't needed now the the short circuit is fixed. * Update spacy/tests/pipeline/test_entity_linker.py Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com> * Use "eg", not "example" Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>	2022-05-23 20:42:26 +02:00
Lj Miranda	1d34aa2b3d	Add spacy-span-analyzer to debug data (#10668 ) * Rename to spans_key for consistency * Implement spans length in debug data * Implement how span bounds and spans are obtained In this commit, I implemented how span boundaries (the tokens) around a given span and spans are obtained. I've put them in the compile_gold() function so that it's accessible later on. I will do the actual computation of the span and boundary distinctiveness in the main function above. * Compute for p_spans and p_bounds * Add computation for SD and BD * Fix mypy issues * Add weighted average computation * Fix compile_gold conditional logic * Add test for frequency distribution computation * Add tests for kl-divergence computation * Fix weighted average computation * Make tables more compact by rounding them * Add more descriptive checks for spans * Modularize span computation methods In this commit, I added the _get_span_characteristics and _print_span_characteristics functions so that they can be reusable anywhere. * Remove unnecessary arguments and make fxs more compact * Update a few parameter arguments * Add tests for print_span and get_span methods * Update API to talk about span characteristics in brief * Add better reporting of spans_length * Add test for span length reporting * Update formatting of span length report Removed '' to indicate that it's not a string, then sort the n-grams by their length, not by their frequency. * Apply suggestions from code review Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com> * Show all frequency distribution when -V In this commit, I displayed the full frequency distribution of the span lengths when --verbose is passed. To make things simpler, I rewrote some of the formatter functions so that I can call them whenever. Another notable change is that instead of showing percentages as Integers, I showed them as floats (max 2-decimal places). I did this because it looks weird when it displays (0%). * Update logic on how total is computed The way the 90% thresholding is computed now is that we keep adding the percentages until we reach >= 90%. I also updated the wording and used the term "At least" to denote that >= 90% of your spans have these distributions. * Fix display when showing the threshold percentage * Apply suggestions from code review Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com> * Add better phrasing for span information * Update spacy/cli/debug_data.py Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com> * Add minor edits for whitespaces etc. Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com> Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>	2022-05-23 19:06:38 +02:00
Paul O'Leary McCann	46982cf694	Add glossary entry for root (#10821 ) * Add glossary entry for root There was already one but it was lower case, maybe that should be removed? * remove lowercase root On reflection, that was probably just a mistake. * Add lowercase root back It's harmless to leave it there.	2022-05-20 09:56:32 +02:00
Raphael Mitsch	357be2614e	Fuzz tokenizer.explain: draft for fuzzy tests. (#10771 ) * Fuzz tokenizer.explain: draft for fuzzy tests. * Fuzz tokenizer.explain: xignoring tokenizer.explain() tests. Removed deadline modification. Removed LANGUAGES_WITHOUT_TOKENIZERS. * Fuzz tokenizer.explain: changed tokenizer initialization to avoid failus in Azure runs. * Fuzz tokenizer.explain: type hint for tokenizer in test. Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com> Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>	2022-05-17 10:23:16 +02:00
github-actions[bot]	99aeaf9bd3	Auto-format code with black (#10795 ) Co-authored-by: explosion-bot <explosion-bot@users.noreply.github.com>	2022-05-13 19:02:08 +02:00
kadarakos	fd36469900	bugfix parser labels (#10797 )	2022-05-13 11:41:32 +02:00
Patrick Düggelin	cb06309ed8	Fix PhraseMatcher remove overlapping terms (#10734 ) * Add regression test for issue 10643 * Improve overlapping terms testcase * Fix removing overlapping terms in phrase matcher (#10643)	2022-05-12 12:23:52 +02:00
Raphael Mitsch	6f9e2ca81f	Ignore overrides for pipe names in config argument (#10779 ) * Pipe name override in config: added check with warning, added removal of name override from config, extended tests. * Pipoe name override in config: added pytest UserWarning. Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com> Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>	2022-05-12 11:46:08 +02:00
Adriane Boyd	b65d652881	Override SpanGroups.setdefault to provide default SpanGroup (#10772 ) * Fix mistake in SpanGroup API docs * Restrict SpanGroups.setdefault to SpanGroup only * Refactor to support default span iterable	2022-05-12 10:06:25 +02:00
Raphael Mitsch	2904359685	Allow assets to be optional in spacy project (#10714 ) * Allow assets to be optional in spacy project: draft for optional flag/download_all options. * Allow assets to be optional in spacy project: added OPTIONAL_DEFAULT reflecting default asset optionality. * Allow assets to be optional in spacy project: renamed --all to --extra. * Allow assets to be optional in spacy project: included optional flag in project config test. * Allow assets to be optional in spacy project: added documentation. * Allow assets to be optional in spacy project: fixing deprecated --all reference. Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com> * Allow assets to be optional in spacy project: fixed project_assets() docstring. * Allow assets to be optional in spacy project: adjusted wording in justification of optional assets. Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com> * Allow assets to be optional in spacy project: switched to as keyword in project.yml. Updated docs. * Allow assets to be optional in spacy project: updated comment. * Allow assets to be optional in spacy project: replacing 'optional' with 'extra' in output. Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com> * Allow assets to be optional in spacy project: replacing 'optional' with 'extra' in docstring.. Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com> * Allow assets to be optional in spacy project: replacing 'optional' with 'extra' in test.. Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com> * Allow assets to be optional in spacy project: replacing 'optional' with 'extra' in test. Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com> * Allow assets to be optional in spacy project: renamed OPTIONAL_DEFAULT to EXTRA_DEFAULT. Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com> Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>	2022-05-10 10:40:11 +02:00
Sofie Van Landeghem	1543558d08	Add test for old architectures (#10751 ) * add v1 and v2 tests for tok2vec architectures * textcat architectures are not "layers" * test older textcat architectures * test older parser architecture	2022-05-10 08:24:42 +02:00
Luca Dorigo	0a92d5644e	Fix StringStore.__getitem__ return type depending on parameter types (#10741 ) * Fix StringStore.__getitem__ return type depending on parameter types Small fix using `@overload` so that `StringStore.__getitem__` returns an `int` when given a `str` or `bytes` and a `str` when given an `int`. * Update spacy/strings.pyi Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>	2022-05-03 17:57:07 +02:00
Raphael Mitsch	f5390e278a	Refactor error messages to remove hardcoded strings (#10729 ) * Use custom error msg instead of hardcoded string: replaced remaining hardcoded error message strings. * Use custom error msg instead of hardcoded string: fixing faulty Errors import.	2022-05-02 13:38:46 +02:00
Madeesh Kannan	0a503ce5e0	Remove vestigial debug print statement in `walk_head_nodes` (#10718 ) * `graph`: Remove vestigial debug print statement in `walk_head_nodes` * Revert whitespace changes * Remove more debug print statements	2022-05-02 13:36:35 +02:00
Adriane Boyd	10377fb945	Set version to v3.3.0 (#10614 ) * Set version to v3.3.0 * Revert "Temporarily skip tests that require models/compat" This reverts commit `e422101e00`.	2022-04-28 13:07:49 +02:00
harmbuisman	c066fb8a4e	#10672 : fixes displacy output for manual unsorted entities (#10673 ) * #10672: fixes displacy output for manual unsorted entities * #10672: removed unused import * fix prettier formatting Co-authored-by: Harm Buisman <h.buisman@iknl.nl> Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>	2022-04-27 09:51:58 +02:00
Sofie Van Landeghem	b3717ba53a	removing print statements from the test suite (#10712 )	2022-04-27 09:14:25 +02:00
Adriane Boyd	455f089c9b	Support exclude in Doc.from_docs (#10689 ) * Support exclude in Doc.from_docs * Update API docs * Add new tag to docs	2022-04-25 18:19:03 +02:00
github-actions[bot]	e07500369c	Auto-format code with black (#10687 ) Co-authored-by: explosion-bot <explosion-bot@users.noreply.github.com>	2022-04-22 11:24:53 +02:00
Richard Hudson	4b227f4861	Merge pull request #10669 from mgrojo/develop Fix some issues in Spanish stop-word list and examples	2022-04-19 09:37:34 +02:00
mgr	3d50b1a989	Fix some issues in Spanish examples - Spelling: nationalities in lowercase, accent. - Incorrect verb composition - Untranslated word	2022-04-18 22:12:57 +02:00
mgr	2a2654c756	Remove significant or not very frequent words from stop word list [es] The list of stop words for Spanish contained many inadequate words, see: https://github.com/explosion/spaCy/issues/3052#issuecomment-1100760100 Removed words: - verb forms of 'trabajar' (work) and intentar (try) - words related to 'empleo' (employment) - incorrect words: ampleamos, arribaabajo, soyos, paìs - miscellaneous words due to being too significant of too infrequent: actualmente, aproximadamente, antaño, cosas, ejemplo, horas, general, pais, principalmente, raras Added other stop words for completion: - Spanish one-letter words - numbers up to twelve Some reformatting to 79 columns. When in doubt, the English and German lists have been consulted as good examples.	2022-04-18 22:04:02 +02:00
Madeesh Kannan	aa6780eb27	`Matcher`: Remove superfluous GIL-acquiring check in `get_is_final` (#10659 ) * `Matcher`: Remove superfluous GIL-acquiring check in `get_is_final` This check incurred a significant performance penalty due to implict interactions between the GIL and Cython ref-counting code. * `Matcher`: Inline `PatternStateC` accessors	2022-04-18 12:59:34 +02:00
Duy Ngo	229ecaf0ea	Add numbers and definitions (#10665 )	2022-04-18 12:58:32 +02:00
Joachim Fainberg	4e1716223c	displaCy: Avoid increasing levels for identical arcs (#10639 ) * Test for arc levels for identical arcs Also moves the test in order with the other numbered tests. * displaCy: filter identical arcs Avoid increased levels due to identical arcs by first filtering any identical arcs. * Sort keys before filtering Manual entry with keys out of order would previously become different tuples and therefore not filtered correctly. Co-authored-by: Joachim Fainberg <joachimfainberg@Joachims-MBP.lan>	2022-04-14 16:48:00 +02:00
fonfonx	028cbad05e	Add feminine form of word "one" in French (#10653 ) * Add French number * Add fonfonx.md * Add feminine ordinal words for French	2022-04-14 10:21:27 +02:00
single-fingal	4228f3c757	Fix a few minor bugs in the SpanGroup API web docs (#10650 ) * Fix a few minor bugs in the SpanGroup API web docs * Update SpanGroup docs examples to have Spans reflect intended "errors"	2022-04-14 09:59:48 +02:00
Richard Hudson	75fbbcdc18	Display warning when spacy.explain() finds no term (#10645 ) * Display warning when spacy.explain() finds no term * Updated warning message text	2022-04-12 10:48:28 +02:00
Madeesh Kannan	9ba3e1cb2f	Basic tests for the Tamil language (#10629 ) * Add basic tests for Tamil (ta) * Add comment Remove superfluous condition * Remove superfluous call to `pipe` Instantiate new tokenizer for special case	2022-04-07 14:47:37 +02:00
Lj Miranda	02dafa3a84	Add debug diff command in spaCy CLI (#10502 ) * Add initial design for diff command For now, the diffing process looks like this: - The default config is created based from some values in the user config (e.g. which pipeline components were used, the lang, etc.) - The user must supply manually if it was optimized for acc/efficiency and if pretraining was involved. * Make diff command structure similar to siblings * Include gpu as a user option for CLI * Make variables more explicit * Fix type declaration for optimize enum * Improve docstrings for diff CLI * Add debug-diff to website API docs * Switch position of configs so that user config is modded * Add markdown flag for debug diff This commit adds a --markdown (--md) flag that allows easier copy-pasting to Github issues. Please note that this commit is dependent on an unreleased version of wasabi (for the time being). For posterity, the related PR is found here: https://github.com/ines/wasabi/pull/20 * Bump version of wasabi to 0.9.1 So that we can use the add_symbols parameter. * Apply suggestions from code review Co-authored-by: Ines Montani <ines@ines.io> * Update docs based on code review suggestions Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com> * Change command name from diff -> diff-config * Clarify when options are relevant or not * Rerun prettier on cli.md Co-authored-by: Ines Montani <ines@ines.io> Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>	2022-04-07 10:48:45 +02:00
Joachim Fainberg	b91255a454	displacy: avoid overlapping arcs in manual mode (#10534 ) * Added test for overlapping arcs * Provide distinct levels to overlapping arcs * Update return type hint for get_levels * Improved formatting spacy/displacy/render.py Co-authored-by: Ines Montani <ines@ines.io> Co-authored-by: Joachim Fainberg <joachimfainberg@Joachims-MacBook-Pro.local> Co-authored-by: Ines Montani <ines@ines.io>	2022-04-05 09:08:02 +02:00
Adriane Boyd	849bef2de6	Merge pull request #10596 from adrianeboyd/chore/v3.3.0.dev0 Set version to v3.3.0.dev0	2022-04-04 09:18:07 +02:00
Adriane Boyd	e422101e00	Temporarily skip tests that require models/compat	2022-04-01 11:09:28 +02:00
Adriane Boyd	ca54de27bb	Support more internal methods for SpanGroup (#10476 ) * Added new convenience cython functions to SpanGroup to avoid unnecessary allocation/deallocation of objects * Replaced sorting in has_overlap with C++ for efficiency. Also, added a test for has_overlap * Added a method to efficiently merge SpanGroups * Added __delitem__, __add__ and __iadd__. Also, allowed to pass span lists to merge function. Replaced extend() body with call to merge * Renamed merge to concat and added missing things to documentation * Added operator+ and operator += in the documentation * Added a test for Doc deallocation * Update spacy/tokens/span_group.pyx * Updated SpanGroup tests to use new span list comparison function rather than assert_span_list_equal, eliminating the need to have a separate assert_not_equal fnction * Fixed typos in SpanGroup documentation Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com> * Minor changes requested by Sofie: rearranged import statements. Added new=3.2.1 tag to SpanGroup.__setitem__ documentation * SpanGroup: moved repetitive list index check/adjustment in a separate function * Turn off formatting that hurts readability spacy/tests/doc/test_span_group.py Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com> * Remove formatting that hurts readability spacy/tests/doc/test_span_group.py Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com> * Turn off formatting that hurts readability in spacy/tests/doc/test_span_group.py Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com> * Support more internal methods for SpanGroup Add support for: * `__setitem__` * `__delitem__` * `__iadd__`: for `SpanGroup` or `Iterable[Span]` * `__add__`: for `SpanGroup` only Adapted from #9698 with the scope limited to the magic methods. * Use v3.3 as new version in docs * Add new tag to SpanGroup.copy in API docs * Remove duplicate import * Apply suggestions from code review Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com> * Remaining suggestions and formatting Co-authored-by: nrodnova <nrodnova@hotmail.com> Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com> Co-authored-by: Natalia Rodnova <4512370+nrodnova@users.noreply.github.com>	2022-04-01 09:56:26 +02:00
Adriane Boyd	d56b1400d2	Set version to v3.3.0.dev0	2022-04-01 09:54:52 +02:00

1 2 3 4 5 ...

9077 Commits