spaCy

mirror of https://github.com/explosion/spaCy.git synced 2024-11-16 06:37:04 +03:00

Author	SHA1	Message	Date
Shen Qin	be00db6645	Addition of min_max quantifier in matcher {n,m} (#10981 ) * Min_max_operators 1. Modified API and Usage for spaCy website to include min_max operator 2. Modified matcher.pyx to include min_max function {n,m} and its variants 3. Modified schemas.py to include min_max validation error 4. Added test cases to test_matcher_api.py, test_matcher_logic.py and test_pattern_validation.py * attempt to fix mypy/pydantic compat issue * formatting * Update spacy/tests/matcher/test_pattern_validation.py Co-authored-by: Source-Shen <82353723+Source-Shen@users.noreply.github.com> Co-authored-by: svlandeg <svlandeg@github.com> Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>	2022-06-30 11:01:58 +02:00
Daniël de Kok	0ff14aabce	vectors: avoid expensive comparisons between numpy ints and Python ints (#10992 ) * vectors: avoid expensive comparisons between numpy ints and Python ints * vectors: avoid failure on lists of ints * Convert another numpy int to Python	2022-06-29 12:58:31 +02:00
Peter Baumgartner	dd038b536c	fix to horizontal space (#10994 )	2022-06-28 20:42:40 +02:00
Adriane Boyd	24f4908fce	Update vector handling in similarity methods (#11013 ) Distinguish between vectors that are 0 vs. missing vectors when warning about missing vectors. Update `Doc.has_vector` to match `Span.has_vector` and `Token.has_vector` for cases where the vocab has vectors but none of the tokens in the container have vectors.	2022-06-28 19:50:47 +02:00
Madeesh Kannan	1d5cad0b42	`Example.get_aligned_parse`: Handle unit and zero length vectors correctly (#11026 ) * `Example.get_aligned_parse`: Do not squeeze gold token idx vector Correctly handle zero-size vectors passed to `np.vectorize` * Add tests * Use `Doc` ctor to initialize attributes * Remove unintended change Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com> * Remove unused import Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>	2022-06-28 19:42:58 +02:00
Richard Hudson	a9559e7435	Handle Cyrillic combining diacritics (#10837 ) * Handle Russian, Ukrainian and Bulgarian * Corrections * Correction * Correction to comment * Changes based on review * Correction * Reverted irrelevant change in punctuation.py * Remove unnecessary group * Reverted accidental change	2022-06-28 15:35:32 +02:00
Zackere	8ffff18ac4	Try cloning repo from main & master (#10843 ) * Try cloning repo from main & master * fixup! Try cloning repo from main & master * fixup! fixup! Try cloning repo from main & master * refactor clone and check for repo:branch existence * spacing fix * make mypy happy * type util function * Update spacy/cli/project/clone.py Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com> Co-authored-by: Peter Baumgartner <5107405+pmbaumgartner@users.noreply.github.com> Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>	2022-06-28 09:11:15 -04:00
Daniël de Kok	1605ef7319	Merge remote-tracking branch 'upstream/master' into merge-master-v4-20220627-2	2022-06-27 17:45:45 +02:00
github-actions[bot]	4155a59d47	Auto-format code with black (#11022 ) Co-authored-by: explosion-bot <explosion-bot@users.noreply.github.com>	2022-06-27 09:35:35 +02:00
Adriane Boyd	738b38064f	Merge pull request #11021 from adrianeboyd/chore/v3.4.0 Set version to v3.4.0	2022-06-24 14:54:16 +02:00
Madeesh Kannan	8f1ba4de58	Backport parser/alignment optimizations from `feature/refactor-parser` (#10952 )	2022-06-24 13:39:52 +02:00
Adriane Boyd	d9320db7db	Temporarily skip tests that require models/compat	2022-06-24 11:20:53 +02:00
Adriane Boyd	bffe54d02b	Set version to v3.4.0	2022-06-24 08:48:58 +02:00
Sofie Van Landeghem	f8116078ce	disable failing test because Stanford servers are down (#11015 )	2022-06-23 10:57:46 +02:00
Sofie Van Landeghem	f00254ae27	add counts to verbose list of NER labels (#10957 )	2022-06-20 09:48:40 +02:00
Raphael Mitsch	4c058eb40a	`enable` argument for spacy.load() (#10784 ) * Enable flag on spacy.load: foundation for include, enable arguments. * Enable flag on spacy.load: fixed tests. * Enable flag on spacy.load: switched from pretrained model to empty model with added pipes for tests. * Enable flag on spacy.load: switched to more consistent error on misspecification of component activity. Test refactoring. Added to default config. * Enable flag on spacy.load: added support for fields not in pipeline. * Enable flag on spacy.load: removed serialization fields from supported fields. * Enable flag on spacy.load: removed 'enable' from config again. * Enable flag on spacy.load: relaxed checks in _resolve_component_activation_status() to allow non-standard pipes. * Enable flag on spacy.load: fixed relaxed checks for _resolve_component_activation_status() to allow non-standard pipes. Extended tests. * Enable flag on spacy.load: comments w.r.t. resolution workarounds. * Enable flag on spacy.load: remove include fields. Update website docs. * Enable flag on spacy.load: updates w.r.t. changes in master. * Implement Doc.from_json(): update docstrings. Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com> * Implement Doc.from_json(): remove newline. Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com> * Implement Doc.from_json(): change error message for E1038. Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com> * Enable flag on spacy.load: wrapped docstring for _resolve_component_status() at 80 chars. * Enable flag on spacy.load: changed exmples for enable flag. * Remove newline. Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com> * Fix docstring for Language._resolve_component_status(). * Rename E1038 to E1042. Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com> Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>	2022-06-17 20:24:13 +01:00
Sofie Van Landeghem	eaeca5eb6a	account for NER labels with a hyphen in the name (#10960 ) * account for NER labels with a hyphen in the name * cleanup * fix docstring * add return type to helper method * shorter method and few more occurrences * user helper method across repo * fix circular import * partial revert to avoid circular import	2022-06-17 20:02:37 +01:00
github-actions[bot]	6313787fb6	Auto-format code with black (#10977 ) Co-authored-by: explosion-bot <explosion-bot@users.noreply.github.com>	2022-06-17 19:41:55 +01:00
Raphael Mitsch	d50668dbf0	Made _initialize_X() methods private. (#10978 )	2022-06-17 15:55:34 +02:00
Raphael Mitsch	a7f6bc5dfb	Workaround for Typer optional default values with Python calls (#10788 ) * Workaround for Typer optional default values with Python calls: added test and workaround. * @rmitsch Workaround for Typer optional default values with Python calls: reverting some black formatting changes. Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com> * @rmitsch Workaround for Typer optional default values with Python calls: removing return type hint. Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com> * Workaround for Typer optional default values with Python calls: fixed imports, added GitHub issue marker. * Workaround for Typer optional default values with Python calls: removed forcing of default values for optional arguments in init_config_cli(). Added default values for init_config(). Synchronized default values for init_config_cli() and init_config(). * Workaround for Typer optional default values with Python calls: removed unused import. * Workaround for Typer optional default values with Python calls: fixed usage of optimize in init_config_cli(). * Workaround for Typer optional default values with Pythhon calls: remove output_file from InitDefaultValues. * Workaround for Typer optional default values with Python calls: rename class for default init values. * Workaround for Typer optional default values with Python calls: remove newline. * remove introduced newlines * Remove test_init_config_from_python_without_optional_args(). * remove leftover import * reformat import * remove duplicate Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>	2022-06-17 12:15:36 +02:00
Daniël de Kok	3d3fbeda9f	Update for CBlas changes in Thinc 8.1.0.dev2 (#10970 )	2022-06-16 11:42:34 +02:00
Daniël de Kok	0d352c46ed	vectors: remove use of float as row number (#10955 ) The float -1 was returned rather than the integer -1 as the row for unknown keys. This doesn't introduce a realy bug, since such floats cast (without issues) to int in the conversion to NumPy arrays. Still, it's nice to to do the correct thing :).	2022-06-15 15:32:02 +02:00
Madeesh Kannan	126d1db123	Add failing test: `test_matcher_extension_in_set_predicate` (#10948 )	2022-06-13 10:56:45 +02:00
Daniël de Kok	a83a501195	precomputable_biaffine: avoid concatenation (#10911 ) The `forward` of `precomputable_biaffine` performs matrix multiplication and then `vstack`s the result with padding. This creates a temporary array used for the output of matrix concatenation. This change avoids the temporary by pre-allocating an array that is large enough for the output of matrix multiplication plus padding and fills the array in-place. This gave me a small speedup (a bit over 100 WPS) on de_core_news_lg on M1 Max (after changing thinc-apple-ops to support in-place gemm as BLIS does).	2022-06-10 18:12:28 +02:00
github-actions[bot]	97e8a5041b	Auto-format code with black (#10945 ) Co-authored-by: explosion-bot <explosion-bot@users.noreply.github.com>	2022-06-10 13:21:33 +02:00
Daniël de Kok	2f05c6824c	Merge remote-tracking branch 'upstream/master' into merge-master-v4-20220609	2022-06-09 10:18:25 +02:00
kadarakos	1bb87f35bc	Detect cycle during projectivize (#10877 ) * detect cycle during projectivize * not complete test to detect cycle in projectivize * boolean to int type to propagate error * use unordered_set instead of set * moved error message to errors * removed cycle from test case * use find instead of count * cycle check: only perform one lookup * Return bool again from _has_head_as_ancestor Communicate presence of cycles through an output argument. * Switch to returning std::pair to encode presence of a cycle The has_cycle pointer is too easy to misuse. Ideally, we would have a sum type like Rust's `Result` here, but C++ is not there yet. * _is_non_proj_arc: clarify what we are returning * _has_head_as_ancestor: remove count We are now explicitly checking for cycles, so the algorithm must always terminate. Either we encounter the head, we find a root, or a cycle. * _is_nonproj_arc: simplify condition * Another refactor using C++ exceptions * Remove unused error code * Print graph with cycle on exception * Include .hh files in source package * Add FIXME comment * cycle detection test * find cycle when starting from problematic vertex Co-authored-by: Daniël de Kok <me@danieldk.eu>	2022-06-08 19:34:11 +02:00
github-actions[bot]	24aafdffad	Auto-format code with black (#10908 ) Co-authored-by: explosion-bot <explosion-bot@users.noreply.github.com>	2022-06-03 11:01:55 +02:00
Adriane Boyd	727ce6d1f5	Remove English exceptions with mismatched features (#10873 ) Remove English contraction exceptions with mismatched features that lead to exceptions like "theses" and "thisre".	2022-06-03 09:44:04 +02:00
Madeesh Kannan	41389ffe1e	Avoid pickling `Doc` inputs passed to `Language.pipe()` (#10864 ) * `Language.pipe()`: Serialize `Doc` objects to bytes when using multiprocessing to avoid pickling overhead * `Doc.to_dict()`: Serialize `_context` attribute (keeping in line with `(un)pickle_doc()` * Correct type annotations * Fix typo * `Doc`: Do not serialize `_context` * `Language.pipe`: Send context objects to child processes, Simplify `as_tuples` handling * Fix type annotation * `Language.pipe`: Simplify `as_tuple` multiprocessor handling * Cleanup code, fix typos * MyPy fixes * Move doc preparation function into `_multiprocessing_pipe` Whitespace changes * Remove superfluous comma * Rename `prepare_doc` to `prepare_input` * Update spacy/errors.py * Undo renaming for error Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>	2022-06-02 20:06:49 +02:00
single-fingal	6c6b8da7cc	Fix: De/Serialize `SpanGroups` including the SpanGroup keys (#10707 ) * fix: De/Serialize `SpanGroups` including the SpanGroup keys This prevents the loss of `SpanGroup`s that have the same .name as other `SpanGroup`s within the same `SpanGroups` object (upon de/serialization of the `SpanGroups`). Fixes #10685 * Maintain backwards compatibility for serialized `SpanGroups` (serialized as: a list of `SpanGroup`s, or b'') * Add tests for `SpanGroups` deserialization backwards-compatibility * Move a `SpanGroups` de/serialization test (test_issue10685) to tests/serialize/test_serialize_spangroups.py * Output a warning if deserializing a `SpanGroups` with duplicate .name-d `SpanGroup`s * Minor refactor * `SpanGroups.from_bytes` handles only `list` and `dict` types with `dict` as the expected default * For lists, keep first rather than last value encountered * Update error message * Rename and update tests * Update to preserve list serialization of SpanGroups To avoid breaking compatibility of serialized `Doc` and `DocBin` with earlier versions of spacy v3, revert back to a list-only serialization, but update the names just for serialization so that the SpanGroups keys override the SpanGroup names. * Preserve object identity and current key overwrite * Preserve SpanGroup object identity * Preserve last rather than first span group from SpanGroup list format without SpanGroups keys * Update inline comments * Fix types * Add type info for SpanGroup.copy * Deserialize `SpanGroup`s as copies when a single SpanGroup is the value for more than 1 `SpanGroups` key. This is because we serialize `SpanGroups` as dicts (to maintain backward- and forward-compatibility) and we can't assume `SpanGroup`s with the same bytes/serialization were the same (identical) object, pre-serialization. * Update spacy/tokens/_dict_proxies.py * Add more SpanGroups serialization tests Test that serialized SpanGroups maintain their Span order * small clarification on older spaCy version * Update spacy/tests/serialize/test_serialize_span_groups.py Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com> Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>	2022-06-02 15:56:27 +02:00
Adriane Boyd	7e13652d36	Fix schemas import in Doc (#10898 )	2022-06-02 15:53:03 +02:00
Raphael Mitsch	8387ce4c01	Add Doc.from_json() (#10688 ) * Implement Doc.from_json: rough draft. * Implement Doc.from_json: first draft with tests. * Implement Doc.from_json: added documentation on website for Doc.to_json(), Doc.from_json(). * Implement Doc.from_json: formatting changes. * Implement Doc.to_json(): reverting unrelated formatting changes. * Implement Doc.to_json(): fixing entity and span conversion. Moving fixture and doc <-> json conversion tests into single file. * Implement Doc.from_json(): replaced entity/span converters with doc.char_span() calls. * Implement Doc.from_json(): handling sentence boundaries in spans. * Implementing Doc.from_json(): added parser-free sentence boundaries transfer. * Implementing Doc.from_json(): added parser-free sentence boundaries transfer. * Implementing Doc.from_json(): incorporated various PR feedback. * Renaming fixture for document without dependencies. Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com> * Implementing Doc.from_json(): using two sent_starts instead of one. Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com> * Implementing Doc.from_json(): doc_without_dependency_parser() -> doc_without_deps. Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com> * Implementing Doc.from_json(): incorporating various PR feedback. Rebased on latest master. * Implementing Doc.from_json(): refactored Doc.from_json() to work with annotation IDs instead of their string representations. * Implement Doc.from_json(): reverting unwanted formatting/rebasing changes. * Implement Doc.from_json(): added check for char_span() calculation for entities. * Update spacy/tokens/doc.pyx Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com> * Implement Doc.from_json(): minor refactoring, additional check for token attribute consistency with corresponding test. * Implement Doc.from_json(): removed redundancy in annotation type key naming. Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com> * Implement Doc.from_json(): Simplifying setting annotation values. Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com> * Implement doc.from_json(): renaming annot_types to token_attrs. Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com> * Implement Doc.from_json(): adjustments for renaming of annot_types to token_attrs. * Implement Doc.from_json(): removing default categories. Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com> * Implement Doc.from_json(): simplifying lexeme initialization. Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com> * Implement Doc.from_json(): simplifying lexeme initialization. Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com> * Implement Doc.from_json(): refactoring to only have keys for present annotations. * Implement Doc.from_json(): fix check for tokens' HEAD attributes. Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com> * Implement Doc.from_json(): refactoring Doc.from_json(). * Implement Doc.from_json(): fixing span_group retrieval. Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com> * Implement Doc.from_json(): fixing span retrieval. * Implement Doc.from_json(): added schema for Doc JSON format. Minor refactoring in Doc.from_json(). * Implement Doc.from_json(): added comment regarding Token and Span extension support. * Implement Doc.from_json(): renaming inconsistent_props to partial_attrs.. Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com> * Implement Doc.from_json(): adjusting error message. Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com> * Implement Doc.from_json(): extending E1038 message. Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com> * Implement Doc.from_json(): added params to E1038 raises. * Implement Doc.from_json(): combined attribute collection with partial attributes check. * Implement Doc.from_json(): added optional schema validation. * Implement Doc.from_json(): fixed optional fields in schema, tests. * Implement Doc.from_json(): removed redundant None check for DEP. * Implement Doc.from_json(): added passing of schema validatoin message to E1037.. * Implement Doc.from_json(): removing redundant error E1040. Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com> * Implement Doc.from_json(): changing message for E1037. Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com> * Implement Doc.from_json(): adjusted website docs and docstring of Doc.from_json(). * Update spacy/tests/doc/test_json_doc_conversion.py * Implement Doc.from_json(): docstring update. Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com> * Implement Doc.from_json(): docstring update. Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com> * Implement Doc.from_json(): website docs update. Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com> * Implement Doc.from_json(): docstring formatting. Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com> * Implement Doc.from_json(): docstring formatting. Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com> * Implement Doc.from_json(): fixing Doc reference in website docs. Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com> * Implement Doc.from_json(): reformatted website/docs/api/doc.md. * Implement Doc.from_json(): bumped IDs of new errors to avoid merge conflicts. * Implement Doc.from_json(): fixing bug in tests. Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com> * Implement Doc.from_json(): fix setting of sentence starts for docs without DEP. * Implement Doc.from_json(): add check for valid char spans when manually setting sentence boundaries. Refactor sentence boundary setting slightly. Move error message for lack of support for partial token annotations to errors.py. * Implement Doc.from_json(): simplify token sentence start manipulation. Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com> * Combine related error messages * Update spacy/tests/doc/test_json_doc_conversion.py Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>	2022-06-02 14:03:47 +02:00
Adriane Boyd	a322d6d5f2	Add SpanRuler component (#9880 ) * Add SpanRuler component Add a `SpanRuler` component similar to `EntityRuler` that saves a list of matched spans to `Doc.spans[spans_key]`. The matches from the token and phrase matchers are deduplicated and sorted before assignment but are not otherwise filtered. * Update spacy/pipeline/span_ruler.py Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com> * Fix cast * Add self.key property * Use number of patterns as length * Remove patterns kwarg from init * Update spacy/tests/pipeline/test_span_ruler.py Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com> * Add options for spans filter and setting to ents * Add `spans_filter` option as a registered function' * Make `spans_key` optional and if `None`, set to `doc.ents` instead of `doc.spans[spans_key]`. * Update and generalize tests * Add test for setting doc.ents, fix key property type * Fix typing * Allow independent doc.spans and doc.ents * If `spans_key` is set, set `doc.spans` with `spans_filter`. * If `annotate_ents` is set, set `doc.ents` with `ents_fitler`. * Use `util.filter_spans` by default as `ents_filter`. * Use a custom warning if the filter does not work for `doc.ents`. * Enable use of SpanC.id in Span * Support id in SpanRuler as Span.id * Update types * `id` can only be provided as string (already by `PatternType` definition) * Update all uses of Span.id/ent_id in Doc * Rename Span id kwarg to span_id * Update types and docs * Add ents filter to mimic EntityRuler overwrite_ents * Refactor `ents_filter` to take `entities, spans` args for more filtering options * Give registered filters more descriptive names * Allow registered `filter_spans` filter (`spacy.first_longest_spans_filter.v1`) to take any number of `Iterable[Span]` objects as args so it can be used for spans filter or ents filter * Implement future entity ruler as span ruler Implement a compatible `entity_ruler` as `future_entity_ruler` using `SpanRuler` as the underlying component: * Add `sort_key` and `sort_reverse` to allow the sorting behavior to be customized. (Necessary for the same sorting/filtering as in `EntityRuler`.) * Implement `overwrite_overlapping_ents_filter` and `preserve_existing_ents_filter` to support `EntityRuler.overwrite_ents` settings. * Add `remove_by_id` to support `EntityRuler.remove` functionality. * Refactor `entity_ruler` tests to parametrize all tests to test both `entity_ruler` and `future_entity_ruler` * Implement `SpanRuler.token_patterns` and `SpanRuler.phrase_patterns` properties. Additional changes: * Move all config settings to top-level attributes to avoid duplicating settings in the config vs. `span_ruler/cfg`. (Also avoids a lot of casting.) * Format * Fix filter make method name * Refactor to use same error for removing by label or ID * Also provide existing spans to spans filter * Support ids property * Remove token_patterns and phrase_patterns * Update docstrings * Add span ruler docs * Fix types * Apply suggestions from code review Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com> * Move sorting into filters * Check for all tokens in seen tokens in entity ruler filters * Remove registered sort key * Set Token.ent_id in a backwards-compatible way in Doc.set_ents * Remove sort options from API docs * Update docstrings * Rename entity ruler filters * Fix and parameterize scoring * Add id to Span API docs * Fix typo in API docs * Include explicit labeled=True for scorer Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>	2022-06-02 13:12:53 +02:00
Sofie Van Landeghem	f7507c2327	fix typo + CI slow testing (#10835 ) * fix typo * one more typo	2022-06-02 00:10:16 +02:00
Paul O'Leary McCann	dca2e8c644	Minor NEL type fixes (#10860 ) * Fix TODO about typing Fix was simple: just request an array2f. * Add type ignore Maxout has a more restrictive type than the residual layer expects (only Floats2d vs any Floats). * Various cleanup This moves a lot of lines around but doesn't change any functionality. Details: 1. use `continue` to reduce indentation 2. move sentence doc building inside conditional since it's otherwise unused 3. reduces some temporary assignments	2022-06-01 00:41:28 +02:00
Daniël de Kok	09a5f03dd7	Merge pull request #10849 from danieldk/simplify-gpu-check Simplify GPU check	2022-05-30 16:35:10 +02:00
Daniël de Kok	85dd2b6c04	Parser: use C saxpy/sgemm provided by the Ops implementation (#10773 ) * Parser: use C saxpy/sgemm provided by the Ops implementation This is a backport of https://github.com/explosion/spaCy/pull/10747 from the parser refactor branch. It eliminates the explicit calls to BLIS, instead using the saxpy/sgemm provided by the Ops implementation. This allows us to use Accelerate in the parser on M1 Macs (with an updated thinc-apple-ops). Performance of the de_core_news_lg pipe: BLIS 0.7.0, no thinc-apple-ops: 6385 WPS BLIS 0.7.0, thinc-apple-ops: 36455 WPS BLIS 0.9.0, no thinc-apple-ops: 19188 WPS BLIS 0.9.0, thinc-apple-ops: 36682 WPS This PR, thinc-apple-ops: 38726 WPS Performance of the de_core_news_lg pipe (only tok2vec -> parser): BLIS 0.7.0, no thinc-apple-ops: 13907 WPS BLIS 0.7.0, thinc-apple-ops: 73172 WPS BLIS 0.9.0, no thinc-apple-ops: 41576 WPS BLIS 0.9.0, thinc-apple-ops: 72569 WPS This PR, thinc-apple-ops: 87061 WPS * Require thinc >=8.1.0,<8.2.0 * Lower thinc lowerbound to 8.1.0.dev0 * Use best CPU ops for CBLAS when the parser model is on the GPU * Fix another unguarded cblas() call * Fix: use ops as a shorthand for self.model.ops Co-authored-by: Madeesh Kannan <shadeMe@users.noreply.github.com> Co-authored-by: Madeesh Kannan <shadeMe@users.noreply.github.com>	2022-05-27 11:20:52 +02:00
github-actions[bot]	6172af8158	Auto-format code with black (#10857 ) Co-authored-by: explosion-bot <explosion-bot@users.noreply.github.com>	2022-05-27 10:54:54 +02:00
Daniël de Kok	7c6a97559d	Simplify GPU check This change removes `thinc.util.has_cupy` from the GPU presence check. Currently `gpu_is_available` already implies `has_cupy`. We also want to show this warning in the future when a machine has a non-CuPy GPU.	2022-05-25 14:06:45 +02:00
kadarakos	f6a4b80c0b	Better errors for has_annotation and Matcher (#10830 ) * Show input argument instead of None * catch invalid attr early * moved error message from code to errors.py * Update spacy/errors.py Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com> * Update spacy/errors.py * update E153 and E154 Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>	2022-05-25 11:12:29 +02:00
Richard Hudson	32954c3bcb	Fix issues for Mypy 0.950 and Pydantic 1.9.0 (#10786 ) * Make changes to typing * Correction * Format with black * Corrections based on review * Bumped Thinc dependency version * Bumped blis requirement * Correction for older Python versions * Update spacy/ml/models/textcat.py Co-authored-by: Daniël de Kok <me@github.danieldk.eu> * Corrections based on review feedback * Readd deleted docstring line Co-authored-by: Daniël de Kok <me@github.danieldk.eu>	2022-05-25 09:33:54 +02:00
Paul O'Leary McCann	6be09bbd07	Fix Entity Linker with tokenization mismatches (fix #9575 ) (#10457 ) * Add failing test * Partial fix for issue This kind of works. The issue with token length mismatches is gone. The problem is that when you get empty lists of encodings to compare, it fails because the sizes are not the same, even though they're both zero: (0, 3) vs (0,). Not sure why that happens... * Short circuit on empties * Remove spurious check The check here isn't needed now the the short circuit is fixed. * Update spacy/tests/pipeline/test_entity_linker.py Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com> * Use "eg", not "example" Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>	2022-05-23 20:42:26 +02:00
Lj Miranda	1d34aa2b3d	Add spacy-span-analyzer to debug data (#10668 ) * Rename to spans_key for consistency * Implement spans length in debug data * Implement how span bounds and spans are obtained In this commit, I implemented how span boundaries (the tokens) around a given span and spans are obtained. I've put them in the compile_gold() function so that it's accessible later on. I will do the actual computation of the span and boundary distinctiveness in the main function above. * Compute for p_spans and p_bounds * Add computation for SD and BD * Fix mypy issues * Add weighted average computation * Fix compile_gold conditional logic * Add test for frequency distribution computation * Add tests for kl-divergence computation * Fix weighted average computation * Make tables more compact by rounding them * Add more descriptive checks for spans * Modularize span computation methods In this commit, I added the _get_span_characteristics and _print_span_characteristics functions so that they can be reusable anywhere. * Remove unnecessary arguments and make fxs more compact * Update a few parameter arguments * Add tests for print_span and get_span methods * Update API to talk about span characteristics in brief * Add better reporting of spans_length * Add test for span length reporting * Update formatting of span length report Removed '' to indicate that it's not a string, then sort the n-grams by their length, not by their frequency. * Apply suggestions from code review Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com> * Show all frequency distribution when -V In this commit, I displayed the full frequency distribution of the span lengths when --verbose is passed. To make things simpler, I rewrote some of the formatter functions so that I can call them whenever. Another notable change is that instead of showing percentages as Integers, I showed them as floats (max 2-decimal places). I did this because it looks weird when it displays (0%). * Update logic on how total is computed The way the 90% thresholding is computed now is that we keep adding the percentages until we reach >= 90%. I also updated the wording and used the term "At least" to denote that >= 90% of your spans have these distributions. * Fix display when showing the threshold percentage * Apply suggestions from code review Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com> * Add better phrasing for span information * Update spacy/cli/debug_data.py Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com> * Add minor edits for whitespaces etc. Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com> Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>	2022-05-23 19:06:38 +02:00
Paul O'Leary McCann	46982cf694	Add glossary entry for root (#10821 ) * Add glossary entry for root There was already one but it was lower case, maybe that should be removed? * remove lowercase root On reflection, that was probably just a mistake. * Add lowercase root back It's harmless to leave it there.	2022-05-20 09:56:32 +02:00
Daniël de Kok	5586fd9311	Merge remote-tracking branch 'upstream/master' into v4-merge-master-20220518	2022-05-18 11:34:54 +02:00
Raphael Mitsch	357be2614e	Fuzz tokenizer.explain: draft for fuzzy tests. (#10771 ) * Fuzz tokenizer.explain: draft for fuzzy tests. * Fuzz tokenizer.explain: xignoring tokenizer.explain() tests. Removed deadline modification. Removed LANGUAGES_WITHOUT_TOKENIZERS. * Fuzz tokenizer.explain: changed tokenizer initialization to avoid failus in Azure runs. * Fuzz tokenizer.explain: type hint for tokenizer in test. Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com> Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>	2022-05-17 10:23:16 +02:00
github-actions[bot]	99aeaf9bd3	Auto-format code with black (#10795 ) Co-authored-by: explosion-bot <explosion-bot@users.noreply.github.com>	2022-05-13 19:02:08 +02:00
kadarakos	fd36469900	bugfix parser labels (#10797 )	2022-05-13 11:41:32 +02:00
Patrick Düggelin	cb06309ed8	Fix PhraseMatcher remove overlapping terms (#10734 ) * Add regression test for issue 10643 * Improve overlapping terms testcase * Fix removing overlapping terms in phrase matcher (#10643)	2022-05-12 12:23:52 +02:00
Raphael Mitsch	6f9e2ca81f	Ignore overrides for pipe names in config argument (#10779 ) * Pipe name override in config: added check with warning, added removal of name override from config, extended tests. * Pipoe name override in config: added pytest UserWarning. Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com> Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>	2022-05-12 11:46:08 +02:00
Adriane Boyd	b65d652881	Override SpanGroups.setdefault to provide default SpanGroup (#10772 ) * Fix mistake in SpanGroup API docs * Restrict SpanGroups.setdefault to SpanGroup only * Refactor to support default span iterable	2022-05-12 10:06:25 +02:00
Raphael Mitsch	2904359685	Allow assets to be optional in spacy project (#10714 ) * Allow assets to be optional in spacy project: draft for optional flag/download_all options. * Allow assets to be optional in spacy project: added OPTIONAL_DEFAULT reflecting default asset optionality. * Allow assets to be optional in spacy project: renamed --all to --extra. * Allow assets to be optional in spacy project: included optional flag in project config test. * Allow assets to be optional in spacy project: added documentation. * Allow assets to be optional in spacy project: fixing deprecated --all reference. Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com> * Allow assets to be optional in spacy project: fixed project_assets() docstring. * Allow assets to be optional in spacy project: adjusted wording in justification of optional assets. Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com> * Allow assets to be optional in spacy project: switched to as keyword in project.yml. Updated docs. * Allow assets to be optional in spacy project: updated comment. * Allow assets to be optional in spacy project: replacing 'optional' with 'extra' in output. Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com> * Allow assets to be optional in spacy project: replacing 'optional' with 'extra' in docstring.. Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com> * Allow assets to be optional in spacy project: replacing 'optional' with 'extra' in test.. Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com> * Allow assets to be optional in spacy project: replacing 'optional' with 'extra' in test. Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com> * Allow assets to be optional in spacy project: renamed OPTIONAL_DEFAULT to EXTRA_DEFAULT. Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com> Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>	2022-05-10 10:40:11 +02:00
Sofie Van Landeghem	1543558d08	Add test for old architectures (#10751 ) * add v1 and v2 tests for tok2vec architectures * textcat architectures are not "layers" * test older textcat architectures * test older parser architecture	2022-05-10 08:24:42 +02:00
Luca Dorigo	0a92d5644e	Fix StringStore.__getitem__ return type depending on parameter types (#10741 ) * Fix StringStore.__getitem__ return type depending on parameter types Small fix using `@overload` so that `StringStore.__getitem__` returns an `int` when given a `str` or `bytes` and a `str` when given an `int`. * Update spacy/strings.pyi Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>	2022-05-03 17:57:07 +02:00
Raphael Mitsch	f5390e278a	Refactor error messages to remove hardcoded strings (#10729 ) * Use custom error msg instead of hardcoded string: replaced remaining hardcoded error message strings. * Use custom error msg instead of hardcoded string: fixing faulty Errors import.	2022-05-02 13:38:46 +02:00
Madeesh Kannan	0a503ce5e0	Remove vestigial debug print statement in `walk_head_nodes` (#10718 ) * `graph`: Remove vestigial debug print statement in `walk_head_nodes` * Revert whitespace changes * Remove more debug print statements	2022-05-02 13:36:35 +02:00
Adriane Boyd	10377fb945	Set version to v3.3.0 (#10614 ) * Set version to v3.3.0 * Revert "Temporarily skip tests that require models/compat" This reverts commit `e422101e00`.	2022-04-28 13:07:49 +02:00
harmbuisman	c066fb8a4e	#10672 : fixes displacy output for manual unsorted entities (#10673 ) * #10672: fixes displacy output for manual unsorted entities * #10672: removed unused import * fix prettier formatting Co-authored-by: Harm Buisman <h.buisman@iknl.nl> Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>	2022-04-27 09:51:58 +02:00
Sofie Van Landeghem	b3717ba53a	removing print statements from the test suite (#10712 )	2022-04-27 09:14:25 +02:00
Adriane Boyd	455f089c9b	Support exclude in Doc.from_docs (#10689 ) * Support exclude in Doc.from_docs * Update API docs * Add new tag to docs	2022-04-25 18:19:03 +02:00
github-actions[bot]	e07500369c	Auto-format code with black (#10687 ) Co-authored-by: explosion-bot <explosion-bot@users.noreply.github.com>	2022-04-22 11:24:53 +02:00
Richard Hudson	4b227f4861	Merge pull request #10669 from mgrojo/develop Fix some issues in Spanish stop-word list and examples	2022-04-19 09:37:34 +02:00
mgr	3d50b1a989	Fix some issues in Spanish examples - Spelling: nationalities in lowercase, accent. - Incorrect verb composition - Untranslated word	2022-04-18 22:12:57 +02:00
mgr	2a2654c756	Remove significant or not very frequent words from stop word list [es] The list of stop words for Spanish contained many inadequate words, see: https://github.com/explosion/spaCy/issues/3052#issuecomment-1100760100 Removed words: - verb forms of 'trabajar' (work) and intentar (try) - words related to 'empleo' (employment) - incorrect words: ampleamos, arribaabajo, soyos, paìs - miscellaneous words due to being too significant of too infrequent: actualmente, aproximadamente, antaño, cosas, ejemplo, horas, general, pais, principalmente, raras Added other stop words for completion: - Spanish one-letter words - numbers up to twelve Some reformatting to 79 columns. When in doubt, the English and German lists have been consulted as good examples.	2022-04-18 22:04:02 +02:00
Madeesh Kannan	aa6780eb27	`Matcher`: Remove superfluous GIL-acquiring check in `get_is_final` (#10659 ) * `Matcher`: Remove superfluous GIL-acquiring check in `get_is_final` This check incurred a significant performance penalty due to implict interactions between the GIL and Cython ref-counting code. * `Matcher`: Inline `PatternStateC` accessors	2022-04-18 12:59:34 +02:00
Duy Ngo	229ecaf0ea	Add numbers and definitions (#10665 )	2022-04-18 12:58:32 +02:00
Adriane Boyd	0e71bd973f	Return doc offsets in Matcher on spans (#10576 ) The returned match offsets were only adjusted for `as_spans`, not generally. Because the `on_match` callbacks are always applied to the doc, the `Matcher` matches on spans should consistently use the doc offsets.	2022-04-15 15:34:58 +02:00
Joachim Fainberg	4e1716223c	displaCy: Avoid increasing levels for identical arcs (#10639 ) * Test for arc levels for identical arcs Also moves the test in order with the other numbered tests. * displaCy: filter identical arcs Avoid increased levels due to identical arcs by first filtering any identical arcs. * Sort keys before filtering Manual entry with keys out of order would previously become different tuples and therefore not filtered correctly. Co-authored-by: Joachim Fainberg <joachimfainberg@Joachims-MBP.lan>	2022-04-14 16:48:00 +02:00
fonfonx	028cbad05e	Add feminine form of word "one" in French (#10653 ) * Add French number * Add fonfonx.md * Add feminine ordinal words for French	2022-04-14 10:21:27 +02:00
single-fingal	4228f3c757	Fix a few minor bugs in the SpanGroup API web docs (#10650 ) * Fix a few minor bugs in the SpanGroup API web docs * Update SpanGroup docs examples to have Spans reflect intended "errors"	2022-04-14 09:59:48 +02:00
Richard Hudson	75fbbcdc18	Display warning when spacy.explain() finds no term (#10645 ) * Display warning when spacy.explain() finds no term * Updated warning message text	2022-04-12 10:48:28 +02:00
Madeesh Kannan	9ba3e1cb2f	Basic tests for the Tamil language (#10629 ) * Add basic tests for Tamil (ta) * Add comment Remove superfluous condition * Remove superfluous call to `pipe` Instantiate new tokenizer for special case	2022-04-07 14:47:37 +02:00
Lj Miranda	02dafa3a84	Add debug diff command in spaCy CLI (#10502 ) * Add initial design for diff command For now, the diffing process looks like this: - The default config is created based from some values in the user config (e.g. which pipeline components were used, the lang, etc.) - The user must supply manually if it was optimized for acc/efficiency and if pretraining was involved. * Make diff command structure similar to siblings * Include gpu as a user option for CLI * Make variables more explicit * Fix type declaration for optimize enum * Improve docstrings for diff CLI * Add debug-diff to website API docs * Switch position of configs so that user config is modded * Add markdown flag for debug diff This commit adds a --markdown (--md) flag that allows easier copy-pasting to Github issues. Please note that this commit is dependent on an unreleased version of wasabi (for the time being). For posterity, the related PR is found here: https://github.com/ines/wasabi/pull/20 * Bump version of wasabi to 0.9.1 So that we can use the add_symbols parameter. * Apply suggestions from code review Co-authored-by: Ines Montani <ines@ines.io> * Update docs based on code review suggestions Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com> * Change command name from diff -> diff-config * Clarify when options are relevant or not * Rerun prettier on cli.md Co-authored-by: Ines Montani <ines@ines.io> Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>	2022-04-07 10:48:45 +02:00
Joachim Fainberg	b91255a454	displacy: avoid overlapping arcs in manual mode (#10534 ) * Added test for overlapping arcs * Provide distinct levels to overlapping arcs * Update return type hint for get_levels * Improved formatting spacy/displacy/render.py Co-authored-by: Ines Montani <ines@ines.io> Co-authored-by: Joachim Fainberg <joachimfainberg@Joachims-MacBook-Pro.local> Co-authored-by: Ines Montani <ines@ines.io>	2022-04-05 09:08:02 +02:00
Adriane Boyd	849bef2de6	Merge pull request #10596 from adrianeboyd/chore/v3.3.0.dev0 Set version to v3.3.0.dev0	2022-04-04 09:18:07 +02:00
Adriane Boyd	e422101e00	Temporarily skip tests that require models/compat	2022-04-01 11:09:28 +02:00
Adriane Boyd	ca54de27bb	Support more internal methods for SpanGroup (#10476 ) * Added new convenience cython functions to SpanGroup to avoid unnecessary allocation/deallocation of objects * Replaced sorting in has_overlap with C++ for efficiency. Also, added a test for has_overlap * Added a method to efficiently merge SpanGroups * Added __delitem__, __add__ and __iadd__. Also, allowed to pass span lists to merge function. Replaced extend() body with call to merge * Renamed merge to concat and added missing things to documentation * Added operator+ and operator += in the documentation * Added a test for Doc deallocation * Update spacy/tokens/span_group.pyx * Updated SpanGroup tests to use new span list comparison function rather than assert_span_list_equal, eliminating the need to have a separate assert_not_equal fnction * Fixed typos in SpanGroup documentation Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com> * Minor changes requested by Sofie: rearranged import statements. Added new=3.2.1 tag to SpanGroup.__setitem__ documentation * SpanGroup: moved repetitive list index check/adjustment in a separate function * Turn off formatting that hurts readability spacy/tests/doc/test_span_group.py Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com> * Remove formatting that hurts readability spacy/tests/doc/test_span_group.py Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com> * Turn off formatting that hurts readability in spacy/tests/doc/test_span_group.py Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com> * Support more internal methods for SpanGroup Add support for: * `__setitem__` * `__delitem__` * `__iadd__`: for `SpanGroup` or `Iterable[Span]` * `__add__`: for `SpanGroup` only Adapted from #9698 with the scope limited to the magic methods. * Use v3.3 as new version in docs * Add new tag to SpanGroup.copy in API docs * Remove duplicate import * Apply suggestions from code review Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com> * Remaining suggestions and formatting Co-authored-by: nrodnova <nrodnova@hotmail.com> Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com> Co-authored-by: Natalia Rodnova <4512370+nrodnova@users.noreply.github.com>	2022-04-01 09:56:26 +02:00
Adriane Boyd	d56b1400d2	Set version to v3.3.0.dev0	2022-04-01 09:54:52 +02:00
Daniël de Kok	c90dd6f265	Alignment: use a simplified ragged type for performance (#10319 ) * Alignment: use a simplified ragged type for performance This introduces the AlignmentArray type, which is a simplified version of Ragged that performs better on the simple(r) indexing performed for alignment. * AlignmentArray: raise an error when using unsupported index * AlignmentArray: move error messages to Errors * AlignmentArray: remove simlified ... with simplifications * AlignmentArray: fix typo that broke a[n:n] indexing	2022-04-01 09:02:06 +02:00
Adriane Boyd	03762b4b92	Add spancat, trainable_lemmatizer to quickstart (#10524 ) * Add `SPACY` and `IS_SPACE` as default `tok2vec` features	2022-04-01 09:01:04 +02:00
Adriane Boyd	e3ccc1973b	Provide debug data info for floret vectors (#10592 )	2022-03-31 15:11:32 +02:00
Yunus Atahan	36d3af3013	Fixed typo in Turkish lang. (#10582 ) * added failing test case for the issue. * Fixed typo. * fixed typo in test. * added corrected typo word into test_tr_lex_attrs_capitals as param. Test passes. Also tried and confirmed that test is failing after fixing the typo in the test case I wrote. Deleted the test case for typo. Co-authored-by: Yunus Atahan <yunus.atahan@trmotor.local>	2022-03-30 13:16:08 +02:00
Adriane Boyd	f98b41c390	Add vector deduplication (#10551 ) * Add vector deduplication * Add `Vocab.deduplicate_vectors()` * Always run deduplication in `spacy init vectors` * Clean up a few vector-related error messages and docs examples * Always unique with numpy * Fix types	2022-03-30 08:54:23 +02:00
Adriane Boyd	85778dfcf4	Add edit tree lemmatizer (#10231 ) * Add edit tree lemmatizer Co-authored-by: Daniël de Kok <me@danieldk.eu> * Hide edit tree lemmatizer labels * Use relative imports * Switch to single quotes in error message * Type annotation fixes Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com> * Reformat edit_tree_lemmatizer with black * EditTreeLemmatizer.predict: take Iterable Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com> * Validate edit trees during deserialization This change also changes the serialized representation. Rather than mirroring the deep C structure, we use a simple flat union of the match and substitution node types. * Move edit_trees to _edit_tree_internals * Fix invalid edit tree format error message * edit_tree_lemmatizer: remove outdated TODO comment * Rename factory name to trainable_lemmatizer * Ignore type instead of casting truths to List[Union[Ints1d, Floats2d, List[int], List[str]]] for thinc v8.0.14 * Switch to Tagger.v2 * Add documentation for EditTreeLemmatizer * docs: Fix 3.2 -> 3.3 somewhere * trainable_lemmatizer documentation fixes * docs: EditTreeLemmatizer is in edit_tree_lemmatizer.py Co-authored-by: Daniël de Kok <me@danieldk.eu> Co-authored-by: Daniël de Kok <me@github.danieldk.eu> Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>	2022-03-28 11:13:50 +02:00
github-actions[bot]	98ed941c39	Auto-format code with black (#10550 ) Co-authored-by: explosion-bot <explosion-bot@users.noreply.github.com>	2022-03-28 10:44:46 +02:00
Luka Dragar	53674bb745	Examples for Slovene (#10539 ) * Added examples for Slovene * Update spacy/lang/sl/examples.py Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com> * Corrected a typo in one of the sentences Co-authored-by: Luka Dragar <D20124481@mytudublin.ie> Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>	2022-03-28 10:44:10 +02:00
Adriane Boyd	3711af74e5	Add tokenizer option to allow Matcher handling for all rules (#10452 ) * Add tokenizer option to allow Matcher handling for all rules Add tokenizer option `with_faster_rules_heuristics` that determines whether the special cases applied by the internal `Matcher` are filtered by whether they contain affixes or space. If `True` (default), the rules are filtered to prioritize speed over rare edge cases. If `False`, all rules are included in the final `Matcher`-based pass over the doc. * Reset all caches when reloading special cases * Revert "Reset all caches when reloading special cases" This reverts commit `4ef6bd171d`. * Initialize max_length properly * Add new tag to API docs * Rename to faster heuristics	2022-03-24 13:21:32 +01:00
Adriane Boyd	31a5d99efa	Maintain support for empty DocBin span groups (#10538 )	2022-03-24 11:51:07 +01:00
Daniël de Kok	2ff197603e	matcher: remove an undefined behavior (#10537 ) Indexing into a zero-length std::vector is an undefined behavior.	2022-03-24 11:48:22 +01:00
Adriane Boyd	d85117f88c	Stream large assets on download (#10521 ) Stream large assets on download rather than reading the whole file at once and potentially running into `urllib3` limits on single read sizes.	2022-03-24 11:47:05 +01:00
Adriane Boyd	e908a67829	Handle unknown tags in KoreanTokenizer tag map (#10536 )	2022-03-24 11:25:36 +01:00
Adriane Boyd	c17980e535	Save vectors as little endian, load with Ops.asarray (#10201 ) * Save vectors as little endian, load with Ops.asarray * Always save vector data as little endian * Always run `Vectors.to_ops` when vector data is loaded so that `Ops.asarray` can be used to load the data correctly for the current ops. * Update spacy/vectors.pyx Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com> * Update spacy/vectors.pyx Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com> Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>	2022-03-21 14:24:46 +01:00
github-actions[bot]	bf1cf77a5b	Auto-format code with black (#10518 ) Co-authored-by: explosion-bot <explosion-bot@users.noreply.github.com>	2022-03-21 09:21:24 +01:00
Grey Murav	3ff5a6a5c0	Extend list of _num_words (#10468 )	2022-03-16 18:25:42 +01:00
Lj Miranda	a79cd3542b	Add displacy support for overlapping Spans (#10332 ) * Fix docstring for EntityRenderer * Add warning in displacy if doc.spans are empty * Implement parse_spans converter One notable change here is that the default spans_key is sc, and it's set by the user through the options. * Implement SpanRenderer Here, I implemented a SpanRenderer that looks similar to the EntityRenderer except for some templates. The spans_key, by default, is set to sc, but can be configured in the options (see parse_spans). The way I rendered these spans is per-token, i.e., I first check if each token (1) belongs to a given span type and (2) a starting token of a given span type. Once I have this information, I render them into the markup. * Fix mypy issues on typing * Add tests for displacy spans support * Update colors from RGB to hex Co-authored-by: Ines Montani <ines@ines.io> * Remove unnecessary CSS properties * Add documentation for website * Remove unnecesasry scripts * Update wording on the documentation Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com> * Put typing dependency on top of file * Put back z-index so that spans overlap properly * Make warning more explicit for spans_key Co-authored-by: Ines Montani <ines@ines.io> Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>	2022-03-16 18:14:34 +01:00
Daniël de Kok	e5debc68e4	Tagger: use unnormalized probabilities for inference (#10197 ) * Tagger: use unnormalized probabilities for inference Using unnormalized softmax avoids use of the relatively expensive exp function, which can significantly speed up non-transformer models (e.g. I got a speedup of 27% on a German tagging + parsing pipeline). * Add spacy.Tagger.v2 with configurable normalization Normalization of probabilities is disabled by default to improve performance. * Update documentation, models, and tests to spacy.Tagger.v2 * Move Tagger.v1 to spacy-legacy * docs/architectures: run prettier * Unnormalized softmax is now a Softmax_v2 option * Require thinc 8.0.14 and spacy-legacy 3.0.9	2022-03-15 14:15:31 +01:00
Adriane Boyd	0dc454ba95	Update docs for Vocab.get_vector (#10486 ) * Update docs for Vocab.get_vector * Clarify description of 0-vector dimensions	2022-03-15 09:10:47 +01:00
Edward	2eef47dd26	Save span candidates produced by spancat suggesters (#10413 ) * Add save_candidates attribute * Change spancat api * Add unit test * reimplement method to produce a list of doc * Add method to docs * Add new version tag * Add intended use to docstring * prettier formatting	2022-03-14 16:46:58 +01:00
Edward	b68bf43f5b	Add spans to doc.to_json (#10073 ) * Add spans to to_json * adjustments to_json * Change docstring * change doc key naming * Update spacy/tokens/doc.pyx Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com> Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>	2022-03-14 15:47:57 +01:00
github-actions[bot]	1bbf232074	Auto-format code with black (#10479 ) * Auto-format code with black * Update spacy/lang/hsb/lex_attrs.py Co-authored-by: explosion-bot <explosion-bot@users.noreply.github.com> Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>	2022-03-11 12:20:23 +01:00
Adriane Boyd	297dd82c86	Fix initial special cases for Tokenizer.explain (#10460 ) Add the missing initial check for special cases to `Tokenizer.explain` to align with `Tokenizer._tokenize_affixes`.	2022-03-11 10:50:47 +01:00
Adriane Boyd	191e8b31fa	Remove English tokenizer exception May. (#10463 )	2022-03-08 14:28:46 +01:00
jnphilipp	5ca0dbae76	Add Lower Sorbian support. (#10431 ) * Add support basic support for lower sorbian. * Add some test for dsb. * Update spacy/lang/dsb/examples.py Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>	2022-03-07 16:57:14 +01:00
Paul O'Leary McCann	61ba5450ff	Fix get_matching_ents (#10451 ) * Fix get_matching_ents Not sure what happened here - the code prior to this commit simply does not work. It's already covered by entity linker tests, which were succeeding in the NEL PR, but couldn't possibly succeed on master. * Fix test Test was indented inside another test and so doesn't seem to have been running properly.	2022-03-07 16:56:57 +01:00
jnphilipp	7ed7908716	Add Upper Sorbian support. (#10432 ) * Add support basic support for upper sorbian. * Add tokenizer exceptions and tests. * Update spacy/lang/hsb/examples.py Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>	2022-03-07 16:20:39 +01:00
Sofie Van Landeghem	d89dac4066	hook up meta in load_model_from_config (#10400 )	2022-03-04 11:07:45 +01:00
Paul O'Leary McCann	91acc3ea75	Fix entity linker batching (#9669 ) * Partial fix of entity linker batching * Add import * Better name * Add `use_gold_ents` option, docs * Change to v2, create stub v1, update docs etc. * Fix error type Honestly no idea what the right type to use here is. ConfigValidationError seems wrong. Maybe a NotImplementedError? * Make mypy happy * Add hacky fix for init issue * Add legacy pipeline entity linker * Fix references to class name * Add __init__.py for legacy * Attempted fix for loss issue * Remove placeholder V1 * formatting * slightly more interesting train data * Handle batches with no usable examples This adds a test for batches that have docs but not entities, and a check in the component that detects such cases and skips the update step as thought the batch were empty. * Remove todo about data verification Check for empty data was moved further up so this should be OK now - the case in question shouldn't be possible. * Fix gradient calculation The model doesn't know which entities are not in the kb, so it generates embeddings for the context of all of them. However, the loss does know which entities aren't in the kb, and it ignores them, as there's no sensible gradient. This has the issue that the gradient will not be calculated for some of the input embeddings, which causes a dimension mismatch in backprop. That should have caused a clear error, but with numpyops it was causing nans to happen, which is another problem that should be addressed separately. This commit changes the loss to give a zero gradient for entities not in the kb. * add failing test for v1 EL legacy architecture * Add nasty but simple working check for legacy arch * Clarify why init hack works the way it does * Clarify use_gold_ents use case * Fix use gold ents related handling * Add tests for no gold ents and fix other tests * Use aligned ents function (not working) This doesn't actually work because the "aligned" ents are gold-only. But if I have a different function that returns the intersection, then this will work as desired. * Use proper matching ent check This changes the process when gold ents are not used so that the intersection of ents in the pred and gold is used. * Move get_matching_ents to Example * Use model attribute to check for legacy arch * Rename flag * bump spacy-legacy to lower 3.0.9 Co-authored-by: svlandeg <svlandeg@github.com>	2022-03-04 09:17:36 +01:00
Adriane Boyd	8e93fa8507	Fix Vectors.n_keys for floret vectors (#10394 ) Fix `Vectors.n_keys` for floret vectors to match docstring description and avoid W007 warnings in similarity methods.	2022-03-01 09:21:25 +01:00
Sofie Van Landeghem	3f68bbcfec	Clean up loggers docs (#10351 ) * update docs to point to spacy-loggers docs * remove unused error code	2022-02-25 16:29:12 +01:00
github-actions[bot]	d637b34e2f	Auto-format code with black (#10377 ) Co-authored-by: explosion-bot <explosion-bot@users.noreply.github.com>	2022-02-25 10:00:21 +01:00
Adriane Boyd	b16da378bb	Re-remove universe tests from test suite (#10357 )	2022-02-23 21:08:56 +01:00
kadarakos	249b97184d	Bugfixes and test for rehearse (#10347 ) * fixing argument order for rehearse * rehearse test for ner and tagger * rehearse bugfix * added test for parser * test for multilabel textcat * rehearse fix * remove debug line * Update spacy/tests/training/test_rehearse.py Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com> * Update spacy/tests/training/test_rehearse.py Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com> Co-authored-by: Kádár Ákos <akos@onyx.uvt.nl> Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>	2022-02-23 16:10:05 +01:00
Daniël de Kok	78a8bec4d0	Make core projectivization functions cdef nogil (#10241 ) * Make core projectivization methods cdef nogil While profiling the parser, I noticed that relatively a lot of time is spent in projectivization. This change rewrites the functions in the core loops as cdef nogil for efficiency. In C++-land, we use vector in place of Python lists and absent heads are represented as -1 in place of None. * _heads_to_c: add assertion Validation should be performed by the caller, but this assertion ensures that we are not reading/writing out of bounds with incorrect input.	2022-02-21 15:02:21 +01:00
Adriane Boyd	30030176ee	Update Korean defaults for Tokenizer (#10322 ) Update Korean defaults for `Tokenizer` for tokenization following UD Korean Kaist.	2022-02-21 10:26:19 +01:00
Adriane Boyd	f32ee2e533	Fix NER check in CoNLL-U converter (#10302 ) * Fix NER check in CoNLL-U converter Leave ents unset if no NER annotation is found in the MISC column. * Revert to global rather than per-sentence NER check * Update spacy/training/converters/conllu_to_docs.py Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com> Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>	2022-02-21 10:24:52 +01:00
Peter Baumgartner	3358fb9bdd	Miscellaneous Minor SpanGroups/DocBin Improvements (#10250 ) * MultiHashEmbed vector docs correction * doc copy span test * ignore empty lists in DocBin.span_groups * serialized empty list const + SpanGroups.is_empty * add conditional deserial on from_bytes * clean up + reorganize * rm test * add constant as class attribute * rename to _EMPTY_BYTES * Update spacy/tests/doc/test_span.py Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com> Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>	2022-02-21 10:24:15 +01:00
Adriane Boyd	f4c74764b8	Fix Tok2Vec for empty batches (#10324 ) * Add test for tok2vec with vectors and empty docs * Add shortcut for empty batch in Tok2Vec.predict * Avoid types	2022-02-21 10:22:36 +01:00
github-actions[bot]	6de84c8757	Auto-format code with black (#10333 ) Co-authored-by: explosion-bot <explosion-bot@users.noreply.github.com>	2022-02-21 09:15:42 +01:00
Adriane Boyd	28ba31e793	Add whitespace and combined augmenters (#10170 ) Add whitespace augmenter that inserts a single whitespace token into a doc containing annotation used in core trained pipelines. Add a combined augmenter that handles lowercasing, orth variants and whitespace augmentation.	2022-02-17 15:54:09 +01:00
Grey Murav	aa93b471a1	Extend list of stopwords for ru language (#10313 )	2022-02-17 15:51:15 +01:00
Grey Murav	23f06dc37f	Extend list of numbers for ru language (#10280 ) * Extended list of numbers for ru language Extended list of numbers with all forms and cases including short forms, slang variants and roman numerals. * Update lex_attrs.py * Update 'like_num' function with percentages Added support for numbers with percentages like 12%, 1.2% and etc. to the 'like_num' function. * black formatting Co-authored-by: thomashacker <EdwardSchmuhl@web.de>	2022-02-17 15:50:08 +01:00
Grey Murav	a9756963e6	Extend list of abbreviations for ru language (#10282 ) * Extend list of abbreviations for ru language Extended list of abbreviations for ru language those may have influence on tokenization. * black formatting Co-authored-by: thomashacker <EdwardSchmuhl@web.de>	2022-02-17 15:48:50 +01:00
Adriane Boyd	da7520a83c	Delay loading of mecab in Korean tokenizer (#10295 ) * Delay loading of mecab in Korean tokenizer Delay loading of mecab until the tokenizer is called the first time so that it's possible to initialize a blank `ko` pipeline without having mecab installed, e.g. for use with `spacy init vectors`. * Move mecab import back to __init__ Move mecab import back to __init__ to warn users at the same point as before for missing python dependencies.	2022-02-17 11:35:34 +01:00
Sofie Van Landeghem	a16b14e591	Merge branch 'master' into copy/develop	2022-02-16 14:04:59 +01:00
github-actions[bot]	5adedb8587	Auto-format code with black (#10260 ) Co-authored-by: explosion-bot <explosion-bot@users.noreply.github.com>	2022-02-11 14:23:01 +01:00
Adriane Boyd	bbaf41fb3b	Set version to v3.2.2 (#10262 )	2022-02-11 11:45:26 +01:00
Edward	7961a0a959	Fix typo in errors (#10256 )	2022-02-10 13:45:46 +01:00
Peter Baumgartner	ee662ec381	Raise error in spacy package when model name is not a valid python identifier (#10192 ) * MultiHashEmbed vector docs correction * raise error for invalid identifier as model name * more succinct error message * update success message * permitted package name + double underscore * clarify package name error * clarify underscore run message * tweak language + simplify underscore run * cleanup underscore run warning * spacing correction * Update spacy/tests/test_cli.py Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>	2022-02-10 08:15:23 +01:00
Ramon Ziai	6477dafac2	fix(phrasematcher.pyi): change type annotation of `docs` in `add()` to `List[Doc]` (#10235 ) https://github.com/explosion/spaCy/issues/10234	2022-02-08 13:37:27 +01:00
Adriane Boyd	a9ee5bff98	Support mixed case model package names (#10223 )	2022-02-08 10:52:46 +01:00
Antti Ajanki	e9c26f2ee9	Add a noun chunker for Finnish (#10214 ) with test cases	2022-02-08 08:44:11 +01:00
Sofie Van Landeghem	deb143fa70	Token sent attributes more consistent (#10164 ) * remove duplicate line * add sent start/end token attributes to the docs * let has_annotation work with IS_SENT_END * elif instead of if * add has_annotation test for sent attributes * fix typo * remove duplicate is_sent_start entry in docs	2022-02-08 08:35:37 +01:00
Lj Miranda	42072f4468	Add spancat pipeline in spacy debug data (#10070 ) * Setup debug data for spancat * Add check for missing labels * Add low-level data warning error * Improve logic when compiling the gold train data * Implement check for negative examples * Remove breakpoint * Remove ws_ents and missing entity checks * Fix mypy errors * Make variable name spans_key consistent * Rename pipeline -> component for consistency * Account for missing labels per spans_key * Cleanup variable names for consistency * Improve brevity of conditional statements * Remove unused variables * Include spans_key as an argument for _get_examples * Add a conditional check for spans_key * Update spancat debug data based on new API - Instead of using _get_labels_from_model(), I'm now using _get_labels_from_spancat() (cf. https://github.com/explosion/spaCy/pull10079) - The way information is displayed was also changed (text -> table) * Rename model_labels to ensure mypy works * Update wording on warning messages Use "span type" instead of "entity type" in wording the warning messages. This is because Spans aren't necessarily entities. * Update component type into a Literal This is to make it clear that the component parameter should only accept either 'spancat' or 'ner'. * Update checks to include actual model span_keys Instead of looking at everything in the data, we only check those span_keys from the actual spancat component. Instead of doing the filter inside the for-loop, I just made another dictionary, data_labels_in_component to hold this value. * Update spacy/cli/debug_data.py * Show label counts only when verbose is True Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com> Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>	2022-02-07 15:03:36 +01:00
Adriane Boyd	63e1e4e8f6	Fix debug data check for ents that cross sents (#10188 ) * Fix debug data check for ents that cross sents * Use aligned sent starts to have the same indices for the NER and sent start annotation * Add a temporary, insufficient hack for the case where a sentence-initial reference token is split into multiple tokens in the predicted doc, since `Example.get_aligned("SENT_START")` currently aligns `True` to all the split tokens. * Improve test example * Use Example.get_aligned_sent_starts * Add test for crossing entity	2022-02-07 08:53:30 +01:00
github-actions[bot]	91ccacea12	Auto-format code with black (#10209 ) * Auto-format code with black * add black requirement to dev dependencies and pin to 22.x * ignore black dependency for comparison with setup.cfg Co-authored-by: explosion-bot <explosion-bot@users.noreply.github.com> Co-authored-by: svlandeg <svlandeg@github.com>	2022-02-06 16:30:30 +01:00
Sofie Van Landeghem	bc12ecb870	Merge pull request #10185 from martinjack/master Update Ukrainian tokenizer_exceptions	2022-02-06 16:30:03 +01:00
Sofie Van Landeghem	14513f82da	Merge pull request #10215 from explosion/master update develop	2022-02-06 13:45:41 +01:00
Adriane Boyd	0668a449ba	Add Pipe.hide_labels to omit labels from pipeline meta (#10175 )	2022-02-05 17:59:24 +01:00
Adriane Boyd	6f551043e4	Use paths.vectors for vectors in init config (#10146 ) So that overriding `paths.vectors` works consistently in generated configs, set vectors model in `paths.vectors` and always refer to this path in `initialize.vectors`.	2022-02-04 21:09:48 +01:00
Adriane Boyd	fef896ce49	Allow Example to align whitespace annotation (#10189 ) Remove exception for whitespace tokens in `Example.get_aligned` so that annotation on whitespace tokens is aligned in the same way as for non-whitespace tokens.	2022-02-03 17:01:53 +01:00
Evgen Kytonin	fc3d446c71	Update Ukrainian tokenizer_exceptions	2022-02-01 13:24:00 +02:00
Lj Miranda	345e7f6bc4	Clarify Span.ents documentation (#10154 ) * Clarify Span.ents documentation Ref: #10135 Retain current behaviour. Span.ents will only include entities within said span. You can't get tokens outside of the original span. * Reword docstrings Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com> * Update API docs in the website Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>	2022-01-31 08:41:42 +01:00
Marek Šuppa	f09c799a96	fix: Add missing comma to `_eleven_to_beyond` (#10166 ) * This comma has been most probably been left out unintentionally, leading to string concatenation between the two consecutive lines. This issue has been found automatically using a regular expression.	2022-01-30 16:45:06 +09:00
Marek Šuppa	67ecac633f	fix: Add missing comma to `examples.py` (#10167 ) * This comma has been most probably been left out unintentionally, leading to string concatenation between the two consecutive lines. This issue has been found automatically using a regular expression.	2022-01-30 16:43:29 +09:00
Adriane Boyd	4f441dfa24	Fix infix as prefix in Tokenizer.explain (#10140 ) * Fix infix as prefix in Tokenizer.explain Update `Tokenizer.explain` to align with the `Tokenizer` algorithm: * skip infix matches that are prefixes in the current substring * Update tokenizer pseudocode in docs	2022-01-28 17:00:54 +01:00
Eduard Zorita	30cf9d6a05	Update typing hints (#10109 ) * Improve typing hints for Matcher.__call__ * Add typing hints for DependencyMatcher * Add typing hints to underscore extensions * Update Doc.tensor type (requires numpy 1.21) * Fix typing hints for Language.component decorator * Use generic np.ndarray type in Doc to avoid numpy version update * Fix mypy errors * Fix cyclic import caused by Underscore typing hints * Use Literal type from spacy.compat * Update matcher.pyi import format Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com> Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>	2022-01-28 16:59:54 +01:00
Adriane Boyd	09734c56fc	Use simple suggester for spancat initialization (#10143 ) Instead of the running the actual suggester, which may require annotation from annotating components that is not necessarily present in the reference docs, use the built-in 1-gram suggester.	2022-01-28 09:34:23 +01:00
github-actions[bot]	6d4db5c3c7	Auto-format code with black (#10106 ) Co-authored-by: explosion-bot <explosion-bot@users.noreply.github.com>	2022-01-21 10:01:10 +01:00
pepemedigu	2abd380f2d	Update lex_attrs.py for Spanish with ordinals (#10038 ) * Update lex_attrs.py Add ordinal words * black formatting Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>	2022-01-20 15:44:13 +01:00

1 2 3 4 5 ...

9207 Commits