spaCy

mirror of https://github.com/explosion/spaCy.git synced 2025-04-17 07:31:59 +03:00

Author	SHA1	Message	Date
Adriane Boyd	bffe54d02b	Set version to v3.4.0	2022-06-24 08:48:58 +02:00
Sofie Van Landeghem	f8116078ce	disable failing test because Stanford servers are down (#11015 )	2022-06-23 10:57:46 +02:00
Sofie Van Landeghem	f00254ae27	add counts to verbose list of NER labels (#10957 )	2022-06-20 09:48:40 +02:00
Raphael Mitsch	4c058eb40a	`enable` argument for spacy.load() (#10784 ) * Enable flag on spacy.load: foundation for include, enable arguments. * Enable flag on spacy.load: fixed tests. * Enable flag on spacy.load: switched from pretrained model to empty model with added pipes for tests. * Enable flag on spacy.load: switched to more consistent error on misspecification of component activity. Test refactoring. Added to default config. * Enable flag on spacy.load: added support for fields not in pipeline. * Enable flag on spacy.load: removed serialization fields from supported fields. * Enable flag on spacy.load: removed 'enable' from config again. * Enable flag on spacy.load: relaxed checks in _resolve_component_activation_status() to allow non-standard pipes. * Enable flag on spacy.load: fixed relaxed checks for _resolve_component_activation_status() to allow non-standard pipes. Extended tests. * Enable flag on spacy.load: comments w.r.t. resolution workarounds. * Enable flag on spacy.load: remove include fields. Update website docs. * Enable flag on spacy.load: updates w.r.t. changes in master. * Implement Doc.from_json(): update docstrings. Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com> * Implement Doc.from_json(): remove newline. Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com> * Implement Doc.from_json(): change error message for E1038. Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com> * Enable flag on spacy.load: wrapped docstring for _resolve_component_status() at 80 chars. * Enable flag on spacy.load: changed exmples for enable flag. * Remove newline. Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com> * Fix docstring for Language._resolve_component_status(). * Rename E1038 to E1042. Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com> Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>	2022-06-17 20:24:13 +01:00
Sofie Van Landeghem	eaeca5eb6a	account for NER labels with a hyphen in the name (#10960 ) * account for NER labels with a hyphen in the name * cleanup * fix docstring * add return type to helper method * shorter method and few more occurrences * user helper method across repo * fix circular import * partial revert to avoid circular import	2022-06-17 20:02:37 +01:00
github-actions[bot]	6313787fb6	Auto-format code with black (#10977 ) Co-authored-by: explosion-bot <explosion-bot@users.noreply.github.com>	2022-06-17 19:41:55 +01:00
Raphael Mitsch	d50668dbf0	Made _initialize_X() methods private. (#10978 )	2022-06-17 15:55:34 +02:00
Raphael Mitsch	a7f6bc5dfb	Workaround for Typer optional default values with Python calls (#10788 ) * Workaround for Typer optional default values with Python calls: added test and workaround. * @rmitsch Workaround for Typer optional default values with Python calls: reverting some black formatting changes. Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com> * @rmitsch Workaround for Typer optional default values with Python calls: removing return type hint. Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com> * Workaround for Typer optional default values with Python calls: fixed imports, added GitHub issue marker. * Workaround for Typer optional default values with Python calls: removed forcing of default values for optional arguments in init_config_cli(). Added default values for init_config(). Synchronized default values for init_config_cli() and init_config(). * Workaround for Typer optional default values with Python calls: removed unused import. * Workaround for Typer optional default values with Python calls: fixed usage of optimize in init_config_cli(). * Workaround for Typer optional default values with Pythhon calls: remove output_file from InitDefaultValues. * Workaround for Typer optional default values with Python calls: rename class for default init values. * Workaround for Typer optional default values with Python calls: remove newline. * remove introduced newlines * Remove test_init_config_from_python_without_optional_args(). * remove leftover import * reformat import * remove duplicate Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>	2022-06-17 12:15:36 +02:00
Daniël de Kok	3d3fbeda9f	Update for CBlas changes in Thinc 8.1.0.dev2 (#10970 )	2022-06-16 11:42:34 +02:00
Daniël de Kok	0d352c46ed	vectors: remove use of float as row number (#10955 ) The float -1 was returned rather than the integer -1 as the row for unknown keys. This doesn't introduce a realy bug, since such floats cast (without issues) to int in the conversion to NumPy arrays. Still, it's nice to to do the correct thing :).	2022-06-15 15:32:02 +02:00
Madeesh Kannan	126d1db123	Add failing test: `test_matcher_extension_in_set_predicate` (#10948 )	2022-06-13 10:56:45 +02:00
Daniël de Kok	a83a501195	precomputable_biaffine: avoid concatenation (#10911 ) The `forward` of `precomputable_biaffine` performs matrix multiplication and then `vstack`s the result with padding. This creates a temporary array used for the output of matrix concatenation. This change avoids the temporary by pre-allocating an array that is large enough for the output of matrix multiplication plus padding and fills the array in-place. This gave me a small speedup (a bit over 100 WPS) on de_core_news_lg on M1 Max (after changing thinc-apple-ops to support in-place gemm as BLIS does).	2022-06-10 18:12:28 +02:00
github-actions[bot]	97e8a5041b	Auto-format code with black (#10945 ) Co-authored-by: explosion-bot <explosion-bot@users.noreply.github.com>	2022-06-10 13:21:33 +02:00
kadarakos	1bb87f35bc	Detect cycle during projectivize (#10877 ) * detect cycle during projectivize * not complete test to detect cycle in projectivize * boolean to int type to propagate error * use unordered_set instead of set * moved error message to errors * removed cycle from test case * use find instead of count * cycle check: only perform one lookup * Return bool again from _has_head_as_ancestor Communicate presence of cycles through an output argument. * Switch to returning std::pair to encode presence of a cycle The has_cycle pointer is too easy to misuse. Ideally, we would have a sum type like Rust's `Result` here, but C++ is not there yet. * _is_non_proj_arc: clarify what we are returning * _has_head_as_ancestor: remove count We are now explicitly checking for cycles, so the algorithm must always terminate. Either we encounter the head, we find a root, or a cycle. * _is_nonproj_arc: simplify condition * Another refactor using C++ exceptions * Remove unused error code * Print graph with cycle on exception * Include .hh files in source package * Add FIXME comment * cycle detection test * find cycle when starting from problematic vertex Co-authored-by: Daniël de Kok <me@danieldk.eu>	2022-06-08 19:34:11 +02:00
github-actions[bot]	24aafdffad	Auto-format code with black (#10908 ) Co-authored-by: explosion-bot <explosion-bot@users.noreply.github.com>	2022-06-03 11:01:55 +02:00
Adriane Boyd	727ce6d1f5	Remove English exceptions with mismatched features (#10873 ) Remove English contraction exceptions with mismatched features that lead to exceptions like "theses" and "thisre".	2022-06-03 09:44:04 +02:00
Madeesh Kannan	41389ffe1e	Avoid pickling `Doc` inputs passed to `Language.pipe()` (#10864 ) * `Language.pipe()`: Serialize `Doc` objects to bytes when using multiprocessing to avoid pickling overhead * `Doc.to_dict()`: Serialize `_context` attribute (keeping in line with `(un)pickle_doc()` * Correct type annotations * Fix typo * `Doc`: Do not serialize `_context` * `Language.pipe`: Send context objects to child processes, Simplify `as_tuples` handling * Fix type annotation * `Language.pipe`: Simplify `as_tuple` multiprocessor handling * Cleanup code, fix typos * MyPy fixes * Move doc preparation function into `_multiprocessing_pipe` Whitespace changes * Remove superfluous comma * Rename `prepare_doc` to `prepare_input` * Update spacy/errors.py * Undo renaming for error Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>	2022-06-02 20:06:49 +02:00
single-fingal	6c6b8da7cc	Fix: De/Serialize `SpanGroups` including the SpanGroup keys (#10707 ) * fix: De/Serialize `SpanGroups` including the SpanGroup keys This prevents the loss of `SpanGroup`s that have the same .name as other `SpanGroup`s within the same `SpanGroups` object (upon de/serialization of the `SpanGroups`). Fixes #10685 * Maintain backwards compatibility for serialized `SpanGroups` (serialized as: a list of `SpanGroup`s, or b'') * Add tests for `SpanGroups` deserialization backwards-compatibility * Move a `SpanGroups` de/serialization test (test_issue10685) to tests/serialize/test_serialize_spangroups.py * Output a warning if deserializing a `SpanGroups` with duplicate .name-d `SpanGroup`s * Minor refactor * `SpanGroups.from_bytes` handles only `list` and `dict` types with `dict` as the expected default * For lists, keep first rather than last value encountered * Update error message * Rename and update tests * Update to preserve list serialization of SpanGroups To avoid breaking compatibility of serialized `Doc` and `DocBin` with earlier versions of spacy v3, revert back to a list-only serialization, but update the names just for serialization so that the SpanGroups keys override the SpanGroup names. * Preserve object identity and current key overwrite * Preserve SpanGroup object identity * Preserve last rather than first span group from SpanGroup list format without SpanGroups keys * Update inline comments * Fix types * Add type info for SpanGroup.copy * Deserialize `SpanGroup`s as copies when a single SpanGroup is the value for more than 1 `SpanGroups` key. This is because we serialize `SpanGroups` as dicts (to maintain backward- and forward-compatibility) and we can't assume `SpanGroup`s with the same bytes/serialization were the same (identical) object, pre-serialization. * Update spacy/tokens/_dict_proxies.py * Add more SpanGroups serialization tests Test that serialized SpanGroups maintain their Span order * small clarification on older spaCy version * Update spacy/tests/serialize/test_serialize_span_groups.py Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com> Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>	2022-06-02 15:56:27 +02:00
Adriane Boyd	7e13652d36	Fix schemas import in Doc (#10898 )	2022-06-02 15:53:03 +02:00
Raphael Mitsch	8387ce4c01	Add Doc.from_json() (#10688 ) * Implement Doc.from_json: rough draft. * Implement Doc.from_json: first draft with tests. * Implement Doc.from_json: added documentation on website for Doc.to_json(), Doc.from_json(). * Implement Doc.from_json: formatting changes. * Implement Doc.to_json(): reverting unrelated formatting changes. * Implement Doc.to_json(): fixing entity and span conversion. Moving fixture and doc <-> json conversion tests into single file. * Implement Doc.from_json(): replaced entity/span converters with doc.char_span() calls. * Implement Doc.from_json(): handling sentence boundaries in spans. * Implementing Doc.from_json(): added parser-free sentence boundaries transfer. * Implementing Doc.from_json(): added parser-free sentence boundaries transfer. * Implementing Doc.from_json(): incorporated various PR feedback. * Renaming fixture for document without dependencies. Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com> * Implementing Doc.from_json(): using two sent_starts instead of one. Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com> * Implementing Doc.from_json(): doc_without_dependency_parser() -> doc_without_deps. Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com> * Implementing Doc.from_json(): incorporating various PR feedback. Rebased on latest master. * Implementing Doc.from_json(): refactored Doc.from_json() to work with annotation IDs instead of their string representations. * Implement Doc.from_json(): reverting unwanted formatting/rebasing changes. * Implement Doc.from_json(): added check for char_span() calculation for entities. * Update spacy/tokens/doc.pyx Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com> * Implement Doc.from_json(): minor refactoring, additional check for token attribute consistency with corresponding test. * Implement Doc.from_json(): removed redundancy in annotation type key naming. Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com> * Implement Doc.from_json(): Simplifying setting annotation values. Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com> * Implement doc.from_json(): renaming annot_types to token_attrs. Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com> * Implement Doc.from_json(): adjustments for renaming of annot_types to token_attrs. * Implement Doc.from_json(): removing default categories. Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com> * Implement Doc.from_json(): simplifying lexeme initialization. Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com> * Implement Doc.from_json(): simplifying lexeme initialization. Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com> * Implement Doc.from_json(): refactoring to only have keys for present annotations. * Implement Doc.from_json(): fix check for tokens' HEAD attributes. Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com> * Implement Doc.from_json(): refactoring Doc.from_json(). * Implement Doc.from_json(): fixing span_group retrieval. Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com> * Implement Doc.from_json(): fixing span retrieval. * Implement Doc.from_json(): added schema for Doc JSON format. Minor refactoring in Doc.from_json(). * Implement Doc.from_json(): added comment regarding Token and Span extension support. * Implement Doc.from_json(): renaming inconsistent_props to partial_attrs.. Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com> * Implement Doc.from_json(): adjusting error message. Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com> * Implement Doc.from_json(): extending E1038 message. Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com> * Implement Doc.from_json(): added params to E1038 raises. * Implement Doc.from_json(): combined attribute collection with partial attributes check. * Implement Doc.from_json(): added optional schema validation. * Implement Doc.from_json(): fixed optional fields in schema, tests. * Implement Doc.from_json(): removed redundant None check for DEP. * Implement Doc.from_json(): added passing of schema validatoin message to E1037.. * Implement Doc.from_json(): removing redundant error E1040. Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com> * Implement Doc.from_json(): changing message for E1037. Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com> * Implement Doc.from_json(): adjusted website docs and docstring of Doc.from_json(). * Update spacy/tests/doc/test_json_doc_conversion.py * Implement Doc.from_json(): docstring update. Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com> * Implement Doc.from_json(): docstring update. Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com> * Implement Doc.from_json(): website docs update. Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com> * Implement Doc.from_json(): docstring formatting. Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com> * Implement Doc.from_json(): docstring formatting. Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com> * Implement Doc.from_json(): fixing Doc reference in website docs. Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com> * Implement Doc.from_json(): reformatted website/docs/api/doc.md. * Implement Doc.from_json(): bumped IDs of new errors to avoid merge conflicts. * Implement Doc.from_json(): fixing bug in tests. Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com> * Implement Doc.from_json(): fix setting of sentence starts for docs without DEP. * Implement Doc.from_json(): add check for valid char spans when manually setting sentence boundaries. Refactor sentence boundary setting slightly. Move error message for lack of support for partial token annotations to errors.py. * Implement Doc.from_json(): simplify token sentence start manipulation. Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com> * Combine related error messages * Update spacy/tests/doc/test_json_doc_conversion.py Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>	2022-06-02 14:03:47 +02:00
Adriane Boyd	a322d6d5f2	Add SpanRuler component (#9880 ) * Add SpanRuler component Add a `SpanRuler` component similar to `EntityRuler` that saves a list of matched spans to `Doc.spans[spans_key]`. The matches from the token and phrase matchers are deduplicated and sorted before assignment but are not otherwise filtered. * Update spacy/pipeline/span_ruler.py Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com> * Fix cast * Add self.key property * Use number of patterns as length * Remove patterns kwarg from init * Update spacy/tests/pipeline/test_span_ruler.py Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com> * Add options for spans filter and setting to ents * Add `spans_filter` option as a registered function' * Make `spans_key` optional and if `None`, set to `doc.ents` instead of `doc.spans[spans_key]`. * Update and generalize tests * Add test for setting doc.ents, fix key property type * Fix typing * Allow independent doc.spans and doc.ents * If `spans_key` is set, set `doc.spans` with `spans_filter`. * If `annotate_ents` is set, set `doc.ents` with `ents_fitler`. * Use `util.filter_spans` by default as `ents_filter`. * Use a custom warning if the filter does not work for `doc.ents`. * Enable use of SpanC.id in Span * Support id in SpanRuler as Span.id * Update types * `id` can only be provided as string (already by `PatternType` definition) * Update all uses of Span.id/ent_id in Doc * Rename Span id kwarg to span_id * Update types and docs * Add ents filter to mimic EntityRuler overwrite_ents * Refactor `ents_filter` to take `entities, spans` args for more filtering options * Give registered filters more descriptive names * Allow registered `filter_spans` filter (`spacy.first_longest_spans_filter.v1`) to take any number of `Iterable[Span]` objects as args so it can be used for spans filter or ents filter * Implement future entity ruler as span ruler Implement a compatible `entity_ruler` as `future_entity_ruler` using `SpanRuler` as the underlying component: * Add `sort_key` and `sort_reverse` to allow the sorting behavior to be customized. (Necessary for the same sorting/filtering as in `EntityRuler`.) * Implement `overwrite_overlapping_ents_filter` and `preserve_existing_ents_filter` to support `EntityRuler.overwrite_ents` settings. * Add `remove_by_id` to support `EntityRuler.remove` functionality. * Refactor `entity_ruler` tests to parametrize all tests to test both `entity_ruler` and `future_entity_ruler` * Implement `SpanRuler.token_patterns` and `SpanRuler.phrase_patterns` properties. Additional changes: * Move all config settings to top-level attributes to avoid duplicating settings in the config vs. `span_ruler/cfg`. (Also avoids a lot of casting.) * Format * Fix filter make method name * Refactor to use same error for removing by label or ID * Also provide existing spans to spans filter * Support ids property * Remove token_patterns and phrase_patterns * Update docstrings * Add span ruler docs * Fix types * Apply suggestions from code review Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com> * Move sorting into filters * Check for all tokens in seen tokens in entity ruler filters * Remove registered sort key * Set Token.ent_id in a backwards-compatible way in Doc.set_ents * Remove sort options from API docs * Update docstrings * Rename entity ruler filters * Fix and parameterize scoring * Add id to Span API docs * Fix typo in API docs * Include explicit labeled=True for scorer Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>	2022-06-02 13:12:53 +02:00
Sofie Van Landeghem	f7507c2327	fix typo + CI slow testing (#10835 ) * fix typo * one more typo	2022-06-02 00:10:16 +02:00
Paul O'Leary McCann	dca2e8c644	Minor NEL type fixes (#10860 ) * Fix TODO about typing Fix was simple: just request an array2f. * Add type ignore Maxout has a more restrictive type than the residual layer expects (only Floats2d vs any Floats). * Various cleanup This moves a lot of lines around but doesn't change any functionality. Details: 1. use `continue` to reduce indentation 2. move sentence doc building inside conditional since it's otherwise unused 3. reduces some temporary assignments	2022-06-01 00:41:28 +02:00
Daniël de Kok	09a5f03dd7	Merge pull request #10849 from danieldk/simplify-gpu-check Simplify GPU check	2022-05-30 16:35:10 +02:00
Daniël de Kok	85dd2b6c04	Parser: use C saxpy/sgemm provided by the Ops implementation (#10773 ) * Parser: use C saxpy/sgemm provided by the Ops implementation This is a backport of https://github.com/explosion/spaCy/pull/10747 from the parser refactor branch. It eliminates the explicit calls to BLIS, instead using the saxpy/sgemm provided by the Ops implementation. This allows us to use Accelerate in the parser on M1 Macs (with an updated thinc-apple-ops). Performance of the de_core_news_lg pipe: BLIS 0.7.0, no thinc-apple-ops: 6385 WPS BLIS 0.7.0, thinc-apple-ops: 36455 WPS BLIS 0.9.0, no thinc-apple-ops: 19188 WPS BLIS 0.9.0, thinc-apple-ops: 36682 WPS This PR, thinc-apple-ops: 38726 WPS Performance of the de_core_news_lg pipe (only tok2vec -> parser): BLIS 0.7.0, no thinc-apple-ops: 13907 WPS BLIS 0.7.0, thinc-apple-ops: 73172 WPS BLIS 0.9.0, no thinc-apple-ops: 41576 WPS BLIS 0.9.0, thinc-apple-ops: 72569 WPS This PR, thinc-apple-ops: 87061 WPS * Require thinc >=8.1.0,<8.2.0 * Lower thinc lowerbound to 8.1.0.dev0 * Use best CPU ops for CBLAS when the parser model is on the GPU * Fix another unguarded cblas() call * Fix: use ops as a shorthand for self.model.ops Co-authored-by: Madeesh Kannan <shadeMe@users.noreply.github.com> Co-authored-by: Madeesh Kannan <shadeMe@users.noreply.github.com>	2022-05-27 11:20:52 +02:00
github-actions[bot]	6172af8158	Auto-format code with black (#10857 ) Co-authored-by: explosion-bot <explosion-bot@users.noreply.github.com>	2022-05-27 10:54:54 +02:00
Daniël de Kok	7c6a97559d	Simplify GPU check This change removes `thinc.util.has_cupy` from the GPU presence check. Currently `gpu_is_available` already implies `has_cupy`. We also want to show this warning in the future when a machine has a non-CuPy GPU.	2022-05-25 14:06:45 +02:00
kadarakos	f6a4b80c0b	Better errors for has_annotation and Matcher (#10830 ) * Show input argument instead of None * catch invalid attr early * moved error message from code to errors.py * Update spacy/errors.py Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com> * Update spacy/errors.py * update E153 and E154 Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>	2022-05-25 11:12:29 +02:00
Richard Hudson	32954c3bcb	Fix issues for Mypy 0.950 and Pydantic 1.9.0 (#10786 ) * Make changes to typing * Correction * Format with black * Corrections based on review * Bumped Thinc dependency version * Bumped blis requirement * Correction for older Python versions * Update spacy/ml/models/textcat.py Co-authored-by: Daniël de Kok <me@github.danieldk.eu> * Corrections based on review feedback * Readd deleted docstring line Co-authored-by: Daniël de Kok <me@github.danieldk.eu>	2022-05-25 09:33:54 +02:00
Paul O'Leary McCann	6be09bbd07	Fix Entity Linker with tokenization mismatches (fix #9575 ) (#10457 ) * Add failing test * Partial fix for issue This kind of works. The issue with token length mismatches is gone. The problem is that when you get empty lists of encodings to compare, it fails because the sizes are not the same, even though they're both zero: (0, 3) vs (0,). Not sure why that happens... * Short circuit on empties * Remove spurious check The check here isn't needed now the the short circuit is fixed. * Update spacy/tests/pipeline/test_entity_linker.py Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com> * Use "eg", not "example" Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>	2022-05-23 20:42:26 +02:00
Lj Miranda	1d34aa2b3d	Add spacy-span-analyzer to debug data (#10668 ) * Rename to spans_key for consistency * Implement spans length in debug data * Implement how span bounds and spans are obtained In this commit, I implemented how span boundaries (the tokens) around a given span and spans are obtained. I've put them in the compile_gold() function so that it's accessible later on. I will do the actual computation of the span and boundary distinctiveness in the main function above. * Compute for p_spans and p_bounds * Add computation for SD and BD * Fix mypy issues * Add weighted average computation * Fix compile_gold conditional logic * Add test for frequency distribution computation * Add tests for kl-divergence computation * Fix weighted average computation * Make tables more compact by rounding them * Add more descriptive checks for spans * Modularize span computation methods In this commit, I added the _get_span_characteristics and _print_span_characteristics functions so that they can be reusable anywhere. * Remove unnecessary arguments and make fxs more compact * Update a few parameter arguments * Add tests for print_span and get_span methods * Update API to talk about span characteristics in brief * Add better reporting of spans_length * Add test for span length reporting * Update formatting of span length report Removed '' to indicate that it's not a string, then sort the n-grams by their length, not by their frequency. * Apply suggestions from code review Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com> * Show all frequency distribution when -V In this commit, I displayed the full frequency distribution of the span lengths when --verbose is passed. To make things simpler, I rewrote some of the formatter functions so that I can call them whenever. Another notable change is that instead of showing percentages as Integers, I showed them as floats (max 2-decimal places). I did this because it looks weird when it displays (0%). * Update logic on how total is computed The way the 90% thresholding is computed now is that we keep adding the percentages until we reach >= 90%. I also updated the wording and used the term "At least" to denote that >= 90% of your spans have these distributions. * Fix display when showing the threshold percentage * Apply suggestions from code review Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com> * Add better phrasing for span information * Update spacy/cli/debug_data.py Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com> * Add minor edits for whitespaces etc. Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com> Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>	2022-05-23 19:06:38 +02:00
Paul O'Leary McCann	46982cf694	Add glossary entry for root (#10821 ) * Add glossary entry for root There was already one but it was lower case, maybe that should be removed? * remove lowercase root On reflection, that was probably just a mistake. * Add lowercase root back It's harmless to leave it there.	2022-05-20 09:56:32 +02:00
Raphael Mitsch	357be2614e	Fuzz tokenizer.explain: draft for fuzzy tests. (#10771 ) * Fuzz tokenizer.explain: draft for fuzzy tests. * Fuzz tokenizer.explain: xignoring tokenizer.explain() tests. Removed deadline modification. Removed LANGUAGES_WITHOUT_TOKENIZERS. * Fuzz tokenizer.explain: changed tokenizer initialization to avoid failus in Azure runs. * Fuzz tokenizer.explain: type hint for tokenizer in test. Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com> Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>	2022-05-17 10:23:16 +02:00
github-actions[bot]	99aeaf9bd3	Auto-format code with black (#10795 ) Co-authored-by: explosion-bot <explosion-bot@users.noreply.github.com>	2022-05-13 19:02:08 +02:00
kadarakos	fd36469900	bugfix parser labels (#10797 )	2022-05-13 11:41:32 +02:00
Patrick Düggelin	cb06309ed8	Fix PhraseMatcher remove overlapping terms (#10734 ) * Add regression test for issue 10643 * Improve overlapping terms testcase * Fix removing overlapping terms in phrase matcher (#10643)	2022-05-12 12:23:52 +02:00
Raphael Mitsch	6f9e2ca81f	Ignore overrides for pipe names in config argument (#10779 ) * Pipe name override in config: added check with warning, added removal of name override from config, extended tests. * Pipoe name override in config: added pytest UserWarning. Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com> Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>	2022-05-12 11:46:08 +02:00
Adriane Boyd	b65d652881	Override SpanGroups.setdefault to provide default SpanGroup (#10772 ) * Fix mistake in SpanGroup API docs * Restrict SpanGroups.setdefault to SpanGroup only * Refactor to support default span iterable	2022-05-12 10:06:25 +02:00
Raphael Mitsch	2904359685	Allow assets to be optional in spacy project (#10714 ) * Allow assets to be optional in spacy project: draft for optional flag/download_all options. * Allow assets to be optional in spacy project: added OPTIONAL_DEFAULT reflecting default asset optionality. * Allow assets to be optional in spacy project: renamed --all to --extra. * Allow assets to be optional in spacy project: included optional flag in project config test. * Allow assets to be optional in spacy project: added documentation. * Allow assets to be optional in spacy project: fixing deprecated --all reference. Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com> * Allow assets to be optional in spacy project: fixed project_assets() docstring. * Allow assets to be optional in spacy project: adjusted wording in justification of optional assets. Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com> * Allow assets to be optional in spacy project: switched to as keyword in project.yml. Updated docs. * Allow assets to be optional in spacy project: updated comment. * Allow assets to be optional in spacy project: replacing 'optional' with 'extra' in output. Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com> * Allow assets to be optional in spacy project: replacing 'optional' with 'extra' in docstring.. Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com> * Allow assets to be optional in spacy project: replacing 'optional' with 'extra' in test.. Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com> * Allow assets to be optional in spacy project: replacing 'optional' with 'extra' in test. Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com> * Allow assets to be optional in spacy project: renamed OPTIONAL_DEFAULT to EXTRA_DEFAULT. Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com> Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>	2022-05-10 10:40:11 +02:00
Sofie Van Landeghem	1543558d08	Add test for old architectures (#10751 ) * add v1 and v2 tests for tok2vec architectures * textcat architectures are not "layers" * test older textcat architectures * test older parser architecture	2022-05-10 08:24:42 +02:00
Luca Dorigo	0a92d5644e	Fix StringStore.__getitem__ return type depending on parameter types (#10741 ) * Fix StringStore.__getitem__ return type depending on parameter types Small fix using `@overload` so that `StringStore.__getitem__` returns an `int` when given a `str` or `bytes` and a `str` when given an `int`. * Update spacy/strings.pyi Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>	2022-05-03 17:57:07 +02:00
Raphael Mitsch	f5390e278a	Refactor error messages to remove hardcoded strings (#10729 ) * Use custom error msg instead of hardcoded string: replaced remaining hardcoded error message strings. * Use custom error msg instead of hardcoded string: fixing faulty Errors import.	2022-05-02 13:38:46 +02:00
Madeesh Kannan	0a503ce5e0	Remove vestigial debug print statement in `walk_head_nodes` (#10718 ) * `graph`: Remove vestigial debug print statement in `walk_head_nodes` * Revert whitespace changes * Remove more debug print statements	2022-05-02 13:36:35 +02:00
Adriane Boyd	10377fb945	Set version to v3.3.0 (#10614 ) * Set version to v3.3.0 * Revert "Temporarily skip tests that require models/compat" This reverts commit `e422101e00`.	2022-04-28 13:07:49 +02:00
harmbuisman	c066fb8a4e	#10672 : fixes displacy output for manual unsorted entities (#10673 ) * #10672: fixes displacy output for manual unsorted entities * #10672: removed unused import * fix prettier formatting Co-authored-by: Harm Buisman <h.buisman@iknl.nl> Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>	2022-04-27 09:51:58 +02:00
Sofie Van Landeghem	b3717ba53a	removing print statements from the test suite (#10712 )	2022-04-27 09:14:25 +02:00
Adriane Boyd	455f089c9b	Support exclude in Doc.from_docs (#10689 ) * Support exclude in Doc.from_docs * Update API docs * Add new tag to docs	2022-04-25 18:19:03 +02:00
github-actions[bot]	e07500369c	Auto-format code with black (#10687 ) Co-authored-by: explosion-bot <explosion-bot@users.noreply.github.com>	2022-04-22 11:24:53 +02:00
Richard Hudson	4b227f4861	Merge pull request #10669 from mgrojo/develop Fix some issues in Spanish stop-word list and examples	2022-04-19 09:37:34 +02:00
mgr	3d50b1a989	Fix some issues in Spanish examples - Spelling: nationalities in lowercase, accent. - Incorrect verb composition - Untranslated word	2022-04-18 22:12:57 +02:00
mgr	2a2654c756	Remove significant or not very frequent words from stop word list [es] The list of stop words for Spanish contained many inadequate words, see: https://github.com/explosion/spaCy/issues/3052#issuecomment-1100760100 Removed words: - verb forms of 'trabajar' (work) and intentar (try) - words related to 'empleo' (employment) - incorrect words: ampleamos, arribaabajo, soyos, paìs - miscellaneous words due to being too significant of too infrequent: actualmente, aproximadamente, antaño, cosas, ejemplo, horas, general, pais, principalmente, raras Added other stop words for completion: - Spanish one-letter words - numbers up to twelve Some reformatting to 79 columns. When in doubt, the English and German lists have been consulted as good examples.	2022-04-18 22:04:02 +02:00
Madeesh Kannan	aa6780eb27	`Matcher`: Remove superfluous GIL-acquiring check in `get_is_final` (#10659 ) * `Matcher`: Remove superfluous GIL-acquiring check in `get_is_final` This check incurred a significant performance penalty due to implict interactions between the GIL and Cython ref-counting code. * `Matcher`: Inline `PatternStateC` accessors	2022-04-18 12:59:34 +02:00
Duy Ngo	229ecaf0ea	Add numbers and definitions (#10665 )	2022-04-18 12:58:32 +02:00
Joachim Fainberg	4e1716223c	displaCy: Avoid increasing levels for identical arcs (#10639 ) * Test for arc levels for identical arcs Also moves the test in order with the other numbered tests. * displaCy: filter identical arcs Avoid increased levels due to identical arcs by first filtering any identical arcs. * Sort keys before filtering Manual entry with keys out of order would previously become different tuples and therefore not filtered correctly. Co-authored-by: Joachim Fainberg <joachimfainberg@Joachims-MBP.lan>	2022-04-14 16:48:00 +02:00
fonfonx	028cbad05e	Add feminine form of word "one" in French (#10653 ) * Add French number * Add fonfonx.md * Add feminine ordinal words for French	2022-04-14 10:21:27 +02:00
single-fingal	4228f3c757	Fix a few minor bugs in the SpanGroup API web docs (#10650 ) * Fix a few minor bugs in the SpanGroup API web docs * Update SpanGroup docs examples to have Spans reflect intended "errors"	2022-04-14 09:59:48 +02:00
Richard Hudson	75fbbcdc18	Display warning when spacy.explain() finds no term (#10645 ) * Display warning when spacy.explain() finds no term * Updated warning message text	2022-04-12 10:48:28 +02:00
Madeesh Kannan	9ba3e1cb2f	Basic tests for the Tamil language (#10629 ) * Add basic tests for Tamil (ta) * Add comment Remove superfluous condition * Remove superfluous call to `pipe` Instantiate new tokenizer for special case	2022-04-07 14:47:37 +02:00
Lj Miranda	02dafa3a84	Add debug diff command in spaCy CLI (#10502 ) * Add initial design for diff command For now, the diffing process looks like this: - The default config is created based from some values in the user config (e.g. which pipeline components were used, the lang, etc.) - The user must supply manually if it was optimized for acc/efficiency and if pretraining was involved. * Make diff command structure similar to siblings * Include gpu as a user option for CLI * Make variables more explicit * Fix type declaration for optimize enum * Improve docstrings for diff CLI * Add debug-diff to website API docs * Switch position of configs so that user config is modded * Add markdown flag for debug diff This commit adds a --markdown (--md) flag that allows easier copy-pasting to Github issues. Please note that this commit is dependent on an unreleased version of wasabi (for the time being). For posterity, the related PR is found here: https://github.com/ines/wasabi/pull/20 * Bump version of wasabi to 0.9.1 So that we can use the add_symbols parameter. * Apply suggestions from code review Co-authored-by: Ines Montani <ines@ines.io> * Update docs based on code review suggestions Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com> * Change command name from diff -> diff-config * Clarify when options are relevant or not * Rerun prettier on cli.md Co-authored-by: Ines Montani <ines@ines.io> Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>	2022-04-07 10:48:45 +02:00
Joachim Fainberg	b91255a454	displacy: avoid overlapping arcs in manual mode (#10534 ) * Added test for overlapping arcs * Provide distinct levels to overlapping arcs * Update return type hint for get_levels * Improved formatting spacy/displacy/render.py Co-authored-by: Ines Montani <ines@ines.io> Co-authored-by: Joachim Fainberg <joachimfainberg@Joachims-MacBook-Pro.local> Co-authored-by: Ines Montani <ines@ines.io>	2022-04-05 09:08:02 +02:00
Adriane Boyd	849bef2de6	Merge pull request #10596 from adrianeboyd/chore/v3.3.0.dev0 Set version to v3.3.0.dev0	2022-04-04 09:18:07 +02:00
Adriane Boyd	e422101e00	Temporarily skip tests that require models/compat	2022-04-01 11:09:28 +02:00
Adriane Boyd	ca54de27bb	Support more internal methods for SpanGroup (#10476 ) * Added new convenience cython functions to SpanGroup to avoid unnecessary allocation/deallocation of objects * Replaced sorting in has_overlap with C++ for efficiency. Also, added a test for has_overlap * Added a method to efficiently merge SpanGroups * Added __delitem__, __add__ and __iadd__. Also, allowed to pass span lists to merge function. Replaced extend() body with call to merge * Renamed merge to concat and added missing things to documentation * Added operator+ and operator += in the documentation * Added a test for Doc deallocation * Update spacy/tokens/span_group.pyx * Updated SpanGroup tests to use new span list comparison function rather than assert_span_list_equal, eliminating the need to have a separate assert_not_equal fnction * Fixed typos in SpanGroup documentation Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com> * Minor changes requested by Sofie: rearranged import statements. Added new=3.2.1 tag to SpanGroup.__setitem__ documentation * SpanGroup: moved repetitive list index check/adjustment in a separate function * Turn off formatting that hurts readability spacy/tests/doc/test_span_group.py Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com> * Remove formatting that hurts readability spacy/tests/doc/test_span_group.py Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com> * Turn off formatting that hurts readability in spacy/tests/doc/test_span_group.py Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com> * Support more internal methods for SpanGroup Add support for: * `__setitem__` * `__delitem__` * `__iadd__`: for `SpanGroup` or `Iterable[Span]` * `__add__`: for `SpanGroup` only Adapted from #9698 with the scope limited to the magic methods. * Use v3.3 as new version in docs * Add new tag to SpanGroup.copy in API docs * Remove duplicate import * Apply suggestions from code review Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com> * Remaining suggestions and formatting Co-authored-by: nrodnova <nrodnova@hotmail.com> Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com> Co-authored-by: Natalia Rodnova <4512370+nrodnova@users.noreply.github.com>	2022-04-01 09:56:26 +02:00
Adriane Boyd	d56b1400d2	Set version to v3.3.0.dev0	2022-04-01 09:54:52 +02:00
Daniël de Kok	c90dd6f265	Alignment: use a simplified ragged type for performance (#10319 ) * Alignment: use a simplified ragged type for performance This introduces the AlignmentArray type, which is a simplified version of Ragged that performs better on the simple(r) indexing performed for alignment. * AlignmentArray: raise an error when using unsupported index * AlignmentArray: move error messages to Errors * AlignmentArray: remove simlified ... with simplifications * AlignmentArray: fix typo that broke a[n:n] indexing	2022-04-01 09:02:06 +02:00
Adriane Boyd	03762b4b92	Add spancat, trainable_lemmatizer to quickstart (#10524 ) * Add `SPACY` and `IS_SPACE` as default `tok2vec` features	2022-04-01 09:01:04 +02:00
Adriane Boyd	e3ccc1973b	Provide debug data info for floret vectors (#10592 )	2022-03-31 15:11:32 +02:00
Yunus Atahan	36d3af3013	Fixed typo in Turkish lang. (#10582 ) * added failing test case for the issue. * Fixed typo. * fixed typo in test. * added corrected typo word into test_tr_lex_attrs_capitals as param. Test passes. Also tried and confirmed that test is failing after fixing the typo in the test case I wrote. Deleted the test case for typo. Co-authored-by: Yunus Atahan <yunus.atahan@trmotor.local>	2022-03-30 13:16:08 +02:00
Adriane Boyd	f98b41c390	Add vector deduplication (#10551 ) * Add vector deduplication * Add `Vocab.deduplicate_vectors()` * Always run deduplication in `spacy init vectors` * Clean up a few vector-related error messages and docs examples * Always unique with numpy * Fix types	2022-03-30 08:54:23 +02:00
Adriane Boyd	85778dfcf4	Add edit tree lemmatizer (#10231 ) * Add edit tree lemmatizer Co-authored-by: Daniël de Kok <me@danieldk.eu> * Hide edit tree lemmatizer labels * Use relative imports * Switch to single quotes in error message * Type annotation fixes Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com> * Reformat edit_tree_lemmatizer with black * EditTreeLemmatizer.predict: take Iterable Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com> * Validate edit trees during deserialization This change also changes the serialized representation. Rather than mirroring the deep C structure, we use a simple flat union of the match and substitution node types. * Move edit_trees to _edit_tree_internals * Fix invalid edit tree format error message * edit_tree_lemmatizer: remove outdated TODO comment * Rename factory name to trainable_lemmatizer * Ignore type instead of casting truths to List[Union[Ints1d, Floats2d, List[int], List[str]]] for thinc v8.0.14 * Switch to Tagger.v2 * Add documentation for EditTreeLemmatizer * docs: Fix 3.2 -> 3.3 somewhere * trainable_lemmatizer documentation fixes * docs: EditTreeLemmatizer is in edit_tree_lemmatizer.py Co-authored-by: Daniël de Kok <me@danieldk.eu> Co-authored-by: Daniël de Kok <me@github.danieldk.eu> Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>	2022-03-28 11:13:50 +02:00
github-actions[bot]	98ed941c39	Auto-format code with black (#10550 ) Co-authored-by: explosion-bot <explosion-bot@users.noreply.github.com>	2022-03-28 10:44:46 +02:00
Luka Dragar	53674bb745	Examples for Slovene (#10539 ) * Added examples for Slovene * Update spacy/lang/sl/examples.py Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com> * Corrected a typo in one of the sentences Co-authored-by: Luka Dragar <D20124481@mytudublin.ie> Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>	2022-03-28 10:44:10 +02:00
Adriane Boyd	3711af74e5	Add tokenizer option to allow Matcher handling for all rules (#10452 ) * Add tokenizer option to allow Matcher handling for all rules Add tokenizer option `with_faster_rules_heuristics` that determines whether the special cases applied by the internal `Matcher` are filtered by whether they contain affixes or space. If `True` (default), the rules are filtered to prioritize speed over rare edge cases. If `False`, all rules are included in the final `Matcher`-based pass over the doc. * Reset all caches when reloading special cases * Revert "Reset all caches when reloading special cases" This reverts commit `4ef6bd171d`. * Initialize max_length properly * Add new tag to API docs * Rename to faster heuristics	2022-03-24 13:21:32 +01:00
Adriane Boyd	31a5d99efa	Maintain support for empty DocBin span groups (#10538 )	2022-03-24 11:51:07 +01:00
Daniël de Kok	2ff197603e	matcher: remove an undefined behavior (#10537 ) Indexing into a zero-length std::vector is an undefined behavior.	2022-03-24 11:48:22 +01:00
Adriane Boyd	d85117f88c	Stream large assets on download (#10521 ) Stream large assets on download rather than reading the whole file at once and potentially running into `urllib3` limits on single read sizes.	2022-03-24 11:47:05 +01:00
Adriane Boyd	e908a67829	Handle unknown tags in KoreanTokenizer tag map (#10536 )	2022-03-24 11:25:36 +01:00
Adriane Boyd	c17980e535	Save vectors as little endian, load with Ops.asarray (#10201 ) * Save vectors as little endian, load with Ops.asarray * Always save vector data as little endian * Always run `Vectors.to_ops` when vector data is loaded so that `Ops.asarray` can be used to load the data correctly for the current ops. * Update spacy/vectors.pyx Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com> * Update spacy/vectors.pyx Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com> Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>	2022-03-21 14:24:46 +01:00
github-actions[bot]	bf1cf77a5b	Auto-format code with black (#10518 ) Co-authored-by: explosion-bot <explosion-bot@users.noreply.github.com>	2022-03-21 09:21:24 +01:00
Grey Murav	3ff5a6a5c0	Extend list of _num_words (#10468 )	2022-03-16 18:25:42 +01:00
Lj Miranda	a79cd3542b	Add displacy support for overlapping Spans (#10332 ) * Fix docstring for EntityRenderer * Add warning in displacy if doc.spans are empty * Implement parse_spans converter One notable change here is that the default spans_key is sc, and it's set by the user through the options. * Implement SpanRenderer Here, I implemented a SpanRenderer that looks similar to the EntityRenderer except for some templates. The spans_key, by default, is set to sc, but can be configured in the options (see parse_spans). The way I rendered these spans is per-token, i.e., I first check if each token (1) belongs to a given span type and (2) a starting token of a given span type. Once I have this information, I render them into the markup. * Fix mypy issues on typing * Add tests for displacy spans support * Update colors from RGB to hex Co-authored-by: Ines Montani <ines@ines.io> * Remove unnecessary CSS properties * Add documentation for website * Remove unnecesasry scripts * Update wording on the documentation Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com> * Put typing dependency on top of file * Put back z-index so that spans overlap properly * Make warning more explicit for spans_key Co-authored-by: Ines Montani <ines@ines.io> Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>	2022-03-16 18:14:34 +01:00
Daniël de Kok	e5debc68e4	Tagger: use unnormalized probabilities for inference (#10197 ) * Tagger: use unnormalized probabilities for inference Using unnormalized softmax avoids use of the relatively expensive exp function, which can significantly speed up non-transformer models (e.g. I got a speedup of 27% on a German tagging + parsing pipeline). * Add spacy.Tagger.v2 with configurable normalization Normalization of probabilities is disabled by default to improve performance. * Update documentation, models, and tests to spacy.Tagger.v2 * Move Tagger.v1 to spacy-legacy * docs/architectures: run prettier * Unnormalized softmax is now a Softmax_v2 option * Require thinc 8.0.14 and spacy-legacy 3.0.9	2022-03-15 14:15:31 +01:00
Adriane Boyd	0dc454ba95	Update docs for Vocab.get_vector (#10486 ) * Update docs for Vocab.get_vector * Clarify description of 0-vector dimensions	2022-03-15 09:10:47 +01:00
Edward	2eef47dd26	Save span candidates produced by spancat suggesters (#10413 ) * Add save_candidates attribute * Change spancat api * Add unit test * reimplement method to produce a list of doc * Add method to docs * Add new version tag * Add intended use to docstring * prettier formatting	2022-03-14 16:46:58 +01:00
Edward	b68bf43f5b	Add spans to doc.to_json (#10073 ) * Add spans to to_json * adjustments to_json * Change docstring * change doc key naming * Update spacy/tokens/doc.pyx Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com> Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>	2022-03-14 15:47:57 +01:00
github-actions[bot]	1bbf232074	Auto-format code with black (#10479 ) * Auto-format code with black * Update spacy/lang/hsb/lex_attrs.py Co-authored-by: explosion-bot <explosion-bot@users.noreply.github.com> Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>	2022-03-11 12:20:23 +01:00
Adriane Boyd	297dd82c86	Fix initial special cases for Tokenizer.explain (#10460 ) Add the missing initial check for special cases to `Tokenizer.explain` to align with `Tokenizer._tokenize_affixes`.	2022-03-11 10:50:47 +01:00
Adriane Boyd	191e8b31fa	Remove English tokenizer exception May. (#10463 )	2022-03-08 14:28:46 +01:00
jnphilipp	5ca0dbae76	Add Lower Sorbian support. (#10431 ) * Add support basic support for lower sorbian. * Add some test for dsb. * Update spacy/lang/dsb/examples.py Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>	2022-03-07 16:57:14 +01:00
Paul O'Leary McCann	61ba5450ff	Fix get_matching_ents (#10451 ) * Fix get_matching_ents Not sure what happened here - the code prior to this commit simply does not work. It's already covered by entity linker tests, which were succeeding in the NEL PR, but couldn't possibly succeed on master. * Fix test Test was indented inside another test and so doesn't seem to have been running properly.	2022-03-07 16:56:57 +01:00
jnphilipp	7ed7908716	Add Upper Sorbian support. (#10432 ) * Add support basic support for upper sorbian. * Add tokenizer exceptions and tests. * Update spacy/lang/hsb/examples.py Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>	2022-03-07 16:20:39 +01:00
Sofie Van Landeghem	d89dac4066	hook up meta in load_model_from_config (#10400 )	2022-03-04 11:07:45 +01:00
Paul O'Leary McCann	91acc3ea75	Fix entity linker batching (#9669 ) * Partial fix of entity linker batching * Add import * Better name * Add `use_gold_ents` option, docs * Change to v2, create stub v1, update docs etc. * Fix error type Honestly no idea what the right type to use here is. ConfigValidationError seems wrong. Maybe a NotImplementedError? * Make mypy happy * Add hacky fix for init issue * Add legacy pipeline entity linker * Fix references to class name * Add __init__.py for legacy * Attempted fix for loss issue * Remove placeholder V1 * formatting * slightly more interesting train data * Handle batches with no usable examples This adds a test for batches that have docs but not entities, and a check in the component that detects such cases and skips the update step as thought the batch were empty. * Remove todo about data verification Check for empty data was moved further up so this should be OK now - the case in question shouldn't be possible. * Fix gradient calculation The model doesn't know which entities are not in the kb, so it generates embeddings for the context of all of them. However, the loss does know which entities aren't in the kb, and it ignores them, as there's no sensible gradient. This has the issue that the gradient will not be calculated for some of the input embeddings, which causes a dimension mismatch in backprop. That should have caused a clear error, but with numpyops it was causing nans to happen, which is another problem that should be addressed separately. This commit changes the loss to give a zero gradient for entities not in the kb. * add failing test for v1 EL legacy architecture * Add nasty but simple working check for legacy arch * Clarify why init hack works the way it does * Clarify use_gold_ents use case * Fix use gold ents related handling * Add tests for no gold ents and fix other tests * Use aligned ents function (not working) This doesn't actually work because the "aligned" ents are gold-only. But if I have a different function that returns the intersection, then this will work as desired. * Use proper matching ent check This changes the process when gold ents are not used so that the intersection of ents in the pred and gold is used. * Move get_matching_ents to Example * Use model attribute to check for legacy arch * Rename flag * bump spacy-legacy to lower 3.0.9 Co-authored-by: svlandeg <svlandeg@github.com>	2022-03-04 09:17:36 +01:00
Adriane Boyd	8e93fa8507	Fix Vectors.n_keys for floret vectors (#10394 ) Fix `Vectors.n_keys` for floret vectors to match docstring description and avoid W007 warnings in similarity methods.	2022-03-01 09:21:25 +01:00
Sofie Van Landeghem	3f68bbcfec	Clean up loggers docs (#10351 ) * update docs to point to spacy-loggers docs * remove unused error code	2022-02-25 16:29:12 +01:00
github-actions[bot]	d637b34e2f	Auto-format code with black (#10377 ) Co-authored-by: explosion-bot <explosion-bot@users.noreply.github.com>	2022-02-25 10:00:21 +01:00
Adriane Boyd	b16da378bb	Re-remove universe tests from test suite (#10357 )	2022-02-23 21:08:56 +01:00
kadarakos	249b97184d	Bugfixes and test for rehearse (#10347 ) * fixing argument order for rehearse * rehearse test for ner and tagger * rehearse bugfix * added test for parser * test for multilabel textcat * rehearse fix * remove debug line * Update spacy/tests/training/test_rehearse.py Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com> * Update spacy/tests/training/test_rehearse.py Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com> Co-authored-by: Kádár Ákos <akos@onyx.uvt.nl> Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>	2022-02-23 16:10:05 +01:00
Daniël de Kok	78a8bec4d0	Make core projectivization functions cdef nogil (#10241 ) * Make core projectivization methods cdef nogil While profiling the parser, I noticed that relatively a lot of time is spent in projectivization. This change rewrites the functions in the core loops as cdef nogil for efficiency. In C++-land, we use vector in place of Python lists and absent heads are represented as -1 in place of None. * _heads_to_c: add assertion Validation should be performed by the caller, but this assertion ensures that we are not reading/writing out of bounds with incorrect input.	2022-02-21 15:02:21 +01:00
Adriane Boyd	30030176ee	Update Korean defaults for Tokenizer (#10322 ) Update Korean defaults for `Tokenizer` for tokenization following UD Korean Kaist.	2022-02-21 10:26:19 +01:00
Adriane Boyd	f32ee2e533	Fix NER check in CoNLL-U converter (#10302 ) * Fix NER check in CoNLL-U converter Leave ents unset if no NER annotation is found in the MISC column. * Revert to global rather than per-sentence NER check * Update spacy/training/converters/conllu_to_docs.py Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com> Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>	2022-02-21 10:24:52 +01:00
Peter Baumgartner	3358fb9bdd	Miscellaneous Minor SpanGroups/DocBin Improvements (#10250 ) * MultiHashEmbed vector docs correction * doc copy span test * ignore empty lists in DocBin.span_groups * serialized empty list const + SpanGroups.is_empty * add conditional deserial on from_bytes * clean up + reorganize * rm test * add constant as class attribute * rename to _EMPTY_BYTES * Update spacy/tests/doc/test_span.py Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com> Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>	2022-02-21 10:24:15 +01:00
Adriane Boyd	f4c74764b8	Fix Tok2Vec for empty batches (#10324 ) * Add test for tok2vec with vectors and empty docs * Add shortcut for empty batch in Tok2Vec.predict * Avoid types	2022-02-21 10:22:36 +01:00
github-actions[bot]	6de84c8757	Auto-format code with black (#10333 ) Co-authored-by: explosion-bot <explosion-bot@users.noreply.github.com>	2022-02-21 09:15:42 +01:00
Adriane Boyd	28ba31e793	Add whitespace and combined augmenters (#10170 ) Add whitespace augmenter that inserts a single whitespace token into a doc containing annotation used in core trained pipelines. Add a combined augmenter that handles lowercasing, orth variants and whitespace augmentation.	2022-02-17 15:54:09 +01:00
Grey Murav	aa93b471a1	Extend list of stopwords for ru language (#10313 )	2022-02-17 15:51:15 +01:00
Grey Murav	23f06dc37f	Extend list of numbers for ru language (#10280 ) * Extended list of numbers for ru language Extended list of numbers with all forms and cases including short forms, slang variants and roman numerals. * Update lex_attrs.py * Update 'like_num' function with percentages Added support for numbers with percentages like 12%, 1.2% and etc. to the 'like_num' function. * black formatting Co-authored-by: thomashacker <EdwardSchmuhl@web.de>	2022-02-17 15:50:08 +01:00
Grey Murav	a9756963e6	Extend list of abbreviations for ru language (#10282 ) * Extend list of abbreviations for ru language Extended list of abbreviations for ru language those may have influence on tokenization. * black formatting Co-authored-by: thomashacker <EdwardSchmuhl@web.de>	2022-02-17 15:48:50 +01:00
Adriane Boyd	da7520a83c	Delay loading of mecab in Korean tokenizer (#10295 ) * Delay loading of mecab in Korean tokenizer Delay loading of mecab until the tokenizer is called the first time so that it's possible to initialize a blank `ko` pipeline without having mecab installed, e.g. for use with `spacy init vectors`. * Move mecab import back to __init__ Move mecab import back to __init__ to warn users at the same point as before for missing python dependencies.	2022-02-17 11:35:34 +01:00
Sofie Van Landeghem	a16b14e591	Merge branch 'master' into copy/develop	2022-02-16 14:04:59 +01:00
github-actions[bot]	5adedb8587	Auto-format code with black (#10260 ) Co-authored-by: explosion-bot <explosion-bot@users.noreply.github.com>	2022-02-11 14:23:01 +01:00
Adriane Boyd	bbaf41fb3b	Set version to v3.2.2 (#10262 )	2022-02-11 11:45:26 +01:00
Edward	7961a0a959	Fix typo in errors (#10256 )	2022-02-10 13:45:46 +01:00
Peter Baumgartner	ee662ec381	Raise error in spacy package when model name is not a valid python identifier (#10192 ) * MultiHashEmbed vector docs correction * raise error for invalid identifier as model name * more succinct error message * update success message * permitted package name + double underscore * clarify package name error * clarify underscore run message * tweak language + simplify underscore run * cleanup underscore run warning * spacing correction * Update spacy/tests/test_cli.py Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>	2022-02-10 08:15:23 +01:00
Ramon Ziai	6477dafac2	fix(phrasematcher.pyi): change type annotation of `docs` in `add()` to `List[Doc]` (#10235 ) https://github.com/explosion/spaCy/issues/10234	2022-02-08 13:37:27 +01:00
Adriane Boyd	a9ee5bff98	Support mixed case model package names (#10223 )	2022-02-08 10:52:46 +01:00
Antti Ajanki	e9c26f2ee9	Add a noun chunker for Finnish (#10214 ) with test cases	2022-02-08 08:44:11 +01:00
Sofie Van Landeghem	deb143fa70	Token sent attributes more consistent (#10164 ) * remove duplicate line * add sent start/end token attributes to the docs * let has_annotation work with IS_SENT_END * elif instead of if * add has_annotation test for sent attributes * fix typo * remove duplicate is_sent_start entry in docs	2022-02-08 08:35:37 +01:00
Lj Miranda	42072f4468	Add spancat pipeline in spacy debug data (#10070 ) * Setup debug data for spancat * Add check for missing labels * Add low-level data warning error * Improve logic when compiling the gold train data * Implement check for negative examples * Remove breakpoint * Remove ws_ents and missing entity checks * Fix mypy errors * Make variable name spans_key consistent * Rename pipeline -> component for consistency * Account for missing labels per spans_key * Cleanup variable names for consistency * Improve brevity of conditional statements * Remove unused variables * Include spans_key as an argument for _get_examples * Add a conditional check for spans_key * Update spancat debug data based on new API - Instead of using _get_labels_from_model(), I'm now using _get_labels_from_spancat() (cf. https://github.com/explosion/spaCy/pull10079) - The way information is displayed was also changed (text -> table) * Rename model_labels to ensure mypy works * Update wording on warning messages Use "span type" instead of "entity type" in wording the warning messages. This is because Spans aren't necessarily entities. * Update component type into a Literal This is to make it clear that the component parameter should only accept either 'spancat' or 'ner'. * Update checks to include actual model span_keys Instead of looking at everything in the data, we only check those span_keys from the actual spancat component. Instead of doing the filter inside the for-loop, I just made another dictionary, data_labels_in_component to hold this value. * Update spacy/cli/debug_data.py * Show label counts only when verbose is True Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com> Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>	2022-02-07 15:03:36 +01:00
Adriane Boyd	63e1e4e8f6	Fix debug data check for ents that cross sents (#10188 ) * Fix debug data check for ents that cross sents * Use aligned sent starts to have the same indices for the NER and sent start annotation * Add a temporary, insufficient hack for the case where a sentence-initial reference token is split into multiple tokens in the predicted doc, since `Example.get_aligned("SENT_START")` currently aligns `True` to all the split tokens. * Improve test example * Use Example.get_aligned_sent_starts * Add test for crossing entity	2022-02-07 08:53:30 +01:00
github-actions[bot]	91ccacea12	Auto-format code with black (#10209 ) * Auto-format code with black * add black requirement to dev dependencies and pin to 22.x * ignore black dependency for comparison with setup.cfg Co-authored-by: explosion-bot <explosion-bot@users.noreply.github.com> Co-authored-by: svlandeg <svlandeg@github.com>	2022-02-06 16:30:30 +01:00
Sofie Van Landeghem	bc12ecb870	Merge pull request #10185 from martinjack/master Update Ukrainian tokenizer_exceptions	2022-02-06 16:30:03 +01:00
Sofie Van Landeghem	14513f82da	Merge pull request #10215 from explosion/master update develop	2022-02-06 13:45:41 +01:00
Adriane Boyd	0668a449ba	Add Pipe.hide_labels to omit labels from pipeline meta (#10175 )	2022-02-05 17:59:24 +01:00
Adriane Boyd	6f551043e4	Use paths.vectors for vectors in init config (#10146 ) So that overriding `paths.vectors` works consistently in generated configs, set vectors model in `paths.vectors` and always refer to this path in `initialize.vectors`.	2022-02-04 21:09:48 +01:00
Adriane Boyd	fef896ce49	Allow Example to align whitespace annotation (#10189 ) Remove exception for whitespace tokens in `Example.get_aligned` so that annotation on whitespace tokens is aligned in the same way as for non-whitespace tokens.	2022-02-03 17:01:53 +01:00
Evgen Kytonin	fc3d446c71	Update Ukrainian tokenizer_exceptions	2022-02-01 13:24:00 +02:00
Lj Miranda	345e7f6bc4	Clarify Span.ents documentation (#10154 ) * Clarify Span.ents documentation Ref: #10135 Retain current behaviour. Span.ents will only include entities within said span. You can't get tokens outside of the original span. * Reword docstrings Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com> * Update API docs in the website Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>	2022-01-31 08:41:42 +01:00
Marek Šuppa	f09c799a96	fix: Add missing comma to `_eleven_to_beyond` (#10166 ) * This comma has been most probably been left out unintentionally, leading to string concatenation between the two consecutive lines. This issue has been found automatically using a regular expression.	2022-01-30 16:45:06 +09:00
Marek Šuppa	67ecac633f	fix: Add missing comma to `examples.py` (#10167 ) * This comma has been most probably been left out unintentionally, leading to string concatenation between the two consecutive lines. This issue has been found automatically using a regular expression.	2022-01-30 16:43:29 +09:00
Adriane Boyd	4f441dfa24	Fix infix as prefix in Tokenizer.explain (#10140 ) * Fix infix as prefix in Tokenizer.explain Update `Tokenizer.explain` to align with the `Tokenizer` algorithm: * skip infix matches that are prefixes in the current substring * Update tokenizer pseudocode in docs	2022-01-28 17:00:54 +01:00
Eduard Zorita	30cf9d6a05	Update typing hints (#10109 ) * Improve typing hints for Matcher.__call__ * Add typing hints for DependencyMatcher * Add typing hints to underscore extensions * Update Doc.tensor type (requires numpy 1.21) * Fix typing hints for Language.component decorator * Use generic np.ndarray type in Doc to avoid numpy version update * Fix mypy errors * Fix cyclic import caused by Underscore typing hints * Use Literal type from spacy.compat * Update matcher.pyi import format Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com> Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>	2022-01-28 16:59:54 +01:00
Adriane Boyd	09734c56fc	Use simple suggester for spancat initialization (#10143 ) Instead of the running the actual suggester, which may require annotation from annotating components that is not necessarily present in the reference docs, use the built-in 1-gram suggester.	2022-01-28 09:34:23 +01:00
github-actions[bot]	6d4db5c3c7	Auto-format code with black (#10106 ) Co-authored-by: explosion-bot <explosion-bot@users.noreply.github.com>	2022-01-21 10:01:10 +01:00
pepemedigu	2abd380f2d	Update lex_attrs.py for Spanish with ordinals (#10038 ) * Update lex_attrs.py Add ordinal words * black formatting Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>	2022-01-20 15:44:13 +01:00
Sofie Van Landeghem	4465fe0306	Merge branch 'develop' into feature/master_copy	2022-01-20 13:36:17 +01:00
Duygu Altinok	47a2916801	Intify IOB (#9738 ) * added iob to int * added tests * added iob strings * added error * blacked attrs * Update spacy/tests/lang/test_attrs.py Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com> * Update spacy/attrs.pyx Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com> * added iob strings as global * minor refinement with iob * removed iob strings from token * changed to uppercase * cleaned and went back to master version * imported iob from attrs * Update and format errors * Support and test both str and int ENT_IOB key Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>	2022-01-20 13:19:38 +01:00
Duygu Altinok	268ddf8a06	Add ENT_IOB key to Matcher (#9649 ) * added new field * added exception for IOb strings * minor refinement to schema * removed field * fixed typo * imported numeriacla val * changed the code bit * cosmetics * added test for matcher * set ents of moc docs * added invalid pattern * minor update to documentation * blacked matcher * added pattern validation * add IOB vals to schema * changed into test * mypy compat * cleaned left over * added compat import * changed type * added compat import * changed literal a bit * went back to old * made explicit type * Update spacy/schemas.py Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com> * Update spacy/schemas.py Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com> * Update spacy/schemas.py Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com> Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>	2022-01-20 13:18:39 +01:00
Daniël de Kok	6984f55277	Merge pull request #10048 from danieldk/index-arcs-by-head Use constant-time head lookups in StateC::{L,R}	2022-01-20 13:06:14 +01:00
Paul O'Leary McCann	32bd3856b3	Rename FACILITY to FAC in color list (#10067 ) This matches the English models	2022-01-20 12:00:28 +01:00
Adriane Boyd	a55212fca0	Determine labels by factory name in debug data (#10079 ) * Determine labels by factory name in debug data For all components, return labels for all components with the corresponding factory name rather than for only the default name. For `spancat`, return labels as a dict keyed by `spans_key`. * Refactor for typing * Add test * Use assert instead of cast, removed unneeded arg * Mark test as slow	2022-01-20 11:42:52 +01:00
Richard Hudson	e9c6314539	Bugfix for similarity return types (#10051 )	2022-01-20 11:40:46 +01:00
Daniël de Kok	50d2a2c930	User fewer Vector internals (#9879 ) * Use Vectors.shape rather than Vectors.data.shape * Use Vectors.size rather than Vectors.data.size * Add Vectors.to_ops to move data between different ops * Add documentation for Vector.to_ops	2022-01-18 17:14:35 +01:00
Adriane Boyd	4dfd559e55	Fix spaces in Doc.from_docs for empty docs (#10052 ) Fix spaces in `Doc.from_docs(ensure_whitespace=True)` for cases where an doc ending in whitespace is followed by an empty doc.	2022-01-18 17:12:42 +01:00
Paul O'Leary McCann	c28e33637b	Mark flaky spancat test so it doesn't fail the build (#10075 ) * Mark flaky spancat test so it doesn't fail the build * Skip, don't run and ignore	2022-01-18 09:36:28 +01:00
Natalia Rodnova	47ea6704f1	Span richcmp fix (#9956 ) * Corrected Span's __richcmp__ implementation to take end, label and kb_id in consideration * Updated test * Updated test * Removed formatting from a test for readability sake * Use same tuples for all comparisons Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>	2022-01-17 11:17:49 +01:00
Adriane Boyd	add52935ff	Revert "Bump sudachipy version (#9917 )" (#10071 ) This reverts commit `58bdd8607b`.	2022-01-17 10:38:37 +01:00
Paul O'Leary McCann	58bdd8607b	Bump sudachipy version (#9917 ) * Edited Slovenian stop words list (#9707) * Noun chunks for Italian (#9662) * added it vocab * copied portuguese * added possessive determiner * added conjed Nps * added nmoded Nps * test misc * more examples * fixed typo * fixed parenth * fixed comma * comma fix * added syntax iters * fix some index problems * fixed index * corrected heads for test case * fixed tets case * fixed determiner gender * cleaned left over * added example with apostophe * French NP review (#9667) * adapted from pt * added basic tests * added fr vocab * fixed noun chunks * more examples * typo fix * changed naming * changed the naming * typo fix * Add Japanese kana characters to default exceptions (fix #9693) (#9742) This includes the main kana, or phonetic characters, used in Japanese. There are some supplemental kana blocks in Unicode outside the BMP that could also be included, but because their actual use is rare I omitted them for now, but maybe they should be added. The omitted blocks are: - Kana Supplement - Kana Extended (A and B) - Small Kana Extension * Remove NER words from stop words in Norwegian (#9820) Default stop words in Norwegian bokmål (nb) in Spacy contain important entities, e.g. France, Germany, Russia, Sweden and USA, police district, important units of time, e.g. months and days of the week, and organisations. Nobody expects their presence among the default stop words. There is a danger of users complying with the general recommendation of filtering out stop words, while being unaware of filtering out important entities from their data. See explanation in https://github.com/explosion/spaCy/issues/3052#issuecomment-986756711 and comment https://github.com/explosion/spaCy/issues/3052#issuecomment-986951831 * Bump sudachipy version * Update sudachipy versions * Bump versions Bumping to the most recent dictionary just to keep thing current. Bumping sudachipy to 5.2 because older versions don't support recent dictionaries. Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com> Co-authored-by: Richard Hudson <richard@explosion.ai> Co-authored-by: Duygu Altinok <duygu@explosion.ai> Co-authored-by: Haakon Meland Eriksen <haakon.eriksen@far.no>	2022-01-17 08:16:22 +01:00
Daniël de Kok	63fa55089d	Use constant-time head lookups in StateC::{L,R} This change changes the type of left/right-arc collections from vector[ArcC] to unordered_map[int, vector[Arc]], so that the arcs are keyed by the head. This allows us to find all the left/right arcs for a particular head in constant time in StateC::{L,R}. Benchmarks with long docs (N is the number of text repetitions): Before (using #10019): N Time (s) 400 3.2 800 5.0 1600 9.5 3200 23.2 6400 66.8 12800 220.0 After (this commit): N Time (s) 400 3.1 800 4.3 1600 6.7 3200 12.0 6400 22.0 12800 42.0 Related to #9858 and #10019.	2022-01-13 12:08:46 +01:00
Daniël de Kok	677c1a3507	Speed up the StateC::L feature function (#10019 ) * Speed up the StateC::L feature function This function gets the n-th most-recent left-arc with a particular head. Before this change, StateC::L would construct a vector of all left-arcs with the given head and then pick the n-th most recent from that vector. Since the number of left-arcs strongly correlates with the doc length and the feature is constructed for every transition, this can make transition-parsing quadratic. With this change StateC::L: - Searches left-arcs backwards. - Stops early when the n-th matching transition is found. - Does not construct a vector (reducing memory pressure). This change doesn't avoid the linear search when the transition that is queried does not occur in the left-arcs. Regardless, performance is improved quite a bit with very long docs: Before: N Time 400 3.3 800 5.4 1600 11.6 3200 30.7 After: N Time 400 3.2 800 5.0 1600 9.5 3200 23.2 We can probably do better with more tailored data structures, but I first wanted to make a low-impact PR. Found while investigating #9858. * StateC::L: simplify loop	2022-01-13 09:29:58 +01:00
Daniël de Kok	28299644fc	Speed up the StateC::L feature function (#10019 ) * Speed up the StateC::L feature function This function gets the n-th most-recent left-arc with a particular head. Before this change, StateC::L would construct a vector of all left-arcs with the given head and then pick the n-th most recent from that vector. Since the number of left-arcs strongly correlates with the doc length and the feature is constructed for every transition, this can make transition-parsing quadratic. With this change StateC::L: - Searches left-arcs backwards. - Stops early when the n-th matching transition is found. - Does not construct a vector (reducing memory pressure). This change doesn't avoid the linear search when the transition that is queried does not occur in the left-arcs. Regardless, performance is improved quite a bit with very long docs: Before: N Time 400 3.3 800 5.4 1600 11.6 3200 30.7 After: N Time 400 3.2 800 5.0 1600 9.5 3200 23.2 We can probably do better with more tailored data structures, but I first wanted to make a low-impact PR. Found while investigating #9858. * StateC::L: simplify loop	2022-01-13 09:03:55 +01:00
jsnfly	176a90edee	Fix texcat loss scaling (#9904 ) (#10002 ) * add failing test for issue 9904 * remove division by batch size and summation before applying the mean Co-authored-by: jonas <jsnfly@gmx.de>	2022-01-13 09:03:23 +01:00
Sofie Van Landeghem	d8a3012539	Merge pull request #10037 from explosion/master Update develop with master	2022-01-12 12:29:23 +01:00
Ryn Daniels	057b8c64c0	Check for assets with size of 0 bytes (#10026 ) * Check for assets with size of 0 bytes * Update spacy/cli/project/assets.py Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com> Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>	2022-01-12 10:34:23 +01:00
Sofie Van Landeghem	067a44a417	Merge pull request #9987 from explosion/master Update develop with commits from master	2022-01-05 11:49:50 +01:00
Lj Miranda	00e7bf5ffd	Add a few docs to the default_config.cfg (#9981 ) * Clarify patience hyperparameter The current value for patience doesn't seem to indicate that it's pointing to the number of steps. It may be useful to specify that explicitly. Ref: https://github.com/explosion/spaCy/discussions/7450 Ref: https://github.com/explosion/spaCy/discussions/7465 * Update docs for max_steps	2022-01-05 09:16:40 +01:00
Duygu Altinok	55cf492218	Feat/debug data warn spread ents (#9960 ) * added check for crossing boundaries * formatted blacked * Rephrasing slightly Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>	2022-01-04 18:22:10 +01:00
Sofie Van Landeghem	56dcb39fb7	Fix references to config file in the docs & UX (#9961 ) * doc fixes around config file * fix typo * clarify default	2022-01-04 14:31:26 +01:00
Sofie Van Landeghem	029a48e340	fix type of lexeme.rank (#9979 )	2022-01-04 13:15:25 +01:00
Florian Cäsar	86e71e7b19	Fix Scorer.score_cats for missing labels (#9443 ) * Fix Scorer.score_cats for missing labels * Add test case for Scorer.score_cats missing labels * semantic nitpick * black formatting * adjust test to give different results depending on multi_label setting * fix loss function according to whether or not missing values are supported * add note to docs * small fixes * make mypy happy * Update spacy/pipeline/textcat.py Co-authored-by: Florian Cäsar <florian.caesar@pm.me> Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com> Co-authored-by: svlandeg <svlandeg@github.com>	2021-12-29 11:04:39 +01:00
Sofie Van Landeghem	b8106e0f95	Merge pull request #9951 from explosion/master Update develop branch with master	2021-12-29 10:11:43 +01:00
Peter Baumgartner	72abf9e102	MultiHashEmbed vector docs correction (#9918 )	2021-12-27 11:18:08 +01:00
Duygu Altinok	7ec1452f5f	added ellided forms (#9878 ) * added ellided forms * rearranged a bit * rearranged a bit * added stopword tests * blacked tests file	2021-12-23 13:41:01 +01:00
Andrew Janco	3cfeb518ee	Handle "_" value for token pos in conllu data (#9903 ) * change '_' to '' to allow Token.pos, when no value for token pos in conllu data * Minor code style Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>	2021-12-21 15:46:33 +01:00
Adriane Boyd	837d241b68	Make floret murmurhash endian-neutral (#9735 )	2021-12-20 17:11:31 +01:00
Sofie Van Landeghem	7847839003	Merge pull request #9891 from explosion/master Update develop with master	2021-12-17 14:01:27 +01:00
Adriane Boyd	94fbd88521	Use dict.copy().items() instead of list(.items()) (#9868 )	2021-12-16 09:17:33 +01:00
antonpibm	ac45ae3779	Update Tokenizer documentation to reflect token_match and url_match signatures (#9859 )	2021-12-15 09:34:33 +01:00
Adriane Boyd	800737b416	Set version to v3.2.1 (#9823 )	2021-12-07 10:51:45 +01:00
Haakon Meland Eriksen	251119455d	Remove NER words from stop words in Norwegian (#9820 ) Default stop words in Norwegian bokmål (nb) in Spacy contain important entities, e.g. France, Germany, Russia, Sweden and USA, police district, important units of time, e.g. months and days of the week, and organisations. Nobody expects their presence among the default stop words. There is a danger of users complying with the general recommendation of filtering out stop words, while being unaware of filtering out important entities from their data. See explanation in https://github.com/explosion/spaCy/issues/3052#issuecomment-986756711 and comment https://github.com/explosion/spaCy/issues/3052#issuecomment-986951831	2021-12-07 09:45:10 +01:00
Adriane Boyd	a0cdc2b007	Use Language.pipe in evaluate (#9800 )	2021-12-06 20:39:15 +01:00
Adriane Boyd	9964243eb2	Make the Tagger neg_prefix configurable (#9802 )	2021-12-06 18:04:44 +01:00
Duygu Altinok	b56b9e7f31	Entity ruler remove pattern (#9685 ) * added ruler coe * added error for none existing pattern * changed error to warning * changed error to warning * added basic tests * fixed place * added test files * went back to error * went back to pattern error * minor change to docs * changed style * changed doc * changed error slightly * added remove to phrasem api * error key already existed * phrase matcher match code to api * blacked tests * moved comments before expr * corrected error no * Update website/docs/api/entityruler.md Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com> * Update website/docs/api/entityruler.md Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com> Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>	2021-12-06 15:32:49 +01:00
Natalia Rodnova	472740d613	Added sents property to Span for Spans spanning over several sentences (#9699 ) * Added sents property to Span class that returns a generator of sentences the Span belongs to * Added description to Span.sents property * Update test_span to clarify the difference between span.sent and span.sents Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com> * Update spacy/tests/doc/test_span.py Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com> * Fix documentation typos in spacy/tokens/span.pyx Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com> * Update Span.sents doc string in spacy/tokens/span.pyx Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com> * Parametrized test_span_spans * Corrected Span.sents to check for span-level hook first. Also, made Span.sent respect doc-level sents hook if no span-level hook is provided * Corrected Span ocumentation copy/paste issue * Put back accidentally deleted lines * Fixed formatting in span.pyx * Moved check for SENT_START annotation after user hooks in Span.sents * add version where the property was introduced Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>	2021-12-06 09:58:01 +01:00
Lj Miranda	7d50804644	Migrate regression tests into the main test suite (#9655 ) * Migrate regressions 1-1000 * Move serialize test to correct file * Remove tests that won't work in v3 * Migrate regressions 1000-1500 Removed regression test 1250 because v3 doesn't support the old LEX scheme anymore. * Add missing imports in serializer tests * Migrate tests 1500-2000 * Migrate regressions from 2000-2500 * Migrate regressions from 2501-3000 * Migrate regressions from 3000-3501 * Migrate regressions from 3501-4000 * Migrate regressions from 4001-4500 * Migrate regressions from 4501-5000 * Migrate regressions from 5001-5501 * Migrate regressions from 5501 to 7000 * Migrate regressions from 7001 to 8000 * Migrate remaining regression tests * Fixing missing imports * Update docs with new system [ci skip] * Update CONTRIBUTING.md - Fix formatting - Update wording * Remove lemmatizer tests in el lang * Move a few tests into the general tokenizer * Separate Doc and DocBin tests	2021-12-04 20:34:48 +01:00
Paul O'Leary McCann	b4d526c357	Add Japanese kana characters to default exceptions (fix #9693 ) (#9742 ) This includes the main kana, or phonetic characters, used in Japanese. There are some supplemental kana blocks in Unicode outside the BMP that could also be included, but because their actual use is rare I omitted them for now, but maybe they should be added. The omitted blocks are: - Kana Supplement - Kana Extended (A and B) - Small Kana Extension	2021-11-30 23:36:39 +01:00
Sofie Van Landeghem	58e29776bd	Merge pull request #9777 from explosion/master Update develop with master	2021-11-30 14:01:23 +01:00
Duygu Altinok	29f28d1f3e	French NP review (#9667 ) * adapted from pt * added basic tests * added fr vocab * fixed noun chunks * more examples * typo fix * changed naming * changed the naming * typo fix	2021-11-30 12:19:07 +01:00
Daniël de Kok	72f7f4e68a	morphologizer: avoid recreating label tuple for each token (#9764 ) * morphologizer: avoid recreating label tuple for each token The `labels` property converts the dictionary key set to a tuple. This property was used for every annotated token, recreating the tuple over and over again. Construct the tuple once in the set_annotations function and reuse it. On a Finnish pipeline that I was experimenting with, this results in a speedup of ~15% (~13000 -> ~15000 WPS). * tagger: avoid recreating label tuple for each token	2021-11-30 11:58:59 +01:00
Narayan Acharya	1be8a4dab3	Displacy serve entity linking support without `manual=True` support. (#9748 ) * Add support for kb_id to be displayed via displacy.serve. The current support is only limited to the manual option in displacy.render * Commit to check pre-commit hooks are run. * Update spacy/displacy/__init__.py Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com> * Changes as per suggestions on the PR. * Update website/docs/api/top-level.md Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com> * Update website/docs/api/top-level.md Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com> * tag option as new from 3.2.1 onwards Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com> Co-authored-by: svlandeg <sofie.vanlandeghem@gmail.com>	2021-11-29 17:13:26 +01:00
Paul O'Leary McCann	ac05de2c6c	Fix Language-specific factory handling in package command (#9674 ) * Use internal names for factories If a component factory is registered like `@French.factory(...)` instead of `@Language.factory(...)`, the name in the factories registry will be prefixed with the language code. However in the nlp.config object the factory will be listed without the language code. The `add_pipe` code has fallback logic to handle this, but packaging code and the registry itself don't. This change makes it so that the factory name in nlp.config is the language-specific form. It's not clear if this will break anything else, but it does seem to fix the inconsistency and resolve the specific user issue that brought this to our attention. * Change approach to use fallback in package lookup This adds fallback logic to the package lookup, so it doesn't have to touch the way the config is built. It seems to fix the tests too. * Remove unecessary line * Add test Thsi also adds an assert that seems to have been forgotten.	2021-11-29 08:31:02 +01:00
Richard Hudson	7b134b8fbd	New tests for a number of alpha languages (#9703 ) * Added Slovak * Added Slovenian tests * Added Estonian tests * Added Croatian tests * Added Latvian tests * Added Icelandic tests * Added Afrikaans tests * Added language-independent tests * Added Kannada tests * Tidied up * Added Albanian tests * Formatted with black * Added failing tests for anomalies * Update spacy/tests/lang/af/test_text.py Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com> * Added context to failing Estonian tokenizer test Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com> * Added context to failing Croatian tokenizer test Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com> * Added context to failing Icelandic tokenizer test Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com> * Added context to failing Latvian tokenizer test Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com> * Added context to failing Slovak tokenizer test Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com> * Added context to failing Slovenian tokenizer test Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com> Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>	2021-11-28 21:59:23 +01:00
Natalia Rodnova	a4c43e5c57	Allow Matcher to match on ENT_ID and ENT_KB_ID (#9688 ) * Added ENT_ID and ENT_KB_ID into the list of the attributes that Matcher matches on * Added ENT_ID and ENT_KB_ID to TEST_PATTERNS in test_pattern_validation.py. Disabled tests that I added before * Update website/docs/api/matcher.md * Format * Remove skipped tests Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>	2021-11-24 10:37:10 +01:00
Duygu Altinok	25bd9f9d48	Noun chunks for Italian (#9662 ) * added it vocab * copied portuguese * added possessive determiner * added conjed Nps * added nmoded Nps * test misc * more examples * fixed typo * fixed parenth * fixed comma * comma fix * added syntax iters * fix some index problems * fixed index * corrected heads for test case * fixed tets case * fixed determiner gender * cleaned left over * added example with apostophe	2021-11-23 16:29:25 +01:00
Duygu Altinok	a7d7e80adb	EntityRuler improve disk load error message (#9658 ) * added error string * added serialization test * added more to if statements * wrote file to tempdir * added tempdir * changed parameter a bit * Update spacy/tests/pipeline/test_entity_ruler.py Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>	2021-11-23 16:26:05 +01:00
Adriane Boyd	9ac6d4991e	Add doc_cleaner component (#9659 ) * Add doc_cleaner component * Fix types * Fix loop * Rephrase method description	2021-11-23 15:33:33 +01:00
Adriane Boyd	a77f50baa4	Allow Scorer.score_spans to handle pred docs with missing annotation (#9701 ) If the predicted docs are missing annotation according to `has_annotation`, treat the docs as having no predictions rather than raising errors when the annotation is missing. The motivation for this is a combined tokenization+sents scorer for a component where the sents annotation is optional. To provide a single scorer in the component factory, it needs to be possible for the scorer to continue despite missing sents annotation in the case where the component is not annotating sents.	2021-11-23 15:17:19 +01:00
Adriane Boyd	36c7047946	Use reference parse to initialize parser moves (#9722 )	2021-11-23 14:55:55 +01:00
Richard Hudson	a1f25412da	Edited Slovenian stop words list (#9707 )	2021-11-22 09:46:34 +01:00
Adriane Boyd	0e93b315f3	Convert labels to strings for README in package CLI (#9694 )	2021-11-19 08:51:46 +01:00
Adriane Boyd	ea450d652c	Exclude strings from v3.2+ source vector checks (#9697 ) Exclude strings from `Vector.to_bytes()` comparions for v3.2+ `Vectors` that now include the string store so that the source vector comparison is only comparing the vectors and not the strings.	2021-11-19 08:51:19 +01:00
Paul O'Leary McCann	f3981bd0c8	Clarify how to fill in init_tok2vec after pretraining (#9639 ) * Clarify how to fill in init_tok2vec after pretraining * Ignore init_tok2vec arg in pretraining * Update docs, config setting * Remove obsolete note about not filling init_tok2vec early This seems to have also caught some lines that needed cleanup.	2021-11-18 15:38:30 +01:00
Adriane Boyd	c9baf9d196	Fix spancat for empty docs and zero suggestions (#9654 ) * Fix spancat for empty docs and zero suggestions * Use ops.xp.zeros in test	2021-11-15 12:40:55 +01:00
github-actions[bot]	67d8c8a081	Auto-format code with black (#9664 ) Co-authored-by: explosion-bot <explosion-bot@users.noreply.github.com>	2021-11-12 10:00:03 +01:00
Sofie Van Landeghem	24cdd4c88e	Merge pull request #9638 from polm/fix/optional-pretrain-path Make Jsonl Corpus reader path optional again	2021-11-09 10:45:14 +01:00
Paul O'Leary McCann	8aa2d32ca9	Update jsonlcorpus constructor types	2021-11-09 16:20:19 +09:00
Paul O'Leary McCann	71fb00ed95	Update spacy/training/corpus.py Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>	2021-11-08 10:02:29 +00:00
Sofie Van Landeghem	c97f29c593	Merge pull request #9629 from ljvmiranda921/chore/migrate-regressions Migrate regression and other tests to the new pytest marker	2021-11-08 09:07:38 +01:00
Paul O'Leary McCann	141f12b92e	Make Jsonl Corpus reader optional again	2021-11-07 18:56:23 +09:00
Lj Miranda	909177589d	Remove utility script	2021-11-06 06:35:58 +08:00

... 2 3 4 5 6 ...

9241 Commits