spaCy

mirror of https://github.com/explosion/spaCy.git synced 2026-01-08 01:31:19 +03:00

Author	SHA1	Message	Date
Luca Dorigo	0a92d5644e	Fix StringStore.__getitem__ return type depending on parameter types (#10741 ) * Fix StringStore.__getitem__ return type depending on parameter types Small fix using `@overload` so that `StringStore.__getitem__` returns an `int` when given a `str` or `bytes` and a `str` when given an `int`. * Update spacy/strings.pyi Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>	2022-05-03 17:57:07 +02:00
Raphael Mitsch	f5390e278a	Refactor error messages to remove hardcoded strings (#10729 ) * Use custom error msg instead of hardcoded string: replaced remaining hardcoded error message strings. * Use custom error msg instead of hardcoded string: fixing faulty Errors import.	2022-05-02 13:38:46 +02:00
Madeesh Kannan	0a503ce5e0	Remove vestigial debug print statement in `walk_head_nodes` (#10718 ) * `graph`: Remove vestigial debug print statement in `walk_head_nodes` * Revert whitespace changes * Remove more debug print statements	2022-05-02 13:36:35 +02:00
Adriane Boyd	10377fb945	Set version to v3.3.0 (#10614 ) * Set version to v3.3.0 * Revert "Temporarily skip tests that require models/compat" This reverts commit `e422101e00`.	2022-04-28 13:07:49 +02:00
harmbuisman	c066fb8a4e	#10672 : fixes displacy output for manual unsorted entities (#10673 ) * #10672: fixes displacy output for manual unsorted entities * #10672: removed unused import * fix prettier formatting Co-authored-by: Harm Buisman <h.buisman@iknl.nl> Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>	2022-04-27 09:51:58 +02:00
Sofie Van Landeghem	b3717ba53a	removing print statements from the test suite (#10712 )	2022-04-27 09:14:25 +02:00
Adriane Boyd	455f089c9b	Support exclude in Doc.from_docs (#10689 ) * Support exclude in Doc.from_docs * Update API docs * Add new tag to docs	2022-04-25 18:19:03 +02:00
github-actions[bot]	e07500369c	Auto-format code with black (#10687 ) Co-authored-by: explosion-bot <explosion-bot@users.noreply.github.com>	2022-04-22 11:24:53 +02:00
Richard Hudson	4b227f4861	Merge pull request #10669 from mgrojo/develop Fix some issues in Spanish stop-word list and examples	2022-04-19 09:37:34 +02:00
mgr	3d50b1a989	Fix some issues in Spanish examples - Spelling: nationalities in lowercase, accent. - Incorrect verb composition - Untranslated word	2022-04-18 22:12:57 +02:00
mgr	2a2654c756	Remove significant or not very frequent words from stop word list [es] The list of stop words for Spanish contained many inadequate words, see: https://github.com/explosion/spaCy/issues/3052#issuecomment-1100760100 Removed words: - verb forms of 'trabajar' (work) and intentar (try) - words related to 'empleo' (employment) - incorrect words: ampleamos, arribaabajo, soyos, paìs - miscellaneous words due to being too significant of too infrequent: actualmente, aproximadamente, antaño, cosas, ejemplo, horas, general, pais, principalmente, raras Added other stop words for completion: - Spanish one-letter words - numbers up to twelve Some reformatting to 79 columns. When in doubt, the English and German lists have been consulted as good examples.	2022-04-18 22:04:02 +02:00
Madeesh Kannan	aa6780eb27	`Matcher`: Remove superfluous GIL-acquiring check in `get_is_final` (#10659 ) * `Matcher`: Remove superfluous GIL-acquiring check in `get_is_final` This check incurred a significant performance penalty due to implict interactions between the GIL and Cython ref-counting code. * `Matcher`: Inline `PatternStateC` accessors	2022-04-18 12:59:34 +02:00
Duy Ngo	229ecaf0ea	Add numbers and definitions (#10665 )	2022-04-18 12:58:32 +02:00
Joachim Fainberg	4e1716223c	displaCy: Avoid increasing levels for identical arcs (#10639 ) * Test for arc levels for identical arcs Also moves the test in order with the other numbered tests. * displaCy: filter identical arcs Avoid increased levels due to identical arcs by first filtering any identical arcs. * Sort keys before filtering Manual entry with keys out of order would previously become different tuples and therefore not filtered correctly. Co-authored-by: Joachim Fainberg <joachimfainberg@Joachims-MBP.lan>	2022-04-14 16:48:00 +02:00
fonfonx	028cbad05e	Add feminine form of word "one" in French (#10653 ) * Add French number * Add fonfonx.md * Add feminine ordinal words for French	2022-04-14 10:21:27 +02:00
single-fingal	4228f3c757	Fix a few minor bugs in the SpanGroup API web docs (#10650 ) * Fix a few minor bugs in the SpanGroup API web docs * Update SpanGroup docs examples to have Spans reflect intended "errors"	2022-04-14 09:59:48 +02:00
Richard Hudson	75fbbcdc18	Display warning when spacy.explain() finds no term (#10645 ) * Display warning when spacy.explain() finds no term * Updated warning message text	2022-04-12 10:48:28 +02:00
Madeesh Kannan	9ba3e1cb2f	Basic tests for the Tamil language (#10629 ) * Add basic tests for Tamil (ta) * Add comment Remove superfluous condition * Remove superfluous call to `pipe` Instantiate new tokenizer for special case	2022-04-07 14:47:37 +02:00
Lj Miranda	02dafa3a84	Add debug diff command in spaCy CLI (#10502 ) * Add initial design for diff command For now, the diffing process looks like this: - The default config is created based from some values in the user config (e.g. which pipeline components were used, the lang, etc.) - The user must supply manually if it was optimized for acc/efficiency and if pretraining was involved. * Make diff command structure similar to siblings * Include gpu as a user option for CLI * Make variables more explicit * Fix type declaration for optimize enum * Improve docstrings for diff CLI * Add debug-diff to website API docs * Switch position of configs so that user config is modded * Add markdown flag for debug diff This commit adds a --markdown (--md) flag that allows easier copy-pasting to Github issues. Please note that this commit is dependent on an unreleased version of wasabi (for the time being). For posterity, the related PR is found here: https://github.com/ines/wasabi/pull/20 * Bump version of wasabi to 0.9.1 So that we can use the add_symbols parameter. * Apply suggestions from code review Co-authored-by: Ines Montani <ines@ines.io> * Update docs based on code review suggestions Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com> * Change command name from diff -> diff-config * Clarify when options are relevant or not * Rerun prettier on cli.md Co-authored-by: Ines Montani <ines@ines.io> Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>	2022-04-07 10:48:45 +02:00
Joachim Fainberg	b91255a454	displacy: avoid overlapping arcs in manual mode (#10534 ) * Added test for overlapping arcs * Provide distinct levels to overlapping arcs * Update return type hint for get_levels * Improved formatting spacy/displacy/render.py Co-authored-by: Ines Montani <ines@ines.io> Co-authored-by: Joachim Fainberg <joachimfainberg@Joachims-MacBook-Pro.local> Co-authored-by: Ines Montani <ines@ines.io>	2022-04-05 09:08:02 +02:00
Adriane Boyd	849bef2de6	Merge pull request #10596 from adrianeboyd/chore/v3.3.0.dev0 Set version to v3.3.0.dev0	2022-04-04 09:18:07 +02:00
Adriane Boyd	e422101e00	Temporarily skip tests that require models/compat	2022-04-01 11:09:28 +02:00
Adriane Boyd	ca54de27bb	Support more internal methods for SpanGroup (#10476 ) * Added new convenience cython functions to SpanGroup to avoid unnecessary allocation/deallocation of objects * Replaced sorting in has_overlap with C++ for efficiency. Also, added a test for has_overlap * Added a method to efficiently merge SpanGroups * Added __delitem__, __add__ and __iadd__. Also, allowed to pass span lists to merge function. Replaced extend() body with call to merge * Renamed merge to concat and added missing things to documentation * Added operator+ and operator += in the documentation * Added a test for Doc deallocation * Update spacy/tokens/span_group.pyx * Updated SpanGroup tests to use new span list comparison function rather than assert_span_list_equal, eliminating the need to have a separate assert_not_equal fnction * Fixed typos in SpanGroup documentation Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com> * Minor changes requested by Sofie: rearranged import statements. Added new=3.2.1 tag to SpanGroup.__setitem__ documentation * SpanGroup: moved repetitive list index check/adjustment in a separate function * Turn off formatting that hurts readability spacy/tests/doc/test_span_group.py Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com> * Remove formatting that hurts readability spacy/tests/doc/test_span_group.py Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com> * Turn off formatting that hurts readability in spacy/tests/doc/test_span_group.py Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com> * Support more internal methods for SpanGroup Add support for: * `__setitem__` * `__delitem__` * `__iadd__`: for `SpanGroup` or `Iterable[Span]` * `__add__`: for `SpanGroup` only Adapted from #9698 with the scope limited to the magic methods. * Use v3.3 as new version in docs * Add new tag to SpanGroup.copy in API docs * Remove duplicate import * Apply suggestions from code review Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com> * Remaining suggestions and formatting Co-authored-by: nrodnova <nrodnova@hotmail.com> Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com> Co-authored-by: Natalia Rodnova <4512370+nrodnova@users.noreply.github.com>	2022-04-01 09:56:26 +02:00
Adriane Boyd	d56b1400d2	Set version to v3.3.0.dev0	2022-04-01 09:54:52 +02:00
Daniël de Kok	c90dd6f265	Alignment: use a simplified ragged type for performance (#10319 ) * Alignment: use a simplified ragged type for performance This introduces the AlignmentArray type, which is a simplified version of Ragged that performs better on the simple(r) indexing performed for alignment. * AlignmentArray: raise an error when using unsupported index * AlignmentArray: move error messages to Errors * AlignmentArray: remove simlified ... with simplifications * AlignmentArray: fix typo that broke a[n:n] indexing	2022-04-01 09:02:06 +02:00
Adriane Boyd	03762b4b92	Add spancat, trainable_lemmatizer to quickstart (#10524 ) * Add `SPACY` and `IS_SPACE` as default `tok2vec` features	2022-04-01 09:01:04 +02:00
Adriane Boyd	e3ccc1973b	Provide debug data info for floret vectors (#10592 )	2022-03-31 15:11:32 +02:00
Yunus Atahan	36d3af3013	Fixed typo in Turkish lang. (#10582 ) * added failing test case for the issue. * Fixed typo. * fixed typo in test. * added corrected typo word into test_tr_lex_attrs_capitals as param. Test passes. Also tried and confirmed that test is failing after fixing the typo in the test case I wrote. Deleted the test case for typo. Co-authored-by: Yunus Atahan <yunus.atahan@trmotor.local>	2022-03-30 13:16:08 +02:00
Adriane Boyd	f98b41c390	Add vector deduplication (#10551 ) * Add vector deduplication * Add `Vocab.deduplicate_vectors()` * Always run deduplication in `spacy init vectors` * Clean up a few vector-related error messages and docs examples * Always unique with numpy * Fix types	2022-03-30 08:54:23 +02:00
Adriane Boyd	85778dfcf4	Add edit tree lemmatizer (#10231 ) * Add edit tree lemmatizer Co-authored-by: Daniël de Kok <me@danieldk.eu> * Hide edit tree lemmatizer labels * Use relative imports * Switch to single quotes in error message * Type annotation fixes Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com> * Reformat edit_tree_lemmatizer with black * EditTreeLemmatizer.predict: take Iterable Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com> * Validate edit trees during deserialization This change also changes the serialized representation. Rather than mirroring the deep C structure, we use a simple flat union of the match and substitution node types. * Move edit_trees to _edit_tree_internals * Fix invalid edit tree format error message * edit_tree_lemmatizer: remove outdated TODO comment * Rename factory name to trainable_lemmatizer * Ignore type instead of casting truths to List[Union[Ints1d, Floats2d, List[int], List[str]]] for thinc v8.0.14 * Switch to Tagger.v2 * Add documentation for EditTreeLemmatizer * docs: Fix 3.2 -> 3.3 somewhere * trainable_lemmatizer documentation fixes * docs: EditTreeLemmatizer is in edit_tree_lemmatizer.py Co-authored-by: Daniël de Kok <me@danieldk.eu> Co-authored-by: Daniël de Kok <me@github.danieldk.eu> Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>	2022-03-28 11:13:50 +02:00
github-actions[bot]	98ed941c39	Auto-format code with black (#10550 ) Co-authored-by: explosion-bot <explosion-bot@users.noreply.github.com>	2022-03-28 10:44:46 +02:00
Luka Dragar	53674bb745	Examples for Slovene (#10539 ) * Added examples for Slovene * Update spacy/lang/sl/examples.py Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com> * Corrected a typo in one of the sentences Co-authored-by: Luka Dragar <D20124481@mytudublin.ie> Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>	2022-03-28 10:44:10 +02:00
Adriane Boyd	3711af74e5	Add tokenizer option to allow Matcher handling for all rules (#10452 ) * Add tokenizer option to allow Matcher handling for all rules Add tokenizer option `with_faster_rules_heuristics` that determines whether the special cases applied by the internal `Matcher` are filtered by whether they contain affixes or space. If `True` (default), the rules are filtered to prioritize speed over rare edge cases. If `False`, all rules are included in the final `Matcher`-based pass over the doc. * Reset all caches when reloading special cases * Revert "Reset all caches when reloading special cases" This reverts commit `4ef6bd171d`. * Initialize max_length properly * Add new tag to API docs * Rename to faster heuristics	2022-03-24 13:21:32 +01:00
Adriane Boyd	31a5d99efa	Maintain support for empty DocBin span groups (#10538 )	2022-03-24 11:51:07 +01:00
Daniël de Kok	2ff197603e	matcher: remove an undefined behavior (#10537 ) Indexing into a zero-length std::vector is an undefined behavior.	2022-03-24 11:48:22 +01:00
Adriane Boyd	d85117f88c	Stream large assets on download (#10521 ) Stream large assets on download rather than reading the whole file at once and potentially running into `urllib3` limits on single read sizes.	2022-03-24 11:47:05 +01:00
Adriane Boyd	e908a67829	Handle unknown tags in KoreanTokenizer tag map (#10536 )	2022-03-24 11:25:36 +01:00
Adriane Boyd	c17980e535	Save vectors as little endian, load with Ops.asarray (#10201 ) * Save vectors as little endian, load with Ops.asarray * Always save vector data as little endian * Always run `Vectors.to_ops` when vector data is loaded so that `Ops.asarray` can be used to load the data correctly for the current ops. * Update spacy/vectors.pyx Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com> * Update spacy/vectors.pyx Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com> Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>	2022-03-21 14:24:46 +01:00
github-actions[bot]	bf1cf77a5b	Auto-format code with black (#10518 ) Co-authored-by: explosion-bot <explosion-bot@users.noreply.github.com>	2022-03-21 09:21:24 +01:00
Grey Murav	3ff5a6a5c0	Extend list of _num_words (#10468 )	2022-03-16 18:25:42 +01:00
Lj Miranda	a79cd3542b	Add displacy support for overlapping Spans (#10332 ) * Fix docstring for EntityRenderer * Add warning in displacy if doc.spans are empty * Implement parse_spans converter One notable change here is that the default spans_key is sc, and it's set by the user through the options. * Implement SpanRenderer Here, I implemented a SpanRenderer that looks similar to the EntityRenderer except for some templates. The spans_key, by default, is set to sc, but can be configured in the options (see parse_spans). The way I rendered these spans is per-token, i.e., I first check if each token (1) belongs to a given span type and (2) a starting token of a given span type. Once I have this information, I render them into the markup. * Fix mypy issues on typing * Add tests for displacy spans support * Update colors from RGB to hex Co-authored-by: Ines Montani <ines@ines.io> * Remove unnecessary CSS properties * Add documentation for website * Remove unnecesasry scripts * Update wording on the documentation Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com> * Put typing dependency on top of file * Put back z-index so that spans overlap properly * Make warning more explicit for spans_key Co-authored-by: Ines Montani <ines@ines.io> Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>	2022-03-16 18:14:34 +01:00
Daniël de Kok	e5debc68e4	Tagger: use unnormalized probabilities for inference (#10197 ) * Tagger: use unnormalized probabilities for inference Using unnormalized softmax avoids use of the relatively expensive exp function, which can significantly speed up non-transformer models (e.g. I got a speedup of 27% on a German tagging + parsing pipeline). * Add spacy.Tagger.v2 with configurable normalization Normalization of probabilities is disabled by default to improve performance. * Update documentation, models, and tests to spacy.Tagger.v2 * Move Tagger.v1 to spacy-legacy * docs/architectures: run prettier * Unnormalized softmax is now a Softmax_v2 option * Require thinc 8.0.14 and spacy-legacy 3.0.9	2022-03-15 14:15:31 +01:00
Adriane Boyd	0dc454ba95	Update docs for Vocab.get_vector (#10486 ) * Update docs for Vocab.get_vector * Clarify description of 0-vector dimensions	2022-03-15 09:10:47 +01:00
Edward	2eef47dd26	Save span candidates produced by spancat suggesters (#10413 ) * Add save_candidates attribute * Change spancat api * Add unit test * reimplement method to produce a list of doc * Add method to docs * Add new version tag * Add intended use to docstring * prettier formatting	2022-03-14 16:46:58 +01:00
Edward	b68bf43f5b	Add spans to doc.to_json (#10073 ) * Add spans to to_json * adjustments to_json * Change docstring * change doc key naming * Update spacy/tokens/doc.pyx Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com> Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>	2022-03-14 15:47:57 +01:00
github-actions[bot]	1bbf232074	Auto-format code with black (#10479 ) * Auto-format code with black * Update spacy/lang/hsb/lex_attrs.py Co-authored-by: explosion-bot <explosion-bot@users.noreply.github.com> Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>	2022-03-11 12:20:23 +01:00
Adriane Boyd	297dd82c86	Fix initial special cases for Tokenizer.explain (#10460 ) Add the missing initial check for special cases to `Tokenizer.explain` to align with `Tokenizer._tokenize_affixes`.	2022-03-11 10:50:47 +01:00
Adriane Boyd	191e8b31fa	Remove English tokenizer exception May. (#10463 )	2022-03-08 14:28:46 +01:00
jnphilipp	5ca0dbae76	Add Lower Sorbian support. (#10431 ) * Add support basic support for lower sorbian. * Add some test for dsb. * Update spacy/lang/dsb/examples.py Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>	2022-03-07 16:57:14 +01:00
Paul O'Leary McCann	61ba5450ff	Fix get_matching_ents (#10451 ) * Fix get_matching_ents Not sure what happened here - the code prior to this commit simply does not work. It's already covered by entity linker tests, which were succeeding in the NEL PR, but couldn't possibly succeed on master. * Fix test Test was indented inside another test and so doesn't seem to have been running properly.	2022-03-07 16:56:57 +01:00

1 2 3 4 5 ...

9051 Commits