spaCy

mirror of https://github.com/explosion/spaCy.git synced 2024-12-27 18:36:36 +03:00

Author	SHA1	Message	Date
richardpaulhudson	d4218366c5	Update Holmes entry in universe.json	2022-05-30 18:05:26 +02:00
Max Tarlov	709d6d9114	Update documentation for displacy style kwargs (#10841 ) * Update docs for displacy style kwargs Added "span" to the accepted values for the style kwarg in the displacy.serve and displacy.render top-level functions. These styles are new as of SpaCy 3.3, so I added the "new" tag for that option only * restored alpha ordering	2022-05-30 09:11:55 +02:00
Peter Baumgartner	bf95f0a1dd	add doc cleaner to menu (#10862 )	2022-05-30 08:51:19 +02:00
Freddy Heppell	322c5a3ac4	Fix misspelt keyword in StringStore example	2022-05-29 10:49:19 +01:00
Sofie Van Landeghem	83ed1f391b	Remove NBSP's across tables in the docs (#10842 )	2022-05-25 09:48:39 +02:00
Lj Miranda	1d34aa2b3d	Add spacy-span-analyzer to debug data (#10668 ) * Rename to spans_key for consistency * Implement spans length in debug data * Implement how span bounds and spans are obtained In this commit, I implemented how span boundaries (the tokens) around a given span and spans are obtained. I've put them in the compile_gold() function so that it's accessible later on. I will do the actual computation of the span and boundary distinctiveness in the main function above. * Compute for p_spans and p_bounds * Add computation for SD and BD * Fix mypy issues * Add weighted average computation * Fix compile_gold conditional logic * Add test for frequency distribution computation * Add tests for kl-divergence computation * Fix weighted average computation * Make tables more compact by rounding them * Add more descriptive checks for spans * Modularize span computation methods In this commit, I added the _get_span_characteristics and _print_span_characteristics functions so that they can be reusable anywhere. * Remove unnecessary arguments and make fxs more compact * Update a few parameter arguments * Add tests for print_span and get_span methods * Update API to talk about span characteristics in brief * Add better reporting of spans_length * Add test for span length reporting * Update formatting of span length report Removed '' to indicate that it's not a string, then sort the n-grams by their length, not by their frequency. * Apply suggestions from code review Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com> * Show all frequency distribution when -V In this commit, I displayed the full frequency distribution of the span lengths when --verbose is passed. To make things simpler, I rewrote some of the formatter functions so that I can call them whenever. Another notable change is that instead of showing percentages as Integers, I showed them as floats (max 2-decimal places). I did this because it looks weird when it displays (0%). * Update logic on how total is computed The way the 90% thresholding is computed now is that we keep adding the percentages until we reach >= 90%. I also updated the wording and used the term "At least" to denote that >= 90% of your spans have these distributions. * Fix display when showing the threshold percentage * Apply suggestions from code review Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com> * Add better phrasing for span information * Update spacy/cli/debug_data.py Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com> * Add minor edits for whitespaces etc. Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com> Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>	2022-05-23 19:06:38 +02:00
Peter Baumgartner	7ce3460b23	add floret to static vectors docs (#10833 )	2022-05-23 09:16:31 +02:00
kadarakos	a3814ee739	oov confusion fix (#10828 )	2022-05-23 09:15:51 +02:00
Adriane Boyd	a82ec56aae	Remove cuda extras for non-linux arm in install widget (#10796 ) * Remove cuda extras for non-linux arm platforms in install widget * Extend cuda versions install widget * Update GPU install docs to clarify cuda	2022-05-20 09:57:41 +02:00
schaeran	f5952c0851	update spaCy Universe: spacytextblob (code example)	2022-05-12 18:23:00 +02:00
Adriane Boyd	b65d652881	Override SpanGroups.setdefault to provide default SpanGroup (#10772 ) * Fix mistake in SpanGroup API docs * Restrict SpanGroups.setdefault to SpanGroup only * Refactor to support default span iterable	2022-05-12 10:06:25 +02:00
Richard Hudson	d524f6415f	Add documentation tip about overriding variables (#10780 )	2022-05-11 10:15:32 +02:00
Raphael Mitsch	2904359685	Allow assets to be optional in spacy project (#10714 ) * Allow assets to be optional in spacy project: draft for optional flag/download_all options. * Allow assets to be optional in spacy project: added OPTIONAL_DEFAULT reflecting default asset optionality. * Allow assets to be optional in spacy project: renamed --all to --extra. * Allow assets to be optional in spacy project: included optional flag in project config test. * Allow assets to be optional in spacy project: added documentation. * Allow assets to be optional in spacy project: fixing deprecated --all reference. Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com> * Allow assets to be optional in spacy project: fixed project_assets() docstring. * Allow assets to be optional in spacy project: adjusted wording in justification of optional assets. Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com> * Allow assets to be optional in spacy project: switched to as keyword in project.yml. Updated docs. * Allow assets to be optional in spacy project: updated comment. * Allow assets to be optional in spacy project: replacing 'optional' with 'extra' in output. Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com> * Allow assets to be optional in spacy project: replacing 'optional' with 'extra' in docstring.. Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com> * Allow assets to be optional in spacy project: replacing 'optional' with 'extra' in test.. Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com> * Allow assets to be optional in spacy project: replacing 'optional' with 'extra' in test. Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com> * Allow assets to be optional in spacy project: renamed OPTIONAL_DEFAULT to EXTRA_DEFAULT. Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com> Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>	2022-05-10 10:40:11 +02:00
Sofie Van Landeghem	1543558d08	Add test for old architectures (#10751 ) * add v1 and v2 tests for tok2vec architectures * textcat architectures are not "layers" * test older textcat architectures * test older parser architecture	2022-05-10 08:24:42 +02:00
Madeesh Kannan	733114bdd9	`training.md`: Fix typos (#10775 )	2022-05-09 19:44:14 +02:00
Raphael Mitsch	e626df959f	Document different ways to create a pipeline (#10762 ) * Document different ways to create a pipeline: moved up/slightly modified paragraph on pipeline creation. * Document different ways to create a pipeline: changed Finnish to Ukrainian in example for language without trained pipeline. * Document different ways to create a pipeline: added explanation of blank pipeline. * Document different ways to create a pipeline: exchanged Ukrainian with Yoruba.	2022-05-06 15:40:59 +02:00
Richard Hudson	c32e1a0079	Updated Coreferee Universe entry (#10763 )	2022-05-06 13:21:39 +02:00
Sofie Van Landeghem	e03b9f8095	Small doc typos (#10750 ) * fix typos * formatting	2022-05-03 13:55:27 +02:00
vincent d warmerdam	f3de976513	Update universe.json to Include spaCy video #6 (#10723 ) * Update universe.json I noticed that episode 6 was missing, so I added it. * Update universe.json * Update universe.json	2022-05-02 13:35:14 +02:00
Adriane Boyd	497a708c71	Docs for v3.3 (#10628 ) * Temporarily disable CI tests * Start v3.3 website updates * Add trainable lemmatizer to pipeline design * Fix Vectors.most_similar * Add floret vector info to pipeline design * Add Lower and Upper Sorbian * Add span to sidebar * Work on release notes * Copy from release notes * Update pipeline design graphic * Upgrading note about Doc.from_docs * Add tables and details * Update website/docs/models/index.md Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com> * Fix da lemma acc * Add minimal intro, various updates * Round lemma acc * Add section on floret / word lists * Add new pipelines table, minor edits * Fix displacy spans example title * Clarify adding non-trainable lemmatizer * Update adding-languages URLs * Revert "Temporarily disable CI tests" This reverts commit `1dee505920`. * Spell out words/sec Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>	2022-04-28 14:09:35 +02:00
harmbuisman	c066fb8a4e	#10672 : fixes displacy output for manual unsorted entities (#10673 ) * #10672: fixes displacy output for manual unsorted entities * #10672: removed unused import * fix prettier formatting Co-authored-by: Harm Buisman <h.buisman@iknl.nl> Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>	2022-04-27 09:51:58 +02:00
Adriane Boyd	455f089c9b	Support exclude in Doc.from_docs (#10689 ) * Support exclude in Doc.from_docs * Update API docs * Add new tag to docs	2022-04-25 18:19:03 +02:00
Mike	3b208197c3	Fixed example for spacy_syllables (#10705 ) There was a typo in the example for the spacy_syllables project.	2022-04-25 16:40:54 +02:00
Schero1994	d622883a42	Adding and updating content in the spacy universe (#10493 ) * signing contributor agreement * adding new content to the spaCy universe * updating outdated example codes * resolving issues for the PR * resolve review for klayers * remove contributor-agreement file from the PR * Update code example of spaCySentiWS Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com> * Update spacy-sentiws code example Co-authored-by: schaeran <schaeran1994@gmail.com> Co-authored-by: schaeran <schaeran@explosion.ai> Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>	2022-04-15 15:36:54 +02:00
Philip Vollet	e63a5d4888	Update newsletter id (#10655 )	2022-04-14 13:34:01 +02:00
Schero1994	caf8528af7	Batch #1 \| spaCy universe cleanup (#10642 ) * delete universe object: wmd-relax * delete universe object: spaCy.jl * delete universe object: saber * delete universe object: languagecrunch * delete universe object: gracyql * delete universe object: ExcelCy * delete universe object: EpiTator Co-authored-by: schaeran <schaeran1994@gmail.com>	2022-04-14 10:08:19 +02:00
single-fingal	4228f3c757	Fix a few minor bugs in the SpanGroup API web docs (#10650 ) * Fix a few minor bugs in the SpanGroup API web docs * Update SpanGroup docs examples to have Spans reflect intended "errors"	2022-04-14 09:59:48 +02:00
David Berenstein	d4196a62f1	added crosslingual coreference to spacy universe without additional commits (#10580 ) * added crosslingual coreference to spacy universe * Updated example to introduce batching example. Co-authored-by: David Berenstein <david.berenstein@pandoraintelligence.com>	2022-04-08 08:23:58 +02:00
Lj Miranda	02dafa3a84	Add debug diff command in spaCy CLI (#10502 ) * Add initial design for diff command For now, the diffing process looks like this: - The default config is created based from some values in the user config (e.g. which pipeline components were used, the lang, etc.) - The user must supply manually if it was optimized for acc/efficiency and if pretraining was involved. * Make diff command structure similar to siblings * Include gpu as a user option for CLI * Make variables more explicit * Fix type declaration for optimize enum * Improve docstrings for diff CLI * Add debug-diff to website API docs * Switch position of configs so that user config is modded * Add markdown flag for debug diff This commit adds a --markdown (--md) flag that allows easier copy-pasting to Github issues. Please note that this commit is dependent on an unreleased version of wasabi (for the time being). For posterity, the related PR is found here: https://github.com/ines/wasabi/pull/20 * Bump version of wasabi to 0.9.1 So that we can use the add_symbols parameter. * Apply suggestions from code review Co-authored-by: Ines Montani <ines@ines.io> * Update docs based on code review suggestions Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com> * Change command name from diff -> diff-config * Clarify when options are relevant or not * Rerun prettier on cli.md Co-authored-by: Ines Montani <ines@ines.io> Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>	2022-04-07 10:48:45 +02:00
Adriane Boyd	0d0153db63	Update default spans_key to sc in API docs (#10616 )	2022-04-04 18:09:15 +02:00
Bram Vanroy	f966bf6a15	Update to spacy_conll in universe (#10617 ) * update to spacy_conll * Update website/meta/universe.json Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com> * Update website/meta/universe.json Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com> Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>	2022-04-04 17:57:52 +02:00
Adriane Boyd	ca54de27bb	Support more internal methods for SpanGroup (#10476 ) * Added new convenience cython functions to SpanGroup to avoid unnecessary allocation/deallocation of objects * Replaced sorting in has_overlap with C++ for efficiency. Also, added a test for has_overlap * Added a method to efficiently merge SpanGroups * Added __delitem__, __add__ and __iadd__. Also, allowed to pass span lists to merge function. Replaced extend() body with call to merge * Renamed merge to concat and added missing things to documentation * Added operator+ and operator += in the documentation * Added a test for Doc deallocation * Update spacy/tokens/span_group.pyx * Updated SpanGroup tests to use new span list comparison function rather than assert_span_list_equal, eliminating the need to have a separate assert_not_equal fnction * Fixed typos in SpanGroup documentation Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com> * Minor changes requested by Sofie: rearranged import statements. Added new=3.2.1 tag to SpanGroup.__setitem__ documentation * SpanGroup: moved repetitive list index check/adjustment in a separate function * Turn off formatting that hurts readability spacy/tests/doc/test_span_group.py Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com> * Remove formatting that hurts readability spacy/tests/doc/test_span_group.py Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com> * Turn off formatting that hurts readability in spacy/tests/doc/test_span_group.py Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com> * Support more internal methods for SpanGroup Add support for: * `__setitem__` * `__delitem__` * `__iadd__`: for `SpanGroup` or `Iterable[Span]` * `__add__`: for `SpanGroup` only Adapted from #9698 with the scope limited to the magic methods. * Use v3.3 as new version in docs * Add new tag to SpanGroup.copy in API docs * Remove duplicate import * Apply suggestions from code review Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com> * Remaining suggestions and formatting Co-authored-by: nrodnova <nrodnova@hotmail.com> Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com> Co-authored-by: Natalia Rodnova <4512370+nrodnova@users.noreply.github.com>	2022-04-01 09:56:26 +02:00
Adriane Boyd	03762b4b92	Add spancat, trainable_lemmatizer to quickstart (#10524 ) * Add `SPACY` and `IS_SPACE` as default `tok2vec` features	2022-04-01 09:01:04 +02:00
Adriane Boyd	f98b41c390	Add vector deduplication (#10551 ) * Add vector deduplication * Add `Vocab.deduplicate_vectors()` * Always run deduplication in `spacy init vectors` * Clean up a few vector-related error messages and docs examples * Always unique with numpy * Fix types	2022-03-30 08:54:23 +02:00
Adriane Boyd	85778dfcf4	Add edit tree lemmatizer (#10231 ) * Add edit tree lemmatizer Co-authored-by: Daniël de Kok <me@danieldk.eu> * Hide edit tree lemmatizer labels * Use relative imports * Switch to single quotes in error message * Type annotation fixes Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com> * Reformat edit_tree_lemmatizer with black * EditTreeLemmatizer.predict: take Iterable Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com> * Validate edit trees during deserialization This change also changes the serialized representation. Rather than mirroring the deep C structure, we use a simple flat union of the match and substitution node types. * Move edit_trees to _edit_tree_internals * Fix invalid edit tree format error message * edit_tree_lemmatizer: remove outdated TODO comment * Rename factory name to trainable_lemmatizer * Ignore type instead of casting truths to List[Union[Ints1d, Floats2d, List[int], List[str]]] for thinc v8.0.14 * Switch to Tagger.v2 * Add documentation for EditTreeLemmatizer * docs: Fix 3.2 -> 3.3 somewhere * trainable_lemmatizer documentation fixes * docs: EditTreeLemmatizer is in edit_tree_lemmatizer.py Co-authored-by: Daniël de Kok <me@danieldk.eu> Co-authored-by: Daniël de Kok <me@github.danieldk.eu> Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>	2022-03-28 11:13:50 +02:00
Adriane Boyd	d5666fd12d	Add NORM to Matcher feature in docs (#10560 )	2022-03-28 10:35:47 +02:00
Adriane Boyd	33eb63b157	Remove now-built-in jinja2>=3.1.0 extensions	2022-03-25 14:29:33 +01:00
David Berenstein	ed2ac34a8a	added Concise Concepts to spaCy universe (#10499 ) * Update universe.json added classy-classification to Spacy universe * Update universe.json added classy-classification to the spacy universe resources * Update universe.json corrected a small typo in json * Update website/meta/universe.json Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com> * Update website/meta/universe.json Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com> * Update website/meta/universe.json Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com> * Update universe.json processed merge feedback * Update universe.json * updated information for Classy Classificaiton Made a more comprehensible and easy description for Classy Classification based on feedback of Philip Vollet to prepare for sharing. * added note about examples * corrected for wrong formatting changes * Update website/meta/universe.json with small typo correction Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com> * resolved another typo * Update website/meta/universe.json Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com> * added Concise Concepts package to spaCy universe. * updated example code Concise Concepts * updated description for Concise Concepts * updated PR with more visually appealing examples SO to koaning for the suggestions. * corrected for small json typo's in concise concepts Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com> Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>	2022-03-24 18:00:12 +01:00
Adriane Boyd	3711af74e5	Add tokenizer option to allow Matcher handling for all rules (#10452 ) * Add tokenizer option to allow Matcher handling for all rules Add tokenizer option `with_faster_rules_heuristics` that determines whether the special cases applied by the internal `Matcher` are filtered by whether they contain affixes or space. If `True` (default), the rules are filtered to prioritize speed over rare edge cases. If `False`, all rules are included in the final `Matcher`-based pass over the doc. * Reset all caches when reloading special cases * Revert "Reset all caches when reloading special cases" This reverts commit `4ef6bd171d`. * Initialize max_length properly * Add new tag to API docs * Rename to faster heuristics	2022-03-24 13:21:32 +01:00
Basile Dura	107bab56b5	docs: add EDS-NLP to spaCy universe (#10489 ) * docs: add EDS-NLP to spaCy universe * fix: remove "standalone" tag for EDS-NLP Co-authored-by: Basile Dura <basile.dura-ext@aphp.fr>	2022-03-21 11:03:39 +01:00
Lj Miranda	0b02dc4c57	Fix mixed-up parameters for spacy-conll (#10516 )	2022-03-18 08:56:21 +01:00
Lj Miranda	a79cd3542b	Add displacy support for overlapping Spans (#10332 ) * Fix docstring for EntityRenderer * Add warning in displacy if doc.spans are empty * Implement parse_spans converter One notable change here is that the default spans_key is sc, and it's set by the user through the options. * Implement SpanRenderer Here, I implemented a SpanRenderer that looks similar to the EntityRenderer except for some templates. The spans_key, by default, is set to sc, but can be configured in the options (see parse_spans). The way I rendered these spans is per-token, i.e., I first check if each token (1) belongs to a given span type and (2) a starting token of a given span type. Once I have this information, I render them into the markup. * Fix mypy issues on typing * Add tests for displacy spans support * Update colors from RGB to hex Co-authored-by: Ines Montani <ines@ines.io> * Remove unnecessary CSS properties * Add documentation for website * Remove unnecesasry scripts * Update wording on the documentation Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com> * Put typing dependency on top of file * Put back z-index so that spans overlap properly * Make warning more explicit for spans_key Co-authored-by: Ines Montani <ines@ines.io> Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>	2022-03-16 18:14:34 +01:00
David Berenstein	e021dc6279	Updated explenation for for classy classification (#10484 ) * Update universe.json added classy-classification to Spacy universe * Update universe.json added classy-classification to the spacy universe resources * Update universe.json corrected a small typo in json * Update website/meta/universe.json Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com> * Update website/meta/universe.json Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com> * Update website/meta/universe.json Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com> * Update universe.json processed merge feedback * Update universe.json * updated information for Classy Classificaiton Made a more comprehensible and easy description for Classy Classification based on feedback of Philip Vollet to prepare for sharing. * added note about examples * corrected for wrong formatting changes * Update website/meta/universe.json with small typo correction Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com> * resolved another typo * Update website/meta/universe.json Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com> Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com> Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>	2022-03-15 16:42:33 +01:00
Daniël de Kok	e5debc68e4	Tagger: use unnormalized probabilities for inference (#10197 ) * Tagger: use unnormalized probabilities for inference Using unnormalized softmax avoids use of the relatively expensive exp function, which can significantly speed up non-transformer models (e.g. I got a speedup of 27% on a German tagging + parsing pipeline). * Add spacy.Tagger.v2 with configurable normalization Normalization of probabilities is disabled by default to improve performance. * Update documentation, models, and tests to spacy.Tagger.v2 * Move Tagger.v1 to spacy-legacy * docs/architectures: run prettier * Unnormalized softmax is now a Softmax_v2 option * Require thinc 8.0.14 and spacy-legacy 3.0.9	2022-03-15 14:15:31 +01:00
Adriane Boyd	e8357923ec	Various install docs updates (#10487 ) * Simplify quickstart source install to use only editable pip install * Update pytorch install instructions to more recent versions	2022-03-15 11:12:50 +01:00
vincent d warmerdam	610001e8c7	Update universe.json (#10490 ) The project moved away from Rasa and into my personal GitHub account.	2022-03-15 11:12:04 +01:00
Adriane Boyd	0dc454ba95	Update docs for Vocab.get_vector (#10486 ) * Update docs for Vocab.get_vector * Clarify description of 0-vector dimensions	2022-03-15 09:10:47 +01:00
Edward	2eef47dd26	Save span candidates produced by spancat suggesters (#10413 ) * Add save_candidates attribute * Change spancat api * Add unit test * reimplement method to produce a list of doc * Add method to docs * Add new version tag * Add intended use to docstring * prettier formatting	2022-03-14 16:46:58 +01:00
Adriane Boyd	297dd82c86	Fix initial special cases for Tokenizer.explain (#10460 ) Add the missing initial check for special cases to `Tokenizer.explain` to align with `Tokenizer._tokenize_affixes`.	2022-03-11 10:50:47 +01:00
Peter Baumgartner	01ec6349ea	Add `path.mkdir` to custom component examples of `to_disk` (#10348 ) * add `path.mkdir` to examples * add ensure_path + mkdir * update highlights	2022-03-08 16:04:10 +01:00
Adriane Boyd	60520d8669	Fix types in API docs for moves in parser and ner (#10464 )	2022-03-08 13:51:11 +01:00
Adriane Boyd	b2bbefd0b5	Add Finnish, Korean, and Swedish models and Korean support notes (#10355 ) * Add Finnish, Korean, and Swedish models to website * Add Korean language support notes	2022-03-07 17:03:45 +01:00
David Berenstein	a6d5824e5f	added classy-classification package to spacy universe (#10393 ) * Update universe.json added classy-classification to Spacy universe * Update universe.json added classy-classification to the spacy universe resources * Update universe.json corrected a small typo in json * Update website/meta/universe.json Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com> * Update website/meta/universe.json Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com> * Update website/meta/universe.json Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com> * Update universe.json processed merge feedback * Update universe.json Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>	2022-03-07 12:47:26 +01:00
Paul O'Leary McCann	91acc3ea75	Fix entity linker batching (#9669 ) * Partial fix of entity linker batching * Add import * Better name * Add `use_gold_ents` option, docs * Change to v2, create stub v1, update docs etc. * Fix error type Honestly no idea what the right type to use here is. ConfigValidationError seems wrong. Maybe a NotImplementedError? * Make mypy happy * Add hacky fix for init issue * Add legacy pipeline entity linker * Fix references to class name * Add __init__.py for legacy * Attempted fix for loss issue * Remove placeholder V1 * formatting * slightly more interesting train data * Handle batches with no usable examples This adds a test for batches that have docs but not entities, and a check in the component that detects such cases and skips the update step as thought the batch were empty. * Remove todo about data verification Check for empty data was moved further up so this should be OK now - the case in question shouldn't be possible. * Fix gradient calculation The model doesn't know which entities are not in the kb, so it generates embeddings for the context of all of them. However, the loss does know which entities aren't in the kb, and it ignores them, as there's no sensible gradient. This has the issue that the gradient will not be calculated for some of the input embeddings, which causes a dimension mismatch in backprop. That should have caused a clear error, but with numpyops it was causing nans to happen, which is another problem that should be addressed separately. This commit changes the loss to give a zero gradient for entities not in the kb. * add failing test for v1 EL legacy architecture * Add nasty but simple working check for legacy arch * Clarify why init hack works the way it does * Clarify use_gold_ents use case * Fix use gold ents related handling * Add tests for no gold ents and fix other tests * Use aligned ents function (not working) This doesn't actually work because the "aligned" ents are gold-only. But if I have a different function that returns the intersection, then this will work as desired. * Use proper matching ent check This changes the process when gold ents are not used so that the intersection of ents in the pred and gold is used. * Move get_matching_ents to Example * Use model attribute to check for legacy arch * Rename flag * bump spacy-legacy to lower 3.0.9 Co-authored-by: svlandeg <svlandeg@github.com>	2022-03-04 09:17:36 +01:00
Adriane Boyd	8e93fa8507	Fix Vectors.n_keys for floret vectors (#10394 ) Fix `Vectors.n_keys` for floret vectors to match docstring description and avoid W007 warnings in similarity methods.	2022-03-01 09:21:25 +01:00
Sofie Van Landeghem	3f68bbcfec	Clean up loggers docs (#10351 ) * update docs to point to spacy-loggers docs * remove unused error code	2022-02-25 16:29:12 +01:00
Sam Edwardes	5f568f7e41	Updated spaCy universe for spacytextblob (#10335 ) * Updated spacytextblob in universe.json * Fixed json * Update website/meta/universe.json Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com> * Added spacy_version tag to spacytextblob Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>	2022-02-24 14:18:10 +09:00
Sofie Van Landeghem	a16b14e591	Merge branch 'master' into copy/develop	2022-02-16 14:04:59 +01:00
Paul O'Leary McCann	23bd103d89	Add tmtoolkit setup steps	2022-02-14 15:17:25 +09:00
Markus Konrad	8818a44a39	add tmtoolkit package to spaCy universe (#10245 )	2022-02-14 15:16:43 +09:00
John Boy	10c77af83d	add textnets to spaCy universe (#10216 ) https://github.com/jboynyc/textnets/issues/38	2022-02-09 15:04:26 +09:00
Ines Montani	7b883da9fd	Merge pull request #10239 from explosion/docs/spacy-tailored-pipelines [ci skip]	2022-02-08 18:04:01 +01:00
Ines Montani	f2c2b97e56	Add spaCy Tailored Pipelines	2022-02-08 11:46:42 +01:00
Sofie Van Landeghem	deb143fa70	Token sent attributes more consistent (#10164 ) * remove duplicate line * add sent start/end token attributes to the docs * let has_annotation work with IS_SENT_END * elif instead of if * add has_annotation test for sent attributes * fix typo * remove duplicate is_sent_start entry in docs	2022-02-08 08:35:37 +01:00
Peter Baumgartner	836f689cc7	YAML multiline tip for project.yml files (#10187 ) * MultiHashEmbed vector docs correction * add in multi-line tip * convert to sidebar tip	2022-02-08 08:35:09 +01:00
Kenneth Enevoldsen	e4625d2fc3	Added Augmenty to universe (#10229 ) * Added Augmenty to universe * Update website/meta/universe.json Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com> * Update website/meta/universe.json Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com> Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com> Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>	2022-02-08 08:32:11 +01:00
Lj Miranda	72fece712f	Add shuffle parameter to Corpus API docs (#10220 ) * Add shuffle parameter to Corpus API docs * Update website/docs/api/corpus.md Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com> Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>	2022-02-07 14:55:53 +01:00
Sofie Van Landeghem	14513f82da	Merge pull request #10215 from explosion/master update develop	2022-02-06 13:45:41 +01:00
Kenneth Enevoldsen	a2f27ff83a	Added spacy-wrap to universe (#10168 ) * Added spacy-wrap to universe Added spacy-wrap to universe a small package for wrapping fine-tuned huggingface transformers to a spacy pipeline following the same API as spacy-transformers. (Currently limited to classification models) * Update website/meta/universe.json * Update website/meta/universe.json * Update website/meta/universe.json * Update website/meta/universe.json Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>	2022-02-03 12:30:09 +01:00
Lj Miranda	345e7f6bc4	Clarify Span.ents documentation (#10154 ) * Clarify Span.ents documentation Ref: #10135 Retain current behaviour. Span.ents will only include entities within said span. You can't get tokens outside of the original span. * Reword docstrings Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com> * Update API docs in the website Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>	2022-01-31 08:41:42 +01:00
Adriane Boyd	4f441dfa24	Fix infix as prefix in Tokenizer.explain (#10140 ) * Fix infix as prefix in Tokenizer.explain Update `Tokenizer.explain` to align with the `Tokenizer` algorithm: * skip infix matches that are prefixes in the current substring * Update tokenizer pseudocode in docs	2022-01-28 17:00:54 +01:00
Ines Montani	34ed93ef68	Support version tags in universe and add note about reporting (#10093 ) * Support version tags in universe and add note about reporting * Apply suggestions from code review Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com> Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>	2022-01-20 23:21:26 +01:00
Peter Baumgartner	a69005037a	Docker Image for Website Dev (#10098 ) * add docker instructions * Update website/README.md Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com> * Update website/README.md Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com> * clarifying language on docker image * fix markdown formatting Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>	2022-01-20 23:02:13 +01:00
Sofie Van Landeghem	4465fe0306	Merge branch 'develop' into feature/master_copy	2022-01-20 13:36:17 +01:00
Duygu Altinok	268ddf8a06	Add ENT_IOB key to Matcher (#9649 ) * added new field * added exception for IOb strings * minor refinement to schema * removed field * fixed typo * imported numeriacla val * changed the code bit * cosmetics * added test for matcher * set ents of moc docs * added invalid pattern * minor update to documentation * blacked matcher * added pattern validation * add IOB vals to schema * changed into test * mypy compat * cleaned left over * added compat import * changed type * added compat import * changed literal a bit * went back to old * made explicit type * Update spacy/schemas.py Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com> * Update spacy/schemas.py Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com> * Update spacy/schemas.py Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com> Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>	2022-01-20 13:18:39 +01:00
Adriane Boyd	7d528e607c	Update quickstart install steps (#10092 ) * For conda: * Use conda environment rather than venv * Install `spacy-transformers` as a conda package * For pip: * Add quotes if extras are included	2022-01-20 10:53:40 +01:00
Paul O'Leary McCann	2ff53834bb	Add link to pattern file info in EntityRuler.initialize docs (#10091 ) * Add link to pattern file info in EntityRuler.initialize docs * Update website/docs/api/entityruler.md Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>	2022-01-19 10:45:11 +01:00
Daniël de Kok	50d2a2c930	User fewer Vector internals (#9879 ) * Use Vectors.shape rather than Vectors.data.shape * Use Vectors.size rather than Vectors.data.size * Add Vectors.to_ops to move data between different ops * Add documentation for Vector.to_ops	2022-01-18 17:14:35 +01:00
Tuomo Hiippala	6a8619dd73	Update the entry for Applied Language Technology in spaCy Universe (#10068 ) * add entry for Applied Language Technology under "Courses" Added the following entry into `universe.json`: ``` { "type": "education", "id": "applt-course", "title": "Applied Language Technology", "slogan": "NLP for newcomers using spaCy and Stanza", "description": "These learning materials provide an introduction to applied language technology for audiences who are unfamiliar with language technology and programming. The learning materials assume no previous knowledge of the Python programming language.", "url": "https://applied-language-technology.readthedocs.io/", "image": "https://www.mv.helsinki.fi/home/thiippal/images/applt-preview.jpg", "thumb": "https://applied-language-technology.readthedocs.io/en/latest/_static/logo.png", "author": "Tuomo Hiippala", "author_links": { "twitter": "tuomo_h", "github": "thiippal", "website": "https://www.mv.helsinki.fi/home/thiippal/" }, "category": ["courses"] }, ``` * Update the entry for "Applied Language Technology"	2022-01-17 08:28:51 +01:00
ColleterVi	a784b12eff	fix: new restcountries url (#10043 ) Url extension "eu" and path "rest" are no longer available. Replacing them for a working url.	2022-01-13 20:25:06 +09:00
Sofie Van Landeghem	d8a3012539	Merge pull request #10037 from explosion/master Update develop with master	2022-01-12 12:29:23 +01:00
Ines Montani	a437ca6737	Update website to use new Algolia search API	2022-01-05 13:21:06 +01:00
Sofie Van Landeghem	067a44a417	Merge pull request #9987 from explosion/master Update develop with commits from master	2022-01-05 11:49:50 +01:00
Sofie Van Landeghem	56dcb39fb7	Fix references to config file in the docs & UX (#9961 ) * doc fixes around config file * fix typo * clarify default	2022-01-04 14:31:26 +01:00
Sam Edwardes	6f65e2b544	Added spacypdfreader to universe.json (#9963 )	2022-01-03 16:34:36 +09:00
Paul O'Leary McCann	f40e237c5a	Remove denomme from universe (#9952 ) Package seems to have been deleted.	2021-12-29 11:41:29 +01:00
Florian Cäsar	86e71e7b19	Fix Scorer.score_cats for missing labels (#9443 ) * Fix Scorer.score_cats for missing labels * Add test case for Scorer.score_cats missing labels * semantic nitpick * black formatting * adjust test to give different results depending on multi_label setting * fix loss function according to whether or not missing values are supported * add note to docs * small fixes * make mypy happy * Update spacy/pipeline/textcat.py Co-authored-by: Florian Cäsar <florian.caesar@pm.me> Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com> Co-authored-by: svlandeg <svlandeg@github.com>	2021-12-29 11:04:39 +01:00
Yoav Vollansky	9d63dfacfc	Update UNIVERSE.md (#9941 ) typo	2021-12-27 13:46:04 +01:00
Peter Baumgartner	72abf9e102	MultiHashEmbed vector docs correction (#9918 )	2021-12-27 11:18:08 +01:00
Edward	018827e9fd	Add healthsea to universe (#9838 ) * Add healthsea to universe * Update website/meta/universe.json * Add thumbnail * Update website/meta/universe.json Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>	2021-12-15 17:57:19 +01:00
Ines Montani	ba0fa7a64e	Support Google Sheets embeds in docs (#9861 )	2021-12-15 09:27:08 +01:00
Adriane Boyd	51a3b60027	Document Tagger neg_prefix, fix typo (#9821 )	2021-12-07 09:42:40 +01:00
Duygu Altinok	b56b9e7f31	Entity ruler remove pattern (#9685 ) * added ruler coe * added error for none existing pattern * changed error to warning * changed error to warning * added basic tests * fixed place * added test files * went back to error * went back to pattern error * minor change to docs * changed style * changed doc * changed error slightly * added remove to phrasem api * error key already existed * phrase matcher match code to api * blacked tests * moved comments before expr * corrected error no * Update website/docs/api/entityruler.md Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com> * Update website/docs/api/entityruler.md Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com> Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>	2021-12-06 15:32:49 +01:00
Natalia Rodnova	472740d613	Added sents property to Span for Spans spanning over several sentences (#9699 ) * Added sents property to Span class that returns a generator of sentences the Span belongs to * Added description to Span.sents property * Update test_span to clarify the difference between span.sent and span.sents Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com> * Update spacy/tests/doc/test_span.py Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com> * Fix documentation typos in spacy/tokens/span.pyx Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com> * Update Span.sents doc string in spacy/tokens/span.pyx Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com> * Parametrized test_span_spans * Corrected Span.sents to check for span-level hook first. Also, made Span.sent respect doc-level sents hook if no span-level hook is provided * Corrected Span ocumentation copy/paste issue * Put back accidentally deleted lines * Fixed formatting in span.pyx * Moved check for SENT_START annotation after user hooks in Span.sents * add version where the property was introduced Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>	2021-12-06 09:58:01 +01:00
Narayan Acharya	1be8a4dab3	Displacy serve entity linking support without `manual=True` support. (#9748 ) * Add support for kb_id to be displayed via displacy.serve. The current support is only limited to the manual option in displacy.render * Commit to check pre-commit hooks are run. * Update spacy/displacy/__init__.py Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com> * Changes as per suggestions on the PR. * Update website/docs/api/top-level.md Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com> * Update website/docs/api/top-level.md Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com> * tag option as new from 3.2.1 onwards Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com> Co-authored-by: svlandeg <sofie.vanlandeghem@gmail.com>	2021-11-29 17:13:26 +01:00
Adriane Boyd	6763cbfdc0	Update Catalan acknowledgements for v3.2 (#9763 )	2021-11-29 14:14:21 +01:00
Tuomo Hiippala	5c44533263	add entry for Applied Language Technology under "Courses" (#9755 ) Added the following entry into `universe.json`: ``` { "type": "education", "id": "applt-course", "title": "Applied Language Technology", "slogan": "NLP for newcomers using spaCy and Stanza", "description": "These learning materials provide an introduction to applied language technology for audiences who are unfamiliar with language technology and programming. The learning materials assume no previous knowledge of the Python programming language.", "url": "https://applied-language-technology.readthedocs.io/", "image": "https://www.mv.helsinki.fi/home/thiippal/images/applt-preview.jpg", "thumb": "https://applied-language-technology.readthedocs.io/en/latest/_static/logo.png", "author": "Tuomo Hiippala", "author_links": { "twitter": "tuomo_h", "github": "thiippal", "website": "https://www.mv.helsinki.fi/home/thiippal/" }, "category": ["courses"] }, ```	2021-11-28 19:33:16 +09:00
Natalia Rodnova	a4c43e5c57	Allow Matcher to match on ENT_ID and ENT_KB_ID (#9688 ) * Added ENT_ID and ENT_KB_ID into the list of the attributes that Matcher matches on * Added ENT_ID and ENT_KB_ID to TEST_PATTERNS in test_pattern_validation.py. Disabled tests that I added before * Update website/docs/api/matcher.md * Format * Remove skipped tests Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>	2021-11-24 10:37:10 +01:00
Adriane Boyd	9ac6d4991e	Add doc_cleaner component (#9659 ) * Add doc_cleaner component * Fix types * Fix loop * Rephrase method description	2021-11-23 15:33:33 +01:00
Paul O'Leary McCann	52b8c2d2e0	Add note on batch contract for listeners (#9691 ) * Add note on batch contract Using listeners requires batches to be consistent. This is obvious if you understand how the listener works, but it wasn't clearly stated in the Docs, and was subtle enough that the EntityLinker missed it. There is probably a clearer way to explain what the actual requirement is, but I figure this is a good start. * Rewrite to clarify role of caching	2021-11-22 11:06:07 +01:00
Sofie Van Landeghem	13645dcbf5	add note that annotating components is new since 3.1 (#9678 )	2021-11-22 14:43:11 +09:00
Paul O'Leary McCann	f3981bd0c8	Clarify how to fill in init_tok2vec after pretraining (#9639 ) * Clarify how to fill in init_tok2vec after pretraining * Ignore init_tok2vec arg in pretraining * Update docs, config setting * Remove obsolete note about not filling init_tok2vec early This seems to have also caught some lines that needed cleanup.	2021-11-18 15:38:30 +01:00
Vishnu Nandakumar	86fa37e8ba	Update universe.json with new library eng_spacysentiment (#9679 ) * Update universe.json * Update universe.json * Cleanup fields Co-authored-by: Paul O'Leary McCann <polm@dampfkraft.com>	2021-11-16 14:06:19 +09:00
Adriane Boyd	216ed231a9	What's new in v3.2 (#9633 ) * What's new in v3.2 * Fix formatting * Fix typo * Redo thanks * Formatting * Fix typo * Fix project links * Fix typo * Minimal intro, floret python module * Rephrase * Rephrase, extend * Rephrase * Update links and formatting [ci skip] * Minor correction * Fix typo Co-authored-by: Ines Montani <ines@ines.io>	2021-11-05 16:31:14 +01:00
Adriane Boyd	07dea324f6	Merge remote-tracking branch 'upstream/develop' into chore/switch-to-master-v3.2.0	2021-11-03 15:32:18 +01:00
Paul O'Leary McCann	c1cc94a33a	Fix typo about receptive field size (#9564 )	2021-11-03 15:16:55 +01:00
Adriane Boyd	79cea03983	Update website model display (#9589 ) * Remove vectors from core trf model descriptions * Update accuracy labels and exclude morph_acc for ja	2021-11-03 09:56:00 +01:00
Paul O'Leary McCann	e43639b27a	Add note about round-trip serializing pipeline to API docs (#9583 )	2021-11-03 09:55:30 +01:00
xxyzz	90ec820f05	Add WordDumb to spaCy Universe (#9572 ) * Add WordDumb to spaCy Universe * Add standalone category Co-authored-by: Paul O'Leary McCann <polm@dampfkraft.com>	2021-11-01 18:38:41 +09:00
Bruce W. Lee (이웅성)	a4dcb68cf6	Adding LingFeat Software to spaCy Universe. (#9574 ) * add lingfeat in universe * add lingfeat in universe * Fix JSON * Minor cleanup Co-authored-by: Paul O'Leary McCann <polm@dampfkraft.com>	2021-11-01 18:38:14 +09:00
Vasundhara	5279c7c4ba	Fix broken link to mappings-exceptions (#9573 )	2021-10-31 13:44:29 +09:00
Adriane Boyd	2d430958e1	Merge remote-tracking branch 'upstream/master' into chore/update-develop-from-master-v3.2-3	2021-10-29 12:18:15 +02:00
Paul O'Leary McCann	006df1ae1f	Clarify error when words are of wrong type (#9541 ) * Clarify error when words are of wrong type See #9437 * Update docs * Use try/except * Apply suggestions from code review Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com> Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com> Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com> Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>	2021-10-29 12:08:40 +02:00
Paul O'Leary McCann	2fd8d616e7	Add docs section for spacy.cli.train.train (#9545 ) * Add section for spacy.cli.train.train * Add link from training page to train function * Ensure path in train helper * Update docs Co-authored-by: Ines Montani <ines@ines.io>	2021-10-29 10:36:34 +02:00
Adriane Boyd	5477453ea3	Docs for thinc-apple-ops (#9549 ) * Docs for thinc-apple-ops * Ignore thinc-apple-ops in reqs tests * Fix install quickstart * Add cupy cuda 113, 114 extras * Remove draft section Co-authored-by: Ines Montani <ines@ines.io>	2021-10-29 10:35:31 +02:00
Adriane Boyd	12974bf4d9	Add micro PRF for morph scoring (#9546 ) * Add micro PRF for morph scoring For pipelines where morph features are added by more than one component and a reference training corpus may not contain all features, a micro PRF score is more flexible than a simple accuracy score. An example is the reading and inflection features added by the Japanese tokenizer. * Use `morph_micro_f` as the default morph score for Japanese morphologizers. * Update docstring * Fix typo in docstring * Update Scorer API docs * Fix results type * Organize score list by attribute prefix	2021-10-29 10:29:29 +02:00
Philip Vollet	76173b0866	fixed typo and URL (#9560 )	2021-10-29 13:57:44 +09:00
Adriane Boyd	c053f158c5	Add support for floret vectors (#8909 ) * Add support for fasttext-bloom hash-only vectors Overview: * Extend `Vectors` to have two modes: `default` and `ngram` * `default` is the default mode and equivalent to the current `Vectors` * `ngram` supports the hash-only ngram tables from `fasttext-bloom` * Extend `spacy.StaticVectors.v2` to handle both modes with no changes for `default` vectors * Extend `spacy init vectors` to support ngram tables The `ngram` mode only supports vector tables produced by this fork of fastText, which adds an option to represent all vectors using only the ngram buckets table and which uses the exact same ngram generation algorithm and hash function (`MurmurHash3_x64_128`). `fasttext-bloom` produces an additional `.hashvec` table, which can be loaded by `spacy init vectors --fasttext-bloom-vectors`. https://github.com/adrianeboyd/fastText/tree/feature/bloom Implementation details: * `Vectors` now includes the `StringStore` as `Vectors.strings` so that the API can stay consistent for both `default` (which can look up from `str` or `int`) and `ngram` (which requires `str` to calculate the ngrams). * In ngram mode `Vectors` uses a default `Vectors` object as a cache since the ngram vectors lookups are relatively expensive. * The default cache size is the same size as the provided ngram vector table. * Once the cache is full, no more entries are added. The user is responsible for managing the cache in cases where the initial documents are not representative of the texts. * The cache can be resized by setting `Vectors.ngram_cache_size` or cleared with `vectors._ngram_cache.clear()`. * The API ends up a bit split between methods for `default` and for `ngram`, so functions that only make sense for `default` or `ngram` include warnings with custom messages suggesting alternatives where possible. * `Vocab.vectors` becomes a property so that the string stores can be synced when assigning vectors to a vocab. * `Vectors` serializes its own config settings as `vectors.cfg`. * The `Vectors` serialization methods have added support for `exclude` so that the `Vocab` can exclude the `Vectors` strings while serializing. Removed: * The `minn` and `maxn` options and related code from `Vocab.get_vector`, which does not work in a meaningful way for default vector tables. * The unused `GlobalRegistry` in `Vectors`. * Refactor to use reduce_mean Refactor to use reduce_mean and remove the ngram vectors cache. * Rename to floret * Rename to floret in error messages * Use --vectors-mode in CLI, vector init * Fix vectors mode in init * Remove unused var * Minor API and docstrings adjustments * Rename `--vectors-mode` to `--mode` in `init vectors` CLI * Rename `Vectors.get_floret_vectors` to `Vectors.get_batch` and support both modes. * Minor updates to Vectors docstrings. * Update API docs for Vectors and init vectors CLI * Update types for StaticVectors	2021-10-27 14:08:31 +02:00
Adriane Boyd	a803af9dfa	Merge remote-tracking branch 'upstream/master' into chore/update-develop-from-master-v3.2-1	2021-10-26 11:53:50 +02:00
Elia Robyn Lake (Robyn Speer)	fa70837f28	clarify how to connect pretraining to training (#9450 ) * clarify how to connect pretraining to training Signed-off-by: Elia Robyn Speer <elia@explosion.ai> * Update website/docs/usage/embeddings-transformers.md * Update website/docs/usage/embeddings-transformers.md * Update website/docs/usage/embeddings-transformers.md * Update website/docs/usage/embeddings-transformers.md Co-authored-by: Elia Robyn Speer <elia@explosion.ai> Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>	2021-10-22 13:15:47 +02:00
Duygu Altinok	7b98aa4c16	Corrected broken (#9505 )	2021-10-20 17:31:59 +02:00
Daniël de Kok	1f05f56433	Add the spacy.models_with_nvtx_range.v1 callback (#9124 ) * Add the spacy.models_with_nvtx_range.v1 callback This callback recursively adds NVTX ranges to the Models in each pipe in a pipeline. * Fix create_models_with_nvtx_range type signature * NVTX range: wrap models of all trainable pipes jointly This avoids that (sub-)models that are shared between pipes get wrapped twice. * NVTX range callback: make color configurable Add forward_color and backprop_color options to set the color for the NVTX range. * Move create_models_with_nvtx_range to spacy.ml * Update create_models_with_nvtx_range for thinc changes with_nvtx_range now updates an existing node, rather than returning a wrapper node. So, we can simply walk over the nodes and update them. * NVTX: use after_pipeline_creation in example	2021-10-20 11:59:48 +02:00
Adriane Boyd	3f181b73d0	Add ja_core_news_trf to website (#9515 )	2021-10-20 10:18:02 +02:00
Paul O'Leary McCann	222cf9b6d2	Clarify how to change base Transformer model (#9498 ) * Add note about how the model name is used * Add link to TransformersModel docs, separate paragraph * Local link * Revise docs * Update website/docs/usage/embeddings-transformers.md * Update website/docs/usage/embeddings-transformers.md Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>	2021-10-19 23:28:20 +02:00
Adriane Boyd	a6424bcea9	Minor updates to spacy-transformers docs for v1.1.0 (#9496 )	2021-10-18 14:55:02 +02:00
Adriane Boyd	9b86209a4a	Update docs for spacy-transformers v1.1 data classes (#9361 )	2021-10-18 14:16:58 +02:00
Sofie Van Landeghem	3fd3531e12	Docs for new spacy-trf architectures (#8954 ) * use TransformerModel.v2 in quickstart * update docs for new transformer architectures * bump spacy_transformers to 1.1.0 * Add new arguments spacy-transformers.TransformerModel.v3 * Mention that mixed-precision support is experimental * Describe delta transformers.Tok2VecTransformer versions * add dot * add dot, again * Update some more TransformerModel references v2 -> v3 * Add mixed-precision options to the training quickstart Disable mixed-precision training/prediction by default. * Update setup.cfg Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com> * Apply suggestions from code review Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com> * Update website/docs/usage/embeddings-transformers.md Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com> Co-authored-by: Daniël de Kok <me@danieldk.eu> Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>	2021-10-18 14:15:06 +02:00
Connor Brinton	657af5f91f	🏷 Add Mypy check to CI and ignore all existing Mypy errors (#9167 ) * 🚨 Ignore all existing Mypy errors * 🏗 Add Mypy check to CI * Add types-mock and types-requests as dev requirements * Add additional type ignore directives * Add types packages to dev-only list in reqs test * Add types-dataclasses for python 3.6 * Add ignore to pretrain * 🏷 Improve type annotation on `run_command` helper The `run_command` helper previously declared that it returned an `Optional[subprocess.CompletedProcess]`, but it isn't actually possible for the function to return `None`. These changes modify the type annotation of the `run_command` helper and remove all now-unnecessary `# type: ignore` directives. * 🔧 Allow variable type redefinition in limited contexts These changes modify how Mypy is configured to allow variables to have their type automatically redefined under certain conditions. The Mypy documentation contains the following example: ```python def process(items: List[str]) -> None: # 'items' has type List[str] items = [item.split() for item in items] # 'items' now has type List[List[str]] ... ``` This configuration change is especially helpful in reducing the number of `# type: ignore` directives needed to handle the common pattern of: * Accepting a filepath as a string * Overwriting the variable using `filepath = ensure_path(filepath)` These changes enable redefinition and remove all `# type: ignore` directives rendered redundant by this change. * 🏷 Add type annotation to converters mapping * 🚨 Fix Mypy error in convert CLI argument verification * 🏷 Improve type annotation on `resolve_dot_names` helper * 🏷 Add type annotations for `Vocab` attributes `strings` and `vectors` * 🏷 Add type annotations for more `Vocab` attributes * 🏷 Add loose type annotation for gold data compilation * 🏷 Improve `_format_labels` type annotation * 🏷 Fix `get_lang_class` type annotation * 🏷 Loosen return type of `Language.evaluate` * 🏷 Don't accept `Scorer` in `handle_scores_per_type` * 🏷 Add `string_to_list` overloads * 🏷 Fix non-Optional command-line options * 🙈 Ignore redefinition of `wandb_logger` in `loggers.py` * ➕ Install `typing_extensions` in Python 3.8+ The `typing_extensions` package states that it should be used when "writing code that must be compatible with multiple Python versions". Since SpaCy needs to support multiple Python versions, it should be used when newer `typing` module members are required. One example of this is `Literal`, which is available starting with Python 3.8. Previously SpaCy tried to import `Literal` from `typing`, falling back to `typing_extensions` if the import failed. However, Mypy doesn't seem to be able to understand what `Literal` means when the initial import means. Therefore, these changes modify how `compat` imports `Literal` by always importing it from `typing_extensions`. These changes also modify how `typing_extensions` is installed, so that it is a requirement for all Python versions, including those greater than or equal to 3.8. * 🏷 Improve type annotation for `Language.pipe` These changes add a missing overload variant to the type signature of `Language.pipe`. Additionally, the type signature is enhanced to allow type checkers to differentiate between the two overload variants based on the `as_tuple` parameter. Fixes #8772 * ➖ Don't install `typing-extensions` in Python 3.8+ After more detailed analysis of how to implement Python version-specific type annotations using SpaCy, it has been determined that by branching on a comparison against `sys.version_info` can be statically analyzed by Mypy well enough to enable us to conditionally use `typing_extensions.Literal`. This means that we no longer need to install `typing_extensions` for Python versions greater than or equal to 3.8! 🎉 These changes revert previous changes installing `typing-extensions` regardless of Python version and modify how we import the `Literal` type to ensure that Mypy treats it properly. * resolve mypy errors for Strict pydantic types * refactor code to avoid missing return statement * fix types of convert CLI command * avoid list-set confustion in debug_data * fix typo and formatting * small fixes to avoid type ignores * fix types in profile CLI command and make it more efficient * type fixes in projects CLI * put one ignore back * type fixes for render * fix render types - the sequel * fix BaseDefault in language definitions * fix type of noun_chunks iterator - yields tuple instead of span * fix types in language-specific modules * 🏷 Expand accepted inputs of `get_string_id` `get_string_id` accepts either a string (in which case it returns its ID) or an ID (in which case it immediately returns the ID). These changes extend the type annotation of `get_string_id` to indicate that it can accept either strings or IDs. * 🏷 Handle override types in `combine_score_weights` The `combine_score_weights` function allows users to pass an `overrides` mapping to override data extracted from the `weights` argument. Since it allows `Optional` dictionary values, the return value may also include `Optional` dictionary values. These changes update the type annotations for `combine_score_weights` to reflect this fact. * 🏷 Fix tokenizer serialization method signatures in `DummyTokenizer` * 🏷 Fix redefinition of `wandb_logger` These changes fix the redefinition of `wandb_logger` by giving a separate name to each `WandbLogger` version. For backwards-compatibility, `spacy.train` still exports `wandb_logger_v3` as `wandb_logger` for now. * more fixes for typing in language * type fixes in model definitions * 🏷 Annotate `_RandomWords.probs` as `NDArray` * 🏷 Annotate `tok2vec` layers to help Mypy * 🐛 Fix `_RandomWords.probs` type annotations for Python 3.6 Also remove an import that I forgot to move to the top of the module 😅 * more fixes for matchers and other pipeline components * quick fix for entity linker * fixing types for spancat, textcat, etc * bugfix for tok2vec * type annotations for scorer * add runtime_checkable for Protocol * type and import fixes in tests * mypy fixes for training utilities * few fixes in util * fix import * 🐵 Remove unused `# type: ignore` directives * 🏷 Annotate `Language._components` * 🏷 Annotate `spacy.pipeline.Pipe` * add doc as property to span.pyi * small fixes and cleanup * explicit type annotations instead of via comment Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com> Co-authored-by: svlandeg <sofie.vanlandeghem@gmail.com> Co-authored-by: svlandeg <svlandeg@github.com>	2021-10-14 15:21:40 +02:00
Adriane Boyd	d98d525bc8	Merge remote-tracking branch 'upstream/master' into chore/update-develop-from-master-v3.1-3	2021-10-14 09:41:46 +02:00
Edward	72711dc2c9	Update universe example codes (#9422 ) * Update universe plugins * Adjust azure trigger * Add init to tests/universe * deliberatly trying to break the universe to see if the CI catches it * revert Co-authored-by: svlandeg <svlandeg@github.com>	2021-10-13 16:29:19 +02:00
Paul O'Leary McCann	b53e39455e	Fix UD POS docs links (fix #9013 ) (#9407 ) * Fix UD POS docs links (fix #9013) The previous link seems to have been for UD v1. * Fix link	2021-10-11 11:51:19 +02:00
Adriane Boyd	fd7edbc645	Fix types descriptions of sm and sent models (#9401 )	2021-10-11 11:17:18 +02:00
Adriane Boyd	a5231cb044	Remove traces of lexemes from vocab serialization (#9400 )	2021-10-11 11:13:35 +02:00
Adriane Boyd	ae1b3e960b	Update overwrite and scorer in API docs (#9384 ) * Update overwrite and scorer in API docs * Rephrase morphologizer extend + example	2021-10-11 10:35:07 +02:00
Sofie Van Landeghem	f87ae3cb7d	Doc fixes in convert API (#9350 ) * add more info on the spacy debug command * formatting	2021-10-06 13:13:18 +09:00
Elia Robyn Lake (Robyn Speer)	53b5f245ed	Allow IETF language codes, aliases, and close matches (#9342 ) * use language-matching to allow language code aliases Signed-off-by: Elia Robyn Speer <elia@explosion.ai> * link to "IETF language tags" in docs Signed-off-by: Elia Robyn Speer <elia@explosion.ai> * Make requirements consistent Signed-off-by: Elia Robyn Speer <elia@explosion.ai> * change "two-letter language ID" to "IETF language tag" in language docs Signed-off-by: Elia Robyn Speer <elia@explosion.ai> * use langcodes 3.2 and handle language-tag errors better Signed-off-by: Elia Robyn Speer <elia@explosion.ai> * all unknown language codes are ImportErrors Signed-off-by: Elia Robyn Speer <elia@explosion.ai> Co-authored-by: Elia Robyn Speer <elia@explosion.ai>	2021-10-05 09:52:22 +02:00
Paul O'Leary McCann	1ee6541ab0	Moving Japanese tokenizer extra info to Token.morph (#8977 ) * Use morph for extra Japanese tokenizer info Previously Japanese tokenizer info that didn't correspond to Token fields was put in user data. Since spaCy core should avoid touching user data, this moves most information to the Token.morph attribute. It also adds the normalized form, which wasn't exposed before. The subtokens, which are a list of full tokens, are still added to user data, except with the default tokenizer granualarity. With the default tokenizer settings the subtokens are all None, so in this case the user data is simply not set. * Update tests Also adds a new test for norm data. * Update docs * Add Japanese morphologizer factory Set the default to `extend=True` so that the morphologizer does not clobber the values set by the tokenizer. * Use the norm_ field for normalized forms Before this commit, normalized forms were put in the "norm" field in the morph attributes. I am not sure why I did that instead of using the token morph, I think I just forgot about it. * Skip test if sudachipy is not installed * Fix import Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>	2021-10-01 19:19:26 +02:00
Paul O'Leary McCann	6e833b617a	Updating Troubleshooting Docs (#9329 ) * Add link to Discussions FAQ * Remove old FAQ entries I think these are no longer relevant. - no-cache-dir: affected pip versions are very old now - narrow unicode: not an issue from py3.3+ - utf-8 osx: upstream bug closed in 2019 Some of the other issues are also maybe not frequent.	2021-10-01 12:28:22 +02:00
Paul O'Leary McCann	78a88f7de7	Fix invalid json	2021-09-30 15:23:55 +09:00
Martin Vallone	a14ab7e882	Adding PhruzzMatcher to spaCy universe (#9321 ) * Adding PhruzzMatcher to spaCy universe * Fixes to make the package work properly	2021-09-30 13:46:53 +09:00
Elia Robyn Lake (Robyn Speer)	5b0b0ca809	Move WandB loggers into spacy-loggers (#9223 ) * factor out the WandB logger into spacy-loggers Signed-off-by: Elia Robyn Speer <gh@arborelia.net> * depend on spacy-loggers so they are available Signed-off-by: Elia Robyn Speer <gh@arborelia.net> * remove docs of spacy.WandbLogger.v2 (moved to spacy-loggers) Signed-off-by: Elia Robyn Speer <elia@explosion.ai> * Version number suggestions from code review Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com> * update references to WandbLogger Signed-off-by: Elia Robyn Speer <elia@explosion.ai> * make order of deps more consistent Signed-off-by: Elia Robyn Speer <elia@explosion.ai> Co-authored-by: Elia Robyn Speer <elia@explosion.ai> Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>	2021-09-29 11:12:50 +02:00
Adriane Boyd	03f234b739	Merge remote-tracking branch 'upstream/master' into develop	2021-09-27 09:10:45 +02:00
Ines Montani	6bb0324b81	Adjust kb_id visualizer templating and docs	2021-09-23 11:59:02 +02:00
Ines Montani	beb4a8c524	Merge pull request #9199 from shigapov/master (resolves #9129 )	2021-09-23 19:41:53 +10:00
Philip Vollet	d2adfe1efa	Add projects to spaCy Universe (#9269 ) * Added spaCy Universe projects * Added user license agreement Philip Vollet * Update website/meta/universe.json Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com> * Update website/meta/universe.json Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com> * Update website/meta/universe.json Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com> Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>	2021-09-23 10:56:45 +02:00
Edward	8bda39f088	Update Hammurabi example code to v3 (#9218 ) * Update Hammurabi example code * Fix typo	2021-09-16 13:32:44 +02:00
Jozef Harag	865cfbc903	feat: add `spacy.WandbLogger.v3` with optional `run_name` and `entity` parameters (#9202 ) * feat: add `spacy.WandbLogger.v3` with optional `run_name` and `entity` parameters * update versioning in docs Co-authored-by: svlandeg <sofie.vanlandeghem@gmail.com>	2021-09-16 12:26:41 +02:00
Paul O'Leary McCann	1d57d78758	Make docs consistent (fix #9126 )	2021-09-16 15:54:12 +09:00
Renat Shigapov	d5cc009faf	Merge branch 'explosion:master' into master	2021-09-13 08:43:48 +02:00
Renat Shigapov	e61d93f8c3	add NEL-visualisation to manual-usage	2021-09-13 08:38:58 +02:00
Paul O'Leary McCann	f89e1c34c9	Minor typo fix in docs	2021-09-11 14:22:05 +09:00
Renat Shigapov	646f3a54db	added spaCyOpenTapioca (#9181 ) * add spaCyOpenTapioca to universe * add agreement * fix misprint in tags	2021-09-11 13:16:51 +09:00
mylibrar	ee28aac68e	Update example code of forte (#9175 ) Co-authored-by: Suqi Sun <suqi.sun@petuum.com>	2021-09-11 13:13:13 +09:00
Renat Shigapov	c1927fe994	fix misprint in tags	2021-09-09 15:37:34 +02:00
Renat Shigapov	ea58294076	add spaCyOpenTapioca to universe	2021-09-09 15:13:18 +02:00
Sofie Van Landeghem	8895e3c9ad	matcher doc corrections (#9115 ) * update error message to current UX * clarify uppercase effect * fix docstring	2021-09-02 09:26:33 +02:00
Robyn Speer	d60b748e3c	Fix surprises when asking for the root of a git repo (#9074 ) * Fix surprises when asking for the root of a git repo In the case of the first asset I wanted to get from git, the data I wanted was the entire repository. I tried leaving "path" blank, which gave a less-than-helpful error, and then I tried `path: "/"`, which started copying my entire filesystem into the project. The path I should have used was "". I've made two changes to make this smoother for others: - The 'path' within a git clone defaults to "" - If the path points outside of the tmpdir that the git clone goes into, we fail with an error Signed-off-by: Elia Robyn Speer <elia@explosion.ai> * use a descriptive error instead of a default plus some minor fixes from PR review Signed-off-by: Elia Robyn Speer <elia@explosion.ai> * check for None values in assets Signed-off-by: Elia Robyn Speer <elia@explosion.ai> Co-authored-by: Elia Robyn Speer <elia@explosion.ai>	2021-09-01 22:52:08 +02:00
Paul O'Leary McCann	ba6a37d358	Document Assigned Attributes of Pipeline Components (#9041 ) * Add textcat docs * Add NER docs * Add Entity Linker docs * Add assigned fields docs for the tagger This also adds a preamble, since there wasn't one. * Add morphologizer docs * Add dependency parser docs * Update entityrecognizer docs This is a little weird because `Doc.ents` is the only thing assigned to, but it's actually a bidirectional property. * Add token fields for entityrecognizer * Fix section name * Add entity ruler docs * Add lemmatizer docs * Add sentencizer/recognizer docs * Update website/docs/api/entityrecognizer.md Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com> * Update website/docs/api/entityruler.md Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com> * Update website/docs/api/tagger.md Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com> * Update website/docs/api/entityruler.md Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com> * Update type for Doc.ents This was `Tuple[Span, ...]` everywhere but `Tuple[Span]` seems to be correct. * Run prettier * Apply suggestions from code review Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com> * Run prettier * Add transformers section This basically just moves and renames the "custom attributes" section from the bottom of the page to be consistent with "assigned attributes" on other pages. I looked at moving the paragraph just above the section into the section, but it includes the unrelated registry additions, so it seemed better to leave it unchanged. * Make table header consistent Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com> Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>	2021-09-01 12:09:39 +02:00
Davide Fiocco	1dd69be1f1	Fix point typo on docbin docs (#9097 )	2021-08-31 10:55:44 +02:00
Meenal Jhajharia	2613f0e98f	benepar usage example has deprecated imports	2021-08-28 16:35:58 +05:30
Sofie Van Landeghem	1e974de837	config is not Optional (#9024 )	2021-08-27 11:44:31 +02:00
Sofie Van Landeghem	4d39430b82	Document use-case of freezing tok2vec (#8992 ) * update error msg * add sentence to docs * expand note on frozen components	2021-08-26 09:50:35 +02:00
Sofie Van Landeghem	94fb840443	fix docs for Span constructor arguments (#9023 )	2021-08-25 16:06:22 +02:00
Sofie Van Landeghem	de025beb5f	Warn and document spangroup.doc weakref (#8980 ) * test for error after Doc has been garbage collected * warn about using a SpanGroup when the Doc has been garbage collected * add warning to the docs * rephrase slightly * raise error instead of warning * update * move warning to doc property	2021-08-20 11:06:19 +02:00
Paul O'Leary McCann	37fe847af4	Fix type annotation in docs	2021-08-20 15:34:22 +09:00
Ines Montani	f2b61b77a5	Fix universe.json [ci skip]	2021-08-20 11:26:29 +10:00
Baltazar	71e65fe943	added spacy api v3 docker	2021-08-19 21:29:25 +02:00
Paul O'Leary McCann	9391998c77	Add notes on preparing training data to docs (#8964 ) * Add training data section Not entirely sure this is in the right location on the page - maybe it should be after quickstart? * Add pointer from binary format to training data section * Minor cleanup * Add to ToC, fix filename * Update website/docs/usage/training.md Co-authored-by: Ines Montani <ines@ines.io> * Update website/docs/usage/training.md Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com> * Update website/docs/usage/training.md Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com> * Move the training data section further down the page * Update website/docs/usage/training.md Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com> * Update website/docs/usage/training.md Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com> * Run prettier Co-authored-by: Ines Montani <ines@ines.io> Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>	2021-08-16 17:37:21 +02:00
Lasse	839ea0f987	change tags formatting to match	2021-08-13 14:40:08 +02:00
Lasse	70ab596f61	Merge branch 'master' of https://github.com/HLasse/spaCy	2021-08-13 14:35:21 +02:00
Lasse	195e4e48c3	add textdescriptives to universe	2021-08-13 14:35:18 +02:00
Adriane Boyd	b278f31ee6	Document scorers in registry and components from #8766 (#8929 ) * Document scorers in registry and components from #8766 * Update spacy/pipeline/lemmatizer.py Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com> * Update website/docs/api/dependencyparser.md Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com> * Reformat Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>	2021-08-12 12:50:03 +02:00
Ines Montani	4f769ff913	Update Prodigy project template for v1.11 [ci skip]	2021-08-12 13:46:20 +10:00
Paul O'Leary McCann	e227d24d43	Allow passing in array vars for speedup (#8882 ) * Allow passing in array vars for speedup This fixes #8845. Not sure about the docstring changes here... * Update docs Types maybe need more detail? Maybe not? * Run prettier on docs * Update spacy/tokens/span.pyx Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>	2021-08-10 15:13:53 +02:00
Paul O'Leary McCann	6029cfc391	Add scores to output in spancat (#8855 ) * Add scores to output in spancat This exposes the scores as an attribute on the SpanGroup. Includes a basic test. * Add basic doc note * Vectorize score calcs * Add "annotation format" section * Update website/docs/api/spancategorizer.md Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com> * Clean up doc section * Ran prettier on docs * Get arrays off the gpu before iterating over them * Remove int() calls Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>	2021-08-10 13:47:49 +02:00
Duygu Altinok	380b2817cf	updated unv json for new book	2021-08-09 12:39:22 +02:00
Paul O'Leary McCann	cac298471f	Fix #8902 (bad link in docs) typo fix	2021-08-08 22:04:00 +09:00
Adriane Boyd	175847f92c	Support list values and INTERSECTS in Matcher (#8784 ) * Support list values and IS_INTERSECT in Matcher * Support list values as token attributes for set operators, not just as pattern values. * Add `IS_INTERSECT` operator. * Fix incorrect `ISSUBSET` and `ISSUPERSET` in schema and docs. * Rename IS_INTERSECT to INTERSECTS	2021-08-02 19:39:26 +02:00
Ines Montani	30f20496d5	Merge pull request #8840 from polm/docs/evaluate-speed [ci skip]	2021-07-30 09:10:15 +10:00
Ines Montani	65d163fab5	Adjust formatting [ci skip]	2021-07-30 09:10:04 +10:00
Ines Montani	3a701d3645	Merge pull request #8841 from adrianeboyd/docs/ent-id-sep [ci skip] Fix formatting of ent_id_sep in EntityRuler API docs	2021-07-30 09:09:25 +10:00
thomashacker	02258916c8	Fix example config typo for transformer architecture	2021-07-29 11:19:40 +02:00
Adriane Boyd	15b12f3e35	Fix formatting of ent_id_sep in EntityRuler API docs	2021-07-29 10:10:12 +02:00
Paul O'Leary McCann	a60cb13910	Update speed entry in metrics table	2021-07-29 16:35:19 +09:00
Paul O'Leary McCann	e125313a50	Revert "Add note about SPEED in output" This reverts commit `c92d268176`.	2021-07-29 16:34:08 +09:00
Ines Montani	0a1e299d30	Merge pull request #8814 from polm/docs/migrate-lexeme-tables [ci skip]	2021-07-29 17:18:02 +10:00
Paul O'Leary McCann	c92d268176	Add note about SPEED in output In #8823 it was pointed out that the `SPEED` value wasn't documented anywhere.	2021-07-29 15:03:07 +09:00
Paul O'Leary McCann	8867e60fbb	Update website/docs/usage/v3.md Co-authored-by: Ines Montani <ines@ines.io>	2021-07-29 14:56:56 +09:00
Adriane Boyd	8547514aa4	Remove labels from textcat component config example (#8815 )	2021-07-27 13:14:38 +02:00
Paul O'Leary McCann	76ac95923a	Add note to migration guide about lexeme tables (fix #7290 ) This just adds the resolution from #6388 to the docs.	2021-07-27 19:19:25 +09:00
Paul O'Leary McCann	67ecdcc3ac	Update subset/superset docs (#8795 ) * Update subset/superset docs * Update website/docs/usage/rule-based-matching.md Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com> Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>	2021-07-27 12:08:46 +02:00
Ines Montani	134cb06af3	Merge pull request #8808 from kevinlu1248/master [ci skip] Changed a CLI command in data-formats.md due to erroneous information	2021-07-27 12:15:16 +10:00
Kevin Lu	4a8e9e4e4e	Update data-formats.md	2021-07-25 22:58:53 -07:00
Ledenel	413f745c68	fix broken example in spaCy universe Chatterbot	2021-07-25 15:53:32 +00:00
Paul O'Leary McCann	d717593eb7	Merge pull request #8754 from KennethEnevoldsen/patch-1 [minor] removed outdated spacy version for spacymoji	2021-07-18 19:17:33 +09:00
Kenneth Enevoldsen	5d6aed0773	fixed GitHub link and thumbnail Sorry, I seem to have misunderstood that the GitHub reference shouldn't be a link.	2021-07-18 10:22:00 +02:00
Ines Montani	313f55e560	Fix JSON [ci skip]	2021-07-18 13:21:33 +10:00
Ines Montani	51e5903d6f	Merge pull request #8702 from KennethEnevoldsen/master [ci skip]	2021-07-18 13:18:42 +10:00
Kenneth Enevoldsen	8546948fba	removed outdated spacy version for spacymoji From the documentation of spacymoji (and the requirements.txt) it seems like it is not only for version 2.	2021-07-17 15:19:43 +02:00
Kenneth Enevoldsen	a0e0ccdb46	Update website/meta/universe.json Co-authored-by: Ines Montani <ines@ines.io>	2021-07-17 07:14:46 +02:00
Mario Šaško	1ba2e8a646	Add TakeLab/spacy-udpipe to Universe (#8698 ) * Add TakeLab/spacy-udpipe to universe * Add SCA * Sign SCA	2021-07-16 11:15:52 +02:00
Adriane Boyd	f5acc48111	Remove TrainablePipe as base class for Lemmatizer in API docs (#8725 )	2021-07-15 16:41:36 +02:00
Sofie Van Landeghem	77859beb99	spacy.ngram_range_suggester.v1 (#8699 )	2021-07-15 10:01:22 +02:00
Ines Montani	2a8eeed5da	Merge pull request #8703 from thomashacker/update/spacy-stanza [ci skip] Update spacy-stanza universe.json	2021-07-13 19:03:42 +10:00
thomashacker	aafb89df78	Update universe.json code_example	2021-07-13 10:22:49 +02:00
Kenneth Enevoldsen	94ce904e10	added missing comma	2021-07-13 09:59:34 +02:00
Kenneth Enevoldsen	a81fcc81b0	added dacy to universe	2021-07-13 09:54:08 +02:00
Ines Montani	50000d37e4	Avoid double parentheses [ci skip]	2021-07-10 10:52:01 +10:00
Calum Sieppert	e2d53aa1a6	Typo fixes	2021-07-09 10:25:56 -06:00
Adriane Boyd	1ee5bee29d	Add Macedonian models to website (#8637 )	2021-07-08 09:32:14 +02:00
Paul O'Leary McCann	1d9209d43a	Merge pull request #8547 from mylibrar/update-universe Add forte to universe.json	2021-07-08 14:59:49 +09:00
Ines Montani	39c8f7949e	Add code preview for textcat_multilabel [ci skip]	2021-07-08 13:33:25 +10:00
Calum Sieppert	889c187bc2	Typo fixes	2021-07-07 16:53:04 -06:00
Adriane Boyd	6db647dfe0	Update v3.1 usage docs	2021-07-07 08:43:33 +02:00
Sofie Van Landeghem	64fac754fe	add spacy prefix to ngram_suggester.v1 (#8623 )	2021-07-07 08:09:30 +02:00
Sofie Van Landeghem	e7d747e3ee	TransitionBasedParser.v1 to legacy (#8586 ) * TransitionBasedParser.v1 to legacy * register sublayers * bump spacy-legacy to 3.0.7	2021-07-06 15:26:45 +02:00
Ines Montani	04a9ade40f	Merge pull request #8466 from explosion/docs/new-in-v3-1 [ci skip]	2021-07-06 22:20:24 +10:00
Sofie Van Landeghem	b9f59118bf	Fix silent evaluation (#8581 ) * fix silentness * sneak in docs typo fix * pass silent boolean instead	2021-07-06 14:16:19 +02:00
Adriane Boyd	29906884c5	Raise an error for textcat with <2 labels (#8584 ) * Raise an error for textcat with <2 labels Raise an error if initializing a `textcat` component without at least two labels. * Add similar note to docs * Update positive_label description in API docs	2021-07-06 12:35:22 +02:00
Ines Montani	5bb7fe4b41	Update with HF hub integration [ci skip]	2021-07-06 19:30:59 +10:00
Cass	7d13fc799b	Fix a command typo in models.md "dowmload" -> "download"	2021-07-05 18:44:18 -07:00
Ines Montani	8423864b50	Add docs notes on installing models from Python and in Jupyter [ci skip] (#8597 )	2021-07-05 13:49:20 +02:00
Yoichiro Hasebe	596e04cbb4	Github repo info fixed for ruby-spacy	2021-07-04 18:55:17 +09:00
Yoichiro Hasebe	2bdfa42107	Update universe.json	2021-07-04 08:44:39 +09:00
Suqi Sun	3901507df8	Update pip	2021-06-30 16:44:43 -04:00
Suqi Sun	61c868ed75	Update pip and code example	2021-06-30 14:49:51 -04:00
Ines Montani	af9d984407	Merge pull request #8405 from svlandeg/fix/whitespace_tokenizer [ci skip]	2021-06-30 20:52:59 +10:00
Suqi Sun	4331c40b78	Add forte to universe.json	2021-06-29 16:17:22 -04:00
Adriane Boyd	41292a1b84	Add note about updating with fill-config	2021-06-29 10:45:36 +02:00
Nick Sorros	bb781ae7f7	Remove extra parenthesis from the example for spacy-streamlit (#8527 )	2021-06-28 14:03:31 +02:00
Adriane Boyd	4d1ef8f695	Tidy up docs	2021-06-28 12:08:15 +02:00
Ines Montani	93572dc12a	Merge pull request #8505 from bryant1410/patch-2 [ci skip] Fix double slash in model release web page	2021-06-28 12:51:06 +10:00
Kevin	1a3e7cc5ef	Updated PyATE syntax to fit spaCy V3	2021-06-26 17:52:41 -07:00
Santiago Castro	2e71944e1e	Fix double slash in model release web page	2021-06-25 19:19:10 -07:00
Ines Montani	4544412442	Update wording [ci skip] Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>	2021-06-25 13:52:48 +10:00
Ines Montani	0d2e2b59bc	Update intro [ci skip]	2021-06-24 22:53:20 +10:00
Matthew Honnibal	f9946154d9	Add SpanCategorizer component (#6747 ) * Draft spancat model * Add spancat model * Add test for extract_spans * Add extract_spans layer * Upd extract_spans * Add spancat model * Add test for spancat model * Upd spancat model * Update spancat component * Upd spancat * Update spancat model * Add quick spancat test * Import SpanCategorizer * Fix SpanCategorizer component * Import SpanGroup * Fix span extraction * Fix import * Fix import * Upd model * Update spancat models * Add scoring, update defaults * Update and add docs * Fix type * Update spacy/ml/extract_spans.py * Auto-format and fix import * Fix comment * Fix type * Fix type * Update website/docs/api/spancategorizer.md * Fix comment Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com> * Better defense Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com> * Fix labels list Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com> * Update spacy/ml/extract_spans.py Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com> * Update spacy/pipeline/spancat.py Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com> * Set annotations during update * Set annotations in spancat * fix imports in test * Update spacy/pipeline/spancat.py * replace MaxoutLogistic with LinearLogistic * fix config * various small fixes * remove set_annotations parameter in update * use our beloved tupley format with recent support for doc.spans * bugfix to allow renaming the default span_key (scores weren't showing up) * use different key in docs example * change defaults to better-working parameters from project (WIP) * register spacy.extract_spans.v1 for legacy purposes * Upd dev version so can build wheel * layers instead of architectures for smaller building blocks * Update website/docs/api/spancategorizer.md Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com> * Update website/docs/api/spancategorizer.md Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com> * Include additional scores from overrides in combined score weights * Parameterize spans key in scoring Parameterize the `SpanCategorizer` `spans_key` for scoring purposes so that it's possible to evaluate multiple `spancat` components in the same pipeline. * Use the (intentionally very short) default spans key `sc` in the `SpanCategorizer` * Adjust the default score weights to include the default key * Adjust the scorer to use `spans_{spans_key}` as the prefix for the returned score * Revert addition of `attr_name` argument to `score_spans` and adjust the key in the `getter` instead. Note that for `spancat` components with a custom `span_key`, the score weights currently need to be modified manually in `[training.score_weights]` for them to be available during training. To suppress the default score weights `spans_sc_p/r/f` during training, set them to `null` in `[training.score_weights]`. * Update website/docs/api/scorer.md * Fix scorer for spans key containing underscore * Increment version * Add Spans to Evaluate CLI (#8439) * Add Spans to Evaluate CLI * Change to spans_key * Add spans per_type output Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com> * Fix spancat GPU issues (#8455) * Fix GPU issues * Require thinc >=8.0.6 * Switch to glorot_uniform_init * Fix and test ngram suggester * Include final ngram in doc for all sizes * Fix ngrams for docs of the same length as ngram size * Handle batches of docs that result in no ngrams * Add tests Co-authored-by: Ines Montani <ines@ines.io> Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com> Co-authored-by: svlandeg <sofie.vanlandeghem@gmail.com> Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com> Co-authored-by: Nirant <NirantK@users.noreply.github.com>	2021-06-24 12:35:27 +02:00
Ines Montani	68721af628	Formatting and preliminary intro [ci skip]	2021-06-24 20:32:23 +10:00
Adriane Boyd	92dc6b409e	Notes on source with vectors	2021-06-24 10:34:07 +02:00
Adriane Boyd	35425d7e26	Add details for Catalan and Danish	2021-06-24 10:10:33 +02:00
Ines Montani	5daf450f51	Update upgrading notes [ci skip]	2021-06-24 18:06:28 +10:00
Ines Montani	528746129d	Merge branch 'master' into docs/new-in-v3-1	2021-06-24 13:11:37 +10:00
Ines Montani	a8e8d02ba7	Merge pull request #8465 from explosion/feature/spacy-package-readme	2021-06-24 13:11:08 +10:00
Ines Montani	3e058dee62	Update features [ci skip]	2021-06-24 12:36:04 +10:00
Ines Montani	40f13c3f0c	Add docs [ci skip]	2021-06-24 11:57:15 +10:00
Ines Montani	a1e4aca267	Fix sentence [ci skip]	2021-06-24 11:40:36 +10:00
Ines Montani	ca0d904faa	Update details [ci skip]	2021-06-23 13:05:56 +10:00
themrmax	d96c422cfc	Fix broken link change /api/registry to /api/top-level#registry	2021-06-22 15:34:06 -07:00
Ines Montani	e9b68d4f4c	Update details and add example [ci skip]	2021-06-22 17:51:03 +10:00
Nick Sorros	31504f5982	Switch model and data path in prodigy project.yml recipe (#8467 )	2021-06-22 09:41:45 +02:00
Ines Montani	bc93c34f54	Add "New in v3.1" guide	2021-06-22 15:23:18 +10:00
Adriane Boyd	e39d1bd4ab	Various docs updates for v3.1 (#8406 ) * Update for Catalan/Italian lemmatizer changes * Add warning about relevance of section	2021-06-21 09:33:50 +02:00
Ines Montani	02d2fdb123	Add link anchor [ci skip]	2021-06-20 11:29:19 +10:00
Matthew Honnibal	6f5e308d17	Support negative examples in partial NER annotations (#8106 ) * Support a cfg field in transition system * Make NER 'has gold' check use right alignment for span * Pass 'negative_samples_key' property into NER transition system * Add field for negative samples to NER transition system * Check neg_key in NER has_gold * Support negative examples in NER oracle * Test for negative examples in NER * Fix name of config variable in NER * Remove vestiges of old-style partial annotation * Remove obsolete tests * Add comment noting lack of support for negative samples in parser * Additions to "neg examples" PR (#8201) * add custom error and test for deprecated format * add test for unlearning an entity * add break also for Begin's cost * add negative_samples_key property on Parser * rename * extend docs & fix some older docs issues * add subclass constructors, clean up tests, fix docs * add flaky test with ValueError if gold parse was not found * remove ValueError if n_gold == 0 * fix docstring * Hack in environment variables to try out training * Remove hack * Remove NER hack, and support 'negative O' samples * Fix O oracle * Fix transition parser * Remove 'not O' from oracle * Fix NER oracle * check for spans in both gold.ents and gold.spans and raise if so, to prevent memory access violation * use set instead of list in consistency check Co-authored-by: svlandeg <sofie.vanlandeghem@gmail.com> Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>	2021-06-17 17:33:00 +10:00
svlandeg	bb9d2f1546	extend example to ensure the text is preserved	2021-06-16 23:56:35 +02:00
Sofie Van Landeghem	e796aab4b3	Resizable textcat (#7862 ) * implement textcat resizing for TextCatCNN * resizing textcat in-place * simplify code * ensure predictions for old textcat labels remain the same after resizing (WIP) * fix for softmax * store softmax as attr * fix ensemble weight copy and cleanup * restructure slightly * adjust documentation, update tests and quickstart templates to use latest versions * extend unit test slightly * revert unnecessary edits * fix typo * ensemble architecture won't be resizable for now * use resizable layer (WIP) * revert using resizable layer * resizable container while avoid shape inference trouble * cleanup * ensure model continues training after resizing * use fill_b parameter * use fill_defaults * resize_layer callback * format * bump thinc to 8.0.4 * bump spacy-legacy to 3.0.6	2021-06-16 11:45:00 +02:00
svlandeg	29d83dec0c	adjust whitespace tokenizer to avoid sep in split()	2021-06-16 10:58:45 +02:00
Adriane Boyd	5646fcbe46	Merge remote-tracking branch 'upstream/develop' into chore/develop-into-master-v3.1	2021-06-15 15:05:17 +02:00
Sofie Van Landeghem	0fd0d949c4	fix 's typo's across code base (#8384 )	2021-06-15 10:57:08 +02:00
Adriane Boyd	507422149f	Various docs updates for v3.0 (#8353 ) * Update cats score names in Scorer API docs * Refer to performance in meta * Update package naming/versions, lemmatizer details * Minor formatting fixes * Provide more explanation for cats_score_desc * Provide language-specific lemmatizer defaults in API docs Co-authored-by: Paul O'Leary McCann <polm@dampfkraft.com>	2021-06-14 12:19:36 +02:00
Adriane Boyd	63d748f80e	Add Catalan and Danish trf to website models (#8378 )	2021-06-14 09:50:13 +02:00
Ines Montani	3259faad42	Update YouTube embed [ci skip]	2021-06-14 10:21:01 +10:00
Ines Montani	7f0f674a1b	Fix universe.json and auto-format [ci skip]	2021-06-14 10:18:06 +10:00
Francisco Aranda	0a1a4c665d	update spacy-wordnet code example (#8327 ) * update spacy-wordnet code example - include spaCy 2.x and 3.x init alternatives - upgrade recognai logo * fix escape chars	2021-06-10 21:53:11 +02:00
Paul O'Leary McCann	5aba213349	Fix skweak Github URL Github entry should not contain url, just user/repo	2021-05-31 18:00:43 +09:00
Kristian Boda	dc8d8d15d2	Add hmrb to spaCy Universe (#8129 ) * docs: add hmrb to spacy universe * docs: add sentence on spacy versions * docs: update description and images * misc: add spaCy Contributor Agreement	2021-05-31 18:40:48 +10:00
Sofie Van Landeghem	3c58c0323f	fix docs (#8200 )	2021-05-27 10:48:59 +02:00
Paul O'Leary McCann	0c553ecd4e	Fix docs (fix #8189 )	2021-05-24 19:47:30 +09:00
Sofie Van Landeghem	202943bc8c	KB & NEL to/from bytes (#8113 ) * unit test for pickling KB * add pickling test for NEL * KB to_bytes and from_bytes * NEL to_bytes and from_bytes * xfail pickle tests for now * fix docs * cleanup	2021-05-20 18:11:30 +10:00
Adriane Boyd	6baab565eb	Minor updates to quickstart settings/instructions (#7965 ) * Minor updates to quickstart settings/instructions * set default value of textcat exclusive to `false` until the default checkbox behavior is updated * add the `morphologizer` to the list of components * add a note that v3.0.6+ is required * Switch to warning above quickstart * Undo changes to textcat default in quickstart Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com> Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>	2021-05-17 16:55:22 +02:00
Julien Salinas	c496f78245	Add NLP Cloud to Universe.	2021-05-14 11:13:44 +02:00
Frederic R. Hopp	c5962b9fba	Update universe.json fixed typo	2021-05-13 07:40:05 -07:00
Frederic R. Hopp	a9ca221e03	Update universe.json Added more detailed description to eMFDscore project	2021-05-12 09:20:17 -07:00
Frederic R. Hopp	7bba9cdc14	Update universe.json	2021-05-11 19:18:19 -07:00
Ines Montani	3883d49446	Fix default transformer in quickstart generator (resolves #8018 ) [ci skip]	2021-05-11 11:27:08 +10:00
Adriane Boyd	71c2a3ab47	Fix new version for match_alignments (#8021 )	2021-05-07 09:55:20 +02:00
Jeno Pizarro	5cf76ab608	Update negspacy example code for spaCy 3.0 (#8022 )	2021-05-07 09:33:21 +02:00
Sofie Van Landeghem	02a6a5fea0	Fix 'debug model' for transformers + generalize (#7973 ) * add overrides to docs * fix debug model with transformer * assume training data is set in config	2021-05-06 18:43:32 +10:00
Paul O'Leary McCann	66bfabd839	Fix pretraining objectives fragment (#8005 ) * Fix pretraining objectives fragment The fragment here is reused from a heading higher up, so you couldn't link to this section. * Fix section link to new fragment	2021-05-06 08:27:36 +02:00
meghanabhange	debaab7021	Update details in universe denomme \| Multilingual Name Detection (#7982 ) * Add denomme * spaCy contributor agreement * Update install and thumb Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>	2021-05-05 17:12:13 +02:00
Ines Montani	12d3d0fedd	Fix quickstart default checked of conditional fields [ci skip]	2021-05-03 11:48:12 +10:00
Adriane Boyd	2320791f6d	Fix Transformer.initialize example (#7963 )	2021-04-30 12:21:31 +02:00
Adriane Boyd	95c0833656	Add training option to set annotations on update (#7767 ) * Add training option to set annotations on update Add a `[training]` option called `set_annotations_on_update` to specify a list of components for which the predicted annotations should be set on `example.predicted` immediately after that component has been updated. The predicted annotations can be accessed by later components in the pipeline during the processing of the batch in the same `update` call. * Rename to annotates / annotating_components * Add test for `annotating_components` when training from config * Add documentation	2021-04-26 16:53:53 +02:00
Adriane Boyd	bdb485cc80	Add callback to copy vocab/tokenizer from model (#7750 ) * Add callback to copy vocab/tokenizer from model Add callback `spacy.copy_from_base_model.v1` to copy the tokenizer settings and/or vocab (including vectors) from a base model. * Move spacy.copy_from_base_model.v1 to spacy.training.callbacks * Add documentation * Modify to specify model as tokenizer and vocab params	2021-04-22 12:36:50 +02:00
Adriane Boyd	f68fc29130	Update sent_starts in Example.from_dict (#7847 ) * Update sent_starts in Example.from_dict Update `sent_starts` for `Example.from_dict` so that `Optional[bool]` values have the same meaning as for `Token.is_sent_start`. Use `Optional[bool]` as the type for sent start values in the docs. * Use helper function for conversion to ternary ints	2021-04-22 11:32:45 +02:00
Adriane Boyd	d2bdaa7823	Replace negative rows with 0 in StaticVectors (#7674 ) * Replace negative rows with 0 in StaticVectors Replace negative row indices with 0-vectors in `StaticVectors`. * Increase versions related to StaticVectors * Increase versions of all architctures and layers related to `StaticVectors` * Improve efficiency of 0-vector operations Parallel `spacy-legacy` PR: https://github.com/explosion/spacy-legacy/pull/5 * Update config defaults to new versions * Update docs	2021-04-22 18:04:15 +10:00
Sofie Van Landeghem	6f565cf39d	fix typo in entity_linker docs	2021-04-22 09:59:24 +02:00
Sofie Van Landeghem	2e746dbf32	update EL training data format in docs (#7839 ) * update EL training data format * fix typo * all -1 because reasons	2021-04-22 08:50:09 +02:00
meghanabhange	49ff1126bf	Project Idea : denomme \| Multilingual Name Detection (#7845 ) * Add denomme * spaCy contributor agreement Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>	2021-04-22 08:48:17 +02:00
Sam Edwardes	b8c6c10c6f	Added a logo to spaCyTextBlob (#7818 ) * Added a logo to spaCyTextBlob * Updated to better thumb	2021-04-22 08:41:55 +02:00
Diego Palma	bbade153ed	Add TRUNAJOD to spaCy universe. (#7754 ) * Add TRUNAJOD to spaCy universe. * Add trunajod logo and thumb. Co-authored-by: Diego <dpalma@evernote.com>	2021-04-22 08:40:28 +02:00
Ines Montani	a9e5ae9b5c	Auto-format [ci skip]	2021-04-22 10:58:05 +10:00
Pierre Lison	debfb46088	adding skweak to the SpaCy universe	2021-04-22 00:58:09 +02:00
Shantam Raj	6017fcf693	Default code for Setting Entity annotations on the website errors (#7738 ) * the default example for "Setting entity annotations" errors on Binder * updating contributer info * using a new variable to store original entities	2021-04-21 09:16:32 +02:00
hudsonr	2722424ec5	Added universe entry for Coreferee	2021-04-19 14:28:06 +02:00
langdonholmes	df541c6b5e	Update processing-pipelines.md to mention method for doc metadata (#7480 ) * Update processing-pipelines.md Under "things to try," inform users they can save metadata when using nlp.pipe(foobar, as_tuples=True) Link to a new example on the attributes page detailing the following: > ``` > data = [ > ("Some text to process", {"meta": "foo"}), > ("And more text...", {"meta": "bar"}) > ] > > for doc, context in nlp.pipe(data, as_tuples=True): > # Let's assume you have a "meta" extension registered on the Doc > doc._.meta = context["meta"] > ``` from https://stackoverflow.com/questions/57058798/make-spacy-nlp-pipe-process-tuples-of-text-and-additional-information-to-add-as * Updating the attributes section Update the attributes section with example of how extensions can be used to store metadata. * Update processing-pipelines.md * Update processing-pipelines.md Made as_tuples example executable and relocated to the end of the "Processing Text" section. * Update processing-pipelines.md * Update processing-pipelines.md Removed extra line * Reformat and rephrase Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>	2021-04-19 11:58:12 +02:00
Adriane Boyd	0e7f94b247	Update Tokenizer.explain with special matches (#7749 ) * Update Tokenizer.explain with special matches Update `Tokenizer.explain` and the pseudo-code in the docs to include the processing of special cases that contain affixes or whitespace. * Handle optional settings in explain * Add test for special matches in explain Add test for `Tokenizer.explain` for special cases containing affixes.	2021-04-19 19:08:20 +10:00
Sofie Van Landeghem	c786e98e56	assemble CLI command (#7783 ) * assemble CLI command * ensure assemble runs even without training section * cleanup	2021-04-19 18:39:11 +10:00
Bram Vanroy	ed561cf428	Terminology: deprecated vs obsolete (#7621 ) * Terminology: deprecated vs obsolete Typically, deprecated is used for functionality that is bound to become unavailable but that can still be used. Obsolete is used for features that have been removed. In E941, I think what is meant is "obsolete" since loading a model by a shortcut simply does not work anymore (and throws an error). This is different from downloading a model with a shortcut, which is deprecated but still works. In light of this, perhaps all other error codes should be checked as well. * clarify that the link command is removed and not just deprecated Co-authored-by: svlandeg <sofie.vanlandeghem@gmail.com>	2021-04-12 14:37:00 +02:00
Adriane Boyd	673e2bc4c0	Add usage docs for streamed train corpora (#7693 )	2021-04-09 16:15:38 +02:00
Sofie Van Landeghem	3e5bd5055e	expand quickstart widget with cuda 11.1 and 11.2 (#7615 )	2021-04-08 12:25:42 +02:00
Sofie Van Landeghem	204c2f116b	Extend score_spans for overlapping & non-labeled spans (#7209 ) * extend span scorer with consider_label and allow_overlap * unit test for spans y2x overlap * add score_spans unit test * docs for new fields in scorer.score_spans * rename to include_label * spell out if-else for clarity * rename to 'labeled' Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>	2021-04-08 12:19:17 +02:00
broaddeep	ee159b8543	Support match alignments (#7321 ) * Support match alignments * change naming from match_alignments to with_alignments, add conditional flow if with_alignments is given, validate with_alignments, add related test case * remove added errors, utilize bint type, cleanup whitespace * fix no new line in end of file * Minor formatting * Skip alignments processing if as_spans is set * Add with_alignments to Matcher API docs * Update website/docs/api/matcher.md Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com> Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com> Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>	2021-04-08 18:10:14 +10:00
Ines Montani	de4f4c9b8a	Add more link anchors [ci skip]	2021-04-06 14:15:21 +10:00
Ines Montani	5bbdd7dc4c	Update pipeline design docs [ci skip]	2021-04-06 14:13:22 +10:00
Ines Montani	1d1cfadbca	Fix formatting [ci skip]	2021-04-06 14:13:13 +10:00
Jaidev Deshpande	93ee74a0a6	Add Numerizer to SpaCy universe (#7650 ) Numerizer is a spaCy extension that converts numbers written in natural language into numeric strings.	2021-04-05 19:02:27 +02:00
Sam Edwardes	f6ad4684bd	Updates to universe.json for spaCyTextBlob (#7647 ) * Updates to universe.json for spaCyTextBlob Updated the documentation for spaCy 3.0. * SamEdwardes.md * Update SamEdwardes.md	2021-04-04 20:17:57 +02:00
Ayush Chaurasia	3c2ce41dd8	W&B integration: Optional support for dataset and model checkpoint logging and versioning (#7429 ) * Add optional artifacts logging * Update docs * Update spacy/training/loggers.py Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com> * Update spacy/training/loggers.py Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com> * Update spacy/training/loggers.py Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com> * Bump WandbLogger Version * Add documentation of v1 to legacy docs * bump spacy-legacy to 3.0.2 (to be released) Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com> Co-authored-by: svlandeg <sofie.vanlandeghem@gmail.com>	2021-04-01 19:36:23 +02:00
vincent d warmerdam	8b3eec6e62	Add Tokenwiser to Projects (#7541 ) * Add tokenwiser * Update universe.json	2021-04-01 14:39:36 +02:00
Sofie Van Landeghem	59c2069eb1	Legacy docs (#7601 ) * document legacy Tok2Vec architectures * add TextCatEnsemble.v1 legacy documentation * Separate legacy section in side bar	2021-03-30 12:43:14 +02:00
Santiago Castro	af07fc3bc1	Add support for CUDA 11.2 (#7583 ) * Add support for CUDA 11.2 * Update the docs * Format Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>	2021-03-30 09:47:33 +02:00
Álvaro Abella Bascarán	5b4dde38a3	fix fn name: tokenizer.infixes_finditer -> tokenizer.infix_finditer (#7606 )	2021-03-30 09:45:49 +02:00
Ines Montani	be55f43163	Merge pull request #7473 from adrianeboyd/docs/v3-pipeline-deps-order	2021-03-22 12:43:07 +01:00
Ines Montani	3ee2fcfba0	Merge pull request #7483 from adrianeboyd/docs/various-v3-4 [ci skip]	2021-03-22 12:37:06 +01:00
Ines Montani	88e5a0dc16	Merge pull request #7504 from polm/fix/lexeme-docs [ci skip] Fix mismatched backtick in Lexeme docs	2021-03-22 12:36:44 +01:00
Adriane Boyd	0d2b723e8d	Update entity setting section	2021-03-20 11:38:55 +01:00
Paul O'Leary McCann	e39c0dcf33	Fix mismatched backtick in Lexeme docs	2021-03-20 18:40:00 +09:00
Adriane Boyd	c771ec22f0	Update matcher errors and docs * Mention `tagger+attribute_ruler` in `POS`/`MORPH` error messages for `Matcher` and `PhraseMatcher` * Document `Matcher.__call__(allow_missing=)`	2021-03-19 10:11:18 +01:00
Adriane Boyd	6a9a467766	Update website/docs/usage/processing-pipelines.md Co-authored-by: Ines Montani <ines@ines.io>	2021-03-19 08:12:49 +01:00
Adriane Boyd	6354b642c5	Fix typo	2021-03-18 19:01:10 +01:00
Adriane Boyd	40e5d3a980	Update saving/loading example	2021-03-18 16:56:10 +01:00
Adriane Boyd	0fb1881f36	Reformat processing pipelines	2021-03-18 13:31:42 +01:00
Adriane Boyd	acc58719da	Update custom similarity hooks example	2021-03-18 13:31:42 +01:00
Adriane Boyd	c9e1a9ac17	Add multiprocessing section	2021-03-18 13:31:42 +01:00
Adriane Boyd	9a254d3995	Include all en_core_web_sm components in examples	2021-03-18 13:31:42 +01:00
Adriane Boyd	83c1b919a7	Fix positional/option in CLI types	2021-03-18 13:31:42 +01:00
Adriane Boyd	9fd41d6742	Remove Language.pipe cleanup arg	2021-03-18 13:31:42 +01:00
Adriane Boyd	5da323fd86	Minor edits	2021-03-17 12:59:05 +01:00
Adriane Boyd	a5ffe8dfed	Add details about pretrained pipeline design	2021-03-17 11:31:26 +01:00
Paolo Arduin	00e59be966	Add SpikeX to spaCy universe	2021-03-16 18:22:03 +01:00
bsweileh	61472e7cb3	Update _training.md - Fix broken link on backpropagation (#7431 ) * Update _training.md Fix broken link on backpropagation * Add agreement add spacy contributor agreement	2021-03-15 09:21:35 +01:00
Ines Montani	c67d5a6eb0	Merge pull request #7394 from adrianeboyd/docs/ner-example-data-readme	2021-03-13 04:26:18 +01:00
Ines Montani	068b97a617	Merge pull request #7408 from adrianeboyd/bugfix/load-keyword-only	2021-03-13 04:25:50 +01:00
Adriane Boyd	3168103605	Fix type of spacy train --output in docs	2021-03-12 10:04:57 +01:00
Adriane Boyd	03e9e7b567	Add --code option to init fill-config	2021-03-12 10:03:57 +01:00
Adriane Boyd	124304b146	Add vocab kwarg back to spacy.load * Additional minor formatting and docs cleanup	2021-03-11 10:58:59 +01:00
Adriane Boyd	84470d9b9e	Incorporate BILUO note from #7407	2021-03-11 10:11:21 +01:00
Adriane Boyd	4294bcf4ab	Align keyword-only in docs for init/util	2021-03-11 09:52:40 +01:00
Adriane Boyd	28726c25a1	Update docs for convert CLI and NER examples	2021-03-10 11:42:02 +01:00
Adriane Boyd	d746ea6278	Add warning about GPU selection in Jupyter notebooks (#7075 ) * Initial warning * Update check * Redo edit * Move jupyter warning to helper method * Add link with details to warnings	2021-03-09 15:35:21 +01:00
Sofie Van Landeghem	932887b950	textcat scoring fix and multi_label docs (#6974 ) * add multi-label textcat to menu * add infobox on textcat API * add info to v3 migration guide * small edits * further fixes in doc strings * add infobox to textcat architectures * add textcat_multilabel to overview of built-in components * spelling * fix unrelated warn msg * Add textcat_multilabel to quickstart [ci skip] * remove separate documentation page for multilabel_textcategorizer * small edits * positive label clarification * avoid duplicating information in self.cfg and fix textcat.score * fix multilabel textcat too * revert threshold to storage in cfg * revert threshold stuff for multi-textcat Co-authored-by: Ines Montani <ines@ines.io>	2021-03-09 23:04:22 +11:00
Sofie Van Landeghem	cd70c3cb79	Fixing pretrain (#7342 ) * initialize NLP with train corpus * add more pretraining tests * more tests * function to fetch tok2vec layer for pretraining * clarify parameter name * test different objectives * formatting * fix check for static vectors when using vectors objective * clarify docs * logger statement * fix init_tok2vec and proc.initialize order * test training after pretraining * add init_config tests for pretraining * pop pretraining block to avoid config validation errors * custom errors	2021-03-09 14:01:13 +11:00
Ines Montani	dfb23a419e	Merge branch 'spacy.io' [ci skip]	2021-03-06 17:38:54 +11:00
graue70	7d085d5b1c	Fix typo in docs	2021-03-05 18:30:09 +01:00
vincent d warmerdam	1b0d413e45	Removed Languages that were listed twice on Docs (#7272 ) * removed languages that were listed twice * sorted * d0h * the d0h strikes back when you dont hit save	2021-03-05 14:31:15 +01:00
svlandeg	682a6232e3	fix typo	2021-03-02 17:59:13 +01:00
svlandeg	d900c55061	consistently use registry as callable	2021-03-02 17:56:28 +01:00
graue70	0fddc0447c	Fix copy & paste error in API docs	2021-03-02 14:00:14 +01:00
Ines Montani	8f7c7b2658	Merge pull request #7211 from svlandeg/docs/el_update [ci skip] kb.get_candidates renamed to get_alias_candidates	2021-02-27 11:51:22 +11:00

... 5 6 7 8 9 ...

3166 Commits