spaCy

mirror of https://github.com/explosion/spaCy.git synced 2025-01-06 07:16:29 +03:00

Author	SHA1	Message	Date
Adriane Boyd	794cea6907	Fix comments and examples for levenshtein_compare (#12113 )	2023-01-18 08:02:33 +01:00
Paul O'Leary McCann	a3b15c9f53	Clarify how `--code` arg works (#12102 ) * Clarify how `--code` arg works This adds a few sentences to the docs to clarify how the `--code` argument works, including an explanation of how to load custom components in your own code. * Add link to spacy.load docs	2023-01-17 19:30:02 +09:00
Daniël de Kok	5e297aa20e	Add `TrainablePipe.{distill,get_teacher_student_loss}` (#12016 ) * Add `TrainablePipe.{distill,get_teacher_student_loss}` This change adds two methods: - `TrainablePipe::distill` which performs a training step of a student pipe on a teacher pipe, giving a batch of `Doc`s. - `TrainablePipe::get_teacher_student_loss` computes the loss of a student relative to the teacher. The `distill` or `get_teacher_student_loss` methods are also implemented in the tagger, edit tree lemmatizer, and parser pipes, to enable distillation in those pipes and as an example for other pipes. * Fix stray `Beam` import * Fix incorrect import * Apply suggestions from code review Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com> * Apply suggestions from code review Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com> * TrainablePipe.distill: use `Iterable[Example]` * Add Pipe.is_distillable method * Add `validate_distillation_examples` This first calls `validate_examples` and then checks that the student/teacher tokens are the same. * Update distill documentation * Add distill documentation for all pipes that support distillation * Fix incorrect identifier * Apply suggestions from code review Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com> * Add comment to explain `is_distillable` Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>	2023-01-16 10:25:53 +01:00
Sofie Van Landeghem	c2f3e699ca	fix anchors (#12095 )	2023-01-13 11:14:58 +01:00
Daniël de Kok	319eb508b5	Add a `spacy benchmark speed` subcommand (#11902 ) * Add a `spacy evaluate speed` subcommand This subcommand reports the mean batch performance of a model on a data set with a 95% confidence interval. For reliability, it first performs some warmup rounds. Then it will measure performance on batches with randomly shuffled documents. To avoid having too many spaCy commands, `speed` is a subcommand of `evaluate` and accuracy evaluation is moved to its own `evaluate accuracy` subcommand. * Fix import cycle * Restore `spacy evaluate`, make `spacy benchmark speed` an alias * Add documentation for `spacy benchmark` * CREATES -> PRINTS * WPS -> words/s * Disable formatting of benchmark speed arguments * Fail with an error message when trying to speed bench empty corpus * Make it clearer that `benchmark accuracy` is a replacement for `evaluate` * Fix docstring webpage reference * tests: check `evaluate` output against `benchmark accuracy`	2023-01-12 11:55:21 +01:00
Paul O'Leary McCann	8e558095a1	Clean up displacy port-related error messages, docs (#12089 ) * Clean up displacy port-related error messages, docs There were some issues in the error messages and docs in #11948. 1. the error messages didn't specify the port argument to displacy.serve correctly 2. the docs didn't mark the auto select argument as new This addresses those issues. * Update website/docs/api/top-level.md Co-authored-by: Raphael Mitsch <r.mitsch@outlook.com> * Apply prettier Co-authored-by: Raphael Mitsch <r.mitsch@outlook.com>	2023-01-12 14:54:09 +09:00
svlandeg	b2fd9490e3	Merge branch 'copy_master' into copy_v4	2023-01-11 18:40:55 +01:00
Sofie Van Landeghem	554df9ef20	Website migration from Gatsby to Next (#12058 ) * Rename all MDX file to `.mdx` * Lock current node version (#11885) * Apply Prettier (#11996) * Minor website fixes (#11974) [ci skip] * fix table * Migrate to Next WEB-17 (#12005) * Initial commit * Run `npx create-next-app@13 next-blog` * Install MDX packages Following: `77b5f79a4d/packages/next-mdx/readme.md` * Add MDX to Next * Allow Next to handle `.md` and `.mdx` files. * Add VSCode extension recommendation * Disabled TypeScript strict mode for now * Add prettier * Apply Prettier to all files * Make sure to use correct Node version * Add basic implementation for `MDXRemote` * Add experimental Rust MDX parser * Add `/public` * Add SASS support * Remove default pages and styling * Convert to module This allows to use `import/export` syntax * Add import for custom components * Add ability to load plugins * Extract function This will make the next commit easier to read * Allow to handle directories for page creation * Refactoring * Allow to parse subfolders for pages * Extract logic * Redirect `index.mdx` to parent directory * Disabled ESLint during builds * Disabled typescript during build * Remove Gatsby from `README.md` * Rephrase Docker part of `README.md` * Update project structure in `README.md` * Move and rename plugins * Update plugin for wrapping sections * Add dependencies for plugin * Use plugin * Rename wrapper type * Simplify unnessary adding of id to sections The slugified section ids are useless, because they can not be referenced anywhere anyway. The navigation only works if the section has the same id as the heading. * Add plugin for custom attributes on Markdown elements * Add plugin to readd support for tables * Add plugin to fix problem with wrapped images For more details see this issue: https://github.com/mdx-js/mdx/issues/1798 * Add necessary meta data to pages * Install necessary dependencies * Remove outdated MDX handling * Remove reliance on `InlineList` * Use existing Remark components * Remove unallowed heading Before `h1` components where not overwritten and would never have worked and they aren't used anywhere either. * Add missing components to MDX * Add correct styling * Fix broken list * Fix broken CSS classes * Implement layout * Fix links * Fix broken images * Fix pattern image * Fix heading attributes * Rename heading attribute `new` was causing some weird issue, so renaming it to `version` * Update comment syntax in MDX * Merge imports * Fix markdown rendering inside components * Add model pages * Simplify anchors * Fix default value for theme * Add Universe index page * Add Universe categories * Add Universe projects * Fix Next problem with copy Next complains when the server renders something different then the client, therfor we move the differing logic to `useEffect` * Fix improper component nesting Next doesn't allow block elements inside a `<p>` * Replace landing page MDX with page component * Remove inlined iframe content * Remove ability to inline HTML content in iFrames * Remove MDX imports * Fix problem with image inside link in MDX * Escape character for MDX * Fix unescaped characters in MDX * Fix headings with logo * Allow to export static HTML pages * Add prebuild script This command is automatically run by Next * Replace `svg-loader` with `react-inlinesvg` `svg-loader` is no longer maintained * Fix ESLint `react-hooks/exhaustive-deps` * Fix dropdowns * Change code language from `cli` to `bash` * Remove unnessary language `none` * Fix invalid code language `markdown_` with an underscore was used to basically turn of syntax highlighting, but using unknown languages know throws an error. * Enable code blocks plugin * Readd `InlineCode` component MDX2 removed the `inlineCode` component > The special component name `inlineCode` was removed, we recommend to use `pre` for the block version of code, and code for both the block and inline versions Source: https://mdxjs.com/migrating/v2/#update-mdx-content * Remove unused code * Extract function to own file * Fix code syntax highlighting * Update syntax for code block meta data * Remove unused prop * Fix internal link recognition There is a problem with regex between Node and browser, and since Next runs the component on both, this create an error. `Prop `rel` did not match. Server: "null" Client: "noopener nofollow noreferrer"` This simplifies the implementation and fixes the above error. * Replace `react-helmet` with `next/head` * Fix `className` problem for JSX component * Fix broken bold markdown * Convert file to `.mjs` to be used by Node process * Add plugin to replace strings * Fix custom table row styling * Fix problem with `span` inside inline `code` React doesn't allow a `span` inside an inline `code` element and throws an error in dev mode. * Add `_document` to be able to customize `<html>` and `<body>` * Add `lang="en"` * Store Netlify settings in file This way we don't need to update via Netlify UI, which can be tricky if changing build settings. * Add sitemap * Add Smartypants * Add PWA support * Add `manifest.webmanifest` * Fix bug with anchor links after reloading There was no need for the previous implementation, since the browser handles this nativly. Additional the manual scrolling into view was actually broken, because the heading would disappear behind the menu bar. * Rename custom event I was googeling for ages to find out what kind of event `inview` is, only to figure out it was a custom event with a name that sounds pretty much like a native one. 🫠 * Fix missing comment syntax highlighting * Refactor Quickstart component The previous implementation was hidding the irrelevant lines via data-props and dynamically generated CSS. This created problems with Next and was also hard to follow. CSS was used to do what React is supposed to handle. The new implementation simplfy filters the list of children (React elements) via their props. * Fix syntax highlighting for Training Quickstart * Unify code rendering * Improve error logging in Juniper * Fix Juniper component * Automatically generate "Read Next" link * Add Plausible * Use recent DocSearch component and adjust styling * Fix images * Turn of image optimization > Image Optimization using Next.js' default loader is not compatible with `next export`. We currently deploy to Netlify via `next export` * Dont build pages starting with `_` * Remove unused files * Add Next plugin to Netlify * Fix button layout MDX automatically adds `p` tags around text on a new line and Prettier wants to put the text on a new line. Hacking with JSX string. * Add 404 page * Apply Prettier * Update Prettier for `package.json` Next sometimes wants to patch `package-lock.json`. The old Prettier setting indended with 4 spaces, but Next always indends with 2 spaces. Since `npm install` automatically uses the indendation from `package.json` for `package-lock.json` and to avoid the format switching back and forth, both files are now set to 2 spaces. * Apply Next patch to `package-lock.json` When starting the dev server Next would warn `warn - Found lockfile missing swc dependencies, patching...` and update the `package-lock.json`. These are the patched changes. * fix link Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com> * small backslash fixes * adjust to new style Co-authored-by: Marcus Blättermann <marcus@essenmitsosse.de>	2023-01-11 17:30:07 +01:00
Adriane Boyd	9e0322de1a	Restore v2 token_acc score implementation (#12073 ) In the v3 scorer refactoring, `token_acc` was implemented incorrectly. It should use `precision` instead of `fscore` for the measure of correctly aligned tokens / number of predicted tokens. Fix the docs to reflect that the measure uses the number of predicted tokens rather than the number of gold tokens.	2023-01-11 08:01:47 +01:00
Kevin Humphreys	19650ebb52	Enable fuzzy text matching in Matcher (#11359 ) * enable fuzzy matching * add fuzzy param to EntityMatcher * include rapidfuzz_capi not yet used * fix type * add FUZZY predicate * add fuzzy attribute list * fix type properly * tidying * remove unnecessary dependency * handle fuzzy sets * simplify fuzzy sets * case fix * switch to FUZZYn predicates use Levenshtein distance. remove fuzzy param. remove rapidfuzz_capi. * revert changes added for fuzzy param * switch to polyleven (Python package) * enable fuzzy matching * add fuzzy param to EntityMatcher * include rapidfuzz_capi not yet used * fix type * add FUZZY predicate * add fuzzy attribute list * fix type properly * tidying * remove unnecessary dependency * handle fuzzy sets * simplify fuzzy sets * case fix * switch to FUZZYn predicates use Levenshtein distance. remove fuzzy param. remove rapidfuzz_capi. * revert changes added for fuzzy param * switch to polyleven (Python package) * fuzzy match only on oov tokens * remove polyleven * exclude whitespace tokens * don't allow more edits than characters * fix min distance * reinstate FUZZY operator with length-based distance function * handle sets inside regex operator * remove is_oov check * attempt build fix no mypy failure locally * re-attempt build fix * don't overwrite fuzzy param value * move fuzzy_match to its own Python module to allow patching * move fuzzy_match back inside Matcher simplify logic and add tests * Format tests * Parametrize fuzzyn tests * Parametrize and merge fuzzy+set tests * Format * Move fuzzy_match to a standalone method * Change regex kwarg type to bool * Add types for fuzzy_match - Refactor variable names - Add test for symmetrical behavior * Parametrize fuzzyn+set tests * Minor refactoring for fuzz/fuzzy * Make fuzzy_match a Matcher kwarg * Update type for _default_fuzzy_match * don't overwrite function param * Rename to fuzzy_compare * Update fuzzy_compare default argument declarations * allow fuzzy_compare override from EntityRuler * define new Matcher keyword arg * fix type definition * Implement fuzzy_compare config option for EntityRuler and SpanRuler * Rename _default_fuzzy_compare to fuzzy_compare, remove from reexported objects * Use simpler fuzzy_compare algorithm * Update types * Increase minimum to 2 in fuzzy_compare to allow one transposition * Fix predicate keys and matching for SetPredicate with FUZZY and REGEX * Add FUZZY6..9 * Add initial docs * Increase default fuzzy to rounded 30% of pattern length * Update docs for fuzzy_compare in components * Update EntityRuler and SpanRuler API docs * Rename EntityRuler and SpanRuler setting to matcher_fuzzy_compare To having naming similar to `phrase_matcher_attr`, rename `fuzzy_compare` setting for `EntityRuler` and `SpanRuler` to `matcher_fuzzy_compare. Organize next to `phrase_matcher_attr` in docs. * Fix schema aliases Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com> * Fix typo Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com> * Add FUZZY6-9 operators and update tests * Parameterize test over greedy Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com> * Fix type for fuzzy_compare to remove Optional * Rename to spacy.levenshtein_compare.v1, move to spacy.matcher.levenshtein * Update docs following levenshtein_compare renaming Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com> Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>	2023-01-10 10:36:17 +01:00
Zhangrp	eb8bb35c13	improve ux for displacy when the serve port is in use (#11948 ) * check port in use and add itself * check port in use and add itself * Auto switch to nearest available port. * Use bind to check port instead of connect_ex. * Reformat. * Add auto_select_port argument. * update docs for displacy.serve * Update spacy/errors.py Co-authored-by: Paul O'Leary McCann <polm@dampfkraft.com> * Update website/docs/api/top-level.md Co-authored-by: Paul O'Leary McCann <polm@dampfkraft.com> * Update spacy/errors.py Co-authored-by: Paul O'Leary McCann <polm@dampfkraft.com> * Add test using multiprocessing * fix argument name * Increase sleep times Want to rule this out as a cause of test failure * Don't terminate a process that isn't alive * Refactor port finding logic This moves all the port logic into its own util function, which can be tested without having to background a server directly. * Use with for the server This ensures the server is closed correctly. * Pass in the host when checking port availability * Shorten argument name * Update error codes following merge * Add types for arguments, specify docstrings. * Add typing for arguments with default value. * Update docstring to match spaCy format. * Update docstring to match spaCy format. * Fix docs Arg name changed from `auto_select_port` to just `auto_select`. * Revert "Fix docs" This reverts commit `356966fe84`. Co-authored-by: zhiiw <1302593554@qq.com> Co-authored-by: Paul O'Leary McCann <polm@dampfkraft.com> Co-authored-by: Raphael Mitsch <r.mitsch@outlook.com>	2023-01-10 15:52:57 +09:00
Madeesh Kannan	f1dcdefc8a	Add version tag to `before_update` config key (#12059 )	2023-01-05 11:46:04 +01:00
Paul O'Leary McCann	dbd829f0ed	Fix inconsistency in displaCy docs about page option (#12047 ) * Fix inconsistency in displaCy docs about page option The `page` option, which wraps the output SVG in HTML, is true by default for `serve` but not for `render`. The `render` docs were wrong though, so this updates them. * Update the same statement in more docs A few renderers used the same language	2023-01-04 12:51:40 +09:00
svlandeg	6852adc8b7	Merge branch 'copy_master' into copy_v4	2023-01-03 13:34:05 +01:00
Madeesh Kannan	aa2b471a6e	New console logger with expanded progress tracking (#11972 ) * Add `ConsoleLogger.v3` This addition expands the progress bar feature to count up the training/distillation steps to either the next evaluation pass or the maximum number of steps. * Rename progress bar types * Add defaults to docs Minor fixes * Move comment * Minor punctuation fixes * Explicitly check for `None` when validating progress bar type Co-authored-by: Paul O'Leary McCann <polm@dampfkraft.com>	2022-12-23 15:21:44 +01:00
Daniël de Kok	207565a788	Merge remote-tracking branch 'upstream/master' into chore/v4-merge-master-20221222	2022-12-22 10:08:54 +01:00
Raphael Mitsch	eef3d950b4	Fix `SpanGroup` and `Span` typing (#12009 ) * Correct Span.label, Span.kb_id types. Fix SpanGroup.__iter__(). * Extend test. * Rename test. Fix typo. * Add comment. * Fix types for Span.label, Span.kb_id, Span.char_span(). * Update spacy/tests/doc/test_span_group.py Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com> * Update docs. * Fix typo. * Update spacy/tokens/span_group.pyx Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com> Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>	2022-12-21 18:54:27 +01:00
kadarakos	c223cd7a86	Add apply CLI (#11376 ) * annotate cli first try * add batch-size and n_process * rename to apply * typing fix * handle file suffixes * walk directories * support jsonl * typing fix * remove debug * make suffix optional for walk * revert unrelated * don't warn but raise * better error message * minor touch up * Update spacy/tests/test_cli.py Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com> * Update spacy/cli/apply.py Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com> * Update spacy/cli/apply.py Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com> * update tests and bugfix * add force_overwrite * typo * fix adding .spacy suffix * Update spacy/cli/apply.py Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com> * Update spacy/cli/apply.py Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com> * Update spacy/cli/apply.py Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com> * store user data and rename cmd arg * include test for user attr * rename cmd arg * better help message * documentation * prettier * black * link fix * Update spacy/cli/apply.py Co-authored-by: Paul O'Leary McCann <polm@dampfkraft.com> * Update website/docs/api/cli.md Co-authored-by: Paul O'Leary McCann <polm@dampfkraft.com> * Update website/docs/api/cli.md Co-authored-by: Paul O'Leary McCann <polm@dampfkraft.com> * Update website/docs/api/cli.md Co-authored-by: Paul O'Leary McCann <polm@dampfkraft.com> * addressing reviews * dont quit but warn * prettier Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com> Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com> Co-authored-by: Paul O'Leary McCann <polm@dampfkraft.com>	2022-12-20 17:11:33 +01:00
cfuerbachersparks	3a2b655a29	Update lexeme.md (#11994 ) Change suffix_ string to end	2022-12-19 10:33:38 +01:00
Paul O'Leary McCann	d60997febb	Remove old model shortcuts (#11916 ) * Remove old model shortcuts * Remove error, docs warnings about shortcuts * Fix import in util Accidentally deleted the whole import and not just the old part... * Change universe example to v3 style * Switch ubuntu-latest to ubuntu-20.04 in main tests (#11928) * Switch ubuntu-latest to ubuntu-20.04 in main tests * Only use 20.04 for 3.6 * Update some model loading in Universe * Add v2 tag to neuralcoref * Use the spacy-version feature instead of a v2 tag Co-authored-by: svlandeg <svlandeg@github.com>	2022-12-08 11:45:52 +01:00
Paul O'Leary McCann	6b9af38eeb	Remove all references to "begin_training" (#11943 ) When v3 was released, `begin_training` was renamed to `initialize`. There were warnings in the code and docs about that. This PR removes them.	2022-12-08 11:43:52 +01:00
svlandeg	799d226676	prettier formatting	2022-12-05 08:57:24 +01:00
svlandeg	04fea09ffd	Merge branch 'copy_master' into copy_v4	2022-12-05 08:56:15 +01:00
Sofie Van Landeghem	4b2097a271	fix links (#11927 )	2022-12-05 16:29:13 +09:00
Zhangrp	9cf3fa9711	Add docs for biluo_to_iob and iob_to_biluo. (#11901 ) * Add docs for biluo_to_iob and iob_to_biluo. * Fix typos. * Remove redundant links.	2022-12-01 13:30:27 +01:00
Damian Romero	afd7a2476d	Fix typo in vocab.md table (#11908 ) * Fix typo in vocab.md table Fixes explosion/spaCy/#11907 * Reformat vocab.md with Prettier	2022-12-01 13:06:28 +01:00
Adriane Boyd	1ebe7db07c	Support local filesystem remotes for projects (#11762 ) * Support local filesystem remotes for projects * Fix support for local filesystem remotes for projects * Use `FluidPath` instead of `Pathy` to support both filesystem and remote paths * Create missing parent directories if required for local filesystem * Add a more general `_file_exists` method to support both `Pathy`, `Path`, and `smart_open`-compatible URLs * Add explicit `smart_open` dependency starting with support for `compression` flag * Update `pathy` dependency to exclude older versions that aren't compatible with required `smart_open` version * Update docs to refer to `Pathy` instead of `smart_open` for project remotes (technically you can still push to any `smart_open`-compatible path but you can't pull from them) * Add tests for local filesystem remotes * Update pathy for general BlobStat sorting * Add import * Remove _file_exists since only Pathy remotes are supported * Format CLI docs * Clean up merge	2022-11-29 11:40:58 +01:00
Sofie Van Landeghem	96c9cf3448	Merge pull request #11855 from essenmitsosse/move-styleguide-out-of-readme Move Styleguide out of Readme	2022-11-28 21:22:56 +01:00
Marcus Blättermann	5c9faf6eea	Update menu for styleguide This reflects the removed parts from `ecbf052abd`	2022-11-27 03:48:05 +01:00
Marcus Blättermann	90141202c0	Merge branch 'move-styleguide-out-of-readme' into migrate-to-next-web-17	2022-11-27 03:48:03 +01:00
Raphael Mitsch	c0fd8a2e71	find-threshold: CLI command for multi-label classifier threshold tuning (#11280 ) * Add foundation for find-threshold CLI functionality. * Finish first draft for find-threshold. * Add tests. * Revert adjusted import statements. * Fix mypy errors. * Fix imports. * Harmonize arguments with spacy evaluate command. * Generalize component and threshold handling. Harmonize arguments with 'spacy evaluate' CLI. * Fix Spancat test. * Add beta parameter to Scorer and PRFScore. * Make beta a component scorer setting. * Remove beta. * Update nlp.config (workaround). * Reload pipeline on threshold change. Adjust tests. Remove confection reference. * Remove assumption of component being a Pipe object or having a .cfg attribute. * Adjust test output and reference values. * Remove beta references. Delete universe.json. * Reverting unnecessary changes. Removing unused default values. Renaming variables in find-cli tests. * Update spacy/cli/find_threshold.py Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com> * Remove adding labels in tests. * Remove unused error * Undo changes to PRFScorer * Change default value for n_trials. Log table iteratively. * Add warnings for pointless applications of find_threshold(). * Fix imports. * Adjust type check of TextCategorizer to exclude subclasses. * Change check of if there's only one unique value in scores. * Update spacy/cli/find_threshold.py Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com> * Incorporate feedback. * Fix test issue. Update docstring. * Update docs & docstring. * Update spacy/tests/test_cli.py Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com> * Add examples to docs. Rename _nlp to nlp in tests. * Update spacy/cli/find_threshold.py Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com> * Update spacy/cli/find_threshold.py Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com> Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com> Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>	2022-11-25 11:44:55 +01:00
kadarakos	dece775279	correct ndim in docs (#11869 )	2022-11-25 11:31:28 +01:00
Madeesh Kannan	5ea14af32b	Add `training.before_update` callback (#11739 ) * Add `training.before_update` callback This callback can be used to implement training paradigms like gradual (un)freezing of components (e.g: the Transformer) after a certain number of training steps to mitigate catastrophic forgetting during fine-tuning. * Fix type annotation, default config value * Generalize arguments passed to the callback * Update schema * Pass `epoch` to callback, rename `current_step` to `step` * Add test * Simplify test * Replace config string with `spacy.blank` * Apply suggestions from code review Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com> * Cleanup imports Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>	2022-11-23 17:54:58 +01:00
Edward	e79910d57e	Remove sentiment extension (#11722 ) * remove sentiment attribute * remove sentiment from docs * add test for backwards compatibility * replace from_disk with from_bytes * Fix docs and format file * Fix formatting	2022-11-23 13:09:32 +01:00
Marcus Blättermann	ecbf052abd	Remove `README.md` content from styleguide	2022-11-23 02:04:54 +01:00
Marcus Blättermann	8c0ceca637	Move `README.md` content to styleguide	2022-11-23 02:04:54 +01:00
Marcus Blättermann	96218a1e8f	Delete `styleguide.md` This is in intermediate commit, so the content of `/README.md`can be moved to the styleguid, but the history is kept	2022-11-23 02:04:54 +01:00
Peter Baumgartner	9baa686f82	remove migration support form (#11802 )	2022-11-14 16:53:14 +01:00
Paul O'Leary McCann	bb523d4d91	Remove spacy-ray from docs (#11781 ) * Remove spacy ray from cli docs * Remove more ray docs * Remove ray from universe	2022-11-14 19:58:38 +09:00
Edward	3478ff1eb0	remove new v2 tags (#11780 )	2022-11-14 17:41:01 +09:00
Raphael Mitsch	20bbbe3e44	Revert disable/disabled merging behavior (#11745 ) * Merge disable with disabled. Adjust warnings, errors and tests. * Replace any() with set operation. * Update spacy/tests/pipeline/test_pipe_methods.py Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com> * Update docs. * Remve reference to config entry nlp.enabled from docs. Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>	2022-11-08 14:58:10 +01:00
Adriane Boyd	68b8fa2df2	Merge remote-tracking branch 'upstream/master' into chore/update-v4-from-master-4	2022-11-03 09:42:36 +01:00
Adriane Boyd	420b1d854b	Update textcat scorer threshold behavior (#11696 ) * Update textcat scorer threshold behavior For `textcat` (with exclusive classes) the scorer should always use a threshold of 0.0 because there should be one predicted label per doc and the numeric score for that particular label should not matter. * Rename to test_textcat_multilabel_threshold * Remove all uses of threshold for multi_label=False * Update Scorer.score_cats API docs * Add tests for score_cats with thresholds * Update textcat API docs * Fix types * Convert threshold back to float * Fix threshold type in docstring * Improve formatting in Scorer API docs	2022-11-02 15:35:04 +01:00
Aaron Zipp	d25f09468c	Spelling mistake in rule-based-matching.md (#11717 ) Changed retokenize to retokenizer	2022-10-31 13:27:12 +09:00
Adriane Boyd	cae4589f5a	Replace EntityRuler with SpanRuler implementation (#11320 ) * Replace EntityRuler with SpanRuler implementation Remove `EntityRuler` and rename the `SpanRuler`-based `future_entity_ruler` to `entity_ruler`. Main changes: * It is no longer possible to load patterns on init as with `EntityRuler(patterns=)`. * The older serialization formats (`patterns.jsonl`) are no longer supported and the related tests are removed. * The config settings are only stored in the config, not in the serialized component (in particular the `phrase_matcher_attr` and overwrite settings). * Add migration guide to EntityRuler API docs * docs update * Minor edit Co-authored-by: svlandeg <svlandeg@github.com>	2022-10-24 09:11:35 +02:00
Adriane Boyd	103b24fb25	Merge remote-tracking branch 'upstream/master' into chore/update-v4-from-master	2022-10-21 09:13:32 +02:00
Adriane Boyd	6c380d4fc6	Merge remote-tracking branch 'upstream/master' into chore/update-develop-from-master-v3.5	2022-10-20 13:45:17 +02:00
Adriane Boyd	7e56701057	Merge remote-tracking branch 'upstream/master' into chore/update-develop-from-master-v3.5	2022-10-20 13:38:49 +02:00
Paul O'Leary McCann	bf83f6872a	Add detailed example of env dict usage (#11677 ) * Add detailed example of env dict usage * Mark code blocks as yaml	2022-10-20 20:35:03 +09:00
Paul O'Leary McCann	858565a567	Fix issues with DVC commands (#11592 ) * Fix flag handling in dvc Prior to this commit, if a flag (--verbose or --quiet) was passed to DVC, it would be added to the end of the generated dvc command line. This would result in the command being interpreted as part of the actual command to run, rather than an argument to dvc. This would result in command lines like: spacy project run preprocess --verbose That would fail with an error that there's no such directory as `--verbose`. This change puts the flags at the front of the dvc command so that they are interpreted correctly. It removes the `run_dvc_commands` function, which had been reduced to just a for loop and wasn't used elsewhere. A separate problem is that there's no way to specify the quiet behaviour to dvc from the command line, though it's unclear if that's a bug. * Add dvc quiet flag to docs * Handle case in DVC where no commands are appropriate If only have commands with no deps or outputs (admittedly unlikely), you get a weird error about the dvc file not existing. This gives explicit output instead. * Add support for quiet flag * Fix command execution Commands are strings now because they're joined further up.	2022-10-18 15:11:39 +09:00
Madeesh Kannan	446a3ecf34	`StringStore` refactoring (#11344 ) * `strings`: Remove unused `hash32_utf8` function * `strings`: Make `hash_utf8` and `decode_Utf8Str` private * `strings`: Reorganize private functions * 'strings': Raise error when non-string/-int types are passed to functions that don't accept them * `strings`: Add `items()` method, add type hints, remove unused methods, restrict inputs to specific types, reorganize methods * `Morphology`: Use `StringStore.items()` to enumerate features when pickling * `test_stringstore`: Update pre-Python 3 tests * Update `StringStore` docs * Fix `get_string_id` imports * Replace redundant test with tests for type checking * Rename `_retrieve_interned_str`, remove `.get` default arg * Add `get_string_id` to `strings.pyi` Remove `mypy` ignore directives from imports of the above * `strings.pyi`: Replace functions that consume `Union`-typed params with overloads * `strings.pyi`: Revert some function signatures * Update `SYMBOLS_BY_INT` lookups and error codes post-merge * Revert clobbered change introduced in a previous merge * Remove unnecessary type hint * Invert tuple order in `StringStore.items()` * Add test for `StringStore.items()` * Revert "`Morphology`: Use `StringStore.items()` to enumerate features when pickling" This reverts commit `1af9510ceb`. * Rename `keys` and `key_map` * Add `keys()` and `values()` * Add comment about the inverted key-value semantics in the API * Fix type hints * Implement `keys()`, `values()`, `items()` without generators * Fix type hints, remove unnecessary boxing * Update docs * Simplify `keys/values/items()` impl * `mypy` fix * Fix error message, doc fixes	2022-10-06 10:51:06 +02:00
Sofie Van Landeghem	b187076a2d	fix docs (#11573 )	2022-10-03 17:01:04 +02:00
svlandeg	e3027c65b8	Merge branch 'copy_develop' into copy_v4	2022-10-03 14:12:16 +02:00
svlandeg	9c8cdb403e	Merge branch 'master_copy' into develop_copy	2022-09-30 15:40:26 +02:00
Paul O'Leary McCann	ba63f57f81	Update docs to reflect Doc input to Language (#11555 )	2022-09-29 18:50:29 +09:00
Paul O'Leary McCann	a44b7d4622	Add experimental coref docs (#11291 ) * Add experimental coref docs * Docs cleanup * Apply suggestions from code review Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com> * Apply changes from code review * Fix prettier formatting It seems a period after a number made this think it was a list? * Update docs on examples for initialize * Add docs for coref scorers * Remove 3.4 notes from coref There won't be a "new" tag until it's in core. * Add docs for span cleaner * Fix docs * Fix docs to match spacy-experimental These weren't properly updated when the code was moved out of spacy core. * More doc fixes * Formatting * Update architectures * Fix links * Fix another link Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com> Co-authored-by: svlandeg <svlandeg@github.com>	2022-09-27 18:11:23 +09:00
Paul O'Leary McCann	936a5f0506	Fix English pipeline names in 3.4 release notes (#11542 )	2022-09-27 08:25:24 +02:00
Richard Hudson	6f692a06d5	Remove side effects from Doc.__init__() (#11506 ) * Remove side effects from Doc.__init__() * Changes based on review comment * Readd test * Change interface of Doc.__init__() * Simplify test Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com> * Update doc.md Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>	2022-09-26 15:58:21 +02:00
Raphael Mitsch	af9b01ef97	Add dependency check to project step runs (#11226 ) * Add dependency check to project step running. * Fix dependency mismatch warning. * Remove newline. * Add types-setuptools to setup.cfg. * Move types-setuptools to test requirements. Move warnings into _validate_requirements(). Handle file reading in project_run(). * Remove newline formatting for output of package conflicts. * Show full version conflict message instead of just package name. * Update spacy/cli/project/run.py Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com> * Fix typo. * Re-add rephrasing of message for conflicting packages. Remove requirements path redundancy. * Update spacy/cli/project/run.py Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com> * Update spacy/cli/project/run.py Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com> * Print unified message for requirement conflicts and missing requirements. * Update spacy/cli/project/run.py Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com> * Fix warning message. * Print conflict/missing messages individually. * Print conflict/missing messages individually. * Add check_requirements setting in project.yml to disable requirements check. * Update website/docs/usage/projects.md Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com> * Update website/docs/usage/projects.md Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com> * Update description of project.yml structure in projects.md. * Update website/docs/usage/projects.md Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com> * Prettify projects docs. Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com> Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>	2022-09-16 16:54:31 +02:00
Sofie Van Landeghem	df0b815c23	more explicit Example constructor example (#11489 ) * make constructor example for Example more explicit * shorten example and add spaces	2022-09-16 09:26:33 +02:00
Richard Hudson	3f0c3ad7d3	Correct alignment example and documentation (#11491 ) * Correct example and documentation * Added altered example.md * Changes based on review + apply prettier * Remote unnecessary 'the' Co-authored-by: Madeesh Kannan <shadeMe@users.noreply.github.com> Co-authored-by: Madeesh Kannan <shadeMe@users.noreply.github.com>	2022-09-14 09:36:55 +02:00
Daniël de Kok	efdbb722c5	Store activations in `Doc`s when `save_activations` is enabled (#11002 ) * Store activations in Doc when `store_activations` is enabled This change adds the new `activations` attribute to `Doc`. This attribute can be used by trainable pipes to store their activations, probabilities, and guesses for downstream users. As an example, this change modifies the `tagger` and `senter` pipes to add an `store_activations` option. When this option is enabled, the probabilities and guesses are stored in `set_annotations`. * Change type of `store_activations` to `Union[bool, List[str]]` When the value is: - A bool: all activations are stored when set to `True`. - A List[str]: the activations named in the list are stored * Formatting fixes in Tagger * Support store_activations in spancat and morphologizer * Make Doc.activations type visible to MyPy * textcat/textcat_multilabel: add store_activations option * trainable_lemmatizer/entity_linker: add store_activations option * parser/ner: do not currently support returning activations * Extend tagger and senter tests So that they, like the other tests, also check that we get no activations if no activations were requested. * Document `Doc.activations` and `store_activations` in the relevant pipes * Start errors/warnings at higher numbers to avoid merge conflicts Between the master and v4 branches. * Add `store_activations` to docstrings. * Replace store_activations setter by set_store_activations method Setters that take a different type than what the getter returns are still problematic for MyPy. Replace the setter by a method, so that type inference works everywhere. * Use dict comprehension suggested by @svlandeg * Revert "Use dict comprehension suggested by @svlandeg" This reverts commit `6e7b958f70`. * EntityLinker: add type annotations to _add_activations * _store_activations: make kwarg-only, remove doc_scores_lens arg * set_annotations: add type annotations * Apply suggestions from code review Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com> * TextCat.predict: return dict * Make the `TrainablePipe.store_activations` property a bool This means that we can also bring back `store_activations` setter. * Remove `TrainablePipe.activations` We do not need to enumerate the activations anymore since `store_activations` is `bool`. * Add type annotations for activations in predict/set_annotations * Rename `TrainablePipe.store_activations` to `save_activations` * Error E1400 is not used anymore This error was used when activations were still `Union[bool, List[str]]`. * Change wording in API docs after store -> save change * docs: tag (save_)activations as new in spaCy 4.0 * Fix copied line in morphologizer activations test * Don't train in any test_save_activations test * Rename activations - "probs" -> "probabilities" - "guesses" -> "label_ids", except in the edit tree lemmatizer, where "guesses" -> "tree_ids". * Remove unused W400 warning. This warning was used when we still allowed the user to specify which activations to save. * Formatting fixes Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com> * Replace "kb_ids" by a constant * spancat: replace a cast by an assertion * Fix EOF spacing * Fix comments in test_save_activations tests * Do not set RNG seed in activation saving tests * Revert "spancat: replace a cast by an assertion" This reverts commit `0bd5730d16`. Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>	2022-09-13 09:51:12 +02:00
Sofie Van Landeghem	cc10a27c59	Prevent tok2vec to broadcast to listeners when predicting (#11385 ) * replicate bug with tok2vec in annotating components * add overfitting test with a frozen tok2vec * remove broadcast from predict and check doc.tensor instead * remove broadcast * proper error * slight rephrase of documentation	2022-09-12 15:36:48 +02:00
Madeesh Kannan	aac9a58c29	Add docs for the `spacy.models_and_pipes_with_nvtx_range.v1` callback (#11463 ) * Add docs for the `spacy.models_and_pipes_with_nvtx_range.v1` callback * Add `new` tag	2022-09-09 10:46:01 +02:00
Paul O'Leary McCann	2602a30d32	Fix DVC command example (#11457 ) This command doesn't have the project dir, but it's required.	2022-09-08 13:42:47 +02:00
Raphael Mitsch	1f23c615d7	Refactor KB for easier customization (#11268 ) * Add implementation of batching + backwards compatibility fixes. Tests indicate issue with batch disambiguation for custom singular entity lookups. * Fix tests. Add distinction w.r.t. batch size. * Remove redundant and add new comments. * Adjust comments. Fix variable naming in EL prediction. * Fix mypy errors. * Remove KB entity type config option. Change return types of candidate retrieval functions to Iterable from Iterator. Fix various other issues. * Update spacy/pipeline/entity_linker.py Co-authored-by: Paul O'Leary McCann <polm@dampfkraft.com> * Update spacy/pipeline/entity_linker.py Co-authored-by: Paul O'Leary McCann <polm@dampfkraft.com> * Update spacy/kb_base.pyx Co-authored-by: Paul O'Leary McCann <polm@dampfkraft.com> * Update spacy/kb_base.pyx Co-authored-by: Paul O'Leary McCann <polm@dampfkraft.com> * Update spacy/pipeline/entity_linker.py Co-authored-by: Paul O'Leary McCann <polm@dampfkraft.com> * Add error messages to NotImplementedErrors. Remove redundant comment. * Fix imports. * Remove redundant comments. * Rename KnowledgeBase to InMemoryLookupKB and BaseKnowledgeBase to KnowledgeBase. * Fix tests. * Update spacy/errors.py Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com> * Move KB into subdirectory. * Adjust imports after KB move to dedicated subdirectory. * Fix config imports. * Move Candidate + retrieval functions to separate module. Fix other, small issues. * Fix docstrings and error message w.r.t. class names. Fix typing for candidate retrieval functions. * Update spacy/kb/kb_in_memory.pyx Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com> * Update spacy/ml/models/entity_linker.py Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com> * Fix typing. * Change typing of mentions to be Span instead of Union[Span, str]. * Update docs. * Update EntityLinker and _architecture docs. * Update website/docs/api/entitylinker.md Co-authored-by: Paul O'Leary McCann <polm@dampfkraft.com> * Adjust message for E1046. * Re-add section for Candidate in kb.md, add reference to dedicated page. * Update docs and docstrings. * Re-add section + reference for KnowledgeBase.get_alias_candidates() in docs. * Update spacy/kb/candidate.pyx * Update spacy/kb/kb_in_memory.pyx * Update spacy/pipeline/legacy/entity_linker.py * Remove canididate.md. Remove mistakenly added config snippet in entity_linker.py. Co-authored-by: Paul O'Leary McCann <polm@dampfkraft.com> Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>	2022-09-08 10:38:07 +02:00
shademe	977b847cce	Merge branch 'develop' into merge-develop-into-v4	2022-09-07 11:35:47 +02:00
Sofie Van Landeghem	d801cccd38	Merge pull request #11430 from rmitsch/chore/synch-develop Synch develop with master	2022-09-05 15:07:18 +02:00
Paul O'Leary McCann	977dc33312	Add a way to get the URL to download a pipeline to the CLI (#11175 ) * Add a dry run flag to download * Remove --dry-run, add --url option to `spacy info` instead * Make mypy happy * Print only the URL, so it's easier to use in scripts * Don't add the egg hash unless downloading an sdist * Update spacy/cli/info.py Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com> * Add two implementations of requirements * Clean up requirements sample slightly This should make mypy happy * Update URL help string * Remove requirements option * Add url option to docs * Add URL to spacy info model output, when available * Add types-setuptools to testing reqs * Add types-setuptools to requirements * Add "compatible", expand docstring * Update spacy/cli/info.py Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com> * Run prettier on CLI docs * Update docs Add a sidebar about finding download URLs, with some examples of the new command. * Add download URLs to table on model page * Apply suggestions from code review Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com> * Updates from review * download url -> download link * Update docs Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com> Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>	2022-09-02 11:58:21 +02:00
Madeesh Kannan	604a7c3c26	`SpanGroup(s)`-related optimizations (#11380 ) * `SpanGroup`: Add support for binding copies to a new reference document * `SpanGroups`: Replace superfluous serialize-deserialize roundtrip in `copy` Instead, directly copy the in-memory representations of the constituent `SpanGroup`s. * Update `SpanGroup.copy()` signature * Rename `new_doc` param to `doc` * Fix kwdarg * Update `.pyi` file and docstrings * `mypy` fix * Update spacy/tokens/span_group.pyx * Update docs Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>	2022-08-31 09:03:20 +02:00
Sofie Van Landeghem	8fc0efc502	Allow string argument for disable/enable/exclude (#11406 ) * adding unit test for spacy.load with disable/exclude string arg * allow pure strings in from_config * update docs * upstream type adjustements * docs update * make docstring more consistent * Update spacy/language.py Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com> * two more cleanups * fix type in internal method Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>	2022-08-31 09:02:34 +02:00
Paul O'Leary McCann	698b8b495f	Update/remove old Matcher syntax (#11370 ) * Clean up old Matcher call style related stuff In v2 Matcher.add was called with (key, on_match, patterns). In v3 this was changed to (key, patterns, , on_match=None), but there were various points where the old call syntax was documented or handled specially. This removes all those. The Matcher itself didn't need any code changes, as it just gives a generic type error. However the PhraseMatcher required some changes because it would automatically "fix" the old call style. Surprisingly, the tokenizer was still using the old call style in one place. After these changes tests failed in two places: 1. one test for the "new" call style, including the "old" call style. I removed this test. 2. deserializing the PhraseMatcher fails because the input docs are a set. I am not sure why 2 is happening - I guess it's a quirk of the serialization format? - so for now I just convert the set to a list when deserializing. The check that the input Docs are a List in the PhraseMatcher is a new check, but makes it parallel with the other Matchers, which seemed like the right thing to do. * Add notes related to input docs / deserialization type * Remove Typing import * Remove old note about call style change * Apply suggestions from code review Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com> * Use separate method for setting internal doc representations In addition to the title change, this changes the internal dict to be a defaultdict, instead of a dict with frequent use of setdefault. * Add _add_from_arrays for unpickling * Cleanup around adding from arrays This moves adding to internal structures into the private batch method, and removes the single-add method. This has one behavioral change for `add`, in that if something is wrong with the list of input Docs (such as one of the items not being a Doc), valid items before the invalid one will not be added. Also the callback will not be updated if anything is invalid. This change should not be significant. This also adds a test to check failure when given a non-Doc. * Update spacy/matcher/phrasematcher.pyx Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com> Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>	2022-08-30 15:40:31 +02:00
Patrick J. Burns	5ae63b1fbd	Add Latin language support (#11349 ) * Add lang folder for la (Latin) * Add Latin lang classes * Add minimal tokenizer exceptions * Add minimal stopwords * Add minimal lex_attrs * Update stopwords, tokenizer exceptions * Add la tests; register la_tokenizer in conftest.py * Update spacy/lang/la/lex_attrs.py Remove duplicate form in Latin lex_attrs Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com> * Update natto-py version spec (#11222) * Update natto-py version spec * Update setup.cfg Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com> Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com> * Add scorer to textcat API docs config settings (#11263) * Update docs for pipeline initialize() methods (#11221) * Update documentation for dependency parser * Update documentation for trainable_lemmatizer * Update documentation for entity_linker * Update documentation for ner * Update documentation for morphologizer * Update documentation for senter * Update documentation for spancat * Update documentation for tagger * Update documentation for textcat * Update documentation for tok2vec * Run prettier on edited files * Apply similar changes in transformer docs * Remove need to say annotated example explicitly I removed the need to say "Must contain at least one annotated Example" because it's often a given that Examples will contain some gold-standard annotation. * Run prettier on transformer docs * chore: add 'concepCy' to spacy universe (#11255) * chore: add 'concepCy' to spacy universe * docs: add 'slogan' to concepCy * Support full prerelease versions in the compat table (#11228) * Support full prerelease versions in the compat table * Fix types * adding spans to doc_annotation in Example.to_dict (#11261) * adding spans to doc_annotation in Example.to_dict * to_dict compatible with from_dict: tuples instead of spans * use strings for label and kb_id * Simplify test * Update data formats docs Co-authored-by: Stefanie Wolf <stefanie.wolf@vitecsoftware.com> Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com> * Fix regex invalid escape sequences (#11276) * Add W605 to the errors raised by flake8 in the CI (#11283) * Clean up automated label-based issue handling (#11284) * Clean up automated label-based issue handline 1. upgrade tiangolo/issue-manager to latest 2. move needs-more-info to tiangolo 3. change needs-more-info close time to 7 days 4. delete old needs-more-info config * Use old, longer message * Fix label name * Fix Dutch noun chunks to skip overlapping spans (#11275) * Add test for overlapping noun chunks * Skip overlapping noun chunks * Update spacy/tests/lang/nl/test_noun_chunks.py Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com> Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com> * Docs: displaCy documentation - data types, `parse_{deps,ents,spans}`, spans example (#10950) * add in spans example and parse references * rm autoformatter * rm extra ents copy * TypedDict draft * type fixes * restore non-documentation files * docs update * fix spans example * fix hyperlinks * add parse example * example fix + argument fix * fix api arg in docs * fix bad variable replacement * fix spacing in style Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com> * fix spacing on table * fix spacing on table * rm temp files Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com> * include span_ruler for default warning filter (#11333) * Add uk pipelines to website (#11332) * Check for . in factory names (#11336) * Make fixes for PR #11349 * Fix roman numeral coverage in #11349 Co-authored-by: Patrick J. Burns <patricks@diyclassics.org> Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com> Co-authored-by: Paul O'Leary McCann <polm@dampfkraft.com> Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com> Co-authored-by: Lj Miranda <12949683+ljvmiranda921@users.noreply.github.com> Co-authored-by: Jules Belveze <32683010+JulesBelveze@users.noreply.github.com> Co-authored-by: stefawolf <wlf.ste@gmail.com> Co-authored-by: Stefanie Wolf <stefanie.wolf@vitecsoftware.com> Co-authored-by: Peter Baumgartner <5107405+pmbaumgartner@users.noreply.github.com>	2022-08-30 14:04:54 +02:00
Edward	6723d76f24	Add ConsoleLogger.v2 (#11214 ) * Init * Change logger to ConsoleLogger.v2 * adjust naming * More naming adjustments * Fix output_file reference error * ignore type * Add basic test for logger * Hopefully fix mypy issue * mypy ignore line * Update mypy line Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com> * Update test method name Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com> * Change file saving logic * Fix finalize method * increase spacy-legacy version in requirements * Update docs * small adjustments Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>	2022-08-29 10:23:05 +02:00
Adriane Boyd	2a558a7cdc	Switch to mecab-ko as default Korean tokenizer (#11294 ) * Switch to mecab-ko as default Korean tokenizer Switch to the (confusingly-named) mecab-ko python module for default Korean tokenization. Maintain the previous `natto-py` tokenizer as `spacy.KoreanNattoTokenizer.v1`. * Temporarily run tests with mecab-ko tokenizer * Fix types * Fix duplicate test names * Update requirements test * Revert "Temporarily run tests with mecab-ko tokenizer" This reverts commit `d2083e7044`. * Add mecab_args setting, fix pickle for KoreanNattoTokenizer * Fix length check * Update docs * Formatting * Update natto-py error message Co-authored-by: Paul O'Leary McCann <polm@dampfkraft.com> Co-authored-by: Paul O'Leary McCann <polm@dampfkraft.com>	2022-08-26 10:11:18 +02:00
Adriane Boyd	740c33fe58	Merge remote-tracking branch 'upstream/develop' into chore/update-v4-from-develop	2022-08-24 20:43:07 +02:00
Adriane Boyd	81874265e9	Merge remote-tracking branch 'upstream/master' into chore/update-develop-from-master-v3.5-1	2022-08-24 12:47:42 +02:00
Adriane Boyd	c44d243f25	Merge remote-tracking branch 'upstream/master' into chore/update-v4-from-master	2022-08-24 07:15:41 +02:00
Tal Zussman	7e75327893	Fix menu order in linguistic-features.md (#11364 ) Swap 'Vectors & Similarity' and 'Mappings & Exceptions' in menu to match order in body	2022-08-23 14:40:38 +09:00
Adriane Boyd	bb0e178878	Make Span/Doc.ents more consistent for ent_kb_id and ent_id (#11328 ) * Map `Span.id` to `Token.ent_id` in all cases when setting `Doc.ents` * Reset `Token.ent_id` and `Token.ent_kb_id` when setting `Doc.ents` * Make `Span.ent_id` an alias of `Span.id` rather than a read-only view of the root token's `ent_id` annotation	2022-08-22 20:28:57 +02:00
Adriane Boyd	5fa8f4faca	Switch ru and uk lemmatizers to pymorphy3 (#11345 ) * Switch ru and uk lemmatizers to pymorphy3 * Switch to pymorphy3 in tests	2022-08-22 11:27:14 +02:00
Peter Baumgartner	db7b9938a4	Docs: displaCy documentation - data types, `parse_{deps,ents,spans}`, spans example (#10950 ) * add in spans example and parse references * rm autoformatter * rm extra ents copy * TypedDict draft * type fixes * restore non-documentation files * docs update * fix spans example * fix hyperlinks * add parse example * example fix + argument fix * fix api arg in docs * fix bad variable replacement * fix spacing in style Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com> * fix spacing on table * fix spacing on table * rm temp files Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>	2022-08-16 11:23:34 -04:00
Sofie Van Landeghem	5d54c0e32a	Rename modules for consistency (#11286 ) * rename Python module to entity_ruler * rename Python module to attribute_ruler	2022-08-10 11:44:05 +02:00
stefawolf	23749cfc91	adding spans to doc_annotation in Example.to_dict (#11261 ) * adding spans to doc_annotation in Example.to_dict * to_dict compatible with from_dict: tuples instead of spans * use strings for label and kb_id * Simplify test * Update data formats docs Co-authored-by: Stefanie Wolf <stefanie.wolf@vitecsoftware.com> Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>	2022-08-05 12:26:38 +02:00
Lj Miranda	d993df41e5	Update docs for pipeline initialize() methods (#11221 ) * Update documentation for dependency parser * Update documentation for trainable_lemmatizer * Update documentation for entity_linker * Update documentation for ner * Update documentation for morphologizer * Update documentation for senter * Update documentation for spancat * Update documentation for tagger * Update documentation for textcat * Update documentation for tok2vec * Run prettier on edited files * Apply similar changes in transformer docs * Remove need to say annotated example explicitly I removed the need to say "Must contain at least one annotated Example" because it's often a given that Examples will contain some gold-standard annotation. * Run prettier on transformer docs	2022-08-03 16:53:02 +02:00
Adriane Boyd	d0578c2ede	Add scorer to textcat API docs config settings (#11263 )	2022-08-03 16:41:20 +02:00
Daniël de Kok	1ff683a50b	Merge remote-tracking branch 'upstream/master' into merge-master-v4-20220728	2022-07-28 13:53:59 +02:00
ninjalu	95a1b8aca6	add additional REL_OP (#10371 ) * add additional REL_OP * change to condition and new rel_op symbols * add operators to docs * add the anchor while we're in here * add tests Co-authored-by: Peter Baumgartner <5107405+pmbaumgartner@users.noreply.github.com>	2022-07-27 13:16:44 +02:00
Dan Radenkovic	a5aa3a818f	fix docs (#11123 )	2022-07-24 17:16:36 +09:00
Madeesh Kannan	ba18d2913d	`Morphology`/`Morphologizer` optimizations and refactoring (#11024 ) * `Morphology`: Refactor to use C types, reduce allocations, remove unused code * `Morphologzier`: Avoid unnecessary sorting of morpho features * `Morphologizer`: Remove execessive reallocations of labels, improve hash lookups of labels, coerce `numpy` numeric types to native ints Update docs * Remove unused method * Replace `unique_ptr` usage with `shared_ptr` * Add type annotations to internal Python methods, rename `hash` variable, fix typos * Add comment to clarify implementation detail * Fix return type * `Morphology`: Stop early when splitting fields and values	2022-07-15 11:14:08 +02:00
Adriane Boyd	11f859c132	Docs for v3.4 (#11057 ) * Add draft of v3.4 usage * Add Croatian models * Add Matcher min/max * Update release notes * Minor edits * Add updates, tables * Update pydantic/mypy versions * Update version in README * Fix sidebar	2022-07-11 15:36:31 +02:00
Adriane Boyd	3701039c1f	Tweak build jobs setting, update install docs (#11077 ) * Restrict SPACY_NUM_BUILD_JOBS to only override if set * Update install docs	2022-07-08 19:21:17 +02:00
Adriane Boyd	be9e17c0e4	Add docs for compiling with build constraints (#11081 )	2022-07-08 11:45:56 +02:00
Raphael Mitsch	e9eb59699f	NEL confidence threshold (#11016 ) * Add base for NEL abstention threshold mechanism. * Add abstention threshold to entity linker. Add test. * Fix entity linking tests. * Changed abstention default threshold from 0 to None. * Fix default values for abstention thresholds. * Fix mypy errors. * Replace assertion with raise of proper error code. * Simplify threshold check. Remove thresholding from EntityLinker_v1. * Rename test. * Update spacy/pipeline/entity_linker.py Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com> * Update spacy/pipeline/entity_linker.py Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com> * Make E1043 configurable. * Update docs. * Rephrase description in docs. Adjusting error code message. Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>	2022-07-04 17:05:21 +02:00
Paul O'Leary McCann	e8fdbfc65e	Minor fix in Lemmatizer docs	2022-07-01 14:28:03 +09:00
Shen Qin	be00db6645	Addition of min_max quantifier in matcher {n,m} (#10981 ) * Min_max_operators 1. Modified API and Usage for spaCy website to include min_max operator 2. Modified matcher.pyx to include min_max function {n,m} and its variants 3. Modified schemas.py to include min_max validation error 4. Added test cases to test_matcher_api.py, test_matcher_logic.py and test_pattern_validation.py * attempt to fix mypy/pydantic compat issue * formatting * Update spacy/tests/matcher/test_pattern_validation.py Co-authored-by: Source-Shen <82353723+Source-Shen@users.noreply.github.com> Co-authored-by: svlandeg <svlandeg@github.com> Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>	2022-06-30 11:01:58 +02:00
Eric Holscher	308a612ec9	Remove `simply` (#11017 ) I was reading this page, and as a relative beginner, nothing about it was simple :)	2022-06-27 09:45:22 +02:00
Adriane Boyd	f1197d9175	Add API docs for token attribute symbols (#10836 ) * Add API docs for token attribute symbols * Remove NBSP's * Fix typo * Rephrase Co-authored-by: svlandeg <svlandeg@github.com>	2022-06-23 08:16:38 +02:00
jademlc	bed23ff291	Update serialization methods code block (#11004 ) * Update serialization methods code block * Update website/docs/usage/saving-loading.md Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>	2022-06-22 20:45:26 +02:00
Sofie Van Landeghem	0fa004c4cd	the 'new' indicator wants a 'number' (#10997 )	2022-06-21 22:01:16 +02:00
Victoria	a08ca064e5	Update linguistic-features.md (#10993 ) Change link for downloading fasttext word vectors	2022-06-21 15:03:41 +09:00
Raphael Mitsch	4c058eb40a	`enable` argument for spacy.load() (#10784 ) * Enable flag on spacy.load: foundation for include, enable arguments. * Enable flag on spacy.load: fixed tests. * Enable flag on spacy.load: switched from pretrained model to empty model with added pipes for tests. * Enable flag on spacy.load: switched to more consistent error on misspecification of component activity. Test refactoring. Added to default config. * Enable flag on spacy.load: added support for fields not in pipeline. * Enable flag on spacy.load: removed serialization fields from supported fields. * Enable flag on spacy.load: removed 'enable' from config again. * Enable flag on spacy.load: relaxed checks in _resolve_component_activation_status() to allow non-standard pipes. * Enable flag on spacy.load: fixed relaxed checks for _resolve_component_activation_status() to allow non-standard pipes. Extended tests. * Enable flag on spacy.load: comments w.r.t. resolution workarounds. * Enable flag on spacy.load: remove include fields. Update website docs. * Enable flag on spacy.load: updates w.r.t. changes in master. * Implement Doc.from_json(): update docstrings. Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com> * Implement Doc.from_json(): remove newline. Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com> * Implement Doc.from_json(): change error message for E1038. Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com> * Enable flag on spacy.load: wrapped docstring for _resolve_component_status() at 80 chars. * Enable flag on spacy.load: changed exmples for enable flag. * Remove newline. Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com> * Fix docstring for Language._resolve_component_status(). * Rename E1038 to E1042. Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com> Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>	2022-06-17 20:24:13 +01:00
Paul O'Leary McCann	d176afd32f	Add note about multiple patterns (#10826 ) * Add note about multiple patterns * Move note to the top of method docs * Remove EntityRuler note	2022-06-08 16:24:14 +02:00
Sofie Van Landeghem	763dcbf885	Fix version in SpanRuler docs (#10925 ) * SpanRuler is new since 3.3.1 * update SpanRuler version since 3.3.1	2022-06-08 14:45:04 +02:00
Ilya Nikitin	c323789721	`token.md`: Fix documentation of `Token.ancestors` (#10917 )	2022-06-06 14:32:36 +09:00
Raphael Mitsch	8387ce4c01	Add Doc.from_json() (#10688 ) * Implement Doc.from_json: rough draft. * Implement Doc.from_json: first draft with tests. * Implement Doc.from_json: added documentation on website for Doc.to_json(), Doc.from_json(). * Implement Doc.from_json: formatting changes. * Implement Doc.to_json(): reverting unrelated formatting changes. * Implement Doc.to_json(): fixing entity and span conversion. Moving fixture and doc <-> json conversion tests into single file. * Implement Doc.from_json(): replaced entity/span converters with doc.char_span() calls. * Implement Doc.from_json(): handling sentence boundaries in spans. * Implementing Doc.from_json(): added parser-free sentence boundaries transfer. * Implementing Doc.from_json(): added parser-free sentence boundaries transfer. * Implementing Doc.from_json(): incorporated various PR feedback. * Renaming fixture for document without dependencies. Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com> * Implementing Doc.from_json(): using two sent_starts instead of one. Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com> * Implementing Doc.from_json(): doc_without_dependency_parser() -> doc_without_deps. Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com> * Implementing Doc.from_json(): incorporating various PR feedback. Rebased on latest master. * Implementing Doc.from_json(): refactored Doc.from_json() to work with annotation IDs instead of their string representations. * Implement Doc.from_json(): reverting unwanted formatting/rebasing changes. * Implement Doc.from_json(): added check for char_span() calculation for entities. * Update spacy/tokens/doc.pyx Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com> * Implement Doc.from_json(): minor refactoring, additional check for token attribute consistency with corresponding test. * Implement Doc.from_json(): removed redundancy in annotation type key naming. Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com> * Implement Doc.from_json(): Simplifying setting annotation values. Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com> * Implement doc.from_json(): renaming annot_types to token_attrs. Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com> * Implement Doc.from_json(): adjustments for renaming of annot_types to token_attrs. * Implement Doc.from_json(): removing default categories. Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com> * Implement Doc.from_json(): simplifying lexeme initialization. Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com> * Implement Doc.from_json(): simplifying lexeme initialization. Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com> * Implement Doc.from_json(): refactoring to only have keys for present annotations. * Implement Doc.from_json(): fix check for tokens' HEAD attributes. Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com> * Implement Doc.from_json(): refactoring Doc.from_json(). * Implement Doc.from_json(): fixing span_group retrieval. Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com> * Implement Doc.from_json(): fixing span retrieval. * Implement Doc.from_json(): added schema for Doc JSON format. Minor refactoring in Doc.from_json(). * Implement Doc.from_json(): added comment regarding Token and Span extension support. * Implement Doc.from_json(): renaming inconsistent_props to partial_attrs.. Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com> * Implement Doc.from_json(): adjusting error message. Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com> * Implement Doc.from_json(): extending E1038 message. Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com> * Implement Doc.from_json(): added params to E1038 raises. * Implement Doc.from_json(): combined attribute collection with partial attributes check. * Implement Doc.from_json(): added optional schema validation. * Implement Doc.from_json(): fixed optional fields in schema, tests. * Implement Doc.from_json(): removed redundant None check for DEP. * Implement Doc.from_json(): added passing of schema validatoin message to E1037.. * Implement Doc.from_json(): removing redundant error E1040. Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com> * Implement Doc.from_json(): changing message for E1037. Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com> * Implement Doc.from_json(): adjusted website docs and docstring of Doc.from_json(). * Update spacy/tests/doc/test_json_doc_conversion.py * Implement Doc.from_json(): docstring update. Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com> * Implement Doc.from_json(): docstring update. Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com> * Implement Doc.from_json(): website docs update. Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com> * Implement Doc.from_json(): docstring formatting. Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com> * Implement Doc.from_json(): docstring formatting. Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com> * Implement Doc.from_json(): fixing Doc reference in website docs. Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com> * Implement Doc.from_json(): reformatted website/docs/api/doc.md. * Implement Doc.from_json(): bumped IDs of new errors to avoid merge conflicts. * Implement Doc.from_json(): fixing bug in tests. Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com> * Implement Doc.from_json(): fix setting of sentence starts for docs without DEP. * Implement Doc.from_json(): add check for valid char spans when manually setting sentence boundaries. Refactor sentence boundary setting slightly. Move error message for lack of support for partial token annotations to errors.py. * Implement Doc.from_json(): simplify token sentence start manipulation. Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com> * Combine related error messages * Update spacy/tests/doc/test_json_doc_conversion.py Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>	2022-06-02 14:03:47 +02:00
Adriane Boyd	a322d6d5f2	Add SpanRuler component (#9880 ) * Add SpanRuler component Add a `SpanRuler` component similar to `EntityRuler` that saves a list of matched spans to `Doc.spans[spans_key]`. The matches from the token and phrase matchers are deduplicated and sorted before assignment but are not otherwise filtered. * Update spacy/pipeline/span_ruler.py Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com> * Fix cast * Add self.key property * Use number of patterns as length * Remove patterns kwarg from init * Update spacy/tests/pipeline/test_span_ruler.py Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com> * Add options for spans filter and setting to ents * Add `spans_filter` option as a registered function' * Make `spans_key` optional and if `None`, set to `doc.ents` instead of `doc.spans[spans_key]`. * Update and generalize tests * Add test for setting doc.ents, fix key property type * Fix typing * Allow independent doc.spans and doc.ents * If `spans_key` is set, set `doc.spans` with `spans_filter`. * If `annotate_ents` is set, set `doc.ents` with `ents_fitler`. * Use `util.filter_spans` by default as `ents_filter`. * Use a custom warning if the filter does not work for `doc.ents`. * Enable use of SpanC.id in Span * Support id in SpanRuler as Span.id * Update types * `id` can only be provided as string (already by `PatternType` definition) * Update all uses of Span.id/ent_id in Doc * Rename Span id kwarg to span_id * Update types and docs * Add ents filter to mimic EntityRuler overwrite_ents * Refactor `ents_filter` to take `entities, spans` args for more filtering options * Give registered filters more descriptive names * Allow registered `filter_spans` filter (`spacy.first_longest_spans_filter.v1`) to take any number of `Iterable[Span]` objects as args so it can be used for spans filter or ents filter * Implement future entity ruler as span ruler Implement a compatible `entity_ruler` as `future_entity_ruler` using `SpanRuler` as the underlying component: * Add `sort_key` and `sort_reverse` to allow the sorting behavior to be customized. (Necessary for the same sorting/filtering as in `EntityRuler`.) * Implement `overwrite_overlapping_ents_filter` and `preserve_existing_ents_filter` to support `EntityRuler.overwrite_ents` settings. * Add `remove_by_id` to support `EntityRuler.remove` functionality. * Refactor `entity_ruler` tests to parametrize all tests to test both `entity_ruler` and `future_entity_ruler` * Implement `SpanRuler.token_patterns` and `SpanRuler.phrase_patterns` properties. Additional changes: * Move all config settings to top-level attributes to avoid duplicating settings in the config vs. `span_ruler/cfg`. (Also avoids a lot of casting.) * Format * Fix filter make method name * Refactor to use same error for removing by label or ID * Also provide existing spans to spans filter * Support ids property * Remove token_patterns and phrase_patterns * Update docstrings * Add span ruler docs * Fix types * Apply suggestions from code review Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com> * Move sorting into filters * Check for all tokens in seen tokens in entity ruler filters * Remove registered sort key * Set Token.ent_id in a backwards-compatible way in Doc.set_ents * Remove sort options from API docs * Update docstrings * Rename entity ruler filters * Fix and parameterize scoring * Add id to Span API docs * Fix typo in API docs * Include explicit labeled=True for scorer Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>	2022-06-02 13:12:53 +02:00
Max Tarlov	709d6d9114	Update documentation for displacy style kwargs (#10841 ) * Update docs for displacy style kwargs Added "span" to the accepted values for the style kwarg in the displacy.serve and displacy.render top-level functions. These styles are new as of SpaCy 3.3, so I added the "new" tag for that option only * restored alpha ordering	2022-05-30 09:11:55 +02:00
Peter Baumgartner	bf95f0a1dd	add doc cleaner to menu (#10862 )	2022-05-30 08:51:19 +02:00
Freddy Heppell	322c5a3ac4	Fix misspelt keyword in StringStore example	2022-05-29 10:49:19 +01:00
Sofie Van Landeghem	83ed1f391b	Remove NBSP's across tables in the docs (#10842 )	2022-05-25 09:48:39 +02:00
Lj Miranda	1d34aa2b3d	Add spacy-span-analyzer to debug data (#10668 ) * Rename to spans_key for consistency * Implement spans length in debug data * Implement how span bounds and spans are obtained In this commit, I implemented how span boundaries (the tokens) around a given span and spans are obtained. I've put them in the compile_gold() function so that it's accessible later on. I will do the actual computation of the span and boundary distinctiveness in the main function above. * Compute for p_spans and p_bounds * Add computation for SD and BD * Fix mypy issues * Add weighted average computation * Fix compile_gold conditional logic * Add test for frequency distribution computation * Add tests for kl-divergence computation * Fix weighted average computation * Make tables more compact by rounding them * Add more descriptive checks for spans * Modularize span computation methods In this commit, I added the _get_span_characteristics and _print_span_characteristics functions so that they can be reusable anywhere. * Remove unnecessary arguments and make fxs more compact * Update a few parameter arguments * Add tests for print_span and get_span methods * Update API to talk about span characteristics in brief * Add better reporting of spans_length * Add test for span length reporting * Update formatting of span length report Removed '' to indicate that it's not a string, then sort the n-grams by their length, not by their frequency. * Apply suggestions from code review Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com> * Show all frequency distribution when -V In this commit, I displayed the full frequency distribution of the span lengths when --verbose is passed. To make things simpler, I rewrote some of the formatter functions so that I can call them whenever. Another notable change is that instead of showing percentages as Integers, I showed them as floats (max 2-decimal places). I did this because it looks weird when it displays (0%). * Update logic on how total is computed The way the 90% thresholding is computed now is that we keep adding the percentages until we reach >= 90%. I also updated the wording and used the term "At least" to denote that >= 90% of your spans have these distributions. * Fix display when showing the threshold percentage * Apply suggestions from code review Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com> * Add better phrasing for span information * Update spacy/cli/debug_data.py Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com> * Add minor edits for whitespaces etc. Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com> Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>	2022-05-23 19:06:38 +02:00
Peter Baumgartner	7ce3460b23	add floret to static vectors docs (#10833 )	2022-05-23 09:16:31 +02:00
kadarakos	a3814ee739	oov confusion fix (#10828 )	2022-05-23 09:15:51 +02:00
Adriane Boyd	a82ec56aae	Remove cuda extras for non-linux arm in install widget (#10796 ) * Remove cuda extras for non-linux arm platforms in install widget * Extend cuda versions install widget * Update GPU install docs to clarify cuda	2022-05-20 09:57:41 +02:00
Adriane Boyd	b65d652881	Override SpanGroups.setdefault to provide default SpanGroup (#10772 ) * Fix mistake in SpanGroup API docs * Restrict SpanGroups.setdefault to SpanGroup only * Refactor to support default span iterable	2022-05-12 10:06:25 +02:00
Richard Hudson	d524f6415f	Add documentation tip about overriding variables (#10780 )	2022-05-11 10:15:32 +02:00
Raphael Mitsch	2904359685	Allow assets to be optional in spacy project (#10714 ) * Allow assets to be optional in spacy project: draft for optional flag/download_all options. * Allow assets to be optional in spacy project: added OPTIONAL_DEFAULT reflecting default asset optionality. * Allow assets to be optional in spacy project: renamed --all to --extra. * Allow assets to be optional in spacy project: included optional flag in project config test. * Allow assets to be optional in spacy project: added documentation. * Allow assets to be optional in spacy project: fixing deprecated --all reference. Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com> * Allow assets to be optional in spacy project: fixed project_assets() docstring. * Allow assets to be optional in spacy project: adjusted wording in justification of optional assets. Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com> * Allow assets to be optional in spacy project: switched to as keyword in project.yml. Updated docs. * Allow assets to be optional in spacy project: updated comment. * Allow assets to be optional in spacy project: replacing 'optional' with 'extra' in output. Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com> * Allow assets to be optional in spacy project: replacing 'optional' with 'extra' in docstring.. Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com> * Allow assets to be optional in spacy project: replacing 'optional' with 'extra' in test.. Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com> * Allow assets to be optional in spacy project: replacing 'optional' with 'extra' in test. Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com> * Allow assets to be optional in spacy project: renamed OPTIONAL_DEFAULT to EXTRA_DEFAULT. Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com> Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>	2022-05-10 10:40:11 +02:00
Sofie Van Landeghem	1543558d08	Add test for old architectures (#10751 ) * add v1 and v2 tests for tok2vec architectures * textcat architectures are not "layers" * test older textcat architectures * test older parser architecture	2022-05-10 08:24:42 +02:00
Madeesh Kannan	733114bdd9	`training.md`: Fix typos (#10775 )	2022-05-09 19:44:14 +02:00
Raphael Mitsch	e626df959f	Document different ways to create a pipeline (#10762 ) * Document different ways to create a pipeline: moved up/slightly modified paragraph on pipeline creation. * Document different ways to create a pipeline: changed Finnish to Ukrainian in example for language without trained pipeline. * Document different ways to create a pipeline: added explanation of blank pipeline. * Document different ways to create a pipeline: exchanged Ukrainian with Yoruba.	2022-05-06 15:40:59 +02:00
Sofie Van Landeghem	e03b9f8095	Small doc typos (#10750 ) * fix typos * formatting	2022-05-03 13:55:27 +02:00
Adriane Boyd	497a708c71	Docs for v3.3 (#10628 ) * Temporarily disable CI tests * Start v3.3 website updates * Add trainable lemmatizer to pipeline design * Fix Vectors.most_similar * Add floret vector info to pipeline design * Add Lower and Upper Sorbian * Add span to sidebar * Work on release notes * Copy from release notes * Update pipeline design graphic * Upgrading note about Doc.from_docs * Add tables and details * Update website/docs/models/index.md Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com> * Fix da lemma acc * Add minimal intro, various updates * Round lemma acc * Add section on floret / word lists * Add new pipelines table, minor edits * Fix displacy spans example title * Clarify adding non-trainable lemmatizer * Update adding-languages URLs * Revert "Temporarily disable CI tests" This reverts commit `1dee505920`. * Spell out words/sec Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>	2022-04-28 14:09:35 +02:00
harmbuisman	c066fb8a4e	#10672 : fixes displacy output for manual unsorted entities (#10673 ) * #10672: fixes displacy output for manual unsorted entities * #10672: removed unused import * fix prettier formatting Co-authored-by: Harm Buisman <h.buisman@iknl.nl> Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>	2022-04-27 09:51:58 +02:00
Adriane Boyd	455f089c9b	Support exclude in Doc.from_docs (#10689 ) * Support exclude in Doc.from_docs * Update API docs * Add new tag to docs	2022-04-25 18:19:03 +02:00
single-fingal	4228f3c757	Fix a few minor bugs in the SpanGroup API web docs (#10650 ) * Fix a few minor bugs in the SpanGroup API web docs * Update SpanGroup docs examples to have Spans reflect intended "errors"	2022-04-14 09:59:48 +02:00
Lj Miranda	02dafa3a84	Add debug diff command in spaCy CLI (#10502 ) * Add initial design for diff command For now, the diffing process looks like this: - The default config is created based from some values in the user config (e.g. which pipeline components were used, the lang, etc.) - The user must supply manually if it was optimized for acc/efficiency and if pretraining was involved. * Make diff command structure similar to siblings * Include gpu as a user option for CLI * Make variables more explicit * Fix type declaration for optimize enum * Improve docstrings for diff CLI * Add debug-diff to website API docs * Switch position of configs so that user config is modded * Add markdown flag for debug diff This commit adds a --markdown (--md) flag that allows easier copy-pasting to Github issues. Please note that this commit is dependent on an unreleased version of wasabi (for the time being). For posterity, the related PR is found here: https://github.com/ines/wasabi/pull/20 * Bump version of wasabi to 0.9.1 So that we can use the add_symbols parameter. * Apply suggestions from code review Co-authored-by: Ines Montani <ines@ines.io> * Update docs based on code review suggestions Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com> * Change command name from diff -> diff-config * Clarify when options are relevant or not * Rerun prettier on cli.md Co-authored-by: Ines Montani <ines@ines.io> Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>	2022-04-07 10:48:45 +02:00
Adriane Boyd	0d0153db63	Update default spans_key to sc in API docs (#10616 )	2022-04-04 18:09:15 +02:00
Adriane Boyd	ca54de27bb	Support more internal methods for SpanGroup (#10476 ) * Added new convenience cython functions to SpanGroup to avoid unnecessary allocation/deallocation of objects * Replaced sorting in has_overlap with C++ for efficiency. Also, added a test for has_overlap * Added a method to efficiently merge SpanGroups * Added __delitem__, __add__ and __iadd__. Also, allowed to pass span lists to merge function. Replaced extend() body with call to merge * Renamed merge to concat and added missing things to documentation * Added operator+ and operator += in the documentation * Added a test for Doc deallocation * Update spacy/tokens/span_group.pyx * Updated SpanGroup tests to use new span list comparison function rather than assert_span_list_equal, eliminating the need to have a separate assert_not_equal fnction * Fixed typos in SpanGroup documentation Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com> * Minor changes requested by Sofie: rearranged import statements. Added new=3.2.1 tag to SpanGroup.__setitem__ documentation * SpanGroup: moved repetitive list index check/adjustment in a separate function * Turn off formatting that hurts readability spacy/tests/doc/test_span_group.py Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com> * Remove formatting that hurts readability spacy/tests/doc/test_span_group.py Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com> * Turn off formatting that hurts readability in spacy/tests/doc/test_span_group.py Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com> * Support more internal methods for SpanGroup Add support for: * `__setitem__` * `__delitem__` * `__iadd__`: for `SpanGroup` or `Iterable[Span]` * `__add__`: for `SpanGroup` only Adapted from #9698 with the scope limited to the magic methods. * Use v3.3 as new version in docs * Add new tag to SpanGroup.copy in API docs * Remove duplicate import * Apply suggestions from code review Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com> * Remaining suggestions and formatting Co-authored-by: nrodnova <nrodnova@hotmail.com> Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com> Co-authored-by: Natalia Rodnova <4512370+nrodnova@users.noreply.github.com>	2022-04-01 09:56:26 +02:00
Adriane Boyd	f98b41c390	Add vector deduplication (#10551 ) * Add vector deduplication * Add `Vocab.deduplicate_vectors()` * Always run deduplication in `spacy init vectors` * Clean up a few vector-related error messages and docs examples * Always unique with numpy * Fix types	2022-03-30 08:54:23 +02:00
Adriane Boyd	85778dfcf4	Add edit tree lemmatizer (#10231 ) * Add edit tree lemmatizer Co-authored-by: Daniël de Kok <me@danieldk.eu> * Hide edit tree lemmatizer labels * Use relative imports * Switch to single quotes in error message * Type annotation fixes Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com> * Reformat edit_tree_lemmatizer with black * EditTreeLemmatizer.predict: take Iterable Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com> * Validate edit trees during deserialization This change also changes the serialized representation. Rather than mirroring the deep C structure, we use a simple flat union of the match and substitution node types. * Move edit_trees to _edit_tree_internals * Fix invalid edit tree format error message * edit_tree_lemmatizer: remove outdated TODO comment * Rename factory name to trainable_lemmatizer * Ignore type instead of casting truths to List[Union[Ints1d, Floats2d, List[int], List[str]]] for thinc v8.0.14 * Switch to Tagger.v2 * Add documentation for EditTreeLemmatizer * docs: Fix 3.2 -> 3.3 somewhere * trainable_lemmatizer documentation fixes * docs: EditTreeLemmatizer is in edit_tree_lemmatizer.py Co-authored-by: Daniël de Kok <me@danieldk.eu> Co-authored-by: Daniël de Kok <me@github.danieldk.eu> Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>	2022-03-28 11:13:50 +02:00
Adriane Boyd	d5666fd12d	Add NORM to Matcher feature in docs (#10560 )	2022-03-28 10:35:47 +02:00
Adriane Boyd	3711af74e5	Add tokenizer option to allow Matcher handling for all rules (#10452 ) * Add tokenizer option to allow Matcher handling for all rules Add tokenizer option `with_faster_rules_heuristics` that determines whether the special cases applied by the internal `Matcher` are filtered by whether they contain affixes or space. If `True` (default), the rules are filtered to prioritize speed over rare edge cases. If `False`, all rules are included in the final `Matcher`-based pass over the doc. * Reset all caches when reloading special cases * Revert "Reset all caches when reloading special cases" This reverts commit `4ef6bd171d`. * Initialize max_length properly * Add new tag to API docs * Rename to faster heuristics	2022-03-24 13:21:32 +01:00
Lj Miranda	a79cd3542b	Add displacy support for overlapping Spans (#10332 ) * Fix docstring for EntityRenderer * Add warning in displacy if doc.spans are empty * Implement parse_spans converter One notable change here is that the default spans_key is sc, and it's set by the user through the options. * Implement SpanRenderer Here, I implemented a SpanRenderer that looks similar to the EntityRenderer except for some templates. The spans_key, by default, is set to sc, but can be configured in the options (see parse_spans). The way I rendered these spans is per-token, i.e., I first check if each token (1) belongs to a given span type and (2) a starting token of a given span type. Once I have this information, I render them into the markup. * Fix mypy issues on typing * Add tests for displacy spans support * Update colors from RGB to hex Co-authored-by: Ines Montani <ines@ines.io> * Remove unnecessary CSS properties * Add documentation for website * Remove unnecesasry scripts * Update wording on the documentation Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com> * Put typing dependency on top of file * Put back z-index so that spans overlap properly * Make warning more explicit for spans_key Co-authored-by: Ines Montani <ines@ines.io> Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>	2022-03-16 18:14:34 +01:00
Daniël de Kok	e5debc68e4	Tagger: use unnormalized probabilities for inference (#10197 ) * Tagger: use unnormalized probabilities for inference Using unnormalized softmax avoids use of the relatively expensive exp function, which can significantly speed up non-transformer models (e.g. I got a speedup of 27% on a German tagging + parsing pipeline). * Add spacy.Tagger.v2 with configurable normalization Normalization of probabilities is disabled by default to improve performance. * Update documentation, models, and tests to spacy.Tagger.v2 * Move Tagger.v1 to spacy-legacy * docs/architectures: run prettier * Unnormalized softmax is now a Softmax_v2 option * Require thinc 8.0.14 and spacy-legacy 3.0.9	2022-03-15 14:15:31 +01:00
Adriane Boyd	e8357923ec	Various install docs updates (#10487 ) * Simplify quickstart source install to use only editable pip install * Update pytorch install instructions to more recent versions	2022-03-15 11:12:50 +01:00
Adriane Boyd	0dc454ba95	Update docs for Vocab.get_vector (#10486 ) * Update docs for Vocab.get_vector * Clarify description of 0-vector dimensions	2022-03-15 09:10:47 +01:00
Edward	2eef47dd26	Save span candidates produced by spancat suggesters (#10413 ) * Add save_candidates attribute * Change spancat api * Add unit test * reimplement method to produce a list of doc * Add method to docs * Add new version tag * Add intended use to docstring * prettier formatting	2022-03-14 16:46:58 +01:00
Adriane Boyd	297dd82c86	Fix initial special cases for Tokenizer.explain (#10460 ) Add the missing initial check for special cases to `Tokenizer.explain` to align with `Tokenizer._tokenize_affixes`.	2022-03-11 10:50:47 +01:00
Peter Baumgartner	01ec6349ea	Add `path.mkdir` to custom component examples of `to_disk` (#10348 ) * add `path.mkdir` to examples * add ensure_path + mkdir * update highlights	2022-03-08 16:04:10 +01:00
Adriane Boyd	60520d8669	Fix types in API docs for moves in parser and ner (#10464 )	2022-03-08 13:51:11 +01:00
Adriane Boyd	b2bbefd0b5	Add Finnish, Korean, and Swedish models and Korean support notes (#10355 ) * Add Finnish, Korean, and Swedish models to website * Add Korean language support notes	2022-03-07 17:03:45 +01:00
Paul O'Leary McCann	91acc3ea75	Fix entity linker batching (#9669 ) * Partial fix of entity linker batching * Add import * Better name * Add `use_gold_ents` option, docs * Change to v2, create stub v1, update docs etc. * Fix error type Honestly no idea what the right type to use here is. ConfigValidationError seems wrong. Maybe a NotImplementedError? * Make mypy happy * Add hacky fix for init issue * Add legacy pipeline entity linker * Fix references to class name * Add __init__.py for legacy * Attempted fix for loss issue * Remove placeholder V1 * formatting * slightly more interesting train data * Handle batches with no usable examples This adds a test for batches that have docs but not entities, and a check in the component that detects such cases and skips the update step as thought the batch were empty. * Remove todo about data verification Check for empty data was moved further up so this should be OK now - the case in question shouldn't be possible. * Fix gradient calculation The model doesn't know which entities are not in the kb, so it generates embeddings for the context of all of them. However, the loss does know which entities aren't in the kb, and it ignores them, as there's no sensible gradient. This has the issue that the gradient will not be calculated for some of the input embeddings, which causes a dimension mismatch in backprop. That should have caused a clear error, but with numpyops it was causing nans to happen, which is another problem that should be addressed separately. This commit changes the loss to give a zero gradient for entities not in the kb. * add failing test for v1 EL legacy architecture * Add nasty but simple working check for legacy arch * Clarify why init hack works the way it does * Clarify use_gold_ents use case * Fix use gold ents related handling * Add tests for no gold ents and fix other tests * Use aligned ents function (not working) This doesn't actually work because the "aligned" ents are gold-only. But if I have a different function that returns the intersection, then this will work as desired. * Use proper matching ent check This changes the process when gold ents are not used so that the intersection of ents in the pred and gold is used. * Move get_matching_ents to Example * Use model attribute to check for legacy arch * Rename flag * bump spacy-legacy to lower 3.0.9 Co-authored-by: svlandeg <svlandeg@github.com>	2022-03-04 09:17:36 +01:00
Adriane Boyd	8e93fa8507	Fix Vectors.n_keys for floret vectors (#10394 ) Fix `Vectors.n_keys` for floret vectors to match docstring description and avoid W007 warnings in similarity methods.	2022-03-01 09:21:25 +01:00
Sofie Van Landeghem	3f68bbcfec	Clean up loggers docs (#10351 ) * update docs to point to spacy-loggers docs * remove unused error code	2022-02-25 16:29:12 +01:00
Sofie Van Landeghem	a16b14e591	Merge branch 'master' into copy/develop	2022-02-16 14:04:59 +01:00
Ines Montani	7b883da9fd	Merge pull request #10239 from explosion/docs/spacy-tailored-pipelines [ci skip]	2022-02-08 18:04:01 +01:00
Ines Montani	f2c2b97e56	Add spaCy Tailored Pipelines	2022-02-08 11:46:42 +01:00
Sofie Van Landeghem	deb143fa70	Token sent attributes more consistent (#10164 ) * remove duplicate line * add sent start/end token attributes to the docs * let has_annotation work with IS_SENT_END * elif instead of if * add has_annotation test for sent attributes * fix typo * remove duplicate is_sent_start entry in docs	2022-02-08 08:35:37 +01:00
Peter Baumgartner	836f689cc7	YAML multiline tip for project.yml files (#10187 ) * MultiHashEmbed vector docs correction * add in multi-line tip * convert to sidebar tip	2022-02-08 08:35:09 +01:00
Lj Miranda	72fece712f	Add shuffle parameter to Corpus API docs (#10220 ) * Add shuffle parameter to Corpus API docs * Update website/docs/api/corpus.md Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com> Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>	2022-02-07 14:55:53 +01:00
Sofie Van Landeghem	14513f82da	Merge pull request #10215 from explosion/master update develop	2022-02-06 13:45:41 +01:00
Lj Miranda	345e7f6bc4	Clarify Span.ents documentation (#10154 ) * Clarify Span.ents documentation Ref: #10135 Retain current behaviour. Span.ents will only include entities within said span. You can't get tokens outside of the original span. * Reword docstrings Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com> * Update API docs in the website Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>	2022-01-31 08:41:42 +01:00
Adriane Boyd	4f441dfa24	Fix infix as prefix in Tokenizer.explain (#10140 ) * Fix infix as prefix in Tokenizer.explain Update `Tokenizer.explain` to align with the `Tokenizer` algorithm: * skip infix matches that are prefixes in the current substring * Update tokenizer pseudocode in docs	2022-01-28 17:00:54 +01:00
Sofie Van Landeghem	4465fe0306	Merge branch 'develop' into feature/master_copy	2022-01-20 13:36:17 +01:00
Duygu Altinok	268ddf8a06	Add ENT_IOB key to Matcher (#9649 ) * added new field * added exception for IOb strings * minor refinement to schema * removed field * fixed typo * imported numeriacla val * changed the code bit * cosmetics * added test for matcher * set ents of moc docs * added invalid pattern * minor update to documentation * blacked matcher * added pattern validation * add IOB vals to schema * changed into test * mypy compat * cleaned left over * added compat import * changed type * added compat import * changed literal a bit * went back to old * made explicit type * Update spacy/schemas.py Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com> * Update spacy/schemas.py Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com> * Update spacy/schemas.py Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com> Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>	2022-01-20 13:18:39 +01:00
Paul O'Leary McCann	2ff53834bb	Add link to pattern file info in EntityRuler.initialize docs (#10091 ) * Add link to pattern file info in EntityRuler.initialize docs * Update website/docs/api/entityruler.md Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>	2022-01-19 10:45:11 +01:00
Daniël de Kok	50d2a2c930	User fewer Vector internals (#9879 ) * Use Vectors.shape rather than Vectors.data.shape * Use Vectors.size rather than Vectors.data.size * Add Vectors.to_ops to move data between different ops * Add documentation for Vector.to_ops	2022-01-18 17:14:35 +01:00
ColleterVi	a784b12eff	fix: new restcountries url (#10043 ) Url extension "eu" and path "rest" are no longer available. Replacing them for a working url.	2022-01-13 20:25:06 +09:00
Sofie Van Landeghem	067a44a417	Merge pull request #9987 from explosion/master Update develop with commits from master	2022-01-05 11:49:50 +01:00
Sofie Van Landeghem	56dcb39fb7	Fix references to config file in the docs & UX (#9961 ) * doc fixes around config file * fix typo * clarify default	2022-01-04 14:31:26 +01:00
Florian Cäsar	86e71e7b19	Fix Scorer.score_cats for missing labels (#9443 ) * Fix Scorer.score_cats for missing labels * Add test case for Scorer.score_cats missing labels * semantic nitpick * black formatting * adjust test to give different results depending on multi_label setting * fix loss function according to whether or not missing values are supported * add note to docs * small fixes * make mypy happy * Update spacy/pipeline/textcat.py Co-authored-by: Florian Cäsar <florian.caesar@pm.me> Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com> Co-authored-by: svlandeg <svlandeg@github.com>	2021-12-29 11:04:39 +01:00
Peter Baumgartner	72abf9e102	MultiHashEmbed vector docs correction (#9918 )	2021-12-27 11:18:08 +01:00
Adriane Boyd	51a3b60027	Document Tagger neg_prefix, fix typo (#9821 )	2021-12-07 09:42:40 +01:00
Duygu Altinok	b56b9e7f31	Entity ruler remove pattern (#9685 ) * added ruler coe * added error for none existing pattern * changed error to warning * changed error to warning * added basic tests * fixed place * added test files * went back to error * went back to pattern error * minor change to docs * changed style * changed doc * changed error slightly * added remove to phrasem api * error key already existed * phrase matcher match code to api * blacked tests * moved comments before expr * corrected error no * Update website/docs/api/entityruler.md Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com> * Update website/docs/api/entityruler.md Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com> Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>	2021-12-06 15:32:49 +01:00
Natalia Rodnova	472740d613	Added sents property to Span for Spans spanning over several sentences (#9699 ) * Added sents property to Span class that returns a generator of sentences the Span belongs to * Added description to Span.sents property * Update test_span to clarify the difference between span.sent and span.sents Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com> * Update spacy/tests/doc/test_span.py Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com> * Fix documentation typos in spacy/tokens/span.pyx Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com> * Update Span.sents doc string in spacy/tokens/span.pyx Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com> * Parametrized test_span_spans * Corrected Span.sents to check for span-level hook first. Also, made Span.sent respect doc-level sents hook if no span-level hook is provided * Corrected Span ocumentation copy/paste issue * Put back accidentally deleted lines * Fixed formatting in span.pyx * Moved check for SENT_START annotation after user hooks in Span.sents * add version where the property was introduced Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>	2021-12-06 09:58:01 +01:00
Narayan Acharya	1be8a4dab3	Displacy serve entity linking support without `manual=True` support. (#9748 ) * Add support for kb_id to be displayed via displacy.serve. The current support is only limited to the manual option in displacy.render * Commit to check pre-commit hooks are run. * Update spacy/displacy/__init__.py Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com> * Changes as per suggestions on the PR. * Update website/docs/api/top-level.md Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com> * Update website/docs/api/top-level.md Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com> * tag option as new from 3.2.1 onwards Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com> Co-authored-by: svlandeg <sofie.vanlandeghem@gmail.com>	2021-11-29 17:13:26 +01:00
Adriane Boyd	6763cbfdc0	Update Catalan acknowledgements for v3.2 (#9763 )	2021-11-29 14:14:21 +01:00
Natalia Rodnova	a4c43e5c57	Allow Matcher to match on ENT_ID and ENT_KB_ID (#9688 ) * Added ENT_ID and ENT_KB_ID into the list of the attributes that Matcher matches on * Added ENT_ID and ENT_KB_ID to TEST_PATTERNS in test_pattern_validation.py. Disabled tests that I added before * Update website/docs/api/matcher.md * Format * Remove skipped tests Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>	2021-11-24 10:37:10 +01:00
Adriane Boyd	9ac6d4991e	Add doc_cleaner component (#9659 ) * Add doc_cleaner component * Fix types * Fix loop * Rephrase method description	2021-11-23 15:33:33 +01:00
Paul O'Leary McCann	52b8c2d2e0	Add note on batch contract for listeners (#9691 ) * Add note on batch contract Using listeners requires batches to be consistent. This is obvious if you understand how the listener works, but it wasn't clearly stated in the Docs, and was subtle enough that the EntityLinker missed it. There is probably a clearer way to explain what the actual requirement is, but I figure this is a good start. * Rewrite to clarify role of caching	2021-11-22 11:06:07 +01:00
Sofie Van Landeghem	13645dcbf5	add note that annotating components is new since 3.1 (#9678 )	2021-11-22 14:43:11 +09:00
Paul O'Leary McCann	f3981bd0c8	Clarify how to fill in init_tok2vec after pretraining (#9639 ) * Clarify how to fill in init_tok2vec after pretraining * Ignore init_tok2vec arg in pretraining * Update docs, config setting * Remove obsolete note about not filling init_tok2vec early This seems to have also caught some lines that needed cleanup.	2021-11-18 15:38:30 +01:00
Adriane Boyd	216ed231a9	What's new in v3.2 (#9633 ) * What's new in v3.2 * Fix formatting * Fix typo * Redo thanks * Formatting * Fix typo * Fix project links * Fix typo * Minimal intro, floret python module * Rephrase * Rephrase, extend * Rephrase * Update links and formatting [ci skip] * Minor correction * Fix typo Co-authored-by: Ines Montani <ines@ines.io>	2021-11-05 16:31:14 +01:00
Adriane Boyd	07dea324f6	Merge remote-tracking branch 'upstream/develop' into chore/switch-to-master-v3.2.0	2021-11-03 15:32:18 +01:00
Paul O'Leary McCann	c1cc94a33a	Fix typo about receptive field size (#9564 )	2021-11-03 15:16:55 +01:00
Paul O'Leary McCann	e43639b27a	Add note about round-trip serializing pipeline to API docs (#9583 )	2021-11-03 09:55:30 +01:00
Vasundhara	5279c7c4ba	Fix broken link to mappings-exceptions (#9573 )	2021-10-31 13:44:29 +09:00
Adriane Boyd	2d430958e1	Merge remote-tracking branch 'upstream/master' into chore/update-develop-from-master-v3.2-3	2021-10-29 12:18:15 +02:00
Paul O'Leary McCann	006df1ae1f	Clarify error when words are of wrong type (#9541 ) * Clarify error when words are of wrong type See #9437 * Update docs * Use try/except * Apply suggestions from code review Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com> Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com> Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com> Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>	2021-10-29 12:08:40 +02:00
Paul O'Leary McCann	2fd8d616e7	Add docs section for spacy.cli.train.train (#9545 ) * Add section for spacy.cli.train.train * Add link from training page to train function * Ensure path in train helper * Update docs Co-authored-by: Ines Montani <ines@ines.io>	2021-10-29 10:36:34 +02:00
Adriane Boyd	5477453ea3	Docs for thinc-apple-ops (#9549 ) * Docs for thinc-apple-ops * Ignore thinc-apple-ops in reqs tests * Fix install quickstart * Add cupy cuda 113, 114 extras * Remove draft section Co-authored-by: Ines Montani <ines@ines.io>	2021-10-29 10:35:31 +02:00
Adriane Boyd	12974bf4d9	Add micro PRF for morph scoring (#9546 ) * Add micro PRF for morph scoring For pipelines where morph features are added by more than one component and a reference training corpus may not contain all features, a micro PRF score is more flexible than a simple accuracy score. An example is the reading and inflection features added by the Japanese tokenizer. * Use `morph_micro_f` as the default morph score for Japanese morphologizers. * Update docstring * Fix typo in docstring * Update Scorer API docs * Fix results type * Organize score list by attribute prefix	2021-10-29 10:29:29 +02:00
Adriane Boyd	c053f158c5	Add support for floret vectors (#8909 ) * Add support for fasttext-bloom hash-only vectors Overview: * Extend `Vectors` to have two modes: `default` and `ngram` * `default` is the default mode and equivalent to the current `Vectors` * `ngram` supports the hash-only ngram tables from `fasttext-bloom` * Extend `spacy.StaticVectors.v2` to handle both modes with no changes for `default` vectors * Extend `spacy init vectors` to support ngram tables The `ngram` mode only supports vector tables produced by this fork of fastText, which adds an option to represent all vectors using only the ngram buckets table and which uses the exact same ngram generation algorithm and hash function (`MurmurHash3_x64_128`). `fasttext-bloom` produces an additional `.hashvec` table, which can be loaded by `spacy init vectors --fasttext-bloom-vectors`. https://github.com/adrianeboyd/fastText/tree/feature/bloom Implementation details: * `Vectors` now includes the `StringStore` as `Vectors.strings` so that the API can stay consistent for both `default` (which can look up from `str` or `int`) and `ngram` (which requires `str` to calculate the ngrams). * In ngram mode `Vectors` uses a default `Vectors` object as a cache since the ngram vectors lookups are relatively expensive. * The default cache size is the same size as the provided ngram vector table. * Once the cache is full, no more entries are added. The user is responsible for managing the cache in cases where the initial documents are not representative of the texts. * The cache can be resized by setting `Vectors.ngram_cache_size` or cleared with `vectors._ngram_cache.clear()`. * The API ends up a bit split between methods for `default` and for `ngram`, so functions that only make sense for `default` or `ngram` include warnings with custom messages suggesting alternatives where possible. * `Vocab.vectors` becomes a property so that the string stores can be synced when assigning vectors to a vocab. * `Vectors` serializes its own config settings as `vectors.cfg`. * The `Vectors` serialization methods have added support for `exclude` so that the `Vocab` can exclude the `Vectors` strings while serializing. Removed: * The `minn` and `maxn` options and related code from `Vocab.get_vector`, which does not work in a meaningful way for default vector tables. * The unused `GlobalRegistry` in `Vectors`. * Refactor to use reduce_mean Refactor to use reduce_mean and remove the ngram vectors cache. * Rename to floret * Rename to floret in error messages * Use --vectors-mode in CLI, vector init * Fix vectors mode in init * Remove unused var * Minor API and docstrings adjustments * Rename `--vectors-mode` to `--mode` in `init vectors` CLI * Rename `Vectors.get_floret_vectors` to `Vectors.get_batch` and support both modes. * Minor updates to Vectors docstrings. * Update API docs for Vectors and init vectors CLI * Update types for StaticVectors	2021-10-27 14:08:31 +02:00
Adriane Boyd	a803af9dfa	Merge remote-tracking branch 'upstream/master' into chore/update-develop-from-master-v3.2-1	2021-10-26 11:53:50 +02:00
Elia Robyn Lake (Robyn Speer)	fa70837f28	clarify how to connect pretraining to training (#9450 ) * clarify how to connect pretraining to training Signed-off-by: Elia Robyn Speer <elia@explosion.ai> * Update website/docs/usage/embeddings-transformers.md * Update website/docs/usage/embeddings-transformers.md * Update website/docs/usage/embeddings-transformers.md * Update website/docs/usage/embeddings-transformers.md Co-authored-by: Elia Robyn Speer <elia@explosion.ai> Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>	2021-10-22 13:15:47 +02:00
Daniël de Kok	1f05f56433	Add the spacy.models_with_nvtx_range.v1 callback (#9124 ) * Add the spacy.models_with_nvtx_range.v1 callback This callback recursively adds NVTX ranges to the Models in each pipe in a pipeline. * Fix create_models_with_nvtx_range type signature * NVTX range: wrap models of all trainable pipes jointly This avoids that (sub-)models that are shared between pipes get wrapped twice. * NVTX range callback: make color configurable Add forward_color and backprop_color options to set the color for the NVTX range. * Move create_models_with_nvtx_range to spacy.ml * Update create_models_with_nvtx_range for thinc changes with_nvtx_range now updates an existing node, rather than returning a wrapper node. So, we can simply walk over the nodes and update them. * NVTX: use after_pipeline_creation in example	2021-10-20 11:59:48 +02:00
Paul O'Leary McCann	222cf9b6d2	Clarify how to change base Transformer model (#9498 ) * Add note about how the model name is used * Add link to TransformersModel docs, separate paragraph * Local link * Revise docs * Update website/docs/usage/embeddings-transformers.md * Update website/docs/usage/embeddings-transformers.md Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>	2021-10-19 23:28:20 +02:00
Adriane Boyd	a6424bcea9	Minor updates to spacy-transformers docs for v1.1.0 (#9496 )	2021-10-18 14:55:02 +02:00
Adriane Boyd	9b86209a4a	Update docs for spacy-transformers v1.1 data classes (#9361 )	2021-10-18 14:16:58 +02:00
Sofie Van Landeghem	3fd3531e12	Docs for new spacy-trf architectures (#8954 ) * use TransformerModel.v2 in quickstart * update docs for new transformer architectures * bump spacy_transformers to 1.1.0 * Add new arguments spacy-transformers.TransformerModel.v3 * Mention that mixed-precision support is experimental * Describe delta transformers.Tok2VecTransformer versions * add dot * add dot, again * Update some more TransformerModel references v2 -> v3 * Add mixed-precision options to the training quickstart Disable mixed-precision training/prediction by default. * Update setup.cfg Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com> * Apply suggestions from code review Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com> * Update website/docs/usage/embeddings-transformers.md Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com> Co-authored-by: Daniël de Kok <me@danieldk.eu> Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>	2021-10-18 14:15:06 +02:00
Connor Brinton	657af5f91f	🏷 Add Mypy check to CI and ignore all existing Mypy errors (#9167 ) * 🚨 Ignore all existing Mypy errors * 🏗 Add Mypy check to CI * Add types-mock and types-requests as dev requirements * Add additional type ignore directives * Add types packages to dev-only list in reqs test * Add types-dataclasses for python 3.6 * Add ignore to pretrain * 🏷 Improve type annotation on `run_command` helper The `run_command` helper previously declared that it returned an `Optional[subprocess.CompletedProcess]`, but it isn't actually possible for the function to return `None`. These changes modify the type annotation of the `run_command` helper and remove all now-unnecessary `# type: ignore` directives. * 🔧 Allow variable type redefinition in limited contexts These changes modify how Mypy is configured to allow variables to have their type automatically redefined under certain conditions. The Mypy documentation contains the following example: ```python def process(items: List[str]) -> None: # 'items' has type List[str] items = [item.split() for item in items] # 'items' now has type List[List[str]] ... ``` This configuration change is especially helpful in reducing the number of `# type: ignore` directives needed to handle the common pattern of: * Accepting a filepath as a string * Overwriting the variable using `filepath = ensure_path(filepath)` These changes enable redefinition and remove all `# type: ignore` directives rendered redundant by this change. * 🏷 Add type annotation to converters mapping * 🚨 Fix Mypy error in convert CLI argument verification * 🏷 Improve type annotation on `resolve_dot_names` helper * 🏷 Add type annotations for `Vocab` attributes `strings` and `vectors` * 🏷 Add type annotations for more `Vocab` attributes * 🏷 Add loose type annotation for gold data compilation * 🏷 Improve `_format_labels` type annotation * 🏷 Fix `get_lang_class` type annotation * 🏷 Loosen return type of `Language.evaluate` * 🏷 Don't accept `Scorer` in `handle_scores_per_type` * 🏷 Add `string_to_list` overloads * 🏷 Fix non-Optional command-line options * 🙈 Ignore redefinition of `wandb_logger` in `loggers.py` * ➕ Install `typing_extensions` in Python 3.8+ The `typing_extensions` package states that it should be used when "writing code that must be compatible with multiple Python versions". Since SpaCy needs to support multiple Python versions, it should be used when newer `typing` module members are required. One example of this is `Literal`, which is available starting with Python 3.8. Previously SpaCy tried to import `Literal` from `typing`, falling back to `typing_extensions` if the import failed. However, Mypy doesn't seem to be able to understand what `Literal` means when the initial import means. Therefore, these changes modify how `compat` imports `Literal` by always importing it from `typing_extensions`. These changes also modify how `typing_extensions` is installed, so that it is a requirement for all Python versions, including those greater than or equal to 3.8. * 🏷 Improve type annotation for `Language.pipe` These changes add a missing overload variant to the type signature of `Language.pipe`. Additionally, the type signature is enhanced to allow type checkers to differentiate between the two overload variants based on the `as_tuple` parameter. Fixes #8772 * ➖ Don't install `typing-extensions` in Python 3.8+ After more detailed analysis of how to implement Python version-specific type annotations using SpaCy, it has been determined that by branching on a comparison against `sys.version_info` can be statically analyzed by Mypy well enough to enable us to conditionally use `typing_extensions.Literal`. This means that we no longer need to install `typing_extensions` for Python versions greater than or equal to 3.8! 🎉 These changes revert previous changes installing `typing-extensions` regardless of Python version and modify how we import the `Literal` type to ensure that Mypy treats it properly. * resolve mypy errors for Strict pydantic types * refactor code to avoid missing return statement * fix types of convert CLI command * avoid list-set confustion in debug_data * fix typo and formatting * small fixes to avoid type ignores * fix types in profile CLI command and make it more efficient * type fixes in projects CLI * put one ignore back * type fixes for render * fix render types - the sequel * fix BaseDefault in language definitions * fix type of noun_chunks iterator - yields tuple instead of span * fix types in language-specific modules * 🏷 Expand accepted inputs of `get_string_id` `get_string_id` accepts either a string (in which case it returns its ID) or an ID (in which case it immediately returns the ID). These changes extend the type annotation of `get_string_id` to indicate that it can accept either strings or IDs. * 🏷 Handle override types in `combine_score_weights` The `combine_score_weights` function allows users to pass an `overrides` mapping to override data extracted from the `weights` argument. Since it allows `Optional` dictionary values, the return value may also include `Optional` dictionary values. These changes update the type annotations for `combine_score_weights` to reflect this fact. * 🏷 Fix tokenizer serialization method signatures in `DummyTokenizer` * 🏷 Fix redefinition of `wandb_logger` These changes fix the redefinition of `wandb_logger` by giving a separate name to each `WandbLogger` version. For backwards-compatibility, `spacy.train` still exports `wandb_logger_v3` as `wandb_logger` for now. * more fixes for typing in language * type fixes in model definitions * 🏷 Annotate `_RandomWords.probs` as `NDArray` * 🏷 Annotate `tok2vec` layers to help Mypy * 🐛 Fix `_RandomWords.probs` type annotations for Python 3.6 Also remove an import that I forgot to move to the top of the module 😅 * more fixes for matchers and other pipeline components * quick fix for entity linker * fixing types for spancat, textcat, etc * bugfix for tok2vec * type annotations for scorer * add runtime_checkable for Protocol * type and import fixes in tests * mypy fixes for training utilities * few fixes in util * fix import * 🐵 Remove unused `# type: ignore` directives * 🏷 Annotate `Language._components` * 🏷 Annotate `spacy.pipeline.Pipe` * add doc as property to span.pyi * small fixes and cleanup * explicit type annotations instead of via comment Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com> Co-authored-by: svlandeg <sofie.vanlandeghem@gmail.com> Co-authored-by: svlandeg <svlandeg@github.com>	2021-10-14 15:21:40 +02:00
Adriane Boyd	d98d525bc8	Merge remote-tracking branch 'upstream/master' into chore/update-develop-from-master-v3.1-3	2021-10-14 09:41:46 +02:00
Paul O'Leary McCann	b53e39455e	Fix UD POS docs links (fix #9013 ) (#9407 ) * Fix UD POS docs links (fix #9013) The previous link seems to have been for UD v1. * Fix link	2021-10-11 11:51:19 +02:00
Adriane Boyd	a5231cb044	Remove traces of lexemes from vocab serialization (#9400 )	2021-10-11 11:13:35 +02:00
Adriane Boyd	ae1b3e960b	Update overwrite and scorer in API docs (#9384 ) * Update overwrite and scorer in API docs * Rephrase morphologizer extend + example	2021-10-11 10:35:07 +02:00
Sofie Van Landeghem	f87ae3cb7d	Doc fixes in convert API (#9350 ) * add more info on the spacy debug command * formatting	2021-10-06 13:13:18 +09:00
Elia Robyn Lake (Robyn Speer)	53b5f245ed	Allow IETF language codes, aliases, and close matches (#9342 ) * use language-matching to allow language code aliases Signed-off-by: Elia Robyn Speer <elia@explosion.ai> * link to "IETF language tags" in docs Signed-off-by: Elia Robyn Speer <elia@explosion.ai> * Make requirements consistent Signed-off-by: Elia Robyn Speer <elia@explosion.ai> * change "two-letter language ID" to "IETF language tag" in language docs Signed-off-by: Elia Robyn Speer <elia@explosion.ai> * use langcodes 3.2 and handle language-tag errors better Signed-off-by: Elia Robyn Speer <elia@explosion.ai> * all unknown language codes are ImportErrors Signed-off-by: Elia Robyn Speer <elia@explosion.ai> Co-authored-by: Elia Robyn Speer <elia@explosion.ai>	2021-10-05 09:52:22 +02:00
Paul O'Leary McCann	1ee6541ab0	Moving Japanese tokenizer extra info to Token.morph (#8977 ) * Use morph for extra Japanese tokenizer info Previously Japanese tokenizer info that didn't correspond to Token fields was put in user data. Since spaCy core should avoid touching user data, this moves most information to the Token.morph attribute. It also adds the normalized form, which wasn't exposed before. The subtokens, which are a list of full tokens, are still added to user data, except with the default tokenizer granualarity. With the default tokenizer settings the subtokens are all None, so in this case the user data is simply not set. * Update tests Also adds a new test for norm data. * Update docs * Add Japanese morphologizer factory Set the default to `extend=True` so that the morphologizer does not clobber the values set by the tokenizer. * Use the norm_ field for normalized forms Before this commit, normalized forms were put in the "norm" field in the morph attributes. I am not sure why I did that instead of using the token morph, I think I just forgot about it. * Skip test if sudachipy is not installed * Fix import Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>	2021-10-01 19:19:26 +02:00
Paul O'Leary McCann	6e833b617a	Updating Troubleshooting Docs (#9329 ) * Add link to Discussions FAQ * Remove old FAQ entries I think these are no longer relevant. - no-cache-dir: affected pip versions are very old now - narrow unicode: not an issue from py3.3+ - utf-8 osx: upstream bug closed in 2019 Some of the other issues are also maybe not frequent.	2021-10-01 12:28:22 +02:00

... 2 3 4 5 6 ...

1969 Commits