spaCy

mirror of https://github.com/explosion/spaCy.git synced 2026-03-02 19:01:29 +03:00

Author	SHA1	Message	Date
Ian Thompson	ef20e114e0	Typo fix in `Language.replace_listeners` docs (#12823 ) * modified: spacy/language.py - corrected typo in docstring for :method:`Language.replace_listeners` - added noqa comment on unused local variable assignment in :method:`Language.from_config` as I wasn't sure if it should be unassigned modified: website/docs/api/language.mdx - corrected typo in `Language.replace_listeners` markdown * modified: spacy/language.py - removed noqa comment --------- Co-authored-by: Ian Thompson <ian.thompson@hrblock.com>	2023-07-14 09:45:54 +02:00
Adriane Boyd	41dba5bd34	Update max_length default in span finder docs (#12803 )	2023-07-07 10:17:41 +02:00
Adriane Boyd	fb0da3e097	Support custom token/lexeme attribute for vectors (#12625 ) * Support custom token/lexeme attribute for vectors * Fix imports * Back off to ORTH without Vectors.attr * Fallback if vectors.attr doesn't exist * Update docs	2023-06-28 09:43:14 +02:00
kadarakos	c003aac29a	SpanFinder into spaCy from experimental (#12507 ) * span finder integrated into spacy from experimental * black * isort * black * default spankey constant * black * Update spacy/pipeline/spancat.py Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com> * rename * rename * max_length and min_length as Optional[int] and strict checking * black * mypy fix for integer type infinity * revert line order * implement all comparison operators for inf int * avoid two for loops over all docs by not precomputing * interleave thresholding with span creation * black * revert to not interleaving (relized its faster) * black * Update spacy/errors.py Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com> * update dosctring * enforce that the gold and predicted documents have the same text * new error for ensuring reference and predicted texts are the same * remove todo * adjust test * black * handle misaligned tokenization * return correct variable * failing overfit test * only use a single spans_key like in spancat * black * remove debug lines * typo * remove comment * remove near duplicate reduntant method * use the 'spans_key' variable name everywhere * Update spacy/pipeline/span_finder.py Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com> * flaky test fix suggestion, hand set bias terms * only test suggester and test result exhaustively * make it clear that the span_finder_suggester is more general (not specific to span_finder) * Update spacy/tests/pipeline/test_span_finder.py Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com> * Apply suggestions from code review * remove question comment * move preset_spans_suggester test to spancat tests * Add docs and unify default configs for spancat and span finder * Add `allow_overlap=True` to span finder scorer * Fix offset bug in set_annotations * Ignore labels in span finder scorer * Format * Add span_finder to quickstart template * Move settings to self.cfg, store min/max unset as None * Remove debugging * Update docstrings and docs * Update spacy/pipeline/span_finder.py Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com> * Fix imports --------- Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com> Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>	2023-06-07 15:52:28 +02:00
Lj Miranda	58779c24ef	Remove shorthand for output-file in spacy apply (#12636 ) The output-file argument is positional, so can't use a shorthand like -o.	2023-05-17 12:36:29 +02:00
Adriane Boyd	3dc445df8d	Fix new tags in docs for v3.5.x (#12629 ) * Fix new tags in docs for v3.5.x * Fix new tag	2023-05-15 12:06:58 +02:00
Adriane Boyd	3637148c4d	Add scorer option to return per-component scores (#12540 ) * Add scorer option to return per-component scores Add `per_component` option to `Language.evaluate` and `Scorer.score` to return scores keyed by `tokenizer` (hard-coded) or by component name. Add option to `evaluate` CLI to score by component. Per-component scores can only be saved to JSON. * Update help text and messages	2023-05-12 15:36:54 +02:00
Kenneth Enevoldsen	88680a6eed	docs: remove invalid huggingface-hub push argument (#12624 )	2023-05-12 09:40:28 +02:00
Kenneth Enevoldsen	73698326df	Update inmemorylookupkb.mdx (#12586 ) Example does not refer to the in memory lookup	2023-05-02 12:51:13 +02:00
Adriane Boyd	b60b027927	Add default option to MorphAnalysis.get (#12545 ) * Add default to MorphAnalysis.get Similar to `dict`, allow a `default` option for `MorphAnalysis.get` for the user to provide a default return value if the field is not found. The default return value remains `[]`, which is not the same as `dict.get`, but is already established as this method's default return value with the return type `List[str]`. However the new `default` option does not enforce that the user-provided default is actually `List[str]`. * Restore test case	2023-04-20 14:06:32 +02:00
TAN Long	119f959218	docs(REL_OP): modify docs for REL_OPs to match Semgrex's update on CoreNLP v4.5.2 (#12531 ) Co-authored-by: Tan Long <tanloong@foxmail.com>	2023-04-17 13:14:01 +02:00
Edward	c95d320d28	Add more information to custom code docs (#12491 ) * Add info to sections * Update website/docs/usage/training.mdx --------- Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>	2023-04-06 11:45:19 +02:00
Will Frey	8d4129e177	Fix invalid ConsoleLogger.v3 example config (#12498 ) Replace `progress_bar = "all_steps"` with `progress_bar = "eval"`, which is consistent with the default behavior for `spacy.ConsoleLogger.v1` and `spacy.ConsoleLogger.v2`.	2023-04-04 20:53:07 +02:00
Edward	de32011e4c	Add model-last saving mechanism to pretraining (#12459 ) * Adjust pretrain command * chane naming and add finally block * Add unit test * Add unit test assertions * Update spacy/training/pretrain.py Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com> * change finally block * Add to docs * Update website/docs/usage/embeddings-transformers.mdx * Add flag to skip saving model-last --------- Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>	2023-04-03 15:24:03 +02:00
Ye Lei (叶磊)	ce258670b7	Allow passing a Span to displacy.parse_deps (#12477 ) * Allow passing a Span to displacy.parse_deps * Update docstring Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com> * Update API docs --------- Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>	2023-03-31 09:44:01 +02:00
Edward	dba4e7bece	Add info to stringstore and vocab (#12471 )	2023-03-27 13:15:14 +02:00
Prajakta Darade	ae7779e830	corrected example code (#12466 )	2023-03-27 11:32:49 +02:00
kadarakos	d1474fdd91	add explanation about overwriting behaviour (#12464 ) * add explanation about overwriting behaviour * Update website/docs/api/spancategorizer.mdx Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com> * Update website/docs/api/spancategorizer.mdx Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com> * Update website/docs/api/spancategorizer.mdx Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com> * format --------- Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>	2023-03-27 10:27:11 +02:00
Vinit Ravishankar	28de85737f	Tagger label smoothing (#12293 ) * add label smoothing * use True/False instead of floats * add entropy to debug data * formatting * docs * change test to check difference in distributions * Update website/docs/api/tagger.mdx Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com> * Update spacy/pipeline/tagger.pyx Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com> * bool -> float * update docs * fix seed * black * update tests to use label_smoothing = 0.0 * set default to 0.0, update quickstart * Update spacy/pipeline/tagger.pyx Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com> * update morphologizer, tagger test * fix morph docs * add url to docs --------- Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>	2023-03-22 12:17:56 +01:00
Adriane Boyd	2ce9a220db	Fix --verbose for spacy find-threshold (#12418 )	2023-03-14 17:16:49 +01:00
Lj Miranda	913d74f509	Add spancat_singlelabel pipeline for multiclass and non-overlapping span labelling tasks (#11365 ) * [wip] Update * [wip] Update * Add initial port * [wip] Update * Fix all imports * Add spancat_exclusive to pipeline * [WIP] Update * [ci skip] Add breakpoint for debugging * Use spacy.SpanCategorizer.v1 as default archi * Update spacy/pipeline/spancat_exclusive.py Co-authored-by: kadarakos <kadar.akos@gmail.com> * [ci skip] Small updates * Use Softmax v2 directly from thinc * Cache the label map * Fix mypy errors However, I ignored line 370 because it opened up a bunch of type errors that might be trickier to solve and might lead to a more complicated codebase. * avoid multiplication with 1.0 Co-authored-by: kadarakos <kadar.akos@gmail.com> * Update spacy/pipeline/spancat_exclusive.py Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com> * Update component versions to v2 * Add scorer to docstring * Add _n_labels property to SpanCategorizer Instead of using len(self.labels) in initialize() I am using a private property self._n_labels. This achieves implementation parity and allows me to delete the whole initialize() method for spancat_exclusive (since it's now the same with spancat). * Inherit from SpanCat instead of TrainablePipe This commit changes the inheritance structure of Exclusive_Spancat, now it's inheriting from SpanCategorizer than TrainablePipe. This allows me to remove duplicate methods that are already present in the parent function. * Revert documentation link to spancat * Fix init call for exclusive spancat * Update spacy/pipeline/spancat_exclusive.py Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com> * Import Suggester from spancat * Include zero_init.v1 for spancat * Implement _allow_extra_label to use _n_labels To ensure that spancat / spancat_exclusive cannot be resized after initialization, I inherited the _allow_extra_label() method from spacy/pipeline/trainable_pipe.pyx and used self._n_labels instead of len(self.labels) for checking. I think that changing it locally is a better solution rather than forcing each class that inherits TrainablePipe to use the self._n_labels attribute. Also note that I turned-off black formatting in this block of code because it reads better without the overhang. * Extend existing tests to spancat_exclusive In this commit, I extended the existing tests for spancat to include spancat_exclusive. I parametrized the test functions with 'name' (similar var name with textcat and textcat_multilabel) for each applicable test. TODO: Add overfitting tests for spancat_exclusive * Update documentation for spancat * Turn on formatting for allow_extra_label * Remove initializers in default config * Use DEFAULT_EXCL_SPANCAT_MODEL I also renamed spancat_exclusive_default_config into spancat_excl_default_config because black does some not pretty formatting changes. * Update documentation Update grammar and usage Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com> * Clarify docstring for Exclusive_SpanCategorizer * Remove mypy ignore and typecast labels to list * Fix documentation API * Use a single variable for tests * Update defaults for number of rows Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com> * Put back initializers in spancat config Whenever I remove model.scorer.init_w and model.scorer.init_b, I encounter an error in the test: SystemError: <method '__getitem__' of 'dict' objects> returned a result with an error set. My Thinc version is 8.1.5, but I can't seem to check what's causing the error. * Update spancat_exclusive docstring * Remove init_W and init_B parameters This commit is expected to fail until the new Thinc release. * Require thinc>=8.1.6 for serializable Softmax defaults * Handle zero suggestions to make tests pass I'm not sure if this is the most elegant solution. But what should happen is that the _make_span_group function MUST return an empty SpanGroup if there are no suggestions. The error happens when the 'scores' variable is empty. We cannot get the 'predicted' and other downstream vars. * Better approach for handling zero suggestions * Update website/docs/api/spancategorizer.md Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com> * Update spancategorizer headers * Apply suggestions from code review Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com> * Add default value in negative_weight in docs * Add default value in allow_overlap in docs * Update how spancat_exclusive is constructed In this commit, I added the following: - Put the default values of negative_weight and allow_overlap in the default_config dictionary. - Rename make_spancat -> make_exclusive_spancat * Run prettier on spancategorizer.mdx * Change exactly one -> at most one * Add suggester documentation in Exclusive_SpanCategorizer * Add suggester to spancat docstrings * merge multilabel and singlelabel spancat * rename spancat_exclusive to singlelable * wire up different make_spangroups for single and multilabel * black * black * add docstrings * more docstring and fix negative_label * don't rely on default arguments * black * remove spancat exclusive * replace single_label with add_negative_label and adjust inference * mypy * logical bug in configuration check * add spans.attrs[scores] * single label make_spangroup test * bugfix * black * tests for make_span_group with negative labels * refactor make_span_group * black * Update spacy/tests/pipeline/test_spancat.py Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com> * remove duplicate declaration * Update spacy/pipeline/spancat.py Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com> * raise error instead of just print * make label mapper private * update docs * run prettier * Update website/docs/api/spancategorizer.mdx Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com> * Update website/docs/api/spancategorizer.mdx Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com> * Update spacy/pipeline/spancat.py Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com> * Update spacy/pipeline/spancat.py Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com> * Update spacy/pipeline/spancat.py Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com> * Update spacy/pipeline/spancat.py Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com> * don't keep recomputing self._label_map for each span * typo in docs * Intervals to private and document 'name' param * Update spacy/pipeline/spancat.py Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com> * Update spacy/pipeline/spancat.py Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com> * add Tag to new features * replace tags * revert * revert * revert * revert * Update website/docs/api/spancategorizer.mdx Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com> * Update website/docs/api/spancategorizer.mdx Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com> * prettier * Fix merge * Update website/docs/api/spancategorizer.mdx * remove references to 'single_label' * remove old paragraph * Add spancat_singlelabel to config template * Format * Extend init config tests --------- Co-authored-by: kadarakos <kadar.akos@gmail.com> Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com> Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>	2023-03-09 10:30:59 +01:00
Raphael Mitsch	6aa6b86d49	Make generation of empty `KnowledgeBase` instances configurable in `EntityLinker` (#12320 ) * Make empty_kb() configurable. * Format. * Update docs. * Be more specific in KB serialization test. * Update KB serialization tests. Update docs. * Remove doc update for batched candidate generation. * Fix serialization of subclassed KB in tests. * Format. * Update docstring. * Update docstring. * Switch from pickle to json for custom field serialization.	2023-03-01 16:02:55 +01:00
kadarakos	56aa0cc75f	Displacy doc fix (#12352 ) * more details for color setting * more details for color setting * prettier	2023-03-01 15:38:23 +01:00
Raphael Mitsch	efbc3d37b3	Update docs w.r.t. spacy.CandidateBatchGenerator.v1. (#12350 )	2023-03-01 11:01:35 +01:00
Adriane Boyd	33864f1d07	Add new tags in docs for #12334 (#12348 )	2023-03-01 10:46:13 +01:00
TAN Long	071667376a	Add new REL_OPs: `>+`, `>-`, `<+`, and `<-` (#12334 ) * Add immediate left/right child/parent dependency relations * Add tests for new REL_OPs: `>+`, `>-`, `<+`, and `<-`. --------- Co-authored-by: Tan Long <tanloong@foxmail.com>	2023-02-28 14:36:33 +01:00
Raphael Mitsch	d38a88f0f3	Remove negation. (#12252 )	2023-02-08 14:18:33 +01:00
Sofie Van Landeghem	4c60afb946	Backslash fixes in docs (#12213 ) * backslash fixes * revert unrelated change	2023-02-01 10:15:38 +01:00
Paul O'Leary McCann	8932f4dc35	Add extra flag to assets docs (#12194 ) * Add extra flag to assets docs For some reason this wasn't included. * Add new tag to docs	2023-01-30 10:05:23 +01:00
Adriane Boyd	5f8a398bb9	Add span_id to Span.char_span, update Doc/Span.char_span docs (#12196 ) * Add span_id to Span.char_span, update Doc/Span.char_span docs `Span.char_span(id=)` should be removed in the future. * Also use Union[int, str] in Doc docstring	2023-01-27 15:09:17 +01:00
Simon Gurcke	774c10fa39	Add alignment_mode argument to Span.char_span() (#12145 ) * Add alignment_mode argument to Span.char_span() * Update website * Update spacy/tokens/span.pyx Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com> * Add test Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>	2023-01-27 11:43:40 +01:00
Daniël de Kok	8d69874afb	Add `spacy.PlainTextCorpusReader.v1` (#12122 ) * Add `spacy.PlainTextCorpusReader.v1` This is a corpus reader that reads plain text corpora with the following format: - UTF-8 encoding - One line per document. - Blank lines are ignored. It is useful for applications where we deal with very large corpora, such as distillation, and don't want to deal with the space overhead of serialized formats. Additionally, many large corpora already use such a text format, keeping the necessary preprocessing to a minimum. * Update spacy/training/corpus.py Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com> * docs: add version to `PlainTextCorpus` * Add docstring to registry function * Add plain text corpus tests * Only strip newline/carriage return * Add return type _string_to_tmp_file helper * Use a temporary directory in place of file name Different OS auto delete/sharing semantics are just wonky. * This will be new in 3.5.1 (rather than 4) * Test improvements from code review Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>	2023-01-26 11:33:22 +01:00
Marcus Blättermann	05a3685849	Fix broken syntax for type annotations (#12171 )	2023-01-25 08:51:25 +01:00
Adriane Boyd	3b8918e166	API docs: Rename kb_in_memory to inmemorylookupkb, add to sidebar (#12128 ) * API docs: Rename kb_in_memory to inmemorylookupkb, add to sidebar * adjust to mdx * linkout to InMemoryLookupKB at first occurrence in kb.mdx * fix links to docs * revert Azure trigger setting (I'll make a separate PR) Co-authored-by: svlandeg <svlandeg@github.com>	2023-01-19 13:29:17 +01:00
Sofie Van Landeghem	7d88c55eeb	update docs for apply (#12127 ) * update docs for apply * prettier	2023-01-19 10:37:09 +01:00
Daniël de Kok	319eb508b5	Add a `spacy benchmark speed` subcommand (#11902 ) * Add a `spacy evaluate speed` subcommand This subcommand reports the mean batch performance of a model on a data set with a 95% confidence interval. For reliability, it first performs some warmup rounds. Then it will measure performance on batches with randomly shuffled documents. To avoid having too many spaCy commands, `speed` is a subcommand of `evaluate` and accuracy evaluation is moved to its own `evaluate accuracy` subcommand. * Fix import cycle * Restore `spacy evaluate`, make `spacy benchmark speed` an alias * Add documentation for `spacy benchmark` * CREATES -> PRINTS * WPS -> words/s * Disable formatting of benchmark speed arguments * Fail with an error message when trying to speed bench empty corpus * Make it clearer that `benchmark accuracy` is a replacement for `evaluate` * Fix docstring webpage reference * tests: check `evaluate` output against `benchmark accuracy`	2023-01-12 11:55:21 +01:00
Paul O'Leary McCann	8e558095a1	Clean up displacy port-related error messages, docs (#12089 ) * Clean up displacy port-related error messages, docs There were some issues in the error messages and docs in #11948. 1. the error messages didn't specify the port argument to displacy.serve correctly 2. the docs didn't mark the auto select argument as new This addresses those issues. * Update website/docs/api/top-level.md Co-authored-by: Raphael Mitsch <r.mitsch@outlook.com> * Apply prettier Co-authored-by: Raphael Mitsch <r.mitsch@outlook.com>	2023-01-12 14:54:09 +09:00
Sofie Van Landeghem	554df9ef20	Website migration from Gatsby to Next (#12058 ) * Rename all MDX file to `.mdx` * Lock current node version (#11885) * Apply Prettier (#11996) * Minor website fixes (#11974) [ci skip] * fix table * Migrate to Next WEB-17 (#12005) * Initial commit * Run `npx create-next-app@13 next-blog` * Install MDX packages Following: `77b5f79a4d/packages/next-mdx/readme.md` * Add MDX to Next * Allow Next to handle `.md` and `.mdx` files. * Add VSCode extension recommendation * Disabled TypeScript strict mode for now * Add prettier * Apply Prettier to all files * Make sure to use correct Node version * Add basic implementation for `MDXRemote` * Add experimental Rust MDX parser * Add `/public` * Add SASS support * Remove default pages and styling * Convert to module This allows to use `import/export` syntax * Add import for custom components * Add ability to load plugins * Extract function This will make the next commit easier to read * Allow to handle directories for page creation * Refactoring * Allow to parse subfolders for pages * Extract logic * Redirect `index.mdx` to parent directory * Disabled ESLint during builds * Disabled typescript during build * Remove Gatsby from `README.md` * Rephrase Docker part of `README.md` * Update project structure in `README.md` * Move and rename plugins * Update plugin for wrapping sections * Add dependencies for plugin * Use plugin * Rename wrapper type * Simplify unnessary adding of id to sections The slugified section ids are useless, because they can not be referenced anywhere anyway. The navigation only works if the section has the same id as the heading. * Add plugin for custom attributes on Markdown elements * Add plugin to readd support for tables * Add plugin to fix problem with wrapped images For more details see this issue: https://github.com/mdx-js/mdx/issues/1798 * Add necessary meta data to pages * Install necessary dependencies * Remove outdated MDX handling * Remove reliance on `InlineList` * Use existing Remark components * Remove unallowed heading Before `h1` components where not overwritten and would never have worked and they aren't used anywhere either. * Add missing components to MDX * Add correct styling * Fix broken list * Fix broken CSS classes * Implement layout * Fix links * Fix broken images * Fix pattern image * Fix heading attributes * Rename heading attribute `new` was causing some weird issue, so renaming it to `version` * Update comment syntax in MDX * Merge imports * Fix markdown rendering inside components * Add model pages * Simplify anchors * Fix default value for theme * Add Universe index page * Add Universe categories * Add Universe projects * Fix Next problem with copy Next complains when the server renders something different then the client, therfor we move the differing logic to `useEffect` * Fix improper component nesting Next doesn't allow block elements inside a `<p>` * Replace landing page MDX with page component * Remove inlined iframe content * Remove ability to inline HTML content in iFrames * Remove MDX imports * Fix problem with image inside link in MDX * Escape character for MDX * Fix unescaped characters in MDX * Fix headings with logo * Allow to export static HTML pages * Add prebuild script This command is automatically run by Next * Replace `svg-loader` with `react-inlinesvg` `svg-loader` is no longer maintained * Fix ESLint `react-hooks/exhaustive-deps` * Fix dropdowns * Change code language from `cli` to `bash` * Remove unnessary language `none` * Fix invalid code language `markdown_` with an underscore was used to basically turn of syntax highlighting, but using unknown languages know throws an error. * Enable code blocks plugin * Readd `InlineCode` component MDX2 removed the `inlineCode` component > The special component name `inlineCode` was removed, we recommend to use `pre` for the block version of code, and code for both the block and inline versions Source: https://mdxjs.com/migrating/v2/#update-mdx-content * Remove unused code * Extract function to own file * Fix code syntax highlighting * Update syntax for code block meta data * Remove unused prop * Fix internal link recognition There is a problem with regex between Node and browser, and since Next runs the component on both, this create an error. `Prop `rel` did not match. Server: "null" Client: "noopener nofollow noreferrer"` This simplifies the implementation and fixes the above error. * Replace `react-helmet` with `next/head` * Fix `className` problem for JSX component * Fix broken bold markdown * Convert file to `.mjs` to be used by Node process * Add plugin to replace strings * Fix custom table row styling * Fix problem with `span` inside inline `code` React doesn't allow a `span` inside an inline `code` element and throws an error in dev mode. * Add `_document` to be able to customize `<html>` and `<body>` * Add `lang="en"` * Store Netlify settings in file This way we don't need to update via Netlify UI, which can be tricky if changing build settings. * Add sitemap * Add Smartypants * Add PWA support * Add `manifest.webmanifest` * Fix bug with anchor links after reloading There was no need for the previous implementation, since the browser handles this nativly. Additional the manual scrolling into view was actually broken, because the heading would disappear behind the menu bar. * Rename custom event I was googeling for ages to find out what kind of event `inview` is, only to figure out it was a custom event with a name that sounds pretty much like a native one. 🫠 * Fix missing comment syntax highlighting * Refactor Quickstart component The previous implementation was hidding the irrelevant lines via data-props and dynamically generated CSS. This created problems with Next and was also hard to follow. CSS was used to do what React is supposed to handle. The new implementation simplfy filters the list of children (React elements) via their props. * Fix syntax highlighting for Training Quickstart * Unify code rendering * Improve error logging in Juniper * Fix Juniper component * Automatically generate "Read Next" link * Add Plausible * Use recent DocSearch component and adjust styling * Fix images * Turn of image optimization > Image Optimization using Next.js' default loader is not compatible with `next export`. We currently deploy to Netlify via `next export` * Dont build pages starting with `_` * Remove unused files * Add Next plugin to Netlify * Fix button layout MDX automatically adds `p` tags around text on a new line and Prettier wants to put the text on a new line. Hacking with JSX string. * Add 404 page * Apply Prettier * Update Prettier for `package.json` Next sometimes wants to patch `package-lock.json`. The old Prettier setting indended with 4 spaces, but Next always indends with 2 spaces. Since `npm install` automatically uses the indendation from `package.json` for `package-lock.json` and to avoid the format switching back and forth, both files are now set to 2 spaces. * Apply Next patch to `package-lock.json` When starting the dev server Next would warn `warn - Found lockfile missing swc dependencies, patching...` and update the `package-lock.json`. These are the patched changes. * fix link Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com> * small backslash fixes * adjust to new style Co-authored-by: Marcus Blättermann <marcus@essenmitsosse.de>	2023-01-11 17:30:07 +01:00
Adriane Boyd	9e0322de1a	Restore v2 token_acc score implementation (#12073 ) In the v3 scorer refactoring, `token_acc` was implemented incorrectly. It should use `precision` instead of `fscore` for the measure of correctly aligned tokens / number of predicted tokens. Fix the docs to reflect that the measure uses the number of predicted tokens rather than the number of gold tokens.	2023-01-11 08:01:47 +01:00
Kevin Humphreys	19650ebb52	Enable fuzzy text matching in Matcher (#11359 ) * enable fuzzy matching * add fuzzy param to EntityMatcher * include rapidfuzz_capi not yet used * fix type * add FUZZY predicate * add fuzzy attribute list * fix type properly * tidying * remove unnecessary dependency * handle fuzzy sets * simplify fuzzy sets * case fix * switch to FUZZYn predicates use Levenshtein distance. remove fuzzy param. remove rapidfuzz_capi. * revert changes added for fuzzy param * switch to polyleven (Python package) * enable fuzzy matching * add fuzzy param to EntityMatcher * include rapidfuzz_capi not yet used * fix type * add FUZZY predicate * add fuzzy attribute list * fix type properly * tidying * remove unnecessary dependency * handle fuzzy sets * simplify fuzzy sets * case fix * switch to FUZZYn predicates use Levenshtein distance. remove fuzzy param. remove rapidfuzz_capi. * revert changes added for fuzzy param * switch to polyleven (Python package) * fuzzy match only on oov tokens * remove polyleven * exclude whitespace tokens * don't allow more edits than characters * fix min distance * reinstate FUZZY operator with length-based distance function * handle sets inside regex operator * remove is_oov check * attempt build fix no mypy failure locally * re-attempt build fix * don't overwrite fuzzy param value * move fuzzy_match to its own Python module to allow patching * move fuzzy_match back inside Matcher simplify logic and add tests * Format tests * Parametrize fuzzyn tests * Parametrize and merge fuzzy+set tests * Format * Move fuzzy_match to a standalone method * Change regex kwarg type to bool * Add types for fuzzy_match - Refactor variable names - Add test for symmetrical behavior * Parametrize fuzzyn+set tests * Minor refactoring for fuzz/fuzzy * Make fuzzy_match a Matcher kwarg * Update type for _default_fuzzy_match * don't overwrite function param * Rename to fuzzy_compare * Update fuzzy_compare default argument declarations * allow fuzzy_compare override from EntityRuler * define new Matcher keyword arg * fix type definition * Implement fuzzy_compare config option for EntityRuler and SpanRuler * Rename _default_fuzzy_compare to fuzzy_compare, remove from reexported objects * Use simpler fuzzy_compare algorithm * Update types * Increase minimum to 2 in fuzzy_compare to allow one transposition * Fix predicate keys and matching for SetPredicate with FUZZY and REGEX * Add FUZZY6..9 * Add initial docs * Increase default fuzzy to rounded 30% of pattern length * Update docs for fuzzy_compare in components * Update EntityRuler and SpanRuler API docs * Rename EntityRuler and SpanRuler setting to matcher_fuzzy_compare To having naming similar to `phrase_matcher_attr`, rename `fuzzy_compare` setting for `EntityRuler` and `SpanRuler` to `matcher_fuzzy_compare. Organize next to `phrase_matcher_attr` in docs. * Fix schema aliases Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com> * Fix typo Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com> * Add FUZZY6-9 operators and update tests * Parameterize test over greedy Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com> * Fix type for fuzzy_compare to remove Optional * Rename to spacy.levenshtein_compare.v1, move to spacy.matcher.levenshtein * Update docs following levenshtein_compare renaming Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com> Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>	2023-01-10 10:36:17 +01:00
Zhangrp	eb8bb35c13	improve ux for displacy when the serve port is in use (#11948 ) * check port in use and add itself * check port in use and add itself * Auto switch to nearest available port. * Use bind to check port instead of connect_ex. * Reformat. * Add auto_select_port argument. * update docs for displacy.serve * Update spacy/errors.py Co-authored-by: Paul O'Leary McCann <polm@dampfkraft.com> * Update website/docs/api/top-level.md Co-authored-by: Paul O'Leary McCann <polm@dampfkraft.com> * Update spacy/errors.py Co-authored-by: Paul O'Leary McCann <polm@dampfkraft.com> * Add test using multiprocessing * fix argument name * Increase sleep times Want to rule this out as a cause of test failure * Don't terminate a process that isn't alive * Refactor port finding logic This moves all the port logic into its own util function, which can be tested without having to background a server directly. * Use with for the server This ensures the server is closed correctly. * Pass in the host when checking port availability * Shorten argument name * Update error codes following merge * Add types for arguments, specify docstrings. * Add typing for arguments with default value. * Update docstring to match spaCy format. * Update docstring to match spaCy format. * Fix docs Arg name changed from `auto_select_port` to just `auto_select`. * Revert "Fix docs" This reverts commit `356966fe84`. Co-authored-by: zhiiw <1302593554@qq.com> Co-authored-by: Paul O'Leary McCann <polm@dampfkraft.com> Co-authored-by: Raphael Mitsch <r.mitsch@outlook.com>	2023-01-10 15:52:57 +09:00
Madeesh Kannan	f1dcdefc8a	Add version tag to `before_update` config key (#12059 )	2023-01-05 11:46:04 +01:00
Paul O'Leary McCann	dbd829f0ed	Fix inconsistency in displaCy docs about page option (#12047 ) * Fix inconsistency in displaCy docs about page option The `page` option, which wraps the output SVG in HTML, is true by default for `serve` but not for `render`. The `render` docs were wrong though, so this updates them. * Update the same statement in more docs A few renderers used the same language	2023-01-04 12:51:40 +09:00
Madeesh Kannan	aa2b471a6e	New console logger with expanded progress tracking (#11972 ) * Add `ConsoleLogger.v3` This addition expands the progress bar feature to count up the training/distillation steps to either the next evaluation pass or the maximum number of steps. * Rename progress bar types * Add defaults to docs Minor fixes * Move comment * Minor punctuation fixes * Explicitly check for `None` when validating progress bar type Co-authored-by: Paul O'Leary McCann <polm@dampfkraft.com>	2022-12-23 15:21:44 +01:00
Raphael Mitsch	eef3d950b4	Fix `SpanGroup` and `Span` typing (#12009 ) * Correct Span.label, Span.kb_id types. Fix SpanGroup.__iter__(). * Extend test. * Rename test. Fix typo. * Add comment. * Fix types for Span.label, Span.kb_id, Span.char_span(). * Update spacy/tests/doc/test_span_group.py Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com> * Update docs. * Fix typo. * Update spacy/tokens/span_group.pyx Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com> Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>	2022-12-21 18:54:27 +01:00
kadarakos	c223cd7a86	Add apply CLI (#11376 ) * annotate cli first try * add batch-size and n_process * rename to apply * typing fix * handle file suffixes * walk directories * support jsonl * typing fix * remove debug * make suffix optional for walk * revert unrelated * don't warn but raise * better error message * minor touch up * Update spacy/tests/test_cli.py Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com> * Update spacy/cli/apply.py Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com> * Update spacy/cli/apply.py Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com> * update tests and bugfix * add force_overwrite * typo * fix adding .spacy suffix * Update spacy/cli/apply.py Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com> * Update spacy/cli/apply.py Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com> * Update spacy/cli/apply.py Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com> * store user data and rename cmd arg * include test for user attr * rename cmd arg * better help message * documentation * prettier * black * link fix * Update spacy/cli/apply.py Co-authored-by: Paul O'Leary McCann <polm@dampfkraft.com> * Update website/docs/api/cli.md Co-authored-by: Paul O'Leary McCann <polm@dampfkraft.com> * Update website/docs/api/cli.md Co-authored-by: Paul O'Leary McCann <polm@dampfkraft.com> * Update website/docs/api/cli.md Co-authored-by: Paul O'Leary McCann <polm@dampfkraft.com> * addressing reviews * dont quit but warn * prettier Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com> Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com> Co-authored-by: Paul O'Leary McCann <polm@dampfkraft.com>	2022-12-20 17:11:33 +01:00
cfuerbachersparks	3a2b655a29	Update lexeme.md (#11994 ) Change suffix_ string to end	2022-12-19 10:33:38 +01:00
Zhangrp	9cf3fa9711	Add docs for biluo_to_iob and iob_to_biluo. (#11901 ) * Add docs for biluo_to_iob and iob_to_biluo. * Fix typos. * Remove redundant links.	2022-12-01 13:30:27 +01:00
Damian Romero	afd7a2476d	Fix typo in vocab.md table (#11908 ) * Fix typo in vocab.md table Fixes explosion/spaCy/#11907 * Reformat vocab.md with Prettier	2022-12-01 13:06:28 +01:00
Adriane Boyd	1ebe7db07c	Support local filesystem remotes for projects (#11762 ) * Support local filesystem remotes for projects * Fix support for local filesystem remotes for projects * Use `FluidPath` instead of `Pathy` to support both filesystem and remote paths * Create missing parent directories if required for local filesystem * Add a more general `_file_exists` method to support both `Pathy`, `Path`, and `smart_open`-compatible URLs * Add explicit `smart_open` dependency starting with support for `compression` flag * Update `pathy` dependency to exclude older versions that aren't compatible with required `smart_open` version * Update docs to refer to `Pathy` instead of `smart_open` for project remotes (technically you can still push to any `smart_open`-compatible path but you can't pull from them) * Add tests for local filesystem remotes * Update pathy for general BlobStat sorting * Add import * Remove _file_exists since only Pathy remotes are supported * Format CLI docs * Clean up merge	2022-11-29 11:40:58 +01:00

1 2 3 4 5 ...

983 Commits