mirror of
				https://github.com/explosion/spaCy.git
				synced 2025-10-25 21:21:10 +03:00 
			
		
		
		
	This reverts commit daedc45d05.
The default length depends on the length of the pattern string and was
correct for this example.
		
	
			
		
			
				
	
	
		
			231 lines
		
	
	
		
			9.1 KiB
		
	
	
	
		
			Plaintext
		
	
	
	
	
	
			
		
		
	
	
			231 lines
		
	
	
		
			9.1 KiB
		
	
	
	
		
			Plaintext
		
	
	
	
	
	
| ---
 | |
| title: What's New in v3.5
 | |
| teaser: New features and how to upgrade
 | |
| menu:
 | |
|   - ['New Features', 'features']
 | |
|   - ['Upgrading Notes', 'upgrading']
 | |
| ---
 | |
| 
 | |
| ## New features {id="features",hidden="true"}
 | |
| 
 | |
| spaCy v3.5 introduces three new CLI commands, `apply`, `benchmark` and
 | |
| `find-threshold`, adds fuzzy matching, provides improvements to our entity
 | |
| linking functionality, and includes a range of language updates and bug fixes.
 | |
| 
 | |
| ### New CLI commands {id="cli"}
 | |
| 
 | |
| #### apply CLI
 | |
| 
 | |
| The [`apply` CLI](/api/cli#apply) can be used to apply a pipeline to one or more
 | |
| `.txt`, `.jsonl` or `.spacy` input files, saving the annotated docs in a single
 | |
| `.spacy` file.
 | |
| 
 | |
| ```bash
 | |
| $ spacy apply en_core_web_sm my_texts/ output.spacy
 | |
| ```
 | |
| 
 | |
| #### benchmark CLI
 | |
| 
 | |
| The [`benchmark` CLI](/api/cli#benchmark) has been added to extend the existing
 | |
| `evaluate` functionality with a wider range of profiling subcommands.
 | |
| 
 | |
| The `benchmark accuracy` CLI is introduced as an alias for `evaluate`. The new
 | |
| `benchmark speed` CLI performs warmup rounds before measuring the speed in words
 | |
| per second on batches of randomly shuffled documents from the provided data.
 | |
| 
 | |
| ```bash
 | |
| $ spacy benchmark speed my_pipeline data.spacy
 | |
| ```
 | |
| 
 | |
| The output is the mean performance using batches (`nlp.pipe`) with a 95%
 | |
| confidence interval, e.g., profiling `en_core_web_sm` on CPU:
 | |
| 
 | |
| ```none
 | |
| Outliers: 2.0%, extreme outliers: 0.0%
 | |
| Mean: 18904.1 words/s (95% CI: -256.9 +244.1)
 | |
| ```
 | |
| 
 | |
| #### find-threshold CLI
 | |
| 
 | |
| The [`find-threshold` CLI](/api/cli#find-threshold) runs a series of trials
 | |
| across threshold values from `0.0` to `1.0` and identifies the best threshold
 | |
| for the provided score metric.
 | |
| 
 | |
| The following command runs 20 trials for the `spancat` component in
 | |
| `my_pipeline`, recording the `spans_sc_f` score for each value of the threshold
 | |
| `[components.spancat.threshold]` from `0.0` to `1.0`:
 | |
| 
 | |
| ```bash
 | |
| $ spacy find-threshold my_pipeline data.spacy spancat threshold spans_sc_f --n_trials 20
 | |
| ```
 | |
| 
 | |
| The `find-threshold` CLI can be used with `textcat_multilabel`, `spancat` and
 | |
| custom components with thresholds that are applied while predicting or scoring.
 | |
| 
 | |
| ### Fuzzy matching {id="fuzzy"}
 | |
| 
 | |
| New `FUZZY` operators support [fuzzy matching](/usage/rule-based-matching#fuzzy)
 | |
| with the `Matcher`. By default, the `FUZZY` operator allows a Levenshtein edit
 | |
| distance of 2 and up to 30% of the pattern string length. `FUZZY1`..`FUZZY9` can
 | |
| be used to specify the exact number of allowed edits.
 | |
| 
 | |
| ```python
 | |
| # Match lowercase with fuzzy matching (allows up to 3 edits)
 | |
| pattern = [{"LOWER": {"FUZZY": "definitely"}}]
 | |
| 
 | |
| # Match custom attribute values with fuzzy matching (allows up to 3 edits)
 | |
| pattern = [{"_": {"country": {"FUZZY": "Kyrgyzstan"}}}]
 | |
| 
 | |
| # Match with exact Levenshtein edit distance limits (allows up to 4 edits)
 | |
| pattern = [{"_": {"country": {"FUZZY4": "Kyrgyzstan"}}}]
 | |
| ```
 | |
| 
 | |
| Note that `FUZZY` uses Levenshtein edit distance rather than Damerau-Levenshtein
 | |
| edit distance, so a transposition like `teh` for `the` counts as two edits, one
 | |
| insertion and one deletion.
 | |
| 
 | |
| If you'd prefer an alternate fuzzy matching algorithm, you can provide your own
 | |
| custom method to the `Matcher` or as a config option for an entity ruler and
 | |
| span ruler.
 | |
| 
 | |
| ### FUZZY and REGEX with lists {id="fuzzy-regex-lists"}
 | |
| 
 | |
| The `FUZZY` and `REGEX` operators are also now supported for lists with `IN` and
 | |
| `NOT_IN`:
 | |
| 
 | |
| ```python
 | |
| pattern = [{"TEXT": {"FUZZY": {"IN": ["awesome", "cool", "wonderful"]}}}]
 | |
| pattern = [{"TEXT": {"REGEX": {"NOT_IN": ["^awe(some)?$", "^wonder(ful)?"]}}}]
 | |
| ```
 | |
| 
 | |
| ### Entity linking generalization {id="el"}
 | |
| 
 | |
| The knowledge base used for entity linking is now easier to customize and has a
 | |
| new default implementation [`InMemoryLookupKB`](/api/inmemorylookupkb).
 | |
| 
 | |
| ### Additional features and improvements {id="additional-features-and-improvements"}
 | |
| 
 | |
| - Language updates:
 | |
|   - Extended support for Slovenian
 | |
|   - Fixed lookup fallback for French and Catalan lemmatizers
 | |
|   - Switch Russian and Ukrainian lemmatizers to `pymorphy3`
 | |
|   - Support for editorial punctuation in Ancient Greek
 | |
|   - Update to Russian tokenizer exceptions
 | |
|   - Small fix for Dutch stop words
 | |
| - Allow up to `typer` v0.7.x, `mypy` 0.990 and `typing_extensions` v4.4.x.
 | |
| - New `spacy.ConsoleLogger.v3` with expanded progress
 | |
|   [tracking](/api/top-level#ConsoleLogger).
 | |
| - Improved scoring behavior for `textcat` with `spacy.textcat_scorer.v2` and
 | |
|   `spacy.textcat_multilabel_scorer.v2`.
 | |
| - Updates so that downstream components can train properly on a frozen `tok2vec`
 | |
|   or `transformer` layer.
 | |
| - Allow interpolation of variables in directory names in projects.
 | |
| - Support for local file system [remotes](/usage/projects#remote) for projects.
 | |
| - Improve UX around `displacy.serve` when the default port is in use.
 | |
| - Optional `before_update` callback that is invoked at the start of each
 | |
|   [training step](/api/data-formats#config-training).
 | |
| - Improve performance of `SpanGroup` and fix typing issues for `SpanGroup` and
 | |
|   `Span` objects.
 | |
| - Patch a
 | |
|   [security vulnerability](https://github.com/advisories/GHSA-gw9q-c7gh-j9vm) in
 | |
|   extracting tar files.
 | |
| - Add equality definition for `Vectors`.
 | |
| - Ensure `Vocab.to_disk` respects the exclude setting for `lookups` and
 | |
|   `vectors`.
 | |
| - Correctly handle missing annotations in the edit tree lemmatizer.
 | |
| 
 | |
| ### Trained pipeline updates {id="pipelines"}
 | |
| 
 | |
| - The CNN pipelines add `IS_SPACE` as a `tok2vec` feature for `tagger` and
 | |
|   `morphologizer` components to improve tagging of non-whitespace vs. whitespace
 | |
|   tokens.
 | |
| - The transformer pipelines require `spacy-transformers` v1.2, which uses the
 | |
|   exact alignment from `tokenizers` for fast tokenizers instead of the heuristic
 | |
|   alignment from `spacy-alignments`. For all trained pipelines except
 | |
|   `ja_core_news_trf`, the alignments between spaCy tokens and transformer tokens
 | |
|   may be slightly different. More details about the `spacy-transformers` changes
 | |
|   in the
 | |
|   [v1.2.0 release notes](https://github.com/explosion/spacy-transformers/releases/tag/v1.2.0).
 | |
| 
 | |
| ## Notes about upgrading from v3.4 {id="upgrading"}
 | |
| 
 | |
| ### Validation of textcat values {id="textcat-validation"}
 | |
| 
 | |
| An error is now raised when unsupported values are given as input to train a
 | |
| `textcat` or `textcat_multilabel` model - ensure that values are `0.0` or `1.0`
 | |
| as explained in the [docs](/api/textcategorizer#assigned-attributes).
 | |
| 
 | |
| ### Using the default knowledge base
 | |
| 
 | |
| As `KnowledgeBase` is now an abstract class, you should call the constructor of
 | |
| the new `InMemoryLookupKB` instead when you want to use spaCy's default KB
 | |
| implementation:
 | |
| 
 | |
| ```diff
 | |
| - kb = KnowledgeBase()
 | |
| + kb = InMemoryLookupKB()
 | |
| ```
 | |
| 
 | |
| If you've written a custom KB that inherits from `KnowledgeBase`, you'll need to
 | |
| implement its abstract methods, or alternatively inherit from `InMemoryLookupKB`
 | |
| instead.
 | |
| 
 | |
| ### Updated scorers for tokenization and textcat {id="scores"}
 | |
| 
 | |
| We fixed a bug that inflated the `token_acc` scores in v3.0-v3.4. The reported
 | |
| `token_acc` will drop from v3.4 to v3.5, but if `token_p/r/f` stay the same,
 | |
| your tokenization performance has not changed from v3.4.
 | |
| 
 | |
| For new `textcat` or `textcat_multilabel` configs, the new default `v2` scorers:
 | |
| 
 | |
| - ignore `threshold` for `textcat`, so the reported `cats_p/r/f` may increase
 | |
|   slightly in v3.5 even though the underlying predictions are unchanged
 | |
| - report the performance of only the **final** `textcat` or `textcat_multilabel`
 | |
|   component in the pipeline by default
 | |
| - allow custom scorers to be used to score multiple `textcat` and
 | |
|   `textcat_multilabel` components with `Scorer.score_cats` by restricting the
 | |
|   evaluation to the component's provided labels
 | |
| 
 | |
| ### Pipeline package version compatibility {id="version-compat"}
 | |
| 
 | |
| > #### Using legacy implementations
 | |
| >
 | |
| > In spaCy v3, you'll still be able to load and reference legacy implementations
 | |
| > via [`spacy-legacy`](https://github.com/explosion/spacy-legacy), even if the
 | |
| > components or architectures change and newer versions are available in the
 | |
| > core library.
 | |
| 
 | |
| When you're loading a pipeline package trained with an earlier version of spaCy
 | |
| v3, you will see a warning telling you that the pipeline may be incompatible.
 | |
| This doesn't necessarily have to be true, but we recommend running your
 | |
| pipelines against your test suite or evaluation data to make sure there are no
 | |
| unexpected results.
 | |
| 
 | |
| If you're using one of the [trained pipelines](/models) we provide, you should
 | |
| run [`spacy download`](/api/cli#download) to update to the latest version. To
 | |
| see an overview of all installed packages and their compatibility, you can run
 | |
| [`spacy validate`](/api/cli#validate).
 | |
| 
 | |
| If you've trained your own custom pipeline and you've confirmed that it's still
 | |
| working as expected, you can update the spaCy version requirements in the
 | |
| [`meta.json`](/api/data-formats#meta):
 | |
| 
 | |
| ```diff
 | |
| - "spacy_version": ">=3.4.0,<3.5.0",
 | |
| + "spacy_version": ">=3.4.0,<3.6.0",
 | |
| ```
 | |
| 
 | |
| ### Updating v3.4 configs
 | |
| 
 | |
| To update a config from spaCy v3.4 with the new v3.5 settings, run
 | |
| [`init fill-config`](/api/cli#init-fill-config):
 | |
| 
 | |
| ```cli
 | |
| $ python -m spacy init fill-config config-v3.4.cfg config-v3.5.cfg
 | |
| ```
 | |
| 
 | |
| In many cases ([`spacy train`](/api/cli#train),
 | |
| [`spacy.load`](/api/top-level#spacy.load)), the new defaults will be filled in
 | |
| automatically, but you'll need to fill in the new settings to run
 | |
| [`debug config`](/api/cli#debug) and [`debug data`](/api/cli#debug-data).
 |