mirror of
https://github.com/explosion/spaCy.git
synced 2025-01-26 17:24:41 +03:00
Add "New in v3.1" guide
This commit is contained in:
parent
caba63b74f
commit
bc93c34f54
|
@ -82,7 +82,7 @@ shortcut for this and instantiate the component using its string name and
|
||||||
| `moves` | A list of transition names. Inferred from the data if set to `None`, which is the default. ~~Optional[List[str]]~~ |
|
| `moves` | A list of transition names. Inferred from the data if set to `None`, which is the default. ~~Optional[List[str]]~~ |
|
||||||
| _keyword-only_ | |
|
| _keyword-only_ | |
|
||||||
| `update_with_oracle_cut_size` | During training, cut long sequences into shorter segments by creating intermediate states based on the gold-standard history. The model is not very sensitive to this parameter, so you usually won't need to change it. Defaults to `100`. ~~int~~ |
|
| `update_with_oracle_cut_size` | During training, cut long sequences into shorter segments by creating intermediate states based on the gold-standard history. The model is not very sensitive to this parameter, so you usually won't need to change it. Defaults to `100`. ~~int~~ |
|
||||||
| `incorrect_spans_key` | Identifies spans that are known to be incorrect entity annotations. The incorrect entity annotations can be stored in the span group, under this key. Defaults to `None`. ~~Optional[str]~~ |
|
| `incorrect_spans_key` | Identifies spans that are known to be incorrect entity annotations. The incorrect entity annotations can be stored in the span group in [`Doc.spans`](/api/doc#spans), under this key. Defaults to `None`. ~~Optional[str]~~ |
|
||||||
|
|
||||||
## EntityRecognizer.\_\_call\_\_ {#call tag="method"}
|
## EntityRecognizer.\_\_call\_\_ {#call tag="method"}
|
||||||
|
|
||||||
|
|
114
website/docs/usage/v3-1.md
Normal file
114
website/docs/usage/v3-1.md
Normal file
|
@ -0,0 +1,114 @@
|
||||||
|
---
|
||||||
|
title: What's New in v3.1
|
||||||
|
teaser: New features and how to upgrade
|
||||||
|
menu:
|
||||||
|
- ['New Features', 'features']
|
||||||
|
- ['Upgrading Notes', 'upgrading']
|
||||||
|
---
|
||||||
|
|
||||||
|
## New Features {#features hidden="true"}
|
||||||
|
|
||||||
|
<!-- TODO: intro -->
|
||||||
|
|
||||||
|
### Using predicted annotations during training {#predicted-annotations-training}
|
||||||
|
|
||||||
|
<!-- TODO: write -->
|
||||||
|
|
||||||
|
<Project id="pipelines/tagger_parser_predicted_annotations">
|
||||||
|
|
||||||
|
This project shows how to use the `token.dep` attribute predicted by the parser
|
||||||
|
as a feature for a subsequent tagger component in the pipeline.
|
||||||
|
|
||||||
|
</Project>
|
||||||
|
|
||||||
|
### SpanCategorizer for predicting arbitrary and overlapping spans {#spancategorizer tag="experimental"}
|
||||||
|
|
||||||
|
A common task in applied NLP is extracting spans of texts from documents,
|
||||||
|
including longer phrases or nested expressions. Named entity recognition isn't
|
||||||
|
the right tool for this problem, since an entity recognizer typically predicts
|
||||||
|
single token-based tags that are very sensitive to boundaries. This is effective
|
||||||
|
for proper nouns and self-contained expressions, but less useful for other types
|
||||||
|
of phrases or overlapping spans. The new
|
||||||
|
[`SpanCategorizer`](/api/spancategorizer) component and
|
||||||
|
[SpanCategorizer](/api/architectures#spancategorizer) architecture let you label
|
||||||
|
arbitrary and potentially overlapping spans of texts. A span categorizer
|
||||||
|
consists of two parts: a [suggester function](/api/spancategorizer#suggesters)
|
||||||
|
that proposes candidate spans, which may or may not overlap, and a labeler model
|
||||||
|
that predicts zero or more labels for each candidate. The predicted spans are
|
||||||
|
available via the [`Doc.spans`](/api/doc#spans) container.
|
||||||
|
|
||||||
|
<!-- TODO: example, getting started (init config?), maybe project template -->
|
||||||
|
|
||||||
|
<Infobox title="Tip: Create data with Prodigy's new span annotation UI">
|
||||||
|
|
||||||
|
<!-- TODO: screenshot -->
|
||||||
|
|
||||||
|
The upcoming version of our annotation tool [Prodigy](https://prodi.gy)
|
||||||
|
(currently available as a [pre-release](https://support.prodi.gy/t/3861) for all
|
||||||
|
users) features a [new workflow and UI](https://support.prodi.gy/t/3861) for
|
||||||
|
annotating overlapping and nested spans. You can use it to create training data
|
||||||
|
for spaCy's `SpanCategorizer` component.
|
||||||
|
|
||||||
|
</Infobox>
|
||||||
|
|
||||||
|
### Update the entity recognizer with partial incorrect annotations {#negative-samples}
|
||||||
|
|
||||||
|
> #### config.cfg (excerpt)
|
||||||
|
>
|
||||||
|
> ```ini
|
||||||
|
> [components.ner]
|
||||||
|
> factory = "ner"
|
||||||
|
> incorrect_spans_key = "incorrect_spans"
|
||||||
|
> moves = null
|
||||||
|
> update_with_oracle_cut_size = 100
|
||||||
|
> ```
|
||||||
|
|
||||||
|
The [`EntityRecognizer`](/api/entityrecognizer) can now be updated with known
|
||||||
|
incorrect annotations, which lets you take advantage of partial and sparse data.
|
||||||
|
For example, you'll be able to use the information that certain spans of text
|
||||||
|
are definitely **not** `PERSON` entities, without having to provide the
|
||||||
|
complete-gold standard annotations for the given example. The incorrect span
|
||||||
|
annotations can be added via the [`Doc.spans`](/api/doc#spans) in the training
|
||||||
|
data under the key defined as
|
||||||
|
[`incorrect_spans_key`](/api/entityrecognizer#init) in the component config.
|
||||||
|
|
||||||
|
<!-- TODO: more details and/or example project? -->
|
||||||
|
|
||||||
|
### New pipeline packages for Catalan and Danish {#pipeline-packages}
|
||||||
|
|
||||||
|
<!-- TODO: intro and update with final numbers -->
|
||||||
|
|
||||||
|
| Package | Language | Tagger | Parser | NER |
|
||||||
|
| ------------------------------------------------- | -------- | -----: | -----: | ---: |
|
||||||
|
| [`ca_core_news_sm`](/models/ca#ca_core_news_sm) | Catalan | | | |
|
||||||
|
| [`ca_core_news_md`](/models/ca#ca_core_news_md) | Catalan | | | |
|
||||||
|
| [`ca_core_news_lg`](/models/ca#ca_core_news_lg) | Catalan | | | |
|
||||||
|
| [`ca_core_news_trf`](/models/ca#ca_core_news_trf) | Catalan | | | |
|
||||||
|
| [`da_core_news_trf`](/models/da#da_core_news_trf) | Danish | | | |
|
||||||
|
|
||||||
|
### Resizable text classification architectures {#resizable-textcat}
|
||||||
|
|
||||||
|
<!-- TODO: write -->
|
||||||
|
|
||||||
|
### CLI command to assemble pipeline from config {#assemble}
|
||||||
|
|
||||||
|
The [`spacy assemble`](/api/cli#assemble) command lets you assemble a pipeline
|
||||||
|
from a config file without additional training. It can be especially useful for
|
||||||
|
creating a blank pipeline with a custom tokenizer, rule-based components or word
|
||||||
|
vectors.
|
||||||
|
|
||||||
|
```cli
|
||||||
|
$ python -m spacy assemble config.cfg ./output
|
||||||
|
```
|
||||||
|
|
||||||
|
### Support for streaming large or infinite corpora {#streaming-corpora}
|
||||||
|
|
||||||
|
<!-- TODO: write -->
|
||||||
|
|
||||||
|
### New lemmatizers for Catalan and Italian {#pos-lemmatizers}
|
||||||
|
|
||||||
|
<!-- TODO: write -->
|
||||||
|
|
||||||
|
## Notes about upgrading from v3.0 {#upgrading}
|
||||||
|
|
||||||
|
<!-- TODO: this could just be a bullet-point list mentioning stuff like the spacy_version, vectors initialization etc. -->
|
|
@ -9,7 +9,8 @@
|
||||||
{ "text": "Models & Languages", "url": "/usage/models" },
|
{ "text": "Models & Languages", "url": "/usage/models" },
|
||||||
{ "text": "Facts & Figures", "url": "/usage/facts-figures" },
|
{ "text": "Facts & Figures", "url": "/usage/facts-figures" },
|
||||||
{ "text": "spaCy 101", "url": "/usage/spacy-101" },
|
{ "text": "spaCy 101", "url": "/usage/spacy-101" },
|
||||||
{ "text": "New in v3.0", "url": "/usage/v3" }
|
{ "text": "New in v3.0", "url": "/usage/v3" },
|
||||||
|
{ "text": "New in v3.1", "url": "/usage/v3-1" }
|
||||||
]
|
]
|
||||||
},
|
},
|
||||||
{
|
{
|
||||||
|
@ -135,9 +136,7 @@
|
||||||
},
|
},
|
||||||
{
|
{
|
||||||
"label": "Legacy",
|
"label": "Legacy",
|
||||||
"items": [
|
"items": [{ "text": "Legacy functions", "url": "/api/legacy" }]
|
||||||
{ "text": "Legacy functions", "url": "/api/legacy" }
|
|
||||||
]
|
|
||||||
}
|
}
|
||||||
]
|
]
|
||||||
}
|
}
|
||||||
|
|
|
@ -119,8 +119,8 @@ const AlertSpace = ({ nightly, legacy }) => {
|
||||||
}
|
}
|
||||||
|
|
||||||
const navAlert = (
|
const navAlert = (
|
||||||
<Link to="/usage/v3" hidden>
|
<Link to="/usage/v3-1" hidden>
|
||||||
<strong>💥 Out now:</strong> spaCy v3.0
|
<strong>💥 Out now:</strong> spaCy v3.1
|
||||||
</Link>
|
</Link>
|
||||||
)
|
)
|
||||||
|
|
||||||
|
|
Loading…
Reference in New Issue
Block a user