Add "New in v3.1" guide

2025-12-24 02:23:19 +03:00 · 2021-06-22 15:23:18 +10:00 · 2021-06-22 15:23:18 +10:00 · bc93c34f54
commit bc93c34f54
parent caba63b74f
4 changed files with 120 additions and 7 deletions
--- a/website/docs/api/entityrecognizer.md
+++ b/website/docs/api/entityrecognizer.md
@ -82,7 +82,7 @@ shortcut for this and instantiate the component using its string name and
 | `moves`                       | A list of transition names. Inferred from the data if set to `None`, which is the default. ~~Optional[List[str]]~~                                                                                                                                  |
 | _keyword-only_                |                                                                                                                                                                                                                                                     |
 | `update_with_oracle_cut_size` | During training, cut long sequences into shorter segments by creating intermediate states based on the gold-standard history. The model is not very sensitive to this parameter, so you usually won't need to change it. Defaults to `100`. ~~int~~ |
-| `incorrect_spans_key`         | Identifies spans that are known to be incorrect entity annotations. The incorrect entity annotations can be stored in the span group, under this key. Defaults to `None`. ~~Optional[str]~~                                                         |
+| `incorrect_spans_key`         | Identifies spans that are known to be incorrect entity annotations. The incorrect entity annotations can be stored in the span group in [`Doc.spans`](/api/doc#spans), under this key. Defaults to `None`. ~~Optional[str]~~                        |

 ## EntityRecognizer.\_\_call\_\_ {#call tag="method"}

--- a/website/docs/usage/v3-1.md
+++ b/website/docs/usage/v3-1.md
@ -0,0 +1,114 @@
+---
+title: What's New in v3.1
+teaser: New features and how to upgrade
+menu:
+  - ['New Features', 'features']
+  - ['Upgrading Notes', 'upgrading']
+---
+
+## New Features {#features hidden="true"}
+
+<!-- TODO: intro -->
+
+### Using predicted annotations during training {#predicted-annotations-training}
+
+<!-- TODO: write -->
+
+<Project id="pipelines/tagger_parser_predicted_annotations">
+
+This project shows how to use the `token.dep` attribute predicted by the parser
+as a feature for a subsequent tagger component in the pipeline.
+
+</Project>
+
+### SpanCategorizer for predicting arbitrary and overlapping spans {#spancategorizer tag="experimental"}
+
+A common task in applied NLP is extracting spans of texts from documents,
+including longer phrases or nested expressions. Named entity recognition isn't
+the right tool for this problem, since an entity recognizer typically predicts
+single token-based tags that are very sensitive to boundaries. This is effective
+for proper nouns and self-contained expressions, but less useful for other types
+of phrases or overlapping spans. The new
+[`SpanCategorizer`](/api/spancategorizer) component and
+[SpanCategorizer](/api/architectures#spancategorizer) architecture let you label
+arbitrary and potentially overlapping spans of texts. A span categorizer
+consists of two parts: a [suggester function](/api/spancategorizer#suggesters)
+that proposes candidate spans, which may or may not overlap, and a labeler model
+that predicts zero or more labels for each candidate. The predicted spans are
+available via the [`Doc.spans`](/api/doc#spans) container.
+
+<!-- TODO: example, getting started (init config?), maybe project template -->
+
+<Infobox title="Tip: Create data with Prodigy's new span annotation UI">
+
+<!-- TODO: screenshot -->
+
+The upcoming version of our annotation tool [Prodigy](https://prodi.gy)
+(currently available as a [pre-release](https://support.prodi.gy/t/3861) for all
+users) features a [new workflow and UI](https://support.prodi.gy/t/3861) for
+annotating overlapping and nested spans. You can use it to create training data
+for spaCy's `SpanCategorizer` component.
+
+</Infobox>
+
+### Update the entity recognizer with partial incorrect annotations {#negative-samples}
+
+> #### config.cfg (excerpt)
+>
+> ```ini
+> [components.ner]
+> factory = "ner"
+> incorrect_spans_key = "incorrect_spans"
+> moves = null
+> update_with_oracle_cut_size = 100
+> ```
+
+The [`EntityRecognizer`](/api/entityrecognizer) can now be updated with known
+incorrect annotations, which lets you take advantage of partial and sparse data.
+For example, you'll be able to use the information that certain spans of text
+are definitely **not** `PERSON` entities, without having to provide the
+complete-gold standard annotations for the given example. The incorrect span
+annotations can be added via the [`Doc.spans`](/api/doc#spans) in the training
+data under the key defined as
+[`incorrect_spans_key`](/api/entityrecognizer#init) in the component config.
+
+<!-- TODO: more details and/or example project? -->
+
+### New pipeline packages for Catalan and Danish {#pipeline-packages}
+
+<!-- TODO: intro and update with final numbers -->
+
+| Package                                           | Language | Tagger | Parser |  NER |
+| ------------------------------------------------- | -------- | -----: | -----: | ---: |
+| [`ca_core_news_sm`](/models/ca#ca_core_news_sm)   | Catalan  |        |        |      |
+| [`ca_core_news_md`](/models/ca#ca_core_news_md)   | Catalan  |        |        |      |
+| [`ca_core_news_lg`](/models/ca#ca_core_news_lg)   | Catalan  |        |        |      |
+| [`ca_core_news_trf`](/models/ca#ca_core_news_trf) | Catalan  |        |        |      |
+| [`da_core_news_trf`](/models/da#da_core_news_trf) | Danish   |        |        |      |
+
+### Resizable text classification architectures {#resizable-textcat}
+
+<!-- TODO: write -->
+
+### CLI command to assemble pipeline from config {#assemble}
+
+The [`spacy assemble`](/api/cli#assemble) command lets you assemble a pipeline
+from a config file without additional training. It can be especially useful for
+creating a blank pipeline with a custom tokenizer, rule-based components or word
+vectors.
+
+```cli
+$ python -m spacy assemble config.cfg ./output
+```
+
+### Support for streaming large or infinite corpora {#streaming-corpora}
+
+<!-- TODO: write -->
+
+### New lemmatizers for Catalan and Italian {#pos-lemmatizers}
+
+<!-- TODO: write -->
+
+## Notes about upgrading from v3.0 {#upgrading}
+
+<!-- TODO: this could just be a bullet-point list mentioning stuff like the spacy_version, vectors initialization etc. -->
--- a/website/meta/sidebars.json
+++ b/website/meta/sidebars.json
@ -9,7 +9,8 @@
                    { "text": "Models & Languages", "url": "/usage/models" },
                    { "text": "Facts & Figures", "url": "/usage/facts-figures" },
                    { "text": "spaCy 101", "url": "/usage/spacy-101" },
-                    { "text": "New in v3.0", "url": "/usage/v3" }
+                    { "text": "New in v3.0", "url": "/usage/v3" },
+                    { "text": "New in v3.1", "url": "/usage/v3-1" }
                ]
            },
            {
@ -135,9 +136,7 @@
            },
            {
                "label": "Legacy",
-                "items": [
-                    { "text": "Legacy functions", "url": "/api/legacy" }
-                ]
+                "items": [{ "text": "Legacy functions", "url": "/api/legacy" }]
            }
        ]
    }
--- a/website/src/templates/index.js
+++ b/website/src/templates/index.js
@ -119,8 +119,8 @@ const AlertSpace = ({ nightly, legacy }) => {
 }

 const navAlert = (
-    <Link to="/usage/v3" hidden>
-        <strong>💥 Out now:</strong> spaCy v3.0
+    <Link to="/usage/v3-1" hidden>
+        <strong>💥 Out now:</strong> spaCy v3.1
    </Link>
 )