Update docs and consistency [ci skip]

2025-11-23 11:16:01 +03:00 · 2020-08-21 13:49:18 +02:00 · 2020-08-21 13:49:18 +02:00 · aa6a7cd6e7
commit aa6a7cd6e7
parent 52bd3a8b48
11 changed files with 43 additions and 40 deletions
--- a/CONTRIBUTING.md
+++ b/CONTRIBUTING.md
@ -5,7 +5,7 @@
 Thanks for your interest in contributing to spaCy 🎉 The project is maintained
 by [@honnibal](https://github.com/honnibal) and [@ines](https://github.com/ines),
 and we'll do our best to help you get started. This page will give you a quick
-overview of how things are organised and most importantly, how to get involved.
+overview of how things are organized and most importantly, how to get involved.

 ## Table of contents

@ -195,7 +195,7 @@ modules in `.py` files, not Cython modules in `.pyx` and `.pxd` files.**
 ### Code formatting

 [`black`](https://github.com/ambv/black) is an opinionated Python code
-formatter, optimised to produce readable code and small diffs. You can run
+formatter, optimized to produce readable code and small diffs. You can run
 `black` from the command-line, or via your code editor. For example, if you're
 using [Visual Studio Code](https://code.visualstudio.com/), you can add the
 following to your `settings.json` to use `black` for formatting and auto-format
@ -286,7 +286,7 @@ Code that interacts with the file-system should accept objects that follow the
 If the function is user-facing and takes a path as an argument, it should check
 whether the path is provided as a string. Strings should be converted to
 `pathlib.Path` objects. Serialization and deserialization functions should always
-accept **file-like objects**, as it makes the library io-agnostic. Working on
+accept **file-like objects**, as it makes the library IO-agnostic. Working on
 buffers makes the code more general, easier to test, and compatible with Python
 3's asynchronous IO.

@ -384,7 +384,7 @@ of Python and C++, with additional complexity and syntax from numpy. The
 many "traps for new players". Working in Cython is very rewarding once you're
 over the initial learning curve. As with C and C++, the first way you write
 something in Cython will often be the performance-optimal approach. In contrast,
-Python optimisation generally requires a lot of experimentation. Is it faster to
+Python optimization generally requires a lot of experimentation. Is it faster to
 have an `if item in my_dict` check, or to use `.get()`? What about `try`/`except`?
 Does this numpy operation create a copy? There's no way to guess the answers to
 these questions, and you'll usually be dissatisfied with your results — so
@ -400,7 +400,7 @@ Python. If it's not fast enough the first time, just switch to Cython.
 - [PEP 8 Style Guide for Python Code](https://www.python.org/dev/peps/pep-0008/) (python.org)
 - [Official Cython documentation](http://docs.cython.org/en/latest/) (cython.org)
 - [Writing C in Cython](https://explosion.ai/blog/writing-c-in-cython) (explosion.ai)
- [Multi-threading spaCy’s parser and named entity recogniser](https://explosion.ai/blog/multithreading-with-cython) (explosion.ai)
+- [Multi-threading spaCy’s parser and named entity recognizer](https://explosion.ai/blog/multithreading-with-cython) (explosion.ai)

 ## Adding tests

@ -412,7 +412,7 @@ name. For example, tests for the `Tokenizer` can be found in
 all test files and test functions need to be prefixed with `test_`.

 When adding tests, make sure to use descriptive names, keep the code short and
-concise and only test for one behaviour at a time. Try to `parametrize` test
+concise and only test for one behavior at a time. Try to `parametrize` test
 cases wherever possible, use our pre-defined fixtures for spaCy components and
 avoid unnecessary imports.

--- a/README.md
+++ b/README.md
@ -49,9 +49,8 @@ It's commercial open-source software, released under the MIT license.

 ## 💬 Where to ask questions

-The spaCy project is maintained by [@honnibal](https://github.com/honnibal) and
-[@ines](https://github.com/ines), along with core contributors
-[@svlandeg](https://github.com/svlandeg) and
+The spaCy project is maintained by [@honnibal](https://github.com/honnibal),
+[@ines](https://github.com/ines), [@svlandeg](https://github.com/svlandeg) and
 [@adrianeboyd](https://github.com/adrianeboyd). Please understand that we won't
 be able to provide individual support via email. We also believe that help is
 much more valuable if it's shared publicly, so that more people can benefit from
--- a/spacy/tokenizer.pyx
+++ b/spacy/tokenizer.pyx
@ -47,9 +47,9 @@ cdef class Tokenizer:
        `infix_finditer` (callable): A function matching the signature of
            `re.compile(string).finditer` to find infixes.
        token_match (callable): A boolean function matching strings to be
-            recognised as tokens.
+            recognized as tokens.
        url_match (callable): A boolean function matching strings to be
-            recognised as tokens after considering prefixes and suffixes.
+            recognized as tokens after considering prefixes and suffixes.

        EXAMPLE:
            >>> tokenizer = Tokenizer(nlp.vocab)
--- a/website/docs/usage/embeddings-transformers.md
+++ b/website/docs/usage/embeddings-transformers.md
@ -184,7 +184,7 @@ yourself. For details on how to get started with training your own model, check
 out the [training quickstart](/usage/training#quickstart).

 <!-- TODO:
-<Project id="en_core_bert">
+<Project id="en_core_trf_lg">

 The easiest way to get started is to clone a transformers-based project
 template. Swap in your data, edit the settings and hyperparameters and train,
--- a/website/docs/usage/index.md
+++ b/website/docs/usage/index.md
@ -169,7 +169,7 @@ $ python setup.py build_ext --inplace           # compile spaCy

 Compared to regular install via pip, the
 [`requirements.txt`](https://github.com/explosion/spaCy/tree/master/requirements.txt)
-additionally installs developer dependencies such as Cython. See the 
+additionally installs developer dependencies such as Cython. See the
 [quickstart widget](#quickstart) to get the right commands for your platform and
 Python version.

@ -368,7 +368,7 @@ from is called `spacy`. So, when using spaCy, never call anything else `spacy`.

 </Accordion>

-<Accordion title="NER model doesn't recognise other entities anymore after training" id="catastrophic-forgetting">
+<Accordion title="NER model doesn't recognize other entities anymore after training" id="catastrophic-forgetting">

 If your training data only contained new entities and you didn't mix in any
 examples the model previously recognized, it can cause the model to "forget"
--- a/website/docs/usage/linguistic-features.md
+++ b/website/docs/usage/linguistic-features.md
@ -429,7 +429,7 @@ nlp = spacy.load("en_core_web_sm")
 doc = nlp("fb is hiring a new vice president of global policy")
 ents = [(e.text, e.start_char, e.end_char, e.label_) for e in doc.ents]
 print('Before', ents)
-# the model didn't recognise "fb" as an entity :(
+# The model didn't recognize "fb" as an entity :(

 fb_ent = Span(doc, 0, 1, label="ORG") # create a Span for the new entity
 doc.ents = list(doc.ents) + [fb_ent]
@ -558,11 +558,11 @@ import spacy
 nlp = spacy.load("my_custom_el_model")
 doc = nlp("Ada Lovelace was born in London")

-# document level
+# Document level
 ents = [(e.text, e.label_, e.kb_id_) for e in doc.ents]
 print(ents)  # [('Ada Lovelace', 'PERSON', 'Q7259'), ('London', 'GPE', 'Q84')]

-# token level
+# Token level
 ent_ada_0 = [doc[0].text, doc[0].ent_type_, doc[0].ent_kb_id_]
 ent_ada_1 = [doc[1].text, doc[1].ent_type_, doc[1].ent_kb_id_]
 ent_london_5 = [doc[5].text, doc[5].ent_type_, doc[5].ent_kb_id_]
@ -914,12 +914,12 @@ from spacy.lang.char_classes import ALPHA, ALPHA_LOWER, ALPHA_UPPER
 from spacy.lang.char_classes import CONCAT_QUOTES, LIST_ELLIPSES, LIST_ICONS
 from spacy.util import compile_infix_regex

-# default tokenizer
+# Default tokenizer
 nlp = spacy.load("en_core_web_sm")
 doc = nlp("mother-in-law")
 print([t.text for t in doc]) # ['mother', '-', 'in', '-', 'law']

-# modify tokenizer infix patterns
+# Modify tokenizer infix patterns
 infixes = (
    LIST_ELLIPSES
    + LIST_ICONS
@ -929,8 +929,8 @@ infixes = (
            al=ALPHA_LOWER, au=ALPHA_UPPER, q=CONCAT_QUOTES
        ),
        r"(?<=[{a}]),(?=[{a}])".format(a=ALPHA),
-        # EDIT: commented out regex that splits on hyphens between letters:
-        #r"(?<=[{a}])(?:{h})(?=[{a}])".format(a=ALPHA, h=HYPHENS),
+        # ✅ Commented out regex that splits on hyphens between letters:
+        # r"(?<=[{a}])(?:{h})(?=[{a}])".format(a=ALPHA, h=HYPHENS),
        r"(?<=[{a}0-9])[:<>=/](?=[{a}])".format(a=ALPHA),
    ]
 )
--- a/website/docs/usage/processing-pipelines.md
+++ b/website/docs/usage/processing-pipelines.md
@ -108,11 +108,11 @@ class, or defined within a [model package](/usage/saving-loading#models).
 >
 > [components.tagger]
 > factory = "tagger"
-> # settings for the tagger component
+> # Settings for the tagger component
 >
 > [components.parser]
 > factory = "parser"
-> # settings for the parser component
+> # Settings for the parser component
 > ```

 When you load a model, spaCy first consults the model's
@ -171,11 +171,11 @@ lang = "en"
 pipeline = ["tagger", "parser", "ner"]
 data_path = "path/to/en_core_web_sm/en_core_web_sm-2.0.0"

-cls = spacy.util.get_lang_class(lang)   # 1. Get Language instance, e.g. English()
-nlp = cls()                             # 2. Initialize it
+cls = spacy.util.get_lang_class(lang)  # 1. Get Language class, e.g. English
+nlp = cls()                            # 2. Initialize it
 for name in pipeline:
-    nlp.add_pipe(name)                  # 3. Add the component to the pipeline
-nlp.from_disk(model_data_path)          # 4. Load in the binary data
+    nlp.add_pipe(name)                 # 3. Add the component to the pipeline
+nlp.from_disk(model_data_path)         # 4. Load in the binary data
 ```

 When you call `nlp` on a text, spaCy will **tokenize** it and then **call each
@ -187,9 +187,9 @@ which is then processed by the component next in the pipeline.

 ```python
 ### The pipeline under the hood
-doc = nlp.make_doc("This is a sentence")   # create a Doc from raw text
-for name, proc in nlp.pipeline:             # iterate over components in order
-    doc = proc(doc)                         # apply each component
+doc = nlp.make_doc("This is a sentence")  # Create a Doc from raw text
+for name, proc in nlp.pipeline:           # Iterate over components in order
+    doc = proc(doc)                       # Apply each component
 ```

 The current processing pipeline is available as `nlp.pipeline`, which returns a
@ -473,7 +473,7 @@ only being able to modify it afterwards.
 >
 > @Language.component("my_component")
 > def my_component(doc):
->    # do something to the doc here
+>    # Do something to the doc here
 >    return doc
 > ```

--- a/website/docs/usage/rule-based-matching.md
+++ b/website/docs/usage/rule-based-matching.md
@ -511,21 +511,21 @@ from spacy.language import Language
 from spacy.matcher import Matcher
 from spacy.tokens import Token

-# We're using a component factory because the component needs to be initialized
-# with the shared vocab via the nlp object
+# We're using a component factory because the component needs to be
+# initialized with the shared vocab via the nlp object
@Language.factory("html_merger")
 def create_bad_html_merger(nlp, name):
-    return BadHTMLMerger(nlp)
+    return BadHTMLMerger(nlp.vocab)

 class BadHTMLMerger:
-    def __init__(self, nlp):
+    def __init__(self, vocab):
        patterns = [
            [{"ORTH": "<"}, {"LOWER": "br"}, {"ORTH": ">"}],
            [{"ORTH": "<"}, {"LOWER": "br/"}, {"ORTH": ">"}],
        ]
        # Register a new token extension to flag bad HTML
        Token.set_extension("bad_html", default=False)
-        self.matcher = Matcher(nlp.vocab)
+        self.matcher = Matcher(vocab)
        self.matcher.add("BAD_HTML", patterns)

    def __call__(self, doc):
--- a/website/docs/usage/training.md
+++ b/website/docs/usage/training.md
@ -792,7 +792,7 @@ you save the transformer outputs for later use.

 <!-- TODO:

-<Project id="en_core_bert">
+<Project id="en_core_trf_lg">

 Try out a BERT-based model pipeline using this project template: swap in your
 data, edit the settings and hyperparameters and train, evaluate, package and
--- a/website/docs/usage/v3.md
+++ b/website/docs/usage/v3.md
@ -66,7 +66,7 @@ menu:
 - **Architectures: ** [TransformerModel](/api/architectures#TransformerModel),
  [Tok2VecListener](/api/architectures#transformers-Tok2VecListener),
  [Tok2VecTransformer](/api/architectures#Tok2VecTransformer)
- **Models:** [`en_core_bert_sm`](/models/en)
+- **Models:** [`en_core_trf_lg_sm`](/models/en)
 - **Implementation:**
  [`spacy-transformers`](https://github.com/explosion/spacy-transformers)

@ -293,7 +293,8 @@ format for documenting argument and return types.

 - **Usage: ** [Embeddings & Transformers](/usage/embeddings-transformers),
  [Training models](/usage/training), [Projects](/usage/projects),
-  [Custom pipeline components](/usage/processing-pipelines#custom-components)
+  [Custom pipeline components](/usage/processing-pipelines#custom-components),
+  [Custom tokenizers](/usage/linguistic-features#custom-tokenizer)
 - **API Reference: ** [Library architecture](/api),
  [Model architectures](/api/architectures), [Data formats](/api/data-formats)
 - **New Classes: ** [`Example`](/api/example), [`Tok2Vec`](/api/tok2vec),
--- a/website/src/styles/layout.sass
+++ b/website/src/styles/layout.sass
@ -363,7 +363,7 @@ body [id]:target
        color: var(--color-red-medium)
        background: var(--color-red-transparent)

-    &.italic
+    &.italic, &.comment
        font-style: italic


@ -384,9 +384,11 @@ body [id]:target
 // Settings for ini syntax (config files)
 [class*="language-ini"]
    color: var(--syntax-comment)
+    font-style: italic !important

    .token
        color: var(--color-subtle)
+        font-style: normal !important


 .gatsby-highlight-code-line
@ -424,6 +426,7 @@ body [id]:target

    .cm-comment
        color: var(--syntax-comment)
+        font-style: italic

    .cm-keyword
        color: var(--syntax-keyword)