Update docs and consistency [ci skip]

This commit is contained in:
Ines Montani 2020-08-21 13:49:18 +02:00
parent 52bd3a8b48
commit aa6a7cd6e7
11 changed files with 43 additions and 40 deletions

View File

@ -5,7 +5,7 @@
Thanks for your interest in contributing to spaCy 🎉 The project is maintained Thanks for your interest in contributing to spaCy 🎉 The project is maintained
by [@honnibal](https://github.com/honnibal) and [@ines](https://github.com/ines), by [@honnibal](https://github.com/honnibal) and [@ines](https://github.com/ines),
and we'll do our best to help you get started. This page will give you a quick and we'll do our best to help you get started. This page will give you a quick
overview of how things are organised and most importantly, how to get involved. overview of how things are organized and most importantly, how to get involved.
## Table of contents ## Table of contents
@ -195,7 +195,7 @@ modules in `.py` files, not Cython modules in `.pyx` and `.pxd` files.**
### Code formatting ### Code formatting
[`black`](https://github.com/ambv/black) is an opinionated Python code [`black`](https://github.com/ambv/black) is an opinionated Python code
formatter, optimised to produce readable code and small diffs. You can run formatter, optimized to produce readable code and small diffs. You can run
`black` from the command-line, or via your code editor. For example, if you're `black` from the command-line, or via your code editor. For example, if you're
using [Visual Studio Code](https://code.visualstudio.com/), you can add the using [Visual Studio Code](https://code.visualstudio.com/), you can add the
following to your `settings.json` to use `black` for formatting and auto-format following to your `settings.json` to use `black` for formatting and auto-format
@ -286,7 +286,7 @@ Code that interacts with the file-system should accept objects that follow the
If the function is user-facing and takes a path as an argument, it should check If the function is user-facing and takes a path as an argument, it should check
whether the path is provided as a string. Strings should be converted to whether the path is provided as a string. Strings should be converted to
`pathlib.Path` objects. Serialization and deserialization functions should always `pathlib.Path` objects. Serialization and deserialization functions should always
accept **file-like objects**, as it makes the library io-agnostic. Working on accept **file-like objects**, as it makes the library IO-agnostic. Working on
buffers makes the code more general, easier to test, and compatible with Python buffers makes the code more general, easier to test, and compatible with Python
3's asynchronous IO. 3's asynchronous IO.
@ -384,7 +384,7 @@ of Python and C++, with additional complexity and syntax from numpy. The
many "traps for new players". Working in Cython is very rewarding once you're many "traps for new players". Working in Cython is very rewarding once you're
over the initial learning curve. As with C and C++, the first way you write over the initial learning curve. As with C and C++, the first way you write
something in Cython will often be the performance-optimal approach. In contrast, something in Cython will often be the performance-optimal approach. In contrast,
Python optimisation generally requires a lot of experimentation. Is it faster to Python optimization generally requires a lot of experimentation. Is it faster to
have an `if item in my_dict` check, or to use `.get()`? What about `try`/`except`? have an `if item in my_dict` check, or to use `.get()`? What about `try`/`except`?
Does this numpy operation create a copy? There's no way to guess the answers to Does this numpy operation create a copy? There's no way to guess the answers to
these questions, and you'll usually be dissatisfied with your results — so these questions, and you'll usually be dissatisfied with your results — so
@ -400,7 +400,7 @@ Python. If it's not fast enough the first time, just switch to Cython.
- [PEP 8 Style Guide for Python Code](https://www.python.org/dev/peps/pep-0008/) (python.org) - [PEP 8 Style Guide for Python Code](https://www.python.org/dev/peps/pep-0008/) (python.org)
- [Official Cython documentation](http://docs.cython.org/en/latest/) (cython.org) - [Official Cython documentation](http://docs.cython.org/en/latest/) (cython.org)
- [Writing C in Cython](https://explosion.ai/blog/writing-c-in-cython) (explosion.ai) - [Writing C in Cython](https://explosion.ai/blog/writing-c-in-cython) (explosion.ai)
- [Multi-threading spaCys parser and named entity recogniser](https://explosion.ai/blog/multithreading-with-cython) (explosion.ai) - [Multi-threading spaCys parser and named entity recognizer](https://explosion.ai/blog/multithreading-with-cython) (explosion.ai)
## Adding tests ## Adding tests
@ -412,7 +412,7 @@ name. For example, tests for the `Tokenizer` can be found in
all test files and test functions need to be prefixed with `test_`. all test files and test functions need to be prefixed with `test_`.
When adding tests, make sure to use descriptive names, keep the code short and When adding tests, make sure to use descriptive names, keep the code short and
concise and only test for one behaviour at a time. Try to `parametrize` test concise and only test for one behavior at a time. Try to `parametrize` test
cases wherever possible, use our pre-defined fixtures for spaCy components and cases wherever possible, use our pre-defined fixtures for spaCy components and
avoid unnecessary imports. avoid unnecessary imports.

View File

@ -49,9 +49,8 @@ It's commercial open-source software, released under the MIT license.
## 💬 Where to ask questions ## 💬 Where to ask questions
The spaCy project is maintained by [@honnibal](https://github.com/honnibal) and The spaCy project is maintained by [@honnibal](https://github.com/honnibal),
[@ines](https://github.com/ines), along with core contributors [@ines](https://github.com/ines), [@svlandeg](https://github.com/svlandeg) and
[@svlandeg](https://github.com/svlandeg) and
[@adrianeboyd](https://github.com/adrianeboyd). Please understand that we won't [@adrianeboyd](https://github.com/adrianeboyd). Please understand that we won't
be able to provide individual support via email. We also believe that help is be able to provide individual support via email. We also believe that help is
much more valuable if it's shared publicly, so that more people can benefit from much more valuable if it's shared publicly, so that more people can benefit from

View File

@ -47,9 +47,9 @@ cdef class Tokenizer:
`infix_finditer` (callable): A function matching the signature of `infix_finditer` (callable): A function matching the signature of
`re.compile(string).finditer` to find infixes. `re.compile(string).finditer` to find infixes.
token_match (callable): A boolean function matching strings to be token_match (callable): A boolean function matching strings to be
recognised as tokens. recognized as tokens.
url_match (callable): A boolean function matching strings to be url_match (callable): A boolean function matching strings to be
recognised as tokens after considering prefixes and suffixes. recognized as tokens after considering prefixes and suffixes.
EXAMPLE: EXAMPLE:
>>> tokenizer = Tokenizer(nlp.vocab) >>> tokenizer = Tokenizer(nlp.vocab)

View File

@ -184,7 +184,7 @@ yourself. For details on how to get started with training your own model, check
out the [training quickstart](/usage/training#quickstart). out the [training quickstart](/usage/training#quickstart).
<!-- TODO: <!-- TODO:
<Project id="en_core_bert"> <Project id="en_core_trf_lg">
The easiest way to get started is to clone a transformers-based project The easiest way to get started is to clone a transformers-based project
template. Swap in your data, edit the settings and hyperparameters and train, template. Swap in your data, edit the settings and hyperparameters and train,

View File

@ -169,7 +169,7 @@ $ python setup.py build_ext --inplace # compile spaCy
Compared to regular install via pip, the Compared to regular install via pip, the
[`requirements.txt`](https://github.com/explosion/spaCy/tree/master/requirements.txt) [`requirements.txt`](https://github.com/explosion/spaCy/tree/master/requirements.txt)
additionally installs developer dependencies such as Cython. See the additionally installs developer dependencies such as Cython. See the
[quickstart widget](#quickstart) to get the right commands for your platform and [quickstart widget](#quickstart) to get the right commands for your platform and
Python version. Python version.
@ -368,7 +368,7 @@ from is called `spacy`. So, when using spaCy, never call anything else `spacy`.
</Accordion> </Accordion>
<Accordion title="NER model doesn't recognise other entities anymore after training" id="catastrophic-forgetting"> <Accordion title="NER model doesn't recognize other entities anymore after training" id="catastrophic-forgetting">
If your training data only contained new entities and you didn't mix in any If your training data only contained new entities and you didn't mix in any
examples the model previously recognized, it can cause the model to "forget" examples the model previously recognized, it can cause the model to "forget"

View File

@ -429,7 +429,7 @@ nlp = spacy.load("en_core_web_sm")
doc = nlp("fb is hiring a new vice president of global policy") doc = nlp("fb is hiring a new vice president of global policy")
ents = [(e.text, e.start_char, e.end_char, e.label_) for e in doc.ents] ents = [(e.text, e.start_char, e.end_char, e.label_) for e in doc.ents]
print('Before', ents) print('Before', ents)
# the model didn't recognise "fb" as an entity :( # The model didn't recognize "fb" as an entity :(
fb_ent = Span(doc, 0, 1, label="ORG") # create a Span for the new entity fb_ent = Span(doc, 0, 1, label="ORG") # create a Span for the new entity
doc.ents = list(doc.ents) + [fb_ent] doc.ents = list(doc.ents) + [fb_ent]
@ -558,11 +558,11 @@ import spacy
nlp = spacy.load("my_custom_el_model") nlp = spacy.load("my_custom_el_model")
doc = nlp("Ada Lovelace was born in London") doc = nlp("Ada Lovelace was born in London")
# document level # Document level
ents = [(e.text, e.label_, e.kb_id_) for e in doc.ents] ents = [(e.text, e.label_, e.kb_id_) for e in doc.ents]
print(ents) # [('Ada Lovelace', 'PERSON', 'Q7259'), ('London', 'GPE', 'Q84')] print(ents) # [('Ada Lovelace', 'PERSON', 'Q7259'), ('London', 'GPE', 'Q84')]
# token level # Token level
ent_ada_0 = [doc[0].text, doc[0].ent_type_, doc[0].ent_kb_id_] ent_ada_0 = [doc[0].text, doc[0].ent_type_, doc[0].ent_kb_id_]
ent_ada_1 = [doc[1].text, doc[1].ent_type_, doc[1].ent_kb_id_] ent_ada_1 = [doc[1].text, doc[1].ent_type_, doc[1].ent_kb_id_]
ent_london_5 = [doc[5].text, doc[5].ent_type_, doc[5].ent_kb_id_] ent_london_5 = [doc[5].text, doc[5].ent_type_, doc[5].ent_kb_id_]
@ -914,12 +914,12 @@ from spacy.lang.char_classes import ALPHA, ALPHA_LOWER, ALPHA_UPPER
from spacy.lang.char_classes import CONCAT_QUOTES, LIST_ELLIPSES, LIST_ICONS from spacy.lang.char_classes import CONCAT_QUOTES, LIST_ELLIPSES, LIST_ICONS
from spacy.util import compile_infix_regex from spacy.util import compile_infix_regex
# default tokenizer # Default tokenizer
nlp = spacy.load("en_core_web_sm") nlp = spacy.load("en_core_web_sm")
doc = nlp("mother-in-law") doc = nlp("mother-in-law")
print([t.text for t in doc]) # ['mother', '-', 'in', '-', 'law'] print([t.text for t in doc]) # ['mother', '-', 'in', '-', 'law']
# modify tokenizer infix patterns # Modify tokenizer infix patterns
infixes = ( infixes = (
LIST_ELLIPSES LIST_ELLIPSES
+ LIST_ICONS + LIST_ICONS
@ -929,8 +929,8 @@ infixes = (
al=ALPHA_LOWER, au=ALPHA_UPPER, q=CONCAT_QUOTES al=ALPHA_LOWER, au=ALPHA_UPPER, q=CONCAT_QUOTES
), ),
r"(?<=[{a}]),(?=[{a}])".format(a=ALPHA), r"(?<=[{a}]),(?=[{a}])".format(a=ALPHA),
# EDIT: commented out regex that splits on hyphens between letters: # ✅ Commented out regex that splits on hyphens between letters:
#r"(?<=[{a}])(?:{h})(?=[{a}])".format(a=ALPHA, h=HYPHENS), # r"(?<=[{a}])(?:{h})(?=[{a}])".format(a=ALPHA, h=HYPHENS),
r"(?<=[{a}0-9])[:<>=/](?=[{a}])".format(a=ALPHA), r"(?<=[{a}0-9])[:<>=/](?=[{a}])".format(a=ALPHA),
] ]
) )

View File

@ -108,11 +108,11 @@ class, or defined within a [model package](/usage/saving-loading#models).
> >
> [components.tagger] > [components.tagger]
> factory = "tagger" > factory = "tagger"
> # settings for the tagger component > # Settings for the tagger component
> >
> [components.parser] > [components.parser]
> factory = "parser" > factory = "parser"
> # settings for the parser component > # Settings for the parser component
> ``` > ```
When you load a model, spaCy first consults the model's When you load a model, spaCy first consults the model's
@ -171,11 +171,11 @@ lang = "en"
pipeline = ["tagger", "parser", "ner"] pipeline = ["tagger", "parser", "ner"]
data_path = "path/to/en_core_web_sm/en_core_web_sm-2.0.0" data_path = "path/to/en_core_web_sm/en_core_web_sm-2.0.0"
cls = spacy.util.get_lang_class(lang) # 1. Get Language instance, e.g. English() cls = spacy.util.get_lang_class(lang) # 1. Get Language class, e.g. English
nlp = cls() # 2. Initialize it nlp = cls() # 2. Initialize it
for name in pipeline: for name in pipeline:
nlp.add_pipe(name) # 3. Add the component to the pipeline nlp.add_pipe(name) # 3. Add the component to the pipeline
nlp.from_disk(model_data_path) # 4. Load in the binary data nlp.from_disk(model_data_path) # 4. Load in the binary data
``` ```
When you call `nlp` on a text, spaCy will **tokenize** it and then **call each When you call `nlp` on a text, spaCy will **tokenize** it and then **call each
@ -187,9 +187,9 @@ which is then processed by the component next in the pipeline.
```python ```python
### The pipeline under the hood ### The pipeline under the hood
doc = nlp.make_doc("This is a sentence") # create a Doc from raw text doc = nlp.make_doc("This is a sentence") # Create a Doc from raw text
for name, proc in nlp.pipeline: # iterate over components in order for name, proc in nlp.pipeline: # Iterate over components in order
doc = proc(doc) # apply each component doc = proc(doc) # Apply each component
``` ```
The current processing pipeline is available as `nlp.pipeline`, which returns a The current processing pipeline is available as `nlp.pipeline`, which returns a
@ -473,7 +473,7 @@ only being able to modify it afterwards.
> >
> @Language.component("my_component") > @Language.component("my_component")
> def my_component(doc): > def my_component(doc):
> # do something to the doc here > # Do something to the doc here
> return doc > return doc
> ``` > ```

View File

@ -511,21 +511,21 @@ from spacy.language import Language
from spacy.matcher import Matcher from spacy.matcher import Matcher
from spacy.tokens import Token from spacy.tokens import Token
# We're using a component factory because the component needs to be initialized # We're using a component factory because the component needs to be
# with the shared vocab via the nlp object # initialized with the shared vocab via the nlp object
@Language.factory("html_merger") @Language.factory("html_merger")
def create_bad_html_merger(nlp, name): def create_bad_html_merger(nlp, name):
return BadHTMLMerger(nlp) return BadHTMLMerger(nlp.vocab)
class BadHTMLMerger: class BadHTMLMerger:
def __init__(self, nlp): def __init__(self, vocab):
patterns = [ patterns = [
[{"ORTH": "<"}, {"LOWER": "br"}, {"ORTH": ">"}], [{"ORTH": "<"}, {"LOWER": "br"}, {"ORTH": ">"}],
[{"ORTH": "<"}, {"LOWER": "br/"}, {"ORTH": ">"}], [{"ORTH": "<"}, {"LOWER": "br/"}, {"ORTH": ">"}],
] ]
# Register a new token extension to flag bad HTML # Register a new token extension to flag bad HTML
Token.set_extension("bad_html", default=False) Token.set_extension("bad_html", default=False)
self.matcher = Matcher(nlp.vocab) self.matcher = Matcher(vocab)
self.matcher.add("BAD_HTML", patterns) self.matcher.add("BAD_HTML", patterns)
def __call__(self, doc): def __call__(self, doc):

View File

@ -792,7 +792,7 @@ you save the transformer outputs for later use.
<!-- TODO: <!-- TODO:
<Project id="en_core_bert"> <Project id="en_core_trf_lg">
Try out a BERT-based model pipeline using this project template: swap in your Try out a BERT-based model pipeline using this project template: swap in your
data, edit the settings and hyperparameters and train, evaluate, package and data, edit the settings and hyperparameters and train, evaluate, package and

View File

@ -66,7 +66,7 @@ menu:
- **Architectures: ** [TransformerModel](/api/architectures#TransformerModel), - **Architectures: ** [TransformerModel](/api/architectures#TransformerModel),
[Tok2VecListener](/api/architectures#transformers-Tok2VecListener), [Tok2VecListener](/api/architectures#transformers-Tok2VecListener),
[Tok2VecTransformer](/api/architectures#Tok2VecTransformer) [Tok2VecTransformer](/api/architectures#Tok2VecTransformer)
- **Models:** [`en_core_bert_sm`](/models/en) - **Models:** [`en_core_trf_lg_sm`](/models/en)
- **Implementation:** - **Implementation:**
[`spacy-transformers`](https://github.com/explosion/spacy-transformers) [`spacy-transformers`](https://github.com/explosion/spacy-transformers)
@ -293,7 +293,8 @@ format for documenting argument and return types.
- **Usage: ** [Embeddings & Transformers](/usage/embeddings-transformers), - **Usage: ** [Embeddings & Transformers](/usage/embeddings-transformers),
[Training models](/usage/training), [Projects](/usage/projects), [Training models](/usage/training), [Projects](/usage/projects),
[Custom pipeline components](/usage/processing-pipelines#custom-components) [Custom pipeline components](/usage/processing-pipelines#custom-components),
[Custom tokenizers](/usage/linguistic-features#custom-tokenizer)
- **API Reference: ** [Library architecture](/api), - **API Reference: ** [Library architecture](/api),
[Model architectures](/api/architectures), [Data formats](/api/data-formats) [Model architectures](/api/architectures), [Data formats](/api/data-formats)
- **New Classes: ** [`Example`](/api/example), [`Tok2Vec`](/api/tok2vec), - **New Classes: ** [`Example`](/api/example), [`Tok2Vec`](/api/tok2vec),

View File

@ -363,7 +363,7 @@ body [id]:target
color: var(--color-red-medium) color: var(--color-red-medium)
background: var(--color-red-transparent) background: var(--color-red-transparent)
&.italic &.italic, &.comment
font-style: italic font-style: italic
@ -384,9 +384,11 @@ body [id]:target
// Settings for ini syntax (config files) // Settings for ini syntax (config files)
[class*="language-ini"] [class*="language-ini"]
color: var(--syntax-comment) color: var(--syntax-comment)
font-style: italic !important
.token .token
color: var(--color-subtle) color: var(--color-subtle)
font-style: normal !important
.gatsby-highlight-code-line .gatsby-highlight-code-line
@ -424,6 +426,7 @@ body [id]:target
.cm-comment .cm-comment
color: var(--syntax-comment) color: var(--syntax-comment)
font-style: italic
.cm-keyword .cm-keyword
color: var(--syntax-keyword) color: var(--syntax-keyword)