mirror of
https://github.com/explosion/spaCy.git
synced 2024-12-25 17:36:30 +03:00
Update docs and consistency [ci skip]
This commit is contained in:
parent
52bd3a8b48
commit
aa6a7cd6e7
|
@ -5,7 +5,7 @@
|
||||||
Thanks for your interest in contributing to spaCy 🎉 The project is maintained
|
Thanks for your interest in contributing to spaCy 🎉 The project is maintained
|
||||||
by [@honnibal](https://github.com/honnibal) and [@ines](https://github.com/ines),
|
by [@honnibal](https://github.com/honnibal) and [@ines](https://github.com/ines),
|
||||||
and we'll do our best to help you get started. This page will give you a quick
|
and we'll do our best to help you get started. This page will give you a quick
|
||||||
overview of how things are organised and most importantly, how to get involved.
|
overview of how things are organized and most importantly, how to get involved.
|
||||||
|
|
||||||
## Table of contents
|
## Table of contents
|
||||||
|
|
||||||
|
@ -195,7 +195,7 @@ modules in `.py` files, not Cython modules in `.pyx` and `.pxd` files.**
|
||||||
### Code formatting
|
### Code formatting
|
||||||
|
|
||||||
[`black`](https://github.com/ambv/black) is an opinionated Python code
|
[`black`](https://github.com/ambv/black) is an opinionated Python code
|
||||||
formatter, optimised to produce readable code and small diffs. You can run
|
formatter, optimized to produce readable code and small diffs. You can run
|
||||||
`black` from the command-line, or via your code editor. For example, if you're
|
`black` from the command-line, or via your code editor. For example, if you're
|
||||||
using [Visual Studio Code](https://code.visualstudio.com/), you can add the
|
using [Visual Studio Code](https://code.visualstudio.com/), you can add the
|
||||||
following to your `settings.json` to use `black` for formatting and auto-format
|
following to your `settings.json` to use `black` for formatting and auto-format
|
||||||
|
@ -286,7 +286,7 @@ Code that interacts with the file-system should accept objects that follow the
|
||||||
If the function is user-facing and takes a path as an argument, it should check
|
If the function is user-facing and takes a path as an argument, it should check
|
||||||
whether the path is provided as a string. Strings should be converted to
|
whether the path is provided as a string. Strings should be converted to
|
||||||
`pathlib.Path` objects. Serialization and deserialization functions should always
|
`pathlib.Path` objects. Serialization and deserialization functions should always
|
||||||
accept **file-like objects**, as it makes the library io-agnostic. Working on
|
accept **file-like objects**, as it makes the library IO-agnostic. Working on
|
||||||
buffers makes the code more general, easier to test, and compatible with Python
|
buffers makes the code more general, easier to test, and compatible with Python
|
||||||
3's asynchronous IO.
|
3's asynchronous IO.
|
||||||
|
|
||||||
|
@ -384,7 +384,7 @@ of Python and C++, with additional complexity and syntax from numpy. The
|
||||||
many "traps for new players". Working in Cython is very rewarding once you're
|
many "traps for new players". Working in Cython is very rewarding once you're
|
||||||
over the initial learning curve. As with C and C++, the first way you write
|
over the initial learning curve. As with C and C++, the first way you write
|
||||||
something in Cython will often be the performance-optimal approach. In contrast,
|
something in Cython will often be the performance-optimal approach. In contrast,
|
||||||
Python optimisation generally requires a lot of experimentation. Is it faster to
|
Python optimization generally requires a lot of experimentation. Is it faster to
|
||||||
have an `if item in my_dict` check, or to use `.get()`? What about `try`/`except`?
|
have an `if item in my_dict` check, or to use `.get()`? What about `try`/`except`?
|
||||||
Does this numpy operation create a copy? There's no way to guess the answers to
|
Does this numpy operation create a copy? There's no way to guess the answers to
|
||||||
these questions, and you'll usually be dissatisfied with your results — so
|
these questions, and you'll usually be dissatisfied with your results — so
|
||||||
|
@ -400,7 +400,7 @@ Python. If it's not fast enough the first time, just switch to Cython.
|
||||||
- [PEP 8 Style Guide for Python Code](https://www.python.org/dev/peps/pep-0008/) (python.org)
|
- [PEP 8 Style Guide for Python Code](https://www.python.org/dev/peps/pep-0008/) (python.org)
|
||||||
- [Official Cython documentation](http://docs.cython.org/en/latest/) (cython.org)
|
- [Official Cython documentation](http://docs.cython.org/en/latest/) (cython.org)
|
||||||
- [Writing C in Cython](https://explosion.ai/blog/writing-c-in-cython) (explosion.ai)
|
- [Writing C in Cython](https://explosion.ai/blog/writing-c-in-cython) (explosion.ai)
|
||||||
- [Multi-threading spaCy’s parser and named entity recogniser](https://explosion.ai/blog/multithreading-with-cython) (explosion.ai)
|
- [Multi-threading spaCy’s parser and named entity recognizer](https://explosion.ai/blog/multithreading-with-cython) (explosion.ai)
|
||||||
|
|
||||||
## Adding tests
|
## Adding tests
|
||||||
|
|
||||||
|
@ -412,7 +412,7 @@ name. For example, tests for the `Tokenizer` can be found in
|
||||||
all test files and test functions need to be prefixed with `test_`.
|
all test files and test functions need to be prefixed with `test_`.
|
||||||
|
|
||||||
When adding tests, make sure to use descriptive names, keep the code short and
|
When adding tests, make sure to use descriptive names, keep the code short and
|
||||||
concise and only test for one behaviour at a time. Try to `parametrize` test
|
concise and only test for one behavior at a time. Try to `parametrize` test
|
||||||
cases wherever possible, use our pre-defined fixtures for spaCy components and
|
cases wherever possible, use our pre-defined fixtures for spaCy components and
|
||||||
avoid unnecessary imports.
|
avoid unnecessary imports.
|
||||||
|
|
||||||
|
|
|
@ -49,9 +49,8 @@ It's commercial open-source software, released under the MIT license.
|
||||||
|
|
||||||
## 💬 Where to ask questions
|
## 💬 Where to ask questions
|
||||||
|
|
||||||
The spaCy project is maintained by [@honnibal](https://github.com/honnibal) and
|
The spaCy project is maintained by [@honnibal](https://github.com/honnibal),
|
||||||
[@ines](https://github.com/ines), along with core contributors
|
[@ines](https://github.com/ines), [@svlandeg](https://github.com/svlandeg) and
|
||||||
[@svlandeg](https://github.com/svlandeg) and
|
|
||||||
[@adrianeboyd](https://github.com/adrianeboyd). Please understand that we won't
|
[@adrianeboyd](https://github.com/adrianeboyd). Please understand that we won't
|
||||||
be able to provide individual support via email. We also believe that help is
|
be able to provide individual support via email. We also believe that help is
|
||||||
much more valuable if it's shared publicly, so that more people can benefit from
|
much more valuable if it's shared publicly, so that more people can benefit from
|
||||||
|
|
|
@ -47,9 +47,9 @@ cdef class Tokenizer:
|
||||||
`infix_finditer` (callable): A function matching the signature of
|
`infix_finditer` (callable): A function matching the signature of
|
||||||
`re.compile(string).finditer` to find infixes.
|
`re.compile(string).finditer` to find infixes.
|
||||||
token_match (callable): A boolean function matching strings to be
|
token_match (callable): A boolean function matching strings to be
|
||||||
recognised as tokens.
|
recognized as tokens.
|
||||||
url_match (callable): A boolean function matching strings to be
|
url_match (callable): A boolean function matching strings to be
|
||||||
recognised as tokens after considering prefixes and suffixes.
|
recognized as tokens after considering prefixes and suffixes.
|
||||||
|
|
||||||
EXAMPLE:
|
EXAMPLE:
|
||||||
>>> tokenizer = Tokenizer(nlp.vocab)
|
>>> tokenizer = Tokenizer(nlp.vocab)
|
||||||
|
|
|
@ -184,7 +184,7 @@ yourself. For details on how to get started with training your own model, check
|
||||||
out the [training quickstart](/usage/training#quickstart).
|
out the [training quickstart](/usage/training#quickstart).
|
||||||
|
|
||||||
<!-- TODO:
|
<!-- TODO:
|
||||||
<Project id="en_core_bert">
|
<Project id="en_core_trf_lg">
|
||||||
|
|
||||||
The easiest way to get started is to clone a transformers-based project
|
The easiest way to get started is to clone a transformers-based project
|
||||||
template. Swap in your data, edit the settings and hyperparameters and train,
|
template. Swap in your data, edit the settings and hyperparameters and train,
|
||||||
|
|
|
@ -169,7 +169,7 @@ $ python setup.py build_ext --inplace # compile spaCy
|
||||||
|
|
||||||
Compared to regular install via pip, the
|
Compared to regular install via pip, the
|
||||||
[`requirements.txt`](https://github.com/explosion/spaCy/tree/master/requirements.txt)
|
[`requirements.txt`](https://github.com/explosion/spaCy/tree/master/requirements.txt)
|
||||||
additionally installs developer dependencies such as Cython. See the
|
additionally installs developer dependencies such as Cython. See the
|
||||||
[quickstart widget](#quickstart) to get the right commands for your platform and
|
[quickstart widget](#quickstart) to get the right commands for your platform and
|
||||||
Python version.
|
Python version.
|
||||||
|
|
||||||
|
@ -368,7 +368,7 @@ from is called `spacy`. So, when using spaCy, never call anything else `spacy`.
|
||||||
|
|
||||||
</Accordion>
|
</Accordion>
|
||||||
|
|
||||||
<Accordion title="NER model doesn't recognise other entities anymore after training" id="catastrophic-forgetting">
|
<Accordion title="NER model doesn't recognize other entities anymore after training" id="catastrophic-forgetting">
|
||||||
|
|
||||||
If your training data only contained new entities and you didn't mix in any
|
If your training data only contained new entities and you didn't mix in any
|
||||||
examples the model previously recognized, it can cause the model to "forget"
|
examples the model previously recognized, it can cause the model to "forget"
|
||||||
|
|
|
@ -429,7 +429,7 @@ nlp = spacy.load("en_core_web_sm")
|
||||||
doc = nlp("fb is hiring a new vice president of global policy")
|
doc = nlp("fb is hiring a new vice president of global policy")
|
||||||
ents = [(e.text, e.start_char, e.end_char, e.label_) for e in doc.ents]
|
ents = [(e.text, e.start_char, e.end_char, e.label_) for e in doc.ents]
|
||||||
print('Before', ents)
|
print('Before', ents)
|
||||||
# the model didn't recognise "fb" as an entity :(
|
# The model didn't recognize "fb" as an entity :(
|
||||||
|
|
||||||
fb_ent = Span(doc, 0, 1, label="ORG") # create a Span for the new entity
|
fb_ent = Span(doc, 0, 1, label="ORG") # create a Span for the new entity
|
||||||
doc.ents = list(doc.ents) + [fb_ent]
|
doc.ents = list(doc.ents) + [fb_ent]
|
||||||
|
@ -558,11 +558,11 @@ import spacy
|
||||||
nlp = spacy.load("my_custom_el_model")
|
nlp = spacy.load("my_custom_el_model")
|
||||||
doc = nlp("Ada Lovelace was born in London")
|
doc = nlp("Ada Lovelace was born in London")
|
||||||
|
|
||||||
# document level
|
# Document level
|
||||||
ents = [(e.text, e.label_, e.kb_id_) for e in doc.ents]
|
ents = [(e.text, e.label_, e.kb_id_) for e in doc.ents]
|
||||||
print(ents) # [('Ada Lovelace', 'PERSON', 'Q7259'), ('London', 'GPE', 'Q84')]
|
print(ents) # [('Ada Lovelace', 'PERSON', 'Q7259'), ('London', 'GPE', 'Q84')]
|
||||||
|
|
||||||
# token level
|
# Token level
|
||||||
ent_ada_0 = [doc[0].text, doc[0].ent_type_, doc[0].ent_kb_id_]
|
ent_ada_0 = [doc[0].text, doc[0].ent_type_, doc[0].ent_kb_id_]
|
||||||
ent_ada_1 = [doc[1].text, doc[1].ent_type_, doc[1].ent_kb_id_]
|
ent_ada_1 = [doc[1].text, doc[1].ent_type_, doc[1].ent_kb_id_]
|
||||||
ent_london_5 = [doc[5].text, doc[5].ent_type_, doc[5].ent_kb_id_]
|
ent_london_5 = [doc[5].text, doc[5].ent_type_, doc[5].ent_kb_id_]
|
||||||
|
@ -914,12 +914,12 @@ from spacy.lang.char_classes import ALPHA, ALPHA_LOWER, ALPHA_UPPER
|
||||||
from spacy.lang.char_classes import CONCAT_QUOTES, LIST_ELLIPSES, LIST_ICONS
|
from spacy.lang.char_classes import CONCAT_QUOTES, LIST_ELLIPSES, LIST_ICONS
|
||||||
from spacy.util import compile_infix_regex
|
from spacy.util import compile_infix_regex
|
||||||
|
|
||||||
# default tokenizer
|
# Default tokenizer
|
||||||
nlp = spacy.load("en_core_web_sm")
|
nlp = spacy.load("en_core_web_sm")
|
||||||
doc = nlp("mother-in-law")
|
doc = nlp("mother-in-law")
|
||||||
print([t.text for t in doc]) # ['mother', '-', 'in', '-', 'law']
|
print([t.text for t in doc]) # ['mother', '-', 'in', '-', 'law']
|
||||||
|
|
||||||
# modify tokenizer infix patterns
|
# Modify tokenizer infix patterns
|
||||||
infixes = (
|
infixes = (
|
||||||
LIST_ELLIPSES
|
LIST_ELLIPSES
|
||||||
+ LIST_ICONS
|
+ LIST_ICONS
|
||||||
|
@ -929,8 +929,8 @@ infixes = (
|
||||||
al=ALPHA_LOWER, au=ALPHA_UPPER, q=CONCAT_QUOTES
|
al=ALPHA_LOWER, au=ALPHA_UPPER, q=CONCAT_QUOTES
|
||||||
),
|
),
|
||||||
r"(?<=[{a}]),(?=[{a}])".format(a=ALPHA),
|
r"(?<=[{a}]),(?=[{a}])".format(a=ALPHA),
|
||||||
# EDIT: commented out regex that splits on hyphens between letters:
|
# ✅ Commented out regex that splits on hyphens between letters:
|
||||||
#r"(?<=[{a}])(?:{h})(?=[{a}])".format(a=ALPHA, h=HYPHENS),
|
# r"(?<=[{a}])(?:{h})(?=[{a}])".format(a=ALPHA, h=HYPHENS),
|
||||||
r"(?<=[{a}0-9])[:<>=/](?=[{a}])".format(a=ALPHA),
|
r"(?<=[{a}0-9])[:<>=/](?=[{a}])".format(a=ALPHA),
|
||||||
]
|
]
|
||||||
)
|
)
|
||||||
|
|
|
@ -108,11 +108,11 @@ class, or defined within a [model package](/usage/saving-loading#models).
|
||||||
>
|
>
|
||||||
> [components.tagger]
|
> [components.tagger]
|
||||||
> factory = "tagger"
|
> factory = "tagger"
|
||||||
> # settings for the tagger component
|
> # Settings for the tagger component
|
||||||
>
|
>
|
||||||
> [components.parser]
|
> [components.parser]
|
||||||
> factory = "parser"
|
> factory = "parser"
|
||||||
> # settings for the parser component
|
> # Settings for the parser component
|
||||||
> ```
|
> ```
|
||||||
|
|
||||||
When you load a model, spaCy first consults the model's
|
When you load a model, spaCy first consults the model's
|
||||||
|
@ -171,11 +171,11 @@ lang = "en"
|
||||||
pipeline = ["tagger", "parser", "ner"]
|
pipeline = ["tagger", "parser", "ner"]
|
||||||
data_path = "path/to/en_core_web_sm/en_core_web_sm-2.0.0"
|
data_path = "path/to/en_core_web_sm/en_core_web_sm-2.0.0"
|
||||||
|
|
||||||
cls = spacy.util.get_lang_class(lang) # 1. Get Language instance, e.g. English()
|
cls = spacy.util.get_lang_class(lang) # 1. Get Language class, e.g. English
|
||||||
nlp = cls() # 2. Initialize it
|
nlp = cls() # 2. Initialize it
|
||||||
for name in pipeline:
|
for name in pipeline:
|
||||||
nlp.add_pipe(name) # 3. Add the component to the pipeline
|
nlp.add_pipe(name) # 3. Add the component to the pipeline
|
||||||
nlp.from_disk(model_data_path) # 4. Load in the binary data
|
nlp.from_disk(model_data_path) # 4. Load in the binary data
|
||||||
```
|
```
|
||||||
|
|
||||||
When you call `nlp` on a text, spaCy will **tokenize** it and then **call each
|
When you call `nlp` on a text, spaCy will **tokenize** it and then **call each
|
||||||
|
@ -187,9 +187,9 @@ which is then processed by the component next in the pipeline.
|
||||||
|
|
||||||
```python
|
```python
|
||||||
### The pipeline under the hood
|
### The pipeline under the hood
|
||||||
doc = nlp.make_doc("This is a sentence") # create a Doc from raw text
|
doc = nlp.make_doc("This is a sentence") # Create a Doc from raw text
|
||||||
for name, proc in nlp.pipeline: # iterate over components in order
|
for name, proc in nlp.pipeline: # Iterate over components in order
|
||||||
doc = proc(doc) # apply each component
|
doc = proc(doc) # Apply each component
|
||||||
```
|
```
|
||||||
|
|
||||||
The current processing pipeline is available as `nlp.pipeline`, which returns a
|
The current processing pipeline is available as `nlp.pipeline`, which returns a
|
||||||
|
@ -473,7 +473,7 @@ only being able to modify it afterwards.
|
||||||
>
|
>
|
||||||
> @Language.component("my_component")
|
> @Language.component("my_component")
|
||||||
> def my_component(doc):
|
> def my_component(doc):
|
||||||
> # do something to the doc here
|
> # Do something to the doc here
|
||||||
> return doc
|
> return doc
|
||||||
> ```
|
> ```
|
||||||
|
|
||||||
|
|
|
@ -511,21 +511,21 @@ from spacy.language import Language
|
||||||
from spacy.matcher import Matcher
|
from spacy.matcher import Matcher
|
||||||
from spacy.tokens import Token
|
from spacy.tokens import Token
|
||||||
|
|
||||||
# We're using a component factory because the component needs to be initialized
|
# We're using a component factory because the component needs to be
|
||||||
# with the shared vocab via the nlp object
|
# initialized with the shared vocab via the nlp object
|
||||||
@Language.factory("html_merger")
|
@Language.factory("html_merger")
|
||||||
def create_bad_html_merger(nlp, name):
|
def create_bad_html_merger(nlp, name):
|
||||||
return BadHTMLMerger(nlp)
|
return BadHTMLMerger(nlp.vocab)
|
||||||
|
|
||||||
class BadHTMLMerger:
|
class BadHTMLMerger:
|
||||||
def __init__(self, nlp):
|
def __init__(self, vocab):
|
||||||
patterns = [
|
patterns = [
|
||||||
[{"ORTH": "<"}, {"LOWER": "br"}, {"ORTH": ">"}],
|
[{"ORTH": "<"}, {"LOWER": "br"}, {"ORTH": ">"}],
|
||||||
[{"ORTH": "<"}, {"LOWER": "br/"}, {"ORTH": ">"}],
|
[{"ORTH": "<"}, {"LOWER": "br/"}, {"ORTH": ">"}],
|
||||||
]
|
]
|
||||||
# Register a new token extension to flag bad HTML
|
# Register a new token extension to flag bad HTML
|
||||||
Token.set_extension("bad_html", default=False)
|
Token.set_extension("bad_html", default=False)
|
||||||
self.matcher = Matcher(nlp.vocab)
|
self.matcher = Matcher(vocab)
|
||||||
self.matcher.add("BAD_HTML", patterns)
|
self.matcher.add("BAD_HTML", patterns)
|
||||||
|
|
||||||
def __call__(self, doc):
|
def __call__(self, doc):
|
||||||
|
|
|
@ -792,7 +792,7 @@ you save the transformer outputs for later use.
|
||||||
|
|
||||||
<!-- TODO:
|
<!-- TODO:
|
||||||
|
|
||||||
<Project id="en_core_bert">
|
<Project id="en_core_trf_lg">
|
||||||
|
|
||||||
Try out a BERT-based model pipeline using this project template: swap in your
|
Try out a BERT-based model pipeline using this project template: swap in your
|
||||||
data, edit the settings and hyperparameters and train, evaluate, package and
|
data, edit the settings and hyperparameters and train, evaluate, package and
|
||||||
|
|
|
@ -66,7 +66,7 @@ menu:
|
||||||
- **Architectures: ** [TransformerModel](/api/architectures#TransformerModel),
|
- **Architectures: ** [TransformerModel](/api/architectures#TransformerModel),
|
||||||
[Tok2VecListener](/api/architectures#transformers-Tok2VecListener),
|
[Tok2VecListener](/api/architectures#transformers-Tok2VecListener),
|
||||||
[Tok2VecTransformer](/api/architectures#Tok2VecTransformer)
|
[Tok2VecTransformer](/api/architectures#Tok2VecTransformer)
|
||||||
- **Models:** [`en_core_bert_sm`](/models/en)
|
- **Models:** [`en_core_trf_lg_sm`](/models/en)
|
||||||
- **Implementation:**
|
- **Implementation:**
|
||||||
[`spacy-transformers`](https://github.com/explosion/spacy-transformers)
|
[`spacy-transformers`](https://github.com/explosion/spacy-transformers)
|
||||||
|
|
||||||
|
@ -293,7 +293,8 @@ format for documenting argument and return types.
|
||||||
|
|
||||||
- **Usage: ** [Embeddings & Transformers](/usage/embeddings-transformers),
|
- **Usage: ** [Embeddings & Transformers](/usage/embeddings-transformers),
|
||||||
[Training models](/usage/training), [Projects](/usage/projects),
|
[Training models](/usage/training), [Projects](/usage/projects),
|
||||||
[Custom pipeline components](/usage/processing-pipelines#custom-components)
|
[Custom pipeline components](/usage/processing-pipelines#custom-components),
|
||||||
|
[Custom tokenizers](/usage/linguistic-features#custom-tokenizer)
|
||||||
- **API Reference: ** [Library architecture](/api),
|
- **API Reference: ** [Library architecture](/api),
|
||||||
[Model architectures](/api/architectures), [Data formats](/api/data-formats)
|
[Model architectures](/api/architectures), [Data formats](/api/data-formats)
|
||||||
- **New Classes: ** [`Example`](/api/example), [`Tok2Vec`](/api/tok2vec),
|
- **New Classes: ** [`Example`](/api/example), [`Tok2Vec`](/api/tok2vec),
|
||||||
|
|
|
@ -363,7 +363,7 @@ body [id]:target
|
||||||
color: var(--color-red-medium)
|
color: var(--color-red-medium)
|
||||||
background: var(--color-red-transparent)
|
background: var(--color-red-transparent)
|
||||||
|
|
||||||
&.italic
|
&.italic, &.comment
|
||||||
font-style: italic
|
font-style: italic
|
||||||
|
|
||||||
|
|
||||||
|
@ -384,9 +384,11 @@ body [id]:target
|
||||||
// Settings for ini syntax (config files)
|
// Settings for ini syntax (config files)
|
||||||
[class*="language-ini"]
|
[class*="language-ini"]
|
||||||
color: var(--syntax-comment)
|
color: var(--syntax-comment)
|
||||||
|
font-style: italic !important
|
||||||
|
|
||||||
.token
|
.token
|
||||||
color: var(--color-subtle)
|
color: var(--color-subtle)
|
||||||
|
font-style: normal !important
|
||||||
|
|
||||||
|
|
||||||
.gatsby-highlight-code-line
|
.gatsby-highlight-code-line
|
||||||
|
@ -424,6 +426,7 @@ body [id]:target
|
||||||
|
|
||||||
.cm-comment
|
.cm-comment
|
||||||
color: var(--syntax-comment)
|
color: var(--syntax-comment)
|
||||||
|
font-style: italic
|
||||||
|
|
||||||
.cm-keyword
|
.cm-keyword
|
||||||
color: var(--syntax-keyword)
|
color: var(--syntax-keyword)
|
||||||
|
|
Loading…
Reference in New Issue
Block a user