mirror of
https://github.com/explosion/spaCy.git
synced 2025-01-12 18:26:30 +03:00
Auto-format [ci skip]
This commit is contained in:
parent
691e0088cf
commit
198b7e9789
|
@ -186,63 +186,63 @@ The German part-of-speech tagger uses the
|
|||
annotation scheme. We also map the tags to the simpler Google Universal POS tag
|
||||
set.
|
||||
|
||||
| Tag | POS | Morphology | Description |
|
||||
| --------- | ------- | ------------------------------------------- | ------------------------------------------------- |
|
||||
| `$(` | `PUNCT` | `PunctType=brck` | other sentence-internal punctuation mark |
|
||||
| `$,` | `PUNCT` | `PunctType=comm` | comma |
|
||||
| `$.` | `PUNCT` | `PunctType=peri` | sentence-final punctuation mark |
|
||||
| `ADJA` | `ADJ` | | adjective, attributive |
|
||||
| `ADJD` | `ADJ` | `Variant=short` | adjective, adverbial or predicative |
|
||||
| `ADV` | `ADV` | | adverb |
|
||||
| `APPO` | `ADP` | `AdpType=post` | postposition |
|
||||
| `APPR` | `ADP` | `AdpType=prep` | preposition; circumposition left |
|
||||
| `APPRART` | `ADP` | `AdpType=prep PronType=art` | preposition with article |
|
||||
| `APZR` | `ADP` | `AdpType=circ` | circumposition right |
|
||||
| `ART` | `DET` | `PronType=art` | definite or indefinite article |
|
||||
| `CARD` | `NUM` | `NumType=card` | cardinal number |
|
||||
| `FM` | `X` | `Foreign=yes` | foreign language material |
|
||||
| `ITJ` | `INTJ` | | interjection |
|
||||
| `KOKOM` | `CONJ` | `ConjType=comp` | comparative conjunction |
|
||||
| `KON` | `CONJ` | | coordinate conjunction |
|
||||
| `KOUI` | `SCONJ` | | subordinate conjunction with "zu" and infinitive |
|
||||
| `KOUS` | `SCONJ` | | subordinate conjunction with sentence |
|
||||
| `NE` | `PROPN` | | proper noun |
|
||||
| `NNE` | `PROPN` | | proper noun |
|
||||
| `NN` | `NOUN` | | noun, singular or mass |
|
||||
| `PROAV` | `ADV` | `PronType=dem` | pronominal adverb |
|
||||
| `PDAT` | `DET` | `PronType=dem` | attributive demonstrative pronoun |
|
||||
| `PDS` | `PRON` | `PronType=dem` | substituting demonstrative pronoun |
|
||||
| `PIAT` | `DET` | `PronType=ind\|neg\|tot` | attributive indefinite pronoun without determiner |
|
||||
| `PIS` | `PRON` | `PronType=ind\|neg\|tot` | substituting indefinite pronoun |
|
||||
| `PPER` | `PRON` | `PronType=prs` | non-reflexive personal pronoun |
|
||||
| `PPOSAT` | `DET` | `Poss=yes PronType=prs` | attributive possessive pronoun |
|
||||
| `PPOSS` | `PRON` | `PronType=rel` | substituting possessive pronoun |
|
||||
| `PRELAT` | `DET` | `PronType=rel` | attributive relative pronoun |
|
||||
| `PRELS` | `PRON` | `PronType=rel` | substituting relative pronoun |
|
||||
| `PRF` | `PRON` | `PronType=prs Reflex=yes` | reflexive personal pronoun |
|
||||
| `PTKA` | `PART` | | particle with adjective or adverb |
|
||||
| `PTKANT` | `PART` | `PartType=res` | answer particle |
|
||||
| `PTKNEG` | `PART` | `Negative=yes` | negative particle |
|
||||
| `PTKVZ` | `PART` | `PartType=vbp` | separable verbal particle |
|
||||
| `PTKZU` | `PART` | `PartType=inf` | "zu" before infinitive |
|
||||
| `PWAT` | `DET` | `PronType=int` | attributive interrogative pronoun |
|
||||
| `PWAV` | `ADV` | `PronType=int` | adverbial interrogative or relative pronoun |
|
||||
| `PWS` | `PRON` | `PronType=int` | substituting interrogative pronoun |
|
||||
| `TRUNC` | `X` | `Hyph=yes` | word remnant |
|
||||
| `VAFIN` | `AUX` | `Mood=ind VerbForm=fin` | finite verb, auxiliary |
|
||||
| `VAIMP` | `AUX` | `Mood=imp VerbForm=fin` | imperative, auxiliary |
|
||||
| `VAINF` | `AUX` | `VerbForm=inf` | infinitive, auxiliary |
|
||||
| `VAPP` | `AUX` | `Aspect=perf VerbForm=fin` | perfect participle, auxiliary |
|
||||
| `VMFIN` | `VERB` | `Mood=ind VerbForm=fin VerbType=mod` | finite verb, modal |
|
||||
| `VMINF` | `VERB` | `VerbForm=fin VerbType=mod` | infinitive, modal |
|
||||
| `VMPP` | `VERB` | `Aspect=perf VerbForm=part VerbType=mod` | perfect participle, modal |
|
||||
| `VVFIN` | `VERB` | `Mood=ind VerbForm=fin` | finite verb, full |
|
||||
| `VVIMP` | `VERB` | `Mood=imp VerbForm=fin` | imperative, full |
|
||||
| `VVINF` | `VERB` | `VerbForm=inf` | infinitive, full |
|
||||
| `VVIZU` | `VERB` | `VerbForm=inf` | infinitive with "zu", full |
|
||||
| `VVPP` | `VERB` | `Aspect=perf VerbForm=part` | perfect participle, full |
|
||||
| `XY` | `X` | | non-word containing non-letter |
|
||||
| `SP` | `SPACE` | | space |
|
||||
| Tag | POS | Morphology | Description |
|
||||
| --------- | ------- | ---------------------------------------- | ------------------------------------------------- |
|
||||
| `$(` | `PUNCT` | `PunctType=brck` | other sentence-internal punctuation mark |
|
||||
| `$,` | `PUNCT` | `PunctType=comm` | comma |
|
||||
| `$.` | `PUNCT` | `PunctType=peri` | sentence-final punctuation mark |
|
||||
| `ADJA` | `ADJ` | | adjective, attributive |
|
||||
| `ADJD` | `ADJ` | `Variant=short` | adjective, adverbial or predicative |
|
||||
| `ADV` | `ADV` | | adverb |
|
||||
| `APPO` | `ADP` | `AdpType=post` | postposition |
|
||||
| `APPR` | `ADP` | `AdpType=prep` | preposition; circumposition left |
|
||||
| `APPRART` | `ADP` | `AdpType=prep PronType=art` | preposition with article |
|
||||
| `APZR` | `ADP` | `AdpType=circ` | circumposition right |
|
||||
| `ART` | `DET` | `PronType=art` | definite or indefinite article |
|
||||
| `CARD` | `NUM` | `NumType=card` | cardinal number |
|
||||
| `FM` | `X` | `Foreign=yes` | foreign language material |
|
||||
| `ITJ` | `INTJ` | | interjection |
|
||||
| `KOKOM` | `CONJ` | `ConjType=comp` | comparative conjunction |
|
||||
| `KON` | `CONJ` | | coordinate conjunction |
|
||||
| `KOUI` | `SCONJ` | | subordinate conjunction with "zu" and infinitive |
|
||||
| `KOUS` | `SCONJ` | | subordinate conjunction with sentence |
|
||||
| `NE` | `PROPN` | | proper noun |
|
||||
| `NNE` | `PROPN` | | proper noun |
|
||||
| `NN` | `NOUN` | | noun, singular or mass |
|
||||
| `PROAV` | `ADV` | `PronType=dem` | pronominal adverb |
|
||||
| `PDAT` | `DET` | `PronType=dem` | attributive demonstrative pronoun |
|
||||
| `PDS` | `PRON` | `PronType=dem` | substituting demonstrative pronoun |
|
||||
| `PIAT` | `DET` | `PronType=ind\|neg\|tot` | attributive indefinite pronoun without determiner |
|
||||
| `PIS` | `PRON` | `PronType=ind\|neg\|tot` | substituting indefinite pronoun |
|
||||
| `PPER` | `PRON` | `PronType=prs` | non-reflexive personal pronoun |
|
||||
| `PPOSAT` | `DET` | `Poss=yes PronType=prs` | attributive possessive pronoun |
|
||||
| `PPOSS` | `PRON` | `PronType=rel` | substituting possessive pronoun |
|
||||
| `PRELAT` | `DET` | `PronType=rel` | attributive relative pronoun |
|
||||
| `PRELS` | `PRON` | `PronType=rel` | substituting relative pronoun |
|
||||
| `PRF` | `PRON` | `PronType=prs Reflex=yes` | reflexive personal pronoun |
|
||||
| `PTKA` | `PART` | | particle with adjective or adverb |
|
||||
| `PTKANT` | `PART` | `PartType=res` | answer particle |
|
||||
| `PTKNEG` | `PART` | `Negative=yes` | negative particle |
|
||||
| `PTKVZ` | `PART` | `PartType=vbp` | separable verbal particle |
|
||||
| `PTKZU` | `PART` | `PartType=inf` | "zu" before infinitive |
|
||||
| `PWAT` | `DET` | `PronType=int` | attributive interrogative pronoun |
|
||||
| `PWAV` | `ADV` | `PronType=int` | adverbial interrogative or relative pronoun |
|
||||
| `PWS` | `PRON` | `PronType=int` | substituting interrogative pronoun |
|
||||
| `TRUNC` | `X` | `Hyph=yes` | word remnant |
|
||||
| `VAFIN` | `AUX` | `Mood=ind VerbForm=fin` | finite verb, auxiliary |
|
||||
| `VAIMP` | `AUX` | `Mood=imp VerbForm=fin` | imperative, auxiliary |
|
||||
| `VAINF` | `AUX` | `VerbForm=inf` | infinitive, auxiliary |
|
||||
| `VAPP` | `AUX` | `Aspect=perf VerbForm=fin` | perfect participle, auxiliary |
|
||||
| `VMFIN` | `VERB` | `Mood=ind VerbForm=fin VerbType=mod` | finite verb, modal |
|
||||
| `VMINF` | `VERB` | `VerbForm=fin VerbType=mod` | infinitive, modal |
|
||||
| `VMPP` | `VERB` | `Aspect=perf VerbForm=part VerbType=mod` | perfect participle, modal |
|
||||
| `VVFIN` | `VERB` | `Mood=ind VerbForm=fin` | finite verb, full |
|
||||
| `VVIMP` | `VERB` | `Mood=imp VerbForm=fin` | imperative, full |
|
||||
| `VVINF` | `VERB` | `VerbForm=inf` | infinitive, full |
|
||||
| `VVIZU` | `VERB` | `VerbForm=inf` | infinitive with "zu", full |
|
||||
| `VVPP` | `VERB` | `Aspect=perf VerbForm=part` | perfect participle, full |
|
||||
| `XY` | `X` | | non-word containing non-letter |
|
||||
| `SP` | `SPACE` | | space |
|
||||
|
||||
</Accordion>
|
||||
|
||||
|
@ -379,51 +379,51 @@ The German dependency labels use the
|
|||
[TIGER Treebank](http://www.ims.uni-stuttgart.de/forschung/ressourcen/korpora/TIGERCorpus/annotation/index.html)
|
||||
annotation scheme.
|
||||
|
||||
| Label | Description |
|
||||
| ------ | ------------------------------- |
|
||||
| `ac` | adpositional case marker |
|
||||
| `adc` | adjective component |
|
||||
| `ag` | genitive attribute |
|
||||
| `ams` | measure argument of adjective |
|
||||
| `app` | apposition |
|
||||
| `avc` | adverbial phrase component |
|
||||
| `cc` | comparative complement |
|
||||
| `cd` | coordinating conjunction |
|
||||
| `cj` | conjunct |
|
||||
| `cm` | comparative conjunction |
|
||||
| `cp` | complementizer |
|
||||
| `cvc` | collocational verb construction |
|
||||
| `da` | dative |
|
||||
| `dm` | discourse marker |
|
||||
| `ep` | expletive es |
|
||||
| `ju` | junctor |
|
||||
| `mnr` | postnominal modifier |
|
||||
| `mo` | modifier |
|
||||
| `ng` | negation |
|
||||
| `nk` | noun kernel element |
|
||||
| `nmc` | numerical component |
|
||||
| `oa` | accusative object |
|
||||
| `oa2` | second accusative object |
|
||||
| `oc` | clausal object |
|
||||
| `og` | genitive object |
|
||||
| `op` | prepositional object |
|
||||
| `par` | parenthetical element |
|
||||
| `pd` | predicate |
|
||||
| `pg` | phrasal genitive |
|
||||
| `ph` | placeholder |
|
||||
| `pm` | morphological particle |
|
||||
| `pnc` | proper noun component |
|
||||
| `punct` | punctuation |
|
||||
| `rc` | relative clause |
|
||||
| `re` | repeated element |
|
||||
| `rs` | reported speech |
|
||||
| `sb` | subject |
|
||||
| `sbp` | passivized subject (PP) |
|
||||
| `sp` | subject or predicate |
|
||||
| `svp` | separable verb prefix |
|
||||
| `uc` | unit component |
|
||||
| `vo` | vocative |
|
||||
| `ROOT` | root |
|
||||
| Label | Description |
|
||||
| ------- | ------------------------------- |
|
||||
| `ac` | adpositional case marker |
|
||||
| `adc` | adjective component |
|
||||
| `ag` | genitive attribute |
|
||||
| `ams` | measure argument of adjective |
|
||||
| `app` | apposition |
|
||||
| `avc` | adverbial phrase component |
|
||||
| `cc` | comparative complement |
|
||||
| `cd` | coordinating conjunction |
|
||||
| `cj` | conjunct |
|
||||
| `cm` | comparative conjunction |
|
||||
| `cp` | complementizer |
|
||||
| `cvc` | collocational verb construction |
|
||||
| `da` | dative |
|
||||
| `dm` | discourse marker |
|
||||
| `ep` | expletive es |
|
||||
| `ju` | junctor |
|
||||
| `mnr` | postnominal modifier |
|
||||
| `mo` | modifier |
|
||||
| `ng` | negation |
|
||||
| `nk` | noun kernel element |
|
||||
| `nmc` | numerical component |
|
||||
| `oa` | accusative object |
|
||||
| `oa2` | second accusative object |
|
||||
| `oc` | clausal object |
|
||||
| `og` | genitive object |
|
||||
| `op` | prepositional object |
|
||||
| `par` | parenthetical element |
|
||||
| `pd` | predicate |
|
||||
| `pg` | phrasal genitive |
|
||||
| `ph` | placeholder |
|
||||
| `pm` | morphological particle |
|
||||
| `pnc` | proper noun component |
|
||||
| `punct` | punctuation |
|
||||
| `rc` | relative clause |
|
||||
| `re` | repeated element |
|
||||
| `rs` | reported speech |
|
||||
| `sb` | subject |
|
||||
| `sbp` | passivized subject (PP) |
|
||||
| `sp` | subject or predicate |
|
||||
| `svp` | separable verb prefix |
|
||||
| `uc` | unit component |
|
||||
| `vo` | vocative |
|
||||
| `ROOT` | root |
|
||||
|
||||
</Accordion>
|
||||
|
||||
|
|
|
@ -174,12 +174,12 @@ All output files generated by this command are compatible with
|
|||
|
||||
<!-- TODO: document jsonl option – maybe update it? -->
|
||||
|
||||
| ID | Description |
|
||||
| ------------------------------ | --------------------------------------------------------------- |
|
||||
| `auto` | Automatically pick converter based on file extension and file content (default). |
|
||||
| `conll`, `conllu`, `conllubio` | Universal Dependencies `.conllu` or `.conll` format. |
|
||||
| ID | Description |
|
||||
| ------------------------------ | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
|
||||
| `auto` | Automatically pick converter based on file extension and file content (default). |
|
||||
| `conll`, `conllu`, `conllubio` | Universal Dependencies `.conllu` or `.conll` format. |
|
||||
| `ner` | NER with IOB/IOB2 tags, one token per line with columns separated by whitespace. The first column is the token and the final column is the IOB tag. Sentences are separated by blank lines and documents are separated by the line `-DOCSTART- -X- O O`. Supports CoNLL 2003 NER format. See [sample data](https://github.com/explosion/spaCy/tree/master/examples/training/ner_example_data). |
|
||||
| `iob` | NER with IOB/IOB2 tags, one sentence per line with tokens separated by whitespace and annotation separated by `|`, either `word|B-ENT` or `word|POS|B-ENT`. See [sample data](https://github.com/explosion/spaCy/tree/master/examples/training/ner_example_data). |
|
||||
| `iob` | NER with IOB/IOB2 tags, one sentence per line with tokens separated by whitespace and annotation separated by `|`, either `word|B-ENT` or `word|POS|B-ENT`. See [sample data](https://github.com/explosion/spaCy/tree/master/examples/training/ner_example_data). |
|
||||
|
||||
## Train {#train}
|
||||
|
||||
|
@ -291,26 +291,26 @@ $ python -m spacy pretrain [texts_loc] [vectors_model] [output_dir]
|
|||
[--seed] [--n-iter] [--use-vectors] [--n-save_every] [--init-tok2vec] [--epoch-start]
|
||||
```
|
||||
|
||||
| Argument | Type | Description |
|
||||
| ----------------------- | ---------- | --------------------------------------------------------------------------------------------------------------------------------- |
|
||||
| `texts_loc` | positional | Path to JSONL file with raw texts to learn from, with text provided as the key `"text"` or tokens as the key `"tokens"`. [See here](#pretrain-jsonl) for details. |
|
||||
| `vectors_model` | positional | Name or path to spaCy model with vectors to learn from. |
|
||||
| `output_dir` | positional | Directory to write models to on each epoch. |
|
||||
| `--width`, `-cw` | option | Width of CNN layers. |
|
||||
| `--depth`, `-cd` | option | Depth of CNN layers. |
|
||||
| `--embed-rows`, `-er` | option | Number of embedding rows. |
|
||||
| `--loss-func`, `-L` | option | Loss function to use for the objective. Either `"L2"` or `"cosine"`. |
|
||||
| `--dropout`, `-d` | option | Dropout rate. |
|
||||
| `--batch-size`, `-bs` | option | Number of words per training batch. |
|
||||
| `--max-length`, `-xw` | option | Maximum words per example. Longer examples are discarded. |
|
||||
| `--min-length`, `-nw` | option | Minimum words per example. Shorter examples are discarded. |
|
||||
| `--seed`, `-s` | option | Seed for random number generators. |
|
||||
| `--n-iter`, `-i` | option | Number of iterations to pretrain. |
|
||||
| `--use-vectors`, `-uv` | flag | Whether to use the static vectors as input features. |
|
||||
| `--n-save-every`, `-se` | option | Save model every X batches. |
|
||||
| `--init-tok2vec`, `-t2v` <Tag variant="new">2.1</Tag> | option | Path to pretrained weights for the token-to-vector parts of the models. See `spacy pretrain`. Experimental.|
|
||||
| `--epoch-start`, `-es` <Tag variant="new">2.1.5</Tag> | option | The epoch to start counting at. Only relevant when using `--init-tok2vec` and the given weight file has been renamed. Prevents unintended overwriting of existing weight files.|
|
||||
| **CREATES** | weights | The pre-trained weights that can be used to initialize `spacy train`. |
|
||||
| Argument | Type | Description |
|
||||
| ----------------------------------------------------- | ---------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
|
||||
| `texts_loc` | positional | Path to JSONL file with raw texts to learn from, with text provided as the key `"text"` or tokens as the key `"tokens"`. [See here](#pretrain-jsonl) for details. |
|
||||
| `vectors_model` | positional | Name or path to spaCy model with vectors to learn from. |
|
||||
| `output_dir` | positional | Directory to write models to on each epoch. |
|
||||
| `--width`, `-cw` | option | Width of CNN layers. |
|
||||
| `--depth`, `-cd` | option | Depth of CNN layers. |
|
||||
| `--embed-rows`, `-er` | option | Number of embedding rows. |
|
||||
| `--loss-func`, `-L` | option | Loss function to use for the objective. Either `"L2"` or `"cosine"`. |
|
||||
| `--dropout`, `-d` | option | Dropout rate. |
|
||||
| `--batch-size`, `-bs` | option | Number of words per training batch. |
|
||||
| `--max-length`, `-xw` | option | Maximum words per example. Longer examples are discarded. |
|
||||
| `--min-length`, `-nw` | option | Minimum words per example. Shorter examples are discarded. |
|
||||
| `--seed`, `-s` | option | Seed for random number generators. |
|
||||
| `--n-iter`, `-i` | option | Number of iterations to pretrain. |
|
||||
| `--use-vectors`, `-uv` | flag | Whether to use the static vectors as input features. |
|
||||
| `--n-save-every`, `-se` | option | Save model every X batches. |
|
||||
| `--init-tok2vec`, `-t2v` <Tag variant="new">2.1</Tag> | option | Path to pretrained weights for the token-to-vector parts of the models. See `spacy pretrain`. Experimental. |
|
||||
| `--epoch-start`, `-es` <Tag variant="new">2.1.5</Tag> | option | The epoch to start counting at. Only relevant when using `--init-tok2vec` and the given weight file has been renamed. Prevents unintended overwriting of existing weight files. |
|
||||
| **CREATES** | weights | The pre-trained weights that can be used to initialize `spacy train`. |
|
||||
|
||||
### JSONL format for raw text {#pretrain-jsonl}
|
||||
|
||||
|
@ -330,10 +330,10 @@ tokenization can be provided.
|
|||
> srsly.write_jsonl("/path/to/text.jsonl", data)
|
||||
> ```
|
||||
|
||||
| Key | Type | Description |
|
||||
| -------- | ------- | -------------------------------------------- |
|
||||
| Key | Type | Description |
|
||||
| -------- | ------- | ---------------------------------------------------------- |
|
||||
| `text` | unicode | The raw input text. Is not required if `tokens` available. |
|
||||
| `tokens` | list | Optional tokenization, one string per token. |
|
||||
| `tokens` | list | Optional tokenization, one string per token. |
|
||||
|
||||
```json
|
||||
### Example
|
||||
|
@ -424,7 +424,7 @@ pip install dist/en_model-0.0.0.tar.gz
|
|||
| `input_dir` | positional | Path to directory containing model data. |
|
||||
| `output_dir` | positional | Directory to create package folder in. |
|
||||
| `--meta-path`, `-m` <Tag variant="new">2</Tag> | option | Path to `meta.json` file (optional). |
|
||||
| `--create-meta`, `-c` <Tag variant="new">2</Tag> | flag | Create a `meta.json` file on the command line, even if one already exists in the directory. If an existing file is found, its entries will be shown as the defaults in the command line prompt.
|
||||
| `--force`, `-f` | flag | Force overwriting of existing folder in output directory. |
|
||||
| `--create-meta`, `-c` <Tag variant="new">2</Tag> | flag | Create a `meta.json` file on the command line, even if one already exists in the directory. If an existing file is found, its entries will be shown as the defaults in the command line prompt. |
|
||||
| `--force`, `-f` | flag | Force overwriting of existing folder in output directory. |
|
||||
| `--help`, `-h` | flag | Show help message and available arguments. |
|
||||
| **CREATES** | directory | A Python package containing the spaCy model. |
|
||||
|
|
Loading…
Reference in New Issue
Block a user