spaCy

mirror of https://github.com/explosion/spaCy.git synced 2026-03-06 12:51:26 +03:00

History

adrianeboyd a365359b36 Add convert CLI option to merge CoNLL-U subtokens (#4722 ) * Add convert CLI option to merge CoNLL-U subtokens Add `-T` option to convert CLI that merges CoNLL-U subtokens into one token in the converted data. Each CoNLL-U sentence is read into a `Doc` and the `Retokenizer` is used to merge subtokens with features as follows: * `orth` is the merged token orth (should correspond to raw text and `# text`) * `tag` is all subtoken tags concatenated with `_`, e.g. `ADP_DET` * `pos` is the POS of the syntactic root of the span (as determined by the Retokenizer) * `morph` is all morphological features merged * `lemma` is all subtoken lemmas concatenated with ` `, e.g. `de o` * with `-m` all morphological features are combined with the tag using the separator `__`, e.g. `ADP_DET__Definite=Def\|Gender=Masc\|Number=Sing\|PronType=Art` * `dep` is the dependency relation for the syntactic root of the span (as determined by the Retokenizer) Concatenated tags will be mapped to the UD POS of the syntactic root (e.g., `ADP`) and the morphological features will be the combined features. In many cases, the original UD subtokens can be reconstructed from the available features given a language-specific lookup table, e.g., Portuguese `do / ADP_DET / Definite=Def\|Gender=Masc\|Number=Sing\|PronType=Art` is `de / ADP`, `o / DET / Definite=Def\|Gender=Masc\|Number=Sing\|PronType=Art` or lookup rules for forms containing open class words like Spanish `hablarlo / VERB_PRON / Case=Acc\|Gender=Masc\|Number=Sing\|Person=3\|PrepCase=Npr\|PronType=Prs\|VerbForm=Inf`. * Clean up imports		2020-01-29 17:44:25 +01:00
..
cli	Add convert CLI option to merge CoNLL-U subtokens (#4722 )	2020-01-29 17:44:25 +01:00
data	Make spacy/data a package	2017-03-18 20:04:22 +01:00
displacy	Drop Python 2.7 and 3.5 (#4828 )	2019-12-22 01:53:56 +01:00
lang	Modify morphology to support arbitrary features (#4932 )	2020-01-23 22:01:54 +01:00
matcher	Add better schemas and validation using Pydantic (#4831 )	2019-12-25 12:39:49 +01:00
ml	Update spaCy for thinc 8.0.0 (#4920 )	2020-01-29 17:06:46 +01:00
pipeline	Update spaCy for thinc 8.0.0 (#4920 )	2020-01-29 17:06:46 +01:00
syntax	Update spaCy for thinc 8.0.0 (#4920 )	2020-01-29 17:06:46 +01:00
tests	Add convert CLI option to merge CoNLL-U subtokens (#4722 )	2020-01-29 17:44:25 +01:00
tokens	Update spaCy for thinc 8.0.0 (#4920 )	2020-01-29 17:06:46 +01:00
__init__.pxd	* Seems to be working after refactor. Need to wire up more POS tag features, and wire up save/load of POS tags.	2014-10-24 02:23:42 +11:00
__init__.py	Update spaCy for thinc 8.0.0 (#4920 )	2020-01-29 17:06:46 +01:00
__main__.py	Update spaCy for thinc 8.0.0 (#4920 )	2020-01-29 17:06:46 +01:00
_ml.py	Update spaCy for thinc 8.0.0 (#4920 )	2020-01-29 17:06:46 +01:00
about.py	Update version [ci skip]	2019-11-21 18:19:37 +01:00
analysis.py	Drop Python 2.7 and 3.5 (#4828 )	2019-12-22 01:53:56 +01:00
attrs.pxd	Fix attrs alignment	2019-07-12 17:59:47 +02:00
attrs.pyx	Drop Python 2.7 and 3.5 (#4828 )	2019-12-22 01:53:56 +01:00
compat.py	Update spaCy for thinc 8.0.0 (#4920 )	2020-01-29 17:06:46 +01:00
errors.py	Modify morphology to support arbitrary features (#4932 )	2020-01-23 22:01:54 +01:00
glossary.py	Drop Python 2.7 and 3.5 (#4828 )	2019-12-22 01:53:56 +01:00
gold.pxd	Add support for pos/morphs/lemmas in training data (#4941 )	2020-01-28 11:36:29 +01:00
gold.pyx	Add support for pos/morphs/lemmas in training data (#4941 )	2020-01-28 11:36:29 +01:00
kb.pxd	rename entity frequency	2019-07-19 17:40:28 +02:00
kb.pyx	More formatting changes	2019-12-25 17:59:52 +01:00
language.py	Update spaCy for thinc 8.0.0 (#4920 )	2020-01-29 17:06:46 +01:00
lemmatizer.py	Drop Python 2.7 and 3.5 (#4828 )	2019-12-22 01:53:56 +01:00
lexeme.pxd	💫 Support lexical attributes in retokenizer attrs (closes #2390 ) (#3325 )	2019-02-24 21:13:51 +01:00
lexeme.pyx	Update spaCy for thinc 8.0.0 (#4920 )	2020-01-29 17:06:46 +01:00
lookups.py	Drop Python 2.7 and 3.5 (#4828 )	2019-12-22 01:53:56 +01:00
morphology.pxd	Modify morphology to support arbitrary features (#4932 )	2020-01-23 22:01:54 +01:00
morphology.pyx	Modify morphology to support arbitrary features (#4932 )	2020-01-23 22:01:54 +01:00
parts_of_speech.pxd	Add support for Universal Dependencies v2.0	2017-03-03 13:17:34 +01:00
parts_of_speech.pyx	Drop Python 2.7 and 3.5 (#4828 )	2019-12-22 01:53:56 +01:00
schemas.py	Add better schemas and validation using Pydantic (#4831 )	2019-12-25 12:39:49 +01:00
scorer.py	Drop Python 2.7 and 3.5 (#4828 )	2019-12-22 01:53:56 +01:00
strings.pxd	Try to fix StringStore clean up (see #1506 )	2017-11-11 03:11:27 +03:00
strings.pyx	Drop Python 2.7 and 3.5 (#4828 )	2019-12-22 01:53:56 +01:00
structs.pxd	Modify morphology to support arbitrary features (#4932 )	2020-01-23 22:01:54 +01:00
symbols.pxd	Modify morphology to support arbitrary features (#4932 )	2020-01-23 22:01:54 +01:00
symbols.pyx	Modify morphology to support arbitrary features (#4932 )	2020-01-23 22:01:54 +01:00
tokenizer.pxd	Generalize handling of tokenizer special cases (#4259 )	2019-11-13 21:24:35 +01:00
tokenizer.pyx	Drop Python 2.7 and 3.5 (#4828 )	2019-12-22 01:53:56 +01:00
typedefs.pxd	Update spaCy for thinc 8.0.0 (#4920 )	2020-01-29 17:06:46 +01:00
typedefs.pyx	Tidy up rest	2017-10-27 21:07:59 +02:00
util.py	Update spaCy for thinc 8.0.0 (#4920 )	2020-01-29 17:06:46 +01:00
vectors.pyx	Update spaCy for thinc 8.0.0 (#4920 )	2020-01-29 17:06:46 +01:00
vocab.pxd	💫 WIP: Basic lookup class scaffolding and JSON for all lemmati… (#4167 )	2019-08-22 14:21:32 +02:00
vocab.pyx	Update spaCy for thinc 8.0.0 (#4920 )	2020-01-29 17:06:46 +01:00