spaCy/spacy
Matthew Honnibal 4262f231c5 Fix conversion of older CoNLL parsing files
There are a billion "CoNLL" formats, depending on the tool producing
them. The Stanford v3.3 converter has a few quirks that the CoNLL-X
conversion wasn't handling:

* Sentences may have extra spacing in between the newlines
* The coarse-grained POS is the same as the fine-grained POS, so we need
  a tag map to get the coarse-grained POS.

Needing the tag map is particularly unfortunate, it feels like something
that should be patched on the source data? Adding the extra option may
be confusing to people, especially since it *overwrites* the corpus tag.
2020-09-12 18:20:18 +02:00
..
cli Fix conversion of older CoNLL parsing files 2020-09-12 18:20:18 +02:00
displacy Merge branch 'develop' into master-tmp 2020-09-04 13:15:36 +02:00
lang Remove unicode declarations and update language data 2020-09-04 13:19:16 +02:00
matcher Merge branch 'develop' into pr/6018 2020-09-04 15:54:49 +02:00
ml Merge pull request #6045 from svlandeg/feature/more-layers-docs [ci skip] 2020-09-09 21:46:40 +02:00
pipeline Remove simple_ner code (#6041) 2020-09-09 16:11:27 +02:00
tests string_to_list to parse comma-separated string into a list 2020-09-12 14:43:22 +02:00
tokens fix typo 2020-09-08 18:32:12 +02:00
training Renaming gold & annotation_setter (#6042) 2020-09-09 10:31:03 +02:00
__init__.pxd * Seems to be working after refactor. Need to wire up more POS tag features, and wire up save/load of POS tags. 2014-10-24 02:23:42 +11:00
__init__.py Use frozen list with custom errors 2020-08-29 15:20:11 +02:00
__main__.py Tidy up 2020-06-22 00:45:40 +02:00
about.py Update default projects repo [ci skip] 2020-09-10 11:42:14 +02:00
attrs.pxd Merge branch 'develop' into master-tmp 2020-05-21 18:39:06 +02:00
attrs.pyx Merge branch 'develop' into master-tmp 2020-05-21 18:39:06 +02:00
compat.py Tidy up, autoformat, add types 2020-07-25 15:01:15 +02:00
default_config_pretraining.cfg Fix handling of optional [pretraining] block (#5954) 2020-08-24 15:56:03 +02:00
default_config.cfg Fix defaults 2020-09-08 15:31:21 +02:00
errors.py Renaming gold & annotation_setter (#6042) 2020-09-09 10:31:03 +02:00
glossary.py unicode -> str consistency 2020-05-24 17:20:58 +02:00
kb.pxd Define candidate generator in EL config (#5876) 2020-08-18 16:10:36 +02:00
kb.pyx Update docs links in codebase 2020-09-04 12:58:50 +02:00
language.py prevent overwriting score_weights 2020-09-11 15:12:05 +02:00
lexeme.pxd Fix Lexeme.from_ptr 2020-08-10 16:43:37 +02:00
lexeme.pyx Update docs links in codebase 2020-09-04 12:58:50 +02:00
lookups.py Update docs links in codebase 2020-09-04 12:58:50 +02:00
morphology.pxd Add Lemmatizer and simplify related components (#5848) 2020-08-07 15:27:13 +02:00
morphology.pyx Add Lemmatizer and simplify related components (#5848) 2020-08-07 15:27:13 +02:00
parts_of_speech.pxd Add support for Universal Dependencies v2.0 2017-03-03 13:17:34 +01:00
parts_of_speech.pyx Drop Python 2.7 and 3.5 (#4828) 2019-12-22 01:53:56 +01:00
pipe_analysis.py Simplify pipe analysis 2020-08-01 13:40:06 +02:00
schemas.py Fix meta.json validation 2020-09-11 11:38:33 +02:00
scorer.py Renaming gold & annotation_setter (#6042) 2020-09-09 10:31:03 +02:00
strings.pxd Remove 'cleanup' of strings (#6007) 2020-09-01 16:12:15 +02:00
strings.pyx Update docs links in codebase 2020-09-04 12:58:50 +02:00
structs.pxd Merge branch 'develop' into master-tmp 2020-05-21 18:39:06 +02:00
symbols.pxd Merge branch 'develop' into master-tmp 2020-05-21 18:39:06 +02:00
symbols.pyx Remove PRON_LEMMA symbol (#5968) 2020-08-25 14:21:29 +02:00
tokenizer.pxd Simplify specials and cache checks (#6012) 2020-09-03 09:42:49 +02:00
tokenizer.pyx Renaming gold & annotation_setter (#6042) 2020-09-09 10:31:03 +02:00
typedefs.pxd Update spaCy for thinc 8.0.0 (#4920) 2020-01-29 17:06:46 +01:00
typedefs.pyx Tidy up rest 2017-10-27 21:07:59 +02:00
util.py WIP: fix project clone compatibility 2020-09-10 15:49:13 +02:00
vectors.pyx Update docs links in codebase 2020-09-04 12:58:50 +02:00
vocab.pxd Tidy up and move noun_chunks, token_match, url_match 2020-07-22 22:18:46 +02:00
vocab.pyx Update docs links in codebase 2020-09-04 12:58:50 +02:00