spaCy/spacy
Paul O'Leary McCann 6e9e686568 Sample implementation of Japanese Tagger (ref #1214)
This is far from complete but it should be enough to check some things.

1. Mecab transition. Janome doesn't support Unidic, only IPAdic, but UD
tag mappings are based on Unidic. This switches out Mecab for Janome to
get around that.

2. Raw tag extension. A simple tag map can't meet the specifications for
UD tag mappings, so this adds an extra field to ambiguous cases. For
this demo it just deals with the simplest case, which only needs to look
at the literal token. (In reality it may be necessary to look at the
whole sentence, but that's another issue.)

3. General code structure. Seems nobody else has implemented a custom
Tagger yet, so still not sure this is the correct way to pass the
vocabulary around, for example.

Any feedback would be greatly appreciated. -POLM
2017-08-08 01:27:15 +09:00
..
bn Merge pull request #885 from PySUST/master 2017-03-12 13:20:59 +01:00
cli Fixed typo in cli/package.py 2017-06-07 16:19:08 +02:00
data Make spacy/data a package 2017-03-18 20:04:22 +01:00
de Handle deprecated language-specific model downloading 2017-03-15 17:37:55 +01:00
en Fix typo in English tokenizer exceptions (resolves #1071) 2017-05-23 12:18:00 +02:00
es Update tokenizer_exceptions.py 2017-06-02 19:00:01 +02:00
fi Remove duplicate keys in [en|fi] data dicts 2017-03-19 11:40:29 +01:00
fr French NUM_WORDS and ORDINAL_WORDS 2017-06-28 14:11:20 +02:00
he add hebrew tokenizer 2017-03-24 18:27:44 +03:00
hu Use regex instead of re 2017-04-20 02:22:52 +03:00
it Use consistent unicode declarations 2017-03-12 13:07:28 +01:00
ja Sample implementation of Japanese Tagger (ref #1214) 2017-08-08 01:27:15 +09:00
language_data Add missing SP symbol to tag map, re #1052 2017-07-22 13:44:17 +02:00
munge * Fix Python3 problem in align_raw 2015-07-28 16:06:53 +02:00
nb Add newline 2017-04-27 11:15:41 +02:00
nl fix import of stop words in language data 2017-07-05 14:08:04 +02:00
pt Import and combine Portuguese tokenizer exceptions (see #943) 2017-04-01 10:37:42 +02:00
serialize Fix Issue #459 -- failed to deserialize empty doc. 2016-10-23 16:31:05 +02:00
sv Use consistent unicode declarations 2017-03-12 13:07:28 +01:00
syntax Default to English noun chunks iterator if no lang set 2017-07-22 14:15:25 +02:00
tests Sample implementation of Japanese Tagger (ref #1214) 2017-08-08 01:27:15 +09:00
tokens Fix Span.noun_chunks. Closes #1207 2017-07-22 14:14:57 +02:00
zh Update __init__.py 2017-07-01 13:12:00 +08:00
__init__.pxd * Seems to be working after refactor. Need to wire up more POS tag features, and wire up save/load of POS tags. 2014-10-24 02:23:42 +11:00
__init__.py Add __version__ symbol in __init__.py 2017-07-22 13:45:21 +02:00
__main__.py Add more options to read in meta data in package command 2017-04-16 13:06:02 +02:00
about.py Increment version 2017-07-22 15:43:16 +02:00
attrs.pxd Whitespace 2016-12-18 16:51:40 +01:00
attrs.pyx Clean up imports, unused code, whitespace, docstrings 2017-04-15 12:05:47 +02:00
cfile.pxd Add hacky support for StringCFile, to make pickling easier. 2017-03-07 20:24:37 +01:00
cfile.pyx Clean up imports, unused code, whitespace, docstrings 2017-04-15 12:05:47 +02:00
compat.py Simplify compat.fix_text 2017-04-23 21:06:50 +02:00
deprecated.py Rename about.__docs__ to about.__docs_models__ 2017-05-13 13:09:00 +02:00
glossary.py Fix formatting 2017-05-03 20:11:02 +02:00
gold.pxd Fix gold.pyx for 1.0 2016-11-25 08:57:59 -06:00
gold.pyx Fix training methods 2017-04-16 13:00:37 -05:00
language.py Create directory if missing in save_to_directory 2017-04-23 21:24:43 +02:00
lemmatizer.py Clean up imports, unused code, whitespace, docstrings 2017-04-15 12:05:47 +02:00
lexeme.pxd Remove stray .tensor attribute from Lexeme 2016-10-18 01:16:32 +02:00
lexeme.pyx Fix gaps in Lexeme API. Closes #1031 2017-07-22 13:53:48 +02:00
matcher.pyx Fix json imports and use ujson 2017-04-15 12:13:34 +02:00
morphology.pxd Add support for Universal Dependencies v2.0 2017-03-03 13:17:34 +01:00
morphology.pyx Clean up imports, unused code, whitespace, docstrings 2017-04-15 12:05:47 +02:00
orth.pxd remove text-unidecode dependency 2016-02-24 08:01:59 +01:00
orth.pyx Clean up imports, unused code, whitespace, docstrings 2017-04-15 12:05:47 +02:00
parts_of_speech.pxd Add support for Universal Dependencies v2.0 2017-03-03 13:17:34 +01:00
parts_of_speech.pyx Clean up imports, unused code, whitespace, docstrings 2017-04-15 12:05:47 +02:00
pipeline.pxd Add classes for beam parser and beam NER 2017-03-11 12:45:37 -06:00
pipeline.pyx Clean up imports, unused code, whitespace, docstrings 2017-04-15 12:05:47 +02:00
scorer.py Clean up imports, unused code, whitespace, docstrings 2017-04-15 12:05:47 +02:00
strings.pxd Update strings.pxd 2016-10-24 14:00:35 +02:00
strings.pyx Clean up imports, unused code, whitespace, docstrings 2017-04-15 12:05:47 +02:00
structs.pxd Initial, limited support for quantified patterns in Matcher, and tracking of ent_id attribute in Token and Span. The quantifiers need a lot more testing, and there are some known problems. The main known problem is that the zero-plus and one-plus quantifiers won't work if a token can match both the quantified pattern expression AND the tail of the match. 2016-09-21 14:54:55 +02:00
symbols.pxd Add support for Universal Dependencies v2.0 2017-03-03 13:17:34 +01:00
symbols.pyx Clean up imports, unused code, whitespace, docstrings 2017-04-15 12:05:47 +02:00
tagger.pxd Add cfg field to Tagger 2016-10-17 01:03:41 +02:00
tagger.pyx Fix json imports and use ujson 2017-04-15 12:13:34 +02:00
tokenizer.pxd Revert "Revert "Merge remote-tracking branch 'origin/master'"" 2017-01-09 13:28:13 +01:00
tokenizer.pyx Add flush_cache method to tokenizer, to fix #1061 2017-07-22 15:06:50 +02:00
train.py Clean up imports, unused code, whitespace, docstrings 2017-04-15 12:05:47 +02:00
typedefs.pxd Revert "Work on Issue #285: intern strings into document-specific pools, to address streaming data memory growth. StringStore.__getitem__ now raises KeyError when it can't find the string. Use StringStore.intern() to get the old behaviour. Still need to hunt down all uses of StringStore.__getitem__ in library and do testing, but logic looks good." 2016-09-30 20:20:22 +02:00
typedefs.pyx * Move POS tag definitions to parts_of_speech.pxd 2015-01-25 16:31:07 +11:00
util.py Use regex instead of re 2017-04-20 02:22:52 +03:00
vocab.pxd Revert "Work on Issue #285: intern strings into document-specific pools, to address streaming data memory growth. StringStore.__getitem__ now raises KeyError when it can't find the string. Use StringStore.intern() to get the old behaviour. Still need to hunt down all uses of StringStore.__getitem__ in library and do testing, but logic looks good." 2016-09-30 20:20:22 +02:00
vocab.pyx Fix json imports and use ujson 2017-04-15 12:13:34 +02:00