spaCy

mirror of https://github.com/explosion/spaCy.git synced 2026-02-03 05:56:15 +03:00

History

Paul O'Leary McCann 6e9e686568 Sample implementation of Japanese Tagger (ref #1214 ) This is far from complete but it should be enough to check some things. 1. Mecab transition. Janome doesn't support Unidic, only IPAdic, but UD tag mappings are based on Unidic. This switches out Mecab for Janome to get around that. 2. Raw tag extension. A simple tag map can't meet the specifications for UD tag mappings, so this adds an extra field to ambiguous cases. For this demo it just deals with the simplest case, which only needs to look at the literal token. (In reality it may be necessary to look at the whole sentence, but that's another issue.) 3. General code structure. Seems nobody else has implemented a custom Tagger yet, so still not sure this is the correct way to pass the vocabulary around, for example. Any feedback would be greatly appreciated. -POLM		2017-08-08 01:27:15 +09:00
..
bn	Merge pull request #885 from PySUST/master	2017-03-12 13:20:59 +01:00
cli	Fixed typo in cli/package.py	2017-06-07 16:19:08 +02:00
data	Make spacy/data a package	2017-03-18 20:04:22 +01:00
de	Handle deprecated language-specific model downloading	2017-03-15 17:37:55 +01:00
en	Fix typo in English tokenizer exceptions (resolves #1071 )	2017-05-23 12:18:00 +02:00
es	Update tokenizer_exceptions.py	2017-06-02 19:00:01 +02:00
fi	Remove duplicate keys in [en\|fi] data dicts	2017-03-19 11:40:29 +01:00
fr	French NUM_WORDS and ORDINAL_WORDS	2017-06-28 14:11:20 +02:00
he	add hebrew tokenizer	2017-03-24 18:27:44 +03:00
hu	Use `regex` instead of `re`	2017-04-20 02:22:52 +03:00
it	Use consistent unicode declarations	2017-03-12 13:07:28 +01:00
ja	Sample implementation of Japanese Tagger (ref #1214 )	2017-08-08 01:27:15 +09:00
language_data	Add missing SP symbol to tag map, re #1052	2017-07-22 13:44:17 +02:00
munge	* Fix Python3 problem in align_raw	2015-07-28 16:06:53 +02:00
nb	Add newline	2017-04-27 11:15:41 +02:00
nl	fix import of stop words in language data	2017-07-05 14:08:04 +02:00
pt	Import and combine Portuguese tokenizer exceptions (see #943 )	2017-04-01 10:37:42 +02:00
serialize	Fix Issue #459 -- failed to deserialize empty doc.	2016-10-23 16:31:05 +02:00
sv	Use consistent unicode declarations	2017-03-12 13:07:28 +01:00
syntax	Default to English noun chunks iterator if no lang set	2017-07-22 14:15:25 +02:00
tests	Sample implementation of Japanese Tagger (ref #1214 )	2017-08-08 01:27:15 +09:00
tokens	Fix Span.noun_chunks. Closes #1207	2017-07-22 14:14:57 +02:00
zh	Update __init__.py	2017-07-01 13:12:00 +08:00
__init__.pxd	* Seems to be working after refactor. Need to wire up more POS tag features, and wire up save/load of POS tags.	2014-10-24 02:23:42 +11:00
__init__.py	Add __version__ symbol in __init__.py	2017-07-22 13:45:21 +02:00
__main__.py	Add more options to read in meta data in package command	2017-04-16 13:06:02 +02:00
about.py	Increment version	2017-07-22 15:43:16 +02:00
attrs.pxd	Whitespace	2016-12-18 16:51:40 +01:00
attrs.pyx	Clean up imports, unused code, whitespace, docstrings	2017-04-15 12:05:47 +02:00
cfile.pxd	Add hacky support for StringCFile, to make pickling easier.	2017-03-07 20:24:37 +01:00
cfile.pyx	Clean up imports, unused code, whitespace, docstrings	2017-04-15 12:05:47 +02:00
compat.py	Simplify compat.fix_text	2017-04-23 21:06:50 +02:00
deprecated.py	Rename about.__docs__ to about.__docs_models__	2017-05-13 13:09:00 +02:00
glossary.py	Fix formatting	2017-05-03 20:11:02 +02:00
gold.pxd	Fix gold.pyx for 1.0	2016-11-25 08:57:59 -06:00
gold.pyx	Fix training methods	2017-04-16 13:00:37 -05:00
language.py	Create directory if missing in save_to_directory	2017-04-23 21:24:43 +02:00
lemmatizer.py	Clean up imports, unused code, whitespace, docstrings	2017-04-15 12:05:47 +02:00
lexeme.pxd	Remove stray .tensor attribute from Lexeme	2016-10-18 01:16:32 +02:00
lexeme.pyx	Fix gaps in Lexeme API. Closes #1031	2017-07-22 13:53:48 +02:00
matcher.pyx	Fix json imports and use ujson	2017-04-15 12:13:34 +02:00
morphology.pxd	Add support for Universal Dependencies v2.0	2017-03-03 13:17:34 +01:00
morphology.pyx	Clean up imports, unused code, whitespace, docstrings	2017-04-15 12:05:47 +02:00
orth.pxd	remove text-unidecode dependency	2016-02-24 08:01:59 +01:00
orth.pyx	Clean up imports, unused code, whitespace, docstrings	2017-04-15 12:05:47 +02:00
parts_of_speech.pxd	Add support for Universal Dependencies v2.0	2017-03-03 13:17:34 +01:00
parts_of_speech.pyx	Clean up imports, unused code, whitespace, docstrings	2017-04-15 12:05:47 +02:00
pipeline.pxd	Add classes for beam parser and beam NER	2017-03-11 12:45:37 -06:00
pipeline.pyx	Clean up imports, unused code, whitespace, docstrings	2017-04-15 12:05:47 +02:00
scorer.py	Clean up imports, unused code, whitespace, docstrings	2017-04-15 12:05:47 +02:00
strings.pxd	Update strings.pxd	2016-10-24 14:00:35 +02:00
strings.pyx	Clean up imports, unused code, whitespace, docstrings	2017-04-15 12:05:47 +02:00
structs.pxd	Initial, limited support for quantified patterns in Matcher, and tracking of ent_id attribute in Token and Span. The quantifiers need a lot more testing, and there are some known problems. The main known problem is that the zero-plus and one-plus quantifiers won't work if a token can match both the quantified pattern expression AND the tail of the match.	2016-09-21 14:54:55 +02:00
symbols.pxd	Add support for Universal Dependencies v2.0	2017-03-03 13:17:34 +01:00
symbols.pyx	Clean up imports, unused code, whitespace, docstrings	2017-04-15 12:05:47 +02:00
tagger.pxd	Add cfg field to Tagger	2016-10-17 01:03:41 +02:00
tagger.pyx	Fix json imports and use ujson	2017-04-15 12:13:34 +02:00
tokenizer.pxd	Revert "Revert "Merge remote-tracking branch 'origin/master'""	2017-01-09 13:28:13 +01:00
tokenizer.pyx	Add flush_cache method to tokenizer, to fix #1061	2017-07-22 15:06:50 +02:00
train.py	Clean up imports, unused code, whitespace, docstrings	2017-04-15 12:05:47 +02:00
typedefs.pxd	Revert "Work on Issue #285 : intern strings into document-specific pools, to address streaming data memory growth. StringStore.__getitem__ now raises KeyError when it can't find the string. Use StringStore.intern() to get the old behaviour. Still need to hunt down all uses of StringStore.__getitem__ in library and do testing, but logic looks good."	2016-09-30 20:20:22 +02:00
typedefs.pyx	* Move POS tag definitions to parts_of_speech.pxd	2015-01-25 16:31:07 +11:00
util.py	Use `regex` instead of `re`	2017-04-20 02:22:52 +03:00
vocab.pxd	Revert "Work on Issue #285 : intern strings into document-specific pools, to address streaming data memory growth. StringStore.__getitem__ now raises KeyError when it can't find the string. Use StringStore.intern() to get the old behaviour. Still need to hunt down all uses of StringStore.__getitem__ in library and do testing, but logic looks good."	2016-09-30 20:20:22 +02:00
vocab.pyx	Fix json imports and use ujson	2017-04-15 12:13:34 +02:00