* Fix german stop words
Two stop words ("einige" and "einigen") are sticking together.
Remove three nouns that may serve as stop words in a specific context (e.g. religious or news) but are not applicable for general use.
* Create Jan-711.md
* Rename `tag_map.py` to `tag_map_fine.py` to indicate that it's not the
default tag map
* Remove duplicate generic UD tag map and load `../tag_map.py` instead
* don't split on a colon. Colon is used to attach suffixes for abbreviations
* tokenize on any of LIST_HYPHENS (except a single hyphen), not just on --
* simplify infix rules by merging similar rules
* Add correct stopwords for Slovak language
* Add SNK Tags
* Disable formatting lint for TAGS
* Add example sentences for Slovak language
* Add slovak numerals in base form
* Add lex_attrs to sk init
* Add contributor agreement
* match domains longer than `hostname.domain.tld` like `www.foo.co.uk`
* expand allowed characters in domain names while only matching
lowercase TLDs so that "this.That" isn't matched as a URL and can be
split on the period as an infix (relevant for at least English, German,
and Tatar)
* Adding Support for Yoruba
* test text
* Updated test string.
* Fixing encoding declaration.
* Adding encoding to stop_words.py
* Added contributor agreement and removed iranlowo.
* Added removed test files and removed iranlowo to keep project bare.
* Returned CONTRIBUTING.md to default state.
* Added delted conftest entries
* Tidy up and auto-format
* Revert CONTRIBUTING.md
Co-authored-by: Ines Montani <ines@ines.io>
* Enable lex_attrs on Finnish
* Copy the Danish tokenizer rules to Finnish
Specifically, don't break hyphenated compound words
* Contributor agreement
* A new file for Finnish tokenizer rules instead of including the Danish ones
- added some tests for tokenization issues
- fixed some issues with tokenization of words with hyphen infix
- rewrote the "tokenizer_exceptions.py" file (stemming from the German version)
* Switch from mecab-python3 to fugashi
mecab-python3 has been the best MeCab binding for a long time but it's
not very actively maintained, and since it's based on old SWIG code
distributed with MeCab there's a limit to how effectively it can be
maintained.
Fugashi is a new Cython-based MeCab wrapper I wrote. Since it's not
based on the old SWIG code it's easier to keep it current and make small
deviations from the MeCab C/C++ API where that makes sense.
* Change mecab-python3 to fugashi in setup.cfg
* Change "mecab tags" to "unidic tags"
The tags come from MeCab, but the tag schema is specified by Unidic, so
it's more proper to refer to it that way.
* Update conftest
* Add fugashi link to external deps list for Japanese
* Rework Chinese language initialization
* Create a `ChineseTokenizer` class
* Modify jieba post-processing to handle whitespace correctly
* Modify non-jieba character tokenization to handle whitespace correctly
* Add a `create_tokenizer()` method to `ChineseDefaults`
* Load lexical attributes
* Update Chinese tag_map for UD v2
* Add very basic Chinese tests
* Test tokenization with and without jieba
* Test `like_num` attribute
* Fix try_jieba_import()
* Fix zh code formatting
* Update English tag_map
Update English tag_map based on this conversion table:
https://universaldependencies.org/tagset-conversion/en-penn-uposf.html
* Update German tag_map
Update German tag_map based on this conversion table:
https://universaldependencies.org/tagset-conversion/de-stts-uposf.html
* Add missing Tiger dependencies to glossary
* Add quotes to definition of TO
* Update POS/TAG tables in docs
Update POS/TAG tables for English and German docs using current
information generated from the tag_maps and GLOSSARY.
* Update warning that -PRON- is specific to English
* Revert docs to default JSON output with convert
* Revert "Revert docs to default JSON output with convert"
This reverts commit 6b78c048f1.
* Create syntax_iterators.py
Replica of spacy/lang/fr/syntax_iterators.py
* Added import statements for SYNTAX_ITERATORS
* Create gustavengstrom.md
* Added "dobj" to list of labels in noun_chunks method and a test_noun_chunks method to the Swedish language model.
* Delete README-checkpoint.md
Co-authored-by: Gustav <gustav@davcon.se>
Co-authored-by: Ines Montani <ines@ines.io>
* Move prefix and suffix detection for URL_PATTERN
Move prefix and suffix detection for `URL_PATTERN` into the tokenizer.
Remove associated lookahead and lookbehind from `URL_PATTERN`.
Fix tokenization for Hungarian given new modified handling of prefixes
and suffixes.
* Match a wider range of URI schemes