* match domains longer than `hostname.domain.tld` like `www.foo.co.uk`
* expand allowed characters in domain names while only matching
lowercase TLDs so that "this.That" isn't matched as a URL and can be
split on the period as an infix (relevant for at least English, German,
and Tatar)
* expand serialization test for custom token attribute
* add failing test for issue 4849
* define ENT_ID as attr and use in doc serialization
* fix few typos
* label in span not writable anymore
* Revert "label in span not writable anymore"
This reverts commit ab442338c8.
* provide more friendly error msg for parsing file
* Adding Support for Yoruba
* test text
* Updated test string.
* Fixing encoding declaration.
* Adding encoding to stop_words.py
* Added contributor agreement and removed iranlowo.
* Added removed test files and removed iranlowo to keep project bare.
* Returned CONTRIBUTING.md to default state.
* Added delted conftest entries
* Tidy up and auto-format
* Revert CONTRIBUTING.md
Co-authored-by: Ines Montani <ines@ines.io>
Instead of a hard-coded NER tag simplification function that was only
intended for NorNE, map NER tags in CoNLL-U converter using a dict
provided as JSON as a command-line option.
Map NER entity types or new tag or to "" for 'O', e.g.:
```
{"PER": "PERSON", "BAD": ""}
=>
B-PER -> B-PERSON
B-BAD -> O
```
* Update token.md
documentation is confusing: A '?' is a right punct, but '¿' is a left punct
* Update token.md
add quotations around parentheses in `is_left_punct` and `is_right_punct` for clarrification, ensuring the question mark that follows is not percieved as an example of left and right punctuation
* Move quotes into code block [ci skip]
* Include Doc.cats in to_bytes()
* Include Doc.cats in DocBin serialization
* Add tests for serialization of cats
Test serialization of cats for Doc and DocBin.
* Enable lex_attrs on Finnish
* Copy the Danish tokenizer rules to Finnish
Specifically, don't break hyphenated compound words
* Contributor agreement
* A new file for Finnish tokenizer rules instead of including the Danish ones
- added some tests for tokenization issues
- fixed some issues with tokenization of words with hyphen infix
- rewrote the "tokenizer_exceptions.py" file (stemming from the German version)