* Change retokenize.split() API for heads
* Pass lists as values for attrs in split
* Fix test_doc_split filename
* Add error for mismatched tokens after split
* Raise error if new tokens don't match text
* Fix doc test
* Fix error
* Move deps under attrs
* Fix split tests
* Fix retokenize.split
* Add base classes for more languages
* Add test for language class initialization
Make sure language can be initialize – otherwise, it's difficult to catch serious errors in the test suite, because languages are lazy-loaded
* Add split one token into several (resolves#2838)
* Improve error message for token splitting
* Make retokenizer.split() tests use a Token object
Change retokenizer.split() to use a Token object, instead of an index.
* Pass Token into retokenize.split()
Tweak retokenize.split() API so that we pass the `Token` object, not the index.
* Fix token.idx in retokenize.split()
* Test that token.idx is correct after split
* Fix token.idx for split tokens
* Fix retokenize.split()
* Fix retokenize.split
* Fix retokenize.split() test
* Add custom MatchPatternError
* Improve validators and add validation option to Matcher
* Adjust formatting
* Never validate in Matcher within PhraseMatcher
If we do decide to make validate default to True, the PhraseMatcher's Matcher shouldn't ever validate. Here, we create the patterns automatically anyways (and it's currently unclear whether the validation has performance impacts at a very large scale).
In most cases, the PhraseMatcher will match on the verbatim token text or as of v2.1, sometimes the lowercase text. This means that we only need a tokenized Doc, without any other attributes.
If phrase patterns are created by processing large terminology lists with the full `nlp` object, this easily can make things a lot slower, because all components will be applied, even if we don't actually need the attributes they set (like part-of-speech tags, dependency labels).
The warning message also includes a suggestion to use nlp.make_doc or nlp.tokenizer.pipe for even faster processing. For now, the validation has to be enabled explicitly by setting validate=True.
* Improved stop words list
* Removed some wrong stop words form list
* Improved stop words list
* Removed some wrong stop words form list
* Improved Polish Tokenizer (#38)
* Add tests for polish tokenizer
* Add polish tokenizer exceptions
* Don't split any words containing hyphens
* Fix test case with wrong model answer
* Remove commented out line of code until better solution is found
* Add source srx' license
* Rename exception_list.py to match spaCy conventionality
* Add a brief explanation of where the exception list comes from
* Add newline after reach exception
* Rename COPYING.txt to LICENSE
* Delete old files
* Add header to the license
* Agreements signed
* Stanisław Giziński agreement
* Krzysztof Kowalczyk - signed agreement
* Mateusz Olko agreement
* Add DoomCoder's contributor agreement
* Improve like number checking in polish lang
* like num tests added
* all from SI system added
* Final licence and removed splitting exceptions
* Added polish stop words to LEX_ATTRA
* Add encoding info to pl tokenizer exceptions
## Description
1. Added the same infix rule as in French (`d'une`, `j'ai`) for Italian (`c'è`, `l'ha`), bringing F-score on `it_isdt-ud-train.txt` from 96% to 99%. Added unit test to check this behaviour.
2. Added specific Urdu punctuation character as suffix, improving F-score on `ur_udtb-ud-train.txt` from 94% to 100%. Added unit test to check this behaviour.
### Types of change
Enhancement of Italian & Urdu tokenization
## Checklist
- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.
* replace unicode categories with raw list of code points
* simplifying ranges
* fixing variable length quotes
* removing redundant regular expression
* small cleanup of regexp notations
* quotes and alpha as ranges instead of alterations
* removed most regexp dependencies and features
* exponential backtracking - unit tests
* rewrote expression with pathological backtracking
* disabling double hyphen tests for now
* test additional variants of repeating punctuation
* remove regex and redundant backslashes from load_reddit script
* small typo fixes
* disable double punctuation test for russian
* clean up old comments
* format block code
* final cleanup
* naming consistency
* french strings as unicode for python 2 support
* french regular expression case insensitive
* Update matcher engine for regex and extensions
Add support for matching over arbitrary Python predicate functions, and
arbitrary Python attribute getters. This will allow matching over regex
patterns, and allow supporting extension attributes.
The results of the Python predicate functions are cached, so that we don't
call the same predicate function twice for the same token. The extension
attributes are fetched into an array for each token in the doc. This
should minimise the performance impact of the new features.
We still need to wire up these features to the patterns, and test it
all.
* Work on wiring up extra attributes in matcher
* Work on tests for extra matcher attrs
* Add support for extension attrs to matcher
* Test extension attribute matching
* Work on implementing predicate-based match patterns
* Get predicates working for set membership
* Add test for set membership
* Make extensions+predicates work
* Test matcher extensions
* Cache predicate results better in Matcher
* Remove print statement in matcher test
* Use srsly to get key for predicates
* Added the same punctuation rules as danish language.
* Added abbreviations and also the possibility to have capitalized abbreviations on some. Added a few specific cases too
* Added test for long texts in swedish
* Added morph rules, infixes and suffixes to __init__.py for swedish
* Added some tests for prefixes, infixes and suffixes
* Added tests for lemma
* Renamed files to follow convention
* [sv] Removed ambigious abbreviations
* Added more tests for tokenizer exceptions
* Added test for problem with punctuation in issue #2578
* Contributor agreement
* Removed faulty lemmatization of 'jag' ('I') as it was lemmatized to 'jaga' ('hunt')
This PR adds a test for an untested case of `Span.get_lca_matrix`, and fixes a bug for that scenario, which I introduced in [this PR](https://github.com/explosion/spaCy/pull/3089) (sorry!).
## Description
The previous implementation of get_lca_matrix was failing for the case `doc[j:k].get_lca_matrix()` where `j > 0`. A test has been added for this case and the bug has been fixed.
### Types of change
Bug fix
## Checklist
- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.
Initially span.as_doc() was designed to return a view of the span's contents, as a Doc object. This was a nice idea, but it fails due to the token.idx property, which refers to the character offset within the string. In a span, the idx of the first token might not be 0. Because this data is different, we can't have a view --- it'll be inconsistent.
This patch changes span.as_doc() to instead return a copy. The docs are updated accordingly. Closes#1537
* Update test for span.as_doc()
* Make span.as_doc() return a copy. Closes#1537
* Document change to Span.as_doc()