* Classes for Ukrainian; small fix in Russian.
* Contributor agreement
* pymorphy2 initialization split for ru and uk (#3327)
* stop-words fixed
* Unit-tests updated
<!--- Provide a general summary of your changes in the title. -->
## Description
This PR adds the abilility to override custom extension attributes during merging. This will only work for attributes that are writable, i.e. attributes registered with a default value like `default=False` or attribute that have both a getter *and* a setter implemented.
```python
Token.set_extension('is_musician', default=False)
doc = nlp("I like David Bowie.")
with doc.retokenize() as retokenizer:
attrs = {"LEMMA": "David Bowie", "_": {"is_musician": True}}
retokenizer.merge(doc[2:4], attrs=attrs)
assert doc[2].text == "David Bowie"
assert doc[2].lemma_ == "David Bowie"
assert doc[2]._.is_musician
```
### Types of change
enhancement
## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.
* Keep TextCategorizer default model same as v2.0
* Add option 'architecture' that allows "simple_cnn" to switch to
simpler model.
* Add option exclusive_classes, defaulting to False. If set to True,
the model treats classes as mutually exclusive, i.e. only one class can
be true per instance.
* splitting up latin unicode interval
* removing hyphen as infix for French
* adding failing test for issue 1235
* test for issue #3002 which now works
* partial fix for issue #2070
* keep the hyphen as infix for French (as it was)
* restore french expressions with hyphen as infix (as it was)
* added succeeding unit test for Issue #2656
* Fix issue #2822 with custom Italian exception
* Fix issue #2926 by allowing numbers right before infix /
* splitting up latin unicode interval
* removing hyphen as infix for French
* adding failing test for issue 1235
* test for issue #3002 which now works
* partial fix for issue #2070
* keep the hyphen as infix for French (as it was)
* restore french expressions with hyphen as infix (as it was)
* added succeeding unit test for Issue #2656
* Fix issue #2822 with custom Italian exception
* Fix issue #2926 by allowing numbers right before infix /
* remove duplicate
* remove xfail for Issue #2179 fixed by Matt
* adjust documentation and remove reference to regex lib
* Fix matching on extension attrs and predicates
* Fix detection of match_id when using extension attributes. The match
ID is stored as the last entry in the pattern. We were checking for this
with nr_attr == 0, which didn't account for extension attributes.
* Fix handling of predicates. The wrong count was being passed through,
so even patterns that didn't have a predicate were being checked.
* Fix regex pattern
* Fix matcher set value test
* Change retokenize.split() API for heads
* Pass lists as values for attrs in split
* Fix test_doc_split filename
* Add error for mismatched tokens after split
* Raise error if new tokens don't match text
* Fix doc test
* Fix error
* Move deps under attrs
* Fix split tests
* Fix retokenize.split
* Add base classes for more languages
* Add test for language class initialization
Make sure language can be initialize – otherwise, it's difficult to catch serious errors in the test suite, because languages are lazy-loaded
* Add split one token into several (resolves#2838)
* Improve error message for token splitting
* Make retokenizer.split() tests use a Token object
Change retokenizer.split() to use a Token object, instead of an index.
* Pass Token into retokenize.split()
Tweak retokenize.split() API so that we pass the `Token` object, not the index.
* Fix token.idx in retokenize.split()
* Test that token.idx is correct after split
* Fix token.idx for split tokens
* Fix retokenize.split()
* Fix retokenize.split
* Fix retokenize.split() test
Otherwise, the true error that happens within a Language subclass is swallowed, because if it's imported lazily like that, it'll always be an ImportError
* Add custom MatchPatternError
* Improve validators and add validation option to Matcher
* Adjust formatting
* Never validate in Matcher within PhraseMatcher
If we do decide to make validate default to True, the PhraseMatcher's Matcher shouldn't ever validate. Here, we create the patterns automatically anyways (and it's currently unclear whether the validation has performance impacts at a very large scale).
In most cases, the PhraseMatcher will match on the verbatim token text or as of v2.1, sometimes the lowercase text. This means that we only need a tokenized Doc, without any other attributes.
If phrase patterns are created by processing large terminology lists with the full `nlp` object, this easily can make things a lot slower, because all components will be applied, even if we don't actually need the attributes they set (like part-of-speech tags, dependency labels).
The warning message also includes a suggestion to use nlp.make_doc or nlp.tokenizer.pipe for even faster processing. For now, the validation has to be enabled explicitly by setting validate=True.
* Improved stop words list
* Removed some wrong stop words form list
* Improved stop words list
* Removed some wrong stop words form list
* Improved Polish Tokenizer (#38)
* Add tests for polish tokenizer
* Add polish tokenizer exceptions
* Don't split any words containing hyphens
* Fix test case with wrong model answer
* Remove commented out line of code until better solution is found
* Add source srx' license
* Rename exception_list.py to match spaCy conventionality
* Add a brief explanation of where the exception list comes from
* Add newline after reach exception
* Rename COPYING.txt to LICENSE
* Delete old files
* Add header to the license
* Agreements signed
* Stanisław Giziński agreement
* Krzysztof Kowalczyk - signed agreement
* Mateusz Olko agreement
* Add DoomCoder's contributor agreement
* Improve like number checking in polish lang
* like num tests added
* all from SI system added
* Final licence and removed splitting exceptions
* Added polish stop words to LEX_ATTRA
* Add encoding info to pl tokenizer exceptions
## Description
1. Added the same infix rule as in French (`d'une`, `j'ai`) for Italian (`c'è`, `l'ha`), bringing F-score on `it_isdt-ud-train.txt` from 96% to 99%. Added unit test to check this behaviour.
2. Added specific Urdu punctuation character as suffix, improving F-score on `ur_udtb-ud-train.txt` from 94% to 100%. Added unit test to check this behaviour.
### Types of change
Enhancement of Italian & Urdu tokenization
## Checklist
- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.
* replace unicode categories with raw list of code points
* simplifying ranges
* fixing variable length quotes
* removing redundant regular expression
* small cleanup of regexp notations
* quotes and alpha as ranges instead of alterations
* removed most regexp dependencies and features
* exponential backtracking - unit tests
* rewrote expression with pathological backtracking
* disabling double hyphen tests for now
* test additional variants of repeating punctuation
* remove regex and redundant backslashes from load_reddit script
* small typo fixes
* disable double punctuation test for russian
* clean up old comments
* format block code
* final cleanup
* naming consistency
* french strings as unicode for python 2 support
* french regular expression case insensitive
* modifying FR lookup to remove ambiguity and adding lookup vocab to FR files
* modifying FR lookup to remove ambiguity and adding lookup vocab to FR files
* updating the contributor agreement for amperinet
Resolves#3208.
Prevent interactions with other libraries (pandas) that also access `get_ipython().config` and its parameters. See #3208 for details. I don't fully understand why this happens, but in spaCy, we can at least make sure we avoid calling into this method.
<!--- Provide a general summary of your changes in the title. -->
## Description
<!--- Use this section to describe your changes. If your changes required
testing, include information about the testing environment and the tests you
ran. If your test fixes a bug reported in an issue, don't forget to include the
issue number. If your PR is still a work in progress, that's totally fine – just
include a note to let us know. -->
### Types of change
<!-- What type of change does your PR cover? Is it a bug fix, an enhancement
or new feature, or a change to the documentation? -->
## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.
* running UD eval
* printing timing of tokenizer: tokens per second
* timing of default English model
* structured output and parameterization to compare different runs
* additional flag to allow evaluation without parsing info
* printing verbose log of errors for manual inspection
* printing over- and undersegmented cases (and combo's)
* add under and oversegmented numbers to Score and structured output
* print high-freq over/under segmented words and word shapes
* printing examples as part of the structured output
* print the results to file
* batch run of different models and treebanks per language
* cleaning up code
* commandline script to process all languages in spaCy & UD
* heuristic to remove blinded corpora and option to run one single best per language
* pathlib instead of os for file paths
* Update matcher engine for regex and extensions
Add support for matching over arbitrary Python predicate functions, and
arbitrary Python attribute getters. This will allow matching over regex
patterns, and allow supporting extension attributes.
The results of the Python predicate functions are cached, so that we don't
call the same predicate function twice for the same token. The extension
attributes are fetched into an array for each token in the doc. This
should minimise the performance impact of the new features.
We still need to wire up these features to the patterns, and test it
all.
* Work on wiring up extra attributes in matcher
* Work on tests for extra matcher attrs
* Add support for extension attrs to matcher
* Test extension attribute matching
* Work on implementing predicate-based match patterns
* Get predicates working for set membership
* Add test for set membership
* Make extensions+predicates work
* Test matcher extensions
* Cache predicate results better in Matcher
* Remove print statement in matcher test
* Use srsly to get key for predicates
* Added the same punctuation rules as danish language.
* Added abbreviations and also the possibility to have capitalized abbreviations on some. Added a few specific cases too
* Added test for long texts in swedish
* Added morph rules, infixes and suffixes to __init__.py for swedish
* Added some tests for prefixes, infixes and suffixes
* Added tests for lemma
* Renamed files to follow convention
* [sv] Removed ambigious abbreviations
* Added more tests for tokenizer exceptions
* Added test for problem with punctuation in issue #2578
* Contributor agreement
* Removed faulty lemmatization of 'jag' ('I') as it was lemmatized to 'jaga' ('hunt')