mirror of
https://github.com/explosion/spaCy.git
synced 2024-12-27 18:36:36 +03:00
a2ad9832e5
7 Commits
Author | SHA1 | Message | Date | |
---|---|---|---|---|
Sofie
|
9a478b6db8 |
Clean up of char classes, few tokenizer fixes and faster default French tokenizer (#3293)
* splitting up latin unicode interval * removing hyphen as infix for French * adding failing test for issue 1235 * test for issue #3002 which now works * partial fix for issue #2070 * keep the hyphen as infix for French (as it was) * restore french expressions with hyphen as infix (as it was) * added succeeding unit test for Issue #2656 * Fix issue #2822 with custom Italian exception * Fix issue #2926 by allowing numbers right before infix / * splitting up latin unicode interval * removing hyphen as infix for French * adding failing test for issue 1235 * test for issue #3002 which now works * partial fix for issue #2070 * keep the hyphen as infix for French (as it was) * restore french expressions with hyphen as infix (as it was) * added succeeding unit test for Issue #2656 * Fix issue #2822 with custom Italian exception * Fix issue #2926 by allowing numbers right before infix / * remove duplicate * remove xfail for Issue #2179 fixed by Matt * adjust documentation and remove reference to regex lib |
||
Ines Montani
|
5651a0d052 |
💫 Replace {Doc,Span}.merge with Doc.retokenize (#3280)
* Add deprecation warning to Doc.merge and Span.merge * Replace {Doc,Span}.merge with Doc.retokenize |
||
Sofie
|
46dfe773e1 |
Replacing regex library with re to increase tokenization speed (#3218)
* replace unicode categories with raw list of code points * simplifying ranges * fixing variable length quotes * removing redundant regular expression * small cleanup of regexp notations * quotes and alpha as ranges instead of alterations * removed most regexp dependencies and features * exponential backtracking - unit tests * rewrote expression with pathological backtracking * disabling double hyphen tests for now * test additional variants of repeating punctuation * remove regex and redundant backslashes from load_reddit script * small typo fixes * disable double punctuation test for russian * clean up old comments * format block code * final cleanup * naming consistency * french strings as unicode for python 2 support * french regular expression case insensitive |
||
Ines Montani
|
323fc26880 | Tidy up and format remaining files | ||
Ines Montani
|
d33953037e
|
💫 Port master changes over to develop (#2979)
* Create aryaprabhudesai.md (#2681) * Update _install.jade (#2688) Typo fix: "models" -> "model" * Add FAC to spacy.explain (resolves #2706) * Remove docstrings for deprecated arguments (see #2703) * When calling getoption() in conftest.py, pass a default option (#2709) * When calling getoption() in conftest.py, pass a default option This is necessary to allow testing an installed spacy by running: pytest --pyargs spacy * Add contributor agreement * update bengali token rules for hyphen and digits (#2731) * Less norm computations in token similarity (#2730) * Less norm computations in token similarity * Contributor agreement * Remove ')' for clarity (#2737) Sorry, don't mean to be nitpicky, I just noticed this when going through the CLI and thought it was a quick fix. That said, if this was intention than please let me know. * added contributor agreement for mbkupfer (#2738) * Basic support for Telugu language (#2751) * Lex _attrs for polish language (#2750) * Signed spaCy contributor agreement * Added polish version of english lex_attrs * Introduces a bulk merge function, in order to solve issue #653 (#2696) * Fix comment * Introduce bulk merge to increase performance on many span merges * Sign contributor agreement * Implement pull request suggestions * Describe converters more explicitly (see #2643) * Add multi-threading note to Language.pipe (resolves #2582) [ci skip] * Fix formatting * Fix dependency scheme docs (closes #2705) [ci skip] * Don't set stop word in example (closes #2657) [ci skip] * Add words to portuguese language _num_words (#2759) * Add words to portuguese language _num_words * Add words to portuguese language _num_words * Update Indonesian model (#2752) * adding e-KTP in tokenizer exceptions list * add exception token * removing lines with containing space as it won't matter since we use .split() method in the end, added new tokens in exception * add tokenizer exceptions list * combining base_norms with norm_exceptions * adding norm_exception * fix double key in lemmatizer * remove unused import on punctuation.py * reformat stop_words to reduce number of lines, improve readibility * updating tokenizer exception * implement is_currency for lang/id * adding orth_first_upper in tokenizer_exceptions * update the norm_exception list * remove bunch of abbreviations * adding contributors file * Fixed spaCy+Keras example (#2763) * bug fixes in keras example * created contributor agreement * Adding French hyphenated first name (#2786) * Fix typo (closes #2784) * Fix typo (#2795) [ci skip] Fixed typo on line 6 "regcognizer --> recognizer" * Adding basic support for Sinhala language. (#2788) * adding Sinhala language package, stop words, examples and lex_attrs. * Adding contributor agreement * Updating contributor agreement * Also include lowercase norm exceptions * Fix error (#2802) * Fix error ValueError: cannot resize an array that references or is referenced by another array in this way. Use the resize function * added spaCy Contributor Agreement * Add charlax's contributor agreement (#2805) * agreement of contributor, may I introduce a tiny pl languge contribution (#2799) * Contributors agreement * Contributors agreement * Contributors agreement * Add jupyter=True to displacy.render in documentation (#2806) * Revert "Also include lowercase norm exceptions" This reverts commit |
||
Ines Montani
|
b6e991440c |
💫 Tidy up and auto-format tests (#2967)
* Auto-format tests with black * Add flake8 config * Tidy up and remove unused imports * Fix redefinitions of test functions * Replace orths_and_spaces with words and spaces * Fix compatibility with pytest 4.0 * xfail test for now Test was previously overwritten by following test due to naming conflict, so failure wasn't reported * Unfail passing test * Only use fixture via arguments Fixes pytest 4.0 compatibility |
||
Ines Montani
|
75f3234404
|
💫 Refactor test suite (#2568)
## Description Related issues: #2379 (should be fixed by separating model tests) * **total execution time down from > 300 seconds to under 60 seconds** 🎉 * removed all model-specific tests that could only really be run manually anyway – those will now live in a separate test suite in the [`spacy-models`](https://github.com/explosion/spacy-models) repository and are already integrated into our new model training infrastructure * changed all relative imports to absolute imports to prepare for moving the test suite from `/spacy/tests` to `/tests` (it'll now always test against the installed version) * merged old regression tests into collections, e.g. `test_issue1001-1500.py` (about 90% of the regression tests are very short anyways) * tidied up and rewrote existing tests wherever possible ### Todo - [ ] move tests to `/tests` and adjust CI commands accordingly - [x] move model test suite from internal repo to `spacy-models` - [x] ~~investigate why `pipeline/test_textcat.py` is flakey~~ - [x] review old regression tests (leftover files) and see if they can be merged, simplified or deleted - [ ] update documentation on how to run tests ### Types of change enhancement, tests ## Checklist <!--- Before you submit the PR, go over this checklist and make sure you can tick off all the boxes. [] -> [x] --> - [x] I have submitted the spaCy Contributor Agreement. - [x] I ran the tests, and all new and existing tests passed. - [ ] My changes don't require a change to the documentation, or if they do, I've added all required information. |