Commit Graph

9669 Commits

Author SHA1 Message Date
Ines Montani
e7593b791e Fix import 2019-02-08 20:50:52 +01:00
Ines Montani
0754b848fe Actually xfail test for #1971 2019-02-08 20:50:35 +01:00
Ines Montani
414a69b736 Add xfailing test (see #1971, #2675, #2671) 2019-02-08 20:50:01 +01:00
Ines Montani
ea07f3022e Only run noun chunks iterator in Span if available (closes #3199) 2019-02-08 18:33:16 +01:00
Ines Montani
ff36b14cb2 Fix whitespace 2019-02-08 18:31:31 +01:00
Ines Montani
f4ce7bb7e9 Fix typo and deprecation message (resolves #3195) [ci skip] 2019-02-08 18:09:23 +01:00
Ines Montani
8ad15a2377 Fix typo [ci skip] 2019-02-08 17:29:53 +01:00
Ines Montani
7a985cba24 Fix typo (closes #3232) [ci skip] 2019-02-08 17:29:18 +01:00
Ines Montani
694139aad3 Fix formatting [ci skip] 2019-02-08 16:32:36 +01:00
Ines Montani
2898768757 Remove unused attribute [ci skip] 2019-02-08 16:31:30 +01:00
Ines Montani
586c56fc6c Tidy up regression tests 2019-02-08 15:51:13 +01:00
Ines Montani
25602c794c Tidy up and fix small bugs and typos 2019-02-08 14:14:49 +01:00
Ines Montani
9e652afa4b Merge branch 'master' into develop 2019-02-08 13:28:09 +01:00
Björn Lennartsson
647f0140c7 Fixed tag map for Swedish Talbanken (#3186) 2019-02-08 14:28:59 +11:00
Stanisław Giziński
1448ad100c Improved polish tokenizer and stop words. (#2974)
* Improved stop words list

* Removed some wrong stop words form list

* Improved stop words list

* Removed some wrong stop words form list

* Improved Polish Tokenizer (#38)

* Add tests for polish tokenizer

* Add polish tokenizer exceptions

* Don't split any words containing hyphens

* Fix test case with wrong model answer

* Remove commented out line of code until better solution is found

* Add source srx' license

* Rename exception_list.py to match spaCy conventionality

* Add a brief explanation of where the exception list comes from

* Add newline after reach exception

* Rename COPYING.txt to LICENSE

* Delete old files

* Add header to the license

* Agreements signed

* Stanisław Giziński agreement

* Krzysztof Kowalczyk - signed agreement

* Mateusz Olko agreement

* Add DoomCoder's contributor agreement

* Improve like number checking in polish lang


* like num tests added

* all from SI system added

* Final licence and removed splitting exceptions

* Added polish stop words to LEX_ATTRA

* Add encoding info to pl tokenizer exceptions
2019-02-08 14:27:21 +11:00
Ines Montani
402d133c90 Add Ukrainian unicode 2019-02-07 21:11:58 +01:00
Ines Montani
e2d93e4852 Merge branch 'master' into develop 2019-02-07 21:10:08 +01:00
Ines Montani
2499da97e8 Format 2019-02-07 21:07:02 +01:00
Ines Montani
18205c6c48 Update company name 2019-02-07 21:06:55 +01:00
Julia Makogon
b41d64825a Ukrainian language added. Small fixes in Russian (#3241)
* Classes for Ukrainian; small fix in Russian.

* Contributor agreement
2019-02-07 21:05:11 +01:00
Ines Montani
77efee0295 Auto-format 2019-02-07 21:00:04 +01:00
Ines Montani
be1ff09403 Update dependencies 2019-02-07 20:57:55 +01:00
Ines Montani
f7e4674423 Fix contributor agreement 2019-02-07 20:56:13 +01:00
Ines Montani
4684195822
Rename contributer_agreement.md to .github/contributors/lauraBaakman.md 2019-02-07 20:55:53 +01:00
Ines Montani
5d0b60999d Merge branch 'master' into develop 2019-02-07 20:54:07 +01:00
Laura Baakman
04aa041c9e Update Example input JSON file to adhere to specification. (#3243)
* Example file does not adhere to json input spec.

According to the [json input spec ](https://spacy.io/api/annotation#json-input) the `id ` needs to be an `int` not a string. Using a string as `id` results in a `TypeError` when calling `spacy.gold.read_json_file()`.

* Add spaCy Contributor Agreement.
2019-02-07 16:18:01 +01:00
Matthew Honnibal
dbeebfa3a2 Set version to v2.1.0a7.dev1 2019-02-08 01:54:01 +11:00
Ines Montani
338d659bd0 Store JSON schemas in Python and tidy up (#3235) 2019-02-07 19:44:31 +11:00
Ines Montani
1ea4df459d 💫 Break up large matcher.pyx (#3236)
* Break up large matcher.pyx

* Remove unused function
2019-02-07 19:42:25 +11:00
Ines Montani
a9bf5d9fd8 Add xfailing test for set value with operator [ci skip] 2019-02-06 13:40:11 +01:00
Ines Montani
e51a238b3f Auto-format 2019-02-06 13:32:18 +01:00
Ines Montani
f25bd9f5e4 Add gold.spans_from_biluo_tags helper (#3227) 2019-02-06 21:50:26 +11:00
Ines Montani
5e16490d9d Fix default argument in TextCategorizer.Model (resolves #3221) 2019-02-05 12:33:47 +01:00
Ines Montani
89ad095900 Fix whitespace 2019-02-05 12:32:20 +01:00
Sofie
9745b0d523 Improve Italian & Urdu tokenization accuracy (#3228)
## Description

1. Added the same infix rule as in French (`d'une`, `j'ai`) for Italian (`c'è`, `l'ha`), bringing F-score on `it_isdt-ud-train.txt` from 96% to 99%. Added unit test to check this behaviour.

2. Added specific Urdu punctuation character as suffix, improving F-score on `ur_udtb-ud-train.txt` from 94% to 100%. Added unit test to check this behaviour.

### Types of change
Enhancement of Italian & Urdu tokenization

## Checklist
- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.
2019-02-04 22:39:25 +01:00
PierreMonico
114d64c4b5 Fix typo (#3223) 2019-02-04 11:37:29 +01:00
Sofie
a3efa3e8d9 Improve Catalan tokenization accuracy (#3225)
* small hyphen clean up for French

* catalan infix similar to french
2019-02-04 20:37:19 +11:00
Ines Montani
e00680a33a Remove unused outdated file 2019-02-01 11:39:48 +01:00
Matthew Honnibal
27e3f98cae Set version to v2.1.0a7.dev0 2019-02-01 18:06:34 +11:00
Sofie
46dfe773e1 Replacing regex library with re to increase tokenization speed (#3218)
* replace unicode categories with raw list of code points

* simplifying ranges

* fixing variable length quotes

* removing redundant regular expression

* small cleanup of regexp notations

* quotes and alpha as ranges instead of alterations

* removed most regexp dependencies and features

* exponential backtracking - unit tests

* rewrote expression with pathological backtracking

* disabling double hyphen tests for now

* test additional variants of repeating punctuation

* remove regex and redundant backslashes from load_reddit script

* small typo fixes

* disable double punctuation test for russian

* clean up old comments

* format block code

* final cleanup

* naming consistency

* french strings as unicode for python 2 support

* french regular expression case insensitive
2019-02-01 18:05:22 +11:00
Amandine Périnet
d570e75dbb Improving the French lookup dictionnary for ambiguous words (#3185)
* modifying FR lookup to remove ambiguity and adding lookup vocab to FR files

* modifying FR lookup to remove ambiguity and adding lookup vocab to FR files

* updating the contributor agreement for amperinet
2019-01-31 23:53:45 +01:00
Ines Montani
e9a6dbe4f3
Don't check for Jupyter in global scope and fix check (#3213)
Resolves #3208.

Prevent interactions with other libraries (pandas) that also access `get_ipython().config` and its parameters. See #3208 for details. I don't fully understand why this happens, but in spaCy, we can at least make sure we avoid calling into this method.

<!--- Provide a general summary of your changes in the title. -->

## Description
<!--- Use this section to describe your changes. If your changes required
testing, include information about the testing environment and the tests you
ran. If your test fixes a bug reported in an issue, don't forget to include the
issue number. If your PR is still a work in progress, that's totally fine – just
include a note to let us know. -->

### Types of change
<!-- What type of change does your PR cover? Is it a bug fix, an enhancement
or new feature, or a change to the documentation? -->

## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.
2019-01-31 23:49:13 +01:00
Amandine Périnet
b34bc9d2e9 add small fix for French lemmatizer (#3206) 2019-01-31 23:44:10 +01:00
mak
8fc6aaf134 Updated main to make use of lang variable (#3220)
Updated main to make use of language variable when initializing spacy.
2019-01-31 23:43:22 +01:00
adrianeboyd
03d58f9feb Update TIGER/German dependency relations in documentation (#3204)
* Add missing dependency relations for TIGER/German

* Contributor agreement for adrianeboyd
2019-01-30 14:23:12 +01:00
Loghi
5ca8e2b269 Tamil (#3194)
* Tamil language support
*stop wors, examples and numerical attribite supports added

* Contributor agreement signed

* Create Loghijiaha.md

Added contributor agreement

* Update CONTRIBUTOR_AGREEMENT.md

Adjusted contributor_agreement.md

* Norm exceptions added
2019-01-27 06:02:04 +01:00
foufaster
8bd85fd9d5 Fix french lemmatization (#3180) 2019-01-27 06:01:30 +01:00
Sofie
66016ac289 Batch UD evaluation script (#3174)
* running UD eval

* printing timing of tokenizer: tokens per second

* timing of default English model

* structured output and parameterization to compare different runs

* additional flag to allow evaluation without parsing info

* printing verbose log of errors for manual inspection

* printing over- and undersegmented cases (and combo's)

* add under and oversegmented numbers to Score and structured output

* print high-freq over/under segmented words and word shapes

* printing examples as part of the structured output

* print the results to file

* batch run of different models and treebanks per language

* cleaning up code

* commandline script to process all languages in spaCy & UD

* heuristic to remove blinded corpora and option to run one single best per language

* pathlib instead of os for file paths
2019-01-27 06:01:02 +01:00
Jo
f9ca09caa0 Create PolyglotOpenstreetmap.md (#3198)
* Create PolyglotOpenstreetmap.md

* forgot to tick that box
2019-01-26 14:02:54 +01:00
Matthew Honnibal
5a4737df09 Set version to 2.1.0a6 2019-01-21 18:32:34 +01:00