Commit Graph

5047 Commits

Author SHA1 Message Date
Matthew Honnibal
9447739027 Merge branch 'master' of https://github.com/explosion/spaCy 2018-10-27 00:50:48 +02:00
Matthew Honnibal
ad068f51be Fix out-of-bounds access in NER training
The helper method state.B(1) gets the index of the first token of the
buffer, or -1 if no such token exists. Normally this is safe because we
pass this to functions like state.safe_get(), which returns an empty
token. Here we used it directly as an array index, which is not okay!

This error may have been the cause of out-of-bounds access errors during
training. Similar errors may still be around, so much be hunted down.
Hunting this one down took a long time...I printed out values across
training runs and diffed, looking for points of divergence between
runs, when no randomness should be allowed.
2018-10-27 00:46:30 +02:00
Grivaz
57f274b693 raise error when setting overlapping entities as doc.ents (#2880) 2018-10-26 23:29:16 +02:00
Ines Montani
48b1bc44d3 Update version to 2.0.16 2018-10-15 14:39:25 +02:00
Ines Montani
a0f6647160 Increment version 2018-10-15 14:20:55 +02:00
Ines Montani
7bc7fa8f1e Increment version 2018-10-15 01:40:44 +02:00
Matthew Honnibal
8612b75890 Set version to 2.0.14 2018-10-15 00:10:04 +02:00
Matthew Honnibal
d6e9cf8b09 Set version to 2.0.14.dev1 2018-10-15 00:09:02 +02:00
Matthew Honnibal
8ccfa52d19 Unhack prefer_gpu 2018-10-14 23:27:09 +02:00
Matthew Honnibal
41adf3572b Set version to v2.0.14 2018-10-14 23:15:34 +02:00
Matthew Honnibal
38aa835ada Workaround bug in thinc require_gpu 2018-10-14 23:15:08 +02:00
Matthew Honnibal
91593b7378 Add tests for prefer_gpu() and require_gpu() 2018-10-14 23:05:22 +02:00
Matthew Honnibal
62c70b3163 Import prefer_gpu and require_gpu functions from Thinc 2018-10-14 23:03:06 +02:00
Ines Montani
295da0f11b Increment version to 2.0.14.dev0 2018-10-14 16:37:46 +02:00
Matthew Honnibal
7de0dcb91f Merge branch 'master' of https://github.com/explosion/spaCy 2018-10-14 16:12:23 +02:00
Keshan
cb075c8e72 Adding "This is a sentence" example to Sinhala (#2846) 2018-10-14 00:06:40 +02:00
Matthew Honnibal
9cfab5933a Set version to 2.0.13 2018-10-13 19:42:16 +02:00
Matthew Honnibal
6a6ae5b0af Merge branch 'master' of https://github.com/explosion/spaCy 2018-10-13 19:41:00 +02:00
mauryaland
36514b5762 Rule-based French Lemmatizer (#2818)
<!--- Provide a general summary of your changes in the title. -->

## Description
<!--- Use this section to describe your changes. If your changes required
testing, include information about the testing environment and the tests you
ran. If your test fixes a bug reported in an issue, don't forget to include the
issue number. If your PR is still a work in progress, that's totally fine – just
include a note to let us know. -->

Add a rule-based French Lemmatizer following the english one and the excellent PR for [greek language optimizations](https://github.com/explosion/spaCy/pull/2558) to adapt the Lemmatizer class.

### Types of change
<!-- What type of change does your PR cover? Is it a bug fix, an enhancement
or new feature, or a change to the documentation? -->

- Lemma dictionary used can be found [here](http://infolingu.univ-mlv.fr/DonneesLinguistiques/Dictionnaires/telechargement.html), I used the XML version.
- Add several files containing exhaustive list of words for each part of speech 
- Add some lemma rules
- Add POS that are not checked in the standard Lemmatizer, i.e PRON, DET, ADV and AUX
- Modify the Lemmatizer class to check in lookup table as a last resort if POS not mentionned
- Modify the lemmatize function to check in lookup table as a last resort
- Init files are updated so the model can support all the functionalities mentioned above
- Add words to tokenizer_exceptions_list.py in respect to regex used in tokenizer_exceptions.py

## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [X] I have submitted the spaCy Contributor Agreement.
- [X] I ran the tests, and all new and existing tests passed.
- [X] My changes don't require a change to the documentation, or if they do, I've added all required information.
2018-10-13 16:38:21 +02:00
Matthew Honnibal
de46286107 Merge branch 'master' of https://github.com/explosion/spaCy 2018-10-13 16:11:16 +02:00
Ines Montani
cb57b35bb8 Also include lowercase norm exceptions 2018-10-13 15:37:30 +02:00
JKhakpour
74a30d883c Add Persian(Farsi) language support (#2797) 2018-10-13 15:31:49 +02:00
Matthew Honnibal
c3ddf98b1e Set version to 2.0.13.dev4 2018-10-13 15:20:59 +02:00
Marina Lysyuk
b76fe08308 Correcting lang/ru/examples.py (#2845)
* Correct some grammatical inaccuracies in lang\ru\examples.py; filled Contributor Agreement

* Correct some grammatical inaccuracies in lang\ru\examples.py

* Move contributor agreement to separate file
2018-10-13 15:19:43 +02:00
Matthew Honnibal
67ddce68d8 Unskip test 2018-10-02 23:47:55 +02:00
Matthew Honnibal
4cf5ce2cc2 Revert "Remove problematic test"
This reverts commit bdebbef455.
2018-10-02 23:47:24 +02:00
Matthew Honnibal
bdebbef455 Remove problematic test 2018-10-02 23:16:29 +02:00
Matthew Honnibal
6afc6ffe56 Skip seemingly problematic test 2018-10-02 22:33:40 +02:00
Matthew Honnibal
9e4079ddb2 Merge branch 'master' of https://github.com/explosion/spaCy 2018-10-02 19:44:43 +02:00
Matthew Honnibal
40f228c2f2 Set version to 2.0.13.dev3 2018-10-02 19:44:25 +02:00
Filipe Caixeta
6c498f9ff4 Update Portuguese Language (#2790)
* Add words to portuguese language _num_words

* Add words to portuguese language _num_words

* Portuguese - Add/remove stopwords, fix tokenizer, add currency symbols

* Extended punctuation and norm_exceptions in the Portuguese language
2018-09-29 09:51:45 +02:00
Matthew Honnibal
6430b1fe64 Restore encoding arg on msgpack-numpy 2018-09-27 15:58:21 +02:00
Matthew Honnibal
2ac69facc6 Fix Python 2 test failure 2018-09-27 15:34:16 +02:00
Matthew Honnibal
72778375fb Merge branch 'master' of https://github.com/explosion/spaCy 2018-09-27 13:54:49 +02:00
Matthew Honnibal
96fe314d8d Fix bug when too many entity types. Fixes #2800 2018-09-27 13:54:34 +02:00
Suraj Rajan
bbdc6456c6 Set up dependency tree pattern matching skeleton (#2732) 2018-09-27 13:27:18 +02:00
Matthew Honnibal
8809dc4514 Remove deprecated encoding argument to msgpack 2018-09-27 12:56:23 +02:00
Matthew Honnibal
bae6b3e2b3 Merge branch 'master' of https://github.com/explosion/spaCy 2018-09-27 12:50:31 +02:00
Ines Montani
71cdbeada7 Revert "Also include lowercase norm exceptions"
This reverts commit 70f4e8adf3.
2018-09-27 12:29:25 +02:00
darindf
8227566805 Fix error (#2802)
* Fix error
ValueError: cannot resize an array that references or is referenced
by another array in this way.  Use the resize function

* added spaCy Contributor Agreement
2018-09-26 21:31:03 +02:00
Ines Montani
5e0dfb34fa Merge branch 'master' of https://github.com/explosion/spaCy 2018-09-26 11:13:58 +02:00
Ines Montani
70f4e8adf3 Also include lowercase norm exceptions 2018-09-25 12:22:02 +02:00
Keshan
9a016d17c2 Adding basic support for Sinhala language. (#2788)
* adding Sinhala language package, stop words, examples and lex_attrs.

* Adding contributor agreement

* Updating contributor agreement
2018-09-25 12:18:25 +02:00
Ines Montani
3c4e3ade30 Fix typo (closes #2784) 2018-09-21 10:45:11 +02:00
mauryaland
68b3c544d5 Adding French hyphenated first name (#2786) 2018-09-21 10:38:13 +02:00
Andrew Ongko
81564cc4e8 Update Indonesian model (#2752)
* adding e-KTP in tokenizer exceptions list

* add exception token

* removing lines with containing space as it won't matter since we use .split() method in the end, added new tokens in exception

* add tokenizer exceptions list

* combining base_norms with norm_exceptions

* adding norm_exception

* fix double key in lemmatizer

* remove unused import on punctuation.py

* reformat stop_words to reduce number of lines, improve readibility

* updating tokenizer exception

* implement is_currency for lang/id

* adding orth_first_upper in tokenizer_exceptions

* update the norm_exception list

* remove bunch of abbreviations

* adding contributors file
2018-09-14 12:30:32 +02:00
Filipe Caixeta
fe515085f3 Add words to portuguese language _num_words (#2759)
* Add words to portuguese language _num_words

* Add words to portuguese language _num_words
2018-09-14 12:30:16 +02:00
Grivaz
aeba99ab0d Introduces a bulk merge function, in order to solve issue #653 (#2696)
* Fix comment

* Introduce bulk merge to increase performance on many span merges

* Sign contributor agreement

* Implement pull request suggestions
2018-09-10 16:41:42 +02:00
tyburam
476472d181 Lex _attrs for polish language (#2750)
* Signed spaCy contributor agreement

* Added polish version of english lex_attrs
2018-09-10 11:53:57 +02:00
Sainath Adapa
77139bc03c Basic support for Telugu language (#2751) 2018-09-10 11:53:18 +02:00
Maxim Kupfer
cebe50b5b8 Remove ')' for clarity (#2737)
Sorry, don't mean to be nitpicky, I just noticed this when going through the CLI and thought it was a quick fix. That said, if this was intention than please let me know.
2018-09-10 11:31:49 +02:00
Piotr Żelasko
bdb2165bd1 Less norm computations in token similarity (#2730)
* Less norm computations in token similarity

* Contributor agreement
2018-09-05 21:50:23 +02:00
Aniruddha Adhikary
4530ddcc51 update bengali token rules for hyphen and digits (#2731) 2018-09-05 21:49:00 +02:00
Nathaniel J. Smith
26849874ad When calling getoption() in conftest.py, pass a default option (#2709)
* When calling getoption() in conftest.py, pass a default option

This is necessary to allow testing an installed spacy by running:

  pytest --pyargs spacy

* Add contributor agreement
2018-09-03 09:57:52 +02:00
Ines Montani
e9022f7b33 Remove docstrings for deprecated arguments (see #2703) 2018-08-26 14:23:13 +02:00
Ines Montani
559f4139e3 Add FAC to spacy.explain (resolves #2706) 2018-08-26 14:13:50 +02:00
Matthew Honnibal
13fa550b36 Merge branch 'master' of https://github.com/explosion/spaCy 2018-08-14 02:32:01 +02:00
Ioannis Daras
fe94e696d3 Optimize Greek language support (#2658) 2018-08-14 02:31:32 +02:00
Matthew Honnibal
85000ea13b Increment version to 2.0.13.dev2 2018-08-10 00:43:55 +02:00
Matthew Honnibal
c4ac981e6d Try again to filter warnings 2018-08-10 00:42:54 +02:00
Matthew Honnibal
ae7fc42a41 Increment version to v2.0.13.dev1 2018-08-10 00:14:31 +02:00
Matthew Honnibal
19f5046934 Undoing warning suppression, as doesnt really work 2018-08-10 00:13:34 +02:00
Matthew Honnibal
3fb828352d Set version to 2.0.13.dev0 2018-08-09 23:49:34 +02:00
Matthew Honnibal
1c0614ecd2 Catch numpy warning 2018-08-09 23:49:24 +02:00
Aashish Gangwani
6eebfc7bf4 Added numbers to ../lang/hi/lex_attrs.py (#2629)
I have added numbers in hindi lex_attrs.py file according to Indian numbering system(https://en.wikipedia.org/wiki/Indian_numbering_system) and here are there english translations:
'शून्य' => zero 
'एक' => one
'दो' => two
'तीन' => three
 'चार' => four
'पांच' => five
'छह' => six
'सात'=>seven 
'आठ' => eight
'नौ' => nine
'दस' => ten
'ग्यारह' => eleven
'बारह' => twelve
 'तेरह' => thirteen
'चौदह' => fourteen
'पंद्रह' => fifteen
'सोलह'=> sixteen
'सत्रह' => seventeen
'अठारह' => eighteen
'उन्नीस' => nineteen
'बीस' => twenty
 'तीस' => thirty
'चालीस' => forty
'पचास' => fifty
'साठ' => sixty
'सत्तर' => seventy
'अस्सी' => eighty
'नब्बे' => ninety
'सौ' => hundred
'हज़ार' => thousand
'लाख' => hundred thousand
'करोड़' => ten million
'अरब' => billion
'खरब' => hundred billion

<!--- Provide a general summary of your changes in the title. -->

## Description
<!--- Use this section to describe your changes. If your changes required
testing, include information about the testing environment and the tests you
ran. If your test fixes a bug reported in an issue, don't forget to include the
issue number. If your PR is still a work in progress, that's totally fine – just
include a note to let us know. -->

### Types of change
<!-- What type of change does your PR cover? Is it a bug fix, an enhancement
or new feature, or a change to the documentation? -->

## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.
2018-08-08 16:06:11 +02:00
Emil Stenström
3834f4146d Add abbreviations from UD_Swedish-Talbanken (#2613)
* Add abbreviations from UD_Swedish-Talbanken

* Add contributor agreement.
2018-08-07 13:53:17 +02:00
Ole Henrik Skogstrøm
0473add369 Feature/span ents (#2599)
* Created Span.ents property

* Add tests for span.ents

* Add tests for start and end of sentence
2018-08-07 13:52:32 +02:00
Xiaoquan Kong
87fa847e6e Fix Chinese language related bugs (#2634) 2018-08-07 11:26:31 +02:00
Xiaoquan Kong
f0c9652ed1 New Feature: display more detail when Error E067 (#2639)
* Fix off-by-one error

* Add verbose option

* Update verbose option

* Update documents for verbose option
2018-08-07 10:45:29 +02:00
Emil Stenström
1914c488d3 Swedish: Exceptions for single letter words ending sentence (#2615)
* Exceptions for single letter words ending sentence

Sentences ending in "i." (as in "... peka i."), "m." (as in "...än 2000 m."), should be tokenized as two separate tokens.

* Add test
2018-08-05 14:14:30 +02:00
Matthew Honnibal
860f5bd91f Add test for issue 2626 2018-08-05 13:46:57 +02:00
Kaisa (Katarzyna) Korsak
e531a827db Changed conllu2json to be able to extract NER tags (#2594)
* extract ner tags from conllu file if available

* fixed a bug in regex
2018-07-25 22:21:31 +02:00
Dmitry Bruhanov
07d0cc9de7 Update examples.py (#2597) 2018-07-25 22:20:24 +02:00
Matthew Honnibal
66983d8412
Port BenDerPan's Chinese changes to v2 (finally) (#2591)
* add  template files for Chinese

* add  template files for Chinese, and test directory .
2018-07-25 02:47:23 +02:00
ines
f2e3e039b7 Update French stop words (resolves #2540) 2018-07-24 23:41:51 +02:00
Matthew Honnibal
6303ce3d0e Try to fix memory error by moving fr_tokenizer to module scope 2018-07-24 20:09:06 +02:00
Matthew Honnibal
afe3fa4449 Merge branch 'master' of https://github.com/explosion/spaCy 2018-07-24 19:44:31 +02:00
Matthew Honnibal
b2e9e958b9 Add session scoping to tokenizers to try to fix oom on Appveyor 2018-07-24 19:44:18 +02:00
Ines Montani
a43ad114c2
Fix typo [ci skip] 2018-07-24 18:45:40 +02:00
Dmitry Bruhanov
27160b1516 added some widespread written jargon & dialectizms (#2584)
This jargon is not offencive but emotionally colored as funny due to its deviation from the norm for various reasons: immitating a dialect, deliberately wrong spelling emphasizing its low colloquial nature, obsolete form, foreign borrowing with native flections, etc.
Dmitry Briukhanov, Linguist & Pythonist
2018-07-24 18:44:29 +02:00
Matthew Honnibal
90c269e1a9 Set about to v2.0.12 release 2018-07-21 15:09:42 +02:00
Matthew Honnibal
1a1c7304cf Set version to 2.0.12.dev1 2018-07-21 13:08:01 +02:00
ines
1ea881c80b Allow ignoring warnings and only overwrite if set explicitly 2018-07-20 22:50:19 +02:00
Matthew Honnibal
e0caf3ae8c Fix msgpack for new version 2018-07-20 17:32:00 +02:00
Matthew Honnibal
899f1cf442 Add regression test for issue 2179 2018-07-20 17:15:44 +02:00
Matthew Honnibal
9db77fd914 Fix deserialization for msgpack 2018-07-20 14:11:09 +02:00
katarkor
5ca853bee0 changed tag_map, morph_rules, lemmatizer for Norwegian (#2565)
* changed tag_map, morph_rules, lemmatizer for Norwegian

* Move unicode declaration up

Hopefully fixes test failure on Python 2

* Update CONTRIBUTOR_AGREEMENT.md

* Move unicode declarations

Hopefully fixes test this time

* Revert "Merge remote-tracking branch 'origin/patch-1'"

This reverts commit f5ccd5dd0d, reversing
changes made to dd07e180ea.

* Update contributor agreement [ci skip]
2018-07-19 19:38:24 +02:00
Ole Henrik Skogstrøm
6e2930a4a2 Conll(u)-bio converter (#2525)
* Started simple conllxbiluo converter

* Fix missing BIO to BILUO conversion
2018-07-18 18:55:42 +02:00
Ioannis Daras
6ed18412d0 Greek language optimizations (#2558)
* Greek language optimizations

* Add encoding on files containing greek words

* Add encoding on files containing greek words
2018-07-18 18:51:38 +02:00
Paul O'Leary McCann
61ef0739b8 Add Japanese stop words. (#2549)
List created by taking the 2000 top words from a Wikipedia dump and
removing everything that wasn't hiragana.

Tried going through kanji words and deciding what to keep but there were
too many obvious non-stopwords (東京 was in the top 500) and many other
words where it wasn't clear if they should be included or not.
2018-07-17 10:12:48 +02:00
Tero K
f35980f865 Enhancement/lang fi examples (#2547)
* Added a file with examples in finnish

* added contributor agreement
2018-07-15 09:50:27 +02:00
Paul O'Leary McCann
1987f3f784 Add Japanese lemmas (#2543)
This info was already available from Mecab, forgot to add it before.
2018-07-13 10:55:14 +02:00
Eleni170
6042723535 Add support for Greek language (#2535)
* Add contributor agreement

* Support for Greek language

* Fix missing el_tokenizer
2018-07-10 13:48:38 +02:00
Stefan Schweter
3dfc7f86be lemmatizer: correct lemma for Rang (#2537)
<!--- Provide a general summary of your changes in the title. -->

## Description

This PR corrects the German lemma form for the word "Rang". Initially, the lemma form was "ringen", which is not correct, because it refers to the verb ("ringen") and not to the noun ("Rang").

### Types of change

The lemma form for "Rang" is corrected to "Rang", see also the [Duden](https://www.duden.de/rechtschreibung/Rang) entry.

## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.
2018-07-10 13:11:19 +02:00
Duygu Altinok
00b9a58558 German lemmatizer additions (#2529)
* lemma of was-> was

* added new pairs issue @2486

* added article tests
2018-07-09 11:10:15 +02:00
Ole Henrik Skogstrøm
c21efea9bb Add sent property to token (#2521)
* Add sent property to token

* Refactored and cleaned up copy paste errors.
2018-07-06 15:54:15 +02:00
kleinay
a82c3153ad fix issue #2452 - displacy arrow direction is always forward (#2506) (closes #2452)
<!--- Provide a general summary of your changes in the title. -->
Referring #2452, fixing displacy arrow directions to match the input. 

## Description
The fix is simply replacing `direction is 'left'` with `direction == 'left'` to include the case `direction` is a `str` and not a `unicode`.

### Types of change
bug fix

## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [ ] I have submitted the spaCy Contributor Agreement.
- [ ] I ran the tests, and all new and existing tests passed.
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.
2018-07-04 14:12:08 +02:00
Bùi Trung Chí
9af46b4f1b Fix loading tokenizer with custom prefix search (#2495)
* Add contributor agreement

* Fix loading tokenizer with cutom prefix search
2018-07-04 12:56:07 +02:00
Ole Henrik Skogstrøm
d16cb6bee6 Accept Span to displacy render (#2478) (closes #2477)
* Add Span to displacy render

* Fix span support, errors and add tests
2018-06-25 14:55:16 +02:00
Muhammad Irfan
f33c703066 Add Urdu Language Support (#2430)
* added Urdu language support.

* added Urdu language tests.

* modified conftest.py for Urdu language support.

* added spacy contributor agreement.
2018-06-22 11:14:03 +02:00