Commit Graph

4994 Commits

Author SHA1 Message Date
Ines Montani
48b1bc44d3 Update version to 2.0.16 2018-10-15 14:39:25 +02:00
Ines Montani
a0f6647160 Increment version 2018-10-15 14:20:55 +02:00
Ines Montani
7bc7fa8f1e Increment version 2018-10-15 01:40:44 +02:00
Matthew Honnibal
8612b75890 Set version to 2.0.14 2018-10-15 00:10:04 +02:00
Matthew Honnibal
d6e9cf8b09 Set version to 2.0.14.dev1 2018-10-15 00:09:02 +02:00
Matthew Honnibal
8ccfa52d19 Unhack prefer_gpu 2018-10-14 23:27:09 +02:00
Matthew Honnibal
41adf3572b Set version to v2.0.14 2018-10-14 23:15:34 +02:00
Matthew Honnibal
38aa835ada Workaround bug in thinc require_gpu 2018-10-14 23:15:08 +02:00
Matthew Honnibal
91593b7378 Add tests for prefer_gpu() and require_gpu() 2018-10-14 23:05:22 +02:00
Matthew Honnibal
62c70b3163 Import prefer_gpu and require_gpu functions from Thinc 2018-10-14 23:03:06 +02:00
Ines Montani
295da0f11b Increment version to 2.0.14.dev0 2018-10-14 16:37:46 +02:00
Matthew Honnibal
7de0dcb91f Merge branch 'master' of https://github.com/explosion/spaCy 2018-10-14 16:12:23 +02:00
Keshan
cb075c8e72 Adding "This is a sentence" example to Sinhala (#2846) 2018-10-14 00:06:40 +02:00
Matthew Honnibal
9cfab5933a Set version to 2.0.13 2018-10-13 19:42:16 +02:00
Matthew Honnibal
6a6ae5b0af Merge branch 'master' of https://github.com/explosion/spaCy 2018-10-13 19:41:00 +02:00
mauryaland
36514b5762 Rule-based French Lemmatizer (#2818)
<!--- Provide a general summary of your changes in the title. -->

## Description
<!--- Use this section to describe your changes. If your changes required
testing, include information about the testing environment and the tests you
ran. If your test fixes a bug reported in an issue, don't forget to include the
issue number. If your PR is still a work in progress, that's totally fine – just
include a note to let us know. -->

Add a rule-based French Lemmatizer following the english one and the excellent PR for [greek language optimizations](https://github.com/explosion/spaCy/pull/2558) to adapt the Lemmatizer class.

### Types of change
<!-- What type of change does your PR cover? Is it a bug fix, an enhancement
or new feature, or a change to the documentation? -->

- Lemma dictionary used can be found [here](http://infolingu.univ-mlv.fr/DonneesLinguistiques/Dictionnaires/telechargement.html), I used the XML version.
- Add several files containing exhaustive list of words for each part of speech 
- Add some lemma rules
- Add POS that are not checked in the standard Lemmatizer, i.e PRON, DET, ADV and AUX
- Modify the Lemmatizer class to check in lookup table as a last resort if POS not mentionned
- Modify the lemmatize function to check in lookup table as a last resort
- Init files are updated so the model can support all the functionalities mentioned above
- Add words to tokenizer_exceptions_list.py in respect to regex used in tokenizer_exceptions.py

## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [X] I have submitted the spaCy Contributor Agreement.
- [X] I ran the tests, and all new and existing tests passed.
- [X] My changes don't require a change to the documentation, or if they do, I've added all required information.
2018-10-13 16:38:21 +02:00
Matthew Honnibal
de46286107 Merge branch 'master' of https://github.com/explosion/spaCy 2018-10-13 16:11:16 +02:00
Ines Montani
cb57b35bb8 Also include lowercase norm exceptions 2018-10-13 15:37:30 +02:00
JKhakpour
74a30d883c Add Persian(Farsi) language support (#2797) 2018-10-13 15:31:49 +02:00
Matthew Honnibal
c3ddf98b1e Set version to 2.0.13.dev4 2018-10-13 15:20:59 +02:00
Marina Lysyuk
b76fe08308 Correcting lang/ru/examples.py (#2845)
* Correct some grammatical inaccuracies in lang\ru\examples.py; filled Contributor Agreement

* Correct some grammatical inaccuracies in lang\ru\examples.py

* Move contributor agreement to separate file
2018-10-13 15:19:43 +02:00
Matthew Honnibal
67ddce68d8 Unskip test 2018-10-02 23:47:55 +02:00
Matthew Honnibal
4cf5ce2cc2 Revert "Remove problematic test"
This reverts commit bdebbef455.
2018-10-02 23:47:24 +02:00
Matthew Honnibal
bdebbef455 Remove problematic test 2018-10-02 23:16:29 +02:00
Matthew Honnibal
6afc6ffe56 Skip seemingly problematic test 2018-10-02 22:33:40 +02:00
Matthew Honnibal
9e4079ddb2 Merge branch 'master' of https://github.com/explosion/spaCy 2018-10-02 19:44:43 +02:00
Matthew Honnibal
40f228c2f2 Set version to 2.0.13.dev3 2018-10-02 19:44:25 +02:00
Filipe Caixeta
6c498f9ff4 Update Portuguese Language (#2790)
* Add words to portuguese language _num_words

* Add words to portuguese language _num_words

* Portuguese - Add/remove stopwords, fix tokenizer, add currency symbols

* Extended punctuation and norm_exceptions in the Portuguese language
2018-09-29 09:51:45 +02:00
Matthew Honnibal
6430b1fe64 Restore encoding arg on msgpack-numpy 2018-09-27 15:58:21 +02:00
Matthew Honnibal
2ac69facc6 Fix Python 2 test failure 2018-09-27 15:34:16 +02:00
Matthew Honnibal
72778375fb Merge branch 'master' of https://github.com/explosion/spaCy 2018-09-27 13:54:49 +02:00
Matthew Honnibal
96fe314d8d Fix bug when too many entity types. Fixes #2800 2018-09-27 13:54:34 +02:00
Suraj Rajan
bbdc6456c6 Set up dependency tree pattern matching skeleton (#2732) 2018-09-27 13:27:18 +02:00
Matthew Honnibal
8809dc4514 Remove deprecated encoding argument to msgpack 2018-09-27 12:56:23 +02:00
Matthew Honnibal
bae6b3e2b3 Merge branch 'master' of https://github.com/explosion/spaCy 2018-09-27 12:50:31 +02:00
Ines Montani
71cdbeada7 Revert "Also include lowercase norm exceptions"
This reverts commit 70f4e8adf3.
2018-09-27 12:29:25 +02:00
darindf
8227566805 Fix error (#2802)
* Fix error
ValueError: cannot resize an array that references or is referenced
by another array in this way.  Use the resize function

* added spaCy Contributor Agreement
2018-09-26 21:31:03 +02:00
Ines Montani
5e0dfb34fa Merge branch 'master' of https://github.com/explosion/spaCy 2018-09-26 11:13:58 +02:00
Ines Montani
70f4e8adf3 Also include lowercase norm exceptions 2018-09-25 12:22:02 +02:00
Keshan
9a016d17c2 Adding basic support for Sinhala language. (#2788)
* adding Sinhala language package, stop words, examples and lex_attrs.

* Adding contributor agreement

* Updating contributor agreement
2018-09-25 12:18:25 +02:00
Ines Montani
3c4e3ade30 Fix typo (closes #2784) 2018-09-21 10:45:11 +02:00
mauryaland
68b3c544d5 Adding French hyphenated first name (#2786) 2018-09-21 10:38:13 +02:00
Andrew Ongko
81564cc4e8 Update Indonesian model (#2752)
* adding e-KTP in tokenizer exceptions list

* add exception token

* removing lines with containing space as it won't matter since we use .split() method in the end, added new tokens in exception

* add tokenizer exceptions list

* combining base_norms with norm_exceptions

* adding norm_exception

* fix double key in lemmatizer

* remove unused import on punctuation.py

* reformat stop_words to reduce number of lines, improve readibility

* updating tokenizer exception

* implement is_currency for lang/id

* adding orth_first_upper in tokenizer_exceptions

* update the norm_exception list

* remove bunch of abbreviations

* adding contributors file
2018-09-14 12:30:32 +02:00
Filipe Caixeta
fe515085f3 Add words to portuguese language _num_words (#2759)
* Add words to portuguese language _num_words

* Add words to portuguese language _num_words
2018-09-14 12:30:16 +02:00
Grivaz
aeba99ab0d Introduces a bulk merge function, in order to solve issue #653 (#2696)
* Fix comment

* Introduce bulk merge to increase performance on many span merges

* Sign contributor agreement

* Implement pull request suggestions
2018-09-10 16:41:42 +02:00
tyburam
476472d181 Lex _attrs for polish language (#2750)
* Signed spaCy contributor agreement

* Added polish version of english lex_attrs
2018-09-10 11:53:57 +02:00
Sainath Adapa
77139bc03c Basic support for Telugu language (#2751) 2018-09-10 11:53:18 +02:00
Maxim Kupfer
cebe50b5b8 Remove ')' for clarity (#2737)
Sorry, don't mean to be nitpicky, I just noticed this when going through the CLI and thought it was a quick fix. That said, if this was intention than please let me know.
2018-09-10 11:31:49 +02:00
Piotr Żelasko
bdb2165bd1 Less norm computations in token similarity (#2730)
* Less norm computations in token similarity

* Contributor agreement
2018-09-05 21:50:23 +02:00
Aniruddha Adhikary
4530ddcc51 update bengali token rules for hyphen and digits (#2731) 2018-09-05 21:49:00 +02:00