Commit Graph

9478 Commits

Author SHA1 Message Date
mauryaland
36514b5762 Rule-based French Lemmatizer (#2818)
<!--- Provide a general summary of your changes in the title. -->

## Description
<!--- Use this section to describe your changes. If your changes required
testing, include information about the testing environment and the tests you
ran. If your test fixes a bug reported in an issue, don't forget to include the
issue number. If your PR is still a work in progress, that's totally fine – just
include a note to let us know. -->

Add a rule-based French Lemmatizer following the english one and the excellent PR for [greek language optimizations](https://github.com/explosion/spaCy/pull/2558) to adapt the Lemmatizer class.

### Types of change
<!-- What type of change does your PR cover? Is it a bug fix, an enhancement
or new feature, or a change to the documentation? -->

- Lemma dictionary used can be found [here](http://infolingu.univ-mlv.fr/DonneesLinguistiques/Dictionnaires/telechargement.html), I used the XML version.
- Add several files containing exhaustive list of words for each part of speech 
- Add some lemma rules
- Add POS that are not checked in the standard Lemmatizer, i.e PRON, DET, ADV and AUX
- Modify the Lemmatizer class to check in lookup table as a last resort if POS not mentionned
- Modify the lemmatize function to check in lookup table as a last resort
- Init files are updated so the model can support all the functionalities mentioned above
- Add words to tokenizer_exceptions_list.py in respect to regex used in tokenizer_exceptions.py

## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [X] I have submitted the spaCy Contributor Agreement.
- [X] I ran the tests, and all new and existing tests passed.
- [X] My changes don't require a change to the documentation, or if they do, I've added all required information.
2018-10-13 16:38:21 +02:00
Matthew Honnibal
de46286107 Merge branch 'master' of https://github.com/explosion/spaCy 2018-10-13 16:11:16 +02:00
Ines Montani
fa23be0f3c Remove in favour of https://github.com/explosion/spaCy/graphs/contributors 2018-10-13 15:46:57 +02:00
Ines Montani
cb57b35bb8 Also include lowercase norm exceptions 2018-10-13 15:37:30 +02:00
JKhakpour
74a30d883c Add Persian(Farsi) language support (#2797) 2018-10-13 15:31:49 +02:00
Matthew Honnibal
c3ddf98b1e Set version to 2.0.13.dev4 2018-10-13 15:20:59 +02:00
Marina Lysyuk
b76fe08308 Correcting lang/ru/examples.py (#2845)
* Correct some grammatical inaccuracies in lang\ru\examples.py; filled Contributor Agreement

* Correct some grammatical inaccuracies in lang\ru\examples.py

* Move contributor agreement to separate file
2018-10-13 15:19:43 +02:00
Jacopo Farina
42c42376a3 Visual C++ link updated (#2842) (closes #2841) [ci skip]
* New landing page

* Add contribution agreement
2018-10-12 14:59:45 +02:00
Ines Montani
4cd9ec0f00
💫 Update training examples and use minibatching (#2830)
<!--- Provide a general summary of your changes in the title. -->

## Description
Update the training examples in `/examples/training` to show usage of spaCy's `minibatch` and `compounding` helpers ([see here](https://spacy.io/usage/training#tips-batch-size) for details). The lack of batching in the examples has caused some confusion in the past, especially for beginners who would copy-paste the examples, update them with large training sets and experienced slow and unsatisfying results.

### Types of change
enhancements

## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.
2018-10-10 01:40:29 +02:00
Matthew Honnibal
f784e42ffe Try older version of regex 2018-10-03 00:23:40 +02:00
Matthew Honnibal
67ddce68d8 Unskip test 2018-10-02 23:47:55 +02:00
Matthew Honnibal
4cf5ce2cc2 Revert "Remove problematic test"
This reverts commit bdebbef455.
2018-10-02 23:47:24 +02:00
Matthew Honnibal
e4fd2ccd07 Try previous version of regex 2018-10-02 23:37:17 +02:00
Matthew Honnibal
bdebbef455 Remove problematic test 2018-10-02 23:16:29 +02:00
Matthew Honnibal
6afc6ffe56 Skip seemingly problematic test 2018-10-02 22:33:40 +02:00
Matthew Honnibal
9e4079ddb2 Merge branch 'master' of https://github.com/explosion/spaCy 2018-10-02 19:44:43 +02:00
Matthew Honnibal
40f228c2f2 Set version to 2.0.13.dev3 2018-10-02 19:44:25 +02:00
Matthew Honnibal
9937ff93e5 Update regex version dependency 2018-10-02 19:43:59 +02:00
Ines Montani
7806deceb4 Fix typo (closes #2815) [ci skip] 2018-10-01 10:49:29 +02:00
Ines Montani
ea20b72c08 💫 Make like_num work for prefixed numbers (#2808)
* Only split + prefix if not numbers

* Make like_num work for prefixed numbers

* Add test for like_num
2018-10-01 10:49:14 +02:00
John Stewart
9faea3ff10 Update Keras Example for (Parikh et al, 2016) implementation (#2803)
* bug fixes in keras example

* created contributor agreement

* baseline for Parikh model

* initial version of parikh 2016 implemented

* tested asymmetric models

* fixed grevious error in normalization

* use standard SNLI test file

* begin to rework parikh example

* initial version of running example

* start to document the new version

* start to document the new version

* Update Decompositional Attention.ipynb

* fixed calls to similarity

* updated the README

* import sys package duh

* simplified indexing on mapping word to IDs

* stupid python indent error

* added code from https://github.com/tensorflow/tensorflow/issues/3388 for tf bug workaround
2018-10-01 10:28:45 +02:00
Ioannis Daras
405a826436 Correct error in spacy universe docs concerning spacy-lookup (#2814) 2018-10-01 10:24:50 +02:00
Filipe Caixeta
6c498f9ff4 Update Portuguese Language (#2790)
* Add words to portuguese language _num_words

* Add words to portuguese language _num_words

* Portuguese - Add/remove stopwords, fix tokenizer, add currency symbols

* Extended punctuation and norm_exceptions in the Portuguese language
2018-09-29 09:51:45 +02:00
Matthew Honnibal
b39810d692 Fix copy_reg compatibility on _serialize module 2018-09-28 15:23:14 +02:00
Matthew Honnibal
f82f8ba5dd Fix serialization when empty parser model. Closes #2482 2018-09-28 15:18:52 +02:00
Matthew Honnibal
d5a6c63b62 Add regression test for #2482 2018-09-28 15:18:30 +02:00
Matthew Honnibal
e3e9fe18d4 Merge branch 'develop' of https://github.com/explosion/spaCy into develop 2018-09-28 14:27:35 +02:00
Matthew Honnibal
0323f5be0c Fix _serialize module 2018-09-28 14:27:24 +02:00
Matthew Honnibal
05b6103a0c Try to fix version pin for msgpack-numpy 2018-09-28 14:07:00 +02:00
Ines Montani
5d56eb70d7 Tidy up tests 2018-09-27 16:41:57 +02:00
Ines Montani
1f1bab9264 Remove unused import 2018-09-27 16:41:37 +02:00
Matthew Honnibal
6430b1fe64 Restore encoding arg on msgpack-numpy 2018-09-27 15:58:21 +02:00
Matthew Honnibal
276aa83d1a Require older msgpack-numpy 2018-09-27 15:34:24 +02:00
Matthew Honnibal
2ac69facc6 Fix Python 2 test failure 2018-09-27 15:34:16 +02:00
Matthew Honnibal
72778375fb Merge branch 'master' of https://github.com/explosion/spaCy 2018-09-27 13:54:49 +02:00
Matthew Honnibal
96fe314d8d Fix bug when too many entity types. Fixes #2800 2018-09-27 13:54:34 +02:00
Suraj Rajan
bbdc6456c6 Set up dependency tree pattern matching skeleton (#2732) 2018-09-27 13:27:18 +02:00
Matthew Honnibal
8809dc4514 Remove deprecated encoding argument to msgpack 2018-09-27 12:56:23 +02:00
Matthew Honnibal
bae6b3e2b3 Merge branch 'master' of https://github.com/explosion/spaCy 2018-09-27 12:50:31 +02:00
Ines Montani
71cdbeada7 Revert "Also include lowercase norm exceptions"
This reverts commit 70f4e8adf3.
2018-09-27 12:29:25 +02:00
Charles-Axel Dein
014dd47c70 Add jupyter=True to displacy.render in documentation (#2806) 2018-09-27 12:28:04 +02:00
Przemysław Hojnacki
966b583d5e agreement of contributor, may I introduce a tiny pl languge contribution (#2799)
* Contributors agreement

* Contributors agreement

* Contributors agreement
2018-09-27 12:25:22 +02:00
Charles-Axel Dein
94ad3c55f1 Add charlax's contributor agreement (#2805) 2018-09-27 12:24:42 +02:00
darindf
8227566805 Fix error (#2802)
* Fix error
ValueError: cannot resize an array that references or is referenced
by another array in this way.  Use the resize function

* added spaCy Contributor Agreement
2018-09-26 21:31:03 +02:00
Ines Montani
5e0dfb34fa Merge branch 'master' of https://github.com/explosion/spaCy 2018-09-26 11:13:58 +02:00
Ines Montani
70f4e8adf3 Also include lowercase norm exceptions 2018-09-25 12:22:02 +02:00
Keshan
9a016d17c2 Adding basic support for Sinhala language. (#2788)
* adding Sinhala language package, stop words, examples and lex_attrs.

* Adding contributor agreement

* Updating contributor agreement
2018-09-25 12:18:25 +02:00
Pranshu Jethmalani
9fd27d777e Fix typo (#2795) [ci skip]
Fixed typo on line 6 "regcognizer --> recognizer"
2018-09-25 12:12:40 +02:00
Matthew Honnibal
b42c123e5d Fix regression introduced by 1759abf1e 2018-09-25 11:08:58 +02:00
Matthew Honnibal
500898907b Fix regression in parser.begin_training() 2018-09-25 11:08:31 +02:00