Commit Graph

8915 Commits

Author SHA1 Message Date
Amandine Périnet
eef11a7a2c French lemmatization: correcting wrong lemmas in the lookup dictionnary (#3104)
* modifying French lookup that contained wrong lemmas

* correcting wrong line breaks on hyphen

* adding contributor agreement for amperinet@

* correcting a typo
2019-01-07 14:15:19 +01:00
Ines Montani
ac8487a96a Adjust pytest pin
Somehow, 4.1.x seems to cause test failure due to get_marker – possibly needs to be investigated for spacy-models/tests and likely not relevant on develop anymore
2019-01-07 12:08:19 +01:00
Álvaro Abella Bascarán
e03e1eee92 Bugfix/get lca matrix (#3110)
This PR adds a test for an untested case of `Span.get_lca_matrix`, and fixes a bug for that scenario, which I introduced in [this PR](https://github.com/explosion/spaCy/pull/3089) (sorry!).

## Description
The previous implementation of get_lca_matrix was failing for the case `doc[j:k].get_lca_matrix()` where `j > 0`. A test has been added for this case and the bug has been fixed.

### Types of change
Bug fix

## Checklist

- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.
2019-01-06 19:07:50 +01:00
alvations
9972716e01 Create alvations.md (#3119) 2019-01-05 13:11:06 +01:00
alvations
f43338a4c5 Joblib site has moved. (#3118) 2019-01-05 13:10:54 +01:00
Álvaro Abella Bascarán
6fe276f85d Fix issue 2396 (#3089)
* Test on #2396: bug in Doc.get_lca_matrix()

* reimplementation of Doc.get_lca_matrix(), (closes #2396)

* reimplement Span.get_lca_matrix(), and call it from Doc.get_lca_matrix()

* tests Span.get_lca_matrix() as well as Doc.get_lca_matrix()

* implement _get_lca_matrix as a helper function in doc.pyx; call it from Doc.get_lca_matrix and Span.get_lca_matrix

* use memory view instead of np.ndarray in _get_lca_matrix (faster)

* fix bug when calling Span.get_lca_matrix; return lca matrix as np.array instead of memoryview

* cleaner conditional, add comment
2018-12-29 18:02:26 +01:00
Sofie
b7916fffcf Fixing few typos in the documentation (#3103)
* few typos / small grammatical errors corrected in documentation

* one more typo

* one last typo
2018-12-28 15:52:26 +01:00
Will Price
4a6af0852a Improve random prefix generation in displaCy arcs (#3096)
* Improve random prefix generation in displaCy arcs

* Add @willprice contributor agreement
2018-12-27 14:46:02 +01:00
Özcan Kasal
b573ebca77 trilyon forgotten (#3083)
* trilyon forgotten

* contributor added
2018-12-27 14:44:23 +01:00
Ines Montani
2dc6c52ccc Update displayed Binder version (see #3077) [ci skip] 2018-12-20 17:36:19 +01:00
Muhammad Irfan
2e84ec1513 Fixed ISO code for Urdu. (#3073) 2018-12-20 12:28:53 +01:00
Kirill Bulygin
10189d9092 Fix the first nlp call for ja (closes #2901) (#3065)
* Fix the first `nlp` call for `ja` (closes #2901)

* Add unicode declaration, formatting and use relative import
2018-12-18 14:53:50 +01:00
Brixjohn
52f3c95004 Added alpha support for Tagalog language (#3062)
I have added alpha support for the Tagalog language from the Philippines. It is the basis for the country's national language Filipino. I have heavily based the format to the EN and ES languages.

I have provided several words in the lemmatizer lookup table, added stop words from a source, translated numeric words to its Tagalog counterpart, added some tokenizer exceptions, and kept the tag map the same as the English language.

While the alpha language passed the preliminary testing that you provided, I think it needs more data to be useful for most cases.

* Added alpha support for Tagalog language

* Edited contributor template

* Included SCA; Reverted templates

* Fixed SCA template

* Fixed changes in SCA template
2018-12-18 13:08:38 +01:00
Ines Montani
c9a89bba50 Don't call begin_training if updating new model (see #3059) [ci skip] 2018-12-17 13:45:28 +01:00
Ines Montani
6f1438b5d9 Auto-format example 2018-12-17 13:44:38 +01:00
Amandine Périnet
361554f629 Lemmatization of Adjectives - French : adding rules and vocabulary (#3045)
* modifying FR lemmatisation for Adjectives

* adding contributor agreement for amperinet

* correcting some errors in vocabulary files
2018-12-16 18:11:07 +01:00
Shooter23
6ae8e49bff Fix docstring for is_right_punct(). (#3044) 2018-12-14 10:11:11 +01:00
Matthew Honnibal
e5685d98a2 Fix averaging in textcat example (closes #2745) (#3032) [ci skip] 2018-12-08 13:27:05 +01:00
Ines Montani
8c0f0f50bc Use nlp.make_doc instead of nlp for patterns [ci skip] 2018-12-08 11:56:01 +01:00
Paul O'Leary McCann
7dd21b66d5 Extras require mecab (#3024)
* Add note that Unidic is required for Japanese

This addresses #3001. -POLM

* Add extras_require for mecab with old version

Related to issue #3018.

* mecab → ja

Co-Authored-By: polm <polm@dampfkraft.com>
2018-12-08 06:34:49 +01:00
Aki Ariga
7fcd6419ff Upadate the document for Unidic link with latest version URL (#3022)
* Upadate Unidic link for latest version in document

This patch improves #3017 . The link for Unidic was old version one, so will the lates version.

* Add contributor agreement

* Use more specific link for unidic-cwj
2018-12-07 17:24:48 +01:00
Amandine Périnet
0b44ea23bd Lemmatization of Nouns - French : adding rules and vocabulary (#2992)
* modifying FR lemmatization for nouns

* modifying FR lemmatization for nouns

* adding contributor agreement for amperinet

* adding rules for words with inclusive parentheses wrongly tokenized

* adding contributor agreement for amperinet

* adding a missing comma
2018-12-06 22:42:18 +01:00
Ines Montani
27905a7b14 Remove reference to cuda10 in docs (closes #2894) [ci skip] 2018-12-06 16:05:37 +01:00
Gavriel Loria
9c8c4287bf Accept iob2 and allow generic whitespace (#2999)
* accept non-pipe whitespace as delimiter; allow iob2 filename

* added small documentation note for IOB2 allowance

* added contributor agreement
2018-12-06 15:50:25 +01:00
Amandine Périnet
2457318b7a Lemmatization of Verbs - French : adding rules and vocabulary (#3006)
* updating rules and vocabulary for French lemmatization of verbs

* updating the file with French auxiliary verb

* updating rules and vocabulary for French lemmatization of verbs

* adding contributor agreement for amperinet

* adding rules for words with inclusive parentheses wrongly tokenized
2018-12-06 15:49:28 +01:00
Beate Sildnes
f0d7e206ec Updated wordforms for Norwegian lemmatizer (#3007)
* Updated wordforms for Norwegian lemmatizer

Upload of updated lists of wordforms for the Norwegian lemmatizer (nouns, verbs, adverbs, adjectives and lookup).

* Add spaCy contributor agreement for user beatesi

*  Updated wordforms for Norwegian lemmatizer
2018-12-06 15:46:18 +01:00
Paul O'Leary McCann
b36f6eabfb Add note that Unidic is required for Japanese (#3017)
This addresses #3001. -POLM
2018-12-06 15:14:10 +01:00
Gavriel Loria
ae5601beae Initialize trues to 0.0 in training example (#3004)
* added contributor agreement

* if there are no true positives, precision should be 0.0
2018-12-03 01:33:22 +01:00
Justin DuJardin
33fca8672f fix issue compiling the latest spacy on MacOS 10.3.6 (#2998) 2018-12-02 05:51:11 +01:00
Matthew Honnibal
bbaca991ba Set version to v2.0.18 2018-12-01 03:35:09 +01:00
Matthew Honnibal
05b2336ffa Try again to fix OSX build 2018-12-01 03:12:21 +01:00
Matthew Honnibal
e1a4b0d7f7 Set version to v2.0.18.dev1 2018-12-01 03:12:12 +01:00
Matthew Honnibal
413530b269 Set version to 2.0.18 2018-12-01 03:00:27 +01:00
Matthew Honnibal
24d52876e1 Set version to v2.0.18.dev0 2018-12-01 02:38:04 +01:00
Matthew Honnibal
4895b2e830 Merge branch 'master' of https://github.com/explosion/spaCy 2018-12-01 02:37:21 +01:00
Matthew Honnibal
3f16af123e Try to fix OSX build error 2018-12-01 02:36:56 +01:00
Matthew Honnibal
61abb1ef70 Remove msgpack dependency, to try to fix #2995 2018-12-01 02:36:41 +01:00
Ines Montani
add6469225 Add "new in v2.0.12" note to Span.ents (closes #2986) 2018-11-30 20:50:55 +01:00
Ines Montani
c9bdeafbc7 Don't run weird failing test for now 2018-11-30 16:13:40 +01:00
wxv
06820ef6e7 Fix is_ascii documentation and create contributor file (#2988)
Proposed in #2933
2018-11-30 15:57:58 +01:00
Sofie
585de273cd Fix small typo bug in French regexp + relevant unit test (#2980)
* additional unit test for new entr word not in other lists

* bugfix - unit test works

* use _latin_lower instead of alpha_lower for french

* revert back to ALPHA_LOWER (following the code for languages)

* contributor agreement
2018-11-29 20:16:13 +01:00
Ben Batorsky
658f7e0dc8 OntoNotes url fix (#2981)
The website for OntoNotes 5 is: https://catalog.ldc.upenn.edu/LDC2013T19, currently the named entity section has it as https://catalog.ldc.upenn.edu/ldc2013T19.
2018-11-29 19:34:30 +01:00
Adam Schwalm
00566949de Fix bug where Vocab.prune_vector did not use 'batch_size' (#2977)
Fixes #2976
2018-11-28 19:49:33 +01:00
Ines Montani
58757c5684
Update README.rst 2018-11-26 20:56:17 +01:00
Matthew Honnibal
9e2ff2f583
Fix regex pin to harmonize with conda (#2964) 2018-11-26 19:28:54 +01:00
Ines Montani
968aff2f6a
Update tests for pytest 4.x (#2965)
<!--- Provide a general summary of your changes in the title. -->

## Description
- [x] Replace marks in params for pytest 4.0 compat ([see here](https://docs.pytest.org/en/latest/deprecations.html#marks-in-pytest-mark-parametrize))
- [x] Un-xfail passing tests (some fixes in a recent update resolved a bunch of issues, but tests were apparently never updated here)

### Types of change
<!-- What type of change does your PR cover? Is it a bug fix, an enhancement
or new feature, or a change to the documentation? -->

## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.
2018-11-26 18:14:57 +01:00
Ines Montani
c80c20e1ec Sort languages alphabetically [ci skip] 2018-11-26 15:37:53 +01:00
Marc Puig
98fe1ab259 Catalan Language Support (#2940)
* Catalan language Support

* Ddding Catalan to documentation
2018-11-26 15:25:47 +01:00
Ines Montani
1844bc238a Update universe [ci skip] 2018-11-26 14:16:22 +01:00
Ines Montani
048416f265 Fix formatting 2018-11-26 13:27:41 +01:00