Commit Graph

4928 Commits

Author SHA1 Message Date
Emil Stenström
1914c488d3 Swedish: Exceptions for single letter words ending sentence (#2615)
* Exceptions for single letter words ending sentence

Sentences ending in "i." (as in "... peka i."), "m." (as in "...än 2000 m."), should be tokenized as two separate tokens.

* Add test
2018-08-05 14:14:30 +02:00
Matthew Honnibal
860f5bd91f Add test for issue 2626 2018-08-05 13:46:57 +02:00
Kaisa (Katarzyna) Korsak
e531a827db Changed conllu2json to be able to extract NER tags (#2594)
* extract ner tags from conllu file if available

* fixed a bug in regex
2018-07-25 22:21:31 +02:00
Dmitry Bruhanov
07d0cc9de7 Update examples.py (#2597) 2018-07-25 22:20:24 +02:00
Matthew Honnibal
66983d8412
Port BenDerPan's Chinese changes to v2 (finally) (#2591)
* add  template files for Chinese

* add  template files for Chinese, and test directory .
2018-07-25 02:47:23 +02:00
ines
f2e3e039b7 Update French stop words (resolves #2540) 2018-07-24 23:41:51 +02:00
Matthew Honnibal
6303ce3d0e Try to fix memory error by moving fr_tokenizer to module scope 2018-07-24 20:09:06 +02:00
Matthew Honnibal
afe3fa4449 Merge branch 'master' of https://github.com/explosion/spaCy 2018-07-24 19:44:31 +02:00
Matthew Honnibal
b2e9e958b9 Add session scoping to tokenizers to try to fix oom on Appveyor 2018-07-24 19:44:18 +02:00
Ines Montani
a43ad114c2
Fix typo [ci skip] 2018-07-24 18:45:40 +02:00
Dmitry Bruhanov
27160b1516 added some widespread written jargon & dialectizms (#2584)
This jargon is not offencive but emotionally colored as funny due to its deviation from the norm for various reasons: immitating a dialect, deliberately wrong spelling emphasizing its low colloquial nature, obsolete form, foreign borrowing with native flections, etc.
Dmitry Briukhanov, Linguist & Pythonist
2018-07-24 18:44:29 +02:00
Matthew Honnibal
90c269e1a9 Set about to v2.0.12 release 2018-07-21 15:09:42 +02:00
Matthew Honnibal
1a1c7304cf Set version to 2.0.12.dev1 2018-07-21 13:08:01 +02:00
ines
1ea881c80b Allow ignoring warnings and only overwrite if set explicitly 2018-07-20 22:50:19 +02:00
Matthew Honnibal
e0caf3ae8c Fix msgpack for new version 2018-07-20 17:32:00 +02:00
Matthew Honnibal
899f1cf442 Add regression test for issue 2179 2018-07-20 17:15:44 +02:00
Matthew Honnibal
9db77fd914 Fix deserialization for msgpack 2018-07-20 14:11:09 +02:00
katarkor
5ca853bee0 changed tag_map, morph_rules, lemmatizer for Norwegian (#2565)
* changed tag_map, morph_rules, lemmatizer for Norwegian

* Move unicode declaration up

Hopefully fixes test failure on Python 2

* Update CONTRIBUTOR_AGREEMENT.md

* Move unicode declarations

Hopefully fixes test this time

* Revert "Merge remote-tracking branch 'origin/patch-1'"

This reverts commit f5ccd5dd0d, reversing
changes made to dd07e180ea.

* Update contributor agreement [ci skip]
2018-07-19 19:38:24 +02:00
Ole Henrik Skogstrøm
6e2930a4a2 Conll(u)-bio converter (#2525)
* Started simple conllxbiluo converter

* Fix missing BIO to BILUO conversion
2018-07-18 18:55:42 +02:00
Ioannis Daras
6ed18412d0 Greek language optimizations (#2558)
* Greek language optimizations

* Add encoding on files containing greek words

* Add encoding on files containing greek words
2018-07-18 18:51:38 +02:00
Paul O'Leary McCann
61ef0739b8 Add Japanese stop words. (#2549)
List created by taking the 2000 top words from a Wikipedia dump and
removing everything that wasn't hiragana.

Tried going through kanji words and deciding what to keep but there were
too many obvious non-stopwords (東京 was in the top 500) and many other
words where it wasn't clear if they should be included or not.
2018-07-17 10:12:48 +02:00
Tero K
f35980f865 Enhancement/lang fi examples (#2547)
* Added a file with examples in finnish

* added contributor agreement
2018-07-15 09:50:27 +02:00
Paul O'Leary McCann
1987f3f784 Add Japanese lemmas (#2543)
This info was already available from Mecab, forgot to add it before.
2018-07-13 10:55:14 +02:00
Eleni170
6042723535 Add support for Greek language (#2535)
* Add contributor agreement

* Support for Greek language

* Fix missing el_tokenizer
2018-07-10 13:48:38 +02:00
Stefan Schweter
3dfc7f86be lemmatizer: correct lemma for Rang (#2537)
<!--- Provide a general summary of your changes in the title. -->

## Description

This PR corrects the German lemma form for the word "Rang". Initially, the lemma form was "ringen", which is not correct, because it refers to the verb ("ringen") and not to the noun ("Rang").

### Types of change

The lemma form for "Rang" is corrected to "Rang", see also the [Duden](https://www.duden.de/rechtschreibung/Rang) entry.

## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.
2018-07-10 13:11:19 +02:00
Duygu Altinok
00b9a58558 German lemmatizer additions (#2529)
* lemma of was-> was

* added new pairs issue @2486

* added article tests
2018-07-09 11:10:15 +02:00
Ole Henrik Skogstrøm
c21efea9bb Add sent property to token (#2521)
* Add sent property to token

* Refactored and cleaned up copy paste errors.
2018-07-06 15:54:15 +02:00
kleinay
a82c3153ad fix issue #2452 - displacy arrow direction is always forward (#2506) (closes #2452)
<!--- Provide a general summary of your changes in the title. -->
Referring #2452, fixing displacy arrow directions to match the input. 

## Description
The fix is simply replacing `direction is 'left'` with `direction == 'left'` to include the case `direction` is a `str` and not a `unicode`.

### Types of change
bug fix

## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [ ] I have submitted the spaCy Contributor Agreement.
- [ ] I ran the tests, and all new and existing tests passed.
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.
2018-07-04 14:12:08 +02:00
Bùi Trung Chí
9af46b4f1b Fix loading tokenizer with custom prefix search (#2495)
* Add contributor agreement

* Fix loading tokenizer with cutom prefix search
2018-07-04 12:56:07 +02:00
Ole Henrik Skogstrøm
d16cb6bee6 Accept Span to displacy render (#2478) (closes #2477)
* Add Span to displacy render

* Fix span support, errors and add tests
2018-06-25 14:55:16 +02:00
Muhammad Irfan
f33c703066 Add Urdu Language Support (#2430)
* added Urdu language support.

* added Urdu language tests.

* modified conftest.py for Urdu language support.

* added spacy contributor agreement.
2018-06-22 11:14:03 +02:00
himkt
14d9007efd fix wrong indexing (#2416)
* fix wrong indexing

* add agreement
2018-06-19 10:20:57 +02:00
Aliia E
428bae66b5 Add Tatar Language Support (#2444)
* add Tatar lang support

* add Tatar letters

* add Tatar tests

* sign contributor agreement

* sign contributor agreement [x]

* remove comments from Language class

* remove all template comments
2018-06-19 10:17:53 +02:00
Cory Hurst
446f5ec41b Silent keyword in info function in init (#2459)
* Pass through "silent" kwarg to the wrapper in the spacy module init.
reference issue  #2196

* Pass through "silent" kwarg to the wrapper in the spacy module init.
reference issue  #2196

* contributor agreement
2018-06-18 12:24:21 +02:00
Nour Shalabi
a169b79092 Additions to Arabic stop words. (#2422)
* Additions to Arabic stop words.

* Create nourshalabi.md
2018-06-08 02:33:23 +02:00
ines
b8ef9c1000 Fix model names in conftest (see #2379) 2018-05-30 14:10:20 +02:00
Maciej
c7d53348d7 Fix bug in CLI iob and ner converter (#2392) (fixes #2385)
* issue_2385 add tests for iob_to_biluo converter function

* issue_2385 fix and modify iob_to_biluo function to accept either iob or biluo tags in cli.converter

* issue_2385 add test to fix b char bug

* add contributor agreement

* fill contributor agreement
2018-05-30 12:28:44 +02:00
ansgar-t
9732988951 escape html in displacy.render (#2378) (closes #2361)
## Description
Fix for issue #2361 :
replace &, <, >, " with &amp;amp; , &amp;lt; , &amp;gt; , &amp;quot; in before rendering svg

## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [x] I have submitted the spaCy Contributor Agreement.
- [ ] I ran the tests, and all new and existing tests passed.
(As discussed in the comments to #2361)
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.
2018-05-28 18:36:41 +02:00
James Messinger
4515e96e90 Better formatting for spacy train CLI (#2357)
* Better formatting for `spacy train` CLI

Changed to use fixed-spaces rather than tabs to align table headers and data.

### Before:
```
Itn.    P.Loss  N.Loss  UAS     NER P.  NER R.  NER F.  Tag %   Token %
0       4618.857        2910.004        76.172  79.645  67.987  88.732  88.261  100.000 4436.9  6376.4
1       4671.972        3764.812        74.481  78.046  62.374  82.680  88.377  100.000 4672.2  6227.1
2       4742.756        3673.473        71.994  77.380  63.966  84.494  90.620  100.000 4298.0  5983.9
```

### After:
```
Itn.  Dep Loss  NER Loss  UAS     NER P.  NER R.  NER F.  Tag %   Token %  CPU WPS  GPU WPS
0     4618.857  2910.004  76.172  79.645  67.987  88.732  88.261  100.000  4436.9   6376.4
1     4671.972  3764.812  74.481  78.046  62.374  82.680  88.377  100.000  4672.2   6227.1
2     4742.756  3673.473  71.994  77.380  63.966  84.494  90.620  100.000  4298.0   5983.9
```

* Added contributor file
2018-05-25 13:08:45 +02:00
Aristo Rinjuang
432ede04af adding more words and rephrasing (#2351)
* adding more words and rephrasing

* adding a contributor

* tokenizer bugs solved
2018-05-24 11:40:57 +02:00
Jani Monoses
ec62cadf4c Updates to Romanian support (#2354)
* Add back Romanian in conftest

* Romanian lex_attr

* More tokenizer exceptions for Romanian

* Add tests for some Romanian tokenizer exceptions
2018-05-24 11:40:00 +02:00
cclauss
f7dcaa1f6b Simplify is_config() and normalize_string_keys() (#2305)
* Simplify is_config() and normalize_string_keys()

* Use __in__ to avoid the nested _ands_ and _ors_.
* Dict comprehension directly tracks with the doc string

* Keep more basic loop in normalize_string_keys

* Whitespace
2018-05-21 01:54:35 +02:00
Ines Montani
d4cc736b7c 💫 Improve model downloads: check for existing install, customise pip and use requests library again (#2346)
* Go back to using requests instead of urllib (closes #2320)

Fewer dependencies are good, but this one was simply causing too many other problems around SSL verification and Python 2/3 compatibility. requests is a popular enough package that it's okay for spaCy to depend on it – and this will hopefully make model downloads less flakey.

* Only download model if not installed (see #1456)

Use #egg=model==version to allow pip to check for existing installations. The download is only started if no installation matching the package/version is found. Fixes a long-standing inconvenience.

* Pass additional options to pip when installing model (resolves #1456)

Treat all additional arguments passed to the download command as pip options to allow user to customise the command. For example:

python -m spacy download en --user

* Add CLI option to enable installing model package dependencies

* Revert "Add CLI option to enable installing model package dependencies"

This reverts commit 9336ffe695.

* Update documentation
2018-05-20 20:26:56 +02:00
ines
b59e3b157f Don't require attrs argument in Doc.retokenize and allow both ints and unicode (resolves #2304) 2018-05-20 15:15:37 +02:00
ines
5768df4f09 Add SimpleFrozenDict util to use as default function argument 2018-05-20 15:13:37 +02:00
Matthew Honnibal
581d318971 Fix conftest 2018-05-15 00:54:45 +02:00
Tahar Zanouda
00417794d3 Add Arabic language (#2314)
* added support for Arabic lang

* added Arabic language support

* updated conftest
2018-05-15 00:27:19 +02:00
Jani Monoses
0e08e49e87 Lemmatizer ro (#2319)
* Add Romanian lemmatizer lookup table.

Adapted from http://www.lexiconista.com/datasets/lemmatization/
by replacing cedillas with commas (ș and ț).

The original dataset is licensed under the Open Database License.

* Fix one blatant issue in the Romanian lemmatizer

* Romanian examples file

* Add ro_tokenizer in conftest

* Add Romanian lemmatizer test
2018-05-12 15:20:04 +02:00
Jani Monoses
42b34832e4 Update Romanian stopword list (#2316)
* Contributor agreement for janimo

* Update Romanian stopword list

Include the correct spellings of all the words already in the repo
that are using cedillas (ş and ţ) instead of commas (ș and ț).

Add another unrelated spelling fix.

See https://github.com/stopwords-iso/stopwords-ro/pull/1 and
https://github.com/stopwords-iso/stopwords-ro/pull/2
2018-05-10 12:16:56 +02:00
Lucas Abbade
be7fdc59d1 Update lex_attrs.py (#2307)
* Update lex_attrs.py

Fixed spelling mistakes of some numbers (according to Brazilian Portuguese).

* Update lex_attrs.py

As requested, I've included the correct spelling for both Brazilian Portuguese and Portuguese Portuguese.

I will advise however, that the two are separated in the future. Brazilian Portuguese is a very different language from the original one, although most of the writing is unified, the way people talk in both countries is radically different. Keeping both languages as one may lead to bigger issues in the future, especially when it comes to spell checking.
2018-05-09 20:49:31 +02:00