Commit Graph

9070 Commits

Author SHA1 Message Date
Matthew Honnibal
66a3f2ba21 Lower-case text before alignment 2018-08-16 00:42:36 +02:00
Matthew Honnibal
595c893791 Expose noise_level option in train CLI 2018-08-16 00:41:44 +02:00
Matthew Honnibal
8365226bf3 Fix lookup of symbols in vocab. 2018-08-15 23:43:34 +02:00
Matthew Honnibal
b9f0588580 Set version to v2.1.0a1 2018-08-15 17:22:39 +02:00
Matthew Honnibal
e968016417 Note link between issues #2671 and #2675 2018-08-15 17:18:28 +02:00
Matthew Honnibal
63bdc734ba Skip flakey test 2018-08-15 16:56:55 +02:00
Matthew Honnibal
ce512e1d47 Fix #2671: Incorrect match ID on some patterns 2018-08-15 16:19:08 +02:00
Matthew Honnibal
f12b9190f6 Xfail test for issue #2671 2018-08-15 15:55:31 +02:00
Matthew Honnibal
7cfa665ce6 Add failing test for issue 2671: Incorrect rule ID returned from matcher 2018-08-15 15:54:33 +02:00
Matthew Honnibal
1b2a5869ab Set version to v2.1.0a2.dev0 2018-08-15 15:38:52 +02:00
Matthew Honnibal
5080760288 Add extra comment on 'add label' in parser 2018-08-15 15:37:24 +02:00
Matthew Honnibal
6e749d3c70 Skip flakey parser test 2018-08-15 15:37:04 +02:00
Ines Montani
fd9d175a53 Update live code [ci skip] 2018-08-15 15:28:48 +02:00
Matthew Honnibal
48ed1ca29d Add branch option to push-tag script 2018-08-15 03:16:43 +02:00
Matthew Honnibal
6ea981c839 Add converter for jsonl NER data 2018-08-14 14:04:32 +02:00
Matthew Honnibal
a9fb6d5511 Fix docs2jsonl function 2018-08-14 14:03:48 +02:00
Matthew Honnibal
ea2edd1e2c Merge branch 'feature/docs_to_json' into develop 2018-08-14 13:23:42 +02:00
Matthew Honnibal
6ec236ab08 Fix label-clobber bug in parser.begin_training()
The parser.begin_training() method was rewritten in v2.1. The rewrite
introduced a regression, where if you added labels prior to
begin_training(), these labels were discarded. This patch fixes that.
2018-08-14 13:20:19 +02:00
Matthew Honnibal
02c5c114d0 Fix usage of deprecated freqs.txt in init-model 2018-08-14 13:19:15 +02:00
Matthew Honnibal
2a5a61683e Add function to get train format from Doc objects
Our JSON training format is annoying to work with, and we've wanted to
retire it for some time. In the meantime, we can at least add some
missing functions to make it easier to live with.

This patch adds a function that generates the JSON format from a list
of Doc objects, one per paragraph. This should be a convenient way to handle
a lot of data conversions: whatever format you have the source
information in, you can use it to setup a Doc object. This approach
should offer better future-proofing as well. Hopefully, we can steadily
rewrite code that is sensitive to the current data-format, so that it
instead goes through this function. Then when we change the data format,
we won't have such a problem.
2018-08-14 13:13:10 +02:00
Matthew Honnibal
4336397ecb Update develop from master 2018-08-14 03:04:28 +02:00
Matthew Honnibal
13fa550b36 Merge branch 'master' of https://github.com/explosion/spaCy 2018-08-14 02:32:01 +02:00
Ioannis Daras
fe94e696d3 Optimize Greek language support (#2658) 2018-08-14 02:31:32 +02:00
Wojciech Łukasiewicz
3953e967a0 User correct variable name in the examples (#2664)
* correct naming

* add contributor agreement
2018-08-13 22:21:24 +02:00
Matthew Honnibal
85000ea13b Increment version to 2.0.13.dev2 2018-08-10 00:43:55 +02:00
Matthew Honnibal
c4ac981e6d Try again to filter warnings 2018-08-10 00:42:54 +02:00
Matthew Honnibal
ae7fc42a41 Increment version to v2.0.13.dev1 2018-08-10 00:14:31 +02:00
Matthew Honnibal
7be9118be3 Require numpy>=1.15.0 to avoid the RuntimeWarning 2018-08-10 00:14:13 +02:00
Matthew Honnibal
19f5046934 Undoing warning suppression, as doesnt really work 2018-08-10 00:13:34 +02:00
Matthew Honnibal
3fb828352d Set version to 2.0.13.dev0 2018-08-09 23:49:34 +02:00
Matthew Honnibal
1c0614ecd2 Catch numpy warning 2018-08-09 23:49:24 +02:00
Aashish Gangwani
6eebfc7bf4 Added numbers to ../lang/hi/lex_attrs.py (#2629)
I have added numbers in hindi lex_attrs.py file according to Indian numbering system(https://en.wikipedia.org/wiki/Indian_numbering_system) and here are there english translations:
'शून्य' => zero 
'एक' => one
'दो' => two
'तीन' => three
 'चार' => four
'पांच' => five
'छह' => six
'सात'=>seven 
'आठ' => eight
'नौ' => nine
'दस' => ten
'ग्यारह' => eleven
'बारह' => twelve
 'तेरह' => thirteen
'चौदह' => fourteen
'पंद्रह' => fifteen
'सोलह'=> sixteen
'सत्रह' => seventeen
'अठारह' => eighteen
'उन्नीस' => nineteen
'बीस' => twenty
 'तीस' => thirty
'चालीस' => forty
'पचास' => fifty
'साठ' => sixty
'सत्तर' => seventy
'अस्सी' => eighty
'नब्बे' => ninety
'सौ' => hundred
'हज़ार' => thousand
'लाख' => hundred thousand
'करोड़' => ten million
'अरब' => billion
'खरब' => hundred billion

<!--- Provide a general summary of your changes in the title. -->

## Description
<!--- Use this section to describe your changes. If your changes required
testing, include information about the testing environment and the tests you
ran. If your test fixes a bug reported in an issue, don't forget to include the
issue number. If your PR is still a work in progress, that's totally fine – just
include a note to let us know. -->

### Types of change
<!-- What type of change does your PR cover? Is it a bug fix, an enhancement
or new feature, or a change to the documentation? -->

## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.
2018-08-08 16:06:11 +02:00
Ines Montani
71723cece1 Add note on visualizing long texts ans sentences (see #2636) [ci skip] 2018-08-08 15:28:21 +02:00
Ines Montani
6147bd3eb4 Fix link target (closes #2645) [ci skip] 2018-08-08 15:03:52 +02:00
Ines Montani
8c47da1f19 Update Language serialization docs (see #2628) [ci skip]
Add note on using from_disk and from_bytes via subclasses and add example
2018-08-07 14:17:57 +02:00
Emil Stenström
3834f4146d Add abbreviations from UD_Swedish-Talbanken (#2613)
* Add abbreviations from UD_Swedish-Talbanken

* Add contributor agreement.
2018-08-07 13:53:17 +02:00
Ole Henrik Skogstrøm
0473add369 Feature/span ents (#2599)
* Created Span.ents property

* Add tests for span.ents

* Add tests for start and end of sentence
2018-08-07 13:52:32 +02:00
Xiaoquan Kong
87fa847e6e Fix Chinese language related bugs (#2634) 2018-08-07 11:26:31 +02:00
Matthew Honnibal
664cfc29bc Merge branch 'master' of https://github.com/explosion/spaCy 2018-08-07 10:49:39 +02:00
Matthew Honnibal
2278c9734e Fix spelling error #2640 2018-08-07 10:49:21 +02:00
Xiaoquan Kong
f0c9652ed1 New Feature: display more detail when Error E067 (#2639)
* Fix off-by-one error

* Add verbose option

* Update verbose option

* Update documents for verbose option
2018-08-07 10:45:29 +02:00
Emil Stenström
1914c488d3 Swedish: Exceptions for single letter words ending sentence (#2615)
* Exceptions for single letter words ending sentence

Sentences ending in "i." (as in "... peka i."), "m." (as in "...än 2000 m."), should be tokenized as two separate tokens.

* Add test
2018-08-05 14:14:30 +02:00
Matthew Honnibal
860f5bd91f Add test for issue 2626 2018-08-05 13:46:57 +02:00
Matthew Honnibal
f762d52b24 Add example for Issue #2627 2018-08-05 13:33:52 +02:00
Ines Montani
6a4360e425 Update universe [ci skip] 2018-08-02 17:33:08 +02:00
Sami
dbc993f5b3 Updating description and code snippet spacy-lefff (#2623)
* updating description and code snippet spacy-lefff

* contributors agreement
2018-08-02 17:25:27 +02:00
Vikas Kumar Yadav
23876dbc70 Create vikaskyadav.md (#2621) 2018-08-02 14:03:44 +02:00
Vikas Kumar Yadav
d3e21aad64 Update _benchmarks.jade (#2618) 2018-08-02 00:28:28 +02:00
Brian Phillips
8227de0099 Update language.jade (#2616) 2018-07-31 12:34:42 +02:00
Ioannis Daras
055cc0de44 Bug fix to pseudocode for tokenizer customization (#2604) 2018-07-27 11:04:12 +02:00