Commit Graph

9034 Commits

Author SHA1 Message Date
Suraj Krishnan Rajan
356af7b0a1 Fix tests 2018-09-06 01:39:36 +02:00
Matthew Honnibal
4d2d7d5866 Fix new feature flags 2018-08-27 02:12:39 +02:00
Matthew Honnibal
598dbf1ce0 Fix character-based tokenization for Japanese 2018-08-27 01:51:38 +02:00
Matthew Honnibal
3763e20afc Pass subword_features and conv_depth params 2018-08-27 01:51:15 +02:00
Matthew Honnibal
8051136d70 Support subword_features and conv_depth params in Tok2Vec 2018-08-27 01:50:48 +02:00
Matthew Honnibal
9c33d4d1df Add more hyper-parameters to spacy ud-train
* subword_features: Controls whether subword features are used in the
word embeddings. True by default (specifically, prefix, suffix and word
shape). Should be set to False for languages like Chinese and Japanese.

* conv_depth: Depth of the convolutional layers. Defaults to 4.
2018-08-27 01:48:46 +02:00
Matthew Honnibal
51a9efbf3b Add draft Binder class 2018-08-22 13:12:51 +02:00
Matthew Honnibal
f0e6be689a Merge branch 'develop' of https://github.com/explosion/spaCy into develop 2018-08-16 17:18:19 +02:00
Matthew Honnibal
5ce459d2ee Fix error in vocab 2018-08-16 17:18:09 +02:00
Ines Montani
aeb49eb625 Update version [ci skip] 2018-08-16 16:56:02 +02:00
Ines Montani
a0eacd3293 Merge branch 'master' into develop 2018-08-16 16:55:05 +02:00
Ines Montani
c0fa9903f4 Update model directory JS [ci skip]
Prevent the default release URL from being overwritten and add license type
2018-08-16 16:54:50 +02:00
Ines Montani
03f661fefb Add Greek to models directory [ci skip] 2018-08-16 16:51:56 +02:00
Matthew Honnibal
00febda2e3 Improve alignment around quotes 2018-08-16 01:04:34 +02:00
Matthew Honnibal
66a3f2ba21 Lower-case text before alignment 2018-08-16 00:42:36 +02:00
Matthew Honnibal
595c893791 Expose noise_level option in train CLI 2018-08-16 00:41:44 +02:00
Matthew Honnibal
8365226bf3 Fix lookup of symbols in vocab. 2018-08-15 23:43:34 +02:00
Matthew Honnibal
b9f0588580 Set version to v2.1.0a1 2018-08-15 17:22:39 +02:00
Matthew Honnibal
e968016417 Note link between issues #2671 and #2675 2018-08-15 17:18:28 +02:00
Matthew Honnibal
63bdc734ba Skip flakey test 2018-08-15 16:56:55 +02:00
Matthew Honnibal
ce512e1d47 Fix #2671: Incorrect match ID on some patterns 2018-08-15 16:19:08 +02:00
Matthew Honnibal
f12b9190f6 Xfail test for issue #2671 2018-08-15 15:55:31 +02:00
Matthew Honnibal
7cfa665ce6 Add failing test for issue 2671: Incorrect rule ID returned from matcher 2018-08-15 15:54:33 +02:00
Matthew Honnibal
1b2a5869ab Set version to v2.1.0a2.dev0 2018-08-15 15:38:52 +02:00
Matthew Honnibal
5080760288 Add extra comment on 'add label' in parser 2018-08-15 15:37:24 +02:00
Matthew Honnibal
6e749d3c70 Skip flakey parser test 2018-08-15 15:37:04 +02:00
Ines Montani
fd9d175a53 Update live code [ci skip] 2018-08-15 15:28:48 +02:00
Matthew Honnibal
48ed1ca29d Add branch option to push-tag script 2018-08-15 03:16:43 +02:00
Matthew Honnibal
6ea981c839 Add converter for jsonl NER data 2018-08-14 14:04:32 +02:00
Matthew Honnibal
a9fb6d5511 Fix docs2jsonl function 2018-08-14 14:03:48 +02:00
Matthew Honnibal
ea2edd1e2c Merge branch 'feature/docs_to_json' into develop 2018-08-14 13:23:42 +02:00
Matthew Honnibal
6ec236ab08 Fix label-clobber bug in parser.begin_training()
The parser.begin_training() method was rewritten in v2.1. The rewrite
introduced a regression, where if you added labels prior to
begin_training(), these labels were discarded. This patch fixes that.
2018-08-14 13:20:19 +02:00
Matthew Honnibal
02c5c114d0 Fix usage of deprecated freqs.txt in init-model 2018-08-14 13:19:15 +02:00
Matthew Honnibal
2a5a61683e Add function to get train format from Doc objects
Our JSON training format is annoying to work with, and we've wanted to
retire it for some time. In the meantime, we can at least add some
missing functions to make it easier to live with.

This patch adds a function that generates the JSON format from a list
of Doc objects, one per paragraph. This should be a convenient way to handle
a lot of data conversions: whatever format you have the source
information in, you can use it to setup a Doc object. This approach
should offer better future-proofing as well. Hopefully, we can steadily
rewrite code that is sensitive to the current data-format, so that it
instead goes through this function. Then when we change the data format,
we won't have such a problem.
2018-08-14 13:13:10 +02:00
Matthew Honnibal
4336397ecb Update develop from master 2018-08-14 03:04:28 +02:00
Matthew Honnibal
13fa550b36 Merge branch 'master' of https://github.com/explosion/spaCy 2018-08-14 02:32:01 +02:00
Ioannis Daras
fe94e696d3 Optimize Greek language support (#2658) 2018-08-14 02:31:32 +02:00
Wojciech Łukasiewicz
3953e967a0 User correct variable name in the examples (#2664)
* correct naming

* add contributor agreement
2018-08-13 22:21:24 +02:00
Matthew Honnibal
85000ea13b Increment version to 2.0.13.dev2 2018-08-10 00:43:55 +02:00
Matthew Honnibal
c4ac981e6d Try again to filter warnings 2018-08-10 00:42:54 +02:00
Matthew Honnibal
ae7fc42a41 Increment version to v2.0.13.dev1 2018-08-10 00:14:31 +02:00
Matthew Honnibal
7be9118be3 Require numpy>=1.15.0 to avoid the RuntimeWarning 2018-08-10 00:14:13 +02:00
Matthew Honnibal
19f5046934 Undoing warning suppression, as doesnt really work 2018-08-10 00:13:34 +02:00
Matthew Honnibal
3fb828352d Set version to 2.0.13.dev0 2018-08-09 23:49:34 +02:00
Matthew Honnibal
1c0614ecd2 Catch numpy warning 2018-08-09 23:49:24 +02:00
Aashish Gangwani
6eebfc7bf4 Added numbers to ../lang/hi/lex_attrs.py (#2629)
I have added numbers in hindi lex_attrs.py file according to Indian numbering system(https://en.wikipedia.org/wiki/Indian_numbering_system) and here are there english translations:
'शून्य' => zero 
'एक' => one
'दो' => two
'तीन' => three
 'चार' => four
'पांच' => five
'छह' => six
'सात'=>seven 
'आठ' => eight
'नौ' => nine
'दस' => ten
'ग्यारह' => eleven
'बारह' => twelve
 'तेरह' => thirteen
'चौदह' => fourteen
'पंद्रह' => fifteen
'सोलह'=> sixteen
'सत्रह' => seventeen
'अठारह' => eighteen
'उन्नीस' => nineteen
'बीस' => twenty
 'तीस' => thirty
'चालीस' => forty
'पचास' => fifty
'साठ' => sixty
'सत्तर' => seventy
'अस्सी' => eighty
'नब्बे' => ninety
'सौ' => hundred
'हज़ार' => thousand
'लाख' => hundred thousand
'करोड़' => ten million
'अरब' => billion
'खरब' => hundred billion

<!--- Provide a general summary of your changes in the title. -->

## Description
<!--- Use this section to describe your changes. If your changes required
testing, include information about the testing environment and the tests you
ran. If your test fixes a bug reported in an issue, don't forget to include the
issue number. If your PR is still a work in progress, that's totally fine – just
include a note to let us know. -->

### Types of change
<!-- What type of change does your PR cover? Is it a bug fix, an enhancement
or new feature, or a change to the documentation? -->

## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.
2018-08-08 16:06:11 +02:00
Ines Montani
71723cece1 Add note on visualizing long texts ans sentences (see #2636) [ci skip] 2018-08-08 15:28:21 +02:00
Ines Montani
6147bd3eb4 Fix link target (closes #2645) [ci skip] 2018-08-08 15:03:52 +02:00
Ines Montani
8c47da1f19 Update Language serialization docs (see #2628) [ci skip]
Add note on using from_disk and from_bytes via subclasses and add example
2018-08-07 14:17:57 +02:00
Emil Stenström
3834f4146d Add abbreviations from UD_Swedish-Talbanken (#2613)
* Add abbreviations from UD_Swedish-Talbanken

* Add contributor agreement.
2018-08-07 13:53:17 +02:00