ines
5768df4f09
Add SimpleFrozenDict util to use as default function argument
2018-05-20 15:13:37 +02:00
Matthew Honnibal
7431e9c87f
Fix parser for GPU
2018-05-19 17:24:34 +00:00
Matthew Honnibal
401213fb1f
Only warn about unnamed vectors if non-zero sized.
2018-05-19 18:51:55 +02:00
Matthew Honnibal
74d5c625b3
Use rising beam update prob
2018-05-16 20:11:59 +02:00
Matthew Honnibal
544ae7f1db
Merge branch 'develop' into feature/refactor-parser
2018-05-16 02:06:49 +02:00
Matthew Honnibal
d1b27fe5aa
Revert "Improve dynamic oracle when values are missing in parse"
...
This reverts commit f56bd4736b
.
2018-05-16 00:31:52 +02:00
Matthew Honnibal
83acaa0358
Add missing name attribute for parser
2018-05-15 19:01:53 +02:00
Matthew Honnibal
f328c195ca
Fix size limits in training data
2018-05-15 19:01:41 +02:00
Matthew Honnibal
8446b35ce0
Fix parser model loading
2018-05-15 18:43:46 +02:00
Matthew Honnibal
dc1a479fbd
Merge branch 'develop' into feature/refactor-parser
2018-05-15 18:39:21 +02:00
Matthew Honnibal
546dd99cdf
Merge master into develop -- mostly Arabic and website
2018-05-15 18:14:28 +02:00
Matthew Honnibal
5664ab7e6c
Revert hacks to tests
2018-05-15 18:00:09 +02:00
Matthew Honnibal
7b9195657b
Restore beam_density argument for parser beam
2018-05-15 17:55:11 +02:00
Matthew Honnibal
581d318971
Fix conftest
2018-05-15 00:54:45 +02:00
Tahar Zanouda
00417794d3
Add Arabic language ( #2314 )
...
* added support for Arabic lang
* added Arabic language support
* updated conftest
2018-05-15 00:27:19 +02:00
Jani Monoses
0e08e49e87
Lemmatizer ro ( #2319 )
...
* Add Romanian lemmatizer lookup table.
Adapted from http://www.lexiconista.com/datasets/lemmatization/
by replacing cedillas with commas (ș and ț).
The original dataset is licensed under the Open Database License.
* Fix one blatant issue in the Romanian lemmatizer
* Romanian examples file
* Add ro_tokenizer in conftest
* Add Romanian lemmatizer test
2018-05-12 15:20:04 +02:00
Matthew Honnibal
887631ca25
Disable some tests to figure out why CI fails
2018-05-10 16:42:01 +02:00
Matthew Honnibal
902a172cb7
Disable some tests to figure out why CI fails
2018-05-10 16:30:07 +02:00
Matthew Honnibal
614d45ea58
Set a more aggressive threshold on the max violn update
2018-05-10 15:38:24 +02:00
Matthew Honnibal
8e8724b55b
Default to beam_update_prob 1
2018-05-10 15:38:02 +02:00
Jani Monoses
42b34832e4
Update Romanian stopword list ( #2316 )
...
* Contributor agreement for janimo
* Update Romanian stopword list
Include the correct spellings of all the words already in the repo
that are using cedillas (ş and ţ) instead of commas (ș and ț).
Add another unrelated spelling fix.
See https://github.com/stopwords-iso/stopwords-ro/pull/1 and
https://github.com/stopwords-iso/stopwords-ro/pull/2
2018-05-10 12:16:56 +02:00
Lucas Abbade
be7fdc59d1
Update lex_attrs.py ( #2307 )
...
* Update lex_attrs.py
Fixed spelling mistakes of some numbers (according to Brazilian Portuguese).
* Update lex_attrs.py
As requested, I've included the correct spelling for both Brazilian Portuguese and Portuguese Portuguese.
I will advise however, that the two are separated in the future. Brazilian Portuguese is a very different language from the original one, although most of the writing is unified, the way people talk in both countries is radically different. Keeping both languages as one may lead to bigger issues in the future, especially when it comes to spell checking.
2018-05-09 20:49:31 +02:00
mauryaland
5368ba028a
Update stop_words.py for French language ( #2310 )
...
* Add contraction forms of some common stopwords
All the stopwords added contain the apostrophe" ' "or " ’ ".
* Adds contributor agreement mauryaland
* Update mauryaland.md
2018-05-09 12:04:38 +02:00
Matthew Honnibal
a61fd60681
Fix error in beam gradient calculation
2018-05-09 02:44:09 +02:00
Matthew Honnibal
a6ae1ee6f7
Don't modify Token in global scope
2018-05-09 00:43:00 +02:00
Matthew Honnibal
f94f721f40
Avoid importing fused token symbol in ud-run-test, untl that's added
2018-05-09 00:28:03 +02:00
Matthew Honnibal
659ec5b975
Avoid importing fused token symbol in ud-run-test, untl that's added
2018-05-08 19:40:33 +02:00
Matthew Honnibal
4cb0494bef
Bug fixes to beam search after refactor
2018-05-08 13:48:50 +02:00
Matthew Honnibal
5ed71973b3
Add a keyword argument sink to GoldParse
2018-05-08 13:48:32 +02:00
Matthew Honnibal
8cfe326f87
Avoid relying on final gold check in beam search
2018-05-08 13:48:19 +02:00
Matthew Honnibal
fc4dd49b77
Support oracle segmentation in ud-train CLI command
2018-05-08 13:47:45 +02:00
Matthew Honnibal
c49e44349a
Fix beam parsing
2018-05-08 02:53:24 +02:00
Matthew Honnibal
99649d114d
Fix parser
2018-05-08 00:27:26 +02:00
Matthew Honnibal
8a82367a9d
Fix beam search after refactor
2018-05-08 00:20:33 +02:00
Matthew Honnibal
5a0f26be0c
Readd beam search after refactor
2018-05-08 00:19:52 +02:00
ines
7a3599c21a
Fix formatting and consistency
2018-05-07 23:02:11 +02:00
Matthew Honnibal
36b2c9bdd5
Fix refactored parser
2018-05-07 18:58:09 +02:00
Matthew Honnibal
bde3be1ad1
Fix refactored parser
2018-05-07 18:31:04 +02:00
Matthew Honnibal
01c4e13b02
Update test
2018-05-07 16:59:52 +02:00
Matthew Honnibal
f6cdafc00e
Fix refactored parser
2018-05-07 16:59:38 +02:00
Matthew Honnibal
f56bd4736b
Improve dynamic oracle when values are missing in parse
2018-05-07 15:53:18 +02:00
Matthew Honnibal
eddc0e0c74
Set gold.sent_starts in ud_train
2018-05-07 15:52:47 +02:00
Matthew Honnibal
bf19f22340
Allow gold.sent_starts to be set from Python
2018-05-07 15:51:34 +02:00
Matthew Honnibal
7f163442e6
Work on refactoring greedy parser
2018-05-07 15:45:52 +02:00
Douglas Knox
9b49a40f4e
Test and fix for Issue #2219 ( #2272 )
...
Test and fix for Issue #2219 : Token.similarity() failed if single letter
2018-05-03 18:40:46 +02:00
Paul O'Leary McCann
bd72fbf09c
Port Japanese mecab tokenizer from v1 ( #2036 )
...
* Port Japanese mecab tokenizer from v1
This brings the Mecab-based Japanese tokenization introduced in #1246 to
spaCy v2. There isn't a JapaneseTagger implementation yet, but POS tag
information from Mecab is stored in a token extension. A tag map is also
included.
As a reminder, Mecab is required because Universal Dependencies are
based on Unidic tags, and Janome doesn't support Unidic.
Things to check:
1. Is this the right way to use a token extension?
2. What's the right way to implement a JapaneseTagger? The approach in
#1246 relied on `tag_from_strings` which is just gone now. I guess the
best thing is to just try training spaCy's default Tagger?
-POLM
* Add tagging/make_doc and tests
2018-05-03 18:38:26 +02:00
G.Pruvost
cc8e804648
#2211 - Support for ssl certs config on download command ( #2212 )
...
* Add support for SSL/Certs customization on download CLI
* Add a note on SSL options for the 'download' CLI in the README
* Add contributor agreement
2018-05-03 18:37:02 +02:00
Jens Dahl Møllerhøj
b9290397fb
rename SP to _SP ( #2289 )
2018-05-03 18:33:49 +02:00
Matthew Honnibal
a8e70a4187
Merge branch 'develop' of https://github.com/explosion/spaCy into develop
2018-05-03 14:02:10 +02:00
Matthew Honnibal
c0e596283b
Set version to 2.1.0a0
2018-05-03 14:00:11 +02:00
Matthew Honnibal
8cd06cc763
Try to fix root-outside-sentence bug
2018-05-02 14:39:48 +00:00
Matthew Honnibal
acebd01033
Set cildren from heads in finalize doc
2018-05-02 14:19:22 +00:00
Matthew Honnibal
569440a6db
Dont normalize gradient by batch size
2018-05-02 08:42:10 +02:00
Matthew Honnibal
281e29cbcd
Merge branch 'develop' of https://github.com/explosion/spaCy into develop
2018-05-02 01:36:23 +00:00
Matthew Honnibal
2338e8c7fc
Update develop from master
2018-05-02 01:36:12 +00:00
Matthew Honnibal
9d147e12c4
Merge remote-tracking branch 'origin/master' into develop
2018-05-01 18:18:51 +02:00
Matthew Honnibal
6d0fe67b72
Constrain subtok label to adjacent tokens
2018-05-01 17:34:27 +02:00
Matthew Honnibal
8f21953fc5
Constrain subtok to adjacent words
2018-05-01 17:29:00 +02:00
Matthew Honnibal
b43bfd3524
Fix arc-eager oracle tests
2018-05-01 16:16:14 +02:00
Matthew Honnibal
31ed64e9b0
Fix textcat test
2018-05-01 15:18:39 +02:00
Matthew Honnibal
548bdff943
Update default Adam settings
2018-05-01 15:18:20 +02:00
Matthew Honnibal
adbb1f7533
Add better arc-eager oracle tests
2018-05-01 15:14:55 +02:00
Matthew Honnibal
697bcaa34f
Add some methods to ArcEager that make testing easier
2018-05-01 15:13:14 +02:00
Mr Roboto
6f5ccda19c
Addresses Issue #2228 - Deserialization fails when using tensor=False or sentiment=False ( #2230 )
...
* Fixes issue #2228
* Adds a new contributor
2018-05-01 13:40:22 +02:00
Matthew Honnibal
d44bb45c72
Fix scoring if tokenization changes
2018-05-01 01:33:20 +02:00
Matthew Honnibal
2b26c007cd
Revert "Disable batch size compounding in ud-train"
...
This reverts commit 8a120fb455
.
2018-04-29 14:09:02 +00:00
Matthew Honnibal
723b328062
Add script to run UD test
2018-04-29 15:50:25 +02:00
Matthew Honnibal
17af6aa3a4
Update ud_train script
2018-04-29 15:49:32 +02:00
Matthew Honnibal
5de8a36537
Fix arc_eager is_nonproj_tree
2018-04-29 15:49:11 +02:00
Matthew Honnibal
5260268f70
Fix textcat after merge
2018-04-29 15:48:53 +02:00
Matthew Honnibal
ad3d56c3ba
Fix compile error in matcher
2018-04-29 15:48:34 +02:00
Matthew Honnibal
a8bc947fd4
Fix Token.set_extension
2018-04-29 15:48:19 +02:00
Matthew Honnibal
2c4a6d66fa
Merge master into develop. Big merge, many conflicts -- need to review
2018-04-29 14:49:26 +02:00
ines
3c80f69ff5
Return data in cli.info and add silent option ( resolves #2196 )
2018-04-29 01:59:44 +02:00
ines
1c6d77610c
Add remove_extension method on Doc, Token and Span ( closes #2242 )
2018-04-28 23:33:09 +02:00
ines
abdb853ebf
Simplify underscore tests
2018-04-28 23:30:33 +02:00
ines
6fb6371670
Add collapse_phrases option to displacy ( closes #2266 )
2018-04-28 23:06:50 +02:00
Robin Linderborg
1f9904ef12
fixes #2238 ( #2241 )
...
* Remove erroneous lemma lookup år > åra in Swedish
* Add contributors agreement
* Add contrib agreement to correct directory
* Revert change to CONTRIBUTOR_AGREEMENT
2018-04-28 14:55:22 +02:00
Robin Linderborg
d01f503b54
Remove incorrect lemma lookup gäng->gänga ( #2252 )
...
* Remove incorrect lemma lookup gäng->gänga
In modern Swedish, "gäng" is mostly associated with "gang" or "group of people". The removed lemma lookup lemmatized it to the verb "thread".
* Add contrib agreement to correct directory
* Revert change to CONTRIBUTOR_AGREEMENT
2018-04-28 14:54:41 +02:00
Suraj Krishnan Rajan
69d041148f
Implement Fast-Text vectors with subword features
2018-04-21 01:34:14 +05:30
ines
686225eadd
Fix Spanish noun_chunks ( resolves #2210 )
...
Make sure 'NP' label is added to StringStore and move noun_bounds helper into a closure to allow reusing label sets
2018-04-18 18:44:01 -04:00
ines
9632595fb4
Use correct, non-deprecated merge syntax ( resolves #2226 )
2018-04-18 18:28:28 -04:00
Suraj Rajan
5957f15227
Fixed typos for #2222,#2223 ( #2233 ) ( closes #2222 , closes #2223 )
2018-04-18 14:55:26 -07:00
Matthew Honnibal
97851d2c4e
Increment version to v2.0.12.dev0
2018-04-10 22:20:16 +02:00
Matthew Honnibal
ed39c75a92
Merge branch 'master' of https://github.com/explosion/spaCy
2018-04-10 22:19:40 +02:00
Matthew Honnibal
3836199a83
Fix loading of models when custom vectors are added
2018-04-10 22:19:20 +02:00
ines
0299d5fac8
Update argument annotations and formatting
2018-04-10 21:45:11 +02:00
ines
49b1e48bf5
Fix syntax error
2018-04-10 21:44:59 +02:00
ines
70052e46e9
Fix formatting [ci skip]
2018-04-10 21:42:46 +02:00
Matthew Honnibal
0ddb152be0
Improve error message when reading vectors
2018-04-10 21:26:50 +02:00
Matthew Honnibal
db50ac524e
Support zipped vector files in init-model
2018-04-10 21:21:00 +02:00
ines
270fcfd925
Fix typo in package command message ( closes #2200 )
2018-04-10 19:14:31 +02:00
ines
24d8bf348d
Revert "Add support for .zip to init_model"
...
This reverts commit 7ee880a0ad
.
2018-04-10 19:08:06 +02:00
Matthew Honnibal
7ee880a0ad
Add support for .zip to init_model
2018-04-10 14:30:04 +00:00
ines
5ecb274764
Fix indentation error and set Doc.is_tagged correctly
2018-04-10 16:14:52 +02:00
ines
987ee27af7
Return Doc if noun chunks merger component if Doc is not parsed
2018-04-09 14:51:02 +02:00
Xiaoquan Kong
e2f13ec722
bugfix: Doc.noun_chunks
call Doc.noun_chunks_iterator
without checking ( closes #2194 )
2018-04-08 23:44:05 +02:00
Jens Dahl Møllerhøj
e5055e3cf6
Add Danish lemmatizer ( #2184 )
...
* add danish lemmatizer
* fill contributor agreement
2018-04-07 19:07:28 +02:00
ines
bccbf538ef
Revert "Check if spaCy has compiled correctly and show error message"
...
This reverts commit 3463ded7cf
.
2018-04-06 15:49:44 +02:00
ines
fb4eda6616
Merge branch 'master' of https://github.com/explosion/spaCy
2018-04-06 00:38:48 +02:00