Ines Montani
e619aba8df
Move WordNet license to correct place
2016-10-21 01:07:16 +02:00
Matthew Honnibal
eab2376547
* Allow longer ellipses to be treated as a single token, e.g. Hello......there
2016-05-09 13:22:53 +02:00
Matthew Honnibal
fe9299a118
* Fix long-standing issue with coarse-grained tags: proper nouns weren't receiving the PROPN tag, and personal pronouns weren't receiving the PRON tag. This should fix Issue #191 , and also Issue #325 , which reported that proper nouns were being lemmatized using the common noun policies. This lemmatization will be prevented if the universal tag is PROPN, not NOUN, as no lemmatization rules are loaded for the PROPN tag.
2016-04-14 12:46:43 +02:00
Matthew Honnibal
6f82065761
* Fix infixed commas in tokenizer, re Issue #326 . Need to benchmark on empirical data, to make sure this doesn't break other cases.
2016-04-14 11:36:03 +02:00
Matthew Honnibal
85485f5c2b
Fix inconsistencies in generate_specials.py
...
Re Issue #321 , fix inconsistencies in the script that generates specials.json. The result still isn't so satisfying --- we need to revise this as we move to parse more morphologically rich languages.
2016-04-07 11:21:52 +10:00
Matthew Honnibal
910a6c805f
* Add infix rule for double hyphens, re Issue #302
2016-03-29 13:03:44 +11:00
Matthew Honnibal
6c633f2edc
Fix Issue #243 : Incorrect gazetteer entry
2016-01-30 06:58:29 +11:00
Matthew Honnibal
4b4eec8b47
* Fix Issue #201 : Tokenization of there'll
2015-12-29 18:09:09 +01:00
Matthew Honnibal
e8bd92f1e7
* Fix lemma of let's, re Issue #177
2015-11-13 06:42:23 +11:00
Matthew Honnibal
726bb648da
* Fix non-breaking space in specials.json
2015-10-19 12:46:11 +11:00
Matthew Honnibal
e39095da82
* Fix designation of non-breaking space in specials.json.
2015-10-19 12:39:03 +11:00
Matthew Honnibal
454c1996d0
* Add tokenizer rule to fix numeric range tokenization
2015-10-17 15:49:51 +11:00
Matthew Honnibal
7488821677
* Map NIL to empty string in tag map
2015-10-10 22:09:50 +11:00
Matthew Honnibal
4bbd1388bd
* Whitespace
2015-10-10 16:03:48 +11:00
Matthew Honnibal
bdcb8d695c
* Add non-breaking space to specials.json
2015-10-10 15:54:06 +11:00
Matthew Honnibal
c12d36d5f4
* Fix quote marks in lemma_rules
2015-10-10 15:03:36 +11:00
Matthew Honnibal
57b3cd4661
* Add smart-quotes to lemma rules
2015-10-10 14:06:46 +11:00
Matthew Honnibal
7e7f28e1fd
* Add smart-quote possessive marker in generate_specials
2015-10-10 14:06:09 +11:00
Matthew Honnibal
a510858f5a
* Pretty-print specials.json, and add the em dash
2015-10-09 11:07:45 +02:00
Matthew Honnibal
49600a44a8
* Fix trailing comma in lemma_rules.json
2015-10-09 11:06:57 +02:00
Matthew Honnibal
0e92e8574a
* Fix pos tag in em-dash in specials
2015-10-09 11:06:37 +02:00
Matthew Honnibal
d341443282
* Remove em-dash from lemma rules. Handle instead in specials.
2015-10-09 10:27:13 +02:00
Matthew Honnibal
b6047afe4c
* Fix punctuation lemma rules, to resolve Issue #130
2015-10-09 10:25:37 +02:00
Matthew Honnibal
393a13d1af
* Add unicode em dash to specials.json, so that we can control what POS tag it gets. This way we can prevent sentence boundary detection errors, to address Issue #130 .
2015-10-09 19:24:33 +11:00
Matthew Honnibal
1490feda29
* Make generate_specials pretty-print the specials.json file
2015-10-09 19:23:47 +11:00
Matthew Honnibal
1842a53e73
* Lemmatize smart quotes as plain quotes
2015-10-09 19:09:36 +11:00
Matthew Honnibal
5332c0b697
* Add support for punctuation lemmatization, to handle unicode characters. This should help in addressing Issue #130
2015-10-09 18:54:40 +11:00
Matthew Honnibal
095831e5bf
* Start adding auxiliaries to morphs.json
2015-09-27 16:56:34 +10:00
Matthew Honnibal
c579b6b96c
* Update English morphs.json
2015-09-24 22:38:41 +10:00
Matthew Honnibal
3b3547251c
* Fix Issue #102 : DT tag was mapped to DET.
2015-09-24 18:38:47 +10:00
Matthew Honnibal
be4848fbcb
* Update morphs.json with universal dependencies/interset morphological features
2015-09-24 00:59:42 +10:00
Henning Peters
911de2ae49
add overseen (?) char
2015-09-22 12:29:47 +02:00
Matthew Honnibal
b9e31dc245
* Bug fix to gazetteer.json
2015-09-10 14:50:44 +02:00
Matthew Honnibal
623329b19a
Merge branch 'master' of ssh://github.com/honnibal/spaCy into develop
2015-09-08 14:27:01 +02:00
Matthew Honnibal
86c888667f
* Merge in changes from de branch
2015-09-06 19:49:28 +02:00
Matthew Honnibal
b3703836f9
* Add en lemma rules
2015-09-06 17:56:11 +02:00
Matthew Honnibal
c9f2082e3c
* Fix compilation error in en/tag_map.json
2015-09-06 17:54:51 +02:00
Matthew Honnibal
0af139e183
* Tagger training now working. Still need to test load/save of model. Morphology still broken.
2015-08-27 09:16:11 +02:00
Matthew Honnibal
56c4e07a59
Update gazetteer.json
2015-08-27 08:53:48 +10:00
Matthew Honnibal
494da25872
* Refactor for more universal spacy
2015-08-26 19:13:50 +02:00
jxs8172
85f01c5e16
Add contributor agreement. Add exception to 'it' so that 'its' and 'Its' isn't generated (its =/= it's)
2015-08-24 18:20:06 -04:00
jxs8172
5876248109
Add missing we've and hardcoded 's and 'S
2015-08-21 22:57:47 -04:00
jxs8172
a5e0a0073b
Add a script to generate the specials.json file, to take care of handling uppercase and missing apostrophe contractions
2015-08-21 22:39:33 -04:00
Matthew Honnibal
b27bd18d6e
* Add spaCy to gazetteer
2015-08-08 23:30:49 +02:00
Matthew Honnibal
855af087fc
* Fix gazetteer.json
2015-08-06 17:27:51 +02:00
Matthew Honnibal
0e098815cc
* Expand gazetteer with some of the errors from the reddit parse
2015-08-06 17:13:27 +02:00
Matthew Honnibal
6fcc3df989
* Expand gazetteer with some of the errors from the reddit parse
2015-08-06 17:11:00 +02:00
Matthew Honnibal
832896ea6c
* Add html to gazetteer
2015-08-06 16:36:54 +02:00
Matthew Honnibal
5c3c962038
* Add html to gazetteer
2015-08-06 16:34:51 +02:00
Matthew Honnibal
91a94e152b
* Make initial gazetteer
2015-08-06 16:10:04 +02:00