Commit Graph

290 Commits

Author SHA1 Message Date
Paul O'Leary McCann
bd72fbf09c Port Japanese mecab tokenizer from v1 (#2036)
* Port Japanese mecab tokenizer from v1

This brings the Mecab-based Japanese tokenization introduced in #1246 to
spaCy v2. There isn't a JapaneseTagger implementation yet, but POS tag
information from Mecab is stored in a token extension. A tag map is also
included.

As a reminder, Mecab is required because Universal Dependencies are
based on Unidic tags, and Janome doesn't support Unidic.

Things to check:

1. Is this the right way to use a token extension?

2. What's the right way to implement a JapaneseTagger? The approach in
 #1246 relied on `tag_from_strings` which is just gone now. I guess the
best thing is to just try training spaCy's default Tagger?

-POLM

* Add tagging/make_doc and tests
2018-05-03 18:38:26 +02:00
Robin Linderborg
1f9904ef12 fixes #2238 (#2241)
* Remove erroneous lemma lookup år > åra in Swedish

* Add contributors agreement

* Add contrib agreement to correct directory

* Revert change to CONTRIBUTOR_AGREEMENT
2018-04-28 14:55:22 +02:00
Robin Linderborg
d01f503b54 Remove incorrect lemma lookup gäng->gänga (#2252)
* Remove incorrect lemma lookup gäng->gänga
In modern Swedish, "gäng" is mostly associated with "gang" or "group of people". The removed lemma lookup lemmatized it to the verb "thread".

* Add contrib agreement to correct directory

* Revert change to CONTRIBUTOR_AGREEMENT
2018-04-28 14:54:41 +02:00
ines
686225eadd Fix Spanish noun_chunks (resolves #2210)
Make sure 'NP' label is added to StringStore and move noun_bounds helper into a closure to allow reusing label sets
2018-04-18 18:44:01 -04:00
Jens Dahl Møllerhøj
e5055e3cf6 Add Danish lemmatizer (#2184)
* add danish lemmatizer

* fill contributor agreement
2018-04-07 19:07:28 +02:00
Matthew Honnibal
21047bde52 Fix syntax error in italian lemmatizer 2018-04-03 23:13:22 +02:00
Viet Trung Tran
ea2af94cd9 Add support for Vietnamese in spaCy by leveraging Pyvi, an external Vietnamese tokenizer (#2155)
* support for Vietnamese

* Contributor Agreement for adding Vietnamese support on spaCy
2018-03-29 12:19:51 +02:00
ines
11c4735ccf Fix issue in Italian lemmatizer data (resolves #2050) 2018-03-27 23:55:22 +02:00
Ines Montani
68226109f4
Merge pull request #2142 from jimregan/polish-more-tokens
more exceptions
2018-03-24 19:06:44 +01:00
Matthew Honnibal
0d3bf0d4eb Merge branch 'master' of https://github.com/explosion/spaCy 2018-03-24 17:31:49 +01:00
dejanmarich
ccd1c04c63 Update stop_words.py
Added more words
2018-03-24 17:31:24 +01:00
ines
f1446b0257 Port over Turkish changes 2018-03-24 17:31:07 +01:00
DuyguA
cd604878a4 quick typo fix 2018-03-24 17:26:35 +01:00
Jim O'Regan
efe037e8be more exceptions 2018-03-24 00:05:27 +00:00
alldefector
f4e5904fc2 Fix Spanish noun_chunks failure caused by typo 2018-03-14 17:03:17 +01:00
Ines Montani
14e7e0f12a
Merge pull request #2000 from jimregan/polish-tag-map
Polish tag map
2018-02-18 19:05:58 +01:00
Matthew Honnibal
eb3040ce46
Merge pull request #1891 from fucking-signup/master
Fix issue #1889
2018-02-18 13:47:47 +01:00
4altinok
94fb0b75e3 code for is_currency 2018-02-11 18:51:32 +01:00
Ines Montani
0954e15dda
Merge pull request #1913 from ohenrik/nb_syntax_iterator
Norwegian Language (nb) - Added french syntax iterator with explanation
2018-02-06 04:59:07 +01:00
Ole Henrik Skogstrøm
251a7805fe Copied French syntax iterator to simplify future changes 2018-02-05 14:45:05 +01:00
ines
f1d3deffac Add Russian example sentences (see #1107) 2018-02-01 20:09:40 +01:00
Ole Henrik Skogstrøm
e40465487c Added french syntax iterator with explenation 2018-01-30 15:44:29 +01:00
Matthew Honnibal
cb7110c22e
Merge pull request #1882 from ohenrik/nb_lemma_and_tag_map
Add norwegian bokmål ('nb') lemmatizer and tag_map
2018-01-29 18:18:50 +01:00
Ali Zarezade
bb6bd3d8ae add persian language 2018-01-27 13:27:26 +03:30
Ali Zarezade
d195675db5 add persian language 2018-01-27 13:21:38 +03:30
Kit
4b42267ba3
Fix issue #1889 2018-01-25 23:17:22 +01:00
Ole Henrik Skogstrøm
8e2c9f2475 Cleaned up nb tag_map comments 2018-01-25 11:09:28 +01:00
Ole Henrik Skogstrøm
1107e89fcf Updated doc string on nb tag_map module 2018-01-25 11:08:28 +01:00
Ole Henrik Skogstrøm
4058a7d579 Fix æøå characters in lemmatizer 2018-01-24 14:03:14 +01:00
Ole Henrik Skogstrøm
42248f423f Updated tag map 2018-01-24 13:50:33 +01:00
Ole Henrik Skogstrøm
74b430b49a Correct Lemmatizer 2018-01-24 13:26:33 +01:00
Ole Henrik Skogstrøm
b9b3a40c78 Add norwegian lemmatizer and tag_map 2018-01-24 12:28:29 +01:00
Ali Zarezade
42349471bc
add ٪ as punctuation 2018-01-23 18:11:33 +03:30
Ali Zarezade
2bda582135
Add Persian character and symbols
Add Persian characters and the following:
- ٪ used instead of %
- ؟ used instead of ?
- ﷼ used instead of $
- ، used instead of ,
- ؛ used instead of ;
2018-01-23 13:20:36 +03:30
Kit
701e7cc6aa
Rename variable to keep code consistent 2018-01-08 03:38:44 +01:00
Kit
ed0db95183
Find lowercased forms of ordinal words, where possible 2018-01-08 03:28:50 +01:00
Kit
9bc524982e
Find lowercased forms of numeric words 2018-01-08 03:25:08 +01:00
Kevin Humphreys
7918fa4ef9 handle would've 2018-01-03 12:25:48 -08:00
zqhZY
f27859fa99 add ChineseDefaults class for pickling 2017-12-28 17:13:58 +08:00
Søren Lind Kristiansen
bef735aef7 Fix Danish abbreviation 'm.h.t.' 2017-12-21 09:24:31 +01:00
Ines Montani
a3dd167d7f
Merge branch 'master' into da_ud_tokenization 2017-12-20 21:05:34 +00:00
Ines Montani
97f100f69f
Merge pull request #1742 from kimfalk/master
Two corrections in the da lan.
2017-12-20 21:02:00 +00:00
Ines Montani
d682a8803e
Merge pull request #1672 from cbilgili/master
Adds Turkish Lemmatization
2017-12-20 21:01:00 +00:00
Benjamin Peterson
9452134cd1 remove no-break spaces from Hindi example (fixes #1750) 2017-12-20 11:35:30 -08:00
Søren Lind Kristiansen
7a2f2f6f94 Fix formatting. 2017-12-20 18:37:37 +01:00
Søren Lind Kristiansen
15d13efafd Tune Danish tokenizer to more closely match tokenization in Universal Dependencies. 2017-12-20 17:36:52 +01:00
Kim FalkJørgensen
648dc60755 Remove the incorrect exception 'm.h.t' 2017-12-20 10:02:39 +01:00
Kim FalkJørgensen
9c9f4ef84a Fixing a translation error in examples.py
Adding an exception in the tokenizer_exceptions.py
2017-12-19 15:26:50 +01:00
ines
22dc744b48 Fix check for '@' in like_url (see #1715) 2017-12-16 13:48:43 +01:00
Ines Montani
6455b574fc
Check for email address first 2017-12-12 10:25:13 +01:00