spaCy

mirror of https://github.com/explosion/spaCy.git synced 2024-09-23 20:39:20 +03:00

Author	SHA1	Message	Date
Paul O'Leary McCann	bd72fbf09c	Port Japanese mecab tokenizer from v1 (#2036 ) * Port Japanese mecab tokenizer from v1 This brings the Mecab-based Japanese tokenization introduced in #1246 to spaCy v2. There isn't a JapaneseTagger implementation yet, but POS tag information from Mecab is stored in a token extension. A tag map is also included. As a reminder, Mecab is required because Universal Dependencies are based on Unidic tags, and Janome doesn't support Unidic. Things to check: 1. Is this the right way to use a token extension? 2. What's the right way to implement a JapaneseTagger? The approach in #1246 relied on `tag_from_strings` which is just gone now. I guess the best thing is to just try training spaCy's default Tagger? -POLM * Add tagging/make_doc and tests	2018-05-03 18:38:26 +02:00
Robin Linderborg	1f9904ef12	fixes #2238 (#2241 ) * Remove erroneous lemma lookup år > åra in Swedish * Add contributors agreement * Add contrib agreement to correct directory * Revert change to CONTRIBUTOR_AGREEMENT	2018-04-28 14:55:22 +02:00
Robin Linderborg	d01f503b54	Remove incorrect lemma lookup gäng->gänga (#2252 ) * Remove incorrect lemma lookup gäng->gänga In modern Swedish, "gäng" is mostly associated with "gang" or "group of people". The removed lemma lookup lemmatized it to the verb "thread". * Add contrib agreement to correct directory * Revert change to CONTRIBUTOR_AGREEMENT	2018-04-28 14:54:41 +02:00
ines	686225eadd	Fix Spanish noun_chunks (resolves #2210 ) Make sure 'NP' label is added to StringStore and move noun_bounds helper into a closure to allow reusing label sets	2018-04-18 18:44:01 -04:00
Jens Dahl Møllerhøj	e5055e3cf6	Add Danish lemmatizer (#2184 ) * add danish lemmatizer * fill contributor agreement	2018-04-07 19:07:28 +02:00
Matthew Honnibal	21047bde52	Fix syntax error in italian lemmatizer	2018-04-03 23:13:22 +02:00
Viet Trung Tran	ea2af94cd9	Add support for Vietnamese in spaCy by leveraging Pyvi, an external Vietnamese tokenizer (#2155 ) * support for Vietnamese * Contributor Agreement for adding Vietnamese support on spaCy	2018-03-29 12:19:51 +02:00
ines	11c4735ccf	Fix issue in Italian lemmatizer data (resolves #2050 )	2018-03-27 23:55:22 +02:00
Ines Montani	68226109f4	Merge pull request #2142 from jimregan/polish-more-tokens more exceptions	2018-03-24 19:06:44 +01:00
Matthew Honnibal	0d3bf0d4eb	Merge branch 'master' of https://github.com/explosion/spaCy	2018-03-24 17:31:49 +01:00
dejanmarich	ccd1c04c63	Update stop_words.py Added more words	2018-03-24 17:31:24 +01:00
ines	f1446b0257	Port over Turkish changes	2018-03-24 17:31:07 +01:00
DuyguA	cd604878a4	quick typo fix	2018-03-24 17:26:35 +01:00
Jim O'Regan	efe037e8be	more exceptions	2018-03-24 00:05:27 +00:00
alldefector	f4e5904fc2	Fix Spanish noun_chunks failure caused by typo	2018-03-14 17:03:17 +01:00
Ines Montani	14e7e0f12a	Merge pull request #2000 from jimregan/polish-tag-map Polish tag map	2018-02-18 19:05:58 +01:00
Matthew Honnibal	eb3040ce46	Merge pull request #1891 from fucking-signup/master Fix issue #1889	2018-02-18 13:47:47 +01:00
4altinok	94fb0b75e3	code for is_currency	2018-02-11 18:51:32 +01:00
Ines Montani	0954e15dda	Merge pull request #1913 from ohenrik/nb_syntax_iterator Norwegian Language (nb) - Added french syntax iterator with explanation	2018-02-06 04:59:07 +01:00
Ole Henrik Skogstrøm	251a7805fe	Copied French syntax iterator to simplify future changes	2018-02-05 14:45:05 +01:00
ines	f1d3deffac	Add Russian example sentences (see #1107 )	2018-02-01 20:09:40 +01:00
Ole Henrik Skogstrøm	e40465487c	Added french syntax iterator with explenation	2018-01-30 15:44:29 +01:00
Matthew Honnibal	cb7110c22e	Merge pull request #1882 from ohenrik/nb_lemma_and_tag_map Add norwegian bokmål ('nb') lemmatizer and tag_map	2018-01-29 18:18:50 +01:00
Ali Zarezade	bb6bd3d8ae	add persian language	2018-01-27 13:27:26 +03:30
Ali Zarezade	d195675db5	add persian language	2018-01-27 13:21:38 +03:30
Kit	4b42267ba3	Fix issue #1889	2018-01-25 23:17:22 +01:00
Ole Henrik Skogstrøm	8e2c9f2475	Cleaned up nb tag_map comments	2018-01-25 11:09:28 +01:00
Ole Henrik Skogstrøm	1107e89fcf	Updated doc string on nb tag_map module	2018-01-25 11:08:28 +01:00
Ole Henrik Skogstrøm	4058a7d579	Fix æøå characters in lemmatizer	2018-01-24 14:03:14 +01:00
Ole Henrik Skogstrøm	42248f423f	Updated tag map	2018-01-24 13:50:33 +01:00
Ole Henrik Skogstrøm	74b430b49a	Correct Lemmatizer	2018-01-24 13:26:33 +01:00
Ole Henrik Skogstrøm	b9b3a40c78	Add norwegian lemmatizer and tag_map	2018-01-24 12:28:29 +01:00
Ali Zarezade	42349471bc	add ٪ as punctuation	2018-01-23 18:11:33 +03:30
Ali Zarezade	2bda582135	Add Persian character and symbols Add Persian characters and the following: - ٪ used instead of % - ؟ used instead of ? - ﷼ used instead of $ - ، used instead of , - ؛ used instead of ;	2018-01-23 13:20:36 +03:30
Kit	701e7cc6aa	Rename variable to keep code consistent	2018-01-08 03:38:44 +01:00
Kit	ed0db95183	Find lowercased forms of ordinal words, where possible	2018-01-08 03:28:50 +01:00
Kit	9bc524982e	Find lowercased forms of numeric words	2018-01-08 03:25:08 +01:00
Kevin Humphreys	7918fa4ef9	handle would've	2018-01-03 12:25:48 -08:00
zqhZY	f27859fa99	add ChineseDefaults class for pickling	2017-12-28 17:13:58 +08:00
Søren Lind Kristiansen	bef735aef7	Fix Danish abbreviation 'm.h.t.'	2017-12-21 09:24:31 +01:00
Ines Montani	a3dd167d7f	Merge branch 'master' into da_ud_tokenization	2017-12-20 21:05:34 +00:00
Ines Montani	97f100f69f	Merge pull request #1742 from kimfalk/master Two corrections in the da lan.	2017-12-20 21:02:00 +00:00
Ines Montani	d682a8803e	Merge pull request #1672 from cbilgili/master Adds Turkish Lemmatization	2017-12-20 21:01:00 +00:00
Benjamin Peterson	9452134cd1	remove no-break spaces from Hindi example (fixes #1750 )	2017-12-20 11:35:30 -08:00
Søren Lind Kristiansen	7a2f2f6f94	Fix formatting.	2017-12-20 18:37:37 +01:00
Søren Lind Kristiansen	15d13efafd	Tune Danish tokenizer to more closely match tokenization in Universal Dependencies.	2017-12-20 17:36:52 +01:00
Kim FalkJørgensen	648dc60755	Remove the incorrect exception 'm.h.t'	2017-12-20 10:02:39 +01:00
Kim FalkJørgensen	9c9f4ef84a	Fixing a translation error in examples.py Adding an exception in the tokenizer_exceptions.py	2017-12-19 15:26:50 +01:00
ines	22dc744b48	Fix check for '@' in like_url (see #1715 )	2017-12-16 13:48:43 +01:00
Ines Montani	6455b574fc	Check for email address first	2017-12-12 10:25:13 +01:00
Bri-Will	d77361d76c	Update lex_attrs.py. Fix like_url from matching on e-mail	2017-12-11 14:13:28 -08:00
Matthew Honnibal	2ab0f2d186	Merge pull request #1664 from jimregan/italian-lemmatizer BOM in Italian lemmatiser	2017-12-06 11:09:04 +01:00
Matthew Honnibal	3f247119d3	Merge pull request #1668 from sorenlind/da_morph Add more Danish morph rules and clean up existing ones	2017-12-06 11:08:09 +01:00
ines	f2ea6d4713	Add Dutch example sentences (see #1107 )	2017-12-01 23:36:05 +01:00
Canbey Bilgili	abe098b255	Adds Turkish Lemmatization	2017-12-01 17:04:32 +03:00
Søren Lind Kristiansen	d86b537a38	Enable morph rules for Danish	2017-11-30 15:58:02 +01:00
Søren Lind Kristiansen	13a988adc3	Remove 'Number[psor]'	2017-11-30 15:55:04 +01:00
Søren Lind Kristiansen	dd6fde18a9	Add more Danish morph rules and clean up existing ones	2017-11-30 11:17:19 +01:00
Vadim Mazaev	4ba7ddf651	Bugfixies	2017-11-30 12:29:38 +03:00
Jim O'Regan	c3e6cee17a	use inan in polimorf tagset conversion	2017-11-29 23:15:47 +00:00
Jim O'Regan	b32575e78c	imports	2017-11-29 23:03:41 +00:00
Jim O'Regan	3696ce6a7b	add UD mapping	2017-11-29 22:59:19 +00:00
Matthew Honnibal	f9ed9ea529	Merge pull request #1624 from GreenRiverRUS/russian Add support for Russian	2017-11-29 23:10:01 +01:00
Jim O'Regan	076a6fc60a	symbols	2017-11-29 20:11:20 +00:00
Jim O'Regan	834ba3c69a	(semi generated) Polimorf mapping	2017-11-29 20:08:24 +00:00
Jim O'Regan	ba6a23fd11	BOM in Italian lemmatiser	2017-11-29 17:40:07 +00:00
Ines Montani	9052643e2c	Merge pull request #1653 from sorenlind/da_example_typo Fix typo	2017-11-27 14:47:42 +00:00
Søren Lind Kristiansen	5fe58b885b	Fix typo	2017-11-27 15:36:18 +01:00
Ines Montani	d52b1ab245	Add unicode_literals (hopefully fixes test failure on Python 2)	2017-11-27 15:16:54 +01:00
Søren Lind Kristiansen	0ffd27b0f6	Add several Danish alternative spellings	2017-11-27 13:35:41 +01:00
Vadim Mazaev	cacd859dcd	Added tag map, fixed tests fails, added more exceptions	2017-11-26 20:54:48 +03:00
Søren Lind Kristiansen	ef03e9ea53	Remove unused import.	2017-11-25 13:04:02 +01:00
Søren Lind Kristiansen	6aa241bcec	Add day of month tokenizer exceptions for Danish.	2017-11-24 15:03:24 +01:00
Søren Lind Kristiansen	0c276ed020	Add weekday abbreviations and remove abiguous month abbreviations for Danish.	2017-11-24 14:43:29 +01:00
Søren Lind Kristiansen	056547e989	Add multiple tokenizer exceptions for Danish.	2017-11-24 11:51:26 +01:00
Søren Lind Kristiansen	ac8116510d	Fix tokenization of 'i.' for Danish.	2017-11-24 11:16:53 +01:00
Vadim Mazaev	81314f8659	Fixed tokenizer: added char classes; added first lemmatizer and tokenizer tests	2017-11-21 22:23:59 +03:00
Vadim Mazaev	52ee1f9bf9	Updated Russian Language, added lemmatizer, norm exceptions and lex attrs	2017-11-21 11:44:46 +03:00
Vadim Mazaev	a0739a06d4	Returned russian support from v1.10 branch	2017-11-17 17:06:15 +03:00
ines	c9d72de0fb	Add dummy serialization methods for Japanese and missing lang getter (resolves #1557 )	2017-11-15 12:44:02 +01:00
Mathias Deschamps	c0691b2ab4	Add tokenizer exceptions for ing verbs Extend list of tokenizing exceptions introduced in `123810b`	2017-11-13 17:46:05 +01:00
Mathias Deschamps	288298ead9	Add norm exception for ing verbs Some ing verbs are sometimes written in or in'. Make the NORM form correct	2017-11-13 17:46:05 +01:00
Abhinav Sharma	59f5740ede	improved upon the list of included stop_words	2017-11-13 17:13:49 +05:30
ines	123810b6de	Add "lovin'" to tokenizer exceptions (see #1248 )	2017-11-09 17:09:30 +01:00
Ines Montani	42b241ccd0	Update language code in usage example in comment	2017-11-08 11:36:38 +01:00
Abhinav Sharma	84edade82d	Create examples.py Populated the file with the translations of English example sentences	2017-11-08 13:23:08 +05:30
ines	bcf42b8846	Fix typo	2017-11-08 01:06:37 +01:00
ines	acb9bdb852	Fix PRON_LEMMA imports	2017-11-06 17:41:53 +01:00
ines	baa231745c	Fix Dutch tag map	2017-11-05 21:41:50 +01:00
ines	507ecb67af	Fix Spanish tag map	2017-11-05 19:23:34 +01:00
ines	975e1042ff	Fix Italian tag map	2017-11-05 18:34:09 +01:00
ines	6b2d6e4937	Fix Portuguese tag map	2017-11-05 18:31:00 +01:00
ines	fa2687fded	Fix Dutch tag map	2017-11-05 17:57:59 +01:00
ines	fb8990d916	Fix Spanish tag map	2017-11-05 17:48:46 +01:00
ines	9d13288f73	Fix French tag map	2017-11-05 17:47:59 +01:00
ines	54579805c5	Fix French tag map	2017-11-05 17:44:05 +01:00
Matthew Honnibal	0d4bd6414e	Fix Italian tag map	2017-11-05 14:11:03 +01:00
ines	ef597622a6	Add Portuguese tag map	2017-11-05 13:58:34 +01:00
ines	793c62dfda	Add Dutch tag map	2017-11-05 13:48:07 +01:00
ines	f7485a09c8	Fix Italian tag map	2017-11-05 13:12:58 +01:00

1 2 3 4 5 ...

340 Commits