spaCy

mirror of https://github.com/explosion/spaCy.git synced 2024-12-26 01:46:28 +03:00

Author	SHA1	Message	Date
Søren Lind Kristiansen	bef735aef7	Fix Danish abbreviation 'm.h.t.'	2017-12-21 09:24:31 +01:00
Ines Montani	a3dd167d7f	Merge branch 'master' into da_ud_tokenization	2017-12-20 21:05:34 +00:00
Ines Montani	97f100f69f	Merge pull request #1742 from kimfalk/master Two corrections in the da lan.	2017-12-20 21:02:00 +00:00
Ines Montani	d682a8803e	Merge pull request #1672 from cbilgili/master Adds Turkish Lemmatization	2017-12-20 21:01:00 +00:00
Benjamin Peterson	9452134cd1	remove no-break spaces from Hindi example (fixes #1750 )	2017-12-20 11:35:30 -08:00
Søren Lind Kristiansen	7a2f2f6f94	Fix formatting.	2017-12-20 18:37:37 +01:00
Søren Lind Kristiansen	15d13efafd	Tune Danish tokenizer to more closely match tokenization in Universal Dependencies.	2017-12-20 17:36:52 +01:00
Kim FalkJørgensen	648dc60755	Remove the incorrect exception 'm.h.t'	2017-12-20 10:02:39 +01:00
Kim FalkJørgensen	9c9f4ef84a	Fixing a translation error in examples.py Adding an exception in the tokenizer_exceptions.py	2017-12-19 15:26:50 +01:00
ines	22dc744b48	Fix check for '@' in like_url (see #1715 )	2017-12-16 13:48:43 +01:00
Ines Montani	6455b574fc	Check for email address first	2017-12-12 10:25:13 +01:00
Bri-Will	d77361d76c	Update lex_attrs.py. Fix like_url from matching on e-mail	2017-12-11 14:13:28 -08:00
Matthew Honnibal	2ab0f2d186	Merge pull request #1664 from jimregan/italian-lemmatizer BOM in Italian lemmatiser	2017-12-06 11:09:04 +01:00
Matthew Honnibal	3f247119d3	Merge pull request #1668 from sorenlind/da_morph Add more Danish morph rules and clean up existing ones	2017-12-06 11:08:09 +01:00
ines	f2ea6d4713	Add Dutch example sentences (see #1107 )	2017-12-01 23:36:05 +01:00
Canbey Bilgili	abe098b255	Adds Turkish Lemmatization	2017-12-01 17:04:32 +03:00
Søren Lind Kristiansen	d86b537a38	Enable morph rules for Danish	2017-11-30 15:58:02 +01:00
Søren Lind Kristiansen	13a988adc3	Remove 'Number[psor]'	2017-11-30 15:55:04 +01:00
Søren Lind Kristiansen	dd6fde18a9	Add more Danish morph rules and clean up existing ones	2017-11-30 11:17:19 +01:00
Vadim Mazaev	4ba7ddf651	Bugfixies	2017-11-30 12:29:38 +03:00
Jim O'Regan	c3e6cee17a	use inan in polimorf tagset conversion	2017-11-29 23:15:47 +00:00
Jim O'Regan	b32575e78c	imports	2017-11-29 23:03:41 +00:00
Jim O'Regan	3696ce6a7b	add UD mapping	2017-11-29 22:59:19 +00:00
Matthew Honnibal	f9ed9ea529	Merge pull request #1624 from GreenRiverRUS/russian Add support for Russian	2017-11-29 23:10:01 +01:00
Jim O'Regan	076a6fc60a	symbols	2017-11-29 20:11:20 +00:00
Jim O'Regan	834ba3c69a	(semi generated) Polimorf mapping	2017-11-29 20:08:24 +00:00
Jim O'Regan	ba6a23fd11	BOM in Italian lemmatiser	2017-11-29 17:40:07 +00:00
Ines Montani	9052643e2c	Merge pull request #1653 from sorenlind/da_example_typo Fix typo	2017-11-27 14:47:42 +00:00
Søren Lind Kristiansen	5fe58b885b	Fix typo	2017-11-27 15:36:18 +01:00
Ines Montani	d52b1ab245	Add unicode_literals (hopefully fixes test failure on Python 2)	2017-11-27 15:16:54 +01:00
Søren Lind Kristiansen	0ffd27b0f6	Add several Danish alternative spellings	2017-11-27 13:35:41 +01:00
Vadim Mazaev	cacd859dcd	Added tag map, fixed tests fails, added more exceptions	2017-11-26 20:54:48 +03:00
Søren Lind Kristiansen	ef03e9ea53	Remove unused import.	2017-11-25 13:04:02 +01:00
Søren Lind Kristiansen	6aa241bcec	Add day of month tokenizer exceptions for Danish.	2017-11-24 15:03:24 +01:00
Søren Lind Kristiansen	0c276ed020	Add weekday abbreviations and remove abiguous month abbreviations for Danish.	2017-11-24 14:43:29 +01:00
Søren Lind Kristiansen	056547e989	Add multiple tokenizer exceptions for Danish.	2017-11-24 11:51:26 +01:00
Søren Lind Kristiansen	ac8116510d	Fix tokenization of 'i.' for Danish.	2017-11-24 11:16:53 +01:00
Vadim Mazaev	81314f8659	Fixed tokenizer: added char classes; added first lemmatizer and tokenizer tests	2017-11-21 22:23:59 +03:00
Vadim Mazaev	52ee1f9bf9	Updated Russian Language, added lemmatizer, norm exceptions and lex attrs	2017-11-21 11:44:46 +03:00
Vadim Mazaev	a0739a06d4	Returned russian support from v1.10 branch	2017-11-17 17:06:15 +03:00
ines	c9d72de0fb	Add dummy serialization methods for Japanese and missing lang getter (resolves #1557 )	2017-11-15 12:44:02 +01:00
Mathias Deschamps	c0691b2ab4	Add tokenizer exceptions for ing verbs Extend list of tokenizing exceptions introduced in `123810b`	2017-11-13 17:46:05 +01:00
Mathias Deschamps	288298ead9	Add norm exception for ing verbs Some ing verbs are sometimes written in or in'. Make the NORM form correct	2017-11-13 17:46:05 +01:00
Abhinav Sharma	59f5740ede	improved upon the list of included stop_words	2017-11-13 17:13:49 +05:30
ines	123810b6de	Add "lovin'" to tokenizer exceptions (see #1248 )	2017-11-09 17:09:30 +01:00
Ines Montani	42b241ccd0	Update language code in usage example in comment	2017-11-08 11:36:38 +01:00
Abhinav Sharma	84edade82d	Create examples.py Populated the file with the translations of English example sentences	2017-11-08 13:23:08 +05:30
ines	bcf42b8846	Fix typo	2017-11-08 01:06:37 +01:00
ines	acb9bdb852	Fix PRON_LEMMA imports	2017-11-06 17:41:53 +01:00
ines	baa231745c	Fix Dutch tag map	2017-11-05 21:41:50 +01:00
ines	507ecb67af	Fix Spanish tag map	2017-11-05 19:23:34 +01:00
ines	975e1042ff	Fix Italian tag map	2017-11-05 18:34:09 +01:00
ines	6b2d6e4937	Fix Portuguese tag map	2017-11-05 18:31:00 +01:00
ines	fa2687fded	Fix Dutch tag map	2017-11-05 17:57:59 +01:00
ines	fb8990d916	Fix Spanish tag map	2017-11-05 17:48:46 +01:00
ines	9d13288f73	Fix French tag map	2017-11-05 17:47:59 +01:00
ines	54579805c5	Fix French tag map	2017-11-05 17:44:05 +01:00
Matthew Honnibal	0d4bd6414e	Fix Italian tag map	2017-11-05 14:11:03 +01:00
ines	ef597622a6	Add Portuguese tag map	2017-11-05 13:58:34 +01:00
ines	793c62dfda	Add Dutch tag map	2017-11-05 13:48:07 +01:00
ines	f7485a09c8	Fix Italian tag map	2017-11-05 13:12:58 +01:00
ines	3cef901834	Add tag map for French and Italian	2017-11-04 23:32:51 +01:00
ines	6c15aafebd	Fix formatting	2017-11-04 23:07:02 +01:00
ines	9baab241b4	Add skeleton language data for Turkish	2017-11-02 16:32:24 +01:00
ines	c6fea3e5f6	Add Romanian and Croatian skeletons (experimental) Add language data templates to make it easier for others to contribute to the language support	2017-11-01 23:04:28 +01:00
ines	18c859500b	Add missing imports	2017-11-01 23:02:51 +01:00
ines	819e30a26e	Tidy up tokenizer exceptions	2017-11-01 23:02:45 +01:00
ines	9659391944	Update deprecated methods and add warnings	2017-11-01 16:49:42 +01:00
Ines Montani	d11659463b	Merge pull request #1152 from jimregan/develop-irish [WIP] attempt a port from #1147	2017-11-01 00:23:43 +01:00
ines	7e424a1804	Don't copy exception dicts if not necessary and tidy up	2017-10-31 21:05:29 +01:00
Ines Montani	06c25a8882	Remove comma that caused list to wrap in tuple! Also removed extra dict wrappings for performance (we used to have them in there, but they should only really exist if copying the dict is absolutely necessary)	2017-10-31 20:13:16 +01:00
Ines Montani	147448b65b	Add missing symbols	2017-10-31 19:34:45 +01:00
Ines Montani	9b0de9fb43	Fix import of symbols (now nested one level lower)	2017-10-31 19:17:58 +01:00
Jim O'Regan	41dd29e48e	merge	2017-10-31 14:07:45 +00:00
Ines Montani	090bd00369	Merge pull request #1464 from mayukh18/develop_bengali_pronouns added the bengali pronouns for v2.0	2017-10-25 21:55:25 +02:00
mayukh18	1bc07758fa	added few bengali pronouns	2017-10-25 22:24:40 +05:30
Ines Montani	d3bf488e16	Merge pull request #1171 from mollerhoj/support-danish Improve basic support for Danish	2017-10-24 20:29:57 +02:00
Matthew Honnibal	66766c1454	Restore SP tag to English tag_map, until models migrate	2017-10-24 17:05:00 +02:00
ines	c55db0a4a1	Add example sentences for Japanese and Chinese (see #1107 )	2017-10-24 13:02:24 +02:00
ines	66f8f9d4a0	Fix Japanese tokenizer JapaneseTokenizer now returns a Doc, not individual words	2017-10-24 13:02:19 +02:00
Ines Montani	facf77e541	Merge branch 'develop' into support-danish	2017-10-24 11:53:19 +02:00
Matthew Honnibal	49895fbef6	Rename 'SP' special tag to '_SP' Renaming the tag with an underscore lets us add it to the tag map without worrying that we'll change the sequence of tags, which throws off the tag-to-ID mapping. For instance, if we inserted a 'SP' tag, the "VERB" tag is pushed to a different class ID, and the model is all messed up.	2017-10-20 14:01:12 +02:00
Ines Montani	f0d577e460	Merge pull request #1425 from explosion/feature/hindi-tokenizer 💫 Basic Hindi tokenization support	2017-10-18 13:34:52 +02:00
Matthew Honnibal	839de87ca9	Make lambda func a named function, for pickling	2017-10-17 18:21:20 +02:00
Matthew Honnibal	9ce7d6af87	Make lex attr functions top-level functions, to promote pickling	2017-10-17 18:19:18 +02:00
Ines Montani	aab299c8ae	Merge pull request #1429 from vishnunekkanti/develop fix syntax error in zh	2017-10-17 14:45:02 +02:00
ines	485c4f6df5	Add Hungarian examples (see #1107 )	2017-10-17 02:37:45 +02:00
Vishnu Kumar Nekkanti	d3c54cf39a	fixed SyntaxError while checking for jieba	2017-10-16 18:51:33 +05:30
ines	266e7180a7	Add Language class, stop words and basic stemmer that sets NORM	2017-10-14 14:59:52 +02:00
ines	e85e1d571b	Update base punctuation	2017-10-14 14:59:23 +02:00
ines	9d6c8eaa49	Update base norm exceptions with more unicode characters e.g. unicode variations of punctuation used in Chinese	2017-10-14 14:58:52 +02:00
ines	38c756fd85	Port over changes from #1287	2017-10-14 13:16:21 +02:00
ines	612224c10d	Port over changes from #1157	2017-10-14 13:11:39 +02:00
ines	a4d974d97b	Port over URL pattern changes from #1411	2017-10-14 12:58:07 +02:00
ines	09aed58140	Port over changes from #1333 and add comments	2017-10-14 12:52:59 +02:00
ines	8ce6f96180	Don't make copies of language data components	2017-10-11 15:34:55 +02:00
ines	417d45f5d0	Add lemmatizer data as variable on language data Don't create lookup lemmatizer within Language class and just pass in the data so it can be set on Token creation	2017-10-11 02:24:58 +02:00
ines	0c2343d73a	Tidy up language data	2017-10-11 02:22:49 +02:00
Matthew Honnibal	8143618497	Set prefix length back to 1	2017-10-10 19:32:54 +02:00
Matthew Honnibal	dce8afb9cf	Set prefix length to 3	2017-10-09 21:55:55 -05:00

1 2 3 4 5 ...

301 Commits