spaCy

mirror of https://github.com/explosion/spaCy.git synced 2025-03-28 13:54:12 +03:00

Author	SHA1	Message	Date
Matthew Honnibal	66766c1454	Restore SP tag to English tag_map, until models migrate	2017-10-24 17:05:00 +02:00
ines	c55db0a4a1	Add example sentences for Japanese and Chinese (see #1107 )	2017-10-24 13:02:24 +02:00
ines	66f8f9d4a0	Fix Japanese tokenizer JapaneseTokenizer now returns a Doc, not individual words	2017-10-24 13:02:19 +02:00
Matthew Honnibal	49895fbef6	Rename 'SP' special tag to '_SP' Renaming the tag with an underscore lets us add it to the tag map without worrying that we'll change the sequence of tags, which throws off the tag-to-ID mapping. For instance, if we inserted a 'SP' tag, the "VERB" tag is pushed to a different class ID, and the model is all messed up.	2017-10-20 14:01:12 +02:00
Ines Montani	f0d577e460	Merge pull request #1425 from explosion/feature/hindi-tokenizer 💫 Basic Hindi tokenization support	2017-10-18 13:34:52 +02:00
Matthew Honnibal	839de87ca9	Make lambda func a named function, for pickling	2017-10-17 18:21:20 +02:00
Matthew Honnibal	9ce7d6af87	Make lex attr functions top-level functions, to promote pickling	2017-10-17 18:19:18 +02:00
Ines Montani	aab299c8ae	Merge pull request #1429 from vishnunekkanti/develop fix syntax error in zh	2017-10-17 14:45:02 +02:00
ines	485c4f6df5	Add Hungarian examples (see #1107 )	2017-10-17 02:37:45 +02:00
Vishnu Kumar Nekkanti	d3c54cf39a	fixed SyntaxError while checking for jieba	2017-10-16 18:51:33 +05:30
ines	266e7180a7	Add Language class, stop words and basic stemmer that sets NORM	2017-10-14 14:59:52 +02:00
ines	e85e1d571b	Update base punctuation	2017-10-14 14:59:23 +02:00
ines	9d6c8eaa49	Update base norm exceptions with more unicode characters e.g. unicode variations of punctuation used in Chinese	2017-10-14 14:58:52 +02:00
ines	38c756fd85	Port over changes from #1287	2017-10-14 13:16:21 +02:00
ines	612224c10d	Port over changes from #1157	2017-10-14 13:11:39 +02:00
ines	a4d974d97b	Port over URL pattern changes from #1411	2017-10-14 12:58:07 +02:00
ines	09aed58140	Port over changes from #1333 and add comments	2017-10-14 12:52:59 +02:00
ines	8ce6f96180	Don't make copies of language data components	2017-10-11 15:34:55 +02:00
ines	417d45f5d0	Add lemmatizer data as variable on language data Don't create lookup lemmatizer within Language class and just pass in the data so it can be set on Token creation	2017-10-11 02:24:58 +02:00
ines	0c2343d73a	Tidy up language data	2017-10-11 02:22:49 +02:00
Matthew Honnibal	8143618497	Set prefix length back to 1	2017-10-10 19:32:54 +02:00
Matthew Honnibal	dce8afb9cf	Set prefix length to 3	2017-10-09 21:55:55 -05:00
Ines Montani	959c46eabe	Merge pull request #1365 from wannaphongcom/develop Add Thai language for spaCy v2	2017-09-26 23:43:05 +02:00
Wannaphong Phatthiyaphaibun	3d5046c499	fix import in th	2017-09-26 22:41:20 +07:00
Wannaphong Phatthiyaphaibun	a63f790b8c	fix thai tag_map	2017-09-26 22:28:57 +07:00
Wannaphong Phatthiyaphaibun	2ea27d07f4	fix tokenizer_exceptions in thai	2017-09-26 22:14:47 +07:00
Wannaphong Phatthiyaphaibun	a2bf4cc7bf	fix newline in file	2017-09-26 21:49:43 +07:00
ines	bb5c631402	Implement like_num getter for French (via #1161 )	2017-09-26 16:47:45 +02:00
ines	15479b3bae	Add comment to like_num re: future work	2017-09-26 16:43:28 +02:00
ines	adda08fe14	Implement like_num getter for Dutch (via #1177 )	2017-09-26 16:39:15 +02:00
ines	5ee10379db	Port over changes from #1340	2017-09-26 16:38:08 +02:00
Wannaphong Phatthiyaphaibun	5cba67146c	add thai in spacy2	2017-09-26 21:36:27 +07:00
ines	10d291f129	Port over change from #1351	2017-09-26 16:11:41 +02:00
ines	ece30c28a8	Don't split hyphenated words in German This way, the tokenizer matches the tokenization in German treebanks	2017-09-16 20:40:15 +02:00
Ines Montani	bd3da3d6fb	Port over change from #1323 and tidy up	2017-09-14 19:23:13 +02:00
Matthew Honnibal	b29e6bff46	Improve lemmatization rule for am\|VBP	2017-09-04 15:18:10 +02:00
Matthew Honnibal	2e28982e28	Merge pull request #1288 from geovedi/indonesian Indonesian language support	2017-08-26 21:31:13 +02:00
Matthew Honnibal	cfc055734e	Split % in units, for compatibility with corpus	2017-08-25 20:03:37 -05:00
Jim Geovedi	58d8078971	Merge remote-tracking branch 'upstream/develop' into indonesian	2017-08-25 09:21:49 +08:00
Matthew Honnibal	bb2541ffd3	Fix PROB attr for OOV words	2017-08-23 12:11:52 +02:00
ines	a68dc891ea	Port over changes from #1281	2017-08-21 23:19:18 +02:00
Jim Geovedi	f77443ab68	reworked	2017-08-20 13:43:21 +07:00
Jim Geovedi	b7d83f37c8	indonesian abbr.	2017-08-20 12:16:50 +07:00
Jim Geovedi	7193c47f0b	direct lookup	2017-08-20 11:57:52 +07:00
Jim Geovedi	fdf802d505	added examples	2017-08-20 11:57:10 +07:00
Jim Geovedi	fa544e6c9a	Merge remote-tracking branch 'upstream/develop' into indonesian	2017-08-20 11:49:40 +07:00
ines	1fe5e1a4d1	Add language example sentences (see #1107 ) da, de, en, es, fr, he, it, nb, pl, pt, sv	2017-08-19 12:22:29 +02:00
Jim Geovedi	37f19f5ed2	added more currencies based on corpus data	2017-08-03 13:03:25 +07:00
Jim Geovedi	30fd068d42	hashtag prefix should be handled somewhere else	2017-08-03 13:03:02 +07:00
Jim Geovedi	ba07e23c87	added USD in currency rules	2017-08-02 22:42:47 +07:00

1 2 3 4

153 Commits