spaCy

mirror of https://github.com/explosion/spaCy.git synced 2025-04-17 07:31:59 +03:00

Author	SHA1	Message	Date
Matthew Honnibal	dce8afb9cf	Set prefix length to 3	2017-10-09 21:55:55 -05:00
Ines Montani	959c46eabe	Merge pull request #1365 from wannaphongcom/develop Add Thai language for spaCy v2	2017-09-26 23:43:05 +02:00
Wannaphong Phatthiyaphaibun	3d5046c499	fix import in th	2017-09-26 22:41:20 +07:00
Wannaphong Phatthiyaphaibun	a63f790b8c	fix thai tag_map	2017-09-26 22:28:57 +07:00
Wannaphong Phatthiyaphaibun	2ea27d07f4	fix tokenizer_exceptions in thai	2017-09-26 22:14:47 +07:00
Wannaphong Phatthiyaphaibun	a2bf4cc7bf	fix newline in file	2017-09-26 21:49:43 +07:00
ines	bb5c631402	Implement like_num getter for French (via #1161 )	2017-09-26 16:47:45 +02:00
ines	15479b3bae	Add comment to like_num re: future work	2017-09-26 16:43:28 +02:00
ines	adda08fe14	Implement like_num getter for Dutch (via #1177 )	2017-09-26 16:39:15 +02:00
ines	5ee10379db	Port over changes from #1340	2017-09-26 16:38:08 +02:00
Wannaphong Phatthiyaphaibun	5cba67146c	add thai in spacy2	2017-09-26 21:36:27 +07:00
ines	10d291f129	Port over change from #1351	2017-09-26 16:11:41 +02:00
ines	ece30c28a8	Don't split hyphenated words in German This way, the tokenizer matches the tokenization in German treebanks	2017-09-16 20:40:15 +02:00
Ines Montani	bd3da3d6fb	Port over change from #1323 and tidy up	2017-09-14 19:23:13 +02:00
Jim O'Regan	9dfd301962	rearrange	2017-09-11 10:14:18 +01:00
Jim O'Regan	1ee75ae337	Merge remote-tracking branch 'origin/develop' into develop-irish	2017-09-11 08:40:11 +01:00
Matthew Honnibal	b29e6bff46	Improve lemmatization rule for am\|VBP	2017-09-04 15:18:10 +02:00
Matthew Honnibal	2e28982e28	Merge pull request #1288 from geovedi/indonesian Indonesian language support	2017-08-26 21:31:13 +02:00
Matthew Honnibal	cfc055734e	Split % in units, for compatibility with corpus	2017-08-25 20:03:37 -05:00
Jim Geovedi	58d8078971	Merge remote-tracking branch 'upstream/develop' into indonesian	2017-08-25 09:21:49 +08:00
Matthew Honnibal	bb2541ffd3	Fix PROB attr for OOV words	2017-08-23 12:11:52 +02:00
ines	a68dc891ea	Port over changes from #1281	2017-08-21 23:19:18 +02:00
Jim Geovedi	f77443ab68	reworked	2017-08-20 13:43:21 +07:00
Jim Geovedi	b7d83f37c8	indonesian abbr.	2017-08-20 12:16:50 +07:00
Jim Geovedi	7193c47f0b	direct lookup	2017-08-20 11:57:52 +07:00
Jim Geovedi	fdf802d505	added examples	2017-08-20 11:57:10 +07:00
Jim Geovedi	fa544e6c9a	Merge remote-tracking branch 'upstream/develop' into indonesian	2017-08-20 11:49:40 +07:00
ines	1fe5e1a4d1	Add language example sentences (see #1107 ) da, de, en, es, fr, he, it, nb, pl, pt, sv	2017-08-19 12:22:29 +02:00
Jim O'Regan	c069b4acb5	fix in UD submitted; map either way	2017-08-08 19:22:14 +01:00
Jim O'Regan	76c22dec4d	UD Irish tag mapping	2017-08-08 19:04:52 +01:00
Jim O'Regan	95921d7d4c	Merge branch 'develop' into develop-irish	2017-08-08 17:21:27 +01:00
Jim Geovedi	37f19f5ed2	added more currencies based on corpus data	2017-08-03 13:03:25 +07:00
Jim Geovedi	30fd068d42	hashtag prefix should be handled somewhere else	2017-08-03 13:03:02 +07:00
Jim Geovedi	ba07e23c87	added USD in currency rules	2017-08-02 22:42:47 +07:00
Jim Geovedi	bb08d696f9	added hashtag rule and fixed currency rules	2017-07-30 21:23:28 +07:00
Jim Geovedi	e9af79a803	added u-\d+ rules (sports team)	2017-07-30 21:23:01 +07:00
Jim Geovedi	e5adc26c72	simplified rules	2017-07-29 18:21:32 +07:00
Jim Geovedi	4d04898dea	updated regexp	2017-07-29 17:44:57 +07:00
Jim Geovedi	7d96d477ea	updated like_num	2017-07-29 17:44:46 +07:00
Jim Geovedi	3cca4ed798	added lex attrs rules	2017-07-29 17:22:21 +07:00
Jim Geovedi	8b814c63f1	more exceptions	2017-07-27 19:46:30 +07:00
Jim Geovedi	6c725e8dcf	updated lemma	2017-07-27 19:46:21 +07:00
Jim Geovedi	547973b92a	wip syntax iterators	2017-07-27 10:51:34 +07:00
Jim Geovedi	bbc75da38d	enable syntax iterator and lemma lookup	2017-07-27 10:51:15 +07:00
Jim Geovedi	24a8c8bf28	added wip lemma dict	2017-07-26 21:39:54 +07:00
Jim Geovedi	63f14ba46b	added hyphen-suffix rules	2017-07-26 19:28:57 +07:00
Jim Geovedi	f288964441	removed -el from suffix rules	2017-07-26 19:28:38 +07:00
Jim Geovedi	6eee7a7411	updated tokenizer exceptions	2017-07-26 19:13:47 +07:00
Jim Geovedi	edec51b1b1	update punctuation rules	2017-07-26 19:13:36 +07:00
Jim Geovedi	62443d495a	enable token match	2017-07-26 19:13:14 +07:00
Jim Geovedi	c97f5ae0bb	updated tokenizer exceptions	2017-07-26 19:12:52 +07:00
Jim Geovedi	73f6ac9d9b	added hyhen	2017-07-24 15:56:31 +07:00
Jim Geovedi	68454c40bf	added missing import	2017-07-24 14:12:34 +07:00
Jim Geovedi	eaf9cbd708	cursed of copy & paste	2017-07-24 14:11:51 +07:00
Jim Geovedi	7aad6718bc	enable tokenizer exceptions	2017-07-24 14:11:10 +07:00
Jim Geovedi	ad56c9179a	added tokenizer exceptions list	2017-07-24 14:10:16 +07:00
Jim Geovedi	c1f3fe99fe	updated punctuation rules	2017-07-24 13:57:21 +07:00
Jim Geovedi	37fa2c8c80	punctution rules	2017-07-24 06:17:18 +07:00
Jim Geovedi	082e94ac1c	added inflix rules	2017-07-24 06:17:07 +07:00
Jim Geovedi	d0ec484725	reverted	2017-07-24 06:16:29 +07:00
Jim Geovedi	0e590c711f	added prefix & suffix rules	2017-07-23 23:46:40 +07:00
Jim Geovedi	ba922e30e8	added ampere hour unit	2017-07-23 23:46:18 +07:00
Jim Geovedi	3b17eba27b	added frequency units	2017-07-23 23:10:52 +07:00
Jim Geovedi	d5fd32a572	added known currencies	2017-07-23 22:56:48 +07:00
Jim Geovedi	f6f15678fb	added lex_attrs	2017-07-23 22:55:22 +07:00
Jim Geovedi	bed8162d00	added tokenizer_exceptions	2017-07-23 22:55:05 +07:00
Jim Geovedi	b80c35bc9a	added norm_exceptions	2017-07-23 22:54:49 +07:00
Jim Geovedi	b5de329ea3	added norm_exceptions	2017-07-23 22:54:19 +07:00
Jim Geovedi	082e9ade46	fixed typo	2017-07-23 21:30:34 +07:00
Jim Geovedi	e2efeb186e	added stopwords	2017-07-23 20:52:37 +07:00
Jim Geovedi	da98676839	use template	2017-07-23 20:51:31 +07:00
Jim Geovedi	c2b4dd7809	start working on Indonesian language	2017-07-23 20:50:56 +07:00
mollerhoj	85144835da	Add Tag_map for Danish	2017-07-03 15:52:55 +02:00
mollerhoj	64c732918a	Add Morph_rules. (TODO: Not working?)	2017-07-03 15:52:55 +02:00
mollerhoj	3b2cb107a3	Add like_num functionality to Danish	2017-07-03 15:49:51 +02:00
mollerhoj	e8f40ceed8	Add short names of months to tokenizer_exceptions	2017-07-03 15:49:51 +02:00
mollerhoj	23025d3b05	Clean up a couple of strange English stopwords	2017-07-03 15:41:59 +02:00
mollerhoj	dc5be7d2f3	Cleanup list of Danish stopwords	2017-07-03 15:40:58 +02:00
Ines Montani	c91642efd5	Port over changes from #1168	2017-07-01 11:43:54 +02:00
Jim O'Regan	70f4d26c10	bounds checks	2017-06-28 10:59:46 +01:00
Jim O'Regan	1ba38b2036	some helpers; the Irish part of UD only has 2500 sentences so this will need source of morphology	2017-06-28 00:42:00 +01:00
Jim O'Regan	559e03605a	b'	2017-06-27 22:42:16 +01:00
Jim Regan	d81ceb0cd5	Merge branch 'develop' into polish	2017-06-26 22:42:27 +01:00
Jim O'Regan	2f84c73585	a start	2017-06-26 22:40:04 +01:00
Jim O'Regan	28d7f0a672	reference	2017-06-26 22:38:28 +01:00
Jim O'Regan	e12defdd9c	missed a couple	2017-06-26 22:24:14 +01:00
Jim O'Regan	c1e4e0f3bf	just now discovered that you can do multiwords	2017-06-26 22:19:39 +01:00
Jim O'Regan	5e5f94c1c0	fix dup	2017-06-26 21:57:00 +01:00
Jim O'Regan	a8dff9133e	add POS	2017-06-26 21:53:41 +01:00
Jim O'Regan	e9213f54de	missed one	2017-06-26 21:29:21 +01:00
Jim O'Regan	1eb7cc3017	attempt a port from #1147	2017-06-26 21:24:55 +01:00
Matthew Honnibal	91e52543ef	Merge pull request #1118 from Gregory-Howard/patch-2 Update _tokenizer_exceptions_list (adding cities)	2017-06-20 11:16:07 +02:00
Tpt	7745b3ae04	Adds noun chunks to French syntax iterators	2017-06-12 15:29:58 +02:00
Grégory Howard	cd974b32b7	Update _tokenizer_exceptions_list (adding cities)	2017-06-09 17:58:18 +02:00
Matthew Honnibal	55d0621532	Merge branch 'develop' of https://github.com/explosion/spaCy into develop	2017-06-04 15:53:25 -05:00
Matthew Honnibal	e28f90b672	Fix syntax iterators	2017-06-04 15:51:50 -05:00
Ines Montani	112c5787eb	Merge pull request #1101 from oroszgy/hu_tokenizer_fix More robust Hungarian tokenizer.	2017-06-04 22:37:51 +02:00
ines	9254a3dd78	Import and add Spanish syntax iterators	2017-06-04 21:42:15 +02:00
Matthew Honnibal	7ca215bc26	Resolve lex_attr_getters conflict	2017-06-03 16:12:01 -05:00
ines	4c643d74c5	Add norm exceptions to other Language classes	2017-06-03 22:29:21 +02:00
ines	fa7e576c57	Change order of exception dicts	2017-06-03 21:52:06 +02:00
Matthew Honnibal	3f5c85d8de	Reorder setting of lex attrs, to avoid clobbering	2017-06-03 14:47:55 -05:00
Matthew Honnibal	aeb7520133	Make norm use lower-case	2017-06-03 14:47:38 -05:00
Matthew Honnibal	de3954843e	Populate norm exceptions with lower-case	2017-06-03 14:47:12 -05:00
ines	e47eef5e03	Update German tokenizer exceptions and tests	2017-06-03 21:07:44 +02:00
ines	0d6fa8b241	Add German norm exceptions	2017-06-03 20:54:18 +02:00
ines	5bd311c77e	Fix update of norm exceptions	2017-06-03 20:54:09 +02:00
ines	746653880c	Add English norm exceptions to lex_attrs	2017-06-03 20:27:28 +02:00
ines	095eeeb12f	Update English tokenizer exceptions and add norms	2017-06-03 20:27:16 +02:00
ines	e5d426406a	Add base norm exceptions	2017-06-03 20:27:05 +02:00
ines	2f1025a94c	Port over Spanish changes from #1096	2017-06-02 19:09:58 +02:00
Gyorgy Orosz	f0c3b09242	More robust Hungarian tokenizer.	2017-05-31 22:28:40 +02:00
Gyorgy Orosz	8c0b4b850e	Fixed emoji handling for Hungarian	2017-05-30 21:34:46 +02:00
ines	84189c1cab	Add 'xx' language ID for multi-language support Allows models to specify their language ID as 'xx'.	2017-05-28 00:58:59 +02:00
ines	33e332e67c	Remove unused export	2017-05-28 00:57:59 +02:00
ines	a8e58e04ef	Add symbols class to punctuation rules to handle emoji (see #1088 ) Currently doesn't work for Hungarian, because of conflicts with the custom punctuation rules. Also doesn't take multi-character emoji like 👩🏽‍💻 into account.	2017-05-27 17:57:10 +02:00
Matthew Honnibal	5db89053aa	Merge docstrings	2017-05-21 13:46:23 -05:00
ines	924e8506de	Move Defaults subclass to module scope (necessary for pickling)	2017-05-20 19:02:27 +02:00
Matthew Honnibal	61fe55efba	Move EnglishDefaults class out of English	2017-05-20 02:18:19 -05:00
Matthew Honnibal	8815507f8e	Move SpanishDefaults out of Language class, for pickle	2017-05-18 04:28:51 -05:00
ines	1a05078c79	Add language-specific syntax iterators to en and de	2017-05-17 12:04:03 +02:00
Matthew Honnibal	4b9d69f428	Merge branch 'v2' into develop * Move v2 parser into nn_parser.pyx * New TokenVectorEncoder class in pipeline.pyx * New spacy/_ml.py module Currently the two parsers live side-by-side, until we figure out how to organize them.	2017-05-14 01:10:23 +02:00
ines	a4a37a783e	Remove import from non-existing module	2017-05-13 16:00:09 +02:00
ines	c13b3fa052	Add LEX_ATTRS	2017-05-12 15:37:45 +02:00
ines	bca2ea9c72	Update Portuguese lexical attributes	2017-05-12 15:37:39 +02:00
ines	2f870123bf	Fix formatting	2017-05-12 15:37:20 +02:00
ines	ca65993d59	Add basic Polish Language class	2017-05-12 09:25:37 +02:00
ines	48177c4f92	Add missing tokenizer exceptions	2017-05-12 09:25:24 +02:00
ines	bb8be3d194	Add Danish language data	2017-05-10 21:15:12 +02:00
ines	a0b00624bb	Make sure like_email returns bool	2017-05-09 11:37:29 +02:00
ines	ea60932e1b	Fix formatting	2017-05-09 11:08:14 +02:00
ines	02d0ac5cab	Remove redundant function and fix formatting	2017-05-09 11:06:04 +02:00
ines	b5ca50607e	Reorganise entity rules	2017-05-09 01:37:10 +02:00
ines	12c3d5fbba	Fix formatting	2017-05-09 01:15:28 +02:00
ines	2829a024ef	Re-add basic like_num check to global lex_attrs	2017-05-09 01:15:23 +02:00
ines	88adeee548	Add English lex_attrs overrides	2017-05-09 01:09:52 +02:00
ines	8f3fbbb147	Fix typos	2017-05-09 01:09:37 +02:00
ines	2216e5f326	Reorganise lex_attrs and add dict	2017-05-09 00:57:54 +02:00
ines	e666f14d20	Add global lex_attrs	2017-05-09 00:41:53 +02:00
ines	41972c43fe	Use consistent regex imports	2017-05-09 00:34:31 +02:00
ines	9f0fd5963f	Reorganise Hungarian punctuation rules	2017-05-09 00:01:59 +02:00
ines	fc0d793360	Reorganise Bengali punctuation rules	2017-05-09 00:01:52 +02:00
ines	e895d1afd7	Reorganise French punctuation rules	2017-05-09 00:00:54 +02:00
ines	014bda0ae3	Reorganise global punctuation rules	2017-05-09 00:00:46 +02:00
ines	a91278cb32	Rename _URL_PATTERN to URL_PATTERN	2017-05-09 00:00:00 +02:00
ines	604f299cf6	Add char classes to global language data	2017-05-08 23:59:33 +02:00
ines	f6f5d78cb9	Fix formatting	2017-05-08 23:59:17 +02:00
ines	3c0f85de8e	Remove imports in /lang/__init__.py	2017-05-08 23:58:07 +02:00
ines	614aa09582	Tidy up Bengali tokenizer exceptions	2017-05-08 22:29:49 +02:00
ines	73b577cb01	Fix relative imports	2017-05-08 22:29:04 +02:00
ines	ae99990f63	Fix formatting	2017-05-08 22:23:48 +02:00
ines	f46ffe3e89	Move language data to /lang module	2017-05-08 20:00:40 +02:00

... 12 13 14 15 16 ...

802 Commits