spaCy

mirror of https://github.com/explosion/spaCy.git synced 2026-01-17 05:49:11 +03:00

Author	SHA1	Message	Date
Raphaël Bournhonesque	3452d6ce52	Resolve issue #1078 by simplifying URL pattern - avoid catastrophic backtracking - reduce character range of host name, domain name and TLD identifier	2017-10-11 11:24:00 +02:00
Yam	923c4c2fb2	Update punctuation.py add `……`	2017-09-22 09:50:46 +08:00
Yam	978b24ccd4	Update punctuation.py In Chinese, `~` and `——` is hyphens, `·` is intermittent symbol	2017-09-20 23:02:22 +08:00
Yu-chun Huang	188b439b25	Add Chinese punctuation Add Chinese punctuation.	2017-09-19 16:58:42 +08:00
Yu-chun Huang	1f1f35dcd0	Add Chinese punctuation Add Chinese punctuation.	2017-09-19 16:57:24 +08:00
Matthew Honnibal	8b9c4c5e1c	Add missing SP symbol to tag map, re #1052	2017-07-22 13:44:17 +02:00
Ben Eyal	33af52599e	Redefine alphabetic characters For caseless languages (Hebrew, Bengali) all characters are both lowercase and uppercase.	2017-04-20 02:25:02 +03:00
Ben Eyal	d8098a8be2	Use `regex` instead of `re`	2017-04-20 02:22:52 +03:00
ines	bf0f15e762	Add / to tokenizer infixes (resolves #891 )	2017-04-07 17:30:44 +02:00
Matthew Honnibal	83dca920d4	Rename test #913 -> #957 , comment Make test for #957 reference correct bug. Add comment. Previous commit closes #957.	2017-04-07 15:54:25 +02:00
Matthew Honnibal	e7b1ee9efd	Switch to regex module for URL identification The URL detection regex was failing on input such as 0.1.2.3, as this input triggered excessive back-tracking in the builtin re module. The solution was to switch to the regex module, which behaves better. Closes #913.	2017-04-07 15:47:36 +02:00
ines	66c1f194f9	Use consistent unicode declarations	2017-03-12 13:07:28 +01:00
Matthew Honnibal	ea2592879f	Merge branch 'master' of https://github.com/explosion/spaCy	2017-03-11 11:13:37 -06:00
ines	b04893a059	Make regex locale-independent for Python 2	2017-03-10 14:21:57 +01:00
Matthew Honnibal	ea53647362	Merge branch 'develop'	2017-03-10 02:49:39 -06:00
Dan Rapp	3b1df3808d	Issue #840 - URL pattenr too broad	2017-03-09 11:39:39 -07:00
Roman Inflianskas	66e1109b53	Add support for Universal Dependencies v2.0	2017-03-03 13:17:34 +01:00
Ines Montani	012f4820cb	Keep infixes of punctuation + hyphens as one token (see #801 )	2017-02-02 16:22:40 +01:00
Ines Montani	1219a5f513	Add = to tokenizer prefixes	2017-02-02 16:21:11 +01:00
Ines Montani	ff04748eb6	Add missing emoticon	2017-02-02 16:21:00 +01:00
Ines Montani	116c675c3c	Merge pull request #742 from oroszgy/hu_tokenizer_fix Improved Hungarian tokenizer	2017-01-14 23:52:44 +01:00
Gyorgy Orosz	63037e79af	Fixed hyphen handling in the Hungarian tokenizer.	2017-01-14 16:30:11 +01:00
Gyorgy Orosz	be7a7aeb1a	Reversed accidental changes.	2017-01-14 15:59:36 +01:00
Gyorgy Orosz	1be5da1ac6	Fixed Hungarian tokenizer for numbers	2017-01-14 15:51:59 +01:00
Ines Montani	0894b8c0ef	Don't split tokens with digits and "/" infixes (resolves #740 )	2017-01-12 22:58:26 +01:00
Matthew Honnibal	fba67fa342	Fix Issue #736 : Times were being tokenized with incorrect string values.	2017-01-12 11:21:01 +01:00
Ines Montani	aa876884f0	Revert "Revert "Merge remote-tracking branch 'origin/master'"" This reverts commit `fb9d3bb022`.	2017-01-09 13:28:13 +01:00
Ines Montani	eef94e3ee2	Split off period after two or more uppercase letters (fixes #483 )	2017-01-08 22:28:25 +01:00
Ines Montani	347c4a2d06	Reorganise and reformat global tokenizer prefixes, suffixes and infixes	2017-01-08 20:37:39 +01:00
Ines Montani	7c3cb2a652	Add global abbreviations data	2017-01-08 20:34:03 +01:00
Ines Montani	bc911322b3	Move ") to emoticons (see Tweebo challenge test)	2017-01-05 18:05:38 +01:00
Ines Montani	fb9d3bb022	Revert "Merge remote-tracking branch 'origin/master'" This reverts commit `d3b181cdf1`, reversing changes made to `b19cfcc144`.	2017-01-03 18:21:36 +01:00
Matthew Honnibal	9936a1b9b5	Merge branch 'tokenization_w_exception_patterns' of https://github.com/oroszgy/spaCy.hu into oroszgy-tokenization_w_exception_patterns	2016-12-30 14:53:40 -06:00
Petter Hohle	f112e7754e	Add PART to tag map 16 of the 17 PoS tags in the UD tag set is added; PART is missing.	2016-12-28 18:39:01 +01:00
Gyorgy Orosz	3a9be4d485	Updated token exception handling mechanism to allow the usage of arbitrary functions as token exception matchers.	2016-12-23 23:49:34 +01:00
Gyorgy Orosz	1748549aeb	Added exception pattern mechanism to the tokenizer.	2016-12-21 23:16:19 +01:00
Ines Montani	920fa0fed2	Add DET_LEMMA constant	2016-12-21 18:05:41 +01:00
Ines Montani	4e95737c6c	Add base tag map	2016-12-18 16:54:28 +01:00
Ines Montani	2b2ea8ca11	Reorganise language data	2016-12-18 16:54:19 +01:00
Ines Montani	bc40dad7d9	Add entity rules	2016-12-18 15:36:53 +01:00
Ines Montani	eaa3b1319d	Fix formatting	2016-12-18 15:36:53 +01:00
Ines Montani	62655fd36f	Add ENT_ID constant	2016-12-18 15:36:53 +01:00
Ines Montani	f324311249	Add global language data utils	2016-12-17 12:27:41 +01:00
Ines Montani	e47ee94761	Split punctuation into its own file	2016-12-08 19:46:43 +01:00
Ines Montani	e8ae588be9	Add emoticons	2016-12-08 19:45:18 +01:00
Ines Montani	5908c0ed9f	Fix formatting	2016-12-08 19:45:11 +01:00
Ines Montani	0d07d7fc80	Apply emoticon exceptions to tokenizer	2016-12-07 21:11:59 +01:00
Ines Montani	9413bcd9ee	Declare encoding and unicode literals	2016-12-07 21:10:34 +01:00
Ines Montani	a280ff2657	Fix __all__	2016-12-07 21:10:12 +01:00
Ines Montani	ba8721953c	Add missing emoticons	2016-12-07 21:09:44 +01:00

1 2

51 Commits