spaCy

mirror of https://github.com/explosion/spaCy.git synced 2025-07-15 18:52:29 +03:00

Author	SHA1	Message	Date
Matthew Honnibal	1a2f61725c	Fix tokenizer serialization	2018-07-06 12:23:04 +02:00
ines	63666af328	Merge branch 'master' into develop	2018-07-04 14:52:25 +02:00
Bùi Trung Chí	9af46b4f1b	Fix loading tokenizer with custom prefix search (#2495 ) * Add contributor agreement * Fix loading tokenizer with cutom prefix search	2018-07-04 12:56:07 +02:00
Matthew Honnibal	46d8a66fef	Fix tokenizer serialization if token_match is None	2018-06-29 14:24:46 +02:00
Ines Montani	3141e04822	💫 New system for error messages and warnings (#2163 ) * Add spacy.errors module * Update deprecation and user warnings * Replace errors and asserts with new error message system * Remove redundant asserts * Fix whitespace * Add messages for print/util.prints statements * Fix typo * Fix typos * Move CLI messages to spacy.cli._messages * Add decorator to display error code with message An implementation like this is nice because it only modifies the string when it's retrieved from the containing class – so we don't have to worry about manipulating tracebacks etc. * Remove unused link in spacy.about * Update errors for invalid pipeline components * Improve error for unknown factories * Add displaCy warnings * Update formatting consistency * Move error message to spacy.errors * Update errors and check if doc returned by component is None	2018-04-03 15:50:31 +02:00
Matthew Honnibal	6bc0f4d29f	Merge pull request #1611 from fsonntag/master Solving #1494	2017-11-29 23:11:23 +01:00
Felix Sonntag	724ae7dc55	Fixed issue of infix capturing prefixes	2017-11-28 17:17:12 +01:00
Matthew Honnibal	542e6fd4ea	Don't remove entries from specials	2017-11-23 12:17:42 +00:00
Felix Sonntag	33b0f86de3	Changed tokenizer to add infix when infix_start is offset	2017-11-19 16:32:10 +01:00
Roman Domrachev	61d28d03e4	Try again to do selective remove cache	2017-11-15 19:11:12 +03:00
Roman Domrachev	b3311100c7	Merge branch 'master' of github.com:explosion/spaCy	2017-11-15 18:30:04 +03:00
Roman Domrachev	505c6a2f2f	Completely cleanup tokenizer cache Tokenizer cache can have be different keys than string That modification can slow down tokenizer and need to be measured	2017-11-15 17:55:48 +03:00
Matthew Honnibal	fe3c42a06b	Fix caching in tokenizer	2017-11-15 13:55:46 +01:00
Roman Domrachev	91e2fa6561	Clean all caches	2017-11-14 21:15:04 +03:00
Daniel Hershcovich	d7ae54ff44	Fix typo in message	2017-11-08 16:06:28 +02:00
ines	9659391944	Update deprecated methods and add warnings	2017-11-01 16:49:42 +01:00
ines	d96e72f656	Tidy up rest	2017-10-27 21:07:59 +02:00
ines	72497c8cb2	Remove comments and add TODO	2017-10-25 12:15:43 +02:00
Matthew Honnibal	b0f6fd3f1d	Disable tokenizer cache for special-cases. Fixes #1250	2017-10-24 16:08:05 +02:00
Matthew Honnibal	f45973848c	Rename 'tokens' variable 'doc' in tokenizer	2017-10-17 18:21:41 +02:00
ines	cd6a29dce7	Port over changes from #1294	2017-10-14 13:28:46 +02:00
ines	7c919aeb09	Make sure serializers and deserializers are ordered	2017-06-03 17:05:09 +02:00
ines	0153b66a86	Return self in Tokenizer.from_bytes	2017-06-03 13:26:13 +02:00
Matthew Honnibal	0561df2a9d	Fix tokenizer serialization	2017-05-31 14:12:38 +02:00
Matthew Honnibal	e9419072e7	Fix tokenizer serialisation	2017-05-31 13:43:31 +02:00
Matthew Honnibal	66af019d5d	Fix serialization of tokenizer	2017-05-31 11:43:40 +02:00
Matthew Honnibal	a318f0cae1	Add to/from disk/bytes methods for tokenizer	2017-05-29 12:24:41 +02:00
ines	c5a653fa48	Update docstrings and API docs for Tokenizer	2017-05-21 13:18:14 +02:00
ines	f216422ac5	Remove deprecated load classmethod	2017-05-21 13:18:01 +02:00
Matthew Honnibal	793430aa7a	Get spaCy train command working with neural network * Integrate models into pipeline * Add basic serialization (maybe incorrect) * Fix pickle on vocab	2017-05-17 12:04:50 +02:00
ines	e1efd589c3	Fix json imports and use ujson	2017-04-15 12:13:34 +02:00
ines	c05ec4b89a	Add compat functions and remove old workarounds Add ensure_path util function to handle checking instance of path	2017-04-15 12:11:16 +02:00
ines	d24589aa72	Clean up imports, unused code, whitespace, docstrings	2017-04-15 12:05:47 +02:00
ines	561f2a3eb4	Use consistent formatting for docstrings	2017-04-15 11:59:21 +02:00
Raphaël Bournhonesque	f332bf05be	Remove unused import statements	2017-03-21 21:08:54 +01:00
Matthew Honnibal	0ac3d27689	Fix handling of trailing whitespace Fix off-by-one error that meant trailing spaces were being dropped. Closes #792	2017-03-08 15:01:40 +01:00
Matthew Honnibal	0a6d7ca200	Fix spacing after token_match The boolean flag indicating a space after the token was being set incorrectly after the token_match regex was applied. Fixes #859.	2017-03-08 14:33:32 +01:00
Raphaël Bournhonesque	dce8f5515e	Allow zero-width 'infix' token	2017-01-23 18:28:01 +01:00
Ines Montani	aa876884f0	Revert "Revert "Merge remote-tracking branch 'origin/master'"" This reverts commit `fb9d3bb022`.	2017-01-09 13:28:13 +01:00
Matthew Honnibal	a36353df47	Temporarily put back the tokenize_from_strings method, while tests aren't updated yet.	2016-11-04 19:18:07 +01:00
Matthew Honnibal	e0c9695615	Fix doc strings for tokenizer	2016-11-02 23:15:39 +01:00
Matthew Honnibal	e9e6fce576	Handle null prefix/suffix/infix search in tokenizer	2016-11-02 20:35:48 +01:00
Matthew Honnibal	8ce8803824	Fix JSON in tokenizer	2016-10-21 01:44:20 +02:00
Matthew Honnibal	95aaea0d3f	Refactor so that the tokenizer data is read from Python data, rather than from disk	2016-09-25 14:49:53 +02:00
Matthew Honnibal	fd65cf6cbb	Finish refactoring data loading	2016-09-24 20:26:17 +02:00
Matthew Honnibal	83e364188c	Mostly finished loading refactoring. Design is in place, but doesn't work yet.	2016-09-24 15:42:01 +02:00
Matthew Honnibal	cc8bf62208	* Fix Issue #360 : Tokenizer failed when the infix regex matched the start of the string while trying to tokenize multi-infix tokens.	2016-05-09 13:23:47 +02:00
Matthew Honnibal	519366f677	* Fix Issue #351 : Indices off when leading whitespace	2016-05-04 15:53:36 +02:00
Matthew Honnibal	04d0209be9	* Recognise multiple infixes in a token.	2016-04-13 18:38:26 +10:00
Henning Peters	b8f63071eb	add lang registration facility	2016-03-25 18:54:45 +01:00
Matthew Honnibal	141639ea3a	* Fix bug in tokenizer that caused new tokens to be added for affixes	2016-02-21 23:17:47 +00:00
Matthew Honnibal	f9e765cae7	* Add pipe() method to tokenizer	2016-02-03 02:32:37 +01:00
Matthew Honnibal	3e9961d2c4	* If final token is whitespace, don't mark it as owning a trailing space. Fixes Issue #154	2016-01-16 17:08:59 +01:00
Henning Peters	235f094534	untangle data_path/via	2016-01-16 12:23:45 +01:00
Henning Peters	846fa49b2a	distinct load() and from_package() methods	2016-01-16 10:00:57 +01:00
Henning Peters	788f734513	refactored data_dir->via, add zip_safe, add spacy.load()	2016-01-15 18:01:02 +01:00
Henning Peters	bc229790ac	integrate with sputnik	2016-01-13 19:46:17 +01:00
Matthew Honnibal	a6ba43ecaf	* Fix errors in packaging revision	2015-12-29 18:37:26 +01:00
Matthew Honnibal	aec130af56	Use util.Package class for io Previous Sputnik integration caused API change: Vocab, Tagger, etc were loaded via a from_package classmethod, that required a sputnik.Package instance. This forced users to first create a sputnik.Sputnik() instance, in order to acquire a Package via sp.pool(). Instead I've created a small file-system shim, util.Package, which allows classes to have a .load() classmethod, that accepts either util.Package objects, or strings. We can later gut the internals of this and make it a proxy for Sputnik if we need more functionality that should live in the Sputnik library. Sputnik is now only used to download and install the data, in spacy.en.download	2015-12-29 18:00:48 +01:00
Henning Peters	9027cef3bc	access model via sputnik	2015-12-07 06:01:28 +01:00
Matthew Honnibal	68f479e821	* Rename Doc.data to Doc.c	2015-11-04 00:15:14 +11:00
Chris DuBois	dac8fe7bdb	Add __reduce__ to Tokenizer so that English pickles. - Add tests to test_pickle and test_tokenizer that save to tempfiles.	2015-10-23 22:24:03 -07:00
Matthew Honnibal	3ba66f2dc7	* Add string length cap in Tokenizer.__call__	2015-10-16 04:54:16 +11:00
Matthew Honnibal	c2307fa9ee	* More work on language-generic parsing	2015-08-28 02:02:33 +02:00
Matthew Honnibal	119c0f8c3f	* Hack out morphology stuff from tokenizer, while morphology being reimplemented.	2015-08-26 19:20:11 +02:00
Matthew Honnibal	9c4d0aae62	* Switch to better Python2/3 compatible unicode handling	2015-07-28 14:45:37 +02:00
Matthew Honnibal	0c507bd80a	* Fix tokenizer	2015-07-22 14:10:30 +02:00
Matthew Honnibal	2fc66e3723	* Use Py_UNICODE in tokenizer for now, while sort out Py_UCS4 stuff	2015-07-22 13:38:45 +02:00
Matthew Honnibal	109106a949	* Replace UniStr, using unicode objects instead	2015-07-22 04:52:05 +02:00
Matthew Honnibal	e49c7f1478	* Update oov check in tokenizer	2015-07-18 22:45:28 +02:00
Matthew Honnibal	cfd842769e	* Allow infix tokens to be variable length	2015-07-18 22:45:00 +02:00
Matthew Honnibal	3b5baa660f	* Fix tokenizer	2015-07-14 00:10:51 +02:00
Matthew Honnibal	24d6ce99ec	* Add comment to tokenizer, explaining the spacy attr	2015-07-13 22:29:13 +02:00
Matthew Honnibal	67641f3b58	* Refactor tokenizer, to set the 'spacy' field on TokenC instead of passing a string	2015-07-13 21:46:02 +02:00
Matthew Honnibal	6eef0bf9ab	* Break up tokens.pyx into tokens/doc.pyx, tokens/token.pyx, tokens/spans.pyx	2015-07-13 20:20:58 +02:00
Matthew Honnibal	bb522496dd	* Rename Tokens to Doc	2015-07-08 18:53:00 +02:00
Matthew Honnibal	935bcdf3e5	* Remove redundant tag_names argument to Tokenizer	2015-07-08 12:36:04 +02:00
Matthew Honnibal	2d0e99a096	* Pass pos_tags into Tokenizer.from_dir	2015-07-07 14:23:08 +02:00
Matthew Honnibal	6788c86b2f	* Begin refactor	2015-07-07 14:00:07 +02:00
Matthew Honnibal	98cfd84123	* Remove hyphenation from main tokenizer loop: do it in infix.txt instead. This lets emoticons work	2015-06-06 05:57:03 +02:00
Matthew Honnibal	20f1d868a3	* Tmp commit. Working on whole document parsing	2015-05-24 02:49:56 +02:00
Jordan Suchow	3a8d9b37a6	Remove trailing whitespace	2015-04-19 13:01:38 -07:00
Matthew Honnibal	f02c39dfaf	* Compare to is not None, for more robustness	2015-03-26 16:44:48 +01:00
Matthew Honnibal	7237c805c7	* Load tag for specials.json token	2015-03-26 16:44:46 +01:00
Matthew Honnibal	0492cee8b4	* Fix Issue #24 : Lemmas are empty when the L field is missing for special-cased tokens	2015-02-08 18:30:30 -05:00
Matthew Honnibal	4ff180db74	* Fix off-by-one error in commit `0a7fceb`	2015-01-30 12:49:33 +11:00
Matthew Honnibal	0a7fcebdf7	* Fix Issue #12 : Incorrect token.idx calculations for some punctuation, in the presence of token cache	2015-01-30 12:33:38 +11:00
Matthew Honnibal	5928d158ce	* Pass the string to Tokens	2015-01-22 02:04:58 +11:00
Matthew Honnibal	6c7e44140b	* Work on word vectors, and other stuff	2015-01-17 16:21:17 +11:00
Matthew Honnibal	ce2edd6312	* Tmp commit. Refactoring to create a Python Lexeme class.	2015-01-12 10:26:22 +11:00
Matthew Honnibal	3f1944d688	* Make PyPy work	2015-01-05 17:54:38 +11:00
Matthew Honnibal	9976aa976e	* Messily fix morphology and POS tags on special tokens.	2014-12-30 23:24:37 +11:00
Matthew Honnibal	4c4aa2c5c9	* Work on train	2014-12-22 07:25:43 +11:00
Matthew Honnibal	e1c1a4b868	* Tmp	2014-12-21 05:36:29 +11:00
Matthew Honnibal	be1bdcbd85	* Move lang.pyx to tokenizer.pyx	2014-12-20 07:55:40 +11:00

1 2 3

145 Commits