spaCy

mirror of https://github.com/explosion/spaCy.git synced 2024-12-26 18:06:29 +03:00

Author	SHA1	Message	Date
Matthew Honnibal	341a3e85cd	* Upd downloaded data version	2015-10-23 00:56:57 +02:00
Henning Peters	ccffd2ef53	fixed extract directory	2015-10-21 07:59:34 +02:00
Henning Peters	da4c9cee06	assert filename match	2015-10-20 19:33:59 +02:00
Henning Peters	4f703f0cb4	better error reporting, cleanup	2015-10-20 19:11:29 +02:00
Matthew Honnibal	9cdea6e450	* Import uget correctly	2015-10-19 08:32:41 +02:00
Henning Peters	bfde91fa49	add custom download tool (uget), replace wget with uget	2015-10-18 12:35:04 +02:00
Matthew Honnibal	e886e6a406	* Inc version	2015-10-13 13:46:17 +11:00
Matthew Honnibal	a3dfe2b901	* Increment data version	2015-10-09 13:26:17 +02:00
Matthew Honnibal	b228a8f4a6	* Remove spacy/en/attrs	2015-10-06 16:20:46 +11:00
Matthew Honnibal	693677fd8d	* Prepare to remove en/attrx file, now that moving to symbols.pyx	2015-10-06 16:20:13 +11:00
Matthew Honnibal	ecc5281b36	* Remove en/pos.pyx, as the tagger code now lives in spacy/tagger.pyx	2015-10-06 10:12:08 +11:00
Robert	8711b64860	Force SSL for downloading English language data. It would also be nice to have a checksum for this.	2015-09-21 17:26:01 -07:00
Matthew Honnibal	e13e47e9e5	* Add English stop words	2015-09-14 17:48:51 +10:00
Matthew Honnibal	0b7d2a6c62	* Inc version	2015-09-13 01:26:29 +02:00
Matthew Honnibal	e2ef78b29c	* Gut pos.pyx module, since functionality moved to spacy/tagger.pyx	2015-08-26 19:15:42 +02:00
Matthew Honnibal	c4d8754385	* Specify LOCAL_DATA_DIR global in spacy.en.__init__.py	2015-08-26 19:15:07 +02:00
Matthew Honnibal	c5a27d1821	* Move lemmatizer to spacy	2015-08-25 15:47:08 +02:00
Matthew Honnibal	82217c6ec6	* Generalize lemmatizer	2015-08-25 15:46:19 +02:00
Matthew Honnibal	8083a07c3e	* Use language base class	2015-08-25 15:37:30 +02:00
Matthew Honnibal	5dd76be446	* Split EnPosTagger up into base class and subclass	2015-08-24 05:25:55 +02:00
Matthew Honnibal	6f1743692a	* Work on language-independent refactoring	2015-08-23 20:49:18 +02:00
Matthew Honnibal	cad0cca4e3	* Tmp	2015-08-22 22:04:34 +02:00
Matthew Honnibal	5737115e1e	* Work on gazetteer matching	2015-08-06 14:33:21 +02:00
Matthew Honnibal	c609ea18f0	* Increment version in download script	2015-07-28 15:22:17 +02:00
Matthew Honnibal	ddc1a5cfe5	* Fix training under python3	2015-07-28 14:09:30 +02:00
Matthew Honnibal	a296d72b54	* Fix en/attrs	2015-07-27 21:16:33 +02:00
Matthew Honnibal	8535d872e8	* Set is_oov property in get_flags	2015-07-27 01:51:24 +02:00
Matthew Honnibal	8e4c69ee8c	* Add is_oov property, and fix up handling of attributes	2015-07-27 01:50:06 +02:00
Matthew Honnibal	6bb96c122d	* Host IS_ flags in attrs.pxd, and add properties for them on Token and Lexeme objects	2015-07-26 16:37:16 +02:00
Matthew Honnibal	eeaea25f0c	* Check oov_prob file is present	2015-07-26 16:36:38 +02:00
Matthew Honnibal	1b5d1da2a7	* Allow an OOV probability to be specified in get_lex_props	2015-07-26 00:03:43 +02:00
Matthew Honnibal	cd6e25132b	* Allow an OOV probability to be specified in get_lex_props	2015-07-26 00:01:46 +02:00
Matthew Honnibal	5b41744270	* Check for directory presence before loading annotators	2015-07-23 09:27:37 +02:00
Matthew Honnibal	12699a1152	* Set initial freqs, to avoid missing values in serializer	2015-07-23 01:16:27 +02:00
Matthew Honnibal	680bb47b55	* Write serializer freqs to single file, vocab/serializer.json	2015-07-23 01:15:25 +02:00
Matthew Honnibal	38ef986b29	* Update spacy/en/attrs.pxd	2015-07-23 01:10:58 +02:00
Matthew Honnibal	c86dbe4944	* Update English.save_models for new Packer save/load stuff	2015-07-22 13:40:23 +02:00
Matthew Honnibal	317cbbc015	* Serialization round trip now working with decent API, but with rough spots in the organisation and requiring vocabulary to be fixed ahead of time.	2015-07-19 15:18:17 +02:00
Matthew Honnibal	4dddc8a69b	* Fix type declarations for attr_t. Remove unused id_t.	2015-07-18 22:39:57 +02:00
Matthew Honnibal	95e57c2780	* Remove unnecessary key and id properties from Utf8String.	2015-07-17 01:40:18 +02:00
Matthew Honnibal	db9dfd2e23	* Major refactor of serialization. Nearly complete now.	2015-07-17 01:27:54 +02:00
Matthew Honnibal	897de2d438	* Add 'bitter' property for serializer in English class	2015-07-16 17:47:53 +02:00
Matthew Honnibal	6eef0bf9ab	* Break up tokens.pyx into tokens/doc.pyx, tokens/token.pyx, tokens/spans.pyx	2015-07-13 20:20:58 +02:00
Matthew Honnibal	ff9ff6f3fa	* Ensure unseen words are given low log probability	2015-07-12 01:31:09 +02:00
Matthew Honnibal	89a91ad726	* Add SPACE part-of-speech tag, and train tagger to assign it. Also train tagger not to make whitespace an entity	2015-07-09 13:30:41 +02:00
Matthew Honnibal	6ddb2f5e45	* Restore merge_mwe in English class	2015-07-08 19:35:30 +02:00
Matthew Honnibal	6859f6adac	* Restore merge_mwe in English class	2015-07-08 19:34:55 +02:00
Matthew Honnibal	e3c53f5ecd	* Fix mention of Tokens in docstring	2015-07-08 18:56:27 +02:00
Matthew Honnibal	bb522496dd	* Rename Tokens to Doc	2015-07-08 18:53:00 +02:00
Matthew Honnibal	4e4fac452b	* Refactor __init__ for simplicity. Allow parse=True, tag=True etc flags to be passed at top-level. Do not lazy-load parser.	2015-07-08 12:35:29 +02:00
Matthew Honnibal	1d2deb4616	* Work on refactoring default arguments to English.__init__	2015-07-07 15:53:25 +02:00
Matthew Honnibal	6788c86b2f	* Begin refactor	2015-07-07 14:00:07 +02:00
Matthew Honnibal	9af86b0b0b	* Fix attrs.pxd	2015-06-30 18:16:30 +02:00
Matthew Honnibal	5d595b5a8c	* Inc versions	2015-06-30 18:11:06 +02:00
Matthew Honnibal	d2eeba6667	* Start wiring up color and emotion lexicons. Hopefully we get to use them.	2015-06-30 16:22:23 +02:00
Matthew Honnibal	b266a63f2c	* Inc version of downloadble data	2015-06-24 04:53:08 +02:00
Matthew Honnibal	7d265a9c62	* Revert to wget in spacy.en.download	2015-06-08 00:48:56 +02:00
Matthew Honnibal	1515862861	* Fix download.py	2015-06-08 00:08:05 +02:00
Matthew Honnibal	7e9e8f654a	* Use urllib in spacy.en.download	2015-06-07 23:51:38 +02:00
Matthew Honnibal	80cff41a9c	* Upd download.py	2015-06-07 19:13:28 +02:00
Matthew Honnibal	58d5ac0944	* Add beam search capabilities to Parser. Rename GreedyParser to Parser.	2015-06-02 00:28:02 +02:00
Matthew Honnibal	62424e6c76	* Remove unused regularize argument from _ml.Model	2015-06-02 00:27:07 +02:00
Matthew Honnibal	04bda8648d	* Pass parameter for regularization to model	2015-05-27 03:16:58 +02:00
Matthew Honnibal	eba7b34f66	* Add flag to disable loading of word vectors	2015-05-25 01:02:42 +02:00
Matthew Honnibal	03ebf70a66	* Inc version to 0.84	2015-05-12 02:38:51 +02:00
Matthew Honnibal	fb8d50b3d5	Merge branch 'master' of ssh://github.com/honnibal/spaCy	2015-04-30 12:45:15 +02:00
Matthew Honnibal	378c2a6435	* Fix POS model: make it use tag instead of pos in history features	2015-04-29 00:02:53 +02:00
Jordan Suchow	3a8d9b37a6	Remove trailing whitespace	2015-04-19 13:01:38 -07:00
Matthew Honnibal	cc4e395927	* Add some ad hoc regexes, for multi-word location prepositions	2015-04-17 04:44:24 +02:00
Matthew Honnibal	684d0e5e85	* Download updated data	2015-04-16 04:29:15 +02:00
Matthew Honnibal	42617548af	* Disable merge_mwes by default	2015-04-16 04:20:31 +02:00
Matthew Honnibal	77d0700caf	* Add on X way regexes	2015-04-16 01:35:46 +02:00
Matthew Honnibal	c6707778dd	* Fix Issue #51 : Handle non-ascii lemmas correctly	2015-04-13 22:28:59 +02:00
Matthew Honnibal	761a19113a	* Fix /tmp moving thing in download.py	2015-04-12 07:04:10 +02:00
Matthew Honnibal	b64b2bd910	* Fix Issue #43 : TAG attr not supported. Also add DEP attr, while I'm at it. Need better way of ensuring future changes don't break in similar way.	2015-04-07 06:00:30 +02:00
Matthew Honnibal	b8d34531c4	* Add support for units to English.__init__, by loading and applying regular expressions	2015-04-07 04:02:32 +02:00
Matthew Honnibal	2fee67cfa3	* Add regular expressions for English multi-word expressions	2015-04-07 03:45:18 +02:00
Matthew Honnibal	567388e38d	* Use values encoded by StringStore in POS tagging, rather than indices into a list of tags	2015-03-26 16:44:45 +01:00
Matthew Honnibal	801bf14f4f	* Clean up handling of dep_strings and ent_strings, using StringStore to encode the label names.	2015-03-26 16:44:45 +01:00
Matthew Honnibal	f21ab2d7fb	* Fix bug in ugly ent_strings hack on English class	2015-03-26 16:44:45 +01:00
Matthew Honnibal	8057a95f20	* NER seems to be working, scoring 69 F. Need to add decision-history features --- currently only use current word, 2 words context. Need refactoring.	2015-03-26 16:44:44 +01:00
Matthew Honnibal	220ce8bfed	* Prepare English class for NER	2015-03-26 16:44:44 +01:00
Matthew Honnibal	179b7eb0a7	* Specify parser transition system in language	2015-03-26 16:44:43 +01:00
Matthew Honnibal	8cc3524dc9	* Ws	2015-03-26 16:44:41 +01:00
Matthew Honnibal	2e8d0e5d45	* Upd download script	2015-03-03 05:47:16 -05:00
Matthew Honnibal	caf046b220	* Hastily add method to apply tags from a list of strings, instead of predicting the tags.	2015-02-23 15:40:17 -05:00
Matthew Honnibal	64645a1c2f	* Improve docstring on English	2015-02-11 15:13:20 -05:00
Matthew Honnibal	594e50bd45	* Add option to download speech-parsing data set.	2015-02-11 14:20:29 -05:00
Matthew Honnibal	0b7e769211	* Add POS tags to support SWBD tag set	2015-02-11 14:08:28 -05:00
Matthew Honnibal	312b3a45f3	* Fix issue #19 : Allow parsing/pos tagging of empty strings	2015-02-10 10:15:58 -05:00
Matthew Honnibal	2a0615104b	* Upd download script	2015-02-09 10:22:59 -05:00
Matthew Honnibal	5c3513583d	* Clear buffered python tokens when modifying the Tokens object. Need to clean this up, and modify via a method on Tokens.	2015-02-09 03:57:10 -05:00
Matthew Honnibal	be5536d239	* Fix Issue #22 : PRP and PRP$ were mapped to NOUN. Should be PRON.	2015-02-08 18:36:18 -05:00
Matthew Honnibal	44c7eafe44	* Fix download.py	2015-02-07 12:00:36 -05:00
Matthew Honnibal	6ca7f2eedc	* Upd download script	2015-02-07 11:32:33 -05:00
Matthew Honnibal	56c2ef2982	* Tweak POS features for web text	2015-02-02 11:59:36 +11:00
Matthew Honnibal	a20fdbd8ee	* Upd download script	2015-02-01 13:22:23 +11:00
Matthew Honnibal	63abdf154c	* Hastily hack download file	2015-01-31 22:48:32 +11:00
Matthew Honnibal	a1ed574b7b	* Fix default model path for English	2015-01-31 16:38:27 +11:00
Matthew Honnibal	e013555b25	* Add option to download script	2015-01-31 13:51:56 +11:00
Matthew Honnibal	024cfd485c	* Pass tag_strings as a tuple, to support new Tokens API	2015-01-31 13:43:37 +11:00
Matthew Honnibal	83a4df5a1a	* Fix download script	2015-01-30 20:40:42 +11:00
Matthew Honnibal	6f9ebc2f34	* Fix download script	2015-01-30 20:33:19 +11:00
Matthew Honnibal	8b85d0bb8a	* Only download small data if no data dir exists	2015-01-30 20:27:14 +11:00
Matthew Honnibal	cb95ef6934	* Fix download script	2015-01-30 19:28:43 +11:00
Matthew Honnibal	e578bd37bd	* Fix download script	2015-01-30 18:59:31 +11:00
Matthew Honnibal	df52014d12	* Fix download script	2015-01-30 18:36:24 +11:00
Matthew Honnibal	998b607f65	* Upd download script, having it download all data if there's no data/ directory, allowing easier compilation from source	2015-01-30 18:04:01 +11:00
Matthew Honnibal	67d6e53a69	* Ensure parser and tagger function correctly when training from missing values, indicated by -1	2015-01-30 14:08:56 +11:00
Matthew Honnibal	c38c62d4a3	* Add docstring to English class	2015-01-27 02:45:21 +11:00
Matthew Honnibal	7f87716cf7	* Fix download script	2015-01-25 23:01:10 +11:00
Matthew Honnibal	12b034e3ef	* Move POS tag definitions to parts_of_speech.pxd	2015-01-25 16:31:07 +11:00
Matthew Honnibal	7431c133d8	* Add error if try to access head and not is_parsed	2015-01-25 15:33:54 +11:00
Matthew Honnibal	951d06c824	* Silently don't parse if data is not present	2015-01-25 14:47:38 +11:00
Matthew Honnibal	4e857ab7a6	* Fix bug in POS tagger feature	2015-01-25 02:20:15 +11:00
Matthew Honnibal	dd56e298e2	* Ensure tagging is applied if parse=True	2015-01-25 02:19:44 +11:00
Matthew Honnibal	94750819cd	* Set parse=True by default --- i.e. parse unless told not to.	2015-01-25 01:28:28 +11:00
Matthew Honnibal	a97bed9359	* Fix POS and dependency label tag names. Add parse and string navigation functions.	2015-01-24 17:29:04 +11:00
Matthew Honnibal	fda94271af	* Rename NORM1 and NORM2 attrs to lower and norm	2015-01-24 06:17:03 +11:00
Matthew Honnibal	5ed8b2b98f	* Rename sic to orth	2015-01-23 02:08:25 +11:00
Matthew Honnibal	f2a229136c	* Fix data_dir=None argument to English class	2015-01-21 18:27:31 +11:00
Matthew Honnibal	ef49b8c179	* Add stop-word flag	2015-01-21 18:22:31 +11:00
Matthew Honnibal	6646bfc5df	* Add LOWER attr	2015-01-21 18:19:08 +11:00
Matthew Honnibal	6c7e44140b	* Work on word vectors, and other stuff	2015-01-17 16:21:17 +11:00
Matthew Honnibal	7d3c40de7d	* Tests passing after refactor. API has obvious warts, particularly in Token and Lexeme	2015-01-15 00:33:16 +11:00
Matthew Honnibal	0930892fc1	* Tmp. Working on refactor. Compiles, must hook up lexical feats.	2015-01-14 00:03:48 +11:00
Matthew Honnibal	46da3d74d2	* Tmp. Refactoring, introducing a Lexeme PyObject.	2015-01-12 11:23:44 +11:00
Matthew Honnibal	ce2edd6312	* Tmp commit. Refactoring to create a Python Lexeme class.	2015-01-12 10:26:22 +11:00
Matthew Honnibal	7689dccd0f	* Remove unused import	2015-01-05 18:48:48 +11:00
Matthew Honnibal	3f1944d688	* Make PyPy work	2015-01-05 17:54:38 +11:00
Matthew Honnibal	a510d9f677	* Another assertion removed	2015-01-05 13:01:40 +11:00
Matthew Honnibal	2856946a66	* Remove assertion that doesn't work on Python 3	2015-01-05 12:51:16 +11:00
Matthew Honnibal	94034f1112	* Fix encoding in lemmatization	2015-01-05 11:54:29 +11:00
Matthew Honnibal	b132b3caa6	* Fix unicode error in lemmatizer	2015-01-05 11:53:54 +11:00
Matthew Honnibal	477e7fbffe	* Fix data reading for lemmatizer	2015-01-05 06:01:32 +11:00
Matthew Honnibal	4e085d5166	* Fix lemmatizer for Python3	2015-01-05 05:51:26 +11:00
Matthew Honnibal	0e4c2ba036	* Fix loading of special morph words	2015-01-03 23:13:00 +11:00
Matthew Honnibal	f5d41028b5	* Move around data files for test release	2015-01-03 01:59:22 +11:00
Matthew Honnibal	a24321b63a	* Add downloader	2015-01-02 21:44:41 +11:00
Matthew Honnibal	5d9a096e2f	* Some minor clean-up after HastyModel	2014-12-31 19:46:04 +11:00
Matthew Honnibal	aafaf58cbe	* Refactor _ml.Model, and finish implementing HastyModel so far not worthwhile.	2014-12-31 19:40:59 +11:00
Matthew Honnibal	1a075f77ff	* Don't over-ride pre-loaded POS tags, if set by special-cases	2014-12-30 23:26:32 +11:00
Matthew Honnibal	785c7ba76a	* Embed signature on attrs	2014-12-30 23:25:31 +11:00
Matthew Honnibal	30e5805656	* Lazy-load tagger and parser	2014-12-30 23:25:09 +11:00
Matthew Honnibal	bb0b00f819	* Repurporse the Tagger class as a generic Model, wrapping thinc's interface	2014-12-30 21:20:15 +11:00
Matthew Honnibal	bb80937544	* Upd docstrings	2014-12-27 18:45:16 +11:00
Matthew Honnibal	b8b65903fc	* Tmp	2014-12-24 17:42:00 +11:00
Matthew Honnibal	7708d0e24a	* Move lemmatizer to en dir	2014-12-23 15:16:57 +11:00
Matthew Honnibal	98eb4c0426	* Fix path to parser model	2014-12-23 15:09:09 +11:00
Matthew Honnibal	b00bc01d8c	* All tests now passing for reorg	2014-12-23 13:18:59 +11:00

1 2 3 4 5 ...

255 Commits