spaCy

mirror of https://github.com/explosion/spaCy.git synced 2024-12-26 18:06:29 +03:00

Author	SHA1	Message	Date
Matthew Honnibal	c1ef07788c	Update train_ud.py Create deps folder if it doesn't exist.	2017-01-09 10:55:44 +11:00
Matthew Honnibal	46e98ec029	Move init_model.py script from repo. These meta-tools should live elsewhere	2016-12-18 14:03:40 +01:00
dafnevk	cdf5dcc40a	fixed bug in init_model so that it runs for dutch	2016-12-13 14:33:44 +01:00
Matthew Honnibal	c7889492f9	Fix model saving error for Python 3	2016-11-25 18:04:30 -06:00
Matthew Honnibal	22189e60db	Use unicode literals in train_ud	2016-11-25 17:45:45 -06:00
Matthew Honnibal	da5f0cce36	Fix train_ud script, which trains models from the Universal Dependencies format.	2016-11-25 11:19:33 -06:00
Matthew Honnibal	314bc8d34f	Fix train script for 1.0	2016-11-25 08:57:37 -06:00
Matthew Honnibal	bd1bfcca61	Update train.py	2016-10-13 03:23:48 +02:00
Matthew Honnibal	ea23b64cc8	Refactor training, with new spacy.train module. Defaults still a little awkward.	2016-10-09 12:24:24 +02:00
Matthew Honnibal	53fbd3dd1c	Fix train.py for v1.0.0-rc1	2016-10-05 01:11:46 +02:00
Matthew Honnibal	ae202e7a60	Fix init_model.py	2016-09-25 15:58:51 +02:00
Matthew Honnibal	af847e07fc	Fix usage of pathlib for Python3 -- turning paths to strings.	2016-09-24 21:05:27 +02:00
Matthew Honnibal	d310dc73ef	Fix bin/init_model.py after refactoring	2016-09-24 20:38:18 +02:00
Matthew Honnibal	8036368d96	* Fix model saving	2016-05-23 12:01:46 +00:00
Matthew Honnibal	35214053fd	* Work around get_lex_attr bug introduced during German parsing	2016-05-23 10:53:00 +00:00
Wolfgang Seeker	dae6bc05eb	define German dummy lemmatizer until morphology is done	2016-05-02 16:04:53 +02:00
Matthew Honnibal	8569dbc2d0	* Add initial stuff for Chinese parsing	2016-04-24 18:44:24 +02:00
Wolfgang Seeker	f9150ccf2a	rename vectors.tgz to vectors.bz2 because it's not compressed with gzip but bzip	2016-04-08 13:38:07 +02:00
Wolfgang Seeker	a8f4e49900	update init_model.py to previous (better) state	2016-03-29 16:12:13 +02:00
Matthew Honnibal	d249e2f7f3	* Improve error message in bin/parser/train.py	2016-03-29 13:04:33 +11:00
Yaser Martinez Palenzuela	3c210f45fa	make use of log_smooth_count	2016-03-17 12:19:52 +01:00
Matthew Honnibal	fcaa0ad7ce	Merge pull request #280 from wbwseeker/german_parser German parser	2016-03-04 03:27:42 +11:00
Wolfgang Seeker	690c5acabf	adjust train.py to train both english and german models	2016-03-03 15:21:00 +01:00
Matthew Honnibal	9d51e4d13c	Delete gather_freqs.py This script was in a broken state, and should be unnecessary. The functionality is subsumed by `get_freqs.py`	2016-03-02 00:42:55 +11:00
Yaser Martinez Palenzuela	1a93d7f725	replace codecs.open with io.open	2016-03-01 14:10:11 +01:00
Wolfgang Seeker	eae35e9b27	add tokenizer files for German, add/change code to train German pos tagger - add files to specify rules for German tokenization - change generate_specials.py to generate from an external file (abbrev.de.tab) - copy gazetteer.json from lang_data/en/ - init_model.py - change doc freq threshold to 0 - add train_german_tagger.py - expects conll09-formatted input	2016-02-18 13:24:20 +01:00
Henning Peters	a89ca6537b	fix cythonize	2016-02-05 16:17:23 +01:00
Henning Peters	3a50448bf3	py3 compatibility	2016-02-05 15:43:50 +01:00
Henning Peters	7627969aba	refactor, listen on setup.py, *.pxd	2016-02-05 15:37:00 +01:00
Matthew Honnibal	5dc6cffc67	* Fix gather_freqs.py	2016-02-04 20:21:58 +01:00
Matthew Honnibal	e2ed6251d7	* Fancy up the CLI for the conll train script	2016-02-02 22:58:06 +01:00
Matthew Honnibal	a676d66807	* Update the CoNLL train script, to get working on other languages	2016-02-02 22:29:34 +01:00
Henning Peters	73674a4afb	try using system-wide headers	2015-12-13 12:51:23 +01:00
Henning Peters	92fabd0114	wrap virtualenv around cythonize	2015-12-13 12:32:22 +01:00
Henning Peters	9662cf04c9	new approach to dependency headers	2015-12-13 11:53:02 +01:00
Matthew Honnibal	6e68b344c1	* Train after parsing, not before.	2015-11-12 04:43:52 +11:00
Matthew Honnibal	4fb038a9eb	* Update conll_train.py script for spaCy v0.97	2015-10-31 00:53:51 +11:00
Matthew Honnibal	cfaa4bde5d	* Add train and parse scripts that use CoNLL formatted data	2015-10-30 12:54:49 +11:00
Matthew Honnibal	2348a08481	* Load/dump strings with a json file, instead of the hacky strings file we were using.	2015-10-22 21:13:03 +11:00
Matthew Honnibal	0ce12e4548	* Import io in get_freqs	2015-10-19 12:56:18 +11:00
Matthew Honnibal	17fffb4c57	* Update get_freqs.py script	2015-10-16 04:33:49 +11:00
Matthew Honnibal	5ff4454177	* Update get_freqs.py script	2015-10-16 04:31:15 +11:00
Matthew Honnibal	a748146dd3	* Update get_freqs.py script	2015-10-16 04:24:50 +11:00
Matthew Honnibal	a29fd79fbc	* Update get_freqs.py script	2015-10-16 04:24:08 +11:00
Matthew Honnibal	e08a4b46a2	* Update get_freqs.py script	2015-10-16 04:20:35 +11:00
Matthew Honnibal	92f750cf8b	* Use a gzipped frequencies file in init_model	2015-10-11 06:59:44 +02:00
Matthew Honnibal	064bd69ad0	* Refactor symbols, so that frequency rank can be derived from the orth id of a word.	2015-10-10 16:03:48 +11:00
Matthew Honnibal	83dccf0fd7	* Use io module insteads of deprecated codecs module	2015-10-10 14:13:01 +11:00
Matthew Honnibal	f35632e2e5	* Remove SBD print statement in train, after SBD evaluation was removed from Scorer	2015-10-09 11:08:58 +02:00
Matthew Honnibal	6ea1601e93	* Add script to train models off the UD treebanks. Note that the UD data is restricted to research purposes only, and should only be used to train models for academic experiments.	2015-10-08 12:01:08 +11:00
Matthew Honnibal	c503654ec1	* Update bin/parser/train for printing output.	2015-10-06 10:35:22 +11:00
alvations	8caedba42a	caught more codecs.open -> io.open	2015-09-30 20:20:09 +02:00
alvations	764bdc62e7	caught another codecs.open	2015-09-30 20:16:52 +02:00
Matthew Honnibal	1ae55cb63a	* Copy tag_map.json in init_model	2015-09-12 05:54:02 +02:00
Matthew Honnibal	b2e82e55f6	* Create POS model dir in training script	2015-09-08 15:36:23 +02:00
Matthew Honnibal	5ad4527c42	* Rename Deutsch to German	2015-09-06 20:18:58 +02:00
Matthew Honnibal	d1eea2d865	* Update train.py for language-generic spaCy	2015-09-06 17:51:48 +02:00
Matthew Honnibal	950ce36660	* Update init model	2015-09-06 17:51:30 +02:00
Matthew Honnibal	b6b1e1aa12	* Add link for Finnish model	2015-08-27 10:26:02 +02:00
Matthew Honnibal	320ced276a	* Add tagger training script	2015-08-27 09:15:41 +02:00
Matthew Honnibal	dc13edd7cb	* Refactor init_model to accomodate other languages	2015-08-26 19:14:05 +02:00
Matthew Honnibal	bbf07ac253	* Cut down init_model to work on more languages	2015-08-24 01:05:20 +02:00
Matthew Honnibal	3ecacb9635	* Copy gazetteer file in init_model	2015-08-06 16:07:23 +02:00
Matthew Honnibal	ddc1a5cfe5	* Fix training under python3	2015-07-28 14:09:30 +02:00
Matthew Honnibal	174ed1ad20	* Tighten the frequency filter in init_model	2015-07-27 21:44:51 +02:00
Matthew Honnibal	6047f2aa35	* Fix path to freqs.txt	2015-07-27 02:22:35 +02:00
Matthew Honnibal	0368889d6c	* Support gzipped frequencies in init_model	2015-07-26 22:39:22 +02:00
Matthew Honnibal	c4f20847da	* Fix init_model for travis tests	2015-07-26 14:03:30 +02:00
Matthew Honnibal	09312b9353	* Fix init_model for travis tests	2015-07-26 13:55:47 +02:00
Matthew Honnibal	90ad717dc4	* Update default freq thresholds in init_model	2015-07-26 01:41:17 +02:00
Matthew Honnibal	6a5e035a48	* Ensure data files are copied for tokenizer in init_model	2015-07-26 01:36:19 +02:00
Matthew Honnibal	ab93898ac6	* Make heuristics more explicit in init_model	2015-07-26 00:22:19 +02:00
Matthew Honnibal	5c04dcd7c1	* Fix init_model	2015-07-25 23:33:02 +02:00
Matthew Honnibal	fd525f0675	* Pass OOV probability around	2015-07-25 23:29:51 +02:00
Matthew Honnibal	5b6bf4d4a6	* Remove probability cap on lexicon	2015-07-25 23:05:51 +02:00
Matthew Honnibal	c62eb110c0	* Fix merge conflict in init_model	2015-07-25 23:04:30 +02:00
Matthew Honnibal	0301472d15	* Fix init_model	2015-07-25 22:56:35 +02:00
Matthew Honnibal	8e800adfbc	* Fix init_model	2015-07-25 22:54:08 +02:00
Matthew Honnibal	5f183098e4	Merge branch 'master' of ssh://github.com/honnibal/spaCy	2015-07-25 22:37:04 +02:00
Matthew Honnibal	6076213c16	* Fix init_model script	2015-07-25 22:35:52 +02:00
Matthew Honnibal	1a99eb69da	Merge branch 'master' of https://github.com/honnibal/spaCy	2015-07-25 22:19:48 +02:00
Matthew Honnibal	ef448649b3	* Add read_freqs function in init_model	2015-07-25 22:16:36 +02:00
Matthew Honnibal	2e6a60eaec	Merge branch 'master' of https://github.com/honnibal/spaCy	2015-07-25 21:14:07 +02:00
Matthew Honnibal	105305b4aa	* Upd get_freqs script	2015-07-25 21:13:41 +02:00
Matthew Honnibal	616445e027	* Add simple script to collate frequencies from sorted file	2015-07-25 21:12:45 +02:00
Matthew Honnibal	c52179f5fa	* Use print function in train.py, for py 2/3 compatibility	2015-07-24 04:52:35 +02:00
Matthew Honnibal	6be3ee311c	Py3 compatibility tweak	2015-07-23 13:13:15 +02:00
Matthew Honnibal	d4407d8e2f	Py3 compatibility tweak	2015-07-23 09:45:15 +02:00
Matthew Honnibal	da4821fc14	* Add cluster words to probs in init_model	2015-07-23 09:27:07 +02:00
Matthew Honnibal	4af2595d99	* Fix structure of wordnet directory for init_model	2015-07-23 06:35:38 +02:00
Matthew Honnibal	83c0f0da22	* Remove lemmatizer from init_model	2015-07-23 02:32:34 +02:00
Matthew Honnibal	4729200dfc	* Whitespace	2015-07-23 01:19:26 +02:00
Matthew Honnibal	2b7bd46508	* Update get_freqs script	2015-07-22 15:43:06 +02:00
Matthew Honnibal	386246db5b	* Update init_model, making language resources optional	2015-07-22 00:25:14 +02:00
Matthew Honnibal	317cbbc015	* Serialization round trip now working with decent API, but with rough spots in the organisation and requiring vocabulary to be fixed ahead of time.	2015-07-19 15:18:17 +02:00
Matthew Honnibal	a6ff7e6ca4	* Fix redundant options in train.py	2015-07-17 22:38:05 +02:00
Matthew Honnibal	6cfa83157e	Merge branch 'refactor' of ssh://github.com/honnibal/spaCy into refactor	2015-07-17 21:38:04 +02:00
Matthew Honnibal	38ca0c33f5	Merge branch 'neuralnet' into refactor Mostly refactors parser, to use new thinc3.2 Example class. Aim is to remove use of shared memory, so that we can parallelize over documents easily. Conflicts: setup.py spacy/syntax/parser.pxd spacy/syntax/parser.pyx spacy/syntax/stateclass.pyx	2015-07-14 14:13:47 +02:00
Matthew Honnibal	af54d05d60	* Remove sense stuff from init_model	2015-07-14 10:56:17 +02:00
Matthew Honnibal	3de1b3ef1d	* Change get_freqs to take a list of files	2015-07-14 10:55:56 +02:00

1 2 3 4 5 ...

256 Commits