spaCy

mirror of https://github.com/explosion/spaCy.git synced 2024-09-21 11:29:13 +03:00

Author	SHA1	Message	Date
Matthew Honnibal	dc13edd7cb	* Refactor init_model to accomodate other languages	2015-08-26 19:14:05 +02:00
Matthew Honnibal	bbf07ac253	* Cut down init_model to work on more languages	2015-08-24 01:05:20 +02:00
Matthew Honnibal	3ecacb9635	* Copy gazetteer file in init_model	2015-08-06 16:07:23 +02:00
Matthew Honnibal	ddc1a5cfe5	* Fix training under python3	2015-07-28 14:09:30 +02:00
Matthew Honnibal	174ed1ad20	* Tighten the frequency filter in init_model	2015-07-27 21:44:51 +02:00
Matthew Honnibal	6047f2aa35	* Fix path to freqs.txt	2015-07-27 02:22:35 +02:00
Matthew Honnibal	0368889d6c	* Support gzipped frequencies in init_model	2015-07-26 22:39:22 +02:00
Matthew Honnibal	c4f20847da	* Fix init_model for travis tests	2015-07-26 14:03:30 +02:00
Matthew Honnibal	09312b9353	* Fix init_model for travis tests	2015-07-26 13:55:47 +02:00
Matthew Honnibal	90ad717dc4	* Update default freq thresholds in init_model	2015-07-26 01:41:17 +02:00
Matthew Honnibal	6a5e035a48	* Ensure data files are copied for tokenizer in init_model	2015-07-26 01:36:19 +02:00
Matthew Honnibal	ab93898ac6	* Make heuristics more explicit in init_model	2015-07-26 00:22:19 +02:00
Matthew Honnibal	5c04dcd7c1	* Fix init_model	2015-07-25 23:33:02 +02:00
Matthew Honnibal	fd525f0675	* Pass OOV probability around	2015-07-25 23:29:51 +02:00
Matthew Honnibal	5b6bf4d4a6	* Remove probability cap on lexicon	2015-07-25 23:05:51 +02:00
Matthew Honnibal	c62eb110c0	* Fix merge conflict in init_model	2015-07-25 23:04:30 +02:00
Matthew Honnibal	0301472d15	* Fix init_model	2015-07-25 22:56:35 +02:00
Matthew Honnibal	8e800adfbc	* Fix init_model	2015-07-25 22:54:08 +02:00
Matthew Honnibal	5f183098e4	Merge branch 'master' of ssh://github.com/honnibal/spaCy	2015-07-25 22:37:04 +02:00
Matthew Honnibal	6076213c16	* Fix init_model script	2015-07-25 22:35:52 +02:00
Matthew Honnibal	1a99eb69da	Merge branch 'master' of https://github.com/honnibal/spaCy	2015-07-25 22:19:48 +02:00
Matthew Honnibal	ef448649b3	* Add read_freqs function in init_model	2015-07-25 22:16:36 +02:00
Matthew Honnibal	2e6a60eaec	Merge branch 'master' of https://github.com/honnibal/spaCy	2015-07-25 21:14:07 +02:00
Matthew Honnibal	105305b4aa	* Upd get_freqs script	2015-07-25 21:13:41 +02:00
Matthew Honnibal	616445e027	* Add simple script to collate frequencies from sorted file	2015-07-25 21:12:45 +02:00
Matthew Honnibal	c52179f5fa	* Use print function in train.py, for py 2/3 compatibility	2015-07-24 04:52:35 +02:00
Matthew Honnibal	6be3ee311c	Py3 compatibility tweak	2015-07-23 13:13:15 +02:00
Matthew Honnibal	d4407d8e2f	Py3 compatibility tweak	2015-07-23 09:45:15 +02:00
Matthew Honnibal	da4821fc14	* Add cluster words to probs in init_model	2015-07-23 09:27:07 +02:00
Matthew Honnibal	4af2595d99	* Fix structure of wordnet directory for init_model	2015-07-23 06:35:38 +02:00
Matthew Honnibal	83c0f0da22	* Remove lemmatizer from init_model	2015-07-23 02:32:34 +02:00
Matthew Honnibal	4729200dfc	* Whitespace	2015-07-23 01:19:26 +02:00
Matthew Honnibal	2b7bd46508	* Update get_freqs script	2015-07-22 15:43:06 +02:00
Matthew Honnibal	386246db5b	* Update init_model, making language resources optional	2015-07-22 00:25:14 +02:00
Matthew Honnibal	317cbbc015	* Serialization round trip now working with decent API, but with rough spots in the organisation and requiring vocabulary to be fixed ahead of time.	2015-07-19 15:18:17 +02:00
Matthew Honnibal	a6ff7e6ca4	* Fix redundant options in train.py	2015-07-17 22:38:05 +02:00
Matthew Honnibal	6cfa83157e	Merge branch 'refactor' of ssh://github.com/honnibal/spaCy into refactor	2015-07-17 21:38:04 +02:00
Matthew Honnibal	38ca0c33f5	Merge branch 'neuralnet' into refactor Mostly refactors parser, to use new thinc3.2 Example class. Aim is to remove use of shared memory, so that we can parallelize over documents easily. Conflicts: setup.py spacy/syntax/parser.pxd spacy/syntax/parser.pyx spacy/syntax/stateclass.pyx	2015-07-14 14:13:47 +02:00
Matthew Honnibal	af54d05d60	* Remove sense stuff from init_model	2015-07-14 10:56:17 +02:00
Matthew Honnibal	3de1b3ef1d	* Change get_freqs to take a list of files	2015-07-14 10:55:56 +02:00
Matthew Honnibal	39c93116eb	* Add get_freqs script	2015-07-14 02:31:32 +02:00
Matthew Honnibal	62cfcd76fe	* Add supersense sets to lexemes, from WordNet. Look-up via lemmatization.	2015-07-01 18:48:59 +02:00
Matthew Honnibal	31b5e58aeb	* Begin reorganizing neuralnet work	2015-06-30 14:26:53 +02:00
Matthew Honnibal	1135cfe50a	* Tidy nn_train a bit	2015-06-29 16:45:14 +02:00
Matthew Honnibal	df8179ca4f	* Add separate Param and AdadeltaParam classes. AdadeltaParam seems broken.	2015-06-29 16:39:16 +02:00
Matthew Honnibal	1dff04acb5	* Apply regularization to the softmax, not the bias	2015-06-29 11:45:38 +02:00
Matthew Honnibal	ca30fe1582	* Use He initialization trick	2015-06-29 10:56:02 +02:00
Matthew Honnibal	fc34e1b6e4	* Move Theano functions into nn_train.py script	2015-06-29 07:09:16 +02:00
Matthew Honnibal	fe7b24ecef	* whitespace	2015-06-28 11:37:17 +02:00
Matthew Honnibal	7b8275fcc4	* Wire hyperparameters to script interface	2015-06-28 11:37:17 +02:00
Matthew Honnibal	897dd0dd0b	* Merge changes, and adjust Example to use memoryview	2015-06-28 11:36:11 +02:00
Matthew Honnibal	ef97b90833	* Fix token scoring	2015-06-28 06:22:18 +02:00
Matthew Honnibal	34c0ef2ee8	* Don't compile the orig_arc_eager and tree_arc_eager modules used for the EMNLP paper	2015-06-23 05:38:17 +02:00
Matthew Honnibal	59e9f9153c	* Remove projectivity constraint in train.py, but raise Exception if non-projective sentence is encountered, since we've told GoldParse to projectivize	2015-06-23 05:04:46 +02:00
Matthew Honnibal	839e5038b7	* Raise exception on non-projective input	2015-06-23 00:01:55 +02:00
Matthew Honnibal	4dad4058c3	* Uncomment NER training	2015-06-16 23:36:54 +02:00
Matthew Honnibal	5699585278	* Use tree_arc_eager system as baseline in experiments	2015-06-15 08:23:43 +02:00
Matthew Honnibal	4841f8ad5e	* Set transition system early	2015-06-15 02:54:12 +02:00
Matthew Honnibal	bcfdf126a4	* Add toggle for OrigArcEager system	2015-06-14 20:28:14 +02:00
Matthew Honnibal	c500d72dc2	* Temporarily disable NER, and wire up the verbose flag during training	2015-06-14 17:45:31 +02:00
Matthew Honnibal	ac422492cf	* Fix write_parses mode of bin/parser/train.py	2015-06-07 19:08:48 +02:00
Matthew Honnibal	4073533e28	* Upd munge_ewtb for the new json format	2015-06-06 02:10:33 +02:00
Matthew Honnibal	6a1341b29e	* Add tb pre-process script	2015-06-06 01:59:44 +02:00
Matthew Honnibal	1736fc5a67	* Add more options to bin/parser/train	2015-06-05 23:49:26 +02:00
Matthew Honnibal	362f87dc3a	* Update input corruption method to work with lists as well as trings	2015-06-05 19:33:32 +02:00
Matthew Honnibal	0aed9c9a33	* Fix train.py	2015-06-05 15:50:24 +02:00
Matthew Honnibal	8466600add	* Clean up train.py, removing unused tag jackknifing code	2015-06-05 15:01:28 +02:00
Matthew Honnibal	e772b48dcd	* Skip sentences of length 1 in training	2015-06-05 02:29:03 +02:00
Matthew Honnibal	e822df0867	* Fix bugs in new greedy/beam parser	2015-06-02 02:01:33 +02:00
Matthew Honnibal	70a7ad89ca	* Removed unused imports from train.py	2015-06-02 00:59:09 +02:00
Matthew Honnibal	a3de20118e	* Wire up beam-width command line argument	2015-06-02 00:54:12 +02:00
Matthew Honnibal	08044ea70c	* Remove try/except around parser.train	2015-05-31 15:21:56 +02:00
Matthew Honnibal	c8a553fe91	* Fix cluster initialization	2015-05-31 15:21:28 +02:00
Matthew Honnibal	d7cc2338e7	* Fix bug in train.py	2015-05-31 06:49:06 +02:00
Matthew Honnibal	c037f80638	* Add case expansion to Brown clusters	2015-05-31 05:50:50 +02:00
Matthew Honnibal	5ab0f233a1	* Ensure words in Brown clusters make it into the vocab, even if they're not in our probs list	2015-05-31 05:46:16 +02:00
Matthew Honnibal	d42dda0372	* Shuffle docs before doing jackknife partition --- otherwise we'll not get the right genre mixes...	2015-05-31 01:25:02 +02:00
Matthew Honnibal	4d8d490547	* Exclude empty sentences in prepare_treebank	2015-05-31 01:12:46 +02:00
Matthew Honnibal	d512d20d81	* Allow parser to jackknife POS tags before training.	2015-05-31 01:11:11 +02:00
Matthew Honnibal	6bbdcc5db5	* Fix gold_preproc flag in train.py	2015-05-30 05:23:02 +02:00
Matthew Honnibal	76300bbb1b	* Use updated JSON format, with sentences below paragraphs. Allows use of gold preprocessing flag.	2015-05-30 01:25:46 +02:00
Matthew Honnibal	2d11739f28	* Change data format of JSON corpus, putting sentences into lists with the paragraph	2015-05-30 01:25:00 +02:00
Matthew Honnibal	784e577f45	* Check NER length matches conll length in prepare_treebank	2015-05-29 03:54:06 +02:00
Matthew Honnibal	b76bbbd12c	* Read json files recursively from a directory, instead of requiring a single .json file	2015-05-29 03:52:55 +02:00
Matthew Honnibal	ef67ef7a4c	* Recomment in training in train.py	2015-05-28 22:40:26 +02:00
Matthew Honnibal	5eb64eeb11	* Print json treebank by genre, instead of by large file	2015-05-28 22:40:01 +02:00
Matthew Honnibal	f42dc1f7d8	* Fix evaluate method in train.py, to use sentences which don't have raw text	2015-05-28 16:30:23 +02:00
Matthew Honnibal	a7cee46fe9	* Update train.py, to support paragraphs where there's no raw_text	2015-05-27 19:14:02 +02:00
Matthew Honnibal	ef1333cf89	* Have prepare_treebank read train/dev/test IDs.	2015-05-27 17:35:05 +02:00
Matthew Honnibal	e140e03516	* Read in OntoNotes. Doesn't support train/test/dev split yet	2015-05-27 17:04:29 +02:00
Matthew Honnibal	895060e774	* Ensure tagger and NER are trained, even if non-projective problem	2015-05-27 03:16:21 +02:00
Matthew Honnibal	32ae2cdabe	* In prepare_treebank, move ner into the token descriptions	2015-05-26 19:52:39 +02:00
Matthew Honnibal	61885aee76	* Work on prepare_treebank script, adding NER to it	2015-05-26 19:28:29 +02:00
Matthew Honnibal	15bbbf4901	* Remove cruft from train.py	2015-05-25 07:54:10 +02:00
Matthew Honnibal	fc75210941	* Move spacy.syntax.conll to spacy.gold	2015-05-24 21:35:02 +02:00
Matthew Honnibal	541c62c126	* Remove import of removed read_docparse_file function	2015-05-24 20:05:13 +02:00
Matthew Honnibal	bfeb29ebd1	* Tmp commit	2015-05-24 02:50:14 +02:00
Matthew Honnibal	983d954ef4	* Tmp commit, while switch to new format that assumes alignment happens during training	2015-05-23 17:39:04 +02:00
Matthew Honnibal	f35503018e	* Tmp commit of train, while I move to better alignment in gold standard	2015-05-23 17:21:25 +02:00
Matthew Honnibal	3d6b3fc6fb	* Restore shuffling, and remove print statements from train.py	2015-05-12 20:27:56 +02:00

1 2 3 4

196 Commits