spaCy

mirror of https://github.com/explosion/spaCy.git synced 2025-11-11 21:35:47 +03:00

Author	SHA1	Message	Date
Matthew Honnibal	680043ebca	Improve efficiency of tagger.set_annotations for GPU	2017-08-12 08:54:21 -05:00
Matthew Honnibal	ebe0f7f641	Pass embed size correctly in tagger, and cache embeddings for efficiency	2017-08-12 05:45:20 -05:00
Matthew Honnibal	1a59db1c86	Fix dropout and learn rate in parser	2017-08-12 05:44:39 -05:00
Matthew Honnibal	d01dc3704a	Adjust parser model	2017-08-09 20:06:33 -05:00
Matthew Honnibal	f37528ef58	Pass embed size for parser fine-tune. Use SELU	2017-08-09 17:52:53 -05:00
Matthew Honnibal	f93f2bed58	Revert use of layer normalization in Tok2Vec	2017-08-09 17:47:03 -05:00
Matthew Honnibal	20944dd8aa	Fix conflict in parser fine-tuning	2017-08-09 16:43:05 -05:00
Matthew Honnibal	ac2de6dced	Switch to ReLu layers in Tok2Vec	2017-08-09 16:41:25 -05:00
Matthew Honnibal	bbace204be	Gate parser fine-tuning behind feature flag	2017-08-09 16:40:42 -05:00
Matthew Honnibal	a59a1deac4	Merge branch 'develop' of https://github.com/explosion/spaCy into develop	2017-08-09 16:23:19 -05:00
Matthew Honnibal	bcce6f7de0	Fix parser fine tuning	2017-08-09 16:23:12 -05:00
ines	28e2fec23b	Fix autolinking failure on fresh model install (resolves #1138 ) On fresh install via subprocess, pip.get_installed_distributions() won't show new model, so is_package check in link command fails. Solution for now is to get model package path explicitly and pass it to link command.	2017-08-09 11:52:38 +02:00
Jim Geovedi	c62b49b7cc	Merge remote-tracking branch 'upstream/develop' into indonesian	2017-08-09 09:17:46 +07:00
Matthew Honnibal	dbdd8afc4b	Fix parser fine-tune training	2017-08-08 15:46:07 -05:00
Matthew Honnibal	88bf1cf87c	Update parser for fine tuning	2017-08-08 15:34:17 -05:00
Jim O'Regan	c069b4acb5	fix in UD submitted; map either way	2017-08-08 19:22:14 +01:00
Jim O'Regan	76c22dec4d	UD Irish tag mapping	2017-08-08 19:04:52 +01:00
Jim O'Regan	95921d7d4c	Merge branch 'develop' into develop-irish	2017-08-08 17:21:27 +01:00
Matthew Honnibal	5d837c3776	Add mix weights on fine_tune	2017-08-07 06:32:59 -05:00
Matthew Honnibal	42bd26f6f3	Give parser its own tok2vec weights	2017-08-06 18:33:46 +02:00
Matthew Honnibal	3ed203de25	Use LayerNorm and SELU in Tok2Vec	2017-08-06 18:33:18 +02:00
Matthew Honnibal	78498a072d	Return Transition for missing actions in lookup_action	2017-08-06 14:16:36 +02:00
Matthew Honnibal	4a5cc89138	Fix tagger 'fine_tune', to keep private CNN weights	2017-08-06 14:15:48 +02:00
Matthew Honnibal	3cb8f06881	Fix NeuralLabeller	2017-08-06 14:15:14 +02:00
Matthew Honnibal	0acce0521b	Fix Language.update for pipeline	2017-08-06 14:13:03 +02:00
Matthew Honnibal	bfffdeabb2	Fix parser batch-size bug introduced during cleanup	2017-08-06 14:10:48 +02:00
Matthew Honnibal	0eec7c9e9b	Fix Language.evaluate	2017-08-06 02:18:31 +02:00
Matthew Honnibal	0a566dc320	Add update_tensors flag to Language.update. Experimental, re #1182	2017-08-06 02:18:12 +02:00
Matthew Honnibal	cc19ea0e7c	Add update_tensors flag to Language.update. Experimental, re #1182	2017-08-06 02:17:10 +02:00
Matthew Honnibal	4cfb7a54e7	Fix tagger	2017-08-06 01:53:31 +02:00
Matthew Honnibal	e9ab800e15	Fix tagging model	2017-08-06 01:50:08 +02:00
Matthew Honnibal	468c138ab3	WIP: Add fine-tuning logic to tagger model, re #1182	2017-08-06 01:13:23 +02:00
Matthew Honnibal	7f876a7a82	Clean up some unused code in parser	2017-08-06 00:00:21 +02:00
Matthew Honnibal	ae1ad81069	Increment version	2017-08-05 18:09:32 +02:00
Jim Geovedi	cc4772cac2	reworks	2017-08-03 13:08:38 +07:00
Jim Geovedi	37f19f5ed2	added more currencies based on corpus data	2017-08-03 13:03:25 +07:00
Jim Geovedi	30fd068d42	hashtag prefix should be handled somewhere else	2017-08-03 13:03:02 +07:00
Jim Geovedi	4705ae19ba	Merge remote-tracking branch 'upstream/develop' into indonesian	2017-08-03 12:40:19 +07:00
Jim Geovedi	ba07e23c87	added USD in currency rules	2017-08-02 22:42:47 +07:00
Matthew Honnibal	5c323daa1a	Merge branch 'develop' of https://github.com/explosion/spaCy into develop	2017-08-01 22:10:37 +02:00
Matthew Honnibal	2e00361522	Fix update when 0 docs	2017-08-01 22:10:17 +02:00
Matthew Honnibal	8fce187de4	Fix ArcEager for missing values	2017-08-01 22:10:05 +02:00
ines	78e262140f	Add workaround for displaCy server on Python 2/3 (resolves #1227 ) Make sure status and headers are bytes on Python 2 and strings on Python 3	2017-08-01 01:11:35 +02:00
Jim Geovedi	2572a9ddf0	Merge remote-tracking branch 'upstream/develop' into indonesian	2017-07-30 21:24:16 +07:00
Jim Geovedi	bb08d696f9	added hashtag rule and fixed currency rules	2017-07-30 21:23:28 +07:00
Jim Geovedi	e9af79a803	added u-\d+ rules (sports team)	2017-07-30 21:23:01 +07:00
Matthew Honnibal	27abc56e98	Add method to get beam entities	2017-07-29 21:59:02 +02:00
Matthew Honnibal	ec63f4fe7b	Add option to control how missing entities are handled when getting NER tags	2017-07-29 21:58:37 +02:00
Jim Geovedi	e5adc26c72	simplified rules	2017-07-29 18:21:32 +07:00
Jim Geovedi	783f7d8b86	added test set for Indonesian language	2017-07-29 18:21:07 +07:00
Jim Geovedi	4d04898dea	updated regexp	2017-07-29 17:44:57 +07:00
Jim Geovedi	7d96d477ea	updated like_num	2017-07-29 17:44:46 +07:00
Jim Geovedi	3cca4ed798	added lex attrs rules	2017-07-29 17:22:21 +07:00
Jim Geovedi	8b814c63f1	more exceptions	2017-07-27 19:46:30 +07:00
Jim Geovedi	6c725e8dcf	updated lemma	2017-07-27 19:46:21 +07:00
Jim Geovedi	c194f7ae26	Merge remote-tracking branch 'upstream/develop' into indonesian	2017-07-27 10:55:34 +07:00
Jim Geovedi	547973b92a	wip syntax iterators	2017-07-27 10:51:34 +07:00
Jim Geovedi	bbc75da38d	enable syntax iterator and lemma lookup	2017-07-27 10:51:15 +07:00
Jim Geovedi	24a8c8bf28	added wip lemma dict	2017-07-26 21:39:54 +07:00
Jim Geovedi	63f14ba46b	added hyphen-suffix rules	2017-07-26 19:28:57 +07:00
Jim Geovedi	f288964441	removed -el from suffix rules	2017-07-26 19:28:38 +07:00
Jim Geovedi	6eee7a7411	updated tokenizer exceptions	2017-07-26 19:13:47 +07:00
Jim Geovedi	edec51b1b1	update punctuation rules	2017-07-26 19:13:36 +07:00
Jim Geovedi	62443d495a	enable token match	2017-07-26 19:13:14 +07:00
Jim Geovedi	c97f5ae0bb	updated tokenizer exceptions	2017-07-26 19:12:52 +07:00
Matthew Honnibal	aff325b7e0	Increment version	2017-07-25 19:41:20 +02:00
Matthew Honnibal	6780132821	Fix tagger loading	2017-07-25 19:41:11 +02:00
Matthew Honnibal	fd20a4af55	Increment version	2017-07-25 18:58:34 +02:00
Matthew Honnibal	523b0df2c9	Update text classification model	2017-07-25 18:57:59 +02:00
Matthew Honnibal	7c7fac9337	Add spacy.blank() loading function	2017-07-25 18:56:37 +02:00
Jim Geovedi	73f6ac9d9b	added hyhen	2017-07-24 15:56:31 +07:00
Jim Geovedi	68454c40bf	added missing import	2017-07-24 14:12:34 +07:00
Jim Geovedi	eaf9cbd708	cursed of copy & paste	2017-07-24 14:11:51 +07:00
Jim Geovedi	7aad6718bc	enable tokenizer exceptions	2017-07-24 14:11:10 +07:00
Jim Geovedi	ad56c9179a	added tokenizer exceptions list	2017-07-24 14:10:16 +07:00
Jim Geovedi	c1f3fe99fe	updated punctuation rules	2017-07-24 13:57:21 +07:00
Jim Geovedi	37fa2c8c80	punctution rules	2017-07-24 06:17:18 +07:00
Jim Geovedi	082e94ac1c	added inflix rules	2017-07-24 06:17:07 +07:00
Jim Geovedi	d0ec484725	reverted	2017-07-24 06:16:29 +07:00
Jim Geovedi	0e590c711f	added prefix & suffix rules	2017-07-23 23:46:40 +07:00
Jim Geovedi	ba922e30e8	added ampere hour unit	2017-07-23 23:46:18 +07:00
Jim Geovedi	3b17eba27b	added frequency units	2017-07-23 23:10:52 +07:00
Jim Geovedi	d5fd32a572	added known currencies	2017-07-23 22:56:48 +07:00
Jim Geovedi	f6f15678fb	added lex_attrs	2017-07-23 22:55:22 +07:00
Jim Geovedi	bed8162d00	added tokenizer_exceptions	2017-07-23 22:55:05 +07:00
Jim Geovedi	b80c35bc9a	added norm_exceptions	2017-07-23 22:54:49 +07:00
Jim Geovedi	b5de329ea3	added norm_exceptions	2017-07-23 22:54:19 +07:00
Jim Geovedi	082e9ade46	fixed typo	2017-07-23 21:30:34 +07:00
Jim Geovedi	e2efeb186e	added stopwords	2017-07-23 20:52:37 +07:00
Jim Geovedi	da98676839	use template	2017-07-23 20:51:31 +07:00
Jim Geovedi	c2b4dd7809	start working on Indonesian language	2017-07-23 20:50:56 +07:00
Matthew Honnibal	5771bd1ff8	Increment version	2017-07-23 14:18:38 +02:00
Matthew Honnibal	c4a81a47a4	Fix deserialization	2017-07-23 14:11:07 +02:00
Matthew Honnibal	2df563ad24	Remove optimization for textcat that caused loading problem	2017-07-23 14:10:51 +02:00
Matthew Honnibal	4fe77bced2	Add cfg attr to pipeline components	2017-07-23 00:52:47 +02:00
Matthew Honnibal	d8aa721664	Compute Language.meta with a property	2017-07-23 00:50:18 +02:00
Matthew Honnibal	a88a7deffe	Five save/load of textcat config	2017-07-23 00:33:43 +02:00
Matthew Honnibal	9bae0ddc50	Fix minibatching	2017-07-22 20:14:49 +02:00
Matthew Honnibal	ded0df5e2f	Expose hyper-param as keyword arg	2017-07-22 20:14:37 +02:00
Matthew Honnibal	f5de8deeec	Increment version	2017-07-22 20:04:53 +02:00
Matthew Honnibal	b55714d5d1	Make gold_tuples arg optional in begin_training	2017-07-22 20:04:43 +02:00
Matthew Honnibal	ed6c85fa3c	Fix loading of text categories in GoldParse	2017-07-22 20:04:03 +02:00
Matthew Honnibal	6ffec9dfea	Update _ml, for textcat model	2017-07-22 20:03:40 +02:00
Matthew Honnibal	d6a5c2c85a	Add test for NER	2017-07-22 01:48:58 +02:00
Matthew Honnibal	28244df4da	Add test for beam parsing	2017-07-22 01:48:35 +02:00
Matthew Honnibal	c86445bdfd	Merge branch 'develop' of https://github.com/explosion/spaCy into develop	2017-07-22 01:14:28 +02:00
Matthew Honnibal	b3a749610e	Fix name of TextCategorizer	2017-07-22 01:14:07 +02:00
Matthew Honnibal	2424493970	Remove unnecessary import of Mock	2017-07-22 01:13:54 +02:00
Matthew Honnibal	baa3d81c35	Add text categorizer to Language	2017-07-22 01:13:36 +02:00
Matthew Honnibal	a6a2159969	Add slot for text categories to Doc	2017-07-22 00:34:15 +02:00
Matthew Honnibal	374ab3ecfb	Increment alpha version	2017-07-22 00:32:49 +02:00
Matthew Honnibal	289f23df51	Test beam parsing	2017-07-20 15:03:10 +02:00
Matthew Honnibal	3da1063b36	Add beam decoding to parser, to allow NER uncertainties	2017-07-20 15:02:55 +02:00
Matthew Honnibal	0ca5832427	Improve negative example handling in NER oracle	2017-07-20 00:18:49 +02:00
Matthew Honnibal	a231b56d40	Add text-classification hook to pipeline	2017-07-20 00:18:15 +02:00
Matthew Honnibal	7ea50182a5	Add support for text-classification labels to GoldParse	2017-07-20 00:17:47 +02:00
Matthew Honnibal	727481377e	Add text-classifer thinc models	2017-07-20 00:17:17 +02:00
Matthew Honnibal	f014138c11	Fix parser tests	2017-07-20 00:16:52 +02:00
mollerhoj	85144835da	Add Tag_map for Danish	2017-07-03 15:52:55 +02:00
mollerhoj	64c732918a	Add Morph_rules. (TODO: Not working?)	2017-07-03 15:52:55 +02:00
mollerhoj	3b2cb107a3	Add like_num functionality to Danish	2017-07-03 15:49:51 +02:00
mollerhoj	e8f40ceed8	Add short names of months to tokenizer_exceptions	2017-07-03 15:49:51 +02:00
mollerhoj	e840077601	Add some basic tests for Danish	2017-07-03 15:49:51 +02:00
mollerhoj	23025d3b05	Clean up a couple of strange English stopwords	2017-07-03 15:41:59 +02:00
mollerhoj	dc5be7d2f3	Cleanup list of Danish stopwords	2017-07-03 15:40:58 +02:00
Ines Montani	c91642efd5	Port over changes from #1168	2017-07-01 11:43:54 +02:00
Jim O'Regan	70f4d26c10	bounds checks	2017-06-28 10:59:46 +01:00
Jim O'Regan	1ba38b2036	some helpers; the Irish part of UD only has 2500 sentences so this will need source of morphology	2017-06-28 00:42:00 +01:00
Jim O'Regan	559e03605a	b'	2017-06-27 22:42:16 +01:00
Jim Regan	d81ceb0cd5	Merge branch 'develop' into polish	2017-06-26 22:42:27 +01:00
Jim O'Regan	2f84c73585	a start	2017-06-26 22:40:04 +01:00
Jim O'Regan	28d7f0a672	reference	2017-06-26 22:38:28 +01:00
Jim O'Regan	e12defdd9c	missed a couple	2017-06-26 22:24:14 +01:00
Jim O'Regan	c1e4e0f3bf	just now discovered that you can do multiwords	2017-06-26 22:19:39 +01:00
Jim O'Regan	5e5f94c1c0	fix dup	2017-06-26 21:57:00 +01:00
Jim O'Regan	a8dff9133e	add POS	2017-06-26 21:53:41 +01:00
Jim O'Regan	e9213f54de	missed one	2017-06-26 21:29:21 +01:00
Jim O'Regan	1eb7cc3017	attempt a port from #1147	2017-06-26 21:24:55 +01:00
Matthew Honnibal	91e52543ef	Merge pull request #1118 from Gregory-Howard/patch-2 Update _tokenizer_exceptions_list (adding cities)	2017-06-20 11:16:07 +02:00
Matthew Honnibal	8ea785e01a	Merge pull request #1119 from oroszgy/patch-3 Fixed conllu converter	2017-06-20 11:14:41 +02:00
Tpt	7745b3ae04	Adds noun chunks to French syntax iterators	2017-06-12 15:29:58 +02:00
Tpt	57e8254f63	Adds function to extract french noun chunks	2017-06-12 15:20:49 +02:00
György Orosz	62dbf9025c	Fixed conllu converter	2017-06-09 22:53:56 +02:00
Grégory Howard	cd974b32b7	Update _tokenizer_exceptions_list (adding cities)	2017-06-09 17:58:18 +02:00
ines	34a2eecb17	Add simple "naughty strings" test (see #1107 )	2017-06-06 17:43:51 +02:00
ines	045574a936	Update package name and increment version	2017-06-05 20:41:30 +02:00
Matthew Honnibal	1f5874a927	Merge branch 'develop' of https://github.com/explosion/spaCy into develop	2017-06-05 20:20:00 +02:00
ines	03db56f48c	Detect spaCy version and add package title Package title allows customised package names (like spacy-nightly)	2017-06-05 20:11:02 +02:00
Matthew Honnibal	c0d90f52f7	Merge branch 'develop' of https://github.com/explosion/spaCy into develop	2017-06-05 19:20:13 +02:00
ines	cc9c5dc7a3	Fix noun chunks test	2017-06-05 16:39:04 +02:00
Matthew Honnibal	836bfa2d0f	Add factory for experimental SimilarityHook component	2017-06-05 15:40:22 +02:00
Matthew Honnibal	d59fa32df1	Add experimental SimilarityHook omponent	2017-06-05 15:40:03 +02:00
Matthew Honnibal	5489b49203	Remove print statement	2017-06-05 13:20:41 +02:00
Matthew Honnibal	fc4204a12a	Merge branch 'develop' of https://github.com/explosion/spaCy into develop	2017-06-05 13:13:23 +02:00
Matthew Honnibal	2479cde446	Support disable keyword in Language.__init__	2017-06-05 13:13:07 +02:00
ines	ea167e14db	Fix model package loading from link	2017-06-05 13:10:49 +02:00
ines	dd6dc4c120	Update spacy.load() helper functions	2017-06-05 13:02:31 +02:00
Matthew Honnibal	b4cdd05466	Add vectors.pyx in setup	2017-06-05 12:45:29 +02:00
Matthew Honnibal	280d419529	Add pickle method for vectors	2017-06-05 12:36:04 +02:00
Matthew Honnibal	30369d580f	Start testing Vectors class	2017-06-05 12:32:49 +02:00
Matthew Honnibal	eb7cbb62c2	Flesh out Vectors class	2017-06-05 12:32:08 +02:00
ines	51d7414e94	Make sure sents are a list	2017-06-05 12:30:13 +02:00
Matthew Honnibal	ebb6c49cd5	Make alignment case-insensitive for gold	2017-06-04 20:26:42 -05:00
Matthew Honnibal	fc4dd62e84	Merge branch 'develop' of https://github.com/explosion/spaCy into develop	2017-06-04 20:19:05 -05:00
Matthew Honnibal	8f8f90b46b	Disable labeller if not parsing	2017-06-04 20:18:54 -05:00
Matthew Honnibal	c52fde40f4	Improve train CLI	2017-06-04 20:18:37 -05:00
Matthew Honnibal	a053b1218e	Fix item counting during training	2017-06-04 20:18:20 -05:00
Matthew Honnibal	b3b5521625	Merge branch 'develop' of https://github.com/explosion/spaCy into develop	2017-06-04 20:17:18 -05:00
Matthew Honnibal	9bc4a26213	Add option of data augmentation noise	2017-06-04 20:16:57 -05:00
Matthew Honnibal	7b2ede783d	Add SP tag to tag map if missing	2017-06-04 20:16:30 -05:00
ines	a0f4592f0a	Update tests	2017-06-05 02:26:13 +02:00
ines	3e105bcd36	Update tests	2017-06-05 02:09:27 +02:00
Matthew Honnibal	516798e9fc	Merge branch 'develop' of https://github.com/explosion/spaCy into develop	2017-06-05 01:35:21 +02:00
Matthew Honnibal	193bf913c0	Set is_tagged=True after tagging	2017-06-05 01:35:07 +02:00
ines	078232932c	Fix tokenizer fixture scope	2017-06-05 01:06:34 +02:00
Matthew Honnibal	58be0e1f6f	Update tests	2017-06-04 16:35:06 -05:00
Matthew Honnibal	b78cc318c3	Fix loading of morphology exceptions	2017-06-04 16:34:32 -05:00
Matthew Honnibal	bb98d45a63	Fix tests	2017-06-04 16:00:44 -05:00
Matthew Honnibal	55d0621532	Merge branch 'develop' of https://github.com/explosion/spaCy into develop	2017-06-04 15:53:25 -05:00
Matthew Honnibal	5b9f116aca	Update tests	2017-06-04 15:53:17 -05:00
Matthew Honnibal	2a3bd5ee90	Fix fetching of noun chunk iterator	2017-06-04 15:53:05 -05:00
Matthew Honnibal	3680c51b8f	Avoid clobbering preset POS tags	2017-06-04 15:52:42 -05:00
Matthew Honnibal	939e8ed567	Add lookup properties for components in Language	2017-06-04 15:52:09 -05:00
Matthew Honnibal	e28f90b672	Fix syntax iterators	2017-06-04 15:51:50 -05:00
ines	8a29308d0b	Remove unused imports	2017-06-04 22:39:29 +02:00
Ines Montani	112c5787eb	Merge pull request #1101 from oroszgy/hu_tokenizer_fix More robust Hungarian tokenizer.	2017-06-04 22:37:51 +02:00
ines	96867a24ae	Fix typo	2017-06-04 22:36:40 +02:00
ines	f432bb4b48	Fix fixture scopes	2017-06-04 22:34:31 +02:00
Matthew Honnibal	6d0356e6cc	Whitespace	2017-06-04 14:55:24 -05:00
Matthew Honnibal	8a683a4494	Merge branch 'develop' of https://github.com/explosion/spaCy into develop	2017-06-04 21:53:56 +02:00
Matthew Honnibal	92ae36f84e	Improve way noun chunks iterator is looked up	2017-06-04 21:53:39 +02:00
ines	9254a3dd78	Import and add Spanish syntax iterators	2017-06-04 21:42:15 +02:00
ines	7db1a0e83e	Make sure printed values are always strings	2017-06-04 21:27:20 +02:00
Matthew Honnibal	51e1541ddb	Merge branch 'develop' of https://github.com/explosion/spaCy into develop	2017-06-04 14:26:29 -05:00
Matthew Honnibal	add9a33782	Return False for vocab.has_vector	2017-06-04 14:26:14 -05:00
Matthew Honnibal	675f448313	Fix vector linkage on Doc	2017-06-04 14:25:30 -05:00
Matthew Honnibal	f4662e9218	Fix vector linkage for token	2017-06-04 14:19:58 -05:00
ines	070e026ed9	Ensure path on read_json	2017-06-04 20:44:37 +02:00
ines	e1e73936b1	Raise correct error	2017-06-04 20:44:27 +02:00
ines	848e47669e	Fix typo	2017-06-04 20:44:15 +02:00
ines	c4614c02a2	Fix dev resources URL	2017-06-04 15:45:50 +02:00
ines	a66cf24ee8	xfail tokenizer serialization tests for now Tests pass locally, but not on Travis – needs more investigation	2017-06-04 13:58:20 +02:00
ines	7b7d46b64e	Fix typo and success message	2017-06-04 13:45:50 +02:00
ines	90d117f378	Update version	2017-06-04 13:41:16 +02:00
Matthew Honnibal	7ca215bc26	Resolve lex_attr_getters conflict	2017-06-03 16:12:01 -05:00
Matthew Honnibal	21eef90dbc	Support specifying which GPU	2017-06-03 16:10:23 -05:00
Matthew Honnibal	d0e42f9275	Merge branch 'develop' of https://github.com/explosion/spaCy into develop	2017-06-03 15:30:32 -05:00
Matthew Honnibal	8a17b99b1c	Use NORM attribute, not LOWER	2017-06-03 15:30:16 -05:00
ines	4c643d74c5	Add norm exceptions to other Language classes	2017-06-03 22:29:21 +02:00
ines	fa7e576c57	Change order of exception dicts	2017-06-03 21:52:06 +02:00
Matthew Honnibal	3f5c85d8de	Reorder setting of lex attrs, to avoid clobbering	2017-06-03 14:47:55 -05:00
Matthew Honnibal	aeb7520133	Make norm use lower-case	2017-06-03 14:47:38 -05:00
Matthew Honnibal	de3954843e	Populate norm exceptions with lower-case	2017-06-03 14:47:12 -05:00
Matthew Honnibal	f6955a459c	Fix prev commit	2017-06-03 14:38:37 -05:00
Matthew Honnibal	468ca6c760	Merge branch 'develop' of https://github.com/explosion/spaCy into develop	2017-06-03 14:33:51 -05:00
Matthew Honnibal	c647a0d33e	Fix training counter for gold preprocessing	2017-06-03 14:33:39 -05:00
ines	e47eef5e03	Update German tokenizer exceptions and tests	2017-06-03 21:07:44 +02:00
ines	d77c2cc8bb	Add tests for English norm exceptions	2017-06-03 20:59:50 +02:00
ines	0d6fa8b241	Add German norm exceptions	2017-06-03 20:54:18 +02:00
ines	5bd311c77e	Fix update of norm exceptions	2017-06-03 20:54:09 +02:00
Matthew Honnibal	94e063ae2a	Merge branch 'develop' of https://github.com/explosion/spaCy into develop	2017-06-03 13:31:40 -05:00
Matthew Honnibal	fea1144e6d	Set max batch size in evaluate	2017-06-03 13:31:33 -05:00
Matthew Honnibal	805495af27	Fix off-by-one in number of tags	2017-06-03 13:29:23 -05:00
Matthew Honnibal	e62f46d39f	Clarify gold.pyx slightly	2017-06-03 13:28:52 -05:00
Matthew Honnibal	43353b5413	Improve train CLI script	2017-06-03 13:28:20 -05:00
ines	746653880c	Add English norm exceptions to lex_attrs	2017-06-03 20:27:28 +02:00
ines	095eeeb12f	Update English tokenizer exceptions and add norms	2017-06-03 20:27:16 +02:00
ines	e5d426406a	Add base norm exceptions	2017-06-03 20:27:05 +02:00
ines	4c2bbc3ccc	Add add_lookups util function	2017-06-03 19:44:47 +02:00
ines	05fe6758a7	Set lexeme attributes for tokenizer special cases	2017-06-03 19:44:39 +02:00
ines	3152ee5ca2	Update serialization tests for tokenizer	2017-06-03 17:05:28 +02:00
ines	7c919aeb09	Make sure serializers and deserializers are ordered	2017-06-03 17:05:09 +02:00
ines	1ebd0d3f27	Add assert_packed_msg_equal util function	2017-06-03 17:04:30 +02:00
ines	de974f7bef	Add serializer tests for tokenizer	2017-06-03 13:26:34 +02:00
ines	0153b66a86	Return self in Tokenizer.from_bytes	2017-06-03 13:26:13 +02:00
ines	82154a1861	Add letter spacing to arrow label	2017-06-03 13:25:41 +02:00
ines	32c6f05de9	Adjust spacing and sizing in compact mode	2017-06-03 13:25:32 +02:00
ines	cc8c8617a4	Shut down displaCy server on KeyboardInterrupt	2017-06-03 13:24:56 +02:00
ines	70fbba7d08	Clone Doc to never merge punctuation on original Doc	2017-06-03 13:24:43 +02:00
ines	459a1e8470	Fix whitespace	2017-06-03 11:31:18 +02:00
ines	5109bba910	Port over fix from #1070	2017-06-03 11:31:11 +02:00
ines	d21459f87d	Update serializer tests	2017-06-02 21:42:26 +02:00
ines	6669583f4e	Use OrderedDict	2017-06-02 21:07:56 +02:00
ines	2f1025a94c	Port over Spanish changes from #1096	2017-06-02 19:09:58 +02:00
ines	d86e7cde93	Add entity recognizer to parser serialization tests	2017-06-02 18:40:06 +02:00
ines	0051c05964	Add tests for serializing parser	2017-06-02 18:37:19 +02:00
ines	fdd0923be4	Translate model=True in exclude to lower_model and upper_model	2017-06-02 18:37:07 +02:00
ines	cef547a9f0	Add serialization tests for tensorizer	2017-06-02 18:18:30 +02:00
ines	924c58bde3	Fix serialization of optional elements	2017-06-02 18:18:17 +02:00
ines	f74a45c1fe	Remove unnecessary argument	2017-06-02 18:17:46 +02:00
ines	43b4d63f85	Add serialization tests for tagger	2017-06-02 17:29:34 +02:00
ines	1b593bbd6d	Fix encoding on tagger serialization	2017-06-02 17:29:21 +02:00
Matthew Honnibal	5f4d328e2c	Fix serialization of tag_map in NeuralTagger	2017-06-02 10:18:37 -05:00
Matthew Honnibal	ed6f575e06	Merge branch 'develop' of https://github.com/explosion/spaCy into develop	2017-06-02 04:26:39 -05:00
ines	acd65c00f6	Add serialization tests for StringStore and Vocab	2017-06-02 10:57:42 +02:00
ines	41a6adf1f6	Initialise Vocab length correctly	2017-06-02 10:57:25 +02:00
ines	53b82f972a	Add strings to Vocab in init, instead of StringStore	2017-06-02 10:57:06 +02:00
ines	023f38bdd4	Fix return value of Vocab.from_bytes	2017-06-02 10:56:40 +02:00
ines	9692c98f57	Add test utils for temp file and temp dir	2017-06-02 10:56:09 +02:00
Matthew Honnibal	c650bc481c	Merge branch 'develop' of https://github.com/explosion/spaCy into develop	2017-06-01 13:03:57 -05:00
Matthew Honnibal	307d615c5f	Fix serialization for tagger when tag_map has changed	2017-06-01 12:18:36 -05:00
Matthew Honnibal	1d18cedae8	Fiddle with msgpack bytes vs unicode	2017-06-01 10:48:43 -05:00
ines	7a2380f617	Rename "nn_tagger" to "tagger"	2017-06-01 17:37:53 +02:00
ines	e5ae6ccf4e	Fix typo	2017-06-01 16:46:15 +02:00
ines	a3e4f91f4a	Only load vocab if it exists	2017-06-01 14:38:35 +02:00
Matthew Honnibal	d310b0aab3	Merge branch 'develop' of https://github.com/explosion/spaCy into develop	2017-06-01 04:58:03 -05:00
Matthew Honnibal	3ff7d7fcef	Merge for updated requirements	2017-06-01 04:57:47 -05:00
Matthew Honnibal	5eae3b9a1e	Fix to/from disk in tagger	2017-06-01 04:55:49 -05:00
ines	d5c8d2f5fd	Update about.py and increment version	2017-06-01 11:52:24 +02:00
Matthew Honnibal	4c97371051	Fixes for thinc 6.7	2017-06-01 04:22:16 -05:00
Matthew Honnibal	53d00a0371	Move weight serialization to Thinc	2017-06-01 03:04:36 -05:00
Matthew Honnibal	ae8010b526	Move weight serialization to Thinc	2017-06-01 02:56:12 -05:00
Gyorgy Orosz	f0c3b09242	More robust Hungarian tokenizer.	2017-05-31 22:28:40 +02:00
Matthew Honnibal	c8a58cfcf8	Fix Python2/3 load bug	2017-05-31 15:21:44 -05:00
Matthew Honnibal	99982684b0	Fix normalize_string_keys function'	2017-05-31 14:08:16 -05:00
Matthew Honnibal	67ade63fc4	Merge branch 'develop' of https://github.com/explosion/spaCy into develop	2017-05-31 08:28:42 -05:00
Matthew Honnibal	490b38e6bb	Fix reference to thinc copy_array util	2017-05-31 08:25:21 -05:00
Matthew Honnibal	9805e0e369	Fix vocab pickling	2017-05-31 08:25:01 -05:00
Matthew Honnibal	6c51cd77b4	Merge branch 'develop' of https://github.com/explosion/spaCy into develop	2017-05-31 15:06:56 +02:00
Matthew Honnibal	8dfb9546f0	Merge branch 'develop' of https://github.com/explosion/spaCy into develop	2017-05-31 07:21:14 -05:00
Matthew Honnibal	480ef8bfc8	Add compat function to normalize dict keys	2017-05-31 07:14:29 -05:00
Matthew Honnibal	92f9e5cc9a	Silence env_opt, and fix serialization for GPU	2017-05-31 07:14:11 -05:00
Matthew Honnibal	0561df2a9d	Fix tokenizer serialization	2017-05-31 14:12:38 +02:00
Matthew Honnibal	4a398c15b7	Merge branch 'develop' of https://github.com/explosion/spaCy into develop	2017-05-31 13:44:16 +02:00
Matthew Honnibal	097ab9c6e4	Fix transition system to/from disk	2017-05-31 13:44:00 +02:00
Matthew Honnibal	b1469d3360	Fix string serialisation	2017-05-31 13:43:44 +02:00
Matthew Honnibal	e9419072e7	Fix tokenizer serialisation	2017-05-31 13:43:31 +02:00
Matthew Honnibal	33e5ec737f	Fix to/from disk methods	2017-05-31 13:43:10 +02:00
ines	5e1c361270	Update tests README with info on model tests	2017-05-31 12:22:58 +02:00
Matthew Honnibal	fe28602f2e	Merge branch 'develop' of https://github.com/explosion/spaCy into develop	2017-05-31 11:43:56 +02:00
Matthew Honnibal	66af019d5d	Fix serialization of tokenizer	2017-05-31 11:43:40 +02:00
Ines Montani	e6cf3c7e1c	Merge pull request #1093 from oroszgy/hu_emoji_fix Fixed emoji handling for Hungarian	2017-05-31 11:33:24 +02:00
Matthew Honnibal	e98eff275d	Merge branch 'develop' of https://github.com/explosion/spaCy into develop	2017-05-31 10:29:15 +02:00
Matthew Honnibal	53a3824334	Fix mistake in ner feature	2017-05-31 03:01:02 +02:00
Matthew Honnibal	8a693c2605	Write binary file during training	2017-05-31 02:59:18 +02:00
Matthew Honnibal	498ad85309	Try using tensor for vector/similarity methdos	2017-05-30 23:35:17 +02:00
Matthew Honnibal	a131981f3b	Work on vectors	2017-05-30 23:34:50 +02:00
Matthew Honnibal	6937e311a4	Update doc tests	2017-05-30 23:34:23 +02:00
Matthew Honnibal	cc911feab2	Fix bug in NER state	2017-05-30 22:12:19 +02:00
Gyorgy Orosz	8c0b4b850e	Fixed emoji handling for Hungarian	2017-05-30 21:34:46 +02:00
Matthew Honnibal	be4a640f0c	Fix arc eager label costs for uint64	2017-05-30 20:37:58 +02:00
Matthew Honnibal	b127645afc	Fix test_misc merge conflict	2017-05-29 18:31:44 -05:00
Matthew Honnibal	e0e8eae7c7	Tweak package test	2017-05-29 18:30:42 -05:00
Matthew Honnibal	11840ff5dd	Store tag map before normalizing props	2017-05-29 17:53:48 -05:00
Matthew Honnibal	b92a89f87b	Make it easier to reference embedding tables	2017-05-29 17:53:29 -05:00
Matthew Honnibal	293d1b425b	Serialize in consistent order	2017-05-29 17:53:06 -05:00
Matthew Honnibal	9bf22a94aa	Fix tag set serialisation	2017-05-29 17:52:36 -05:00
Matthew Honnibal	2a061e2777	Fix serialisation, for reals this time	2017-05-29 17:52:08 -05:00
ines	20a7003c0d	Update model fixtures and reorganise tests	2017-05-29 22:14:31 +02:00
ines	795fe43a4d	Add load_test_model function with importorskip() Loads model only if it can be imported, i.e. if it's installed as a package.	2017-05-29 22:11:31 +02:00
ines	ad3c8b3ad9	Fix formatting	2017-05-29 22:10:50 +02:00
ines	6e3937efc5	Check for arguments of model markers to specify models to test Lets user set --models --en for only English models	2017-05-29 22:10:16 +02:00
Matthew Honnibal	35d981241f	Fix model deserialization	2017-05-29 14:46:31 -05:00
Matthew Honnibal	5b29f227ae	Fix serialization	2017-05-29 14:35:53 -05:00
Matthew Honnibal	1e6df0a2a1	Merge branch 'develop' of https://github.com/explosion/spaCy into develop	2017-05-29 14:30:12 -05:00
ines	08382f21e3	Pass model meta to nlp object in load_model	2017-05-29 20:44:11 +02:00
ines	6145fe6a93	Catch all kwargs on Language	2017-05-29 20:43:48 +02:00
ines	0d7d50fe22	Add __version__ to __init__.py	2017-05-29 20:43:24 +02:00
Matthew Honnibal	6522ea6c8b	More serialization fixes. Still broken	2017-05-29 13:23:47 -05:00
Matthew Honnibal	9c9ee24411	Fix broken lambda scoping in Python 2	2017-05-29 13:23:28 -05:00
Matthew Honnibal	f1acdaab55	Fix serialization of weight offsets	2017-05-29 13:23:11 -05:00
Matthew Honnibal	c044e9c21c	Merge branch 'develop' of https://github.com/explosion/spaCy into develop	2017-05-29 08:41:02 -05:00
Matthew Honnibal	aa4c33914b	Work on serialization	2017-05-29 08:40:45 -05:00
ines	9e83a17e95	Use new model templates	2017-05-29 15:27:24 +02:00
ines	567485a818	Fix and document model loading with pipeline and overrides	2017-05-29 14:10:10 +02:00
Matthew Honnibal	deac7eb01c	Fix for serialization	2017-05-29 13:54:18 +02:00
Matthew Honnibal	04c32aa091	Fix for serialization	2017-05-29 13:53:32 +02:00
Matthew Honnibal	a1960c2d09	Fix for serialization	2017-05-29 13:47:42 +02:00
Matthew Honnibal	7b06bb896e	Fix for serialization	2017-05-29 13:42:55 +02:00
Matthew Honnibal	74235587ef	Fix to serialization	2017-05-29 13:40:31 +02:00
Matthew Honnibal	59f355d525	Fixes for serialization	2017-05-29 13:38:20 +02:00
Matthew Honnibal	920887f4e4	Specify order of vocab deserialization	2017-05-29 13:04:40 +02:00
Matthew Honnibal	f4aafca222	Merge changes to test_misc	2017-05-29 12:26:02 +02:00
Matthew Honnibal	a318f0cae1	Add to/from disk/bytes methods for tokenizer	2017-05-29 12:24:41 +02:00
Matthew Honnibal	ff26aa6c37	Work on to/from bytes/disk serialization methods	2017-05-29 11:45:45 +02:00
ines	df920ba0e7	Add tests for displaCy and util functions and fix util typo	2017-05-29 10:51:19 +02:00
ines	c5714d4fb2	xfail matcher test for now until setting norm via Span.merge works	2017-05-29 10:51:02 +02:00
Matthew Honnibal	6b019b0540	Update to/from bytes methods	2017-05-29 10:14:20 +02:00
Matthew Honnibal	c91b121aeb	Move serialization functions to util	2017-05-29 10:13:42 +02:00
Matthew Honnibal	1fa2bfb600	Add model_to_bytes and model_from_bytes helpers. Probably belong in thinc.	2017-05-29 09:27:04 +02:00
Matthew Honnibal	6dad4117ad	Work on serialization for models	2017-05-29 01:37:57 +02:00
ines	7b1ddcc04d	Add test for vocab serialization	2017-05-29 01:09:52 +02:00
ines	00b2094dc3	Fix typos, long integers and tests	2017-05-29 01:09:52 +02:00
ines	804dbb8d25	Add StringStore test for API docs	2017-05-29 01:09:52 +02:00
Matthew Honnibal	6cd5730ee7	Fix lex struct setters for strings	2017-05-29 01:05:09 +02:00
Matthew Honnibal	2edd96ce47	Draft Vocab to/from disk/bytes	2017-05-28 23:34:12 +02:00
Matthew Honnibal	4ddff020c3	Fix compile error	2017-05-28 23:30:40 +02:00
Matthew Honnibal	6d3caeadd2	Fix type check for long	2017-05-28 23:22:45 +02:00
Matthew Honnibal	92dbf28c1e	Hack a fixture in the vectors tests, for xfail	2017-05-28 20:28:32 +02:00
Matthew Honnibal	9239f06ed3	Fix german noun chunks iterator	2017-05-28 20:13:03 +02:00
Matthew Honnibal	fd9b6722a9	Fix noun chunks iterator for new stringstore	2017-05-28 20:12:10 +02:00
ines	414193e9ba	Update docs to reflect StringStore changes	2017-05-28 18:19:11 +02:00
Matthew Honnibal	7996d21717	Fixes for new StringStore	2017-05-28 11:09:27 -05:00
Matthew Honnibal	8a24c60c1e	Merge branch 'develop' of https://github.com/explosion/spaCy into develop	2017-05-28 08:12:05 -05:00
Matthew Honnibal	bc97bc292c	Fix __call__ method	2017-05-28 08:11:58 -05:00
Matthew Honnibal	5cf47b847b	Handle iob with no tag in converter	2017-05-28 08:11:39 -05:00
Matthew Honnibal	fe11564b8e	Finish stringstore change. Also xfail vectors tests	2017-05-28 15:10:22 +02:00
Matthew Honnibal	b007a2b0d3	Update stringstore tests	2017-05-28 14:08:09 +02:00
Matthew Honnibal	84e66ca6d4	WIP on stringstore change. 27 failures	2017-05-28 14:06:40 +02:00
Matthew Honnibal	fe4a746300	Accomodate symbols in new string scheme	2017-05-28 13:03:16 +02:00
Matthew Honnibal	f51e6a6c16	Adjust lexeme sizing for attr_t being 64 bit	2017-05-28 12:51:09 +02:00
Matthew Honnibal	a5606c3eda	Work on changing StringStore to return hashes.	2017-05-28 12:36:27 +02:00
Matthew Honnibal	39293ab2ee	Merge branch 'develop' of https://github.com/explosion/spaCy into develop	2017-05-28 11:46:57 +02:00
Matthew Honnibal	dd052572d4	Update arc eager for SBD changes	2017-05-28 11:46:51 +02:00
Matthew Honnibal	3ea98e2043	Remove vector member from lexeme	2017-05-28 11:46:24 +02:00
Matthew Honnibal	2445707f3c	Re-delegate vectors to vocab	2017-05-28 11:46:10 +02:00
Matthew Honnibal	6863d01361	Remove vectors from lexeme	2017-05-28 11:45:48 +02:00
Matthew Honnibal	15f6efc127	Remove vectors from vocab	2017-05-28 11:45:32 +02:00
Matthew Honnibal	c1263a844b	Merge branch 'develop' of https://github.com/explosion/spaCy into develop	2017-05-27 18:32:57 -05:00
Matthew Honnibal	9e711c3476	Divide d_loss by batch size	2017-05-27 18:32:46 -05:00
Matthew Honnibal	b082f76494	Randomize pipeline order during training	2017-05-27 18:32:21 -05:00
Matthew Honnibal	a1d4c97fb7	Improve correctness of minibatching	2017-05-27 17:59:00 -05:00
ines	84189c1cab	Add 'xx' language ID for multi-language support Allows models to specify their language ID as 'xx'.	2017-05-28 00:58:59 +02:00
ines	33e332e67c	Remove unused export	2017-05-28 00:57:59 +02:00
ines	c1983621fb	Update util functions for model loading	2017-05-28 00:22:40 +02:00
ines	c8543c8237	Fix formatting and docstrings and remove deprecated function	2017-05-28 00:22:40 +02:00
Matthew Honnibal	49235017bf	Merge branch 'develop' of https://github.com/explosion/spaCy into develop	2017-05-27 16:34:28 -05:00
Matthew Honnibal	7ebd26b8aa	Use ordered dict to specify transitions	2017-05-27 15:52:20 -05:00
Matthew Honnibal	3eea5383a1	Add move_names property to parser	2017-05-27 15:51:55 -05:00
Matthew Honnibal	8de9829f09	Don't overwrite model in initialization, when loading	2017-05-27 15:50:40 -05:00
Matthew Honnibal	99316fa631	Use ordered dict to specify actions	2017-05-27 15:50:21 -05:00
Matthew Honnibal	655ca58c16	Clarifying change to StateC.clone	2017-05-27 15:49:37 -05:00
Matthew Honnibal	5e4312feed	Evaluate loaded class, to ensure save/load works	2017-05-27 15:47:02 -05:00
Matthew Honnibal	34bbad8e0e	Add __reduce__ methods on parser subclasses. Fixes pickling.	2017-05-27 15:46:06 -05:00
Matthew Honnibal	7cc9c3e9a6	Fix convert CLI	2017-05-27 15:44:42 -05:00
ines	1203959625	Add pipeline setting to meta.json generator	2017-05-27 20:02:01 +02:00
ines	086a06e7d7	Fix CLI docstrings and add command as first argument Workaround for Plac	2017-05-27 20:01:46 +02:00
ines	a8e58e04ef	Add symbols class to punctuation rules to handle emoji (see #1088 ) Currently doesn't work for Hungarian, because of conflicts with the custom punctuation rules. Also doesn't take multi-character emoji like 👩🏽‍💻 into account.	2017-05-27 17:57:10 +02:00
Matthew Honnibal	dc07d72d80	Merge branch 'develop' of https://github.com/explosion/spaCy into develop	2017-05-27 08:20:40 -05:00
Matthew Honnibal	de13fe0305	Remove length cap on sentences	2017-05-27 08:20:32 -05:00
Matthew Honnibal	73a643d32a	Don't randomise pipeline for training, and don't update if no gradient	2017-05-27 08:20:13 -05:00
Matthew Honnibal	3d22fcaf0b	Return None from parser if there are no annotations	2017-05-26 14:02:59 -05:00
Matthew Honnibal	d06f235fc9	Fix conflict on convert.py	2017-05-26 11:33:29 -05:00
Matthew Honnibal	2e587c6417	Export iob_to_biluo utility	2017-05-26 11:32:55 -05:00
Matthew Honnibal	2b3b937a04	Fix converter CLI	2017-05-26 11:32:41 -05:00
Matthew Honnibal	5a87bcf35f	Fix converters	2017-05-26 11:32:34 -05:00
Matthew Honnibal	8af3100143	Merge branch 'develop' of https://github.com/explosion/spaCy into develop	2017-05-26 11:31:41 -05:00
Matthew Honnibal	3d5a536eaa	Improve efficiency of parser batching	2017-05-26 11:31:23 -05:00
Matthew Honnibal	daac3e3573	Always shuffle gold data, and support length cap	2017-05-26 11:30:52 -05:00
Matthew Honnibal	d65f99a720	Improve model saving in train script	2017-05-26 05:52:09 -05:00
ines	51882c4984	Fix formatting	2017-05-26 12:37:45 +02:00
ines	353f0ef8d7	Use disable argument (list) for serialization	2017-05-26 12:33:54 +02:00
Matthew Honnibal	22d7b448a5	Fix convert command	2017-05-25 19:47:12 -05:00
Matthew Honnibal	dbf2a4cf57	Update all models on each epoch	2017-05-25 19:46:56 -05:00
Matthew Honnibal	faff1c23fb	Merge branch 'develop' of https://github.com/explosion/spaCy into develop	2017-05-25 17:16:10 -05:00
Matthew Honnibal	82b11b0320	Remove print statement	2017-05-25 17:15:59 -05:00
Matthew Honnibal	80cf42e33b	Fix compounding and decaying utils	2017-05-25 17:15:39 -05:00
Matthew Honnibal	df8015f05d	Tweaks to train script	2017-05-25 17:15:24 -05:00
Matthew Honnibal	3a6e59cc53	Add minibatch function in spacy.gold	2017-05-25 17:15:09 -05:00
Matthew Honnibal	702fe74a4d	Clean up spacy.cli.train	2017-05-25 16:16:30 -05:00
Matthew Honnibal	b9cea9cd93	Add compounding and decaying functions	2017-05-25 16:16:10 -05:00
Matthew Honnibal	2cb7cc2db7	Remove commented code from parser	2017-05-25 14:55:09 -05:00
Matthew Honnibal	f403c2cd5f	Add env opts for optimizer	2017-05-25 11:19:26 -05:00
Matthew Honnibal	c245ff6b27	Rebatch parser inputs, with mid-sentence states	2017-05-25 11:18:59 -05:00
Matthew Honnibal	679efe79c8	Make parser update less hacky	2017-05-25 06:49:00 -05:00
Matthew Honnibal	8500d9b1da	Only train one task per iter, holding grads	2017-05-25 06:47:42 -05:00
Matthew Honnibal	b27c587800	Fix pieces argument to PrecomputedMaxout	2017-05-25 06:46:59 -05:00
Matthew Honnibal	e1cb5be0c7	Adjust dropout, depth and multi-task in parser	2017-05-24 20:11:41 -05:00
Matthew Honnibal	e6cc927ab1	Rearrange multi-task learning	2017-05-24 20:10:54 -05:00
Matthew Honnibal	135a13790c	Disable gold preprocessing	2017-05-24 20:10:20 -05:00
Matthew Honnibal	467bbeadb8	Add hidden layers for tagger	2017-05-24 20:09:51 -05:00
ines	66088851dc	Add Doc.to_disk() and Doc.from_disk() methods	2017-05-24 11:58:17 +02:00
Matthew Honnibal	620df0414f	Fix dropout in parser	2017-05-23 15:20:45 -05:00
Matthew Honnibal	5b67bcbee0	Increase default embed size to 7500	2017-05-23 15:20:16 -05:00
Matthew Honnibal	48eef94f92	Merge branch 'develop' of https://github.com/explosion/spaCy into develop	2017-05-23 18:47:32 +02:00
Matthew Honnibal	d44b1eafc4	Fix conflict artefacts	2017-05-23 18:47:11 +02:00
Matthew Honnibal	01e59e4e6e	* Add Token.sent_start property, re Issue #235	2017-05-23 18:41:11 +02:00
Matthew Honnibal	4917cbb484	Include sent_start test	2017-05-23 18:40:37 +02:00
Matthew Honnibal	d68dd1f251	Add SENT_START attribute, for custom sentence boundary detection	2017-05-23 18:37:58 +02:00
Matthew Honnibal	8026c183d0	Add hacky logic to accelerate depth=0 case in parser	2017-05-23 11:06:49 -05:00
Matthew Honnibal	e7d3159d91	Merge branch 'develop' of https://github.com/explosion/spaCy into develop	2017-05-23 05:58:17 -05:00
Matthew Honnibal	a8b6d11c5b	Support optional maxout layer	2017-05-23 05:58:07 -05:00
Matthew Honnibal	c55b8fa7c5	Fix bugs in parse_batch	2017-05-23 05:57:52 -05:00
ines	fb0ff0272f	xfail neural parser tests for now and remove test for deprecated method	2017-05-23 12:40:37 +02:00
Matthew Honnibal	964707d795	Restore support for deeper networks in parser	2017-05-23 05:31:13 -05:00
Matthew Honnibal	e27262f431	Go back to previous matcher signature, with on_match positional	2017-05-23 04:37:40 -05:00
Matthew Honnibal	5418bcf5d7	Resolve conflict on test	2017-05-23 04:37:16 -05:00
ines	e6acd3bbf2	Fix matcher tests and matcher docs	2017-05-23 11:36:02 +02:00
ines	d0c6d4f76d	Fix formatting	2017-05-23 11:32:00 +02:00
Matthew Honnibal	f0bcc0bd8d	Merge branch 'develop' of https://github.com/explosion/spaCy into develop	2017-05-23 04:29:28 -05:00
Matthew Honnibal	9adfe9e8fc	Don't hold gradient updates in language -- let the parser decide how to batch the updates.	2017-05-23 04:29:10 -05:00
Matthew Honnibal	6b918cc58e	Support making updates periodically during training	2017-05-23 04:23:29 -05:00
Matthew Honnibal	3f725ff7b3	Roll back changes to parser update	2017-05-23 04:23:05 -05:00
Matthew Honnibal	3959d778ac	Revert "Revert "WIP on improving parser efficiency"" This reverts commit `532afef4a8`.	2017-05-23 03:06:53 -05:00
Matthew Honnibal	532afef4a8	Revert "WIP on improving parser efficiency" This reverts commit `bdaac7ab44`.	2017-05-23 03:05:25 -05:00
Matthew Honnibal	bdaac7ab44	WIP on improving parser efficiency	2017-05-23 02:59:31 -05:00
Matthew Honnibal	8a9e318deb	Put the parsing loop in a nogil prange block	2017-05-22 17:58:12 -05:00
ines	a23f487b06	Tidy up displaCy and add "manual" option Also don't require title in EntityRenderer	2017-05-22 18:48:20 +02:00
Matthew Honnibal	0264447c4d	Merge branch 'develop' of https://github.com/explosion/spaCy into develop	2017-05-22 10:41:56 -05:00
Matthew Honnibal	6e8dce2c05	Fix train command line args	2017-05-22 10:41:39 -05:00
Matthew Honnibal	a7ee63c0ac	Fix labeller loss for unseen labels	2017-05-22 10:41:20 -05:00
Matthew Honnibal	c9760b2104	Support sentence limits in GoldCorpus	2017-05-22 10:40:46 -05:00
Matthew Honnibal	e2136232f9	Exclude states with no matching gold annotations from parsing	2017-05-22 10:30:12 -05:00
Matthew Honnibal	83ffd16474	Fix offset calculation for other negative values	2017-05-22 08:00:53 -05:00
ines	b3c7ee0148	Fix tests and use the new Matcher API	2017-05-22 13:54:20 +02:00
Matthew Honnibal	f00f821496	Fix pseudoprojectivity->nonproj	2017-05-22 06:14:42 -05:00
Matthew Honnibal	ae8cf70dc1	Fix CLI train signature	2017-05-22 06:13:39 -05:00
Matthew Honnibal	187f370734	Update tests for matcher changes	2017-05-22 12:59:50 +02:00
Matthew Honnibal	5d59e74cf6	PseudoProjectivity->nonproj	2017-05-22 05:49:53 -05:00
Matthew Honnibal	7e2cdc0c81	Merge branch 'develop' of https://github.com/explosion/spaCy into develop	2017-05-22 12:39:34 +02:00
Matthew Honnibal	70a8c531cd	Merge branch 'develop' of https://github.com/explosion/spaCy into develop	2017-05-22 05:39:18 -05:00
Matthew Honnibal	2f78413a02	PseudoProjectivity->nonproj	2017-05-22 05:39:03 -05:00
Matthew Honnibal	89ebc5c3cd	Merge branch 'develop' of https://github.com/explosion/spaCy into develop	2017-05-22 12:38:15 +02:00
Matthew Honnibal	d8bb5bb959	Implement StringStore serialization, and update tests	2017-05-22 12:38:00 +02:00
ines	54f04a9fe0	Update API docs with changes in spacy.gold and spacy.language	2017-05-22 12:29:30 +02:00
ines	b5fb43fdd8	Allow sys.exit status as exits keyword arg in util.prints()	2017-05-22 12:29:15 +02:00
ines	fc3ec733ea	Reduce complexity in CLI Remove now redundant model command and move plac annotations to cli files	2017-05-22 12:28:58 +02:00
Matthew Honnibal	b45b4aa392	PseudoProjectivity --> nonproj	2017-05-22 05:17:44 -05:00
Matthew Honnibal	aae97f00e9	Fix nonproj import	2017-05-22 05:15:06 -05:00
Matthew Honnibal	9262fc4829	Fix syntax error	2017-05-22 05:14:59 -05:00
Matthew Honnibal	93a042253b	Make GoldParse attributes writeable	2017-05-22 04:51:08 -05:00
Matthew Honnibal	2a5eb9f61e	Make nonproj methods top-level functions, instead of class methods	2017-05-22 04:51:08 -05:00
Matthew Honnibal	c998776c25	Make single array for features, to reduce GPU copies	2017-05-22 04:51:08 -05:00
Matthew Honnibal	bc2294d7f1	Add support for fiddly hyper-parameters to train func	2017-05-22 04:51:08 -05:00
Matthew Honnibal	80e19a2399	Simplify CLI implementation for subcommands. Remove model command.	2017-05-22 04:51:08 -05:00
Matthew Honnibal	33e2222839	Remove unused code in deprojectivize	2017-05-22 04:51:08 -05:00
Matthew Honnibal	4e0988605a	Pass through non-projective=True	2017-05-22 04:51:08 -05:00
Matthew Honnibal	025d9bbc37	Fix handling of non-projective deps	2017-05-22 04:51:08 -05:00
Matthew Honnibal	5738d373d5	Add deprojectivize to pipeline	2017-05-22 04:51:08 -05:00
Matthew Honnibal	1b5fa68996	Do pseudo-projective pre-processing for parser	2017-05-22 04:51:08 -05:00
Matthew Honnibal	1d5d9838a2	Fix action collection for parser	2017-05-22 04:51:08 -05:00
Matthew Honnibal	8d1e64be69	Add experimental NeuralLabeller	2017-05-22 04:51:08 -05:00
Matthew Honnibal	9b1b0742fd	Fix prediction for tok2vec	2017-05-22 04:51:08 -05:00
Matthew Honnibal	f13d6c7359	Support gold preprocessing and single gold files	2017-05-22 04:51:08 -05:00
Matthew Honnibal	e14533757b	Use averaged params for evaluation	2017-05-22 04:51:08 -05:00
Matthew Honnibal	7811d97339	Refactor CLI	2017-05-22 04:51:08 -05:00
Matthew Honnibal	5db89053aa	Merge docstrings	2017-05-21 13:46:23 -05:00
Matthew Honnibal	432b3499b3	Fix memory leak	2017-05-21 13:38:46 -05:00
Matthew Honnibal	59fbfb3829	Remove train.py -- functions now in GoldCorpus and Language	2017-05-21 09:08:27 -05:00
Matthew Honnibal	8904814c0e	Add missing import	2017-05-21 09:07:56 -05:00
Matthew Honnibal	baf3ef0ddc	Remove import of removed train_config script	2017-05-21 09:07:34 -05:00
Matthew Honnibal	4c9202249d	Refactor training, to fix memory leak	2017-05-21 09:07:06 -05:00
Matthew Honnibal	4803b3b69e	Add GoldCorpus class, to manage data streaming	2017-05-21 09:06:17 -05:00
Matthew Honnibal	180e5afede	Fix tokvecs flattening in pipeline	2017-05-21 09:05:34 -05:00
Matthew Honnibal	0731971bfc	Add itershuffle utility function. Maybe belongs in thinc	2017-05-21 09:05:05 -05:00
ines	2c5cfe8bbf	Update docstrings and API docs for StringStore	2017-05-21 14:18:58 +02:00
ines	251346b59f	Fix typos and formatting	2017-05-21 14:18:46 +02:00
ines	075f5ff87a	Update docstrings and API docs for GoldParse	2017-05-21 13:53:46 +02:00
ines	99b631617d	Reformat docstrings	2017-05-21 13:32:15 +02:00
ines	885e82c9b0	Update docstrings and remove deprecated load classmethod	2017-05-21 13:27:52 +02:00
ines	c5a653fa48	Update docstrings and API docs for Tokenizer	2017-05-21 13:18:14 +02:00
ines	f216422ac5	Remove deprecated load classmethod	2017-05-21 13:18:01 +02:00
ines	d82ae9a585	Change "function" to "callable" in docs	2017-05-21 13:17:40 +02:00
ines	3871157d84	Update spacy.util documentation	2017-05-21 01:12:09 +02:00
ines	0c6c65aa3c	Improve messaging if model linking fails after download	2017-05-21 00:28:37 +02:00
Matthew Honnibal	3b7c108246	Pass tokvecs through as a list, instead of concatenated. Also fix padding	2017-05-20 13:23:32 -05:00
ines	924e8506de	Move Defaults subclass to module scope (necessary for pickling)	2017-05-20 19:02:27 +02:00
Matthew Honnibal	d52b65aec2	Revert "Move to contiguous buffer for token_ids and d_vectors" This reverts commit `3ff8c35a79`.	2017-05-20 11:26:23 -05:00
ines	27de0834b2	Update docstrings and API docs for Lexeme	2017-05-20 15:13:42 +02:00
ines	7ed8a92ed1	Update docstrings and API docs for Token	2017-05-20 15:13:33 +02:00
ines	4ed6a36622	Update docstrings and API docs for Matcher	2017-05-20 14:43:10 +02:00
ines	39f36539f6	Update docstrings and API docs for Matcher	2017-05-20 14:32:34 +02:00
ines	c00ff257be	Update docstrings and API docs for Matcher	2017-05-20 14:26:10 +02:00
ines	790435e51c	Update docstrings	2017-05-20 14:05:07 +02:00
ines	f0cc642bb9	Update docstrings and API docs for Vocab	2017-05-20 14:00:41 +02:00
Matthew Honnibal	ce9234f593	Update Matcher API	2017-05-20 13:54:53 +02:00
Matthew Honnibal	b272890a8c	Try to move parser to simpler PrecomputedAffine class. Currently broken -- maybe the previous change	2017-05-20 06:40:10 -05:00
ines	e39ad78267	Resolve model name properly in cli.info Use util.resolve_model_path() to also allow package names and paths.	2017-05-20 12:24:40 +02:00
Matthew Honnibal	3ff8c35a79	Move to contiguous buffer for token_ids and d_vectors	2017-05-20 04:17:30 -05:00
Matthew Honnibal	8b04b0af9f	Remove freqs from transition_system	2017-05-20 02:20:48 -05:00
Matthew Honnibal	61fe55efba	Move EnglishDefaults class out of English	2017-05-20 02:18:19 -05:00
Matthew Honnibal	a1ba20e2b1	Fix over-run on parse_batch	2017-05-19 18:57:30 -05:00
ines	1d4d3d0ecd	Add TODO	2017-05-20 01:38:04 +02:00
Matthew Honnibal	7ee1827af0	Disable data caching in parser	2017-05-19 18:17:11 -05:00
Matthew Honnibal	e84de028b5	Remove 'rebatch' op, and remove min-batch cap	2017-05-19 18:16:36 -05:00
Matthew Honnibal	3376d4d6e8	Update the train script, fixing GPU memory leak	2017-05-19 18:15:50 -05:00
Matthew Honnibal	836fe1d880	Update neural net tests	2017-05-19 18:11:29 -05:00
ines	fe5d8819ea	Update Matcher docstrings and API docs	2017-05-19 21:47:06 +02:00
Matthew Honnibal	08766240c3	Add incomplete iob converter	2017-05-19 13:27:51 -05:00
Matthew Honnibal	c12ab47a56	Remove state argument in pipeline. Other changes	2017-05-19 13:26:36 -05:00
Matthew Honnibal	66ea9aebe7	Remove the state argument from Language	2017-05-19 13:25:42 -05:00
Matthew Honnibal	09a877886b	WIP on iob converter	2017-05-19 13:24:39 -05:00
ines	a804045597	Use is_ancestor instead of deprecated is_ancestor_of	2017-05-19 20:23:40 +02:00
Matthew Honnibal	8d5e6d9f4f	Rename no_ner arg to no_entities	2017-05-19 13:23:11 -05:00
ines	e9e62b01b0	Update docstrings and API docs for Token	2017-05-19 18:47:56 +02:00
ines	62ceec4fc6	Update docstrings and API docs for Span	2017-05-19 18:47:46 +02:00
ines	23f9a3ccc8	Update docstrings and API docs for Doc	2017-05-19 18:47:39 +02:00
ines	2c8c9dc0c9	Update docstrings and API docs for Language	2017-05-19 18:47:24 +02:00
ines	0791f0aae6	Update docstrings and API docs for Span class	2017-05-19 00:31:31 +02:00
ines	8455cb1327	Update docstring for Doc.__getitem__	2017-05-19 00:30:51 +02:00
ines	0fc05e54e4	Document TokenVectorEncoder	2017-05-19 00:00:02 +02:00
ines	b687ad109d	Update docstrings and API docs for Doc class	2017-05-18 23:59:44 +02:00
ines	d42bc16868	Update docstrings and API docs for Language class	2017-05-18 23:57:38 +02:00
ines	593361ee3c	Update docstrings for Span class	2017-05-18 22:17:41 +02:00
ines	b87066ff10	Update docstrings and API docs for Doc class	2017-05-18 22:17:41 +02:00
Matthew Honnibal	238be0f16a	Merge branch 'develop' of https://github.com/explosion/spaCy into develop	2017-05-18 08:32:22 -05:00
Matthew Honnibal	c214c0decb	Improve env_opt reporting	2017-05-18 08:32:03 -05:00
Matthew Honnibal	bbb59e371c	Fix GPU evaluation	2017-05-18 08:31:15 -05:00
Matthew Honnibal	c2c825127a	Fix use_params and pipe methods	2017-05-18 08:30:59 -05:00
Matthew Honnibal	ca70b08661	Fix GPU training and evaluation	2017-05-18 08:30:33 -05:00
ines	489d2fb4ba	Add is_in_jupyter() helper for displaCy (see #1058 )	2017-05-18 14:13:14 +02:00
ines	abf0188b0a	Move cupy and CudaStream to compat	2017-05-18 14:12:45 +02:00
ines	33decd85b6	Reorganise and explicitly state what's importable	2017-05-18 14:12:31 +02:00
Matthew Honnibal	a438cef8c5	Fix significant bug in feature calculation -- off by 1	2017-05-18 06:21:32 -05:00
Matthew Honnibal	fc8d3a112c	Add util.env_opt support: Can set hyper params through environment variables.	2017-05-18 04:36:53 -05:00
Matthew Honnibal	d2626fdb45	Fix name error in nn parser	2017-05-18 04:31:01 -05:00
Matthew Honnibal	b460533827	Bug fixes to pipeline	2017-05-18 04:29:51 -05:00
Matthew Honnibal	8815507f8e	Move SpanishDefaults out of Language class, for pickle	2017-05-18 04:28:51 -05:00
Matthew Honnibal	2713041571	Fix GPU usage in Language	2017-05-18 04:25:19 -05:00
Matthew Honnibal	711ad5edc4	Cache features in doc2feats	2017-05-18 04:22:20 -05:00
Matthew Honnibal	39ea38c4b1	Add option to use gpu to spacy train	2017-05-18 04:21:49 -05:00
Matthew Honnibal	a1d8e420b5	Merge branch 'develop' of https://github.com/explosion/spaCy into develop	2017-05-17 08:00:04 -05:00
Matthew Honnibal	edfea3a513	Fix progress bar	2017-05-17 14:59:37 +02:00
Matthew Honnibal	0b7fd67408	Fix style check in displacy	2017-05-17 07:57:24 -05:00
Matthew Honnibal	55dab77de8	Add conversion rule for .conll	2017-05-17 13:13:48 +02:00
Matthew Honnibal	692bd2a186	Bug fix to tagger: wasnt backproping to token vectors	2017-05-17 13:13:14 +02:00
Matthew Honnibal	877f83807f	Merge branch 'develop' of https://github.com/explosion/spaCy into develop	2017-05-17 12:09:29 +02:00
Matthew Honnibal	793430aa7a	Get spaCy train command working with neural network * Integrate models into pipeline * Add basic serialization (maybe incorrect) * Fix pickle on vocab	2017-05-17 12:04:50 +02:00
Matthew Honnibal	3bf4a28d8d	Use tag in CoNLL converter, not POS	2017-05-17 12:04:33 +02:00
ines	1a05078c79	Add language-specific syntax iterators to en and de	2017-05-17 12:04:03 +02:00
Matthew Honnibal	c9a5d5d24b	Merge branch 'develop' of https://github.com/explosion/spaCy into develop	2017-05-16 16:22:05 +02:00
Matthew Honnibal	8cf097ca88	Redesign training to integrate NN components * Obsolete .parser, .entity etc names in favour of .pipeline * Components no longer create models on initialization * Models created by loading method (from_disk(), from_bytes() etc), or .begin_training() * Add .predict(), .set_annotations() methods in components * Pass state through pipeline, to allow components to share information more flexibly.	2017-05-16 16:17:30 +02:00
Matthew Honnibal	221b4c1ee8	Fix test for Python 3	2017-05-16 13:06:30 +02:00
Matthew Honnibal	5211645af3	Get data flowing through pipeline. Needs redesign	2017-05-16 11:21:59 +02:00
Matthew Honnibal	1d7c18e58a	Merge branch 'develop' of https://github.com/explosion/spaCy into develop	2017-05-15 21:53:47 +02:00
Matthew Honnibal	a9edb3aa1d	Improve integration of NN parser, to support unified training API	2017-05-15 21:53:27 +02:00
ines	98354be150	Only get user_data if it exists on doc	2017-05-15 13:39:47 +02:00
ines	c33bdeb564	Use uppercase for entity types	2017-05-15 01:24:57 +02:00
ines	4aaa607b8d	Add xmlns:xlink so SVGs are rendered properly as individual files	2017-05-14 19:54:13 +02:00
ines	9dd13cd76a	Update docstrings	2017-05-14 19:30:47 +02:00
ines	a04550605a	Add Jupyter notebook support (see #1058 )	2017-05-14 18:39:01 +02:00
ines	c31792aaec	Add displaCy visualisers (see #1058 )	2017-05-14 17:50:23 +02:00
ines	b462076d80	Merge load_lang_class and get_lang_class	2017-05-14 01:31:10 +02:00
ines	36bebe7164	Update docstrings	2017-05-14 01:30:29 +02:00
Matthew Honnibal	4b9d69f428	Merge branch 'v2' into develop * Move v2 parser into nn_parser.pyx * New TokenVectorEncoder class in pipeline.pyx * New spacy/_ml.py module Currently the two parsers live side-by-side, until we figure out how to organize them.	2017-05-14 01:10:23 +02:00
Matthew Honnibal	5cac951a16	Move new parser to nn_parser.pyx, and restore old parser, to make tests pass.	2017-05-14 00:55:01 +02:00
Matthew Honnibal	f8c02b4341	Remove cupy imports from parser, so it can work on CPU	2017-05-14 00:37:53 +02:00
Matthew Honnibal	613ba79e2e	Fiddle with sizings for parser	2017-05-13 17:20:23 -05:00
Matthew Honnibal	e6d71e1778	Small fixes to parser	2017-05-13 17:19:04 -05:00
Matthew Honnibal	188c0f6949	Clean up unused import	2017-05-13 17:18:27 -05:00
Matthew Honnibal	f85c8464f7	Draft support of regression loss in parser	2017-05-13 17:17:27 -05:00
ines	1694c24e52	Add docstrings, error messages and fix consistency	2017-05-13 21:22:49 +02:00
ines	ee7dcf65c9	Fix expand_exc to make sure it returns combined dict	2017-05-13 21:22:25 +02:00
ines	824d09bb74	Move resolve_load_name to deprecated	2017-05-13 21:21:47 +02:00
ines	a4a37a783e	Remove import from non-existing module	2017-05-13 16:00:09 +02:00
ines	5858857a78	Update languages list in conftest	2017-05-13 15:37:54 +02:00
ines	9d85cda8e4	Fix models error message and use about.__docs_models__ (see #1051 )	2017-05-13 13:05:47 +02:00
ines	6b942763f0	Tidy up imports	2017-05-13 13:04:40 +02:00
ines	8c2a0c026d	Fix parse_tree test	2017-05-13 12:32:45 +02:00
ines	6129016e15	Replace deepcopy	2017-05-13 12:32:37 +02:00
ines	df68bf45ce	Set defaults for light and flat kwargs	2017-05-13 12:32:23 +02:00
ines	b9dea345e5	Remove old import	2017-05-13 12:32:11 +02:00
ines	293ee359c5	Fix formatting	2017-05-13 12:32:06 +02:00
ines	4eefb288e3	Port over PR #1055	2017-05-13 03:25:32 +02:00
Matthew Honnibal	ee1d35bdb0	Fix merge conflict	2017-05-13 03:20:19 +02:00
Matthew Honnibal	b2540d2379	Merge Kengz's tree_print patch	2017-05-13 03:18:49 +02:00
Matthew Honnibal	827b5af697	Update draft of parser neural network model Model is good, but code is messy. Currently requires Chainer, which may cause the build to fail on machines without a GPU. Outline of the model: We first predict context-sensitive vectors for each word in the input: (embed_lower \| embed_prefix \| embed_suffix \| embed_shape) >> Maxout(token_width) >> convolution ** 4 This convolutional layer is shared between the tagger and the parser. This prevents the parser from needing tag features. To boost the representation, we make a "super tag" with POS, morphology and dependency label. The tagger predicts this by adding a softmax layer onto the convolutional layer --- so, we're teaching the convolutional layer to give us a representation that's one affine transform from this informative lexical information. This is obviously good for the parser (which backprops to the convolutions too). The parser model makes a state vector by concatenating the vector representations for its context tokens. Current results suggest few context tokens works well. Maybe this is a bug. The current context tokens: * S0, S1, S2: Top three words on the stack * B0, B1: First two words of the buffer * S0L1, S0L2: Leftmost and second leftmost children of S0 * S0R1, S0R2: Rightmost and second rightmost children of S0 * S1L1, S1L2, S1R2, S1R, B0L1, B0L2: Likewise for S1 and B0 This makes the state vector quite long: 13T, where T is the token vector width (128 is working well). Fortunately, there's a way to structure the computation to save some expense (and make it more GPU friendly). The parser typically visits 2N states for a sentence of length N (although it may visit more, if it back-tracks with a non-monotonic transition). A naive implementation would require 2N (B, 13T) @ (13T, H) matrix multiplications for a batch of size B. We can instead perform one (BN, T) @ (T, 13*H) multiplication, to pre-compute the hidden weights for each positional feature wrt the words in the batch. (Note that our token vectors come from the CNN -- so we can't play this trick over the vocabulary. That's how Stanford's NN parser works --- and why its model is so big.) This pre-computation strategy allows a nice compromise between GPU-friendliness and implementation simplicity. The CNN and the wide lower layer are computed on the GPU, and then the precomputed hidden weights are moved to the CPU, before we start the transition-based parsing process. This makes a lot of things much easier. We don't have to worry about variable-length batch sizes, and we don't have to implement the dynamic oracle in CUDA to train. Currently the parser's loss function is multilabel log loss, as the dynamic oracle allows multiple states to be 0 cost. This is defined as: (exp(score) / Z) - (exp(score) / gZ) Where gZ is the sum of the scores assigned to gold classes. I'm very interested in regressing on the cost directly, but so far this isn't working well. Machinery is in place for beam-search, which has been working well for the linear model. Beam search should benefit greatly from the pre-computation trick.	2017-05-12 16:09:15 -05:00
ines	c4857bc7db	Remove unused argument	2017-05-12 15:37:54 +02:00
ines	c13b3fa052	Add LEX_ATTRS	2017-05-12 15:37:45 +02:00
ines	bca2ea9c72	Update Portuguese lexical attributes	2017-05-12 15:37:39 +02:00
ines	2f870123bf	Fix formatting	2017-05-12 15:37:20 +02:00
ines	ca65993d59	Add basic Polish Language class	2017-05-12 09:25:37 +02:00
ines	48177c4f92	Add missing tokenizer exceptions	2017-05-12 09:25:24 +02:00
ines	bb8be3d194	Add Danish language data	2017-05-10 21:15:12 +02:00
Matthew Honnibal	4efb391994	Fix serializer	2017-05-09 18:45:18 +02:00
Matthew Honnibal	b16ae75824	Remove serializer hacks from pipeline classes	2017-05-09 18:16:40 +02:00
Matthew Honnibal	7253b4e649	Remove old serialization tests	2017-05-09 18:12:58 +02:00
Matthew Honnibal	f9327343ce	Start updating serializer test	2017-05-09 18:12:03 +02:00
Matthew Honnibal	1166b0c491	Implement Doc.to_bytes and Doc.from_bytes methods	2017-05-09 18:11:34 +02:00
Matthew Honnibal	9e167b7bb6	Strip serializer from code	2017-05-09 17:28:50 +02:00
Matthew Honnibal	b53f7dfdc3	Remove spacy.serialize	2017-05-09 17:22:06 +02:00
Matthew Honnibal	62ecdea9f2	Add binder class for document serialization	2017-05-09 17:21:00 +02:00
ines	a0b00624bb	Make sure like_email returns bool	2017-05-09 11:37:29 +02:00
ines	ea60932e1b	Fix formatting	2017-05-09 11:08:14 +02:00
ines	2c3bdd09b1	Add English test for like_num	2017-05-09 11:06:34 +02:00
ines	22375eafb0	Fix and merge attrs and lex_attrs tests	2017-05-09 11:06:25 +02:00
ines	02d0ac5cab	Remove redundant function and fix formatting	2017-05-09 11:06:04 +02:00
ines	b5ca50607e	Reorganise entity rules	2017-05-09 01:37:10 +02:00
ines	564939391a	Remove spacy.orth	2017-05-09 01:21:47 +02:00
ines	12c3d5fbba	Fix formatting	2017-05-09 01:15:28 +02:00
ines	2829a024ef	Re-add basic like_num check to global lex_attrs	2017-05-09 01:15:23 +02:00
ines	88adeee548	Add English lex_attrs overrides	2017-05-09 01:09:52 +02:00
ines	8f3fbbb147	Fix typos	2017-05-09 01:09:37 +02:00
ines	ea5fa46475	Import LEX_ATTRS from lang.lex_attrs	2017-05-09 00:58:10 +02:00
ines	2216e5f326	Reorganise lex_attrs and add dict	2017-05-09 00:57:54 +02:00
ines	e666f14d20	Add global lex_attrs	2017-05-09 00:41:53 +02:00
ines	41972c43fe	Use consistent regex imports	2017-05-09 00:34:31 +02:00
ines	7b83977020	Remove unused munge package	2017-05-09 00:16:16 +02:00
ines	c714841cc8	Move language-specific tests to tests/lang	2017-05-09 00:02:37 +02:00
ines	bd57b611cc	Update conftest to lazy load languages	2017-05-09 00:02:21 +02:00
ines	9f0fd5963f	Reorganise Hungarian punctuation rules	2017-05-09 00:01:59 +02:00
ines	fc0d793360	Reorganise Bengali punctuation rules	2017-05-09 00:01:52 +02:00
ines	e895d1afd7	Reorganise French punctuation rules	2017-05-09 00:00:54 +02:00
ines	014bda0ae3	Reorganise global punctuation rules	2017-05-09 00:00:46 +02:00
ines	a91278cb32	Rename _URL_PATTERN to URL_PATTERN	2017-05-09 00:00:00 +02:00
ines	604f299cf6	Add char classes to global language data	2017-05-08 23:59:33 +02:00
ines	f6f5d78cb9	Fix formatting	2017-05-08 23:59:17 +02:00
ines	6eb6306843	Fix language data imports	2017-05-08 23:58:31 +02:00
ines	3c0f85de8e	Remove imports in /lang/__init__.py	2017-05-08 23:58:07 +02:00
ines	86d9c29f30	Reorder util functions	2017-05-08 23:51:15 +02:00
ines	9a0d2fdef1	Add load_lang_class() util function	2017-05-08 23:50:45 +02:00
ines	614aa09582	Tidy up Bengali tokenizer exceptions	2017-05-08 22:29:49 +02:00
ines	73b577cb01	Fix relative imports	2017-05-08 22:29:04 +02:00
ines	ae99990f63	Fix formatting	2017-05-08 22:23:48 +02:00
ines	f46ffe3e89	Move language data to /lang module	2017-05-08 20:00:40 +02:00
ines	41a322c733	Fix LEMMA in exceptions and morph rules	2017-05-08 19:57:36 +02:00
ines	2edc0aee12	Update warning message	2017-05-08 19:53:36 +02:00
ines	6025cdb992	Fix string interpolation in times	2017-05-08 16:38:16 +02:00
ines	b9ba58ba5c	Add function to resolve load name Warn if old 'path' keyword argument is used.	2017-05-08 16:33:37 +02:00
ines	e6f1a5d0a1	Add unicode declaration	2017-05-08 16:22:17 +02:00
ines	be5541bd16	Fix import and tokenizer exceptions	2017-05-08 16:20:14 +02:00
ines	2324788970	Remove bad tests	2017-05-08 16:15:27 +02:00
ines	b88c4193e7	Add missing symbol	2017-05-08 16:15:20 +02:00
ines	9a5b2bdd4c	Don't set morph rules without tag map	2017-05-08 16:15:12 +02:00
ines	4930f0fa8f	Explicitly import TOKEN_MATCH	2017-05-08 16:11:54 +02:00
ines	50b7ec03ca	Fix typo	2017-05-08 16:11:45 +02:00
ines	3ca611fe48	Fix wildcard imports	2017-05-08 15:56:29 +02:00
ines	c2469b8135	Remove __all__ export	2017-05-08 15:56:22 +02:00
ines	14a9c3ee7a	Fix wildcard import	2017-05-08 15:56:13 +02:00
ines	deed623864	Remove comment	2017-05-08 15:56:05 +02:00
ines	e7f95c37ee	Merge base tokenizer exceptions	2017-05-08 15:55:52 +02:00
ines	24606d364c	Remove redundant language_data.py files in languages Originally intended to collect all components of a language, but just made things messy. Now each component is in charge of exporting itself properly.	2017-05-08 15:55:29 +02:00
ines	a627d3e3b0	Reorganise Chinese language data	2017-05-08 15:54:36 +02:00
ines	7b86ee093a	Reorganise Swedish language data	2017-05-08 15:54:29 +02:00
ines	50510fa947	Reorganise Portuguese language data	2017-05-08 15:52:01 +02:00
ines	279895ea83	Reorganise Dutch language data	2017-05-08 15:51:39 +02:00
ines	04ef5025bd	Reorganise Norwegian language data	2017-05-08 15:51:22 +02:00
ines	5edbc725d8	Reorganise Japanese language data	2017-05-08 15:50:46 +02:00
ines	51a389d3bb	Reorganise Italian language data	2017-05-08 15:50:17 +02:00
ines	1bbfa14436	Reorganise Hungarian language data	2017-05-08 15:49:56 +02:00
ines	a77c9fc60d	Reorganise Hebrew language data	2017-05-08 15:49:28 +02:00
ines	7f05e977fa	Reorganise French language data	2017-05-08 15:49:05 +02:00
ines	0207ffdd52	Reorganise Finnish language data	2017-05-08 15:48:31 +02:00
ines	8e483ec950	Reorganise Spanish language data	2017-05-08 15:48:04 +02:00
ines	c7c21b980f	Reorganise English language data	2017-05-08 15:47:25 +02:00
ines	1bf9d5ec8b	Reorganise German language data	2017-05-08 15:44:26 +02:00
ines	7b3a983f96	Reorganise Bengali language data	2017-05-08 15:43:50 +02:00
ines	607ba458e7	Fix whitespace	2017-05-08 15:42:31 +02:00
ines	60db497525	Add update_exc and expand_exc to util Doesn't require separate language data util anymore	2017-05-08 15:42:12 +02:00
Matthew Honnibal	b44f7e259c	Clean up unused parser code	2017-05-08 15:42:04 +02:00
ines	6e5bd4f228	Remove unused functions from deprecated	2017-05-08 15:40:16 +02:00
Matthew Honnibal	17efb1c001	Change width	2017-05-08 08:40:13 -05:00
ines	f68e420bc0	Add PRON_LEMMA and DET_LEMMA to deprecated Will be replaced with proper values across the language data later.	2017-05-08 15:35:30 +02:00
ines	bd6a7cf4f6	Simplify deprecated model downloading Only relevant for spaCy < v1.7.0.	2017-05-08 15:32:10 +02:00
ines	95edd9e896	Let parse_package_meta take full path	2017-05-08 15:30:48 +02:00
ines	326746eb15	Add util function to resolve arg to model path 1. check if in data dir or shortcut link 2. check if installed as a pip package 3. check if string is path to model 4. check if Path or Path-like object	2017-05-08 15:29:47 +02:00
Matthew Honnibal	bef89ef23d	Mergery	2017-05-08 08:29:36 -05:00
ines	a7801e7342	Update spacy.load() path argument is now deprecated and name can either take a model name or path. Implement lazy loading by importing module and read Language class name off __all__.	2017-05-08 15:27:25 +02:00
Matthew Honnibal	50ddc9fc45	Fix infinite loop bug	2017-05-08 07:54:26 -05:00
Matthew Honnibal	94e86ae00a	Predict tags with encoder	2017-05-08 07:53:45 -05:00
Matthew Honnibal	56073a11ef	Don't use tags when calculating token vectors	2017-05-08 07:52:24 -05:00
Matthew Honnibal	a66a4a4d0f	Replace einsums	2017-05-08 14:46:50 +02:00
Matthew Honnibal	8d2eab74da	Use PretrainableMaxouts	2017-05-08 14:24:55 +02:00
Matthew Honnibal	807cb2e370	Add PretrainableMaxouts	2017-05-08 14:24:43 +02:00
Matthew Honnibal	2e2268a442	Precomputable hidden now working	2017-05-08 11:36:37 +02:00
ines	94697e9afc	Fix typo	2017-05-08 02:00:37 +02:00
ines	0ee2a22b67	Merge branch 'pr/1024' into develop	2017-05-08 01:12:44 +02:00
ines	c4492d260a	Fix kwargs	2017-05-08 01:05:24 +02:00
Matthew Honnibal	10682d35ab	Get pre-computed version working	2017-05-08 00:38:35 +02:00
ines	b5a726c5cd	Tidy up deprecated.py	2017-05-07 23:29:22 +02:00
ines	59c3b9d4dd	Tidy up CLI and fix print functions	2017-05-07 23:25:29 +02:00
ines	311704674d	Add path2str compat function	2017-05-07 23:24:56 +02:00
ines	e34069db9f	Move is_package and get_model_package_path to util	2017-05-07 23:24:51 +02:00
ines	957ba676b4	Add model files base path to about.py	2017-05-07 23:22:35 +02:00
ines	8d8dd9ceb2	Don't set default value for model	2017-05-07 23:22:21 +02:00
Matthew Honnibal	35458987e8	Checkpoint -- nearly finished reimpl	2017-05-07 23:05:01 +02:00
Matthew Honnibal	4441866f55	Checkpoint -- nearly finished reimpl	2017-05-07 22:47:06 +02:00
Matthew Honnibal	6782eedf9b	Tmp GPU code	2017-05-07 11:04:24 -05:00
Matthew Honnibal	e420e5a809	Tmp	2017-05-07 07:31:09 -05:00
Matthew Honnibal	12039e80ca	Switch to single matmul for state layer	2017-05-07 14:26:34 +02:00
Matthew Honnibal	700979fb3c	CPU/GPU compat	2017-05-07 04:01:11 +02:00
Matthew Honnibal	f99f5b75dc	working residual net	2017-05-07 03:57:26 +02:00
Matthew Honnibal	bdf2dba9fb	WIP on refactor, with hidde pre-computing	2017-05-07 02:02:43 +02:00
Matthew Honnibal	b439e04f8d	Learning smoothly	2017-05-06 20:38:12 +02:00
Matthew Honnibal	08bee76790	Learns things	2017-05-06 18:24:38 +02:00
Matthew Honnibal	04ae1c01f1	Learns things	2017-05-06 18:21:02 +02:00
Matthew Honnibal	bcf4cd0a5f	Learns things	2017-05-06 17:37:36 +02:00
Matthew Honnibal	8e48b58cd6	Gradients look correct	2017-05-06 16:47:15 +02:00
Matthew Honnibal	7e04260d38	Data running through, likely errors in model	2017-05-06 14:22:20 +02:00
Matthew Honnibal	fa7c1990b6	Restore tok2vec function	2017-05-05 20:12:03 +02:00
Matthew Honnibal	efe9630e1c	Bug fixes	2017-05-05 20:09:50 +02:00
Matthew Honnibal	ef4fa594aa	Draft of NN parser, to be tested	2017-05-05 19:20:39 +02:00
Matthew Honnibal	7d1df50aec	Draft up Parser model	2017-05-04 13:31:40 +02:00
Matthew Honnibal	ccaf26206b	Pseudocode for parser	2017-05-04 12:17:59 +02:00
ines	b1f22c5a10	Fix formatting	2017-05-03 20:11:02 +02:00
ines	a04b5be1b2	Add glossary for annotation scheme (closes #1034 ) Can be imported as explain from spacy.glossary, or called as spacy.explain(term)	2017-05-03 17:02:17 +02:00
Gregory Howard	929f2792a7	Rennaming cls in module. cls is now a class	2017-05-03 15:41:07 +02:00
Gregory Howard	0e8c41ea4f	Adding method lemmatizer for every class	2017-05-03 12:14:42 +02:00
Gregory Howard	32ca07989e	adding export japanese	2017-05-03 11:07:29 +02:00
Grégory Howard	f9d7144224	Merge branch 'master' into master	2017-05-03 11:04:51 +02:00
Gregory Howard	f2ab7d77b4	Lazy imports language	2017-05-03 11:01:42 +02:00
Ines Montani	3ea23a3f4d	Fix formatting	2017-05-03 09:44:38 +02:00
Ines Montani	d730eb0c0d	Raise custom ImportError if importing janome fails	2017-05-03 09:43:29 +02:00
Ines Montani	949ad6594b	Add newline	2017-05-03 09:38:43 +02:00
Ines Montani	d12ca587ea	Add newline	2017-05-03 09:38:29 +02:00
Ines Montani	8676cd0135	Add newline	2017-05-03 09:38:07 +02:00
Yasuaki Uechi	c8f83aeb87	Add basic japanese support	2017-05-03 13:56:21 +09:00
Gregory Howard	c0afcd22bb	Merge remote-tracking branch 'remotes/upstream/master'	2017-04-27 14:42:54 +02:00
Matthew Honnibal	31ec9e1371	Merge branch 'master' of https://github.com/explosion/spaCy	2017-04-27 13:21:39 +02:00
Matthew Honnibal	2da16adcc2	Add dropout optin for parser and NER Dropout can now be specified in the `Parser.update()` method via the `drop` keyword argument, e.g. nlp.entity.update(doc, gold, drop=0.4) This will randomly drop 40% of features, and multiply the value of the others by 1. / 0.4. This may be useful for generalising from small data sets. This commit also patches the examples/training/train_new_entity_type.py example, to use dropout and fix the output (previously it did not output the learned entity).	2017-04-27 13:18:39 +02:00
Gregory Howard	92f368f83b	Removing extra spaces	2017-04-27 12:02:14 +02:00
Gregory Howard	13b6957c8e	Adding unitest for tokenization in french (with title)	2017-04-27 11:53:44 +02:00
Gregory Howard	8ff4682255	correcting tokenizer exception. Adding tests for lemmatization	2017-04-27 11:52:14 +02:00
Ines Montani	7da9cefd25	Merge pull request #1022 from luvogels/master Initial support for Norwegian Bokmål	2017-04-27 11:16:06 +02:00
Ines Montani	c9e592ae6c	Add newline	2017-04-27 11:15:41 +02:00
Ines Montani	5942adccc2	Add newline	2017-04-27 11:15:19 +02:00
Ines Montani	4cd9269aef	Add newline	2017-04-27 11:15:04 +02:00
Ines Montani	ccf13ecc21	Add newline	2017-04-27 11:14:42 +02:00
Ines Montani	03d2b0cc05	Add newline	2017-04-27 11:14:26 +02:00
Gregory Howard	44cb486849	Adding unitest for tokenization in french (with title)	2017-04-27 10:59:38 +02:00
Gregory Howard	ad8129cb45	Improvement of rules now title insentive and have same declaration format	2017-04-27 10:23:56 +02:00
luvogels	d12a0b6431	Hooked up tokenizer tests	2017-04-26 23:21:41 +02:00
Matthew Honnibal	f0e1606d27	Increment version	2017-04-26 20:25:41 +02:00
luvogels	b331929a7e	Merge branch 'master' of https://github.com/luvogels/spaCy	2017-04-26 19:15:48 +02:00
luvogels	8de59ce3b9	Added tokenizer tests	2017-04-26 19:10:18 +02:00
Matthew Honnibal	4d98511db7	Make Span hashable. Closes #1019	2017-04-26 19:01:05 +02:00
Matthew Honnibal	24c4c51f13	Try to make test999 less flakey	2017-04-26 18:42:06 +02:00
Leif Uwe Vogelsang	460094bf09	Update __init__.py	2017-04-26 18:27:55 +02:00
ines	527d51ac9a	Fetch shortcuts from GitHub and improve error handling	2017-04-26 18:00:28 +02:00
Gregory Howard	ed5f094451	Adding insensitive lemmatisation test	2017-04-25 18:07:02 +02:00
ghoward	26e31afc18	renamming tests	2017-04-25 17:46:01 +02:00
ghoward	c085c2d391	Adding some unitests	2017-04-25 17:44:16 +02:00
ghoward	55c6910f90	Look_up table for languages in spacy. Need to find an another name for lemmatizerlookup. I was not inspired. Trying to uses new files in fr language.	2017-04-24 16:39:00 +02:00
Matthew Honnibal	c4be9c36fe	Fix unicode header in tests	2017-04-24 10:09:01 +02:00
Matthew Honnibal	65f10b53e5	Fix test	2017-04-24 00:25:55 +02:00
Matthew Honnibal	70a43858e1	Fix flakey test	2017-04-24 00:06:30 +02:00
Matthew Honnibal	3973af2d15	Make training test less flakey	2017-04-23 22:59:34 +02:00
Matthew Honnibal	4f9657b42b	Fix reporting if no dev data with train	2017-04-23 22:27:10 +02:00
Matthew Honnibal	df2ac8b843	Merge branch 'master' of https://github.com/explosion/spaCy	2017-04-23 21:25:07 +02:00
Matthew Honnibal	d0e19267e8	Create directory if missing in save_to_directory	2017-04-23 21:24:43 +02:00
ines	42305bc519	Remove unnecessary test	2017-04-23 21:21:41 +02:00
ines	012ea594d1	Add file for misc tests	2017-04-23 21:06:51 +02:00
ines	83f66947dc	Rename test_download to test_cli	2017-04-23 21:06:50 +02:00
ines	401045433c	Simplify compat.fix_text	2017-04-23 21:06:50 +02:00
Matthew Honnibal	e033c86a64	Increment version	2017-04-23 21:03:43 +02:00
Matthew Honnibal	d2436dc17b	Update fix for Issue #999	2017-04-23 18:14:37 +02:00
Matthew Honnibal	874a3cbb07	Add test for Issue #955	2017-04-23 17:57:01 +02:00
Matthew Honnibal	60703cede5	Ensure noun chunks can't be nested. Closes #955	2017-04-23 17:56:39 +02:00
Matthew Honnibal	c9ec24b257	Merge branch 'master' of https://github.com/explosion/spaCy	2017-04-23 17:07:46 +02:00
Matthew Honnibal	5d8af40445	Add test for Issue #999	2017-04-23 17:06:30 +02:00
Matthew Honnibal	4d2a659c52	Fix json dump for Python3	2017-04-23 17:05:53 +02:00
Matthew Honnibal	040751ad17	Remove xfail on Test #910	2017-04-23 16:28:55 +02:00
ines	3a9710f356	Pass dev_scores to print_progress correctly (resolves #1008 ) Only read scores attribute if command is used with dev_data, otherwise default dev_scores to empty dict.	2017-04-23 15:58:40 +02:00
Matthew Honnibal	1b12f342e4	Merge branch 'master' of https://github.com/explosion/spaCy	2017-04-20 17:03:11 +02:00
Matthew Honnibal	4eef200bab	Persist the actions within spacy.parser.cfg	2017-04-20 17:02:44 +02:00
ines	25c70b4cc5	Move fix_text to spacy.compat (see #1002 )	2017-04-20 15:47:17 +02:00
Ines Montani	60b5243bee	Merge pull request #1002 from oroszgy/model_cli_fix Fixes for the `model` CLI	2017-04-20 15:41:03 +02:00
Gyorgy Orosz	4a06a2572c	Using ftfy for handling broken encoded strings.	2017-04-20 13:34:51 +02:00
Ines Montani	3800b29046	Merge pull request #1001 from recognai/master Add SPACE to es tag map	2017-04-20 12:16:34 +02:00
oeg	f0bcd0babb	fix(model): Add SPACE to es tag_map. Fixing error in morphology.pyx when SP tag is missing	2017-04-20 11:36:24 +02:00
Ben Eyal	e90e8a3f10	Enable test	2017-04-20 02:25:24 +03:00
Ben Eyal	33af52599e	Redefine alphabetic characters For caseless languages (Hebrew, Bengali) all characters are both lowercase and uppercase.	2017-04-20 02:25:02 +03:00
Ben Eyal	d8098a8be2	Use `regex` instead of `re`	2017-04-20 02:22:52 +03:00
oeg	daaa42dd25	Merge remote-tracking branch 'upstream/master'	2017-04-19 23:30:36 +02:00
oeg	936a297241	fix(model): Fix tag map for fixing issues with tag SPACE	2017-04-19 23:30:21 +02:00
luvogels	c7cec7e5e2	Update __init__.py	2017-04-19 21:06:30 +02:00
luvogels	55e8cade36	Update __init__.py	2017-04-19 21:06:30 +02:00
luvogels	03abd0c8e6	Update __init__.py	2017-04-19 21:06:30 +02:00
Leif Uwe Vogelsang	538a8d6b12	Resolved merge conflict by incorporating both suggestions.	2017-04-19 21:06:07 +02:00
Leif Uwe Vogelsang	e821c48489	Norwegian language basics	2017-04-19 21:04:01 +02:00
Leif Uwe Vogelsang	3796c668d9	more norwegian	2017-04-19 21:01:32 +02:00
Leif Uwe Vogelsang	bc9557b21f	Norwegian language basics	2017-04-19 21:00:01 +02:00
ines	2bd89e7ade	Tidy up Hebrew tests and test for punctuation (see #995 )	2017-04-19 19:28:03 +02:00
ines	48da244058	Use spacy.compat.json_dumps for Python 2/3 compatibility (resolves #991 )	2017-04-19 11:50:36 +02:00
ines	ddd5194088	Update Language docs and docstrings	2017-04-17 01:52:13 +02:00
ines	f62b740961	Use compat.json_dumps	2017-04-17 01:46:14 +02:00
ines	8e83f8e2fa	Update docstrings	2017-04-17 01:40:26 +02:00
ines	e2299dc389	Ensure path in save_to_directory	2017-04-17 01:40:14 +02:00
ines	82f5f1f98f	Replace str with compat.unicode_	2017-04-17 01:29:54 +02:00
ines	16a8521efa	Increment version	2017-04-16 22:38:38 +02:00
Matthew Honnibal	4efd6fb9d6	Fix training	2017-04-16 15:28:27 -05:00
Matthew Honnibal	17c9fffb9e	Fix naked except	2017-04-16 15:28:16 -05:00
ines	5610fdcc06	Get language name first if no model path exists Makes sure spaCy fails early if no tokenizer exists, and allows printing better error message.	2017-04-16 22:16:47 +02:00
ines	ad168ba88c	Set model name to empty string if path override exists Required for parse_package_meta, which composes path of data_path and model_name (needs to be fixed in the future)	2017-04-16 22:15:51 +02:00
ines	97647c46cd	Add docstring and todo note	2017-04-16 22:14:45 +02:00
ines	5c5f8c0a72	Check if full string is found in lang classes first This allows users to set arbitrary strings. (Otherwise, custom lang class "my_custom_class" would always load Burmese "my" tokenizer if one was available.)	2017-04-16 22:14:38 +02:00
ines	13d30b6c01	xfail lemmatizer test that's causing problems (see #546 )	2017-04-16 21:18:39 +02:00
Matthew Honnibal	4931c56afc	Increment version	2017-04-16 13:59:38 -05:00
ines	6145b7c153	Remove redundant Path	2017-04-16 20:53:25 +02:00
Matthew Honnibal	fa89613444	Merge branch 'master' of https://github.com/explosion/spaCy	2017-04-16 13:42:56 -05:00
ines	1f9f867c70	Remove unused util function	2017-04-16 20:37:45 +02:00
ines	7670c745b6	Update spacy.load() and fix path checks	2017-04-16 20:37:45 +02:00
ines	d3759dfb32	Fix docstring	2017-04-16 20:37:45 +02:00
ines	ed7e19ad68	Remove unused import	2017-04-16 20:37:45 +02:00
ines	0084466a66	Remove unused utf8open util and replace os.path with ensure_path	2017-04-16 20:37:45 +02:00
Matthew Honnibal	89a4f262fc	Fix training methods	2017-04-16 13:00:37 -05:00
Matthew Honnibal	6a4221a6de	Allow lemma to be set from Python. Re #973	2017-04-16 18:07:53 +02:00
Matthew Honnibal	137b210bcf	Restore use of FTRL training	2017-04-16 18:02:42 +02:00
ines	d10bd0eaf9	Fix formatting	2017-04-16 13:42:34 +02:00
ines	8191e33cf1	Update link error message with info on permissions	2017-04-16 13:32:31 +02:00
ines	a3ddbc0444	Add note about --force flag to error message	2017-04-16 13:14:36 +02:00
ines	e3de035814	Add meta validation to check for required settings Complain if no "lang", "name" or "version" is found (those settings are used in directory / package names). Package will still build without, but it'll inevitably fail somewhere down the line.	2017-04-16 13:13:17 +02:00
ines	a7574b7572	Add more options to read in meta data in package command Add meta option to supply path to meta.json. If no meta path is set, check if meta.json exists in input directory and use it. Otherwise, prompt for details on the command line.	2017-04-16 13:06:02 +02:00
ines	13c8a42d2b	Fix typos	2017-04-16 13:03:58 +02:00
ines	31fa73293a	Move read_json out to own util function	2017-04-16 13:03:28 +02:00
Matthew Honnibal	45464d065e	Remove print statement	2017-04-15 16:11:43 +02:00
Matthew Honnibal	c76cb8af35	Fix training for new labels	2017-04-15 16:11:26 +02:00
Matthew Honnibal	4884b2c113	Refix StepwiseState	2017-04-15 16:00:28 +02:00
Matthew Honnibal	e6ee7e130f	Fix parse package meta	2017-04-15 13:38:53 +02:00
Matthew Honnibal	1a98e48b8e	Fix Stepwisestate'	2017-04-15 13:35:01 +02:00
ines	0739ae7b76	Tidy up and fix formatting and imports	2017-04-15 13:05:15 +02:00
ines	fefe6684cd	Fix symlink function to check for Windows	2017-04-15 12:17:27 +02:00
ines	35fb4febe2	Fix whitespace	2017-04-15 12:13:45 +02:00
ines	e1efd589c3	Fix json imports and use ujson	2017-04-15 12:13:34 +02:00
ines	958b12dec8	Use pathlib instead of os.path	2017-04-15 12:13:00 +02:00
ines	956dc36785	Move functions to deprecated	2017-04-15 12:12:31 +02:00
ines	c05ec4b89a	Add compat functions and remove old workarounds Add ensure_path util function to handle checking instance of path	2017-04-15 12:11:16 +02:00
ines	26445ee304	Add compat module for Python2/3 and platform compatibility	2017-04-15 12:07:02 +02:00
ines	d24589aa72	Clean up imports, unused code, whitespace, docstrings	2017-04-15 12:05:47 +02:00
ines	561f2a3eb4	Use consistent formatting for docstrings	2017-04-15 11:59:21 +02:00
Matthew Honnibal	d13f0a7017	Merge branch 'develop' of https://github.com/explosion/spaCy into develop	2017-04-14 23:54:57 +02:00
Matthew Honnibal	354458484c	WIP on add_label bug during NER training Currently when a new label is introduced to NER during training, it causes the labels to be read in in an unexpected order. This invalidates the model.	2017-04-14 23:52:17 +02:00
Matthew Honnibal	33ba5066eb	Refactor Language.end_training, making new save_to_directory method	2017-04-14 23:51:24 +02:00
ines	84341c2975	Only compile list of models if data_path exists	2017-04-14 16:48:02 +02:00
Gyorgy Orosz	dd3244c08a	Made json dump to produce unicode strings in py2	2017-04-13 23:30:47 +02:00
Gyorgy Orosz	a9469c8173	Fixed typo	2017-04-13 15:24:14 +02:00
ines	41037f0f07	Remove unused imports	2017-04-13 13:52:11 +02:00
ines	1b92c8d5d5	Use unicode paths on Windows/Python 2 and catch other errors (resolves #970 ) try/except here is quite dirty, but it'll at least make sure users see an error message that explains what's going on	2017-04-10 17:49:51 +02:00
Matthew Honnibal	49e2de900e	Add costs property to StepwiseState, to show which moves are gold.	2017-04-10 11:37:04 +02:00
Matthew Honnibal	e26577b202	Increment version	2017-04-07 18:45:06 +02:00
Matthew Honnibal	40bf7ecf27	Increment version	2017-04-07 18:44:20 +02:00
Matthew Honnibal	1dca7eeb03	Add unicode declaration on new regression test	2017-04-07 18:09:23 +02:00
ines	887827fc6a	Merge branch 'develop'	2017-04-07 17:36:23 +02:00
ines	444dd511c5	Fix xpassing URL test case	2017-04-07 17:36:05 +02:00
ines	bf0f15e762	Add / to tokenizer infixes (resolves #891 )	2017-04-07 17:30:44 +02:00
ines	00b9011a49	Fix whitespace	2017-04-07 17:29:59 +02:00
ines	f9869e4dc5	Merge branch 'master' into develop	2017-04-07 17:23:40 +02:00
Matthew Honnibal	4a6204dbad	Merge remote-tracking branch 'origin/develop'	2017-04-07 17:20:09 +02:00
Matthew Honnibal	0513c43bf0	Merge branch 'master' of https://github.com/explosion/spaCy	2017-04-07 17:07:10 +02:00
Matthew Honnibal	cc36c308f4	Fix noun_chunk rules around coordination Closes #693.	2017-04-07 17:06:40 +02:00
Matthew Honnibal	ab846256cf	Merge pull request #966 from recognai/master Prepare Spanish language for training models, including configuration, rich-UD tag map and tests	2017-04-07 16:12:29 +02:00
Matthew Honnibal	83dca920d4	Rename test #913 -> #957 , comment Make test for #957 reference correct bug. Add comment. Previous commit closes #957.	2017-04-07 15:54:25 +02:00
Matthew Honnibal	be204ed714	Merge branch 'master' of https://github.com/explosion/spaCy	2017-04-07 15:50:14 +02:00
Matthew Honnibal	e7b1ee9efd	Switch to regex module for URL identification The URL detection regex was failing on input such as 0.1.2.3, as this input triggered excessive back-tracking in the builtin re module. The solution was to switch to the regex module, which behaves better. Closes #913.	2017-04-07 15:47:36 +02:00
Matthew Honnibal	5887383fc0	Add test for Issue #913 : Hang from bad regex	2017-04-07 15:47:27 +02:00
ines	7ea1673072	Fix whitespace	2017-04-07 13:28:48 +02:00
ines	255650dbc2	Add connlu2json converter from explosion/spacy-dev-resources/#11	2017-04-07 13:05:12 +02:00
ines	789ce8a45e	Add convert command	2017-04-07 13:04:17 +02:00
ines	9952d3b08a	Fix whitespace	2017-04-07 13:02:05 +02:00
ines	47ddce6eb7	Remove unused variable	2017-04-07 13:01:48 +02:00
ines	dcf8ab0c47	Merge branch 'develop'	2017-04-07 12:00:09 +02:00
ines	75f9b4c6e2	Fix whitespace	2017-04-07 10:22:18 +02:00
oeg	c693d40791	feature(model): Add support for creating the Spanish model, including rich tagset, configuration, and basich tests	2017-04-06 18:48:45 +02:00
oeg	010293fb2f	fix(typo): Fixes typo in method calling PseudoProjectivity.deprojectivize, failing with new train cli	2017-04-06 17:33:15 +02:00
ines	808cd6cf7f	Add missing tags to verbs (resolves #948 )	2017-04-03 18:12:52 +02:00
ines	ad8bf1829f	Import and combine Portuguese tokenizer exceptions (see #943 )	2017-04-01 10:37:42 +02:00
Ines Montani	f8b2d9c3b7	Merge pull request #943 from mamoit/master Portuguese improvements	2017-04-01 10:32:00 +02:00
ines	3b667a24d4	Remove whitespace	2017-04-01 10:21:08 +02:00
ines	e71a1f4bd0	Fix download commands in error messages (see #946 )	2017-04-01 10:20:57 +02:00
ines	42382d5692	Fix download commands in error messages (see #946 )	2017-04-01 10:19:32 +02:00
ines	d4a59c254b	Remove whitespace	2017-04-01 10:19:01 +02:00
Matthew Honnibal	51882ee2b8	Fix check for setting ent_id in merge	2017-03-31 19:32:01 +02:00
Miguel Almeida	4fde64c4ea	Portuguese contractions and some abreviations	2017-03-31 15:52:55 +01:00
Miguel Almeida	465b240bcb	Review Portuguese stop words Mainly to review typos and add missing masculines/feminines	2017-03-31 13:00:47 +01:00
Matthew Honnibal	fc3900e5b2	Allow ent_id to be set in Token	2017-03-31 14:00:14 +02:00
Matthew Honnibal	9720103428	Improve attribute handlign in doc.merge(). Still unsatisfying	2017-03-31 13:59:58 +02:00
Matthew Honnibal	cfff4e0f61	Improve test	2017-03-31 13:59:32 +02:00
Matthew Honnibal	1bb7b4ca71	Add comment	2017-03-31 13:59:19 +02:00
Matthew Honnibal	725249c59a	Add merge_phrase callback in matcher.pyx	2017-03-31 13:58:59 +02:00
Matthew Honnibal	e854f28304	Add test for Issue #758 Issue #758 occurs when no actions are available for a single token doc after merging.	2017-03-31 13:26:25 +02:00
Miguel Almeida	c1d020b0a6	Remove "ista" from portuguese stop words	2017-03-31 12:26:13 +01:00
Miguel Almeida	17a1e7a119	Add Portuguese numbers and ordinals	2017-03-31 12:21:01 +01:00
Matthew Honnibal	47a3ef06a6	Unhack deprojetivization, moving it into pipeline Previously the deprojectivize() call was attached to the transition system, and only called for German. Instead it should be a separate process, called after the parser. This makes it available for any language. Closes #898.	2017-03-31 12:31:50 +02:00
Joshua Reeter	564daf6dec	Issue #934 symlink should not convert paths as_posix under windows.	2017-03-30 23:47:45 -05:00
Bruno P. Kinoshita	c2d48974bc	Fix typos in Portuguese stop words	2017-03-30 21:59:18 +13:00
Matthew Honnibal	0fefdfcbda	Merge pull request #935 from ericzhao28/master Add option to use label=ent_type in doc.merge arguments (Bug fix for issue #862)	2017-03-30 02:51:24 +02:00
ines	4759fd437d	Merge branch 'master' into develop	2017-03-29 10:37:13 +02:00
ines	7e4befec88	Add Hebrew to init and setup.py	2017-03-29 10:34:57 +02:00
Grégory Howard	9c2996b27f	correction of package.py (encoding on open instead of write)	2017-03-29 09:11:02 +02:00
Eric Zhao	aafdf6ffb8	Add option to use label karg to determine ent_type in doc.merge	2017-03-28 23:35:03 -07:00
ines	7198cf1c8a	Remove unused import	2017-03-26 20:56:05 +02:00
ines	7ceaa1614b	Add experimental model init command	2017-03-26 20:51:40 +02:00
Matthew Honnibal	83ba6c247c	Fix init of Language without model	2017-03-26 16:46:00 +02:00
Matthew Honnibal	fa107f95f6	Remove unused train_config command	2017-03-26 09:28:59 -05:00
Matthew Honnibal	df83921f0a	Increment version	2017-03-26 09:27:32 -05:00
Matthew Honnibal	92ac3af21d	Merge branch 'master' of https://github.com/explosion/spaCy	2017-03-26 09:26:59 -05:00
Matthew Honnibal	a9b1f23c7d	Enable regression loss for parser	2017-03-26 09:26:30 -05:00
ines	c00d997924	Merge branch 'develop'	2017-03-26 15:57:00 +02:00
Matthew Honnibal	2efdbc08ff	Make training work with directories	2017-03-26 08:46:44 -05:00
ines	007a2492bd	Remove train_config command for now	2017-03-26 15:40:50 +02:00
ines	b297fab062	Update error message for missing commands	2017-03-26 15:40:02 +02:00
ines	7f95023fc0	Fix formatting	2017-03-26 15:37:37 +02:00
ines	5901c8f7f0	Update spacy train CLI documentation	2017-03-26 15:33:48 +02:00
Matthew Honnibal	9dcb58aaaf	Merge CLI changes	2017-03-26 07:30:45 -05:00
Matthew Honnibal	6b7f7a2060	Connect parser L1 option to train CLI	2017-03-26 07:24:07 -05:00
Matthew Honnibal	ed2b106f4d	Fix circular import in lemmatizer	2017-03-26 07:17:07 -05:00
Matthew Honnibal	dec5571bf3	Update train CLI	2017-03-26 07:16:52 -05:00
ines	53cf2f1c0e	Make dev data optional	2017-03-26 11:48:17 +02:00
Matthew Honnibal	5eac089fbe	Merge branch 'master' into develop	2017-03-26 04:45:43 -05:00
ines	0fc56e2544	Update flag and defaults	2017-03-26 11:42:11 +02:00
Matthew Honnibal	2f63806ddb	Update config when adding label. Re #910	2017-03-25 22:35:44 +01:00
Matthew Honnibal	b94286de30	Fix regression test	2017-03-25 22:35:07 +01:00
Matthew Honnibal	c748907a66	Fix errors in previous commit	2017-03-25 22:25:01 +01:00
Matthew Honnibal	4f400fa486	Prevent lemmatization of base nouns Update lemmatizer's base-form check, for change in morphology class. Closes #903.	2017-03-25 21:51:12 +01:00
Matthew Honnibal	850d35dcb3	Make morphology use int attributes internally The morphology class was calling the lemmatizer inconsistently, which some string-valued attributes. This caused Issue #903.	2017-03-25 21:49:10 +01:00
Matthew Honnibal	4454c1b23f	Block lemmatization of base-form adjectives Fixes check that an adjective is a base form (as opposed to a comparative or superlative), so that it's not lemmatized. e.g. inner -!> inn. Closes #912.	2017-03-25 21:29:57 +01:00
ines	97814f8da6	Update Windows Python 2 link workaround to use helper functions	2017-03-25 14:04:27 +01:00
ines	fdec758113	Add is_windows and is_python2 utility functions	2017-03-25 14:04:02 +01:00
Ines Montani	09837158e4	Merge pull request #921 from solresol/master Possible solution to #909	2017-03-25 13:51:55 +01:00
Greg Baker	b7f714b498	Possible solution to #909	2017-03-25 21:36:38 +11:00
Ines Montani	97cb4d5e3c	Merge branch 'master' into master	2017-03-25 10:03:47 +01:00
Iddo Berger	da135bd823	add hebrew tokenizer	2017-03-24 18:27:44 +03:00
Matthew Honnibal	f40fbc3710	Add test for Issue #910 : Resuming entity training	2017-03-23 23:38:57 +01:00
Matthew Honnibal	9c9cd99144	Merge branch 'master' of https://github.com/explosion/spaCy	2017-03-23 11:11:24 +01:00
ines	0035fd9efe	Add spacy train work in progress	2017-03-23 11:08:41 +01:00
ines	d5ebf583a4	Fix formatting	2017-03-23 11:08:30 +01:00
ines	3f20efe165	Merge branch 'develop' # Conflicts: # spacy/util.py	2017-03-22 17:14:15 +01:00
Ines Montani	f86a3a92d5	Merge pull request #899 from raphael0202/duplicate_keys Remove duplicate keys in [en\|fi] language data dicts	2017-03-22 10:20:11 +01:00
Ines Montani	87a2c85e1b	Merge pull request #900 from raphael0202/unused_imports Remove unused import statements	2017-03-22 10:10:43 +01:00
ines	ce065e5d65	Fix imports	2017-03-22 10:02:14 +01:00
Andrew Poliakov	07199c3e8b	Fix infinite recursion in spacy.info	2017-03-22 11:43:22 +03:00
Raphaël Bournhonesque	f332bf05be	Remove unused import statements	2017-03-21 21:08:54 +01:00
ines	c3a9f73896	Fix writing to file	2017-03-21 12:35:22 +01:00
ines	d74aa428ad	Fix path	2017-03-21 12:26:00 +01:00
ines	83a999ea83	Change default license from MIT to CC	2017-03-21 12:24:43 +01:00
ines	ae46647560	Fix brackets	2017-03-21 12:21:42 +01:00
ines	3e134b5b2b	Make sure paths in copytree and rmtree are strings	2017-03-21 12:15:33 +01:00
ines	cf0094187e	Fetch MANIFEST.in from GitHub as well	2017-03-21 11:32:38 +01:00
ines	09b24bc5a9	Add docs for package command	2017-03-21 11:19:21 +01:00
ines	3f4e3fda1d	Update command and fetch file templates from GitHub While feature is still experimental, this allows files to be modified without having to ship a new version of spaCy.	2017-03-21 11:17:36 +01:00
ines	5230ed5b98	Move directory check and overwriting/creating dirs to own function	2017-03-21 02:06:53 +01:00
ines	46bc3c36b0	Fix typo	2017-03-21 02:06:37 +01:00
ines	64e38f304e	Only import shutil	2017-03-21 02:06:29 +01:00
ines	448a916d0d	Add --force option to override directory	2017-03-21 02:05:34 +01:00
ines	8eb9a2b355	Fix formatting	2017-03-21 02:05:14 +01:00
ines	b2bcdec0f6	Update docstring	2017-03-20 22:50:55 +01:00
ines	bf240132d7	Add cli.package command to build model packages	2017-03-20 22:50:13 +01:00
ines	a54e3c2efe	Remove empty line	2017-03-20 22:49:36 +01:00
ines	5aea327a5b	Add util function to get raw user input	2017-03-20 22:48:56 +01:00
ines	a6c0361803	Handle raw_input vs input in Python 2 and 3	2017-03-20 22:48:32 +01:00
ines	adbcac6591	Fix spacing	2017-03-20 22:48:21 +01:00
Matthew Honnibal	692eb0603d	Fix high memory usage in download command Due to PyPi issue #2984, installing large packages via pip causes a large spike in memory usage. The recommended fix is to disable caching.	2017-03-20 18:24:44 +01:00
ines	f830213c4c	Remove compatibility check test Will only cause problems when incrementing version and not updating table. Also depends on external URL, which is bad.	2017-03-20 13:20:26 +01:00
Matthew Honnibal	f314d3d044	Increment version	2017-03-20 12:58:24 +01:00
Matthew Honnibal	b487b8735a	Decrease beam density, and fix Python 3 problem in beam	2017-03-20 12:56:05 +01:00
Ines Montani	b6ee241e26	Fix print statements	2017-03-20 11:46:37 +01:00
ines	b8f8d5d8bf	Make sure model_path is a Posix path Otherwise, formatting the success message with model_path.as_posix() fails when using a local path for linking (linking still works, but the error message is confusing)	2017-03-19 11:57:13 +01:00
ines	fe0ff00fe1	Fix spacing	2017-03-19 11:55:37 +01:00
ines	5712da6095	Add regression test for #891	2017-03-19 11:48:01 +01:00
Raphaël Bournhonesque	7f579ae834	Remove duplicate keys in [en\|fi] data dicts	2017-03-19 11:40:29 +01:00
ines	8de5108af6	Exclude common cache directories from mode list in cli.info This means models called "cache" etc. won't show up in the list, but it seems worth it.	2017-03-19 01:44:43 +01:00
Matthew Honnibal	6ee2ea1128	Increment version	2017-03-19 01:40:52 +01:00
Matthew Honnibal	797f286c38	Use import to find data package	2017-03-19 01:39:36 +01:00
Matthew Honnibal	5941fb9e92	Make spacy/data a package	2017-03-18 20:04:22 +01:00
Matthew Honnibal	bc10d06bc2	Merge branch 'master' of https://github.com/explosion/spaCy	2017-03-18 19:32:54 +01:00
Matthew Honnibal	583628c350	Import metadata into __init__	2017-03-18 19:30:03 +01:00
Matthew Honnibal	1754e0db9b	Call pip via subprocess, to make it use virtualenv	2017-03-18 19:29:36 +01:00
ines	1277abcde2	Remove print statement	2017-03-18 19:14:58 +01:00
Matthew Honnibal	dcec104643	Remove unused import	2017-03-18 18:57:45 +01:00
Matthew Honnibal	703eb7bdbd	Fix link module	2017-03-18 18:57:31 +01:00
Matthew Honnibal	f6c6c89546	Add empty data directory	2017-03-18 18:32:29 +01:00
ines	7d33104180	Use distutils.sysconfig.get_python_lib site.getsitepackages seems to not work as expected in Python 2	2017-03-18 18:20:40 +01:00
Matthew Honnibal	1a53fcc685	Fix CLI for Python 2	2017-03-18 18:14:03 +01:00
ines	aefb898e37	Add title-case version of morph rules (resolves #686 )	2017-03-18 17:27:11 +01:00
ines	64ec17abc1	Pass xpassing tests and add xfails for failures	2017-03-18 17:20:46 +01:00
ines	d0b85faf69	Pass regression test for #401 (resolves #401 ) Fixed in new English models.	2017-03-18 17:06:49 +01:00
ines	be9daefbdd	Remove actual model downloading from tests	2017-03-18 17:01:10 +01:00
ines	850650221a	Use correct command in deprecated download command message	2017-03-18 17:01:01 +01:00
ines	0dd7710556	Make sure paths are paths	2017-03-18 16:48:52 +01:00
Matthew Honnibal	de0e6385b4	Merge branch 'master' of https://github.com/explosion/spaCy	2017-03-18 16:17:28 +01:00
Matthew Honnibal	fe442cac53	Fix #717 : Set correct lemma for contracted verbs	2017-03-18 16:16:10 +01:00
ines	ad934a9abd	Add regression test for #693	2017-03-18 16:12:30 +01:00
ines	f57c616830	Add regression test for #704 and test new model (resolves #704 ) (using new English model)	2017-03-18 16:04:14 +01:00
Matthew Honnibal	413138de79	Fix #719 : Lemmatizer can no longer output empty string	2017-03-18 16:02:06 +01:00
ines	ab1451f997	Don't mark compatibility test as slow	2017-03-18 15:17:39 +01:00
ines	ec3e810662	Add directory cli and set up command line interface	2017-03-18 15:14:48 +01:00
ines	cd94ea1095	Use info module for spacy.info()	2017-03-18 13:01:26 +01:00
ines	e3e25c0a33	Add spacy.info module Print info about spaCy installation, local setup and models. Allow export in Markdown format to copy-paste into GitHub issues.	2017-03-18 13:01:16 +01:00
ines	0eafc0f2c6	Add util functions to print data as table or markdown list	2017-03-18 13:00:14 +01:00
ines	6b9b444065	Fix imports	2017-03-18 12:59:41 +01:00
ines	a035ebd32a	Use pathlib.Path instead of os.path	2017-03-18 12:59:21 +01:00
ines	9605cf39cc	Handle default path in Language classes	2017-03-18 12:58:45 +01:00
Matthew Honnibal	ac4b88cce9	Fix auto-linking in download command	2017-03-17 21:36:13 +01:00
ines	8a34c3e666	Fix shortcut name	2017-03-17 20:07:34 +01:00
Matthew Honnibal	6420f86f02	Merge changes to __init__.py	2017-03-17 19:51:45 +01:00
ines	e01fbacf81	Update resolve_model_name	2017-03-17 19:26:28 +01:00
ines	aedefef49d	Add function to resolve model names and link them	2017-03-17 18:47:05 +01:00
Matthew Honnibal	d013aba7b5	Merge branch 'master' of https://github.com/explosion/spaCy	2017-03-17 18:30:53 +01:00
Matthew Honnibal	854cfce7cf	Make vocabs more compatible across versions Previously, symbols were inserted into the string-store before strings were loaded. This meant that adding a symbol would invalidate saved models. We now make sure that strings are loaded faithfully, so that compatibility is maintained.	2017-03-17 18:29:04 +01:00
Matthew Honnibal	1cc841e600	Merge branch 'master' of https://github.com/explosion/spaCy	2017-03-17 08:18:11 -05:00
Matthew Honnibal	4bfc55b532	Auto-add words to vocab when loading vectors When calling vocab.load_vectors_from_bin_loc, ensure that missing entries are added to the vocab. Otherwise, loading vectors into an empty vocab object resulted in no vectors being added.	2017-03-17 08:15:59 -05:00
ines	0e533ad0cc	Mark compatibility table test as slow (temporary) Prevent Travis from running test test until models repo is published	2017-03-17 13:11:36 +01:00
ines	279b1d1965	Update version	2017-03-17 12:43:08 +01:00
ines	8af4b9e4df	Fix compatibility.json link	2017-03-17 12:43:03 +01:00
Matthew Honnibal	a630726b13	Fix typo in tests	2017-03-16 20:50:36 -05:00
Matthew Honnibal	f98b30583f	Fix tests	2017-03-16 19:48:00 -05:00
Matthew Honnibal	db51abf685	Fix tests	2017-03-16 18:53:47 -05:00
Matthew Honnibal	adb0b7e43b	Fix loading when no package found	2017-03-16 18:30:23 -05:00
Matthew Honnibal	5c66cffafd	Add tag map for Spanish	2017-03-16 18:05:15 -05:00
Matthew Honnibal	c4351e1165	Update base-form check in lemmatizer, for UD 2.0 morphology	2017-03-16 17:59:31 -05:00
Matthew Honnibal	1e10383e1b	Merge branch 'master' of https://github.com/explosion/spaCy	2017-03-16 17:41:13 -05:00
Matthew Honnibal	859315863a	Merge branch 'master' of https://github.com/explosion/spaCy	2017-03-16 17:40:07 -05:00
Matthew Honnibal	fea9fe08af	Merge pull request #866 from juanmirocks/master Fix lemmatization of OOV words	2017-03-16 23:37:36 +01:00
Matthew Honnibal	ffd4a19383	Increment version	2017-03-16 17:35:57 -05:00
Matthew Honnibal	28bb546939	Merge pull request #883 from ericzhao28/master Add `lower_` and `upper_` properties to `Span` class	2017-03-16 23:35:47 +01:00
ines	fd60961825	Fix spacing	2017-03-16 23:23:26 +01:00
Matthew Honnibal	890747d8ff	Fix trailing whitespace on morphology features	2017-03-16 17:07:37 -05:00
Matthew Honnibal	af41a9790c	Merge remote-tracking branch 'origin/develop-downloads'	2017-03-16 20:41:37 +01:00
Matthew Honnibal	303a56f173	Get absolute path for linking	2017-03-16 20:41:23 +01:00
ines	3d484c3faf	Don't print in parse_package_meta and accept on_erro callback instead TODO: log warning for missing meta data in spacy.link, as this affects the Language class returned by spacy.load()	2017-03-16 20:34:50 +01:00
ines	d8c984b65e	Don't exit if no model meta data is present	2017-03-16 20:33:33 +01:00
Matthew Honnibal	2524efc0ac	Merge remote-tracking branch 'origin/develop-downloads'	2017-03-16 20:20:41 +01:00
ines	8253581057	Link model automatically if not direct download	2017-03-16 19:54:51 +01:00
Matthew Honnibal	8843b84bd1	Merge remote-tracking branch 'origin/develop-downloads'	2017-03-16 12:00:42 -05:00
Matthew Honnibal	55f813bfbb	Don't reapply the model during training	2017-03-16 11:59:43 -05:00
Matthew Honnibal	c90dc7ac29	Clean up state initiatisation in transition system	2017-03-16 11:59:11 -05:00
Matthew Honnibal	a46933a8fe	Clean up FTRL parsing stuff.	2017-03-16 11:58:20 -05:00
ines	618ce3b425	Add .meta to Language object Allows getting the current model's meta data, e.g.: nlp = spacy.load('my-model') print(nlp.meta)	2017-03-16 17:14:56 +01:00
ines	e348d4434c	Add spacy.info(model_name) to show model meta Allows "previewing" model before loading and making sure it's linked correctly.	2017-03-16 17:13:40 +01:00
ines	eea3b35e3f	Update model loading to support links Remove match_best_version check, fetch model language from meta instead of directory name, and don't make too many assumptions – if model is downloaded via downloader, version should match anyway. (Otherwise, users should be free to add and load whichever models they want.)	2017-03-16 17:13:08 +01:00
ines	5f3f04bd0a	Add util function to load and parse package meta.json	2017-03-16 17:10:05 +01:00
ines	7f920c2f75	Don't break text in when rendering print_msg	2017-03-16 17:09:50 +01:00
ines	16a63d9676	Add docstring	2017-03-16 17:09:11 +01:00
ines	68c04fa897	Move sys_exit() function to util	2017-03-16 17:08:58 +01:00
ines	ccd1a79988	Add spacy.link module to link model directories to shortcuts	2017-03-16 17:01:51 +01:00
Matthew Honnibal	2611ac2a89	Fix scorer bug for NER, related to ambiguity between missing annotations and misaligned tokens	2017-03-16 09:38:28 -05:00
ines	595d89698a	Add basestring	2017-03-16 10:01:14 +01:00
ines	7b2eca36e4	Revert "Fix formatting and remove unused code" This reverts commit `d7898d586f`.	2017-03-16 09:58:41 +01:00
ines	2f0db1dd36	Use small English model as default	2017-03-16 09:54:40 +01:00
Matthew Honnibal	3d0833c3df	Fix off-by-1 in parse features fill_context	2017-03-15 19:55:35 -05:00
Matthew Honnibal	4ef68c413f	Approximate cost in Break transition, to speed things up a bit.	2017-03-15 16:40:27 -05:00
Matthew Honnibal	8543db8a5b	Use ftrl optimizer in parser	2017-03-15 11:56:37 -05:00
ines	4cfc8ffbd2	Reformat pickle tests	2017-03-15 17:39:54 +01:00
ines	2a0fcf1354	Add tests for new download module	2017-03-15 17:39:43 +01:00
ines	71956c94db	Handle deprecated language-specific model downloading	2017-03-15 17:37:55 +01:00
ines	58b884b6d4	Refactor download script and about.py to use new download method	2017-03-15 17:37:18 +01:00
ines	f5d1a39a5b	Add util functions for printing and wrapping messages	2017-03-15 17:35:57 +01:00
ines	d7898d586f	Fix formatting and remove unused code	2017-03-15 17:35:41 +01:00
ines	b672e95045	Fix formatting	2017-03-15 17:35:04 +01:00
ines	0474e706a0	Remove unused deprecated functions for sputnik	2017-03-15 17:34:54 +01:00
ines	b13e7f79b4	Fix formatting and remove unused imports	2017-03-15 17:33:57 +01:00
ines	1101fd3855	Fix formatting and remove unused imports	2017-03-15 17:33:39 +01:00
ines	842782c128	Move fix_deprecated_glove_vectors_loading to deprecated.py	2017-03-15 17:33:29 +01:00
Matthew Honnibal	4cab8ac136	Update morph exceptions test	2017-03-15 09:31:34 -05:00
Matthew Honnibal	d719f8e77e	Use nogil in parser, and set L1 to 0.0 by default	2017-03-15 09:31:01 -05:00
Matthew Honnibal	c61c501406	Update beam-parser to allow parser to maintain nogil	2017-03-15 09:30:22 -05:00
Matthew Honnibal	3d4e389d23	Whitespace	2017-03-15 09:29:42 -05:00
Matthew Honnibal	7769bc31e3	Add beam-search classes	2017-03-15 09:27:41 -05:00
Matthew Honnibal	c79b3129e3	Fix setting of empty lexeme in initial parse state	2017-03-15 09:26:53 -05:00
Matthew Honnibal	d864708072	Add more morphology names in attrs.pyx	2017-03-15 09:26:16 -05:00
Matthew Honnibal	b382dc902c	Add morph rules in Language	2017-03-15 09:24:40 -05:00
Matthew Honnibal	8dbff4f5f4	Wire up English lemma and morph rules.	2017-03-15 09:23:22 -05:00
Matthew Honnibal	f70be44746	Use lemmatizer in code, not from downloaded model.	2017-03-15 04:52:50 -05:00
ines	42ba740dde	Revert "Merge branch 'debug'" This reverts commit `89b79d1178`, reversing changes made to `02bdf490a1`.	2017-03-13 20:11:52 +01:00
ines	4c5f51e49e	Update regression test	2017-03-13 15:16:11 +01:00
ines	02bdf490a1	Remove regression test to see if it caused pytest Travis error	2017-03-13 13:00:22 +01:00
ines	17018750ac	Add regression test for #717	2017-03-13 12:58:22 +01:00
ines	2883ebfca2	Remove print statement	2017-03-13 12:30:42 +01:00
ines	98c13d8aa9	Add regression test for #401	2017-03-13 12:28:41 +01:00
ines	444d665f9d	Add regression test for #686	2017-03-13 12:23:35 +01:00
ines	46b17e5b51	Add regression test for #719	2017-03-13 12:17:35 +01:00
ines	c8ae682ff9	Add regression test for #636	2017-03-13 12:08:31 +01:00
ines	337f9601f2	Add missing unicode declaration	2017-03-13 12:08:19 +01:00
ines	d70386ec6e	Update docstring in #886 regression test	2017-03-13 12:00:38 +01:00
ines	51ba3ef0a8	Add regression test for #886	2017-03-13 11:44:58 +01:00
ines	eec3f21c50	Add WordNet license	2017-03-12 13:58:24 +01:00
ines	f9e603903b	Rename stop_words.py to word_sets.py and include more sets NUM_WORDS and ORDINAL_WORDS are currently not used, but the hard-coded list should be removed from orth.pyx and replaced to use language-specific functions. This will later allow other languages to use their own functions to set those flags. (In English, this is easier because it only needs to be checked against a set – in German for example, this requires a more complex function, as most number words are one word.)	2017-03-12 13:58:22 +01:00
ines	f24f9b4b7b	Remove unused code	2017-03-12 13:58:22 +01:00
ines	1da29a7146	Use new Lemmatizer data and remove file import Since there's currently only an English lemmatizer, the global Lemmatizer imports from spacy.en. This is unideal and still needs to be fixed.	2017-03-12 13:58:22 +01:00
ines	0957737ee8	Add Python-formatted lemmatizer data and rules	2017-03-12 13:58:22 +01:00
ines	c89e30d1a3	Add test for English time exceptions ("1a.m." etc.)	2017-03-12 13:58:22 +01:00
ines	ce9568af84	Move English time exceptions ("1a.m." etc.) and refactor	2017-03-12 13:58:22 +01:00
ines	6b30541774	Fix formatting	2017-03-12 13:58:22 +01:00
Ines Montani	e97a30b99a	Merge pull request #885 from PySUST/master [Bengali] Spell checked and add new stop words	2017-03-12 13:20:59 +01:00
ines	66c1f194f9	Use consistent unicode declarations	2017-03-12 13:07:28 +01:00
shuvanon	91cb4cdb2b	Sort stop_words	2017-03-12 17:55:51 +06:00
shuvanon	784f6cfa49	Update stop_words	2017-03-12 17:41:01 +06:00
shuvanon	73cc17078e	Merge branch 'master' of https://github.com/PySUST/spaCy	2017-03-12 14:52:17 +06:00
shuvanon	35ec7135bb	Spell checked and add new stop words	2017-03-12 14:51:34 +06:00
Em	9c809efc25	Removed mapStr	2017-03-11 16:23:26 -08:00
Matthew Honnibal	fa23278ee3	Add classes for beam parser and beam NER	2017-03-11 12:45:37 -06:00
Matthew Honnibal	6c4108c073	Add header for beam parser	2017-03-11 12:45:12 -06:00
Matthew Honnibal	4382f175b3	Squelch compiler warnings	2017-03-11 12:44:43 -06:00
Matthew Honnibal	ea2592879f	Merge branch 'master' of https://github.com/explosion/spaCy	2017-03-11 11:13:37 -06:00
Matthew Honnibal	1224c4d3c6	Improve output on trainer	2017-03-11 11:12:48 -06:00
Matthew Honnibal	b438dfd3f3	Add itn argument to tagger.update	2017-03-11 11:12:21 -06:00
Matthew Honnibal	931feb3360	Allow beam parsing for NER	2017-03-11 11:12:01 -06:00
Matthew Honnibal	f77a5bb60a	Switch back to greedy parser	2017-03-11 11:11:30 -06:00
Matthew Honnibal	ca9c8c57c0	Add iteration argument to parser.update	2017-03-11 07:00:47 -06:00
Matthew Honnibal	dcce9ca3f3	Use beam parser	2017-03-11 07:00:20 -06:00
Matthew Honnibal	e30ffdd003	Use ftrl optimizer in tagger	2017-03-11 06:59:13 -06:00
Matthew Honnibal	d59c6926c1	I think this fixes the segfault	2017-03-11 06:58:34 -06:00
Matthew Honnibal	318b9e32ff	WIP on beam parser. Currently segfaults.	2017-03-11 06:19:52 -06:00
Em	426d17167f	Added string manipulation for spans	2017-03-10 16:50:02 -08:00
Matthew Honnibal	b0d80dc9ae	Update name of 'train' function in BeamParser	2017-03-10 14:35:43 -06:00
Matthew Honnibal	d11f1a4ddf	Record negative costs in non-monotonic arc eager oracle	2017-03-10 11:22:04 -06:00
Matthew Honnibal	ecf91a2dbb	Support beam parser	2017-03-10 11:21:21 -06:00
Ines Montani	a16aff17aa	Merge pull request #876 from PySUST/master [Bangla] Update "tokenizer_exceptions.py"	2017-03-10 14:46:00 +01:00
ines	10e29189ac	Adjust URL testcases and xfail problems (instead of comment)	2017-03-10 14:22:50 +01:00
ines	b04893a059	Make regex locale-independent for Python 2	2017-03-10 14:21:57 +01:00
Matthew Honnibal	ea53647362	Merge branch 'develop'	2017-03-10 02:49:39 -06:00
Ines Montani	1c40890321	Add missing comma Should fix Travis build error	2017-03-10 09:34:54 +01:00
Shuvanon Razik	c251703428	Update abbreviations	2017-03-10 10:45:01 +06:00
Matthew Honnibal	b5247c49eb	Merge branch 'develop' of https://github.com/explosion/spaCy into develop	2017-03-09 18:45:43 -06:00
Matthew Honnibal	798450136d	Set L1 penalty to 0 in tagger.	2017-03-09 18:43:47 -06:00
Matthew Honnibal	c62da02344	Use ftrl training, to learn compressed model.	2017-03-09 18:43:21 -06:00
Matthew Honnibal	f71eeef9bb	Pass path argument to end_training	2017-03-09 18:42:40 -06:00
Dan Rapp	123d3f2d38	Fix error in test case parameterization	2017-03-09 12:18:21 -07:00
Dan Rapp	b9307dfcd7	Merge branch 'master' into rappdw/tokenizer_exceptions_url_fix	2017-03-09 11:42:14 -07:00
Dan Rapp	3b1df3808d	Issue #840 - URL pattenr too broad	2017-03-09 11:39:39 -07:00
Matthew Honnibal	5b0b968d13	Merge branch 'develop' of https://github.com/explosion/spaCy into develop	2017-03-08 15:03:10 +01:00
Matthew Honnibal	0ac3d27689	Fix handling of trailing whitespace Fix off-by-one error that meant trailing spaces were being dropped. Closes #792	2017-03-08 15:01:40 +01:00
ines	c2e3e651b8	Re-add regression test for #859	2017-03-08 14:36:09 +01:00
Matthew Honnibal	0a6d7ca200	Fix spacing after token_match The boolean flag indicating a space after the token was being set incorrectly after the token_match regex was applied. Fixes #859.	2017-03-08 14:33:32 +01:00
shuvanon	85438aee1b	update tokenizertokenizer	2017-03-08 17:29:39 +06:00
shuvanon	45bc78461c	update tokenizertokenizer	2017-03-08 17:27:12 +06:00
Matthew Honnibal	cd33b39a04	Fix 2/3 problem for json save/load	2017-03-08 01:39:13 +01:00
Matthew Honnibal	40703988bc	Use FTRL training in parser	2017-03-08 01:38:51 +01:00
Matthew Honnibal	d108534dc2	Fix 2/3 problems for training	2017-03-08 01:37:52 +01:00
Matthew Honnibal	d03d6a13f1	Merge branch 'rominf-ud20' into develop	2017-03-07 21:48:56 +01:00
Matthew Honnibal	f7374d0b86	Merge branch 'ud20' of https://github.com/rominf/spaCy into rominf-ud20	2017-03-07 21:48:37 +01:00
Matthew Honnibal	16670d3251	Xfail the vocab pickling for now	2017-03-07 21:43:28 +01:00
Matthew Honnibal	a89c3500f6	Fixes to hacky vocab pickling	2017-03-07 20:58:55 +01:00
Matthew Honnibal	d814892805	Hackish pickle support for Vocab.	2017-03-07 20:25:12 +01:00
Matthew Honnibal	26614e028f	Add hacky support for StringCFile, to make pickling easier.	2017-03-07 20:24:37 +01:00
Matthew Honnibal	3edb8ae207	Whitespace	2017-03-07 17:16:26 +01:00
Matthew Honnibal	5de7e712b7	Add support for pickling StringStore.	2017-03-07 17:15:18 +01:00
Matthew Honnibal	4e75e74247	Update regression test for variable-length pattern problem in the matcher.	2017-03-07 16:08:32 +01:00
Matthew Honnibal	6d67213b80	Add test for 850: Matcher fails on zero-or-more.	2017-03-07 15:55:28 +01:00
Aniruddha Adhikary	696215a3fb	add tests for Bengali	2017-03-05 11:25:12 +06:00
Aniruddha Adhikary	8f3bfe9bfc	[Bengali] basic tag map, morph, lemma rules and exceptions	2017-03-04 12:36:59 +06:00
Roman Inflianskas	66e1109b53	Add support for Universal Dependencies v2.0	2017-03-03 13:17:34 +01:00
ines	8dff040032	Revert "Add regression test for #859 " This reverts commit `c4f16c66d1`.	2017-03-01 21:56:20 +01:00
Juan Miguel Cejuela	25c29f072d	apply patch	2017-03-01 21:44:17 +01:00
Juan Miguel Cejuela	a8cfde46d3	#781 Fix test — colocalizes is lemmatized to colocaliz and colicalize	2017-03-01 21:43:08 +01:00
Juan Miguel Cejuela	a471114eb2	#781 add regression test, failing previous bug fix	2017-03-01 21:30:51 +01:00
ines	c4f16c66d1	Add regression test for #859	2017-03-01 16:07:27 +01:00
Aniruddha Adhikary	d91be7aed4	add punctuations for Bengali	2017-02-28 21:07:14 +06:00
Aniruddha Adhikary	5a4fc09576	add basic Bengali support	2017-02-28 07:48:37 +06:00
Matthew Honnibal	cc9b2b74e3	Merge branch 'french-tokenizer-exceptions'	2017-02-27 11:44:39 +01:00
Matthew Honnibal	bd4375a2e6	Remove comment	2017-02-27 11:44:26 +01:00
Matthew Honnibal	e7e22d8be6	Move import within get_exceptions() function, to speed import	2017-02-27 11:34:48 +01:00
Matthew Honnibal	34bcc8706d	Merge branch 'french-tokenizer-exceptions'	2017-02-27 11:21:21 +01:00
Matthew Honnibal	0aaa546435	Fix test after updating the French tokenizer stuff	2017-02-27 11:20:47 +01:00
Matthew Honnibal	26446aa728	Avoid loading all French exceptions on import Move exceptions loading behind a get_tokenizer_exceptions() function for French, instead of loading into the top-level namespace. This cuts import times from 0.6s to 0.2s, at the expense of making the French data a little different from the others (there's no top-level TOKENIZER_EXCEPTIONS variable.) The current solution feels somewhat unsatisfying.	2017-02-25 11:55:00 +01:00
ines	376c5813a7	Remove print statements from test	2017-02-24 18:26:32 +01:00
ines	7c1260e98c	Add regression test	2017-02-24 18:22:49 +01:00
ines	0e2e331b58	Convert exceptions to Python list	2017-02-24 18:22:40 +01:00
ines	51eb190ef4	Remove print statements from test	2017-02-24 17:41:12 +01:00
Matthew Honnibal	db5ada3995	Merge branch 'master' of https://github.com/explosion/spaCy	2017-02-24 14:28:12 +01:00
Matthew Honnibal	8f94897d07	Add 1 operator to matcher, and make sure open patterns are closed at end of document. Closes Issue #766	2017-02-24 14:27:02 +01:00
ines	67991b6e5f	Add more test cases to #775 regression test to cover #847	2017-02-18 14:10:44 +01:00
ines	30ce2a6793	Exclude "shed" and "Shed" from tokenizer exceptions (see #847 )	2017-02-18 14:10:44 +01:00
Ines Montani	de997c1a33	Merge pull request #842 from magnusburton/master Added regular verb rules for Swedish	2017-02-17 11:18:20 +01:00
Magnus Burton	41fcfd06b8	Added regular verb rules for Swedish	2017-02-17 10:04:04 +01:00
ines	aa92d4e9b5	Fix unicode regex for Python 2 (see #834 )	2017-02-16 23:49:54 +01:00
ines	44de3c7642	Reformat test and use text_file fixture	2017-02-16 23:49:19 +01:00
ines	3dd22e9c88	Mark vectors test as xfail (temporary)	2017-02-16 23:28:51 +01:00
ines	85d249d451	Revert "Revert "Merge pull request #836 from raphael0202/load_vectors (closes #834 )"" This reverts commit `ea05f78660`.	2017-02-16 23:26:25 +01:00
ines	ea05f78660	Revert "Merge pull request #836 from raphael0202/load_vectors (closes #834 )" This reverts commit `7d8c9eee7f`, reversing changes made to `f6b69babcc`.	2017-02-16 15:27:12 +01:00
Raphaël Bournhonesque	06a71d22df	Fix test failure by using unicode literals	2017-02-16 14:48:00 +01:00
Raphaël Bournhonesque	3ba109622c	Add regression test with non ' ' space character as token	2017-02-16 12:23:27 +01:00
Raphaël Bournhonesque	e17dc2db75	Remove useless import	2017-02-16 12:10:24 +01:00
Raphaël Bournhonesque	3fd2742649	load_vectors should accept arbitrary space characters as word tokens Fix bug #834	2017-02-16 12:08:30 +01:00
ines	f08e180a47	Make groups non-capturing Prevents hitting the 100 named groups limit in Python	2017-02-10 13:35:02 +01:00
ines	fa3b8512da	Use consistent imports and exports Bundle everything in language_data to keep it consistent with other languages and make TOKENIZER_EXCEPTIONS importable from there.	2017-02-10 13:34:09 +01:00
ines	21f09d10d7	Revert "Revert "Merge pull request #818 from raphael0202/tokenizer_exceptions"" This reverts commit `f02a2f9322`.	2017-02-10 13:17:05 +01:00
ines	f02a2f9322	Revert "Merge pull request #818 from raphael0202/tokenizer_exceptions" This reverts commit `b95afdf39c`, reversing changes made to `b0ccf32378`.	2017-02-09 17:07:21 +01:00
Raphaël Bournhonesque	309da78bf0	Merge branch 'master' into tokenizer_exceptions	2017-02-09 16:32:12 +01:00
Raphaël Bournhonesque	4ce0bbc6b6	Update unit tests	2017-02-09 16:30:43 +01:00
Raphaël Bournhonesque	5d706ab95d	Merge tokenizer exceptions from PR #802	2017-02-09 16:30:28 +01:00
ines	654fe447b1	Add Swedish tokenizer tests (see #807 )	2017-02-05 11:47:07 +01:00
ines	6715615d55	Add missing EXC variable and combine tokenizer exceptions	2017-02-05 11:42:52 +01:00
Ines Montani	30a52d576b	Merge pull request #807 from magnusburton/master Added swedish lemma rules and more verb contractions	2017-02-05 11:34:19 +01:00
Magnus Burton	19c0ce745a	Added swedish lemma rules	2017-02-04 17:53:32 +01:00
Michael Wallin	d25556bf80	[issue 805] Fix issue	2017-02-04 16:22:21 +02:00
Michael Wallin	35100c8bdd	[issue 805] Add regression test and the required fixture	2017-02-04 16:21:34 +02:00
ines	0ab353b0ca	Add line breaks to Finnish stop words for better readability	2017-02-04 13:40:25 +01:00
Michael Wallin	1a1952afa5	[finnish] Add initial tests for tokenizer	2017-02-04 13:54:10 +02:00
Michael Wallin	f9bb25d1cf	[finnish] Reformat and correct stop words	2017-02-04 13:54:10 +02:00
Michael Wallin	73f66ec570	Add preliminary support for Finnish	2017-02-04 13:54:10 +02:00
Ines Montani	65d6202107	Merge pull request #802 from Tpt/fr-tokenizer Adds more French tokenizer exceptions	2017-02-03 10:52:20 +01:00
Tpt	75a74857bb	Adds more French tokenizer exceptions	2017-02-03 13:45:18 +04:00
Ines Montani	afc6365388	Update regression test for #801 to match current expected behaviour	2017-02-02 16:23:05 +01:00
Ines Montani	012f4820cb	Keep infixes of punctuation + hyphens as one token (see #801 )	2017-02-02 16:22:40 +01:00
Ines Montani	1219a5f513	Add = to tokenizer prefixes	2017-02-02 16:21:11 +01:00
Ines Montani	ff04748eb6	Add missing emoticon	2017-02-02 16:21:00 +01:00
Ines Montani	13a4ab37e0	Add regression test for #801	2017-02-02 15:33:52 +01:00
Raphaël Bournhonesque	85f951ca99	Add tokenizer exceptions for French	2017-02-02 08:36:16 +01:00
Matvey Ezhov	32a22291bc	Small `Doc.count_by` documentation update Current example doesn't work	2017-01-31 19:18:45 +03:00
Ines Montani	e4875834fe	Fix formatting	2017-01-31 15:19:33 +01:00
Ines Montani	c304834e45	Add missing import	2017-01-31 15:18:30 +01:00
Ines Montani	e6465b9ca3	Parametrize test cases and mark as xfail	2017-01-31 15:14:42 +01:00
latkins	e4c84321a5	Added regression test for Issue #792 .	2017-01-31 13:47:42 +00:00
Matthew Honnibal	6c665b81df	Fix redundant == TAG in from_array conditional	2017-01-31 00:46:21 +11:00
Ines Montani	19501f3340	Add regression test for #775	2017-01-25 13:16:52 +01:00
Ines Montani	209c37bbcf	Exclude "shell" and "Shell" from English tokenizer exceptions (resolves #775 )	2017-01-25 13:15:02 +01:00
Raphaël Bournhonesque	1be9c0e724	Add fr tokenization unit tests	2017-01-24 10:57:37 +01:00
Raphaël Bournhonesque	1faaf698ca	Add infixes and abbreviation exceptions (fr)	2017-01-24 10:57:37 +01:00
Raphaël Bournhonesque	cf8474401b	Remove unused import statement	2017-01-24 10:57:37 +01:00
Raphaël Bournhonesque	902f136f18	Add support for elision in French	2017-01-24 10:57:37 +01:00
Ines Montani	55c9c62abc	Use relative import	2017-01-23 21:27:49 +01:00
Ines Montani	0967eb07be	Add regression test for #768	2017-01-23 21:25:46 +01:00
Ines Montani	6baa98f774	Merge pull request #769 from raphael0202/spacy-768 Allow zero-width 'infix' token	2017-01-23 21:24:33 +01:00
Raphaël Bournhonesque	dce8f5515e	Allow zero-width 'infix' token	2017-01-23 18:28:01 +01:00
Ines Montani	5f6f48e734	Add regression test for #759	2017-01-20 15:11:48 +01:00
Ines Montani	09ecc39b4e	Fix multi-line string of NUM_WORDS (resolves #759 )	2017-01-20 15:11:48 +01:00
Magnus Burton	69eab727d7	Added loops to handle contractions with verbs	2017-01-19 14:08:52 +01:00
Matthew Honnibal	be26085277	Fix missing import Closes #755	2017-01-19 22:03:52 +11:00
Ines Montani	7e36568d5b	Fix title to accommodate sputnik	2017-01-17 00:51:09 +01:00
Ines Montani	d704cfa60d	Fix typo	2017-01-16 21:30:33 +01:00
Ines Montani	64e142f460	Update about.py	2017-01-16 14:23:08 +01:00
Matthew Honnibal	e889cd698e	Increment version	2017-01-16 14:01:35 +01:00
Matthew Honnibal	e7f8e13cf3	Make Token hashable. Fixes #743	2017-01-16 13:27:57 +01:00
Matthew Honnibal	2c60d0cb1e	Test #743 : Tokens unhashable.	2017-01-16 13:27:26 +01:00
Matthew Honnibal	48c712f1c1	Merge branch 'master' of ssh://github.com/explosion/spaCy	2017-01-16 13:18:06 +01:00
Matthew Honnibal	7ccf490c73	Increment version	2017-01-16 13:17:58 +01:00
Ines Montani	50878ef598	Exclude "were" and "Were" from tokenizer exceptions and add regression test (resolves #744 )	2017-01-16 13:10:38 +01:00
Ines Montani	e053c7693b	Fix formatting	2017-01-16 13:09:52 +01:00
Ines Montani	116c675c3c	Merge pull request #742 from oroszgy/hu_tokenizer_fix Improved Hungarian tokenizer	2017-01-14 23:52:44 +01:00
Gyorgy Orosz	92345b6a41	Further numeric test.	2017-01-14 22:44:19 +01:00
Gyorgy Orosz	b4df202bfa	Better error handling	2017-01-14 22:24:58 +01:00
Gyorgy Orosz	b03a46792c	Better error handling	2017-01-14 22:09:29 +01:00
Gyorgy Orosz	a45f22913f	Added further abbreviations present in the Szeged corpus	2017-01-14 22:08:55 +01:00
Ines Montani	332ce2d758	Update README.md	2017-01-14 21:12:11 +01:00
Gyorgy Orosz	9505c6a72b	Passing all old tests.	2017-01-14 20:39:21 +01:00
Gyorgy Orosz	63037e79af	Fixed hyphen handling in the Hungarian tokenizer.	2017-01-14 16:30:11 +01:00
Gyorgy Orosz	f77c0284d6	Maintaining compatibility with other spacy tokenizers.	2017-01-14 16:19:15 +01:00
Gyorgy Orosz	be7a7aeb1a	Reversed accidental changes.	2017-01-14 15:59:36 +01:00
Gyorgy Orosz	1be5da1ac6	Fixed Hungarian tokenizer for numbers	2017-01-14 15:51:59 +01:00
Ines Montani	a89e269a5a	Fix test formatting and consistency	2017-01-14 13:41:19 +01:00
Ines Montani	3424e3a7e5	Update README.md	2017-01-13 15:54:54 +01:00
Ines Montani	49186b34a1	Mark lemmatizer tests as models since they use installed data	2017-01-13 15:12:07 +01:00
Ines Montani	138deb80a1	Modernise vector tests, use add_vecs_to_vocab and don't depend on models	2017-01-13 15:12:07 +01:00
Ines Montani	96f0caa28a	Fix test name for consistency	2017-01-13 15:12:07 +01:00
Ines Montani	dc2bb1259f	Add util function to add vectors to vocab	2017-01-13 15:12:07 +01:00
Ines Montani	db9b25663d	Reformat add_docs_equal and add docstring	2017-01-13 15:12:07 +01:00
Ines Montani	62ce0a0073	Add README.md to tests to explain organisation and conventions	2017-01-13 15:11:18 +01:00
Ines Montani	38d60f6b90	Modernise serializer I/O tests and don't depend on models where possible	2017-01-13 02:24:56 +01:00
Ines Montani	4bb5b89ee4	Add text_file_b fixture using BytesIO	2017-01-13 02:23:50 +01:00
Ines Montani	49febd8c62	Modernise noun chunks tests and don't depend on models	2017-01-13 02:01:00 +01:00
Ines Montani	3ee97b5686	Rename test_parser to test_noun_chunks	2017-01-13 01:36:33 +01:00
Ines Montani	a308703f47	Remove old tests	2017-01-13 01:34:48 +01:00
Ines Montani	12eb8edf26	Move parser tests from unit to parser	2017-01-13 01:34:38 +01:00
Ines Montani	138c53ff2e	Merge tokenizer tests	2017-01-13 01:34:14 +01:00
Ines Montani	01f36ca3ff	Move attrs tests from unit to root and modernise	2017-01-13 01:33:50 +01:00
Ines Montani	3610d27967	Move alignment tests from munge to gold and modernise	2017-01-13 01:33:31 +01:00
Ines Montani	094ff7396a	Reformat and rename Pragmatic Segmenter tests and mark xfails	2017-01-13 01:30:20 +01:00
Ines Montani	affcf1b19d	Modernise lemmatizer tests	2017-01-12 23:41:17 +01:00
Ines Montani	33d9cf87f9	Modernise tagger tests and fix xpassing test	2017-01-12 23:40:52 +01:00
Ines Montani	33e5f8dc2e	Create basic and extended test set for URLs	2017-01-12 23:40:02 +01:00
Ines Montani	5e4f5ebfc8	Modernise BILUO tests	2017-01-12 23:39:18 +01:00
Ines Montani	09acfbca01	Add Lemmatizer fixture	2017-01-12 23:38:55 +01:00
Ines Montani	514bfa2597	Add path fixture for spaCy data path	2017-01-12 23:38:47 +01:00
Ines Montani	0894b8c0ef	Don't split tokens with digits and "/" infixes (resolves #740 )	2017-01-12 22:58:26 +01:00
Ines Montani	e9e99a5670	Add regression test for #740	2017-01-12 22:57:38 +01:00
Ines Montani	6935d55409	Fix formatting	2017-01-12 22:56:20 +01:00
Ines Montani	5f0d196a31	Modernise and merge matcher tests	2017-01-12 22:23:11 +01:00
Ines Montani	d5d774413a	Update comments on EN and DE fixtures	2017-01-12 22:03:07 +01:00
Ines Montani	9b4bea1df9	Tidy up and rename regression tests and remove unnecessary imports	2017-01-12 22:00:37 +01:00
Ines Montani	5e1b6178e3	Fix formatting and consistency	2017-01-12 22:00:06 +01:00
Ines Montani	a3fd32455e	Remove redundant language loading integration tests	2017-01-12 21:59:48 +01:00
Ines Montani	61f1ca09c2	Modernise serializer codecs tests	2017-01-12 21:58:55 +01:00
Ines Montani	5dbc6e59f6	Modernise Huffman tests	2017-01-12 21:58:40 +01:00
Ines Montani	edeeeccea5	Modernise packer tests and don't depend on models where possible	2017-01-12 21:58:07 +01:00
Ines Montani	d084676cd0	Modernise and merge serialization tests	2017-01-12 21:57:19 +01:00
Ines Montani	442237787c	Add assert_docs_equal util to compare two docs	2017-01-12 21:56:52 +01:00
Ines Montani	eac3f700fb	Add fixture for entity recognizer	2017-01-12 21:56:32 +01:00
Ines Montani	b438cfddbc	Modernise matcher tests and split into two files	2017-01-12 17:51:46 +01:00
Ines Montani	27482ebed8	Move matcher tests for #188 and #242 to regression tests Modernise tests and remove unnecessary imports	2017-01-12 17:33:57 +01:00
Ines Montani	0a4dc632bd	Update test to not create redundant Doc object	2017-01-12 17:33:18 +01:00
Ines Montani	a2526e66d8	Fix formatting, naming and unicode declaration	2017-01-12 16:51:13 +01:00
Ines Montani	052cdff07d	Modernise vector similarity tests	2017-01-12 16:51:13 +01:00
Ines Montani	bd20ec0a6a	Add get_cosine util function	2017-01-12 16:51:13 +01:00
Ines Montani	51ef75f629	Fix regression test for #615 and remove unnecessary imports	2017-01-12 16:51:12 +01:00
Ines Montani	aeb747e10c	Adjust formatting	2017-01-12 16:51:12 +01:00
Ines Montani	8e3e58a7e6	Modernise and merge lexeme vocab tests	2017-01-12 16:51:12 +01:00
Ines Montani	c3d4516fc2	Move test for #361 to regression tests	2017-01-12 16:51:12 +01:00
Daniel Hershcovich	99eb494a82	Fix #737 : support loading word vectors with " " as a word	2017-01-12 17:00:14 +02:00
Ines Montani	7cb3d74426	Modernise span tests and don't depend on models	2017-01-12 15:30:49 +01:00
Ines Montani	92e3d8b3ee	Modernise vocab API tests and remove old xfailing tests	2017-01-12 15:27:46 +01:00
Ines Montani	7ea87684cd	Rename test_vocab.py to test_vocab_api.py	2017-01-12 15:12:21 +01:00
Ines Montani	0da2ee5c68	Merge flag features tests into orth tests in tests root	2017-01-12 15:12:00 +01:00
Ines Montani	03c136cfd3	Remove StringStore tests from vocab tests	2017-01-12 15:11:15 +01:00
Ines Montani	d7bd57abdf	Modernise add vectors vocab test	2017-01-12 15:09:49 +01:00
Ines Montani	89525ef345	Use consistent test names	2017-01-12 15:09:21 +01:00
Ines Montani	f8803808ce	Remove old unused tests and conftest files	2017-01-12 15:09:05 +01:00
Ines Montani	4d0bfebcd9	Move Pragmatic Segmenter test cases (currently unused) to parser tests	2017-01-12 15:08:02 +01:00
Ines Montani	26d018d874	Add tests for StringStore	2017-01-12 15:07:31 +01:00
Ines Montani	9b6784bab5	Add fixture for StringStore	2017-01-12 15:05:40 +01:00
Ines Montani	99d66d613a	Modernise tests for merging spans and don't depend on models	2017-01-12 12:26:26 +01:00
Ines Montani	fa8f67596d	Remove unused old test	2017-01-12 12:26:08 +01:00
Ines Montani	359f73a96b	Move test for #54 to regression tests	2017-01-12 12:25:51 +01:00
Ines Montani	3f3a46722c	Remove unused conftest	2017-01-12 12:25:24 +01:00
Ines Montani	c2406e92bc	Allow setting ents in get_doc	2017-01-12 12:25:10 +01:00
Ines Montani	c5914c6fe5	Fix and pass regression test for #736	2017-01-12 11:48:56 +01:00
Matthew Honnibal	4e48862fa8	Remove print statement	2017-01-12 11:25:39 +01:00
Matthew Honnibal	d1d8214767	Increment version	2017-01-12 11:21:57 +01:00
Matthew Honnibal	fba67fa342	Fix Issue #736 : Times were being tokenized with incorrect string values.	2017-01-12 11:21:01 +01:00
Ines Montani	a6790b6694	Rename tags to pos in get_doc and allow adding tags to tokens	2017-01-12 11:18:36 +01:00
Ines Montani	1add8ace67	Merge lemmatizer tests	2017-01-12 11:16:53 +01:00
Ines Montani	3bc082abdf	Modernise morph exceptions test and don't depend on models	2017-01-12 11:14:29 +01:00
Ines Montani	ec7739b76e	Add regression test for #736	2017-01-12 11:12:44 +01:00
Ines Montani	6c1c564891	Move language-specific tests out of redundant tokenizer directories	2017-01-12 02:17:18 +01:00
Ines Montani	8fecedac3a	Tidy up	2017-01-12 02:16:37 +01:00
Ines Montani	ae7edd30e7	Move text file back to tokenizer tests directory	2017-01-12 02:10:23 +01:00
Ines Montani	ffcaba9017	Remove old and/or redundant tests	2017-01-12 02:10:18 +01:00
Ines Montani	19c4132097	Modernise space attachment parser tests and don't depend on models	2017-01-12 01:54:44 +01:00
Ines Montani	69778924c8	Modernise and merge parser tests and don't depend on models	2017-01-12 01:07:29 +01:00
Ines Montani	178c147612	Modernise nonprojectivity tests and don't depend on models	2017-01-12 01:06:36 +01:00
Ines Montani	1a3984742c	Modernise sentence boundary detection tests and don't depend on models (where possible)	2017-01-11 23:53:08 +01:00
Ines Montani	0cdb6ea61d	Remove old unused pickle test	2017-01-11 23:52:28 +01:00
Ines Montani	c9671329dc	Move test for #309 to regression tests	2017-01-11 23:52:13 +01:00
Ines Montani	d0e37b5670	Modernise parser tests and don't depend on models	2017-01-11 21:30:27 +01:00
Ines Montani	342cb41782	Add apply_transition_sequence util function to utils	2017-01-11 21:30:14 +01:00
Ines Montani	09807addff	Add en_parser fixture	2017-01-11 21:29:59 +01:00
Ines Montani	55d151aa61	Modernise Doc parse tree navigation tests and don't depend on models	2017-01-11 21:14:15 +01:00
Ines Montani	7262421bb2	Use consistent test names	2017-01-11 19:00:52 +01:00
Ines Montani	33800c9367	Rename "tokens" tests to "doc"	2017-01-11 18:59:01 +01:00
Ines Montani	3a9c6a9563	Remove old unused files	2017-01-11 18:58:38 +01:00
Ines Montani	8e962de39f	Remove old word vector tests	2017-01-11 18:55:08 +01:00
Ines Montani	e027936920	Modernise Doc noun chunks tests	2017-01-11 18:54:56 +01:00
Ines Montani	439f396acd	Modernise Doc array tests and don't depend on models	2017-01-11 18:54:46 +01:00
Ines Montani	05447be884	Modernise test for adding entities	2017-01-11 18:54:24 +01:00
Ines Montani	6e883f4c00	Modernise Doc API tests and don't depend on models	2017-01-11 18:05:36 +01:00
Ines Montani	8bf3bb5c44	Make words optional for get_doc	2017-01-11 18:05:10 +01:00
Ines Montani	928db7e419	Fix StringIO import for Python 3	2017-01-11 14:07:48 +01:00
Ines Montani	69998f216b	Rename test_tokens_api.py to test_doc_api.py	2017-01-11 13:58:56 +01:00
Ines Montani	d94dea1b18	Merge token tests into token API tests	2017-01-11 13:57:02 +01:00
Ines Montani	eb23424ab0	Modernise token API tests and don't depend on loading models	2017-01-11 13:56:54 +01:00
Ines Montani	c682b8ca90	Merge conftests into one cohesive file	2017-01-11 13:56:32 +01:00
Ines Montani	909f24d7df	Add test utils and get_doc helper function Create Doc object from given vocab, words and annotations to allow tests not to depend on loading the models.	2017-01-11 13:55:33 +01:00
Matthew Honnibal	e12c90e03f	Merge branch 'master' of ssh://github.com/explosion/spaCy	2017-01-11 13:03:51 +01:00
Matthew Honnibal	12cd27b821	Amend 8ae8b443f: Handle comparison with None tokens.	2017-01-11 13:03:32 +01:00
Daniel Hershcovich	8e603cc917	Avoid "True if ... else False"	2017-01-11 11:18:22 +02:00
Matthew Honnibal	44e2b0100d	Support TAG attribute in doc.from_array	2017-01-10 22:47:07 +01:00
Ines Montani	3e6e1f0251	Tidy up regression tests	2017-01-10 19:24:10 +01:00
Magnus Burton	aad23ab0b4	Supplemented with capitalized Swedish exceptions	2017-01-10 16:07:20 +01:00
Ines Montani	869963c3c4	Mark extensive prefix/suffix tests as slow	2017-01-10 15:57:35 +01:00
Ines Montani	487e020ebe	Add simple test for surrounding brackets	2017-01-10 15:57:26 +01:00
Ines Montani	0ba5cf51d2	Assert length first	2017-01-10 15:57:00 +01:00
Ines Montani	2185d31907	Adjust names and formatting	2017-01-10 15:56:35 +01:00
Ines Montani	e10d4ca964	Remove semi-redundant URLs and punctuation for faster testing	2017-01-10 15:54:25 +01:00
Ines Montani	3a3cb2c90c	Add unicode declaration	2017-01-10 15:53:15 +01:00
Matthew Honnibal	0f9b8a00a5	Unbreak data download	2017-01-09 23:40:26 +01:00
Matthew Honnibal	8ae8b443f1	Add richcmp method to Token. Closes #631	2017-01-09 19:30:31 +01:00
Matthew Honnibal	64f747cb65	Token comparison test	2017-01-09 19:12:00 +01:00
Matthew Honnibal	18c3c2d05c	Add tests for token comparison, re Issue #631	2017-01-09 19:09:59 +01:00
Matthew Honnibal	97a1286129	Revert changes to tagger and parser for thinc 6	2017-01-09 10:08:34 -06:00
Matthew Honnibal	95a52005df	Revert "Fix Issue #683 : Add 'SP' to tag_map, if it's not there already, within the Morphology class." This reverts commit `40e71586d6`.	2017-01-09 09:55:55 -06:00
Ines Montani	363f09e68c	Merge pull request #726 from magnusburton/master Added Swedish abbreviations as token exceptions	2017-01-09 14:58:15 +01:00
Matthew Honnibal	42cd598f57	Use correct fixtures in URL tokenizer	2017-01-09 14:10:40 +01:00
Matthew Honnibal	d9a77ddf14	Return None for data path if it doesn't exist	2017-01-09 14:10:05 +01:00
Matthew Honnibal	e4862d1dab	Merge branch 'develop'	2017-01-09 13:36:01 +01:00
Ines Montani	aa876884f0	Revert "Revert "Merge remote-tracking branch 'origin/master'"" This reverts commit `fb9d3bb022`.	2017-01-09 13:28:13 +01:00
Ines Montani	d5c72c40eb	Remove old tests for old website example code	2017-01-08 22:28:53 +01:00
Ines Montani	eef94e3ee2	Split off period after two or more uppercase letters (fixes #483 )	2017-01-08 22:28:25 +01:00
Ines Montani	a89a6000e5	Remove unused import	2017-01-08 22:17:37 +01:00
Ines Montani	5d28664fc5	Don't test Hungarian for numbers and hyphens for now Reinvestigate behaviour of case affixes given reorganised tokenizer patterns.	2017-01-08 20:45:40 +01:00
Ines Montani	53362b6b93	Reorganise Hungarian prefixes/suffixes/infixes Use global prefixes and suffixes for non-language-specific rules, import list of alpha unicode characters and adjust regexes.	2017-01-08 20:40:33 +01:00
Ines Montani	347c4a2d06	Reorganise and reformat global tokenizer prefixes, suffixes and infixes	2017-01-08 20:37:39 +01:00
Ines Montani	0dec90e9f7	Use global abbreviation data languages and remove duplicates	2017-01-08 20:36:00 +01:00
Ines Montani	7c3cb2a652	Add global abbreviations data	2017-01-08 20:34:03 +01:00
Ines Montani	de5aa92bc2	Handle deprecated tokenizer prefix data	2017-01-08 20:33:28 +01:00
Ines Montani	abb09782f9	Move sun.txt to original location and fix path to not break parser tests	2017-01-08 20:32:54 +01:00
Ines Montani	cab39c59c5	Add missing contractions to English tokenizer exceptions Inspired by https://github.com/kootenpv/contractions/blob/master/contractions/__init __.py	2017-01-05 19:59:06 +01:00
Ines Montani	a23504fe07	Move abbreviations below other exceptions	2017-01-05 19:58:07 +01:00
Ines Montani	7d2cf934b9	Generate he/she/it correctly with 's instead of 've	2017-01-05 19:57:00 +01:00
Ines Montani	8328925e1f	Add newlines to long German text	2017-01-05 18:13:30 +01:00
Ines Montani	55b46d7cf6	Add tokenizer tests for German	2017-01-05 18:11:25 +01:00
Ines Montani	5bb4081f52	Remove redundant test_tokenizer.py for English	2017-01-05 18:11:11 +01:00
Ines Montani	8216ba599b	Add tests for longer and mixed English texts	2017-01-05 18:11:04 +01:00
Ines Montani	65f937d5c6	Move basic contraction tests to test_contractions.py	2017-01-05 18:09:53 +01:00
Ines Montani	bbe7cab3a1	Move non-English-specific tests back to general tokenizer tests	2017-01-05 18:09:29 +01:00
Ines Montani	038002d616	Reformat HU tokenizer tests and adapt to general style Improve readability of test cases and add conftest.py with fixture	2017-01-05 18:06:44 +01:00
Ines Montani	bc911322b3	Move ") to emoticons (see Tweebo challenge test)	2017-01-05 18:05:38 +01:00
Ines Montani	637f785036	Add general sanity tests for all tokenizers	2017-01-05 16:25:38 +01:00
Ines Montani	c5f2dc15de	Move English tokenizer tests to directory /en	2017-01-05 16:25:04 +01:00
Ines Montani	8b45363b4d	Modernize and merge general tokenizer tests	2017-01-05 13:17:05 +01:00
Ines Montani	02cfda48c9	Modernize and merge tokenizer tests for string loading	2017-01-05 13:16:55 +01:00
Ines Montani	a11f684822	Modernize and merge tokenizer tests for whitespace	2017-01-05 13:16:33 +01:00
Ines Montani	8b284fc6f1	Modernize and merge tokenizer tests for text from file	2017-01-05 13:15:52 +01:00
Ines Montani	2c2e878653	Modernize and merge tokenizer tests for punctuation	2017-01-05 13:14:16 +01:00
Ines Montani	8a74129cdf	Modernize and merge tokenizer tests for prefixes/suffixes/infixes	2017-01-05 13:13:12 +01:00
Ines Montani	0e65dca9a5	Modernize and merge tokenizer tests for exception and emoticons	2017-01-05 13:11:31 +01:00
Ines Montani	34c47bb20d	Fix formatting	2017-01-05 13:10:51 +01:00
Ines Montani	2e72683baa	Add missing docstrings	2017-01-05 13:10:21 +01:00
Ines Montani	da10a049a6	Add unicode declarations	2017-01-05 13:09:48 +01:00
Ines Montani	58adae8774	Remove unused file	2017-01-05 13:09:22 +01:00
Ines Montani	c6e5a5349d	Move regression test for #360 into own file	2017-01-04 00:49:31 +01:00
Ines Montani	8279993a6f	Modernize and merge tokenizer tests for punctuation	2017-01-04 00:49:20 +01:00
Ines Montani	550630df73	Update tokenizer tests for contractions	2017-01-04 00:48:42 +01:00
Ines Montani	109f202e8f	Update conftest fixture	2017-01-04 00:48:21 +01:00
Ines Montani	ee6b49b293	Modernize tokenizer tests for emoticons	2017-01-04 00:47:59 +01:00
Ines Montani	f09b5a5dfd	Modernize tokenizer tests for infixes	2017-01-04 00:47:42 +01:00
Ines Montani	59059fed27	Move regression test for #351 to own file	2017-01-04 00:47:11 +01:00
Ines Montani	667051375d	Modernize tokenizer tests for whitespace	2017-01-04 00:46:35 +01:00
Ines Montani	aafc894285	Modernize tokenizer tests for contractions Use @pytest.mark.parametrize.	2017-01-03 23:02:21 +01:00
Ines Montani	1d237664af	Add lowercase lemma to tokenizer exceptions	2017-01-03 23:02:21 +01:00
Ines Montani	84a87951eb	Fix typos	2017-01-03 18:27:43 +01:00
Ines Montani	35b39f53c3	Reorganise English tokenizer exceptions (as discussed in #718 ) Add logic to generate exceptions that follow a consistent pattern (like verbs and pronouns) and allow certain tokens to be excluded explicitly.	2017-01-03 18:26:09 +01:00
Ines Montani	fb9d3bb022	Revert "Merge remote-tracking branch 'origin/master'" This reverts commit `d3b181cdf1`, reversing changes made to `b19cfcc144`.	2017-01-03 18:21:36 +01:00
Ines Montani	461cbb99d8	Revert "Reorganise English tokenizer exceptions (as discussed in #718 )" This reverts commit `b19cfcc144`.	2017-01-03 18:21:29 +01:00
Ines Montani	d3b181cdf1	Merge remote-tracking branch 'origin/master' # Conflicts: # spacy/en/tokenizer_exceptions.py	2017-01-03 18:20:01 +01:00
Ines Montani	b19cfcc144	Reorganise English tokenizer exceptions (as discussed in #718 ) Add logic to generate exceptions that follow a consistent pattern (like verbs and pronouns) and allow certain tokens to be excluded explicitly.	2017-01-03 18:17:57 +01:00
Ines Montani	1bd53bbf89	Fix typos (resolves #718 )	2017-01-03 11:26:21 +01:00
Matthew Honnibal	fde53be3b4	Move whole token mach inside _split_affixes.	2016-12-30 17:11:50 -06:00
Matthew Honnibal	3ba7c167a8	Fix URL tests	2016-12-30 17:10:08 -06:00
Matthew Honnibal	9936a1b9b5	Merge branch 'tokenization_w_exception_patterns' of https://github.com/oroszgy/spaCy.hu into oroszgy-tokenization_w_exception_patterns	2016-12-30 14:53:40 -06:00
Magnus Burton	56e2219b65	Added Swedish city abbreviations	2016-12-30 21:17:34 +01:00
Magnus Burton	e935c950d8	Added months and days as abbreviations for Swedish	2016-12-30 21:08:44 +01:00
kengz	73a38bd4d1	Merge remote-tracking branch 'upstream/master'	2016-12-30 12:19:59 -05:00
kengz	da44183ae1	move parse_tree logic to a new tokens/printers.py file	2016-12-30 12:19:18 -05:00
Matthew Honnibal	3e8d9c772e	Test interaction of token_match and punctuation Check that the new token_match function applies after punctuation is split off.	2016-12-31 00:52:17 +11:00
Matthew Honnibal	74b921f394	Merge branch 'master' of ssh://github.com/explosion/spaCy into develop	2016-12-30 14:38:27 +01:00
Matthew Honnibal	623d94e14f	Whitespace	2016-12-31 00:30:28 +11:00
Matthew Honnibal	af81ac8bb0	Use thinc 6.0	2016-12-29 11:58:42 +01:00
Petter Hohle	f112e7754e	Add PART to tag map 16 of the 17 PoS tags in the UD tag set is added; PART is missing.	2016-12-28 18:39:01 +01:00
Matthew Honnibal	f62db78dc3	Increment version	2016-12-27 21:11:22 +01:00
Matthew Honnibal	cade536d1e	Merge branch 'master' of ssh://github.com/explosion/spaCy	2016-12-27 21:04:10 +01:00
Matthew Honnibal	ce4539dafd	Allow the vocabulary to grow to 10,000, to prevent cold-start problem.	2016-12-27 21:03:45 +01:00
Ines Montani	ad3669cef5	Merge pull request #703 from magnusburton/master Added Swedish abbreviations	2016-12-27 01:01:49 +01:00
Ines Montani	78f754dd9a	Merge pull request #705 from oroszgy/hu_tokenizer Initial support for Hungarian	2016-12-27 00:48:13 +01:00
Ines Montani	8785706039	Reformat stop words for better readability	2016-12-24 00:58:40 +01:00
Gyorgy Orosz	45e045a87b	Unicode/UTF8 compatibility for Python2	2016-12-24 00:21:00 +01:00
Gyorgy Orosz	72b61b6d03	Typo fix.	2016-12-24 00:10:29 +01:00
Gyorgy Orosz	3a9be4d485	Updated token exception handling mechanism to allow the usage of arbitrary functions as token exception matchers.	2016-12-23 23:49:34 +01:00
Ines Montani	1436b9f15a	Fix formatting and consistency	2016-12-23 21:36:01 +01:00
Ines Montani	1d64527727	Update Spanish tokenizer Remove reflexive pronouns as they're part of an open class, fix mistakes and add exceptions	2016-12-23 21:36:01 +01:00
Ines Montani	7f411fd01c	Remove exceptions containing whitespace / no special chars	2016-12-23 14:30:06 +01:00
Magnus Burton	fdf4776262	Added Swedish abbreviations	2016-12-22 22:45:18 +01:00
Gyorgy Orosz	d9c59c4751	Maintaining backward compatibility.	2016-12-21 23:30:49 +01:00
Gyorgy Orosz	1748549aeb	Added exception pattern mechanism to the tokenizer.	2016-12-21 23:16:19 +01:00
Gyorgy Orosz	35aa54765d	Hungarian module is exposed in spacy.	2016-12-21 20:45:36 +01:00
Gyorgy Orosz	ab2f6ea46c	Removed data files from tests..	2016-12-21 20:22:09 +01:00
Ines Montani	3c87c71d43	Add tokenizer exceptions for a.m. and p.m. in Spanish	2016-12-21 18:19:10 +01:00
Ines Montani	78e63dc7d0	Update tokenizer exceptions for English	2016-12-21 18:06:34 +01:00
Ines Montani	702d1eed93	Update tokenizer exceptions for German	2016-12-21 18:06:27 +01:00
Ines Montani	d60380418e	Update tokenizer exceptions for Spanish	2016-12-21 18:06:17 +01:00
Ines Montani	920fa0fed2	Add DET_LEMMA constant	2016-12-21 18:05:41 +01:00
Ines Montani	8978806ea6	Allow Vocab to load without serializer_freqs	2016-12-21 18:05:23 +01:00
Ines Montani	be8ed811f6	Remove trailing whitespace	2016-12-21 18:04:41 +01:00
Ines Montani	926e19184a	Merge pull request #695 from magnusburton/master Added Swedish morph rules	2016-12-21 01:06:00 +01:00
Gyorgy Orosz	3d5306acb9	Added further testcases.	2016-12-20 23:49:35 +01:00
Gyorgy Orosz	23956e72ff	Improved partial support for tokenzing Hungarian numbers	2016-12-20 23:36:59 +01:00
Gyorgy Orosz	6add156075	Refactored language data structure	2016-12-20 22:28:20 +01:00
Gyorgy Orosz	366b3f8685	Merge branch 'master' into hu_tokenizer	2016-12-20 20:53:31 +01:00
Gyorgy Orosz	c035928156	Partial Hungarian number tokenization is added.	2016-12-20 20:46:20 +01:00
JM	70ff0639b5	Fixed missing vec_path declaration that was failing if 'add_vectors' was set Added vec_path variable declaration to avoid accessing it before assignment in case 'add_vectors' is in overrides.	2016-12-20 18:21:05 +01:00
Magnus Burton	48dcc9f647	Added morph rules	2016-12-20 13:18:41 +01:00
Magnus Burton	db5a077d2b	Initial commit for Swedish	2016-12-20 11:05:06 +01:00
Matthew Honnibal	3f5747a9b2	Merge branch 'master' of ssh://github.com/explosion/spaCy	2016-12-18 23:44:22 +01:00
Matthew Honnibal	40e71586d6	Fix Issue #683 : Add 'SP' to tag_map, if it's not there already, within the Morphology class.	2016-12-18 23:44:05 +01:00
Matthew Honnibal	fa1d23e10d	Merge branch 'master' of https://github.com/explosion/spaCy	2016-12-18 23:32:03 +01:00
Matthew Honnibal	f38eb25fe1	Fix test for word vector	2016-12-18 23:31:55 +01:00
Matthew Honnibal	4e68abebc4	Merge branch 'master' of ssh://github.com/explosion/spaCy	2016-12-18 23:19:45 +01:00
Matthew Honnibal	5a6328a5a4	Increment version	2016-12-18 23:19:19 +01:00
Matthew Honnibal	13a0b31279	Another tweak to GloVe path hackery.	2016-12-18 23:12:49 +01:00
Matthew Honnibal	2c6228565e	Fix vector loading re glove hack	2016-12-18 23:06:44 +01:00
Matthew Honnibal	618b50a064	Fix issue #684 : GloVe vectors not loaded in spacy.en.English.	2016-12-18 22:46:31 +01:00

... 28 29 30 31 32 ...

5008 Commits