spaCy

mirror of https://github.com/explosion/spaCy.git synced 2024-12-29 11:26:28 +03:00

Author	SHA1	Message	Date
Matthew Honnibal	bc97bc292c	Fix __call__ method	2017-05-28 08:11:58 -05:00
Matthew Honnibal	5cf47b847b	Handle iob with no tag in converter	2017-05-28 08:11:39 -05:00
Matthew Honnibal	fe11564b8e	Finish stringstore change. Also xfail vectors tests	2017-05-28 15:10:22 +02:00
Matthew Honnibal	b007a2b0d3	Update stringstore tests	2017-05-28 14:08:09 +02:00
Matthew Honnibal	84e66ca6d4	WIP on stringstore change. 27 failures	2017-05-28 14:06:40 +02:00
Matthew Honnibal	fe4a746300	Accomodate symbols in new string scheme	2017-05-28 13:03:16 +02:00
Matthew Honnibal	f51e6a6c16	Adjust lexeme sizing for attr_t being 64 bit	2017-05-28 12:51:09 +02:00
Matthew Honnibal	a5606c3eda	Work on changing StringStore to return hashes.	2017-05-28 12:36:27 +02:00
Matthew Honnibal	39293ab2ee	Merge branch 'develop' of https://github.com/explosion/spaCy into develop	2017-05-28 11:46:57 +02:00
Matthew Honnibal	dd052572d4	Update arc eager for SBD changes	2017-05-28 11:46:51 +02:00
Matthew Honnibal	3ea98e2043	Remove vector member from lexeme	2017-05-28 11:46:24 +02:00
Matthew Honnibal	2445707f3c	Re-delegate vectors to vocab	2017-05-28 11:46:10 +02:00
Matthew Honnibal	6863d01361	Remove vectors from lexeme	2017-05-28 11:45:48 +02:00
Matthew Honnibal	15f6efc127	Remove vectors from vocab	2017-05-28 11:45:32 +02:00
Matthew Honnibal	c1263a844b	Merge branch 'develop' of https://github.com/explosion/spaCy into develop	2017-05-27 18:32:57 -05:00
Matthew Honnibal	9e711c3476	Divide d_loss by batch size	2017-05-27 18:32:46 -05:00
Matthew Honnibal	b082f76494	Randomize pipeline order during training	2017-05-27 18:32:21 -05:00
Matthew Honnibal	a1d4c97fb7	Improve correctness of minibatching	2017-05-27 17:59:00 -05:00
ines	84189c1cab	Add 'xx' language ID for multi-language support Allows models to specify their language ID as 'xx'.	2017-05-28 00:58:59 +02:00
ines	33e332e67c	Remove unused export	2017-05-28 00:57:59 +02:00
ines	c1983621fb	Update util functions for model loading	2017-05-28 00:22:40 +02:00
ines	c8543c8237	Fix formatting and docstrings and remove deprecated function	2017-05-28 00:22:40 +02:00
Matthew Honnibal	49235017bf	Merge branch 'develop' of https://github.com/explosion/spaCy into develop	2017-05-27 16:34:28 -05:00
Matthew Honnibal	7ebd26b8aa	Use ordered dict to specify transitions	2017-05-27 15:52:20 -05:00
Matthew Honnibal	3eea5383a1	Add move_names property to parser	2017-05-27 15:51:55 -05:00
Matthew Honnibal	8de9829f09	Don't overwrite model in initialization, when loading	2017-05-27 15:50:40 -05:00
Matthew Honnibal	99316fa631	Use ordered dict to specify actions	2017-05-27 15:50:21 -05:00
Matthew Honnibal	655ca58c16	Clarifying change to StateC.clone	2017-05-27 15:49:37 -05:00
Matthew Honnibal	5e4312feed	Evaluate loaded class, to ensure save/load works	2017-05-27 15:47:02 -05:00
Matthew Honnibal	34bbad8e0e	Add __reduce__ methods on parser subclasses. Fixes pickling.	2017-05-27 15:46:06 -05:00
Matthew Honnibal	7cc9c3e9a6	Fix convert CLI	2017-05-27 15:44:42 -05:00
ines	1203959625	Add pipeline setting to meta.json generator	2017-05-27 20:02:01 +02:00
ines	086a06e7d7	Fix CLI docstrings and add command as first argument Workaround for Plac	2017-05-27 20:01:46 +02:00
ines	a8e58e04ef	Add symbols class to punctuation rules to handle emoji (see #1088 ) Currently doesn't work for Hungarian, because of conflicts with the custom punctuation rules. Also doesn't take multi-character emoji like 👩🏽‍💻 into account.	2017-05-27 17:57:10 +02:00
Matthew Honnibal	dc07d72d80	Merge branch 'develop' of https://github.com/explosion/spaCy into develop	2017-05-27 08:20:40 -05:00
Matthew Honnibal	de13fe0305	Remove length cap on sentences	2017-05-27 08:20:32 -05:00
Matthew Honnibal	73a643d32a	Don't randomise pipeline for training, and don't update if no gradient	2017-05-27 08:20:13 -05:00
Matthew Honnibal	3d22fcaf0b	Return None from parser if there are no annotations	2017-05-26 14:02:59 -05:00
Matthew Honnibal	d06f235fc9	Fix conflict on convert.py	2017-05-26 11:33:29 -05:00
Matthew Honnibal	2e587c6417	Export iob_to_biluo utility	2017-05-26 11:32:55 -05:00
Matthew Honnibal	2b3b937a04	Fix converter CLI	2017-05-26 11:32:41 -05:00
Matthew Honnibal	5a87bcf35f	Fix converters	2017-05-26 11:32:34 -05:00
Matthew Honnibal	8af3100143	Merge branch 'develop' of https://github.com/explosion/spaCy into develop	2017-05-26 11:31:41 -05:00
Matthew Honnibal	3d5a536eaa	Improve efficiency of parser batching	2017-05-26 11:31:23 -05:00
Matthew Honnibal	daac3e3573	Always shuffle gold data, and support length cap	2017-05-26 11:30:52 -05:00
Matthew Honnibal	d65f99a720	Improve model saving in train script	2017-05-26 05:52:09 -05:00
ines	51882c4984	Fix formatting	2017-05-26 12:37:45 +02:00
ines	353f0ef8d7	Use disable argument (list) for serialization	2017-05-26 12:33:54 +02:00
Matthew Honnibal	22d7b448a5	Fix convert command	2017-05-25 19:47:12 -05:00
Matthew Honnibal	dbf2a4cf57	Update all models on each epoch	2017-05-25 19:46:56 -05:00
Matthew Honnibal	faff1c23fb	Merge branch 'develop' of https://github.com/explosion/spaCy into develop	2017-05-25 17:16:10 -05:00
Matthew Honnibal	82b11b0320	Remove print statement	2017-05-25 17:15:59 -05:00
Matthew Honnibal	80cf42e33b	Fix compounding and decaying utils	2017-05-25 17:15:39 -05:00
Matthew Honnibal	df8015f05d	Tweaks to train script	2017-05-25 17:15:24 -05:00
Matthew Honnibal	3a6e59cc53	Add minibatch function in spacy.gold	2017-05-25 17:15:09 -05:00
Matthew Honnibal	702fe74a4d	Clean up spacy.cli.train	2017-05-25 16:16:30 -05:00
Matthew Honnibal	b9cea9cd93	Add compounding and decaying functions	2017-05-25 16:16:10 -05:00
Matthew Honnibal	2cb7cc2db7	Remove commented code from parser	2017-05-25 14:55:09 -05:00
Matthew Honnibal	f403c2cd5f	Add env opts for optimizer	2017-05-25 11:19:26 -05:00
Matthew Honnibal	c245ff6b27	Rebatch parser inputs, with mid-sentence states	2017-05-25 11:18:59 -05:00
Matthew Honnibal	679efe79c8	Make parser update less hacky	2017-05-25 06:49:00 -05:00
Matthew Honnibal	8500d9b1da	Only train one task per iter, holding grads	2017-05-25 06:47:42 -05:00
Matthew Honnibal	b27c587800	Fix pieces argument to PrecomputedMaxout	2017-05-25 06:46:59 -05:00
Matthew Honnibal	e1cb5be0c7	Adjust dropout, depth and multi-task in parser	2017-05-24 20:11:41 -05:00
Matthew Honnibal	e6cc927ab1	Rearrange multi-task learning	2017-05-24 20:10:54 -05:00
Matthew Honnibal	135a13790c	Disable gold preprocessing	2017-05-24 20:10:20 -05:00
Matthew Honnibal	467bbeadb8	Add hidden layers for tagger	2017-05-24 20:09:51 -05:00
ines	66088851dc	Add Doc.to_disk() and Doc.from_disk() methods	2017-05-24 11:58:17 +02:00
Matthew Honnibal	620df0414f	Fix dropout in parser	2017-05-23 15:20:45 -05:00
Matthew Honnibal	5b67bcbee0	Increase default embed size to 7500	2017-05-23 15:20:16 -05:00
Matthew Honnibal	48eef94f92	Merge branch 'develop' of https://github.com/explosion/spaCy into develop	2017-05-23 18:47:32 +02:00
Matthew Honnibal	d44b1eafc4	Fix conflict artefacts	2017-05-23 18:47:11 +02:00
Matthew Honnibal	01e59e4e6e	* Add Token.sent_start property, re Issue #235	2017-05-23 18:41:11 +02:00
Matthew Honnibal	4917cbb484	Include sent_start test	2017-05-23 18:40:37 +02:00
Matthew Honnibal	d68dd1f251	Add SENT_START attribute, for custom sentence boundary detection	2017-05-23 18:37:58 +02:00
Matthew Honnibal	8026c183d0	Add hacky logic to accelerate depth=0 case in parser	2017-05-23 11:06:49 -05:00
Matthew Honnibal	e7d3159d91	Merge branch 'develop' of https://github.com/explosion/spaCy into develop	2017-05-23 05:58:17 -05:00
Matthew Honnibal	a8b6d11c5b	Support optional maxout layer	2017-05-23 05:58:07 -05:00
Matthew Honnibal	c55b8fa7c5	Fix bugs in parse_batch	2017-05-23 05:57:52 -05:00
ines	fb0ff0272f	xfail neural parser tests for now and remove test for deprecated method	2017-05-23 12:40:37 +02:00
Matthew Honnibal	964707d795	Restore support for deeper networks in parser	2017-05-23 05:31:13 -05:00
Matthew Honnibal	e27262f431	Go back to previous matcher signature, with on_match positional	2017-05-23 04:37:40 -05:00
Matthew Honnibal	5418bcf5d7	Resolve conflict on test	2017-05-23 04:37:16 -05:00
ines	e6acd3bbf2	Fix matcher tests and matcher docs	2017-05-23 11:36:02 +02:00
ines	d0c6d4f76d	Fix formatting	2017-05-23 11:32:00 +02:00
Matthew Honnibal	f0bcc0bd8d	Merge branch 'develop' of https://github.com/explosion/spaCy into develop	2017-05-23 04:29:28 -05:00
Matthew Honnibal	9adfe9e8fc	Don't hold gradient updates in language -- let the parser decide how to batch the updates.	2017-05-23 04:29:10 -05:00
Matthew Honnibal	6b918cc58e	Support making updates periodically during training	2017-05-23 04:23:29 -05:00
Matthew Honnibal	3f725ff7b3	Roll back changes to parser update	2017-05-23 04:23:05 -05:00
Matthew Honnibal	3959d778ac	Revert "Revert "WIP on improving parser efficiency"" This reverts commit `532afef4a8`.	2017-05-23 03:06:53 -05:00
Matthew Honnibal	532afef4a8	Revert "WIP on improving parser efficiency" This reverts commit `bdaac7ab44`.	2017-05-23 03:05:25 -05:00
Matthew Honnibal	bdaac7ab44	WIP on improving parser efficiency	2017-05-23 02:59:31 -05:00
Matthew Honnibal	8a9e318deb	Put the parsing loop in a nogil prange block	2017-05-22 17:58:12 -05:00
ines	a23f487b06	Tidy up displaCy and add "manual" option Also don't require title in EntityRenderer	2017-05-22 18:48:20 +02:00
Matthew Honnibal	0264447c4d	Merge branch 'develop' of https://github.com/explosion/spaCy into develop	2017-05-22 10:41:56 -05:00
Matthew Honnibal	6e8dce2c05	Fix train command line args	2017-05-22 10:41:39 -05:00
Matthew Honnibal	a7ee63c0ac	Fix labeller loss for unseen labels	2017-05-22 10:41:20 -05:00
Matthew Honnibal	c9760b2104	Support sentence limits in GoldCorpus	2017-05-22 10:40:46 -05:00
Matthew Honnibal	e2136232f9	Exclude states with no matching gold annotations from parsing	2017-05-22 10:30:12 -05:00
Matthew Honnibal	83ffd16474	Fix offset calculation for other negative values	2017-05-22 08:00:53 -05:00
ines	b3c7ee0148	Fix tests and use the new Matcher API	2017-05-22 13:54:20 +02:00
Matthew Honnibal	f00f821496	Fix pseudoprojectivity->nonproj	2017-05-22 06:14:42 -05:00
Matthew Honnibal	ae8cf70dc1	Fix CLI train signature	2017-05-22 06:13:39 -05:00
Matthew Honnibal	187f370734	Update tests for matcher changes	2017-05-22 12:59:50 +02:00
Matthew Honnibal	5d59e74cf6	PseudoProjectivity->nonproj	2017-05-22 05:49:53 -05:00
Matthew Honnibal	7e2cdc0c81	Merge branch 'develop' of https://github.com/explosion/spaCy into develop	2017-05-22 12:39:34 +02:00
Matthew Honnibal	70a8c531cd	Merge branch 'develop' of https://github.com/explosion/spaCy into develop	2017-05-22 05:39:18 -05:00
Matthew Honnibal	2f78413a02	PseudoProjectivity->nonproj	2017-05-22 05:39:03 -05:00
Matthew Honnibal	89ebc5c3cd	Merge branch 'develop' of https://github.com/explosion/spaCy into develop	2017-05-22 12:38:15 +02:00
Matthew Honnibal	d8bb5bb959	Implement StringStore serialization, and update tests	2017-05-22 12:38:00 +02:00
ines	54f04a9fe0	Update API docs with changes in spacy.gold and spacy.language	2017-05-22 12:29:30 +02:00
ines	b5fb43fdd8	Allow sys.exit status as exits keyword arg in util.prints()	2017-05-22 12:29:15 +02:00
ines	fc3ec733ea	Reduce complexity in CLI Remove now redundant model command and move plac annotations to cli files	2017-05-22 12:28:58 +02:00
Matthew Honnibal	b45b4aa392	PseudoProjectivity --> nonproj	2017-05-22 05:17:44 -05:00
Matthew Honnibal	aae97f00e9	Fix nonproj import	2017-05-22 05:15:06 -05:00
Matthew Honnibal	9262fc4829	Fix syntax error	2017-05-22 05:14:59 -05:00
Matthew Honnibal	93a042253b	Make GoldParse attributes writeable	2017-05-22 04:51:08 -05:00
Matthew Honnibal	2a5eb9f61e	Make nonproj methods top-level functions, instead of class methods	2017-05-22 04:51:08 -05:00
Matthew Honnibal	c998776c25	Make single array for features, to reduce GPU copies	2017-05-22 04:51:08 -05:00
Matthew Honnibal	bc2294d7f1	Add support for fiddly hyper-parameters to train func	2017-05-22 04:51:08 -05:00
Matthew Honnibal	80e19a2399	Simplify CLI implementation for subcommands. Remove model command.	2017-05-22 04:51:08 -05:00
Matthew Honnibal	33e2222839	Remove unused code in deprojectivize	2017-05-22 04:51:08 -05:00
Matthew Honnibal	4e0988605a	Pass through non-projective=True	2017-05-22 04:51:08 -05:00
Matthew Honnibal	025d9bbc37	Fix handling of non-projective deps	2017-05-22 04:51:08 -05:00
Matthew Honnibal	5738d373d5	Add deprojectivize to pipeline	2017-05-22 04:51:08 -05:00
Matthew Honnibal	1b5fa68996	Do pseudo-projective pre-processing for parser	2017-05-22 04:51:08 -05:00
Matthew Honnibal	1d5d9838a2	Fix action collection for parser	2017-05-22 04:51:08 -05:00
Matthew Honnibal	8d1e64be69	Add experimental NeuralLabeller	2017-05-22 04:51:08 -05:00
Matthew Honnibal	9b1b0742fd	Fix prediction for tok2vec	2017-05-22 04:51:08 -05:00
Matthew Honnibal	f13d6c7359	Support gold preprocessing and single gold files	2017-05-22 04:51:08 -05:00
Matthew Honnibal	e14533757b	Use averaged params for evaluation	2017-05-22 04:51:08 -05:00
Matthew Honnibal	7811d97339	Refactor CLI	2017-05-22 04:51:08 -05:00
Matthew Honnibal	5db89053aa	Merge docstrings	2017-05-21 13:46:23 -05:00
Matthew Honnibal	432b3499b3	Fix memory leak	2017-05-21 13:38:46 -05:00
Matthew Honnibal	59fbfb3829	Remove train.py -- functions now in GoldCorpus and Language	2017-05-21 09:08:27 -05:00
Matthew Honnibal	8904814c0e	Add missing import	2017-05-21 09:07:56 -05:00
Matthew Honnibal	baf3ef0ddc	Remove import of removed train_config script	2017-05-21 09:07:34 -05:00
Matthew Honnibal	4c9202249d	Refactor training, to fix memory leak	2017-05-21 09:07:06 -05:00
Matthew Honnibal	4803b3b69e	Add GoldCorpus class, to manage data streaming	2017-05-21 09:06:17 -05:00
Matthew Honnibal	180e5afede	Fix tokvecs flattening in pipeline	2017-05-21 09:05:34 -05:00
Matthew Honnibal	0731971bfc	Add itershuffle utility function. Maybe belongs in thinc	2017-05-21 09:05:05 -05:00
ines	2c5cfe8bbf	Update docstrings and API docs for StringStore	2017-05-21 14:18:58 +02:00
ines	251346b59f	Fix typos and formatting	2017-05-21 14:18:46 +02:00
ines	075f5ff87a	Update docstrings and API docs for GoldParse	2017-05-21 13:53:46 +02:00
ines	99b631617d	Reformat docstrings	2017-05-21 13:32:15 +02:00
ines	885e82c9b0	Update docstrings and remove deprecated load classmethod	2017-05-21 13:27:52 +02:00
ines	c5a653fa48	Update docstrings and API docs for Tokenizer	2017-05-21 13:18:14 +02:00
ines	f216422ac5	Remove deprecated load classmethod	2017-05-21 13:18:01 +02:00
ines	d82ae9a585	Change "function" to "callable" in docs	2017-05-21 13:17:40 +02:00
ines	3871157d84	Update spacy.util documentation	2017-05-21 01:12:09 +02:00
ines	0c6c65aa3c	Improve messaging if model linking fails after download	2017-05-21 00:28:37 +02:00
Matthew Honnibal	3b7c108246	Pass tokvecs through as a list, instead of concatenated. Also fix padding	2017-05-20 13:23:32 -05:00
ines	924e8506de	Move Defaults subclass to module scope (necessary for pickling)	2017-05-20 19:02:27 +02:00
Matthew Honnibal	d52b65aec2	Revert "Move to contiguous buffer for token_ids and d_vectors" This reverts commit `3ff8c35a79`.	2017-05-20 11:26:23 -05:00
ines	27de0834b2	Update docstrings and API docs for Lexeme	2017-05-20 15:13:42 +02:00
ines	7ed8a92ed1	Update docstrings and API docs for Token	2017-05-20 15:13:33 +02:00
ines	4ed6a36622	Update docstrings and API docs for Matcher	2017-05-20 14:43:10 +02:00
ines	39f36539f6	Update docstrings and API docs for Matcher	2017-05-20 14:32:34 +02:00
ines	c00ff257be	Update docstrings and API docs for Matcher	2017-05-20 14:26:10 +02:00
ines	790435e51c	Update docstrings	2017-05-20 14:05:07 +02:00
ines	f0cc642bb9	Update docstrings and API docs for Vocab	2017-05-20 14:00:41 +02:00
Matthew Honnibal	ce9234f593	Update Matcher API	2017-05-20 13:54:53 +02:00
Matthew Honnibal	b272890a8c	Try to move parser to simpler PrecomputedAffine class. Currently broken -- maybe the previous change	2017-05-20 06:40:10 -05:00
ines	e39ad78267	Resolve model name properly in cli.info Use util.resolve_model_path() to also allow package names and paths.	2017-05-20 12:24:40 +02:00
Matthew Honnibal	3ff8c35a79	Move to contiguous buffer for token_ids and d_vectors	2017-05-20 04:17:30 -05:00
Matthew Honnibal	8b04b0af9f	Remove freqs from transition_system	2017-05-20 02:20:48 -05:00
Matthew Honnibal	61fe55efba	Move EnglishDefaults class out of English	2017-05-20 02:18:19 -05:00
Matthew Honnibal	a1ba20e2b1	Fix over-run on parse_batch	2017-05-19 18:57:30 -05:00
ines	1d4d3d0ecd	Add TODO	2017-05-20 01:38:04 +02:00
Matthew Honnibal	7ee1827af0	Disable data caching in parser	2017-05-19 18:17:11 -05:00
Matthew Honnibal	e84de028b5	Remove 'rebatch' op, and remove min-batch cap	2017-05-19 18:16:36 -05:00
Matthew Honnibal	3376d4d6e8	Update the train script, fixing GPU memory leak	2017-05-19 18:15:50 -05:00
Matthew Honnibal	836fe1d880	Update neural net tests	2017-05-19 18:11:29 -05:00
ines	fe5d8819ea	Update Matcher docstrings and API docs	2017-05-19 21:47:06 +02:00
Matthew Honnibal	08766240c3	Add incomplete iob converter	2017-05-19 13:27:51 -05:00
Matthew Honnibal	c12ab47a56	Remove state argument in pipeline. Other changes	2017-05-19 13:26:36 -05:00
Matthew Honnibal	66ea9aebe7	Remove the state argument from Language	2017-05-19 13:25:42 -05:00
Matthew Honnibal	09a877886b	WIP on iob converter	2017-05-19 13:24:39 -05:00
ines	a804045597	Use is_ancestor instead of deprecated is_ancestor_of	2017-05-19 20:23:40 +02:00
Matthew Honnibal	8d5e6d9f4f	Rename no_ner arg to no_entities	2017-05-19 13:23:11 -05:00
ines	e9e62b01b0	Update docstrings and API docs for Token	2017-05-19 18:47:56 +02:00
ines	62ceec4fc6	Update docstrings and API docs for Span	2017-05-19 18:47:46 +02:00
ines	23f9a3ccc8	Update docstrings and API docs for Doc	2017-05-19 18:47:39 +02:00
ines	2c8c9dc0c9	Update docstrings and API docs for Language	2017-05-19 18:47:24 +02:00
ines	0791f0aae6	Update docstrings and API docs for Span class	2017-05-19 00:31:31 +02:00
ines	8455cb1327	Update docstring for Doc.__getitem__	2017-05-19 00:30:51 +02:00
ines	0fc05e54e4	Document TokenVectorEncoder	2017-05-19 00:00:02 +02:00
ines	b687ad109d	Update docstrings and API docs for Doc class	2017-05-18 23:59:44 +02:00
ines	d42bc16868	Update docstrings and API docs for Language class	2017-05-18 23:57:38 +02:00
ines	593361ee3c	Update docstrings for Span class	2017-05-18 22:17:41 +02:00
ines	b87066ff10	Update docstrings and API docs for Doc class	2017-05-18 22:17:41 +02:00
Matthew Honnibal	238be0f16a	Merge branch 'develop' of https://github.com/explosion/spaCy into develop	2017-05-18 08:32:22 -05:00
Matthew Honnibal	c214c0decb	Improve env_opt reporting	2017-05-18 08:32:03 -05:00
Matthew Honnibal	bbb59e371c	Fix GPU evaluation	2017-05-18 08:31:15 -05:00
Matthew Honnibal	c2c825127a	Fix use_params and pipe methods	2017-05-18 08:30:59 -05:00
Matthew Honnibal	ca70b08661	Fix GPU training and evaluation	2017-05-18 08:30:33 -05:00
ines	489d2fb4ba	Add is_in_jupyter() helper for displaCy (see #1058 )	2017-05-18 14:13:14 +02:00
ines	abf0188b0a	Move cupy and CudaStream to compat	2017-05-18 14:12:45 +02:00
ines	33decd85b6	Reorganise and explicitly state what's importable	2017-05-18 14:12:31 +02:00
Matthew Honnibal	a438cef8c5	Fix significant bug in feature calculation -- off by 1	2017-05-18 06:21:32 -05:00
Matthew Honnibal	fc8d3a112c	Add util.env_opt support: Can set hyper params through environment variables.	2017-05-18 04:36:53 -05:00
Matthew Honnibal	d2626fdb45	Fix name error in nn parser	2017-05-18 04:31:01 -05:00
Matthew Honnibal	b460533827	Bug fixes to pipeline	2017-05-18 04:29:51 -05:00
Matthew Honnibal	8815507f8e	Move SpanishDefaults out of Language class, for pickle	2017-05-18 04:28:51 -05:00
Matthew Honnibal	2713041571	Fix GPU usage in Language	2017-05-18 04:25:19 -05:00
Matthew Honnibal	711ad5edc4	Cache features in doc2feats	2017-05-18 04:22:20 -05:00
Matthew Honnibal	39ea38c4b1	Add option to use gpu to spacy train	2017-05-18 04:21:49 -05:00
Matthew Honnibal	a1d8e420b5	Merge branch 'develop' of https://github.com/explosion/spaCy into develop	2017-05-17 08:00:04 -05:00
Matthew Honnibal	edfea3a513	Fix progress bar	2017-05-17 14:59:37 +02:00
Matthew Honnibal	0b7fd67408	Fix style check in displacy	2017-05-17 07:57:24 -05:00
Matthew Honnibal	55dab77de8	Add conversion rule for .conll	2017-05-17 13:13:48 +02:00
Matthew Honnibal	692bd2a186	Bug fix to tagger: wasnt backproping to token vectors	2017-05-17 13:13:14 +02:00
Matthew Honnibal	877f83807f	Merge branch 'develop' of https://github.com/explosion/spaCy into develop	2017-05-17 12:09:29 +02:00
Matthew Honnibal	793430aa7a	Get spaCy train command working with neural network * Integrate models into pipeline * Add basic serialization (maybe incorrect) * Fix pickle on vocab	2017-05-17 12:04:50 +02:00
Matthew Honnibal	3bf4a28d8d	Use tag in CoNLL converter, not POS	2017-05-17 12:04:33 +02:00
ines	1a05078c79	Add language-specific syntax iterators to en and de	2017-05-17 12:04:03 +02:00
Matthew Honnibal	c9a5d5d24b	Merge branch 'develop' of https://github.com/explosion/spaCy into develop	2017-05-16 16:22:05 +02:00
Matthew Honnibal	8cf097ca88	Redesign training to integrate NN components * Obsolete .parser, .entity etc names in favour of .pipeline * Components no longer create models on initialization * Models created by loading method (from_disk(), from_bytes() etc), or .begin_training() * Add .predict(), .set_annotations() methods in components * Pass state through pipeline, to allow components to share information more flexibly.	2017-05-16 16:17:30 +02:00
Matthew Honnibal	221b4c1ee8	Fix test for Python 3	2017-05-16 13:06:30 +02:00
Matthew Honnibal	5211645af3	Get data flowing through pipeline. Needs redesign	2017-05-16 11:21:59 +02:00
Matthew Honnibal	1d7c18e58a	Merge branch 'develop' of https://github.com/explosion/spaCy into develop	2017-05-15 21:53:47 +02:00
Matthew Honnibal	a9edb3aa1d	Improve integration of NN parser, to support unified training API	2017-05-15 21:53:27 +02:00
ines	98354be150	Only get user_data if it exists on doc	2017-05-15 13:39:47 +02:00
ines	c33bdeb564	Use uppercase for entity types	2017-05-15 01:24:57 +02:00
ines	4aaa607b8d	Add xmlns:xlink so SVGs are rendered properly as individual files	2017-05-14 19:54:13 +02:00
ines	9dd13cd76a	Update docstrings	2017-05-14 19:30:47 +02:00
ines	a04550605a	Add Jupyter notebook support (see #1058 )	2017-05-14 18:39:01 +02:00
ines	c31792aaec	Add displaCy visualisers (see #1058 )	2017-05-14 17:50:23 +02:00
ines	b462076d80	Merge load_lang_class and get_lang_class	2017-05-14 01:31:10 +02:00
ines	36bebe7164	Update docstrings	2017-05-14 01:30:29 +02:00
Matthew Honnibal	4b9d69f428	Merge branch 'v2' into develop * Move v2 parser into nn_parser.pyx * New TokenVectorEncoder class in pipeline.pyx * New spacy/_ml.py module Currently the two parsers live side-by-side, until we figure out how to organize them.	2017-05-14 01:10:23 +02:00
Matthew Honnibal	5cac951a16	Move new parser to nn_parser.pyx, and restore old parser, to make tests pass.	2017-05-14 00:55:01 +02:00
Matthew Honnibal	f8c02b4341	Remove cupy imports from parser, so it can work on CPU	2017-05-14 00:37:53 +02:00
Matthew Honnibal	613ba79e2e	Fiddle with sizings for parser	2017-05-13 17:20:23 -05:00
Matthew Honnibal	e6d71e1778	Small fixes to parser	2017-05-13 17:19:04 -05:00
Matthew Honnibal	188c0f6949	Clean up unused import	2017-05-13 17:18:27 -05:00
Matthew Honnibal	f85c8464f7	Draft support of regression loss in parser	2017-05-13 17:17:27 -05:00
ines	1694c24e52	Add docstrings, error messages and fix consistency	2017-05-13 21:22:49 +02:00
ines	ee7dcf65c9	Fix expand_exc to make sure it returns combined dict	2017-05-13 21:22:25 +02:00
ines	824d09bb74	Move resolve_load_name to deprecated	2017-05-13 21:21:47 +02:00
ines	a4a37a783e	Remove import from non-existing module	2017-05-13 16:00:09 +02:00
ines	5858857a78	Update languages list in conftest	2017-05-13 15:37:54 +02:00
ines	9d85cda8e4	Fix models error message and use about.__docs_models__ (see #1051 )	2017-05-13 13:05:47 +02:00
ines	6b942763f0	Tidy up imports	2017-05-13 13:04:40 +02:00
ines	8c2a0c026d	Fix parse_tree test	2017-05-13 12:32:45 +02:00
ines	6129016e15	Replace deepcopy	2017-05-13 12:32:37 +02:00
ines	df68bf45ce	Set defaults for light and flat kwargs	2017-05-13 12:32:23 +02:00
ines	b9dea345e5	Remove old import	2017-05-13 12:32:11 +02:00
ines	293ee359c5	Fix formatting	2017-05-13 12:32:06 +02:00
ines	4eefb288e3	Port over PR #1055	2017-05-13 03:25:32 +02:00
Matthew Honnibal	ee1d35bdb0	Fix merge conflict	2017-05-13 03:20:19 +02:00
Matthew Honnibal	b2540d2379	Merge Kengz's tree_print patch	2017-05-13 03:18:49 +02:00
Matthew Honnibal	827b5af697	Update draft of parser neural network model Model is good, but code is messy. Currently requires Chainer, which may cause the build to fail on machines without a GPU. Outline of the model: We first predict context-sensitive vectors for each word in the input: (embed_lower \| embed_prefix \| embed_suffix \| embed_shape) >> Maxout(token_width) >> convolution ** 4 This convolutional layer is shared between the tagger and the parser. This prevents the parser from needing tag features. To boost the representation, we make a "super tag" with POS, morphology and dependency label. The tagger predicts this by adding a softmax layer onto the convolutional layer --- so, we're teaching the convolutional layer to give us a representation that's one affine transform from this informative lexical information. This is obviously good for the parser (which backprops to the convolutions too). The parser model makes a state vector by concatenating the vector representations for its context tokens. Current results suggest few context tokens works well. Maybe this is a bug. The current context tokens: * S0, S1, S2: Top three words on the stack * B0, B1: First two words of the buffer * S0L1, S0L2: Leftmost and second leftmost children of S0 * S0R1, S0R2: Rightmost and second rightmost children of S0 * S1L1, S1L2, S1R2, S1R, B0L1, B0L2: Likewise for S1 and B0 This makes the state vector quite long: 13T, where T is the token vector width (128 is working well). Fortunately, there's a way to structure the computation to save some expense (and make it more GPU friendly). The parser typically visits 2N states for a sentence of length N (although it may visit more, if it back-tracks with a non-monotonic transition). A naive implementation would require 2N (B, 13T) @ (13T, H) matrix multiplications for a batch of size B. We can instead perform one (BN, T) @ (T, 13*H) multiplication, to pre-compute the hidden weights for each positional feature wrt the words in the batch. (Note that our token vectors come from the CNN -- so we can't play this trick over the vocabulary. That's how Stanford's NN parser works --- and why its model is so big.) This pre-computation strategy allows a nice compromise between GPU-friendliness and implementation simplicity. The CNN and the wide lower layer are computed on the GPU, and then the precomputed hidden weights are moved to the CPU, before we start the transition-based parsing process. This makes a lot of things much easier. We don't have to worry about variable-length batch sizes, and we don't have to implement the dynamic oracle in CUDA to train. Currently the parser's loss function is multilabel log loss, as the dynamic oracle allows multiple states to be 0 cost. This is defined as: (exp(score) / Z) - (exp(score) / gZ) Where gZ is the sum of the scores assigned to gold classes. I'm very interested in regressing on the cost directly, but so far this isn't working well. Machinery is in place for beam-search, which has been working well for the linear model. Beam search should benefit greatly from the pre-computation trick.	2017-05-12 16:09:15 -05:00
ines	c4857bc7db	Remove unused argument	2017-05-12 15:37:54 +02:00
ines	c13b3fa052	Add LEX_ATTRS	2017-05-12 15:37:45 +02:00
ines	bca2ea9c72	Update Portuguese lexical attributes	2017-05-12 15:37:39 +02:00
ines	2f870123bf	Fix formatting	2017-05-12 15:37:20 +02:00
ines	ca65993d59	Add basic Polish Language class	2017-05-12 09:25:37 +02:00
ines	48177c4f92	Add missing tokenizer exceptions	2017-05-12 09:25:24 +02:00
ines	bb8be3d194	Add Danish language data	2017-05-10 21:15:12 +02:00
Matthew Honnibal	4efb391994	Fix serializer	2017-05-09 18:45:18 +02:00
Matthew Honnibal	b16ae75824	Remove serializer hacks from pipeline classes	2017-05-09 18:16:40 +02:00
Matthew Honnibal	7253b4e649	Remove old serialization tests	2017-05-09 18:12:58 +02:00
Matthew Honnibal	f9327343ce	Start updating serializer test	2017-05-09 18:12:03 +02:00
Matthew Honnibal	1166b0c491	Implement Doc.to_bytes and Doc.from_bytes methods	2017-05-09 18:11:34 +02:00
Matthew Honnibal	9e167b7bb6	Strip serializer from code	2017-05-09 17:28:50 +02:00
Matthew Honnibal	b53f7dfdc3	Remove spacy.serialize	2017-05-09 17:22:06 +02:00
Matthew Honnibal	62ecdea9f2	Add binder class for document serialization	2017-05-09 17:21:00 +02:00
ines	a0b00624bb	Make sure like_email returns bool	2017-05-09 11:37:29 +02:00
ines	ea60932e1b	Fix formatting	2017-05-09 11:08:14 +02:00
ines	2c3bdd09b1	Add English test for like_num	2017-05-09 11:06:34 +02:00
ines	22375eafb0	Fix and merge attrs and lex_attrs tests	2017-05-09 11:06:25 +02:00
ines	02d0ac5cab	Remove redundant function and fix formatting	2017-05-09 11:06:04 +02:00
ines	b5ca50607e	Reorganise entity rules	2017-05-09 01:37:10 +02:00
ines	564939391a	Remove spacy.orth	2017-05-09 01:21:47 +02:00
ines	12c3d5fbba	Fix formatting	2017-05-09 01:15:28 +02:00
ines	2829a024ef	Re-add basic like_num check to global lex_attrs	2017-05-09 01:15:23 +02:00
ines	88adeee548	Add English lex_attrs overrides	2017-05-09 01:09:52 +02:00
ines	8f3fbbb147	Fix typos	2017-05-09 01:09:37 +02:00
ines	ea5fa46475	Import LEX_ATTRS from lang.lex_attrs	2017-05-09 00:58:10 +02:00
ines	2216e5f326	Reorganise lex_attrs and add dict	2017-05-09 00:57:54 +02:00
ines	e666f14d20	Add global lex_attrs	2017-05-09 00:41:53 +02:00
ines	41972c43fe	Use consistent regex imports	2017-05-09 00:34:31 +02:00
ines	7b83977020	Remove unused munge package	2017-05-09 00:16:16 +02:00
ines	c714841cc8	Move language-specific tests to tests/lang	2017-05-09 00:02:37 +02:00
ines	bd57b611cc	Update conftest to lazy load languages	2017-05-09 00:02:21 +02:00
ines	9f0fd5963f	Reorganise Hungarian punctuation rules	2017-05-09 00:01:59 +02:00
ines	fc0d793360	Reorganise Bengali punctuation rules	2017-05-09 00:01:52 +02:00
ines	e895d1afd7	Reorganise French punctuation rules	2017-05-09 00:00:54 +02:00
ines	014bda0ae3	Reorganise global punctuation rules	2017-05-09 00:00:46 +02:00
ines	a91278cb32	Rename _URL_PATTERN to URL_PATTERN	2017-05-09 00:00:00 +02:00
ines	604f299cf6	Add char classes to global language data	2017-05-08 23:59:33 +02:00
ines	f6f5d78cb9	Fix formatting	2017-05-08 23:59:17 +02:00
ines	6eb6306843	Fix language data imports	2017-05-08 23:58:31 +02:00
ines	3c0f85de8e	Remove imports in /lang/__init__.py	2017-05-08 23:58:07 +02:00
ines	86d9c29f30	Reorder util functions	2017-05-08 23:51:15 +02:00
ines	9a0d2fdef1	Add load_lang_class() util function	2017-05-08 23:50:45 +02:00
ines	614aa09582	Tidy up Bengali tokenizer exceptions	2017-05-08 22:29:49 +02:00
ines	73b577cb01	Fix relative imports	2017-05-08 22:29:04 +02:00
ines	ae99990f63	Fix formatting	2017-05-08 22:23:48 +02:00
ines	f46ffe3e89	Move language data to /lang module	2017-05-08 20:00:40 +02:00
ines	41a322c733	Fix LEMMA in exceptions and morph rules	2017-05-08 19:57:36 +02:00
ines	2edc0aee12	Update warning message	2017-05-08 19:53:36 +02:00
ines	6025cdb992	Fix string interpolation in times	2017-05-08 16:38:16 +02:00
ines	b9ba58ba5c	Add function to resolve load name Warn if old 'path' keyword argument is used.	2017-05-08 16:33:37 +02:00
ines	e6f1a5d0a1	Add unicode declaration	2017-05-08 16:22:17 +02:00
ines	be5541bd16	Fix import and tokenizer exceptions	2017-05-08 16:20:14 +02:00
ines	2324788970	Remove bad tests	2017-05-08 16:15:27 +02:00
ines	b88c4193e7	Add missing symbol	2017-05-08 16:15:20 +02:00
ines	9a5b2bdd4c	Don't set morph rules without tag map	2017-05-08 16:15:12 +02:00
ines	4930f0fa8f	Explicitly import TOKEN_MATCH	2017-05-08 16:11:54 +02:00
ines	50b7ec03ca	Fix typo	2017-05-08 16:11:45 +02:00
ines	3ca611fe48	Fix wildcard imports	2017-05-08 15:56:29 +02:00
ines	c2469b8135	Remove __all__ export	2017-05-08 15:56:22 +02:00
ines	14a9c3ee7a	Fix wildcard import	2017-05-08 15:56:13 +02:00
ines	deed623864	Remove comment	2017-05-08 15:56:05 +02:00
ines	e7f95c37ee	Merge base tokenizer exceptions	2017-05-08 15:55:52 +02:00
ines	24606d364c	Remove redundant language_data.py files in languages Originally intended to collect all components of a language, but just made things messy. Now each component is in charge of exporting itself properly.	2017-05-08 15:55:29 +02:00
ines	a627d3e3b0	Reorganise Chinese language data	2017-05-08 15:54:36 +02:00
ines	7b86ee093a	Reorganise Swedish language data	2017-05-08 15:54:29 +02:00
ines	50510fa947	Reorganise Portuguese language data	2017-05-08 15:52:01 +02:00
ines	279895ea83	Reorganise Dutch language data	2017-05-08 15:51:39 +02:00
ines	04ef5025bd	Reorganise Norwegian language data	2017-05-08 15:51:22 +02:00
ines	5edbc725d8	Reorganise Japanese language data	2017-05-08 15:50:46 +02:00
ines	51a389d3bb	Reorganise Italian language data	2017-05-08 15:50:17 +02:00
ines	1bbfa14436	Reorganise Hungarian language data	2017-05-08 15:49:56 +02:00
ines	a77c9fc60d	Reorganise Hebrew language data	2017-05-08 15:49:28 +02:00
ines	7f05e977fa	Reorganise French language data	2017-05-08 15:49:05 +02:00
ines	0207ffdd52	Reorganise Finnish language data	2017-05-08 15:48:31 +02:00
ines	8e483ec950	Reorganise Spanish language data	2017-05-08 15:48:04 +02:00
ines	c7c21b980f	Reorganise English language data	2017-05-08 15:47:25 +02:00
ines	1bf9d5ec8b	Reorganise German language data	2017-05-08 15:44:26 +02:00
ines	7b3a983f96	Reorganise Bengali language data	2017-05-08 15:43:50 +02:00
ines	607ba458e7	Fix whitespace	2017-05-08 15:42:31 +02:00
ines	60db497525	Add update_exc and expand_exc to util Doesn't require separate language data util anymore	2017-05-08 15:42:12 +02:00
Matthew Honnibal	b44f7e259c	Clean up unused parser code	2017-05-08 15:42:04 +02:00
ines	6e5bd4f228	Remove unused functions from deprecated	2017-05-08 15:40:16 +02:00
Matthew Honnibal	17efb1c001	Change width	2017-05-08 08:40:13 -05:00
ines	f68e420bc0	Add PRON_LEMMA and DET_LEMMA to deprecated Will be replaced with proper values across the language data later.	2017-05-08 15:35:30 +02:00
ines	bd6a7cf4f6	Simplify deprecated model downloading Only relevant for spaCy < v1.7.0.	2017-05-08 15:32:10 +02:00
ines	95edd9e896	Let parse_package_meta take full path	2017-05-08 15:30:48 +02:00
ines	326746eb15	Add util function to resolve arg to model path 1. check if in data dir or shortcut link 2. check if installed as a pip package 3. check if string is path to model 4. check if Path or Path-like object	2017-05-08 15:29:47 +02:00
Matthew Honnibal	bef89ef23d	Mergery	2017-05-08 08:29:36 -05:00
ines	a7801e7342	Update spacy.load() path argument is now deprecated and name can either take a model name or path. Implement lazy loading by importing module and read Language class name off __all__.	2017-05-08 15:27:25 +02:00
Matthew Honnibal	50ddc9fc45	Fix infinite loop bug	2017-05-08 07:54:26 -05:00
Matthew Honnibal	94e86ae00a	Predict tags with encoder	2017-05-08 07:53:45 -05:00
Matthew Honnibal	56073a11ef	Don't use tags when calculating token vectors	2017-05-08 07:52:24 -05:00
Matthew Honnibal	a66a4a4d0f	Replace einsums	2017-05-08 14:46:50 +02:00
Matthew Honnibal	8d2eab74da	Use PretrainableMaxouts	2017-05-08 14:24:55 +02:00
Matthew Honnibal	807cb2e370	Add PretrainableMaxouts	2017-05-08 14:24:43 +02:00
Matthew Honnibal	2e2268a442	Precomputable hidden now working	2017-05-08 11:36:37 +02:00
ines	94697e9afc	Fix typo	2017-05-08 02:00:37 +02:00
ines	0ee2a22b67	Merge branch 'pr/1024' into develop	2017-05-08 01:12:44 +02:00
ines	c4492d260a	Fix kwargs	2017-05-08 01:05:24 +02:00
Matthew Honnibal	10682d35ab	Get pre-computed version working	2017-05-08 00:38:35 +02:00
ines	b5a726c5cd	Tidy up deprecated.py	2017-05-07 23:29:22 +02:00
ines	59c3b9d4dd	Tidy up CLI and fix print functions	2017-05-07 23:25:29 +02:00
ines	311704674d	Add path2str compat function	2017-05-07 23:24:56 +02:00
ines	e34069db9f	Move is_package and get_model_package_path to util	2017-05-07 23:24:51 +02:00
ines	957ba676b4	Add model files base path to about.py	2017-05-07 23:22:35 +02:00
ines	8d8dd9ceb2	Don't set default value for model	2017-05-07 23:22:21 +02:00
Matthew Honnibal	35458987e8	Checkpoint -- nearly finished reimpl	2017-05-07 23:05:01 +02:00
Matthew Honnibal	4441866f55	Checkpoint -- nearly finished reimpl	2017-05-07 22:47:06 +02:00
Matthew Honnibal	6782eedf9b	Tmp GPU code	2017-05-07 11:04:24 -05:00
Matthew Honnibal	e420e5a809	Tmp	2017-05-07 07:31:09 -05:00
Matthew Honnibal	12039e80ca	Switch to single matmul for state layer	2017-05-07 14:26:34 +02:00
Matthew Honnibal	700979fb3c	CPU/GPU compat	2017-05-07 04:01:11 +02:00
Matthew Honnibal	f99f5b75dc	working residual net	2017-05-07 03:57:26 +02:00
Matthew Honnibal	bdf2dba9fb	WIP on refactor, with hidde pre-computing	2017-05-07 02:02:43 +02:00
Matthew Honnibal	b439e04f8d	Learning smoothly	2017-05-06 20:38:12 +02:00
Matthew Honnibal	08bee76790	Learns things	2017-05-06 18:24:38 +02:00
Matthew Honnibal	04ae1c01f1	Learns things	2017-05-06 18:21:02 +02:00
Matthew Honnibal	bcf4cd0a5f	Learns things	2017-05-06 17:37:36 +02:00
Matthew Honnibal	8e48b58cd6	Gradients look correct	2017-05-06 16:47:15 +02:00
Matthew Honnibal	7e04260d38	Data running through, likely errors in model	2017-05-06 14:22:20 +02:00
Matthew Honnibal	fa7c1990b6	Restore tok2vec function	2017-05-05 20:12:03 +02:00
Matthew Honnibal	efe9630e1c	Bug fixes	2017-05-05 20:09:50 +02:00
Matthew Honnibal	ef4fa594aa	Draft of NN parser, to be tested	2017-05-05 19:20:39 +02:00
Matthew Honnibal	7d1df50aec	Draft up Parser model	2017-05-04 13:31:40 +02:00
Matthew Honnibal	ccaf26206b	Pseudocode for parser	2017-05-04 12:17:59 +02:00
ines	b1f22c5a10	Fix formatting	2017-05-03 20:11:02 +02:00
ines	a04b5be1b2	Add glossary for annotation scheme (closes #1034 ) Can be imported as explain from spacy.glossary, or called as spacy.explain(term)	2017-05-03 17:02:17 +02:00
Gregory Howard	929f2792a7	Rennaming cls in module. cls is now a class	2017-05-03 15:41:07 +02:00
Gregory Howard	0e8c41ea4f	Adding method lemmatizer for every class	2017-05-03 12:14:42 +02:00
Gregory Howard	32ca07989e	adding export japanese	2017-05-03 11:07:29 +02:00
Grégory Howard	f9d7144224	Merge branch 'master' into master	2017-05-03 11:04:51 +02:00
Gregory Howard	f2ab7d77b4	Lazy imports language	2017-05-03 11:01:42 +02:00
Ines Montani	3ea23a3f4d	Fix formatting	2017-05-03 09:44:38 +02:00
Ines Montani	d730eb0c0d	Raise custom ImportError if importing janome fails	2017-05-03 09:43:29 +02:00
Ines Montani	949ad6594b	Add newline	2017-05-03 09:38:43 +02:00
Ines Montani	d12ca587ea	Add newline	2017-05-03 09:38:29 +02:00
Ines Montani	8676cd0135	Add newline	2017-05-03 09:38:07 +02:00
Yasuaki Uechi	c8f83aeb87	Add basic japanese support	2017-05-03 13:56:21 +09:00
Gregory Howard	c0afcd22bb	Merge remote-tracking branch 'remotes/upstream/master'	2017-04-27 14:42:54 +02:00
Matthew Honnibal	31ec9e1371	Merge branch 'master' of https://github.com/explosion/spaCy	2017-04-27 13:21:39 +02:00
Matthew Honnibal	2da16adcc2	Add dropout optin for parser and NER Dropout can now be specified in the `Parser.update()` method via the `drop` keyword argument, e.g. nlp.entity.update(doc, gold, drop=0.4) This will randomly drop 40% of features, and multiply the value of the others by 1. / 0.4. This may be useful for generalising from small data sets. This commit also patches the examples/training/train_new_entity_type.py example, to use dropout and fix the output (previously it did not output the learned entity).	2017-04-27 13:18:39 +02:00
Gregory Howard	92f368f83b	Removing extra spaces	2017-04-27 12:02:14 +02:00
Gregory Howard	13b6957c8e	Adding unitest for tokenization in french (with title)	2017-04-27 11:53:44 +02:00
Gregory Howard	8ff4682255	correcting tokenizer exception. Adding tests for lemmatization	2017-04-27 11:52:14 +02:00
Ines Montani	7da9cefd25	Merge pull request #1022 from luvogels/master Initial support for Norwegian Bokmål	2017-04-27 11:16:06 +02:00
Ines Montani	c9e592ae6c	Add newline	2017-04-27 11:15:41 +02:00
Ines Montani	5942adccc2	Add newline	2017-04-27 11:15:19 +02:00
Ines Montani	4cd9269aef	Add newline	2017-04-27 11:15:04 +02:00
Ines Montani	ccf13ecc21	Add newline	2017-04-27 11:14:42 +02:00
Ines Montani	03d2b0cc05	Add newline	2017-04-27 11:14:26 +02:00
Gregory Howard	44cb486849	Adding unitest for tokenization in french (with title)	2017-04-27 10:59:38 +02:00
Gregory Howard	ad8129cb45	Improvement of rules now title insentive and have same declaration format	2017-04-27 10:23:56 +02:00
luvogels	d12a0b6431	Hooked up tokenizer tests	2017-04-26 23:21:41 +02:00
Matthew Honnibal	f0e1606d27	Increment version	2017-04-26 20:25:41 +02:00
luvogels	b331929a7e	Merge branch 'master' of https://github.com/luvogels/spaCy	2017-04-26 19:15:48 +02:00
luvogels	8de59ce3b9	Added tokenizer tests	2017-04-26 19:10:18 +02:00
Matthew Honnibal	4d98511db7	Make Span hashable. Closes #1019	2017-04-26 19:01:05 +02:00
Matthew Honnibal	24c4c51f13	Try to make test999 less flakey	2017-04-26 18:42:06 +02:00
Leif Uwe Vogelsang	460094bf09	Update __init__.py	2017-04-26 18:27:55 +02:00
ines	527d51ac9a	Fetch shortcuts from GitHub and improve error handling	2017-04-26 18:00:28 +02:00
Gregory Howard	ed5f094451	Adding insensitive lemmatisation test	2017-04-25 18:07:02 +02:00
ghoward	26e31afc18	renamming tests	2017-04-25 17:46:01 +02:00
ghoward	c085c2d391	Adding some unitests	2017-04-25 17:44:16 +02:00
ghoward	55c6910f90	Look_up table for languages in spacy. Need to find an another name for lemmatizerlookup. I was not inspired. Trying to uses new files in fr language.	2017-04-24 16:39:00 +02:00
Matthew Honnibal	c4be9c36fe	Fix unicode header in tests	2017-04-24 10:09:01 +02:00
Matthew Honnibal	65f10b53e5	Fix test	2017-04-24 00:25:55 +02:00
Matthew Honnibal	70a43858e1	Fix flakey test	2017-04-24 00:06:30 +02:00
Matthew Honnibal	3973af2d15	Make training test less flakey	2017-04-23 22:59:34 +02:00
Matthew Honnibal	4f9657b42b	Fix reporting if no dev data with train	2017-04-23 22:27:10 +02:00
Matthew Honnibal	df2ac8b843	Merge branch 'master' of https://github.com/explosion/spaCy	2017-04-23 21:25:07 +02:00
Matthew Honnibal	d0e19267e8	Create directory if missing in save_to_directory	2017-04-23 21:24:43 +02:00
ines	42305bc519	Remove unnecessary test	2017-04-23 21:21:41 +02:00
ines	012ea594d1	Add file for misc tests	2017-04-23 21:06:51 +02:00
ines	83f66947dc	Rename test_download to test_cli	2017-04-23 21:06:50 +02:00
ines	401045433c	Simplify compat.fix_text	2017-04-23 21:06:50 +02:00
Matthew Honnibal	e033c86a64	Increment version	2017-04-23 21:03:43 +02:00
Matthew Honnibal	d2436dc17b	Update fix for Issue #999	2017-04-23 18:14:37 +02:00
Matthew Honnibal	874a3cbb07	Add test for Issue #955	2017-04-23 17:57:01 +02:00
Matthew Honnibal	60703cede5	Ensure noun chunks can't be nested. Closes #955	2017-04-23 17:56:39 +02:00
Matthew Honnibal	c9ec24b257	Merge branch 'master' of https://github.com/explosion/spaCy	2017-04-23 17:07:46 +02:00
Matthew Honnibal	5d8af40445	Add test for Issue #999	2017-04-23 17:06:30 +02:00
Matthew Honnibal	4d2a659c52	Fix json dump for Python3	2017-04-23 17:05:53 +02:00
Matthew Honnibal	040751ad17	Remove xfail on Test #910	2017-04-23 16:28:55 +02:00
ines	3a9710f356	Pass dev_scores to print_progress correctly (resolves #1008 ) Only read scores attribute if command is used with dev_data, otherwise default dev_scores to empty dict.	2017-04-23 15:58:40 +02:00
Matthew Honnibal	1b12f342e4	Merge branch 'master' of https://github.com/explosion/spaCy	2017-04-20 17:03:11 +02:00
Matthew Honnibal	4eef200bab	Persist the actions within spacy.parser.cfg	2017-04-20 17:02:44 +02:00
ines	25c70b4cc5	Move fix_text to spacy.compat (see #1002 )	2017-04-20 15:47:17 +02:00
Ines Montani	60b5243bee	Merge pull request #1002 from oroszgy/model_cli_fix Fixes for the `model` CLI	2017-04-20 15:41:03 +02:00
Gyorgy Orosz	4a06a2572c	Using ftfy for handling broken encoded strings.	2017-04-20 13:34:51 +02:00
Ines Montani	3800b29046	Merge pull request #1001 from recognai/master Add SPACE to es tag map	2017-04-20 12:16:34 +02:00
oeg	f0bcd0babb	fix(model): Add SPACE to es tag_map. Fixing error in morphology.pyx when SP tag is missing	2017-04-20 11:36:24 +02:00
Ben Eyal	e90e8a3f10	Enable test	2017-04-20 02:25:24 +03:00
Ben Eyal	33af52599e	Redefine alphabetic characters For caseless languages (Hebrew, Bengali) all characters are both lowercase and uppercase.	2017-04-20 02:25:02 +03:00
Ben Eyal	d8098a8be2	Use `regex` instead of `re`	2017-04-20 02:22:52 +03:00
oeg	daaa42dd25	Merge remote-tracking branch 'upstream/master'	2017-04-19 23:30:36 +02:00
oeg	936a297241	fix(model): Fix tag map for fixing issues with tag SPACE	2017-04-19 23:30:21 +02:00
luvogels	c7cec7e5e2	Update __init__.py	2017-04-19 21:06:30 +02:00
luvogels	55e8cade36	Update __init__.py	2017-04-19 21:06:30 +02:00
luvogels	03abd0c8e6	Update __init__.py	2017-04-19 21:06:30 +02:00
Leif Uwe Vogelsang	538a8d6b12	Resolved merge conflict by incorporating both suggestions.	2017-04-19 21:06:07 +02:00
Leif Uwe Vogelsang	e821c48489	Norwegian language basics	2017-04-19 21:04:01 +02:00
Leif Uwe Vogelsang	3796c668d9	more norwegian	2017-04-19 21:01:32 +02:00
Leif Uwe Vogelsang	bc9557b21f	Norwegian language basics	2017-04-19 21:00:01 +02:00
ines	2bd89e7ade	Tidy up Hebrew tests and test for punctuation (see #995 )	2017-04-19 19:28:03 +02:00
ines	48da244058	Use spacy.compat.json_dumps for Python 2/3 compatibility (resolves #991 )	2017-04-19 11:50:36 +02:00
ines	ddd5194088	Update Language docs and docstrings	2017-04-17 01:52:13 +02:00
ines	f62b740961	Use compat.json_dumps	2017-04-17 01:46:14 +02:00
ines	8e83f8e2fa	Update docstrings	2017-04-17 01:40:26 +02:00
ines	e2299dc389	Ensure path in save_to_directory	2017-04-17 01:40:14 +02:00
ines	82f5f1f98f	Replace str with compat.unicode_	2017-04-17 01:29:54 +02:00
ines	16a8521efa	Increment version	2017-04-16 22:38:38 +02:00
Matthew Honnibal	4efd6fb9d6	Fix training	2017-04-16 15:28:27 -05:00
Matthew Honnibal	17c9fffb9e	Fix naked except	2017-04-16 15:28:16 -05:00
ines	5610fdcc06	Get language name first if no model path exists Makes sure spaCy fails early if no tokenizer exists, and allows printing better error message.	2017-04-16 22:16:47 +02:00
ines	ad168ba88c	Set model name to empty string if path override exists Required for parse_package_meta, which composes path of data_path and model_name (needs to be fixed in the future)	2017-04-16 22:15:51 +02:00
ines	97647c46cd	Add docstring and todo note	2017-04-16 22:14:45 +02:00
ines	5c5f8c0a72	Check if full string is found in lang classes first This allows users to set arbitrary strings. (Otherwise, custom lang class "my_custom_class" would always load Burmese "my" tokenizer if one was available.)	2017-04-16 22:14:38 +02:00
ines	13d30b6c01	xfail lemmatizer test that's causing problems (see #546 )	2017-04-16 21:18:39 +02:00
Matthew Honnibal	4931c56afc	Increment version	2017-04-16 13:59:38 -05:00
ines	6145b7c153	Remove redundant Path	2017-04-16 20:53:25 +02:00
Matthew Honnibal	fa89613444	Merge branch 'master' of https://github.com/explosion/spaCy	2017-04-16 13:42:56 -05:00
ines	1f9f867c70	Remove unused util function	2017-04-16 20:37:45 +02:00
ines	7670c745b6	Update spacy.load() and fix path checks	2017-04-16 20:37:45 +02:00
ines	d3759dfb32	Fix docstring	2017-04-16 20:37:45 +02:00
ines	ed7e19ad68	Remove unused import	2017-04-16 20:37:45 +02:00
ines	0084466a66	Remove unused utf8open util and replace os.path with ensure_path	2017-04-16 20:37:45 +02:00
Matthew Honnibal	89a4f262fc	Fix training methods	2017-04-16 13:00:37 -05:00
Matthew Honnibal	6a4221a6de	Allow lemma to be set from Python. Re #973	2017-04-16 18:07:53 +02:00
Matthew Honnibal	137b210bcf	Restore use of FTRL training	2017-04-16 18:02:42 +02:00
ines	d10bd0eaf9	Fix formatting	2017-04-16 13:42:34 +02:00
ines	8191e33cf1	Update link error message with info on permissions	2017-04-16 13:32:31 +02:00
ines	a3ddbc0444	Add note about --force flag to error message	2017-04-16 13:14:36 +02:00
ines	e3de035814	Add meta validation to check for required settings Complain if no "lang", "name" or "version" is found (those settings are used in directory / package names). Package will still build without, but it'll inevitably fail somewhere down the line.	2017-04-16 13:13:17 +02:00
ines	a7574b7572	Add more options to read in meta data in package command Add meta option to supply path to meta.json. If no meta path is set, check if meta.json exists in input directory and use it. Otherwise, prompt for details on the command line.	2017-04-16 13:06:02 +02:00
ines	13c8a42d2b	Fix typos	2017-04-16 13:03:58 +02:00
ines	31fa73293a	Move read_json out to own util function	2017-04-16 13:03:28 +02:00
Matthew Honnibal	45464d065e	Remove print statement	2017-04-15 16:11:43 +02:00
Matthew Honnibal	c76cb8af35	Fix training for new labels	2017-04-15 16:11:26 +02:00
Matthew Honnibal	4884b2c113	Refix StepwiseState	2017-04-15 16:00:28 +02:00
Matthew Honnibal	e6ee7e130f	Fix parse package meta	2017-04-15 13:38:53 +02:00
Matthew Honnibal	1a98e48b8e	Fix Stepwisestate'	2017-04-15 13:35:01 +02:00
ines	0739ae7b76	Tidy up and fix formatting and imports	2017-04-15 13:05:15 +02:00
ines	fefe6684cd	Fix symlink function to check for Windows	2017-04-15 12:17:27 +02:00
ines	35fb4febe2	Fix whitespace	2017-04-15 12:13:45 +02:00
ines	e1efd589c3	Fix json imports and use ujson	2017-04-15 12:13:34 +02:00
ines	958b12dec8	Use pathlib instead of os.path	2017-04-15 12:13:00 +02:00
ines	956dc36785	Move functions to deprecated	2017-04-15 12:12:31 +02:00
ines	c05ec4b89a	Add compat functions and remove old workarounds Add ensure_path util function to handle checking instance of path	2017-04-15 12:11:16 +02:00
ines	26445ee304	Add compat module for Python2/3 and platform compatibility	2017-04-15 12:07:02 +02:00
ines	d24589aa72	Clean up imports, unused code, whitespace, docstrings	2017-04-15 12:05:47 +02:00
ines	561f2a3eb4	Use consistent formatting for docstrings	2017-04-15 11:59:21 +02:00
Matthew Honnibal	d13f0a7017	Merge branch 'develop' of https://github.com/explosion/spaCy into develop	2017-04-14 23:54:57 +02:00
Matthew Honnibal	354458484c	WIP on add_label bug during NER training Currently when a new label is introduced to NER during training, it causes the labels to be read in in an unexpected order. This invalidates the model.	2017-04-14 23:52:17 +02:00
Matthew Honnibal	33ba5066eb	Refactor Language.end_training, making new save_to_directory method	2017-04-14 23:51:24 +02:00
ines	84341c2975	Only compile list of models if data_path exists	2017-04-14 16:48:02 +02:00
Gyorgy Orosz	dd3244c08a	Made json dump to produce unicode strings in py2	2017-04-13 23:30:47 +02:00
Gyorgy Orosz	a9469c8173	Fixed typo	2017-04-13 15:24:14 +02:00
ines	41037f0f07	Remove unused imports	2017-04-13 13:52:11 +02:00
ines	1b92c8d5d5	Use unicode paths on Windows/Python 2 and catch other errors (resolves #970 ) try/except here is quite dirty, but it'll at least make sure users see an error message that explains what's going on	2017-04-10 17:49:51 +02:00
Matthew Honnibal	49e2de900e	Add costs property to StepwiseState, to show which moves are gold.	2017-04-10 11:37:04 +02:00
Matthew Honnibal	e26577b202	Increment version	2017-04-07 18:45:06 +02:00
Matthew Honnibal	40bf7ecf27	Increment version	2017-04-07 18:44:20 +02:00
Matthew Honnibal	1dca7eeb03	Add unicode declaration on new regression test	2017-04-07 18:09:23 +02:00
ines	887827fc6a	Merge branch 'develop'	2017-04-07 17:36:23 +02:00
ines	444dd511c5	Fix xpassing URL test case	2017-04-07 17:36:05 +02:00
ines	bf0f15e762	Add / to tokenizer infixes (resolves #891 )	2017-04-07 17:30:44 +02:00
ines	00b9011a49	Fix whitespace	2017-04-07 17:29:59 +02:00
ines	f9869e4dc5	Merge branch 'master' into develop	2017-04-07 17:23:40 +02:00
Matthew Honnibal	4a6204dbad	Merge remote-tracking branch 'origin/develop'	2017-04-07 17:20:09 +02:00
Matthew Honnibal	0513c43bf0	Merge branch 'master' of https://github.com/explosion/spaCy	2017-04-07 17:07:10 +02:00
Matthew Honnibal	cc36c308f4	Fix noun_chunk rules around coordination Closes #693.	2017-04-07 17:06:40 +02:00
Matthew Honnibal	ab846256cf	Merge pull request #966 from recognai/master Prepare Spanish language for training models, including configuration, rich-UD tag map and tests	2017-04-07 16:12:29 +02:00
Matthew Honnibal	83dca920d4	Rename test #913 -> #957 , comment Make test for #957 reference correct bug. Add comment. Previous commit closes #957.	2017-04-07 15:54:25 +02:00
Matthew Honnibal	be204ed714	Merge branch 'master' of https://github.com/explosion/spaCy	2017-04-07 15:50:14 +02:00
Matthew Honnibal	e7b1ee9efd	Switch to regex module for URL identification The URL detection regex was failing on input such as 0.1.2.3, as this input triggered excessive back-tracking in the builtin re module. The solution was to switch to the regex module, which behaves better. Closes #913.	2017-04-07 15:47:36 +02:00
Matthew Honnibal	5887383fc0	Add test for Issue #913 : Hang from bad regex	2017-04-07 15:47:27 +02:00
ines	7ea1673072	Fix whitespace	2017-04-07 13:28:48 +02:00
ines	255650dbc2	Add connlu2json converter from explosion/spacy-dev-resources/#11	2017-04-07 13:05:12 +02:00
ines	789ce8a45e	Add convert command	2017-04-07 13:04:17 +02:00
ines	9952d3b08a	Fix whitespace	2017-04-07 13:02:05 +02:00
ines	47ddce6eb7	Remove unused variable	2017-04-07 13:01:48 +02:00
ines	dcf8ab0c47	Merge branch 'develop'	2017-04-07 12:00:09 +02:00
ines	75f9b4c6e2	Fix whitespace	2017-04-07 10:22:18 +02:00
oeg	c693d40791	feature(model): Add support for creating the Spanish model, including rich tagset, configuration, and basich tests	2017-04-06 18:48:45 +02:00
oeg	010293fb2f	fix(typo): Fixes typo in method calling PseudoProjectivity.deprojectivize, failing with new train cli	2017-04-06 17:33:15 +02:00
ines	808cd6cf7f	Add missing tags to verbs (resolves #948 )	2017-04-03 18:12:52 +02:00
ines	ad8bf1829f	Import and combine Portuguese tokenizer exceptions (see #943 )	2017-04-01 10:37:42 +02:00
Ines Montani	f8b2d9c3b7	Merge pull request #943 from mamoit/master Portuguese improvements	2017-04-01 10:32:00 +02:00
ines	3b667a24d4	Remove whitespace	2017-04-01 10:21:08 +02:00
ines	e71a1f4bd0	Fix download commands in error messages (see #946 )	2017-04-01 10:20:57 +02:00
ines	42382d5692	Fix download commands in error messages (see #946 )	2017-04-01 10:19:32 +02:00
ines	d4a59c254b	Remove whitespace	2017-04-01 10:19:01 +02:00
Matthew Honnibal	51882ee2b8	Fix check for setting ent_id in merge	2017-03-31 19:32:01 +02:00
Miguel Almeida	4fde64c4ea	Portuguese contractions and some abreviations	2017-03-31 15:52:55 +01:00
Miguel Almeida	465b240bcb	Review Portuguese stop words Mainly to review typos and add missing masculines/feminines	2017-03-31 13:00:47 +01:00

... 9 10 11 12 13 ...

3704 Commits