spaCy

mirror of https://github.com/explosion/spaCy.git synced 2025-07-08 05:43:08 +03:00

Author	SHA1	Message	Date
Ines Montani	16c2522791	Merge branch 'master' into develop	2019-09-14 16:42:01 +02:00
adrianeboyd	6942a6a69b	Extend default punct for sentencizer (#4290 ) Most of these characters are for languages / writing systems that aren't supported by spacy, but I don't think it causes problems to include them. In the UD evals, Hindi and Urdu improve a lot as expected (from 0-10% to 70-80%) and Persian improves a little (90% to 96%). Tamil improves in combination with #4288. The punctuation list is converted to a set internally because of its increased length. Sentence final punctuation generated with: ``` unichars -gas '[\p{Sentence_Break=STerm}\p{Sentence_Break=ATerm}]' '\p{Terminal_Punctuation}' ``` See: https://stackoverflow.com/a/9508766/461847 Fixes #4269.	2019-09-14 15:25:48 +02:00
Ines Montani	27106d6528	Merge branch 'master' into develop	2019-09-13 17:07:17 +02:00
Sofie Van Landeghem	2ae5db580e	dim bugfix when incl_prior is False (#4285 )	2019-09-13 16:30:05 +02:00
Ines Montani	3c3658ef9f	Merge branch 'master' into develop	2019-09-12 18:03:01 +02:00
Ines Montani	228bbf506d	Improve label properties on pipes	2019-09-12 18:02:44 +02:00
Ines Montani	655b434553	Merge branch 'master' into develop	2019-09-12 11:39:18 +02:00
tamuhey	71909cdf22	Fix iss4278 (#4279 ) * fix: len(tuple) == 2 * (#4278) add fail test * add contributor's aggreement	2019-09-12 10:44:49 +02:00
Ines Montani	8ebc3711dc	Fix bug in Parser.labels and add test (#4275 )	2019-09-11 18:29:35 +02:00
Matthew Honnibal	c308cf3e3e	Merge branch 'master' into feature/lemmatizer	2019-08-25 13:52:27 +02:00
Matthew Honnibal	bb911e5f4e	Fix #3830 : 'subtok' label being added even if learn_tokens=False (#4188 ) * Prevent subtok label if not learning tokens The parser introduces the subtok label to mark tokens that should be merged during post-processing. Previously this happened even if we did not have the --learn-tokens flag set. This patch passes the config through to the parser, to prevent the problem. * Make merge_subtokens a parser post-process if learn_subtokens * Fix train script * Add test for 3830: subtok problem * Fix handlign of non-subtok in parser training	2019-08-23 17:54:00 +02:00
Matthew Honnibal	bcd08f20af	Merge changes from master	2019-08-21 14:18:52 +02:00
Ines Montani	f65e36925d	Fix absolute imports and avoid importing from cli	2019-08-20 15:08:59 +02:00
Sofie Van Landeghem	0ba1b5eebc	CLI scripts for entity linking (wikipedia & generic) (#4091 ) * document token ent_kb_id * document span kb_id * update pipeline documentation * prior and context weights as bool's instead * entitylinker api documentation * drop for both models * finish entitylinker documentation * small fixes * documentation for KB * candidate documentation * links to api pages in code * small fix * frequency examples as counts for consistency * consistent documentation about tensors returned by predict * add entity linking to usage 101 * add entity linking infobox and KB section to 101 * entity-linking in linguistic features * small typo corrections * training example and docs for entity_linker * predefined nlp and kb * revert back to similarity encodings for simplicity (for now) * set prior probabilities to 0 when excluded * code clean up * bugfix: deleting kb ID from tokens when entities were removed * refactor train el example to use either model or vocab * pretrain_kb example for example kb generation * add to training docs for KB + EL example scripts * small fixes * error numbering * ensure the language of vocab and nlp stay consistent across serialization * equality with = * avoid conflict in errors file * add error 151 * final adjustements to the train scripts - consistency * update of goldparse documentation * small corrections * push commit * turn kb_creator into CLI script (wip) * proper parameters for training entity vectors * wikidata pipeline split up into two executable scripts * remove context_width * move wikidata scripts in bin directory, remove old dummy script * refine KB script with logs and preprocessing options * small edits * small improvements to logging of EL CLI script	2019-08-13 15:38:59 +02:00
Matthew Honnibal	4632c597e7	Fix Pipe base class	2019-08-01 17:29:01 +02:00
Sofie Van Landeghem	7de3b129ab	Resolve edge case when calling textcat.predict with empty doc (#4035 ) * resolve edge case where no doc has tokens when calling textcat.predict * more explicit value test	2019-07-30 14:58:01 +02:00
Matthew Honnibal	06eb428ed1	Make pipe base class a bit less presumptuous	2019-07-28 17:56:11 +02:00
Matthew Honnibal	16b5144095	Don't raise NotImplemented in Pipe.update	2019-07-28 17:54:11 +02:00
Matthew Honnibal	73e095923f	💫 Improve error message when model.from_bytes() dies (#4014 ) * Improve error message when model.from_bytes() dies When Thinc's model.from_bytes() is called with a mismatched model, often we get a particularly ungraceful error, e.g. "AttributeError: FunctionLayer has no attribute G" This is because we're trying to load the parameters for something like a LayerNorm layer, and the model architecture has some other layer there instead. This is obviously terrible, especially since the error type is wrong. I've changed it to raise a ValueError. The error message is still probably a bit terse, but it's hard to be sure exactly what's gone wrong. * Update spacy/pipeline/pipes.pyx * Update spacy/pipeline/pipes.pyx * Update spacy/pipeline/pipes.pyx * Update spacy/syntax/nn_parser.pyx * Update spacy/syntax/nn_parser.pyx * Update spacy/pipeline/pipes.pyx Co-Authored-By: Matthew Honnibal <honnibal+gh@gmail.com> * Update spacy/pipeline/pipes.pyx Co-Authored-By: Matthew Honnibal <honnibal+gh@gmail.com> Co-authored-by: Ines Montani <ines@ines.io>	2019-07-24 11:27:34 +02:00
svlandeg	4e7ec1ed31	return fix	2019-07-23 14:23:58 +02:00
svlandeg	400ff342cf	replace assert's with custom error messages	2019-07-23 11:52:48 +02:00
svlandeg	20389e4553	format and bugfix	2019-07-22 15:08:17 +02:00
svlandeg	41fb5204ba	output tensors as part of predict	2019-07-19 14:47:36 +02:00
svlandeg	21176517a7	have gold.links correspond exactly to doc.ents	2019-07-19 12:36:15 +02:00
svlandeg	e1213eaf6a	use original gold object in get_loss function	2019-07-18 13:35:10 +02:00
svlandeg	ec55d2fccd	filter training data beforehand (+black formatting)	2019-07-18 10:22:24 +02:00
svlandeg	a63d15a142	code cleanup	2019-07-15 17:36:43 +02:00
svlandeg	60f299374f	set default context width	2019-07-15 12:03:09 +02:00
Sofie Van Landeghem	c4c21cb428	more friendly textcat errors (#3946 ) * more friendly textcat errors with require_model and require_labels * update thinc version with recent bugfix	2019-07-10 19:39:38 +02:00
Ines Montani	f2ea3e3ea2	Merge branch 'master' into feature/nel-wiki	2019-07-09 21:57:47 +02:00
Ines Montani	547464609d	Remove merge_subtokens from parser postprocessing for now	2019-07-09 21:50:30 +02:00
svlandeg	668b17ea4a	deuglify kb deserializer	2019-07-03 15:00:42 +02:00
svlandeg	8840d4b1b3	fix for context encoder optimizer	2019-07-03 13:35:36 +02:00
svlandeg	2d2dea9924	experiment with adding NER types to the feature vector	2019-06-29 14:52:36 +02:00
svlandeg	c664f58246	adding prior probability as feature in the model	2019-06-28 16:22:58 +02:00
svlandeg	68a0662019	context encoder with Tok2Vec + linking model instead of cosine	2019-06-28 08:29:31 +02:00
Ines Montani	37f744ca00	Auto-format [ci skip]	2019-06-26 14:48:09 +02:00
svlandeg	1de61f68d6	improve speed of prediction loop	2019-06-26 13:53:10 +02:00
svlandeg	58a5b40ef6	clean up duplicate code	2019-06-24 15:19:58 +02:00
svlandeg	b58bace84b	small fixes	2019-06-24 10:55:04 +02:00
svlandeg	cc9ae28a52	custom error and warning messages	2019-06-19 12:35:26 +02:00
svlandeg	791327e3c5	Merge remote-tracking branch 'upstream/master' into feature/nel-wiki	2019-06-19 09:44:05 +02:00
svlandeg	a31648d28b	further code cleanup	2019-06-19 09:15:43 +02:00
svlandeg	478305cd3f	small tweaks and documentation	2019-06-18 18:38:09 +02:00
svlandeg	0d177c1146	clean up code, remove old code, move to bin	2019-06-18 13:20:40 +02:00
svlandeg	ffae7d3555	sentence encoder only (removing article/mention encoder)	2019-06-18 00:05:47 +02:00
svlandeg	b312f2d0e7	redo training data to be independent of KB and entity-level instead of doc-level	2019-06-14 15:55:26 +02:00
svlandeg	78dd3e11da	write entity linking pipe to file and keep vocab consistent between kb and nlp	2019-06-13 16:25:39 +02:00
svlandeg	b12001f368	small fixes	2019-06-12 22:05:53 +02:00
svlandeg	6521cfa132	speeding up training	2019-06-12 13:37:05 +02:00

1 2

84 Commits