spaCy

mirror of https://github.com/explosion/spaCy.git synced 2025-03-26 12:54:12 +03:00

Author	SHA1	Message	Date
Matthew Honnibal	bb911e5f4e	Fix #3830 : 'subtok' label being added even if learn_tokens=False (#4188 ) * Prevent subtok label if not learning tokens The parser introduces the subtok label to mark tokens that should be merged during post-processing. Previously this happened even if we did not have the --learn-tokens flag set. This patch passes the config through to the parser, to prevent the problem. * Make merge_subtokens a parser post-process if learn_subtokens * Fix train script * Add test for 3830: subtok problem * Fix handlign of non-subtok in parser training	2019-08-23 17:54:00 +02:00
Ines Montani	f5d3afb1a3	Fix typo in docstrings [ci skip]	2019-08-22 16:24:15 +02:00
adrianeboyd	8fe7bdd0fa	Improve token pattern checking without validation (#4105 ) * Fix typo in rule-based matching docs * Improve token pattern checking without validation Add more detailed token pattern checks without full JSON pattern validation and provide more detailed error messages. Addresses #4070 (also related: #4063, #4100). * Check whether top-level attributes in patterns and attr for PhraseMatcher are in token pattern schema * Check whether attribute value types are supported in general (as opposed to per attribute with full validation) * Report various internal error types (OverflowError, AttributeError, KeyError) as ValueError with standard error messages * Check for tagger/parser in PhraseMatcher pipeline for attributes TAG, POS, LEMMA, and DEP * Add error messages with relevant details on how to use validate=True or nlp() instead of nlp.make_doc() * Support attr=TEXT for PhraseMatcher * Add NORM to schema * Expand tests for pattern validation, Matcher, PhraseMatcher, and EntityRuler * Remove unnecessary .keys() * Rephrase error messages * Add another type check to Matcher Add another type check to Matcher for more understandable error messages in some rare cases. * Support phrase_matcher_attr=TEXT for EntityRuler * Don't use spacy.errors in examples and bin scripts * Fix error code * Auto-format Also try get Azure pipelines to finally start a build :( * Update errors.py Co-authored-by: Ines Montani <ines@ines.io> Co-authored-by: Matthew Honnibal <honnibal+gh@gmail.com>	2019-08-21 14:00:37 +02:00
Ines Montani	f65e36925d	Fix absolute imports and avoid importing from cli	2019-08-20 15:08:59 +02:00
Sofie Van Landeghem	0ba1b5eebc	CLI scripts for entity linking (wikipedia & generic) (#4091 ) * document token ent_kb_id * document span kb_id * update pipeline documentation * prior and context weights as bool's instead * entitylinker api documentation * drop for both models * finish entitylinker documentation * small fixes * documentation for KB * candidate documentation * links to api pages in code * small fix * frequency examples as counts for consistency * consistent documentation about tensors returned by predict * add entity linking to usage 101 * add entity linking infobox and KB section to 101 * entity-linking in linguistic features * small typo corrections * training example and docs for entity_linker * predefined nlp and kb * revert back to similarity encodings for simplicity (for now) * set prior probabilities to 0 when excluded * code clean up * bugfix: deleting kb ID from tokens when entities were removed * refactor train el example to use either model or vocab * pretrain_kb example for example kb generation * add to training docs for KB + EL example scripts * small fixes * error numbering * ensure the language of vocab and nlp stay consistent across serialization * equality with = * avoid conflict in errors file * add error 151 * final adjustements to the train scripts - consistency * update of goldparse documentation * small corrections * push commit * turn kb_creator into CLI script (wip) * proper parameters for training entity vectors * wikidata pipeline split up into two executable scripts * remove context_width * move wikidata scripts in bin directory, remove old dummy script * refine KB script with logs and preprocessing options * small edits * small improvements to logging of EL CLI script	2019-08-13 15:38:59 +02:00
adrianeboyd	69aca7d839	Add validate option to EntityRuler (#4089 ) * Add validate option to EntityRuler * Add validate to EntityRuler, passed to Matcher and PhraseMatcher * Add validate to usage and API docs * Update website/docs/usage/rule-based-matching.md Co-Authored-By: Ines Montani <ines@ines.io> * Update website/docs/usage/rule-based-matching.md Co-Authored-By: Ines Montani <ines@ines.io>	2019-08-07 00:40:53 +02:00
Matthew Honnibal	4632c597e7	Fix Pipe base class	2019-08-01 17:29:01 +02:00
Sofie Van Landeghem	7de3b129ab	Resolve edge case when calling textcat.predict with empty doc (#4035 ) * resolve edge case where no doc has tokens when calling textcat.predict * more explicit value test	2019-07-30 14:58:01 +02:00
Matthew Honnibal	06eb428ed1	Make pipe base class a bit less presumptuous	2019-07-28 17:56:11 +02:00
Matthew Honnibal	16b5144095	Don't raise NotImplemented in Pipe.update	2019-07-28 17:54:11 +02:00
Matthew Honnibal	73e095923f	💫 Improve error message when model.from_bytes() dies (#4014 ) * Improve error message when model.from_bytes() dies When Thinc's model.from_bytes() is called with a mismatched model, often we get a particularly ungraceful error, e.g. "AttributeError: FunctionLayer has no attribute G" This is because we're trying to load the parameters for something like a LayerNorm layer, and the model architecture has some other layer there instead. This is obviously terrible, especially since the error type is wrong. I've changed it to raise a ValueError. The error message is still probably a bit terse, but it's hard to be sure exactly what's gone wrong. * Update spacy/pipeline/pipes.pyx * Update spacy/pipeline/pipes.pyx * Update spacy/pipeline/pipes.pyx * Update spacy/syntax/nn_parser.pyx * Update spacy/syntax/nn_parser.pyx * Update spacy/pipeline/pipes.pyx Co-Authored-By: Matthew Honnibal <honnibal+gh@gmail.com> * Update spacy/pipeline/pipes.pyx Co-Authored-By: Matthew Honnibal <honnibal+gh@gmail.com> Co-authored-by: Ines Montani <ines@ines.io>	2019-07-24 11:27:34 +02:00
svlandeg	4e7ec1ed31	return fix	2019-07-23 14:23:58 +02:00
svlandeg	400ff342cf	replace assert's with custom error messages	2019-07-23 11:52:48 +02:00
svlandeg	20389e4553	format and bugfix	2019-07-22 15:08:17 +02:00
svlandeg	41fb5204ba	output tensors as part of predict	2019-07-19 14:47:36 +02:00
svlandeg	21176517a7	have gold.links correspond exactly to doc.ents	2019-07-19 12:36:15 +02:00
svlandeg	e1213eaf6a	use original gold object in get_loss function	2019-07-18 13:35:10 +02:00
svlandeg	ec55d2fccd	filter training data beforehand (+black formatting)	2019-07-18 10:22:24 +02:00
svlandeg	a63d15a142	code cleanup	2019-07-15 17:36:43 +02:00
svlandeg	60f299374f	set default context width	2019-07-15 12:03:09 +02:00
Sofie Van Landeghem	c4c21cb428	more friendly textcat errors (#3946 ) * more friendly textcat errors with require_model and require_labels * update thinc version with recent bugfix	2019-07-10 19:39:38 +02:00
Ines Montani	40cd03fc35	Improve EntityRuler serialization	2019-07-10 12:25:45 +02:00
Ines Montani	570ab1f481	Fix handling of old entity ruler files Expected an `entity_ruler.jsonl` file in the top-level model directory, so the path passed to from_disk by default (model path plus componentn name), but with the suffix ".jsonl".	2019-07-10 12:14:12 +02:00
Ines Montani	ea2050079b	Auto-format	2019-07-10 12:03:05 +02:00
Ines Montani	f2ea3e3ea2	Merge branch 'master' into feature/nel-wiki	2019-07-09 21:57:47 +02:00
Ines Montani	547464609d	Remove merge_subtokens from parser postprocessing for now	2019-07-09 21:50:30 +02:00
Joshua Smith	2eb925bd05	Added an argument to `EntityRuler` constructor to pass attrs to… (#3919 ) * Perserve flags in EntityRuler The EntityRuler (explosion/spaCy#3526) does not preserve overwrite flags (or `ent_id_sep`) when serialized. This commit adds support for serialization/deserialization preserving overwrite and ent_id_sep flags. * add signed contributor agreement * flake8 cleanup mostly blank line issues. * mark test from the issue as needing a model The test from the issue needs some language model for serialization but the test wasn't originally marked correctly. * Adds `phrase_matcher_attr` to allow args to PhraseMatcher This is an added arg to pass to the `PhraseMatcher`. For example, this allows creation of a case insensitive phrase matcher when the `EntityRuler` is created. References explosion/spaCy#3822 * remove unneeded model loading The model didn't need to be loaded, and I replaced it with a change that doesn't require it (using existings fixtures) * updated docstring for new argument * updated docs to reflect new argument to the EntityRuler constructor * change tempdir handling to be compatible with python 2.7 * return conflicted code to entityruler Some stuff got cut out because of merge conflicts, this returns that code for the phrase_matcher_attr. * fixed typo in the code added back after conflicts * flake8 compliance When I deconflicted the branch there were some flake8 issues introduced. This resolves the spacing problems. * test changes: attempts to fix flaky test in python3.5 These tests seem to be alittle flaky in 3.5 so I changed the check to avoid the comparisons that seem to be fail sometimes.	2019-07-09 20:09:17 +02:00
Joshua Smith	e8420ab2b7	Added support for serializing overwrite and ent_id_sep (#3918 ) * Perserve flags in EntityRuler The EntityRuler (explosion/spaCy#3526) does not preserve overwrite flags (or `ent_id_sep`) when serialized. This commit adds support for serialization/deserialization preserving overwrite and ent_id_sep flags. * add signed contributor agreement * flake8 cleanup mostly blank line issues. * mark test from the issue as needing a model The test from the issue needs some language model for serialization but the test wasn't originally marked correctly. * remove unneeded model loading The model didn't need to be loaded, and I replaced it with a change that doesn't require it (using existings fixtures) * change tempdir handling to be compatible with python 2.7 * Adds code to handle item saved before this change. This code chanes how the save files are handled and how the bytes are stored as well. This code adds check to dispatch correctly if it encounters bytes or files saved in the old format (and tests for those cases). * use util function for tempdir management Updated after PR comments: this code now uses the make_tempdir function from util instead of doing it by hand.	2019-07-08 17:28:28 +02:00
svlandeg	668b17ea4a	deuglify kb deserializer	2019-07-03 15:00:42 +02:00
svlandeg	8840d4b1b3	fix for context encoder optimizer	2019-07-03 13:35:36 +02:00
svlandeg	2d2dea9924	experiment with adding NER types to the feature vector	2019-06-29 14:52:36 +02:00
svlandeg	c664f58246	adding prior probability as feature in the model	2019-06-28 16:22:58 +02:00
svlandeg	68a0662019	context encoder with Tok2Vec + linking model instead of cosine	2019-06-28 08:29:31 +02:00
Ines Montani	37f744ca00	Auto-format [ci skip]	2019-06-26 14:48:09 +02:00
svlandeg	1de61f68d6	improve speed of prediction loop	2019-06-26 13:53:10 +02:00
svlandeg	58a5b40ef6	clean up duplicate code	2019-06-24 15:19:58 +02:00
svlandeg	b58bace84b	small fixes	2019-06-24 10:55:04 +02:00
svlandeg	cc9ae28a52	custom error and warning messages	2019-06-19 12:35:26 +02:00
svlandeg	791327e3c5	Merge remote-tracking branch 'upstream/master' into feature/nel-wiki	2019-06-19 09:44:05 +02:00
svlandeg	a31648d28b	further code cleanup	2019-06-19 09:15:43 +02:00
svlandeg	478305cd3f	small tweaks and documentation	2019-06-18 18:38:09 +02:00
svlandeg	0d177c1146	clean up code, remove old code, move to bin	2019-06-18 13:20:40 +02:00
svlandeg	ffae7d3555	sentence encoder only (removing article/mention encoder)	2019-06-18 00:05:47 +02:00
Kabir Khan	1e19f34e29	Add optional `id` property to EntityRuler patterns (#3591 ) * Adding support for entity_id in EntityRuler pipeline component * Adding Spacy Contributor aggreement * Updating EntityRuler to use string.format instead of f strings * Update Entity Ruler to support an 'id' attribute per pattern that explicitly identifies an entity. * Fixing tests * Remove custom extension entity_id and use built in ent_id token attribute. * Changing entity_id to ent_id for consistent naming * entity_ids => ent_ids * Removing kb, cleaning up tests, making util functions private, use rsplit instead of split	2019-06-16 13:29:04 +02:00
svlandeg	b312f2d0e7	redo training data to be independent of KB and entity-level instead of doc-level	2019-06-14 15:55:26 +02:00
svlandeg	78dd3e11da	write entity linking pipe to file and keep vocab consistent between kb and nlp	2019-06-13 16:25:39 +02:00
svlandeg	b12001f368	small fixes	2019-06-12 22:05:53 +02:00
svlandeg	6521cfa132	speeding up training	2019-06-12 13:37:05 +02:00
svlandeg	fe1ed432ef	eval on dev set, varying combo's of prior and context scores	2019-06-11 11:40:58 +02:00
svlandeg	83dc7b46fd	first tests with EL pipe	2019-06-10 21:25:26 +02:00

1 2

77 Commits