spaCy

mirror of https://github.com/explosion/spaCy.git synced 2025-04-08 03:04:13 +03:00

Author	SHA1	Message	Date
AJ Rader	2f3648700c	Correction of default lemmatizer lookup in English (Issue # 4104) (#4110 ) * pytest file for issue4104 established * edited default lookup english lemmatizer for spun; fixes issue 4102 * eliminated parameterization and sorted dictionary dependnency in issue 4104 test * added contributor agreement	2019-08-15 11:39:10 +02:00
Ines Montani	1711b5eb62	💫 Support displaCy user colors via entry point (#4113 )	2019-08-13 15:59:55 +02:00
Sofie Van Landeghem	0ba1b5eebc	CLI scripts for entity linking (wikipedia & generic) (#4091 ) * document token ent_kb_id * document span kb_id * update pipeline documentation * prior and context weights as bool's instead * entitylinker api documentation * drop for both models * finish entitylinker documentation * small fixes * documentation for KB * candidate documentation * links to api pages in code * small fix * frequency examples as counts for consistency * consistent documentation about tensors returned by predict * add entity linking to usage 101 * add entity linking infobox and KB section to 101 * entity-linking in linguistic features * small typo corrections * training example and docs for entity_linker * predefined nlp and kb * revert back to similarity encodings for simplicity (for now) * set prior probabilities to 0 when excluded * code clean up * bugfix: deleting kb ID from tokens when entities were removed * refactor train el example to use either model or vocab * pretrain_kb example for example kb generation * add to training docs for KB + EL example scripts * small fixes * error numbering * ensure the language of vocab and nlp stay consistent across serialization * equality with = * avoid conflict in errors file * add error 151 * final adjustements to the train scripts - consistency * update of goldparse documentation * small corrections * push commit * turn kb_creator into CLI script (wip) * proper parameters for training entity vectors * wikidata pipeline split up into two executable scripts * remove context_width * move wikidata scripts in bin directory, remove old dummy script * refine KB script with logs and preprocessing options * small edits * small improvements to logging of EL CLI script	2019-08-13 15:38:59 +02:00
黎谢鹏	250a54414b	update lang/zh (#4103 ) * update lang/zh * update lang/zh	2019-08-12 10:37:48 +02:00
Sofie Van Landeghem	963ea5e8d0	Update lemma and vector information after splitting a token (#4097 ) * fixing vector and lemma attributes after retokenizer.split * fixing unit test with mockup tensor * xp instead of numpy	2019-08-08 15:09:44 +02:00
Matthew Honnibal	04113a844d	Set version to v2.1.8	2019-08-07 13:53:58 +02:00
Ines Montani	6bec24cdd0	Require downloaded model in pkg_resources (#4090 )	2019-08-07 13:18:11 +02:00
adrianeboyd	69aca7d839	Add validate option to EntityRuler (#4089 ) * Add validate option to EntityRuler * Add validate to EntityRuler, passed to Matcher and PhraseMatcher * Add validate to usage and API docs * Update website/docs/usage/rule-based-matching.md Co-Authored-By: Ines Montani <ines@ines.io> * Update website/docs/usage/rule-based-matching.md Co-Authored-By: Ines Montani <ines@ines.io>	2019-08-07 00:40:53 +02:00
Jeno	15be09ceb0	Raise error if annotation dict in simple training style has unexpected keys #4074 (#4079 ) * adding enhancement #4074. * modified behavior to strictly require top level dictionary keys - issue #4074 * pass expected keys to error message and add links as expected top level key	2019-08-06 11:01:25 +02:00
Sofie Van Landeghem	ad09b0d6f3	fetch norm from lex if necessary for matching (#4080 )	2019-08-05 23:51:04 +02:00
Pavle Vidanović	e1a935d71c	Stopwords for Serbian language. (#4078 ) * Serbian stopwords added. (cyrillic alphabet) * spaCy Contribution agreement included. * Test initialize updated	2019-08-05 10:22:27 +02:00
veer-bains	874bd8c8dd	Fixed syntax error in lang/ko when using python 2 (#4082 ) (closes #4068 ) * fixed syntax error in declaring variables with python 2.7 in spacy/lang/ko/__init__.py * fixed syntax error in declaring variables with python 2.7 in spacy/lang/ko/__init__.py * Update __init__.py * Create veer-bains.md * Update __init__.py fixed syntax errors in variable datatype assignment when calling spacy.blank("ko") with python 2.7	2019-08-05 10:19:32 +02:00
Ines Montani	87ddbdc33e	Fix handling of kwargs in Language.evaluate Makes it consistent with other methods	2019-08-04 13:44:21 +02:00
Muhammad Irfan	d1d30b0442	added missing punctuation following conventions. (#4066 )	2019-08-04 13:41:18 +02:00
Anastassia	33b14724a5	Update gold corpus code to properly ingest a directory of jsonl… (#4067 ) * Update gold corpus code to properly ingest a directory of jsonlines files In response to: https://github.com/explosion/spaCy/issues/3975 * Update spacy/gold.pyx Co-Authored-By: Ines Montani <ines@ines.io>	2019-08-02 09:58:51 +02:00
Matthew Honnibal	944a66c326	Add span.tensor and token.tensor attributes	2019-08-01 18:30:50 +02:00
Matthew Honnibal	d3071ecdbc	Set version to v2.1.7	2019-08-01 18:09:19 +02:00
Matthew Honnibal	97c51ef93b	Set version to v2.1.7.dev1	2019-08-01 17:29:25 +02:00
Matthew Honnibal	4632c597e7	Fix Pipe base class	2019-08-01 17:29:01 +02:00
Ines Montani	8718ca8b1f	Fix init_model if there's no vocab (closes #4048 ) (#4049 )	2019-08-01 17:26:09 +02:00
adrianeboyd	925a852bb6	Improve NER per type scoring (#4052 ) * Improve NER per type scoring * include all gold labels in per type scoring, not only when recall > 0 * improve efficiency of per type scoring * Create Scorer tests, initially with NER tests * move regression test #3968 (per type NER scoring) to Scorer tests * add new test for per type NER scoring with imperfect P/R/F and per type P/R/F including a case where R == 0.0	2019-08-01 17:15:36 +02:00
Sofie Van Landeghem	f7d950de6d	ensure the lang of vocab and nlp stay consistent (#4057 ) * ensure the language of vocab and nlp stay consistent across serialization * equality with =	2019-08-01 17:13:01 +02:00
Sofie Van Landeghem	7de3b129ab	Resolve edge case when calling textcat.predict with empty doc (#4035 ) * resolve edge case where no doc has tokens when calling textcat.predict * more explicit value test	2019-07-30 14:58:01 +02:00
Matthew Honnibal	89c92c65fb	Update version	2019-07-28 17:56:38 +02:00
Matthew Honnibal	06eb428ed1	Make pipe base class a bit less presumptuous	2019-07-28 17:56:11 +02:00
Matthew Honnibal	16b5144095	Don't raise NotImplemented in Pipe.update	2019-07-28 17:54:11 +02:00
Ines Montani	fc69da0acb	💫 Support simple training format in nlp.evaluate and add tests (#4033 ) * Support simple training format in nlp.evaluate and add tests * Update docs [ci skip]	2019-07-27 17:30:18 +02:00
Ines Montani	a3723f439c	Fix formatting [ci skip]	2019-07-27 16:35:42 +02:00
Ines Montani	d5bce35fb1	Fix bug in Span.similarity when called via hook	2019-07-27 15:33:27 +02:00
Ines Montani	109b5e1798	Fix bug in Token.similarity when called via hook	2019-07-27 15:26:01 +02:00
Ines Montani	e000b5ed82	Also support "requirements" in model.json	2019-07-27 13:34:57 +02:00
Ines Montani	307ffe472d	Support custom language factory setting in meta.json (#4031 )	2019-07-27 13:17:43 +02:00
Bae Yong-Ju	05fbf5d976	Fix error when Korean text contains regexp special characters. (#4022 )	2019-07-25 17:53:33 +02:00
Matthew Honnibal	73e095923f	💫 Improve error message when model.from_bytes() dies (#4014 ) * Improve error message when model.from_bytes() dies When Thinc's model.from_bytes() is called with a mismatched model, often we get a particularly ungraceful error, e.g. "AttributeError: FunctionLayer has no attribute G" This is because we're trying to load the parameters for something like a LayerNorm layer, and the model architecture has some other layer there instead. This is obviously terrible, especially since the error type is wrong. I've changed it to raise a ValueError. The error message is still probably a bit terse, but it's hard to be sure exactly what's gone wrong. * Update spacy/pipeline/pipes.pyx * Update spacy/pipeline/pipes.pyx * Update spacy/pipeline/pipes.pyx * Update spacy/syntax/nn_parser.pyx * Update spacy/syntax/nn_parser.pyx * Update spacy/pipeline/pipes.pyx Co-Authored-By: Matthew Honnibal <honnibal+gh@gmail.com> * Update spacy/pipeline/pipes.pyx Co-Authored-By: Matthew Honnibal <honnibal+gh@gmail.com> Co-authored-by: Ines Montani <ines@ines.io>	2019-07-24 11:27:34 +02:00
Ines Montani	87fcf3141c	Merge pull request #4003 from svlandeg/feature/nel-fixes API changes for Entity linking functionality	2019-07-23 23:17:07 +02:00
Paul O'Leary McCann	c8949ce88a	Remove old comment (#4012 ) Norwegian used to borrow from French but that doesn't appear to have been true for a while now, so the comment that was here is no longer relevant.	2019-07-23 23:10:06 +02:00
Sofie Van Landeghem	ba02957c80	Fix dependency copy for as_doc (#3969 ) * failing unit test for issue 3962 * attempt to fix Issue #3962 * create artificial unit test example * using length instead of self.length * sp * reformat with black * find better ancestor within span and use generic 'dep' * attach to span.root if there is no appropriate ancestor * comment span text * clean up ancestor code * reconstruct dep tree to keep same number of sentences	2019-07-23 18:28:54 +02:00
svlandeg	4e7ec1ed31	return fix	2019-07-23 14:23:58 +02:00
svlandeg	400ff342cf	replace assert's with custom error messages	2019-07-23 11:52:48 +02:00
svlandeg	20389e4553	format and bugfix	2019-07-22 15:08:17 +02:00
svlandeg	b1911f7105	Errors.E146 for IO error when FP is null	2019-07-22 14:56:13 +02:00
svlandeg	5d544f89ba	Errors.E145 for IO errors when reading KB	2019-07-22 14:36:07 +02:00
Ines Montani	a32b033b8c	Add regression test for #4002 Test that the PhraseMatcher can match on overwritten NORM attributes.	2019-07-22 14:18:24 +02:00
svlandeg	ad65171837	Merge remote-tracking branch 'upstream/master' into feature/nel-fixes	2019-07-22 13:41:28 +02:00
svlandeg	76184374e2	test corner cases	2019-07-22 13:39:32 +02:00
svlandeg	9f8c1e71a2	fix for Issue #4000	2019-07-22 13:34:12 +02:00
svlandeg	dae8a21282	rename entity frequency	2019-07-19 17:40:28 +02:00
svlandeg	41fb5204ba	output tensors as part of predict	2019-07-19 14:47:36 +02:00
svlandeg	21176517a7	have gold.links correspond exactly to doc.ents	2019-07-19 12:36:15 +02:00
BreakBB	3e370cf2ba	Add 'Prof.' to Englisch tokenizer_exceptions	2019-07-19 10:00:45 +02:00

1 2 3 4 5 ...

6135 Commits