spaCy

mirror of https://github.com/explosion/spaCy.git synced 2025-12-10 19:54:17 +03:00

Author	SHA1	Message	Date
Matthew Honnibal	89c92c65fb	Update version	2019-07-28 17:56:38 +02:00
Matthew Honnibal	06eb428ed1	Make pipe base class a bit less presumptuous	2019-07-28 17:56:11 +02:00
Matthew Honnibal	16b5144095	Don't raise NotImplemented in Pipe.update	2019-07-28 17:54:11 +02:00
Ines Montani	fc69da0acb	💫 Support simple training format in nlp.evaluate and add tests (#4033 ) * Support simple training format in nlp.evaluate and add tests * Update docs [ci skip]	2019-07-27 17:30:18 +02:00
Ines Montani	a3723f439c	Fix formatting [ci skip]	2019-07-27 16:35:42 +02:00
Ines Montani	d5bce35fb1	Fix bug in Span.similarity when called via hook	2019-07-27 15:33:27 +02:00
Ines Montani	109b5e1798	Fix bug in Token.similarity when called via hook	2019-07-27 15:26:01 +02:00
Ines Montani	e000b5ed82	Also support "requirements" in model.json	2019-07-27 13:34:57 +02:00
Ines Montani	307ffe472d	Support custom language factory setting in meta.json (#4031 )	2019-07-27 13:17:43 +02:00
Bae Yong-Ju	05fbf5d976	Fix error when Korean text contains regexp special characters. (#4022 )	2019-07-25 17:53:33 +02:00
Matthew Honnibal	73e095923f	💫 Improve error message when model.from_bytes() dies (#4014 ) * Improve error message when model.from_bytes() dies When Thinc's model.from_bytes() is called with a mismatched model, often we get a particularly ungraceful error, e.g. "AttributeError: FunctionLayer has no attribute G" This is because we're trying to load the parameters for something like a LayerNorm layer, and the model architecture has some other layer there instead. This is obviously terrible, especially since the error type is wrong. I've changed it to raise a ValueError. The error message is still probably a bit terse, but it's hard to be sure exactly what's gone wrong. * Update spacy/pipeline/pipes.pyx * Update spacy/pipeline/pipes.pyx * Update spacy/pipeline/pipes.pyx * Update spacy/syntax/nn_parser.pyx * Update spacy/syntax/nn_parser.pyx * Update spacy/pipeline/pipes.pyx Co-Authored-By: Matthew Honnibal <honnibal+gh@gmail.com> * Update spacy/pipeline/pipes.pyx Co-Authored-By: Matthew Honnibal <honnibal+gh@gmail.com> Co-authored-by: Ines Montani <ines@ines.io>	2019-07-24 11:27:34 +02:00
Ines Montani	87fcf3141c	Merge pull request #4003 from svlandeg/feature/nel-fixes API changes for Entity linking functionality	2019-07-23 23:17:07 +02:00
Paul O'Leary McCann	c8949ce88a	Remove old comment (#4012 ) Norwegian used to borrow from French but that doesn't appear to have been true for a while now, so the comment that was here is no longer relevant.	2019-07-23 23:10:06 +02:00
Sofie Van Landeghem	ba02957c80	Fix dependency copy for as_doc (#3969 ) * failing unit test for issue 3962 * attempt to fix Issue #3962 * create artificial unit test example * using length instead of self.length * sp * reformat with black * find better ancestor within span and use generic 'dep' * attach to span.root if there is no appropriate ancestor * comment span text * clean up ancestor code * reconstruct dep tree to keep same number of sentences	2019-07-23 18:28:54 +02:00
svlandeg	4e7ec1ed31	return fix	2019-07-23 14:23:58 +02:00
svlandeg	400ff342cf	replace assert's with custom error messages	2019-07-23 11:52:48 +02:00
svlandeg	20389e4553	format and bugfix	2019-07-22 15:08:17 +02:00
svlandeg	b1911f7105	Errors.E146 for IO error when FP is null	2019-07-22 14:56:13 +02:00
svlandeg	5d544f89ba	Errors.E145 for IO errors when reading KB	2019-07-22 14:36:07 +02:00
Ines Montani	a32b033b8c	Add regression test for #4002 Test that the PhraseMatcher can match on overwritten NORM attributes.	2019-07-22 14:18:24 +02:00
svlandeg	ad65171837	Merge remote-tracking branch 'upstream/master' into feature/nel-fixes	2019-07-22 13:41:28 +02:00
svlandeg	76184374e2	test corner cases	2019-07-22 13:39:32 +02:00
svlandeg	9f8c1e71a2	fix for Issue #4000	2019-07-22 13:34:12 +02:00
svlandeg	dae8a21282	rename entity frequency	2019-07-19 17:40:28 +02:00
svlandeg	41fb5204ba	output tensors as part of predict	2019-07-19 14:47:36 +02:00
svlandeg	21176517a7	have gold.links correspond exactly to doc.ents	2019-07-19 12:36:15 +02:00
BreakBB	3e370cf2ba	Add 'Prof.' to Englisch tokenizer_exceptions	2019-07-19 10:00:45 +02:00
svlandeg	e1213eaf6a	use original gold object in get_loss function	2019-07-18 13:35:10 +02:00
svlandeg	ec55d2fccd	filter training data beforehand (+black formatting)	2019-07-18 10:22:24 +02:00
Falak Asad	ff1e73e35c	Bugfix/issue 3968 (#3982 ) * Fix for issue-3968 * Added contributor agreement * Made suggested changes	2019-07-18 00:20:32 +02:00
svlandeg	d833d4c358	fixes in kb and gold	2019-07-17 17:18:26 +02:00
Ines Montani	73565c6d9d	Rename function arguments	2019-07-17 14:29:52 +02:00
Matthew Honnibal	394e4d8058	Add docstring for spacy.gold.align	2019-07-17 13:59:17 +02:00
Ines Montani	073013f129	Auto-format [ci skip]	2019-07-17 12:34:13 +02:00
svlandeg	4086c6ff60	get vector functionality + unit test	2019-07-17 12:17:02 +02:00
Ines Montani	62ff128888	Add regression test for #3951	2019-07-16 14:00:00 +02:00
Ines Montani	7f551050b1	Add regression test for #3972	2019-07-16 13:07:35 +02:00
svlandeg	a63d15a142	code cleanup	2019-07-15 17:36:43 +02:00
svlandeg	cdc589d344	small fix	2019-07-15 12:04:45 +02:00
svlandeg	60f299374f	set default context width	2019-07-15 12:03:09 +02:00
svlandeg	6e809e9b8b	proper error for missing cfg arguments	2019-07-15 11:42:50 +02:00
svlandeg	6026958957	tokenizer doc fix	2019-07-15 11:19:34 +02:00
Ines Montani	c0e29f7029	Merge pull request #3957 from sorenlind/danish-tokenizer-slash Make Danish tokenizer split on forward slash	2019-07-12 18:19:22 +02:00
Matthew Honnibal	ef666656b3	Fix attrs alignment	2019-07-12 17:59:47 +02:00
Matthew Honnibal	c345c042b0	Fix symbol alignment	2019-07-12 17:48:38 +02:00
Ines Montani	7281026879	Increment version [ci skip]	2019-07-12 17:40:00 +02:00
Søren Lind Kristiansen	26aee70d95	Make Danish tokenizer split on forward slash	2019-07-12 15:20:42 +02:00
Matthew Honnibal	3bc4d618f9	Set version to v2.1.5	2019-07-12 13:26:12 +02:00
Sofie Van Landeghem	ed774cb953	Fixing ngram bug (#3953 ) * minimal failing example for Issue #3661 * referenced Issue #3661 instead of Issue #3611 * cleanup	2019-07-12 10:01:35 +02:00
Matthew Honnibal	09dc01a426	Fix #3853 , and add warning	2019-07-11 14:46:47 +02:00
Matthew Honnibal	7369949d2e	Add warning for #3853	2019-07-11 14:46:47 +02:00
Ines Montani	673c864a06	Fix doc.count_by functionality (#3950 ) Fix doc.count_by functionality	2019-07-11 13:44:00 +02:00
Ines Montani	2426f4d44c	Fix default punctuation rules for splitting Hindi text (#3948 ) Fix default punctuation rules for splitting Hindi text Co-authored-by: yash <patadiayash@gmail.com> Co-authored-by: Ines Montani <ines@ines.io>	2019-07-11 13:36:28 +02:00
svlandeg	349107daa3	cleanup	2019-07-11 13:09:22 +02:00
svlandeg	0f0f07318a	counter instead of preshcounter	2019-07-11 13:05:53 +02:00
Matthew Honnibal	b40b4c2c31	💫 Fix issue #3839 : Incorrect entity IDs from Matcher with operators (#3949 ) * Add regression test for issue #3541 * Add comment on bugfix * Remove incorrect test * Un-xfail test	2019-07-11 12:55:11 +02:00
Matthew Honnibal	e19f4ee719	Add warning message re Issue #3853	2019-07-11 12:50:38 +02:00
Ines Montani	197cfd7ebc	Merge branch 'master' into pr/3948	2019-07-11 12:18:31 +02:00
Ines Montani	d166756607	Fix test	2019-07-11 12:16:43 +02:00
Ines Montani	0b8406a05c	Tidy up and auto-format	2019-07-11 12:02:25 +02:00
yash	6751af3e78	Merge branch 'master' of https://github.com/yash1994/spaCy	2019-07-11 15:26:57 +05:30
yash	ae2d52e323	Add default encoding utf-8 for test file	2019-07-11 15:26:27 +05:30
Ines Montani	33ca0a036a	Merge branch 'master' into pr/3948	2019-07-11 11:55:54 +02:00
Matthew Honnibal	0491a8e7c8	Reformat	2019-07-11 11:49:36 +02:00
Matthew Honnibal	bd3c3f342b	Fix _serialize	2019-07-11 11:48:55 +02:00
yash	815f8d13dd	Fix default punctuation rules for hindi text (#3625 explosion)	2019-07-11 15:00:51 +05:30
yash	d5311b3c42	Add test file for issue (#3625 ) and spacy contributor agreement	2019-07-11 14:53:14 +05:30
svlandeg	e080412385	tracked the bug down to PreshCounter.inc - still unclear what goes wrong	2019-07-11 01:53:06 +02:00
svlandeg	a89fecce97	failing unit test for issue #3869	2019-07-11 00:43:55 +02:00
Matthew Honnibal	a388888074	Merge branch 'master' of https://github.com/explosion/spaCy	2019-07-10 22:54:17 +02:00
Matthew Honnibal	c6cb782758	Set version to 2.1.5.dev0	2019-07-10 22:54:09 +02:00
Sofie Van Landeghem	c4c21cb428	more friendly textcat errors (#3946 ) * more friendly textcat errors with require_model and require_labels * update thinc version with recent bugfix	2019-07-10 19:39:38 +02:00
Matthew Honnibal	b94c5443d9	Rename Binder->DocBox, and improve it.	2019-07-10 19:37:20 +02:00
Matthew Honnibal	3d18600c05	Return True from doc.is_... when no ambiguity * Make doc.is_sentenced return True if len(doc) < 2. * Make doc.is_nered return True if len(doc) == 0, for consistency. Closes #3934	2019-07-10 19:21:42 +02:00
Matthew Honnibal	465456edb9	Un-xfail test #3880	2019-07-10 14:01:17 +02:00
Matthew Honnibal	87f7ec34d5	Add test for #3880	2019-07-10 13:53:55 +02:00
Ines Montani	4e04080b76	Only compare sorted patterns in test Try to work around flaky tests on Python 3.5	2019-07-10 13:00:52 +02:00
Ines Montani	82045aac8a	Merge regression tests	2019-07-10 12:49:18 +02:00
Ines Montani	40cd03fc35	Improve EntityRuler serialization	2019-07-10 12:25:45 +02:00
Ines Montani	570ab1f481	Fix handling of old entity ruler files Expected an `entity_ruler.jsonl` file in the top-level model directory, so the path passed to from_disk by default (model path plus componentn name), but with the suffix ".jsonl".	2019-07-10 12:14:12 +02:00
Ines Montani	874d914a44	Tidy up test	2019-07-10 12:13:23 +02:00
Ines Montani	ea2050079b	Auto-format	2019-07-10 12:03:05 +02:00
Ines Montani	6ba5ddbd5f	Merge pull request #3864 from svlandeg/feature/nel-wiki Entity linking using Wikipedia & Wikidata	2019-07-10 11:25:41 +02:00
Ines Montani	8721849423	Update Scorer.ents_per_type	2019-07-10 11:19:28 +02:00
Björn Böing	205c73a589	Update tokenizer and doc init example (#3939 ) * Fix Doc.to_json hyperlink * Update tokenizer and doc init examples * Change "matchin rules" to "punctuation rules" * Auto-format	2019-07-10 10:16:48 +02:00
cedar101	58f06e6180	Korean support (#3901 ) * start lang/ko * add test codes * using natto-py * add test_ko_tokenizer_full_tags() * spaCy contributor agreement * external dependency for ko * collections.namedtuple for python version < 3.5 * case fix * tuple unpacking * add jongseong(final consonant) * apply mecab option * Remove Pipfile for now Co-authored-by: Ines Montani <ines@ines.io>	2019-07-09 22:23:16 +02:00
Ines Montani	f2ea3e3ea2	Merge branch 'master' into feature/nel-wiki	2019-07-09 21:57:47 +02:00
Ines Montani	547464609d	Remove merge_subtokens from parser postprocessing for now	2019-07-09 21:50:30 +02:00
Björn Böing	04982ccc40	Update pretrain to prevent unintended overwriting of weight fil… (#3902 ) * Update pretrain to prevent unintended overwriting of weight files for #3859 * Add '--epoch-start' to pretrain docs * Add mising pretrain arguments to bash example * Update doc tag for v2.1.5	2019-07-09 21:48:30 +02:00
Alejandro Alcalde	6d577f0b92	Evaluation of NER model per entity type, closes #3490 (#3911 ) * Evaluation of NER model per entity type, closes ##3490 Now each ent score is tracked individually in order to have its own Precision, Recall and F1 Score * Keep track of each entity individually using dicts * Improving how to compute the scores for each entity * Fixed bug computing scores for ents * Formatting with black * Added key ents_per_type to the scores function The key `ents_per_type` contains the metrics Precision, Recall and F1-Score for each entity individually	2019-07-09 20:54:59 +02:00
Joshua Smith	2eb925bd05	Added an argument to `EntityRuler` constructor to pass attrs to… (#3919 ) * Perserve flags in EntityRuler The EntityRuler (explosion/spaCy#3526) does not preserve overwrite flags (or `ent_id_sep`) when serialized. This commit adds support for serialization/deserialization preserving overwrite and ent_id_sep flags. * add signed contributor agreement * flake8 cleanup mostly blank line issues. * mark test from the issue as needing a model The test from the issue needs some language model for serialization but the test wasn't originally marked correctly. * Adds `phrase_matcher_attr` to allow args to PhraseMatcher This is an added arg to pass to the `PhraseMatcher`. For example, this allows creation of a case insensitive phrase matcher when the `EntityRuler` is created. References explosion/spaCy#3822 * remove unneeded model loading The model didn't need to be loaded, and I replaced it with a change that doesn't require it (using existings fixtures) * updated docstring for new argument * updated docs to reflect new argument to the EntityRuler constructor * change tempdir handling to be compatible with python 2.7 * return conflicted code to entityruler Some stuff got cut out because of merge conflicts, this returns that code for the phrase_matcher_attr. * fixed typo in the code added back after conflicts * flake8 compliance When I deconflicted the branch there were some flake8 issues introduced. This resolves the spacing problems. * test changes: attempts to fix flaky test in python3.5 These tests seem to be alittle flaky in 3.5 so I changed the check to avoid the comparisons that seem to be fail sometimes.	2019-07-09 20:09:17 +02:00
Joshua Smith	e8420ab2b7	Added support for serializing overwrite and ent_id_sep (#3918 ) * Perserve flags in EntityRuler The EntityRuler (explosion/spaCy#3526) does not preserve overwrite flags (or `ent_id_sep`) when serialized. This commit adds support for serialization/deserialization preserving overwrite and ent_id_sep flags. * add signed contributor agreement * flake8 cleanup mostly blank line issues. * mark test from the issue as needing a model The test from the issue needs some language model for serialization but the test wasn't originally marked correctly. * remove unneeded model loading The model didn't need to be loaded, and I replaced it with a change that doesn't require it (using existings fixtures) * change tempdir handling to be compatible with python 2.7 * Adds code to handle item saved before this change. This code chanes how the save files are handled and how the bytes are stored as well. This code adds check to dispatch correctly if it encounters bytes or files saved in the old format (and tests for those cases). * use util function for tempdir management Updated after PR comments: this code now uses the make_tempdir function from util instead of doing it by hand.	2019-07-08 17:28:28 +02:00
Knut O. Hellan	a54f0cfc2b	Norwegian tweaks (#3894 ) * Norwegian fix Add support for alternative past tense verb form (vaska). * Norwegian months Add all Norwegian months to tokenizer excpetions. * More Norwegian abbreviations Add more Norwegian abbreviations to tokenizer_exceptions. * Contributor agreement khellan Add signed contributor agreement for khellan (Knut O. Hellan).	2019-07-08 10:28:47 +02:00
Rokas Ramanauskas	61ce126d4c	Lithuanian language support (#3895 ) * initial LT lang support * Added more stopwords. Started setting up some basic test environment (not complete) * Initial morph rules for LT lang * Closes #1 Adds tokenizer exceptions for Lithuanian * Closes #5 Punctuation rules. Closes #6 Lexical Attributes * test: add native examples to basic tests * feat: add tag map for lt lang * fix: remove undefined tag attribute 'Definite' * feat: add lemmatizer for lt lang * refactor: add new instances to lt lang morph rules; use tags from tag map * refactor: add morph rules to lt lang defaults * refactor: only keep nouns, verbs, adverbs and adjectives in lt lang lemmatizer lookup * refactor: add capitalized words to lt lang lemmatizer * refactor: add more num words to lt lang lex attrs * refactor: update lt lang stop word set * refactor: add new instances to lt lang tokenizer exceptions * refactor: remove comments form lt lang init file * refactor: use function instead of lambda in lt lex lang getter * refactor: remove conversion to dict in lt init when dict is already provided * chore: rename lt 'test_basic' to 'test_text' * feat: add more lt text tests * feat: add lemmatizer tests * refactor: remove unused imports, add newline to end of file * chore: add contributor agreement * chore: change 'en' to 'lt' in lt example description * fix: add missing encoding info * style: add newline to end of file * refactor: use python2 compatible syntax * style: reformat code using black	2019-07-08 10:25:22 +02:00
svlandeg	0ea52c86b8	remove redundancy	2019-07-03 15:02:10 +02:00
svlandeg	668b17ea4a	deuglify kb deserializer	2019-07-03 15:00:42 +02:00
svlandeg	8840d4b1b3	fix for context encoder optimizer	2019-07-03 13:35:36 +02:00
svlandeg	2d2dea9924	experiment with adding NER types to the feature vector	2019-06-29 14:52:36 +02:00
svlandeg	c664f58246	adding prior probability as feature in the model	2019-06-28 16:22:58 +02:00
svlandeg	1c80b85241	fix tests	2019-06-28 08:59:23 +02:00

1 2 3 4 5 ...

6162 Commits