spaCy

mirror of https://github.com/explosion/spaCy.git synced 2025-09-14 16:12:39 +03:00

Author	SHA1	Message	Date
Ines Montani	547464609d	Remove merge_subtokens from parser postprocessing for now	2019-07-09 21:50:30 +02:00
Björn Böing	04982ccc40	Update pretrain to prevent unintended overwriting of weight fil… (#3902 ) * Update pretrain to prevent unintended overwriting of weight files for #3859 * Add '--epoch-start' to pretrain docs * Add mising pretrain arguments to bash example * Update doc tag for v2.1.5	2019-07-09 21:48:30 +02:00
Alejandro Alcalde	6d577f0b92	Evaluation of NER model per entity type, closes #3490 (#3911 ) * Evaluation of NER model per entity type, closes ##3490 Now each ent score is tracked individually in order to have its own Precision, Recall and F1 Score * Keep track of each entity individually using dicts * Improving how to compute the scores for each entity * Fixed bug computing scores for ents * Formatting with black * Added key ents_per_type to the scores function The key `ents_per_type` contains the metrics Precision, Recall and F1-Score for each entity individually	2019-07-09 20:54:59 +02:00
Joshua Smith	2eb925bd05	Added an argument to `EntityRuler` constructor to pass attrs to… (#3919 ) * Perserve flags in EntityRuler The EntityRuler (explosion/spaCy#3526) does not preserve overwrite flags (or `ent_id_sep`) when serialized. This commit adds support for serialization/deserialization preserving overwrite and ent_id_sep flags. * add signed contributor agreement * flake8 cleanup mostly blank line issues. * mark test from the issue as needing a model The test from the issue needs some language model for serialization but the test wasn't originally marked correctly. * Adds `phrase_matcher_attr` to allow args to PhraseMatcher This is an added arg to pass to the `PhraseMatcher`. For example, this allows creation of a case insensitive phrase matcher when the `EntityRuler` is created. References explosion/spaCy#3822 * remove unneeded model loading The model didn't need to be loaded, and I replaced it with a change that doesn't require it (using existings fixtures) * updated docstring for new argument * updated docs to reflect new argument to the EntityRuler constructor * change tempdir handling to be compatible with python 2.7 * return conflicted code to entityruler Some stuff got cut out because of merge conflicts, this returns that code for the phrase_matcher_attr. * fixed typo in the code added back after conflicts * flake8 compliance When I deconflicted the branch there were some flake8 issues introduced. This resolves the spacing problems. * test changes: attempts to fix flaky test in python3.5 These tests seem to be alittle flaky in 3.5 so I changed the check to avoid the comparisons that seem to be fail sometimes.	2019-07-09 20:09:17 +02:00
Alex	a795fbd3b2	added contributor agreement ameyuuno.md (#3925 ) @ines hi! I asked to change my username (yuukos -> ameyuuno). So I added a new contributor agreement.	2019-07-09 10:09:52 +02:00
Joshua Smith	e8420ab2b7	Added support for serializing overwrite and ent_id_sep (#3918 ) * Perserve flags in EntityRuler The EntityRuler (explosion/spaCy#3526) does not preserve overwrite flags (or `ent_id_sep`) when serialized. This commit adds support for serialization/deserialization preserving overwrite and ent_id_sep flags. * add signed contributor agreement * flake8 cleanup mostly blank line issues. * mark test from the issue as needing a model The test from the issue needs some language model for serialization but the test wasn't originally marked correctly. * remove unneeded model loading The model didn't need to be loaded, and I replaced it with a change that doesn't require it (using existings fixtures) * change tempdir handling to be compatible with python 2.7 * Adds code to handle item saved before this change. This code chanes how the save files are handled and how the bytes are stored as well. This code adds check to dispatch correctly if it encounters bytes or files saved in the old format (and tests for those cases). * use util function for tempdir management Updated after PR comments: this code now uses the make_tempdir function from util instead of doing it by hand.	2019-07-08 17:28:28 +02:00
Knut O. Hellan	a54f0cfc2b	Norwegian tweaks (#3894 ) * Norwegian fix Add support for alternative past tense verb form (vaska). * Norwegian months Add all Norwegian months to tokenizer excpetions. * More Norwegian abbreviations Add more Norwegian abbreviations to tokenizer_exceptions. * Contributor agreement khellan Add signed contributor agreement for khellan (Knut O. Hellan).	2019-07-08 10:28:47 +02:00
Patrick Hogan	8c0586fd9c	Update example and sign contributor agreement (#3916 ) * Sign contributor agreement for askhogan * Remove unneeded `seen_tokens` which is never used within the scope	2019-07-08 10:27:20 +02:00
Rokas Ramanauskas	61ce126d4c	Lithuanian language support (#3895 ) * initial LT lang support * Added more stopwords. Started setting up some basic test environment (not complete) * Initial morph rules for LT lang * Closes #1 Adds tokenizer exceptions for Lithuanian * Closes #5 Punctuation rules. Closes #6 Lexical Attributes * test: add native examples to basic tests * feat: add tag map for lt lang * fix: remove undefined tag attribute 'Definite' * feat: add lemmatizer for lt lang * refactor: add new instances to lt lang morph rules; use tags from tag map * refactor: add morph rules to lt lang defaults * refactor: only keep nouns, verbs, adverbs and adjectives in lt lang lemmatizer lookup * refactor: add capitalized words to lt lang lemmatizer * refactor: add more num words to lt lang lex attrs * refactor: update lt lang stop word set * refactor: add new instances to lt lang tokenizer exceptions * refactor: remove comments form lt lang init file * refactor: use function instead of lambda in lt lex lang getter * refactor: remove conversion to dict in lt init when dict is already provided * chore: rename lt 'test_basic' to 'test_text' * feat: add more lt text tests * feat: add lemmatizer tests * refactor: remove unused imports, add newline to end of file * chore: add contributor agreement * chore: change 'en' to 'lt' in lt example description * fix: add missing encoding info * style: add newline to end of file * refactor: use python2 compatible syntax * style: reformat code using black	2019-07-08 10:25:22 +02:00
svlandeg	b7a0c9bf60	fixing the context/prior weight settings	2019-07-03 17:48:09 +02:00
svlandeg	0ea52c86b8	remove redundancy	2019-07-03 15:02:10 +02:00
svlandeg	668b17ea4a	deuglify kb deserializer	2019-07-03 15:00:42 +02:00
svlandeg	8840d4b1b3	fix for context encoder optimizer	2019-07-03 13:35:36 +02:00
svlandeg	3420cbe496	small fixes	2019-07-03 10:25:51 +02:00
svlandeg	2d2dea9924	experiment with adding NER types to the feature vector	2019-06-29 14:52:36 +02:00
svlandeg	c664f58246	adding prior probability as feature in the model	2019-06-28 16:22:58 +02:00
svlandeg	1c80b85241	fix tests	2019-06-28 08:59:23 +02:00
svlandeg	68a0662019	context encoder with Tok2Vec + linking model instead of cosine	2019-06-28 08:29:31 +02:00
Ines Montani	4f1dae1c6b	Update languages and examples (see #1107 )	2019-06-26 16:19:17 +02:00
svlandeg	dbc53b9870	rename to KBEntryC	2019-06-26 15:55:26 +02:00
Ines Montani	37f744ca00	Auto-format [ci skip]	2019-06-26 14:48:09 +02:00
Ines Montani	d361e380b8	Fix matcher callback example (closes #3862 )	2019-06-26 14:47:26 +02:00
Ines Montani	6ccdf37574	Exclude user_data when copying doc in displaCy (closes #3882 )	2019-06-26 14:37:05 +02:00
svlandeg	1de61f68d6	improve speed of prediction loop	2019-06-26 13:53:10 +02:00
svlandeg	bee23cd8af	try Tok2Vec instead of SpacyVectors	2019-06-25 16:09:22 +02:00
svlandeg	8608685543	ensure Span.as_doc keeps the entity links + unit test	2019-06-25 15:28:51 +02:00
svlandeg	58a5b40ef6	clean up duplicate code	2019-06-24 15:19:58 +02:00
svlandeg	ddc73b11a9	fix unicode literals	2019-06-24 12:58:18 +02:00
Bram Vanroy	f22704621e	Update CITATION (#3873 ) As discussed in https://github.com/explosion/spaCy/pull/2167 the citation should look slightly different.	2019-06-24 11:03:16 +02:00
svlandeg	f4af47ce4a	Merge branch 'feature/nel-wiki' of https://github.com/svlandeg/spaCy into feature/nel-wiki	2019-06-24 10:57:07 +02:00
svlandeg	b58bace84b	small fixes	2019-06-24 10:55:04 +02:00
Ines Montani	c833d9b314	Add "v.s." to English tokenizer exceptions (see #3868 )	2019-06-20 17:48:45 +02:00
Ines Montani	ae2c208735	Auto-format [ci skip]	2019-06-20 10:36:38 +02:00
Ines Montani	872121955c	Update error code	2019-06-20 10:35:51 +02:00
Ines Montani	e1be80e3ec	Merge branch 'master' into pr/3864	2019-06-20 10:35:37 +02:00
Guillaume Claret	d7a519a922	Typo (#3865 ) * Typo * Add contributor agreement	2019-06-20 10:31:19 +02:00
Björn Böing	ebf5a04d6c	Update pretrain docs and add unsupported loss_func error (#3860 ) * Add error to `get_vectors_loss` for unsupported loss function of `pretrain` * Add missing "--loss-func" argument to pretrain docs. Update pretrain plac annotations to match docs. * Add missing quotation marks	2019-06-20 10:30:44 +02:00
Alejandro Alcalde	4866a7ee9e	Changed learning rate by its param name. (#3855 ) * Changed learning rate by its param name. I've been searching for a while how the parameter learning rate was named, with `beta1` and `beta2` its easy as they are marked as code, but learning rate wasn't. I think writing the actual parameter name would be helpful. * Signing SCA	2019-06-20 10:29:20 +02:00
svlandeg	b76a43bee4	unicode strings	2019-06-19 13:26:33 +02:00
svlandeg	0b0959b363	UTF8 encoding	2019-06-19 13:11:39 +02:00
svlandeg	cc9ae28a52	custom error and warning messages	2019-06-19 12:35:26 +02:00
svlandeg	791327e3c5	Merge remote-tracking branch 'upstream/master' into feature/nel-wiki	2019-06-19 09:44:05 +02:00
svlandeg	a31648d28b	further code cleanup	2019-06-19 09:15:43 +02:00
svlandeg	478305cd3f	small tweaks and documentation	2019-06-18 18:38:09 +02:00
svlandeg	0d177c1146	clean up code, remove old code, move to bin	2019-06-18 13:20:40 +02:00
svlandeg	ffae7d3555	sentence encoder only (removing article/mention encoder)	2019-06-18 00:05:47 +02:00
svlandeg	6332af40de	baseline performances: oracle KB, random and prior prob	2019-06-17 14:39:40 +02:00
svlandeg	24db1392b9	reprocessing all of wikipedia for training data	2019-06-16 21:14:45 +02:00
Ines Montani	81c12640ab	Auto-format [ci skip]	2019-06-16 14:33:20 +02:00
Greg Werner	9041a72d7f	Update tokenizer.md for construction example (#3790 ) * Update tokenizer.md for construction example Self contained example. You should really say what nlp is so that the example will work as is * Update CONTRIBUTOR_AGREEMENT.md * Restore contributor agreement * Adjust construction examples	2019-06-16 14:32:56 +02:00

... 4 5 6 7 8 ...

10506 Commits