spaCy

mirror of https://github.com/explosion/spaCy.git synced 2026-01-14 04:19:07 +03:00

Author	SHA1	Message	Date
Ines Montani	cf65a80f36	Refactor lemmatizer and data table integration (#4353 ) * Move test * Allow default in Lookups.get_table * Start with blank tables in Lookups.from_bytes * Refactor lemmatizer to hold instance of Lookups * Get lookups table within the lemmatization methods to make sure it references the correct table (even if the table was replaced or modified, e.g. when loading a model from disk) * Deprecate other arguments on Lemmatizer.__init__ and expect Lookups for consistency * Remove old and unsupported Lemmatizer.load classmethod * Refactor language-specific lemmatizers to inherit as much as possible from base class and override only what they need * Update tests and docs * Fix more tests * Fix lemmatizer * Upgrade pytest to try and fix weird CI errors * Try pytest 4.6.5	2019-10-01 21:36:03 +02:00
Ines Montani	a8a1800f2a	Update lemma data documentation [ci skip]	2019-10-01 13:22:13 +02:00
Ines Montani	932ad9cb91	Fix typos and formatting [ci skip]	2019-10-01 12:30:04 +02:00
Ines Montani	f8d1e2f214	Update CLI docs [ci skip]	2019-09-28 13:12:30 +02:00
Ines Montani	aad66d9bb9	Document PhraseMatcher.remove [ci skip]	2019-09-27 16:34:53 +02:00
Ines Montani	eb0649e38e	Fix tag [ci skip]	2019-09-26 16:22:33 +02:00
Ines Montani	da9a869d3f	Update vectors name docs [ci skip]	2019-09-26 16:21:32 +02:00
Matthew Honnibal	92ed4dc5e0	Allow vectors name to be set in init-model (#4321 ) * Allow vectors name to be specified in init-model * Document --vectors-name argument to init-model * Update website/docs/api/cli.md Co-Authored-By: Ines Montani <ines@ines.io>	2019-09-25 13:11:00 +02:00
Matthew Honnibal	e2047576c4	Fix merge conflict	2019-09-18 21:42:11 +02:00
Matthew Honnibal	46c02d25b1	Merge changes to test_ner	2019-09-18 21:41:24 +02:00
Ines Montani	6ebdc5f7d2	Update download docs [ci skip]	2019-09-18 21:21:39 +02:00
Ines Montani	dd1810f05a	Update DocBin and add docs	2019-09-18 20:23:21 +02:00
Ines Montani	bd435faddd	Add note about usage docs [ci skip]	2019-09-18 19:56:43 +02:00
Ines Montani	ee15fdfe88	Fix wording [ci skip]	2019-09-17 14:59:42 +02:00
Ines Montani	f566e69f38	Fix --vectors-loc docs (closes #4270 )	2019-09-17 14:59:12 +02:00
Ines Montani	25c2b4b9a5	Improve init-model docs (see #4137 )	2019-09-17 14:51:44 +02:00
Ines Montani	198b7e9789	Auto-format [ci skip]	2019-09-17 14:48:35 +02:00
adrianeboyd	b5d999e510	Add textcat to train CLI (#4226 ) * Add doc.cats to spacy.gold at the paragraph level Support `doc.cats` as `"cats": [{"label": string, "value": number}]` in the spacy JSON training format at the paragraph level. * `spacy.gold.docs_to_json()` writes `docs.cats` * `GoldCorpus` reads in cats in each `GoldParse` * Update instances of gold_tuples to handle cats Update iteration over gold_tuples / gold_parses to handle addition of cats at the paragraph level. * Add textcat to train CLI * Add textcat options to train CLI * Add textcat labels in `TextCategorizer.begin_training()` * Add textcat evaluation to `Scorer`: * For binary exclusive classes with provided label: F1 for label * For 2+ exclusive classes: F1 macro average * For multilabel (not exclusive): ROC AUC macro average (currently relying on sklearn) * Provide user info on textcat evaluation settings, potential incompatibilities * Provide pipeline to Scorer in `Language.evaluate` for textcat config * Customize train CLI output to include only metrics relevant to current pipeline * Add textcat evaluation to evaluate CLI * Fix handling of unset arguments and config params Fix handling of unset arguments and model confiug parameters in Scorer initialization. * Temporarily add sklearn requirement * Remove sklearn version number * Improve Scorer handling of models without textcats * Fixing Scorer handling of models without textcats * Update Scorer output for python 2.7 * Modify inf in Scorer for python 2.7 * Auto-format Also make small adjustments to make auto-formatting with black easier and produce nicer results * Move error message to Errors * Update documentation * Add cats to annotation JSON format [ci skip] * Fix tpl flag and docs [ci skip] * Switch to internal roc_auc_score Switch to internal `roc_auc_score()` adapted from scikit-learn. * Add AUCROCScore tests and improve errors/warnings * Add tests for AUCROCScore and roc_auc_score * Add missing error for only positive/negative values * Remove unnecessary warnings and errors * Make reduced roc_auc_score functions private Because most of the checks and warnings have been stripped for the internal functions and access is only intended through `ROCAUCScore`, make the functions for roc_auc_score adapted from scikit-learn private. * Check that data corresponds with multilabel flag Check that the training instances correspond with the multilabel flag, adding the multilabel flag if required. * Add textcat score to early stopping check * Add more checks to debug-data for textcat * Add example training data for textcat * Add more checks to textcat train CLI * Check configuration when extending base model * Fix typos * Update textcat example data * Provide licensing details and licenses for data * Remove two labels with no positive instances from jigsaw-toxic-comment data. Co-authored-by: Ines Montani <ines@ines.io>	2019-09-15 22:31:31 +02:00
Ines Montani	bab9976d9a	💫 Adjust Table API and add docs (#4289 ) * Adjust Table API and add docs * Add attributes and update description [ci skip] * Use strings.get_string_id instead of hash_string * Fix table method calls * Make orth arg in Lemmatizer.lookup optional Fall back to string, which is now handled by Table.__contains__ out-of-the-box * Fix method name * Auto-format	2019-09-15 22:08:13 +02:00
Ines Montani	5c8b5e68ec	Fix docs consistency [ci skip]	2019-09-14 16:23:37 +02:00
Ines Montani	3126dd0904	Tidy up and auto-format [ci skip]	2019-09-14 12:58:06 +02:00
Ines Montani	3c3658ef9f	Merge branch 'master' into develop	2019-09-12 18:03:01 +02:00
Sofie Van Landeghem	9be4d1c105	Allow copying of user_data in as_doc (#4282 ) * Allow copying the user_data with as_doc + unit test * add option to docs * add typing * import fix * workaround to avoid bool clashing ... * bint instead of bool	2019-09-12 17:08:14 +02:00
Ines Montani	ff51fba96a	Update lemmaitzer docs [ci skip]	2019-09-12 16:26:33 +02:00
Ines Montani	25b2b3ff45	Remove LEMMA from exception examples [ci skip]	2019-09-12 16:26:27 +02:00
Ines Montani	82c16b7943	Remove u-strings and fix formatting [ci skip]	2019-09-12 16:11:15 +02:00
Ines Montani	b544dcb3c5	Document debug-data [ci skip]	2019-09-12 15:26:20 +02:00
Ines Montani	10257f3131	Document Lookups [ci skip]	2019-09-12 14:00:14 +02:00
Ines Montani	aa4ff0baa1	Auto-format [ci skip]	2019-09-12 13:05:53 +02:00
Ines Montani	625ce2db8e	Update Language docs [ci skip]	2019-09-12 13:03:38 +02:00
Ines Montani	cb41a33d14	Update displaCy API docs [ci skip]	2019-09-12 12:59:20 +02:00
Sofie Van Landeghem	0b4b4f1819	Documentation for Entity Linking (#4065 ) * document token ent_kb_id * document span kb_id * update pipeline documentation * prior and context weights as bool's instead * entitylinker api documentation * drop for both models * finish entitylinker documentation * small fixes * documentation for KB * candidate documentation * links to api pages in code * small fix * frequency examples as counts for consistency * consistent documentation about tensors returned by predict * add entity linking to usage 101 * add entity linking infobox and KB section to 101 * entity-linking in linguistic features * small typo corrections * training example and docs for entity_linker * predefined nlp and kb * revert back to similarity encodings for simplicity (for now) * set prior probabilities to 0 when excluded * code clean up * bugfix: deleting kb ID from tokens when entities were removed * refactor train el example to use either model or vocab * pretrain_kb example for example kb generation * add to training docs for KB + EL example scripts * small fixes * error numbering * ensure the language of vocab and nlp stay consistent across serialization * equality with = * avoid conflict in errors file * add error 151 * final adjustements to the train scripts - consistency * update of goldparse documentation * small corrections * push commit * typo fix * add candidate API to kb documentation * update API sidebar with EntityLinker and KnowledgeBase * remove EL from 101 docs * remove entity linker from 101 pipelines / rephrase * custom el model instead of existing model * set version to 2.2 for EL functionality * update documentation for 2 CLI scripts	2019-09-12 11:38:34 +02:00
Sofie Van Landeghem	53a9ca45c9	Docs: bufsize instead of buffsize (#4247 )	2019-09-06 11:11:54 +02:00
adrianeboyd	82159b5c19	Updates/bugfixes for NER/IOB converters (#4186 ) * Updates/bugfixes for NER/IOB converters * Converter formats `ner` and `iob` use autodetect to choose a converter if possible * `iob2json` is reverted to handle sentence-per-line data like `word1\|pos1\|ent1 word2\|pos2\|ent2` * Fix bug in `merge_sentences()` so the second sentence in each batch isn't skipped * `conll_ner2json` is made more general so it can handle more formats with whitespace-separated columns * Supports all formats where the first column is the token and the final column is the IOB tag; if present, the second column is the POS tag * As in CoNLL 2003 NER, blank lines separate sentences, `-DOCSTART- -X- O O` separates documents * Add option for segmenting sentences (new flag `-s`) * Parser-based sentence segmentation with a provided model, otherwise with sentencizer (new option `-b` to specify model) * Can group sentences into documents with `n_sents` as long as sentence segmentation is available * Only applies automatic segmentation when there are no existing delimiters in the data * Provide info about settings applied during conversion with warnings and suggestions if settings conflict or might not be not optimal. * Add tests for common formats * Add '(default)' back to docs for -c auto * Add document count back to output * Revert changes to converter output message * Use explicit tabs in convert CLI test data * Adjust/add messages for n_sents=1 default * Add sample NER data to training examples * Update README * Add links in docs to example NER data * Define msg within converters	2019-08-29 12:04:01 +02:00
Björn Böing	bae0455f91	Fix visualizer options linking for displaCy. (#4202 )	2019-08-27 14:04:28 +02:00
Christos Aridas	61f5c007a0	DOC Fix pipeline functions examples (#4189 )	2019-08-23 19:15:32 +02:00
Ines Montani	fe230c8776	Fix typo [ci skip]	2019-08-20 13:02:05 +02:00
Daniel Bourke	b0a28fd0de	fix PhraseMatcher link typo (#4150 ) /api/phtasematcher -> /api/phrasematcher	2019-08-20 13:01:43 +02:00
Ines Montani	ce4c3e5204	Document force flag on set_extension (closes #4148 )	2019-08-19 19:22:07 +02:00
adrianeboyd	69aca7d839	Add validate option to EntityRuler (#4089 ) * Add validate option to EntityRuler * Add validate to EntityRuler, passed to Matcher and PhraseMatcher * Add validate to usage and API docs * Update website/docs/usage/rule-based-matching.md Co-Authored-By: Ines Montani <ines@ines.io> * Update website/docs/usage/rule-based-matching.md Co-Authored-By: Ines Montani <ines@ines.io>	2019-08-07 00:40:53 +02:00
Ines Montani	0f76e0022d	Update .tensor docs [ci skip]	2019-08-01 18:37:09 +02:00
Björn Böing	a83c0add2e	Add links to tokenizer API docs to refer relevant information. (#4064 ) * Add links to tokenizer API docs to refer relevant information. * Add suggested changes Co-Authored-By: Ines Montani <ines@ines.io>	2019-08-01 14:28:38 +02:00
Ejar	2cdf7d39e7	Corrected imported fucntion (#4062 ) The example showed an incorrected import	2019-08-01 12:43:36 +02:00
Ines Montani	fcd2f7f656	Fix version introducing Span.ents (closes #4045 ) [ci skip]	2019-07-30 10:32:33 +02:00
Ines Montani	fc69da0acb	💫 Support simple training format in nlp.evaluate and add tests (#4033 ) * Support simple training format in nlp.evaluate and add tests * Update docs [ci skip]	2019-07-27 17:30:18 +02:00
Ines Montani	1fa6d6ba55	Improve consistency of docs examples [ci skip]	2019-07-25 14:24:56 +02:00
adrianeboyd	784a5f4284	Update GoldParse attributes in API docs (#4023 ) * add `words` * update name of entity list to `ner` I think it might be a bit more consistent to have `ner` named `entities` or `ents` (and `ents` is actually set somewhere to `None`, which is a bit confusing), but it looks like renaming it would be a non-trivial decision.	2019-07-25 12:14:02 +02:00
Adriane Boyd	6c5044ed2a	Update annotation docs for German - minor formatting fixes - remove STTS tags not used in Tiger - update list of dependency relations to match tiger2dep	2019-07-22 11:59:03 +02:00
adrianeboyd	d2c474cbb7	Fix initial example in EntityRuler API docs (#3999 )	2019-07-22 11:18:55 +02:00
BreakBB	6d9a7c0749	Add '--silent' argument to bash example of CLI Info	2019-07-19 10:00:45 +02:00

1 2 3 4 5 ...

275 Commits