spaCy

mirror of https://github.com/explosion/spaCy.git synced 2025-04-18 08:01:58 +03:00

Author	SHA1	Message	Date
Matthew Honnibal	88a23cf49a	Fix name	2019-09-18 13:38:29 +02:00
Matthew Honnibal	3507943b15	Add docstring for DocPallet	2019-09-18 13:25:47 +02:00
Matthew Honnibal	1c8de6b2e5	Rename DocBox->DocPallet	2019-09-18 13:13:51 +02:00
Ines Montani	691e0088cf	Remove duplicate tok2vec property (closes #4302 )	2019-09-17 11:22:03 +02:00
Ines Montani	a84025d70b	Remove --no-deps from default pip args on download Add warning if user is executing spaCy without having it installed and add --no-deps to prevent the package from being redownloaded	2019-09-16 23:32:41 +02:00
Matthew Honnibal	84c65f9455	Merge branch 'master' into develop	2019-09-16 22:12:20 +02:00
Matthew Honnibal	47055d5988	Fix type declarations in _merge method	2019-09-16 22:10:13 +02:00
Sofie Van Landeghem	03ac29f437	Ensure that doc.ents preserves kb_id annotations (#4294 ) * bugfix: ensure doc.ents preserves kb_id annotations * fix backward compatibility * additional test	2019-09-16 15:18:37 +02:00
Ines Montani	139428c20f	Set unique vector names in tests	2019-09-16 15:16:54 +02:00
Ines Montani	bf06d9d537	Allow passing vectors_name to Vocab	2019-09-16 15:16:41 +02:00
Ines Montani	cb6c68a573	Pass vectors name correctly in prune_vectors	2019-09-16 15:16:29 +02:00
Ines Montani	3ba5238282	Make "unnamed vectors" warning a real warning	2019-09-16 15:16:12 +02:00
adrianeboyd	b5d999e510	Add textcat to train CLI (#4226 ) * Add doc.cats to spacy.gold at the paragraph level Support `doc.cats` as `"cats": [{"label": string, "value": number}]` in the spacy JSON training format at the paragraph level. * `spacy.gold.docs_to_json()` writes `docs.cats` * `GoldCorpus` reads in cats in each `GoldParse` * Update instances of gold_tuples to handle cats Update iteration over gold_tuples / gold_parses to handle addition of cats at the paragraph level. * Add textcat to train CLI * Add textcat options to train CLI * Add textcat labels in `TextCategorizer.begin_training()` * Add textcat evaluation to `Scorer`: * For binary exclusive classes with provided label: F1 for label * For 2+ exclusive classes: F1 macro average * For multilabel (not exclusive): ROC AUC macro average (currently relying on sklearn) * Provide user info on textcat evaluation settings, potential incompatibilities * Provide pipeline to Scorer in `Language.evaluate` for textcat config * Customize train CLI output to include only metrics relevant to current pipeline * Add textcat evaluation to evaluate CLI * Fix handling of unset arguments and config params Fix handling of unset arguments and model confiug parameters in Scorer initialization. * Temporarily add sklearn requirement * Remove sklearn version number * Improve Scorer handling of models without textcats * Fixing Scorer handling of models without textcats * Update Scorer output for python 2.7 * Modify inf in Scorer for python 2.7 * Auto-format Also make small adjustments to make auto-formatting with black easier and produce nicer results * Move error message to Errors * Update documentation * Add cats to annotation JSON format [ci skip] * Fix tpl flag and docs [ci skip] * Switch to internal roc_auc_score Switch to internal `roc_auc_score()` adapted from scikit-learn. * Add AUCROCScore tests and improve errors/warnings * Add tests for AUCROCScore and roc_auc_score * Add missing error for only positive/negative values * Remove unnecessary warnings and errors * Make reduced roc_auc_score functions private Because most of the checks and warnings have been stripped for the internal functions and access is only intended through `ROCAUCScore`, make the functions for roc_auc_score adapted from scikit-learn private. * Check that data corresponds with multilabel flag Check that the training instances correspond with the multilabel flag, adding the multilabel flag if required. * Add textcat score to early stopping check * Add more checks to debug-data for textcat * Add example training data for textcat * Add more checks to textcat train CLI * Check configuration when extending base model * Fix typos * Update textcat example data * Provide licensing details and licenses for data * Remove two labels with no positive instances from jigsaw-toxic-comment data. Co-authored-by: Ines Montani <ines@ines.io>	2019-09-15 22:31:31 +02:00
Ines Montani	bab9976d9a	💫 Adjust Table API and add docs (#4289 ) * Adjust Table API and add docs * Add attributes and update description [ci skip] * Use strings.get_string_id instead of hash_string * Fix table method calls * Make orth arg in Lemmatizer.lookup optional Fall back to string, which is now handled by Table.__contains__ out-of-the-box * Fix method name * Auto-format	2019-09-15 22:08:13 +02:00
Ines Montani	88a9d87f6f	Fix test	2019-09-15 18:04:44 +02:00
Ines Montani	23e28e2844	Merge branch 'master' into develop	2019-09-15 17:57:09 +02:00
Ines Montani	c7e4ea7154	Update examples and languages.json [ci skip]	2019-09-15 17:56:40 +02:00
Ines Montani	aa3c59a2f3	Include Norwegian NER entity types in glossary [ci skip] See https://github.com/ltgoslo/norne	2019-09-15 17:16:21 +02:00
Ines Montani	7194845234	Skip tests properly instead of xfailing them	2019-09-15 17:00:17 +02:00
Ines Montani	16c2522791	Merge branch 'master' into develop	2019-09-14 16:42:01 +02:00
adrianeboyd	6942a6a69b	Extend default punct for sentencizer (#4290 ) Most of these characters are for languages / writing systems that aren't supported by spacy, but I don't think it causes problems to include them. In the UD evals, Hindi and Urdu improve a lot as expected (from 0-10% to 70-80%) and Persian improves a little (90% to 96%). Tamil improves in combination with #4288. The punctuation list is converted to a set internally because of its increased length. Sentence final punctuation generated with: ``` unichars -gas '[\p{Sentence_Break=STerm}\p{Sentence_Break=ATerm}]' '\p{Terminal_Punctuation}' ``` See: https://stackoverflow.com/a/9508766/461847 Fixes #4269.	2019-09-14 15:25:48 +02:00
adrianeboyd	bee7961927	Add Kannada, Tamil, and Telugu unicode blocks (#4288 ) Add Kannada, Tamil, and Telugu unicode blocks to uncased character classes so that period is recognized as a suffix during tokenization. (I'm sure a few symbols in the code blocks should not be ALPHA, but this is mainly relevant for suffix detection and seems to be an improvement in practice.)	2019-09-14 14:23:06 +02:00
Ines Montani	3126dd0904	Tidy up and auto-format [ci skip]	2019-09-14 12:58:06 +02:00
Ines Montani	27106d6528	Merge branch 'master' into develop	2019-09-13 17:07:17 +02:00
Sofie Van Landeghem	2ae5db580e	dim bugfix when incl_prior is False (#4285 )	2019-09-13 16:30:05 +02:00
Paul O'Leary McCann	29a9e636eb	Fix half-width space handling in JA (#4284 ) (closes #4262 ) Before this patch, half-width spaces between words were simply lost in Japanese text. This wasn't immediately noticeable because much Japanese text never uses spaces at all.	2019-09-13 16:28:12 +02:00
Ines Montani	3c3658ef9f	Merge branch 'master' into develop	2019-09-12 18:03:01 +02:00
Ines Montani	228bbf506d	Improve label properties on pipes	2019-09-12 18:02:44 +02:00
Paul O'Leary McCann	7d8df69158	Bloom-filter backed Lookup Tables (#4268 ) * Improve load_language_data helper * WIP: Add Lookups implementation * Start moving lemma data over to JSON * WIP: move data over for more languages * Convert more languages * Fix lemmatizer fixtures in tests * Finish conversion * Auto-format JSON files * Fix test for now * Make sure tables are stored on instance * Update docstrings * Update docstrings and errors * Update test * Add Lookups.__len__ * Add serialization methods * Add Lookups.remove_table * Use msgpack for serialization to disk * Fix file exists check * Try using OrderedDict for everything * Update .flake8 [ci skip] * Try fixing serialization * Update test_lookups.py * Update test_serialize_vocab_strings.py * Lookups / Tables now work This implements the stubs in the Lookups/Table classes. Currently this is in Cython but with no type declarations, so that could be improved. * Add lookups to setup.py * Actually add lookups pyx The previous commit added the old py file... * Lookups work-in-progress * Move from pyx back to py * Add string based lookups, fix serialization * Update tests, language/lemmatizer to work with string lookups There are some outstanding issues here: - a pickling-related test fails due to the bloom filter - some custom lemmatizers (fr/nl at least) have issues More generally, there's a question of how to deal with the case where you have a string but want to use the lookup table. Currently the table allows access by string or id, but that's getting pretty awkward. * Change lemmatizer lookup method to pass (orth, string) * Fix token lookup * Fix French lookup * Fix lt lemmatizer test * Fix Dutch lemmatizer * Fix lemmatizer lookup test This was using a normal dict instead of a Table, so checks for the string instead of an integer key failed. * Make uk/nl/ru lemmatizer lookup methods consistent The mentioned tokenizers all have their own implementation of the `lookup` method, which accesses a `Lookups` table. The way that was called in `token.pyx` was changed so this should be updated to have the same arguments as `lookup` in `lemmatizer.py` (specificially (orth/id, string)). Prior to this change tests weren't failing, but there would probably be issues with normal use of a model. More tests should proably be added. Additionally, the language-specific `lookup` implementations seem like they might not be needed, since they handle things like lower-casing that aren't actually language specific. * Make recently added Greek method compatible * Remove redundant class/method Leftovers from a merge not cleaned up adequately.	2019-09-12 17:26:11 +02:00
Sofie Van Landeghem	9be4d1c105	Allow copying of user_data in as_doc (#4282 ) * Allow copying the user_data with as_doc + unit test * add option to docs * add typing * import fix * workaround to avoid bool clashing ... * bint instead of bool	2019-09-12 17:08:14 +02:00
Matthew Honnibal	7d782aa97b	Add more docstrings for MorphAnalysis	2019-09-12 16:48:30 +02:00
Ines Montani	b544dcb3c5	Document debug-data [ci skip]	2019-09-12 15:26:20 +02:00
Ines Montani	05a2df6616	Remove not implemented file validation [ci skip]	2019-09-12 15:26:02 +02:00
Ines Montani	10257f3131	Document Lookups [ci skip]	2019-09-12 14:00:14 +02:00
Ines Montani	32404e613c	Create directory if it doesn't exist	2019-09-12 14:00:01 +02:00
Ines Montani	625ce2db8e	Update Language docs [ci skip]	2019-09-12 13:03:38 +02:00
Ines Montani	655b434553	Merge branch 'master' into develop	2019-09-12 11:39:18 +02:00
Sofie Van Landeghem	0b4b4f1819	Documentation for Entity Linking (#4065 ) * document token ent_kb_id * document span kb_id * update pipeline documentation * prior and context weights as bool's instead * entitylinker api documentation * drop for both models * finish entitylinker documentation * small fixes * documentation for KB * candidate documentation * links to api pages in code * small fix * frequency examples as counts for consistency * consistent documentation about tensors returned by predict * add entity linking to usage 101 * add entity linking infobox and KB section to 101 * entity-linking in linguistic features * small typo corrections * training example and docs for entity_linker * predefined nlp and kb * revert back to similarity encodings for simplicity (for now) * set prior probabilities to 0 when excluded * code clean up * bugfix: deleting kb ID from tokens when entities were removed * refactor train el example to use either model or vocab * pretrain_kb example for example kb generation * add to training docs for KB + EL example scripts * small fixes * error numbering * ensure the language of vocab and nlp stay consistent across serialization * equality with = * avoid conflict in errors file * add error 151 * final adjustements to the train scripts - consistency * update of goldparse documentation * small corrections * push commit * typo fix * add candidate API to kb documentation * update API sidebar with EntityLinker and KnowledgeBase * remove EL from 101 docs * remove entity linker from 101 pipelines / rephrase * custom el model instead of existing model * set version to 2.2 for EL functionality * update documentation for 2 CLI scripts	2019-09-12 11:38:34 +02:00
Ines Montani	4d4b3b0783	Add "labels" to Language.meta	2019-09-12 11:34:25 +02:00
Ines Montani	ac0e27a825	💫 Add Language.pipe_labels (#4276 ) * Add Language.pipe_labels * Update spacy/language.py Co-Authored-By: Matthew Honnibal <honnibal+gh@gmail.com>	2019-09-12 10:56:28 +02:00
tamuhey	71909cdf22	Fix iss4278 (#4279 ) * fix: len(tuple) == 2 * (#4278) add fail test * add contributor's aggreement	2019-09-12 10:44:49 +02:00
Ines Montani	8ebc3711dc	Fix bug in Parser.labels and add test (#4275 )	2019-09-11 18:29:35 +02:00
Matthew Honnibal	7fbb559045	Set version to v2.2.0.dev6	2019-09-11 18:07:20 +02:00
Matthew Honnibal	f7a096b462	Update morphology	2019-09-11 18:06:43 +02:00
Matthew Honnibal	f8ce9dde0f	Set version to v2.2.0.dev5	2019-09-11 17:41:21 +02:00
Matthew Honnibal	c47c0269b1	Update morphology features	2019-09-11 15:16:53 +02:00
Ines Montani	af25323653	Tidy up and auto-format	2019-09-11 14:00:36 +02:00
Matthew Honnibal	af93997993	Fix conllu converter	2019-09-11 13:28:07 +02:00
Matthew Honnibal	178d010b25	Set version to 2.2.0.dev4	2019-09-11 12:28:37 +02:00
Ines Montani	e82a8d0d7a	Merge branch 'master' into develop	2019-09-11 11:52:38 +02:00
Ines Montani	8f9f48b04c	Add GreekLemmatizer.lookup (resolves #4272 )	2019-09-11 11:44:40 +02:00
Ines Montani	6279d74c65	Tidy up and auto-format	2019-09-11 11:38:22 +02:00
Matthew Honnibal	7b858ba606	Update from master	2019-09-10 20:14:08 +02:00
Ines Montani	669a7d37ce	Exclude vocab when testing to_bytes	2019-09-10 19:45:16 +02:00
adrianeboyd	e367864e59	Update Ukrainian create_lemmatizer kwargs (#4266 ) Allow Ukrainian create_lemmatizer to accept lookups kwarg.	2019-09-10 11:14:46 +02:00
adrianeboyd	c32126359a	Allow period as suffix following punctuation (#4248 ) Addresses rare cases (such as `_MATH_.`, see #1061) where the final period was not recognized as a suffix following punctuation.	2019-09-09 19:19:22 +02:00
Ines Montani	3e8f136ba7	💫 WIP: Basic lookup class scaffolding and JSON for all lemmatizer data (#4178 ) * Improve load_language_data helper * WIP: Add Lookups implementation * Start moving lemma data over to JSON * WIP: move data over for more languages * Convert more languages * Fix lemmatizer fixtures in tests * Finish conversion * Auto-format JSON files * Fix test for now * Make sure tables are stored on instance * Update docstrings * Update docstrings and errors * Update test * Add Lookups.__len__ * Add serialization methods * Add Lookups.remove_table * Use msgpack for serialization to disk * Fix file exists check * Try using OrderedDict for everything * Update .flake8 [ci skip] * Try fixing serialization * Update test_lookups.py * Update test_serialize_vocab_strings.py * Fix serialization for lookups * Fix lookups * Fix lookups * Fix lookups * Try to fix serialization * Try to fix serialization * Try to fix serialization * Try to fix serialization * Give up on serialization test * Xfail more serialization tests for 3.5 * Fix lookups for 2.7	2019-09-09 19:17:55 +02:00
Sofie Van Landeghem	482c7cd1b9	pulling tqdm imports in functions to avoid bug (tmp fix) (#4263 )	2019-09-09 16:32:11 +02:00
Mihai Gliga	25aecd504f	adding Romanian tag_map (#4257 ) * adding Romanian tag_map * added SCA file * forgotten import	2019-09-09 11:53:09 +02:00
Matthew Honnibal	1653b818c5	Update Lithuanian tag map	2019-09-08 20:57:58 +02:00
adrianeboyd	3780e2ff50	Flush tokenizer cache when necessary (#4258 ) Flush tokenizer cache when affixes, token_match, or special cases are modified. Fixes #4238, same issue as in #1250.	2019-09-08 20:52:46 +02:00
Matthew Honnibal	da8830d909	Set version to v2.2.0.dev3	2019-09-08 18:22:03 +02:00
Matthew Honnibal	1a65c5b7af	Update develop from master	2019-09-08 18:21:41 +02:00
Matthew Honnibal	aec6174ae6	Fix lemmatizer	2019-09-08 18:09:53 +02:00
Matthew Honnibal	fde4f8ac8e	Create lookups if not passed in	2019-09-08 18:08:09 +02:00
Pavle Vidanović	d03401f532	Lemmatizer lookup dictionary for Serbian and basic tag set adde… (#4251 ) * Serbian stopwords added. (cyrillic alphabet) * spaCy Contribution agreement included. * Test initialize updated * Serbian language code update. --bugfix * Tokenizer exceptions added. Init file updated. * Norm exceptions and lexical attributes added. * Examples added. * Tests added. * sr_lang examples update. * Tokenizer exceptions updated. (Serbian) * Lemmatizer created. Licence included. * Test updated. * Tag map basic added. * tag_map.py file removed since it uses default spacy tags.	2019-09-08 14:19:15 +02:00
Ivan Šarić	b01025dd06	adds Croatian lemma_lookup.json, license file and corresponding tests (#4252 )	2019-09-08 13:40:45 +02:00
adrianeboyd	aec755d3a3	Modify retokenizer to use span root attributes (#4219 ) * Modify retokenizer to use span root attributes * tag/pos/morph are set to root tag/pos/morph * lemma and norm are reset and end up as orth (not ideal, but better than orth of first token) * Also handle individual merge case * Add test * Attempt to handle ent_iob and ent_type in merges * Fix check for whether B-ENT should become I-ENT * Move IOB consistency check to after attrs Move all IOB consistency checks after attrs are set and simplify to check entire document, modifying I to B at the beginning of the document or if the entity type of the previous token isn't the same. * Move IOB consistency check for single merge Move IOB consistency check after the token array is compressed for the single merge case. * Update spacy/tokens/_retokenize.pyx Co-Authored-By: Matthew Honnibal <honnibal+gh@gmail.com> * Remove single vs. multiple merge distinction Remove original single-instance `_merge()` and use `_bulk_merge()` (now renamed `_merge()`) for all merges. * Add out-of-bound check in previous entity check	2019-09-08 13:04:49 +02:00
Bae Yong-Ju	a55f5a744f	Fix ValueError exception on empty Korean text. (#4245 )	2019-09-06 10:29:40 +02:00
Adriane Boyd	0f28418446	Add regression test for #1061 back to test suite	2019-09-04 20:42:24 +02:00
Adriane Boyd	c39c13f26b	Add guillemets/chevrons to German orth variants Add guillemets/chevrons to German orth variants for both German/Austrian and Swiss conventions.	2019-09-04 20:05:08 +02:00
Adriane Boyd	6b0fec76fd	Fix handling of preset entities in NER * Fix check of valid ent_type for B * Add valid L as preset-I followed by not-I	2019-09-04 13:42:42 +02:00
Ines Montani	419ae59c79	Make flaky test test_issue_1971_4 more explicit	2019-08-31 14:08:05 +02:00
Ines Montani	cd90752193	Tidy up and auto-format [ci skip]	2019-08-31 13:39:06 +02:00
Matthew Honnibal	67c3d03905	Revert morphology serialisation	2019-08-30 13:13:07 +02:00
Adriane Boyd	893f11a9e3	Serialize tag_map directly Fix Aspect_prof typo	2019-08-30 11:30:03 +02:00
Adriane Boyd	02babf9317	English tag map without unsupported features/values	2019-08-30 11:29:19 +02:00
Matthew Honnibal	516650f58f	Merge pull request #4207 from svlandeg/bugfix/serialize-tok-exc Bugfix for serializing tokenizer rules/exceptions	2019-08-30 11:04:58 +02:00
Matthew Honnibal	f3c3ce7f1e	Update vocab	2019-08-29 21:19:54 +02:00
Matthew Honnibal	fc0a3c8c38	Add morphology serialization	2019-08-29 21:17:34 +02:00
Matthew Honnibal	c94fc9edb9	Fix noise addition	2019-08-29 15:39:32 +02:00
Matthew Honnibal	32842a3cd4	Disable whitespace corruption	2019-08-29 15:01:58 +02:00
Matthew Honnibal	3c1c0ec18e	Add tests for NER oracle with whitespace	2019-08-29 14:33:39 +02:00
Matthew Honnibal	6511e1d8d3	Fix NER gold-standard around whitespace	2019-08-29 14:33:07 +02:00
adrianeboyd	82159b5c19	Updates/bugfixes for NER/IOB converters (#4186 ) * Updates/bugfixes for NER/IOB converters * Converter formats `ner` and `iob` use autodetect to choose a converter if possible * `iob2json` is reverted to handle sentence-per-line data like `word1\|pos1\|ent1 word2\|pos2\|ent2` * Fix bug in `merge_sentences()` so the second sentence in each batch isn't skipped * `conll_ner2json` is made more general so it can handle more formats with whitespace-separated columns * Supports all formats where the first column is the token and the final column is the IOB tag; if present, the second column is the POS tag * As in CoNLL 2003 NER, blank lines separate sentences, `-DOCSTART- -X- O O` separates documents * Add option for segmenting sentences (new flag `-s`) * Parser-based sentence segmentation with a provided model, otherwise with sentencizer (new option `-b` to specify model) * Can group sentences into documents with `n_sents` as long as sentence segmentation is available * Only applies automatic segmentation when there are no existing delimiters in the data * Provide info about settings applied during conversion with warnings and suggestions if settings conflict or might not be not optimal. * Add tests for common formats * Add '(default)' back to docs for -c auto * Add document count back to output * Revert changes to converter output message * Use explicit tabs in convert CLI test data * Adjust/add messages for n_sents=1 default * Add sample NER data to training examples * Update README * Add links in docs to example NER data * Define msg within converters	2019-08-29 12:04:01 +02:00
adrianeboyd	5feb342f5e	Add more token attributes to token pattern schema (#4210 ) Add token attributes with tests to token pattern schema.	2019-08-29 12:02:26 +02:00
Adriane Boyd	f3906950d3	Add separate noise vs orth level to train CLI	2019-08-29 09:10:35 +02:00
Matthew Honnibal	7d6d438566	Set version to v2.2.0.dev2	2019-08-28 18:30:43 +02:00
Matthew Honnibal	bc5ce49859	Fix 'noise_level' in train cmd	2019-08-28 17:55:38 +02:00
Matthew Honnibal	782056d117	Fix morph rules	2019-08-28 16:59:45 +02:00
Matthew Honnibal	6b2ea883ed	Merge pull request #4205 from adrianeboyd/feature/gold-train-orth-variants Add train_docs() option to add orth variants	2019-08-28 16:54:06 +02:00
svlandeg	c54aabc3cd	fix loading custom tokenizer rules/exceptions from file	2019-08-28 14:17:44 +02:00
svlandeg	7bec0ebbcb	failing unit test for Issue 4190	2019-08-28 14:16:34 +02:00
Adriane Boyd	0a26e94d02	Modify raw to match orth variant annotation tuples If raw is available, attempt to modify raw to match the orth variants. If raw/words can't be aligned, abort and return unmodified raw/annotation.	2019-08-28 13:38:54 +02:00
Adriane Boyd	47af3f676e	Single and paired orth variants for German	2019-08-28 09:19:18 +02:00
Adriane Boyd	56c38484a1	Single and paired orth variants for English	2019-08-28 09:19:18 +02:00
Adriane Boyd	aae05ff16b	Add train_docs() option to add orth variants Filtering by orth and tag, create variants of training docs with alternate orth variants, e.g., unicode quotes, dashes, and ellipses. The variants can be single tokens (dashes) or paired tokens (quotes) with left and right versions. Currently restricted to only add variants to training documents without raw text provided, where only gold.words needs to be modified.	2019-08-28 09:18:36 +02:00
Matthew Honnibal	af7fad2c6d	Set version to v2.2.0.dev1	2019-08-25 22:05:47 +02:00
Matthew Honnibal	71c0321ecf	Fix test	2019-08-25 22:03:37 +02:00
Matthew Honnibal	188a1cf297	Fix morphology for \| features	2019-08-25 21:57:02 +02:00

1 2 3 4 5 ...

6423 Commits