spaCy

mirror of https://github.com/explosion/spaCy.git synced 2024-11-11 12:18:04 +03:00

Author	SHA1	Message	Date
Ines Montani	0750d59e5a	Allow setting ner_missing_tag on docs_to_json	2019-12-21 13:47:21 +01:00
Matthew Honnibal	a927b3a21e	Put new alignment behind flag for v2.2.2 release (#4541 ) * Xfail new tokenization test * Put new alignment behind feature flag * Move USE_ALIGN to top of the file [ci skip] Co-authored-by: Ines Montani <ines@ines.io>	2019-10-28 16:12:32 +01:00
tamuhey	df293f3894	modified gold.align to handle space tokens (#4537 ) Co-authored-by: Matthew Honnibal <honnibal+gh@gmail.com>	2019-10-28 15:44:28 +01:00
Ines Montani	92018b9cd4	Tidy up and auto-format	2019-10-28 12:36:23 +01:00
Matthew Honnibal	f0ec7bcb79	Flag to ignore examples with mismatched raw/gold text (#4534 ) * Flag to ignore examples with mismatched raw/gold text After #4525, we're seeing some alignment failures on our OntoNotes data. I think we actually have fixes for most of these cases. In general it's better to fix the data, but it seems good to allow the GoldCorpus class to just skip cases where the raw text doesn't match up to the gold words. I think previously we were silently ignoring these cases. * Try to fix test on Python 2.7	2019-10-28 11:40:12 +01:00
Matthew Honnibal	f8d740bfb1	Fix --gold-preproc train cli command (#4392 ) * Fix get labels for textcat * Fix char_embed for gpu * Revert "Fix char_embed for gpu" This reverts commit `055b9a9e85`. * Fix passing of cats in gold.pyx * Revert "Match pop with append for training format (#4516)" This reverts commit `8e7414dace`. * Fix popping gold parses * Fix handling of cats in gold tuples * Fix name * Fix ner_multitask_objective script * Add test for 4402	2019-10-27 21:58:50 +01:00
Sofie Van Landeghem	8e7414dace	Match pop with append for training format (#4516 ) * trying to fix script - not succesful yet * match pop() with extend() to avoid changing the data * few more pop-extend fixes * reinsert deleted print statement * fix print statement * add last tested version * append instead of extend * add in few comments * quick fix for 4402 + unit test * fixing number of docs (not counting cats) * more fixes * fix len * print tmp file instead of using data from examples dir * print tmp file instead of using data from examples dir (2)	2019-10-27 16:01:32 +01:00
tamuhey	fcd25db033	[#4529 ] fix: gold pyx (#4530 ) * fix: gold pyx * remove print * skip test in python2 * Add unicode declarations and don't skip test on Python 2	2019-10-27 13:50:07 +01:00
Matthew Honnibal	bddfbc7e1b	Restore missing normalization from gold align PR #4526 missed extra lower-casing and spacing normalization.	2019-10-27 13:47:08 +01:00
tamuhey	554850206c	[#4525 ] fix gold.align (#4526 ) * fix: gold.align * fix align * remove old align	2019-10-27 13:38:04 +01:00
adrianeboyd	8516e9d53b	Support train dict format as JSONL (#4471 ) * Support train dict format as JSONL * Add (overly simple) check for dict vs. tuple to read JSONL lines as either train dicts or train tuples * Extend JSON/JSONL roundtrip conversion tests using `docs_to_json()` and `GoldCorpus.train_tuples` * Revert docs to default JSON output with convert	2019-10-23 16:01:44 +02:00
Sofie Van Landeghem	48886afc78	prevent zero-length mem alloc (#4429 ) * raise specific error when removing a matcher rule that doesn't exist * rephrasing * goldparse init: allocate fields only if doc is not empty * avoid zero length alloc in saving tokenizer cache * avoid allocating zero length mem in matcher * asserts to avoid allocating zero length mem * fix zero-length allocation in matcher * bump cymem version * revert cymem version bump	2019-10-22 16:54:33 +02:00
tamuhey	fb89f6792b	refactor: remove unused variable (#4499 )	2019-10-22 14:38:17 +02:00
adrianeboyd	f5c551a43a	Checks/errors related to ill-formed IOB input in CLI convert and debug-data (#4487 ) * Error for ill-formed input to iob_to_biluo() Check for empty label in iob_to_biluo(), which can result from ill-formed input. * Check for empty NER label in debug-data	2019-10-21 12:20:28 +02:00
Anastassia	4a77d03ff7	Fix documentation for the docs_to_json function (#4456 )	2019-10-16 23:17:58 +02:00
Matthew Honnibal	7d510c833e	Fix orth replacement	2019-09-19 00:03:24 +02:00
Matthew Honnibal	42df49133d	Also lower-case in orth variants	2019-09-18 21:54:51 +02:00
adrianeboyd	b5d999e510	Add textcat to train CLI (#4226 ) * Add doc.cats to spacy.gold at the paragraph level Support `doc.cats` as `"cats": [{"label": string, "value": number}]` in the spacy JSON training format at the paragraph level. * `spacy.gold.docs_to_json()` writes `docs.cats` * `GoldCorpus` reads in cats in each `GoldParse` * Update instances of gold_tuples to handle cats Update iteration over gold_tuples / gold_parses to handle addition of cats at the paragraph level. * Add textcat to train CLI * Add textcat options to train CLI * Add textcat labels in `TextCategorizer.begin_training()` * Add textcat evaluation to `Scorer`: * For binary exclusive classes with provided label: F1 for label * For 2+ exclusive classes: F1 macro average * For multilabel (not exclusive): ROC AUC macro average (currently relying on sklearn) * Provide user info on textcat evaluation settings, potential incompatibilities * Provide pipeline to Scorer in `Language.evaluate` for textcat config * Customize train CLI output to include only metrics relevant to current pipeline * Add textcat evaluation to evaluate CLI * Fix handling of unset arguments and config params Fix handling of unset arguments and model confiug parameters in Scorer initialization. * Temporarily add sklearn requirement * Remove sklearn version number * Improve Scorer handling of models without textcats * Fixing Scorer handling of models without textcats * Update Scorer output for python 2.7 * Modify inf in Scorer for python 2.7 * Auto-format Also make small adjustments to make auto-formatting with black easier and produce nicer results * Move error message to Errors * Update documentation * Add cats to annotation JSON format [ci skip] * Fix tpl flag and docs [ci skip] * Switch to internal roc_auc_score Switch to internal `roc_auc_score()` adapted from scikit-learn. * Add AUCROCScore tests and improve errors/warnings * Add tests for AUCROCScore and roc_auc_score * Add missing error for only positive/negative values * Remove unnecessary warnings and errors * Make reduced roc_auc_score functions private Because most of the checks and warnings have been stripped for the internal functions and access is only intended through `ROCAUCScore`, make the functions for roc_auc_score adapted from scikit-learn private. * Check that data corresponds with multilabel flag Check that the training instances correspond with the multilabel flag, adding the multilabel flag if required. * Add textcat score to early stopping check * Add more checks to debug-data for textcat * Add example training data for textcat * Add more checks to textcat train CLI * Check configuration when extending base model * Fix typos * Update textcat example data * Provide licensing details and licenses for data * Remove two labels with no positive instances from jigsaw-toxic-comment data. Co-authored-by: Ines Montani <ines@ines.io>	2019-09-15 22:31:31 +02:00
Matthew Honnibal	c94fc9edb9	Fix noise addition	2019-08-29 15:39:32 +02:00
Matthew Honnibal	32842a3cd4	Disable whitespace corruption	2019-08-29 15:01:58 +02:00
Matthew Honnibal	6511e1d8d3	Fix NER gold-standard around whitespace	2019-08-29 14:33:07 +02:00
Matthew Honnibal	6b2ea883ed	Merge pull request #4205 from adrianeboyd/feature/gold-train-orth-variants Add train_docs() option to add orth variants	2019-08-28 16:54:06 +02:00
Adriane Boyd	0a26e94d02	Modify raw to match orth variant annotation tuples If raw is available, attempt to modify raw to match the orth variants. If raw/words can't be aligned, abort and return unmodified raw/annotation.	2019-08-28 13:38:54 +02:00
Adriane Boyd	aae05ff16b	Add train_docs() option to add orth variants Filtering by orth and tag, create variants of training docs with alternate orth variants, e.g., unicode quotes, dashes, and ellipses. The variants can be single tokens (dashes) or paired tokens (quotes) with left and right versions. Currently restricted to only add variants to training documents without raw text provided, where only gold.words needs to be modified.	2019-08-28 09:18:36 +02:00
Matthew Honnibal	c308cf3e3e	Merge branch 'master' into feature/lemmatizer	2019-08-25 13:52:27 +02:00
Matthew Honnibal	bcd08f20af	Merge changes from master	2019-08-21 14:18:52 +02:00
adrianeboyd	a58cb023d7	WIP: Extending debug-data (#4114 ) * Extending debug-data with dependency checks, etc. * Modify debug-data to load with GoldCorpus to iterate over .json/.jsonl files within directories * Add GoldCorpus iterator train_docs_without_preprocessing to load original train docs without shuffling and projectivizing * Report number of misaligned tokens * Add more dependency checks and messages * Update spacy/cli/debug_data.py Co-Authored-By: Ines Montani <ines@ines.io> * Fixed conflict * Move counts to _compile_gold() * Move all dependency nonproj/sent/head/cycle counting to _compile_gold() * Unclobber previous merges * Update variable names * Update more variable names, fix misspelling * Don't clobber loading error messages * Only warn about misaligned tokens if present	2019-08-16 10:52:46 +02:00
Ziming He	eea7d4f4a8	biluo_tags_from_offsets throw exception for overlapping entities (#4021 ) * Check whether two entities overlap - biluo_gold_biluo_overlap now throw exception when entities passed in have overlaps - added unit test * SCA agreement	2019-08-15 18:13:32 +02:00
adrianeboyd	2f9b28c218	Provide more info in cycle error message E069 (#4123 ) Provide the tokens in the cycle and the first 50 tokens from document in the error message so it's easier to track down the location of the cycle in the data. Addresses feature request in #3698.	2019-08-15 18:08:28 +02:00
Anastassia	33b14724a5	Update gold corpus code to properly ingest a directory of jsonl… (#4067 ) * Update gold corpus code to properly ingest a directory of jsonlines files In response to: https://github.com/explosion/spaCy/issues/3975 * Update spacy/gold.pyx Co-Authored-By: Ines Montani <ines@ines.io>	2019-08-02 09:58:51 +02:00
svlandeg	ad65171837	Merge remote-tracking branch 'upstream/master' into feature/nel-fixes	2019-07-22 13:41:28 +02:00
svlandeg	21176517a7	have gold.links correspond exactly to doc.ents	2019-07-19 12:36:15 +02:00
svlandeg	d833d4c358	fixes in kb and gold	2019-07-17 17:18:26 +02:00
Ines Montani	73565c6d9d	Rename function arguments	2019-07-17 14:29:52 +02:00
Matthew Honnibal	394e4d8058	Add docstring for spacy.gold.align	2019-07-17 13:59:17 +02:00
svlandeg	0486ccabfd	introduce goldparse.links	2019-06-07 13:54:45 +02:00
Ramanan Balakrishnan	26c37c5a4d	fix all references to BILUO annotation format (#3797 )	2019-05-31 12:19:19 +02:00
Ines Montani	6ae3b5699e	Make sure path is string (resolves #3546 )	2019-04-08 12:53:41 +02:00
Ines Montani	d0f5e015cb	Auto-format	2019-04-08 12:53:16 +02:00
Matthew Honnibal	47e110375d	Fix jsonl to json conversion (#3419 ) * Fix spacy.gold.docs_to_json function * Fix jsonl2json converter	2019-03-17 22:12:54 +01:00
Matthew Honnibal	78aba46530	Update feature/lemmatizer from develop	2019-03-10 02:45:33 +01:00
Ines Montani	76764fcf59	💫 Improve converters and training data file formats (#3374 ) * Populate converter argument info automatically * Add conversion option for msgpack * Update docs * Allow reading training data from JSONL	2019-03-08 23:15:23 +01:00
Matthew Honnibal	4cf897e8e1	Update from develop	2019-03-08 16:56:54 +01:00
Ines Montani	296446a1c8	Tidy up and improve docs and docstrings (#3370 ) <!--- Provide a general summary of your changes in the title. --> ## Description * tidy up and adjust Cython code to code style * improve docstrings and make calling `help()` nicer * add URLs to new docs pages to docstrings wherever possible, mostly to user-facing objects * fix various typos and inconsistencies in docs ### Types of change enhancement, docs ## Checklist <!--- Before you submit the PR, go over this checklist and make sure you can tick off all the boxes. [] -> [x] --> - [x] I have submitted the spaCy Contributor Agreement. - [x] I ran the tests, and all new and existing tests passed. - [x] My changes don't require a change to the documentation, or if they do, I've added all required information.	2019-03-08 11:42:26 +01:00
Matthew Honnibal	3993f41cc4	Update morphology branch from develop	2019-03-07 00:14:43 +01:00
Matthew Honnibal	f1d77eb140	💫 Improve handling of missing NER tags (closes #2603 ) (#3341 ) * Improve handling of missing NER tags GoldParse can accept missing NER tags, if entities is provided in BILUO format (rather than as spans). Missing tags can be provided as None values. Fix bug that occurred when first tag was a None value. Closes #2603. * Document specification of missing NER tags.	2019-02-27 12:06:32 +01:00
Ines Montani	f25bd9f5e4	Add gold.spans_from_biluo_tags helper (#3227 )	2019-02-06 21:50:26 +11:00
Matthew Honnibal	a338c6f8f6	Fix JSON segmentation bug that affected French Fix a bug in the JSON streaming code that GoldCorpus uses. Escaped slashes were being handled incorrectly. This bug caused low scores for French in the early v2.1.0 alphas, because most of the data was not being read in. Fittingly, the document that triggered the bug was a Wikipedia article about Perl. Parsing perl remains difficult!	2018-12-08 10:41:24 +01:00
Ines Montani	5b2741f751	Remove unused cytoolz / itertools imports	2018-12-03 02:12:07 +01:00
Ines Montani	f37863093a	💫 Replace ujson, msgpack and dill/pickle/cloudpickle with srsly (#3003 ) Remove hacks and wrappers, keep code in sync across our libraries and move spaCy a few steps closer to only depending on packages with binary wheels 🎉 See here: https://github.com/explosion/srsly Serialization is hard, especially across Python versions and multiple platforms. After dealing with many subtle bugs over the years (encodings, locales, large files) our libraries like spaCy and Prodigy have steadily grown a number of utility functions to wrap the multiple serialization formats we need to support (especially json, msgpack and pickle). These wrapping functions ended up duplicated across our codebases, so we wanted to put them in one place. At the same time, we noticed that having a lot of small dependencies was making maintainence harder, and making installation slower. To solve this, we've made srsly standalone, by including the component packages directly within it. This way we can provide all the serialization utilities we need in a single binary wheel. srsly currently includes forks of the following packages: ujson msgpack msgpack-numpy cloudpickle * WIP: replace json/ujson with srsly * Replace ujson in examples Use regular json instead of srsly to make code easier to read and follow * Update requirements * Fix imports * Fix typos * Replace msgpack with srsly * Fix warning	2018-12-03 01:28:22 +01:00

1 2 3 4

160 Commits