spaCy

mirror of https://github.com/explosion/spaCy.git synced 2026-01-02 15:03:29 +03:00

History

Lj Miranda 1d34aa2b3d Add spacy-span-analyzer to debug data (#10668 ) * Rename to spans_key for consistency * Implement spans length in debug data * Implement how span bounds and spans are obtained In this commit, I implemented how span boundaries (the tokens) around a given span and spans are obtained. I've put them in the compile_gold() function so that it's accessible later on. I will do the actual computation of the span and boundary distinctiveness in the main function above. * Compute for p_spans and p_bounds * Add computation for SD and BD * Fix mypy issues * Add weighted average computation * Fix compile_gold conditional logic * Add test for frequency distribution computation * Add tests for kl-divergence computation * Fix weighted average computation * Make tables more compact by rounding them * Add more descriptive checks for spans * Modularize span computation methods In this commit, I added the _get_span_characteristics and _print_span_characteristics functions so that they can be reusable anywhere. * Remove unnecessary arguments and make fxs more compact * Update a few parameter arguments * Add tests for print_span and get_span methods * Update API to talk about span characteristics in brief * Add better reporting of spans_length * Add test for span length reporting * Update formatting of span length report Removed '' to indicate that it's not a string, then sort the n-grams by their length, not by their frequency. * Apply suggestions from code review Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com> * Show all frequency distribution when -V In this commit, I displayed the full frequency distribution of the span lengths when --verbose is passed. To make things simpler, I rewrote some of the formatter functions so that I can call them whenever. Another notable change is that instead of showing percentages as Integers, I showed them as floats (max 2-decimal places). I did this because it looks weird when it displays (0%). * Update logic on how total is computed The way the 90% thresholding is computed now is that we keep adding the percentages until we reach >= 90%. I also updated the wording and used the term "At least" to denote that >= 90% of your spans have these distributions. * Fix display when showing the threshold percentage * Apply suggestions from code review Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com> * Add better phrasing for span information * Update spacy/cli/debug_data.py Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com> * Add minor edits for whitespaces etc. Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com> Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>		2022-05-23 19:06:38 +02:00
..
architectures.md	Tagger: use unnormalized probabilities for inference (#10197 )	2022-03-15 14:15:31 +01:00
attributeruler.md	Document scorers in registry and components from #8766 (#8929 )	2021-08-12 12:50:03 +02:00
cli.md	Add spacy-span-analyzer to debug data (#10668 )	2022-05-23 19:06:38 +02:00
corpus.md	Add shuffle parameter to Corpus API docs (#10220 )	2022-02-07 14:55:53 +01:00
cython-classes.md	Update docs, types and API consistency	2020-08-17 16:45:24 +02:00
cython-structs.md	Update docs, types and API consistency	2020-08-17 16:45:24 +02:00
cython.md	Update docs [ci skip]	2020-09-12 17:05:10 +02:00
data-formats.md	Fix references to config file in the docs & UX (#9961 )	2022-01-04 14:31:26 +01:00
dependencymatcher.md	doc fixes	2020-09-12 17:38:54 +02:00
dependencyparser.md	Fix types in API docs for moves in parser and ner (#10464 )	2022-03-08 13:51:11 +01:00
doc.md	Docs for v3.3 (#10628 )	2022-04-28 14:09:35 +02:00
docbin.md	Fix point typo on docbin docs (#9097 )	2021-08-31 10:55:44 +02:00
edittreelemmatizer.md	Add edit tree lemmatizer (#10231 )	2022-03-28 11:13:50 +02:00
entitylinker.md	Fix entity linker batching (#9669 )	2022-03-04 09:17:36 +01:00
entityrecognizer.md	Fix types in API docs for moves in parser and ner (#10464 )	2022-03-08 13:51:11 +01:00
entityruler.md	Add link to pattern file info in EntityRuler.initialize docs (#10091 )	2022-01-19 10:45:11 +01:00
example.md	Extend score_spans for overlapping & non-labeled spans (#7209 )	2021-04-08 12:19:17 +02:00
index.md	Update v3 docs	2020-07-03 16:48:21 +02:00
kb.md	Tidy up docs	2021-06-28 12:08:15 +02:00
language.md	Merge remote-tracking branch 'upstream/develop' into chore/switch-to-master-v3.2.0	2021-11-03 15:32:18 +01:00
legacy.md	Add test for old architectures (#10751 )	2022-05-10 08:24:42 +02:00
lemmatizer.md	Add edit tree lemmatizer (#10231 )	2022-03-28 11:13:50 +02:00
lexeme.md	fix 's typo's across code base (#8384 )	2021-06-15 10:57:08 +02:00
lookups.md	Update docs, types and API consistency	2020-08-17 16:45:24 +02:00
matcher.md	Add NORM to Matcher feature in docs (#10560 )	2022-03-28 10:35:47 +02:00
morphologizer.md	Update overwrite and scorer in API docs (#9384 )	2021-10-11 10:35:07 +02:00
morphology.md	Document Assigned Attributes of Pipeline Components (#9041 )	2021-09-01 12:09:39 +02:00
phrasematcher.md	🏷 Add Mypy check to CI and ignore all existing Mypy errors (#9167 )	2021-10-14 15:21:40 +02:00
pipe.md	Document scorers in registry and components from #8766 (#8929 )	2021-08-12 12:50:03 +02:00
pipeline-functions.md	Add doc_cleaner component (#9659 )	2021-11-23 15:33:33 +01:00
scorer.md	Add micro PRF for morph scoring (#9546 )	2021-10-29 10:29:29 +02:00
sentencerecognizer.md	Update overwrite and scorer in API docs (#9384 )	2021-10-11 10:35:07 +02:00
sentencizer.md	Update overwrite and scorer in API docs (#9384 )	2021-10-11 10:35:07 +02:00
span.md	Docs for v3.3 (#10628 )	2022-04-28 14:09:35 +02:00
spancategorizer.md	Update default spans_key to sc in API docs (#10616 )	2022-04-04 18:09:15 +02:00
spangroup.md	Override SpanGroups.setdefault to provide default SpanGroup (#10772 )	2022-05-12 10:06:25 +02:00
stringstore.md	Update docs, types and API consistency	2020-08-17 16:45:24 +02:00
tagger.md	Document Tagger neg_prefix, fix typo (#9821 )	2021-12-07 09:42:40 +01:00
textcategorizer.md	Fix Scorer.score_cats for missing labels (#9443 )	2021-12-29 11:04:39 +01:00
tok2vec.md	Tidy up docs	2021-06-28 12:08:15 +02:00
token.md	Token sent attributes more consistent (#10164 )	2022-02-08 08:35:37 +01:00
tokenizer.md	Add tokenizer option to allow Matcher handling for all rules (#10452 )	2022-03-24 13:21:32 +01:00
top-level.md	#10672 : fixes displacy output for manual unsorted entities (#10673 )	2022-04-27 09:51:58 +02:00
transformer.md	Update docs for spacy-transformers v1.1 data classes (#9361 )	2021-10-18 14:16:58 +02:00
vectors.md	Docs for v3.3 (#10628 )	2022-04-28 14:09:35 +02:00
vocab.md	Add vector deduplication (#10551 )	2022-03-30 08:54:23 +02:00