spaCy

mirror of https://github.com/explosion/spaCy.git synced 2025-12-05 17:24:29 +03:00

Author	SHA1	Message	Date
Paul O'Leary McCann	00d481dd12	Stack the mention scorer In the reference implementations, there's usually a function to build a ffnn of arbitrary depth, consisting of a stack of Linear >> Relu >> Dropout. In practice the depth is always 1 in coref-hoi, but in earlier iterations of the model, which are more similar to our model here (since we aren't using attention or even necessarily BERT), using a small depth like 2 was common. This hard-codes a stack of 2. In brief tests this allows similar performance to the unstacked version with much smaller embedding sizes. The depth of the stack could be made into a hyperparameter.	2021-08-09 18:04:42 +09:00
Paul O'Leary McCann	56803d3909	Change mention limit to match reference implementations This generall means fewer spans are considered, which makes individual steps in training faster but can make training take longer to find the good spans.	2021-08-08 19:55:52 +09:00
Paul O'Leary McCann	1d1679d431	Minor speedup This continue should be a break. The current form doesn't cause errors but using a break will be a bit faster.	2021-07-21 19:50:10 +09:00
Paul O'Leary McCann	a151c62d13	Add sentence map test	2021-07-19 13:05:26 +09:00
Paul O'Leary McCann	3ed0fae671	Add multi-sentence mention test Also formatting.	2021-07-19 13:00:16 +09:00
Paul O'Leary McCann	8bd0474730	Run black	2021-07-18 20:20:22 +09:00
Paul O'Leary McCann	bc081c24fa	Add full traditional scoring This calculates scores as an average of three metrics. As noted in the code, these metrics all have issues, but we want to use them to match up with prior work. This should be replaced with some simpler default scoring and the scorer here should be moved to an external project to be passed in just for generating the traditional scores.	2021-07-18 20:13:10 +09:00
Paul O'Leary McCann	a4531be099	Add simple mention test	2021-07-18 19:15:32 +09:00
Paul O'Leary McCann	9b63cbb775	Add extract spans import	2021-07-15 18:16:53 +09:00
Paul O'Leary McCann	e9626e38c1	Fix serialization test This test was failing not because the thing it was testing wasn't working, but because of the way span equality works. Span equality relies on doc equality, and doc equality is object identity, so spans from different docs will never be equal.	2021-07-14 18:37:34 +09:00
Paul O'Leary McCann	4a9dc00d86	Use relative indices for mentions Was using batch absolute indices to manage mentions, but extract_spans expects doc-relative ones.	2021-07-14 18:36:18 +09:00
Paul O'Leary McCann	3684f7fdfd	Remove comment from fixed test	2021-07-14 18:22:14 +09:00
Paul O'Leary McCann	f1796e4af7	Fix mention list bug There was an off-by-one error in how mentions are generated that would affect mentions at the end of a sentence. This was pretty nasty.	2021-07-14 18:19:00 +09:00
Paul O'Leary McCann	80a17071d3	Remove unused code	2021-07-11 18:46:39 +09:00
Paul O'Leary McCann	447c7070e3	Fix loss Accidentally deleted it	2021-07-10 22:45:25 +09:00
Paul O'Leary McCann	c25ec292a9	Cleanup	2021-07-10 22:42:55 +09:00
Paul O'Leary McCann	e00bd422d9	Fix span embeds Some of the lengths and backprop weren't right. Also various cleanup.	2021-07-10 21:38:53 +09:00
Paul O'Leary McCann	d7d317a1b5	Clean up span embedding code This is now cleaner and significantly faster. There's still some messy parts in the code (particularly variable names), will get to that later.	2021-07-10 19:59:08 +09:00
Paul O'Leary McCann	dc1f974d39	Merge branch 'master' into feature/coref	2021-07-10 18:10:40 +09:00
Paul O'Leary McCann	f34915c1e8	Use scatter_add to speed up span embed backprop This was the slowest part of the code, and using scatter_add here probably reduces the runtime by 50%.	2021-07-10 18:08:51 +09:00
Adriane Boyd	d8805a1073	Fix ru/uk lemmatizer mp with spawn (#8657 ) Use an instance variable instead a class variable for the morphological analzyer so that multiprocessing with spawn is possible.	2021-07-09 15:36:56 +02:00
Adriane Boyd	b8e720fdb9	Fix Azerbaijani init, extend lang init tests (#8656 ) * Extend langs in initialize tests * Fix az init	2021-07-09 15:36:35 +02:00
explosion-bot	334f1f98d8	Auto-format code with black	2021-07-09 08:06:06 +00:00
Paul O'Leary McCann	d0b041aff4	Switch to using Thinc tuplify The tuplify code here was added to Thinc proper and that's been released, so no need to have it here any more.	2021-07-08 16:08:36 +09:00
Sofie Van Landeghem	64fac754fe	add spacy prefix to ngram_suggester.v1 (#8623 )	2021-07-07 08:09:30 +02:00
Sofie Van Landeghem	733e8ceea9	fix spancat initialize with labels (#8620 )	2021-07-06 19:08:25 +02:00
Sofie Van Landeghem	608fc1d623	avoid msg var impliciteness (#8619 ) * avoid msg var impliciteness * rename local msg * Add CI tests for debug data and train * Adjust debug data CLI test Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>	2021-07-06 19:08:08 +02:00
Sofie Van Landeghem	e7d747e3ee	TransitionBasedParser.v1 to legacy (#8586 ) * TransitionBasedParser.v1 to legacy * register sublayers * bump spacy-legacy to 3.0.7	2021-07-06 15:26:45 +02:00
Luca Dorigo	e8ef4a46d5	Add the right return type for Language.pipe and an overload for the as_tuples case (#8441 ) * Add the right return type for Language.pipe and an overload for the as_tuples version * Reformat, tidy up Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>	2021-07-06 14:18:40 +02:00
Sofie Van Landeghem	b9f59118bf	Fix silent evaluation (#8581 ) * fix silentness * sneak in docs typo fix * pass silent boolean instead	2021-07-06 14:16:19 +02:00
Sofie Van Landeghem	3daf57d70c	Small spancat fixes (#8614 ) * two small fixes + additional tests * rename	2021-07-06 14:15:41 +02:00
Ines Montani	327f83573a	Move scores per type handling into util function (#8590 )	2021-07-06 13:02:37 +02:00
Adriane Boyd	5fd0b5207e	Fix vectors check for sourced components (#8559 ) * Fix vectors check for sourced components Since vectors are not loaded when components are sourced, store a hash for the vectors of each sourced component and compare it to the loaded vectors after the vectors are loaded from the `[initialize]` block. * Pop temporary info * Remove stored hash in remove_pipe * Add default for pop * Add additional convert/debug/assemble CLI tests	2021-07-06 12:43:17 +02:00
Adriane Boyd	29906884c5	Raise an error for textcat with <2 labels (#8584 ) * Raise an error for textcat with <2 labels Raise an error if initializing a `textcat` component without at least two labels. * Add similar note to docs * Update positive_label description in API docs	2021-07-06 12:35:22 +02:00
Paul O'Leary McCann	eb5820b593	Improve take_vecs implementation This pulls out references to needed bits so that other parts (the larger embeddings) can be freed before backprop.	2021-07-05 21:08:42 +09:00
Paul O'Leary McCann	13bef2ddb6	Add width prior feature Not necessary for convergence, but in coref-hoi this seems to add a few f1 points. Note that there are two width-related features in coref-hoi. This is a "prior" that is added to mention scores. The other width related feature is appended to the span embedding representation for other layers to reference.	2021-07-05 21:06:28 +09:00
Paul O'Leary McCann	8f66176b2d	Fix loss? This rewrites the loss to not use the Thinc crossentropy code at all. The main difference here is that the negative predictions are being masked out (= marginalized over), but negative gradient is still being reflected. I'm still not sure this is exactly right but models seem to train reliably now.	2021-07-05 18:17:10 +09:00
Paul O'Leary McCann	5db28ec2fd	Tweak mention limit calculation The calculation of this in the coref-hoi code is hard to follow. Based on comments and variable names it sounds like it's using the doc length, but it might actually be the number of mentions? Number of mentions should be much larger and seems more correct, but might want to revisit this.	2021-07-03 21:13:32 +09:00
Paul O'Leary McCann	2d3c559dc4	On initialize, use just two samples Coref docs are kind of long, and using 10 samples on a smallish GPU can cause OOMs.	2021-07-03 18:43:03 +09:00
Paul O'Leary McCann	251a5b43ac	Minor fix in crossing spans code I think this was technically incorrect but harmless. The reason the code here is different than the reference in coref-hoi is that the indices there are such that they get +1 at the end of processing, while the code here handles indices directly.	2021-07-03 18:41:46 +09:00
Paul O'Leary McCann	865caedebd	Remove XXX comment Comment wondered if there should be some subtraction to avoid double counting, but it probably doesn't matter because the diagonal is 0.	2021-07-03 18:40:38 +09:00
Paul O'Leary McCann	d74fa82c80	Fix axis handling in topk In practice this is only ever used with axis=1, so it wasn't causing issues, even though it was wrong.	2021-07-03 18:39:25 +09:00
Paul O'Leary McCann	f2e0e9dc28	Move placeholder handling into model code	2021-07-03 18:38:48 +09:00
Paul O'Leary McCann	3f66e18592	Clean up pw_prod loss This doesn't change the math but makes the transposes slightly easier to understand (maybe?).	2021-07-03 18:33:17 +09:00
explosion-bot	ee37288a1f	Auto-format code with black	2021-07-02 07:48:26 +00:00
Ines Montani	af9d984407	Merge pull request #8405 from svlandeg/fix/whitespace_tokenizer [ci skip]	2021-06-30 20:52:59 +10:00
Adriane Boyd	2b8c679a3d	Fix duplicate spacy package CLI opts (#8551 ) Use `-c` for `--code` and not additionally for `--create-meta`, in line with the docs.	2021-06-30 11:23:26 +02:00
Ines Montani	7f65902702	Merge pull request #8522 from adrianeboyd/chore/update-flake8 Update flake8 version in reqs and CI	2021-06-28 21:46:06 +10:00
Adriane Boyd	86d01e9229	Tidy up with flake8: imports, comparisons, etc.	2021-06-28 12:08:15 +02:00
Adriane Boyd	5eeb25f043	Tidy up code	2021-06-28 12:08:15 +02:00

1 2 3 4 5 ...

8799 Commits