spaCy

mirror of https://github.com/explosion/spaCy.git synced 2025-09-19 18:42:37 +03:00

Author	SHA1	Message	Date
adrianeboyd	49ef06d793	Add option for base model in init-model CLI (#5467 ) Intended for languages like Chinese with a custom tokenizer.	2020-05-20 18:49:11 +02:00
Adriane Boyd	4b229bfc22	Improve handling of NER in CoNLL-U MISC	2020-05-20 18:48:51 +02:00
Matthew Honnibal	609c0ba557	Fix accidentally quadratic runtime in Example.split_sents (#5464 ) * Tidy up train-from-config a bit * Fix accidentally quadratic perf in TokenAnnotation.brackets When we're reading in the gold data, we had a nested loop where we looped over the brackets for each token, looking for brackets that start on that word. This is accidentally quadratic, because we have one bracket per word (for the POS tags). So we had an O(N*2) behaviour here that ended up being pretty slow. To solve this I'm indexing the brackets by their starting word on the TokenAnnotations object, and having a property to provide the previous view. Fixes	2020-05-20 18:48:18 +02:00
Kevin Lu	c7c4cd5fe1	Changed pyate code example in universe.json	2020-05-20 09:11:32 -07:00
Adriane Boyd	daaa7bf451	Add option to omit extra lexeme tables in CLI	2020-05-20 15:51:44 +02:00
Adriane Boyd	8cba0e41d8	Return lowercase form as default except for PROPN	2020-05-20 15:35:08 +02:00
adrianeboyd	9393253b66	Remove peeking from Parser.begin_training (#5456 ) Inspect all instances in `Parser.begin_training` rather than only the first 1000.	2020-05-20 15:18:06 +02:00
Ines Montani	78bb9ff5e0	doc_or_span -> obj	2020-05-20 14:56:52 +02:00
Matthw Honnibal	60e8da4813	Tidy up train-from-config a bit	2020-05-20 12:56:27 +02:00
Matthw Honnibal	fda7355508	Fix train-from-config	2020-05-20 12:30:21 +02:00
Matthw Honnibal	24efd54a42	Merge from develop	2020-05-20 12:27:31 +02:00
Sofie Van Landeghem	7f5715a081	Various fixes to NEL functionality, Example class etc (#5460 ) * setting KB in the EL constructor, similar to how the model is passed on * removing wikipedia example files - moved to projects * throw an error when nlp.update is called with 2 positional arguments * rewriting the config logic in create pipe to accomodate for other objects (e.g. KB) in the config * update config files with new parameters * avoid training pipeline components that don't have a model (like sentencizer) * various small fixes + UX improvements * small fixes * set thinc to 8.0.0a9 everywhere * remove outdated comment	2020-05-20 11:41:12 +02:00
Adriane Boyd	4fa9670537	Extend lemmatizer rules for all UPOS tags	2020-05-20 10:15:43 +02:00
Kevin Lu	291b9ad7b9	Update CONTRIBUTOR_AGREEMENT.md	2020-05-19 20:29:53 -07:00
Kevin Lu	9a1a535215	Create kevinlu1248.md	2020-05-19 20:25:45 -07:00
Kevin Lu	a23b3a5a50	Update CONTRIBUTOR_AGREEMENT.md	2020-05-19 20:24:24 -07:00
Kevin Lu	0a5b140235	Update universe.json	2020-05-19 20:12:21 -07:00
Matthew Honnibal	664a3603b0	Set version to v3.0.0.dev8	2020-05-19 17:15:39 +02:00
adrianeboyd	40e65d6f63	Fix most_similar for vectors with unused rows (#5348 ) * Fix most_similar for vectors with unused rows Address issues related to the unused rows in the vector table and `most_similar`: * Update `most_similar()` to search only through rows that are in use according to `key2row`. * Raise an error when `most_similar(n=n)` is larger than the number of vectors in the table. * Set and restore `_unset` correctly when vectors are added or deserialized so that new vectors are added in the correct row. * Set data and keys to the same length in `Vocab.prune_vectors()` to avoid spurious entries in `key2row`. * Fix regression test using `most_similar` Co-authored-by: Matthew Honnibal <honnibal+gh@gmail.com>	2020-05-19 16:41:26 +02:00
Matthew Honnibal	a2830c3ef5	Use thinc 8.0.0a9	2020-05-19 16:23:11 +02:00
Sofie Van Landeghem	f00de445dd	default models defined in component decorator (#5452 ) * move defaults to pipeline and use in component decorator * black formatting * relative import	2020-05-19 16:20:03 +02:00
adrianeboyd	70da1fd2d6	Add warning for misaligned character offset spans (#5007 ) * Add warning for misaligned character offset spans * Resolve conflict * Filter warnings in example scripts Filter warnings in example scripts to show warnings once, in particular warnings about misaligned entities. Co-authored-by: Ines Montani <ines@ines.io>	2020-05-19 16:01:18 +02:00
adrianeboyd	0061992d95	Update Polish tokenizer for UD_Polish-PDB (#5432 ) Update Polish tokenizer for UD_Polish-PDB, which is a relatively major change from the existing tokenizer. Unused exceptions files and conflicting test cases removed. Co-authored-by: Matthew Honnibal <honnibal+gh@gmail.com>	2020-05-19 15:59:55 +02:00
adrianeboyd	a5cd203284	Reduce stored lexemes data, move feats to lookups (#5238 ) * Reduce stored lexemes data, move feats to lookups * Move non-derivable lexemes features (`norm / cluster / prob`) to `spacy-lookups-data` as lookups * Get/set `norm` in both lookups and `LexemeC`, serialize in lookups * Remove `cluster` and `prob` from `LexemesC`, get/set/serialize in lookups only * Remove serialization of lexemes data as `vocab/lexemes.bin` * Remove `SerializedLexemeC` * Remove `Lexeme.to_bytes/from_bytes` * Modify normalization exception loading: * Always create `Vocab.lookups` table `lexeme_norm` for normalization exceptions * Load base exceptions from `lang.norm_exceptions`, but load language-specific exceptions from lookups * Set `lex_attr_getter[NORM]` including new lookups table in `BaseDefaults.create_vocab()` and when deserializing `Vocab` * Remove all cached lexemes when deserializing vocab to override existing normalizations with the new normalizations (as a replacement for the previous step that replaced all lexemes data with the deserialized data) * Skip English normalization test Skip English normalization test because the data is now in `spacy-lookups-data`. * Remove norm exceptions Moved to spacy-lookups-data. * Move norm exceptions test to spacy-lookups-data * Load extra lookups from spacy-lookups-data lazily Load extra lookups (currently for cluster and prob) lazily from the entry point `lg_extra` as `Vocab.lookups_extra`. * Skip creating lexeme cache on load To improve model loading times, do not create the full lexeme cache when loading. The lexemes will be created on demand when processing. * Identify numeric values in Lexeme.set_attrs() With the removal of a special case for `PROB`, also identify `float` to avoid trying to convert it with the `StringStore`. * Skip lexeme cache init in from_bytes * Unskip and update lookups tests for python3.6+ * Update vocab pickle to include lookups_extra * Update vocab serialization tests Check strings rather than lexemes since lexemes aren't initialized automatically, account for addition of "_SP". * Re-skip lookups test because of python3.5 * Skip PROB/float values in Lexeme.set_attrs * Convert is_oov from lexeme flag to lex in vectors Instead of storing `is_oov` as a lexeme flag, `is_oov` reports whether the lexeme has a vector. Co-authored-by: Matthew Honnibal <honnibal+gh@gmail.com>	2020-05-19 15:59:14 +02:00
Sofie Van Landeghem	0d94737857	Feature toggle_pipes (#5378 ) * make disable_pipes deprecated in favour of the new toggle_pipes * rewrite disable_pipes statements * update documentation * remove bin/wiki_entity_linking folder * one more fix * remove deprecated link to documentation * few more doc fixes * add note about name change to the docs * restore original disable_pipes * small fixes * fix typo * fix error number to W096 * rename to select_pipes * also make changes to the documentation Co-authored-by: Matthew Honnibal <honnibal+gh@gmail.com>	2020-05-18 22:27:10 +02:00
Matthew Honnibal	333b1a308b	Adapt parser and NER for transformers (#5449 ) * Draft layer for BILUO actions * Fixes to biluo layer * WIP on BILUO layer * Add tests for BILUO layer * Format * Fix transitions * Update test * Link in the simple_ner * Update BILUO tagger * Update __init__ * Import simple_ner * Update test * Import * Add files * Add config * Fix label passing for BILUO and tagger * Fix label handling for simple_ner component * Update simple NER test * Update config * Hack train script * Update BILUO layer * Fix SimpleNER component * Update train_from_config * Add biluo_to_iob helper * Add IOB layer * Add IOBTagger model * Update biluo layer * Update SimpleNER tagger * Update BILUO * Read random seed in train-from-config * Update use of normal_init * Fix normalization of gradient in SimpleNER * Update IOBTagger * Remove print * Tweak masking in BILUO * Add dropout in SimpleNER * Update thinc * Tidy up simple_ner * Fix biluo model * Unhack train-from-config * Update setup.cfg and requirements * Add tb_framework.py for parser model * Try to avoid memory leak in BILUO * Move ParserModel into spacy.ml, avoid need for subclass. * Use updated parser model * Remove incorrect call to model.initializre in PrecomputableAffine * Update parser model * Avoid divide by zero in tagger * Add extra dropout layer in tagger * Refine minibatch_by_words function to avoid oom * Fix parser model after refactor * Try to avoid div-by-zero in SimpleNER * Fix infinite loop in minibatch_by_words * Use SequenceCategoricalCrossentropy in Tagger * Fix parser model when hidden layer * Remove extra dropout from tagger * Add extra nan check in tagger * Fix thinc version * Update tests and imports * Fix test * Update test * Update tests * Fix tests * Fix test Co-authored-by: Ines Montani <ines@ines.io>	2020-05-18 22:23:33 +02:00
Ines Montani	3100c97e69	Merge pull request #5441 from svlandeg/fix/updating	2020-05-18 10:53:41 +02:00
Ines Montani	e8ff4c1e6a	Pin flake8 version	2020-05-18 10:50:21 +02:00
Ines Montani	a41e28ceba	Merge pull request #5436 from ilivans/fix_errors_with_codes	2020-05-18 10:45:56 +02:00
Ilkyu Ju	72a25c9cef	Very minor issues in Korean example sentences (#5446 ) * Add contributor agreement * Improve ko translation of example sentences I fixed unnatural translations and word spacing errors. * Update osori.md	2020-05-17 13:43:34 +02:00
svlandeg	6fb6a8518c	bump to 3.0.0.dev7 and thinc to 8.0.0a8	2020-05-15 13:25:54 +02:00
svlandeg	047f3d7d94	remove ops argument for Adam	2020-05-15 13:25:00 +02:00
svlandeg	79d4f196e5	pin flak8 to 3.5.0	2020-05-15 11:53:01 +02:00
svlandeg	e0fda2bd81	throw warning when model_cfg is None	2020-05-15 11:02:10 +02:00
adrianeboyd	908dea3939	Skip duplicate lexeme rank setting (#5401 ) Skip duplicate lexeme rank setting within `_fix_pretrained_vectors_name()`.	2020-05-14 18:26:12 +02:00
adrianeboyd	f49e2810e6	Add Polish lemmatizer (#5413 ) * Add Polish lemmatizer Contributed by @ryszardtuora * Add missing import	2020-05-14 18:23:19 +02:00
adrianeboyd	e63880e081	Use Token.sent_start for Span.sent (#5439 ) Use `Token.sent_start` for sentence boundaries in `Span.sent` so that `Doc.sents` and `Span.sent` return the same sentence boundaries.	2020-05-14 18:22:51 +02:00
adrianeboyd	780b869345	Fix syntax iterators for Persian (#5437 )	2020-05-14 16:51:03 +02:00
Ilia Ivanov	ee8fe37474	Add ilivans' contributor agreement	2020-05-14 15:59:06 +02:00
Ilia Ivanov	712d9d4820	fixup! Fix ErrorsWithCodes().__class__ return value	2020-05-14 15:45:58 +02:00
Ilia Ivanov	a987e9e45d	Fix ErrorsWithCodes().__class__ return value	2020-05-14 14:14:15 +02:00
Vishnu Priya VR	9ce059dd06	Limiting noun_chunks for specific languages (#5396 ) * Limiting noun_chunks for specific langauges * Limiting noun_chunks for specific languages Contributor Agreement * Addressing review comments * Removed unused fixtures and imports * Add fa_tokenizer in test suite * Use fa_tokenizer in test * Undo extraneous reformatting Co-authored-by: adrianeboyd <adrianeboyd@gmail.com>	2020-05-14 12:58:06 +02:00
Sofie Van Landeghem	b04738903e	prevent None in gold fields (#5425 ) * set gold fields to empty list instead of keeping them as None * add unit test	2020-05-13 22:08:50 +02:00
adrianeboyd	113e7981d0	Check that row is within bounds when adding vector (#5430 ) Check that row is within bounds for the vector data array when adding a vector. Don't add vectors with rank OOV_RANK in `init-model` (change is due to shift from OOV as 0 to OOV as OOV_RANK).	2020-05-13 22:08:28 +02:00
adrianeboyd	07639dd6ac	Remove TAG from da/sv tokenizer exceptions (#5428 ) Remove `TAG` value from Danish and Swedish tokenizer exceptions because it may not be included in a tag map (and these settings are problematic as tokenizer exceptions anyway).	2020-05-13 10:25:54 +02:00
adrianeboyd	24e7108f80	Modify array type to accommodate OOV_RANK (#5429 ) Modify indices array type in `Vocab.prune_vectors` to accommodate OOV_RANK index as max(uint64).	2020-05-13 10:25:05 +02:00
svlandeg	102c8c7e2f	fix fan_in renaming	2020-05-12 13:56:10 +02:00
svlandeg	9fe1e23512	update to thinc 8.0.0a6	2020-05-12 13:51:25 +02:00
Ines Montani	f333c2a011	Merge pull request #5386 from svlandeg/fix/nel-docs	2020-05-10 12:00:09 +02:00
adrianeboyd	440b81bddc	Improve exceptions for 'd (would/had) in English (#5379 ) Instead of treating `'d` in contractions like `I'd` as `would` in all cases in the tokenizer exceptions, leave the tagging and lemmatization up to later components.	2020-05-08 15:10:57 +02:00

... 4 5 6 7 8 ...

11793 Commits