spaCy

mirror of https://github.com/explosion/spaCy.git synced 2025-09-23 12:36:46 +03:00

Author	SHA1	Message	Date
adrianeboyd	84e06f9fb7	Improve GoldParse NER alignment (#5335 ) Improve GoldParse NER alignment by including all cases where the start and end of the NER span can be aligned, regardless of internal tokenization differences. To do this, convert BILUO tags to character offsets, check start/end alignment with `doc.char_span()`, and assign the BILUO tags for the aligned spans. Alignment for `O/-` tags is handled through the one-to-one and multi alignments.	2020-04-23 16:58:23 +02:00
Mike	481574cbc8	[minor doc change] embedding vis. link is broken in `website/docs/usage/examples.md` (#5325 ) * The embedding vis. link is broken The first link seems to be reasonable for now unless someone has an updated embedding vis they want to share? * contributor agreement * Update Mlawrence95.md * Update website/docs/usage/examples.md Co-Authored-By: Sofie Van Landeghem <svlandeg@users.noreply.github.com> Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>	2020-04-21 20:35:12 +02:00
adrianeboyd	521f361052	Switch to new gold.align method (#5334 ) * Switch from original `_align` to new simpler alignment algorithm from #4526 * Remove alignment normalizations beyond whitespace and lowercasing	2020-04-21 19:31:03 +02:00
Matthew Honnibal	b2ef6100af	Only run backprop once when shared tok2vec weights (#5331 ) Previously, pipelines with shared tok2vec weights would call the tok2vec backprop callback multiple times, once for each pipeline component. This caused errors for PyTorch, and was inefficient. Instead, accumulate the gradient for all but one component, and just call the callback once.	2020-04-21 19:30:41 +02:00
adrianeboyd	bf5c13d170	Modify jieba install message (#5328 ) Modify jieba install message to instruct the user to use `ChineseDefaults.use_jieba = False` so that it's possible to load pkuseg-only models without jieba installed.	2020-04-20 22:06:53 +02:00
Matthew Honnibal	6918d99b6c	Improve GPU usage for train-with-config (#5330 ) * Adjust for no ops in Optimizer * Fix gpu in train-from-config * Update train-from-config script * Fix parser * Fix GPU efficiency of padding backprop	2020-04-20 22:06:28 +02:00
Ines Montani	b919844fce	Tidy up and fix alignment of landing cards (#5317 )	2020-04-20 20:33:13 +02:00
adrianeboyd	f7471abd82	Add pkuseg and serialization support for Chinese (#5308 ) * Add pkuseg and serialization support for Chinese Add support for pkuseg alongside jieba * Specify model through `Language` meta: * split on characters (if no word segmentation packages are installed) ``` Chinese(meta={"tokenizer": {"config": {"use_jieba": False, "use_pkuseg": False}}}) ``` * jieba (remains the default tokenizer if installed) ``` Chinese() Chinese(meta={"tokenizer": {"config": {"use_jieba": True}}}) # explicit ``` * pkuseg ``` Chinese(meta={"tokenizer": {"config": {"pkuseg_model": "default", "use_jieba": False, "use_pkuseg": True}}}) ``` * The new tokenizer setting `require_pkuseg` is used to override `use_jieba` default, which is intended for models that provide a pkuseg model: ``` nlp_pkuseg = Chinese(meta={"tokenizer": {"config": {"pkuseg_model": "default", "require_pkuseg": True}}}) nlp = Chinese() # has `use_jieba` as `True` by default nlp.from_bytes(nlp_pkuseg.to_bytes()) # `require_pkuseg` overrides `use_jieba` when calling the tokenizer ``` Add support for serialization of tokenizer settings and pkuseg model, if loaded * Add sorting for `Language.to_bytes()` serialization of `Language.meta` so that the (emptied, but still present) tokenizer metadata is in a consistent position in the serialized data Extend tests to cover all three tokenizer configurations and serialization * Fix from_disk and tests without jieba or pkuseg * Load cfg first and only show error if `use_pkuseg` * Fix blank/default initialization in serialization tests * Explicitly initialize jieba's cache on init * Add serialization for pkuseg pre/postprocessors * Reformat pkuseg install message	2020-04-18 17:01:53 +02:00
laszabine	fb73d4943a	Amend documentation to Language.evaluate (#5319 ) * Specified usage of arguments to Language.evaluate * Created contributor agreement	2020-04-16 20:00:18 +02:00
Ines Montani	068146d4ca	Update netlify.toml [ci skip]	2020-04-16 14:45:25 +02:00
Jakob Jul Elben	663333c3b2	Fixes #5413 (#5315 ) * Fix 5314 * Add contributor * Resolve requested changes Co-authored-by: Jakob Jul Elben <jakob@datamaga.com>	2020-04-16 13:29:02 +02:00
Sébastien Harinck	dac70f29eb	contrib: add contributor agreement for user sebastienharinck (#5316 )	2020-04-16 11:32:09 +02:00
Sébastien Harinck	688a328668	docs(website): fix issue on example in spacy-lookup	2020-04-15 16:47:29 +02:00
Paolo Arduin	1ca32d8f9c	Matcher support for Span as well as Doc (#5113 ) * Matcher support for Span, as well as Doc #5056 * Removes an import unused * Signed contributors agreement * Code optimization and better test * Add error message for bad Matcher call argument * Fix merging	2020-04-15 13:51:33 +02:00
Thomas Thiebaud	1eef60c658	Add spacy_fastlang to universe (#5271 ) * Add spacy_fastlang to universe * Sign SCA	2020-04-15 13:50:46 +02:00
adrianeboyd	98c59027ed	Use max(uint64) for OOV lexeme rank (#5303 ) * Use max(uint64) for OOV lexeme rank * Add test for default OOV rank * Revert back to thinc==7.4.0 Requiring the updated version of thinc was unnecessary. * Define OOV_RANK in one place Define OOV_RANK in one place in `util`. * Fix formatting [ci skip] * Switch to external definitions of max(uint64) Switch to external defintions of max(uint64) and confirm that they are equal.	2020-04-15 13:49:47 +02:00
adrianeboyd	3d2c308906	Add Doc init from list of words and text (#5251 ) * Add Doc init from list of words and text Add an option to initialize a `Doc` from a text and list of words where the words may or may not include all whitespace tokens. If the text and words are mismatched, raise an error. * Fix error code * Remove all whitespace before aligning words/text * Move words/text init to util function * Update error message * Rename to get_words_and_spaces * Fix formatting	2020-04-14 19:15:52 +02:00
Paolo Arduin	8ce408d2e1	Comparison predicate handling for `!=` (#5282 ) * Fix #5281 * Optim test	2020-04-14 19:14:15 +02:00
Sofie Van Landeghem	a3965ec13d	tag-map-path since 2.2.4 instead of 2.2.3 (#5289 )	2020-04-14 14:53:47 +02:00
Marek Grzenkowicz	6a8a52650f	[Closes #5292 ] Fix typo in option name "--n-save_every" (#5293 ) * Sign contributor agreement for chopeen * Fix typo in option name and close #5292	2020-04-11 23:35:01 +02:00
Umar Butler	8952effcc4	Fixed Typo in Warning (#5284 ) * Fixed typo in cli warning Fixed a typo in the warning for the provision of exactly two labels, which have not been designated as binary, to textcat. * Create and signed contributor form	2020-04-09 15:46:15 +02:00
Sofie Van Landeghem	42364dcd9f	Remove "pala" tokenizer exception for Spanish (#5265 )	2020-04-09 10:21:20 +02:00
adrianeboyd	cf579a398d	Add __init__.py to eu and hy tests (#5278 )	2020-04-08 20:03:06 +02:00
adrianeboyd	ae4af52ce7	Add ideographic stops to sentencizer (#5263 ) Add ideographic half- and fullwidth full stops to default sentencizer punctuation.	2020-04-08 12:58:39 +02:00
Sofie Van Landeghem	7ad0fcf01d	fix json (#5267 )	2020-04-08 12:58:09 +02:00
adrianeboyd	fa760010a5	Set rank for new vector in Vocab.set_vector (#5266 ) Set `Lexeme.rank` for vectors added with `Vocab.set_vector` so that the lexeme `ID` accessed by a model points the right row for the new vector.	2020-04-07 12:04:51 +02:00
adrianeboyd	c981aa6684	Use inline flags in token_match patterns (#5257 ) * Use inline flags in token_match patterns Use inline flags in `token_match` patterns so that serializing does not lose the flag information. * Modify inline flag * Modify inline flag	2020-04-06 13:19:04 +02:00
adrianeboyd	e8be15e9b7	Improve tokenization for UD Spanish AnCora (#5253 )	2020-04-06 13:18:23 +02:00
adrianeboyd	f4ef64a526	Improve tokenization for UD Dutch corpora (#5259 ) * Improve tokenization for UD Dutch corpora Improve tokenization for UD Dutch Alpino and LassySmall. * Format Dutch tokenizer exceptions	2020-04-06 13:18:07 +02:00
vincent d warmerdam	f329d5663a	add "whatlies" to spaCy universe (#5252 ) * Add "whatlies" We're releasing it on our side officially on the 16th of April. If possible, let's announce around the same time :) * sign contributor thing * Added fancy gif as the image * Update universe.json Spellin error and spaCy clarification.	2020-04-06 11:29:30 +02:00
Muhammad Irfan	406d5748b3	add missing Urdu tags	2020-04-05 20:55:38 +05:00
nlptechbook	ddf3c2430d	Update universe.json	2020-04-03 12:10:03 -04:00
Sofie Van Landeghem	b2e93be867	Optimizer defaults (#5244 ) * set optimizer defaults to mimic thinc 7 + bump to dev6 * larger error range for senter overfitting test	2020-04-03 13:02:46 +02:00
YohannesDatasci	beef184e53	Armenian language support (#5246 ) * add Armenian language and test cases * agreement submission	2020-04-03 13:02:18 +02:00
Sofie Van Landeghem	1137420840	Small doc fixes (#5250 ) * fix link * torchtext instead tochtext	2020-04-03 13:01:43 +02:00
Sofie Van Landeghem	9cf965c260	avoid enumerate to avoid long waiting at 0% (#5159 )	2020-04-02 15:04:15 +02:00
Michael Leichtfried	2b14997b68	Remove duplicated branch in if/else-if statement (#5234 ) * Remove duplicated branch in if-elif-statement * Add contributor agreement for leicmi	2020-04-02 14:47:42 +02:00
adrianeboyd	b71a11ff6d	Update morphologizer (#5108 ) * Add pos and morph scoring to Scorer Add pos, morph, and morph_per_type to `Scorer`. Report pos and morph accuracy in `spacy evaluate`. * Update morphologizer for v3 * switch to tagger-based morphologizer * use `spacy.HashCharEmbedCNN` for morphologizer defaults * add `Doc.is_morphed` flag * Add morphologizer to train CLI * Add basic morphologizer pipeline tests * Add simple morphologizer training example * Remove subword_features from CharEmbed models Remove `subword_features` argument from `spacy.HashCharEmbedCNN.v1` and `spacy.HashCharEmbedBiLSTM.v1` since in these cases `subword_features` is always `False`. * Rename setting in morphologizer example Use `with_pos_tags` instead of `without_pos_tags`. * Fix kwargs for spacy.HashCharEmbedBiLSTM.v1 * Remove defaults for spacy.HashCharEmbedBiLSTM.v1 Remove default `nM/nC` for `spacy.HashCharEmbedBiLSTM.v1`. * Set random seed for textcat overfitting test	2020-04-02 14:46:32 +02:00
adrianeboyd	d107afcffb	Raise error for inplace resize with new vector dim (#5228 ) Raise an error if there is an attempt to resize the vectors in place with a different vector dimension.	2020-04-02 10:43:13 +02:00
Jacob Lauritzen	0b76212831	Extend and fix Danish examples (#5227 ) * Extend and fix Danish examples This PR fixes two examples, adds additional examples translated from the english version, and adds punctuation. The two changed examples are: * "fortov" changed to "fortovet", which is more [used](https://www.google.com/search?client=firefox-b-d&sxsrf=ALeKk0143gEuPe4IbIUpzBBt-oU10OMVqA%3A1585549036477&ei=7I6BXuvJHMGOrwSqi46oCQ&q=l%C3%B8behjul+p%C3%A5+fortov&oq=l%C3%B8behjul+p%C3%A5+fortov&gs_lcp=CgZwc3ktYWIQAzIECAAQRzIECAAQRzIECAAQRzIECAAQRzIECAAQRzIECAAQRzIECAAQRzIECAAQR1DT8xZY0_MWYK_0FmgAcAZ4AIABAIgBAJIBAJgBAKABAaoBB2d3cy13aXo&sclient=psy-ab&ved=0ahUKEwjr7964xsHoAhVBx4sKHaqFA5UQ4dUDCAo&uact=5) and more natural. The Swedish and Norwegian examples also use this version of the word. * "stor by" changed to "storby". In Danish we have a specific noun to describe a large, metropolitan city which is different from just describing a city as "large". In this sentence it would be much more natural to describe London as a "storby". Google even correct as search for "London stor by" to "London storby". * Sign contrib agreement	2020-04-02 10:42:35 +02:00
Sofie Van Landeghem	ab59f3124e	fix NEL overfitting test for GPU (#5236 )	2020-04-02 10:32:52 +02:00
Sofie Van Landeghem	311133e579	Train textcat with config (#5143 ) * bring back default build_text_classifier method * remove _set_dims_ hack in favor of proper dim inference * add tok2vec initialize to unit test * small fixes * add unit test for various textcat config settings * logistic output layer does not have nO * fix window_size setting * proper fix * fix W initialization * Update textcat training example * Use ml_datasets * Convert training data to `Example` format * Use `n_texts` to set proportionate dev size * fix _init renaming on latest thinc * avoid setting a non-existing dim * update to thinc==8.0.0a2 * add BOW and CNN defaults for easy testing * various experiments with train_textcat script, fix softmax activation in textcat bow * allow textcat train script to work on other datasets as well * have dataset as a parameter * train textcat from config, with example config * add config for training textcat * formatting * fix exclusive_classes * fixing BOW for GPU * bump thinc to 8.0.0a3 (not published yet so CI will fail) * add in link_vectors_to_models which got deleted Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>	2020-03-29 19:40:36 +02:00
Ines Montani	09f8486eb1	Merge pull request #5223 from nikhilsaldanha/fix-entity-recognizer-docs update docs for return type of EntityRecognizer.predict	2020-03-29 19:10:42 +02:00
Ines Montani	99da6e1d79	Merge branch 'master' into fix-entity-recognizer-docs	2020-03-29 19:10:18 +02:00
adrianeboyd	ce0e538068	Check whether doc is instantiated in Example.get_gold_parses() (#5167 ) * Check whether doc is instantiated When creating docs to pair with gold parses, modify test to check whether a doc is unset rather than whether it contains tokens. * Restore test of evaluate on an empty doc * Set a minimal gold.orig for the scorer Without a minimal gold.orig the scorer can't evaluate empty docs. This is the v3 equivalent of #4925.	2020-03-29 13:57:00 +02:00
Sofie Van Landeghem	d6d95674c1	bugfix in span similarity (#5155 ) * bugfix in span similarity * also rewrite doc.pyx for clarity * formatting	2020-03-29 13:56:07 +02:00
Nikhil Saldanha	4f27a24f5b	Add kannada examples (#5162 ) * Add example sentences for Kannada * sign contributor agreement	2020-03-29 13:54:42 +02:00
adrianeboyd	d47b810ba4	Fix exclusive_classes in textcat ensemble (#5166 ) Pass the exclusive_classes setting to the bow model within the ensemble textcat model.	2020-03-29 13:52:34 +02:00
Tom Milligan	e904958115	Limit to cupy-cuda v8, so as not to pull in v9 automatically. (#5194 )	2020-03-29 13:52:08 +02:00
adrianeboyd	963bd890c1	Modify Vector.resize to work with cupy and improve resizing (#5216 ) * Modify Vector.resize to work with cupy Modify `Vectors.resize` to work with cupy. Modify behavior when resizing to a different vector dimension so that individual vectors are truncated or extended with zeros instead of having the original values filled into the new shape without regard for the original axes. * Update spacy/tests/vocab_vectors/test_vectors.py Co-Authored-By: Matthew Honnibal <honnibal+gh@gmail.com> Co-authored-by: Matthew Honnibal <honnibal+gh@gmail.com>	2020-03-29 13:51:20 +02:00

... 3 4 5 6 7 ...

11636 Commits