spaCy

mirror of https://github.com/explosion/spaCy.git synced 2025-09-23 12:36:46 +03:00

Author	SHA1	Message	Date
adrianeboyd	ae4af52ce7	Add ideographic stops to sentencizer (#5263 ) Add ideographic half- and fullwidth full stops to default sentencizer punctuation.	2020-04-08 12:58:39 +02:00
Sofie Van Landeghem	7ad0fcf01d	fix json (#5267 )	2020-04-08 12:58:09 +02:00
adrianeboyd	fa760010a5	Set rank for new vector in Vocab.set_vector (#5266 ) Set `Lexeme.rank` for vectors added with `Vocab.set_vector` so that the lexeme `ID` accessed by a model points the right row for the new vector.	2020-04-07 12:04:51 +02:00
lfiedler	e1e25c7e30	issue5230: added unittest test case for completion	2020-04-06 21:36:02 +02:00
Leander Fiedler	b63871ceff	issue5230: added contributors agreement	2020-04-06 21:04:06 +02:00
Leander Fiedler	cde96f6c64	issue5230: optimized unit test a bit	2020-04-06 20:51:12 +02:00
Leander Fiedler	71cc903d65	issue5230: replaced open statements on path objects so that serialization still works an files are closed	2020-04-06 20:30:41 +02:00
Leander Fiedler	273ed452bb	issue5230: added unicode declaration at top of the file	2020-04-06 19:22:32 +02:00
Leander Fiedler	1cd975d4a5	issue5230: fixed resource warnings in language	2020-04-06 18:54:32 +02:00
Leander Fiedler	493c77462a	issue5230: test cases covering known sources of resource warnings	2020-04-06 18:46:51 +02:00
adrianeboyd	c981aa6684	Use inline flags in token_match patterns (#5257 ) * Use inline flags in token_match patterns Use inline flags in `token_match` patterns so that serializing does not lose the flag information. * Modify inline flag * Modify inline flag	2020-04-06 13:19:04 +02:00
adrianeboyd	e8be15e9b7	Improve tokenization for UD Spanish AnCora (#5253 )	2020-04-06 13:18:23 +02:00
adrianeboyd	f4ef64a526	Improve tokenization for UD Dutch corpora (#5259 ) * Improve tokenization for UD Dutch corpora Improve tokenization for UD Dutch Alpino and LassySmall. * Format Dutch tokenizer exceptions	2020-04-06 13:18:07 +02:00
vincent d warmerdam	f329d5663a	add "whatlies" to spaCy universe (#5252 ) * Add "whatlies" We're releasing it on our side officially on the 16th of April. If possible, let's announce around the same time :) * sign contributor thing * Added fancy gif as the image * Update universe.json Spellin error and spaCy clarification.	2020-04-06 11:29:30 +02:00
Muhammad Irfan	406d5748b3	add missing Urdu tags	2020-04-05 20:55:38 +05:00
nlptechbook	ddf3c2430d	Update universe.json	2020-04-03 12:10:03 -04:00
YohannesDatasci	beef184e53	Armenian language support (#5246 ) * add Armenian language and test cases * agreement submission	2020-04-03 13:02:18 +02:00
Sofie Van Landeghem	1137420840	Small doc fixes (#5250 ) * fix link * torchtext instead tochtext	2020-04-03 13:01:43 +02:00
Sofie Van Landeghem	9cf965c260	avoid enumerate to avoid long waiting at 0% (#5159 )	2020-04-02 15:04:15 +02:00
Michael Leichtfried	2b14997b68	Remove duplicated branch in if/else-if statement (#5234 ) * Remove duplicated branch in if-elif-statement * Add contributor agreement for leicmi	2020-04-02 14:47:42 +02:00
adrianeboyd	d107afcffb	Raise error for inplace resize with new vector dim (#5228 ) Raise an error if there is an attempt to resize the vectors in place with a different vector dimension.	2020-04-02 10:43:13 +02:00
Jacob Lauritzen	0b76212831	Extend and fix Danish examples (#5227 ) * Extend and fix Danish examples This PR fixes two examples, adds additional examples translated from the english version, and adds punctuation. The two changed examples are: * "fortov" changed to "fortovet", which is more [used](https://www.google.com/search?client=firefox-b-d&sxsrf=ALeKk0143gEuPe4IbIUpzBBt-oU10OMVqA%3A1585549036477&ei=7I6BXuvJHMGOrwSqi46oCQ&q=l%C3%B8behjul+p%C3%A5+fortov&oq=l%C3%B8behjul+p%C3%A5+fortov&gs_lcp=CgZwc3ktYWIQAzIECAAQRzIECAAQRzIECAAQRzIECAAQRzIECAAQRzIECAAQRzIECAAQRzIECAAQR1DT8xZY0_MWYK_0FmgAcAZ4AIABAIgBAJIBAJgBAKABAaoBB2d3cy13aXo&sclient=psy-ab&ved=0ahUKEwjr7964xsHoAhVBx4sKHaqFA5UQ4dUDCAo&uact=5) and more natural. The Swedish and Norwegian examples also use this version of the word. * "stor by" changed to "storby". In Danish we have a specific noun to describe a large, metropolitan city which is different from just describing a city as "large". In this sentence it would be much more natural to describe London as a "storby". Google even correct as search for "London stor by" to "London storby". * Sign contrib agreement	2020-04-02 10:42:35 +02:00
Ines Montani	09f8486eb1	Merge pull request #5223 from nikhilsaldanha/fix-entity-recognizer-docs update docs for return type of EntityRecognizer.predict	2020-03-29 19:10:42 +02:00
Ines Montani	99da6e1d79	Merge branch 'master' into fix-entity-recognizer-docs	2020-03-29 19:10:18 +02:00
Nikhil Saldanha	4f27a24f5b	Add kannada examples (#5162 ) * Add example sentences for Kannada * sign contributor agreement	2020-03-29 13:54:42 +02:00
adrianeboyd	d47b810ba4	Fix exclusive_classes in textcat ensemble (#5166 ) Pass the exclusive_classes setting to the bow model within the ensemble textcat model.	2020-03-29 13:52:34 +02:00
Tom Milligan	e904958115	Limit to cupy-cuda v8, so as not to pull in v9 automatically. (#5194 )	2020-03-29 13:52:08 +02:00
adrianeboyd	963bd890c1	Modify Vector.resize to work with cupy and improve resizing (#5216 ) * Modify Vector.resize to work with cupy Modify `Vectors.resize` to work with cupy. Modify behavior when resizing to a different vector dimension so that individual vectors are truncated or extended with zeros instead of having the original values filled into the new shape without regard for the original axes. * Update spacy/tests/vocab_vectors/test_vectors.py Co-Authored-By: Matthew Honnibal <honnibal+gh@gmail.com> Co-authored-by: Matthew Honnibal <honnibal+gh@gmail.com>	2020-03-29 13:51:20 +02:00
Nikhil Saldanha	be6d10517f	sign contributor agreement	2020-03-28 18:36:55 +01:00
Nikhil Saldanha	d1ddfa1cb7	update docs for EntityRecognizer.predict return type was wrongly written as a tuple, changed to syntax.StateClass	2020-03-28 18:13:02 +01:00
Tiljander	e53232533b	Describing priority rules for overlapping matches (#5197 ) * Describing priority rules for overlapping matches * Create Tiljander.md * Describing priority rules for overlapping matches * Update website/docs/api/entityruler.md Co-Authored-By: Ines Montani <ines@ines.io> Co-authored-by: Ines Montani <ines@ines.io>	2020-03-26 13:13:22 +01:00
adrianeboyd	8d3563f1c4	Minor bugfixes for train CLI (#5186 ) * Omit per_type scores from model-best calculations The addition of per_type scores to the included metrics (#4911) causes errors when they're compared while determining the best model, so omit them for this `max()` comparison. * Add default speed data for interrupted train CLI Add better speed meta defaults so that an interrupted iteration still produces a best model. Co-authored-by: Ines Montani <ines@ines.io>	2020-03-26 10:46:50 +01:00
adrianeboyd	a04f802099	Fix GoldParse init when token count differs (#5191 ) Fix the `GoldParse` initialization when the number of tokens has changed (due to merging subtokens with the parser).	2020-03-26 10:46:23 +01:00
adrianeboyd	d88a377bed	Remove Vectors.from_glove (#5209 )	2020-03-26 10:45:47 +01:00
Ines Montani	828acffc12	Tidy up and auto-format	2020-03-25 12:28:12 +01:00
adrianeboyd	b71dd44dbc	Improved Romanian tokenization for UD RRT (#5206 ) Modifications to Romanian tokenization to improve tokenization for UD_Romanian-RRT.	2020-03-25 11:28:19 +01:00
adrianeboyd	86c43e55fa	Improve Lithuanian tokenization (#5205 ) * Improve Lithuanian tokenization Modify Lithuanian tokenization to improve performance for UD_Lithuanian-ALKSNIS. * Update Lithuanian tokenizer tests	2020-03-25 11:28:12 +01:00
adrianeboyd	1a944e5976	Improve Italian tokenization (#5204 ) Improve Italian tokenization for UD_Italian-ISDT.	2020-03-25 11:28:02 +01:00
adrianeboyd	923a453449	Modifications/updates to Portuguese tokenization (#5203 ) Modifications to Portuguese tokenization for UD_Portuguese-Bosque. Instead of splitting contactions as exceptions, they are kept as merged tokens.	2020-03-25 11:27:53 +01:00
adrianeboyd	4117a5c705	Improve French tokenization (#5202 ) Improve French tokenization for UD_French-Sequoia.	2020-03-25 11:27:42 +01:00
Ines Montani	a3d09ffe61	Merge pull request #5201 from adrianeboyd/feature/ud-tokenization-nb-v2 Improved tokenization for UD_Norwegian-Bokmaal	2020-03-25 11:27:31 +01:00
Ines Montani	0e8dfdf77e	Merge pull request #5065 from adrianeboyd/feature/ud-tokenization-da Add a few more Danish tokenizer exceptions	2020-03-25 11:27:19 +01:00
Adriane Boyd	09d442f5ad	Merge remote-tracking branch 'upstream/master' into feature/ud-tokenization-da	2020-03-25 09:41:52 +01:00
Adriane Boyd	cba2d1d972	Disable failing abbreviation test UD_Danish-DDT has (as far as I can tell) hallucinated periods after abbreviations, so the changes are an artifact of the corpus and not due to anything meaningful about Danish tokenization.	2020-03-25 09:39:26 +01:00
Adriane Boyd	79737adb90	Improved tokenization for UD_Norwegian-Bokmaal	2020-03-25 08:54:02 +01:00
Ines Montani	5f2afa0479	Merge pull request #5185 from adrianeboyd/bugfix/de-punctuation-style Improve German tokenizer settings style	2020-03-24 16:38:32 +01:00
Ines Montani	3fc2309c48	Merge pull request #5174 from Baciccin/master Add Ligurian language	2020-03-24 16:33:59 +01:00
Ines Montani	f434d6aaa9	Merge pull request #5190 from guerda/patch-1 Remove max_length parameter in PhraseMatcher example	2020-03-24 16:32:12 +01:00
Philip Gillißen	128acb9ee1	Update guerda.md	2020-03-24 10:42:30 +01:00
Philip Gillißen	5d067bcc5e	Add SCA for guerda	2020-03-24 10:42:10 +01:00

... 5 6 7 8 9 ...

11610 Commits