spaCy

mirror of https://github.com/explosion/spaCy.git synced 2025-09-23 04:26:46 +03:00

Author	SHA1	Message	Date
adrianeboyd	84e06f9fb7	Improve GoldParse NER alignment (#5335 ) Improve GoldParse NER alignment by including all cases where the start and end of the NER span can be aligned, regardless of internal tokenization differences. To do this, convert BILUO tags to character offsets, check start/end alignment with `doc.char_span()`, and assign the BILUO tags for the aligned spans. Alignment for `O/-` tags is handled through the one-to-one and multi alignments.	2020-04-23 16:58:23 +02:00
Mike	481574cbc8	[minor doc change] embedding vis. link is broken in `website/docs/usage/examples.md` (#5325 ) * The embedding vis. link is broken The first link seems to be reasonable for now unless someone has an updated embedding vis they want to share? * contributor agreement * Update Mlawrence95.md * Update website/docs/usage/examples.md Co-Authored-By: Sofie Van Landeghem <svlandeg@users.noreply.github.com> Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>	2020-04-21 20:35:12 +02:00
adrianeboyd	521f361052	Switch to new gold.align method (#5334 ) * Switch from original `_align` to new simpler alignment algorithm from #4526 * Remove alignment normalizations beyond whitespace and lowercasing	2020-04-21 19:31:03 +02:00
adrianeboyd	bf5c13d170	Modify jieba install message (#5328 ) Modify jieba install message to instruct the user to use `ChineseDefaults.use_jieba = False` so that it's possible to load pkuseg-only models without jieba installed.	2020-04-20 22:06:53 +02:00
Ines Montani	b919844fce	Tidy up and fix alignment of landing cards (#5317 )	2020-04-20 20:33:13 +02:00
adrianeboyd	f7471abd82	Add pkuseg and serialization support for Chinese (#5308 ) * Add pkuseg and serialization support for Chinese Add support for pkuseg alongside jieba * Specify model through `Language` meta: * split on characters (if no word segmentation packages are installed) ``` Chinese(meta={"tokenizer": {"config": {"use_jieba": False, "use_pkuseg": False}}}) ``` * jieba (remains the default tokenizer if installed) ``` Chinese() Chinese(meta={"tokenizer": {"config": {"use_jieba": True}}}) # explicit ``` * pkuseg ``` Chinese(meta={"tokenizer": {"config": {"pkuseg_model": "default", "use_jieba": False, "use_pkuseg": True}}}) ``` * The new tokenizer setting `require_pkuseg` is used to override `use_jieba` default, which is intended for models that provide a pkuseg model: ``` nlp_pkuseg = Chinese(meta={"tokenizer": {"config": {"pkuseg_model": "default", "require_pkuseg": True}}}) nlp = Chinese() # has `use_jieba` as `True` by default nlp.from_bytes(nlp_pkuseg.to_bytes()) # `require_pkuseg` overrides `use_jieba` when calling the tokenizer ``` Add support for serialization of tokenizer settings and pkuseg model, if loaded * Add sorting for `Language.to_bytes()` serialization of `Language.meta` so that the (emptied, but still present) tokenizer metadata is in a consistent position in the serialized data Extend tests to cover all three tokenizer configurations and serialization * Fix from_disk and tests without jieba or pkuseg * Load cfg first and only show error if `use_pkuseg` * Fix blank/default initialization in serialization tests * Explicitly initialize jieba's cache on init * Add serialization for pkuseg pre/postprocessors * Reformat pkuseg install message	2020-04-18 17:01:53 +02:00
laszabine	fb73d4943a	Amend documentation to Language.evaluate (#5319 ) * Specified usage of arguments to Language.evaluate * Created contributor agreement	2020-04-16 20:00:18 +02:00
Ines Montani	068146d4ca	Update netlify.toml [ci skip]	2020-04-16 14:45:25 +02:00
Jakob Jul Elben	663333c3b2	Fixes #5413 (#5315 ) * Fix 5314 * Add contributor * Resolve requested changes Co-authored-by: Jakob Jul Elben <jakob@datamaga.com>	2020-04-16 13:29:02 +02:00
Sébastien Harinck	dac70f29eb	contrib: add contributor agreement for user sebastienharinck (#5316 )	2020-04-16 11:32:09 +02:00
Leander Fiedler	a3401b1194	issue5230 changed reference to function to anonymous function	2020-04-15 21:52:52 +02:00
Leander Fiedler	cef0c909b9	issue5230 changed reference to function to anonymous function	2020-04-15 19:28:33 +02:00
Paolo Arduin	1ca32d8f9c	Matcher support for Span as well as Doc (#5113 ) * Matcher support for Span, as well as Doc #5056 * Removes an import unused * Signed contributors agreement * Code optimization and better test * Add error message for bad Matcher call argument * Fix merging	2020-04-15 13:51:33 +02:00
Thomas Thiebaud	1eef60c658	Add spacy_fastlang to universe (#5271 ) * Add spacy_fastlang to universe * Sign SCA	2020-04-15 13:50:46 +02:00
adrianeboyd	98c59027ed	Use max(uint64) for OOV lexeme rank (#5303 ) * Use max(uint64) for OOV lexeme rank * Add test for default OOV rank * Revert back to thinc==7.4.0 Requiring the updated version of thinc was unnecessary. * Define OOV_RANK in one place Define OOV_RANK in one place in `util`. * Fix formatting [ci skip] * Switch to external definitions of max(uint64) Switch to external defintions of max(uint64) and confirm that they are equal.	2020-04-15 13:49:47 +02:00
adrianeboyd	3d2c308906	Add Doc init from list of words and text (#5251 ) * Add Doc init from list of words and text Add an option to initialize a `Doc` from a text and list of words where the words may or may not include all whitespace tokens. If the text and words are mismatched, raise an error. * Fix error code * Remove all whitespace before aligning words/text * Move words/text init to util function * Update error message * Rename to get_words_and_spaces * Fix formatting	2020-04-14 19:15:52 +02:00
Paolo Arduin	8ce408d2e1	Comparison predicate handling for `!=` (#5282 ) * Fix #5281 * Optim test	2020-04-14 19:14:15 +02:00
Sofie Van Landeghem	a3965ec13d	tag-map-path since 2.2.4 instead of 2.2.3 (#5289 )	2020-04-14 14:53:47 +02:00
Leander Fiedler	6700006830	issue5230 attempted fix of pytest segfault for python3.5	2020-04-12 09:34:54 +02:00
Leander Fiedler	d60e2d3ebf	issue5230 added unit test for dumping and loading knowledgebase	2020-04-12 09:08:41 +02:00
Marek Grzenkowicz	6a8a52650f	[Closes #5292 ] Fix typo in option name "--n-save_every" (#5293 ) * Sign contributor agreement for chopeen * Fix typo in option name and close #5292	2020-04-11 23:35:01 +02:00
Leander Fiedler	d2bb649227	issue5230 filter warnings in addition to filterwarnings to prevent deprecation warnings in python35(win) setup to pop up	2020-04-10 23:21:13 +02:00
Leander Fiedler	ca2a7a44db	issue5230 store string values of warnings to remotely debug failing python35(win) setup	2020-04-10 22:26:55 +02:00
Leander Fiedler	88ca40a15d	issue5230 raise warnings as errors to remotely debug failing python35(win) setup	2020-04-10 21:45:53 +02:00
Leander Fiedler	a7bdfe42e1	issue5230 added print statement to warnings filter to remotely debug failing python35(win) setup	2020-04-10 21:14:33 +02:00
Leander Fiedler	8c1d0d628f	issue5230 writer now checks instance of loc parameter before trying to operate on it	2020-04-10 20:35:52 +02:00
Umar Butler	8952effcc4	Fixed Typo in Warning (#5284 ) * Fixed typo in cli warning Fixed a typo in the warning for the provision of exactly two labels, which have not been designated as binary, to textcat. * Create and signed contributor form	2020-04-09 15:46:15 +02:00
adrianeboyd	cf579a398d	Add __init__.py to eu and hy tests (#5278 )	2020-04-08 20:03:06 +02:00
adrianeboyd	ae4af52ce7	Add ideographic stops to sentencizer (#5263 ) Add ideographic half- and fullwidth full stops to default sentencizer punctuation.	2020-04-08 12:58:39 +02:00
Sofie Van Landeghem	7ad0fcf01d	fix json (#5267 )	2020-04-08 12:58:09 +02:00
adrianeboyd	fa760010a5	Set rank for new vector in Vocab.set_vector (#5266 ) Set `Lexeme.rank` for vectors added with `Vocab.set_vector` so that the lexeme `ID` accessed by a model points the right row for the new vector.	2020-04-07 12:04:51 +02:00
lfiedler	e1e25c7e30	issue5230: added unittest test case for completion	2020-04-06 21:36:02 +02:00
Leander Fiedler	b63871ceff	issue5230: added contributors agreement	2020-04-06 21:04:06 +02:00
Leander Fiedler	cde96f6c64	issue5230: optimized unit test a bit	2020-04-06 20:51:12 +02:00
Leander Fiedler	71cc903d65	issue5230: replaced open statements on path objects so that serialization still works an files are closed	2020-04-06 20:30:41 +02:00
Leander Fiedler	273ed452bb	issue5230: added unicode declaration at top of the file	2020-04-06 19:22:32 +02:00
Leander Fiedler	1cd975d4a5	issue5230: fixed resource warnings in language	2020-04-06 18:54:32 +02:00
Leander Fiedler	493c77462a	issue5230: test cases covering known sources of resource warnings	2020-04-06 18:46:51 +02:00
adrianeboyd	c981aa6684	Use inline flags in token_match patterns (#5257 ) * Use inline flags in token_match patterns Use inline flags in `token_match` patterns so that serializing does not lose the flag information. * Modify inline flag * Modify inline flag	2020-04-06 13:19:04 +02:00
adrianeboyd	e8be15e9b7	Improve tokenization for UD Spanish AnCora (#5253 )	2020-04-06 13:18:23 +02:00
adrianeboyd	f4ef64a526	Improve tokenization for UD Dutch corpora (#5259 ) * Improve tokenization for UD Dutch corpora Improve tokenization for UD Dutch Alpino and LassySmall. * Format Dutch tokenizer exceptions	2020-04-06 13:18:07 +02:00
vincent d warmerdam	f329d5663a	add "whatlies" to spaCy universe (#5252 ) * Add "whatlies" We're releasing it on our side officially on the 16th of April. If possible, let's announce around the same time :) * sign contributor thing * Added fancy gif as the image * Update universe.json Spellin error and spaCy clarification.	2020-04-06 11:29:30 +02:00
Muhammad Irfan	406d5748b3	add missing Urdu tags	2020-04-05 20:55:38 +05:00
nlptechbook	ddf3c2430d	Update universe.json	2020-04-03 12:10:03 -04:00
YohannesDatasci	beef184e53	Armenian language support (#5246 ) * add Armenian language and test cases * agreement submission	2020-04-03 13:02:18 +02:00
Sofie Van Landeghem	1137420840	Small doc fixes (#5250 ) * fix link * torchtext instead tochtext	2020-04-03 13:01:43 +02:00
Sofie Van Landeghem	9cf965c260	avoid enumerate to avoid long waiting at 0% (#5159 )	2020-04-02 15:04:15 +02:00
Michael Leichtfried	2b14997b68	Remove duplicated branch in if/else-if statement (#5234 ) * Remove duplicated branch in if-elif-statement * Add contributor agreement for leicmi	2020-04-02 14:47:42 +02:00
adrianeboyd	d107afcffb	Raise error for inplace resize with new vector dim (#5228 ) Raise an error if there is an attempt to resize the vectors in place with a different vector dimension.	2020-04-02 10:43:13 +02:00
Jacob Lauritzen	0b76212831	Extend and fix Danish examples (#5227 ) * Extend and fix Danish examples This PR fixes two examples, adds additional examples translated from the english version, and adds punctuation. The two changed examples are: * "fortov" changed to "fortovet", which is more [used](https://www.google.com/search?client=firefox-b-d&sxsrf=ALeKk0143gEuPe4IbIUpzBBt-oU10OMVqA%3A1585549036477&ei=7I6BXuvJHMGOrwSqi46oCQ&q=l%C3%B8behjul+p%C3%A5+fortov&oq=l%C3%B8behjul+p%C3%A5+fortov&gs_lcp=CgZwc3ktYWIQAzIECAAQRzIECAAQRzIECAAQRzIECAAQRzIECAAQRzIECAAQRzIECAAQRzIECAAQR1DT8xZY0_MWYK_0FmgAcAZ4AIABAIgBAJIBAJgBAKABAaoBB2d3cy13aXo&sclient=psy-ab&ved=0ahUKEwjr7964xsHoAhVBx4sKHaqFA5UQ4dUDCAo&uact=5) and more natural. The Swedish and Norwegian examples also use this version of the word. * "stor by" changed to "storby". In Danish we have a specific noun to describe a large, metropolitan city which is different from just describing a city as "large". In this sentence it would be much more natural to describe London as a "storby". Google even correct as search for "London stor by" to "London storby". * Sign contrib agreement	2020-04-02 10:42:35 +02:00

... 6 7 8 9 10 ...

11688 Commits