spaCy

mirror of https://github.com/explosion/spaCy.git synced 2024-12-30 20:06:30 +03:00

Author	SHA1	Message	Date
Adriane Boyd	2a558a7cdc	Switch to mecab-ko as default Korean tokenizer (#11294 ) * Switch to mecab-ko as default Korean tokenizer Switch to the (confusingly-named) mecab-ko python module for default Korean tokenization. Maintain the previous `natto-py` tokenizer as `spacy.KoreanNattoTokenizer.v1`. * Temporarily run tests with mecab-ko tokenizer * Fix types * Fix duplicate test names * Update requirements test * Revert "Temporarily run tests with mecab-ko tokenizer" This reverts commit `d2083e7044`. * Add mecab_args setting, fix pickle for KoreanNattoTokenizer * Fix length check * Update docs * Formatting * Update natto-py error message Co-authored-by: Paul O'Leary McCann <polm@dampfkraft.com> Co-authored-by: Paul O'Leary McCann <polm@dampfkraft.com>	2022-08-26 10:11:18 +02:00
Raphael Mitsch	e626df959f	Document different ways to create a pipeline (#10762 ) * Document different ways to create a pipeline: moved up/slightly modified paragraph on pipeline creation. * Document different ways to create a pipeline: changed Finnish to Ukrainian in example for language without trained pipeline. * Document different ways to create a pipeline: added explanation of blank pipeline. * Document different ways to create a pipeline: exchanged Ukrainian with Yoruba.	2022-05-06 15:40:59 +02:00
Adriane Boyd	b2bbefd0b5	Add Finnish, Korean, and Swedish models and Korean support notes (#10355 ) * Add Finnish, Korean, and Swedish models to website * Add Korean language support notes	2022-03-07 17:03:45 +01:00
Paul O'Leary McCann	1ee6541ab0	Moving Japanese tokenizer extra info to Token.morph (#8977 ) * Use morph for extra Japanese tokenizer info Previously Japanese tokenizer info that didn't correspond to Token fields was put in user data. Since spaCy core should avoid touching user data, this moves most information to the Token.morph attribute. It also adds the normalized form, which wasn't exposed before. The subtokens, which are a list of full tokens, are still added to user data, except with the default tokenizer granualarity. With the default tokenizer settings the subtokens are all None, so in this case the user data is simply not set. * Update tests Also adds a new test for norm data. * Update docs * Add Japanese morphologizer factory Set the default to `extend=True` so that the morphologizer does not clobber the values set by the tokenizer. * Use the norm_ field for normalized forms Before this commit, normalized forms were put in the "norm" field in the morph attributes. I am not sure why I did that instead of using the token morph, I think I just forgot about it. * Skip test if sudachipy is not installed * Fix import Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>	2021-10-01 19:19:26 +02:00
Cass	7d13fc799b	Fix a command typo in models.md "dowmload" -> "download"	2021-07-05 18:44:18 -07:00
Ines Montani	8423864b50	Add docs notes on installing models from Python and in Jupyter [ci skip] (#8597 )	2021-07-05 13:49:20 +02:00
Sofie Van Landeghem	0fd0d949c4	fix 's typo's across code base (#8384 )	2021-06-15 10:57:08 +02:00
Tocic	b1996a51a1	fix typo in models.md (#7157 )	2021-02-22 09:00:38 +01:00
Pengcheng YIN	6fdc33203a	Fix a typo	2021-02-01 17:26:28 -05:00
Ines Montani	a59f3fcf5d	Make wheel the default format and update docs [ci skip]	2021-02-01 23:18:43 +11:00
Ines Montani	7752f80f39	Update docs [ci skip]	2021-01-31 16:11:24 +11:00
Ines Montani	43e59bb22a	Update docs and install extras [ci skip]	2020-10-08 10:58:50 +02:00
Adriane Boyd	aa9c9f3bf0	Update Chinese usage for spacy-pkuseg	2020-10-06 11:21:17 +02:00
Ines Montani	df06f7a792	Update docs [ci skip]	2020-10-02 13:24:33 +02:00
Adriane Boyd	351f352cdc	Update Japanese docs and pin for sudachipy	2020-10-02 10:12:44 +02:00
Adriane Boyd	7670df04dd	Update Chinese usage docs	2020-10-02 10:09:03 +02:00
Ines Montani	012b3a7096	Update docs [ci skip]	2020-09-20 17:44:58 +02:00
Ines Montani	8b0dabe987	Update docs [ci skip]	2020-09-12 17:05:10 +02:00
Ines Montani	b5a0657fd6	"model" terminology consistency in docs	2020-09-03 13:13:03 +02:00
Ines Montani	13291e97ba	Update docs [ci skip]	2020-08-19 00:28:37 +02:00
Ines Montani	82f0e20318	Update docs and consistency [ci skip]	2020-08-18 14:39:40 +02:00
Ines Montani	3ae5e02f4f	Update docs, types and API consistency	2020-08-17 16:45:24 +02:00
Ines Montani	644074b954	Merge branch 'develop' into master-tmp	2020-07-20 14:58:04 +02:00
Adriane Boyd	39ebcd9ec9	Refactor Chinese tokenizer configuration (#5736 ) * Refactor Chinese tokenizer configuration Refactor `ChineseTokenizer` configuration so that it uses a single `segmenter` setting to choose between character segmentation, jieba, and pkuseg. * replace `use_jieba`, `use_pkuseg`, `require_pkuseg` with the setting `segmenter` with the supported values: `char`, `jieba`, `pkuseg` * make the default segmenter plain character segmentation `char` (no additional libraries required) * Fix Chinese serialization test to use char default * Warn if attempting to customize other segmenter Add a warning if `Chinese.pkuseg_update_user_dict` is called when another segmenter is selected.	2020-07-19 13:34:37 +02:00
Adriane Boyd	cd5af72c9a	Update pkuseg version (#5774 ) * Update pkuseg version in Chinese tokenizer warnings * Update pkuseg version in `Makefile` * Remove warning about python3.8 wheels in docs	2020-07-19 11:09:49 +02:00
Ines Montani	bb3ee38cf9	Update WIP	2020-07-06 22:22:37 +02:00
Ines Montani	1e0d54edd1	Update docs	2020-07-04 14:23:10 +02:00
Ines Montani	fe4cfd0632	Start updating website for v3 [ci skip]	2020-07-01 21:26:39 +02:00
Adriane Boyd	931d80de72	Warning for sudachipy 0.4.5 (#5611 )	2020-06-19 12:43:41 +02:00
Ines Montani	44af53bdd9	Add pkuseg warnings and auto-format [ci skip]	2020-06-16 17:13:35 +02:00
Adriane Boyd	d5110ffbf2	Documentation updates for v2.3.0 (#5593 ) * Update website models for v2.3.0 * Add docs for Chinese word segmentation * Tighten up Chinese docs section * Merge branch 'master' into docs/v2.3.0 [ci skip] * Merge branch 'master' into docs/v2.3.0 [ci skip] * Auto-format and update version * Update matcher.md * Update languages and sorting * Typo in landing page * Infobox about token_match behavior * Add meta and basic docs for Japanese * POS -> TAG in models table * Add info about lookups for normalization * Updates to API docs for v2.3 * Update adding norm exceptions for adding languages * Add --omit-extra-lookups to CLI API docs * Add initial draft of "What's New in v2.3" * Add new in v2.3 tags to Chinese and Japanese sections * Add tokenizer to migration section * Add new in v2.3 flags to init-model * Typo * More what's new in v2.3 Co-authored-by: Ines Montani <ines@ines.io>	2020-06-16 15:37:35 +02:00
Ines Montani	a8a1800f2a	Update lemma data documentation [ci skip]	2019-10-01 13:22:13 +02:00
Ines Montani	9c940eab94	Update version in examples [ci skip]	2019-09-18 21:23:26 +02:00
Ines Montani	82c16b7943	Remove u-strings and fix formatting [ci skip]	2019-09-12 16:11:15 +02:00
mak	89379a7fa4	Corrected example model URL in requirements.txt (#3786 ) The URL used to show how to add a model to the requirements.txt had the old release path (excl. explosion).	2019-05-29 10:51:55 +02:00
Ines Montani	a611b32fbf	Update model docs [ci skip]	2019-03-17 11:48:18 +01:00
Ines Montani	4cfe4aa224	Fix small issues in the docs [ci skip]	2019-03-12 22:57:15 +01:00
Ines Montani	212ff359ef	Fix links [ci skip]	2019-02-17 22:25:50 +01:00
Ines Montani	e597110d31	💫 Update website (#3285 ) <!--- Provide a general summary of your changes in the title. --> ## Description The new website is implemented using [Gatsby](https://www.gatsbyjs.org) with [Remark](https://github.com/remarkjs/remark) and [MDX](https://mdxjs.com/). This allows authoring content in straightforward Markdown without the usual limitations. Standard elements can be overwritten with powerful [React](http://reactjs.org/) components and wherever Markdown syntax isn't enough, JSX components can be used. Hopefully, this update will also make it much easier to contribute to the docs. Once this PR is merged, I'll implement auto-deployment via [Netlify](https://netlify.com) on a specific branch (to avoid building the website on every PR). There's a bunch of other cool stuff that the new setup will allow us to do – including writing front-end tests, service workers, offline support, implementing a search and so on. This PR also includes various new docs pages and content. Resolves #3270. Resolves #3222. Resolves #2947. Resolves #2837. ### Types of change enhancement ## Checklist <!--- Before you submit the PR, go over this checklist and make sure you can tick off all the boxes. [] -> [x] --> - [x] I have submitted the spaCy Contributor Agreement. - [x] I ran the tests, and all new and existing tests passed. - [x] My changes don't require a change to the documentation, or if they do, I've added all required information.	2019-02-17 19:31:19 +01:00

39 Commits