spaCy

mirror of https://github.com/explosion/spaCy.git synced 2025-09-02 02:14:56 +03:00

Author	SHA1	Message	Date
adrianeboyd	0b9a5f4074	Rework Chinese language initialization and tokenization (#4619 ) * Rework Chinese language initialization * Create a `ChineseTokenizer` class * Modify jieba post-processing to handle whitespace correctly * Modify non-jieba character tokenization to handle whitespace correctly * Add a `create_tokenizer()` method to `ChineseDefaults` * Load lexical attributes * Update Chinese tag_map for UD v2 * Add very basic Chinese tests * Test tokenization with and without jieba * Test `like_num` attribute * Fix try_jieba_import() * Fix zh code formatting	2019-11-11 14:23:21 +01:00
adrianeboyd	4d85f67eee	Minor updates to language example sentences (#4608 ) * Add punctuation to Spanish example sentences * Combine multilanguage examples for lang xx * Add punctuation to nb examples	2019-11-07 22:34:58 +01:00
Priscilla de Abreu Lopes	39e79fcc86	Bugfix/dep matcher issue 4590 (#4601 ) * add contributor agreement for prilopes * add test for issue #4590 * fix on_match params for DependencyMacther (#4590)	2019-11-07 12:01:06 +01:00
Ines Montani	09cec3e41b	Replace function registries with catalogue (#4584 ) * Replace functions registries with catalogue * Update __init__.py * Fix test * Revert unrelated flag [ci skip]	2019-11-07 11:45:22 +01:00
adrianeboyd	0f8678c0b1	Fix DocBin.merge() example (#4599 )	2019-11-07 11:26:48 +01:00
walterhenry	5563c42ef5	Fixed typo: Added space between "recognize" and "various" (#4600 )	2019-11-06 23:06:36 +01:00
Ines Montani	828ef27a32	Add warnings about 3.8 (resolves #4593 ) [ci skip]	2019-11-05 18:30:11 +01:00
Ines Montani	fed53b1552	Update README.md	2019-11-05 18:26:47 +01:00
Ines Montani	83381018d3	Add load_from_docbin example [ci skip] TODO: upload the file somewhere	2019-11-05 11:52:43 +01:00
Sofie Van Landeghem	4ec7623288	Fix conllu script (#4579 ) * force extensions to avoid clash between example scripts * fix arg order and default file encoding * add example config for conllu script * newline * move extension definitions to main function * few more encodings fixes	2019-11-04 20:31:26 +01:00
Matthew Honnibal	4e43c0ba93	Fix multiprocessing for as_tuples=True (#4582 )	2019-11-04 20:29:03 +01:00
Ines Montani	4b95587ad4	Update universe.json [ci skip]	2019-11-04 13:55:55 +01:00
Yash Patadia	0c396aeed4	add dframcy to universe.json (#4580 )	2019-11-04 13:53:23 +01:00
Ines Montani	3ec231f7e1	Reorganise install_requires	2019-11-04 02:39:28 +01:00
Ines Montani	cf4ec88b38	Use latest wasabi	2019-11-04 02:38:45 +01:00
Ines Montani	d82630d7c1	Revert "Update azure-pipelines.yml" This reverts commit `ed1060cf59`.	2019-11-03 17:48:54 +01:00
Ines Montani	ed1060cf59	Update azure-pipelines.yml	2019-11-03 17:48:26 +01:00
Ines Montani	6ec119d976	Add error in debug-data if no dev docs are available (see #4575 )	2019-11-02 16:08:11 +01:00
adrianeboyd	56ad3a3988	Add LAS per dependency to Scorer (#4560 )	2019-10-31 21:18:16 +01:00
Matthew Honnibal	de98d66f87	Set version to v2.2.2	2019-10-31 15:53:31 +01:00
Matthw Honnibal	55f2241d72	Merge branch 'master' of https://github.com/explosion/spaCy	2019-10-31 15:37:52 +01:00
Ines Montani	df4c9ae3dc	Fix formatting [ci skip]	2019-10-31 15:10:25 +01:00
Ines Montani	59358d9b71	Remove box-decoration-break from entities in displacy (#4564 )	2019-10-31 15:09:43 +01:00
Matthw Honnibal	8b9954d1b7	Set version to v2.2.2.dev5	2019-10-31 15:06:19 +01:00
Ines Montani	2c107f02a4	Auto-format [ci skip]	2019-10-31 15:01:56 +01:00
Matthew Honnibal	e82306937e	Put Tok2Vec refactor behind feature flag (#4563 ) * Add back pre-2.2.2 tok2vec * Add simple tok2vec tests * Add simple tok2vec tests * Reformat * Fix CharacterEmbed in new tok2vec * Fix legacy tok2vec * Resolve circular imports * Fix test for Python 2	2019-10-31 15:01:15 +01:00
Ines Montani	828108a57f	Update README.md [ci skip]	2019-10-31 13:23:25 +01:00
Ines Montani	5e9849b60f	Auto-format [ci skip]	2019-10-30 19:27:18 +01:00
Ines Montani	afe4a428f7	Fix pipeline analysis on remove pipe (#4557 ) Validate after component is removed, not before	2019-10-30 19:04:17 +01:00
Matthew Honnibal	6b874ef096	Set version to v2.2.2.dev4	2019-10-30 17:36:20 +01:00
Ines Montani	85f2b04c45	Support span._. in component decorator attrs (#4555 ) * Support span._. in component decorator attrs * Adjust error [ci skip]	2019-10-30 17:19:36 +01:00
Ines Montani	4e1de85e43	Update syntax iterators [ci skip]	2019-10-30 14:31:40 +01:00
Ines Montani	726c5dd306	Update universe.json [ci skip]	2019-10-30 13:29:00 +01:00
Neel Kamath	6c036ab57d	Add "spaCy Server" to spaCy Universe (#4553 ) * Add "spaCy Server" to spaCy Universe * Accept the spaCy Contributor Agreement	2019-10-30 13:20:46 +01:00
Nipun Sadvilkar	2a5e71232b	✨ project: pySBD - Python Sentence Boundary Disambiguation (#4455 ) * ✨ project: pySBD - Python Sentence Boundary Disambiguation * 📝 Update links and description * 🐛 Fix missing comma * Update universe.json pysbd as a spacy component through entrypoints * 🚨 Fix universe.json * 📝 Update code_example	2019-10-30 12:13:29 +01:00
Matthew Honnibal	c2f5f9f572	Set version to v2.2.2.dev3	2019-10-29 16:37:58 +01:00
Sofie Van Landeghem	33ba9ff464	set encodings explicitly to utf8 (#4551 )	2019-10-29 13:16:55 +01:00
Matthew Honnibal	9e210fa7fd	Fix tok2vec structure after model registry refactor (#4549 ) The model registry refactor of the Tok2Vec function broke loading models trained with the previous function, because the model tree was slightly different. Specifically, the new function wrote: concatenate(norm, prefix, suffix, shape) To build the embedding layer. In the previous implementation, I had used the operator overloading shortcut: ( norm \| prefix \| suffix \| shape ) This actually gets mapped to a binary association, giving something like: concatenate(norm, concatenate(prefix, concatenate(suffix, shape))) This is a different tree, so the layers iterate differently and we loaded the weights wrongly.	2019-10-28 23:59:03 +01:00
Matthew Honnibal	bade60fe64	Set version to v2.2.2.dev1	2019-10-28 19:09:34 +01:00
Matthew Honnibal	b1505380ff	Fix training with vectors	2019-10-28 18:06:38 +01:00
Matthew Honnibal	a927b3a21e	Put new alignment behind flag for v2.2.2 release (#4541 ) * Xfail new tokenization test * Put new alignment behind feature flag * Move USE_ALIGN to top of the file [ci skip] Co-authored-by: Ines Montani <ines@ines.io>	2019-10-28 16:12:32 +01:00
Ines Montani	a90025b277	Fix serialization of extension attr values in DocBin (#4540 )	2019-10-28 16:02:13 +01:00
tamuhey	df293f3894	modified gold.align to handle space tokens (#4537 ) Co-authored-by: Matthew Honnibal <honnibal+gh@gmail.com>	2019-10-28 15:44:28 +01:00
adrianeboyd	f2bfaa1b38	Filter subtoken matches in merge_subtokens() (#4539 ) The `Matcher` in `merge_subtokens()` returns all possible subsequences of `subtok`, so for sequences of two or more subtoks it's necessary to filter the matches so that the retokenizer is only merging the longest matches with no overlapping spans.	2019-10-28 15:40:28 +01:00
Matthew Honnibal	d5509e0989	Support Mish activation (requires Thinc 7.3) (#4536 ) * Add arch for MishWindowEncoder * Support mish in tok2vec and conv window >=2 * Pass new tok2vec settings from parser * Syntax error * Fix tok2vec setting * Fix registration of MishWindowEncoder * Fix receptive field setting * Fix mish arch * Pass more options from parser * Support more tok2vec options in pretrain * Require thinc 7.3 * Add docs [ci skip] * Require thinc 7.3.0.dev0 to run CI * Run black * Fix typo * Update Thinc version Co-authored-by: Ines Montani <ines@ines.io>	2019-10-28 15:16:33 +01:00
Ines Montani	96bb8f2187	Add regression test for #4528 [ci skip]	2019-10-28 14:36:03 +01:00
Matthew Honnibal	02e8adf2c2	Add the spacy_lookups_data to pex file	2019-10-28 14:03:35 +01:00
Ines Montani	c5e41247e8	Tidy up and auto-format	2019-10-28 12:43:55 +01:00
Ines Montani	92018b9cd4	Tidy up and auto-format	2019-10-28 12:36:23 +01:00
Matthew Honnibal	f0ec7bcb79	Flag to ignore examples with mismatched raw/gold text (#4534 ) * Flag to ignore examples with mismatched raw/gold text After #4525, we're seeing some alignment failures on our OntoNotes data. I think we actually have fixes for most of these cases. In general it's better to fix the data, but it seems good to allow the GoldCorpus class to just skip cases where the raw text doesn't match up to the gold words. I think previously we were silently ignoring these cases. * Try to fix test on Python 2.7	2019-10-28 11:40:12 +01:00

... 3 4 5 6 7 ...

11246 Commits