spaCy

mirror of https://github.com/explosion/spaCy.git synced 2026-02-14 19:20:39 +03:00

Author	SHA1	Message	Date
adrianeboyd	2c876eb672	Add tokenizer explain() debugging method (#4596 ) * Expose tokenizer rules as a property Expose the tokenizer rules property in the same way as the other core properties. (The cache resetting is overkill, but consistent with `from_bytes` for now.) Add tests and update Tokenizer API docs. * Update Hungarian punctuation to remove empty string Update Hungarian punctuation definitions so that `_units` does not match an empty string. * Use _load_special_tokenization consistently Use `_load_special_tokenization()` and have it to handle `None` checks. * Fix precedence of `token_match` vs. special cases Remove `token_match` check from `_split_affixes()` so that special cases have precedence over `token_match`. `token_match` is checked only before infixes are split. * Add `make_debug_doc()` to the Tokenizer Add `make_debug_doc()` to the Tokenizer as a working implementation of the pseudo-code in the docs. Add a test (marked as slow) that checks that `nlp.tokenizer()` and `nlp.tokenizer.make_debug_doc()` return the same non-whitespace tokens for all languages that have `examples.sentences` that can be imported. * Update tokenization usage docs Update pseudo-code and algorithm description to correspond to `nlp.tokenizer.make_debug_doc()` with example debugging usage. Add more examples for customizing tokenizers while preserving the existing defaults. Minor edits / clarifications. * Revert "Update Hungarian punctuation to remove empty string" This reverts commit `f0a577f7a5`. * Rework `make_debug_doc()` as `explain()` Rework `make_debug_doc()` as `explain()`, which returns a list of `(pattern_string, token_string)` tuples rather than a non-standard `Doc`. Update docs and tests accordingly, leaving the visualization for future work. * Handle cases with bad tokenizer patterns Detect when tokenizer patterns match empty prefixes and suffixes so that `explain()` does not hang on bad patterns. * Remove unused displacy image * Add tokenizer.explain() to usage docs	2019-11-20 13:07:25 +01:00
Matthew Honnibal	a3c43a1692	Support no hidden layer in parser and NER (#4672 ) * Support no hidden layers for parser * Fix parser model for depth 1 * Fix parser for hidden depth=0 * Add option of non-blocking to CUDA stream	2019-11-19 15:54:34 +01:00
Matthew Honnibal	4b123952aa	Add option for improved NER feature extraction (#4671 ) * Support option of three NER features * Expose nr_feature parser model setting * Give feature tokens better name * Test nr_feature=3 for NER * Format	2019-11-19 15:03:14 +01:00
Elijah Rippeth	5ad5c4b44a	Add initial Korean support (#4660 ) * add hangul and jamo char classes. * add initial Korean lexical attributes. * add contributor agreement	2019-11-18 12:56:07 +01:00
Ines Montani	e8b9cee6fd	Make example consistent with model (closes #4587 ) [ci skip]	2019-11-18 12:41:48 +01:00
Ines Montani	e01a1a237f	Auto-format [ci skip]	2019-11-18 12:41:31 +01:00
adrianeboyd	62e00fd9da	Update tokenization usage docs (#4666 ) Update pseudo-code and algorithm description to correspond to current tokenizer behavior. Add more examples for customizing tokenizers while preserving the existing defaults. Minor edits / clarifications.	2019-11-18 12:35:13 +01:00
Ines Montani	5adcb352e9	Adjust order of docs sections [ci skip]	2019-11-17 16:08:56 +01:00
Ines Montani	74b951fe61	Fix xpassing tests (#4657 ) * Ignore internal warnings * Un-xfail passing tests * Skip instead of xfail	2019-11-16 20:20:53 +01:00
Ines Montani	3bd15055ce	Fix bug in Language.evaluate for components without .pipe (#4662 )	2019-11-16 20:20:37 +01:00
adrianeboyd	bdfb696677	Fix conllu2json converter to output all sentences (#4656 ) Make sure that the last batch of sentences is output if n_sents > 1.	2019-11-15 17:08:32 +01:00
Ines Montani	d64cfce546	Remove unnecessary newline replace	2019-11-15 16:19:01 +01:00
Christoph Purschke	433748e867	Fix basic language support for Luxembourgish (by adding punctuation.py) (#4648 ) * Update __init__.py * Create punctuation.py * Update tokenizer_exceptions.py * Create questoph.md * Update questoph.md * Update test_text.py * Update test_text.py * Update test_text.py * Update test_text.py	2019-11-15 16:16:47 +01:00
Ines Montani	e5b25a9cee	Update azure-pipelines.yml	2019-11-15 02:02:25 +01:00
Ines Montani	57af7c9d7f	Don't upgrade pip	2019-11-15 01:51:56 +01:00
Ines Montani	463a056e85	Merge branch 'master' of https://github.com/explosion/spaCy	2019-11-15 01:50:58 +01:00
Ines Montani	64f34d97b1	Use newer pip to try fix wheel selection on 3.8 Windows	2019-11-15 01:50:55 +01:00
Ines Montani	e30d08410a	Add CI for Python 3.8 (#4479 ) * Add 3.8 classifier * Update azure-pipelines.yml * Remove 3.8 warning from docs [ci skip]	2019-11-15 01:13:48 +01:00
Ines Montani	98b9d387c9	Auto-format [ci skip]	2019-11-15 00:33:44 +01:00
f11r	877971860e	Fix assert in sentencizer documentation. (#4639 )	2019-11-13 15:24:14 +01:00
Ines Montani	9d5ff177c4	Work around Markdown rendering issue surfaced in #4600 [ci skip]	2019-11-11 17:12:08 +01:00
adrianeboyd	91f89f9693	Fix realloc in retokenizer.split() (#4606 ) Always realloc to a size larger than `doc.max_length` in `retokenizer.split()` (or cymem will throw errors).	2019-11-11 16:26:46 +01:00
adrianeboyd	f415e9b7d1	Set extensions when write_conllu() is called in UD train script (#4618 ) * Set extensions when write_conllu() is called `run_eval.py` uses the `write_conllu()` function from `ud_train.py` by itself, so it needs to set the token extensions if necessary. * Switch from try to if	2019-11-11 16:25:03 +01:00
adrianeboyd	0b9a5f4074	Rework Chinese language initialization and tokenization (#4619 ) * Rework Chinese language initialization * Create a `ChineseTokenizer` class * Modify jieba post-processing to handle whitespace correctly * Modify non-jieba character tokenization to handle whitespace correctly * Add a `create_tokenizer()` method to `ChineseDefaults` * Load lexical attributes * Update Chinese tag_map for UD v2 * Add very basic Chinese tests * Test tokenization with and without jieba * Test `like_num` attribute * Fix try_jieba_import() * Fix zh code formatting	2019-11-11 14:23:21 +01:00
adrianeboyd	4d85f67eee	Minor updates to language example sentences (#4608 ) * Add punctuation to Spanish example sentences * Combine multilanguage examples for lang xx * Add punctuation to nb examples	2019-11-07 22:34:58 +01:00
Priscilla de Abreu Lopes	39e79fcc86	Bugfix/dep matcher issue 4590 (#4601 ) * add contributor agreement for prilopes * add test for issue #4590 * fix on_match params for DependencyMacther (#4590)	2019-11-07 12:01:06 +01:00
Ines Montani	09cec3e41b	Replace function registries with catalogue (#4584 ) * Replace functions registries with catalogue * Update __init__.py * Fix test * Revert unrelated flag [ci skip]	2019-11-07 11:45:22 +01:00
adrianeboyd	0f8678c0b1	Fix DocBin.merge() example (#4599 )	2019-11-07 11:26:48 +01:00
walterhenry	5563c42ef5	Fixed typo: Added space between "recognize" and "various" (#4600 )	2019-11-06 23:06:36 +01:00
Ines Montani	828ef27a32	Add warnings about 3.8 (resolves #4593 ) [ci skip]	2019-11-05 18:30:11 +01:00
Ines Montani	fed53b1552	Update README.md	2019-11-05 18:26:47 +01:00
Ines Montani	83381018d3	Add load_from_docbin example [ci skip] TODO: upload the file somewhere	2019-11-05 11:52:43 +01:00
Sofie Van Landeghem	4ec7623288	Fix conllu script (#4579 ) * force extensions to avoid clash between example scripts * fix arg order and default file encoding * add example config for conllu script * newline * move extension definitions to main function * few more encodings fixes	2019-11-04 20:31:26 +01:00
Matthew Honnibal	4e43c0ba93	Fix multiprocessing for as_tuples=True (#4582 )	2019-11-04 20:29:03 +01:00
Ines Montani	4b95587ad4	Update universe.json [ci skip]	2019-11-04 13:55:55 +01:00
Yash Patadia	0c396aeed4	add dframcy to universe.json (#4580 )	2019-11-04 13:53:23 +01:00
Ines Montani	3ec231f7e1	Reorganise install_requires	2019-11-04 02:39:28 +01:00
Ines Montani	cf4ec88b38	Use latest wasabi	2019-11-04 02:38:45 +01:00
Ines Montani	d82630d7c1	Revert "Update azure-pipelines.yml" This reverts commit `ed1060cf59`.	2019-11-03 17:48:54 +01:00
Ines Montani	ed1060cf59	Update azure-pipelines.yml	2019-11-03 17:48:26 +01:00
Ines Montani	6ec119d976	Add error in debug-data if no dev docs are available (see #4575 )	2019-11-02 16:08:11 +01:00
adrianeboyd	56ad3a3988	Add LAS per dependency to Scorer (#4560 )	2019-10-31 21:18:16 +01:00
Matthew Honnibal	de98d66f87	Set version to v2.2.2	2019-10-31 15:53:31 +01:00
Matthw Honnibal	55f2241d72	Merge branch 'master' of https://github.com/explosion/spaCy	2019-10-31 15:37:52 +01:00
Ines Montani	df4c9ae3dc	Fix formatting [ci skip]	2019-10-31 15:10:25 +01:00
Ines Montani	59358d9b71	Remove box-decoration-break from entities in displacy (#4564 )	2019-10-31 15:09:43 +01:00
Matthw Honnibal	8b9954d1b7	Set version to v2.2.2.dev5	2019-10-31 15:06:19 +01:00
Ines Montani	2c107f02a4	Auto-format [ci skip]	2019-10-31 15:01:56 +01:00
Matthew Honnibal	e82306937e	Put Tok2Vec refactor behind feature flag (#4563 ) * Add back pre-2.2.2 tok2vec * Add simple tok2vec tests * Add simple tok2vec tests * Reformat * Fix CharacterEmbed in new tok2vec * Fix legacy tok2vec * Resolve circular imports * Fix test for Python 2	2019-10-31 15:01:15 +01:00
Ines Montani	828108a57f	Update README.md [ci skip]	2019-10-31 13:23:25 +01:00

1 2 3 4 5 ...

11069 Commits