spaCy

mirror of https://github.com/explosion/spaCy.git synced 2025-11-25 04:16:11 +03:00

Author	SHA1	Message	Date
Christoph Purschke	a7ee4b6f17	new tests & tokenization fixes (#4734 ) - added some tests for tokenization issues - fixed some issues with tokenization of words with hyphen infix - rewrote the "tokenizer_exceptions.py" file (stemming from the German version)	2019-12-01 23:08:21 +01:00
adrianeboyd	48ea2e8d0f	Restructure Sentencizer to follow Pipe API (#4721 ) * Restructure Sentencizer to follow Pipe API Restructure Sentencizer to follow Pipe API so that it can be scored with `nlp.evaluate()`. * Add Sentencizer pipe() test	2019-11-27 16:33:34 +01:00
Ines Montani	5b36dec7eb	Auto-exclude disabled when calling from_disk during load (#4708 )	2019-11-25 16:01:22 +01:00
adrianeboyd	2d8c6e1124	Iterate over lr_edges until sents are correct (#4702 ) Iterate over lr_edges until all heads are within the current sentence. Instead of iterating over them for a fixed number of iterations, check whether the sentence boundaries are correct for the heads and stop when all are correct. Stop after a maximum of 10 iterations, providing a warning in this case since the sentence boundaries may not be correct.	2019-11-25 13:06:36 +01:00
Paul O'Leary McCann	f0e3e606a6	Replace python-mecab3 with fugashi for Japanese (#4621 ) * Switch from mecab-python3 to fugashi mecab-python3 has been the best MeCab binding for a long time but it's not very actively maintained, and since it's based on old SWIG code distributed with MeCab there's a limit to how effectively it can be maintained. Fugashi is a new Cython-based MeCab wrapper I wrote. Since it's not based on the old SWIG code it's easier to keep it current and make small deviations from the MeCab C/C++ API where that makes sense. * Change mecab-python3 to fugashi in setup.cfg * Change "mecab tags" to "unidic tags" The tags come from MeCab, but the tag schema is specified by Unidic, so it's more proper to refer to it that way. * Update conftest * Add fugashi link to external deps list for Japanese	2019-11-23 14:31:04 +01:00
Ines Montani	5d4eede1e4	Fix test util imports	2019-11-21 16:28:29 +01:00
GuiGel	8f7ab70870	Bugfix/fix entity ruler from disk (#4670 ) * fix EntityRuler from_disk bug * add contributor file * Test EntityRuler PhraseMatcher deserialization (#4651) * newline at end of file * fix copy paste error * serializing the EntityRuler by itself * Add unicode declarations for Python 2 and auto-format	2019-11-21 16:26:37 +01:00
adrianeboyd	054df5d90a	Add error for non-string labels (#4690 ) Add error when attempting to add non-string labels to `Tagger` or `TextCategorizer`.	2019-11-21 16:24:10 +01:00
adrianeboyd	d7f32b285c	Detect more empty matches in tokenizer.explain() (#4675 ) * Detect more empty matches in tokenizer.explain() * Include a few languages in explain non-slow tests Mark a few languages in tokenizer.explain() tests as not slow so they're run by default.	2019-11-20 16:31:29 +01:00
Ines Montani	5bf9ab5b03	Tidy up and auto-format	2019-11-20 13:16:33 +01:00
Ines Montani	7f3b00164a	Re-add slow marker	2019-11-20 13:15:59 +01:00
Ines Montani	6e303de717	Auto-format	2019-11-20 13:15:24 +01:00
Ines Montani	2e7c896fe5	Update Tokenizer.explain tests	2019-11-20 13:14:11 +01:00
adrianeboyd	2c876eb672	Add tokenizer explain() debugging method (#4596 ) * Expose tokenizer rules as a property Expose the tokenizer rules property in the same way as the other core properties. (The cache resetting is overkill, but consistent with `from_bytes` for now.) Add tests and update Tokenizer API docs. * Update Hungarian punctuation to remove empty string Update Hungarian punctuation definitions so that `_units` does not match an empty string. * Use _load_special_tokenization consistently Use `_load_special_tokenization()` and have it to handle `None` checks. * Fix precedence of `token_match` vs. special cases Remove `token_match` check from `_split_affixes()` so that special cases have precedence over `token_match`. `token_match` is checked only before infixes are split. * Add `make_debug_doc()` to the Tokenizer Add `make_debug_doc()` to the Tokenizer as a working implementation of the pseudo-code in the docs. Add a test (marked as slow) that checks that `nlp.tokenizer()` and `nlp.tokenizer.make_debug_doc()` return the same non-whitespace tokens for all languages that have `examples.sentences` that can be imported. * Update tokenization usage docs Update pseudo-code and algorithm description to correspond to `nlp.tokenizer.make_debug_doc()` with example debugging usage. Add more examples for customizing tokenizers while preserving the existing defaults. Minor edits / clarifications. * Revert "Update Hungarian punctuation to remove empty string" This reverts commit `f0a577f7a5`. * Rework `make_debug_doc()` as `explain()` Rework `make_debug_doc()` as `explain()`, which returns a list of `(pattern_string, token_string)` tuples rather than a non-standard `Doc`. Update docs and tests accordingly, leaving the visualization for future work. * Handle cases with bad tokenizer patterns Detect when tokenizer patterns match empty prefixes and suffixes so that `explain()` does not hang on bad patterns. * Remove unused displacy image * Add tokenizer.explain() to usage docs	2019-11-20 13:07:25 +01:00
Matthew Honnibal	4b123952aa	Add option for improved NER feature extraction (#4671 ) * Support option of three NER features * Expose nr_feature parser model setting * Give feature tokens better name * Test nr_feature=3 for NER * Format	2019-11-19 15:03:14 +01:00
Ines Montani	74b951fe61	Fix xpassing tests (#4657 ) * Ignore internal warnings * Un-xfail passing tests * Skip instead of xfail	2019-11-16 20:20:53 +01:00
Ines Montani	3bd15055ce	Fix bug in Language.evaluate for components without .pipe (#4662 )	2019-11-16 20:20:37 +01:00
Christoph Purschke	433748e867	Fix basic language support for Luxembourgish (by adding punctuation.py) (#4648 ) * Update __init__.py * Create punctuation.py * Update tokenizer_exceptions.py * Create questoph.md * Update questoph.md * Update test_text.py * Update test_text.py * Update test_text.py * Update test_text.py	2019-11-15 16:16:47 +01:00
adrianeboyd	91f89f9693	Fix realloc in retokenizer.split() (#4606 ) Always realloc to a size larger than `doc.max_length` in `retokenizer.split()` (or cymem will throw errors).	2019-11-11 16:26:46 +01:00
adrianeboyd	0b9a5f4074	Rework Chinese language initialization and tokenization (#4619 ) * Rework Chinese language initialization * Create a `ChineseTokenizer` class * Modify jieba post-processing to handle whitespace correctly * Modify non-jieba character tokenization to handle whitespace correctly * Add a `create_tokenizer()` method to `ChineseDefaults` * Load lexical attributes * Update Chinese tag_map for UD v2 * Add very basic Chinese tests * Test tokenization with and without jieba * Test `like_num` attribute * Fix try_jieba_import() * Fix zh code formatting	2019-11-11 14:23:21 +01:00
Priscilla de Abreu Lopes	39e79fcc86	Bugfix/dep matcher issue 4590 (#4601 ) * add contributor agreement for prilopes * add test for issue #4590 * fix on_match params for DependencyMacther (#4590)	2019-11-07 12:01:06 +01:00
Ines Montani	09cec3e41b	Replace function registries with catalogue (#4584 ) * Replace functions registries with catalogue * Update __init__.py * Fix test * Revert unrelated flag [ci skip]	2019-11-07 11:45:22 +01:00
adrianeboyd	56ad3a3988	Add LAS per dependency to Scorer (#4560 )	2019-10-31 21:18:16 +01:00
Matthew Honnibal	e82306937e	Put Tok2Vec refactor behind feature flag (#4563 ) * Add back pre-2.2.2 tok2vec * Add simple tok2vec tests * Add simple tok2vec tests * Reformat * Fix CharacterEmbed in new tok2vec * Fix legacy tok2vec * Resolve circular imports * Fix test for Python 2	2019-10-31 15:01:15 +01:00
Ines Montani	5e9849b60f	Auto-format [ci skip]	2019-10-30 19:27:18 +01:00
Ines Montani	afe4a428f7	Fix pipeline analysis on remove pipe (#4557 ) Validate after component is removed, not before	2019-10-30 19:04:17 +01:00
Ines Montani	85f2b04c45	Support span._. in component decorator attrs (#4555 ) * Support span._. in component decorator attrs * Adjust error [ci skip]	2019-10-30 17:19:36 +01:00
Matthew Honnibal	a927b3a21e	Put new alignment behind flag for v2.2.2 release (#4541 ) * Xfail new tokenization test * Put new alignment behind feature flag * Move USE_ALIGN to top of the file [ci skip] Co-authored-by: Ines Montani <ines@ines.io>	2019-10-28 16:12:32 +01:00
Ines Montani	a90025b277	Fix serialization of extension attr values in DocBin (#4540 )	2019-10-28 16:02:13 +01:00
tamuhey	df293f3894	modified gold.align to handle space tokens (#4537 ) Co-authored-by: Matthew Honnibal <honnibal+gh@gmail.com>	2019-10-28 15:44:28 +01:00
adrianeboyd	f2bfaa1b38	Filter subtoken matches in merge_subtokens() (#4539 ) The `Matcher` in `merge_subtokens()` returns all possible subsequences of `subtok`, so for sequences of two or more subtoks it's necessary to filter the matches so that the retokenizer is only merging the longest matches with no overlapping spans.	2019-10-28 15:40:28 +01:00
Ines Montani	96bb8f2187	Add regression test for #4528 [ci skip]	2019-10-28 14:36:03 +01:00
Ines Montani	c5e41247e8	Tidy up and auto-format	2019-10-28 12:43:55 +01:00
Matthw Honnibal	426b745640	Fix tests for gpu	2019-10-27 22:19:18 +01:00
Sofie Van Landeghem	8e7414dace	Match pop with append for training format (#4516 ) * trying to fix script - not succesful yet * match pop() with extend() to avoid changing the data * few more pop-extend fixes * reinsert deleted print statement * fix print statement * add last tested version * append instead of extend * add in few comments * quick fix for 4402 + unit test * fixing number of docs (not counting cats) * more fixes * fix len * print tmp file instead of using data from examples dir * print tmp file instead of using data from examples dir (2)	2019-10-27 16:01:32 +01:00
tamuhey	fcd25db033	[#4529 ] fix: gold pyx (#4530 ) * fix: gold pyx * remove print * skip test in python2 * Add unicode declarations and don't skip test on Python 2	2019-10-27 13:50:07 +01:00
tamuhey	554850206c	[#4525 ] fix gold.align (#4526 ) * fix: gold.align * fix align * remove old align	2019-10-27 13:38:04 +01:00
Ines Montani	a9c6104047	Component decorator and component analysis (#4517 ) * Add work in progress * Update analysis helpers and component decorator * Fix porting of docstrings for Python 2 * Fix docstring stuff on Python 2 * Support meta factories when loading model * Put auto pipeline analysis behind flag for now * Analyse pipes on remove_pipe and replace_pipe * Move analysis to root for now Try to find a better place for it, but it needs to go for now to avoid circular imports * Simplify decorator Don't return a wrapped class and instead just write to the object * Update existing components and factories * Add condition in factory for classes vs. functions * Add missing from_nlp classmethods * Add "retokenizes" to printed overview * Update assigns/requires declarations of builtins * Only return data if no_print is enabled * Use multiline table for overview * Don't support Span * Rewrite errors/warnings and move them to spacy.errors	2019-10-27 13:35:49 +01:00
Ines Montani	cfffdba7b1	Implement new API for {Phrase}Matcher.add (backwards-compatible) (#4522 ) * Implement new API for {Phrase}Matcher.add (backwards-compatible) * Update docs * Also update DependencyMatcher.add * Update internals * Rewrite tests to use new API * Add basic check for common mistake Raise error with suggestion if user likely passed in a pattern instead of a list of patterns * Fix typo [ci skip]	2019-10-25 22:21:08 +02:00
Ines Montani	d2da117114	Also support passing list to Language.disable_pipes (#4521 ) * Also support passing list to Language.disable_pipes * Adjust internals	2019-10-25 16:19:08 +02:00
Ines Montani	cc05d9dad6	Auto-format [ci skip]	2019-10-24 16:21:08 +02:00
Ines Montani	73dc63d3bf	Tidy up and auto-format [ci skip]	2019-10-24 16:20:48 +02:00
Ines Montani	9a849fe54e	Explicitly catch warning in test	2019-10-24 16:16:27 +02:00
adrianeboyd	8516e9d53b	Support train dict format as JSONL (#4471 ) * Support train dict format as JSONL * Add (overly simple) check for dict vs. tuple to read JSONL lines as either train dicts or train tuples * Extend JSON/JSONL roundtrip conversion tests using `docs_to_json()` and `GoldCorpus.train_tuples` * Revert docs to default JSON output with convert	2019-10-23 16:01:44 +02:00
Matthew Honnibal	9489c5f6b2	Clip most_similar to range [-1, 1] (fixes #4506 ) (#4507 ) * Clip most_similar to range [-1, 1] * Add/fix vectors tests * Fix test	2019-10-22 20:10:42 +02:00
Ines Montani	74a19aeb1c	Add xfailing test [ci skip]	2019-10-22 18:18:43 +02:00
gustavengstrom	050e2445a8	Adding noun_chunks to the Swedish language model (sv) (#4422 ) * Create syntax_iterators.py Replica of spacy/lang/fr/syntax_iterators.py * Added import statements for SYNTAX_ITERATORS * Create gustavengstrom.md * Added "dobj" to list of labels in noun_chunks method and a test_noun_chunks method to the Swedish language model. * Delete README-checkpoint.md Co-authored-by: Gustav <gustav@davcon.se> Co-authored-by: Ines Montani <ines@ines.io>	2019-10-21 12:57:06 +02:00
adrianeboyd	f5c551a43a	Checks/errors related to ill-formed IOB input in CLI convert and debug-data (#4487 ) * Error for ill-formed input to iob_to_biluo() Check for empty label in iob_to_biluo(), which can result from ill-formed input. * Check for empty NER label in debug-data	2019-10-21 12:20:28 +02:00
Sofie Van Landeghem	d5d55312b2	prevent division by zero in most_similar method (#4488 )	2019-10-21 12:04:46 +02:00
Ines Montani	181c01f629	Tidy up and auto-format	2019-10-18 11:27:38 +02:00

1 2 3 4 5 ...

1466 Commits