spaCy

mirror of https://github.com/explosion/spaCy.git synced 2026-02-17 20:50:55 +03:00

Author	SHA1	Message	Date
Paul O'Leary McCann	410fb7ee43	Add Japanese Model (#5544 ) * Add more rules to deal with Japanese UD mappings Japanese UD rules sometimes give different UD tags to tokens with the same underlying POS tag. The UD spec indicates these cases should be disambiguated using the output of a tool called "comainu", but rules are enough to get the right result. These rules are taken from Ginza at time of writing, see #3756. * Add new tags from GSD This is a few rare tags that aren't in Unidic but are in the GSD data. * Add basic Japanese sentencization This code is taken from Ginza again. * Add sentenceizer quote handling Could probably add more paired characters but this will do for now. Also includes some tests. * Replace fugashi with SudachiPy * Modify tag format to match GSD annotations Some of the tests still need to be updated, but I want to get this up for testing training. * Deal with case with closing punct without opening * refactor resolve_pos() * change tag field separator from "," to "-" * add TAG_ORTH_MAP * add TAG_BIGRAM_MAP * revise rules for 連体詞 * revise rules for 連体詞 * improve POS about 2% * add syntax_iterator.py (not mature yet) * improve syntax_iterators.py * improve syntax_iterators.py * add phrases including nouns and drop NPs consist of STOP_WORDS * First take at noun chunks This works in many situations but still has issues in others. If the start of a subtree has no noun, then nested phrases can be generated. また行きたい、そんな気持ちにさせてくれるお店です。 [そんな気持ち, また行きたい、そんな気持ちにさせてくれるお店] For some reason て gets included sometimes. Not sure why. ゲンに連れ添って円盤生物を調査するパートナーとなる。 [て円盤生物, ...] Some phrases that look like they should be split are grouped together; not entirely sure that's wrong. This whole thing becomes one chunk: 道の駅遠山郷北側からかぐら大橋南詰現道交点までの1.060kmのみ開通済み * Use new generic get_words_and_spaces The new get_words_and_spaces function is simpler than what was used in Japanese, so it's good to be able to switch to it. However, there was an issue. The new function works just on text, so POS info could get out of sync. Fixing this required a small change to the way dtokens (tokens with POS and lemma info) were generated. Specifically, multiple extraneous spaces now become a single token, so when generating dtokens multiple space tokens should be created in a row. * Fix noun_chunks, should be working now * Fix some tests, add naughty strings tests Some of the existing tests changed because the tokenization mode of Sudachi changed to the more fine-grained A mode. Sudachi also has issues with some strings, so this adds a test against the naughty strings. * Remove empty Sudachi tokens Not doing this creates zero-length tokens and causes errors in the internal spaCy processing. * Add yield_bunsetu back in as a separate piece of code Co-authored-by: Hiroshi Matsuda <40782025+hiroshi-matsuda-rit@users.noreply.github.com> Co-authored-by: hiroshi <hiroshi_matsuda@megagon.ai>	2020-06-04 19:15:43 +02:00
Ines Montani	c6ec19c844	Add missing declaration	2020-05-21 17:30:05 +02:00
Ines Montani	bd6353715a	Merge branch 'master' into fix/travis-tests	2020-05-21 14:23:04 +02:00
Ines Montani	d8f3190c0a	Tidy up and auto-format	2020-05-21 14:14:01 +02:00
Ines Montani	56de520afd	Try to fix tests on Travis (2.7)	2020-05-21 14:04:57 +02:00
adrianeboyd	0061992d95	Update Polish tokenizer for UD_Polish-PDB (#5432 ) Update Polish tokenizer for UD_Polish-PDB, which is a relatively major change from the existing tokenizer. Unused exceptions files and conflicting test cases removed. Co-authored-by: Matthew Honnibal <honnibal+gh@gmail.com>	2020-05-19 15:59:55 +02:00
adrianeboyd	a5cd203284	Reduce stored lexemes data, move feats to lookups (#5238 ) * Reduce stored lexemes data, move feats to lookups * Move non-derivable lexemes features (`norm / cluster / prob`) to `spacy-lookups-data` as lookups * Get/set `norm` in both lookups and `LexemeC`, serialize in lookups * Remove `cluster` and `prob` from `LexemesC`, get/set/serialize in lookups only * Remove serialization of lexemes data as `vocab/lexemes.bin` * Remove `SerializedLexemeC` * Remove `Lexeme.to_bytes/from_bytes` * Modify normalization exception loading: * Always create `Vocab.lookups` table `lexeme_norm` for normalization exceptions * Load base exceptions from `lang.norm_exceptions`, but load language-specific exceptions from lookups * Set `lex_attr_getter[NORM]` including new lookups table in `BaseDefaults.create_vocab()` and when deserializing `Vocab` * Remove all cached lexemes when deserializing vocab to override existing normalizations with the new normalizations (as a replacement for the previous step that replaced all lexemes data with the deserialized data) * Skip English normalization test Skip English normalization test because the data is now in `spacy-lookups-data`. * Remove norm exceptions Moved to spacy-lookups-data. * Move norm exceptions test to spacy-lookups-data * Load extra lookups from spacy-lookups-data lazily Load extra lookups (currently for cluster and prob) lazily from the entry point `lg_extra` as `Vocab.lookups_extra`. * Skip creating lexeme cache on load To improve model loading times, do not create the full lexeme cache when loading. The lexemes will be created on demand when processing. * Identify numeric values in Lexeme.set_attrs() With the removal of a special case for `PROB`, also identify `float` to avoid trying to convert it with the `StringStore`. * Skip lexeme cache init in from_bytes * Unskip and update lookups tests for python3.6+ * Update vocab pickle to include lookups_extra * Update vocab serialization tests Check strings rather than lexemes since lexemes aren't initialized automatically, account for addition of "_SP". * Re-skip lookups test because of python3.5 * Skip PROB/float values in Lexeme.set_attrs * Convert is_oov from lexeme flag to lex in vectors Instead of storing `is_oov` as a lexeme flag, `is_oov` reports whether the lexeme has a vector. Co-authored-by: Matthew Honnibal <honnibal+gh@gmail.com>	2020-05-19 15:59:14 +02:00
adrianeboyd	780b869345	Fix syntax iterators for Persian (#5437 )	2020-05-14 16:51:03 +02:00
Vishnu Priya VR	9ce059dd06	Limiting noun_chunks for specific languages (#5396 ) * Limiting noun_chunks for specific langauges * Limiting noun_chunks for specific languages Contributor Agreement * Addressing review comments * Removed unused fixtures and imports * Add fa_tokenizer in test suite * Use fa_tokenizer in test * Undo extraneous reformatting Co-authored-by: adrianeboyd <adrianeboyd@gmail.com>	2020-05-14 12:58:06 +02:00
adrianeboyd	c963e269ba	Add method to update / reset pkuseg user dict (#5404 )	2020-05-08 11:21:46 +02:00
Samuel Rodríguez Medina	5e55bfa821	Fixed tests for Swedish that were written in Danish. (#5395 )	2020-05-05 14:06:27 +02:00
Samuel Rodríguez Medina	148b036e0c	Spanish like num improvement (#5381 ) * Add tests for Spanish like_num. * Add missing numbers in Spanish lexical attributes for like_num. * Modify Spanish test function name. * Add contributor agreement.	2020-04-30 11:13:23 +02:00
Samuel Rodríguez Medina	8602daba85	Swedish like_num (#5371 ) * Sign contributor agreement. * Add like_num functionality to Swedish. * Update spacy/tests/lang/sv/test_lex_attrs.py Co-Authored-By: Sofie Van Landeghem <svlandeg@users.noreply.github.com> * Update contributor agreement Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>	2020-04-29 21:25:22 +02:00
Sofie Van Landeghem	9203d821ae	Add 2 ini files in tests/lang (#5359 )	2020-04-27 13:01:54 +02:00
Punitvara	b2b7e1f37a	This PR adds Gujarati Language class along with (#5355 ) * This PR adds Gujarati Language class along with - stop words * Add test for gu tokenizer	2020-04-27 11:07:37 +02:00
sabiqueqb	fc91660aa2	Gh 5339 language class for malayalam (#5342 ) * Initialize Malayalam Language class * Add lex_attrs and examples for Malayalam * Add spaCy Contributor Agreement * Add test for ml tokenizer	2020-04-27 09:45:08 +02:00
adrianeboyd	f7471abd82	Add pkuseg and serialization support for Chinese (#5308 ) * Add pkuseg and serialization support for Chinese Add support for pkuseg alongside jieba * Specify model through `Language` meta: * split on characters (if no word segmentation packages are installed) ``` Chinese(meta={"tokenizer": {"config": {"use_jieba": False, "use_pkuseg": False}}}) ``` * jieba (remains the default tokenizer if installed) ``` Chinese() Chinese(meta={"tokenizer": {"config": {"use_jieba": True}}}) # explicit ``` * pkuseg ``` Chinese(meta={"tokenizer": {"config": {"pkuseg_model": "default", "use_jieba": False, "use_pkuseg": True}}}) ``` * The new tokenizer setting `require_pkuseg` is used to override `use_jieba` default, which is intended for models that provide a pkuseg model: ``` nlp_pkuseg = Chinese(meta={"tokenizer": {"config": {"pkuseg_model": "default", "require_pkuseg": True}}}) nlp = Chinese() # has `use_jieba` as `True` by default nlp.from_bytes(nlp_pkuseg.to_bytes()) # `require_pkuseg` overrides `use_jieba` when calling the tokenizer ``` Add support for serialization of tokenizer settings and pkuseg model, if loaded * Add sorting for `Language.to_bytes()` serialization of `Language.meta` so that the (emptied, but still present) tokenizer metadata is in a consistent position in the serialized data Extend tests to cover all three tokenizer configurations and serialization * Fix from_disk and tests without jieba or pkuseg * Load cfg first and only show error if `use_pkuseg` * Fix blank/default initialization in serialization tests * Explicitly initialize jieba's cache on init * Add serialization for pkuseg pre/postprocessors * Reformat pkuseg install message	2020-04-18 17:01:53 +02:00
adrianeboyd	cf579a398d	Add __init__.py to eu and hy tests (#5278 )	2020-04-08 20:03:06 +02:00
YohannesDatasci	beef184e53	Armenian language support (#5246 ) * add Armenian language and test cases * agreement submission	2020-04-03 13:02:18 +02:00
Ines Montani	828acffc12	Tidy up and auto-format	2020-03-25 12:28:12 +01:00
adrianeboyd	86c43e55fa	Improve Lithuanian tokenization (#5205 ) * Improve Lithuanian tokenization Modify Lithuanian tokenization to improve performance for UD_Lithuanian-ALKSNIS. * Update Lithuanian tokenizer tests	2020-03-25 11:28:12 +01:00
Adriane Boyd	09d442f5ad	Merge remote-tracking branch 'upstream/master' into feature/ud-tokenization-da	2020-03-25 09:41:52 +01:00
Adriane Boyd	cba2d1d972	Disable failing abbreviation test UD_Danish-DDT has (as far as I can tell) hallucinated periods after abbreviations, so the changes are an artifact of the corpus and not due to anything meaningful about Danish tokenization.	2020-03-25 09:39:26 +01:00
Muhammad Irfan	03376c9d9b	Basque language added and tested.	2020-03-04 11:58:56 +05:00
Sofie Van Landeghem	44f4142ce4	add two abbreviations and some additional unit tests (#5040 )	2020-02-22 14:12:32 +01:00
Antti Ajanki	e1f777b151	Improvements for Finnish tokenizer (#4985 ) * don't split on a colon. Colon is used to attach suffixes for abbreviations * tokenize on any of LIST_HYPHENS (except a single hyphen), not just on -- * simplify infix rules by merging similar rules	2020-02-10 20:32:43 -05:00
adrianeboyd	aef83e8070	Mark most Hungarian tokenizer test cases as slow (#4883 ) * Mark most Hungarian tokenizer test cases as slow Mark most Hungarian tokenizer test cases as slow to reduce the runtime of the test suite in ordinary usage: * for normal tests: run default tests plus 10% of the detailed tests * for slow tests: run all tests * Rework to mark individual tests as slow	2020-01-08 12:34:06 +01:00
adrianeboyd	d652ff215d	Add trailing whitespace to multiline test text (#4877 )	2020-01-06 14:58:59 +01:00
Ines Montani	cb4145adc7	Tidy up and auto-format	2019-12-21 19:04:17 +01:00
Olamilekan Wahab	a741de7cf6	Adding support for Yoruba Language (#4614 ) * Adding Support for Yoruba * test text * Updated test string. * Fixing encoding declaration. * Adding encoding to stop_words.py * Added contributor agreement and removed iranlowo. * Added removed test files and removed iranlowo to keep project bare. * Returned CONTRIBUTING.md to default state. * Added delted conftest entries * Tidy up and auto-format * Revert CONTRIBUTING.md Co-authored-by: Ines Montani <ines@ines.io>	2019-12-21 14:11:50 +01:00
Antti Ajanki	e626a011cc	Improvements to the Finnish language data (#4738 ) * Enable lex_attrs on Finnish * Copy the Danish tokenizer rules to Finnish Specifically, don't break hyphenated compound words * Contributor agreement * A new file for Finnish tokenizer rules instead of including the Danish ones	2019-12-03 12:55:28 +01:00
Christoph Purschke	a7ee4b6f17	new tests & tokenization fixes (#4734 ) - added some tests for tokenization issues - fixed some issues with tokenization of words with hyphen infix - rewrote the "tokenizer_exceptions.py" file (stemming from the German version)	2019-12-01 23:08:21 +01:00
adrianeboyd	2c876eb672	Add tokenizer explain() debugging method (#4596 ) * Expose tokenizer rules as a property Expose the tokenizer rules property in the same way as the other core properties. (The cache resetting is overkill, but consistent with `from_bytes` for now.) Add tests and update Tokenizer API docs. * Update Hungarian punctuation to remove empty string Update Hungarian punctuation definitions so that `_units` does not match an empty string. * Use _load_special_tokenization consistently Use `_load_special_tokenization()` and have it to handle `None` checks. * Fix precedence of `token_match` vs. special cases Remove `token_match` check from `_split_affixes()` so that special cases have precedence over `token_match`. `token_match` is checked only before infixes are split. * Add `make_debug_doc()` to the Tokenizer Add `make_debug_doc()` to the Tokenizer as a working implementation of the pseudo-code in the docs. Add a test (marked as slow) that checks that `nlp.tokenizer()` and `nlp.tokenizer.make_debug_doc()` return the same non-whitespace tokens for all languages that have `examples.sentences` that can be imported. * Update tokenization usage docs Update pseudo-code and algorithm description to correspond to `nlp.tokenizer.make_debug_doc()` with example debugging usage. Add more examples for customizing tokenizers while preserving the existing defaults. Minor edits / clarifications. * Revert "Update Hungarian punctuation to remove empty string" This reverts commit `f0a577f7a5`. * Rework `make_debug_doc()` as `explain()` Rework `make_debug_doc()` as `explain()`, which returns a list of `(pattern_string, token_string)` tuples rather than a non-standard `Doc`. Update docs and tests accordingly, leaving the visualization for future work. * Handle cases with bad tokenizer patterns Detect when tokenizer patterns match empty prefixes and suffixes so that `explain()` does not hang on bad patterns. * Remove unused displacy image * Add tokenizer.explain() to usage docs	2019-11-20 13:07:25 +01:00
Christoph Purschke	433748e867	Fix basic language support for Luxembourgish (by adding punctuation.py) (#4648 ) * Update __init__.py * Create punctuation.py * Update tokenizer_exceptions.py * Create questoph.md * Update questoph.md * Update test_text.py * Update test_text.py * Update test_text.py * Update test_text.py	2019-11-15 16:16:47 +01:00
adrianeboyd	0b9a5f4074	Rework Chinese language initialization and tokenization (#4619 ) * Rework Chinese language initialization * Create a `ChineseTokenizer` class * Modify jieba post-processing to handle whitespace correctly * Modify non-jieba character tokenization to handle whitespace correctly * Add a `create_tokenizer()` method to `ChineseDefaults` * Load lexical attributes * Update Chinese tag_map for UD v2 * Add very basic Chinese tests * Test tokenization with and without jieba * Test `like_num` attribute * Fix try_jieba_import() * Fix zh code formatting	2019-11-11 14:23:21 +01:00
Ines Montani	73dc63d3bf	Tidy up and auto-format [ci skip]	2019-10-24 16:20:48 +02:00
gustavengstrom	050e2445a8	Adding noun_chunks to the Swedish language model (sv) (#4422 ) * Create syntax_iterators.py Replica of spacy/lang/fr/syntax_iterators.py * Added import statements for SYNTAX_ITERATORS * Create gustavengstrom.md * Added "dobj" to list of labels in noun_chunks method and a test_noun_chunks method to the Swedish language model. * Delete README-checkpoint.md Co-authored-by: Gustav <gustav@davcon.se> Co-authored-by: Ines Montani <ines@ines.io>	2019-10-21 12:57:06 +02:00
Ines Montani	181c01f629	Tidy up and auto-format	2019-10-18 11:27:38 +02:00
Peter Gilles	428887b8f2	Initial commit: New language Luxembourgish (lb) (#4424 ) * new language: Luxembourgish (lb) * update * update * Update and rename .github/CONTRIBUTOR_AGREEMENT.md to .github/contributors/PeterGilles.md * Update and rename .github/contributors/PeterGilles.md to .github/CONTRIBUTOR_AGREEMENT.md * Update norm_exceptions.py * Delete README.md * moved test_lemma.py * deactivated 'lemma_lookup = LOOKUP' * update * Update conftest.py * update * tests updated * import unicode_literals * Update spacy/tests/lang/lb/test_text.py Co-Authored-By: Ines Montani <ines@ines.io> * Create PeterGilles.md	2019-10-14 12:27:50 +02:00
Ines Montani	e0cf4796a5	Move lookup tables out of the core library (#4346 ) * Add default to util.get_entry_point * Tidy up entry points * Read lookups from entry points * Remove lookup tables and related tests * Add lookups install option * Remove lemmatizer tests * Remove logic to process language data files * Update setup.cfg	2019-10-01 00:01:27 +02:00
Ines Montani	3d8fd4b461	Revert #4334	2019-09-29 17:32:12 +02:00
Ines Montani	c9cd516d96	Move tests out of package (#4334 ) * Move tests out of package * Fix typo	2019-09-28 18:05:00 +02:00
Ines Montani	5fe61539c4	Fix unicode "e" in filename	2019-09-28 15:45:16 +02:00
Ines Montani	acd5bcb0b3	Tidy up fixtures	2019-09-27 17:57:59 +02:00
Ines Montani	bab9976d9a	💫 Adjust Table API and add docs (#4289 ) * Adjust Table API and add docs * Add attributes and update description [ci skip] * Use strings.get_string_id instead of hash_string * Fix table method calls * Make orth arg in Lemmatizer.lookup optional Fall back to string, which is now handled by Table.__contains__ out-of-the-box * Fix method name * Auto-format	2019-09-15 22:08:13 +02:00
Ines Montani	3126dd0904	Tidy up and auto-format [ci skip]	2019-09-14 12:58:06 +02:00
Paul O'Leary McCann	29a9e636eb	Fix half-width space handling in JA (#4284 ) (closes #4262 ) Before this patch, half-width spaces between words were simply lost in Japanese text. This wasn't immediately noticeable because much Japanese text never uses spaces at all.	2019-09-13 16:28:12 +02:00
Paul O'Leary McCann	7d8df69158	Bloom-filter backed Lookup Tables (#4268 ) * Improve load_language_data helper * WIP: Add Lookups implementation * Start moving lemma data over to JSON * WIP: move data over for more languages * Convert more languages * Fix lemmatizer fixtures in tests * Finish conversion * Auto-format JSON files * Fix test for now * Make sure tables are stored on instance * Update docstrings * Update docstrings and errors * Update test * Add Lookups.__len__ * Add serialization methods * Add Lookups.remove_table * Use msgpack for serialization to disk * Fix file exists check * Try using OrderedDict for everything * Update .flake8 [ci skip] * Try fixing serialization * Update test_lookups.py * Update test_serialize_vocab_strings.py * Lookups / Tables now work This implements the stubs in the Lookups/Table classes. Currently this is in Cython but with no type declarations, so that could be improved. * Add lookups to setup.py * Actually add lookups pyx The previous commit added the old py file... * Lookups work-in-progress * Move from pyx back to py * Add string based lookups, fix serialization * Update tests, language/lemmatizer to work with string lookups There are some outstanding issues here: - a pickling-related test fails due to the bloom filter - some custom lemmatizers (fr/nl at least) have issues More generally, there's a question of how to deal with the case where you have a string but want to use the lookup table. Currently the table allows access by string or id, but that's getting pretty awkward. * Change lemmatizer lookup method to pass (orth, string) * Fix token lookup * Fix French lookup * Fix lt lemmatizer test * Fix Dutch lemmatizer * Fix lemmatizer lookup test This was using a normal dict instead of a Table, so checks for the string instead of an integer key failed. * Make uk/nl/ru lemmatizer lookup methods consistent The mentioned tokenizers all have their own implementation of the `lookup` method, which accesses a `Lookups` table. The way that was called in `token.pyx` was changed so this should be updated to have the same arguments as `lookup` in `lemmatizer.py` (specificially (orth/id, string)). Prior to this change tests weren't failing, but there would probably be issues with normal use of a model. More tests should proably be added. Additionally, the language-specific `lookup` implementations seem like they might not be needed, since they handle things like lower-casing that aren't actually language specific. * Make recently added Greek method compatible * Remove redundant class/method Leftovers from a merge not cleaned up adequately.	2019-09-12 17:26:11 +02:00
adrianeboyd	c32126359a	Allow period as suffix following punctuation (#4248 ) Addresses rare cases (such as `_MATH_.`, see #1061) where the final period was not recognized as a suffix following punctuation.	2019-09-09 19:19:22 +02:00
Pavle Vidanović	d03401f532	Lemmatizer lookup dictionary for Serbian and basic tag set adde… (#4251 ) * Serbian stopwords added. (cyrillic alphabet) * spaCy Contribution agreement included. * Test initialize updated * Serbian language code update. --bugfix * Tokenizer exceptions added. Init file updated. * Norm exceptions and lexical attributes added. * Examples added. * Tests added. * sr_lang examples update. * Tokenizer exceptions updated. (Serbian) * Lemmatizer created. Licence included. * Test updated. * Tag map basic added. * tag_map.py file removed since it uses default spacy tags.	2019-09-08 14:19:15 +02:00

1 2 3 4

169 Commits