spaCy

mirror of https://github.com/explosion/spaCy.git synced 2025-04-17 15:41:59 +03:00

Author	SHA1	Message	Date
Adriane Boyd	1d59fdbd39	Update Vietnamese tokenizer (#8099 ) * Adapt tokenization methods from `pyvi` to preserve text encoding and whitespace * Add serialization support similar to Chinese and Japanese Note: as for Chinese and Japanese, some settings are duplicated in `config.cfg` and `tokenizer/cfg`.	2021-05-17 18:16:20 +10:00
Boian Tzonev	cca8651fc8	Bulgarian tokenizer exceptions (#7114 ) * [Bulgarian] Add tokenizer exceptions and like_num for Bulgarian * [Bulgarian] Add tokenizer exceptions and like_num for Bulgarian	2021-02-19 19:19:19 +01:00
Ines Montani	6c450decfc	Fix punctuation settings and add to initialize tests	2021-02-13 11:51:21 +11:00
Ines Montani	e6accb3a9e	Tidy up and auto-format	2021-01-30 12:52:33 +11:00
Ines Montani	5ed51c9dd2	Merge pull request #6828 from explosion/master-tmp	2021-01-27 23:05:46 +11:00
Adriane Boyd	d17afb4826	Add Spanish rule-based lemmatizer (#6833 ) * Initial Spanish lemmatizer * Handle merged verb+pron(s) multi-word tokens * Use VERB for AUX rule lookup * Add morph to lemma cache key * Fix aux lookups, minor refactoring * Improve verb+pron handling * Move verb+pron handling into its own method * Check for exceptions (primarily for se) * Collect pronouns in the same (not reversed) order * Only add modified possible lemmas	2021-01-27 19:21:35 +08:00
Ines Montani	230e651ad6	Merge branch 'develop' into master-tmp	2021-01-27 13:26:29 +11:00
muratjumashev	87168eb81f	Add tests	2021-01-24 20:56:16 +06:00
Sofie Van Landeghem	fed8f48965	raise NotImplementedError when noun_chunks iterator is not implemented (#6711 ) * raise NotImplementedError when noun_chunks iterator is not implemented * bring back, fix and document span.noun_chunks * formatting Co-authored-by: Matthew Honnibal <honnibal+gh@gmail.com>	2021-01-17 19:56:05 +08:00
Ines Montani	b0b743597c	Tidy up and auto-format	2021-01-15 11:57:36 +11:00
Adriane Boyd	0c936004d1	Merge remote-tracking branch 'upstream/master' into chore/update-develop-from-master-rc3	2021-01-14 11:49:58 +01:00
ophelielacroix	e3222fdec9	Add (noun chunks) syntax iterators for Danish (#6246 ) * add syntax iterators for danish * add test noun chunks for danish syntax iterators * add contributor agreement * update da syntax iterators to remove nested chunks * add tests for da noun chunks * Fix test * add missing import * fix example * Prevent overlapping noun chunks Prevent overlapping noun chunks by tracking the end index of the previous noun chunk span. Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>	2021-01-07 16:33:00 +11:00
Sofie Van Landeghem	6f7e7d88b9	remove cause without apostrophe from norm exceptions (#6636 )	2021-01-06 12:30:30 +08:00
Ines Montani	81f018fb67	Merge pull request #6671 from explosion/chore/tidy-autoformat Tidy up and auto-format	2021-01-05 14:45:31 +11:00
Ines Montani	991669c934	Tidy up and auto-format	2021-01-05 13:41:53 +11:00
svlandeg	a6a68da673	unskipping tests with python >= 3.6	2020-12-30 18:46:43 +01:00
Yosi	cf52510631	Add Amharic አማርኛ Language support (#6583 ) * Add Amharic to space * clean up * Add some PRON_LEMMA * add Tigrinya support * remove text_noun_chunks * Tigrinya Support * added some more details for ti * fix unit test * add amharic char range * changes from review * amharic and tigrinya share same unicode block * get rid of _amharic/_tigrinya in char_classes Co-authored-by: Josiah Solomon <jsolomon@meteorcomm.com>	2020-12-22 16:50:34 +01:00
Adriane Boyd	724831b066	Merge remote-tracking branch 'upstream/master' into chore/update-develop-from-master * Update Macedonian for v3 * Update Turkish for v3	2020-11-25 11:49:34 +01:00
Duygu Altinok	0e55f806dd	Turkish tokenization improvements (#6268 ) * added single and paired orth variants * added token match * added long text tokenization test * inverted init * normalized lemmas to lowercase * more abbrevs * tests for ordinals and abbrevs * separated period abbvrevs to another list * fiex typo * added ordinal and abbrev tests * added number tests for dates * minor refinement * added inflected abbrevs regex * added percentage and inflection * cosmetics * added token match * added url inflection tests * excluded url tokens from custom pattern * removed url match import	2020-10-29 09:43:17 +01:00
Borijan Georgievski	2311192ba1	Include Macedonian language (#6230 ) * Include Macedonian language * Fix indentation at char_classes.py * Fix indentation at char_classes.py * Add Macedonian tests, update lex_attrs and char_classes * Import unicode literals for python 2	2020-10-15 15:55:01 +02:00
Ines Montani	d165af26be	Auto-format [ci skip]	2020-10-15 10:08:53 +02:00
Ines Montani	5d62499266	Fix tests	2020-10-15 09:29:15 +02:00
Ines Montani	178760855f	Merge branch 'develop' into master-tmp	2020-10-15 09:06:03 +02:00
Ines Montani	539b0c10da	Tidy up and auto-format	2020-10-10 19:14:48 +02:00
Duygu Altinok	80fb1bffc9	Ordinal numbers for Turkish (#6142 ) * minor ordinal number addition * fixed typo * added corresponding lexical test	2020-10-09 10:13:15 +02:00
Duygu Altinok	2fad279a44	Turkish language syntax iterators (#6191 ) * added tr_vocab to config * basic test * added syntax iterator to Turkish lang class * first version for Turkish syntax iter, without flat * added simple tests with nmod, amod, det * more tests to amod and nmod * separated noun chunks and parser test * rearrangement after nchunk parser separation * added recursive NPs * tests with complicated recursive NPs * tests with conjed NPs * additional tests for conj NP * small modification for shaving off conj from NP * added tests with flat * more tests with flat * added examples with flats conjed * added inner func for flat trick * corrected parse Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>	2020-10-09 10:10:22 +02:00
Duygu Altinok	7e821c2776	Turkish language syntax iterators (#6191 ) * added tr_vocab to config * basic test * added syntax iterator to Turkish lang class * first version for Turkish syntax iter, without flat * added simple tests with nmod, amod, det * more tests to amod and nmod * separated noun chunks and parser test * rearrangement after nchunk parser separation * added recursive NPs * tests with complicated recursive NPs * tests with conjed NPs * additional tests for conj NP * small modification for shaving off conj from NP * added tests with flat * more tests with flat * added examples with flats conjed * added inner func for flat trick * corrected parse Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>	2020-10-07 11:07:52 +02:00
Duygu Altinok	b95a11dd95	Ordinal numbers for Turkish (#6142 ) * minor ordinal number addition * fixed typo * added corresponding lexical test	2020-10-07 10:25:37 +02:00
Rahul Gupta	1a00bff06d	Hindi: Adds tests for lexical attributes (norm and like_num) (#5829 ) * Hindi: Adds tests for lexical attributes (norm and like_num) * Signs and sdds the contributor agreement * Add ordinal numbers to be tagged as like_num * Adds alternate pronunciation for 31 and 39	2020-10-07 10:23:32 +02:00
Ines Montani	3bc3c05fcc	Tidy up and auto-format	2020-10-03 17:20:18 +02:00
Ines Montani	7c4ab7e82c	Fix Lemmatizer.get_lookups_config	2020-10-03 17:16:10 +02:00
Ines Montani	f0b30aedad	Make lemmatizers use initialize logic (#6182 ) * Make lemmatizer use initialize logic and tidy up * Fix typo * Raise for uninitialized tables	2020-10-02 15:42:36 +02:00
Ines Montani	381258b75b	Merge pull request #6165 from explosion/feature/update-tokenizers-initialize	2020-10-01 09:49:47 +02:00
Adriane Boyd	6b7bb32834	Refactor Chinese initialization	2020-09-30 11:46:45 +02:00
Ines Montani	fa47f87924	Tidy up and auto-format	2020-09-29 21:39:28 +02:00
Adriane Boyd	11e195d3ed	Update ChineseTokenizer * Allow `pkuseg_model` to be set to `None` on initialization * Don't save config within tokenizer * Force convert pkuseg_model to use pickle protocol 4 by reencoding with `pickle5` on serialization * Update pkuseg serialization test	2020-09-27 14:00:18 +02:00
Ines Montani	ca3c997062	Improve CLI config validation with latest Thinc	2020-09-26 13:13:57 +02:00
Ines Montani	67fbcb3da5	Tidy up tests and docs	2020-09-21 20:43:54 +02:00
Adriane Boyd	7e4cd7575c	Refactor Docs.is_ flags (#6044 ) * Refactor Docs.is_ flags * Add derived `Doc.has_annotation` method * `Doc.has_annotation(attr)` returns `True` for partial annotation * `Doc.has_annotation(attr, require_complete=True)` returns `True` for complete annotation * Add deprecation warnings to `is_tagged`, `is_parsed`, `is_sentenced` and `is_nered` * Add `Doc._get_array_attrs()`, which returns a full list of `Doc` attrs for use with `Doc.to_array`, `Doc.to_bytes` and `Doc.from_docs`. The list is the `DocBin` attributes list plus `SPACY` and `LENGTH`. Notes on `Doc.has_annotation`: * `HEAD` is converted to `DEP` because heads don't have an unset state * Accept `IS_SENT_START` as a synonym of `SENT_START` Additional changes: * Add `NORM`, `ENT_ID` and `SENT_START` to default attributes for `DocBin` * In `Doc.from_array()` the presence of `DEP` causes `HEAD` to override `SENT_START` * In `Doc.from_array()` using `attrs` other than `Doc._get_array_attrs()` (i.e., a user's custom list rather than our default internal list) with both `HEAD` and `SENT_START` shows a warning that `HEAD` will override `SENT_START` * `set_children_from_heads` does not require dependency labels to set sentence boundaries and sets `sent_start` for all non-sentence starts to `-1` * Fix call to set_children_form_heads Co-authored-by: Matthew Honnibal <honnibal+gh@gmail.com>	2020-09-17 00:14:01 +02:00
Adriane Boyd	87c329c711	Set rule-based lemmatizers as default (#6076 ) For languages without provided models and with lemmatizer rules in `spacy-lookups-data`, make the rule-based lemmatizer the default: Bengali, Persian, Norwegian, Swedish	2020-09-16 17:37:29 +02:00
Ines Montani	90043a6f9b	Tidy up and auto-format	2020-09-04 13:42:33 +02:00
Ines Montani	df0b68f60e	Remove unicode declarations and update language data	2020-09-04 13:19:16 +02:00
Ines Montani	864a697e63	Merge branch 'develop' into master-tmp	2020-09-04 13:15:36 +02:00
Ines Montani	5afe6447cd	registry.assets -> registry.misc	2020-09-03 17:31:14 +02:00
Shashank	450720aca2	Added support for Sanskrit language (#5956 ) * Added support for Sanskrit language * Added tests for lexical attribute like_num	2020-08-25 10:56:29 +02:00
idoshr	b10c7bc56e	Hebrew like num (#5952 ) * Update stop_words.py Hebrew STOP WORDS * Update stop_words.py * contributor * contributor * add some common domain extentions support human number 1K/1M.... * support human number 1K/1M.... * hebrew number tokenize 1K/1M implement in EN * test human tokenize fix * test * heb like num revert human number change * heb like num	2020-08-24 14:30:05 +02:00
Sofie Van Landeghem	56eabcb2f2	Adding num_like test for Czech (#5946 ) * Create lex_attrs.py Hello, I am missing a CZECH language in SpaCy. So I would like to help to push it a little. This file is base on others lex_attrs.py files just with translation to Czech. * Update __init__.py Updated for use with new Czech Lex_attrs file * Update stop_words.py * Create test_text.py * add like_num testing for czech Co-authored-by: holubvl3 <47881982+holubvl3@users.noreply.github.com> Co-authored-by: holubvl3 <vilemrousi@gmail.com> Co-authored-by: Vladimír Holubec <vholubec@arcdata.cz>	2020-08-21 17:06:33 +02:00
Adriane Boyd	e962784531	Add Lemmatizer and simplify related components (#5848 ) * Add Lemmatizer and simplify related components * Add `Lemmatizer` pipe with `lookup` and `rule` modes using the `Lookups` tables. * Reduce `Tagger` to a simple tagger that sets `Token.tag` (no pos or lemma) * Reduce `Morphology` to only keep track of morph tags (no tag map, lemmatizer, or morph rules) * Remove lemmatizer from `Vocab` * Adjust many many tests Differences: * No default lookup lemmas * No special treatment of TAG in `from_array` and similar required * Easier to modify labels in a `Tagger` * No extra strings added from morphology / tag map * Fix test * Initial fix for Lemmatizer config/serialization * Adjust init test to be more generic * Adjust init test to force empty Lookups * Add simple cache to rule-based lemmatizer * Convert language-specific lemmatizers Convert language-specific lemmatizers to component lemmatizers. Remove previous lemmatizer class. * Fix French and Polish lemmatizers * Remove outdated UPOS conversions * Update Russian lemmatizer init in tests * Add minimal init/run tests for custom lemmatizers * Add option to overwrite existing lemmas * Update mode setting, lookup loading, and caching * Make `mode` an immutable property * Only enforce strict `load_lookups` for known supported modes * Move caching into individual `_lemmatize` methods * Implement strict when lang is not found in lookups * Fix tables/lookups in make_lemmatizer * Reallow provided lookups and allow for stricter checks * Add lookups asset to all Lemmatizer pipe tests * Rename lookups in lemmatizer init test * Clean up merge * Refactor lookup table loading * Add helper from `load_lemmatizer_lookups` that loads required and optional lookups tables based on settings provided by a config. Additional slight refactor of lookups: * Add `Lookups.set_table` to set a table from a provided `Table` * Reorder class definitions to be able to specify type as `Table` * Move registry assets into test methods * Refactor lookups tables config Use class methods within `Lemmatizer` to provide the config for particular modes and to load the lookups from a config. * Add pipe and score to lemmatizer * Simplify Tagger.score * Add missing import * Clean up imports and auto-format * Remove unused kwarg * Tidy up and auto-format * Update docstrings for Lemmatizer Update docstrings for Lemmatizer. Additionally modify `is_base_form` API to take `Token` instead of individual features. * Update docstrings * Remove tag map values from Tagger.add_label * Update API docs * Fix relative link in Lemmatizer API docs	2020-08-07 15:27:13 +02:00
Ines Montani	e68459296d	Tidy up and auto-format	2020-08-05 16:00:59 +02:00
Rahul Gupta	f76fae0e8d	English: adds ordinal numbers (#5830 )	2020-07-29 20:22:47 +02:00

1 2 3 4 5

250 Commits