spaCy

mirror of https://github.com/explosion/spaCy.git synced 2025-07-15 02:32:37 +03:00

Author	SHA1	Message	Date
Adriane Boyd	c5de9b463a	Update custom tokenizer APIs and pickling (#8972 ) * Fix incorrect pickling of Japanese and Korean pipelines, which led to the entire pipeline being reset if pickled * Enable pickling of Vietnamese tokenizer * Update tokenizer APIs for Chinese, Japanese, Korean, Thai, and Vietnamese so that only the `Vocab` is required for initialization	2021-08-19 14:37:47 +02:00
Adriane Boyd	f99d6d5e39	Refactor scoring methods to use registered functions (#8766 ) * Add scorer option to components Add an optional `scorer` parameter to all pipeline components. If a scoring function is provided, it overrides the default scoring method for that component. * Add registered scorers for all components * Add `scorers` registry * Move all scoring methods outside of components as independent functions and register * Use the registered scoring methods as defaults in configs and inits Additional: * The scoring methods no longer have access to the full component, so use settings from `cfg` as default scorer options to handle settings such as `labels`, `threshold`, and `positive_label` * The `attribute_ruler` scoring method no longer has access to the patterns, so all scoring methods are called * Bug fix: `spancat` scoring method is updated to set `allow_overlap` to score overlapping spans correctly * Update Russian lemmatizer to use direct score method * Check type of cfg in Pipe.score * Fix check * Update spacy/pipeline/sentencizer.pyx Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com> * Remove validate_examples from scoring functions * Use Pipe.labels instead of Pipe.cfg["labels"] Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>	2021-08-10 15:13:39 +02:00
fgaim	ee011ca963	Update Tigrinya ትግርኛ language support (#8900 ) * Add missing punctuation for Tigrinya and Amharic * Fix numeral and ordinal numbers for Tigrinya - Amharic was used in many cases - Also fixed some typos * Update Tigrinya stop-words * Contributor agreement for fgaim * Fix typo in "ti" lang test * Remove multi-word entries from numbers and ordinals	2021-08-10 13:55:08 +02:00
Dimitar Ganev	733ffe439d	Improve the stop words and the tokenizer exceptions in Bulgarian language. (#8862 ) * Add more stop words and Improve the readability * Add and categorize the tokenizer exceptions for `bg` lang * Create syrull.md * Add references for the additional stop words and tokenizer exc abbrs	2021-08-10 13:44:23 +02:00
Adriane Boyd	81d3a1edb1	Use tokenizer URL_MATCH pattern in LIKE_URL (#8765 )	2021-07-27 12:07:01 +02:00
Adriane Boyd	d48c01a6f7	Remove extraneous grc test file (#8768 )	2021-07-20 15:51:15 +02:00
explosion-bot	eff3d1088b	Auto-format code with black	2021-07-16 08:03:36 +00:00
jmyerston	993b0fab0e	Added ancient Greek language support (#8606 ) * Add ancient Greek language support Initial commit * Contributor Agreement * grc tokenizer test added and files formatted with black, unnecessary import removed Co-Authored-By: Sofie Van Landeghem <svlandeg@users.noreply.github.com> * Commas in lists fixed. __init__py added to test * Update lex_attrs.py * Update stop_words.py * Update stop_words.py Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>	2021-07-15 10:27:17 +02:00
Julien Rossi	e117573822	Adding noun_chunks to the DUTCH language model (nl) (#8529 ) * ✨ implement noun_chunks for dutch language * copy/paste FR and SV syntax iterators to accomodate UD tags * added tests with dutch text * signed contributor agreement * 🐛 fix noun chunks generator * built from scratch * define noun chunk as a single Noun-Phrase * includes some corner cases debugging (incorrect POS tagging) * test with provided annotated sample (POS, DEP) * ✅ fix failing test * CI pipeline did not like the added sample file * add the sample as a pytest fixture * Update spacy/lang/nl/syntax_iterators.py * Update spacy/lang/nl/syntax_iterators.py Code readability Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com> * Update spacy/tests/lang/nl/test_noun_chunks.py correct comment Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com> * finalize code * change "if next_word" into "if next_word is not None" Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>	2021-07-14 14:01:02 +02:00
Adriane Boyd	d8805a1073	Fix ru/uk lemmatizer mp with spawn (#8657 ) Use an instance variable instead a class variable for the morphological analzyer so that multiprocessing with spawn is possible.	2021-07-09 15:36:56 +02:00
Adriane Boyd	b8e720fdb9	Fix Azerbaijani init, extend lang init tests (#8656 ) * Extend langs in initialize tests * Fix az init	2021-07-09 15:36:35 +02:00
Adriane Boyd	86d01e9229	Tidy up with flake8: imports, comparisons, etc.	2021-06-28 12:08:15 +02:00
Adriane Boyd	5eeb25f043	Tidy up code	2021-06-28 12:08:15 +02:00
Adriane Boyd	02bac8f269	Fix non-deterministic deduplication in Greek lemmatizer (#8421 )	2021-06-17 09:11:01 +02:00
Giovanni Toffoli	19521d525b	Added Italian POS-aware lemmatizer. (#8079 ) * Added Italian POS-aware lemmatizer. Also added the code used to build the lookup tables by POS. * Create gtoffoli.md * Add imports and format * Remove helper script * Use lemma_lookup instead of lemma_lookup_legacy Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>	2021-06-16 11:14:45 +02:00
Antti Ajanki	5a6125c227	[Finnish tokenizer] Handle conjunction contractions (#8105 )	2021-06-16 10:56:47 +02:00
Adriane Boyd	5646fcbe46	Merge remote-tracking branch 'upstream/develop' into chore/develop-into-master-v3.1	2021-06-15 15:05:17 +02:00
Adriane Boyd	b98d216205	Update Catalan language data (#8308 ) * Update Catalan language data Update Catalan language data based on contributions from the Text Mining Unit at the Barcelona Supercomputing Center: https://github.com/TeMU-BSC/spacy4release/tree/main/lang_data * Update tokenizer settings for UD Catalan AnCora Update for UD Catalan AnCora v2.7 with merged multi-word tokens. * Update test * Move prefix patternt to more generic infix pattern * Clean up	2021-06-11 10:21:22 +02:00
Adriane Boyd	f4008bdb13	Restrict pymorphy2 requirement to pymorphy2 mode (#8299 ) For the Russian and Ukrainian lemmatizers, restrict the `pymorphy2` requirement to the mode `pymorphy2` so that lookup or other lemmatizer modes can be loaded without installing `pymorphy2`.	2021-06-11 10:19:22 +02:00
Jean-Hugues Roy	ff5cf3606c	Improvements to French stopwords list (#7941 ) * "y" etc. Many changes described in pull request * Update spacy/lang/fr/stop_words.py * Update spacy/lang/fr/stop_words.py Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>	2021-06-02 11:50:49 +02:00
Paul O'Leary McCann	d1a221a374	Add all symbols in Unicode Currency Symbols block (#8212 ) * Add all symbols in Unicode Currency Symbols block In #8102 it came up that the rupee symbol was treated different from dollar / euro / yen symbols. This adds many symbols not already included. * Fix test * Fix training test	2021-05-31 18:03:40 +10:00
Adriane Boyd	1d59fdbd39	Update Vietnamese tokenizer (#8099 ) * Adapt tokenization methods from `pyvi` to preserve text encoding and whitespace * Add serialization support similar to Chinese and Japanese Note: as for Chinese and Japanese, some settings are duplicated in `config.cfg` and `tokenizer/cfg`.	2021-05-17 18:16:20 +10:00
Paul O'Leary McCann	bdeaf3a18b	Fix/fix en ordinals (#8028 ) * Fix #8019 "th" is not the only ordinal ending. * Add some more ordinal tests	2021-05-07 10:26:42 +02:00
Adriane Boyd	31528f62ed	Add / to nb infixes (#7991 )	2021-05-04 11:00:10 +02:00
Sevdimali	49aed683cc	Azerbaijani language added (#7911 )	2021-04-28 14:42:02 +02:00
Jacopo Farina	c105ed10fd	Remove torino from stop words (#7634 ) Torino is the proper name of a city and the token has no other meaning	2021-04-26 16:53:43 +02:00
m0canu1	921feee092	Added more exception to the italian language from https://forum.wordr … (#7246 ) * Added more exception to the italian language from https://forum.wordreference.com/threads/le-abbreviazioni-nella-lingua-italiana-abbreviations-in-italian.2464189/ * Remove unnecessary exception Co-authored-by: Alexandru Mocanu <alexandru.mocanu@augeos.it> Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>	2021-03-30 10:23:32 +02:00
Adriane Boyd	3bcf74aca7	Rename and update ru pymorphy2 lookup lemmatize * To allow default lookup lemmatization with a blank Russian model, rename pymorphy2 lookup mode to `pymorphy2_lookup` * Bug fix: update pymorphy2 lookup lemmatize to return list rather than string	2021-03-15 11:11:06 +01:00
Adriane Boyd	264862c67a	Fix Ukrainian lemmatizer init (#7127 ) Fix class variable and init for `UkrainianLemmatizer` so that it loads the `uk` dictionaries rather than having the parent `RussianLemmatizer` override with the `ru` settings.	2021-02-22 11:05:08 +11:00
Boian Tzonev	cca8651fc8	Bulgarian tokenizer exceptions (#7114 ) * [Bulgarian] Add tokenizer exceptions and like_num for Bulgarian * [Bulgarian] Add tokenizer exceptions and like_num for Bulgarian	2021-02-19 19:19:19 +01:00
Ines Montani	9ba715ed16	Tidy up and auto-format	2021-02-13 12:55:56 +11:00
Ines Montani	6c450decfc	Fix punctuation settings and add to initialize tests	2021-02-13 11:51:21 +11:00
Shumi	4e514f1ea8	Update stop_words.py I have deleted line 1 to 5 and the statement print(STOP_WORDS)	2021-02-11 21:30:34 +02:00
Shumi	0d57e84b7b	Update lex_attrs.py I have removed line 1 to 4	2021-02-11 21:28:23 +02:00
Shumi	37ec67f868	Update examples.py I have removed two lines: # coding: utf8 from __future__ import unicode_literals And updated: >>> from spacy.lang.tn.examples import sentences	2021-02-11 21:25:58 +02:00
Shumi	39eeba6760	Update __init__.py Added infixes = TOKENIZER_INFIXES	2021-02-11 21:20:46 +02:00
Shumi	ed3397727e	Delete tag_map.py Tag map file is deleted. I will add it later because it was failing validations	2021-02-10 20:41:18 +02:00
Shumi	7c8721b1bd	Update tag_map.py Updated tag_map	2021-02-10 20:21:22 +02:00
Shumi	f6be28cfb2	Added files to Setswana Language Add South African Setswana Language	2021-02-10 20:15:13 +02:00
Shumi	24046fef17	South African Setswana language Please accept the additional of Setswana language	2021-02-10 20:12:33 +02:00
svlandeg	91e72c031e	reformatting	2021-01-30 17:29:33 +01:00
svlandeg	a8d84188f0	add stop words Co-authored-by: tewodrosm <tedmaam2006@gmail.com>	2021-01-30 17:26:49 +01:00
Ines Montani	e6accb3a9e	Tidy up and auto-format	2021-01-30 12:52:33 +11:00
Ines Montani	817b0db521	Fix escape sequence	2021-01-30 12:39:58 +11:00
Ines Montani	bbf080dfe5	Merge pull request #6645 from bittlingmayer/patch-3	2021-01-30 01:26:28 +11:00
Adriane Boyd	bced6309e5	Add full exceptions with spaces	2021-01-29 14:27:22 +01:00
Ines Montani	5ed51c9dd2	Merge pull request #6828 from explosion/master-tmp	2021-01-27 23:05:46 +11:00
Adriane Boyd	d17afb4826	Add Spanish rule-based lemmatizer (#6833 ) * Initial Spanish lemmatizer * Handle merged verb+pron(s) multi-word tokens * Use VERB for AUX rule lookup * Add morph to lemma cache key * Fix aux lookups, minor refactoring * Improve verb+pron handling * Move verb+pron handling into its own method * Check for exceptions (primarily for se) * Collect pronouns in the same (not reversed) order * Only add modified possible lemmas	2021-01-27 19:21:35 +08:00
Ines Montani	615dba9d99	Fix tokenizer exceptions	2021-01-27 22:11:42 +11:00
Ines Montani	e3f8be9a94	Update language data	2021-01-27 13:29:22 +11:00

1 2 3 4 5 ...

842 Commits