spaCy

mirror of https://github.com/explosion/spaCy.git synced 2024-12-28 19:06:33 +03:00

Author	SHA1	Message	Date
Adriane Boyd	03f234b739	Merge remote-tracking branch 'upstream/master' into develop	2021-09-27 09:10:45 +02:00
Rumesh Madhusanka	68264b4cee	Updating the stop word list for Sinhala language (#9270 )	2021-09-22 20:43:42 +02:00
Paul O'Leary McCann	0f01f46e02	Update Cython string types (#9143 ) * Replace all basestring references with unicode `basestring` was a compatability type introduced by Cython to make dealing with utf-8 strings in Python2 easier. In Python3 it is equivalent to the unicode (or str) type. I replaced all references to basestring with unicode, since that was used elsewhere, but we could also just replace them with str, which shoudl also be equivalent. All tests pass locally. * Replace all references to unicode type with str Since we only support python3 this is simpler. * Remove all references to unicode type This removes all references to the unicode type across the codebase and replaces them with `str`, which makes it more drastic than the prior commits. In order to make this work importing `unicode_literals` had to be removed, and one explicit unicode literal also had to be removed (it is unclear why this is necessary in Cython with language level 3, but without doing it there were errors about implicit conversion). When `unicode` is used as a type in comments it was also edited to be `str`. Additionally `coding: utf8` headers were removed from a few files.	2021-09-13 17:02:17 +02:00
David Strouk	31e9b126a0	Fix verbs list in lang/fr/tokenizer_exceptions.py (#9033 )	2021-08-25 15:55:09 +02:00
Adriane Boyd	c5de9b463a	Update custom tokenizer APIs and pickling (#8972 ) * Fix incorrect pickling of Japanese and Korean pipelines, which led to the entire pipeline being reset if pickled * Enable pickling of Vietnamese tokenizer * Update tokenizer APIs for Chinese, Japanese, Korean, Thai, and Vietnamese so that only the `Vocab` is required for initialization	2021-08-19 14:37:47 +02:00
Adriane Boyd	f99d6d5e39	Refactor scoring methods to use registered functions (#8766 ) * Add scorer option to components Add an optional `scorer` parameter to all pipeline components. If a scoring function is provided, it overrides the default scoring method for that component. * Add registered scorers for all components * Add `scorers` registry * Move all scoring methods outside of components as independent functions and register * Use the registered scoring methods as defaults in configs and inits Additional: * The scoring methods no longer have access to the full component, so use settings from `cfg` as default scorer options to handle settings such as `labels`, `threshold`, and `positive_label` * The `attribute_ruler` scoring method no longer has access to the patterns, so all scoring methods are called * Bug fix: `spancat` scoring method is updated to set `allow_overlap` to score overlapping spans correctly * Update Russian lemmatizer to use direct score method * Check type of cfg in Pipe.score * Fix check * Update spacy/pipeline/sentencizer.pyx Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com> * Remove validate_examples from scoring functions * Use Pipe.labels instead of Pipe.cfg["labels"] Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>	2021-08-10 15:13:39 +02:00
fgaim	ee011ca963	Update Tigrinya ትግርኛ language support (#8900 ) * Add missing punctuation for Tigrinya and Amharic * Fix numeral and ordinal numbers for Tigrinya - Amharic was used in many cases - Also fixed some typos * Update Tigrinya stop-words * Contributor agreement for fgaim * Fix typo in "ti" lang test * Remove multi-word entries from numbers and ordinals	2021-08-10 13:55:08 +02:00
Dimitar Ganev	733ffe439d	Improve the stop words and the tokenizer exceptions in Bulgarian language. (#8862 ) * Add more stop words and Improve the readability * Add and categorize the tokenizer exceptions for `bg` lang * Create syrull.md * Add references for the additional stop words and tokenizer exc abbrs	2021-08-10 13:44:23 +02:00
Adriane Boyd	81d3a1edb1	Use tokenizer URL_MATCH pattern in LIKE_URL (#8765 )	2021-07-27 12:07:01 +02:00
Adriane Boyd	d48c01a6f7	Remove extraneous grc test file (#8768 )	2021-07-20 15:51:15 +02:00
explosion-bot	eff3d1088b	Auto-format code with black	2021-07-16 08:03:36 +00:00
jmyerston	993b0fab0e	Added ancient Greek language support (#8606 ) * Add ancient Greek language support Initial commit * Contributor Agreement * grc tokenizer test added and files formatted with black, unnecessary import removed Co-Authored-By: Sofie Van Landeghem <svlandeg@users.noreply.github.com> * Commas in lists fixed. __init__py added to test * Update lex_attrs.py * Update stop_words.py * Update stop_words.py Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>	2021-07-15 10:27:17 +02:00
Julien Rossi	e117573822	Adding noun_chunks to the DUTCH language model (nl) (#8529 ) * ✨ implement noun_chunks for dutch language * copy/paste FR and SV syntax iterators to accomodate UD tags * added tests with dutch text * signed contributor agreement * 🐛 fix noun chunks generator * built from scratch * define noun chunk as a single Noun-Phrase * includes some corner cases debugging (incorrect POS tagging) * test with provided annotated sample (POS, DEP) * ✅ fix failing test * CI pipeline did not like the added sample file * add the sample as a pytest fixture * Update spacy/lang/nl/syntax_iterators.py * Update spacy/lang/nl/syntax_iterators.py Code readability Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com> * Update spacy/tests/lang/nl/test_noun_chunks.py correct comment Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com> * finalize code * change "if next_word" into "if next_word is not None" Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>	2021-07-14 14:01:02 +02:00
Adriane Boyd	d8805a1073	Fix ru/uk lemmatizer mp with spawn (#8657 ) Use an instance variable instead a class variable for the morphological analzyer so that multiprocessing with spawn is possible.	2021-07-09 15:36:56 +02:00
Adriane Boyd	b8e720fdb9	Fix Azerbaijani init, extend lang init tests (#8656 ) * Extend langs in initialize tests * Fix az init	2021-07-09 15:36:35 +02:00
Adriane Boyd	86d01e9229	Tidy up with flake8: imports, comparisons, etc.	2021-06-28 12:08:15 +02:00
Adriane Boyd	5eeb25f043	Tidy up code	2021-06-28 12:08:15 +02:00
Adriane Boyd	02bac8f269	Fix non-deterministic deduplication in Greek lemmatizer (#8421 )	2021-06-17 09:11:01 +02:00
Giovanni Toffoli	19521d525b	Added Italian POS-aware lemmatizer. (#8079 ) * Added Italian POS-aware lemmatizer. Also added the code used to build the lookup tables by POS. * Create gtoffoli.md * Add imports and format * Remove helper script * Use lemma_lookup instead of lemma_lookup_legacy Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>	2021-06-16 11:14:45 +02:00
Antti Ajanki	5a6125c227	[Finnish tokenizer] Handle conjunction contractions (#8105 )	2021-06-16 10:56:47 +02:00
Adriane Boyd	5646fcbe46	Merge remote-tracking branch 'upstream/develop' into chore/develop-into-master-v3.1	2021-06-15 15:05:17 +02:00
Adriane Boyd	b98d216205	Update Catalan language data (#8308 ) * Update Catalan language data Update Catalan language data based on contributions from the Text Mining Unit at the Barcelona Supercomputing Center: https://github.com/TeMU-BSC/spacy4release/tree/main/lang_data * Update tokenizer settings for UD Catalan AnCora Update for UD Catalan AnCora v2.7 with merged multi-word tokens. * Update test * Move prefix patternt to more generic infix pattern * Clean up	2021-06-11 10:21:22 +02:00
Adriane Boyd	f4008bdb13	Restrict pymorphy2 requirement to pymorphy2 mode (#8299 ) For the Russian and Ukrainian lemmatizers, restrict the `pymorphy2` requirement to the mode `pymorphy2` so that lookup or other lemmatizer modes can be loaded without installing `pymorphy2`.	2021-06-11 10:19:22 +02:00
Jean-Hugues Roy	ff5cf3606c	Improvements to French stopwords list (#7941 ) * "y" etc. Many changes described in pull request * Update spacy/lang/fr/stop_words.py * Update spacy/lang/fr/stop_words.py Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>	2021-06-02 11:50:49 +02:00
Paul O'Leary McCann	d1a221a374	Add all symbols in Unicode Currency Symbols block (#8212 ) * Add all symbols in Unicode Currency Symbols block In #8102 it came up that the rupee symbol was treated different from dollar / euro / yen symbols. This adds many symbols not already included. * Fix test * Fix training test	2021-05-31 18:03:40 +10:00
Adriane Boyd	1d59fdbd39	Update Vietnamese tokenizer (#8099 ) * Adapt tokenization methods from `pyvi` to preserve text encoding and whitespace * Add serialization support similar to Chinese and Japanese Note: as for Chinese and Japanese, some settings are duplicated in `config.cfg` and `tokenizer/cfg`.	2021-05-17 18:16:20 +10:00
Paul O'Leary McCann	bdeaf3a18b	Fix/fix en ordinals (#8028 ) * Fix #8019 "th" is not the only ordinal ending. * Add some more ordinal tests	2021-05-07 10:26:42 +02:00
Adriane Boyd	31528f62ed	Add / to nb infixes (#7991 )	2021-05-04 11:00:10 +02:00
Sevdimali	49aed683cc	Azerbaijani language added (#7911 )	2021-04-28 14:42:02 +02:00
Jacopo Farina	c105ed10fd	Remove torino from stop words (#7634 ) Torino is the proper name of a city and the token has no other meaning	2021-04-26 16:53:43 +02:00
m0canu1	921feee092	Added more exception to the italian language from https://forum.wordr … (#7246 ) * Added more exception to the italian language from https://forum.wordreference.com/threads/le-abbreviazioni-nella-lingua-italiana-abbreviations-in-italian.2464189/ * Remove unnecessary exception Co-authored-by: Alexandru Mocanu <alexandru.mocanu@augeos.it> Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>	2021-03-30 10:23:32 +02:00
Adriane Boyd	3bcf74aca7	Rename and update ru pymorphy2 lookup lemmatize * To allow default lookup lemmatization with a blank Russian model, rename pymorphy2 lookup mode to `pymorphy2_lookup` * Bug fix: update pymorphy2 lookup lemmatize to return list rather than string	2021-03-15 11:11:06 +01:00
Adriane Boyd	264862c67a	Fix Ukrainian lemmatizer init (#7127 ) Fix class variable and init for `UkrainianLemmatizer` so that it loads the `uk` dictionaries rather than having the parent `RussianLemmatizer` override with the `ru` settings.	2021-02-22 11:05:08 +11:00
Boian Tzonev	cca8651fc8	Bulgarian tokenizer exceptions (#7114 ) * [Bulgarian] Add tokenizer exceptions and like_num for Bulgarian * [Bulgarian] Add tokenizer exceptions and like_num for Bulgarian	2021-02-19 19:19:19 +01:00
Ines Montani	9ba715ed16	Tidy up and auto-format	2021-02-13 12:55:56 +11:00
Ines Montani	6c450decfc	Fix punctuation settings and add to initialize tests	2021-02-13 11:51:21 +11:00
Shumi	4e514f1ea8	Update stop_words.py I have deleted line 1 to 5 and the statement print(STOP_WORDS)	2021-02-11 21:30:34 +02:00
Shumi	0d57e84b7b	Update lex_attrs.py I have removed line 1 to 4	2021-02-11 21:28:23 +02:00
Shumi	37ec67f868	Update examples.py I have removed two lines: # coding: utf8 from __future__ import unicode_literals And updated: >>> from spacy.lang.tn.examples import sentences	2021-02-11 21:25:58 +02:00
Shumi	39eeba6760	Update __init__.py Added infixes = TOKENIZER_INFIXES	2021-02-11 21:20:46 +02:00
Shumi	ed3397727e	Delete tag_map.py Tag map file is deleted. I will add it later because it was failing validations	2021-02-10 20:41:18 +02:00
Shumi	7c8721b1bd	Update tag_map.py Updated tag_map	2021-02-10 20:21:22 +02:00
Shumi	f6be28cfb2	Added files to Setswana Language Add South African Setswana Language	2021-02-10 20:15:13 +02:00
Shumi	24046fef17	South African Setswana language Please accept the additional of Setswana language	2021-02-10 20:12:33 +02:00
svlandeg	91e72c031e	reformatting	2021-01-30 17:29:33 +01:00
svlandeg	a8d84188f0	add stop words Co-authored-by: tewodrosm <tedmaam2006@gmail.com>	2021-01-30 17:26:49 +01:00
Ines Montani	e6accb3a9e	Tidy up and auto-format	2021-01-30 12:52:33 +11:00
Ines Montani	817b0db521	Fix escape sequence	2021-01-30 12:39:58 +11:00
Ines Montani	bbf080dfe5	Merge pull request #6645 from bittlingmayer/patch-3	2021-01-30 01:26:28 +11:00
Adriane Boyd	bced6309e5	Add full exceptions with spaces	2021-01-29 14:27:22 +01:00
Ines Montani	5ed51c9dd2	Merge pull request #6828 from explosion/master-tmp	2021-01-27 23:05:46 +11:00
Adriane Boyd	d17afb4826	Add Spanish rule-based lemmatizer (#6833 ) * Initial Spanish lemmatizer * Handle merged verb+pron(s) multi-word tokens * Use VERB for AUX rule lookup * Add morph to lemma cache key * Fix aux lookups, minor refactoring * Improve verb+pron handling * Move verb+pron handling into its own method * Check for exceptions (primarily for se) * Collect pronouns in the same (not reversed) order * Only add modified possible lemmas	2021-01-27 19:21:35 +08:00
Ines Montani	615dba9d99	Fix tokenizer exceptions	2021-01-27 22:11:42 +11:00
Ines Montani	e3f8be9a94	Update language data	2021-01-27 13:29:22 +11:00
Ines Montani	230e651ad6	Merge branch 'develop' into master-tmp	2021-01-27 13:26:29 +11:00
Adriane Boyd	71a6350744	Implement overwrite param for all custom lemmatizers (#6794 )	2021-01-26 14:53:43 +11:00
muratjumashev	2b19ebad59	Remove Kyrgyz chars fr. char_classes since Tatar ones already cover	2021-01-25 00:46:45 +06:00
muratjumashev	53abf759ad	Fix punctuation	2021-01-24 20:54:22 +06:00
muratjumashev	2a2646362b	Fix language subclass	2021-01-23 22:00:50 +06:00
muratjumashev	fe3b5b8ff5	Add kyrgyz to char_classes	2021-01-23 21:53:41 +06:00
muratjumashev	e30bbf5432	Add examples	2021-01-23 21:49:08 +06:00
muratjumashev	2f385385a9	Remove comment	2021-01-23 21:36:28 +06:00
muratjumashev	d53724ba1d	Add lex_attrs	2021-01-23 21:35:25 +06:00
muratjumashev	4418ec2eee	Add punctuation	2021-01-23 21:31:31 +06:00
muratjumashev	101d265778	Add stopwords	2021-01-23 21:25:28 +06:00
muratjumashev	28d06ab860	Add tokenizer_exceptions	2021-01-22 23:08:41 +06:00
Sofie Van Landeghem	fed8f48965	raise NotImplementedError when noun_chunks iterator is not implemented (#6711 ) * raise NotImplementedError when noun_chunks iterator is not implemented * bring back, fix and document span.noun_chunks * formatting Co-authored-by: Matthew Honnibal <honnibal+gh@gmail.com>	2021-01-17 19:56:05 +08:00
Adriane Boyd	185fc62f4d	Remove unused is_base_form for mk lemmatizer (#6743 ) Remove unimplemented/incorrect is_base_form for Macedonian lemmatizer.	2021-01-17 09:41:35 +01:00
Ines Montani	b0b743597c	Tidy up and auto-format	2021-01-15 11:57:36 +11:00
Adriane Boyd	0c936004d1	Merge remote-tracking branch 'upstream/master' into chore/update-develop-from-master-rc3	2021-01-14 11:49:58 +01:00
Adriane Boyd	e649242927	Prevent overlapping noun chunks for Spanish (#6712 ) * Prevent overlapping noun chunks in Spanish noun chunk iterator * Clean up similar code in Danish noun chunk iterator	2021-01-14 17:33:31 +11:00
Adriane Boyd	54e8e3c208	Update model-related dependencies (#6725 ) * Update pymorphy2 error messages for Russian and Ukrainian * Add pymorphy2 to pex * Update spacy-pkuseg version for pex	2021-01-14 17:29:44 +11:00
Alex Combessie	9cc880014c	Remove questionable French stopwords (#6310 ) * Remove questionable French stopwords * Create alexcombessie.md	2021-01-08 11:36:22 +11:00
Cristiana S Parada	7a0222f260	Update stop_words.py in Portuguese (a,o,e) (#6345 ) * Update stop_words.py Added three aditional stopwords: "a" and "o" that means "the", and "e" that means "and" * Create cristianasp.md * zero edit to push CI Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>	2021-01-08 11:35:38 +11:00
Lorena Ciutacu	f11002f1f1	add new Romanian stopwords (#6621 ) * add contributor agreement * update ro stopwords list * add new stopwords	2021-01-08 11:34:47 +11:00
ophelielacroix	e3222fdec9	Add (noun chunks) syntax iterators for Danish (#6246 ) * add syntax iterators for danish * add test noun chunks for danish syntax iterators * add contributor agreement * update da syntax iterators to remove nested chunks * add tests for da noun chunks * Fix test * add missing import * fix example * Prevent overlapping noun chunks Prevent overlapping noun chunks by tracking the end index of the previous noun chunk span. Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>	2021-01-07 16:33:00 +11:00
Sofie Van Landeghem	6f7e7d88b9	remove cause without apostrophe from norm exceptions (#6636 )	2021-01-06 12:30:30 +08:00
Ines Montani	991669c934	Tidy up and auto-format	2021-01-05 13:41:53 +11:00
Adam Bittlingmayer	f2fe60bacf	Update tokenizer_exceptions.py See https://github.com/explosion/spaCy/pull/6643	2020-12-29 16:05:11 +04:00
Yosi	cf52510631	Add Amharic አማርኛ Language support (#6583 ) * Add Amharic to space * clean up * Add some PRON_LEMMA * add Tigrinya support * remove text_noun_chunks * Tigrinya Support * added some more details for ti * fix unit test * add amharic char range * changes from review * amharic and tigrinya share same unicode block * get rid of _amharic/_tigrinya in char_classes Co-authored-by: Josiah Solomon <jsolomon@meteorcomm.com>	2020-12-22 16:50:34 +01:00
Ines Montani	1da1568110	Remove tag map	2020-12-09 11:13:49 +11:00
Adriane Boyd	724831b066	Merge remote-tracking branch 'upstream/master' into chore/update-develop-from-master * Update Macedonian for v3 * Update Turkish for v3	2020-11-25 11:49:34 +01:00
Daniel Vasic	20d72de986	Added Multext-East V5 tagset for Croatian language (#6248 ) * Added Multext-East V5 tagset for Croatian language * Create danielvasic.md * Update danielvasic.md * Update danielvasic.md * Add tag map to CroatianDefaults Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>	2020-11-05 12:19:22 +01:00
Robert Šípek	6069efe57d	Add tag map to cs language (#6284 )	2020-11-05 10:13:11 +01:00
Vu Ha	6d465ec52c	add oprd to the list of accepted deps for noun chunking (#6302 ) * add oprd to the list of accepted deps for noun chunking * add SCA	2020-11-05 09:17:35 +01:00
Duygu Altinok	0e55f806dd	Turkish tokenization improvements (#6268 ) * added single and paired orth variants * added token match * added long text tokenization test * inverted init * normalized lemmas to lowercase * more abbrevs * tests for ordinals and abbrevs * separated period abbvrevs to another list * fiex typo * added ordinal and abbrev tests * added number tests for dates * minor refinement * added inflected abbrevs regex * added percentage and inflection * cosmetics * added token match * added url inflection tests * excluded url tokens from custom pattern * removed url match import	2020-10-29 09:43:17 +01:00
Adriane Boyd	4299a7f654	Setup / install / quickstart updates * Add `cuda110` to setup.cfg and quickstart dropdown * Switch to `pip` for pip-only packages in conda quickstart instructions * Update zh pkuseg install message with version range and conda * Remove `zh` from `extras_require` because the default doesn't require additional packages	2020-10-23 11:27:54 +02:00
Borijan Georgievski	2311192ba1	Include Macedonian language (#6230 ) * Include Macedonian language * Fix indentation at char_classes.py * Fix indentation at char_classes.py * Add Macedonian tests, update lex_attrs and char_classes * Import unicode literals for python 2	2020-10-15 15:55:01 +02:00
Ines Montani	d165af26be	Auto-format [ci skip]	2020-10-15 10:08:53 +02:00
Ines Montani	5665a21517	Tidy up	2020-10-15 09:30:32 +02:00
Ines Montani	178760855f	Merge branch 'develop' into master-tmp	2020-10-15 09:06:03 +02:00
Ines Montani	7f92a5ee6a	Update spacy/lang/ta/examples.py	2020-10-13 11:03:35 +02:00
Ines Montani	539b0c10da	Tidy up and auto-format	2020-10-10 19:14:48 +02:00
Duygu Altinok	80fb1bffc9	Ordinal numbers for Turkish (#6142 ) * minor ordinal number addition * fixed typo * added corresponding lexical test	2020-10-09 10:13:15 +02:00
Duygu Altinok	2fad279a44	Turkish language syntax iterators (#6191 ) * added tr_vocab to config * basic test * added syntax iterator to Turkish lang class * first version for Turkish syntax iter, without flat * added simple tests with nmod, amod, det * more tests to amod and nmod * separated noun chunks and parser test * rearrangement after nchunk parser separation * added recursive NPs * tests with complicated recursive NPs * tests with conjed NPs * additional tests for conj NP * small modification for shaving off conj from NP * added tests with flat * more tests with flat * added examples with flats conjed * added inner func for flat trick * corrected parse Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>	2020-10-09 10:10:22 +02:00
Baranitharan	d6037c1860	added sentence	2020-10-08 08:22:58 +05:30
Baranitharan	81afe9b19d	Update examples.py	2020-10-08 08:17:25 +05:30
Wannaphong Phatthiyaphaibun	9fc8392b38	Add Thai tag map (LST20 Corpus) (#6163 ) * Add Thai tag map (LST20 Corpus) By @korakot * Update tag_map.py * Update tag_map.py * Update tag_map.py	2020-10-07 11:12:01 +02:00
Duygu Altinok	7e821c2776	Turkish language syntax iterators (#6191 ) * added tr_vocab to config * basic test * added syntax iterator to Turkish lang class * first version for Turkish syntax iter, without flat * added simple tests with nmod, amod, det * more tests to amod and nmod * separated noun chunks and parser test * rearrangement after nchunk parser separation * added recursive NPs * tests with complicated recursive NPs * tests with conjed NPs * additional tests for conj NP * small modification for shaving off conj from NP * added tests with flat * more tests with flat * added examples with flats conjed * added inner func for flat trick * corrected parse Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>	2020-10-07 11:07:52 +02:00
Duygu Altinok	2ce6fc2611	Turkish tag map and morph rules addition (#6141 ) * feat: added turkish tag map * feat: morph rules cconj and sconj * feat: more conjuncts * feat: added popular postpositions * feat: added adverbs * feat: added personal pronouns * feat: added reflexive pronouns * minor: corrected case capital * minor: fixed comma typo * feat: added indef pronouns * feat: added dict iter * fixed comma typo * updated language class with tag map and morph * use default tag map instead * removed tag map	2020-10-07 10:27:36 +02:00
Duygu Altinok	b95a11dd95	Ordinal numbers for Turkish (#6142 ) * minor ordinal number addition * fixed typo * added corresponding lexical test	2020-10-07 10:25:37 +02:00
Rahul Gupta	1a00bff06d	Hindi: Adds tests for lexical attributes (norm and like_num) (#5829 ) * Hindi: Adds tests for lexical attributes (norm and like_num) * Signs and sdds the contributor agreement * Add ordinal numbers to be tagged as like_num * Adds alternate pronunciation for 31 and 39	2020-10-07 10:23:32 +02:00
Nuccy90	c809b2c8e7	Update morph_rules.py (#6102 ) * Update morph_rules.py Added "dig" and "dej" ("you" in accusative form) * Create Nuccy90.md * Update Nuccy90.md	2020-10-06 15:14:47 +02:00
Ines Montani	126268ce50	Auto-format [ci skip]	2020-10-05 21:58:18 +02:00
Ines Montani	2d0c0134bc	Adjust message [ci skip]	2020-10-05 21:38:23 +02:00
Ines Montani	6abfc2911d	Merge pull request #6203 from adrianeboyd/feature/zh-spacy-pkuseg	2020-10-05 21:35:57 +02:00
Adriane Boyd	f102ef6b54	Read features.msgpack instead of features.pkl	2020-10-05 17:47:39 +02:00
Adriane Boyd	187234648c	Revert back to "default" as default for pkuseg_user_dict	2020-10-05 16:24:28 +02:00
Adriane Boyd	5d19dfc9d3	Update Chinese tokenizer for spacy-pkuseg fork	2020-10-05 14:21:53 +02:00
Adriane Boyd	b0b93854cb	Update ru/uk lemmatizers for new nlp.initialize	2020-10-05 09:27:16 +02:00
Ines Montani	59deeb7da6	Merge branch 'develop' into master-tmp	2020-10-04 14:52:20 +02:00
Ines Montani	3bc3c05fcc	Tidy up and auto-format	2020-10-03 17:20:18 +02:00
Ines Montani	7c4ab7e82c	Fix Lemmatizer.get_lookups_config	2020-10-03 17:16:10 +02:00
Ines Montani	f0b30aedad	Make lemmatizers use initialize logic (#6182 ) * Make lemmatizer use initialize logic and tidy up * Fix typo * Raise for uninitialized tables	2020-10-02 15:42:36 +02:00
Ines Montani	d48ddd6c9a	Remove default initialize lookups	2020-10-01 21:54:33 +02:00
Ines Montani	381258b75b	Merge pull request #6165 from explosion/feature/update-tokenizers-initialize	2020-10-01 09:49:47 +02:00
Ines Montani	4b6afd3611	Remove English [initialize] default block for now to get tests to pass	2020-09-30 23:49:29 +02:00
Ines Montani	6f29f68f69	Update errors and make Tokenizer.initialize args less strict	2020-09-30 23:48:47 +02:00
Adriane Boyd	6b7bb32834	Refactor Chinese initialization	2020-09-30 11:46:45 +02:00
Ines Montani	34f9c26c62	Add lexeme norm defaults	2020-09-30 10:20:14 +02:00
Ines Montani	fa47f87924	Tidy up and auto-format	2020-09-29 21:39:28 +02:00
Ines Montani	6467a560e3	WIP: Test updating Chinese tokenizer	2020-09-29 21:10:22 +02:00
Ines Montani	4f3102d09c	Auto-format	2020-09-29 21:09:10 +02:00
Ines Montani	c3f8c09d7d	Merge pull request #6154 from adrianeboyd/bugfix/chinese-tokenizer-pickle	2020-09-29 20:54:59 +02:00
Adriane Boyd	013b66de05	Add tokenizer scoring to ja / ko / zh (#6152 )	2020-09-27 22:20:45 +02:00
Adriane Boyd	8393dbedad	Minor fixes * Put `cfg` back in serialization * Add `pickle5` to pytest conf	2020-09-27 15:15:53 +02:00
Adriane Boyd	54fe871935	Fix formatting, refactor pickle5 exceptions	2020-09-27 14:37:28 +02:00
Adriane Boyd	11e195d3ed	Update ChineseTokenizer * Allow `pkuseg_model` to be set to `None` on initialization * Don't save config within tokenizer * Force convert pkuseg_model to use pickle protocol 4 by reencoding with `pickle5` on serialization * Update pkuseg serialization test	2020-09-27 14:00:18 +02:00
Ines Montani	ae51f580c1	Fix handling of score_weights	2020-09-24 10:27:33 +02:00
Muhammad Fahmi Rasyid	7489d02dea	Update Indonesian Example Phrases (#6124 ) * create contributor agreement * Update Indonesian example. (see #1107) Update Indonesian examples with more proper phrases. the current phrases contains sensitive and violent words.	2020-09-23 14:02:26 +02:00
Ines Montani	f976bab710	Remove empty file [ci skip]	2020-09-23 09:30:09 +02:00
Adriane Boyd	9b4979407d	Fix overlapping German noun chunks (#6112 ) Add a similar fix as in #5470 to prevent the German noun chunks iterator from producing overlapping spans.	2020-09-22 21:52:42 +02:00
Adriane Boyd	7e4cd7575c	Refactor Docs.is_ flags (#6044 ) * Refactor Docs.is_ flags * Add derived `Doc.has_annotation` method * `Doc.has_annotation(attr)` returns `True` for partial annotation * `Doc.has_annotation(attr, require_complete=True)` returns `True` for complete annotation * Add deprecation warnings to `is_tagged`, `is_parsed`, `is_sentenced` and `is_nered` * Add `Doc._get_array_attrs()`, which returns a full list of `Doc` attrs for use with `Doc.to_array`, `Doc.to_bytes` and `Doc.from_docs`. The list is the `DocBin` attributes list plus `SPACY` and `LENGTH`. Notes on `Doc.has_annotation`: * `HEAD` is converted to `DEP` because heads don't have an unset state * Accept `IS_SENT_START` as a synonym of `SENT_START` Additional changes: * Add `NORM`, `ENT_ID` and `SENT_START` to default attributes for `DocBin` * In `Doc.from_array()` the presence of `DEP` causes `HEAD` to override `SENT_START` * In `Doc.from_array()` using `attrs` other than `Doc._get_array_attrs()` (i.e., a user's custom list rather than our default internal list) with both `HEAD` and `SENT_START` shows a warning that `HEAD` will override `SENT_START` * `set_children_from_heads` does not require dependency labels to set sentence boundaries and sets `sent_start` for all non-sentence starts to `-1` * Fix call to set_children_form_heads Co-authored-by: Matthew Honnibal <honnibal+gh@gmail.com>	2020-09-17 00:14:01 +02:00
Adriane Boyd	87c329c711	Set rule-based lemmatizers as default (#6076 ) For languages without provided models and with lemmatizer rules in `spacy-lookups-data`, make the rule-based lemmatizer the default: Bengali, Persian, Norwegian, Swedish	2020-09-16 17:37:29 +02:00
Ines Montani	df0b68f60e	Remove unicode declarations and update language data	2020-09-04 13:19:16 +02:00
Ines Montani	864a697e63	Merge branch 'develop' into master-tmp	2020-09-04 13:15:36 +02:00
holubvl3	0a27fca557	Create examples.py (#5985 ) * Create examples.py * Create tag_map.py * Delete tag_map.py * Update examples.py formatting: add empty line Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>	2020-09-04 11:00:14 +02:00
Adriane Boyd	b97d98783a	Fix Hungarian % tokenization (#6013 )	2020-09-02 13:06:16 +02:00
Adriane Boyd	7d7b65ffd4	Fix raw strings in URL pattern (#5972 ) Add missing raw string specifiers.	2020-08-26 04:00:49 +02:00
Hiroshi Matsuda	332803eda9	fix ja leading spaces (#5969 ) * change condition for space after * add NAUGHTY_STRINGS test example	2020-08-25 14:16:24 +02:00
Shashank	450720aca2	Added support for Sanskrit language (#5956 ) * Added support for Sanskrit language * Added tests for lexical attribute like_num	2020-08-25 10:56:29 +02:00
Adriane Boyd	abd3f2b65a	Rename Polish lemmatizer method (#5960 ) Rename Polish lemmatizer method to `pos_lookup` to distinguish it from pure token-based lookup methods.	2020-08-25 00:22:27 +02:00
idoshr	b10c7bc56e	Hebrew like num (#5952 ) * Update stop_words.py Hebrew STOP WORDS * Update stop_words.py * contributor * contributor * add some common domain extentions support human number 1K/1M.... * support human number 1K/1M.... * hebrew number tokenize 1K/1M implement in EN * test human tokenize fix * test * heb like num revert human number change * heb like num	2020-08-24 14:30:05 +02:00
holubvl3	a341b4ef09	Adding support for Czech language (#5826 ) * Create lex_attrs.py Hello, I am missing a CZECH language in SpaCy. So I would like to help to push it a little. This file is base on others lex_attrs.py files just with translation to Czech. * Update __init__.py Updated for use with new Czech Lex_attrs file * Update stop_words.py * Create test_text.py Co-authored-by: Vladimír Holubec <vholubec@arcdata.cz>	2020-08-21 16:17:53 +02:00
Ines Montani	3eaeb73342	Tidy up and auto-format	2020-08-09 22:36:23 +02:00
Adriane Boyd	e962784531	Add Lemmatizer and simplify related components (#5848 ) * Add Lemmatizer and simplify related components * Add `Lemmatizer` pipe with `lookup` and `rule` modes using the `Lookups` tables. * Reduce `Tagger` to a simple tagger that sets `Token.tag` (no pos or lemma) * Reduce `Morphology` to only keep track of morph tags (no tag map, lemmatizer, or morph rules) * Remove lemmatizer from `Vocab` * Adjust many many tests Differences: * No default lookup lemmas * No special treatment of TAG in `from_array` and similar required * Easier to modify labels in a `Tagger` * No extra strings added from morphology / tag map * Fix test * Initial fix for Lemmatizer config/serialization * Adjust init test to be more generic * Adjust init test to force empty Lookups * Add simple cache to rule-based lemmatizer * Convert language-specific lemmatizers Convert language-specific lemmatizers to component lemmatizers. Remove previous lemmatizer class. * Fix French and Polish lemmatizers * Remove outdated UPOS conversions * Update Russian lemmatizer init in tests * Add minimal init/run tests for custom lemmatizers * Add option to overwrite existing lemmas * Update mode setting, lookup loading, and caching * Make `mode` an immutable property * Only enforce strict `load_lookups` for known supported modes * Move caching into individual `_lemmatize` methods * Implement strict when lang is not found in lookups * Fix tables/lookups in make_lemmatizer * Reallow provided lookups and allow for stricter checks * Add lookups asset to all Lemmatizer pipe tests * Rename lookups in lemmatizer init test * Clean up merge * Refactor lookup table loading * Add helper from `load_lemmatizer_lookups` that loads required and optional lookups tables based on settings provided by a config. Additional slight refactor of lookups: * Add `Lookups.set_table` to set a table from a provided `Table` * Reorder class definitions to be able to specify type as `Table` * Move registry assets into test methods * Refactor lookups tables config Use class methods within `Lemmatizer` to provide the config for particular modes and to load the lookups from a config. * Add pipe and score to lemmatizer * Simplify Tagger.score * Add missing import * Clean up imports and auto-format * Remove unused kwarg * Tidy up and auto-format * Update docstrings for Lemmatizer Update docstrings for Lemmatizer. Additionally modify `is_base_form` API to take `Token` instead of individual features. * Update docstrings * Remove tag map values from Tagger.add_label * Update API docs * Fix relative link in Lemmatizer API docs	2020-08-07 15:27:13 +02:00
Ines Montani	56c17973aa	Use "raise ... from" in custom errors for better tracebacks	2020-08-05 23:53:21 +02:00
Adriane Boyd	cd59979ab4	Fix span boundary handling in Spanish noun_chunks (#5860 )	2020-08-03 13:53:15 +02:00
Rahul Gupta	f76fae0e8d	English: adds ordinal numbers (#5830 )	2020-07-29 20:22:47 +02:00
oculusrepairo	03ab518f28	Update examples.py (#5820 ) * Update examples.py adding factual sentences to the list * Add missing comma separators Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>	2020-07-29 10:28:56 +02:00

1 2 3 4 5 ...

946 Commits