spaCy

mirror of https://github.com/explosion/spaCy.git synced 2025-04-14 06:04:15 +03:00

Author	SHA1	Message	Date
muratjumashev	e30bbf5432	Add examples	2021-01-23 21:49:08 +06:00
muratjumashev	2f385385a9	Remove comment	2021-01-23 21:36:28 +06:00
muratjumashev	d53724ba1d	Add lex_attrs	2021-01-23 21:35:25 +06:00
muratjumashev	4418ec2eee	Add punctuation	2021-01-23 21:31:31 +06:00
muratjumashev	101d265778	Add stopwords	2021-01-23 21:25:28 +06:00
muratjumashev	28d06ab860	Add tokenizer_exceptions	2021-01-22 23:08:41 +06:00
Adriane Boyd	e649242927	Prevent overlapping noun chunks for Spanish (#6712 ) * Prevent overlapping noun chunks in Spanish noun chunk iterator * Clean up similar code in Danish noun chunk iterator	2021-01-14 17:33:31 +11:00
Alex Combessie	9cc880014c	Remove questionable French stopwords (#6310 ) * Remove questionable French stopwords * Create alexcombessie.md	2021-01-08 11:36:22 +11:00
Cristiana S Parada	7a0222f260	Update stop_words.py in Portuguese (a,o,e) (#6345 ) * Update stop_words.py Added three aditional stopwords: "a" and "o" that means "the", and "e" that means "and" * Create cristianasp.md * zero edit to push CI Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>	2021-01-08 11:35:38 +11:00
Lorena Ciutacu	f11002f1f1	add new Romanian stopwords (#6621 ) * add contributor agreement * update ro stopwords list * add new stopwords	2021-01-08 11:34:47 +11:00
ophelielacroix	e3222fdec9	Add (noun chunks) syntax iterators for Danish (#6246 ) * add syntax iterators for danish * add test noun chunks for danish syntax iterators * add contributor agreement * update da syntax iterators to remove nested chunks * add tests for da noun chunks * Fix test * add missing import * fix example * Prevent overlapping noun chunks Prevent overlapping noun chunks by tracking the end index of the previous noun chunk span. Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>	2021-01-07 16:33:00 +11:00
Sofie Van Landeghem	6f7e7d88b9	remove cause without apostrophe from norm exceptions (#6636 )	2021-01-06 12:30:30 +08:00
Yosi	cf52510631	Add Amharic አማርኛ Language support (#6583 ) * Add Amharic to space * clean up * Add some PRON_LEMMA * add Tigrinya support * remove text_noun_chunks * Tigrinya Support * added some more details for ti * fix unit test * add amharic char range * changes from review * amharic and tigrinya share same unicode block * get rid of _amharic/_tigrinya in char_classes Co-authored-by: Josiah Solomon <jsolomon@meteorcomm.com>	2020-12-22 16:50:34 +01:00
Daniel Vasic	20d72de986	Added Multext-East V5 tagset for Croatian language (#6248 ) * Added Multext-East V5 tagset for Croatian language * Create danielvasic.md * Update danielvasic.md * Update danielvasic.md * Add tag map to CroatianDefaults Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>	2020-11-05 12:19:22 +01:00
Robert Šípek	6069efe57d	Add tag map to cs language (#6284 )	2020-11-05 10:13:11 +01:00
Vu Ha	6d465ec52c	add oprd to the list of accepted deps for noun chunking (#6302 ) * add oprd to the list of accepted deps for noun chunking * add SCA	2020-11-05 09:17:35 +01:00
Duygu Altinok	0e55f806dd	Turkish tokenization improvements (#6268 ) * added single and paired orth variants * added token match * added long text tokenization test * inverted init * normalized lemmas to lowercase * more abbrevs * tests for ordinals and abbrevs * separated period abbvrevs to another list * fiex typo * added ordinal and abbrev tests * added number tests for dates * minor refinement * added inflected abbrevs regex * added percentage and inflection * cosmetics * added token match * added url inflection tests * excluded url tokens from custom pattern * removed url match import	2020-10-29 09:43:17 +01:00
Borijan Georgievski	2311192ba1	Include Macedonian language (#6230 ) * Include Macedonian language * Fix indentation at char_classes.py * Fix indentation at char_classes.py * Add Macedonian tests, update lex_attrs and char_classes * Import unicode literals for python 2	2020-10-15 15:55:01 +02:00
Ines Montani	7f92a5ee6a	Update spacy/lang/ta/examples.py	2020-10-13 11:03:35 +02:00
Baranitharan	d6037c1860	added sentence	2020-10-08 08:22:58 +05:30
Baranitharan	81afe9b19d	Update examples.py	2020-10-08 08:17:25 +05:30
Wannaphong Phatthiyaphaibun	9fc8392b38	Add Thai tag map (LST20 Corpus) (#6163 ) * Add Thai tag map (LST20 Corpus) By @korakot * Update tag_map.py * Update tag_map.py * Update tag_map.py	2020-10-07 11:12:01 +02:00
Duygu Altinok	7e821c2776	Turkish language syntax iterators (#6191 ) * added tr_vocab to config * basic test * added syntax iterator to Turkish lang class * first version for Turkish syntax iter, without flat * added simple tests with nmod, amod, det * more tests to amod and nmod * separated noun chunks and parser test * rearrangement after nchunk parser separation * added recursive NPs * tests with complicated recursive NPs * tests with conjed NPs * additional tests for conj NP * small modification for shaving off conj from NP * added tests with flat * more tests with flat * added examples with flats conjed * added inner func for flat trick * corrected parse Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>	2020-10-07 11:07:52 +02:00
Duygu Altinok	2ce6fc2611	Turkish tag map and morph rules addition (#6141 ) * feat: added turkish tag map * feat: morph rules cconj and sconj * feat: more conjuncts * feat: added popular postpositions * feat: added adverbs * feat: added personal pronouns * feat: added reflexive pronouns * minor: corrected case capital * minor: fixed comma typo * feat: added indef pronouns * feat: added dict iter * fixed comma typo * updated language class with tag map and morph * use default tag map instead * removed tag map	2020-10-07 10:27:36 +02:00
Duygu Altinok	b95a11dd95	Ordinal numbers for Turkish (#6142 ) * minor ordinal number addition * fixed typo * added corresponding lexical test	2020-10-07 10:25:37 +02:00
Rahul Gupta	1a00bff06d	Hindi: Adds tests for lexical attributes (norm and like_num) (#5829 ) * Hindi: Adds tests for lexical attributes (norm and like_num) * Signs and sdds the contributor agreement * Add ordinal numbers to be tagged as like_num * Adds alternate pronunciation for 31 and 39	2020-10-07 10:23:32 +02:00
Nuccy90	c809b2c8e7	Update morph_rules.py (#6102 ) * Update morph_rules.py Added "dig" and "dej" ("you" in accusative form) * Create Nuccy90.md * Update Nuccy90.md	2020-10-06 15:14:47 +02:00
Muhammad Fahmi Rasyid	7489d02dea	Update Indonesian Example Phrases (#6124 ) * create contributor agreement * Update Indonesian example. (see #1107) Update Indonesian examples with more proper phrases. the current phrases contains sensitive and violent words.	2020-09-23 14:02:26 +02:00
Adriane Boyd	9b4979407d	Fix overlapping German noun chunks (#6112 ) Add a similar fix as in #5470 to prevent the German noun chunks iterator from producing overlapping spans.	2020-09-22 21:52:42 +02:00
holubvl3	0a27fca557	Create examples.py (#5985 ) * Create examples.py * Create tag_map.py * Delete tag_map.py * Update examples.py formatting: add empty line Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>	2020-09-04 11:00:14 +02:00
Adriane Boyd	7d7b65ffd4	Fix raw strings in URL pattern (#5972 ) Add missing raw string specifiers.	2020-08-26 04:00:49 +02:00
Hiroshi Matsuda	332803eda9	fix ja leading spaces (#5969 ) * change condition for space after * add NAUGHTY_STRINGS test example	2020-08-25 14:16:24 +02:00
Shashank	450720aca2	Added support for Sanskrit language (#5956 ) * Added support for Sanskrit language * Added tests for lexical attribute like_num	2020-08-25 10:56:29 +02:00
idoshr	b10c7bc56e	Hebrew like num (#5952 ) * Update stop_words.py Hebrew STOP WORDS * Update stop_words.py * contributor * contributor * add some common domain extentions support human number 1K/1M.... * support human number 1K/1M.... * hebrew number tokenize 1K/1M implement in EN * test human tokenize fix * test * heb like num revert human number change * heb like num	2020-08-24 14:30:05 +02:00
holubvl3	a341b4ef09	Adding support for Czech language (#5826 ) * Create lex_attrs.py Hello, I am missing a CZECH language in SpaCy. So I would like to help to push it a little. This file is base on others lex_attrs.py files just with translation to Czech. * Update __init__.py Updated for use with new Czech Lex_attrs file * Update stop_words.py * Create test_text.py Co-authored-by: Vladimír Holubec <vholubec@arcdata.cz>	2020-08-21 16:17:53 +02:00
Adriane Boyd	cd59979ab4	Fix span boundary handling in Spanish noun_chunks (#5860 )	2020-08-03 13:53:15 +02:00
Rahul Gupta	f76fae0e8d	English: adds ordinal numbers (#5830 )	2020-07-29 20:22:47 +02:00
oculusrepairo	03ab518f28	Update examples.py (#5820 ) * Update examples.py adding factual sentences to the list * Add missing comma separators Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>	2020-07-29 10:28:56 +02:00
Joshua Olson	6d4d5c074c	Mark Japanese documents as tagged. (#5803 ) Mark the document as tagged before returning it to the user from the JapaneseTokenizer. Fixes #5802	2020-07-23 08:57:01 +02:00
Adriane Boyd	bf24f7f672	Update invalid tag maps (#5796 ) * Remove copy of (old?) PTB tag map for: bn, eu * Remove unsupported features from: hy, pl, ro, ru	2020-07-22 16:02:51 +02:00
Adriane Boyd	cd5af72c9a	Update pkuseg version (#5774 ) * Update pkuseg version in Chinese tokenizer warnings * Update pkuseg version in `Makefile` * Remove warning about python3.8 wheels in docs	2020-07-19 11:09:49 +02:00
Adriane Boyd	0a62098c5f	Fix lemmatizer is_base_form for python2.7 (#5734 ) * Fix lemmatizer init args for python2.7 * Move English is_base_form to a class method * Skip test pickling PhraseMatcher for python2	2020-07-09 22:11:24 +02:00
Adriane Boyd	923affd091	Remove is_base_form from French lemmatizer (#5733 ) Remove English-specific is_base_form from French lemmatizer.	2020-07-09 22:11:13 +02:00
Mike Izbicki	7a2ca00794	fix bug in Korean language, resulting in 100x speedup by reducing overhead of mecab (#5701 ) * speed up Korean nlp 100x by stopping mecab from reloading on each doc * add contributor agreement * rename variables to improve code readability	2020-07-06 17:03:33 +02:00
Matthew Honnibal	2d715451a2	Revert "Convert custom user_data to token extension format for Japanese tokenizer (#5652 )" (#5665 ) This reverts commit `1dd38191ec`.	2020-06-29 14:34:15 +02:00
Adriane Boyd	1dd38191ec	Convert custom user_data to token extension format for Japanese tokenizer (#5652 ) * Convert custom user_data to token extension format Convert the user_data values so that they can be loaded as custom token extensions for `inflection`, `reading_form`, `sub_tokens`, and `lemma`. * Reset Underscore state in ja tokenizer tests	2020-06-29 14:20:26 +02:00
Adriane Boyd	167df42cb6	Move lemmatizer is_base_form to language settings (#5663 ) Move `Lemmatizer.is_base_form` to the language settings so that each language can provide a language-specific method as `LanguageDefaults.is_base_form`. The existing English-specific `Lemmatizer.is_base_form` is moved to `EnglishDefaults`.	2020-06-29 14:16:57 +02:00
Hiroshi Matsuda	150a39ccca	Japanese model: add user_dict entries and small refactor (#5573 ) * user_dict fields: adding inflections, reading_forms, sub_tokens deleting: unidic_tags improve code readability around the token alignment procedure * add test cases, replace fugashi with sudachipy in conftest * move bunsetu.py to spaCy Universe as a pipeline component BunsetuRecognizer * tag is space -> both surface and tag are spaces * consider len(text)==0	2020-06-22 14:32:25 +02:00
Rameshh	c34420794a	Add Nepali Language (#5622 ) * added support for nepali lang * added examples and test files * added spacy contributor agreement	2020-06-22 10:25:46 +02:00
Karen Hambardzumyan	66a4834e56	Some changes for Armenian (#5616 ) * Fixing numericals * We need a Armenian question sign to make the sentence a question	2020-06-22 08:50:34 +02:00

1 2 3 4 5 ...

703 Commits