spaCy

mirror of https://github.com/explosion/spaCy.git synced 2025-12-15 22:24:31 +03:00

Author	SHA1	Message	Date
svlandeg	91e72c031e	reformatting	2021-01-30 17:29:33 +01:00
svlandeg	a8d84188f0	add stop words Co-authored-by: tewodrosm <tedmaam2006@gmail.com>	2021-01-30 17:26:49 +01:00
Ines Montani	e6accb3a9e	Tidy up and auto-format	2021-01-30 12:52:33 +11:00
Ines Montani	817b0db521	Fix escape sequence	2021-01-30 12:39:58 +11:00
Ines Montani	bbf080dfe5	Merge pull request #6645 from bittlingmayer/patch-3	2021-01-30 01:26:28 +11:00
Adriane Boyd	bced6309e5	Add full exceptions with spaces	2021-01-29 14:27:22 +01:00
Ines Montani	5ed51c9dd2	Merge pull request #6828 from explosion/master-tmp	2021-01-27 23:05:46 +11:00
Adriane Boyd	d17afb4826	Add Spanish rule-based lemmatizer (#6833 ) * Initial Spanish lemmatizer * Handle merged verb+pron(s) multi-word tokens * Use VERB for AUX rule lookup * Add morph to lemma cache key * Fix aux lookups, minor refactoring * Improve verb+pron handling * Move verb+pron handling into its own method * Check for exceptions (primarily for se) * Collect pronouns in the same (not reversed) order * Only add modified possible lemmas	2021-01-27 19:21:35 +08:00
Ines Montani	615dba9d99	Fix tokenizer exceptions	2021-01-27 22:11:42 +11:00
Ines Montani	e3f8be9a94	Update language data	2021-01-27 13:29:22 +11:00
Ines Montani	230e651ad6	Merge branch 'develop' into master-tmp	2021-01-27 13:26:29 +11:00
Adriane Boyd	71a6350744	Implement overwrite param for all custom lemmatizers (#6794 )	2021-01-26 14:53:43 +11:00
muratjumashev	2b19ebad59	Remove Kyrgyz chars fr. char_classes since Tatar ones already cover	2021-01-25 00:46:45 +06:00
muratjumashev	53abf759ad	Fix punctuation	2021-01-24 20:54:22 +06:00
muratjumashev	2a2646362b	Fix language subclass	2021-01-23 22:00:50 +06:00
muratjumashev	fe3b5b8ff5	Add kyrgyz to char_classes	2021-01-23 21:53:41 +06:00
muratjumashev	e30bbf5432	Add examples	2021-01-23 21:49:08 +06:00
muratjumashev	2f385385a9	Remove comment	2021-01-23 21:36:28 +06:00
muratjumashev	d53724ba1d	Add lex_attrs	2021-01-23 21:35:25 +06:00
muratjumashev	4418ec2eee	Add punctuation	2021-01-23 21:31:31 +06:00
muratjumashev	101d265778	Add stopwords	2021-01-23 21:25:28 +06:00
muratjumashev	28d06ab860	Add tokenizer_exceptions	2021-01-22 23:08:41 +06:00
Sofie Van Landeghem	fed8f48965	raise NotImplementedError when noun_chunks iterator is not implemented (#6711 ) * raise NotImplementedError when noun_chunks iterator is not implemented * bring back, fix and document span.noun_chunks * formatting Co-authored-by: Matthew Honnibal <honnibal+gh@gmail.com>	2021-01-17 19:56:05 +08:00
Adriane Boyd	185fc62f4d	Remove unused is_base_form for mk lemmatizer (#6743 ) Remove unimplemented/incorrect is_base_form for Macedonian lemmatizer.	2021-01-17 09:41:35 +01:00
Ines Montani	b0b743597c	Tidy up and auto-format	2021-01-15 11:57:36 +11:00
Adriane Boyd	0c936004d1	Merge remote-tracking branch 'upstream/master' into chore/update-develop-from-master-rc3	2021-01-14 11:49:58 +01:00
Adriane Boyd	e649242927	Prevent overlapping noun chunks for Spanish (#6712 ) * Prevent overlapping noun chunks in Spanish noun chunk iterator * Clean up similar code in Danish noun chunk iterator	2021-01-14 17:33:31 +11:00
Adriane Boyd	54e8e3c208	Update model-related dependencies (#6725 ) * Update pymorphy2 error messages for Russian and Ukrainian * Add pymorphy2 to pex * Update spacy-pkuseg version for pex	2021-01-14 17:29:44 +11:00
Alex Combessie	9cc880014c	Remove questionable French stopwords (#6310 ) * Remove questionable French stopwords * Create alexcombessie.md	2021-01-08 11:36:22 +11:00
Cristiana S Parada	7a0222f260	Update stop_words.py in Portuguese (a,o,e) (#6345 ) * Update stop_words.py Added three aditional stopwords: "a" and "o" that means "the", and "e" that means "and" * Create cristianasp.md * zero edit to push CI Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>	2021-01-08 11:35:38 +11:00
Lorena Ciutacu	f11002f1f1	add new Romanian stopwords (#6621 ) * add contributor agreement * update ro stopwords list * add new stopwords	2021-01-08 11:34:47 +11:00
ophelielacroix	e3222fdec9	Add (noun chunks) syntax iterators for Danish (#6246 ) * add syntax iterators for danish * add test noun chunks for danish syntax iterators * add contributor agreement * update da syntax iterators to remove nested chunks * add tests for da noun chunks * Fix test * add missing import * fix example * Prevent overlapping noun chunks Prevent overlapping noun chunks by tracking the end index of the previous noun chunk span. Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>	2021-01-07 16:33:00 +11:00
Sofie Van Landeghem	6f7e7d88b9	remove cause without apostrophe from norm exceptions (#6636 )	2021-01-06 12:30:30 +08:00
Ines Montani	991669c934	Tidy up and auto-format	2021-01-05 13:41:53 +11:00
Adam Bittlingmayer	f2fe60bacf	Update tokenizer_exceptions.py See https://github.com/explosion/spaCy/pull/6643	2020-12-29 16:05:11 +04:00
Yosi	cf52510631	Add Amharic አማርኛ Language support (#6583 ) * Add Amharic to space * clean up * Add some PRON_LEMMA * add Tigrinya support * remove text_noun_chunks * Tigrinya Support * added some more details for ti * fix unit test * add amharic char range * changes from review * amharic and tigrinya share same unicode block * get rid of _amharic/_tigrinya in char_classes Co-authored-by: Josiah Solomon <jsolomon@meteorcomm.com>	2020-12-22 16:50:34 +01:00
Ines Montani	1da1568110	Remove tag map	2020-12-09 11:13:49 +11:00
Adriane Boyd	724831b066	Merge remote-tracking branch 'upstream/master' into chore/update-develop-from-master * Update Macedonian for v3 * Update Turkish for v3	2020-11-25 11:49:34 +01:00
Daniel Vasic	20d72de986	Added Multext-East V5 tagset for Croatian language (#6248 ) * Added Multext-East V5 tagset for Croatian language * Create danielvasic.md * Update danielvasic.md * Update danielvasic.md * Add tag map to CroatianDefaults Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>	2020-11-05 12:19:22 +01:00
Robert Šípek	6069efe57d	Add tag map to cs language (#6284 )	2020-11-05 10:13:11 +01:00
Vu Ha	6d465ec52c	add oprd to the list of accepted deps for noun chunking (#6302 ) * add oprd to the list of accepted deps for noun chunking * add SCA	2020-11-05 09:17:35 +01:00
Duygu Altinok	0e55f806dd	Turkish tokenization improvements (#6268 ) * added single and paired orth variants * added token match * added long text tokenization test * inverted init * normalized lemmas to lowercase * more abbrevs * tests for ordinals and abbrevs * separated period abbvrevs to another list * fiex typo * added ordinal and abbrev tests * added number tests for dates * minor refinement * added inflected abbrevs regex * added percentage and inflection * cosmetics * added token match * added url inflection tests * excluded url tokens from custom pattern * removed url match import	2020-10-29 09:43:17 +01:00
Adriane Boyd	4299a7f654	Setup / install / quickstart updates * Add `cuda110` to setup.cfg and quickstart dropdown * Switch to `pip` for pip-only packages in conda quickstart instructions * Update zh pkuseg install message with version range and conda * Remove `zh` from `extras_require` because the default doesn't require additional packages	2020-10-23 11:27:54 +02:00
Borijan Georgievski	2311192ba1	Include Macedonian language (#6230 ) * Include Macedonian language * Fix indentation at char_classes.py * Fix indentation at char_classes.py * Add Macedonian tests, update lex_attrs and char_classes * Import unicode literals for python 2	2020-10-15 15:55:01 +02:00
Ines Montani	d165af26be	Auto-format [ci skip]	2020-10-15 10:08:53 +02:00
Ines Montani	5665a21517	Tidy up	2020-10-15 09:30:32 +02:00
Ines Montani	178760855f	Merge branch 'develop' into master-tmp	2020-10-15 09:06:03 +02:00
Ines Montani	7f92a5ee6a	Update spacy/lang/ta/examples.py	2020-10-13 11:03:35 +02:00
Ines Montani	539b0c10da	Tidy up and auto-format	2020-10-10 19:14:48 +02:00
Duygu Altinok	80fb1bffc9	Ordinal numbers for Turkish (#6142 ) * minor ordinal number addition * fixed typo * added corresponding lexical test	2020-10-09 10:13:15 +02:00

1 2 3 4 5 ...

802 Commits