spaCy

mirror of https://github.com/explosion/spaCy.git synced 2026-03-06 04:41:32 +03:00

Author	SHA1	Message	Date
Hiroshi Matsuda	332803eda9	fix ja leading spaces (#5969 ) * change condition for space after * add NAUGHTY_STRINGS test example	2020-08-25 14:16:24 +02:00
Shashank	450720aca2	Added support for Sanskrit language (#5956 ) * Added support for Sanskrit language * Added tests for lexical attribute like_num	2020-08-25 10:56:29 +02:00
idoshr	b10c7bc56e	Hebrew like num (#5952 ) * Update stop_words.py Hebrew STOP WORDS * Update stop_words.py * contributor * contributor * add some common domain extentions support human number 1K/1M.... * support human number 1K/1M.... * hebrew number tokenize 1K/1M implement in EN * test human tokenize fix * test * heb like num revert human number change * heb like num	2020-08-24 14:30:05 +02:00
Sofie Van Landeghem	56eabcb2f2	Adding num_like test for Czech (#5946 ) * Create lex_attrs.py Hello, I am missing a CZECH language in SpaCy. So I would like to help to push it a little. This file is base on others lex_attrs.py files just with translation to Czech. * Update __init__.py Updated for use with new Czech Lex_attrs file * Update stop_words.py * Create test_text.py * add like_num testing for czech Co-authored-by: holubvl3 <47881982+holubvl3@users.noreply.github.com> Co-authored-by: holubvl3 <vilemrousi@gmail.com> Co-authored-by: Vladimír Holubec <vholubec@arcdata.cz>	2020-08-21 17:06:33 +02:00
holubvl3	a341b4ef09	Adding support for Czech language (#5826 ) * Create lex_attrs.py Hello, I am missing a CZECH language in SpaCy. So I would like to help to push it a little. This file is base on others lex_attrs.py files just with translation to Czech. * Update __init__.py Updated for use with new Czech Lex_attrs file * Update stop_words.py * Create test_text.py Co-authored-by: Vladimír Holubec <vholubec@arcdata.cz>	2020-08-21 16:17:53 +02:00
Ines Montani	99d2a25687	Make sure sys.argv exists (#5943 ) * Make sure sys.argv exists (resolves #5610) * Fix typo	2020-08-20 16:30:11 +02:00
Sofie Van Landeghem	071c09ff35	add coding (#5942 )	2020-08-20 11:08:38 +02:00
Attila Szász	669dc70822	Create tilusnet.md (#5914 )	2020-08-12 22:46:08 +02:00
Adam Bittlingmayer	7b33b2854f	Add Armenian sentence-final verchaket, Greek question mark and Arabic question mark to default punct (#5910 ) * Add Armenian sentence-final verchaket * Add Greek and Arabic question marks, and contributor agreement * Check box	2020-08-12 15:36:14 +02:00
graue70	49e690bde1	Fix typos in comments (#5904 ) * Fix typo in comment * Fix typo * Add spaCy Contributor Agreement	2020-08-12 15:35:25 +02:00
Adriane Boyd	4193402c47	Add warning when Matcher subpattern is discarded (#5873 ) * Add a warning when a subpattern is not processed and discarded * Normalize subpattern attribute/operator keys to upper case like top-level attributes	2020-08-05 14:56:14 +02:00
Bram Vanroy	9e45d064bb	Update universe details spacy_conll (#5871 )	2020-08-05 14:34:12 +02:00
Adriane Boyd	c62fd878a3	Allow Doc.char_span to snap to token boundaries (#5849 ) * Allow Doc.char_span to snap to token boundaries Add a `mode` option to allow `Doc.char_span` to snap to token boundaries. The `mode` options: * `strict`: character offsets must match token boundaries (default, same as before) * `inside`: all tokens completely within the character span * `outside`: all tokens at least partially covered by the character span Add a new helper function `token_by_char` that returns the token corresponding to a character position in the text. Update `token_by_start` and `token_by_end` to use `token_by_char` for more efficient searching. * Remove unused import * Rename mode to alignment_mode Rename `mode` to `alignment_mode` with the options `strict`/`contract`/`expand`. Any unrecognized modes are silently converted to `strict`.	2020-08-04 13:36:32 +02:00
Adriane Boyd	b841248589	Add Span index boundary checks (#5861 ) * Add Span index boundary checks * Return Span-specific IndexError in all cases * Simplify and fix if/else	2020-08-04 13:35:25 +02:00
Adriane Boyd	cd59979ab4	Fix span boundary handling in Spanish noun_chunks (#5860 )	2020-08-03 13:53:15 +02:00
Adriane Boyd	ac14ce7c30	Prefer earlier spans in EntityRuler (#5843 ) Similar to #4414, update the sorting in EntityRuler to prefer the first span in overlapping spans.	2020-07-31 16:09:32 +02:00
holubvl3	d16c0f2c3a	Create holubvl3 (#5845 ) * Create holubvl3 * Rename holubvl3 to holubvl3.md Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>	2020-07-30 17:40:31 +02:00
Rahul Gupta	f76fae0e8d	English: adds ordinal numbers (#5830 )	2020-07-29 20:22:47 +02:00
Gustavo Zadrozny Leyendecker	90b958fd01	Fix on EntityRendered to support break lines (after last entity) (closes #5838 )	2020-07-29 18:48:39 +02:00
oculusrepairo	03ab518f28	Update examples.py (#5820 ) * Update examples.py adding factual sentences to the list * Add missing comma separators Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>	2020-07-29 10:28:56 +02:00
graue70	b97dbab998	Fix typo in unit tests (#5823 )	2020-07-27 20:18:48 +02:00
Adriane Boyd	2880d8a555	Normalize spelling for spaCy (#5822 )	2020-07-27 10:09:33 +02:00
Martino Mensio	2f6b8132ef	Sentence transformers added to spaCy universe (#5814 ) * fix details for spacy-universal-sentence-encoder * added sentence-transformers	2020-07-27 09:44:33 +02:00
Nipun Sadvilkar	a66ad89fcb	✏️ typo in pysbd code example (#5821 )	2020-07-27 09:43:39 +02:00
Li Zhe	a69eb445dc	fix the wrong hash url in adding-languages.md file (#5810 ) * fix the wrong hash url in adding-languages.md file change the #101 url hash path to #language-data * filled in the spaCy Contributor Agreement filled in the spaCy Contributor Agreement	2020-07-25 13:13:38 +02:00
Adriane Boyd	19dc42776a	Remove hard-coded GPU ID from pretrain (#5808 )	2020-07-24 09:26:26 +02:00
Joshua Olson	6d4d5c074c	Mark Japanese documents as tagged. (#5803 ) Mark the document as tagged before returning it to the user from the JapaneseTokenizer. Fixes #5802	2020-07-23 08:57:01 +02:00
Adriane Boyd	038ff1a811	Improve warnings around normalization tables (#5794 ) Provide more customized normalization table warnings when training a new model. Only suggest installing `spacy-lookups-data` if it's not already installed and it includes a table for this language (currently checked in a hard-coded list).	2020-07-22 16:04:58 +02:00
Adriane Boyd	bf24f7f672	Update invalid tag maps (#5796 ) * Remove copy of (old?) PTB tag map for: bn, eu * Remove unsupported features from: hy, pl, ro, ru	2020-07-22 16:02:51 +02:00
Alec Chapman	a8978ca285	Add VA COVID-19 NLP project to spaCy Universe (#5777 ) * Update universe.json Add cov-bsv to "resources" * Update universe.json * add contributor agreement	2020-07-19 13:35:31 +02:00
Adriane Boyd	597bcc629e	Improve tag map initialization and updating (#5768 ) * Improve tag map initialization and updating Generalize tag map initialization and updating so that a provided tag map can be loaded correctly in the CLI. * normalize provided tag map as necessary * use the same method for initializing and overwriting the tag map * Reinitialize cache after loading new tag map Reinitialize the cache with the right size after loading a new tag map.	2020-07-19 11:13:39 +02:00
Adriane Boyd	7e14272096	Lower upper pin for cupy to 8.0.0 (#5773 )	2020-07-19 11:10:11 +02:00
Adriane Boyd	cd5af72c9a	Update pkuseg version (#5774 ) * Update pkuseg version in Chinese tokenizer warnings * Update pkuseg version in `Makefile` * Remove warning about python3.8 wheels in docs	2020-07-19 11:09:49 +02:00
Ines Montani	6f4e4aceb3	Add Plausible [ci skip]	2020-07-18 23:50:29 +02:00
Adriane Boyd	5228920e2f	Clarify warning W030 for misaligned BILUO tags (#5761 )	2020-07-14 14:09:48 +02:00
Adriane Boyd	7ea2cc7650	Set version to 2.3.2 (#5756 )	2020-07-13 14:55:56 +02:00
Mark Neumann	27a1cd3c63	fix meta serialization in train (#5751 ) Co-authored-by: Mark Neumann <markng@allenai.org>	2020-07-12 22:06:46 +02:00
Adriane Boyd	0a62098c5f	Fix lemmatizer is_base_form for python2.7 (#5734 ) * Fix lemmatizer init args for python2.7 * Move English is_base_form to a class method * Skip test pickling PhraseMatcher for python2	2020-07-09 22:11:24 +02:00
Adriane Boyd	923affd091	Remove is_base_form from French lemmatizer (#5733 ) Remove English-specific is_base_form from French lemmatizer.	2020-07-09 22:11:13 +02:00
Ines Montani	3d83721551	Merge pull request #5723 from gandersen101/fix-spaczz-universe-typo	2020-07-08 11:35:40 +02:00
gandersen101	893133873d	Fix quote issue in spaczz universe.json	2020-07-07 19:16:28 -05:00
Ines Montani	109849bd31	Fix and update universe.json [ci skip]	2020-07-07 21:12:28 +02:00
gandersen101	9097549227	Adding spaczz package to universe.json (#5717 ) * Adding spaczz package to universe.json * Adding contributor agreement.	2020-07-07 20:55:24 +02:00
Jonathan Besomi	546f3d10d4	Add texthero to universe.json (#5716 ) * Add texthero to universe.json * Add spaCy contributor Agreement	2020-07-07 20:54:22 +02:00
Mike Izbicki	7a2ca00794	fix bug in Korean language, resulting in 100x speedup by reducing overhead of mecab (#5701 ) * speed up Korean nlp 100x by stopping mecab from reloading on each doc * add contributor agreement * rename variables to improve code readability	2020-07-06 17:03:33 +02:00
graue70	9860b8399e	Fix typo in test function docstring (#5696 )	2020-07-05 15:49:06 +02:00
Matthew Honnibal	3e78e82a83	Experimental character-based pretraining (#5700 ) * Use cosine loss in Cloze multitask * Fix char_embed for gpu * Call resume_training for base model in train CLI * Fix bilstm_depth default in pretrain command * Implement character-based pretraining objective * Use chars loss in ClozeMultitask * Add method to decode predicted characters * Fix number characters * Rescale gradients for mlm * Fix char embed+vectors in ml * Fix pipes * Fix pretrain args * Move get_characters_loss * Fix import * Fix import * Mention characters loss option in pretrain * Remove broken 'self attention' option in pretrain * Revert "Remove broken 'self attention' option in pretrain" This reverts commit `56b820f6af`. * Document 'characters' objective of pretrain	2020-07-05 15:48:39 +02:00
Adriane Boyd	86d13a9fb8	Set version to 2.3.1 (#5705 )	2020-07-03 13:38:41 +02:00
Matthias Hertel	2fb9bd795d	Fixed vocabulary in the entity linker training example (#5676 ) * entity linker training example: model loading changed according to issue 5668 (https://github.com/explosion/spaCy/issues/5668) + vocab_path is a required argument * contributor agreement	2020-07-03 10:24:02 +02:00
Adriane Boyd	a77c4c3465	Add strings and ENT_KB_ID to Doc serialization (#5691 ) * Add strings for all writeable Token attributes to `Doc.to/from_bytes()`. * Add ENT_KB_ID to default attributes.	2020-07-02 17:11:57 +02:00

1 2 3 4 5 ...

11599 Commits