spaCy

mirror of https://github.com/explosion/spaCy.git synced 2025-02-22 22:40:32 +03:00

Author	SHA1	Message	Date
Adriane Boyd	b9c524917a	Cast to uint64 for all array-based doc representations	2022-12-07 09:53:33 +01:00
Adriane Boyd	19f64ee18a	Set version to v2.3.8	2022-10-18 16:06:10 +02:00
Adriane Boyd	cae72e46dd	Set version to v2.3.7 (#8289 ) * Set version to v2.3.7 * Add download test to CI	2021-06-04 19:33:42 +02:00
Adriane Boyd	22287c89c0	Fix pip args in download CLI (#8287 )	2021-06-04 19:02:02 +02:00
Adriane Boyd	2c1de4b9a4	Set version to v2.3.6 (#8117 )	2021-05-17 17:55:19 +02:00
Adriane Boyd	5e7e7cda94	Fix range in Span.get_lca_matrix (#8115 ) Fix the adjusted token index / lca matrix index ranges for `_get_lca_matrix` for spans. * The range for `k` should correspond to the adjusted indices in `lca_matrix` with the `start` indexed at `0`	2021-05-17 16:54:10 +02:00
Pamphile ROY	41ee75ac6d	Remove --no-cache-dir when downloading models When `--no-cache-dir` is present, it prevents caching to properly function. If the user still wants to do this, there is the possibility to pass options with `user_pip_args`. But you should not enforce options like these. In my case this is preventing some docker build (using buildkit caching) to have proper caching of models.	2021-01-29 15:37:44 +01:00
Adriane Boyd	4096a79de7	Add alignment mode error and fix Doc.char_span docs (#6820 ) * Raise an error on an unrecognized alignment mode rather than defaulting to `strict` * Fix the `Doc.char_span` API doc alignment mode details	2021-01-27 23:40:42 +11:00
muratjumashev	2b19ebad59	Remove Kyrgyz chars fr. char_classes since Tatar ones already cover	2021-01-25 00:46:45 +06:00
muratjumashev	87168eb81f	Add tests	2021-01-24 20:56:16 +06:00
muratjumashev	53abf759ad	Fix punctuation	2021-01-24 20:54:22 +06:00
muratjumashev	2a2646362b	Fix language subclass	2021-01-23 22:00:50 +06:00
muratjumashev	fe3b5b8ff5	Add kyrgyz to char_classes	2021-01-23 21:53:41 +06:00
muratjumashev	e30bbf5432	Add examples	2021-01-23 21:49:08 +06:00
muratjumashev	2f385385a9	Remove comment	2021-01-23 21:36:28 +06:00
muratjumashev	d53724ba1d	Add lex_attrs	2021-01-23 21:35:25 +06:00
muratjumashev	4418ec2eee	Add punctuation	2021-01-23 21:31:31 +06:00
muratjumashev	101d265778	Add stopwords	2021-01-23 21:25:28 +06:00
muratjumashev	28d06ab860	Add tokenizer_exceptions	2021-01-22 23:08:41 +06:00
Sofie Van Landeghem	5ace559201	ensure span.text works for an empty span (#6772 )	2021-01-21 23:18:46 +08:00
Sofie Van Landeghem	fdf8c77630	support IS_SENT_START in PhraseMatcher (#6771 ) * support IS_SENT_START in PhraseMatcher * add unit test and friendlier error * use IDS.get instead	2021-01-21 09:59:17 +01:00
Adriane Boyd	bc7d83d4be	Skip 0-length matches (#6759 ) Add hack to prevent matcher from returning 0-length matches.	2021-01-19 07:38:11 +08:00
Santiago Castro	28256522c8	Fix `spacy.util.minibatch` when the size iterator is finished (#6745 )	2021-01-17 19:48:43 +08:00
Adriane Boyd	e649242927	Prevent overlapping noun chunks for Spanish (#6712 ) * Prevent overlapping noun chunks in Spanish noun chunk iterator * Clean up similar code in Danish noun chunk iterator	2021-01-14 17:33:31 +11:00
Adriane Boyd	9957ed7897	Override language defaults for null token and URL match (#6705 ) * Override language defaults for null token and URL match When the serialized `token_match` or `url_match` is `None`, override the language defaults to preserve `None` on deserialization. * Fix fixtures in tests	2021-01-14 17:31:29 +11:00
Alex Combessie	9cc880014c	Remove questionable French stopwords (#6310 ) * Remove questionable French stopwords * Create alexcombessie.md	2021-01-08 11:36:22 +11:00
Cristiana S Parada	7a0222f260	Update stop_words.py in Portuguese (a,o,e) (#6345 ) * Update stop_words.py Added three aditional stopwords: "a" and "o" that means "the", and "e" that means "and" * Create cristianasp.md * zero edit to push CI Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>	2021-01-08 11:35:38 +11:00
Lorena Ciutacu	f11002f1f1	add new Romanian stopwords (#6621 ) * add contributor agreement * update ro stopwords list * add new stopwords	2021-01-08 11:34:47 +11:00
ophelielacroix	e3222fdec9	Add (noun chunks) syntax iterators for Danish (#6246 ) * add syntax iterators for danish * add test noun chunks for danish syntax iterators * add contributor agreement * update da syntax iterators to remove nested chunks * add tests for da noun chunks * Fix test * add missing import * fix example * Prevent overlapping noun chunks Prevent overlapping noun chunks by tracking the end index of the previous noun chunk span. Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>	2021-01-07 16:33:00 +11:00
Sofie Van Landeghem	6f7e7d88b9	remove cause without apostrophe from norm exceptions (#6636 )	2021-01-06 12:30:30 +08:00
Yosi	cf52510631	Add Amharic አማርኛ Language support (#6583 ) * Add Amharic to space * clean up * Add some PRON_LEMMA * add Tigrinya support * remove text_noun_chunks * Tigrinya Support * added some more details for ti * fix unit test * add amharic char range * changes from review * amharic and tigrinya share same unicode block * get rid of _amharic/_tigrinya in char_classes Co-authored-by: Josiah Solomon <jsolomon@meteorcomm.com>	2020-12-22 16:50:34 +01:00
Tim Gates	292c1d6a73	docs: fix simple typo, speficied -> specified (#6611 ) There is a small typo in spacy/cli/info.py. Should read `specified` rather than `speficied`.	2020-12-22 09:14:10 +01:00
Adriane Boyd	7b277661f6	Set version to v2.3.5	2020-12-10 13:32:10 +01:00
Koichi Yasuoka	0afb54ac93	JapaneseTokenizer.pipe added (#6515 ) * JapaneseTokenizer.pipe added For [spacymoji](https://spacy.io/universe/project/spacymoji) with `Japanese()`. * DummyTokenizer.pipe added instead	2020-12-08 20:02:23 +01:00
Adriane Boyd	6c221d4841	Fix subsequent pipe detection in EntityRuler Fix subsequent pipe detection to detect the position of the current object by comparing the component itself rather than from the factory name.	2020-12-08 10:01:30 +01:00
Adriane Boyd	5ceac425ee	Remove non-working --use-chars from train CLI Remove the non-working `--use-chars` option from the train CLI. The implementation of the option across component types and the CLI settings could be fixed, but the `CharacterEmbed` model does not work on GPU in v2 so it's better to remove it.	2020-12-08 08:30:00 +01:00
Adriane Boyd	e931d3f72b	Move max_length to nlp.make_doc() (#6512 ) Move max_length check to `nlp.make_doc()` so that's it's also checked for `nlp.pipe()`.	2020-12-08 14:24:02 +08:00
Adriane Boyd	53c0fb7431	Only set NORM on Token in retokenizer (#6464 ) * Only set NORM on Token in retokenizer Instead of setting `NORM` on both the token and lexeme, set `NORM` only on the token. The retokenizer tries to set all possible attributes with `Token/Lexeme.set_struct_attr` so that it doesn't have to enumerate which attributes are available for each. `NORM` is the only attribute that's stored on both and for most cases it doesn't make sense to set the global norms based on a individual retokenization. For lexeme-only attributes like `IS_STOP` there's no way to avoid the global side effects, but I think that `NORM` would be better only on the token. * Fix test	2020-11-30 09:35:42 +08:00
Adriane Boyd	03ae77e603	Add SPACY as a Matcher attribute (#6463 )	2020-11-30 09:34:50 +08:00
Adriane Boyd	3a5cc5f8b4	Set version to v2.3.4	2020-11-26 08:48:52 +01:00
Adriane Boyd	e0f5646a4a	Restore cleanup_beam method (#6446 )	2020-11-25 13:21:48 +01:00
Adriane Boyd	573f5c863f	Fix tag map clobbering in spacy train (#6437 ) Fix bug from #5768 where the tag map is clobbered if a custom tag map isn't provided.	2020-11-24 13:13:16 +01:00
Adriane Boyd	ce18fc6588	Set version to v2.3.3	2020-11-24 10:03:45 +01:00
Adriane Boyd	cd61d264ef	Set version to v2.3.3.dev0	2020-11-23 13:51:59 +01:00
Sofie Van Landeghem	2af31a8c8d	Bugfix textcat reproducibility on GPU (#6411 ) * add seed argument to ParametricAttention layer * bump thinc to 7.4.3 * set thinc version range Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>	2020-11-23 12:29:35 +01:00
Adriane Boyd	3f61f5eb54	Use int8_t instead of char in Matcher (#6413 ) * Use signed char instead of char in Matcher Remove unused char* utf8_t typedef * Use int8_t instead of signed char	2020-11-23 10:26:47 +01:00
Adriane Boyd	4284605683	Remove Beam cleanup (#6414 ) Beam cleanup is handled through the Beam finalization method.	2020-11-23 10:01:46 +01:00
Adriane Boyd	a8c2dad466	Add all vectors to vocab before pruning (#6408 ) Add all vectors to the vocab before pruning to correct the selection of vectors to prioritize.	2020-11-23 10:00:59 +01:00
Adriane Boyd	320a8b1481	Add ent_id_ to strings serialized with Doc (#6353 )	2020-11-10 20:16:07 +08:00
Daniel Vasic	20d72de986	Added Multext-East V5 tagset for Croatian language (#6248 ) * Added Multext-East V5 tagset for Croatian language * Create danielvasic.md * Update danielvasic.md * Update danielvasic.md * Add tag map to CroatianDefaults Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>	2020-11-05 12:19:22 +01:00

1 2 3 4 5 ...

6966 Commits