spaCy

mirror of https://github.com/explosion/spaCy.git synced 2026-03-04 20:01:28 +03:00

Author	SHA1	Message	Date
Preston Badeer	b216ff43c9	Update vectors-similarity.md (#4889 ) These links are broken on the website, due to quotes around the URLs.	2020-01-08 16:49:40 +01:00
adrianeboyd	aef83e8070	Mark most Hungarian tokenizer test cases as slow (#4883 ) * Mark most Hungarian tokenizer test cases as slow Mark most Hungarian tokenizer test cases as slow to reduce the runtime of the test suite in ordinary usage: * for normal tests: run default tests plus 10% of the detailed tests * for slow tests: run all tests * Rework to mark individual tests as slow	2020-01-08 12:34:06 +01:00
Sofie Van Landeghem	7b96a5e10f	Reduce mem usage in training Entity Linker (#4811 ) * move nlp processing for el pipe to batch training instead of preprocessing * adding dev eval back in, and limit in articles instead of entities * use pipe whenever possible * few more small doc changes * access dev data through generator * tqdm description * small fixes * update documentation	2020-01-06 14:59:50 +01:00
Sofie Van Landeghem	6e9b61b49d	add warning in debug_data for punctuation in entities (#4853 )	2020-01-06 14:59:28 +01:00
adrianeboyd	d652ff215d	Add trailing whitespace to multiline test text (#4877 )	2020-01-06 14:58:59 +01:00
adrianeboyd	de69bc6509	Fix and improve URL pattern (#4882 ) * match domains longer than `hostname.domain.tld` like `www.foo.co.uk` * expand allowed characters in domain names while only matching lowercase TLDs so that "this.That" isn't matched as a URL and can be split on the period as an infix (relevant for at least English, German, and Tatar)	2020-01-06 14:58:30 +01:00
Sofie Van Landeghem	a1b22e90cd	serialize ENT_ID (#4852 ) * expand serialization test for custom token attribute * add failing test for issue 4849 * define ENT_ID as attr and use in doc serialization * fix few typos	2020-01-06 14:57:34 +01:00
Geoffrey Gordon Ashbrook	53929138d7	remove extra word typo (#4875 ) "let you find you"	2020-01-06 12:37:42 +01:00
Ines Montani	400257a802	Update index.md [ci skip]	2020-01-04 01:52:18 +01:00
Al Johri	1aa2d4dac9	stop rendering mathjax by default in displacy (#4840 ) * stop rendering mathjax by default in displacy * Replace f-string and add comment Co-authored-by: Ines Montani <ines@ines.io>	2020-01-01 13:15:05 +01:00
Anastasiia Iurshina	db9257559c	Adds script shebang (#4846 )	2019-12-29 14:25:05 +01:00
Anastasiia Iurshina	1830a12578	Fixes typos (#4843 ) * Fixes typos * Fixes typo * Contributor agreement	2019-12-29 14:24:13 +01:00
Ivan Echevarria	ef13e0c038	Add n_process to Language.pipe documentation (#4842 ) [ci skip] * Add n_process to documentation * Auto-format and add default [ci skip] Co-authored-by: Ines Montani <ines@ines.io>	2019-12-29 14:23:33 +01:00
Al Johri	fd4a7bd2b7	sign contributor agreement for AlJohri (#4839 ) [ci skip]	2019-12-29 14:17:28 +01:00
Ines Montani	3431ac42de	Fix typo	2019-12-21 21:17:45 +01:00
Ines Montani	7c69d30de5	Tidy up and expect warning	2019-12-21 21:14:52 +01:00
Sofie Van Landeghem	732142bf28	facilitate larger training files (#4827 ) * add warning for large file and change start var to long * type for file_length	2019-12-21 21:12:19 +01:00
Ines Montani	cb4145adc7	Tidy up and auto-format	2019-12-21 19:04:17 +01:00
Olamilekan Wahab	a741de7cf6	Adding support for Yoruba Language (#4614 ) * Adding Support for Yoruba * test text * Updated test string. * Fixing encoding declaration. * Adding encoding to stop_words.py * Added contributor agreement and removed iranlowo. * Added removed test files and removed iranlowo to keep project bare. * Returned CONTRIBUTING.md to default state. * Added delted conftest entries * Tidy up and auto-format * Revert CONTRIBUTING.md Co-authored-by: Ines Montani <ines@ines.io>	2019-12-21 14:11:50 +01:00
Ines Montani	1b838d1313	Divide models into core and starters [ci skip]	2019-12-21 14:10:22 +01:00
Ines Montani	0750d59e5a	Allow setting ner_missing_tag on docs_to_json	2019-12-21 13:47:21 +01:00
Sofie Van Landeghem	8ebbb85117	Documentation for PhraseMatcher constructor (#4826 ) * add max_length as argument for init PhraseMatcher * improve error message too	2019-12-20 23:00:04 +01:00
Sofie Van Landeghem	12158c1e3a	Restore tqdm imports (#4804 ) * set 4.38.0 to minimal version with color bug fix * set imports back to proper place * add upper range for tqdm	2019-12-16 13:12:19 +01:00
Ines Montani	c466e02466	Update universe [ci skip]	2019-12-13 15:57:39 +01:00
Sofie Van Landeghem	557dcf5659	NEL requires sentences to be set (#4801 )	2019-12-13 15:55:18 +01:00
tamuhey	1707e77c5e	add char_span to Span (#4793 )	2019-12-13 15:54:58 +01:00
Sofie Van Landeghem	f9b541f9ef	More robust set entities method in KB (#4794 ) * add unit test for setting entities with duplicate identifiers * count the number of actual unique identifiers and throw duplicate warning	2019-12-13 10:45:29 +01:00
Thiago Lages de Alencar	a067ded495	Update doc.md (#4796 )	2019-12-11 18:21:40 +01:00
Sofie Van Landeghem	5355b0038f	Update EL example (#4789 ) * update EL example script after sentence-central refactor * version bump * set incl_prior to False for quick demo purposes * clean up	2019-12-11 18:19:42 +01:00
adrianeboyd	38e1bc19f4	Add destructors for states in TransitionSystem (#4686 )	2019-12-10 13:23:27 +01:00
Matthew Honnibal	45efdb1ef7	Merge branch 'master' of https://github.com/explosion/spaCy	2019-12-10 00:54:18 +01:00
Matthew Honnibal	0a3175d46f	Require thinc v7.4.0.dev0	2019-12-10 00:47:51 +01:00
adrianeboyd	c208eb6e4d	Fix int value handling in Matcher (#4749 ) Add `int` values (for `LENGTH`) in _get_attr_values() instead of treating `int` like `dict`.	2019-12-06 19:22:57 +01:00
Tclack88	ab8dc2732c	Update token.md (#4767 ) * Update token.md documentation is confusing: A '?' is a right punct, but '¿' is a left punct * Update token.md add quotations around parentheses in `is_left_punct` and `is_right_punct` for clarrification, ensuring the question mark that follows is not percieved as an example of left and right punctuation * Move quotes into code block [ci skip]	2019-12-06 19:22:02 +01:00
Sofie Van Landeghem	780d43aac7	fix bug in EL predict (#4779 )	2019-12-06 19:18:14 +01:00
Ines Montani	bf611ebca7	Document jsonl option on converter [ci skip]	2019-12-06 19:17:45 +01:00
Nicolai Bjerre Pedersen	de5453cdcb	Fix link to user hooks in docs (#4778 ) * Fix link to user hooks in docs * Update mr_bjerre.md Mistake in contributor agreement * Apparently hard to get it right (wrong name of sca)	2019-12-06 19:17:12 +01:00
adrianeboyd	676e75838f	Include Doc.cats in serialization of Doc and DocBin (#4774 ) * Include Doc.cats in to_bytes() * Include Doc.cats in DocBin serialization * Add tests for serialization of cats Test serialization of cats for Doc and DocBin.	2019-12-06 14:07:39 +01:00
Antti Ajanki	e626a011cc	Improvements to the Finnish language data (#4738 ) * Enable lex_attrs on Finnish * Copy the Danish tokenizer rules to Finnish Specifically, don't break hyphenated compound words * Contributor agreement * A new file for Finnish tokenizer rules instead of including the Danish ones	2019-12-03 12:55:28 +01:00
Christoph Purschke	a7ee4b6f17	new tests & tokenization fixes (#4734 ) - added some tests for tokenization issues - fixed some issues with tokenization of words with hyphen infix - rewrote the "tokenizer_exceptions.py" file (stemming from the German version)	2019-12-01 23:08:21 +01:00
adrianeboyd	48ea2e8d0f	Restructure Sentencizer to follow Pipe API (#4721 ) * Restructure Sentencizer to follow Pipe API Restructure Sentencizer to follow Pipe API so that it can be scored with `nlp.evaluate()`. * Add Sentencizer pipe() test	2019-11-27 16:33:34 +01:00
Jari Bakken	16cb19e960	update nb tag_map (#4711 )	2019-11-25 21:26:26 +01:00
Ines Montani	5b36dec7eb	Auto-exclude disabled when calling from_disk during load (#4708 )	2019-11-25 16:01:22 +01:00
Ines Montani	2160ecfc92	Fix typo [ci skip]	2019-11-25 13:08:19 +01:00
adrianeboyd	2d8c6e1124	Iterate over lr_edges until sents are correct (#4702 ) Iterate over lr_edges until all heads are within the current sentence. Instead of iterating over them for a fixed number of iterations, check whether the sentence boundaries are correct for the heads and stop when all are correct. Stop after a maximum of 10 iterations, providing a warning in this case since the sentence boundaries may not be correct.	2019-11-25 13:06:36 +01:00
Ines Montani	cbacb0f1a4	Update shape docs and examples (resolves #4615 ) [ci skip]	2019-11-23 17:16:55 +01:00
Matt Maybeno	c9f1e99787	Agnostic vocab array fix (#4680 ) * Use get_array_module instead of numpy * add contributor agreement	2019-11-23 14:59:52 +01:00
adrianeboyd	46250f60ac	Add missing tags to el/es/pt tag maps (#4696 ) * Add missing tags to pt tag map * Add missing tags to es tag map * Add missing tags to el tag map * Add missing symbol in el tag map	2019-11-23 14:57:21 +01:00
Paul O'Leary McCann	f0e3e606a6	Replace python-mecab3 with fugashi for Japanese (#4621 ) * Switch from mecab-python3 to fugashi mecab-python3 has been the best MeCab binding for a long time but it's not very actively maintained, and since it's based on old SWIG code distributed with MeCab there's a limit to how effectively it can be maintained. Fugashi is a new Cython-based MeCab wrapper I wrote. Since it's not based on the old SWIG code it's easier to keep it current and make small deviations from the MeCab C/C++ API where that makes sense. * Change mecab-python3 to fugashi in setup.cfg * Change "mecab tags" to "unidic tags" The tags come from MeCab, but the tag schema is specified by Unidic, so it's more proper to refer to it that way. * Update conftest * Add fugashi link to external deps list for Japanese	2019-11-23 14:31:04 +01:00
Ines Montani	a0fb1acb10	Update version [ci skip]	2019-11-21 18:19:37 +01:00

1 2 3 4 5 ...

11133 Commits