spaCy

mirror of https://github.com/explosion/spaCy.git synced 2025-12-08 10:44:30 +03:00

Author	SHA1	Message	Date
Ines Montani	a98d1cd58e	Update Thinc version and remove GPU ops	2019-10-20 19:01:45 +02:00
Pepe Berba	7772d5d3c5	Update `vocab.get_vector` docs to include features on Fasttext ngram (#4464 ) * Update `vocab.get_vector` * Added contrib agreement	2019-10-20 01:28:18 +02:00
Ines Montani	2c96a5e5b0	Remove lemma attrs on BaseDefaults (#4468 )	2019-10-19 23:18:09 +02:00
Ines Montani	f6af3cf8d9	Add 3.8 classifier [ci skip]	2019-10-19 18:13:25 +02:00
Ines Montani	5e59c9b3ee	Fix unicode strings in examples [ci skip]	2019-10-18 18:47:59 +02:00
adrianeboyd	8d3de90bc4	Suppress convert output if writing to stdout (#4472 )	2019-10-18 18:12:59 +02:00
Ines Montani	692d7f4291	Fix formatting [ci skip]	2019-10-18 11:33:38 +02:00
Ines Montani	181c01f629	Tidy up and auto-format	2019-10-18 11:27:38 +02:00
Ines Montani	fb11852750	Remove unused imports	2019-10-18 11:06:41 +02:00
adrianeboyd	d359da9687	Replace Entity/MatchStruct with SpanC (#4459 ) * Replace MatchStruct with Entity Replace MatchStruct with Entity since the existing Entity struct is nearly identical. * Replace Entity with more general SpanC	2019-10-18 11:01:47 +02:00
adrianeboyd	29e3da6493	Add missing cats to gold annot_tuples in Scorer (#4466 ) Add missing `cats` in `Scorer` call to `GoldParse.from_annot_tuples()` when the `doc` and `gold` have differing lengths.	2019-10-18 11:00:02 +02:00
adrianeboyd	135e3de531	Check for docs with 2+ sentences in debug-data (#4467 )	2019-10-18 10:59:16 +02:00
Ghola	258eb9e064	Misspelling on Lemmatizer Example #4406 (#4449 ) Removing extra o in the lookups = Loookups()	2019-10-16 23:23:15 +02:00
Daniel King	e646956176	Most similar bug (#4446 ) * Add batch size indexing * Don't sort if n == 1 * Add test for most similar vectors issue * Change > to >=	2019-10-16 23:18:55 +02:00
Anastassia	4a77d03ff7	Fix documentation for the docs_to_json function (#4456 )	2019-10-16 23:17:58 +02:00
adrianeboyd	275c9ad872	Allow int values in token patterns (#4444 ) * Add missing int value option to top-level pattern validation in Matcher * Adjust existing tests accordingly * Add new test for valid pattern `{"LENGTH": int}`	2019-10-16 13:40:18 +02:00
Sofie Van Landeghem	7d1efac4eb	Fix remove pattern from matcher (#4454 ) * raise specific error when removing a matcher rule that doesn't exist * rephrasing * bugfix in remove matcher + extended unit test	2019-10-16 13:34:58 +02:00
Sofie Van Landeghem	2d249a9502	KB extensions and better parsing of WikiData (#4375 ) * fix overflow error on windows * more documentation & logging fixes * md fix * 3 different limit parameters to play with execution time * bug fixes directory locations * small fixes * exclude dev test articles from prior probabilities stats * small fixes * filtering wikidata entities, removing numeric and meta items * adding aliases from wikidata also to the KB * fix adding WD aliases * adding also new aliases to previously added entities * fixing comma's * small doc fixes * adding subclassof filtering * append alias functionality in KB * prevent appending the same entity-alias pair * fix for appending WD aliases * remove date filter * remove unnecessary import * small corrections and reformatting * remove WD aliases for now (too slow) * removing numeric entities from training and evaluation * small fixes * shortcut during prediction if there is only one candidate * add counts and fscore logging, remove FP NER from evaluation * fix entity_linker.predict to take docs instead of single sentences * remove enumeration sentences from the WP dataset * entity_linker.update to process full doc instead of single sentence * spelling corrections and dump locations in readme * NLP IO fix * reading KB is unnecessary at the end of the pipeline * small logging fix * remove empty files	2019-10-14 12:28:53 +02:00
Peter Gilles	428887b8f2	Initial commit: New language Luxembourgish (lb) (#4424 ) * new language: Luxembourgish (lb) * update * update * Update and rename .github/CONTRIBUTOR_AGREEMENT.md to .github/contributors/PeterGilles.md * Update and rename .github/contributors/PeterGilles.md to .github/CONTRIBUTOR_AGREEMENT.md * Update norm_exceptions.py * Delete README.md * moved test_lemma.py * deactivated 'lemma_lookup = LOOKUP' * update * Update conftest.py * update * tests updated * import unicode_literals * Update spacy/tests/lang/lb/test_text.py Co-Authored-By: Ines Montani <ines@ines.io> * Create PeterGilles.md	2019-10-14 12:27:50 +02:00
adrianeboyd	98a961a60e	Fix PhraseMatcher.remove for overlapping patterns (#4437 )	2019-10-14 12:19:51 +02:00
Ines Montani	f8f68bb062	Auto-format [ci skip]	2019-10-10 17:08:39 +02:00
adrianeboyd	d2d2baaf76	Revert training example edit from #4327 (#4403 ) I think the original annotation was correct and this change also unfortunately introduced a cycle into the dependency tree.	2019-10-10 17:00:26 +02:00
adrianeboyd	6f54e59fe7	Fix util.filter_spans() to prefer first span in overlapping sam… (#4414 ) * Update util.filter_spans() to prefer earlier spans * Add filter_spans test for first same-length span * Update entity relation example to refer to util.filter_spans()	2019-10-10 17:00:03 +02:00
Sofie Van Landeghem	da6e0de34f	fix attrs field in the matcher (#4423 ) * raise specific error when removing a matcher rule that doesn't exist * rephrasing * ensure attrs is NULL when nr_attr == 0 + several fixes to prevent OOB	2019-10-10 15:20:59 +02:00
Sofie Van Landeghem	5efae495f1	Error when removing a matcher rule that doesn't exist (#4420 ) * raise specific error when removing a matcher rule that doesn't exist * rephrasing	2019-10-10 14:01:53 +02:00
Matthew Honnibal	fa95c030a5	Unify matcher get_ent_id and get_pattern_key (#4415 ) This is basically stabbing blindly at the ghost match problem, but it at least seems like there was a bug previously here --- so this should hopefully be an improvement, even if it doesn't fix the ghost match problem.	2019-10-09 15:26:31 +02:00
Ines Montani	77643de2ca	Downgrade importlib_metadata requirement	2019-10-08 23:43:24 +02:00
Ines Montani	5cbe21700b	Only show label scheme if not empty [ci skip]	2019-10-08 15:52:59 +02:00
Ines Montani	8f76d6c9ef	Update transformer model details [ci skip]	2019-10-08 15:39:38 +02:00
Ines Montani	dd30d3ec99	Add setuptools as runtime dependency	2019-10-08 12:46:59 +02:00
Ines Montani	c4f95c1569	Update formatting and docstrings [ci skip]	2019-10-08 12:25:23 +02:00
Matthew Honnibal	ddd6fda59c	Add registry for model creation functions ('architectures') (#4395 ) * Add architecture registry * Add test for arch registry * Add error for model architectures	2019-10-08 12:21:03 +02:00
tamuhey	650cbfe82d	multiprocessing pipe (#1303 ) (#4371 ) * refactor: separate formatting docs and golds in Language.update * fix return typo * add pipe test * unpickleable object cannot be assigned to p.map * passed test pipe * passed test! * pipe terminate * try pipe * passed test * fix ch * add comments * fix len(texts) * add comment * add comment * fix: multiprocessing of pipe is not supported in 2 * test: use assert_docs_equal * fix: is_python3 -> is_python2 * fix: change _pipe arg to use functools.partial * test: add vector modification test * test: add sample ner_pipe and user_data pipe * add warnings test * test: fix user warnings * test: fix warnings capture * fix: remove islice import * test: remove warnings test * test: add stream test * test: rename * fix: multiproc stream * fix: stream pipe * add comment * mp.Pipe seems to be able to use with relative small data * test: skip stream test in python2 * sort imports * test: add reason to skiptest * fix: use pipe for docs communucation * add comments * add comment	2019-10-08 12:20:55 +02:00
adrianeboyd	14841d0aa6	Fix PhraseMatcher callback and add tests (#4399 ) * Fix callback lookup in PhraseMatcher (string key rather than hash key) * Add callback tests for Matcher and PhraseMatcher	2019-10-08 12:07:02 +02:00
Matthew Honnibal	fd4a5341b0	Fix ner_jsonl2json converter (fix #4389 ) (#4394 )	2019-10-08 00:52:45 +02:00
Matthew Honnibal	29f9fec267	Improve spacy pretrain (#4393 ) * Support bilstm_depth arg in spacy pretrain * Add option to ignore zero vectors in get_cossim_loss * Use cosine loss in Cloze multitask	2019-10-07 23:34:58 +02:00
Ines Montani	9cd6ca3e4d	Improve usage of pkg_resources and handling of entry points (#4387 ) * Only import pkg_resources where it's needed Apparently it's really slow * Use importlib_metadata for entry points * Revert "Only import pkg_resources where it's needed" This reverts commit `5ed8c03afa`. * Revert "Revert "Only import pkg_resources where it's needed"" This reverts commit `8b30b57957`. * Revert "Use importlib_metadata for entry points" This reverts commit `9f071f5c40`. * Revert "Revert "Use importlib_metadata for entry points"" This reverts commit `02e12a17ec`. * Skip test that weirdly hangs * Fix hanging test by using global	2019-10-07 17:22:09 +02:00
adrianeboyd	d53a8d9313	Consider batch_size when sorting similar vectors (#4388 )	2019-10-07 13:38:35 +02:00
adrianeboyd	a3509f67d4	Extend unicode character block for Sinhala (#4378 ) * Extend unicode character block for Sinhala * Add sentencizer tests for more languages	2019-10-07 13:17:03 +02:00
Ines Montani	573e543e4a	Alphanumeric -> alphabetic [ci skip] see ines/spacy-course#38	2019-10-06 13:30:01 +02:00
adrianeboyd	cbc2cee2c8	Improve URL_PATTERN and handling in tokenizer (#4374 ) * Move prefix and suffix detection for URL_PATTERN Move prefix and suffix detection for `URL_PATTERN` into the tokenizer. Remove associated lookahead and lookbehind from `URL_PATTERN`. Fix tokenization for Hungarian given new modified handling of prefixes and suffixes. * Match a wider range of URI schemes	2019-10-05 13:00:09 +02:00
Ines Montani	e65dffd80b	Clarify serialization of extension attributes (closes #4377 ) [ci skip]	2019-10-05 11:58:00 +02:00
Ines Montani	fec9433044	Make PhraseMatcher.vocab consistent with Matcher.vocab (closes #4373 )	2019-10-04 12:18:41 +02:00
Ines Montani	e7ddc6f662	Add conda install for lookups [ci skip]	2019-10-03 17:52:53 +02:00
Matthew Honnibal	37ef874d8b	Set version to v2.2.1	2019-10-03 14:50:39 +02:00
Sofie Van Landeghem	4e7259c6cf	Bugfix initializing DocBin with attributes (#4368 ) * docbin init fix + documentation fix + unit tests * newline * try with zlib instead of gzip (python 2 incompatibilities)	2019-10-03 14:48:45 +02:00
Ines Montani	ce1d441de5	Add docs for Vectors.most_similar [ci skip]	2019-10-03 14:29:47 +02:00
Ben Taylor	1db79a33cb	most_similar() return the k most similar vectors (#4364 ) * most_similar return n-most similar vectors * updated most_similar comment * add bintay contributor agreement * sign bintay contributor agreement * fix most_similar documentation typo * fixed error in prune_vectors * updated prune_vectors test	2019-10-03 14:09:44 +02:00
Ines Montani	4159936720	Update README.md [ci skip]	2019-10-02 19:15:22 +02:00
Ines Montani	e4782feae9	Update README.md [ci skip]	2019-10-02 18:49:55 +02:00

1 2 3 4 5 ...

10998 Commits