spaCy

mirror of https://github.com/explosion/spaCy.git synced 2024-12-25 17:36:30 +03:00

Author	SHA1	Message	Date
Ines Montani	f31876154d	Adjust formatting [ci skip]	2019-10-25 11:19:46 +02:00
Kabir Khan	93640373c7	Make entity_ruler ent_id resolution 2x faster and add docs for… (#4513 ) * Update entityruler.py * Making ent_id resolution 2x faster and adding docs * Fixing newlines in docstrings * Fixing newlines in docstrings	2019-10-25 11:16:42 +02:00
Ines Montani	cc05d9dad6	Auto-format [ci skip]	2019-10-24 16:21:08 +02:00
Ines Montani	73dc63d3bf	Tidy up and auto-format [ci skip]	2019-10-24 16:20:48 +02:00
Ines Montani	6dd2832438	Use numpy.frombuffer instead of fromstring Deprecation warning says we should do this	2019-10-24 16:18:41 +02:00
Ines Montani	9a849fe54e	Explicitly catch warning in test	2019-10-24 16:16:27 +02:00
adrianeboyd	1b0bbe4b76	Update tag maps and docs for English and German (#4501 ) * Update English tag_map Update English tag_map based on this conversion table: https://universaldependencies.org/tagset-conversion/en-penn-uposf.html * Update German tag_map Update German tag_map based on this conversion table: https://universaldependencies.org/tagset-conversion/de-stts-uposf.html * Add missing Tiger dependencies to glossary * Add quotes to definition of TO * Update POS/TAG tables in docs Update POS/TAG tables for English and German docs using current information generated from the tag_maps and GLOSSARY. * Update warning that -PRON- is specific to English * Revert docs to default JSON output with convert * Revert "Revert docs to default JSON output with convert" This reverts commit `6b78c048f1`.	2019-10-24 12:56:05 +02:00
Zhuoru Lin	10d88b09bb	Bugfix/fix wikidata train entity linker (#4509 ) * Fix labels_discard Nonetype iteration error * Contributor agreement for Zhuoru Lin * Enhance EntityLinker.predict() to handle labels_discard is None case.	2019-10-24 12:52:59 +02:00
adrianeboyd	8516e9d53b	Support train dict format as JSONL (#4471 ) * Support train dict format as JSONL * Add (overly simple) check for dict vs. tuple to read JSONL lines as either train dicts or train tuples * Extend JSON/JSONL roundtrip conversion tests using `docs_to_json()` and `GoldCorpus.train_tuples` * Revert docs to default JSON output with convert	2019-10-23 16:01:44 +02:00
Matthew Honnibal	ca7f0e669e	Set version to v2.2.2.dev1	2019-10-22 20:11:25 +02:00
Matthew Honnibal	9489c5f6b2	Clip most_similar to range [-1, 1] (fixes #4506 ) (#4507 ) * Clip most_similar to range [-1, 1] * Add/fix vectors tests * Fix test	2019-10-22 20:10:42 +02:00
Ines Montani	74a19aeb1c	Add xfailing test [ci skip]	2019-10-22 18:18:43 +02:00
Matthew Honnibal	3f6cb618a9	Set version to v2.2.2.dev0	2019-10-22 17:47:36 +02:00
Sofie Van Landeghem	48886afc78	prevent zero-length mem alloc (#4429 ) * raise specific error when removing a matcher rule that doesn't exist * rephrasing * goldparse init: allocate fields only if doc is not empty * avoid zero length alloc in saving tokenizer cache * avoid allocating zero length mem in matcher * asserts to avoid allocating zero length mem * fix zero-length allocation in matcher * bump cymem version * revert cymem version bump	2019-10-22 16:54:33 +02:00
adrianeboyd	3dfc764577	Free pointers in parser activations (#4486 ) * Free pointers in ActivationsC * Restructure alloc/free for parser activations * Rewrite/restructure to have allocation and free in parallel functions in `_parser_model` rather than partially in `_parseC()` in `Parser`. * Remove `resize_activations` from `_parser_model.pxd`.	2019-10-22 15:06:44 +02:00
tamuhey	fb89f6792b	refactor: remove unused variable (#4499 )	2019-10-22 14:38:17 +02:00
gustavengstrom	050e2445a8	Adding noun_chunks to the Swedish language model (sv) (#4422 ) * Create syntax_iterators.py Replica of spacy/lang/fr/syntax_iterators.py * Added import statements for SYNTAX_ITERATORS * Create gustavengstrom.md * Added "dobj" to list of labels in noun_chunks method and a test_noun_chunks method to the Swedish language model. * Delete README-checkpoint.md Co-authored-by: Gustav <gustav@davcon.se> Co-authored-by: Ines Montani <ines@ines.io>	2019-10-21 12:57:06 +02:00
adrianeboyd	f5c551a43a	Checks/errors related to ill-formed IOB input in CLI convert and debug-data (#4487 ) * Error for ill-formed input to iob_to_biluo() Check for empty label in iob_to_biluo(), which can result from ill-formed input. * Check for empty NER label in debug-data	2019-10-21 12:20:28 +02:00
Sofie Van Landeghem	d5d55312b2	prevent division by zero in most_similar method (#4488 )	2019-10-21 12:04:46 +02:00
Pepe Berba	7772d5d3c5	Update `vocab.get_vector` docs to include features on Fasttext ngram (#4464 ) * Update `vocab.get_vector` * Added contrib agreement	2019-10-20 01:28:18 +02:00
Ines Montani	2c96a5e5b0	Remove lemma attrs on BaseDefaults (#4468 )	2019-10-19 23:18:09 +02:00
adrianeboyd	8d3de90bc4	Suppress convert output if writing to stdout (#4472 )	2019-10-18 18:12:59 +02:00
Ines Montani	692d7f4291	Fix formatting [ci skip]	2019-10-18 11:33:38 +02:00
Ines Montani	181c01f629	Tidy up and auto-format	2019-10-18 11:27:38 +02:00
Ines Montani	fb11852750	Remove unused imports	2019-10-18 11:06:41 +02:00
adrianeboyd	d359da9687	Replace Entity/MatchStruct with SpanC (#4459 ) * Replace MatchStruct with Entity Replace MatchStruct with Entity since the existing Entity struct is nearly identical. * Replace Entity with more general SpanC	2019-10-18 11:01:47 +02:00
adrianeboyd	29e3da6493	Add missing cats to gold annot_tuples in Scorer (#4466 ) Add missing `cats` in `Scorer` call to `GoldParse.from_annot_tuples()` when the `doc` and `gold` have differing lengths.	2019-10-18 11:00:02 +02:00
adrianeboyd	135e3de531	Check for docs with 2+ sentences in debug-data (#4467 )	2019-10-18 10:59:16 +02:00
Daniel King	e646956176	Most similar bug (#4446 ) * Add batch size indexing * Don't sort if n == 1 * Add test for most similar vectors issue * Change > to >=	2019-10-16 23:18:55 +02:00
Anastassia	4a77d03ff7	Fix documentation for the docs_to_json function (#4456 )	2019-10-16 23:17:58 +02:00
adrianeboyd	275c9ad872	Allow int values in token patterns (#4444 ) * Add missing int value option to top-level pattern validation in Matcher * Adjust existing tests accordingly * Add new test for valid pattern `{"LENGTH": int}`	2019-10-16 13:40:18 +02:00
Sofie Van Landeghem	7d1efac4eb	Fix remove pattern from matcher (#4454 ) * raise specific error when removing a matcher rule that doesn't exist * rephrasing * bugfix in remove matcher + extended unit test	2019-10-16 13:34:58 +02:00
Sofie Van Landeghem	2d249a9502	KB extensions and better parsing of WikiData (#4375 ) * fix overflow error on windows * more documentation & logging fixes * md fix * 3 different limit parameters to play with execution time * bug fixes directory locations * small fixes * exclude dev test articles from prior probabilities stats * small fixes * filtering wikidata entities, removing numeric and meta items * adding aliases from wikidata also to the KB * fix adding WD aliases * adding also new aliases to previously added entities * fixing comma's * small doc fixes * adding subclassof filtering * append alias functionality in KB * prevent appending the same entity-alias pair * fix for appending WD aliases * remove date filter * remove unnecessary import * small corrections and reformatting * remove WD aliases for now (too slow) * removing numeric entities from training and evaluation * small fixes * shortcut during prediction if there is only one candidate * add counts and fscore logging, remove FP NER from evaluation * fix entity_linker.predict to take docs instead of single sentences * remove enumeration sentences from the WP dataset * entity_linker.update to process full doc instead of single sentence * spelling corrections and dump locations in readme * NLP IO fix * reading KB is unnecessary at the end of the pipeline * small logging fix * remove empty files	2019-10-14 12:28:53 +02:00
Peter Gilles	428887b8f2	Initial commit: New language Luxembourgish (lb) (#4424 ) * new language: Luxembourgish (lb) * update * update * Update and rename .github/CONTRIBUTOR_AGREEMENT.md to .github/contributors/PeterGilles.md * Update and rename .github/contributors/PeterGilles.md to .github/CONTRIBUTOR_AGREEMENT.md * Update norm_exceptions.py * Delete README.md * moved test_lemma.py * deactivated 'lemma_lookup = LOOKUP' * update * Update conftest.py * update * tests updated * import unicode_literals * Update spacy/tests/lang/lb/test_text.py Co-Authored-By: Ines Montani <ines@ines.io> * Create PeterGilles.md	2019-10-14 12:27:50 +02:00
adrianeboyd	98a961a60e	Fix PhraseMatcher.remove for overlapping patterns (#4437 )	2019-10-14 12:19:51 +02:00
Ines Montani	f8f68bb062	Auto-format [ci skip]	2019-10-10 17:08:39 +02:00
adrianeboyd	6f54e59fe7	Fix util.filter_spans() to prefer first span in overlapping sam… (#4414 ) * Update util.filter_spans() to prefer earlier spans * Add filter_spans test for first same-length span * Update entity relation example to refer to util.filter_spans()	2019-10-10 17:00:03 +02:00
Sofie Van Landeghem	da6e0de34f	fix attrs field in the matcher (#4423 ) * raise specific error when removing a matcher rule that doesn't exist * rephrasing * ensure attrs is NULL when nr_attr == 0 + several fixes to prevent OOB	2019-10-10 15:20:59 +02:00
Sofie Van Landeghem	5efae495f1	Error when removing a matcher rule that doesn't exist (#4420 ) * raise specific error when removing a matcher rule that doesn't exist * rephrasing	2019-10-10 14:01:53 +02:00
Matthew Honnibal	fa95c030a5	Unify matcher get_ent_id and get_pattern_key (#4415 ) This is basically stabbing blindly at the ghost match problem, but it at least seems like there was a bug previously here --- so this should hopefully be an improvement, even if it doesn't fix the ghost match problem.	2019-10-09 15:26:31 +02:00
Ines Montani	c4f95c1569	Update formatting and docstrings [ci skip]	2019-10-08 12:25:23 +02:00
Matthew Honnibal	ddd6fda59c	Add registry for model creation functions ('architectures') (#4395 ) * Add architecture registry * Add test for arch registry * Add error for model architectures	2019-10-08 12:21:03 +02:00
tamuhey	650cbfe82d	multiprocessing pipe (#1303 ) (#4371 ) * refactor: separate formatting docs and golds in Language.update * fix return typo * add pipe test * unpickleable object cannot be assigned to p.map * passed test pipe * passed test! * pipe terminate * try pipe * passed test * fix ch * add comments * fix len(texts) * add comment * add comment * fix: multiprocessing of pipe is not supported in 2 * test: use assert_docs_equal * fix: is_python3 -> is_python2 * fix: change _pipe arg to use functools.partial * test: add vector modification test * test: add sample ner_pipe and user_data pipe * add warnings test * test: fix user warnings * test: fix warnings capture * fix: remove islice import * test: remove warnings test * test: add stream test * test: rename * fix: multiproc stream * fix: stream pipe * add comment * mp.Pipe seems to be able to use with relative small data * test: skip stream test in python2 * sort imports * test: add reason to skiptest * fix: use pipe for docs communucation * add comments * add comment	2019-10-08 12:20:55 +02:00
adrianeboyd	14841d0aa6	Fix PhraseMatcher callback and add tests (#4399 ) * Fix callback lookup in PhraseMatcher (string key rather than hash key) * Add callback tests for Matcher and PhraseMatcher	2019-10-08 12:07:02 +02:00
Matthew Honnibal	fd4a5341b0	Fix ner_jsonl2json converter (fix #4389 ) (#4394 )	2019-10-08 00:52:45 +02:00
Matthew Honnibal	29f9fec267	Improve spacy pretrain (#4393 ) * Support bilstm_depth arg in spacy pretrain * Add option to ignore zero vectors in get_cossim_loss * Use cosine loss in Cloze multitask	2019-10-07 23:34:58 +02:00
Ines Montani	9cd6ca3e4d	Improve usage of pkg_resources and handling of entry points (#4387 ) * Only import pkg_resources where it's needed Apparently it's really slow * Use importlib_metadata for entry points * Revert "Only import pkg_resources where it's needed" This reverts commit `5ed8c03afa`. * Revert "Revert "Only import pkg_resources where it's needed"" This reverts commit `8b30b57957`. * Revert "Use importlib_metadata for entry points" This reverts commit `9f071f5c40`. * Revert "Revert "Use importlib_metadata for entry points"" This reverts commit `02e12a17ec`. * Skip test that weirdly hangs * Fix hanging test by using global	2019-10-07 17:22:09 +02:00
adrianeboyd	d53a8d9313	Consider batch_size when sorting similar vectors (#4388 )	2019-10-07 13:38:35 +02:00
adrianeboyd	a3509f67d4	Extend unicode character block for Sinhala (#4378 ) * Extend unicode character block for Sinhala * Add sentencizer tests for more languages	2019-10-07 13:17:03 +02:00
Ines Montani	573e543e4a	Alphanumeric -> alphabetic [ci skip] see ines/spacy-course#38	2019-10-06 13:30:01 +02:00
adrianeboyd	cbc2cee2c8	Improve URL_PATTERN and handling in tokenizer (#4374 ) * Move prefix and suffix detection for URL_PATTERN Move prefix and suffix detection for `URL_PATTERN` into the tokenizer. Remove associated lookahead and lookbehind from `URL_PATTERN`. Fix tokenization for Hungarian given new modified handling of prefixes and suffixes. * Match a wider range of URI schemes	2019-10-05 13:00:09 +02:00
Ines Montani	fec9433044	Make PhraseMatcher.vocab consistent with Matcher.vocab (closes #4373 )	2019-10-04 12:18:41 +02:00
Matthew Honnibal	37ef874d8b	Set version to v2.2.1	2019-10-03 14:50:39 +02:00
Sofie Van Landeghem	4e7259c6cf	Bugfix initializing DocBin with attributes (#4368 ) * docbin init fix + documentation fix + unit tests * newline * try with zlib instead of gzip (python 2 incompatibilities)	2019-10-03 14:48:45 +02:00
Ben Taylor	1db79a33cb	most_similar() return the k most similar vectors (#4364 ) * most_similar return n-most similar vectors * updated most_similar comment * add bintay contributor agreement * sign bintay contributor agreement * fix most_similar documentation typo * fixed error in prune_vectors * updated prune_vectors test	2019-10-03 14:09:44 +02:00
Matthew Honnibal	2eb31012e7	Set version to v2.2.0	2019-10-02 14:40:06 +02:00
Matthew Honnibal	796072e560	Set version to v2.2.0.dev19	2019-10-02 12:51:29 +02:00
Sofie Van Landeghem	9d3ce7cba2	Ensure training doesn't crash with empty batches (#4360 ) * unit test for previously resolved unflatten issue * prevent batch of empty docs to cause problems	2019-10-02 12:50:47 +02:00
adrianeboyd	dda86118bd	Update Ukrainian lemmatizer with new lookups (#4359 ) * Update Ukrainian lemmatizer with new lookups * Add missing import Co-authored-by: Ines Montani <ines@ines.io>	2019-10-02 12:04:06 +02:00
Ines Montani	b6670bf0c2	Use consistent spelling	2019-10-02 10:37:39 +02:00
Matthew Honnibal	38b6e69389	Merge branch 'master' of https://github.com/explosion/spaCy	2019-10-01 22:28:25 +02:00
Matthew Honnibal	d4b63bb6dd	Set version to v2.2.0	2019-10-01 22:28:13 +02:00
Ines Montani	475e3188ce	Add docs on filtering overlapping spans for merging (resolves #4352 ) [ci skip]	2019-10-01 21:59:50 +02:00
Matthew Honnibal	64a9577d43	Set version to v2.2.0.dev17	2019-10-01 21:36:59 +02:00
Ines Montani	cf65a80f36	Refactor lemmatizer and data table integration (#4353 ) * Move test * Allow default in Lookups.get_table * Start with blank tables in Lookups.from_bytes * Refactor lemmatizer to hold instance of Lookups * Get lookups table within the lemmatization methods to make sure it references the correct table (even if the table was replaced or modified, e.g. when loading a model from disk) * Deprecate other arguments on Lemmatizer.__init__ and expect Lookups for consistency * Remove old and unsupported Lemmatizer.load classmethod * Refactor language-specific lemmatizers to inherit as much as possible from base class and override only what they need * Update tests and docs * Fix more tests * Fix lemmatizer * Upgrade pytest to try and fix weird CI errors * Try pytest 4.6.5	2019-10-01 21:36:03 +02:00
Ines Montani	3297a19545	Warn in Tagger.begin_training if no lemma tables are available (#4351 )	2019-10-01 15:13:55 +02:00
Matthew Honnibal	2fb05482dd	Set version to v2.2.0	2019-10-01 03:50:13 +02:00
Matthew Honnibal	dc22ec0aad	Set version to v2.2.0.dev17	2019-10-01 03:26:53 +02:00
Matthew Honnibal	aedfba867a	Set version to v2.2.0.dev16	2019-10-01 00:31:00 +02:00
Ines Montani	e0cf4796a5	Move lookup tables out of the core library (#4346 ) * Add default to util.get_entry_point * Tidy up entry points * Read lookups from entry points * Remove lookup tables and related tests * Add lookups install option * Remove lemmatizer tests * Remove logic to process language data files * Update setup.cfg	2019-10-01 00:01:27 +02:00
Rahul Soni	ed620daa5c	Fix example sentences in Hindi for grammatical errors (#4343 ) * Fix grammar for hindi * Fix grammar for hindi * Submit contributor agreement	2019-09-30 23:32:49 +02:00
Ines Montani	ba186299e1	Tidy up and modernize setup and config (#4344 ) * Tidy up and modernize setup and config * Update setup.cfg * Re-add pyproject.toml * Delete .flake8 * Move static meta from about to setup.cfg * Update setup.cfg Co-Authored-By: Matthew Honnibal <honnibal+gh@gmail.com>	2019-09-30 20:10:55 +02:00
Ines Montani	4f905ac9e6	Add test for ASCII filenames (#4345 )	2019-09-30 18:45:30 +02:00
Matthew Honnibal	b5c775dd42	Set version to v2.2.0	2019-09-30 12:47:08 +02:00
Ines Montani	f7d1736241	Skip duplicate spans in Doc.retokenize (#4339 )	2019-09-30 12:43:48 +02:00
Ines Montani	0226b3bf0e	Fix test imports	2019-09-29 17:34:56 +02:00
Ines Montani	3d8fd4b461	Revert #4334	2019-09-29 17:32:12 +02:00
adrianeboyd	ba5595c764	Fix PhraseMatcher to remember attr on pickling (#4336 ) * Fix PhraseMatcher to remember attr on pickling * Check for attr as int or long	2019-09-29 17:12:33 +02:00
Ines Montani	75514b5970	Fix Korean	2019-09-29 17:10:56 +02:00
Ines Montani	499c39acba	Remove unnecessary namedtuple/dataclass	2019-09-29 15:05:28 +02:00
Matthew Honnibal	eba708404d	Set version to v2.2.0.dev15	2019-09-28 22:23:53 +02:00
Matthew Honnibal	6189959adb	Set version to v2.2.0.dev14	2019-09-28 22:09:46 +02:00
Matthew Honnibal	0df2a599b7	Set version to v2.2.0.dev13	2019-09-28 21:26:05 +02:00
Ines Montani	c9cd516d96	Move tests out of package (#4334 ) * Move tests out of package * Fix typo	2019-09-28 18:05:00 +02:00
Matthew Honnibal	d05eb56ce2	Set version to v2.2.0.dev12	2019-09-28 16:35:56 +02:00
Ines Montani	5fe61539c4	Fix unicode "e" in filename	2019-09-28 15:45:16 +02:00
Ines Montani	811c4c97c9	Correct lookup lemma of "lenses" (see #4332 )	2019-09-28 14:04:07 +02:00
Ines Montani	f8d1e2f214	Update CLI docs [ci skip]	2019-09-28 13:12:30 +02:00
Sofie Van Landeghem	22b9e12159	Ensure the NER remains consistent after resizing (#4330 ) * test and fix for second bug of issue 4042 * fix for first bug in 4042 * crashing test for Issue 4313 * forgot one instance of resize * remove prints * undo uncomment * delete test for 4313 (uses third party lib) * add fix for Issue 4313 * unit test for 4313	2019-09-27 20:57:13 +02:00
adrianeboyd	3906785b49	Initialize low data warning for debug-data parser (#4331 )	2019-09-27 20:56:49 +02:00
Ines Montani	206e8a5ac7	Also apply hotfix to Ukrainian lemmaitzer	2019-09-27 18:03:26 +02:00
Ines Montani	acd5bcb0b3	Tidy up fixtures	2019-09-27 17:57:59 +02:00
Ines Montani	b21b2e27e5	Hotfix Russian lemmatizer	2019-09-27 17:56:12 +02:00
Matthew Honnibal	a4d4c4bfa4	Set version to v2.2.0.dev11	2019-09-27 16:40:26 +02:00
Ines Montani	aad66d9bb9	Document PhraseMatcher.remove [ci skip]	2019-09-27 16:34:53 +02:00
adrianeboyd	c23edf302b	Replace PhraseMatcher with trie-based search (#4309 ) * Replace PhraseMatcher with Aho-Corasick Replace PhraseMatcher with the Aho-Corasick algorithm over numpy arrays of the hash values for the relevant attribute. The implementation is based on FlashText. The speed should be similar to the previous PhraseMatcher. It is now possible to easily remove match IDs and matches don't go missing with large keyword lists / vocabularies. Fixes #4308. * Restore support for pickling * Fix internal keyword add/remove for numpy arrays * Add missing loop for match ID set in search loop * Remove cruft in matching loop for partial matches There was a bit of unnecessary code left over from FlashText in the matching loop to handle partial token matches, which we don't have with PhraseMatcher. * Replace dict trie with MapStruct trie * Fix how match ID hash is stored/added * Update fix for match ID vocab * Switch from map_get_unless_missing to map_get * Switch from numpy array to Token.get_struct_attr Access token attributes directly in Doc instead of making a copy of the relevant values in a numpy array. Add unsatisfactory warning for hash collision with reserved terminal hash key. (Ideally it would change the reserved terminal hash and redo the whole trie, but for now, I'm hoping there won't be collisions.) * Restructure imports to export find_matches * Implement full remove() Remove unnecessary trie paths and free unused maps. Parallel to Matcher, raise KeyError when attempting to remove a match ID that has not been added. * Store docs internally only as attr lists * Reduces size for pickle * Remove duplicate keywords store Now that docs are stored as lists of attr hashes, there's no need to have the duplicate _keywords store.	2019-09-27 16:22:34 +02:00
tamuhey	b408b5b29e	Refactor language update (#4316 ) * refactor: separate formatting docs and golds in Language.update * fix return typo	2019-09-27 16:20:21 +02:00
Jaydeep Borkar	6a06a3fa6a	Update stop_words.py and add name in contributors (#4325 ) * Update stop_words.py and add name in contributors * add jaydeepborkar.md in contributors directory * Reset template [ci skip] Co-authored-by: Ines Montani <ines@ines.io>	2019-09-27 11:57:27 +02:00
Ines Montani	da9a869d3f	Update vectors name docs [ci skip]	2019-09-26 16:21:32 +02:00
Matthew Honnibal	58533f01bf	Set version to v2.2.0.dev10	2019-09-26 03:03:50 +02:00

1 2 3 4 5 ...

6552 Commits