spaCy

mirror of https://github.com/explosion/spaCy.git synced 2025-12-09 19:24:22 +03:00

Author	SHA1	Message	Date
Sofie Van Landeghem	ba02957c80	Fix dependency copy for as_doc (#3969 ) * failing unit test for issue 3962 * attempt to fix Issue #3962 * create artificial unit test example * using length instead of self.length * sp * reformat with black * find better ancestor within span and use generic 'dep' * attach to span.root if there is no appropriate ancestor * comment span text * clean up ancestor code * reconstruct dep tree to keep same number of sentences	2019-07-23 18:28:54 +02:00
Ines Montani	a32b033b8c	Add regression test for #4002 Test that the PhraseMatcher can match on overwritten NORM attributes.	2019-07-22 14:18:24 +02:00
Falak Asad	ff1e73e35c	Bugfix/issue 3968 (#3982 ) * Fix for issue-3968 * Added contributor agreement * Made suggested changes	2019-07-18 00:20:32 +02:00
Ines Montani	073013f129	Auto-format [ci skip]	2019-07-17 12:34:13 +02:00
Ines Montani	62ff128888	Add regression test for #3951	2019-07-16 14:00:00 +02:00
Ines Montani	7f551050b1	Add regression test for #3972	2019-07-16 13:07:35 +02:00
Søren Lind Kristiansen	26aee70d95	Make Danish tokenizer split on forward slash	2019-07-12 15:20:42 +02:00
Sofie Van Landeghem	ed774cb953	Fixing ngram bug (#3953 ) * minimal failing example for Issue #3661 * referenced Issue #3661 instead of Issue #3611 * cleanup	2019-07-12 10:01:35 +02:00
Ines Montani	673c864a06	Fix doc.count_by functionality (#3950 ) Fix doc.count_by functionality	2019-07-11 13:44:00 +02:00
Ines Montani	2426f4d44c	Fix default punctuation rules for splitting Hindi text (#3948 ) Fix default punctuation rules for splitting Hindi text Co-authored-by: yash <patadiayash@gmail.com> Co-authored-by: Ines Montani <ines@ines.io>	2019-07-11 13:36:28 +02:00
svlandeg	349107daa3	cleanup	2019-07-11 13:09:22 +02:00
Matthew Honnibal	b40b4c2c31	💫 Fix issue #3839 : Incorrect entity IDs from Matcher with operators (#3949 ) * Add regression test for issue #3541 * Add comment on bugfix * Remove incorrect test * Un-xfail test	2019-07-11 12:55:11 +02:00
Ines Montani	197cfd7ebc	Merge branch 'master' into pr/3948	2019-07-11 12:18:31 +02:00
Ines Montani	d166756607	Fix test	2019-07-11 12:16:43 +02:00
Ines Montani	0b8406a05c	Tidy up and auto-format	2019-07-11 12:02:25 +02:00
yash	ae2d52e323	Add default encoding utf-8 for test file	2019-07-11 15:26:27 +05:30
yash	d5311b3c42	Add test file for issue (#3625 ) and spacy contributor agreement	2019-07-11 14:53:14 +05:30
svlandeg	e080412385	tracked the bug down to PreshCounter.inc - still unclear what goes wrong	2019-07-11 01:53:06 +02:00
svlandeg	a89fecce97	failing unit test for issue #3869	2019-07-11 00:43:55 +02:00
Matthew Honnibal	465456edb9	Un-xfail test #3880	2019-07-10 14:01:17 +02:00
Matthew Honnibal	87f7ec34d5	Add test for #3880	2019-07-10 13:53:55 +02:00
Ines Montani	4e04080b76	Only compare sorted patterns in test Try to work around flaky tests on Python 3.5	2019-07-10 13:00:52 +02:00
Ines Montani	82045aac8a	Merge regression tests	2019-07-10 12:49:18 +02:00
Ines Montani	570ab1f481	Fix handling of old entity ruler files Expected an `entity_ruler.jsonl` file in the top-level model directory, so the path passed to from_disk by default (model path plus componentn name), but with the suffix ".jsonl".	2019-07-10 12:14:12 +02:00
Ines Montani	874d914a44	Tidy up test	2019-07-10 12:13:23 +02:00
Ines Montani	6ba5ddbd5f	Merge pull request #3864 from svlandeg/feature/nel-wiki Entity linking using Wikipedia & Wikidata	2019-07-10 11:25:41 +02:00
cedar101	58f06e6180	Korean support (#3901 ) * start lang/ko * add test codes * using natto-py * add test_ko_tokenizer_full_tags() * spaCy contributor agreement * external dependency for ko * collections.namedtuple for python version < 3.5 * case fix * tuple unpacking * add jongseong(final consonant) * apply mecab option * Remove Pipfile for now Co-authored-by: Ines Montani <ines@ines.io>	2019-07-09 22:23:16 +02:00
Ines Montani	f2ea3e3ea2	Merge branch 'master' into feature/nel-wiki	2019-07-09 21:57:47 +02:00
Joshua Smith	2eb925bd05	Added an argument to `EntityRuler` constructor to pass attrs to… (#3919 ) * Perserve flags in EntityRuler The EntityRuler (explosion/spaCy#3526) does not preserve overwrite flags (or `ent_id_sep`) when serialized. This commit adds support for serialization/deserialization preserving overwrite and ent_id_sep flags. * add signed contributor agreement * flake8 cleanup mostly blank line issues. * mark test from the issue as needing a model The test from the issue needs some language model for serialization but the test wasn't originally marked correctly. * Adds `phrase_matcher_attr` to allow args to PhraseMatcher This is an added arg to pass to the `PhraseMatcher`. For example, this allows creation of a case insensitive phrase matcher when the `EntityRuler` is created. References explosion/spaCy#3822 * remove unneeded model loading The model didn't need to be loaded, and I replaced it with a change that doesn't require it (using existings fixtures) * updated docstring for new argument * updated docs to reflect new argument to the EntityRuler constructor * change tempdir handling to be compatible with python 2.7 * return conflicted code to entityruler Some stuff got cut out because of merge conflicts, this returns that code for the phrase_matcher_attr. * fixed typo in the code added back after conflicts * flake8 compliance When I deconflicted the branch there were some flake8 issues introduced. This resolves the spacing problems. * test changes: attempts to fix flaky test in python3.5 These tests seem to be alittle flaky in 3.5 so I changed the check to avoid the comparisons that seem to be fail sometimes.	2019-07-09 20:09:17 +02:00
Joshua Smith	e8420ab2b7	Added support for serializing overwrite and ent_id_sep (#3918 ) * Perserve flags in EntityRuler The EntityRuler (explosion/spaCy#3526) does not preserve overwrite flags (or `ent_id_sep`) when serialized. This commit adds support for serialization/deserialization preserving overwrite and ent_id_sep flags. * add signed contributor agreement * flake8 cleanup mostly blank line issues. * mark test from the issue as needing a model The test from the issue needs some language model for serialization but the test wasn't originally marked correctly. * remove unneeded model loading The model didn't need to be loaded, and I replaced it with a change that doesn't require it (using existings fixtures) * change tempdir handling to be compatible with python 2.7 * Adds code to handle item saved before this change. This code chanes how the save files are handled and how the bytes are stored as well. This code adds check to dispatch correctly if it encounters bytes or files saved in the old format (and tests for those cases). * use util function for tempdir management Updated after PR comments: this code now uses the make_tempdir function from util instead of doing it by hand.	2019-07-08 17:28:28 +02:00
Rokas Ramanauskas	61ce126d4c	Lithuanian language support (#3895 ) * initial LT lang support * Added more stopwords. Started setting up some basic test environment (not complete) * Initial morph rules for LT lang * Closes #1 Adds tokenizer exceptions for Lithuanian * Closes #5 Punctuation rules. Closes #6 Lexical Attributes * test: add native examples to basic tests * feat: add tag map for lt lang * fix: remove undefined tag attribute 'Definite' * feat: add lemmatizer for lt lang * refactor: add new instances to lt lang morph rules; use tags from tag map * refactor: add morph rules to lt lang defaults * refactor: only keep nouns, verbs, adverbs and adjectives in lt lang lemmatizer lookup * refactor: add capitalized words to lt lang lemmatizer * refactor: add more num words to lt lang lex attrs * refactor: update lt lang stop word set * refactor: add new instances to lt lang tokenizer exceptions * refactor: remove comments form lt lang init file * refactor: use function instead of lambda in lt lex lang getter * refactor: remove conversion to dict in lt init when dict is already provided * chore: rename lt 'test_basic' to 'test_text' * feat: add more lt text tests * feat: add lemmatizer tests * refactor: remove unused imports, add newline to end of file * chore: add contributor agreement * chore: change 'en' to 'lt' in lt example description * fix: add missing encoding info * style: add newline to end of file * refactor: use python2 compatible syntax * style: reformat code using black	2019-07-08 10:25:22 +02:00
svlandeg	1c80b85241	fix tests	2019-06-28 08:59:23 +02:00
Ines Montani	6ccdf37574	Exclude user_data when copying doc in displaCy (closes #3882 )	2019-06-26 14:37:05 +02:00
svlandeg	8608685543	ensure Span.as_doc keeps the entity links + unit test	2019-06-25 15:28:51 +02:00
svlandeg	ddc73b11a9	fix unicode literals	2019-06-24 12:58:18 +02:00
svlandeg	b76a43bee4	unicode strings	2019-06-19 13:26:33 +02:00
svlandeg	0b0959b363	UTF8 encoding	2019-06-19 13:11:39 +02:00
svlandeg	791327e3c5	Merge remote-tracking branch 'upstream/master' into feature/nel-wiki	2019-06-19 09:44:05 +02:00
Kabir Khan	1e19f34e29	Add optional `id` property to EntityRuler patterns (#3591 ) * Adding support for entity_id in EntityRuler pipeline component * Adding Spacy Contributor aggreement * Updating EntityRuler to use string.format instead of f strings * Update Entity Ruler to support an 'id' attribute per pattern that explicitly identifies an entity. * Fixing tests * Remove custom extension entity_id and use built in ent_id token attribute. * Changing entity_id to ent_id for consistent naming * entity_ids => ent_ids * Removing kb, cleaning up tests, making util functions private, use rsplit instead of split	2019-06-16 13:29:04 +02:00
Suraj Rajan	46c78d0a41	Dependency tree pattern matcher (#3465 ) * Functional dependency tree pattern matcher * Tests fail due to inconsistent behaviour * Renamed dependencymatcher and added optimizations	2019-06-16 13:25:32 +02:00
BreakBB	d8573ee715	Update error raising for CLI pretrain to fix #3840 (#3843 ) * Add check for empty input file to CLI pretrain * Raise error if JSONL is not a dict or contains neither `tokens` nor `text` key * Skip empty values for correct pretrain keys and log a counter as warning * Add tests for CLI pretrain core function make_docs. * Add a short hint for the `tokens` key to the CLI pretrain docs * Add success message to CLI pretrain * Update model loading to fix the tests * Skip empty values and do not create docs out of it	2019-06-16 13:22:57 +02:00
Ines Montani	f35ce09776	Add regression test for #3839	2019-06-12 13:38:30 +02:00
Ines Montani	aae9034492	Tidy up [ci skip]	2019-06-12 13:38:23 +02:00
svlandeg	5c723c32c3	entity vectors in the KB + serialization of them	2019-06-05 18:29:18 +02:00
svlandeg	d83a1e3052	Merge branch 'master' into feature/nel-wiki	2019-06-03 09:35:10 +02:00
Germán	86eb817b74	Overwrites default getter for like_num in Spanish by adding _num_words and like_num to lex_attrs.py (#3810 ) (closes #3803 )) * (#3803) Spanish like_num returning false for number-like token * (#3803) Spanish like_num now returning True for number-like token	2019-06-02 12:22:57 +02:00
BreakBB	ed18a6efbd	Add check for callable to 'Language.replace_pipe' to fix #3737 (#3741 )	2019-05-14 16:59:31 +02:00
Ines Montani	8baff1c7c0	💫 Improve introspection of custom extension attributes (#3729 ) * Add custom __dir__ to Underscore (see #3707) * Make sure custom extension methods keep their docstrings (see #3707) * Improve tests * Prepend note on partial to docstring (see #3707) * Remove print statement * Handle cases where docstring is None	2019-05-12 00:53:11 +02:00
Ines Montani	505c9e0e19	Add util.filter_spans helper (#3686 )	2019-05-08 02:33:40 +02:00
svlandeg	19e8f339cb	deduce entity freq from WP corpus and serialize vocab in WP test	2019-04-29 17:37:29 +02:00

1 2 3 4 5 ...

1290 Commits