spaCy

mirror of https://github.com/explosion/spaCy.git synced 2024-12-25 17:36:30 +03:00

Author	SHA1	Message	Date
Ines Montani	cc76a26fe8	Raise error for negative arc indices (closes #3917 )	2019-08-20 15:51:37 +02:00
Ines Montani	009280fbc5	Tidy up and auto-format	2019-08-18 15:09:16 +02:00
Ziming He	eea7d4f4a8	biluo_tags_from_offsets throw exception for overlapping entities (#4021 ) * Check whether two entities overlap - biluo_gold_biluo_overlap now throw exception when entities passed in have overlaps - added unit test * SCA agreement	2019-08-15 18:13:32 +02:00
AJ Rader	2f3648700c	Correction of default lemmatizer lookup in English (Issue # 4104) (#4110 ) * pytest file for issue4104 established * edited default lookup english lemmatizer for spun; fixes issue 4102 * eliminated parameterization and sorted dictionary dependnency in issue 4104 test * added contributor agreement	2019-08-15 11:39:10 +02:00
Sofie Van Landeghem	0ba1b5eebc	CLI scripts for entity linking (wikipedia & generic) (#4091 ) * document token ent_kb_id * document span kb_id * update pipeline documentation * prior and context weights as bool's instead * entitylinker api documentation * drop for both models * finish entitylinker documentation * small fixes * documentation for KB * candidate documentation * links to api pages in code * small fix * frequency examples as counts for consistency * consistent documentation about tensors returned by predict * add entity linking to usage 101 * add entity linking infobox and KB section to 101 * entity-linking in linguistic features * small typo corrections * training example and docs for entity_linker * predefined nlp and kb * revert back to similarity encodings for simplicity (for now) * set prior probabilities to 0 when excluded * code clean up * bugfix: deleting kb ID from tokens when entities were removed * refactor train el example to use either model or vocab * pretrain_kb example for example kb generation * add to training docs for KB + EL example scripts * small fixes * error numbering * ensure the language of vocab and nlp stay consistent across serialization * equality with = * avoid conflict in errors file * add error 151 * final adjustements to the train scripts - consistency * update of goldparse documentation * small corrections * push commit * turn kb_creator into CLI script (wip) * proper parameters for training entity vectors * wikidata pipeline split up into two executable scripts * remove context_width * move wikidata scripts in bin directory, remove old dummy script * refine KB script with logs and preprocessing options * small edits * small improvements to logging of EL CLI script	2019-08-13 15:38:59 +02:00
Sofie Van Landeghem	963ea5e8d0	Update lemma and vector information after splitting a token (#4097 ) * fixing vector and lemma attributes after retokenizer.split * fixing unit test with mockup tensor * xp instead of numpy	2019-08-08 15:09:44 +02:00
adrianeboyd	69aca7d839	Add validate option to EntityRuler (#4089 ) * Add validate option to EntityRuler * Add validate to EntityRuler, passed to Matcher and PhraseMatcher * Add validate to usage and API docs * Update website/docs/usage/rule-based-matching.md Co-Authored-By: Ines Montani <ines@ines.io> * Update website/docs/usage/rule-based-matching.md Co-Authored-By: Ines Montani <ines@ines.io>	2019-08-07 00:40:53 +02:00
Jeno	15be09ceb0	Raise error if annotation dict in simple training style has unexpected keys #4074 (#4079 ) * adding enhancement #4074. * modified behavior to strictly require top level dictionary keys - issue #4074 * pass expected keys to error message and add links as expected top level key	2019-08-06 11:01:25 +02:00
Sofie Van Landeghem	ad09b0d6f3	fetch norm from lex if necessary for matching (#4080 )	2019-08-05 23:51:04 +02:00
Pavle Vidanović	e1a935d71c	Stopwords for Serbian language. (#4078 ) * Serbian stopwords added. (cyrillic alphabet) * spaCy Contribution agreement included. * Test initialize updated	2019-08-05 10:22:27 +02:00
Muhammad Irfan	d1d30b0442	added missing punctuation following conventions. (#4066 )	2019-08-04 13:41:18 +02:00
adrianeboyd	925a852bb6	Improve NER per type scoring (#4052 ) * Improve NER per type scoring * include all gold labels in per type scoring, not only when recall > 0 * improve efficiency of per type scoring * Create Scorer tests, initially with NER tests * move regression test #3968 (per type NER scoring) to Scorer tests * add new test for per type NER scoring with imperfect P/R/F and per type P/R/F including a case where R == 0.0	2019-08-01 17:15:36 +02:00
Sofie Van Landeghem	f7d950de6d	ensure the lang of vocab and nlp stay consistent (#4057 ) * ensure the language of vocab and nlp stay consistent across serialization * equality with =	2019-08-01 17:13:01 +02:00
Sofie Van Landeghem	7de3b129ab	Resolve edge case when calling textcat.predict with empty doc (#4035 ) * resolve edge case where no doc has tokens when calling textcat.predict * more explicit value test	2019-07-30 14:58:01 +02:00
Ines Montani	fc69da0acb	💫 Support simple training format in nlp.evaluate and add tests (#4033 ) * Support simple training format in nlp.evaluate and add tests * Update docs [ci skip]	2019-07-27 17:30:18 +02:00
Bae Yong-Ju	05fbf5d976	Fix error when Korean text contains regexp special characters. (#4022 )	2019-07-25 17:53:33 +02:00
Ines Montani	87fcf3141c	Merge pull request #4003 from svlandeg/feature/nel-fixes API changes for Entity linking functionality	2019-07-23 23:17:07 +02:00
Sofie Van Landeghem	ba02957c80	Fix dependency copy for as_doc (#3969 ) * failing unit test for issue 3962 * attempt to fix Issue #3962 * create artificial unit test example * using length instead of self.length * sp * reformat with black * find better ancestor within span and use generic 'dep' * attach to span.root if there is no appropriate ancestor * comment span text * clean up ancestor code * reconstruct dep tree to keep same number of sentences	2019-07-23 18:28:54 +02:00
Ines Montani	a32b033b8c	Add regression test for #4002 Test that the PhraseMatcher can match on overwritten NORM attributes.	2019-07-22 14:18:24 +02:00
svlandeg	ad65171837	Merge remote-tracking branch 'upstream/master' into feature/nel-fixes	2019-07-22 13:41:28 +02:00
svlandeg	76184374e2	test corner cases	2019-07-22 13:39:32 +02:00
svlandeg	dae8a21282	rename entity frequency	2019-07-19 17:40:28 +02:00
Falak Asad	ff1e73e35c	Bugfix/issue 3968 (#3982 ) * Fix for issue-3968 * Added contributor agreement * Made suggested changes	2019-07-18 00:20:32 +02:00
svlandeg	d833d4c358	fixes in kb and gold	2019-07-17 17:18:26 +02:00
Ines Montani	073013f129	Auto-format [ci skip]	2019-07-17 12:34:13 +02:00
svlandeg	4086c6ff60	get vector functionality + unit test	2019-07-17 12:17:02 +02:00
Ines Montani	62ff128888	Add regression test for #3951	2019-07-16 14:00:00 +02:00
Ines Montani	7f551050b1	Add regression test for #3972	2019-07-16 13:07:35 +02:00
Søren Lind Kristiansen	26aee70d95	Make Danish tokenizer split on forward slash	2019-07-12 15:20:42 +02:00
Sofie Van Landeghem	ed774cb953	Fixing ngram bug (#3953 ) * minimal failing example for Issue #3661 * referenced Issue #3661 instead of Issue #3611 * cleanup	2019-07-12 10:01:35 +02:00
Ines Montani	673c864a06	Fix doc.count_by functionality (#3950 ) Fix doc.count_by functionality	2019-07-11 13:44:00 +02:00
Ines Montani	2426f4d44c	Fix default punctuation rules for splitting Hindi text (#3948 ) Fix default punctuation rules for splitting Hindi text Co-authored-by: yash <patadiayash@gmail.com> Co-authored-by: Ines Montani <ines@ines.io>	2019-07-11 13:36:28 +02:00
svlandeg	349107daa3	cleanup	2019-07-11 13:09:22 +02:00
Matthew Honnibal	b40b4c2c31	💫 Fix issue #3839 : Incorrect entity IDs from Matcher with operators (#3949 ) * Add regression test for issue #3541 * Add comment on bugfix * Remove incorrect test * Un-xfail test	2019-07-11 12:55:11 +02:00
Ines Montani	197cfd7ebc	Merge branch 'master' into pr/3948	2019-07-11 12:18:31 +02:00
Ines Montani	d166756607	Fix test	2019-07-11 12:16:43 +02:00
Ines Montani	0b8406a05c	Tidy up and auto-format	2019-07-11 12:02:25 +02:00
yash	ae2d52e323	Add default encoding utf-8 for test file	2019-07-11 15:26:27 +05:30
yash	d5311b3c42	Add test file for issue (#3625 ) and spacy contributor agreement	2019-07-11 14:53:14 +05:30
svlandeg	e080412385	tracked the bug down to PreshCounter.inc - still unclear what goes wrong	2019-07-11 01:53:06 +02:00
svlandeg	a89fecce97	failing unit test for issue #3869	2019-07-11 00:43:55 +02:00
Matthew Honnibal	465456edb9	Un-xfail test #3880	2019-07-10 14:01:17 +02:00
Matthew Honnibal	87f7ec34d5	Add test for #3880	2019-07-10 13:53:55 +02:00
Ines Montani	4e04080b76	Only compare sorted patterns in test Try to work around flaky tests on Python 3.5	2019-07-10 13:00:52 +02:00
Ines Montani	82045aac8a	Merge regression tests	2019-07-10 12:49:18 +02:00
Ines Montani	570ab1f481	Fix handling of old entity ruler files Expected an `entity_ruler.jsonl` file in the top-level model directory, so the path passed to from_disk by default (model path plus componentn name), but with the suffix ".jsonl".	2019-07-10 12:14:12 +02:00
Ines Montani	874d914a44	Tidy up test	2019-07-10 12:13:23 +02:00
Ines Montani	6ba5ddbd5f	Merge pull request #3864 from svlandeg/feature/nel-wiki Entity linking using Wikipedia & Wikidata	2019-07-10 11:25:41 +02:00
cedar101	58f06e6180	Korean support (#3901 ) * start lang/ko * add test codes * using natto-py * add test_ko_tokenizer_full_tags() * spaCy contributor agreement * external dependency for ko * collections.namedtuple for python version < 3.5 * case fix * tuple unpacking * add jongseong(final consonant) * apply mecab option * Remove Pipfile for now Co-authored-by: Ines Montani <ines@ines.io>	2019-07-09 22:23:16 +02:00
Ines Montani	f2ea3e3ea2	Merge branch 'master' into feature/nel-wiki	2019-07-09 21:57:47 +02:00
Joshua Smith	2eb925bd05	Added an argument to `EntityRuler` constructor to pass attrs to… (#3919 ) * Perserve flags in EntityRuler The EntityRuler (explosion/spaCy#3526) does not preserve overwrite flags (or `ent_id_sep`) when serialized. This commit adds support for serialization/deserialization preserving overwrite and ent_id_sep flags. * add signed contributor agreement * flake8 cleanup mostly blank line issues. * mark test from the issue as needing a model The test from the issue needs some language model for serialization but the test wasn't originally marked correctly. * Adds `phrase_matcher_attr` to allow args to PhraseMatcher This is an added arg to pass to the `PhraseMatcher`. For example, this allows creation of a case insensitive phrase matcher when the `EntityRuler` is created. References explosion/spaCy#3822 * remove unneeded model loading The model didn't need to be loaded, and I replaced it with a change that doesn't require it (using existings fixtures) * updated docstring for new argument * updated docs to reflect new argument to the EntityRuler constructor * change tempdir handling to be compatible with python 2.7 * return conflicted code to entityruler Some stuff got cut out because of merge conflicts, this returns that code for the phrase_matcher_attr. * fixed typo in the code added back after conflicts * flake8 compliance When I deconflicted the branch there were some flake8 issues introduced. This resolves the spacing problems. * test changes: attempts to fix flaky test in python3.5 These tests seem to be alittle flaky in 3.5 so I changed the check to avoid the comparisons that seem to be fail sometimes.	2019-07-09 20:09:17 +02:00
Joshua Smith	e8420ab2b7	Added support for serializing overwrite and ent_id_sep (#3918 ) * Perserve flags in EntityRuler The EntityRuler (explosion/spaCy#3526) does not preserve overwrite flags (or `ent_id_sep`) when serialized. This commit adds support for serialization/deserialization preserving overwrite and ent_id_sep flags. * add signed contributor agreement * flake8 cleanup mostly blank line issues. * mark test from the issue as needing a model The test from the issue needs some language model for serialization but the test wasn't originally marked correctly. * remove unneeded model loading The model didn't need to be loaded, and I replaced it with a change that doesn't require it (using existings fixtures) * change tempdir handling to be compatible with python 2.7 * Adds code to handle item saved before this change. This code chanes how the save files are handled and how the bytes are stored as well. This code adds check to dispatch correctly if it encounters bytes or files saved in the old format (and tests for those cases). * use util function for tempdir management Updated after PR comments: this code now uses the make_tempdir function from util instead of doing it by hand.	2019-07-08 17:28:28 +02:00
Rokas Ramanauskas	61ce126d4c	Lithuanian language support (#3895 ) * initial LT lang support * Added more stopwords. Started setting up some basic test environment (not complete) * Initial morph rules for LT lang * Closes #1 Adds tokenizer exceptions for Lithuanian * Closes #5 Punctuation rules. Closes #6 Lexical Attributes * test: add native examples to basic tests * feat: add tag map for lt lang * fix: remove undefined tag attribute 'Definite' * feat: add lemmatizer for lt lang * refactor: add new instances to lt lang morph rules; use tags from tag map * refactor: add morph rules to lt lang defaults * refactor: only keep nouns, verbs, adverbs and adjectives in lt lang lemmatizer lookup * refactor: add capitalized words to lt lang lemmatizer * refactor: add more num words to lt lang lex attrs * refactor: update lt lang stop word set * refactor: add new instances to lt lang tokenizer exceptions * refactor: remove comments form lt lang init file * refactor: use function instead of lambda in lt lex lang getter * refactor: remove conversion to dict in lt init when dict is already provided * chore: rename lt 'test_basic' to 'test_text' * feat: add more lt text tests * feat: add lemmatizer tests * refactor: remove unused imports, add newline to end of file * chore: add contributor agreement * chore: change 'en' to 'lt' in lt example description * fix: add missing encoding info * style: add newline to end of file * refactor: use python2 compatible syntax * style: reformat code using black	2019-07-08 10:25:22 +02:00
svlandeg	1c80b85241	fix tests	2019-06-28 08:59:23 +02:00
Ines Montani	6ccdf37574	Exclude user_data when copying doc in displaCy (closes #3882 )	2019-06-26 14:37:05 +02:00
svlandeg	8608685543	ensure Span.as_doc keeps the entity links + unit test	2019-06-25 15:28:51 +02:00
svlandeg	ddc73b11a9	fix unicode literals	2019-06-24 12:58:18 +02:00
svlandeg	b76a43bee4	unicode strings	2019-06-19 13:26:33 +02:00
svlandeg	0b0959b363	UTF8 encoding	2019-06-19 13:11:39 +02:00
svlandeg	791327e3c5	Merge remote-tracking branch 'upstream/master' into feature/nel-wiki	2019-06-19 09:44:05 +02:00
Kabir Khan	1e19f34e29	Add optional `id` property to EntityRuler patterns (#3591 ) * Adding support for entity_id in EntityRuler pipeline component * Adding Spacy Contributor aggreement * Updating EntityRuler to use string.format instead of f strings * Update Entity Ruler to support an 'id' attribute per pattern that explicitly identifies an entity. * Fixing tests * Remove custom extension entity_id and use built in ent_id token attribute. * Changing entity_id to ent_id for consistent naming * entity_ids => ent_ids * Removing kb, cleaning up tests, making util functions private, use rsplit instead of split	2019-06-16 13:29:04 +02:00
Suraj Rajan	46c78d0a41	Dependency tree pattern matcher (#3465 ) * Functional dependency tree pattern matcher * Tests fail due to inconsistent behaviour * Renamed dependencymatcher and added optimizations	2019-06-16 13:25:32 +02:00
BreakBB	d8573ee715	Update error raising for CLI pretrain to fix #3840 (#3843 ) * Add check for empty input file to CLI pretrain * Raise error if JSONL is not a dict or contains neither `tokens` nor `text` key * Skip empty values for correct pretrain keys and log a counter as warning * Add tests for CLI pretrain core function make_docs. * Add a short hint for the `tokens` key to the CLI pretrain docs * Add success message to CLI pretrain * Update model loading to fix the tests * Skip empty values and do not create docs out of it	2019-06-16 13:22:57 +02:00
Ines Montani	f35ce09776	Add regression test for #3839	2019-06-12 13:38:30 +02:00
Ines Montani	aae9034492	Tidy up [ci skip]	2019-06-12 13:38:23 +02:00
svlandeg	5c723c32c3	entity vectors in the KB + serialization of them	2019-06-05 18:29:18 +02:00
svlandeg	d83a1e3052	Merge branch 'master' into feature/nel-wiki	2019-06-03 09:35:10 +02:00
Germán	86eb817b74	Overwrites default getter for like_num in Spanish by adding _num_words and like_num to lex_attrs.py (#3810 ) (closes #3803 )) * (#3803) Spanish like_num returning false for number-like token * (#3803) Spanish like_num now returning True for number-like token	2019-06-02 12:22:57 +02:00
BreakBB	ed18a6efbd	Add check for callable to 'Language.replace_pipe' to fix #3737 (#3741 )	2019-05-14 16:59:31 +02:00
Ines Montani	8baff1c7c0	💫 Improve introspection of custom extension attributes (#3729 ) * Add custom __dir__ to Underscore (see #3707) * Make sure custom extension methods keep their docstrings (see #3707) * Improve tests * Prepend note on partial to docstring (see #3707) * Remove print statement * Handle cases where docstring is None	2019-05-12 00:53:11 +02:00
Ines Montani	505c9e0e19	Add util.filter_spans helper (#3686 )	2019-05-08 02:33:40 +02:00
svlandeg	19e8f339cb	deduce entity freq from WP corpus and serialize vocab in WP test	2019-04-29 17:37:29 +02:00
svlandeg	387263d618	simplify chains	2019-04-29 13:58:07 +02:00
svlandeg	54d0cea062	unit test for KB serialization	2019-04-24 23:52:34 +02:00
BreakBB	5b8dbe4975	Fix symlink creation to show error message on failure (#3589 ) (resolves #3307 )) * Fix symlink creation to show error message on failure. Update tests to reflect those changes. * Fix test to succeed on non windows systems.	2019-04-16 11:58:31 +02:00
svlandeg	9a7d534b1b	enable nogil for cython functions in kb.pxd	2019-04-10 17:25:10 +02:00
Ines Montani	4d198a7e92	Ensure match pattern error isn't raised on empty errors (closes #3549 )	2019-04-09 12:50:43 +02:00
Ines Montani	145c0b7e88	Tidy up and auto-format	2019-04-09 11:40:19 +02:00
Ines Montani	5f005adf61	Add xfailing test for #3555	2019-04-09 11:07:14 +02:00
Ines Montani	4faf62d515	Merge pull request #3530 from svlandeg/fix/issue_3521 Allow English stopwords with any type of apostrophe	2019-04-03 14:14:03 +02:00
Yves Peirsman	951825532c	Improved Dutch language resources and Dutch lemmatization (#3409 ) * Improved Dutch language resources and Dutch lemmatization * Fix conftest * Update punctuation.py * Auto-format * Format and fix tests * Remove unused test file * Re-add deleted test * removed redundant infix regex pattern for ','; note: brackets + simple hyphen remains * Cleaner lemmatization files	2019-04-03 14:13:26 +02:00
svlandeg	4ff786e113	addressed all comments by Ines	2019-04-03 13:50:33 +02:00
Ines Montani	6a4575a56c	Don't make "settings" or "title" required in displaCy data (closes #3531 )	2019-04-03 10:13:16 +02:00
svlandeg	85b4319f33	specify encoding in files	2019-04-02 15:05:31 +02:00
svlandeg	673c81bbb4	unicode string for python 2.7	2019-04-02 13:52:07 +02:00
svlandeg	eca9cc5417	fixing Issue #3521 by adding all hyphen variants for each stopword	2019-04-02 13:24:59 +02:00
svlandeg	e7062cf699	failing test for Issue #3521	2019-04-02 13:15:35 +02:00
svlandeg	1424b12b09	failing test for Issue #3449	2019-04-02 13:06:37 +02:00
Ines Montani	c23e234d65	Auto-format	2019-04-01 12:11:27 +02:00
Ines Montani	68900066e0	Merge pull request #3459 from svlandeg/feature/el-framework Basic framework and APIs for entity linker	2019-03-29 14:02:22 +01:00
Hiromu Hota	914b9ff3d2	Tags are joined with a comma and padded with asterisks (#3491 ) <!--- Provide a general summary of your changes in the title. --> ## Description <!--- Use this section to describe your changes. If your changes required testing, include information about the testing environment and the tests you ran. If your test fixes a bug reported in an issue, don't forget to include the issue number. If your PR is still a work in progress, that's totally fine – just include a note to let us know. --> Fix a bug in the test of JapaneseTokenizer. This PR may require @polm's review. ### Types of change <!-- What type of change does your PR cover? Is it a bug fix, an enhancement or new feature, or a change to the documentation? --> Bug fix ## Checklist <!--- Before you submit the PR, go over this checklist and make sure you can tick off all the boxes. [] -> [x] --> - [x] I have submitted the spaCy Contributor Agreement. - [x] I ran the tests, and all new and existing tests passed. - [x] My changes don't require a change to the documentation, or if they do, I've added all required information.	2019-03-28 16:17:31 +01:00
Samuel Kane	06a1846379	fix(util): fix decaying function output (#3495 ) * fix(util): fix decaying function output * fix(util): better test and adhere to code standards * fix(util): correct variable name, pytestify test, update website text	2019-03-28 13:24:47 +01:00
Duygu Altinok	5a7bc6b39d	Fix/irreg adverbs extension (#3499 ) * extended list of irreg adverbs * added test to exceptions * fixed typo	2019-03-28 13:23:33 +01:00
Sofie	a4a6bfa4e1	Merge branch 'master' into feature/el-framework	2019-03-26 11:00:02 +01:00
svlandeg	8814b9010d	entity as one field instead of both ID and name	2019-03-25 18:10:41 +01:00
Ines Montani	06bf130890	💫 Add better and serializable sentencizer (#3471 ) * Add better serializable sentencizer component * Replace default factory * Add tests * Tidy up * Pass test * Update docs	2019-03-23 15:45:02 +01:00
Matthew Honnibal	d9a07a7f6e	💫 Fix class mismap on parser deserializing (closes #3433 ) (#3470 ) v2.1 introduced a regression when deserializing the parser after parser.add_label() had been called. The code around the class mapping is pretty confusing currently, as it was written to accommodate backwards model compatibility. It needs to be revised when the models are next retrained. Closes #3433	2019-03-23 13:46:25 +01:00
Matthew Honnibal	444a3abfe5	Add xfail test for #3433 . Improve test for add label.	2019-03-23 12:36:00 +01:00
Ines Montani	6b6e9b638e	Fix test for #3468	2019-03-23 11:24:29 +01:00
Ines Montani	fbec72b4c3	Slightly modify test for #3468 Check for Token.is_sent_start first (which is serialized/deserialized correctly)	2019-03-23 11:22:44 +01:00
Ines Montani	02d9378d8c	Add xfailing test for #3468	2019-03-23 11:19:11 +01:00
svlandeg	9de9900510	adding future import unicode literals to .py files	2019-03-22 16:18:04 +01:00
svlandeg	9751312aff	specify unicode strings for python 2.7	2019-03-22 14:15:18 +01:00
svlandeg	ec3e860b44	Merge remote-tracking branch 'upstream/master' into feature/el-framework	2019-03-22 13:47:08 +01:00
svlandeg	12d4caf341	Merge remote-tracking branch 'upstream/master' into feature/el-framework	2019-03-22 13:44:36 +01:00
Matthew Honnibal	e65b5bb9a0	Fix tokenizer on Python2.7 (#3460 ) spaCy v2.1 switched to the built-in re module, where v2.0 had been using the third-party regex library. When the tokenizer was deserialized on Python2.7, the `re.compile()` function was called with expressions that featured escaped unicode codepoints that were not in Python2.7's unicode database. Problems occurred when we had a range between two of these unknown codepoints, like this: ``` '[\\uAA77-\\uAA79]' ``` On Python2.7, the unknown codepoints are not unescaped correctly, resulting in arbitrary out-of-range characters being matched by the expression. This problem does not occur if we instead have a range between two unicode literals, rather than the escape sequences. To fix the bug, we therefore add a new compat function that unescapes unicode sequences using the `ast.literal_eval()` function. Care is taken to ensure we do not also escape non-unicode sequences. Closes #3356. - [x] I have submitted the spaCy Contributor Agreement. - [x] I ran the tests, and all new and existing tests passed. - [x] My changes don't require a change to the documentation, or if they do, I've added all required information.	2019-03-22 13:42:47 +01:00
Ines Montani	188ccd5750	Fix xfail marker	2019-03-22 12:54:14 +01:00
svlandeg	5b1cd49222	error msg and unit tests for setting kb_id on span	2019-03-22 12:05:35 +01:00
svlandeg	a48241e9a2	use nlp's vocab for stringstore	2019-03-22 11:36:45 +01:00
svlandeg	c71123dd0c	ensure no candidates are returned for unknown aliases	2019-03-22 11:36:45 +01:00
svlandeg	98ae77a682	unit test on number of candidates generated	2019-03-22 11:36:45 +01:00
svlandeg	a9074e0886	check the length of entities and probabilities vector + unit test	2019-03-22 11:36:45 +01:00
svlandeg	d133ffaff9	correct size, not counting dummy elements in the vector	2019-03-22 11:36:45 +01:00
svlandeg	33f8a0fe2e	check and unit test in case prior probs exceed 1	2019-03-22 11:36:45 +01:00
svlandeg	20a7b7b1c0	raising error when adding alias for unknown entity + unit test	2019-03-22 11:36:45 +01:00
Matthew Honnibal	d811c97da1	Fix test that caused pytest to choke on Python3	2019-03-22 10:28:51 +01:00
Matthew Honnibal	a2ad9832e5	Add failing test for #3356	2019-03-22 02:42:37 +01:00
Ines Montani	278e9d2eb0	Merge branch 'master' into feature/lemmatizer	2019-03-16 13:44:22 +01:00
Ryan Ford	00842d7f1b	Merging conversion scripts for conll formats (#3405 ) * merging conllu/conll and conllubio scripts * tabs to spaces * removing conllubio2json from converters/__init__.py * Move not-really-CLI tests to misc * Add converter test using no-ud data * Fix test I broke * removing include_biluo parameter * fixing read_conllx * remove include_biluo from convert.py	2019-03-15 18:14:46 +01:00
Ines Montani	bec8db91e6	Add actual deprecation warning for n_threads (resolves #3410 )	2019-03-15 16:38:44 +01:00
Sofie	c45ed32c74	label in span not writable anymore (#3408 ) * label in span not writable anymore * more explicit unit test and error message for readonly label * bit more explanation (view) * error msg tailored to specific case * fix None case	2019-03-15 00:46:45 +01:00
Ines Montani	479b5cff43	Auto-format [ci skip]	2019-03-12 13:35:34 +01:00
Ines Montani	886e5966c0	Update test_displacy.py	2019-03-11 19:03:52 +01:00
Ines Montani	4bd2688eac	💫 Fix displaCy support for RTL languages (#3393 ) Closes #2091. ## Description With the new `vocab.writing_system` property introduced in #3390 (exposed via the language defaults), I was able to finally fix this (I think!). Based on the `Doc`, dispaCy now detects whether it's a RTL or LTR language and adjusts the visualization accordingly. Wherever possible, I've also added `direction` and `lang` attributes. Entity visualization now looks like this: <img width="318" alt="Screenshot 2019-03-11 at 16 06 51" src="https://user-images.githubusercontent.com/13643239/54136866-d97afd80-441c-11e9-8c27-3d46994cc833.png"> And dependencies like this (ignore the most likely incorrect tags and dependencies): <img width="621" alt="Screenshot 2019-03-11 at 16 51 59" src="https://user-images.githubusercontent.com/13643239/54137771-8b66f980-441e-11e9-8460-0682b95eef2a.png"> ### Types of change enhancement, bug fix ## Checklist <!--- Before you submit the PR, go over this checklist and make sure you can tick off all the boxes. [] -> [x] --> - [x] I have submitted the spaCy Contributor Agreement. - [x] I ran the tests, and all new and existing tests passed. - [x] My changes don't require a change to the documentation, or if they do, I've added all required information.	2019-03-11 18:52:50 +01:00
Matthew Honnibal	b0b990e405	Fix token.conjuncts (closes #795 ) (#3392 ) * Implement conjuncts method * Add span.conjuncts property * Un-xfail token.conjuncts tests * Update docs for token.conjuncts and span.conjuncts * Fix merge error in token.conjuncts	2019-03-11 17:05:45 +01:00
Matthew Honnibal	e2b9b523ce	Merge branch 'develop' of https://github.com/explosion/spaCy into develop	2019-03-11 15:59:28 +01:00
Matthew Honnibal	db79a704bf	Add xfail tests for token.conjuncts	2019-03-11 15:46:52 +01:00
Ines Montani	c3df4d1108	Move displaCy tests to own file	2019-03-11 15:28:34 +01:00
Ines Montani	c5a407e95a	Fix code style	2019-03-11 15:28:22 +01:00
Matthew Honnibal	39a4741e26	Add support for vocab.writing_system property (#3390 ) * Add xfail test for vocab.writing_system * Add vocab.writing_system property * Set Language.Defaults.writing_system * Set default writing system * Remove xfail on test_vocab_writing_system	2019-03-11 15:23:20 +01:00
Ines Montani	ebcf2bb1c3	Add Doc.lang and Doc.lang_	2019-03-11 14:21:40 +01:00
Ines Montani	c399162a82	Tidy up	2019-03-11 13:34:14 +01:00
Ines Montani	7c05ca01e8	💫 Support mutable default values for extension attributes (#3389 ) * Support mutable default values in extensions * Update documentation	2019-03-11 12:50:44 +01:00
Matthew Honnibal	80b94313b6	💫 Fix interaction of lemmatizer and tokenizer exceptions (#3388 ) Closes #2203. Closes #3268. Lemmas set from outside the `Morphology` class were being overwritten. The result was especially confusing when deserialising, as it meant some lemmas could change when storing and retrieving a `Doc` object. This PR applies two fixes: 1) When we go to set the lemma in the `Morphology` class, first check whether a lemma is already set. If so, don't overwrite. 2) When we load with `doc.from_array()`, take care to apply the `TAG` field first. This allows other fields to overwrite the `TAG` implied properties, if they're provided explicitly (e.g. the `LEMMA`). ## Checklist <!--- Before you submit the PR, go over this checklist and make sure you can tick off all the boxes. [] -> [x] --> - [x] I have submitted the spaCy Contributor Agreement. - [x] I ran the tests, and all new and existing tests passed. - [x] My changes don't require a change to the documentation, or if they do, I've added all required information.	2019-03-11 01:31:21 +01:00
Ines Montani	8f45ff3dc2	Adjust formatting [ci skip]	2019-03-11 00:47:41 +01:00
Ines Montani	7ba3a5d95c	💫 Make serialization methods consistent (#3385 ) * Make serialization methods consistent exclude keyword argument instead of random named keyword arguments and deprecation handling * Update docs and add section on serialization fields	2019-03-10 19:16:45 +01:00
Ines Montani	67e38690d4	Un-xfail passing tests and tidy up	2019-03-10 18:42:16 +01:00
Matthew Honnibal	27dd820753	Fix vocab deserialization when loading already present lexemes (#3383 ) * Fix vocab deserialization bug. Closes #2153 * Un-xfail test for #2153	2019-03-10 17:21:19 +01:00
Matthew Honnibal	61e5ce02a4	Add xfailing test for #2153	2019-03-10 16:36:29 +01:00
Matthew Honnibal	8a6272f842	Un-xfail test	2019-03-10 15:51:15 +01:00
Ines Montani	0426689db8	💫 Improve Doc.to_json and add Doc.is_nered (#3381 ) * Use default return instead of else * Add Doc.is_nered to indicate if entities have been set * Add properties in Doc.to_json if they were set, not if they're available This way, if a processed Doc exports "pos": None, it means that the tag was explicitly unset. If it exports "ents": [], it means that entity annotations are available but that this document doesn't contain any entities. Before, this would have been unclear and problematic for training.	2019-03-10 15:24:34 +01:00
Ines Montani	7984543953	Add xfailing test for to_array/from_array string attrs	2019-03-10 15:08:15 +01:00
Ines Montani	6bbf4ea309	Simplify tests and avoid tokenizing	2019-03-10 15:05:56 +01:00
Matthew Honnibal	a5b1f6dcec	Fix NER when preset entities cross sentence boundaries (#3379 ) 💫 Fix NER when preset entities cross sentence boundaries	2019-03-10 14:53:03 +01:00
Matthew Honnibal	231bc7bb7b	Add xfailing test for #3345	2019-03-10 13:00:15 +01:00
Ines Montani	ad834be494	Tidy up and auto-format	2019-03-08 13:28:53 +01:00
Ines Montani	d260aa17fd	Merge branch 'develop' into feature/lemmatizer	2019-03-08 13:25:00 +01:00
Matthew Honnibal	19e6b39786	Test morphological features	2019-03-08 01:38:54 +01:00
Matthew Honnibal	3c32590243	Add test for morph analysis	2019-03-08 00:10:07 +01:00
Matthew Honnibal	fed0371db7	Remove enums from morphology	2019-03-07 17:14:57 +01:00
Ines Montani	96b91a8898	Fix noqa [ci skip]	2019-03-07 12:25:00 +01:00
Matthew Honnibal	3993f41cc4	Update morphology branch from develop	2019-03-07 00:14:43 +01:00
Ines Montani	533b580c19	Add test for stray print statements in languages (see #3342 )	2019-02-27 16:04:30 +01:00
Ines Montani	9b62639d19	Auto-format [ci skip]	2019-02-27 14:24:55 +01:00
Matthew Honnibal	f1d77eb140	💫 Improve handling of missing NER tags (closes #2603 ) (#3341 ) * Improve handling of missing NER tags GoldParse can accept missing NER tags, if entities is provided in BILUO format (rather than as spans). Missing tags can be provided as None values. Fix bug that occurred when first tag was a None value. Closes #2603. * Document specification of missing NER tags.	2019-02-27 12:06:32 +01:00
Ines Montani	e359bdd0e3	Auto-format	2019-02-27 11:56:45 +01:00
Matthew Honnibal	4a3371acd5	Make doc[0].is_sent_start == True (closes #2869 ) (#3340 ) * Make doc[0] have sent_start True. Closes #2869 * Document that doc[0].is_sent_start defaults True.	2019-02-27 11:17:17 +01:00
Matthew Honnibal	2d3ce89b78	Improve matcher tests re issue #3328	2019-02-27 10:25:56 +01:00
Matthew Honnibal	8d6954e0e7	Fix matcher bug #3328	2019-02-27 10:25:39 +01:00
Ines Montani	aadf586789	Add xfailing test for #3331	2019-02-25 22:33:30 +01:00
Ines Montani	f135d663f7	Update conftest.py	2019-02-25 15:55:29 +01:00
Ines Montani	76ce8b2662	Merge branch 'master' into develop	2019-02-25 15:54:55 +01:00
Julia Makogon	f1c3108d52	Fixing pymorphy2 dependency issue (#3329 ) (closes #3327 ) * Classes for Ukrainian; small fix in Russian. * Contributor agreement * pymorphy2 initialization split for ru and uk (#3327) * stop-words fixed * Unit-tests updated	2019-02-25 15:48:17 +01:00
Ines Montani	1a735e0f1f	Add regression test for #3328	2019-02-25 10:12:58 +01:00
Ines Montani	62b558ab72	💫 Support lexical attributes in retokenizer attrs (closes #2390 ) (#3325 ) * Fix formatting and whitespace * Add support for lexical attributes (closes #2390) * Document lexical attribute setting during retokenization * Assign variable oputside of nested loop	2019-02-24 21:13:51 +01:00
Ines Montani	a48deb4081	Merge regression tests	2019-02-24 21:03:39 +01:00
Ines Montani	8f6c193a4d	Delete _test_issue1622.py	2019-02-24 20:33:31 +01:00
Ines Montani	c8e967c78d	Try include previously segfaulting test	2019-02-24 20:32:46 +01:00
Ines Montani	328b589deb	Merge regression tests	2019-02-24 20:31:38 +01:00
Ines Montani	3bc53905cc	Remove print statements from test	2019-02-24 20:31:15 +01:00
Ines Montani	1ae0df3da9	Un-x-fail passing test	2019-02-24 20:24:15 +01:00
Ines Montani	399a5803d0	Tidy up tests [ci skip]	2019-02-24 19:02:16 +01:00
Ines Montani	df19e2bff6	💫 Allow setting of custom attributes during retokenization (closes #3314 ) (#3324 ) <!--- Provide a general summary of your changes in the title. --> ## Description This PR adds the abilility to override custom extension attributes during merging. This will only work for attributes that are writable, i.e. attributes registered with a default value like `default=False` or attribute that have both a getter and a setter implemented. ```python Token.set_extension('is_musician', default=False) doc = nlp("I like David Bowie.") with doc.retokenize() as retokenizer: attrs = {"LEMMA": "David Bowie", "_": {"is_musician": True}} retokenizer.merge(doc[2:4], attrs=attrs) assert doc[2].text == "David Bowie" assert doc[2].lemma_ == "David Bowie" assert doc[2]._.is_musician ``` ### Types of change enhancement ## Checklist <!--- Before you submit the PR, go over this checklist and make sure you can tick off all the boxes. [] -> [x] --> - [x] I have submitted the spaCy Contributor Agreement. - [x] I ran the tests, and all new and existing tests passed. - [x] My changes don't require a change to the documentation, or if they do, I've added all required information.	2019-02-24 18:38:47 +01:00
Ines Montani	d8f69d592f	Tidy up retokenizer tests	2019-02-24 14:14:11 +01:00
Ines Montani	723e27cb8c	Tidy up tests	2019-02-24 14:11:23 +01:00
Ines Montani	80bdcb99c5	Fix escaping of HTML in displacy ENT (closes #2728 )	2019-02-21 14:30:39 +01:00
Matthew Honnibal	c5f947f194	Fix regex deprecation warnings	2019-02-21 11:56:47 +01:00
Matthew Honnibal	80195bc2d1	Fix issue #3288 (#3308 )	2019-02-21 09:48:53 +01:00
Matthew Honnibal	a137e8b418	Fix Pipe.to_bytes() when model uninitialized Closes #3289	2019-02-21 09:42:02 +01:00
Sofie	9a478b6db8	Clean up of char classes, few tokenizer fixes and faster default French tokenizer (#3293 ) * splitting up latin unicode interval * removing hyphen as infix for French * adding failing test for issue 1235 * test for issue #3002 which now works * partial fix for issue #2070 * keep the hyphen as infix for French (as it was) * restore french expressions with hyphen as infix (as it was) * added succeeding unit test for Issue #2656 * Fix issue #2822 with custom Italian exception * Fix issue #2926 by allowing numbers right before infix / * splitting up latin unicode interval * removing hyphen as infix for French * adding failing test for issue 1235 * test for issue #3002 which now works * partial fix for issue #2070 * keep the hyphen as infix for French (as it was) * restore french expressions with hyphen as infix (as it was) * added succeeding unit test for Issue #2656 * Fix issue #2822 with custom Italian exception * Fix issue #2926 by allowing numbers right before infix / * remove duplicate * remove xfail for Issue #2179 fixed by Matt * adjust documentation and remove reference to regex lib	2019-02-20 22:10:13 +01:00
Matthew Honnibal	0d1ca15b13	💫 Fix bugs in matcher extensions. Closes #1971 (#3301 ) * Fix matching on extension attrs and predicates * Fix detection of match_id when using extension attributes. The match ID is stored as the last entry in the pattern. We were checking for this with nr_attr == 0, which didn't account for extension attributes. * Fix handling of predicates. The wrong count was being passed through, so even patterns that didn't have a predicate were being checked. * Fix regex pattern * Fix matcher set value test	2019-02-20 21:30:39 +01:00
Ines Montani	3b667787a9	Add xfailing test for #3289	2019-02-18 16:45:04 +01:00
Ines Montani	91f260f2c4	Add another test for #1971	2019-02-18 13:36:20 +01:00
Ines Montani	f30aac324c	Update test_issue1971.py	2019-02-18 13:36:15 +01:00
Ines Montani	8fa26ca97e	Fix tensor shape in test for #3288	2019-02-18 11:01:54 +01:00
Ines Montani	c32290557f	Add xfailing test for #3288	2019-02-18 10:59:31 +01:00
Ines Montani	3af0b2dd1c	Add xfailing test for #1971 [ci skip]	2019-02-17 13:04:47 +01:00
Ines Montani	1e252b129c	Auto-format	2019-02-17 12:22:07 +01:00
Matthew Honnibal	92b6bd2977	Refinements to retokenize.split() function (#3282 ) * Change retokenize.split() API for heads * Pass lists as values for attrs in split * Fix test_doc_split filename * Add error for mismatched tokens after split * Raise error if new tokens don't match text * Fix doc test * Fix error * Move deps under attrs * Fix split tests * Fix retokenize.split	2019-02-15 17:32:31 +01:00
Ines Montani	1aa57690dc	Add xfailing test for orth mismatch in retokenizer.split	2019-02-15 13:55:04 +01:00
Ines Montani	819768483f	Add xfailing test for out-of-bounds heads	2019-02-15 13:09:07 +01:00
Ines Montani	d8051e89ca	Tidy up tests	2019-02-15 12:56:51 +01:00
Ines Montani	c31a9dabd5	💫 Add en/em dash to prefixes and suffixes (#3281 ) * Auto-format * Add en/em dash to prefixes and suffixes	2019-02-15 10:29:59 +01:00
Ines Montani	5651a0d052	💫 Replace {Doc,Span}.merge with Doc.retokenize (#3280 ) * Add deprecation warning to Doc.merge and Span.merge * Replace {Doc,Span}.merge with Doc.retokenize	2019-02-15 10:29:44 +01:00
Ines Montani	f146121092	💫 Make handling of [Pipe].labels consistent (#3273 ) * Make handling of [Pipe].labels consistent * Un-xfail passing test * Update spacy/pipeline/pipes.pyx Co-Authored-By: ines <ines@ines.io> * Update spacy/pipeline/pipes.pyx Co-Authored-By: ines <ines@ines.io> * Update spacy/tests/pipeline/test_pipe_methods.py Co-Authored-By: ines <ines@ines.io> * Update spacy/pipeline/pipes.pyx Co-Authored-By: ines <ines@ines.io> * Move error message to spacy.errors * Fix textcat labels and test * Make EntityRuler.labels return tuple as well	2019-02-15 06:03:19 +11:00
Ines Montani	3d577b77c6	Auto-formatting	2019-02-14 19:56:38 +01:00
Ines Montani	e104e47c21	Merge branch 'develop' of https://github.com/explosion/spaCy into develop	2019-02-14 15:35:34 +01:00
Ines Montani	0cd01a8c5e	Merge branch 'master' into develop	2019-02-14 15:35:20 +01:00
Ines Montani	2e31921d0a	💫 Add base Language classes for more languages (#3276 ) * Add base classes for more languages * Add test for language class initialization Make sure language can be initialize – otherwise, it's difficult to catch serious errors in the test suite, because languages are lazy-loaded	2019-02-15 01:31:19 +11:00
Grivaz	39815513e2	Add split one token into several (resolves #2838 ) (#3253 ) * Add split one token into several (resolves #2838) * Improve error message for token splitting * Make retokenizer.split() tests use a Token object Change retokenizer.split() to use a Token object, instead of an index. * Pass Token into retokenize.split() Tweak retokenize.split() API so that we pass the `Token` object, not the index. * Fix token.idx in retokenize.split() * Test that token.idx is correct after split * Fix token.idx for split tokens * Fix retokenize.split() * Fix retokenize.split * Fix retokenize.split() test	2019-02-15 01:27:13 +11:00
Ines Montani	743ecf728c	Tidy up conftest	2019-02-14 13:27:13 +01:00
Ines Montani	4d2438f985	Tidy up and auto-format	2019-02-13 15:29:08 +01:00
Ines Montani	fbf9f1edf1	Also raise error in Span.__reduce__	2019-02-13 13:22:05 +01:00
Ines Montani	2d0c3c73f4	Raise better error if token is pickled (resolves #2833 ) (#3267 )	2019-02-13 11:27:04 +01:00
Ines Montani	b589b945db	Fix PhraseMatcher pickling and length (resolves #3248 ) (#3252 )	2019-02-12 18:27:54 +01:00
Ines Montani	483dddc9bc	💫 Add token match pattern validation via JSON schemas (#3244 ) * Add custom MatchPatternError * Improve validators and add validation option to Matcher * Adjust formatting * Never validate in Matcher within PhraseMatcher If we do decide to make validate default to True, the PhraseMatcher's Matcher shouldn't ever validate. Here, we create the patterns automatically anyways (and it's currently unclear whether the validation has performance impacts at a very large scale).	2019-02-13 01:47:26 +11:00
Ines Montani	ad2a514cdf	Show warning if phrase pattern Doc was overprocessed (#3255 ) In most cases, the PhraseMatcher will match on the verbatim token text or as of v2.1, sometimes the lowercase text. This means that we only need a tokenized Doc, without any other attributes. If phrase patterns are created by processing large terminology lists with the full `nlp` object, this easily can make things a lot slower, because all components will be applied, even if we don't actually need the attributes they set (like part-of-speech tags, dependency labels). The warning message also includes a suggestion to use nlp.make_doc or nlp.tokenizer.pipe for even faster processing. For now, the validation has to be enabled explicitly by setting validate=True.	2019-02-13 01:45:31 +11:00
Matthew Honnibal	6ec834dc72	Merge branch 'develop' of https://github.com/explosion/spaCy into develop	2019-02-13 01:14:44 +11:00
Matthew Honnibal	43fa039d96	xfail regression test for model labels	2019-02-13 01:14:26 +11:00
Matthew Honnibal	bc300d4e31	Add test for issue 3209	2019-02-13 01:13:01 +11:00
Ines Montani	34a3cc26a9	Add xfailing test for reverse pattern (see #1971 )	2019-02-12 14:49:59 +01:00
Ines Montani	fe39fd4d13	Make warning tests more explicit	2019-02-10 14:02:19 +01:00
Ines Montani	e7593b791e	Fix import	2019-02-08 20:50:52 +01:00
Ines Montani	0754b848fe	Actually xfail test for #1971	2019-02-08 20:50:35 +01:00
Ines Montani	414a69b736	Add xfailing test (see #1971 , #2675 , #2671 )	2019-02-08 20:50:01 +01:00
Ines Montani	ea07f3022e	Only run noun chunks iterator in Span if available (closes #3199 )	2019-02-08 18:33:16 +01:00
Ines Montani	586c56fc6c	Tidy up regression tests	2019-02-08 15:51:13 +01:00
Ines Montani	25602c794c	Tidy up and fix small bugs and typos	2019-02-08 14:14:49 +01:00
Ines Montani	9e652afa4b	Merge branch 'master' into develop	2019-02-08 13:28:09 +01:00
Stanisław Giziński	1448ad100c	Improved polish tokenizer and stop words. (#2974 ) * Improved stop words list * Removed some wrong stop words form list * Improved stop words list * Removed some wrong stop words form list * Improved Polish Tokenizer (#38) * Add tests for polish tokenizer * Add polish tokenizer exceptions * Don't split any words containing hyphens * Fix test case with wrong model answer * Remove commented out line of code until better solution is found * Add source srx' license * Rename exception_list.py to match spaCy conventionality * Add a brief explanation of where the exception list comes from * Add newline after reach exception * Rename COPYING.txt to LICENSE * Delete old files * Add header to the license * Agreements signed * Stanisław Giziński agreement * Krzysztof Kowalczyk - signed agreement * Mateusz Olko agreement * Add DoomCoder's contributor agreement * Improve like number checking in polish lang * like num tests added * all from SI system added * Final licence and removed splitting exceptions * Added polish stop words to LEX_ATTRA * Add encoding info to pl tokenizer exceptions	2019-02-08 14:27:21 +11:00
Ines Montani	e2d93e4852	Merge branch 'master' into develop	2019-02-07 21:10:08 +01:00
Julia Makogon	b41d64825a	Ukrainian language added. Small fixes in Russian (#3241 ) * Classes for Ukrainian; small fix in Russian. * Contributor agreement	2019-02-07 21:05:11 +01:00
Ines Montani	5d0b60999d	Merge branch 'master' into develop	2019-02-07 20:54:07 +01:00
Ines Montani	338d659bd0	Store JSON schemas in Python and tidy up (#3235 )	2019-02-07 19:44:31 +11:00
Ines Montani	a9bf5d9fd8	Add xfailing test for set value with operator [ci skip]	2019-02-06 13:40:11 +01:00
Ines Montani	e51a238b3f	Auto-format	2019-02-06 13:32:18 +01:00
Ines Montani	f25bd9f5e4	Add gold.spans_from_biluo_tags helper (#3227 )	2019-02-06 21:50:26 +11:00
Sofie	9745b0d523	Improve Italian & Urdu tokenization accuracy (#3228 ) ## Description 1. Added the same infix rule as in French (`d'une`, `j'ai`) for Italian (`c'è`, `l'ha`), bringing F-score on `it_isdt-ud-train.txt` from 96% to 99%. Added unit test to check this behaviour. 2. Added specific Urdu punctuation character as suffix, improving F-score on `ur_udtb-ud-train.txt` from 94% to 100%. Added unit test to check this behaviour. ### Types of change Enhancement of Italian & Urdu tokenization ## Checklist - [x] I have submitted the spaCy Contributor Agreement. - [x] I ran the tests, and all new and existing tests passed. - [x] My changes don't require a change to the documentation, or if they do, I've added all required information.	2019-02-04 22:39:25 +01:00
Sofie	a3efa3e8d9	Improve Catalan tokenization accuracy (#3225 ) * small hyphen clean up for French * catalan infix similar to french	2019-02-04 20:37:19 +11:00
Sofie	46dfe773e1	Replacing regex library with re to increase tokenization speed (#3218 ) * replace unicode categories with raw list of code points * simplifying ranges * fixing variable length quotes * removing redundant regular expression * small cleanup of regexp notations * quotes and alpha as ranges instead of alterations * removed most regexp dependencies and features * exponential backtracking - unit tests * rewrote expression with pathological backtracking * disabling double hyphen tests for now * test additional variants of repeating punctuation * remove regex and redundant backslashes from load_reddit script * small typo fixes * disable double punctuation test for russian * clean up old comments * format block code * final cleanup * naming consistency * french strings as unicode for python 2 support * french regular expression case insensitive	2019-02-01 18:05:22 +11:00
foufaster	8bd85fd9d5	Fix french lemmatization (#3180 )	2019-01-27 06:01:30 +01:00
Matthew Honnibal	77ddcf7381	💫 Update matcher engine for regex and extensions (#3173 ) * Update matcher engine for regex and extensions Add support for matching over arbitrary Python predicate functions, and arbitrary Python attribute getters. This will allow matching over regex patterns, and allow supporting extension attributes. The results of the Python predicate functions are cached, so that we don't call the same predicate function twice for the same token. The extension attributes are fetched into an array for each token in the doc. This should minimise the performance impact of the new features. We still need to wire up these features to the patterns, and test it all. * Work on wiring up extra attributes in matcher * Work on tests for extra matcher attrs * Add support for extension attrs to matcher * Test extension attribute matching * Work on implementing predicate-based match patterns * Get predicates working for set membership * Add test for set membership * Make extensions+predicates work * Test matcher extensions * Cache predicate results better in Matcher * Remove print statement in matcher test * Use srsly to get key for predicates	2019-01-21 13:23:15 +01:00
Björn Lennartsson	b892b446cc	Updates to Swedish Language (#3164 ) * Added the same punctuation rules as danish language. * Added abbreviations and also the possibility to have capitalized abbreviations on some. Added a few specific cases too * Added test for long texts in swedish * Added morph rules, infixes and suffixes to __init__.py for swedish * Added some tests for prefixes, infixes and suffixes * Added tests for lemma * Renamed files to follow convention * [sv] Removed ambigious abbreviations * Added more tests for tokenizer exceptions * Added test for problem with punctuation in issue #2578 * Contributor agreement * Removed faulty lemmatization of 'jag' ('I') as it was lemmatized to 'jaga' ('hunt')	2019-01-16 13:45:50 +01:00
Álvaro Abella Bascarán	e03e1eee92	Bugfix/get lca matrix (#3110 ) This PR adds a test for an untested case of `Span.get_lca_matrix`, and fixes a bug for that scenario, which I introduced in [this PR](https://github.com/explosion/spaCy/pull/3089) (sorry!). ## Description The previous implementation of get_lca_matrix was failing for the case `doc[j:k].get_lca_matrix()` where `j > 0`. A test has been added for this case and the bug has been fixed. ### Types of change Bug fix ## Checklist - [x] I have submitted the spaCy Contributor Agreement. - [x] I ran the tests, and all new and existing tests passed. - [x] My changes don't require a change to the documentation, or if they do, I've added all required information.	2019-01-06 19:07:50 +01:00
Matthew Honnibal	3c09d3d986	Merge branch 'develop' of https://github.com/explosion/spaCy into develop	2018-12-30 15:49:57 +01:00
Matthew Honnibal	bf20252ae0	Update test for #3012	2018-12-30 15:46:46 +01:00
Matthew Honnibal	63b7accd74	💫 Make span.as_doc() return a copy, not a view. Closes #1537 (#3107 ) Initially span.as_doc() was designed to return a view of the span's contents, as a Doc object. This was a nice idea, but it fails due to the token.idx property, which refers to the character offset within the string. In a span, the idx of the first token might not be 0. Because this data is different, we can't have a view --- it'll be inconsistent. This patch changes span.as_doc() to instead return a copy. The docs are updated accordingly. Closes #1537 * Update test for span.as_doc() * Make span.as_doc() return a copy. Closes #1537 * Document change to Span.as_doc()	2018-12-30 15:17:46 +01:00
Matthew Honnibal	72e4d3782a	Resize doc.tensor when merging spans. Closes #1963 (#3106 ) The doc.retokenize() context manager wasn't resizing doc.tensor, leading to a mismatch between the number of tokens in the doc and the number of rows in the tensor. We fix this by deleting rows from the tensor. Merged spans are represented by the vector of their last token. * Add test for resizing doc.tensor when merging * Add test for resizing doc.tensor when merging. Closes #1963 * Update get_lca_matrix test for develop * Fix retokenize if tensor unset	2018-12-30 15:17:17 +01:00
Matthew Honnibal	3d64eb4a74	Update get_lca_matrix test for develop	2018-12-30 14:28:07 +01:00
Matthew Honnibal	ac9e3a4a8b	Add test for #1773	2018-12-30 13:16:05 +01:00
Kirill Bulygin	b665a32b95	Enabling `tests/lang/ru/test_lemmatizer.py`, fixing a `unicode` issue (#3084 ) <!--- Provide a general summary of your changes in the title. --> ## Description See #3079. Here I'm merging into `develop` instead of `master`. ### Types of change <!-- What type of change does your PR cover? Is it a bug fix, an enhancement or new feature, or a change to the documentation? --> Bug fix. ## Checklist <!--- Before you submit the PR, go over this checklist and make sure you can tick off all the boxes. [] -> [x] --> - [x] I have submitted the spaCy Contributor Agreement. - [x] I ran the tests, and all new and existing tests passed. - [x] My changes don't require a change to the documentation, or if they do, I've added all required information.	2018-12-30 12:10:26 +01:00
Álvaro Abella Bascarán	9bc4cc1352	Fix issue 2396 (#3089 ) * Test on #2396: bug in Doc.get_lca_matrix() * reimplementation of Doc.get_lca_matrix(), (closes #2396) * reimplement Span.get_lca_matrix(), and call it from Doc.get_lca_matrix() * tests Span.get_lca_matrix() as well as Doc.get_lca_matrix() * implement _get_lca_matrix as a helper function in doc.pyx; call it from Doc.get_lca_matrix and Span.get_lca_matrix * use memory view instead of np.ndarray in _get_lca_matrix (faster) * fix bug when calling Span.get_lca_matrix; return lca matrix as np.array instead of memoryview * cleaner conditional, add comment	2018-12-29 18:05:52 +01:00
Álvaro Abella Bascarán	6fe276f85d	Fix issue 2396 (#3089 ) * Test on #2396: bug in Doc.get_lca_matrix() * reimplementation of Doc.get_lca_matrix(), (closes #2396) * reimplement Span.get_lca_matrix(), and call it from Doc.get_lca_matrix() * tests Span.get_lca_matrix() as well as Doc.get_lca_matrix() * implement _get_lca_matrix as a helper function in doc.pyx; call it from Doc.get_lca_matrix and Span.get_lca_matrix * use memory view instead of np.ndarray in _get_lca_matrix (faster) * fix bug when calling Span.get_lca_matrix; return lca matrix as np.array instead of memoryview * cleaner conditional, add comment	2018-12-29 18:02:26 +01:00
Matthew Honnibal	174e85439b	Fix behaviour of Matcher's ? quantifier for v2.1 (#3105 ) * Add failing test for matcher bug #3009 * Deduplicate matches from Matcher * Update matcher ? quantifier test * Fix bug with ? quantifier in Matcher The ? quantifier indicates a token may occur zero or one times. If the token pattern fit, the matcher would fail to consider valid matches where the token pattern did not fit. Consider a simple regex like: .?b If we have the string 'b', the .? part will fit --- but then the 'b' in the pattern will not fit, leaving us with no match. The same bug left us with too few matches in some cases. For instance, consider: .?.? If we have a string of length two, like 'ab', we actually have three possible matches here: [a, b, ab]. We were only recovering 'ab'. This should now be fixed. Note that the fix also uncovered another bug, where we weren't deduplicating the matches. There are actually two ways we might match 'a' and two ways we might match 'b': as the second token of the pattern, or as the first token of the pattern. This ambiguity is spurious, so we need to deduplicate. Closes #2464 and #3009 * Fix Python2	2018-12-29 16:18:09 +01:00
Ines Montani	ca244f5f84	Small fixes to displaCy (#3076 ) ## Description - [x] fix auto-detection of Jupyter notebooks (even if `jupyter=True` isn't set) - [x] add `displacy.set_render_wrapper` method to define a custom function called around the HTML markup generated in all calls to `displacy.render` (can be used to allow custom integrations, callbacks and page formatting) - [x] add option to customise host for web server - [x] show warning if `displacy.serve` is called from within Jupyter notebooks - [x] move error message to `spacy.errors.Errors`. ### Types of change enhancement ## Checklist <!--- Before you submit the PR, go over this checklist and make sure you can tick off all the boxes. [] -> [x] --> - [x] I have submitted the spaCy Contributor Agreement. - [x] I ran the tests, and all new and existing tests passed. - [x] My changes don't require a change to the documentation, or if they do, I've added all required information.	2018-12-20 17:32:04 +01:00
Muhammad Irfan	2e84ec1513	Fixed ISO code for Urdu. (#3073 )	2018-12-20 12:28:53 +01:00
Ken	5f0c5fbfa4	issue #3012 : add test (#3021 ) * issue #3012: add test * add contributor aggreement * Make test work without models and fix typos ten.pos_ instead of ten.orth_ and comparison against "10" instead of integer 10	2018-12-18 15:02:49 +01:00
Kirill Bulygin	2fb004832f	Fix the first `nlp` call for `ja` (closes #2901 ) (#3065 ) * Fix the first `nlp` call for `ja` (closes #2901) * Add unicode declaration, formatting and use relative import	2018-12-18 15:01:06 +01:00
Kirill Bulygin	10189d9092	Fix the first `nlp` call for `ja` (closes #2901 ) (#3065 ) * Fix the first `nlp` call for `ja` (closes #2901) * Add unicode declaration, formatting and use relative import	2018-12-18 14:53:50 +01:00
Ines Montani	ae880ef912	Tidy up merge conflict leftovers	2018-12-18 13:58:30 +01:00

... 3 4 5 6 7 ...

1522 Commits