spaCy

mirror of https://github.com/explosion/spaCy.git synced 2024-11-11 20:28:20 +03:00

Author	SHA1	Message	Date
Adriane Boyd	e4a1b5dab1	Rename to url_match Rename to `url_match` and update docs.	2020-05-22 12:41:03 +02:00
Adriane Boyd	565e0eef73	Add tokenizer option for token match with affixes To fix the slow tokenizer URL (#4374) and allow `token_match` to take priority over prefixes and suffixes by default, introduce a new tokenizer option for a token match pattern that's applied after prefixes and suffixes but before infixes.	2020-05-05 10:35:33 +02:00
Adriane Boyd	792c8af8cf	Merge remote-tracking branch 'upstream/master' into bugfix/revert-token-match	2020-05-05 09:25:57 +02:00
Adriane Boyd	bc39f97e11	Simplify warnings	2020-04-28 13:37:37 +02:00
adrianeboyd	c981aa6684	Use inline flags in token_match patterns (#5257 ) * Use inline flags in token_match patterns Use inline flags in `token_match` patterns so that serializing does not lose the flag information. * Modify inline flag * Modify inline flag	2020-04-06 13:19:04 +02:00
Adriane Boyd	1139247532	Revert changes to token_match priority from #4374 * Revert changes to priority of `token_match` so that it has priority over all other tokenizer patterns * Add lookahead and potentially slow lookbehind back to the default URL pattern * Expand character classes in URL pattern to improve matching around lookaheads and lookbehinds related to #4882 * Revert changes to Hungarian tokenizer * Revert (xfail) several URL tests to their status before #4374 * Update `tokenizer.explain()` and docs accordingly	2020-03-09 12:09:41 +01:00
Mark Abraham	0345135167	Tokenizer to_disk and from_disk now ensure paths (#5116 ) * Tokenizer to_disk and from_disk now ensure strings are converted to paths Fixes #5115 * Sign contributor agreement	2020-03-08 13:25:56 +01:00
adrianeboyd	2281c4708c	Restore empty tokenizer properties (#5026 ) * Restore empty tokenizer properties * Check for types in tokenizer.from_bytes() * Add test for setting empty tokenizer rules	2020-03-02 11:55:02 +01:00
adrianeboyd	d7f32b285c	Detect more empty matches in tokenizer.explain() (#4675 ) * Detect more empty matches in tokenizer.explain() * Include a few languages in explain non-slow tests Mark a few languages in tokenizer.explain() tests as not slow so they're run by default.	2019-11-20 16:31:29 +01:00
adrianeboyd	2c876eb672	Add tokenizer explain() debugging method (#4596 ) * Expose tokenizer rules as a property Expose the tokenizer rules property in the same way as the other core properties. (The cache resetting is overkill, but consistent with `from_bytes` for now.) Add tests and update Tokenizer API docs. * Update Hungarian punctuation to remove empty string Update Hungarian punctuation definitions so that `_units` does not match an empty string. * Use _load_special_tokenization consistently Use `_load_special_tokenization()` and have it to handle `None` checks. * Fix precedence of `token_match` vs. special cases Remove `token_match` check from `_split_affixes()` so that special cases have precedence over `token_match`. `token_match` is checked only before infixes are split. * Add `make_debug_doc()` to the Tokenizer Add `make_debug_doc()` to the Tokenizer as a working implementation of the pseudo-code in the docs. Add a test (marked as slow) that checks that `nlp.tokenizer()` and `nlp.tokenizer.make_debug_doc()` return the same non-whitespace tokens for all languages that have `examples.sentences` that can be imported. * Update tokenization usage docs Update pseudo-code and algorithm description to correspond to `nlp.tokenizer.make_debug_doc()` with example debugging usage. Add more examples for customizing tokenizers while preserving the existing defaults. Minor edits / clarifications. * Revert "Update Hungarian punctuation to remove empty string" This reverts commit `f0a577f7a5`. * Rework `make_debug_doc()` as `explain()` Rework `make_debug_doc()` as `explain()`, which returns a list of `(pattern_string, token_string)` tuples rather than a non-standard `Doc`. Update docs and tests accordingly, leaving the visualization for future work. * Handle cases with bad tokenizer patterns Detect when tokenizer patterns match empty prefixes and suffixes so that `explain()` does not hang on bad patterns. * Remove unused displacy image * Add tokenizer.explain() to usage docs	2019-11-20 13:07:25 +01:00
Sofie Van Landeghem	48886afc78	prevent zero-length mem alloc (#4429 ) * raise specific error when removing a matcher rule that doesn't exist * rephrasing * goldparse init: allocate fields only if doc is not empty * avoid zero length alloc in saving tokenizer cache * avoid allocating zero length mem in matcher * asserts to avoid allocating zero length mem * fix zero-length allocation in matcher * bump cymem version * revert cymem version bump	2019-10-22 16:54:33 +02:00
adrianeboyd	cbc2cee2c8	Improve URL_PATTERN and handling in tokenizer (#4374 ) * Move prefix and suffix detection for URL_PATTERN Move prefix and suffix detection for `URL_PATTERN` into the tokenizer. Remove associated lookahead and lookbehind from `URL_PATTERN`. Fix tokenization for Hungarian given new modified handling of prefixes and suffixes. * Match a wider range of URI schemes	2019-10-05 13:00:09 +02:00
adrianeboyd	3780e2ff50	Flush tokenizer cache when necessary (#4258 ) Flush tokenizer cache when affixes, token_match, or special cases are modified. Fixes #4238, same issue as in #1250.	2019-09-08 20:52:46 +02:00
svlandeg	c54aabc3cd	fix loading custom tokenizer rules/exceptions from file	2019-08-28 14:17:44 +02:00
svlandeg	6026958957	tokenizer doc fix	2019-07-15 11:19:34 +02:00
Bharat Raghunathan	1db3e47509	DOC: Update tokenizer docs to include default value for batch_size in pipe (#3492 )	2019-03-28 12:48:02 +01:00
Matthew Honnibal	e65b5bb9a0	Fix tokenizer on Python2.7 (#3460 ) spaCy v2.1 switched to the built-in re module, where v2.0 had been using the third-party regex library. When the tokenizer was deserialized on Python2.7, the `re.compile()` function was called with expressions that featured escaped unicode codepoints that were not in Python2.7's unicode database. Problems occurred when we had a range between two of these unknown codepoints, like this: ``` '[\\uAA77-\\uAA79]' ``` On Python2.7, the unknown codepoints are not unescaped correctly, resulting in arbitrary out-of-range characters being matched by the expression. This problem does not occur if we instead have a range between two unicode literals, rather than the escape sequences. To fix the bug, we therefore add a new compat function that unescapes unicode sequences using the `ast.literal_eval()` function. Care is taken to ensure we do not also escape non-unicode sequences. Closes #3356. - [x] I have submitted the spaCy Contributor Agreement. - [x] I ran the tests, and all new and existing tests passed. - [x] My changes don't require a change to the documentation, or if they do, I've added all required information.	2019-03-22 13:42:47 +01:00
Ines Montani	bec8db91e6	Add actual deprecation warning for n_threads (resolves #3410 )	2019-03-15 16:38:44 +01:00
Ines Montani	cb5dbfa63a	Tidy up references to n_threads and fix default	2019-03-15 16:24:26 +01:00
Ines Montani	7ba3a5d95c	💫 Make serialization methods consistent (#3385 ) * Make serialization methods consistent exclude keyword argument instead of random named keyword arguments and deprecation handling * Update docs and add section on serialization fields	2019-03-10 19:16:45 +01:00
Ines Montani	296446a1c8	Tidy up and improve docs and docstrings (#3370 ) <!--- Provide a general summary of your changes in the title. --> ## Description * tidy up and adjust Cython code to code style * improve docstrings and make calling `help()` nicer * add URLs to new docs pages to docstrings wherever possible, mostly to user-facing objects * fix various typos and inconsistencies in docs ### Types of change enhancement, docs ## Checklist <!--- Before you submit the PR, go over this checklist and make sure you can tick off all the boxes. [] -> [x] --> - [x] I have submitted the spaCy Contributor Agreement. - [x] I ran the tests, and all new and existing tests passed. - [x] My changes don't require a change to the documentation, or if they do, I've added all required information.	2019-03-08 11:42:26 +01:00
Matthew Honnibal	b21481eeca	Load token_match regex with .match, not .search	2019-02-21 09:09:03 +01:00
Sofie	46dfe773e1	Replacing regex library with re to increase tokenization speed (#3218 ) * replace unicode categories with raw list of code points * simplifying ranges * fixing variable length quotes * removing redundant regular expression * small cleanup of regexp notations * quotes and alpha as ranges instead of alterations * removed most regexp dependencies and features * exponential backtracking - unit tests * rewrote expression with pathological backtracking * disabling double hyphen tests for now * test additional variants of repeating punctuation * remove regex and redundant backslashes from load_reddit script * small typo fixes * disable double punctuation test for russian * clean up old comments * format block code * final cleanup * naming consistency * french strings as unicode for python 2 support * french regular expression case insensitive	2019-02-01 18:05:22 +11:00
Matthew Honnibal	82277f63a3	💫 Small efficiency fixes to tokenizer (#2587 ) This patch improves tokenizer speed by about 10%, and reduces memory usage in the `Vocab` by removing a redundant index. The `vocab._by_orth` and `vocab._by_hash` indexed on different data in v1, but in v2 the orth and the hash are identical. The patch also fixes an uninitialized variable in the tokenizer, the `has_special` flag. This checks whether a chunk we're tokenizing triggers a special-case rule. If it does, then we avoid caching within the chunk. This check led to incorrectly rejecting some chunks from the cache. With the `en_core_web_md` model, we now tokenize the IMDB train data at 503,104k words per second. Prior to this patch, we had 465,764k words per second. Before switching to the regex library and supporting more languages, we had 1.3m words per second for the tokenizer. In order to recover the missing speed, we need to: * Fix the variable-length lookarounds in the suffix, infix and `token_match` rules * Improve the performance of the `token_match` regex * Switch back from the `regex` library to the `re` library. ## Checklist <!--- Before you submit the PR, go over this checklist and make sure you can tick off all the boxes. [] -> [x] --> - [x] I have submitted the spaCy Contributor Agreement. - [x] I ran the tests, and all new and existing tests passed. - [x] My changes don't require a change to the documentation, or if they do, I've added all required information.	2018-07-24 23:35:54 +02:00
Matthew Honnibal	43dcaa473e	Merge branch 'develop' of https://github.com/explosion/spaCy into develop	2018-07-06 12:36:42 +02:00
Matthew Honnibal	6c8d627733	Fix tokenizer deserialization	2018-07-06 12:36:33 +02:00
ines	c001d46153	Tidy up	2018-07-06 12:33:42 +02:00
Matthew Honnibal	63f5651f8d	Fix tokenizer serialization	2018-07-06 12:32:11 +02:00
Matthew Honnibal	1a2f61725c	Fix tokenizer serialization	2018-07-06 12:23:04 +02:00
ines	63666af328	Merge branch 'master' into develop	2018-07-04 14:52:25 +02:00
Bùi Trung Chí	9af46b4f1b	Fix loading tokenizer with custom prefix search (#2495 ) * Add contributor agreement * Fix loading tokenizer with cutom prefix search	2018-07-04 12:56:07 +02:00
Matthew Honnibal	46d8a66fef	Fix tokenizer serialization if token_match is None	2018-06-29 14:24:46 +02:00
Ines Montani	3141e04822	💫 New system for error messages and warnings (#2163 ) * Add spacy.errors module * Update deprecation and user warnings * Replace errors and asserts with new error message system * Remove redundant asserts * Fix whitespace * Add messages for print/util.prints statements * Fix typo * Fix typos * Move CLI messages to spacy.cli._messages * Add decorator to display error code with message An implementation like this is nice because it only modifies the string when it's retrieved from the containing class – so we don't have to worry about manipulating tracebacks etc. * Remove unused link in spacy.about * Update errors for invalid pipeline components * Improve error for unknown factories * Add displaCy warnings * Update formatting consistency * Move error message to spacy.errors * Update errors and check if doc returned by component is None	2018-04-03 15:50:31 +02:00
Matthew Honnibal	6bc0f4d29f	Merge pull request #1611 from fsonntag/master Solving #1494	2017-11-29 23:11:23 +01:00
Felix Sonntag	724ae7dc55	Fixed issue of infix capturing prefixes	2017-11-28 17:17:12 +01:00
Matthew Honnibal	542e6fd4ea	Don't remove entries from specials	2017-11-23 12:17:42 +00:00
Felix Sonntag	33b0f86de3	Changed tokenizer to add infix when infix_start is offset	2017-11-19 16:32:10 +01:00
Roman Domrachev	61d28d03e4	Try again to do selective remove cache	2017-11-15 19:11:12 +03:00
Roman Domrachev	b3311100c7	Merge branch 'master' of github.com:explosion/spaCy	2017-11-15 18:30:04 +03:00
Roman Domrachev	505c6a2f2f	Completely cleanup tokenizer cache Tokenizer cache can have be different keys than string That modification can slow down tokenizer and need to be measured	2017-11-15 17:55:48 +03:00
Matthew Honnibal	fe3c42a06b	Fix caching in tokenizer	2017-11-15 13:55:46 +01:00
Roman Domrachev	91e2fa6561	Clean all caches	2017-11-14 21:15:04 +03:00
Daniel Hershcovich	d7ae54ff44	Fix typo in message	2017-11-08 16:06:28 +02:00
ines	9659391944	Update deprecated methods and add warnings	2017-11-01 16:49:42 +01:00
ines	d96e72f656	Tidy up rest	2017-10-27 21:07:59 +02:00
ines	72497c8cb2	Remove comments and add TODO	2017-10-25 12:15:43 +02:00
Matthew Honnibal	b0f6fd3f1d	Disable tokenizer cache for special-cases. Fixes #1250	2017-10-24 16:08:05 +02:00
Matthew Honnibal	f45973848c	Rename 'tokens' variable 'doc' in tokenizer	2017-10-17 18:21:41 +02:00
ines	cd6a29dce7	Port over changes from #1294	2017-10-14 13:28:46 +02:00
ines	7c919aeb09	Make sure serializers and deserializers are ordered	2017-06-03 17:05:09 +02:00

1 2 3

123 Commits