spaCy

mirror of https://github.com/explosion/spaCy.git synced 2025-09-23 04:26:46 +03:00

Author	SHA1	Message	Date
adrianeboyd	c32126359a	Allow period as suffix following punctuation (#4248 ) Addresses rare cases (such as `_MATH_.`, see #1061) where the final period was not recognized as a suffix following punctuation.	2019-09-09 19:19:22 +02:00
Ines Montani	3e8f136ba7	💫 WIP: Basic lookup class scaffolding and JSON for all lemmatizer data (#4178 ) * Improve load_language_data helper * WIP: Add Lookups implementation * Start moving lemma data over to JSON * WIP: move data over for more languages * Convert more languages * Fix lemmatizer fixtures in tests * Finish conversion * Auto-format JSON files * Fix test for now * Make sure tables are stored on instance * Update docstrings * Update docstrings and errors * Update test * Add Lookups.__len__ * Add serialization methods * Add Lookups.remove_table * Use msgpack for serialization to disk * Fix file exists check * Try using OrderedDict for everything * Update .flake8 [ci skip] * Try fixing serialization * Update test_lookups.py * Update test_serialize_vocab_strings.py * Fix serialization for lookups * Fix lookups * Fix lookups * Fix lookups * Try to fix serialization * Try to fix serialization * Try to fix serialization * Try to fix serialization * Give up on serialization test * Xfail more serialization tests for 3.5 * Fix lookups for 2.7	2019-09-09 19:17:55 +02:00
Sofie Van Landeghem	482c7cd1b9	pulling tqdm imports in functions to avoid bug (tmp fix) (#4263 )	2019-09-09 16:32:11 +02:00
Mihai Gliga	25aecd504f	adding Romanian tag_map (#4257 ) * adding Romanian tag_map * added SCA file * forgotten import	2019-09-09 11:53:09 +02:00
Matthew Honnibal	1653b818c5	Update Lithuanian tag map	2019-09-08 20:57:58 +02:00
adrianeboyd	3780e2ff50	Flush tokenizer cache when necessary (#4258 ) Flush tokenizer cache when affixes, token_match, or special cases are modified. Fixes #4238, same issue as in #1250.	2019-09-08 20:52:46 +02:00
Matthew Honnibal	da8830d909	Set version to v2.2.0.dev3	2019-09-08 18:22:03 +02:00
Matthew Honnibal	1a65c5b7af	Update develop from master	2019-09-08 18:21:41 +02:00
Matthew Honnibal	aec6174ae6	Fix lemmatizer	2019-09-08 18:09:53 +02:00
Matthew Honnibal	fde4f8ac8e	Create lookups if not passed in	2019-09-08 18:08:09 +02:00
Pavle Vidanović	d03401f532	Lemmatizer lookup dictionary for Serbian and basic tag set adde… (#4251 ) * Serbian stopwords added. (cyrillic alphabet) * spaCy Contribution agreement included. * Test initialize updated * Serbian language code update. --bugfix * Tokenizer exceptions added. Init file updated. * Norm exceptions and lexical attributes added. * Examples added. * Tests added. * sr_lang examples update. * Tokenizer exceptions updated. (Serbian) * Lemmatizer created. Licence included. * Test updated. * Tag map basic added. * tag_map.py file removed since it uses default spacy tags.	2019-09-08 14:19:15 +02:00
Ivan Šarić	b01025dd06	adds Croatian lemma_lookup.json, license file and corresponding tests (#4252 )	2019-09-08 13:40:45 +02:00
adrianeboyd	aec755d3a3	Modify retokenizer to use span root attributes (#4219 ) * Modify retokenizer to use span root attributes * tag/pos/morph are set to root tag/pos/morph * lemma and norm are reset and end up as orth (not ideal, but better than orth of first token) * Also handle individual merge case * Add test * Attempt to handle ent_iob and ent_type in merges * Fix check for whether B-ENT should become I-ENT * Move IOB consistency check to after attrs Move all IOB consistency checks after attrs are set and simplify to check entire document, modifying I to B at the beginning of the document or if the entity type of the previous token isn't the same. * Move IOB consistency check for single merge Move IOB consistency check after the token array is compressed for the single merge case. * Update spacy/tokens/_retokenize.pyx Co-Authored-By: Matthew Honnibal <honnibal+gh@gmail.com> * Remove single vs. multiple merge distinction Remove original single-instance `_merge()` and use `_bulk_merge()` (now renamed `_merge()`) for all merges. * Add out-of-bound check in previous entity check	2019-09-08 13:04:49 +02:00
Sofie Van Landeghem	53a9ca45c9	Docs: bufsize instead of buffsize (#4247 )	2019-09-06 11:11:54 +02:00
Sofie Van Landeghem	6b012cebff	Make pos/tag distinction more clear in docs (#4246 ) * make distinction between tag and pos more prominent in docs * out of the 101	2019-09-06 10:31:21 +02:00
Bae Yong-Ju	a55f5a744f	Fix ValueError exception on empty Korean text. (#4245 )	2019-09-06 10:29:40 +02:00
Ines Montani	232a029de6	Send referrer for internal links [ci skip]	2019-09-05 10:41:46 +02:00
Matthew Honnibal	d039ed2267	Merge pull request #4237 from adrianeboyd/feature/gold-train-orth-variants Add guillemets/chevrons to German orth variants	2019-09-04 23:10:49 +02:00
Matthew Honnibal	b94c34ec8f	Merge pull request #4239 from adrianeboyd/bugfix/tokenizer-cache-test-1061 Add regression test for #1061 back to test suite	2019-09-04 23:10:12 +02:00
Adriane Boyd	0f28418446	Add regression test for #1061 back to test suite	2019-09-04 20:42:24 +02:00
Adriane Boyd	c39c13f26b	Add guillemets/chevrons to German orth variants Add guillemets/chevrons to German orth variants for both German/Austrian and Swiss conventions.	2019-09-04 20:05:08 +02:00
Ines Montani	2f31f96fce	Update languages.json [ci skip]	2019-09-04 18:15:42 +02:00
Ines Montani	2245e95e2d	Update languages.json [ci skip]	2019-09-04 17:11:40 +02:00
Matthew Honnibal	17c039406b	Merge pull request #4232 from adrianeboyd/bugfix/entityruler-ner-4229 Fix handling of preset entities in NER	2019-09-04 15:02:31 +02:00
Adriane Boyd	6b0fec76fd	Fix handling of preset entities in NER * Fix check of valid ent_type for B * Add valid L as preset-I followed by not-I	2019-09-04 13:42:42 +02:00
Ines Montani	419ae59c79	Make flaky test test_issue_1971_4 more explicit	2019-08-31 14:08:05 +02:00
Ines Montani	dad5621166	Tidy up and auto-format [ci skip]	2019-08-31 13:39:31 +02:00
Ines Montani	cd90752193	Tidy up and auto-format [ci skip]	2019-08-31 13:39:06 +02:00
Ines Montani	bcd1b12f43	Add contributor agreement [ci skip]	2019-08-30 17:02:43 +02:00
Matthew Honnibal	67c3d03905	Revert morphology serialisation	2019-08-30 13:13:07 +02:00
Matthew Honnibal	efcb51ddc8	Merge pull request #4217 from adrianeboyd/bugfix/morph-en-serialization Morphology tag_map-related bugfixes	2019-08-30 12:46:29 +02:00
Adriane Boyd	893f11a9e3	Serialize tag_map directly Fix Aspect_prof typo	2019-08-30 11:30:03 +02:00
Adriane Boyd	02babf9317	English tag map without unsupported features/values	2019-08-30 11:29:19 +02:00
Matthew Honnibal	516650f58f	Merge pull request #4207 from svlandeg/bugfix/serialize-tok-exc Bugfix for serializing tokenizer rules/exceptions	2019-08-30 11:04:58 +02:00
Matthew Honnibal	f3c3ce7f1e	Update vocab	2019-08-29 21:19:54 +02:00
Matthew Honnibal	fc0a3c8c38	Add morphology serialization	2019-08-29 21:17:34 +02:00
Matthew Honnibal	c94fc9edb9	Fix noise addition	2019-08-29 15:39:32 +02:00
Matthew Honnibal	32842a3cd4	Disable whitespace corruption	2019-08-29 15:01:58 +02:00
Matthew Honnibal	3c1c0ec18e	Add tests for NER oracle with whitespace	2019-08-29 14:33:39 +02:00
Matthew Honnibal	6511e1d8d3	Fix NER gold-standard around whitespace	2019-08-29 14:33:07 +02:00
adrianeboyd	82159b5c19	Updates/bugfixes for NER/IOB converters (#4186 ) * Updates/bugfixes for NER/IOB converters * Converter formats `ner` and `iob` use autodetect to choose a converter if possible * `iob2json` is reverted to handle sentence-per-line data like `word1\|pos1\|ent1 word2\|pos2\|ent2` * Fix bug in `merge_sentences()` so the second sentence in each batch isn't skipped * `conll_ner2json` is made more general so it can handle more formats with whitespace-separated columns * Supports all formats where the first column is the token and the final column is the IOB tag; if present, the second column is the POS tag * As in CoNLL 2003 NER, blank lines separate sentences, `-DOCSTART- -X- O O` separates documents * Add option for segmenting sentences (new flag `-s`) * Parser-based sentence segmentation with a provided model, otherwise with sentencizer (new option `-b` to specify model) * Can group sentences into documents with `n_sents` as long as sentence segmentation is available * Only applies automatic segmentation when there are no existing delimiters in the data * Provide info about settings applied during conversion with warnings and suggestions if settings conflict or might not be not optimal. * Add tests for common formats * Add '(default)' back to docs for -c auto * Add document count back to output * Revert changes to converter output message * Use explicit tabs in convert CLI test data * Adjust/add messages for n_sents=1 default * Add sample NER data to training examples * Update README * Add links in docs to example NER data * Define msg within converters	2019-08-29 12:04:01 +02:00
adrianeboyd	5feb342f5e	Add more token attributes to token pattern schema (#4210 ) Add token attributes with tests to token pattern schema.	2019-08-29 12:02:26 +02:00
Matthew Honnibal	216f63a987	Merge pull request #4208 from adrianeboyd/bugfix/orth-vs-noise Add separate noise vs orth level to train CLI	2019-08-29 10:26:42 +02:00
Adriane Boyd	f3906950d3	Add separate noise vs orth level to train CLI	2019-08-29 09:10:35 +02:00
Matthew Honnibal	7d6d438566	Set version to v2.2.0.dev2	2019-08-28 18:30:43 +02:00
Matthew Honnibal	bc5ce49859	Fix 'noise_level' in train cmd	2019-08-28 17:55:38 +02:00
Matthew Honnibal	782056d117	Fix morph rules	2019-08-28 16:59:45 +02:00
Matthew Honnibal	6b2ea883ed	Merge pull request #4205 from adrianeboyd/feature/gold-train-orth-variants Add train_docs() option to add orth variants	2019-08-28 16:54:06 +02:00
svlandeg	c54aabc3cd	fix loading custom tokenizer rules/exceptions from file	2019-08-28 14:17:44 +02:00
svlandeg	7bec0ebbcb	failing unit test for Issue 4190	2019-08-28 14:16:34 +02:00

... 2 3 4 5 6 ...

10798 Commits