spaCy

mirror of https://github.com/explosion/spaCy.git synced 2025-04-17 23:51:58 +03:00

Author	SHA1	Message	Date
Ines Montani	cbacb0f1a4	Update shape docs and examples (resolves #4615 ) [ci skip]	2019-11-23 17:16:55 +01:00
Paul O'Leary McCann	f0e3e606a6	Replace python-mecab3 with fugashi for Japanese (#4621 ) * Switch from mecab-python3 to fugashi mecab-python3 has been the best MeCab binding for a long time but it's not very actively maintained, and since it's based on old SWIG code distributed with MeCab there's a limit to how effectively it can be maintained. Fugashi is a new Cython-based MeCab wrapper I wrote. Since it's not based on the old SWIG code it's easier to keep it current and make small deviations from the MeCab C/C++ API where that makes sense. * Change mecab-python3 to fugashi in setup.cfg * Change "mecab tags" to "unidic tags" The tags come from MeCab, but the tag schema is specified by Unidic, so it's more proper to refer to it that way. * Update conftest * Add fugashi link to external deps list for Japanese	2019-11-23 14:31:04 +01:00
Ines Montani	a6200bc424	Update scorer.md [ci skip]	2019-11-21 17:02:43 +01:00
richardpaulhudson	8d06386e1e	Update to Holmes Universe entry (#4679 ) * Updated Universe entry for Holmes * Correction * Updated model name * Updated wording	2019-11-21 16:23:24 +01:00
Ines Montani	235fe6fe3b	Auto-format [ci skip]	2019-11-20 13:14:58 +01:00
adrianeboyd	2c876eb672	Add tokenizer explain() debugging method (#4596 ) * Expose tokenizer rules as a property Expose the tokenizer rules property in the same way as the other core properties. (The cache resetting is overkill, but consistent with `from_bytes` for now.) Add tests and update Tokenizer API docs. * Update Hungarian punctuation to remove empty string Update Hungarian punctuation definitions so that `_units` does not match an empty string. * Use _load_special_tokenization consistently Use `_load_special_tokenization()` and have it to handle `None` checks. * Fix precedence of `token_match` vs. special cases Remove `token_match` check from `_split_affixes()` so that special cases have precedence over `token_match`. `token_match` is checked only before infixes are split. * Add `make_debug_doc()` to the Tokenizer Add `make_debug_doc()` to the Tokenizer as a working implementation of the pseudo-code in the docs. Add a test (marked as slow) that checks that `nlp.tokenizer()` and `nlp.tokenizer.make_debug_doc()` return the same non-whitespace tokens for all languages that have `examples.sentences` that can be imported. * Update tokenization usage docs Update pseudo-code and algorithm description to correspond to `nlp.tokenizer.make_debug_doc()` with example debugging usage. Add more examples for customizing tokenizers while preserving the existing defaults. Minor edits / clarifications. * Revert "Update Hungarian punctuation to remove empty string" This reverts commit `f0a577f7a5`. * Rework `make_debug_doc()` as `explain()` Rework `make_debug_doc()` as `explain()`, which returns a list of `(pattern_string, token_string)` tuples rather than a non-standard `Doc`. Update docs and tests accordingly, leaving the visualization for future work. * Handle cases with bad tokenizer patterns Detect when tokenizer patterns match empty prefixes and suffixes so that `explain()` does not hang on bad patterns. * Remove unused displacy image * Add tokenizer.explain() to usage docs	2019-11-20 13:07:25 +01:00
Ines Montani	e8b9cee6fd	Make example consistent with model (closes #4587 ) [ci skip]	2019-11-18 12:41:48 +01:00
Ines Montani	e01a1a237f	Auto-format [ci skip]	2019-11-18 12:41:31 +01:00
adrianeboyd	62e00fd9da	Update tokenization usage docs (#4666 ) Update pseudo-code and algorithm description to correspond to current tokenizer behavior. Add more examples for customizing tokenizers while preserving the existing defaults. Minor edits / clarifications.	2019-11-18 12:35:13 +01:00
Ines Montani	5adcb352e9	Adjust order of docs sections [ci skip]	2019-11-17 16:08:56 +01:00
Ines Montani	e30d08410a	Add CI for Python 3.8 (#4479 ) * Add 3.8 classifier * Update azure-pipelines.yml * Remove 3.8 warning from docs [ci skip]	2019-11-15 01:13:48 +01:00
adrianeboyd	faaa832518	Generalize handling of tokenizer special cases (#4259 ) * Generalize handling of tokenizer special cases Handle tokenizer special cases more generally by using the Matcher internally to match special cases after the affix/token_match tokenization is complete. Instead of only matching special cases while processing balanced or nearly balanced prefixes and suffixes, this recognizes special cases in a wider range of contexts: * Allows arbitrary numbers of prefixes/affixes around special cases * Allows special cases separated by infixes Existing tests/settings that couldn't be preserved as before: * The emoticon '")' is no longer a supported special case * The emoticon ':)' in "example:)" is a false positive again When merged with #4258 (or the relevant cache bugfix), the affix and token_match properties should be modified to flush and reload all special cases to use the updated internal tokenization with the Matcher. * Remove accidentally added test case * Really remove accidentally added test * Reload special cases when necessary Reload special cases when affixes or token_match are modified. Skip reloading during initialization. * Update error code number * Fix offset and whitespace in Matcher special cases * Fix offset bugs when merging and splitting tokens * Set final whitespace on final token in inserted special case * Improve cache flushing in tokenizer * Separate cache and specials memory (temporarily) * Flush cache when adding special cases * Repeated `self._cache = PreshMap()` and `self._specials = PreshMap()` are necessary due to this bug: https://github.com/explosion/preshed/issues/21 * Remove reinitialized PreshMaps on cache flush * Update UD bin scripts * Update imports for `bin/` * Add all currently supported languages * Update subtok merger for new Matcher validation * Modify blinded check to look at tokens instead of lemmas (for corpora with tokens but not lemmas like Telugu) * Use special Matcher only for cases with affixes * Reinsert specials cache checks during normal tokenization for special cases as much as possible * Additionally include specials cache checks while splitting on infixes * Since the special Matcher needs consistent affix-only tokenization for the special cases themselves, introduce the argument `with_special_cases` in order to do tokenization with or without specials cache checks * After normal tokenization, postprocess with special cases Matcher for special cases containing affixes * Replace PhraseMatcher with Aho-Corasick Replace PhraseMatcher with the Aho-Corasick algorithm over numpy arrays of the hash values for the relevant attribute. The implementation is based on FlashText. The speed should be similar to the previous PhraseMatcher. It is now possible to easily remove match IDs and matches don't go missing with large keyword lists / vocabularies. Fixes #4308. * Restore support for pickling * Fix internal keyword add/remove for numpy arrays * Add test for #4248, clean up test * Improve efficiency of special cases handling * Use PhraseMatcher instead of Matcher * Improve efficiency of merging/splitting special cases in document * Process merge/splits in one pass without repeated token shifting * Merge in place if no splits * Update error message number * Remove UD script modifications Only used for timing/testing, should be a separate PR * Remove final traces of UD script modifications * Update UD bin scripts * Update imports for `bin/` * Add all currently supported languages * Update subtok merger for new Matcher validation * Modify blinded check to look at tokens instead of lemmas (for corpora with tokens but not lemmas like Telugu) * Add missing loop for match ID set in search loop * Remove cruft in matching loop for partial matches There was a bit of unnecessary code left over from FlashText in the matching loop to handle partial token matches, which we don't have with PhraseMatcher. * Replace dict trie with MapStruct trie * Fix how match ID hash is stored/added * Update fix for match ID vocab * Switch from map_get_unless_missing to map_get * Switch from numpy array to Token.get_struct_attr Access token attributes directly in Doc instead of making a copy of the relevant values in a numpy array. Add unsatisfactory warning for hash collision with reserved terminal hash key. (Ideally it would change the reserved terminal hash and redo the whole trie, but for now, I'm hoping there won't be collisions.) * Restructure imports to export find_matches * Implement full remove() Remove unnecessary trie paths and free unused maps. Parallel to Matcher, raise KeyError when attempting to remove a match ID that has not been added. * Switch to PhraseMatcher.find_matches * Switch to local cdef functions for span filtering * Switch special case reload threshold to variable Refer to variable instead of hard-coded threshold * Move more of special case retokenize to cdef nogil Move as much of the special case retokenization to nogil as possible. * Rewrap sort as stdsort for OS X * Rewrap stdsort with specific types * Switch to qsort * Fix merge * Improve cmp functions * Fix realloc * Fix realloc again * Initialize span struct while retokenizing * Temporarily skip retokenizing * Revert "Move more of special case retokenize to cdef nogil" This reverts commit `0b7e52c797`. * Revert "Switch to qsort" This reverts commit `a98d71a942`. * Fix specials check while caching * Modify URL test with emoticons The multiple suffix tests result in the emoticon `:>`, which is now retokenized into one token as a special case after the suffixes are split off. * Refactor _apply_special_cases() * Use cdef ints for span info used in multiple spots * Modify _filter_special_spans() to prefer earlier Parallel to #4414, modify _filter_special_spans() so that the earlier span is preferred for overlapping spans of the same length. * Replace MatchStruct with Entity Replace MatchStruct with Entity since the existing Entity struct is nearly identical. * Replace Entity with more general SpanC * Replace MatchStruct with SpanC * Add error in debug-data if no dev docs are available (see #4575) * Update azure-pipelines.yml * Revert "Update azure-pipelines.yml" This reverts commit `ed1060cf59`. * Use latest wasabi * Reorganise install_requires * add dframcy to universe.json (#4580) * Update universe.json [ci skip] * Fix multiprocessing for as_tuples=True (#4582) * Fix conllu script (#4579) * force extensions to avoid clash between example scripts * fix arg order and default file encoding * add example config for conllu script * newline * move extension definitions to main function * few more encodings fixes * Add load_from_docbin example [ci skip] TODO: upload the file somewhere * Update README.md * Add warnings about 3.8 (resolves #4593) [ci skip] * Fixed typo: Added space between "recognize" and "various" (#4600) * Fix DocBin.merge() example (#4599) * Replace function registries with catalogue (#4584) * Replace functions registries with catalogue * Update __init__.py * Fix test * Revert unrelated flag [ci skip] * Bugfix/dep matcher issue 4590 (#4601) * add contributor agreement for prilopes * add test for issue #4590 * fix on_match params for DependencyMacther (#4590) * Minor updates to language example sentences (#4608) * Add punctuation to Spanish example sentences * Combine multilanguage examples for lang xx * Add punctuation to nb examples * Always realloc to a larger size Avoid potential (unlikely) edge case and cymem error seen in #4604. * Add error in debug-data if no dev docs are available (see #4575) * Update debug-data for GoldCorpus / Example * Ignore None label in misaligned NER data	2019-11-13 21:24:35 +01:00
f11r	877971860e	Fix assert in sentencizer documentation. (#4639 )	2019-11-13 15:24:14 +01:00
Ines Montani	9d5ff177c4	Work around Markdown rendering issue surfaced in #4600 [ci skip]	2019-11-11 17:12:08 +01:00
adrianeboyd	0f8678c0b1	Fix DocBin.merge() example (#4599 )	2019-11-07 11:26:48 +01:00
walterhenry	5563c42ef5	Fixed typo: Added space between "recognize" and "various" (#4600 )	2019-11-06 23:06:36 +01:00
Ines Montani	828ef27a32	Add warnings about 3.8 (resolves #4593 ) [ci skip]	2019-11-05 18:30:11 +01:00
Ines Montani	4b95587ad4	Update universe.json [ci skip]	2019-11-04 13:55:55 +01:00
Yash Patadia	0c396aeed4	add dframcy to universe.json (#4580 )	2019-11-04 13:53:23 +01:00
Ines Montani	59358d9b71	Remove box-decoration-break from entities in displacy (#4564 )	2019-10-31 15:09:43 +01:00
Ines Montani	4e1de85e43	Update syntax iterators [ci skip]	2019-10-30 14:31:40 +01:00
Ines Montani	726c5dd306	Update universe.json [ci skip]	2019-10-30 13:29:00 +01:00
Neel Kamath	6c036ab57d	Add "spaCy Server" to spaCy Universe (#4553 ) * Add "spaCy Server" to spaCy Universe * Accept the spaCy Contributor Agreement	2019-10-30 13:20:46 +01:00
Nipun Sadvilkar	2a5e71232b	✨ project: pySBD - Python Sentence Boundary Disambiguation (#4455 ) * ✨ project: pySBD - Python Sentence Boundary Disambiguation * 📝 Update links and description * 🐛 Fix missing comma * Update universe.json pysbd as a spacy component through entrypoints * 🚨 Fix universe.json * 📝 Update code_example	2019-10-30 12:13:29 +01:00
Matthew Honnibal	d5509e0989	Support Mish activation (requires Thinc 7.3) (#4536 ) * Add arch for MishWindowEncoder * Support mish in tok2vec and conv window >=2 * Pass new tok2vec settings from parser * Syntax error * Fix tok2vec setting * Fix registration of MishWindowEncoder * Fix receptive field setting * Fix mish arch * Pass more options from parser * Support more tok2vec options in pretrain * Require thinc 7.3 * Add docs [ci skip] * Require thinc 7.3.0.dev0 to run CI * Run black * Fix typo * Update Thinc version Co-authored-by: Ines Montani <ines@ines.io>	2019-10-28 15:16:33 +01:00
Ines Montani	1180304449	Update languages.json [ci skip]	2019-10-26 13:51:42 +02:00
Ines Montani	cfffdba7b1	Implement new API for {Phrase}Matcher.add (backwards-compatible) (#4522 ) * Implement new API for {Phrase}Matcher.add (backwards-compatible) * Update docs * Also update DependencyMatcher.add * Update internals * Rewrite tests to use new API * Add basic check for common mistake Raise error with suggestion if user likely passed in a pattern instead of a list of patterns * Fix typo [ci skip]	2019-10-25 22:21:08 +02:00
Ines Montani	d2da117114	Also support passing list to Language.disable_pipes (#4521 ) * Also support passing list to Language.disable_pipes * Adjust internals	2019-10-25 16:19:08 +02:00
Ines Montani	493be8e9db	Update new version identifier [ci skip]	2019-10-25 11:42:49 +02:00
Ines Montani	2abf1028cb	Update docs [ci skip]	2019-10-25 11:27:00 +02:00
Ines Montani	f31876154d	Adjust formatting [ci skip]	2019-10-25 11:19:46 +02:00
Kabir Khan	93640373c7	Make entity_ruler ent_id resolution 2x faster and add docs for… (#4513 ) * Update entityruler.py * Making ent_id resolution 2x faster and adding docs * Fixing newlines in docstrings * Fixing newlines in docstrings	2019-10-25 11:16:42 +02:00
adrianeboyd	1b0bbe4b76	Update tag maps and docs for English and German (#4501 ) * Update English tag_map Update English tag_map based on this conversion table: https://universaldependencies.org/tagset-conversion/en-penn-uposf.html * Update German tag_map Update German tag_map based on this conversion table: https://universaldependencies.org/tagset-conversion/de-stts-uposf.html * Add missing Tiger dependencies to glossary * Add quotes to definition of TO * Update POS/TAG tables in docs Update POS/TAG tables for English and German docs using current information generated from the tag_maps and GLOSSARY. * Update warning that -PRON- is specific to English * Revert docs to default JSON output with convert * Revert "Revert docs to default JSON output with convert" This reverts commit `6b78c048f1`.	2019-10-24 12:56:05 +02:00
adrianeboyd	8516e9d53b	Support train dict format as JSONL (#4471 ) * Support train dict format as JSONL * Add (overly simple) check for dict vs. tuple to read JSONL lines as either train dicts or train tuples * Extend JSON/JSONL roundtrip conversion tests using `docs_to_json()` and `GoldCorpus.train_tuples` * Revert docs to default JSON output with convert	2019-10-23 16:01:44 +02:00
adrianeboyd	7fc39f124c	Fix logic in rules+model entity example [ci skip] (#4510 )	2019-10-23 14:41:21 +02:00
Ines Montani	388ea03065	Update universe.json [ci skip]	2019-10-22 14:54:47 +02:00
Kabir Khan	8a7a30ea1d	Add cookiecutter-spacy-fastapi to spacy universe (#4498 )	2019-10-22 14:50:40 +02:00
Ines Montani	4659435573	Fix argument type in PhraseMatcher.add docs (closes #4496 ) [ci skip]	2019-10-22 14:37:30 +02:00
Julin S	3ee15fce0d	Update information about Rasa (#4492 ) Rasa has been updated and rasa core and rasa nlu have been merged.	2019-10-22 14:32:31 +02:00
Ines Montani	b2f88e2060	Fix formatting [ci skip]	2019-10-21 12:26:07 +02:00
adrianeboyd	3195a8f170	Add Entity Linking to menu (#4489 )	2019-10-21 12:17:30 +02:00
Pepe Berba	7772d5d3c5	Update `vocab.get_vector` docs to include features on Fasttext ngram (#4464 ) * Update `vocab.get_vector` * Added contrib agreement	2019-10-20 01:28:18 +02:00
Ghola	258eb9e064	Misspelling on Lemmatizer Example #4406 (#4449 ) Removing extra o in the lookups = Loookups()	2019-10-16 23:23:15 +02:00
Anastassia	4a77d03ff7	Fix documentation for the docs_to_json function (#4456 )	2019-10-16 23:17:58 +02:00
Ines Montani	5cbe21700b	Only show label scheme if not empty [ci skip]	2019-10-08 15:52:59 +02:00
Ines Montani	8f76d6c9ef	Update transformer model details [ci skip]	2019-10-08 15:39:38 +02:00
Ines Montani	573e543e4a	Alphanumeric -> alphabetic [ci skip] see ines/spacy-course#38	2019-10-06 13:30:01 +02:00
Ines Montani	e65dffd80b	Clarify serialization of extension attributes (closes #4377 ) [ci skip]	2019-10-05 11:58:00 +02:00
Ines Montani	e7ddc6f662	Add conda install for lookups [ci skip]	2019-10-03 17:52:53 +02:00
Sofie Van Landeghem	4e7259c6cf	Bugfix initializing DocBin with attributes (#4368 ) * docbin init fix + documentation fix + unit tests * newline * try with zlib instead of gzip (python 2 incompatibilities)	2019-10-03 14:48:45 +02:00
Ines Montani	ce1d441de5	Add docs for Vectors.most_similar [ci skip]	2019-10-03 14:29:47 +02:00
Ines Montani	80cf385f65	Update v2-2.md [ci skip]	2019-10-02 16:58:21 +02:00
Ines Montani	12a941d841	Update binder version [ci skip]	2019-10-02 16:47:01 +02:00
Ines Montani	b6670bf0c2	Use consistent spelling	2019-10-02 10:37:39 +02:00
Ines Montani	475e3188ce	Add docs on filtering overlapping spans for merging (resolves #4352 ) [ci skip]	2019-10-01 21:59:50 +02:00
Ines Montani	0dd127bb00	Update v2-2.md [ci skip]	2019-10-01 21:37:06 +02:00
Ines Montani	cf65a80f36	Refactor lemmatizer and data table integration (#4353 ) * Move test * Allow default in Lookups.get_table * Start with blank tables in Lookups.from_bytes * Refactor lemmatizer to hold instance of Lookups * Get lookups table within the lemmatization methods to make sure it references the correct table (even if the table was replaced or modified, e.g. when loading a model from disk) * Deprecate other arguments on Lemmatizer.__init__ and expect Lookups for consistency * Remove old and unsupported Lemmatizer.load classmethod * Refactor language-specific lemmatizers to inherit as much as possible from base class and override only what they need * Update tests and docs * Fix more tests * Fix lemmatizer * Upgrade pytest to try and fix weird CI errors * Try pytest 4.6.5	2019-10-01 21:36:03 +02:00
Ines Montani	bc7e7db208	Fix wording [ci skip]	2019-10-01 14:20:44 +02:00
Ines Montani	2a3a4565cd	Update infobox [ci skip]	2019-10-01 14:19:34 +02:00
Ines Montani	66aa0d479f	Update v2.2 page [ci skip]	2019-10-01 14:11:05 +02:00
Ines Montani	a8a1800f2a	Update lemma data documentation [ci skip]	2019-10-01 13:22:13 +02:00
Ines Montani	932ad9cb91	Fix typos and formatting [ci skip]	2019-10-01 12:30:04 +02:00
Ines Montani	ca0b20ae8b	Make prereleases less verbose [ci skip]	2019-10-01 12:29:14 +02:00
Ines Montani	61263e2fbc	Update universe.json [ci skip]	2019-09-30 13:49:44 +02:00
Ines Montani	71bd040834	Update models.js [ci skip]	2019-09-30 12:01:09 +02:00
Ines Montani	3d8fd4b461	Revert #4334	2019-09-29 17:32:12 +02:00
Ines Montani	3bd4da068e	Fix link [ci skip]	2019-09-29 17:30:38 +02:00
Ines Montani	089f44cc56	Update serialization docs [ci skip]	2019-09-29 17:11:13 +02:00
Ines Montani	c9cd516d96	Move tests out of package (#4334 ) * Move tests out of package * Fix typo	2019-09-28 18:05:00 +02:00
Ines Montani	10742d3219	Update v2 docs [ci skip]	2019-09-28 15:57:22 +02:00
Ines Montani	a2815f6643	Fix model table display [ci skip]	2019-09-28 14:23:03 +02:00
Ines Montani	129670283e	Pass meta labels through correctly [ci skip]	2019-09-28 14:08:33 +02:00
Ines Montani	f8d1e2f214	Update CLI docs [ci skip]	2019-09-28 13:12:30 +02:00
Ines Montani	59beab8405	Update v2-2.md [ci skip]	2019-09-27 18:10:43 +02:00
Ines Montani	685e4b2554	Update v2-2.md [ci skip]	2019-09-27 16:35:01 +02:00
Ines Montani	aad66d9bb9	Document PhraseMatcher.remove [ci skip]	2019-09-27 16:34:53 +02:00
Ines Montani	3624153591	Update languages.json [ci skip]	2019-09-27 15:15:41 +02:00
Ajinkya Kale	975aebd7e4	typo fix for wordnet_annotator (#4326 )	2019-09-27 11:52:53 +02:00
Ines Montani	eb0649e38e	Fix tag [ci skip]	2019-09-26 16:22:33 +02:00
Ines Montani	da9a869d3f	Update vectors name docs [ci skip]	2019-09-26 16:21:32 +02:00
Em Zhan	aafa091541	Fix typo in documentation (#4322 ) * Fix typo 'probj' instead of 'pobj' * Add spaCy contributor agreement for zqianem	2019-09-25 19:42:18 +02:00
Matthew Honnibal	92ed4dc5e0	Allow vectors name to be set in init-model (#4321 ) * Allow vectors name to be specified in init-model * Document --vectors-name argument to init-model * Update website/docs/api/cli.md Co-Authored-By: Ines Montani <ines@ines.io>	2019-09-25 13:11:00 +02:00
Eric Semeniuc	09816f8323	update sense2vec version (#4320 )	2019-09-25 12:17:54 +02:00
Sofie Van Landeghem	42340740e3	update neuralcoref example (#4317 )	2019-09-24 10:47:17 +02:00
Ines Montani	197406de1d	Update v2-2.md [ci skip]	2019-09-19 14:33:58 +02:00
Ines Montani	ddc09b08ed	Update v2-2.md [ci skip]	2019-09-19 00:58:30 +02:00
Matthew Honnibal	e2047576c4	Fix merge conflict	2019-09-18 21:42:11 +02:00
Matthew Honnibal	46c02d25b1	Merge changes to test_ner	2019-09-18 21:41:24 +02:00
Ines Montani	d84763727c	Remove unused setting [ci skip]	2019-09-18 21:24:14 +02:00
Ines Montani	9c940eab94	Update version in examples [ci skip]	2019-09-18 21:23:26 +02:00
Ines Montani	f873548f6c	Add backwards incompatibility [ci skip]	2019-09-18 21:21:48 +02:00
Ines Montani	6ebdc5f7d2	Update download docs [ci skip]	2019-09-18 21:21:39 +02:00
Ines Montani	dd1810f05a	Update DocBin and add docs	2019-09-18 20:23:21 +02:00
Ines Montani	d62690b3ba	Update examples	2019-09-18 19:57:36 +02:00
Ines Montani	bd435faddd	Add note about usage docs [ci skip]	2019-09-18 19:56:43 +02:00
Matthew Honnibal	931e96b6c7	DocPallet->DocBin in docs	2019-09-18 15:17:26 +02:00
Matthew Honnibal	f537cbeacc	Update v2-2 docs	2019-09-18 14:07:55 +02:00
Ines Montani	c922f8e8b0	Fix sources rendering [ci skip]	2019-09-18 12:09:21 +02:00
Ines Montani	ea2a686cf7	Support new model sources format [ci skip]	2019-09-18 11:42:45 +02:00
Ines Montani	ee15fdfe88	Fix wording [ci skip]	2019-09-17 14:59:42 +02:00
Ines Montani	f566e69f38	Fix --vectors-loc docs (closes #4270 )	2019-09-17 14:59:12 +02:00
Ines Montani	25c2b4b9a5	Improve init-model docs (see #4137 )	2019-09-17 14:51:44 +02:00
Ines Montani	198b7e9789	Auto-format [ci skip]	2019-09-17 14:48:35 +02:00
adrianeboyd	b5d999e510	Add textcat to train CLI (#4226 ) * Add doc.cats to spacy.gold at the paragraph level Support `doc.cats` as `"cats": [{"label": string, "value": number}]` in the spacy JSON training format at the paragraph level. * `spacy.gold.docs_to_json()` writes `docs.cats` * `GoldCorpus` reads in cats in each `GoldParse` * Update instances of gold_tuples to handle cats Update iteration over gold_tuples / gold_parses to handle addition of cats at the paragraph level. * Add textcat to train CLI * Add textcat options to train CLI * Add textcat labels in `TextCategorizer.begin_training()` * Add textcat evaluation to `Scorer`: * For binary exclusive classes with provided label: F1 for label * For 2+ exclusive classes: F1 macro average * For multilabel (not exclusive): ROC AUC macro average (currently relying on sklearn) * Provide user info on textcat evaluation settings, potential incompatibilities * Provide pipeline to Scorer in `Language.evaluate` for textcat config * Customize train CLI output to include only metrics relevant to current pipeline * Add textcat evaluation to evaluate CLI * Fix handling of unset arguments and config params Fix handling of unset arguments and model confiug parameters in Scorer initialization. * Temporarily add sklearn requirement * Remove sklearn version number * Improve Scorer handling of models without textcats * Fixing Scorer handling of models without textcats * Update Scorer output for python 2.7 * Modify inf in Scorer for python 2.7 * Auto-format Also make small adjustments to make auto-formatting with black easier and produce nicer results * Move error message to Errors * Update documentation * Add cats to annotation JSON format [ci skip] * Fix tpl flag and docs [ci skip] * Switch to internal roc_auc_score Switch to internal `roc_auc_score()` adapted from scikit-learn. * Add AUCROCScore tests and improve errors/warnings * Add tests for AUCROCScore and roc_auc_score * Add missing error for only positive/negative values * Remove unnecessary warnings and errors * Make reduced roc_auc_score functions private Because most of the checks and warnings have been stripped for the internal functions and access is only intended through `ROCAUCScore`, make the functions for roc_auc_score adapted from scikit-learn private. * Check that data corresponds with multilabel flag Check that the training instances correspond with the multilabel flag, adding the multilabel flag if required. * Add textcat score to early stopping check * Add more checks to debug-data for textcat * Add example training data for textcat * Add more checks to textcat train CLI * Check configuration when extending base model * Fix typos * Update textcat example data * Provide licensing details and licenses for data * Remove two labels with no positive instances from jigsaw-toxic-comment data. Co-authored-by: Ines Montani <ines@ines.io>	2019-09-15 22:31:31 +02:00
Ines Montani	bab9976d9a	💫 Adjust Table API and add docs (#4289 ) * Adjust Table API and add docs * Add attributes and update description [ci skip] * Use strings.get_string_id instead of hash_string * Fix table method calls * Make orth arg in Lemmatizer.lookup optional Fall back to string, which is now handled by Table.__contains__ out-of-the-box * Fix method name * Auto-format	2019-09-15 22:08:13 +02:00
Ines Montani	23e28e2844	Merge branch 'master' into develop	2019-09-15 17:57:09 +02:00
Ines Montani	57f4c088be	Use full model name in quickstart install [ci skip]	2019-09-15 17:56:54 +02:00
Ines Montani	c7e4ea7154	Update examples and languages.json [ci skip]	2019-09-15 17:56:40 +02:00
Ines Montani	16c2522791	Merge branch 'master' into develop	2019-09-14 16:42:01 +02:00
Ines Montani	86befc80bf	WIP: Add v2.2 page [ci skip]	2019-09-14 16:41:48 +02:00
Ines Montani	04d36d2471	Remove unused link [ci skip]	2019-09-14 16:41:19 +02:00
Ines Montani	76d26a3d5e	Update site.json [ci skip]	2019-09-14 16:32:24 +02:00
Ines Montani	fe87ccc8d1	Update languages.json [ci skip]	2019-09-14 16:23:50 +02:00
Ines Montani	5c8b5e68ec	Fix docs consistency [ci skip]	2019-09-14 16:23:37 +02:00
Ines Montani	bbf7337eaf	Update adding languages docs [ci skip]	2019-09-14 15:32:15 +02:00
Ines Montani	3126dd0904	Tidy up and auto-format [ci skip]	2019-09-14 12:58:06 +02:00
Ines Montani	3c3658ef9f	Merge branch 'master' into develop	2019-09-12 18:03:01 +02:00
Ines Montani	03809b82b7	Support label schemes in model directory	2019-09-12 18:01:46 +02:00
Sofie Van Landeghem	9be4d1c105	Allow copying of user_data in as_doc (#4282 ) * Allow copying the user_data with as_doc + unit test * add option to docs * add typing * import fix * workaround to avoid bool clashing ... * bint instead of bool	2019-09-12 17:08:14 +02:00
Ines Montani	ff51fba96a	Update lemmaitzer docs [ci skip]	2019-09-12 16:26:33 +02:00
Ines Montani	25b2b3ff45	Remove LEMMA from exception examples [ci skip]	2019-09-12 16:26:27 +02:00
Ines Montani	82c16b7943	Remove u-strings and fix formatting [ci skip]	2019-09-12 16:11:15 +02:00
Ines Montani	38037d6816	Update landing [ci skip]	2019-09-12 15:33:39 +02:00
Ines Montani	a31e9e1cd5	Update training docs [ci skip]	2019-09-12 15:32:39 +02:00
Ines Montani	b544dcb3c5	Document debug-data [ci skip]	2019-09-12 15:26:20 +02:00
Ines Montani	72274e83f2	Ensure accordion label is left-aligned [ci skip]	2019-09-12 15:24:17 +02:00
Ines Montani	c0a4cab178	Update "Adding languages" docs [ci skip]	2019-09-12 14:53:06 +02:00
Ines Montani	10257f3131	Document Lookups [ci skip]	2019-09-12 14:00:14 +02:00
Ines Montani	aa4ff0baa1	Auto-format [ci skip]	2019-09-12 13:05:53 +02:00
Ines Montani	625ce2db8e	Update Language docs [ci skip]	2019-09-12 13:03:38 +02:00
Ines Montani	cb41a33d14	Update displaCy API docs [ci skip]	2019-09-12 12:59:20 +02:00
Ines Montani	e7c20ad1d2	Update colors entry points docs [ci skip]	2019-09-12 12:59:10 +02:00
Ines Montani	7b59a919e6	Update entry points docs [ci skip]	2019-09-12 12:52:06 +02:00
Sofie Van Landeghem	0b4b4f1819	Documentation for Entity Linking (#4065 ) * document token ent_kb_id * document span kb_id * update pipeline documentation * prior and context weights as bool's instead * entitylinker api documentation * drop for both models * finish entitylinker documentation * small fixes * documentation for KB * candidate documentation * links to api pages in code * small fix * frequency examples as counts for consistency * consistent documentation about tensors returned by predict * add entity linking to usage 101 * add entity linking infobox and KB section to 101 * entity-linking in linguistic features * small typo corrections * training example and docs for entity_linker * predefined nlp and kb * revert back to similarity encodings for simplicity (for now) * set prior probabilities to 0 when excluded * code clean up * bugfix: deleting kb ID from tokens when entities were removed * refactor train el example to use either model or vocab * pretrain_kb example for example kb generation * add to training docs for KB + EL example scripts * small fixes * error numbering * ensure the language of vocab and nlp stay consistent across serialization * equality with = * avoid conflict in errors file * add error 151 * final adjustements to the train scripts - consistency * update of goldparse documentation * small corrections * push commit * typo fix * add candidate API to kb documentation * update API sidebar with EntityLinker and KnowledgeBase * remove EL from 101 docs * remove entity linker from 101 pipelines / rephrase * custom el model instead of existing model * set version to 2.2 for EL functionality * update documentation for 2 CLI scripts	2019-09-12 11:38:34 +02:00
Sofie Van Landeghem	53a9ca45c9	Docs: bufsize instead of buffsize (#4247 )	2019-09-06 11:11:54 +02:00
Sofie Van Landeghem	6b012cebff	Make pos/tag distinction more clear in docs (#4246 ) * make distinction between tag and pos more prominent in docs * out of the 101	2019-09-06 10:31:21 +02:00
Ines Montani	232a029de6	Send referrer for internal links [ci skip]	2019-09-05 10:41:46 +02:00
Ines Montani	2f31f96fce	Update languages.json [ci skip]	2019-09-04 18:15:42 +02:00
Ines Montani	2245e95e2d	Update languages.json [ci skip]	2019-09-04 17:11:40 +02:00
adrianeboyd	82159b5c19	Updates/bugfixes for NER/IOB converters (#4186 ) * Updates/bugfixes for NER/IOB converters * Converter formats `ner` and `iob` use autodetect to choose a converter if possible * `iob2json` is reverted to handle sentence-per-line data like `word1\|pos1\|ent1 word2\|pos2\|ent2` * Fix bug in `merge_sentences()` so the second sentence in each batch isn't skipped * `conll_ner2json` is made more general so it can handle more formats with whitespace-separated columns * Supports all formats where the first column is the token and the final column is the IOB tag; if present, the second column is the POS tag * As in CoNLL 2003 NER, blank lines separate sentences, `-DOCSTART- -X- O O` separates documents * Add option for segmenting sentences (new flag `-s`) * Parser-based sentence segmentation with a provided model, otherwise with sentencizer (new option `-b` to specify model) * Can group sentences into documents with `n_sents` as long as sentence segmentation is available * Only applies automatic segmentation when there are no existing delimiters in the data * Provide info about settings applied during conversion with warnings and suggestions if settings conflict or might not be not optimal. * Add tests for common formats * Add '(default)' back to docs for -c auto * Add document count back to output * Revert changes to converter output message * Use explicit tabs in convert CLI test data * Adjust/add messages for n_sents=1 default * Add sample NER data to training examples * Update README * Add links in docs to example NER data * Define msg within converters	2019-08-29 12:04:01 +02:00
Ines Montani	b91425f803	Update universe.json [ci skip]	2019-08-28 13:45:06 +02:00
Ines Montani	aedae8b4c5	Update universe.json [ci skip]	2019-08-28 11:59:06 +02:00
Björn Böing	bae0455f91	Fix visualizer options linking for displaCy. (#4202 )	2019-08-27 14:04:28 +02:00
Ines Montani	8114933f01	Fix universe.json [ci skip]	2019-08-27 12:13:42 +02:00
Ines Montani	48385552c6	Update languages.json [ci skip]	2019-08-27 11:52:51 +02:00
yanaiela	5d7bc26735	new universe project - the numeric fused-head (#4192 ) * new universe project * Update website/meta/universe.json Co-Authored-By: Ines Montani <ines@ines.io> * Update website/meta/universe.json Co-Authored-By: Ines Montani <ines@ines.io>	2019-08-25 17:25:28 +02:00
Christos Aridas	61f5c007a0	DOC Fix pipeline functions examples (#4189 )	2019-08-23 19:15:32 +02:00
Ines Montani	b072c13017	Update universe with videos [ci skip]	2019-08-21 21:35:37 +02:00
Pavle Vidanović	4fe9329bfb	Serbian language code update "rs" -> "sr" (#4159 ) * Serbian stopwords added. (cyrillic alphabet) * spaCy Contribution agreement included. * Test initialize updated * Serbian language code update. --bugfix	2019-08-21 19:57:37 +02:00
adrianeboyd	8fe7bdd0fa	Improve token pattern checking without validation (#4105 ) * Fix typo in rule-based matching docs * Improve token pattern checking without validation Add more detailed token pattern checks without full JSON pattern validation and provide more detailed error messages. Addresses #4070 (also related: #4063, #4100). * Check whether top-level attributes in patterns and attr for PhraseMatcher are in token pattern schema * Check whether attribute value types are supported in general (as opposed to per attribute with full validation) * Report various internal error types (OverflowError, AttributeError, KeyError) as ValueError with standard error messages * Check for tagger/parser in PhraseMatcher pipeline for attributes TAG, POS, LEMMA, and DEP * Add error messages with relevant details on how to use validate=True or nlp() instead of nlp.make_doc() * Support attr=TEXT for PhraseMatcher * Add NORM to schema * Expand tests for pattern validation, Matcher, PhraseMatcher, and EntityRuler * Remove unnecessary .keys() * Rephrase error messages * Add another type check to Matcher Add another type check to Matcher for more understandable error messages in some rare cases. * Support phrase_matcher_attr=TEXT for EntityRuler * Don't use spacy.errors in examples and bin scripts * Fix error code * Auto-format Also try get Azure pipelines to finally start a build :( * Update errors.py Co-authored-by: Ines Montani <ines@ines.io> Co-authored-by: Matthew Honnibal <honnibal+gh@gmail.com>	2019-08-21 14:00:37 +02:00
Ines Montani	3134a9b6e0	Add section on expanding regex match to token boundaries (see #4158 ) [ci skip]	2019-08-21 12:53:31 +02:00
Ines Montani	072860fcd0	Auto-format [ci skip]	2019-08-20 14:46:41 +02:00
Andrei-Marius Avram	199589228e	Added RONEC to spaCy Universe (#4151 ) * Added RONEC to spaCy Universe * Added contributor file * Corrected date from .github/contributors/avramandrei.md * Convert tabs to spaces * Remove duplicate keys Can only have one GitHub link unfortunately * Also add models category * Adjust ID This is used to generate the URL, so a simpler string is better	2019-08-20 14:46:07 +02:00
Ines Montani	fe230c8776	Fix typo [ci skip]	2019-08-20 13:02:05 +02:00
Daniel Bourke	b0a28fd0de	fix PhraseMatcher link typo (#4150 ) /api/phtasematcher -> /api/phrasematcher	2019-08-20 13:01:43 +02:00
Ines Montani	ce4c3e5204	Document force flag on set_extension (closes #4148 )	2019-08-19 19:22:07 +02:00
Ines Montani	66aba2d676	Improve regex matching docs [ci skip]	2019-08-19 13:59:41 +02:00
Sofie Van Landeghem	cc66f47893	Make enabling/disabling jupyter mode more explicit (#4144 ) * make enabling/disabling jupyter mode more explicit * markup fix	2019-08-19 11:53:34 +02:00
Ines Montani	e520eb3f6c	Make visualized NER examples more clear (closes #4104 ) [ci skip]	2019-08-18 16:29:29 +02:00
Jeno	91441f169c	Update universe.json to include negspacy (#4132 )	2019-08-16 17:48:17 +02:00
Ines Montani	1362f793cf	Improve docs on phrase pattern attributes (closes #4100 ) [ci skip]	2019-08-11 11:13:49 +02:00
Ines Montani	1f4d8bf77e	Update universe.json [ci skip]	2019-08-09 17:42:37 +02:00
ICLR&D	87e40b17a0	Add entry for Blackstone in universe.json (#4101 ) * Add entry for Blackstone in universe.json Add an entry for the Blackstone project. Checked JSON is valid. * Create ICLRandD.md * Fix indentation (tabs to spaces) It looks like during validation, the JSON file automatically changed spaces to tabs. This caused the diff to show everything as changed, which is obviously not true. This hopefully fixes that. * Try to fix formatting for diff * Fix diff Co-authored-by: Ines Montani <ines@ines.io>	2019-08-09 17:16:51 +02:00
Ines Montani	a2ac2e873f	Update Binder version [ci skip]	2019-08-08 13:03:45 +02:00
Ines Montani	3e60afacf9	Add Serbian to languages [ci skip]	2019-08-07 13:38:25 +02:00
Ines Montani	1dc28a9ecb	Update Binder version [ci skip]	2019-08-07 13:38:12 +02:00
Ines Montani	8b4a0fabbb	Adjust docs example [ci skip]	2019-08-07 00:46:47 +02:00
adrianeboyd	69aca7d839	Add validate option to EntityRuler (#4089 ) * Add validate option to EntityRuler * Add validate to EntityRuler, passed to Matcher and PhraseMatcher * Add validate to usage and API docs * Update website/docs/usage/rule-based-matching.md Co-Authored-By: Ines Montani <ines@ines.io> * Update website/docs/usage/rule-based-matching.md Co-Authored-By: Ines Montani <ines@ines.io>	2019-08-07 00:40:53 +02:00
Ines Montani	4ae320e5c2	Use consistent casing for entity ruler patterns (see #4063 ) [ci skip]	2019-08-06 12:20:22 +02:00
Ines Montani	223bde5cf6	Improve docs on matcher attributes [ci skip] (closes #4063 )	2019-08-06 12:13:42 +02:00
Ines Montani	2bfae0b167	Auto-format	2019-08-06 12:13:31 +02:00
Ines Montani	7f3212e2f5	💫 Sync branches (#4084 ) [ci skip] * Update from master * Re-added Universe readme (#3688) (closes #3680) * Fix typo * Add version tag to `--base-model` argument (closes #3720) * fixing regex matcher examples (#3708) (#3719) * Improve Token.prob and Lexeme.prob docs (resolves #3701) * Fix DependencyParser.predict docs (resolves #3561) * Update languages.json Co-authored-by: Bram Vanroy <Bram.Vanroy@UGent.be> Co-authored-by: Aaron Kub <aaronkub@gmail.com>	2019-08-05 14:32:54 +02:00
Ines Montani	0f740fad1a	Update universe.json [ci skip]	2019-08-05 14:30:07 +02:00
Ines Montani	0f76e0022d	Update .tensor docs [ci skip]	2019-08-01 18:37:09 +02:00
Ines Montani	3072eb28c2	Support and render Markdown in model meta [ci skip]	2019-08-01 18:33:10 +02:00
Björn Böing	a83c0add2e	Add links to tokenizer API docs to refer relevant information. (#4064 ) * Add links to tokenizer API docs to refer relevant information. * Add suggested changes Co-Authored-By: Ines Montani <ines@ines.io>	2019-08-01 14:28:38 +02:00
Ejar	2cdf7d39e7	Corrected imported fucntion (#4062 ) The example showed an incorrected import	2019-08-01 12:43:36 +02:00
Mohammed Daudali	23ec07debd	Correct typo for AllenAI url on homepage (#4050 ) * Typo fix for AllenAI url Changed incorrect home page url for AllenAI from appenai.org to allenai.org * Sign contributor agreement * Change date format	2019-07-31 00:16:33 +02:00
Ines Montani	fcd2f7f656	Fix version introducing Span.ents (closes #4045 ) [ci skip]	2019-07-30 10:32:33 +02:00
Ines Montani	fc69da0acb	💫 Support simple training format in nlp.evaluate and add tests (#4033 ) * Support simple training format in nlp.evaluate and add tests * Update docs [ci skip]	2019-07-27 17:30:18 +02:00
Ines Montani	bd39e5e630	Add "Processing text" section [ci skip]	2019-07-25 17:38:03 +02:00
Ines Montani	a5e3d2f318	Improve section on disabling pipes [ci skip]	2019-07-25 14:25:34 +02:00
Ines Montani	02e444ec7c	Add section on special tokenizer component [ci skip]	2019-07-25 14:25:03 +02:00
Ines Montani	1fa6d6ba55	Improve consistency of docs examples [ci skip]	2019-07-25 14:24:56 +02:00
adrianeboyd	784a5f4284	Update GoldParse attributes in API docs (#4023 ) * add `words` * update name of entity list to `ner` I think it might be a bit more consistent to have `ner` named `entities` or `ents` (and `ents` is actually set somewhere to `None`, which is a bit confusing), but it looks like renaming it would be a non-trivial decision.	2019-07-25 12:14:02 +02:00
Adriane Boyd	6c5044ed2a	Update annotation docs for German - minor formatting fixes - remove STTS tags not used in Tiger - update list of dependency relations to match tiger2dep	2019-07-22 11:59:03 +02:00
adrianeboyd	d2c474cbb7	Fix initial example in EntityRuler API docs (#3999 )	2019-07-22 11:18:55 +02:00
Ines Montani	1167c303a0	Fix typos [ci skip]	2019-07-19 13:08:18 +02:00
BreakBB	6d9a7c0749	Add '--silent' argument to bash example of CLI Info	2019-07-19 10:00:45 +02:00
BreakBB	c8ba0f690d	Fix --force parameter of CLI package	2019-07-19 10:00:45 +02:00
Ines Montani	a0acb1b3cd	Also add infobox to API docs [ci skip]	2019-07-17 16:26:41 +02:00
Ines Montani	c3ead02ea5	Adjust wording [ci skip]	2019-07-17 16:06:25 +02:00
Ines Montani	1d5ff3e455	Add infobox	2019-07-17 15:29:36 +02:00
Ines Montani	114cb18892	Improve wording	2019-07-17 15:27:53 +02:00
Ines Montani	7522beef9e	Add "Things to try" prompts	2019-07-17 15:25:02 +02:00
Ines Montani	9f02e3c027	Adjust example Not actually supported in this alignment interpretation	2019-07-17 15:13:50 +02:00
Ines Montani	1ea472468a	Add usage docs for aligning tokenization	2019-07-17 15:08:33 +02:00
Ines Montani	f97a555445	Add API documentation	2019-07-17 14:30:04 +02:00
pmbaumgartner	040bb061fd	Merge branch 'master' of github.com:pmbaumgartner/spaCy	2019-07-14 20:25:37 -04:00
pmbaumgartner	9a86d95ea2	fix custom attribute links	2019-07-14 20:23:54 -04:00

... 2 3 4 5 6 ...

1710 Commits