spaCy

mirror of https://github.com/explosion/spaCy.git synced 2026-01-07 17:21:15 +03:00

Author	SHA1	Message	Date
Ines Montani	abd5c06374	Adjust formatting [ci skip]	2020-02-03 13:00:02 +01:00
Martin A. Kayser	02a44c5be2	Adding a note on retrieving the string rep of the match_id (#4904 ) Stolen from here: https://stackoverflow.com/questions/47638877/using-phrasematcher-in-spacy-to-find-multiple-match-types	2020-02-03 12:58:58 +01:00
Omri Mendels	6ff947e1f9	Added presidio-research to universe.json (#4950 ) * Added presidio-research to universe.json Added a reference to Presidio Research, the data-science toolbox for Microsoft Presidio. * Updated url	2020-02-03 12:57:55 +01:00
Paco Nathan	49fefb6139	Submitting `PyTextRank` for inclusion in the spaCy uniVerse (#4942 ) * submitting PyTextRank for consideration of including in the spaCy uniVerse * including SCA	2020-01-28 11:37:54 +01:00
adrianeboyd	7ad000fce7	Update docs for train CLI --use_gpu option (#4927 )	2020-01-20 17:02:47 +01:00
Bram Vanroy	718704022a	Changes to spacy_conll in universe (#4914 ) * Update information on spacy_conll * Typo fix	2020-01-16 01:56:39 +01:00
Preston Badeer	b216ff43c9	Update vectors-similarity.md (#4889 ) These links are broken on the website, due to quotes around the URLs.	2020-01-08 16:49:40 +01:00
Geoffrey Gordon Ashbrook	53929138d7	remove extra word typo (#4875 ) "let you find you"	2020-01-06 12:37:42 +01:00
Ines Montani	400257a802	Update index.md [ci skip]	2020-01-04 01:52:18 +01:00
Ivan Echevarria	ef13e0c038	Add n_process to Language.pipe documentation (#4842 ) [ci skip] * Add n_process to documentation * Auto-format and add default [ci skip] Co-authored-by: Ines Montani <ines@ines.io>	2019-12-29 14:23:33 +01:00
Ines Montani	1b838d1313	Divide models into core and starters [ci skip]	2019-12-21 14:10:22 +01:00
Sofie Van Landeghem	8ebbb85117	Documentation for PhraseMatcher constructor (#4826 ) * add max_length as argument for init PhraseMatcher * improve error message too	2019-12-20 23:00:04 +01:00
Ines Montani	c466e02466	Update universe [ci skip]	2019-12-13 15:57:39 +01:00
Thiago Lages de Alencar	a067ded495	Update doc.md (#4796 )	2019-12-11 18:21:40 +01:00
Tclack88	ab8dc2732c	Update token.md (#4767 ) * Update token.md documentation is confusing: A '?' is a right punct, but '¿' is a left punct * Update token.md add quotations around parentheses in `is_left_punct` and `is_right_punct` for clarrification, ensuring the question mark that follows is not percieved as an example of left and right punctuation * Move quotes into code block [ci skip]	2019-12-06 19:22:02 +01:00
Ines Montani	bf611ebca7	Document jsonl option on converter [ci skip]	2019-12-06 19:17:45 +01:00
Nicolai Bjerre Pedersen	de5453cdcb	Fix link to user hooks in docs (#4778 ) * Fix link to user hooks in docs * Update mr_bjerre.md Mistake in contributor agreement * Apparently hard to get it right (wrong name of sca)	2019-12-06 19:17:12 +01:00
Ines Montani	cbacb0f1a4	Update shape docs and examples (resolves #4615 ) [ci skip]	2019-11-23 17:16:55 +01:00
Paul O'Leary McCann	f0e3e606a6	Replace python-mecab3 with fugashi for Japanese (#4621 ) * Switch from mecab-python3 to fugashi mecab-python3 has been the best MeCab binding for a long time but it's not very actively maintained, and since it's based on old SWIG code distributed with MeCab there's a limit to how effectively it can be maintained. Fugashi is a new Cython-based MeCab wrapper I wrote. Since it's not based on the old SWIG code it's easier to keep it current and make small deviations from the MeCab C/C++ API where that makes sense. * Change mecab-python3 to fugashi in setup.cfg * Change "mecab tags" to "unidic tags" The tags come from MeCab, but the tag schema is specified by Unidic, so it's more proper to refer to it that way. * Update conftest * Add fugashi link to external deps list for Japanese	2019-11-23 14:31:04 +01:00
Ines Montani	a6200bc424	Update scorer.md [ci skip]	2019-11-21 17:02:43 +01:00
richardpaulhudson	8d06386e1e	Update to Holmes Universe entry (#4679 ) * Updated Universe entry for Holmes * Correction * Updated model name * Updated wording	2019-11-21 16:23:24 +01:00
Ines Montani	235fe6fe3b	Auto-format [ci skip]	2019-11-20 13:14:58 +01:00
adrianeboyd	2c876eb672	Add tokenizer explain() debugging method (#4596 ) * Expose tokenizer rules as a property Expose the tokenizer rules property in the same way as the other core properties. (The cache resetting is overkill, but consistent with `from_bytes` for now.) Add tests and update Tokenizer API docs. * Update Hungarian punctuation to remove empty string Update Hungarian punctuation definitions so that `_units` does not match an empty string. * Use _load_special_tokenization consistently Use `_load_special_tokenization()` and have it to handle `None` checks. * Fix precedence of `token_match` vs. special cases Remove `token_match` check from `_split_affixes()` so that special cases have precedence over `token_match`. `token_match` is checked only before infixes are split. * Add `make_debug_doc()` to the Tokenizer Add `make_debug_doc()` to the Tokenizer as a working implementation of the pseudo-code in the docs. Add a test (marked as slow) that checks that `nlp.tokenizer()` and `nlp.tokenizer.make_debug_doc()` return the same non-whitespace tokens for all languages that have `examples.sentences` that can be imported. * Update tokenization usage docs Update pseudo-code and algorithm description to correspond to `nlp.tokenizer.make_debug_doc()` with example debugging usage. Add more examples for customizing tokenizers while preserving the existing defaults. Minor edits / clarifications. * Revert "Update Hungarian punctuation to remove empty string" This reverts commit `f0a577f7a5`. * Rework `make_debug_doc()` as `explain()` Rework `make_debug_doc()` as `explain()`, which returns a list of `(pattern_string, token_string)` tuples rather than a non-standard `Doc`. Update docs and tests accordingly, leaving the visualization for future work. * Handle cases with bad tokenizer patterns Detect when tokenizer patterns match empty prefixes and suffixes so that `explain()` does not hang on bad patterns. * Remove unused displacy image * Add tokenizer.explain() to usage docs	2019-11-20 13:07:25 +01:00
Ines Montani	e8b9cee6fd	Make example consistent with model (closes #4587 ) [ci skip]	2019-11-18 12:41:48 +01:00
Ines Montani	e01a1a237f	Auto-format [ci skip]	2019-11-18 12:41:31 +01:00
adrianeboyd	62e00fd9da	Update tokenization usage docs (#4666 ) Update pseudo-code and algorithm description to correspond to current tokenizer behavior. Add more examples for customizing tokenizers while preserving the existing defaults. Minor edits / clarifications.	2019-11-18 12:35:13 +01:00
Ines Montani	5adcb352e9	Adjust order of docs sections [ci skip]	2019-11-17 16:08:56 +01:00
Ines Montani	e30d08410a	Add CI for Python 3.8 (#4479 ) * Add 3.8 classifier * Update azure-pipelines.yml * Remove 3.8 warning from docs [ci skip]	2019-11-15 01:13:48 +01:00
f11r	877971860e	Fix assert in sentencizer documentation. (#4639 )	2019-11-13 15:24:14 +01:00
Ines Montani	9d5ff177c4	Work around Markdown rendering issue surfaced in #4600 [ci skip]	2019-11-11 17:12:08 +01:00
adrianeboyd	0f8678c0b1	Fix DocBin.merge() example (#4599 )	2019-11-07 11:26:48 +01:00
walterhenry	5563c42ef5	Fixed typo: Added space between "recognize" and "various" (#4600 )	2019-11-06 23:06:36 +01:00
Ines Montani	828ef27a32	Add warnings about 3.8 (resolves #4593 ) [ci skip]	2019-11-05 18:30:11 +01:00
Ines Montani	4b95587ad4	Update universe.json [ci skip]	2019-11-04 13:55:55 +01:00
Yash Patadia	0c396aeed4	add dframcy to universe.json (#4580 )	2019-11-04 13:53:23 +01:00
Ines Montani	59358d9b71	Remove box-decoration-break from entities in displacy (#4564 )	2019-10-31 15:09:43 +01:00
Ines Montani	4e1de85e43	Update syntax iterators [ci skip]	2019-10-30 14:31:40 +01:00
Ines Montani	726c5dd306	Update universe.json [ci skip]	2019-10-30 13:29:00 +01:00
Neel Kamath	6c036ab57d	Add "spaCy Server" to spaCy Universe (#4553 ) * Add "spaCy Server" to spaCy Universe * Accept the spaCy Contributor Agreement	2019-10-30 13:20:46 +01:00
Nipun Sadvilkar	2a5e71232b	✨ project: pySBD - Python Sentence Boundary Disambiguation (#4455 ) * ✨ project: pySBD - Python Sentence Boundary Disambiguation * 📝 Update links and description * 🐛 Fix missing comma * Update universe.json pysbd as a spacy component through entrypoints * 🚨 Fix universe.json * 📝 Update code_example	2019-10-30 12:13:29 +01:00
Matthew Honnibal	d5509e0989	Support Mish activation (requires Thinc 7.3) (#4536 ) * Add arch for MishWindowEncoder * Support mish in tok2vec and conv window >=2 * Pass new tok2vec settings from parser * Syntax error * Fix tok2vec setting * Fix registration of MishWindowEncoder * Fix receptive field setting * Fix mish arch * Pass more options from parser * Support more tok2vec options in pretrain * Require thinc 7.3 * Add docs [ci skip] * Require thinc 7.3.0.dev0 to run CI * Run black * Fix typo * Update Thinc version Co-authored-by: Ines Montani <ines@ines.io>	2019-10-28 15:16:33 +01:00
Ines Montani	1180304449	Update languages.json [ci skip]	2019-10-26 13:51:42 +02:00
Ines Montani	cfffdba7b1	Implement new API for {Phrase}Matcher.add (backwards-compatible) (#4522 ) * Implement new API for {Phrase}Matcher.add (backwards-compatible) * Update docs * Also update DependencyMatcher.add * Update internals * Rewrite tests to use new API * Add basic check for common mistake Raise error with suggestion if user likely passed in a pattern instead of a list of patterns * Fix typo [ci skip]	2019-10-25 22:21:08 +02:00
Ines Montani	d2da117114	Also support passing list to Language.disable_pipes (#4521 ) * Also support passing list to Language.disable_pipes * Adjust internals	2019-10-25 16:19:08 +02:00
Ines Montani	493be8e9db	Update new version identifier [ci skip]	2019-10-25 11:42:49 +02:00
Ines Montani	2abf1028cb	Update docs [ci skip]	2019-10-25 11:27:00 +02:00
Ines Montani	f31876154d	Adjust formatting [ci skip]	2019-10-25 11:19:46 +02:00
Kabir Khan	93640373c7	Make entity_ruler ent_id resolution 2x faster and add docs for… (#4513 ) * Update entityruler.py * Making ent_id resolution 2x faster and adding docs * Fixing newlines in docstrings * Fixing newlines in docstrings	2019-10-25 11:16:42 +02:00
adrianeboyd	1b0bbe4b76	Update tag maps and docs for English and German (#4501 ) * Update English tag_map Update English tag_map based on this conversion table: https://universaldependencies.org/tagset-conversion/en-penn-uposf.html * Update German tag_map Update German tag_map based on this conversion table: https://universaldependencies.org/tagset-conversion/de-stts-uposf.html * Add missing Tiger dependencies to glossary * Add quotes to definition of TO * Update POS/TAG tables in docs Update POS/TAG tables for English and German docs using current information generated from the tag_maps and GLOSSARY. * Update warning that -PRON- is specific to English * Revert docs to default JSON output with convert * Revert "Revert docs to default JSON output with convert" This reverts commit `6b78c048f1`.	2019-10-24 12:56:05 +02:00
adrianeboyd	8516e9d53b	Support train dict format as JSONL (#4471 ) * Support train dict format as JSONL * Add (overly simple) check for dict vs. tuple to read JSONL lines as either train dicts or train tuples * Extend JSON/JSONL roundtrip conversion tests using `docs_to_json()` and `GoldCorpus.train_tuples` * Revert docs to default JSON output with convert	2019-10-23 16:01:44 +02:00

1 2 3 4 5 ...

1576 Commits