* Update token.md
documentation is confusing: A '?' is a right punct, but '¿' is a left punct
* Update token.md
add quotations around parentheses in `is_left_punct` and `is_right_punct` for clarrification, ensuring the question mark that follows is not percieved as an example of left and right punctuation
* Move quotes into code block [ci skip]
* Switch from mecab-python3 to fugashi
mecab-python3 has been the best MeCab binding for a long time but it's
not very actively maintained, and since it's based on old SWIG code
distributed with MeCab there's a limit to how effectively it can be
maintained.
Fugashi is a new Cython-based MeCab wrapper I wrote. Since it's not
based on the old SWIG code it's easier to keep it current and make small
deviations from the MeCab C/C++ API where that makes sense.
* Change mecab-python3 to fugashi in setup.cfg
* Change "mecab tags" to "unidic tags"
The tags come from MeCab, but the tag schema is specified by Unidic, so
it's more proper to refer to it that way.
* Update conftest
* Add fugashi link to external deps list for Japanese
* Expose tokenizer rules as a property
Expose the tokenizer rules property in the same way as the other core
properties. (The cache resetting is overkill, but consistent with
`from_bytes` for now.)
Add tests and update Tokenizer API docs.
* Update Hungarian punctuation to remove empty string
Update Hungarian punctuation definitions so that `_units` does not match
an empty string.
* Use _load_special_tokenization consistently
Use `_load_special_tokenization()` and have it to handle `None` checks.
* Fix precedence of `token_match` vs. special cases
Remove `token_match` check from `_split_affixes()` so that special cases
have precedence over `token_match`. `token_match` is checked only before
infixes are split.
* Add `make_debug_doc()` to the Tokenizer
Add `make_debug_doc()` to the Tokenizer as a working implementation of
the pseudo-code in the docs.
Add a test (marked as slow) that checks that `nlp.tokenizer()` and
`nlp.tokenizer.make_debug_doc()` return the same non-whitespace tokens
for all languages that have `examples.sentences` that can be imported.
* Update tokenization usage docs
Update pseudo-code and algorithm description to correspond to
`nlp.tokenizer.make_debug_doc()` with example debugging usage.
Add more examples for customizing tokenizers while preserving the
existing defaults.
Minor edits / clarifications.
* Revert "Update Hungarian punctuation to remove empty string"
This reverts commit f0a577f7a5.
* Rework `make_debug_doc()` as `explain()`
Rework `make_debug_doc()` as `explain()`, which returns a list of
`(pattern_string, token_string)` tuples rather than a non-standard
`Doc`. Update docs and tests accordingly, leaving the visualization for
future work.
* Handle cases with bad tokenizer patterns
Detect when tokenizer patterns match empty prefixes and suffixes so that
`explain()` does not hang on bad patterns.
* Remove unused displacy image
* Add tokenizer.explain() to usage docs
Update pseudo-code and algorithm description to correspond to current
tokenizer behavior.
Add more examples for customizing tokenizers while preserving the
existing defaults.
Minor edits / clarifications.
* Add arch for MishWindowEncoder
* Support mish in tok2vec and conv window >=2
* Pass new tok2vec settings from parser
* Syntax error
* Fix tok2vec setting
* Fix registration of MishWindowEncoder
* Fix receptive field setting
* Fix mish arch
* Pass more options from parser
* Support more tok2vec options in pretrain
* Require thinc 7.3
* Add docs [ci skip]
* Require thinc 7.3.0.dev0 to run CI
* Run black
* Fix typo
* Update Thinc version
Co-authored-by: Ines Montani <ines@ines.io>
* Implement new API for {Phrase}Matcher.add (backwards-compatible)
* Update docs
* Also update DependencyMatcher.add
* Update internals
* Rewrite tests to use new API
* Add basic check for common mistake
Raise error with suggestion if user likely passed in a pattern instead of a list of patterns
* Fix typo [ci skip]
* Update English tag_map
Update English tag_map based on this conversion table:
https://universaldependencies.org/tagset-conversion/en-penn-uposf.html
* Update German tag_map
Update German tag_map based on this conversion table:
https://universaldependencies.org/tagset-conversion/de-stts-uposf.html
* Add missing Tiger dependencies to glossary
* Add quotes to definition of TO
* Update POS/TAG tables in docs
Update POS/TAG tables for English and German docs using current
information generated from the tag_maps and GLOSSARY.
* Update warning that -PRON- is specific to English
* Revert docs to default JSON output with convert
* Revert "Revert docs to default JSON output with convert"
This reverts commit 6b78c048f1.
* Support train dict format as JSONL
* Add (overly simple) check for dict vs. tuple to read JSONL lines as
either train dicts or train tuples
* Extend JSON/JSONL roundtrip conversion tests using `docs_to_json()`
and `GoldCorpus.train_tuples`
* Revert docs to default JSON output with convert
* Move test
* Allow default in Lookups.get_table
* Start with blank tables in Lookups.from_bytes
* Refactor lemmatizer to hold instance of Lookups
* Get lookups table within the lemmatization methods to make sure it references the correct table (even if the table was replaced or modified, e.g. when loading a model from disk)
* Deprecate other arguments on Lemmatizer.__init__ and expect Lookups for consistency
* Remove old and unsupported Lemmatizer.load classmethod
* Refactor language-specific lemmatizers to inherit as much as possible from base class and override only what they need
* Update tests and docs
* Fix more tests
* Fix lemmatizer
* Upgrade pytest to try and fix weird CI errors
* Try pytest 4.6.5
* Allow vectors name to be specified in init-model
* Document --vectors-name argument to init-model
* Update website/docs/api/cli.md
Co-Authored-By: Ines Montani <ines@ines.io>
* Add doc.cats to spacy.gold at the paragraph level
Support `doc.cats` as `"cats": [{"label": string, "value": number}]` in
the spacy JSON training format at the paragraph level.
* `spacy.gold.docs_to_json()` writes `docs.cats`
* `GoldCorpus` reads in cats in each `GoldParse`
* Update instances of gold_tuples to handle cats
Update iteration over gold_tuples / gold_parses to handle addition of
cats at the paragraph level.
* Add textcat to train CLI
* Add textcat options to train CLI
* Add textcat labels in `TextCategorizer.begin_training()`
* Add textcat evaluation to `Scorer`:
* For binary exclusive classes with provided label: F1 for label
* For 2+ exclusive classes: F1 macro average
* For multilabel (not exclusive): ROC AUC macro average (currently
relying on sklearn)
* Provide user info on textcat evaluation settings, potential
incompatibilities
* Provide pipeline to Scorer in `Language.evaluate` for textcat config
* Customize train CLI output to include only metrics relevant to current
pipeline
* Add textcat evaluation to evaluate CLI
* Fix handling of unset arguments and config params
Fix handling of unset arguments and model confiug parameters in Scorer
initialization.
* Temporarily add sklearn requirement
* Remove sklearn version number
* Improve Scorer handling of models without textcats
* Fixing Scorer handling of models without textcats
* Update Scorer output for python 2.7
* Modify inf in Scorer for python 2.7
* Auto-format
Also make small adjustments to make auto-formatting with black easier and produce nicer results
* Move error message to Errors
* Update documentation
* Add cats to annotation JSON format [ci skip]
* Fix tpl flag and docs [ci skip]
* Switch to internal roc_auc_score
Switch to internal `roc_auc_score()` adapted from scikit-learn.
* Add AUCROCScore tests and improve errors/warnings
* Add tests for AUCROCScore and roc_auc_score
* Add missing error for only positive/negative values
* Remove unnecessary warnings and errors
* Make reduced roc_auc_score functions private
Because most of the checks and warnings have been stripped for the
internal functions and access is only intended through `ROCAUCScore`,
make the functions for roc_auc_score adapted from scikit-learn private.
* Check that data corresponds with multilabel flag
Check that the training instances correspond with the multilabel flag,
adding the multilabel flag if required.
* Add textcat score to early stopping check
* Add more checks to debug-data for textcat
* Add example training data for textcat
* Add more checks to textcat train CLI
* Check configuration when extending base model
* Fix typos
* Update textcat example data
* Provide licensing details and licenses for data
* Remove two labels with no positive instances from jigsaw-toxic-comment
data.
Co-authored-by: Ines Montani <ines@ines.io>
* Adjust Table API and add docs
* Add attributes and update description [ci skip]
* Use strings.get_string_id instead of hash_string
* Fix table method calls
* Make orth arg in Lemmatizer.lookup optional
Fall back to string, which is now handled by Table.__contains__ out-of-the-box
* Fix method name
* Auto-format
* Allow copying the user_data with as_doc + unit test
* add option to docs
* add typing
* import fix
* workaround to avoid bool clashing ...
* bint instead of bool
* document token ent_kb_id
* document span kb_id
* update pipeline documentation
* prior and context weights as bool's instead
* entitylinker api documentation
* drop for both models
* finish entitylinker documentation
* small fixes
* documentation for KB
* candidate documentation
* links to api pages in code
* small fix
* frequency examples as counts for consistency
* consistent documentation about tensors returned by predict
* add entity linking to usage 101
* add entity linking infobox and KB section to 101
* entity-linking in linguistic features
* small typo corrections
* training example and docs for entity_linker
* predefined nlp and kb
* revert back to similarity encodings for simplicity (for now)
* set prior probabilities to 0 when excluded
* code clean up
* bugfix: deleting kb ID from tokens when entities were removed
* refactor train el example to use either model or vocab
* pretrain_kb example for example kb generation
* add to training docs for KB + EL example scripts
* small fixes
* error numbering
* ensure the language of vocab and nlp stay consistent across serialization
* equality with =
* avoid conflict in errors file
* add error 151
* final adjustements to the train scripts - consistency
* update of goldparse documentation
* small corrections
* push commit
* typo fix
* add candidate API to kb documentation
* update API sidebar with EntityLinker and KnowledgeBase
* remove EL from 101 docs
* remove entity linker from 101 pipelines / rephrase
* custom el model instead of existing model
* set version to 2.2 for EL functionality
* update documentation for 2 CLI scripts
* Updates/bugfixes for NER/IOB converters
* Converter formats `ner` and `iob` use autodetect to choose a converter if
possible
* `iob2json` is reverted to handle sentence-per-line data like
`word1|pos1|ent1 word2|pos2|ent2`
* Fix bug in `merge_sentences()` so the second sentence in each batch isn't
skipped
* `conll_ner2json` is made more general so it can handle more formats with
whitespace-separated columns
* Supports all formats where the first column is the token and the final
column is the IOB tag; if present, the second column is the POS tag
* As in CoNLL 2003 NER, blank lines separate sentences, `-DOCSTART- -X- O O`
separates documents
* Add option for segmenting sentences (new flag `-s`)
* Parser-based sentence segmentation with a provided model, otherwise with
sentencizer (new option `-b` to specify model)
* Can group sentences into documents with `n_sents` as long as sentence
segmentation is available
* Only applies automatic segmentation when there are no existing delimiters
in the data
* Provide info about settings applied during conversion with warnings and
suggestions if settings conflict or might not be not optimal.
* Add tests for common formats
* Add '(default)' back to docs for -c auto
* Add document count back to output
* Revert changes to converter output message
* Use explicit tabs in convert CLI test data
* Adjust/add messages for n_sents=1 default
* Add sample NER data to training examples
* Update README
* Add links in docs to example NER data
* Define msg within converters
* Fix typo in rule-based matching docs
* Improve token pattern checking without validation
Add more detailed token pattern checks without full JSON pattern validation and
provide more detailed error messages.
Addresses #4070 (also related: #4063, #4100).
* Check whether top-level attributes in patterns and attr for PhraseMatcher are
in token pattern schema
* Check whether attribute value types are supported in general (as opposed to
per attribute with full validation)
* Report various internal error types (OverflowError, AttributeError, KeyError)
as ValueError with standard error messages
* Check for tagger/parser in PhraseMatcher pipeline for attributes TAG, POS,
LEMMA, and DEP
* Add error messages with relevant details on how to use validate=True or nlp()
instead of nlp.make_doc()
* Support attr=TEXT for PhraseMatcher
* Add NORM to schema
* Expand tests for pattern validation, Matcher, PhraseMatcher, and EntityRuler
* Remove unnecessary .keys()
* Rephrase error messages
* Add another type check to Matcher
Add another type check to Matcher for more understandable error messages
in some rare cases.
* Support phrase_matcher_attr=TEXT for EntityRuler
* Don't use spacy.errors in examples and bin scripts
* Fix error code
* Auto-format
Also try get Azure pipelines to finally start a build :(
* Update errors.py
Co-authored-by: Ines Montani <ines@ines.io>
Co-authored-by: Matthew Honnibal <honnibal+gh@gmail.com>
* Added RONEC to spaCy Universe
* Added contributor file
* Corrected date from .github/contributors/avramandrei.md
* Convert tabs to spaces
* Remove duplicate keys
Can only have one GitHub link unfortunately
* Also add models category
* Adjust ID
This is used to generate the URL, so a simpler string is better
* Add entry for Blackstone in universe.json
Add an entry for the Blackstone project. Checked JSON is valid.
* Create ICLRandD.md
* Fix indentation (tabs to spaces)
It looks like during validation, the JSON file automatically changed spaces to tabs. This caused the diff to show *everything* as changed, which is obviously not true. This hopefully fixes that.
* Try to fix formatting for diff
* Fix diff
Co-authored-by: Ines Montani <ines@ines.io>
* Typo fix for AllenAI url
Changed incorrect home page url for AllenAI from appenai.org to allenai.org
* Sign contributor agreement
* Change date format
* add `words`
* update name of entity list to `ner`
I think it might be a bit more consistent to have `ner` named `entities`
or `ents` (and `ents` is actually set somewhere to `None`, which is a
bit confusing), but it looks like renaming it would be a non-trivial
decision.