* Add static method to Doc to allow merging of multiple docs.
* Add error description for the error that occurs if docs with different
vocabs (from different languages) are merged in Doc.from_docs().
* Add test for Doc.from_docs() implementation.
* Fix using numpy's concatenate in Doc.from_docs.
* Replace typing's type annotations in from_docs.
* Simply remove type annotations in from_docs.
* Add documentation for Doc.from_docs to api.
* Simplify from_docs, its test and the api doc for codebase consistency.
* Fix merging of Doc objects that end with whitespaces (Achieved by simply not setting the SPACY attribute on whitespace tokens). Remove two unnecessary imports of attributes.
* Add merging of user data from Doc objects in from_docs. Add user data test case to corresponding test. Add applicable warning messages.
* Fix incorrect setting of tokens idx by using concatenated spaces (again). Add test case to corresponding test.
* Add MORPH to attrs
* Update warnings calls
* Remove out-dated error from merge
* Rename space_delimiter to ensure_whitespace
Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
* Add version number to DocBin
Add a version number to DocBin for future use.
* Add POS to all attributes in DocBin
* Add morph string to strings in DocBin
* Update DocBin API
* Add string for ENT_KB_ID in DocBin
Very minor fix in docs, specifically in this part:
```
matcher = PhraseMatcher(nlp.vocab)
> for doc in matcher.pipe(texts, batch_size=50):
> pass
```
`texts` suggests the input is an iterable of strings. I replaced it for `docs`.
* Update website models for v2.3.0
* Add docs for Chinese word segmentation
* Tighten up Chinese docs section
* Merge branch 'master' into docs/v2.3.0 [ci skip]
* Merge branch 'master' into docs/v2.3.0 [ci skip]
* Auto-format and update version
* Update matcher.md
* Update languages and sorting
* Typo in landing page
* Infobox about token_match behavior
* Add meta and basic docs for Japanese
* POS -> TAG in models table
* Add info about lookups for normalization
* Updates to API docs for v2.3
* Update adding norm exceptions for adding languages
* Add --omit-extra-lookups to CLI API docs
* Add initial draft of "What's New in v2.3"
* Add new in v2.3 tags to Chinese and Japanese sections
* Add tokenizer to migration section
* Add new in v2.3 flags to init-model
* Typo
* More what's new in v2.3
Co-authored-by: Ines Montani <ines@ines.io>
* verbose and tag_map options
* adding init_tok2vec option and only changing the tok2vec that is specified
* adding omit_extra_lookups and verifying textcat config
* wip
* pretrain bugfix
* add replace and resume options
* train_textcat fix
* raw text functionality
* improve UX when KeyError or when input data can't be parsed
* avoid unnecessary access to goldparse in TextCat pipe
* save performance information in nlp.meta
* add noise_level to config
* move nn_parser's defaults to config file
* multitask in config - doesn't work yet
* scorer offering both F and AUC options, need to be specified in config
* add textcat verification code from old train script
* small fixes to config files
* clean up
* set default config for ner/parser to allow create_pipe to work as before
* two more test fixes
* small fixes
* cleanup
* fix NER pickling + additional unit test
* create_pipe as before
* make disable_pipes deprecated in favour of the new toggle_pipes
* rewrite disable_pipes statements
* update documentation
* remove bin/wiki_entity_linking folder
* one more fix
* remove deprecated link to documentation
* few more doc fixes
* add note about name change to the docs
* restore original disable_pipes
* small fixes
* fix typo
* fix error number to W096
* rename to select_pipes
* also make changes to the documentation
Co-authored-by: Matthew Honnibal <honnibal+gh@gmail.com>
To fix the slow tokenizer URL (#4374) and allow `token_match` to take
priority over prefixes and suffixes by default, introduce a new
tokenizer option for a token match pattern that's applied after prefixes
and suffixes but before infixes.
* `debug-data`: determine coverage of provided vectors
* `evaluate`: support `blank:lg` model to make it possible to just evaluate
tokenization
* `init-model`: add option to truncate vectors to N most frequent vectors
from word2vec file
* `train`:
* if training on GPU, only run evaluation/timing on CPU in the first
iteration
* if training is aborted, exit with a non-0 exit status
Reconstruction of the original PR #4697 by @MiniLau.
Removes unused `SENT_END` symbol and `IS_SENT_END` from `Matcher` schema
because the Matcher is only going to be able to support `IS_SENT_START`.
* add lemma option to displacy 'dep' visualiser
* more compact list comprehension
* add option to doc
* fix test and add lemmas to util.get_doc
* fix capital
* remove lemma from get_doc
* cleanup
* Update token.md
documentation is confusing: A '?' is a right punct, but '¿' is a left punct
* Update token.md
add quotations around parentheses in `is_left_punct` and `is_right_punct` for clarrification, ensuring the question mark that follows is not percieved as an example of left and right punctuation
* Move quotes into code block [ci skip]
* Expose tokenizer rules as a property
Expose the tokenizer rules property in the same way as the other core
properties. (The cache resetting is overkill, but consistent with
`from_bytes` for now.)
Add tests and update Tokenizer API docs.
* Update Hungarian punctuation to remove empty string
Update Hungarian punctuation definitions so that `_units` does not match
an empty string.
* Use _load_special_tokenization consistently
Use `_load_special_tokenization()` and have it to handle `None` checks.
* Fix precedence of `token_match` vs. special cases
Remove `token_match` check from `_split_affixes()` so that special cases
have precedence over `token_match`. `token_match` is checked only before
infixes are split.
* Add `make_debug_doc()` to the Tokenizer
Add `make_debug_doc()` to the Tokenizer as a working implementation of
the pseudo-code in the docs.
Add a test (marked as slow) that checks that `nlp.tokenizer()` and
`nlp.tokenizer.make_debug_doc()` return the same non-whitespace tokens
for all languages that have `examples.sentences` that can be imported.
* Update tokenization usage docs
Update pseudo-code and algorithm description to correspond to
`nlp.tokenizer.make_debug_doc()` with example debugging usage.
Add more examples for customizing tokenizers while preserving the
existing defaults.
Minor edits / clarifications.
* Revert "Update Hungarian punctuation to remove empty string"
This reverts commit f0a577f7a5.
* Rework `make_debug_doc()` as `explain()`
Rework `make_debug_doc()` as `explain()`, which returns a list of
`(pattern_string, token_string)` tuples rather than a non-standard
`Doc`. Update docs and tests accordingly, leaving the visualization for
future work.
* Handle cases with bad tokenizer patterns
Detect when tokenizer patterns match empty prefixes and suffixes so that
`explain()` does not hang on bad patterns.
* Remove unused displacy image
* Add tokenizer.explain() to usage docs
* Add arch for MishWindowEncoder
* Support mish in tok2vec and conv window >=2
* Pass new tok2vec settings from parser
* Syntax error
* Fix tok2vec setting
* Fix registration of MishWindowEncoder
* Fix receptive field setting
* Fix mish arch
* Pass more options from parser
* Support more tok2vec options in pretrain
* Require thinc 7.3
* Add docs [ci skip]
* Require thinc 7.3.0.dev0 to run CI
* Run black
* Fix typo
* Update Thinc version
Co-authored-by: Ines Montani <ines@ines.io>
* Implement new API for {Phrase}Matcher.add (backwards-compatible)
* Update docs
* Also update DependencyMatcher.add
* Update internals
* Rewrite tests to use new API
* Add basic check for common mistake
Raise error with suggestion if user likely passed in a pattern instead of a list of patterns
* Fix typo [ci skip]
* Update English tag_map
Update English tag_map based on this conversion table:
https://universaldependencies.org/tagset-conversion/en-penn-uposf.html
* Update German tag_map
Update German tag_map based on this conversion table:
https://universaldependencies.org/tagset-conversion/de-stts-uposf.html
* Add missing Tiger dependencies to glossary
* Add quotes to definition of TO
* Update POS/TAG tables in docs
Update POS/TAG tables for English and German docs using current
information generated from the tag_maps and GLOSSARY.
* Update warning that -PRON- is specific to English
* Revert docs to default JSON output with convert
* Revert "Revert docs to default JSON output with convert"
This reverts commit 6b78c048f1.
* Support train dict format as JSONL
* Add (overly simple) check for dict vs. tuple to read JSONL lines as
either train dicts or train tuples
* Extend JSON/JSONL roundtrip conversion tests using `docs_to_json()`
and `GoldCorpus.train_tuples`
* Revert docs to default JSON output with convert
* Move test
* Allow default in Lookups.get_table
* Start with blank tables in Lookups.from_bytes
* Refactor lemmatizer to hold instance of Lookups
* Get lookups table within the lemmatization methods to make sure it references the correct table (even if the table was replaced or modified, e.g. when loading a model from disk)
* Deprecate other arguments on Lemmatizer.__init__ and expect Lookups for consistency
* Remove old and unsupported Lemmatizer.load classmethod
* Refactor language-specific lemmatizers to inherit as much as possible from base class and override only what they need
* Update tests and docs
* Fix more tests
* Fix lemmatizer
* Upgrade pytest to try and fix weird CI errors
* Try pytest 4.6.5
* Allow vectors name to be specified in init-model
* Document --vectors-name argument to init-model
* Update website/docs/api/cli.md
Co-Authored-By: Ines Montani <ines@ines.io>