* Refactor Docs.is_ flags
* Add derived `Doc.has_annotation` method
* `Doc.has_annotation(attr)` returns `True` for partial annotation
* `Doc.has_annotation(attr, require_complete=True)` returns `True` for
complete annotation
* Add deprecation warnings to `is_tagged`, `is_parsed`, `is_sentenced`
and `is_nered`
* Add `Doc._get_array_attrs()`, which returns a full list of `Doc` attrs
for use with `Doc.to_array`, `Doc.to_bytes` and `Doc.from_docs`. The
list is the `DocBin` attributes list plus `SPACY` and `LENGTH`.
Notes on `Doc.has_annotation`:
* `HEAD` is converted to `DEP` because heads don't have an unset state
* Accept `IS_SENT_START` as a synonym of `SENT_START`
Additional changes:
* Add `NORM`, `ENT_ID` and `SENT_START` to default attributes for
`DocBin`
* In `Doc.from_array()` the presence of `DEP` causes `HEAD` to override
`SENT_START`
* In `Doc.from_array()` using `attrs` other than
`Doc._get_array_attrs()` (i.e., a user's custom list rather than our
default internal list) with both `HEAD` and `SENT_START` shows a warning
that `HEAD` will override `SENT_START`
* `set_children_from_heads` does not require dependency labels to set
sentence boundaries and sets `sent_start` for all non-sentence starts to
`-1`
* Fix call to set_children_form_heads
Co-authored-by: Matthew Honnibal <honnibal+gh@gmail.com>
* Clean up spacy.tokens
* Update `set_children_from_heads`:
* Don't check `dep` when setting lr_* or sentence starts
* Set all non-sentence starts to `False`
* Use `set_children_from_heads` in `Token.head` setter
* Reduce similar/duplicate code (admittedly adds a bit of overhead)
* Update sentence starts consistently
* Remove unused `Doc.set_parse`
* Minor changes:
* Declare cython variables (to avoid cython warnings)
* Clean up imports
* Modify set_children_from_heads to set token range
Modify `set_children_from_heads` so that it adjust tokens within a
specified range rather then the whole document.
Modify the `Token.head` setter to adjust only the tokens affected by the
new head assignment.
Modify `Token.morph` property so that `Token.c.morph` can be reset back
to an internal value of `0`. Allow setting `Token.morph` from a hash as
long as the morph string is already in the `StringStore`, setting it
indirectly through `Token.morph_` so that the value is added to the
morphology. If the hash is not in the `StringStore`, raise an error.
* ensure Language passes on valid examples for initialization
* fix tagger model initialization
* check for valid get_examples across components
* assume labels were added before begin_training
* fix senter initialization
* fix morphologizer initialization
* use methods to check arguments
* test textcat init, requires thinc>=8.0.0a31
* fix tok2vec init
* fix entity linker init
* use islice
* fix simple NER
* cleanup debug model
* fix assert statements
* fix tests
* throw error when adding a label if the output layer can't be resized anymore
* fix test
* add failing test for simple_ner
* UX improvements
* morphologizer UX
* assume begin_training gets a representative set and processes the labels
* remove assumptions for output of untrained NER model
* restore test for original purpose
* Add Lemmatizer and simplify related components
* Add `Lemmatizer` pipe with `lookup` and `rule` modes using the
`Lookups` tables.
* Reduce `Tagger` to a simple tagger that sets `Token.tag` (no pos or lemma)
* Reduce `Morphology` to only keep track of morph tags (no tag map, lemmatizer,
or morph rules)
* Remove lemmatizer from `Vocab`
* Adjust many many tests
Differences:
* No default lookup lemmas
* No special treatment of TAG in `from_array` and similar required
* Easier to modify labels in a `Tagger`
* No extra strings added from morphology / tag map
* Fix test
* Initial fix for Lemmatizer config/serialization
* Adjust init test to be more generic
* Adjust init test to force empty Lookups
* Add simple cache to rule-based lemmatizer
* Convert language-specific lemmatizers
Convert language-specific lemmatizers to component lemmatizers. Remove
previous lemmatizer class.
* Fix French and Polish lemmatizers
* Remove outdated UPOS conversions
* Update Russian lemmatizer init in tests
* Add minimal init/run tests for custom lemmatizers
* Add option to overwrite existing lemmas
* Update mode setting, lookup loading, and caching
* Make `mode` an immutable property
* Only enforce strict `load_lookups` for known supported modes
* Move caching into individual `_lemmatize` methods
* Implement strict when lang is not found in lookups
* Fix tables/lookups in make_lemmatizer
* Reallow provided lookups and allow for stricter checks
* Add lookups asset to all Lemmatizer pipe tests
* Rename lookups in lemmatizer init test
* Clean up merge
* Refactor lookup table loading
* Add helper from `load_lemmatizer_lookups` that loads required and
optional lookups tables based on settings provided by a config.
Additional slight refactor of lookups:
* Add `Lookups.set_table` to set a table from a provided `Table`
* Reorder class definitions to be able to specify type as `Table`
* Move registry assets into test methods
* Refactor lookups tables config
Use class methods within `Lemmatizer` to provide the config for
particular modes and to load the lookups from a config.
* Add pipe and score to lemmatizer
* Simplify Tagger.score
* Add missing import
* Clean up imports and auto-format
* Remove unused kwarg
* Tidy up and auto-format
* Update docstrings for Lemmatizer
Update docstrings for Lemmatizer.
Additionally modify `is_base_form` API to take `Token` instead of
individual features.
* Update docstrings
* Remove tag map values from Tagger.add_label
* Update API docs
* Fix relative link in Lemmatizer API docs
* Add AttributeRuler for token attribute exceptions
Add the `AttributeRuler` to handle exceptions for token-level
attributes. The `AttributeRuler` uses `Matcher` patterns to identify
target spans and applies the specified attributes to the token at the
provided index in the matched span. A negative index can be used to
index from the end of the matched span. The retokenizer is used to
"merge" the individual tokens and assign them the provided attributes.
Helper functions can import existing tag maps and morph rules to the
corresponding `Matcher` patterns.
There is an additional minor bug fix for `MORPH` attributes in the
retokenizer to correctly normalize the values and to handle `MORPH`
alongside `_` in an attrs dict.
* Fix default name
* Update name in error message
* Extend AttributeRuler functionality
* Add option to initialize with a dict of AttributeRuler patterns
* Instead of silently discarding overlapping matches (the default
behavior for the retokenizer if only the attrs differ), split the
matches into disjoint sets and retokenize each set separately. This
allows, for instance, one pattern to set the POS and another pattern to
set the lemma. (If two matches modify the same attribute, it looks like
the attrs are applied in the order they were added, but it may not be
deterministic?)
* Improve types
* Sort spans before processing
* Fix index boundaries in Span
* Refactor retokenizer to separate attrs methods
Add top-level `normalize_token_attrs` and `set_token_attrs` methods.
* Update AttributeRuler to use refactored methods
Update `AttributeRuler` to replace use of full retokenizer with only the
relevant methods for normalizing and setting attributes for a single
token.
* Update spacy/pipeline/attributeruler.py
Co-authored-by: Ines Montani <ines@ines.io>
* Make API more similar to EntityRuler
* Add `AttributeRuler.add_patterns` to add patterns from a list of dicts
* Return list of dicts as property `AttributeRuler.patterns`
* Make attrs_unnormed private
* Add test loading patterns from assets
* Revert "Fix index boundaries in Span"
This reverts commit 8f8a5c3386.
* Add Span index boundary checks (#5861)
* Add Span index boundary checks
* Return Span-specific IndexError in all cases
* Simplify and fix if/else
Co-authored-by: Ines Montani <ines@ines.io>
* Allow Doc.char_span to snap to token boundaries
Add a `mode` option to allow `Doc.char_span` to snap to token
boundaries. The `mode` options:
* `strict`: character offsets must match token boundaries (default, same as
before)
* `inside`: all tokens completely within the character span
* `outside`: all tokens at least partially covered by the character span
Add a new helper function `token_by_char` that returns the token
corresponding to a character position in the text. Update
`token_by_start` and `token_by_end` to use `token_by_char` for more
efficient searching.
* Remove unused import
* Rename mode to alignment_mode
Rename `mode` to `alignment_mode` with the options
`strict`/`contract`/`expand`. Any unrecognized modes are silently
converted to `strict`.
* `MorphAnalysis.get` returns only the field values
* Move `_normalize_props` inside `Morphology` as
`Morphology.normalize_attrs` and simplify
* Simplify POS field detection/conversion
* Convert all non-POS features to strings
* `Morphology` returns an empty string for a missing morph to align
with the FEATS string returned for an existing morph
* Remove unused `list_to_feats`
* Update with WIP
* Update with WIP
* Update with pipeline serialization
* Update types and pipe factories
* Add deep merge, tidy up and add tests
* Fix pipe creation from config
* Don't validate default configs on load
* Update spacy/language.py
Co-authored-by: Ines Montani <ines@ines.io>
* Adjust factory/component meta error
* Clean up factory args and remove defaults
* Add test for failing empty dict defaults
* Update pipeline handling and methods
* provide KB as registry function instead of as object
* small change in test to make functionality more clear
* update example script for EL configuration
* Fix typo
* Simplify test
* Simplify test
* splitting pipes.pyx into separate files
* moving default configs to each component file
* fix batch_size type
* removing default values from component constructors where possible (TODO: test 4725)
* skip instead of xfail
* Add test for config -> nlp with multiple instances
* pipeline.pipes -> pipeline.pipe
* Tidy up, document, remove kwargs
* small cleanup/generalization for Tok2VecListener
* use DEFAULT_UPSTREAM field
* revert to avoid circular imports
* Fix tests
* Replace deprecated arg
* Make model dirs require config
* fix pickling of keyword-only arguments in constructor
* WIP: clean up and integrate full config
* Add helper to handle function args more reliably
Now also includes keyword-only args
* Fix config composition and serialization
* Improve config debugging and add visual diff
* Remove unused defaults and fix type
* Remove pipeline and factories from meta
* Update spacy/default_config.cfg
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
* Update spacy/default_config.cfg
* small UX edits
* avoid printing stack trace for debug CLI commands
* Add support for language-specific factories
* specify the section of the config which holds the model to debug
* WIP: add Language.from_config
* Update with language data refactor WIP
* Auto-format
* Add backwards-compat handling for Language.factories
* Update morphologizer.pyx
* Fix morphologizer
* Update and simplify lemmatizers
* Fix Japanese tests
* Port over tagger changes
* Fix Chinese and tests
* Update to latest Thinc
* WIP: xfail first Russian lemmatizer test
* Fix component-specific overrides
* fix nO for output layers in debug_model
* Fix default value
* Fix tests and don't pass objects in config
* Fix deep merging
* Fix lemma lookup data registry
Only load the lookups if an entry is available in the registry (and if spacy-lookups-data is installed)
* Add types
* Add Vocab.from_config
* Fix typo
* Fix tests
* Make config copying more elegant
* Fix pipe analysis
* Fix lemmatizers and is_base_form
* WIP: move language defaults to config
* Fix morphology type
* Fix vocab
* Remove comment
* Update to latest Thinc
* Add morph rules to config
* Tidy up
* Remove set_morphology option from tagger factory
* Hack use_gpu
* Move [pipeline] to top-level block and make [nlp.pipeline] list
Allows separating component blocks from component order – otherwise, ordering the config would mean a changed component order, which is bad. Also allows initial config to define more components and not use all of them
* Fix use_gpu and resume in CLI
* Auto-format
* Remove resume from config
* Fix formatting and error
* [pipeline] -> [components]
* Fix types
* Fix tagger test: requires set_morphology?
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
Co-authored-by: svlandeg <sofie.vanlandeghem@gmail.com>
Co-authored-by: Matthew Honnibal <honnibal+gh@gmail.com>
Remove corpus-specific tag maps from the language data for languages
without custom tokenizers. For languages with custom word segmenters
that also provide tags (Japanese and Korean), the tag maps for the
custom tokenizers are kept as the default.
The default tag maps for languages without custom tokenizers are now the
default tag map from `lang/tag_map/py`, UPOS -> UPOS.
* Add morph to morphology in Doc.from_array
Add morphological analyses to morphology table in `Doc.from_array`.
* Use separate vocab in DocBin roundtrip test
* Add static method to Doc to allow merging of multiple docs.
* Add error description for the error that occurs if docs with different
vocabs (from different languages) are merged in Doc.from_docs().
* Add test for Doc.from_docs() implementation.
* Fix using numpy's concatenate in Doc.from_docs.
* Replace typing's type annotations in from_docs.
* Simply remove type annotations in from_docs.
* Add documentation for Doc.from_docs to api.
* Simplify from_docs, its test and the api doc for codebase consistency.
* Fix merging of Doc objects that end with whitespaces (Achieved by simply not setting the SPACY attribute on whitespace tokens). Remove two unnecessary imports of attributes.
* Add merging of user data from Doc objects in from_docs. Add user data test case to corresponding test. Add applicable warning messages.
* Fix incorrect setting of tokens idx by using concatenated spaces (again). Add test case to corresponding test.
* Add MORPH to attrs
* Update warnings calls
* Remove out-dated error from merge
* Rename space_delimiter to ensure_whitespace
Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
* Update errors
* Remove beam for now (maybe)
Remove beam_utils
Update setup.py
Remove beam
* Remove GoldParse
WIP on removing goldparse
Get ArcEager compiling after GoldParse excise
Update setup.py
Get spacy.syntax compiling after removing GoldParse
Rename NewExample -> Example and clean up
Clean html files
Start updating tests
Update Morphologizer
* fix error numbers
* fix merge conflict
* informative error when calling to_array with wrong field
* fix error catching
* fixing language and scoring tests
* start testing get_aligned
* additional tests for new get_aligned function
* Draft create_gold_state for arc_eager oracle
* Fix import
* Fix import
* Remove TokenAnnotation code from nonproj
* fixing NER one-to-many alignment
* Fix many-to-one IOB codes
* fix test for misaligned
* attempt to fix cases with weird spaces
* fix spaces
* test_gold_biluo_different_tokenization works
* allow None as BILUO annotation
* fixed some tests + WIP roundtrip unit test
* add spaces to json output format
* minibatch utiltiy can deal with strings, docs or examples
* fix augment (needs further testing)
* various fixes in scripts - needs to be further tested
* fix test_cli
* cleanup
* correct silly typo
* add support for MORPH in to/from_array, fix morphologizer overfitting test
* fix tagger
* fix entity linker
* ensure test keeps working with non-linked entities
* pipe() takes docs, not examples
* small bug fix
* textcat bugfix
* throw informative error when running the components with the wrong type of objects
* fix parser tests to work with example (most still failing)
* fix BiluoPushDown parsing entities
* small fixes
* bugfix tok2vec
* fix renames and simple_ner labels
* various small fixes
* prevent writing dummy values like deps because that could interfer with sent_start values
* fix the fix
* implement split_sent with aligned SENT_START attribute
* test for split sentences with various alignment issues, works
* Return ArcEagerGoldParse from ArcEager
* Update parser and NER gold stuff
* Draft new GoldCorpus class
* add links to to_dict
* clean up
* fix test checking for variants
* Fix oracles
* Start updating converters
* Move converters under spacy.gold
* Move things around
* Fix naming
* Fix name
* Update converter to produce DocBin
* Update converters
* Allow DocBin to take list of Doc objects.
* Make spacy convert output docbin
* Fix import
* Fix docbin
* Fix compile in ArcEager
* Fix import
* Serialize all attrs by default
* Update converter
* Remove jsonl converter
* Add json2docs converter
* Draft Corpus class for DocBin
* Work on train script
* Update Corpus
* Update DocBin
* Allocate Doc before starting to add words
* Make doc.from_array several times faster
* Update train.py
* Fix Corpus
* Fix parser model
* Start debugging arc_eager oracle
* Update header
* Fix parser declaration
* Xfail some tests
* Skip tests that cause crashes
* Skip test causing segfault
* Remove GoldCorpus
* Update imports
* Update after removing GoldCorpus
* Fix module name of corpus
* Fix mimport
* Work on parser oracle
* Update arc_eager oracle
* Restore ArcEager.get_cost function
* Update transition system
* Update test_arc_eager_oracle
* Remove beam test
* Update test
* Unskip
* Unskip tests
* add links to to_dict
* clean up
* fix test checking for variants
* Allow DocBin to take list of Doc objects.
* Fix compile in ArcEager
* Serialize all attrs by default
Move converters under spacy.gold
Move things around
Fix naming
Fix name
Update converter to produce DocBin
Update converters
Make spacy convert output docbin
Fix import
Fix docbin
Fix import
Update converter
Remove jsonl converter
Add json2docs converter
* Allocate Doc before starting to add words
* Make doc.from_array several times faster
* Start updating converters
* Work on train script
* Draft Corpus class for DocBin
Update Corpus
Fix Corpus
* Update DocBin
Add missing strings when serializing
* Update train.py
* Fix parser model
* Start debugging arc_eager oracle
* Update header
* Fix parser declaration
* Xfail some tests
Skip tests that cause crashes
Skip test causing segfault
* Remove GoldCorpus
Update imports
Update after removing GoldCorpus
Fix module name of corpus
Fix mimport
* Work on parser oracle
Update arc_eager oracle
Restore ArcEager.get_cost function
Update transition system
* Update tests
Remove beam test
Update test
Unskip
Unskip tests
* Add get_aligned_parse method in Example
Fix Example.get_aligned_parse
* Add kwargs to Corpus.dev_dataset to match train_dataset
* Update nonproj
* Use get_aligned_parse in ArcEager
* Add another arc-eager oracle test
* Remove Example.doc property
Remove Example.doc
Remove Example.doc
Remove Example.doc
Remove Example.doc
* Update ArcEager oracle
Fix Break oracle
* Debugging
* Fix Corpus
* Fix eg.doc
* Format
* small fixes
* limit arg for Corpus
* fix test_roundtrip_docs_to_docbin
* fix test_make_orth_variants
* fix add_label test
* Update tests
* avoid writing temp dir in json2docs, fixing 4402 test
* Update test
* Add missing costs to NER oracle
* Update test
* Work on Example.get_aligned_ner method
* Clean up debugging
* Xfail tests
* Remove prints
* Remove print
* Xfail some tests
* Replace unseen labels for parser
* Update test
* Update test
* Xfail test
* Fix Corpus
* fix imports
* fix docs_to_json
* various small fixes
* cleanup
* Support gold_preproc in Corpus
* Support gold_preproc
* Pass gold_preproc setting into corpus
* Remove debugging
* Fix gold_preproc
* Fix json2docs converter
* Fix convert command
* Fix flake8
* Fix import
* fix output_dir (converted to Path by typer)
* fix var
* bugfix: update states after creating golds to avoid out of bounds indexing
* Improve efficiency of ArEager oracle
* pull merge_sent into iob2docs to avoid Doc creation for each line
* fix asserts
* bugfix excl Span.end in iob2docs
* Support max_length in Corpus
* Fix arc_eager oracle
* Filter out uannotated sentences in NER
* Remove debugging in parser
* Simplify NER alignment
* Fix conversion of NER data
* Fix NER init_gold_batch
* Tweak efficiency of precomputable affine
* Update onto-json default
* Update gold test for NER
* Fix parser test
* Update test
* Add NER data test
* Fix convert for single file
* Fix test
* Hack scorer to avoid evaluating non-nered data
* Fix handling of NER data in Example
* Output unlabelled spans from O biluo tags in iob_utils
* Fix unset variable
* Return kept examples from init_gold_batch
* Return examples from init_gold_batch
* Dont return Example from init_gold_batch
* Set spaces on gold doc after conversion
* Add test
* Fix spaces reading
* Improve NER alignment
* Improve handling of missing values in NER
* Restore the 'cutting' in parser training
* Add assertion
* Print epochs
* Restore random cuts in parser/ner training
* Implement Doc.copy
* Implement Example.copy
* Copy examples at the start of Language.update
* Don't unset example docs
* Tweak parser model slightly
* attempt to fix _guess_spaces
* _add_entities_to_doc first, so that links don't get overwritten
* fixing get_aligned_ner for one-to-many
* fix indexing into x_text
* small fix biluo_tags_from_offsets
* Add onto-ner config
* Simplify NER alignment
* Fix NER scoring for partially annotated documents
* fix indexing into x_text
* fix test_cli failing tests by ignoring spans in doc.ents with empty label
* Fix limit
* Improve NER alignment
* Fix count_train
* Remove print statement
* fix tests, we're not having nothing but None
* fix clumsy fingers
* Fix tests
* Fix doc.ents
* Remove empty docs in Corpus and improve limit
* Update config
Co-authored-by: svlandeg <sofie.vanlandeghem@gmail.com>
* verbose and tag_map options
* adding init_tok2vec option and only changing the tok2vec that is specified
* adding omit_extra_lookups and verifying textcat config
* wip
* pretrain bugfix
* add replace and resume options
* train_textcat fix
* raw text functionality
* improve UX when KeyError or when input data can't be parsed
* avoid unnecessary access to goldparse in TextCat pipe
* save performance information in nlp.meta
* add noise_level to config
* move nn_parser's defaults to config file
* multitask in config - doesn't work yet
* scorer offering both F and AUC options, need to be specified in config
* add textcat verification code from old train script
* small fixes to config files
* clean up
* set default config for ner/parser to allow create_pipe to work as before
* two more test fixes
* small fixes
* cleanup
* fix NER pickling + additional unit test
* create_pipe as before
Reconstruction of the original PR #4697 by @MiniLau.
Removes unused `SENT_END` symbol and `IS_SENT_END` from `Matcher` schema
because the Matcher is only going to be able to support `IS_SENT_START`.
* Add Doc init from list of words and text
Add an option to initialize a `Doc` from a text and list of words where
the words may or may not include all whitespace tokens. If the text and
words are mismatched, raise an error.
* Fix error code
* Remove all whitespace before aligning words/text
* Move words/text init to util function
* Update error message
* Rename to get_words_and_spaces
* Fix formatting
* Improve token head verification
Improve the verification for valid token heads when heads are set:
* in `Token.head`: heads come from the same document
* in `Doc.from_array()`: head indices are within the bounds of the
document
* Improve error message
* fix grad_clip naming
* cleaning up pretrained_vectors out of cfg
* further refactoring Model init's
* move Model building out of pipes
* further refactor to require a model config when creating a pipe
* small fixes
* making cfg in nn_parser more consistent
* fixing nr_class for parser
* fixing nn_parser's nO
* fix printing of loss
* architectures in own file per type, consistent naming
* convenience methods default_tagger_config and default_tok2vec_config
* let create_pipe access default config if available for that component
* default_parser_config
* move defaults to separate folder
* allow reading nlp from package or dir with argument 'name'
* architecture spacy.VocabVectors.v1 to read static vectors from file
* cleanup
* default configs for nel, textcat, morphologizer, tensorizer
* fix imports
* fixing unit tests
* fixes and clean up
* fixing defaults, nO, fix unit tests
* restore parser IO
* fix IO
* 'fix' serialization test
* add *.cfg to manifest
* fix example configs with additional arguments
* replace Morpohologizer with Tagger
* add IO bit when testing overfitting of tagger (currently failing)
* fix IO - don't initialize when reading from disk
* expand overfitting tests to also check IO goes OK
* remove dropout from HashEmbed to fix Tagger performance
* add defaults for sentrec
* update thinc
* always pass a Model instance to a Pipe
* fix piped_added statement
* remove obsolete W029
* remove obsolete errors
* restore byte checking tests (work again)
* clean up test
* further test cleanup
* convert from config to Model in create_pipe
* bring back error when component is not initialized
* cleanup
* remove calls for nlp2.begin_training
* use thinc.api in imports
* allow setting charembed's nM and nC
* fix for hardcoded nM/nC + unit test
* formatting fixes
* trigger build
* Sync Span __eq__ and __hash__
Use the same tuple for `__eq__` and `__hash__`, including all attributes
except `vector` and `vector_norm`.
* Update entity comparison in tests
Update `assert_docs_equal()` test util to compare `Span` properties for
ents rather than `Span` objects.