* test and fix for second bug of issue 4042
* fix for first bug in 4042
* crashing test for Issue 4313
* forgot one instance of resize
* remove prints
* undo uncomment
* delete test for 4313 (uses third party lib)
* add fix for Issue 4313
* unit test for 4313
* Replace PhraseMatcher with Aho-Corasick
Replace PhraseMatcher with the Aho-Corasick algorithm over numpy arrays
of the hash values for the relevant attribute. The implementation is
based on FlashText.
The speed should be similar to the previous PhraseMatcher. It is now
possible to easily remove match IDs and matches don't go missing with
large keyword lists / vocabularies.
Fixes#4308.
* Restore support for pickling
* Fix internal keyword add/remove for numpy arrays
* Add missing loop for match ID set in search loop
* Remove cruft in matching loop for partial matches
There was a bit of unnecessary code left over from FlashText in the
matching loop to handle partial token matches, which we don't have with
PhraseMatcher.
* Replace dict trie with MapStruct trie
* Fix how match ID hash is stored/added
* Update fix for match ID vocab
* Switch from map_get_unless_missing to map_get
* Switch from numpy array to Token.get_struct_attr
Access token attributes directly in Doc instead of making a copy of the
relevant values in a numpy array.
Add unsatisfactory warning for hash collision with reserved terminal
hash key. (Ideally it would change the reserved terminal hash and redo
the whole trie, but for now, I'm hoping there won't be collisions.)
* Restructure imports to export find_matches
* Implement full remove()
Remove unnecessary trie paths and free unused maps.
Parallel to Matcher, raise KeyError when attempting to remove a match ID
that has not been added.
* Store docs internally only as attr lists
* Reduces size for pickle
* Remove duplicate keywords store
Now that docs are stored as lists of attr hashes, there's no need to
have the duplicate _keywords store.
* Allow vectors name to be specified in init-model
* Document --vectors-name argument to init-model
* Update website/docs/api/cli.md
Co-Authored-By: Ines Montani <ines@ines.io>
* remove duplicate unit test
* unit test (currently failing) for issue 4267
* bugfix: ensure doc.ents preserves kb_id annotations
* fix in setting doc.ents with empty label
* rename
* test for presetting an entity to a certain type
* allow overwriting Outside + blocking presets
* fix actions when previous label needs to be kept
* fix default ent_iob in set entities
* cleaner solution with U- action
* remove debugging print statements
* unit tests with explicit transitions and is_valid testing
* remove U- from move_names explicitly
* remove unit tests with pre-trained models that don't work
* remove (working) unit tests with pre-trained models
* clean up unit tests
* move unit tests
* small fixes
* remove two TODO's from doc.ents comments
* make merge more efficient
* fix offsets
* merge works with relative indices
* remove printing
* Add the SCA
* fix SCA date
* more cythonize _retokenize.pyx
* more cythonize _retokenize.pyx
* fix only declaration in _retokenize.pyx
* switch back to absolute head
* switch back to absolute head
* fix comment
* merge from origin repo
* remove redundant __call__ method in pipes.TextCategorizer
Because the parent __call__ method behaves in the same way.
* fix: Pipe.__call__ arg
* fix: invalid arg in Pipe.__call__
* modified: spacy/tests/regression/test_issue4278.py (#4278)
* deleted: Pipfile
* Add doc.cats to spacy.gold at the paragraph level
Support `doc.cats` as `"cats": [{"label": string, "value": number}]` in
the spacy JSON training format at the paragraph level.
* `spacy.gold.docs_to_json()` writes `docs.cats`
* `GoldCorpus` reads in cats in each `GoldParse`
* Update instances of gold_tuples to handle cats
Update iteration over gold_tuples / gold_parses to handle addition of
cats at the paragraph level.
* Add textcat to train CLI
* Add textcat options to train CLI
* Add textcat labels in `TextCategorizer.begin_training()`
* Add textcat evaluation to `Scorer`:
* For binary exclusive classes with provided label: F1 for label
* For 2+ exclusive classes: F1 macro average
* For multilabel (not exclusive): ROC AUC macro average (currently
relying on sklearn)
* Provide user info on textcat evaluation settings, potential
incompatibilities
* Provide pipeline to Scorer in `Language.evaluate` for textcat config
* Customize train CLI output to include only metrics relevant to current
pipeline
* Add textcat evaluation to evaluate CLI
* Fix handling of unset arguments and config params
Fix handling of unset arguments and model confiug parameters in Scorer
initialization.
* Temporarily add sklearn requirement
* Remove sklearn version number
* Improve Scorer handling of models without textcats
* Fixing Scorer handling of models without textcats
* Update Scorer output for python 2.7
* Modify inf in Scorer for python 2.7
* Auto-format
Also make small adjustments to make auto-formatting with black easier and produce nicer results
* Move error message to Errors
* Update documentation
* Add cats to annotation JSON format [ci skip]
* Fix tpl flag and docs [ci skip]
* Switch to internal roc_auc_score
Switch to internal `roc_auc_score()` adapted from scikit-learn.
* Add AUCROCScore tests and improve errors/warnings
* Add tests for AUCROCScore and roc_auc_score
* Add missing error for only positive/negative values
* Remove unnecessary warnings and errors
* Make reduced roc_auc_score functions private
Because most of the checks and warnings have been stripped for the
internal functions and access is only intended through `ROCAUCScore`,
make the functions for roc_auc_score adapted from scikit-learn private.
* Check that data corresponds with multilabel flag
Check that the training instances correspond with the multilabel flag,
adding the multilabel flag if required.
* Add textcat score to early stopping check
* Add more checks to debug-data for textcat
* Add example training data for textcat
* Add more checks to textcat train CLI
* Check configuration when extending base model
* Fix typos
* Update textcat example data
* Provide licensing details and licenses for data
* Remove two labels with no positive instances from jigsaw-toxic-comment
data.
Co-authored-by: Ines Montani <ines@ines.io>
* Adjust Table API and add docs
* Add attributes and update description [ci skip]
* Use strings.get_string_id instead of hash_string
* Fix table method calls
* Make orth arg in Lemmatizer.lookup optional
Fall back to string, which is now handled by Table.__contains__ out-of-the-box
* Fix method name
* Auto-format
Most of these characters are for languages / writing systems that aren't
supported by spacy, but I don't think it causes problems to include
them. In the UD evals, Hindi and Urdu improve a lot as expected (from
0-10% to 70-80%) and Persian improves a little (90% to 96%). Tamil
improves in combination with #4288.
The punctuation list is converted to a set internally because of its
increased length.
Sentence final punctuation generated with:
```
unichars -gas '[\p{Sentence_Break=STerm}\p{Sentence_Break=ATerm}]' '\p{Terminal_Punctuation}'
```
See: https://stackoverflow.com/a/9508766/461847Fixes#4269.
Add Kannada, Tamil, and Telugu unicode blocks to uncased character
classes so that period is recognized as a suffix during tokenization.
(I'm sure a few symbols in the code blocks should not be ALPHA, but this
is mainly relevant for suffix detection and seems to be an improvement
in practice.)
Before this patch, half-width spaces between words were simply lost in
Japanese text. This wasn't immediately noticeable because much Japanese
text never uses spaces at all.
* Improve load_language_data helper
* WIP: Add Lookups implementation
* Start moving lemma data over to JSON
* WIP: move data over for more languages
* Convert more languages
* Fix lemmatizer fixtures in tests
* Finish conversion
* Auto-format JSON files
* Fix test for now
* Make sure tables are stored on instance
* Update docstrings
* Update docstrings and errors
* Update test
* Add Lookups.__len__
* Add serialization methods
* Add Lookups.remove_table
* Use msgpack for serialization to disk
* Fix file exists check
* Try using OrderedDict for everything
* Update .flake8 [ci skip]
* Try fixing serialization
* Update test_lookups.py
* Update test_serialize_vocab_strings.py
* Lookups / Tables now work
This implements the stubs in the Lookups/Table classes. Currently this
is in Cython but with no type declarations, so that could be improved.
* Add lookups to setup.py
* Actually add lookups pyx
The previous commit added the old py file...
* Lookups work-in-progress
* Move from pyx back to py
* Add string based lookups, fix serialization
* Update tests, language/lemmatizer to work with string lookups
There are some outstanding issues here:
- a pickling-related test fails due to the bloom filter
- some custom lemmatizers (fr/nl at least) have issues
More generally, there's a question of how to deal with the case where
you have a string but want to use the lookup table. Currently the table
allows access by string or id, but that's getting pretty awkward.
* Change lemmatizer lookup method to pass (orth, string)
* Fix token lookup
* Fix French lookup
* Fix lt lemmatizer test
* Fix Dutch lemmatizer
* Fix lemmatizer lookup test
This was using a normal dict instead of a Table, so checks for the
string instead of an integer key failed.
* Make uk/nl/ru lemmatizer lookup methods consistent
The mentioned tokenizers all have their own implementation of the
`lookup` method, which accesses a `Lookups` table. The way that was
called in `token.pyx` was changed so this should be updated to have the
same arguments as `lookup` in `lemmatizer.py` (specificially (orth/id,
string)).
Prior to this change tests weren't failing, but there would probably be
issues with normal use of a model. More tests should proably be added.
Additionally, the language-specific `lookup` implementations seem like
they might not be needed, since they handle things like lower-casing
that aren't actually language specific.
* Make recently added Greek method compatible
* Remove redundant class/method
Leftovers from a merge not cleaned up adequately.
* Allow copying the user_data with as_doc + unit test
* add option to docs
* add typing
* import fix
* workaround to avoid bool clashing ...
* bint instead of bool
* document token ent_kb_id
* document span kb_id
* update pipeline documentation
* prior and context weights as bool's instead
* entitylinker api documentation
* drop for both models
* finish entitylinker documentation
* small fixes
* documentation for KB
* candidate documentation
* links to api pages in code
* small fix
* frequency examples as counts for consistency
* consistent documentation about tensors returned by predict
* add entity linking to usage 101
* add entity linking infobox and KB section to 101
* entity-linking in linguistic features
* small typo corrections
* training example and docs for entity_linker
* predefined nlp and kb
* revert back to similarity encodings for simplicity (for now)
* set prior probabilities to 0 when excluded
* code clean up
* bugfix: deleting kb ID from tokens when entities were removed
* refactor train el example to use either model or vocab
* pretrain_kb example for example kb generation
* add to training docs for KB + EL example scripts
* small fixes
* error numbering
* ensure the language of vocab and nlp stay consistent across serialization
* equality with =
* avoid conflict in errors file
* add error 151
* final adjustements to the train scripts - consistency
* update of goldparse documentation
* small corrections
* push commit
* typo fix
* add candidate API to kb documentation
* update API sidebar with EntityLinker and KnowledgeBase
* remove EL from 101 docs
* remove entity linker from 101 pipelines / rephrase
* custom el model instead of existing model
* set version to 2.2 for EL functionality
* update documentation for 2 CLI scripts
* Improve load_language_data helper
* WIP: Add Lookups implementation
* Start moving lemma data over to JSON
* WIP: move data over for more languages
* Convert more languages
* Fix lemmatizer fixtures in tests
* Finish conversion
* Auto-format JSON files
* Fix test for now
* Make sure tables are stored on instance
* Update docstrings
* Update docstrings and errors
* Update test
* Add Lookups.__len__
* Add serialization methods
* Add Lookups.remove_table
* Use msgpack for serialization to disk
* Fix file exists check
* Try using OrderedDict for everything
* Update .flake8 [ci skip]
* Try fixing serialization
* Update test_lookups.py
* Update test_serialize_vocab_strings.py
* Fix serialization for lookups
* Fix lookups
* Fix lookups
* Fix lookups
* Try to fix serialization
* Try to fix serialization
* Try to fix serialization
* Try to fix serialization
* Give up on serialization test
* Xfail more serialization tests for 3.5
* Fix lookups for 2.7
* Modify retokenizer to use span root attributes
* tag/pos/morph are set to root tag/pos/morph
* lemma and norm are reset and end up as orth (not ideal, but better
than orth of first token)
* Also handle individual merge case
* Add test
* Attempt to handle ent_iob and ent_type in merges
* Fix check for whether B-ENT should become I-ENT
* Move IOB consistency check to after attrs
Move all IOB consistency checks after attrs are set and simplify to
check entire document, modifying I to B at the beginning of the document
or if the entity type of the previous token isn't the same.
* Move IOB consistency check for single merge
Move IOB consistency check after the token array is compressed for the
single merge case.
* Update spacy/tokens/_retokenize.pyx
Co-Authored-By: Matthew Honnibal <honnibal+gh@gmail.com>
* Remove single vs. multiple merge distinction
Remove original single-instance `_merge()` and use `_bulk_merge()` (now
renamed `_merge()`) for all merges.
* Add out-of-bound check in previous entity check
* Updates/bugfixes for NER/IOB converters
* Converter formats `ner` and `iob` use autodetect to choose a converter if
possible
* `iob2json` is reverted to handle sentence-per-line data like
`word1|pos1|ent1 word2|pos2|ent2`
* Fix bug in `merge_sentences()` so the second sentence in each batch isn't
skipped
* `conll_ner2json` is made more general so it can handle more formats with
whitespace-separated columns
* Supports all formats where the first column is the token and the final
column is the IOB tag; if present, the second column is the POS tag
* As in CoNLL 2003 NER, blank lines separate sentences, `-DOCSTART- -X- O O`
separates documents
* Add option for segmenting sentences (new flag `-s`)
* Parser-based sentence segmentation with a provided model, otherwise with
sentencizer (new option `-b` to specify model)
* Can group sentences into documents with `n_sents` as long as sentence
segmentation is available
* Only applies automatic segmentation when there are no existing delimiters
in the data
* Provide info about settings applied during conversion with warnings and
suggestions if settings conflict or might not be not optimal.
* Add tests for common formats
* Add '(default)' back to docs for -c auto
* Add document count back to output
* Revert changes to converter output message
* Use explicit tabs in convert CLI test data
* Adjust/add messages for n_sents=1 default
* Add sample NER data to training examples
* Update README
* Add links in docs to example NER data
* Define msg within converters
Filtering by orth and tag, create variants of training docs with
alternate orth variants, e.g., unicode quotes, dashes, and ellipses.
The variants can be single tokens (dashes) or paired tokens (quotes)
with left and right versions.
Currently restricted to only add variants to training documents without
raw text provided, where only gold.words needs to be modified.
* Prevent subtok label if not learning tokens
The parser introduces the subtok label to mark tokens that should be
merged during post-processing. Previously this happened even if we did
not have the --learn-tokens flag set. This patch passes the config
through to the parser, to prevent the problem.
* Make merge_subtokens a parser post-process if learn_subtokens
* Fix train script
* Add test for 3830: subtok problem
* Fix handlign of non-subtok in parser training
* allow phrasematcher to link one match to multiple original patterns
* small fix for defining ent_id in the matcher (anti-ghost prevention)
* cleanup
* formatting
* Improve load_language_data helper
* WIP: Add Lookups implementation
* Start moving lemma data over to JSON
* WIP: move data over for more languages
* Convert more languages
* Fix lemmatizer fixtures in tests
* Finish conversion
* Auto-format JSON files
* Fix test for now
* Make sure tables are stored on instance
Check for relevant components in the pipeline when Matcher is called,
similar to the checks for PhraseMatcher in #4105.
* keep track of attributes seen in patterns
* when Matcher is called on a Doc, check for is_tagged for LEMMA, TAG,
POS and for is_parsed for DEP
* Fix typo in rule-based matching docs
* Improve token pattern checking without validation
Add more detailed token pattern checks without full JSON pattern validation and
provide more detailed error messages.
Addresses #4070 (also related: #4063, #4100).
* Check whether top-level attributes in patterns and attr for PhraseMatcher are
in token pattern schema
* Check whether attribute value types are supported in general (as opposed to
per attribute with full validation)
* Report various internal error types (OverflowError, AttributeError, KeyError)
as ValueError with standard error messages
* Check for tagger/parser in PhraseMatcher pipeline for attributes TAG, POS,
LEMMA, and DEP
* Add error messages with relevant details on how to use validate=True or nlp()
instead of nlp.make_doc()
* Support attr=TEXT for PhraseMatcher
* Add NORM to schema
* Expand tests for pattern validation, Matcher, PhraseMatcher, and EntityRuler
* Remove unnecessary .keys()
* Rephrase error messages
* Add another type check to Matcher
Add another type check to Matcher for more understandable error messages
in some rare cases.
* Support phrase_matcher_attr=TEXT for EntityRuler
* Don't use spacy.errors in examples and bin scripts
* Fix error code
* Auto-format
Also try get Azure pipelines to finally start a build :(
* Update errors.py
Co-authored-by: Ines Montani <ines@ines.io>
Co-authored-by: Matthew Honnibal <honnibal+gh@gmail.com>
* Move Turkish lemmas to a json file
Rather than a large dict in Python source, the data is now a big json
file. This includes a method for loading the json file, falling back to
a compressed file, and an update to MANIFEST.in that excludes json in
the spacy/lang directory.
This focuses on Turkish specifically because it has the most language
data in core.
* Transition all lemmatizer.py files to json
This covers all lemmatizer.py files of a significant size (>500k or so).
Small files were left alone.
None of the affected files have logic, so this was pretty
straightforward.
One unusual thing is that the lemma data for Urdu doesn't seem to be
used anywhere. That may require further investigation.
* Move large lang data to json for fr/nb/nl/sv
These are the languages that use a lemmatizer directory (rather than a
single file) and are larger than English.
For most of these languages there were many language data files, in
which case only the large ones (>500k or so) were converted to json. It
may or may not be a good idea to migrate the remaining Python files to
json in the future.
* Fix id lemmas.json
The contents of this file were originally just copied from the Python
source, but that used single quotes, so it had to be properly converted
to json first.
* Add .json.gz to gitignore
This covers the json.gz files built as part of distribution.
* Add language data gzip to build process
Currently this gzip data on every build; it works, but it should be
changed to only gzip when the source file has been updated.
* Remove Danish lemmatizer.py
Missed this when I added the json.
* Update to match latest explosion/srsly#9
The way gzipped json is loaded/saved in srsly changed a bit.
* Only compress language data if necessary
If a .json.gz file exists and is newer than the corresponding json file,
it's not recompressed.
* Move en/el language data to json
This only affected files >500kb, which was nouns for both languages and
the generic lookup table for English.
* Remove empty files in Norwegian tokenizer
It's unclear why, but the Norwegian (nb) tokenizer had empty files for
adj/adv/noun/verb lemmas. This may have been a result of copying the
structure of the English lemmatizer.
This removed the files, but still creates the empty sets in the
lemmatizer. That may not actually be necessary.
* Remove dubious entries in English lookup.json
" furthest" and " skilled" - both prefixed with a space - were in the
English lookup table. That seems obviously wrong so I have removed them.
* Fix small issues with en/fr lemmatizers
The en tokenizer was including the removed _nouns.py file, so that's
removed.
The fr tokenizer is unusual in that it has a lemmatizer directory with
both __init__.py and lemmatizer.py. lemmatizer.py had not been converted
to load the json language data, so that was fixed.
* Auto-format
* Auto-format
* Update srsly pin
* Consistently use pathlib paths
While working on an unrelated task I got warnings about an unsupported
escape sequence (`"\("`) in the tokenizer exceptions. Making the
tokenizer exceptions a raw string makes this warning go away.
The specific string that triggered this is `¯\(ツ)/¯`.
* customizable template for entities display, allowing to pass additional parameters along each entity
* contributor agreement
* simpler naming for the additional parameters given to the span entities renderer
Co-Authored-By: Ines Montani <ines@ines.io>
* change of default parameter, as suggested
Co-Authored-By: Ines Montani <ines@ines.io>
* Extending debug-data with dependency checks, etc.
* Modify debug-data to load with GoldCorpus to iterate over .json/.jsonl
files within directories
* Add GoldCorpus iterator train_docs_without_preprocessing to load
original train docs without shuffling and projectivizing
* Report number of misaligned tokens
* Add more dependency checks and messages
* Update spacy/cli/debug_data.py
Co-Authored-By: Ines Montani <ines@ines.io>
* Fixed conflict
* Move counts to _compile_gold()
* Move all dependency nonproj/sent/head/cycle counting to
_compile_gold()
* Unclobber previous merges
* Update variable names
* Update more variable names, fix misspelling
* Don't clobber loading error messages
* Only warn about misaligned tokens if present
* Check whether two entities overlap
- biluo_gold_biluo_overlap now throw exception when entities passed in have overlaps
- added unit test
* SCA agreement
Provide the tokens in the cycle and the first 50 tokens from document in
the error message so it's easier to track down the location of the cycle
in the data.
Addresses feature request in #3698.
* pytest file for issue4104 established
* edited default lookup english lemmatizer for spun; fixes issue 4102
* eliminated parameterization and sorted dictionary dependnency in issue 4104 test
* added contributor agreement
* document token ent_kb_id
* document span kb_id
* update pipeline documentation
* prior and context weights as bool's instead
* entitylinker api documentation
* drop for both models
* finish entitylinker documentation
* small fixes
* documentation for KB
* candidate documentation
* links to api pages in code
* small fix
* frequency examples as counts for consistency
* consistent documentation about tensors returned by predict
* add entity linking to usage 101
* add entity linking infobox and KB section to 101
* entity-linking in linguistic features
* small typo corrections
* training example and docs for entity_linker
* predefined nlp and kb
* revert back to similarity encodings for simplicity (for now)
* set prior probabilities to 0 when excluded
* code clean up
* bugfix: deleting kb ID from tokens when entities were removed
* refactor train el example to use either model or vocab
* pretrain_kb example for example kb generation
* add to training docs for KB + EL example scripts
* small fixes
* error numbering
* ensure the language of vocab and nlp stay consistent across serialization
* equality with =
* avoid conflict in errors file
* add error 151
* final adjustements to the train scripts - consistency
* update of goldparse documentation
* small corrections
* push commit
* turn kb_creator into CLI script (wip)
* proper parameters for training entity vectors
* wikidata pipeline split up into two executable scripts
* remove context_width
* move wikidata scripts in bin directory, remove old dummy script
* refine KB script with logs and preprocessing options
* small edits
* small improvements to logging of EL CLI script
* Update gold corpus code to properly ingest a directory of jsonlines files
In response to: https://github.com/explosion/spaCy/issues/3975
* Update spacy/gold.pyx
Co-Authored-By: Ines Montani <ines@ines.io>
* Improve NER per type scoring
* include all gold labels in per type scoring, not only when recall > 0
* improve efficiency of per type scoring
* Create Scorer tests, initially with NER tests
* move regression test #3968 (per type NER scoring) to Scorer tests
* add new test for per type NER scoring with imperfect P/R/F and per
type P/R/F including a case where R == 0.0
* Improve error message when model.from_bytes() dies
When Thinc's model.from_bytes() is called with a mismatched model, often
we get a particularly ungraceful error,
e.g. "AttributeError: FunctionLayer has no attribute G"
This is because we're trying to load the parameters for something like
a LayerNorm layer, and the model architecture has some other layer there
instead. This is obviously terrible, especially since the error *type*
is wrong.
I've changed it to raise a ValueError. The error message is still
probably a bit terse, but it's hard to be sure exactly what's gone
wrong.
* Update spacy/pipeline/pipes.pyx
* Update spacy/pipeline/pipes.pyx
* Update spacy/pipeline/pipes.pyx
* Update spacy/syntax/nn_parser.pyx
* Update spacy/syntax/nn_parser.pyx
* Update spacy/pipeline/pipes.pyx
Co-Authored-By: Matthew Honnibal <honnibal+gh@gmail.com>
* Update spacy/pipeline/pipes.pyx
Co-Authored-By: Matthew Honnibal <honnibal+gh@gmail.com>
Co-authored-by: Ines Montani <ines@ines.io>
* failing unit test for issue 3962
* attempt to fix Issue #3962
* create artificial unit test example
* using length instead of self.length
* sp
* reformat with black
* find better ancestor within span and use generic 'dep'
* attach to span.root if there is no appropriate ancestor
* comment span text
* clean up ancestor code
* reconstruct dep tree to keep same number of sentences
Expected an `entity_ruler.jsonl` file in the top-level model directory, so the path passed to from_disk by default (model path plus componentn name), but with the suffix ".jsonl".
* Update pretrain to prevent unintended overwriting of weight files for #3859
* Add '--epoch-start' to pretrain docs
* Add mising pretrain arguments to bash example
* Update doc tag for v2.1.5
* Evaluation of NER model per entity type, closes ##3490
Now each ent score is tracked individually in order to have its own Precision, Recall and F1 Score
* Keep track of each entity individually using dicts
* Improving how to compute the scores for each entity
* Fixed bug computing scores for ents
* Formatting with black
* Added key ents_per_type to the scores function
The key `ents_per_type` contains the metrics Precision, Recall and F1-Score for each entity individually
* Perserve flags in EntityRuler
The EntityRuler (explosion/spaCy#3526) does not preserve
overwrite flags (or `ent_id_sep`) when serialized. This
commit adds support for serialization/deserialization preserving
overwrite and ent_id_sep flags.
* add signed contributor agreement
* flake8 cleanup
mostly blank line issues.
* mark test from the issue as needing a model
The test from the issue needs some language model for serialization
but the test wasn't originally marked correctly.
* Adds `phrase_matcher_attr` to allow args to PhraseMatcher
This is an added arg to pass to the `PhraseMatcher`. For example,
this allows creation of a case insensitive phrase matcher when the
`EntityRuler` is created. References explosion/spaCy#3822
* remove unneeded model loading
The model didn't need to be loaded, and I replaced it with
a change that doesn't require it (using existings fixtures)
* updated docstring for new argument
* updated docs to reflect new argument to the EntityRuler constructor
* change tempdir handling to be compatible with python 2.7
* return conflicted code to entityruler
Some stuff got cut out because of merge conflicts, this
returns that code for the phrase_matcher_attr.
* fixed typo in the code added back after conflicts
* flake8 compliance
When I deconflicted the branch there were some flake8 issues
introduced. This resolves the spacing problems.
* test changes: attempts to fix flaky test in python3.5
These tests seem to be alittle flaky in 3.5 so I changed the check to avoid
the comparisons that seem to be fail sometimes.
* Perserve flags in EntityRuler
The EntityRuler (explosion/spaCy#3526) does not preserve
overwrite flags (or `ent_id_sep`) when serialized. This
commit adds support for serialization/deserialization preserving
overwrite and ent_id_sep flags.
* add signed contributor agreement
* flake8 cleanup
mostly blank line issues.
* mark test from the issue as needing a model
The test from the issue needs some language model for serialization
but the test wasn't originally marked correctly.
* remove unneeded model loading
The model didn't need to be loaded, and I replaced it with
a change that doesn't require it (using existings fixtures)
* change tempdir handling to be compatible with python 2.7
* Adds code to handle item saved before this change.
This code chanes how the save files are handled and how the bytes
are stored as well. This code adds check to dispatch correctly
if it encounters bytes or files saved in the old format (and tests
for those cases).
* use util function for tempdir management
Updated after PR comments: this code now uses the make_tempdir function from util
instead of doing it by hand.
* Norwegian fix
Add support for alternative past tense verb form (vaska).
* Norwegian months
Add all Norwegian months to tokenizer excpetions.
* More Norwegian abbreviations
Add more Norwegian abbreviations to tokenizer_exceptions.
* Contributor agreement khellan
Add signed contributor agreement for khellan (Knut O. Hellan).
* initial LT lang support
* Added more stopwords. Started setting up some basic test environment (not complete)
* Initial morph rules for LT lang
* Closes#1 Adds tokenizer exceptions for Lithuanian
* Closes#5 Punctuation rules. Closes#6 Lexical Attributes
* test: add native examples to basic tests
* feat: add tag map for lt lang
* fix: remove undefined tag attribute 'Definite'
* feat: add lemmatizer for lt lang
* refactor: add new instances to lt lang morph rules; use tags from tag map
* refactor: add morph rules to lt lang defaults
* refactor: only keep nouns, verbs, adverbs and adjectives in lt lang lemmatizer lookup
* refactor: add capitalized words to lt lang lemmatizer
* refactor: add more num words to lt lang lex attrs
* refactor: update lt lang stop word set
* refactor: add new instances to lt lang tokenizer exceptions
* refactor: remove comments form lt lang init file
* refactor: use function instead of lambda in lt lex lang getter
* refactor: remove conversion to dict in lt init when dict is already provided
* chore: rename lt 'test_basic' to 'test_text'
* feat: add more lt text tests
* feat: add lemmatizer tests
* refactor: remove unused imports, add newline to end of file
* chore: add contributor agreement
* chore: change 'en' to 'lt' in lt example description
* fix: add missing encoding info
* style: add newline to end of file
* refactor: use python2 compatible syntax
* style: reformat code using black
* Add error to `get_vectors_loss` for unsupported loss function of `pretrain`
* Add missing "--loss-func" argument to pretrain docs. Update pretrain plac annotations to match docs.
* Add missing quotation marks
* Adding support for entity_id in EntityRuler pipeline component
* Adding Spacy Contributor aggreement
* Updating EntityRuler to use string.format instead of f strings
* Update Entity Ruler to support an 'id' attribute per pattern that explicitly identifies an entity.
* Fixing tests
* Remove custom extension entity_id and use built in ent_id token attribute.
* Changing entity_id to ent_id for consistent naming
* entity_ids => ent_ids
* Removing kb, cleaning up tests, making util functions private, use rsplit instead of split
* Add check for empty input file to CLI pretrain
* Raise error if JSONL is not a dict or contains neither `tokens` nor `text` key
* Skip empty values for correct pretrain keys and log a counter as warning
* Add tests for CLI pretrain core function make_docs.
* Add a short hint for the `tokens` key to the CLI pretrain docs
* Add success message to CLI pretrain
* Update model loading to fix the tests
* Skip empty values and do not create docs out of it
* Update norm_exceptions.py
Extended the Currency set to include Franc, Indian Rupee, Bangladeshi Taka, Korean Won, Mexican Dollar, and Egyptian Pound
* Fix formatting [ci skip]
* Adding Marathi language details and folder to it
* Adding few changes and running tests
* Adding few changes and running tests
* Update __init__.py
mh -> mr
* Rename spacy/lang/mh/__init__.py to spacy/lang/mr/__init__.py
* mh -> mr
* Add custom __dir__ to Underscore (see #3707)
* Make sure custom extension methods keep their docstrings (see #3707)
* Improve tests
* Prepend note on partial to docstring (see #3707)
* Remove print statement
* Handle cases where docstring is None
* Update glossary.py to match information found in documentation
I used regexes to add any dependency tag that was in the documentation but not in the glossary. Solves #3679👍
* Adds forgotten colon
* test sPacy commit to git fri 04052019 10:54
* change Data format from my format to master format
* ทัทั้งนี้ ---> ทั้งนี้
* delete stop_word translate from Eng
* Adjust formatting and readability
* add Thai norm_exception
* Add Dobita21 SCA
* editรึ : หรือ,
* Update Dobita21.md
* Auto-format
* Integrate norms into language defaults
* add acronym and some norm exception words
* add lex_attrs
* Add lexical attribute getters into the language defaults
* fix LEX_ATTRS
Co-authored-by: Donut <dobita21@gmail.com>
Co-authored-by: Ines Montani <ines@ines.io>
* test sPacy commit to git fri 04052019 10:54
* change Data format from my format to master format
* ทัทั้งนี้ ---> ทั้งนี้
* delete stop_word translate from Eng
* Adjust formatting and readability
* add Thai norm_exception
* Add Dobita21 SCA
* editรึ : หรือ,
* Update Dobita21.md
* Auto-format
* Integrate norms into language defaults
* add acronym and some norm exception words
<!--- Provide a general summary of your changes in the title. -->
When using `spacy pretrain`, the model is saved only after every epoch. But each epoch can be very big since `pretrain` is used for language modeling tasks. So I added a `--save-every` option in the CLI to save after every `--save-every` batches.
## Description
<!--- Use this section to describe your changes. If your changes required
testing, include information about the testing environment and the tests you
ran. If your test fixes a bug reported in an issue, don't forget to include the
issue number. If your PR is still a work in progress, that's totally fine – just
include a note to let us know. -->
To test...
Save this file to `sample_sents.jsonl`
```
{"text": "hello there."}
{"text": "hello there."}
{"text": "hello there."}
{"text": "hello there."}
{"text": "hello there."}
{"text": "hello there."}
{"text": "hello there."}
{"text": "hello there."}
{"text": "hello there."}
{"text": "hello there."}
{"text": "hello there."}
{"text": "hello there."}
{"text": "hello there."}
{"text": "hello there."}
{"text": "hello there."}
{"text": "hello there."}
```
Then run `--save-every 2` when pretraining.
```bash
spacy pretrain sample_sents.jsonl en_core_web_md here -nw 1 -bs 1 -i 10 --save-every 2
```
And it should save the model to the `here/` folder after every 2 batches. The models that are saved during an epoch will have a `.temp` appended to the save name.
At the end the training, you should see these files (`ls here/`):
```bash
config.json model2.bin model5.bin model8.bin
log.jsonl model2.temp.bin model5.temp.bin model8.temp.bin
model0.bin model3.bin model6.bin model9.bin
model0.temp.bin model3.temp.bin model6.temp.bin model9.temp.bin
model1.bin model4.bin model7.bin
model1.temp.bin model4.temp.bin model7.temp.bin
```
### Types of change
<!-- What type of change does your PR cover? Is it a bug fix, an enhancement
or new feature, or a change to the documentation? -->
This is a new feature to `spacy pretrain`.
🌵 **Unfortunately, I haven't been able to test this because compiling from source is not working (cythonize error).**
```
Processing matcher.pyx
[Errno 2] No such file or directory: '/Users/mwu/github/spaCy/spacy/matcher.pyx'
Traceback (most recent call last):
File "/Users/mwu/github/spaCy/bin/cythonize.py", line 169, in <module>
run(args.root)
File "/Users/mwu/github/spaCy/bin/cythonize.py", line 158, in run
process(base, filename, db)
File "/Users/mwu/github/spaCy/bin/cythonize.py", line 124, in process
preserve_cwd(base, process_pyx, root + ".pyx", root + ".cpp")
File "/Users/mwu/github/spaCy/bin/cythonize.py", line 87, in preserve_cwd
func(*args)
File "/Users/mwu/github/spaCy/bin/cythonize.py", line 63, in process_pyx
raise Exception("Cython failed")
Exception: Cython failed
Traceback (most recent call last):
File "setup.py", line 276, in <module>
setup_package()
File "setup.py", line 209, in setup_package
generate_cython(root, "spacy")
File "setup.py", line 132, in generate_cython
raise RuntimeError("Running cythonize failed")
RuntimeError: Running cythonize failed
```
Edit: Fixed! after deleting all `.cpp` files: `find spacy -name "*.cpp" | xargs rm`
## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.
* test sPacy commit to git fri 04052019 10:54
* change Data format from my format to master format
* ทัทั้งนี้ ---> ทั้งนี้
* delete stop_word translate from Eng
* Adjust formatting and readability
* add Thai norm_exception
* Add Dobita21 SCA
* editรึ : หรือ,
* Update Dobita21.md
* Auto-format
* Integrate norms into language defaults
If the Morphology class tries to lemmatize a word that's not in the
string store, it's forced to just return it as-is. While loading
exceptions, the class could hit a case where these strings weren't in
the string store yet. The resulting lemmas could then be cached, leading
to some words receiving upper-case lemmas. Closes#3551.
* Add early stopping
* Add return_score option to evaluate
* Fix missing str to path conversion
* Fix import + old python compatibility
* Fix bad beam_width setting during cpu evaluation in spacy train with gpu option turned on
* test sPacy commit to git fri 04052019 10:54
* change Data format from my format to master format
* ทัทั้งนี้ ---> ทั้งนี้
* delete stop_word translate from Eng
* Adjust formatting and readability
## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [ ] I have submitted the spaCy Contributor Agreement.
- [ ] I ran the tests, and all new and existing tests passed.
- [ ] My changes don't require a change to the documentation, or if they do, I've added all required information.
Co-authored-by: Ines Montani <ines@ines.io>
* added tag_map for indonesian
* changed tag map from .py to .txt to see if tests pass
* added symbols import
* added utf8 encoding flag
* added missing SCONJ symbol
* Auto-format
* Remove unused imports
* Make tag map available in Indonesian defaults
<!--- Provide a general summary of your changes in the title. -->
## Description
<!--- Use this section to describe your changes. If your changes required
testing, include information about the testing environment and the tests you
ran. If your test fixes a bug reported in an issue, don't forget to include the
issue number. If your PR is still a work in progress, that's totally fine – just
include a note to let us know. -->
Fix a bug in the test of JapaneseTokenizer.
This PR may require @polm's review.
### Types of change
<!-- What type of change does your PR cover? Is it a bug fix, an enhancement
or new feature, or a change to the documentation? -->
Bug fix
## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.
* fix(util): fix decaying function output
* fix(util): better test and adhere to code standards
* fix(util): correct variable name, pytestify test, update website text
* Fix code for bag-of-words feature extraction
The _ml.py module had a redundant copy of a function to extract unigram
bag-of-words features, except one had a bug that set values to 0.
Another function allowed extraction of bigram features. Replace all three
with a new function that supports arbitrary ngram sizes and also allows
control of which attribute is used (e.g. ORTH, LOWER, etc).
* Support 'bow' architecture for TextCategorizer
This allows efficient ngram bag-of-words models, which are better when
the classifier needs to run quickly, especially when the texts are long.
Pass architecture="bow" to use it. The extra arguments ngram_size and
attr are also available, e.g. ngram_size=2 means unigram and bigram
features will be extracted.
* Fix size limits in train_textcat example
* Explain architectures better in docs
v2.1 introduced a regression when deserializing the parser after
parser.add_label() had been called. The code around the class mapping is
pretty confusing currently, as it was written to accommodate backwards
model compatibility. It needs to be revised when the models are next
retrained.
Closes#3433
spaCy v2.1 switched to the built-in re module, where v2.0 had been using
the third-party regex library. When the tokenizer was deserialized on
Python2.7, the `re.compile()` function was called with expressions that
featured escaped unicode codepoints that were not in Python2.7's unicode
database.
Problems occurred when we had a range between two of these unknown
codepoints, like this:
```
'[\\uAA77-\\uAA79]'
```
On Python2.7, the unknown codepoints are not unescaped correctly,
resulting in arbitrary out-of-range characters being matched by the
expression.
This problem does not occur if we instead have a range between two
unicode literals, rather than the escape sequences. To fix the bug, we
therefore add a new compat function that unescapes unicode sequences
using the `ast.literal_eval()` function. Care is taken to ensure we
do not also escape non-unicode sequences.
Closes#3356.
- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.
I wrote a small script to read the UD English training data and check
that our tag map and morph rules were resulting in the best POS map.
This hadn't been done for some time, and there have been various changes
to the UD schema since it has been done. After these changes we should
see much better agreement between our POS assignments and the UD POS
tags.
While developing v2.1, I ran a bunch of hyper-parameter search
experiments to find settings that performed well for spaCy's NER and
parser. I ended up changing the default Adam settings from beta1=0.9,
beta2=0.999, eps=1e-8 to beta1=0.8, beta2=0.8, eps=1e-5. This was giving
a small improvement in accuracy (like, 0.4%).
Months later, I run the models with Prodigy, which uses beam-search
decoding even when the model has been trained with a greedy objective.
The new models performed terribly...So, wtf? After a couple of days
debugging, I figured out that the new optimizer settings was causing the
model to converge to solutions where the top-scoring class often had
a score of like, -80. The variance on the weights had gone up
enormously. I guess I needed to update the L2 regularisation as well?
Anyway. Let's just revert the change --- if the optimizer is finding
such extreme solutions, that seems bad, and not nearly worth the small
improvement in accuracy.
Currently training a slate of models, to verify the accuracy change is minimal.
Once the training is complete, we can merge this.
<!--- Provide a general summary of your changes in the title. -->
## Description
<!--- Use this section to describe your changes. If your changes required
testing, include information about the testing environment and the tests you
ran. If your test fixes a bug reported in an issue, don't forget to include the
issue number. If your PR is still a work in progress, that's totally fine – just
include a note to let us know. -->
### Types of change
<!-- What type of change does your PR cover? Is it a bug fix, an enhancement
or new feature, or a change to the documentation? -->
## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.
Add and document CLI options for batch size, max doc length, min doc length for `spacy pretrain`.
Also improve CLI output.
Closes#3216
## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.
* merging conllu/conll and conllubio scripts
* tabs to spaces
* removing conllubio2json from converters/__init__.py
* Move not-really-CLI tests to misc
* Add converter test using no-ud data
* Fix test I broke
* removing include_biluo parameter
* fixing read_conllx
* remove include_biluo from convert.py
* label in span not writable anymore
* more explicit unit test and error message for readonly label
* bit more explanation (view)
* error msg tailored to specific case
* fix None case
Closes#2091.
## Description
With the new `vocab.writing_system` property introduced in #3390 (exposed via the language defaults), I was able to finally fix this (I think!). Based on the `Doc`, dispaCy now detects whether it's a RTL or LTR language and adjusts the visualization accordingly. Wherever possible, I've also added `direction` and `lang` attributes.
Entity visualization now looks like this:
<img width="318" alt="Screenshot 2019-03-11 at 16 06 51" src="https://user-images.githubusercontent.com/13643239/54136866-d97afd80-441c-11e9-8c27-3d46994cc833.png">
And dependencies like this (ignore the most likely incorrect tags and dependencies):
<img width="621" alt="Screenshot 2019-03-11 at 16 51 59" src="https://user-images.githubusercontent.com/13643239/54137771-8b66f980-441e-11e9-8460-0682b95eef2a.png">
### Types of change
enhancement, bug fix
## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.
* Add xfail test for vocab.writing_system
* Add vocab.writing_system property
* Set Language.Defaults.writing_system
* Set default writing system
* Remove xfail on test_vocab_writing_system
Closes#2203. Closes#3268.
Lemmas set from outside the `Morphology` class were being overwritten. The result was especially confusing when deserialising, as it meant some lemmas could change when storing and retrieving a `Doc` object.
This PR applies two fixes:
1) When we go to set the lemma in the `Morphology` class, first check whether a lemma is already set. If so, don't overwrite.
2) When we load with `doc.from_array()`, take care to apply the `TAG` field first. This allows other fields to overwrite the `TAG` implied properties, if they're provided explicitly (e.g. the `LEMMA`).
## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.
* Add component_cfg kwarg to begin_training
* Document component_cfg arg to begin_training
* Update docs and auto-format
* Support component_cfg across Language
* Format
* Update docs and docstrings [ci skip]
* Fix begin_training
* Make serialization methods consistent
exclude keyword argument instead of random named keyword arguments and deprecation handling
* Update docs and add section on serialization fields
* Use default return instead of else
* Add Doc.is_nered to indicate if entities have been set
* Add properties in Doc.to_json if they were set, not if they're available
This way, if a processed Doc exports "pos": None, it means that the tag was explicitly unset. If it exports "ents": [], it means that entity annotations are available but that this document doesn't contain any entities. Before, this would have been unclear and problematic for training.