* fixing argument order for rehearse
* rehearse test for ner and tagger
* rehearse bugfix
* added test for parser
* test for multilabel textcat
* rehearse fix
* remove debug line
* Update spacy/tests/training/test_rehearse.py
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
* Update spacy/tests/training/test_rehearse.py
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
Co-authored-by: Kádár Ákos <akos@onyx.uvt.nl>
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
* Make core projectivization methods cdef nogil
While profiling the parser, I noticed that relatively a lot of time is
spent in projectivization. This change rewrites the functions in the
core loops as cdef nogil for efficiency.
In C++-land, we use vector in place of Python lists and absent heads
are represented as -1 in place of None.
* _heads_to_c: add assertion
Validation should be performed by the caller, but this assertion ensures that
we are not reading/writing out of bounds with incorrect input.
* Fix NER check in CoNLL-U converter
Leave ents unset if no NER annotation is found in the MISC column.
* Revert to global rather than per-sentence NER check
* Update spacy/training/converters/conllu_to_docs.py
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
Add whitespace augmenter that inserts a single whitespace token into a
doc containing annotation used in core trained pipelines.
Add a combined augmenter that handles lowercasing, orth variants and
whitespace augmentation.
* Extended list of numbers for ru language
Extended list of numbers with all forms and cases including short forms, slang variants and roman numerals.
* Update lex_attrs.py
* Update 'like_num' function with percentages
Added support for numbers with percentages like 12%, 1.2% and etc. to the 'like_num' function.
* black formatting
Co-authored-by: thomashacker <EdwardSchmuhl@web.de>
* Extend list of abbreviations for ru language
Extended list of abbreviations for ru language those may have influence on tokenization.
* black formatting
Co-authored-by: thomashacker <EdwardSchmuhl@web.de>
* Delay loading of mecab in Korean tokenizer
Delay loading of mecab until the tokenizer is called the first time so
that it's possible to initialize a blank `ko` pipeline without having
mecab installed, e.g. for use with `spacy init vectors`.
* Move mecab import back to __init__
Move mecab import back to __init__ to warn users at the same point as
before for missing python dependencies.
* Add github actions for slow and gpu tests
* change weekly GPU tests to also run slow tests, and change the time
* only run the tests if there were commits in the past day
* remove duplicate line
* add sent start/end token attributes to the docs
* let has_annotation work with IS_SENT_END
* elif instead of if
* add has_annotation test for sent attributes
* fix typo
* remove duplicate is_sent_start entry in docs