spaCy/spacy/tokens
Adriane Boyd f42c9026f5
Update v2.3.x branch (#5636)
* Fix typos and auto-format [ci skip]

* Add pkuseg warnings and auto-format [ci skip]

* Update Binder URL [ci skip]

* Update Binder version [ci skip]

* Update alignment example for new gold.align

* Update POS in tagging example

* Fix numpy.zeros() dtype for Doc.from_array

* Change example title to Dr.

Change example title to Dr. so the current model does exclude the title
in the initial example.

* Fix spacy convert argument

* Warning for sudachipy 0.4.5 (#5611)

* Create myavrum.md (#5612)

* Update lex_attrs.py (#5608)

* Create mahnerak.md (#5615)

* Some changes for Armenian (#5616)

* Fixing numericals

* We need a Armenian question sign to make the sentence a question

* Add Nepali Language  (#5622)

* added support for nepali lang

* added examples and test files

* added spacy contributor agreement

* Japanese model: add user_dict entries and small refactor (#5573)

* user_dict fields: adding inflections, reading_forms, sub_tokens
deleting: unidic_tags
improve code readability around the token alignment procedure

* add test cases, replace fugashi with sudachipy in conftest

* move bunsetu.py to spaCy Universe as a pipeline component BunsetuRecognizer

* tag is space -> both surface and tag are spaces

* consider len(text)==0

* Add warnings example in v2.3 migration guide (#5627)

* contribute (#5632)

* Fix polarity of Token.is_oov and Lexeme.is_oov (#5634)

Fix `Token.is_oov` and `Lexeme.is_oov` so they return `True` when the
lexeme does **not** have a vector.

* Extend what's new in v2.3 with vocab / is_oov (#5635)

* Skip vocab in component config overrides (#5624)

* Fix backslashes in warnings config diff (#5640)

Fix backslashes in warnings config diff in v2.3 migration section.

* Disregard special tag  _SP in check for new tag map (#5641)

* Skip special tag  _SP in check for new tag map

In `Tagger.begin_training()` check for new tags aside from `_SP` in the
new tag map initialized from the provided gold tuples when determining
whether to reinitialize the morphology with the new tag map.

* Simplify _SP check

Co-authored-by: Ines Montani <ines@ines.io>
Co-authored-by: Marat M. Yavrumyan <myavrum@ysu.am>
Co-authored-by: Karen Hambardzumyan <mahnerak@gmail.com>
Co-authored-by: Rameshh <30867740+rameshhpathak@users.noreply.github.com>
Co-authored-by: Hiroshi Matsuda <40782025+hiroshi-matsuda-rit@users.noreply.github.com>
Co-authored-by: Richard Liaw <rliaw@berkeley.edu>
2020-06-29 14:13:12 +02:00
..
__init__.pxd * Break up tokens.pyx into tokens/doc.pyx, tokens/token.pyx, tokens/spans.pyx 2015-07-13 20:20:58 +02:00
__init__.py DocPallet -> DocBin 2019-09-18 15:15:37 +02:00
_retokenize.pyx Disallow merging 0-length spans 2020-05-22 10:14:34 +02:00
_serialize.py Include Doc.cats in serialization of Doc and DocBin (#4774) 2019-12-06 14:07:39 +01:00
doc.pxd Normalize TokenC.sent_start values for Matcher (#5346) 2020-04-29 12:57:30 +02:00
doc.pyx Limiting noun_chunks for specific languages (#5396) 2020-05-14 12:58:06 +02:00
morphanalysis.pxd Add header for morphanalysis 2019-03-07 17:24:57 +01:00
morphanalysis.pyx Remove MorphAnalysis __str__ and __repr__ 2020-05-29 14:33:47 +02:00
span.pxd annotate kb_id through ents in doc 2019-03-22 11:36:44 +01:00
span.pyx Use Token.sent_start for Span.sent (#5439) 2020-05-14 18:22:51 +02:00
token.pxd serialize ENT_ID (#4852) 2020-01-06 14:57:34 +01:00
token.pyx Update v2.3.x branch (#5636) 2020-06-29 14:13:12 +02:00
underscore.py load Underscore state when multiprocessing 2020-02-12 11:50:42 +01:00