spaCy/spacy/lang
Paul O'Leary McCann 0f01f46e02
Update Cython string types (#9143)
* Replace all basestring references with unicode

`basestring` was a compatability type introduced by Cython to make
dealing with utf-8 strings in Python2 easier. In Python3 it is
equivalent to the unicode (or str) type.

I replaced all references to basestring with unicode, since that was
used elsewhere, but we could also just replace them with str, which
shoudl also be equivalent.

All tests pass locally.

* Replace all references to unicode type with str

Since we only support python3 this is simpler.

* Remove all references to unicode type

This removes all references to the unicode type across the codebase and
replaces them with `str`, which makes it more drastic than the prior
commits. In order to make this work importing `unicode_literals` had to
be removed, and one explicit unicode literal also had to be removed (it
is unclear why this is necessary in Cython with language level 3, but
without doing it there were errors about implicit conversion).

When `unicode` is used as a type in comments it was also edited to be
`str`.

Additionally `coding: utf8` headers were removed from a few files.
2021-09-13 17:02:17 +02:00
..
af Simplify language data and revert detailed configs 2020-07-24 14:50:26 +02:00
am Update Tigrinya ትግርኛ language support (#8900) 2021-08-10 13:55:08 +02:00
ar Simplify language data and revert detailed configs 2020-07-24 14:50:26 +02:00
az Fix Azerbaijani init, extend lang init tests (#8656) 2021-07-09 15:36:35 +02:00
bg Improve the stop words and the tokenizer exceptions in Bulgarian language. (#8862) 2021-08-10 13:44:23 +02:00
bn Refactor scoring methods to use registered functions (#8766) 2021-08-10 15:13:39 +02:00
ca Refactor scoring methods to use registered functions (#8766) 2021-08-10 15:13:39 +02:00
cs Tidy up and auto-format 2021-01-05 13:41:53 +11:00
da Merge remote-tracking branch 'upstream/master' into chore/update-develop-from-master-rc3 2021-01-14 11:49:58 +01:00
de Merge branch 'develop' into master-tmp 2020-10-04 14:52:20 +02:00
el Refactor scoring methods to use registered functions (#8766) 2021-08-10 15:13:39 +02:00
en Update Cython string types (#9143) 2021-09-13 17:02:17 +02:00
es Refactor scoring methods to use registered functions (#8766) 2021-08-10 15:13:39 +02:00
et Simplify language data and revert detailed configs 2020-07-24 14:50:26 +02:00
eu Simplify language data and revert detailed configs 2020-07-24 14:50:26 +02:00
fa Refactor scoring methods to use registered functions (#8766) 2021-08-10 15:13:39 +02:00
fi Tidy up code 2021-06-28 12:08:15 +02:00
fr Refactor scoring methods to use registered functions (#8766) 2021-08-10 15:13:39 +02:00
ga Simplify language data and revert detailed configs 2020-07-24 14:50:26 +02:00
grc Remove extraneous grc test file (#8768) 2021-07-20 15:51:15 +02:00
gu Simplify language data and revert detailed configs 2020-07-24 14:50:26 +02:00
he raise NotImplementedError when noun_chunks iterator is not implemented (#6711) 2021-01-17 19:56:05 +08:00
hi Auto-format [ci skip] 2020-10-15 10:08:53 +02:00
hr Remove tag map 2020-12-09 11:13:49 +11:00
hu Fix Hungarian % tokenization (#6013) 2020-09-02 13:06:16 +02:00
hy Simplify language data and revert detailed configs 2020-07-24 14:50:26 +02:00
id Merge branch 'develop' into master-tmp 2020-10-04 14:52:20 +02:00
is Simplify language data and revert detailed configs 2020-07-24 14:50:26 +02:00
it Refactor scoring methods to use registered functions (#8766) 2021-08-10 15:13:39 +02:00
ja Update custom tokenizer APIs and pickling (#8972) 2021-08-19 14:37:47 +02:00
kn Simplify language data and revert detailed configs 2020-07-24 14:50:26 +02:00
ko Update custom tokenizer APIs and pickling (#8972) 2021-08-19 14:37:47 +02:00
ky Tidy up and auto-format 2021-01-30 12:52:33 +11:00
lb Remove default initialize lookups 2020-10-01 21:54:33 +02:00
lij Simplify language data and revert detailed configs 2020-07-24 14:50:26 +02:00
lt Fix escape sequence 2021-01-30 12:39:58 +11:00
lv Simplify language data and revert detailed configs 2020-07-24 14:50:26 +02:00
mk Refactor scoring methods to use registered functions (#8766) 2021-08-10 15:13:39 +02:00
ml Add missing lex_attr_getters (resolves #5806 ) 2020-07-25 12:55:18 +02:00
mr Simplify language data and revert detailed configs 2020-07-24 14:50:26 +02:00
nb Refactor scoring methods to use registered functions (#8766) 2021-08-10 15:13:39 +02:00
ne Remove unicode declarations and update language data 2020-09-04 13:19:16 +02:00
nl Refactor scoring methods to use registered functions (#8766) 2021-08-10 15:13:39 +02:00
pl Refactor scoring methods to use registered functions (#8766) 2021-08-10 15:13:39 +02:00
pt Tidy up and auto-format 2021-01-15 11:57:36 +11:00
ro Merge remote-tracking branch 'upstream/master' into chore/update-develop-from-master-rc3 2021-01-14 11:49:58 +01:00
ru Refactor scoring methods to use registered functions (#8766) 2021-08-10 15:13:39 +02:00
sa Tidy up and auto-format 2020-09-29 21:39:28 +02:00
si Simplify language data and revert detailed configs 2020-07-24 14:50:26 +02:00
sk Simplify language data and revert detailed configs 2020-07-24 14:50:26 +02:00
sl Simplify language data and revert detailed configs 2020-07-24 14:50:26 +02:00
sq Simplify language data and revert detailed configs 2020-07-24 14:50:26 +02:00
sr Remove default initialize lookups 2020-10-01 21:54:33 +02:00
sv Refactor scoring methods to use registered functions (#8766) 2021-08-10 15:13:39 +02:00
ta Merge branch 'develop' into master-tmp 2020-10-15 09:06:03 +02:00
te Simplify language data and revert detailed configs 2020-07-24 14:50:26 +02:00
th Update custom tokenizer APIs and pickling (#8972) 2021-08-19 14:37:47 +02:00
ti Update Tigrinya ትግርኛ language support (#8900) 2021-08-10 13:55:08 +02:00
tl Simplify language data and revert detailed configs 2020-07-24 14:50:26 +02:00
tn Tidy up and auto-format 2021-02-13 12:55:56 +11:00
tr Tidy up and auto-format 2021-01-05 13:41:53 +11:00
tt Simplify language data and revert detailed configs 2020-07-24 14:50:26 +02:00
uk Refactor scoring methods to use registered functions (#8766) 2021-08-10 15:13:39 +02:00
ur Simplify language data and revert detailed configs 2020-07-24 14:50:26 +02:00
vi Update custom tokenizer APIs and pickling (#8972) 2021-08-19 14:37:47 +02:00
xx Simplify language data and revert detailed configs 2020-07-24 14:50:26 +02:00
yo Simplify language data and revert detailed configs 2020-07-24 14:50:26 +02:00
zh Update custom tokenizer APIs and pickling (#8972) 2021-08-19 14:37:47 +02:00
__init__.py Remove imports in /lang/__init__.py 2017-05-08 23:58:07 +02:00
char_classes.py Add all symbols in Unicode Currency Symbols block (#8212) 2021-05-31 18:03:40 +10:00
lex_attrs.py Use tokenizer URL_MATCH pattern in LIKE_URL (#8765) 2021-07-27 12:07:01 +02:00
norm_exceptions.py Tidy up and auto-format 2020-02-18 15:38:18 +01:00
punctuation.py Simplify language data and revert detailed configs 2020-07-24 14:50:26 +02:00
tokenizer_exceptions.py Tidy up with flake8: imports, comparisons, etc. 2021-06-28 12:08:15 +02:00