diff --git a/.github/contributors/mahnerak.md b/.github/contributors/mahnerak.md deleted file mode 100644 index cc7739681..000000000 --- a/.github/contributors/mahnerak.md +++ /dev/null @@ -1,106 +0,0 @@ -# spaCy contributor agreement - -This spaCy Contributor Agreement (**"SCA"**) is based on the -[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf). -The SCA applies to any contribution that you make to any product or project -managed by us (the **"project"**), and sets out the intellectual property rights -you grant to us in the contributed materials. The term **"us"** shall mean -[ExplosionAI GmbH](https://explosion.ai/legal). The term -**"you"** shall mean the person or entity identified below. - -If you agree to be bound by these terms, fill in the information requested -below and include the filled-in version with your first pull request, under the -folder [`.github/contributors/`](/.github/contributors/). The name of the file -should be your GitHub username, with the extension `.md`. For example, the user -example_user would create the file `.github/contributors/example_user.md`. - -Read this agreement carefully before signing. These terms and conditions -constitute a binding legal agreement. - -## Contributor Agreement - -1. The term "contribution" or "contributed materials" means any source code, -object code, patch, tool, sample, graphic, specification, manual, -documentation, or any other material posted or submitted by you to the project. - -2. With respect to any worldwide copyrights, or copyright applications and -registrations, in your contribution: - - * you hereby assign to us joint ownership, and to the extent that such - assignment is or becomes invalid, ineffective or unenforceable, you hereby - grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge, - royalty-free, unrestricted license to exercise all rights under those - copyrights. This includes, at our option, the right to sublicense these same - rights to third parties through multiple levels of sublicensees or other - licensing arrangements; - - * you agree that each of us can do all things in relation to your - contribution as if each of us were the sole owners, and if one of us makes - a derivative work of your contribution, the one who makes the derivative - work (or has it made will be the sole owner of that derivative work; - - * you agree that you will not assert any moral rights in your contribution - against us, our licensees or transferees; - - * you agree that we may register a copyright in your contribution and - exercise all ownership rights associated with it; and - - * you agree that neither of us has any duty to consult with, obtain the - consent of, pay or render an accounting to the other for any use or - distribution of your contribution. - -3. With respect to any patents you own, or that you can license without payment -to any third party, you hereby grant to us a perpetual, irrevocable, -non-exclusive, worldwide, no-charge, royalty-free license to: - - * make, have made, use, sell, offer to sell, import, and otherwise transfer - your contribution in whole or in part, alone or in combination with or - included in any product, work or materials arising out of the project to - which your contribution was submitted, and - - * at our option, to sublicense these same rights to third parties through - multiple levels of sublicensees or other licensing arrangements. - -4. Except as set out above, you keep all right, title, and interest in your -contribution. The rights that you grant to us under these terms are effective -on the date you first submitted a contribution to us, even if your submission -took place before the date you sign these terms. - -5. You covenant, represent, warrant and agree that: - - * Each contribution that you submit is and shall be an original work of - authorship and you can legally grant the rights set out in this SCA; - - * to the best of your knowledge, each contribution will not violate any - third party's copyrights, trademarks, patents, or other intellectual - property rights; and - - * each contribution shall be in compliance with U.S. export control laws and - other applicable export and import laws. You agree to notify us if you - become aware of any circumstance which would make any of the foregoing - representations inaccurate in any respect. We may publicly disclose your - participation in the project, including the fact that you have signed the SCA. - -6. This SCA is governed by the laws of the State of California and applicable -U.S. Federal law. Any choice of law rules will not apply. - -7. Please place an “x” on one of the applicable statement below. Please do NOT -mark both statements: - - * [x] I am signing on behalf of myself as an individual and no other person - or entity, including my employer, has or will have rights with respect to my - contributions. - - * [ ] I am signing on behalf of my employer or a legal entity and I have the - actual authority to contractually bind that entity. - -## Contributor Details - -| Field | Entry | -|------------------------------- | -------------------- | -| Name | Karen Hambardzumyan | -| Company name (if applicable) | YerevaNN | -| Title or role (if applicable) | Researcher | -| Date | 2020-06-19 | -| GitHub username | mahnerak | -| Website (optional) | https://mahnerak.com/| diff --git a/.github/contributors/myavrum.md b/.github/contributors/myavrum.md deleted file mode 100644 index dc8f1bb84..000000000 --- a/.github/contributors/myavrum.md +++ /dev/null @@ -1,106 +0,0 @@ -# spaCy contributor agreement - -This spaCy Contributor Agreement (**"SCA"**) is based on the -[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf). -The SCA applies to any contribution that you make to any product or project -managed by us (the **"project"**), and sets out the intellectual property rights -you grant to us in the contributed materials. The term **"us"** shall mean -[ExplosionAI GmbH](https://explosion.ai/legal). The term -**"you"** shall mean the person or entity identified below. - -If you agree to be bound by these terms, fill in the information requested -below and include the filled-in version with your first pull request, under the -folder [`.github/contributors/`](/.github/contributors/). The name of the file -should be your GitHub username, with the extension `.md`. For example, the user -example_user would create the file `.github/contributors/example_user.md`. - -Read this agreement carefully before signing. These terms and conditions -constitute a binding legal agreement. - -## Contributor Agreement - -1. The term "contribution" or "contributed materials" means any source code, -object code, patch, tool, sample, graphic, specification, manual, -documentation, or any other material posted or submitted by you to the project. - -2. With respect to any worldwide copyrights, or copyright applications and -registrations, in your contribution: - - * you hereby assign to us joint ownership, and to the extent that such - assignment is or becomes invalid, ineffective or unenforceable, you hereby - grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge, - royalty-free, unrestricted license to exercise all rights under those - copyrights. This includes, at our option, the right to sublicense these same - rights to third parties through multiple levels of sublicensees or other - licensing arrangements; - - * you agree that each of us can do all things in relation to your - contribution as if each of us were the sole owners, and if one of us makes - a derivative work of your contribution, the one who makes the derivative - work (or has it made will be the sole owner of that derivative work; - - * you agree that you will not assert any moral rights in your contribution - against us, our licensees or transferees; - - * you agree that we may register a copyright in your contribution and - exercise all ownership rights associated with it; and - - * you agree that neither of us has any duty to consult with, obtain the - consent of, pay or render an accounting to the other for any use or - distribution of your contribution. - -3. With respect to any patents you own, or that you can license without payment -to any third party, you hereby grant to us a perpetual, irrevocable, -non-exclusive, worldwide, no-charge, royalty-free license to: - - * make, have made, use, sell, offer to sell, import, and otherwise transfer - your contribution in whole or in part, alone or in combination with or - included in any product, work or materials arising out of the project to - which your contribution was submitted, and - - * at our option, to sublicense these same rights to third parties through - multiple levels of sublicensees or other licensing arrangements. - -4. Except as set out above, you keep all right, title, and interest in your -contribution. The rights that you grant to us under these terms are effective -on the date you first submitted a contribution to us, even if your submission -took place before the date you sign these terms. - -5. You covenant, represent, warrant and agree that: - - * Each contribution that you submit is and shall be an original work of - authorship and you can legally grant the rights set out in this SCA; - - * to the best of your knowledge, each contribution will not violate any - third party's copyrights, trademarks, patents, or other intellectual - property rights; and - - * each contribution shall be in compliance with U.S. export control laws and - other applicable export and import laws. You agree to notify us if you - become aware of any circumstance which would make any of the foregoing - representations inaccurate in any respect. We may publicly disclose your - participation in the project, including the fact that you have signed the SCA. - -6. This SCA is governed by the laws of the State of California and applicable -U.S. Federal law. Any choice of law rules will not apply. - -7. Please place an “x” on one of the applicable statement below. Please do NOT -mark both statements: - - * [x] I am signing on behalf of myself as an individual and no other person - or entity, including my employer, has or will have rights with respect to my - contributions. - - * [ ] I am signing on behalf of my employer or a legal entity and I have the - actual authority to contractually bind that entity. - -## Contributor Details - -| Field | Entry | -|------------------------------- | -------------------- | -| Name | Marat M. Yavrumyan | -| Company name (if applicable) | YSU, UD_Armenian Project | -| Title or role (if applicable) | Dr., Principal Investigator | -| Date | 2020-06-19 | -| GitHub username | myavrum | -| Website (optional) | http://armtreebank.yerevann.com/ | diff --git a/.github/contributors/rameshhpathak.md b/.github/contributors/rameshhpathak.md deleted file mode 100644 index 30a543307..000000000 --- a/.github/contributors/rameshhpathak.md +++ /dev/null @@ -1,106 +0,0 @@ -# spaCy contributor agreement - -This spaCy Contributor Agreement (**"SCA"**) is based on the -[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf). -The SCA applies to any contribution that you make to any product or project -managed by us (the **"project"**), and sets out the intellectual property rights -you grant to us in the contributed materials. The term **"us"** shall mean -[ExplosionAI GmbH](https://explosion.ai/legal). The term -**"you"** shall mean the person or entity identified below. - -If you agree to be bound by these terms, fill in the information requested -below and include the filled-in version with your first pull request, under the -folder [`.github/contributors/`](/.github/contributors/). The name of the file -should be your GitHub username, with the extension `.md`. For example, the user -example_user would create the file `.github/contributors/example_user.md`. - -Read this agreement carefully before signing. These terms and conditions -constitute a binding legal agreement. - -## Contributor Agreement - -1. The term "contribution" or "contributed materials" means any source code, -object code, patch, tool, sample, graphic, specification, manual, -documentation, or any other material posted or submitted by you to the project. - -2. With respect to any worldwide copyrights, or copyright applications and -registrations, in your contribution: - - * you hereby assign to us joint ownership, and to the extent that such - assignment is or becomes invalid, ineffective or unenforceable, you hereby - grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge, - royalty-free, unrestricted license to exercise all rights under those - copyrights. This includes, at our option, the right to sublicense these same - rights to third parties through multiple levels of sublicensees or other - licensing arrangements; - - * you agree that each of us can do all things in relation to your - contribution as if each of us were the sole owners, and if one of us makes - a derivative work of your contribution, the one who makes the derivative - work (or has it made will be the sole owner of that derivative work; - - * you agree that you will not assert any moral rights in your contribution - against us, our licensees or transferees; - - * you agree that we may register a copyright in your contribution and - exercise all ownership rights associated with it; and - - * you agree that neither of us has any duty to consult with, obtain the - consent of, pay or render an accounting to the other for any use or - distribution of your contribution. - -3. With respect to any patents you own, or that you can license without payment -to any third party, you hereby grant to us a perpetual, irrevocable, -non-exclusive, worldwide, no-charge, royalty-free license to: - - * make, have made, use, sell, offer to sell, import, and otherwise transfer - your contribution in whole or in part, alone or in combination with or - included in any product, work or materials arising out of the project to - which your contribution was submitted, and - - * at our option, to sublicense these same rights to third parties through - multiple levels of sublicensees or other licensing arrangements. - -4. Except as set out above, you keep all right, title, and interest in your -contribution. The rights that you grant to us under these terms are effective -on the date you first submitted a contribution to us, even if your submission -took place before the date you sign these terms. - -5. You covenant, represent, warrant and agree that: - - * Each contribution that you submit is and shall be an original work of - authorship and you can legally grant the rights set out in this SCA; - - * to the best of your knowledge, each contribution will not violate any - third party's copyrights, trademarks, patents, or other intellectual - property rights; and - - * each contribution shall be in compliance with U.S. export control laws and - other applicable export and import laws. You agree to notify us if you - become aware of any circumstance which would make any of the foregoing - representations inaccurate in any respect. We may publicly disclose your - participation in the project, including the fact that you have signed the SCA. - -6. This SCA is governed by the laws of the State of California and applicable -U.S. Federal law. Any choice of law rules will not apply. - -7. Please place an “x” on one of the applicable statement below. Please do NOT -mark both statements: - - * [x] I am signing on behalf of myself as an individual and no other person - or entity, including my employer, has or will have rights with respect to my - contributions. - - * [ ] I am signing on behalf of my employer or a legal entity and I have the - actual authority to contractually bind that entity. - -## Contributor Details - -| Field | Entry | -|------------------------------- | -------------------- | -| Name | Ramesh Pathak | -| Company name (if applicable) | Diyo AI | -| Title or role (if applicable) | AI Engineer | -| Date | June 21, 2020 | -| GitHub username | rameshhpathak | -| Website (optional) |rameshhpathak.github.io| | diff --git a/.github/contributors/richardliaw.md b/.github/contributors/richardliaw.md deleted file mode 100644 index 2af4ce840..000000000 --- a/.github/contributors/richardliaw.md +++ /dev/null @@ -1,106 +0,0 @@ -# spaCy contributor agreement - -This spaCy Contributor Agreement (**"SCA"**) is based on the -[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf). -The SCA applies to any contribution that you make to any product or project -managed by us (the **"project"**), and sets out the intellectual property rights -you grant to us in the contributed materials. The term **"us"** shall mean -[ExplosionAI GmbH](https://explosion.ai/legal). The term -**"you"** shall mean the person or entity identified below. - -If you agree to be bound by these terms, fill in the information requested -below and include the filled-in version with your first pull request, under the -folder [`.github/contributors/`](/.github/contributors/). The name of the file -should be your GitHub username, with the extension `.md`. For example, the user -example_user would create the file `.github/contributors/example_user.md`. - -Read this agreement carefully before signing. These terms and conditions -constitute a binding legal agreement. - -## Contributor Agreement - -1. The term "contribution" or "contributed materials" means any source code, -object code, patch, tool, sample, graphic, specification, manual, -documentation, or any other material posted or submitted by you to the project. - -2. With respect to any worldwide copyrights, or copyright applications and -registrations, in your contribution: - - * you hereby assign to us joint ownership, and to the extent that such - assignment is or becomes invalid, ineffective or unenforceable, you hereby - grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge, - royalty-free, unrestricted license to exercise all rights under those - copyrights. This includes, at our option, the right to sublicense these same - rights to third parties through multiple levels of sublicensees or other - licensing arrangements; - - * you agree that each of us can do all things in relation to your - contribution as if each of us were the sole owners, and if one of us makes - a derivative work of your contribution, the one who makes the derivative - work (or has it made will be the sole owner of that derivative work; - - * you agree that you will not assert any moral rights in your contribution - against us, our licensees or transferees; - - * you agree that we may register a copyright in your contribution and - exercise all ownership rights associated with it; and - - * you agree that neither of us has any duty to consult with, obtain the - consent of, pay or render an accounting to the other for any use or - distribution of your contribution. - -3. With respect to any patents you own, or that you can license without payment -to any third party, you hereby grant to us a perpetual, irrevocable, -non-exclusive, worldwide, no-charge, royalty-free license to: - - * make, have made, use, sell, offer to sell, import, and otherwise transfer - your contribution in whole or in part, alone or in combination with or - included in any product, work or materials arising out of the project to - which your contribution was submitted, and - - * at our option, to sublicense these same rights to third parties through - multiple levels of sublicensees or other licensing arrangements. - -4. Except as set out above, you keep all right, title, and interest in your -contribution. The rights that you grant to us under these terms are effective -on the date you first submitted a contribution to us, even if your submission -took place before the date you sign these terms. - -5. You covenant, represent, warrant and agree that: - - * Each contribution that you submit is and shall be an original work of - authorship and you can legally grant the rights set out in this SCA; - - * to the best of your knowledge, each contribution will not violate any - third party's copyrights, trademarks, patents, or other intellectual - property rights; and - - * each contribution shall be in compliance with U.S. export control laws and - other applicable export and import laws. You agree to notify us if you - become aware of any circumstance which would make any of the foregoing - representations inaccurate in any respect. We may publicly disclose your - participation in the project, including the fact that you have signed the SCA. - -6. This SCA is governed by the laws of the State of California and applicable -U.S. Federal law. Any choice of law rules will not apply. - -7. Please place an “x” on one of the applicable statement below. Please do NOT -mark both statements: - - * [x] I am signing on behalf of myself as an individual and no other person - or entity, including my employer, has or will have rights with respect to my - contributions. - - * [ ] I am signing on behalf of my employer or a legal entity and I have the - actual authority to contractually bind that entity. - -## Contributor Details - -| Field | Entry | -|------------------------------- | -------------------- | -| Name | Richard Liaw | -| Company name (if applicable) | | -| Title or role (if applicable) | | -| Date | 06/22/2020 | -| GitHub username | richardliaw | -| Website (optional) | | \ No newline at end of file diff --git a/spacy/lang/hy/examples.py b/spacy/lang/hy/examples.py index 8a00fd243..323f77b1c 100644 --- a/spacy/lang/hy/examples.py +++ b/spacy/lang/hy/examples.py @@ -11,6 +11,6 @@ Example sentences to test spaCy and its language models. sentences = [ "Լոնդոնը Միացյալ Թագավորության մեծ քաղաք է։", "Ո՞վ է Ֆրանսիայի նախագահը։", - "Ո՞րն է Միացյալ Նահանգների մայրաքաղաքը։", + "Որն է Միացյալ Նահանգների մայրաքաղաքը։", "Ե՞րբ է ծնվել Բարաք Օբաման։", ] diff --git a/spacy/lang/hy/lex_attrs.py b/spacy/lang/hy/lex_attrs.py index dea3c0e97..910625fb8 100644 --- a/spacy/lang/hy/lex_attrs.py +++ b/spacy/lang/hy/lex_attrs.py @@ -5,8 +5,8 @@ from ...attrs import LIKE_NUM _num_words = [ - "զրո", - "մեկ", + "զրօ", + "մէկ", "երկու", "երեք", "չորս", @@ -18,21 +18,20 @@ _num_words = [ "տասը", "տասնմեկ", "տասներկու", - "տասներեք", - "տասնչորս", - "տասնհինգ", - "տասնվեց", - "տասնյոթ", - "տասնութ", - "տասնինը", - "քսան", - "երեսուն", + "տասն­երեք", + "տասն­չորս", + "տասն­հինգ", + "տասն­վեց", + "տասն­յոթ", + "տասն­ութ", + "տասն­ինը", + "քսան" "երեսուն", "քառասուն", "հիսուն", - "վաթսուն", + "վաթցսուն", "յոթանասուն", "ութսուն", - "իննսուն", + "ինիսուն", "հարյուր", "հազար", "միլիոն", diff --git a/spacy/lang/ja/__init__.py b/spacy/lang/ja/__init__.py index fb8b9d7fe..a7ad0846e 100644 --- a/spacy/lang/ja/__init__.py +++ b/spacy/lang/ja/__init__.py @@ -20,7 +20,12 @@ from ... import util # Hold the attributes we need with convenient names -DetailedToken = namedtuple("DetailedToken", ["surface", "tag", "inf", "lemma", "reading", "sub_tokens"]) +DetailedToken = namedtuple("DetailedToken", ["surface", "pos", "lemma"]) + +# Handling for multiple spaces in a row is somewhat awkward, this simplifies +# the flow by creating a dummy with the same interface. +DummyNode = namedtuple("DummyNode", ["surface", "pos", "lemma"]) +DummySpace = DummyNode(" ", " ", " ") def try_sudachi_import(split_mode="A"): @@ -48,7 +53,7 @@ def try_sudachi_import(split_mode="A"): ) -def resolve_pos(orth, tag, next_tag): +def resolve_pos(orth, pos, next_pos): """If necessary, add a field to the POS tag for UD mapping. Under Universal Dependencies, sometimes the same Unidic POS tag can be mapped differently depending on the literal token or its context @@ -59,77 +64,124 @@ def resolve_pos(orth, tag, next_tag): # Some tokens have their UD tag decided based on the POS of the following # token. - # apply orth based mapping - if tag in TAG_ORTH_MAP: - orth_map = TAG_ORTH_MAP[tag] + # orth based rules + if pos[0] in TAG_ORTH_MAP: + orth_map = TAG_ORTH_MAP[pos[0]] if orth in orth_map: - return orth_map[orth], None # current_pos, next_pos + return orth_map[orth], None - # apply tag bi-gram mapping - if next_tag: - tag_bigram = tag, next_tag + # tag bi-gram mapping + if next_pos: + tag_bigram = pos[0], next_pos[0] if tag_bigram in TAG_BIGRAM_MAP: - current_pos, next_pos = TAG_BIGRAM_MAP[tag_bigram] - if current_pos is None: # apply tag uni-gram mapping for current_pos - return TAG_MAP[tag][POS], next_pos # only next_pos is identified by tag bi-gram mapping + bipos = TAG_BIGRAM_MAP[tag_bigram] + if bipos[0] is None: + return TAG_MAP[pos[0]][POS], bipos[1] else: - return current_pos, next_pos + return bipos - # apply tag uni-gram mapping - return TAG_MAP[tag][POS], None + return TAG_MAP[pos[0]][POS], None -def get_dtokens_and_spaces(dtokens, text, gap_tag="空白"): - # Compare the content of tokens and text, first +# Use a mapping of paired punctuation to avoid splitting quoted sentences. +pairpunct = {'「':'」', '『': '』', '【': '】'} + + +def separate_sentences(doc): + """Given a doc, mark tokens that start sentences based on Unidic tags. + """ + + stack = [] # save paired punctuation + + for i, token in enumerate(doc[:-2]): + # Set all tokens after the first to false by default. This is necessary + # for the doc code to be aware we've done sentencization, see + # `is_sentenced`. + token.sent_start = (i == 0) + if token.tag_: + if token.tag_ == "補助記号-括弧開": + ts = str(token) + if ts in pairpunct: + stack.append(pairpunct[ts]) + elif stack and ts == stack[-1]: + stack.pop() + + if token.tag_ == "補助記号-句点": + next_token = doc[i+1] + if next_token.tag_ != token.tag_ and not stack: + next_token.sent_start = True + + +def get_dtokens(tokenizer, text): + tokens = tokenizer.tokenize(text) + words = [] + for ti, token in enumerate(tokens): + tag = '-'.join([xx for xx in token.part_of_speech()[:4] if xx != '*']) + inf = '-'.join([xx for xx in token.part_of_speech()[4:] if xx != '*']) + dtoken = DetailedToken( + token.surface(), + (tag, inf), + token.dictionary_form()) + if ti > 0 and words[-1].pos[0] == '空白' and tag == '空白': + # don't add multiple space tokens in a row + continue + words.append(dtoken) + + # remove empty tokens. These can be produced with characters like … that + # Sudachi normalizes internally. + words = [ww for ww in words if len(ww.surface) > 0] + return words + + +def get_words_lemmas_tags_spaces(dtokens, text, gap_tag=("空白", "")): words = [x.surface for x in dtokens] if "".join("".join(words).split()) != "".join(text.split()): raise ValueError(Errors.E194.format(text=text, words=words)) - - text_dtokens = [] + text_words = [] + text_lemmas = [] + text_tags = [] text_spaces = [] text_pos = 0 # handle empty and whitespace-only texts if len(words) == 0: - return text_dtokens, text_spaces + return text_words, text_lemmas, text_tags, text_spaces elif len([word for word in words if not word.isspace()]) == 0: assert text.isspace() - text_dtokens = [DetailedToken(text, gap_tag, '', text, None, None)] + text_words = [text] + text_lemmas = [text] + text_tags = [gap_tag] text_spaces = [False] - return text_dtokens, text_spaces - - # align words and dtokens by referring text, and insert gap tokens for the space char spans - for word, dtoken in zip(words, dtokens): - # skip all space tokens - if word.isspace(): - continue + return text_words, text_lemmas, text_tags, text_spaces + # normalize words to remove all whitespace tokens + norm_words, norm_dtokens = zip(*[(word, dtokens) for word, dtokens in zip(words, dtokens) if not word.isspace()]) + # align words with text + for word, dtoken in zip(norm_words, norm_dtokens): try: word_start = text[text_pos:].index(word) except ValueError: raise ValueError(Errors.E194.format(text=text, words=words)) - - # space token if word_start > 0: w = text[text_pos:text_pos + word_start] - text_dtokens.append(DetailedToken(w, gap_tag, '', w, None, None)) + text_words.append(w) + text_lemmas.append(w) + text_tags.append(gap_tag) text_spaces.append(False) text_pos += word_start - - # content word - text_dtokens.append(dtoken) + text_words.append(word) + text_lemmas.append(dtoken.lemma) + text_tags.append(dtoken.pos) text_spaces.append(False) text_pos += len(word) - # poll a space char after the word if text_pos < len(text) and text[text_pos] == " ": text_spaces[-1] = True text_pos += 1 - - # trailing space token if text_pos < len(text): w = text[text_pos:] - text_dtokens.append(DetailedToken(w, gap_tag, '', w, None, None)) + text_words.append(w) + text_lemmas.append(w) + text_tags.append(gap_tag) text_spaces.append(False) - - return text_dtokens, text_spaces + return text_words, text_lemmas, text_tags, text_spaces class JapaneseTokenizer(DummyTokenizer): @@ -139,78 +191,29 @@ class JapaneseTokenizer(DummyTokenizer): self.tokenizer = try_sudachi_import(self.split_mode) def __call__(self, text): - # convert sudachipy.morpheme.Morpheme to DetailedToken and merge continuous spaces - sudachipy_tokens = self.tokenizer.tokenize(text) - dtokens = self._get_dtokens(sudachipy_tokens) - dtokens, spaces = get_dtokens_and_spaces(dtokens, text) + dtokens = get_dtokens(self.tokenizer, text) - # create Doc with tag bi-gram based part-of-speech identification rules - words, tags, inflections, lemmas, readings, sub_tokens_list = zip(*dtokens) if dtokens else [[]] * 6 - sub_tokens_list = list(sub_tokens_list) + words, lemmas, unidic_tags, spaces = get_words_lemmas_tags_spaces(dtokens, text) doc = Doc(self.vocab, words=words, spaces=spaces) - next_pos = None # for bi-gram rules - for idx, (token, dtoken) in enumerate(zip(doc, dtokens)): - token.tag_ = dtoken.tag - if next_pos: # already identified in previous iteration + next_pos = None + for idx, (token, lemma, unidic_tag) in enumerate(zip(doc, lemmas, unidic_tags)): + token.tag_ = unidic_tag[0] + if next_pos: token.pos = next_pos next_pos = None else: token.pos, next_pos = resolve_pos( token.orth_, - dtoken.tag, - tags[idx + 1] if idx + 1 < len(tags) else None + unidic_tag, + unidic_tags[idx + 1] if idx + 1 < len(unidic_tags) else None ) - # if there's no lemma info (it's an unk) just use the surface - token.lemma_ = dtoken.lemma if dtoken.lemma else dtoken.surface - doc.user_data["inflections"] = inflections - doc.user_data["reading_forms"] = readings - doc.user_data["sub_tokens"] = sub_tokens_list + # if there's no lemma info (it's an unk) just use the surface + token.lemma_ = lemma + doc.user_data["unidic_tags"] = unidic_tags return doc - def _get_dtokens(self, sudachipy_tokens, need_sub_tokens=True): - sub_tokens_list = self._get_sub_tokens(sudachipy_tokens) if need_sub_tokens else None - dtokens = [ - DetailedToken( - token.surface(), # orth - '-'.join([xx for xx in token.part_of_speech()[:4] if xx != '*']), # tag - ','.join([xx for xx in token.part_of_speech()[4:] if xx != '*']), # inf - token.dictionary_form(), # lemma - token.reading_form(), # user_data['reading_forms'] - sub_tokens_list[idx] if sub_tokens_list else None, # user_data['sub_tokens'] - ) for idx, token in enumerate(sudachipy_tokens) if len(token.surface()) > 0 - # remove empty tokens which can be produced with characters like … that - ] - # Sudachi normalizes internally and outputs each space char as a token. - # This is the preparation for get_dtokens_and_spaces() to merge the continuous space tokens - return [ - t for idx, t in enumerate(dtokens) if - idx == 0 or - not t.surface.isspace() or t.tag != '空白' or - not dtokens[idx - 1].surface.isspace() or dtokens[idx - 1].tag != '空白' - ] - - def _get_sub_tokens(self, sudachipy_tokens): - if self.split_mode is None or self.split_mode == "A": # do nothing for default split mode - return None - - sub_tokens_list = [] # list of (list of list of DetailedToken | None) - for token in sudachipy_tokens: - sub_a = token.split(self.tokenizer.SplitMode.A) - if len(sub_a) == 1: # no sub tokens - sub_tokens_list.append(None) - elif self.split_mode == "B": - sub_tokens_list.append([self._get_dtokens(sub_a, False)]) - else: # "C" - sub_b = token.split(self.tokenizer.SplitMode.B) - if len(sub_a) == len(sub_b): - dtokens = self._get_dtokens(sub_a, False) - sub_tokens_list.append([dtokens, dtokens]) - else: - sub_tokens_list.append([self._get_dtokens(sub_a, False), self._get_dtokens(sub_b, False)]) - return sub_tokens_list - def _get_config(self): config = OrderedDict( ( diff --git a/spacy/lang/ja/bunsetu.py b/spacy/lang/ja/bunsetu.py new file mode 100644 index 000000000..7c3eee336 --- /dev/null +++ b/spacy/lang/ja/bunsetu.py @@ -0,0 +1,144 @@ +# coding: utf8 +from __future__ import unicode_literals + +from .stop_words import STOP_WORDS + + +POS_PHRASE_MAP = { + "NOUN": "NP", + "NUM": "NP", + "PRON": "NP", + "PROPN": "NP", + + "VERB": "VP", + + "ADJ": "ADJP", + + "ADV": "ADVP", + + "CCONJ": "CCONJP", +} + + +# return value: [(bunsetu_tokens, phrase_type={'NP', 'VP', 'ADJP', 'ADVP'}, phrase_tokens)] +def yield_bunsetu(doc, debug=False): + bunsetu = [] + bunsetu_may_end = False + phrase_type = None + phrase = None + prev = None + prev_tag = None + prev_dep = None + prev_head = None + for t in doc: + pos = t.pos_ + pos_type = POS_PHRASE_MAP.get(pos, None) + tag = t.tag_ + dep = t.dep_ + head = t.head.i + if debug: + print(t.i, t.orth_, pos, pos_type, dep, head, bunsetu_may_end, phrase_type, phrase, bunsetu) + + # DET is always an individual bunsetu + if pos == "DET": + if bunsetu: + yield bunsetu, phrase_type, phrase + yield [t], None, None + bunsetu = [] + bunsetu_may_end = False + phrase_type = None + phrase = None + + # PRON or Open PUNCT always splits bunsetu + elif tag == "補助記号-括弧開": + if bunsetu: + yield bunsetu, phrase_type, phrase + bunsetu = [t] + bunsetu_may_end = True + phrase_type = None + phrase = None + + # bunsetu head not appeared + elif phrase_type is None: + if bunsetu and prev_tag == "補助記号-読点": + yield bunsetu, phrase_type, phrase + bunsetu = [] + bunsetu_may_end = False + phrase_type = None + phrase = None + bunsetu.append(t) + if pos_type: # begin phrase + phrase = [t] + phrase_type = pos_type + if pos_type in {"ADVP", "CCONJP"}: + bunsetu_may_end = True + + # entering new bunsetu + elif pos_type and ( + pos_type != phrase_type or # different phrase type arises + bunsetu_may_end # same phrase type but bunsetu already ended + ): + # exceptional case: NOUN to VERB + if phrase_type == "NP" and pos_type == "VP" and prev_dep == 'compound' and prev_head == t.i: + bunsetu.append(t) + phrase_type = "VP" + phrase.append(t) + # exceptional case: VERB to NOUN + elif phrase_type == "VP" and pos_type == "NP" and ( + prev_dep == 'compound' and prev_head == t.i or + dep == 'compound' and prev == head or + prev_dep == 'nmod' and prev_head == t.i + ): + bunsetu.append(t) + phrase_type = "NP" + phrase.append(t) + else: + yield bunsetu, phrase_type, phrase + bunsetu = [t] + bunsetu_may_end = False + phrase_type = pos_type + phrase = [t] + + # NOUN bunsetu + elif phrase_type == "NP": + bunsetu.append(t) + if not bunsetu_may_end and (( + (pos_type == "NP" or pos == "SYM") and (prev_head == t.i or prev_head == head) and prev_dep in {'compound', 'nummod'} + ) or ( + pos == "PART" and (prev == head or prev_head == head) and dep == 'mark' + )): + phrase.append(t) + else: + bunsetu_may_end = True + + # VERB bunsetu + elif phrase_type == "VP": + bunsetu.append(t) + if not bunsetu_may_end and pos == "VERB" and prev_head == t.i and prev_dep == 'compound': + phrase.append(t) + else: + bunsetu_may_end = True + + # ADJ bunsetu + elif phrase_type == "ADJP" and tag != '連体詞': + bunsetu.append(t) + if not bunsetu_may_end and (( + pos == "NOUN" and (prev_head == t.i or prev_head == head) and prev_dep in {'amod', 'compound'} + ) or ( + pos == "PART" and (prev == head or prev_head == head) and dep == 'mark' + )): + phrase.append(t) + else: + bunsetu_may_end = True + + # other bunsetu + else: + bunsetu.append(t) + + prev = t.i + prev_tag = t.tag_ + prev_dep = t.dep_ + prev_head = head + + if bunsetu: + yield bunsetu, phrase_type, phrase diff --git a/spacy/lang/ne/__init__.py b/spacy/lang/ne/__init__.py deleted file mode 100644 index 21556277d..000000000 --- a/spacy/lang/ne/__init__.py +++ /dev/null @@ -1,23 +0,0 @@ -# coding: utf8 -from __future__ import unicode_literals - -from .stop_words import STOP_WORDS -from .lex_attrs import LEX_ATTRS - -from ...language import Language -from ...attrs import LANG - - -class NepaliDefaults(Language.Defaults): - lex_attr_getters = dict(Language.Defaults.lex_attr_getters) - lex_attr_getters.update(LEX_ATTRS) - lex_attr_getters[LANG] = lambda text: "ne" # Nepali language ISO code - stop_words = STOP_WORDS - - -class Nepali(Language): - lang = "ne" - Defaults = NepaliDefaults - - -__all__ = ["Nepali"] diff --git a/spacy/lang/ne/examples.py b/spacy/lang/ne/examples.py deleted file mode 100644 index b3c4f9e73..000000000 --- a/spacy/lang/ne/examples.py +++ /dev/null @@ -1,22 +0,0 @@ -# coding: utf8 -from __future__ import unicode_literals - - -""" -Example sentences to test spaCy and its language models. - ->>> from spacy.lang.ne.examples import sentences ->>> docs = nlp.pipe(sentences) -""" - - -sentences = [ - "एप्पलले अमेरिकी स्टार्टअप १ अर्ब डलरमा किन्ने सोच्दै छ", - "स्वायत्त कारहरूले बीमा दायित्व निर्माताहरु तिर बदल्छन्", - "स्यान फ्रांसिस्कोले फुटपाथ वितरण रोबोटहरु प्रतिबंध गर्ने विचार गर्दै छ", - "लन्डन यूनाइटेड किंगडमको एक ठूलो शहर हो।", - "तिमी कहाँ छौ?", - "फ्रान्स को राष्ट्रपति को हो?", - "संयुक्त राज्यको राजधानी के हो?", - "बराक ओबामा कहिले कहिले जन्मेका हुन्?", -] diff --git a/spacy/lang/ne/lex_attrs.py b/spacy/lang/ne/lex_attrs.py deleted file mode 100644 index 652307577..000000000 --- a/spacy/lang/ne/lex_attrs.py +++ /dev/null @@ -1,98 +0,0 @@ -# coding: utf8 -from __future__ import unicode_literals - -from ..norm_exceptions import BASE_NORMS -from ...attrs import NORM, LIKE_NUM - - -# fmt: off -_stem_suffixes = [ - ["ा", "ि", "ी", "ु", "ू", "ृ", "े", "ै", "ो", "ौ"], - ["ँ", "ं", "्", "ः"], - ["लाई", "ले", "बाट", "को", "मा", "हरू"], - ["हरूलाई", "हरूले", "हरूबाट", "हरूको", "हरूमा"], - ["इलो", "िलो", "नु", "ाउनु", "ई", "इन", "इन्", "इनन्"], - ["एँ", "इँन्", "इस्", "इनस्", "यो", "एन", "यौं", "एनौं", "ए", "एनन्"], - ["छु", "छौँ", "छस्", "छौ", "छ", "छन्", "छेस्", "छे", "छ्यौ", "छिन्", "हुन्छ"], - ["दै", "दिन", "दिँन", "दैनस्", "दैन", "दैनौँ", "दैनौं", "दैनन्"], - ["हुन्न", "न्न", "न्न्स्", "न्नौं", "न्नौ", "न्न्न्", "िई"], - ["अ", "ओ", "ऊ", "अरी", "साथ", "वित्तिकै", "पूर्वक"], - ["याइ", "ाइ", "बार", "वार", "चाँहि"], - ["ने", "ेको", "ेकी", "ेका", "ेर", "दै", "तै", "िकन", "उ", "न", "नन्"] -] -# fmt: on - -# reference 1: https://en.wikipedia.org/wiki/Numbers_in_Nepali_language -# reference 2: https://www.imnepal.com/nepali-numbers/ -_num_words = [ - "शुन्य", - "एक", - "दुई", - "तीन", - "चार", - "पाँच", - "छ", - "सात", - "आठ", - "नौ", - "दश", - "एघार", - "बाह्र", - "तेह्र", - "चौध", - "पन्ध्र", - "सोह्र", - "सोह्र", - "सत्र", - "अठार", - "उन्नाइस", - "बीस", - "तीस", - "चालीस", - "पचास", - "साठी", - "सत्तरी", - "असी", - "नब्बे", - "सय", - "हजार", - "लाख", - "करोड", - "अर्ब", - "खर्ब", -] - - -def norm(string): - # normalise base exceptions, e.g. punctuation or currency symbols - if string in BASE_NORMS: - return BASE_NORMS[string] - # set stem word as norm, if available, adapted from: - # https://github.com/explosion/spaCy/blob/master/spacy/lang/hi/lex_attrs.py - # https://www.researchgate.net/publication/237261579_Structure_of_Nepali_Grammar - for suffix_group in reversed(_stem_suffixes): - length = len(suffix_group[0]) - if len(string) <= length: - break - for suffix in suffix_group: - if string.endswith(suffix): - return string[:-length] - return string - - -def like_num(text): - if text.startswith(("+", "-", "±", "~")): - text = text[1:] - text = text.replace(", ", "").replace(".", "") - if text.isdigit(): - return True - if text.count("/") == 1: - num, denom = text.split("/") - if num.isdigit() and denom.isdigit(): - return True - if text.lower() in _num_words: - return True - return False - - -LEX_ATTRS = {NORM: norm, LIKE_NUM: like_num} diff --git a/spacy/lang/ne/stop_words.py b/spacy/lang/ne/stop_words.py deleted file mode 100644 index f008697d0..000000000 --- a/spacy/lang/ne/stop_words.py +++ /dev/null @@ -1,498 +0,0 @@ -# coding: utf8 -from __future__ import unicode_literals - - -# Source: https://github.com/sanjaalcorps/NepaliStopWords/blob/master/NepaliStopWords.txt - -STOP_WORDS = set( - """ -अक्सर -अगाडि -अगाडी -अघि -अझै -अठार -अथवा -अनि -अनुसार -अन्तर्गत -अन्य -अन्यत्र -अन्यथा -अब -अरु -अरुलाई -अरू -अर्को -अर्थात -अर्थात् -अलग -अलि -अवस्था -अहिले -आए -आएका -आएको -आज -आजको -आठ -आत्म -आदि -आदिलाई -आफनो -आफू -आफूलाई -आफै -आफैँ -आफ्नै -आफ्नो -आयो -उ -उक्त -उदाहरण -उनको -उनलाई -उनले -उनि -उनी -उनीहरुको -उन्नाइस -उप -उसको -उसलाई -उसले -उहालाई -ऊ -एउटा -एउटै -एक -एकदम -एघार -ओठ -औ -औं -कता -कति -कतै -कम -कमसेकम -कसरि -कसरी -कसै -कसैको -कसैलाई -कसैले -कसैसँग -कस्तो -कहाँबाट -कहिलेकाहीं -का -काम -कारण -कि -किन -किनभने -कुन -कुनै -कुन्नी -कुरा -कृपया -के -केहि -केही -को -कोहि -कोहिपनि -कोही -कोहीपनि -क्रमशः -गए -गएको -गएर -गयौ -गरि -गरी -गरे -गरेका -गरेको -गरेर -गरौं -गर्छ -गर्छन् -गर्छु -गर्दा -गर्दै -गर्न -गर्नु -गर्नुपर्छ -गर्ने -गैर -घर -चार -चाले -चाहनुहुन्छ -चाहन्छु -चाहिं -चाहिए -चाहिंले -चाहीं -चाहेको -चाहेर -चोटी -चौथो -चौध -छ -छन -छन् -छु -छू -छैन -छैनन् -छौ -छौं -जता -जताततै -जना -जनाको -जनालाई -जनाले -जब -जबकि -जबकी -जसको -जसबाट -जसमा -जसरी -जसलाई -जसले -जस्ता -जस्तै -जस्तो -जस्तोसुकै -जहाँ -जान -जाने -जाहिर -जुन -जुनै -जे -जो -जोपनि -जोपनी -झैं -ठाउँमा -ठीक -ठूलो -त -तता -तत्काल -तथा -तथापि -तथापी -तदनुसार -तपाइ -तपाई -तपाईको -तब -तर -तर्फ -तल -तसरी -तापनि -तापनी -तिन -तिनि -तिनिहरुलाई -तिनी -तिनीहरु -तिनीहरुको -तिनीहरू -तिनीहरूको -तिनै -तिमी -तिर -तिरको -ती -तीन -तुरन्त -तुरुन्त -तुरुन्तै -तेश्रो -तेस्कारण -तेस्रो -तेह्र -तैपनि -तैपनी -त्यत्तिकै -त्यत्तिकैमा -त्यस -त्यसकारण -त्यसको -त्यसले -त्यसैले -त्यसो -त्यस्तै -त्यस्तो -त्यहाँ -त्यहिँ -त्यही -त्यहीँ -त्यहीं -त्यो -त्सपछि -त्सैले -थप -थरि -थरी -थाहा -थिए -थिएँ -थिएन -थियो -दर्ता -दश -दिए -दिएको -दिन -दिनुभएको -दिनुहुन्छ -दुइ -दुइवटा -दुई -देखि -देखिन्छ -देखियो -देखे -देखेको -देखेर -दोश्री -दोश्रो -दोस्रो -द्वारा -धन्न -धेरै -धौ -न -नगर्नु -नगर्नू -नजिकै -नत्र -नत्रभने -नभई -नभएको -नभनेर -नयाँ -नि -निकै -निम्ति -निम्न -निम्नानुसार -निर्दिष्ट -नै -नौ -पक्का -पक्कै -पछाडि -पछाडी -पछि -पछिल्लो -पछी -पटक -पनि -पन्ध्र -पर्छ -पर्थ्यो -पर्दैन -पर्ने -पर्नेमा -पर्याप्त -पहिले -पहिलो -पहिल्यै -पाँच -पांच -पाचौँ -पाँचौं -पिच्छे -पूर्व -पो -प्रति -प्रतेक -प्रत्यक -प्राय -प्लस -फरक -फेरि -फेरी -बढी -बताए -बने -बरु -बाट -बारे -बाहिर -बाहेक -बाह्र -बिच -बिचमा -बिरुद्ध -बिशेष -बिस -बीच -बीचमा -बीस -भए -भएँ -भएका -भएकालाई -भएको -भएन -भएर -भन -भने -भनेको -भनेर -भन् -भन्छन् -भन्छु -भन्दा -भन्दै -भन्नुभयो -भन्ने -भन्या -भयेन -भयो -भर -भरि -भरी -भा -भित्र -भित्री -भीत्र -म -मध्य -मध्ये -मलाई -मा -मात्र -मात्रै -माथि -माथी -मुख्य -मुनि -मुन्तिर -मेरो -मैले -यति -यथोचित -यदि -यद्ध्यपि -यद्यपि -यस -यसका -यसको -यसपछि -यसबाहेक -यसमा -यसरी -यसले -यसो -यस्तै -यस्तो -यहाँ -यहाँसम्म -यही -या -यी -यो -र -रही -रहेका -रहेको -रहेछ -राखे -राख्छ -राम्रो -रुपमा -रूप -रे -लगभग -लगायत -लाई -लाख -लागि -लागेको -ले -वटा -वरीपरी -वा -वाट -वापत -वास्तवमा -शायद -सक्छ -सक्ने -सँग -संग -सँगको -सँगसँगै -सँगै -संगै -सङ्ग -सङ्गको -सट्टा -सत्र -सधै -सबै -सबैको -सबैलाई -समय -समेत -सम्भव -सम्म -सय -सरह -सहित -सहितै -सही -साँच्चै -सात -साथ -साथै -सायद -सारा -सुनेको -सुनेर -सुरु -सुरुको -सुरुमै -सो -सोचेको -सोचेर -सोही -सोह्र -स्थित -स्पष्ट -हजार -हरे -हरेक -हामी -हामीले -हाम्रा -हाम्रो -हुँदैन -हुन -हुनत -हुनु -हुने -हुनेछ -हुन् -हुन्छ -हुन्थ्यो -हैन -हो -होइन -होकि -होला -""".split() -) diff --git a/spacy/lexeme.pyx b/spacy/lexeme.pyx index 8042098d7..1df516dcb 100644 --- a/spacy/lexeme.pyx +++ b/spacy/lexeme.pyx @@ -349,7 +349,7 @@ cdef class Lexeme: @property def is_oov(self): """RETURNS (bool): Whether the lexeme is out-of-vocabulary.""" - return self.orth not in self.vocab.vectors + return self.orth in self.vocab.vectors property is_stop: """RETURNS (bool): Whether the lexeme is a stop word.""" diff --git a/spacy/pipeline/pipes.pyx b/spacy/pipeline/pipes.pyx index 8f07bf8f7..3f40cb545 100644 --- a/spacy/pipeline/pipes.pyx +++ b/spacy/pipeline/pipes.pyx @@ -528,10 +528,10 @@ class Tagger(Pipe): new_tag_map[tag] = orig_tag_map[tag] else: new_tag_map[tag] = {POS: X} + if "_SP" in orig_tag_map: + new_tag_map["_SP"] = orig_tag_map["_SP"] cdef Vocab vocab = self.vocab if new_tag_map: - if "_SP" in orig_tag_map: - new_tag_map["_SP"] = orig_tag_map["_SP"] vocab.morphology = Morphology(vocab.strings, new_tag_map, vocab.morphology.lemmatizer, exc=vocab.morphology.exc) diff --git a/spacy/tests/conftest.py b/spacy/tests/conftest.py index 91b7e4d9d..1f13da5d6 100644 --- a/spacy/tests/conftest.py +++ b/spacy/tests/conftest.py @@ -170,11 +170,6 @@ def nb_tokenizer(): return get_lang_class("nb").Defaults.create_tokenizer() -@pytest.fixture(scope="session") -def ne_tokenizer(): - return get_lang_class("ne").Defaults.create_tokenizer() - - @pytest.fixture(scope="session") def nl_tokenizer(): return get_lang_class("nl").Defaults.create_tokenizer() diff --git a/spacy/tests/lang/ja/test_tokenizer.py b/spacy/tests/lang/ja/test_tokenizer.py index 651e906eb..26be5cf59 100644 --- a/spacy/tests/lang/ja/test_tokenizer.py +++ b/spacy/tests/lang/ja/test_tokenizer.py @@ -4,7 +4,7 @@ from __future__ import unicode_literals import pytest from ...tokenizer.test_naughty_strings import NAUGHTY_STRINGS -from spacy.lang.ja import Japanese, DetailedToken +from spacy.lang.ja import Japanese # fmt: off TOKENIZER_TESTS = [ @@ -96,57 +96,6 @@ def test_ja_tokenizer_split_modes(ja_tokenizer, text, len_a, len_b, len_c): assert len(nlp_c(text)) == len_c -@pytest.mark.parametrize("text,sub_tokens_list_a,sub_tokens_list_b,sub_tokens_list_c", - [ - ( - "選挙管理委員会", - [None, None, None, None], - [None, None, [ - [ - DetailedToken(surface='委員', tag='名詞-普通名詞-一般', inf='', lemma='委員', reading='イイン', sub_tokens=None), - DetailedToken(surface='会', tag='名詞-普通名詞-一般', inf='', lemma='会', reading='カイ', sub_tokens=None), - ] - ]], - [[ - [ - DetailedToken(surface='選挙', tag='名詞-普通名詞-サ変可能', inf='', lemma='選挙', reading='センキョ', sub_tokens=None), - DetailedToken(surface='管理', tag='名詞-普通名詞-サ変可能', inf='', lemma='管理', reading='カンリ', sub_tokens=None), - DetailedToken(surface='委員', tag='名詞-普通名詞-一般', inf='', lemma='委員', reading='イイン', sub_tokens=None), - DetailedToken(surface='会', tag='名詞-普通名詞-一般', inf='', lemma='会', reading='カイ', sub_tokens=None), - ], [ - DetailedToken(surface='選挙', tag='名詞-普通名詞-サ変可能', inf='', lemma='選挙', reading='センキョ', sub_tokens=None), - DetailedToken(surface='管理', tag='名詞-普通名詞-サ変可能', inf='', lemma='管理', reading='カンリ', sub_tokens=None), - DetailedToken(surface='委員会', tag='名詞-普通名詞-一般', inf='', lemma='委員会', reading='イインカイ', sub_tokens=None), - ] - ]] - ), - ] -) -def test_ja_tokenizer_sub_tokens(ja_tokenizer, text, sub_tokens_list_a, sub_tokens_list_b, sub_tokens_list_c): - nlp_a = Japanese(meta={"tokenizer": {"config": {"split_mode": "A"}}}) - nlp_b = Japanese(meta={"tokenizer": {"config": {"split_mode": "B"}}}) - nlp_c = Japanese(meta={"tokenizer": {"config": {"split_mode": "C"}}}) - - assert ja_tokenizer(text).user_data["sub_tokens"] == sub_tokens_list_a - assert nlp_a(text).user_data["sub_tokens"] == sub_tokens_list_a - assert nlp_b(text).user_data["sub_tokens"] == sub_tokens_list_b - assert nlp_c(text).user_data["sub_tokens"] == sub_tokens_list_c - - -@pytest.mark.parametrize("text,inflections,reading_forms", - [ - ( - "取ってつけた", - ("五段-ラ行,連用形-促音便", "", "下一段-カ行,連用形-一般", "助動詞-タ,終止形-一般"), - ("トッ", "テ", "ツケ", "タ"), - ), - ] -) -def test_ja_tokenizer_inflections_reading_forms(ja_tokenizer, text, inflections, reading_forms): - assert ja_tokenizer(text).user_data["inflections"] == inflections - assert ja_tokenizer(text).user_data["reading_forms"] == reading_forms - - def test_ja_tokenizer_emptyish_texts(ja_tokenizer): doc = ja_tokenizer("") assert len(doc) == 0 diff --git a/spacy/tests/lang/ne/__init__.py b/spacy/tests/lang/ne/__init__.py deleted file mode 100644 index e69de29bb..000000000 diff --git a/spacy/tests/lang/ne/test_text.py b/spacy/tests/lang/ne/test_text.py deleted file mode 100644 index 926a7de04..000000000 --- a/spacy/tests/lang/ne/test_text.py +++ /dev/null @@ -1,19 +0,0 @@ -# coding: utf-8 -from __future__ import unicode_literals - -import pytest - - -def test_ne_tokenizer_handlers_long_text(ne_tokenizer): - text = """मैले पाएको सर्टिफिकेटलाई म त बोक्रो सम्झन्छु र अभ्यास तब सुरु भयो, जब मैले कलेज पार गरेँ र जीवनको पढाइ सुरु गरेँ ।""" - tokens = ne_tokenizer(text) - assert len(tokens) == 24 - - -@pytest.mark.parametrize( - "text,length", - [("समय जान कति पनि बेर लाग्दैन ।", 7), ("म ठूलो हुँदै थिएँ ।", 5)], -) -def test_ne_tokenizer_handles_cnts(ne_tokenizer, text, length): - tokens = ne_tokenizer(text) - assert len(tokens) == length \ No newline at end of file diff --git a/spacy/tests/pipeline/test_tagger.py b/spacy/tests/pipeline/test_tagger.py index 1681ffeaa..a5bda9090 100644 --- a/spacy/tests/pipeline/test_tagger.py +++ b/spacy/tests/pipeline/test_tagger.py @@ -3,7 +3,6 @@ from __future__ import unicode_literals import pytest from spacy.language import Language -from spacy.symbols import POS, NOUN def test_label_types(): @@ -12,16 +11,3 @@ def test_label_types(): nlp.get_pipe("tagger").add_label("A") with pytest.raises(ValueError): nlp.get_pipe("tagger").add_label(9) - - -def test_tagger_begin_training_tag_map(): - """Test that Tagger.begin_training() without gold tuples does not clobber - the tag map.""" - nlp = Language() - tagger = nlp.create_pipe("tagger") - orig_tag_count = len(tagger.labels) - tagger.add_label("A", {"POS": "NOUN"}) - nlp.add_pipe(tagger) - nlp.begin_training() - assert nlp.vocab.morphology.tag_map["A"] == {POS: NOUN} - assert orig_tag_count + 1 == len(nlp.get_pipe("tagger").labels) diff --git a/spacy/tests/vocab_vectors/test_vectors.py b/spacy/tests/vocab_vectors/test_vectors.py index b31cef1f2..576ca93d2 100644 --- a/spacy/tests/vocab_vectors/test_vectors.py +++ b/spacy/tests/vocab_vectors/test_vectors.py @@ -376,6 +376,6 @@ def test_vector_is_oov(): data[1] = 2.0 vocab.set_vector("cat", data[0]) vocab.set_vector("dog", data[1]) - assert vocab["cat"].is_oov is False - assert vocab["dog"].is_oov is False - assert vocab["hamster"].is_oov is True + assert vocab["cat"].is_oov is True + assert vocab["dog"].is_oov is True + assert vocab["hamster"].is_oov is False diff --git a/spacy/tokens/token.pyx b/spacy/tokens/token.pyx index 8d3406bae..45deebc93 100644 --- a/spacy/tokens/token.pyx +++ b/spacy/tokens/token.pyx @@ -923,7 +923,7 @@ cdef class Token: @property def is_oov(self): """RETURNS (bool): Whether the token is out-of-vocabulary.""" - return self.c.lex.orth not in self.vocab.vectors + return self.c.lex.orth in self.vocab.vectors @property def is_stop(self): diff --git a/spacy/util.py b/spacy/util.py index 923f56b31..5362952e2 100644 --- a/spacy/util.py +++ b/spacy/util.py @@ -208,10 +208,6 @@ def load_model_from_path(model_path, meta=False, **overrides): pipeline = nlp.Defaults.pipe_names elif pipeline in (False, None): pipeline = [] - # skip "vocab" from overrides in component initialization since vocab is - # already configured from overrides when nlp is initialized above - if "vocab" in overrides: - del overrides["vocab"] for name in pipeline: if name not in disable: config = meta.get("pipeline_args", {}).get(name, {}) diff --git a/website/docs/api/goldparse.md b/website/docs/api/goldparse.md index bc33dd4e6..5df625991 100644 --- a/website/docs/api/goldparse.md +++ b/website/docs/api/goldparse.md @@ -12,18 +12,18 @@ expects true examples of a label to have the value `1.0`, and negative examples of a label to have the value `0.0`. Labels not in the dictionary are treated as missing – the gradient for those labels will be zero. -| Name | Type | Description | -| ----------------- | ----------- | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | -| `doc` | `Doc` | The document the annotations refer to. | -| `words` | iterable | A sequence of unicode word strings. | -| `tags` | iterable | A sequence of strings, representing tag annotations. | -| `heads` | iterable | A sequence of integers, representing syntactic head offsets. | -| `deps` | iterable | A sequence of strings, representing the syntactic relation types. | -| `entities` | iterable | A sequence of named entity annotations, either as BILUO tag strings, or as `(start_char, end_char, label)` tuples, representing the entity positions. If BILUO tag strings, you can specify missing values by setting the tag to None. | -| `cats` | dict | Labels for text classification. Each key in the dictionary is a string label for the category and each value is `1.0` (positive) or `0.0` (negative). | -| `links` | dict | Labels for entity linking. A dict with `(start_char, end_char)` keys, and the values being dicts with `kb_id:value` entries, representing external KB IDs mapped to either `1.0` (positive) or `0.0` (negative). | -| `make_projective` | bool | Whether to projectivize the dependency tree. Defaults to `False`. | -| **RETURNS** | `GoldParse` | The newly constructed object. | +| Name | Type | Description | +| ----------- | ----------- | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | +| `doc` | `Doc` | The document the annotations refer to. | +| `words` | iterable | A sequence of unicode word strings. | +| `tags` | iterable | A sequence of strings, representing tag annotations. | +| `heads` | iterable | A sequence of integers, representing syntactic head offsets. | +| `deps` | iterable | A sequence of strings, representing the syntactic relation types. | +| `entities` | iterable | A sequence of named entity annotations, either as BILUO tag strings, or as `(start_char, end_char, label)` tuples, representing the entity positions. If BILUO tag strings, you can specify missing values by setting the tag to None. | +| `cats` | dict | Labels for text classification. Each key in the dictionary is a string label for the category and each value is `1.0` (positive) or `0.0` (negative). | +| `links` | dict | Labels for entity linking. A dict with `(start_char, end_char)` keys, and the values being dicts with `kb_id:value` entries, representing external KB IDs mapped to either `1.0` (positive) or `0.0` (negative). | +| `make_projective` | bool | Whether to projectivize the dependency tree. Defaults to `False.`. | +| **RETURNS** | `GoldParse` | The newly constructed object. | ## GoldParse.\_\_len\_\_ {#len tag="method"} @@ -43,17 +43,17 @@ Whether the provided syntactic annotations form a projective dependency tree. ## Attributes {#attributes} -| Name | Type | Description | -| ------------------------------------ | ---- | ------------------------------------------------------------------------------------------------------------------------ | -| `words` | list | The words. | -| `tags` | list | The part-of-speech tag annotations. | -| `heads` | list | The syntactic head annotations. | -| `labels` | list | The syntactic relation-type annotations. | -| `ner` | list | The named entity annotations as BILUO tags. | -| `cand_to_gold` | list | The alignment from candidate tokenization to gold tokenization. | -| `gold_to_cand` | list | The alignment from gold tokenization to candidate tokenization. | -| `cats` 2 | dict | Keys in the dictionary are string category labels with values `1.0` or `0.0`. | -| `links` 2.2 | dict | Keys in the dictionary are `(start_char, end_char)` triples, and the values are dictionaries with `kb_id:value` entries. | +| Name | Type | Description | +| ------------------------------------ | ---- | -------------------------------------------------------------------------------------------------------------------------------------------------------- | +| `words` | list | The words. | +| `tags` | list | The part-of-speech tag annotations. | +| `heads` | list | The syntactic head annotations. | +| `labels` | list | The syntactic relation-type annotations. | +| `ner` | list | The named entity annotations as BILUO tags. | +| `cand_to_gold` | list | The alignment from candidate tokenization to gold tokenization. | +| `gold_to_cand` | list | The alignment from gold tokenization to candidate tokenization. | +| `cats` 2 | dict | Keys in the dictionary are string category labels with values `1.0` or `0.0`. | +| `links` 2.2 | dict | Keys in the dictionary are `(start_char, end_char)` triples, and the values are dictionaries with `kb_id:value` entries. | ## Utilities {#util} @@ -61,8 +61,7 @@ Whether the provided syntactic annotations form a projective dependency tree. Convert a list of Doc objects into the [JSON-serializable format](/api/annotation#json-input) used by the -[`spacy train`](/api/cli#train) command. Each input doc will be treated as a -'paragraph' in the output doc. +[`spacy train`](/api/cli#train) command. Each input doc will be treated as a 'paragraph' in the output doc. > #### Example > diff --git a/website/docs/api/matcher.md b/website/docs/api/matcher.md index 7b195e352..ac2f898e0 100644 --- a/website/docs/api/matcher.md +++ b/website/docs/api/matcher.md @@ -57,7 +57,7 @@ spaCy v2.3, the `Matcher` can also be called on `Span` objects. | Name | Type | Description | | ----------- | ------------ | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------ | -| `doclike` | `Doc`/`Span` | The document to match over or a `Span` (as of v2.3). | +| `doclike` | `Doc`/`Span` | The document to match over or a `Span` (as of v2.3).. | | **RETURNS** | list | A list of `(match_id, start, end)` tuples, describing the matches. A match tuple describes a span `doc[start:end`]. The `match_id` is the ID of the added match pattern. | diff --git a/website/docs/usage/101/_pos-deps.md b/website/docs/usage/101/_pos-deps.md index 1e8960edf..1a438e424 100644 --- a/website/docs/usage/101/_pos-deps.md +++ b/website/docs/usage/101/_pos-deps.md @@ -36,7 +36,7 @@ for token in doc: | Text | Lemma | POS | Tag | Dep | Shape | alpha | stop | | ------- | ------- | ------- | ----- | ---------- | ------- | ------- | ------- | | Apple | apple | `PROPN` | `NNP` | `nsubj` | `Xxxxx` | `True` | `False` | -| is | be | `AUX` | `VBZ` | `aux` | `xx` | `True` | `True` | +| is | be | `VERB` | `VBZ` | `aux` | `xx` | `True` | `True` | | looking | look | `VERB` | `VBG` | `ROOT` | `xxxx` | `True` | `False` | | at | at | `ADP` | `IN` | `prep` | `xx` | `True` | `True` | | buying | buy | `VERB` | `VBG` | `pcomp` | `xxxx` | `True` | `False` | diff --git a/website/docs/usage/adding-languages.md b/website/docs/usage/adding-languages.md index 29a9a1c27..d42aad705 100644 --- a/website/docs/usage/adding-languages.md +++ b/website/docs/usage/adding-languages.md @@ -662,7 +662,7 @@ One thing to keep in mind is that spaCy expects to train its models from **whole documents**, not just single sentences. If your corpus only contains single sentences, spaCy's models will never learn to expect multi-sentence documents, leading to low performance on real text. To mitigate this problem, you can use -the `-n` argument to the `spacy convert` command, to merge some of the sentences +the `-N` argument to the `spacy convert` command, to merge some of the sentences into longer pseudo-documents. ### Training the tagger and parser {#train-tagger-parser} diff --git a/website/docs/usage/linguistic-features.md b/website/docs/usage/linguistic-features.md index 9031a356f..84bb3d71b 100644 --- a/website/docs/usage/linguistic-features.md +++ b/website/docs/usage/linguistic-features.md @@ -471,7 +471,7 @@ doc = nlp.make_doc("London is a big city in the United Kingdom.") print("Before", doc.ents) # [] header = [ENT_IOB, ENT_TYPE] -attr_array = numpy.zeros((len(doc), len(header)), dtype="uint64") +attr_array = numpy.zeros((len(doc), len(header))) attr_array[0, 0] = 3 # B attr_array[0, 1] = doc.vocab.strings["GPE"] doc.from_array(header, attr_array) @@ -1143,9 +1143,9 @@ from spacy.gold import align other_tokens = ["i", "listened", "to", "obama", "'", "s", "podcasts", "."] spacy_tokens = ["i", "listened", "to", "obama", "'s", "podcasts", "."] cost, a2b, b2a, a2b_multi, b2a_multi = align(other_tokens, spacy_tokens) -print("Edit distance:", cost) # 3 +print("Misaligned tokens:", cost) # 2 print("One-to-one mappings a -> b", a2b) # array([0, 1, 2, 3, -1, -1, 5, 6]) -print("One-to-one mappings b -> a", b2a) # array([0, 1, 2, 3, -1, 6, 7]) +print("One-to-one mappings b -> a", b2a) # array([0, 1, 2, 3, 5, 6, 7]) print("Many-to-one mappings a -> b", a2b_multi) # {4: 4, 5: 4} print("Many-to-one mappings b-> a", b2a_multi) # {} ``` @@ -1153,7 +1153,7 @@ print("Many-to-one mappings b-> a", b2a_multi) # {} Here are some insights from the alignment information generated in the example above: -- The edit distance (cost) is `3`: two deletions and one insertion. +- Two tokens are misaligned. - The one-to-one mappings for the first four tokens are identical, which means they map to each other. This makes sense because they're also identical in the input: `"i"`, `"listened"`, `"to"` and `"obama"`. diff --git a/website/docs/usage/models.md b/website/docs/usage/models.md index b11e6347a..382193157 100644 --- a/website/docs/usage/models.md +++ b/website/docs/usage/models.md @@ -117,18 +117,6 @@ The Chinese language class supports three word segmentation options: better segmentation for Chinese OntoNotes and the new [Chinese models](/models/zh). - - -Note that [`pkuseg`](https://github.com/lancopku/pkuseg-python) doesn't yet ship -with pre-compiled wheels for Python 3.8. If you're running Python 3.8, you can -install it from our fork and compile it locally: - -```bash -$ pip install https://github.com/honnibal/pkuseg-python/archive/master.zip -``` - - - The `meta` argument of the `Chinese` language class supports the following @@ -208,20 +196,12 @@ nlp = Chinese(meta={"tokenizer": {"config": {"pkuseg_model": "/path/to/pkuseg_mo The Japanese language class uses [SudachiPy](https://github.com/WorksApplications/SudachiPy) for word -segmentation and part-of-speech tagging. The default Japanese language class and -the provided Japanese models use SudachiPy split mode `A`. +segmentation and part-of-speech tagging. The default Japanese language class +and the provided Japanese models use SudachiPy split mode `A`. The `meta` argument of the `Japanese` language class can be used to configure the split mode to `A`, `B` or `C`. - - -If you run into errors related to `sudachipy`, which is currently under active -development, we suggest downgrading to `sudachipy==0.4.5`, which is the version -used for training the current [Japanese models](/models/ja). - - - ## Installing and using models {#download} > #### Downloading models in spaCy < v1.7 diff --git a/website/docs/usage/rule-based-matching.md b/website/docs/usage/rule-based-matching.md index f7866fe31..1db2405d1 100644 --- a/website/docs/usage/rule-based-matching.md +++ b/website/docs/usage/rule-based-matching.md @@ -1158,17 +1158,17 @@ what you need for your application. > available corpus. For example, the corpus spaCy's [English models](/models/en) were trained on -defines a `PERSON` entity as just the **person name**, without titles like "Mr." -or "Dr.". This makes sense, because it makes it easier to resolve the entity -type back to a knowledge base. But what if your application needs the full -names, _including_ the titles? +defines a `PERSON` entity as just the **person name**, without titles like "Mr" +or "Dr". This makes sense, because it makes it easier to resolve the entity type +back to a knowledge base. But what if your application needs the full names, +_including_ the titles? ```python ### {executable="true"} import spacy nlp = spacy.load("en_core_web_sm") -doc = nlp("Dr. Alex Smith chaired first board meeting of Acme Corp Inc.") +doc = nlp("Dr Alex Smith chaired first board meeting of Acme Corp Inc.") print([(ent.text, ent.label_) for ent in doc.ents]) ``` @@ -1233,7 +1233,7 @@ def expand_person_entities(doc): # Add the component after the named entity recognizer nlp.add_pipe(expand_person_entities, after='ner') -doc = nlp("Dr. Alex Smith chaired first board meeting of Acme Corp Inc.") +doc = nlp("Dr Alex Smith chaired first board meeting of Acme Corp Inc.") print([(ent.text, ent.label_) for ent in doc.ents]) ``` diff --git a/website/docs/usage/v2-3.md b/website/docs/usage/v2-3.md index e6b88c779..ba75b01ab 100644 --- a/website/docs/usage/v2-3.md +++ b/website/docs/usage/v2-3.md @@ -14,10 +14,10 @@ all language models, and decreased model size and loading times for models with vectors. We've added pretrained models for **Chinese, Danish, Japanese, Polish and Romanian** and updated the training data and vectors for most languages. Model packages with vectors are about **2×** smaller on disk and load -**2-4×** faster. For the full changelog, see the -[release notes on GitHub](https://github.com/explosion/spaCy/releases/tag/v2.3.0). -For more details and a behind-the-scenes look at the new release, -[see our blog post](https://explosion.ai/blog/spacy-v2-3). +**2-4×** faster. For the full changelog, see the [release notes on +GitHub](https://github.com/explosion/spaCy/releases/tag/v2.3.0). For more +details and a behind-the-scenes look at the new release, [see our blog +post](https://explosion.ai/blog/spacy-v2-3). ### Expanded model families with vectors {#models} @@ -33,10 +33,10 @@ For more details and a behind-the-scenes look at the new release, With new model families for Chinese, Danish, Polish, Romanian and Chinese plus `md` and `lg` models with word vectors for all languages, this release provides -a total of 46 model packages. For models trained using -[Universal Dependencies](https://universaldependencies.org) corpora, the -training data has been updated to UD v2.5 (v2.6 for Japanese, v2.3 for Polish) -and Dutch has been extended to include both UD Dutch Alpino and LassySmall. +a total of 46 model packages. For models trained using [Universal +Dependencies](https://universaldependencies.org) corpora, the training data has +been updated to UD v2.5 (v2.6 for Japanese, v2.3 for Polish) and Dutch has been +extended to include both UD Dutch Alpino and LassySmall. @@ -48,7 +48,6 @@ and Dutch has been extended to include both UD Dutch Alpino and LassySmall. ### Chinese {#chinese} > #### Example -> > ```python > from spacy.lang.zh import Chinese > @@ -58,49 +57,41 @@ and Dutch has been extended to include both UD Dutch Alpino and LassySmall. > > # Append words to user dict > nlp.tokenizer.pkuseg_update_user_dict(["中国", "ABC"]) -> ``` This release adds support for -[`pkuseg`](https://github.com/lancopku/pkuseg-python) for word segmentation and -the new Chinese models ship with a custom pkuseg model trained on OntoNotes. The -Chinese tokenizer can be initialized with both `pkuseg` and custom models and -the `pkuseg` user dictionary is easy to customize. Note that -[`pkuseg`](https://github.com/lancopku/pkuseg-python) doesn't yet ship with -pre-compiled wheels for Python 3.8. See the -[usage documentation](/usage/models#chinese) for details on how to install it on -Python 3.8. +[pkuseg](https://github.com/lancopku/pkuseg-python) for word segmentation and +the new Chinese models ship with a custom pkuseg model trained on OntoNotes. +The Chinese tokenizer can be initialized with both `pkuseg` and custom models +and the `pkuseg` user dictionary is easy to customize. -**Models:** [Chinese models](/models/zh) **Usage: ** -[Chinese tokenizer usage](/usage/models#chinese) +**Chinese:** [Chinese tokenizer usage](/usage/models#chinese) ### Japanese {#japanese} The updated Japanese language class switches to -[`SudachiPy`](https://github.com/WorksApplications/SudachiPy) for word -segmentation and part-of-speech tagging. Using `SudachiPy` greatly simplifies +[SudachiPy](https://github.com/WorksApplications/SudachiPy) for word +segmentation and part-of-speech tagging. Using `sudachipy` greatly simplifies installing spaCy for Japanese, which is now possible with a single command: `pip install spacy[ja]`. -**Models:** [Japanese models](/models/ja) **Usage:** -[Japanese tokenizer usage](/usage/models#japanese) +**Japanese:** [Japanese tokenizer usage](/usage/models#japanese) ### Small CLI updates -- [`spacy debug-data`](/api/cli#debug-data) provides the coverage of the vectors - in a base model with `spacy debug-data lang train dev -b base_model` -- [`spacy evaluate`](/api/cli#evaluate) supports `blank:lg` (e.g. - `spacy evaluate blank:en dev.json`) to evaluate the tokenization accuracy - without loading a model -- [`spacy train`](/api/cli#train) on GPU restricts the CPU timing evaluation to - the first iteration +- `spacy debug-data` provides the coverage of the vectors in a base model with + `spacy debug-data lang train dev -b base_model` +- `spacy evaluate` supports `blank:lg` (e.g. `spacy evaluate blank:en + dev.json`) to evaluate the tokenization accuracy without loading a model +- `spacy train` on GPU restricts the CPU timing evaluation to the first + iteration ## Backwards incompatibilities {#incompat} @@ -109,8 +100,8 @@ installing spaCy for Japanese, which is now possible with a single command: If you've been training **your own models**, you'll need to **retrain** them with the new version. Also don't forget to upgrade all models to the latest versions. Models for earlier v2 releases (v2.0, v2.1, v2.2) aren't compatible -with models for v2.3. To check if all of your models are up to date, you can run -the [`spacy validate`](/api/cli#validate) command. +with models for v2.3. To check if all of your models are up to date, you can +run the [`spacy validate`](/api/cli#validate) command. @@ -125,20 +116,21 @@ the [`spacy validate`](/api/cli#validate) command. > directly. - If you're training new models, you'll want to install the package - [`spacy-lookups-data`](https://github.com/explosion/spacy-lookups-data), which - now includes both the lemmatization tables (as in v2.2) and the normalization - tables (new in v2.3). If you're using pretrained models, **nothing changes**, - because the relevant tables are included in the model packages. + [`spacy-lookups-data`](https://github.com/explosion/spacy-lookups-data), + which now includes both the lemmatization tables (as in v2.2) and the + normalization tables (new in v2.3). If you're using pretrained models, + **nothing changes**, because the relevant tables are included in the model + packages. - Due to the updated Universal Dependencies training data, the fine-grained part-of-speech tags will change for many provided language models. The coarse-grained part-of-speech tagset remains the same, but the mapping from particular fine-grained to coarse-grained tags may show minor differences. - For French, Italian, Portuguese and Spanish, the fine-grained part-of-speech - tagsets contain new merged tags related to contracted forms, such as `ADP_DET` - for French `"au"`, which maps to UPOS `ADP` based on the head `"à"`. This - increases the accuracy of the models by improving the alignment between - spaCy's tokenization and Universal Dependencies multi-word tokens used for - contractions. + tagsets contain new merged tags related to contracted forms, such as + `ADP_DET` for French `"au"`, which maps to UPOS `ADP` based on the head + `"à"`. This increases the accuracy of the models by improving the alignment + between spaCy's tokenization and Universal Dependencies multi-word tokens + used for contractions. ### Migrating from spaCy 2.2 {#migrating} @@ -151,81 +143,29 @@ v2.3 so that `token_match` has priority over prefixes and suffixes as in v2.2.1 and earlier versions. A new tokenizer setting `url_match` has been introduced in v2.3.0 to handle -cases like URLs where the tokenizer should remove prefixes and suffixes (e.g., a -comma at the end of a URL) before applying the match. See the full -[tokenizer documentation](/usage/linguistic-features#tokenization) and try out +cases like URLs where the tokenizer should remove prefixes and suffixes (e.g., +a comma at the end of a URL) before applying the match. See the full [tokenizer +documentation](/usage/linguistic-features#tokenization) and try out [`nlp.tokenizer.explain()`](/usage/linguistic-features#tokenizer-debug) when debugging your tokenizer configuration. #### Warnings configuration -spaCy's custom warnings have been replaced with native Python +spaCy's custom warnings have been replaced with native python [`warnings`](https://docs.python.org/3/library/warnings.html). Instead of -setting `SPACY_WARNING_IGNORE`, use the [`warnings` +setting `SPACY_WARNING_IGNORE`, use the [warnings filters](https://docs.python.org/3/library/warnings.html#the-warnings-filter) to manage warnings. -```diff -import spacy -+ import warnings - -- spacy.errors.SPACY_WARNING_IGNORE.append('W007') -+ warnings.filterwarnings("ignore", message=r"\\[W007\\]", category=UserWarning) -``` - #### Normalization tables The normalization tables have moved from the language data in -[`spacy/lang`](https://github.com/explosion/spaCy/tree/master/spacy/lang) to the -package [`spacy-lookups-data`](https://github.com/explosion/spacy-lookups-data). -If you're adding data for a new language, the normalization table should be -added to `spacy-lookups-data`. See -[adding norm exceptions](/usage/adding-languages#norm-exceptions). - -#### No preloaded lexemes/vocab for models with vectors - -To reduce the initial loading time, the lexemes in `nlp.vocab` are no longer -loaded on initialization for models with vectors. As you process texts, the -lexemes will be added to the vocab automatically, just as in models without -vectors. - -To see the number of unique vectors and number of words with vectors, see -`nlp.meta['vectors']`, for example for `en_core_web_md` there are `20000` -unique vectors and `684830` words with vectors: - -```python -{ - 'width': 300, - 'vectors': 20000, - 'keys': 684830, - 'name': 'en_core_web_md.vectors' -} -``` - -If required, for instance if you are working directly with word vectors rather -than processing texts, you can load all lexemes for words with vectors at once: - -```python -for orth in nlp.vocab.vectors: - _ = nlp.vocab[orth] -``` - -#### Lexeme.is_oov and Token.is_oov - - - -Due to a bug, the values for `is_oov` are reversed in v2.3.0, but this will be -fixed in the next patch release v2.3.1. - - - -In v2.3, `Lexeme.is_oov` and `Token.is_oov` are `True` if the lexeme does not -have a word vector. This is equivalent to `token.orth not in -nlp.vocab.vectors`. - -Previously in v2.2, `is_oov` corresponded to whether a lexeme had stored -probability and cluster features. The probability and cluster features are no -longer included in the provided medium and large models (see the next section). +[`spacy/lang`](https://github.com/explosion/spaCy/tree/master/spacy/lang) to +the package +[`spacy-lookups-data`](https://github.com/explosion/spacy-lookups-data). If +you're adding data for a new language, the normalization table should be added +to `spacy-lookups-data`. See [adding norm +exceptions](/usage/adding-languages#norm-exceptions). #### Probability and cluster features @@ -241,28 +181,28 @@ longer included in the provided medium and large models (see the next section). The `Token.prob` and `Token.cluster` features, which are no longer used by the core pipeline components as of spaCy v2, are no longer provided in the -pretrained models to reduce the model size. To keep these features available for -users relying on them, the `prob` and `cluster` features for the most frequent -1M tokens have been moved to +pretrained models to reduce the model size. To keep these features available +for users relying on them, the `prob` and `cluster` features for the most +frequent 1M tokens have been moved to [`spacy-lookups-data`](https://github.com/explosion/spacy-lookups-data) as `extra` features for the relevant languages (English, German, Greek and Spanish). The extra tables are loaded lazily, so if you have `spacy-lookups-data` -installed and your code accesses `Token.prob`, the full table is loaded into the -model vocab, which will take a few seconds on initial loading. When you save -this model after loading the `prob` table, the full `prob` table will be saved -as part of the model vocab. +installed and your code accesses `Token.prob`, the full table is loaded into +the model vocab, which will take a few seconds on initial loading. When you +save this model after loading the `prob` table, the full `prob` table will be +saved as part of the model vocab. -If you'd like to include custom `cluster`, `prob`, or `sentiment` tables as part -of a new model, add the data to +If you'd like to include custom `cluster`, `prob`, or `sentiment` tables as +part of a new model, add the data to [`spacy-lookups-data`](https://github.com/explosion/spacy-lookups-data) under the entry point `lg_extra`, e.g. `en_extra` for English. Alternatively, you can initialize your [`Vocab`](/api/vocab) with the `lookups_extra` argument with a [`Lookups`](/api/lookups) object that includes the tables `lexeme_cluster`, `lexeme_prob`, `lexeme_sentiment` or `lexeme_settings`. `lexeme_settings` is -currently only used to provide a custom `oov_prob`. See examples in the -[`data` directory](https://github.com/explosion/spacy-lookups-data/tree/master/spacy_lookups_data/data) +currently only used to provide a custom `oov_prob`. See examples in the [`data` +directory](https://github.com/explosion/spacy-lookups-data/tree/master/spacy_lookups_data/data) in `spacy-lookups-data`. #### Initializing new models without extra lookups tables diff --git a/website/meta/site.json b/website/meta/site.json index 8b8424f82..29d71048e 100644 --- a/website/meta/site.json +++ b/website/meta/site.json @@ -23,9 +23,9 @@ "apiKey": "371e26ed49d29a27bd36273dfdaf89af", "indexName": "spacy" }, - "binderUrl": "explosion/spacy-io-binder", + "binderUrl": "ines/spacy-io-binder", "binderBranch": "live", - "binderVersion": "2.3.0", + "binderVersion": "2.2.0", "sections": [ { "id": "usage", "title": "Usage Documentation", "theme": "blue" }, { "id": "models", "title": "Models Documentation", "theme": "blue" },