diff --git a/.github/contributors/gandersen101.md b/.github/contributors/gandersen101.md new file mode 100644 index 000000000..cae4ad047 --- /dev/null +++ b/.github/contributors/gandersen101.md @@ -0,0 +1,106 @@ +# spaCy contributor agreement + +This spaCy Contributor Agreement (**"SCA"**) is based on the +[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf). +The SCA applies to any contribution that you make to any product or project +managed by us (the **"project"**), and sets out the intellectual property rights +you grant to us in the contributed materials. The term **"us"** shall mean +[ExplosionAI GmbH](https://explosion.ai/legal). The term +**"you"** shall mean the person or entity identified below. + +If you agree to be bound by these terms, fill in the information requested +below and include the filled-in version with your first pull request, under the +folder [`.github/contributors/`](/.github/contributors/). The name of the file +should be your GitHub username, with the extension `.md`. For example, the user +example_user would create the file `.github/contributors/example_user.md`. + +Read this agreement carefully before signing. These terms and conditions +constitute a binding legal agreement. + +## Contributor Agreement + +1. The term "contribution" or "contributed materials" means any source code, +object code, patch, tool, sample, graphic, specification, manual, +documentation, or any other material posted or submitted by you to the project. + +2. With respect to any worldwide copyrights, or copyright applications and +registrations, in your contribution: + + * you hereby assign to us joint ownership, and to the extent that such + assignment is or becomes invalid, ineffective or unenforceable, you hereby + grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge, + royalty-free, unrestricted license to exercise all rights under those + copyrights. This includes, at our option, the right to sublicense these same + rights to third parties through multiple levels of sublicensees or other + licensing arrangements; + + * you agree that each of us can do all things in relation to your + contribution as if each of us were the sole owners, and if one of us makes + a derivative work of your contribution, the one who makes the derivative + work (or has it made will be the sole owner of that derivative work; + + * you agree that you will not assert any moral rights in your contribution + against us, our licensees or transferees; + + * you agree that we may register a copyright in your contribution and + exercise all ownership rights associated with it; and + + * you agree that neither of us has any duty to consult with, obtain the + consent of, pay or render an accounting to the other for any use or + distribution of your contribution. + +3. With respect to any patents you own, or that you can license without payment +to any third party, you hereby grant to us a perpetual, irrevocable, +non-exclusive, worldwide, no-charge, royalty-free license to: + + * make, have made, use, sell, offer to sell, import, and otherwise transfer + your contribution in whole or in part, alone or in combination with or + included in any product, work or materials arising out of the project to + which your contribution was submitted, and + + * at our option, to sublicense these same rights to third parties through + multiple levels of sublicensees or other licensing arrangements. + +4. Except as set out above, you keep all right, title, and interest in your +contribution. The rights that you grant to us under these terms are effective +on the date you first submitted a contribution to us, even if your submission +took place before the date you sign these terms. + +5. You covenant, represent, warrant and agree that: + + * Each contribution that you submit is and shall be an original work of + authorship and you can legally grant the rights set out in this SCA; + + * to the best of your knowledge, each contribution will not violate any + third party's copyrights, trademarks, patents, or other intellectual + property rights; and + + * each contribution shall be in compliance with U.S. export control laws and + other applicable export and import laws. You agree to notify us if you + become aware of any circumstance which would make any of the foregoing + representations inaccurate in any respect. We may publicly disclose your + participation in the project, including the fact that you have signed the SCA. + +6. This SCA is governed by the laws of the State of California and applicable +U.S. Federal law. Any choice of law rules will not apply. + +7. Please place an “x” on one of the applicable statement below. Please do NOT +mark both statements: + + * [ x] I am signing on behalf of myself as an individual and no other person + or entity, including my employer, has or will have rights with respect to my + contributions. + + * [ ] I am signing on behalf of my employer or a legal entity and I have the + actual authority to contractually bind that entity. + +## Contributor Details + +| Field | Entry | +|------------------------------- | -------------------- | +| Name | Grant Andersen | +| Company name (if applicable) | | +| Title or role (if applicable) | | +| Date | 07.06.2020 | +| GitHub username | gandersen101 | +| Website (optional) | | diff --git a/.github/contributors/jbesomi.md b/.github/contributors/jbesomi.md new file mode 100644 index 000000000..ac43a3bfd --- /dev/null +++ b/.github/contributors/jbesomi.md @@ -0,0 +1,106 @@ +# spaCy contributor agreement + +This spaCy Contributor Agreement (**"SCA"**) is based on the +[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf). +The SCA applies to any contribution that you make to any product or project +managed by us (the **"project"**), and sets out the intellectual property rights +you grant to us in the contributed materials. The term **"us"** shall mean +[ExplosionAI GmbH](https://explosion.ai/legal). The term +**"you"** shall mean the person or entity identified below. + +If you agree to be bound by these terms, fill in the information requested +below and include the filled-in version with your first pull request, under the +folder [`.github/contributors/`](/.github/contributors/). The name of the file +should be your GitHub username, with the extension `.md`. For example, the user +example_user would create the file `.github/contributors/example_user.md`. + +Read this agreement carefully before signing. These terms and conditions +constitute a binding legal agreement. + +## Contributor Agreement + +1. The term "contribution" or "contributed materials" means any source code, +object code, patch, tool, sample, graphic, specification, manual, +documentation, or any other material posted or submitted by you to the project. + +2. With respect to any worldwide copyrights, or copyright applications and +registrations, in your contribution: + + * you hereby assign to us joint ownership, and to the extent that such + assignment is or becomes invalid, ineffective or unenforceable, you hereby + grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge, + royalty-free, unrestricted license to exercise all rights under those + copyrights. This includes, at our option, the right to sublicense these same + rights to third parties through multiple levels of sublicensees or other + licensing arrangements; + + * you agree that each of us can do all things in relation to your + contribution as if each of us were the sole owners, and if one of us makes + a derivative work of your contribution, the one who makes the derivative + work (or has it made will be the sole owner of that derivative work; + + * you agree that you will not assert any moral rights in your contribution + against us, our licensees or transferees; + + * you agree that we may register a copyright in your contribution and + exercise all ownership rights associated with it; and + + * you agree that neither of us has any duty to consult with, obtain the + consent of, pay or render an accounting to the other for any use or + distribution of your contribution. + +3. With respect to any patents you own, or that you can license without payment +to any third party, you hereby grant to us a perpetual, irrevocable, +non-exclusive, worldwide, no-charge, royalty-free license to: + + * make, have made, use, sell, offer to sell, import, and otherwise transfer + your contribution in whole or in part, alone or in combination with or + included in any product, work or materials arising out of the project to + which your contribution was submitted, and + + * at our option, to sublicense these same rights to third parties through + multiple levels of sublicensees or other licensing arrangements. + +4. Except as set out above, you keep all right, title, and interest in your +contribution. The rights that you grant to us under these terms are effective +on the date you first submitted a contribution to us, even if your submission +took place before the date you sign these terms. + +5. You covenant, represent, warrant and agree that: + + * Each contribution that you submit is and shall be an original work of + authorship and you can legally grant the rights set out in this SCA; + + * to the best of your knowledge, each contribution will not violate any + third party's copyrights, trademarks, patents, or other intellectual + property rights; and + + * each contribution shall be in compliance with U.S. export control laws and + other applicable export and import laws. You agree to notify us if you + become aware of any circumstance which would make any of the foregoing + representations inaccurate in any respect. We may publicly disclose your + participation in the project, including the fact that you have signed the SCA. + +6. This SCA is governed by the laws of the State of California and applicable +U.S. Federal law. Any choice of law rules will not apply. + +7. Please place an “x” on one of the applicable statement below. Please do NOT +mark both statements: + + * [x] I am signing on behalf of myself as an individual and no other person + or entity, including my employer, has or will have rights with respect to my + contributions. + + * [ ] I am signing on behalf of my employer or a legal entity and I have the + actual authority to contractually bind that entity. + +## Contributor Details + +| Field | Entry | +|------------------------------- | -------------------- | +| Name | Jonathan B. | +| Company name (if applicable) | besomi.ai | +| Title or role (if applicable) | - | +| Date | 07.07.2020 | +| GitHub username | jbesomi | +| Website (optional) | besomi.ai | diff --git a/.github/contributors/mikeizbicki.md b/.github/contributors/mikeizbicki.md new file mode 100644 index 000000000..6e9d8c098 --- /dev/null +++ b/.github/contributors/mikeizbicki.md @@ -0,0 +1,106 @@ +# spaCy contributor agreement + +This spaCy Contributor Agreement (**"SCA"**) is based on the +[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf). +The SCA applies to any contribution that you make to any product or project +managed by us (the **"project"**), and sets out the intellectual property rights +you grant to us in the contributed materials. The term **"us"** shall mean +[ExplosionAI GmbH](https://explosion.ai/legal). The term +**"you"** shall mean the person or entity identified below. + +If you agree to be bound by these terms, fill in the information requested +below and include the filled-in version with your first pull request, under the +folder [`.github/contributors/`](/.github/contributors/). The name of the file +should be your GitHub username, with the extension `.md`. For example, the user +example_user would create the file `.github/contributors/example_user.md`. + +Read this agreement carefully before signing. These terms and conditions +constitute a binding legal agreement. + +## Contributor Agreement + +1. The term "contribution" or "contributed materials" means any source code, +object code, patch, tool, sample, graphic, specification, manual, +documentation, or any other material posted or submitted by you to the project. + +2. With respect to any worldwide copyrights, or copyright applications and +registrations, in your contribution: + + * you hereby assign to us joint ownership, and to the extent that such + assignment is or becomes invalid, ineffective or unenforceable, you hereby + grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge, + royalty-free, unrestricted license to exercise all rights under those + copyrights. This includes, at our option, the right to sublicense these same + rights to third parties through multiple levels of sublicensees or other + licensing arrangements; + + * you agree that each of us can do all things in relation to your + contribution as if each of us were the sole owners, and if one of us makes + a derivative work of your contribution, the one who makes the derivative + work (or has it made will be the sole owner of that derivative work; + + * you agree that you will not assert any moral rights in your contribution + against us, our licensees or transferees; + + * you agree that we may register a copyright in your contribution and + exercise all ownership rights associated with it; and + + * you agree that neither of us has any duty to consult with, obtain the + consent of, pay or render an accounting to the other for any use or + distribution of your contribution. + +3. With respect to any patents you own, or that you can license without payment +to any third party, you hereby grant to us a perpetual, irrevocable, +non-exclusive, worldwide, no-charge, royalty-free license to: + + * make, have made, use, sell, offer to sell, import, and otherwise transfer + your contribution in whole or in part, alone or in combination with or + included in any product, work or materials arising out of the project to + which your contribution was submitted, and + + * at our option, to sublicense these same rights to third parties through + multiple levels of sublicensees or other licensing arrangements. + +4. Except as set out above, you keep all right, title, and interest in your +contribution. The rights that you grant to us under these terms are effective +on the date you first submitted a contribution to us, even if your submission +took place before the date you sign these terms. + +5. You covenant, represent, warrant and agree that: + + * Each contribution that you submit is and shall be an original work of + authorship and you can legally grant the rights set out in this SCA; + + * to the best of your knowledge, each contribution will not violate any + third party's copyrights, trademarks, patents, or other intellectual + property rights; and + + * each contribution shall be in compliance with U.S. export control laws and + other applicable export and import laws. You agree to notify us if you + become aware of any circumstance which would make any of the foregoing + representations inaccurate in any respect. We may publicly disclose your + participation in the project, including the fact that you have signed the SCA. + +6. This SCA is governed by the laws of the State of California and applicable +U.S. Federal law. Any choice of law rules will not apply. + +7. Please place an “x” on one of the applicable statement below. Please do NOT +mark both statements: + + * [x] I am signing on behalf of myself as an individual and no other person + or entity, including my employer, has or will have rights with respect to my + contributions. + + * [ ] I am signing on behalf of my employer or a legal entity and I have the + actual authority to contractually bind that entity. + +## Contributor Details + +| Field | Entry | +|------------------------------- | -------------------- | +| Name | Mike Izbicki | +| Company name (if applicable) | | +| Title or role (if applicable) | | +| Date | 02 Jun 2020 | +| GitHub username | mikeizbicki | +| Website (optional) | https://izbicki.me | diff --git a/spacy/_ml.py b/spacy/_ml.py index 60a0bbee0..d947aab1c 100644 --- a/spacy/_ml.py +++ b/spacy/_ml.py @@ -14,7 +14,7 @@ from thinc.api import with_getitem, flatten_add_lengths from thinc.api import uniqued, wrap, noop from thinc.linear.linear import LinearModel from thinc.neural.ops import NumpyOps, CupyOps -from thinc.neural.util import get_array_module, copy_array +from thinc.neural.util import get_array_module, copy_array, to_categorical from thinc.neural.optimizers import Adam from thinc import describe @@ -840,6 +840,8 @@ def masked_language_model(vocab, model, mask_prob=0.15): def mlm_backward(d_output, sgd=None): d_output *= 1 - mask + # Rescale gradient for number of instances. + d_output *= mask.size - mask.sum() return backprop(d_output, sgd=sgd) return output, mlm_backward @@ -944,7 +946,7 @@ class CharacterEmbed(Model): # for the tip. nCv = self.ops.xp.arange(self.nC) for doc in docs: - doc_ids = doc.to_utf8_array(nr_char=self.nC) + doc_ids = self.ops.asarray(doc.to_utf8_array(nr_char=self.nC)) doc_vectors = self.ops.allocate((len(doc), self.nC, self.nM)) # Let's say I have a 2d array of indices, and a 3d table of data. What numpy # incantation do I chant to get @@ -986,3 +988,17 @@ def get_cossim_loss(yh, y, ignore_zeros=False): losses[zero_indices] = 0 loss = losses.sum() return loss, -d_yh + + +def get_characters_loss(ops, docs, prediction, nr_char=10): + target_ids = numpy.vstack([doc.to_utf8_array(nr_char=nr_char) for doc in docs]) + target_ids = target_ids.reshape((-1,)) + target = ops.asarray(to_categorical(target_ids, nb_classes=256), dtype="f") + target = target.reshape((-1, 256*nr_char)) + diff = prediction - target + loss = (diff**2).sum() + d_target = diff / float(prediction.shape[0]) + return loss, d_target + + + diff --git a/spacy/cli/pretrain.py b/spacy/cli/pretrain.py index aaec1ea75..6d6c65161 100644 --- a/spacy/cli/pretrain.py +++ b/spacy/cli/pretrain.py @@ -18,7 +18,8 @@ from ..errors import Errors from ..tokens import Doc from ..attrs import ID, HEAD from .._ml import Tok2Vec, flatten, chain, create_default_optimizer -from .._ml import masked_language_model, get_cossim_loss +from .._ml import masked_language_model, get_cossim_loss, get_characters_loss +from .._ml import MultiSoftmax from .. import util from .train import _load_pretrained_tok2vec @@ -42,7 +43,7 @@ from .train import _load_pretrained_tok2vec bilstm_depth=("Depth of BiLSTM layers (requires PyTorch)", "option", "lstm", int), embed_rows=("Number of embedding rows", "option", "er", int), loss_func=( - "Loss function to use for the objective. Either 'L2' or 'cosine'", + "Loss function to use for the objective. Either 'characters', 'L2' or 'cosine'", "option", "L", str, @@ -85,11 +86,11 @@ def pretrain( output_dir, width=96, conv_depth=4, - bilstm_depth=0, cnn_pieces=3, sa_depth=0, - use_chars=False, cnn_window=1, + bilstm_depth=0, + use_chars=False, embed_rows=2000, loss_func="cosine", use_vectors=False, @@ -124,11 +125,7 @@ def pretrain( config[key] = str(config[key]) util.fix_random_seed(seed) - has_gpu = prefer_gpu() - if has_gpu: - import torch - - torch.set_default_tensor_type("torch.cuda.FloatTensor") + has_gpu = prefer_gpu(gpu_id=1) msg.info("Using GPU" if has_gpu else "Not using GPU") output_dir = Path(output_dir) @@ -174,6 +171,7 @@ def pretrain( subword_features=not use_chars, # Set to False for Chinese etc cnn_maxout_pieces=cnn_pieces, # If set to 1, use Mish activation. ), + objective=loss_func ) # Load in pretrained weights if init_tok2vec is not None: @@ -264,7 +262,10 @@ def make_update(model, docs, optimizer, drop=0.0, objective="L2"): RETURNS loss: A float for the loss. """ predictions, backprop = model.begin_update(docs, drop=drop) - loss, gradients = get_vectors_loss(model.ops, docs, predictions, objective) + if objective == "characters": + loss, gradients = get_characters_loss(model.ops, docs, predictions) + else: + loss, gradients = get_vectors_loss(model.ops, docs, predictions, objective) backprop(gradients, sgd=optimizer) # Don't want to return a cupy object here # The gradients are modified in-place by the BERT MLM, @@ -326,16 +327,23 @@ def get_vectors_loss(ops, docs, prediction, objective="L2"): return loss, d_target -def create_pretraining_model(nlp, tok2vec): +def create_pretraining_model(nlp, tok2vec, objective="cosine", nr_char=10): """Define a network for the pretraining. We simply add an output layer onto the tok2vec input model. The tok2vec input model needs to be a model that takes a batch of Doc objects (as a list), and returns a list of arrays. Each array in the output needs to have one row per token in the doc. """ - output_size = nlp.vocab.vectors.data.shape[1] - output_layer = chain( - LN(Maxout(300, pieces=3)), Affine(output_size, drop_factor=0.0) - ) + if objective == "characters": + out_sizes = [256] * nr_char + output_layer = chain( + LN(Maxout(300, pieces=3)), + MultiSoftmax(out_sizes, 300) + ) + else: + output_size = nlp.vocab.vectors.data.shape[1] + output_layer = chain( + LN(Maxout(300, pieces=3)), Affine(output_size, drop_factor=0.0) + ) # This is annoying, but the parser etc have the flatten step after # the tok2vec. To load the weights in cleanly, we need to match # the shape of the models' components exactly. So what we cann diff --git a/spacy/cli/train.py b/spacy/cli/train.py index d4de9aeb4..b81214b95 100644 --- a/spacy/cli/train.py +++ b/spacy/cli/train.py @@ -285,7 +285,7 @@ def train( if base_model and not pipes_added: # Start with an existing model, use default optimizer - optimizer = create_default_optimizer(Model.ops) + optimizer = nlp.resume_training(device=use_gpu) else: # Start with a blank model, call begin_training cfg = {"device": use_gpu} @@ -576,6 +576,8 @@ def train( with nlp.use_params(optimizer.averages): final_model_path = output_path / "model-final" nlp.to_disk(final_model_path) + srsly.write_json(final_model_path / "meta.json", meta) + meta_loc = output_path / "model-final" / "meta.json" final_meta = srsly.read_json(meta_loc) final_meta.setdefault("accuracy", {}) diff --git a/spacy/lang/en/__init__.py b/spacy/lang/en/__init__.py index d52f3dfd8..f58ae4a4e 100644 --- a/spacy/lang/en/__init__.py +++ b/spacy/lang/en/__init__.py @@ -18,41 +18,6 @@ def _return_en(_): return "en" -def en_is_base_form(univ_pos, morphology=None): - """ - Check whether we're dealing with an uninflected paradigm, so we can - avoid lemmatization entirely. - - univ_pos (unicode / int): The token's universal part-of-speech tag. - morphology (dict): The token's morphological features following the - Universal Dependencies scheme. - """ - if morphology is None: - morphology = {} - if univ_pos == "noun" and morphology.get("Number") == "sing": - return True - elif univ_pos == "verb" and morphology.get("VerbForm") == "inf": - return True - # This maps 'VBP' to base form -- probably just need 'IS_BASE' - # morphology - elif univ_pos == "verb" and ( - morphology.get("VerbForm") == "fin" - and morphology.get("Tense") == "pres" - and morphology.get("Number") is None - ): - return True - elif univ_pos == "adj" and morphology.get("Degree") == "pos": - return True - elif morphology.get("VerbForm") == "inf": - return True - elif morphology.get("VerbForm") == "none": - return True - elif morphology.get("Degree") == "pos": - return True - else: - return False - - class EnglishDefaults(Language.Defaults): lex_attr_getters = dict(Language.Defaults.lex_attr_getters) lex_attr_getters.update(LEX_ATTRS) @@ -61,7 +26,6 @@ class EnglishDefaults(Language.Defaults): tag_map = TAG_MAP stop_words = STOP_WORDS morph_rules = MORPH_RULES - is_base_form = en_is_base_form syntax_iterators = SYNTAX_ITERATORS single_orth_variants = [ {"tags": ["NFP"], "variants": ["…", "..."]}, @@ -72,6 +36,41 @@ class EnglishDefaults(Language.Defaults): {"tags": ["``", "''"], "variants": [('"', '"'), ("“", "”")]}, ] + @classmethod + def is_base_form(cls, univ_pos, morphology=None): + """ + Check whether we're dealing with an uninflected paradigm, so we can + avoid lemmatization entirely. + + univ_pos (unicode / int): The token's universal part-of-speech tag. + morphology (dict): The token's morphological features following the + Universal Dependencies scheme. + """ + if morphology is None: + morphology = {} + if univ_pos == "noun" and morphology.get("Number") == "sing": + return True + elif univ_pos == "verb" and morphology.get("VerbForm") == "inf": + return True + # This maps 'VBP' to base form -- probably just need 'IS_BASE' + # morphology + elif univ_pos == "verb" and ( + morphology.get("VerbForm") == "fin" + and morphology.get("Tense") == "pres" + and morphology.get("Number") is None + ): + return True + elif univ_pos == "adj" and morphology.get("Degree") == "pos": + return True + elif morphology.get("VerbForm") == "inf": + return True + elif morphology.get("VerbForm") == "none": + return True + elif morphology.get("Degree") == "pos": + return True + else: + return False + class English(Language): lang = "en" diff --git a/spacy/lang/fr/lemmatizer.py b/spacy/lang/fr/lemmatizer.py index 79f4dd28d..af8345e1b 100644 --- a/spacy/lang/fr/lemmatizer.py +++ b/spacy/lang/fr/lemmatizer.py @@ -45,9 +45,6 @@ class FrenchLemmatizer(Lemmatizer): univ_pos = "sconj" else: return [self.lookup(string)] - # See Issue #435 for example of where this logic is requied. - if self.is_base_form(univ_pos, morphology): - return list(set([string.lower()])) index_table = self.lookups.get_table("lemma_index", {}) exc_table = self.lookups.get_table("lemma_exc", {}) rules_table = self.lookups.get_table("lemma_rules", {}) @@ -59,43 +56,6 @@ class FrenchLemmatizer(Lemmatizer): ) return lemmas - def is_base_form(self, univ_pos, morphology=None): - """ - Check whether we're dealing with an uninflected paradigm, so we can - avoid lemmatization entirely. - """ - morphology = {} if morphology is None else morphology - others = [ - key - for key in morphology - if key not in (POS, "Number", "POS", "VerbForm", "Tense") - ] - if univ_pos == "noun" and morphology.get("Number") == "sing": - return True - elif univ_pos == "verb" and morphology.get("VerbForm") == "inf": - return True - # This maps 'VBP' to base form -- probably just need 'IS_BASE' - # morphology - elif univ_pos == "verb" and ( - morphology.get("VerbForm") == "fin" - and morphology.get("Tense") == "pres" - and morphology.get("Number") is None - and not others - ): - return True - elif univ_pos == "adj" and morphology.get("Degree") == "pos": - return True - elif VerbForm_inf in morphology: - return True - elif VerbForm_none in morphology: - return True - elif Number_sing in morphology: - return True - elif Degree_pos in morphology: - return True - else: - return False - def noun(self, string, morphology=None): return self(string, "noun", morphology) diff --git a/spacy/lang/ko/__init__.py b/spacy/lang/ko/__init__.py index ec79a95ab..21a754168 100644 --- a/spacy/lang/ko/__init__.py +++ b/spacy/lang/ko/__init__.py @@ -42,7 +42,11 @@ def check_spaces(text, tokens): class KoreanTokenizer(DummyTokenizer): def __init__(self, cls, nlp=None): self.vocab = nlp.vocab if nlp is not None else cls.create_vocab(nlp) - self.Tokenizer = try_mecab_import() + MeCab = try_mecab_import() + self.mecab_tokenizer = MeCab("-F%f[0],%f[7]") + + def __del__(self): + self.mecab_tokenizer.__del__() def __call__(self, text): dtokens = list(self.detailed_tokens(text)) @@ -58,17 +62,16 @@ class KoreanTokenizer(DummyTokenizer): def detailed_tokens(self, text): # 품사 태그(POS)[0], 의미 부류(semantic class)[1], 종성 유무(jongseong)[2], 읽기(reading)[3], # 타입(type)[4], 첫번째 품사(start pos)[5], 마지막 품사(end pos)[6], 표현(expression)[7], * - with self.Tokenizer("-F%f[0],%f[7]") as tokenizer: - for node in tokenizer.parse(text, as_nodes=True): - if node.is_eos(): - break - surface = node.surface - feature = node.feature - tag, _, expr = feature.partition(",") - lemma, _, remainder = expr.partition("/") - if lemma == "*": - lemma = surface - yield {"surface": surface, "lemma": lemma, "tag": tag} + for node in self.mecab_tokenizer.parse(text, as_nodes=True): + if node.is_eos(): + break + surface = node.surface + feature = node.feature + tag, _, expr = feature.partition(",") + lemma, _, remainder = expr.partition("/") + if lemma == "*": + lemma = surface + yield {"surface": surface, "lemma": lemma, "tag": tag} class KoreanDefaults(Language.Defaults): diff --git a/spacy/lemmatizer.py b/spacy/lemmatizer.py index f72eae128..8b2375257 100644 --- a/spacy/lemmatizer.py +++ b/spacy/lemmatizer.py @@ -21,7 +21,7 @@ class Lemmatizer(object): def load(cls, *args, **kwargs): raise NotImplementedError(Errors.E172) - def __init__(self, lookups, *args, is_base_form=None, **kwargs): + def __init__(self, lookups, is_base_form=None, *args, **kwargs): """Initialize a Lemmatizer. lookups (Lookups): The lookups object containing the (optional) tables diff --git a/spacy/ml/_legacy_tok2vec.py b/spacy/ml/_legacy_tok2vec.py index b077a46b7..3e41b1c6a 100644 --- a/spacy/ml/_legacy_tok2vec.py +++ b/spacy/ml/_legacy_tok2vec.py @@ -49,6 +49,14 @@ def Tok2Vec(width, embed_size, **kwargs): >> LN(Maxout(width, width * 5, pieces=3)), column=cols.index(ORTH), ) + elif char_embed: + embed = concatenate_lists( + CharacterEmbed(nM=64, nC=8), + FeatureExtracter(cols) >> with_flatten(glove), + ) + reduce_dimensions = LN( + Maxout(width, 64 * 8 + width, pieces=cnn_maxout_pieces) + ) else: embed = uniqued( (glove | norm) >> LN(Maxout(width, width * 2, pieces=3)), @@ -81,7 +89,8 @@ def Tok2Vec(width, embed_size, **kwargs): ) else: tok2vec = FeatureExtracter(cols) >> with_flatten( - embed >> convolution ** conv_depth, pad=conv_depth + embed + >> convolution ** conv_depth, pad=conv_depth ) if bilstm_depth >= 1: diff --git a/spacy/pipeline/pipes.pyx b/spacy/pipeline/pipes.pyx index 8f07bf8f7..b28f34a7a 100644 --- a/spacy/pipeline/pipes.pyx +++ b/spacy/pipeline/pipes.pyx @@ -33,6 +33,7 @@ from .._ml import build_text_classifier, build_simple_cnn_text_classifier from .._ml import build_bow_text_classifier, build_nel_encoder from .._ml import link_vectors_to_models, zero_init, flatten from .._ml import masked_language_model, create_default_optimizer, get_cossim_loss +from .._ml import MultiSoftmax, get_characters_loss from ..errors import Errors, TempErrors, Warnings from .. import util @@ -846,11 +847,15 @@ class MultitaskObjective(Tagger): class ClozeMultitask(Pipe): @classmethod def Model(cls, vocab, tok2vec, **cfg): - output_size = vocab.vectors.data.shape[1] - output_layer = chain( - LayerNorm(Maxout(output_size, tok2vec.nO, pieces=3)), - zero_init(Affine(output_size, output_size, drop_factor=0.0)) - ) + if cfg["objective"] == "characters": + out_sizes = [256] * cfg.get("nr_char", 4) + output_layer = MultiSoftmax(out_sizes) + else: + output_size = vocab.vectors.data.shape[1] + output_layer = chain( + LayerNorm(Maxout(output_size, tok2vec.nO, pieces=3)), + zero_init(Affine(output_size, output_size, drop_factor=0.0)) + ) model = chain(tok2vec, output_layer) model = masked_language_model(vocab, model) model.tok2vec = tok2vec @@ -861,6 +866,8 @@ class ClozeMultitask(Pipe): self.vocab = vocab self.model = model self.cfg = cfg + self.cfg.setdefault("objective", "characters") + self.cfg.setdefault("nr_char", 4) def set_annotations(self, docs, dep_ids, tensors=None): pass @@ -869,7 +876,8 @@ class ClozeMultitask(Pipe): tok2vec=None, sgd=None, **kwargs): link_vectors_to_models(self.vocab) if self.model is True: - self.model = self.Model(self.vocab, tok2vec) + kwargs.update(self.cfg) + self.model = self.Model(self.vocab, tok2vec, **kwargs) X = self.model.ops.allocate((5, self.model.tok2vec.nO)) self.model.output_layer.begin_training(X) if sgd is None: @@ -883,13 +891,16 @@ class ClozeMultitask(Pipe): return tokvecs, vectors def get_loss(self, docs, vectors, prediction): - # The simplest way to implement this would be to vstack the - # token.vector values, but that's a bit inefficient, especially on GPU. - # Instead we fetch the index into the vectors table for each of our tokens, - # and look them up all at once. This prevents data copying. - ids = self.model.ops.flatten([doc.to_array(ID).ravel() for doc in docs]) - target = vectors[ids] - loss, gradient = get_cossim_loss(prediction, target, ignore_zeros=True) + if self.cfg["objective"] == "characters": + loss, gradient = get_characters_loss(self.model.ops, docs, prediction) + else: + # The simplest way to implement this would be to vstack the + # token.vector values, but that's a bit inefficient, especially on GPU. + # Instead we fetch the index into the vectors table for each of our tokens, + # and look them up all at once. This prevents data copying. + ids = self.model.ops.flatten([doc.to_array(ID).ravel() for doc in docs]) + target = vectors[ids] + loss, gradient = get_cossim_loss(prediction, target, ignore_zeros=True) return float(loss), gradient def update(self, docs, golds, drop=0., sgd=None, losses=None): @@ -906,6 +917,20 @@ class ClozeMultitask(Pipe): if losses is not None: losses[self.name] += loss + @staticmethod + def decode_utf8_predictions(char_array): + # The format alternates filling from start and end, and 255 is missing + words = [] + char_array = char_array.reshape((char_array.shape[0], -1, 256)) + nr_char = char_array.shape[1] + char_array = char_array.argmax(axis=-1) + for row in char_array: + starts = [chr(c) for c in row[::2] if c != 255] + ends = [chr(c) for c in row[1::2] if c != 255] + word = "".join(starts + list(reversed(ends))) + words.append(word) + return words + @component("textcat", assigns=["doc.cats"]) class TextCategorizer(Pipe): @@ -1069,6 +1094,7 @@ cdef class DependencyParser(Parser): assigns = ["token.dep", "token.is_sent_start", "doc.sents"] requires = [] TransitionSystem = ArcEager + nr_feature = 8 @property def postprocesses(self): diff --git a/spacy/tests/regression/test_issue2501-3000.py b/spacy/tests/regression/test_issue2501-3000.py index 1f5e44499..622fc3635 100644 --- a/spacy/tests/regression/test_issue2501-3000.py +++ b/spacy/tests/regression/test_issue2501-3000.py @@ -59,7 +59,7 @@ def test_issue2626_2835(en_tokenizer, text): def test_issue2656(en_tokenizer): - """Test that tokenizer correctly splits of punctuation after numbers with + """Test that tokenizer correctly splits off punctuation after numbers with decimal points. """ doc = en_tokenizer("I went for 40.3, and got home by 10.0.") diff --git a/spacy/tests/regression/test_issue3001-3500.py b/spacy/tests/regression/test_issue3001-3500.py index effbebb92..a10225390 100644 --- a/spacy/tests/regression/test_issue3001-3500.py +++ b/spacy/tests/regression/test_issue3001-3500.py @@ -121,6 +121,7 @@ def test_issue3248_1(): assert len(matcher) == 2 +@pytest.mark.skipif(is_python2, reason="Can't pickle instancemethod for is_base_form") def test_issue3248_2(): """Test that the PhraseMatcher can be pickled correctly.""" nlp = English() diff --git a/website/docs/api/cli.md b/website/docs/api/cli.md index fe8877c69..779fa7695 100644 --- a/website/docs/api/cli.md +++ b/website/docs/api/cli.md @@ -473,7 +473,7 @@ $ python -m spacy pretrain [texts_loc] [vectors_model] [output_dir] | `--use-chars`, `-chr` 2.2.2 | flag | Whether to use character-based embedding. | | `--sa-depth`, `-sa` 2.2.2 | option | Depth of self-attention layers. | | `--embed-rows`, `-er` | option | Number of embedding rows. | -| `--loss-func`, `-L` | option | Loss function to use for the objective. Either `"L2"` or `"cosine"`. | +| `--loss-func`, `-L` | option | Loss function to use for the objective. Either `"cosine"`, `"L2"` or `"characters"`. | | `--dropout`, `-d` | option | Dropout rate. | | `--batch-size`, `-bs` | option | Number of words per training batch. | | `--max-length`, `-xw` | option | Maximum words per example. Longer examples are discarded. | diff --git a/website/meta/universe.json b/website/meta/universe.json index 2c74a2964..c5eb96e43 100644 --- a/website/meta/universe.json +++ b/website/meta/universe.json @@ -1,5 +1,58 @@ { "resources": [ + { + "id": "spacy-streamlit", + "title": "spacy-streamlit", + "slogan": "spaCy building blocks for Streamlit apps", + "github": "explosion/spacy-streamlit", + "description": "This package contains utilities for visualizing spaCy models and building interactive spaCy-powered apps with [Streamlit](https://streamlit.io). It includes various building blocks you can use in your own Streamlit app, like visualizers for **syntactic dependencies**, **named entities**, **text classification**, **semantic similarity** via word vectors, token attributes, and more.", + "pip": "spacy-streamlit", + "category": ["visualizers"], + "thumb": "https://i.imgur.com/mhEjluE.jpg", + "image": "https://user-images.githubusercontent.com/13643239/85388081-f2da8700-b545-11ea-9bd4-e303d3c5763c.png", + "code_example": [ + "import spacy_streamlit", + "", + "models = [\"en_core_web_sm\", \"en_core_web_md\"]", + "default_text = \"Sundar Pichai is the CEO of Google.\"", + "spacy_streamlit.visualize(models, default_text))" + ], + "author": "Ines Montani", + "author_links": { + "twitter": "_inesmontani", + "github": "ines", + "website": "https://ines.io" + } + }, + { + "id": "spaczz", + "title": "spaczz", + "slogan": "Fuzzy matching and more for spaCy.", + "description": "Spaczz provides fuzzy matching and multi-token regex matching functionality for spaCy. Spaczz's components have similar APIs to their spaCy counterparts and spaczz pipeline components can integrate into spaCy pipelines where they can be saved/loaded as models.", + "github": "gandersen101/spaczz", + "pip": "spaczz", + "code_example": [ + "import spacy", + "from spaczz.pipeline import SpaczzRuler", + "", + "nlp = spacy.blank('en')", + "ruler = SpaczzRuler(nlp)", + "ruler.add_patterns([{'label': 'PERSON', 'pattern': 'Bill Gates', 'type': 'fuzzy'}])", + "nlp.add_pipe(ruler)", + "", + "doc = nlp('Oops, I spelled Bill Gatez wrong.')", + "print([(ent.text, ent.start, ent.end, ent.label_) for ent in doc.ents])" + ], + "code_language": "python", + "url": "https://spaczz.readthedocs.io/en/latest/", + "author": "Grant Andersen", + "author_links": { + "twitter": "gandersen101", + "github": "gandersen101" + }, + "category": ["pipeline"], + "tags": ["fuzzy-matching", "regex"] + }, { "id": "spacy-universal-sentence-encoder", "title": "SpaCy - Universal Sentence Encoder", @@ -1237,6 +1290,19 @@ "youtube": "K1elwpgDdls", "category": ["videos"] }, + { + "type": "education", + "id": "video-spacy-course-es", + "title": "NLP avanzado con spaCy · Un curso en línea gratis", + "description": "spaCy es un paquete moderno de Python para hacer Procesamiento de Lenguaje Natural de potencia industrial. En este curso en línea, interactivo y gratuito, aprenderás a usar spaCy para construir sistemas avanzados de comprensión de lenguaje natural usando enfoques basados en reglas y en machine learning.", + "url": "https://course.spacy.io/es", + "author": "Camila Gutiérrez", + "author_links": { + "twitter": "Mariacamilagl30" + }, + "youtube": "RNiLVCE5d4k", + "category": ["videos"] + }, { "type": "education", "id": "video-intro-to-nlp-episode-1", @@ -1293,6 +1359,20 @@ "youtube": "IqOJU1-_Fi0", "category": ["videos"] }, + { + "type": "education", + "id": "video-intro-to-nlp-episode-5", + "title": "Intro to NLP with spaCy (5)", + "slogan": "Episode 5: Rules vs. Machine Learning", + "description": "In this new video series, data science instructor Vincent Warmerdam gets started with spaCy, an open-source library for Natural Language Processing in Python. His mission: building a system to automatically detect programming languages in large volumes of text. Follow his process from the first idea to a prototype all the way to data collection and training a statistical named entity recogntion model from scratch.", + "author": "Vincent Warmerdam", + "author_links": { + "twitter": "fishnets88", + "github": "koaning" + }, + "youtube": "f4sqeLRzkPg", + "category": ["videos"] + }, { "type": "education", "id": "video-spacy-irl-entity-linking", @@ -2347,6 +2427,32 @@ }, "category": ["pipeline", "conversational", "research"], "tags": ["spell check", "correction", "preprocessing", "translation", "correction"] + }, + { + "id": "texthero", + "title": "Texthero", + "slogan": "Text preprocessing, representation and visualization from zero to hero.", + "description": "Texthero is a python package to work with text data efficiently. It empowers NLP developers with a tool to quickly understand any text-based dataset and it provides a solid pipeline to clean and represent text data, from zero to hero.", + "github": "jbesomi/texthero", + "pip": "texthero", + "code_example": [ + "import texthero as hero", + "import pandas as pd", + "", + "df = pd.read_csv('https://github.com/jbesomi/texthero/raw/master/dataset/bbcsport.csv')", + "df['named_entities'] = hero.named_entities(df['text'])", + "df.head()" + ], + "code_language": "python", + "url": "https://texthero.org", + "thumb": "https://texthero.org/img/T.png", + "image": "https://texthero.org/docs/assets/texthero.png", + "author": "Jonathan Besomi", + "author_links": { + "github": "jbesomi", + "website": "https://besomi.ai" + }, + "category": ["standalone"] } ],