diff --git a/.github/contributors/emulbreh.md b/.github/contributors/emulbreh.md new file mode 100644 index 000000000..60388d22a --- /dev/null +++ b/.github/contributors/emulbreh.md @@ -0,0 +1,106 @@ +# spaCy contributor agreement + +This spaCy Contributor Agreement (**"SCA"**) is based on the +[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf). +The SCA applies to any contribution that you make to any product or project +managed by us (the **"project"**), and sets out the intellectual property rights +you grant to us in the contributed materials. The term **"us"** shall mean +[ExplosionAI UG (haftungsbeschränkt)](https://explosion.ai/legal). The term +**"you"** shall mean the person or entity identified below. + +If you agree to be bound by these terms, fill in the information requested +below and include the filled-in version with your first pull request, under the +folder [`.github/contributors/`](/.github/contributors/). The name of the file +should be your GitHub username, with the extension `.md`. For example, the user +example_user would create the file `.github/contributors/example_user.md`. + +Read this agreement carefully before signing. These terms and conditions +constitute a binding legal agreement. + +## Contributor Agreement + +1. The term "contribution" or "contributed materials" means any source code, +object code, patch, tool, sample, graphic, specification, manual, +documentation, or any other material posted or submitted by you to the project. + +2. With respect to any worldwide copyrights, or copyright applications and +registrations, in your contribution: + + * you hereby assign to us joint ownership, and to the extent that such + assignment is or becomes invalid, ineffective or unenforceable, you hereby + grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge, + royalty-free, unrestricted license to exercise all rights under those + copyrights. This includes, at our option, the right to sublicense these same + rights to third parties through multiple levels of sublicensees or other + licensing arrangements; + + * you agree that each of us can do all things in relation to your + contribution as if each of us were the sole owners, and if one of us makes + a derivative work of your contribution, the one who makes the derivative + work (or has it made will be the sole owner of that derivative work; + + * you agree that you will not assert any moral rights in your contribution + against us, our licensees or transferees; + + * you agree that we may register a copyright in your contribution and + exercise all ownership rights associated with it; and + + * you agree that neither of us has any duty to consult with, obtain the + consent of, pay or render an accounting to the other for any use or + distribution of your contribution. + +3. With respect to any patents you own, or that you can license without payment +to any third party, you hereby grant to us a perpetual, irrevocable, +non-exclusive, worldwide, no-charge, royalty-free license to: + + * make, have made, use, sell, offer to sell, import, and otherwise transfer + your contribution in whole or in part, alone or in combination with or + included in any product, work or materials arising out of the project to + which your contribution was submitted, and + + * at our option, to sublicense these same rights to third parties through + multiple levels of sublicensees or other licensing arrangements. + +4. Except as set out above, you keep all right, title, and interest in your +contribution. The rights that you grant to us under these terms are effective +on the date you first submitted a contribution to us, even if your submission +took place before the date you sign these terms. + +5. You covenant, represent, warrant and agree that: + + * Each contribution that you submit is and shall be an original work of + authorship and you can legally grant the rights set out in this SCA; + + * to the best of your knowledge, each contribution will not violate any + third party's copyrights, trademarks, patents, or other intellectual + property rights; and + + * each contribution shall be in compliance with U.S. export control laws and + other applicable export and import laws. You agree to notify us if you + become aware of any circumstance which would make any of the foregoing + representations inaccurate in any respect. We may publicly disclose your + participation in the project, including the fact that you have signed the SCA. + +6. This SCA is governed by the laws of the State of California and applicable +U.S. Federal law. Any choice of law rules will not apply. + +7. Please place an “x” on one of the applicable statement below. Please do NOT +mark both statements: + + * [x] I am signing on behalf of myself as an individual and no other person + or entity, including my employer, has or will have rights with respect to my + contributions. + + * [ ] I am signing on behalf of my employer or a legal entity and I have the + actual authority to contractually bind that entity. + +## Contributor Details + +| Field | Entry | +|------------------------------- | -------------------- | +| Name | Johannes Dollinger | +| Company name (if applicable) | | +| Title or role (if applicable) | | +| Date | 2018-02-13 | +| GitHub username | emulbreh | +| Website (optional) | | diff --git a/.github/contributors/enerrio.md b/.github/contributors/enerrio.md new file mode 100644 index 000000000..85ed022ce --- /dev/null +++ b/.github/contributors/enerrio.md @@ -0,0 +1,106 @@ +# spaCy contributor agreement + +This spaCy Contributor Agreement (**"SCA"**) is based on the +[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf). +The SCA applies to any contribution that you make to any product or project +managed by us (the **"project"**), and sets out the intellectual property rights +you grant to us in the contributed materials. The term **"us"** shall mean +[ExplosionAI UG (haftungsbeschränkt)](https://explosion.ai/legal). The term +**"you"** shall mean the person or entity identified below. + +If you agree to be bound by these terms, fill in the information requested +below and include the filled-in version with your first pull request, under the +folder [`.github/contributors/`](/.github/contributors/). The name of the file +should be your GitHub username, with the extension `.md`. For example, the user +example_user would create the file `.github/contributors/example_user.md`. + +Read this agreement carefully before signing. These terms and conditions +constitute a binding legal agreement. + +## Contributor Agreement + +1. The term "contribution" or "contributed materials" means any source code, +object code, patch, tool, sample, graphic, specification, manual, +documentation, or any other material posted or submitted by you to the project. + +2. With respect to any worldwide copyrights, or copyright applications and +registrations, in your contribution: + + * you hereby assign to us joint ownership, and to the extent that such + assignment is or becomes invalid, ineffective or unenforceable, you hereby + grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge, + royalty-free, unrestricted license to exercise all rights under those + copyrights. This includes, at our option, the right to sublicense these same + rights to third parties through multiple levels of sublicensees or other + licensing arrangements; + + * you agree that each of us can do all things in relation to your + contribution as if each of us were the sole owners, and if one of us makes + a derivative work of your contribution, the one who makes the derivative + work (or has it made will be the sole owner of that derivative work; + + * you agree that you will not assert any moral rights in your contribution + against us, our licensees or transferees; + + * you agree that we may register a copyright in your contribution and + exercise all ownership rights associated with it; and + + * you agree that neither of us has any duty to consult with, obtain the + consent of, pay or render an accounting to the other for any use or + distribution of your contribution. + +3. With respect to any patents you own, or that you can license without payment +to any third party, you hereby grant to us a perpetual, irrevocable, +non-exclusive, worldwide, no-charge, royalty-free license to: + + * make, have made, use, sell, offer to sell, import, and otherwise transfer + your contribution in whole or in part, alone or in combination with or + included in any product, work or materials arising out of the project to + which your contribution was submitted, and + + * at our option, to sublicense these same rights to third parties through + multiple levels of sublicensees or other licensing arrangements. + +4. Except as set out above, you keep all right, title, and interest in your +contribution. The rights that you grant to us under these terms are effective +on the date you first submitted a contribution to us, even if your submission +took place before the date you sign these terms. + +5. You covenant, represent, warrant and agree that: + + * Each contribution that you submit is and shall be an original work of + authorship and you can legally grant the rights set out in this SCA; + + * to the best of your knowledge, each contribution will not violate any + third party's copyrights, trademarks, patents, or other intellectual + property rights; and + + * each contribution shall be in compliance with U.S. export control laws and + other applicable export and import laws. You agree to notify us if you + become aware of any circumstance which would make any of the foregoing + representations inaccurate in any respect. We may publicly disclose your + participation in the project, including the fact that you have signed the SCA. + +6. This SCA is governed by the laws of the State of California and applicable +U.S. Federal law. Any choice of law rules will not apply. + +7. Please place an “x” on one of the applicable statement below. Please do NOT +mark both statements: + + * [x] I am signing on behalf of myself as an individual and no other person + or entity, including my employer, has or will have rights with respect to my + contributions. + + * [ ] I am signing on behalf of my employer or a legal entity and I have the + actual authority to contractually bind that entity. + +## Contributor Details + +| Field | Entry | +|------------------------------- | -------------------- | +| Name | Aaron Marquez | +| Company name (if applicable) | | +| Title or role (if applicable) | | +| Date | 2/15/2018 | +| GitHub username | enerrio | +| Website (optional) | | diff --git a/.github/contributors/oxinabox.md b/.github/contributors/oxinabox.md new file mode 100644 index 000000000..8e58c4ea1 --- /dev/null +++ b/.github/contributors/oxinabox.md @@ -0,0 +1,106 @@ +# spaCy contributor agreement + +This spaCy Contributor Agreement (**"SCA"**) is based on the +[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf). +The SCA applies to any contribution that you make to any product or project +managed by us (the **"project"**), and sets out the intellectual property rights +you grant to us in the contributed materials. The term **"us"** shall mean +[ExplosionAI UG (haftungsbeschränkt)](https://explosion.ai/legal). The term +**"you"** shall mean the person or entity identified below. + +If you agree to be bound by these terms, fill in the information requested +below and include the filled-in version with your first pull request, under the +folder [`.github/contributors/`](/.github/contributors/). The name of the file +should be your GitHub username, with the extension `.md`. For example, the user +example_user would create the file `.github/contributors/example_user.md`. + +Read this agreement carefully before signing. These terms and conditions +constitute a binding legal agreement. + +## Contributor Agreement + +1. The term "contribution" or "contributed materials" means any source code, +object code, patch, tool, sample, graphic, specification, manual, +documentation, or any other material posted or submitted by you to the project. + +2. With respect to any worldwide copyrights, or copyright applications and +registrations, in your contribution: + + * you hereby assign to us joint ownership, and to the extent that such + assignment is or becomes invalid, ineffective or unenforceable, you hereby + grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge, + royalty-free, unrestricted license to exercise all rights under those + copyrights. This includes, at our option, the right to sublicense these same + rights to third parties through multiple levels of sublicensees or other + licensing arrangements; + + * you agree that each of us can do all things in relation to your + contribution as if each of us were the sole owners, and if one of us makes + a derivative work of your contribution, the one who makes the derivative + work (or has it made will be the sole owner of that derivative work; + + * you agree that you will not assert any moral rights in your contribution + against us, our licensees or transferees; + + * you agree that we may register a copyright in your contribution and + exercise all ownership rights associated with it; and + + * you agree that neither of us has any duty to consult with, obtain the + consent of, pay or render an accounting to the other for any use or + distribution of your contribution. + +3. With respect to any patents you own, or that you can license without payment +to any third party, you hereby grant to us a perpetual, irrevocable, +non-exclusive, worldwide, no-charge, royalty-free license to: + + * make, have made, use, sell, offer to sell, import, and otherwise transfer + your contribution in whole or in part, alone or in combination with or + included in any product, work or materials arising out of the project to + which your contribution was submitted, and + + * at our option, to sublicense these same rights to third parties through + multiple levels of sublicensees or other licensing arrangements. + +4. Except as set out above, you keep all right, title, and interest in your +contribution. The rights that you grant to us under these terms are effective +on the date you first submitted a contribution to us, even if your submission +took place before the date you sign these terms. + +5. You covenant, represent, warrant and agree that: + + * Each contribution that you submit is and shall be an original work of + authorship and you can legally grant the rights set out in this SCA; + + * to the best of your knowledge, each contribution will not violate any + third party's copyrights, trademarks, patents, or other intellectual + property rights; and + + * each contribution shall be in compliance with U.S. export control laws and + other applicable export and import laws. You agree to notify us if you + become aware of any circumstance which would make any of the foregoing + representations inaccurate in any respect. We may publicly disclose your + participation in the project, including the fact that you have signed the SCA. + +6. This SCA is governed by the laws of the State of California and applicable +U.S. Federal law. Any choice of law rules will not apply. + +7. Please place an “x” on one of the applicable statement below. Please do NOT +mark both statements: + + * [x] I am signing on behalf of myself as an individual and no other person + or entity, including my employer, has or will have rights with respect to my + contributions. + + * [ ] I am signing on behalf of my employer or a legal entity and I have the + actual authority to contractually bind that entity. + +## Contributor Details + +| Field | Entry | +|------------------------------- | -------------------- | +| Name | Lyndon White | +| Company name (if applicable) | | +| Title or role (if applicable) | | +| Date | 9/2/2018 | +| GitHub username | oxinabox | +| Website (optional) | white.ucc.asn.au | diff --git a/.github/contributors/ursachec.md b/.github/contributors/ursachec.md new file mode 100644 index 000000000..45a85f166 --- /dev/null +++ b/.github/contributors/ursachec.md @@ -0,0 +1,106 @@ +# spaCy contributor agreement + +This spaCy Contributor Agreement (**"SCA"**) is based on the +[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf). +The SCA applies to any contribution that you make to any product or project +managed by us (the **"project"**), and sets out the intellectual property rights +you grant to us in the contributed materials. The term **"us"** shall mean +[ExplosionAI UG (haftungsbeschränkt)](https://explosion.ai/legal). The term +**"you"** shall mean the person or entity identified below. + +If you agree to be bound by these terms, fill in the information requested +below and include the filled-in version with your first pull request, under the +folder [`.github/contributors/`](/.github/contributors/). The name of the file +should be your GitHub username, with the extension `.md`. For example, the user +example_user would create the file `.github/contributors/example_user.md`. + +Read this agreement carefully before signing. These terms and conditions +constitute a binding legal agreement. + +## Contributor Agreement + +1. The term "contribution" or "contributed materials" means any source code, +object code, patch, tool, sample, graphic, specification, manual, +documentation, or any other material posted or submitted by you to the project. + +2. With respect to any worldwide copyrights, or copyright applications and +registrations, in your contribution: + + * you hereby assign to us joint ownership, and to the extent that such + assignment is or becomes invalid, ineffective or unenforceable, you hereby + grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge, + royalty-free, unrestricted license to exercise all rights under those + copyrights. This includes, at our option, the right to sublicense these same + rights to third parties through multiple levels of sublicensees or other + licensing arrangements; + + * you agree that each of us can do all things in relation to your + contribution as if each of us were the sole owners, and if one of us makes + a derivative work of your contribution, the one who makes the derivative + work (or has it made will be the sole owner of that derivative work; + + * you agree that you will not assert any moral rights in your contribution + against us, our licensees or transferees; + + * you agree that we may register a copyright in your contribution and + exercise all ownership rights associated with it; and + + * you agree that neither of us has any duty to consult with, obtain the + consent of, pay or render an accounting to the other for any use or + distribution of your contribution. + +3. With respect to any patents you own, or that you can license without payment +to any third party, you hereby grant to us a perpetual, irrevocable, +non-exclusive, worldwide, no-charge, royalty-free license to: + + * make, have made, use, sell, offer to sell, import, and otherwise transfer + your contribution in whole or in part, alone or in combination with or + included in any product, work or materials arising out of the project to + which your contribution was submitted, and + + * at our option, to sublicense these same rights to third parties through + multiple levels of sublicensees or other licensing arrangements. + +4. Except as set out above, you keep all right, title, and interest in your +contribution. The rights that you grant to us under these terms are effective +on the date you first submitted a contribution to us, even if your submission +took place before the date you sign these terms. + +5. You covenant, represent, warrant and agree that: + + * Each contribution that you submit is and shall be an original work of + authorship and you can legally grant the rights set out in this SCA; + + * to the best of your knowledge, each contribution will not violate any + third party's copyrights, trademarks, patents, or other intellectual + property rights; and + + * each contribution shall be in compliance with U.S. export control laws and + other applicable export and import laws. You agree to notify us if you + become aware of any circumstance which would make any of the foregoing + representations inaccurate in any respect. We may publicly disclose your + participation in the project, including the fact that you have signed the SCA. + +6. This SCA is governed by the laws of the State of California and applicable +U.S. Federal law. Any choice of law rules will not apply. + +7. Please place an “x” on one of the applicable statement below. Please do NOT +mark both statements: + + * [x] I am signing on behalf of myself as an individual and no other person + or entity, including my employer, has or will have rights with respect to my + contributions. + + * [ ] I am signing on behalf of my employer or a legal entity and I have the + actual authority to contractually bind that entity. + +## Contributor Details + +| Field | Entry | +|------------------------------- | ------------------------- | +| Name | Claudiu-Vlad Ursache | +| Company name (if applicable) | | +| Title or role (if applicable) | | +| Date | 2018-02-04 | +| GitHub username | ursachec | +| Website (optional) | https://www.cvursache.com | diff --git a/spacy/attrs.pxd b/spacy/attrs.pxd index 74397fa64..79a177ba9 100644 --- a/spacy/attrs.pxd +++ b/spacy/attrs.pxd @@ -18,9 +18,9 @@ cdef enum attr_id_t: IS_QUOTE IS_LEFT_PUNCT IS_RIGHT_PUNCT + IS_CURRENCY - FLAG18 = 18 - FLAG19 + FLAG19 = 19 FLAG20 FLAG21 FLAG22 diff --git a/spacy/attrs.pyx b/spacy/attrs.pyx index 893ec0845..d4e8a38c5 100644 --- a/spacy/attrs.pyx +++ b/spacy/attrs.pyx @@ -21,7 +21,7 @@ IDS = { "IS_QUOTE": IS_QUOTE, "IS_LEFT_PUNCT": IS_LEFT_PUNCT, "IS_RIGHT_PUNCT": IS_RIGHT_PUNCT, - "FLAG18": FLAG18, + "IS_CURRENCY": IS_CURRENCY, "FLAG19": FLAG19, "FLAG20": FLAG20, "FLAG21": FLAG21, diff --git a/spacy/cli/evaluate.py b/spacy/cli/evaluate.py index 551689413..43edd858d 100644 --- a/spacy/cli/evaluate.py +++ b/spacy/cli/evaluate.py @@ -3,8 +3,6 @@ from __future__ import unicode_literals, division, print_function import plac from timeit import default_timer as timer -import random -import numpy.random from ..gold import GoldCorpus from ..util import prints @@ -12,10 +10,6 @@ from .. import util from .. import displacy -random.seed(0) -numpy.random.seed(0) - - @plac.annotations( model=("model name or path", "positional", None, str), data_path=("location of JSON-formatted evaluation data", "positional", @@ -31,6 +25,8 @@ def evaluate(model, data_path, gpu_id=-1, gold_preproc=False, displacy_path=None Evaluate a model. To render a sample of parses in a HTML file, set an output directory as the displacy_path argument. """ + + util.fix_random_seed() if gpu_id >= 0: util.use_gpu(gpu_id) util.set_env_log(False) diff --git a/spacy/cli/train.py b/spacy/cli/train.py index f8363bde1..6c7b95682 100644 --- a/spacy/cli/train.py +++ b/spacy/cli/train.py @@ -6,8 +6,6 @@ from pathlib import Path import tqdm from thinc.neural._classes.model import Model from timeit import default_timer as timer -import random -import numpy.random from ..gold import GoldCorpus, minibatch from ..util import prints @@ -16,9 +14,6 @@ from .. import about from .. import displacy from ..compat import json_dumps -random.seed(0) -numpy.random.seed(0) - @plac.annotations( lang=("model language", "positional", None, str), @@ -45,6 +40,7 @@ def train(lang, output_dir, train_data, dev_data, n_iter=30, n_sents=0, """ Train a model. Expects data in spaCy's JSON format. """ + util.fix_random_seed() util.set_env_log(True) n_sents = n_sents or None output_path = util.ensure_path(output_dir) diff --git a/spacy/compat.py b/spacy/compat.py index e50036013..3cc214b28 100644 --- a/spacy/compat.py +++ b/spacy/compat.py @@ -43,15 +43,15 @@ fix_text = ftfy.fix_text copy_array = copy_array izip = getattr(itertools, 'izip', zip) -is_python2 = six.PY2 -is_python3 = six.PY3 is_windows = sys.platform.startswith('win') is_linux = sys.platform.startswith('linux') is_osx = sys.platform == 'darwin' +is_python2 = six.PY2 +is_python3 = six.PY3 +is_python_pre_3_5 = is_python2 or (is_python3 and sys.version_info[1]<5) if is_python2: - import imp bytes_ = str unicode_ = unicode # noqa: F821 basestring_ = basestring # noqa: F821 @@ -60,7 +60,6 @@ if is_python2: path2str = lambda path: str(path).decode('utf8') elif is_python3: - import importlib.util bytes_ = bytes unicode_ = str basestring_ = str @@ -111,9 +110,11 @@ def normalize_string_keys(old): def import_file(name, loc): loc = str(loc) - if is_python2: + if is_python_pre_3_5: + import imp return imp.load_source(name, loc) else: + import importlib.util spec = importlib.util.spec_from_file_location(name, str(loc)) module = importlib.util.module_from_spec(spec) spec.loader.exec_module(module) diff --git a/spacy/glossary.py b/spacy/glossary.py index c17cb7467..02d8815e0 100644 --- a/spacy/glossary.py +++ b/spacy/glossary.py @@ -115,7 +115,7 @@ GLOSSARY = { 'ADJA': 'adjective, attributive', 'ADJD': 'adjective, adverbial or predicative', 'APPO': 'postposition', - 'APRP': 'preposition; circumposition left', + 'APPR': 'preposition; circumposition left', 'APPRART': 'preposition with article', 'APZR': 'circumposition right', 'ART': 'definite or indefinite article', diff --git a/spacy/lang/lex_attrs.py b/spacy/lang/lex_attrs.py index c3bb4a8ff..f1279f035 100644 --- a/spacy/lang/lex_attrs.py +++ b/spacy/lang/lex_attrs.py @@ -69,6 +69,14 @@ def is_right_punct(text): return text in right_punct +def is_currency(text): + # can be overwritten by lang with list of currency words, e.g. dollar, euro + for char in text: + if unicodedata.category(char) != 'Sc': + return False + return True + + def like_email(text): return bool(_like_email(text)) @@ -164,5 +172,6 @@ LEX_ATTRS = { attrs.IS_QUOTE: is_quote, attrs.IS_LEFT_PUNCT: is_left_punct, attrs.IS_RIGHT_PUNCT: is_right_punct, + attrs.IS_CURRENCY: is_currency, attrs.LIKE_URL: like_url } diff --git a/spacy/language.py b/spacy/language.py index a2b945c49..bd1e8d012 100644 --- a/spacy/language.py +++ b/spacy/language.py @@ -624,7 +624,7 @@ class Language(object): deserializers = OrderedDict(( ('vocab', lambda p: self.vocab.from_disk(p)), ('tokenizer', lambda p: self.tokenizer.from_disk(p, vocab=False)), - ('meta.json', lambda p: self.meta.update(ujson.load(p.open('r')))) + ('meta.json', lambda p: self.meta.update(util.read_json(p))) )) for name, proc in self.pipeline: if name in disable: @@ -720,5 +720,5 @@ class DisabledPipes(list): def _pipe(func, docs): for doc in docs: - func(doc) + doc = func(doc) yield doc diff --git a/spacy/lexeme.pyx b/spacy/lexeme.pyx index d136540f9..78d3bed6c 100644 --- a/spacy/lexeme.pyx +++ b/spacy/lexeme.pyx @@ -12,7 +12,7 @@ import numpy from .typedefs cimport attr_t, flags_t from .attrs cimport IS_ALPHA, IS_ASCII, IS_DIGIT, IS_LOWER, IS_PUNCT, IS_SPACE from .attrs cimport IS_TITLE, IS_UPPER, LIKE_URL, LIKE_NUM, LIKE_EMAIL, IS_STOP -from .attrs cimport IS_BRACKET, IS_QUOTE, IS_LEFT_PUNCT, IS_RIGHT_PUNCT, IS_OOV +from .attrs cimport IS_BRACKET, IS_QUOTE, IS_LEFT_PUNCT, IS_RIGHT_PUNCT, IS_CURRENCY, IS_OOV from .attrs cimport PROB from .attrs import intify_attrs from . import about @@ -474,6 +474,14 @@ cdef class Lexeme: def __set__(self, bint x): Lexeme.c_set_flag(self.c, IS_RIGHT_PUNCT, x) + property is_currency: + """RETURNS (bool): Whether the lexeme is a currency symbol, e.g. $, €.""" + def __get__(self): + return Lexeme.c_check_flag(self.c, IS_CURRENCY) + + def __set__(self, bint x): + Lexeme.c_set_flag(self.c, IS_CURRENCY, x) + property like_url: """RETURNS (bool): Whether the lexeme resembles a URL.""" def __get__(self): diff --git a/spacy/pipeline.pyx b/spacy/pipeline.pyx index c5f8065de..e826ee0d6 100644 --- a/spacy/pipeline.pyx +++ b/spacy/pipeline.pyx @@ -144,7 +144,8 @@ class Pipe(object): return create_default_optimizer(self.model.ops, **self.cfg.get('optimizer', {})) - def begin_training(self, gold_tuples=tuple(), pipeline=None, sgd=None): + def begin_training(self, gold_tuples=tuple(), pipeline=None, sgd=None, + **kwargs): """Initialize the pipe for training, using data exampes if available. If no model has been initialized yet, the model is added.""" if self.model is True: @@ -214,7 +215,8 @@ class Pipe(object): def _load_cfg(path): if path.exists(): - return ujson.load(path.open()) + with path.open() as file_: + return ujson.load(file_) else: return {} @@ -344,7 +346,8 @@ class Tensorizer(Pipe): loss = (d_scores**2).sum() return loss, d_scores - def begin_training(self, gold_tuples=tuple(), pipeline=None, sgd=None): + def begin_training(self, gold_tuples=tuple(), pipeline=None, sgd=None, + **kwargs): """Allocate models, pre-process training data and acquire an optimizer. @@ -467,7 +470,8 @@ class Tagger(Pipe): d_scores = self.model.ops.unflatten(d_scores, [len(d) for d in docs]) return float(loss), d_scores - def begin_training(self, gold_tuples=tuple(), pipeline=None, sgd=None): + def begin_training(self, gold_tuples=tuple(), pipeline=None, sgd=None, + **kwargs): orig_tag_map = dict(self.vocab.morphology.tag_map) new_tag_map = OrderedDict() for raw_text, annots_brackets in gold_tuples: @@ -580,7 +584,8 @@ class Tagger(Pipe): def load_model(p): if self.model is True: self.model = self.Model(self.vocab.morphology.n_tags, **self.cfg) - self.model.from_bytes(p.open('rb').read()) + with p.open('rb') as file_: + self.model.from_bytes(file_.read()) def load_tag_map(p): with p.open('rb') as file_: @@ -641,7 +646,7 @@ class MultitaskObjective(Tagger): pass def begin_training(self, gold_tuples=tuple(), pipeline=None, tok2vec=None, - sgd=None): + sgd=None, **kwargs): gold_tuples = nonproj.preprocess_training_data(gold_tuples) for raw_text, annots_brackets in gold_tuples: for annots, brackets in annots_brackets: @@ -766,7 +771,7 @@ class SimilarityHook(Pipe): def update(self, doc1_doc2, golds, sgd=None, drop=0.): sims, bp_sims = self.model.begin_update(doc1_doc2, drop=drop) - def begin_training(self, _=tuple(), pipeline=None, sgd=None): + def begin_training(self, _=tuple(), pipeline=None, sgd=None, **kwargs): """Allocate model, using width from tensorizer in pipeline. gold_tuples (iterable): Gold-standard training data. @@ -887,6 +892,7 @@ cdef class DependencyParser(Parser): self._multitasks.append(labeller) def init_multitask_objectives(self, gold_tuples, pipeline, sgd=None, **cfg): + self.add_multitask_objective('tag') for labeller in self._multitasks: tok2vec = self.model[0] labeller.begin_training(gold_tuples, pipeline=pipeline, diff --git a/spacy/symbols.pxd b/spacy/symbols.pxd index 6960681a3..cc1734e6d 100644 --- a/spacy/symbols.pxd +++ b/spacy/symbols.pxd @@ -17,9 +17,9 @@ cdef enum symbol_t: IS_QUOTE IS_LEFT_PUNCT IS_RIGHT_PUNCT + IS_CURRENCY - FLAG18 = 18 - FLAG19 + FLAG19 = 19 FLAG20 FLAG21 FLAG22 diff --git a/spacy/symbols.pyx b/spacy/symbols.pyx index 98e4c440d..4bc1d4228 100644 --- a/spacy/symbols.pyx +++ b/spacy/symbols.pyx @@ -22,8 +22,8 @@ IDS = { "IS_QUOTE": IS_QUOTE, "IS_LEFT_PUNCT": IS_LEFT_PUNCT, "IS_RIGHT_PUNCT": IS_RIGHT_PUNCT, + "IS_CURRENCY": IS_CURRENCY, - "FLAG18": FLAG18, "FLAG19": FLAG19, "FLAG20": FLAG20, "FLAG21": FLAG21, diff --git a/spacy/syntax/arc_eager.pyx b/spacy/syntax/arc_eager.pyx index 16d55db24..190155269 100644 --- a/spacy/syntax/arc_eager.pyx +++ b/spacy/syntax/arc_eager.pyx @@ -390,6 +390,22 @@ cdef class ArcEager(TransitionSystem): gold.c.labels[i] = self.strings.add(label) return gold + def get_beam_parses(self, Beam beam): + parses = [] + probs = beam.probs + for i in range(beam.size): + state = beam.at(i) + if state.is_final(): + self.finalize_state(state) + prob = probs[i] + parse = [] + for j in range(state.length): + head = state.H(j) + label = self.strings[state._sent[j].dep] + parse.append((head, j, label)) + parses.append((prob, parse)) + return parses + cdef Transition lookup_transition(self, object name) except *: if '-' in name: move_str, label_str = name.split('-', 1) diff --git a/spacy/syntax/nn_parser.pyx b/spacy/syntax/nn_parser.pyx index fa91c697e..a4647c159 100644 --- a/spacy/syntax/nn_parser.pyx +++ b/spacy/syntax/nn_parser.pyx @@ -835,7 +835,8 @@ cdef class Parser: sgd = self.create_optimizer() self.model[1].begin_training( self.model[1].ops.allocate((5, cfg['token_vector_width']))) - self.init_multitask_objectives(gold_tuples, pipeline, sgd=sgd, **cfg) + if pipeline is not None: + self.init_multitask_objectives(gold_tuples, pipeline, sgd=sgd, **cfg) link_vectors_to_models(self.vocab) else: if sgd is None: @@ -887,7 +888,7 @@ cdef class Parser: deserializers = { 'vocab': lambda p: self.vocab.from_disk(p), 'moves': lambda p: self.moves.from_disk(p, strings=False), - 'cfg': lambda p: self.cfg.update(ujson.load(p.open())), + 'cfg': lambda p: self.cfg.update(util.read_json(p)), 'model': lambda p: None } util.from_disk(path, deserializers, exclude) diff --git a/spacy/tests/lang/test_attrs.py b/spacy/tests/lang/test_attrs.py index 92ee04737..67485ee60 100644 --- a/spacy/tests/lang/test_attrs.py +++ b/spacy/tests/lang/test_attrs.py @@ -2,7 +2,7 @@ from __future__ import unicode_literals from ...attrs import intify_attrs, ORTH, NORM, LEMMA, IS_ALPHA -from ...lang.lex_attrs import is_punct, is_ascii, like_url, word_shape +from ...lang.lex_attrs import is_punct, is_ascii, is_currency, like_url, word_shape import pytest @@ -37,6 +37,13 @@ def test_lex_attrs_is_ascii(text, match): assert is_ascii(text) == match +@pytest.mark.parametrize('text,match', [('$', True), ('£', True), ('♥', False), + ('€', True), ('¥', True), ('¢', True), + ('a', False), ('www.google.com', False), ('dog', False)]) +def test_lex_attrs_is_currency(text, match): + assert is_currency(text) == match + + @pytest.mark.parametrize('text,match', [ ('www.google.com', True), ('google.com', True), ('sydney.com', True), ('2girls1cup.org', True), ('http://stupid', True), ('www.hi', True), diff --git a/spacy/tests/regression/test_issue1959.py b/spacy/tests/regression/test_issue1959.py new file mode 100644 index 000000000..0787af3b7 --- /dev/null +++ b/spacy/tests/regression/test_issue1959.py @@ -0,0 +1,23 @@ +# coding: utf8 +from __future__ import unicode_literals +import pytest + + +@pytest.mark.models('en') +def test_issue1959(EN): + texts = ['Apple is looking at buying U.K. startup for $1 billion.'] + # nlp = load_test_model('en_core_web_sm') + EN.add_pipe(clean_component, name='cleaner', after='ner') + doc = EN(texts[0]) + doc_pipe = [doc_pipe for doc_pipe in EN.pipe(texts)] + assert doc == doc_pipe[0] + + +def clean_component(doc): + """ Clean up text. Make lowercase and remove punctuation and stopwords """ + # Remove punctuation, symbols (#) and stopwords + doc = [tok.text.lower() for tok in doc if (not tok.is_stop + and tok.pos_ != 'PUNCT' and + tok.pos_ != 'SYM')] + doc = ' '.join(doc) + return doc diff --git a/spacy/tests/serialize/test_serialize_language.py b/spacy/tests/serialize/test_serialize_language.py new file mode 100644 index 000000000..1fcf8ef18 --- /dev/null +++ b/spacy/tests/serialize/test_serialize_language.py @@ -0,0 +1,28 @@ +# coding: utf-8 +from __future__ import unicode_literals + +from ..util import make_tempdir +from ...language import Language + +import pytest + + +@pytest.fixture +def meta_data(): + return { + 'name': 'name-in-fixture', + 'version': 'version-in-fixture', + 'description': 'description-in-fixture', + 'author': 'author-in-fixture', + 'email': 'email-in-fixture', + 'url': 'url-in-fixture', + 'license': 'license-in-fixture', + } + + +def test_serialize_language_meta_disk(meta_data): + language = Language(meta=meta_data) + with make_tempdir() as d: + language.to_disk(d) + new_language = Language().from_disk(d) + assert new_language.meta == language.meta diff --git a/spacy/tokens/token.pyx b/spacy/tokens/token.pyx index 74487b515..9e4b878cf 100644 --- a/spacy/tokens/token.pyx +++ b/spacy/tokens/token.pyx @@ -15,7 +15,7 @@ from ..lexeme cimport Lexeme from .. import parts_of_speech from ..attrs cimport IS_ALPHA, IS_ASCII, IS_DIGIT, IS_LOWER, IS_PUNCT, IS_SPACE from ..attrs cimport IS_BRACKET, IS_QUOTE, IS_LEFT_PUNCT, IS_RIGHT_PUNCT -from ..attrs cimport IS_OOV, IS_TITLE, IS_UPPER, LIKE_URL, LIKE_NUM, LIKE_EMAIL +from ..attrs cimport IS_OOV, IS_TITLE, IS_UPPER, IS_CURRENCY, LIKE_URL, LIKE_NUM, LIKE_EMAIL from ..attrs cimport IS_STOP, ID, ORTH, NORM, LOWER, SHAPE, PREFIX, SUFFIX from ..attrs cimport LENGTH, CLUSTER, LEMMA, POS, TAG, DEP from ..compat import is_config @@ -855,6 +855,11 @@ cdef class Token: def __get__(self): return Lexeme.c_check_flag(self.c.lex, IS_RIGHT_PUNCT) + property is_currency: + """RETURNS (bool): Whether the token is a currency symbol.""" + def __get__(self): + return Lexeme.c_check_flag(self.c.lex, IS_CURRENCY) + property like_url: """RETURNS (bool): Whether the token resembles a URL.""" def __get__(self): diff --git a/spacy/util.py b/spacy/util.py index 7676b33b2..dc51e467d 100644 --- a/spacy/util.py +++ b/spacy/util.py @@ -17,6 +17,7 @@ from thinc.neural._classes.model import Model import functools import cytoolz import itertools +import numpy.random from .symbols import ORTH from .compat import cupy, CudaStream, path2str, basestring_, input_, unicode_ @@ -623,3 +624,8 @@ def use_gpu(gpu_id): Model.ops = CupyOps() Model.Ops = CupyOps return device + + +def fix_random_seed(seed=0): + random.seed(seed) + numpy.random.seed(seed) diff --git a/spacy/vectors.pyx b/spacy/vectors.pyx index 079f6fc84..7daebabe6 100644 --- a/spacy/vectors.pyx +++ b/spacy/vectors.pyx @@ -347,7 +347,8 @@ cdef class Vectors: """ def load_key2row(path): if path.exists(): - self.key2row = msgpack.load(path.open('rb')) + with path.open('rb') as file_: + self.key2row = msgpack.load(file_) for key, row in self.key2row.items(): if row in self._unset: self._unset.remove(row) diff --git a/website/_includes/_navigation.jade b/website/_includes/_navigation.jade index e5837747f..8ce5e394b 100644 --- a/website/_includes/_navigation.jade +++ b/website/_includes/_navigation.jade @@ -10,6 +10,9 @@ nav.c-nav.u-text.js-nav(class=landing ? "c-nav--theme" : null) li.c-nav__menu__item(class=is_active ? "is-active" : null) +a(url)(tabindex=is_active ? "-1" : null)=item + li.c-nav__menu__item.u-hidden-xs + +a("https://survey.spacy.io", true) User Survey 2018 + li.c-nav__menu__item.u-hidden-xs +a(gh("spaCy"))(aria-label="GitHub") #[+icon("github", 20)] diff --git a/website/usage/_facts-figures/_benchmarks.jade b/website/usage/_facts-figures/_benchmarks.jade index b530b84de..dabf58795 100644 --- a/website/usage/_facts-figures/_benchmarks.jade +++ b/website/usage/_facts-figures/_benchmarks.jade @@ -13,7 +13,7 @@ p | Their results and subsequent discussions helped us develop a novel | psychologically-motivated technique to improve spaCy's accuracy, which | we published in joint work with Macquarie University - | #[+a("https://aclweb.org/anthology/D/D15/D15-1162.pdf") (Honnibal and Johnson, 2015)]. + | #[+a("https://www.aclweb.org/anthology/D/D15/D15-1162.pdf") (Honnibal and Johnson, 2015)]. include _benchmarks-choi-2015 diff --git a/website/usage/_install/_troubleshooting.jade b/website/usage/_install/_troubleshooting.jade index c846ff957..2135f323a 100644 --- a/website/usage/_install/_troubleshooting.jade +++ b/website/usage/_install/_troubleshooting.jade @@ -38,9 +38,10 @@ p | #[code spacy/data] directory. This means your user needs permission to do | this. The above error mostly occurs when doing a system-wide installation, | which will create the symlinks in a system directory. Run the - | #[code download] or #[code link] command as administrator, or use a - | #[code virtualenv] to install spaCy in a user directory, instead - | of doing a system-wide installation. + | #[code download] or #[code link] command as administrator (on Windows, + | simply right-click on your terminal or shell ans select "Run as + | Administrator"), or use a #[code virtualenv] to install spaCy in a user + | directory, instead of doing a system-wide installation. +h(3, "no-cache-dir") No such option: --no-cache-dir diff --git a/website/usage/_linguistic-features/_dependency-parse.jade b/website/usage/_linguistic-features/_dependency-parse.jade index 188b7b8f3..d8d7cbce1 100644 --- a/website/usage/_linguistic-features/_dependency-parse.jade +++ b/website/usage/_linguistic-features/_dependency-parse.jade @@ -65,9 +65,9 @@ p - var style = [0, 1, 0, 1, 0] +annotation-row(["Autonomous", "amod", "cars", "NOUN", ""], style) +annotation-row(["cars", "nsubj", "shift", "VERB", "Autonomous"], style) - +annotation-row(["shift", "ROOT", "shift", "VERB", "cars, liability"], style) + +annotation-row(["shift", "ROOT", "shift", "VERB", "cars, liability, toward"], style) +annotation-row(["insurance", "compound", "liability", "NOUN", ""], style) - +annotation-row(["liability", "dobj", "shift", "VERB", "insurance, toward"], style) + +annotation-row(["liability", "dobj", "shift", "VERB", "insurance"], style) +annotation-row(["toward", "prep", "liability", "NOUN", "manufacturers"], style) +annotation-row(["manufacturers", "pobj", "toward", "ADP", ""], style) diff --git a/website/usage/_linguistic-features/_named-entities.jade b/website/usage/_linguistic-features/_named-entities.jade index 9e55ba84e..0f32d1da3 100644 --- a/website/usage/_linguistic-features/_named-entities.jade +++ b/website/usage/_linguistic-features/_named-entities.jade @@ -80,7 +80,7 @@ p doc.ents = [netflix_ent] ents = [(e.text, e.start_char, e.end_char, e.label_) for e in doc.ents] - assert ents = [(u'Netflix', 0, 7, u'ORG')] + assert ents == [(u'Netflix', 0, 7, u'ORG')] p | Keep in mind that you need to create a #[code Span] with the start and diff --git a/website/usage/_linguistic-features/_rule-based-matching.jade b/website/usage/_linguistic-features/_rule-based-matching.jade index 7872b668f..e1a7c8a81 100644 --- a/website/usage/_linguistic-features/_rule-based-matching.jade +++ b/website/usage/_linguistic-features/_rule-based-matching.jade @@ -54,10 +54,21 @@ p p | The matcher returns a list of #[code (match_id, start, end)] tuples – in - | this case, #[code [('HelloWorld', 0, 2)]], which maps to the span - | #[code doc[0:2]] of our original document. Optionally, we could also - | choose to add more than one pattern, for example to also match sequences - | without punctuation between "hello" and "world": + | this case, #[code [('15578876784678163569', 0, 2)]], which maps to the + | span #[code doc[0:2]] of our original document. The #[code match_id] + | is the #[+a("/usage/spacy-101#vocab") hash value] of the string ID + | "HelloWorld". To get the string value, you can look up the ID + | in the #[+api("stringstore") #[code StringStore]]. + ++code. + for match_id, start, end in matches: + string_id = nlp.vocab.strings[match_id] # 'HelloWorld' + span = doc[start:end] # the matched span + +p + | Optionally, we could also choose to add more than one pattern, for + | example to also match sequences without punctuation between "hello" and + | "world": +code. matcher.add('HelloWorld', None, @@ -91,6 +102,10 @@ p +cell.u-nowrap #[code LOWER] +cell The lowercase form of the token text. + +row + +cell #[code LENGTH] + +cell The length of the token text. + +row +cell.u-nowrap #[code IS_ALPHA], #[code IS_ASCII], #[code IS_DIGIT] +cell @@ -117,6 +132,10 @@ p | The token's simple and extended part-of-speech tag, dependency | label, lemma, shape. + +row + +cell.u-nowrap #[code ENT_TYPE] + +cell The token's entity label. + +h(4, "adding-patterns-wildcard") Using wildcard token patterns +tag-new(2) @@ -335,7 +354,8 @@ p | flag. +code. - IS_DEFINITELY = nlp.vocab.add_flag(re.compile(r'deff?in[ia]tely').match) + definitely_flag = lambda text: bool(re.compile(r'deff?in[ia]tely').match(text)) + IS_DEFINITELY = nlp.vocab.add_flag(definitely_flag) matcher = Matcher(nlp.vocab) matcher.add('DEFINITELY', None, [{IS_DEFINITELY: True}]) diff --git a/website/usage/_linguistic-features/_tokenization.jade b/website/usage/_linguistic-features/_tokenization.jade index f149556ce..2cd3a13de 100644 --- a/website/usage/_linguistic-features/_tokenization.jade +++ b/website/usage/_linguistic-features/_tokenization.jade @@ -54,7 +54,7 @@ p +code. import spacy - from spacy.symbols import ORTH, LEMMA, POS + from spacy.symbols import ORTH, LEMMA, POS, TAG nlp = spacy.load('en') doc = nlp(u'gimme that') # phrase to tokenize diff --git a/website/usage/_models/_install-basics.jade b/website/usage/_models/_install-basics.jade index 7b32e3333..3fb8fa00c 100644 --- a/website/usage/_models/_install-basics.jade +++ b/website/usage/_models/_install-basics.jade @@ -31,3 +31,13 @@ p import spacy nlp = spacy.load('en') doc = nlp(u'This is a sentence.') + ++infobox("Important note", "⚠️") + | To allow loading models via convenient shortcuts like #[code 'en'], spaCy + | will create a symlink within the #[code spacy/data] directory. This means + | that your user needs the #[strong required permissions]. + | If you've installed spaCy to a system directory and don't have admin + | privileges, the model linking may fail. The easiest solution + | is to re-run the command as admin, or use a #[code virtualenv]. For more + | info on this, see the + | #[+a("/usage/#symlink-privilege") troubleshooting guide]. diff --git a/website/usage/_models/_install.jade b/website/usage/_models/_install.jade index 769d3f2d6..7473e41a6 100644 --- a/website/usage/_models/_install.jade +++ b/website/usage/_models/_install.jade @@ -132,7 +132,7 @@ p # set up shortcut link to load local model as "my_amazing_model" python -m spacy link /Users/you/model my_amazing_model -+infobox("Important note") ++infobox("Important note", "⚠️") | In order to create a symlink, your user needs the #[strong required permissions]. | If you've installed spaCy to a system directory and don't have admin | privileges, the #[code spacy link] command may fail. The easiest solution