diff --git a/contributors/cla.md b/contributors/cla.md index 007739a1a..27b522dc8 100644 --- a/contributors/cla.md +++ b/contributors/cla.md @@ -1,7 +1,7 @@ Signing the Contributors License Agreement ========================================== -SpaCy is a commercial open-source project, owned by Syllogism Co. We require that contributors to SpaCy sign our Contributors License Agreement, which is based on the Oracle Contributor Agreement. +SpaCy is a commercial open-source project, owned by Syllogism Co. We require that contributors to SpaCy sign our Contributors License Agreement, which is based on the Oracle Contributor Agreement. The CLA must be signed on your first pull request. To do this, simply fill in the file cla_template.md, and include the filed in form in your first pull request. diff --git a/contributors/cla_template.md b/contributors/cla_template.md index fb54da72d..fca6771de 100644 --- a/contributors/cla_template.md +++ b/contributors/cla_template.md @@ -2,7 +2,7 @@ Syllogism Contributor Agreement =============================== This Syllogism Contributor Agreement (“SCA”) is based on the Oracle Contributor -Agreement. The SCA applies to any contribution that you make to any product or +Agreement. The SCA applies to any contribution that you make to any product or project managed by us (the “project”), and sets out the intellectual property rights you grant to us in the contributed materials. The term “us” shall mean Syllogism Co. The term "you" shall mean the person or entity identified below. diff --git a/docs/source/api.rst b/docs/source/api.rst index 808204e65..bb85b45ae 100644 --- a/docs/source/api.rst +++ b/docs/source/api.rst @@ -107,7 +107,7 @@ API *derivational* suffixes are not stripped, e.g. the lemma of "instutitions" is "institution", not "institute". Lemmatization is performed using the WordNet data, but extended to also cover closed-class words such as - pronouns. By default, the WN lemmatizer returns "hi" as the lemma of "his". + pronouns. By default, the WN lemmatizer returns "hi" as the lemma of "his". We assign pronouns the lemma -PRON-. lower @@ -121,7 +121,7 @@ API A transform of the word's string, to show orthographic features. The characters a-z are mapped to x, A-Z is mapped to X, 0-9 is mapped to d. After these mappings, sequences of 4 or more of the same character are - truncated to length 4. Examples: C3Po --> XdXx, favorite --> xxxx, + truncated to length 4. Examples: C3Po --> XdXx, favorite --> xxxx, :) --> :) prefix diff --git a/docs/source/features.rst b/docs/source/features.rst index 1643ad2bb..ecd465182 100644 --- a/docs/source/features.rst +++ b/docs/source/features.rst @@ -66,7 +66,7 @@ Boolean features +-------------+--------------------------------------------------------------+ | IS_UPPER | The result of sic.isupper() | +-------------+--------------------------------------------------------------+ -| LIKE_URL | Check whether the string looks like it could be a URL. Aims | +| LIKE_URL | Check whether the string looks like it could be a URL. Aims | | | for low false negative rate. | +-------------+--------------------------------------------------------------+ | LIKE_NUMBER | Check whether the string looks like it could be a numeric | diff --git a/docs/source/guide/overview.rst b/docs/source/guide/overview.rst index 7e1b34558..dbcfebfd7 100644 --- a/docs/source/guide/overview.rst +++ b/docs/source/guide/overview.rst @@ -6,7 +6,7 @@ What and Why spaCy is a lightning-fast, full-cream NLP tokenizer and lexicon. -Most tokenizers give you a sequence of strings. That's barbaric. +Most tokenizers give you a sequence of strings. That's barbaric. Giving you strings invites you to compute on every *token*, when what you should be doing is computing on every *type*. Remember `Zipf's law `_: you'll diff --git a/docs/source/howworks.rst b/docs/source/howworks.rst index 3abc2ef05..6f88db744 100644 --- a/docs/source/howworks.rst +++ b/docs/source/howworks.rst @@ -116,7 +116,7 @@ this was written quickly and has not been executed): This procedure splits off tokens from the start and end of the string, at each -point checking whether the remaining string is in our special-cases table. If +point checking whether the remaining string is in our special-cases table. If it is, we stop splitting, and return the tokenization at that point. The advantage of this design is that the prefixes, suffixes and special-cases @@ -206,8 +206,8 @@ loop: class_, score = max(enumerate(scores), key=lambda item: item[1]) transition(state, class_) -The parser makes 2N transitions for a sentence of length N. In order to select -the transition, it extracts a vector of K features from the state. Each feature +The parser makes 2N transitions for a sentence of length N. In order to select +the transition, it extracts a vector of K features from the state. Each feature is used as a key into a hash table managed by the model. The features map to a vector of weights, of length C. We then dot product the feature weights to the scores vector we are building for that instance. diff --git a/docs/source/index.rst b/docs/source/index.rst index 75892b975..60a66b2ae 100644 --- a/docs/source/index.rst +++ b/docs/source/index.rst @@ -10,7 +10,7 @@ spaCy: Industrial-strength NLP .. _Issue Tracker: https://github.com/honnibal/spaCy/issues -**13/04**: *Version 0.80 released. Includes named entity recognition, better sentence +**13/04**: *Version 0.80 released. Includes named entity recognition, better sentence boundary detection, and many bug fixes.* `spaCy`_ is a new library for text processing in Python and Cython. @@ -28,7 +28,7 @@ If they don't want to stay in academia, they join Google, IBM, etc. The net result is that outside of the tech giants, commercial NLP has changed little in the last ten years. In academia, it's changed entirely. Amazing -improvements in quality. Orders of magnitude faster. But the +improvements in quality. Orders of magnitude faster. But the academic code is always GPL, undocumented, unuseable, or all three. You could implement the ideas yourself, but the papers are hard to read, and training data is exorbitantly expensive. So what are you left with? A common answer is @@ -58,7 +58,7 @@ to embedded word representations, and a range of useful features are pre-calcula and cached. If none of that made any sense to you, here's the gist of it. Computers don't -understand text. This is unfortunate, because that's what the web almost entirely +understand text. This is unfortunate, because that's what the web almost entirely consists of. We want to recommend people text based on other text they liked. We want to shorten text to display it on a mobile screen. We want to aggregate it, link it, filter it, categorise it, generate it and correct it. @@ -242,7 +242,7 @@ I report mean times per document, in milliseconds. **Hardware**: Intel i7-3770 (2012) -.. table:: Efficiency comparison. Lower is better. +.. table:: Efficiency comparison. Lower is better. +--------------+---------------------------+--------------------------------+ | | Absolute (ms per doc) | Relative (to spaCy) | @@ -278,7 +278,7 @@ representations. publish or perform any benchmark or performance tests or analysis relating to the Service or the use thereof without express authorization from AlchemyAPI; -.. Did you get that? You're not allowed to evaluate how well their system works, +.. Did you get that? You're not allowed to evaluate how well their system works, unless you're granted a special exception. Their system must be pretty terrible to motivate such an embarrassing restriction. They must know this makes them look bad, but they apparently believe allowing diff --git a/docs/source/license.rst b/docs/source/license.rst index 7dc889586..833b1aae7 100644 --- a/docs/source/license.rst +++ b/docs/source/license.rst @@ -92,7 +92,7 @@ developing. They own the copyright to any modifications they make to spaCy, but not to the original spaCy code. No additional fees will be due when they hire new developers, run spaCy on -additional internal servers, etc. If their company is acquired, the license will +additional internal servers, etc. If their company is acquired, the license will be transferred to the company acquiring them. However, to use spaCy in another product, they will have to buy a second license. @@ -115,9 +115,9 @@ In order to do this, they must sign a contributor agreement, ceding their copyright. When commercial licenses to spaCy are sold, Alex and Sasha will not be able to claim any royalties from their contributions. -Later, Alex and Sasha implement new features into spaCy, for another paper. The +Later, Alex and Sasha implement new features into spaCy, for another paper. The code was quite rushed, and they don't want to take the time to put together a -proper pull request. They must release their modifications under the AGPL, but +proper pull request. They must release their modifications under the AGPL, but they are not obliged to contribute it to the spaCy repository, or concede their copyright. @@ -126,8 +126,8 @@ Phuong and Jessie: Open Source developers ######################################### Phuong and Jessie use the open-source software Calibre to manage their e-book -libraries. They have an idea for a search feature, and they want to use spaCy -to implement it. Calibre is released under the GPLv3. The AGPL has additional +libraries. They have an idea for a search feature, and they want to use spaCy +to implement it. Calibre is released under the GPLv3. The AGPL has additional restrictions for projects used as a network resource, but they don't apply to this project, so Phuong and Jessie can use spaCy to improve Calibre. They'll have to release their code, but that was always their intention anyway. diff --git a/docs/source/quickstart.rst b/docs/source/quickstart.rst index ec9f612ff..0226d5c88 100644 --- a/docs/source/quickstart.rst +++ b/docs/source/quickstart.rst @@ -23,7 +23,7 @@ parser model and word vectors, which it installs within the spacy.en package dir If you're stuck using a server with an old version of Python, and you don't have root access, I've prepared a bootstrap script to help you compile a local -Python install. Run: +Python install. Run: .. code:: bash @@ -47,7 +47,7 @@ this is how I build the project. $ py.test tests/ Python packaging is awkward at the best of times, and it's particularly tricky -with C extensions, built via Cython, requiring large data files. So, please +with C extensions, built via Cython, requiring large data files. So, please report issues as you encounter them, and bear with me :) Usage diff --git a/docs/source/updates.rst b/docs/source/updates.rst index 0b443266a..a526ee757 100644 --- a/docs/source/updates.rst +++ b/docs/source/updates.rst @@ -32,7 +32,7 @@ Bug Fixes sometimes inconsistent. I've addressed the most immediate problems, but this part of the design is - still a work in progress. It's a difficult problem. The parse is a tree, + still a work in progress. It's a difficult problem. The parse is a tree, and we want to freely navigate up and down it without creating reference cycles that inhibit garbage collection, and without doing a lot of copying, creating and deleting. @@ -53,7 +53,7 @@ pinning down or reproducing. Please send details of your system to the Enhancements: Train and evaluate on whole paragraphs ---------------------------------------------------- -.. note:: tl;dr: I shipped the wrong parsing model with 0.3. That model expected input to be segmented into sentences. 0.4 ships the correct model, which uses some algorithmic tricks to minimize the impact of tokenization and sentence segmentation errors on the parser. +.. note:: tl;dr: I shipped the wrong parsing model with 0.3. That model expected input to be segmented into sentences. 0.4 ships the correct model, which uses some algorithmic tricks to minimize the impact of tokenization and sentence segmentation errors on the parser. Most English parsing research is performed on text with perfect pre-processing: @@ -77,7 +77,7 @@ made a big difference: | Corrected | 89.9 | 88.8 | +-------------+-------+----------+ -.. note:: spaCy is evaluated on unlabelled dependencies, where the above accuracy figures refer to phrase-structure trees. Accuracies are non-comparable. +.. note:: spaCy is evaluated on unlabelled dependencies, where the above accuracy figures refer to phrase-structure trees. Accuracies are non-comparable.