Use consistent sentence spacing within files

This commit is contained in:
Jordan Suchow 2015-04-19 01:43:46 -07:00
parent 3a8d9b37a6
commit 7bddd15e27
10 changed files with 24 additions and 24 deletions

View File

@ -1,7 +1,7 @@
Signing the Contributors License Agreement
==========================================
SpaCy is a commercial open-source project, owned by Syllogism Co. We require that contributors to SpaCy sign our Contributors License Agreement, which is based on the Oracle Contributor Agreement.
SpaCy is a commercial open-source project, owned by Syllogism Co. We require that contributors to SpaCy sign our Contributors License Agreement, which is based on the Oracle Contributor Agreement.
The CLA must be signed on your first pull request. To do this, simply fill in the file cla_template.md, and include the filed in form in your first pull request.

View File

@ -2,7 +2,7 @@ Syllogism Contributor Agreement
===============================
This Syllogism Contributor Agreement (“SCA”) is based on the Oracle Contributor
Agreement. The SCA applies to any contribution that you make to any product or
Agreement. The SCA applies to any contribution that you make to any product or
project managed by us (the “project”), and sets out the intellectual property
rights you grant to us in the contributed materials. The term “us” shall mean
Syllogism Co. The term "you" shall mean the person or entity identified below.

View File

@ -107,7 +107,7 @@ API
*derivational* suffixes are not stripped, e.g. the lemma of "instutitions"
is "institution", not "institute". Lemmatization is performed using the
WordNet data, but extended to also cover closed-class words such as
pronouns. By default, the WN lemmatizer returns "hi" as the lemma of "his".
pronouns. By default, the WN lemmatizer returns "hi" as the lemma of "his".
We assign pronouns the lemma -PRON-.
lower
@ -121,7 +121,7 @@ API
A transform of the word's string, to show orthographic features. The
characters a-z are mapped to x, A-Z is mapped to X, 0-9 is mapped to d.
After these mappings, sequences of 4 or more of the same character are
truncated to length 4. Examples: C3Po --> XdXx, favorite --> xxxx,
truncated to length 4. Examples: C3Po --> XdXx, favorite --> xxxx,
:) --> :)
prefix

View File

@ -66,7 +66,7 @@ Boolean features
+-------------+--------------------------------------------------------------+
| IS_UPPER | The result of sic.isupper() |
+-------------+--------------------------------------------------------------+
| LIKE_URL | Check whether the string looks like it could be a URL. Aims |
| LIKE_URL | Check whether the string looks like it could be a URL. Aims |
| | for low false negative rate. |
+-------------+--------------------------------------------------------------+
| LIKE_NUMBER | Check whether the string looks like it could be a numeric |

View File

@ -6,7 +6,7 @@ What and Why
spaCy is a lightning-fast, full-cream NLP tokenizer and lexicon.
Most tokenizers give you a sequence of strings. That's barbaric.
Most tokenizers give you a sequence of strings. That's barbaric.
Giving you strings invites you to compute on every *token*, when what
you should be doing is computing on every *type*. Remember
`Zipf's law <http://en.wikipedia.org/wiki/Zipf's_law>`_: you'll

View File

@ -116,7 +116,7 @@ this was written quickly and has not been executed):
This procedure splits off tokens from the start and end of the string, at each
point checking whether the remaining string is in our special-cases table. If
point checking whether the remaining string is in our special-cases table. If
it is, we stop splitting, and return the tokenization at that point.
The advantage of this design is that the prefixes, suffixes and special-cases
@ -206,8 +206,8 @@ loop:
class_, score = max(enumerate(scores), key=lambda item: item[1])
transition(state, class_)
The parser makes 2N transitions for a sentence of length N. In order to select
the transition, it extracts a vector of K features from the state. Each feature
The parser makes 2N transitions for a sentence of length N. In order to select
the transition, it extracts a vector of K features from the state. Each feature
is used as a key into a hash table managed by the model. The features map to
a vector of weights, of length C. We then dot product the feature weights to the
scores vector we are building for that instance.

View File

@ -10,7 +10,7 @@ spaCy: Industrial-strength NLP
.. _Issue Tracker: https://github.com/honnibal/spaCy/issues
**13/04**: *Version 0.80 released. Includes named entity recognition, better sentence
**13/04**: *Version 0.80 released. Includes named entity recognition, better sentence
boundary detection, and many bug fixes.*
`spaCy`_ is a new library for text processing in Python and Cython.
@ -28,7 +28,7 @@ If they don't want to stay in academia, they join Google, IBM, etc.
The net result is that outside of the tech giants, commercial NLP has changed
little in the last ten years. In academia, it's changed entirely. Amazing
improvements in quality. Orders of magnitude faster. But the
improvements in quality. Orders of magnitude faster. But the
academic code is always GPL, undocumented, unuseable, or all three. You could
implement the ideas yourself, but the papers are hard to read, and training
data is exorbitantly expensive. So what are you left with? A common answer is
@ -58,7 +58,7 @@ to embedded word representations, and a range of useful features are pre-calcula
and cached.
If none of that made any sense to you, here's the gist of it. Computers don't
understand text. This is unfortunate, because that's what the web almost entirely
understand text. This is unfortunate, because that's what the web almost entirely
consists of. We want to recommend people text based on other text they liked.
We want to shorten text to display it on a mobile screen. We want to aggregate
it, link it, filter it, categorise it, generate it and correct it.
@ -242,7 +242,7 @@ I report mean times per document, in milliseconds.
**Hardware**: Intel i7-3770 (2012)
.. table:: Efficiency comparison. Lower is better.
.. table:: Efficiency comparison. Lower is better.
+--------------+---------------------------+--------------------------------+
| | Absolute (ms per doc) | Relative (to spaCy) |
@ -278,7 +278,7 @@ representations.
publish or perform any benchmark or performance tests or analysis relating to
the Service or the use thereof without express authorization from AlchemyAPI;
.. Did you get that? You're not allowed to evaluate how well their system works,
.. Did you get that? You're not allowed to evaluate how well their system works,
unless you're granted a special exception. Their system must be pretty
terrible to motivate such an embarrassing restriction.
They must know this makes them look bad, but they apparently believe allowing

View File

@ -92,7 +92,7 @@ developing. They own the copyright to any modifications they make to spaCy,
but not to the original spaCy code.
No additional fees will be due when they hire new developers, run spaCy on
additional internal servers, etc. If their company is acquired, the license will
additional internal servers, etc. If their company is acquired, the license will
be transferred to the company acquiring them. However, to use spaCy in another
product, they will have to buy a second license.
@ -115,9 +115,9 @@ In order to do this, they must sign a contributor agreement, ceding their
copyright. When commercial licenses to spaCy are sold, Alex and Sasha will
not be able to claim any royalties from their contributions.
Later, Alex and Sasha implement new features into spaCy, for another paper. The
Later, Alex and Sasha implement new features into spaCy, for another paper. The
code was quite rushed, and they don't want to take the time to put together a
proper pull request. They must release their modifications under the AGPL, but
proper pull request. They must release their modifications under the AGPL, but
they are not obliged to contribute it to the spaCy repository, or concede their
copyright.
@ -126,8 +126,8 @@ Phuong and Jessie: Open Source developers
#########################################
Phuong and Jessie use the open-source software Calibre to manage their e-book
libraries. They have an idea for a search feature, and they want to use spaCy
to implement it. Calibre is released under the GPLv3. The AGPL has additional
libraries. They have an idea for a search feature, and they want to use spaCy
to implement it. Calibre is released under the GPLv3. The AGPL has additional
restrictions for projects used as a network resource, but they don't apply to
this project, so Phuong and Jessie can use spaCy to improve Calibre. They'll
have to release their code, but that was always their intention anyway.

View File

@ -23,7 +23,7 @@ parser model and word vectors, which it installs within the spacy.en package dir
If you're stuck using a server with an old version of Python, and you don't
have root access, I've prepared a bootstrap script to help you compile a local
Python install. Run:
Python install. Run:
.. code:: bash
@ -47,7 +47,7 @@ this is how I build the project.
$ py.test tests/
Python packaging is awkward at the best of times, and it's particularly tricky
with C extensions, built via Cython, requiring large data files. So, please
with C extensions, built via Cython, requiring large data files. So, please
report issues as you encounter them, and bear with me :)
Usage

View File

@ -32,7 +32,7 @@ Bug Fixes
sometimes inconsistent.
I've addressed the most immediate problems, but this part of the design is
still a work in progress. It's a difficult problem. The parse is a tree,
still a work in progress. It's a difficult problem. The parse is a tree,
and we want to freely navigate up and down it without creating reference
cycles that inhibit garbage collection, and without doing a lot of copying,
creating and deleting.
@ -53,7 +53,7 @@ pinning down or reproducing. Please send details of your system to the
Enhancements: Train and evaluate on whole paragraphs
----------------------------------------------------
.. note:: tl;dr: I shipped the wrong parsing model with 0.3. That model expected input to be segmented into sentences. 0.4 ships the correct model, which uses some algorithmic tricks to minimize the impact of tokenization and sentence segmentation errors on the parser.
.. note:: tl;dr: I shipped the wrong parsing model with 0.3. That model expected input to be segmented into sentences. 0.4 ships the correct model, which uses some algorithmic tricks to minimize the impact of tokenization and sentence segmentation errors on the parser.
Most English parsing research is performed on text with perfect pre-processing:
@ -77,7 +77,7 @@ made a big difference:
| Corrected | 89.9 | 88.8 |
+-------------+-------+----------+
.. note:: spaCy is evaluated on unlabelled dependencies, where the above accuracy figures refer to phrase-structure trees. Accuracies are non-comparable.
.. note:: spaCy is evaluated on unlabelled dependencies, where the above accuracy figures refer to phrase-structure trees. Accuracies are non-comparable.