Use consistent sentence spacing within files

This commit is contained in:
Jordan Suchow 2015-04-19 01:43:46 -07:00
parent 3a8d9b37a6
commit 7bddd15e27
10 changed files with 24 additions and 24 deletions

View File

@ -1,7 +1,7 @@
Signing the Contributors License Agreement Signing the Contributors License Agreement
========================================== ==========================================
SpaCy is a commercial open-source project, owned by Syllogism Co. We require that contributors to SpaCy sign our Contributors License Agreement, which is based on the Oracle Contributor Agreement. SpaCy is a commercial open-source project, owned by Syllogism Co. We require that contributors to SpaCy sign our Contributors License Agreement, which is based on the Oracle Contributor Agreement.
The CLA must be signed on your first pull request. To do this, simply fill in the file cla_template.md, and include the filed in form in your first pull request. The CLA must be signed on your first pull request. To do this, simply fill in the file cla_template.md, and include the filed in form in your first pull request.

View File

@ -2,7 +2,7 @@ Syllogism Contributor Agreement
=============================== ===============================
This Syllogism Contributor Agreement (“SCA”) is based on the Oracle Contributor This Syllogism Contributor Agreement (“SCA”) is based on the Oracle Contributor
Agreement. The SCA applies to any contribution that you make to any product or Agreement. The SCA applies to any contribution that you make to any product or
project managed by us (the “project”), and sets out the intellectual property project managed by us (the “project”), and sets out the intellectual property
rights you grant to us in the contributed materials. The term “us” shall mean rights you grant to us in the contributed materials. The term “us” shall mean
Syllogism Co. The term "you" shall mean the person or entity identified below. Syllogism Co. The term "you" shall mean the person or entity identified below.

View File

@ -107,7 +107,7 @@ API
*derivational* suffixes are not stripped, e.g. the lemma of "instutitions" *derivational* suffixes are not stripped, e.g. the lemma of "instutitions"
is "institution", not "institute". Lemmatization is performed using the is "institution", not "institute". Lemmatization is performed using the
WordNet data, but extended to also cover closed-class words such as WordNet data, but extended to also cover closed-class words such as
pronouns. By default, the WN lemmatizer returns "hi" as the lemma of "his". pronouns. By default, the WN lemmatizer returns "hi" as the lemma of "his".
We assign pronouns the lemma -PRON-. We assign pronouns the lemma -PRON-.
lower lower
@ -121,7 +121,7 @@ API
A transform of the word's string, to show orthographic features. The A transform of the word's string, to show orthographic features. The
characters a-z are mapped to x, A-Z is mapped to X, 0-9 is mapped to d. characters a-z are mapped to x, A-Z is mapped to X, 0-9 is mapped to d.
After these mappings, sequences of 4 or more of the same character are After these mappings, sequences of 4 or more of the same character are
truncated to length 4. Examples: C3Po --> XdXx, favorite --> xxxx, truncated to length 4. Examples: C3Po --> XdXx, favorite --> xxxx,
:) --> :) :) --> :)
prefix prefix

View File

@ -66,7 +66,7 @@ Boolean features
+-------------+--------------------------------------------------------------+ +-------------+--------------------------------------------------------------+
| IS_UPPER | The result of sic.isupper() | | IS_UPPER | The result of sic.isupper() |
+-------------+--------------------------------------------------------------+ +-------------+--------------------------------------------------------------+
| LIKE_URL | Check whether the string looks like it could be a URL. Aims | | LIKE_URL | Check whether the string looks like it could be a URL. Aims |
| | for low false negative rate. | | | for low false negative rate. |
+-------------+--------------------------------------------------------------+ +-------------+--------------------------------------------------------------+
| LIKE_NUMBER | Check whether the string looks like it could be a numeric | | LIKE_NUMBER | Check whether the string looks like it could be a numeric |

View File

@ -6,7 +6,7 @@ What and Why
spaCy is a lightning-fast, full-cream NLP tokenizer and lexicon. spaCy is a lightning-fast, full-cream NLP tokenizer and lexicon.
Most tokenizers give you a sequence of strings. That's barbaric. Most tokenizers give you a sequence of strings. That's barbaric.
Giving you strings invites you to compute on every *token*, when what Giving you strings invites you to compute on every *token*, when what
you should be doing is computing on every *type*. Remember you should be doing is computing on every *type*. Remember
`Zipf's law <http://en.wikipedia.org/wiki/Zipf's_law>`_: you'll `Zipf's law <http://en.wikipedia.org/wiki/Zipf's_law>`_: you'll

View File

@ -116,7 +116,7 @@ this was written quickly and has not been executed):
This procedure splits off tokens from the start and end of the string, at each This procedure splits off tokens from the start and end of the string, at each
point checking whether the remaining string is in our special-cases table. If point checking whether the remaining string is in our special-cases table. If
it is, we stop splitting, and return the tokenization at that point. it is, we stop splitting, and return the tokenization at that point.
The advantage of this design is that the prefixes, suffixes and special-cases The advantage of this design is that the prefixes, suffixes and special-cases
@ -206,8 +206,8 @@ loop:
class_, score = max(enumerate(scores), key=lambda item: item[1]) class_, score = max(enumerate(scores), key=lambda item: item[1])
transition(state, class_) transition(state, class_)
The parser makes 2N transitions for a sentence of length N. In order to select The parser makes 2N transitions for a sentence of length N. In order to select
the transition, it extracts a vector of K features from the state. Each feature the transition, it extracts a vector of K features from the state. Each feature
is used as a key into a hash table managed by the model. The features map to is used as a key into a hash table managed by the model. The features map to
a vector of weights, of length C. We then dot product the feature weights to the a vector of weights, of length C. We then dot product the feature weights to the
scores vector we are building for that instance. scores vector we are building for that instance.

View File

@ -10,7 +10,7 @@ spaCy: Industrial-strength NLP
.. _Issue Tracker: https://github.com/honnibal/spaCy/issues .. _Issue Tracker: https://github.com/honnibal/spaCy/issues
**13/04**: *Version 0.80 released. Includes named entity recognition, better sentence **13/04**: *Version 0.80 released. Includes named entity recognition, better sentence
boundary detection, and many bug fixes.* boundary detection, and many bug fixes.*
`spaCy`_ is a new library for text processing in Python and Cython. `spaCy`_ is a new library for text processing in Python and Cython.
@ -28,7 +28,7 @@ If they don't want to stay in academia, they join Google, IBM, etc.
The net result is that outside of the tech giants, commercial NLP has changed The net result is that outside of the tech giants, commercial NLP has changed
little in the last ten years. In academia, it's changed entirely. Amazing little in the last ten years. In academia, it's changed entirely. Amazing
improvements in quality. Orders of magnitude faster. But the improvements in quality. Orders of magnitude faster. But the
academic code is always GPL, undocumented, unuseable, or all three. You could academic code is always GPL, undocumented, unuseable, or all three. You could
implement the ideas yourself, but the papers are hard to read, and training implement the ideas yourself, but the papers are hard to read, and training
data is exorbitantly expensive. So what are you left with? A common answer is data is exorbitantly expensive. So what are you left with? A common answer is
@ -58,7 +58,7 @@ to embedded word representations, and a range of useful features are pre-calcula
and cached. and cached.
If none of that made any sense to you, here's the gist of it. Computers don't If none of that made any sense to you, here's the gist of it. Computers don't
understand text. This is unfortunate, because that's what the web almost entirely understand text. This is unfortunate, because that's what the web almost entirely
consists of. We want to recommend people text based on other text they liked. consists of. We want to recommend people text based on other text they liked.
We want to shorten text to display it on a mobile screen. We want to aggregate We want to shorten text to display it on a mobile screen. We want to aggregate
it, link it, filter it, categorise it, generate it and correct it. it, link it, filter it, categorise it, generate it and correct it.
@ -242,7 +242,7 @@ I report mean times per document, in milliseconds.
**Hardware**: Intel i7-3770 (2012) **Hardware**: Intel i7-3770 (2012)
.. table:: Efficiency comparison. Lower is better. .. table:: Efficiency comparison. Lower is better.
+--------------+---------------------------+--------------------------------+ +--------------+---------------------------+--------------------------------+
| | Absolute (ms per doc) | Relative (to spaCy) | | | Absolute (ms per doc) | Relative (to spaCy) |
@ -278,7 +278,7 @@ representations.
publish or perform any benchmark or performance tests or analysis relating to publish or perform any benchmark or performance tests or analysis relating to
the Service or the use thereof without express authorization from AlchemyAPI; the Service or the use thereof without express authorization from AlchemyAPI;
.. Did you get that? You're not allowed to evaluate how well their system works, .. Did you get that? You're not allowed to evaluate how well their system works,
unless you're granted a special exception. Their system must be pretty unless you're granted a special exception. Their system must be pretty
terrible to motivate such an embarrassing restriction. terrible to motivate such an embarrassing restriction.
They must know this makes them look bad, but they apparently believe allowing They must know this makes them look bad, but they apparently believe allowing

View File

@ -92,7 +92,7 @@ developing. They own the copyright to any modifications they make to spaCy,
but not to the original spaCy code. but not to the original spaCy code.
No additional fees will be due when they hire new developers, run spaCy on No additional fees will be due when they hire new developers, run spaCy on
additional internal servers, etc. If their company is acquired, the license will additional internal servers, etc. If their company is acquired, the license will
be transferred to the company acquiring them. However, to use spaCy in another be transferred to the company acquiring them. However, to use spaCy in another
product, they will have to buy a second license. product, they will have to buy a second license.
@ -115,9 +115,9 @@ In order to do this, they must sign a contributor agreement, ceding their
copyright. When commercial licenses to spaCy are sold, Alex and Sasha will copyright. When commercial licenses to spaCy are sold, Alex and Sasha will
not be able to claim any royalties from their contributions. not be able to claim any royalties from their contributions.
Later, Alex and Sasha implement new features into spaCy, for another paper. The Later, Alex and Sasha implement new features into spaCy, for another paper. The
code was quite rushed, and they don't want to take the time to put together a code was quite rushed, and they don't want to take the time to put together a
proper pull request. They must release their modifications under the AGPL, but proper pull request. They must release their modifications under the AGPL, but
they are not obliged to contribute it to the spaCy repository, or concede their they are not obliged to contribute it to the spaCy repository, or concede their
copyright. copyright.
@ -126,8 +126,8 @@ Phuong and Jessie: Open Source developers
######################################### #########################################
Phuong and Jessie use the open-source software Calibre to manage their e-book Phuong and Jessie use the open-source software Calibre to manage their e-book
libraries. They have an idea for a search feature, and they want to use spaCy libraries. They have an idea for a search feature, and they want to use spaCy
to implement it. Calibre is released under the GPLv3. The AGPL has additional to implement it. Calibre is released under the GPLv3. The AGPL has additional
restrictions for projects used as a network resource, but they don't apply to restrictions for projects used as a network resource, but they don't apply to
this project, so Phuong and Jessie can use spaCy to improve Calibre. They'll this project, so Phuong and Jessie can use spaCy to improve Calibre. They'll
have to release their code, but that was always their intention anyway. have to release their code, but that was always their intention anyway.

View File

@ -23,7 +23,7 @@ parser model and word vectors, which it installs within the spacy.en package dir
If you're stuck using a server with an old version of Python, and you don't If you're stuck using a server with an old version of Python, and you don't
have root access, I've prepared a bootstrap script to help you compile a local have root access, I've prepared a bootstrap script to help you compile a local
Python install. Run: Python install. Run:
.. code:: bash .. code:: bash
@ -47,7 +47,7 @@ this is how I build the project.
$ py.test tests/ $ py.test tests/
Python packaging is awkward at the best of times, and it's particularly tricky Python packaging is awkward at the best of times, and it's particularly tricky
with C extensions, built via Cython, requiring large data files. So, please with C extensions, built via Cython, requiring large data files. So, please
report issues as you encounter them, and bear with me :) report issues as you encounter them, and bear with me :)
Usage Usage

View File

@ -32,7 +32,7 @@ Bug Fixes
sometimes inconsistent. sometimes inconsistent.
I've addressed the most immediate problems, but this part of the design is I've addressed the most immediate problems, but this part of the design is
still a work in progress. It's a difficult problem. The parse is a tree, still a work in progress. It's a difficult problem. The parse is a tree,
and we want to freely navigate up and down it without creating reference and we want to freely navigate up and down it without creating reference
cycles that inhibit garbage collection, and without doing a lot of copying, cycles that inhibit garbage collection, and without doing a lot of copying,
creating and deleting. creating and deleting.
@ -53,7 +53,7 @@ pinning down or reproducing. Please send details of your system to the
Enhancements: Train and evaluate on whole paragraphs Enhancements: Train and evaluate on whole paragraphs
---------------------------------------------------- ----------------------------------------------------
.. note:: tl;dr: I shipped the wrong parsing model with 0.3. That model expected input to be segmented into sentences. 0.4 ships the correct model, which uses some algorithmic tricks to minimize the impact of tokenization and sentence segmentation errors on the parser. .. note:: tl;dr: I shipped the wrong parsing model with 0.3. That model expected input to be segmented into sentences. 0.4 ships the correct model, which uses some algorithmic tricks to minimize the impact of tokenization and sentence segmentation errors on the parser.
Most English parsing research is performed on text with perfect pre-processing: Most English parsing research is performed on text with perfect pre-processing:
@ -77,7 +77,7 @@ made a big difference:
| Corrected | 89.9 | 88.8 | | Corrected | 89.9 | 88.8 |
+-------------+-------+----------+ +-------------+-------+----------+
.. note:: spaCy is evaluated on unlabelled dependencies, where the above accuracy figures refer to phrase-structure trees. Accuracies are non-comparable. .. note:: spaCy is evaluated on unlabelled dependencies, where the above accuracy figures refer to phrase-structure trees. Accuracies are non-comparable.