mirror of
https://github.com/explosion/spaCy.git
synced 2025-01-12 10:16:27 +03:00
Use consistent sentence spacing within files
This commit is contained in:
parent
3a8d9b37a6
commit
7bddd15e27
|
@ -1,7 +1,7 @@
|
|||
Signing the Contributors License Agreement
|
||||
==========================================
|
||||
|
||||
SpaCy is a commercial open-source project, owned by Syllogism Co. We require that contributors to SpaCy sign our Contributors License Agreement, which is based on the Oracle Contributor Agreement.
|
||||
SpaCy is a commercial open-source project, owned by Syllogism Co. We require that contributors to SpaCy sign our Contributors License Agreement, which is based on the Oracle Contributor Agreement.
|
||||
|
||||
The CLA must be signed on your first pull request. To do this, simply fill in the file cla_template.md, and include the filed in form in your first pull request.
|
||||
|
||||
|
|
|
@ -2,7 +2,7 @@ Syllogism Contributor Agreement
|
|||
===============================
|
||||
|
||||
This Syllogism Contributor Agreement (“SCA”) is based on the Oracle Contributor
|
||||
Agreement. The SCA applies to any contribution that you make to any product or
|
||||
Agreement. The SCA applies to any contribution that you make to any product or
|
||||
project managed by us (the “project”), and sets out the intellectual property
|
||||
rights you grant to us in the contributed materials. The term “us” shall mean
|
||||
Syllogism Co. The term "you" shall mean the person or entity identified below.
|
||||
|
|
|
@ -107,7 +107,7 @@ API
|
|||
*derivational* suffixes are not stripped, e.g. the lemma of "instutitions"
|
||||
is "institution", not "institute". Lemmatization is performed using the
|
||||
WordNet data, but extended to also cover closed-class words such as
|
||||
pronouns. By default, the WN lemmatizer returns "hi" as the lemma of "his".
|
||||
pronouns. By default, the WN lemmatizer returns "hi" as the lemma of "his".
|
||||
We assign pronouns the lemma -PRON-.
|
||||
|
||||
lower
|
||||
|
@ -121,7 +121,7 @@ API
|
|||
A transform of the word's string, to show orthographic features. The
|
||||
characters a-z are mapped to x, A-Z is mapped to X, 0-9 is mapped to d.
|
||||
After these mappings, sequences of 4 or more of the same character are
|
||||
truncated to length 4. Examples: C3Po --> XdXx, favorite --> xxxx,
|
||||
truncated to length 4. Examples: C3Po --> XdXx, favorite --> xxxx,
|
||||
:) --> :)
|
||||
|
||||
prefix
|
||||
|
|
|
@ -66,7 +66,7 @@ Boolean features
|
|||
+-------------+--------------------------------------------------------------+
|
||||
| IS_UPPER | The result of sic.isupper() |
|
||||
+-------------+--------------------------------------------------------------+
|
||||
| LIKE_URL | Check whether the string looks like it could be a URL. Aims |
|
||||
| LIKE_URL | Check whether the string looks like it could be a URL. Aims |
|
||||
| | for low false negative rate. |
|
||||
+-------------+--------------------------------------------------------------+
|
||||
| LIKE_NUMBER | Check whether the string looks like it could be a numeric |
|
||||
|
|
|
@ -6,7 +6,7 @@ What and Why
|
|||
|
||||
spaCy is a lightning-fast, full-cream NLP tokenizer and lexicon.
|
||||
|
||||
Most tokenizers give you a sequence of strings. That's barbaric.
|
||||
Most tokenizers give you a sequence of strings. That's barbaric.
|
||||
Giving you strings invites you to compute on every *token*, when what
|
||||
you should be doing is computing on every *type*. Remember
|
||||
`Zipf's law <http://en.wikipedia.org/wiki/Zipf's_law>`_: you'll
|
||||
|
|
|
@ -116,7 +116,7 @@ this was written quickly and has not been executed):
|
|||
|
||||
|
||||
This procedure splits off tokens from the start and end of the string, at each
|
||||
point checking whether the remaining string is in our special-cases table. If
|
||||
point checking whether the remaining string is in our special-cases table. If
|
||||
it is, we stop splitting, and return the tokenization at that point.
|
||||
|
||||
The advantage of this design is that the prefixes, suffixes and special-cases
|
||||
|
@ -206,8 +206,8 @@ loop:
|
|||
class_, score = max(enumerate(scores), key=lambda item: item[1])
|
||||
transition(state, class_)
|
||||
|
||||
The parser makes 2N transitions for a sentence of length N. In order to select
|
||||
the transition, it extracts a vector of K features from the state. Each feature
|
||||
The parser makes 2N transitions for a sentence of length N. In order to select
|
||||
the transition, it extracts a vector of K features from the state. Each feature
|
||||
is used as a key into a hash table managed by the model. The features map to
|
||||
a vector of weights, of length C. We then dot product the feature weights to the
|
||||
scores vector we are building for that instance.
|
||||
|
|
|
@ -10,7 +10,7 @@ spaCy: Industrial-strength NLP
|
|||
|
||||
.. _Issue Tracker: https://github.com/honnibal/spaCy/issues
|
||||
|
||||
**13/04**: *Version 0.80 released. Includes named entity recognition, better sentence
|
||||
**13/04**: *Version 0.80 released. Includes named entity recognition, better sentence
|
||||
boundary detection, and many bug fixes.*
|
||||
|
||||
`spaCy`_ is a new library for text processing in Python and Cython.
|
||||
|
@ -28,7 +28,7 @@ If they don't want to stay in academia, they join Google, IBM, etc.
|
|||
|
||||
The net result is that outside of the tech giants, commercial NLP has changed
|
||||
little in the last ten years. In academia, it's changed entirely. Amazing
|
||||
improvements in quality. Orders of magnitude faster. But the
|
||||
improvements in quality. Orders of magnitude faster. But the
|
||||
academic code is always GPL, undocumented, unuseable, or all three. You could
|
||||
implement the ideas yourself, but the papers are hard to read, and training
|
||||
data is exorbitantly expensive. So what are you left with? A common answer is
|
||||
|
@ -58,7 +58,7 @@ to embedded word representations, and a range of useful features are pre-calcula
|
|||
and cached.
|
||||
|
||||
If none of that made any sense to you, here's the gist of it. Computers don't
|
||||
understand text. This is unfortunate, because that's what the web almost entirely
|
||||
understand text. This is unfortunate, because that's what the web almost entirely
|
||||
consists of. We want to recommend people text based on other text they liked.
|
||||
We want to shorten text to display it on a mobile screen. We want to aggregate
|
||||
it, link it, filter it, categorise it, generate it and correct it.
|
||||
|
@ -242,7 +242,7 @@ I report mean times per document, in milliseconds.
|
|||
|
||||
**Hardware**: Intel i7-3770 (2012)
|
||||
|
||||
.. table:: Efficiency comparison. Lower is better.
|
||||
.. table:: Efficiency comparison. Lower is better.
|
||||
|
||||
+--------------+---------------------------+--------------------------------+
|
||||
| | Absolute (ms per doc) | Relative (to spaCy) |
|
||||
|
@ -278,7 +278,7 @@ representations.
|
|||
publish or perform any benchmark or performance tests or analysis relating to
|
||||
the Service or the use thereof without express authorization from AlchemyAPI;
|
||||
|
||||
.. Did you get that? You're not allowed to evaluate how well their system works,
|
||||
.. Did you get that? You're not allowed to evaluate how well their system works,
|
||||
unless you're granted a special exception. Their system must be pretty
|
||||
terrible to motivate such an embarrassing restriction.
|
||||
They must know this makes them look bad, but they apparently believe allowing
|
||||
|
|
|
@ -92,7 +92,7 @@ developing. They own the copyright to any modifications they make to spaCy,
|
|||
but not to the original spaCy code.
|
||||
|
||||
No additional fees will be due when they hire new developers, run spaCy on
|
||||
additional internal servers, etc. If their company is acquired, the license will
|
||||
additional internal servers, etc. If their company is acquired, the license will
|
||||
be transferred to the company acquiring them. However, to use spaCy in another
|
||||
product, they will have to buy a second license.
|
||||
|
||||
|
@ -115,9 +115,9 @@ In order to do this, they must sign a contributor agreement, ceding their
|
|||
copyright. When commercial licenses to spaCy are sold, Alex and Sasha will
|
||||
not be able to claim any royalties from their contributions.
|
||||
|
||||
Later, Alex and Sasha implement new features into spaCy, for another paper. The
|
||||
Later, Alex and Sasha implement new features into spaCy, for another paper. The
|
||||
code was quite rushed, and they don't want to take the time to put together a
|
||||
proper pull request. They must release their modifications under the AGPL, but
|
||||
proper pull request. They must release their modifications under the AGPL, but
|
||||
they are not obliged to contribute it to the spaCy repository, or concede their
|
||||
copyright.
|
||||
|
||||
|
@ -126,8 +126,8 @@ Phuong and Jessie: Open Source developers
|
|||
#########################################
|
||||
|
||||
Phuong and Jessie use the open-source software Calibre to manage their e-book
|
||||
libraries. They have an idea for a search feature, and they want to use spaCy
|
||||
to implement it. Calibre is released under the GPLv3. The AGPL has additional
|
||||
libraries. They have an idea for a search feature, and they want to use spaCy
|
||||
to implement it. Calibre is released under the GPLv3. The AGPL has additional
|
||||
restrictions for projects used as a network resource, but they don't apply to
|
||||
this project, so Phuong and Jessie can use spaCy to improve Calibre. They'll
|
||||
have to release their code, but that was always their intention anyway.
|
||||
|
|
|
@ -23,7 +23,7 @@ parser model and word vectors, which it installs within the spacy.en package dir
|
|||
|
||||
If you're stuck using a server with an old version of Python, and you don't
|
||||
have root access, I've prepared a bootstrap script to help you compile a local
|
||||
Python install. Run:
|
||||
Python install. Run:
|
||||
|
||||
.. code:: bash
|
||||
|
||||
|
@ -47,7 +47,7 @@ this is how I build the project.
|
|||
$ py.test tests/
|
||||
|
||||
Python packaging is awkward at the best of times, and it's particularly tricky
|
||||
with C extensions, built via Cython, requiring large data files. So, please
|
||||
with C extensions, built via Cython, requiring large data files. So, please
|
||||
report issues as you encounter them, and bear with me :)
|
||||
|
||||
Usage
|
||||
|
|
|
@ -32,7 +32,7 @@ Bug Fixes
|
|||
sometimes inconsistent.
|
||||
|
||||
I've addressed the most immediate problems, but this part of the design is
|
||||
still a work in progress. It's a difficult problem. The parse is a tree,
|
||||
still a work in progress. It's a difficult problem. The parse is a tree,
|
||||
and we want to freely navigate up and down it without creating reference
|
||||
cycles that inhibit garbage collection, and without doing a lot of copying,
|
||||
creating and deleting.
|
||||
|
@ -53,7 +53,7 @@ pinning down or reproducing. Please send details of your system to the
|
|||
Enhancements: Train and evaluate on whole paragraphs
|
||||
----------------------------------------------------
|
||||
|
||||
.. note:: tl;dr: I shipped the wrong parsing model with 0.3. That model expected input to be segmented into sentences. 0.4 ships the correct model, which uses some algorithmic tricks to minimize the impact of tokenization and sentence segmentation errors on the parser.
|
||||
.. note:: tl;dr: I shipped the wrong parsing model with 0.3. That model expected input to be segmented into sentences. 0.4 ships the correct model, which uses some algorithmic tricks to minimize the impact of tokenization and sentence segmentation errors on the parser.
|
||||
|
||||
|
||||
Most English parsing research is performed on text with perfect pre-processing:
|
||||
|
@ -77,7 +77,7 @@ made a big difference:
|
|||
| Corrected | 89.9 | 88.8 |
|
||||
+-------------+-------+----------+
|
||||
|
||||
.. note:: spaCy is evaluated on unlabelled dependencies, where the above accuracy figures refer to phrase-structure trees. Accuracies are non-comparable.
|
||||
.. note:: spaCy is evaluated on unlabelled dependencies, where the above accuracy figures refer to phrase-structure trees. Accuracies are non-comparable.
|
||||
|
||||
|
||||
|
||||
|
|
Loading…
Reference in New Issue
Block a user