Commit Graph

5950 Commits

Author SHA1 Message Date
Matthew Honnibal
e65b5bb9a0 Fix tokenizer on Python2.7 (#3460)
spaCy v2.1 switched to the built-in re module, where v2.0 had been using
the third-party regex library. When the tokenizer was deserialized on
Python2.7, the `re.compile()` function was called with expressions that
featured escaped unicode codepoints that were not in Python2.7's unicode
database.

Problems occurred when we had a range between two of these unknown
codepoints, like this:

```
    '[\\uAA77-\\uAA79]'
```

On Python2.7, the unknown codepoints are not unescaped correctly,
resulting in arbitrary out-of-range characters being matched by the
expression.

This problem does not occur if we instead have a range between two
unicode literals, rather than the escape sequences. To fix the bug, we
therefore add a new compat function that unescapes unicode sequences
using the `ast.literal_eval()` function. Care is taken to ensure we
do not also escape non-unicode sequences.

Closes #3356.

- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.
2019-03-22 13:42:47 +01:00
Ines Montani
188ccd5750 Fix xfail marker 2019-03-22 12:54:14 +01:00
svlandeg
7cf0bc9a8c delete sandbox folder 2019-03-22 12:25:11 +01:00
svlandeg
5b1cd49222 error msg and unit tests for setting kb_id on span 2019-03-22 12:05:35 +01:00
svlandeg
a48241e9a2 use nlp's vocab for stringstore 2019-03-22 11:36:45 +01:00
svlandeg
1ee0e78fd7 select candidate with highest prior probabiity 2019-03-22 11:36:45 +01:00
svlandeg
7b708ab8a4 name per entity 2019-03-22 11:36:45 +01:00
svlandeg
c593607ce2 minimal EL pipe 2019-03-22 11:36:45 +01:00
svlandeg
c71123dd0c ensure no candidates are returned for unknown aliases 2019-03-22 11:36:45 +01:00
svlandeg
b6c3255a9f Entity class 2019-03-22 11:36:45 +01:00
svlandeg
1289cd6e8f property getters and keep track of KB internally 2019-03-22 11:36:45 +01:00
svlandeg
98ae77a682 unit test on number of candidates generated 2019-03-22 11:36:45 +01:00
svlandeg
9a46c431c3 store entity hash instead of pointer 2019-03-22 11:36:45 +01:00
svlandeg
9819dca80e create candidate object from entry pointer (not fully functional yet) 2019-03-22 11:36:45 +01:00
svlandeg
a9074e0886 check the length of entities and probabilities vector + unit test 2019-03-22 11:36:45 +01:00
svlandeg
d133ffaff9 correct size, not counting dummy elements in the vector 2019-03-22 11:36:45 +01:00
svlandeg
33f8a0fe2e check and unit test in case prior probs exceed 1 2019-03-22 11:36:45 +01:00
svlandeg
b55baaa1dc avoid value 0 in preshmap and helpful user warnings 2019-03-22 11:36:45 +01:00
svlandeg
20a7b7b1c0 raising error when adding alias for unknown entity + unit test 2019-03-22 11:36:45 +01:00
svlandeg
8843f9279c use StringStore 2019-03-22 11:36:45 +01:00
svlandeg
51560bf0ed bugfix adding aliases 2019-03-22 11:36:45 +01:00
svlandeg
c4ba942765 get candidates by alias 2019-03-22 11:36:45 +01:00
svlandeg
151b855cc8 adding and retrieving aliases 2019-03-22 11:36:45 +01:00
svlandeg
cf34113250 very minimal KB functionality working 2019-03-22 11:36:44 +01:00
svlandeg
af281c5466 adding aliases per entity in the KB 2019-03-22 11:36:44 +01:00
svlandeg
f77b99c103 fix compile errors 2019-03-22 11:36:44 +01:00
svlandeg
27483f9080 add pyx and separate method to add aliases 2019-03-22 11:36:44 +01:00
svlandeg
feb71e15fd hash the entity name 2019-03-22 11:36:44 +01:00
svlandeg
839dafa104 documented some comments and todos 2019-03-22 11:36:44 +01:00
svlandeg
7f37737878 kb snippet, draft by Matt (wip) 2019-03-22 11:36:44 +01:00
svlandeg
735fc2a735 annotate kb_id through ents in doc 2019-03-22 11:36:44 +01:00
svlandeg
d849eb2455 adding kb_id as field to token, el as nlp pipeline component 2019-03-22 11:34:46 +01:00
Matthew Honnibal
d811c97da1 Fix test that caused pytest to choke on Python3 2019-03-22 10:28:51 +01:00
Matthew Honnibal
a2ad9832e5 Add failing test for #3356 2019-03-22 02:42:37 +01:00
Matthew Honnibal
c66bd61e88 Fix lemmas 2019-03-21 14:22:12 +01:00
Matthew Honnibal
04395ffa49 Bring English tag_map in line with UD Treebank
I wrote a small script to read the UD English training data and check
that our tag map and morph rules were resulting in the best POS map.
This hadn't been done for some time, and there have been various changes
to the UD schema since it has been done. After these changes we should
see much better agreement between our POS assignments and the UD POS
tags.
2019-03-21 13:53:44 +01:00
Matthew Honnibal
c7f26abe5f
Merge pull request #3434 from Bharat123rox/narrow-unicode
Raise Error for a narrow unicode build of Python
2019-03-20 12:19:52 +01:00
Matthew Honnibal
1c8ff59185
Merge pull request #3441 from explosion/fix/cli-ud-scripts
💫 Move UD scripts to bin
2019-03-20 12:19:15 +01:00
Matthew Honnibal
72889a16d5 Fix similarity calculation if vectors are on GPU (#3440) 2019-03-20 12:09:59 +01:00
Matthew Honnibal
1612990e88 Implement cosine loss for spacy pretrain. Make default 2019-03-20 11:06:58 +00:00
Ines Montani
ae5b4d0e84 Fix formatting (hopefully also restarts build properly) 2019-03-20 09:55:45 +01:00
Ines Montani
6abc1ddb26 Update __main__.py 2019-03-20 09:43:26 +01:00
Bharat123Rox
f2547f02d6 Made changes suggested by @ines 2019-03-20 07:43:19 +05:30
Ines Montani
7400c7f8a7 Move UD scripts to bin 2019-03-20 01:19:34 +01:00
Ines Montani
685fff40cf Revert "Add --always-link flag to cli.download (see #3435)"
This reverts commit 583a566843.
2019-03-20 01:03:40 +01:00
Matthew Honnibal
6cfbb2d34e Merge branch 'master' of https://github.com/explosion/spaCy 2019-03-20 00:59:54 +01:00
Matthew Honnibal
5a53e9358a Set version to 2.1.1 2019-03-20 00:59:45 +01:00
Ines Montani
583a566843 Add --always-link flag to cli.download (see #3435) 2019-03-19 22:03:27 +01:00
Bharat123Rox
6db1ddd9c7 Raise ValueError for narrow unicode build 2019-03-19 23:02:58 +05:30
Mehdi Hamoumi
9211f30ee3 Tiny correction in french lookup dictionary (#3427) 2019-03-19 13:00:19 +01:00
Ines Montani
f0c1efcb00 Set version to 2.1.0 2019-03-17 22:42:58 +01:00
Matthew Honnibal
47e110375d Fix jsonl to json conversion (#3419)
* Fix spacy.gold.docs_to_json function

* Fix jsonl2json converter
2019-03-17 22:12:54 +01:00
Matthew Honnibal
0a4b074184 Improve beam search defaults 2019-03-17 21:47:45 +01:00
Ines Montani
226db621d0 Strip out .dev versions in spacy validate [ci skip] 2019-03-17 12:16:53 +01:00
Matthew Honnibal
c6be9964ec Set version to v2.1.0.dev1 2019-03-16 21:47:41 +01:00
Matthew Honnibal
61617c64d5 Revert changes to optimizer default hyper-params (WIP) (#3415)
While developing v2.1, I ran a bunch of hyper-parameter search
experiments to find settings that performed well for spaCy's NER and
parser. I ended up changing the default Adam settings from beta1=0.9,
beta2=0.999, eps=1e-8 to beta1=0.8, beta2=0.8, eps=1e-5. This was giving
a small improvement in accuracy (like, 0.4%).

Months later, I run the models with Prodigy, which uses beam-search
decoding even when the model has been trained with a greedy objective.
The new models performed terribly...So, wtf? After a couple of days
debugging, I figured out that the new optimizer settings was causing the
model to converge to solutions where the top-scoring class often had
a score of like, -80. The variance on the weights had gone up
enormously. I guess I needed to update the L2 regularisation as well?

Anyway. Let's just revert the change --- if the optimizer is finding
such extreme solutions, that seems bad, and not nearly worth the small
improvement in accuracy.

Currently training a slate of models, to verify the accuracy change is minimal.
Once the training is complete, we can merge this.

<!--- Provide a general summary of your changes in the title. -->

## Description
<!--- Use this section to describe your changes. If your changes required
testing, include information about the testing environment and the tests you
ran. If your test fixes a bug reported in an issue, don't forget to include the
issue number. If your PR is still a work in progress, that's totally fine – just
include a note to let us know. -->

### Types of change
<!-- What type of change does your PR cover? Is it a bug fix, an enhancement
or new feature, or a change to the documentation? -->

## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.
2019-03-16 21:39:02 +01:00
Matthew Honnibal
62afa64a8d Expose batch size and length caps on CLI for pretrain (#3417)
Add and document CLI options for batch size, max doc length, min doc length for `spacy pretrain`.

Also improve CLI output.

Closes #3216 

## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.
2019-03-16 21:38:45 +01:00
Matthew Honnibal
58d562d9b0
Merge pull request #3416 from explosion/feature/improve-beam
Improve beam search support
2019-03-16 18:42:18 +01:00
Ines Montani
2c5dd4d602 Update Vectors.find docs [ci skip] 2019-03-16 17:10:57 +01:00
Ines Montani
0f8739c7cb Update train.py 2019-03-16 16:04:15 +01:00
Ines Montani
e7aa25d9b1 Fix beam width integration 2019-03-16 16:02:47 +01:00
Ines Montani
c94742ff64 Only add beam width if customised 2019-03-16 15:55:31 +01:00
Ines Montani
7a354761c7 Auto-format 2019-03-16 15:55:13 +01:00
Matthew Honnibal
daa8c3787a Add eval_beam_widths argument to spacy train 2019-03-16 15:02:39 +01:00
Ines Montani
2eecd756fa Update package name 2019-03-16 14:43:53 +01:00
Ines Montani
f55a52a2dd Set version to v2.1.0.dev0 2019-03-16 13:47:03 +01:00
Ryan Ford
00842d7f1b Merging conversion scripts for conll formats (#3405)
* merging conllu/conll and conllubio scripts

* tabs to spaces

* removing conllubio2json from converters/__init__.py

* Move not-really-CLI tests to misc

* Add converter test using no-ud data

* Fix test I broke

* removing include_biluo parameter

* fixing read_conllx

* remove include_biluo from convert.py
2019-03-15 18:14:46 +01:00
Ines Montani
bec8db91e6 Add actual deprecation warning for n_threads (resolves #3410) 2019-03-15 16:38:44 +01:00
Ines Montani
cb5dbfa63a Tidy up references to n_threads and fix default 2019-03-15 16:24:26 +01:00
Ines Montani
852e1f105c Tidy up docstrings 2019-03-15 16:23:17 +01:00
Matthew Honnibal
b13b2aeb54 Use hash_state in beam 2019-03-15 15:22:58 +01:00
Matthew Honnibal
693c8934e8 Normalize over all actions in parser, not just valid ones 2019-03-15 15:22:16 +01:00
Matthew Honnibal
b94b2b1168 Export hash_state from beam_utils 2019-03-15 15:20:28 +01:00
Matthew Honnibal
ad56641324 Fix Language.evaluate 2019-03-15 15:20:09 +01:00
Matthew Honnibal
f762c36e61 Evaluate accuracy at multiple beam widths 2019-03-15 15:19:49 +01:00
Matthew Honnibal
0703f5986b Remove hack from beam 2019-03-15 00:48:39 +01:00
Sofie
c45ed32c74 label in span not writable anymore (#3408)
* label in span not writable anymore

* more explicit unit test and error message for readonly label

* bit more explanation (view)

* error msg tailored to specific case

* fix None case
2019-03-15 00:46:45 +01:00
Ines Montani
8ac197d443 Merge branch 'develop' of https://github.com/explosion/spaCy into develop 2019-03-12 15:22:11 +01:00
Matthew Honnibal
6aab2d8533 Set version to v2.1.0a13 2019-03-12 15:14:06 +01:00
Ines Montani
8ee6514ab8 Merge branch 'develop' of https://github.com/explosion/spaCy into develop 2019-03-12 15:11:39 +01:00
Ines Montani
479b5cff43 Auto-format [ci skip] 2019-03-12 13:35:34 +01:00
Matthew Honnibal
1179de0860 Merge branch 'develop' of https://github.com/explosion/spaCy into develop 2019-03-12 13:33:22 +01:00
Matthew Honnibal
8a4121cbc2 Fix bug introduced by component_cfg 2019-03-12 13:32:56 +01:00
Ines Montani
2912ddc9a6 Don't set extension attribute in Japanese (closes #3398) 2019-03-12 13:30:33 +01:00
Matthew Honnibal
062934aa12 Set version to v2.1.0a12 2019-03-11 22:26:19 +01:00
Ines Montani
886e5966c0 Update test_displacy.py 2019-03-11 19:03:52 +01:00
Ines Montani
4bd2688eac
💫 Fix displaCy support for RTL languages (#3393)
Closes #2091.

## Description

With the new `vocab.writing_system` property introduced in #3390 (exposed via the language defaults), I was able to finally fix this (I think!). Based on the `Doc`, dispaCy now detects whether it's a RTL or LTR language and adjusts the visualization accordingly. Wherever possible, I've also added `direction` and `lang` attributes.

Entity visualization now looks like this:

<img width="318" alt="Screenshot 2019-03-11 at 16 06 51" src="https://user-images.githubusercontent.com/13643239/54136866-d97afd80-441c-11e9-8c27-3d46994cc833.png">

And dependencies like this (ignore the most likely incorrect tags and dependencies):

<img width="621" alt="Screenshot 2019-03-11 at 16 51 59" src="https://user-images.githubusercontent.com/13643239/54137771-8b66f980-441e-11e9-8460-0682b95eef2a.png">

### Types of change
enhancement, bug fix

## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.
2019-03-11 18:52:50 +01:00
Ines Montani
cdd418b93e Auto-format [ci skip] 2019-03-11 17:10:50 +01:00
Matthew Honnibal
b0b990e405 Fix token.conjuncts (closes #795) (#3392)
* Implement conjuncts method

* Add span.conjuncts property

* Un-xfail token.conjuncts tests

* Update docs for token.conjuncts and span.conjuncts

* Fix merge error in token.conjuncts
2019-03-11 17:05:45 +01:00
Matthew Honnibal
e2b9b523ce Merge branch 'develop' of https://github.com/explosion/spaCy into develop 2019-03-11 15:59:28 +01:00
Ines Montani
47e9c274ef Tidy up property code style (#3391)
Use decorator if properties only have a getter and existing syntax if there's getter and setter
2019-03-11 15:59:09 +01:00
Matthew Honnibal
db79a704bf Add xfail tests for token.conjuncts 2019-03-11 15:46:52 +01:00
Ines Montani
c3df4d1108 Move displaCy tests to own file 2019-03-11 15:28:34 +01:00
Ines Montani
c5a407e95a Fix code style 2019-03-11 15:28:22 +01:00
Matthew Honnibal
39a4741e26 Add support for vocab.writing_system property (#3390)
* Add xfail test for vocab.writing_system

* Add vocab.writing_system property

* Set Language.Defaults.writing_system

* Set default writing system

* Remove xfail on test_vocab_writing_system
2019-03-11 15:23:20 +01:00
Matthew Honnibal
05ef0a5abb Merge branch 'develop' of https://github.com/explosion/spaCy into develop 2019-03-11 14:33:15 +01:00
Ines Montani
ee4f312e89 Add writing_system to ArabicDefaults (experimental) 2019-03-11 14:22:23 +01:00
Ines Montani
ebcf2bb1c3 Add Doc.lang and Doc.lang_ 2019-03-11 14:21:40 +01:00
Ines Montani
ef80cfde6f Fix pickling of Japanese (closes #3191) 2019-03-11 13:34:23 +01:00
Ines Montani
c399162a82 Tidy up 2019-03-11 13:34:14 +01:00
Ines Montani
7c05ca01e8 💫 Support mutable default values for extension attributes (#3389)
* Support mutable default values in extensions

* Update documentation
2019-03-11 12:50:44 +01:00
Matthew Honnibal
4e8a07c7d3 Set version to v2.1.0a11 2019-03-11 10:45:06 +01:00
Matthew Honnibal
80b94313b6 💫 Fix interaction of lemmatizer and tokenizer exceptions (#3388)
Closes #2203. Closes #3268.

Lemmas set from outside the `Morphology` class were being overwritten. The result was especially confusing when deserialising, as it meant some lemmas could change when storing and retrieving a `Doc` object.

This PR applies two fixes:

1) When we go to set the lemma in the `Morphology` class, first check whether a lemma is already set. If so, don't overwrite.
2) When we load with `doc.from_array()`, take care to apply the `TAG` field first. This allows other fields to overwrite the `TAG` implied properties, if they're provided explicitly (e.g. the `LEMMA`).

## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.
2019-03-11 01:31:21 +01:00
Matthew Honnibal
04ca710da7 Merge branch 'develop' of https://github.com/explosion/spaCy into develop 2019-03-11 01:07:34 +01:00
Matthew Honnibal
5d25ee52fb Fix English tag map 2019-03-11 01:06:02 +01:00
Ines Montani
8f45ff3dc2 Adjust formatting [ci skip] 2019-03-11 00:47:41 +01:00
Matthew Honnibal
7503e1e505 Improve English tag map. Re #593, #3311 2019-03-10 23:50:00 +01:00
Matthew Honnibal
98acf5ffe4 💫 Allow passing of config parameters to specific pipeline components (#3386)
* Add component_cfg kwarg to begin_training

* Document component_cfg arg to begin_training

* Update docs and auto-format

* Support component_cfg across Language

* Format

* Update docs and docstrings [ci skip]

* Fix begin_training
2019-03-10 23:36:47 +01:00
Ines Montani
c998cde7e2 Auto-format [ci skip] 2019-03-10 19:22:59 +01:00
Ines Montani
7ba3a5d95c 💫 Make serialization methods consistent (#3385)
* Make serialization methods consistent

exclude keyword argument instead of random named keyword arguments and deprecation handling

* Update docs and add section on serialization fields
2019-03-10 19:16:45 +01:00
Ines Montani
67e38690d4 Un-xfail passing tests and tidy up 2019-03-10 18:42:16 +01:00
Matthew Honnibal
27dd820753
Fix vocab deserialization when loading already present lexemes (#3383)
* Fix vocab deserialization bug. Closes #2153

* Un-xfail test for #2153
2019-03-10 17:21:19 +01:00
Matthew Honnibal
d6eaa71afc Handle scalar values in doc.from_array() 2019-03-10 16:54:03 +01:00
Matthew Honnibal
61e5ce02a4 Add xfailing test for #2153 2019-03-10 16:36:29 +01:00
Matthew Honnibal
7461e5e055 Fix batch bug in issue #3344 2019-03-10 16:01:34 +01:00
Matthew Honnibal
8a6272f842 Un-xfail test 2019-03-10 15:51:15 +01:00
Matthew Honnibal
4e80fc41ad Make doc.from_array() consistent with doc.to_array(). Closes #3382 2019-03-10 15:50:48 +01:00
Ines Montani
0426689db8 💫 Improve Doc.to_json and add Doc.is_nered (#3381)
* Use default return instead of else

* Add Doc.is_nered to indicate if entities have been set

* Add properties in Doc.to_json if they were set, not if they're available

This way, if a processed Doc exports "pos": None, it means that the tag was explicitly unset. If it exports "ents": [], it means that entity annotations are available but that this document doesn't contain any entities. Before, this would have been unclear and problematic for training.
2019-03-10 15:24:34 +01:00
Ines Montani
7984543953 Add xfailing test for to_array/from_array string attrs 2019-03-10 15:08:15 +01:00
Ines Montani
6bbf4ea309 Simplify tests and avoid tokenizing 2019-03-10 15:05:56 +01:00
Matthew Honnibal
a5b1f6dcec Fix NER when preset entities cross sentence boundaries (#3379)
💫 Fix NER when preset entities cross sentence boundaries
2019-03-10 14:53:03 +01:00
Ines Montani
3fe5811fa7 Only link model after download if shortcut link (#3378) 2019-03-10 13:02:24 +01:00
Matthew Honnibal
231bc7bb7b Add xfailing test for #3345 2019-03-10 13:00:15 +01:00
Matthew Honnibal
bdc77848f5 Add helper method to apply a transition in parser/NER 2019-03-10 13:00:00 +01:00
Matthew Honnibal
ce1fe8a510 Add comment 2019-03-09 17:51:17 +00:00
Matthew Honnibal
28c26e212d Fix textcat model for GPU 2019-03-09 17:50:08 +00:00
Ines Montani
610fb306bd Revert hyphens 2019-03-09 12:51:53 +01:00
Ines Montani
bbabb6aaae Escape more hyphens 2019-03-09 12:41:05 +01:00
Ines Montani
b8db219850 Auto-format 2019-03-09 12:40:58 +01:00
Ines Montani
a145bfe627 Try escaping hyphens again 2019-03-09 03:06:50 +01:00
Ines Montani
b9c71fc0f0 Fix flags 2019-03-09 02:46:04 +01:00
Ines Montani
ae09b6a6cf Try fixing unicode inconsistencies on Python 2 2019-03-09 02:37:50 +01:00
Ines Montani
d957d7a697 Auto-format 2019-03-09 02:37:41 +01:00
Ines Montani
65402c3d02 Revert "Experiment with escaping hyphens"
This reverts commit 9b42e2d5dd.
2019-03-09 02:13:00 +01:00
Ines Montani
9b42e2d5dd Experiment with escaping hyphens 2019-03-09 02:05:26 +01:00
Ines Montani
76764fcf59 💫 Improve converters and training data file formats (#3374)
* Populate converter argument info automatically

* Add conversion option for msgpack

* Update docs

* Allow reading training data from JSONL
2019-03-08 23:15:23 +01:00
Ines Montani
296446a1c8
Tidy up and improve docs and docstrings (#3370)
<!--- Provide a general summary of your changes in the title. -->

## Description
* tidy up and adjust Cython code to code style
* improve docstrings and make calling `help()` nicer
* add URLs to new docs pages to docstrings wherever possible, mostly to user-facing objects
* fix various typos and inconsistencies in docs

### Types of change
enhancement, docs

## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.
2019-03-08 11:42:26 +01:00
Ines Montani
daaeeb7a2b Merge branch 'master' into develop 2019-03-07 22:07:31 +01:00
Adrien Ball
88909a9adb Fix egg fragments in direct download (#3369)
## Description
The egg fragment in the URL must be of the form `#egg=package_name==version` instead of `#egg=package_name-version`.
One of the consequences of specifying wrong egg fragments is that `pip` does not recognize the package and its version properly, and thus it re-downloads the package systematically.

I'm not sure how this should be tested properly. 
Here is what I had before the fix when running the same direct download twice:
```
$ python -m spacy download en_core_web_sm-2.0.0 --direct
Looking in indexes: https://pypi.python.org/simple/
Collecting en_core_web_sm-2.0.0 from https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-2.0.0/en_core_web_sm-2.0.0.tar.gz#egg=en_core_web_sm-2.0.0
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-2.0.0/en_core_web_sm-2.0.0.tar.gz (37.4MB)
    100% |████████████████████████████████| 37.4MB 1.6MB/s
  Generating metadata for package en-core-web-sm-2.0.0 produced metadata for project name en-core-web-sm. Fix your #egg=en-core-web-sm-2.0.0 fragments.
Installing collected packages: en-core-web-sm
  Running setup.py install for en-core-web-sm ... done
Successfully installed en-core-web-sm-2.0.0
$ python -m spacy download en_core_web_sm-2.0.0 --direct
Looking in indexes: https://pypi.python.org/simple/
Collecting en_core_web_sm-2.0.0 from https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-2.0.0/en_core_web_sm-2.0.0.tar.gz#egg=en_core_web_sm-2.0.0
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-2.0.0/en_core_web_sm-2.0.0.tar.gz (37.4MB)
    100% |████████████████████████████████| 37.4MB 919kB/s
  Generating metadata for package en-core-web-sm-2.0.0 produced metadata for project name en-core-web-sm. Fix your #egg=en-core-web-sm-2.0.0 fragments.
Requirement already satisfied (use --upgrade to upgrade): en-core-web-sm from https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-2.0.0/en_core_web_sm-2.0.0.tar.gz#egg=en_core_web_sm-2.0.0 in ./venv3/lib/python3.6/site-packages
```

And after the fix:
```
$ python -m spacy download en_core_web_sm-2.0.0 --direct
Looking in indexes: https://pypi.python.org/simple/
Collecting en_core_web_sm==2.0.0 from https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-2.0.0/en_core_web_sm-2.0.0.tar.gz#egg=en_core_web_sm==2.0.0
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-2.0.0/en_core_web_sm-2.0.0.tar.gz (37.4MB)
    100% |████████████████████████████████| 37.4MB 1.1MB/s
Installing collected packages: en-core-web-sm
  Running setup.py install for en-core-web-sm ... done
Successfully installed en-core-web-sm-2.0.0
$ python -m spacy download en_core_web_sm-2.0.0 --direct
Looking in indexes: https://pypi.python.org/simple/
Requirement already satisfied: en_core_web_sm==2.0.0 from https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-2.0.0/en_core_web_sm-2.0.0.tar.gz#egg=en_core_web_sm==2.0.0 in ./venv3/lib/python3.6/site-packages (2.0.0)
```

### Types of change
This is an enhancement as it avoids unnecessary downloads of (potentially big) spacy models, when they have already been downloaded.

## Checklist
- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.
2019-03-07 21:07:19 +01:00
Ines Montani
96b91a8898 Fix noqa [ci skip] 2019-03-07 12:25:00 +01:00
Ines Montani
9d6ca18a10 Tidy up and only use self.vector once 2019-03-07 01:06:12 +01:00
Ines Montani
a8f1efd2f5 Merge branch 'master' into develop 2019-03-07 00:56:31 +01:00
Daniel King
5f40229397 Don't use numpy directly for similarity (#3362)
* Don't use numpy directly for similarity

* Contributor agreement
2019-03-06 22:58:38 +00:00
Ines Montani
6bd34e9d54 Expose Japanese stop words (closes #3346) 2019-03-06 14:21:15 +01:00
Ines Montani
85deb96278 Fix whitespace 2019-03-06 14:20:34 +01:00
Ines Montani
23f6ebf0f3 Add missing " (closes #3343) 2019-02-27 16:37:03 +01:00
Ines Montani
533b580c19 Add test for stray print statements in languages (see #3342) 2019-02-27 16:04:30 +01:00
Ines Montani
48a2046d1c Remove stray print statement (closes #3342) 2019-02-27 15:35:04 +01:00
Ines Montani
07d7c0a1af Fix whitespace 2019-02-27 15:34:21 +01:00
Ines Montani
9b62639d19 Auto-format [ci skip] 2019-02-27 14:24:55 +01:00