Commit Graph

468 Commits

Author SHA1 Message Date
svlandeg
3420cbe496 small fixes 2019-07-03 10:25:51 +02:00
svlandeg
2d2dea9924 experiment with adding NER types to the feature vector 2019-06-29 14:52:36 +02:00
svlandeg
c664f58246 adding prior probability as feature in the model 2019-06-28 16:22:58 +02:00
svlandeg
1c80b85241 fix tests 2019-06-28 08:59:23 +02:00
svlandeg
68a0662019 context encoder with Tok2Vec + linking model instead of cosine 2019-06-28 08:29:31 +02:00
svlandeg
dbc53b9870 rename to KBEntryC 2019-06-26 15:55:26 +02:00
svlandeg
1de61f68d6 improve speed of prediction loop 2019-06-26 13:53:10 +02:00
svlandeg
bee23cd8af try Tok2Vec instead of SpacyVectors 2019-06-25 16:09:22 +02:00
svlandeg
b58bace84b small fixes 2019-06-24 10:55:04 +02:00
svlandeg
a31648d28b further code cleanup 2019-06-19 09:15:43 +02:00
svlandeg
478305cd3f small tweaks and documentation 2019-06-18 18:38:09 +02:00
svlandeg
0d177c1146 clean up code, remove old code, move to bin 2019-06-18 13:20:40 +02:00
svlandeg
ffae7d3555 sentence encoder only (removing article/mention encoder) 2019-06-18 00:05:47 +02:00
svlandeg
6332af40de baseline performances: oracle KB, random and prior prob 2019-06-17 14:39:40 +02:00
svlandeg
24db1392b9 reprocessing all of wikipedia for training data 2019-06-16 21:14:45 +02:00
svlandeg
81731907ba performance per entity type 2019-06-14 19:55:46 +02:00
svlandeg
b312f2d0e7 redo training data to be independent of KB and entity-level instead of doc-level 2019-06-14 15:55:26 +02:00
svlandeg
0b04d142de regenerating KB 2019-06-13 22:32:56 +02:00
svlandeg
78dd3e11da write entity linking pipe to file and keep vocab consistent between kb and nlp 2019-06-13 16:25:39 +02:00
svlandeg
b12001f368 small fixes 2019-06-12 22:05:53 +02:00
svlandeg
6521cfa132 speeding up training 2019-06-12 13:37:05 +02:00
svlandeg
66813a1fdc speed up predictions 2019-06-11 14:18:20 +02:00
svlandeg
fe1ed432ef eval on dev set, varying combo's of prior and context scores 2019-06-11 11:40:58 +02:00
svlandeg
83dc7b46fd first tests with EL pipe 2019-06-10 21:25:26 +02:00
svlandeg
7de1ee69b8 training loop in proper pipe format 2019-06-07 15:55:10 +02:00
svlandeg
0486ccabfd introduce goldparse.links 2019-06-07 13:54:45 +02:00
svlandeg
a5c061f506 storing NEL training data in GoldParse objects 2019-06-07 12:58:42 +02:00
svlandeg
61f0e2af65 code cleanup 2019-06-06 20:22:14 +02:00
svlandeg
d8b435ceff pretraining description vectors and storing them in the KB 2019-06-06 19:51:27 +02:00
svlandeg
5c723c32c3 entity vectors in the KB + serialization of them 2019-06-05 18:29:18 +02:00
svlandeg
9abbd0899f separate entity encoder to get 64D descriptions 2019-06-05 00:09:46 +02:00
svlandeg
fb37cdb2d3 implementing el pipe in pipes.pyx (not tested yet) 2019-06-03 21:32:54 +02:00
svlandeg
d83a1e3052 Merge branch 'master' into feature/nel-wiki 2019-06-03 09:35:10 +02:00
svlandeg
9e88763dab 60% acc run 2019-06-03 08:04:49 +02:00
svlandeg
268a52ead7 experimenting with cosine sim for negative examples (not OK yet) 2019-05-29 16:07:53 +02:00
svlandeg
a761929fa5 context encoder combining sentence and article 2019-05-28 18:14:49 +02:00
svlandeg
992fa92b66 refactor again to clusters of entities and cosine similarity 2019-05-28 00:05:22 +02:00
svlandeg
8c4aa076bc small fixes 2019-05-27 14:29:38 +02:00
svlandeg
cfc27d7ff9 using Tok2Vec instead 2019-05-26 23:39:46 +02:00
svlandeg
abf9af81c9 learn rate en epochs 2019-05-24 22:04:25 +02:00
svlandeg
86ed771e0b adding local sentence encoder 2019-05-23 16:59:11 +02:00
svlandeg
4392c01b7b obtain sentence for each mention 2019-05-23 15:37:05 +02:00
svlandeg
97241a3ed7 upsampling and batch processing 2019-05-22 23:40:10 +02:00
svlandeg
1a16490d20 update per entity 2019-05-22 12:46:40 +02:00
svlandeg
eb08bdb11f hidden with for encoders 2019-05-21 23:42:46 +02:00
svlandeg
7b13e3d56f undersampling negatives 2019-05-21 18:35:10 +02:00
svlandeg
2fa3fac851 fix concat bp and more efficient batch calls 2019-05-21 13:43:59 +02:00
svlandeg
0a15ee4541 fix in bp call 2019-05-20 23:54:55 +02:00
svlandeg
89e322a637 small fixes 2019-05-20 17:20:39 +02:00
svlandeg
7edb2e1711 fix convolution layer 2019-05-20 11:58:48 +02:00
svlandeg
dd691d0053 debugging 2019-05-17 17:44:11 +02:00
svlandeg
400b19353d simplify architecture and larger-scale test runs 2019-05-17 01:51:18 +02:00
svlandeg
d51bffe63b clean up code 2019-05-16 18:36:15 +02:00
svlandeg
b5470f3d75 various tests, architectures and experiments 2019-05-16 18:25:34 +02:00
svlandeg
9ffe5437ae calculate gradient for entity encoding 2019-05-15 02:23:08 +02:00
svlandeg
2713abc651 implement loss function using dot product and prob estimate per candidate cluster 2019-05-14 22:55:56 +02:00
svlandeg
09ed446b20 different architecture / settings 2019-05-14 08:37:52 +02:00
svlandeg
4142e8dd1b train and predict per article (saving time for doc encoding) 2019-05-13 17:02:34 +02:00
svlandeg
3b81b00954 evaluating on dev set during training 2019-05-13 14:26:04 +02:00
svlandeg
b6d788064a some first experiments with different architectures and metrics 2019-05-10 12:53:14 +02:00
svlandeg
9d089c0410 grouping clusters of instances per doc+mention 2019-05-09 18:11:49 +02:00
svlandeg
c6ca8649d7 first stab at model - not functional yet 2019-05-09 17:23:19 +02:00
svlandeg
9f33732b96 using entity descriptions and article texts as input embedding vectors for training 2019-05-07 16:03:42 +02:00
svlandeg
7e348d7f7f baseline evaluation using highest-freq candidate 2019-05-06 15:13:50 +02:00
Ines Montani
dd153b2b33 Simplify helper (see #3681) [ci skip] 2019-05-06 15:13:10 +02:00
Ines Montani
f8fce6c03c Fix typo (see #3681) 2019-05-06 15:02:11 +02:00
Ines Montani
f2a56c1b56 Rewrite example to use Retokenizer (resolves #3681)
Also add helper to filter spans
2019-05-06 14:51:18 +02:00
svlandeg
6961215578 refactor code to separate functionality into different files 2019-05-06 10:56:56 +02:00
svlandeg
f5190267e7 run only 100M of WP data as training dataset (9%) 2019-05-03 18:09:09 +02:00
svlandeg
4e929600e5 fix WP id parsing, speed up processing and remove ambiguous strings in one doc (for now) 2019-05-03 17:37:47 +02:00
svlandeg
34600c92bd try catch per article to ensure the pipeline goes on 2019-05-03 15:10:09 +02:00
svlandeg
bbcb9da466 creating training data with clean WP texts and QID entities true/false 2019-05-03 10:44:29 +02:00
svlandeg
cba9680d13 run NER on clean WP text and link to gold-standard entity IDs 2019-05-02 17:24:52 +02:00
svlandeg
581dc9742d parsing clean text from WP articles to use as input data for NER and NEL 2019-05-02 17:09:56 +02:00
svlandeg
8353552191 cleanup 2019-05-01 23:26:16 +02:00
svlandeg
1ae41daaa9 allow small rounding errors 2019-05-01 23:05:40 +02:00
svlandeg
3629a52ede reading all persons in wikidata 2019-05-01 01:00:59 +02:00
svlandeg
60b54ae8ce bulk entity writing and experiment with regex wikidata reader to speed up processing 2019-05-01 00:00:38 +02:00
svlandeg
653b7d9c87 calculate entity raw counts offline to speed up KB construction 2019-04-30 11:39:42 +02:00
svlandeg
19e8f339cb deduce entity freq from WP corpus and serialize vocab in WP test 2019-04-29 17:37:29 +02:00
svlandeg
54d0cea062 unit test for KB serialization 2019-04-24 23:52:34 +02:00
svlandeg
3e0cb69065 KB aliases to and from file 2019-04-24 20:24:24 +02:00
svlandeg
ad6c5e581c writing and reading number of entries to/from header 2019-04-24 15:31:44 +02:00
svlandeg
6e3223f234 bulk loading in proper order of entity indices 2019-04-24 11:26:38 +02:00
svlandeg
694fea597a dumping all entryC entries + (inefficient) reading back in 2019-04-23 18:36:50 +02:00
svlandeg
8e70a564f1 custom reader and writer for _EntryC fields (first stab at it - not complete) 2019-04-23 16:33:40 +02:00
svlandeg
004e5e7d1c little fixes 2019-04-19 14:24:02 +02:00
svlandeg
9a8197185b fix alias capitalization 2019-04-18 22:37:50 +02:00
svlandeg
9f308eb5dc fixes for prior prob and linking wikidata IDs with wikipedia titles 2019-04-18 16:14:25 +02:00
svlandeg
10ee8dfea2 poc with few entities and collecting aliases from the WP links 2019-04-18 14:12:17 +02:00
svlandeg
6763e025e1 parse wp dump for links to determine prior probabilities 2019-04-15 11:41:57 +02:00
svlandeg
3163331b1e wikipedia dump parser and mediawiki format regex cleanup 2019-04-14 21:52:01 +02:00
svlandeg
b31a390a9a reading types, claims and sitelinks 2019-04-11 21:42:44 +02:00
svlandeg
6e997be4b4 reading wikidata descriptions and aliases 2019-04-11 21:08:22 +02:00
svlandeg
9a7d534b1b enable nogil for cython functions in kb.pxd 2019-04-10 17:25:10 +02:00
Ines Montani
24cecdb44f Update compatibility [ci skip] 2019-04-01 16:25:16 +02:00
Sofie
a4a6bfa4e1
Merge branch 'master' into feature/el-framework 2019-03-26 11:00:02 +01:00
svlandeg
8814b9010d entity as one field instead of both ID and name 2019-03-25 18:10:41 +01:00
Matthew Honnibal
6c783f8045 Bug fixes and options for TextCategorizer (#3472)
* Fix code for bag-of-words feature extraction

The _ml.py module had a redundant copy of a function to extract unigram
bag-of-words features, except one had a bug that set values to 0.
Another function allowed extraction of bigram features. Replace all three
with a new function that supports arbitrary ngram sizes and also allows
control of which attribute is used (e.g. ORTH, LOWER, etc).

* Support 'bow' architecture for TextCategorizer

This allows efficient ngram bag-of-words models, which are better when
the classifier needs to run quickly, especially when the texts are long.
Pass architecture="bow" to use it. The extra arguments ngram_size and
attr are also available, e.g. ngram_size=2 means unigram and bigram
features will be extracted.

* Fix size limits in train_textcat example

* Explain architectures better in docs
2019-03-23 16:44:44 +01:00
svlandeg
9de9900510 adding future import unicode literals to .py files 2019-03-22 16:18:04 +01:00
Matthew Honnibal
4c5f265884
Fix train loop for train_textcat example 2019-03-22 16:10:11 +01:00
svlandeg
5318ce88fa 'entity_linker' instead of 'el' 2019-03-22 13:55:10 +01:00
svlandeg
a48241e9a2 use nlp's vocab for stringstore 2019-03-22 11:36:45 +01:00
svlandeg
1ee0e78fd7 select candidate with highest prior probabiity 2019-03-22 11:36:45 +01:00
Matthew Honnibal
4e3ed2ea88 Add -t2v argument to train_textcat script 2019-03-20 23:05:42 +01:00
Ines Montani
399987c216 Test and update examples [ci skip] 2019-03-16 14:15:49 +01:00
Ines Montani
cb5dbfa63a Tidy up references to n_threads and fix default 2019-03-15 16:24:26 +01:00
Matthew Honnibal
4dc57d9e15 Update train_new_entity_type example 2019-02-24 16:41:03 +01:00
Matthew Honnibal
7ac0f9626c Update rehearsal example 2019-02-24 16:17:41 +01:00
Matthew Honnibal
981cb89194 Fix f-score calculation if zero 2019-02-23 12:45:41 +01:00
Matthew Honnibal
5063d999e5 Set architecture in textcat example 2019-02-23 11:57:59 +01:00
Matthew Honnibal
582be8746c Update multi_processing example 2019-02-21 10:33:16 +01:00
Ines Montani
9696cf16c1 Merge branch 'master' into develop 2019-02-20 21:31:27 +01:00
Michael Liberman
386cec1979 - Json fix in comment (#3294) 2019-02-19 18:01:35 +01:00
Ines Montani
5d0b60999d Merge branch 'master' into develop 2019-02-07 20:54:07 +01:00
Laura Baakman
04aa041c9e Update Example input JSON file to adhere to specification. (#3243)
* Example file does not adhere to json input spec.

According to the [json input spec ](https://spacy.io/api/annotation#json-input) the `id ` needs to be an `int` not a string. Using a string as `id` results in a `TypeError` when calling `spacy.gold.read_json_file()`.

* Add spaCy Contributor Agreement.
2019-02-07 16:18:01 +01:00
mak
8fc6aaf134 Updated main to make use of lang variable (#3220)
Updated main to make use of language variable when initializing spacy.
2019-01-31 23:43:22 +01:00
Hunter Kelly
f28a1c7271 Update call to mkdir() to create the parents (#3139)
* Update call to `mkdir()` to create the parents

- Update the call to `output_dir.mkdir()` to also create the parents if needed

* don't automatically create parents but fail fast if cannot create directory

* add signed contributors agreement for retnuh
2019-01-11 03:02:18 +01:00
Ines Montani
61d09c481b Merge branch 'master' into develop 2018-12-18 13:48:10 +01:00
Ines Montani
e3405f8af3 Don't call begin_training if updating new model (see #3059) [ci skip] 2018-12-17 13:45:49 +01:00
Ines Montani
c9a89bba50 Don't call begin_training if updating new model (see #3059) [ci skip] 2018-12-17 13:45:28 +01:00
Ines Montani
6f1438b5d9 Auto-format example 2018-12-17 13:44:38 +01:00
Matthew Honnibal
83ac227bd3
💫 Better support for semi-supervised learning (#3035)
The new spacy pretrain command implemented BERT/ULMFit/etc-like transfer learning, using our Language Modelling with Approximate Outputs version of BERT's cloze task. Pretraining is convenient, but in some ways it's a bit of a strange solution. All we're doing is initialising the weights. At the same time, we're putting a lot of work into our optimisation so that it's less sensitive to initial conditions, and more likely to find good optima. I discuss this a bit in the pseudo-rehearsal blog post: https://explosion.ai/blog/pseudo-rehearsal-catastrophic-forgetting
Support semi-supervised learning in spacy train

One obvious way to improve these pretraining methods is to do multi-task learning, instead of just transfer learning. This has been shown to work very well: https://arxiv.org/pdf/1809.08370.pdf . This patch makes it easy to do this sort of thing.

    Add a new argument to spacy train, --raw-text. This takes a jsonl file with unlabelled data that can be used in arbitrary ways to do semi-supervised learning.

    Add a new method to the Language class and to pipeline components, .rehearse(). This is like .update(), but doesn't expect GoldParse objects. It takes a batch of Doc objects, and performs an update on some semi-supervised objective.

    Move the BERT-LMAO objective out from spacy/cli/pretrain.py into spacy/_ml.py, so we can create a new pipeline component, ClozeMultitask. This can be specified as a parser or NER multitask in the spacy train command. Example usage:

python -m spacy train en ./tmp ~/data/en-core-web/train/nw.json ~/data/en-core-web/dev/nw.json --pipeline parser --raw-textt ~/data/unlabelled/reddit-100k.jsonl --vectors en_vectors_web_lg --parser-multitasks cloze

Implement rehearsal methods for pipeline components

The new --raw-text argument and nlp.rehearse() method also gives us a good place to implement the the idea in the pseudo-rehearsal blog post in the parser. This works as follows:

    Add a new nlp.resume_training() method. This allocates copies of pre-trained models in the pipeline, setting things up for the rehearsal updates. It also returns an optimizer object. This also greatly reduces confusion around the nlp.begin_training() method, which randomises the weights, making it not suitable for adding new labels or otherwise fine-tuning a pre-trained model.

    Implement rehearsal updates on the Parser class, making it available for the dependency parser and NER. During rehearsal, the initial model is used to supervise the model being trained. The current model is asked to match the predictions of the initial model on some data. This minimises catastrophic forgetting, by keeping the model's predictions close to the original. See the blog post for details.

    Implement rehearsal updates for tagger

    Implement rehearsal updates for text categoriz
2018-12-10 16:25:33 +01:00
Matthew Honnibal
375f0dc529
💫 Make TextCategorizer default to a simpler, GPU-friendly model (#3038)
Currently the TextCategorizer defaults to a fairly complicated model, designed partly around the active learning requirements of Prodigy. The model's a bit slow, and not very GPU-friendly.

This patch implements a straightforward CNN model that still performs pretty well. The replacement model also makes it easy to use the LMAO pretraining, since most of the parameters are in the CNN.

The replacement model has a flag to specify whether labels are mutually exclusive, which defaults to True. This has been a common problem with the text classifier. We'll also now be able to support adding labels to pretrained models again.

Resolves #2934, #2756, #1798, #1748.
2018-12-10 14:37:39 +01:00
Matthew Honnibal
e5685d98a2 Fix averaging in textcat example (closes #2745) (#3032) [ci skip] 2018-12-08 13:27:05 +01:00
Ines Montani
5b2741f751 Remove unused cytoolz / itertools imports 2018-12-03 02:12:07 +01:00
Gavriel Loria
ae5601beae Initialize trues to 0.0 in training example (#3004)
* added contributor agreement

* if there are no true positives, precision should be 0.0
2018-12-03 01:33:22 +01:00
Ines Montani
f37863093a 💫 Replace ujson, msgpack and dill/pickle/cloudpickle with srsly (#3003)
Remove hacks and wrappers, keep code in sync across our libraries and move spaCy a few steps closer to only depending on packages with binary wheels 🎉

See here: https://github.com/explosion/srsly

    Serialization is hard, especially across Python versions and multiple platforms. After dealing with many subtle bugs over the years (encodings, locales, large files) our libraries like spaCy and Prodigy have steadily grown a number of utility functions to wrap the multiple serialization formats we need to support (especially json, msgpack and pickle). These wrapping functions ended up duplicated across our codebases, so we wanted to put them in one place.

    At the same time, we noticed that having a lot of small dependencies was making maintainence harder, and making installation slower. To solve this, we've made srsly standalone, by including the component packages directly within it. This way we can provide all the serialization utilities we need in a single binary wheel.

    srsly currently includes forks of the following packages:

        ujson
        msgpack
        msgpack-numpy
        cloudpickle



* WIP: replace json/ujson with srsly

* Replace ujson in examples

Use regular json instead of srsly to make code easier to read and follow

* Update requirements

* Fix imports

* Fix typos

* Replace msgpack with srsly

* Fix warning
2018-12-03 01:28:22 +01:00
Ines Montani
40b57ea4ac Format example 2018-12-02 04:28:34 +01:00
Ines Montani
45798cc53e Auto-format examples 2018-12-02 04:26:26 +01:00
Ines Montani
d33953037e
💫 Port master changes over to develop (#2979)
* Create aryaprabhudesai.md (#2681)

* Update _install.jade (#2688)

Typo fix: "models" -> "model"

* Add FAC to spacy.explain (resolves #2706)

* Remove docstrings for deprecated arguments (see #2703)

* When calling getoption() in conftest.py, pass a default option (#2709)

* When calling getoption() in conftest.py, pass a default option

This is necessary to allow testing an installed spacy by running:

  pytest --pyargs spacy

* Add contributor agreement

* update bengali token rules for hyphen and digits (#2731)

* Less norm computations in token similarity (#2730)

* Less norm computations in token similarity

* Contributor agreement

* Remove ')' for clarity (#2737)

Sorry, don't mean to be nitpicky, I just noticed this when going through the CLI and thought it was a quick fix. That said, if this was intention than please let me know.

* added contributor agreement for mbkupfer (#2738)

* Basic support for Telugu language (#2751)

* Lex _attrs for polish language (#2750)

* Signed spaCy contributor agreement

* Added polish version of english lex_attrs

* Introduces a bulk merge function, in order to solve issue #653 (#2696)

* Fix comment

* Introduce bulk merge to increase performance on many span merges

* Sign contributor agreement

* Implement pull request suggestions

* Describe converters more explicitly (see #2643)

* Add multi-threading note to Language.pipe (resolves #2582) [ci skip]

* Fix formatting

* Fix dependency scheme docs (closes #2705) [ci skip]

* Don't set stop word in example (closes #2657) [ci skip]

* Add words to portuguese language _num_words (#2759)

* Add words to portuguese language _num_words

* Add words to portuguese language _num_words

* Update Indonesian model (#2752)

* adding e-KTP in tokenizer exceptions list

* add exception token

* removing lines with containing space as it won't matter since we use .split() method in the end, added new tokens in exception

* add tokenizer exceptions list

* combining base_norms with norm_exceptions

* adding norm_exception

* fix double key in lemmatizer

* remove unused import on punctuation.py

* reformat stop_words to reduce number of lines, improve readibility

* updating tokenizer exception

* implement is_currency for lang/id

* adding orth_first_upper in tokenizer_exceptions

* update the norm_exception list

* remove bunch of abbreviations

* adding contributors file

* Fixed spaCy+Keras example (#2763)

* bug fixes in keras example

* created contributor agreement

* Adding French hyphenated first name (#2786)

* Fix typo (closes #2784)

* Fix typo (#2795) [ci skip]

Fixed typo on line 6 "regcognizer --> recognizer"

* Adding basic support for Sinhala language. (#2788)

* adding Sinhala language package, stop words, examples and lex_attrs.

* Adding contributor agreement

* Updating contributor agreement

* Also include lowercase norm exceptions

* Fix error (#2802)

* Fix error
ValueError: cannot resize an array that references or is referenced
by another array in this way.  Use the resize function

* added spaCy Contributor Agreement

* Add charlax's contributor agreement (#2805)

* agreement of contributor, may I introduce a tiny pl languge contribution (#2799)

* Contributors agreement

* Contributors agreement

* Contributors agreement

* Add jupyter=True to displacy.render in documentation (#2806)

* Revert "Also include lowercase norm exceptions"

This reverts commit 70f4e8adf3.

* Remove deprecated encoding argument to msgpack

* Set up dependency tree pattern matching skeleton (#2732)

* Fix bug when too many entity types. Fixes #2800

* Fix Python 2 test failure

* Require older msgpack-numpy

* Restore encoding arg on msgpack-numpy

* Try to fix version pin for msgpack-numpy

* Update Portuguese Language (#2790)

* Add words to portuguese language _num_words

* Add words to portuguese language _num_words

* Portuguese - Add/remove stopwords, fix tokenizer, add currency symbols

* Extended punctuation and norm_exceptions in the Portuguese language

* Correct error in spacy universe docs concerning spacy-lookup (#2814)

* Update Keras Example for (Parikh et al, 2016) implementation  (#2803)

* bug fixes in keras example

* created contributor agreement

* baseline for Parikh model

* initial version of parikh 2016 implemented

* tested asymmetric models

* fixed grevious error in normalization

* use standard SNLI test file

* begin to rework parikh example

* initial version of running example

* start to document the new version

* start to document the new version

* Update Decompositional Attention.ipynb

* fixed calls to similarity

* updated the README

* import sys package duh

* simplified indexing on mapping word to IDs

* stupid python indent error

* added code from https://github.com/tensorflow/tensorflow/issues/3388 for tf bug workaround

* Fix typo (closes #2815) [ci skip]

* Update regex version dependency

* Set version to 2.0.13.dev3

* Skip seemingly problematic test

* Remove problematic test

* Try previous version of regex

* Revert "Remove problematic test"

This reverts commit bdebbef455.

* Unskip test

* Try older version of regex

* 💫 Update training examples and use minibatching (#2830)

<!--- Provide a general summary of your changes in the title. -->

## Description
Update the training examples in `/examples/training` to show usage of spaCy's `minibatch` and `compounding` helpers ([see here](https://spacy.io/usage/training#tips-batch-size) for details). The lack of batching in the examples has caused some confusion in the past, especially for beginners who would copy-paste the examples, update them with large training sets and experienced slow and unsatisfying results.

### Types of change
enhancements

## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.

* Visual C++ link updated (#2842) (closes #2841) [ci skip]

* New landing page

* Add contribution agreement

* Correcting lang/ru/examples.py (#2845)

* Correct some grammatical inaccuracies in lang\ru\examples.py; filled Contributor Agreement

* Correct some grammatical inaccuracies in lang\ru\examples.py

* Move contributor agreement to separate file

* Set version to 2.0.13.dev4

* Add Persian(Farsi) language support (#2797)

* Also include lowercase norm exceptions

* Remove in favour of https://github.com/explosion/spaCy/graphs/contributors

* Rule-based French Lemmatizer (#2818)

<!--- Provide a general summary of your changes in the title. -->

## Description
<!--- Use this section to describe your changes. If your changes required
testing, include information about the testing environment and the tests you
ran. If your test fixes a bug reported in an issue, don't forget to include the
issue number. If your PR is still a work in progress, that's totally fine – just
include a note to let us know. -->

Add a rule-based French Lemmatizer following the english one and the excellent PR for [greek language optimizations](https://github.com/explosion/spaCy/pull/2558) to adapt the Lemmatizer class.

### Types of change
<!-- What type of change does your PR cover? Is it a bug fix, an enhancement
or new feature, or a change to the documentation? -->

- Lemma dictionary used can be found [here](http://infolingu.univ-mlv.fr/DonneesLinguistiques/Dictionnaires/telechargement.html), I used the XML version.
- Add several files containing exhaustive list of words for each part of speech 
- Add some lemma rules
- Add POS that are not checked in the standard Lemmatizer, i.e PRON, DET, ADV and AUX
- Modify the Lemmatizer class to check in lookup table as a last resort if POS not mentionned
- Modify the lemmatize function to check in lookup table as a last resort
- Init files are updated so the model can support all the functionalities mentioned above
- Add words to tokenizer_exceptions_list.py in respect to regex used in tokenizer_exceptions.py

## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [X] I have submitted the spaCy Contributor Agreement.
- [X] I ran the tests, and all new and existing tests passed.
- [X] My changes don't require a change to the documentation, or if they do, I've added all required information.

* Set version to 2.0.13

* Fix formatting and consistency

* Update docs for new version [ci skip]

* Increment version [ci skip]

* Add info on wheels [ci skip]

* Adding "This is a sentence" example to Sinhala (#2846)

* Add wheels badge

* Update badge [ci skip]

* Update README.rst [ci skip]

* Update murmurhash pin

* Increment version to 2.0.14.dev0

* Update GPU docs for v2.0.14

* Add wheel to setup_requires

* Import prefer_gpu and require_gpu functions from Thinc

* Add tests for prefer_gpu() and require_gpu()

* Update requirements and setup.py

* Workaround bug in thinc require_gpu

* Set version to v2.0.14

* Update push-tag script

* Unhack prefer_gpu

* Require thinc 6.10.6

* Update prefer_gpu and require_gpu docs [ci skip]

* Fix specifiers for GPU

* Set version to 2.0.14.dev1

* Set version to 2.0.14

* Update Thinc version pin

* Increment version

* Fix msgpack-numpy version pin

* Increment version

* Update version to 2.0.16

* Update version [ci skip]

* Redundant ')' in the Stop words' example (#2856)

<!--- Provide a general summary of your changes in the title. -->

## Description
<!--- Use this section to describe your changes. If your changes required
testing, include information about the testing environment and the tests you
ran. If your test fixes a bug reported in an issue, don't forget to include the
issue number. If your PR is still a work in progress, that's totally fine – just
include a note to let us know. -->

### Types of change
<!-- What type of change does your PR cover? Is it a bug fix, an enhancement
or new feature, or a change to the documentation? -->

## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [ ] I have submitted the spaCy Contributor Agreement.
- [ ] I ran the tests, and all new and existing tests passed.
- [ ] My changes don't require a change to the documentation, or if they do, I've added all required information.

* Documentation improvement regarding joblib and SO (#2867)

Some documentation improvements

## Description
1. Fixed the dead URL to joblib
2. Fixed Stack Overflow brand name (with space)

### Types of change
Documentation

## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.

* raise error when setting overlapping entities as doc.ents (#2880)

* Fix out-of-bounds access in NER training

The helper method state.B(1) gets the index of the first token of the
buffer, or -1 if no such token exists. Normally this is safe because we
pass this to functions like state.safe_get(), which returns an empty
token. Here we used it directly as an array index, which is not okay!

This error may have been the cause of out-of-bounds access errors during
training. Similar errors may still be around, so much be hunted down.
Hunting this one down took a long time...I printed out values across
training runs and diffed, looking for points of divergence between
runs, when no randomness should be allowed.

* Change PyThaiNLP Url (#2876)

* Fix missing comma

* Add example showing a fix-up rule for space entities

* Set version to 2.0.17.dev0

* Update regex version

* Revert "Update regex version"

This reverts commit 62358dd867.

* Try setting older regex version, to align with conda

* Set version to 2.0.17

* Add spacy-js to universe [ci-skip]

* Add spacy-raspberry to universe (closes #2889)

* Add script to validate universe json [ci skip]

* Removed space in docs + added contributor indo (#2909)

* - removed unneeded space in documentation

* - added contributor info

* Allow input text of length up to max_length, inclusive (#2922)

* Include universe spec for spacy-wordnet component (#2919)

* feat: include universe spec for spacy-wordnet component

* chore: include spaCy contributor agreement

* Minor formatting changes [ci skip]

* Fix image [ci skip]

Twitter URL doesn't work on live site

* Check if the word is in one of the regular lists specific to each POS (#2886)

* 💫 Create random IDs for SVGs to prevent ID clashes (#2927)

Resolves #2924.

## Description
Fixes problem where multiple visualizations in Jupyter notebooks would have clashing arc IDs, resulting in weirdly positioned arc labels. Generating a random ID prefix so even identical parses won't receive the same IDs for consistency (even if effect of ID clash isn't noticable here.)

### Types of change
bug fix

## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.

* Fix typo [ci skip]

* fixes symbolic link on py3 and windows (#2949)

* fixes symbolic link on py3 and windows
during setup of spacy using command
python -m spacy link en_core_web_sm en
closes #2948

* Update spacy/compat.py

Co-Authored-By: cicorias <cicorias@users.noreply.github.com>

* Fix formatting

* Update universe [ci skip]

* Catalan Language Support (#2940)

* Catalan language Support

* Ddding Catalan to documentation

* Sort languages alphabetically [ci skip]

* Update tests for pytest 4.x (#2965)

<!--- Provide a general summary of your changes in the title. -->

## Description
- [x] Replace marks in params for pytest 4.0 compat ([see here](https://docs.pytest.org/en/latest/deprecations.html#marks-in-pytest-mark-parametrize))
- [x] Un-xfail passing tests (some fixes in a recent update resolved a bunch of issues, but tests were apparently never updated here)

### Types of change
<!-- What type of change does your PR cover? Is it a bug fix, an enhancement
or new feature, or a change to the documentation? -->

## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.

* Fix regex pin to harmonize with conda (#2964)

* Update README.rst

* Fix bug where Vocab.prune_vector did not use 'batch_size' (#2977)

Fixes #2976

* Fix typo

* Fix typo

* Remove duplicate file

* Require thinc 7.0.0.dev2

Fixes bug in gpu_ops that would use cupy instead of numpy on CPU

* Add missing import

* Fix error IDs

* Fix tests
2018-11-29 16:30:29 +01:00
Matthew Honnibal
7ed9124a45
Fix Python2 error on example 2018-11-14 19:35:17 +01:00
Matthew Honnibal
09aa616182 Make pretraining script work without GPU 2018-11-04 17:09:52 +01:00
Matthew Honnibal
bc8cda818c Improve pretrain textcat example 2018-11-04 00:17:09 +00:00
Matthew Honnibal
3e7a96f99d Improve pretrain textcat example 2018-11-03 17:44:12 +00:00
Matthew Honnibal
c87c50af62 Rename new example 2018-11-03 13:09:46 +00:00
Matthew Honnibal
8e8ccc0f92 Work on pretraining script 2018-11-03 12:53:25 +00:00
Matthew Honnibal
0127f10ba3 Improve train tensorizer script 2018-11-03 10:54:20 +00:00
Matthew Honnibal
baf7feae68 Add tensorizer training example 2018-11-02 23:30:06 +00:00
Matthew Honnibal
5a4aeb96b7 Add example showing a fix-up rule for space entities 2018-10-28 16:06:00 +01:00
Ines Montani
4cd9ec0f00
💫 Update training examples and use minibatching (#2830)
<!--- Provide a general summary of your changes in the title. -->

## Description
Update the training examples in `/examples/training` to show usage of spaCy's `minibatch` and `compounding` helpers ([see here](https://spacy.io/usage/training#tips-batch-size) for details). The lack of batching in the examples has caused some confusion in the past, especially for beginners who would copy-paste the examples, update them with large training sets and experienced slow and unsatisfying results.

### Types of change
enhancements

## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.
2018-10-10 01:40:29 +02:00
John Stewart
9faea3ff10 Update Keras Example for (Parikh et al, 2016) implementation (#2803)
* bug fixes in keras example

* created contributor agreement

* baseline for Parikh model

* initial version of parikh 2016 implemented

* tested asymmetric models

* fixed grevious error in normalization

* use standard SNLI test file

* begin to rework parikh example

* initial version of running example

* start to document the new version

* start to document the new version

* Update Decompositional Attention.ipynb

* fixed calls to similarity

* updated the README

* import sys package duh

* simplified indexing on mapping word to IDs

* stupid python indent error

* added code from https://github.com/tensorflow/tensorflow/issues/3388 for tf bug workaround
2018-10-01 10:28:45 +02:00
John Stewart
2d15859d2a Fixed spaCy+Keras example (#2763)
* bug fixes in keras example

* created contributor agreement
2018-09-15 13:06:39 +02:00
Matthew Honnibal
4336397ecb Update develop from master 2018-08-14 03:04:28 +02:00
Matthew Honnibal
f762d52b24 Add example for Issue #2627 2018-08-05 13:33:52 +02:00
ines
4339f64128 Merge branch 'master' into develop 2018-07-19 16:15:03 +02:00
ines
d489ffb78b Fix formatting [ci skip] 2018-07-19 13:22:25 +02:00
himkt
57311d5d47 replace janome with mecab in the documentation and the test (#2415)
* Add links to Reddit data (see #2401)

* replace janome with mecab in the documentation and the test

* add the assignment
2018-06-11 00:33:13 +02:00
Ines Montani
3f2e3cbd27
Add links to Reddit data (see #2401) 2018-05-31 16:22:43 +02:00
Matthew Honnibal
546dd99cdf Merge master into develop -- mostly Arabic and website 2018-05-15 18:14:28 +02:00