spaCy

mirror of https://github.com/explosion/spaCy.git synced 2025-09-23 12:36:46 +03:00

Author	SHA1	Message	Date
Ines Montani	8d3bfb3c04	Remove outdated options and fix formatting	2018-11-28 23:33:34 +01:00
Nathaniel J. Smith	73255091f8	Fix conftest getoption	2018-11-28 19:07:24 +01:00
Matthew Honnibal	87da5bcf5b	Set version to v2.1.0a3	2018-11-28 18:22:09 +01:00
Matthew Honnibal	647d1a1efc	Merge branch 'develop' of https://github.com/explosion/spaCy into develop	2018-11-28 18:21:45 +01:00
Matthew Honnibal	61e435610e	💫 Feature/improve pretraining (#2971 ) * Improve spacy pretrain script * Implement BERT-style 'masked language model' objective. Much better results. * Improve logging. * Add length cap for documents, to avoid memory errors. * Require thinc 7.0.0.dev1 * Require thinc 7.0.0.dev1 * Add argument for using pretrained vectors * Fix defaults * Fix syntax error * Improve spacy pretrain script * Implement BERT-style 'masked language model' objective. Much better results. * Improve logging. * Add length cap for documents, to avoid memory errors. * Require thinc 7.0.0.dev1 * Require thinc 7.0.0.dev1 * Add argument for using pretrained vectors * Fix defaults * Fix syntax error * Tweak pretraining script * Fix data limits in spacy.gold * Fix pretrain script	2018-11-28 18:04:58 +01:00
Matthew Honnibal	0fdb25b958	Fix msgpack error	2018-11-27 19:35:55 +01:00
Matthew Honnibal	ef0820827a	Update hyper-parameters after NER random search (#2972 ) These experiments were completed a few weeks ago, but I didn't make the PR, pending model release. Token vector width: 128->96 Hidden width: 128->64 Embed size: 5000->2000 Dropout: 0.2->0.1 Updated optimizer defaults (unclear how important?) This should improve speed, model size and load time, while keeping similar or slightly better accuracy. The tl;dr is we prefer to prevent over-fitting by reducing model size, rather than using more dropout.	2018-11-27 18:49:52 +01:00
Matthew Honnibal	c9f6acc564	Set version to 2.1.0a3.dev0	2018-11-27 05:15:27 +01:00
Ines Montani	b6e991440c	💫 Tidy up and auto-format tests (#2967 ) * Auto-format tests with black * Add flake8 config * Tidy up and remove unused imports * Fix redefinitions of test functions * Replace orths_and_spaces with words and spaces * Fix compatibility with pytest 4.0 * xfail test for now Test was previously overwritten by following test due to naming conflict, so failure wasn't reported * Unfail passing test * Only use fixture via arguments Fixes pytest 4.0 compatibility	2018-11-27 01:09:36 +01:00
Matthew Honnibal	2c37e0ccf6	💫 Use Blis for matrix multiplications (#2966 ) Our epic matrix multiplication odyssey is drawing to a close... I've now finally got the Blis linear algebra routines in a self-contained Python package, with wheels for Windows, Linux and OSX. The only missing platform at the moment is Windows Python 2.7. The result is at https://github.com/explosion/cython-blis Thinc v7.0.0 will make the change to Blis. I've put a Thinc v7.0.0.dev0 up on PyPi so that we can test these changes with the CI, and even get them out to spacy-nightly, before Thinc v7.0.0 is released. This PR also updates the other dependencies to be in line with the current versions master is using. I've also resolved the msgpack deprecation problems, and gotten spaCy and Thinc up to date with the latest Cython. The point of switching to Blis is to have control of how our matrix multiplications are executed across platforms. When we were using numpy for this, a different library would be used on pip and conda, OSX would use Accelerate, etc. This would open up different bugs and performance problems, especially when multi-threading was introduced. With the change to Blis, we now strictly single-thread the matrix multiplications. This will make it much easier to use multiprocessing to parallelise the runtime, since we won't have nested parallelism problems to deal with. * Use blis * Use -2 arg to Cython * Update dependencies * Fix requirements * Update setup dependencies * Fix requirement typo * Fix msgpack errors * Remove Python27 test from Appveyor, until Blis works there * Auto-format setup.py * Fix murmurhash version	2018-11-27 00:44:04 +01:00
Ines Montani	3832c8a2c1	💫 Use README.md instead of README.rst (#2968 ) * Auto-format setup.py * Use README.md instead of README.rst	2018-11-26 22:04:35 +01:00
Ines Montani	41c6002fd8	Tidy up [ci skip]	2018-11-26 18:56:04 +01:00
Ines Montani	c62d06ea5c	Port over #2949	2018-11-26 18:54:27 +01:00
Ines Montani	ec5ee9e616	Auto-format	2018-11-26 18:54:20 +01:00
Ines Montani	350c8d25b0	Add EntityRecognizer.label property	2018-11-18 00:06:26 +01:00
Ines Montani	017bc2ef2f	Expose TextCategorizer via __all__	2018-11-18 00:06:13 +01:00
Ines Montani	b4581435f6	Merge branch 'develop' of https://github.com/explosion/spaCy into develop	2018-11-16 13:08:22 +01:00
Ines Montani	e2f75eb492	Fix message formatting	2018-11-16 13:08:20 +01:00
Matthew Honnibal	c89fd19f66	Hack broken pipe error for Python2	2018-11-16 02:22:05 +01:00
Matthew Honnibal	2874b8efd8	Fix tok2vec loading in spacy train	2018-11-15 23:34:54 +00:00
Matthew Honnibal	2ddd428834	Fix pretrain script	2018-11-15 23:34:35 +00:00
Matthew Honnibal	09a0227656	Temporarily add a script to load reddit	2018-11-15 23:18:35 +00:00
Matthew Honnibal	f8afaa0c1c	Fix pretrain	2018-11-15 22:46:53 +00:00
Matthew Honnibal	6af6950e46	Fix pretrain	2018-11-15 22:45:36 +00:00
Matthew Honnibal	3e7b214e57	Make pretrain script work with stream from stdin	2018-11-15 22:44:07 +00:00
Matthew Honnibal	8fdb9bc278	💫 Add experimental ULMFit/BERT/Elmo-like pretraining (#2931 ) * Add 'spacy pretrain' command * Fix pretrain command for Python 2 * Fix pretrain command * Fix pretrain command	2018-11-15 22:17:16 +01:00
Ines Montani	e89708c3eb	💫 Allow matching non-ORTH attributes in PhraseMatcher (#2925 ) * Allow matching non-orth attributes in PhraseMatcher (see #1971) Usage: PhraseMatcher(nlp.vocab, attr='POS') * Allow attr argument to be int * Fix formatting * Fix typo	2018-11-15 03:00:58 +01:00
Matthew Honnibal	7ed9124a45	Fix Python2 error on example	2018-11-14 19:35:17 +01:00
Ines Montani	0d5b142c78	Fix typos and whitespace	2018-11-14 19:12:34 +01:00
Ines Montani	bd1b0e396a	Add deprecation warning for PhraseMatcher max_length	2018-11-14 19:10:46 +01:00
Ines Montani	64257bf3a7	Fix formatting	2018-11-14 19:10:21 +01:00
Ines Montani	b3cadd5b81	Delete _matcher2_notes.py	2018-11-14 16:19:12 +01:00
Matthew Honnibal	5fc98ade04	Set version to 2.1.0a2	2018-11-08 09:56:56 +01:00
Matthew Honnibal	09aa616182	Make pretraining script work without GPU	2018-11-04 17:09:52 +01:00
Matthew Honnibal	bc8cda818c	Improve pretrain textcat example	2018-11-04 00:17:09 +00:00
Matthew Honnibal	3e7a96f99d	Improve pretrain textcat example	2018-11-03 17:44:12 +00:00
Matthew Honnibal	c87c50af62	Rename new example	2018-11-03 13:09:46 +00:00
Matthew Honnibal	8e8ccc0f92	Work on pretraining script	2018-11-03 12:53:25 +00:00
Matthew Honnibal	ad44982f01	Fix dropout in tensorizer, update comment	2018-11-03 12:46:58 +00:00
Matthew Honnibal	0127f10ba3	Improve train tensorizer script	2018-11-03 10:54:20 +00:00
Matthew Honnibal	ba365ae1c9	Normalize gradient by number of words in tensorizer	2018-11-03 10:53:22 +00:00
Matthew Honnibal	dac3f1b280	Improve Tensorizer	2018-11-03 10:52:50 +00:00
Matthew Honnibal	baf7feae68	Add tensorizer training example	2018-11-02 23:30:06 +00:00
Matthew Honnibal	2527ba68e5	Fix tensorizer	2018-11-02 23:29:54 +00:00
Suraj Rajan	0bf14082a4	Added more constucts for dependency tree matcher (#2836 )	2018-10-29 23:21:39 +01:00
Matthew Honnibal	817e1fc5e5	Fix out-of-bounds access in NER training The helper method state.B(1) gets the index of the first token of the buffer, or -1 if no such token exists. Normally this is safe because we pass this to functions like state.safe_get(), which returns an empty token. Here we used it directly as an array index, which is not okay! This error may have been the cause of out-of-bounds access errors during training. Similar errors may still be around, so much be hunted down. Hunting this one down took a long time...I printed out values across training runs and diffed, looking for points of divergence between runs, when no randomness should be allowed.	2018-10-27 01:12:50 +02:00
Ines Montani	ea20b72c08	💫 Make like_num work for prefixed numbers (#2808 ) * Only split + prefix if not numbers * Make like_num work for prefixed numbers * Add test for like_num	2018-10-01 10:49:14 +02:00
Matthew Honnibal	b39810d692	Fix copy_reg compatibility on _serialize module	2018-09-28 15:23:14 +02:00
Matthew Honnibal	f82f8ba5dd	Fix serialization when empty parser model. Closes #2482	2018-09-28 15:18:52 +02:00
Matthew Honnibal	d5a6c63b62	Add regression test for #2482	2018-09-28 15:18:30 +02:00

1 2 3 4 5 ...

9112 Commits