Commit Graph

1360 Commits

Author SHA1 Message Date
张晓飞
ba1ff00370 update response after calling add_pipe (#3661)
* update response after calling add_pipe

component:print_info is appened in the last, so need show it at the end of  pipeline

* Create henry860916.md
2019-05-01 12:02:18 +02:00
Ramiro Gómez
8ee4100f8f Remove dangling M (#3657)
I assume this is a typo. Sorry if it has a meaning that I'm not aware of.
2019-04-29 19:44:43 +02:00
Amit Chaudhary
167d63af31 Fix broken link to Dive Into Python 3 website (#3656)
* Fix broken link to Dive Into Python 3 website

* Sign spaCy Contributor Agreement
2019-04-29 19:44:00 +02:00
Brad Jascob
6fcafcc564 Doc changes for local website setup (#3651) 2019-04-27 13:28:23 +02:00
Ivan Tham
fa94f83697 Improve redundant variable name (#3643)
* Improve redundant variable name

* Apply suggestions from code review

Co-Authored-By: pickfire <pickfire@riseup.net>
2019-04-26 16:50:14 +02:00
Ines Montani
dc87fb805d Merge branch 'master' of https://github.com/explosion/spaCy 2019-04-26 13:17:57 +02:00
Ines Montani
62060ae9c6 Merge branch 'spacy.io' 2019-04-26 13:17:52 +02:00
Brad Jascob
9afa0d6723 Update Universe Website for pyInflect (#3641) 2019-04-26 13:17:36 +02:00
Ines Montani
db7c0dbfd6 Update seo.js 2019-04-23 18:39:30 +02:00
Ines Montani
ec0d840ab5 Document early stopping 2019-04-22 14:31:32 +02:00
Ines Montani
1d567913f9 Update spacy evaluate example 2019-04-22 14:28:42 +02:00
Ines Montani
7917ce2f73 Make flag shortcut consistent and document 2019-04-22 14:23:44 +02:00
Ines Montani
52658c80d5 Allow jupyter=False to override Jupyter mode (closes #3598) 2019-04-22 14:18:32 +02:00
Motoki Wu
8e2cef49f3 Add save after --save-every batches for spacy pretrain (#3510)
<!--- Provide a general summary of your changes in the title. -->

When using `spacy pretrain`, the model is saved only after every epoch. But each epoch can be very big since `pretrain` is used for language modeling tasks. So I added a `--save-every` option in the CLI to save after every `--save-every` batches.

## Description
<!--- Use this section to describe your changes. If your changes required
testing, include information about the testing environment and the tests you
ran. If your test fixes a bug reported in an issue, don't forget to include the
issue number. If your PR is still a work in progress, that's totally fine – just
include a note to let us know. -->

To test...

Save this file to `sample_sents.jsonl`

```
{"text": "hello there."}
{"text": "hello there."}
{"text": "hello there."}
{"text": "hello there."}
{"text": "hello there."}
{"text": "hello there."}
{"text": "hello there."}
{"text": "hello there."}
{"text": "hello there."}
{"text": "hello there."}
{"text": "hello there."}
{"text": "hello there."}
{"text": "hello there."}
{"text": "hello there."}
{"text": "hello there."}
{"text": "hello there."}
```

Then run `--save-every 2` when pretraining.

```bash
spacy pretrain sample_sents.jsonl en_core_web_md here -nw 1 -bs 1 -i 10 --save-every 2
```

And it should save the model to the `here/` folder after every 2 batches. The models that are saved during an epoch will have a `.temp` appended to the save name.

At the end the training, you should see these files (`ls here/`):

```bash
config.json     model2.bin      model5.bin      model8.bin
log.jsonl       model2.temp.bin model5.temp.bin model8.temp.bin
model0.bin      model3.bin      model6.bin      model9.bin
model0.temp.bin model3.temp.bin model6.temp.bin model9.temp.bin
model1.bin      model4.bin      model7.bin
model1.temp.bin model4.temp.bin model7.temp.bin
```

### Types of change
<!-- What type of change does your PR cover? Is it a bug fix, an enhancement
or new feature, or a change to the documentation? -->

This is a new feature to `spacy pretrain`.

🌵 **Unfortunately, I haven't been able to test this because compiling from source is not working (cythonize error).** 

```
Processing matcher.pyx
[Errno 2] No such file or directory: '/Users/mwu/github/spaCy/spacy/matcher.pyx'
Traceback (most recent call last):
  File "/Users/mwu/github/spaCy/bin/cythonize.py", line 169, in <module>
    run(args.root)
  File "/Users/mwu/github/spaCy/bin/cythonize.py", line 158, in run
    process(base, filename, db)
  File "/Users/mwu/github/spaCy/bin/cythonize.py", line 124, in process
    preserve_cwd(base, process_pyx, root + ".pyx", root + ".cpp")
  File "/Users/mwu/github/spaCy/bin/cythonize.py", line 87, in preserve_cwd
    func(*args)
  File "/Users/mwu/github/spaCy/bin/cythonize.py", line 63, in process_pyx
    raise Exception("Cython failed")
Exception: Cython failed
Traceback (most recent call last):
  File "setup.py", line 276, in <module>
    setup_package()
  File "setup.py", line 209, in setup_package
    generate_cython(root, "spacy")
  File "setup.py", line 132, in generate_cython
    raise RuntimeError("Running cythonize failed")
RuntimeError: Running cythonize failed
```

Edit: Fixed! after deleting all `.cpp` files: `find spacy -name "*.cpp" | xargs rm`

## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.
2019-04-22 14:10:16 +02:00
Ines Montani
7937109ed9 Update link [ci skip] 2019-04-19 16:01:41 +02:00
Ines Montani
0dce4585b1 Add course to 101 2019-04-19 15:59:51 +02:00
Ines Montani
2efc87c382 Remove unused image 2019-04-19 15:48:12 +02:00
Ines Montani
38395d9518 Merge branch 'spacy.io' 2019-04-19 15:26:20 +02:00
Ines Montani
7ac5bb0a7b Update landing and feature overview 2019-04-19 15:23:08 +02:00
fizban99
f2f2df6e78 entity types for colors should be in uppercase (#3599)
although the text indicates the entity types should be in lowercase, the sample code shows uppercase, which is the correct format.
2019-04-17 11:22:56 +02:00
Ines Montani
5289dd1356 Fix formatting 2019-04-13 17:58:26 +02:00
Ines Montani
9e7deeaf48 Remove Datacamp 2019-04-13 17:46:32 +02:00
oterrier
2854724e69 Added project gracyql to Universe (#3570) (resolves #3568)
As discussed with Ines in https://github.com/explosion/spaCy/issues/3568 , adding a new project proposal for the community in SpaCy Universe website

GracyQL a tiny graphql wrapper aroung spacy using graphene and starlette.

## Description
Change only in universe.json file to add a new project

### Types of change
New project reference in Universe

## Checklist
- [x ] I have submitted the spaCy Contributor Agreement.
- [x ] I ran the tests, and all new and existing tests passed.
- [ x] My changes don't require a change to the documentation, or if they do, I've added all required information.
2019-04-10 17:54:42 +02:00
Santiago Castro
86e4b68aa9 Fix website docs for Vectors.from_glove (#3565)
* Fix website docs for Vectors.from_glove

* Add myself as a contributor
2019-04-10 15:23:27 +02:00
Bharat Raghunathan
72820896d4 Fix typo in web docs cli.md (#3559) 2019-04-09 11:40:03 +02:00
pierremonico
0d26bfe677 Removes duplicate in table (#3550)
* Removes duplicate in table

Just fixing typos.

* Remove newline


Co-authored-by: Ines Montani <ines@ines.io>
2019-04-08 10:30:42 +02:00
Piero Molino
5198aa4ae6 Added Ludwig among the projects (#3548) [ci skip]
* Added Ludwig among the projects

* Create w4nderlust.md

* Add Uber to logo wall
2019-04-07 13:01:26 +02:00
Ines Montani
2f0f439c54 Remove non-existent example (closes #3533) 2019-04-03 09:59:17 +02:00
Ines Montani
b070e0caf7 Update landing.js 2019-03-30 22:26:46 +01:00
Ines Montani
9d1221943b Merge branch 'master' into spacy.io 2019-03-30 20:32:14 +01:00
Ines Montani
037ffdfd3f Add spaCy IRL to landing [ci skip] 2019-03-30 20:32:03 +01:00
Ines Montani
730f759b4f Merge branch 'master' into spacy.io 2019-03-28 15:26:17 +01:00
Ines Montani
7d033a7b89 Fix met a description in universe projects [ci skip] 2019-03-28 15:26:01 +01:00
Ines Montani
fe2cb642ac Merge branch 'master' into spacy.io 2019-03-28 15:13:39 +01:00
David
74e738dd4d adds textpipe to universe (#3500) [ci skip]
* Adds textpipe to universe

* signed contributor agreement

* Adjust formatting, code style and use "standalone" category
2019-03-28 15:13:19 +01:00
Ines Montani
04a9fb1a02 Merge branch 'master' into spacy.io 2019-03-28 13:34:46 +01:00
Samuel Kane
06a1846379 fix(util): fix decaying function output (#3495)
* fix(util): fix decaying function output

* fix(util): better test and adhere to code standards

* fix(util): correct variable name, pytestify test, update website text
2019-03-28 13:24:47 +01:00
Bharat Raghunathan
1db3e47509 DOC: Update tokenizer docs to include default value for batch_size in pipe (#3492) 2019-03-28 12:48:02 +01:00
Ines Montani
2ed16d82bf Fix social image 2019-03-26 18:27:40 +01:00
Ines Montani
9e14b2b69f Add Estonian to docs [ci skip] (closes #3482) 2019-03-25 18:01:54 +01:00
Ines Montani
21ade53ef7 Merge branch 'master' into spacy.io 2019-03-25 13:05:00 +01:00
Ines Montani
db938ab0e3 Update favicon (closes #3475) [ci skip] 2019-03-25 13:04:47 +01:00
Ines Montani
c8c1baaea8 Update binderVersion 2019-03-25 12:17:03 +01:00
Ines Montani
200d8bdb3c Merge branch 'spacy.io' [ci skip] 2019-03-23 16:46:34 +01:00
Ines Montani
1e5b917d75 Fix formatting [ci skip] 2019-03-23 16:45:50 +01:00
Matthew Honnibal
6c783f8045 Bug fixes and options for TextCategorizer (#3472)
* Fix code for bag-of-words feature extraction

The _ml.py module had a redundant copy of a function to extract unigram
bag-of-words features, except one had a bug that set values to 0.
Another function allowed extraction of bigram features. Replace all three
with a new function that supports arbitrary ngram sizes and also allows
control of which attribute is used (e.g. ORTH, LOWER, etc).

* Support 'bow' architecture for TextCategorizer

This allows efficient ngram bag-of-words models, which are better when
the classifier needs to run quickly, especially when the texts are long.
Pass architecture="bow" to use it. The extra arguments ngram_size and
attr are also available, e.g. ngram_size=2 means unigram and bigram
features will be extracted.

* Fix size limits in train_textcat example

* Explain architectures better in docs
2019-03-23 16:44:44 +01:00
Ines Montani
5944cf10c7 Add blog post to v2.1 page 2019-03-23 16:34:23 +01:00
Ines Montani
ffebdad08d Add cheat sheet to spaCy 101 2019-03-23 16:32:55 +01:00
Ines Montani
06bf130890 💫 Add better and serializable sentencizer (#3471)
* Add better serializable sentencizer component

* Replace default factory

* Add tests

* Tidy up

* Pass test

* Update docs
2019-03-23 15:45:02 +01:00
Ines Montani
dcd6e06c47 Improve landing example [ci skip] 2019-03-22 19:02:15 +01:00
Ines Montani
a841324034 Update landing example [ci skip] 2019-03-22 18:50:00 +01:00
Ines Montani
b532386a60 Fix typo [ci skip] 2019-03-22 18:36:17 +01:00
Ines Montani
d8533f0149 Update Binder [ci skip] 2019-03-22 18:16:46 +01:00
Christos Aridas
9cee3f702a Add missing space in landing page (#3462) [ci skip] 2019-03-22 15:17:35 +01:00
Ines Montani
5073ce63fd Merge branch 'spacy.io' [ci skip] 2019-03-22 15:17:11 +01:00
Ines Montani
0712efc6b3 Update version requirements [ci skip] 2019-03-21 10:23:54 +01:00
Ines Montani
764359c952 Merge branch 'master' into spacy.io 2019-03-20 17:24:28 +01:00
Ines Montani
dac8f8ff99 Update Span.__init__ docs (see #3445) [ci skip] 2019-03-20 17:24:17 +01:00
Ines Montani
f7b5ff7907 Move netlify.toml to root 2019-03-19 14:40:14 +01:00
Ines Montani
c6ee030721 Fix docsearch 2019-03-19 14:38:49 +01:00
Ines Montani
0155083e01 Update netlify.toml 2019-03-19 14:07:00 +01:00
Ines Montani
d4eed4a84f Add note on unicode build to troubleshooting guide (see #3421) [ci skip] 2019-03-19 10:27:02 +01:00
Ines Montani
42d4b818e4 Redirect Netlify URL 2019-03-19 10:17:56 +01:00
Ines Montani
1ee97bc282 Add page title fallback, just in case 2019-03-18 18:58:55 +01:00
Ines Montani
728ae7651b Fix universe page titles if no separate title is set 2019-03-18 18:58:46 +01:00
Ines Montani
a20d3772fd FIx responsive landing 2019-03-18 16:24:52 +01:00
Ines Montani
08284f3a11
💫 v2.1.0 launch updates (only merge on launch!) (#3414)
* Update README.md

* Use production docsearch [ci skip]

* Add option to exclude pages from search
2019-03-18 16:07:26 +01:00
Ines Montani
a611b32fbf Update model docs [ci skip] 2019-03-17 11:48:18 +01:00
Matthew Honnibal
62afa64a8d Expose batch size and length caps on CLI for pretrain (#3417)
Add and document CLI options for batch size, max doc length, min doc length for `spacy pretrain`.

Also improve CLI output.

Closes #3216 

## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.
2019-03-16 21:38:45 +01:00
Ines Montani
2c5dd4d602 Update Vectors.find docs [ci skip] 2019-03-16 17:10:57 +01:00
Ines Montani
fa0f501165 Use dev DocSearch index 2019-03-15 14:48:38 +01:00
Ines Montani
8af7d01382 Fix general-purpose IDs 2019-03-15 14:48:26 +01:00
Ines Montani
cbcba699dd Fix missing ids 2019-03-14 17:56:53 +01:00
Ines Montani
cffe63ea24 Fix :target padding for ids 2019-03-14 17:41:02 +01:00
Ines Montani
51b7b88acf Generate active sidebar heading (h0) at compile time 2019-03-14 17:20:51 +01:00
Ines Montani
4ab1871a75 Add search-exclude classes 2019-03-14 16:51:29 +01:00
Ines Montani
59bbf85986 Add id to body 2019-03-14 16:51:18 +01:00
Ines Montani
6e07750dd8 Fix class name 2019-03-14 11:52:31 +01:00
Ines Montani
a0813b93e0 Server-side render is-active for crawler 2019-03-14 11:46:27 +01:00
Ines Montani
39ace04b55 Fix active style 2019-03-14 11:46:13 +01:00
Ines Montani
4cfe4aa224 Fix small issues in the docs [ci skip] 2019-03-12 22:57:15 +01:00
Ines Montani
ba7eb2d131 Update section [ci skip] 2019-03-12 16:18:34 +01:00
Ines Montani
cecc31b765 Don't auto-slugify accordion links [ci skip] 2019-03-12 15:30:49 +01:00
Ines Montani
d842d5698e Tidy up website and add eslint config [ci skip] 2019-03-12 15:21:58 +01:00
Ines Montani
72fb324d95 Add vector training script to bin [ci skip] 2019-03-12 12:07:56 +01:00
Ines Montani
3abf0e6b9f Replace dev-resources links with real examples 2019-03-12 12:07:40 +01:00
Ines Montani
59c0620487 Auto-format 2019-03-12 12:07:11 +01:00
Ines Montani
1664d1fa62 Update universe [ci skip] 2019-03-12 11:13:03 +01:00
Ines Montani
cdd418b93e Auto-format [ci skip] 2019-03-11 17:10:50 +01:00
Matthew Honnibal
b0b990e405 Fix token.conjuncts (closes #795) (#3392)
* Implement conjuncts method

* Add span.conjuncts property

* Un-xfail token.conjuncts tests

* Update docs for token.conjuncts and span.conjuncts

* Fix merge error in token.conjuncts
2019-03-11 17:05:45 +01:00
Ines Montani
25cb764e64 Document new API [ci skip] 2019-03-11 15:23:53 +01:00
Ines Montani
ebcf2bb1c3 Add Doc.lang and Doc.lang_ 2019-03-11 14:21:40 +01:00
Ines Montani
7c05ca01e8 💫 Support mutable default values for extension attributes (#3389)
* Support mutable default values in extensions

* Update documentation
2019-03-11 12:50:44 +01:00
Matthew Honnibal
98acf5ffe4 💫 Allow passing of config parameters to specific pipeline components (#3386)
* Add component_cfg kwarg to begin_training

* Document component_cfg arg to begin_training

* Update docs and auto-format

* Support component_cfg across Language

* Format

* Update docs and docstrings [ci skip]

* Fix begin_training
2019-03-10 23:36:47 +01:00
Ines Montani
8dbf1e9037 Also fix #3387 on develop 2019-03-10 23:36:28 +01:00
Ines Montani
7ba3a5d95c 💫 Make serialization methods consistent (#3385)
* Make serialization methods consistent

exclude keyword argument instead of random named keyword arguments and deprecation handling

* Update docs and add section on serialization fields
2019-03-10 19:16:45 +01:00
Ines Montani
9a8f169e5c Update v2-1.md 2019-03-10 18:58:51 +01:00
Ines Montani
0426689db8 💫 Improve Doc.to_json and add Doc.is_nered (#3381)
* Use default return instead of else

* Add Doc.is_nered to indicate if entities have been set

* Add properties in Doc.to_json if they were set, not if they're available

This way, if a processed Doc exports "pos": None, it means that the tag was explicitly unset. If it exports "ents": [], it means that entity annotations are available but that this document doesn't contain any entities. Before, this would have been unclear and problematic for training.
2019-03-10 15:24:34 +01:00
Ines Montani
76764fcf59 💫 Improve converters and training data file formats (#3374)
* Populate converter argument info automatically

* Add conversion option for msgpack

* Update docs

* Allow reading training data from JSONL
2019-03-08 23:15:23 +01:00
Ines Montani
296446a1c8
Tidy up and improve docs and docstrings (#3370)
<!--- Provide a general summary of your changes in the title. -->

## Description
* tidy up and adjust Cython code to code style
* improve docstrings and make calling `help()` nicer
* add URLs to new docs pages to docstrings wherever possible, mostly to user-facing objects
* fix various typos and inconsistencies in docs

### Types of change
enhancement, docs

## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.
2019-03-08 11:42:26 +01:00