Commit Graph

275 Commits

Author SHA1 Message Date
Matthew Honnibal
98acf5ffe4 💫 Allow passing of config parameters to specific pipeline components (#3386)
* Add component_cfg kwarg to begin_training

* Document component_cfg arg to begin_training

* Update docs and auto-format

* Support component_cfg across Language

* Format

* Update docs and docstrings [ci skip]

* Fix begin_training
2019-03-10 23:36:47 +01:00
Ines Montani
7ba3a5d95c 💫 Make serialization methods consistent (#3385)
* Make serialization methods consistent

exclude keyword argument instead of random named keyword arguments and deprecation handling

* Update docs and add section on serialization fields
2019-03-10 19:16:45 +01:00
Ines Montani
0426689db8 💫 Improve Doc.to_json and add Doc.is_nered (#3381)
* Use default return instead of else

* Add Doc.is_nered to indicate if entities have been set

* Add properties in Doc.to_json if they were set, not if they're available

This way, if a processed Doc exports "pos": None, it means that the tag was explicitly unset. If it exports "ents": [], it means that entity annotations are available but that this document doesn't contain any entities. Before, this would have been unclear and problematic for training.
2019-03-10 15:24:34 +01:00
Ines Montani
76764fcf59 💫 Improve converters and training data file formats (#3374)
* Populate converter argument info automatically

* Add conversion option for msgpack

* Update docs

* Allow reading training data from JSONL
2019-03-08 23:15:23 +01:00
Ines Montani
296446a1c8
Tidy up and improve docs and docstrings (#3370)
<!--- Provide a general summary of your changes in the title. -->

## Description
* tidy up and adjust Cython code to code style
* improve docstrings and make calling `help()` nicer
* add URLs to new docs pages to docstrings wherever possible, mostly to user-facing objects
* fix various typos and inconsistencies in docs

### Types of change
enhancement, docs

## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.
2019-03-08 11:42:26 +01:00
Ines Montani
fa7314b221 Clarify train_path and dev_path format (see #3366) [ci skip] 2019-03-07 12:23:27 +01:00
Ines Montani
e9babd9973 Update hyperparameters section (see #3352) 2019-03-06 14:40:30 +01:00
Ines Montani
5eadf61327 Update pretraining docs on file format (closes #3354) 2019-03-04 16:30:13 +00:00
Ines Montani
1d4ba7678f Auto-format [ci skip] 2019-02-27 12:07:35 +01:00
Matthew Honnibal
f1d77eb140
💫 Improve handling of missing NER tags (closes #2603) (#3341)
* Improve handling of missing NER tags

GoldParse can accept missing NER tags, if entities is provided
in BILUO format (rather than as spans). Missing tags can be provided
as None values.

Fix bug that occurred when first tag was a None value. Closes #2603.

* Document specification of missing NER tags.
2019-02-27 12:06:32 +01:00
Matthew Honnibal
4a3371acd5
Make doc[0].is_sent_start == True (closes #2869) (#3340)
* Make doc[0] have sent_start True. Closes #2869

* Document that doc[0].is_sent_start defaults True.
2019-02-27 11:17:17 +01:00
Ines Montani
d0b3af9222 Fix remaining inaccuracies in API docs (closes #2329) 2019-02-24 22:21:25 +01:00
Ines Montani
62b558ab72 💫 Support lexical attributes in retokenizer attrs (closes #2390) (#3325)
* Fix formatting and whitespace

* Add support for lexical attributes (closes #2390)

* Document lexical attribute setting during retokenization

* Assign variable oputside of nested loop
2019-02-24 21:13:51 +01:00
Ines Montani
df19e2bff6
💫 Allow setting of custom attributes during retokenization (closes #3314) (#3324)
<!--- Provide a general summary of your changes in the title. -->

## Description

This PR adds the abilility to override custom extension attributes during merging. This will only work for attributes that are writable, i.e. attributes registered with a default value like `default=False` or attribute that have both a getter *and* a setter implemented.

```python
Token.set_extension('is_musician', default=False)

doc = nlp("I like David Bowie.")
with doc.retokenize() as retokenizer:
    attrs = {"LEMMA": "David Bowie", "_": {"is_musician": True}}
    retokenizer.merge(doc[2:4], attrs=attrs)

assert doc[2].text == "David Bowie"
assert doc[2].lemma_ == "David Bowie"
assert doc[2]._.is_musician
```

### Types of change
enhancement

## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.
2019-02-24 18:38:47 +01:00
Ines Montani
1ea1bc98e7 Document regex utilities [ci skip] 2019-02-24 18:34:10 +01:00
Ines Montani
46ec5cdccc Update TextCategorizer docs 2019-02-24 13:11:57 +01:00
Ines Montani
c03cb1cc63 Improve built-in component API docs 2019-02-24 13:11:49 +01:00
Ines Montani
250e88ef55 Fix docs example (see #2728) 2019-02-21 14:22:06 +01:00
Ines Montani
04b4df0ec9 Remove n_threads 2019-02-17 22:25:42 +01:00
Ines Montani
e597110d31
💫 Update website (#3285)
<!--- Provide a general summary of your changes in the title. -->

## Description

The new website is implemented using [Gatsby](https://www.gatsbyjs.org) with [Remark](https://github.com/remarkjs/remark) and [MDX](https://mdxjs.com/). This allows authoring content in **straightforward Markdown** without the usual limitations. Standard elements can be overwritten with powerful [React](http://reactjs.org/) components and wherever Markdown syntax isn't enough, JSX components can be used. Hopefully, this update will also make it much easier to contribute to the docs. Once this PR is merged, I'll implement auto-deployment via [Netlify](https://netlify.com) on a specific branch (to avoid building the website on every PR). There's a bunch of other cool stuff that the new setup will allow us to do – including writing front-end tests, service workers, offline support, implementing a search and so on.

This PR also includes various new docs pages and content.
Resolves #3270. Resolves #3222. Resolves #2947. Resolves #2837.


### Types of change
enhancement

## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.
2019-02-17 19:31:19 +01:00
ines
808f7ee417 Update API documentation 2017-10-03 14:27:22 +02:00
ines
d15775c3ad Fix typos and commands in alpha docs 2017-08-21 13:40:11 +02:00
ines
3c33003078 Port over typo corrections from #1245 2017-08-20 12:00:17 +02:00
ines
1261b01e46 Update Doc.char_span docs 2017-08-19 16:34:32 +02:00
ines
5cb0200e63 Document new Span.to_array() method 2017-08-19 12:45:28 +02:00
ines
471eed4126 Add example to Span.merge() 2017-08-19 12:45:16 +02:00
ines
404d3067b8 Document new Doc.char_span() method 2017-08-19 12:45:00 +02:00
ines
d53cbf369f Document as_tuples kwarg on Language.pipe() 2017-08-19 12:44:50 +02:00
ines
6a37c93311 Update argument type 2017-08-19 12:44:33 +02:00
ines
4731d50220 Add break utility for long nowrap items (e.g. code) 2017-08-19 12:44:23 +02:00
ines
0aba11b64b Update package command docs 2017-08-14 16:45:44 +02:00
ines
a29f132ffd Change python -m spacy to spacy
Reflects latest change to entry point or auto-alias
2017-08-14 13:04:48 +02:00
ines
f085b88f9d Add TextCategorizer API docs stub 2017-07-22 17:56:33 +02:00
ines
ab1a4e8b3c Add Tensorizer API docs stub 2017-07-22 17:56:25 +02:00
ines
d2a7e5b8e5 Add GoldParse.cats attribute 2017-07-22 17:55:35 +02:00
ines
23d976ed00 Add Doc.cats attribute and missing v2 tag 2017-07-22 17:55:14 +02:00
Ines Montani
1ddbeddca2 Fix typo 2017-07-22 15:00:58 +02:00
Vetea
8e20cf6368 Update doc.jade
Just remove a duplicate 'doc ='
2017-06-08 10:35:58 +02:00
ines
9f55c0d4f6 Add Vectors class 2017-06-05 13:33:11 +02:00
ines
e204788c30 Add docs for util.load_model_from_path 2017-06-05 13:18:22 +02:00
ines
efc37ea3de Update train CLI 2017-06-04 23:45:14 +02:00
ines
3419ecbfdd Update docs on model shortcut links 2017-06-04 13:55:00 +02:00
ines
b0225183c2 Update displaCy defaults 2017-06-03 13:27:06 +02:00
ines
c60431357d Port over docs typo corrections 2017-06-03 11:31:30 +02:00
ines
1bebc6392c Add source files to pipeline components 2017-06-01 17:38:06 +02:00
ines
706cec6d58 Move annotation specs up 2017-06-01 13:02:43 +02:00
ines
77dca25c7f Update Language API docs 2017-06-01 11:51:31 +02:00
ines
f86289566a Update new in v2 section and add note on Matcher acceptors 2017-05-30 13:53:06 +02:00
ines
b5bfab8699 Add description 2017-05-29 15:27:16 +02:00
ines
567485a818 Fix and document model loading with pipeline and overrides 2017-05-29 14:10:10 +02:00
ines
00b2094dc3 Fix typos, long integers and tests 2017-05-29 01:09:52 +02:00
ines
606879b217 Update hash strings examples 2017-05-28 19:42:44 +02:00
ines
c7b57ea314 Update docs and change integer IDs to hash values 2017-05-28 19:25:34 +02:00
ines
0ea31d1e31 Add under construction note to pipeline components 2017-05-28 18:44:07 +02:00
ines
414193e9ba Update docs to reflect StringStore changes 2017-05-28 18:19:11 +02:00
ines
69bda9aed7 Update text, examples, typos, wording and formatting 2017-05-28 16:41:01 +02:00
ines
eb5a8be9ad Update language overview and add section on 'xx' lang class 2017-05-28 01:15:44 +02:00
ines
eb703f7656 Update API docs 2017-05-28 00:32:43 +02:00
ines
c1983621fb Update util functions for model loading 2017-05-28 00:22:40 +02:00
ines
70afcfec3e Update defaults and example 2017-05-26 14:04:31 +02:00
ines
1b982f0838 Update train command and add docs on hyperparameters 2017-05-26 14:02:38 +02:00
ines
1b9c6ded71 Update API docs and add "source" button to GH source 2017-05-26 13:40:32 +02:00
ines
d48530835a Update API docs and fix typos 2017-05-26 12:43:16 +02:00
ines
ea9474f71c Add version tag mixin to label new features 2017-05-26 12:42:36 +02:00
ines
353f0ef8d7 Use disable argument (list) for serialization 2017-05-26 12:33:54 +02:00
ines
0f48fb1f97 Rename processing text to production use and remove linear feature scheme 2017-05-25 00:10:33 +02:00
ines
8b86b08bed Update usage workflows 2017-05-24 11:59:08 +02:00
ines
66088851dc Add Doc.to_disk() and Doc.from_disk() methods 2017-05-24 11:58:17 +02:00
ines
10afb3c796 Tidy up and merge usage pages 2017-05-24 00:37:47 +02:00
ines
697d3d7cb3 Fix links to CLI docs 2017-05-24 00:36:38 +02:00
ines
a38393e2f6 Update annotation docs 2017-05-23 23:16:17 +02:00
ines
786af87ffb Update IOB docs 2017-05-23 23:15:50 +02:00
ines
c8bde2161c Add kwargs to spacy.load 2017-05-23 23:14:02 +02:00
ines
0a8a2d2f6d Remove tip infoboxes from annotation docs 2017-05-23 23:13:51 +02:00
ines
e6acd3bbf2 Fix matcher tests and matcher docs 2017-05-23 11:36:02 +02:00
ines
f497cf60b2 Update formatting 2017-05-23 11:32:25 +02:00
ines
a23f487b06 Tidy up displaCy and add "manual" option
Also don't require title in EntityRenderer
2017-05-22 18:48:20 +02:00
ines
dddad5bf26 Update util.prints docs 2017-05-22 13:54:52 +02:00
ines
d5a6a9a6a9 Use string values for attrs in Matcher docs 2017-05-22 13:54:45 +02:00
ines
54f04a9fe0 Update API docs with changes in spacy.gold and spacy.language 2017-05-22 12:29:30 +02:00
ines
fc3ec733ea Reduce complexity in CLI
Remove now redundant model command and move plac annotations to cli
files
2017-05-22 12:28:58 +02:00
ines
2c5cfe8bbf Update docstrings and API docs for StringStore 2017-05-21 14:18:58 +02:00
ines
251346b59f Fix typos and formatting 2017-05-21 14:18:46 +02:00
ines
075f5ff87a Update docstrings and API docs for GoldParse 2017-05-21 13:53:46 +02:00
ines
465a1dd710 Add BILUO scheme to annotation docs 2017-05-21 13:53:34 +02:00
ines
c9f04f3cd0 Add note on automated processes to download command 2017-05-21 13:23:39 +02:00
ines
8ab59515b2 Fix typo and use consistent description for from_bytes 2017-05-21 13:18:39 +02:00
ines
c5a653fa48 Update docstrings and API docs for Tokenizer 2017-05-21 13:18:14 +02:00
ines
d82ae9a585 Change "function" to "callable" in docs 2017-05-21 13:17:40 +02:00
ines
ee3fdffffb Move attributes and remove deprecated methods 2017-05-21 01:18:31 +02:00
ines
1cb2c86f9a Update CLI docs 2017-05-21 01:13:05 +02:00
ines
272a8981c3 Add model tag to spacy.load API docs 2017-05-21 01:12:43 +02:00
ines
3871157d84 Update spacy.util documentation 2017-05-21 01:12:09 +02:00
ines
da12aee0c1 Update spacy.load with note on get_lang_class 2017-05-21 00:19:26 +02:00
ines
27de0834b2 Update docstrings and API docs for Lexeme 2017-05-20 15:13:42 +02:00
ines
7ed8a92ed1 Update docstrings and API docs for Token 2017-05-20 15:13:33 +02:00
ines
4ed6a36622 Update docstrings and API docs for Matcher 2017-05-20 14:43:10 +02:00
ines
39f36539f6 Update docstrings and API docs for Matcher 2017-05-20 14:32:34 +02:00
ines
c00ff257be Update docstrings and API docs for Matcher 2017-05-20 14:26:10 +02:00
ines
463e3cc80f Remove resize_vectors and vectors_length 2017-05-20 14:02:14 +02:00
ines
f0cc642bb9 Update docstrings and API docs for Vocab 2017-05-20 14:00:41 +02:00
Matthew Honnibal
a93276bb78 Merge branch 'develop' of https://github.com/explosion/spaCy into develop 2017-05-20 13:55:12 +02:00
Matthew Honnibal
ce9234f593 Update Matcher API 2017-05-20 13:54:53 +02:00
ines
8b14476253 Fix typo 2017-05-20 13:00:13 +02:00
ines
6557ff9e85 Update example 2017-05-20 13:00:07 +02:00
ines
fea4925f41 Reorganise API docs navigation 2017-05-20 12:59:57 +02:00
ines
b2678372c7 Add API docs for top-level spaCy functions
i.e. spacy.load(), spacy.info(), spacy.explain()
2017-05-20 12:59:44 +02:00
ines
797f10ab16 Update formatting 2017-05-20 12:59:16 +02:00
ines
e10c48210d Update Matcher API and workflow to reflect new API
on_match is now the second positional argument, to easily allow a
variable number of patterns while keeping the method clean and readable.
2017-05-20 12:59:03 +02:00
ines
eb521af267 Fix formatting 2017-05-20 12:58:15 +02:00
ines
7973912114 Update CLI docs 2017-05-20 12:58:05 +02:00
ines
5163a4513e Update API docs 2017-05-20 01:43:48 +02:00
ines
e3256e7406 Update Matcher API docs 2017-05-20 01:38:34 +02:00
ines
0cabf9e13f Fix model tag 2017-05-20 01:38:14 +02:00
ines
fe5d8819ea Update Matcher docstrings and API docs 2017-05-19 21:47:06 +02:00
ines
c8580da686 Update "requires model" tags 2017-05-19 20:24:46 +02:00
ines
c3e903e4c2 Update examples and API docs 2017-05-19 19:59:02 +02:00
ines
e9e62b01b0 Update docstrings and API docs for Token 2017-05-19 18:47:56 +02:00
ines
62ceec4fc6 Update docstrings and API docs for Span 2017-05-19 18:47:46 +02:00
ines
23f9a3ccc8 Update docstrings and API docs for Doc 2017-05-19 18:47:39 +02:00
ines
2c8c9dc0c9 Update docstrings and API docs for Language 2017-05-19 18:47:24 +02:00
ines
0791f0aae6 Update docstrings and API docs for Span class 2017-05-19 00:31:31 +02:00
ines
5b68579eb8 Use returns/yields instead of return/yield 2017-05-19 00:02:34 +02:00
ines
b687ad109d Update docstrings and API docs for Doc class 2017-05-18 23:59:44 +02:00
ines
d42bc16868 Update docstrings and API docs for Language class 2017-05-18 23:57:38 +02:00
ines
b87066ff10 Update docstrings and API docs for Doc class 2017-05-18 22:17:41 +02:00
ines
476b8209fe Update docs with new Jupyter auto-detection 2017-05-18 14:58:17 +02:00
ines
02a4841e7b Move CLI docs to API reference 2017-05-17 12:04:03 +02:00
ines
d7244ae72d Add docs on collapse_punct option 2017-05-15 13:51:33 +02:00
ines
c33bdeb564 Use uppercase for entity types 2017-05-15 01:24:57 +02:00
ines
cf7e5ed534 Use American spelling for "visualizers"
Kinda sucks because we normally use British spelling, but it just looks
weird and confusing otherwise... same with tokenizer and all other
library internals. So this is sort of the "official policy" for now.
2017-05-14 23:29:36 +02:00
ines
fe5a5086e1 Fix typo 2017-05-14 23:27:56 +02:00
ines
1ae07da18f Add API docs for spacy.displacy (see #1058) 2017-05-14 19:31:23 +02:00
ines
b462076d80 Merge load_lang_class and get_lang_class 2017-05-14 01:31:10 +02:00
ines
1465c6c221 Add API docs for util functions 2017-05-13 21:23:12 +02:00
ines
19879cb693 Update alpha support docs 2017-05-12 15:57:49 +02:00
ines
63d79947c8 Update title in navigation 2017-05-12 15:40:43 +02:00
ines
531ee1373b Rename "Language models" to "Languages" in API 2017-05-12 15:38:56 +02:00
ines
fac3566aac Add descriptions to POS tagging scheme 2017-05-03 20:11:02 +02:00
ines
1570b83ee5 Add spacy.explain() note to NER annotation scheme 2017-05-03 20:11:02 +02:00
ines
219369bb7d Add detailed docs for dependency label annotations 2017-05-03 20:11:02 +02:00
ines
f9384b0fbd Update alpha languages and add aside for tokenizer dependencies 2017-05-03 09:58:31 +02:00
Yasuaki Uechi
0e7a9b9fac Add Japanese to 'Alpha support’ section 2017-05-03 13:56:45 +09:00
ines
034ec5710b Fix typo and add Norwegian to alpha languages 2017-04-27 11:24:21 +02:00
ines
375edf0bb5 Add list of models and include French 2017-04-26 20:50:27 +02:00
ines
ddd5194088 Update Language docs and docstrings 2017-04-17 01:52:13 +02:00
ines
aad80a291f Add save_to_directory method to API docs 2017-04-17 01:40:34 +02:00
ines
13df2d6a60 Add documentation for spaCy's JSON format 2017-03-26 15:56:15 +02:00
ines
a5fc5fb0db Add Hebrew to list of alpha languages 2017-03-25 10:22:46 +01:00
ines
9600cd1b9e Fix download commands 2017-03-25 10:22:05 +01:00