Commit Graph

1316 Commits

Author SHA1 Message Date
Ines Montani
1e5b917d75 Fix formatting [ci skip] 2019-03-23 16:45:50 +01:00
Matthew Honnibal
6c783f8045 Bug fixes and options for TextCategorizer (#3472)
* Fix code for bag-of-words feature extraction

The _ml.py module had a redundant copy of a function to extract unigram
bag-of-words features, except one had a bug that set values to 0.
Another function allowed extraction of bigram features. Replace all three
with a new function that supports arbitrary ngram sizes and also allows
control of which attribute is used (e.g. ORTH, LOWER, etc).

* Support 'bow' architecture for TextCategorizer

This allows efficient ngram bag-of-words models, which are better when
the classifier needs to run quickly, especially when the texts are long.
Pass architecture="bow" to use it. The extra arguments ngram_size and
attr are also available, e.g. ngram_size=2 means unigram and bigram
features will be extracted.

* Fix size limits in train_textcat example

* Explain architectures better in docs
2019-03-23 16:44:44 +01:00
Ines Montani
5944cf10c7 Add blog post to v2.1 page 2019-03-23 16:34:23 +01:00
Ines Montani
ffebdad08d Add cheat sheet to spaCy 101 2019-03-23 16:32:55 +01:00
Ines Montani
06bf130890 💫 Add better and serializable sentencizer (#3471)
* Add better serializable sentencizer component

* Replace default factory

* Add tests

* Tidy up

* Pass test

* Update docs
2019-03-23 15:45:02 +01:00
Ines Montani
dcd6e06c47 Improve landing example [ci skip] 2019-03-22 19:02:15 +01:00
Ines Montani
a841324034 Update landing example [ci skip] 2019-03-22 18:50:00 +01:00
Ines Montani
b532386a60 Fix typo [ci skip] 2019-03-22 18:36:17 +01:00
Ines Montani
d8533f0149 Update Binder [ci skip] 2019-03-22 18:16:46 +01:00
Christos Aridas
9cee3f702a Add missing space in landing page (#3462) [ci skip] 2019-03-22 15:17:35 +01:00
Ines Montani
5073ce63fd Merge branch 'spacy.io' [ci skip] 2019-03-22 15:17:11 +01:00
Ines Montani
0712efc6b3 Update version requirements [ci skip] 2019-03-21 10:23:54 +01:00
Ines Montani
764359c952 Merge branch 'master' into spacy.io 2019-03-20 17:24:28 +01:00
Ines Montani
dac8f8ff99 Update Span.__init__ docs (see #3445) [ci skip] 2019-03-20 17:24:17 +01:00
Ines Montani
f7b5ff7907 Move netlify.toml to root 2019-03-19 14:40:14 +01:00
Ines Montani
c6ee030721 Fix docsearch 2019-03-19 14:38:49 +01:00
Ines Montani
0155083e01 Update netlify.toml 2019-03-19 14:07:00 +01:00
Ines Montani
d4eed4a84f Add note on unicode build to troubleshooting guide (see #3421) [ci skip] 2019-03-19 10:27:02 +01:00
Ines Montani
42d4b818e4 Redirect Netlify URL 2019-03-19 10:17:56 +01:00
Ines Montani
1ee97bc282 Add page title fallback, just in case 2019-03-18 18:58:55 +01:00
Ines Montani
728ae7651b Fix universe page titles if no separate title is set 2019-03-18 18:58:46 +01:00
Ines Montani
a20d3772fd FIx responsive landing 2019-03-18 16:24:52 +01:00
Ines Montani
08284f3a11
💫 v2.1.0 launch updates (only merge on launch!) (#3414)
* Update README.md

* Use production docsearch [ci skip]

* Add option to exclude pages from search
2019-03-18 16:07:26 +01:00
Ines Montani
a611b32fbf Update model docs [ci skip] 2019-03-17 11:48:18 +01:00
Matthew Honnibal
62afa64a8d Expose batch size and length caps on CLI for pretrain (#3417)
Add and document CLI options for batch size, max doc length, min doc length for `spacy pretrain`.

Also improve CLI output.

Closes #3216 

## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.
2019-03-16 21:38:45 +01:00
Ines Montani
2c5dd4d602 Update Vectors.find docs [ci skip] 2019-03-16 17:10:57 +01:00
Ines Montani
fa0f501165 Use dev DocSearch index 2019-03-15 14:48:38 +01:00
Ines Montani
8af7d01382 Fix general-purpose IDs 2019-03-15 14:48:26 +01:00
Ines Montani
cbcba699dd Fix missing ids 2019-03-14 17:56:53 +01:00
Ines Montani
cffe63ea24 Fix :target padding for ids 2019-03-14 17:41:02 +01:00
Ines Montani
51b7b88acf Generate active sidebar heading (h0) at compile time 2019-03-14 17:20:51 +01:00
Ines Montani
4ab1871a75 Add search-exclude classes 2019-03-14 16:51:29 +01:00
Ines Montani
59bbf85986 Add id to body 2019-03-14 16:51:18 +01:00
Ines Montani
6e07750dd8 Fix class name 2019-03-14 11:52:31 +01:00
Ines Montani
a0813b93e0 Server-side render is-active for crawler 2019-03-14 11:46:27 +01:00
Ines Montani
39ace04b55 Fix active style 2019-03-14 11:46:13 +01:00
Ines Montani
4cfe4aa224 Fix small issues in the docs [ci skip] 2019-03-12 22:57:15 +01:00
Ines Montani
ba7eb2d131 Update section [ci skip] 2019-03-12 16:18:34 +01:00
Ines Montani
cecc31b765 Don't auto-slugify accordion links [ci skip] 2019-03-12 15:30:49 +01:00
Ines Montani
d842d5698e Tidy up website and add eslint config [ci skip] 2019-03-12 15:21:58 +01:00
Ines Montani
72fb324d95 Add vector training script to bin [ci skip] 2019-03-12 12:07:56 +01:00
Ines Montani
3abf0e6b9f Replace dev-resources links with real examples 2019-03-12 12:07:40 +01:00
Ines Montani
59c0620487 Auto-format 2019-03-12 12:07:11 +01:00
Ines Montani
1664d1fa62 Update universe [ci skip] 2019-03-12 11:13:03 +01:00
Ines Montani
cdd418b93e Auto-format [ci skip] 2019-03-11 17:10:50 +01:00
Matthew Honnibal
b0b990e405 Fix token.conjuncts (closes #795) (#3392)
* Implement conjuncts method

* Add span.conjuncts property

* Un-xfail token.conjuncts tests

* Update docs for token.conjuncts and span.conjuncts

* Fix merge error in token.conjuncts
2019-03-11 17:05:45 +01:00
Ines Montani
25cb764e64 Document new API [ci skip] 2019-03-11 15:23:53 +01:00
Ines Montani
ebcf2bb1c3 Add Doc.lang and Doc.lang_ 2019-03-11 14:21:40 +01:00
Ines Montani
7c05ca01e8 💫 Support mutable default values for extension attributes (#3389)
* Support mutable default values in extensions

* Update documentation
2019-03-11 12:50:44 +01:00
Matthew Honnibal
98acf5ffe4 💫 Allow passing of config parameters to specific pipeline components (#3386)
* Add component_cfg kwarg to begin_training

* Document component_cfg arg to begin_training

* Update docs and auto-format

* Support component_cfg across Language

* Format

* Update docs and docstrings [ci skip]

* Fix begin_training
2019-03-10 23:36:47 +01:00
Ines Montani
8dbf1e9037 Also fix #3387 on develop 2019-03-10 23:36:28 +01:00
Ines Montani
7ba3a5d95c 💫 Make serialization methods consistent (#3385)
* Make serialization methods consistent

exclude keyword argument instead of random named keyword arguments and deprecation handling

* Update docs and add section on serialization fields
2019-03-10 19:16:45 +01:00
Ines Montani
9a8f169e5c Update v2-1.md 2019-03-10 18:58:51 +01:00
Ines Montani
0426689db8 💫 Improve Doc.to_json and add Doc.is_nered (#3381)
* Use default return instead of else

* Add Doc.is_nered to indicate if entities have been set

* Add properties in Doc.to_json if they were set, not if they're available

This way, if a processed Doc exports "pos": None, it means that the tag was explicitly unset. If it exports "ents": [], it means that entity annotations are available but that this document doesn't contain any entities. Before, this would have been unclear and problematic for training.
2019-03-10 15:24:34 +01:00
Ines Montani
76764fcf59 💫 Improve converters and training data file formats (#3374)
* Populate converter argument info automatically

* Add conversion option for msgpack

* Update docs

* Allow reading training data from JSONL
2019-03-08 23:15:23 +01:00
Ines Montani
296446a1c8
Tidy up and improve docs and docstrings (#3370)
<!--- Provide a general summary of your changes in the title. -->

## Description
* tidy up and adjust Cython code to code style
* improve docstrings and make calling `help()` nicer
* add URLs to new docs pages to docstrings wherever possible, mostly to user-facing objects
* fix various typos and inconsistencies in docs

### Types of change
enhancement, docs

## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.
2019-03-08 11:42:26 +01:00
Ines Montani
fa7314b221 Clarify train_path and dev_path format (see #3366) [ci skip] 2019-03-07 12:23:27 +01:00
Ines Montani
e9babd9973 Update hyperparameters section (see #3352) 2019-03-06 14:40:30 +01:00
Ines Montani
48a206a95f Fix displaCy visualizations in docs (closes #3357) [ci skip] 2019-03-06 13:20:44 +01:00
Ines Montani
5eadf61327 Update pretraining docs on file format (closes #3354) 2019-03-04 16:30:13 +00:00
Ines Montani
1d4ba7678f Auto-format [ci skip] 2019-02-27 12:07:35 +01:00
Matthew Honnibal
f1d77eb140
💫 Improve handling of missing NER tags (closes #2603) (#3341)
* Improve handling of missing NER tags

GoldParse can accept missing NER tags, if entities is provided
in BILUO format (rather than as spans). Missing tags can be provided
as None values.

Fix bug that occurred when first tag was a None value. Closes #2603.

* Document specification of missing NER tags.
2019-02-27 12:06:32 +01:00
Ines Montani
c478a2ccb6 Update backwards incompat [ci skip] 2019-02-27 11:56:56 +01:00
Ines Montani
d7217513c9 Merge branch 'spacy.io' into develop [ci skip] 2019-02-27 11:42:10 +01:00
Matthew Honnibal
4a3371acd5
Make doc[0].is_sent_start == True (closes #2869) (#3340)
* Make doc[0] have sent_start True. Closes #2869

* Document that doc[0].is_sent_start defaults True.
2019-02-27 11:17:17 +01:00
Ines Montani
cb481aa1fe Merge branch 'spacy.io' into develop [ci skip] 2019-02-26 16:51:22 +01:00
Ines Montani
2579ecbb63 Merge branch 'spacy.io' into develop [ci skip] 2019-02-25 21:41:51 +01:00
Ines Montani
3379ebcaa4 Fix default prop [ci skip] 2019-02-25 20:29:11 +01:00
Ines Montani
e711969e3b Add more human-readable class names [ci skip] 2019-02-25 20:22:40 +01:00
Ines Montani
162bd4d75b
💫 Add Algolia DocSearch (#3332)
* Add Algolia DocSearch

* Add human-readable selector for teaser
2019-02-25 20:11:11 +01:00
Ines Montani
1b6238101a Add table explaining training metrics [closes #2644] 2019-02-25 10:03:43 +01:00
Ines Montani
1981b194cc Fix recomputing of :target [ci skip]
Prevents additional history entry
2019-02-25 10:03:20 +01:00
Ines Montani
d0b3af9222 Fix remaining inaccuracies in API docs (closes #2329) 2019-02-24 22:21:25 +01:00
Ines Montani
49d0938038 Update version [ci skip] 2019-02-24 22:01:47 +01:00
Ines Montani
62b558ab72 💫 Support lexical attributes in retokenizer attrs (closes #2390) (#3325)
* Fix formatting and whitespace

* Add support for lexical attributes (closes #2390)

* Document lexical attribute setting during retokenization

* Assign variable oputside of nested loop
2019-02-24 21:13:51 +01:00
Ines Montani
aa52305461 Improve pipeline model and meta example [ci skip] 2019-02-24 18:45:39 +01:00
Ines Montani
df19e2bff6
💫 Allow setting of custom attributes during retokenization (closes #3314) (#3324)
<!--- Provide a general summary of your changes in the title. -->

## Description

This PR adds the abilility to override custom extension attributes during merging. This will only work for attributes that are writable, i.e. attributes registered with a default value like `default=False` or attribute that have both a getter *and* a setter implemented.

```python
Token.set_extension('is_musician', default=False)

doc = nlp("I like David Bowie.")
with doc.retokenize() as retokenizer:
    attrs = {"LEMMA": "David Bowie", "_": {"is_musician": True}}
    retokenizer.merge(doc[2:4], attrs=attrs)

assert doc[2].text == "David Bowie"
assert doc[2].lemma_ == "David Bowie"
assert doc[2]._.is_musician
```

### Types of change
enhancement

## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.
2019-02-24 18:38:47 +01:00
Ines Montani
403b9cd58b Add docs on adding to existing tokenizer rules [ci skip] 2019-02-24 18:35:19 +01:00
Ines Montani
1ea1bc98e7 Document regex utilities [ci skip] 2019-02-24 18:34:10 +01:00
Ines Montani
09bf08b3c3 Update redirects [ci skip] 2019-02-24 13:37:50 +01:00
Ines Montani
dceca3264d Tidy up package.json [ci skip] 2019-02-24 13:37:41 +01:00
Ines Montani
46ec5cdccc Update TextCategorizer docs 2019-02-24 13:11:57 +01:00
Ines Montani
c03cb1cc63 Improve built-in component API docs 2019-02-24 13:11:49 +01:00
Ines Montani
383e2e1f12 Update Python versions [ci skip] 2019-02-24 11:49:45 +01:00
Ines Montani
b624cb4b89 Update v2-1.md 2019-02-24 11:49:27 +01:00
Ines Montani
250e88ef55 Fix docs example (see #2728) 2019-02-21 14:22:06 +01:00
Ines Montani
0fc908d7a5 Add note on merging speed in v2.1 (see #3300) [ci skip] 2019-02-21 12:34:18 +01:00
Ines Montani
236aa94ded Update v2-1.md 2019-02-21 12:33:56 +01:00
Sofie
9a478b6db8 Clean up of char classes, few tokenizer fixes and faster default French tokenizer (#3293)
* splitting up latin unicode interval

* removing hyphen as infix for French

* adding failing test for issue 1235

* test for issue #3002 which now works

* partial fix for issue #2070

* keep the hyphen as infix for French (as it was)

* restore french expressions with hyphen as infix (as it was)

* added succeeding unit test for Issue #2656

* Fix issue #2822 with custom Italian exception

* Fix issue #2926 by allowing numbers right before infix /

* splitting up latin unicode interval

* removing hyphen as infix for French

* adding failing test for issue 1235

* test for issue #3002 which now works

* partial fix for issue #2070

* keep the hyphen as infix for French (as it was)

* restore french expressions with hyphen as infix (as it was)

* added succeeding unit test for Issue #2656

* Fix issue #2822 with custom Italian exception

* Fix issue #2926 by allowing numbers right before infix /

* remove duplicate

* remove xfail for Issue #2179 fixed by Matt

* adjust documentation and remove reference to regex lib
2019-02-20 22:10:13 +01:00
Ines Montani
f73d01aa32 Update netlify.toml [ci skip] 2019-02-20 14:33:32 +01:00
Ines Montani
da5edbe434 Tidy up 2019-02-20 14:33:23 +01:00
Ines Montani
57ae71ea95 Add docs on serializing the pipeline (see #3289) [ci skip] 2019-02-18 14:13:29 +01:00
Ines Montani
38e4422c0d Improve matcher example (resolves #3287) 2019-02-18 13:26:37 +01:00
Ines Montani
660cfe44c5 Fix formatting 2019-02-18 13:26:22 +01:00
Ines Montani
c5476bd75b Update languages.json 2019-02-18 10:03:35 +01:00
Ines Montani
212ff359ef Fix links [ci skip] 2019-02-17 22:25:50 +01:00
Ines Montani
04b4df0ec9 Remove n_threads 2019-02-17 22:25:42 +01:00
Ines Montani
4c7ab7620a Update README.md 2019-02-17 22:16:17 +01:00
Ines Montani
8a8523d8c1 Update README.md 2019-02-17 21:59:52 +01:00
Ines Montani
e597110d31
💫 Update website (#3285)
<!--- Provide a general summary of your changes in the title. -->

## Description

The new website is implemented using [Gatsby](https://www.gatsbyjs.org) with [Remark](https://github.com/remarkjs/remark) and [MDX](https://mdxjs.com/). This allows authoring content in **straightforward Markdown** without the usual limitations. Standard elements can be overwritten with powerful [React](http://reactjs.org/) components and wherever Markdown syntax isn't enough, JSX components can be used. Hopefully, this update will also make it much easier to contribute to the docs. Once this PR is merged, I'll implement auto-deployment via [Netlify](https://netlify.com) on a specific branch (to avoid building the website on every PR). There's a bunch of other cool stuff that the new setup will allow us to do – including writing front-end tests, service workers, offline support, implementing a search and so on.

This PR also includes various new docs pages and content.
Resolves #3270. Resolves #3222. Resolves #2947. Resolves #2837.


### Types of change
enhancement

## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.
2019-02-17 19:31:19 +01:00