Commit Graph

10062 Commits

Author SHA1 Message Date
Ines Montani
dd153b2b33 Simplify helper (see #3681) [ci skip] 2019-05-06 15:13:10 +02:00
Ines Montani
f8fce6c03c Fix typo (see #3681) 2019-05-06 15:02:11 +02:00
Ines Montani
f2a56c1b56 Rewrite example to use Retokenizer (resolves #3681)
Also add helper to filter spans
2019-05-06 14:51:18 +02:00
Brad Jascob
955b95cb8b Fix inconsistant lemmatizer issue #3484 (#3646)
* Fix inconsistant lemmatizer issue #3484

* Remove test case
2019-05-04 18:16:03 +02:00
Ines Montani
b4d142e3c4 Adjust wording and formatting [ci skip] 2019-05-03 12:00:31 +02:00
Ines Montani
04658ebbb2 Relax jsonschema pin (closes #3628) 2019-05-03 11:58:58 +02:00
d5555
ba4bcbf285 Update universe.json (#3653) [ci skip]
* Update universe.json

* Update universe.json
2019-05-03 11:50:12 +02:00
Dobita21
f95ecedd83 Add Thai lex_attrs (#3655)
* test sPacy commit to git fri 04052019 10:54

* change Data format from my format to master format

* ทัทั้งนี้ ---> ทั้งนี้

* delete stop_word translate from Eng

* Adjust formatting and readability

* add Thai norm_exception

* Add Dobita21 SCA

* editรึ : หรือ,

* Update Dobita21.md

* Auto-format

* Integrate norms into language defaults

* add acronym and some norm exception words

* add lex_attrs

* Add lexical attribute getters into the language defaults

* fix LEX_ATTRS


Co-authored-by: Donut <dobita21@gmail.com>
Co-authored-by: Ines Montani <ines@ines.io>
2019-05-01 12:03:14 +02:00
张晓飞
ba1ff00370 update response after calling add_pipe (#3661)
* update response after calling add_pipe

component:print_info is appened in the last, so need show it at the end of  pipeline

* Create henry860916.md
2019-05-01 12:02:18 +02:00
BreakBB
8952004dfc Update French example sents and add two German stop words (#3662)
* Update french example sentences

* Add 'anderem' and 'ihren' to German stop words
2019-05-01 12:01:35 +02:00
Ramiro Gómez
8ee4100f8f Remove dangling M (#3657)
I assume this is a typo. Sorry if it has a meaning that I'm not aware of.
2019-04-29 19:44:43 +02:00
Amit Chaudhary
167d63af31 Fix broken link to Dive Into Python 3 website (#3656)
* Fix broken link to Dive Into Python 3 website

* Sign spaCy Contributor Agreement
2019-04-29 19:44:00 +02:00
Ramiro Gómez
e7e5999ddc Create yaph.md so I can contribute (#3658) 2019-04-29 19:43:06 +02:00
Brad Jascob
6fcafcc564 Doc changes for local website setup (#3651) 2019-04-27 13:28:23 +02:00
Ivan Tham
fa94f83697 Improve redundant variable name (#3643)
* Improve redundant variable name

* Apply suggestions from code review

Co-Authored-By: pickfire <pickfire@riseup.net>
2019-04-26 16:50:14 +02:00
Ines Montani
dc87fb805d Merge branch 'master' of https://github.com/explosion/spaCy 2019-04-26 13:17:57 +02:00
Ines Montani
62060ae9c6 Merge branch 'spacy.io' 2019-04-26 13:17:52 +02:00
Brad Jascob
9afa0d6723 Update Universe Website for pyInflect (#3641) 2019-04-26 13:17:36 +02:00
Ines Montani
db7c0dbfd6 Update seo.js 2019-04-23 18:39:30 +02:00
Dobita21
721e1fc86c update norm_exceptions (#3627)
* test sPacy commit to git fri 04052019 10:54

* change Data format from my format to master format

* ทัทั้งนี้ ---> ทั้งนี้

* delete stop_word translate from Eng

* Adjust formatting and readability

* add Thai norm_exception

* Add Dobita21 SCA

* editรึ : หรือ,

* Update Dobita21.md

* Auto-format

* Integrate norms into language defaults

* add acronym and some norm exception words
2019-04-23 12:48:03 +02:00
Ines Montani
ec0d840ab5 Document early stopping 2019-04-22 14:31:32 +02:00
Ines Montani
e0f487f904 Rename early_stopping_iter to n_early_stopping 2019-04-22 14:31:25 +02:00
Ines Montani
9767427669 Auto-format 2019-04-22 14:31:11 +02:00
Ines Montani
1d567913f9 Update spacy evaluate example 2019-04-22 14:28:42 +02:00
Ines Montani
7917ce2f73 Make flag shortcut consistent and document 2019-04-22 14:23:44 +02:00
Ines Montani
52658c80d5 Allow jupyter=False to override Jupyter mode (closes #3598) 2019-04-22 14:18:32 +02:00
Motoki Wu
8e2cef49f3 Add save after --save-every batches for spacy pretrain (#3510)
<!--- Provide a general summary of your changes in the title. -->

When using `spacy pretrain`, the model is saved only after every epoch. But each epoch can be very big since `pretrain` is used for language modeling tasks. So I added a `--save-every` option in the CLI to save after every `--save-every` batches.

## Description
<!--- Use this section to describe your changes. If your changes required
testing, include information about the testing environment and the tests you
ran. If your test fixes a bug reported in an issue, don't forget to include the
issue number. If your PR is still a work in progress, that's totally fine – just
include a note to let us know. -->

To test...

Save this file to `sample_sents.jsonl`

```
{"text": "hello there."}
{"text": "hello there."}
{"text": "hello there."}
{"text": "hello there."}
{"text": "hello there."}
{"text": "hello there."}
{"text": "hello there."}
{"text": "hello there."}
{"text": "hello there."}
{"text": "hello there."}
{"text": "hello there."}
{"text": "hello there."}
{"text": "hello there."}
{"text": "hello there."}
{"text": "hello there."}
{"text": "hello there."}
```

Then run `--save-every 2` when pretraining.

```bash
spacy pretrain sample_sents.jsonl en_core_web_md here -nw 1 -bs 1 -i 10 --save-every 2
```

And it should save the model to the `here/` folder after every 2 batches. The models that are saved during an epoch will have a `.temp` appended to the save name.

At the end the training, you should see these files (`ls here/`):

```bash
config.json     model2.bin      model5.bin      model8.bin
log.jsonl       model2.temp.bin model5.temp.bin model8.temp.bin
model0.bin      model3.bin      model6.bin      model9.bin
model0.temp.bin model3.temp.bin model6.temp.bin model9.temp.bin
model1.bin      model4.bin      model7.bin
model1.temp.bin model4.temp.bin model7.temp.bin
```

### Types of change
<!-- What type of change does your PR cover? Is it a bug fix, an enhancement
or new feature, or a change to the documentation? -->

This is a new feature to `spacy pretrain`.

🌵 **Unfortunately, I haven't been able to test this because compiling from source is not working (cythonize error).** 

```
Processing matcher.pyx
[Errno 2] No such file or directory: '/Users/mwu/github/spaCy/spacy/matcher.pyx'
Traceback (most recent call last):
  File "/Users/mwu/github/spaCy/bin/cythonize.py", line 169, in <module>
    run(args.root)
  File "/Users/mwu/github/spaCy/bin/cythonize.py", line 158, in run
    process(base, filename, db)
  File "/Users/mwu/github/spaCy/bin/cythonize.py", line 124, in process
    preserve_cwd(base, process_pyx, root + ".pyx", root + ".cpp")
  File "/Users/mwu/github/spaCy/bin/cythonize.py", line 87, in preserve_cwd
    func(*args)
  File "/Users/mwu/github/spaCy/bin/cythonize.py", line 63, in process_pyx
    raise Exception("Cython failed")
Exception: Cython failed
Traceback (most recent call last):
  File "setup.py", line 276, in <module>
    setup_package()
  File "setup.py", line 209, in setup_package
    generate_cython(root, "spacy")
  File "setup.py", line 132, in generate_cython
    raise RuntimeError("Running cythonize failed")
RuntimeError: Running cythonize failed
```

Edit: Fixed! after deleting all `.cpp` files: `find spacy -name "*.cpp" | xargs rm`

## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.
2019-04-22 14:10:16 +02:00
Dobita21
189c90743c Add Thai norm_exceptions (#3612)
* test sPacy commit to git fri 04052019 10:54

* change Data format from my format to master format

* ทัทั้งนี้ ---> ทั้งนี้

* delete stop_word translate from Eng

* Adjust formatting and readability

* add Thai norm_exception

* Add Dobita21 SCA

* editรึ : หรือ,

* Update Dobita21.md

* Auto-format

* Integrate norms into language defaults
2019-04-20 12:16:03 +02:00
Ines Montani
7937109ed9 Update link [ci skip] 2019-04-19 16:01:41 +02:00
Ines Montani
0dce4585b1 Add course to 101 2019-04-19 15:59:51 +02:00
Ines Montani
2efc87c382 Remove unused image 2019-04-19 15:48:12 +02:00
Ines Montani
38395d9518 Merge branch 'spacy.io' 2019-04-19 15:26:20 +02:00
Ines Montani
7ac5bb0a7b Update landing and feature overview 2019-04-19 15:23:08 +02:00
Dobita21
d86848cf1f Create Dobita21.md (#3614)
<!--- Provide a general summary of your changes in the title. -->

## Description
<!--- Use this section to describe your changes. If your changes required
testing, include information about the testing environment and the tests you
ran. If your test fixes a bug reported in an issue, don't forget to include the
issue number. If your PR is still a work in progress, that's totally fine – just
include a note to let us know. -->

### Types of change
<!-- What type of change does your PR cover? Is it a bug fix, an enhancement
or new feature, or a change to the documentation? -->

## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.
2019-04-18 12:51:54 +02:00
fizban99
f2f2df6e78 entity types for colors should be in uppercase (#3599)
although the text indicates the entity types should be in lowercase, the sample code shows uppercase, which is the correct format.
2019-04-17 11:22:56 +02:00
fizban99
57d4a8bf3d Create fizban99.md (#3601) 2019-04-17 11:22:19 +02:00
Matthew Honnibal
83511972d3 Set version to v2.1.4.dev0 2019-04-16 14:17:26 +02:00
Matthew Honnibal
8b5ae0733e Merge branch 'master' of https://github.com/explosion/spaCy 2019-04-16 12:29:46 +02:00
Matthew Honnibal
d59b2e8a0c Fix issue #3551: Upper case lemmas
If the Morphology class tries to lemmatize a word that's not in the
string store, it's forced to just return it as-is. While loading
exceptions, the class could hit a case where these strings weren't in
the string store yet. The resulting lemmas could then be cached, leading
to some words receiving upper-case lemmas. Closes #3551.
2019-04-16 12:27:15 +02:00
BreakBB
5b8dbe4975 Fix symlink creation to show error message on failure (#3589) (resolves #3307))
* Fix symlink creation to show error message on failure. Update tests to reflect those changes.

* Fix test to succeed on non windows systems.
2019-04-16 11:58:31 +02:00
Krzysztof Kowalczyk
cc1516ec26 Improved training and evaluation (#3538)
* Add early stopping

* Add return_score option to evaluate

* Fix missing str to path conversion

* Fix import + old python compatibility

* Fix bad beam_width setting during cpu evaluation in spacy train with gpu option turned on
2019-04-15 12:04:36 +02:00
Ines Montani
5289dd1356 Fix formatting 2019-04-13 17:58:26 +02:00
Ines Montani
9e7deeaf48 Remove Datacamp 2019-04-13 17:46:32 +02:00
Shikhar Chauhan
bbf6f9f764 Change default output format from jsonl to json for cli convert (#3583) (closes #3523)
* Changing default ouput format from jsonl to json for cli convert

* Adding Contributor Agreement
2019-04-12 11:31:23 +02:00
Omer Celik
531c0869b2 Added Turkish Lira symbol(₺) (#3576)
Added Turkish Lira symbol(₺) 
https://en.wikipedia.org/wiki/Turkish_lira
2019-04-11 11:32:28 +02:00
Omer Celik
034a1f458b Signed agreement (#3577) 2019-04-11 11:31:27 +02:00
Ivan Tham
71710e2454 Add myself to contributors (#3575) 2019-04-11 11:31:04 +02:00
oterrier
2854724e69 Added project gracyql to Universe (#3570) (resolves #3568)
As discussed with Ines in https://github.com/explosion/spaCy/issues/3568 , adding a new project proposal for the community in SpaCy Universe website

GracyQL a tiny graphql wrapper aroung spacy using graphene and starlette.

## Description
Change only in universe.json file to add a new project

### Types of change
New project reference in Universe

## Checklist
- [x ] I have submitted the spaCy Contributor Agreement.
- [x ] I ran the tests, and all new and existing tests passed.
- [ x] My changes don't require a change to the documentation, or if they do, I've added all required information.
2019-04-10 17:54:42 +02:00
Santiago Castro
86e4b68aa9 Fix website docs for Vectors.from_glove (#3565)
* Fix website docs for Vectors.from_glove

* Add myself as a contributor
2019-04-10 15:23:27 +02:00
Ines Montani
4d198a7e92 Ensure match pattern error isn't raised on empty errors (closes #3549) 2019-04-09 12:50:43 +02:00