Commit Graph

5935 Commits

Author SHA1 Message Date
Matthew Honnibal
3aceeeaaeb Set version to v2.1.4 2019-05-11 22:57:53 +02:00
Ines Montani
aea1c93a05 Replace cytoolz.partition_all with util.minibatch 2019-05-11 21:12:09 +02:00
Ines Montani
0bf6441863 Fix .iob converter (closes #3620) 2019-05-11 19:15:26 +02:00
Matthew Honnibal
a5159ddcf5 Set version to v2.1.4.dev1 2019-05-11 19:03:51 +02:00
Ines Montani
6b3a79ac96 Call rmtree and copytree with strings (closes #3713) 2019-05-11 15:48:35 +02:00
devforfu
21af12eb53 Make "text" key in JSONL format optional when "tokens" key is provided (#3721)
* Fix issue with forcing text key when it is not required

* Extending the docs to reflect the new behavior
2019-05-11 15:41:29 +02:00
Luca Dorigo
82d034f976 Update glossary.py to match information found in documentation (#3704) (closes ##3679)
* Update glossary.py to match information found in documentation

I used regexes to add any dependency tag that was in the documentation but not in the glossary. Solves #3679 👍

* Adds forgotten colon
2019-05-10 14:23:20 +02:00
Wannaphong Phatthiyaphaibun
5a14a13f64 fix thai bug (#3693)
fix tokenize for pythainlp
2019-05-10 14:21:34 +02:00
Ines Montani
505c9e0e19 Add util.filter_spans helper (#3686) 2019-05-08 02:33:40 +02:00
F0rge1cE
dd1e6b0bc6 Fix offset bug in loading pre-trained word2vec. (#3689)
* Fix offset bug in loading pre-trained word2vec.

* add contributor agreement
2019-05-06 23:00:38 +02:00
Ines Montani
78cb807a9a Auto-format [ci skip] 2019-05-06 16:58:29 +02:00
Brad Jascob
955b95cb8b Fix inconsistant lemmatizer issue #3484 (#3646)
* Fix inconsistant lemmatizer issue #3484

* Remove test case
2019-05-04 18:16:03 +02:00
Dobita21
f95ecedd83 Add Thai lex_attrs (#3655)
* test sPacy commit to git fri 04052019 10:54

* change Data format from my format to master format

* ทัทั้งนี้ ---> ทั้งนี้

* delete stop_word translate from Eng

* Adjust formatting and readability

* add Thai norm_exception

* Add Dobita21 SCA

* editรึ : หรือ,

* Update Dobita21.md

* Auto-format

* Integrate norms into language defaults

* add acronym and some norm exception words

* add lex_attrs

* Add lexical attribute getters into the language defaults

* fix LEX_ATTRS


Co-authored-by: Donut <dobita21@gmail.com>
Co-authored-by: Ines Montani <ines@ines.io>
2019-05-01 12:03:14 +02:00
BreakBB
8952004dfc Update French example sents and add two German stop words (#3662)
* Update french example sentences

* Add 'anderem' and 'ihren' to German stop words
2019-05-01 12:01:35 +02:00
Dobita21
721e1fc86c update norm_exceptions (#3627)
* test sPacy commit to git fri 04052019 10:54

* change Data format from my format to master format

* ทัทั้งนี้ ---> ทั้งนี้

* delete stop_word translate from Eng

* Adjust formatting and readability

* add Thai norm_exception

* Add Dobita21 SCA

* editรึ : หรือ,

* Update Dobita21.md

* Auto-format

* Integrate norms into language defaults

* add acronym and some norm exception words
2019-04-23 12:48:03 +02:00
Ines Montani
e0f487f904 Rename early_stopping_iter to n_early_stopping 2019-04-22 14:31:25 +02:00
Ines Montani
9767427669 Auto-format 2019-04-22 14:31:11 +02:00
Ines Montani
7917ce2f73 Make flag shortcut consistent and document 2019-04-22 14:23:44 +02:00
Ines Montani
52658c80d5 Allow jupyter=False to override Jupyter mode (closes #3598) 2019-04-22 14:18:32 +02:00
Motoki Wu
8e2cef49f3 Add save after --save-every batches for spacy pretrain (#3510)
<!--- Provide a general summary of your changes in the title. -->

When using `spacy pretrain`, the model is saved only after every epoch. But each epoch can be very big since `pretrain` is used for language modeling tasks. So I added a `--save-every` option in the CLI to save after every `--save-every` batches.

## Description
<!--- Use this section to describe your changes. If your changes required
testing, include information about the testing environment and the tests you
ran. If your test fixes a bug reported in an issue, don't forget to include the
issue number. If your PR is still a work in progress, that's totally fine – just
include a note to let us know. -->

To test...

Save this file to `sample_sents.jsonl`

```
{"text": "hello there."}
{"text": "hello there."}
{"text": "hello there."}
{"text": "hello there."}
{"text": "hello there."}
{"text": "hello there."}
{"text": "hello there."}
{"text": "hello there."}
{"text": "hello there."}
{"text": "hello there."}
{"text": "hello there."}
{"text": "hello there."}
{"text": "hello there."}
{"text": "hello there."}
{"text": "hello there."}
{"text": "hello there."}
```

Then run `--save-every 2` when pretraining.

```bash
spacy pretrain sample_sents.jsonl en_core_web_md here -nw 1 -bs 1 -i 10 --save-every 2
```

And it should save the model to the `here/` folder after every 2 batches. The models that are saved during an epoch will have a `.temp` appended to the save name.

At the end the training, you should see these files (`ls here/`):

```bash
config.json     model2.bin      model5.bin      model8.bin
log.jsonl       model2.temp.bin model5.temp.bin model8.temp.bin
model0.bin      model3.bin      model6.bin      model9.bin
model0.temp.bin model3.temp.bin model6.temp.bin model9.temp.bin
model1.bin      model4.bin      model7.bin
model1.temp.bin model4.temp.bin model7.temp.bin
```

### Types of change
<!-- What type of change does your PR cover? Is it a bug fix, an enhancement
or new feature, or a change to the documentation? -->

This is a new feature to `spacy pretrain`.

🌵 **Unfortunately, I haven't been able to test this because compiling from source is not working (cythonize error).** 

```
Processing matcher.pyx
[Errno 2] No such file or directory: '/Users/mwu/github/spaCy/spacy/matcher.pyx'
Traceback (most recent call last):
  File "/Users/mwu/github/spaCy/bin/cythonize.py", line 169, in <module>
    run(args.root)
  File "/Users/mwu/github/spaCy/bin/cythonize.py", line 158, in run
    process(base, filename, db)
  File "/Users/mwu/github/spaCy/bin/cythonize.py", line 124, in process
    preserve_cwd(base, process_pyx, root + ".pyx", root + ".cpp")
  File "/Users/mwu/github/spaCy/bin/cythonize.py", line 87, in preserve_cwd
    func(*args)
  File "/Users/mwu/github/spaCy/bin/cythonize.py", line 63, in process_pyx
    raise Exception("Cython failed")
Exception: Cython failed
Traceback (most recent call last):
  File "setup.py", line 276, in <module>
    setup_package()
  File "setup.py", line 209, in setup_package
    generate_cython(root, "spacy")
  File "setup.py", line 132, in generate_cython
    raise RuntimeError("Running cythonize failed")
RuntimeError: Running cythonize failed
```

Edit: Fixed! after deleting all `.cpp` files: `find spacy -name "*.cpp" | xargs rm`

## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.
2019-04-22 14:10:16 +02:00
Dobita21
189c90743c Add Thai norm_exceptions (#3612)
* test sPacy commit to git fri 04052019 10:54

* change Data format from my format to master format

* ทัทั้งนี้ ---> ทั้งนี้

* delete stop_word translate from Eng

* Adjust formatting and readability

* add Thai norm_exception

* Add Dobita21 SCA

* editรึ : หรือ,

* Update Dobita21.md

* Auto-format

* Integrate norms into language defaults
2019-04-20 12:16:03 +02:00
Matthew Honnibal
83511972d3 Set version to v2.1.4.dev0 2019-04-16 14:17:26 +02:00
Matthew Honnibal
8b5ae0733e Merge branch 'master' of https://github.com/explosion/spaCy 2019-04-16 12:29:46 +02:00
Matthew Honnibal
d59b2e8a0c Fix issue #3551: Upper case lemmas
If the Morphology class tries to lemmatize a word that's not in the
string store, it's forced to just return it as-is. While loading
exceptions, the class could hit a case where these strings weren't in
the string store yet. The resulting lemmas could then be cached, leading
to some words receiving upper-case lemmas. Closes #3551.
2019-04-16 12:27:15 +02:00
BreakBB
5b8dbe4975 Fix symlink creation to show error message on failure (#3589) (resolves #3307))
* Fix symlink creation to show error message on failure. Update tests to reflect those changes.

* Fix test to succeed on non windows systems.
2019-04-16 11:58:31 +02:00
Krzysztof Kowalczyk
cc1516ec26 Improved training and evaluation (#3538)
* Add early stopping

* Add return_score option to evaluate

* Fix missing str to path conversion

* Fix import + old python compatibility

* Fix bad beam_width setting during cpu evaluation in spacy train with gpu option turned on
2019-04-15 12:04:36 +02:00
Shikhar Chauhan
bbf6f9f764 Change default output format from jsonl to json for cli convert (#3583) (closes #3523)
* Changing default ouput format from jsonl to json for cli convert

* Adding Contributor Agreement
2019-04-12 11:31:23 +02:00
Omer Celik
531c0869b2 Added Turkish Lira symbol(₺) (#3576)
Added Turkish Lira symbol(₺) 
https://en.wikipedia.org/wiki/Turkish_lira
2019-04-11 11:32:28 +02:00
Ines Montani
4d198a7e92 Ensure match pattern error isn't raised on empty errors (closes #3549) 2019-04-09 12:50:43 +02:00
Ines Montani
145c0b7e88 Tidy up and auto-format 2019-04-09 11:40:19 +02:00
Ines Montani
5f005adf61 Add xfailing test for #3555 2019-04-09 11:07:14 +02:00
Ines Montani
6ae3b5699e Make sure path is string (resolves #3546) 2019-04-08 12:53:41 +02:00
Ines Montani
d0f5e015cb Auto-format 2019-04-08 12:53:16 +02:00
Dobita21
8bf6967eb7 Update Thai stop words (#3545)
* test sPacy commit to git fri 04052019 10:54

* change Data format from my format to master format

* ทัทั้งนี้ ---> ทั้งนี้

* delete stop_word translate from Eng

* Adjust formatting and readability
2019-04-05 12:06:38 +02:00
jeannefukumaru
f67d881b30 fix typos in tag_map flagged by python -m debug-data (#3542)
## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [ ] I have submitted the spaCy Contributor Agreement.
- [ ] I ran the tests, and all new and existing tests passed.
- [ ] My changes don't require a change to the documentation, or if they do, I've added all required information.


Co-authored-by: Ines Montani <ines@ines.io>
2019-04-05 12:06:09 +02:00
Jeanne Choo
b6c9807431 Merge remote-tracking branch 'upstream/master' 2019-04-04 14:21:50 +08:00
Jeanne Choo
80e15af76c fixed tag_map.py merge conflict 2019-04-04 14:18:27 +08:00
jeannefukumaru
876ce01567 updated tag map with missing tags 2019-04-03 23:09:11 +08:00
Ines Montani
4faf62d515
Merge pull request #3530 from svlandeg/fix/issue_3521
Allow English stopwords with any type of apostrophe
2019-04-03 14:14:03 +02:00
Yves Peirsman
951825532c Improved Dutch language resources and Dutch lemmatization (#3409)
* Improved Dutch language resources and Dutch lemmatization

* Fix conftest

* Update punctuation.py

* Auto-format

* Format and fix tests

* Remove unused test file

* Re-add deleted test

* removed redundant infix regex pattern for ','; note: brackets + simple hyphen remains

* Cleaner lemmatization files
2019-04-03 14:13:26 +02:00
svlandeg
4ff786e113 addressed all comments by Ines 2019-04-03 13:50:33 +02:00
Ines Montani
6a4575a56c Don't make "settings" or "title" required in displaCy data (closes #3531) 2019-04-03 10:13:16 +02:00
Kamolsit Mongkolsrisawat
dcc67f3f51 Update Thai tokenizer_exception list (#3529)
* add tokenizer_exceptions word (ก-น) from https://goo.gl/JpJ2qq

* update tokenizer_exceptions word list

* add contributor file
2019-04-03 09:13:36 +02:00
svlandeg
85b4319f33 specify encoding in files 2019-04-02 15:05:31 +02:00
svlandeg
673c81bbb4 unicode string for python 2.7 2019-04-02 13:52:07 +02:00
svlandeg
eca9cc5417 fixing Issue #3521 by adding all hyphen variants for each stopword 2019-04-02 13:24:59 +02:00
svlandeg
e7062cf699 failing test for Issue #3521 2019-04-02 13:15:35 +02:00
svlandeg
1424b12b09 failing test for Issue #3449 2019-04-02 13:06:37 +02:00
jeannefukumaru
6cdb7b2e04 added tag_map for indonesian (#3515)
* added tag_map for indonesian

* changed tag map from .py to .txt to see if tests pass

* added symbols import

* added utf8 encoding flag

* added missing SCONJ symbol

* Auto-format

* Remove unused imports

* Make tag map available in Indonesian defaults
2019-04-01 12:27:48 +02:00
Ines Montani
c23e234d65 Auto-format 2019-04-01 12:11:27 +02:00