Commit Graph

5216 Commits

Author SHA1 Message Date
Matthew Honnibal
019d09e3c3 Fix init model 2018-07-03 22:16:44 +02:00
Matthew Honnibal
2543f8c93a Support .npz vectors in init-model command 2018-07-03 21:42:16 +02:00
Matthew Honnibal
86aad11939 Fix init_model arg 2018-07-03 17:00:42 +02:00
Matthew Honnibal
eff42d36e3 Fix init model command 2018-07-03 16:32:23 +02:00
Matthew Honnibal
97487122ea Merge branch 'develop' of https://github.com/explosion/spaCy into develop 2018-07-03 15:44:37 +02:00
Matthew Honnibal
6a89faf12e Add support for jsonl-formatted lexical attributes to init-model command. 2018-07-03 12:22:56 +02:00
Matthew Honnibal
2ec2192000 Revert #1389: Don't overrule rules when lemma exception is present 2018-06-29 19:43:02 +02:00
Matthew Honnibal
01ace9734d Make pipeline work on empty docs 2018-06-29 19:21:38 +02:00
Matthew Honnibal
a1b05048d0 Fix tagger when doc is empty 2018-06-29 16:05:40 +02:00
Matthew Honnibal
3786942ff1 Fix tagger when docs are empty 2018-06-29 15:13:45 +02:00
ines
526be40823 Add test for 46d8a66 2018-06-29 14:33:12 +02:00
ines
f08c871adf Fix typo in Language.from_disk 2018-06-29 14:32:16 +02:00
Matthew Honnibal
46d8a66fef Fix tokenizer serialization if token_match is None 2018-06-29 14:24:46 +02:00
Matthew Honnibal
e0860bcfb3 Fix bug when docs are empty 2018-06-29 13:56:29 +02:00
Matthew Honnibal
a4d2b0c293 Fix bug when docs are empty 2018-06-29 13:44:25 +02:00
Matthew Honnibal
c83fccfe2a Fix output of best model 2018-06-25 23:05:56 +02:00
Matthew Honnibal
5a65418c40 Fix handling of unseen labels in tagger 2018-06-25 22:28:59 +02:00
Matthew Honnibal
5b56aad4c2 Fix handling of unseen labels in tagger 2018-06-25 22:24:54 +02:00
Matthew Honnibal
3aabf621a3 Fix handling of unknown tags in tagger update 2018-06-25 22:01:02 +02:00
Matthew Honnibal
69c900f003 Fix init-model if no vectors provided 2018-06-25 18:26:02 +02:00
Matthew Honnibal
664f89327a Fix init-model if no vectors provided 2018-06-25 17:58:45 +02:00
Matthew Honnibal
c4698f5712 Don't collate model unless training succeeds 2018-06-25 16:36:42 +02:00
Ole Henrik Skogstrøm
d16cb6bee6 Accept Span to displacy render (#2478) (closes #2477)
* Add Span to displacy render

* Fix span support, errors and add tests
2018-06-25 14:55:16 +02:00
Matthew Honnibal
24dfbb8a28 Fix model collation 2018-06-25 14:35:24 +02:00
Matthew Honnibal
62237755a4 Import shutil 2018-06-25 13:40:17 +02:00
Matthew Honnibal
a040fca99e Import json into cli.train 2018-06-25 11:50:37 +02:00
Matthew Honnibal
2c703d99c2 Fix collation of best models 2018-06-25 01:21:34 +02:00
Matthew Honnibal
9d6a1c57f2 Merge branch 'develop' of https://github.com/explosion/spaCy into develop 2018-06-24 23:40:06 +02:00
Matthew Honnibal
2c80b7c013 Collate best model after training 2018-06-24 23:39:52 +02:00
Muhammad Irfan
f33c703066 Add Urdu Language Support (#2430)
* added Urdu language support.

* added Urdu language tests.

* modified conftest.py for Urdu language support.

* added spacy contributor agreement.
2018-06-22 11:14:03 +02:00
himkt
14d9007efd fix wrong indexing (#2416)
* fix wrong indexing

* add agreement
2018-06-19 10:20:57 +02:00
Aliia E
428bae66b5 Add Tatar Language Support (#2444)
* add Tatar lang support

* add Tatar letters

* add Tatar tests

* sign contributor agreement

* sign contributor agreement [x]

* remove comments from Language class

* remove all template comments
2018-06-19 10:17:53 +02:00
Cory Hurst
446f5ec41b Silent keyword in info function in init (#2459)
* Pass through "silent" kwarg to the wrapper in the spacy module init.
reference issue  #2196

* Pass through "silent" kwarg to the wrapper in the spacy module init.
reference issue  #2196

* contributor agreement
2018-06-18 12:24:21 +02:00
ines
778e5f4da3 Merge branch 'master' into develop 2018-06-11 00:38:04 +02:00
himkt
57311d5d47 replace janome with mecab in the documentation and the test (#2415)
* Add links to Reddit data (see #2401)

* replace janome with mecab in the documentation and the test

* add the assignment
2018-06-11 00:33:13 +02:00
Nour Shalabi
a169b79092 Additions to Arabic stop words. (#2422)
* Additions to Arabic stop words.

* Create nourshalabi.md
2018-06-08 02:33:23 +02:00
ines
a0017e4909 Merge branch 'master' into develop 2018-05-30 14:10:47 +02:00
ines
b8ef9c1000 Fix model names in conftest (see #2379) 2018-05-30 14:10:20 +02:00
ines
4a62486340 Merge branch 'master' into develop 2018-05-30 13:01:01 +02:00
Maciej
c7d53348d7 Fix bug in CLI iob and ner converter (#2392) (fixes #2385)
* issue_2385 add tests for iob_to_biluo converter function

* issue_2385 fix and modify iob_to_biluo function to accept either iob or biluo tags in cli.converter

* issue_2385 add test to fix b char bug

* add contributor agreement

* fill contributor agreement
2018-05-30 12:28:44 +02:00
ines
3c3a175018 Merge branch 'master' into develop 2018-05-28 18:37:09 +02:00
ansgar-t
9732988951 escape html in displacy.render (#2378) (closes #2361)
## Description
Fix for issue #2361 :
replace &, <, >, " with &amp;amp; , &amp;lt; , &amp;gt; , &amp;quot; in before rendering svg

## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [x] I have submitted the spaCy Contributor Agreement.
- [ ] I ran the tests, and all new and existing tests passed.
(As discussed in the comments to #2361)
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.
2018-05-28 18:36:41 +02:00
ines
f7103babd9 Only overwrite warnings filter if set explicitly (resolves #2369)
This way, pre-defined warning filters are respected and users are still able to use the fine-grained warning settings if they like.
2018-05-26 18:44:15 +02:00
ines
330c039106 Merge branch 'master' into develop 2018-05-26 18:30:52 +02:00
James Messinger
4515e96e90 Better formatting for spacy train CLI (#2357)
* Better formatting for `spacy train` CLI

Changed to use fixed-spaces rather than tabs to align table headers and data.

### Before:
```
Itn.    P.Loss  N.Loss  UAS     NER P.  NER R.  NER F.  Tag %   Token %
0       4618.857        2910.004        76.172  79.645  67.987  88.732  88.261  100.000 4436.9  6376.4
1       4671.972        3764.812        74.481  78.046  62.374  82.680  88.377  100.000 4672.2  6227.1
2       4742.756        3673.473        71.994  77.380  63.966  84.494  90.620  100.000 4298.0  5983.9
```

### After:
```
Itn.  Dep Loss  NER Loss  UAS     NER P.  NER R.  NER F.  Tag %   Token %  CPU WPS  GPU WPS
0     4618.857  2910.004  76.172  79.645  67.987  88.732  88.261  100.000  4436.9   6376.4
1     4671.972  3764.812  74.481  78.046  62.374  82.680  88.377  100.000  4672.2   6227.1
2     4742.756  3673.473  71.994  77.380  63.966  84.494  90.620  100.000  4298.0   5983.9
```

* Added contributor file
2018-05-25 13:08:45 +02:00
Aristo Rinjuang
432ede04af adding more words and rephrasing (#2351)
* adding more words and rephrasing

* adding a contributor

* tokenizer bugs solved
2018-05-24 11:40:57 +02:00
Jani Monoses
ec62cadf4c Updates to Romanian support (#2354)
* Add back Romanian in conftest

* Romanian lex_attr

* More tokenizer exceptions for Romanian

* Add tests for some Romanian tokenizer exceptions
2018-05-24 11:40:00 +02:00
Matthew Honnibal
5d281cf302 Merge branch 'develop' of https://github.com/explosion/spaCy into develop 2018-05-22 20:50:59 +02:00
Matthew Honnibal
ce458c2428 Fix spacy requirement constraint in package template 2018-05-22 20:50:46 +02:00
Ines Montani
862da5e793 Support pipeline factories via entry points (#2348) 2018-05-22 18:29:45 +02:00
Matthew Honnibal
d5af38f80c Merge branch 'develop' of https://github.com/explosion/spaCy into develop 2018-05-21 17:42:55 +02:00
Matthew Honnibal
ee33de8652 Fix unpickling of NER parser 2018-05-21 17:42:40 +02:00
ines
f9dbcac8e4 Merge branch 'master' into develop 2018-05-21 02:29:29 +02:00
cclauss
f7dcaa1f6b Simplify is_config() and normalize_string_keys() (#2305)
* Simplify is_config() and normalize_string_keys()

* Use __in__ to avoid the nested _ands_ and _ors_.
* Dict comprehension directly tracks with the doc string

* Keep more basic loop in normalize_string_keys

* Whitespace
2018-05-21 01:54:35 +02:00
Ines Montani
cae4457c38 💫 Add .similarity warnings for no vectors and option to exclude warnings (#2197)
* Add logic to filter out warning IDs via environment variable

Usage: SPACY_WARNING_EXCLUDE=W001,W007

* Add warnings for empty vectors

* Add warning if no word vectors are used in .similarity methods

For example, if only tensors are available in small models – should hopefully clear up some confusion around this

* Capture warnings in tests

* Rename SPACY_WARNING_EXCLUDE to SPACY_WARNING_IGNORE
2018-05-21 01:22:38 +02:00
Matthew Honnibal
b096b22c20
Merge pull request #2247 from skrcode/1480
1480 - Implement Fast-Text vectors with subword features
2018-05-21 01:16:21 +02:00
Matthew Honnibal
f3b4f6a4ec Merge setup.py 2018-05-20 23:21:00 +02:00
Ines Montani
d4cc736b7c 💫 Improve model downloads: check for existing install, customise pip and use requests library again (#2346)
* Go back to using requests instead of urllib (closes #2320)

Fewer dependencies are good, but this one was simply causing too many other problems around SSL verification and Python 2/3 compatibility. requests is a popular enough package that it's okay for spaCy to depend on it – and this will hopefully make model downloads less flakey.

* Only download model if not installed (see #1456)

Use #egg=model==version to allow pip to check for existing installations. The download is only started if no installation matching the package/version is found. Fixes a long-standing inconvenience.

* Pass additional options to pip when installing model (resolves #1456)

Treat all additional arguments passed to the download command as pip options to allow user to customise the command. For example:

python -m spacy download en --user

* Add CLI option to enable installing model package dependencies

* Revert "Add CLI option to enable installing model package dependencies"

This reverts commit 9336ffe695.

* Update documentation
2018-05-20 20:26:56 +02:00
Matthew Honnibal
3eb446e0a5 Require thinc 6.11.1 and prepare for release to spacy-nightly 2018-05-20 19:00:34 +02:00
Matthew Honnibal
bdc23dd8c1 Merge branch 'develop' of https://github.com/explosion/spaCy into develop 2018-05-20 18:59:24 +02:00
ines
5401c55c75 Merge branch 'master' into develop 2018-05-20 16:49:40 +02:00
ines
b59e3b157f Don't require attrs argument in Doc.retokenize and allow both ints and unicode (resolves #2304) 2018-05-20 15:15:37 +02:00
ines
5768df4f09 Add SimpleFrozenDict util to use as default function argument 2018-05-20 15:13:37 +02:00
Matthew Honnibal
7431e9c87f Fix parser for GPU 2018-05-19 17:24:34 +00:00
Matthew Honnibal
401213fb1f Only warn about unnamed vectors if non-zero sized. 2018-05-19 18:51:55 +02:00
Matthew Honnibal
74d5c625b3 Use rising beam update prob 2018-05-16 20:11:59 +02:00
Matthew Honnibal
544ae7f1db Merge branch 'develop' into feature/refactor-parser 2018-05-16 02:06:49 +02:00
Matthew Honnibal
d1b27fe5aa Revert "Improve dynamic oracle when values are missing in parse"
This reverts commit f56bd4736b.
2018-05-16 00:31:52 +02:00
Matthew Honnibal
83acaa0358 Add missing name attribute for parser 2018-05-15 19:01:53 +02:00
Matthew Honnibal
f328c195ca Fix size limits in training data 2018-05-15 19:01:41 +02:00
Matthew Honnibal
8446b35ce0 Fix parser model loading 2018-05-15 18:43:46 +02:00
Matthew Honnibal
dc1a479fbd Merge branch 'develop' into feature/refactor-parser 2018-05-15 18:39:21 +02:00
Matthew Honnibal
546dd99cdf Merge master into develop -- mostly Arabic and website 2018-05-15 18:14:28 +02:00
Matthew Honnibal
5664ab7e6c Revert hacks to tests 2018-05-15 18:00:09 +02:00
Matthew Honnibal
7b9195657b Restore beam_density argument for parser beam 2018-05-15 17:55:11 +02:00
Matthew Honnibal
581d318971 Fix conftest 2018-05-15 00:54:45 +02:00
Tahar Zanouda
00417794d3 Add Arabic language (#2314)
* added support for Arabic lang

* added Arabic language support

* updated conftest
2018-05-15 00:27:19 +02:00
Jani Monoses
0e08e49e87 Lemmatizer ro (#2319)
* Add Romanian lemmatizer lookup table.

Adapted from http://www.lexiconista.com/datasets/lemmatization/
by replacing cedillas with commas (ș and ț).

The original dataset is licensed under the Open Database License.

* Fix one blatant issue in the Romanian lemmatizer

* Romanian examples file

* Add ro_tokenizer in conftest

* Add Romanian lemmatizer test
2018-05-12 15:20:04 +02:00
Matthew Honnibal
887631ca25 Disable some tests to figure out why CI fails 2018-05-10 16:42:01 +02:00
Matthew Honnibal
902a172cb7 Disable some tests to figure out why CI fails 2018-05-10 16:30:07 +02:00
Matthew Honnibal
614d45ea58 Set a more aggressive threshold on the max violn update 2018-05-10 15:38:24 +02:00
Matthew Honnibal
8e8724b55b Default to beam_update_prob 1 2018-05-10 15:38:02 +02:00
Jani Monoses
42b34832e4 Update Romanian stopword list (#2316)
* Contributor agreement for janimo

* Update Romanian stopword list

Include the correct spellings of all the words already in the repo
that are using cedillas (ş and ţ) instead of commas (ș and ț).

Add another unrelated spelling fix.

See https://github.com/stopwords-iso/stopwords-ro/pull/1 and
https://github.com/stopwords-iso/stopwords-ro/pull/2
2018-05-10 12:16:56 +02:00
Lucas Abbade
be7fdc59d1 Update lex_attrs.py (#2307)
* Update lex_attrs.py

Fixed spelling mistakes of some numbers (according to Brazilian Portuguese).

* Update lex_attrs.py

As requested, I've included the correct spelling for both Brazilian Portuguese and Portuguese Portuguese.

I will advise however, that the two are separated in the future. Brazilian Portuguese is a very different language from the original one, although most of the writing is unified, the way people talk in both countries is radically different. Keeping both languages as one may lead to bigger issues in the future, especially when it comes to spell checking.
2018-05-09 20:49:31 +02:00
mauryaland
5368ba028a Update stop_words.py for French language (#2310)
* Add contraction forms of some common stopwords

All the stopwords added contain the apostrophe" ' "or " ’ ".

* Adds contributor agreement mauryaland

* Update mauryaland.md
2018-05-09 12:04:38 +02:00
Matthew Honnibal
a61fd60681 Fix error in beam gradient calculation 2018-05-09 02:44:09 +02:00
Matthew Honnibal
a6ae1ee6f7 Don't modify Token in global scope 2018-05-09 00:43:00 +02:00
Matthew Honnibal
f94f721f40 Avoid importing fused token symbol in ud-run-test, untl that's added 2018-05-09 00:28:03 +02:00
Matthew Honnibal
659ec5b975 Avoid importing fused token symbol in ud-run-test, untl that's added 2018-05-08 19:40:33 +02:00
Matthew Honnibal
4cb0494bef Bug fixes to beam search after refactor 2018-05-08 13:48:50 +02:00
Matthew Honnibal
5ed71973b3 Add a keyword argument sink to GoldParse 2018-05-08 13:48:32 +02:00
Matthew Honnibal
8cfe326f87 Avoid relying on final gold check in beam search 2018-05-08 13:48:19 +02:00
Matthew Honnibal
fc4dd49b77 Support oracle segmentation in ud-train CLI command 2018-05-08 13:47:45 +02:00
Matthew Honnibal
c49e44349a Fix beam parsing 2018-05-08 02:53:24 +02:00
Matthew Honnibal
99649d114d Fix parser 2018-05-08 00:27:26 +02:00
Matthew Honnibal
8a82367a9d Fix beam search after refactor 2018-05-08 00:20:33 +02:00
Matthew Honnibal
5a0f26be0c Readd beam search after refactor 2018-05-08 00:19:52 +02:00
ines
7a3599c21a Fix formatting and consistency 2018-05-07 23:02:11 +02:00
Matthew Honnibal
36b2c9bdd5 Fix refactored parser 2018-05-07 18:58:09 +02:00
Matthew Honnibal
bde3be1ad1 Fix refactored parser 2018-05-07 18:31:04 +02:00