Commit Graph

5937 Commits

Author SHA1 Message Date
Matthew Honnibal
ae7fc42a41 Increment version to v2.0.13.dev1 2018-08-10 00:14:31 +02:00
Matthew Honnibal
19f5046934 Undoing warning suppression, as doesnt really work 2018-08-10 00:13:34 +02:00
Matthew Honnibal
3fb828352d Set version to 2.0.13.dev0 2018-08-09 23:49:34 +02:00
Matthew Honnibal
1c0614ecd2 Catch numpy warning 2018-08-09 23:49:24 +02:00
Aashish Gangwani
6eebfc7bf4 Added numbers to ../lang/hi/lex_attrs.py (#2629)
I have added numbers in hindi lex_attrs.py file according to Indian numbering system(https://en.wikipedia.org/wiki/Indian_numbering_system) and here are there english translations:
'शून्य' => zero 
'एक' => one
'दो' => two
'तीन' => three
 'चार' => four
'पांच' => five
'छह' => six
'सात'=>seven 
'आठ' => eight
'नौ' => nine
'दस' => ten
'ग्यारह' => eleven
'बारह' => twelve
 'तेरह' => thirteen
'चौदह' => fourteen
'पंद्रह' => fifteen
'सोलह'=> sixteen
'सत्रह' => seventeen
'अठारह' => eighteen
'उन्नीस' => nineteen
'बीस' => twenty
 'तीस' => thirty
'चालीस' => forty
'पचास' => fifty
'साठ' => sixty
'सत्तर' => seventy
'अस्सी' => eighty
'नब्बे' => ninety
'सौ' => hundred
'हज़ार' => thousand
'लाख' => hundred thousand
'करोड़' => ten million
'अरब' => billion
'खरब' => hundred billion

<!--- Provide a general summary of your changes in the title. -->

## Description
<!--- Use this section to describe your changes. If your changes required
testing, include information about the testing environment and the tests you
ran. If your test fixes a bug reported in an issue, don't forget to include the
issue number. If your PR is still a work in progress, that's totally fine – just
include a note to let us know. -->

### Types of change
<!-- What type of change does your PR cover? Is it a bug fix, an enhancement
or new feature, or a change to the documentation? -->

## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.
2018-08-08 16:06:11 +02:00
Emil Stenström
3834f4146d Add abbreviations from UD_Swedish-Talbanken (#2613)
* Add abbreviations from UD_Swedish-Talbanken

* Add contributor agreement.
2018-08-07 13:53:17 +02:00
Ole Henrik Skogstrøm
0473add369 Feature/span ents (#2599)
* Created Span.ents property

* Add tests for span.ents

* Add tests for start and end of sentence
2018-08-07 13:52:32 +02:00
Xiaoquan Kong
87fa847e6e Fix Chinese language related bugs (#2634) 2018-08-07 11:26:31 +02:00
Xiaoquan Kong
f0c9652ed1 New Feature: display more detail when Error E067 (#2639)
* Fix off-by-one error

* Add verbose option

* Update verbose option

* Update documents for verbose option
2018-08-07 10:45:29 +02:00
Emil Stenström
1914c488d3 Swedish: Exceptions for single letter words ending sentence (#2615)
* Exceptions for single letter words ending sentence

Sentences ending in "i." (as in "... peka i."), "m." (as in "...än 2000 m."), should be tokenized as two separate tokens.

* Add test
2018-08-05 14:14:30 +02:00
Matthew Honnibal
860f5bd91f Add test for issue 2626 2018-08-05 13:46:57 +02:00
Kaisa (Katarzyna) Korsak
e531a827db Changed conllu2json to be able to extract NER tags (#2594)
* extract ner tags from conllu file if available

* fixed a bug in regex
2018-07-25 22:21:31 +02:00
Dmitry Bruhanov
07d0cc9de7 Update examples.py (#2597) 2018-07-25 22:20:24 +02:00
Matthew Honnibal
66983d8412
Port BenDerPan's Chinese changes to v2 (finally) (#2591)
* add  template files for Chinese

* add  template files for Chinese, and test directory .
2018-07-25 02:47:23 +02:00
ines
f2e3e039b7 Update French stop words (resolves #2540) 2018-07-24 23:41:51 +02:00
Ines Montani
75f3234404
💫 Refactor test suite (#2568)
## Description

Related issues: #2379 (should be fixed by separating model tests)

* **total execution time down from > 300 seconds to under 60 seconds** 🎉
* removed all model-specific tests that could only really be run manually anyway – those will now live in a separate test suite in the [`spacy-models`](https://github.com/explosion/spacy-models) repository and are already integrated into our new model training infrastructure
* changed all relative imports to absolute imports to prepare for moving the test suite from `/spacy/tests` to `/tests` (it'll now always test against the installed version)
* merged old regression tests into collections, e.g. `test_issue1001-1500.py` (about 90% of the regression tests are very short anyways)
* tidied up and rewrote existing tests wherever possible

### Todo

- [ ] move tests to `/tests` and adjust CI commands accordingly
- [x] move model test suite from internal repo to `spacy-models`
- [x] ~~investigate why `pipeline/test_textcat.py` is flakey~~
- [x] review old regression tests (leftover files) and see if they can be merged, simplified or deleted
- [ ] update documentation on how to run tests


### Types of change
enhancement, tests

## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [ ] My changes don't require a change to the documentation, or if they do, I've added all required information.
2018-07-24 23:38:44 +02:00
Matthew Honnibal
82277f63a3 💫 Small efficiency fixes to tokenizer (#2587)
This patch improves tokenizer speed by about 10%, and reduces memory usage in the `Vocab` by removing a redundant index. The `vocab._by_orth` and `vocab._by_hash` indexed on different data in v1, but in v2 the orth and the hash are identical.

The patch also fixes an uninitialized variable in the tokenizer, the `has_special` flag. This checks whether a chunk we're tokenizing triggers a special-case rule. If it does, then we avoid caching within the chunk. This check led to incorrectly rejecting some chunks from the cache. 

With the `en_core_web_md` model, we now tokenize the IMDB train data at 503,104k words per second. Prior to this patch, we had 465,764k words per second.

Before switching to the regex library and supporting more languages, we had 1.3m words per second for the tokenizer. In order to recover the missing speed, we need to:

* Fix the variable-length lookarounds in the suffix, infix and `token_match` rules
* Improve the performance of the `token_match` regex
* Switch back from the `regex` library to the `re` library.

## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.
2018-07-24 23:35:54 +02:00
Matthew Honnibal
6303ce3d0e Try to fix memory error by moving fr_tokenizer to module scope 2018-07-24 20:09:06 +02:00
Matthew Honnibal
afe3fa4449 Merge branch 'master' of https://github.com/explosion/spaCy 2018-07-24 19:44:31 +02:00
Matthew Honnibal
b2e9e958b9 Add session scoping to tokenizers to try to fix oom on Appveyor 2018-07-24 19:44:18 +02:00
Ines Montani
a43ad114c2
Fix typo [ci skip] 2018-07-24 18:45:40 +02:00
Dmitry Bruhanov
27160b1516 added some widespread written jargon & dialectizms (#2584)
This jargon is not offencive but emotionally colored as funny due to its deviation from the norm for various reasons: immitating a dialect, deliberately wrong spelling emphasizing its low colloquial nature, obsolete form, foreign borrowing with native flections, etc.
Dmitry Briukhanov, Linguist & Pythonist
2018-07-24 18:44:29 +02:00
ines
3c30d1763c Merge branch 'master' into develop 2018-07-21 15:34:18 +02:00
Matthew Honnibal
90c269e1a9 Set about to v2.0.12 release 2018-07-21 15:09:42 +02:00
Matthew Honnibal
1a1c7304cf Set version to 2.0.12.dev1 2018-07-21 13:08:01 +02:00
ines
1ea881c80b Allow ignoring warnings and only overwrite if set explicitly 2018-07-20 22:50:19 +02:00
Matthew Honnibal
e0caf3ae8c Fix msgpack for new version 2018-07-20 17:32:00 +02:00
Matthew Honnibal
899f1cf442 Add regression test for issue 2179 2018-07-20 17:15:44 +02:00
Matthew Honnibal
9db77fd914 Fix deserialization for msgpack 2018-07-20 14:11:09 +02:00
katarkor
5ca853bee0 changed tag_map, morph_rules, lemmatizer for Norwegian (#2565)
* changed tag_map, morph_rules, lemmatizer for Norwegian

* Move unicode declaration up

Hopefully fixes test failure on Python 2

* Update CONTRIBUTOR_AGREEMENT.md

* Move unicode declarations

Hopefully fixes test this time

* Revert "Merge remote-tracking branch 'origin/patch-1'"

This reverts commit f5ccd5dd0d, reversing
changes made to dd07e180ea.

* Update contributor agreement [ci skip]
2018-07-19 19:38:24 +02:00
Ines Montani
e7b075565d
💫 Rule-based NER component (#2513)
* Add helper function for reading in JSONL

* Add rule-based NER component

* Fix whitespace

* Add component to factories

* Add tests

* Add option to disable indent on json_dumps compat

Otherwise, reading JSONL back in line by line won't work

* Fix error code
2018-07-18 19:43:16 +02:00
ines
d84b13e02c Merge branch 'master' into develop 2018-07-18 18:57:00 +02:00
Ole Henrik Skogstrøm
6e2930a4a2 Conll(u)-bio converter (#2525)
* Started simple conllxbiluo converter

* Fix missing BIO to BILUO conversion
2018-07-18 18:55:42 +02:00
ines
02aefe7cc0 Merge branch 'master' into develop 2018-07-18 18:52:59 +02:00
Ioannis Daras
6ed18412d0 Greek language optimizations (#2558)
* Greek language optimizations

* Add encoding on files containing greek words

* Add encoding on files containing greek words
2018-07-18 18:51:38 +02:00
ines
80e7485630 Merge branch 'master' into develop 2018-07-18 17:28:47 +02:00
Paul O'Leary McCann
61ef0739b8 Add Japanese stop words. (#2549)
List created by taking the 2000 top words from a Wikipedia dump and
removing everything that wasn't hiragana.

Tried going through kanji words and deciding what to keep but there were
too many obvious non-stopwords (東京 was in the top 500) and many other
words where it wasn't clear if they should be included or not.
2018-07-17 10:12:48 +02:00
Tero K
f35980f865 Enhancement/lang fi examples (#2547)
* Added a file with examples in finnish

* added contributor agreement
2018-07-15 09:50:27 +02:00
Paul O'Leary McCann
1987f3f784 Add Japanese lemmas (#2543)
This info was already available from Mecab, forgot to add it before.
2018-07-13 10:55:14 +02:00
ines
3a321e79ac Merge branch 'master' into develop 2018-07-10 13:49:08 +02:00
Eleni170
6042723535 Add support for Greek language (#2535)
* Add contributor agreement

* Support for Greek language

* Fix missing el_tokenizer
2018-07-10 13:48:38 +02:00
Stefan Schweter
3dfc7f86be lemmatizer: correct lemma for Rang (#2537)
<!--- Provide a general summary of your changes in the title. -->

## Description

This PR corrects the German lemma form for the word "Rang". Initially, the lemma form was "ringen", which is not correct, because it refers to the verb ("ringen") and not to the noun ("Rang").

### Types of change

The lemma form for "Rang" is corrected to "Rang", see also the [Duden](https://www.duden.de/rechtschreibung/Rang) entry.

## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.
2018-07-10 13:11:19 +02:00
ines
fd6207426a Merge branch 'master' into develop 2018-07-09 18:05:10 +02:00
Duygu Altinok
00b9a58558 German lemmatizer additions (#2529)
* lemma of was-> was

* added new pairs issue @2486

* added article tests
2018-07-09 11:10:15 +02:00
Ole Henrik Skogstrøm
c21efea9bb Add sent property to token (#2521)
* Add sent property to token

* Refactored and cleaned up copy paste errors.
2018-07-06 15:54:15 +02:00
ines
38e07ade4c Add test for custom tokenizer serialization (resolves #2494) 2018-07-06 12:40:51 +02:00
ines
c2581f9172 Tidy up tokenizer test 2018-07-06 12:40:28 +02:00
Matthew Honnibal
43dcaa473e Merge branch 'develop' of https://github.com/explosion/spaCy into develop 2018-07-06 12:36:42 +02:00
Matthew Honnibal
6c8d627733 Fix tokenizer deserialization 2018-07-06 12:36:33 +02:00
ines
c001d46153 Tidy up 2018-07-06 12:33:42 +02:00
Matthew Honnibal
63f5651f8d Fix tokenizer serialization 2018-07-06 12:32:11 +02:00
Matthew Honnibal
e1569fda4e Fix compile error in matcher 2018-07-06 12:29:23 +02:00
Matthew Honnibal
f5b2076700 Merge branch 'develop' of https://github.com/explosion/spaCy into develop 2018-07-06 12:23:14 +02:00
Matthew Honnibal
1a2f61725c Fix tokenizer serialization 2018-07-06 12:23:04 +02:00
ines
9e09477b2f Remove unused import 2018-07-06 12:18:17 +02:00
ines
26f04a6ac3 Fix Matcher tests and add test for any token with operator 2018-07-06 12:17:50 +02:00
Matthew Honnibal
f5703b7a91 Clean up unused stuff in matcher 2018-07-06 12:16:44 +02:00
Matthew Honnibal
08c362d541 Suppress compiler warning about unreachable code 2018-07-06 11:31:22 +02:00
Matthew Honnibal
8ae1bec8bf Fix init_model 2018-07-05 14:02:06 +02:00
Matthew Honnibal
7b09a4ca49 Fix lemmatization 2018-07-05 13:56:02 +02:00
Matthew Honnibal
ec41ceb383 Merge branch 'develop' of https://github.com/explosion/spaCy into develop 2018-07-05 13:49:42 +02:00
Matthew Honnibal
4eb3405df7 Fix lemmatizer ordering, re Issue #1387 2018-07-05 13:49:29 +02:00
ines
63666af328 Merge branch 'master' into develop 2018-07-04 14:52:25 +02:00
ines
8feb7cfe2d Remove model dependency from French lemmatizer tests 2018-07-04 14:46:45 +02:00
kleinay
a82c3153ad fix issue #2452 - displacy arrow direction is always forward (#2506) (closes #2452)
<!--- Provide a general summary of your changes in the title. -->
Referring #2452, fixing displacy arrow directions to match the input. 

## Description
The fix is simply replacing `direction is 'left'` with `direction == 'left'` to include the case `direction` is a `str` and not a `unicode`.

### Types of change
bug fix

## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [ ] I have submitted the spaCy Contributor Agreement.
- [ ] I ran the tests, and all new and existing tests passed.
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.
2018-07-04 14:12:08 +02:00
Bùi Trung Chí
9af46b4f1b Fix loading tokenizer with custom prefix search (#2495)
* Add contributor agreement

* Fix loading tokenizer with cutom prefix search
2018-07-04 12:56:07 +02:00
Matthew Honnibal
dee8bdb900 Fix init-model for npz vectors 2018-07-04 02:29:48 +02:00
Matthew Honnibal
59d655e8d0 Fix model init from jsonl 2018-07-04 01:30:40 +02:00
Matthew Honnibal
1e38bea6e9 Save vectors init 2018-07-03 23:55:04 +02:00
Matthew Honnibal
6692833887 Fix init_model 2018-07-03 23:24:11 +02:00
Matthew Honnibal
4a38a26cb5 Fix init_model 2018-07-03 22:57:11 +02:00
Matthew Honnibal
019d09e3c3 Fix init model 2018-07-03 22:16:44 +02:00
Matthew Honnibal
2543f8c93a Support .npz vectors in init-model command 2018-07-03 21:42:16 +02:00
Matthew Honnibal
86aad11939 Fix init_model arg 2018-07-03 17:00:42 +02:00
Matthew Honnibal
eff42d36e3 Fix init model command 2018-07-03 16:32:23 +02:00
Matthew Honnibal
97487122ea Merge branch 'develop' of https://github.com/explosion/spaCy into develop 2018-07-03 15:44:37 +02:00
Matthew Honnibal
6a89faf12e Add support for jsonl-formatted lexical attributes to init-model command. 2018-07-03 12:22:56 +02:00
Matthew Honnibal
2ec2192000 Revert #1389: Don't overrule rules when lemma exception is present 2018-06-29 19:43:02 +02:00
Matthew Honnibal
01ace9734d Make pipeline work on empty docs 2018-06-29 19:21:38 +02:00
Matthew Honnibal
a1b05048d0 Fix tagger when doc is empty 2018-06-29 16:05:40 +02:00
Matthew Honnibal
3786942ff1 Fix tagger when docs are empty 2018-06-29 15:13:45 +02:00
ines
526be40823 Add test for 46d8a66 2018-06-29 14:33:12 +02:00
ines
f08c871adf Fix typo in Language.from_disk 2018-06-29 14:32:16 +02:00
Matthew Honnibal
46d8a66fef Fix tokenizer serialization if token_match is None 2018-06-29 14:24:46 +02:00
Matthew Honnibal
e0860bcfb3 Fix bug when docs are empty 2018-06-29 13:56:29 +02:00
Matthew Honnibal
a4d2b0c293 Fix bug when docs are empty 2018-06-29 13:44:25 +02:00
Matthew Honnibal
c83fccfe2a Fix output of best model 2018-06-25 23:05:56 +02:00
Matthew Honnibal
5a65418c40 Fix handling of unseen labels in tagger 2018-06-25 22:28:59 +02:00
Matthew Honnibal
5b56aad4c2 Fix handling of unseen labels in tagger 2018-06-25 22:24:54 +02:00
Matthew Honnibal
3aabf621a3 Fix handling of unknown tags in tagger update 2018-06-25 22:01:02 +02:00
Matthew Honnibal
69c900f003 Fix init-model if no vectors provided 2018-06-25 18:26:02 +02:00
Matthew Honnibal
664f89327a Fix init-model if no vectors provided 2018-06-25 17:58:45 +02:00
Matthew Honnibal
c4698f5712 Don't collate model unless training succeeds 2018-06-25 16:36:42 +02:00
Ole Henrik Skogstrøm
d16cb6bee6 Accept Span to displacy render (#2478) (closes #2477)
* Add Span to displacy render

* Fix span support, errors and add tests
2018-06-25 14:55:16 +02:00
Matthew Honnibal
24dfbb8a28 Fix model collation 2018-06-25 14:35:24 +02:00
Matthew Honnibal
62237755a4 Import shutil 2018-06-25 13:40:17 +02:00
Matthew Honnibal
a040fca99e Import json into cli.train 2018-06-25 11:50:37 +02:00
Matthew Honnibal
2c703d99c2 Fix collation of best models 2018-06-25 01:21:34 +02:00
Matthew Honnibal
9d6a1c57f2 Merge branch 'develop' of https://github.com/explosion/spaCy into develop 2018-06-24 23:40:06 +02:00
Matthew Honnibal
2c80b7c013 Collate best model after training 2018-06-24 23:39:52 +02:00
Muhammad Irfan
f33c703066 Add Urdu Language Support (#2430)
* added Urdu language support.

* added Urdu language tests.

* modified conftest.py for Urdu language support.

* added spacy contributor agreement.
2018-06-22 11:14:03 +02:00
himkt
14d9007efd fix wrong indexing (#2416)
* fix wrong indexing

* add agreement
2018-06-19 10:20:57 +02:00
Aliia E
428bae66b5 Add Tatar Language Support (#2444)
* add Tatar lang support

* add Tatar letters

* add Tatar tests

* sign contributor agreement

* sign contributor agreement [x]

* remove comments from Language class

* remove all template comments
2018-06-19 10:17:53 +02:00
Cory Hurst
446f5ec41b Silent keyword in info function in init (#2459)
* Pass through "silent" kwarg to the wrapper in the spacy module init.
reference issue  #2196

* Pass through "silent" kwarg to the wrapper in the spacy module init.
reference issue  #2196

* contributor agreement
2018-06-18 12:24:21 +02:00
ines
778e5f4da3 Merge branch 'master' into develop 2018-06-11 00:38:04 +02:00
himkt
57311d5d47 replace janome with mecab in the documentation and the test (#2415)
* Add links to Reddit data (see #2401)

* replace janome with mecab in the documentation and the test

* add the assignment
2018-06-11 00:33:13 +02:00
Nour Shalabi
a169b79092 Additions to Arabic stop words. (#2422)
* Additions to Arabic stop words.

* Create nourshalabi.md
2018-06-08 02:33:23 +02:00
ines
a0017e4909 Merge branch 'master' into develop 2018-05-30 14:10:47 +02:00
ines
b8ef9c1000 Fix model names in conftest (see #2379) 2018-05-30 14:10:20 +02:00
ines
4a62486340 Merge branch 'master' into develop 2018-05-30 13:01:01 +02:00
Maciej
c7d53348d7 Fix bug in CLI iob and ner converter (#2392) (fixes #2385)
* issue_2385 add tests for iob_to_biluo converter function

* issue_2385 fix and modify iob_to_biluo function to accept either iob or biluo tags in cli.converter

* issue_2385 add test to fix b char bug

* add contributor agreement

* fill contributor agreement
2018-05-30 12:28:44 +02:00
ines
3c3a175018 Merge branch 'master' into develop 2018-05-28 18:37:09 +02:00
ansgar-t
9732988951 escape html in displacy.render (#2378) (closes #2361)
## Description
Fix for issue #2361 :
replace &, <, >, " with &amp;amp; , &amp;lt; , &amp;gt; , &amp;quot; in before rendering svg

## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [x] I have submitted the spaCy Contributor Agreement.
- [ ] I ran the tests, and all new and existing tests passed.
(As discussed in the comments to #2361)
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.
2018-05-28 18:36:41 +02:00
ines
f7103babd9 Only overwrite warnings filter if set explicitly (resolves #2369)
This way, pre-defined warning filters are respected and users are still able to use the fine-grained warning settings if they like.
2018-05-26 18:44:15 +02:00
ines
330c039106 Merge branch 'master' into develop 2018-05-26 18:30:52 +02:00
James Messinger
4515e96e90 Better formatting for spacy train CLI (#2357)
* Better formatting for `spacy train` CLI

Changed to use fixed-spaces rather than tabs to align table headers and data.

### Before:
```
Itn.    P.Loss  N.Loss  UAS     NER P.  NER R.  NER F.  Tag %   Token %
0       4618.857        2910.004        76.172  79.645  67.987  88.732  88.261  100.000 4436.9  6376.4
1       4671.972        3764.812        74.481  78.046  62.374  82.680  88.377  100.000 4672.2  6227.1
2       4742.756        3673.473        71.994  77.380  63.966  84.494  90.620  100.000 4298.0  5983.9
```

### After:
```
Itn.  Dep Loss  NER Loss  UAS     NER P.  NER R.  NER F.  Tag %   Token %  CPU WPS  GPU WPS
0     4618.857  2910.004  76.172  79.645  67.987  88.732  88.261  100.000  4436.9   6376.4
1     4671.972  3764.812  74.481  78.046  62.374  82.680  88.377  100.000  4672.2   6227.1
2     4742.756  3673.473  71.994  77.380  63.966  84.494  90.620  100.000  4298.0   5983.9
```

* Added contributor file
2018-05-25 13:08:45 +02:00
Aristo Rinjuang
432ede04af adding more words and rephrasing (#2351)
* adding more words and rephrasing

* adding a contributor

* tokenizer bugs solved
2018-05-24 11:40:57 +02:00
Jani Monoses
ec62cadf4c Updates to Romanian support (#2354)
* Add back Romanian in conftest

* Romanian lex_attr

* More tokenizer exceptions for Romanian

* Add tests for some Romanian tokenizer exceptions
2018-05-24 11:40:00 +02:00
Matthew Honnibal
5d281cf302 Merge branch 'develop' of https://github.com/explosion/spaCy into develop 2018-05-22 20:50:59 +02:00
Matthew Honnibal
ce458c2428 Fix spacy requirement constraint in package template 2018-05-22 20:50:46 +02:00
Ines Montani
862da5e793 Support pipeline factories via entry points (#2348) 2018-05-22 18:29:45 +02:00
Matthew Honnibal
d5af38f80c Merge branch 'develop' of https://github.com/explosion/spaCy into develop 2018-05-21 17:42:55 +02:00
Matthew Honnibal
ee33de8652 Fix unpickling of NER parser 2018-05-21 17:42:40 +02:00
ines
f9dbcac8e4 Merge branch 'master' into develop 2018-05-21 02:29:29 +02:00
cclauss
f7dcaa1f6b Simplify is_config() and normalize_string_keys() (#2305)
* Simplify is_config() and normalize_string_keys()

* Use __in__ to avoid the nested _ands_ and _ors_.
* Dict comprehension directly tracks with the doc string

* Keep more basic loop in normalize_string_keys

* Whitespace
2018-05-21 01:54:35 +02:00
Ines Montani
cae4457c38 💫 Add .similarity warnings for no vectors and option to exclude warnings (#2197)
* Add logic to filter out warning IDs via environment variable

Usage: SPACY_WARNING_EXCLUDE=W001,W007

* Add warnings for empty vectors

* Add warning if no word vectors are used in .similarity methods

For example, if only tensors are available in small models – should hopefully clear up some confusion around this

* Capture warnings in tests

* Rename SPACY_WARNING_EXCLUDE to SPACY_WARNING_IGNORE
2018-05-21 01:22:38 +02:00
Matthew Honnibal
b096b22c20
Merge pull request #2247 from skrcode/1480
1480 - Implement Fast-Text vectors with subword features
2018-05-21 01:16:21 +02:00
Matthew Honnibal
f3b4f6a4ec Merge setup.py 2018-05-20 23:21:00 +02:00
Ines Montani
d4cc736b7c 💫 Improve model downloads: check for existing install, customise pip and use requests library again (#2346)
* Go back to using requests instead of urllib (closes #2320)

Fewer dependencies are good, but this one was simply causing too many other problems around SSL verification and Python 2/3 compatibility. requests is a popular enough package that it's okay for spaCy to depend on it – and this will hopefully make model downloads less flakey.

* Only download model if not installed (see #1456)

Use #egg=model==version to allow pip to check for existing installations. The download is only started if no installation matching the package/version is found. Fixes a long-standing inconvenience.

* Pass additional options to pip when installing model (resolves #1456)

Treat all additional arguments passed to the download command as pip options to allow user to customise the command. For example:

python -m spacy download en --user

* Add CLI option to enable installing model package dependencies

* Revert "Add CLI option to enable installing model package dependencies"

This reverts commit 9336ffe695.

* Update documentation
2018-05-20 20:26:56 +02:00
Matthew Honnibal
3eb446e0a5 Require thinc 6.11.1 and prepare for release to spacy-nightly 2018-05-20 19:00:34 +02:00
Matthew Honnibal
bdc23dd8c1 Merge branch 'develop' of https://github.com/explosion/spaCy into develop 2018-05-20 18:59:24 +02:00
ines
5401c55c75 Merge branch 'master' into develop 2018-05-20 16:49:40 +02:00
ines
b59e3b157f Don't require attrs argument in Doc.retokenize and allow both ints and unicode (resolves #2304) 2018-05-20 15:15:37 +02:00
ines
5768df4f09 Add SimpleFrozenDict util to use as default function argument 2018-05-20 15:13:37 +02:00
Matthew Honnibal
7431e9c87f Fix parser for GPU 2018-05-19 17:24:34 +00:00
Matthew Honnibal
401213fb1f Only warn about unnamed vectors if non-zero sized. 2018-05-19 18:51:55 +02:00
Matthew Honnibal
74d5c625b3 Use rising beam update prob 2018-05-16 20:11:59 +02:00
Matthew Honnibal
544ae7f1db Merge branch 'develop' into feature/refactor-parser 2018-05-16 02:06:49 +02:00
Matthew Honnibal
d1b27fe5aa Revert "Improve dynamic oracle when values are missing in parse"
This reverts commit f56bd4736b.
2018-05-16 00:31:52 +02:00
Matthew Honnibal
83acaa0358 Add missing name attribute for parser 2018-05-15 19:01:53 +02:00
Matthew Honnibal
f328c195ca Fix size limits in training data 2018-05-15 19:01:41 +02:00
Matthew Honnibal
8446b35ce0 Fix parser model loading 2018-05-15 18:43:46 +02:00
Matthew Honnibal
dc1a479fbd Merge branch 'develop' into feature/refactor-parser 2018-05-15 18:39:21 +02:00
Matthew Honnibal
546dd99cdf Merge master into develop -- mostly Arabic and website 2018-05-15 18:14:28 +02:00
Matthew Honnibal
5664ab7e6c Revert hacks to tests 2018-05-15 18:00:09 +02:00
Matthew Honnibal
7b9195657b Restore beam_density argument for parser beam 2018-05-15 17:55:11 +02:00
Matthew Honnibal
581d318971 Fix conftest 2018-05-15 00:54:45 +02:00
Tahar Zanouda
00417794d3 Add Arabic language (#2314)
* added support for Arabic lang

* added Arabic language support

* updated conftest
2018-05-15 00:27:19 +02:00
Jani Monoses
0e08e49e87 Lemmatizer ro (#2319)
* Add Romanian lemmatizer lookup table.

Adapted from http://www.lexiconista.com/datasets/lemmatization/
by replacing cedillas with commas (ș and ț).

The original dataset is licensed under the Open Database License.

* Fix one blatant issue in the Romanian lemmatizer

* Romanian examples file

* Add ro_tokenizer in conftest

* Add Romanian lemmatizer test
2018-05-12 15:20:04 +02:00
Matthew Honnibal
887631ca25 Disable some tests to figure out why CI fails 2018-05-10 16:42:01 +02:00
Matthew Honnibal
902a172cb7 Disable some tests to figure out why CI fails 2018-05-10 16:30:07 +02:00
Matthew Honnibal
614d45ea58 Set a more aggressive threshold on the max violn update 2018-05-10 15:38:24 +02:00
Matthew Honnibal
8e8724b55b Default to beam_update_prob 1 2018-05-10 15:38:02 +02:00
Jani Monoses
42b34832e4 Update Romanian stopword list (#2316)
* Contributor agreement for janimo

* Update Romanian stopword list

Include the correct spellings of all the words already in the repo
that are using cedillas (ş and ţ) instead of commas (ș and ț).

Add another unrelated spelling fix.

See https://github.com/stopwords-iso/stopwords-ro/pull/1 and
https://github.com/stopwords-iso/stopwords-ro/pull/2
2018-05-10 12:16:56 +02:00
Lucas Abbade
be7fdc59d1 Update lex_attrs.py (#2307)
* Update lex_attrs.py

Fixed spelling mistakes of some numbers (according to Brazilian Portuguese).

* Update lex_attrs.py

As requested, I've included the correct spelling for both Brazilian Portuguese and Portuguese Portuguese.

I will advise however, that the two are separated in the future. Brazilian Portuguese is a very different language from the original one, although most of the writing is unified, the way people talk in both countries is radically different. Keeping both languages as one may lead to bigger issues in the future, especially when it comes to spell checking.
2018-05-09 20:49:31 +02:00
mauryaland
5368ba028a Update stop_words.py for French language (#2310)
* Add contraction forms of some common stopwords

All the stopwords added contain the apostrophe" ' "or " ’ ".

* Adds contributor agreement mauryaland

* Update mauryaland.md
2018-05-09 12:04:38 +02:00
Matthew Honnibal
a61fd60681 Fix error in beam gradient calculation 2018-05-09 02:44:09 +02:00
Matthew Honnibal
a6ae1ee6f7 Don't modify Token in global scope 2018-05-09 00:43:00 +02:00
Matthew Honnibal
f94f721f40 Avoid importing fused token symbol in ud-run-test, untl that's added 2018-05-09 00:28:03 +02:00
Matthew Honnibal
659ec5b975 Avoid importing fused token symbol in ud-run-test, untl that's added 2018-05-08 19:40:33 +02:00
Matthew Honnibal
4cb0494bef Bug fixes to beam search after refactor 2018-05-08 13:48:50 +02:00
Matthew Honnibal
5ed71973b3 Add a keyword argument sink to GoldParse 2018-05-08 13:48:32 +02:00
Matthew Honnibal
8cfe326f87 Avoid relying on final gold check in beam search 2018-05-08 13:48:19 +02:00
Matthew Honnibal
fc4dd49b77 Support oracle segmentation in ud-train CLI command 2018-05-08 13:47:45 +02:00
Matthew Honnibal
c49e44349a Fix beam parsing 2018-05-08 02:53:24 +02:00
Matthew Honnibal
99649d114d Fix parser 2018-05-08 00:27:26 +02:00
Matthew Honnibal
8a82367a9d Fix beam search after refactor 2018-05-08 00:20:33 +02:00
Matthew Honnibal
5a0f26be0c Readd beam search after refactor 2018-05-08 00:19:52 +02:00
ines
7a3599c21a Fix formatting and consistency 2018-05-07 23:02:11 +02:00
Matthew Honnibal
36b2c9bdd5 Fix refactored parser 2018-05-07 18:58:09 +02:00
Matthew Honnibal
bde3be1ad1 Fix refactored parser 2018-05-07 18:31:04 +02:00
Matthew Honnibal
01c4e13b02 Update test 2018-05-07 16:59:52 +02:00
Matthew Honnibal
f6cdafc00e Fix refactored parser 2018-05-07 16:59:38 +02:00
Matthew Honnibal
f56bd4736b Improve dynamic oracle when values are missing in parse 2018-05-07 15:53:18 +02:00
Matthew Honnibal
eddc0e0c74 Set gold.sent_starts in ud_train 2018-05-07 15:52:47 +02:00
Matthew Honnibal
bf19f22340 Allow gold.sent_starts to be set from Python 2018-05-07 15:51:34 +02:00
Matthew Honnibal
7f163442e6 Work on refactoring greedy parser 2018-05-07 15:45:52 +02:00
Douglas Knox
9b49a40f4e Test and fix for Issue #2219 (#2272)
Test and fix for Issue #2219: Token.similarity() failed if single letter
2018-05-03 18:40:46 +02:00
Paul O'Leary McCann
bd72fbf09c Port Japanese mecab tokenizer from v1 (#2036)
* Port Japanese mecab tokenizer from v1

This brings the Mecab-based Japanese tokenization introduced in #1246 to
spaCy v2. There isn't a JapaneseTagger implementation yet, but POS tag
information from Mecab is stored in a token extension. A tag map is also
included.

As a reminder, Mecab is required because Universal Dependencies are
based on Unidic tags, and Janome doesn't support Unidic.

Things to check:

1. Is this the right way to use a token extension?

2. What's the right way to implement a JapaneseTagger? The approach in
 #1246 relied on `tag_from_strings` which is just gone now. I guess the
best thing is to just try training spaCy's default Tagger?

-POLM

* Add tagging/make_doc and tests
2018-05-03 18:38:26 +02:00
G.Pruvost
cc8e804648 #2211 - Support for ssl certs config on download command (#2212)
* Add support for SSL/Certs customization on download CLI

* Add a note on SSL options for the 'download' CLI in the README

* Add contributor agreement
2018-05-03 18:37:02 +02:00
Jens Dahl Møllerhøj
b9290397fb rename SP to _SP (#2289) 2018-05-03 18:33:49 +02:00
Matthew Honnibal
a8e70a4187 Merge branch 'develop' of https://github.com/explosion/spaCy into develop 2018-05-03 14:02:10 +02:00
Matthew Honnibal
c0e596283b Set version to 2.1.0a0 2018-05-03 14:00:11 +02:00
Matthew Honnibal
8cd06cc763 Try to fix root-outside-sentence bug 2018-05-02 14:39:48 +00:00
Matthew Honnibal
acebd01033 Set cildren from heads in finalize doc 2018-05-02 14:19:22 +00:00
Matthew Honnibal
569440a6db Dont normalize gradient by batch size 2018-05-02 08:42:10 +02:00
Matthew Honnibal
281e29cbcd Merge branch 'develop' of https://github.com/explosion/spaCy into develop 2018-05-02 01:36:23 +00:00
Matthew Honnibal
2338e8c7fc Update develop from master 2018-05-02 01:36:12 +00:00
Matthew Honnibal
9d147e12c4 Merge remote-tracking branch 'origin/master' into develop 2018-05-01 18:18:51 +02:00
Matthew Honnibal
6d0fe67b72 Constrain subtok label to adjacent tokens 2018-05-01 17:34:27 +02:00
Matthew Honnibal
8f21953fc5 Constrain subtok to adjacent words 2018-05-01 17:29:00 +02:00
Matthew Honnibal
b43bfd3524 Fix arc-eager oracle tests 2018-05-01 16:16:14 +02:00
Matthew Honnibal
31ed64e9b0 Fix textcat test 2018-05-01 15:18:39 +02:00
Matthew Honnibal
548bdff943 Update default Adam settings 2018-05-01 15:18:20 +02:00
Matthew Honnibal
adbb1f7533 Add better arc-eager oracle tests 2018-05-01 15:14:55 +02:00
Matthew Honnibal
697bcaa34f Add some methods to ArcEager that make testing easier 2018-05-01 15:13:14 +02:00
Mr Roboto
6f5ccda19c Addresses Issue #2228 - Deserialization fails when using tensor=False or sentiment=False (#2230)
* Fixes issue #2228

* Adds a new contributor
2018-05-01 13:40:22 +02:00
Matthew Honnibal
d44bb45c72 Fix scoring if tokenization changes 2018-05-01 01:33:20 +02:00
Matthew Honnibal
2b26c007cd Revert "Disable batch size compounding in ud-train"
This reverts commit 8a120fb455.
2018-04-29 14:09:02 +00:00
Matthew Honnibal
723b328062 Add script to run UD test 2018-04-29 15:50:25 +02:00
Matthew Honnibal
17af6aa3a4 Update ud_train script 2018-04-29 15:49:32 +02:00
Matthew Honnibal
5de8a36537 Fix arc_eager is_nonproj_tree 2018-04-29 15:49:11 +02:00
Matthew Honnibal
5260268f70 Fix textcat after merge 2018-04-29 15:48:53 +02:00
Matthew Honnibal
ad3d56c3ba Fix compile error in matcher 2018-04-29 15:48:34 +02:00
Matthew Honnibal
a8bc947fd4 Fix Token.set_extension 2018-04-29 15:48:19 +02:00
Matthew Honnibal
2c4a6d66fa Merge master into develop. Big merge, many conflicts -- need to review 2018-04-29 14:49:26 +02:00
ines
3c80f69ff5 Return data in cli.info and add silent option (resolves #2196) 2018-04-29 01:59:44 +02:00
ines
1c6d77610c Add remove_extension method on Doc, Token and Span (closes #2242) 2018-04-28 23:33:09 +02:00
ines
abdb853ebf Simplify underscore tests 2018-04-28 23:30:33 +02:00
ines
6fb6371670 Add collapse_phrases option to displacy (closes #2266) 2018-04-28 23:06:50 +02:00
Robin Linderborg
1f9904ef12 fixes #2238 (#2241)
* Remove erroneous lemma lookup år > åra in Swedish

* Add contributors agreement

* Add contrib agreement to correct directory

* Revert change to CONTRIBUTOR_AGREEMENT
2018-04-28 14:55:22 +02:00
Robin Linderborg
d01f503b54 Remove incorrect lemma lookup gäng->gänga (#2252)
* Remove incorrect lemma lookup gäng->gänga
In modern Swedish, "gäng" is mostly associated with "gang" or "group of people". The removed lemma lookup lemmatized it to the verb "thread".

* Add contrib agreement to correct directory

* Revert change to CONTRIBUTOR_AGREEMENT
2018-04-28 14:54:41 +02:00
Suraj Krishnan Rajan
69d041148f Implement Fast-Text vectors with subword features 2018-04-21 01:34:14 +05:30
ines
686225eadd Fix Spanish noun_chunks (resolves #2210)
Make sure 'NP' label is added to StringStore and move noun_bounds helper into a closure to allow reusing label sets
2018-04-18 18:44:01 -04:00
ines
9632595fb4 Use correct, non-deprecated merge syntax (resolves #2226) 2018-04-18 18:28:28 -04:00
Suraj Rajan
5957f15227 Fixed typos for #2222,#2223 (#2233) (closes #2222, closes #2223) 2018-04-18 14:55:26 -07:00
Matthew Honnibal
97851d2c4e Increment version to v2.0.12.dev0 2018-04-10 22:20:16 +02:00
Matthew Honnibal
ed39c75a92 Merge branch 'master' of https://github.com/explosion/spaCy 2018-04-10 22:19:40 +02:00
Matthew Honnibal
3836199a83 Fix loading of models when custom vectors are added 2018-04-10 22:19:20 +02:00
ines
0299d5fac8 Update argument annotations and formatting 2018-04-10 21:45:11 +02:00
ines
49b1e48bf5 Fix syntax error 2018-04-10 21:44:59 +02:00
ines
70052e46e9 Fix formatting [ci skip] 2018-04-10 21:42:46 +02:00
Matthew Honnibal
0ddb152be0 Improve error message when reading vectors 2018-04-10 21:26:50 +02:00
Matthew Honnibal
db50ac524e Support zipped vector files in init-model 2018-04-10 21:21:00 +02:00
ines
270fcfd925 Fix typo in package command message (closes #2200) 2018-04-10 19:14:31 +02:00
ines
24d8bf348d Revert "Add support for .zip to init_model"
This reverts commit 7ee880a0ad.
2018-04-10 19:08:06 +02:00
Matthew Honnibal
7ee880a0ad Add support for .zip to init_model 2018-04-10 14:30:04 +00:00
ines
5ecb274764 Fix indentation error and set Doc.is_tagged correctly 2018-04-10 16:14:52 +02:00
ines
987ee27af7 Return Doc if noun chunks merger component if Doc is not parsed 2018-04-09 14:51:02 +02:00
Xiaoquan Kong
e2f13ec722 bugfix: Doc.noun_chunks call Doc.noun_chunks_iterator without checking (closes #2194) 2018-04-08 23:44:05 +02:00
Jens Dahl Møllerhøj
e5055e3cf6 Add Danish lemmatizer (#2184)
* add danish lemmatizer

* fill contributor agreement
2018-04-07 19:07:28 +02:00
ines
bccbf538ef Revert "Check if spaCy has compiled correctly and show error message"
This reverts commit 3463ded7cf.
2018-04-06 15:49:44 +02:00
ines
fb4eda6616 Merge branch 'master' of https://github.com/explosion/spaCy 2018-04-06 00:38:48 +02:00
Matthew Honnibal
0c7fab4443 Set version to 2.0.11 2018-04-04 11:19:11 +02:00
Matthew Honnibal
a350be0601 Fix vector-name loading fix 2018-04-04 01:31:25 +02:00
Matthew Honnibal
21047bde52 Fix syntax error in italian lemmatizer 2018-04-03 23:13:22 +02:00
Matthew Honnibal
81f4005f3d Fix loading models with pretrained vectors 2018-04-03 23:11:48 +02:00
ines
3463ded7cf Check if spaCy has compiled correctly and show error message 2018-04-03 22:18:47 +02:00
Matthew Honnibal
96b612873b Add hyper-parameter to control whether parser makes a beam update 2018-04-03 22:02:56 +02:00
ines
e5f47cd82d Update errors 2018-04-03 21:40:29 +02:00
Matthew Honnibal
f7e6313b43 Increment version to v2.0.11.dev0 2018-04-03 20:58:47 +02:00
ines
10462816bc Fix tests for Python 2 2018-04-03 18:51:31 +02:00
ines
62b4b527d7 Don't raise error if set_extension has getter and setter (closes #2177)
Improve error messages, raise error if setter is specified without a getter and compare against _unset to allow default=None. Also add more tests.
2018-04-03 18:30:17 +02:00
ines
ee3082ad29 Fix whitespace 2018-04-03 18:29:53 +02:00
Ines Montani
3141e04822
💫 New system for error messages and warnings (#2163)
* Add spacy.errors module

* Update deprecation and user warnings

* Replace errors and asserts with new error message system

* Remove redundant asserts

* Fix whitespace

* Add messages for print/util.prints statements

* Fix typo

* Fix typos

* Move CLI messages to spacy.cli._messages

* Add decorator to display error code with message

An implementation like this is nice because it only modifies the string when it's retrieved from the containing class – so we don't have to worry about manipulating tracebacks etc.

* Remove unused link in spacy.about

* Update errors for invalid pipeline components

* Improve error for unknown factories

* Add displaCy warnings

* Update formatting consistency

* Move error message to spacy.errors

* Update errors and check if doc returned by component is None
2018-04-03 15:50:31 +02:00
Matthew Honnibal
abf8b16d71
Add doc.retokenize() context manager (#2172)
This patch takes a step towards #1487 by introducing the
doc.retokenize() context manager, to handle merging spans, and soon
splitting tokens.

The idea is to do merging and splitting like this:

with doc.retokenize() as retokenizer:
    for start, end, label in matches:
        retokenizer.merge(doc[start : end], attrs={'ent_type': label})

The retokenizer accumulates the merge requests, and applies them
together at the end of the block. This will allow retokenization to be
more efficient, and much less error prone.

A retokenizer.split() function will then be added, to handle splitting a
single token into multiple tokens. These methods take `Span` and `Token`
objects; if the user wants to go directly from offsets, they can append
to the .merges and .splits lists on the retokenizer.

The doc.merge() method's behaviour remains unchanged, so this patch
should be 100% backwards incompatible (modulo bugs). Internally,
doc.merge() fixes up the arguments (to handle the various deprecated styles),
opens the retokenizer, and makes the single merge.

We can later start making deprecation warnings on direct calls to doc.merge(),
to migrate people to use of the retokenize context manager.
2018-04-03 14:10:35 +02:00
Matthew Honnibal
8a120fb455 Disable batch size compounding in ud-train 2018-04-01 08:45:00 +00:00
Matthew Honnibal
98165e43a7 Sometimes update beam with greedy oracle 2018-04-01 08:44:35 +00:00
Suraj Rajan
1cdbb7c97c [2032] - Changed python set to cpp stl set (#2170)
Changed python set to cpp stl set #2032 

## Description

Changed python set to cpp stl set. CPP stl set works better due to the logarithmic run time of its methods. Finding minimum in the cpp set is done in constant time as opposed to the worst case linear runtime of python set. Operations such as find,count,insert,delete are also done in either constant and logarithmic time thus making cpp set a better option to manage vectors.
Reference : http://www.cplusplus.com/reference/set/set/

### Types of change
Enhancement for `Vectors` for faster initialising of word vectors(fasttext)
2018-03-31 13:28:25 +02:00
Matthew Honnibal
f3b7c5e537 Fix syntax error 2018-03-29 21:50:32 +02:00
Matthew Honnibal
23afa6429f Add input length error, to address #1826 2018-03-29 21:45:26 +02:00
Ines Montani
a609a1ca29
Merge pull request #2152 from explosion/feature/tidy-up-dependencies
💫 Tidy up dependencies
2018-03-29 14:35:09 +02:00
Viet Trung Tran
ea2af94cd9 Add support for Vietnamese in spaCy by leveraging Pyvi, an external Vietnamese tokenizer (#2155)
* support for Vietnamese

* Contributor Agreement for adding Vietnamese support on spaCy
2018-03-29 12:19:51 +02:00
ines
e6979bdbbd Merge branch 'feature/tidy-up-dependencies' of https://github.com/explosion/spaCy into feature/tidy-up-dependencies 2018-03-29 00:19:37 +02:00
ines
83146458a2 Fix urllib for Python 3 2018-03-29 00:19:33 +02:00
Matthew Honnibal
8308bbc617 Get msgpack and msgpack_numpy via Thinc, to avoid potential version conflicts 2018-03-29 00:14:55 +02:00
Matthew Honnibal
b5098079d8 Fix error on urllib 2018-03-29 00:08:16 +02:00
Ines Montani
0de599b16b
Merge pull request #2159 from explosion/feature/fix-merged-entity-iob (resolves #1554, resolves #1752)
💫 Fix token.ent_iob after doc.merge(), and ensure consistency in doc.ents
2018-03-28 23:10:00 +02:00
Ines Montani
98e9cda677
Merge pull request #2158 from explosion/feature/fix-multiple-vectors (resolves #1660)
💫 Fix loading of multiple vector models
2018-03-28 23:08:24 +02:00
Matthew Honnibal
a7c5ae2beb Avoid forcing a name on empty vectors, and remove print statement 2018-03-28 21:08:58 +02:00
ines
3eb67bbe4b Allow entity types with dashes (resolves #1967) 2018-03-28 20:51:26 +02:00
Matthew Honnibal
cf5fcf0546 Update serialization test 2018-03-28 20:12:53 +02:00
Matthew Honnibal
4555e3e251 Dont assume pretrained_vectors cfg set in build_tagger 2018-03-28 20:12:45 +02:00
Matthew Honnibal
0b375d50c8 Fix ent_iob tags in doc.merge to avoid inconsistent sequences 2018-03-28 18:39:03 +02:00
Matthew Honnibal
95fa89c4b8 Update doc.ents test 2018-03-28 18:39:03 +02:00
Matthew Honnibal
e807f88410 Resolve merge when cherry-picking ent iob patches from develop 2018-03-28 18:38:13 +02:00
Matthew Honnibal
99fbc7db33 Improve error message when entity sequence is inconsistent 2018-03-28 18:36:53 +02:00
Matthew Honnibal
cbd2794be0 Add test for ent_iob during span merge 2018-03-28 18:36:53 +02:00
Matthew Honnibal
f8dd905a24 Warn and fallback if vectors have no name 2018-03-28 18:24:53 +02:00
Matthew Honnibal
fd9e259414 Add test for #1660 2018-03-28 18:22:51 +02:00
Matthew Honnibal
bc4afa9881 Remove print statement 2018-03-28 17:48:37 +02:00
Matthew Honnibal
79dc241caa Set pretrained_vectors in parser cfg 2018-03-28 17:35:07 +02:00
Matthew Honnibal
17c3e7efa2 Add message noting vectors 2018-03-28 16:33:43 +02:00
Matthew Honnibal
9bf6e93b3e Set pretrained_vectors in begin_training 2018-03-28 16:32:41 +02:00
Matthew Honnibal
95a9615221 Fix loading of multiple pre-trained vectors
This patch addresses #1660, which was caused by keying all pre-trained
vectors with the same ID when telling Thinc how to refer to them. This
meant that if multiple models were loaded that had pre-trained vectors,
errors or incorrect behaviour resulted.

The vectors class now includes a .name attribute, which defaults to:
{nlp.meta['lang']_nlp.meta['name']}.vectors
The vectors name is set in the cfg of the pipeline components under the
key pretrained_vectors. This replaces the previous cfg key
pretrained_dims.

In order to make existing models compatible with this change, we check
for the pretrained_dims key when loading models in from_disk and
from_bytes, and add the cfg key pretrained_vectors if we find it.
2018-03-28 16:02:59 +02:00
ines
7fbc9e5874 Replace requests with urllib 2018-03-28 12:46:07 +02:00
ines
da1f200362 Add compat helpers for urllib 2018-03-28 12:45:53 +02:00
ines
ac88c72c9a Fix ftfy workaround and remove old import 2018-03-28 12:14:28 +02:00
ines
ce6071ca89 Remove ftfy dependency and update docs 2018-03-28 12:09:42 +02:00
Matthew Honnibal
070b6c6495 Remove dependency on ftfy 2018-03-28 12:07:02 +02:00
ines
6d2c85f428 Drop six and related hacks as a dependency 2018-03-28 10:45:25 +02:00
ines
9e83513004 Add position of invalid token to error message 2018-03-27 23:56:59 +02:00
ines
11c4735ccf Fix issue in Italian lemmatizer data (resolves #2050) 2018-03-27 23:55:22 +02:00
Matthew Honnibal
6a961928b2 Merge branch 'develop' of https://github.com/explosion/spaCy into develop 2018-03-27 21:01:48 +00:00
Matthew Honnibal
b7136cb094 Support zipped vector files in init-model 2018-03-27 21:01:18 +00:00
ines
693971dd8f Improve error message if token text is empty string (see #2101) 2018-03-27 22:25:40 +02:00
ines
0c829e6605 Fix whitespace 2018-03-27 22:20:59 +02:00
Matthew Honnibal
de9fd091ac Fix #2014: token.pos_ not writeable 2018-03-27 21:21:11 +02:00
Matthew Honnibal
18da89e04c Handle non-callable gold_tuples in parser begin_training 2018-03-27 21:08:41 +02:00
Matthew Honnibal
1f7229f40f Revert "Merge branch 'develop' of https://github.com/explosion/spaCy into develop"
This reverts commit c9ba3d3c2d, reversing
changes made to 92c26a35d4.
2018-03-27 19:23:02 +02:00
Matthew Honnibal
8b7a74570f Revert "Revert "Merge branch 'develop' of https://github.com/explosion/spaCy into develop""
This reverts commit f41e626844.
2018-03-27 19:22:52 +02:00
Matthew Honnibal
f41e626844 Revert "Merge branch 'develop' of https://github.com/explosion/spaCy into develop"
This reverts commit c9ba3d3c2d, reversing
changes made to f57bfbccdc.
2018-03-27 19:22:25 +02:00
Matthew Honnibal
c9ba3d3c2d Merge branch 'develop' of https://github.com/explosion/spaCy into develop 2018-03-27 18:59:08 +02:00
Matthew Honnibal
92c26a35d4 Update get_cuda_stream 2018-03-27 16:42:00 +00:00
Matthew Honnibal
f57bfbccdc Fix non-projective label filtering 2018-03-27 13:41:33 +02:00
Matthew Honnibal
d2118792e7 Merge changes from master 2018-03-27 13:38:41 +02:00
Matthew Honnibal
d4680e4d83 Merge branch 'master' of https://github.com/explosion/spaCy 2018-03-27 13:36:37 +02:00
Matthew Honnibal
63a267b34d Fix #2073: Token.set_extension not working 2018-03-27 13:36:20 +02:00
Matthew Honnibal
25280b7013 Try to make sum_state_features faster 2018-03-27 10:08:38 +00:00
Matthew Honnibal
987e1533a4 Use 8 features in parser 2018-03-27 10:08:12 +00:00
Matthew Honnibal
8bbd26579c Support GPU in UD training script 2018-03-27 09:53:35 +00:00
Matthew Honnibal
dd54511c4f Pass data as a function in begin_training methods 2018-03-27 09:39:59 +00:00
Matthew Honnibal
d9ebd78e11 Change default sizes in parser 2018-03-26 17:22:18 +02:00
Matthew Honnibal
a3d0cb15d3 Fix ent_iob tags in doc.merge to avoid inconsistent sequences 2018-03-26 07:16:06 +02:00
Matthew Honnibal
7d4687162f Update doc.ents test 2018-03-26 07:14:35 +02:00
Matthew Honnibal
514d89a3ae Set missing label for non-specified entities when setting doc.ents 2018-03-26 07:14:16 +02:00
Matthew Honnibal
54d7a1c916 Improve error message when entity sequence is inconsistent 2018-03-26 07:13:34 +02:00
Matthew Honnibal
938436455a Add test for ent_iob during span merge 2018-03-25 22:16:19 +02:00
Matthew Honnibal
8e08c378fe Fix entity IOB and tag in span merging 2018-03-25 22:16:01 +02:00
Matthew Honnibal
5430c43298 Set about to spacy-nightly 2018-03-25 19:30:14 +02:00
Ines Montani
68226109f4
Merge pull request #2142 from jimregan/polish-more-tokens
more exceptions
2018-03-24 19:06:44 +01:00
Matthew Honnibal
d566e673bf Set version to v2.0.10 2018-03-24 18:09:03 +01:00
Matthew Honnibal
0d3bf0d4eb Merge branch 'master' of https://github.com/explosion/spaCy 2018-03-24 17:31:49 +01:00
dejanmarich
ccd1c04c63 Update stop_words.py
Added more words
2018-03-24 17:31:24 +01:00
ines
f1446b0257 Port over Turkish changes 2018-03-24 17:31:07 +01:00
DuyguA
cd604878a4 quick typo fix 2018-03-24 17:26:35 +01:00
Matthew Honnibal
406548b976 Support .gz and .tar.gz files in spacy init-model 2018-03-24 17:18:32 +01:00
Jim O'Regan
efe037e8be more exceptions 2018-03-24 00:05:27 +00:00
Ines Montani
719037cf20
Update formatting and add missing commas 2018-03-23 22:18:20 +01:00
Otto Sulin
266efc2018 Added Finnish examples 2018-03-23 22:58:52 +02:00
Otto Sulin
1940e54602 Added Finnish numbers 2018-03-23 22:33:08 +02:00
Otto Sulin
4ec3f19e2b fixed stop words -> to-do lex_attrs.py 2018-03-23 22:18:17 +02:00
Matthew Honnibal
85717f570c Merge branch 'master' of https://github.com/explosion/spaCy 2018-03-23 20:30:42 +01:00
Matthew Honnibal
8902754f0b Fix vector loading for ud_train 2018-03-23 20:30:00 +01:00
Xiaoquan Kong
a71b99d7ff bugfix for global-variable-change-in-runtime related issue (#2135)
* Bugfix: setting pollution from spacy/cli/ud_train.py to whole package

* Add contributor agreement of howl-anderson
2018-03-23 11:36:38 +01:00
Matthew Honnibal
044397e269 Support .gz and .tar.gz files in spacy init-model 2018-03-21 14:33:23 +01:00
Matthew Honnibal
49fbe2dfee Use thinc.openblas in spacy.syntax.nn_parser 2018-03-20 02:22:09 +01:00
DuyguA
f708d7443b added contractions to stopwords #2020 2018-03-19 14:06:39 +01:00
Matthew Honnibal
bede11b67c
Improve label management in parser and NER (#2108)
This patch does a few smallish things that tighten up the training workflow a little, and allow memory use during training to be reduced by letting the GoldCorpus stream data properly.

Previously, the parser and entity recognizer read and saved labels as lists, with extra labels noted separately. Lists were used becaue ordering is very important, to ensure that the label-to-class mapping is stable.

We now manage labels as nested dictionaries, first keyed by the action, and then keyed by the label. Values are frequencies. The trick is, how do we save new labels? We need to make sure we iterate over these in the same order they're added. Otherwise, we'll get different class IDs, and the model's predictions won't make sense.

To allow stable sorting, we map the new labels to negative values. If we have two new labels, they'll be noted as having "frequency" -1 and -2. The next new label will then have "frequency" -3. When we sort by (frequency, label), we then get a stable sort.

Storing frequencies then allows us to make the next nice improvement. Previously we had to iterate over the whole training set, to pre-process it for the deprojectivisation. This led to storing the whole training set in memory. This was most of the required memory during training.

To prevent this, we now store the frequencies as we stream in the data, and deprojectivize as we go. Once we've built the frequencies, we can then apply a frequency cut-off when we decide how many classes to make.

Finally, to allow proper data streaming, we also have to have some way of shuffling the iterator. This is awkward if the training files have multiple documents in them. To solve this, the GoldCorpus class now writes the training data to disk in msgpack files, one per document. We can then shuffle the data by shuffling the paths.

This is a squash merge, as I made a lot of very small commits. Individual commit messages below.

* Simplify label management for TransitionSystem and its subclasses

* Fix serialization for new label handling format in parser

* Simplify and improve GoldCorpus class. Reduce memory use, write to temp dir

* Set actions in transition system

* Require thinc 6.11.1.dev4

* Fix error in parser init

* Add unicode declaration

* Fix unicode declaration

* Update textcat test

* Try to get model training on less memory

* Print json loc for now

* Try rapidjson to reduce memory use

* Remove rapidjson requirement

* Try rapidjson for reduced mem usage

* Handle None heads when projectivising

* Stream json docs

* Fix train script

* Handle projectivity in GoldParse

* Fix projectivity handling

* Add minibatch_by_words util from ud_train

* Minibatch by number of words in spacy.cli.train

* Move minibatch_by_words util to spacy.util

* Fix label handling

* More hacking at label management in parser

* Fix encoding in msgpack serialization in GoldParse

* Adjust batch sizes in parser training

* Fix minibatch_by_words

* Add merge_subtokens function to pipeline.pyx

* Register merge_subtokens factory

* Restore use of msgpack tmp directory

* Use minibatch-by-words in train

* Handle retokenization in scorer

* Change back-off approach for missing labels. Use 'dep' label

* Update NER for new label management

* Set NER tags for over-segmented words

* Fix label alignment in gold

* Fix label back-off for infrequent labels

* Fix int type in labels dict key

* Fix int type in labels dict key

* Update feature definition for 8 feature set

* Update ud-train script for new label stuff

* Fix json streamer

* Print the line number if conll eval fails

* Update children and sentence boundaries after deprojectivisation

* Export set_children_from_heads from doc.pxd

* Render parses during UD training

* Remove print statement

* Require thinc 6.11.1.dev6. Try adding wheel as install_requires

* Set different dev version, to flush pip cache

* Update thinc version

* Update GoldCorpus docs

* Remove print statements

* Fix formatting and links [ci skip]
2018-03-19 02:58:08 +01:00
Matthew Honnibal
ff42b726c1 Fix unicode declaration on test 2018-03-19 02:04:24 +01:00
Matthew Honnibal
7dc76c6ff6 Add test for textcat 2018-03-16 12:39:45 +01:00
Matthew Honnibal
3cdee79a0c Add depth argument for text classifier 2018-03-16 12:37:31 +01:00
Matthew Honnibal
13067095a1 Disable broken add-after-train in textcat 2018-03-16 12:33:33 +01:00
Matthew Honnibal
565ef8c4d8 Improve argument passing in textcat 2018-03-16 12:30:51 +01:00
Matthew Honnibal
eb2a3c5971 Remove unused function 2018-03-16 12:30:33 +01:00
Matthew Honnibal
307d6bf6d3 Fix parser for Thinc 6.11 2018-03-16 10:59:31 +01:00
Matthew Honnibal
9a389c4490 Fix parser for Thinc 6.11 2018-03-16 10:38:13 +01:00
Matthew Honnibal
648532d647 Don't assume blas methods are present 2018-03-16 02:48:20 +01:00
Matthew Honnibal
e85dd038fe Merge remote-tracking branch 'origin/master' into feature/single-thread 2018-03-16 02:41:11 +01:00
Matthew Honnibal
e3be3d65b3 Version as v2.0.10.dev0 2018-03-15 17:31:22 +01:00
ines
f3f8bfc367 Add built-in factories for merge_entities and merge_noun_chunks
Allows adding those components to the pipeline out-of-the-box if they're defined in a model's meta.json. Also allows usage as nlp.add_pipe(nlp.create_pipe('merge_entities')).
2018-03-15 17:16:54 +01:00
Ines Montani
0d17377e8b
Merge pull request #2095 from DuyguA/quick-typo-fix (resolves #2063)
Quick typo fix
2018-03-15 00:29:56 +01:00
ines
d854f69fe3 Add built-in factories for merge_entities and merge_noun_chunks
Allows adding those components to the pipeline out-of-the-box if they're defined in a model's meta.json. Also allows usage as nlp.add_pipe(nlp.create_pipe('merge_entities')).
2018-03-15 00:18:51 +01:00
ines
9ad5df41fe Fix whitespace 2018-03-15 00:11:18 +01:00
Matthew Honnibal
d7ce6527fb Use increasing batch sizes in ud-train 2018-03-14 20:15:28 +01:00
alldefector
f4e5904fc2 Fix Spanish noun_chunks failure caused by typo 2018-03-14 17:03:17 +01:00
Thomas Opsomer
fbf48b3f9f lemma property to return hash instead of unicode 2018-03-14 17:03:00 +01:00
Matthew Honnibal
8cefc58abc Fix Vectors pickling 2018-03-14 16:59:37 +01:00
DuyguA
be4f6da16b maybe not a good idea to remove also 2018-03-14 14:47:24 +01:00
DuyguA
1a513f71e3 removed also from lookup 2018-03-14 11:57:15 +01:00
DuyguA
cca66abf1e quick typo fix 2018-03-14 11:34:22 +01:00
Matthew Honnibal
7b755414eb Update call into thinc 2018-03-13 13:59:59 +01:00
Matthew Honnibal
e101f10ef0 Fix header 2018-03-13 02:12:16 +01:00
Matthew Honnibal
952c87409e Use openblas.sgemm in parser 2018-03-13 02:12:01 +01:00
Matthew Honnibal
d55620041b Switch parser to gemm from thinc.openblas 2018-03-13 02:10:58 +01:00
Matthew Honnibal
c2f4759257
Fix test for Python 2 2018-03-12 23:03:05 +01:00
Matthew Honnibal
9aeec9c242 Increment dev version 2018-03-11 01:58:21 +01:00
Matthew Honnibal
f49d71fa7c Merge branch 'master' of https://github.com/explosion/spaCy 2018-03-11 01:27:17 +01:00
Matthew Honnibal
5dddb30e5b Fix ud-train script 2018-03-11 01:26:45 +01:00
Matthew Honnibal
e42960bd14
Merge pull request #2012 from alldefector/patch-1
Fix Spanish noun_chunks failure caused by typo
2018-03-11 01:05:19 +01:00
Matthew Honnibal
2cab4d6517 Remove use of attr module in ud_train 2018-03-11 00:59:39 +01:00
Matthew Honnibal
fa9fd21620 Increment dev version 2018-03-11 00:41:54 +01:00
Matthew Honnibal
53b3249e06 Add tests for arc eager oracle 2018-03-10 23:42:56 +01:00
Matthew Honnibal
754ea1b2f7 Link in spaCy CoNLL commands 2018-03-10 23:42:15 +01:00
Matthew Honnibal
3478ea76d1 Add ud_train and ud_evaluate CLI commands 2018-03-10 23:41:55 +01:00
Matthew Honnibal
4b72c38556 Fix dropout bug in beam parser 2018-03-10 23:16:40 +01:00
Matthew Honnibal
9cc202d670 Fix Vectors pickling 2018-03-10 22:53:42 +01:00
Matthew Honnibal
3d6487c734 Support dropout in beam parse 2018-03-10 22:41:55 +01:00
Matthew Honnibal
31b156d60b Fix itershuffle 2018-03-10 22:32:59 +01:00
Matthew Honnibal
b59765ca9f Stream gold during spacy train 2018-03-10 22:32:45 +01:00
Matthew Honnibal
c3d168509a Stream the gold data during training, to reduce memory 2018-03-10 22:32:32 +01:00
DuyguA
cba63196f9 fixed typo 2018-03-09 10:54:18 +01:00
DuyguA
7a780476af added more abbreviations 2018-03-09 10:13:00 +01:00
DuyguA
cca87756d7 added Sti 2018-03-08 18:07:52 +01:00
DuyguA
3c994311c5 added abbrevs 2018-03-08 18:03:27 +01:00
DuyguA
56d6fb180e added like_num to lex 2018-03-08 15:25:25 +01:00
DuyguA
26ee0590a3 added some commonly used cases 2018-03-08 12:43:58 +01:00
DuyguA
ae6473e4d5 removed some words with negation particle. 2018-03-08 12:20:32 +01:00
DuyguA
6ed59a2198 removed number words to be caried to the lexical 2018-03-08 12:19:23 +01:00
DuyguA
04784a44a6 made alphabetical order for Turkish chaaracters 2018-03-08 12:11:32 +01:00
DuyguA
af33e022a5 added example sentences for Turkish 2018-03-08 12:06:03 +01:00
Matthew Honnibal
a1be01185c Fix array out of bounds error in Span 2018-02-28 12:27:09 +01:00
Thomas Opsomer
8df9e52829 lemma property to return hash instead of unicode 2018-02-27 19:50:01 +01:00
Ines Montani
35634352fe
Merge pull request #2025 from dejanmarich/patch-1
Update stop_words.py for Croatian language
2018-02-26 18:22:32 +01:00
Matthew Honnibal
14f729c72a Add subtok label to parser 2018-02-26 12:26:35 +01:00
Matthew Honnibal
7137ad8b0b Make label filtering clearer for projectivisation 2018-02-26 12:02:01 +01:00
Matthew Honnibal
b8d52cb285 Fix inconsistent label freq cutoff for projectivisation 2018-02-26 12:01:44 +01:00
Matthew Honnibal
7b66ec896a Revert "Revert "Improve parser oracle around sentence breaks.""
This reverts commit 36e481c584.
2018-02-26 10:57:37 +01:00
Matthew Honnibal
36e481c584 Revert "Improve parser oracle around sentence breaks."
This reverts commit 50817dc9ad.
2018-02-26 10:53:55 +01:00
Matthew Honnibal
5faae803c6 Add option to not use Janome for Japanese tokenization 2018-02-26 09:39:46 +01:00
Matthew Honnibal
9b406181cd Add Chinese.Defaults.use_jieba setting, for UD 2018-02-25 15:12:38 +01:00
Matthew Honnibal
9ccd0c643b Add Vietnamese 2018-02-25 15:00:46 +01:00
Matthew Honnibal
d4fdb97c87 Fix alignment for words with spaces 2018-02-25 14:55:00 +01:00
Matthew Honnibal
6d2c1ef52c Fix SP tag in generic tag map 2018-02-24 16:04:56 +01:00
Matthew Honnibal
5cc3bd1c1d Update alignment tests 2018-02-24 16:03:58 +01:00
Matthew Honnibal
6138439469 Fix many-to-one alignment 2018-02-24 16:03:50 +01:00
Matthew Honnibal
4890ee1732 Fix scoring of tokenization for punct 2018-02-24 10:32:32 +01:00
Matthew Honnibal
12b39f87da Move cython declarations in matcher.pyx 2018-02-24 10:32:18 +01:00
Matthew Honnibal
01d1b7abdf Support many-to-one alignment in GoldParse 2018-02-24 10:17:01 +01:00
Matthew Honnibal
7865746574 Support many-to-one alignment 2018-02-24 02:09:53 +01:00
Matthew Honnibal
458710b831 Poke matcher test for appveyor 2018-02-23 23:53:48 +01:00
Matthew Honnibal
968dabdde4 Fix bug in multi-task objective 2018-02-23 23:48:09 +01:00
Matthew Honnibal
2c9c8b8d72 Try comming out emoji test in matcher 2018-02-23 23:34:35 +01:00
Matthew Honnibal
980ad68cbe Try to find test that fails on appveyor 2018-02-23 21:27:53 +01:00
Matthew Honnibal
39de8cd4d3 Try to find test failing on appveyor 2018-02-23 20:59:21 +01:00
Matthew Honnibal
4492a33a9d Fix sent_start multi-task objective when alignment fails 2018-02-23 16:50:59 +01:00
Matthew Honnibal
5fa44e93f1 Set unicode_literals in matcher 2018-02-23 16:48:54 +01:00
Matthew Honnibal
12264f9296 Add multi-task objective for sentence segmentation 2018-02-23 16:25:57 +01:00
Matthew Honnibal
e7deadb519 Set version to 2.1.0.dev1 2018-02-23 16:22:24 +01:00
Matthew Honnibal
7b575a119e Try to reduce memory usage of test_matcher 2018-02-23 15:34:37 +01:00
Matthew Honnibal
24563f4026 Fix data typing in align 2018-02-23 15:08:06 +01:00
Matthew Honnibal
7a5ba20692 Fix integer typing in _align 2018-02-23 14:51:24 +01:00
Matthew Honnibal
875411b875 Set unicode types in _align.pyx and test 2018-02-23 14:35:38 +01:00
Matthew Honnibal
51d9679aa3 Fix broken span.as_doc test 2018-02-23 14:22:24 +01:00
dejanmarich
71c261d58b
Update stop_words.py
Added more words
2018-02-23 10:31:01 +01:00
Matthew Honnibal
3e6c1111b7 Remove obsolete test 2018-02-23 03:22:07 +01:00
Matthew Honnibal
a4fdec524a Merge branch 'master' of https://github.com/explosion/spaCy into feature/better-gold 2018-02-22 21:44:28 +01:00
Matthew Honnibal
50817dc9ad Improve parser oracle around sentence breaks. 2018-02-22 19:22:26 +01:00
Matthew Honnibal
307aefe131 Increment version to v2.0.9 2018-02-22 17:07:53 +01:00
Feng Niu
1c60384bed return on empty doc 2018-02-21 15:39:04 -08:00
Feng Niu
7eb1cd100b unbound doc var 2018-02-21 15:05:37 -08:00
Feng Niu
8df75b229c fix unbound vars in es.syntax_iterators 2018-02-21 13:11:17 -08:00
alldefector
4244e285c2
Fix Spanish noun_chunks failure caused by typo 2018-02-21 12:43:21 -08:00
Matthew Honnibal
661873ee4c Randomize the rebatch size in parser 2018-02-21 21:02:07 +01:00
Matthew Honnibal
0872cf611d Don't lower-case lemmas of proper nouns 2018-02-21 16:01:16 +01:00
Matthew Honnibal
a0ddb803fd Make error when no label found more helpful 2018-02-21 16:00:59 +01:00
Matthew Honnibal
ea2fc5d45f Improve length and freq cutoffs in parser 2018-02-21 16:00:38 +01:00
Matthew Honnibal
e5757d4bf0 Add labels property to parser 2018-02-21 16:00:00 +01:00
Matthew Honnibal
eff4ae809a Fix nonproj label filter 2018-02-21 15:59:04 +01:00
Matthew Honnibal
e624405cda Temporarily remove cutoff when filtering labels in nonproj 2018-02-21 13:53:40 +01:00
Matthew Honnibal
f466f0186e Use new alignment implementation in GoldParse 2018-02-20 21:16:35 +01:00
Matthew Honnibal
c0734ba526 Make alignment work with strings 2018-02-20 17:51:49 +01:00
Matthew Honnibal
8180c84a98 Add tests for new Levenshtein alignment 2018-02-20 17:32:25 +01:00
Matthew Honnibal
930c980570 Add improved Levenshtein alignment implementation 2018-02-20 17:31:56 +01:00
Ines Montani
14e7e0f12a
Merge pull request #2000 from jimregan/polish-tag-map
Polish tag map
2018-02-18 19:05:58 +01:00
Jim O'Regan
664407de5d missing PrepCase attribute 2018-02-18 14:46:12 +00:00
Jim O'Regan
95f0673fbc fix typo/missing here too 2018-02-18 14:38:27 +00:00
Matthew Honnibal
2bccad8815 Fix incorrect matcher test 2018-02-18 14:56:12 +01:00
Matthew Honnibal
530172d57a Merge branch 'master' of https://github.com/explosion/spaCy into feature/better-faster-matcher 2018-02-18 14:40:42 +01:00
Matthew Honnibal
cf0e320f2b Add doc.is_sentenced attribute, re #1959 2018-02-18 14:16:55 +01:00
Matthew Honnibal
1e5aeb4eec
Merge pull request #1987 from thomasopsomer/span-sent
Make span.sent work when only manual / custom sbd
2018-02-18 14:05:37 +01:00
Matthew Honnibal
1cf774bdc1 Add output options return_matches and as_tuples to Matcher 2018-02-18 14:00:45 +01:00
Matthew Honnibal
dd9b0945af Fix inconsistencies in the symbols table 2018-02-18 13:51:31 +01:00
Matthew Honnibal
66496ac8e1 Set version to v2.1.0.dev0 2018-02-18 13:48:39 +01:00
Matthew Honnibal
eb3040ce46
Merge pull request #1891 from fucking-signup/master
Fix issue #1889
2018-02-18 13:47:47 +01:00
Matthew Honnibal
3d7285870b Update matcher branch with v2.0.8 master 2018-02-18 13:42:58 +01:00
ines
6bba1db4cc Drop six and related hacks as a dependency 2018-02-18 13:29:56 +01:00
Matthew Honnibal
b30b09192a
Merge pull request #1665 from jimregan/animacy
typo in "inan", add "nhum"
2018-02-18 13:26:53 +01:00
Matthew Honnibal
1b3c98e01b Set version to v2.0.8 2018-02-18 12:16:31 +01:00
Matthew Honnibal
f9f46e5a07 Revert matcher fixes from GregDubbin 2018-02-18 10:59:28 +01:00
Matthew Honnibal
86405e4ad1 Fix CLI for multitask objectives 2018-02-18 10:59:11 +01:00
Matthew Honnibal
a34749b2bf Add multitask objectives options to train CLI 2018-02-17 22:03:54 +01:00
Matthew Honnibal
8f06903e09 Fix multitask objectives 2018-02-17 18:41:36 +01:00
Matthew Honnibal
d1246c95fb Fix model loading when using multitask objectives 2018-02-17 18:11:36 +01:00
Matthew Honnibal
262d0a3148 Fix overwriting of lexical attributes when loading vectors during training 2018-02-17 18:11:11 +01:00
Matthew Honnibal
c0caf7cf27 Fix LANG symbol 2018-02-17 18:10:50 +01:00
Matthew Honnibal
0bf2f6be29 Add missing symbol for LANG attr. Fixes inconsistent numeric ID 2018-02-17 17:37:02 +01:00
Matthew Honnibal
97a228a4ce Increment to v2.0.8.dev0 2018-02-17 16:54:36 +01:00
Matthew Honnibal
f7dc64d2a3 Merge branch 'master' of https://github.com/explosion/spaCy into feature/better-faster-matcher 2018-02-17 16:47:35 +01:00
Aaron Marquez
ea571e8325 Merge branch 'master' into issue-1959 2018-02-16 15:14:09 -08:00
Matthew Honnibal
7d5c720fc3 Fix multitask objective when no pipeline provided 2018-02-15 23:50:21 +01:00
Aaron Marquez
f0d3672e17 Changed loading EN model 2018-02-15 14:28:38 -08:00
Aaron Marquez
3765d84d57 Fix issue #1959 2018-02-15 12:51:49 -08:00
Aaron Marquez
7ba4111554 Add test for issue-1959 2018-02-15 12:46:22 -08:00
Matthew Honnibal
59b7cf9db8 Add get_beam_parse method in ArcEager, for Prodigy 2018-02-15 21:03:16 +01:00
Matthew Honnibal
3e541de440 Merge branch 'master' of https://github.com/explosion/spaCy 2018-02-15 21:02:55 +01:00
Thomas Opsomer
5d24a81c0b add test for span.sent when doc not parsed 2018-02-15 16:59:16 +01:00
Thomas Opsomer
deab391cbf correct check on sent_start & raise if no boundaries 2018-02-15 16:58:30 +01:00
Matthew Honnibal
afbd46adfb Remove length cap in PhraseMatcher 2018-02-15 16:10:54 +01:00
Matthew Honnibal
4533c7408d Update matcher tests 2018-02-15 15:39:47 +01:00
Matthew Honnibal
1c19605426 Move matcher2.pyx to matcher.pyx 2018-02-15 15:27:03 +01:00
Matthew Honnibal
9ebf2fe7c3 Make helper function to get longest matches 2018-02-15 15:26:15 +01:00
Matthew Honnibal
4cb861e080
Merge pull request #1968 from DuyguA/is_currency
New lexical feature is_currency
2018-02-15 12:13:36 +01:00
Thomas Opsomer
b902731313 Find span sentence when only sentence boundaries (no parser) 2018-02-14 22:18:54 +01:00
Matthew Honnibal
d19dc67886 Make get_action nogil, for efficiency 2018-02-14 12:16:36 +01:00
Matthew Honnibal
7885b92b45 Refactor matcher2, hopefully making it faster 2018-02-14 12:11:17 +01:00
Matthew Honnibal
00261eea27 Make tests refer to matcher2 2018-02-14 12:10:51 +01:00
Claudiu-Vlad Ursache
e28de12cbd
Ensure files opened in from_disk are closed
Fixes [issue 1706](https://github.com/explosion/spaCy/issues/1706).
2018-02-13 20:49:43 +01:00
Matthew Honnibal
262cbe356e Remove caching, as doesn't seem to help for now. 2018-02-13 17:15:20 +01:00
Matthew Honnibal
f43d53f2c5 Remove print statement 2018-02-13 17:15:07 +01:00
Matthew Honnibal
dcd8d89aef Update test for 850, making it work with matcher2 2018-02-13 16:35:20 +01:00
Matthew Honnibal
9bdfa5cd4f Remove re comparisons tests, as matcher behaves differently 2018-02-13 16:28:52 +01:00
Matthew Honnibal
6d7986b0f1 Fix matcher test 2018-02-13 16:28:06 +01:00
Matthew Honnibal
9efda9e9ab Add PhraseMatcher in matcher2.pyx 2018-02-13 16:27:46 +01:00
Johannes Dollinger
012e874d09 Add contributor agreement for emulbreh 2018-02-13 13:40:33 +01:00
Johannes Dollinger
bf94c13382 Don't fix random seeds on import 2018-02-13 12:42:23 +01:00
Matthew Honnibal
0004331895 Update notes on matcher2 2018-02-13 11:45:45 +01:00
Matthew Honnibal
b4cc39eb74 Fix zero-width quantifiers. Passes test_matcher 2018-02-13 11:45:32 +01:00
Matthew Honnibal
1b01685f47 Fix ZERO_PLUS operator 2018-02-12 12:28:03 +01:00
Matthew Honnibal
9115c3ba0a Add TODO in notes 2018-02-12 12:06:48 +01:00
Matthew Honnibal
b00326a7fe Move pattern_id out of TokenPattern 2018-02-12 12:05:54 +01:00
Matthew Honnibal
d34c732635 Add Python notes for rethinking matcher 2018-02-12 10:19:29 +01:00
Matthew Honnibal
d7c9b53120 Pass kwargs into pipeline components during begin_training 2018-02-12 10:18:39 +01:00
Matthew Honnibal
fae5c0dc18 Work on matcher2 2018-02-12 10:17:43 +01:00
4altinok
ca8728035d added new lex feat to token 2018-02-11 18:55:48 +01:00
4altinok
edd7202a06 added new symbol 2018-02-11 18:55:32 +01:00
4altinok
ed1ac2969e added new lexical feat to lexeme 2018-02-11 18:51:48 +01:00
4altinok
94fb0b75e3 code for is_currency 2018-02-11 18:51:32 +01:00
4altinok
3deef1497a removed 18 and replaced 18 with is_currency 2018-02-11 18:51:09 +01:00
4altinok
471d3c9e23 added lex test for is_currency 2018-02-11 18:50:50 +01:00
ines
c63e99da8a Fix typo in glossary (resolves #1964)
Co-Authored-By: SThomasP <sthomasp@users.noreply.github.com>
2018-02-10 11:58:41 +01:00
Lyndon White
6ee5dff51c
Make python 3.4 compat module loading (fix #1733) 2018-02-09 23:03:35 +08:00
Matthew Honnibal
e361b4f82b Fix #1929: Incorrect NER when pre-set sentence boundaries. 2018-02-08 15:25:41 +01:00
Matthew Honnibal
fd9fd275c5 Make test for #1945 more precise 2018-02-07 02:06:11 +01:00
Matthew Honnibal
c087a14380 Merge branch 'master' of https://github.com/explosion/spaCy 2018-02-07 01:29:39 +01:00
Matthew Honnibal
76d89b2180 Add test for #1945: PhraseMatcher regression 2018-02-07 01:29:23 +01:00
Ines Montani
0954e15dda
Merge pull request #1913 from ohenrik/nb_syntax_iterator
Norwegian Language (nb) - Added french syntax iterator with explanation
2018-02-06 04:59:07 +01:00
Ole Henrik Skogstrøm
251a7805fe Copied French syntax iterator to simplify future changes 2018-02-05 14:45:05 +01:00
Matthew Honnibal
2e7391e627
Merge pull request #1916 from tokestermw/bug/fix-not-passing-in-model-cfg-in-nlp
Bug/fix not passing in model cfg in nlp
2018-02-05 01:19:40 +01:00
Ali Zarezade
9df9da34a3
Fix init_model issue
Fixing issue #1928
2018-02-03 17:21:34 +03:30
Matthew Honnibal
ebe84e45e5 Increment version to 2.0.7 2018-02-02 03:39:16 +01:00
Matthew Honnibal
e4b1f57599 Increment version 2018-02-02 02:33:23 +01:00
Matthew Honnibal
069531c351 Merge branch 'master' of https://github.com/explosion/spaCy 2018-02-02 02:32:58 +01:00
Matthew Honnibal
f74a802d09 Test and fix #1919: Error resuming training 2018-02-02 02:32:40 +01:00
ines
f1d3deffac Add Russian example sentences (see #1107) 2018-02-01 20:09:40 +01:00
Matthew Honnibal
6b1126c312 Merge branch 'master' of https://github.com/explosion/spaCy 2018-02-01 02:57:52 +01:00
ines
3c1fb9d02d Make validate command fail more gracefully if version not found
Mostly relevant during develoment when working with .dev versions
2018-01-31 22:06:28 +01:00
Motoki Wu
54062b7326 added tests for issue #1915 2018-01-30 18:30:19 -08:00
Motoki Wu
f4a7d1a423 make to sure pass in **cfg to each component when training 2018-01-30 18:29:54 -08:00
ines
4046823699 Only check component in factories if string (see #1911) 2018-01-30 16:29:07 +01:00
ines
ce10d320c4 Fix component check in self.factories (see #1911) 2018-01-30 16:09:37 +01:00
Ole Henrik Skogstrøm
e40465487c Added french syntax iterator with explenation 2018-01-30 15:44:29 +01:00
ines
8901814248 Improve error handling if pipeline component is not callable (resolves #1911)
Also add help message if user accidentally calls nlp.add_pipe() with a string of a built-in component name.
2018-01-30 15:43:03 +01:00
Matthew Honnibal
a437ba87a3 Set release=True 2018-01-29 21:26:04 +01:00
Adam Binford
9238749aaf Removed test to avoid network requests 2018-01-29 14:48:20 -05:00
Adam Binford
1a2c2f7d7f Fixed auto linking after download and added simple test to check 2018-01-29 14:25:21 -05:00
Matthew Honnibal
cb7110c22e
Merge pull request #1882 from ohenrik/nb_lemma_and_tag_map
Add norwegian bokmål ('nb') lemmatizer and tag_map
2018-01-29 18:18:50 +01:00
Matthew Honnibal
0c1e7f0c86
Merge pull request #1893 from azarezade/master
Add Persian language
2018-01-29 18:18:33 +01:00
Matthew Honnibal
cbdab75b36 Increment version 2018-01-28 23:46:22 +01:00
Matthew Honnibal
512e6adb08
Merge pull request #1896 from thomasopsomer/fix-sent
Fix sentence boundaries serialization (issue #1834)
2018-01-28 21:18:51 +01:00
Matthew Honnibal
f5b1ad4100 Limit parser model size, to hopefully reduce memory during CI tests 2018-01-28 21:00:32 +01:00
Thomas Opsomer
515e25910e fix sent_start in serialization 2018-01-28 19:50:42 +01:00
Thomas Opsomer
45d62561f7 add test for the issue 2018-01-28 19:49:56 +01:00
ines
6d978e5c35 Don't use deprecated Doc.merge call in displaCy
As reported here: https://stackoverflow.com/a/48464412/6400719
2018-01-27 11:25:05 +01:00
Ali Zarezade
bb6bd3d8ae add persian language 2018-01-27 13:27:26 +03:30
Ali Zarezade
d195675db5 add persian language 2018-01-27 13:21:38 +03:30
Kit
4b42267ba3
Fix issue #1889 2018-01-25 23:17:22 +01:00
Kit
52ef51f36e
Add test for issue #1889 2018-01-25 22:56:48 +01:00
Ole Henrik Skogstrøm
8e2c9f2475 Cleaned up nb tag_map comments 2018-01-25 11:09:28 +01:00
Ole Henrik Skogstrøm
1107e89fcf Updated doc string on nb tag_map module 2018-01-25 11:08:28 +01:00
Matthew Honnibal
6a8cb905aa
Merge pull request #1876 from GregDubbin/master
Pattern matcher fixes
2018-01-24 16:38:11 +01:00
Matthew Honnibal
38b260e0c3
Merge pull request #1879 from azarezade/master
Add Persian character and symbols
2018-01-24 16:34:22 +01:00
Matthew Honnibal
edb71a280e Add test for #1883: Unpickling Matcher 2018-01-24 15:42:33 +01:00
Matthew Honnibal
2ad050e668 Fix unpickling of Matcher. Also store correct data in matcher._patterns 2018-01-24 15:42:11 +01:00
Ole Henrik Skogstrøm
4058a7d579 Fix æøå characters in lemmatizer 2018-01-24 14:03:14 +01:00
Ole Henrik Skogstrøm
42248f423f Updated tag map 2018-01-24 13:50:33 +01:00
Ole Henrik Skogstrøm
74b430b49a Correct Lemmatizer 2018-01-24 13:26:33 +01:00
Ole Henrik Skogstrøm
b9b3a40c78 Add norwegian lemmatizer and tag_map 2018-01-24 12:28:29 +01:00
Matthew Honnibal
42a18ef903 Add test for #1868: Vocab.__contains__ with ints 2018-01-23 23:27:05 +01:00
Matthew Honnibal
43f381ce36 Make Vocab.__contains__ work with ints. Fixes #1868 2018-01-23 23:26:47 +01:00
greg
85ab99e692 Correct test examples 2018-01-23 15:00:14 -05:00
greg
f50bb1aafc Restructure StateC to eliminate dependency on unordered_map 2018-01-23 14:40:03 -05:00
Matthew Honnibal
f3753c2453 Further model deserialization fixes re #1727 2018-01-23 19:16:05 +01:00
Matthew Honnibal
91e916cb67 Add comment to new test 2018-01-23 19:11:53 +01:00
Matthew Honnibal
fd187d71ad Add test for #1727 2018-01-23 19:11:01 +01:00
Matthew Honnibal
85c942a6e3 Dont overwrite pretrained_dims setting from cfg. Fixes #1727 2018-01-23 19:10:49 +01:00
Ali Zarezade
42349471bc
add ٪ as punctuation 2018-01-23 18:11:33 +03:30
Ali Zarezade
2bda582135
Add Persian character and symbols
Add Persian characters and the following:
- ٪ used instead of %
- ؟ used instead of ?
- ﷼ used instead of $
- ، used instead of ,
- ؛ used instead of ;
2018-01-23 13:20:36 +03:30
Matthew Honnibal
7e6dc283db Fix unicode import in test 2018-01-22 23:55:44 +01:00
greg
686735b94e Fix matcher import 2018-01-22 16:53:05 -05:00
greg
3a491093ee Import libcpp.map if libcpp.unordered_map doesn't exist 2018-01-22 16:46:25 -05:00
greg
d55992bdf0 Switch match dictionary to use final state pointer rather than ID 2018-01-22 15:36:47 -05:00
Matthew Honnibal
4ce7d24fd5 Add test for #1799: Set left and right edges (and thus sentences) in non-projective parses. 2018-01-22 20:18:38 +01:00
Matthew Honnibal
56164ab688 Set l_edge and r_edge correctly for non-projective parses. Fixes #1799 2018-01-22 20:18:04 +01:00
Matthew Honnibal
964aa1b384 Merge branch 'master' of https://github.com/explosion/spaCy 2018-01-22 19:18:46 +01:00
Matthew Honnibal
29897ed1b3 Allow vector loading to work on 1d data files. Fixes #1831 2018-01-22 19:18:26 +01:00
greg
490bc82c27 Add comments clarifying matcher logic for '*' 2018-01-22 10:03:12 -05:00
Matthew Honnibal
fe4748fc38
Merge pull request #1870 from avadhpatel/master
Model Load Performance Improvement by more than 5x
2018-01-22 00:05:15 +01:00
Avadh Patel
a517df55c8 Small fix
Signed-off-by: Avadh Patel <avadh4all@gmail.com>
2018-01-21 15:20:45 -06:00
Avadh Patel
5b5029890d Merge branch 'perfTuning' into perfTuningMaster
Signed-off-by: Avadh Patel <avadh4all@gmail.com>
2018-01-21 15:20:00 -06:00
Matthew Honnibal
203d2ea830 Allow multitask objectives to be added to the parser and NER more easily 2018-01-21 19:37:02 +01:00
Matthew Honnibal
4a7d524efb Merge branch 'master' of https://github.com/explosion/spaCy 2018-01-21 19:22:03 +01:00
Matthew Honnibal
61a051f2c0 Fix MultitaskObjective 2018-01-21 19:21:34 +01:00
Avadh Patel
75903949da Updated model building after suggestion from Matthew
Signed-off-by: Avadh Patel <avadh4all@gmail.com>
2018-01-18 06:51:57 -06:00
Avadh Patel
fe879da2a1 Do not train model if its going to be loaded from disk
This saves significant time in loading a model from disk.

Signed-off-by: Avadh Patel <avadh4all@gmail.com>
2018-01-17 06:16:07 -06:00
Avadh Patel
2146faffee Do not train model if its going to be loaded from disk
This saves significant time in loading a model from disk.

Signed-off-by: Avadh Patel <avadh4all@gmail.com>
2018-01-17 06:04:22 -06:00
greg
7072b395c9 Add greedy matcher tests 2018-01-16 15:46:13 -05:00
greg
441f490c1c Merge branch 'master' of github.com:GregDubbin/spaCy 2018-01-16 13:31:10 -05:00
greg
8bea62f26e Correct bugs for greedy matching and introduce ADVANCE_PLUS action 2018-01-16 13:21:43 -05:00
Matthew Honnibal
ccb51a9f36 Make .similarity() return 1.0 if all orth attrs match 2018-01-15 16:29:48 +01:00
Matthew Honnibal
82135d85b7 Fix test 2018-01-15 15:55:15 +01:00
Matthew Honnibal
4b09616b58 Add test for #1757: Comparison against None 2018-01-15 15:55:01 +01:00
Matthew Honnibal
b904d81e9a Fix rich comparison against None objects. Closes #1757 2018-01-15 15:51:25 +01:00
Matthew Honnibal
9e413449f6 Fix unicode error in new test 2018-01-15 15:39:00 +01:00
Matthew Honnibal
ab7c45b12d Fix error message and handling of doc.sents 2018-01-15 15:21:11 +01:00
Matthew Honnibal
6b215d2dd3 Add test for Issue #1537 2018-01-15 15:20:56 +01:00
ines
5babb7d6f6 Merge branch 'master' of https://github.com/explosion/spaCy 2018-01-14 17:31:09 +01:00
ines
793890cb4d Remove test for removed deprecation warning 2018-01-14 17:31:06 +01:00
Matthew Honnibal
465a6f6452 Add missing Span.vocab property. Closes #1633 2018-01-14 15:06:30 +01:00
Matthew Honnibal
0cb090e526 Fix infinite recursion in token.sent_start. Closes #1640 2018-01-14 15:02:15 +01:00
Matthew Honnibal
5cbe913b6f Don't raise deprecation warning in property. Closes #1813, #1712 2018-01-14 14:55:58 +01:00
Matthew Honnibal
1a1cca6052 Fix vectors.resize() on Py3. Closes #1539 2018-01-14 14:48:51 +01:00
Matthew Honnibal
0153220304 Make set_vector add word to vocab. Fixes #1807 2018-01-14 13:57:57 +01:00
Ines Montani
55754f0cee
Merge pull request #1836 from fucking-signup/master
Add tests for issue #1769
2018-01-13 00:23:35 +00:00
Kit
4ee97f20a0
Mark like_num tests as slow 2018-01-13 00:44:15 +01:00
Kit
855531537e
Rewrite tests for issue #1769 2018-01-12 23:49:51 +01:00
Kit
5b541cb5ec
Simplify tests for issue #1769 2018-01-12 23:34:27 +01:00
Kit
7a2adc4633
Remove some tests to see build status changes 2018-01-12 22:49:16 +01:00
Kit
0e62809a43
Rewrite tests for issue #1769 2018-01-12 22:26:06 +01:00
Ines Montani
36f426fe0a
Merge pull request #1808 from fucking-signup/master
Fix issue #1769
2018-01-12 21:12:02 +00:00
Kit
76f4eeca44
Remove tests to see build changes on Windows (Python 2.7) 2018-01-12 20:30:51 +01:00
Matthew Honnibal
7ca49c2061
Merge branch 'master' into feature-improve-model-download 2018-01-10 18:21:55 +01:00
Kit
7ec0956e8d
Add regression test (issue #1769) 2018-01-08 03:42:04 +01:00
Kit
701e7cc6aa
Rename variable to keep code consistent 2018-01-08 03:38:44 +01:00
Kit
ed0db95183
Find lowercased forms of ordinal words, where possible 2018-01-08 03:28:50 +01:00
Kit
9bc524982e
Find lowercased forms of numeric words 2018-01-08 03:25:08 +01:00
Søren Lind Kristiansen
62de5da1ff Remove unsused dummy variable 2018-01-05 09:57:24 +01:00
Søren Lind Kristiansen
10dab8eef8 Remove dummy variable from function calls 2018-01-05 09:37:05 +01:00
Søren Lind Kristiansen
7f0ab145e9 Don't pass CLI command name as dummy argument 2018-01-04 21:33:47 +01:00
Ines Montani
6a008233b5
Merge pull request #1795 from textioHQ/issue1758 (resolves #1758)
english tokenizer: handle "would've"
2018-01-04 02:43:39 +00:00
Kevin Humphreys
597df5bf83 add test 2018-01-03 13:00:05 -08:00
Kevin Humphreys
7918fa4ef9 handle would've 2018-01-03 12:25:48 -08:00
ines
2c656f90fb Exit with 1 if incompatible models found (see #1714) 2018-01-03 21:20:35 +01:00
ines
dacfaa2ca4 Ensure that download command exits properly (resolves #1714) 2018-01-03 21:03:36 +01:00
Søren Lind Kristiansen
a9ff6eadc9 Prefix dummy argument names with underscore 2018-01-03 20:48:12 +01:00
ines
1081e08efb Fix formatting 2018-01-03 20:14:50 +01:00
ines
d8109964d6 Use --no-deps on model install
In general, it's nice for models to specify spaCy as a dependency. However, this tends to cause problems in conda environments, as pip will re-install spaCy and its dependencies (especially Thinc)
2018-01-03 17:40:37 +01:00
ines
319d754309 Fix overwriting of existing symlinks
Check for is_symlink() to also overwrite invalid and outdated symlinks. Also show better error message if link path exists but is not symlink (i.e. file or directory).
2018-01-03 17:39:36 +01:00
ines
8ba0dfd017 Make message on failed linking more clear 2018-01-03 17:38:09 +01:00
Søren Lind Kristiansen
d6327e8495 Fix handling case when vectors not specified 2018-01-03 12:20:49 +01:00
Søren Lind Kristiansen
bcc51d7d8b Fix shifted positional arguments 2018-01-03 12:19:47 +01:00
zqhZY
f27859fa99 add ChineseDefaults class for pickling 2017-12-28 17:13:58 +08:00
Ines Montani
ff9fc945ab
Merge pull request #1749 from sorenlind/da_ud_tokenization
Tune Danish tokenizer to more closely match Universal Dependencies
2017-12-22 16:00:49 +00:00
ines
26f313dabc Fix missing import 2017-12-22 16:21:44 +01:00
ines
8dc1c27841 Merge branch 'master' of https://github.com/explosion/spaCy 2017-12-22 16:01:00 +01:00
ines
b10ba848b8 xfail test that causes MemoryError on Python 2 on Windows
Need to investigate this further!
2017-12-22 16:00:58 +01:00
Søren Lind Kristiansen
bef735aef7 Fix Danish abbreviation 'm.h.t.' 2017-12-21 09:24:31 +01:00
Ines Montani
a3dd167d7f
Merge branch 'master' into da_ud_tokenization 2017-12-20 21:05:34 +00:00
Ines Montani
97f100f69f
Merge pull request #1742 from kimfalk/master
Two corrections in the da lan.
2017-12-20 21:02:00 +00:00
Ines Montani
d682a8803e
Merge pull request #1672 from cbilgili/master
Adds Turkish Lemmatization
2017-12-20 21:01:00 +00:00
Benjamin Peterson
9452134cd1 remove no-break spaces from Hindi example (fixes #1750) 2017-12-20 11:35:30 -08:00
Søren Lind Kristiansen
7a2f2f6f94 Fix formatting. 2017-12-20 18:37:37 +01:00
Søren Lind Kristiansen
15d13efafd Tune Danish tokenizer to more closely match tokenization in Universal Dependencies. 2017-12-20 17:36:52 +01:00
Kim FalkJørgensen
648dc60755 Remove the incorrect exception 'm.h.t' 2017-12-20 10:02:39 +01:00
Kim FalkJørgensen
9c9f4ef84a Fixing a translation error in examples.py
Adding an exception in the tokenizer_exceptions.py
2017-12-19 15:26:50 +01:00
ines
22dc744b48 Fix check for '@' in like_url (see #1715) 2017-12-16 13:48:43 +01:00
Ines Montani
9c1ee65268
Add regression test for #1698 2017-12-12 10:36:11 +01:00
Ines Montani
6455b574fc
Check for email address first 2017-12-12 10:25:13 +01:00
Bri-Will
d77361d76c
Update lex_attrs.py. Fix like_url from matching on e-mail 2017-12-11 14:13:28 -08:00
Søren Lind Kristiansen
5a9d377580 Remove abbreviation for positional plac argument 2017-12-11 11:08:29 +01:00
Isaac Sijaranamual
38021fbb00 Switch from python 3 only TemporaryDirectory to pytest's tmpdir 2017-12-11 00:16:04 +01:00
Isaac Sijaranamual
20ae0c459a Fixes "Error saving model" #1622 2017-12-10 23:07:13 +01:00
Isaac Sijaranamual
568130ce7c Adds regression test_issue1622 2017-12-10 23:00:48 +01:00
Isaac Sijaranamual
e188b61960 Make cli/train.py not eat exception 2017-12-10 22:53:08 +01:00
ines
020a7e5d52 Allow 'fine_grained' option in displaCy (see #1703)
Shows token.tag_ instead of token.pos_. Disabled by default, to not cause rendering issues for models with long fine-grained tags (e.g. merged morphological features).
2017-12-09 15:11:12 +01:00
Matthew Honnibal
3b17eb7c49 Merge branch 'master' of https://github.com/explosion/spaCy 2017-12-07 10:39:32 +01:00
Matthew Honnibal
a6b43729c6 Set version to v2.0.5 2017-12-07 10:39:14 +01:00
ines
5eaa61c2b8 Fix formatting 2017-12-07 10:23:09 +01:00
ines
24e80c51b8 Document init-model command 2017-12-07 10:14:37 +01:00
Matthew Honnibal
c91f451b0f Fix imports and CLI in init-model 2017-12-07 10:03:07 +01:00
ines
82e80ff928 Rename model command to init_model and fix formatting 2017-12-07 09:59:23 +01:00
Ines Montani
2feeb428d6
Merge pull request #1646 from GreenRiverRUS/master
Added model command to create models from raw data
2017-12-07 08:54:26 +00:00
Matthew Honnibal
6373d2580d Increment version to v2.0.5.dev0 2017-12-07 09:53:59 +01:00
Matthew Honnibal
36b47e3fa6 Fix (and test) vector pickling 2017-12-07 09:53:30 +01:00
Matthew Honnibal
05f41ff587 Set version to 2.0.4 2017-12-06 13:24:02 +01:00
Matthew Honnibal
04c38f7e87 Merge branch 'master' of https://github.com/explosion/spaCy 2017-12-06 12:15:52 +01:00
Matthew Honnibal
361944e512 If no rules are set, lemmatize by lookup 2017-12-06 12:12:11 +01:00
Matthew Honnibal
2ab0f2d186
Merge pull request #1664 from jimregan/italian-lemmatizer
BOM in Italian lemmatiser
2017-12-06 11:09:04 +01:00
Matthew Honnibal
3f247119d3
Merge pull request #1668 from sorenlind/da_morph
Add more Danish morph rules and clean up existing ones
2017-12-06 11:08:09 +01:00
Matthew Honnibal
b712de774e Fix vectors pickling 2017-12-05 12:45:24 +01:00
Matthew Honnibal
04650e38c7 Set version to 2.0.4.dev0 2017-12-05 10:52:31 +01:00
Matthew Honnibal
07acb43a85 Merge branch 'master' of https://github.com/explosion/spaCy 2017-12-04 14:42:52 +01:00
Thomas Werkmeister
94eac75b7c
fix setup.py spacy req string for packaging
Requirement should be `spacy>=2.0.2` instead of `spacy2.0.2`
2017-12-03 04:16:28 -06:00
ines
f2ea6d4713 Add Dutch example sentences (see #1107) 2017-12-01 23:36:05 +01:00
Canbey Bilgili
abe098b255 Adds Turkish Lemmatization 2017-12-01 17:04:32 +03:00
Søren Lind Kristiansen
d86b537a38 Enable morph rules for Danish 2017-11-30 15:58:02 +01:00
Søren Lind Kristiansen
13a988adc3 Remove 'Number[psor]' 2017-11-30 15:55:04 +01:00
Søren Lind Kristiansen
dd6fde18a9 Add more Danish morph rules and clean up existing ones 2017-11-30 11:17:19 +01:00
Vadim Mazaev
495eacf470 Merge branch 'model_command' 2017-11-30 12:30:26 +03:00
Vadim Mazaev
4ba7ddf651 Bugfixies 2017-11-30 12:29:38 +03:00
Jim O'Regan
a4ecdeadd4 aha 2017-11-29 23:43:25 +00:00
Jim O'Regan
2c7a9215d7 Merge branch 'master' into animacy 2017-11-29 23:31:12 +00:00
Jim O'Regan
c3e6cee17a use inan in polimorf tagset conversion 2017-11-29 23:15:47 +00:00
Jim O'Regan
b32575e78c imports 2017-11-29 23:03:41 +00:00
Jim O'Regan
3696ce6a7b add UD mapping 2017-11-29 22:59:19 +00:00
Jim O'Regan
f8e7082fe4 typo in "inan", add "nhum" 2017-11-29 22:40:47 +00:00
Matthew Honnibal
6bc0f4d29f
Merge pull request #1611 from fsonntag/master
Solving #1494
2017-11-29 23:11:23 +01:00
Matthew Honnibal
f9ed9ea529
Merge pull request #1624 from GreenRiverRUS/russian
Add support for Russian
2017-11-29 23:10:01 +01:00
Jim O'Regan
076a6fc60a symbols 2017-11-29 20:11:20 +00:00
Jim O'Regan
834ba3c69a (semi generated) Polimorf mapping 2017-11-29 20:08:24 +00:00
Jim O'Regan
ba6a23fd11 BOM in Italian lemmatiser 2017-11-29 17:40:07 +00:00
ines
a31506e060 Fix off-by-one error in nlp.add_pipe(after=name) (fixes #1654) 2017-11-28 20:37:55 +01:00
ines
b62739fbfe Add regression test for #1654 2017-11-28 20:27:54 +01:00
ines
2e50dbb9d7 Simplify test 2017-11-28 20:27:27 +01:00
Felix Sonntag
724ae7dc55 Fixed issue of infix capturing prefixes 2017-11-28 17:17:12 +01:00
Ines Montani
9052643e2c
Merge pull request #1653 from sorenlind/da_example_typo
Fix typo
2017-11-27 14:47:42 +00:00
Søren Lind Kristiansen
5fe58b885b Fix typo 2017-11-27 15:36:18 +01:00
Ines Montani
d52b1ab245
Add unicode_literals (hopefully fixes test failure on Python 2) 2017-11-27 15:16:54 +01:00
Søren Lind Kristiansen
0ffd27b0f6 Add several Danish alternative spellings 2017-11-27 13:35:41 +01:00
Ines Montani
6362024cf8
Merge pull request #1645 from GreenRiverRUS/fix_default_meta
Fixed spaCy version string in default meta
2017-11-27 11:58:02 +00:00
Vadim Mazaev
c332ffdde1 Added model command to create model from raw data:
words counts, brown clusters and vectors
2017-11-27 01:21:47 +03:00
Vadim Mazaev
59f03ab1d7 Fixed spacy version string in default meta 2017-11-26 23:02:07 +03:00
Vadim Mazaev
53e7c38637 Fixed tests depends on pymorphy2 2017-11-26 21:04:44 +03:00
Vadim Mazaev
cacd859dcd Added tag map, fixed tests fails, added more exceptions 2017-11-26 20:54:48 +03:00
Ines Montani
a7bb8f1b42
Merge pull request #1637 from sorenlind/da_tokenization
Improve Danish tokenization
2017-11-26 15:41:38 +00:00
ines
c699aec089 Add offsets_from_biluo_tags helper and tests (see #1626) 2017-11-26 16:38:01 +01:00
Søren Lind Kristiansen
ef03e9ea53 Remove unused import. 2017-11-25 13:04:02 +01:00
Søren Lind Kristiansen
6aa241bcec Add day of month tokenizer exceptions for Danish. 2017-11-24 15:03:24 +01:00
Søren Lind Kristiansen
0c276ed020 Add weekday abbreviations and remove abiguous month abbreviations for Danish. 2017-11-24 14:43:29 +01:00
Søren Lind Kristiansen
056547e989 Add multiple tokenizer exceptions for Danish. 2017-11-24 11:51:26 +01:00
Søren Lind Kristiansen
8dc265ac0c Add test for tokenization of 'i.' for Danish. 2017-11-24 11:29:37 +01:00
Søren Lind Kristiansen
ac8116510d Fix tokenization of 'i.' for Danish. 2017-11-24 11:16:53 +01:00
Matthew Honnibal
79f11d4f85 Pickle vectors with vocab 2017-11-23 17:19:50 +01:00
Matthew Honnibal
f29c3925ee Fix more efficient nonproj 2017-11-23 12:48:00 +00:00
Matthew Honnibal
e10e9ad2c5 Improve efficiency of Doc.to_array 2017-11-23 12:33:27 +00:00
Matthew Honnibal
2acc907d55 Improve profiling 2017-11-23 12:33:03 +00:00
Matthew Honnibal
fa62427300 Remove lookup-based lemmatization 2017-11-23 12:32:22 +00:00
Matthew Honnibal
fb26b2cb12 Use lookup lemmatizer if lemma unset 2017-11-23 12:31:58 +00:00
Matthew Honnibal
db5c714ad2 Improve efficiency of deprojectivization 2017-11-23 12:31:34 +00:00
Matthew Honnibal
8fec7268eb Move string cleanup under a setting flag 2017-11-23 12:19:18 +00:00
Matthew Honnibal
5949777b12 Fix misleading multi-threading docstring 2017-11-23 12:18:59 +00:00
Matthew Honnibal
542e6fd4ea Don't remove entries from specials 2017-11-23 12:17:42 +00:00
Matthew Honnibal
30ba81f881
Merge pull request #1576 from ligser/master
Actually reset caches in pipe [wip]
2017-11-23 12:54:48 +01:00
ines
c90fe92e15 Fix displaCy test 2017-11-22 05:04:39 +01:00
ines
a6f33ac27d Fix displaCy test 2017-11-22 04:19:28 +01:00
ines
93b0be611a Merge branch 'master' of https://github.com/explosion/spaCy 2017-11-22 00:28:55 +01:00
ines
60b4915569 Use .pos_ instead of .tags_ in displaCy by default (see #1006) 2017-11-22 00:28:52 +01:00
Vadim Mazaev
81314f8659 Fixed tokenizer: added char classes; added first lemmatizer and
tokenizer tests
2017-11-21 22:23:59 +03:00
Vadim Mazaev
52ee1f9bf9 Updated Russian Language, added lemmatizer, norm exceptions and lex
attrs
2017-11-21 11:44:46 +03:00
Burton DeWilde
a5c6869b2d Fix bug where span.orth_ != span.text (see #1612) 2017-11-20 12:05:43 -06:00
Burton DeWilde
635792997c Add regression test for #1612 2017-11-20 12:05:35 -06:00
ines
9a63e32f21 Add noqa to Python 2 compat variables of built-ins (see #1617) 2017-11-20 14:03:42 +01:00
ines
d70a64d78b Fix syntax error and formatting in test (see #1617) 2017-11-20 14:01:25 +01:00
ines
17849dee4b Fix French test (see #1617) 2017-11-20 13:59:59 +01:00
Felix Sonntag
33b0f86de3 Changed tokenizer to add infix when infix_start is offset 2017-11-19 16:32:10 +01:00
Felix Sonntag
8be3392302 Added regression text for 1494 2017-11-19 16:30:35 +01:00
Motoki Wu
a52e195a0a Fixes Issue #1207 where noun_chunks of Span gives an error.
Make sure to reference `self.doc` when getting the noun chunks.

Same fix as 9750a0128c
2017-11-17 17:16:20 -08:00
Motoki Wu
b818afaa0e Added failing test for Issue #1207.
The noun chunk iterator should work for `Doc` but not for `Span`.
2017-11-17 17:04:27 -08:00
Vadim Mazaev
a0739a06d4 Returned russian support from v1.10 branch 2017-11-17 17:06:15 +03:00
yuukos
7401152289 updated Russian tokenizer
moved the trying to import pymorph into __init__
2017-11-17 17:04:50 +03:00
yuukos
3aad66cf00 added russian language support 2017-11-17 17:04:22 +03:00
ines
a3d4dd1a5d Test adding of lots of pipeline components (see #1585)
Just to make sure that there's no error now or in the future with adding a large number of pipeline components.
2017-11-15 17:28:06 +01:00
Roman Domrachev
61d28d03e4 Try again to do selective remove cache 2017-11-15 19:11:12 +03:00
Roman Domrachev
b3311100c7 Merge branch 'master' of github.com:explosion/spaCy 2017-11-15 18:30:04 +03:00
Matthew Honnibal
b60d92aca8 Increment version 2017-11-15 16:14:46 +01:00
Roman Domrachev
505c6a2f2f Completely cleanup tokenizer cache
Tokenizer cache can have be different keys than string

That modification can slow down tokenizer and need to be measured
2017-11-15 17:55:48 +03:00
Matthew Honnibal
cf0be62096 Increment version 2017-11-15 15:00:18 +01:00
ines
97a4f9362b Merge branch 'master' of https://github.com/explosion/spaCy 2017-11-15 14:24:00 +01:00
ines
8e65247886 Fix lex.id if vectors is None 2017-11-15 14:23:58 +01:00
Matthew Honnibal
437ad1a852
Merge pull request #1570 from explosion/feature/fix-beam-leak
Fix memory leak in beam parser
2017-11-15 14:15:05 +01:00
Matthew Honnibal
2f169fdb0a Set lex ID correctly for new tokens in Vocab 2017-11-15 13:58:03 +01:00
Matthew Honnibal
fe3c42a06b Fix caching in tokenizer 2017-11-15 13:55:46 +01:00
Matthew Honnibal
8d692771f6 Improve profiling 2017-11-15 13:51:25 +01:00
Matthew Honnibal
b797dca977 Merge branch 'master' of https://github.com/explosion/spaCy 2017-11-15 13:11:43 +01:00
ines
c9d72de0fb Add dummy serialization methods for Japanese and missing lang getter (resolves #1557) 2017-11-15 12:44:02 +01:00
Matthew Honnibal
d274d3a3b9 Let beam forward use minibatches 2017-11-15 00:51:42 +01:00
Matthew Honnibal
855872f872 Remove state hashing 2017-11-14 23:36:46 +01:00
Roman Domrachev
3e21680814 Use safer method to get string without hit 2017-11-14 22:58:46 +03:00
Roman Domrachev
a33d5a068d Try to hold origin data instead of restore it 2017-11-14 22:40:03 +03:00
Roman Domrachev
91e2fa6561 Clean all caches 2017-11-14 21:15:04 +03:00