Commit Graph

5915 Commits

Author SHA1 Message Date
svlandeg
19e8f339cb deduce entity freq from WP corpus and serialize vocab in WP test 2019-04-29 17:37:29 +02:00
svlandeg
387263d618 simplify chains 2019-04-29 13:58:07 +02:00
svlandeg
54d0cea062 unit test for KB serialization 2019-04-24 23:52:34 +02:00
svlandeg
3e0cb69065 KB aliases to and from file 2019-04-24 20:24:24 +02:00
svlandeg
ad6c5e581c writing and reading number of entries to/from header 2019-04-24 15:31:44 +02:00
svlandeg
6e3223f234 bulk loading in proper order of entity indices 2019-04-24 11:26:38 +02:00
svlandeg
694fea597a dumping all entryC entries + (inefficient) reading back in 2019-04-23 18:36:50 +02:00
svlandeg
8e70a564f1 custom reader and writer for _EntryC fields (first stab at it - not complete) 2019-04-23 16:33:40 +02:00
svlandeg
10ee8dfea2 poc with few entities and collecting aliases from the WP links 2019-04-18 14:12:17 +02:00
svlandeg
9a7d534b1b enable nogil for cython functions in kb.pxd 2019-04-10 17:25:10 +02:00
svlandeg
61a33f55d2 little fixes 2019-04-10 16:06:09 +02:00
Ines Montani
6ae3b5699e Make sure path is string (resolves #3546) 2019-04-08 12:53:41 +02:00
Ines Montani
d0f5e015cb Auto-format 2019-04-08 12:53:16 +02:00
Dobita21
8bf6967eb7 Update Thai stop words (#3545)
* test sPacy commit to git fri 04052019 10:54

* change Data format from my format to master format

* ทัทั้งนี้ ---> ทั้งนี้

* delete stop_word translate from Eng

* Adjust formatting and readability
2019-04-05 12:06:38 +02:00
jeannefukumaru
f67d881b30 fix typos in tag_map flagged by python -m debug-data (#3542)
## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [ ] I have submitted the spaCy Contributor Agreement.
- [ ] I ran the tests, and all new and existing tests passed.
- [ ] My changes don't require a change to the documentation, or if they do, I've added all required information.


Co-authored-by: Ines Montani <ines@ines.io>
2019-04-05 12:06:09 +02:00
Jeanne Choo
b6c9807431 Merge remote-tracking branch 'upstream/master' 2019-04-04 14:21:50 +08:00
Jeanne Choo
80e15af76c fixed tag_map.py merge conflict 2019-04-04 14:18:27 +08:00
jeannefukumaru
876ce01567 updated tag map with missing tags 2019-04-03 23:09:11 +08:00
Ines Montani
4faf62d515
Merge pull request #3530 from svlandeg/fix/issue_3521
Allow English stopwords with any type of apostrophe
2019-04-03 14:14:03 +02:00
Yves Peirsman
951825532c Improved Dutch language resources and Dutch lemmatization (#3409)
* Improved Dutch language resources and Dutch lemmatization

* Fix conftest

* Update punctuation.py

* Auto-format

* Format and fix tests

* Remove unused test file

* Re-add deleted test

* removed redundant infix regex pattern for ','; note: brackets + simple hyphen remains

* Cleaner lemmatization files
2019-04-03 14:13:26 +02:00
svlandeg
4ff786e113 addressed all comments by Ines 2019-04-03 13:50:33 +02:00
Ines Montani
6a4575a56c Don't make "settings" or "title" required in displaCy data (closes #3531) 2019-04-03 10:13:16 +02:00
Kamolsit Mongkolsrisawat
dcc67f3f51 Update Thai tokenizer_exception list (#3529)
* add tokenizer_exceptions word (ก-น) from https://goo.gl/JpJ2qq

* update tokenizer_exceptions word list

* add contributor file
2019-04-03 09:13:36 +02:00
svlandeg
85b4319f33 specify encoding in files 2019-04-02 15:05:31 +02:00
svlandeg
673c81bbb4 unicode string for python 2.7 2019-04-02 13:52:07 +02:00
svlandeg
eca9cc5417 fixing Issue #3521 by adding all hyphen variants for each stopword 2019-04-02 13:24:59 +02:00
svlandeg
e7062cf699 failing test for Issue #3521 2019-04-02 13:15:35 +02:00
svlandeg
1424b12b09 failing test for Issue #3449 2019-04-02 13:06:37 +02:00
jeannefukumaru
6cdb7b2e04 added tag_map for indonesian (#3515)
* added tag_map for indonesian

* changed tag map from .py to .txt to see if tests pass

* added symbols import

* added utf8 encoding flag

* added missing SCONJ symbol

* Auto-format

* Remove unused imports

* Make tag map available in Indonesian defaults
2019-04-01 12:27:48 +02:00
Ines Montani
c23e234d65 Auto-format 2019-04-01 12:11:27 +02:00
Ines Montani
0a0b1087b0 Make tag map available in Indonesian defaults 2019-04-01 11:46:51 +02:00
Ines Montani
5d9212c44c Remove unused imports 2019-04-01 11:46:25 +02:00
Ines Montani
8d6b544632 Auto-format 2019-04-01 11:45:43 +02:00
jeannefukumaru
6567f27849
added missing SCONJ symbol 2019-04-01 17:02:53 +08:00
jeannefukumaru
082a0a2232
added utf8 encoding flag 2019-04-01 16:37:11 +08:00
jeannefukumaru
a741bed7a7
added symbols import 2019-04-01 16:21:06 +08:00
jeannefukumaru
745cf0c914 changed tag map from .py to .txt to see if tests pass 2019-04-01 07:04:50 +08:00
jeannefukumaru
3cc897102f added tag_map for indonesian 2019-04-01 00:00:08 +08:00
Matthew Honnibal
e64b241f9c Merge branch 'master' of https://github.com/explosion/spaCy 2019-03-31 13:58:38 +02:00
Ines Montani
68900066e0
Merge pull request #3459 from svlandeg/feature/el-framework
Basic framework and APIs for entity linker
2019-03-29 14:02:22 +01:00
Hiromu Hota
914b9ff3d2 Tags are joined with a comma and padded with asterisks (#3491)
<!--- Provide a general summary of your changes in the title. -->

## Description
<!--- Use this section to describe your changes. If your changes required
testing, include information about the testing environment and the tests you
ran. If your test fixes a bug reported in an issue, don't forget to include the
issue number. If your PR is still a work in progress, that's totally fine – just
include a note to let us know. -->

Fix a bug in the test of JapaneseTokenizer.
This PR may require @polm's review.

### Types of change
<!-- What type of change does your PR cover? Is it a bug fix, an enhancement
or new feature, or a change to the documentation? -->

Bug fix

## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.
2019-03-28 16:17:31 +01:00
Samuel Kane
06a1846379 fix(util): fix decaying function output (#3495)
* fix(util): fix decaying function output

* fix(util): better test and adhere to code standards

* fix(util): correct variable name, pytestify test, update website text
2019-03-28 13:24:47 +01:00
Duygu Altinok
5a7bc6b39d Fix/irreg adverbs extension (#3499)
* extended list of irreg adverbs

* added test to exceptions

* fixed typo
2019-03-28 13:23:33 +01:00
Bharat Raghunathan
1db3e47509 DOC: Update tokenizer docs to include default value for batch_size in pipe (#3492) 2019-03-28 12:48:02 +01:00
Matthew Honnibal
f77bf2bdb1 Fix GPU training for textcat. Closes #3473 2019-03-26 13:36:11 +01:00
Sofie
a4a6bfa4e1
Merge branch 'master' into feature/el-framework 2019-03-26 11:00:02 +01:00
svlandeg
8814b9010d entity as one field instead of both ID and name 2019-03-25 18:10:41 +01:00
Wannaphong Phatthiyaphaibun
297a051992 Update Thai tag map (#3480)
* Update Thai tag map

Update Thai tag map

* Create wannaphongcom.md
2019-03-25 16:53:26 +01:00
Matthew Honnibal
85dcd9477e Set version to v2.1.3 2019-03-23 16:47:57 +01:00
Matthew Honnibal
f436efd8a4 Small tweak to ensemble textcat model 2019-03-23 16:47:26 +01:00