Commit Graph

9978 Commits

Author SHA1 Message Date
Kamolsit Mongkolsrisawat
dcc67f3f51 Update Thai tokenizer_exception list (#3529)
* add tokenizer_exceptions word (ก-น) from https://goo.gl/JpJ2qq

* update tokenizer_exceptions word list

* add contributor file
2019-04-03 09:13:36 +02:00
ivigamberdiev
5e5641616d Update links and http -> https (#3532)
* update links and http -> https

* SCA
2019-04-02 17:36:22 +02:00
Ines Montani
24cecdb44f Update compatibility [ci skip] 2019-04-01 16:25:16 +02:00
jeannefukumaru
6cdb7b2e04 added tag_map for indonesian (#3515)
* added tag_map for indonesian

* changed tag map from .py to .txt to see if tests pass

* added symbols import

* added utf8 encoding flag

* added missing SCONJ symbol

* Auto-format

* Remove unused imports

* Make tag map available in Indonesian defaults
2019-04-01 12:27:48 +02:00
Ines Montani
c23e234d65 Auto-format 2019-04-01 12:11:27 +02:00
Ines Montani
5821b020d5 Merge branch 'spacy.io' 2019-04-01 11:47:59 +02:00
Matthew Honnibal
e64b241f9c Merge branch 'master' of https://github.com/explosion/spaCy 2019-03-31 13:58:38 +02:00
Ines Montani
b070e0caf7 Update landing.js 2019-03-30 22:26:46 +01:00
Ines Montani
9d1221943b Merge branch 'master' into spacy.io 2019-03-30 20:32:14 +01:00
Ines Montani
037ffdfd3f Add spaCy IRL to landing [ci skip] 2019-03-30 20:32:03 +01:00
Ines Montani
68900066e0
Merge pull request #3459 from svlandeg/feature/el-framework
Basic framework and APIs for entity linker
2019-03-29 14:02:22 +01:00
Hiromu Hota
914b9ff3d2 Tags are joined with a comma and padded with asterisks (#3491)
<!--- Provide a general summary of your changes in the title. -->

## Description
<!--- Use this section to describe your changes. If your changes required
testing, include information about the testing environment and the tests you
ran. If your test fixes a bug reported in an issue, don't forget to include the
issue number. If your PR is still a work in progress, that's totally fine – just
include a note to let us know. -->

Fix a bug in the test of JapaneseTokenizer.
This PR may require @polm's review.

### Types of change
<!-- What type of change does your PR cover? Is it a bug fix, an enhancement
or new feature, or a change to the documentation? -->

Bug fix

## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.
2019-03-28 16:17:31 +01:00
Ines Montani
730f759b4f Merge branch 'master' into spacy.io 2019-03-28 15:26:17 +01:00
Ines Montani
7d033a7b89 Fix met a description in universe projects [ci skip] 2019-03-28 15:26:01 +01:00
Ines Montani
fe2cb642ac Merge branch 'master' into spacy.io 2019-03-28 15:13:39 +01:00
David
74e738dd4d adds textpipe to universe (#3500) [ci skip]
* Adds textpipe to universe

* signed contributor agreement

* Adjust formatting, code style and use "standalone" category
2019-03-28 15:13:19 +01:00
Ines Montani
04a9fb1a02 Merge branch 'master' into spacy.io 2019-03-28 13:34:46 +01:00
Samuel Kane
06a1846379 fix(util): fix decaying function output (#3495)
* fix(util): fix decaying function output

* fix(util): better test and adhere to code standards

* fix(util): correct variable name, pytestify test, update website text
2019-03-28 13:24:47 +01:00
Duygu Altinok
5a7bc6b39d Fix/irreg adverbs extension (#3499)
* extended list of irreg adverbs

* added test to exceptions

* fixed typo
2019-03-28 13:23:33 +01:00
Bharat Raghunathan
1db3e47509 DOC: Update tokenizer docs to include default value for batch_size in pipe (#3492) 2019-03-28 12:48:02 +01:00
Ines Montani
2ed16d82bf Fix social image 2019-03-26 18:27:40 +01:00
Matthew Honnibal
f77bf2bdb1 Fix GPU training for textcat. Closes #3473 2019-03-26 13:36:11 +01:00
Sofie
a4a6bfa4e1
Merge branch 'master' into feature/el-framework 2019-03-26 11:00:02 +01:00
svlandeg
8814b9010d entity as one field instead of both ID and name 2019-03-25 18:10:41 +01:00
Ines Montani
9e14b2b69f Add Estonian to docs [ci skip] (closes #3482) 2019-03-25 18:01:54 +01:00
Wannaphong Phatthiyaphaibun
297a051992 Update Thai tag map (#3480)
* Update Thai tag map

Update Thai tag map

* Create wannaphongcom.md
2019-03-25 16:53:26 +01:00
Ines Montani
21ade53ef7 Merge branch 'master' into spacy.io 2019-03-25 13:05:00 +01:00
Ines Montani
db938ab0e3 Update favicon (closes #3475) [ci skip] 2019-03-25 13:04:47 +01:00
Ines Montani
c8c1baaea8 Update binderVersion 2019-03-25 12:17:03 +01:00
Matthew Honnibal
85dcd9477e Set version to v2.1.3 2019-03-23 16:47:57 +01:00
Matthew Honnibal
f436efd8a4 Small tweak to ensemble textcat model 2019-03-23 16:47:26 +01:00
Ines Montani
200d8bdb3c Merge branch 'spacy.io' [ci skip] 2019-03-23 16:46:34 +01:00
Ines Montani
1e5b917d75 Fix formatting [ci skip] 2019-03-23 16:45:50 +01:00
Matthew Honnibal
6c783f8045 Bug fixes and options for TextCategorizer (#3472)
* Fix code for bag-of-words feature extraction

The _ml.py module had a redundant copy of a function to extract unigram
bag-of-words features, except one had a bug that set values to 0.
Another function allowed extraction of bigram features. Replace all three
with a new function that supports arbitrary ngram sizes and also allows
control of which attribute is used (e.g. ORTH, LOWER, etc).

* Support 'bow' architecture for TextCategorizer

This allows efficient ngram bag-of-words models, which are better when
the classifier needs to run quickly, especially when the texts are long.
Pass architecture="bow" to use it. The extra arguments ngram_size and
attr are also available, e.g. ngram_size=2 means unigram and bigram
features will be extracted.

* Fix size limits in train_textcat example

* Explain architectures better in docs
2019-03-23 16:44:44 +01:00
Ines Montani
5944cf10c7 Add blog post to v2.1 page 2019-03-23 16:34:23 +01:00
Ines Montani
ffebdad08d Add cheat sheet to spaCy 101 2019-03-23 16:32:55 +01:00
Ines Montani
06bf130890 💫 Add better and serializable sentencizer (#3471)
* Add better serializable sentencizer component

* Replace default factory

* Add tests

* Tidy up

* Pass test

* Update docs
2019-03-23 15:45:02 +01:00
Matthew Honnibal
d9a07a7f6e
💫 Fix class mismap on parser deserializing (closes #3433) (#3470)
v2.1 introduced a regression when deserializing the parser after
parser.add_label() had been called. The code around the class mapping is
pretty confusing currently, as it was written to accommodate backwards
model compatibility. It needs to be revised when the models are next
retrained.

Closes #3433
2019-03-23 13:46:25 +01:00
Matthew Honnibal
444a3abfe5 Add xfail test for #3433. Improve test for add label. 2019-03-23 12:36:00 +01:00
Ines Montani
6b6e9b638e Fix test for #3468 2019-03-23 11:24:29 +01:00
Ines Montani
fbec72b4c3 Slightly modify test for #3468
Check for Token.is_sent_start first (which is serialized/deserialized correctly)
2019-03-23 11:22:44 +01:00
Ines Montani
02d9378d8c Add xfailing test for #3468 2019-03-23 11:19:11 +01:00
Ines Montani
ed91592726 Merge branch 'master' into spacy.io 2019-03-22 19:02:26 +01:00
Ines Montani
dcd6e06c47 Improve landing example [ci skip] 2019-03-22 19:02:15 +01:00
Ines Montani
c2bb39dcb4 Merge branch 'master' into spacy.io 2019-03-22 18:50:16 +01:00
Ines Montani
a841324034 Update landing example [ci skip] 2019-03-22 18:50:00 +01:00
Ines Montani
a9ad735241 Merge branch 'master' into spacy.io 2019-03-22 18:36:28 +01:00
Ines Montani
b532386a60 Fix typo [ci skip] 2019-03-22 18:36:17 +01:00
Ines Montani
7b5496027b Merge branch 'master' into spacy.io 2019-03-22 18:21:16 +01:00
Ines Montani
d8533f0149 Update Binder [ci skip] 2019-03-22 18:16:46 +01:00