Commit Graph

11708 Commits

Author SHA1 Message Date
Ines Montani
109849bd31 Fix and update universe.json [ci skip] 2020-07-07 21:12:28 +02:00
gandersen101
9097549227
Adding spaczz package to universe.json (#5717)
* Adding spaczz package to universe.json

* Adding contributor agreement.
2020-07-07 20:55:24 +02:00
Jonathan Besomi
546f3d10d4
Add texthero to universe.json (#5716)
* Add texthero to universe.json

* Add spaCy contributor Agreement
2020-07-07 20:54:22 +02:00
Mike Izbicki
7a2ca00794
fix bug in Korean language, resulting in 100x speedup by reducing overhead of mecab (#5701)
* speed up Korean nlp 100x by stopping mecab from reloading on each doc

* add contributor agreement

* rename variables to improve code readability
2020-07-06 17:03:33 +02:00
graue70
9860b8399e
Fix typo in test function docstring (#5696) 2020-07-05 15:49:06 +02:00
Matthew Honnibal
3e78e82a83
Experimental character-based pretraining (#5700)
* Use cosine loss in Cloze multitask

* Fix char_embed for gpu

* Call resume_training for base model in train CLI

* Fix bilstm_depth default in pretrain command

* Implement character-based pretraining objective

* Use chars loss in ClozeMultitask

* Add method to decode predicted characters

* Fix number characters

* Rescale gradients for mlm

* Fix char embed+vectors in ml

* Fix pipes

* Fix pretrain args

* Move get_characters_loss

* Fix import

* Fix import

* Mention characters loss option in pretrain

* Remove broken 'self attention' option in pretrain

* Revert "Remove broken 'self attention' option in pretrain"

This reverts commit 56b820f6af.

* Document 'characters' objective of pretrain
2020-07-05 15:48:39 +02:00
Adriane Boyd
86d13a9fb8
Set version to 2.3.1 (#5705) 2020-07-03 13:38:41 +02:00
Matthias Hertel
2fb9bd795d
Fixed vocabulary in the entity linker training example (#5676)
* entity linker training example: model loading changed according to issue 5668 (https://github.com/explosion/spaCy/issues/5668) + vocab_path is a required argument

* contributor agreement
2020-07-03 10:24:02 +02:00
Adriane Boyd
a77c4c3465
Add strings and ENT_KB_ID to Doc serialization (#5691)
* Add strings for all writeable Token attributes to `Doc.to/from_bytes()`.
* Add ENT_KB_ID to default attributes.
2020-07-02 17:11:57 +02:00
Adriane Boyd
971826a96d
Include git commit in package and model meta (#5694)
* Include git commit in package and model meta

* Rewrite to read file in setup

* Fix file handle
2020-07-02 17:10:27 +02:00
Adriane Boyd
2bd78c39e3
Fix multiple context manages in examples (#5690) 2020-07-02 10:36:07 +02:00
Ines Montani
6bc643d2e2 Update netlify.toml [ci skip] 2020-07-01 21:34:17 +02:00
Ines Montani
f2a932a60c Update netlify.toml [ci skip] 2020-07-01 13:34:35 +02:00
Álvaro Abella Bascarán
ff0dbe5c64
Fix in docs: pipe(docs) instead of pipe(texts) (#5680)
Very minor fix in docs, specifically in this part:

```
 matcher = PhraseMatcher(nlp.vocab)
>   for doc in matcher.pipe(texts, batch_size=50):
>       pass
```

`texts` suggests the input is an iterable of strings. I replaced it for `docs`.
2020-06-30 20:00:50 +02:00
Matthias Hertel
8b0f749606
Website: fixed the token span in the text about the rule-based matching example (#5669)
* fixed token span in pattern matcher example

* contributor agreement
2020-06-30 19:58:23 +02:00
Matthew Honnibal
2d715451a2
Revert "Convert custom user_data to token extension format for Japanese tokenizer (#5652)" (#5665)
This reverts commit 1dd38191ec.
2020-06-29 14:34:15 +02:00
Adriane Boyd
1dd38191ec
Convert custom user_data to token extension format for Japanese tokenizer (#5652)
* Convert custom user_data to token extension format

Convert the user_data values so that they can be loaded as custom token
extensions for `inflection`, `reading_form`, `sub_tokens`, and `lemma`.

* Reset Underscore state in ja tokenizer tests
2020-06-29 14:20:26 +02:00
Adriane Boyd
167df42cb6
Move lemmatizer is_base_form to language settings (#5663)
Move `Lemmatizer.is_base_form` to the language settings so that each
language can provide a language-specific method as
`LanguageDefaults.is_base_form`.

The existing English-specific `Lemmatizer.is_base_form` is moved to
`EnglishDefaults`.
2020-06-29 14:16:57 +02:00
Adriane Boyd
c4d0209472
Extend v2.3 migration guide (#5653)
* Extend preloaded vocab section

* Add section on tag maps
2020-06-26 14:12:29 +02:00
PluieElectrique
90c7eb0e2f
Reduce memory usage of Lookup's BloomFilter (#5606)
* Reduce memory usage of Lookup's BloomFilter

* Remove extra Table update
2020-06-26 14:09:10 +02:00
Adriane Boyd
b7107ac89f
Disregard special tag _SP in check for new tag map (#5641)
* Skip special tag  _SP in check for new tag map

In `Tagger.begin_training()` check for new tags aside from `_SP` in the
new tag map initialized from the provided gold tuples when determining
whether to reinitialize the morphology with the new tag map.

* Simplify _SP check
2020-06-26 09:23:21 +02:00
Adriane Boyd
fd4287c178
Fix backslashes in warnings config diff (#5640)
Fix backslashes in warnings config diff in v2.3 migration section.
2020-06-24 10:26:12 +02:00
Adriane Boyd
6fe6e761de
Skip vocab in component config overrides (#5624) 2020-06-23 23:21:11 +02:00
Adriane Boyd
7ce451c211
Extend what's new in v2.3 with vocab / is_oov (#5635) 2020-06-23 16:48:59 +02:00
Adriane Boyd
d94e961f14
Fix polarity of Token.is_oov and Lexeme.is_oov (#5634)
Fix `Token.is_oov` and `Lexeme.is_oov` so they return `True` when the
lexeme does **not** have a vector.
2020-06-23 13:29:51 +02:00
Richard Liaw
0ef78bad93
contribute (#5632) 2020-06-23 08:53:58 +02:00
Adriane Boyd
bc1cb30b21
Add warnings example in v2.3 migration guide (#5627) 2020-06-22 14:37:24 +02:00
Hiroshi Matsuda
150a39ccca
Japanese model: add user_dict entries and small refactor (#5573)
* user_dict fields: adding inflections, reading_forms, sub_tokens
deleting: unidic_tags
improve code readability around the token alignment procedure

* add test cases, replace fugashi with sudachipy in conftest

* move bunsetu.py to spaCy Universe as a pipeline component BunsetuRecognizer

* tag is space -> both surface and tag are spaces

* consider len(text)==0
2020-06-22 14:32:25 +02:00
Rameshh
c34420794a
Add Nepali Language (#5622)
* added support for nepali lang

* added examples and test files

* added spacy contributor agreement
2020-06-22 10:25:46 +02:00
Karen Hambardzumyan
66a4834e56
Some changes for Armenian (#5616)
* Fixing numericals

* We need a Armenian question sign to make the sentence a question
2020-06-22 08:50:34 +02:00
Karen Hambardzumyan
ff6a084e9c
Create mahnerak.md (#5615) 2020-06-20 11:14:26 +02:00
Marat M. Yavrumyan
8120b641cc
Update lex_attrs.py (#5608) 2020-06-19 20:00:34 +02:00
Marat M. Yavrumyan
ccd7edf04b
Create myavrum.md (#5612) 2020-06-19 18:34:27 +02:00
Adriane Boyd
931d80de72
Warning for sudachipy 0.4.5 (#5611) 2020-06-19 12:43:41 +02:00
Ines Montani
6d712f3e06
Merge pull request #5599 from adrianeboyd/docs/v2.3.0-minor 2020-06-16 13:49:25 -07:00
Adriane Boyd
02369f91d3 Fix spacy convert argument 2020-06-16 20:41:17 +02:00
Adriane Boyd
f0fd77648f Change example title to Dr.
Change example title to Dr. so the current model does exclude the title
in the initial example.
2020-06-16 20:36:21 +02:00
Adriane Boyd
a6abdfbc3c Fix numpy.zeros() dtype for Doc.from_array 2020-06-16 20:35:45 +02:00
Adriane Boyd
9aff317ca7 Update POS in tagging example 2020-06-16 20:26:57 +02:00
Adriane Boyd
457babfa0c Update alignment example for new gold.align 2020-06-16 20:22:03 +02:00
Ines Montani
41003a5117 Update Binder version [ci skip] 2020-06-16 17:41:23 +02:00
Ines Montani
fd89f44c0c Update Binder URL [ci skip] 2020-06-16 17:34:26 +02:00
Ines Montani
44af53bdd9 Add pkuseg warnings and auto-format [ci skip] 2020-06-16 17:13:35 +02:00
Ines Montani
a9e5b840ee Fix typos and auto-format [ci skip] 2020-06-16 16:38:45 +02:00
Ines Montani
1d3e8b7578
Merge pull request #5595 from explosion/v2.3.x 2020-06-16 07:37:10 -07:00
Ines Montani
e9d3e177f0 Merge branch 'master' into v2.3.x 2020-06-16 16:31:38 +02:00
Ines Montani
bb54f54369 Fix model accuracy table [ci skip] 2020-06-16 16:10:12 +02:00
Adriane Boyd
d5110ffbf2
Documentation updates for v2.3.0 (#5593)
* Update website models for v2.3.0

* Add docs for Chinese word segmentation

* Tighten up Chinese docs section

* Merge branch 'master' into docs/v2.3.0 [ci skip]

* Merge branch 'master' into docs/v2.3.0 [ci skip]

* Auto-format and update version

* Update matcher.md

* Update languages and sorting

* Typo in landing page

* Infobox about token_match behavior

* Add meta and basic docs for Japanese

* POS -> TAG in models table

* Add info about lookups for normalization

* Updates to API docs for v2.3

* Update adding norm exceptions for adding languages

* Add --omit-extra-lookups to CLI API docs

* Add initial draft of "What's New in v2.3"

* Add new in v2.3 tags to Chinese and Japanese sections

* Add tokenizer to migration section

* Add new in v2.3 flags to init-model

* Typo

* More what's new in v2.3

Co-authored-by: Ines Montani <ines@ines.io>
2020-06-16 15:37:35 +02:00
Matthew Honnibal
7ff447c5a0 Set version to v2.3.0 2020-06-15 18:22:25 +02:00
Adriane Boyd
0d8405aafa Updates to docstrings (#5589) 2020-06-15 14:58:36 +02:00