1
1
mirror of https://github.com/explosion/spaCy.git synced 2025-04-07 10:44:15 +03:00
Commit Graph

11563 Commits

Author SHA1 Message Date
Mark Neumann
27a1cd3c63
fix meta serialization in train ()
Co-authored-by: Mark Neumann <markng@allenai.org>
2020-07-12 22:06:46 +02:00
Adriane Boyd
0a62098c5f
Fix lemmatizer is_base_form for python2.7 ()
* Fix lemmatizer init args for python2.7

* Move English is_base_form to a class method

* Skip test pickling PhraseMatcher for python2
2020-07-09 22:11:24 +02:00
Adriane Boyd
923affd091
Remove is_base_form from French lemmatizer ()
Remove English-specific is_base_form from French lemmatizer.
2020-07-09 22:11:13 +02:00
Ines Montani
3d83721551
Merge pull request from gandersen101/fix-spaczz-universe-typo 2020-07-08 11:35:40 +02:00
gandersen101
893133873d Fix quote issue in spaczz universe.json 2020-07-07 19:16:28 -05:00
Ines Montani
109849bd31 Fix and update universe.json [ci skip] 2020-07-07 21:12:28 +02:00
gandersen101
9097549227
Adding spaczz package to universe.json ()
* Adding spaczz package to universe.json

* Adding contributor agreement.
2020-07-07 20:55:24 +02:00
Jonathan Besomi
546f3d10d4
Add texthero to universe.json ()
* Add texthero to universe.json

* Add spaCy contributor Agreement
2020-07-07 20:54:22 +02:00
Mike Izbicki
7a2ca00794
fix bug in Korean language, resulting in 100x speedup by reducing overhead of mecab ()
* speed up Korean nlp 100x by stopping mecab from reloading on each doc

* add contributor agreement

* rename variables to improve code readability
2020-07-06 17:03:33 +02:00
graue70
9860b8399e
Fix typo in test function docstring () 2020-07-05 15:49:06 +02:00
Matthew Honnibal
3e78e82a83
Experimental character-based pretraining ()
* Use cosine loss in Cloze multitask

* Fix char_embed for gpu

* Call resume_training for base model in train CLI

* Fix bilstm_depth default in pretrain command

* Implement character-based pretraining objective

* Use chars loss in ClozeMultitask

* Add method to decode predicted characters

* Fix number characters

* Rescale gradients for mlm

* Fix char embed+vectors in ml

* Fix pipes

* Fix pretrain args

* Move get_characters_loss

* Fix import

* Fix import

* Mention characters loss option in pretrain

* Remove broken 'self attention' option in pretrain

* Revert "Remove broken 'self attention' option in pretrain"

This reverts commit 56b820f6af.

* Document 'characters' objective of pretrain
2020-07-05 15:48:39 +02:00
Adriane Boyd
86d13a9fb8
Set version to 2.3.1 () 2020-07-03 13:38:41 +02:00
Matthias Hertel
2fb9bd795d
Fixed vocabulary in the entity linker training example ()
* entity linker training example: model loading changed according to issue 5668 (https://github.com/explosion/spaCy/issues/5668) + vocab_path is a required argument

* contributor agreement
2020-07-03 10:24:02 +02:00
Adriane Boyd
a77c4c3465
Add strings and ENT_KB_ID to Doc serialization ()
* Add strings for all writeable Token attributes to `Doc.to/from_bytes()`.
* Add ENT_KB_ID to default attributes.
2020-07-02 17:11:57 +02:00
Adriane Boyd
971826a96d
Include git commit in package and model meta ()
* Include git commit in package and model meta

* Rewrite to read file in setup

* Fix file handle
2020-07-02 17:10:27 +02:00
Adriane Boyd
2bd78c39e3
Fix multiple context manages in examples () 2020-07-02 10:36:07 +02:00
Ines Montani
6bc643d2e2 Update netlify.toml [ci skip] 2020-07-01 21:34:17 +02:00
Ines Montani
f2a932a60c Update netlify.toml [ci skip] 2020-07-01 13:34:35 +02:00
Álvaro Abella Bascarán
ff0dbe5c64
Fix in docs: pipe(docs) instead of pipe(texts) ()
Very minor fix in docs, specifically in this part:

```
 matcher = PhraseMatcher(nlp.vocab)
>   for doc in matcher.pipe(texts, batch_size=50):
>       pass
```

`texts` suggests the input is an iterable of strings. I replaced it for `docs`.
2020-06-30 20:00:50 +02:00
Matthias Hertel
8b0f749606
Website: fixed the token span in the text about the rule-based matching example ()
* fixed token span in pattern matcher example

* contributor agreement
2020-06-30 19:58:23 +02:00
Matthew Honnibal
2d715451a2
Revert "Convert custom user_data to token extension format for Japanese tokenizer ()" ()
This reverts commit 1dd38191ec.
2020-06-29 14:34:15 +02:00
Adriane Boyd
1dd38191ec
Convert custom user_data to token extension format for Japanese tokenizer ()
* Convert custom user_data to token extension format

Convert the user_data values so that they can be loaded as custom token
extensions for `inflection`, `reading_form`, `sub_tokens`, and `lemma`.

* Reset Underscore state in ja tokenizer tests
2020-06-29 14:20:26 +02:00
Adriane Boyd
167df42cb6
Move lemmatizer is_base_form to language settings ()
Move `Lemmatizer.is_base_form` to the language settings so that each
language can provide a language-specific method as
`LanguageDefaults.is_base_form`.

The existing English-specific `Lemmatizer.is_base_form` is moved to
`EnglishDefaults`.
2020-06-29 14:16:57 +02:00
Adriane Boyd
c4d0209472
Extend v2.3 migration guide ()
* Extend preloaded vocab section

* Add section on tag maps
2020-06-26 14:12:29 +02:00
PluieElectrique
90c7eb0e2f
Reduce memory usage of Lookup's BloomFilter ()
* Reduce memory usage of Lookup's BloomFilter

* Remove extra Table update
2020-06-26 14:09:10 +02:00
Adriane Boyd
b7107ac89f
Disregard special tag _SP in check for new tag map ()
* Skip special tag  _SP in check for new tag map

In `Tagger.begin_training()` check for new tags aside from `_SP` in the
new tag map initialized from the provided gold tuples when determining
whether to reinitialize the morphology with the new tag map.

* Simplify _SP check
2020-06-26 09:23:21 +02:00
Adriane Boyd
fd4287c178
Fix backslashes in warnings config diff ()
Fix backslashes in warnings config diff in v2.3 migration section.
2020-06-24 10:26:12 +02:00
Adriane Boyd
6fe6e761de
Skip vocab in component config overrides () 2020-06-23 23:21:11 +02:00
Adriane Boyd
7ce451c211
Extend what's new in v2.3 with vocab / is_oov () 2020-06-23 16:48:59 +02:00
Adriane Boyd
d94e961f14
Fix polarity of Token.is_oov and Lexeme.is_oov ()
Fix `Token.is_oov` and `Lexeme.is_oov` so they return `True` when the
lexeme does **not** have a vector.
2020-06-23 13:29:51 +02:00
Richard Liaw
0ef78bad93
contribute () 2020-06-23 08:53:58 +02:00
Adriane Boyd
bc1cb30b21
Add warnings example in v2.3 migration guide () 2020-06-22 14:37:24 +02:00
Hiroshi Matsuda
150a39ccca
Japanese model: add user_dict entries and small refactor ()
* user_dict fields: adding inflections, reading_forms, sub_tokens
deleting: unidic_tags
improve code readability around the token alignment procedure

* add test cases, replace fugashi with sudachipy in conftest

* move bunsetu.py to spaCy Universe as a pipeline component BunsetuRecognizer

* tag is space -> both surface and tag are spaces

* consider len(text)==0
2020-06-22 14:32:25 +02:00
Rameshh
c34420794a
Add Nepali Language ()
* added support for nepali lang

* added examples and test files

* added spacy contributor agreement
2020-06-22 10:25:46 +02:00
Karen Hambardzumyan
66a4834e56
Some changes for Armenian ()
* Fixing numericals

* We need a Armenian question sign to make the sentence a question
2020-06-22 08:50:34 +02:00
Karen Hambardzumyan
ff6a084e9c
Create mahnerak.md () 2020-06-20 11:14:26 +02:00
Marat M. Yavrumyan
8120b641cc
Update lex_attrs.py () 2020-06-19 20:00:34 +02:00
Marat M. Yavrumyan
ccd7edf04b
Create myavrum.md () 2020-06-19 18:34:27 +02:00
Adriane Boyd
931d80de72
Warning for sudachipy 0.4.5 () 2020-06-19 12:43:41 +02:00
Ines Montani
6d712f3e06
Merge pull request from adrianeboyd/docs/v2.3.0-minor 2020-06-16 13:49:25 -07:00
Adriane Boyd
02369f91d3 Fix spacy convert argument 2020-06-16 20:41:17 +02:00
Adriane Boyd
f0fd77648f Change example title to Dr.
Change example title to Dr. so the current model does exclude the title
in the initial example.
2020-06-16 20:36:21 +02:00
Adriane Boyd
a6abdfbc3c Fix numpy.zeros() dtype for Doc.from_array 2020-06-16 20:35:45 +02:00
Adriane Boyd
9aff317ca7 Update POS in tagging example 2020-06-16 20:26:57 +02:00
Adriane Boyd
457babfa0c Update alignment example for new gold.align 2020-06-16 20:22:03 +02:00
Ines Montani
41003a5117 Update Binder version [ci skip] 2020-06-16 17:41:23 +02:00
Ines Montani
fd89f44c0c Update Binder URL [ci skip] 2020-06-16 17:34:26 +02:00
Ines Montani
44af53bdd9 Add pkuseg warnings and auto-format [ci skip] 2020-06-16 17:13:35 +02:00
Ines Montani
a9e5b840ee Fix typos and auto-format [ci skip] 2020-06-16 16:38:45 +02:00
Ines Montani
1d3e8b7578
Merge pull request from explosion/v2.3.x 2020-06-16 07:37:10 -07:00