Commit Graph

11787 Commits

Author SHA1 Message Date
Ines Montani
1d3e8b7578
Merge pull request #5595 from explosion/v2.3.x 2020-06-16 07:37:10 -07:00
Ines Montani
e9d3e177f0 Merge branch 'master' into v2.3.x 2020-06-16 16:31:38 +02:00
Ines Montani
bb54f54369 Fix model accuracy table [ci skip] 2020-06-16 16:10:12 +02:00
Adriane Boyd
d5110ffbf2
Documentation updates for v2.3.0 (#5593)
* Update website models for v2.3.0

* Add docs for Chinese word segmentation

* Tighten up Chinese docs section

* Merge branch 'master' into docs/v2.3.0 [ci skip]

* Merge branch 'master' into docs/v2.3.0 [ci skip]

* Auto-format and update version

* Update matcher.md

* Update languages and sorting

* Typo in landing page

* Infobox about token_match behavior

* Add meta and basic docs for Japanese

* POS -> TAG in models table

* Add info about lookups for normalization

* Updates to API docs for v2.3

* Update adding norm exceptions for adding languages

* Add --omit-extra-lookups to CLI API docs

* Add initial draft of "What's New in v2.3"

* Add new in v2.3 tags to Chinese and Japanese sections

* Add tokenizer to migration section

* Add new in v2.3 flags to init-model

* Typo

* More what's new in v2.3

Co-authored-by: Ines Montani <ines@ines.io>
2020-06-16 15:37:35 +02:00
Matthew Honnibal
7ff447c5a0 Set version to v2.3.0 2020-06-15 18:22:25 +02:00
Adriane Boyd
0d8405aafa Updates to docstrings (#5589) 2020-06-15 14:58:36 +02:00
Adriane Boyd
e867e9fa8f Fix and add warnings related to spacy-lookups-data (#5588)
* Fix warning message for lemmatization tables

* Add a warning when the `lexeme_norm` table is empty. (Given the
relatively lang-specific loading for `Lookups`, it seemed like too much
overhead to dynamically extract the list of languages, so for now it's
hard-coded.)
2020-06-15 14:58:29 +02:00
Arvind Srinivasan
f698007907 Added Tamil Example Sentences (#5583)
* Added Examples for Tamil Sentences

#### Description
This PR add example sentences for the Tamil language which were missing as per issue #1107 

#### Type of Change
This is an enhancement.

* Accepting spaCy Contributor Agreement

* Signed on my behalf as an individual
2020-06-15 14:58:21 +02:00
Adriane Boyd
c94f7d0e75
Updates to docstrings (#5589) 2020-06-15 14:56:51 +02:00
Adriane Boyd
c482f20778
Fix and add warnings related to spacy-lookups-data (#5588)
* Fix warning message for lemmatization tables

* Add a warning when the `lexeme_norm` table is empty. (Given the
relatively lang-specific loading for `Lookups`, it seemed like too much
overhead to dynamically extract the list of languages, so for now it's
hard-coded.)
2020-06-15 14:56:04 +02:00
Arvind Srinivasan
aa5b40fa64
Added Tamil Example Sentences (#5583)
* Added Examples for Tamil Sentences

#### Description
This PR add example sentences for the Tamil language which were missing as per issue #1107 

#### Type of Change
This is an enhancement.

* Accepting spaCy Contributor Agreement

* Signed on my behalf as an individual
2020-06-13 15:56:26 +02:00
theudas
3f5e2f9d99 Added Parameter to NEL to take n sentences into account (#5548)
* added setting for neighbour sentence in NEL

* added spaCy contributor agreement

* added multi sentence also for training

* made the try-except block smaller
2020-06-12 15:15:03 +02:00
adrianeboyd
4724fa4cf4 Expand Japanese requirements warning (#5572)
Include explicit install instructions in Japanese requirements warning.
2020-06-12 15:14:55 +02:00
adrianeboyd
44967a3f9c Update pytest conf for sudachipy with Japanese (#5574) 2020-06-12 15:14:47 +02:00
Matthew Honnibal
a1c5b694be Small fixes to train defaults 2020-06-12 02:22:13 +02:00
theudas
fa46e0bef2
Added Parameter to NEL to take n sentences into account (#5548)
* added setting for neighbour sentence in NEL

* added spaCy contributor agreement

* added multi sentence also for training

* made the try-except block smaller
2020-06-12 02:03:23 +02:00
Sofie Van Landeghem
c0f4a1e43b
train is from-config by default (#5575)
* verbose and tag_map options

* adding init_tok2vec option and only changing the tok2vec that is specified

* adding omit_extra_lookups and verifying textcat config

* wip

* pretrain bugfix

* add replace and resume options

* train_textcat fix

* raw text functionality

* improve UX when KeyError or when input data can't be parsed

* avoid unnecessary access to goldparse in TextCat pipe

* save performance information in nlp.meta

* add noise_level to config

* move nn_parser's defaults to config file

* multitask in config - doesn't work yet

* scorer offering both F and AUC options, need to be specified in config

* add textcat verification code from old train script

* small fixes to config files

* clean up

* set default config for ner/parser to allow create_pipe to work as before

* two more test fixes

* small fixes

* cleanup

* fix NER pickling + additional unit test

* create_pipe as before
2020-06-12 02:02:07 +02:00
Sofie Van Landeghem
18c6dc8093
removing label both on comment and on close 2020-06-11 14:09:40 +02:00
adrianeboyd
556895177e
Expand Japanese requirements warning (#5572)
Include explicit install instructions in Japanese requirements warning.
2020-06-11 13:47:37 +02:00
adrianeboyd
fe167fcf7d
Update pytest conf for sudachipy with Japanese (#5574) 2020-06-11 10:23:50 +02:00
Jones Martins
bab30e4ad2
Add "c'mon" token exception (#5570)
* Add "c'mon" exception

* Fix typo in "C'mon" exception
2020-06-10 21:54:06 +02:00
Jones Martins
28db7dd5d9
Add missing pronoums/determiners (#5569)
* Add missing pronoums/determiners

* Add test for missing pronoums

* Add contributor file
2020-06-10 18:47:04 +02:00
Sofie Van Landeghem
12c1965070
set delay to 7 days 2020-06-10 10:46:12 +02:00
adrianeboyd
0a70bd6281
Bump version to 2.3.0.dev1 (#5567) 2020-06-09 15:47:31 +02:00
adrianeboyd
b7e6e1b9a7
Disable sentence segmentation in ja tokenizer (#5566) 2020-06-09 12:00:59 +02:00
Sofie Van Landeghem
86112d2168
update issue manager's version 2020-06-09 08:57:38 +02:00
adrianeboyd
f162815f45
Handle empty and whitespace-only docs for Japanese (#5564)
Handle empty and whitespace-only docs in the custom alignment method
used by the Japanese tokenizer.
2020-06-08 21:09:23 +02:00
Martino Mensio
de00f967ce
adding spacy-universal-sentence-encoder (#5534)
* adding spacy-universal-sentence-encoder

* update affiliation

* updated code example
2020-06-08 20:26:30 +02:00
Sofie Van Landeghem
d1799da200
bot for answered issues (#5563)
* add tiangolo's issue manager

* fix formatting

* spaces, tabs, who knows

* formatting

* I'll get this right at some point

* maybe one more space ?
2020-06-08 19:47:32 +02:00
adrianeboyd
3bf111585d
Update Japanese tokenizer config and add serialization (#5562)
* Use `config` dict for tokenizer settings
* Add serialization of split mode setting
* Add tests for tokenizer split modes and serialization of split mode
setting

Based on #5561
2020-06-08 16:29:05 +02:00
Hiroshi Matsuda
456bf47f51
fix a bug causing mis-alignments (#5560) 2020-06-08 15:49:34 +02:00
adrianeboyd
009119fa66
Requirements/setup for Japanese (#5553)
* Add sudachipy and sudachidict_core to Makefile

* Switch ja requirements from fugashi to sudachipy
2020-06-06 00:22:18 +02:00
Ines Montani
d93cbeb14f
Add warning for loose version constraints (#5536)
* Add warning for loose version constraints

* Update wording [ci skip]

* Tweak error message

Co-authored-by: Matthew Honnibal <honnibal+gh@gmail.com>
2020-06-05 12:42:15 +02:00
adrianeboyd
1ac43d78f9
Avoid libc.stdint for UINT64_MAX (#5545) 2020-06-04 20:02:05 +02:00
Sofie Van Landeghem
4d1ba6feb4
add tag variant for 2.3 (#5542) 2020-06-04 19:16:33 +02:00
Paul O'Leary McCann
410fb7ee43
Add Japanese Model (#5544)
* Add more rules to deal with Japanese UD mappings

Japanese UD rules sometimes give different UD tags to tokens with the
same underlying POS tag. The UD spec indicates these cases should be
disambiguated using the output of a tool called "comainu", but rules are
enough to get the right result.

These rules are taken from Ginza at time of writing, see #3756.

* Add new tags from GSD

This is a few rare tags that aren't in Unidic but are in the GSD data.

* Add basic Japanese sentencization

This code is taken from Ginza again.

* Add sentenceizer quote handling

Could probably add more paired characters but this will do for now. Also
includes some tests.

* Replace fugashi with SudachiPy

* Modify tag format to match GSD annotations

Some of the tests still need to be updated, but I want to get this up
for testing training.

* Deal with case with closing punct without opening

* refactor resolve_pos()

* change tag field separator from "," to "-"

* add TAG_ORTH_MAP

* add TAG_BIGRAM_MAP

* revise rules for 連体詞

* revise rules for 連体詞

* improve POS about 2%

* add syntax_iterator.py (not mature yet)

* improve syntax_iterators.py

* improve syntax_iterators.py

* add phrases including nouns and drop NPs consist of STOP_WORDS

* First take at noun chunks

This works in many situations but still has issues in others.

If the start of a subtree has no noun, then nested phrases can be
generated.

    また行きたい、そんな気持ちにさせてくれるお店です。
    [そんな気持ち, また行きたい、そんな気持ちにさせてくれるお店]

For some reason て gets included sometimes. Not sure why.

    ゲンに連れ添って円盤生物を調査するパートナーとなる。
    [て円盤生物, ...]

Some phrases that look like they should be split are grouped together;
not entirely sure that's wrong. This whole thing becomes one chunk:

    道の駅遠山郷北側からかぐら大橋南詰現道交点までの1.060kmのみ開通済み

* Use new generic get_words_and_spaces

The new get_words_and_spaces function is simpler than what was used in
Japanese, so it's good to be able to switch to it. However, there was an
issue. The new function works just on text, so POS info could get out of
sync. Fixing this required a small change to the way dtokens (tokens
with POS and lemma info) were generated.

Specifically, multiple extraneous spaces now become a single token, so
when generating dtokens multiple space tokens should be created in a
row.

* Fix noun_chunks, should be working now

* Fix some tests, add naughty strings tests

Some of the existing tests changed because the tokenization mode of
Sudachi changed to the more fine-grained A mode.

Sudachi also has issues with some strings, so this adds a test against
the naughty strings.

* Remove empty Sudachi tokens

Not doing this creates zero-length tokens and causes errors in the
internal spaCy processing.

* Add yield_bunsetu back in as a separate piece of code

Co-authored-by: Hiroshi Matsuda <40782025+hiroshi-matsuda-rit@users.noreply.github.com>
Co-authored-by: hiroshi <hiroshi_matsuda@megagon.ai>
2020-06-04 19:15:43 +02:00
Matthew Honnibal
8411d4f4e6
Merge pull request #5543 from svlandeg/feature/pretrain-config
pretrain from config
2020-06-04 19:07:12 +02:00
svlandeg
3ade455fd3 formatting 2020-06-04 16:09:55 +02:00
svlandeg
776d4f1190 cleanup 2020-06-04 16:07:30 +02:00
svlandeg
6b027d7689 remove duplicate model definition of tok2vec layer 2020-06-04 15:49:23 +02:00
svlandeg
1775f54a26 small little fixes 2020-06-03 22:17:02 +02:00
svlandeg
07886a3de3 rename init_tok2vec to resume 2020-06-03 22:00:25 +02:00
svlandeg
4ed6278663 small fixes to pretrain config, init_tok2vec TODO 2020-06-03 19:32:40 +02:00
Ines Montani
d79964bcb1
Merge pull request #5535 from adrianeboyd/feature/model-spacy-version-check 2020-06-03 15:35:20 +02:00
Ines Montani
56a9d1b78c
Merge pull request #5479 from explosion/master-tmp 2020-06-03 15:31:27 +02:00
svlandeg
ddf8244df9 add normalize option to distance metric 2020-06-03 14:52:54 +02:00
svlandeg
ffe0451d09 pretrain from config 2020-06-03 14:45:00 +02:00
Ines Montani
a8875d4a4b Fix typo 2020-06-03 14:42:39 +02:00
Ines Montani
4e0610d0d4 Update warning codes 2020-06-03 14:37:09 +02:00
Ines Montani
810fce3bb1 Merge branch 'develop' into master-tmp 2020-06-03 14:36:59 +02:00