Commit Graph

6921 Commits

Author SHA1 Message Date
Paul O'Leary McCann
410fb7ee43
Add Japanese Model (#5544)
* Add more rules to deal with Japanese UD mappings

Japanese UD rules sometimes give different UD tags to tokens with the
same underlying POS tag. The UD spec indicates these cases should be
disambiguated using the output of a tool called "comainu", but rules are
enough to get the right result.

These rules are taken from Ginza at time of writing, see #3756.

* Add new tags from GSD

This is a few rare tags that aren't in Unidic but are in the GSD data.

* Add basic Japanese sentencization

This code is taken from Ginza again.

* Add sentenceizer quote handling

Could probably add more paired characters but this will do for now. Also
includes some tests.

* Replace fugashi with SudachiPy

* Modify tag format to match GSD annotations

Some of the tests still need to be updated, but I want to get this up
for testing training.

* Deal with case with closing punct without opening

* refactor resolve_pos()

* change tag field separator from "," to "-"

* add TAG_ORTH_MAP

* add TAG_BIGRAM_MAP

* revise rules for 連体詞

* revise rules for 連体詞

* improve POS about 2%

* add syntax_iterator.py (not mature yet)

* improve syntax_iterators.py

* improve syntax_iterators.py

* add phrases including nouns and drop NPs consist of STOP_WORDS

* First take at noun chunks

This works in many situations but still has issues in others.

If the start of a subtree has no noun, then nested phrases can be
generated.

    また行きたい、そんな気持ちにさせてくれるお店です。
    [そんな気持ち, また行きたい、そんな気持ちにさせてくれるお店]

For some reason て gets included sometimes. Not sure why.

    ゲンに連れ添って円盤生物を調査するパートナーとなる。
    [て円盤生物, ...]

Some phrases that look like they should be split are grouped together;
not entirely sure that's wrong. This whole thing becomes one chunk:

    道の駅遠山郷北側からかぐら大橋南詰現道交点までの1.060kmのみ開通済み

* Use new generic get_words_and_spaces

The new get_words_and_spaces function is simpler than what was used in
Japanese, so it's good to be able to switch to it. However, there was an
issue. The new function works just on text, so POS info could get out of
sync. Fixing this required a small change to the way dtokens (tokens
with POS and lemma info) were generated.

Specifically, multiple extraneous spaces now become a single token, so
when generating dtokens multiple space tokens should be created in a
row.

* Fix noun_chunks, should be working now

* Fix some tests, add naughty strings tests

Some of the existing tests changed because the tokenization mode of
Sudachi changed to the more fine-grained A mode.

Sudachi also has issues with some strings, so this adds a test against
the naughty strings.

* Remove empty Sudachi tokens

Not doing this creates zero-length tokens and causes errors in the
internal spaCy processing.

* Add yield_bunsetu back in as a separate piece of code

Co-authored-by: Hiroshi Matsuda <40782025+hiroshi-matsuda-rit@users.noreply.github.com>
Co-authored-by: hiroshi <hiroshi_matsuda@megagon.ai>
2020-06-04 19:15:43 +02:00
Adriane Boyd
8c758ed1eb Fix meta path 2020-06-03 12:11:57 +02:00
Adriane Boyd
a57bdeecac Test util.get_model_meta instead of util.load_model 2020-06-03 12:10:12 +02:00
Adriane Boyd
75f08ad62d Remove unnecessary check 2020-06-02 17:41:25 +02:00
Adriane Boyd
bbc1836581 Add rudimentary version checks on model load 2020-06-02 17:33:48 +02:00
Leo
925e938570
Spanish tokenizer exception and examples improvement (#5531)
* Spanish tokenizer exception additions. Added Spanish question examples

* erased slang tokenization examples
2020-06-01 18:18:34 +02:00
Matthew Honnibal
67af3a32b0
Merge pull request #5527 from adrianeboyd/bugfix/tagger-sp-tag-map
Preserve _SP when filtering tag map in Tagger
2020-06-01 12:00:21 +02:00
Leo
c21c308ecb
corrected issue #5524 changed <U+009C> 'STRING TERMINATOR' for <U+0153> LATIN SMALL LIGATURE OE' (#5526) 2020-05-31 22:08:12 +02:00
Adriane Boyd
a005ccd6d7 Preserve _SP when filtering tag map in Tagger
To allow "SP" as a tag (for Chinese OntoNotes), preserve "_SP" if
present as the reference `SPACE` POS in the tag map in
`Tagger.begin_training()`.
2020-05-31 19:57:54 +02:00
svlandeg
15134ef611 fix deserialization order 2020-05-30 12:53:32 +02:00
Matthew Honnibal
64adda3202 Revert "Remove peeking from Parser.begin_training (#5456)"
This reverts commit 9393253b66.

The model shouldn't need to see all examples, and actually in v3 there's
no equivalent step. All examples are provided to the component, for the
component to do stuff like figuring out the labels. The model just needs
to do stuff like shape inference.
2020-05-29 23:21:55 +02:00
Matthew Honnibal
85f1acfaa0
Merge pull request #5517 from adrianeboyd/bugfix/morph-repr
Remove MorphAnalysis __str__ and __repr__
2020-05-29 19:20:56 +02:00
svlandeg
291483157d prevent loading a pretrained Tok2Vec layer AND pretrained components 2020-05-29 17:38:33 +02:00
Adriane Boyd
e1b7cbd197 Remove MorphAnalysis __str__ and __repr__ 2020-05-29 14:33:47 +02:00
Matthew Honnibal
aecd1437cc
Merge pull request #5508 from adrianeboyd/bugfix/tag-map-sp-tag
Prefer _SP over SP for default tag map space attrs
2020-05-27 20:39:40 +02:00
Adriane Boyd
25de2a2191 Improve vector name loading from model meta 2020-05-27 14:48:54 +02:00
adrianeboyd
aad0610a85
Map NR to PROPN (#5512) 2020-05-26 22:30:53 +02:00
Adriane Boyd
b6b5908f5e Prefer _SP over SP for default tag map space attrs
If `_SP` is already in the tag map, use the mapping from `_SP` instead
of `SP` so that `SP` can be a valid non-space tag. (Chinese has a
non-space tag `SP` which was overriding the mapping of `_SP` to
`SPACE`.)
2020-05-26 14:57:13 +02:00
Adriane Boyd
1eed101be9 Fix Polish lemmatizer for deserialized models
Restructure Polish lemmatizer not to depend on lookups data in
`__init__` since the lemmatizer is initialized before the lookups data
is loaded from a saved model. The lookups tables are accessed first in
`__call__` instead once the data is available.
2020-05-26 09:56:12 +02:00
Ines Montani
24ef6680fa
Merge pull request #5499 from adrianeboyd/chore/bump-version-deps-v2.3.0 2020-05-25 13:25:45 +02:00
Adriane Boyd
3f727bc539 Switch to v2.3.0.dev0 2020-05-25 12:57:20 +02:00
Adriane Boyd
736f3cb5af Bump version and deps for v2.3.0
* spacy to v2.3.0
* thinc to v7.4.1
* spacy-lookups-data to v0.3.2
2020-05-25 12:03:49 +02:00
Adriane Boyd
e06ca7ea24 Switch to new add API in PhraseMatcher unpickle 2020-05-25 11:22:47 +02:00
Ines Montani
6728747f71
Merge pull request #5486 from explosion/fix/compat-py2 2020-05-22 15:47:21 +02:00
Matthew Honnibal
f6078d866a
Merge pull request #5121 from adrianeboyd/bugfix/revert-token-match
Revert token_match priority changes from #4374 and extend token match options
2020-05-22 14:42:51 +02:00
Ines Montani
c685ee734a Fix compat for v2.x branch 2020-05-22 14:22:36 +02:00
Adriane Boyd
e4a1b5dab1 Rename to url_match
Rename to `url_match` and update docs.
2020-05-22 12:41:03 +02:00
Adriane Boyd
730fa493a4 Merge remote-tracking branch 'upstream/master' into bugfix/revert-token-match 2020-05-22 12:18:00 +02:00
Adriane Boyd
71fe61fdcd Disallow merging 0-length spans 2020-05-22 10:14:34 +02:00
Matthew Honnibal
93c4d13588
Merge pull request #5264 from lfiedler/issue-5230
Fix ResourceWarnings during unittest
2020-05-22 00:31:07 +02:00
Matthew Honnibal
e1cb7e838b
Merge pull request #5481 from explosion/feature/blank-shortcut-v2
Add blank:{lang} shortcut support to util.load_model
2020-05-22 00:08:23 +02:00
Ines Montani
2250380816
Merge pull request #5482 from explosion/fix/backwards-compat-super 2020-05-21 21:51:46 +02:00
Ines Montani
891fa59009 Use backwards-compatible super() 2020-05-21 20:52:48 +02:00
Matthew Honnibal
5ce02c1b17
Merge pull request #5470 from svlandeg/bugfix/noun-chunks
Bugfix in noun chunks
2020-05-21 20:51:31 +02:00
Ines Montani
cb02bff0eb Add blank:{lang} shortcut to util.load_mode 2020-05-21 20:24:07 +02:00
Ines Montani
0f1beb5ff2 Tidy up and avoid absolute spacy imports in core 2020-05-21 20:05:03 +02:00
svlandeg
51715b9f72 span / noun chunk has +1 because end is exclusive 2020-05-21 19:56:56 +02:00
svlandeg
84d5b7ad0a Merge remote-tracking branch 'upstream/master' into bugfix/noun-chunks
# Conflicts:
#	spacy/lang/el/syntax_iterators.py
#	spacy/lang/en/syntax_iterators.py
#	spacy/lang/fa/syntax_iterators.py
#	spacy/lang/fr/syntax_iterators.py
#	spacy/lang/id/syntax_iterators.py
#	spacy/lang/nb/syntax_iterators.py
#	spacy/lang/sv/syntax_iterators.py
2020-05-21 19:19:50 +02:00
svlandeg
f7d10da555 avoid unnecessary loop to check overlapping noun chunks 2020-05-21 19:15:57 +02:00
Ines Montani
c6ec19c844 Add missing declaration 2020-05-21 17:30:05 +02:00
Matthew Honnibal
884d9b060d
Merge pull request #5466 from adrianeboyd/feature/omit-extra-lexeme-info
Add option to omit extra lexeme tables in CLI
2020-05-21 16:40:02 +02:00
Matthew Honnibal
26cd6a0229
Merge pull request #5462 from adrianeboyd/feature/lemmatizer-all-upos
Extend lemmatizer rules for all UPOS tags
2020-05-21 16:05:31 +02:00
Matthew Honnibal
cad9b290a2
Merge branch 'master' into feature/omit-extra-lexeme-info 2020-05-21 16:04:24 +02:00
Matthew Honnibal
1f572ce89b
Merge pull request #5473 from explosion/fix/travis-tests
Fix Python 2.7 compat
2020-05-21 15:56:16 +02:00
Ines Montani
a9cb2882cb
Rename argument: doc_or_span/obj -> doclike (#5463)
* doc_or_span -> obj

* Revert "doc_or_span -> obj"

This reverts commit 78bb9ff5e0.

* obj -> doclike

* Refer to correct object
2020-05-21 15:17:39 +02:00
Ines Montani
bea863acd2 Fix naming conflict and formatting 2020-05-21 14:24:38 +02:00
Ines Montani
bd6353715a Merge branch 'master' into fix/travis-tests 2020-05-21 14:23:04 +02:00
Ines Montani
d8f3190c0a Tidy up and auto-format 2020-05-21 14:14:01 +02:00
Ines Montani
56de520afd Try to fix tests on Travis (2.7) 2020-05-21 14:04:57 +02:00
adrianeboyd
d45602bc11
Merge branch 'master' into feature/omit-extra-lexeme-info 2020-05-21 10:26:01 +02:00
svlandeg
b221bcf1ba fixing all languages 2020-05-21 00:17:28 +02:00
svlandeg
b509a3e7fc fix: use actual range in 'seen' instead of subtree 2020-05-20 23:06:39 +02:00
svlandeg
36a94c409a failing test to reproduce overlapping spans problem 2020-05-20 23:06:03 +02:00
adrianeboyd
49ef06d793
Add option for base model in init-model CLI (#5467)
Intended for languages like Chinese with a custom tokenizer.
2020-05-20 18:49:11 +02:00
Adriane Boyd
daaa7bf451 Add option to omit extra lexeme tables in CLI 2020-05-20 15:51:44 +02:00
Adriane Boyd
8cba0e41d8 Return lowercase form as default except for PROPN 2020-05-20 15:35:08 +02:00
adrianeboyd
9393253b66
Remove peeking from Parser.begin_training (#5456)
Inspect all instances in `Parser.begin_training` rather than only the
first 1000.
2020-05-20 15:18:06 +02:00
Adriane Boyd
4fa9670537 Extend lemmatizer rules for all UPOS tags 2020-05-20 10:15:43 +02:00
adrianeboyd
40e65d6f63
Fix most_similar for vectors with unused rows (#5348)
* Fix most_similar for vectors with unused rows

Address issues related to the unused rows in the vector table and
`most_similar`:

* Update `most_similar()` to search only through rows that are in use
according to `key2row`.

* Raise an error when `most_similar(n=n)` is larger than the number of
vectors in the table.

* Set and restore `_unset` correctly when vectors are added or
deserialized so that new vectors are added in the correct row.

* Set data and keys to the same length in `Vocab.prune_vectors()` to
avoid spurious entries in `key2row`.

* Fix regression test using `most_similar`

Co-authored-by: Matthew Honnibal <honnibal+gh@gmail.com>
2020-05-19 16:41:26 +02:00
adrianeboyd
70da1fd2d6
Add warning for misaligned character offset spans (#5007)
* Add warning for misaligned character offset spans

* Resolve conflict

* Filter warnings in example scripts

Filter warnings in example scripts to show warnings once, in particular
warnings about misaligned entities.

Co-authored-by: Ines Montani <ines@ines.io>
2020-05-19 16:01:18 +02:00
adrianeboyd
0061992d95
Update Polish tokenizer for UD_Polish-PDB (#5432)
Update Polish tokenizer for UD_Polish-PDB, which is a relatively major
change from the existing tokenizer. Unused exceptions files and
conflicting test cases removed.

Co-authored-by: Matthew Honnibal <honnibal+gh@gmail.com>
2020-05-19 15:59:55 +02:00
adrianeboyd
a5cd203284
Reduce stored lexemes data, move feats to lookups (#5238)
* Reduce stored lexemes data, move feats to lookups

* Move non-derivable lexemes features (`norm / cluster / prob`) to
`spacy-lookups-data` as lookups
  * Get/set `norm` in both lookups and `LexemeC`, serialize in lookups
  * Remove `cluster` and `prob` from `LexemesC`, get/set/serialize in
    lookups only
* Remove serialization of lexemes data as `vocab/lexemes.bin`
  * Remove `SerializedLexemeC`
  * Remove `Lexeme.to_bytes/from_bytes`
* Modify normalization exception loading:
  * Always create `Vocab.lookups` table `lexeme_norm` for
    normalization exceptions
  * Load base exceptions from `lang.norm_exceptions`, but load
    language-specific exceptions from lookups
  * Set `lex_attr_getter[NORM]` including new lookups table in
    `BaseDefaults.create_vocab()` and when deserializing `Vocab`
* Remove all cached lexemes when deserializing vocab to override
  existing normalizations with the new normalizations (as a replacement
  for the previous step that replaced all lexemes data with the
  deserialized data)

* Skip English normalization test

Skip English normalization test because the data is now in
`spacy-lookups-data`.

* Remove norm exceptions

Moved to spacy-lookups-data.

* Move norm exceptions test to spacy-lookups-data

* Load extra lookups from spacy-lookups-data lazily

Load extra lookups (currently for cluster and prob) lazily from the
entry point `lg_extra` as `Vocab.lookups_extra`.

* Skip creating lexeme cache on load

To improve model loading times, do not create the full lexeme cache when
loading. The lexemes will be created on demand when processing.

* Identify numeric values in Lexeme.set_attrs()

With the removal of a special case for `PROB`, also identify `float` to
avoid trying to convert it with the `StringStore`.

* Skip lexeme cache init in from_bytes

* Unskip and update lookups tests for python3.6+

* Update vocab pickle to include lookups_extra

* Update vocab serialization tests

Check strings rather than lexemes since lexemes aren't initialized
automatically, account for addition of "_SP".

* Re-skip lookups test because of python3.5

* Skip PROB/float values in Lexeme.set_attrs

* Convert is_oov from lexeme flag to lex in vectors

Instead of storing `is_oov` as a lexeme flag, `is_oov` reports whether
the lexeme has a vector.

Co-authored-by: Matthew Honnibal <honnibal+gh@gmail.com>
2020-05-19 15:59:14 +02:00
Ines Montani
a41e28ceba
Merge pull request #5436 from ilivans/fix_errors_with_codes 2020-05-18 10:45:56 +02:00
Ilkyu Ju
72a25c9cef
Very minor issues in Korean example sentences (#5446)
* Add contributor agreement

* Improve ko translation of example sentences

I fixed unnatural translations and word spacing errors.

* Update osori.md
2020-05-17 13:43:34 +02:00
adrianeboyd
908dea3939
Skip duplicate lexeme rank setting (#5401)
Skip duplicate lexeme rank setting within
`_fix_pretrained_vectors_name()`.
2020-05-14 18:26:12 +02:00
adrianeboyd
f49e2810e6
Add Polish lemmatizer (#5413)
* Add Polish lemmatizer

Contributed by @ryszardtuora

* Add missing import
2020-05-14 18:23:19 +02:00
adrianeboyd
e63880e081
Use Token.sent_start for Span.sent (#5439)
Use `Token.sent_start` for sentence boundaries in `Span.sent` so that
`Doc.sents` and `Span.sent` return the same sentence boundaries.
2020-05-14 18:22:51 +02:00
adrianeboyd
780b869345
Fix syntax iterators for Persian (#5437) 2020-05-14 16:51:03 +02:00
Ilia Ivanov
712d9d4820 fixup! Fix ErrorsWithCodes().__class__ return value 2020-05-14 15:45:58 +02:00
Ilia Ivanov
a987e9e45d Fix ErrorsWithCodes().__class__ return value 2020-05-14 14:14:15 +02:00
Vishnu Priya VR
9ce059dd06
Limiting noun_chunks for specific languages (#5396)
* Limiting noun_chunks for specific langauges

* Limiting noun_chunks for specific languages

Contributor Agreement

* Addressing review comments

* Removed unused fixtures and imports

* Add fa_tokenizer in test suite

* Use fa_tokenizer in test

* Undo extraneous reformatting

Co-authored-by: adrianeboyd <adrianeboyd@gmail.com>
2020-05-14 12:58:06 +02:00
Sofie Van Landeghem
b04738903e
prevent None in gold fields (#5425)
* set gold fields to empty list instead of keeping them as None

* add unit test
2020-05-13 22:08:50 +02:00
adrianeboyd
113e7981d0
Check that row is within bounds when adding vector (#5430)
Check that row is within bounds for the vector data array when adding a
vector.

Don't add vectors with rank OOV_RANK in `init-model` (change is due to
shift from OOV as 0 to OOV as OOV_RANK).
2020-05-13 22:08:28 +02:00
adrianeboyd
07639dd6ac
Remove TAG from da/sv tokenizer exceptions (#5428)
Remove `TAG` value from Danish and Swedish tokenizer exceptions because
it may not be included in a tag map (and these settings are problematic
as tokenizer exceptions anyway).
2020-05-13 10:25:54 +02:00
adrianeboyd
24e7108f80
Modify array type to accommodate OOV_RANK (#5429)
Modify indices array type in `Vocab.prune_vectors` to accommodate
OOV_RANK index as max(uint64).
2020-05-13 10:25:05 +02:00
adrianeboyd
440b81bddc
Improve exceptions for 'd (would/had) in English (#5379)
Instead of treating `'d` in contractions like `I'd` as `would` in all
cases in the tokenizer exceptions, leave the tagging and lemmatization
up to later components.
2020-05-08 15:10:57 +02:00
adrianeboyd
c963e269ba
Add method to update / reset pkuseg user dict (#5404) 2020-05-08 11:21:46 +02:00
Samuel Rodríguez Medina
5e55bfa821
Fixed tests for Swedish that were written in Danish. (#5395) 2020-05-05 14:06:27 +02:00
Adriane Boyd
565e0eef73 Add tokenizer option for token match with affixes
To fix the slow tokenizer URL (#4374) and allow `token_match` to take
priority over prefixes and suffixes by default, introduce a new
tokenizer option for a token match pattern that's applied after prefixes
and suffixes but before infixes.
2020-05-05 10:35:33 +02:00
Adriane Boyd
792c8af8cf Merge remote-tracking branch 'upstream/master' into bugfix/revert-token-match 2020-05-05 09:25:57 +02:00
adrianeboyd
c045a9c7f6
Fix logic in train CLI timing eval on GPU (#5387)
Run CPU timing in first iteration only
2020-05-01 12:05:33 +02:00
Samuel Rodríguez Medina
148b036e0c
Spanish like num improvement (#5381)
* Add tests for Spanish like_num.

* Add missing numbers in Spanish lexical attributes for like_num.

* Modify Spanish test function name.

* Add contributor agreement.
2020-04-30 11:13:23 +02:00
Samuel Rodríguez Medina
8602daba85
Swedish like_num (#5371)
* Sign contributor agreement.

* Add like_num functionality to Swedish.

* Update spacy/tests/lang/sv/test_lex_attrs.py

Co-Authored-By: Sofie Van Landeghem <svlandeg@users.noreply.github.com>

* Update contributor agreement

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
2020-04-29 21:25:22 +02:00
adrianeboyd
74da669326
Fix problems with lower and whitespace in variants (#5361)
* Initialize lower flag explicitly

* Handle whitespace words from GoldParse correctly when creating raw
text with orth variants

* Return the text with original casing if anything goes wrong
2020-04-29 13:01:25 +02:00
adrianeboyd
3f43c73d37
Normalize TokenC.sent_start values for Matcher (#5346)
Normalize TokenC.sent_start values to booleans for the `Matcher`.
2020-04-29 12:57:30 +02:00
adrianeboyd
bdff76dede
Various updates/additions to CLI scripts (#5362)
* `debug-data`: determine coverage of provided vectors

* `evaluate`: support `blank:lg` model to make it possible to just evaluate
tokenization

* `init-model`: add option to truncate vectors to N most frequent vectors
from word2vec file

* `train`:

  * if training on GPU, only run evaluation/timing on CPU in the first
    iteration

  * if training is aborted, exit with a non-0 exit status
2020-04-29 12:56:46 +02:00
Sofie Van Landeghem
cfdaf99b80
Fix passing of component configuration (#5374)
* add kwargs to to_disk methods in docs - otherwise crashes on 'exclude' argument

* add fix and test for Issue 5137
2020-04-29 12:56:17 +02:00
Ines Montani
efec28ce70
Merge pull request #5367 from adrianeboyd/feature/simplify-warnings-v2 2020-04-29 12:55:37 +02:00
Sofie Van Landeghem
f67343295d
Update NEL examples and documentation (#5370)
* simplify creation of KB by skipping dim reduction

* small fixes to train EL example script

* add KB creation and NEL training example scripts to example section

* update descriptions of example scripts in the documentation

* moving wiki_entity_linking folder from bin to projects

* remove test for wiki NEL functionality that is being moved
2020-04-29 12:53:53 +02:00
adrianeboyd
a6e521cd79
Add is_sent_end token property (#5375)
Reconstruction of the original PR #4697 by @MiniLau.

Removes unused `SENT_END` symbol and `IS_SENT_END` from `Matcher` schema
because the Matcher is only going to be able to support `IS_SENT_START`.
2020-04-29 12:53:16 +02:00
Ines Montani
eac47971f1
Merge pull request #5258 from mirfan899/master 2020-04-29 12:51:55 +02:00
adrianeboyd
d5f18f8307
Add missing import 2020-04-28 14:01:29 +02:00
adrianeboyd
ac40a8f7a5
Add missing import 2020-04-28 14:00:11 +02:00
Adriane Boyd
3a045572ed Add missing import 2020-04-28 13:48:37 +02:00
Adriane Boyd
bc39f97e11 Simplify warnings 2020-04-28 13:37:37 +02:00
adrianeboyd
f8ac5b9f56
bugfix in span similarity (#5155) (#5358)
* bugfix in span similarity

* also rewrite doc.pyx for clarity

* formatting

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
2020-04-27 16:51:27 +02:00
Sofie Van Landeghem
9203d821ae
Add 2 ini files in tests/lang (#5359) 2020-04-27 13:01:54 +02:00
Punitvara
b2b7e1f37a
This PR adds Gujarati Language class along with (#5355)
* This PR adds Gujarati Language class along with
- stop words

* Add test for gu tokenizer
2020-04-27 11:07:37 +02:00
sabiqueqb
fc91660aa2
Gh 5339 language class for malayalam (#5342)
* Initialize Malayalam Language class

* Add lex_attrs and examples for Malayalam

* Add spaCy Contributor Agreement

* Add test for ml tokenizer
2020-04-27 09:45:08 +02:00
adrianeboyd
84e06f9fb7
Improve GoldParse NER alignment (#5335)
Improve GoldParse NER alignment by including all cases where the start
and end of the NER span can be aligned, regardless of internal
tokenization differences.

To do this, convert BILUO tags to character offsets, check start/end
alignment with `doc.char_span()`, and assign the BILUO tags for the
aligned spans. Alignment for `O/-` tags is handled through the
one-to-one and multi alignments.
2020-04-23 16:58:23 +02:00
adrianeboyd
521f361052
Switch to new gold.align method (#5334)
* Switch from original `_align` to new simpler alignment algorithm from
  #4526

* Remove alignment normalizations beyond whitespace and lowercasing
2020-04-21 19:31:03 +02:00
adrianeboyd
bf5c13d170
Modify jieba install message (#5328)
Modify jieba install message to instruct the user to use
`ChineseDefaults.use_jieba = False` so that it's possible to load
pkuseg-only models without jieba installed.
2020-04-20 22:06:53 +02:00
adrianeboyd
f7471abd82
Add pkuseg and serialization support for Chinese (#5308)
* Add pkuseg and serialization support for Chinese

Add support for pkuseg alongside jieba

* Specify model through `Language` meta:

  * split on characters (if no word segmentation packages are installed)

```
Chinese(meta={"tokenizer": {"config": {"use_jieba": False, "use_pkuseg": False}}})
```

  * jieba (remains the default tokenizer if installed)

```
Chinese()
Chinese(meta={"tokenizer": {"config": {"use_jieba": True}}}) # explicit
```

  * pkuseg

```
Chinese(meta={"tokenizer": {"config": {"pkuseg_model": "default", "use_jieba": False, "use_pkuseg": True}}})
```

* The new tokenizer setting `require_pkuseg` is used to override
`use_jieba` default, which is intended for models that provide a pkuseg
model:

```
nlp_pkuseg = Chinese(meta={"tokenizer": {"config": {"pkuseg_model": "default", "require_pkuseg": True}}})
nlp = Chinese() # has `use_jieba` as `True` by default
nlp.from_bytes(nlp_pkuseg.to_bytes()) # `require_pkuseg` overrides `use_jieba` when calling the tokenizer
```

Add support for serialization of tokenizer settings and pkuseg model, if
loaded

* Add sorting for `Language.to_bytes()` serialization of `Language.meta`
so that the (emptied, but still present) tokenizer metadata is in a
consistent position in the serialized data

Extend tests to cover all three tokenizer configurations and
serialization

* Fix from_disk and tests without jieba or pkuseg

* Load cfg first and only show error if `use_pkuseg`
* Fix blank/default initialization in serialization tests

* Explicitly initialize jieba's cache on init

* Add serialization for pkuseg pre/postprocessors

* Reformat pkuseg install message
2020-04-18 17:01:53 +02:00
Jakob Jul Elben
663333c3b2
Fixes #5413 (#5315)
* Fix 5314

* Add contributor

* Resolve requested changes

Co-authored-by: Jakob Jul Elben <jakob@datamaga.com>
2020-04-16 13:29:02 +02:00
Leander Fiedler
a3401b1194 issue5230 changed reference to function to anonymous function 2020-04-15 21:52:52 +02:00
Leander Fiedler
cef0c909b9 issue5230 changed reference to function to anonymous function 2020-04-15 19:28:33 +02:00
Paolo Arduin
1ca32d8f9c
Matcher support for Span as well as Doc (#5113)
* Matcher support for Span, as well as Doc #5056

* Removes an import unused

* Signed contributors agreement

* Code optimization and better test

* Add error message for bad Matcher call argument

* Fix merging
2020-04-15 13:51:33 +02:00
adrianeboyd
98c59027ed
Use max(uint64) for OOV lexeme rank (#5303)
* Use max(uint64) for OOV lexeme rank

* Add test for default OOV rank

* Revert back to thinc==7.4.0

Requiring the updated version of thinc was unnecessary.

* Define OOV_RANK in one place

Define OOV_RANK in one place in `util`.

* Fix formatting [ci skip]

* Switch to external definitions of max(uint64)

Switch to external defintions of max(uint64) and confirm that they are
equal.
2020-04-15 13:49:47 +02:00
adrianeboyd
3d2c308906
Add Doc init from list of words and text (#5251)
* Add Doc init from list of words and text

Add an option to initialize a `Doc` from a text and list of words where
the words may or may not include all whitespace tokens. If the text and
words are mismatched, raise an error.

* Fix error code

* Remove all whitespace before aligning words/text

* Move words/text init to util function

* Update error message

* Rename to get_words_and_spaces

* Fix formatting
2020-04-14 19:15:52 +02:00
Paolo Arduin
8ce408d2e1
Comparison predicate handling for != (#5282)
* Fix #5281

* Optim test
2020-04-14 19:14:15 +02:00
Leander Fiedler
6700006830 issue5230 attempted fix of pytest segfault for python3.5 2020-04-12 09:34:54 +02:00
Leander Fiedler
d60e2d3ebf issue5230 added unit test for dumping and loading knowledgebase 2020-04-12 09:08:41 +02:00
Leander Fiedler
d2bb649227 issue5230 filter warnings in addition to filterwarnings to prevent deprecation warnings in python35(win) setup to pop up 2020-04-10 23:21:13 +02:00
Leander Fiedler
ca2a7a44db issue5230 store string values of warnings to remotely debug failing python35(win) setup 2020-04-10 22:26:55 +02:00
Leander Fiedler
88ca40a15d issue5230 raise warnings as errors to remotely debug failing python35(win) setup 2020-04-10 21:45:53 +02:00
Leander Fiedler
a7bdfe42e1 issue5230 added print statement to warnings filter to remotely debug failing python35(win) setup 2020-04-10 21:14:33 +02:00
Leander Fiedler
8c1d0d628f issue5230 writer now checks instance of loc parameter before trying to operate on it 2020-04-10 20:35:52 +02:00
Umar Butler
8952effcc4
Fixed Typo in Warning (#5284)
* Fixed typo in cli warning

Fixed a typo in the warning for the provision of exactly two labels, which have not been designated as binary, to textcat.

* Create and signed contributor form
2020-04-09 15:46:15 +02:00
adrianeboyd
cf579a398d
Add __init__.py to eu and hy tests (#5278) 2020-04-08 20:03:06 +02:00
adrianeboyd
ae4af52ce7
Add ideographic stops to sentencizer (#5263)
Add ideographic half- and fullwidth full stops to default sentencizer
punctuation.
2020-04-08 12:58:39 +02:00
adrianeboyd
fa760010a5
Set rank for new vector in Vocab.set_vector (#5266)
Set `Lexeme.rank` for vectors added with `Vocab.set_vector` so that the
lexeme `ID` accessed by a model points the right row for the new vector.
2020-04-07 12:04:51 +02:00
lfiedler
e1e25c7e30 issue5230: added unittest test case for completion 2020-04-06 21:36:02 +02:00
Leander Fiedler
cde96f6c64 issue5230: optimized unit test a bit 2020-04-06 20:51:12 +02:00
Leander Fiedler
71cc903d65 issue5230: replaced open statements on path objects so that serialization still works an files are closed 2020-04-06 20:30:41 +02:00
Leander Fiedler
273ed452bb issue5230: added unicode declaration at top of the file 2020-04-06 19:22:32 +02:00
Leander Fiedler
1cd975d4a5 issue5230: fixed resource warnings in language 2020-04-06 18:54:32 +02:00
Leander Fiedler
493c77462a issue5230: test cases
covering known sources of resource warnings
2020-04-06 18:46:51 +02:00
adrianeboyd
c981aa6684
Use inline flags in token_match patterns (#5257)
* Use inline flags in token_match patterns

Use inline flags in `token_match` patterns so that serializing does not
lose the flag information.

* Modify inline flag

* Modify inline flag
2020-04-06 13:19:04 +02:00
adrianeboyd
e8be15e9b7
Improve tokenization for UD Spanish AnCora (#5253) 2020-04-06 13:18:23 +02:00
adrianeboyd
f4ef64a526
Improve tokenization for UD Dutch corpora (#5259)
* Improve tokenization for UD Dutch corpora

Improve tokenization for UD Dutch Alpino and LassySmall.

* Format Dutch tokenizer exceptions
2020-04-06 13:18:07 +02:00
Muhammad Irfan
406d5748b3 add missing Urdu tags 2020-04-05 20:55:38 +05:00
YohannesDatasci
beef184e53
Armenian language support (#5246)
* add Armenian language and test cases

* agreement submission
2020-04-03 13:02:18 +02:00
Michael Leichtfried
2b14997b68
Remove duplicated branch in if/else-if statement (#5234)
* Remove duplicated branch in if-elif-statement

* Add contributor agreement for leicmi
2020-04-02 14:47:42 +02:00
adrianeboyd
d107afcffb
Raise error for inplace resize with new vector dim (#5228)
Raise an error if there is an attempt to resize the vectors in place with
a different vector dimension.
2020-04-02 10:43:13 +02:00
Jacob Lauritzen
0b76212831
Extend and fix Danish examples (#5227)
* Extend and fix Danish examples

This PR fixes two examples, adds additional examples translated from the english version, and adds punctuation.

The two changed examples are:
* "fortov" changed to "fortovet", which is more [used](https://www.google.com/search?client=firefox-b-d&sxsrf=ALeKk0143gEuPe4IbIUpzBBt-oU10OMVqA%3A1585549036477&ei=7I6BXuvJHMGOrwSqi46oCQ&q=l%C3%B8behjul+p%C3%A5+fortov&oq=l%C3%B8behjul+p%C3%A5+fortov&gs_lcp=CgZwc3ktYWIQAzIECAAQRzIECAAQRzIECAAQRzIECAAQRzIECAAQRzIECAAQRzIECAAQRzIECAAQR1DT8xZY0_MWYK_0FmgAcAZ4AIABAIgBAJIBAJgBAKABAaoBB2d3cy13aXo&sclient=psy-ab&ved=0ahUKEwjr7964xsHoAhVBx4sKHaqFA5UQ4dUDCAo&uact=5) and more natural. The Swedish and Norwegian examples also use this version of the word.
* "stor by" changed to "storby". In Danish we have a specific noun to describe a large, metropolitan city which is different from just describing a city as "large". In this sentence it would be much more natural to describe London as a "storby". Google even correct as search for "London stor by" to "London storby".

* Sign contrib agreement
2020-04-02 10:42:35 +02:00
Nikhil Saldanha
4f27a24f5b
Add kannada examples (#5162)
* Add example sentences for Kannada

* sign contributor agreement
2020-03-29 13:54:42 +02:00
adrianeboyd
d47b810ba4
Fix exclusive_classes in textcat ensemble (#5166)
Pass the exclusive_classes setting to the bow model within the ensemble
textcat model.
2020-03-29 13:52:34 +02:00
adrianeboyd
963bd890c1
Modify Vector.resize to work with cupy and improve resizing (#5216)
* Modify Vector.resize to work with cupy

Modify `Vectors.resize` to work with cupy. Modify behavior when resizing
to a different vector dimension so that individual vectors are truncated
or extended with zeros instead of having the original values filled into
the new shape without regard for the original axes.

* Update spacy/tests/vocab_vectors/test_vectors.py

Co-Authored-By: Matthew Honnibal <honnibal+gh@gmail.com>

Co-authored-by: Matthew Honnibal <honnibal+gh@gmail.com>
2020-03-29 13:51:20 +02:00
adrianeboyd
8d3563f1c4
Minor bugfixes for train CLI (#5186)
* Omit per_type scores from model-best calculations

The addition of per_type scores to the included metrics (#4911) causes
errors when they're compared while determining the best model, so omit
them for this `max()` comparison.

* Add default speed data for interrupted train CLI

Add better speed meta defaults so that an interrupted iteration still
produces a best model.

Co-authored-by: Ines Montani <ines@ines.io>
2020-03-26 10:46:50 +01:00
adrianeboyd
a04f802099
Fix GoldParse init when token count differs (#5191)
Fix the `GoldParse` initialization when the number of tokens has changed
(due to merging subtokens with the parser).
2020-03-26 10:46:23 +01:00
adrianeboyd
d88a377bed
Remove Vectors.from_glove (#5209) 2020-03-26 10:45:47 +01:00
Ines Montani
828acffc12 Tidy up and auto-format 2020-03-25 12:28:12 +01:00
adrianeboyd
86c43e55fa
Improve Lithuanian tokenization (#5205)
* Improve Lithuanian tokenization

Modify Lithuanian tokenization to improve performance for
UD_Lithuanian-ALKSNIS.

* Update Lithuanian tokenizer tests
2020-03-25 11:28:12 +01:00
adrianeboyd
1a944e5976
Improve Italian tokenization (#5204)
Improve Italian tokenization for UD_Italian-ISDT.
2020-03-25 11:28:02 +01:00
adrianeboyd
923a453449
Modifications/updates to Portuguese tokenization (#5203)
Modifications to Portuguese tokenization for UD_Portuguese-Bosque.
Instead of splitting contactions as exceptions, they are kept as merged
tokens.
2020-03-25 11:27:53 +01:00
adrianeboyd
4117a5c705
Improve French tokenization (#5202)
Improve French tokenization for UD_French-Sequoia.
2020-03-25 11:27:42 +01:00
Ines Montani
a3d09ffe61
Merge pull request #5201 from adrianeboyd/feature/ud-tokenization-nb-v2
Improved tokenization for UD_Norwegian-Bokmaal
2020-03-25 11:27:31 +01:00
Adriane Boyd
09d442f5ad Merge remote-tracking branch 'upstream/master' into feature/ud-tokenization-da 2020-03-25 09:41:52 +01:00
Adriane Boyd
cba2d1d972 Disable failing abbreviation test
UD_Danish-DDT has (as far as I can tell) hallucinated periods after
abbreviations, so the changes are an artifact of the corpus and not due
to anything meaningful about Danish tokenization.
2020-03-25 09:39:26 +01:00
Adriane Boyd
79737adb90 Improved tokenization for UD_Norwegian-Bokmaal 2020-03-25 08:54:02 +01:00