Commit Graph

15952 Commits

Author SHA1 Message Date
Adriane Boyd
c155f333bb Revert "Temporarily use v3.1.0 models in CI"
This reverts commit bd6433bbab.
2021-11-02 14:25:05 +01:00
Adriane Boyd
53a3523910 Revert "Temporarily ignore W095 in assemble CLI CI test (#9460)"
This reverts commit 8db574e0b5.
2021-11-02 14:24:54 +01:00
Adriane Boyd
4d5db737e9 Revert "Temporarily skip compat tests (#9594)"
This reverts commit 667572adca.
2021-11-02 14:24:06 +01:00
Adriane Boyd
667572adca
Temporarily skip compat tests (#9594) 2021-11-02 14:10:48 +01:00
Lj Miranda
f1bc655a38
Add initial Tagalog (tl) tests (#9582)
* Add tl_tokenizer to test fixtures

* Add tagalog tests
2021-11-02 08:35:49 +01:00
xxyzz
90ec820f05
Add WordDumb to spaCy Universe (#9572)
* Add WordDumb to spaCy Universe

* Add standalone category

Co-authored-by: Paul O'Leary McCann <polm@dampfkraft.com>
2021-11-01 18:38:41 +09:00
Bruce W. Lee (이웅성)
a4dcb68cf6
Adding LingFeat Software to spaCy Universe. (#9574)
* add lingfeat in universe

* add lingfeat in universe

* Fix JSON

* Minor cleanup

Co-authored-by: Paul O'Leary McCann <polm@dampfkraft.com>
2021-11-01 18:38:14 +09:00
Vasundhara
5279c7c4ba
Fix broken link to mappings-exceptions (#9573) 2021-10-31 13:44:29 +09:00
svlandeg
87cf72d1c8 pass nO through 2021-10-29 17:38:11 +02:00
svlandeg
1cc0d05812 fixes 2021-10-29 17:10:07 +02:00
Adriane Boyd
bb26550e22
Fix StaticVectors after floret+mypy merge (#9566) 2021-10-29 16:25:43 +02:00
Adriane Boyd
322635e371
Set version to v3.2.0 (#9565) 2021-10-29 15:22:40 +02:00
svlandeg
dbaf68a439 formatting 2021-10-29 14:19:30 +02:00
svlandeg
87fb268f76 Merge remote-tracking branch 'upstream/master' into refactor/parser-gpu 2021-10-29 14:16:43 +02:00
Adriane Boyd
5e9db156c2
Merge pull request #9563 from adrianeboyd/chore/update-develop-from-master-v3.2-3
Update develop from master for v3.2
2021-10-29 14:08:14 +02:00
svlandeg
753f9ee685 cleanup 2021-10-29 13:25:15 +02:00
Adriane Boyd
2d430958e1 Merge remote-tracking branch 'upstream/master' into chore/update-develop-from-master-v3.2-3 2021-10-29 12:18:15 +02:00
Paul O'Leary McCann
006df1ae1f
Clarify error when words are of wrong type (#9541)
* Clarify error when words are of wrong type

See #9437

* Update docs

* Use try/except

* Apply suggestions from code review

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
2021-10-29 12:08:40 +02:00
Paul O'Leary McCann
2fd8d616e7
Add docs section for spacy.cli.train.train (#9545)
* Add section for spacy.cli.train.train

* Add link from training page to train function

* Ensure path in train helper

* Update docs

Co-authored-by: Ines Montani <ines@ines.io>
2021-10-29 10:36:34 +02:00
Adriane Boyd
5477453ea3
Docs for thinc-apple-ops (#9549)
* Docs for thinc-apple-ops

* Ignore thinc-apple-ops in reqs tests

* Fix install quickstart

* Add cupy cuda 113, 114 extras

* Remove draft section

Co-authored-by: Ines Montani <ines@ines.io>
2021-10-29 10:35:31 +02:00
Adriane Boyd
12974bf4d9
Add micro PRF for morph scoring (#9546)
* Add micro PRF for morph scoring

For pipelines where morph features are added by more than one component
and a reference training corpus may not contain all features, a micro
PRF score is more flexible than a simple accuracy score. An example is
the reading and inflection features added by the Japanese tokenizer.

* Use `morph_micro_f` as the default morph score for Japanese
morphologizers.

* Update docstring

* Fix typo in docstring

* Update Scorer API docs

* Fix results type

* Organize score list by attribute prefix
2021-10-29 10:29:29 +02:00
Philip Vollet
76173b0866
fixed typo and URL (#9560) 2021-10-29 13:57:44 +09:00
Adriane Boyd
72dc63b3fb
Update for python 3.10 (#9519)
* Update for python 3.10

* Update mac image

* Update build constraints for python 3.10

* Add extras for cupy cuda 11.3-11.5

* Remove cupy-cuda115 extra

* Require thinc>=8.0.12

* Switch CI to windows-2019

* Skip mypy for python 3.10
2021-10-28 15:32:06 +02:00
Adriane Boyd
554fa414ec
Require spacy-transformers v1.1 in transformers extra (#9557)
So that the install/upgrade quickstart also upgrades
`spacy-transformers` with `pip install spacy[transformers]`, require
`spacy-transformers>=1.1.2` in the `transformers` extra.
2021-10-28 11:18:19 +02:00
Matthew Honnibal
79d5957c47 Xfail. 6 failures 2021-10-27 23:26:07 +02:00
Matthew Honnibal
6b5302cdf3 More xfail. 7 failures 2021-10-27 23:24:33 +02:00
Matthew Honnibal
7309e49286 Xfail beam stuff. 9 failures 2021-10-27 23:21:55 +02:00
Matthew Honnibal
880182afdb Work on parser. 15 tests failing 2021-10-27 23:02:29 +02:00
Matthew Honnibal
af9a30b192 Keep working through errors 2021-10-27 17:13:11 +02:00
Matthew Honnibal
b67dd0cf89 Keep working through errors 2021-10-27 17:10:33 +02:00
Adriane Boyd
c053f158c5
Add support for floret vectors (#8909)
* Add support for fasttext-bloom hash-only vectors

Overview:

* Extend `Vectors` to have two modes: `default` and `ngram`
  * `default` is the default mode and equivalent to the current
    `Vectors`
  * `ngram` supports the hash-only ngram tables from `fasttext-bloom`
* Extend `spacy.StaticVectors.v2` to handle both modes with no changes
  for `default` vectors
* Extend `spacy init vectors` to support ngram tables

The `ngram` mode **only** supports vector tables produced by this
fork of fastText, which adds an option to represent all vectors using
only the ngram buckets table and which uses the exact same ngram
generation algorithm and hash function (`MurmurHash3_x64_128`).
`fasttext-bloom` produces an additional `.hashvec` table, which can be
loaded by `spacy init vectors --fasttext-bloom-vectors`.

https://github.com/adrianeboyd/fastText/tree/feature/bloom

Implementation details:

* `Vectors` now includes the `StringStore` as `Vectors.strings` so that
  the API can stay consistent for both `default` (which can look up from
  `str` or `int`) and `ngram` (which requires `str` to calculate the
  ngrams).

* In ngram mode `Vectors` uses a default `Vectors` object as a cache
  since the ngram vectors lookups are relatively expensive.

  * The default cache size is the same size as the provided ngram vector
    table.

  * Once the cache is full, no more entries are added. The user is
    responsible for managing the cache in cases where the initial
    documents are not representative of the texts.

  * The cache can be resized by setting `Vectors.ngram_cache_size` or
    cleared with `vectors._ngram_cache.clear()`.

* The API ends up a bit split between methods for `default` and for
  `ngram`, so functions that only make sense for `default` or `ngram`
  include warnings with custom messages suggesting alternatives where
  possible.

* `Vocab.vectors` becomes a property so that the string stores can be
  synced when assigning vectors to a vocab.

* `Vectors` serializes its own config settings as `vectors.cfg`.

* The `Vectors` serialization methods have added support for `exclude`
  so that the `Vocab` can exclude the `Vectors` strings while serializing.

Removed:

* The `minn` and `maxn` options and related code from
  `Vocab.get_vector`, which does not work in a meaningful way for default
  vector tables.

* The unused `GlobalRegistry` in `Vectors`.

* Refactor to use reduce_mean

Refactor to use reduce_mean and remove the ngram vectors cache.

* Rename to floret

* Rename to floret in error messages

* Use --vectors-mode in CLI, vector init

* Fix vectors mode in init

* Remove unused var

* Minor API and docstrings adjustments

* Rename `--vectors-mode` to `--mode` in `init vectors` CLI
* Rename `Vectors.get_floret_vectors` to `Vectors.get_batch` and support
  both modes.
* Minor updates to Vectors docstrings.

* Update API docs for Vectors and init vectors CLI

* Update types for StaticVectors
2021-10-27 14:08:31 +02:00
Adriane Boyd
0c97ed2746
Rename ja morph features to Inflection and Reading (#9520)
* Rename ja morph features to Inflection and Reading
2021-10-27 13:13:03 +02:00
Adriane Boyd
2ea9b58006
Ignore prefix in suffix matches (#9155)
* Ignore prefix in suffix matches

Ignore the currently matched prefix when looking for suffix matches in
the tokenizer. Otherwise a lookbehind in the suffix pattern may match
incorrectly due the presence of the prefix in the token string.

* Move °[cfkCFK]. to a tokenizer exception

* Adjust exceptions for same tokenization as v3.1

* Also update test accordingly

* Continue to split . after °CFK if ° is not a prefix

* Exclude new ° exceptions for pl

* Switch back to default tokenization of "° C ."

* Revert "Exclude new ° exceptions for pl"

This reverts commit 952013a5b4.

* Add exceptions for °C for hu
2021-10-27 13:02:25 +02:00
Adriane Boyd
4170110ce7
Merge pull request #9540 from adrianeboyd/chore/update-develop-from-master-v3.2-1
Update develop from master for v3.2
2021-10-27 08:23:57 +02:00
Adriane Boyd
386dcada1c
Address random results in slow readers tests (#9544)
* Set random seed for dataset shuffling
* Use more dev examples for non-zero scores
2021-10-26 16:53:10 +02:00
Adriane Boyd
a803af9dfa Merge remote-tracking branch 'upstream/master' into chore/update-develop-from-master-v3.2-1 2021-10-26 11:53:50 +02:00
Matthew Honnibal
c538eaf1c8 Work through tests 2021-10-26 01:21:51 +02:00
Matthew Honnibal
d765a4f8ee Cleaner handling of unseen classes 2021-10-25 22:34:29 +02:00
Matthew Honnibal
07a3581ff8 Support unseen classes in parser 2021-10-25 22:26:52 +02:00
Matthew Honnibal
4b5d1b53f6 Support unseen_classes in parser model 2021-10-25 22:21:17 +02:00
Matthew Honnibal
03018904ef Work on parser model 2021-10-25 16:11:58 +02:00
Matthew Honnibal
9c4a04d0c5 Uncython 2021-10-25 12:51:32 +02:00
Matthew Honnibal
1921e86813 Uncython ner.pyx and dep_parser.pyx 2021-10-25 12:51:14 +02:00
Matthew Honnibal
45ca12f07a Wire up parser model 2021-10-25 12:50:33 +02:00
Matthew Honnibal
71abe2e42d Wire up tb_framework to new parser model 2021-10-25 12:50:20 +02:00
Matthew Honnibal
0279aa036a Delete _precomputable_affine module 2021-10-25 12:28:57 +02:00
Matthew Honnibal
9b459f9ef2 Delete spacy.ml.parser_model 2021-10-25 12:28:31 +02:00
Matthew Honnibal
7b9c282469 Convert parser from cdef class 2021-10-25 12:28:13 +02:00
Matthew Honnibal
34aab9899f Prepare to remove parser_model.pyx 2021-10-25 12:22:46 +02:00
Matthew Honnibal
de8c88babb New progress on parser model refactor 2021-10-25 03:13:31 +02:00