Commit Graph

11548 Commits

Author SHA1 Message Date
Adriane Boyd
f1f9c8b417 Port train CLI updates
Updates from #5362 and fix from #5387:

* `train`:

  * if training on GPU, only run evaluation/timing on CPU in the first
    iteration

  * if training is aborted, exit with a non-0 exit status
2020-06-03 14:03:43 +02:00
Ines Montani
581bda9f98 Update senter test and auto-format 2020-05-21 20:17:14 +02:00
Adriane Boyd
132b2a6898 Merge remote-tracking branch 'upstream/master-tmp' into HEAD 2020-05-21 19:50:30 +02:00
Adriane Boyd
17ee9ab53a Fix _SP/POS=SPACE in strings serialization tests 2020-05-21 19:49:08 +02:00
Ines Montani
245f91df78 Fix merge issues 2020-05-21 19:42:13 +02:00
Ines Montani
631e20d0c6 Fix test and schemas 2020-05-21 19:01:02 +02:00
Ines Montani
d34fc0915e Remove serialization getter 2020-05-21 18:48:21 +02:00
Ines Montani
f44897e4c6 Update warning IDs 2020-05-21 18:39:11 +02:00
Ines Montani
24f72c669c Merge branch 'develop' into master-tmp 2020-05-21 18:39:06 +02:00
Ines Montani
c6ec19c844 Add missing declaration 2020-05-21 17:30:05 +02:00
Matthew Honnibal
884d9b060d
Merge pull request #5466 from adrianeboyd/feature/omit-extra-lexeme-info
Add option to omit extra lexeme tables in CLI
2020-05-21 16:40:02 +02:00
Matthew Honnibal
e6c4c1a507
Merge pull request #5468 from adrianeboyd/feature/cli-conllu-misc-ner
Improve handling of NER in CoNLL-U MISC
2020-05-21 16:39:46 +02:00
Matthew Honnibal
26cd6a0229
Merge pull request #5462 from adrianeboyd/feature/lemmatizer-all-upos
Extend lemmatizer rules for all UPOS tags
2020-05-21 16:05:31 +02:00
Matthew Honnibal
cad9b290a2
Merge branch 'master' into feature/omit-extra-lexeme-info 2020-05-21 16:04:24 +02:00
Matthew Honnibal
1f572ce89b
Merge pull request #5473 from explosion/fix/travis-tests
Fix Python 2.7 compat
2020-05-21 15:56:16 +02:00
Matthew Honnibal
7902ebc63c
Rename argument: doc_or_span/obj -> doclike (#5463)
* doc_or_span -> obj

* Revert "doc_or_span -> obj"

This reverts commit 78bb9ff5e0.

* obj -> doclike

* Refer to correct object
2020-05-21 15:17:54 +02:00
Ines Montani
a9cb2882cb
Rename argument: doc_or_span/obj -> doclike (#5463)
* doc_or_span -> obj

* Revert "doc_or_span -> obj"

This reverts commit 78bb9ff5e0.

* obj -> doclike

* Refer to correct object
2020-05-21 15:17:39 +02:00
Ines Montani
bea863acd2 Fix naming conflict and formatting 2020-05-21 14:24:38 +02:00
Ines Montani
bd6353715a Merge branch 'master' into fix/travis-tests 2020-05-21 14:23:04 +02:00
Ines Montani
e2fe83e35d Refer to correct object 2020-05-21 14:20:29 +02:00
Ines Montani
b1f45c9da3 obj -> doclike 2020-05-21 14:19:58 +02:00
Ines Montani
69fb4bedf2 Revert "doc_or_span -> obj"
This reverts commit 78bb9ff5e0.
2020-05-21 14:14:28 +02:00
Ines Montani
d8f3190c0a Tidy up and auto-format 2020-05-21 14:14:01 +02:00
Ines Montani
56de520afd Try to fix tests on Travis (2.7) 2020-05-21 14:04:57 +02:00
Ines Montani
f2a131bd9a
Merge pull request #5461 from kevinlu1248/master 2020-05-21 13:53:10 +02:00
adrianeboyd
d45602bc11
Merge branch 'master' into feature/omit-extra-lexeme-info 2020-05-21 10:26:01 +02:00
adrianeboyd
49ef06d793
Add option for base model in init-model CLI (#5467)
Intended for languages like Chinese with a custom tokenizer.
2020-05-20 18:49:11 +02:00
Adriane Boyd
4b229bfc22 Improve handling of NER in CoNLL-U MISC 2020-05-20 18:48:51 +02:00
Matthew Honnibal
609c0ba557
Fix accidentally quadratic runtime in Example.split_sents (#5464)
* Tidy up train-from-config a bit

* Fix accidentally quadratic perf in TokenAnnotation.brackets

When we're reading in the gold data, we had a nested loop where
we looped over the brackets for each token, looking for brackets
that start on that word. This is accidentally quadratic, because
we have one bracket per word (for the POS tags). So we had
an O(N**2) behaviour here that ended up being pretty slow.

To solve this I'm indexing the brackets by their starting word
on the TokenAnnotations object, and having a property to provide
the previous view.

* Fixes
2020-05-20 18:48:18 +02:00
Kevin Lu
c7c4cd5fe1
Changed pyate code example in universe.json 2020-05-20 09:11:32 -07:00
Adriane Boyd
daaa7bf451 Add option to omit extra lexeme tables in CLI 2020-05-20 15:51:44 +02:00
Adriane Boyd
8cba0e41d8 Return lowercase form as default except for PROPN 2020-05-20 15:35:08 +02:00
adrianeboyd
9393253b66
Remove peeking from Parser.begin_training (#5456)
Inspect all instances in `Parser.begin_training` rather than only the
first 1000.
2020-05-20 15:18:06 +02:00
Ines Montani
78bb9ff5e0 doc_or_span -> obj 2020-05-20 14:56:52 +02:00
Matthw Honnibal
fda7355508 Fix train-from-config 2020-05-20 12:30:21 +02:00
Matthw Honnibal
24efd54a42 Merge from develop 2020-05-20 12:27:31 +02:00
Sofie Van Landeghem
7f5715a081
Various fixes to NEL functionality, Example class etc (#5460)
* setting KB in the EL constructor, similar to how the model is passed on

* removing wikipedia example files - moved to projects

* throw an error when nlp.update is called with 2 positional arguments

* rewriting the config logic in create pipe to accomodate for other objects (e.g. KB) in the config

* update config files with new parameters

* avoid training pipeline components that don't have a model (like sentencizer)

* various small fixes + UX improvements

* small fixes

* set thinc to 8.0.0a9 everywhere

* remove outdated comment
2020-05-20 11:41:12 +02:00
Adriane Boyd
4fa9670537 Extend lemmatizer rules for all UPOS tags 2020-05-20 10:15:43 +02:00
Kevin Lu
291b9ad7b9
Update CONTRIBUTOR_AGREEMENT.md 2020-05-19 20:29:53 -07:00
Kevin Lu
9a1a535215
Create kevinlu1248.md 2020-05-19 20:25:45 -07:00
Kevin Lu
a23b3a5a50
Update CONTRIBUTOR_AGREEMENT.md 2020-05-19 20:24:24 -07:00
Kevin Lu
0a5b140235
Update universe.json 2020-05-19 20:12:21 -07:00
Matthew Honnibal
664a3603b0 Set version to v3.0.0.dev8 2020-05-19 17:15:39 +02:00
adrianeboyd
40e65d6f63
Fix most_similar for vectors with unused rows (#5348)
* Fix most_similar for vectors with unused rows

Address issues related to the unused rows in the vector table and
`most_similar`:

* Update `most_similar()` to search only through rows that are in use
according to `key2row`.

* Raise an error when `most_similar(n=n)` is larger than the number of
vectors in the table.

* Set and restore `_unset` correctly when vectors are added or
deserialized so that new vectors are added in the correct row.

* Set data and keys to the same length in `Vocab.prune_vectors()` to
avoid spurious entries in `key2row`.

* Fix regression test using `most_similar`

Co-authored-by: Matthew Honnibal <honnibal+gh@gmail.com>
2020-05-19 16:41:26 +02:00
Matthew Honnibal
a2830c3ef5 Use thinc 8.0.0a9 2020-05-19 16:23:11 +02:00
Sofie Van Landeghem
f00de445dd
default models defined in component decorator (#5452)
* move defaults to pipeline and use in component decorator

* black formatting

* relative import
2020-05-19 16:20:03 +02:00
adrianeboyd
70da1fd2d6
Add warning for misaligned character offset spans (#5007)
* Add warning for misaligned character offset spans

* Resolve conflict

* Filter warnings in example scripts

Filter warnings in example scripts to show warnings once, in particular
warnings about misaligned entities.

Co-authored-by: Ines Montani <ines@ines.io>
2020-05-19 16:01:18 +02:00
adrianeboyd
0061992d95
Update Polish tokenizer for UD_Polish-PDB (#5432)
Update Polish tokenizer for UD_Polish-PDB, which is a relatively major
change from the existing tokenizer. Unused exceptions files and
conflicting test cases removed.

Co-authored-by: Matthew Honnibal <honnibal+gh@gmail.com>
2020-05-19 15:59:55 +02:00
adrianeboyd
a5cd203284
Reduce stored lexemes data, move feats to lookups (#5238)
* Reduce stored lexemes data, move feats to lookups

* Move non-derivable lexemes features (`norm / cluster / prob`) to
`spacy-lookups-data` as lookups
  * Get/set `norm` in both lookups and `LexemeC`, serialize in lookups
  * Remove `cluster` and `prob` from `LexemesC`, get/set/serialize in
    lookups only
* Remove serialization of lexemes data as `vocab/lexemes.bin`
  * Remove `SerializedLexemeC`
  * Remove `Lexeme.to_bytes/from_bytes`
* Modify normalization exception loading:
  * Always create `Vocab.lookups` table `lexeme_norm` for
    normalization exceptions
  * Load base exceptions from `lang.norm_exceptions`, but load
    language-specific exceptions from lookups
  * Set `lex_attr_getter[NORM]` including new lookups table in
    `BaseDefaults.create_vocab()` and when deserializing `Vocab`
* Remove all cached lexemes when deserializing vocab to override
  existing normalizations with the new normalizations (as a replacement
  for the previous step that replaced all lexemes data with the
  deserialized data)

* Skip English normalization test

Skip English normalization test because the data is now in
`spacy-lookups-data`.

* Remove norm exceptions

Moved to spacy-lookups-data.

* Move norm exceptions test to spacy-lookups-data

* Load extra lookups from spacy-lookups-data lazily

Load extra lookups (currently for cluster and prob) lazily from the
entry point `lg_extra` as `Vocab.lookups_extra`.

* Skip creating lexeme cache on load

To improve model loading times, do not create the full lexeme cache when
loading. The lexemes will be created on demand when processing.

* Identify numeric values in Lexeme.set_attrs()

With the removal of a special case for `PROB`, also identify `float` to
avoid trying to convert it with the `StringStore`.

* Skip lexeme cache init in from_bytes

* Unskip and update lookups tests for python3.6+

* Update vocab pickle to include lookups_extra

* Update vocab serialization tests

Check strings rather than lexemes since lexemes aren't initialized
automatically, account for addition of "_SP".

* Re-skip lookups test because of python3.5

* Skip PROB/float values in Lexeme.set_attrs

* Convert is_oov from lexeme flag to lex in vectors

Instead of storing `is_oov` as a lexeme flag, `is_oov` reports whether
the lexeme has a vector.

Co-authored-by: Matthew Honnibal <honnibal+gh@gmail.com>
2020-05-19 15:59:14 +02:00
Sofie Van Landeghem
0d94737857
Feature toggle_pipes (#5378)
* make disable_pipes deprecated in favour of the new toggle_pipes

* rewrite disable_pipes statements

* update documentation

* remove bin/wiki_entity_linking folder

* one more fix

* remove deprecated link to documentation

* few more doc fixes

* add note about name change to the docs

* restore original disable_pipes

* small fixes

* fix typo

* fix error number to W096

* rename to select_pipes

* also make changes to the documentation

Co-authored-by: Matthew Honnibal <honnibal+gh@gmail.com>
2020-05-18 22:27:10 +02:00