Commit Graph

384 Commits

Author SHA1 Message Date
Ines Montani
46568f40a7 Merge branch 'master' into tmp/sync 2020-03-26 13:38:14 +01:00
Ines Montani
828acffc12 Tidy up and auto-format 2020-03-25 12:28:12 +01:00
svlandeg
fba219f737 remove unnecessary itertools call 2020-03-16 08:31:36 +01:00
Sofie Van Landeghem
5847be6022
Tok2Vec: extract-embed-encode (#5102)
* avoid changing original config

* fix elif structure, batch with just int crashes otherwise

* tok2vec example with doc2feats, encode and embed architectures

* further clean up MultiHashEmbed

* further generalize Tok2Vec to work with extract-embed-encode parts

* avoid initializing the charembed layer with Docs (for now ?)

* small fixes for bilstm config (still does not run)

* rename to core layer

* move new configs

* walk model to set nI instead of using core ref

* fix senter overfitting test to be more similar to the training data (avoid flakey behaviour)
2020-03-08 13:23:18 +01:00
adrianeboyd
993758c58f
Remove unnecessary iterator in Language.pipe (#5101)
Remove iterator over `raw_texts` with `iterator.tee()` in
`Language.pipe` that is never consumed and consumes memory
unnecessarily.
2020-03-08 13:22:25 +01:00
adrianeboyd
c95ce96c44
Update sentence recognizer (#5109)
* Update sentence recognizer

* rename `sentrec` to `senter`
* use `spacy.HashEmbedCNN.v1` by default
* update to follow `Tagger` modifications
* remove component methods that can be inherited from `Tagger`
* add simple initialization and overfitting pipeline tests

* Update serialization test for senter
2020-03-06 14:45:02 +01:00
Sofie Van Landeghem
d307e9ca58
take care of global vectors in multiprocessing (#5081)
* restore load_nlp.VECTORS in the child process

* add unit test

* fix test

* remove unnecessary import

* add utf8 encoding

* import unicode_literals
2020-03-03 13:58:22 +01:00
Ines Montani
37691e6d5d Simplify warnings 2020-02-28 12:20:23 +01:00
Ines Montani
5da3ad682a Tidy up and auto-format 2020-02-28 11:57:41 +01:00
Sofie Van Landeghem
06f0a8daa0
Default settings to configurations (#4995)
* fix grad_clip naming

* cleaning up pretrained_vectors out of cfg

* further refactoring Model init's

* move Model building out of pipes

* further refactor to require a model config when creating a pipe

* small fixes

* making cfg in nn_parser more consistent

* fixing nr_class for parser

* fixing nn_parser's nO

* fix printing of loss

* architectures in own file per type, consistent naming

* convenience methods default_tagger_config and default_tok2vec_config

* let create_pipe access default config if available for that component

* default_parser_config

* move defaults to separate folder

* allow reading nlp from package or dir with argument 'name'

* architecture spacy.VocabVectors.v1 to read static vectors from file

* cleanup

* default configs for nel, textcat, morphologizer, tensorizer

* fix imports

* fixing unit tests

* fixes and clean up

* fixing defaults, nO, fix unit tests

* restore parser IO

* fix IO

* 'fix' serialization test

* add *.cfg to manifest

* fix example configs with additional arguments

* replace Morpohologizer with Tagger

* add IO bit when testing overfitting of tagger (currently failing)

* fix IO - don't initialize when reading from disk

* expand overfitting tests to also check IO goes OK

* remove dropout from HashEmbed to fix Tagger performance

* add defaults for sentrec

* update thinc

* always pass a Model instance to a Pipe

* fix piped_added statement

* remove obsolete W029

* remove obsolete errors

* restore byte checking tests (work again)

* clean up test

* further test cleanup

* convert from config to Model in create_pipe

* bring back error when component is not initialized

* cleanup

* remove calls for nlp2.begin_training

* use thinc.api in imports

* allow setting charembed's nM and nC

* fix for hardcoded nM/nC + unit test

* formatting fixes

* trigger build
2020-02-27 18:42:27 +01:00
Ines Montani
4440a072d2
Merge pull request #5006 from svlandeg/bugfix/multiproc-underscore
load Underscore state when multiprocessing
2020-02-25 14:46:02 +01:00
Ines Montani
e3f40a6a0f Tidy up and auto-format 2020-02-18 15:38:18 +01:00
Ines Montani
de11ea753a Merge branch 'master' into develop 2020-02-18 14:47:23 +01:00
Sofie Van Landeghem
72c964bcf4
define pretrained_dims which is used by build_text_classifier (#5004) 2020-02-16 17:21:17 +01:00
svlandeg
65f5b48b5d add comment 2020-02-12 12:06:27 +01:00
svlandeg
ecbb9c4b9f load Underscore state when multiprocessing 2020-02-12 11:50:42 +01:00
Sofie Van Landeghem
cabd60fa1e
Small fixes to as_example (#4957)
* label in span not writable anymore

* Revert "label in span not writable anymore"

This reverts commit ab442338c8.

* fixing yield - remove redundant list
2020-02-03 13:02:12 +01:00
Sofie Van Landeghem
569cc98982
Update spaCy for thinc 8.0.0 (#4920)
* Add load_from_config function

* Add train_from_config script

* Merge configs and expose via spacy.config

* Fix script

* Suggest create_evaluation_callback

* Hard-code for NER

* Fix errors

* Register command

* Add TODO

* Update train-from-config todos

* Fix imports

* Allow delayed setting of parser model nr_class

* Get train-from-config working

* Tidy up and fix scores and printing

* Hide traceback if cancelled

* Fix weighted score formatting

* Fix score formatting

* Make output_path optional

* Add Tok2Vec component

* Tidy up and add tok2vec_tensors

* Add option to copy docs in nlp.update

* Copy docs in nlp.update

* Adjust nlp.update() for set_annotations

* Don't shuffle pipes in nlp.update, decruft

* Support set_annotations arg in component update

* Support set_annotations in parser update

* Add get_gradients method

* Add get_gradients to parser

* Update errors.py

* Fix problems caused by merge

* Add _link_components method in nlp

* Add concept of 'listeners' and ControlledModel

* Support optional attributes arg in ControlledModel

* Try having tok2vec component in pipeline

* Fix tok2vec component

* Fix config

* Fix tok2vec

* Update for Example

* Update for Example

* Update config

* Add eg2doc util

* Update and add schemas/types

* Update schemas

* Fix nlp.update

* Fix tagger

* Remove hacks from train-from-config

* Remove hard-coded config str

* Calculate loss in tok2vec component

* Tidy up and use function signatures instead of models

* Support union types for registry models

* Minor cleaning in Language.update

* Make ControlledModel specifically Tok2VecListener

* Fix train_from_config

* Fix tok2vec

* Tidy up

* Add function for bilstm tok2vec

* Fix type

* Fix syntax

* Fix pytorch optimizer

* Add example configs

* Update for thinc describe changes

* Update for Thinc changes

* Update for dropout/sgd changes

* Update for dropout/sgd changes

* Unhack gradient update

* Work on refactoring _ml

* Remove _ml.py module

* WIP upgrade cli scripts for thinc

* Move some _ml stuff to util

* Import link_vectors from util

* Update train_from_config

* Import from util

* Import from util

* Temporarily add ml.component_models module

* Move ml methods

* Move typedefs

* Update load vectors

* Update gitignore

* Move imports

* Add PrecomputableAffine

* Fix imports

* Fix imports

* Fix imports

* Fix missing imports

* Update CLI scripts

* Update spacy.language

* Add stubs for building the models

* Update model definition

* Update create_default_optimizer

* Fix import

* Fix comment

* Update imports in tests

* Update imports in spacy.cli

* Fix import

* fix obsolete thinc imports

* update srsly pin

* from thinc to ml_datasets for example data such as imdb

* update ml_datasets pin

* using STATE.vectors

* small fix

* fix Sentencizer.pipe

* black formatting

* rename Affine to Linear as in thinc

* set validate explicitely to True

* rename with_square_sequences to with_list2padded

* rename with_flatten to with_list2array

* chaining layernorm

* small fixes

* revert Optimizer import

* build_nel_encoder with new thinc style

* fixes using model's get and set methods

* Tok2Vec in component models, various fixes

* fix up legacy tok2vec code

* add model initialize calls

* add in build_tagger_model

* small fixes

* setting model dims

* fixes for ParserModel

* various small fixes

* initialize thinc Models

* fixes

* consistent naming of window_size

* fixes, removing set_dropout

* work around Iterable issue

* remove legacy tok2vec

* util fix

* fix forward function of tok2vec listener

* more fixes

* trying to fix PrecomputableAffine (not succesful yet)

* alloc instead of allocate

* add morphologizer

* rename residual

* rename fixes

* Fix predict function

* Update parser and parser model

* fixing few more tests

* Fix precomputable affine

* Update component model

* Update parser model

* Move backprop padding to own function, for test

* Update test

* Fix p. affine

* Update NEL

* build_bow_text_classifier and extract_ngrams

* Fix parser init

* Fix test add label

* add build_simple_cnn_text_classifier

* Fix parser init

* Set gpu off by default in example

* Fix tok2vec listener

* Fix parser model

* Small fixes

* small fix for PyTorchLSTM parameters

* revert my_compounding hack (iterable fixed now)

* fix biLSTM

* Fix uniqued

* PyTorchRNNWrapper fix

* small fixes

* use helper function to calculate cosine loss

* small fixes for build_simple_cnn_text_classifier

* putting dropout default at 0.0 to ensure the layer gets built

* using thinc util's set_dropout_rate

* moving layer normalization inside of maxout definition to optimize dropout

* temp debugging in NEL

* fixed NEL model by using init defaults !

* fixing after set_dropout_rate refactor

* proper fix

* fix test_update_doc after refactoring optimizers in thinc

* Add CharacterEmbed layer

* Construct tagger Model

* Add missing import

* Remove unused stuff

* Work on textcat

* fix test (again :)) after optimizer refactor

* fixes to allow reading Tagger from_disk without overwriting dimensions

* don't build the tok2vec prematuraly

* fix CharachterEmbed init

* CharacterEmbed fixes

* Fix CharacterEmbed architecture

* fix imports

* renames from latest thinc update

* one more rename

* add initialize calls where appropriate

* fix parser initialization

* Update Thinc version

* Fix errors, auto-format and tidy up imports

* Fix validation

* fix if bias is cupy array

* revert for now

* ensure it's a numpy array before running bp in ParserStepModel

* no reason to call require_gpu twice

* use CupyOps.to_numpy instead of cupy directly

* fix initialize of ParserModel

* remove unnecessary import

* fixes for CosineDistance

* fix device renaming

* use refactored loss functions (Thinc PR 251)

* overfitting test for tagger

* experimental settings for the tagger: avoid zero-init and subword normalization

* clean up tagger overfitting test

* use previous default value for nP

* remove toy config

* bringing layernorm back (had a bug - fixed in thinc)

* revert setting nP explicitly

* remove setting default in constructor

* restore values as they used to be

* add overfitting test for NER

* add overfitting test for dep parser

* add overfitting test for textcat

* fixing init for linear (previously affine)

* larger eps window for textcat

* ensure doc is not None

* Require newer thinc

* Make float check vaguer

* Slop the textcat overfit test more

* Fix textcat test

* Fix exclusive classes for textcat

* fix after renaming of alloc methods

* fixing renames and mandatory arguments (staticvectors WIP)

* upgrade to thinc==8.0.0.dev3

* refer to vocab.vectors directly instead of its name

* rename alpha to learn_rate

* adding hashembed and staticvectors dropout

* upgrade to thinc 8.0.0.dev4

* add name back to avoid warning W020

* thinc dev4

* update srsly

* using thinc 8.0.0a0 !

Co-authored-by: Matthew Honnibal <honnibal+gh@gmail.com>
Co-authored-by: Ines Montani <ines@ines.io>
2020-01-29 17:06:46 +01:00
adrianeboyd
e1b493ae85 Add sentrec shortcut to Language (#4890) 2020-01-08 16:51:24 +01:00
Sofie Van Landeghem
a1b22e90cd serialize ENT_ID (#4852)
* expand serialization test for custom token attribute

* add failing test for issue 4849

* define ENT_ID as attr and use in doc serialization

* fix few typos
2020-01-06 14:57:34 +01:00
Ines Montani
a892821c51 More formatting changes 2019-12-25 17:59:52 +01:00
Ines Montani
db55577c45
Drop Python 2.7 and 3.5 (#4828)
* Remove unicode declarations

* Remove Python 3.5 and 2.7 from CI

* Don't require pathlib

* Replace compat helpers

* Remove OrderedDict

* Use f-strings

* Set Cython compiler language level

* Fix typo

* Re-add OrderedDict for Table

* Update setup.cfg

* Revert CONTRIBUTING.md

* Revert lookups.md

* Revert top-level.md

* Small adjustments and docs [ci skip]
2019-12-22 01:53:56 +01:00
Ines Montani
158b98a3ef Merge branch 'master' into develop 2019-12-21 18:55:03 +01:00
adrianeboyd
392c4880d9 Restructure Example with merged sents as default (#4632)
* Switch to train_dataset() function in train CLI

* Fixes for pipe() methods in pipeline components

* Don't clobber `examples` variable with `as_example` in pipe() methods
* Remove unnecessary traversals of `examples`

* Update Parser.pipe() for Examples

* Add `as_examples` kwarg to `pipe()` with implementation to return
`Example`s

* Accept `Doc` or `Example` in `pipe()` with `_get_doc()` (copied from
`Pipe`)

* Fixes to Example implementation in spacy.gold

* Move `make_projective` from an attribute of Example to an argument of
`Example.get_gold_parses()`

* Head of 0 are not treated as unset

* Unset heads are set to self rather than `None` (which causes problems
while projectivizing)

* Check for `Doc` (not just not `None`) when creating GoldParses for
pre-merged example

* Don't clobber `examples` variable in `iter_gold_docs()`

* Add/modify gold tests for handling projectivity

* In JSON roundtrip compare results from `dev_dataset` rather than
`train_dataset` to avoid projectivization (and other potential
modifications)

* Add test for projective train vs. nonprojective dev versions of the
same `Doc`

* Handle ignore_misaligned as arg rather than attr

Move `ignore_misaligned` from an attribute of `Example` to an argument
to `Example.get_gold_parses()`, which makes it parallel to
`make_projective`.

Add test with old and new align that checks whether `ignore_misaligned`
errors are raised as expected (only for new align).

* Remove unused attrs from gold.pxd

Remove `ignore_misaligned` and `make_projective` from `gold.pxd`

* Restructure Example with merged sents as default

An `Example` now includes a single `TokenAnnotation` that includes all
the information from one `Doc` (=JSON `paragraph`). If required, the
individual sentences can be returned as a list of examples with
`Example.split_sents()` with no raw text available.

* Input/output a single `Example.token_annotation`

* Add `sent_starts` to `TokenAnnotation` to handle sentence boundaries

* Replace `Example.merge_sents()` with `Example.split_sents()`

* Modify components to use a single `Example.token_annotation`

  * Pipeline components
  * conllu2json converter

* Rework/rename `add_token_annotation()` and `add_doc_annotation()` to
`set_token_annotation()` and `set_doc_annotation()`, functions that set
rather then appending/extending.

* Rename `morphology` to `morphs` in `TokenAnnotation` and `GoldParse`

* Add getters to `TokenAnnotation` to supply default values when a given
attribute is not available

* `Example.get_gold_parses()` in `spacy.gold._make_golds()` is only
applied on single examples, so the `GoldParse` is returned saved in the
provided `Example` rather than creating a new `Example` with no other
internal annotation

* Update tests for API changes and `merge_sents()` vs. `split_sents()`

* Refer to Example.goldparse in iter_gold_docs()

Use `Example.goldparse` in `iter_gold_docs()` instead of `Example.gold`
because a `None` `GoldParse` is generated with ignore_misaligned and
generating it on-the-fly can raise an unwanted AlignmentError

* Fix make_orth_variants()

Fix bug in make_orth_variants() related to conversion from multiple to
one TokenAnnotation per Example.

* Add basic test for make_orth_variants()

* Replace try/except with conditionals

* Replace default morph value with set
2019-11-25 16:03:28 +01:00
Ines Montani
3bd15055ce
Fix bug in Language.evaluate for components without .pipe (#4662) 2019-11-16 20:20:37 +01:00
adrianeboyd
faaa832518 Generalize handling of tokenizer special cases (#4259)
* Generalize handling of tokenizer special cases

Handle tokenizer special cases more generally by using the Matcher
internally to match special cases after the affix/token_match
tokenization is complete.

Instead of only matching special cases while processing balanced or
nearly balanced prefixes and suffixes, this recognizes special cases in
a wider range of contexts:

* Allows arbitrary numbers of prefixes/affixes around special cases
* Allows special cases separated by infixes

Existing tests/settings that couldn't be preserved as before:

* The emoticon '")' is no longer a supported special case
* The emoticon ':)' in "example:)" is a false positive again

When merged with #4258 (or the relevant cache bugfix), the affix and
token_match properties should be modified to flush and reload all
special cases to use the updated internal tokenization with the Matcher.

* Remove accidentally added test case

* Really remove accidentally added test

* Reload special cases when necessary

Reload special cases when affixes or token_match are modified. Skip
reloading during initialization.

* Update error code number

* Fix offset and whitespace in Matcher special cases

* Fix offset bugs when merging and splitting tokens
* Set final whitespace on final token in inserted special case

* Improve cache flushing in tokenizer

* Separate cache and specials memory (temporarily)
* Flush cache when adding special cases
* Repeated `self._cache = PreshMap()` and `self._specials = PreshMap()`
are necessary due to this bug:
https://github.com/explosion/preshed/issues/21

* Remove reinitialized PreshMaps on cache flush

* Update UD bin scripts

* Update imports for `bin/`
* Add all currently supported languages
* Update subtok merger for new Matcher validation
* Modify blinded check to look at tokens instead of lemmas (for corpora
with tokens but not lemmas like Telugu)

* Use special Matcher only for cases with affixes

* Reinsert specials cache checks during normal tokenization for special
cases as much as possible
  * Additionally include specials cache checks while splitting on infixes
  * Since the special Matcher needs consistent affix-only tokenization
    for the special cases themselves, introduce the argument
    `with_special_cases` in order to do tokenization with or without
    specials cache checks
* After normal tokenization, postprocess with special cases Matcher for
special cases containing affixes

* Replace PhraseMatcher with Aho-Corasick

Replace PhraseMatcher with the Aho-Corasick algorithm over numpy arrays
of the hash values for the relevant attribute. The implementation is
based on FlashText.

The speed should be similar to the previous PhraseMatcher. It is now
possible to easily remove match IDs and matches don't go missing with
large keyword lists / vocabularies.

Fixes #4308.

* Restore support for pickling

* Fix internal keyword add/remove for numpy arrays

* Add test for #4248, clean up test

* Improve efficiency of special cases handling

* Use PhraseMatcher instead of Matcher
* Improve efficiency of merging/splitting special cases in document
  * Process merge/splits in one pass without repeated token shifting
  * Merge in place if no splits

* Update error message number

* Remove UD script modifications

Only used for timing/testing, should be a separate PR

* Remove final traces of UD script modifications

* Update UD bin scripts

* Update imports for `bin/`
* Add all currently supported languages
* Update subtok merger for new Matcher validation
* Modify blinded check to look at tokens instead of lemmas (for corpora
with tokens but not lemmas like Telugu)

* Add missing loop for match ID set in search loop

* Remove cruft in matching loop for partial matches

There was a bit of unnecessary code left over from FlashText in the
matching loop to handle partial token matches, which we don't have with
PhraseMatcher.

* Replace dict trie with MapStruct trie

* Fix how match ID hash is stored/added

* Update fix for match ID vocab

* Switch from map_get_unless_missing to map_get

* Switch from numpy array to Token.get_struct_attr

Access token attributes directly in Doc instead of making a copy of the
relevant values in a numpy array.

Add unsatisfactory warning for hash collision with reserved terminal
hash key. (Ideally it would change the reserved terminal hash and redo
the whole trie, but for now, I'm hoping there won't be collisions.)

* Restructure imports to export find_matches

* Implement full remove()

Remove unnecessary trie paths and free unused maps.

Parallel to Matcher, raise KeyError when attempting to remove a match ID
that has not been added.

* Switch to PhraseMatcher.find_matches

* Switch to local cdef functions for span filtering

* Switch special case reload threshold to variable

Refer to variable instead of hard-coded threshold

* Move more of special case retokenize to cdef nogil

Move as much of the special case retokenization to nogil as possible.

* Rewrap sort as stdsort for OS X

* Rewrap stdsort with specific types

* Switch to qsort

* Fix merge

* Improve cmp functions

* Fix realloc

* Fix realloc again

* Initialize span struct while retokenizing

* Temporarily skip retokenizing

* Revert "Move more of special case retokenize to cdef nogil"

This reverts commit 0b7e52c797.

* Revert "Switch to qsort"

This reverts commit a98d71a942.

* Fix specials check while caching

* Modify URL test with emoticons

The multiple suffix tests result in the emoticon `:>`, which is now
retokenized into one token as a special case after the suffixes are
split off.

* Refactor _apply_special_cases()

* Use cdef ints for span info used in multiple spots

* Modify _filter_special_spans() to prefer earlier

Parallel to #4414, modify _filter_special_spans() so that the earlier
span is preferred for overlapping spans of the same length.

* Replace MatchStruct with Entity

Replace MatchStruct with Entity since the existing Entity struct is
nearly identical.

* Replace Entity with more general SpanC

* Replace MatchStruct with SpanC

* Add error in debug-data if no dev docs are available (see #4575)

* Update azure-pipelines.yml

* Revert "Update azure-pipelines.yml"

This reverts commit ed1060cf59.

* Use latest wasabi

* Reorganise install_requires

* add dframcy to universe.json (#4580)

* Update universe.json [ci skip]

* Fix multiprocessing for as_tuples=True (#4582)

* Fix conllu script (#4579)

* force extensions to avoid clash between example scripts

* fix arg order and default file encoding

* add example config for conllu script

* newline

* move extension definitions to main function

* few more encodings fixes

* Add load_from_docbin example [ci skip]

TODO: upload the file somewhere

* Update README.md

* Add warnings about 3.8 (resolves #4593) [ci skip]

* Fixed typo: Added space between "recognize" and "various" (#4600)

* Fix DocBin.merge() example (#4599)

* Replace function registries with catalogue (#4584)

* Replace functions registries with catalogue

* Update __init__.py

* Fix test

* Revert unrelated flag [ci skip]

* Bugfix/dep matcher issue 4590 (#4601)

* add contributor agreement for prilopes

* add test for issue #4590

* fix on_match params for DependencyMacther (#4590)

* Minor updates to language example sentences (#4608)

* Add punctuation to Spanish example sentences

* Combine multilanguage examples for lang xx

* Add punctuation to nb examples

* Always realloc to a larger size

Avoid potential (unlikely) edge case and cymem error seen in #4604.

* Add error in debug-data if no dev docs are available (see #4575)

* Update debug-data for GoldCorpus / Example

* Ignore None label in misaligned NER data
2019-11-13 21:24:35 +01:00
Sofie Van Landeghem
e48a09df4e Example class for training data (#4543)
* OrigAnnot class instead of gold.orig_annot list of zipped tuples

* from_orig to replace from_annot_tuples

* rename to RawAnnot

* some unit tests for GoldParse creation and internal format

* removing orig_annot and switching to lists instead of tuple

* rewriting tuples to use RawAnnot (+ debug statements, WIP)

* fix pop() changing the data

* small fixes

* pop-append fixes

* return RawAnnot for existing GoldParse to have uniform interface

* clean up imports

* fix merge_sents

* add unit test for 4402 with new structure (not working yet)

* introduce DocAnnot

* typo fixes

* add unit test for merge_sents

* rename from_orig to from_raw

* fixing unit tests

* fix nn parser

* read_annots to produce text, doc_annot pairs

* _make_golds fix

* rename golds_to_gold_annots

* small fixes

* fix encoding

* have golds_to_gold_annots use DocAnnot

* missed a spot

* merge_sents as function in DocAnnot

* allow specifying only part of the token-level annotations

* refactor with Example class + underlying dicts

* pipeline components to work with Example objects (wip)

* input checking

* fix yielding

* fix calls to update

* small fixes

* fix scorer unit test with new format

* fix kwargs order

* fixes for ud and conllu scripts

* fix reading data for conllu script

* add in proper errors (not fixed numbering yet to avoid merge conflicts)

* fixing few more small bugs

* fix EL script
2019-11-11 17:35:27 +01:00
Ines Montani
09cec3e41b
Replace function registries with catalogue (#4584)
* Replace functions registries with catalogue

* Update __init__.py

* Fix test

* Revert unrelated flag [ci skip]
2019-11-07 11:45:22 +01:00
Matthew Honnibal
4e43c0ba93 Fix multiprocessing for as_tuples=True (#4582) 2019-11-04 20:29:03 +01:00
Ines Montani
afe4a428f7
Fix pipeline analysis on remove pipe (#4557)
Validate *after* component is removed, not before
2019-10-30 19:04:17 +01:00
Ines Montani
c5e41247e8 Tidy up and auto-format 2019-10-28 12:43:55 +01:00
Matthew Honnibal
f8d740bfb1
Fix --gold-preproc train cli command (#4392)
* Fix get labels for textcat

* Fix char_embed for gpu

* Revert "Fix char_embed for gpu"

This reverts commit 055b9a9e85.

* Fix passing of cats in gold.pyx

* Revert "Match pop with append for training format (#4516)"

This reverts commit 8e7414dace.

* Fix popping gold parses

* Fix handling of cats in gold tuples

* Fix name

* Fix ner_multitask_objective script

* Add test for 4402
2019-10-27 21:58:50 +01:00
Sofie Van Landeghem
8e7414dace Match pop with append for training format (#4516)
* trying to fix script - not succesful yet

* match pop() with extend() to avoid changing the data

* few more pop-extend fixes

* reinsert deleted print statement

* fix print statement

* add last tested version

* append instead of extend

* add in few comments

* quick fix for 4402 + unit test

* fixing number of docs (not counting cats)

* more fixes

* fix len

* print tmp file instead of using data from examples dir

* print tmp file instead of using data from examples dir (2)
2019-10-27 16:01:32 +01:00
Ines Montani
a9c6104047 Component decorator and component analysis (#4517)
* Add work in progress

* Update analysis helpers and component decorator

* Fix porting of docstrings for Python 2

* Fix docstring stuff on Python 2

* Support meta factories when loading model

* Put auto pipeline analysis behind flag for now

* Analyse pipes on remove_pipe and replace_pipe

* Move analysis to root for now

Try to find a better place for it, but it needs to go for now to avoid circular imports

* Simplify decorator

Don't return a wrapped class and instead just write to the object

* Update existing components and factories

* Add condition in factory for classes vs. functions

* Add missing from_nlp classmethods

* Add "retokenizes" to printed overview

* Update assigns/requires declarations of builtins

* Only return data if no_print is enabled

* Use multiline table for overview

* Don't support Span

* Rewrite errors/warnings and move them to spacy.errors
2019-10-27 13:35:49 +01:00
Ines Montani
d2da117114 Also support passing list to Language.disable_pipes (#4521)
* Also support passing list to Language.disable_pipes

* Adjust internals
2019-10-25 16:19:08 +02:00
Ines Montani
2c96a5e5b0
Remove lemma attrs on BaseDefaults (#4468) 2019-10-19 23:18:09 +02:00
Ines Montani
692d7f4291 Fix formatting [ci skip] 2019-10-18 11:33:38 +02:00
Ines Montani
fb11852750 Remove unused imports 2019-10-18 11:06:41 +02:00
Sofie Van Landeghem
2d249a9502 KB extensions and better parsing of WikiData (#4375)
* fix overflow error on windows

* more documentation & logging fixes

* md fix

* 3 different limit parameters to play with execution time

* bug fixes directory locations

* small fixes

* exclude dev test articles from prior probabilities stats

* small fixes

* filtering wikidata entities, removing numeric and meta items

* adding aliases from wikidata also to the KB

* fix adding WD aliases

* adding also new aliases to previously added entities

* fixing comma's

* small doc fixes

* adding subclassof filtering

* append alias functionality in KB

* prevent appending the same entity-alias pair

* fix for appending WD aliases

* remove date filter

* remove unnecessary import

* small corrections and reformatting

* remove WD aliases for now (too slow)

* removing numeric entities from training and evaluation

* small fixes

* shortcut during prediction if there is only one candidate

* add counts and fscore logging, remove FP NER from evaluation

* fix entity_linker.predict to take docs instead of single sentences

* remove enumeration sentences from the WP dataset

* entity_linker.update to process full doc instead of single sentence

* spelling corrections and dump locations in readme

* NLP IO fix

* reading KB is unnecessary at the end of the pipeline

* small logging fix

* remove empty files
2019-10-14 12:28:53 +02:00
Ines Montani
f8f68bb062 Auto-format [ci skip] 2019-10-10 17:08:39 +02:00
tamuhey
650cbfe82d multiprocessing pipe (#1303) (#4371)
* refactor: separate formatting docs and golds in Language.update

* fix return typo

* add pipe test

* unpickleable object cannot be assigned to p.map

* passed test pipe

* passed test!

* pipe terminate

* try pipe

* passed test

* fix ch

* add comments

* fix len(texts)

* add comment

* add comment

* fix: multiprocessing of pipe is not supported in 2

* test: use assert_docs_equal

* fix: is_python3 -> is_python2

* fix: change _pipe arg to use functools.partial

* test: add vector modification test

* test: add sample ner_pipe and user_data pipe

* add warnings test

* test: fix user warnings

* test: fix warnings capture

* fix: remove islice import

* test: remove warnings test

* test: add stream test

* test: rename

* fix: multiproc stream

* fix: stream pipe

* add comment

* mp.Pipe seems to be able to use with relative small data

* test: skip stream test in python2

* sort imports

* test: add reason to skiptest

* fix: use pipe for docs communucation

* add comments

* add comment
2019-10-08 12:20:55 +02:00
Ines Montani
b6670bf0c2 Use consistent spelling 2019-10-02 10:37:39 +02:00
Ines Montani
cf65a80f36 Refactor lemmatizer and data table integration (#4353)
* Move test

* Allow default in Lookups.get_table

* Start with blank tables in Lookups.from_bytes

* Refactor lemmatizer to hold instance of Lookups

* Get lookups table within the lemmatization methods to make sure it references the correct table (even if the table was replaced or modified, e.g. when loading a model from disk)
* Deprecate other arguments on Lemmatizer.__init__ and expect Lookups for consistency
* Remove old and unsupported Lemmatizer.load classmethod
* Refactor language-specific lemmatizers to inherit as much as possible from base class and override only what they need

* Update tests and docs

* Fix more tests

* Fix lemmatizer

* Upgrade pytest to try and fix weird CI errors

* Try pytest 4.6.5
2019-10-01 21:36:03 +02:00
Ines Montani
e0cf4796a5 Move lookup tables out of the core library (#4346)
* Add default to util.get_entry_point

* Tidy up entry points

* Read lookups from entry points

* Remove lookup tables and related tests

* Add lookups install option

* Remove lemmatizer tests

* Remove logic to process language data files

* Update setup.cfg
2019-10-01 00:01:27 +02:00
tamuhey
b408b5b29e Refactor language update (#4316)
* refactor: separate formatting docs and golds in Language.update

* fix return typo
2019-09-27 16:20:21 +02:00
Ines Montani
00a8cbc306 Tidy up and auto-format 2019-09-18 20:27:03 +02:00
adrianeboyd
b5d999e510 Add textcat to train CLI (#4226)
* Add doc.cats to spacy.gold at the paragraph level

Support `doc.cats` as `"cats": [{"label": string, "value": number}]` in
the spacy JSON training format at the paragraph level.

* `spacy.gold.docs_to_json()` writes `docs.cats`

* `GoldCorpus` reads in cats in each `GoldParse`

* Update instances of gold_tuples to handle cats

Update iteration over gold_tuples / gold_parses to handle addition of
cats at the paragraph level.

* Add textcat to train CLI

* Add textcat options to train CLI
* Add textcat labels in `TextCategorizer.begin_training()`
* Add textcat evaluation to `Scorer`:
  * For binary exclusive classes with provided label: F1 for label
  * For 2+ exclusive classes: F1 macro average
  * For multilabel (not exclusive): ROC AUC macro average (currently
relying on sklearn)
* Provide user info on textcat evaluation settings, potential
incompatibilities
* Provide pipeline to Scorer in `Language.evaluate` for textcat config
* Customize train CLI output to include only metrics relevant to current
pipeline
* Add textcat evaluation to evaluate CLI

* Fix handling of unset arguments and config params

Fix handling of unset arguments and model confiug parameters in Scorer
initialization.

* Temporarily add sklearn requirement

* Remove sklearn version number

* Improve Scorer handling of models without textcats

* Fixing Scorer handling of models without textcats

* Update Scorer output for python 2.7

* Modify inf in Scorer for python 2.7

* Auto-format

Also make small adjustments to make auto-formatting with black easier and produce nicer results

* Move error message to Errors

* Update documentation

* Add cats to annotation JSON format [ci skip]

* Fix tpl flag and docs [ci skip]

* Switch to internal roc_auc_score

Switch to internal `roc_auc_score()` adapted from scikit-learn.

* Add AUCROCScore tests and improve errors/warnings

* Add tests for AUCROCScore and roc_auc_score
* Add missing error for only positive/negative values
* Remove unnecessary warnings and errors

* Make reduced roc_auc_score functions private

Because most of the checks and warnings have been stripped for the
internal functions and access is only intended through `ROCAUCScore`,
make the functions for roc_auc_score adapted from scikit-learn private.

* Check that data corresponds with multilabel flag

Check that the training instances correspond with the multilabel flag,
adding the multilabel flag if required.

* Add textcat score to early stopping check

* Add more checks to debug-data for textcat

* Add example training data for textcat

* Add more checks to textcat train CLI

* Check configuration when extending base model
* Fix typos

* Update textcat example data

* Provide licensing details and licenses for data
* Remove two labels with no positive instances from jigsaw-toxic-comment
data.


Co-authored-by: Ines Montani <ines@ines.io>
2019-09-15 22:31:31 +02:00
Paul O'Leary McCann
7d8df69158 Bloom-filter backed Lookup Tables (#4268)
* Improve load_language_data helper

* WIP: Add Lookups implementation

* Start moving lemma data over to JSON

* WIP: move data over for more languages

* Convert more languages

* Fix lemmatizer fixtures in tests

* Finish conversion

* Auto-format JSON files

* Fix test for now

* Make sure tables are stored on instance

* Update docstrings

* Update docstrings and errors

* Update test

* Add Lookups.__len__

* Add serialization methods

* Add Lookups.remove_table

* Use msgpack for serialization to disk

* Fix file exists check

* Try using OrderedDict for everything

* Update .flake8 [ci skip]

* Try fixing serialization

* Update test_lookups.py

* Update test_serialize_vocab_strings.py

* Lookups / Tables now work

This implements the stubs in the Lookups/Table classes. Currently this
is in Cython but with no type declarations, so that could be improved.

* Add lookups to setup.py

* Actually add lookups pyx

The previous commit added the old py file...

* Lookups work-in-progress

* Move from pyx back to py

* Add string based lookups, fix serialization

* Update tests, language/lemmatizer to work with string lookups

There are some outstanding issues here:

- a pickling-related test fails due to the bloom filter
- some custom lemmatizers (fr/nl at least) have issues

More generally, there's a question of how to deal with the case where
you have a string but want to use the lookup table. Currently the table
allows access by string or id, but that's getting pretty awkward.

* Change lemmatizer lookup method to pass (orth, string)

* Fix token lookup

* Fix French lookup

* Fix lt lemmatizer test

* Fix Dutch lemmatizer

* Fix lemmatizer lookup test

This was using a normal dict instead of a Table, so checks for the
string instead of an integer key failed.

* Make uk/nl/ru lemmatizer lookup methods consistent

The mentioned tokenizers all have their own implementation of the
`lookup` method, which accesses a `Lookups` table. The way that was
called in `token.pyx` was changed so this should be updated to have the
same arguments as `lookup` in `lemmatizer.py` (specificially (orth/id,
string)).

Prior to this change tests weren't failing, but there would probably be
issues with normal use of a model. More tests should proably be added.

Additionally, the language-specific `lookup` implementations seem like
they might not be needed, since they handle things like lower-casing
that aren't actually language specific.

* Make recently added Greek method compatible

* Remove redundant class/method

Leftovers from a merge not cleaned up adequately.
2019-09-12 17:26:11 +02:00
Ines Montani
625ce2db8e Update Language docs [ci skip] 2019-09-12 13:03:38 +02:00
Ines Montani
655b434553 Merge branch 'master' into develop 2019-09-12 11:39:18 +02:00