Commit Graph

120 Commits

Author SHA1 Message Date
Ines Montani
e0ffe36e79 Update docstrings, docs and types 2020-07-29 11:36:42 +02:00
Ines Montani
412dbb1f38
Remove dead and/or deprecated code (#5710)
* Remove dead and/or deprecated code

* Remove n_threads

Co-authored-by: Matthew Honnibal <honnibal+gh@gmail.com>
2020-07-06 13:06:25 +02:00
Ines Montani
60c2695131 Remove deprecated methods 2020-07-01 22:33:39 +02:00
Ines Montani
810fce3bb1 Merge branch 'develop' into master-tmp 2020-06-03 14:36:59 +02:00
Ines Montani
5d3806e059 unicode -> str consistency 2020-05-24 17:20:58 +02:00
Ines Montani
24f72c669c Merge branch 'develop' into master-tmp 2020-05-21 18:39:06 +02:00
adrianeboyd
e63880e081
Use Token.sent_start for Span.sent (#5439)
Use `Token.sent_start` for sentence boundaries in `Span.sent` so that
`Doc.sents` and `Span.sent` return the same sentence boundaries.
2020-05-14 18:22:51 +02:00
Adriane Boyd
bc39f97e11 Simplify warnings 2020-04-28 13:37:37 +02:00
adrianeboyd
f8ac5b9f56
bugfix in span similarity (#5155) (#5358)
* bugfix in span similarity

* also rewrite doc.pyx for clarity

* formatting

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
2020-04-27 16:51:27 +02:00
Sofie Van Landeghem
d6d95674c1
bugfix in span similarity (#5155)
* bugfix in span similarity

* also rewrite doc.pyx for clarity

* formatting
2020-03-29 13:56:07 +02:00
Ines Montani
b0cfab317f Merge branch 'develop' into refactor/simplify-warnings 2020-03-04 16:38:55 +01:00
Ines Montani
648f61d077
Tidy up compiler flags and imports (#5071) 2020-03-02 11:48:10 +01:00
Ines Montani
37691e6d5d Simplify warnings 2020-02-28 12:20:23 +01:00
Ines Montani
e3f40a6a0f Tidy up and auto-format 2020-02-18 15:38:18 +01:00
Ines Montani
de11ea753a Merge branch 'master' into develop 2020-02-18 14:47:23 +01:00
adrianeboyd
3b22eb651b
Sync Span __eq__ and __hash__ (#5005)
* Sync Span __eq__ and __hash__

Use the same tuple for `__eq__` and `__hash__`, including all attributes
except `vector` and `vector_norm`.

* Update entity comparison in tests

Update `assert_docs_equal()` test util to compare `Span` properties for
ents rather than `Span` objects.
2020-02-16 17:20:36 +01:00
Ines Montani
2ed49404e3
Improve setup.py and call into Cython directly (#4952)
* Improve setup.py and call into Cython directly

* Add numpy to setup_requires

* Improve clean helper

* Update setup.cfg

* Try if it builds without pyproject.toml

* Update MANIFEST.in
2020-02-11 17:46:18 -05:00
Sofie Van Landeghem
569cc98982
Update spaCy for thinc 8.0.0 (#4920)
* Add load_from_config function

* Add train_from_config script

* Merge configs and expose via spacy.config

* Fix script

* Suggest create_evaluation_callback

* Hard-code for NER

* Fix errors

* Register command

* Add TODO

* Update train-from-config todos

* Fix imports

* Allow delayed setting of parser model nr_class

* Get train-from-config working

* Tidy up and fix scores and printing

* Hide traceback if cancelled

* Fix weighted score formatting

* Fix score formatting

* Make output_path optional

* Add Tok2Vec component

* Tidy up and add tok2vec_tensors

* Add option to copy docs in nlp.update

* Copy docs in nlp.update

* Adjust nlp.update() for set_annotations

* Don't shuffle pipes in nlp.update, decruft

* Support set_annotations arg in component update

* Support set_annotations in parser update

* Add get_gradients method

* Add get_gradients to parser

* Update errors.py

* Fix problems caused by merge

* Add _link_components method in nlp

* Add concept of 'listeners' and ControlledModel

* Support optional attributes arg in ControlledModel

* Try having tok2vec component in pipeline

* Fix tok2vec component

* Fix config

* Fix tok2vec

* Update for Example

* Update for Example

* Update config

* Add eg2doc util

* Update and add schemas/types

* Update schemas

* Fix nlp.update

* Fix tagger

* Remove hacks from train-from-config

* Remove hard-coded config str

* Calculate loss in tok2vec component

* Tidy up and use function signatures instead of models

* Support union types for registry models

* Minor cleaning in Language.update

* Make ControlledModel specifically Tok2VecListener

* Fix train_from_config

* Fix tok2vec

* Tidy up

* Add function for bilstm tok2vec

* Fix type

* Fix syntax

* Fix pytorch optimizer

* Add example configs

* Update for thinc describe changes

* Update for Thinc changes

* Update for dropout/sgd changes

* Update for dropout/sgd changes

* Unhack gradient update

* Work on refactoring _ml

* Remove _ml.py module

* WIP upgrade cli scripts for thinc

* Move some _ml stuff to util

* Import link_vectors from util

* Update train_from_config

* Import from util

* Import from util

* Temporarily add ml.component_models module

* Move ml methods

* Move typedefs

* Update load vectors

* Update gitignore

* Move imports

* Add PrecomputableAffine

* Fix imports

* Fix imports

* Fix imports

* Fix missing imports

* Update CLI scripts

* Update spacy.language

* Add stubs for building the models

* Update model definition

* Update create_default_optimizer

* Fix import

* Fix comment

* Update imports in tests

* Update imports in spacy.cli

* Fix import

* fix obsolete thinc imports

* update srsly pin

* from thinc to ml_datasets for example data such as imdb

* update ml_datasets pin

* using STATE.vectors

* small fix

* fix Sentencizer.pipe

* black formatting

* rename Affine to Linear as in thinc

* set validate explicitely to True

* rename with_square_sequences to with_list2padded

* rename with_flatten to with_list2array

* chaining layernorm

* small fixes

* revert Optimizer import

* build_nel_encoder with new thinc style

* fixes using model's get and set methods

* Tok2Vec in component models, various fixes

* fix up legacy tok2vec code

* add model initialize calls

* add in build_tagger_model

* small fixes

* setting model dims

* fixes for ParserModel

* various small fixes

* initialize thinc Models

* fixes

* consistent naming of window_size

* fixes, removing set_dropout

* work around Iterable issue

* remove legacy tok2vec

* util fix

* fix forward function of tok2vec listener

* more fixes

* trying to fix PrecomputableAffine (not succesful yet)

* alloc instead of allocate

* add morphologizer

* rename residual

* rename fixes

* Fix predict function

* Update parser and parser model

* fixing few more tests

* Fix precomputable affine

* Update component model

* Update parser model

* Move backprop padding to own function, for test

* Update test

* Fix p. affine

* Update NEL

* build_bow_text_classifier and extract_ngrams

* Fix parser init

* Fix test add label

* add build_simple_cnn_text_classifier

* Fix parser init

* Set gpu off by default in example

* Fix tok2vec listener

* Fix parser model

* Small fixes

* small fix for PyTorchLSTM parameters

* revert my_compounding hack (iterable fixed now)

* fix biLSTM

* Fix uniqued

* PyTorchRNNWrapper fix

* small fixes

* use helper function to calculate cosine loss

* small fixes for build_simple_cnn_text_classifier

* putting dropout default at 0.0 to ensure the layer gets built

* using thinc util's set_dropout_rate

* moving layer normalization inside of maxout definition to optimize dropout

* temp debugging in NEL

* fixed NEL model by using init defaults !

* fixing after set_dropout_rate refactor

* proper fix

* fix test_update_doc after refactoring optimizers in thinc

* Add CharacterEmbed layer

* Construct tagger Model

* Add missing import

* Remove unused stuff

* Work on textcat

* fix test (again :)) after optimizer refactor

* fixes to allow reading Tagger from_disk without overwriting dimensions

* don't build the tok2vec prematuraly

* fix CharachterEmbed init

* CharacterEmbed fixes

* Fix CharacterEmbed architecture

* fix imports

* renames from latest thinc update

* one more rename

* add initialize calls where appropriate

* fix parser initialization

* Update Thinc version

* Fix errors, auto-format and tidy up imports

* Fix validation

* fix if bias is cupy array

* revert for now

* ensure it's a numpy array before running bp in ParserStepModel

* no reason to call require_gpu twice

* use CupyOps.to_numpy instead of cupy directly

* fix initialize of ParserModel

* remove unnecessary import

* fixes for CosineDistance

* fix device renaming

* use refactored loss functions (Thinc PR 251)

* overfitting test for tagger

* experimental settings for the tagger: avoid zero-init and subword normalization

* clean up tagger overfitting test

* use previous default value for nP

* remove toy config

* bringing layernorm back (had a bug - fixed in thinc)

* revert setting nP explicitly

* remove setting default in constructor

* restore values as they used to be

* add overfitting test for NER

* add overfitting test for dep parser

* add overfitting test for textcat

* fixing init for linear (previously affine)

* larger eps window for textcat

* ensure doc is not None

* Require newer thinc

* Make float check vaguer

* Slop the textcat overfit test more

* Fix textcat test

* Fix exclusive classes for textcat

* fix after renaming of alloc methods

* fixing renames and mandatory arguments (staticvectors WIP)

* upgrade to thinc==8.0.0.dev3

* refer to vocab.vectors directly instead of its name

* rename alpha to learn_rate

* adding hashembed and staticvectors dropout

* upgrade to thinc 8.0.0.dev4

* add name back to avoid warning W020

* thinc dev4

* update srsly

* using thinc 8.0.0a0 !

Co-authored-by: Matthew Honnibal <honnibal+gh@gmail.com>
Co-authored-by: Ines Montani <ines@ines.io>
2020-01-29 17:06:46 +01:00
Sofie Van Landeghem
a1b22e90cd serialize ENT_ID (#4852)
* expand serialization test for custom token attribute

* add failing test for issue 4849

* define ENT_ID as attr and use in doc serialization

* fix few typos
2020-01-06 14:57:34 +01:00
Ines Montani
db55577c45
Drop Python 2.7 and 3.5 (#4828)
* Remove unicode declarations

* Remove Python 3.5 and 2.7 from CI

* Don't require pathlib

* Replace compat helpers

* Remove OrderedDict

* Use f-strings

* Set Cython compiler language level

* Fix typo

* Re-add OrderedDict for Table

* Update setup.cfg

* Revert CONTRIBUTING.md

* Revert lookups.md

* Revert top-level.md

* Small adjustments and docs [ci skip]
2019-12-22 01:53:56 +01:00
tamuhey
1707e77c5e add char_span to Span (#4793) 2019-12-13 15:54:58 +01:00
Sofie Van Landeghem
9be4d1c105 Allow copying of user_data in as_doc (#4282)
* Allow copying the user_data with as_doc + unit test

* add option to docs

* add typing

* import fix

* workaround to avoid bool clashing ...

* bint instead of bool
2019-09-12 17:08:14 +02:00
Matthew Honnibal
944a66c326 Add span.tensor and token.tensor attributes 2019-08-01 18:30:50 +02:00
Ines Montani
d5bce35fb1 Fix bug in Span.similarity when called via hook 2019-07-27 15:33:27 +02:00
Sofie Van Landeghem
ba02957c80 Fix dependency copy for as_doc (#3969)
* failing unit test for issue 3962

* attempt to fix Issue #3962

* create artificial unit test example

* using length instead of self.length

* sp

* reformat with black

* find better ancestor within span and use generic 'dep'

* attach to span.root if there is no appropriate ancestor

* comment span text

* clean up ancestor code

* reconstruct dep tree to keep same number of sentences
2019-07-23 18:28:54 +02:00
svlandeg
8608685543 ensure Span.as_doc keeps the entity links + unit test 2019-06-25 15:28:51 +02:00
svlandeg
5b1cd49222 error msg and unit tests for setting kb_id on span 2019-03-22 12:05:35 +01:00
svlandeg
feb71e15fd hash the entity name 2019-03-22 11:36:44 +01:00
svlandeg
735fc2a735 annotate kb_id through ents in doc 2019-03-22 11:36:44 +01:00
Matthew Honnibal
72889a16d5 Fix similarity calculation if vectors are on GPU (#3440) 2019-03-20 12:09:59 +01:00
Sofie
c45ed32c74 label in span not writable anymore (#3408)
* label in span not writable anymore

* more explicit unit test and error message for readonly label

* bit more explanation (view)

* error msg tailored to specific case

* fix None case
2019-03-15 00:46:45 +01:00
Matthew Honnibal
b0b990e405 Fix token.conjuncts (closes #795) (#3392)
* Implement conjuncts method

* Add span.conjuncts property

* Un-xfail token.conjuncts tests

* Update docs for token.conjuncts and span.conjuncts

* Fix merge error in token.conjuncts
2019-03-11 17:05:45 +01:00
Ines Montani
47e9c274ef Tidy up property code style (#3391)
Use decorator if properties only have a getter and existing syntax if there's getter and setter
2019-03-11 15:59:09 +01:00
Ines Montani
296446a1c8
Tidy up and improve docs and docstrings (#3370)
<!--- Provide a general summary of your changes in the title. -->

## Description
* tidy up and adjust Cython code to code style
* improve docstrings and make calling `help()` nicer
* add URLs to new docs pages to docstrings wherever possible, mostly to user-facing objects
* fix various typos and inconsistencies in docs

### Types of change
enhancement, docs

## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.
2019-03-08 11:42:26 +01:00
Ines Montani
9d6ca18a10 Tidy up and only use self.vector once 2019-03-07 01:06:12 +01:00
Ines Montani
a8f1efd2f5 Merge branch 'master' into develop 2019-03-07 00:56:31 +01:00
Daniel King
5f40229397 Don't use numpy directly for similarity (#3362)
* Don't use numpy directly for similarity

* Contributor agreement
2019-03-06 22:58:38 +00:00
Ines Montani
5651a0d052 💫 Replace {Doc,Span}.merge with Doc.retokenize (#3280)
* Add deprecation warning to Doc.merge and Span.merge

* Replace {Doc,Span}.merge with Doc.retokenize
2019-02-15 10:29:44 +01:00
Ines Montani
fbf9f1edf1 Also raise error in Span.__reduce__ 2019-02-13 13:22:05 +01:00
Ines Montani
ea07f3022e Only run noun chunks iterator in Span if available (closes #3199) 2019-02-08 18:33:16 +01:00
Ines Montani
ff36b14cb2 Fix whitespace 2019-02-08 18:31:31 +01:00
Ines Montani
5d0b60999d Merge branch 'master' into develop 2019-02-07 20:54:07 +01:00
Álvaro Abella Bascarán
1cd8f9823f Correct docs of Token.subtree and Span.subtree (issue #3122) (#3124)
* solve inconsistency between docs and Span.subtree (issue #3122)

* solve inconsistency between docs and Token.subtree (issue #3122)
2019-01-09 03:11:15 +01:00
Matthew Honnibal
63b7accd74
💫 Make span.as_doc() return a copy, not a view. Closes #1537 (#3107)
Initially span.as_doc() was designed to return a view of the span's contents, as a Doc object. This was a nice idea, but it fails due to the token.idx property, which refers to the character offset within the string. In a span, the idx of the first token might not be 0. Because this data is different, we can't have a view --- it'll be inconsistent.

This patch changes span.as_doc() to instead return a copy. The docs are updated accordingly. Closes #1537

* Update test for span.as_doc()

* Make span.as_doc() return a copy. Closes #1537

* Document change to Span.as_doc()
2018-12-30 15:17:46 +01:00
Álvaro Abella Bascarán
9bc4cc1352 Fix issue 2396 (#3089)
* Test on #2396: bug in Doc.get_lca_matrix()

* reimplementation of Doc.get_lca_matrix(), (closes #2396)

* reimplement Span.get_lca_matrix(), and call it from Doc.get_lca_matrix()

* tests Span.get_lca_matrix() as well as Doc.get_lca_matrix()

* implement _get_lca_matrix as a helper function in doc.pyx; call it from Doc.get_lca_matrix and Span.get_lca_matrix

* use memory view instead of np.ndarray in _get_lca_matrix (faster)

* fix bug when calling Span.get_lca_matrix; return lca matrix as np.array instead of memoryview

* cleaner conditional, add comment
2018-12-29 18:05:52 +01:00
Álvaro Abella Bascarán
6fe276f85d Fix issue 2396 (#3089)
* Test on #2396: bug in Doc.get_lca_matrix()

* reimplementation of Doc.get_lca_matrix(), (closes #2396)

* reimplement Span.get_lca_matrix(), and call it from Doc.get_lca_matrix()

* tests Span.get_lca_matrix() as well as Doc.get_lca_matrix()

* implement _get_lca_matrix as a helper function in doc.pyx; call it from Doc.get_lca_matrix and Span.get_lca_matrix

* use memory view instead of np.ndarray in _get_lca_matrix (faster)

* fix bug when calling Span.get_lca_matrix; return lca matrix as np.array instead of memoryview

* cleaner conditional, add comment
2018-12-29 18:02:26 +01:00
Matthew Honnibal
2c2db0c492 💫 Allow Span to take text label (#3031)
Fixes #3027.

* Allow Span.__init__ to take unicode values for the `label` argument.
* Allow `Span.label_` to be writeable.

- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [ ] My changes don't require a change to the documentation, or if they do, I've added all required information.
2018-12-08 13:08:41 +01:00
Matthew Honnibal
4336397ecb Update develop from master 2018-08-14 03:04:28 +02:00
Ole Henrik Skogstrøm
0473add369 Feature/span ents (#2599)
* Created Span.ents property

* Add tests for span.ents

* Add tests for start and end of sentence
2018-08-07 13:52:32 +02:00
Ines Montani
cae4457c38 💫 Add .similarity warnings for no vectors and option to exclude warnings (#2197)
* Add logic to filter out warning IDs via environment variable

Usage: SPACY_WARNING_EXCLUDE=W001,W007

* Add warnings for empty vectors

* Add warning if no word vectors are used in .similarity methods

For example, if only tensors are available in small models – should hopefully clear up some confusion around this

* Capture warnings in tests

* Rename SPACY_WARNING_EXCLUDE to SPACY_WARNING_IGNORE
2018-05-21 01:22:38 +02:00