Commit Graph

7178 Commits

Author SHA1 Message Date
Matthew Honnibal
cc477be952
Improve gold-standard alignment (#5711)
* Remove previous alignment

* Implement better alignment, using ragged data structure

* Use pytokenizations for alignment

* Fixes

* Fixes

* Fix overlapping entities in alignment

* Fix align split_sents

* Update test

* Commit align.py

* Try to appease setuptools

* Fix flake8

* use realistic entities for testing

* Update tests for better alignment

* Improve alignment heuristic

Co-authored-by: svlandeg <sofie.vanlandeghem@gmail.com>
2020-07-06 17:39:31 +02:00
Ines Montani
fa261d09e8 Add alternative CLI option 2020-07-06 15:57:38 +02:00
Adriane Boyd
c67fc6aa5b
Make docs_to_json backwards-compatible with v2 (#5714)
* In `spacy convert -t json` output the JSON docs wrapped in a list

* Add back token-level `ner` alongside the doc-level `entities`
2020-07-06 14:15:00 +02:00
Ines Montani
412dbb1f38
Remove dead and/or deprecated code (#5710)
* Remove dead and/or deprecated code

* Remove n_threads

Co-authored-by: Matthew Honnibal <honnibal+gh@gmail.com>
2020-07-06 13:06:25 +02:00
Sofie Van Landeghem
fcbf899b08
Feature/example only (#5707)
* remove _convert_examples

* fix test_gold, raise TypeError if tuples are used instead of Example's

* throwing proper errors when the wrong type of objects are passed

* fix deprectated format in tests

* fix deprectated format in parser tests

* fix tests for NEL, morph, senter, tagger, textcat

* update regression tests with new Example format

* use make_doc

* more fixes to nlp.update calls

* few more small fixes for rehearse and evaluate

* only import ml_datasets if really necessary
2020-07-06 13:02:36 +02:00
Matthw Honnibal
3f6f087113 Merge branch 'develop' of https://github.com/explosion/spaCy into develop 2020-07-04 23:52:12 +02:00
Matthw Honnibal
5642507823 Fix has_unknown_spaces in Doc.copy 2020-07-04 23:52:02 +02:00
Matthw Honnibal
8870a6ded7 Specify seeds in HashEmbed 2020-07-04 23:51:49 +02:00
Ines Montani
37c3bb35e2 Auto-format 2020-07-04 16:25:34 +02:00
Ines Montani
abd173937f Auto-format and update URL 2020-07-04 14:23:44 +02:00
Ines Montani
99aff16d60 Make argument shortcut consistent 2020-07-04 14:23:32 +02:00
Matthew Honnibal
2bd1bf81f1
Refactor pretrain and support character-based objective for v3 (#5706)
* Start adding character-based stuff

* Start adding character-based objective

* Start adding character-based stuff

* Start adding character-based objective

* Remove outdated comment

* Update pretraining models

* Add/fix character-based multi-task models

* Refactor pretrain and support character-based objective

* Update pretrain config

* Remove unused

* Fix flake8 errors

* Clean up imports

* Format

* Format

* Update Thinc version

* Raise error if vectors objective but no vectors
2020-07-03 17:57:28 +02:00
Ines Montani
84fb3a3fb3 Auto-format and fix tuple 2020-07-03 15:20:10 +02:00
Matthew Honnibal
e1b3e8ee11 Set version to v3.0.0a1 2020-07-03 13:21:08 +02:00
Matthew Honnibal
a902b5f217
Record whether Doc objects are built from known spacing (#5697)
* Tell convert CLI to store user data for Doc

* Remove assert

* Add has_unknwon_spaces flag on Doc

* Do not tokenize docs with unknown spaces in Corpus

* Handle conversion of unknown spaces in Example

* Fixes

* Fixes

* Draft has_known_spaces support in DocBin

* Add test for serialize has_unknown_spaces

* Fix DocBin serialization when has_unknown_spaces

* Use serialization in test
2020-07-03 12:58:16 +02:00
Adriane Boyd
abad56db7d
Add conllu2docs converter (#5704)
Add conllu2docs converter adapted from conllu2json converter
2020-07-03 12:54:32 +02:00
Jan Jessewitsch
e4dcac4a4b
Merging multiple docs into one (#5032)
* Add static method to Doc to allow merging of multiple docs.

* Add error description for the error that occurs if docs with different
vocabs (from different languages) are merged in Doc.from_docs().

* Add test for Doc.from_docs() implementation.

* Fix using numpy's concatenate in Doc.from_docs.

* Replace typing's type annotations in from_docs.

* Simply remove type annotations in from_docs.

* Add documentation for Doc.from_docs to api.

* Simplify from_docs, its test and the api doc for codebase consistency.

* Fix merging of Doc objects that end with whitespaces (Achieved by simply not setting the SPACY attribute on whitespace tokens). Remove two unnecessary imports of attributes.

* Add merging of user data from Doc objects in from_docs. Add user data test case to corresponding test. Add applicable warning messages.

* Fix incorrect setting of tokens idx by using concatenated spaces (again). Add test case to corresponding test.

* Add MORPH to attrs

* Update warnings calls

* Remove out-dated error from merge

* Rename space_delimiter to ensure_whitespace

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
2020-07-03 11:32:42 +02:00
Sofie Van Landeghem
41b65fd0f8
fix to pretrain script (#5699)
* fix to pretrain script

* remove unnecessary import
2020-07-02 21:48:01 +02:00
Adriane Boyd
a723fa02a1
DocBin: add version number, missing attributes and strings (#5685)
* Add version number to DocBin

Add a version number to DocBin for future use.

* Add POS to all attributes in DocBin

* Add morph string to strings in DocBin

* Update DocBin API

* Add string for ENT_KB_ID in DocBin
2020-07-02 17:41:50 +02:00
Ines Montani
d36632553a
Merge pull request #5688 from explosion/remove-deprecated
Remove deprecated methods: Doc.print_tree, Doc.merge, Span.merge
2020-07-02 15:10:30 +02:00
Ines Montani
8a5b9a6d5f
Merge pull request #5693 from svlandeg/bugfix/nel-v3 2020-07-02 14:45:46 +02:00
Ines Montani
ee8a830248
Merge pull request #5687 from svlandeg/bugfix/init-model
Fixing init_model
2020-07-02 14:10:28 +02:00
svlandeg
04ed4d60a8 raise error when links are not aligned to tokens 2020-07-02 13:57:35 +02:00
svlandeg
f503817623 fix parsing entity links in new gold format 2020-07-02 13:48:11 +02:00
Ines Montani
60c2695131 Remove deprecated methods 2020-07-01 22:33:39 +02:00
Ines Montani
fe4cfd0632 Start updating website for v3 [ci skip] 2020-07-01 21:26:39 +02:00
svlandeg
a30bc77415 bugfixing prune_vectors and vectors_loc 2020-07-01 21:00:47 +02:00
Matthw Honnibal
94a0cf46fd Merge branch 'develop' of https://github.com/explosion/spaCy into develop 2020-07-01 18:45:45 +02:00
Matthw Honnibal
6a0a27e5c2 Fix max_steps 2020-07-01 18:08:14 +02:00
Ines Montani
8d90e44d74 Fix title 2020-07-01 15:38:01 +02:00
Ines Montani
8fb574900a Update parent package and version 2020-07-01 15:35:23 +02:00
Matthew Honnibal
0ada186dda Set version to v3.0.0.dev14 2020-07-01 15:31:04 +02:00
Matthw Honnibal
cb51bb637b Merge branch 'develop' of https://github.com/explosion/spaCy into develop 2020-07-01 15:17:27 +02:00
Matthw Honnibal
7734cbc34d Set batch size in begin_training 2020-07-01 15:16:59 +02:00
Matthw Honnibal
1f7709e9a6 Improve max length check in corpus 2020-07-01 15:16:43 +02:00
Matthw Honnibal
2fa56484b2 Fix eval batch size 2020-07-01 15:16:25 +02:00
Matthw Honnibal
c5d12d1a22 Allow batch size to be set for evaluation in spacy train 2020-07-01 15:04:36 +02:00
Matthw Honnibal
f5532757a3 Filter out 0-length examples in Corpus 2020-07-01 15:02:37 +02:00
Ines Montani
bc87ba97e0
Merge pull request #5681 from svlandeg/bugfix/exec-cwd 2020-07-01 14:13:19 +02:00
Matthw Honnibal
52338a07bb Set version to v3.0.0.dev13 2020-07-01 02:49:17 +02:00
Matthw Honnibal
fa6d473390 Fix parser maxout_pieces=1 2020-07-01 02:48:58 +02:00
Matthw Honnibal
35af5819e0 Merge branch 'develop' of https://github.com/explosion/spaCy into develop 2020-07-01 01:03:39 +02:00
Matthw Honnibal
0d6edf5397 Clean up debug code in transition_system 2020-07-01 01:03:20 +02:00
Matthw Honnibal
a1b6add4c8 Fix parser gold cutting and gradient normalization 2020-07-01 01:02:58 +02:00
Matthw Honnibal
8c5a88e777 Fix per-epoch shuffling 2020-07-01 01:02:35 +02:00
svlandeg
a7d547c65e small fix 2020-06-30 21:56:17 +02:00
svlandeg
8eca7e995e add try-except to git commands to get an informative warning 2020-06-30 21:53:40 +02:00
Ines Montani
b032943c34 Fix funny printing again 2020-06-30 21:33:41 +02:00
Matthw Honnibal
d525552979 Fix efficiency of parser backprop_nonlinearity 2020-06-30 21:22:54 +02:00
Ines Montani
d64644d9d1 Adjust auto-formatting 2020-06-30 20:36:30 +02:00