Commit Graph

8334 Commits

Author SHA1 Message Date
Adriane Boyd
0041dfbc7f
Use special matcher for exceptions with spaces (#6668)
Use the special cases phrase matcher for exceptions that include space
characters so that exceptions including spaces are supported.
2021-01-06 12:05:10 +08:00
Sofie Van Landeghem
afc5714d32
multi-label textcat component (#6474)
* multi-label textcat component

* formatting

* fix comment

* cleanup

* fix from #6481

* random edit to push the tests

* add explicit error when textcat is called with multi-label gold data

* fix error nr

* small fix
2021-01-06 13:07:14 +11:00
Bruno
1a77607036
spaCy v3 is not saving the best version in training loop (#6629)
* Save best only if is the best and also respect the average config

* Create bratao.md

* Update loop.py

* Remove average check

* Keep before_to_disk
2021-01-06 12:51:30 +11:00
Sofie Van Landeghem
29b59086f9
Prevent 0-length mem alloc (#6653)
* prevent 0-length mem alloc by adding asserts

* fix lexeme mem allocation
2021-01-06 12:50:17 +11:00
Ines Montani
6f83abb971
Merge pull request #6647 from svlandeg/feature/init_config_overwrite 2021-01-05 14:59:04 +11:00
Ines Montani
81f018fb67
Merge pull request #6671 from explosion/chore/tidy-autoformat
Tidy up and auto-format
2021-01-05 14:45:31 +11:00
Ines Montani
224a3590e9
Merge pull request #6654 from svlandeg/chore/tests-cleanup
Unskipping tests
2021-01-05 13:53:40 +11:00
Ines Montani
a9e845426f Use --force for consistency and add docs 2021-01-05 13:49:59 +11:00
Ines Montani
c4993f16d0
Merge pull request #6651 from svlandeg/bugfix/cli_info 2021-01-05 13:44:26 +11:00
Ines Montani
991669c934 Tidy up and auto-format 2021-01-05 13:41:53 +11:00
Adriane Boyd
b57be94c78
Fix memory issues in Language.evaluate (#6386)
* Fix memory issues in Language.evaluate

Reset annotation in predicted docs before evaluating and store all data
in `examples`.

* Minor refactor to docs generator init

* Fix generator expression

* Fix final generator check

* Refactor pipeline loop

* Handle examples generator in Language.evaluate

* Add test with generator

* Use make_doc
2020-12-31 10:45:50 +11:00
svlandeg
a6a68da673 unskipping tests with python >= 3.6 2020-12-30 18:46:43 +01:00
svlandeg
d5ff0fecf8 add docs 2020-12-30 14:01:13 +01:00
svlandeg
c74ab6a313 fix imports 2020-12-30 12:40:12 +01:00
svlandeg
712a78b74a add simple unit test 2020-12-30 12:35:26 +01:00
svlandeg
4347e6d39b fixes for CLI info command 2020-12-30 12:05:58 +01:00
svlandeg
62b4fe118f prevent overwriting existing config file 2020-12-29 15:40:22 +01:00
Adriane Boyd
5ca57d8221
Add logger warning when serializing user hooks (#6595)
Add a warning that user hooks are lost on serialization.

Add a `user_hooks` exclude to skip the warning with pickle.
2020-12-29 11:54:32 +01:00
Adriane Boyd
cabd4ae5b1
Use logger.warning instead of logger.warn (#6596)
Use `logger.warning` instead of deprecated `logger.warn`.
2020-12-21 08:25:10 +08:00
Sofie Van Landeghem
282a3b49ea
Fix parser resizing when there is no upper layer (#6460)
* allow resizing of the parser model even when upper=False

* update from spacy.TransitionBasedParser.v1 to v2

* bugfix
2020-12-18 18:56:57 +08:00
Sofie Van Landeghem
0a923a7915
Tagger robustness (#6580)
* require labels in taggers

* ensure tagger works with incomplete data
2020-12-18 18:51:47 +08:00
Adriane Boyd
e10295c9fd
Fix memory leak when adding empty morph (#6581)
Fix lookup of empty morph in the morphology table, which fixes a memory
leak where a new morphology tag was allocated each time the empty morph
tag was added.
2020-12-18 18:51:01 +08:00
Ines Montani
e9b0963827
Merge pull request #6333 from adrianeboyd/chore/python39 2020-12-17 22:11:57 +11:00
Ines Montani
47c1ec678b Merge branch 'develop' into pr/6333 2020-12-17 10:19:28 +11:00
Ines Montani
3f90bffa27
Merge pull request #6571 from adrianeboyd/bugfix/debug-data-missing-vectors
Fix alignment and vector checks in debug data
2020-12-17 10:10:47 +11:00
Thomas Bird
cbb8c66da3 prevent the root logger from inialising 2020-12-15 19:50:34 +00:00
Adriane Boyd
1ddf2f39c7
Switch converters to generator functions (#6547)
* Switch converters to generator functions

To reduce the memory usage when converting large corpora, refactor the
convert methods to be generator functions.

* Update tests
2020-12-15 16:47:16 +08:00
Adriane Boyd
20e18cc246 Fix alignment and vector checks in debug data
* Update token alignment check to use Example alignment
* Update missing vector check further related to changes in v3
2020-12-15 09:43:14 +01:00
Matthew Honnibal
8656a08777
Add beam_parser and beam_ner components for v3 (#6369)
* Get basic beam tests working

* Get basic beam tests working

* Compile _beam_utils

* Remove prints

* Test beam density

* Beam parser seems to train

* Draft beam NER

* Upd beam

* Add hypothesis as dev dependency

* Implement missing is-gold-parse method

* Implement early update

* Fix state hashing

* Fix test

* Fix test

* Default to non-beam in parser constructor

* Improve oracle for beam

* Start refactoring beam

* Update test

* Refactor beam

* Update nn

* Refactor beam and weight by cost

* Update ner beam settings

* Update test

* Add __init__.pxd

* Upd test

* Fix test

* Upd test

* Fix test

* Remove ring buffer history from StateC

* WIP change arc-eager transitions

* Add state tests

* Support ternary sent start values

* Fix arc eager

* Fix NER

* Pass oracle cut size for beam

* Fix ner test

* Fix beam

* Improve StateC.clone

* Improve StateClass.borrow

* Work directly with StateC, not StateClass

* Remove print statements

* Fix state copy

* Improve state class

* Refactor parser oracles

* Fix arc eager oracle

* Fix arc eager oracle

* Use a vector to implement the stack

* Refactor state data structure

* Fix alignment of sent start

* Add get_aligned_sent_starts method

* Add test for ae oracle when bad sentence starts

* Fix sentence segment handling

* Avoid Reduce that inserts illegal sentence

* Update preset SBD test

* Fix test

* Remove prints

* Fix sent starts in Example

* Improve python API of StateClass

* Tweak comments and debug output of arc eager

* Upd test

* Fix state test

* Fix state test
2020-12-13 09:08:32 +08:00
Ines Montani
513c4e332a
Include custom code via spacy package command (#6531) 2020-12-10 20:36:46 +08:00
Ines Montani
2a6043fabb
Merge pull request #6530 from explosion/feature/init-config-cpu-gpu 2020-12-10 09:38:46 +11:00
Ines Montani
9d32e839d3 Merge branch 'develop' into feature/init-config-cpu-gpu 2020-12-10 08:50:53 +11:00
Adriane Boyd
6ee6e41234 Update docstring for Language.evaluate 2020-12-09 10:21:39 +01:00
Adriane Boyd
fa8fa474a3 Add nlp.batch_size setting
Add a default `batch_size` setting for `Language.pipe` and
`Language.evaluate` as `nlp.batch_size`.
2020-12-09 09:13:26 +01:00
Ines Montani
f2571b5ec4
Merge pull request #6444 from adrianeboyd/chore/update-develop-from-master 2020-12-09 13:09:58 +11:00
Ines Montani
90171f2031
Merge pull request #6528 from svlandeg/feature/pipe_fill_config 2020-12-09 12:01:22 +11:00
Ines Montani
dfaef27f90
Merge pull request #6503 from adrianeboyd/feature/lemmatizer-rule-warning-pos
Warn on empty POS for the rule-based lemmatizer
2020-12-09 11:34:16 +11:00
Ines Montani
271923eaea Fix retokenizer 2020-12-09 11:29:55 +11:00
Ines Montani
b85bd63eca Fix test 2020-12-09 11:24:01 +11:00
Ines Montani
febf71af28 Fix test 2020-12-09 11:23:07 +11:00
Ines Montani
1da1568110 Remove tag map 2020-12-09 11:13:49 +11:00
Ines Montani
1980203229 Merge branch 'master' into pr/6444 2020-12-09 11:09:40 +11:00
Ines Montani
05a2812ae0 Merge branch 'develop' into pr/6444 2020-12-09 11:04:03 +11:00
Ines Montani
758ad6c3cd Make CPU the default for init config 2020-12-09 11:00:51 +11:00
Ines Montani
5d605d539d Remove output_file from init_config helper 2020-12-09 10:57:55 +11:00
Sofie Van Landeghem
cfc72c2995
Bugfix multi-label textcat reproducibility (#6481)
* add test for multi-label textcat reproducibility

* remove positive_label

* fix lengths dtype

* fix comments

* remove comment that we should not have forgotten :-)
2020-12-09 06:29:15 +08:00
Sofie Van Landeghem
de108ed3e8
Add specific error when StaticVectors can't read the vectors data (#6450) 2020-12-09 06:16:07 +08:00
Koichi Yasuoka
0afb54ac93
JapaneseTokenizer.pipe added (#6515)
* JapaneseTokenizer.pipe added

For [spacymoji](https://spacy.io/universe/project/spacymoji)  with `Japanese()`.

* DummyTokenizer.pipe added instead
2020-12-08 20:02:23 +01:00
svlandeg
8f8a7f1733 returning config in init_config 2020-12-08 17:37:20 +01:00
Ines Montani
8921364579
Merge pull request #6521 from explosion/feature/config-stdin
Allow reading config from stdin in spacy train
2020-12-08 22:07:43 +11:00