Sofie Van Landeghem
492d1ec5de
Prevent alignment when texts don't match ( #5867 )
...
* remove empty gold.pyx
* add alignment unit test (to be used in docs)
* ensure that Alignment is only used on equal texts
* additional test using example.alignment
* formatting
Co-authored-by: Matthew Honnibal <honnibal+gh@gmail.com>
2020-08-04 16:29:18 +02:00
Matthew Honnibal
ecb3c4e8f4
Create corpus iterator and batcher from registry during training ( #5865 )
...
* Move batchers into their own module (and registry)
* Update CLI
* Update Corpus and batcher
* Update tests
* Update one config
* Merge 'evaluation' block back under [training]
* Import batchers in gold __init__
* Fix batchers
* Update config
* Update schema
* Update util
* Don't assume train and dev are actually paths
* Update onto-joint config
* Fix missing import
* Format
* Format
* Update spacy/gold/corpus.py
Co-authored-by: Ines Montani <ines@ines.io>
* Fix name
* Update default config
* Fix get_length option in batchers
* Update test
* Add comment
* Pass path into Corpus
* Update docstring
* Update schema and configs
* Update config
* Fix test
* Fix paths
* Fix print
* Fix create_train_batches
* [training.read_train] -> [training.train_corpus]
* Update onto-joint config
Co-authored-by: Ines Montani <ines@ines.io>
2020-08-04 15:09:37 +02:00
Sofie Van Landeghem
82347110f5
Default empty KB in EL component ( #5872 )
...
* EL field documentation
* documentation consistent with docs
* default empty KB, initialize vocab separately
* formatting
* add test for changing the default entity vector length
* update comment
2020-08-04 14:34:09 +02:00
Adriane Boyd
b7e3018d97
Recalculate alignment if tokenization differs ( #5868 )
...
* Recalculate alignment if tokenization differs
* Refactor cached alignment data
2020-08-04 14:31:32 +02:00
Adriane Boyd
c62fd878a3
Allow Doc.char_span to snap to token boundaries ( #5849 )
...
* Allow Doc.char_span to snap to token boundaries
Add a `mode` option to allow `Doc.char_span` to snap to token
boundaries. The `mode` options:
* `strict`: character offsets must match token boundaries (default, same as
before)
* `inside`: all tokens completely within the character span
* `outside`: all tokens at least partially covered by the character span
Add a new helper function `token_by_char` that returns the token
corresponding to a character position in the text. Update
`token_by_start` and `token_by_end` to use `token_by_char` for more
efficient searching.
* Remove unused import
* Rename mode to alignment_mode
Rename `mode` to `alignment_mode` with the options
`strict`/`contract`/`expand`. Any unrecognized modes are silently
converted to `strict`.
2020-08-04 13:36:32 +02:00
Adriane Boyd
b841248589
Add Span index boundary checks ( #5861 )
...
* Add Span index boundary checks
* Return Span-specific IndexError in all cases
* Simplify and fix if/else
2020-08-04 13:35:25 +02:00
Adriane Boyd
cd59979ab4
Fix span boundary handling in Spanish noun_chunks ( #5860 )
2020-08-03 13:53:15 +02:00
Ines Montani
934447a611
Merge pull request #5855 from svlandeg/fix/cli-debug
2020-08-03 13:09:20 +02:00
Li Zhe
296f8b65b4
fix the wrong hash url in adding-languages.md file ( #5810 )
...
* fix the wrong hash url in adding-languages.md file
change the #101 url hash path to #language-data
* filled in the spaCy Contributor Agreement
filled in the spaCy Contributor Agreement
2020-08-02 23:15:56 +02:00
Ines Montani
4c055f0aa7
Add init CLI and init config ( #5854 )
...
* Add init CLI and init config draft
* Improve config validation
* Auto-format
* Don't export anything in debug config
* Update docs
2020-08-02 15:18:30 +02:00
svlandeg
6f4e46ee93
Merge remote-tracking branch 'upstream/develop' into fix/cli-debug
...
# Conflicts:
# pyproject.toml
# requirements.txt
# setup.cfg
2020-08-01 18:38:59 +02:00
Ines Montani
e393ebd78b
Merge pull request #5851 from explosion/feature/better-pipe-analysis
2020-08-01 14:20:27 +02:00
Ines Montani
b40f44419b
Simplify pipe analysis
...
- remove unused code
- don't print by default
- integrate attrs info into analysis output
2020-08-01 13:40:06 +02:00
Ines Montani
93144bde97
Update code block style [ci skip]
2020-07-31 18:55:55 +02:00
Ines Montani
98c6a85c8b
Update docs [ci skip]
2020-07-31 18:55:38 +02:00
Ines Montani
b68c53858c
Remove global
2020-07-31 18:37:58 +02:00
Ines Montani
30a76fcf6f
Integrate and simplify pipe analysis
2020-07-31 18:34:35 +02:00
svlandeg
9b719dfb1a
use divider inbetween steps
2020-07-31 18:06:48 +02:00
svlandeg
51ffc4a166
rename pipe_name to component
2020-07-31 17:58:55 +02:00
svlandeg
878327d38e
printing final predictions by default to False
2020-07-31 17:36:32 +02:00
Ines Montani
2d955fbf98
Fix linting [ci skip]
2020-07-31 17:05:28 +02:00
Ines Montani
e9e8fa2466
Update docs and types
2020-07-31 17:02:54 +02:00
Ines Montani
dab31426e1
Pin to latest Thinc
2020-07-31 17:00:14 +02:00
svlandeg
cc2f58a1b0
use data_validation context manager
2020-07-31 16:49:42 +02:00
Adriane Boyd
ac14ce7c30
Prefer earlier spans in EntityRuler ( #5843 )
...
Similar to #4414 , update the sorting in EntityRuler to prefer the first
span in overlapping spans.
2020-07-31 16:09:32 +02:00
svlandeg
5fa3235d06
set DATA_VALIDATION to False for debug_model (upgrade thinc)
2020-07-31 15:21:01 +02:00
svlandeg
08d3c36c20
bugfix in train CLI
2020-07-31 15:03:43 +02:00
Ines Montani
6365837ca9
Merge pull request #5833 from explosion/feature/scorer-adjustments
2020-07-31 14:00:39 +02:00
Ines Montani
5a221f79c2
Revert "Remove keyword-only from Scorer API docs" [ci skip]
...
This reverts commit 7a6ac47dc1
.
2020-07-31 14:00:21 +02:00
Ines Montani
160f1a5f94
Update docs [ci skip]
2020-07-31 13:26:39 +02:00
Adriane Boyd
9b509aa87f
Move Language.evaluate scorer config to new arg
...
Move `Language.evaluate` scorer config from `component_cfg` to separate
argument `scorer_cfg`.
2020-07-31 11:05:16 +02:00
Adriane Boyd
901801b33b
Fix default arguments in DependencyParser.score
2020-07-31 10:55:44 +02:00
Adriane Boyd
9d79916792
Merge branch 'develop' into feature/scorer-adjustments
2020-07-31 10:48:14 +02:00
Sofie Van Landeghem
ca491722ad
The Parser is now a Pipe (2) ( #5844 )
...
* moving syntax folder to _parser_internals
* moving nn_parser and transition_system
* move nn_parser and transition_system out of internals folder
* moving nn_parser code into transition_system file
* rename transition_system to transition_parser
* moving parser_model and _state to ml
* move _state back to internals
* The Parser now inherits from Pipe!
* small code fixes
* removing unnecessary imports
* remove link_vectors_to_models
* transition_system to internals folder
* little bit more cleanup
* newlines
2020-07-30 23:30:54 +02:00
svlandeg
0b23594953
pipe_name instead of section in debug_model
2020-07-30 20:06:28 +02:00
holubvl3
d16c0f2c3a
Create holubvl3 ( #5845 )
...
* Create holubvl3
* Rename holubvl3 to holubvl3.md
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
2020-07-30 17:40:31 +02:00
Rahul Gupta
f76fae0e8d
English: adds ordinal numbers ( #5830 )
2020-07-29 20:22:47 +02:00
Ines Montani
3449c45fd9
Update docs [ci skip]
2020-07-29 19:48:26 +02:00
Ines Montani
9c80cb673d
Update docs [ci skip]
2020-07-29 19:41:34 +02:00
Ines Montani
9f69afdd1e
Update docs [ci skip]
2020-07-29 19:09:44 +02:00
Ines Montani
7a21775cd0
Merge pull request #5834 from explosion/feature/vectors
2020-07-29 18:49:26 +02:00
Gustavo Zadrozny Leyendecker
90b958fd01
Fix on EntityRendered to support break lines (after last entity) ( closes #5838 )
2020-07-29 18:48:39 +02:00
Ines Montani
6a5c853edb
Fix docs [ci skip]
2020-07-29 18:45:12 +02:00
Ines Montani
158d8c1e48
Update docs [ci skip]
2020-07-29 18:44:10 +02:00
Matthew Honnibal
f7adc9d3b7
Start rewriting vectors docs
2020-07-29 17:10:06 +02:00
Ines Montani
b0f57a0cac
Update docs and consistency
2020-07-29 15:14:07 +02:00
Matthew Honnibal
a2d573c039
Merge branch 'feature/vectors' of https://github.com/explosion/spaCy into feature/vectors
2020-07-29 14:56:27 +02:00
Matthew Honnibal
ebdb3f5f04
Fix config
2020-07-29 14:56:11 +02:00
Matthew Honnibal
2af741d7e3
Fix train arg
2020-07-29 14:56:01 +02:00
Matthew Honnibal
c27309f839
Merge branch 'develop' into feature/vectors
2020-07-29 14:54:10 +02:00