Jacob Lauritzen
0b76212831
Extend and fix Danish examples ( #5227 )
...
* Extend and fix Danish examples
This PR fixes two examples, adds additional examples translated from the english version, and adds punctuation.
The two changed examples are:
* "fortov" changed to "fortovet", which is more [used](https://www.google.com/search?client=firefox-b-d&sxsrf=ALeKk0143gEuPe4IbIUpzBBt-oU10OMVqA%3A1585549036477&ei=7I6BXuvJHMGOrwSqi46oCQ&q=l%C3%B8behjul+p%C3%A5+fortov&oq=l%C3%B8behjul+p%C3%A5+fortov&gs_lcp=CgZwc3ktYWIQAzIECAAQRzIECAAQRzIECAAQRzIECAAQRzIECAAQRzIECAAQRzIECAAQRzIECAAQR1DT8xZY0_MWYK_0FmgAcAZ4AIABAIgBAJIBAJgBAKABAaoBB2d3cy13aXo&sclient=psy-ab&ved=0ahUKEwjr7964xsHoAhVBx4sKHaqFA5UQ4dUDCAo&uact=5 ) and more natural. The Swedish and Norwegian examples also use this version of the word.
* "stor by" changed to "storby". In Danish we have a specific noun to describe a large, metropolitan city which is different from just describing a city as "large". In this sentence it would be much more natural to describe London as a "storby". Google even correct as search for "London stor by" to "London storby".
* Sign contrib agreement
2020-04-02 10:42:35 +02:00
Sofie Van Landeghem
ab59f3124e
fix NEL overfitting test for GPU ( #5236 )
2020-04-02 10:32:52 +02:00
Sofie Van Landeghem
311133e579
Train textcat with config ( #5143 )
...
* bring back default build_text_classifier method
* remove _set_dims_ hack in favor of proper dim inference
* add tok2vec initialize to unit test
* small fixes
* add unit test for various textcat config settings
* logistic output layer does not have nO
* fix window_size setting
* proper fix
* fix W initialization
* Update textcat training example
* Use ml_datasets
* Convert training data to `Example` format
* Use `n_texts` to set proportionate dev size
* fix _init renaming on latest thinc
* avoid setting a non-existing dim
* update to thinc==8.0.0a2
* add BOW and CNN defaults for easy testing
* various experiments with train_textcat script, fix softmax activation in textcat bow
* allow textcat train script to work on other datasets as well
* have dataset as a parameter
* train textcat from config, with example config
* add config for training textcat
* formatting
* fix exclusive_classes
* fixing BOW for GPU
* bump thinc to 8.0.0a3 (not published yet so CI will fail)
* add in link_vectors_to_models which got deleted
Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
2020-03-29 19:40:36 +02:00
Ines Montani
09f8486eb1
Merge pull request #5223 from nikhilsaldanha/fix-entity-recognizer-docs
...
update docs for return type of EntityRecognizer.predict
2020-03-29 19:10:42 +02:00
Ines Montani
99da6e1d79
Merge branch 'master' into fix-entity-recognizer-docs
2020-03-29 19:10:18 +02:00
adrianeboyd
ce0e538068
Check whether doc is instantiated in Example.get_gold_parses() ( #5167 )
...
* Check whether doc is instantiated
When creating docs to pair with gold parses, modify test to check
whether a doc is unset rather than whether it contains tokens.
* Restore test of evaluate on an empty doc
* Set a minimal gold.orig for the scorer
Without a minimal gold.orig the scorer can't evaluate empty docs. This
is the v3 equivalent of #4925 .
2020-03-29 13:57:00 +02:00
Sofie Van Landeghem
d6d95674c1
bugfix in span similarity ( #5155 )
...
* bugfix in span similarity
* also rewrite doc.pyx for clarity
* formatting
2020-03-29 13:56:07 +02:00
Nikhil Saldanha
4f27a24f5b
Add kannada examples ( #5162 )
...
* Add example sentences for Kannada
* sign contributor agreement
2020-03-29 13:54:42 +02:00
adrianeboyd
d47b810ba4
Fix exclusive_classes in textcat ensemble ( #5166 )
...
Pass the exclusive_classes setting to the bow model within the ensemble
textcat model.
2020-03-29 13:52:34 +02:00
Tom Milligan
e904958115
Limit to cupy-cuda v8, so as not to pull in v9 automatically. ( #5194 )
2020-03-29 13:52:08 +02:00
adrianeboyd
963bd890c1
Modify Vector.resize to work with cupy and improve resizing ( #5216 )
...
* Modify Vector.resize to work with cupy
Modify `Vectors.resize` to work with cupy. Modify behavior when resizing
to a different vector dimension so that individual vectors are truncated
or extended with zeros instead of having the original values filled into
the new shape without regard for the original axes.
* Update spacy/tests/vocab_vectors/test_vectors.py
Co-Authored-By: Matthew Honnibal <honnibal+gh@gmail.com>
Co-authored-by: Matthew Honnibal <honnibal+gh@gmail.com>
2020-03-29 13:51:20 +02:00
Sofie Van Landeghem
1f9852abc3
Fix parser @ GPU ( #5210 )
...
* ensure self.bias is numpy array in parser model
* 2 more little bug fixes for parser on GPU
* removing testing GPU statement
* remove commented code
2020-03-28 23:09:35 +01:00
Nikhil Saldanha
be6d10517f
sign contributor agreement
2020-03-28 18:36:55 +01:00
Nikhil Saldanha
d1ddfa1cb7
update docs for EntityRecognizer.predict
...
return type was wrongly written as a tuple, changed to syntax.StateClass
2020-03-28 18:13:02 +01:00
Sofie Van Landeghem
9b412516e7
Fixing pickling of the parser ( #5218 )
...
* fix __reduce__ for pickling parser
* setting the move object as 'state' during pickling
* unskip test_issue4725 - works again
2020-03-27 19:35:26 +01:00
Ines Montani
a0858ae761
Merge pull request #5213 from explosion/tmp/sync
...
Try master -> develop sync again (part 2)
2020-03-27 11:39:46 +01:00
Ines Montani
92b9b631ef
xfail -> skip
2020-03-27 10:51:32 +01:00
Ines Montani
ee4bb0e3b6
Fix import
2020-03-26 21:44:18 +01:00
Ines Montani
4fe2299586
xfail hanging test
2020-03-26 20:58:13 +01:00
Ines Montani
f12a46472c
Remove unicode declarations
2020-03-26 15:18:32 +01:00
Ines Montani
7453df79d1
Fix argument
2020-03-26 14:09:02 +01:00
Ines Montani
e7341db5dc
Add sent_start to pattern schema
2020-03-26 14:05:40 +01:00
Ines Montani
70ee4ef4fd
Fix small errors
2020-03-26 13:47:31 +01:00
Ines Montani
46568f40a7
Merge branch 'master' into tmp/sync
2020-03-26 13:38:14 +01:00
Tiljander
e53232533b
Describing priority rules for overlapping matches ( #5197 )
...
* Describing priority rules for overlapping matches
* Create Tiljander.md
* Describing priority rules for overlapping matches
* Update website/docs/api/entityruler.md
Co-Authored-By: Ines Montani <ines@ines.io>
Co-authored-by: Ines Montani <ines@ines.io>
2020-03-26 13:13:22 +01:00
adrianeboyd
8d3563f1c4
Minor bugfixes for train CLI ( #5186 )
...
* Omit per_type scores from model-best calculations
The addition of per_type scores to the included metrics (#4911 ) causes
errors when they're compared while determining the best model, so omit
them for this `max()` comparison.
* Add default speed data for interrupted train CLI
Add better speed meta defaults so that an interrupted iteration still
produces a best model.
Co-authored-by: Ines Montani <ines@ines.io>
2020-03-26 10:46:50 +01:00
adrianeboyd
a04f802099
Fix GoldParse init when token count differs ( #5191 )
...
Fix the `GoldParse` initialization when the number of tokens has changed
(due to merging subtokens with the parser).
2020-03-26 10:46:23 +01:00
adrianeboyd
d88a377bed
Remove Vectors.from_glove ( #5209 )
2020-03-26 10:45:47 +01:00
Ines Montani
828acffc12
Tidy up and auto-format
2020-03-25 12:28:12 +01:00
adrianeboyd
b71dd44dbc
Improved Romanian tokenization for UD RRT ( #5206 )
...
Modifications to Romanian tokenization to improve tokenization for
UD_Romanian-RRT.
2020-03-25 11:28:19 +01:00
adrianeboyd
86c43e55fa
Improve Lithuanian tokenization ( #5205 )
...
* Improve Lithuanian tokenization
Modify Lithuanian tokenization to improve performance for
UD_Lithuanian-ALKSNIS.
* Update Lithuanian tokenizer tests
2020-03-25 11:28:12 +01:00
adrianeboyd
1a944e5976
Improve Italian tokenization ( #5204 )
...
Improve Italian tokenization for UD_Italian-ISDT.
2020-03-25 11:28:02 +01:00
adrianeboyd
923a453449
Modifications/updates to Portuguese tokenization ( #5203 )
...
Modifications to Portuguese tokenization for UD_Portuguese-Bosque.
Instead of splitting contactions as exceptions, they are kept as merged
tokens.
2020-03-25 11:27:53 +01:00
adrianeboyd
4117a5c705
Improve French tokenization ( #5202 )
...
Improve French tokenization for UD_French-Sequoia.
2020-03-25 11:27:42 +01:00
Ines Montani
a3d09ffe61
Merge pull request #5201 from adrianeboyd/feature/ud-tokenization-nb-v2
...
Improved tokenization for UD_Norwegian-Bokmaal
2020-03-25 11:27:31 +01:00
Ines Montani
0e8dfdf77e
Merge pull request #5065 from adrianeboyd/feature/ud-tokenization-da
...
Add a few more Danish tokenizer exceptions
2020-03-25 11:27:19 +01:00
Sofie Van Landeghem
218e1706ac
Bugfix linking vectors ( #5196 )
...
* restore call to _load_vectors
* bump to thinc 8.0.0a3
* bump to 3.0.0.dev4
2020-03-25 10:20:11 +01:00
Adriane Boyd
09d442f5ad
Merge remote-tracking branch 'upstream/master' into feature/ud-tokenization-da
2020-03-25 09:41:52 +01:00
Adriane Boyd
cba2d1d972
Disable failing abbreviation test
...
UD_Danish-DDT has (as far as I can tell) hallucinated periods after
abbreviations, so the changes are an artifact of the corpus and not due
to anything meaningful about Danish tokenization.
2020-03-25 09:39:26 +01:00
Adriane Boyd
79737adb90
Improved tokenization for UD_Norwegian-Bokmaal
2020-03-25 08:54:02 +01:00
Ines Montani
5f2afa0479
Merge pull request #5185 from adrianeboyd/bugfix/de-punctuation-style
...
Improve German tokenizer settings style
2020-03-24 16:38:32 +01:00
Ines Montani
3fc2309c48
Merge pull request #5174 from Baciccin/master
...
Add Ligurian language
2020-03-24 16:33:59 +01:00
Ines Montani
f434d6aaa9
Merge pull request #5190 from guerda/patch-1
...
Remove max_length parameter in PhraseMatcher example
2020-03-24 16:32:12 +01:00
Philip Gillißen
128acb9ee1
Update guerda.md
2020-03-24 10:42:30 +01:00
Philip Gillißen
5d067bcc5e
Add SCA for guerda
2020-03-24 10:42:10 +01:00
Philip Gillißen
f8b4407a29
Remove max_length parameter
...
The parameter max_length is deprecated in PhraseMatcher, as stated here: https://spacy.io/api/phrasematcher#init
2020-03-24 10:22:12 +01:00
Ines Montani
fcac1ace78
Update macOS image on Azure Pipelines
2020-03-23 22:55:47 +01:00
Ines Montani
494ec23adb
Merge pull request #5187 from adrianeboyd/update/azure-images
...
Update from macOS-10.13 to macOS-10.14
2020-03-23 20:47:49 +01:00
Adriane Boyd
30d862d4d8
Update from macOS-10.13 to macOS-10.14
2020-03-23 19:52:57 +01:00
Adriane Boyd
2897a73559
Improve German tokenizer settings style
2020-03-23 19:23:47 +01:00