adrianeboyd
3d2c308906
Add Doc init from list of words and text ( #5251 )
...
* Add Doc init from list of words and text
Add an option to initialize a `Doc` from a text and list of words where
the words may or may not include all whitespace tokens. If the text and
words are mismatched, raise an error.
* Fix error code
* Remove all whitespace before aligning words/text
* Move words/text init to util function
* Update error message
* Rename to get_words_and_spaces
* Fix formatting
2020-04-14 19:15:52 +02:00
Paolo Arduin
8ce408d2e1
Comparison predicate handling for !=
( #5282 )
...
* Fix #5281
* Optim test
2020-04-14 19:14:15 +02:00
Sofie Van Landeghem
a3965ec13d
tag-map-path since 2.2.4 instead of 2.2.3 ( #5289 )
2020-04-14 14:53:47 +02:00
Leander Fiedler
6700006830
issue5230 attempted fix of pytest segfault for python3.5
2020-04-12 09:34:54 +02:00
Leander Fiedler
d60e2d3ebf
issue5230 added unit test for dumping and loading knowledgebase
2020-04-12 09:08:41 +02:00
Marek Grzenkowicz
6a8a52650f
[ Closes #5292 ] Fix typo in option name "--n-save_every" ( #5293 )
...
* Sign contributor agreement for chopeen
* Fix typo in option name and close #5292
2020-04-11 23:35:01 +02:00
Leander Fiedler
d2bb649227
issue5230 filter warnings in addition to filterwarnings to prevent deprecation warnings in python35(win) setup to pop up
2020-04-10 23:21:13 +02:00
Leander Fiedler
ca2a7a44db
issue5230 store string values of warnings to remotely debug failing python35(win) setup
2020-04-10 22:26:55 +02:00
Leander Fiedler
88ca40a15d
issue5230 raise warnings as errors to remotely debug failing python35(win) setup
2020-04-10 21:45:53 +02:00
Leander Fiedler
a7bdfe42e1
issue5230 added print statement to warnings filter to remotely debug failing python35(win) setup
2020-04-10 21:14:33 +02:00
Leander Fiedler
8c1d0d628f
issue5230 writer now checks instance of loc parameter before trying to operate on it
2020-04-10 20:35:52 +02:00
Umar Butler
8952effcc4
Fixed Typo in Warning ( #5284 )
...
* Fixed typo in cli warning
Fixed a typo in the warning for the provision of exactly two labels, which have not been designated as binary, to textcat.
* Create and signed contributor form
2020-04-09 15:46:15 +02:00
adrianeboyd
cf579a398d
Add __init__.py to eu and hy tests ( #5278 )
2020-04-08 20:03:06 +02:00
adrianeboyd
ae4af52ce7
Add ideographic stops to sentencizer ( #5263 )
...
Add ideographic half- and fullwidth full stops to default sentencizer
punctuation.
2020-04-08 12:58:39 +02:00
Sofie Van Landeghem
7ad0fcf01d
fix json ( #5267 )
2020-04-08 12:58:09 +02:00
adrianeboyd
fa760010a5
Set rank for new vector in Vocab.set_vector ( #5266 )
...
Set `Lexeme.rank` for vectors added with `Vocab.set_vector` so that the
lexeme `ID` accessed by a model points the right row for the new vector.
2020-04-07 12:04:51 +02:00
lfiedler
e1e25c7e30
issue5230: added unittest test case for completion
2020-04-06 21:36:02 +02:00
Leander Fiedler
b63871ceff
issue5230: added contributors agreement
2020-04-06 21:04:06 +02:00
Leander Fiedler
cde96f6c64
issue5230: optimized unit test a bit
2020-04-06 20:51:12 +02:00
Leander Fiedler
71cc903d65
issue5230: replaced open statements on path objects so that serialization still works an files are closed
2020-04-06 20:30:41 +02:00
Leander Fiedler
273ed452bb
issue5230: added unicode declaration at top of the file
2020-04-06 19:22:32 +02:00
Leander Fiedler
1cd975d4a5
issue5230: fixed resource warnings in language
2020-04-06 18:54:32 +02:00
Leander Fiedler
493c77462a
issue5230: test cases
...
covering known sources of resource warnings
2020-04-06 18:46:51 +02:00
adrianeboyd
c981aa6684
Use inline flags in token_match patterns ( #5257 )
...
* Use inline flags in token_match patterns
Use inline flags in `token_match` patterns so that serializing does not
lose the flag information.
* Modify inline flag
* Modify inline flag
2020-04-06 13:19:04 +02:00
adrianeboyd
e8be15e9b7
Improve tokenization for UD Spanish AnCora ( #5253 )
2020-04-06 13:18:23 +02:00
adrianeboyd
f4ef64a526
Improve tokenization for UD Dutch corpora ( #5259 )
...
* Improve tokenization for UD Dutch corpora
Improve tokenization for UD Dutch Alpino and LassySmall.
* Format Dutch tokenizer exceptions
2020-04-06 13:18:07 +02:00
vincent d warmerdam
f329d5663a
add "whatlies" to spaCy universe ( #5252 )
...
* Add "whatlies"
We're releasing it on our side officially on the 16th of April. If possible, let's announce around the same time :)
* sign contributor thing
* Added fancy gif
as the image
* Update universe.json
Spellin error and spaCy clarification.
2020-04-06 11:29:30 +02:00
Muhammad Irfan
406d5748b3
add missing Urdu tags
2020-04-05 20:55:38 +05:00
nlptechbook
ddf3c2430d
Update universe.json
2020-04-03 12:10:03 -04:00
YohannesDatasci
beef184e53
Armenian language support ( #5246 )
...
* add Armenian language and test cases
* agreement submission
2020-04-03 13:02:18 +02:00
Sofie Van Landeghem
1137420840
Small doc fixes ( #5250 )
...
* fix link
* torchtext instead tochtext
2020-04-03 13:01:43 +02:00
Sofie Van Landeghem
9cf965c260
avoid enumerate to avoid long waiting at 0% ( #5159 )
2020-04-02 15:04:15 +02:00
Michael Leichtfried
2b14997b68
Remove duplicated branch in if/else-if statement ( #5234 )
...
* Remove duplicated branch in if-elif-statement
* Add contributor agreement for leicmi
2020-04-02 14:47:42 +02:00
adrianeboyd
d107afcffb
Raise error for inplace resize with new vector dim ( #5228 )
...
Raise an error if there is an attempt to resize the vectors in place with
a different vector dimension.
2020-04-02 10:43:13 +02:00
Jacob Lauritzen
0b76212831
Extend and fix Danish examples ( #5227 )
...
* Extend and fix Danish examples
This PR fixes two examples, adds additional examples translated from the english version, and adds punctuation.
The two changed examples are:
* "fortov" changed to "fortovet", which is more [used](https://www.google.com/search?client=firefox-b-d&sxsrf=ALeKk0143gEuPe4IbIUpzBBt-oU10OMVqA%3A1585549036477&ei=7I6BXuvJHMGOrwSqi46oCQ&q=l%C3%B8behjul+p%C3%A5+fortov&oq=l%C3%B8behjul+p%C3%A5+fortov&gs_lcp=CgZwc3ktYWIQAzIECAAQRzIECAAQRzIECAAQRzIECAAQRzIECAAQRzIECAAQRzIECAAQRzIECAAQR1DT8xZY0_MWYK_0FmgAcAZ4AIABAIgBAJIBAJgBAKABAaoBB2d3cy13aXo&sclient=psy-ab&ved=0ahUKEwjr7964xsHoAhVBx4sKHaqFA5UQ4dUDCAo&uact=5 ) and more natural. The Swedish and Norwegian examples also use this version of the word.
* "stor by" changed to "storby". In Danish we have a specific noun to describe a large, metropolitan city which is different from just describing a city as "large". In this sentence it would be much more natural to describe London as a "storby". Google even correct as search for "London stor by" to "London storby".
* Sign contrib agreement
2020-04-02 10:42:35 +02:00
Ines Montani
09f8486eb1
Merge pull request #5223 from nikhilsaldanha/fix-entity-recognizer-docs
...
update docs for return type of EntityRecognizer.predict
2020-03-29 19:10:42 +02:00
Ines Montani
99da6e1d79
Merge branch 'master' into fix-entity-recognizer-docs
2020-03-29 19:10:18 +02:00
Nikhil Saldanha
4f27a24f5b
Add kannada examples ( #5162 )
...
* Add example sentences for Kannada
* sign contributor agreement
2020-03-29 13:54:42 +02:00
adrianeboyd
d47b810ba4
Fix exclusive_classes in textcat ensemble ( #5166 )
...
Pass the exclusive_classes setting to the bow model within the ensemble
textcat model.
2020-03-29 13:52:34 +02:00
Tom Milligan
e904958115
Limit to cupy-cuda v8, so as not to pull in v9 automatically. ( #5194 )
2020-03-29 13:52:08 +02:00
adrianeboyd
963bd890c1
Modify Vector.resize to work with cupy and improve resizing ( #5216 )
...
* Modify Vector.resize to work with cupy
Modify `Vectors.resize` to work with cupy. Modify behavior when resizing
to a different vector dimension so that individual vectors are truncated
or extended with zeros instead of having the original values filled into
the new shape without regard for the original axes.
* Update spacy/tests/vocab_vectors/test_vectors.py
Co-Authored-By: Matthew Honnibal <honnibal+gh@gmail.com>
Co-authored-by: Matthew Honnibal <honnibal+gh@gmail.com>
2020-03-29 13:51:20 +02:00
Nikhil Saldanha
be6d10517f
sign contributor agreement
2020-03-28 18:36:55 +01:00
Nikhil Saldanha
d1ddfa1cb7
update docs for EntityRecognizer.predict
...
return type was wrongly written as a tuple, changed to syntax.StateClass
2020-03-28 18:13:02 +01:00
Tiljander
e53232533b
Describing priority rules for overlapping matches ( #5197 )
...
* Describing priority rules for overlapping matches
* Create Tiljander.md
* Describing priority rules for overlapping matches
* Update website/docs/api/entityruler.md
Co-Authored-By: Ines Montani <ines@ines.io>
Co-authored-by: Ines Montani <ines@ines.io>
2020-03-26 13:13:22 +01:00
adrianeboyd
8d3563f1c4
Minor bugfixes for train CLI ( #5186 )
...
* Omit per_type scores from model-best calculations
The addition of per_type scores to the included metrics (#4911 ) causes
errors when they're compared while determining the best model, so omit
them for this `max()` comparison.
* Add default speed data for interrupted train CLI
Add better speed meta defaults so that an interrupted iteration still
produces a best model.
Co-authored-by: Ines Montani <ines@ines.io>
2020-03-26 10:46:50 +01:00
adrianeboyd
a04f802099
Fix GoldParse init when token count differs ( #5191 )
...
Fix the `GoldParse` initialization when the number of tokens has changed
(due to merging subtokens with the parser).
2020-03-26 10:46:23 +01:00
adrianeboyd
d88a377bed
Remove Vectors.from_glove ( #5209 )
2020-03-26 10:45:47 +01:00
Ines Montani
828acffc12
Tidy up and auto-format
2020-03-25 12:28:12 +01:00
adrianeboyd
b71dd44dbc
Improved Romanian tokenization for UD RRT ( #5206 )
...
Modifications to Romanian tokenization to improve tokenization for
UD_Romanian-RRT.
2020-03-25 11:28:19 +01:00
adrianeboyd
86c43e55fa
Improve Lithuanian tokenization ( #5205 )
...
* Improve Lithuanian tokenization
Modify Lithuanian tokenization to improve performance for
UD_Lithuanian-ALKSNIS.
* Update Lithuanian tokenizer tests
2020-03-25 11:28:12 +01:00