adrianeboyd
908dea3939
Skip duplicate lexeme rank setting ( #5401 )
...
Skip duplicate lexeme rank setting within
`_fix_pretrained_vectors_name()`.
2020-05-14 18:26:12 +02:00
adrianeboyd
f49e2810e6
Add Polish lemmatizer ( #5413 )
...
* Add Polish lemmatizer
Contributed by @ryszardtuora
* Add missing import
2020-05-14 18:23:19 +02:00
adrianeboyd
e63880e081
Use Token.sent_start for Span.sent ( #5439 )
...
Use `Token.sent_start` for sentence boundaries in `Span.sent` so that
`Doc.sents` and `Span.sent` return the same sentence boundaries.
2020-05-14 18:22:51 +02:00
adrianeboyd
780b869345
Fix syntax iterators for Persian ( #5437 )
2020-05-14 16:51:03 +02:00
Ilia Ivanov
ee8fe37474
Add ilivans' contributor agreement
2020-05-14 15:59:06 +02:00
Ilia Ivanov
712d9d4820
fixup! Fix ErrorsWithCodes().__class__ return value
2020-05-14 15:45:58 +02:00
Ilia Ivanov
a987e9e45d
Fix ErrorsWithCodes().__class__ return value
2020-05-14 14:14:15 +02:00
Vishnu Priya VR
9ce059dd06
Limiting noun_chunks for specific languages ( #5396 )
...
* Limiting noun_chunks for specific langauges
* Limiting noun_chunks for specific languages
Contributor Agreement
* Addressing review comments
* Removed unused fixtures and imports
* Add fa_tokenizer in test suite
* Use fa_tokenizer in test
* Undo extraneous reformatting
Co-authored-by: adrianeboyd <adrianeboyd@gmail.com>
2020-05-14 12:58:06 +02:00
Sofie Van Landeghem
b04738903e
prevent None in gold fields ( #5425 )
...
* set gold fields to empty list instead of keeping them as None
* add unit test
2020-05-13 22:08:50 +02:00
adrianeboyd
113e7981d0
Check that row is within bounds when adding vector ( #5430 )
...
Check that row is within bounds for the vector data array when adding a
vector.
Don't add vectors with rank OOV_RANK in `init-model` (change is due to
shift from OOV as 0 to OOV as OOV_RANK).
2020-05-13 22:08:28 +02:00
adrianeboyd
07639dd6ac
Remove TAG from da/sv tokenizer exceptions ( #5428 )
...
Remove `TAG` value from Danish and Swedish tokenizer exceptions because
it may not be included in a tag map (and these settings are problematic
as tokenizer exceptions anyway).
2020-05-13 10:25:54 +02:00
adrianeboyd
24e7108f80
Modify array type to accommodate OOV_RANK ( #5429 )
...
Modify indices array type in `Vocab.prune_vectors` to accommodate
OOV_RANK index as max(uint64).
2020-05-13 10:25:05 +02:00
Ines Montani
f333c2a011
Merge pull request #5386 from svlandeg/fix/nel-docs
2020-05-10 12:00:09 +02:00
adrianeboyd
440b81bddc
Improve exceptions for 'd (would/had) in English ( #5379 )
...
Instead of treating `'d` in contractions like `I'd` as `would` in all
cases in the tokenizer exceptions, leave the tagging and lemmatization
up to later components.
2020-05-08 15:10:57 +02:00
Travis Hoppe
d4cc18b746
Added author information for NLPre ( #5414 )
...
* Add author links for NLPre and update category
* Add contributor statement
2020-05-08 11:28:54 +02:00
adrianeboyd
c963e269ba
Add method to update / reset pkuseg user dict ( #5404 )
2020-05-08 11:21:46 +02:00
adrianeboyd
4a15b559ba
Clarify Token.pos as UPOS ( #5419 )
2020-05-08 10:36:25 +02:00
adrianeboyd
a2345618f1
Fix Token API docs from #5375 ( #5418 )
2020-05-08 10:25:02 +02:00
Samuel Rodríguez Medina
5e55bfa821
Fixed tests for Swedish that were written in Danish. ( #5395 )
2020-05-05 14:06:27 +02:00
Adriane Boyd
565e0eef73
Add tokenizer option for token match with affixes
...
To fix the slow tokenizer URL (#4374 ) and allow `token_match` to take
priority over prefixes and suffixes by default, introduce a new
tokenizer option for a token match pattern that's applied after prefixes
and suffixes but before infixes.
2020-05-05 10:35:33 +02:00
Adriane Boyd
792c8af8cf
Merge remote-tracking branch 'upstream/master' into bugfix/revert-token-match
2020-05-05 09:25:57 +02:00
adrianeboyd
c045a9c7f6
Fix logic in train CLI timing eval on GPU ( #5387 )
...
Run CPU timing in first iteration only
2020-05-01 12:05:33 +02:00
Samuel Rodríguez Medina
148b036e0c
Spanish like num improvement ( #5381 )
...
* Add tests for Spanish like_num.
* Add missing numbers in Spanish lexical attributes for like_num.
* Modify Spanish test function name.
* Add contributor agreement.
2020-04-30 11:13:23 +02:00
svlandeg
ebaed7dcfa
Few more updates to the EL documentation
2020-04-30 10:17:06 +02:00
Samuel Rodríguez Medina
8602daba85
Swedish like_num ( #5371 )
...
* Sign contributor agreement.
* Add like_num functionality to Swedish.
* Update spacy/tests/lang/sv/test_lex_attrs.py
Co-Authored-By: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
* Update contributor agreement
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
2020-04-29 21:25:22 +02:00
adrianeboyd
74da669326
Fix problems with lower and whitespace in variants ( #5361 )
...
* Initialize lower flag explicitly
* Handle whitespace words from GoldParse correctly when creating raw
text with orth variants
* Return the text with original casing if anything goes wrong
2020-04-29 13:01:25 +02:00
adrianeboyd
3f43c73d37
Normalize TokenC.sent_start values for Matcher ( #5346 )
...
Normalize TokenC.sent_start values to booleans for the `Matcher`.
2020-04-29 12:57:30 +02:00
adrianeboyd
bdff76dede
Various updates/additions to CLI scripts ( #5362 )
...
* `debug-data`: determine coverage of provided vectors
* `evaluate`: support `blank:lg` model to make it possible to just evaluate
tokenization
* `init-model`: add option to truncate vectors to N most frequent vectors
from word2vec file
* `train`:
* if training on GPU, only run evaluation/timing on CPU in the first
iteration
* if training is aborted, exit with a non-0 exit status
2020-04-29 12:56:46 +02:00
Sofie Van Landeghem
cfdaf99b80
Fix passing of component configuration ( #5374 )
...
* add kwargs to to_disk methods in docs - otherwise crashes on 'exclude' argument
* add fix and test for Issue 5137
2020-04-29 12:56:17 +02:00
Ines Montani
efec28ce70
Merge pull request #5367 from adrianeboyd/feature/simplify-warnings-v2
2020-04-29 12:55:37 +02:00
Ines Montani
63885c1836
Remove u string and auto-format [ci skip]
2020-04-29 12:54:57 +02:00
Sofie Van Landeghem
f67343295d
Update NEL examples and documentation ( #5370 )
...
* simplify creation of KB by skipping dim reduction
* small fixes to train EL example script
* add KB creation and NEL training example scripts to example section
* update descriptions of example scripts in the documentation
* moving wiki_entity_linking folder from bin to projects
* remove test for wiki NEL functionality that is being moved
2020-04-29 12:53:53 +02:00
adrianeboyd
a6e521cd79
Add is_sent_end token property ( #5375 )
...
Reconstruction of the original PR #4697 by @MiniLau.
Removes unused `SENT_END` symbol and `IS_SENT_END` from `Matcher` schema
because the Matcher is only going to be able to support `IS_SENT_START`.
2020-04-29 12:53:16 +02:00
Ines Montani
a77754120d
Merge pull request #5177 from nlptechbook/patch-5
2020-04-29 12:52:21 +02:00
Ines Montani
eac47971f1
Merge pull request #5258 from mirfan899/master
2020-04-29 12:51:55 +02:00
Ines Montani
1cbb272a6b
Update website/meta/universe.json
2020-04-29 12:51:44 +02:00
Ines Montani
732629b0dd
Update website/meta/universe.json
2020-04-29 12:51:37 +02:00
adrianeboyd
90ce34db42
Add cuda101 and cuda102 options to setup ( #5377 )
...
* Add cuda101 and cuda102 options to setup
* Update cudaNNN options in docs
2020-04-29 12:51:12 +02:00
Louis Guitton
a27c4014f5
Add mlflow to spaCy universe ( #5352 )
...
* Add mlflow to universe
* Use mlflow black logo
2020-04-29 10:18:03 +02:00
adrianeboyd
d5f18f8307
Add missing import
2020-04-28 14:01:29 +02:00
adrianeboyd
ac40a8f7a5
Add missing import
2020-04-28 14:00:11 +02:00
Adriane Boyd
3a045572ed
Add missing import
2020-04-28 13:48:37 +02:00
Adriane Boyd
bc39f97e11
Simplify warnings
2020-04-28 13:37:37 +02:00
Michael
5b5528ff2e
Add !=3.4.*
to python_requires ( #5344 )
...
Missed in 80d554f2e2
2020-04-27 22:02:09 +02:00
adrianeboyd
792aa7b6ab
Remove references to textcat spans ( #5360 )
...
Remove references to unimplemented `TextCategorizer` span labels in
`GoldParse` and `Doc`.
2020-04-27 18:01:12 +02:00
adrianeboyd
f8ac5b9f56
bugfix in span similarity ( #5155 ) ( #5358 )
...
* bugfix in span similarity
* also rewrite doc.pyx for clarity
* formatting
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
2020-04-27 16:51:27 +02:00
Sofie Van Landeghem
9203d821ae
Add 2 ini files in tests/lang ( #5359 )
2020-04-27 13:01:54 +02:00
Punitvara
b2b7e1f37a
This PR adds Gujarati Language class along with ( #5355 )
...
* This PR adds Gujarati Language class along with
- stop words
* Add test for gu tokenizer
2020-04-27 11:07:37 +02:00
adrianeboyd
90c754024f
Update nlp.vectors to nlp.vocab.vectors ( #5357 )
2020-04-27 10:53:05 +02:00
sabiqueqb
fc91660aa2
Gh 5339 language class for malayalam ( #5342 )
...
* Initialize Malayalam Language class
* Add lex_attrs and examples for Malayalam
* Add spaCy Contributor Agreement
* Add test for ml tokenizer
2020-04-27 09:45:08 +02:00