Commit Graph

11688 Commits

Author SHA1 Message Date
adrianeboyd
908dea3939
Skip duplicate lexeme rank setting (#5401)
Skip duplicate lexeme rank setting within
`_fix_pretrained_vectors_name()`.
2020-05-14 18:26:12 +02:00
adrianeboyd
f49e2810e6
Add Polish lemmatizer (#5413)
* Add Polish lemmatizer

Contributed by @ryszardtuora

* Add missing import
2020-05-14 18:23:19 +02:00
adrianeboyd
e63880e081
Use Token.sent_start for Span.sent (#5439)
Use `Token.sent_start` for sentence boundaries in `Span.sent` so that
`Doc.sents` and `Span.sent` return the same sentence boundaries.
2020-05-14 18:22:51 +02:00
adrianeboyd
780b869345
Fix syntax iterators for Persian (#5437) 2020-05-14 16:51:03 +02:00
Ilia Ivanov
ee8fe37474 Add ilivans' contributor agreement 2020-05-14 15:59:06 +02:00
Ilia Ivanov
712d9d4820 fixup! Fix ErrorsWithCodes().__class__ return value 2020-05-14 15:45:58 +02:00
Ilia Ivanov
a987e9e45d Fix ErrorsWithCodes().__class__ return value 2020-05-14 14:14:15 +02:00
Vishnu Priya VR
9ce059dd06
Limiting noun_chunks for specific languages (#5396)
* Limiting noun_chunks for specific langauges

* Limiting noun_chunks for specific languages

Contributor Agreement

* Addressing review comments

* Removed unused fixtures and imports

* Add fa_tokenizer in test suite

* Use fa_tokenizer in test

* Undo extraneous reformatting

Co-authored-by: adrianeboyd <adrianeboyd@gmail.com>
2020-05-14 12:58:06 +02:00
Sofie Van Landeghem
b04738903e
prevent None in gold fields (#5425)
* set gold fields to empty list instead of keeping them as None

* add unit test
2020-05-13 22:08:50 +02:00
adrianeboyd
113e7981d0
Check that row is within bounds when adding vector (#5430)
Check that row is within bounds for the vector data array when adding a
vector.

Don't add vectors with rank OOV_RANK in `init-model` (change is due to
shift from OOV as 0 to OOV as OOV_RANK).
2020-05-13 22:08:28 +02:00
adrianeboyd
07639dd6ac
Remove TAG from da/sv tokenizer exceptions (#5428)
Remove `TAG` value from Danish and Swedish tokenizer exceptions because
it may not be included in a tag map (and these settings are problematic
as tokenizer exceptions anyway).
2020-05-13 10:25:54 +02:00
adrianeboyd
24e7108f80
Modify array type to accommodate OOV_RANK (#5429)
Modify indices array type in `Vocab.prune_vectors` to accommodate
OOV_RANK index as max(uint64).
2020-05-13 10:25:05 +02:00
Ines Montani
f333c2a011
Merge pull request #5386 from svlandeg/fix/nel-docs 2020-05-10 12:00:09 +02:00
adrianeboyd
440b81bddc
Improve exceptions for 'd (would/had) in English (#5379)
Instead of treating `'d` in contractions like `I'd` as `would` in all
cases in the tokenizer exceptions, leave the tagging and lemmatization
up to later components.
2020-05-08 15:10:57 +02:00
Travis Hoppe
d4cc18b746
Added author information for NLPre (#5414)
* Add author links for NLPre and update category

* Add contributor statement
2020-05-08 11:28:54 +02:00
adrianeboyd
c963e269ba
Add method to update / reset pkuseg user dict (#5404) 2020-05-08 11:21:46 +02:00
adrianeboyd
4a15b559ba
Clarify Token.pos as UPOS (#5419) 2020-05-08 10:36:25 +02:00
adrianeboyd
a2345618f1
Fix Token API docs from #5375 (#5418) 2020-05-08 10:25:02 +02:00
Samuel Rodríguez Medina
5e55bfa821
Fixed tests for Swedish that were written in Danish. (#5395) 2020-05-05 14:06:27 +02:00
Adriane Boyd
565e0eef73 Add tokenizer option for token match with affixes
To fix the slow tokenizer URL (#4374) and allow `token_match` to take
priority over prefixes and suffixes by default, introduce a new
tokenizer option for a token match pattern that's applied after prefixes
and suffixes but before infixes.
2020-05-05 10:35:33 +02:00
Adriane Boyd
792c8af8cf Merge remote-tracking branch 'upstream/master' into bugfix/revert-token-match 2020-05-05 09:25:57 +02:00
adrianeboyd
c045a9c7f6
Fix logic in train CLI timing eval on GPU (#5387)
Run CPU timing in first iteration only
2020-05-01 12:05:33 +02:00
Samuel Rodríguez Medina
148b036e0c
Spanish like num improvement (#5381)
* Add tests for Spanish like_num.

* Add missing numbers in Spanish lexical attributes for like_num.

* Modify Spanish test function name.

* Add contributor agreement.
2020-04-30 11:13:23 +02:00
svlandeg
ebaed7dcfa Few more updates to the EL documentation 2020-04-30 10:17:06 +02:00
Samuel Rodríguez Medina
8602daba85
Swedish like_num (#5371)
* Sign contributor agreement.

* Add like_num functionality to Swedish.

* Update spacy/tests/lang/sv/test_lex_attrs.py

Co-Authored-By: Sofie Van Landeghem <svlandeg@users.noreply.github.com>

* Update contributor agreement

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
2020-04-29 21:25:22 +02:00
adrianeboyd
74da669326
Fix problems with lower and whitespace in variants (#5361)
* Initialize lower flag explicitly

* Handle whitespace words from GoldParse correctly when creating raw
text with orth variants

* Return the text with original casing if anything goes wrong
2020-04-29 13:01:25 +02:00
adrianeboyd
3f43c73d37
Normalize TokenC.sent_start values for Matcher (#5346)
Normalize TokenC.sent_start values to booleans for the `Matcher`.
2020-04-29 12:57:30 +02:00
adrianeboyd
bdff76dede
Various updates/additions to CLI scripts (#5362)
* `debug-data`: determine coverage of provided vectors

* `evaluate`: support `blank:lg` model to make it possible to just evaluate
tokenization

* `init-model`: add option to truncate vectors to N most frequent vectors
from word2vec file

* `train`:

  * if training on GPU, only run evaluation/timing on CPU in the first
    iteration

  * if training is aborted, exit with a non-0 exit status
2020-04-29 12:56:46 +02:00
Sofie Van Landeghem
cfdaf99b80
Fix passing of component configuration (#5374)
* add kwargs to to_disk methods in docs - otherwise crashes on 'exclude' argument

* add fix and test for Issue 5137
2020-04-29 12:56:17 +02:00
Ines Montani
efec28ce70
Merge pull request #5367 from adrianeboyd/feature/simplify-warnings-v2 2020-04-29 12:55:37 +02:00
Ines Montani
63885c1836 Remove u string and auto-format [ci skip] 2020-04-29 12:54:57 +02:00
Sofie Van Landeghem
f67343295d
Update NEL examples and documentation (#5370)
* simplify creation of KB by skipping dim reduction

* small fixes to train EL example script

* add KB creation and NEL training example scripts to example section

* update descriptions of example scripts in the documentation

* moving wiki_entity_linking folder from bin to projects

* remove test for wiki NEL functionality that is being moved
2020-04-29 12:53:53 +02:00
adrianeboyd
a6e521cd79
Add is_sent_end token property (#5375)
Reconstruction of the original PR #4697 by @MiniLau.

Removes unused `SENT_END` symbol and `IS_SENT_END` from `Matcher` schema
because the Matcher is only going to be able to support `IS_SENT_START`.
2020-04-29 12:53:16 +02:00
Ines Montani
a77754120d
Merge pull request #5177 from nlptechbook/patch-5 2020-04-29 12:52:21 +02:00
Ines Montani
eac47971f1
Merge pull request #5258 from mirfan899/master 2020-04-29 12:51:55 +02:00
Ines Montani
1cbb272a6b
Update website/meta/universe.json 2020-04-29 12:51:44 +02:00
Ines Montani
732629b0dd
Update website/meta/universe.json 2020-04-29 12:51:37 +02:00
adrianeboyd
90ce34db42
Add cuda101 and cuda102 options to setup (#5377)
* Add cuda101 and cuda102 options to setup

* Update cudaNNN options in docs
2020-04-29 12:51:12 +02:00
Louis Guitton
a27c4014f5
Add mlflow to spaCy universe (#5352)
* Add mlflow to universe

* Use mlflow black logo
2020-04-29 10:18:03 +02:00
adrianeboyd
d5f18f8307
Add missing import 2020-04-28 14:01:29 +02:00
adrianeboyd
ac40a8f7a5
Add missing import 2020-04-28 14:00:11 +02:00
Adriane Boyd
3a045572ed Add missing import 2020-04-28 13:48:37 +02:00
Adriane Boyd
bc39f97e11 Simplify warnings 2020-04-28 13:37:37 +02:00
Michael
5b5528ff2e
Add !=3.4.* to python_requires (#5344)
Missed in 80d554f2e2
2020-04-27 22:02:09 +02:00
adrianeboyd
792aa7b6ab
Remove references to textcat spans (#5360)
Remove references to unimplemented `TextCategorizer` span labels in
`GoldParse` and `Doc`.
2020-04-27 18:01:12 +02:00
adrianeboyd
f8ac5b9f56
bugfix in span similarity (#5155) (#5358)
* bugfix in span similarity

* also rewrite doc.pyx for clarity

* formatting

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
2020-04-27 16:51:27 +02:00
Sofie Van Landeghem
9203d821ae
Add 2 ini files in tests/lang (#5359) 2020-04-27 13:01:54 +02:00
Punitvara
b2b7e1f37a
This PR adds Gujarati Language class along with (#5355)
* This PR adds Gujarati Language class along with
- stop words

* Add test for gu tokenizer
2020-04-27 11:07:37 +02:00
adrianeboyd
90c754024f
Update nlp.vectors to nlp.vocab.vectors (#5357) 2020-04-27 10:53:05 +02:00
sabiqueqb
fc91660aa2
Gh 5339 language class for malayalam (#5342)
* Initialize Malayalam Language class

* Add lex_attrs and examples for Malayalam

* Add spaCy Contributor Agreement

* Add test for ml tokenizer
2020-04-27 09:45:08 +02:00