Commit Graph

15850 Commits

Author SHA1 Message Date
svlandeg
6fb6a8518c bump to 3.0.0.dev7 and thinc to 8.0.0a8 2020-05-15 13:25:54 +02:00
svlandeg
047f3d7d94 remove ops argument for Adam 2020-05-15 13:25:00 +02:00
svlandeg
79d4f196e5 pin flak8 to 3.5.0 2020-05-15 11:53:01 +02:00
svlandeg
e0fda2bd81 throw warning when model_cfg is None 2020-05-15 11:02:10 +02:00
adrianeboyd
908dea3939
Skip duplicate lexeme rank setting (#5401)
Skip duplicate lexeme rank setting within
`_fix_pretrained_vectors_name()`.
2020-05-14 18:26:12 +02:00
adrianeboyd
f49e2810e6
Add Polish lemmatizer (#5413)
* Add Polish lemmatizer

Contributed by @ryszardtuora

* Add missing import
2020-05-14 18:23:19 +02:00
adrianeboyd
e63880e081
Use Token.sent_start for Span.sent (#5439)
Use `Token.sent_start` for sentence boundaries in `Span.sent` so that
`Doc.sents` and `Span.sent` return the same sentence boundaries.
2020-05-14 18:22:51 +02:00
adrianeboyd
780b869345
Fix syntax iterators for Persian (#5437) 2020-05-14 16:51:03 +02:00
Ilia Ivanov
ee8fe37474 Add ilivans' contributor agreement 2020-05-14 15:59:06 +02:00
Ilia Ivanov
712d9d4820 fixup! Fix ErrorsWithCodes().__class__ return value 2020-05-14 15:45:58 +02:00
Ilia Ivanov
a987e9e45d Fix ErrorsWithCodes().__class__ return value 2020-05-14 14:14:15 +02:00
Vishnu Priya VR
9ce059dd06
Limiting noun_chunks for specific languages (#5396)
* Limiting noun_chunks for specific langauges

* Limiting noun_chunks for specific languages

Contributor Agreement

* Addressing review comments

* Removed unused fixtures and imports

* Add fa_tokenizer in test suite

* Use fa_tokenizer in test

* Undo extraneous reformatting

Co-authored-by: adrianeboyd <adrianeboyd@gmail.com>
2020-05-14 12:58:06 +02:00
Sofie Van Landeghem
b04738903e
prevent None in gold fields (#5425)
* set gold fields to empty list instead of keeping them as None

* add unit test
2020-05-13 22:08:50 +02:00
adrianeboyd
113e7981d0
Check that row is within bounds when adding vector (#5430)
Check that row is within bounds for the vector data array when adding a
vector.

Don't add vectors with rank OOV_RANK in `init-model` (change is due to
shift from OOV as 0 to OOV as OOV_RANK).
2020-05-13 22:08:28 +02:00
adrianeboyd
07639dd6ac
Remove TAG from da/sv tokenizer exceptions (#5428)
Remove `TAG` value from Danish and Swedish tokenizer exceptions because
it may not be included in a tag map (and these settings are problematic
as tokenizer exceptions anyway).
2020-05-13 10:25:54 +02:00
adrianeboyd
24e7108f80
Modify array type to accommodate OOV_RANK (#5429)
Modify indices array type in `Vocab.prune_vectors` to accommodate
OOV_RANK index as max(uint64).
2020-05-13 10:25:05 +02:00
svlandeg
102c8c7e2f fix fan_in renaming 2020-05-12 13:56:10 +02:00
svlandeg
9fe1e23512 update to thinc 8.0.0a6 2020-05-12 13:51:25 +02:00
Ines Montani
f333c2a011
Merge pull request #5386 from svlandeg/fix/nel-docs 2020-05-10 12:00:09 +02:00
adrianeboyd
440b81bddc
Improve exceptions for 'd (would/had) in English (#5379)
Instead of treating `'d` in contractions like `I'd` as `would` in all
cases in the tokenizer exceptions, leave the tagging and lemmatization
up to later components.
2020-05-08 15:10:57 +02:00
Travis Hoppe
afb26d788f Added author information for NLPre (#5414)
* Add author links for NLPre and update category

* Add contributor statement
2020-05-08 11:29:52 +02:00
Travis Hoppe
d4cc18b746
Added author information for NLPre (#5414)
* Add author links for NLPre and update category

* Add contributor statement
2020-05-08 11:28:54 +02:00
adrianeboyd
c963e269ba
Add method to update / reset pkuseg user dict (#5404) 2020-05-08 11:21:46 +02:00
adrianeboyd
b3969c1479 Clarify Token.pos as UPOS (#5419) 2020-05-08 10:38:21 +02:00
adrianeboyd
4a15b559ba
Clarify Token.pos as UPOS (#5419) 2020-05-08 10:36:25 +02:00
adrianeboyd
a2345618f1
Fix Token API docs from #5375 (#5418) 2020-05-08 10:25:02 +02:00
Samuel Rodríguez Medina
5e55bfa821
Fixed tests for Swedish that were written in Danish. (#5395) 2020-05-05 14:06:27 +02:00
Adriane Boyd
565e0eef73 Add tokenizer option for token match with affixes
To fix the slow tokenizer URL (#4374) and allow `token_match` to take
priority over prefixes and suffixes by default, introduce a new
tokenizer option for a token match pattern that's applied after prefixes
and suffixes but before infixes.
2020-05-05 10:35:33 +02:00
Adriane Boyd
792c8af8cf Merge remote-tracking branch 'upstream/master' into bugfix/revert-token-match 2020-05-05 09:25:57 +02:00
Matthew Honnibal
eb117e2fce Add load_config_from_str helper 2020-05-02 14:09:21 +02:00
adrianeboyd
c045a9c7f6
Fix logic in train CLI timing eval on GPU (#5387)
Run CPU timing in first iteration only
2020-05-01 12:05:33 +02:00
Samuel Rodríguez Medina
148b036e0c
Spanish like num improvement (#5381)
* Add tests for Spanish like_num.

* Add missing numbers in Spanish lexical attributes for like_num.

* Modify Spanish test function name.

* Add contributor agreement.
2020-04-30 11:13:23 +02:00
svlandeg
ebaed7dcfa Few more updates to the EL documentation 2020-04-30 10:17:06 +02:00
Sofie Van Landeghem
cafe94ee04 Update NEL examples and documentation (#5370)
* simplify creation of KB by skipping dim reduction

* small fixes to train EL example script

* add KB creation and NEL training example scripts to example section

* update descriptions of example scripts in the documentation

* moving wiki_entity_linking folder from bin to projects

* remove test for wiki NEL functionality that is being moved
# Conflicts:
#	bin/wiki_entity_linking/wikipedia_processor.py
2020-04-30 09:50:02 +02:00
Samuel Rodríguez Medina
8602daba85
Swedish like_num (#5371)
* Sign contributor agreement.

* Add like_num functionality to Swedish.

* Update spacy/tests/lang/sv/test_lex_attrs.py

Co-Authored-By: Sofie Van Landeghem <svlandeg@users.noreply.github.com>

* Update contributor agreement

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
2020-04-29 21:25:22 +02:00
adrianeboyd
74da669326
Fix problems with lower and whitespace in variants (#5361)
* Initialize lower flag explicitly

* Handle whitespace words from GoldParse correctly when creating raw
text with orth variants

* Return the text with original casing if anything goes wrong
2020-04-29 13:01:25 +02:00
adrianeboyd
3f43c73d37
Normalize TokenC.sent_start values for Matcher (#5346)
Normalize TokenC.sent_start values to booleans for the `Matcher`.
2020-04-29 12:57:30 +02:00
adrianeboyd
bdff76dede
Various updates/additions to CLI scripts (#5362)
* `debug-data`: determine coverage of provided vectors

* `evaluate`: support `blank:lg` model to make it possible to just evaluate
tokenization

* `init-model`: add option to truncate vectors to N most frequent vectors
from word2vec file

* `train`:

  * if training on GPU, only run evaluation/timing on CPU in the first
    iteration

  * if training is aborted, exit with a non-0 exit status
2020-04-29 12:56:46 +02:00
Sofie Van Landeghem
cfdaf99b80
Fix passing of component configuration (#5374)
* add kwargs to to_disk methods in docs - otherwise crashes on 'exclude' argument

* add fix and test for Issue 5137
2020-04-29 12:56:17 +02:00
Ines Montani
efec28ce70
Merge pull request #5367 from adrianeboyd/feature/simplify-warnings-v2 2020-04-29 12:55:37 +02:00
Ines Montani
63885c1836 Remove u string and auto-format [ci skip] 2020-04-29 12:54:57 +02:00
Ines Montani
962bf12a20
Merge pull request #5312 from odaxiom/fix/website-documentation-spacy-lookup 2020-04-29 12:54:31 +02:00
Sofie Van Landeghem
f67343295d
Update NEL examples and documentation (#5370)
* simplify creation of KB by skipping dim reduction

* small fixes to train EL example script

* add KB creation and NEL training example scripts to example section

* update descriptions of example scripts in the documentation

* moving wiki_entity_linking folder from bin to projects

* remove test for wiki NEL functionality that is being moved
2020-04-29 12:53:53 +02:00
adrianeboyd
a6e521cd79
Add is_sent_end token property (#5375)
Reconstruction of the original PR #4697 by @MiniLau.

Removes unused `SENT_END` symbol and `IS_SENT_END` from `Matcher` schema
because the Matcher is only going to be able to support `IS_SENT_START`.
2020-04-29 12:53:16 +02:00
Ines Montani
a77754120d
Merge pull request #5177 from nlptechbook/patch-5 2020-04-29 12:52:21 +02:00
Ines Montani
eac47971f1
Merge pull request #5258 from mirfan899/master 2020-04-29 12:51:55 +02:00
Sofie Van Landeghem
1bf2082ac4
update is_new_osx function (#5376) 2020-04-29 12:51:49 +02:00
Ines Montani
1cbb272a6b
Update website/meta/universe.json 2020-04-29 12:51:44 +02:00
Ines Montani
732629b0dd
Update website/meta/universe.json 2020-04-29 12:51:37 +02:00
adrianeboyd
90ce34db42
Add cuda101 and cuda102 options to setup (#5377)
* Add cuda101 and cuda102 options to setup

* Update cudaNNN options in docs
2020-04-29 12:51:12 +02:00