Commit Graph

11345 Commits

Author SHA1 Message Date
adrianeboyd
74da669326
Fix problems with lower and whitespace in variants (#5361)
* Initialize lower flag explicitly

* Handle whitespace words from GoldParse correctly when creating raw
text with orth variants

* Return the text with original casing if anything goes wrong
2020-04-29 13:01:25 +02:00
adrianeboyd
3f43c73d37
Normalize TokenC.sent_start values for Matcher (#5346)
Normalize TokenC.sent_start values to booleans for the `Matcher`.
2020-04-29 12:57:30 +02:00
adrianeboyd
bdff76dede
Various updates/additions to CLI scripts (#5362)
* `debug-data`: determine coverage of provided vectors

* `evaluate`: support `blank:lg` model to make it possible to just evaluate
tokenization

* `init-model`: add option to truncate vectors to N most frequent vectors
from word2vec file

* `train`:

  * if training on GPU, only run evaluation/timing on CPU in the first
    iteration

  * if training is aborted, exit with a non-0 exit status
2020-04-29 12:56:46 +02:00
Sofie Van Landeghem
cfdaf99b80
Fix passing of component configuration (#5374)
* add kwargs to to_disk methods in docs - otherwise crashes on 'exclude' argument

* add fix and test for Issue 5137
2020-04-29 12:56:17 +02:00
Ines Montani
efec28ce70
Merge pull request #5367 from adrianeboyd/feature/simplify-warnings-v2 2020-04-29 12:55:37 +02:00
Ines Montani
63885c1836 Remove u string and auto-format [ci skip] 2020-04-29 12:54:57 +02:00
Sofie Van Landeghem
f67343295d
Update NEL examples and documentation (#5370)
* simplify creation of KB by skipping dim reduction

* small fixes to train EL example script

* add KB creation and NEL training example scripts to example section

* update descriptions of example scripts in the documentation

* moving wiki_entity_linking folder from bin to projects

* remove test for wiki NEL functionality that is being moved
2020-04-29 12:53:53 +02:00
adrianeboyd
a6e521cd79
Add is_sent_end token property (#5375)
Reconstruction of the original PR #4697 by @MiniLau.

Removes unused `SENT_END` symbol and `IS_SENT_END` from `Matcher` schema
because the Matcher is only going to be able to support `IS_SENT_START`.
2020-04-29 12:53:16 +02:00
Ines Montani
a77754120d
Merge pull request #5177 from nlptechbook/patch-5 2020-04-29 12:52:21 +02:00
Ines Montani
eac47971f1
Merge pull request #5258 from mirfan899/master 2020-04-29 12:51:55 +02:00
Ines Montani
1cbb272a6b
Update website/meta/universe.json 2020-04-29 12:51:44 +02:00
Ines Montani
732629b0dd
Update website/meta/universe.json 2020-04-29 12:51:37 +02:00
adrianeboyd
90ce34db42
Add cuda101 and cuda102 options to setup (#5377)
* Add cuda101 and cuda102 options to setup

* Update cudaNNN options in docs
2020-04-29 12:51:12 +02:00
Louis Guitton
a27c4014f5
Add mlflow to spaCy universe (#5352)
* Add mlflow to universe

* Use mlflow black logo
2020-04-29 10:18:03 +02:00
adrianeboyd
d5f18f8307
Add missing import 2020-04-28 14:01:29 +02:00
adrianeboyd
ac40a8f7a5
Add missing import 2020-04-28 14:00:11 +02:00
Adriane Boyd
3a045572ed Add missing import 2020-04-28 13:48:37 +02:00
Adriane Boyd
bc39f97e11 Simplify warnings 2020-04-28 13:37:37 +02:00
Michael
5b5528ff2e
Add !=3.4.* to python_requires (#5344)
Missed in 80d554f2e2
2020-04-27 22:02:09 +02:00
adrianeboyd
792aa7b6ab
Remove references to textcat spans (#5360)
Remove references to unimplemented `TextCategorizer` span labels in
`GoldParse` and `Doc`.
2020-04-27 18:01:12 +02:00
adrianeboyd
f8ac5b9f56
bugfix in span similarity (#5155) (#5358)
* bugfix in span similarity

* also rewrite doc.pyx for clarity

* formatting

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
2020-04-27 16:51:27 +02:00
Sofie Van Landeghem
9203d821ae
Add 2 ini files in tests/lang (#5359) 2020-04-27 13:01:54 +02:00
Punitvara
b2b7e1f37a
This PR adds Gujarati Language class along with (#5355)
* This PR adds Gujarati Language class along with
- stop words

* Add test for gu tokenizer
2020-04-27 11:07:37 +02:00
adrianeboyd
90c754024f
Update nlp.vectors to nlp.vocab.vectors (#5357) 2020-04-27 10:53:05 +02:00
sabiqueqb
fc91660aa2
Gh 5339 language class for malayalam (#5342)
* Initialize Malayalam Language class

* Add lex_attrs and examples for Malayalam

* Add spaCy Contributor Agreement

* Add test for ml tokenizer
2020-04-27 09:45:08 +02:00
adrianeboyd
84e06f9fb7
Improve GoldParse NER alignment (#5335)
Improve GoldParse NER alignment by including all cases where the start
and end of the NER span can be aligned, regardless of internal
tokenization differences.

To do this, convert BILUO tags to character offsets, check start/end
alignment with `doc.char_span()`, and assign the BILUO tags for the
aligned spans. Alignment for `O/-` tags is handled through the
one-to-one and multi alignments.
2020-04-23 16:58:23 +02:00
Mike
481574cbc8
[minor doc change] embedding vis. link is broken in website/docs/usage/examples.md (#5325)
* The embedding vis. link is broken

The first link seems to be reasonable for now unless someone has an updated embedding vis they want to share?

* contributor agreement

* Update Mlawrence95.md

* Update website/docs/usage/examples.md

Co-Authored-By: Sofie Van Landeghem <svlandeg@users.noreply.github.com>

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
2020-04-21 20:35:12 +02:00
adrianeboyd
521f361052
Switch to new gold.align method (#5334)
* Switch from original `_align` to new simpler alignment algorithm from
  #4526

* Remove alignment normalizations beyond whitespace and lowercasing
2020-04-21 19:31:03 +02:00
adrianeboyd
bf5c13d170
Modify jieba install message (#5328)
Modify jieba install message to instruct the user to use
`ChineseDefaults.use_jieba = False` so that it's possible to load
pkuseg-only models without jieba installed.
2020-04-20 22:06:53 +02:00
Ines Montani
b919844fce
Tidy up and fix alignment of landing cards (#5317) 2020-04-20 20:33:13 +02:00
adrianeboyd
f7471abd82
Add pkuseg and serialization support for Chinese (#5308)
* Add pkuseg and serialization support for Chinese

Add support for pkuseg alongside jieba

* Specify model through `Language` meta:

  * split on characters (if no word segmentation packages are installed)

```
Chinese(meta={"tokenizer": {"config": {"use_jieba": False, "use_pkuseg": False}}})
```

  * jieba (remains the default tokenizer if installed)

```
Chinese()
Chinese(meta={"tokenizer": {"config": {"use_jieba": True}}}) # explicit
```

  * pkuseg

```
Chinese(meta={"tokenizer": {"config": {"pkuseg_model": "default", "use_jieba": False, "use_pkuseg": True}}})
```

* The new tokenizer setting `require_pkuseg` is used to override
`use_jieba` default, which is intended for models that provide a pkuseg
model:

```
nlp_pkuseg = Chinese(meta={"tokenizer": {"config": {"pkuseg_model": "default", "require_pkuseg": True}}})
nlp = Chinese() # has `use_jieba` as `True` by default
nlp.from_bytes(nlp_pkuseg.to_bytes()) # `require_pkuseg` overrides `use_jieba` when calling the tokenizer
```

Add support for serialization of tokenizer settings and pkuseg model, if
loaded

* Add sorting for `Language.to_bytes()` serialization of `Language.meta`
so that the (emptied, but still present) tokenizer metadata is in a
consistent position in the serialized data

Extend tests to cover all three tokenizer configurations and
serialization

* Fix from_disk and tests without jieba or pkuseg

* Load cfg first and only show error if `use_pkuseg`
* Fix blank/default initialization in serialization tests

* Explicitly initialize jieba's cache on init

* Add serialization for pkuseg pre/postprocessors

* Reformat pkuseg install message
2020-04-18 17:01:53 +02:00
laszabine
fb73d4943a
Amend documentation to Language.evaluate (#5319)
* Specified usage of arguments to Language.evaluate

* Created contributor agreement
2020-04-16 20:00:18 +02:00
Ines Montani
068146d4ca Update netlify.toml [ci skip] 2020-04-16 14:45:25 +02:00
Jakob Jul Elben
663333c3b2
Fixes #5413 (#5315)
* Fix 5314

* Add contributor

* Resolve requested changes

Co-authored-by: Jakob Jul Elben <jakob@datamaga.com>
2020-04-16 13:29:02 +02:00
Sébastien Harinck
dac70f29eb
contrib: add contributor agreement for user sebastienharinck (#5316) 2020-04-16 11:32:09 +02:00
Paolo Arduin
1ca32d8f9c
Matcher support for Span as well as Doc (#5113)
* Matcher support for Span, as well as Doc #5056

* Removes an import unused

* Signed contributors agreement

* Code optimization and better test

* Add error message for bad Matcher call argument

* Fix merging
2020-04-15 13:51:33 +02:00
Thomas Thiebaud
1eef60c658
Add spacy_fastlang to universe (#5271)
* Add spacy_fastlang to universe

* Sign SCA
2020-04-15 13:50:46 +02:00
adrianeboyd
98c59027ed
Use max(uint64) for OOV lexeme rank (#5303)
* Use max(uint64) for OOV lexeme rank

* Add test for default OOV rank

* Revert back to thinc==7.4.0

Requiring the updated version of thinc was unnecessary.

* Define OOV_RANK in one place

Define OOV_RANK in one place in `util`.

* Fix formatting [ci skip]

* Switch to external definitions of max(uint64)

Switch to external defintions of max(uint64) and confirm that they are
equal.
2020-04-15 13:49:47 +02:00
adrianeboyd
3d2c308906
Add Doc init from list of words and text (#5251)
* Add Doc init from list of words and text

Add an option to initialize a `Doc` from a text and list of words where
the words may or may not include all whitespace tokens. If the text and
words are mismatched, raise an error.

* Fix error code

* Remove all whitespace before aligning words/text

* Move words/text init to util function

* Update error message

* Rename to get_words_and_spaces

* Fix formatting
2020-04-14 19:15:52 +02:00
Paolo Arduin
8ce408d2e1
Comparison predicate handling for != (#5282)
* Fix #5281

* Optim test
2020-04-14 19:14:15 +02:00
Sofie Van Landeghem
a3965ec13d
tag-map-path since 2.2.4 instead of 2.2.3 (#5289) 2020-04-14 14:53:47 +02:00
Marek Grzenkowicz
6a8a52650f
[Closes #5292] Fix typo in option name "--n-save_every" (#5293)
* Sign contributor agreement for chopeen

* Fix typo in option name and close #5292
2020-04-11 23:35:01 +02:00
Umar Butler
8952effcc4
Fixed Typo in Warning (#5284)
* Fixed typo in cli warning

Fixed a typo in the warning for the provision of exactly two labels, which have not been designated as binary, to textcat.

* Create and signed contributor form
2020-04-09 15:46:15 +02:00
adrianeboyd
cf579a398d
Add __init__.py to eu and hy tests (#5278) 2020-04-08 20:03:06 +02:00
adrianeboyd
ae4af52ce7
Add ideographic stops to sentencizer (#5263)
Add ideographic half- and fullwidth full stops to default sentencizer
punctuation.
2020-04-08 12:58:39 +02:00
Sofie Van Landeghem
7ad0fcf01d
fix json (#5267) 2020-04-08 12:58:09 +02:00
adrianeboyd
fa760010a5
Set rank for new vector in Vocab.set_vector (#5266)
Set `Lexeme.rank` for vectors added with `Vocab.set_vector` so that the
lexeme `ID` accessed by a model points the right row for the new vector.
2020-04-07 12:04:51 +02:00
adrianeboyd
c981aa6684
Use inline flags in token_match patterns (#5257)
* Use inline flags in token_match patterns

Use inline flags in `token_match` patterns so that serializing does not
lose the flag information.

* Modify inline flag

* Modify inline flag
2020-04-06 13:19:04 +02:00
adrianeboyd
e8be15e9b7
Improve tokenization for UD Spanish AnCora (#5253) 2020-04-06 13:18:23 +02:00
adrianeboyd
f4ef64a526
Improve tokenization for UD Dutch corpora (#5259)
* Improve tokenization for UD Dutch corpora

Improve tokenization for UD Dutch Alpino and LassySmall.

* Format Dutch tokenizer exceptions
2020-04-06 13:18:07 +02:00