Commit Graph

15677 Commits

Author SHA1 Message Date
adrianeboyd
b3969c1479 Clarify Token.pos as UPOS (#5419) 2020-05-08 10:38:21 +02:00
adrianeboyd
4a15b559ba
Clarify Token.pos as UPOS (#5419) 2020-05-08 10:36:25 +02:00
adrianeboyd
a2345618f1
Fix Token API docs from #5375 (#5418) 2020-05-08 10:25:02 +02:00
Samuel Rodríguez Medina
5e55bfa821
Fixed tests for Swedish that were written in Danish. (#5395) 2020-05-05 14:06:27 +02:00
Adriane Boyd
565e0eef73 Add tokenizer option for token match with affixes
To fix the slow tokenizer URL (#4374) and allow `token_match` to take
priority over prefixes and suffixes by default, introduce a new
tokenizer option for a token match pattern that's applied after prefixes
and suffixes but before infixes.
2020-05-05 10:35:33 +02:00
Adriane Boyd
792c8af8cf Merge remote-tracking branch 'upstream/master' into bugfix/revert-token-match 2020-05-05 09:25:57 +02:00
Matthew Honnibal
eb117e2fce Add load_config_from_str helper 2020-05-02 14:09:21 +02:00
adrianeboyd
c045a9c7f6
Fix logic in train CLI timing eval on GPU (#5387)
Run CPU timing in first iteration only
2020-05-01 12:05:33 +02:00
Samuel Rodríguez Medina
148b036e0c
Spanish like num improvement (#5381)
* Add tests for Spanish like_num.

* Add missing numbers in Spanish lexical attributes for like_num.

* Modify Spanish test function name.

* Add contributor agreement.
2020-04-30 11:13:23 +02:00
svlandeg
ebaed7dcfa Few more updates to the EL documentation 2020-04-30 10:17:06 +02:00
Sofie Van Landeghem
cafe94ee04 Update NEL examples and documentation (#5370)
* simplify creation of KB by skipping dim reduction

* small fixes to train EL example script

* add KB creation and NEL training example scripts to example section

* update descriptions of example scripts in the documentation

* moving wiki_entity_linking folder from bin to projects

* remove test for wiki NEL functionality that is being moved
# Conflicts:
#	bin/wiki_entity_linking/wikipedia_processor.py
2020-04-30 09:50:02 +02:00
Samuel Rodríguez Medina
8602daba85
Swedish like_num (#5371)
* Sign contributor agreement.

* Add like_num functionality to Swedish.

* Update spacy/tests/lang/sv/test_lex_attrs.py

Co-Authored-By: Sofie Van Landeghem <svlandeg@users.noreply.github.com>

* Update contributor agreement

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
2020-04-29 21:25:22 +02:00
adrianeboyd
74da669326
Fix problems with lower and whitespace in variants (#5361)
* Initialize lower flag explicitly

* Handle whitespace words from GoldParse correctly when creating raw
text with orth variants

* Return the text with original casing if anything goes wrong
2020-04-29 13:01:25 +02:00
adrianeboyd
3f43c73d37
Normalize TokenC.sent_start values for Matcher (#5346)
Normalize TokenC.sent_start values to booleans for the `Matcher`.
2020-04-29 12:57:30 +02:00
adrianeboyd
bdff76dede
Various updates/additions to CLI scripts (#5362)
* `debug-data`: determine coverage of provided vectors

* `evaluate`: support `blank:lg` model to make it possible to just evaluate
tokenization

* `init-model`: add option to truncate vectors to N most frequent vectors
from word2vec file

* `train`:

  * if training on GPU, only run evaluation/timing on CPU in the first
    iteration

  * if training is aborted, exit with a non-0 exit status
2020-04-29 12:56:46 +02:00
Sofie Van Landeghem
cfdaf99b80
Fix passing of component configuration (#5374)
* add kwargs to to_disk methods in docs - otherwise crashes on 'exclude' argument

* add fix and test for Issue 5137
2020-04-29 12:56:17 +02:00
Ines Montani
efec28ce70
Merge pull request #5367 from adrianeboyd/feature/simplify-warnings-v2 2020-04-29 12:55:37 +02:00
Ines Montani
63885c1836 Remove u string and auto-format [ci skip] 2020-04-29 12:54:57 +02:00
Ines Montani
962bf12a20
Merge pull request #5312 from odaxiom/fix/website-documentation-spacy-lookup 2020-04-29 12:54:31 +02:00
Sofie Van Landeghem
f67343295d
Update NEL examples and documentation (#5370)
* simplify creation of KB by skipping dim reduction

* small fixes to train EL example script

* add KB creation and NEL training example scripts to example section

* update descriptions of example scripts in the documentation

* moving wiki_entity_linking folder from bin to projects

* remove test for wiki NEL functionality that is being moved
2020-04-29 12:53:53 +02:00
adrianeboyd
a6e521cd79
Add is_sent_end token property (#5375)
Reconstruction of the original PR #4697 by @MiniLau.

Removes unused `SENT_END` symbol and `IS_SENT_END` from `Matcher` schema
because the Matcher is only going to be able to support `IS_SENT_START`.
2020-04-29 12:53:16 +02:00
Ines Montani
a77754120d
Merge pull request #5177 from nlptechbook/patch-5 2020-04-29 12:52:21 +02:00
Ines Montani
eac47971f1
Merge pull request #5258 from mirfan899/master 2020-04-29 12:51:55 +02:00
Sofie Van Landeghem
1bf2082ac4
update is_new_osx function (#5376) 2020-04-29 12:51:49 +02:00
Ines Montani
1cbb272a6b
Update website/meta/universe.json 2020-04-29 12:51:44 +02:00
Ines Montani
732629b0dd
Update website/meta/universe.json 2020-04-29 12:51:37 +02:00
adrianeboyd
90ce34db42
Add cuda101 and cuda102 options to setup (#5377)
* Add cuda101 and cuda102 options to setup

* Update cudaNNN options in docs
2020-04-29 12:51:12 +02:00
Louis Guitton
a27c4014f5
Add mlflow to spaCy universe (#5352)
* Add mlflow to universe

* Use mlflow black logo
2020-04-29 10:18:03 +02:00
adrianeboyd
d5f18f8307
Add missing import 2020-04-28 14:01:29 +02:00
adrianeboyd
ac40a8f7a5
Add missing import 2020-04-28 14:00:11 +02:00
Adriane Boyd
3a045572ed Add missing import 2020-04-28 13:48:37 +02:00
Adriane Boyd
bc39f97e11 Simplify warnings 2020-04-28 13:37:37 +02:00
Michael
5b5528ff2e
Add !=3.4.* to python_requires (#5344)
Missed in 80d554f2e2
2020-04-27 22:02:09 +02:00
adrianeboyd
792aa7b6ab
Remove references to textcat spans (#5360)
Remove references to unimplemented `TextCategorizer` span labels in
`GoldParse` and `Doc`.
2020-04-27 18:01:12 +02:00
adrianeboyd
f8ac5b9f56
bugfix in span similarity (#5155) (#5358)
* bugfix in span similarity

* also rewrite doc.pyx for clarity

* formatting

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
2020-04-27 16:51:27 +02:00
Sofie Van Landeghem
9203d821ae
Add 2 ini files in tests/lang (#5359) 2020-04-27 13:01:54 +02:00
Punitvara
b2b7e1f37a
This PR adds Gujarati Language class along with (#5355)
* This PR adds Gujarati Language class along with
- stop words

* Add test for gu tokenizer
2020-04-27 11:07:37 +02:00
adrianeboyd
90c754024f
Update nlp.vectors to nlp.vocab.vectors (#5357) 2020-04-27 10:53:05 +02:00
sabiqueqb
fc91660aa2
Gh 5339 language class for malayalam (#5342)
* Initialize Malayalam Language class

* Add lex_attrs and examples for Malayalam

* Add spaCy Contributor Agreement

* Add test for ml tokenizer
2020-04-27 09:45:08 +02:00
adrianeboyd
84e06f9fb7
Improve GoldParse NER alignment (#5335)
Improve GoldParse NER alignment by including all cases where the start
and end of the NER span can be aligned, regardless of internal
tokenization differences.

To do this, convert BILUO tags to character offsets, check start/end
alignment with `doc.char_span()`, and assign the BILUO tags for the
aligned spans. Alignment for `O/-` tags is handled through the
one-to-one and multi alignments.
2020-04-23 16:58:23 +02:00
Mike
481574cbc8
[minor doc change] embedding vis. link is broken in website/docs/usage/examples.md (#5325)
* The embedding vis. link is broken

The first link seems to be reasonable for now unless someone has an updated embedding vis they want to share?

* contributor agreement

* Update Mlawrence95.md

* Update website/docs/usage/examples.md

Co-Authored-By: Sofie Van Landeghem <svlandeg@users.noreply.github.com>

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
2020-04-21 20:35:12 +02:00
adrianeboyd
521f361052
Switch to new gold.align method (#5334)
* Switch from original `_align` to new simpler alignment algorithm from
  #4526

* Remove alignment normalizations beyond whitespace and lowercasing
2020-04-21 19:31:03 +02:00
Matthew Honnibal
b2ef6100af
Only run backprop once when shared tok2vec weights (#5331)
Previously, pipelines with shared tok2vec weights would call the
tok2vec backprop callback multiple times, once for each pipeline
component. This caused errors for PyTorch, and was inefficient.

Instead, accumulate the gradient for all but one component, and just
call the callback once.
2020-04-21 19:30:41 +02:00
adrianeboyd
bf5c13d170
Modify jieba install message (#5328)
Modify jieba install message to instruct the user to use
`ChineseDefaults.use_jieba = False` so that it's possible to load
pkuseg-only models without jieba installed.
2020-04-20 22:06:53 +02:00
Matthew Honnibal
6918d99b6c
Improve GPU usage for train-with-config (#5330)
* Adjust for no ops in Optimizer

* Fix gpu in train-from-config

* Update train-from-config script

* Fix parser

* Fix GPU efficiency of padding backprop
2020-04-20 22:06:28 +02:00
Ines Montani
b919844fce
Tidy up and fix alignment of landing cards (#5317) 2020-04-20 20:33:13 +02:00
adrianeboyd
f7471abd82
Add pkuseg and serialization support for Chinese (#5308)
* Add pkuseg and serialization support for Chinese

Add support for pkuseg alongside jieba

* Specify model through `Language` meta:

  * split on characters (if no word segmentation packages are installed)

```
Chinese(meta={"tokenizer": {"config": {"use_jieba": False, "use_pkuseg": False}}})
```

  * jieba (remains the default tokenizer if installed)

```
Chinese()
Chinese(meta={"tokenizer": {"config": {"use_jieba": True}}}) # explicit
```

  * pkuseg

```
Chinese(meta={"tokenizer": {"config": {"pkuseg_model": "default", "use_jieba": False, "use_pkuseg": True}}})
```

* The new tokenizer setting `require_pkuseg` is used to override
`use_jieba` default, which is intended for models that provide a pkuseg
model:

```
nlp_pkuseg = Chinese(meta={"tokenizer": {"config": {"pkuseg_model": "default", "require_pkuseg": True}}})
nlp = Chinese() # has `use_jieba` as `True` by default
nlp.from_bytes(nlp_pkuseg.to_bytes()) # `require_pkuseg` overrides `use_jieba` when calling the tokenizer
```

Add support for serialization of tokenizer settings and pkuseg model, if
loaded

* Add sorting for `Language.to_bytes()` serialization of `Language.meta`
so that the (emptied, but still present) tokenizer metadata is in a
consistent position in the serialized data

Extend tests to cover all three tokenizer configurations and
serialization

* Fix from_disk and tests without jieba or pkuseg

* Load cfg first and only show error if `use_pkuseg`
* Fix blank/default initialization in serialization tests

* Explicitly initialize jieba's cache on init

* Add serialization for pkuseg pre/postprocessors

* Reformat pkuseg install message
2020-04-18 17:01:53 +02:00
laszabine
fb73d4943a
Amend documentation to Language.evaluate (#5319)
* Specified usage of arguments to Language.evaluate

* Created contributor agreement
2020-04-16 20:00:18 +02:00
Ines Montani
51207c9417 Update netlify.toml [ci skip] 2020-04-16 14:45:52 +02:00
Ines Montani
068146d4ca Update netlify.toml [ci skip] 2020-04-16 14:45:25 +02:00