Commit Graph

1037 Commits

Author SHA1 Message Date
Ines Montani
f9af7d365c Update docs [ci skip] 2020-09-22 09:45:41 +02:00
Ines Montani
49e80dbcac
Merge pull request #6103 from explosion/chore/tidy-up-tests-docs-get-doc 2020-09-22 09:45:04 +02:00
Adriane Boyd
844db6ff12 Update architecture overview 2020-09-22 09:31:47 +02:00
Adriane Boyd
5fbb8dfcbc Merge remote-tracking branch 'upstream/develop' into docs/various-v3-2 2020-09-22 09:22:58 +02:00
Ines Montani
67fbcb3da5 Tidy up tests and docs 2020-09-21 20:43:54 +02:00
Ines Montani
e548654aca Update docs [ci skip] 2020-09-21 14:46:55 +02:00
Ines Montani
9d32cac736 Update docs [ci skip] 2020-09-21 10:55:36 +02:00
Adriane Boyd
cc71ec901f Fix typo in saving and loading usage docs 2020-09-21 09:08:55 +02:00
Ines Montani
012b3a7096 Update docs [ci skip] 2020-09-20 17:44:58 +02:00
Ines Montani
554c9a2497 Update docs [ci skip] 2020-09-20 12:30:53 +02:00
Sofie Van Landeghem
39872de1f6
Introducing the gpu_allocator (#6091)
* rename 'use_pytorch_for_gpu_memory' to 'gpu_allocator'

* --code instead of --code-path

* update documentation

* avoid querying the "system" section directly

* add explanation of gpu_allocator to TF/PyTorch section in docs

* fix typo

* fix typo 2

* use set_gpu_allocator from thinc 8.0.0a34

* default null instead of empty string
2020-09-19 01:17:02 +02:00
Ines Montani
a127fa475e
Merge pull request #6078 from svlandeg/fix/corpus 2020-09-18 14:44:21 +02:00
Ines Montani
a0b4389a38 Update docs [ci skip] 2020-09-17 19:24:48 +02:00
Matthew Honnibal
6efb7688a6 Draft pretrain usage 2020-09-17 18:17:03 +02:00
Ines Montani
a2c8cda26f Update docs [ci skip] 2020-09-17 17:12:51 +02:00
Matthew Honnibal
ec751068f3 Draft text for static vectors intro 2020-09-17 16:42:53 +02:00
svlandeg
c8c84f1ccd Merge remote-tracking branch 'upstream/develop' into fix/corpus 2020-09-17 15:43:04 +02:00
Ines Montani
c8fa2247e3 Merge branch 'develop' of https://github.com/explosion/spaCy into develop 2020-09-17 12:34:15 +02:00
Ines Montani
6761028c6f Update docs [ci skip] 2020-09-17 12:34:11 +02:00
svlandeg
0c35885751 generalize corpora, dot notation for dev and train corpus 2020-09-17 11:38:59 +02:00
svlandeg
781fae678b Merge remote-tracking branch 'upstream/develop' into fix/corpus 2020-09-17 09:24:36 +02:00
Adriane Boyd
7e4cd7575c
Refactor Docs.is_ flags (#6044)
* Refactor Docs.is_ flags

* Add derived `Doc.has_annotation` method

  * `Doc.has_annotation(attr)` returns `True` for partial annotation

  * `Doc.has_annotation(attr, require_complete=True)` returns `True` for
    complete annotation

* Add deprecation warnings to `is_tagged`, `is_parsed`, `is_sentenced`
and `is_nered`

* Add `Doc._get_array_attrs()`, which returns a full list of `Doc` attrs
for use with `Doc.to_array`, `Doc.to_bytes` and `Doc.from_docs`. The
list is the `DocBin` attributes list plus `SPACY` and `LENGTH`.

Notes on `Doc.has_annotation`:

* `HEAD` is converted to `DEP` because heads don't have an unset state

* Accept `IS_SENT_START` as a synonym of `SENT_START`

Additional changes:

* Add `NORM`, `ENT_ID` and `SENT_START` to default attributes for
`DocBin`

* In `Doc.from_array()` the presence of `DEP` causes `HEAD` to override
`SENT_START`

* In `Doc.from_array()` using `attrs` other than
`Doc._get_array_attrs()` (i.e., a user's custom list rather than our
default internal list) with both `HEAD` and `SENT_START` shows a warning
that `HEAD` will override `SENT_START`

* `set_children_from_heads` does not require dependency labels to set
sentence boundaries and sets `sent_start` for all non-sentence starts to
`-1`

* Fix call to set_children_form_heads

Co-authored-by: Matthew Honnibal <honnibal+gh@gmail.com>
2020-09-17 00:14:01 +02:00
svlandeg
51fa929f47 rewrite train_corpus to corpus.train in config 2020-09-15 21:58:04 +02:00
Ines Montani
b7faa38960 Update docs [ci skip] 2020-09-15 12:44:03 +02:00
Ines Montani
154752f9c2 Update docs and consistency [ci skip] 2020-09-15 00:32:49 +02:00
Ines Montani
85e5910102 Update docs [ci skip] 2020-09-13 23:09:19 +02:00
Ines Montani
5ebb2a2ac8 Update docs [ci skip] 2020-09-13 22:36:20 +02:00
Ines Montani
47acb45850 Update docs [ci skip] 2020-09-13 22:30:33 +02:00
Ines Montani
2e3d067a7b Update docs [ci skip] 2020-09-13 19:29:06 +02:00
Ines Montani
99b26fe492 Update docs [ci skip] 2020-09-13 17:59:38 +02:00
Ines Montani
1316071086 Update docs [ci skip] 2020-09-13 11:31:50 +02:00
Ines Montani
368ecf705a Update docs [ci skip] 2020-09-12 17:40:50 +02:00
Ines Montani
8b0dabe987 Update docs [ci skip] 2020-09-12 17:05:10 +02:00
Ines Montani
4fec8c39a3 Update project teaser [ci skip] 2020-09-10 13:23:03 +02:00
Ines Montani
763e302dcc Update project widgets and examples [ci skip] 2020-09-10 13:04:16 +02:00
Ines Montani
908f3a4494 Update default projects repo [ci skip] 2020-09-10 11:42:14 +02:00
Ines Montani
2e567a47c2 Update docs and formatting 2020-09-09 21:26:10 +02:00
svlandeg
aa27e3f1f2 PyTorch spelling 2020-09-09 16:27:21 +02:00
svlandeg
a8aa9a8068 document Pipe API details, crossreferences etc 2020-09-09 15:56:27 +02:00
svlandeg
9a7c6cc61a references to usage page on layers and architectures 2020-09-09 14:47:32 +02:00
svlandeg
e80898092b Merge branch 'feature/more-layers-docs' of https://github.com/svlandeg/spaCy into feature/more-layers-docs 2020-09-09 14:44:28 +02:00
svlandeg
4c080b3a98 details on Thinc shape inference 2020-09-09 13:57:05 +02:00
svlandeg
39aa740777 Merge remote-tracking branch 'upstream/develop' into feature/more-layers-docs 2020-09-09 11:59:34 +02:00
svlandeg
e39242c4e6 formatting 2020-09-09 11:25:35 +02:00
Ines Montani
24053d83ec Merge branch 'develop' of https://github.com/explosion/spaCy into develop 2020-09-09 11:20:14 +02:00
Ines Montani
406aed78ee Update docs [ci skip] 2020-09-09 11:20:07 +02:00
Sofie Van Landeghem
8e7557656f
Renaming gold & annotation_setter (#6042)
* version bump to 3.0.0a16

* rename "gold" folder to "training"

* rename 'annotation_setter' to 'set_extra_annotations'

* formatting
2020-09-09 10:31:03 +02:00
svlandeg
a16afb79e3 add section on Thinc implementation details 2020-09-08 20:43:09 +02:00
svlandeg
1c476b4b41 how to register and use custom function 2020-09-08 20:22:20 +02:00
svlandeg
b35a26ea5d example wrapped Torch model and chaining with Thinc 2020-09-08 18:32:58 +02:00
svlandeg
bd8f9b188b small fixes 2020-09-08 17:24:36 +02:00
Ines Montani
d98ae9d918 Update docs [ci skip] 2020-09-08 10:33:48 +02:00
Ines Montani
c443c82722 Update docs [ci skip] 2020-09-05 13:41:10 +02:00
Ines Montani
b3e338d65e Update docs [ci skip] 2020-09-04 20:58:36 +02:00
Ines Montani
157caf4dfa WIP: update docs [ci skip] 2020-09-04 16:30:31 +02:00
Ines Montani
f174c7b1f3 Merge branch 'develop' into pr/6018 2020-09-04 15:54:49 +02:00
Ines Montani
864a697e63 Merge branch 'develop' into master-tmp 2020-09-04 13:15:36 +02:00
Adriane Boyd
b927893309
Merge branch 'develop' into feature/dependency-matcher-v3 2020-09-04 13:03:30 +02:00
Ines Montani
2189046869
Merge pull request #6024 from explosion/chore/registry-renaming 2020-09-04 10:54:10 +02:00
Ines Montani
b1eb98b15c Remove todos [ci skip] 2020-09-03 17:43:58 +02:00
Ines Montani
23b7d9cfa3 Prefix span getters 2020-09-03 17:37:06 +02:00
Ines Montani
5afe6447cd registry.assets -> registry.misc 2020-09-03 17:31:14 +02:00
Ines Montani
121809dd1e Fix anchor [ci skip] 2020-09-03 16:49:56 +02:00
Ines Montani
25a595dc10 Fix typos and wording [ci skip] 2020-09-03 16:37:45 +02:00
Ines Montani
b5a0657fd6 "model" terminology consistency in docs 2020-09-03 13:13:03 +02:00
Ines Montani
b02ad8045b Update docs [ci skip] 2020-09-03 10:10:13 +02:00
Ines Montani
1815c613c9 Update docs [ci skip] 2020-09-03 10:07:45 +02:00
Adriane Boyd
960d9cfadc Officially support DependencyMatcher
Add official support for the `DependencyMatcher`. Redesign the pattern
specification. Fix and extend operator implementations. Update API docs
and add usage docs.

Patterns
--------

Refactor pattern structure to:

```
{
  "LEFT_ID": str,
  "REL_OP": str,
  "RIGHT_ID": str,
  "RIGHT_ATTRS": dict,
}
```

The first node contains only `RIGHT_ID` and `RIGHT_ATTRS` and all
subsequent nodes contain all four keys.

New operators
-------------

Because of the way patterns are constructed from left to right, it's
helpful to have `follows` operators along with `precedes` operators. Add
operators for simple precedes / follows alongside immediate precedes /
follows.

* `.*`: precedes
* `;`: immediately follows
* `;*`: follows

Operator fixes
--------------

* `<` and `<<` do not include the node itself
* Fix reversed order for all operators involving linear precedence (`.`,
  all sibling operators)
* Linear precedence operators do not match nodes outside the same parse

Additional fixes
----------------

* Use v3 Matcher API
* Support `get` and `remove`
* Support pickling
2020-09-02 17:45:29 +02:00
svlandeg
19298de352 small fix 2020-09-02 17:43:11 +02:00
svlandeg
bbaea530f6 sublayers paragraph 2020-09-02 17:36:22 +02:00
svlandeg
1be7ff02a6 swapping section 2020-09-02 15:26:07 +02:00
svlandeg
57e432ba2a editor tip as Accordion instead of Infobox 2020-09-02 14:26:57 +02:00
svlandeg
d19ec6c67b small rewrites in types paragraph 2020-09-02 14:25:18 +02:00
svlandeg
821b2d4e63 update examples 2020-09-02 14:15:50 +02:00
svlandeg
e29a33449d rewrite intro, simpel Model example 2020-09-02 13:41:18 +02:00
svlandeg
422df9c2e2 Merge remote-tracking branch 'upstream/develop' into feature/docs-layers
# Conflicts:
#	website/docs/usage/layers-architectures.md
2020-09-02 13:17:11 +02:00
Ines Montani
70238543c8 Update layers/arch docs structure [ci skip] 2020-09-02 13:04:35 +02:00
svlandeg
6fd7f140ec custom-architectures section 2020-09-02 11:14:06 +02:00
svlandeg
3d9ae9286f small fixes 2020-09-02 10:46:38 +02:00
Ines Montani
690bd77669 Add todos [ci skip] 2020-09-01 14:04:36 +02:00
Ines Montani
70b226f69d Support ignore marker in project document [ci skip] 2020-09-01 12:49:04 +02:00
Ines Montani
9af82f3f11
Merge pull request #6003 from explosion/feature/matcher-as-spans 2020-08-31 17:50:56 +02:00
Sofie Van Landeghem
3ac620f09d
fix config example [ci skip] 2020-08-31 17:40:04 +02:00
Ines Montani
add9de5487 Deprecate (Phrase)Matcher.pipe 2020-08-31 17:01:24 +02:00
Ines Montani
bca6bf8dda Update docs [ci skip] 2020-08-31 16:39:53 +02:00
Ines Montani
db9f8896f5 Add docs [ci skip] 2020-08-31 16:10:41 +02:00
svlandeg
e47ea88aeb revert annotations refactor 2020-08-31 14:40:55 +02:00
svlandeg
13ee742fb4 example of custom logger 2020-08-31 14:24:41 +02:00
svlandeg
c18eb63483 Merge remote-tracking branch 'upstream/develop' into feature/vectors-docs
# Conflicts:
#	website/docs/usage/embeddings-transformers.md
2020-08-31 13:21:36 +02:00
Juan Gutiérrez
9002bea29f
Update suffixes example (#5989)
* Update suffixes example

The current example will throw `TypeError: can only concatenate list (not "tuple") to list`

* Signing Contributor Agreement
2020-08-31 12:44:56 +02:00
Sofie Van Landeghem
ec14744ee4
Rename Transformer listener (#6001)
* rename to spacy-transformers.TransformerListener

* add some more tok2vec tests

* use select_pipes

* fix docs - annotation setter was not changed in the end
2020-08-31 12:41:39 +02:00
Adriane Boyd
216efaf5f5 Restrict tokenizer exceptions to ORTH and NORM 2020-08-31 09:55:01 +02:00
Ines Montani
9b86312bab Update docs [ci skip] 2020-08-29 18:43:19 +02:00
Adriane Boyd
870774f475
Merge branch 'develop' into docs/morph-usage-v3 2020-08-29 16:00:50 +02:00
Ines Montani
45f46a5c85
Merge pull request #5993 from explosion/feature/disabled-components 2020-08-29 15:58:41 +02:00
Adriane Boyd
f9ed31a757 Update usage docs for lemmatization and morphology 2020-08-29 15:56:50 +02:00
Ines Montani
bc0730be3f Update docs [ci skip] 2020-08-29 12:53:14 +02:00
Ines Montani
450bf806b0
Merge pull request #5991 from adrianeboyd/docs/sent-usage-v3
Update sentence segmentation usage docs
2020-08-29 12:40:06 +02:00
Ines Montani
66d76f5126 Update docs 2020-08-29 12:36:05 +02:00
svlandeg
9f00a20ce4 proofreading and custom examples 2020-08-28 21:50:42 +02:00
svlandeg
5230529de2 add loggers registry & logger docs sections 2020-08-28 21:44:04 +02:00
Adriane Boyd
48df50533d Update sentence segmentation usage docs
Update sentence segmentation usage docs to incorporate `senter`.
2020-08-28 10:58:16 +02:00
svlandeg
8cde6ccb7d Merge remote-tracking branch 'upstream/develop' into feature/vectors-docs 2020-08-27 19:56:09 +02:00
svlandeg
556e975a30 various fixes 2020-08-27 19:24:44 +02:00
svlandeg
329e490560 small import fixes 2020-08-27 14:50:43 +02:00
svlandeg
28e4ba7270 fix references to TransformerListener 2020-08-27 14:33:28 +02:00
svlandeg
4d37ac3f33 configure_custom_sent_spans example 2020-08-27 14:14:16 +02:00
svlandeg
c68169f83f fix link 2020-08-27 10:19:43 +02:00
svlandeg
acc794c975 example of writing to other custom attribute 2020-08-27 10:10:10 +02:00
svlandeg
559b65f2e0 adjust references to null_annotation_setter to trfdata_setter 2020-08-27 09:43:32 +02:00
svlandeg
ec069627fe rename to TransformerListener 2020-08-26 13:31:01 +02:00
Ines Montani
627617a079 Tidy up and add docs [ci skip] 2020-08-26 13:24:55 +02:00
svlandeg
15902c5aa2 fix link 2020-08-26 11:51:57 +02:00
Ines Montani
f31c4462ca Update docs [ci skip] 2020-08-25 13:27:59 +02:00
Ines Montani
8ac5ef1284 Update docs 2020-08-25 11:54:37 +02:00
Matthew Honnibal
8038b87f04
Various small tweaks to project CLI (#5965)
* Fix up/download of http and local paths

* Support git_sparse_checkout for assets

* Fix scorer

* Handle already-present directories for git assets

* Improve convert command

* Fix support for existant files in git assets

* Support branches in git sparse checkout

* Format

* Fix git assets

* Document git block in assets

* Fix test

* Fix test

* Revert "Fix test"

This reverts commit cf3097260f.

* Revert "Fix test"

This reverts commit 964d636e27.

* Dont multiply p/r/f by 100

* Display scores * 100 during training
2020-08-25 00:30:52 +02:00
Matthew Honnibal
e559867605
Allow spacy project to push and pull to/from remote storage (#5949)
* Add utils for working with remote storage

* WIP add remote_cache for project

* WIP add push and pull commands

* Use pathy in remote_cache

* Updarte util

* Update remote_cache

* Update util

* Update project assets

* Update pull script

* Update push script

* Fix type annotation in util

* Work on remote storage

* Remove site and env hash

* Fix imports

* Fix type annotation

* Require pathy

* Require pathy

* Fix import

* Add a util to handle project variable substitution

* Import push and pull commands

* Fix pull command

* Fix push command

* Fix tarfile in remote_storage

* Improve printing

* Fiddle with status messages

* Set version to v3.0.0a9

* Draft docs for spacy project remote storages

* Update docs [ci skip]

* Use Thinc config to simplify and unify template variables

* Auto-format

* Don't import Pathy globally for now

Causes slow and annoying Google Cloud warning

* Tidy up test

* Tidy up and update tests

* Update to latest Thinc

* Update docs

* variables -> vars

* Update docs [ci skip]

* Update docs [ci skip]

Co-authored-by: Ines Montani <ines@ines.io>
2020-08-23 18:32:09 +02:00
Ines Montani
f27aecac14 Update formatting [ci skip] 2020-08-23 11:57:56 +02:00
Ines Montani
98a9e063b6 Update docs [ci skip] 2020-08-22 17:15:05 +02:00
Matthew Honnibal
8dfc4cbfe7 Merge branch 'develop' of https://github.com/explosion/spaCy into develop 2020-08-22 17:12:09 +02:00
Matthew Honnibal
048de64d4c Suggest edits 2020-08-22 17:11:28 +02:00
Ines Montani
adcf790b96 Update docs[ci skip] 2020-08-22 17:04:16 +02:00
Ines Montani
37ebff6997 Update docs [ci skip] 2020-08-22 16:47:03 +02:00
Matthew Honnibal
8685229891 Merge branch 'develop' of https://github.com/explosion/spaCy into develop 2020-08-22 16:06:59 +02:00
Matthew Honnibal
d97695d09d Update embeddings-transformers.md 2020-08-22 15:41:35 +02:00
Ines Montani
c7c9b0451f Update docs [ci skip] 2020-08-22 13:52:52 +02:00
Ines Montani
71aeae89c5
Merge pull request #5948 from svlandeg/feature/docs-docs-docs [ci skip] 2020-08-22 12:18:47 +02:00
Ines Montani
27f81109d6 Update docs [ci skip] 2020-08-21 20:02:18 +02:00
Ines Montani
f102164a1f Update docs [ci skip] 2020-08-21 19:34:06 +02:00
svlandeg
1b7cfa7347 Merge remote-tracking branch 'upstream/develop' into feature/docs-docs-docs 2020-08-21 18:36:18 +02:00
svlandeg
942adf0f4d comma 2020-08-21 18:36:02 +02:00
svlandeg
262552010d context manager with space (for consistency) 2020-08-21 18:34:02 +02:00
svlandeg
da48c6a2a2 several small updates 2020-08-21 18:25:26 +02:00
svlandeg
ad2332d4b7 alphabetize registries 2020-08-21 18:10:31 +02:00
svlandeg
c6659e37d8 small fixes 2020-08-21 18:02:20 +02:00
svlandeg
518a1f97f3 remove outdated TODO's 2020-08-21 17:55:15 +02:00
Ines Montani
2cc4640385 Update docs [ci skip] 2020-08-21 16:21:55 +02:00
Ines Montani
74cb6d39d0 Update docs [ci skip] 2020-08-21 16:11:38 +02:00
Ines Montani
aa6a7cd6e7 Update docs and consistency [ci skip] 2020-08-21 13:49:18 +02:00
Ines Montani
52bd3a8b48 Update docs [ci skip] 2020-08-21 13:22:59 +02:00
Ines Montani
e60442d83a Adjust label casing in displaCy NER visualizer (resolves #4866)
- Accept any case for label names in ents and colors option, even if actual predicted label uses different casing
- Don't text-transform: uppercase visually, if it's important to users that the label is represented as-is in the UI
2020-08-21 11:51:31 +02:00
Ines Montani
04e4d59235 Update docs [ci skip] 2020-08-20 16:17:25 +02:00
Ines Montani
6ad59d59fe Merge branch 'develop' of https://github.com/explosion/spaCy into develop [ci skip] 2020-08-20 11:20:58 +02:00
Ines Montani
fb51b55eb9 Add comment [ci skip] 2020-08-20 11:20:43 +02:00
Ines Montani
2253d26b82 Update vectors and similarity docs [ci skip] 2020-08-19 21:18:26 +02:00
Ines Montani
15e6feed01 Update docs [ci skip] 2020-08-19 20:37:54 +02:00
svlandeg
d8f6abdc23 add linking TODO back in 2020-08-19 18:00:35 +02:00
svlandeg
169b5bcda0 Merge remote-tracking branch 'upstream/develop' into feature/update-docs
# Conflicts:
#	website/docs/usage/training.md
2020-08-19 17:58:25 +02:00
svlandeg
7119295a8a badgers intro 2020-08-19 17:53:22 +02:00
svlandeg
4906a2ae6c custom functions intro 2020-08-19 17:32:35 +02:00
svlandeg
7a2e6a96f5 fix typo 2020-08-19 16:54:16 +02:00
svlandeg
648499157a rename "custom models" to "custom functions" 2020-08-19 16:53:51 +02:00
Ines Montani
63921161c8 Update docs [ci skip] 2020-08-19 16:04:21 +02:00
svlandeg
d3a8321172 fix typos 2020-08-19 15:12:12 +02:00
Ines Montani
225f8866a1 Fix consistency 2020-08-19 12:47:57 +02:00
Ines Montani
9c25656ccc Update docs [ci skip] 2020-08-19 12:14:41 +02:00
Ines Montani
2285e59765
Merge pull request #5933 from svlandeg/feature/more-v3-docs [ci skip] 2020-08-19 11:29:02 +02:00
Ines Montani
13291e97ba Update docs [ci skip] 2020-08-19 00:28:37 +02:00
svlandeg
6ed67d495a format 2020-08-18 19:43:20 +02:00
svlandeg
f9fe5eb323 clean up example 2020-08-18 19:35:23 +02:00
svlandeg
a8acedd4ba example of custom reader and batcher 2020-08-18 19:15:16 +02:00
svlandeg
abba639565 Merge remote-tracking branch 'upstream/develop' into feature/more-v3-docs 2020-08-18 18:55:12 +02:00
Ines Montani
82f0e20318 Update docs and consistency [ci skip] 2020-08-18 14:39:40 +02:00
Matthew Honnibal
b72bd1767f Remove todo 2020-08-18 13:52:22 +02:00
Matthew Honnibal
574fd53289 Add precision/recall description 2020-08-18 13:51:08 +02:00
Matthew Honnibal
96a9c65f97 Add model architectures intro 2020-08-18 13:50:55 +02:00
svlandeg
f7b76d2d83 Merge remote-tracking branch 'upstream/develop' into feature/more-v3-docs 2020-08-18 11:57:52 +02:00
svlandeg
8dcda351ec typo's and quick note on default values 2020-08-18 10:23:27 +02:00
Ines Montani
ef6cf3b276 Update docs [ci skip] 2020-08-18 01:29:34 +02:00
Ines Montani
728fec0194 Update docs [ci skip] 2020-08-18 00:49:19 +02:00
Ines Montani
9299166c75
Merge pull request #5925 from explosion/docs/vectors [ci skip]
Update the 'vectors' docs page
2020-08-17 21:45:09 +02:00
svlandeg
4fe4bab1c9 typo fixes 2020-08-17 17:10:15 +02:00
svlandeg
da80c18660 merge develop into branch 2020-08-17 16:57:18 +02:00
Ines Montani
3ae5e02f4f Update docs, types and API consistency 2020-08-17 16:45:24 +02:00
Matthew Honnibal
052d82aa4e Suggest vectors changes 2020-08-17 15:32:30 +02:00
svlandeg
961e818be6 p/r definitions 2020-08-17 15:02:39 +02:00
svlandeg
319692aa53 fix typos 2020-08-17 14:05:48 +02:00
Matthew Honnibal
be07567ac6 Update transformers page 2020-08-16 20:29:50 +02:00
Matthew Honnibal
8e5f99ee25 Update transformer docs intro. Also write system requirements 2020-08-16 20:13:24 +02:00
Ines Montani
a570c304df Update quickstart, template and docs 2020-08-15 14:50:29 +02:00
Ines Montani
950832f087
Tidy up pipes (#5906)
* Tidy up pipes

* Fix init, defaults and raise custom errors

* Update docs

* Update docs [ci skip]

* Apply suggestions from code review

Co-authored-by: Matthew Honnibal <honnibal+gh@gmail.com>

* Tidy up error handling and validation, fix consistency

* Simplify get_examples check

* Remove unused import [ci skip]

Co-authored-by: Matthew Honnibal <honnibal+gh@gmail.com>
2020-08-11 23:29:31 +02:00
Ines Montani
b7ec06e331 Update docs [ci skip] 2020-08-11 20:57:23 +02:00
Ines Montani
10f42e3a39 Update docs [ci skip] 2020-08-11 00:09:49 +02:00
Ines Montani
2778d04377 Update docs [ci skip] 2020-08-10 23:41:09 +02:00
Ines Montani
023ba7ae26 Update docs 2020-08-10 17:13:11 +02:00
Ines Montani
12052bd8f6 Update docs [ci skip] 2020-08-10 01:20:10 +02:00
Ines Montani
d611cbef43 Update docs [ci skip] 2020-08-10 00:42:26 +02:00
Ines Montani
c044460823 Update docs [ci skip] 2020-08-10 00:01:38 +02:00
Ines Montani
05dcab10aa Fix typo 2020-08-09 22:34:03 +02:00
Ines Montani
8d2baa153d Update tokenizer docs and add test 2020-08-09 15:24:01 +02:00
Ines Montani
3901b088ff Update graphics and 101 [ci skip] 2020-08-07 17:14:13 +02:00
Ines Montani
5e1421e5a6 Update docs [ci skip] 2020-08-07 16:23:12 +02:00
Ines Montani
b7e34c1451 Update docs [ci skip] 2020-08-07 16:13:13 +02:00
Ines Montani
e829d3bf14 Update docs [ci skip] 2020-08-07 15:46:20 +02:00
svlandeg
824f4b2107 casing consistent 2020-08-06 23:20:13 +02:00
Ines Montani
e5995904d6 Update docs 2020-08-06 19:30:43 +02:00
Ines Montani
5d417d3b19 WIP: Update docs [ci skip] 2020-08-06 13:10:15 +02:00
Ines Montani
06e80d95cd
Sync develop with nightly docs state (#5883)
Co-authored-by: svlandeg <sofie.vanlandeghem@gmail.com>
2020-08-06 00:28:14 +02:00
Ines Montani
50311a4d37 Update docs [ci skip] 2020-08-05 20:29:53 +02:00
Ines Montani
cdec46493f Update docs 2020-08-05 15:00:54 +02:00
Ines Montani
4c055f0aa7
Add init CLI and init config (#5854)
* Add init CLI and init config draft

* Improve config validation

* Auto-format

* Don't export anything in debug config

* Update docs
2020-08-02 15:18:30 +02:00
Ines Montani
b40f44419b Simplify pipe analysis
- remove unused code
- don't print by default
- integrate attrs info into analysis output
2020-08-01 13:40:06 +02:00
Ines Montani
98c6a85c8b Update docs [ci skip] 2020-07-31 18:55:38 +02:00
Ines Montani
e9e8fa2466 Update docs and types 2020-07-31 17:02:54 +02:00
Ines Montani
160f1a5f94 Update docs [ci skip] 2020-07-31 13:26:39 +02:00
Ines Montani
3449c45fd9 Update docs [ci skip] 2020-07-29 19:48:26 +02:00
Ines Montani
9c80cb673d Update docs [ci skip] 2020-07-29 19:41:34 +02:00
Ines Montani
9f69afdd1e Update docs [ci skip] 2020-07-29 19:09:44 +02:00
Ines Montani
7a21775cd0
Merge pull request #5834 from explosion/feature/vectors 2020-07-29 18:49:26 +02:00
Ines Montani
158d8c1e48 Update docs [ci skip] 2020-07-29 18:44:10 +02:00
Matthew Honnibal
f7adc9d3b7 Start rewriting vectors docs 2020-07-29 17:10:06 +02:00
Ines Montani
e0ffe36e79 Update docstrings, docs and types 2020-07-29 11:36:42 +02:00
Ines Montani
d8b519c23c API docs, docstrings and argument consistency 2020-07-27 18:11:45 +02:00
Ines Montani
7dd53d0964 Fix typo [ci skip] 2020-07-27 00:34:00 +02:00
Ines Montani
7adbaf9a5b Update docs [ci skip] 2020-07-27 00:29:45 +02:00
Matthew Honnibal
fb5dbe30b5 Trim training 101 2020-07-26 13:43:22 +02:00
Matthew Honnibal
e6a7deb7cc Edits to the training 101 section 2020-07-26 13:42:08 +02:00
Ines Montani
c288dba8e7 Update docs [ci skip] 2020-07-25 18:51:12 +02:00
Li Zhe
a69eb445dc
fix the wrong hash url in adding-languages.md file (#5810)
* fix the wrong hash url in adding-languages.md file

change the #101 url hash path to #language-data

* filled in the spaCy Contributor Agreement 

filled in the spaCy Contributor Agreement
2020-07-25 13:13:38 +02:00
Adriane Boyd
d3385f4be2 Add Morphology and MorphAnalysis to overview 2020-07-21 13:06:22 +02:00
Ines Montani
644074b954 Merge branch 'develop' into master-tmp 2020-07-20 14:58:04 +02:00
Adriane Boyd
39ebcd9ec9
Refactor Chinese tokenizer configuration (#5736)
* Refactor Chinese tokenizer configuration

Refactor `ChineseTokenizer` configuration so that it uses a single
`segmenter` setting to choose between character segmentation, jieba, and
pkuseg.

* replace `use_jieba`, `use_pkuseg`, `require_pkuseg` with the setting
`segmenter` with the supported values: `char`, `jieba`, `pkuseg`
* make the default segmenter plain character segmentation `char` (no
additional libraries required)

* Fix Chinese serialization test to use char default

* Warn if attempting to customize other segmenter

Add a warning if `Chinese.pkuseg_update_user_dict` is called when
another segmenter is selected.
2020-07-19 13:34:37 +02:00
Adriane Boyd
cd5af72c9a
Update pkuseg version (#5774)
* Update pkuseg version in Chinese tokenizer warnings
* Update pkuseg version in `Makefile`
* Remove warning about python3.8 wheels in docs
2020-07-19 11:09:49 +02:00
Ines Montani
872938ec76
Merge pull request #5747 from explosion/feature/refactor-config-args 2020-07-14 00:00:22 +02:00
Ines Montani
5f6f4ff594 Remove object subclassing 2020-07-12 14:03:23 +02:00
Ines Montani
3f948b9c74 Update docs 2020-07-12 12:32:28 +02:00
Ines Montani
7b5717cac3 Merge branch 'develop' into feature/refactor-config-args 2020-07-10 22:50:07 +02:00
Ines Montani
e6a6587a9a Update projects.md [ci skip] 2020-07-10 22:41:27 +02:00
Ines Montani
f2cd982e7b Update training.md 2020-07-10 22:34:27 +02:00
Ines Montani
52e9b5b472 Fix formatting 2020-07-09 23:25:58 +02:00
Ines Montani
28cdae898a Update projects.md 2020-07-09 22:35:54 +02:00
Ines Montani
7bcf9f7cfb Document new features 2020-07-09 21:10:36 +02:00
Ines Montani
ea01831f6a Update projects docs etc. 2020-07-09 19:43:25 +02:00
Ines Montani
2298e129e6 Update example and training docs 2020-07-07 20:30:12 +02:00
svlandeg
2b60e894cb fix component constructors, update, begin_training, reference to GoldParse 2020-07-07 19:17:19 +02:00
Ines Montani
bb3ee38cf9 Update WIP 2020-07-06 22:22:37 +02:00
Ines Montani
44790c1c32 Update docs and add keyword-only tag 2020-07-06 18:14:57 +02:00
Ines Montani
a35236e5f0 Update v3 docs WIP [ci skip] 2020-07-06 15:57:44 +02:00
Ines Montani
63247cbe87 Update v3 docs [ci skip] 2020-07-05 16:11:16 +02:00
Ines Montani
dc8c9d912f Update docs [ci skip] 2020-07-04 16:47:24 +02:00
Ines Montani
1e0d54edd1 Update docs 2020-07-04 14:23:10 +02:00
Ines Montani
06f1ecb308 Update v3 docs 2020-07-03 16:48:21 +02:00
Ines Montani
b5268955d7 Update matcher usage examples [ci skip] 2020-07-02 15:39:45 +02:00
Ines Montani
fe4cfd0632 Start updating website for v3 [ci skip] 2020-07-01 21:26:39 +02:00
Ines Montani
26df4efa94 Add new in v3.0 2020-07-01 13:02:17 +02:00
Ines Montani
414dc7ace1 Merge branch 'spacy.io' into spacy.io-develop 2020-07-01 11:47:47 +02:00
Matthias Hertel
305221f3e5 Website: fixed the token span in the text about the rule-based matching example (#5669)
* fixed token span in pattern matcher example

* contributor agreement
2020-06-30 19:58:55 +02:00
Matthias Hertel
8b0f749606
Website: fixed the token span in the text about the rule-based matching example (#5669)
* fixed token span in pattern matcher example

* contributor agreement
2020-06-30 19:58:23 +02:00
Adriane Boyd
d777d9cc38 Extend v2.3 migration guide (#5653)
* Extend preloaded vocab section

* Add section on tag maps
2020-06-26 14:13:01 +02:00
Adriane Boyd
c4d0209472
Extend v2.3 migration guide (#5653)
* Extend preloaded vocab section

* Add section on tag maps
2020-06-26 14:12:29 +02:00
Adriane Boyd
a2660bd9c6 Fix backslashes in warnings config diff (#5640)
Fix backslashes in warnings config diff in v2.3 migration section.
2020-06-24 10:26:57 +02:00
Adriane Boyd
fd4287c178
Fix backslashes in warnings config diff (#5640)
Fix backslashes in warnings config diff in v2.3 migration section.
2020-06-24 10:26:12 +02:00
Adriane Boyd
4f73ced914 Extend what's new in v2.3 with vocab / is_oov (#5635) 2020-06-23 16:50:43 +02:00
Adriane Boyd
7ce451c211
Extend what's new in v2.3 with vocab / is_oov (#5635) 2020-06-23 16:48:59 +02:00
Adriane Boyd
fcdecefacf Add warnings example in v2.3 migration guide (#5627) 2020-06-22 14:38:06 +02:00
Adriane Boyd
bc1cb30b21
Add warnings example in v2.3 migration guide (#5627) 2020-06-22 14:37:24 +02:00
Ines Montani
52728d8fa3 Merge branch 'develop' into master-tmp 2020-06-20 15:52:00 +02:00
Adriane Boyd
66889de166 Warning for sudachipy 0.4.5 (#5611) 2020-06-19 13:45:23 +02:00
Adriane Boyd
931d80de72
Warning for sudachipy 0.4.5 (#5611) 2020-06-19 12:43:41 +02:00
Ines Montani
6d712f3e06
Merge pull request #5599 from adrianeboyd/docs/v2.3.0-minor 2020-06-16 13:49:25 -07:00
Adriane Boyd
02369f91d3 Fix spacy convert argument 2020-06-16 20:41:17 +02:00
Adriane Boyd
f0fd77648f Change example title to Dr.
Change example title to Dr. so the current model does exclude the title
in the initial example.
2020-06-16 20:36:21 +02:00
Adriane Boyd
a6abdfbc3c Fix numpy.zeros() dtype for Doc.from_array 2020-06-16 20:35:45 +02:00
Adriane Boyd
9aff317ca7 Update POS in tagging example 2020-06-16 20:26:57 +02:00
Adriane Boyd
457babfa0c Update alignment example for new gold.align 2020-06-16 20:22:03 +02:00
Ines Montani
44af53bdd9 Add pkuseg warnings and auto-format [ci skip] 2020-06-16 17:13:35 +02:00
Adriane Boyd
d5110ffbf2
Documentation updates for v2.3.0 (#5593)
* Update website models for v2.3.0

* Add docs for Chinese word segmentation

* Tighten up Chinese docs section

* Merge branch 'master' into docs/v2.3.0 [ci skip]

* Merge branch 'master' into docs/v2.3.0 [ci skip]

* Auto-format and update version

* Update matcher.md

* Update languages and sorting

* Typo in landing page

* Infobox about token_match behavior

* Add meta and basic docs for Japanese

* POS -> TAG in models table

* Add info about lookups for normalization

* Updates to API docs for v2.3

* Update adding norm exceptions for adding languages

* Add --omit-extra-lookups to CLI API docs

* Add initial draft of "What's New in v2.3"

* Add new in v2.3 tags to Chinese and Japanese sections

* Add tokenizer to migration section

* Add new in v2.3 flags to init-model

* Typo

* More what's new in v2.3

Co-authored-by: Ines Montani <ines@ines.io>
2020-06-16 15:37:35 +02:00
Ines Montani
810fce3bb1 Merge branch 'develop' into master-tmp 2020-06-03 14:36:59 +02:00
Ines Montani
262d306eaa unicode -> str consistency 2020-05-24 17:23:00 +02:00
Ines Montani
5d3806e059 unicode -> str consistency 2020-05-24 17:20:58 +02:00
Jannis
aa53ce6996
Documentation Typo Fix (#5492)
* Fix typo

Change 'realize' to 'realise'

* Add contributer agreement
2020-05-22 19:50:26 +02:00
Adriane Boyd
e4a1b5dab1 Rename to url_match
Rename to `url_match` and update docs.
2020-05-22 12:41:03 +02:00
Adriane Boyd
730fa493a4 Merge remote-tracking branch 'upstream/master' into bugfix/revert-token-match 2020-05-22 12:18:00 +02:00
Ines Montani
24f72c669c Merge branch 'develop' into master-tmp 2020-05-21 18:39:06 +02:00
Sofie Van Landeghem
0d94737857
Feature toggle_pipes (#5378)
* make disable_pipes deprecated in favour of the new toggle_pipes

* rewrite disable_pipes statements

* update documentation

* remove bin/wiki_entity_linking folder

* one more fix

* remove deprecated link to documentation

* few more doc fixes

* add note about name change to the docs

* restore original disable_pipes

* small fixes

* fix typo

* fix error number to W096

* rename to select_pipes

* also make changes to the documentation

Co-authored-by: Matthew Honnibal <honnibal+gh@gmail.com>
2020-05-18 22:27:10 +02:00
Ines Montani
f333c2a011
Merge pull request #5386 from svlandeg/fix/nel-docs 2020-05-10 12:00:09 +02:00
adrianeboyd
4a15b559ba
Clarify Token.pos as UPOS (#5419) 2020-05-08 10:36:25 +02:00
Adriane Boyd
792c8af8cf Merge remote-tracking branch 'upstream/master' into bugfix/revert-token-match 2020-05-05 09:25:57 +02:00
svlandeg
ebaed7dcfa Few more updates to the EL documentation 2020-04-30 10:17:06 +02:00
Sofie Van Landeghem
cfdaf99b80
Fix passing of component configuration (#5374)
* add kwargs to to_disk methods in docs - otherwise crashes on 'exclude' argument

* add fix and test for Issue 5137
2020-04-29 12:56:17 +02:00
Sofie Van Landeghem
f67343295d
Update NEL examples and documentation (#5370)
* simplify creation of KB by skipping dim reduction

* small fixes to train EL example script

* add KB creation and NEL training example scripts to example section

* update descriptions of example scripts in the documentation

* moving wiki_entity_linking folder from bin to projects

* remove test for wiki NEL functionality that is being moved
2020-04-29 12:53:53 +02:00
adrianeboyd
90ce34db42
Add cuda101 and cuda102 options to setup (#5377)
* Add cuda101 and cuda102 options to setup

* Update cudaNNN options in docs
2020-04-29 12:51:12 +02:00
Mike
481574cbc8
[minor doc change] embedding vis. link is broken in website/docs/usage/examples.md (#5325)
* The embedding vis. link is broken

The first link seems to be reasonable for now unless someone has an updated embedding vis they want to share?

* contributor agreement

* Update Mlawrence95.md

* Update website/docs/usage/examples.md

Co-Authored-By: Sofie Van Landeghem <svlandeg@users.noreply.github.com>

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
2020-04-21 20:35:12 +02:00
Sofie Van Landeghem
1137420840
Small doc fixes (#5250)
* fix link

* torchtext instead tochtext
2020-04-03 13:01:43 +02:00
Sofie Van Landeghem
9b412516e7
Fixing pickling of the parser (#5218)
* fix __reduce__ for pickling parser

* setting the move object as 'state' during pickling

* unskip test_issue4725 - works again
2020-03-27 19:35:26 +01:00
Ines Montani
46568f40a7 Merge branch 'master' into tmp/sync 2020-03-26 13:38:14 +01:00
Tiljander
e53232533b
Describing priority rules for overlapping matches (#5197)
* Describing priority rules for overlapping matches

* Create Tiljander.md

* Describing priority rules for overlapping matches

* Update website/docs/api/entityruler.md

Co-Authored-By: Ines Montani <ines@ines.io>

Co-authored-by: Ines Montani <ines@ines.io>
2020-03-26 13:13:22 +01:00
adrianeboyd
d88a377bed
Remove Vectors.from_glove (#5209) 2020-03-26 10:45:47 +01:00
Ines Montani
17bd9ed84f
Merge pull request #5153 from pinealan/fix/website-docs
Fix website typos and weird sentences
2020-03-16 15:03:01 +01:00
Alan Chan
36e3532475 Remove unfinished sentence 2020-03-15 03:45:17 +08:00
Mark Abraham
a0ffa346c0 Fix broken link in docs 2020-03-13 14:07:26 +01:00
Renaud Richardet
eccf6b1686
small typo in code sample 2020-03-09 14:49:11 +01:00
Adriane Boyd
0c31f03ec5 Update docs [ci skip] 2020-03-09 13:41:17 +01:00
Adriane Boyd
1139247532 Revert changes to token_match priority from #4374
* Revert changes to priority of `token_match` so that it has priority
over all other tokenizer patterns

* Add lookahead and potentially slow lookbehind back to the default URL
pattern

* Expand character classes in URL pattern to improve matching around
lookaheads and lookbehinds related to #4882

* Revert changes to Hungarian tokenizer

* Revert (xfail) several URL tests to their status before #4374

* Update `tokenizer.explain()` and docs accordingly
2020-03-09 12:09:41 +01:00
Ines Montani
de11ea753a Merge branch 'master' into develop 2020-02-18 14:47:23 +01:00
Kabir Khan
f6ed07b85c
Use nlp.pipe in EntityRuler for phrase patterns in add_patterns (#4931)
* Fix ent_ids and labels properties when id attribute used in patterns

* use set for labels

* sort end_ids for comparison in entity_ruler tests

* fixing entity_ruler ent_ids test

* add to set

* Run make_doc optimistically if using phrase matcher patterns.

* remove unused coveragerc I was testing with

* format

* Refactor EntityRuler.add_patterns to use nlp.pipe for phrase patterns. Improves speed substantially.

* Removing old add_patterns function

* Fixing spacing

* Make sure token_patterns loaded as well, before generator was being emptied in from_disk
2020-02-16 18:17:47 +01:00
Julin S
479e81bafc
fix link (#4977) 2020-02-10 20:31:26 -05:00
Ines Montani
9c08d9baa3 Remove old sections [ci skip] (closes #4961) 2020-02-03 13:10:46 +01:00
Preston Badeer
b216ff43c9 Update vectors-similarity.md (#4889)
These links are broken on the website, due to quotes around the URLs.
2020-01-08 16:49:40 +01:00
Geoffrey Gordon Ashbrook
53929138d7 remove extra word typo (#4875)
"let you find you"
2020-01-06 12:37:42 +01:00
Ines Montani
400257a802 Update index.md [ci skip] 2020-01-04 01:52:18 +01:00
Ines Montani
db55577c45
Drop Python 2.7 and 3.5 (#4828)
* Remove unicode declarations

* Remove Python 3.5 and 2.7 from CI

* Don't require pathlib

* Replace compat helpers

* Remove OrderedDict

* Use f-strings

* Set Cython compiler language level

* Fix typo

* Re-add OrderedDict for Table

* Update setup.cfg

* Revert CONTRIBUTING.md

* Revert lookups.md

* Revert top-level.md

* Small adjustments and docs [ci skip]
2019-12-22 01:53:56 +01:00
Ines Montani
158b98a3ef Merge branch 'master' into develop 2019-12-21 18:55:03 +01:00
Ines Montani
1b838d1313 Divide models into core and starters [ci skip] 2019-12-21 14:10:22 +01:00
Nicolai Bjerre Pedersen
de5453cdcb Fix link to user hooks in docs (#4778)
* Fix link to user hooks in docs

* Update mr_bjerre.md

Mistake in contributor agreement

* Apparently hard to get it right (wrong name of sca)
2019-12-06 19:17:12 +01:00
Ines Montani
cbacb0f1a4 Update shape docs and examples (resolves #4615) [ci skip] 2019-11-23 17:16:55 +01:00
Ines Montani
235fe6fe3b Auto-format [ci skip] 2019-11-20 13:14:58 +01:00
adrianeboyd
2c876eb672 Add tokenizer explain() debugging method (#4596)
* Expose tokenizer rules as a property

Expose the tokenizer rules property in the same way as the other core
properties. (The cache resetting is overkill, but consistent with
`from_bytes` for now.)

Add tests and update Tokenizer API docs.

* Update Hungarian punctuation to remove empty string

Update Hungarian punctuation definitions so that `_units` does not match
an empty string.

* Use _load_special_tokenization consistently

Use `_load_special_tokenization()` and have it to handle `None` checks.

* Fix precedence of `token_match` vs. special cases

Remove `token_match` check from `_split_affixes()` so that special cases
have precedence over `token_match`. `token_match` is checked only before
infixes are split.

* Add `make_debug_doc()` to the Tokenizer

Add `make_debug_doc()` to the Tokenizer as a working implementation of
the pseudo-code in the docs.

Add a test (marked as slow) that checks that `nlp.tokenizer()` and
`nlp.tokenizer.make_debug_doc()` return the same non-whitespace tokens
for all languages that have `examples.sentences` that can be imported.

* Update tokenization usage docs

Update pseudo-code and algorithm description to correspond to
`nlp.tokenizer.make_debug_doc()` with example debugging usage.

Add more examples for customizing tokenizers while preserving the
existing defaults.

Minor edits / clarifications.

* Revert "Update Hungarian punctuation to remove empty string"

This reverts commit f0a577f7a5.

* Rework `make_debug_doc()` as `explain()`

Rework `make_debug_doc()` as `explain()`, which returns a list of
`(pattern_string, token_string)` tuples rather than a non-standard
`Doc`. Update docs and tests accordingly, leaving the visualization for
future work.

* Handle cases with bad tokenizer patterns

Detect when tokenizer patterns match empty prefixes and suffixes so that
`explain()` does not hang on bad patterns.

* Remove unused displacy image

* Add tokenizer.explain() to usage docs
2019-11-20 13:07:25 +01:00
Ines Montani
e8b9cee6fd Make example consistent with model (closes #4587) [ci skip] 2019-11-18 12:41:48 +01:00
Ines Montani
e01a1a237f Auto-format [ci skip] 2019-11-18 12:41:31 +01:00
adrianeboyd
62e00fd9da Update tokenization usage docs (#4666)
Update pseudo-code and algorithm description to correspond to current
tokenizer behavior.

Add more examples for customizing tokenizers while preserving the
existing defaults.

Minor edits / clarifications.
2019-11-18 12:35:13 +01:00
Ines Montani
5adcb352e9 Adjust order of docs sections [ci skip] 2019-11-17 16:08:56 +01:00
Ines Montani
e30d08410a
Add CI for Python 3.8 (#4479)
* Add 3.8 classifier

* Update azure-pipelines.yml

* Remove 3.8 warning from docs [ci skip]
2019-11-15 01:13:48 +01:00
adrianeboyd
faaa832518 Generalize handling of tokenizer special cases (#4259)
* Generalize handling of tokenizer special cases

Handle tokenizer special cases more generally by using the Matcher
internally to match special cases after the affix/token_match
tokenization is complete.

Instead of only matching special cases while processing balanced or
nearly balanced prefixes and suffixes, this recognizes special cases in
a wider range of contexts:

* Allows arbitrary numbers of prefixes/affixes around special cases
* Allows special cases separated by infixes

Existing tests/settings that couldn't be preserved as before:

* The emoticon '")' is no longer a supported special case
* The emoticon ':)' in "example:)" is a false positive again

When merged with #4258 (or the relevant cache bugfix), the affix and
token_match properties should be modified to flush and reload all
special cases to use the updated internal tokenization with the Matcher.

* Remove accidentally added test case

* Really remove accidentally added test

* Reload special cases when necessary

Reload special cases when affixes or token_match are modified. Skip
reloading during initialization.

* Update error code number

* Fix offset and whitespace in Matcher special cases

* Fix offset bugs when merging and splitting tokens
* Set final whitespace on final token in inserted special case

* Improve cache flushing in tokenizer

* Separate cache and specials memory (temporarily)
* Flush cache when adding special cases
* Repeated `self._cache = PreshMap()` and `self._specials = PreshMap()`
are necessary due to this bug:
https://github.com/explosion/preshed/issues/21

* Remove reinitialized PreshMaps on cache flush

* Update UD bin scripts

* Update imports for `bin/`
* Add all currently supported languages
* Update subtok merger for new Matcher validation
* Modify blinded check to look at tokens instead of lemmas (for corpora
with tokens but not lemmas like Telugu)

* Use special Matcher only for cases with affixes

* Reinsert specials cache checks during normal tokenization for special
cases as much as possible
  * Additionally include specials cache checks while splitting on infixes
  * Since the special Matcher needs consistent affix-only tokenization
    for the special cases themselves, introduce the argument
    `with_special_cases` in order to do tokenization with or without
    specials cache checks
* After normal tokenization, postprocess with special cases Matcher for
special cases containing affixes

* Replace PhraseMatcher with Aho-Corasick

Replace PhraseMatcher with the Aho-Corasick algorithm over numpy arrays
of the hash values for the relevant attribute. The implementation is
based on FlashText.

The speed should be similar to the previous PhraseMatcher. It is now
possible to easily remove match IDs and matches don't go missing with
large keyword lists / vocabularies.

Fixes #4308.

* Restore support for pickling

* Fix internal keyword add/remove for numpy arrays

* Add test for #4248, clean up test

* Improve efficiency of special cases handling

* Use PhraseMatcher instead of Matcher
* Improve efficiency of merging/splitting special cases in document
  * Process merge/splits in one pass without repeated token shifting
  * Merge in place if no splits

* Update error message number

* Remove UD script modifications

Only used for timing/testing, should be a separate PR

* Remove final traces of UD script modifications

* Update UD bin scripts

* Update imports for `bin/`
* Add all currently supported languages
* Update subtok merger for new Matcher validation
* Modify blinded check to look at tokens instead of lemmas (for corpora
with tokens but not lemmas like Telugu)

* Add missing loop for match ID set in search loop

* Remove cruft in matching loop for partial matches

There was a bit of unnecessary code left over from FlashText in the
matching loop to handle partial token matches, which we don't have with
PhraseMatcher.

* Replace dict trie with MapStruct trie

* Fix how match ID hash is stored/added

* Update fix for match ID vocab

* Switch from map_get_unless_missing to map_get

* Switch from numpy array to Token.get_struct_attr

Access token attributes directly in Doc instead of making a copy of the
relevant values in a numpy array.

Add unsatisfactory warning for hash collision with reserved terminal
hash key. (Ideally it would change the reserved terminal hash and redo
the whole trie, but for now, I'm hoping there won't be collisions.)

* Restructure imports to export find_matches

* Implement full remove()

Remove unnecessary trie paths and free unused maps.

Parallel to Matcher, raise KeyError when attempting to remove a match ID
that has not been added.

* Switch to PhraseMatcher.find_matches

* Switch to local cdef functions for span filtering

* Switch special case reload threshold to variable

Refer to variable instead of hard-coded threshold

* Move more of special case retokenize to cdef nogil

Move as much of the special case retokenization to nogil as possible.

* Rewrap sort as stdsort for OS X

* Rewrap stdsort with specific types

* Switch to qsort

* Fix merge

* Improve cmp functions

* Fix realloc

* Fix realloc again

* Initialize span struct while retokenizing

* Temporarily skip retokenizing

* Revert "Move more of special case retokenize to cdef nogil"

This reverts commit 0b7e52c797.

* Revert "Switch to qsort"

This reverts commit a98d71a942.

* Fix specials check while caching

* Modify URL test with emoticons

The multiple suffix tests result in the emoticon `:>`, which is now
retokenized into one token as a special case after the suffixes are
split off.

* Refactor _apply_special_cases()

* Use cdef ints for span info used in multiple spots

* Modify _filter_special_spans() to prefer earlier

Parallel to #4414, modify _filter_special_spans() so that the earlier
span is preferred for overlapping spans of the same length.

* Replace MatchStruct with Entity

Replace MatchStruct with Entity since the existing Entity struct is
nearly identical.

* Replace Entity with more general SpanC

* Replace MatchStruct with SpanC

* Add error in debug-data if no dev docs are available (see #4575)

* Update azure-pipelines.yml

* Revert "Update azure-pipelines.yml"

This reverts commit ed1060cf59.

* Use latest wasabi

* Reorganise install_requires

* add dframcy to universe.json (#4580)

* Update universe.json [ci skip]

* Fix multiprocessing for as_tuples=True (#4582)

* Fix conllu script (#4579)

* force extensions to avoid clash between example scripts

* fix arg order and default file encoding

* add example config for conllu script

* newline

* move extension definitions to main function

* few more encodings fixes

* Add load_from_docbin example [ci skip]

TODO: upload the file somewhere

* Update README.md

* Add warnings about 3.8 (resolves #4593) [ci skip]

* Fixed typo: Added space between "recognize" and "various" (#4600)

* Fix DocBin.merge() example (#4599)

* Replace function registries with catalogue (#4584)

* Replace functions registries with catalogue

* Update __init__.py

* Fix test

* Revert unrelated flag [ci skip]

* Bugfix/dep matcher issue 4590 (#4601)

* add contributor agreement for prilopes

* add test for issue #4590

* fix on_match params for DependencyMacther (#4590)

* Minor updates to language example sentences (#4608)

* Add punctuation to Spanish example sentences

* Combine multilanguage examples for lang xx

* Add punctuation to nb examples

* Always realloc to a larger size

Avoid potential (unlikely) edge case and cymem error seen in #4604.

* Add error in debug-data if no dev docs are available (see #4575)

* Update debug-data for GoldCorpus / Example

* Ignore None label in misaligned NER data
2019-11-13 21:24:35 +01:00
Ines Montani
9d5ff177c4 Work around Markdown rendering issue surfaced in #4600 [ci skip] 2019-11-11 17:12:08 +01:00
walterhenry
5563c42ef5 Fixed typo: Added space between "recognize" and "various" (#4600) 2019-11-06 23:06:36 +01:00
Ines Montani
828ef27a32 Add warnings about 3.8 (resolves #4593) [ci skip] 2019-11-05 18:30:11 +01:00
Ines Montani
4e1de85e43 Update syntax iterators [ci skip] 2019-10-30 14:31:40 +01:00
Ines Montani
493be8e9db Update new version identifier [ci skip] 2019-10-25 11:42:49 +02:00
Ines Montani
f31876154d Adjust formatting [ci skip] 2019-10-25 11:19:46 +02:00
Kabir Khan
93640373c7 Make entity_ruler ent_id resolution 2x faster and add docs for… (#4513)
* Update entityruler.py

* Making ent_id resolution 2x faster and adding docs

* Fixing newlines in docstrings

* Fixing newlines in docstrings
2019-10-25 11:16:42 +02:00
adrianeboyd
7fc39f124c Fix logic in rules+model entity example [ci skip] (#4510) 2019-10-23 14:41:21 +02:00
adrianeboyd
3195a8f170 Add Entity Linking to menu (#4489) 2019-10-21 12:17:30 +02:00
Ines Montani
573e543e4a Alphanumeric -> alphabetic [ci skip]
see ines/spacy-course#38
2019-10-06 13:30:01 +02:00
Ines Montani
e65dffd80b Clarify serialization of extension attributes (closes #4377) [ci skip] 2019-10-05 11:58:00 +02:00
Sofie Van Landeghem
4e7259c6cf Bugfix initializing DocBin with attributes (#4368)
* docbin init fix + documentation fix + unit tests

* newline

* try with zlib instead of gzip (python 2 incompatibilities)
2019-10-03 14:48:45 +02:00
Ines Montani
80cf385f65 Update v2-2.md [ci skip] 2019-10-02 16:58:21 +02:00
Ines Montani
b6670bf0c2 Use consistent spelling 2019-10-02 10:37:39 +02:00
Ines Montani
475e3188ce Add docs on filtering overlapping spans for merging (resolves #4352) [ci skip] 2019-10-01 21:59:50 +02:00
Ines Montani
0dd127bb00 Update v2-2.md [ci skip] 2019-10-01 21:37:06 +02:00
Ines Montani
bc7e7db208 Fix wording [ci skip] 2019-10-01 14:20:44 +02:00
Ines Montani
2a3a4565cd Update infobox [ci skip] 2019-10-01 14:19:34 +02:00
Ines Montani
66aa0d479f Update v2.2 page [ci skip] 2019-10-01 14:11:05 +02:00
Ines Montani
a8a1800f2a Update lemma data documentation [ci skip] 2019-10-01 13:22:13 +02:00
Ines Montani
932ad9cb91 Fix typos and formatting [ci skip] 2019-10-01 12:30:04 +02:00
Ines Montani
3d8fd4b461 Revert #4334 2019-09-29 17:32:12 +02:00
Ines Montani
3bd4da068e Fix link [ci skip] 2019-09-29 17:30:38 +02:00
Ines Montani
089f44cc56 Update serialization docs [ci skip] 2019-09-29 17:11:13 +02:00
Ines Montani
c9cd516d96 Move tests out of package (#4334)
* Move tests out of package

* Fix typo
2019-09-28 18:05:00 +02:00
Ines Montani
10742d3219 Update v2 docs [ci skip] 2019-09-28 15:57:22 +02:00
Ines Montani
59beab8405 Update v2-2.md [ci skip] 2019-09-27 18:10:43 +02:00
Ines Montani
685e4b2554 Update v2-2.md [ci skip] 2019-09-27 16:35:01 +02:00
Em Zhan
aafa091541 Fix typo in documentation (#4322)
* Fix typo 'probj' instead of 'pobj'

* Add spaCy contributor agreement for zqianem
2019-09-25 19:42:18 +02:00
Ines Montani
197406de1d Update v2-2.md [ci skip] 2019-09-19 14:33:58 +02:00
Ines Montani
ddc09b08ed Update v2-2.md [ci skip] 2019-09-19 00:58:30 +02:00
Ines Montani
9c940eab94 Update version in examples [ci skip] 2019-09-18 21:23:26 +02:00
Ines Montani
f873548f6c Add backwards incompatibility [ci skip] 2019-09-18 21:21:48 +02:00
Ines Montani
dd1810f05a Update DocBin and add docs 2019-09-18 20:23:21 +02:00
Ines Montani
d62690b3ba Update examples 2019-09-18 19:57:36 +02:00
Matthew Honnibal
931e96b6c7 DocPallet->DocBin in docs 2019-09-18 15:17:26 +02:00