Added more docs

This commit is contained in:
M1ha 2020-02-06 13:39:56 +05:00
parent c0afa7b53a
commit f2dc978634
15 changed files with 355 additions and 83 deletions

View File

@ -1 +1,2 @@
# django-clickhouse
# django-clickhouse
Documentation is [here](docs/index.md)

View File

@ -1,9 +1,9 @@
# Basic information
## <a name="about">About</a>
## About
This project's goal is to build [Yandex ClickHouse](https://clickhouse.yandex/) database into [Django](https://www.djangoproject.com/) project.
It is based on [infi.clickhouse-orm](https://github.com/Infinidat/infi.clickhouse_orm) library.
## <a name="features">Features</a>
## Features
* Multiple ClickHouse database configuration in [settings.py](https://docs.djangoproject.com/en/2.1/ref/settings/)
* ORM to create and manage ClickHouse models.
* ClickHouse migration system.
@ -11,26 +11,26 @@ It is based on [infi.clickhouse-orm](https://github.com/Infinidat/infi.clickhous
* Effective periodical synchronization of django models to ClickHouse without loosing data.
* Synchronization process monitoring.
## <a name="requirements">Requirements</a>
## Requirements
* [Python 3](https://www.python.org/downloads/)
* [Django](https://docs.djangoproject.com/) 1.7+
* [Yandex ClickHouse](https://clickhouse.yandex/)
* [infi.clickhouse-orm](https://github.com/Infinidat/infi.clickhouse_orm)
* pytz
* six
* typing
* psycopg2
* celery
* statsd
* [pytz](https://pypi.org/project/pytz/)
* [six](https://pypi.org/project/six/)
* [typing](https://pypi.org/project/typing/)
* [psycopg2](https://www.psycopg.org/)
* [celery](http://www.celeryproject.org/)
* [statsd](https://pypi.org/project/statsd/)
### Optional libraries
* [redis-py](https://redis-py.readthedocs.io/en/latest/) for [RedisStorage](storages.md#redis_storage)
* [redis-py](https://redis-py.readthedocs.io/en/latest/) for [RedisStorage](storages.md#redisstorage)
* [django-pg-returning](https://github.com/M1hacka/django-pg-returning)
for optimizing registering updates in [PostgreSQL](https://www.postgresql.org/)
* [django-pg-bulk-update](https://github.com/M1hacka/django-pg-bulk-update)
for performing effective bulk update operation in [PostgreSQL](https://www.postgresql.org/)
for performing effective bulk update and create operations in [PostgreSQL](https://www.postgresql.org/)
## <a name="installation">Installation</a>
## Installation
Install via pip:
`pip install django-clickhouse` ([not released yet](https://github.com/carrotquest/django-clickhouse/issues/3))
or via setup.py:

View File

@ -3,19 +3,18 @@
Library configuration is made in settings.py. All parameters start with `CLICKHOUSE_` prefix.
Prefix can be changed using `CLICKHOUSE_SETTINGS_PREFIX` parameter.
### <a name="databases">CLICKHOUSE_SETTINGS_PREFIX</a>
### CLICKHOUSE_SETTINGS_PREFIX
Defaults to: `'CLICKHOUSE_'`
You can change `CLICKHOUSE_` prefix in settings using this parameter to anything your like.
### <a name="databases">CLICKHOUSE_DATABASES</a>
### CLICKHOUSE_DATABASES
Defaults to: `{}`
A dictionary, defining databases in django-like style.
<!--- TODO Add link --->
Key is an alias to communicate with this database in [connections]() and [using]().
Value is a configuration dict with parameters:
* [infi.clickhouse_orm database parameters](https://github.com/Infinidat/infi.clickhouse_orm/blob/develop/docs/class_reference.md#database)
<!--- TODO Add link --->
* `migrate: bool` - indicates if this database should be migrated. See [migrations]().
* `migrate: bool` - indicates if this database should be migrated. See [migrations](migrations.md).
Example:
```python
@ -24,22 +23,28 @@ CLICKHOUSE_DATABASES = {
'db_name': 'test',
'username': 'default',
'password': ''
}
},
'reader': {
'db_name': 'read_only',
'username': 'reader',
'readonly': True,
'password': ''
}
}
```
### <a name="default_db_alias">CLICKHOUSE_DEFAULT_DB_ALIAS</a>
### CLICKHOUSE_DEFAULT_DB_ALIAS
Defaults to: `'default'`
<!--- TODO Add link --->
A database alias to use in [QuerySets]() if direct [using]() is not specified.
### <a name="sync_storage">CLICKHOUSE_SYNC_STORAGE</a>
### CLICKHOUSE_SYNC_STORAGE
Defaults to: `'django_clickhouse.storages.RedisStorage'`
An intermediate storage class to use. Can be a string or class. [More info about storages](storages.md).
### <a name="redis_config">CLICKHOUSE_REDIS_CONFIG</a>
### CLICKHOUSE_REDIS_CONFIG
Default to: `None`
Redis configuration for [RedisStorage](storages.md#redis_storage).
Redis configuration for [RedisStorage](storages.md#redisstorage).
If given, should be a dictionary of parameters to pass to [redis-py](https://redis-py.readthedocs.io/en/latest/#redis.Redis).
Example:
@ -52,45 +57,42 @@ CLICKHOUSE_REDIS_CONFIG = {
}
```
### <a name="sync_batch_size">CLICKHOUSE_SYNC_BATCH_SIZE</a>
### CLICKHOUSE_SYNC_BATCH_SIZE
Defaults to: `10000`
Maximum number of operations, fetched by sync process from intermediate storage per sync round.
### <a name="sync_delay">CLICKHOUSE_SYNC_DELAY</a>
### CLICKHOUSE_SYNC_DELAY
Defaults to: `5`
A delay in seconds between two sync rounds start.
### <a name="models_module">CLICKHOUSE_MODELS_MODULE</a>
### CLICKHOUSE_MODELS_MODULE
Defaults to: `'clickhouse_models'`
<!--- TODO Add link --->
Module name inside [django app](https://docs.djangoproject.com/en/2.2/intro/tutorial01/),
where [ClickHouseModel]() classes are search during migrations.
Module name inside [django app](https://docs.djangoproject.com/en/3.0/intro/tutorial01/),
where [ClickHouseModel](models.md#clickhousemodel) classes are search during migrations.
### <a name="database_router">CLICKHOUSE_DATABASE_ROUTER</a>
### CLICKHOUSE_DATABASE_ROUTER
Defaults to: `'django_clickhouse.routers.DefaultRouter'`
<!--- TODO Add link --->
A dotted path to class, representing [database router]().
A dotted path to class, representing [database router](routing.md#router).
### <a name="migrations_package">CLICKHOUSE_MIGRATIONS_PACKAGE</a>
### CLICKHOUSE_MIGRATIONS_PACKAGE
Defaults to: `'clickhouse_migrations'`
A python package name inside [django app](https://docs.djangoproject.com/en/2.2/intro/tutorial01/),
A python package name inside [django app](https://docs.djangoproject.com/en/3.0/intro/tutorial01/),
where migration files are searched.
### <a name="migration_history_model">CLICKHOUSE_MIGRATION_HISTORY_MODEL</a>
### CLICKHOUSE_MIGRATION_HISTORY_MODEL
Defaults to: `'django_clickhouse.migrations.MigrationHistory'`
<!--- TODO Add link --->
A dotted name of a ClickHouseModel subclass (including module path), representing [MigrationHistory]() model.
A dotted name of a ClickHouseModel subclass (including module path),
representing [MigrationHistory model](migrations.md#migrationhistory-clickhousemodel).
### <a name="migrate_with_default_db">CLICKHOUSE_MIGRATE_WITH_DEFAULT_DB</a>
### CLICKHOUSE_MIGRATE_WITH_DEFAULT_DB
Defaults to: `True`
A boolean flag enabling automatic ClickHouse migration,
when you call [`migrate`](https://docs.djangoproject.com/en/2.2/ref/django-admin/#django-admin-migrate) on default database.
when you call [`migrate`](https://docs.djangoproject.com/en/2.2/ref/django-admin/#django-admin-migrate) on `default` database.
### <a name="statd_prefix">CLICKHOUSE_STATSD_PREFIX</a>
### CLICKHOUSE_STATSD_PREFIX
Defaults to: `clickhouse`
<!--- TODO Add link --->
A prefix in [statsd](https://pythonhosted.org/python-statsd/) added to each library metric. See [metrics]()
A prefix in [statsd](https://pythonhosted.org/python-statsd/) added to each library metric. See [monitoring](monitoring.md).
### <a name="celery_queue">CLICKHOUSE_CELERY_QUEUE</a>
### CLICKHOUSE_CELERY_QUEUE
Defaults to: `'celery'`
A name of a queue, used by celery to plan library sync tasks.

View File

@ -5,7 +5,9 @@
* [Features](basic_information.md#features)
* [Requirements](basic_information.md#requirements)
* [Installation](basic_information.md#installation)
* [Design motivation](motivation.md)
* Usage
* [Overview](overview.md)
* [Models](models.md)
* [DjangoModel](models.md#DjangoModel)
* [ClickHouseModel](models.md#ClickHouseModel)
@ -14,4 +16,6 @@
* [Migrations](migrations.md)
* [Synchronization](synchronization.md)
* [Storages](storages.md)
* [RedisStorage](storages.md#redis_storage)
* [RedisStorage](storages.md#redisstorage)
* [Monitoring](monitoring.md)
* [Performance notes](performance.md)

View File

@ -5,7 +5,7 @@ but makes it a little bit more django-like.
## File structure
Each django app can have optional `clickhouse_migrations` package.
This is a default package name, it can be changed with [CLICKHOUSE_MIGRATIONS_PACKAGE](configuration.md#migrations_package) setting.
This is a default package name, it can be changed with [CLICKHOUSE_MIGRATIONS_PACKAGE](configuration.md#clickhouse_migrations_package) setting.
Package contains py files, starting with 4-digit number.
A number gives an order in which migrations will be applied.
@ -17,24 +17,27 @@ my_app
>>>> __init__.py
>>>> 0001_initial.py
>>>> 0002_add_new_field_to_my_model.py
>> clickhouse_models.py
>> urls.py
>> views.py
```
## Migration files
Each file must contain a `Migration` class, inherited from `django_clickhouse.migrations.Migration`.
The class should define an `operations` attribute - a list of operations to apply one by one.
Operation is one of operations, supported by [infi.clickhouse-orm](https://github.com/Infinidat/infi.clickhouse_orm/blob/develop/docs/schema_migrations.md).
Operation is one of [operations, supported by infi.clickhouse-orm](https://github.com/Infinidat/infi.clickhouse_orm/blob/develop/docs/schema_migrations.md).
```python
from django_clickhouse import migrations
from my_app.clickhouse_models import ClickHouseUser
class Migration(migrations.Migration):
operations = [
migrations.CreateTable(ClickHouseTestModel),
migrations.CreateTable(ClickHouseCollapseTestModel)
migrations.CreateTable(ClickHouseUser)
]
```
## MigrationHistory ClickHouse model
## MigrationHistory ClickHouseModel
This model stores information about applied migrations.
By default, library uses `django_clickhouse.migrations.MigrationHistory` model,
but this can be changed using `CLICKHOUSE_MIGRATION_HISTORY_MODEL` setting.
@ -45,27 +48,30 @@ MigrationHistory model is stored in default database.
## Automatic migrations
When library is installed, it tries applying migrations every time,
you call `python manage.py migrate`. If you want to disable this, use [CLICKHOUSE_MIGRATE_WITH_DEFAULT_DB](configuration.md#migrate_with_default_db) settings.
you call [django migrate](https://docs.djangoproject.com/en/3.0/ref/django-admin/#django-admin-migrate). If you want to disable this, use [CLICKHOUSE_MIGRATE_WITH_DEFAULT_DB](configuration.md#clickhouse_migrate_with_default_db) setting.
By default migrations are applied to all [CLICKHOUSE_DATABASES](configuration.md#clickhouse_databases), which have no flags:
* `'migrate': False`
* `'readonly': True`
Note: migrations are only applied, when `default` database is migrated.
Note: migrations are only applied, with django `default` database.
So if you call `python manage.py migrate --database=secondary` they wouldn't be applied.
## Migration algorithm
- Gets a list of databases from `CLICKHOUSE_DATABASES` settings. Migrate them one by one.
- Find all django apps from `INSTALLED_APPS` settings, which have no `readonly=True` setting and have `migrate=True` settings.
Migrate them one by one.
* Iterate over `INSTAALLED_APPS`, searching for `clickhouse_migrations` package
- Get a list of databases from `CLICKHOUSE_DATABASES` setting. Migrate them one by one.
- Find all django apps from `INSTALLED_APPS` setting, which have no `readonly=True` attribute and have `migrate=True` attribute. Migrate them one by one.
* Iterate over `INSTAALLED_APPS`, searching for [clickhouse_migrations package](#file-structure)
* If package was not found, skip app.
* Get a list of migrations applied from `MigrationHistory` model
* Get a list of migrations applied from [MigrationHistory model](#migrationhistory-clickhousemodel)
* Get a list of unapplied migrations
* Get `Migration` class from each migration and call it `apply()` method
* `apply()` iterates operations, checking if it should be applied with [router](router.md)
* Get [Migration class](#migration-files) from each migration and call it `apply()` method
* `apply()` iterates operations, checking if it should be applied with [router](routing.md)
* If migration should be applied, it is applied
* Mark migration as applied in `MigrationHistory` model
* Mark migration as applied in [MigrationHistory model](#migrationhistory-clickhousemodel)
## Security notes
1) ClickHouse has no transaction system, as django relational databases.
As a result, if migration fails, it would be partially applied and there's no correct way to rollback.
I recommend to make migrations as small as possible, so it should be easier to determine and correct the result if something goes wrong.
2) Unlike django, this library is enable to unapply migrations.
This functionality may be implemented in the future.
This functionality may be implemented in the future.

View File

@ -1,20 +1,20 @@
# Models
Model is a pythonic class representing database table in your code.
It also defined an interface (methods) to perform operations on this table
It also defines an interface (methods) to perform operations on this table
and describes its configuration inside framework.
This library operates 2 kinds of models:
* Django model, describing tables in source relational model
* DjangoModel, describing tables in source relational database (PostgreSQL, MySQL, etc.)
* ClickHouseModel, describing models in [ClickHouse](https://clickhouse.yandex/docs/en) database
In order to distinguish them, I will refer them as ClickHouseModel and DjangoModel in further documentation.
## DjangoModel
Django provides a [model system](https://docs.djangoproject.com/en/2.2/topics/db/models/)
Django provides a [model system](https://docs.djangoproject.com/en/3.0/topics/db/models/)
to interact with relational databases.
In order to perform [synchronization](synchronization.md) we need to "catch" all DML operations
on source django model and save information about it in [storage](storages.md).
To achieve this library introduces abstract `django_clickhouse.models.ClickHouseSyncModel` class.
In order to perform [synchronization](synchronization.md) we need to "catch" all [DML operations](https://en.wikipedia.org/wiki/Data_manipulation_language)
on source django model and save information about them in [storage](storages.md).
To achieve this, library introduces abstract `django_clickhouse.models.ClickHouseSyncModel` class.
Each model, inherited from `ClickHouseSyncModel` will automatically save information, needed to sync to storage.
Read [synchronization](synchronization.md) section for more info.
@ -25,7 +25,7 @@ Read [synchronization](synchronization.md) section for more info.
* All queries of [django-pg-returning](https://pypi.org/project/django-pg-returning/) library
* All queries of [django-pg-bulk-update](https://pypi.org/project/django-pg-bulk-update/) library
You can also combine your custom django manager and queryset using mixins from `django_clickhouse.models` package.
You can also combine your custom django manager and queryset using mixins from `django_clickhouse.models` package:
**Important note**: Operations are saved in [transaction.on_commit()](https://docs.djangoproject.com/en/2.2/topics/db/transactions/#django.db.transaction.on_commit).
The goal is avoiding syncing operations, not committed to relational database.
@ -44,9 +44,12 @@ class User(ClickHouseSyncModel):
birthday = models.DateField()
# All operations will be registered to sync with ClickHouse models:
MyModel.objects.create(first_name='Alice', age=16, , birthday=date(2003, 6, 1))
MyModel(first_name='Bob', age=17, birthday=date(2002, 1, 1)).save()
MyModel.objects.update(first_name='Candy')
User.objects.create(first_name='Alice', age=16, birthday=date(2003, 6, 1))
User(first_name='Bob', age=17, birthday=date(2002, 1, 1)).save()
User.objects.update(first_name='Candy')
# Custom manager
```
## ClickHouseModel
@ -56,10 +59,10 @@ This kind of model is based on [infi.clickhouse_orm Model](https://github.com/In
You should define `ClickHouseModel` subclass for each table you want to access and sync in ClickHouse.
Each model should be inherited from `django_clickhouse.clickhouse_models.ClickHouseModel`.
By default, models are searched in `clickhouse_models` module of each django app.
You can change modules name, using stting [CLICKHOUSE_MODELS_MODULE](configuration.md#models_module)
You can change modules name, using setting [CLICKHOUSE_MODELS_MODULE](configuration.md#clickhouse_models_module)
You can read more about creating models and fields [here](https://github.com/Infinidat/infi.clickhouse_orm/blob/develop/docs/models_and_databases.md#defining-models):
all capabilites are supported. At the same time, django-clickhouse libraries adds:
all capabilities are supported. At the same time, django-clickhouse libraries adds:
* [routing attributes and methods](routing.md)
* [sync attributes and methods](synchronization.md)
@ -68,6 +71,8 @@ Example:
from django_clickhouse.clickhouse_models import ClickHouseModel
from django_clickhouse.engines import MergeTree
from infi.clickhouse_orm import fields
from my_app.models import User
class HeightData(ClickHouseModel):
django_model = User
@ -84,7 +89,7 @@ class AgeData(ClickHouseModel):
first_name = fields.StringField()
birthday = fields.DateField()
age = fields.IntegerField()
age = fields.UInt32Field()
engine = MergeTree('birthday', ('first_name', 'last_name', 'birthday'))
```
@ -97,6 +102,7 @@ You can read more in [sync](synchronization.md) section.
Example:
```python
from django_clickhouse.clickhouse_models import ClickHouseMultiModel
from my_app.models import User
class MyMultiModel(ClickHouseMultiModel):
django_model = User
@ -104,7 +110,13 @@ class MyMultiModel(ClickHouseMultiModel):
```
## Engines
Engine is a way of storing, indexing, replicating and sorting data in [ClickHouse](https://clickhouse.yandex/docs/en/operations/table_engines/).
Engine system is based on [infi.clickhouse_orm](https://github.com/Infinidat/infi.clickhouse_orm/blob/develop/docs/table_engines.md#table-engines).
django-clickhouse extends original engine classes, as each engine can have it's own synchronization mechanics.
Engine is a way of storing, indexing, replicating and sorting data ClickHouse ([docs](https://clickhouse.yandex/docs/en/operations/table_engines/)).
Engine system is based on [infi.clickhouse_orm engine system](https://github.com/Infinidat/infi.clickhouse_orm/blob/develop/docs/table_engines.md#table-engines).
This library extends original engine classes as each engine can have it's own synchronization mechanics.
Engines are defined in `django_clickhouse.engines` module.
Currently supported engines (with all infi functionality, [more info](https://github.com/Infinidat/infi.clickhouse_orm/blob/develop/docs/table_engines.md#data-replication)):
* `MergeTree`
* `ReplacingMergeTree`
* `SummingMergeTree`
* `CollapsingMergeTree`

56
docs/monitoring.md Normal file
View File

@ -0,0 +1,56 @@
# Monitoring
In order to monitor [synchronization](synchronization.md) process, [statsd](https://pypi.org/project/statsd/) is used.
Data from statsd then can be used by [Prometheus exporter](https://github.com/prometheus/statsd_exporter)
or [Graphite](https://graphite.readthedocs.io/en/latest/).
## Configuration
Library expects statsd to be configured as written in [statsd docs for django](https://statsd.readthedocs.io/en/latest/configure.html#in-django).
You can set a common prefix for all keys in this library using [CLICKHOUSE_STATSD_PREFIX](configuration.md#clickhouse_statsd_prefix) parameter.
## Exported metrics
## Gauges
* `<prefix>.sync.<model_name>.queue`
Number of elements in [intermediate storage](storages.md) queue waiting for import.
<!--- TODO Add link --->
Queue should not be big. It depends on [sync_delay]() configured and time for syncing single batch.
It is a good parameter to watch and alert on.
## Timers
All time is sent in milliseconds.
* `<prefix>.sync.<model_name>.total`
Total time of single batch task execution.
* `<prefix>.sync.<model_name>.steps.<step_name>`
`<step_name>` is one of `pre_sync`, `get_operations`, `get_sync_objects`, `get_insert_batch`, `get_final_versions`,
`insert`, `post_sync`. Read [here](synchronization.md) for more details.
Time of each sync step. Can be useful to debug reasons of long sync process.
* `<prefix>.inserted_tuples.<model_name>`
Time of inserting batch of data into ClickHouse.
It excludes as much python code as it could to distinguish real INSERT time from python data preparation.
* `<prefix>.sync.<model_name>.register_operations`
Time of inserting sync operations into storage.
## Counters
* `<prefix>.sync.<model_name>.register_operations.<op_name>`
`<op_name>` is one or `create`, `update`, `delete`.
Number of DML operations added by DjangoModel methods calls to sync queue.
* `<prefix>.sync.<model_name>.operations`
Number of operations, fetched from [storage](storages.md) for sync in one batch.
* `<prefix>.sync.<model_name>.import_objects`
Number of objects, fetched from relational storage (based on operations) in order to sync with ClickHouse models.
* `<prefix>.inserted_tuples.<model_name>`
Number of rows inserted to ClickHouse.
* `<prefix>.sync.<model_name>.lock.timeout`
Number of locks in [RedisStorage](storages.md#redisstorage), not acquired and skipped by timeout.
This value should be zero. If not, it means your model sync takes longer then sync task call interval.
* `<prefix>.sync.<model_name>.lock.hard_release`
Number of locks in [RedisStorage](storages.md#redisstorage), released hardly (as process which required a lock is dead).
This value should be zero. If not, it means your sync tasks are killed hardly during the sync process (by OutOfMemory killer, for instance).

35
docs/motivation.md Normal file
View File

@ -0,0 +1,35 @@
# Design motivation
## Separate from django database setting, QuerySet and migration system
ClickHouse SQL and DML language is near to standard, but does not follow it exactly ([docs](https://clickhouse.tech/docs/en/introduction/distinctive_features/#sql-support)).
As a result, it can not be easily integrated into django query subsystem as it expects databases to support:
1. Transactions.
2. INNER/OUTER JOINS by condition.
3. Full featured updates and deletes.
4. Per database replication (ClickHouse has per table replication)
5. Other features, not supported in ClickHouse.
In order to have more functionality, [infi.clickhouse-orm](https://github.com/Infinidat/infi.clickhouse_orm)
is used as base library for databases, querysets and migrations. The most part of it is compatible and can be used without any changes.
## Sync over intermediate storage
This library has several goals which lead to intermediate storage:
1. Fail resistant import, does not matter what the fail reason is:
ClickHouse fail, network fail, killing import process by system (OOM, for instance).
2. ClickHouse does not like single row inserts: [docs](https://clickhouse.tech/docs/en/introduction/performance/#performance-when-inserting-data).
So it's worth batching data somewhere before inserting it.
ClickHouse provide BufferEngine for this, but it can loose data if ClickHouse fails - and no one will now about it.
3. Better scalability. Different intermediate storages may be implemented in the future, based on databases, queue systems or even BufferEngine.
## Replication and routing
In primitive cases people just have single database or cluster with same tables on each replica.
But as ClickHouse has per table replication a more complicated structure can be built:
1. Model A is stored on servers 1 and 2
2. Model B is stored on servers 2, 3 and 5
3. Model C is stored on servers 1, 3 and 4
Moreover, migration operations in ClickHouse can also be auto-replicated (`ALTER TABLE`, for instance) or not (`CREATE TABLE`).
In order to make replication scheme scalable:
1. Each model has it's own read / write / migrate [routing configuration](routing.md#clickhousemodel-routing-attributes).
2. You can use [router](routing.md#router) like django does to set basic routing rules for all models or model groups.

141
docs/overview.md Normal file
View File

@ -0,0 +1,141 @@
# Usage overview
## Requirements
At the begging I expect, that you already have:
1. [ClickHouse](https://clickhouse.tech/docs/en/) (with [ZooKeeper](https://zookeeper.apache.org/), if you use replication)
2. Relational database used with [Django](https://www.djangoproject.com/). For instance, [PostgreSQL](https://www.postgresql.org/)
3. [Django database set up](https://docs.djangoproject.com/en/3.0/ref/databases/)
4. [Intermediate storage](storages.md) set up. For instance, [Redis](https://redis.io/).
## Configuration
Add required parameters to [Django settings.py](https://docs.djangoproject.com/en/3.0/topics/settings/):
1. [CLICKHOUSE_DATABASES](configuration.md#clickhouse_databases)
2. [Intermediate storage](storages.md) configuration. For instance, [RedisStorage](storages.md#redisstorage)
3. It's recommended to change [CLICKHOUSE_CELERY_QUEUE](configuration.md#clickhouse_celery_queue)
4. Add sync task to [celerybeat schedule](http://docs.celeryproject.org/en/v2.3.3/userguide/periodic-tasks.html).
Note, that executing planner every 2 seconds doesn't mean sync is executed every 2 seconds.
Sync time depends on model sync_delay attribute value and [CLICKHOUSE_SYNC_DELAY](configuration.md#clickhouse_sync_delay) configuration parameter.
You can read more in [sync section](synchronization.md).
You can also change other [configuration parameters](configuration.md) depending on your project.
#### Example
```python
# django-clickhouse library setup
CLICKHOUSE_DATABASES = {
# Connection name to refer in using(...) method
'default': {
'db_name': 'test',
'username': 'default',
'password': ''
}
}
CLICKHOUSE_REDIS_CONFIG = {
'host': '127.0.0.1',
'port': 6379,
'db': 8,
'socket_timeout': 10
}
CLICKHOUSE_CELERY_QUEUE = 'clickhouse'
# If you have no any celerybeat tasks, define a new dictionary
# More info: http://docs.celeryproject.org/en/v2.3.3/userguide/periodic-tasks.html
from datetime import timedelta
CELERYBEAT_SCHEDULE = {
'clickhouse_auto_sync': {
'task': 'django_clickhouse.tasks.clickhouse_auto_sync',
'schedule': timedelta(seconds=2), # Every 2 seconds
'options': {'expires': 1, 'queue': CLICKHOUSE_CELERY_QUEUE}
}
}
```
## Adopting django model
Read [ClickHouseSyncModel](models.md#djangomodel) section.
Inherit all [django models](https://docs.djangoproject.com/en/3.0/topics/db/models/)
you want to sync with ClickHouse from `django_clickhouse.models.ClickHouseSyncModel` or sync mixins.
```python
from django_clickhouse.models import ClickHouseSyncModel
from django.db import models
class User(ClickHouseSyncModel):
first_name = models.CharField(max_length=50)
visits = models.IntegerField(default=0)
birthday = models.DateField()
```
## Create ClickHouseModel
1. Read [ClickHouseModel section](models.md#clickhousemodel)
2. Create `clickhouse_models.py` in your django app.
3. Add `ClickHouseModel` class there:
```python
from django_clickhouse.clickhouse_models import ClickHouseModel
from django_clickhouse.engines import MergeTree
from infi.clickhouse_orm import fields
from my_app.models import User
class ClickHouseUser(ClickHouseModel):
django_model = User
sync_delay = 5
id = fields.UInt32Field()
first_name = fields.StringField()
birthday = fields.DateField()
visits = fields.UInt32Field(default=0)
engine = MergeTree('birthday', ('birthday',))
```
## Migration to create table in ClickHouse
1. Read [migrations](migrations.md) section
2. Create `clickhouse_migrations` package in your django app
3. Create `0001_initial.py` file inside the created package. Result structure should be:
```
my_app
>> clickhouse_migrations
>>>> __init__.py
>>>> 0001_initial.py
>> clickhouse_models.py
>> models.py
```
4. Add content to file `0001_initial.py`:
```python
from django_clickhouse import migrations
from my_app.cilckhouse_models import ClickHouseUser
class Migration(migrations.Migration):
operations = [
migrations.CreateTable(ClickHouseUser)
]
```
## Run migrations
Call [django migrate](https://docs.djangoproject.com/en/3.0/ref/django-admin/#django-admin-migrate)
to apply created migration and create table in ClickHouse.
## Set up and run celery sync process
Set up [celery worker](https://docs.celeryproject.org/en/latest/userguide/workers.html#starting-the-worker) for [CLICKHOUSE_CELERY_QUEUE](configuration.md#clickhouse_celery_queue) and [celerybeat](https://docs.celeryproject.org/en/latest/userguide/periodic-tasks.html#starting-the-scheduler).
## Test sync and write analytics queries
1. Read [monitoring section](monitoring.md) in order to set up your monitoring system.
2. Read [query section](queries.md) to understand how to query database.
2. Create some data in source table with django.
3. Check, if it is synced.
#### Example
```python
import time
from my_app.models import User
from my_app.clickhouse_models import ClickHouseUser
u = User.objects.create(first_name='Alice', birthday=datetime.date(1987, 1, 1), visits=1)
# Wait for celery task is executed at list once
time.sleep(6)
assert ClickHouseUser.objects.filter(id=u.id).count() == 1, "Sync is not working"
```
## Congratulations
Tune your integration to achieve better performance if needed: [docs](performance.md).

3
docs/performance.md Normal file
View File

@ -0,0 +1,3 @@
# Sync performance
TODO

View File

@ -1,4 +1,13 @@
# Making queries
## Motivation
ClickHouse SQL language is near to standard, but does not follow it exactly ([docs](https://clickhouse.tech/docs/en/introduction/distinctive_features/#sql-support)).
It can not be easily integrated into django query subsystem as it expects databases to support standard SQL language features like transactions and INNER/OUTER JOINS by condition.
In order to fit it
Libraries query system extends [infi.clickhouse-orm](https://github.com/Infinidat/infi.clickhouse_orm/blob/develop/docs/querysets.md).
TODO

View File

@ -15,9 +15,9 @@ Unlike traditional relational databases, [ClickHouse](https://clickhouse.yandex/
3) To make system more extendable we need default routing, per model routing and router class for complex cases.
## Introduction
All database connections are defined in [CLICKHOUSE_DATABASES](configuration.md#databases) setting.
All database connections are defined in [CLICKHOUSE_DATABASES](configuration.md#clickhouse_databases) setting.
Each connection has it's alias name to refer with.
If no routing is configured, [CLICKHOUSE_DEFAULT_DB_ALIAS](configuration.md#default_db_alias) is used.
If no routing is configured, [CLICKHOUSE_DEFAULT_DB_ALIAS](configuration.md#clickhouse_default_db_alias) is used.
## Router
Router is a class, defining 3 methods:
@ -29,7 +29,7 @@ Router is a class, defining 3 methods:
Checks if migration `operation` should be applied in django application `app_label` on database `db_alias`.
Optional `model` field can be used to determine migrations on concrete model.
By default [CLICKHOUSE_DATABASE_ROUTER](configuration.md#database_router) is used.
By default [CLICKHOUSE_DATABASE_ROUTER](configuration.md#clickhouse_database_router) is used.
It gets routing information from model fields, described below.
## ClickHouseModel routing attributes
@ -54,7 +54,8 @@ class MyModel(ClickHouseModel):
```
## Settings database in QuerySet
Database can be set in each [QuerySet](# TODO) explicitly by using one of methods:
<!--- TODO Add link --->
Database can be set in each [QuerySet]() explicitly by using one of methods:
* With [infi approach](https://github.com/Infinidat/infi.clickhouse_orm/blob/develop/docs/querysets.md#querysets): `MyModel.objects_in(db_object).filter(id__in=[1,2,3]).count()`
* With `using()` method: `MyModel.objects.filter(id__in=[1,2,3]).using(db_alias).count()`

View File

@ -49,18 +49,18 @@ Each method of abstract `Storage` class takes `kwargs` parameters, which can be
* `post_sync_failed(import_key: str, exception: Exception, **kwargs) -> None:`
Called if any exception has occurred during import process. It cleans storage after unsuccessful import.
Note that if import process is hardly killed (with OOM, for instance) this method is not called.
Note that if import process is hardly killed (with OOM killer, for instance) this method is not called.
* `flush() -> None`
*Dangerous*. Drops all data, kept by storage. It is used for cleaning up between tests.
## Predefined storages
### <a name="redis_storage">RedisStorage</a>
### RedisStorage
This storage uses [Redis database](https://redis.io/) as intermediate storage.
To communicate with Redis it uses [redis-py](https://redis-py.readthedocs.io/en/latest/) library.
It is not required, but should be installed to use RedisStorage.
In order to use RedisStorage you must also fill [CLICKHOUSE_REDIS_CONFIG](configuration.md#redis_config) parameter.
In order to use RedisStorage you must also fill [CLICKHOUSE_REDIS_CONFIG](configuration.md#clickhouse_redis_config) parameter.
Stored operation contains:
* Django database alias where original record can be found.

View File

@ -1 +1,3 @@
# Synchronization
TODO

View File

@ -188,7 +188,7 @@ class ClickHouseSyncModel(DjangoModel):
@receiver(post_save)
def post_save(sender, instance, **kwargs):
statsd.incr('clickhouse.sync.post_save'.format('post_save'), 1)
statsd.incr('%s.sync.post_save' % config.STATSD_PREFIX, 1)
if issubclass(sender, ClickHouseSyncModel):
instance.post_save(kwargs.get('created', False), using=kwargs.get('using'))