django-clickhouse/docs/synchronization.md
2020-02-07 13:05:19 +05:00

5.0 KiB

Synchronization

Design motivation

Read here.

Algorithm

  1. Celery beat schedules django_clickhouse.tasks.clickhouse_auto_sync task every second or near.
  2. Celery workers execute clickhouse_auto_sync. It searches for ClickHouseModel subclasses which need sync (if Model.need_sync() method returns True).
  3. django_clickhouse.tasks.sync_clickhouse_model task is scheduled for each ClickHouseModel which needs sync.
  4. sync_clickhouse_model saves sync start time in storage and calls ClickHouseModel.sync_batch_from_storage() method.
  5. ClickHouseModel.sync_batch_from_storage():
    • Gets storage model works with using ClickHouseModel.get_storage() method
    • Calls Storage.pre_sync(import_key) for model storage. This may be used to prevent parallel execution with locks or some other operations.
    • Gets a list of operations to sync from storage.
    • Fetches objects from relational database calling ClickHouseModel.get_sync_objects(operations) method.
    • Forms a batch of tuples to insert into ClickHouse using ClickHouseModel.get_insert_batch(import_objects) method.
    • Inserts batch of tuples into ClickHouse using ClickHouseModel.insert_batch(batch) method.
    • Calls Storage.post_sync(import_key) method to clean up storage after syncing batch. This method also removes synced operations from storage.
    • If some exception occurred during execution, Storage.post_sybc_failed(import_key) method is called. Note, that process can be killed without exception, for instance by OOM killer. And this method will not be called.

Configuration

Sync configuration can be set globally using django settings.py parameters or redeclared for each ClickHouseModel class. ClickHouseModel configuration is prior to settings configuration.

Settings configuration

ClickHouseModel configuration

Each ClickHouseModel subclass can define sync arguments and methods:

  • django_model: django.db.models.Model
    Required. Django model this ClickHouseModel class is synchronized with.

  • django_model_serializer: django.db.models.Model
    Defaults to: django_clickhouse.serializers.Django2ClickHouseModelSerializer
    Serializer class to convert DjangoModel to ClickHouseModel.

  • sync_enabled: bool
    Defaults to: False. Is sync for this model enabled?

  • sync_batch_size: int
    Defaults to: CLICKHOUSE_SYNC_BATCH_SIZE
    Maximum number of operations, fetched by sync process from storage per sync round.

  • sync_delay: float
    Defaults to: CLICKHOUSE_SYNC_DELAY
    A delay in seconds between two sync rounds start.

  • sync_storage: Union[str, Storage]
    Defaults to: CLICKHOUSE_SYNC_STORAGE
    An intermediate storage class to use. Can be a string or class.

Example:

from django_clickhouse.clickhouse_models import ClickHouseModel
from django_clickhouse.engines import ReplacingMergeTree
from infi.clickhouse_orm import fields 
from my_app.models import User

class ClickHouseUser(ClickHouseModel):
    django_model = User
    sync_enabled = True
    sync_delay = 5
    sync_batch_size = 1000

    id = fields.UInt32Field()
    first_name = fields.StringField()
    birthday = fields.DateField()
    visits = fields.UInt32Field(default=0)

    engine = ReplacingMergeTree('birthday', ('birthday',))

Fail resistance

Fail resistance is based on several points:

  1. Storage should not loose data in any case. It's not this library goal to keep it stable.
  2. Data is removed from storage only if import succeeds. Otherwise import attempt is repeated.
  3. It's recommended to use ReplacingMergeTree or CollapsingMergeTree engines instead of simple MergeTree, so it removes duplicates if batch is imported twice.
  4. Each ClickHouseModel is synced in separate process. If one model fails, it should not affect other models.