5.0 KiB
Synchronization
Design motivation
Read here.
Algorithm
- Celery beat schedules
django_clickhouse.tasks.clickhouse_auto_sync
task every second or near. - Celery workers execute
clickhouse_auto_sync
. It searches forClickHouseModel
subclasses which need sync (ifModel.need_sync()
method returnsTrue
). django_clickhouse.tasks.sync_clickhouse_model
task is scheduled for eachClickHouseModel
which needs sync.sync_clickhouse_model
saves sync start time in storage and callsClickHouseModel.sync_batch_from_storage()
method.ClickHouseModel.sync_batch_from_storage()
:- Gets storage model works with using
ClickHouseModel.get_storage()
method - Calls
Storage.pre_sync(import_key)
for model storage. This may be used to prevent parallel execution with locks or some other operations. - Gets a list of operations to sync from storage.
- Fetches objects from relational database calling
ClickHouseModel.get_sync_objects(operations)
method. - Forms a batch of tuples to insert into ClickHouse using
ClickHouseModel.get_insert_batch(import_objects)
method. - Inserts batch of tuples into ClickHouse using
ClickHouseModel.insert_batch(batch)
method. - Calls
Storage.post_sync(import_key)
method to clean up storage after syncing batch. This method also removes synced operations from storage. - If some exception occurred during execution,
Storage.post_sybc_failed(import_key)
method is called. Note, that process can be killed without exception, for instance by OOM killer. And this method will not be called.
- Gets storage model works with using
Configuration
Sync configuration can be set globally using django settings.py parameters or redeclared for each ClickHouseModel
class.
ClickHouseModel
configuration is prior to settings configuration.
Settings configuration
-
CLICKHOUSE_CELERY_QUEUE
Defaults to:'celery'
A name of a queue, used by celery to plan library sync tasks. -
CLICKHOUSE_SYNC_STORAGE
Defaults to:'django_clickhouse.storages.RedisStorage'
An intermediate storage class to use. Can be a string or class. -
CLICKHOUSE_SYNC_BATCH_SIZE
Defaults to:10000
Maximum number of operations, fetched by sync process from intermediate storage per sync round. -
CLICKHOUSE_SYNC_DELAY
Defaults to:5
A delay in seconds between two sync rounds start.
ClickHouseModel configuration
Each ClickHouseModel
subclass can define sync arguments and methods:
-
django_model: django.db.models.Model
Required. Django model this ClickHouseModel class is synchronized with. -
django_model_serializer: django.db.models.Model
Defaults to:django_clickhouse.serializers.Django2ClickHouseModelSerializer
Serializer class to convert DjangoModel to ClickHouseModel. -
sync_enabled: bool
Defaults to:False
. Is sync for this model enabled? -
sync_batch_size: int
Defaults to: CLICKHOUSE_SYNC_BATCH_SIZE
Maximum number of operations, fetched by sync process from storage per sync round. -
sync_delay: float
Defaults to: CLICKHOUSE_SYNC_DELAY
A delay in seconds between two sync rounds start. -
sync_storage: Union[str, Storage]
Defaults to: CLICKHOUSE_SYNC_STORAGE
An intermediate storage class to use. Can be a string or class.
Example:
from django_clickhouse.clickhouse_models import ClickHouseModel
from django_clickhouse.engines import ReplacingMergeTree
from infi.clickhouse_orm import fields
from my_app.models import User
class ClickHouseUser(ClickHouseModel):
django_model = User
sync_enabled = True
sync_delay = 5
sync_batch_size = 1000
id = fields.UInt32Field()
first_name = fields.StringField()
birthday = fields.DateField()
visits = fields.UInt32Field(default=0)
engine = ReplacingMergeTree('birthday', ('birthday',))
Fail resistance
Fail resistance is based on several points:
- Storage should not loose data in any case. It's not this library goal to keep it stable.
- Data is removed from storage only if import succeeds. Otherwise import attempt is repeated.
- It's recommended to use ReplacingMergeTree or CollapsingMergeTree engines instead of simple MergeTree, so it removes duplicates if batch is imported twice.
- Each
ClickHouseModel
is synced in separate process. If one model fails, it should not affect other models.