Added more docs

2025-08-16 01:44:48 +03:00 · 2020-02-07 13:05:19 +05:00 · 2020-02-07 13:05:19 +05:00 · 7d2d94336c
commit 7d2d94336c
parent 5cb43ca6cd
10 changed files with 199 additions and 21 deletions
--- a/docs/configuration.md
+++ b/docs/configuration.md
@ -38,7 +38,7 @@ A database alias to use in [QuerySets](queries.md) if direct [using](routing.md#

 ### CLICKHOUSE_SYNC_STORAGE
 Defaults to: `'django_clickhouse.storages.RedisStorage'`  
-An intermediate storage class to use. Can be a string or class. [More info about storages](storages.md).
+An [intermediate storage](storages.md) class to use. Can be a string or class.

 ### CLICKHOUSE_REDIS_CONFIG
 Default to: `None`  
@ -57,11 +57,11 @@ CLICKHOUSE_REDIS_CONFIG = {

 ### CLICKHOUSE_SYNC_BATCH_SIZE
 Defaults to: `10000`  
-Maximum number of operations, fetched by sync process from intermediate storage per sync round.
+Maximum number of operations, fetched by sync process from [intermediate storage](storages.md) per [sync](sync.md)) round.

 ### CLICKHOUSE_SYNC_DELAY
 Defaults to: `5`
-A delay in seconds between two sync rounds start.
+A delay in seconds between two [sync](synchronization.md) rounds start.

 ### CLICKHOUSE_MODELS_MODULE
 Defaults to: `'clickhouse_models'`  
--- a/docs/databases.md
+++ b/docs/databases.md
@ -22,6 +22,9 @@ secondary = connections['secondary']
 db_link = connections['default']
 ```

+You can also get database objects from [QuerySet](queries.md) and [ClickHouseModel](models.md) instances by calling `get_database(for_write: bool = False)` method.
+This database may differ, depending on [routing](routing.md#router) you use.
+
 ## Database object
 Database class is based on [infi.clickhouse_orm Database object](https://github.com/Infinidat/infi.clickhouse_orm/blob/develop/docs/models_and_databases.md#models-and-databases),
 but extends it with some extra attributes and methods:
@ -31,10 +34,4 @@ I expect this library [migration system](migrations.md) to be used.
 Direct database migration will lead to migration information errors.

 ### `insert_tuples` and `select_tuples` methods
-[infi.clickhouse_orm](https://github.com/Infinidat/infi.clickhouse_orm) store data rows in Model objects.
-It works well on hundreds of records. 
-But when you sync 100k records in a batch, initializing 100k model instances will be slow.  
-Too optimize this process `ClickHouseModel` class have `get_tuple_class()` method.
-It generates a [namedtuple](https://docs.python.org/3/library/collections.html#collections.namedtuple) class,
-with same data fields a model has. 
-Initializing such tuples takes much less time, then initializing Model objects.
+Methods to work with [ClickHouseModel namedtuples](models.md#clickhousemodel-namedtuple-form).
--- a/docs/index.md
+++ b/docs/index.md
@ -6,7 +6,7 @@
  * [Requirements](basic_information.md#requirements)
  * [Installation](basic_information.md#installation)
  * [Design motivation](motivation.md)
-* Usage  
+* [Usage](overview.md)  
  * [Overview](overview.md)
  * [Models](models.md)
     * [DjangoModel](models.md#DjangoModel)
--- a/docs/models.md
+++ b/docs/models.md
@ -109,6 +109,15 @@ class MyMultiModel(ClickHouseMultiModel):
    sub_models = [AgeData, HeightData]
 ```

+## ClickHouseModel namedtuple form
+[infi.clickhouse_orm](https://github.com/Infinidat/infi.clickhouse_orm) stores data rows in special Model objects.
+It works well on hundreds of records. 
+But when you sync 100k records in a batch, initializing 100k model instances will be slow.  
+Too optimize this process `ClickHouseModel` class have `get_tuple_class()` method.
+It generates a [namedtuple](https://docs.python.org/3/library/collections.html#collections.namedtuple) class,
+with same data fields a model has. 
+Initializing such tuples takes much less time, then initializing Model objects.
+
 ## Engines
 Engine is a way of storing, indexing, replicating and sorting data ClickHouse ([docs](https://clickhouse.yandex/docs/en/operations/table_engines/)).  
 Engine system is based on [infi.clickhouse_orm engine system](https://github.com/Infinidat/infi.clickhouse_orm/blob/develop/docs/table_engines.md#table-engines).  
@ -120,3 +129,25 @@ Currently supported engines (with all infi functionality, [more info](https://gi
 * `ReplacingMergeTree`
 * `SummingMergeTree`
 * `CollapsingMergeTree`
+
+
+## Serializers
+Serializer is a class which translates django model instances to [namedtuples, inserted into ClickHouse](#clickhousemodel-namedtuple-form).
+`django_clickhouse.serializers.Django2ClickHouseModelSerializer` is used by default in all models.
+ All serializers must inherit this class. 
+
+Serializer must implement next interface:
+```python
+from django_clickhouse.serializers import Django2ClickHouseModelSerializer
+from django.db.models import Model as DjangoModel
+from typing import *
+
+class CustomSerializer(Django2ClickHouseModelSerializer):
+    def __init__(self, model_cls: Type['ClickHouseModel'], fields: Optional[Iterable[str]] = None,
+                 exclude_fields: Optional[Iterable[str]] = None, writable: bool = False,
+                 defaults: Optional[dict] = None) -> None:
+        super().__init__(model_cls, fields=fields, exclude_fields=exclude_fields, writable=writable, defaults=defaults)
+
+    def serialize(self, obj: DjangoModel) -> NamedTuple:
+        pass
+```
--- a/docs/monitoring.md
+++ b/docs/monitoring.md
@ -11,8 +11,7 @@ You can set a common prefix for all keys in this library using [CLICKHOUSE_STATS
 ## Gauges
 * `<prefix>.sync.<model_name>.queue`  
    Number of elements in [intermediate storage](storages.md) queue waiting for import.
-    <!--- TODO Add link --->
-    Queue should not be big. It depends on [sync_delay]() configured and time for syncing single batch.   
+    Queue should not be big. It depends on [sync_delay](synchronization.md#configuration) configured and time for syncing single batch.   
    It is a good parameter to watch and alert on.

 ## Timers
--- a/docs/overview.md
+++ b/docs/overview.md
@ -76,7 +76,6 @@ from my_app.models import User

 class ClickHouseUser(ClickHouseModel):
    django_model = User
-    sync_delay = 5
    
    id = fields.UInt32Field()
    first_name = fields.StringField()
--- a/docs/performance.md
+++ b/docs/performance.md
@ -1,3 +1,46 @@
 # Sync performance
+Every real life system may have its own performance problems. 
+They depend on:
+* You ClickHouse servers configuration
+* Number of ClickHouse instances in your cluster
+* Your data formats
+* Import speed
+* Network
+* etc

-TODO
+I recommend to use [monitoring](monitoring.md) in order to understand where is the bottle neck and act accordingly.
+
+This chapter gives a list of known problems which can slow down your import.
+
+## ClickHouse tuning
+Read this [doc](https://clickhouse.tech/docs/en/introduction/performance/#performance-when-inserting-data)
+ and tune it both for read and write.
+
+## ClickHouse cluster
+As ClickHouse is a [multimaster database](https://clickhouse.tech/docs/en/introduction/distinctive_features/#data-replication-and-data-integrity-support),
+ you can import and read from any node when you have a cluster.
+In order to read and import to multiple nodes you can use [CHProxy](https://github.com/Vertamedia/chproxy)
+or add multiple databases to [routing configuration](routing.md#clickhousemodel-routing-attributes).
+
+## CollapsingMergeTree engine and previous versions
+In order to reduce number of stored data in [intermediate storage](storages.md),
+ this library doesn't store old versions of data on update or delete.
+ Another point is that getting previous data versions from relational storages is a hard operation.
+Engines like `CollapsingMergeTree` get old versions from ClickHouse:
+1. Using `version_col` if it is set in engine's parameters. 
+ This is a special field which stores incremental row versions and is filled by the library.
+ It should be of any unsigned integer type (depending on how many row versions you may have).
+2. Using `FINAL` query modification.
+ This way is much more slow, but doesn't require additional column.  
+
+## Know your data
+In common case library user uses python types to form ClickHouse data.
+Library is responsible for converting this data into format ClickHouse expects to receive.
+This leads to great number of convert operations when you import data in big batches.
+In order to reduce this time, you can:
+* Set `MyClickHouseModel.sync_formatted_tuples` to True
+* Override `MyClickHouseModel.get_insert_batch(, import_objects: Iterable[DjangoModel])` method:  
+  It should get `cls.get_tuple_class()` and yield (it is a [generator](https://wiki.python.org/moin/Generators))
+  so it generates tuples of string values, already prepared to insert into ClickHouse.  
+  **Important note**: `ClickHouseModel.get_insert_batch(...)` can perform additional functionality depending on model [engine](models.md#engines).
+  Be careful.
--- a/docs/synchronization.md
+++ b/docs/synchronization.md
@ -1,3 +1,105 @@
 # Synchronization

-TODO
+## Design motivation
+Read [here](motivation.md#sync-over-intermediate-storage).
+
+
+## Algorithm
+<!--- ![General scheme](https://octodex.github.com/images/yaktocat.png) --->
+1. [Celery beat](https://docs.celeryproject.org/en/latest/userguide/periodic-tasks.html) schedules `django_clickhouse.tasks.clickhouse_auto_sync` task every second or near.
+2. [Celery workers](https://docs.celeryproject.org/en/latest/userguide/workers.html) execute `clickhouse_auto_sync`.
+ It searches for `ClickHouseModel` subclasses which need sync (if `Model.need_sync()` method returns `True`).
+2. `django_clickhouse.tasks.sync_clickhouse_model` task is scheduled for each `ClickHouseModel` which needs sync.
+3. `sync_clickhouse_model` saves sync start time in [storage](storages.md) and calls `ClickHouseModel.sync_batch_from_storage()` method.
+4. `ClickHouseModel.sync_batch_from_storage()`:
+    * Gets [storage](storages.md) model works with using `ClickHouseModel.get_storage()` method
+    * Calls `Storage.pre_sync(import_key)` for model [storage](storages.md).
+        This may be used to prevent parallel execution with locks or some other operations.
+    * Gets a list of operations to sync from [storage](storages.md).
+    * Fetches objects from relational database calling `ClickHouseModel.get_sync_objects(operations)` method.
+    * Forms a batch of tuples to insert into ClickHouse using `ClickHouseModel.get_insert_batch(import_objects)` method.
+    * Inserts batch of tuples into ClickHouse using `ClickHouseModel.insert_batch(batch)` method.
+    * Calls `Storage.post_sync(import_key)` method to clean up storage after syncing batch.
+        This method also removes synced operations from storage.
+    * If some exception occurred during execution, `Storage.post_sybc_failed(import_key)` method is called.
+        Note, that process can be killed without exception, for instance by OOM killer.
+        And this method will not be called. 
+   
+    
+## Configuration
+Sync configuration can be set globally using django settings.py parameters or redeclared for each `ClickHouseModel` class.
+`ClickHouseModel` configuration is prior to settings configuration.
+
+### Settings configuration
+* [CLICKHOUSE_CELERY_QUEUE](configuration.md#clickhouse_celery_queue)  
+Defaults to: `'celery'`  
+A name of a queue, used by celery to plan library sync tasks.
+    
+* [CLICKHOUSE_SYNC_STORAGE](configuration.md#clickhouse_sync_storage)  
+Defaults to: `'django_clickhouse.storages.RedisStorage'`  
+An [intermediate storage](storages.md) class to use. Can be a string or class.
+    
+* [CLICKHOUSE_SYNC_BATCH_SIZE](configuration.md#clickhouse_sync_storage)  
+Defaults to: `10000`  
+Maximum number of operations, fetched by sync process from [intermediate storage](storages.md) per sync round.
+    
+* [CLICKHOUSE_SYNC_DELAY](configuration.md#clickhouse_sync_storage)  
+Defaults to: `5`
+A delay in seconds between two sync rounds start.
+
+### ClickHouseModel configuration
+Each `ClickHouseModel` subclass can define sync arguments and methods:
+* `django_model: django.db.models.Model`  
+Required.
+Django model this ClickHouseModel class is synchronized with.
+
+* `django_model_serializer: django.db.models.Model`  
+Defaults to: `django_clickhouse.serializers.Django2ClickHouseModelSerializer`  
+[Serializer class](models.md#serializers) to convert DjangoModel to ClickHouseModel.  
+
+* `sync_enabled: bool`  
+Defaults to: `False`.
+Is sync for this model enabled?
+
+* `sync_batch_size: int`  
+Defaults to: [CLICKHOUSE_SYNC_BATCH_SIZE](configuration.md#clickhouse_sync_storage)  
+Maximum number of operations, fetched by sync process from [storage](storages.md) per sync round.  
+
+* `sync_delay: float`  
+Defaults to: [CLICKHOUSE_SYNC_DELAY](configuration.md#clickhouse_sync_storage)  
+A delay in seconds between two sync rounds start.  
+
+* `sync_storage: Union[str, Storage]`  
+Defaults to: [CLICKHOUSE_SYNC_STORAGE](configuration.md#clickhouse_sync_storage)  
+An [intermediate storage](storages.md) class to use. Can be a string or class.  
+
+Example:  
+```python
+from django_clickhouse.clickhouse_models import ClickHouseModel
+from django_clickhouse.engines import ReplacingMergeTree
+from infi.clickhouse_orm import fields 
+from my_app.models import User
+
+class ClickHouseUser(ClickHouseModel):
+    django_model = User
+    sync_enabled = True
+    sync_delay = 5
+    sync_batch_size = 1000
+
+    id = fields.UInt32Field()
+    first_name = fields.StringField()
+    birthday = fields.DateField()
+    visits = fields.UInt32Field(default=0)
+
+    engine = ReplacingMergeTree('birthday', ('birthday',))
+```
+
+
+## Fail resistance
+Fail resistance is based on several points:
+1. [Storage](storages.md) should not loose data in any case. It's not this library goal to keep it stable.
+2. Data is removed from [storage](storages.md) only if import succeeds. Otherwise import attempt is repeated.
+3. It's recommended to use ReplacingMergeTree or CollapsingMergeTree [engines](models.md#engines) 
+    instead of simple MergeTree, so it removes duplicates if batch is imported twice.
+4. Each `ClickHouseModel` is synced in separate process. 
+    If one model fails, it should not affect other models.
--- a/setup.py
+++ b/setup.py
@ -13,7 +13,7 @@ with open('requirements.txt') as f:

 setup(
    name='django-clickhouse',
-    version='0.0.1',
+    version='1.0.0',
    packages=['django_clickhouse'],
    package_dir={'': 'src'},
    url='https://github.com/carrotquest/django-clickhouse',
--- a/tests/test_compatibility.py
+++ b/tests/test_compatibility.py
@ -1,3 +1,6 @@
+import sys
+from unittest import skipIf
+
 from django.test import TestCase

 from django_clickhouse.compatibility import namedtuple
@ -10,12 +13,16 @@ class NamedTupleTest(TestCase):
        self.assertTupleEqual((1, 2, 4), tuple(TestTuple(1, 2, 4)))
        self.assertTupleEqual((1, 2, 4), tuple(TestTuple(a=1, b=2, c=4)))

-    def test_exceptions(self):
+    @skipIf(sys.version_info < (3, 7),
+            "On python < 3.7 this error is not raised, as not given defaults are filled by None")
+    def test_no_required_value(self):
        TestTuple = namedtuple('TestTuple', ('a', 'b', 'c'), defaults=[3])

-        # BUG On python < 3.7 this error is not raised, as not given defaults are filled by None
-        # with self.assertRaises(TypeError):
-        #     TestTuple(b=1, c=4)
+        with self.assertRaises(TypeError):
+            TestTuple(b=1, c=4)
+
+    def test_duplicate_value(self):
+        TestTuple = namedtuple('TestTuple', ('a', 'b', 'c'), defaults=[3])

        with self.assertRaises(TypeError):
            TestTuple(1, 2, 3, c=4)