1) Started writing the docs

2)
This commit is contained in:
M1hacka 2019-04-21 10:04:41 +05:00
parent 10fae9220c
commit c2beabbc07
5 changed files with 213 additions and 0 deletions

35
docs/basic_information.md Normal file
View File

@ -0,0 +1,35 @@
# Basic information
## <a name="about">About</a>
This project's goal is to build [Yandex ClickHouse](https://clickhouse.yandex/) database into [Django](https://www.djangoproject.com/) project.
It is based on [infi.clickhouse-orm](https://github.com/Infinidat/infi.clickhouse_orm) library.
## <a name="features">Features</a>
* Multiple ClickHouse database configuration in [settings.py](https://docs.djangoproject.com/en/2.1/ref/settings/)
* ORM to create and manage ClickHouse models.
* ClickHouse migration system.
* Scalable serialization of django model instances to ORM model instances.
* Effective periodical synchronization of django models to ClickHouse without loosing data.
* Synchronization process monitoring.
## <a name="requirements">Requirements</a>
* [Python 3](https://www.python.org/downloads/)
* [Django](https://docs.djangoproject.com/) 1.7+
* [Yandex ClickHouse](https://clickhouse.yandex/)
* [infi.clickhouse-orm](https://github.com/Infinidat/infi.clickhouse_orm)
* pytz
* six
* typing
* psycopg2
* celery
* statsd
### Optional libraries
* [redis-py](https://redis-py.readthedocs.io/en/latest/) for [RedisStorage](storages.md#redis_storage)
* [django-pg-returning](https://travis-ci.com/M1hacka/django-pg-returning)
for optimizing registering updates in [PostgreSQL](https://www.postgresql.org/)
## <a name="installation">Installation</a>
Install via pip:
`pip install django-clickhouse` ([not released yet](https://github.com/carrotquest/django-clickhouse/issues/3))
or via setup.py:
`python setup.py install`

96
docs/configuration.md Normal file
View File

@ -0,0 +1,96 @@
# Configuration
Library configuration is made in settings.py. All parameters start with `CLICKHOUSE_` prefix.
Prefix can be changed using `CLICKHOUSE_SETTINGS_PREFIX` parameter.
### <a name="databases">CLICKHOUSE_SETTINGS_PREFIX</a>
Defaults to: `'CLICKHOUSE_'`
You can change `CLICKHOUSE_` prefix in settings using this parameter to anything your like.
### <a name="databases">CLICKHOUSE_DATABASES</a>
Defaults to: `{}`
A dictionary, defining databases in django-like style.
<!--- TODO Add link --->
Key is an alias to communicate with this database in [connections]() and [using]().
Value is a configuration dict with parameters:
* [infi.clickhouse_orm database parameters](https://github.com/Infinidat/infi.clickhouse_orm/blob/develop/docs/class_reference.md#database)
<!--- TODO Add link --->
* `migrate: bool` - indicates if this database should be migrated. See [migrations]().
Example:
```python
CLICKHOUSE_DATABASES = {
'default': {
'db_name': 'test',
'username': 'default',
'password': ''
}
}
```
### <a name="default_db_alias">CLICKHOUSE_DEFAULT_DB_ALIAS</a>
Defaults to: `'default'`
<!--- TODO Add link --->
A database alias to use in [QuerySets]() if direct [using]() is not specified.
### <a name="sync_storage">CLICKHOUSE_SYNC_STORAGE</a>
Defaults to: `'django_clickhouse.storages.RedisStorage'`
An intermediate storage class to use. Can be a string or class. [More info about storages](storages.md).
### <a name="redis_config">CLICKHOUSE_REDIS_CONFIG</a>
Default to: `None`
Redis configuration for [RedisStorage](storages.md#redis_storage).
If given, should be a dictionary of parameters to pass to [redis-py](https://redis-py.readthedocs.io/en/latest/#redis.Redis).
Example:
```python
CLICKHOUSE_REDIS_CONFIG = {
'host': '127.0.0.1',
'port': 6379,
'db': 8,
'socket_timeout': 10
}
```
### <a name="sync_batch_size">CLICKHOUSE_SYNC_BATCH_SIZE</a>
Defaults to: `10000`
Maximum number of operations, fetched by sync process from intermediate storage per sync round.
### <a name="sync_delay">CLICKHOUSE_SYNC_DELAY</a>
Defaults to: `5`
A delay in seconds between two sync rounds start.
### <a name="models_module">CLICKHOUSE_MODELS_MODULE</a>
Defaults to: `'clickhouse_models'`
<!--- TODO Add link --->
Module name inside [django app](https://docs.djangoproject.com/en/2.2/intro/tutorial01/),
where [ClickHouseModel]() classes are search during migrations.
### <a name="database_router">CLICKHOUSE_DATABASE_ROUTER</a>
Defaults to: `'django_clickhouse.routers.DefaultRouter'`
<!--- TODO Add link --->
A dotted path to class, representing [database router]().
### <a name="migrations_package">CLICKHOUSE_MIGRATIONS_PACKAGE</a>
Defaults to: `'clickhouse_migrations'`
A python package name inside [django app](https://docs.djangoproject.com/en/2.2/intro/tutorial01/),
where migration files are searched.
### <a name="migration_history_model">CLICKHOUSE_MIGRATION_HISTORY_MODEL</a>
Defaults to: `'django_clickhouse.migrations.MigrationHistory'`
<!--- TODO Add link --->
A dotted name of a ClickHouseModel subclass (including module path), representing [MigrationHistory]() model.
### <a name="migrate_with_default_db">CLICKHOUSE_MIGRATE_WITH_DEFAULT_DB</a>
Defaults to: `True`
A boolean flag enabling automatic ClickHouse migration,
when you call [`migrate`](https://docs.djangoproject.com/en/2.2/ref/django-admin/#django-admin-migrate) on default database.
### <a name="statd_prefix">CLICKHOUSE_STATSD_PREFIX</a>
Defaults to: `clickhouse`
<!--- TODO Add link --->
A prefix in [statsd](https://pythonhosted.org/python-statsd/) added to each library metric. See [metrics]()
### <a name="celery_queue">CLICKHOUSE_CELERY_QUEUE</a>
Defaults to: `'celery'`
A name of a queue, used by celery to plan library sync tasks.

10
docs/index.md Normal file
View File

@ -0,0 +1,10 @@
# Table of contents
* [Basic information](basic_information.md)
* [About](basic_information.md#about)
* [Features](basic_information.md#features)
* [Requirements](basic_information.md#requirements)
* [Installation](basic_information.md#installation)
* Usage
* [Storages](storages.md)
* [RedisStorage](storages.md#redis_storage)

70
docs/storages.md Normal file
View File

@ -0,0 +1,70 @@
# Storages
Storage class is a facade, that stores information about operations, which where performed on django models.
It has three main purposes:
* Storage should be fast to insert single records. It forms a batch of data, which is then inserted to ClickHouse.
* Storage guarantees, that no data is lost.
Intermediate data in storage is deleted only after importing batch finishes successfully.
If it fails in some point - starting new import process should import failed data again.
* Keep information about sync process. For instance, last time the model sync has been called.
In order to determine different models from each other storage uses `import_key`.
By default, it is generated by `ClickHouseModel.get_import_key()` method and is equal to class name.
Each method of abstract `Storage` class takes `kwargs` parameters, which can be used in concrete storage.
## Storage methods
* `register_operations(import_key: str, operation: str, *pks: *Any) -> int`
Saves a new operation in source database to storage. This method should be fast.
It is called after source database transaction is committed.
Method returns number of operations registered.
`operation` is one of `insert`, `update` or `delete`
`pks` is an iterable of strings, enough to select needed records from source database.
* `get_last_sync_time(import_key: str) -> Optional[datetime.datetime]`
Returns last time, a model sync has been called. If no sync has been done, returns None.
* `set_last_sync_time(import_key: str, dt: datetime.datetime) -> None`
Saves datetime, when a sync process has been called last time.
* `register_operations_wrapped(self, import_key: str, operation: str, *pks: *Any) -> int`
A wrapper for register_operations. It's goal is to write metrics and logs.
* `pre_sync(import_key: str, **kwargs) -> None`
Called before import process starts. It initializes storage for importing new batch.
* `operations_count(import_key: str, **kwargs) -> int`
Counts, how many operations are waiting for import in storage.
* `get_operations(import_key: str, count: int, **kwargs) -> List[Tuple[str, str]]`
Returns a next batch of operations to import. `count` parameter gives a number of operations to return.
Operation is a tuple `(operation, primary_key)`, where `operation` is one of insert, update or delete
and `primary_key` is a string enough to select record from source database.
* `post_sync(import_key: str, **kwargs) -> None`
Called after import process have finished. It cleans storage after importing a batch.
* `post_batch_removed(import_key: str, batch_size: int) -> None`
This method should be called by `post_sync` method after data is removed from storage.
By default, it marks queue size metric.
* `post_sync_failed(import_key: str, exception: Exception, **kwargs) -> None:`
Called if any exception has occurred during import process. It cleans storage after unsuccessful import.
Note that if import process is hardly killed (with OOM, for instance) this method is not called.
* `flush() -> None`
*Dangerous*. Drops all data, kept by storage. It is used for cleaning up between tests.
## Predefined storages
### <a name="redis_storage">RedisStorage</a>
This storage uses [Redis database](https://redis.io/) as intermediate storage.
To communicate with Redis it uses [redis-py](https://redis-py.readthedocs.io/en/latest/) library.
It is not required, but should be installed to use RedisStorage.
In order to use RedisStorage you must also fill [CLICKHOUSE_REDIS_CONFIG](configuration.md#redis_config) parameter.
Stored operation contains:
* Django database alias where original record can be found.
* Record primary key
* Operation performed (insert, update, delete)
This storage does not allow multi-threaded sync.

2
docs/usage.md Normal file
View File

@ -0,0 +1,2 @@
# Usage