Support for function-based DEFAULT values, not only literals #36

This commit is contained in:
Itai Shirav 2020-02-08 12:05:48 +02:00
parent 0a94ac98a3
commit ffeed4a6a4
8 changed files with 211 additions and 109 deletions

View File

@ -3,22 +3,22 @@ Field Types
See: [ClickHouse Documentation](https://clickhouse.yandex/docs/en/data_types/)
Currently the following field types are supported:
The following field types are supported:
| Class | DB Type | Pythonic Type | Comments
| ------------------ | ---------- | --------------------- | -----------------------------------------------------
| StringField | String | unicode | Encoded as UTF-8 when written to ClickHouse
| FixedStringField | String | unicode | Encoded as UTF-8 when written to ClickHouse
| StringField | String | str | Encoded as UTF-8 when written to ClickHouse
| FixedStringField | FixedString| str | Encoded as UTF-8 when written to ClickHouse
| DateField | Date | datetime.date | Range 1970-01-01 to 2105-12-31
| DateTimeField | DateTime | datetime.datetime | Minimal value is 1970-01-01 00:00:00; Always in UTC
| Int8Field | Int8 | int | Range -128 to 127
| Int16Field | Int16 | int | Range -32768 to 32767
| Int32Field | Int32 | int | Range -2147483648 to 2147483647
| Int64Field | Int64 | int/long | Range -9223372036854775808 to 9223372036854775807
| Int64Field | Int64 | int | Range -9223372036854775808 to 9223372036854775807
| UInt8Field | UInt8 | int | Range 0 to 255
| UInt16Field | UInt16 | int | Range 0 to 65535
| UInt32Field | UInt32 | int | Range 0 to 4294967295
| UInt64Field | UInt64 | int/long | Range 0 to 18446744073709551615
| UInt64Field | UInt64 | int | Range 0 to 18446744073709551615
| Float32Field | Float32 | float |
| Float64Field | Float64 | float |
| DecimalField | Decimal | Decimal | Pythonic values are rounded to fit the scale of the database field
@ -33,6 +33,113 @@ Currently the following field types are supported:
| ArrayField | Array | list | See below
| NullableField | Nullable | See below | See below
Field Options
----------------
All field types accept the following arguments:
- default
- alias
- materialized
- readonly
- codec
Note that `default`, `alias` and `materialized` are mutually exclusive - you cannot use more than one of them in a single field.
### default
Specifies a default value to use for the field. If not given, the field will have a default value based on its type: empty string for string fields, zero for numeric fields, etc.
The default value can be a Python value suitable for the field type, or an expression. For example:
```python
class Event(models.Model):
name = fields.StringField(default="EVENT")
repeated = fields.UInt32Field(default=1)
created = fields.DateTimeField(default=F.now())
engine = engines.Memory()
...
```
When creating a model instance, any fields you do not specify get their default value. Fields that use a default expression are assigned a sentinel value of `infi.clickhouse_orm.models.NO_VALUE` instead. For example:
```python
>>> event = Event()
>>> print(event.to_dict())
{'name': 'EVENT', 'repeated': 1, 'created': <NO_VALUE>}
```
:warning: Due to a bug in ClickHouse versions prior to 20.1.2.4, insertion of records with expressions for default values may fail.
### alias / materialized
The `alias` and `materialized` attributes expect an expression that gets calculated by the database. The difference is that `alias` fields are calculated on the fly, while `materialized` fields are calculated when the record is inserted, and are stored on disk.
You can use any expression, and can refer to other model fields. For example:
```python
class Event(models.Model):
created = fields.DateTimeField()
created_date = fields.DateTimeField(materialized=F.toDate(created))
name = fields.StringField()
normalized_name = fields.StringField(alias=F.upper(F.trim(name)))
engine = engines.Memory()
```
For backwards compatibility with older versions of the ORM, you can pass the expression as an SQL string:
```python
created_date = fields.DateTimeField(materialized="toDate(created)")
```
Both field types can't be inserted into the database directly, so they are ignored when using the `Database.insert()` method. ClickHouse does not return the field values if you use `"SELECT * FROM ..."` - you have to list these field names explicitly in the query.
Usage:
```python
obj = Event(created=datetime.now(), name='MyEvent')
db = Database('my_test_db')
db.insert([obj])
# All values will be retrieved from database
db.select('SELECT created, created_date, username, name FROM $db.event', model_class=Event)
# created_date and username will contain a default value
db.select('SELECT * FROM $db.event', model_class=Event)
```
When creating a model instance, any alias or materialized fields are assigned a sentinel value of `infi.clickhouse_orm.models.NO_VALUE` since their real values can only be known after insertion to the database.
### readonly
This attribute is set automatically for fields with `alias` or `materialized` attributes, you do not need to pass it yourself.
### codec
This attribute specifies the compression algorithm to use for the field (instead of the default data compression algorithm defined in server settings).
Supported compression algorithms:
| Codec | Argument | Comment
| -------------------- | -------------------------------------------| ----------------------------------------------------
| NONE | None | No compression.
| LZ4 | None | LZ4 compression.
| LZ4HC(`level`) | Possible `level` range: [3, 12]. | Default value: 9. Greater values stands for better compression and higher CPU usage. Recommended value range: [4,9].
| ZSTD(`level`) | Possible `level`range: [1, 22]. | Default value: 1. Greater values stands for better compression and higher CPU usage. Levels >= 20, should be used with caution, as they require more memory.
| Delta(`delta_bytes`) | Possible `delta_bytes` range: 1, 2, 4 , 8. | Default value for `delta_bytes` is `sizeof(type)` if it is equal to 1, 2,4 or 8 and equals to 1 otherwise.
Codecs can be combined by separating their names with commas. The default database codec is not included into pipeline (if it should be applied to a field, you have to specify it explicitly in pipeline).
Recommended usage for codecs:
- When values for particular metric do not differ significantly from point to point, delta-encoding allows to reduce disk space usage significantly.
- DateTime works great with pipeline of Delta, ZSTD and the column size can be compressed to 2-3% of its original size (given a smooth datetime data)
- Numeric types usually enjoy best compression rates with ZSTD
- String types enjoy good compression rates with LZ4HC
Example:
```python
class Stats(models.Model):
id = fields.UInt64Field(codec='ZSTD(10)')
timestamp = fields.DateTimeField(codec='Delta,ZSTD')
timestamp_date = fields.DateField(codec='Delta(4),ZSTD(22)')
metadata_id = fields.Int64Field(codec='LZ4')
status = fields.StringField(codec='LZ4HC(10)')
calculation = fields.NullableField(fields.Float32Field(), codec='ZSTD')
alerts = fields.ArrayField(fields.FixedStringField(length=15), codec='Delta(2),LZ4HC')
engine = MergeTree('timestamp_date', ('id', 'timestamp'))
```
Note: This feature is supported on ClickHouse version 19.1.16 and above. Codec arguments will be ignored by the ORM for older versions of ClickHouse.
DateTimeField and Time Zones
----------------------------
@ -45,8 +152,7 @@ A `DateTimeField` can be assigned values from one of the following types:
The assigned value always gets converted to a timezone-aware `datetime` in UTC. If the assigned value is a timezone-aware `datetime` in another timezone, it will be converted to UTC. Otherwise, the assigned value is assumed to already be in UTC.
DateTime values that are read from the database are also converted to UTC. ClickHouse formats them according to the timezone of the server, and the ORM makes the necessary conversions. This requires a ClickHouse
version which is new enough to support the `timezone()` function, otherwise it is assumed to be using UTC. In any case, we recommend settings the server timezone to UTC in order to prevent confusion.
DateTime values that are read from the database are also converted to UTC. ClickHouse formats them according to the timezone of the server, and the ORM makes the necessary conversions. This requires a ClickHouse version which is new enough to support the `timezone()` function, otherwise it is assumed to be using UTC. In any case, we recommend settings the server timezone to UTC in order to prevent confusion.
Working with enum fields
------------------------
@ -89,36 +195,6 @@ data = SensorData(date=date.today(), temperatures=[25.5, 31.2, 28.7], humidity_l
Note that multidimensional arrays are not supported yet by the ORM.
Working with materialized and alias fields
------------------------------------------
ClickHouse provides an opportunity to create MATERIALIZED and ALIAS Fields.
See documentation [here](https://clickhouse.yandex/docs/en/query_language/queries/#default-values).
Both field types can't be inserted into the database directly, so they are ignored when using the `Database.insert()` method. ClickHouse does not return the field values if you use `"SELECT * FROM ..."` - you have to list these field names explicitly in the query.
Usage:
```python
class Event(models.Model):
created = fields.DateTimeField()
created_date = fields.DateTimeField(materialized='toDate(created)')
name = fields.StringField()
username = fields.StringField(alias='name')
engine = engines.MergeTree('created_date', ('created_date', 'created'))
obj = Event(created=datetime.now(), name='MyEvent')
db = Database('my_test_db')
db.insert([obj])
# All values will be retrieved from database
db.select('SELECT created, created_date, username, name FROM $db.event', model_class=Event)
# created_date and username will contain a default value
db.select('SELECT * FROM $db.event', model_class=Event)
```
Working with nullable fields
----------------------------
[ClickHouse provides a NULL value support](https://clickhouse.yandex/docs/en/data_types/nullable).
@ -149,46 +225,6 @@ NOTE: `ArrayField` of `NullableField` is not supported. Also `EnumField` cannot
NOTE: Using `Nullable` almost always negatively affects performance, keep this in mind when designing your databases.
Working with field compression codecs
-------------------------------------
Besides default data compression, defined in server settings, per-field specification is also available.
Supported compression algorithms:
| Codec | Argument | Comment
| -------------------- | -------------------------------------------| ----------------------------------------------------
| NONE | None | No compression.
| LZ4 | None | LZ4 compression.
| LZ4HC(`level`) | Possible `level` range: [3, 12]. | Default value: 9. Greater values stands for better compression and higher CPU usage. Recommended value range: [4,9].
| ZSTD(`level`) | Possible `level`range: [1, 22]. | Default value: 1. Greater values stands for better compression and higher CPU usage. Levels >= 20, should be used with caution, as they require more memory.
| Delta(`delta_bytes`) | Possible `delta_bytes` range: 1, 2, 4 , 8. | Default value for `delta_bytes` is `sizeof(type)` if it is equal to 1, 2,4 or 8 and equals to 1 otherwise.
Codecs can be combined in a pipeline. Default table codec is not included into pipeline (if it should be applied to a field, you have to specify it explicitly in pipeline).
Recommended usage for codecs:
- Usually, values for particular metric, stored in path does not differ significantly from point to point. Using delta-encoding allows to reduce disk space usage significantly.
- DateTime works great with pipeline of Delta, ZSTD and the column size can be compressed to 2-3% of its original size (given a smooth datetime data)
- Numeric types usually enjoy best compression rates with ZSTD
- String types enjoy good compression rates with LZ4HC
Usage:
```python
class Stats(models.Model):
id = fields.UInt64Field(codec='ZSTD(10)')
timestamp = fields.DateTimeField(codec='Delta,ZSTD')
timestamp_date = fields.DateField(codec='Delta(4),ZSTD(22)')
metadata_id = fields.Int64Field(codec='LZ4')
status = fields.StringField(codec='LZ4HC(10)')
calculation = fields.NullableField(fields.Float32Field(), codec='ZSTD')
alerts = fields.ArrayField(fields.FixedStringField(length=15), codec='Delta(2),LZ4HC')
engine = MergeTree('timestamp_date', ('id', 'timestamp'))
```
Note: This feature is supported on ClickHouse version 19.1.16 and above. Codec arguments will be ignored by the ORM for older versions of ClickHouse.
Working with LowCardinality fields
----------------------------------
Starting with version 19.0 ClickHouse offers a new type of field to improve the performance of queries

View File

@ -199,20 +199,19 @@ class Database(object):
fields_list = ','.join(
['`%s`' % name for name in first_instance.fields(writable=True)])
fmt = 'TSKV' if model_class.has_funcs_as_defaults() else 'TabSeparated'
query = 'INSERT INTO $table (%s) FORMAT %s\n' % (fields_list, fmt)
def gen():
buf = BytesIO()
query = 'INSERT INTO $table (%s) FORMAT TabSeparated\n' % fields_list
buf.write(self._substitute(query, model_class).encode('utf-8'))
first_instance.set_database(self)
buf.write(first_instance.to_tsv(include_readonly=False).encode('utf-8'))
buf.write('\n'.encode('utf-8'))
buf.write(first_instance.to_db_string())
# Collect lines in batches of batch_size
lines = 2
for instance in i:
instance.set_database(self)
buf.write(instance.to_tsv(include_readonly=False).encode('utf-8'))
buf.write('\n'.encode('utf-8'))
buf.write(instance.to_db_string())
lines += 1
if lines >= batch_size:
# Return the current batch of lines

View File

@ -89,6 +89,8 @@ class Field(FunctionOperatorsMixin):
sql += ' ALIAS %s' % string_or_func(self.alias)
elif self.materialized:
sql += ' MATERIALIZED %s' % string_or_func(self.materialized)
elif isinstance(self.default, F):
sql += ' DEFAULT %s' % self.default.to_sql()
elif self.default:
default = self.to_db_string(self.default)
sql += ' DEFAULT %s' % default
@ -112,26 +114,6 @@ class Field(FunctionOperatorsMixin):
inner_field = getattr(inner_field, 'inner_field', None)
return False
# Support comparison operators (for use in querysets)
def __lt__(self, other):
return F.less(self, other)
def __le__(self, other):
return F.lessOrEquals(self, other)
def __eq__(self, other):
return F.equals(self, other)
def __ne__(self, other):
return F.notEquals(self, other)
def __gt__(self, other):
return F.greater(self, other)
def __ge__(self, other):
return F.greaterOrEquals(self, other)
class StringField(Field):

View File

@ -9,11 +9,23 @@ import pytz
from .fields import Field, StringField
from .utils import parse_tsv
from .query import QuerySet
from .funcs import F
from .engines import Merge, Distributed
logger = getLogger('clickhouse_orm')
class NoValue:
'''
A sentinel for fields with an expression for a default value,
that were not assigned a value yet.
'''
def __repr__(self):
return '<NO_VALUE>'
NO_VALUE = NoValue()
class ModelBase(type):
'''
A metaclass for ORM models. It adds the _fields list to model classes.
@ -35,13 +47,23 @@ class ModelBase(type):
fields = sorted(fields.items(), key=lambda item: item[1].creation_counter)
# Build a dictionary of default values
defaults = {n: f.to_python(f.default, pytz.UTC) for n, f in fields}
defaults = {}
has_funcs_as_defaults = False
for n, f in fields:
if f.alias or f.materialized:
defaults[n] = NO_VALUE
elif isinstance(f.default, F):
defaults[n] = NO_VALUE
has_funcs_as_defaults = True
else:
defaults[n] = f.to_python(f.default, pytz.UTC)
attrs = dict(
attrs,
_fields=OrderedDict(fields),
_writable_fields=OrderedDict([f for f in fields if not f[1].readonly]),
_defaults=defaults
_defaults=defaults,
_has_funcs_as_defaults=has_funcs_as_defaults
)
model = super(ModelBase, cls).__new__(cls, str(name), bases, attrs)
@ -195,6 +217,14 @@ class Model(metaclass=ModelBase):
'''
return cls.__name__.lower()
@classmethod
def has_funcs_as_defaults(cls):
'''
Return True if some of the model's fields use a function expression
as a default value. This requires special handling when inserting instances.
'''
return cls._has_funcs_as_defaults
@classmethod
def create_table_sql(cls, db):
'''
@ -249,6 +279,29 @@ class Model(metaclass=ModelBase):
fields = self.fields(writable=not include_readonly)
return '\t'.join(field.to_db_string(data[name], quote=False) for name, field in fields.items())
def to_tskv(self, include_readonly=True):
'''
Returns the instance's column keys and values as a tab-separated line. A newline is not included.
Fields that were not assigned a value are omitted.
- `include_readonly`: if false, returns only fields that can be inserted into database.
'''
data = self.__dict__
fields = self.fields(writable=not include_readonly)
parts = []
for name, field in fields.items():
if data[name] != NO_VALUE:
parts.append(name + '=' + field.to_db_string(data[name], quote=False))
return '\t'.join(parts)
def to_db_string(self):
'''
Returns the instance as a bytestring ready to be inserted into the database.
'''
s = self.to_tskv(False) if self._has_funcs_as_defaults else self.to_tsv(False)
s += '\n'
return s.encode('utf-8')
def to_dict(self, include_readonly=True, field_names=None):
'''
Returns the instance's column values as a dict.
@ -409,3 +462,5 @@ class DistributedModel(Model):
db.db_name, cls.table_name(), cls.engine.table_name),
'ENGINE = ' + cls.engine.create_table_sql(db)]
return '\n'.join(parts)

View File

@ -3,7 +3,7 @@ import unittest
from datetime import date
from infi.clickhouse_orm.database import Database
from infi.clickhouse_orm.models import Model
from infi.clickhouse_orm.models import Model, NO_VALUE
from infi.clickhouse_orm.fields import *
from infi.clickhouse_orm.engines import *
@ -56,6 +56,10 @@ class AliasFieldsTest(unittest.TestCase):
with self.assertRaises(AssertionError):
StringField(alias='str_field', materialized='str_field')
def test_default_value(self):
instance = ModelWithAliasFields()
self.assertEqual(instance.alias_str, NO_VALUE)
class ModelWithAliasFields(Model):
int_field = Int32Field()

View File

@ -1,8 +1,12 @@
# -*- coding: utf-8 -*-
from __future__ import unicode_literals
import unittest
import datetime
from infi.clickhouse_orm.database import ServerError, DatabaseException
from infi.clickhouse_orm.models import Model
from infi.clickhouse_orm.engines import Memory
from infi.clickhouse_orm.fields import *
from .base_test_with_data import *
@ -26,6 +30,19 @@ class DatabaseTestCase(TestCaseWithData):
def test_insert__medium_batches(self):
self._insert_and_check(self._sample_data(), len(data), batch_size=100)
def test_insert__funcs_as_default_values(self):
class TestModel(Model):
a = DateTimeField(default=datetime.datetime(2020, 1, 1))
b = DateField(default=F.toDate(a))
c = Int32Field(default=7)
d = Int32Field(default=c * 5)
engine = Memory()
self.database.create_table(TestModel)
self.database.insert([TestModel()])
t = TestModel.objects_in(self.database)[0]
self.assertEqual(str(t.b), '2020-01-01')
self.assertEqual(t.d, 35)
def test_count(self):
self.database.insert(self._sample_data())
self.assertEqual(self.database.count(Person), 100)

View File

@ -3,7 +3,7 @@ import unittest
from datetime import date
from infi.clickhouse_orm.database import Database
from infi.clickhouse_orm.models import Model
from infi.clickhouse_orm.models import Model, NO_VALUE
from infi.clickhouse_orm.fields import *
from infi.clickhouse_orm.engines import *
@ -56,6 +56,10 @@ class MaterializedFieldsTest(unittest.TestCase):
with self.assertRaises(AssertionError):
StringField(materialized='str_field', alias='str_field')
def test_default_value(self):
instance = ModelWithMaterializedFields()
self.assertEqual(instance.mat_str, NO_VALUE)
class ModelWithMaterializedFields(Model):
int_field = Int32Field()

View File

@ -3,9 +3,10 @@ import unittest
import datetime
import pytz
from infi.clickhouse_orm.models import Model
from infi.clickhouse_orm.models import Model, NO_VALUE
from infi.clickhouse_orm.fields import *
from infi.clickhouse_orm.engines import *
from infi.clickhouse_orm.funcs import F
class ModelTestCase(unittest.TestCase):
@ -18,6 +19,7 @@ class ModelTestCase(unittest.TestCase):
self.assertEqual(instance.str_field, 'dozo')
self.assertEqual(instance.int_field, 17)
self.assertEqual(instance.float_field, 0)
self.assertEqual(instance.default_func, NO_VALUE)
def test_assignment(self):
# Check that all fields are assigned during construction
@ -64,14 +66,16 @@ class ModelTestCase(unittest.TestCase):
"float_field": 7.0,
"datetime_field": datetime.datetime(1970, 1, 1, 0, 0, 0, tzinfo=pytz.utc),
"alias_field": 0.0,
'str_field': 'dozo'
"str_field": "dozo",
"default_func": NO_VALUE
})
self.assertDictEqual(instance.to_dict(include_readonly=False), {
"date_field": datetime.date(1973, 12, 6),
"int_field": 100,
"float_field": 7.0,
"datetime_field": datetime.datetime(1970, 1, 1, 0, 0, 0, tzinfo=pytz.utc),
'str_field': 'dozo'
"str_field": "dozo",
"default_func": NO_VALUE
})
self.assertDictEqual(
instance.to_dict(include_readonly=False, field_names=('int_field', 'alias_field', 'datetime_field')), {
@ -109,5 +113,6 @@ class SimpleModel(Model):
int_field = Int32Field(default=17)
float_field = Float32Field()
alias_field = Float32Field(alias='float_field')
default_func = Float32Field(default=F.sqrt(float_field) + 17)
engine = MergeTree('date_field', ('int_field', 'date_field'))