mirror of https://github.com/Infinidat/infi.clickhouse_orm.git synced 2025-07-11 00:22:24 +03:00

History

Itai Shirav 40a1e21348 Added usage examples		2020-06-26 17:53:39 +03:00
..
.gitignore	Added usage examples	2020-06-26 17:53:39 +03:00
download_ebooks.py	Added usage examples	2020-06-26 17:53:39 +03:00
load.py	Added usage examples	2020-06-26 17:53:39 +03:00
models.py	Added usage examples	2020-06-26 17:53:39 +03:00
README.md	Added usage examples	2020-06-26 17:53:39 +03:00
requirements.txt	Added usage examples	2020-06-26 17:53:39 +03:00
search.py	Added usage examples	2020-06-26 17:53:39 +03:00

README.md

Full Text Search

This example shows how ClickHouse might be used for searching for word sequences in texts. It's a nice proof of concept, but for production use there are probably better solutions, such as Elasticsearch.

Running the code

Create a virtualenv and install the required libraries:

virtualenv -p python3.6 env
source env/bin/activate
pip install -r requirements.txt

Run the download_ebooks script to download a dozen classical books from The Gutenberg Project:

python download_ebooks.py

Run the load script to populate the database with the downloaded texts:

python load.py

And finally, run the full text search:

 python search.py "cheshire cat"

Asterisks can be used as wildcards (each asterisk stands for one word):

 python search.py "much * than"

How it works

The models.py file defines an ORM model for storing each word in the indexed texts:

class Fragment(Model):

    language = LowCardinalityField(StringField(default='EN'))
    document = LowCardinalityField(StringField())
    idx      = UInt64Field()
    word     = StringField()
    stem     = StringField()

    # An index for faster search by document and fragment idx
    index    = Index((document, idx), type=Index.minmax(), granularity=1)

    # The primary key allows efficient lookup of stems
    engine   = MergeTree(order_by=(stem, document, idx), partition_key=('language',))

The document (name) and idx (running number of the word inside the document) fields identify the specific word. The word field stores the original word as it appears in the text, while the stem contains the word after normalization, and that's the field which is used for matching the search terms. Stemming the words makes the matching less strict, so that searching for "swallowed" will also find documents that mention "swallow" or "swallowing".

Here's what some records in the fragment table might look like:

language	document	idx	word	stem
EN	Moby Dick; or The Whale	4510	whenever	whenev
EN	Moby Dick; or The Whale	4511	it	it
EN	Moby Dick; or The Whale	4512	is	is
EN	Moby Dick; or The Whale	4513	a	a
EN	Moby Dick; or The Whale	4514	damp,	damp
EN	Moby Dick; or The Whale	4515	drizzly	drizzli
EN	Moby Dick; or The Whale	4516	November	novemb
EN	Moby Dick; or The Whale	4517	in	in
EN	Moby Dick; or The Whale	4518	my	my
EN	Moby Dick; or The Whale	4519	soul;	soul

Let's say we're looking for the terms "drizzly November". Finding the first in the sequence (after stemming it) is fast and easy:

query = Fragment.objects_in(db).filter(stem='drizzli').only(Fragment.document, Fragment.idx)

We're interested only in the document and idx fields, since they identify a specific word.

To find the next word in the search terms, we need a subquery similar to the first one, with an additional condition that its index will be one greater than the index of the first word:

subquery = Fragment.objects_in(db).filter(stem='novemb').only(Fragment.document, Fragment.idx)
query = query.filter(F.isIn((Fragment.document, Fragment.idx + 1), subquery))

And so on, by adding another subquery for each additional search term we can construct the whole sequence of words.

As for wildcard support, when encountering a wildcard in the search terms we simply skip it - it does not need a subquery (since it can match any word). It only increases the index count so that the query conditions will "skip" one word in the sequence.

The algorithm for building this compound query can be found in the build_query function.