Please refer to the following links for more information:
Database workloads#
OLTP#
OLTP: On-Line Transaction Processing
Fast operations that involve reading and updating a small range of data.
OLAP#
OLAP: On-Line Analytical Processing
Supports complex analytical operations for decision support, etc.
HTAP#
Hybrid Transaction + Analytical Processing
A new workload that combines OLTP and OLAP.
Storage Models#
N-Ary Storage Model (NSM)#
In the n-ary
storage model, the DBMS stores all tuples containing all attributes consecutively in a single page.
This is ideal for OLTP.
Advantages:
- Fast insertion, update, and deletion
- Efficient for queries that require the entire tuple
Disadvantages:
- Inefficient for queries that require scanning the entire table or only need one attribute
Decomposition Storage Model (DSM)#
In the decomposition storage model, each attribute of all tuples is stored separately.
Also known as "column store"
It is efficient for read-only OLAP operations, especially those that only require scanning certain attributes.
Advantages:
- Reduces I/O waste
- Better for querying and compressing data
Disadvantages:
- Slower for single-point modification, query, and update operations
To implement this, there are two approaches:
- Set a fixed word length for each attribute, so we only need to get the offset to find the desired data accurately.
- A less common approach is to use a tuple in the form of
(id: pos)
to store values, indicating that the value of theid
is stored at positionpos
.
Database Compression#
I/O is time-consuming and often the bottleneck of the entire database, so DBMS widely uses compression algorithms to improve performance.
We usually need to trade off between speed and compression ratio.
Compression Granularity#
- Block Level
- Tuple Level (NSM Only)
- Attribute Level
- Columnar Level
Naive Compression#
Using general-purpose compression algorithms is a common solution. However, once we use this method, the DBMS does not know the data we are operating on until it is decompressed.
Provided as input:
→ LZO (1996), LZ4 (2011), Snappy (2011),
Oracle OZIP (2014), Zstd (2015)
To improve speed, we need additional compression methods that allow us to access information even after compression.
Columnar Compression#
Run-Length Encoding (RLE)#
We can compress values that appear consecutively in the same column into triplets (value: pos: num)
.
Where:
value
represents the valuepos
represents the starting position of the valuenum
represents the number of times the value is repeated
However, this method may have some drawbacks.
After transformation
Bit-Packing Encoding#
Some data is highly redundant, and we can reduce this redundancy by using bit-packing.
Converting int64
to int8
greatly reduces the required space.
However, this method has some drawbacks, as some information may not fit into int8
.
Therefore, we need to store it in the following way.
However, this method can only be used when there is minimal additional storage.
Bitmap Encoding#
When there are few values for each attribute, we can use bitmap encoding to store them.
For example, when there are only two values, F
and M
, we can represent them as 01
for yes or no.
Delta Encoding#
In many cases, such as room temperature, there may be dense values within a certain range in our statistics.
Therefore, after determining a value, all subsequent values can be stored in the form of delta
.
Incremental Encoding#
We can also obtain the final result by taking the prefix/suffix of the data.
Dictionary Compression#
When a table may have multiple values, and these values are located in different places, we can use a dictionary to get the positions of these values.
This is also the most commonly used compression method.