Lecture number 07: Hash Tables

Please refer to the following resources for more information:

cover

1. Data Structure#

DBMS mainly includes the following types of data:

Internal Meta-Data: Records information about the database and system.
Core Data Storage: Records information about tuples.
Temporary Data Structures
Table Indices

The main considerations are:

Data organization
Data reliability

2. Hash Table#

hash_table

hash_table_target

Hash Function

Converts a value into a smaller key.
Balances between speed and collision rate.

Hashing Scheme

How to handle collisions.
Balancing between hash table size and additional maintenance information.

3. Hash Function#

A function that outputs a specific number regardless of the input key.

Hash_Function

Hash_Function bench_mark

4. Static Hashing Schemes#

4.1 Linear Probe Hashing#

A linear probing method is used to resolve hash collisions. If the position returned by the hash function is already occupied, the algorithm continues to search for the next available position until an empty slot is found.

In the case of deletion, a value may not be directly produced by the hash function, but obtained by continuously searching for the next position.

In this case, there are two methods to handle it:

Adjust the values affected by the offset by rehashing. This method is less efficient.
Mark the deleted values as "deleted". These values can be reused when new values need to be inserted. However, periodic garbage collection is required.

Storing the same values

NON-UNIQUE-KEY

When storing values with the same key but different values, there are two ways to handle it:

Allocate a separate space for each key and its corresponding value.
Hash all keys into one storage space using the hash function.

4.2 Robin Hood Hashing#

This is an upgraded version of linear probing that aims to solve the problem of uneven distribution. In the previous algorithm, some keys may cluster together after hash calculations, leading to a degradation of the hash algorithm. Robin Hood hashing solves collisions by redistributing some "rich" keys to "poor" keys.

In this method, each value is assigned an offset, and when searching downwards in the table, the current value needs to be compared with the values in the table. If the current offset is greater than the offset of the value in the table, the current value is swapped with the value in the table. This process continues until an empty slot is found.

void Robin_hood_Hashing(int key) {
     int val = hash(key);
     int offset = 0;
     for (int i = val ; ; i ++ , offset ++ ) {
          if(hash_table->is_empty(i)){ 
               hash_table->insert(i , val , offset);
               break;
          }

          if(hash_table->get_offset(i) < offset) {
               auto [n_val , n_offset] = hash_table->get_data(i);
               hash_table->insert(i , val , offset);
               val = n_val , offset = n_offset;
          }
          if(i == MAXSIZE) i = 0;
     }
}

4.3 Cuckoo Hashing#

This method of resolving collisions is achieved by using two hash functions and two hash tables.

For an input key, it calculates val[0] and val[1], and randomly chooses one of them as the final result. If one of them triggers a collision, it uses a hash function and hash table. If both of them collide, the collision is recalculated until there is no collision.

Using this method, the complexity of querying is always O(1), but the cost of insertion is increased.

Open-source CMU implementation

5 Dynamic Hashing Schemes#

Observation

The above hash tables are based on fixed-size elements, so dynamic hash tables are needed.

5.1 Chained Hashing#

Maintains a linked list for each hash value. All keys with the same hash value are stored in one bucket, and the buckets are connected in the form of linked lists.

Chained Hashing