Overview

The Adaptive Radix Tree (ART) is an efficient in-memory index data structure on modern architectures that beats read-only search trees on lookup performance while at the same time supporting insertion and deletion. We implemented three synchronization protocols for the Adaptive Radix Tree and compared their performance under low and high contention. Our results showed that our optimistic lock coupling implementation scales better than the fine-grained lock coupling and the coarse-grained lock approach under both writer-only low contention scenarios and mixed writer and reader high contention scenarios. The results also suggest that using shared pointers as the memory reclamation method is inefficient in a parallel setting.

This project is developed as part of a final project for the 15-418/618 Parallel Computer Architecture and Programming class.

[Final Report] [Poster]

WeekWork ItemWork splitStatus
04/16 - 04/21Sequential implementation - insert, getYuchen
04/16 - 04/21Sequential implementation - removeHang
04/22 - 04/24Garbage collection and lock implementationYuchen
04/22 - 04/24Benchmark setupHang
04/25 - 04/30LC and OLC implementation - insert, getYuchen
04/25 - 04/30LC and OLC implementation - removeHang
05/01 - 05/05Final report/result collection and analysisTogether
05/06Poster SessionTogether

Project Proposal

Authors: Yuchen Liang (yuchenl3) and Hang Shu (hangshu)

URL to project homepage: https://yliang412.github.io/cart

Proposal PDF: https://yliang412.github.io/cart/files/proposal.pdf

Summary

We are aiming at implementing an efficient concurrent Adaptive Radix Tree (ART) in Rust. ART is one of the most performant main-memory index data structure on modern architecture that beats read-only search trees on lookup performance while at the same time support insertion and deletion. We are interested in exploring and implementing one of the two synchronization techniques described by the authors, namely optimistic lock coupling and read-optimized write exclusion.

Background

On a parallel architecture, it is critical to has an efficient synchronization approach so that the performance of the data structure scales with the number of processors. It is obvious that holding a mutex over the entire data structure is not the solution as it would serialize all operations and greatly limit throughput of the system.

For a tree-like data structure, finer-grained locking is a traditional synchronization approach that is relatively easy to implement. For instance, on a B+ Tree, one would have a reader-writer lock for each of the tree node locking the entire subtree. However, this approach does not scale well due to the overhead of acquiring and releasing locks on modern architecture. Lock-free data structures (e.g. skiplist), on the other hand, uses atomic operations to allow multiple threads to access and modify the underlying content without waiting for a lock. Compare to finer-grain locking, designing a completely lock-free data structure is more difficult and needs additional indirections.

We are interested in exploring the two synchronization approaches that balance between scalablity and ease of use. The first approach is optimistic lock coupling. Instead of directly taking a lock in the traversal, one would optimistically assume there are no concurrent modification and detect conflict by version counters. The operation would restart if a conflict is detected, but in the common case, this approach can improve performance by avoiding unnecessary cache misses due to locking.

The second and more interesting approach is read-optimized write exclusion. This approach gurantees reads will never block or restart. In this scheme, a writer takes a lock to block other writers while a reader never acquire any locks. Reader-writer exclusion is instead provided by using atomic operations to access and modify internal data structures.

In addition to exploring these synchronization approaches, we are also interested in providing an efficient ART implementation for the Rust ecosystem while surveying the current language and library support for designing concurrent data structures.

Challenges

The most challenging part of making ART concurrent is that we want to the performance to scale while supporting both read and write operations at the same time. We will make extensive use of synchronization primitives and atomics to gurantee the integrity of the data structure while maintain high performance.

Concurrent data structure debugging and testing

Handling and debugging data races and deadlocks can be very tricky in a concurrent setting. It is even more difficult to verify the correctness of the data structure. We think Rust's ownership system might help us with this, but will still be quite challenging considering we will likely touch unsafe Rust.

Garbage Collection

In a concurrent setting, the memory of a deleted node cannot be immediately reclaimed because other readers might still be active. Therefore, we want to defer memory reclaimation until it is safe to do so. Unlike Java and other GC-based languages, where the language runtime takes care of the memory reclamation, garbage collection in Rust need to be handled by the programmers. Luckily, there are libraries in Rust that implements a epoch-based or simliar reclaimation scheme that could help us with this task.

Resources

We will base our implementation heavily on the original paper1 describing the data structure and the follow-up paper2 parallelizing the data structure. We are likely to implement garbage collection with the help of the crossbeam-epoch or the seize crate.

1

V. Leis, et al., The Adaptive Radix Tree: ARTful Indexing for Main-Memory Databases, in ICDE, 2013. https://db.in.tum.de/~leis/papers/ART.pdf

2

V. Leis, et al., The ART of Practical Synchronization, in DaMoN, 2016. https://db.in.tum.de/~leis/papers/artsync.pdf

Goals and Deliverables

75% Goals

  • Related works reading. Understand the ART data structure and the synchronization approaches described in the papers.
  • Sequential implementation of ART. Implement a version of ART with no synchronization. This version will help us understand the data structure better while serving as a reference to compare against.
  • Concurrent implementation of ART using lock coupling (LC). As shown in the paper, the code structure for lock coupling and optimistic lock coupling are very similar. The LC implementation of ART will also serve as an alternative synchronization approach to compare against with.
  • Extensive testing on implemented sequential and concurrent ART implementations.

100% Goals

75% Goals, plus:

  • Scalability analysis. We will run microbenchmarks to investigate the scaliability of each finished implementations under low contention.
  • Contention analysis. We will run microbenchmarks to measure the performance of each finished implementations when there are simulantaneous read and write operations.
  • Concurrent implementation of ART using optimistic lock coupling (OLC). Modify the LC version to do OLC.
  • Benchmark against other concurrent data structures in Rust. Our goal is to have comparable performance to both unordered and ordered concurrent data structures (dashmap, crossbeam-skiplist).

125% Goals

100% Goals, plus:

  • Concurrent implementation of ART using read-optimized write exclusion (ROWEX). The ROWEX synchronization approach should further improve read performance since a read will not wait or restart in this scheme.
  • Run YCSB benchmark. In addtion to the microbenchmarks, it might be beneficial to evaluate the performance on a synthetic workload. YCSB is common set of workloads for evaluating the performance of different key-value stores, which fits our scope. We might need to adapt this benchmark into Rust since the workload generators and clients are written in Java.
  • SIMD optimizations. SIMD-based parallel comparison can be used to further improve lookup performance.

Demo

A test workload will be ran on the sequential and parallel versions of our ART implementation, and we will specifically be looking at the speedup of the parallel versions for an increasing thread count. Our poster will also show speedup graphs on our ART implementation.

Analysis

We will be comparing speedup of our parallel implementation to the our sequential implementation. We are also interested in seeing the performance difference between tradational lock coupling, and optimistic lock coupling. If time permits, we will be adding in read-optimized write exclusion for comparison between the parallel implementations.

Platform Choice

We will do development locally on our own laptop but will use the GHC and PSC machines for testing concurrency and doing benchmark experiments on performance with a higher number of cores.

Schedule

WeekWork ItemStatus
03/25 - 03/31Related work reading
04/01 - 04/07Sequential implementation
04/08 - 04/14LC implementation + Microbenchmark setup
04/15 - 04/21OLC implementation
04/22 - 04/28Evaluation + Optimizations
04/29 - 05/05Benchmarking + Final Report
05/06Poster Session

Milestone Report

Progress So Far

We familiarized ourselves with the Adaptive Radix Tree data structure and ways to parallelize it by reading the two papers. However, we encountered some problems implementing the data structure in Rust. The most ergonomic way of implementing the various ART node types is to use an enum (tagged union), which has the size of the largest variant. Under this layout, we would lose the advantage of having smaller node types. If we want to modify the memory layout, we need to specify our memory allocation layout using unsafe Rust, which we don't feel comfortable doing. Also, the reference sequential implementation uses a lot of memory copying operations instead of swapping the pointer. Thus, we decided to pivot back to using C++, a language both of us are more familiar with. It also gives us more flexibility in controlling the internal memory layout and working with raw pointers.

New Goals and Deliverables

Compared to the goals listed in the proposal, we decided not to include the comparison between our implementation and other concurrent data structures due to time constraint. We will still have thorough scalablity and contention analysis.

75% Goals

  • Related works reading. Understand the ART data structure and the synchronization approaches described in the papers.
  • Sequential implementation of ART. Implement a version of ART with no synchronization. This version will help us understand the data structure better while serving as a reference to compare against.
  • Concurrent implementation of ART using lock coupling (LC). As shown in the paper, the code structure for lock coupling and optimistic lock coupling are very similar. The LC implementation of ART will also serve as an alternative synchronization approach to compare against with.
  • Extensive testing on implemented sequential and concurrent ART implementations.

100% Goals

75% Goals, plus:

  • Scalability analysis. We will run microbenchmarks to investigate the scaliability of each finished implementations under low contention.
  • Contention analysis. We will run microbenchmarks to measure the performance of each finished implementations when there are simulantaneous read and write operations.
  • Concurrent implementation of ART using optimistic lock coupling (OLC). Modify the LC version to do OLC.

125% Goals

100% Goals, plus:

  • Concurrent implementation of ART using read-optimized write exclusion (ROWEX). The ROWEX synchronization approach should further improve read performance since a read will not wait or restart in this scheme.
  • Run YCSB benchmark. In addtion to the microbenchmarks, it might be beneficial to evaluate the performance on a synthetic workload. YCSB is common set of workloads for evaluating the performance of different key-value stores, which fits our scope.
  • SIMD optimizations. SIMD-based parallel comparison can be used to further improve lookup performance.

Demo

A test workload will be ran on the sequential and parallel versions of our ART implementation, and we will specifically be looking at the speedup of the parallel versions for an increasing thread count. Our poster will also show speedup graphs on our ART implementation.

Analysis

We will be comparing speedup of our parallel implementation to the our sequential implementation. We are also interested in seeing the performance difference between tradational lock coupling, and optimistic lock coupling. If time permits, we will be adding in read-optimized write exclusion for comparison between the parallel implementations.

New Schedule

WeekWork ItemWork splitStatus
04/16 - 04/21Sequential implementation - insert, getYuchen
04/16 - 04/21Sequential implementation - removeHang
04/22 - 04/24Garbage collection and lock implementationYuchen
04/22 - 04/24Benchmark setupHang
04/25 - 04/30LC and OLC implementation - insert, getYuchen
04/25 - 04/30LC and OLC implementation - removeHang
05/01 - 05/05Final report/result collection and analysisTogether
05/06Poster SessionTogether

Concerns

One minor concern that we have is the time constraint on implementing both LC and OLC. The sequential version of ART will already take some time implement, so there's a chance that we may run out of time to implement both lock-coupling and optimistic lock-coupling, in which case, we will likely have to pick only one to implement. However, this isn't a huge concern, as the implementation difference between LC and OLC should not be too big.

Final Report

[Final Report]

[Poster]