Delphix Replication
Customers use Delphix to efficiently deliver production data to application development and QA environments. As Delphix becomes an integral part of the application development life cycle, protecting this data against catastrophic failure becomes critical. Delphix provides a native service to replicate application data from one Delphix engine (source) to another (target). Application data can be stored efficiently, with block aware compression giving 2-4x savings in storage. Fast and efficient recovery can be done at a finer granularity compared to any Storage or VM based solution. For example, we can recover data from a single database or from a table within a database.
Today our customers use replication to service different use cases. Replication can be used to provide failover protection to critical application development environments. Replicating to a local standby Delphix engine provides protection against server or storage crashes. Replicating to a remote Delphix engine provides protection against site outages. Replication can also be used as a migration tool for infrastructure expansion, load balancing and storage migration. We have customers using Delphix as their primary data protection solution for application development environments.
In our latest release, 3.1, we improved replication facilities to provide better value to our customers. Majority of these improvements came from using our in-house Delphix Session Protocol (DSP) developed by my colleague Peng Dai. Using DSP as the primary transport for our native Replication service enabled a number of enhancements to performance and functionality. In this post I will touch on some of them. A thorough discussion of DSP is out of the scope of this post and Peng will be talking about that in his upcoming blog post.
Improved Performance
The chart in Fig:2 shows replication throughput between two Delphix engines. The Delphix engine in this case synchronizes with three databases of different sizes, each supporting 12 Virtual Databases (VDBs). These VDBs handle read heavy, write heavy and OLTP style workloads from their respective applications. The Delphix engine is replicated with all the databases synchronized and running their workloads. On our test platform, we can see more than 3x improvement in throughput achieved between Delphix 3.0 and 3.1. Note that this improvement is currently tied to the underlying physical infrastructure available for Replication. For our upcoming release, we are investing in reducing the dependency on physical infrastructure through the use of better compression and throughput throttling. The details of these will be covered in an upcoming blog post.
Transport Enhancements
With 3.1, we migrated the Replication transport from NDMP to DSP. DSP enables efficient use of network bandwidth and the ability to recover gracefully from transport errors. In 3.1, DSP allows us to overcome transport level outages by automatically resuming replication. This is especially useful for transferring large environments across a WAN.
Interoperability
Another major improvement to Replication in 3.1 is the ability to use the replication target as another full-fledged Delphix engine. Until now, the replication target was only used as a standby for another Delphix engine, providing HA/DR. With 3.1, customers can use the replication target to synchronize with other production databases and provision VDBs, while still replicating other environments. This will open more use cases and increase the value customers can draw from the second Delphix engine.
Enhanced Topologies
In 3.1, customers can configure complex topologies for replication. A Delphix engine can replicate different environments to multiple replication target engines. One replication target can receive replication streams from multiple primary engines. This enables customers to distribute replication services across different engines to get maximum protection with minimal overhead. Replication policies can then be orchestrated to suit the needs of the different application development environments.
With the offerings in our latest release, you can imagine a scenario where multiple Delphix engines are cross-purposed in a complex topology, to act as an extended protection layer with superior recovery time and granularity of recovery points as well as providing high availability of multiple application development environments.
Given the vantage point we occupy in our customers’ application development environment, Delphix becomes a natural and superior alternative to supplant backup/recovery services provided by other VM or storage based solutions. The improvements in speed and functionality in our latest release are just the opening salvo in a series of enhancements we are planning to roll out in our upcoming releases.
ZFS Write Performance (Impact of fragmentation)
At Delphix we are constantly trying to improve performance. Some of the enhancements come from customers where we see performance pathologies that need to be addressed while some come from our understanding of the bottlenecks in Delphix OS. In the past few months, George Wilson and Adam Leventhal made significant improvements to how writes are handled in ZFS. In this multi part blog post, I will talk about a benchmark we created to measure improvements in ZFS write performance as we make changes to the OS. In this post, I will talk about the benchmark setup and run. I will show some of the results from this benchmark on Delphix OS. In part two, I will present data and some analysis on the bottlenecks discovered and how they are going to be addressed.
The Benchmark
The benchmark is a simple program that writes random data to random offsets in a file. I am pleasantly surprised at how much I was able to learn from this simple 20-line program. Hopefully there is information here that would help some of you as well. The program was thought up by Adam and George as a way to observe the behavior of ZFS when a zpool is fragmented.
randFd = open("/dev/urandom", O_RDONLY);
fd = open(argv[2], O_CREAT|O_WRONLY|O_APPEND, 0777);
.
.
while (1) {
read(randFd, buf, sizeof(char)*BLOCKSIZE);
pwrite(fd, buf, BLOCKSIZE, (lrand48() % (test_size / BLOCKSIZE)) * BLOCKSIZE);
}
I ran the program on a clean zpool and observed the performance of the system using DTrace. The program fragments the pool as it writes to random offsets. I wanted to measure the steady state throughput achieved as a function of the zpool usage (10% full, 50% full etc.). This will demonstrate interesting ZFS pathologies and track our progress from release to release.
Benchmark Setup
Before each run, I created a new zpool and a filesystem with record size of 8k, using the following commands.
zpool create domain0 c3t1d0 zfs create -o recordsize=8k -o compression=on domain0/tests
I used an 8 vCPU VM, with 48GB of RAM and 32GB of pool size for the Delphix server. I used a small pool size so I could fill it quickly. While 32GB is far smaller than a typical customer’s pool, I believe that the results are still representative. I ran eight concurrent instances of the program to generate load. (I also tested the benchmark with fewer than eight writers, it showed similar characteristics, but took longer to reach steady state.)
Benchmark Run
Once I created a zpool, I populated it with random data using “dd”.
dd if=/dev/urandom of=/domain0/tests/testfileX.dat bs=8k count=Y
I monitored the file system using dtrace as the pool was getting full. The output below shows a histogram of IOs, by size, being performed at the backend storage while the above “dd” command is running. The majority of writes are 64K and 128K in size, even though I did 8k writes at the user level. This aggregation is performed at the IO layer in ZFS.
2013 Feb 16 18:45:42 (14.999s elapsed) /engineering/uday/scripts/io.d bytes write value ------------- Distribution ------------- count 256 | 0 512 |@ 309 1024 | 217 2048 |@ 281 4096 |@ 230 8192 |@ 231 16384 | 176 32768 | 119 65536 |@@@@@ 2352 131072 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 14321 262144 | 0 avg latency iops throughput write 7311us 1302/s 142394k/s total 1302/s 142394k/s
ZFS divides each IO device into a few hundred regions called “Metaslabs”. The figure 1 below shows a visualization of the zpool after it is filled to 25% using "dd" command above. Each cell represents one metaslab. The pool here has 256, 128MB Metaslabs. The percentage value is the amount of free space in each metaslab. As you can see, around 32 metaslabs are full, accounting for 8GB of data.
At this point, the pool is compact -most metaslabs are either full or empty and consume minimal space on disk. When I started my program with the zpool in the above state, I saw an initial spurt of high write throughput followed by a steep fall in throughput stabilizing after some time. The figure 2 below depicts the zpool after the program ran for 30 mins -note that pool is still only 25% full. This image did not change even after running the program for a few hours longer, the benchmark had reached steady state.
As you can see the pool is fairly fragmented at this point. The pool is not compact, 2.5x times as many metaslabs are used. The majority of metaslabs have regions of free space. So what is going on here? When the program started writing to random blocks in the zpool, ZFS tried to place the random blocks in contiguous region of the pool by allocating new metaslabs. Due to copy-on-write, every time a block is overwritten it will create a hole somewhere in the pool. New data blocks were allocate to new metaslabs until ZFS found free regions within currently allocated metaslabs. At this point, ZFS started filling those holes before allocating to new metaslabs. Hence the number of metaslabs used did not increase after a point in the test. The chart in Fig 3 below shows write throughput during the benchmark, at different usage level of the zpool.
After the initial high write throughput, there is a precipitous drop with the benchmark settling at a steady state throughput. Steady state throughput is computed using average of throughput from the final 50 samples in the experiment. The steady state throughput is proportional to the amount of free space in the zpool at the beginning of the benchmark run. The chart in Fig 4 below plots this throughput against zpool occupancy.
Conclusion
The experiments show an interesting performance pathology as ZFS pools become increasingly full and fragmented. Visualizing the metaslabs and their usage through the benchmark run, helped understand the reasons behind the drop in performance. The Fig.4 above shows a tight correlation between amount of free space in the pool and the random write throughput achieved. This test provides an interesting benchmark for future ZFS performance work. We can both work to push the curve to the right, delaying the point at which performance degrades, and push the curve up, improving the worst case. The simple, 20-line C program proved to be a fountain of useful information. In my next blog post I'll dissect the results in more detail, and provide some insight into what ZFS is doing -- and how we might improve it.
Delphix Version 3.0
Last week we released Delphix Version 3.0, which extends Delphix database virtualization technology to Microsoft SQL Server and Oracle Real Application Clusters (RAC). As with previous releases, the newest version of our server includes a number of innovative features, a comprehensive list can be found here. Along with our focus on rolling out new functionality, we continue to invest in boosting the performance of our server. This post is dedicated to highlighting some of the key performance improvements you can expect from Delphix 3.0.
Enhanced Caching
ARC is the Adaptive Replacement Cache, used by DelphixOS to improve latency of reads. Accesses that are satisfied in the ARC are typically an order of magnitude faster compared to traditional storage. Delphix 3.0 features enhanced caching by allowing ARC to be shared between all the Virtual Databases (VDBs) provisioned. This improves utilization of ARC space and reduces average latencies for read only data.
Performance improvements from sharing ARC stem from two major components. First, sharing ARC between all the VDBs improves utilization of ARC space. In Delphix 2.7, every VDB had its own address space. This means, even if two VDBs were reading the same data, they maintained separate copies of it in the ARC. This reduces effective utilization of ARC space. In 3.0, each unique block is allocated in memory only once; if you have 10 VDBs for example, you’re getting 10 times the efficacy out of your memory. Efficient use of memory increases the amount of data that can be cached overall.
The second component of the performance comes from improved latencies. By sharing address space between VDBs, they will start seeing shared cache effects. For example, if you have 10 VDBs, only one of them will have to read its data from disk. The rest of them can read that data much faster from the ARC.
Performance gains from sharing ARC will vary depending on the amount of sharing each VDB is getting from the data (locality of reference) plus the working set of all the VDBs combined in the non-shared scenario. We typically see our customers maxed out on ARC, so this feature is expected to show good performance gains.
In order to showcase the benefits of this feature, I constructed an experiment which is scaled down from a realistic scenario. I am using DelphixBench to run a shared ARC vs. non-shared ARC experiment. Non-Shared ARC mimics the behavior of Delphix 2.7. As you can see, for this scenario, sharing ARC space will produce close to 30% improvement in performance. For this particular workload, the working set size is ~9GB. I provisioned a Delphix virtual appliance with 4GB of memory. In customer environments, working set size is typically larger than the ARC space available. In this particular experiment, the benefits are coming both from ARC space utilization and prefetching between VDBs resulting in better average latency. Note that improvements from this feature are tightly dependent on customer environments. The more VDBs you have, the more benefit you see, and the more value you get out of each byte of memory you buy.
Faster Virtual Database Provisioning
Provisioning a VDB involves creating a clone of the snapshot from the production database and then applying all the logs since the time snapshot was created -to re-create a consistent in time copy of the production database. The amount of time taken to provision a VDB is independent of the size of the database, but was tightly dependent on the amount of log activity needed to be recovered. Providing fast and consistent provisioning time will improve workflows for customers. With version 3.0 we have alleviated the dependency on redo/log activity to provide better provisioning times.
The chart below shows the absolute time taken to provision a VDB using versions 2.7 and 3.0 of Delphix. As you can see, time to provision is independent of the size of the database being used. You will also notice the big drop in time taken to provision a VDB from 2.7 to 3.0. This gain comes from efficiency improvements in the provisioning process.
We also reduced the dependence on the type of workload or the amount of log activity present in the source database. The chart above shows a big drop in provision time for three different types of database workloads. Redo/Log activity increases going from Read Heavy workload, to OLTP to Write Heavy Workload. This enhancement is intended to make VDB Provision time more predictable to the user, improving the usability of the product.
Streamlined Replication Service Workflow
Replication service enables continuous user data replication from one Delphix server (source) onto another one (target). This is an essential component of our HA/DR solution. With 3.0 we have completely revamped the workflows for replication service. Inhouse experiments show close to 2x improvement in speed of replication. This comes from improvements in the network protocols used as well as enhancements to the replication service work flow. Customers will see improved utilization of their network link between the source and the target. Faster incremental replication will also mean reduced pressure on source delphix server.
Conclusion
I only touched upon some of the big performance enhancements made to the server in version 3.0. There are numerous smaller features added in the application stack and UI to improve overall responsiveness of the server. Culmination of all those is better usability of the product for our customers. Even with all the innovations made to the functionality of the software, 3.0 will again prove our continued focus on boosting performance.
DelphixBench 1.0
Introduction
DelphixBench is an internal benchmarking framework developed to measure performance of various workflows in Delphix. The benchmark is designed to produce repeatable results in a consistent manner. This benchmark focuses on the workflows which are most relevant to our customers to ensure that the benchmark metrics are representative of customer experience.
Motivation
There are various tools, micro-benchmarks available to us to measure different aspects of the Delphix ecosystem. For example, FIO is a tool that can be used to measure IO and file-system performance. Other such tools exist today to evaluate performance, but those tend to not reflect our customers’ real-world performance either because they are micro-benchmarks or because they are irrelevant macro benchmarks. DelphixBench is an attempt at bridging that gap. DelphixBench also addresses the following needs:
- A comprehensive framework for measuring performance of workflows relevant to our customers.
- Ensure a continued uptrend in workflow performance.
- Provide an efficient mechanism to evaluate performance trade-offs.
- Provide a tool to showcase our performance.
Benchmark Workload
DelphixBench is built on Swingbench, developed by Dominic Giles. DelphixBench uses the Order Entry schema from Swingbench, with the following three workload types:
- OLTP
- This is the standard Transaction Processing style workload, which includes a mix of Browse, Query, and Update transactions.
- This workload mimics TPC-C.
- Read
- This is a custom workload developed to stress caching and prefetching aspects of the Delphix ecosystem.
- The majority of transactions in this workload are Query and Browse.
- This workload is sensitive to datafile read bandwidth, and latency.
- Write
- This is a custom workload developed on top of the Order Entry workload.
- The majority of the transactions are “New Order,” which update the Order tables.
- This workload generates a large amount of redo/log traffic, and is sensitive to redo/log write latency.
These three workloads are run against three different databases, resulting in nine benchmarks. All databases use the Order Entry schema, but are 1GB, 10GB and 60GB in size. The three databases (SOE1G, SOE10G and SOE60G) are populated apriori. Every benchmark run will start from a pre-defined snapshot of the database.
Benchmark Operation
The benchmark run is completely automated through the internal “blackbox” framework. Users can either choose one of the existing Delphix appliances to run the benchmark against, or point to their own appliance.
The benchmark operation will have an option to either set up Delphix from scratch, or use an existing stack. The following steps are carried out for each of the test cases.
- The source databases are obtained from their snapshot and are “linked” to the Delphix server as dSources.
- For each of these dSources, 12 different Virtual Databases (VDBs) are provisioned onto two target systems.
- Using 12 VDBs to generate enough load on server, to ensure consistent results.
- Once VDBs are provisioned, three types of load is targeted on them, measuring Transactions Per Second.
- Other metrics are measured from different work flows, as described in Benchmark Metrics.
Benchmark Metrics
Benchmark metrics are defined by measuring typical workflows performed on the Delphix appliance by our customers. Emphasis is placed on those operations which provide the most value to customers.
Linking Performance
Delphix server allows users to ‘link’ their databases in order to then create/provision virtual copies efficiently with minimal overhead. The amount of time it takes to link a fresh database into a new datasource is used a metric of linking performance. This time is closely dependent on the amount of data that needs to be consumed, which includes the size of the database plus the redo/log data. The metric will represent total time to link the database normalized by the size of database plus log activity since the snapshot. This metric is reported for all three databases.
Provision Performance
This benchmark operation provisions 12 VDBs in parallel for each source database. All the VDBs are provisioned asynchronously.The time to finish provisioning all the VDBs is measured and normalized by the number of VDBs. Provision time is typically dependent on the amount of online log data in the source database at the time of provisioning. Since this is constant across all the databases in the benchmark, absolute time to provision is used as the metric. Absolute time also emphasises independence of provision time to database size or type of workload.
VDB Performance
Once the VDBs are provisioned, three different workloads, OLTP, READ, and WRITE, are run against them. Each run uses a separate set of VDBs. The performance metric evaluated here is the Average Transactions Per Sec, which is measured by Swingbench. Transactions per second is a popular metric relevant for OLTP workloads. Each run will produce a separate TPS rating.
Replication Performance
Replication service enables continuous user data replication from one Delphix server (source) onto another (target). The target Delphix server can be used as a backup server in case of any catastrophic failures. The benchmark operation measures both full and incremental replication. At the end of the first benchmark run, the Delphix server with the source database, plus the 12 VDBs, are replicated onto the target server. After every subsequent test, more data sources are added, and the server replicated. Note that subsequent replications are all incremental. The benchmark operation measures and reports the results of all replications.
Conclusion
DelphixBench is designed to measure performance of various operations of the Delphix server that our customers typically perform. I used a popular oracle benchmark as the basis for the workload, to ensure representative results. Metrics are defined to be representative of the performance relevant to our customers. As newer features are added, we will incorporate those workflows into the benchmark.
SQLIO vs. FIO
A few weeks prior to our first Microsoft SQL deployment, I was trying to benchmark a Virtual Database running MS SQL on the Delphix Appliance. I picked SQLIO as a starting point. SQLIO is an IO benchmark that Microsoft recommends to assess storage systems behind SQL installations. SQLIO creates a file on the target disk and measures its capacity/speed using reads/writes. It can perform IO with various block sizes, sequential or random etc.
One problem I discovered right away is that, SQLIO relies on zero blocks to populate its datasets (it writes 0x0). This is generally fine for a naive storage system. But most modern storage systems often identify this pattern and end up gaming the benchmark, unintentionally. ZFS for example, detects zero blocks and compresses them, leaving only meta data writes to IO. So the actual storage devices only see a fraction of the traffic that SQLIO generates. SQLIO reported numbers which were too good to be true. The screenshot below shows output from one of the runs. An 8K sequential write test resulted in ~21,000 IOPS, at 170 MB/sec of aggregate write bandwidth. The majority of the IOs were serviced in < 2ms ! None of the caches in my stack were large enough to capture this traffic and account for the latency. ZFS was compressing the zero blocks to nothing, doing only metadata writes to the actual storage. It was gaming the benchmark unintentionally, thereby producing phenomenal IOPS and bandwidth/latency characteristics. Add to that, ZFS does not take credit for the phenomenal storage savings achieved
The compression ratio does not account for the zero blocks eliminated.
ZFS is not the only storage system that exposes this problem with SQLIO. There are other storage processors and SSDs that will do zero block detection as a way to improve storage utilization and performance. Using SQLIO to characterize such IO systems will result in gross miss-characterization of the underlying storage. Note that this problem exists regardless of whether you are running a read or a write test.
SQLIOSIM is another benchmark Microsoft recommends to evaluate storage systems. SQLIOSIM performs parallel IO operations to multiple files on the target system to mimic traffic generated during normal operation of an SQL database. This benchmark also suffers from the same problem, uses zero blocks to populate some of its datasets.
There are other alternatives for benchmarking IO that do not suffer from this problem. I found FIO to be the most useful among them. First, it uses random data to populate the datasets, so it is agnostic to zero block detection. It has all of the capabilities of SQLIO and more. It is available for multiple environments -you can compare results in a heterogeneous environment. It is open source, so you can customize it for your environment. If you are trying to benchmark your storage system, FIO is a good benchmark to use.
If you need to evaluate your storage under database load, we have developed a wrapper for FIO that can be used. This wrapper will generate random reads, sequential reads and sequential writes, typical of database systems. More details at Kyle’s git repository here. This could be a good replacement for something like SQLIOSIM.
There are other nuances that you should be wary of, when evaluating IO systems, especially those backing databases. I will talk about a couple of them in my next post. The best thing to do is understand the limitations of your benchmark and question everything that does not look right. If the results look too good to be true, they probably are.








