Delphix Continuous Data Integration
“You never develop code without version control, why do you develop your database without it?”
Applications are the nexus of the modern enterprise. They simplify operations, speed execution, and drive competitive advantage. Accelerating the application lifecycle means accelerating the business. Organizations depend on rapid iteration as the foundation of agile development. This had led to a rich ecosystem of source code management tools that enable developers to work quickly within a private sandbox and push changes to a shared branch where continuous integration tools build and validate those changes on an ongoing basis.
This source code management agility has historically been stymied by data dependencies that are not easily moved, refreshed, or managed. This forces shared sandboxes, manual schema management, and stale data that degrades quality and slows development. Over the last decade, momentum has built around concepts such as Evolutionary Database Design and Database Continuous Integration, and an increasingly rich toolchain has emerged. While implementations vary, they share the same basic idea:
Structural (DDL) and data (DML) changes must be managed in concert with source code and applied in a rigorous and automated fashion.
These tools (flyway and liquibase being the most prominent) connect to arbitrary databases and manage the contents of the database, but make no attempt to manage the databases themselves. While they are capable of upgrading production data at any version, users typically operate on an empty database and populate it with synthetic data. While this is useful for unit tests that require repeatable results, it is a poor solution for functional, regression, and system test environments. Running those tests on synthetic, partial, or stale data can lead to bugs caught late in the development cycle, requiring costly process resets that could have been avoided if the developer had been using real data in their own sandbox. The result is longer application development cycles expensive process to manually push real data earlier into the lifecycle, and slower velocity for the business.
Continuous Integration with Delphix
Delphix is the engine that can accelerate these tools and provide a robust foundation for a new generation of continuous data integration processes. Database continuous integration tools provide the framework for DDL and DML change management, Delphix efficiently provides fresh real data to developers at the point they need it. Imagine a developer writing a DML translation based on the assumption about the format of some data. All of their manual, unit, and integration tests pass based on the synthetic data built on those same assumptions (or pulled from stale production data), but the application fails in UAT when confronted with an unknown format on the latest production data.
Bug Impact to Test Cycle Before Delphix
Bug Impact to Test Cycle After Delphix
Before Delphix, these bugs force the project team into an unpleasant situation: re-run the entire test cycle and lose weeks of development time, or run only a subset of tests after fixing the issue and jeopardize quality by risking bugs that might only be caught through a full test cycle. After Delphix, no one ever notices a bug that never escapes the developer environment.
Database continuous integration tools like flyway and liquibase are a perfect match for Delphix-driven provisioning and refresh. Each of these tools tracks ordered transformations using a table within the database to track what changes have been applied. This allows the software to bring any database up to the current developer version. There are no additional hooks required on the Delphix side, but there are ways Delphix can provide benefits beyond just fresh real data.
Developer Reset
These DDL and DML transformations are not foolproof. In some cases, it is necessary to write custom product code to run transformations that cannot be accomplished via SQL (such as updating BLOBs), or assemble a particularly tricky set of SQL statements. The developer should be able to test and validate these transformations on real data prior to integrating the changes. With Delphix, developers can tag their branch after all the known transformation have been run, and then rapidly iterate on validating new transformations with real data, resetting on failure without needing to refresh or get a new copy of real data.
Project Branches
Developers rarely work in isolation. Project or release branches allow developers to share code changes by pushing to a common source code repository. With Delphix-enabled continuous data integration, it’s possible for developers to share not just a source code repository, but a data repository as well. While continuous integration tools can apply changes within each developer copy, these transformations may time to run and consume additional storage depending on the size of the database and nature of the transformations. These changes can be automated within a project branch such that developers always get the latest transformed data whenever they refresh.
On refresh of a project branch, a script is invoked that transforms the production data using the latest DDL and DML changes pulled from the project source code repository. Combined with a refresh policy, this can keep the project branch up to do date with fresh data without each developer needing to run the same transformations. This process can also be run through hooks whenever a DDL or DML change is pushed to the source code repository.
Summary
Agile development requires agile data, and Delphix is the agile data engine of the future. You can sync to your production or source databases, instantly provision virtual databases where they are needed using a minuscule amount space, and provide each developer an independent copy of the data that can be refreshed and reset on demand. Combined with database continuous integration tools like flyway and liquibase, Delphix simplifies the development and testing of structural and data transformations. Providing efficient full data copies through Delphix that are always in sync with source code changes reduces bugs and accelerates application delivery while reducing IT and infrastructure overhead.
Agile Data Technology
The Requirements for Agile Data
Applications are the nexus of the modern enterprise. They simplify operations, speed execution, and drive competitive advantage. Accelerating the application lifecycle means accelerating the business. Increasingly, organizations turn to public and private clouds, SaaS offerings, and outsourcing to hasten development and reduce risk, only to find themselves held hostage by their data.
Applications are nothing without data. Enterprise applications have data anchored in infrastructure, tied down by production requirements, legacy architecture, and security regulations. But projects demand fresh data under the control of their developers and testers, requiring processes to work around these impediments. The suboptimal result leads to cost overruns, schedule delays, and poor quality.
Agile development requires agile data. Agile data empowers developers and testers to control their data on their schedule. It unburdens IT by efficiently providing data where it is needed independent of underlying infrastructure. And it accelerates application delivery by providing fresh and complete data whenever necessary. It grants its users super powers.
Many technologies can solve part of the agile data problem, but a partial solution still leaves you with suboptimal processes that impede your business. A complete agile data solution must embrace the following attributes.
Non Disruptive Synchronization
Production data is sensitive. The environment has been highly optimized and secured, and its continued operation is critical to the success of the business - introducing risk is unacceptable. An agile data solution must automatically synchronize with production data such that it can provide fresh and relevant data copies, but it cannot mandate changes to how the production environment is managed, nor can its operation jeopardize the performance or success of business critical activities.
Service Provisioning
Data is more than just a sequence of bits. Projects access data through relational databases, NoSQL databases, REST APIs, or other APIs. An agile data solution must move beyond copying the physical representation of the data by instantiating and configuring the systems to access that data. Leaving this process to the end users induces delays and amplifies risk.
Source Freedom
Data is pervasive. Efforts to mandate a single data representation, be it a particular relational or NoSQL system, rarely succeed and limit the ability of projects to choose the data representation most appropriate for their needs. As projects needs diversify the data landscape, the ability to manage all data through a single experience becomes essential. This unified agile experience necessitates a solution not tied to a single data source.
Platform Independence
The premier storage, virtualization, and compute platforms of today may be next year’s legacy architecture. Solutions limited to a single platform inhibit the ability of organizations to capitalize on advances in the industry, be it a high performance flash array or new private cloud software. Agility over time requires a solution that is not tied to the implementation of a particular hardware or software platform.
Efficient Bridges
Enterprise infrastructure spans geography, platform, and technology. Data anchored in the world's best datacenters is useless to the developer that needs it in the cloud, a project that needs it migrated to a different platform, or the tester that needs a local copy halfway around the world. Data mobility through efficient transfer mechanisms is required to get data where it needs to be.
Efficient Copies
Storage costs money, and time costs the business. Agile development requires a proliferation of data copies for each developer and tester, magnifying these effects. Working around the issue with partial data leads to costly errors that are caught late in the application lifecycle, if at all. An agile solution must be able to create, refresh, and rollback copies of production data in minutes while consuming a fraction of the space required for a full copy.
Workflow Automation
Each development environment has its own application lifecycle workflow. Data may need to be masked, projects may need multiple branches with different schemas, or developers may need to restart services as data is refreshed. Pushing responsibility to the end user is error prone and impedes application delivery. An agile solution must provide stable interfaces for automation and customization such that it can adapt to any development workflow.
Self Service Data
Developers and testers dictate the pace of their assigned tasks, and each task affects the data. Agile development mandates that developers have the ability to transform, refresh, and roll back their data without interference. This experience should shield the user from the implementation details of the environment to limit confusion and reduce opportunity for error.
Resource Management
Each data copy consumes resources through storage, memory, and compute. Once developers experience the power of agile data, they will want more copies, run workloads on them for which they were not designed, and forget to delete them when they are through. As these resources become scarce, the failure modes (such as poor performance) become more expensive to diagnose and repair. Combatting this data sprawl requires visibility into performance and capacity usage, accountability through auditing and reports, and proactive resource constraints.
Delphix is the agile data platform of the future. You can sync to your production or source databases, instantly provision virtual databases where they are needed using a minuscule amount space, and provide each developer an independent copy of the data that can be refreshed and rolled back on demand. This platform will become only more powerful over time as we add new data sources, provide richer workflows targeting specific applications and use cases, and streamline the self service model. An enterprise data strategy without Delphix is just a path to more data anchors, necessitating suboptimal processes that continue to slow application development and your business.
Enterprise Software Hackathons
At Delphix, we just concluded one of our recurring Engineering Kickoff events where we get everyone together for a few days of collaboration, discussion, idea sharing, and fun. In this case it included, for the first time, an all-day hackathon event. To be honest, it was a bit of an experiment and one where we were unsure of how it would be received. We had all read about, participated in, or hear praise of, hackathons at other companies, but these companies were always more consumer-focused or had technologies that were more easily assembled into different creations. As an enterprise software company, we were concerned that even the simplest projects would be too complex to turn around over the course of a day. Given the potential benefit, however, it was clearly something we wanted to experiment with - any failure would also be a learning opportunity.
Some companies go big or go home when it comes to hackathons - week long activities, physical hacks, etc. We wanted to preserve freedom but be a little more targeted. The directive was simple: spend a day doing something unrelated to your normal day job that in some way connects to the business. People volunteered ideas and mentorship ahead of time so that even the newest engineers could meaningfully participate. The result was a resounding success. Whether people were able to give a demo, sketch on a whiteboard, or just speak to their ideas and the challenges they faced, everyone pushed themselves in new directions and walked away having learned something through the experience.
The set of activities covered a wide swath of engineering, including:
- Using D3.js for visualizing analytics data
- "zero copy" iSCSI in illumos
- web portal for customer data analysis
- "zpool dump" to store pool metadata for offline zdb(1M) use
- Real time engineering dashboard to aggregate commits, bugs, reviews, and more
- "D++" DTrace syntactic sugar: function elapsed time, unrolled while loops, callers array
- Mobile application to monitor Delphix alerts and faults
- Global symbol tab completion for MDB
- Network performance tool
- Speeding up unit tests
- Browser usage analytics
- 'zfs send' to a POSIX filesystem
- BTrace++ (a.k.a. CATrace) to make java tracing safe and easy
- New V2P (virtual to physical) mechanisms in Delphix
- Tools to more easily deploy changes to VMs
For myself, I put together a prototype of a hosted SSH/HTTP proxy for use by our support organization. This was my first real foray into the world of true PaaS cloud software - running node.js, redis, and cloudAMQP in a heroku instance, and it's been incredibly interesting to finally play with all these tools I've read about but never had a reason to use. I will post details (and hopefully code) once I get it into slightly better shape.
Only a fraction of these are really what I would consider a contribution to the product itself, which is where our initial trepidation around a hackathon went awry. No matter how complex your product or how high the barriers to entry , engineers will find a way to build cool things and try out new ideas in a hackathon setting. Everything that people did, from learning how to make changes to our OS to improving our quality of life as engineers to testing new product ideas, will provide real value to the engineering organization. On top of that, it was incredibly fun and a great way to get everyone working together in different ways.
It's something we'll certainly look at doing again, and I'd recommend that every company, organization, or group, find some way to get engineers together with the express purpose of working on ideas not directly related to their regular work. You'll end up with some cool ideas and prototypes, and everyone will learn new things while having fun doing it.
Behind the Music: The Delphix node.js CLI
As I indicated in a previous post, the new Delphix CLI is a node.js application that runs locally when the user logs into the system. User documentation can be found in the CLI User Guide, but I thought it would be interesting to explore how we ended up with the CLI we have today given the underlying web services upon which it is based.
Our web services are all defined by a set of schemas that I hope to one day describe in more detail. For those of you with a delphix server on hand, you can access the full schema at /api/json/delphix.json on any server. These schemas are loaded by the CLI to dynamically generate the content such that we don't need to update the CLI with every change made to the web services layer. Our APIs are generally RESTful with CRUD semantics, though they have a few oddities that are worth noting:
- We support singleton objects (i.e. NDMP Configuration) that export read/update on a global path.
- Our objects are polymorphic and leverage inheritance, so that we can have a set of operations (link, provision, delete, etc) that operate on a variety of types (Oracle, MSSQL, etc) without needing an entirely separate namespace (which would prevent consumers from generically iterating over objects).
- We support non-CRUD operations both per-object (start, stop, etc), and globally (link, provision, etc).
- We have persistent references to objects that are independent of the object name.
- Some operations execute asynchronously from the web service call and can take a while to run.
The CLI is very much inspired by Bryan's work on the Fishworks CLI (Fishworkers will undoubtedly take heart in the fact that the CLI docs still pay homage to dory and kiowa). We knew that a modal CLI based on filesystem-like navigation, property manipulation, and rich built-in help and tab completion would be vastly easier to use than a never-ending list of commands with esoteric --options-with-ridiculous-names. But we could also learn from the Fishworks CLI: because the underlying web service layer was not RESTful and well-formed, it meant that maintaining the CLI was quite expensive and semantics varied in subtle ways between contexts.
The first decision was to have the CLI namespace mimic that of the web services layer. If an API was rooted at /resources/json/delphix/service/smtp, it meant that the CLI location would be service smtp. This encourages both rational layout of APIs, and eliminates the need for extra translation. We allow for these paths to be typed in directly, but also support the cd command and UNIX-like paths for those familiar with those shells:
delphix> database
delphix database> cd /service/ndmp
delphix service ndmp> cd ..
delphix service>
For groups of objects, we have a list command (as well as a ls alias that shows everything at the current context) that can display objects, and a select command to select an individual node:
delphix database> list
NAME PARENTCONTAINER DESCRIPTION
dory - -
delphix database> select dory
delphix database "dory"> cd /
delphix> database "dory"
delphix database "dory">
Now that we could move around the namespace, the next question was how to model operations: create, read, update, delete, and custom global and per-object operations. Each operation, regardless of HTTP implementation (PUT, POST, or DELETE) can take a single JSON object as input. Rather than specifying a complete object on the command line, we place the user into a context where they can interactively change properties and the elect to commit (or discard) the operation.
delphix database> link
delphix database link *> get container
type: OracleDatabaseContainer
name: (required)
description: (unset)
diagnoseNoLoggingFaults: true
group: (required)
masked: (unset)
performanceMode: (unset)
delphix database link *> set container.name=fluke
delphix database link *> set dbUser=delphix
The APIs support nested objects as input, so the CLI uses dot-delimited properties (and tab completion) for specifying input. For objects that support multiple types (through inheritance), the set of available properties changes whenever the type parameter is changed.
For programmatic consumers, data can be output in JSON, and the 'trace' option can be turned on to see the HTTP calls made as part of an operation. Users can then experiment in the CLI, with a more natural interface, tab completion, integrated help, and then use the switch to the raw web services when more advanced manipulation is required.
There are many more topics in the world of the CLI and web services, from objects references in names to asynchronous jobs, to integration with the illumos PAM stack to hook into SSH authentication. But for my next post I hope to dive into the structure of the web services schemas, and describe how we use them on the backend, GUI, documentation, and CLI to automate significant parts of the engineering process.
Engineer Anti-Patterns
The other week I had a particularly disappointing discussion with a potential new hire. I typically describe our engineering organization at Delphix as a bottoms-up meritocracy where smart people seize opportunities to impact the company through through thoughtful execution and data-driven methodology (a.k.a. buzzword bingo gold). In this case, after hours of discussions, I couldn't shake this engineer's fixation with understanding how his title would affect his ability to have impact at Delphix. After deciding that it was not a good cultural fit, I spent some time thinking about what defines our engineering culture and what exactly it was that I felt was such a mismatch. Rather than writing some pseudo-recruiting material extolling the virtues of Delphix, I thought I'd take a cue from Bryan's presentation on corporate open source anti-patterns (video) and instead look at some engineering cultural anti-patterns that I've encountered in the past. What follows is a random collection of cultural engineering pathologies that I've observed in the past and have worked to eschew at Delphix.
The Thinker
This engineer believes his responsibility is to think up new ideas, while others actually execute those ideas. While there are certainly execution-oriented engineers with an architect title out there that do great work, at Delphix we intentionally don't have such a title because it can send the wrong message: that "architecting" is somehow separate from executing, and permission to architect is something to be given as a reward. The hardest part of engineering comes through execution - plotting an achievable path through a maze of code and possible deliverables while maintaining a deep understanding of the customer problem and constraints of the system. It's important to have people who can think big, deeply understand customer needs, and build complex architectures, but without a tangible connection to the code and engineering execution those ideas are rarely actionable.
The Talker
Often coinciding with "The Thinker", this engineer would rather talk in perpetuity rather than sit down and do actual work. Talkers will give plenty of excuses about why they can't get something done, but the majority of the time those problems could be solved simply by standing up and doing things themselves. Even more annoying is their penchant for refusing to concede any argument, resulting in orders of magnitude more verbiage with each ensuing email, despite attempts to bring the discussion to a close. In the worst case the talker will provide tacit agreement publicly but fume privately for inordinate amounts of time. In many cases the sum total of time spent talking about the problem exceeds the time it would take to simply fix the underlying issue.
The Entitled
This engineer believes that titles are granted to individuals in order to expand her influence in the company; that being a Senior Staff Engineer enables her to do something that cannot be accomplished as a Staff Engineer. Titles should be a reflection of the impact already achieved through hard work, not a license granted by a benevolent management. When someone is promoted, the reasons should be obvious to the organization as a whole, not a stroke of luck or the result of clever political maneuvering. Leadership is something earned by gaining the respect of your peers through execution, and people who would use their title to make up for a lack of execution and respect of their peers can do an incredible amount of damage within an enabling culture.
The Owner
This engineer believes that the path to greater impact in the organization is through "owning" ideas and swaths of technology. While ownership of execution is key to any successful engineering organization (clear responsibility and accountability are a necessity), ownership of ideas is toxic. This can lead to passive-aggressive counter-productive acts by senior engineering, and an environment where junior engineers struggle to get their ideas heard. The owner rarely takes code review comments well, bullies colleagues that encroach on her territory, and generally holds parts of the product hostage to her tirades. Metastasized in middle management, this leads to ever growing fiefdoms where technical decisions are made for entirely wrong organizational reasons. Ideas and innovation come from everywhere, and while different parts of the organization are better suited to execution of large projects based on their area of expertise, no one should be forbidden from articulating their ideas due to arbitrary assignment.
The Recluse
Also known as "the specialist", this engineer defines his role in the most narrow fashion possible, creating an ivory tower limited by his skill set and attitude. Good engineers seize a hard problem and drive it to completion, even when that problem pushes them well beyond their comfort zone. The recluse, however, will use any excuse to make something someone else's problem. Even when the problem falls within his limited domain, he will solve only the smallest portion of the problem, preferring to file a bug to have someone else finish the overall work. When confronted on architectural issues, he will often agree to do it your way, but then does it his way anyway. Months later it can turn out he never understood what you had said in the first place or discussed it in the interim, and by then it's too late to undo the damage done.
All of us have the potential for these anti-patterns in us. It's only through regular introspection and frank discussions with colleagues that we can hope to have enough self awareness to avoid going down these paths. Most importantly, we all need to work to create a strong engineering culture where it is impossible for these pathologies to thrive. Once these pathologies become a fixture in a culture, they breed similar mentalities as the organization grows and can be impossible to eradicate at scale.
A node.js CLI?
Over the past several months, one of the new features I've been working on for the next release is the development of the new CLI for our appliance. While the CLI is the epitome of a checkbox item to most users, as a completely different client-side consumer of web APIs it can have a staggering maintenance cost. Our upcoming release introduces a number of new concepts that required that we gut our web services and built them from the ground up - retrofitting the old CLI was simply not an option.
What we ended up building was local node.js CLI that takes programmatically defined web service API definitions and dynamically constructs a structured environment for manipulating state through those APIs. Users can log in with their Delphix credentials over SSH or the console and be presented with this CLI via custom PAM integration. Whenever I describe this to people, however, I get more than a few confused looks and questions:
- Isn't node.js for writing massively scalable cloud applications?
- Does anyone care about a CLI?
- Are you high?
Yes node.js is so hot right now. Yes, a bunch of Joyeurs (the epicenter of node.js) are former Fishworks colleagues. But the path here, as with all engineering at Delphix, is ultimately data-driven based on real problems, and this was simply the best solution to those problems. I hope to have more blog posts about the architecture in the future, but as I was writing up a README for our own developers, I thought the content would make a reasonable first post. What follows is an engineering FAQ. From here I hope to describe our schemas and some of the basic structure of the CLI, stay tuned.
Why a local CLI?
The previous Delphix CLI was a client-side java application written in groovy. Running the CLI on the client both incurs a cost to users (they need to download an manage additional software) as well as making it a nightmare to manage different versions across different servers. Nearly every appliance shipped today has a SSH interface; doing something different just increases customer confusion. The purported benefit (there is no native Windows SSH client) has shown to be insignificant in practice, and there are other more scalable solutions to this problem (such as distributing a java SSH client).
Why Javascript?
We knew that we were going to need to be manipulating a lot of dynamic state, and the scope of the CLI would remain relatively small. A dynamic scripting language makes for a far more compelling development environment for rapid development, at the cost of needing a more robust unit test framework to catch what would otherwise be compile time errors in a strongly typed statically compiled language. We explicitly chose javascript because our new GUI will be built in javascript, and this both keeps the number of languages and environments used in production to a minimum, as well allowing these clients to share code where applicable.
Why node.js?
We knew v8 was the best-of-breed runtime when it comes to javascript, and we actually started with a custom v8 wrapper. As a single threaded environment, this was pretty straightforward. But once we started considering background tasks we knew we'd need to move to an asynchronous model. Between the cost of building infrastructure already provided by node (HTTP request handling, file manipulation, etc) and the desire to support async activity, node.js was the clear choice of runtime.
Why auto-generated?
Historically, the cost of maintaining the CLI at Delphix (and elsewhere) has been very high. CLI features lag behind the GUI, and developers face additional burden to port their APIs to multiple clients. We wanted to build a CLI that would be almost completely generated from shared schema. When developers change the schema in one place, we auto-generate both backend infrastructure (java objects and constants), GUI data model bindings, and the complete CLI experience.
Why a modal hierarchy?
The look and feel of the Delphix CLI is in many ways inspired by the Fishworks CLI. As engineers and users of many (bad) CLIs, our experience has led to the belief that a CLI with integrated help, tab completion, and a filesystem-like hierarchy promotes exploration and is more natural than a CLI with dozens of commands each with dozens of options. It also makes for a better representation of the actual web service APIs (and hence easier auto-generation), with user operations (list, select, update) that mirror the underlying CRUD API operations.
Data Replication: Building a better NDMP
In my previous post I outlined some of the challenges faced when building a data replication solution, how the first Delphix implementation missed the mark, and how we set out to build something better the second time around.
The first thing that became clear after starting on the new replication subsystem was that we needed a better NDMP implementation. A binary-only separate daemon with poor error semantics that routinely left the system in an inconsistent state was not going to cut it. NDMP is a protocol built for a singular purpose: backing up files using a file-specific format (dump or tar) over arbitrary topologies (direct attached tape, 3-way restore, etc). By being both simultaneously so specific in the data semantics but so general in the control protocol, we end up with the worst of both worlds: baked-in concepts (such as file history, complete with inode numbers) that prevent us from adequately expressing Delphix concepts, and a limited control protocol (lacking multiple streams or resumable streams) with terrible error semantics. While we will ultimately replace NDMP for replication, we knew that we still needed it for backup, and that we didn't have the time to replace both the implementation and the data protocol for the current release.
Illumos, the open source operating system our distribution is based on, provides an NDMP implementation, one that I had previously dealt with while at Fishworks (though Dave Pacheo was the one who did the actual NDMP integration). I spent some time looking at the implementation and came to the conclusion that it suffered from a number of fatal flaws:
- Poor error semantics - The strategy was "log everything and worry about it later". For an implementation shipped with a roll-your-own
OS this was not a terrible strategy, but it was a deal breaker for an appliance implementation. We needed clear, concise failure modes that appeared
integrated with our UI. - Embedded data semantics - The notion of tar as a backup format (or raw zfs send) was built very deeply into the architecture. We needed our own data protocol, but replacing the data operations without major surgery was out of the question. While raw ZFS send seems appealing, it is still assumes ownership and control of the filesystem namespace, something that wouldn't fly in the Delphix world.
- Unused code - There was tons of dead code, ranging from protocol varieties that were unnecessary (NDMPv2) to swaths of device handling
code that did nothing. - Standalone daemon - A standalone daemon makes it difficult to exchange data across the process boundary, and introduces new complex failure modes.
With this in mind I looked at the ndmp.org SDK implementation, and found it to suffer from the same pathologies (and a much worse implementation to boot). It was clear that the Solaris implementation was derived from the SDK, and that there was no mythical "great NDMP implementation" waiting to be found. I was going to have to suck it up and get back to my Solaris roots to eviscerate this beast.
The first thing I did was recast the daemon as a library, elminating any code that deal with daemonizing, running a door server to report statistics, and
existing Solaris commands that communicated with the server. This allowed me to add a set of client-provided callback vector and configuration options to control state. With this library in place, we could use JNA to easily call into C code from our java management stack without having to worry about marshaling data to and from an external daemon.
The next step was to rip out all the data-handling functionality, instead creating a set of callback vectors in the library registration mechanism to start and stop backup. This left the actual implementation of the over-the-wire format up to the consumer. The sheer amount of code used to support tar and zfs send was staggering, and it had its tendrils all across the implementation. As I started to pull on the thread, more and more started to unravel. Data-specific operations would call into the "tape library management" code (which had very little to do with tape library management) that would then call back into common NDMP code, that would then do nothing.
With the data operations gone, I then had to finally address the hard part: making the code usable. The old error semantics were terrible. I had to go through every log call and non-zero return value, analyze its purpose, and restructure it to use the consumer-provided vector so that we could log such messages natively in the Delphix stack. While doing generic code cleanup, this led me to rip out huge swaths of unused code, from buffer management to NDMPv2 support (v3 has been in common use for more than a decade). This was rather painful, but the result has been quite a usable product. While the old Delphix implementation would have reported "NDMP server error CRITICAL: consult server log for details" (of course, there was no way for the customer to get to the "server log"), we would now get much more helpful messages like "NDMP client reported an error during data operation: out of space".
The final piece of the puzzle was something that surprised me. By choosing NDMP as the replication protocol (again, a temporary choice), we needed a way to drive the 3-way restore operation from within the Delphix stack. This meant that we wanted to act as a DMA. As I looked at the unbelievable awful 'ndmpcopy' implementation shipped with the NDMP SDK, I noticed a lot of similarity to what we needed on the client and what we had on the server (processing requests was identical, even if the set of expected requests was quite different). Rather than build an entirely separate implementation, I converted libndmp such that it could act as a server or a client. This allowed us to build an NDMP copy operation in Java, as well as simulate a remote DMA (an invaluable testing tool).
It took more than a month of solid hard work and several more months of cleanup here and there, but the result was worth it. The new implementation clocks in at just over 11,000 lines of code, while the original was a staggering 43,000 lines of code. Our implementation doesn't include any actual data handling, so it's perhaps an unfair comparison. But we also include the ability to act as a full-featured DMA client, something the illumos implementation lacks.
The results of this effort will be available on github as soon as we release the next Delphix version (within a few weeks). While interesting, it's unlikely to be useful to the general masses, and certainly not something that we'll try to push upstream. I encourage others looking for an open-source embedded NDMP implementation to fork and improve what we have in Delphix - it's a very flexible NDMP implementation that can be adopted for a variety of non-traditional NDMP scenarios. But with no built-in data processing, and no standalone daemon implementation, it's a long way from replacing what can be found in illumos. If someone was so inspired, you could build a daemon on top of the current library - one that provides support for tar, dump, ZFS, and whatever other formats are supported by the current illumos implementation. It would not be a small amount of work, but I am happy to lend advice (if not code) to anyone interested.
Next up will be a post whose working title is "Data Replication: Metadata + Data = Crazy Pain in My Ass".
Data Replication: Approaching the Problem
With our next Delphix release just around the corner, I wanted to spend some time discussion the engineering process behind one of the major new features: data replication between servers. The current Delphix version already has a replication solution, so how does this constitute a "new feature"? The reason is that it's an entirely new system, the result of an endeavor to create a more reliable, maintainable, and extensible system. How we got here makes for an interesting tale of business analysis, architecture, and implementation.
Where did we come from?
Before we begin looking at the current implementation, we need to understand why we started with a blank sheet of paper when we already had a shipping solution. The short answer is that what we had was unusable: it was unreliable, undebuggable, and unmaintainable. And when you're in charge of data consistency for disaster recovery, "unreliable" is not an acceptable state. While I had not written any of the replication infrastructure at Fishworks (my colleagues Adam Leventhal and Dave Pacheco deserve the credit for that), I had spent a lot of time in discussions with them, as well as thinking about how to build a distributed data architecture at Fishworks. So it seemed natural for me to take on this project at Delphix. As I started to unwind our current state, I found a series of decisions that, in hindsight, led to the untenable state we were in today.
- Analysis of the business problem - At the core of the current replication system was the notion that its purpose was for disaster recovery. This is indeed a major use case of replication, but it's not the only one (geographical distribution of data being another strong contender). While picking one major problem to tackle first is a reasonable approach to constrain scope, by not correctly identifying future opportunities we ended up with a solution that could only be used for active/passive disaster recovery.
- Data protocol choice - There is another problem that is very similar to replication: offline backup/restore. Clearly, we want to leverage the same data format and serialization process, but do we want to use the same protocol? NDMP is the industry standard for backups, but it's tailored to a very specific use case (files and filesystems). By choosing to use NDMP for replication, we sacrificed features (resumable operations, multiple streams) and usability (poor error semantics) and maintainability (unnecessarily complicated operation).
- Outsourcing of work - At the time this architecture was created, it was decided that NDMP was not part of the company's core competency, and we should contract with a third party to provide the NDMP solution. I'm a firm believer that engineering work should never be outsourced unless it's known ahead of time that the result will be thrown away. Otherwise, you're inevitably saddled with a part of your product that you have limited ability to change, debug, and support. In our case, this was compounded by the fact that the deliverable was binary objects - we didn't even have source available.
- Architectural design - By having a separate NDMP daemon we were forced to have an arcane communication mechanism (local HTTP) that lost information with each transition, resulting in a non-trivial amount of application logic resting in a binary we didn't control. This made it difficult to adapt to core changes in the underlying abstractions.
- Algorithmic design - There was a very early decision made that replication would be done on a per-group basis (Delphix arranges databases into logical groups). This was divorced from the reality of the underlying ZFS data dependencies, resulting a numerous oddities such as being unable to replicate non self-contained groups or cyclic dependencies between groups. This abstraction was deeply baked into the architecture such that it was impossible to fix in the original architecture.
- Implementation - The implementation itself was built to be "isolated" of any other code in the system. When one is replicating the core representation of system metadata, this results in an unmaintainable and brittle mess. We had a completely separate copy of our object model that had to be maintained and updated along with the core model, and changes elsewhere in the system (such as deleting objects while replication was ongoing) could lead to obscure errors. The most egregious problems led to unrecoverable state - the target and source could get out of sync such that the only resolution was a new full replication from scratch.
- Test infrastructure - There was no unit test infrastructure, no automated functional test infrastructure, and no way to test the majority of functionality without manually setting up multi-machine replication or working with a remote DMA. As a result only the most basic functionality worked, and even then it was unreliable most of the time.
Ideals for a new system
Given this list of limitations, I (later joined by Matt) sat down with a fresh sheet of paper. The following were some of the core ideals we set forth as we built this new system:
- Separation of mechanism from protocol - Whatever choices we make in terms of protocol and replication topologies, we want the core serialization infrastructure to be entirely divorced from the protocol used to transfer the data.
- Support for arbitrary topologies - We should be able to replicate from a host to any number of other hosts and vice versa, as well as provision from replicated objects.
- Robust test infrastructure - We should be able to run protocol-level tests, simulate failures, and perform full replication within a single-system unit test framework.
- Integrated with core object model - There should be one place where object definitions are maintained, such that the replication system can't get out of sync with the primary source code.
- Resilient to failure - No matter what, the system must be maintain consistent state in the face of failure. This includes both catastrophic system failure, as well as ongoing changes to the system (i.e. objects being created and deleted). At any point, we must be able to resume replication from a previously known good state without user intervention.
- Clear error messages - Failures, when they do occur, must present a clear indication of the nature of the problem and what actions must be taken by the user, if any, to fix the underlying problem.
At the same time, we were forced to limit the scope of the project so we could deliver something in an appropriate timeframe. We stuck with NDMP as a protocol despite its inherent problems, as we needed to fix our backup/restore implementation as well. And we kept the active/passive deployment model so that we did not require any significant changes to the GUI.
Next, I'll discuss the first major piece of work: building a better NDMP implementation.
Your MDB fell into my DTrace!
Yesterday, several of us from Delphix, Nexenta, Joyent, and elsewhere, convened before the OpenStorage summit as part of an illumos hackathon. The idea was to get a bunch of illumos coders in a room, brainstorm a bunch of small project ideas, and then break off to go implement them over the course of the day. That was the idea, at least - in reality we didn't know what to expect or how it would turn out. Suffice to say that the hackathon was an amazing success. There were a lot of cool ideas, and a lot of great mentors in the room that could lead people through unfamiliar territory.
For my mini-project (suggested by ahl), I implemented MDB's ::print functionality in DTrace via a new print() action. Today, we have the trace() action, but the result is somewhat less than useful when dealing with structs, as it degenerates into tracemem():
# dtrace -qn 'BEGIN{trace(`p0); exit(0)}'
0 1 2 3 4 5 6 7 8 9 a b c d e f 0123456789abcdef
0: 00 00 00 00 00 00 00 00 60 02 c3 fb ff ff ff ff ........`.......
10: c8 c9 c6 fb ff ff ff ff 00 00 00 00 00 00 00 00 ................
20: b0 ad 14 c6 00 ff ff ff 00 00 00 00 02 00 00 00 ................
...
The results aren't pretty, and we end up throwing away all that useful proc_t type information. With a little tweaks to dtrace, and some cribbing from mdb_print.c, we can do much better:
# dtrace -qn 'BEGIN{print(`p0); exit(0)}'
proc_t {
struct vnode *p_exec = 0
struct as *p_as = 0xfffffffffbc30260
struct plock *p_lockp = 0xfffffffffbc6c9c8
kmutex_t p_crlock = {
void *[1] _opaque = [ 0 ]
}
struct cred *p_cred = 0xffffff00c614adb0
int p_swapcnt = 0
char p_stat = '02'
....
Much better! Now, how did we get there from here? The answer was an interesting journey through libdtrace, the kernel dtrace implementation, CTF, and the horrors of bitfields.
To action or not to action?
The first question I set out to answer is what the user-visible interface should be. It seemed clear that this should be an operation on the same level as trace(), allowing arbitrary D expressions, but simply preserving the type of the result and pretty-printing it later. After briefly considering printt() (for "print type"), I decided upon just print(), since this seemed like a logical My first inclination was to create a new DTRACEACT_PRINT, but after some discussion with Adam, we decided this was extraneous - the behavior was identical to DTRACEACT_DIFEXPR (the internal name for trace), but just with type information.
Through the looking glass with types and formats
The real issue is that what we compile (dtrace statements) and what we consume (dtrace epids and records) are two very different things, and never the twain shall meet. At the time we go to generate the DIFEXPR statement in dt_cc.c, we have the CTF data in hand. We don't want to change the DIF we generate, simply do post-processing on the other side, so we just need some way to get back to that type information in dt_consume_cpu(). We can't simply hang it off our dtrace statement, as that would break anonymous tracing (and violate the rest of the DTrace architecture to boot).
Thankfully, this problem had already been solved for printf() (and related actions) because we need to preserve the original format string for the exact same reason. To do this, we take the action-specific integer argument, and use it to point into the DOF string table, where we stash the original format string. I simply had to hijack dtrace_dof_create() and have it do the same thing for the type information, right?
If only it could be so simple. There were two complications here: there is a lot of code that explicitly treats these as printf strings, and parses them into internal argv-style representations. Pretending our types were just format strings would cause all kinds of problems in this code. So I had to modify libdtrace to treat this more explicitly as raw 'string data' that is (optionally) used with the DIFEXPR action. Even with that in place, the formats I was sending down were not making it back out of the kernel. Because the argument is action-specific, the kernel needed to be modified to recognize this new argument in dtrace_ecb_action_add. With that change in place, I was able to get the format string back in userland when consuming the CPU buffers.
Bitfields, or why the D compiler cost me an hour of my life
With the trace data and type string in hand, I then proceeded to copy the mdb ::print code, first from apptrace (which turned out to be complete garbage) and then fixing it up bit by bit. Finally, after tweaking the code for an hour or two, I had it looking pretty much like identical ::print output. But when I fed it a klwp_t structure, I found that the user_desc_t structure bitfields weren't being printed correctly:
# dtrace -n 'BEGIN{print(*((user_desc_t*)0xffffff00cb0a4d90)); exit(0)}'
dtrace: description 'BEGIN' matched 1 probe
CPU ID FUNCTION:NAME
0 1 :BEGIN user_desc_t {
unsigned long usd_lolimit = 0xcff3000000ffff
unsigned long usd_lobase = 0xcff3000000
unsigned long usd_midbase = 0xcff300
unsigned long usd_type = 0xcff3
unsigned long usd_dpl :64 = 0xcff3
unsigned long usd_p :64 = 0xcff3
unsigned long usd_hilimit = 0xcf
unsigned long usd_avl :64 = 0xcf
unsigned long usd_long :64 = 0xcf
unsigned long usd_def32 :64 = 0xcf
unsigned long usd_gran :64 = 0xcf
unsigned long usd_hibase = 0
}
I spent an hour trying to debug this, only to find that the CTF IDs weren't matching what I expected from the underlying object. I finally tracked it down to the fact that the D compiler, by virtue of processing the /usr/lib/dtrace files, pulls in its own version of klwp_t from the system header files. But it botches the bitfields, leaving the user with a subtly incorrect data. Switching the type to be genunix`user_desc_t fixed the problem.
What's next
Given the usefulness of this feature, the next steps are to clean up the code, get it reviewed, and push to the illumos gate. It should hopefully be finding its way to an illumos distribution near you soon. Here's a final print() invocation to leave you with:
# dtrace -n 'zio_done:entry{print(*args[0]); exit(0)}'
dtrace: description 'zio_done:entry' matched 1 probe
CPU ID FUNCTION:NAME
0 42594 zio_done:entry zio_t {
zbookmark_t io_bookmark = {
uint64_t zb_objset = 0
uint64_t zb_object = 0
int64_t zb_level = 0
uint64_t zb_blkid = 0
}
zio_prop_t io_prop = {
enum zio_checksum zp_checksum = ZIO_CHECKSUM_INHERIT
enum zio_compress zp_compress = ZIO_COMPRESS_INHERIT
dmu_object_type_t zp_type = DMU_OT_NONE
uint8_t zp_level = 0
uint8_t zp_copies = 0
uint8_t zp_dedup = 0
uint8_t zp_dedup_verify = 0
}
zio_type_t io_type = ZIO_TYPE_NULL
enum zio_child io_child_type = ZIO_CHILD_VDEV
int io_cmd = 0
uint8_t io_priority = 0
uint8_t io_reexecute = 0
uint8_t [2] io_state = [ 0x1, 0 ]
uint64_t io_txg = 0
spa_t *io_spa = 0xffffff00c6806580
blkptr_t *io_bp = 0
blkptr_t *io_bp_override = 0
blkptr_t io_bp_copy = {
dva_t [3] blk_dva = [
dva_t {
uint64_t [2] dva_word = [ 0, 0 ]
},
...
Delphix illumos sources posted to github
With our first illumos-based distribution (2.6) out the door, we've posted the illumos-derived sources to github:
https://github.com/delphix/delphix-os-2.6
This repository contains the following types of changes from the illumos gate:
- Changes that are complete and generally useful to the illumos community. We have been (and will continue to be) proactive about pushing these changes to the illumos trunk ourselves. We missed a few this time around, so we'll be going back through to pick up anything we missed.
- Changes that are sufficient to meet the needs of our product, but are not complete or generally useful for the larger community. Our hope is that by pushing these changes to github, others can pick up such pieces of work and integrate them in a form that is acceptable to the illumos community at large.
- Changes that represent distro-specific changes unique to our product. It is unlikely that these will be of interest to anyone except the morbidly curious.
We will post updates with each release of the software. This allows us to make sure the code is fully baked and tested, while still allowing us to proactively push complete pieces of work more frequently.
If you have questions about any particular change, feel free to email the author for more information. You can also find us on the illumos developer mailing list and the #illumos IRC channel on freenode.



