One of the projects I’ve been working on recently at work is adapting legacy Java application to fit into a modern, dynamic, parallelized framework. Legacy design aside, we chose this application for a few reasons:
- it’s open source and mature;
- the primary maintainer is easy to communicate with;
- it has a well-established academic basis.
Even with this, I don’t believe that this last would have been enough if the first and second weren’t true; even with the most rigorously established tenents, a project based in solid academia can still falter when pushed to its limits in a production system.
As with many established legacy Java projects, this one was written before modularity entered the province of the professional programmer. This one happens to be a gigantic monolith of Java code that attempts to provide everything from a secure authentication system, to a GUI for editing the data, all with an api interface. As such, it’s been a challenge in some cases to incorporate it as an internal component to a broader system, and even more so to integrate it as a redundant component (it’s designed to be a single, running, monolith, an adorable trait from the pre-EC2 days).
As we’ve worked with the system, we’ve found a few patterns helpful:
- Creating an API wrapper: putting a thin API in front of the main engine has done a huge amount to make this system work for us. The API layer can compensate for some of the more annoying features (ie the auth system, which is cumbersome in our fully private system), and it provides a place to standardize access to the system.
- Hooks to reflect the system’s internal state changes to other components: we were lucky that this particular system has some support to allow for event hooks. We’ve take this ability and combined it with a publishing mechanism – in our case, a Kafka queue that acts as an RPC stream (more on this in a later post), which, combined with our standardized API interface, has allowed us to offload complex computations to an external system (see my post on Samza and JRuby), and then push the data back into the system.
- Database roles, row-level security, and configuring the database at startup to use this role (thank you PostgreSQL): this idea isn’t as clear-cut as the previous two, but it so happens that, given that this system is written to behave as though it is the only system in the universe and have sole access to its database, if we were to scale it up to work in parallel (and we are considering the magnitude of 1000’s of parallel instances here), creating a new database for each of these would be a nightmare. Thankfully, PostgreSQL provides the notion of role-based, row-level security, which allows a single database to provided isolated access to many different users based on permissions. While the ability to use this will vary, if possible, this can be a great boon when dealing with many hundreds of instances that don’t understand how to share amongst themselves.
- Incorporate an external service for cross-system synchronization: also not as hard-and-fast as the first and second items, this can be of great use when trying to figure out when instance A should do what, and instance B is clear to follow. Since we’re already incorporating Apache Samza into our stack, we already have ZooKeeper available. ZooKeeper provides some great primitives – for instance, its notion of ephemeral nodes is a great help for publishing service availability (this is how Kafka does it).
We arrived at these principles after many months of slow integration. My feeling is that the first and second items are basically required when dealing with a legacy system like this; if you don’t have this basic flexibility, then everything else is going to be much harder.