As of late, technology platform integration has been a popular topic of discussion. These platforms usually include back office software like a CRM or ERP, e-commerce websites, corporate portals, and the like. Because this is such a frequent and important architectural conversation when developing solutions, an explanation is warranted to help clear up any confusion or misconceptions.
Data synchronization and how to achieve it
Probably the most common driver behind integration is the need to make decisions in one system based off of data housed in another. You commonly see this when inventory or pricing resides in a back-office system but is required to keep an e-commerce website up to date. That same inventory data might also be surfaced inside the company’s extranet for dashboards used to inform and drive its sales. Adding another layer, transactional data affects inventory. So, you have inventory and pricing data driving e-commerce and traditional sales, both of which, in return, affect inventory and pricing. You can see how important it is to have synchronized data.
These kinds of data synchronizations can be extremely complicated because of the vast array of technologies and platforms. Most of the larger platforms usually have a set of APIs (Application Programming Interface) that can be useful, but they aren’t a magic bullet. APIs commonly need some customization to make them work, or when dealing with the small- to middle-sized platforms, you might find that they don’t even exist.
To top it off, there are very few standards for these types of integrations, and the ones that DO exist (EDI for example) are only fun for people that blog about system integrations for entertainment’s sake. If you’re not one of those people, then integration can be a tough go.
Most synchronization processes can be summed up in one of two words: Push or Pull.
For this discussion, I’ll use the terms “source” and “target.” “Source” is the originating system of the data being synchronized, and “target” is the proposed destination system.
Here’s what we need to understand: Is the source system pushing data to the target system or is the target system pulling data from the source system?
Push tactics should be considered first, allowing the source systems to ultimately decide what data and how often. After all, it’s the source system’s data. Who better to dictate this?
Push is most commonly:
- Event driven – This is when the source system alerts or notifies a target system that an action has taken place. The target system then handles the action as needed.
- Batch driven – This is similar to the event driven system, with a slight exception. In a batch-driven tactic, the source system will collect all of the events that have taken place over a set span of time and push all events at once to be handled by the target system. Examples include:
- FTP a CSV file with all orders created yesterday to a back office system.
- Write an XML file for each order that was placed in the past 15 minutes to a network file share, working directory or queuing mechanism of some sort.
Pull tactics are usually batch driven with deviations on how the batch is acquired. There are few exceptions. This tactic is largely time-based polling oriented. What that means is that on a given schedule, a polling cycle occurs and all the data collected in that polling cycle is acted on.
A few examples:
- Every night at 1 a.m., a process looks at a directory (ie: FTP, network share, local folder) and processes all the files inside it.
- Every 5 minutes, a process queries the e-commerce site’s database for all orders with a status of “New” and creates them in a back office system.
- Every 12 hours, a process calls out to a back office web service that yields all the orders that have been shipped and updates the order status in the e-commerce platform.
Each of the above mentioned tactics can be paired with another to form a hybrid.
- The source system uses an event push tactic to deliver data to the target system. The target system uses a batch pull tactic to process the data.
- The source system uses batch push tactic, and the target system uses a batch pull tactic.
Whats, Hows, and Whys
These are great questions to ask at the outset of almost any business project. However, with regard to data synchronization, where the answer to almost every other question is “it depends.” they are indispensable.
- … are the technical limitations of the source and target systems?
- … does the data look like that is being synchronized? Is it complex or relatively flat?
- … is the projected size?
- … type of security considerations need to be made?
- … will the data be obtained?
- … often does the data need synchronized?
- … stale can the data be?
- … does the data being synchronized satisfy the requirements of the other systems?
- … does this data need synchronized?
Asking these questions may lead to other questions, which waterfall into others and still others. But asking them is imperative because the flexibility of the synchronization and the data being synchronized become clearly defined in the process of answering them, enabling tactical decisions to be made.
This one is entitled, “The Myth of Real-Time data.”
Data synchronizations usually start innocently enough, but as requirements change or the project rolls into Phase 2, the phrase “real-time data” will inevitably crop up. This type of synchronization DOES NOT exist and by definition is literally impossible.
To illustrate, here is a purely hypothetical story about a man named Greg.
Greg is a moderately bright, emerging day trader. He lives and dies by his charts, graphs, and “real-time” data his expensive tools provide him. However, he’ll soon find out how expensive his “real-time” data will be, because one day, Greg gets a (probably illegal) tip that a stock in which he has taken a large position will hit a phenomenal peak at around noon before bottoming out in spectacular fashion.
Greg is very excited about the possibility of making his millions over lunch, and fixes his attention and focus on his “real-time” graphs. Just when Greg sees his stock reach the peak, he sells it all. But unfortunately, the second he sold was the second after the stock bottomed out. Greg is now sad and confused. It turns out that his “real-time” data was actually “near real-time” data.
The lesson is simple: The only things that happen “real-time” have already happened.
To break down the fateful events of our friend Greg:
- He watched intently, and clicked a button. That button click isn’t instant. The browser takes time to translate the event to a request and then transfer the request over the internet.
- The web server handling the requests then has to receive the request, translate the actions of the request and send the sale off to the stock exchange or broker.
For our purposes, the moral of the story is that anything data-driven takes time, especially when the data is travelling across networks and between systems. That time may only be nanoseconds (which may or may not matter for your purposes), but it is latency nonetheless. Real time simply doesn’t exist; only “near-real-time” data is achievable. The question then becomes, “How far behind is your data?”
What About The Whens?
The Whens come after the Whats, Hows and Whys. It’s where the rubber meets the road. In particular, the most important When is, “When should a particular tactic be used and why?”
Using the answers from the Whats, Hows and Whys, the Whens are pretty simple.
The Whens rules of thumb:
- If the data can be relatively stale (meaning doesn’t have to be refreshed at a high frequency), or is relatively large and will take some time to process, a batch push or pull tactic would be best.
- If the frequency of updates is quick, and the data isn’t large, an event push tactic would be best.
- If the data needs to have a guaranteed delivery, as is with most transactional integrations, an event push tactic would be best.
Is Push or Pull Better For Batches?
So you’ve decided on batches, have you? Great. Now, do you push or pull? The answer depends on who wants to assume control and responsibility of the synchronization.
- In a push scenario, the source system has ultimate control of the schedule and rate at which the data is provided to the target system.
- In a pull scenario, the roles are reversed. The target system controls the schedule and rate at which the data is pulled from the source system.
Here’s something to keep in mind when using batch push tactics: It is likely that the solution will truly be a hybrid tactic, with a batch push on the source system and batch pull on the target system.
This lends itself to making the identification of who is in control a bit muddy because you’ll have two different polling cycles, which could make identifying problems more difficult.
Can Anything Be Done For Greg?
In the case when the integration needs to be “real-time,” Greg’s story sheds light on how a few milliseconds can make a big difference.
With that being said, any form of batch tactic is instantly out because of the necessity of some form of polling cycle, which leaves the event push tactic.
Event push can be harder to achieve, and in some systems it may even be impossible. But if it is possible in your scenario and you’re willing to work for it, it’s well worth it the extra effort. It is by far the most flexible solution, allowing for “near-real-time” synchronization as well as being able to leverage ad-hoc capabilities from both the source and target systems. The control and responsibility of the synchronization is maintained by the source system, and the target system can provide a level of guarantee that specific data has been synchronized.
Because of the flexibility allowed, you can implement a hybrid model. The source system can change to use an event push tactic on a polling schedule, and the target system doesn’t need to change. The obvious drawback is the loss of the “near-real-time” synchronization, but it demonstrates the sheer flexibility and scalability of the event push tactic.
To sum it all up: Ask too many questions. Devise a well thought out plan. Favor push tactics due to their elegance and flexibility. And learn from poor Greg’s financial ruin – don’t be fooled by the promise of “real-time” data.