Monitoring Integrations
With production data flowing through Open Integration Hub via many flows across many tenants, it’s important to have a strategy for monitoring flows for failures.
Failures for a flow to complete successfully can happen for a number of reasons, including:
- An infrastructure failure (lack of CPU or memory, etc.)
- A service or component failure (part of the OIH application goes down)
- An endpoint API failure (one of the integrated systems is down)
- Data-related problems
These above types of failures are listed from least to most likely to occur. Your triage process should consider the relative likelihood and impact of each failure type.
These exceptions should be a small minority of all flow executions, but they do happen. That makes it critical that a strategy is in place to monitor for and react to each type of failure. The following are recommendations for doing so.
Infrastructure Failure
Infrastructure failures occur when the cloud environment to which Open Integration Hub is deployed cannot support or sustain OIH’s operation. Typically this is related to lack of resources allocated to the various services required to operate the platform, but these failures could be related to other issues related to the Kubernetes instance and/or cloud computing environment.
How to Identify Infrastructure Failures
Infrastructure failures will most immediately impact data movement, and they will likely impact all active data flows. Most infrastructure failures can be identified via the Kubernetes management interface, and will surface as Kubernetes pods repeatedly restarting and or settling in a failed state.
The following are some specific places to look for failures:
- Pods are unschedulable due to resource constraints such as CPU or memory
- Pods are running but never reach the ready state
- Pods have multiple restarts
- RabbitMQ message queues backing up
- OIH components are not receiving messages or there is a lack of messages in the pod logs
How to React to Infrastructure Failures
Reacting to infrastructure failures generally means adjusting the resource allocation and/or configurations related to your Kubernetes environment.
The following are some specific reactions that might be helpful:
- Allocate more CPU or memory resources to various pods by increasing the number of nodes available in the Kubernetes cluster.
- Did other infrastructure changes occur that were not intended to impact the Kubernetes environment?
- Did any of the integrations trigger a very large, unexpected load of data?
- Configure auto-scaling resource allocation for some/all components or services.
- Are the RabbitMQ instances hitting message size, queue size or storage limits? This can be checked through the RabbitMQ management console or storage limits in the Kubernetes management console.
Application Failure
Application failures occur when an Open Integration Hub service or a component fails to do its job when data is moving through the system. These are less likely than infrastructure failures to create a global outage, but they can still create widespread problems if the service or component that fails cannot come online and is used by many flows.
Application failures are most likely the result of a product defect or a product shortcoming where it is unable to handle a specific configuration. An example of the latter would be a large message size that the RabbitMQ queueing service is not able to consume safely.
- Some application failures can be addressed with environmental configuration changes, but often they will require a code-based resolution.
How to Identify Application Failures
Application failures will likely impact many active flows, preventing data from moving. Some application failures are targeted to a smaller subset of flows, if they are related to a configuration that is valid but the component or service cannot handle. An application failure is likely to cause widespread outage.
The following are some specific places to look for failures:
- Certain Kubernetes pods for services or components are restarting repeatedly or settling in a failed state
- Workload logs for a component or service includes errors related to the issue
- The RabbitMQ error queue has messages related to the failure
- The RabbitMQ dead letter queues contain messages from failed deliveries
How to React to Application Failures
Reacting to application failures means isolating the conditions under which the failure occurs. This will first assess how widespread the failure is likely to be. It will also help to determine the appropriate course of action.
The following are some specific reactions that might be helpful:
- Determine which component or service is failing, then identify the errors being logged
- If the error is flow-related check the component orchestrator logs for potential errors
- Does the failure correlate to a version update? A new flow? An updated flow?
- Rollback versions and/or have defects addressed
- Have environment variables changed?
- Are the containers pulling in the latest version or set to a specific version?
Endpoint Failure
Endpoint failures occur when a connected system cannot accept requests or product appropriate responses via its API or interface. These issues tend to appear as integration-related failures, because, “Data is not moving,” but are actually related to the data’s source or targeted destination.
How to Identify Endpoint Failures
Endpoint failures always show up in the logs and error queues related to components that speak to external systems (e.g. the REST component). These do not cause Kubernetes pods for services or components to fail. Logging endpoint failures means the component is doing its job appropriately.
The following are some specific places to look for failures:
- RabbitMQ error or rebound queues for components that communicate externally
- Kubernetes pod logs for components that communicate externally
- The OIH Web UI flows view or curl request to check that a flow is active
- If available, service status page from the endpoint provider
How to React to Endpoint Failures
Unless the external system is your product/API, you are typically limited in what you can do to react to an endpoint. Your goal should be to identify the issue, minimize any further data cleanup, and provide a path forward for your end users in accordance with whatever level of integration support you offer them.
The following are some specific reactions that might be helpful:
- Deactivate flows that you determine to be failing based on endpoint issues to eliminate noisy logs and mitigate further data cleanup
- Execute bulk updates or retrigger events after the endpoint failure is resolved to “catch up” users’ data
- Provide logged endpoint information to users or directly to the endpoint support team to notify them of such an issue (as the provider of an integration, you are uniquely positioned to help the endpoint see and resolve the issue)
- If flow is stopped, start the flow and investigate possible reasons the flow was stopped
Data-Related Failure
Data-related failures are the most common type of failure, and how you react to them will depend on the type of relationship you have with your users relative to the integration you provide them.
Data-related failures occur when everything works correctly, but the user has done something in an endpoint system that was not expected or is incompatible with the configured integration flow. When this occurs, the flow incorrectly transforms data, calls the wrong API, or isn’t configured to address a specific edge case. The technical root cause can vary.
Ideally when designing a flow template, you should identify and mitigate against any possible data-related failures, but it’s impossible to be perfect. Data failures have a higher risk of occurring in homegrown, “on-premise”, or highly customizable systems compared to SaaS applications that have one API that works the same way for all users. These failures can occur in all cases, though.
How to Identify Data-Related Failures
Data related failures can occur via any component used in a flow, and will manifest in one of the component’s logs and error queue. It may not always be obvious which component produces the failure, and the resolution to the problem may be upstream from where the failure occurs.
The most common components to find data-related failures are those that communicate with external systems (e.g. REST component), the JSONata transformation component, and the Code Component, but failures can occur in any component.
Understanding and reacting to a data-related failure requires knowledge of the OIH environment as well as knowledge of the failing flow itself.
The following are some specific places to look for failures:
- Error queues for the components that are used in a specific flow (ideally you identified the highest risk steps to the flow during its design)
- Kubernetes pod logs for the components that are used in a specific flow
How to React to Data-Related Failures
Reacting to data-related failures involves one or both of the following: enhancing the logic of the flow to account for the previously unidentified data consideration and/or working with the end user to rectify the data problem.
The following are some specific reactions that might be helpful:
- Provide specific guidance to your customer about why the failure has occurred and what they can do to remedy the problem
- Consider that the issue was simply an oversight during design time and make an update to your flow (this could potentially prevent future occurrences for all users of that flow)
- Decide to split that customer to a unique version of an otherwise templated flow (this is not recommended unless you are required and able to support bespoke integrations for your customers)