Planning for Fault Tolerance
Fault tolerance, or “high availability”, is critical to any successful business operation. To ensure that requests are processed in the event of failure, FME Server supports configuring fault tolerance throughout the multiple levels of an integrated system. FME Server provides fault tolerance in the following ways:
- Recovery: Restarting components and jobs when crashes occur. FME Server provides component and job recovery automatically - no additional planning is needed.
- Failover: Ensuring there is no single point of failure. Two different configurations can be used to achieve this: Active-Passive or Active-Active. Failover is the primary consideration for the type of installation architecture you decide to implement and that you are in charge of managing.
Recovery
Component Recovery
FME Server comes out-of-the-box with component recovery. This means that, even on a single system, FME Server monitors and restarts components that fail, including the FME Engines and the FME Server Core. This is achieved through the FME Server Process Monitor. The ability for FME Server to monitor its own components ensures reliable uptime and dependability.
Job Recovery
FME Server also includes the ability to restart a job when a crash occurs. As a result, jobs that experience temporary issues, such as a network hiccup, are re-submitted and run again.
After FME Server submits a translation request to an FME Engine, it monitors the connection to that engine until a response is returned.
FME Server can resubmit a failed job if:
- The connection to the engine is lost.
- The engine crashes.
FME Server continues to re-submit a translation up to a specified number of attempts. To prevent FME Server from indefinitely retrying a job that fails, the default setting is to resubmit a failed job up to three times. This setting is configurable and can be turned off entirely.
Ms Analyst says... |
WARNING! A failed translation request may cause an FME Engine to shut down improperly. If no maximum limit is imposed, the translation is resent indefinitely, which may cause repeated FME Engine failures.
Re-submitted transactions may also cause data duplication, such as when writing to database formats or when writing mid-translation with the FeatureWriter. |
Failover
The goal of a failover environment is to remove single points of failure so that a component can fail, but not take the entire system offline. FME Server supports two approaches to failover: Active-Passive and Active-Active.
We typically recommend the Active-Passive architecture, which meets the needs of most enterprises. There are advantages and disadvantages to both approaches.
Keep in mind that with failover, the FME Server Core, Jobs, and Engines are looked after for you, but the customer is in charge of making the Database and File System fault tolerant.
Active-Passive
With the Active-Passive approach, when the Active system fails, the Passive system takes over the capabilities of the failed Active system and assumes the role of the Active system. The failed system, in turn, assumes the Passive mode when it becomes healthy again. The failed system can then be investigated while the new Active system provides continued operation of FME Server. Once the new Passive system is recovered, it remains in this role until another failure on the Active system occurs.
Failover is achieved through a heartbeat monitor between the Active and Passive systems. The types of failures that typically cause failover are hardware and operating system crashes, in which the primary system goes down completely.
Keep in mind that after the Active Core fails, it takes 3-5 minutes for processing to resume on the Passive Core which has now become the new Active Core. Any schedules that would have triggered jobs during this window will not occur.
Sister Intuitive says... |
Clients of Notification Service publishers do not failover. These clients must be manually reconfigured to connect to the new active core, or, alternatively, restore to the original Active Core. |
The diagram below shows the structure of an FME Server system properly configured for Active-Passive failover:
In the Active-Passive architecture, the FME Server Web Application Server and FME Server System Share files are separated physically. The Database Server, File System, and Web Application Server should all be configured for fault tolerance. Fault tolerance for these components must be provided by the client or customer.
Active-Active
The Active-Active failover architecture duplicates complete FME Server installations on separate servers. In other words, all components reside on the same system, and additional systems are configured similarly and provide similar functionality. A third-party load balancer directs incoming traffic to one of the available systems. When requests are directed to any of the systems, they are handled independently and by only one system. This approach works well with a cloud-based computing environment, such as Amazon Web Services, in which machines can be cloned easily to expand capacity.
Differences Between Active-Active and Active-Passive
Feature | Active-Active | Active-Passive |
Easy setup using Express Installation option | X | |
Publishing workspaces is a one-time task for the whole system | X | |
Requires administration of multiple FME Servers | X | |
Processing capacity is diminished when a system fails | X | |
May still require recovery/replication of the FME Server System Share for entire environment | X | |
Schedules automatically failover | X |