By Yann Rapaport, 6WIND Customer Support and Service Manager

This is the third post of a series about High Availability capabilities for packet processing software. In the first post, HA requirements for high performance packet processing software were presented. The second post addressed the problem of the synchronization of Control Plane protocols. Now, let’s discuss how packet processing software should be monitored.

A HA system has to be periodically monitored to detect possible issues, in order to prevent complete shutdown or to anticipate switching from the active to the inactive element. Similar to other elements, a Monitoring System has to be added to periodically check the health of packet processing software components and inform the HA framework about it, so that the HA framework can make the relevant decisions. For packet processing software, these decisions can be:

  • Restart a specific daemon,
  • Reboot the Control Plane but not the Fast Path,
  • Reboot the Control Plane and the Fast Path,
  • Reboot all the system.

The problem for the Monitoring System is to monitor very different software components and abstract this heterogeneity from the HA framework. A convenient way to implement the monitoring services is to have a single daemon which monitors all the others and to use a library that hides software implementation details from the monitoring daemon. This library can, for example, use XML messages over UNIX sockets to communicate with the different components of the packet processing software.

Two types of software components have to be monitored:

  • Static components,
  • Dynamic components whose activation and configuration are done by the management system: All routing protocols are examples of dynamic software components.

To be monitored, the software components implement:

  • Actions on monitoring daemon queries,
  • Internal audit,
  • Reports to monitoring daemon with internal status,
  • Link and register with library.

As the monitoring daemon is a single point of failure, the Monitoring System also has to implement a discovery mechanism that allows recovery if the monitoring daemon crashes.

Now, the packet processing software is monitored. The next step is to implement all the required features for a graceful restart…

More information about 6WINDGate architecture can be found here.

6WINDGate High Availability Architecture Overview is available here.

You can check 6WINDGate FAQ here.

VN:F [1.9.6_1107]
Rating: 0.0/10 (0 votes cast)
VN:F [1.9.6_1107]
Rating: 0 (from 0 votes)

3 Responses to “High Availability-Ready Packet Processing Software (Part 3) – Monitoring Services”

  • Dear Yann,

    Thank you for a very informative write-up on HA services. Looking forward to the remaining articles.

    One more strategy that could be applied for synchronization is – using Shared Memory. Let “Domain” be the processor and the set of daemon processes that needs to be monitored for synchronization.

    Any Domain would implement a Shared Memory, which could be sub-divided into pre-defined, fixed memory blocks alloted to each module/process for their control information. Between the Active Domain and Passive Domain, the complete shared memory block could be synchronized (periodically) using hardware or software means.

    This would take care of the cases where the board has to reboot. For monitoring processes within a Domain, I agree on your suggestion to use a single parent daemon process, which would spawn all other processes and continuously poll for PID (and then take a decision based on outcome).

    Of course there are pros and cons. Still, let me know your thoughts.

    Regards

    Ramesh

    VA:F [1.9.6_1107]
    Rating: 0.0/5 (0 votes cast)
    VA:F [1.9.6_1107]
    Rating: 0 (from 0 votes)
    • Dear Ramesh,

      I think that the pain of your solution is already into your proposal: “… the complete shared memory block could be synchronized (periodically)…”

      It leads to two issues which become almost impossible to control with the right level of granularity:
      – periodically, so when?
      – along with a periodical synchronization, you would need a on demand one,
      – which data do you want to synchronize for every sync process? If you synchronize all data, it would be too slow and you would hit the Alex’s issues.

      Best regards,
      Vincent

      VA:F [1.9.6_1107]
      Rating: 0.0/5 (0 votes cast)
      VA:F [1.9.6_1107]
      Rating: 0 (from 0 votes)
  • Alex Bachmutsky:

    Ramesh,

    One of problems in the shared memory approach is the probability of the memory corruption by the failed process. Practically, there are two questions: a) do you trust at all the memory of the failed process (usually, the answer is negative; you can protect the shared memory by CRC, but you still don’t know whether the memory has been corrupted before the CRC calculation)? b) are you sure that the secondary process can reliably read the shared memory without causing a secondary crash? In the latter case, let us assume that the shared memory is corrupted, the protecting process parses the sahred memory structure and this parsing causes a crash because of wrong length field, wrong pointer value, etc. Of course, the system can be protected against such use case, but it is not an easy task and requires very strict rules about kind of information held in this shared memory.

    VA:F [1.9.6_1107]
    Rating: 5.0/5 (1 vote cast)
    VA:F [1.9.6_1107]
    Rating: 0 (from 0 votes)

Leave a Reply

*
Subscribe to the Forum
Categories
Archives