Main Page   Modules   Data Structures   File List   Data Fields   Globals   Related Pages  

XII. High Availability Extensions

While GM automatically handles transient network errors such as dropped, corrupted, or misrouted packets, and while the GM mapper automatically reconfigures the network if links or nodes appear or disappear, GM cannot automatically handle catastrophic errors such as crashed hosts or loss of network connectivity without the cooperation of the client program.

When GM detects a catastrophic error, it temporarily disables the delivery of all messages with the same sender port, target port, and priority as the message that experienced the error, and GM informs the client of catastrophic network errors by passing a status other than GM_SUCCESS to the client's send completion callback routine. The client program is then expected to call either gm_resume_sending() or gm_drop_sends(), which re-enable the delivery of messages with the same sender port, target port, and priority. This mechanism preserves the message order over the prioritized connection between the sending and receiving ports, while allowing the client to decide if the other packets that it has already enqueued over the same connection should be transmitted or dropped.

Simpler GM programs, such as MPI programs, will typically consider GM send errors to be fatal and will typically exit when they see a send error. This is reasonable for applications running on small or physically robust clusters where errors are rare and when users can tolerate restarting jobs in the rare event of a network error. Poorly written GM programs may simply ignore the error codes, which will cause the program to eventually hang with no error indication when catastrophic errors are encountered. This poor programming practice is strongly discouraged: Developers should always check the send completion status. More sophisticated applications, such as high availability database applications, will respond to the network faults, which appear to the client as send completion status codes other than GM_SUCCESS.

A complete list of send completion status codes can be found in gm.h and section VIII. Sending Messages.

When the send completion status code indicates an error a sophisticated client program may respond by calling gm_resume_sending() or gm_drop_sends(). Calling gm_resume_sending() causes GM to simply re-enable delivery of subsequent messages over the connection, including those that have already been enqueued. This would be the typical response of a distributed database that assumes the underlying network is unreliable and layers its own reliability protocol over GM. Calling gm_drop_sends() causes GM to drop all enqueued sends over the disabled connection, return them to the client with status GM_SEND_DROPPED, and re-enable the connection. This would be the typical response of a program that wishes to reorder subsequent communication over the connection in response to the error.

Note that each of the fault response functions (gm_drop_sends() and gm_resume_sending() requires a send token. This send token is implicitly returned to the caller when the callback function passed to gm_drop_sends() or gm_resume_sending() is called by GM.

Here is an example program demonstrating the use of gm_drop_sends(). In this example, there are no messages queued after the message that has just been discarded, and gm_drop_sends() and gm_resume_sending() are equivalent. They just re-enable the target subport for further gm_send_with_callback() calls. If the send callback returns you an error, that means the corresponding message has been definitely discarded, both gm_resume_sending() and gm_drop_sends() only impact messages that have been queued after the message that has just been discarded.

#include <stdio.h>
#include <assert.h>

#include "gm.h"

unsigned int received,sent;
unsigned int my_gm_node_id;

static void test_send_callback (struct gm_port *port, void *context, 
                                gm_status_t status);

static void 
drop_send_callback (struct gm_port *port,
                    void *context,
                    gm_status_t status)
{
   fprintf(stderr, "Got gm_drop_send notification, start resending\n");
   gm_send_with_callback(port, context, 20, 1, GM_LOW_PRIORITY,
                            my_gm_node_id, 7, test_send_callback, context);
}


static void 
test_send_callback (struct gm_port *port,
                    void *context,
                    gm_status_t status)
{
  switch (status)
    {
    case GM_SUCCESS:
      fprintf(stderr, "Send successfully delivered\n");
      sent += 1;
      break;
      
    case GM_SEND_TIMED_OUT:
      fprintf(stderr, "Send timeout, provide buffers and initiate resend...\n");
      gm_provide_receive_buffer(port, context, 20, GM_LOW_PRIORITY);
      gm_provide_receive_buffer(port, context, 20, GM_LOW_PRIORITY);
      gm_drop_sends (port, GM_LOW_PRIORITY, my_gm_node_id, 7,
                         drop_send_callback, context);
      break;
     case GM_SEND_DROPPED:
      fprintf(stderr, "Got DROPPED_SEND notification, resend\n");
      gm_send_with_callback(port, context, 20, 1, GM_LOW_PRIORITY,
                            my_gm_node_id, 7, test_send_callback, context);
       break;
       
    default:
      fprintf(stderr, "Something bad happen\n");
      assert (0);
    }
}


int main (void)
{
  struct gm_port *port;
  gm_recv_event_t *event;
  char *token;
  
  assert (gm_open (&port, 0, 7, "Test Resume", GM_API_VERSION) == GM_SUCCESS);
  token = gm_dma_malloc (port, sizeof (char));
  assert (token != NULL);

  received = 0;
  gm_get_node_id(port, &my_gm_node_id);
  gm_send_with_callback (port, token, 20, sizeof (char), 
                         GM_LOW_PRIORITY, my_gm_node_id, 7, 
                         test_send_callback, token);
  gm_send_with_callback (port, token, 20, sizeof (char),
                         GM_LOW_PRIORITY, my_gm_node_id, 7, 
                         test_send_callback, token);
  
  while (received < 2 || sent < 2)
    {
      event = gm_receive (port);
      switch (gm_ntoh_u8 (event->recv.type))
        {
        case GM_NO_RECV_EVENT:
          break;
         case GM_RECV_EVENT:
          fprintf(stderr,"received message\n");
          received += 1;
          break;
        default:
          gm_unknown (port, event);
        }
    }

  gm_dma_free (port, token);
  gm_close (port);
}


Generated on Mon Nov 3 15:39:27 2003 for GM by doxygen1.2.15