
FLUSH protocol for default stack
================================

Author: Bela Ban
Version: $Id: FLUSH.txt,v 1.10 2006/10/20 18:49:37 vlada Exp $
See http://jira.jboss.com/jira/browse/JGRP-82

Overview
--------
Flushing means that all members of a group flush their pending messages. The process of flushing acquiesces the cluster
so that state transfer or a join can be done. It is also called the stop-the-world model as nobody will be able to
send messages while a flush is in process.

When is it needed ?
-------------------
(1) On state transfer. When a member requests state transfer it tells everyone to stop sending messages and waits
for everyone's ack. Then it asks the application for its state and ships it back to the requester. After the
requester has received and set the state successfully, the requester tells everyone to resume sending messages.

(2) On view changes (e.g.a join). Before installing a new view V2, flushing would ensure that all messages *sent* in the
current view V1 are indeed *delivered* in V1, rather than in V2 (in all non-faulty members). This is essentially
Virtual Synchrony.

Design
------
As another protocol FLUSH which handles a FLUSH event. FLUSH sits just below the channel, e.g. above STATE_TRANSFER
and FC. STATE_TRANSFER and GMS protocol request flush by sending a SUSPEND event up the stack, where it is handled 
by the FLUSH protcol. The SUSPEND_OK ack sent back by the FLUSH protocol let's the caller know that the flush has 
completed.

When done (e.g. view was installed or state transferred), the protocol sends up a RESUME event, which will allow
everyone in the cluster to resume sending.

FLUSH protocol (state transfer)
-------------------------------
- State requester sends SUSPEND event: multicast START_FLUSH to everyone (including self)

- On reception of START_FLUSH:
  - send BLOCK event up the stack to application level
  - multicast a FLUSH_OK message

- On reception of FLUSH_OK at each member:
  - Record FLUSH_OK in vector (1 per member)
  - Once FLUSH_OK messages have been received from all members:
  	 - send unicast FLUSH_COMPLETED to flush requester 
  	 - block FLUSH.down(), all threads invoking FLUSH.down() will block on mutex

- After reception of FLUSH_COMPLETED (at flush requester) from all members - send down SUSPEND_OK event

- On RESUME event: multicast STOP_FLUSH message

- On reception of STOP_FLUSH message: 
	- unblock mutex at down() and send UNBLOCK event up the stack to application level (unless flush requester member)
	- multicast STOP_FLUSH_OK

- On reception of STOP_FLUSH_OK message: 	
	- unblock mutex at down() and send UNBLOCK event up the stack to application level (only flush requester member)



FLUSH protocol (view installation for join/crash/leave)
-------------------------------------------------------

- Coordinator initiates FLUSH prior to view V2 installation

- Coordinator sends SUSPEND event: multicast START_FLUSH to every surviving/existing member in previous view V1 (including self)

- On reception of START_FLUSH (existing members in V1):
  - send BLOCK event up the stack to application level
  - multicast a FLUSH_OK message

- On reception of FLUSH_OK at each existing member in V1:
  - Record FLUSH_OK in vector (1 per member)
  - Once FLUSH_OK messages have been received from all members:
  	 - send unicast FLUSH_COMPLETED to flush requester 
  	 - block FLUSH.down(), all threads invoking FLUSH.down() will block on mutex

- After reception of FLUSH_COMPLETED (at flush requester) from all members - send up/down SUSPEND_OK event

- On RESUME event: multicast STOP_FLUSH message

- On reception of STOP_FLUSH message (every member from V1): 
	- unblock mutex at down() and send UNBLOCK event up the stack to application level
	- multicast STOP_FLUSH_OK

- On reception of STOP_FLUSH_OK message (every member in V2 and not in V1): 	
	- unblock mutex at down() and send UNBLOCK event up the stack to application level
	
	
	
Crash handling during FLUSH
----------------------------
- On suspect(P)
  - Do not expect any FLUSH message exchange from P (effectively exclude from FLUSH)
  - If P is the flush requester and Q is next member in view 
    - Run the FLUSH protocol from scratch with Q as flush requester

	
	

Notes
-----
*Multicasting* (as compared to unicasting) the FLUSH_OK message is required so we can be sure that the FLUSH_OK
message sent by P is the last message received from P before P stopped sending messages. Since we're waiting on
everyone's FLUSH_OK, we can be assured that everyone's previous messages were received (thanks to reliable mcast).


FLUSH on view change
--------------------
Whenever there is a view change, GMS needs to use FLUSH to acquiesce the group, before sending the view change message
and returning the new view to the joiner. Here's pseudo code of the coordinator A in group {A,B,C}:
- On reception of JOIN(D):
  - FLUSH(A,B,C) - wait until flush returns, we flush A, B and C, but onviously not D
  - Compute new view V2={A,B,C,D}
  - Multicast V2 to {A,B,C}, wait for all acks
  - Send JOIN_RSP with V2 to D, wait for ack from D
  - RESUME sending messages

This algorithm also works for a number of JOIN and LEAVE requests (view bundling):
- On handleJoinAndLeaveRequests(M, S, L, J), where M is the current membership, S the set of suspected members,
  L the set of leaving members and J the set of joined members:
  - FLUSH(M)
  - Compute new view V2=M - S - L + J
  - Multicast V2 to M-S, wait for all acks
  - Send LEAVE_RSP to L, wait for all acks
  - Send JOIN_RSP to J, wait for all acks
  - Resume sending messages
  

FLUSH with join & state transfer
--------------------------------
http://jira.jboss.com/jira/browse/JGRP-236 (TODO)

- A new method Channel.connect(String group, boolean transfer_state, String state_id) will be added
- The JOIN_REQ contains the boolean
- The algorithm in GMS is the same as the two described above, except that before multicasting the new view and sending
  the JOIN_RSPs and LEAVE_RSPs, we ask the application for its state (GET_APPLSTATE) and when received (GET_APPLSTATE_OK),
  we send it back to the joining member(s), and the RESUME sending messages

Design change: flushing and block() / unblock() callbacks
---------------------------------------------------------

The change consists of moving the blocking of message sending in FLUSH.down() from the first phase to the second phase:

Phase #1: send START_FLUSH across the cluster, on reception everyone invokes block(). We do *not* (as previously done)
          block messages in FLUSH.down(). Method block() might send some messages, but when block() returns (or
          Channel.blockOk() is called), FLUSH_OK will be broadcast across the cluster. Because FIFO is in place, we
          can be sure that all *multicast* (group) messages from member P are received *before* P's FLUSH_OK message.

Phase #2: send FLUSH_OK across the cluster. When every member has received all FLUSH_OKs from all other member, every
          member blocks messages in FLUSH.down().

Phase #3: send FLUSH_COMPLETED to the initiator of the flush

Phase #4: send STOP_FLUSH across the cluster, e.g. when state has been transferred, or member has joined


Use case: block() callback needs to complete transactions which are in the 2PC (two-phase commit) phase
- Members {A,B,C}
- C just joined and now starts a state transfer, to acquire the state from A
- A and B have transactions in PREPARED or PREPARING state, so these need to be completed
- C initiates the flush protocol
- START_FLUSH is broadcast across the cluster
- A and B have their block() method invoked, they broadcast PREPARE and/or COMMIT messages to complete existing
  prepared or preparing transaction, and blocks or rolls back other transactions
- A and B have to wait for PREPARE or COMMIT acks from all members, they block until these are received (inside
  of the block() callback)
- Once all acks have been received, A and B return from block(), they now broadcast FLUSH_OK messages across the cluster
- Once everyone has received FLUSH_OK messages from A, B and C, we can be assured that we have also received all
  PREPARE/COMMIT/ROLLBACK messages and their acks from all members
- Now FLUSH.down() at A,B,C starts blocking sending of messages

Issues:
- What happens is a member sends another message M *after* returning from block() but *before* that member has received
  all FLUSH_OK messages and therefore blocks ?
- This algorithm only works for multicast messages, not unicast messages because multicast and unicast messages are not
  ordered with respect to each other


















