(April 6th, 2011, Koichi Suzuki)
Now we have GTM-Standby, which backs up all the current GTM status in synchronous way and fails over when master GTM fails. Now we need a mechanism in GTM-Proxy to reconnect to new GTM in such case.
When GTM-Standby fails over the GTM, all the status has been copied to the Standby and GTM-Proxy should just disconnect from the old GTM, reconnect to the new one and register itself. If the last command to the old GTM did not respond, the command can be reissued to the new GTM to get correct response.
Because reconnect itself will be triggered by XCM module, xcwatcher and monitoring agent, this document provides initial design how to implement the reconnect in GTM-Proxy.
Reconnect can be initiated by invoking gtm_ctl with a new command "reconnect". The syntacs will be as follows:
gtm_ctl -S gtm_proxy reconnect -D dir -o "-s xxx -t xxx"
where -D option describes gtm_proxy's working directory, which must be the same as it started. -s and -t specifies address and the port number of the new GTM.
Also, at the time of start, gtm_ctl will have new options, namely, -w and -n. They indicates how long and how many times GTM-Proxy waits when it detects connection closure from GTM. During this wait, users can issue gtm_ctl reconnect to GTM-Proxy.
Within gtm_ctl, these options are backed up to gtm_proxy.opts file in GTM's working directory, merged with existing options. Then gtm_ctl will prepare gtm_proxy_sighup.opt (file name may change) file to indicate to reconnect and issue SIGHUP signal to the gtm_proxy.
Gtm_proxy SIGHUP signal handler will check grm_proxy_sighup_opt and determines it should reconnect to the new GTM (at present, this is only one option of SIGUP action), and update a flag to indicate reconnect. Such a flag can be stored in thread-specific structure.
Each thread checks this flag before it send commands to GTM. If error is detected to receive response from GTM, it will be the time to check this structure. If it is no set, then the thread can wait for a little while and recheck this.
If "should_lock" bit is set, then the thread disconnects current connection to (old) GTM and reconnect to the new one and can continue service to coordinator or datanode.
Current Code Study
It was very straightforward code.
The point is: new GTM host and port will be give as -o option. This will be passed to the target GTM Proxy through "newgtm" file under -D option directory.
Only one issue is that we have to leave option argument check to gtm proxy's signal handler. So the result of gtm_ctl can only be known by testing the log.
Signal hander is straightforward too. I need new signal handler because current one just exit whole process. GTMProxy_ThreadInfo contains all the information needed for each thread. We can add a flag to reconnect (and necessary lock) to this structure.
Because this thread_info is not exposed, I need to have one anchor in heap area and chain all the thread info from here, so that signal handler can find where to set the info (need lock to write the value)
Each thread polls this value. Because signal handler is only one writer and write will be done in one instruction, we don't need a lock to read this. When it recognizes the flag is set, then it reconnects to the new GTM and (after acquiring the lock) set the flag off.
worker thread main loop
GTMProxy_ThreadMain() is the main thread loop.
We need to add the polling of SIGUSR2 handler flag. If it is set, then GTMProxy disconnects the current Proxy and reconnect to the new GTM (promoted one).
How to detect gtm_ctl reconnect ?
It's not simple.
1. Need to backup command from clients. --> Is memory context safe?
2. Need to detect gtm_ctl while it is waiting reply of send() to GTM. send() may return with the error (disconnected) --> need error handling.
Send/receive handling: need handling
send() <-- gtmpqSendSome() <-- gtmpqFlush() <-- (many) <-- gtmpqWaitTimed() <-- (many) <-- internal_flush() <-- internal_putbytes() <-- pq_putbytes() <-- pq_putmessage() <-- pq_flush() <-- (many)
sendto() <-- (none) sendmsg() <-- (none)
recv() <-- gtmpqReadData() <-- (many) <-- pq_recvbuf() <-- pq_getbyte() <-- pq_peekbyte() <-- pq_getbytes() <-- pq_getbytes() <-- pq_getstring() recvfrom() <-- (none) recvmsg() <-- (none)
Again, proxy message loop (proxy_main.c)
ProcessCommand() --> ProcessPGXCNodeCommand() --> (nothing sent to GTM) --> ProcessTransactionCommand() --> (nothing set to GTM) --> ProcessSnapshotCommand() --> (Prepare msg to GTM ... ) --> GTMProxy_ProxyCommand() --> No writes to GTM --> Really? --> ProcessSequenceCommand() --> (Prepare msg to GTM ... ) --> (same)
GTMProxy_ProcessPendingCommands() --> (many writes to GTM())
Now all the incomming commands are buffered before calling GTMProxy_ProcessPendingCommands().
So we can check if reconnect is issued just before this call. If reconnect is issued after this and until the end of loop turn, signal can longjmp() just before this. Now all the incoming commands are ready to go to the new GTM!
Next, GTMPQgetResult() is called to collect response from GTM. If GTM fails and no gtm_ctl reconnect has been issued, should we fail or just wait? There will be a choice. Maybe we can wait in a small loop for a while until gtm_ctl is issue, while reporting GTM error. To do this, we may need to analyze the error more in detail.
I'm not sure if there's no writes to GTM in GTMProxy_ProxyCommand() call. Safer (and less restrictive) way is to backup commands from incoming sockets and redo when reconnect is detected.
Use of SIGNAL
GTM-Proxy will receive SIGUSR1 as the whole process. This signal will be passed to the main thread, which distributes SIGUSR2 to all the worker thread.
In each worker thread, we have a control how SIGNAL handler returns. In ThreadInfo, we have an option to allow longjmp() or not, as well as stack backup prepared by setjmp().
In principle, only longjmp() occurs in the signal handler when signal is detected at read/write GTM (asynchronous signal detection). In other case, signal handler will just return, turning on the flag to show that the signal was detected. The flag is tested when the worker thread read all the incoming commands from the backend (synchronous detection). Incoming command will be backed up so that it can be retried when reconnect to the new GTM. In the both cases, we use longjmp() to jump to the point to read the command. If SIGUSR2 is detected, then the worker will reconnect to the new GTM and continue command handling.
Order of Re-Register
After GTM-Standby is promoted and GTM-Proxy reconnect to new GTM, reconnect should be done in the following steps.
- GTM-Proxy main thread should register itself to new GTM,
- After GTM-Proxy registration has been done, each worker thread can register itself to the new GTM.
To do this, we need another pthread log (read/write, say, reconnect control lock). First, when GTM-Proxy main thread accepts SIGUSR1, it should validate reconnect information. If it's valid, then the mail thread should acquire write lock of the reconnect control lock and issue SIGUSR2 to worker threads. GTM-Proxy main thread continues execution to reconnect itself to the new GTM. When it's done, reconnect control lock will be released.
Worker thread signal handler first tries to acquire read lock of the reconnect control lock to wait until the main thread reconnects and register itself to the new GTM, and then connect and register itself to the new GTM.
I found that worker threads just connect to GTM, not register themselves. Is it okay?
Yet Another Issue to Handle
Order to start GTM-Standby
At present, GTM-Standby must be connected before GTM accepts any other connections. Connections established before GTM-Standby connects will not be backed-up. This means that if GTM-Standby promotes and we'd like next standby, we have to connect it to the Standby before GTM-Proxy reconnects. Or we should shut down whole cluster to get GTM-Standby available.
Recovery *.node file
Somehow, GTM and GTM-Proxy rejects to register datanode/coordinator. Shoud identify the cause to fit it.
This section describes some of the message format for GTM (not exhaustive).
MSG_NODE_REGISTER (command: 4byte) node_type (GTM_PGXCNodeType) --> gtm_c.h (4byte) node_id (GTM_PGXCNodeId) --> gtm_c.h (4byte) ip_address (len:4byte + data) port# proxy_node_id (don't if we need this. Should node_id be sufficient?)