HA Troubleshooting
1. What if HA servers are not able to run?
This might be related to HA properties. Please verify whether the following properties are configured to use same values in the backup server.
- ActiveLockReAcquisitionInterval
- PassiveLockAcquisitionInterval
- PingInterval
- GatewayServerIPAddress
- GatewayServerPort
- LockFile
2. While running HA Profile following error message is thrown to console:
Lock file::null is not specified properly.
=======================================================================================
Unable to validate permissions on lock file::null.
Either file is not valid or lock file machine is not allowing to verify permissions on lock,
Unable to boot the server
=======================================================================================
This will occurs when Lock file is not specified. So, please make sure to specify Lock file.
3. What if HA servers are not able to run, while launched as a service?
Look in the logs runContainer_output_log and runContainer_error_log. These logs have all the error and messages thrown at run time.
4. SocketBindException saying that the HA Port is already bound.
This exception means that some other program running on the HA port or the last instance of the server is not properly killed. You can choose to stop or kill the application which is holding up the port and start the server again or choose a different HA port. But changing this means that there needs to be a change in the Backup Servers' configuration for its Backup Server port.
5. None of the servers connect to each other.
One Server is in STANDALONE state and the other server in WAITING state and both the servers are unable to communicate.
This behavior means that the Backup Server IP and port numbers are wrong for both the server configurations.
Example: Server consoles when it cannot connect to each other.
The Server console below illustrates a situation where the server is not able to connect to the Backup Server. If it the backup server is running as shown, then there is a problem with the configuration. The message prints the IP address and the port to which it is trying to connect to establish the HA channel. You can check if the Backup Server is running in the printed IP address and port.
Primary Server Console
Secondary Server Console
6. Both servers go to Standalone/Active state
Both servers go to Standalone/Active state in replicated/shared mode respectively if the network link between them is broken. This can happen if the servers do not refer to the same LockFile.
7. If Both servers print the message 'Both servers are in STANDALONE state. Please refer troubleshooting section of Fiorano HA guide or take backup of both server databases and contact fiorano support team.'
If this happened after proper configuration of servers, please take backup of both server databases, server logs and console logs. Please call Fiorano support team to inform the situation and take help from them to follow rest of the steps. The following steps can cause message ordering issues.
To avoid downtime in production servers, user can move secondary server database and delete log file. Now run both the servers. Servers should work now properly and process all the data. If primary server switched to ACTIVE, then user need to investigate secondary server data for avoiding message loss. If primary server switched to PASSIVE then user need to investigate both primary and secondary server data for avoiding message loss.
Now, server database (contents of run directory) needs to be investigated to see whether it contains any latest messages which may not be present in other server database. Now move this database to a StandAlone FioranoMQ server run directory and run FioranoMQ server. Using fmq/samples/PTP/QueueBrowser and fmq/samples/PubSub/MessageBrowsers all pending messages can be seen. Either the messages which are not present in the other server data need to be transferred to the HA server using a custom application or a StandAloneBridge and StandAloneRepeater can be used to transfer all messages to the HA server.
Limitations
Client level transactions do not span across servers in the Enterprise Server when running in shared mode. Transacted sessions involving receivers will be rolled back if the primary server crashes. Therefore, the messages delivered in that transaction are redelivered to the receivers upon connection to the backup server.
Distributed transactions, which are in execution during the transition phase, become "in-doubt transactions" when the primary server goes down. These transactions get 'rolled back' and can be recovered after the client reconnects to the secondary server.
JMS Topic Requestor can not receive its intended reply if failover occurs after a request is sent. This occurs because JMS Topic Requestor creates a non-durable subscriber, which can miss a message during failover. However, if a topic requestor creates a durable subscriber to listen for replies, then it works even during failover.
If both HA servers (primary as well as backup) go down, the requestor receives a duplicate reply (with reDelivered Flag = true) for the first request made after failover occurs.