Arul’s Oracle Zone

~my references and inferences of Oracle

Weird RAC issue due to an incorrect setup

Posted by Arul Ramachandran on November 18, 2007

When setting up a 4-node Oracle 10.2.0.2 RAC on RHEL4, after installing Oracle Clusterware, Oracle RAC binaries and creating a clustered database, I went ahead to open the instance on the first node and and then the second node. The instance on the first node would come up without any problem, but the instance on the second node would take a while to come up. However, on further checking we realized that the first instance was down. It turned out that while the second instance was coming up the first instance would crash. It all seemed weird because one would think both instances were healthy, but just the fact of bringing up an instance would crash the already open instance.

The alert log had the following messages:

Interface type 1 eth0 192.168.10.0 configured from OCR for use as a cluster interconnect
Interface type 1 eth3 192.168.11.0 configured from OCR for use as a cluster interconnect
WARNING 192.168.11.0 could not be translated to a network address error 1
Interface type 1 eth2 1#.1##.145.0 configured from OCR for use as a public interface

I went ahead to check if both interconnects are healthy.

A ping command on the primary interconnect revealed it was fine.

$ping <Primary-NIC>
PING Primary-NIC (192.168.10.2) 56(84) bytes of data.
64 bytes from Primary-NIC (192.168.10.2): icmp_seq=0 ttl=64 time=0.035 ms
64 bytes from Primary-NIC (192.168.10.2): icmp_seq=1 ttl=64 time=0.045 ms

— Primary-NIC ping statistics —
2 packets transmitted, 2 received, 0% packet loss, time 999ms
rtt min/avg/max/mdev = 0.035/0.040/0.045/0.005 ms, pipe 2

A ping command on the secondary interconnect indicated there was a problem.

$ping <Secondary-NIC>
PING Secondary-NIC (192.168.11.2) 56(84) bytes of data.

— Secondary-NIC ping statistics —
28 packets transmitted, 0 received, 100% packet loss, time 27018ms

Once the SysAdmin fixed the secondary interconnect, both instances came up without any issues. This reinforces the importance of a functioning primary and secondary interconnect, if you specify both interconnects during the install – however, this does not provide redundancy or the ability to failover to the good interconnect if one interconnect fails.

The setup did not use NIC bonding, instead it used a primary and secondary interconnect for private interface (secondary interconnect for failover capability). I am positive both NIC cards should have been working during the install as the install pre-requisites were all met. But, it is evident from the alert log messages the second interconnect had some trouble.

Multiple interconnects allow Cache Fusion traffic to be distributed on all interconnects. However, if any one of the interconnects does not function, Oracle will assume the private network is down and will not open in cluster mode. Therefore it is highly recommended to use NIC bonding at the OS level for NIC failover and traffic sharing capability.

Bottom line: this incident clearly indicates the configuration of the interconnect plays a crucial role in RAC. Should I even make this statement? :-)

P.S: for the dirty details of setting up NIC bonding in Linux check out this excellent link by Vivek.

Leave a Reply

XHTML: You can use these tags: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <pre> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>