Arul’s Oracle Zone

~my references and inferences of Oracle

Archive for November, 2007

Weird RAC issue due to an incorrect setup

Posted by Arul Ramachandran on November 18, 2007

When setting up a 4-node Oracle 10.2.0.2 RAC on RHEL4, after installing Oracle Clusterware, Oracle RAC binaries and creating a clustered database, I went ahead to open the instance on the first node and and then the second node. The instance on the first node would come up without any problem, but the instance on the second node would take a while to come up. However, on further checking we realized that the first instance was down. It turned out that while the second instance was coming up the first instance would crash. It all seemed weird because one would think both instances were healthy, but just the fact of bringing up an instance would crash the already open instance.

The alert log had the following messages:

Interface type 1 eth0 192.168.10.0 configured from OCR for use as a cluster interconnect
Interface type 1 eth3 192.168.11.0 configured from OCR for use as a cluster interconnect
WARNING 192.168.11.0 could not be translated to a network address error 1
Interface type 1 eth2 1#.1##.145.0 configured from OCR for use as a public interface

I went ahead to check if both interconnects are healthy.

A ping command on the primary interconnect revealed it was fine.

$ping <Primary-NIC>
PING Primary-NIC (192.168.10.2) 56(84) bytes of data.
64 bytes from Primary-NIC (192.168.10.2): icmp_seq=0 ttl=64 time=0.035 ms
64 bytes from Primary-NIC (192.168.10.2): icmp_seq=1 ttl=64 time=0.045 ms

— Primary-NIC ping statistics —
2 packets transmitted, 2 received, 0% packet loss, time 999ms
rtt min/avg/max/mdev = 0.035/0.040/0.045/0.005 ms, pipe 2

A ping command on the secondary interconnect indicated there was a problem.

$ping <Secondary-NIC>
PING Secondary-NIC (192.168.11.2) 56(84) bytes of data.

— Secondary-NIC ping statistics —
28 packets transmitted, 0 received, 100% packet loss, time 27018ms

Once the SysAdmin fixed the secondary interconnect, both instances came up without any issues. This reinforces the importance of a functioning primary and secondary interconnect, if you specify both interconnects during the install – however, this does not provide redundancy or the ability to failover to the good interconnect if one interconnect fails.

The setup did not use NIC bonding, instead it used a primary and secondary interconnect for private interface (secondary interconnect for failover capability). I am positive both NIC cards should have been working during the install as the install pre-requisites were all met. But, it is evident from the alert log messages the second interconnect had some trouble.

Multiple interconnects allow Cache Fusion traffic to be distributed on all interconnects. However, if any one of the interconnects does not function, Oracle will assume the private network is down and will not open in cluster mode. Therefore it is highly recommended to use NIC bonding at the OS level for NIC failover and traffic sharing capability.

Bottom line: this incident clearly indicates the configuration of the interconnect plays a crucial role in RAC. Should I even make this statement? :-)

P.S: for the dirty details of setting up NIC bonding in Linux check out this excellent link by Vivek.

Posted in RAC | Tagged: , | Leave a Comment »

RAC Hangs due to small cache size on SYS.AUDSES$

Posted by Arul Ramachandran on November 15, 2007

If you run Oracle RAC on 10.2.0.2 or 9i watch out for the following.

I’ve seen on a certain occasions when there is heavy system activity, all instances in RAC hang, even connecting as sysdba was terribly slow. It turned out this was due to the sequence SYS.AUDSES$ has its cache setting at default value of 20.  Metalink note: 395314.1 mentions that the some of the symptoms are:

- Checkpoint not completing on all RAC nodes
- Waits expire on row cache enqueue lock dc_sequences
- RAC hangs due to QMON deadlocking

It also mentions the fix to increase the cache size to a large value.

alter sequence sys.audses$ cache 10000;

Apparently this is fixed in 10.2.0.3.

Posted in RAC | Tagged: | Leave a Comment »

ORA-04030

Posted by Arul Ramachandran on November 12, 2007

On seeing the ORA-04030 error reported on a Oracle 9.2.0.6 RAC database running on Solaris 9, VCS SFRAC, being somewhat familiar with this error message I started looking at PGA and ulimit related settings.

Before I delve any further, here is a bit of background:

oerr ora 4030
04030, 00000, “out of process memory when trying to allocate %s bytes (%s,%s)”
// *Cause: Operating system process private memory has been exhausted
// *Action:

Automatic PGA had been set on this database:

NAME                                 TYPE        VALUE
------------------------------------ ----------- ------------------------------
workarea_size_policy                 string      AUTO
pga_aggregate_target                 big integer 1073741824

1Gb for pga_aggregate_target on a server with 16Gb memory for the nature of the database operations seemed quite reasonable.

ulimit settings were as follows:

ulimit -a
time(seconds) unlimited
file(blocks) unlimited
data(kbytes) unlimited
stack(kbytes) 8192
coredump(blocks) unlimited
nofiles(descriptors) 4096
vmemory(kbytes) unlimited

On some further checking, it was found processes and sessions were set as follows

NAME                                 TYPE        VALUE
------------------------------------ ----------- ------------------------------
sessions                             integer     2500
processes                            integer     700

These settings were odd, on further checking it was found that another DBA had changed these values to test a change to VCS SFRAC, but the old values were not restored after the test.

On changing these parameters back to old values processes=1000 and sessions=1105, the problem disappeared.

The moral of the story here is, when troubleshooting issues we need to keep an eye on not so straightforward causes as well.

Posted in PGA | Leave a Comment »