Oracle very kindly offered to run through a Proof of Concept installation of their GoldenGate replication product with us, and seeing as how we’re not prone to looking gift horses in their mouths, we’ve accepted!
Quiet Friday’s are a nice treat but something too often squandered. This past Friday I found myself left to my own devices as the rest of the DBA team were working from another location. This left me free to get a few things about running our environment straight in my own mind with the most notable being was how to use CommVault for database backup and recovery beyond simply copying RMAN output to tape.
I have access to a test server for playing about with such things so I created a test database with the sample schemas to use for my backup and recovery testing. This also presented an ideal opportunity to look at the capabilities of OEM in and around monitoring databases while something bad happens to them (ahead of my recovery tests) so I first needed to install the OEM agent and the relevant options (host, databases, listener, ASM). The OEM agent went in without too much fuss (just a few things needed editing regarding SUDO and the DBSNMP database user account) and OEM got down to its monitoring duties quickly. This was a little unexpected as yesterday I had an issue with OEM getting stuck in “status pending” for one of our database instances (in that case the EMCTL command CLEARSTATE fixed it) so I was half expecting to hit a problem today. Our OEM system is configured with groups like “Database Instance” set up to send monitoring emails to the DBA team so the addition of my test system into the appropriate groups did trigger email notifications as I brought the database up and down as part of my CommVault tests.
On the CommVault front, the agent went onto the server also with surprising ease once you know to how to properly configure access permissions for the agent on Linux systems (I’ve seen this trip up administrators in the past so I knew that there were additional steps to be performed on the host beyond just deploying the agent). Once the database had the correct media policy applied it was a simple matter to run a backup. The hard part about CommVault is remembering that, with regards to Oracle, it is essentially running an RMAN backup, so dealing with database issues like dropped objects and schemas (which is what I was testing) need to be dealt with via a Point in Time recovery in order to be successful. Also, whenever CommVault throws an error, it’s best to analyse the associated RMAN log for the cause as the CommVault GUI (version 9 anyway) doesn’t reveal too much in the line of detail.
Last Friday thankfully wasn’t squandered and my initial foray into deploying OEM and CommVault now opens the door to some further adventures; I am particularly interested in deploying both of those systems to SQL Server (especially OEM!).
// Data Guard // Database Incarnations // Standby Logs applying in Alert Log but not in V$ARCHIVED_LOG
Over the past week I’ve been getting a great introduction to the practical workings of Data Guard. In the past I’ve worked a lot with Disaster Recovery systems built using Oracle standard edition and therefore not licensed to use Data Guard and in those circumstances a poor man’s version of the system was put in place by using RSYNC to synchronise archive logs between a production database and a DR database. With the logs in place, a script run on a schedule would recover the data from the logs. It turns out that the concept of Data Guard is pretty similar in that it’s basically about getting archive logs to the right place and setting the destination database into a mode where it can read those logs and be ready for when disaster strikes.
Using Data Guard in a production setting makes for interesting times as you get into the hows and whys of setting up the Data Guard source and destinations and then on into considerations like physical vs. logical DR databases, as well as the debate about the correct terms to use (e.g. is the DR database an “active standby” or “active data guard” system?).
Over the past couple of days I’ve been testing a procedure for switching between the production system and the DR system. The environment is made up of two geographic locations (actually situated in two different cities, which is just like something from the official documentation). On both sites there are two RAC nodes, but only on the production site are both nodes up and running – the standby is open (in read only) as a single instance, or more accurately, one node of a two-node RAC.
The procedure for the switch over is straight forward enough but during testing I did encounter something unusual with one of the scripts used to ensure archive logs are being applied to the standby. Here’s the query that caused the problem:
SELECT SEQUENCE#, TO_CHAR(FIRST_TIME, 'DD-MON-YY HH24:MI:SS') FIRST_TIME, TO_CHAR(NEXT_TIME, 'DD-MON-YY HH24:MI:SS') NEXT_TIME, APPLIED FROM V$ARCHIVED_LOG WHERE THREAD#=1 ORDER BY SEQUENCE#;
This query returns details of log application coming from thread #1, i.e. the first node of the source RAC – changing the thread number to 2 gets details of the logs from the other node. The end of the output of that query is:
SEQUENCE# FIRST_TIME NEXT_TIME APPLIED ---------- ------------------ ------------------ --------- 48192 08-JUL-13 22:33:16 09-JUL-13 00:30:17 YES 48193 09-JUL-13 00:30:17 09-JUL-13 00:56:27 NO 48193 09-JUL-13 00:30:17 09-JUL-13 00:56:27 NO 48193 09-JUL-13 00:30:17 09-JUL-13 00:56:27 NO 48193 09-JUL-13 00:30:17 09-JUL-13 00:56:27 YES
Never mind the Yes’s and No’s for now and instead focus your attention on the NEXT_TIME date, the 9th July 2013…. Today is the 24th of July! Where are today’s logs? Strangely, the alert log for the standby database is showing that everything is OK and that recovery of logs is proceeding as expected. It turns out that the problem is down to the Incarnation of the database and a misplaced ORDER BY clause.
For more on database incarnations you really should check out the official documentation, particularly the neat diagram that really explains what’s happening when you issue a RESETLOGS command, you can find all that here: http://docs.oracle.com/cd/E11882_01/backup.112/e10642/rcmrvcon.htm#BRADV1171
In my case, the incarnation had moved on (as a result of something we were trying out last week) and so log sequence numbers are now being reused. The query above orders its results by the SEQUENCE# and so the highest sequence number displays last, but those high log sequence numbers were generated by a previous incarnation of the database and so are not current. A simple change to the query reveals the truth of the situation:
SELECT SEQUENCE#, TO_CHAR(FIRST_TIME, 'DD-MON-YY HH24:MI:SS') FIRST_TIME, TO_CHAR(NEXT_TIME, 'DD-MON-YY HH24:MI:SS') NEXT_TIME, APPLIED FROM V$ARCHIVED_LOG WHERE THREAD#=1;
With the ORDER BY removed, the results of the query look a lot more reassuring:
SEQUENCE# FIRST_TIME NEXT_TIME APPLIED ---------- ------------------ ------------------ --------- 3208 24-JUL-13 14:11:52 24-JUL-13 14:11:54 YES 3209 24-JUL-13 14:11:54 24-JUL-13 14:11:59 YES 3210 24-JUL-13 14:11:59 24-JUL-13 14:12:02 YES 3211 24-JUL-13 14:12:02 24-JUL-13 14:13:57 YES 3212 24-JUL-13 14:13:57 24-JUL-13 14:14:50 YES
Note the correct date and the low, low log sequence numbers!
Keeping track of the time, so to speak, when the database incarnation moves on is something that can catch any DBA but is one of those fun little advanced topics, along with Data Guard, that adds a little spice to the day.