Wednesday, August 22, 2007

static code analysis to find cursor leak

One problem I've been fighting for years is ora-01000, maximum cursors exceeded. This is always caused by developer forgetting to close cursor. Recently I found this static java code analysis tool called FindBugs, I think I can solve the problem once for all. FindBugs scans your java binary code and detect potential problem. It has build-in detector for all sorts of problems such as infinite loop, guaranteed null pointer, etc. The nice thing about it is that it is easy to add your own detector. I was excited about this tool first time I saw it, and the things came to mind immediately to make detector for bind variable and detector for not closing cursor. The latter is relatively easier, below is sample code I wrote for it, assuming the cursor always open through openCursor method call and closed with closeCursor method. Be noted there are 2 problems with the code, one is if the code close cursor in another method, the detector gives a false alarm and the other is the close should be in finally block, the below detector doesn't enforce that.

import edu.umd.cs.findbugs.*;
import java.util.*;

import org.apache.bcel.classfile.Code;


public class CursorDetector extends BytecodeScanningDetector {
private BugReporter bugReporter;

int registerOnStack;
boolean isCursorReferenced;
boolean openCursor;
Hashtable hs;


public CursorDetector(BugReporter bugReporter) {
this.bugReporter = bugReporter;
hs = new Hashtable();

}

public void visit(Code code) {
registerOnStack = -1;
isCursorReferenced = false;
openCursor = false;
hs.clear();
super.visit(code);
if (!hs.isEmpty()) {
Enumeration e = hs.keys();
Object obj;
while (e.hasMoreElements()) {
obj = e.nextElement();
bugReporter.reportBug(
new BugInstance("CURSOR_NOT_CLOSED", HIGH_PRIORITY)
.addClassAndMethod(this).addSourceLine(this,((Integer)hs.get(obj)).intValue()));
}
hs.clear();
}
}

public void sawOpcode(int seen) {
if (isCursorReferenced) {
checkIfCloseCursorIsCalled(seen);
isCursorReferenced = false;
return;
}
else
if (seen == INVOKEVIRTUAL &&
"MyClass".equals(getClassConstantOperand())
&& "openCursor".equals(getNameConstantOperand()) ) {
openCursor = true;
return;
}
else
if (openCursor ) {
saveToLocalVariable(seen);
openCursor = false;
return;
}
else
if (!hs.isEmpty() && seen >= ALOAD && seen <= ALOAD_3 ) {
registerOnStack = (seen == ALOAD)? getRegisterOperand(): seen - ALOAD_0;
if (hs.containsKey(new Integer(registerOnStack)))
isCursorReferenced = true;
return;
}

else {
return;
}
}


void checkIfCloseCursorIsCalled(int seen) {
if (seen == INVOKEVIRTUAL &&
"MyClass".equals(getClassConstantOperand()) &&
"closeCursor".equals(getNameConstantOperand())
)
hs.remove(new Integer(registerOnStack)); //close cursoe

}

void saveToLocalVariable(int seen) {
switch (seen) {
case ASTORE_0:
hs.put(new Integer(0), new Integer(getPC()));
break;
case ASTORE_1:
hs.put(new Integer(1), new Integer(getPC()));
break;
case ASTORE_2:
hs.put(new Integer(2), new Integer(getPC()));
break;
case ASTORE_3:
hs.put(new Integer(3), new Integer(getPC()));
break;
case ASTORE:
hs.put(new Integer(getRegisterOperand()), new Integer(getPC()));
break;
}
// doesn't process if it is global variable.
}

}

ITL lock analysis

This is something I wrote years back. It was wrote for oracle8i. But the principle should still apply to 10g.

Introduction
To take advantage of the processing power of multiple CPUs in SMP machine, today's applications should parallelize application, especially for batch job. The batch application needs to spawn multiple threads/processes, each process/thread use same logic and code to deals with different subset of work set. This architecture fully utilizes the cpus in the hardware box. In the meantime, it introduces various contentions, which should be considered and monitored during design phase and implementation phase. Among them, how to avoid lock is one of top issues designer/architect should think about. Oracle is known for its superb row level transaction lock mechanism, which minimizes the locking scope and makes developer's life much easier. However, given RDBMS is such a complicated software, it is understandable that it has lots of locking mechanism other than row level transaction lock. Unfortunately, some of the locks are in block level rather than row level, which means they lock more rows, hence application gets more chances to be blocked and deadlock is more likely to occur. ITL lock is such a block level lock. It is common in multi-threaded application, though we rarely see it in other applications. This article will discuss the cause of ITL lock, how to identify it and how to avoid it.

1. What is ITL?
In Oracle's data block, there is a data structure called transaction header. The transaction header consists of 24 bytes of control information followed by an array of one or more transaction slots.
This array is called the interested transaction list (ITL), and each slot may contain an interested transaction entry (ITE). The initial number of ITL slots is set by the INITRANS parameter for the segment
According to Steve Adams' excellent website http://www.ixora.com.au, this is what the ITL looks like in a block dump.


Itl Xid Uba Flag Lck Scn/Fsc
0x01 xid: 0x0000.00f.00000005 uba: 0x0040000b.0000.12 C--- 0 scn 0x0000.000000ac
0x02 xid: 0x0002.003.00000027 uba: 0x00c00004.0019.01 --U- 1 fsc 0x0000.00021b7c
Each ITL entry/slot contains the following fields.

transaction id 8 bytes This is comprised of the rollback segment number,
the slot number in the transaction table of that rollback segment,
and the number of times use of that transaction table has wrapped .

undo block address 8 bytes The location of the undo for the most recent change to this block by this transaction.
This is comprised of the DBA of the rollback segment block ,
the sequence number, and the record number for the change in that undo block


flags 1 nibble ---- = transaction is active, or committed pending cleanout
C--- = transaction has been committed and locks cleaned out
-B-- = this undo record contains the undo for this ITL entry
--U- = transaction committed (maybe long ago); SCN is an upper bound
both fast commit and delayed cleanout can set this flag
---T = transaction was still active at block cleanout SCN


locks 3 nibbles The number of row-level locks held in the block by this transaction.

SCN or free
space credit 6 bytes if the transaction has been cleaned out, this is the commit SCN or an upper bound thereof.
Otherwise the leading two bytes contain the free space credit for the transaction -
that is, the number of bytes freed in the block by the transaction.


Every transaction that doing INSERT/DELETE/UPDATE rows on this block needs and only needs one ITL slot in this block. When the transaction is committed, the ITL slot is released and is able to serve other transactions. When the transaction is rolled back, if the ITL slot was newly allocated for this transaction, the space (24 bytes) it takes will be returned as free space, otherwise, just as committed transaction, this ITL slot will be able to serve other transaction. If all existing ITL slots are being used and there is a transaction which hasn't have a corresponding ITL slot in this block wants to INSERT/DELETE/UPDATE the rows in this block, Oracle will try to find free space to use as this transaction's ITL. If it fails to find 24 bytes free space, lock occurs. The session that owns the transaction will wait for a one of the transactions that are holding ITL slots in this block. To make things worse, even when some of the transactions holding ITL slots commit/rollback since, the blocked session has to wait as long as the transaction he is waiting for hasn't committed/rollback. Obviously, this lock can lead to deadlock. And some of the facts are:
a) Index won't cause ITL shortage. Due to the structure of index, when there is no space, the block will get split. Interestingly, the default initrans for index is 2, don't be fooled to think it indicates index is easier to cause ITL contention. It is because it has to keep an extra slot for recursive transaction to split the block
b) Insert is unlikely to cause contention. If the block doesn't have free space, it will be moved out of the freelist. Logically, Oracle can insert a record into any block, whilst Oracle can't choose block when the dml is delete or update no matter if there isn't enough free space for a new ITL slot.
c) Delete and update can cause ITL lock. Among them, update is by far the most likely culprit. Frequently update can eat up the free space (which is 10% of block by default), and eventually cause Oracle can't find place for ITL.

This is a simple example of how to produce an ITL lock/deadlock.
a) create table test_itl(a number) pctfree 0;
b) insert into test_itl select rownum from all_objects where rownum<2000;
commit;
c) select min(a) from test_itl where dbms_rowid.rowid_block_number(rowid)<>
(select dbms_rowid.rowid_block_number(rowid) from test_itl where a=1);
d) open a session session1 in sqlplus;
update test_itl
set a=1
where a=1;
e) open another session session2 in sqlplus:
update test_itl
set a=
where a=;
f) go back to session 1,
update test_itl
set a=+1
where a=+1;
You'll see the session hang
g) go back to session 2,
update test_itl
set a=2
where a=2;
You'll see the session hang.
Then after a while(3 seconds), you'll find a new file is generated under $ORACLE_UDUMP, indicating a deadlock has occured.



2. How to identify ITL lock problem:
ITL lock is an enqueue wait. Every time it occurs, it will increment the enqueue statistics in Oracle's v$ views.
However, there are lots of reasons can cause enqueue wait, for instance, row level lock. We don't know if the enqueues we are seeing in v$ view are really caused by ITL lock. Fortunately, deadlock, which is worst case of lock, can help us identify the problem. It makes sense that if you do have severe ITL contention, you'll see some deadlock. Deadlock ORA-00060 will trigger Oracle to dump a trace file under $ORACLE_UDUMP, at the beginning of the trace file, you can see the offending sql statement. like:
*** SESSION ID:(10.20) 2004-05-21 10:37:57.716
DEADLOCK DETECTED
Current SQL statement for this session:
UPDATE TEST_ITL SET A=:b1 WHERE A = :b1

As we discussed earlier, ITL lock tends to be caused by update statement

Then below offending sql statement, there is deadlock graph, like:

Deadlock graph:
---------Blocker(s)-------- ---------Waiter(s)---------
Resource Name process session holds waits process session holds waits
TX-0010001c-000067a1 33 129 X 153 212 S
TX-003a0048-0000616e 153 212 X 94 170 S
TX-003c0038-00007f15 94 170 X 33 129 S

Now here, you should pay attention to couple of things:
a) Resource Name: it usually is either started with TX or TM. Normally, the deadlock caused by no index on foreign key waits for TM lock. So here we see it is TX, we can rule out foreign key problem.
b) Blocker(s) holds and Waiter(s) waits. We know for deadlock caused by row level lock, both holds and waits are exclusive mode, which is 'X'. Here holds mode is exclusive ('X') and waits mode is share ('S'). There for, row level lock caused deadlock is ruled out.

According Metalink Note:62354.1, there are 3 scenarios can cause this TX share mode waits on TX exclusive mode.
a) ITL shortage,
b) Waits due to Unique or Primary Key Constraint enforcement
For example, if a session deleted a parent record but hasn't commit, then the other session trying to insert a new record in child table referencing to the deleted row by the first session in parent table has to wait for first session to commit or rollback.
c) Waits due to rows being covered by the same BITMAP index fragment.

Actually, I believe there could be other reasons cause this lock scenario. For example if more than one sessions are inserting into a table with a primary key on a sequence generated id, when a session is in the process to split the index branch lock (which should be a right most branch block), other sessions want to write that index branch block have to wait in TX shared mode. But obviously this situation won't lead to deadlock. To ensure this is an ITL shortage caused deadlock, we have to look further into the trace file.

If you search "type: 17" in the trace file, sometimes the search fails, but sometimes you'll find the block dump on the block the lock occurred, like:
SO: 63d9cb508, type: 17, owner: 640f6f0d0, pt: 0, flag: INIT/-/-/0x00
(buffer) (CR) PR: 0x62bd55878 FLG: 0x100400
lock rls: 0x0, class bit: 0x0
kcbbfbp: [BH: 0x63bdc7210, LINK: 0x63d9cb540]
where: kdswh05: kdsgrp, why: 0
buffer tsn: 18 rdba: 0x044074bc (17/29884)
scn: 0x0000.119d2bc2 seq: 0x00 flg: 0x00 tail: 0x2bc20600
frmt: 0x02 chkval: 0x0000 type: 0x06=trans data

Block header dump: 0x044074bc
Object id on Block? Y
seg/obj: 0xcb61 csc: 0x00.119d2bc2 itc: 11 flg: - typ: 1 - DATA
fsl: 0 fnx: 0x0 ver: 0x01

Itl Xid Uba Flag Lck Scn/Fsc
0x01 xid: 0x003c.045.00007dc8 uba: 0x0a888023.00e9.17 C--- 0 scn 0x0000.11920ffc
0x02 xid: 0x000f.04b.000067aa uba: 0x0104b523.0078.08 C--- 0 scn 0x0000.11920f9a
.....
0x0a xid: 0x000f.000.000067ed uba: 0x0104b6f9.0078.12 C--- 0 scn 0x0000.11921023
0x0b xid: 0x003c.054.00007dac uba: 0x0a8882ab.00e9.22 C--- 0 scn 0x0000.11921084

data_block_dump
===============
tsiz: 0x1ec8
hsiz: 0x64
pbl: 0x5e72a0134
bdba: 0x044074bc
flag=-----------
ntab=1
nrow=41
frre=-1
fsbo=0x64
fseo=0x6b
avsp=0x7
tosp=0x7
0xe:pti[0] nrow=41 offs=0
0x12:pri[0] offs=0x1e2f
0x14:pri[1] offs=0x1d89
...
0x5e:pri[38] offs=0x249
0x60:pri[39] offs=0x142
0x62:pri[40] offs=0x6b
block_row_dump:
tab 0, row 0, @0x1e2f
tl: 153 fb: --H-FL-- lb: 0x0 cc: 12
col 0: [ 5] c4 08 25 0d 3c
col 1: [ 4] 49 4e 53 54
col 2: [ 5] c4 06 32 5c 55
col 3: [ 1] 45
col 4: [27]
58 45 5f 49 4e 53 54 52 55 4d 45 4e 54 2e 4c 4f 4e 47 5f 44 45 53 43 5f 54
58 54
col 5: [23]
4d 41 4e 45 4c 45 20 42 41 59 20 56 45 4e 54 55 52 45 53 20 49 4e 43
col 6: [23]
4d 41 4e 45 4c 45 20 42 41 59 20 56 45 4e 54 55 52 45 53 20 49 4e 43
col 7: [ 7] 78 66 0c 0e 0c 21 27
col 8: [13] 42 45 20 49 4e 53 54 52 55 4d 45 4e 54
col 9: [10] 32 30 33 30 33 33 35 38 35 35
col 10: [13] 42 45 20 49 4e 53 54 52 55 4d 45 4e 54
col 11: [ 7] 78 67 06 0c 18 33 04
tab 0, row 1, @0x1d89
tl: 166 fb: --H-FL-- lb: 0x0 cc: 12
col 0: [ 5] c4 08 25 0d 45
col 1: [ 4] 49 4e 53 54
col 2: [ 5] c4 06 32 5c 32
col 3: [ 1] 45
col 4: [27]
58 45 5f 49 4e 53 54 52 55 4d 45 4e 54 2e 53 48 52 54 5f 44 45 53 43 5f 54
58 54
col 5: [30]
43 41 4c 4c 20 4e 41 53 44 41 51 20 31 30 30 20 20 20 4a 41 4e 20 30 32 37
20 2a 2a 2a 2a
col 6: [30]
43 41 4c 4c 20 4e 41 53 44 41 51 20 31 30 30 20 20 20 4a 41 4e 20 30 32 37
20 2a 2a 2a 2a
col 7: [ 7] 78 66 0c 0e 0c 21 27
col 8: [13] 42 45 20 49 4e 53 54 52 55 4d 45 4e 54
col 9: [ 9] 33 36 32 31 33 36 30 38 37
col 10: [13] 42 45 20 49 4e 53 54 52 55 4d 45 4e 54
col 11: [ 7] 78 67 03 04 09 2a 2e
tab 0, row 2, @0x1cfa
...
end_of_block_dump


Here it tells us there are 11 ITL slots (itc:11) in this block and free space is 7 (tosp=0x7). Now if there is 12th transaction try to insert/update/delete in this block, there is no space to allocate a new ITL slot. Therefore, it is very likely an ITL shortage caused deadlock. On the other hand, because this is a CR read block image from the session that detect the deadlock, means the changes on the block made by other transactions yet to commit when the dump occurred aren't reflected in this block dump. It is possible that you can see lots of free space in block dump, but actually there is no available space in current mode block, mainly caused by update on rows increase the row size greatly. That is why the mode of each ITL slot is 'C', means they are committed and reusable. As we mentioned earlier, sometimes Oracle doesn't dump the offending block at all. If that is the case, we can dump the block by our own. If the offending sql is an insert statement, we can almost rule out ITL shortage. If it is update or delete, we are able to know which row it intended to change even if the sql is using bind variable, because you can find the value of bind variable in cursor section (search for Cursor Dump)of the dump file. By knowing the record, we can know the file number block number by select dbms_rowid.rowid_block_number(rowid) and dbms_rowid.rowid_relative_fno(rowid) on that row. Then we dump the block to see if there is plenty of free space. Like: alter system dump datafile block min block max .

Another way to identify ITL lock/deadlock is to look at the wait event history in the dump file which is generated by deadlock, if it previously has waited for ITL enqueue, then likely the deadlock is still on ITL.

The above talked about how to identify if the deadlock is caused by ITL shortage. A DBA should be proactive to get the information how likely the ITL problem could occur. If something is near broken, then DBA should take action immediately rather than waiting for lock/deadlock to occur. The idea is to get the average number ITL slots in each block and average free space, compare them with the initrans and initial free space (which is *pctfree):

select (-66-(2+avg_row_len)*(num_rows/blocks)-avg_space)/24 avg_itl_slots,
avg_space,
ini_trans,
*pct_free/100 ini_free_space
from user_tables order by avg_space

avg_itl_slots is actually the concurrent transactions high watermark on the block. If you find a table with little average space and avg_itl_slots is much bigger than ini_trans, the table tends to cause or already caused ITL contention. If you don't set freelists correctly and you populate a table in parallel, you'd see a very high avg_itl_slots, it is fairly safe and you shouldn't worry about it. Also be noted that the formula to calculate avg_itl_slots only can be used when the table's statistics was generated by analyze utility. If the statistics was generated by dbms_stats package, the formula should be modified to (-66-(2+3+avg_row_len)*(num_rows/blocks)-avg_space)/24, since dbms_stats doesn't include 3-byte row header when calculate average row length.



3. How to prevent ITL lock?

In previous section, we discussed for a DBA how to identify ITL lock. Here we'll talk about how to fix the problem and how to prevent the problem in first place, i.e. in design and implementation phase.

a) To be perfectly safe, you can set initrans to be the small one of the number of possible concurrent processes accessing each block and the number of rows each block could possibly hold. For example, you have a pretty fat table with 400 bytes for each row (you can get it by querying select avg_row_len from user_tables), and you block size is 8192, you don't need set initrans to be anything more than 20. Since 8i, you can change initrans on the fly. However, the setting will only be effective for the new allocated blocks, i.e., your existing blocks of the table will continue to suffer ITL shortage. So, you should:
alter table initrans
alter table move
rebuild all the indices of the table
analyze table/index appropriately.

The problem of this solution is you may waste some space and if clients decides to increase application's parallel degree, they have to remember alter initrans and rebuild table/indices.

b) Increase pctfree. Oracle would have more chances to succeed in getting free space in block when is asked for new ITL by transaction. Same as bouncing initrans, you need to rebuild table and indices after increasing pctfree. This solution has the same problem as increasing initrans.

c) Commit more frequently. Let's say your application spawn 20 threads/processes to do some work. The table has 100K blocks, each block hold average 100 rows. If the each session commit after updating 100 rows and The chance of 20 sessions updating a block at the same time will be way lower than each session commit every 1000 rows. There has already been a lot of discussion of how often the commit should be, in perspective of performance and data integrity. To me, I think it is up to your application. You should benchmark your application. If your benchmark shows 1000 rows/commit doesn't buy you anything than 100 rows/commit (which I think would be the case in most applications), barring the test result is affected by ITL lock, why would you insist on 1000 rows/commit? The drawback of this solution is there is no guarantee there would be no ITL shortage.

d) Reduce the parallel degree of application. It is obvious that 30 concurrent sessions are less likely to mess up with each other in on block than 60 concurrent sessions. Again same as in the discussion of commit frequency, you should do test in implementation phase. If the result turns out to be using 30 concurrent sessions isn't slower than 60 concurrent sessions, and you are sure ITL lock is a non-factor in the test, why would you insist on using 60 threads/processes?

e) Make each thread/process deal with different blocks. Normally, when assign work set, we only need to be sure it won't cause row level lock. Ideally, Application should try to make each thread/process update on different block. Thus, ITL lock will be eliminated, and the application would be more OPS/RAC ready. Of course, an application based on Oracle's proprietary block structure(rowid) is too extreme from software design perspective. On the other hand, depends on application, sometimes it is just impossible. For instance, it is apparent that you can't do it if your application needs to update a parent table and its child table as well. Since if you somehow manage to partition your application to make sure each parent table block is only updated by one thread, how can you make sure child rows of this parent rows in this block, which surely are scattered in lots of different blocks, are updated by one thread in the meantime? Another idea is to try to partition the related the tables. It is too complex and more importantly, just the same reason as threading by block, it isn't feasible in many occasions.

4. Miscellaneous.
Does Index Organized Table have ITL shortage problem? The answer is yes. You might be confused because we just claimed that index doesn't have this problem due to block split. IOT is a table organized like an index, it will do block split when there is no free space, how can it have ITL problem? Well, if the DML on IOT change the primary key, it'll incur block split should there is no free space. But considering the situation that you have a IOT which has five columns, the first four of them are primary key, if you only update the fifth column, it is just the do a normal update on a normal table, thus could cause wait on ITL, hence could could deadlock due to ITL shortage.

5. Conclusion.
The article shows how ITL shortage can cause lock, even deadlock, and how to eliminate it in design and implemention and how to identify it in production.

Tuesday, August 21, 2007

closeWithKey in Oracle 10g

closeWithKey is Oracle's own method call for explicit caching. I think explicit cache itself is an over-engineering,why do we need explicit caching when there is already implicit caching? Anyways, the company I work with used it in legacy code. It works fine, well, sort of. There are some bugs in application side mainly mainly associated with not using bind variable. Both implicit caching and explicit caching means you mark the statement as eligible to be close, i.e, if there are more sqls(primarily because not using bind variable) than the cache can hold, then some statements will have to aged out based on LRU algorithm. Unfortunately, our app code uses it in a way like:
p = prepareStatement
rs=executeQuery();
p.closeWithKey();

while(rs.next())
...

This itself is flawed because during the time of fetching, i.e rs.next()), if there are more sqls than cache size need to be cached, then p will be aged out, rs.next() will get a "closed statement" sql exception. If the fetch phase is short such as one row in the resultset and the while loop doesn't call sql , then it is unlikely the "closed statement" would occur.

But all this has changed in 10g JDBC. in 10g, if you call closeWithKey, not sure about the plain close() call for implicit caching but should be same, the statement is marked close immediately, in order to open it you have to call prepareStatementWithKey, though the second prepare doesn't incur a (soft) parse. Therefore, the previous code will fail immediately with "closed statement" exception.

Wednesday, August 8, 2007

Horizontal to vertical query trick used in application

I've known this trick since I started to use relational database. The trick is about how to convert a row of record, such as column1 column2 column3 ... columnN, to
column1
column2
column3
...
columnN

Until recently I found the perfect place where I can use it in application.

In the application, we store certain account types as a parameter stored in parameter table. Such as : select value from parameter_table where parameter_nm='ACCT_PLAN_TYPE', you'll get RRSP,401K,RESP,LIF. The commonly used logic for this information is these account types will comprise of an in-list as a condition of the query. Imagine how developer will use this information. They will write query to query the parameter table, then parse the string, eliminate the "," and space, store the values into an array, or sort of collection. Then test the query result against the collection. Or they can use dynamic sql, just append the list as an in-list to the query. Neither way seems nice. It is my belief that database should not only be the place to store data, it should also be a development platform, i.e, it should facilitate programming, i.e, it should make programming easier. The idea is this type information should be stored in a separate table, then all the logic can be represented in standard join or in subquery, such as:
acct_type in (select acct_type from acct_plan_type)
Instead of arguing with other parties to move the information into separate table, I simply created a view using the previously mentioned horizontal to vertical trick:

CREATE OR REPLACE VIEW acct_plan_type_vw (
acct_plan_type )
AS
SELECT ltrim(rtrim(acct_plan_type)) acct_plan_type
FROM (SELECT (CASE
WHEN rn = 1 THEN
SUBSTR (acct_plan_type,1,
case
when INSTR (acct_plan_type, ',', 1, 1) >0 then
INSTR (acct_plan_type, ',', 1, 1)-1
else
length(acct_plan_type)
end
)
WHEN INSTR (acct_plan_type, ',', 1, rn - 1) > 0 THEN
SUBSTR (acct_plan_type, INSTR (acct_plan_type, ',', 1, rn - 1) + 1,
CASE
WHEN INSTR (acct_plan_type, ',', 1, rn) > 0 THEN
INSTR (acct_plan_type, ',', 1, rn) - INSTR (acct_plan_type,',',1,rn - 1)- 1
ELSE
LENGTH (acct_plan_type) - INSTR (acct_plan_type, ',', 1, rn - 1)
END
)
ELSE
NULL
END
) acct_plan_type
FROM (
SELECT parameter_val acct_plan_type FROM parameter_table
WHERE parameter_cd = 'ACCT_PLAN_TYPES'
) acct_plan_types,
(SELECT ROWNUM rn FROM parameter_table --only because parameter_table has more than 100 rows
WHERE ROWNUM < 100) aux)
WHERE acct_plan_type IS NOT NULL
/

a bug in Oracle instant client 10.1.0.2

If you are familiar with Oracle's internal representation for number, you'd know the last byte can't be 101(0x65), since it is a positive number, the (last_byte -1) represents the last 2 digits of the number. 101-1=100, which isn't a 2 digits number. If it is a negative number, the last byte is 102. Through a bug in JDBC I recently encountered, it seems Oracle accepts 101 as the last byte, it treats it as 01, maybe it is just a protection against data corruption, or because it knows it has some buggy driver that could insert 101 when it should be 01. To test it, you can either hack the block directly and see if Oracle recognize it. I think sqlplus is OK with it, i.e, treat is as 01 so the last 2 digits are 00 which may look weird because instead of say "6.51", it'll show 6.5100. if you use other tools like sql*nav, you might get some funny character. Or if you don't want hack data block, you can use the buggy Oracle instant client 10.1.0.2, that is where I found the bug, last time I checked it wasn't in metalink. In java program, use preparedStatement to insert a value in ((0.0001, 0.0002, 0.0003,…0.0009, -0.0001, -0.0002,…,-0.0009) into a table. Then select dump() from table, you'll see "101" as the last byte. And if you user query the column and use getDouble() to get the value from resultset, it'll give you the right value. But, here is how I stumbled the bug, our application for some reason, always use getString to get value and I got an arrayoutofbound exception. So it looks there are 2 bugs here in JDBC, one is it inserts a malformed, or non-standard, number into the database. The other is getString function doesn't handle this non-standard format of number. To bypass the problem, either change getString to getDouble, or upgrade to 10.1.0.5 or 10.2.0.3 where the insert is correct in the first place.

Tuesday, August 7, 2007

ora-07445 and its dump

Recently, there was a core dump on ora 7445. The error was intermittent, it didn’t seem that I can reproduce it easily. It was urgent to get a workaround so that the system could continue to function. Searched on meatlink ( not a typo, this is what I call it), didn’t relevant info. As usual, Oracle support wasn’t helpful at all. To my surprise, the guy even disputed if it was a genuine bug after he saw the 7445. As usual, he asked for test case. If I can reproduce it in test case, then I probably know in what circumstance the bug will likely strike, then I will have a workaround already. I don’t have test case, but still there is dump trace and Oracle error stack, and they have source code, and yet they have nothing to tell and they don’t even admit it is a bug? What a joke.

But I did need to find a workaround for the problem. So I set out to study the dump trace to see what I can find. Here is a excerpt of dump trace:
ORA-07445: exception encountered: core dump [lxhasc()+20] [SIGSEGV] [Address not mapped to object] [0x000000060] [] []
Current SQL statement for this session:
SELECT T.LOGIN_TOKEN_ID, T.TOKEN_EFFECTIVE_DT, S.SUSPENDED_FL FROM XE_LOGIN_TOKEN T, XE_LOGIN S WHERE (1=1) AND S.SESSION_START_DT = TRUNC(SYSDATE) AND S.IP_NM = :B2 AND S.LEGAL_ENTITY_ID = ( SELECT LEGAL_ENTITY_ID FROM XE_LEGAL_ENTITY WHERE USER_NM = :B3 AND PRIVILEGE_CD <> 16384 AND LENTITY_STATUS_CD <> 'IN' ) AND T.LOGIN_ID = S.LOGIN_ID AND T.TOKEN_TXT = :B1
----- PL/SQL Call Stack -----
object line object
handle number name
47b652a18 1093 package body abc.login_pg
47b65be78 1 anonymous block
----- Call Stack Trace -----
calling call entry argument values in hex
location type point (? means dubious value)
-------------------- -------- -------------------- ----------------------------
ksedmp()+744 CALL ksedst() 000000780 ? 10608E2AC ?
000000000 ? 10608ADA0 ?
106089B08 ? 10608A508 ?
ssexhd()+1240 CALL ksedmp() 000105C00 ? 105F0CD24 ?
105F0C000 ? 000105F0C ?
000105C00 ? 105F0CD24 ?
sigacthandler()+44 PTR_CALL 0000000000000000 105F0A000 ? 106091EF0 ?
105F06CEC ? 000105F0A ?
00000000B ? 000000067 ?
lxhasc()+20 PTR_CALL 0000000000000000 00000000B ? 106091EF0 ?
106091C10 ? 000000014 ?
FFFFFFFF7AA2BE58 ?
105F1C618 ?
ldxsnf()+76 FRM_LESS lxhasc() FFFFFFFF7ADABA70 ?
105F1C598 ? 000000007 ?
000000014 ?
FFFFFFFF7AA2BE58 ?
105F1C618 ?
ldxgpc()+56 CALL ldxsnf() 105F16F28 ?
FFFFFFFF7FFF4FD8 ?
FFFFFFFF7FFF4FD0 ?
FFFFFFFF7FFF4FC8 ?
FFFFFFFF7ADABA70 ?
FFFFFFFF7FFF4DA0 ?
evadrn()+320 CALL ldxgpc() 105F16F28 ? 105A3F5A0 ?
000000002 ?
FFFFFFFF7AA292E0 ?
FFFFFFFF7AA2A310 ?
000000248 ?
evaopn2()+360 PTR_CALL 0000000000000000 4703D80B8 ? 1039271C0 ?
FFFFFFFF7AA2A068 ?
FFFFFFFF7AA2A078 ?
105F16F28 ? 000000D98 ?



I searched the method names appeared on error stack in meatlink, found out that the functions starts with ldx is likely related to NLS. In the problematic sql, the only thing seems to be remotely related to NLS is the trunk(sysdate). Evadrn called ldxgpc and the error stack shows the argument. The first one, 105F16F28 , is pointer, in that location, it is 0x100000000000008B. The second one ,105A3F5A0, is also a pointer, and it turns out to be “DD…” in that memory location. The third one is apparently a number, 2. Could it be the length of the second parameter? The trunc function can take 2 arguments where the the second argument is the format. And trunc() is the same as trunc(, ‘DD’). It is possible that Oracle convert every trunc() to trunc(, ‘DD’) internally, then ldxgpc should be the one that perform the TRUNC. If that is the case, where is the input date? If it is supposed be the first argument, then the date is somehow corrupted, since 0x100000000000008B isn’t a valid date. According to the time of dump shown on the top of dump trace, it was 2007-07-31 10:48:17. In Oracle’s date format, it is 07D7071F0A301100, there are indeed several occurrence of this string in the dump, but has nothing to do with 0x100000000000008B. So, in summary, Oracle seems to lose track of the date when he perform the TRUNC. Therefore, I tried to replace TRUNC(sysdate) with to_date(to_char(sysdate,’yyyymmdd’),’yyyymmdd’). And there haven’t been crash since. Of course, it is still a scary situation since I don’t know what the root cause of the problem and if it will strike somewhere else.

Monday, August 6, 2007

thin jdbc driver vs. OCI jdbc driver

When possible, one should choose thin driver over OCI. The benefits are:
1) Using thin client, you'll never have Dr. Watson: access vioation error(c0000005). One of the reasons that java is better than C is you don't need to worry about the annoying malloc and dealloc anymore. In other words, in java app, you won't see the crashes such as Dr. Watson: access violation or coredump: signal 11 segmentation violation. I have a client who was using Oracle 9.2.0.5 and they were experiencing intermittent tomcat mid-tier crashes. The Dr. Watson log clearly pointed the access violation that led to the crash happen to a OCI call, the Oracle function name is something like "kpuhhalo", seems like the function to allocate memory. Oracle reviewed it and didn't find problem in OCI code and said you'd better upgrade to 9.2.0.8 because that is the supported version. So they went to 9208, still got crash, the new offending method was "kpuhhfre", seems the function to free memory. After spending one month to convincing Oracle support this is a Oracle bug and got developer to review the problem, the answer was something like "I can investigate the problem for you but don't count on me, it'll take long time to diagnose." In the meantime, I ran into a similar crash in a test environment. Because it was test environment, I had more control of it, I changed code to use thin JDBC, the job didn't succeed, but there was no crash and it did give me a error message "ora-01000:maximum cursor exceeded". So what happened previously must be OCI driver experiencing some anomaly because too many cursors were opened, but before it reached max open cursor when Oracle would report a clear error message, OCI died in memory allocation/dealloc routine because of C code is fragile, especially when boundary condition, like in our case, approaching maximum cursor.
2) The configuration and deployment is much easier, just one jar file, otherwise you'll have to install either whole Oracle client or oracle instant client.

One major problem of using thin jdbc was it doesn't support standard stream jdbc interface for LOB, you'd have to use lob locator. In 10g jdbc, the problem is solved by the
introduction of SetBigStringTryClob property.

Since I haven't seen formal report claiming OCI is faster than thin, I guess even if OCI does have a performance superiority, it is marginal. I'd trade marginal performance gain for less mysterious crashes and easier config/deployment on any given day.