Tuesday, August 7, 2007

ora-07445 and its dump

Recently, there was a core dump on ora 7445. The error was intermittent, it didn’t seem that I can reproduce it easily. It was urgent to get a workaround so that the system could continue to function. Searched on meatlink ( not a typo, this is what I call it), didn’t relevant info. As usual, Oracle support wasn’t helpful at all. To my surprise, the guy even disputed if it was a genuine bug after he saw the 7445. As usual, he asked for test case. If I can reproduce it in test case, then I probably know in what circumstance the bug will likely strike, then I will have a workaround already. I don’t have test case, but still there is dump trace and Oracle error stack, and they have source code, and yet they have nothing to tell and they don’t even admit it is a bug? What a joke.

But I did need to find a workaround for the problem. So I set out to study the dump trace to see what I can find. Here is a excerpt of dump trace:
ORA-07445: exception encountered: core dump [lxhasc()+20] [SIGSEGV] [Address not mapped to object] [0x000000060] [] []
Current SQL statement for this session:
SELECT T.LOGIN_TOKEN_ID, T.TOKEN_EFFECTIVE_DT, S.SUSPENDED_FL FROM XE_LOGIN_TOKEN T, XE_LOGIN S WHERE (1=1) AND S.SESSION_START_DT = TRUNC(SYSDATE) AND S.IP_NM = :B2 AND S.LEGAL_ENTITY_ID = ( SELECT LEGAL_ENTITY_ID FROM XE_LEGAL_ENTITY WHERE USER_NM = :B3 AND PRIVILEGE_CD <> 16384 AND LENTITY_STATUS_CD <> 'IN' ) AND T.LOGIN_ID = S.LOGIN_ID AND T.TOKEN_TXT = :B1
----- PL/SQL Call Stack -----
object line object
handle number name
47b652a18 1093 package body abc.login_pg
47b65be78 1 anonymous block
----- Call Stack Trace -----
calling call entry argument values in hex
location type point (? means dubious value)
-------------------- -------- -------------------- ----------------------------
ksedmp()+744 CALL ksedst() 000000780 ? 10608E2AC ?
000000000 ? 10608ADA0 ?
106089B08 ? 10608A508 ?
ssexhd()+1240 CALL ksedmp() 000105C00 ? 105F0CD24 ?
105F0C000 ? 000105F0C ?
000105C00 ? 105F0CD24 ?
sigacthandler()+44 PTR_CALL 0000000000000000 105F0A000 ? 106091EF0 ?
105F06CEC ? 000105F0A ?
00000000B ? 000000067 ?
lxhasc()+20 PTR_CALL 0000000000000000 00000000B ? 106091EF0 ?
106091C10 ? 000000014 ?
FFFFFFFF7AA2BE58 ?
105F1C618 ?
ldxsnf()+76 FRM_LESS lxhasc() FFFFFFFF7ADABA70 ?
105F1C598 ? 000000007 ?
000000014 ?
FFFFFFFF7AA2BE58 ?
105F1C618 ?
ldxgpc()+56 CALL ldxsnf() 105F16F28 ?
FFFFFFFF7FFF4FD8 ?
FFFFFFFF7FFF4FD0 ?
FFFFFFFF7FFF4FC8 ?
FFFFFFFF7ADABA70 ?
FFFFFFFF7FFF4DA0 ?
evadrn()+320 CALL ldxgpc() 105F16F28 ? 105A3F5A0 ?
000000002 ?
FFFFFFFF7AA292E0 ?
FFFFFFFF7AA2A310 ?
000000248 ?
evaopn2()+360 PTR_CALL 0000000000000000 4703D80B8 ? 1039271C0 ?
FFFFFFFF7AA2A068 ?
FFFFFFFF7AA2A078 ?
105F16F28 ? 000000D98 ?



I searched the method names appeared on error stack in meatlink, found out that the functions starts with ldx is likely related to NLS. In the problematic sql, the only thing seems to be remotely related to NLS is the trunk(sysdate). Evadrn called ldxgpc and the error stack shows the argument. The first one, 105F16F28 , is pointer, in that location, it is 0x100000000000008B. The second one ,105A3F5A0, is also a pointer, and it turns out to be “DD…” in that memory location. The third one is apparently a number, 2. Could it be the length of the second parameter? The trunc function can take 2 arguments where the the second argument is the format. And trunc() is the same as trunc(, ‘DD’). It is possible that Oracle convert every trunc() to trunc(, ‘DD’) internally, then ldxgpc should be the one that perform the TRUNC. If that is the case, where is the input date? If it is supposed be the first argument, then the date is somehow corrupted, since 0x100000000000008B isn’t a valid date. According to the time of dump shown on the top of dump trace, it was 2007-07-31 10:48:17. In Oracle’s date format, it is 07D7071F0A301100, there are indeed several occurrence of this string in the dump, but has nothing to do with 0x100000000000008B. So, in summary, Oracle seems to lose track of the date when he perform the TRUNC. Therefore, I tried to replace TRUNC(sysdate) with to_date(to_char(sysdate,’yyyymmdd’),’yyyymmdd’). And there haven’t been crash since. Of course, it is still a scary situation since I don’t know what the root cause of the problem and if it will strike somewhere else.

No comments: