Quite recently we have moved ourdatabase from SQL 6.5 to SQL2k. Since then we are experiencing very rare but extremely weird problems -- typically they are look like at one point zero appears instead of valid value. So far I have caught 4 such cases.
Here is the configuration:
- access -- through DBLib (ntwdblib.dll v2000.80.2039.0)
- connection string is DBMSSOCN,address,port
- server/client is Win2003 Server SP1 (server and client are different boxes)
- 2-processor systems with HT enabled
Case #1:
we have a C appliation that recalculates some data and updates two different tables simultaneously with almost the same set of data. It looks like this:
// prepare three data structures
...
// send data to DB
swInsFxSumPlTable(pData, 3, pDbProc);
swInsFxTradeExpoTable(pData, 3, pDbProc);
every function looks like:
for ( usCount=0; usCount<3; usCount++ )
{
sprintf (
szQuery+strlen(szQuery),
"exec sp_swFxedrq_Add_fx_sum_pl \n"
"'%s', '%s', '%s', \n"
"%f, %f, %f, %f \n",
pData[usCount].szPcId,
pData[usCount].szCcyPairId,
pData[usCount].szCcyLen,
pData[usCount].rfNpvTodPl,
pData[usCount].rfMtdPl,
pData[usCount].rfNomPos,
pData[usCount].rfShaPipHdg
);
}
fReturn = imExecMultiQuery (pDbproc, szQuery);
(no buffer overruns happen). The only difference between them is the name of called procedure and set of parameters. Both stored procedures are simple inserts, both tables are truncated beforehand. imExecMultiQuery is equivalent to:
dbcancel( dbproc );
dbfcmd ( dbproc, "%s", pszQuery );
dbsqlexec ( dbproc )
while ( dbresults ( dbproc ) != NO_MORE_RESULTS ) dbcanquery(dbproc);
(error handling is omitted for clarity). Problem: rfNomPos is different in only one row in resulting tables. I could not believe my eyes, I have checked extract from transaction log:
"INSERT","pc_id","ccy_pair_id","ccy_code_len","npv_tod_pl","mtd_pl","nom_nop","sha_plp_hdg"
"INSERTED","CRINST"," FJD ","3","0","0","0","0"
"INSERT","pc_id","ccy_pair_id","npv_tod_pl","mtd_pl","nom_nop","sha_plp_hdg","usd_eq_nop"
"INSERTED","CRINST","FJD ","0","0","-1.38099e+008","0","-7.92688e+007"
Red-marked values should be the same, I simply can't see any way for them to be different (except of bugs in ntwdblib.dll or SQL2k). Table and stored procedure uses type float, C code uses DBFLT8. Repeated execution produces correct results.
Case #2:
we have SQR report running every day (this is script language for report generation). It uses ODBC to connect to DB. It simply reads the table and dumps all values to text file. One row has value 0 instead of correct value. Happened only once, repeated calls produced correct results. No comments, I simply dropped my jaw when I saw it. We have zillions of SQR scripts running every day, we never had such problem before. I have checked this specific row -- nobody was touching it for ages.
Bad thing that this happens randomly in various parts of our (quite big) system, and every time we find the 'magic' place after costly and time consuming reconciliation process. Every time (whether it is reading or writing) we have similar behavior -- one column's value was replaced by 0.
I do really appreciate any suggestions, because I ran out of ideas. I do suspect bug in ntwdblib.dll related to multithreading. It is impossible to reproduce it so far... I'll try to get most fresh ntwdblib.dll, if it won't help -- I'll try one shipped with SQL6.5.
Bye.
Sincerely yours, Michael.
P.S. Sorry for font size -- looks like forum is buggy too... It looks fine in editor
Update. Problem happened again :-( . Here is the excerpt from DB tranlog (altered for readability):
"fxed_sum_pl_report","CRINST|XAUD/DEM|7","INSERT","pc_id","ccy_pair_id","ccy_code_len","npv_tod_pl","mtd_pl","nom_nop","sha_plp_hdg"
"INSERTED","CRINST","XAUD/DEM","7","249051","2.64878e+006","0","0"
"fxed_sum_pl_report","CRINST| DEM |3", "INSERT","pc_id","ccy_pair_id","ccy_code_len","npv_tod_pl","mtd_pl","nom_nop","sha_plp_hdg"
"INSERTED","CRINST"," DEM ","3","0","0","-1.59396e+008","0"
"fxed_trade_expo_report","CRINST|AUD/DEM","INSERT","pc_id","ccy_pair_id","npv_tod_pl","mtd_pl","nom_nop","sha_plp_hdg","usd_eq_nop"
"INSERTED","CRINST","AUD/DEM","249051","2.64878e+006","0","0","0"
"fxed_trade_expo_report","CRINST|AUD ","UPDATE","nom_nop","usd_eq_nop"
"UPDATED FROM","4.22946e+008","3.25015e+008"
"UPDATED TO","5.40672e+008","4.14675e+008"
"fxed_trade_expo_report","CRINST|DEM ","INSERT","pc_id","ccy_pair_id","npv_tod_pl","mtd_pl","nom_nop","sha_plp_hdg","usd_eq_nop"
"INSERTED","CRINST","DEM ","0","0","-1.59396e+008","0","-1.02418e+008"
There is no update for "fxed_sum_pl_report","CRINST| AUD |3". It is simply not there. Program is supposed to do 3 updates to fxed_sum_pl_report table and then almost the same updates are going to fxed_trade_expo_report table. Thje same data is used for both calls. Each call looks like:
dbfcmd("exec sp_swFxedrq_Add_fx_sum_pl <par1>, ... \n"
"exec sp_swFxedrq_Add_fx_sum_pl <par1>, ... \n"
"exec sp_swFxedrq_Add_fx_sum_pl <par1>, ... \n"
);
dbsqlexec ( dbproc );
while ( dbresults ( dbproc ) != NO_MORE_RESULTS ) dbcanquery(dbproc);
Somehow, middle 'exec' call did not leave any trace in the transaction log... no errors or messages were produced. :-(
Here is the code of sp_swFxedrq_Add_fx_sum_pl:
/****************************************************************************/
create procedure sp_swFxedrq_Add_fx_sum_pl
@.pc_id char(6),
@.ccy_pair_id char(8),
@.ccy_code_len char(1),
@.npv_tod_pl float,
@.mtd_pl float,
@.nom_nop float,
@.sha_plp_hdg float
as
if (@.ccy_code_len = '7')
begin
select @.ccy_pair_id = 'X'+@.ccy_pair_id
end
else
begin
select @.ccy_pair_id = ' '+@.ccy_pair_id
end
if (( select count(*)
from fxed_sum_pl_report
where pc_id = @.pc_id
and ccy_pair_id = @.ccy_pair_id
and ccy_code_len = @.ccy_code_len ) > 0 )
begin
update fxed_sum_pl_report
set npv_tod_pl = npv_tod_pl + @.npv_tod_pl,
mtd_pl = mtd_pl + @.mtd_pl,
nom_nop = nom_nop + @.nom_nop,
sha_plp_hdg = sha_plp_hdg+ @.sha_plp_hdg
where pc_id = @.pc_id
and ccy_pair_id = @.ccy_pair_id
and ccy_code_len = @.ccy_code_len
end
else
begin
insert into fxed_sum_pl_report
values(
@.pc_id,
@.ccy_pair_id,
@.ccy_code_len,
@.npv_tod_pl,
@.mtd_pl,
@.nom_nop,
@.sha_plp_hdg
)
end
So, possible ways to get this behaviour:
1. DBLib somehow lost middle 'exec' call and did not transfer it to SQL server
2. select count(*) returned smth >0 (as it should be -- insert happened quite a while before), but update fxed_sum_pl_report failed to update
3. Due to mysterious error in sprintf every float argument was 0 and update did not produce any entry in tranlog. Which is strange because I have quite a lot of entries in tranlog that state: "UPDATE","NO CHANGES DETECTED"
4. tranlog extraction utility is buggy and generates ***.
Any ideas? I am really stuck.
I am going to downgrade ntwdblib.dll soon, we'll see if it will help.
Bye.
Sincerely yours, Michael.
No comments:
Post a Comment