Guys 'n Gals:
I'd appreciate it if some of you, anyone who knows about threads on
Linux, could give me a hand here...
I'm trying to do my first multi-threaded program. I've gotten myself
into this, at work, by talking up how neatly we could do this big
project I'm responsible for as a threaded app. This is on RH 6.2, BTW.
Now when I'm trying my first test scenario, a short demo program
wrapped around one of the library/toolkits we're going to need to use,
I'm having bizarre problems with the demo program which I cannot
figure out.
I can't really show you the program here because it's over 2000 lines,
including the toolkit, but I can give some excerpts and some GDB output.
If anyone wants to see more source than I can put here I'll mail it to
said person. Just let me know.
Anyway, the tookit is not originally designed for thread usage, but
upon examination of it it seems pretty clean: no statics (except for
rcsid which I assume will harm no one since it's unreferenced) though
several functions are declared static this shouldn't be a problem.
There is a lot of malloc/calloc going on, but in every case it is
allocating memory whose address is returned to the caller.
The calling routine (remember its a demo, so its basically one main()
function making multiple calls into the toolkit) has no statics, and
everything is pointers to structs which are allocated by the toolkit,
said struct pointers all being automatic variables.
Here's an example of the code, from of one of the places it frequently
dies:
void HL7savSeg (char *seg, int *segNamI, InitCntxt * ic, char * err_buf)
{ /* Save a Segment def. by Allen Rueter */
int i;
MSGterm *pTmpTerm;
HL7SegRule *pSR;
DB printf ("%s ", seg);
pTmpTerm = (MSGterm *) malloc (sizeof (MSGterm));
pTmpTerm->typ = 0; /* set type to seg (not []{}()<>) */
pSR = ic->pFlvr->p1SegR;
==> while (strncmp (pSR->azNam, seg, HL7_SEG_NAME_LEN) != 0)
{
pSR = pSR->nxt;
if (pSR == 0)
{
sprintf (err_buf, "HL7Init() SavSeg() couldn't find %s\n", seg);
return;
}
}
pTmpTerm->pSegR = pSR; /* point to type of segment */
pTmpTerm->nxt = 0; /* current end of rule */
if (ic->pCurRule->p1term == 0) /* 1st rule? */
ic->pCurRule->p1term = pTmpTerm; /* save start up rule list */
else
ic->pCurTerm->nxt = pTmpTerm; /* add to end of rule list */
ic->pCurTerm = pTmpTerm;
*segNamI = 0;
}
It frequently segfaults down inside the strcmp call on the marked line
(see ==> above). Upon investigation I find that the value in pSR is not
the same as the value in ic->pFlvr->p1SegR even though it certainly
seems obvious (to me, at least) that it should be. When it takes one of
those segfaults at this point, the stack trace from gdb looks like:
Program received signal SIGSEGV, Segmentation fault.
strncmp (s1=0x29 <Address 0x29 out of bounds>, s2=0xbf5fccdc "PD1", n=3)
at ../sysdeps/generic/strncmp.c:64
64 ../sysdeps/generic/strncmp.c: No such file or directory.
(gdb) where
#0 strncmp (s1=0x29 <Address 0x29 out of bounds>, s2=0xbf5fccdc "PD1", n=3)
at ../sysdeps/generic/strncmp.c:64
#1 0x804976f in HL7savSeg (seg=0xbf5fccdc "PD1", segNamI=0xbf5fccd8,
ic=0xbf5fcba8, err_buf=0xbf5fd144 "") at HS_hl7api.nomutexes.c:240
#2 0x804ad7c in HL7Init (tblPath=0x8090c59 "", tblVrsn=0x8090c54 ".v21",
err_buf=0xbf5fd144 "") at HS_hl7api.nomutexes.c:559
#3 0x80482bd in thrfunc (foo=0x0) at hl7demo3.c:128
#4 0x8051016 in pthread_start_thread (arg=0xbf5ffe40) at manager.c:241
(gdb)
note that the first parameter of the strncmp is (in this case) "0x29".
In other cases it has been x41, or x31, or x19, but it always seems to
be something totally bogus.
Other times it hangs apparently forever (my patience wears thin after one
or two minutes). if I do ^C while in gdb, it then shows a stack trace
like:
Program received signal SIGINT, Interrupt.
0x80550e0 in __sigsuspend (set=0xbffff898)
at ../sysdeps/unix/sysv/linux/sigsuspend.c:48
48 ../sysdeps/unix/sysv/linux/sigsuspend.c: No such file or directory.
(gdb) where
#0 0x80550e0 in __sigsuspend (set=0xbffff898)
at ../sysdeps/unix/sysv/linux/sigsuspend.c:48
#1 0x8051d03 in __pthread_wait_for_restart_signal (self=0x809ee40)
at pthread.c:785
#2 0x80501cb in pthread_join (thread_id=2051, thread_return=0xbffff9f4)
at restart.h:26
#3 0x80481ed in main () at hl7demo3.c:74
(gdb)
According to the visible output, one thread is still near its beginning,
but this gdb output seems to imply that one of them is already being
joined ????? I haven't a clue why it would be hung here.
other times it appears to run fine, at least no SIGSEGVs and no hangs.
Note that this example program starts only two threads. if I change
it to start only one thread it seems to work reliably. The more threads
the more often it croaks.
Also note that this toolkit is NOT new code, it has been running for
five or more years in threadless environments. While I don't accuse it
of being bug-free, it certainly has never exhibited problems like this
until I started calling it from 2 threads.
I've tried, in the demo program, wrapping every call into the toolkit
with a mutex lock/unlock. Makes no difference at all.
It's a bit harder to put a mutex inside all the entry points of the
toolkit because some of the entrypoints are also called by other routines,
so we end up needing recursive mutexes. I've tried it, but not gotten it
to work at all.
right now I've removed all mutexes because i can't see anything in any
of the code that should require them, and am trying to understand what
is going wrong. My head is getting sore and the hole in the wall is
getting deep.
Can someone help enlighten me!
thanks!
Fred
--
---- Fred Smith -- [EMAIL PROTECTED] ----------------------------
"Not everyone who says to me, 'Lord, Lord,' will enter the kingdom of
heaven, but only he who does the will of my Father who is in heaven."
------------------------------ Matthew 7:21 (niv) -----------------------------
--
To unsubscribe: mail [EMAIL PROTECTED] with "unsubscribe"
as the Subject.