If I run a test script enough time, it eventually freezes in this deadlock situation:
The client sends a command to a backend and waits for an answer. It will wait forever because the backend is not aware of the arrival of the request and waits for a next command. What happens in the loop is: SIInsertDataEntry: table is 70% full, signaling postmaster In reaction, the postmaster sends to its children: SignalChildren: sending signal 31 to process <pid> Most of the time, it works. But at an unpredictable iteration, it freezes. This problem appeared first in a replication machinery, so I reduced the number of components involved, to get a simpler test case: A pgtcl script, running a loop with: create table from another-table copy table to file drop table The 'create table' regularly fires the '70% full' event, and at some point, the 'copy' never gets answered. I attached these files: - test.tcl: the script to run. Change these values to meet your context: set srctable pgr_qryengine_log set dbname euronetUsers The source table can be anything empty. In my case, it's: CREATE TABLE public.pgr_qryengine_log ( pgr_sid int4 NOT NULL, tablename varchar(50), pgr_gfid int8 NOT NULL, pgr_grid int8 NOT NULL, pgr_optype varchar(2), pgr_when timestamp, pgr_username varchar(30), qry_result text ) WITH OIDS; - postmaster-ok.log The traces of a successful iteration. - postmaster-ko.log The traces of the forever waiting iteration. EOF is received on a ctrl/c on the client side. Comparison of the traces shows that the signals are processed, but the backend doesn't start a StartTransactionCommand for the expected 'copy'. I don't know the exact conditions for the freeze to arise. I just noticed that chances are higher if there is a lot of postgres.exe processes alive. I could run 10000 runs without any extra backends. So I opened a pgAdmin III session to have many connexions (on multiple db, with different accounts). With 7 to 10 processes, I reached the freeze at 3392, 2027, 6729, 272, 1871 runs. I tried to strace the postmaster, but never managed to have the problem. I guess strace slow down the system too much. I just have a strace of a correct iteration. Done on: - postgres 7.3.5, W2000 SP2, cygwin 1.5.5-1 - postgres 7.3.5, NT SP6, cygwin 1.5.7-1 I can't tell if the source of the problem is in cygwin or in postgres, so I post in the two lists. Would be helpful if anybody can reproduce the problem, or provide advices to progress on the debugging work. Patrick __________________________________ Do you Yahoo!? Yahoo! Search - Find what you’re looking for faster http://search.yahoo.com
test.tcl
Description: test.tcl
postmaster-ok.log
Description: postmaster-ok.log
postmaster-ko.log
Description: postmaster-ko.log
-- Unsubscribe info: http://cygwin.com/ml/#unsubscribe-simple Problem reports: http://cygwin.com/problems.html Documentation: http://cygwin.com/docs.html FAQ: http://cygwin.com/faq/