Re: prblems rebuilding conversations db

Michael Menge Mon, 21 Jan 2019 05:35:32 -0800

Hi Ellie

Quoting ellie timoney <el...@fastmail.com>:

Hi Michael,

On Wed, Jan 16, 2019, at 10:12 PM, Michael Menge wrote:

Hi,

because conversations db seems to be required for search indexes, I
enabled this option
on our production servers today and tried to rebuild the conversations
db for all users with

ctl_conversationsdb -v -b UID

For most users this did take less than a second. But for some users
this process would
not finish. I did kill one process after about 30 Minutes (most others
after 3 Minutes).
The UserID.conversations has grown over 2 GB (the mailbox itself has
only ~700 MB of mails,
and the conversations files from finnished rebuilds are less then 20 MB)

So, a 2GB-and-growing conversations db for a 700MB mailbox is aratio of ~3:1, which seems obscene...


thanks for confirming that this is not expected.

For comparison, the Conversations.append_reply_200 cassandane test(two messages with 100 replies each) produces an 87Kb conversationsdb for a 1016Kb mailbox (ratio ~1:12, n.b inverted!), which makesyour ~3:1 and growing ratio seem even more obscene...
Obviously you can't share the bad conversations db becauseconversations db's are full of personally-identifying information.But I wonder if looking through it with `cyr_dbtool ... show` turnsup any unusual patterns, especially in comparison to one of the goodusers? I wonder if it's somehow repeating itself?


I was unable to run "cyr_dbtool ... show" or "ctl_conversationsdb -d"
while ctl_conversationsdb -v -b was still running
(waiting for a lock on conversations db file)

After aborting "ctl_conversationsdb -b" "cyr_dbtool show" didn't show any
duplicates. But the string command showed more occurrences than cyr_dbtool

did (i used an hash value (d8087b76083ec5e7) I found in the output ofcyr_dbtool

# cyr_dbtool /path-to/userid.conversations skiplist show | grepd8087b76083ec5e7 | wc -l

# cyr_dbtool /path-to/userid.conversations skiplist show | grepBd8087b76083ec5e7 | wc -l

1
# strings /path-to/userid.conversations | grep d8087b76083ec5e7 | wc -l
183
# strings /path-to/userid.conversations | grep Bd8087b76083ec5e7 | wc -l
91

ctl_conversationsdb -d did a cleanup because I did abort the"ctl_conversationsdb -b" process

skiplist recovery /path-to/userid.conversations: found partial txn,not replayingskiplist: recovered /path-to/userid.conversations (0 records, 144bytes) in 0 secondsskiplist: checkpointed /path-to/userid.conversations (0 records, 144bytes) in 0.010 sec



I did run ctl_conversationsd -b with strace and discovered that the inbox
(cyrus.index O_RDWR, cyrus.header O_RDONLY, cache  cyrus.O_RDWR ) of that user
was opened over and over again, no other folder for that user was accessed.

For other users the inbox was only opened once and then the otherfolders followed.So not stopping "ctl_conversationsd -b", it would have run till myfilesystem was

full.

Bron, does this sound like anything you've seen before, that maybegot fixed on master but not backported to 3.0?
Cyrus Logs show many "ctl_conversationsdb[933]: mailbox: longlock
user.UserID for 1.5 seconds"
every few seconds.
This message is logged when the lock is released, if that lock hadbeen held for >1s. It's reporting that it held the lock for longerthan it would like to have, but at least you know the lock has beenreleased. I think this is telling you that user-visible performanceon the mailbox would have been impacted while this was occurring(because imapd/lmtpd wouldn't be able to access their mailbox whilectl_conversationsdb was holding those locks), but also tells youthat ctl_conversationsdb was doing the right thing, and not justholding a single lock for the entire job.


I am not sure if any other processes had a chance to acquire a lock.

Though, that raises the question: was the user accessing the mailboxwhile the rebuild was in progress? I wonder if their client isdoing something funny and tripping things up?


We did this while the server was accessible (didn't want to have
several hour downtime for ~44000 users)

But as I did my recent testing on a test-server without user access,
only ctl_conversationsdb and cyr_dbtool did accessed the mailbox during
the conversation rebuild mentioned in this mail. So I don't think that
user access is part of the problem.


So for me it looks like we have two problems:

1. multiple entries for the same key in skiplist db files.
   As skiplist is a Key/Value store this should not be possible
   to have duplicate keys?

   Is twoskip a better alternative?

2. Why didn't ctl_conversationsdb continue to process
   the next mailbox but re-did the INBOX / the last mailbox


I will try to run a debugger on ctl_conversationsdb and
will test with twoskip as conversations db format

I did notice one other problem, recreating the conversations db
on the replica confused the sync protocol as the rebuild did increase
the modseq.

Many logs like "record mismatch with replica: %s more recent onreplica" on the sync client

and "higher modseq on replica" on the sync server

is this intended?

Regards

   Michael Menge





--------------------------------------------------------------------------------
M.Menge                                Tel.: (49) 7071/29-70316
Universität Tübingen                   Fax.: (49) 7071/29-5912

Zentrum für Datenverarbeitung mail:michael.me...@zdv.uni-tuebingen.de

Wächterstraße 76
72074 Tübingen

----
Cyrus Home Page: http://www.cyrusimap.org/
List Archives/Info: http://lists.andrew.cmu.edu/pipermail/info-cyrus/
To Unsubscribe:
https://lists.andrew.cmu.edu/mailman/listinfo/info-cyrus

Re: prblems rebuilding conversations db

Reply via email to