How do you do. >> Full backups every sunday and incrementals dayly. But >> after some time one of bacula processes started to crash every morning >> and 1 or more (or all) jobs were left not done. Such situation last >> for some weeks - it become clear to me that I need help.
>> and every morning Director's process bacula-dir is missing. AL> That's bad. Yes, the most obvious sign of a problem... >> Last morning log's end looks like this >> # less /var/db/bacula/log AL> ... skip some log output... >> 02-Nov 03:15 nfs4p-dir: Start Backup JobId 3960, >> Job=sinux-oracle.2005-11-02_03.15.00 >> 02-Nov 03:15 sinux-fd: ClientRunBeforeJob: -su: line 8: ulimit: max user >> processes: cannot modify limit: Operation not permitt >> ed AL> That one indicates a problem, I guess. There seems to be a limit on the AL> number of processes a user can have running. Some script or program AL> tries to increase that limit. AL> You should investigate the script that is called as Client Run Before AL> Job script for the job sinux-oracle. Just to point out the obvious: That AL> script is not on the director machine (probably nfs-4p) but on sinux. That's an old (and by now not very important) Oracle server that's used in development. It does something before and after backup. It worked OK before and even now the backup job for sinux finishes with "Termination: Backup OK" The guys that use that machine will look at these strange messages anyway, thank you, Arno! AL> That situation *could* indicate a serious security problem, even a AL> compromised database server. Good luck. AL> ... AL> more output >> >> "That's all, folks!" (c) :-( That's how it died on cgatex job last time. 02-Nov 03:21 nfs4p-dir: Begin pruning Jobs. 02-Nov 03:21 nfs4p-dir: No Jobs found to prune. 02-Nov 03:21 nfs4p-dir: Begin pruning Files. 02-Nov 03:21 nfs4p-dir: No Files found to prune. 02-Nov 03:21 nfs4p-dir: End auto prune. 02-Nov 07:05 nfs4p-dir: Start Backup JobId 3961, Job=cgatex-full.2005-11-02_07.05.00 02-Nov 07:05 cgatex-fd-fd: Since time adjusted by 0 seconds. 02-Nov 07:05 s10-sd: Volume "Vol0086" previously written, moving to end of data. 02-Nov 07:06 s10-sd: User defined maximum volume capacity 734,003,200 exceeded on device /d/0/bacula. 02-Nov 07:06 s10-sd: End of medium on Volume "Vol0086" Bytes=733,941,548 Blocks=11,378 at 02-Nov-2005 07:06. 02-Nov 07:06 nfs4p-dir: Recycled volume "Vol0087" ... ... 02-Nov 07:35 s10-sd: Recycled volume "Vol0091" on device "/d/0/bacula", all previous data lost. 02-Nov 07:35 s10-sd: New volume "Vol0091" mounted on device /d/0/bacula at 02-Nov-2005 07:35. 02-Nov 08:04 s10-sd: User defined maximum volume capacity 734,003,200 exceeded on device /d/0/bacula. 02-Nov 08:04 s10-sd: End of medium on Volume "Vol0091" Bytes=733,952,897 Blocks=11,377 at 02-Nov-2005 08:04. 03-Nov 01:05 nfs4p-dir: Start Backup JobId 3962, Job=nfs4p.2005-11-03_01.05.00 03-Nov 01:05 nfs4p-fd: Since time adjusted by -1095 seconds. I don't understand the last string... >> >> I run >> >> # /etc/bacula/bconsole >> >> and see >> >> Connecting to Director 127.0.0.1:9101 >> 1000 OK: nfs4p-dir Version: 1.36.1 (26 November 2004) >> Enter a period to cancel a command. >> *status 1 >> Using default Catalog name=MyCatalog DB=bacula >> Automatically selected Storage: File >> Connecting to Storage daemon File at 10.253.4.15:9103 >> >> s10-sd Version: 1.36.1 (26 November 2004) i386-pc-solaris2.10 solaris 5.10 >> Daemon started 02-Nov-05 20:10, 0 Jobs run since started. >> >> Running Jobs: >> No Jobs running. >> ==== >> >> Terminated Jobs: >> JobId Level Files Bytes Status Finished Name >> ====================================================================== >> 3952 Incr 2,462 1,889,217 OK 02-Nov-05 01:21 ns02 >> 3953 Incr 1 33,512,324 OK 02-Nov-05 01:22 sinux >> 3954 Incr 83 28,008,073 OK 02-Nov-05 01:23 dbh1-matroska >> 3955 Incr 0 0 OK 02-Nov-05 01:23 dbh1-configs >> 3956 Incr 0 0 OK 02-Nov-05 01:23 dbh1-home >> 3957 Incr 1,418 84,707,857 OK 02-Nov-05 01:28 hpov-full >> 3958 Incr 67 615,990,101 OK 02-Nov-05 01:37 dbh2-full >> 3959 Full 1 186,384,929 OK 02-Nov-05 01:51 BackupCatalog >> 3960 Full 5 181,932,043 OK 02-Nov-05 03:21 sinux-oracle >> 3961 Incr 9,889 2,246,582,808 Cancel 02-Nov-05 10:22 cgatex-full >> ==== >> >> Device status: >> Device "/d/0/bacula" is not open. >> ==== >> >> The last job is the most important - it's the mail server... :-( AL> It looks like that job hasn't failed but got cancelled - hat status AL> should, as far as I know, only happen as a direct result of user AL> intervention. No one but me could intervent. I did not. There was no manual cancel... it would be too simple... I wonder ... when DIR falls, what's going on then... >> If I leave this console till next morning and try to enter any command >> after the bacula-dir crashes it'll die also being unable to connect to >> Director. AL> Ok, the DIR dies during the night. AL> You can do the following: AL> Either run the director with debug output enabled and capture the AL> output. You'd call it with something like "./bacula-dir -v -d 200 -c AL> /etc/bacula/bacula-dir >>/var/log/bacula-dir.output". Adjust paths and AL> debug level to your needs... a debug levelof 100 gives a good overview AL> of the program flow, 400 results in lots and lots of details, and 900 AL> gives you more than you will need to locate the problem, I guess. AL> After the DIR crashes, you should investigate the last lines of the AL> output, probably post it here. Perhaps it helps to locate the problem. Thank you very much for the advice! I've just adjusted the startup script for Bacula. I think I'll start with -d 200 as you recommend and see what happens ... AL> The other possibility is to run the DIR under the debugger - there are AL> some instructions in the manual. It would be best if you know a little AL> about how to work with gdb, though. Not too much, unfortunately. AL> Finally, and I suspect that this would be something you'd end up with AL> anyway, you could upgrade to the current release version 1.38. This AL> version does fix some bugs, introduces some features, and requires only AL> minor - if at all - configuration changes. It does require a catalog AL> upgrade, so you will want to read the instructions carefully :-) OK, I'll keep this in mind. But first of all I must try to reanimate what I have got... AL> I suspect that, if you found a bug in bacula, you will be forced to AL> upgrade because it's unlikely that Kern will fix an older version. I see... AL> Well, start with the debug log and probably the debugger. That should AL> help understanding what happens when Bacula crashes. Or upgrade to 1.38 AL> and see if that fixes your problem (which might easily happen). The AL> upgrade itself is not a problem as long as you know how your installed AL> version was built (options to configure) I'll probably have to investigate it. AL> and have the necessary toolchain and libraries installed. The AL> catalog upgrade can be a problem as you can not easily revert to AL> an older version... That matters. >> Thanks for your attention. I really need help to make my problem clear >> and solve it. Any good advice will move things from bad to good. AL> Well, and good luck for fixing your problems. Keep us informed, or post AL> some more detailed information and I'm confident that can be fixed. Surely. I'll quote some information right when I'll have it. Thank you very much! -- SY Vadim A. Umanski ------------------------------------------------------- SF.Net email is sponsored by: Tame your development challenges with Apache's Geronimo App Server. Download it for free - -and be entered to win a 42" plasma tv or your very own Sony(tm)PSP. Click here to play: http://sourceforge.net/geronimo.php _______________________________________________ Bacula-users mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/bacula-users
