Resending, with additional information.
On Jul 10, 2011, at 3:18 PM, Steve Costaras wrote:
>
> -----Original Message-----
> From: Dan Langille [mailto:[email protected]]
> Sent: Sunday, July 10, 2011 12:58 PM
> To: [email protected]
> Cc: [email protected]
> Subject: Re: [Bacula-users] Catastrophic error. Cannot write overflow block
> to device "LTO4"
>
> >>
> >> 2) since everything is spooled first, there should be NO error that should
> >> cancel a job. A tape drive could fail, a tape could burst into flame, all
> >> that would be needed was bacula to know that >>there was an issue and give
> >> the admin a simple statement do you want to fix the issue or cancel?, the
> >> admin to fix the problem, and then bacula told to restart from the last
> >> block that was >>stored successfully OR if need be from the beginning of
> >> the spooled data file.
>
> >This I do know. Although, at first glance it seems easy to do this, it is
> >not. If it was trivial to do, I assure you, it would already be in place.
>
> >> Canceling jobs that run for days for TB's of data is just screwed up.
>
> >I suggest running smaller jobs. I don't mean to sound trite, but that really
> >is the solution. Given that the alternative is non-trivial, the sensible
> >choice is, I'm afraid, cancel the job.
>
> I'm already kicking off 20+ jobs for a single system already. This does not
> work when we're talking over the 100TB/nearly 200TB mark. And when these
> errors happen it does not matter how many jobs you have as /all/ outstanding
> jobs fail when you have concurancy (in this case all jobs that were qued and
> were not even writing to the same tape were canceled).
This sounds like a configuration issue. Queued jobs should not be cancelled
when a previous job cancels. FYI, I've never seen this happen on my systems.
I think this is something you need to follow up on
> This does not happen with any other enterprise backup software not that they
> should be 100% mimicked.
> With the data sizes we have today I don't see why there are not better error
> handling checks/routines.
This is open source software. Stuff gets written because someone wants it.
Clearly, nobody who wants it has written. That is why it does not exist.
But sorry, that's not helping you find a solution. James Harper has some good
points. :) I hope it leads somewhere.
--
Dan Langille - http://langille.org
------------------------------------------------------------------------------
All of the data generated in your IT infrastructure is seriously valuable.
Why? It contains a definitive record of application performance, security
threats, fraudulent activity, and more. Splunk takes this data and makes
sense of it. IT sense. And common sense.
http://p.sf.net/sfu/splunk-d2d-c2
_______________________________________________
Bacula-users mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/bacula-users