On Mon, Dec 30, 2013 at 10:27 AM, Zach La Celle
<[email protected]> wrote:
> I'm reading through BackupPC documentation and trying to understand how
> it functions.  I'm a little confused about two part:
>
> 1) Difference between full and incremental: full says it does all
> files.  I guess this means that all files are pulled down, then checked
> and pooled.  Incremental first checks modification, then only copies the
> files that are different, then pools.  However, if incremental/full are
> working correctly, they will (in the end) function semantically the
> same, because all files which would have been copied by full that would
> not have been copied by incremental would be pooled anyways, resulting
> in nothing new written to disk other than hard links.  So, why do a full
> backup ever after completing the first one?  Why not run 1 full, then
> always do incrementals?

There are 2 scenarios.   Non-rsync backups work strictly by timestamps
and will miss file changes like old files under renamed directories,
or things that maintain old timestamps when copying files.  These need
periodic fulls to correct the tree and note deletions.   Rsync backups
normally work using the previous full as the base for comparison, so
as the target system changes the incrementals transfer more and more
data.   An rsync full only transfers the differences anyway.  It just
does a block-checksum comparision over the data to verify it and
rebuilds the tree to be the base for subsequent comparisons.

> 2) The order of operations is ping->dump->extract->link.  I'm trying to
> understand the file compare/extract/link part.  The available transfer
> methods are rsync, rsyncd, ftp, smb, and tar.  The documentation says
> that incoming data is extracted to __TOPDIR__/pc/$host/new (if it's
> compressed), with tarExtract checking the MD5 hash of files as they come
> in.  It only checks the first N bytes, meaning that the hash can be
> completed 100% in memory, before the file is written to disk.

I think this means that it can determine in the first N bytes if there
is a possible match in the pool.   If not, it has to create a new
entry.

> However,
> it then says "BackupPC_tarExtract and rsync can handle arbitrarily large
> files and multiple candidate matching files without needing to write the
> file to disk in the case of a match."  This I don't quite understand: if
> the file is larger than my memory, it has to be stored while being
> compared bit-by-bit with other MD5-matching files to do the full compare.

I haven't followed that through myself, but as long as the new data
matches you would not need to save a new instance while computing the
hash.  If there is a mismatch at some point you can always reconstruct
the file using the portion that matched from the existing earlier
copy, continuing new content from the point where it changes.   And if
it matches all the way through, you can link to existing copy instead
of writing anything.

> Basically, for the extract->compare part:
> * The first N bytes of incoming files are written to memory, then the
> md5 is performed against all existing files, then what?  If it doesn't
> find a match, it's written to the pool, but if it does, it still has to
> write the file (there could be md5 collisions), correct?  Does it store
> a list of the MD5 collisions?  Because in the BackupPC_link
> documentation, it says it has to check against all files again, since
> there could have been some added by another link process.

Yes, there can be collisions, so there can be a list of possibly matching files.

> * Rsync is special because it doesn't have to write a file to disk in
> case of a match?  How does this work with link?

Rsync uses the filenames in the pc tree as the comparison base and
makes new links against them for existing, matching files instead of
looking up the hashed filename in the pool because they are already
linked there.   For new/changed, fields, the new pool filenames have
to be created.

> * The above is not true for ssh, ftp, rsyncd, etc?

Rsync/rsyncd work the same way matching filenames from the last full
first to avoid transferring matching data, the others always work with
the pool filenames that are hashes of the content.

-- 
  Les Mikesell
     [email protected]

------------------------------------------------------------------------------
Rapidly troubleshoot problems before they affect your business. Most IT 
organizations don't have a clear picture of how application performance 
affects their revenue. With AppDynamics, you get 100% visibility into your 
Java,.NET, & PHP application. Start your 15-day FREE TRIAL of AppDynamics Pro!
http://pubads.g.doubleclick.net/gampad/clk?id=84349831&iu=/4140/ostg.clktrk
_______________________________________________
BackupPC-users mailing list
[email protected]
List:    https://lists.sourceforge.net/lists/listinfo/backuppc-users
Wiki:    http://backuppc.wiki.sourceforge.net
Project: http://backuppc.sourceforge.net/

Reply via email to