Can Subversion work well with very large data sets?

Robert Rohde Sat, 09 Oct 2010 14:24:33 -0700

Hello,

I am trying to identify a reasonable version control system for an
unusual workflow.  As SVN is a major player in this space, it is one
of the systems that I want to consider but I've run into some
problems.  It is unclear to me whether my problems are specific to the
SVN clients I have tested or whether they are a general consequence of
the way SVN has been designed.


I would appreciate feedback on whether there are ways to make SVN work
more effectively for my project, or in the alternative whether there
are other version control systems that might be more suitable.

Workflow Specifications:

* ~1 million files under version control ( > 99% are essentially text
files, but a few are binary )
* Average file size 50 kB, for a total archive of 50 GB.
* Wide range of sizes, ~50% of files are less than 10 kB, but a couple
are greater than 1 GB.
* Most updates occur through a batch process that changes ~10% of
files every two weeks.  (Not the same 10% every time.)
* Typically batch changes modify only a few percent of each file, so
total difference during batch update is only ~200 MB.

Other Requirements:

* Must support random file / version access.
* Clients must run on Windows and Linux / Mac
* Must allow for web based repository viewing.
* Highly desirable to allow for partial checkout of subdirectories.


In my testing, SVN clients seem to behave badly when you throw very
large numbers of files at them.  TortoiseSVN, for example, can take
hours for a relatively simple add operation on data samples a fraction
of the total intended size.  Another of the SVN clients I tested (but
won't bother naming) crashed outright when asked to work with 30000
files.

Are there ways to use SVN in conjunction with very large data sets
that would improve its performance?  For example alternative clients
that might be better optimized for this workflow?  I'd even consider
recompiling a client if there was a simple way to find significant
improvements.

My worry is that SVN may be designed in such a way that it is always
going to perform poorly on a data set like mine.  For example, by
requiring lots of additional file i/o to maintain all its records.  Is
that the case?  If so, I would appreciate any recommendations for
other version control systems that might be better tailored to working
with very large data sets.

Thank you for your assistance.

-Robert A. Rohde

Can Subversion work well with very large data sets?

Reply via email to