Re: Can Solr with the StatsComponent analyze 20+ million files?

Markus Jelsma Mon, 08 Aug 2011 14:11:09 -0700

> Hi,
> Currently we are in the process of figuring out how to deal with
> millions of CSV files containing weather data(20+ million files). Each
> file is about 500 bytes in size.
> We want to calculate statistics on fields read from the file. For
> example, the standard deviation of wind speed across all 20+ million files.
> Processing speed isn't an important issue. The analysis routine can run
> for days, if needed.
> 
> The StatsComponent(http://wiki.apache.org/solr/StatsComponent) for Solr
> appears to be able to calculate the statistics we are interested in.
> 
> Will the StatsComponent in Solr do what we need with minimal configuration?
> Can the StatsComponent only be used on a subset of the data? For
> example, only look at data from certain months?


If i remember correctly, it cannot.

> Are there other free programs out there that can parse and analyze 20+
> million files?

Yes, if analyzing data like your data is all you do (not search, that's Solr's 
power) then you're most likely much better of not using Solr and write 
map/reduce programs for Apache Hadoop, it will analyze huge amounts of data. 
Hadoop can be quite difficult to start with so you can use the excellent Apache 
CouchDB database that supports map/reduce as well.

CouchDB is much easier to begin with. If you transform a sample of your data 
to the JSON format, install CouchDB, load your data, write a simple map/reduce 
function all in 8 hours. Loading and processing all the data will take a bit 
longer.

Cheers


> 
> We are still very new to Solr and really appreciate all your help.
> Thanks,
> Fred

Re: Can Solr with the StatsComponent analyze 20+ million files?

Reply via email to