take a look into this: http://vimeo.com/16102543
for that amount of data it isn't that easy :-)
We are looking into building a reporting feature and investigating solutions which will allow us to search though our logs for downloads, searches and view history. Each log item is relatively small download history <add> <doc> <field name="uuid">item123-v1</field> <field name="market">photography</field> <field name="name">item 1</field> <field name="userid">1</field> <field name="version">1</field> <field name="downloadType">hires</field> <field name="itemId">123</field> <field name="timestamp">2009-11-07T14:50:54Z</field> </doc> </add> search history <add> <doc> <field name="uuid">1</field> <field name="query">brand assets</field> <field name="userid">1</field> <field name="timestamp">2009-11-07T14:50:54Z</field> </doc> </add> view history <add> <doc> <field name="uuid">1</field> <field name="itemId">123</field> <field name="userid">1</field> <field name="timestamp">2009-11-07T14:50:54Z</field> </doc> </add> and we reckon that we could have around 10 - 30 million log records for each type (downloads, searches, views) so 70 million records in total but obviously must scale higher. concurrent users will be around 10 - 20 (relatively low) new logs will be imported as a batch overnight. Because we have some previous experience with SOLR and because the interface needs to have full-text searching and filtering we built a prototype using SOLR 4.0. We used the new field collapsing feature within SOLR 4.0 to collapse on groups of data. For example view History needs to collapse on itemId. Each row will then show the frequency on how many views the item has had. This is achieved by the number of items which have been grouped. The requirements for the solution is to be schemaless to allow adding new fields to new documents easier, and have a powerful search interface, both which SOLR can do. QUESTIONS Our prototype is working as expected but im unsure if 1. has anyone got experience with using SOLR for log analysis. 2. SOLR can scale but when is the limit when i should start considering about sharding the index. It should be fine with 100+ million records. 3. We are using a nightly build of SOLR for the "field collapsing" feature. Would it be possible to patch SOLR 1.4.1 with the SOLR-236 patch? has anyone used this in production? thanks
-- http://jetwick.com twitter search prototype