Hi guys, I'm cross posting this from lucene list as I guess I can have
better help here for this scenario.
Suppose I want to index 100Gb+ of numeric data. I'm not yet sure the
specifics, but I can expect the following:
- data is expected to be in one gigantic table. conceptually, is likea
spreadsheet table: rows are objects and columns are properties.-
values are mostly floating point numbers, and I expect them to
be,let's say, unique or discreet, or almost randomly distributed
(1.89868776E+50,1.434E-12)- The data is readonly. it will never
change.
Now I need to query this data based mostly in range queries on
thecolumns. Something like:
"SELECT * FROM Table WHERE (Col1 > 1.2E2 AND Col1 < 1.8E2) OR (Col3 == 0)"
which is basically "give me all the rows that satisfy this criteria".
I believe this could be easily done with a standard RDBMS, but I
wouldlike to avoid that route.
While thinking about this, and assuming this could work well withSolr,
I had some things I couldn't answer:-
- In this case, it makes total sense to store the data in the index.
If I will index all "columns", I might as well have the data right
there.
- Does it make any sense to index this whole thing once, while
offline, and then upload only the index to the servers?
- I'm almost sure I will have to shard the index in some way, and this
isn't difficult. But what are the possible hardware requirements to
host this thing? I know this depends on lots of information I didn't
provide (searches/sec for example), but can someone throw a number? I
have completely no ideia...

Thanks
--
Pedro Ferreira

mobile: 00 44 7712 557303
skype: pedrosilvaferreira
email: psilvaferre...@gmail.com
linkedin: http://uk.linkedin.com/in/pedrosilvaferreira

Reply via email to