Round N + 1 of "it depends" <G>. This isn't a very big index as Solr indexes go, my first guess would be that you can easily fit this on the machines you're talking about. But, as always, how you implement things may prove me wrong.
Really, about the only thing you can do is try it. Be aware that "the size of the index" is a tricky concept. For instance, if you store your data (stored="true"), the files in your index directory will NOT reflect the total memory requirements since verbatim copies of your fields are held in the *.fdt files and really don't affect searching speed. Here's what I claim: 1> you can index these 12M document in a reasonable time. I index on my Mac book Pro 1.9M documents (Wikipedia dump) in just a few minutes (< 10 as I remember). So you can "just try things". 2> use a Master/Slave architecture. You can control how fast the updates are available by the polling interval on the slave and how fast you commit. 2 hours is easy. 10 minutes is a reasonable goal here. 3> Consider edismax-style handlers. The point here is that they allow you to tune relevance much more finely than a "bag of words" approach in which you index many fields into a single text field. 4> You only really need to store the fields you intend to display as part of your search results. Assuming you're going to your system-of-record for the full document, your stored data may be very small 5> Be aware that the first few queries will often be much slower than later queries, as there are certain caches that need to be filled up. See the various warming parameters on the caches and the "firstSearcher" and "newSearcher" entries in the config files. 6> Create a mix of queries and use something like jMeter or SolrMeter to determine where your target hardware falls down. You have to take some care to create a reasonable query set, not just the same query over and over or you'll just get cached results. Fire enough queries at the searcher that it starts to perform poorly and tweak from there. 7> Really, really get familiar with two things, a> the admin/analysis page for understanding the analysis process. b> adding &debugQuery=on to your queries when you don't understand what's happening. In particular, that will show you the parsed queries, you can defer digging into the scoring explanations for later. 8> string types aren't what you want very often. They're really suitable for things like IDs, serial numbers, etc. But they are NOT tokenized. So if your input is "some stuff" and you search for "stuff", you won't get a match. This often confuses people. For tokenized processing, you'll probably want one of the "text" variants. String types are even case sensitive... But all in all, I don't see what you've described as particularly difficult, although you'll doubtlessly run into things you don't expect. Hope that helps Erick On Sun, Sep 25, 2011 at 1:00 PM, Ikhsvaku S <ikhsv...@gmail.com> wrote: > Hi List, > > We are pretty new to Solr & Lucene and have just starting indexing few 10K > documents using Solr. Before we attempt anything bigger we want to see what > should be the best approach.. > > Documents: We have close to ~12 million XML docs, of varying sizes average > size 20 KB. These documents have 150 fields, which should be searchable & > indexed. Of which over 80% fixed length string fields and few strings are > multivalued ones (e.g. title, headline, id, submitter, reviewers, > suggested-titles etc), there other 15% who are date specific (added-on, > reviewed-on etc). Rest are multivalued text documents, (E,g, > description, summary, comments, notes etc). Some of the documents do have > large number of these text fields (so we are leaning against storing these > in index). Approximately ~6000 such documents are updated & 400-800 new ones > are added each day > > Queries: A typical query would mainly be on string fields ~ 60% of queries > e.g. a simple one would be find document ids of documents whose author is > XYZ & submitted between [X-Z] & whose status is reviewed or pending review > && title has this string etc... the results of which are exacting nature > (found 300 docs). Rest of searches would include the text fields, where they > search quoted snippets or phrases... Almost all queries have multiple > operators. Also each one would want to grab as many result rows as possible > (we are limiting this to 2000). The output shall contain only 1-5 fields. > (No highlighting etc needed) > > Available hardware: > Some of existing hardware we could find consists of existing ~300GB SAN each > on 4 Boxes with ~96Gig each. We do couple of older HP DL380s (mainly want to > use for offline indexing). All of this is on 10G Ethernet. > > Questions: > Our priority is to provide results fast, and the new or updated documents > should be indexed within 2 hour. Users are also known to use complex queries > for data mining. Seeing all this any recommendations for indexing data, > fields? > How do we scale, what architecture should we follow here? Slave/master > servers? Any possible issues we may hit? > > Thanks >