Matthew, With an index that small, you should be able to build a proof of concept on your own hardware and discover how it performs using something like SolrMeter:
Michael Della Bitta ------------------------------------------------ Appinions 18 East 41st Street, 2nd Floor New York, NY 10017-6271 www.appinions.com Where Influence Isn’t a Game On Wed, Feb 13, 2013 at 12:21 PM, Matthew Shapiro <m...@mshapiro.net> wrote: > Thanks for the reply. > > If the main amount of searches are the exact same (e.g. the empty search), >> the result will be cached. If 5,683 searches/month is the real count, this >> sounds like a very low amount of searches in a very limited corpus. Just >> about any machine should be fine. I guess I am missing something here. >> Could you elaborate a bit? How large is a document, how many do you expect >> to handle, what do you expect a query to look like, how should the result >> be presented? > > > Sorry, I should clarify our current statistics. First of all I meant 183k > documents (not 183, woops). Around 100k of those are full fledged html > articles (not web pages but articles in our CMS with html content inside of > them), the rest of the data are more like key/value data records with a lot > of attached meta data for searching. > > Also, what I meant by search without a search term is that probably 80% > (hard to confirm due to the lack of stats given by the GSA) of our searches > are done on pure metadata clauses without any searching through the content > itself, so for example "give me documents that have a content type of > video, that are marked for client X, have a category of Y or Z, and was > published to platform A, ordered by date published". The searches that use > a search term are more like use the same query from the example as before, > but find me all the documents that have the string "My Video" in it's title > and description. From the way that the GSA provides us statistics (which > are pretty bare), it appears like they do not count "no search term" > searches in part of those statistics (the GSA is not really built for not > using search terms either, and we've had various issues using it in this > way because of it). > > The reason we are using the GSA for this and not our MSSql database is > because some of this data requires multiple, and expensive, joins and we do > need full text search for when users want to use that option. Also for > faceting. > > > On Wed, Feb 13, 2013 at 11:24 AM, Toke Eskildsen > <t...@statsbiblioteket.dk>wrote: > >> Matthew Shapiro [m...@mshapiro.net] wrote: >> >> [Hardware for Solr] >> >> > What type of hardware (at a high level) should I be looking for. Are the >> > main constraints disk I/O, memory size, processing power, etc...? >> >> That depends on what you are trying to achieve. Broadly speaking, "simple" >> search and retrieval is mainly I/O bound. The easy way to handle that is to >> use SSDs as storage. However, a lot of people like the old school solution >> and compensates for the slow seeks of spinning drives by adding RAM and >> doing warmup of the searcher or index files. So either SSD or RAM on the >> I/O side. If the corpus is non-trivial is size that is, which brings us >> to... >> >> > Right now we have about 183 documents stored in the GSA (which will go >> up a >> > lot once we are on Solr since the GSA is limiting). The search systems >> are >> > used to display core information on several of our homepages, so our >> search >> > traffic is pretty significant (the GSA reports 5,683 searches in the last >> > month, however I am 99% sure this is not correct and is not counting >> search >> > requests without any search terms, which consists of most of our search >> > traffic). >> >> If the main amount of searches are the exact same (e.g. the empty search), >> the result will be cached. If 5,683 searches/month is the real count, this >> sounds like a very low amount of searches in a very limited corpus. Just >> about any machine should be fine. I guess I am missing something here. >> Could you elaborate a bit? How large is a document, how many do you expect >> to handle, what do you expect a query to look like, how should the result >> be presented? >> >> Regards, >> Toke Eskildsen