Thanks for the reply. If the main amount of searches are the exact same (e.g. the empty search), > the result will be cached. If 5,683 searches/month is the real count, this > sounds like a very low amount of searches in a very limited corpus. Just > about any machine should be fine. I guess I am missing something here. > Could you elaborate a bit? How large is a document, how many do you expect > to handle, what do you expect a query to look like, how should the result > be presented?
Sorry, I should clarify our current statistics. First of all I meant 183k documents (not 183, woops). Around 100k of those are full fledged html articles (not web pages but articles in our CMS with html content inside of them), the rest of the data are more like key/value data records with a lot of attached meta data for searching. Also, what I meant by search without a search term is that probably 80% (hard to confirm due to the lack of stats given by the GSA) of our searches are done on pure metadata clauses without any searching through the content itself, so for example "give me documents that have a content type of video, that are marked for client X, have a category of Y or Z, and was published to platform A, ordered by date published". The searches that use a search term are more like use the same query from the example as before, but find me all the documents that have the string "My Video" in it's title and description. From the way that the GSA provides us statistics (which are pretty bare), it appears like they do not count "no search term" searches in part of those statistics (the GSA is not really built for not using search terms either, and we've had various issues using it in this way because of it). The reason we are using the GSA for this and not our MSSql database is because some of this data requires multiple, and expensive, joins and we do need full text search for when users want to use that option. Also for faceting. On Wed, Feb 13, 2013 at 11:24 AM, Toke Eskildsen <t...@statsbiblioteket.dk>wrote: > Matthew Shapiro [m...@mshapiro.net] wrote: > > [Hardware for Solr] > > > What type of hardware (at a high level) should I be looking for. Are the > > main constraints disk I/O, memory size, processing power, etc...? > > That depends on what you are trying to achieve. Broadly speaking, "simple" > search and retrieval is mainly I/O bound. The easy way to handle that is to > use SSDs as storage. However, a lot of people like the old school solution > and compensates for the slow seeks of spinning drives by adding RAM and > doing warmup of the searcher or index files. So either SSD or RAM on the > I/O side. If the corpus is non-trivial is size that is, which brings us > to... > > > Right now we have about 183 documents stored in the GSA (which will go > up a > > lot once we are on Solr since the GSA is limiting). The search systems > are > > used to display core information on several of our homepages, so our > search > > traffic is pretty significant (the GSA reports 5,683 searches in the last > > month, however I am 99% sure this is not correct and is not counting > search > > requests without any search terms, which consists of most of our search > > traffic). > > If the main amount of searches are the exact same (e.g. the empty search), > the result will be cached. If 5,683 searches/month is the real count, this > sounds like a very low amount of searches in a very limited corpus. Just > about any machine should be fine. I guess I am missing something here. > Could you elaborate a bit? How large is a document, how many do you expect > to handle, what do you expect a query to look like, how should the result > be presented? > > Regards, > Toke Eskildsen