Ooops: https://code.google.com/p/solrmeter/
Michael Della Bitta ------------------------------------------------ Appinions 18 East 41st Street, 2nd Floor New York, NY 10017-6271 www.appinions.com Where Influence Isn’t a Game On Wed, Feb 13, 2013 at 12:25 PM, Michael Della Bitta <michael.della.bi...@appinions.com> wrote: > Matthew, > > With an index that small, you should be able to build a proof of > concept on your own hardware and discover how it performs using > something like SolrMeter: > > > Michael Della Bitta > > ------------------------------------------------ > Appinions > 18 East 41st Street, 2nd Floor > New York, NY 10017-6271 > > www.appinions.com > > Where Influence Isn’t a Game > > > On Wed, Feb 13, 2013 at 12:21 PM, Matthew Shapiro <m...@mshapiro.net> wrote: >> Thanks for the reply. >> >> If the main amount of searches are the exact same (e.g. the empty search), >>> the result will be cached. If 5,683 searches/month is the real count, this >>> sounds like a very low amount of searches in a very limited corpus. Just >>> about any machine should be fine. I guess I am missing something here. >>> Could you elaborate a bit? How large is a document, how many do you expect >>> to handle, what do you expect a query to look like, how should the result >>> be presented? >> >> >> Sorry, I should clarify our current statistics. First of all I meant 183k >> documents (not 183, woops). Around 100k of those are full fledged html >> articles (not web pages but articles in our CMS with html content inside of >> them), the rest of the data are more like key/value data records with a lot >> of attached meta data for searching. >> >> Also, what I meant by search without a search term is that probably 80% >> (hard to confirm due to the lack of stats given by the GSA) of our searches >> are done on pure metadata clauses without any searching through the content >> itself, so for example "give me documents that have a content type of >> video, that are marked for client X, have a category of Y or Z, and was >> published to platform A, ordered by date published". The searches that use >> a search term are more like use the same query from the example as before, >> but find me all the documents that have the string "My Video" in it's title >> and description. From the way that the GSA provides us statistics (which >> are pretty bare), it appears like they do not count "no search term" >> searches in part of those statistics (the GSA is not really built for not >> using search terms either, and we've had various issues using it in this >> way because of it). >> >> The reason we are using the GSA for this and not our MSSql database is >> because some of this data requires multiple, and expensive, joins and we do >> need full text search for when users want to use that option. Also for >> faceting. >> >> >> On Wed, Feb 13, 2013 at 11:24 AM, Toke Eskildsen >> <t...@statsbiblioteket.dk>wrote: >> >>> Matthew Shapiro [m...@mshapiro.net] wrote: >>> >>> [Hardware for Solr] >>> >>> > What type of hardware (at a high level) should I be looking for. Are the >>> > main constraints disk I/O, memory size, processing power, etc...? >>> >>> That depends on what you are trying to achieve. Broadly speaking, "simple" >>> search and retrieval is mainly I/O bound. The easy way to handle that is to >>> use SSDs as storage. However, a lot of people like the old school solution >>> and compensates for the slow seeks of spinning drives by adding RAM and >>> doing warmup of the searcher or index files. So either SSD or RAM on the >>> I/O side. If the corpus is non-trivial is size that is, which brings us >>> to... >>> >>> > Right now we have about 183 documents stored in the GSA (which will go >>> up a >>> > lot once we are on Solr since the GSA is limiting). The search systems >>> are >>> > used to display core information on several of our homepages, so our >>> search >>> > traffic is pretty significant (the GSA reports 5,683 searches in the last >>> > month, however I am 99% sure this is not correct and is not counting >>> search >>> > requests without any search terms, which consists of most of our search >>> > traffic). >>> >>> If the main amount of searches are the exact same (e.g. the empty search), >>> the result will be cached. If 5,683 searches/month is the real count, this >>> sounds like a very low amount of searches in a very limited corpus. Just >>> about any machine should be fine. I guess I am missing something here. >>> Could you elaborate a bit? How large is a document, how many do you expect >>> to handle, what do you expect a query to look like, how should the result >>> be presented? >>> >>> Regards, >>> Toke Eskildsen