AWS provides some standard data sets, including an extract of all wikipedia content:
http://developer.amazonwebservices.com/connect/entry.jspa?externalID=2345&categoryID=249 Looks like it's not being updated often, so this or another AWS data set could be a consistent basis for benchmarking? -Peter On Wed, Jul 15, 2009 at 2:21 PM, Jason Rutherglen<jason.rutherg...@gmail.com> wrote: > Yeah that's what I was thinking of as an alternative, use enwiki > and randomly generate facet data along with it. However for > consistent benchmarking the random data would need to stay the > same so that people could execute the same benchmark > consistently in their own environment. > > On Tue, Jul 14, 2009 at 6:28 PM, Mark Miller<markrmil...@gmail.com> wrote: >> Why don't you just randomly generate the facet data? Thats prob the best way >> right? You can control the uniques and ranges. >> >> On Wed, Jul 15, 2009 at 1:21 AM, Grant Ingersoll <gsing...@apache.org>wrote: >> >>> Probably not as generated by the EnwikiDocMaker, but the WikipediaTokenizer >>> in Lucene can pull out richer syntax which could then be Teed/Sinked to >>> other fields. Things like categories, related links, etc. Mostly, though, >>> I was just commenting on the fact that it isn't hard to at least use it for >>> getting docs into Solr. >>> >>> -Grant >>> >>> On Jul 14, 2009, at 7:38 PM, Jason Rutherglen wrote: >>> >>> You think enwiki has enough data for faceting? >>>> >>>> On Tue, Jul 14, 2009 at 2:56 PM, Grant Ingersoll<gsing...@apache.org> >>>> wrote: >>>> >>>>> At a min, it is trivial to use the EnWikiDocMaker and then send the doc >>>>> over >>>>> SolrJ... >>>>> >>>>> On Jul 14, 2009, at 4:07 PM, Mark Miller wrote: >>>>> >>>>> On Tue, Jul 14, 2009 at 3:36 PM, Jason Rutherglen < >>>>>> jason.rutherg...@gmail.com> wrote: >>>>>> >>>>>> Is there a standard index like what Lucene uses for contrib/benchmark >>>>>>> for >>>>>>> executing faceted queries over? Or maybe we can randomly generate one >>>>>>> that >>>>>>> works in conjunction with wikipedia? That way we can execute real world >>>>>>> queries against faceted data. Or we could use the Lucene/Solr mailing >>>>>>> lists >>>>>>> and other data (ala Lucid's faceted site) as a standard index? >>>>>>> >>>>>>> >>>>>> I don't think there is any standard set of docs for solr testing - there >>>>>> is >>>>>> not a real benchmark contrib - though I know more than a few of us have >>>>>> hacked up pieces of Lucene benchmark to work with Solr - I think I've >>>>>> done >>>>>> it twice now ;) >>>>>> >>>>>> Would be nice to get things going. I was thinking the other day: I >>>>>> wonder >>>>>> how hard it would be to make Lucene Benchmark generic enough to accept >>>>>> Solr >>>>>> impls and Solr algs? >>>>>> >>>>>> It does a lot that would suck to duplicate. >>>>>> >>>>>> -- >>>>>> -- >>>>>> - Mark >>>>>> >>>>>> http://www.lucidimagination.com >>>>>> >>>>> >>>>> -------------------------- >>>>> Grant Ingersoll >>>>> http://www.lucidimagination.com/ >>>>> >>>>> Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids) using >>>>> Solr/Lucene: >>>>> http://www.lucidimagination.com/search >>>>> >>>>> >>>>> >>> -------------------------- >>> Grant Ingersoll >>> http://www.lucidimagination.com/ >>> >>> Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids) using >>> Solr/Lucene: >>> http://www.lucidimagination.com/search >>> >>> >> >> >> -- >> -- >> - Mark >> >> http://www.lucidimagination.com >> > -- Peter M. Wolanin, Ph.D. Momentum Specialist, Acquia. Inc. peter.wola...@acquia.com