The question that comes to mind is how it's different than http://people.apache.org/~gsingers/wikipedia/enwiki-20070527-pages-articles.xml.bz2
Guess we'd need to download it and take a look! On Thu, Jul 16, 2009 at 8:33 PM, Peter Wolanin<peter.wola...@acquia.com> wrote: > AWS provides some standard data sets, including an extract of all > wikipedia content: > > http://developer.amazonwebservices.com/connect/entry.jspa?externalID=2345&categoryID=249 > > Looks like it's not being updated often, so this or another AWS data > set could be a consistent basis for benchmarking? > > -Peter > > On Wed, Jul 15, 2009 at 2:21 PM, Jason > Rutherglen<jason.rutherg...@gmail.com> wrote: >> Yeah that's what I was thinking of as an alternative, use enwiki >> and randomly generate facet data along with it. However for >> consistent benchmarking the random data would need to stay the >> same so that people could execute the same benchmark >> consistently in their own environment. >> >> On Tue, Jul 14, 2009 at 6:28 PM, Mark Miller<markrmil...@gmail.com> wrote: >>> Why don't you just randomly generate the facet data? Thats prob the best way >>> right? You can control the uniques and ranges. >>> >>> On Wed, Jul 15, 2009 at 1:21 AM, Grant Ingersoll <gsing...@apache.org>wrote: >>> >>>> Probably not as generated by the EnwikiDocMaker, but the WikipediaTokenizer >>>> in Lucene can pull out richer syntax which could then be Teed/Sinked to >>>> other fields. Things like categories, related links, etc. Mostly, though, >>>> I was just commenting on the fact that it isn't hard to at least use it for >>>> getting docs into Solr. >>>> >>>> -Grant >>>> >>>> On Jul 14, 2009, at 7:38 PM, Jason Rutherglen wrote: >>>> >>>> You think enwiki has enough data for faceting? >>>>> >>>>> On Tue, Jul 14, 2009 at 2:56 PM, Grant Ingersoll<gsing...@apache.org> >>>>> wrote: >>>>> >>>>>> At a min, it is trivial to use the EnWikiDocMaker and then send the doc >>>>>> over >>>>>> SolrJ... >>>>>> >>>>>> On Jul 14, 2009, at 4:07 PM, Mark Miller wrote: >>>>>> >>>>>> On Tue, Jul 14, 2009 at 3:36 PM, Jason Rutherglen < >>>>>>> jason.rutherg...@gmail.com> wrote: >>>>>>> >>>>>>> Is there a standard index like what Lucene uses for contrib/benchmark >>>>>>>> for >>>>>>>> executing faceted queries over? Or maybe we can randomly generate one >>>>>>>> that >>>>>>>> works in conjunction with wikipedia? That way we can execute real world >>>>>>>> queries against faceted data. Or we could use the Lucene/Solr mailing >>>>>>>> lists >>>>>>>> and other data (ala Lucid's faceted site) as a standard index? >>>>>>>> >>>>>>>> >>>>>>> I don't think there is any standard set of docs for solr testing - there >>>>>>> is >>>>>>> not a real benchmark contrib - though I know more than a few of us have >>>>>>> hacked up pieces of Lucene benchmark to work with Solr - I think I've >>>>>>> done >>>>>>> it twice now ;) >>>>>>> >>>>>>> Would be nice to get things going. I was thinking the other day: I >>>>>>> wonder >>>>>>> how hard it would be to make Lucene Benchmark generic enough to accept >>>>>>> Solr >>>>>>> impls and Solr algs? >>>>>>> >>>>>>> It does a lot that would suck to duplicate. >>>>>>> >>>>>>> -- >>>>>>> -- >>>>>>> - Mark >>>>>>> >>>>>>> http://www.lucidimagination.com >>>>>>> >>>>>> >>>>>> -------------------------- >>>>>> Grant Ingersoll >>>>>> http://www.lucidimagination.com/ >>>>>> >>>>>> Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids) using >>>>>> Solr/Lucene: >>>>>> http://www.lucidimagination.com/search >>>>>> >>>>>> >>>>>> >>>> -------------------------- >>>> Grant Ingersoll >>>> http://www.lucidimagination.com/ >>>> >>>> Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids) using >>>> Solr/Lucene: >>>> http://www.lucidimagination.com/search >>>> >>>> >>> >>> >>> -- >>> -- >>> - Mark >>> >>> http://www.lucidimagination.com >>> >> > > > > -- > Peter M. Wolanin, Ph.D. > Momentum Specialist, Acquia. Inc. > peter.wola...@acquia.com >