The question that comes to mind is how it's different than
http://people.apache.org/~gsingers/wikipedia/enwiki-20070527-pages-articles.xml.bz2

Guess we'd need to download it and take a look!

On Thu, Jul 16, 2009 at 8:33 PM, Peter Wolanin<peter.wola...@acquia.com> wrote:
> AWS provides some standard data sets, including an extract of all
> wikipedia content:
>
> http://developer.amazonwebservices.com/connect/entry.jspa?externalID=2345&categoryID=249
>
> Looks like it's not being updated often, so this or another AWS data
> set could be a consistent basis for benchmarking?
>
> -Peter
>
> On Wed, Jul 15, 2009 at 2:21 PM, Jason
> Rutherglen<jason.rutherg...@gmail.com> wrote:
>> Yeah that's what I was thinking of as an alternative, use enwiki
>> and randomly generate facet data along with it. However for
>> consistent benchmarking the random data would need to stay the
>> same so that people could execute the same benchmark
>> consistently in their own environment.
>>
>> On Tue, Jul 14, 2009 at 6:28 PM, Mark Miller<markrmil...@gmail.com> wrote:
>>> Why don't you just randomly generate the facet data? Thats prob the best way
>>> right? You can control the uniques and ranges.
>>>
>>> On Wed, Jul 15, 2009 at 1:21 AM, Grant Ingersoll <gsing...@apache.org>wrote:
>>>
>>>> Probably not as generated by the EnwikiDocMaker, but the WikipediaTokenizer
>>>> in Lucene can pull out richer syntax which could then be Teed/Sinked to
>>>> other fields.  Things like categories, related links, etc.  Mostly, though,
>>>> I was just commenting on the fact that it isn't hard to at least use it for
>>>> getting docs into Solr.
>>>>
>>>> -Grant
>>>>
>>>> On Jul 14, 2009, at 7:38 PM, Jason Rutherglen wrote:
>>>>
>>>>  You think enwiki has enough data for faceting?
>>>>>
>>>>> On Tue, Jul 14, 2009 at 2:56 PM, Grant Ingersoll<gsing...@apache.org>
>>>>> wrote:
>>>>>
>>>>>> At a min, it is trivial to use the EnWikiDocMaker and then send the doc
>>>>>> over
>>>>>> SolrJ...
>>>>>>
>>>>>> On Jul 14, 2009, at 4:07 PM, Mark Miller wrote:
>>>>>>
>>>>>>  On Tue, Jul 14, 2009 at 3:36 PM, Jason Rutherglen <
>>>>>>> jason.rutherg...@gmail.com> wrote:
>>>>>>>
>>>>>>>  Is there a standard index like what Lucene uses for contrib/benchmark
>>>>>>>> for
>>>>>>>> executing faceted queries over? Or maybe we can randomly generate one
>>>>>>>> that
>>>>>>>> works in conjunction with wikipedia? That way we can execute real world
>>>>>>>> queries against faceted data. Or we could use the Lucene/Solr mailing
>>>>>>>> lists
>>>>>>>> and other data (ala Lucid's faceted site) as a standard index?
>>>>>>>>
>>>>>>>>
>>>>>>> I don't think there is any standard set of docs for solr testing - there
>>>>>>> is
>>>>>>> not a real benchmark contrib - though I know more than a few of us have
>>>>>>> hacked up pieces of Lucene benchmark to work with Solr - I think I've
>>>>>>> done
>>>>>>> it twice now ;)
>>>>>>>
>>>>>>> Would be nice to get things going. I was thinking the other day: I
>>>>>>> wonder
>>>>>>> how hard it would be to make Lucene Benchmark generic enough to accept
>>>>>>> Solr
>>>>>>> impls and Solr algs?
>>>>>>>
>>>>>>> It does a lot that would suck to duplicate.
>>>>>>>
>>>>>>> --
>>>>>>> --
>>>>>>> - Mark
>>>>>>>
>>>>>>> http://www.lucidimagination.com
>>>>>>>
>>>>>>
>>>>>> --------------------------
>>>>>> Grant Ingersoll
>>>>>> http://www.lucidimagination.com/
>>>>>>
>>>>>> Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids) using
>>>>>> Solr/Lucene:
>>>>>> http://www.lucidimagination.com/search
>>>>>>
>>>>>>
>>>>>>
>>>> --------------------------
>>>> Grant Ingersoll
>>>> http://www.lucidimagination.com/
>>>>
>>>> Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids) using
>>>> Solr/Lucene:
>>>> http://www.lucidimagination.com/search
>>>>
>>>>
>>>
>>>
>>> --
>>> --
>>> - Mark
>>>
>>> http://www.lucidimagination.com
>>>
>>
>
>
>
> --
> Peter M. Wolanin, Ph.D.
> Momentum Specialist,  Acquia. Inc.
> peter.wola...@acquia.com
>

Reply via email to