Re: Wikipedia or reuters like index for testing facets?

Peter Wolanin Thu, 16 Jul 2009 20:33:43 -0700

AWS provides some standard data sets, including an extract of all
wikipedia content:


http://developer.amazonwebservices.com/connect/entry.jspa?externalID=2345&categoryID=249

Looks like it's not being updated often, so this or another AWS data
set could be a consistent basis for benchmarking?

-Peter

On Wed, Jul 15, 2009 at 2:21 PM, Jason
Rutherglen<jason.rutherg...@gmail.com> wrote:
> Yeah that's what I was thinking of as an alternative, use enwiki
> and randomly generate facet data along with it. However for
> consistent benchmarking the random data would need to stay the
> same so that people could execute the same benchmark
> consistently in their own environment.
>
> On Tue, Jul 14, 2009 at 6:28 PM, Mark Miller<markrmil...@gmail.com> wrote:
>> Why don't you just randomly generate the facet data? Thats prob the best way
>> right? You can control the uniques and ranges.
>>
>> On Wed, Jul 15, 2009 at 1:21 AM, Grant Ingersoll <gsing...@apache.org>wrote:
>>
>>> Probably not as generated by the EnwikiDocMaker, but the WikipediaTokenizer
>>> in Lucene can pull out richer syntax which could then be Teed/Sinked to
>>> other fields.  Things like categories, related links, etc.  Mostly, though,
>>> I was just commenting on the fact that it isn't hard to at least use it for
>>> getting docs into Solr.
>>>
>>> -Grant
>>>
>>> On Jul 14, 2009, at 7:38 PM, Jason Rutherglen wrote:
>>>
>>>  You think enwiki has enough data for faceting?
>>>>
>>>> On Tue, Jul 14, 2009 at 2:56 PM, Grant Ingersoll<gsing...@apache.org>
>>>> wrote:
>>>>
>>>>> At a min, it is trivial to use the EnWikiDocMaker and then send the doc
>>>>> over
>>>>> SolrJ...
>>>>>
>>>>> On Jul 14, 2009, at 4:07 PM, Mark Miller wrote:
>>>>>
>>>>>  On Tue, Jul 14, 2009 at 3:36 PM, Jason Rutherglen <
>>>>>> jason.rutherg...@gmail.com> wrote:
>>>>>>
>>>>>>  Is there a standard index like what Lucene uses for contrib/benchmark
>>>>>>> for
>>>>>>> executing faceted queries over? Or maybe we can randomly generate one
>>>>>>> that
>>>>>>> works in conjunction with wikipedia? That way we can execute real world
>>>>>>> queries against faceted data. Or we could use the Lucene/Solr mailing
>>>>>>> lists
>>>>>>> and other data (ala Lucid's faceted site) as a standard index?
>>>>>>>
>>>>>>>
>>>>>> I don't think there is any standard set of docs for solr testing - there
>>>>>> is
>>>>>> not a real benchmark contrib - though I know more than a few of us have
>>>>>> hacked up pieces of Lucene benchmark to work with Solr - I think I've
>>>>>> done
>>>>>> it twice now ;)
>>>>>>
>>>>>> Would be nice to get things going. I was thinking the other day: I
>>>>>> wonder
>>>>>> how hard it would be to make Lucene Benchmark generic enough to accept
>>>>>> Solr
>>>>>> impls and Solr algs?
>>>>>>
>>>>>> It does a lot that would suck to duplicate.
>>>>>>
>>>>>> --
>>>>>> --
>>>>>> - Mark
>>>>>>
>>>>>> http://www.lucidimagination.com
>>>>>>
>>>>>
>>>>> --------------------------
>>>>> Grant Ingersoll
>>>>> http://www.lucidimagination.com/
>>>>>
>>>>> Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids) using
>>>>> Solr/Lucene:
>>>>> http://www.lucidimagination.com/search
>>>>>
>>>>>
>>>>>
>>> --------------------------
>>> Grant Ingersoll
>>> http://www.lucidimagination.com/
>>>
>>> Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids) using
>>> Solr/Lucene:
>>> http://www.lucidimagination.com/search
>>>
>>>
>>
>>
>> --
>> --
>> - Mark
>>
>> http://www.lucidimagination.com
>>
>



-- 
Peter M. Wolanin, Ph.D.
Momentum Specialist,  Acquia. Inc.
peter.wola...@acquia.com

Re: Wikipedia or reuters like index for testing facets?

Reply via email to