Re: Wikipedia or reuters like index for testing facets?

Jason Rutherglen Fri, 17 Jul 2009 16:51:03 -0700

I saw the discussion about TeeSinkTokenFilter on java-user, and
was wondering how Solr performs copy fields? Couldn't Solr by
default utilize a TeeSinkTokenFilter like class for copying
fields?


> That link is meant to be stable for benchmarking purposes within Lucene.

The fields are different?

On Fri, Jul 17, 2009 at 9:57 AM, Grant Ingersoll<gsing...@apache.org> wrote:
> It's likely quite different.  That link is meant to be stable for
> benchmarking purposes within Lucene.
>
> Note, one think I wish I had time for:
> Hook in Tee/Sink capabilities into Solr such that one could use the
> WikipediaTokenizer and then Tee the Categories, etc. off to separate fields
> automatically for faceting, etc.
>
> -Grant
>
> On Jul 17, 2009, at 10:48 AM, Jason Rutherglen wrote:
>
>> The question that comes to mind is how it's different than
>>
>> http://people.apache.org/~gsingers/wikipedia/enwiki-20070527-pages-articles.xml.bz2
>>
>> Guess we'd need to download it and take a look!
>>
>> On Thu, Jul 16, 2009 at 8:33 PM, Peter Wolanin<peter.wola...@acquia.com>
>> wrote:
>>>
>>> AWS provides some standard data sets, including an extract of all
>>> wikipedia content:
>>>
>>>
>>> http://developer.amazonwebservices.com/connect/entry.jspa?externalID=2345&categoryID=249
>>>
>>> Looks like it's not being updated often, so this or another AWS data
>>> set could be a consistent basis for benchmarking?
>>>
>>> -Peter
>>>
>>> On Wed, Jul 15, 2009 at 2:21 PM, Jason
>>> Rutherglen<jason.rutherg...@gmail.com> wrote:
>>>>
>>>> Yeah that's what I was thinking of as an alternative, use enwiki
>>>> and randomly generate facet data along with it. However for
>>>> consistent benchmarking the random data would need to stay the
>>>> same so that people could execute the same benchmark
>>>> consistently in their own environment.
>>>>
>>>> On Tue, Jul 14, 2009 at 6:28 PM, Mark Miller<markrmil...@gmail.com>
>>>> wrote:
>>>>>
>>>>> Why don't you just randomly generate the facet data? Thats prob the
>>>>> best way
>>>>> right? You can control the uniques and ranges.
>>>>>
>>>>> On Wed, Jul 15, 2009 at 1:21 AM, Grant Ingersoll
>>>>> <gsing...@apache.org>wrote:
>>>>>
>>>>>> Probably not as generated by the EnwikiDocMaker, but the
>>>>>> WikipediaTokenizer
>>>>>> in Lucene can pull out richer syntax which could then be Teed/Sinked
>>>>>> to
>>>>>> other fields.  Things like categories, related links, etc.  Mostly,
>>>>>> though,
>>>>>> I was just commenting on the fact that it isn't hard to at least use
>>>>>> it for
>>>>>> getting docs into Solr.
>>>>>>
>>>>>> -Grant
>>>>>>
>>>>>> On Jul 14, 2009, at 7:38 PM, Jason Rutherglen wrote:
>>>>>>
>>>>>>  You think enwiki has enough data for faceting?
>>>>>>>
>>>>>>> On Tue, Jul 14, 2009 at 2:56 PM, Grant Ingersoll<gsing...@apache.org>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> At a min, it is trivial to use the EnWikiDocMaker and then send the
>>>>>>>> doc
>>>>>>>> over
>>>>>>>> SolrJ...
>>>>>>>>
>>>>>>>> On Jul 14, 2009, at 4:07 PM, Mark Miller wrote:
>>>>>>>>
>>>>>>>>  On Tue, Jul 14, 2009 at 3:36 PM, Jason Rutherglen <
>>>>>>>>>
>>>>>>>>> jason.rutherg...@gmail.com> wrote:
>>>>>>>>>
>>>>>>>>>  Is there a standard index like what Lucene uses for
>>>>>>>>> contrib/benchmark
>>>>>>>>>>
>>>>>>>>>> for
>>>>>>>>>> executing faceted queries over? Or maybe we can randomly generate
>>>>>>>>>> one
>>>>>>>>>> that
>>>>>>>>>> works in conjunction with wikipedia? That way we can execute real
>>>>>>>>>> world
>>>>>>>>>> queries against faceted data. Or we could use the Lucene/Solr
>>>>>>>>>> mailing
>>>>>>>>>> lists
>>>>>>>>>> and other data (ala Lucid's faceted site) as a standard index?
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>> I don't think there is any standard set of docs for solr testing -
>>>>>>>>> there
>>>>>>>>> is
>>>>>>>>> not a real benchmark contrib - though I know more than a few of us
>>>>>>>>> have
>>>>>>>>> hacked up pieces of Lucene benchmark to work with Solr - I think
>>>>>>>>> I've
>>>>>>>>> done
>>>>>>>>> it twice now ;)
>>>>>>>>>
>>>>>>>>> Would be nice to get things going. I was thinking the other day: I
>>>>>>>>> wonder
>>>>>>>>> how hard it would be to make Lucene Benchmark generic enough to
>>>>>>>>> accept
>>>>>>>>> Solr
>>>>>>>>> impls and Solr algs?
>>>>>>>>>
>>>>>>>>> It does a lot that would suck to duplicate.
>>>>>>>>>
>>>>>>>>> --
>>>>>>>>> --
>>>>>>>>> - Mark
>>>>>>>>>
>>>>>>>>> http://www.lucidimagination.com
>>>>>>>>>
>>>>>>>>
>>>>>>>> --------------------------
>>>>>>>> Grant Ingersoll
>>>>>>>> http://www.lucidimagination.com/
>>>>>>>>
>>>>>>>> Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids)
>>>>>>>> using
>>>>>>>> Solr/Lucene:
>>>>>>>> http://www.lucidimagination.com/search
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>> --------------------------
>>>>>> Grant Ingersoll
>>>>>> http://www.lucidimagination.com/
>>>>>>
>>>>>> Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids)
>>>>>> using
>>>>>> Solr/Lucene:
>>>>>> http://www.lucidimagination.com/search
>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> --
>>>>> - Mark
>>>>>
>>>>> http://www.lucidimagination.com
>>>>>
>>>>
>>>
>>>
>>>
>>> --
>>> Peter M. Wolanin, Ph.D.
>>> Momentum Specialist,  Acquia. Inc.
>>> peter.wola...@acquia.com
>>>
>
> --------------------------
> Grant Ingersoll
> http://www.lucidimagination.com/
>
> Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids) using
> Solr/Lucene:
> http://www.lucidimagination.com/search
>
>

Re: Wikipedia or reuters like index for testing facets?

Reply via email to