I saw the discussion about TeeSinkTokenFilter on java-user, and was wondering how Solr performs copy fields? Couldn't Solr by default utilize a TeeSinkTokenFilter like class for copying fields?
> That link is meant to be stable for benchmarking purposes within Lucene. The fields are different? On Fri, Jul 17, 2009 at 9:57 AM, Grant Ingersoll<gsing...@apache.org> wrote: > It's likely quite different. That link is meant to be stable for > benchmarking purposes within Lucene. > > Note, one think I wish I had time for: > Hook in Tee/Sink capabilities into Solr such that one could use the > WikipediaTokenizer and then Tee the Categories, etc. off to separate fields > automatically for faceting, etc. > > -Grant > > On Jul 17, 2009, at 10:48 AM, Jason Rutherglen wrote: > >> The question that comes to mind is how it's different than >> >> http://people.apache.org/~gsingers/wikipedia/enwiki-20070527-pages-articles.xml.bz2 >> >> Guess we'd need to download it and take a look! >> >> On Thu, Jul 16, 2009 at 8:33 PM, Peter Wolanin<peter.wola...@acquia.com> >> wrote: >>> >>> AWS provides some standard data sets, including an extract of all >>> wikipedia content: >>> >>> >>> http://developer.amazonwebservices.com/connect/entry.jspa?externalID=2345&categoryID=249 >>> >>> Looks like it's not being updated often, so this or another AWS data >>> set could be a consistent basis for benchmarking? >>> >>> -Peter >>> >>> On Wed, Jul 15, 2009 at 2:21 PM, Jason >>> Rutherglen<jason.rutherg...@gmail.com> wrote: >>>> >>>> Yeah that's what I was thinking of as an alternative, use enwiki >>>> and randomly generate facet data along with it. However for >>>> consistent benchmarking the random data would need to stay the >>>> same so that people could execute the same benchmark >>>> consistently in their own environment. >>>> >>>> On Tue, Jul 14, 2009 at 6:28 PM, Mark Miller<markrmil...@gmail.com> >>>> wrote: >>>>> >>>>> Why don't you just randomly generate the facet data? Thats prob the >>>>> best way >>>>> right? You can control the uniques and ranges. >>>>> >>>>> On Wed, Jul 15, 2009 at 1:21 AM, Grant Ingersoll >>>>> <gsing...@apache.org>wrote: >>>>> >>>>>> Probably not as generated by the EnwikiDocMaker, but the >>>>>> WikipediaTokenizer >>>>>> in Lucene can pull out richer syntax which could then be Teed/Sinked >>>>>> to >>>>>> other fields. Things like categories, related links, etc. Mostly, >>>>>> though, >>>>>> I was just commenting on the fact that it isn't hard to at least use >>>>>> it for >>>>>> getting docs into Solr. >>>>>> >>>>>> -Grant >>>>>> >>>>>> On Jul 14, 2009, at 7:38 PM, Jason Rutherglen wrote: >>>>>> >>>>>> You think enwiki has enough data for faceting? >>>>>>> >>>>>>> On Tue, Jul 14, 2009 at 2:56 PM, Grant Ingersoll<gsing...@apache.org> >>>>>>> wrote: >>>>>>> >>>>>>>> At a min, it is trivial to use the EnWikiDocMaker and then send the >>>>>>>> doc >>>>>>>> over >>>>>>>> SolrJ... >>>>>>>> >>>>>>>> On Jul 14, 2009, at 4:07 PM, Mark Miller wrote: >>>>>>>> >>>>>>>> On Tue, Jul 14, 2009 at 3:36 PM, Jason Rutherglen < >>>>>>>>> >>>>>>>>> jason.rutherg...@gmail.com> wrote: >>>>>>>>> >>>>>>>>> Is there a standard index like what Lucene uses for >>>>>>>>> contrib/benchmark >>>>>>>>>> >>>>>>>>>> for >>>>>>>>>> executing faceted queries over? Or maybe we can randomly generate >>>>>>>>>> one >>>>>>>>>> that >>>>>>>>>> works in conjunction with wikipedia? That way we can execute real >>>>>>>>>> world >>>>>>>>>> queries against faceted data. Or we could use the Lucene/Solr >>>>>>>>>> mailing >>>>>>>>>> lists >>>>>>>>>> and other data (ala Lucid's faceted site) as a standard index? >>>>>>>>>> >>>>>>>>>> >>>>>>>>> I don't think there is any standard set of docs for solr testing - >>>>>>>>> there >>>>>>>>> is >>>>>>>>> not a real benchmark contrib - though I know more than a few of us >>>>>>>>> have >>>>>>>>> hacked up pieces of Lucene benchmark to work with Solr - I think >>>>>>>>> I've >>>>>>>>> done >>>>>>>>> it twice now ;) >>>>>>>>> >>>>>>>>> Would be nice to get things going. I was thinking the other day: I >>>>>>>>> wonder >>>>>>>>> how hard it would be to make Lucene Benchmark generic enough to >>>>>>>>> accept >>>>>>>>> Solr >>>>>>>>> impls and Solr algs? >>>>>>>>> >>>>>>>>> It does a lot that would suck to duplicate. >>>>>>>>> >>>>>>>>> -- >>>>>>>>> -- >>>>>>>>> - Mark >>>>>>>>> >>>>>>>>> http://www.lucidimagination.com >>>>>>>>> >>>>>>>> >>>>>>>> -------------------------- >>>>>>>> Grant Ingersoll >>>>>>>> http://www.lucidimagination.com/ >>>>>>>> >>>>>>>> Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids) >>>>>>>> using >>>>>>>> Solr/Lucene: >>>>>>>> http://www.lucidimagination.com/search >>>>>>>> >>>>>>>> >>>>>>>> >>>>>> -------------------------- >>>>>> Grant Ingersoll >>>>>> http://www.lucidimagination.com/ >>>>>> >>>>>> Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids) >>>>>> using >>>>>> Solr/Lucene: >>>>>> http://www.lucidimagination.com/search >>>>>> >>>>>> >>>>> >>>>> >>>>> -- >>>>> -- >>>>> - Mark >>>>> >>>>> http://www.lucidimagination.com >>>>> >>>> >>> >>> >>> >>> -- >>> Peter M. Wolanin, Ph.D. >>> Momentum Specialist, Acquia. Inc. >>> peter.wola...@acquia.com >>> > > -------------------------- > Grant Ingersoll > http://www.lucidimagination.com/ > > Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids) using > Solr/Lucene: > http://www.lucidimagination.com/search > >