Re: Generating large datasets for Solr proof-of-concept

Pulkit Singhal Thu, 15 Sep 2011 18:56:04 -0700

Ah missing } doh!

BTW I still welcome any ideas on how to build an e-commerce test base.
It doesn't have to be amazon that was jsut my approach, any one?


- Pulkit

On Thu, Sep 15, 2011 at 8:52 PM, Pulkit Singhal <pulkitsing...@gmail.com> wrote:
> Thanks for all the feedback thus far. Now to get  little technical about it :)
>
> I was thinking of feeding a file with all the tags of amazon that
> yield close to roughly 50000 results each into a file and then running
> my rss DIH off of that, I came up with the following config but
> something is amiss, can someone please point out what is off about
> this?
>
>    <document>
>        <entity name="amazonFeeds"
>                processor="LineEntityProcessor"
>                url="file:///xxx/yyy/zzz/amazonfeeds.txt"
>                rootEntity="false"
>                dataSource="myURIreader1"
>                transformer="RegexTransformer,DateFormatTransformer"
>                >
>            <entity name="feed"
>                    pk="link"
>                    url="${amazonFeeds.rawLine"
>                    processor="XPathEntityProcessor"
>                    forEach="/rss/channel | /rss/channel/item"
>
> transformer="RegexTransformer,HTMLStripTransformer,DateFormatTransformer,script:skipRow">
> ...
>
> The rawline should feed into the url key but instead i get:
>
> Caused by: java.net.MalformedURLException: no protocol:
> null${amazonFeeds.rawLine
>        at 
> org.apache.solr.handler.dataimport.URLDataSource.getData(URLDataSource.java:90)
>
> Sep 15, 2011 8:48:01 PM org.apache.solr.update.DirectUpdateHandler2 rollback
> INFO: start rollback
>
> Sep 15, 2011 8:48:01 PM org.apache.solr.handler.dataimport.SolrWriter rollback
> SEVERE: Exception while solr rollback.
>
> Thanks in advance!
>
> On Thu, Sep 15, 2011 at 4:12 PM, Markus Jelsma
> <markus.jel...@openindex.io> wrote:
>> If we want to test with huge amounts of data we feed portions of the 
>> internet.
>> The problem is it takes a lot of bandwith and lots of computing power to get
>> to a `reasonable` size. On the positive side, you deal with real text so it's
>> easier to tune for relevance.
>>
>> I think it's easier to create a simple XML generator with mock data, prices,
>> popularity rates etc. It's fast to generate millions of mock products and 
>> once
>> you have a large quantity of XML files, you can easily index, test, change
>> config or schema and reindex.
>>
>> On the other hand, the sample data that comes with the Solr example is a good
>> set as well as it proves the concepts well, especially with the stock 
>> Velocity
>> templates.
>>
>> We know Solr will handle enormous sets but quantity is not always a part of a
>> PoC.
>>
>>> Hello Everyone,
>>>
>>> I have a goal of populating Solr with a million unique products in
>>> order to create a test environment for a proof of concept. I started
>>> out by using DIH with Amazon RSS feeds but I've quickly realized that
>>> there's no way I can glean a million products from one RSS feed. And
>>> I'd go mad if I just sat at my computer all day looking for feeds and
>>> punching them into DIH config for Solr.
>>>
>>> Has anyone ever had to create large mock/dummy datasets for test
>>> environments or for POCs/Demos to convince folks that Solr was the
>>> wave of the future? Any tips would be greatly appreciated. I suppose
>>> it sounds a lot like crawling even though it started out as innocent
>>> DIH usage.
>>>
>>> - Pulkit
>>
>

Re: Generating large datasets for Solr proof-of-concept

Reply via email to