Re: Generating large datasets for Solr proof-of-concept

Pulkit Singhal Thu, 15 Sep 2011 18:57:11 -0700

Thanks for all the feedback thus far. Now to get  little technical about it :)


I was thinking of feeding a file with all the tags of amazon that
yield close to roughly 50000 results each into a file and then running
my rss DIH off of that, I came up with the following config but
something is amiss, can someone please point out what is off about
this?

    <document>
        <entity name="amazonFeeds"
                processor="LineEntityProcessor"
                url="file:///xxx/yyy/zzz/amazonfeeds.txt"
                rootEntity="false"
                dataSource="myURIreader1"
                transformer="RegexTransformer,DateFormatTransformer"
                >
            <entity name="feed"
                    pk="link"
                    url="${amazonFeeds.rawLine"
                    processor="XPathEntityProcessor"
                    forEach="/rss/channel | /rss/channel/item"

transformer="RegexTransformer,HTMLStripTransformer,DateFormatTransformer,script:skipRow">
...

The rawline should feed into the url key but instead i get:

Caused by: java.net.MalformedURLException: no protocol:
null${amazonFeeds.rawLine
        at 
org.apache.solr.handler.dataimport.URLDataSource.getData(URLDataSource.java:90)

Sep 15, 2011 8:48:01 PM org.apache.solr.update.DirectUpdateHandler2 rollback
INFO: start rollback

Sep 15, 2011 8:48:01 PM org.apache.solr.handler.dataimport.SolrWriter rollback
SEVERE: Exception while solr rollback.

Thanks in advance!

On Thu, Sep 15, 2011 at 4:12 PM, Markus Jelsma
<markus.jel...@openindex.io> wrote:
> If we want to test with huge amounts of data we feed portions of the internet.
> The problem is it takes a lot of bandwith and lots of computing power to get
> to a `reasonable` size. On the positive side, you deal with real text so it's
> easier to tune for relevance.
>
> I think it's easier to create a simple XML generator with mock data, prices,
> popularity rates etc. It's fast to generate millions of mock products and once
> you have a large quantity of XML files, you can easily index, test, change
> config or schema and reindex.
>
> On the other hand, the sample data that comes with the Solr example is a good
> set as well as it proves the concepts well, especially with the stock Velocity
> templates.
>
> We know Solr will handle enormous sets but quantity is not always a part of a
> PoC.
>
>> Hello Everyone,
>>
>> I have a goal of populating Solr with a million unique products in
>> order to create a test environment for a proof of concept. I started
>> out by using DIH with Amazon RSS feeds but I've quickly realized that
>> there's no way I can glean a million products from one RSS feed. And
>> I'd go mad if I just sat at my computer all day looking for feeds and
>> punching them into DIH config for Solr.
>>
>> Has anyone ever had to create large mock/dummy datasets for test
>> environments or for POCs/Demos to convince folks that Solr was the
>> wave of the future? Any tips would be greatly appreciated. I suppose
>> it sounds a lot like crawling even though it started out as innocent
>> DIH usage.
>>
>> - Pulkit
>

Re: Generating large datasets for Solr proof-of-concept

Reply via email to