Thanks for all the feedback thus far. Now to get little technical about it :)
I was thinking of feeding a file with all the tags of amazon that yield close to roughly 50000 results each into a file and then running my rss DIH off of that, I came up with the following config but something is amiss, can someone please point out what is off about this? <document> <entity name="amazonFeeds" processor="LineEntityProcessor" url="file:///xxx/yyy/zzz/amazonfeeds.txt" rootEntity="false" dataSource="myURIreader1" transformer="RegexTransformer,DateFormatTransformer" > <entity name="feed" pk="link" url="${amazonFeeds.rawLine" processor="XPathEntityProcessor" forEach="/rss/channel | /rss/channel/item" transformer="RegexTransformer,HTMLStripTransformer,DateFormatTransformer,script:skipRow"> ... The rawline should feed into the url key but instead i get: Caused by: java.net.MalformedURLException: no protocol: null${amazonFeeds.rawLine at org.apache.solr.handler.dataimport.URLDataSource.getData(URLDataSource.java:90) Sep 15, 2011 8:48:01 PM org.apache.solr.update.DirectUpdateHandler2 rollback INFO: start rollback Sep 15, 2011 8:48:01 PM org.apache.solr.handler.dataimport.SolrWriter rollback SEVERE: Exception while solr rollback. Thanks in advance! On Thu, Sep 15, 2011 at 4:12 PM, Markus Jelsma <markus.jel...@openindex.io> wrote: > If we want to test with huge amounts of data we feed portions of the internet. > The problem is it takes a lot of bandwith and lots of computing power to get > to a `reasonable` size. On the positive side, you deal with real text so it's > easier to tune for relevance. > > I think it's easier to create a simple XML generator with mock data, prices, > popularity rates etc. It's fast to generate millions of mock products and once > you have a large quantity of XML files, you can easily index, test, change > config or schema and reindex. > > On the other hand, the sample data that comes with the Solr example is a good > set as well as it proves the concepts well, especially with the stock Velocity > templates. > > We know Solr will handle enormous sets but quantity is not always a part of a > PoC. > >> Hello Everyone, >> >> I have a goal of populating Solr with a million unique products in >> order to create a test environment for a proof of concept. I started >> out by using DIH with Amazon RSS feeds but I've quickly realized that >> there's no way I can glean a million products from one RSS feed. And >> I'd go mad if I just sat at my computer all day looking for feeds and >> punching them into DIH config for Solr. >> >> Has anyone ever had to create large mock/dummy datasets for test >> environments or for POCs/Demos to convince folks that Solr was the >> wave of the future? Any tips would be greatly appreciated. I suppose >> it sounds a lot like crawling even though it started out as innocent >> DIH usage. >> >> - Pulkit >