Ah missing } doh! BTW I still welcome any ideas on how to build an e-commerce test base. It doesn't have to be amazon that was jsut my approach, any one?
- Pulkit On Thu, Sep 15, 2011 at 8:52 PM, Pulkit Singhal <pulkitsing...@gmail.com> wrote: > Thanks for all the feedback thus far. Now to get little technical about it :) > > I was thinking of feeding a file with all the tags of amazon that > yield close to roughly 50000 results each into a file and then running > my rss DIH off of that, I came up with the following config but > something is amiss, can someone please point out what is off about > this? > > <document> > <entity name="amazonFeeds" > processor="LineEntityProcessor" > url="file:///xxx/yyy/zzz/amazonfeeds.txt" > rootEntity="false" > dataSource="myURIreader1" > transformer="RegexTransformer,DateFormatTransformer" > > > <entity name="feed" > pk="link" > url="${amazonFeeds.rawLine" > processor="XPathEntityProcessor" > forEach="/rss/channel | /rss/channel/item" > > transformer="RegexTransformer,HTMLStripTransformer,DateFormatTransformer,script:skipRow"> > ... > > The rawline should feed into the url key but instead i get: > > Caused by: java.net.MalformedURLException: no protocol: > null${amazonFeeds.rawLine > at > org.apache.solr.handler.dataimport.URLDataSource.getData(URLDataSource.java:90) > > Sep 15, 2011 8:48:01 PM org.apache.solr.update.DirectUpdateHandler2 rollback > INFO: start rollback > > Sep 15, 2011 8:48:01 PM org.apache.solr.handler.dataimport.SolrWriter rollback > SEVERE: Exception while solr rollback. > > Thanks in advance! > > On Thu, Sep 15, 2011 at 4:12 PM, Markus Jelsma > <markus.jel...@openindex.io> wrote: >> If we want to test with huge amounts of data we feed portions of the >> internet. >> The problem is it takes a lot of bandwith and lots of computing power to get >> to a `reasonable` size. On the positive side, you deal with real text so it's >> easier to tune for relevance. >> >> I think it's easier to create a simple XML generator with mock data, prices, >> popularity rates etc. It's fast to generate millions of mock products and >> once >> you have a large quantity of XML files, you can easily index, test, change >> config or schema and reindex. >> >> On the other hand, the sample data that comes with the Solr example is a good >> set as well as it proves the concepts well, especially with the stock >> Velocity >> templates. >> >> We know Solr will handle enormous sets but quantity is not always a part of a >> PoC. >> >>> Hello Everyone, >>> >>> I have a goal of populating Solr with a million unique products in >>> order to create a test environment for a proof of concept. I started >>> out by using DIH with Amazon RSS feeds but I've quickly realized that >>> there's no way I can glean a million products from one RSS feed. And >>> I'd go mad if I just sat at my computer all day looking for feeds and >>> punching them into DIH config for Solr. >>> >>> Has anyone ever had to create large mock/dummy datasets for test >>> environments or for POCs/Demos to convince folks that Solr was the >>> wave of the future? Any tips would be greatly appreciated. I suppose >>> it sounds a lot like crawling even though it started out as innocent >>> DIH usage. >>> >>> - Pulkit >> >