On Tue, Nov 4, 2008 at 1:31 AM, Lance Norskog <[EMAIL PROTECTED]> wrote: > Thank you for the "rootEntity" tip. Does this mean that the inner loop only > walks the first item and breaks out of the loop? This is very good because it > allows me to drill down a few levels without downloading 10,000 feeds. > (Public API sites tend to dislike this behavior :) >
nope . It goes through each item in the inner loop and create one document for each item. > The URL is wrong because the streaming parser is iterating past the end of > the element entries. It is an off-by-one bug of some sort in the DIH code. > > Thanks, > > Lance > > -----Original Message----- > From: Noble Paul നോബിള് नोब्ळ् [mailto:[EMAIL PROTECTED] > Sent: Saturday, November 01, 2008 7:44 PM > To: solr-user@lucene.apache.org > Subject: Re: DIH Http input bug - problem with two-level RSS walker > > If you wish to create 1 doc per inner entity the set rootEntity="false" for > the entity outer. > The exception is because the url is wrong > > On Sat, Nov 1, 2008 at 10:30 AM, Lance Norskog <[EMAIL PROTECTED]> wrote: >> I wrote a nested HttpDataSource RSS poller. The outer loop reads an >> rss feed which contains N links to other rss feeds. The nested loop >> then reads each one of those to create documents. (Yes, this is an >> obnoxious thing to do.) Let's say the outer RSS feed gives 10 items. >> Both feeds use the same >> structure: /rss/channel with a <title> node and then N <item> nodes >> inside the channel. This should create two separate XML streams with >> two separate Xpath iterators, right? >> >> <entity name="outer" http stuff> >> <field column="name" xpath="/rss/channel/title" /> >> <field column="url" xpath="/rss/channel/item/link"/> >> >> <entity name="inner" http stuff url="${outer.url}" pk="title" > >> <field column="title" xpath="/rss/channel/item/title" /> >> </entity> >> </entity> >> >> This does indeed walk each url from the outer feed and then fetch the >> inner rss feed. Bravo! >> >> However, I found two separate problems in xpath iteration. They may be >> related. The first problem is that it only stores the first document >> from each "inner" feed. Each feed has several documents with different >> title fields but it only grabs the first. >> >> The other is an off-by-one bug. The outer loop iterates through the 10 >> items and then tries to pull an 11th. It then gives this exception trace: >> >> INFO: Created URL to: [inner url] >> Oct 31, 2008 11:21:20 PM >> org.apache.solr.handler.dataimport.HttpDataSource >> getData >> SEVERE: Exception thrown while getting data >> java.net.MalformedURLException: no protocol: null/account.rss >> at java.net.URL.<init>(URL.java:567) >> at java.net.URL.<init>(URL.java:464) >> at java.net.URL.<init>(URL.java:413) >> at >> org.apache.solr.handler.dataimport.HttpDataSource.getData(HttpDataSour >> ce.jav >> a:90) >> at >> org.apache.solr.handler.dataimport.HttpDataSource.getData(HttpDataSour >> ce.jav >> a:47) >> at >> org.apache.solr.handler.dataimport.DebugLogger$2.getData(DebugLogger.j >> ava:18 >> 3) >> at >> org.apache.solr.handler.dataimport.XPathEntityProcessor.initQuery(XPat >> hEntit >> yProcessor.java:210) >> at >> org.apache.solr.handler.dataimport.XPathEntityProcessor.fetchNextRow(X >> PathEn >> tityProcessor.java:180) >> at >> org.apache.solr.handler.dataimport.XPathEntityProcessor.nextRow(XPathE >> ntityP >> rocessor.java:160) >> at >> org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java: >> 285) >> ... >> Oct 31, 2008 11:21:20 PM org.apache.solr.handler.dataimport.DocBuilder >> buildDocument >> SEVERE: Exception while processing: album document : >> SolrInputDocumnt[{name=name(1.0)={Groups of stuff}}] >> org.apache.solr.handler.dataimport.DataImportHandlerException: >> Exception in invoking url null Processing Document # 11 >> at >> org.apache.solr.handler.dataimport.HttpDataSource.getData(HttpDataSour >> ce.jav >> a:115) >> at >> org.apache.solr.handler.dataimport.HttpDataSource.getData(HttpDataSour >> ce.jav >> a:47) >> >> >> >> >> >> > > > > -- > --Noble Paul > > -- --Noble Paul