It seems the proper xpath statement to select the href for the link child when rel="self" is /feed/link[@rel='self']/string(@ref) for the root
/feed/entry/link[@rel='alternate']/string(@ref) should get the childern . But it doesn't work in the DIH, does work on other xpath query processors. Can the DIH handle compound xpath statements? -----Original Message----- From: Gora Mohanty [mailto:g...@mimirtech.com] Sent: Friday, January 14, 2011 3:08 AM To: solr-user@lucene.apache.org Subject: Re: DataimportHandler development issue On Fri, Jan 14, 2011 at 12:17 AM, Derek Werthmuller <dwert...@ctg.albany.edu> wrote: > Its not clear why its not working. Advice? > Also is this the best way to load data? We intent on loading several > thousand docbook documents once we understand how this all works. We > stuck with the rss/atom example since we didn't want to deal with > schema changes yet. > Thanks > Derek > > example-DIH/solr/rss/conf/rss-data-config.xml modified source: > <dataConfig> > <dataSource type="URLDataSource" /> > <document> > <entity name="slashdot" > pk="link" > url="http://twitter.com/statuses/user_timeline/existdb.rss" > processor="XPathEntityProcessor" > forEach="/rss/channel | /rss/channel/item" > transformer="DateFormatTransformer"> > > <field column="source" xpath="/rss/channel/title" commonField="true" > /> <field column="source-link" xpath="/rss/channel/link" > commonField="true" /> <field column="subject" > xpath="/rss/channel/subject" commonField="true" /> > > <field column="title" xpath="/rss/channel/item/title" /> <field > column="link" xpath="/rss/channel/item/link" /> <field > column="description" xpath="/rss/channel/item/description" /> <field > column="creator" xpath="/rss/channel/item/creator" /> <field > column="item-subject" xpath="/rss/channel/item/subject" /> <field > column="date" xpath="/rss/channel/item/date" > dateTimeFormat="yyyy-MM-dd'T'hh:mm:ss" /> <field > column="slash-department" xpath="/rss/channel/item/department" /> > <field column="slash-section" xpath="/rss/channel/item/section" /> > <field column="slash-comments" xpath="/rss/channel/item/comments" /> > </entity> > > <entity name="twitter" > pk="link" > url="http://twitter.com/statuses/user_timeline/ctg_ualbany.atom" > processor="XPathEntityProcessor" > forEach="/feed | /feed/entry" > transformer="DateFormatTransformer"> > > <field column="source" xpath="/feed/title" commonField="true" /> > <field column="source-link" xpath="/feed/link" commonField="true" /> > <field column="subject" xpath="/feed/subtitle" commonField="true" /> > > <field column="title" xpath="/feed/entry/title" /> <field > column="link" xpath="/feed/entry/link" /> <field column="description" > xpath="/feed/entry/description" /> <field column="creator" > xpath="/feed/entry/creator" /> <field column="item-subject" > xpath="/feed/entry/subject" /> <field column="date" > xpath="/rss/channel/item/date" > dateTimeFormat="yyyy-MM-dd'T'hh:mm:ss" /> <field > column="slash-department" xpath="/feed/entry/department" /> <field > column="slash-section" xpath="/feed/entry/section" /> <field > column="slash-comments" xpath="/feed/entry/comments" /> </entity> > </document> </dataConfig> Your problem is the second entity in the DIH configuration file. The Solr schema defines the unique key to be the field "link". As noted in the comments in schema.xml, this means that this field is required. Solr is not able to populate the "link" field from the Atom feed. I have not tracked down why this is so, but it is probably because there is more than one link node under /feed/entry, and the "link" field is not multi-valued. Change the xpath to, say, "/feed/entry/id", and the import works. Also, while this is not necessarily an issue, please note that several other fields have incorrect xpaths for this entity. To answer your other question, this way of importing data should work fine. Regards, Gora