DataimportHandler development issue

2011-01-13 Thread Derek Werthmuller
We're just getting started with Solr and are very interested in using Solr
for search applications.

I've got the rss example working 1.4.1 didn't work out of the box, but we
figured it out -then found fixes in the svn.  Any way we are learning how
to load the data/rss & atom feeds into the Solr index.  We are trying to
modify the rss-data-import.xml file so that we can import atom feeds also.
But for some reason they don't load.  Here is what we have for the
configuration.  

We've been using the DataImportHandler Development Console
http://localhost:8983/solr/rss/admin/dataimport.jsp?handler=/rssimport
  to
look at the status and the DocsNum but only the rss feed works.
If we remove all the slashdot -rss entity the atom example still doesn't
work.  We've tried creating a seperate atom-data-config.xml file and adding
the 
proper entry to the solrconfig.xml to support the extra dataimport.  That
gave us the same results.  

−

0
1

−

−

atom-data-config.xml


status
idle

−

1
0
0
2011-01-13
08:42:53
0
0:0:0.519

−

This response format is experimental. It is
likely to change in the future.




Its not clear why its not working.  Advice?
Also is this the best way to load data?  We intent on loading several
thousand docbook documents once we understand how this all works.  We stuck
with the rss/atom example since we didn't want to deal with schema changes
yet.
Thanks
Derek

example-DIH/solr/rss/conf/rss-data-config.xml  modified source:



http://twitter.com/statuses/user_timeline/existdb.rss";
processor="XPathEntityProcessor"
forEach="/rss/channel | /rss/channel/item"
transformer="DateFormatTransformer">
















http://twitter.com/statuses/user_timeline/ctg_ualbany.atom";
processor="XPathEntityProcessor"
forEach="/feed | /feed/entry"
transformer="DateFormatTransformer">




















RE: DataimportHandler development issue

2011-01-21 Thread Derek Werthmuller
It seems the proper xpath statement to select the href for the link child
when rel="self" is
/feed/link[@rel='self']/string(@ref) for the root

/feed/entry/link[@rel='alternate']/string(@ref) should get the childern .

But it doesn't work in the DIH, does work on other xpath query processors.

Can the DIH handle compound xpath statements?


 

-Original Message-
From: Gora Mohanty [mailto:g...@mimirtech.com] 
Sent: Friday, January 14, 2011 3:08 AM
To: solr-user@lucene.apache.org
Subject: Re: DataimportHandler development issue

On Fri, Jan 14, 2011 at 12:17 AM, Derek Werthmuller
 wrote:

> Its not clear why its not working.  Advice?
> Also is this the best way to load data?  We intent on loading several 
> thousand docbook documents once we understand how this all works.  We 
> stuck with the rss/atom example since we didn't want to deal with 
> schema changes yet.
> Thanks
>        Derek
>
> example-DIH/solr/rss/conf/rss-data-config.xml  modified source:
> 
> 
> 
>  pk="link"
> url="http://twitter.com/statuses/user_timeline/existdb.rss";
> processor="XPathEntityProcessor"
> forEach="/rss/channel | /rss/channel/item"
> transformer="DateFormatTransformer">
>
>  />  commonField="true" />  xpath="/rss/channel/subject" commonField="true" />
>
>   column="link" xpath="/rss/channel/item/link" />  column="description" xpath="/rss/channel/item/description" />  column="creator" xpath="/rss/channel/item/creator" />  column="item-subject" xpath="/rss/channel/item/subject" />  column="date" xpath="/rss/channel/item/date"
> dateTimeFormat="-MM-dd'T'hh:mm:ss" />  column="slash-department" xpath="/rss/channel/item/department" /> 
>  
>  
> 
>
>  pk="link"
> url="http://twitter.com/statuses/user_timeline/ctg_ualbany.atom";
> processor="XPathEntityProcessor"
> forEach="/feed | /feed/entry"
> transformer="DateFormatTransformer">
>
>  
>  
> 
>
>   column="link" xpath="/feed/entry/link" />  xpath="/feed/entry/description" />  xpath="/feed/entry/creator" />  xpath="/feed/entry/subject" />  xpath="/rss/channel/item/date"
> dateTimeFormat="-MM-dd'T'hh:mm:ss" />  column="slash-department" xpath="/feed/entry/department" />  column="slash-section" xpath="/feed/entry/section" />  column="slash-comments" xpath="/feed/entry/comments" />  
>  

Your problem is the second entity in the DIH configuration file. The Solr
schema defines the unique key to be the field "link". As noted in the
comments in schema.xml, this means that this field is required.
Solr is not able to populate the "link" field from the Atom feed. I have not
tracked down why this is so, but it is probably because there is more than
one link node under /feed/entry, and the "link" field is not multi-valued.
Change the xpath to, say, "/feed/entry/id", and the import works. Also,
while this is not necessarily an issue, please note that several other
fields have incorrect xpaths for this entity.

To answer your other question, this way of importing data should work fine.

Regards,
Gora


loading XML docbook files into solr

2011-02-26 Thread Derek Werthmuller
I've been working on this for a while an seem to hit a wall.  The error
messages aren't complete enought to give guidance why importing a sample
docbook document
into solr is not working.
I'm using the curl tool to post the xml file and receive a non error message
but the document count doesn't increase and the *:* returns no results
still.
The docbook document has a attribute id and this is mapped to the uniquekey
in the schema.xml file.  But it seems this may be the issue still.  Its not
clear
how the field names map to the XML.  Do they only map to attributes?  or do
they map to elements?   How to you differentiate?
Can field names in the schema.xml file have xpath statements?

Are there other important sections of the solrconfig that could be keeping
this from working?

We want to maintain much of the document structure so we have more control
over the searching.

Here is what the docbook XML looks like:  (tried setting the uniquekey to id
and docid but no go either way)


245
Advancing Return on Investment Analysis for Government IT:
A Pu
blic Value Framework 

Advancing Return on Investment Analysis for Government IT: A
Publ
ic Value Framework






Public Value Illustration



..

Here is the section of the schema.xml  










   





 

 
 id

 
 all_text

 
 



Load command results.

$ ./postfile.sh 


056



015



Thanks
Derek


RE: loading XML docbook files into solr

2011-02-26 Thread Derek Werthmuller
Thank you this clearifies a lot.
 

-Original Message-
From: Gora Mohanty [mailto:g...@mimirtech.com] 
Sent: Saturday, February 26, 2011 11:49 AM
To: solr-user@lucene.apache.org
Subject: Re: loading XML docbook files into solr

On Sat, Feb 26, 2011 at 9:10 PM, Derek Werthmuller 
wrote:
> I've been working on this for a while an seem to hit a wall.  The 
> error messages aren't complete enought to give guidance why importing 
> a sample docbook document into solr is not working.
> I'm using the curl tool to post the xml file and receive a non error 
> message but the document count doesn't increase and the *:* returns no 
> results still.
[...]

Which curl tool? The post.sh included with Solr? You refer to a postfile.sh
below.

Unless I am missing something, it seems like you are trying to post a
standard XML file to Solr. You cannot do that. There are two ways to
proceed:
* Reformat the XML into Solr's format. See the .xml documents in
  the example/exampledocs directory of your Solr distribution, or see, e.g.,
 
http://www.xml.com/pub/a/2006/08/09/solr-indexing-xml-with-lucene-andrest.ht
ml
* Write a DataImportHandler script with an XPathEntityProcessor. Please
  see http://wiki.apache.org/solr/DataImportHandler
> Load command results.
>
> $ ./postfile.sh
[...]

This is not the problem here, but the standard Solr post.sh takes filenames
to be posted as command-line arguments.

Regards,
Gora