Re: DIH load only selected documents with XPathEntityProcessor

Pulkit Singhal Tue, 13 Sep 2011 14:16:01 -0700

This solution doesn't seem to be working for me.

I am using Solr trunk and I have the same question as Bernd with a small
twist: the field that should NOT be empty, happens to be a derived field
called price, see the config below:


<entity ...
  transformer="RegexTransformer,HTMLStripTransformer,DateFormatTransformer,
script:skipRow">

<field column="description"
          xpath="/rss/channel/item/description"
          />

<field column="price"
         regex=".*\$(\d*.\d*)"
         sourceColName="description"
         />
...
</entity>

I have also changed the sample script to check the price field isntead of
the link field that was being used as an example in this thread earlier:

    <script>
        <![CDATA[
        function skipRow(row) {
            var price = row.get( 'price' );
            if ( price == null || price == '' ) {
                row.put( '$skipRow', 'true' );
            }
            return row;
        }
        ]]>
    </script>

Does anyone have any thoughts on what I'm missing?
Thanks!
- Pulkit

On Mon, Jan 10, 2011 at 3:06 AM, Bernd Fehling <
bernd.fehl...@uni-bielefeld.de> wrote:

> Hi Gora,
>
> thanks a lot, very nice solution, works perfectly.
> I will dig more into ScriptTransformer, seems to be very powerful.
>
> Regards,
> Bernd
>
> Am 08.01.2011 14:38, schrieb Gora Mohanty:
> > On Fri, Jan 7, 2011 at 12:30 PM, Bernd Fehling
> > <bernd.fehl...@uni-bielefeld.de> wrote:
> >> Hello list,
> >>
> >> is it possible to load only selected documents with
> XPathEntityProcessor?
> >> While loading docs I want to drop/skip/ignore documents with missing
> URL.
> >>
> >> Example:
> >> <documents>
> >>    <document>
> >>        <title>first title</title>
> >>        <id>identifier_01</id>
> >>        <link>http://www.foo.com/path/bar.html</link>
> >>    </document>
> >>    <document>
> >>        <title>second title</title>
> >>        <id>identifier_02</id>
> >>        <link></link>
> >>    </document>
> >> </documents>
> >>
> >> The first document should be loaded, the second document should be
> ignored
> >> because it has an empty link (should also work for missing link field).
> > [...]
> >
> > You can use a ScriptTransformer, along with $skipRow/$skipDoc.
> > E.g., something like this for your data import configuration file:
> >
> > <dataConfig>
> >     <script><![CDATA[
> >       function skipRow(row) {
> >         var link = row.get( 'link' );
> >         if( link == null || link == '' ) {
> >           row.put( '$skipRow', 'true' );
> >         }
> >         return row;
> >       }
> >     ]]></script>
> >     <dataSource type="FileDataSource" />
> >     <document>
> >         <entity name="f" processor="FileListEntityProcessor"
> > baseDir="/home/gora/test" fileName=".*xml" newerThan="'NOW-3DAYS'"
> > recursive="true" rootEntity="false" dataSource="null">
> >             <entity name="top" processor="XPathEntityProcessor"
> > forEach="/documents/document" url="${f.fileAbsolutePath}"
> > transformer="script:skipRow">
> >                <field column="link" xpath="/documents/document/link"/>
> >                <field column="title" xpath="/documents/document/title"/>
> >                <field column="id" xpath="/documents/document/id"/>
> >             </entity>
> >         </entity>
> >     </document>
> > </dataConfig>
> >
> > Regards,
> > Gora
>

Re: DIH load only selected documents with XPathEntityProcessor

Reply via email to