Re: DIH load only selected documents with XPathEntityProcessor

Pulkit Singhal Tue, 13 Sep 2011 14:20:02 -0700

Oh and I"m sure that I'm using Java 6 because the properties from the Solr
webpage spit out:


java.runtime.version = 1.6.0_26-b03-384-10M3425


On Tue, Sep 13, 2011 at 4:15 PM, Pulkit Singhal <pulkitsing...@gmail.com>wrote:

> This solution doesn't seem to be working for me.
>
> I am using Solr trunk and I have the same question as Bernd with a small
> twist: the field that should NOT be empty, happens to be a derived field
> called price, see the config below:
>
> <entity ...
>   transformer="RegexTransformer,HTMLStripTransformer,DateFormatTransformer,
> script:skipRow">
>
> <field column="description"
>           xpath="/rss/channel/item/description"
>           />
>
> <field column="price"
>          regex=".*\$(\d*.\d*)"
>          sourceColName="description"
>          />
> ...
> </entity>
>
> I have also changed the sample script to check the price field isntead of
> the link field that was being used as an example in this thread earlier:
>
>
>     <script>
>         <![CDATA[
>         function skipRow(row) {
>             var price = row.get( 'price' );
>             if ( price == null || price == '' ) {
>
>                 row.put( '$skipRow', 'true' );
>             }
>             return row;
>         }
>         ]]>
>     </script>
>
> Does anyone have any thoughts on what I'm missing?
> Thanks!
> - Pulkit
>
>
> On Mon, Jan 10, 2011 at 3:06 AM, Bernd Fehling <
> bernd.fehl...@uni-bielefeld.de> wrote:
>
>> Hi Gora,
>>
>> thanks a lot, very nice solution, works perfectly.
>> I will dig more into ScriptTransformer, seems to be very powerful.
>>
>> Regards,
>> Bernd
>>
>> Am 08.01.2011 14:38, schrieb Gora Mohanty:
>> > On Fri, Jan 7, 2011 at 12:30 PM, Bernd Fehling
>> > <bernd.fehl...@uni-bielefeld.de> wrote:
>> >> Hello list,
>> >>
>> >> is it possible to load only selected documents with
>> XPathEntityProcessor?
>> >> While loading docs I want to drop/skip/ignore documents with missing
>> URL.
>> >>
>> >> Example:
>> >> <documents>
>> >>    <document>
>> >>        <title>first title</title>
>> >>        <id>identifier_01</id>
>> >>        <link>http://www.foo.com/path/bar.html</link>
>> >>    </document>
>> >>    <document>
>> >>        <title>second title</title>
>> >>        <id>identifier_02</id>
>> >>        <link></link>
>> >>    </document>
>> >> </documents>
>> >>
>> >> The first document should be loaded, the second document should be
>> ignored
>> >> because it has an empty link (should also work for missing link field).
>> > [...]
>> >
>> > You can use a ScriptTransformer, along with $skipRow/$skipDoc.
>> > E.g., something like this for your data import configuration file:
>> >
>> > <dataConfig>
>> >     <script><![CDATA[
>> >       function skipRow(row) {
>> >         var link = row.get( 'link' );
>> >         if( link == null || link == '' ) {
>> >           row.put( '$skipRow', 'true' );
>> >         }
>> >         return row;
>> >       }
>> >     ]]></script>
>> >     <dataSource type="FileDataSource" />
>> >     <document>
>> >         <entity name="f" processor="FileListEntityProcessor"
>> > baseDir="/home/gora/test" fileName=".*xml" newerThan="'NOW-3DAYS'"
>> > recursive="true" rootEntity="false" dataSource="null">
>> >             <entity name="top" processor="XPathEntityProcessor"
>> > forEach="/documents/document" url="${f.fileAbsolutePath}"
>> > transformer="script:skipRow">
>> >                <field column="link" xpath="/documents/document/link"/>
>> >                <field column="title" xpath="/documents/document/title"/>
>> >                <field column="id" xpath="/documents/document/id"/>
>> >             </entity>
>> >         </entity>
>> >     </document>
>> > </dataConfig>
>> >
>> > Regards,
>> > Gora
>>
>
>

Re: DIH load only selected documents with XPathEntityProcessor

Reply via email to