This solution doesn't seem to be working for me.
I am using Solr trunk and I have the same question as Bernd with a small
twist: the field that should NOT be empty, happens to be a derived field
called price, see the config below:
<entity ...
transformer="RegexTransformer,HTMLStripTransformer,DateFormatTransformer,
script:skipRow">
<field column="description"
xpath="/rss/channel/item/description"
/>
<field column="price"
regex=".*\$(\d*.\d*)"
sourceColName="description"
/>
...
</entity>
I have also changed the sample script to check the price field isntead of
the link field that was being used as an example in this thread earlier:
<script>
<![CDATA[
function skipRow(row) {
var price = row.get( 'price' );
if ( price == null || price == '' ) {
row.put( '$skipRow', 'true' );
}
return row;
}
]]>
</script>
Does anyone have any thoughts on what I'm missing?
Thanks!
- Pulkit
On Mon, Jan 10, 2011 at 3:06 AM, Bernd Fehling <
[email protected]> wrote:
> Hi Gora,
>
> thanks a lot, very nice solution, works perfectly.
> I will dig more into ScriptTransformer, seems to be very powerful.
>
> Regards,
> Bernd
>
> Am 08.01.2011 14:38, schrieb Gora Mohanty:
> > On Fri, Jan 7, 2011 at 12:30 PM, Bernd Fehling
> > <[email protected]> wrote:
> >> Hello list,
> >>
> >> is it possible to load only selected documents with
> XPathEntityProcessor?
> >> While loading docs I want to drop/skip/ignore documents with missing
> URL.
> >>
> >> Example:
> >> <documents>
> >> <document>
> >> <title>first title</title>
> >> <id>identifier_01</id>
> >> <link>http://www.foo.com/path/bar.html</link>
> >> </document>
> >> <document>
> >> <title>second title</title>
> >> <id>identifier_02</id>
> >> <link></link>
> >> </document>
> >> </documents>
> >>
> >> The first document should be loaded, the second document should be
> ignored
> >> because it has an empty link (should also work for missing link field).
> > [...]
> >
> > You can use a ScriptTransformer, along with $skipRow/$skipDoc.
> > E.g., something like this for your data import configuration file:
> >
> > <dataConfig>
> > <script><![CDATA[
> > function skipRow(row) {
> > var link = row.get( 'link' );
> > if( link == null || link == '' ) {
> > row.put( '$skipRow', 'true' );
> > }
> > return row;
> > }
> > ]]></script>
> > <dataSource type="FileDataSource" />
> > <document>
> > <entity name="f" processor="FileListEntityProcessor"
> > baseDir="/home/gora/test" fileName=".*xml" newerThan="'NOW-3DAYS'"
> > recursive="true" rootEntity="false" dataSource="null">
> > <entity name="top" processor="XPathEntityProcessor"
> > forEach="/documents/document" url="${f.fileAbsolutePath}"
> > transformer="script:skipRow">
> > <field column="link" xpath="/documents/document/link"/>
> > <field column="title" xpath="/documents/document/title"/>
> > <field column="id" xpath="/documents/document/id"/>
> > </entity>
> > </entity>
> > </document>
> > </dataConfig>
> >
> > Regards,
> > Gora
>