This solution doesn't seem to be working for me. I am using Solr trunk and I have the same question as Bernd with a small twist: the field that should NOT be empty, happens to be a derived field called price, see the config below:
<entity ... transformer="RegexTransformer,HTMLStripTransformer,DateFormatTransformer, script:skipRow"> <field column="description" xpath="/rss/channel/item/description" /> <field column="price" regex=".*\$(\d*.\d*)" sourceColName="description" /> ... </entity> I have also changed the sample script to check the price field isntead of the link field that was being used as an example in this thread earlier: <script> <![CDATA[ function skipRow(row) { var price = row.get( 'price' ); if ( price == null || price == '' ) { row.put( '$skipRow', 'true' ); } return row; } ]]> </script> Does anyone have any thoughts on what I'm missing? Thanks! - Pulkit On Mon, Jan 10, 2011 at 3:06 AM, Bernd Fehling < bernd.fehl...@uni-bielefeld.de> wrote: > Hi Gora, > > thanks a lot, very nice solution, works perfectly. > I will dig more into ScriptTransformer, seems to be very powerful. > > Regards, > Bernd > > Am 08.01.2011 14:38, schrieb Gora Mohanty: > > On Fri, Jan 7, 2011 at 12:30 PM, Bernd Fehling > > <bernd.fehl...@uni-bielefeld.de> wrote: > >> Hello list, > >> > >> is it possible to load only selected documents with > XPathEntityProcessor? > >> While loading docs I want to drop/skip/ignore documents with missing > URL. > >> > >> Example: > >> <documents> > >> <document> > >> <title>first title</title> > >> <id>identifier_01</id> > >> <link>http://www.foo.com/path/bar.html</link> > >> </document> > >> <document> > >> <title>second title</title> > >> <id>identifier_02</id> > >> <link></link> > >> </document> > >> </documents> > >> > >> The first document should be loaded, the second document should be > ignored > >> because it has an empty link (should also work for missing link field). > > [...] > > > > You can use a ScriptTransformer, along with $skipRow/$skipDoc. > > E.g., something like this for your data import configuration file: > > > > <dataConfig> > > <script><![CDATA[ > > function skipRow(row) { > > var link = row.get( 'link' ); > > if( link == null || link == '' ) { > > row.put( '$skipRow', 'true' ); > > } > > return row; > > } > > ]]></script> > > <dataSource type="FileDataSource" /> > > <document> > > <entity name="f" processor="FileListEntityProcessor" > > baseDir="/home/gora/test" fileName=".*xml" newerThan="'NOW-3DAYS'" > > recursive="true" rootEntity="false" dataSource="null"> > > <entity name="top" processor="XPathEntityProcessor" > > forEach="/documents/document" url="${f.fileAbsolutePath}" > > transformer="script:skipRow"> > > <field column="link" xpath="/documents/document/link"/> > > <field column="title" xpath="/documents/document/title"/> > > <field column="id" xpath="/documents/document/id"/> > > </entity> > > </entity> > > </document> > > </dataConfig> > > > > Regards, > > Gora >