>On Fri, Sep 11, 2009 at 6:48 AM, venn hardy <venn.ha...@hotmail.com> wrote:
>>
>> Hi Fergus,
>>
>> When I debugged in the development console 
>> http://localhost:9080/solr/admin/dataimport.jsp?handler=/dataimport
>>
>> I had no problems. Each category/item seems to be only indexed once, and no 
>> parent fields are available (except the category name).
>>
>> I am not entirely sure how the forEach statement works, but my 
>> interpretation of forEach="/document/category/item | /document/category" is 
>> something like this:
>>
>> 1. Whenever DIH encounters a document/category it will extract the 
>> /document/category/
>>
>> name field as a common field
>> 2. Whenever DIH encounters a document/category/item it will extract all of 
>> the item fields.
>> 3. When all fields have been encountered, save the document in solr and go 
>> to the next category/item
>
>/document/category/item | /document/category
>
>means there are two paths which triggers a new doc (it is possible to
>have more). Whenever it encounters the closing tag of that xpath , it
>emits all the fields it collected since the opening of the same tag.
>after that it clears all the fields it collected since the opening of
>the tag.
>
>If there are fields it collected before opening of the same tag, it retains it


Nice and clear, but that is not what I see.

With my test case with forEach="/record | /record/mediaBlock"
I see that for each /record/mediaBlock "document" indexed it contains all fields
from the parent "/record" document as well. A search over mediaBlock s returns 
lots
of extra fields from the parent which did not have the commonField attribute. I 
will try and produce a testcase.


>>
>>
>>> Date: Thu, 10 Sep 2009 14:19:31 +0100
>>> To: solr-user@lucene.apache.org
>>> From: fer...@twig.me.uk
>>> Subject: RE: Extract info from parent node during data import
>>>
>>> >Hi Paul,
>>> >The forEach="/document/category/item | /document/category/name" didn't 
>>> >work (no categoryname was stored or indexed).
>>> >However forEach="/document/category/item | /document/category" seems to 
>>> >work well. I am not sure why category on its own works, but not 
>>> >category/name...
>>> >But thanks for tip. It wasn't as painful as I thought it would be.
>>> >Venn
>>>
>>> Hmmm, I had bother with this. Although each occurance of 
>>> /document/category/item
>>> causes a new solr document to indexed, that document contained all the 
>>> fields from
>>> the parent element as well.
>>>
>>> Did you see this?
>>>
>>> >
>>> >> From: noble.p...@corp.aol.com
>>> >> Date: Thu, 10 Sep 2009 09:58:21 +0530
>>> >> Subject: Re: Extract info from parent node during data import
>>> >> To: solr-user@lucene.apache.org
>>> >>
>>> >> try this
>>> >>
>>> >> add two xpaths in your forEach
>>> >>
>>> >> forEach="/document/category/item | /document/category/name"
>>> >>
>>> >> and add a field as follows
>>> >>
>>> >> <field column="catgoryname" xpath ="/document/category/name"
>>> >> commonField="true"/>
>>> >>
>>> >> Please try it out and let me know.
>>> >>
>>> >> On Thu, Sep 10, 2009 at 7:30 AM, venn hardy <venn.ha...@hotmail.com> 
>>> >> wrote:
>>> >> >
>>> >> > Hello,
>>> >> >
>>> >> >
>>> >> >
>>> >> > I am using SOLR 1.4 (from nighly build) and its URLDataSource in 
>>> >> > conjunction with the XPathEntityProcessor. I have successfully 
>>> >> > imported XML content, but I think I may have found a limitation when 
>>> >> > it comes to the commonField attribute in the DataImportHandler.
>>> >> >
>>> >> >
>>> >> >
>>> >> > Before writing my own parser to read in a whole XML document, I 
>>> >> > thought I'd post the question here (since I got some great advice last 
>>> >> > time).
>>> >> >
>>> >> >
>>> >> >
>>> >> > The bulk of my content is contained within each <item> tag. However, 
>>> >> > each item has a parent called <category> and each category has a name 
>>> >> > which I would like to import. In my forEach loop I specify the 
>>> >> > /document/category/item as the collection of items I am interested in. 
>>> >> > Is there anyway to extract an element from underneath a parent node? 
>>> >> > To be a more more specific (see eg xml below). I would like to index 
>>> >> > the following:
>>> >> >
>>> >> > - category: Category 1; id: 1; author: Author 1
>>> >> >
>>> >> > - category: Category 1; id: 2; author: Author 2
>>> >> >
>>> >> > - category: Category 2; id: 3; author: Author 3
>>> >> >
>>> >> > - category: Category 2; id: 4; author: Author 4
>>> >> >
>>> >> >
>>> >> >
>>> >> > Any ideas on how I can get to a parent node from within a child during 
>>> >> > data import? If it cant be done, what do you suggest would be the best 
>>> >> > way so I can keep using the DataImportHandler... would XSLT be a good 
>>> >> > idea to 'flatten out' the structure a bit?
>>> >> >
>>> >> >
>>> >> >
>>> >> > Thanks
>>> >> >
>>> >> >
>>> >> >
>>> >> > This is what my XML document looks like:
>>> >> >
>>> >> > <document>
>>> >> > <category>
>>> >> > <name>Category 1</name>
>>> >> > <item>
>>> >> > <id>1</id>
>>> >> > <author>Author 1</author>
>>> >> > </item>
>>> >> > <item>
>>> >> > <id>2</id>
>>> >> > <author>Author 2</author>
>>> >> > </item>
>>> >> > </category>
>>> >> > <category>
>>> >> > <name>Category 2</name>
>>> >> > <item>
>>> >> > <id>3</id>
>>> >> > <author>Author 3</author>
>>> >> > </item>
>>> >> > <item>
>>> >> > <id>4</id>
>>> >> > <author>Author 4</author>
>>> >> > </item>
>>> >> > </category>
>>> >> > </document>
>>> >> >
>>> >> >
>>> >> >
>>> >> > And this is what my dataConfig looks like:
>>> >> > <dataConfig>
>>> >> > <dataSource type="URLDataSource" />
>>> >> > <document>
>>> >> > <entity name="archive" pk="id" 
>>> >> > url="http://localhost:9080/data/20090817070752.xml"; 
>>> >> > processor="XPathEntityProcessor" forEach="/document/category/item" 
>>> >> > transformer="DateFormatTransformer" stream="true" 
>>> >> > dataSource="dataSource">
>>> >> > <field column="category" xpath="/document/category/name" 
>>> >> > commonField="true" />
>>> >> > <field column="id" xpath="/document/category/item/id" />
>>> >> > <field column="author" xpath="/document/category/item/author" />
>>> >> > </entity>
>>> >> > </document>
>>> >> > </dataConfig>
>>> >> >
>>> >> >
>>> >> >
>>> >> > This is how I have specified my schema
>>> >> > <fields>
>>> >> > <field name="id" type="string" indexed="true" stored="true" 
>>> >> > required="true" />
>>> >> > <field name="author" type="string" indexed="true" stored="true"/>
>>> >> > <field name="category" type="string" indexed="true" stored="true"/>
>>> >> > </fields>
>>> >> >
>>> >> > <uniqueKey>id</uniqueKey>
>>> >> > <defaultSearchField>id</defaultSearchField>
>>> >> >

-- 

===============================================================
Fergus McMenemie               Email:fer...@twig.me.uk
Techmore Ltd                   Phone:(UK) 07721 376021

Unix/Mac/Intranets             Analyst Programmer
===============================================================

Reply via email to