Re: Extract info from parent node during data import

Noble Paul നോബിള്‍ नोब्ळ् Sat, 12 Sep 2009 02:09:23 -0700

On Sat, Sep 12, 2009 at 12:24 PM, Fergus McMenemie <fer...@twig.me.uk> wrote:
>>On Fri, Sep 11, 2009 at 6:48 AM, venn hardy <venn.ha...@hotmail.com> wrote:
>>>
>>> Hi Fergus,
>>>
>>> When I debugged in the development console 
>>> http://localhost:9080/solr/admin/dataimport.jsp?handler=/dataimport
>>>
>>> I had no problems. Each category/item seems to be only indexed once, and no 
>>> parent fields are available (except the category name).
>>>
>>> I am not entirely sure how the forEach statement works, but my 
>>> interpretation of forEach="/document/category/item | /document/category" is 
>>> something like this:
>>>
>>> 1. Whenever DIH encounters a document/category it will extract the 
>>> /document/category/
>>>
>>> name field as a common field
>>> 2. Whenever DIH encounters a document/category/item it will extract all of 
>>> the item fields.
>>> 3. When all fields have been encountered, save the document in solr and go 
>>> to the next category/item
>>
>>/document/category/item | /document/category
>>
>>means there are two paths which triggers a new doc (it is possible to
>>have more). Whenever it encounters the closing tag of that xpath , it
>>emits all the fields it collected since the opening of the same tag.
>>after that it clears all the fields it collected since the opening of
>>the tag.
>>
>>If there are fields it collected before opening of the same tag, it retains it
>
>
> Nice and clear, but that is not what I see.
>
> With my test case with forEach="/record | /record/mediaBlock"
> I see that for each /record/mediaBlock "document" indexed it contains all 
> fields
> from the parent "/record" document as well. A search over mediaBlock s 
> returns lots
> of extra fields from the parent which did not have the commonField attribute. 
> I
> will try and produce a testcase


yes it does . . /record/mediaBlock will have all the fields collected
from /record as well. It is by design
.
>
>
>>>
>>>
>>>> Date: Thu, 10 Sep 2009 14:19:31 +0100
>>>> To: solr-user@lucene.apache.org
>>>> From: fer...@twig.me.uk
>>>> Subject: RE: Extract info from parent node during data import
>>>>
>>>> >Hi Paul,
>>>> >The forEach="/document/category/item | /document/category/name" didn't 
>>>> >work (no categoryname was stored or indexed).
>>>> >However forEach="/document/category/item | /document/category" seems to 
>>>> >work well. I am not sure why category on its own works, but not 
>>>> >category/name...
>>>> >But thanks for tip. It wasn't as painful as I thought it would be.
>>>> >Venn
>>>>
>>>> Hmmm, I had bother with this. Although each occurance of 
>>>> /document/category/item
>>>> causes a new solr document to indexed, that document contained all the 
>>>> fields from
>>>> the parent element as well.
>>>>
>>>> Did you see this?
>>>>
>>>> >
>>>> >> From: noble.p...@corp.aol.com
>>>> >> Date: Thu, 10 Sep 2009 09:58:21 +0530
>>>> >> Subject: Re: Extract info from parent node during data import
>>>> >> To: solr-user@lucene.apache.org
>>>> >>
>>>> >> try this
>>>> >>
>>>> >> add two xpaths in your forEach
>>>> >>
>>>> >> forEach="/document/category/item | /document/category/name"
>>>> >>
>>>> >> and add a field as follows
>>>> >>
>>>> >> <field column="catgoryname" xpath ="/document/category/name"
>>>> >> commonField="true"/>
>>>> >>
>>>> >> Please try it out and let me know.
>>>> >>
>>>> >> On Thu, Sep 10, 2009 at 7:30 AM, venn hardy <venn.ha...@hotmail.com> 
>>>> >> wrote:
>>>> >> >
>>>> >> > Hello,
>>>> >> >
>>>> >> >
>>>> >> >
>>>> >> > I am using SOLR 1.4 (from nighly build) and its URLDataSource in 
>>>> >> > conjunction with the XPathEntityProcessor. I have successfully 
>>>> >> > imported XML content, but I think I may have found a limitation when 
>>>> >> > it comes to the commonField attribute in the DataImportHandler.
>>>> >> >
>>>> >> >
>>>> >> >
>>>> >> > Before writing my own parser to read in a whole XML document, I 
>>>> >> > thought I'd post the question here (since I got some great advice 
>>>> >> > last time).
>>>> >> >
>>>> >> >
>>>> >> >
>>>> >> > The bulk of my content is contained within each <item> tag. However, 
>>>> >> > each item has a parent called <category> and each category has a name 
>>>> >> > which I would like to import. In my forEach loop I specify the 
>>>> >> > /document/category/item as the collection of items I am interested 
>>>> >> > in. Is there anyway to extract an element from underneath a parent 
>>>> >> > node? To be a more more specific (see eg xml below). I would like to 
>>>> >> > index the following:
>>>> >> >
>>>> >> > - category: Category 1; id: 1; author: Author 1
>>>> >> >
>>>> >> > - category: Category 1; id: 2; author: Author 2
>>>> >> >
>>>> >> > - category: Category 2; id: 3; author: Author 3
>>>> >> >
>>>> >> > - category: Category 2; id: 4; author: Author 4
>>>> >> >
>>>> >> >
>>>> >> >
>>>> >> > Any ideas on how I can get to a parent node from within a child 
>>>> >> > during data import? If it cant be done, what do you suggest would be 
>>>> >> > the best way so I can keep using the DataImportHandler... would XSLT 
>>>> >> > be a good idea to 'flatten out' the structure a bit?
>>>> >> >
>>>> >> >
>>>> >> >
>>>> >> > Thanks
>>>> >> >
>>>> >> >
>>>> >> >
>>>> >> > This is what my XML document looks like:
>>>> >> >
>>>> >> > <document>
>>>> >> > <category>
>>>> >> > <name>Category 1</name>
>>>> >> > <item>
>>>> >> > <id>1</id>
>>>> >> > <author>Author 1</author>
>>>> >> > </item>
>>>> >> > <item>
>>>> >> > <id>2</id>
>>>> >> > <author>Author 2</author>
>>>> >> > </item>
>>>> >> > </category>
>>>> >> > <category>
>>>> >> > <name>Category 2</name>
>>>> >> > <item>
>>>> >> > <id>3</id>
>>>> >> > <author>Author 3</author>
>>>> >> > </item>
>>>> >> > <item>
>>>> >> > <id>4</id>
>>>> >> > <author>Author 4</author>
>>>> >> > </item>
>>>> >> > </category>
>>>> >> > </document>
>>>> >> >
>>>> >> >
>>>> >> >
>>>> >> > And this is what my dataConfig looks like:
>>>> >> > <dataConfig>
>>>> >> > <dataSource type="URLDataSource" />
>>>> >> > <document>
>>>> >> > <entity name="archive" pk="id" 
>>>> >> > url="http://localhost:9080/data/20090817070752.xml"; 
>>>> >> > processor="XPathEntityProcessor" forEach="/document/category/item" 
>>>> >> > transformer="DateFormatTransformer" stream="true" 
>>>> >> > dataSource="dataSource">
>>>> >> > <field column="category" xpath="/document/category/name" 
>>>> >> > commonField="true" />
>>>> >> > <field column="id" xpath="/document/category/item/id" />
>>>> >> > <field column="author" xpath="/document/category/item/author" />
>>>> >> > </entity>
>>>> >> > </document>
>>>> >> > </dataConfig>
>>>> >> >
>>>> >> >
>>>> >> >
>>>> >> > This is how I have specified my schema
>>>> >> > <fields>
>>>> >> > <field name="id" type="string" indexed="true" stored="true" 
>>>> >> > required="true" />
>>>> >> > <field name="author" type="string" indexed="true" stored="true"/>
>>>> >> > <field name="category" type="string" indexed="true" stored="true"/>
>>>> >> > </fields>
>>>> >> >
>>>> >> > <uniqueKey>id</uniqueKey>
>>>> >> > <defaultSearchField>id</defaultSearchField>
>>>> >> >
>
> --
>
> ===============================================================
> Fergus McMenemie               Email:fer...@twig.me.uk
> Techmore Ltd                   Phone:(UK) 07721 376021
>
> Unix/Mac/Intranets             Analyst Programmer
> ===============================================================
>



-- 
-----------------------------------------------------
Noble Paul | Principal Engineer| AOL | http://aol.com

Re: Extract info from parent node during data import

Reply via email to