>On Fri, Sep 11, 2009 at 6:48 AM, venn hardy <venn.ha...@hotmail.com> wrote: >> >> Hi Fergus, >> >> When I debugged in the development console >> http://localhost:9080/solr/admin/dataimport.jsp?handler=/dataimport >> >> I had no problems. Each category/item seems to be only indexed once, and no >> parent fields are available (except the category name). >> >> I am not entirely sure how the forEach statement works, but my >> interpretation of forEach="/document/category/item | /document/category" is >> something like this: >> >> 1. Whenever DIH encounters a document/category it will extract the >> /document/category/ >> >> name field as a common field >> 2. Whenever DIH encounters a document/category/item it will extract all of >> the item fields. >> 3. When all fields have been encountered, save the document in solr and go >> to the next category/item > >/document/category/item | /document/category > >means there are two paths which triggers a new doc (it is possible to >have more). Whenever it encounters the closing tag of that xpath , it >emits all the fields it collected since the opening of the same tag. >after that it clears all the fields it collected since the opening of >the tag. > >If there are fields it collected before opening of the same tag, it retains it
Nice and clear, but that is not what I see. With my test case with forEach="/record | /record/mediaBlock" I see that for each /record/mediaBlock "document" indexed it contains all fields from the parent "/record" document as well. A search over mediaBlock s returns lots of extra fields from the parent which did not have the commonField attribute. I will try and produce a testcase. >> >> >>> Date: Thu, 10 Sep 2009 14:19:31 +0100 >>> To: solr-user@lucene.apache.org >>> From: fer...@twig.me.uk >>> Subject: RE: Extract info from parent node during data import >>> >>> >Hi Paul, >>> >The forEach="/document/category/item | /document/category/name" didn't >>> >work (no categoryname was stored or indexed). >>> >However forEach="/document/category/item | /document/category" seems to >>> >work well. I am not sure why category on its own works, but not >>> >category/name... >>> >But thanks for tip. It wasn't as painful as I thought it would be. >>> >Venn >>> >>> Hmmm, I had bother with this. Although each occurance of >>> /document/category/item >>> causes a new solr document to indexed, that document contained all the >>> fields from >>> the parent element as well. >>> >>> Did you see this? >>> >>> > >>> >> From: noble.p...@corp.aol.com >>> >> Date: Thu, 10 Sep 2009 09:58:21 +0530 >>> >> Subject: Re: Extract info from parent node during data import >>> >> To: solr-user@lucene.apache.org >>> >> >>> >> try this >>> >> >>> >> add two xpaths in your forEach >>> >> >>> >> forEach="/document/category/item | /document/category/name" >>> >> >>> >> and add a field as follows >>> >> >>> >> <field column="catgoryname" xpath ="/document/category/name" >>> >> commonField="true"/> >>> >> >>> >> Please try it out and let me know. >>> >> >>> >> On Thu, Sep 10, 2009 at 7:30 AM, venn hardy <venn.ha...@hotmail.com> >>> >> wrote: >>> >> > >>> >> > Hello, >>> >> > >>> >> > >>> >> > >>> >> > I am using SOLR 1.4 (from nighly build) and its URLDataSource in >>> >> > conjunction with the XPathEntityProcessor. I have successfully >>> >> > imported XML content, but I think I may have found a limitation when >>> >> > it comes to the commonField attribute in the DataImportHandler. >>> >> > >>> >> > >>> >> > >>> >> > Before writing my own parser to read in a whole XML document, I >>> >> > thought I'd post the question here (since I got some great advice last >>> >> > time). >>> >> > >>> >> > >>> >> > >>> >> > The bulk of my content is contained within each <item> tag. However, >>> >> > each item has a parent called <category> and each category has a name >>> >> > which I would like to import. In my forEach loop I specify the >>> >> > /document/category/item as the collection of items I am interested in. >>> >> > Is there anyway to extract an element from underneath a parent node? >>> >> > To be a more more specific (see eg xml below). I would like to index >>> >> > the following: >>> >> > >>> >> > - category: Category 1; id: 1; author: Author 1 >>> >> > >>> >> > - category: Category 1; id: 2; author: Author 2 >>> >> > >>> >> > - category: Category 2; id: 3; author: Author 3 >>> >> > >>> >> > - category: Category 2; id: 4; author: Author 4 >>> >> > >>> >> > >>> >> > >>> >> > Any ideas on how I can get to a parent node from within a child during >>> >> > data import? If it cant be done, what do you suggest would be the best >>> >> > way so I can keep using the DataImportHandler... would XSLT be a good >>> >> > idea to 'flatten out' the structure a bit? >>> >> > >>> >> > >>> >> > >>> >> > Thanks >>> >> > >>> >> > >>> >> > >>> >> > This is what my XML document looks like: >>> >> > >>> >> > <document> >>> >> > <category> >>> >> > <name>Category 1</name> >>> >> > <item> >>> >> > <id>1</id> >>> >> > <author>Author 1</author> >>> >> > </item> >>> >> > <item> >>> >> > <id>2</id> >>> >> > <author>Author 2</author> >>> >> > </item> >>> >> > </category> >>> >> > <category> >>> >> > <name>Category 2</name> >>> >> > <item> >>> >> > <id>3</id> >>> >> > <author>Author 3</author> >>> >> > </item> >>> >> > <item> >>> >> > <id>4</id> >>> >> > <author>Author 4</author> >>> >> > </item> >>> >> > </category> >>> >> > </document> >>> >> > >>> >> > >>> >> > >>> >> > And this is what my dataConfig looks like: >>> >> > <dataConfig> >>> >> > <dataSource type="URLDataSource" /> >>> >> > <document> >>> >> > <entity name="archive" pk="id" >>> >> > url="http://localhost:9080/data/20090817070752.xml" >>> >> > processor="XPathEntityProcessor" forEach="/document/category/item" >>> >> > transformer="DateFormatTransformer" stream="true" >>> >> > dataSource="dataSource"> >>> >> > <field column="category" xpath="/document/category/name" >>> >> > commonField="true" /> >>> >> > <field column="id" xpath="/document/category/item/id" /> >>> >> > <field column="author" xpath="/document/category/item/author" /> >>> >> > </entity> >>> >> > </document> >>> >> > </dataConfig> >>> >> > >>> >> > >>> >> > >>> >> > This is how I have specified my schema >>> >> > <fields> >>> >> > <field name="id" type="string" indexed="true" stored="true" >>> >> > required="true" /> >>> >> > <field name="author" type="string" indexed="true" stored="true"/> >>> >> > <field name="category" type="string" indexed="true" stored="true"/> >>> >> > </fields> >>> >> > >>> >> > <uniqueKey>id</uniqueKey> >>> >> > <defaultSearchField>id</defaultSearchField> >>> >> > -- =============================================================== Fergus McMenemie Email:fer...@twig.me.uk Techmore Ltd Phone:(UK) 07721 376021 Unix/Mac/Intranets Analyst Programmer ===============================================================