On Sat, Sep 12, 2009 at 12:24 PM, Fergus McMenemie <fer...@twig.me.uk> wrote: >>On Fri, Sep 11, 2009 at 6:48 AM, venn hardy <venn.ha...@hotmail.com> wrote: >>> >>> Hi Fergus, >>> >>> When I debugged in the development console >>> http://localhost:9080/solr/admin/dataimport.jsp?handler=/dataimport >>> >>> I had no problems. Each category/item seems to be only indexed once, and no >>> parent fields are available (except the category name). >>> >>> I am not entirely sure how the forEach statement works, but my >>> interpretation of forEach="/document/category/item | /document/category" is >>> something like this: >>> >>> 1. Whenever DIH encounters a document/category it will extract the >>> /document/category/ >>> >>> name field as a common field >>> 2. Whenever DIH encounters a document/category/item it will extract all of >>> the item fields. >>> 3. When all fields have been encountered, save the document in solr and go >>> to the next category/item >> >>/document/category/item | /document/category >> >>means there are two paths which triggers a new doc (it is possible to >>have more). Whenever it encounters the closing tag of that xpath , it >>emits all the fields it collected since the opening of the same tag. >>after that it clears all the fields it collected since the opening of >>the tag. >> >>If there are fields it collected before opening of the same tag, it retains it > > > Nice and clear, but that is not what I see. > > With my test case with forEach="/record | /record/mediaBlock" > I see that for each /record/mediaBlock "document" indexed it contains all > fields > from the parent "/record" document as well. A search over mediaBlock s > returns lots > of extra fields from the parent which did not have the commonField attribute. > I > will try and produce a testcase
yes it does . . /record/mediaBlock will have all the fields collected from /record as well. It is by design . > > >>> >>> >>>> Date: Thu, 10 Sep 2009 14:19:31 +0100 >>>> To: solr-user@lucene.apache.org >>>> From: fer...@twig.me.uk >>>> Subject: RE: Extract info from parent node during data import >>>> >>>> >Hi Paul, >>>> >The forEach="/document/category/item | /document/category/name" didn't >>>> >work (no categoryname was stored or indexed). >>>> >However forEach="/document/category/item | /document/category" seems to >>>> >work well. I am not sure why category on its own works, but not >>>> >category/name... >>>> >But thanks for tip. It wasn't as painful as I thought it would be. >>>> >Venn >>>> >>>> Hmmm, I had bother with this. Although each occurance of >>>> /document/category/item >>>> causes a new solr document to indexed, that document contained all the >>>> fields from >>>> the parent element as well. >>>> >>>> Did you see this? >>>> >>>> > >>>> >> From: noble.p...@corp.aol.com >>>> >> Date: Thu, 10 Sep 2009 09:58:21 +0530 >>>> >> Subject: Re: Extract info from parent node during data import >>>> >> To: solr-user@lucene.apache.org >>>> >> >>>> >> try this >>>> >> >>>> >> add two xpaths in your forEach >>>> >> >>>> >> forEach="/document/category/item | /document/category/name" >>>> >> >>>> >> and add a field as follows >>>> >> >>>> >> <field column="catgoryname" xpath ="/document/category/name" >>>> >> commonField="true"/> >>>> >> >>>> >> Please try it out and let me know. >>>> >> >>>> >> On Thu, Sep 10, 2009 at 7:30 AM, venn hardy <venn.ha...@hotmail.com> >>>> >> wrote: >>>> >> > >>>> >> > Hello, >>>> >> > >>>> >> > >>>> >> > >>>> >> > I am using SOLR 1.4 (from nighly build) and its URLDataSource in >>>> >> > conjunction with the XPathEntityProcessor. I have successfully >>>> >> > imported XML content, but I think I may have found a limitation when >>>> >> > it comes to the commonField attribute in the DataImportHandler. >>>> >> > >>>> >> > >>>> >> > >>>> >> > Before writing my own parser to read in a whole XML document, I >>>> >> > thought I'd post the question here (since I got some great advice >>>> >> > last time). >>>> >> > >>>> >> > >>>> >> > >>>> >> > The bulk of my content is contained within each <item> tag. However, >>>> >> > each item has a parent called <category> and each category has a name >>>> >> > which I would like to import. In my forEach loop I specify the >>>> >> > /document/category/item as the collection of items I am interested >>>> >> > in. Is there anyway to extract an element from underneath a parent >>>> >> > node? To be a more more specific (see eg xml below). I would like to >>>> >> > index the following: >>>> >> > >>>> >> > - category: Category 1; id: 1; author: Author 1 >>>> >> > >>>> >> > - category: Category 1; id: 2; author: Author 2 >>>> >> > >>>> >> > - category: Category 2; id: 3; author: Author 3 >>>> >> > >>>> >> > - category: Category 2; id: 4; author: Author 4 >>>> >> > >>>> >> > >>>> >> > >>>> >> > Any ideas on how I can get to a parent node from within a child >>>> >> > during data import? If it cant be done, what do you suggest would be >>>> >> > the best way so I can keep using the DataImportHandler... would XSLT >>>> >> > be a good idea to 'flatten out' the structure a bit? >>>> >> > >>>> >> > >>>> >> > >>>> >> > Thanks >>>> >> > >>>> >> > >>>> >> > >>>> >> > This is what my XML document looks like: >>>> >> > >>>> >> > <document> >>>> >> > <category> >>>> >> > <name>Category 1</name> >>>> >> > <item> >>>> >> > <id>1</id> >>>> >> > <author>Author 1</author> >>>> >> > </item> >>>> >> > <item> >>>> >> > <id>2</id> >>>> >> > <author>Author 2</author> >>>> >> > </item> >>>> >> > </category> >>>> >> > <category> >>>> >> > <name>Category 2</name> >>>> >> > <item> >>>> >> > <id>3</id> >>>> >> > <author>Author 3</author> >>>> >> > </item> >>>> >> > <item> >>>> >> > <id>4</id> >>>> >> > <author>Author 4</author> >>>> >> > </item> >>>> >> > </category> >>>> >> > </document> >>>> >> > >>>> >> > >>>> >> > >>>> >> > And this is what my dataConfig looks like: >>>> >> > <dataConfig> >>>> >> > <dataSource type="URLDataSource" /> >>>> >> > <document> >>>> >> > <entity name="archive" pk="id" >>>> >> > url="http://localhost:9080/data/20090817070752.xml" >>>> >> > processor="XPathEntityProcessor" forEach="/document/category/item" >>>> >> > transformer="DateFormatTransformer" stream="true" >>>> >> > dataSource="dataSource"> >>>> >> > <field column="category" xpath="/document/category/name" >>>> >> > commonField="true" /> >>>> >> > <field column="id" xpath="/document/category/item/id" /> >>>> >> > <field column="author" xpath="/document/category/item/author" /> >>>> >> > </entity> >>>> >> > </document> >>>> >> > </dataConfig> >>>> >> > >>>> >> > >>>> >> > >>>> >> > This is how I have specified my schema >>>> >> > <fields> >>>> >> > <field name="id" type="string" indexed="true" stored="true" >>>> >> > required="true" /> >>>> >> > <field name="author" type="string" indexed="true" stored="true"/> >>>> >> > <field name="category" type="string" indexed="true" stored="true"/> >>>> >> > </fields> >>>> >> > >>>> >> > <uniqueKey>id</uniqueKey> >>>> >> > <defaultSearchField>id</defaultSearchField> >>>> >> > > > -- > > =============================================================== > Fergus McMenemie Email:fer...@twig.me.uk > Techmore Ltd Phone:(UK) 07721 376021 > > Unix/Mac/Intranets Analyst Programmer > =============================================================== > -- ----------------------------------------------------- Noble Paul | Principal Engineer| AOL | http://aol.com