Hi Grant,

After comparing the differences between my solrconfig.xml and that used by
the example, the key difference is that I didn't have <str
name="captureAttr">true</str> in the defaults for the ERH.  Commenting out
this line in the example configuration causes the example to display the
same behaviour as I'm seeing.

I've added the option back in and it all works as expected, but seem to be a
change in the configuration.  I didn't have captureAttr enabled because I
don't have it enabled in my 1.4 production environment (I'm just checking
the upgrade process at the moment) and this problem doesn't happen for me
there.  Is the change deliberate?

Thanks,
Liam

On 13 April 2011 23:25, Grant Ingersoll <grant.ingers...@gmail.com> wrote:

>
> On Apr 13, 2011, at 12:06 AM, Liam O'Boyle wrote:
>
> > Afternoon,
> >
> > After an upgrade to Solr 3.1 which has largely been very smooth and
> > painless, I'm having a minor issue with the ExtractingRequestHandler.
> >
> > The problem is that it's inserting metadata into the extracted
> > content, as well as mapping it to a dynamic field.  Previously the
> > same configuration only mapped it to a dynamic field and I'm not sure
> > how it's managing to add it into my content as well.
> >
> > The requestHandler configuration is as follows
> >
> > <requestHandler name="/update/extract"
> >              startup="lazy"
> >              class="solr.extraction.ExtractingRequestHandler" >
> > <lst name="defaults">
> >  <!-- All the main content goes into "text"... if you need to return
> >       the extracted text or do highlighting, use a stored field.
> > -->
> > <str name="fmap.content_type">attr_source_content_type</str>
> > <str name="lowernames">true</str>
> > <str name="uprefix">ignored_</str>
> > </lst>
> > </requestHandler>
> >
> > The schema has a dynamic field for attr_*, <dynamicField name="attr_*"
> > type="textgen" indexed="true" stored="true" multiValued="true" />.
> >
> > The request being submitted is (reformatted for readability, extracted
> > from the catalina log)
> >
> > literal.ib_extension=blarg
> > literal.ib_date=2010-09-09T21:41:30Z
> > literal.ib_custom2=custom2
> > resource.name=test.txt
> > literal.ib_custom3=custom3
> > literal.ib_authorid=1
> > literal.ib_custom1=custom1
> > literal.ib_custom6=custom6
> > literal.ib_custom7=custom7
> > literal.ib_custom4=custom4
> > literal.ib_linkid=1
> > literal.ib_custom5=custom5
> > literal.ib_tags=foo
> > literal.ib_tags=bar
> > literal.ib_tags=blarg
> > commit=true
> > literal.ib_permissionid=1
> > literal.ib_filters=1
> > literal.ib_filters=2
> > literal.ib_filters=3
> > literal.ib_description=My+Description
> > literal.ib_title=My+Title
> > json.nl=map
> > wt=json
> > literal.ib_realid=1
> > literal.ib_custom9=custom9
> > literal.ib_id=fb1
> > fmap.content=ib_content
> > literal.ib_custom8=custom8
> > literal.ib_type=foobar
> > uprefix=attr_
> > literal.ib_clientid=1
> >
> > After indexing, the ib_content field contains the contents of the
> > file, prefixed with "stream_content_type application/octet-stream
> > stream_size 971 Content-Encoding UTF-8 Content-Type text/plain
> > resourceName test.txt".  These have all been mapped to the dynamic
> > field, so I have attr_content_encoding, attr_source_content_type,
> > attr_stream_content_type and attr_stream_size all with their correct
> > values as well.
> >
> > There are no copyField parameters to add content from attr_* fields
> > into anything else and I've had no luck tracking down where this is
> > coming from.  Has there been some option added which controls this
> > behaviour?
>
>
>
> I'm not aware of anything changing here, other than we upgraded Tika.  Can
> you isolate the problem and share the test?  I tried it on trunk (I can get
> 3.1.0 if needed, but they should be the same in regards to the ERH) using
> the examples on the http://wiki.apache.org/solr/ExtractingRequestHandlerpage 
> and I don't see the behavior.
>
> --------------------------
> Grant Ingersoll
> http://www.lucidimagination.com/
>
> Search the Lucene ecosystem docs using Solr/Lucene:
> http://www.lucidimagination.com/search
>
>


-- 
Liam O'Boyle

IntelligenceBank Pty Ltd
Level 1, 31 Coventry Street Southbank, Victoria 3006, Australia
P:   +613 8618 7810   F:   +613 8618 7899   M: +61 403 88 66 44

*Awarded 2010 "Best New Business" and "Business of the Year" - Business3000
Awards*

This email and any attachments are confidential and may contain legally
privileged information or copyright material. If you are not an intended
recipient, please contact us at once by return email and then delete both
messages. We do not accept liability in connection with transmission of
information using the internet.

Reply via email to