On Apr 13, 2011, at 12:06 AM, Liam O'Boyle wrote: > Afternoon, > > After an upgrade to Solr 3.1 which has largely been very smooth and > painless, I'm having a minor issue with the ExtractingRequestHandler. > > The problem is that it's inserting metadata into the extracted > content, as well as mapping it to a dynamic field. Previously the > same configuration only mapped it to a dynamic field and I'm not sure > how it's managing to add it into my content as well. > > The requestHandler configuration is as follows > > <requestHandler name="/update/extract" > startup="lazy" > class="solr.extraction.ExtractingRequestHandler" > > <lst name="defaults"> > <!-- All the main content goes into "text"... if you need to return > the extracted text or do highlighting, use a stored field. > --> > <str name="fmap.content_type">attr_source_content_type</str> > <str name="lowernames">true</str> > <str name="uprefix">ignored_</str> > </lst> > </requestHandler> > > The schema has a dynamic field for attr_*, <dynamicField name="attr_*" > type="textgen" indexed="true" stored="true" multiValued="true" />. > > The request being submitted is (reformatted for readability, extracted > from the catalina log) > > literal.ib_extension=blarg > literal.ib_date=2010-09-09T21:41:30Z > literal.ib_custom2=custom2 > resource.name=test.txt > literal.ib_custom3=custom3 > literal.ib_authorid=1 > literal.ib_custom1=custom1 > literal.ib_custom6=custom6 > literal.ib_custom7=custom7 > literal.ib_custom4=custom4 > literal.ib_linkid=1 > literal.ib_custom5=custom5 > literal.ib_tags=foo > literal.ib_tags=bar > literal.ib_tags=blarg > commit=true > literal.ib_permissionid=1 > literal.ib_filters=1 > literal.ib_filters=2 > literal.ib_filters=3 > literal.ib_description=My+Description > literal.ib_title=My+Title > json.nl=map > wt=json > literal.ib_realid=1 > literal.ib_custom9=custom9 > literal.ib_id=fb1 > fmap.content=ib_content > literal.ib_custom8=custom8 > literal.ib_type=foobar > uprefix=attr_ > literal.ib_clientid=1 > > After indexing, the ib_content field contains the contents of the > file, prefixed with "stream_content_type application/octet-stream > stream_size 971 Content-Encoding UTF-8 Content-Type text/plain > resourceName test.txt". These have all been mapped to the dynamic > field, so I have attr_content_encoding, attr_source_content_type, > attr_stream_content_type and attr_stream_size all with their correct > values as well. > > There are no copyField parameters to add content from attr_* fields > into anything else and I've had no luck tracking down where this is > coming from. Has there been some option added which controls this > behaviour?
I'm not aware of anything changing here, other than we upgraded Tika. Can you isolate the problem and share the test? I tried it on trunk (I can get 3.1.0 if needed, but they should be the same in regards to the ERH) using the examples on the http://wiki.apache.org/solr/ExtractingRequestHandler page and I don't see the behavior. -------------------------- Grant Ingersoll http://www.lucidimagination.com/ Search the Lucene ecosystem docs using Solr/Lucene: http://www.lucidimagination.com/search