On Apr 13, 2011, at 12:06 AM, Liam O'Boyle wrote:

> Afternoon,
> 
> After an upgrade to Solr 3.1 which has largely been very smooth and
> painless, I'm having a minor issue with the ExtractingRequestHandler.
> 
> The problem is that it's inserting metadata into the extracted
> content, as well as mapping it to a dynamic field.  Previously the
> same configuration only mapped it to a dynamic field and I'm not sure
> how it's managing to add it into my content as well.
> 
> The requestHandler configuration is as follows
> 
> <requestHandler name="/update/extract"
>              startup="lazy"
>              class="solr.extraction.ExtractingRequestHandler" >
> <lst name="defaults">
>  <!-- All the main content goes into "text"... if you need to return
>       the extracted text or do highlighting, use a stored field.
> -->
> <str name="fmap.content_type">attr_source_content_type</str>
> <str name="lowernames">true</str>
> <str name="uprefix">ignored_</str>
> </lst>
> </requestHandler>
> 
> The schema has a dynamic field for attr_*, <dynamicField name="attr_*"
> type="textgen" indexed="true" stored="true" multiValued="true" />.
> 
> The request being submitted is (reformatted for readability, extracted
> from the catalina log)
> 
> literal.ib_extension=blarg
> literal.ib_date=2010-09-09T21:41:30Z
> literal.ib_custom2=custom2
> resource.name=test.txt
> literal.ib_custom3=custom3
> literal.ib_authorid=1
> literal.ib_custom1=custom1
> literal.ib_custom6=custom6
> literal.ib_custom7=custom7
> literal.ib_custom4=custom4
> literal.ib_linkid=1
> literal.ib_custom5=custom5
> literal.ib_tags=foo
> literal.ib_tags=bar
> literal.ib_tags=blarg
> commit=true
> literal.ib_permissionid=1
> literal.ib_filters=1
> literal.ib_filters=2
> literal.ib_filters=3
> literal.ib_description=My+Description
> literal.ib_title=My+Title
> json.nl=map
> wt=json
> literal.ib_realid=1
> literal.ib_custom9=custom9
> literal.ib_id=fb1
> fmap.content=ib_content
> literal.ib_custom8=custom8
> literal.ib_type=foobar
> uprefix=attr_
> literal.ib_clientid=1
> 
> After indexing, the ib_content field contains the contents of the
> file, prefixed with "stream_content_type application/octet-stream
> stream_size 971 Content-Encoding UTF-8 Content-Type text/plain
> resourceName test.txt".  These have all been mapped to the dynamic
> field, so I have attr_content_encoding, attr_source_content_type,
> attr_stream_content_type and attr_stream_size all with their correct
> values as well.
> 
> There are no copyField parameters to add content from attr_* fields
> into anything else and I've had no luck tracking down where this is
> coming from.  Has there been some option added which controls this
> behaviour?



I'm not aware of anything changing here, other than we upgraded Tika.  Can you 
isolate the problem and share the test?  I tried it on trunk (I can get 3.1.0 
if needed, but they should be the same in regards to the ERH) using the 
examples on the http://wiki.apache.org/solr/ExtractingRequestHandler page and I 
don't see the behavior.

--------------------------
Grant Ingersoll
http://www.lucidimagination.com/

Search the Lucene ecosystem docs using Solr/Lucene:
http://www.lucidimagination.com/search

Reply via email to