Re: How can i instruct the Solr/ Solr Cell to output the original HTML document which was fed to it.?
Thank you for replying sir !!! I have two queries related with this - 1) So in this case which request handler I have to use because 'ExtractingRequestHandler' by default strips the html content and the default handler 'UpdateRequestHandler' does not accepts the HTML contrents. 2) How can I 'Extract' & 'Index' META information in the HTML document separately. Awaiting your reply Thank you!!!
Re: How can i instruct the Solr/ Solr Cell to output the original HTML document which was fed to it.?
Thank you for your help Jack. I just wanted to know if there is any ready made solution for this because i really don't know about extracting meta information. awaiting reply.. Thank you On Tue, Feb 19, 2013 at 12:48 PM, Jack Krupansky wrote: > Use the standard update handler and pass the entire HTML page as literal > text in a Solr XML document for the field that has the HTML strip filter, > but be sure to escape the HTML (angle brackets, ampersands, etc.) syntax. > > You'll have to process meta information yourself. > > > -- Jack Krupansky > > -----Original Message- From: Divyanand Tiwari > Sent: Monday, February 18, 2013 10:52 PM > To: solr-user@lucene.apache.org > Subject: Re: How can i instruct the Solr/ Solr Cell to output the original > HTML document which was fed to it.? > > > Thank you for replying sir !!! > > I have two queries related with this - > > 1) So in this case which request handler I have to use because > 'ExtractingRequestHandler' by default strips the html content and the > default handler 'UpdateRequestHandler' does not accepts the HTML contrents. > > 2) How can I 'Extract' & 'Index' META information in the HTML document > separately. > > Awaiting your reply > Thank you!!! > -- Regards, Divyanand Tiwari
Re: How can i instruct the Solr/ Solr Cell to output the original HTML document which was fed to it.?
Hi Chris thank you for replying. My "content" field in the schema is stored="true" and indexed="false" because I am copying the "content" field in "text" field which is by default indexed="true". I was having a query that I am able to search in the html documents I had fed to the solr, but as the results returned by the Tika/ExtractingRequestHandler is stripped down version of the HTML document, I am not able to present the document in the original format at my site. :( I got certain idea based upon Jack's reply that making my own request handler and I am working on it. I'll update if I am coming up with any solution also any help is most welcomed..!!! Thank you all for all your support...!!! On Fri, Feb 22, 2013 at 6:42 AM, Chris Hostetter wrote: > > : Hi everyone, i am new to solr technology and not getting a way to get > back > : the original HTML document with Hits highlighted into it. what > : configuration and where i can do to instruct SolrCell/ Tika so that it > does > : not strips down the tags of HTML document in the content field. > > I _think_ what you want is simply to ensure that you have a "content" > field in your schema which is stored="true" (and indexed="true" if you > want to serach on it directly) ... and then ExtractingRequestHandler will > put the entire XHTML it generates from the documents you index into that > field. > > http://wiki.apache.org/solr/ExtractingRequestHandler > > If that isn't what you had in mind, then you need to provide us with more > details about what you've tried, what results you get, and how exactly > those results differ fro mwhat you want to get. > > > -Hoss > -- Regards, Divyanand Tiwari