Re: Help storing + highlighting search results in PDF newspapers

Erick Erickson Fri, 11 Sep 2015 09:04:11 -0700

Yeah, there are a lot of moving parts to connect....

Let's see the highlight configuration you're
using. Should be in your solrconfig.xml file for the request
handler you're using. Are you calling out the field you want
highlighted in the hl.fl list?



Unfortunately getting specific fields populated is tricky since
Tika has to deal with all the file formats which store
meta-data in various ways, i.e. Word is completely
unrelated to PDF which is unrelated to (pick your
file format here). But we can deal with that after you
get some basic highlighting done.

And I tend to prefer to do my Tika parsing on a client,
it gives me more control over what happens and moves
the burden off the Solr server, here's a place to get
started if you want to pursue that avenue.

http://lucidworks.com/blog/indexing-with-solrj/

Best,
Erick


On Fri, Sep 11, 2015 at 8:47 AM, Colin 't Hart <co...@sharpheart.org> wrote:
> Hi,
>
> I'm having trouble negotiating the steep Solr learning curve...
>
> 1. I'm trying to store scanned and OCRed newspapers in PDF format into Solr
> for full-text searching.
> I've tried most (all?) of the examples and sample configurations that come
> with Solr 5.3.0 and I can upload the PDFs.
> Searching works, but for the life of me I can't get highlights in the
> results.
>
> I tried setting the "store" attribute of the "_text_" and/or "content"
> fields to "true" but that didn't help -- just increased the size of the
> query response -- and lots of PDF data appeared in the response instead of
> just the text -- but the "highlighting" section of the response was still
> virtually empty (just lists matching documents, but no highlighted text
> fragments).
>
>
> Can someone point me in the direction of a sample config that will work?
>
>
> 2. After that's working I'd like to trim this down to a minimal schema with
> just
>
> * title
> * date
> * volume
> * number
> * URL (the PDFs themselves will be made available online for viewing using
> the same viewer.js that's embedded in Firefox)
>
> as metadata (as well as the required metadata such as id and _version_).
>
> I want to extract these metadata fields from the filenames -- I presume
> that's also possible? Can someone point me to how I would go about doing
> this too?
>
>
> 3. The newspapers are in Swedish. I've found the Swedish stopwords list;
> are there any other dictionaries etc available to assist with queries where
> words have different forms for eg plurals, eg "flicka" (girl), "flickor"
> (girls)?
>
>
>
> Much thanks in advance!
>
> Regards,
>
> Colin

Re: Help storing + highlighting search results in PDF newspapers

Reply via email to