Re: SolrCell and indexing HTML

Jack Krupansky Fri, 21 Mar 2014 10:20:41 -0700

The extractOnly option is simply telling you what the raw metadata is, whilenormal non-extractOnly mode is indexing meta exactly as you have requestedit to be indexed. You haven't shown us any of your parameters that describehow you want the metadata indexed. If you didn't specify any mapping, it wasprobably all thrown away.

Read the tutorial on Solr Cell if you are not yet aware of how to mapmetadata:

https://cwiki.apache.org/confluence/display/solr/Uploading+Data+with+Solr+Cell+using+Apache+Tika

Or read that chapter in my e-book! It has lots of examples, especially forthe various mapping parameters.


-- Jack Krupansky

-----Original Message-----From: Liz Sommers

Sent: Friday, March 21, 2014 12:56 PM
To: solr-user
Subject: SolrCell and indexing HTML

I am trying to write a POC about indexing URL's with Solr using solrJ and
solrCell.  (The code is written in groovy).

The relevant code is here

ContentStreamUpdateRequest req = new
ContentStreamUpdateRequest("/update/extract");

       req.setParam("literal.id",p.id.toString())
       req.setParam("extractOnly","true")
       URL url = new URL(p.url)
       ContentStream stream = new ContentStreamBase.URLStream(url)
       req.addContentStream(stream)

       def result = server.request(req)
       println "result: ${result}"

When I set extractOnly to true I get everything in the URL.  All the tags,
all the stylesheets.  When I set it to false I get a response that has
nothing in it except

result: {responseHeader={status=0,QTime=19}}

When I test it with the admin tools, nothing in the url has been indexed as
far as I can tell.
I know I am doing something wrong with the params, but I haven't figured
out what.  Can somebody please help me.

Thanks
Liz Sommers
lizzy...@gmail.com

lizswo...@gmail.com

Re: SolrCell and indexing HTML

Reply via email to