The extractOnly option is simply telling you what the raw metadata is, while
normal non-extractOnly mode is indexing meta exactly as you have requested
it to be indexed. You haven't shown us any of your parameters that describe
how you want the metadata indexed. If you didn't specify any mapping, it was
probably all thrown away.
Read the tutorial on Solr Cell if you are not yet aware of how to map
metadata:
https://cwiki.apache.org/confluence/display/solr/Uploading+Data+with+Solr+Cell+using+Apache+Tika
Or read that chapter in my e-book! It has lots of examples, especially for
the various mapping parameters.
-- Jack Krupansky
-----Original Message-----
From: Liz Sommers
Sent: Friday, March 21, 2014 12:56 PM
To: solr-user
Subject: SolrCell and indexing HTML
I am trying to write a POC about indexing URL's with Solr using solrJ and
solrCell. (The code is written in groovy).
The relevant code is here
ContentStreamUpdateRequest req = new
ContentStreamUpdateRequest("/update/extract");
req.setParam("literal.id",p.id.toString())
req.setParam("extractOnly","true")
URL url = new URL(p.url)
ContentStream stream = new ContentStreamBase.URLStream(url)
req.addContentStream(stream)
def result = server.request(req)
println "result: ${result}"
When I set extractOnly to true I get everything in the URL. All the tags,
all the stylesheets. When I set it to false I get a response that has
nothing in it except
result: {responseHeader={status=0,QTime=19}}
When I test it with the admin tools, nothing in the url has been indexed as
far as I can tell.
I know I am doing something wrong with the params, but I haven't figured
out what. Can somebody please help me.
Thanks
Liz Sommers
lizzy...@gmail.com
lizswo...@gmail.com