Hello!
  I just started using Solr. My general use case is pushing a lot of data from 
Hbase to solr via an M/R job using Solrj. I have lots of questions, but the 
ones I'd like to start with are:

(1)
I noticed this:
http://lucene.472066.n3.nabble.com/what-happens-to-docsPending-if-stop-solr-before-commit-td2781493.html

Would seem to indicate that pending documents are commited on restart. This is 
great! I also noticed, that while there is a lag on start up if I have 
documents pending - it's only a few minutes or so. But if I issue a commit for 
the same number of files, the server stays blocked for 20 min or so. It almost 
seems like it would be a faster to add all my documents and restart the server, 
rather than issuing a commit. Am I doing something strange? Is this a valid 
conclusion?

(2)
I'm also getting a lot of errors about invalid UTF-8:

SEVERE: org.apache.solr.common.SolrException: Invalid UTF-8 character 0xffff at 
char #2380289, byte #2378666)
        at org.apache.solr.handler.XMLLoader.load(XMLLoader.java:79)
        at 
org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:55)
        at 
org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:129)
        at org.apache.solr.core.SolrCore.execute(SolrCore.java:1360)
        at 
org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:356)
        at 
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:252)

It could be that the values I have in some of my document fields is indeed 
invalid. My question is what does this mean when I'm submitting a batch of 
documents (specifically I'm using Solrj's StreamingUpdateSolrServer w/ a 
BinaryRequestWriter) - do I:

- lose the whole batch that has the bad document?
- lose the document?
- lose the one field?

I wish it was the third, hope it's the second, and I'm afraid it's the first...

Ooo.. and I guess a third question - I'm having trouble finding a document that 
describes the overall design/functionality of Solr, something that would help 
me reason about stuff like "what happens to pending documents when the server 
restarts" or "does a commit in one indexing thread commit previously added 
documents from another indexing thread". Both of those I've answered to my 
satisfaction by looking over the Solr logs & mailing lists, but I'm wondering 
if there's some documentation I missed somehow..
For example, something like this:
http://hadoop.apache.org/common/docs/current/hdfs_design.html
http://hbase.apache.org/book.html#architecture

Thanks!

Take care,
  -stu

Reply via email to