Erick I fixed the "missing content stream" issue as well. by making sure Iam not adding empty list. However, My very first issue of getting zero docs once in a while is still haunting me, even after using cursorMarkers, disabling auto commit and soft commit. I ran code two times and you can see the statement returns zero docs at random times.
log.info("Indexed " + count + "/" + docList.getNumFound()); -bash-4.1$ tail -f reindexing.log 2015-09-26 01:44:40 INFO [a.b.c.AdhocCorrectUUID] - Indexed 6500/1440653 2015-09-26 01:44:44 INFO [a.b.c.AdhocCorrectUUID] - Indexed 7000/1439863 2015-09-26 01:44:48 INFO [a.b.c.AdhocCorrectUUID] - Indexed 7500/1439410 2015-09-26 01:44:56 INFO [a.b.c.AdhocCorrectUUID] - Indexed 8000/1438918 2015-09-26 01:45:01 INFO [a.b.c.AdhocCorrectUUID] - Indexed 8500/1438330 2015-09-26 01:45:01 INFO [a.b.c.AdhocCorrectUUID] - Indexed 8500/0 2015-09-26 01:45:06 INFO [a.b.c.AdhocCorrectUUID] - FINISHED !!! 2015-09-26 01:48:15 INFO [a.b.c.AdhocCorrectUUID] - Indexed 500/1437440 2015-09-26 01:48:19 INFO [a.b.c.AdhocCorrectUUID] - Indexed 1000/1437440 2015-09-26 01:48:19 INFO [a.b.c.AdhocCorrectUUID] - Indexed 1000/0 2015-09-26 01:48:22 INFO [a.b.c.AdhocCorrectUUID] - FINISHED !!! Thanks Ravi Kiran Bhaskar On Sat, Sep 26, 2015 at 1:17 AM, Ravi Solr <ravis...@gmail.com> wrote: > Erick as per your advise I used cursorMarks (see code below). It was > slightly better but Solr throws Exceptions randomly. Please look at the > code and Stacktrace below > > 2015-09-26 01:00:45 INFO [a.b.c.AdhocCorrectUUID] - Indexed 500/1453133 > 2015-09-26 01:00:49 INFO [a.b.c.AdhocCorrectUUID] - Indexed 1000/1453133 > 2015-09-26 01:00:54 INFO [a.b.c.AdhocCorrectUUID] - Indexed 1500/1452592 > 2015-09-26 01:00:58 INFO [a.b.c.AdhocCorrectUUID] - Indexed 2000/1452095 > 2015-09-26 01:01:03 INFO [a.b.c.AdhocCorrectUUID] - Indexed 2500/1451675 > 2015-09-26 01:01:10 INFO [a.b.c.AdhocCorrectUUID] - Indexed 3000/1450924 > 2015-09-26 01:01:15 INFO [a.b.c.AdhocCorrectUUID] - Indexed 3500/1450445 > 2015-09-26 01:01:19 INFO [a.b.c.AdhocCorrectUUID] - Indexed 4000/1449997 > 2015-09-26 01:01:24 INFO [a.b.c.AdhocCorrectUUID] - Indexed 4500/1449692 > 2015-09-26 01:01:28 INFO [a.b.c.AdhocCorrectUUID] - Indexed 5000/1449201 > 2015-09-26 01:01:28 ERROR [a.b.c.AdhocCorrectUUID] - Error indexing > org.apache.solr.client.solrj.impl.HttpSolrClient$RemoteSolrException: > Error from server at http://xx.xx.xx.xx:1111/solr/collection1: missing > content stream > at > org.apache.solr.client.solrj.impl.HttpSolrClient.executeMethod(HttpSolrClient.java:560) > at > org.apache.solr.client.solrj.impl.HttpSolrClient.request(HttpSolrClient.java:234) > at > org.apache.solr.client.solrj.impl.HttpSolrClient.request(HttpSolrClient.java:226) > at > org.apache.solr.client.solrj.impl.LBHttpSolrClient.doRequest(LBHttpSolrClient.java:376) > at > org.apache.solr.client.solrj.impl.LBHttpSolrClient.request(LBHttpSolrClient.java:328) > at > org.apache.solr.client.solrj.impl.CloudSolrClient.sendRequest(CloudSolrClient.java:1085) > at > org.apache.solr.client.solrj.impl.CloudSolrClient.requestWithRetryOnStaleState(CloudSolrClient.java:856) > at > org.apache.solr.client.solrj.impl.CloudSolrClient.request(CloudSolrClient.java:799) > at > org.apache.solr.client.solrj.SolrRequest.process(SolrRequest.java:135) > at org.apache.solr.client.solrj.SolrClient.add(SolrClient.java:107) > at org.apache.solr.client.solrj.SolrClient.add(SolrClient.java:72) > at org.apache.solr.client.solrj.SolrClient.add(SolrClient.java:86) > at a.b.c.AdhocCorrectUUID.processDocs(AdhocCorrectUUID.java:97) > at a.b.c.AdhocCorrectUUID.main(AdhocCorrectUUID.java:37) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:606) > at com.simontuffs.onejar.Boot.run(Boot.java:306) > at com.simontuffs.onejar.Boot.main(Boot.java:159) > 2015-09-26 01:01:28 INFO [a.b.c.AdhocCorrectUUID] - FINISHED !!! > > > CODE > ------------ > protected static void processDocs() { > > try { > CloudSolrClient client = new > CloudSolrClient("zk1:xxxx,zk2:xxxx,zk3.com:xxxx"); > client.setDefaultCollection("collection1"); > > boolean done = false; > String cursorMark = CursorMarkParams.CURSOR_MARK_START; > Integer count = 0; > > while (!done) { > SolrQuery q = new > SolrQuery("*:*").setRows(500).addSort("publishtime", > ORDER.desc).addSort("uniqueId",ORDER.desc).setFields(new > String[]{"uniqueId","uuid"}); > q.addFilterQuery(new String[] {"uuid:[* TO *]", > "uuid:sun.org.mozilla*"}); > q.set(CursorMarkParams.CURSOR_MARK_PARAM, cursorMark); > > QueryResponse resp = client.query(q); > String nextCursorMark = resp.getNextCursorMark(); > > SolrDocumentList docList = resp.getResults(); > > List<SolrInputDocument> inList = new > ArrayList<SolrInputDocument>(); > for(SolrDocument doc : docList) { > > SolrInputDocument iDoc = > ClientUtils.toSolrInputDocument(doc); > > //This is my system's id > String uniqueId = (String) > iDoc.getFieldValue("uniqueId"); > > /* > * This is another system's unique id which is what I > want to correct that was messed > * because of script transformer in DIH import via > SolrEntityProcessor > * ex- > sun.org.mozilla.javascript.internal.NativeString:9cdef726-05dd-40b7-b1b2-c9bbce96741f > */ > String uuid = (String) iDoc.getFieldValue("uuid"); > String sanitizedUUID = > uuid.replace("sun.org.mozilla.javascript.internal.NativeString:", ""); > Map<String,String> fieldModifier = new > HashMap<String,String>(1); > fieldModifier.put("set",sanitizedUUID); > iDoc.setField("uuid", fieldModifier); > > inList.add(iDoc); > } > client.add(inList); > > count = count + docList.size(); > log.info("Indexed " + count + "/" + > docList.getNumFound()); > > if (cursorMark.equals(nextCursorMark)) { > done = true; > client.commit(true, true); > } > cursorMark = nextCursorMark; > } > > } catch (Exception e) { > log.error("Error indexing ", e); > } > } > > > Thanks > > Ravi Kiran Bhaskar > > On Sat, Sep 26, 2015 at 12:10 AM, Ravi Solr <ravis...@gmail.com> wrote: > >> thank you for taking time to help me out. Yes I was not using cursorMark, >> I will try that next. This is what I was doing, its a bit shabby coding but >> what can I say my brain was fried :-) FYI this is a side process just to >> correct a messed up string. The actual indexing process was working all the >> time as our business owners are a bit petulant about stopping indexing. My >> autocommit conf and code is given below, as you can see autocommit should >> fire every 100 docs anyway >> >> <autoCommit> >> <maxDocs>100</maxDocs> >> <maxTime>120000</maxTime> >> </autoCommit> >> >> <autoSoftCommit> >> <maxTime>30000</maxTime> >> </autoSoftCommit> >> </updateHandler> >> >> private static void processDocs() { >> >> try { >> CloudSolrClient client = new >> CloudSolrClient("zk1:xxxx,zk2:xxxx,zk3.com:xxxx"); >> client.setDefaultCollection("collection1"); >> >> //First initialize docs >> SolrDocumentList docList = getDocs(client, 100); >> Long count = 0L; >> >> while (docList != null && docList.size() > 0) { >> >> List<SolrInputDocument> inList = new >> ArrayList<SolrInputDocument>(); >> for(SolrDocument doc : docList) { >> >> SolrInputDocument iDoc = >> ClientUtils.toSolrInputDocument(doc); >> >> //This is my SOLR's Unique id >> String uniqueId = (String) >> iDoc.getFieldValue("uniqueId"); >> >> /* >> * This is another system's id which is what I want >> to correct. Was messed >> * because of script transformer in DIH import via >> SolrEntityProcessor >> * ex- >> sun.org.mozilla.javascript.internal.NativeString:9cdef726-05dd-40b7-b1b2-c9bbce96741f >> */ >> String uuid = (String) iDoc.getFieldValue("uuid"); >> String sanitizedUUID = >> uuid.replace("sun.org.mozilla.javascript.internal.NativeString:", ""); >> Map<String,String> fieldModifier = new >> HashMap<String,String>(1); >> fieldModifier.put("set",sanitizedUUID); >> iDoc.setField("uuid", fieldModifier); >> >> inList.add(iDoc); >> log.info("added " + systemid); >> } >> >> client.add(inList); >> >> count = count + docList.size(); >> log.info("Indexed " + count + "/" + >> docList.getNumFound()); >> >> Thread.sleep(5000); >> >> docList = getDocs(client, docList.size()); >> log.info("Got Docs- " + docList.getNumFound()); >> } >> >> } catch (Exception e) { >> log.error("Error indexing ", e); >> } >> } >> >> private static SolrDocumentList getDocs(CloudSolrClient client, >> Integer rows) { >> >> >> SolrQuery q = new SolrQuery("*:*"); >> q.setSort("publishtime", ORDER.desc); >> q.setStart(0); >> q.setRows(rows); >> q.addFilterQuery(new String[] {"uuid:[* TO *]", >> "uuid:sun.org.mozilla*"}); >> q.setFields(new String[]{"uniqueId","uuid"}); >> SolrDocumentList docList = null; >> QueryResponse resp; >> try { >> resp = client.query(q); >> docList = resp.getResults(); >> } catch (Exception e) { >> log.error("Error querying " + q.toString(), e); >> } >> return docList; >> } >> >> >> Thanks >> >> Ravi Kiran Bhaskar >> >> On Fri, Sep 25, 2015 at 10:58 PM, Erick Erickson <erickerick...@gmail.com >> > wrote: >> >>> Wait, query again how? You've got to have something that keeps you >>> from getting the same 100 docs back so you have to be sorting somehow. >>> Or you have a high water mark. Or something. Waiting 5 seconds for any >>> commit also doesn't really make sense to me. I mean how do you know >>> >>> 1> that you're going to get a commit (did you explicitly send one from >>> the client?). >>> 2> all autowarming will be complete by the time the next query hits? >>> >>> Let's see the query you fire. There has to be some kind of marker that >>> you're using to know when you've gotten through the entire set. >>> >>> And I would use much larger batches, I usually update in batches of >>> 1,000 (excepting if these are very large docs of course). I suspect >>> you're spending a lot more time sleeping than you need to. I wouldn't >>> sleep at all in fact. This is one (rare) case I might consider >>> committing from the client. If you specify the wait for searcher param >>> (server.commit(true, true), then it doesn't return until a new >>> searcher is completely opened so your previous updates will be >>> reflected in your next search. >>> >>> Actually, what I'd really do is >>> 1> turn off all auto commits >>> 2> go ahead and query/change/update. But the query bits would be using >>> the cursormark. >>> 3> do NOT commit >>> 4> issue a commit when you were all done. >>> >>> I bet you'd get through your update a lot faster that way. >>> >>> Best, >>> Erick >>> >>> On Fri, Sep 25, 2015 at 5:07 PM, Ravi Solr <ravis...@gmail.com> wrote: >>> > Thanks for responding Erick. I set the "start" to zero and "rows" >>> always to >>> > 100. I create CloudSolrClient instance and use it to both query as >>> well as >>> > index. But I do sleep for 5 secs just to allow for any auto commits. >>> > >>> > So query --> client.add(100 docs) --> wait --> query again >>> > >>> > But the weird thing I noticed was that after 8 or 9 batches I.e 800/900 >>> > docs the "query again" returns zero docs causing my while loop to >>> > exist...so was trying to see if I was doing the right thing or if >>> there is >>> > an alternate way to do heavy indexing. >>> > >>> > Thanks >>> > >>> > Ravi Kiran Bhaskar >>> > >>> > >>> > >>> > On Friday, September 25, 2015, Erick Erickson <erickerick...@gmail.com >>> > >>> > wrote: >>> > >>> >> How are you querying Solr? You say you query for 100 docs, >>> >> update then get the next set. What are you using for a marker? >>> >> If you're using the start parameter, and somehow a commit is >>> >> creeping in things might be weird, especially if you're using any >>> >> of the internal Lucene doc IDs. If you're absolutely sure no commits >>> >> are taking place even that should be OK. >>> >> >>> >> The "deep paging" stuff could be helpful here, see: >>> >> >>> >> >>> https://lucidworks.com/blog/coming-soon-to-solr-efficient-cursor-based-iteration-of-large-result-sets/ >>> >> >>> >> Best, >>> >> Erick >>> >> >>> >> On Fri, Sep 25, 2015 at 3:13 PM, Ravi Solr <ravis...@gmail.com >>> >> <javascript:;>> wrote: >>> >> > No problem Walter, it's all fun. Was just wondering if there was >>> some >>> >> other >>> >> > good way that I did not know of, that's all 😀 >>> >> > >>> >> > Thanks >>> >> > >>> >> > Ravi Kiran Bhaskar >>> >> > >>> >> > On Friday, September 25, 2015, Walter Underwood < >>> wun...@wunderwood.org >>> >> <javascript:;>> >>> >> > wrote: >>> >> > >>> >> >> Sorry, I did not mean to be rude. The original question did not >>> say that >>> >> >> you don’t have the docs outside of Solr. Some people jump to the >>> >> advanced >>> >> >> features and miss the simple ones. >>> >> >> >>> >> >> It might be faster to fetch all the docs from Solr and save them in >>> >> files. >>> >> >> Then modify them. Then reload all of them. No guarantee, but it is >>> >> worth a >>> >> >> try. >>> >> >> >>> >> >> Good luck. >>> >> >> >>> >> >> wunder >>> >> >> Walter Underwood >>> >> >> wun...@wunderwood.org <javascript:;> <javascript:;> >>> >> >> http://observer.wunderwood.org/ (my blog) >>> >> >> >>> >> >> >>> >> >> > On Sep 25, 2015, at 2:59 PM, Ravi Solr <ravis...@gmail.com >>> >> <javascript:;> >>> >> >> <javascript:;>> wrote: >>> >> >> > >>> >> >> > Walter, Not in a mood for banter right now.... Its 6:00pm on a >>> friday >>> >> and >>> >> >> > Iam stuck here trying to figure reindexing issues :-) >>> >> >> > I dont have source of docs so I have to query the SOLR, modify >>> and >>> >> put it >>> >> >> > back and that is seeming to be quite a task in 5.3.0, I did >>> reindex >>> >> >> several >>> >> >> > times with 4.7.2 in a master slave env without any issue. Since >>> then >>> >> we >>> >> >> > have moved to cloud and it has been a pain all day. >>> >> >> > >>> >> >> > Thanks >>> >> >> > >>> >> >> > Ravi Kiran Bhaskar >>> >> >> > >>> >> >> > On Fri, Sep 25, 2015 at 5:25 PM, Walter Underwood < >>> >> wun...@wunderwood.org <javascript:;> >>> >> >> <javascript:;>> >>> >> >> > wrote: >>> >> >> > >>> >> >> >> Sure. >>> >> >> >> >>> >> >> >> 1. Delete all the docs (no commit). >>> >> >> >> 2. Add all the docs (no commit). >>> >> >> >> 3. Commit. >>> >> >> >> >>> >> >> >> wunder >>> >> >> >> Walter Underwood >>> >> >> >> wun...@wunderwood.org <javascript:;> <javascript:;> >>> >> >> >> http://observer.wunderwood.org/ (my blog) >>> >> >> >> >>> >> >> >> >>> >> >> >>> On Sep 25, 2015, at 2:17 PM, Ravi Solr <ravis...@gmail.com >>> >> <javascript:;> >>> >> >> <javascript:;>> wrote: >>> >> >> >>> >>> >> >> >>> I have been trying to re-index the docs (about 1.5 million) as >>> one >>> >> of >>> >> >> the >>> >> >> >>> field needed part of string value removed (accidentally >>> >> introduced). I >>> >> >> >> was >>> >> >> >>> issuing a query for 100 docs getting 4 fields and updating the >>> doc >>> >> >> >> (atomic >>> >> >> >>> update with "set") via the CloudSolrClient in batches, However >>> from >>> >> >> time >>> >> >> >> to >>> >> >> >>> time the query returns 0 results, which exits the re-indexing >>> >> program. >>> >> >> >>> >>> >> >> >>> I cant understand as to why the cloud returns 0 results when >>> there >>> >> are >>> >> >> >> 1.4x >>> >> >> >>> million docs which have the "accidental" string in them. >>> >> >> >>> >>> >> >> >>> Is there another way to do bulk massive updates ? >>> >> >> >>> >>> >> >> >>> Thanks >>> >> >> >>> >>> >> >> >>> Ravi Kiran Bhaskar >>> >> >> >> >>> >> >> >> >>> >> >> >>> >> >> >>> >> >>> >> >> >