thank you for taking time to help me out. Yes I was not using cursorMark, I will try that next. This is what I was doing, its a bit shabby coding but what can I say my brain was fried :-) FYI this is a side process just to correct a messed up string. The actual indexing process was working all the time as our business owners are a bit petulant about stopping indexing. My autocommit conf and code is given below, as you can see autocommit should fire every 100 docs anyway
<autoCommit> <maxDocs>100</maxDocs> <maxTime>120000</maxTime> </autoCommit> <autoSoftCommit> <maxTime>30000</maxTime> </autoSoftCommit> </updateHandler> private static void processDocs() { try { CloudSolrClient client = new CloudSolrClient("zk1:xxxx,zk2:xxxx,zk3.com:xxxx"); client.setDefaultCollection("collection1"); //First initialize docs SolrDocumentList docList = getDocs(client, 100); Long count = 0L; while (docList != null && docList.size() > 0) { List<SolrInputDocument> inList = new ArrayList<SolrInputDocument>(); for(SolrDocument doc : docList) { SolrInputDocument iDoc = ClientUtils.toSolrInputDocument(doc); //This is my SOLR's Unique id String uniqueId = (String) iDoc.getFieldValue("uniqueId"); /* * This is another system's id which is what I want to correct. Was messed * because of script transformer in DIH import via SolrEntityProcessor * ex- sun.org.mozilla.javascript.internal.NativeString:9cdef726-05dd-40b7-b1b2-c9bbce96741f */ String uuid = (String) iDoc.getFieldValue("uuid"); String sanitizedUUID = uuid.replace("sun.org.mozilla.javascript.internal.NativeString:", ""); Map<String,String> fieldModifier = new HashMap<String,String>(1); fieldModifier.put("set",sanitizedUUID); iDoc.setField("uuid", fieldModifier); inList.add(iDoc); log.info("added " + systemid); } client.add(inList); count = count + docList.size(); log.info("Indexed " + count + "/" + docList.getNumFound()); Thread.sleep(5000); docList = getDocs(client, docList.size()); log.info("Got Docs- " + docList.getNumFound()); } } catch (Exception e) { log.error("Error indexing ", e); } } private static SolrDocumentList getDocs(CloudSolrClient client, Integer rows) { SolrQuery q = new SolrQuery("*:*"); q.setSort("publishtime", ORDER.desc); q.setStart(0); q.setRows(rows); q.addFilterQuery(new String[] {"uuid:[* TO *]", "uuid:sun.org.mozilla*"}); q.setFields(new String[]{"uniqueId","uuid"}); SolrDocumentList docList = null; QueryResponse resp; try { resp = client.query(q); docList = resp.getResults(); } catch (Exception e) { log.error("Error querying " + q.toString(), e); } return docList; } Thanks Ravi Kiran Bhaskar On Fri, Sep 25, 2015 at 10:58 PM, Erick Erickson <erickerick...@gmail.com> wrote: > Wait, query again how? You've got to have something that keeps you > from getting the same 100 docs back so you have to be sorting somehow. > Or you have a high water mark. Or something. Waiting 5 seconds for any > commit also doesn't really make sense to me. I mean how do you know > > 1> that you're going to get a commit (did you explicitly send one from > the client?). > 2> all autowarming will be complete by the time the next query hits? > > Let's see the query you fire. There has to be some kind of marker that > you're using to know when you've gotten through the entire set. > > And I would use much larger batches, I usually update in batches of > 1,000 (excepting if these are very large docs of course). I suspect > you're spending a lot more time sleeping than you need to. I wouldn't > sleep at all in fact. This is one (rare) case I might consider > committing from the client. If you specify the wait for searcher param > (server.commit(true, true), then it doesn't return until a new > searcher is completely opened so your previous updates will be > reflected in your next search. > > Actually, what I'd really do is > 1> turn off all auto commits > 2> go ahead and query/change/update. But the query bits would be using > the cursormark. > 3> do NOT commit > 4> issue a commit when you were all done. > > I bet you'd get through your update a lot faster that way. > > Best, > Erick > > On Fri, Sep 25, 2015 at 5:07 PM, Ravi Solr <ravis...@gmail.com> wrote: > > Thanks for responding Erick. I set the "start" to zero and "rows" always > to > > 100. I create CloudSolrClient instance and use it to both query as well > as > > index. But I do sleep for 5 secs just to allow for any auto commits. > > > > So query --> client.add(100 docs) --> wait --> query again > > > > But the weird thing I noticed was that after 8 or 9 batches I.e 800/900 > > docs the "query again" returns zero docs causing my while loop to > > exist...so was trying to see if I was doing the right thing or if there > is > > an alternate way to do heavy indexing. > > > > Thanks > > > > Ravi Kiran Bhaskar > > > > > > > > On Friday, September 25, 2015, Erick Erickson <erickerick...@gmail.com> > > wrote: > > > >> How are you querying Solr? You say you query for 100 docs, > >> update then get the next set. What are you using for a marker? > >> If you're using the start parameter, and somehow a commit is > >> creeping in things might be weird, especially if you're using any > >> of the internal Lucene doc IDs. If you're absolutely sure no commits > >> are taking place even that should be OK. > >> > >> The "deep paging" stuff could be helpful here, see: > >> > >> > https://lucidworks.com/blog/coming-soon-to-solr-efficient-cursor-based-iteration-of-large-result-sets/ > >> > >> Best, > >> Erick > >> > >> On Fri, Sep 25, 2015 at 3:13 PM, Ravi Solr <ravis...@gmail.com > >> <javascript:;>> wrote: > >> > No problem Walter, it's all fun. Was just wondering if there was some > >> other > >> > good way that I did not know of, that's all 😀 > >> > > >> > Thanks > >> > > >> > Ravi Kiran Bhaskar > >> > > >> > On Friday, September 25, 2015, Walter Underwood < > wun...@wunderwood.org > >> <javascript:;>> > >> > wrote: > >> > > >> >> Sorry, I did not mean to be rude. The original question did not say > that > >> >> you don’t have the docs outside of Solr. Some people jump to the > >> advanced > >> >> features and miss the simple ones. > >> >> > >> >> It might be faster to fetch all the docs from Solr and save them in > >> files. > >> >> Then modify them. Then reload all of them. No guarantee, but it is > >> worth a > >> >> try. > >> >> > >> >> Good luck. > >> >> > >> >> wunder > >> >> Walter Underwood > >> >> wun...@wunderwood.org <javascript:;> <javascript:;> > >> >> http://observer.wunderwood.org/ (my blog) > >> >> > >> >> > >> >> > On Sep 25, 2015, at 2:59 PM, Ravi Solr <ravis...@gmail.com > >> <javascript:;> > >> >> <javascript:;>> wrote: > >> >> > > >> >> > Walter, Not in a mood for banter right now.... Its 6:00pm on a > friday > >> and > >> >> > Iam stuck here trying to figure reindexing issues :-) > >> >> > I dont have source of docs so I have to query the SOLR, modify and > >> put it > >> >> > back and that is seeming to be quite a task in 5.3.0, I did reindex > >> >> several > >> >> > times with 4.7.2 in a master slave env without any issue. Since > then > >> we > >> >> > have moved to cloud and it has been a pain all day. > >> >> > > >> >> > Thanks > >> >> > > >> >> > Ravi Kiran Bhaskar > >> >> > > >> >> > On Fri, Sep 25, 2015 at 5:25 PM, Walter Underwood < > >> wun...@wunderwood.org <javascript:;> > >> >> <javascript:;>> > >> >> > wrote: > >> >> > > >> >> >> Sure. > >> >> >> > >> >> >> 1. Delete all the docs (no commit). > >> >> >> 2. Add all the docs (no commit). > >> >> >> 3. Commit. > >> >> >> > >> >> >> wunder > >> >> >> Walter Underwood > >> >> >> wun...@wunderwood.org <javascript:;> <javascript:;> > >> >> >> http://observer.wunderwood.org/ (my blog) > >> >> >> > >> >> >> > >> >> >>> On Sep 25, 2015, at 2:17 PM, Ravi Solr <ravis...@gmail.com > >> <javascript:;> > >> >> <javascript:;>> wrote: > >> >> >>> > >> >> >>> I have been trying to re-index the docs (about 1.5 million) as > one > >> of > >> >> the > >> >> >>> field needed part of string value removed (accidentally > >> introduced). I > >> >> >> was > >> >> >>> issuing a query for 100 docs getting 4 fields and updating the > doc > >> >> >> (atomic > >> >> >>> update with "set") via the CloudSolrClient in batches, However > from > >> >> time > >> >> >> to > >> >> >>> time the query returns 0 results, which exits the re-indexing > >> program. > >> >> >>> > >> >> >>> I cant understand as to why the cloud returns 0 results when > there > >> are > >> >> >> 1.4x > >> >> >>> million docs which have the "accidental" string in them. > >> >> >>> > >> >> >>> Is there another way to do bulk massive updates ? > >> >> >>> > >> >> >>> Thanks > >> >> >>> > >> >> >>> Ravi Kiran Bhaskar > >> >> >> > >> >> >> > >> >> > >> >> > >> >