Re: bulk reindexing 5.3.0 issue

Ravi Solr Fri, 25 Sep 2015 23:11:34 -0700

Erick I fixed the "missing content stream" issue as well. by making sure
Iam not adding empty list. However, My very first issue of getting zero
docs once in a while is still haunting me, even after using cursorMarkers,
disabling auto commit and soft commit. I ran code two times and you can see
the statement returns zero docs at random times.


log.info("Indexed " + count + "/" + docList.getNumFound());

-bash-4.1$ tail -f reindexing.log
2015-09-26 01:44:40 INFO  [a.b.c.AdhocCorrectUUID] - Indexed 6500/1440653
2015-09-26 01:44:44 INFO  [a.b.c.AdhocCorrectUUID] - Indexed 7000/1439863
2015-09-26 01:44:48 INFO  [a.b.c.AdhocCorrectUUID] - Indexed 7500/1439410
2015-09-26 01:44:56 INFO  [a.b.c.AdhocCorrectUUID] - Indexed 8000/1438918
2015-09-26 01:45:01 INFO  [a.b.c.AdhocCorrectUUID] - Indexed 8500/1438330
2015-09-26 01:45:01 INFO  [a.b.c.AdhocCorrectUUID] - Indexed 8500/0
2015-09-26 01:45:06 INFO  [a.b.c.AdhocCorrectUUID] - FINISHED !!!

2015-09-26 01:48:15 INFO  [a.b.c.AdhocCorrectUUID] - Indexed 500/1437440
2015-09-26 01:48:19 INFO  [a.b.c.AdhocCorrectUUID] - Indexed 1000/1437440
2015-09-26 01:48:19 INFO  [a.b.c.AdhocCorrectUUID] - Indexed 1000/0
2015-09-26 01:48:22 INFO  [a.b.c.AdhocCorrectUUID] - FINISHED !!!


Thanks

Ravi Kiran Bhaskar

On Sat, Sep 26, 2015 at 1:17 AM, Ravi Solr <[email protected]> wrote:

> Erick as per your advise I used cursorMarks (see code below). It was
> slightly better but Solr throws Exceptions randomly. Please look at the
> code and Stacktrace below
>
> 2015-09-26 01:00:45 INFO  [a.b.c.AdhocCorrectUUID] - Indexed 500/1453133
> 2015-09-26 01:00:49 INFO  [a.b.c.AdhocCorrectUUID] - Indexed 1000/1453133
> 2015-09-26 01:00:54 INFO  [a.b.c.AdhocCorrectUUID] - Indexed 1500/1452592
> 2015-09-26 01:00:58 INFO  [a.b.c.AdhocCorrectUUID] - Indexed 2000/1452095
> 2015-09-26 01:01:03 INFO  [a.b.c.AdhocCorrectUUID] - Indexed 2500/1451675
> 2015-09-26 01:01:10 INFO  [a.b.c.AdhocCorrectUUID] - Indexed 3000/1450924
> 2015-09-26 01:01:15 INFO  [a.b.c.AdhocCorrectUUID] - Indexed 3500/1450445
> 2015-09-26 01:01:19 INFO  [a.b.c.AdhocCorrectUUID] - Indexed 4000/1449997
> 2015-09-26 01:01:24 INFO  [a.b.c.AdhocCorrectUUID] - Indexed 4500/1449692
> 2015-09-26 01:01:28 INFO  [a.b.c.AdhocCorrectUUID] - Indexed 5000/1449201
> 2015-09-26 01:01:28 ERROR [a.b.c.AdhocCorrectUUID] - Error indexing
> org.apache.solr.client.solrj.impl.HttpSolrClient$RemoteSolrException:
> Error from server at http://xx.xx.xx.xx:1111/solr/collection1: missing
> content stream
>     at
> org.apache.solr.client.solrj.impl.HttpSolrClient.executeMethod(HttpSolrClient.java:560)
>     at
> org.apache.solr.client.solrj.impl.HttpSolrClient.request(HttpSolrClient.java:234)
>     at
> org.apache.solr.client.solrj.impl.HttpSolrClient.request(HttpSolrClient.java:226)
>     at
> org.apache.solr.client.solrj.impl.LBHttpSolrClient.doRequest(LBHttpSolrClient.java:376)
>     at
> org.apache.solr.client.solrj.impl.LBHttpSolrClient.request(LBHttpSolrClient.java:328)
>     at
> org.apache.solr.client.solrj.impl.CloudSolrClient.sendRequest(CloudSolrClient.java:1085)
>     at
> org.apache.solr.client.solrj.impl.CloudSolrClient.requestWithRetryOnStaleState(CloudSolrClient.java:856)
>     at
> org.apache.solr.client.solrj.impl.CloudSolrClient.request(CloudSolrClient.java:799)
>     at
> org.apache.solr.client.solrj.SolrRequest.process(SolrRequest.java:135)
>     at org.apache.solr.client.solrj.SolrClient.add(SolrClient.java:107)
>     at org.apache.solr.client.solrj.SolrClient.add(SolrClient.java:72)
>     at org.apache.solr.client.solrj.SolrClient.add(SolrClient.java:86)
>     at a.b.c.AdhocCorrectUUID.processDocs(AdhocCorrectUUID.java:97)
>     at a.b.c.AdhocCorrectUUID.main(AdhocCorrectUUID.java:37)
>     at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>     at
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
>     at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>     at java.lang.reflect.Method.invoke(Method.java:606)
>     at com.simontuffs.onejar.Boot.run(Boot.java:306)
>     at com.simontuffs.onejar.Boot.main(Boot.java:159)
> 2015-09-26 01:01:28 INFO  [a.b.c.AdhocCorrectUUID] - FINISHED !!!
>
>
> CODE
> ------------
>     protected static void processDocs() {
>
>         try {
>             CloudSolrClient client = new
> CloudSolrClient("zk1:xxxx,zk2:xxxx,zk3.com:xxxx");
>             client.setDefaultCollection("collection1");
>
>             boolean done = false;
>             String cursorMark = CursorMarkParams.CURSOR_MARK_START;
>             Integer count = 0;
>
>             while (!done) {
>                 SolrQuery q = new
> SolrQuery("*:*").setRows(500).addSort("publishtime",
> ORDER.desc).addSort("uniqueId",ORDER.desc).setFields(new
> String[]{"uniqueId","uuid"});
>                 q.addFilterQuery(new String[] {"uuid:[* TO *]",
> "uuid:sun.org.mozilla*"});
>                 q.set(CursorMarkParams.CURSOR_MARK_PARAM, cursorMark);
>
>                 QueryResponse resp = client.query(q);
>                 String nextCursorMark = resp.getNextCursorMark();
>
>                 SolrDocumentList docList = resp.getResults();
>
>                 List<SolrInputDocument> inList = new
> ArrayList<SolrInputDocument>();
>                 for(SolrDocument doc : docList) {
>
>                     SolrInputDocument iDoc =
> ClientUtils.toSolrInputDocument(doc);
>
>                     //This is my system's id
>                     String uniqueId = (String)
> iDoc.getFieldValue("uniqueId");
>
>                     /*
>                      * This is another system's unique id which is what I
> want to correct that was messed
>                      * because of script transformer in DIH import via
> SolrEntityProcessor
>                      * ex-
> sun.org.mozilla.javascript.internal.NativeString:9cdef726-05dd-40b7-b1b2-c9bbce96741f
>                      */
>                     String uuid = (String) iDoc.getFieldValue("uuid");
>                     String sanitizedUUID =
> uuid.replace("sun.org.mozilla.javascript.internal.NativeString:", "");
>                     Map<String,String> fieldModifier = new
> HashMap<String,String>(1);
>                     fieldModifier.put("set",sanitizedUUID);
>                     iDoc.setField("uuid", fieldModifier);
>
>                     inList.add(iDoc);
>                 }
>                 client.add(inList);
>
>                 count = count + docList.size();
>                 log.info("Indexed " + count + "/" +
> docList.getNumFound());
>
>                 if (cursorMark.equals(nextCursorMark)) {
>                     done = true;
>                     client.commit(true, true);
>                 }
>                 cursorMark = nextCursorMark;
>             }
>
>         } catch (Exception e) {
>             log.error("Error indexing ", e);
>         }
>     }
>
>
> Thanks
>
> Ravi Kiran Bhaskar
>
> On Sat, Sep 26, 2015 at 12:10 AM, Ravi Solr <[email protected]> wrote:
>
>> thank you for taking time to help me out. Yes I was not using cursorMark,
>> I will try that next. This is what I was doing, its a bit shabby coding but
>> what can I say my brain was fried :-) FYI this is a side process just to
>> correct a messed up string. The actual indexing process was working all the
>> time as our business owners are a bit petulant about stopping indexing. My
>> autocommit conf and code is given below, as you can see autocommit should
>> fire every 100 docs anyway
>>
>>     <autoCommit>
>>        <maxDocs>100</maxDocs>
>>        <maxTime>120000</maxTime>
>>     </autoCommit>
>>
>>     <autoSoftCommit>
>>         <maxTime>30000</maxTime>
>>     </autoSoftCommit>
>>   </updateHandler>
>>
>>     private static void processDocs() {
>>
>>         try {
>>             CloudSolrClient client = new
>> CloudSolrClient("zk1:xxxx,zk2:xxxx,zk3.com:xxxx");
>>             client.setDefaultCollection("collection1");
>>
>>             //First initialize docs
>>             SolrDocumentList docList = getDocs(client, 100);
>>             Long count = 0L;
>>
>>             while (docList != null && docList.size() > 0) {
>>
>>                 List<SolrInputDocument> inList = new
>> ArrayList<SolrInputDocument>();
>>                 for(SolrDocument doc : docList) {
>>
>>                     SolrInputDocument iDoc =
>> ClientUtils.toSolrInputDocument(doc);
>>
>>                     //This is my SOLR's Unique id
>>                     String uniqueId = (String)
>> iDoc.getFieldValue("uniqueId");
>>
>>                     /*
>>                      * This is another system's id which is what I want
>> to correct. Was messed
>>                      * because of script transformer in DIH import via
>> SolrEntityProcessor
>>                      * ex-
>> sun.org.mozilla.javascript.internal.NativeString:9cdef726-05dd-40b7-b1b2-c9bbce96741f
>>                      */
>>                     String uuid = (String) iDoc.getFieldValue("uuid");
>>                     String sanitizedUUID =
>> uuid.replace("sun.org.mozilla.javascript.internal.NativeString:", "");
>>                     Map<String,String> fieldModifier = new
>> HashMap<String,String>(1);
>>                     fieldModifier.put("set",sanitizedUUID);
>>                     iDoc.setField("uuid", fieldModifier);
>>
>>                     inList.add(iDoc);
>>                     log.info("added " + systemid);
>>                 }
>>
>>                 client.add(inList);
>>
>>                 count = count + docList.size();
>>                 log.info("Indexed " + count + "/" +
>> docList.getNumFound());
>>
>>                 Thread.sleep(5000);
>>
>>                 docList = getDocs(client, docList.size());
>>                 log.info("Got Docs- " + docList.getNumFound());
>>             }
>>
>>         } catch (Exception e) {
>>             log.error("Error indexing ", e);
>>         }
>>     }
>>
>>     private static SolrDocumentList getDocs(CloudSolrClient client,
>> Integer rows) {
>>
>>
>>         SolrQuery q = new SolrQuery("*:*");
>>         q.setSort("publishtime", ORDER.desc);
>>         q.setStart(0);
>>         q.setRows(rows);
>>         q.addFilterQuery(new String[] {"uuid:[* TO *]",
>> "uuid:sun.org.mozilla*"});
>>         q.setFields(new String[]{"uniqueId","uuid"});
>>         SolrDocumentList docList = null;
>>         QueryResponse resp;
>>         try {
>>             resp = client.query(q);
>>             docList = resp.getResults();
>>         } catch (Exception e) {
>>             log.error("Error querying " + q.toString(), e);
>>         }
>>         return docList;
>>     }
>>
>>
>> Thanks
>>
>> Ravi Kiran Bhaskar
>>
>> On Fri, Sep 25, 2015 at 10:58 PM, Erick Erickson <[email protected]
>> > wrote:
>>
>>> Wait, query again how? You've got to have something that keeps you
>>> from getting the same 100 docs back so you have to be sorting somehow.
>>> Or you have a high water mark. Or something. Waiting 5 seconds for any
>>> commit also doesn't really make sense to me. I mean how do you know
>>>
>>> 1> that you're going to get a commit (did you explicitly send one from
>>> the client?).
>>> 2> all autowarming will be complete by the time the next query hits?
>>>
>>> Let's see the query you fire. There has to be some kind of marker that
>>> you're using to know when you've gotten through the entire set.
>>>
>>> And I would use much larger batches, I usually update in batches of
>>> 1,000 (excepting if these are very large docs of course). I suspect
>>> you're spending a lot more time sleeping than you need to. I wouldn't
>>> sleep at all in fact. This is one (rare) case I might consider
>>> committing from the client. If you specify the wait for searcher param
>>> (server.commit(true, true), then it doesn't return until a new
>>> searcher is completely opened so your previous updates will be
>>> reflected in your next search.
>>>
>>> Actually, what I'd really do is
>>> 1> turn off all auto commits
>>> 2> go ahead and query/change/update. But the query bits would be using
>>> the cursormark.
>>> 3> do NOT commit
>>> 4> issue a commit when you were all done.
>>>
>>> I bet you'd get through your update a lot faster that way.
>>>
>>> Best,
>>> Erick
>>>
>>> On Fri, Sep 25, 2015 at 5:07 PM, Ravi Solr <[email protected]> wrote:
>>> > Thanks for responding Erick. I set the "start" to zero and "rows"
>>> always to
>>> > 100. I create CloudSolrClient instance and use it to both query as
>>> well as
>>> > index. But I do sleep for 5 secs just to allow for any auto commits.
>>> >
>>> > So query --> client.add(100 docs) --> wait --> query again
>>> >
>>> > But the weird thing I noticed was that after 8 or 9 batches I.e 800/900
>>> > docs the "query again" returns zero docs causing my while loop to
>>> > exist...so was trying to see if I was doing the right thing or if
>>> there is
>>> > an alternate way to do heavy indexing.
>>> >
>>> > Thanks
>>> >
>>> > Ravi Kiran Bhaskar
>>> >
>>> >
>>> >
>>> > On Friday, September 25, 2015, Erick Erickson <[email protected]
>>> >
>>> > wrote:
>>> >
>>> >> How are you querying Solr? You say you query for 100 docs,
>>> >> update then get the next set. What are you using for a marker?
>>> >> If you're using the start parameter, and somehow a commit is
>>> >> creeping in things might be weird, especially if you're using any
>>> >> of the internal Lucene doc IDs. If you're absolutely sure no commits
>>> >> are taking place even that should be OK.
>>> >>
>>> >> The "deep paging" stuff could be helpful here, see:
>>> >>
>>> >>
>>> https://lucidworks.com/blog/coming-soon-to-solr-efficient-cursor-based-iteration-of-large-result-sets/
>>> >>
>>> >> Best,
>>> >> Erick
>>> >>
>>> >> On Fri, Sep 25, 2015 at 3:13 PM, Ravi Solr <[email protected]
>>> >> <javascript:;>> wrote:
>>> >> > No problem Walter, it's all fun. Was just wondering if there was
>>> some
>>> >> other
>>> >> > good way that I did not know of, that's all 😀
>>> >> >
>>> >> > Thanks
>>> >> >
>>> >> > Ravi Kiran Bhaskar
>>> >> >
>>> >> > On Friday, September 25, 2015, Walter Underwood <
>>> [email protected]
>>> >> <javascript:;>>
>>> >> > wrote:
>>> >> >
>>> >> >> Sorry, I did not mean to be rude. The original question did not
>>> say that
>>> >> >> you don’t have the docs outside of Solr. Some people jump to the
>>> >> advanced
>>> >> >> features and miss the simple ones.
>>> >> >>
>>> >> >> It might be faster to fetch all the docs from Solr and save them in
>>> >> files.
>>> >> >> Then modify them. Then reload all of them. No guarantee, but it is
>>> >> worth a
>>> >> >> try.
>>> >> >>
>>> >> >> Good luck.
>>> >> >>
>>> >> >> wunder
>>> >> >> Walter Underwood
>>> >> >> [email protected] <javascript:;> <javascript:;>
>>> >> >> http://observer.wunderwood.org/  (my blog)
>>> >> >>
>>> >> >>
>>> >> >> > On Sep 25, 2015, at 2:59 PM, Ravi Solr <[email protected]
>>> >> <javascript:;>
>>> >> >> <javascript:;>> wrote:
>>> >> >> >
>>> >> >> > Walter, Not in a mood for banter right now.... Its 6:00pm on a
>>> friday
>>> >> and
>>> >> >> > Iam stuck here trying to figure reindexing issues :-)
>>> >> >> > I dont have source of docs so I have to query the SOLR, modify
>>> and
>>> >> put it
>>> >> >> > back and that is seeming to be quite a task in 5.3.0, I did
>>> reindex
>>> >> >> several
>>> >> >> > times with 4.7.2 in a master slave env without any issue. Since
>>> then
>>> >> we
>>> >> >> > have moved to cloud and it has been a pain all day.
>>> >> >> >
>>> >> >> > Thanks
>>> >> >> >
>>> >> >> > Ravi Kiran Bhaskar
>>> >> >> >
>>> >> >> > On Fri, Sep 25, 2015 at 5:25 PM, Walter Underwood <
>>> >> [email protected] <javascript:;>
>>> >> >> <javascript:;>>
>>> >> >> > wrote:
>>> >> >> >
>>> >> >> >> Sure.
>>> >> >> >>
>>> >> >> >> 1. Delete all the docs (no commit).
>>> >> >> >> 2. Add all the docs (no commit).
>>> >> >> >> 3. Commit.
>>> >> >> >>
>>> >> >> >> wunder
>>> >> >> >> Walter Underwood
>>> >> >> >> [email protected] <javascript:;> <javascript:;>
>>> >> >> >> http://observer.wunderwood.org/  (my blog)
>>> >> >> >>
>>> >> >> >>
>>> >> >> >>> On Sep 25, 2015, at 2:17 PM, Ravi Solr <[email protected]
>>> >> <javascript:;>
>>> >> >> <javascript:;>> wrote:
>>> >> >> >>>
>>> >> >> >>> I have been trying to re-index the docs (about 1.5 million) as
>>> one
>>> >> of
>>> >> >> the
>>> >> >> >>> field needed part of string value removed (accidentally
>>> >> introduced). I
>>> >> >> >> was
>>> >> >> >>> issuing a query for 100 docs getting 4 fields and updating the
>>> doc
>>> >> >> >> (atomic
>>> >> >> >>> update with "set") via the CloudSolrClient in batches, However
>>> from
>>> >> >> time
>>> >> >> >> to
>>> >> >> >>> time the query returns 0 results, which exits the re-indexing
>>> >> program.
>>> >> >> >>>
>>> >> >> >>> I cant understand as to why the cloud returns 0 results when
>>> there
>>> >> are
>>> >> >> >> 1.4x
>>> >> >> >>> million docs which have the "accidental" string in them.
>>> >> >> >>>
>>> >> >> >>> Is there another way to do bulk massive updates ?
>>> >> >> >>>
>>> >> >> >>> Thanks
>>> >> >> >>>
>>> >> >> >>> Ravi Kiran Bhaskar
>>> >> >> >>
>>> >> >> >>
>>> >> >>
>>> >> >>
>>> >>
>>>
>>
>>
>

Re: bulk reindexing 5.3.0 issue

Reply via email to