Re: bulk reindexing 5.3.0 issue

Erick Erickson Sat, 26 Sep 2015 09:48:40 -0700

Well, let's forget the cursormark stuff for a bit.

There's no reason you should be getting all 1.4 million rows.
Presumably you've been running this program occasionally and blanking
strings like "sun.org.mozilla.javascript.internal.NativeString:" in
the uuid field. Then you turn around and run the program again with
the fq clause like


fq=uuid:sun.org.mozilla*

and there aren't very many, just the ones that have been added since
the last run. It's possible that your program is running perfectly
correctly. After it runs, have you run a query at the system to see if
there are any records that have this a uuid that starts with
sun.org.mozilla? If not, everything's fine. And there's a potential
infinite loop in here as well. You're removing this string:
"sun.org.mozilla.javascript.internal.NativeString:" but searching on
anything like sun.org.mozilla. Let's claim you have lots records like

uuid:sun.org.mozilla.anything.else

They'll never be replaced by the code you have and just be fetched
over and over and over again.

BTW, having the maxDocs set to 100 is very, very, very short in any
kind of bulk indexing operation and will lead to a lot of churn.
Usually, people either set a hard (autocommit) with openSearcher=false
or use softCommit. Using both is pretty odd. Here's a long blog on the
subject: 
https://lucidworks.com/blog/understanding-transaction-logs-softcommit-and-commit-in-sorlcloud/

Relying on commits to have everything re-ordered so you can just get
the first N, even with a delay, is not very robust, I'd take the
special settings you appear to have in the solrconfig file out and
just let autocommit work as usual, rather concentrate on forming a
query that does what you want in the first place.

So let me see if I understand what you're doing here:

Any uuid with "sun.org.mozilla.javascript.internal.NativeString:"
should have that bit removed, right? Instead of waiting for commits
and all that, why not just form queries like:

q=+uuid:sun.org.mozilla* +uniqueId:{marker TO *]&
sort=uniqueId asc&

where "marker" is * the first time you query, and the uniqueId from
the last record returned the previous time it ran?

Then you wouldn't have to worry about commits, waiting, or anything
else. And you commit interval is beating the crap out of your system
by opening new searchers all the time when you're indexing.

Best,
Erick


On Fri, Sep 25, 2015 at 11:10 PM, Ravi Solr <ravis...@gmail.com> wrote:
> Erick I fixed the "missing content stream" issue as well. by making sure
> Iam not adding empty list. However, My very first issue of getting zero
> docs once in a while is still haunting me, even after using cursorMarkers,
> disabling auto commit and soft commit. I ran code two times and you can see
> the statement returns zero docs at random times.
>
> log.info("Indexed " + count + "/" + docList.getNumFound());
>
> -bash-4.1$ tail -f reindexing.log
> 2015-09-26 01:44:40 INFO  [a.b.c.AdhocCorrectUUID] - Indexed 6500/1440653
> 2015-09-26 01:44:44 INFO  [a.b.c.AdhocCorrectUUID] - Indexed 7000/1439863
> 2015-09-26 01:44:48 INFO  [a.b.c.AdhocCorrectUUID] - Indexed 7500/1439410
> 2015-09-26 01:44:56 INFO  [a.b.c.AdhocCorrectUUID] - Indexed 8000/1438918
> 2015-09-26 01:45:01 INFO  [a.b.c.AdhocCorrectUUID] - Indexed 8500/1438330
> 2015-09-26 01:45:01 INFO  [a.b.c.AdhocCorrectUUID] - Indexed 8500/0
> 2015-09-26 01:45:06 INFO  [a.b.c.AdhocCorrectUUID] - FINISHED !!!
>
> 2015-09-26 01:48:15 INFO  [a.b.c.AdhocCorrectUUID] - Indexed 500/1437440
> 2015-09-26 01:48:19 INFO  [a.b.c.AdhocCorrectUUID] - Indexed 1000/1437440
> 2015-09-26 01:48:19 INFO  [a.b.c.AdhocCorrectUUID] - Indexed 1000/0
> 2015-09-26 01:48:22 INFO  [a.b.c.AdhocCorrectUUID] - FINISHED !!!
>
>
> Thanks
>
> Ravi Kiran Bhaskar
>
> On Sat, Sep 26, 2015 at 1:17 AM, Ravi Solr <ravis...@gmail.com> wrote:
>
>> Erick as per your advise I used cursorMarks (see code below). It was
>> slightly better but Solr throws Exceptions randomly. Please look at the
>> code and Stacktrace below
>>
>> 2015-09-26 01:00:45 INFO  [a.b.c.AdhocCorrectUUID] - Indexed 500/1453133
>> 2015-09-26 01:00:49 INFO  [a.b.c.AdhocCorrectUUID] - Indexed 1000/1453133
>> 2015-09-26 01:00:54 INFO  [a.b.c.AdhocCorrectUUID] - Indexed 1500/1452592
>> 2015-09-26 01:00:58 INFO  [a.b.c.AdhocCorrectUUID] - Indexed 2000/1452095
>> 2015-09-26 01:01:03 INFO  [a.b.c.AdhocCorrectUUID] - Indexed 2500/1451675
>> 2015-09-26 01:01:10 INFO  [a.b.c.AdhocCorrectUUID] - Indexed 3000/1450924
>> 2015-09-26 01:01:15 INFO  [a.b.c.AdhocCorrectUUID] - Indexed 3500/1450445
>> 2015-09-26 01:01:19 INFO  [a.b.c.AdhocCorrectUUID] - Indexed 4000/1449997
>> 2015-09-26 01:01:24 INFO  [a.b.c.AdhocCorrectUUID] - Indexed 4500/1449692
>> 2015-09-26 01:01:28 INFO  [a.b.c.AdhocCorrectUUID] - Indexed 5000/1449201
>> 2015-09-26 01:01:28 ERROR [a.b.c.AdhocCorrectUUID] - Error indexing
>> org.apache.solr.client.solrj.impl.HttpSolrClient$RemoteSolrException:
>> Error from server at http://xx.xx.xx.xx:1111/solr/collection1: missing
>> content stream
>>     at
>> org.apache.solr.client.solrj.impl.HttpSolrClient.executeMethod(HttpSolrClient.java:560)
>>     at
>> org.apache.solr.client.solrj.impl.HttpSolrClient.request(HttpSolrClient.java:234)
>>     at
>> org.apache.solr.client.solrj.impl.HttpSolrClient.request(HttpSolrClient.java:226)
>>     at
>> org.apache.solr.client.solrj.impl.LBHttpSolrClient.doRequest(LBHttpSolrClient.java:376)
>>     at
>> org.apache.solr.client.solrj.impl.LBHttpSolrClient.request(LBHttpSolrClient.java:328)
>>     at
>> org.apache.solr.client.solrj.impl.CloudSolrClient.sendRequest(CloudSolrClient.java:1085)
>>     at
>> org.apache.solr.client.solrj.impl.CloudSolrClient.requestWithRetryOnStaleState(CloudSolrClient.java:856)
>>     at
>> org.apache.solr.client.solrj.impl.CloudSolrClient.request(CloudSolrClient.java:799)
>>     at
>> org.apache.solr.client.solrj.SolrRequest.process(SolrRequest.java:135)
>>     at org.apache.solr.client.solrj.SolrClient.add(SolrClient.java:107)
>>     at org.apache.solr.client.solrj.SolrClient.add(SolrClient.java:72)
>>     at org.apache.solr.client.solrj.SolrClient.add(SolrClient.java:86)
>>     at a.b.c.AdhocCorrectUUID.processDocs(AdhocCorrectUUID.java:97)
>>     at a.b.c.AdhocCorrectUUID.main(AdhocCorrectUUID.java:37)
>>     at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>>     at
>> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
>>     at
>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>>     at java.lang.reflect.Method.invoke(Method.java:606)
>>     at com.simontuffs.onejar.Boot.run(Boot.java:306)
>>     at com.simontuffs.onejar.Boot.main(Boot.java:159)
>> 2015-09-26 01:01:28 INFO  [a.b.c.AdhocCorrectUUID] - FINISHED !!!
>>
>>
>> CODE
>> ------------
>>     protected static void processDocs() {
>>
>>         try {
>>             CloudSolrClient client = new
>> CloudSolrClient("zk1:xxxx,zk2:xxxx,zk3.com:xxxx");
>>             client.setDefaultCollection("collection1");
>>
>>             boolean done = false;
>>             String cursorMark = CursorMarkParams.CURSOR_MARK_START;
>>             Integer count = 0;
>>
>>             while (!done) {
>>                 SolrQuery q = new
>> SolrQuery("*:*").setRows(500).addSort("publishtime",
>> ORDER.desc).addSort("uniqueId",ORDER.desc).setFields(new
>> String[]{"uniqueId","uuid"});
>>                 q.addFilterQuery(new String[] {"uuid:[* TO *]",
>> "uuid:sun.org.mozilla*"});
>>                 q.set(CursorMarkParams.CURSOR_MARK_PARAM, cursorMark);
>>
>>                 QueryResponse resp = client.query(q);
>>                 String nextCursorMark = resp.getNextCursorMark();
>>
>>                 SolrDocumentList docList = resp.getResults();
>>
>>                 List<SolrInputDocument> inList = new
>> ArrayList<SolrInputDocument>();
>>                 for(SolrDocument doc : docList) {
>>
>>                     SolrInputDocument iDoc =
>> ClientUtils.toSolrInputDocument(doc);
>>
>>                     //This is my system's id
>>                     String uniqueId = (String)
>> iDoc.getFieldValue("uniqueId");
>>
>>                     /*
>>                      * This is another system's unique id which is what I
>> want to correct that was messed
>>                      * because of script transformer in DIH import via
>> SolrEntityProcessor
>>                      * ex-
>> sun.org.mozilla.javascript.internal.NativeString:9cdef726-05dd-40b7-b1b2-c9bbce96741f
>>                      */
>>                     String uuid = (String) iDoc.getFieldValue("uuid");
>>                     String sanitizedUUID =
>> uuid.replace("sun.org.mozilla.javascript.internal.NativeString:", "");
>>                     Map<String,String> fieldModifier = new
>> HashMap<String,String>(1);
>>                     fieldModifier.put("set",sanitizedUUID);
>>                     iDoc.setField("uuid", fieldModifier);
>>
>>                     inList.add(iDoc);
>>                 }
>>                 client.add(inList);
>>
>>                 count = count + docList.size();
>>                 log.info("Indexed " + count + "/" +
>> docList.getNumFound());
>>
>>                 if (cursorMark.equals(nextCursorMark)) {
>>                     done = true;
>>                     client.commit(true, true);
>>                 }
>>                 cursorMark = nextCursorMark;
>>             }
>>
>>         } catch (Exception e) {
>>             log.error("Error indexing ", e);
>>         }
>>     }
>>
>>
>> Thanks
>>
>> Ravi Kiran Bhaskar
>>
>> On Sat, Sep 26, 2015 at 12:10 AM, Ravi Solr <ravis...@gmail.com> wrote:
>>
>>> thank you for taking time to help me out. Yes I was not using cursorMark,
>>> I will try that next. This is what I was doing, its a bit shabby coding but
>>> what can I say my brain was fried :-) FYI this is a side process just to
>>> correct a messed up string. The actual indexing process was working all the
>>> time as our business owners are a bit petulant about stopping indexing. My
>>> autocommit conf and code is given below, as you can see autocommit should
>>> fire every 100 docs anyway
>>>
>>>     <autoCommit>
>>>        <maxDocs>100</maxDocs>
>>>        <maxTime>120000</maxTime>
>>>     </autoCommit>
>>>
>>>     <autoSoftCommit>
>>>         <maxTime>30000</maxTime>
>>>     </autoSoftCommit>
>>>   </updateHandler>
>>>
>>>     private static void processDocs() {
>>>
>>>         try {
>>>             CloudSolrClient client = new
>>> CloudSolrClient("zk1:xxxx,zk2:xxxx,zk3.com:xxxx");
>>>             client.setDefaultCollection("collection1");
>>>
>>>             //First initialize docs
>>>             SolrDocumentList docList = getDocs(client, 100);
>>>             Long count = 0L;
>>>
>>>             while (docList != null && docList.size() > 0) {
>>>
>>>                 List<SolrInputDocument> inList = new
>>> ArrayList<SolrInputDocument>();
>>>                 for(SolrDocument doc : docList) {
>>>
>>>                     SolrInputDocument iDoc =
>>> ClientUtils.toSolrInputDocument(doc);
>>>
>>>                     //This is my SOLR's Unique id
>>>                     String uniqueId = (String)
>>> iDoc.getFieldValue("uniqueId");
>>>
>>>                     /*
>>>                      * This is another system's id which is what I want
>>> to correct. Was messed
>>>                      * because of script transformer in DIH import via
>>> SolrEntityProcessor
>>>                      * ex-
>>> sun.org.mozilla.javascript.internal.NativeString:9cdef726-05dd-40b7-b1b2-c9bbce96741f
>>>                      */
>>>                     String uuid = (String) iDoc.getFieldValue("uuid");
>>>                     String sanitizedUUID =
>>> uuid.replace("sun.org.mozilla.javascript.internal.NativeString:", "");
>>>                     Map<String,String> fieldModifier = new
>>> HashMap<String,String>(1);
>>>                     fieldModifier.put("set",sanitizedUUID);
>>>                     iDoc.setField("uuid", fieldModifier);
>>>
>>>                     inList.add(iDoc);
>>>                     log.info("added " + systemid);
>>>                 }
>>>
>>>                 client.add(inList);
>>>
>>>                 count = count + docList.size();
>>>                 log.info("Indexed " + count + "/" +
>>> docList.getNumFound());
>>>
>>>                 Thread.sleep(5000);
>>>
>>>                 docList = getDocs(client, docList.size());
>>>                 log.info("Got Docs- " + docList.getNumFound());
>>>             }
>>>
>>>         } catch (Exception e) {
>>>             log.error("Error indexing ", e);
>>>         }
>>>     }
>>>
>>>     private static SolrDocumentList getDocs(CloudSolrClient client,
>>> Integer rows) {
>>>
>>>
>>>         SolrQuery q = new SolrQuery("*:*");
>>>         q.setSort("publishtime", ORDER.desc);
>>>         q.setStart(0);
>>>         q.setRows(rows);
>>>         q.addFilterQuery(new String[] {"uuid:[* TO *]",
>>> "uuid:sun.org.mozilla*"});
>>>         q.setFields(new String[]{"uniqueId","uuid"});
>>>         SolrDocumentList docList = null;
>>>         QueryResponse resp;
>>>         try {
>>>             resp = client.query(q);
>>>             docList = resp.getResults();
>>>         } catch (Exception e) {
>>>             log.error("Error querying " + q.toString(), e);
>>>         }
>>>         return docList;
>>>     }
>>>
>>>
>>> Thanks
>>>
>>> Ravi Kiran Bhaskar
>>>
>>> On Fri, Sep 25, 2015 at 10:58 PM, Erick Erickson <erickerick...@gmail.com
>>> > wrote:
>>>
>>>> Wait, query again how? You've got to have something that keeps you
>>>> from getting the same 100 docs back so you have to be sorting somehow.
>>>> Or you have a high water mark. Or something. Waiting 5 seconds for any
>>>> commit also doesn't really make sense to me. I mean how do you know
>>>>
>>>> 1> that you're going to get a commit (did you explicitly send one from
>>>> the client?).
>>>> 2> all autowarming will be complete by the time the next query hits?
>>>>
>>>> Let's see the query you fire. There has to be some kind of marker that
>>>> you're using to know when you've gotten through the entire set.
>>>>
>>>> And I would use much larger batches, I usually update in batches of
>>>> 1,000 (excepting if these are very large docs of course). I suspect
>>>> you're spending a lot more time sleeping than you need to. I wouldn't
>>>> sleep at all in fact. This is one (rare) case I might consider
>>>> committing from the client. If you specify the wait for searcher param
>>>> (server.commit(true, true), then it doesn't return until a new
>>>> searcher is completely opened so your previous updates will be
>>>> reflected in your next search.
>>>>
>>>> Actually, what I'd really do is
>>>> 1> turn off all auto commits
>>>> 2> go ahead and query/change/update. But the query bits would be using
>>>> the cursormark.
>>>> 3> do NOT commit
>>>> 4> issue a commit when you were all done.
>>>>
>>>> I bet you'd get through your update a lot faster that way.
>>>>
>>>> Best,
>>>> Erick
>>>>
>>>> On Fri, Sep 25, 2015 at 5:07 PM, Ravi Solr <ravis...@gmail.com> wrote:
>>>> > Thanks for responding Erick. I set the "start" to zero and "rows"
>>>> always to
>>>> > 100. I create CloudSolrClient instance and use it to both query as
>>>> well as
>>>> > index. But I do sleep for 5 secs just to allow for any auto commits.
>>>> >
>>>> > So query --> client.add(100 docs) --> wait --> query again
>>>> >
>>>> > But the weird thing I noticed was that after 8 or 9 batches I.e 800/900
>>>> > docs the "query again" returns zero docs causing my while loop to
>>>> > exist...so was trying to see if I was doing the right thing or if
>>>> there is
>>>> > an alternate way to do heavy indexing.
>>>> >
>>>> > Thanks
>>>> >
>>>> > Ravi Kiran Bhaskar
>>>> >
>>>> >
>>>> >
>>>> > On Friday, September 25, 2015, Erick Erickson <erickerick...@gmail.com
>>>> >
>>>> > wrote:
>>>> >
>>>> >> How are you querying Solr? You say you query for 100 docs,
>>>> >> update then get the next set. What are you using for a marker?
>>>> >> If you're using the start parameter, and somehow a commit is
>>>> >> creeping in things might be weird, especially if you're using any
>>>> >> of the internal Lucene doc IDs. If you're absolutely sure no commits
>>>> >> are taking place even that should be OK.
>>>> >>
>>>> >> The "deep paging" stuff could be helpful here, see:
>>>> >>
>>>> >>
>>>> https://lucidworks.com/blog/coming-soon-to-solr-efficient-cursor-based-iteration-of-large-result-sets/
>>>> >>
>>>> >> Best,
>>>> >> Erick
>>>> >>
>>>> >> On Fri, Sep 25, 2015 at 3:13 PM, Ravi Solr <ravis...@gmail.com
>>>> >> <javascript:;>> wrote:
>>>> >> > No problem Walter, it's all fun. Was just wondering if there was
>>>> some
>>>> >> other
>>>> >> > good way that I did not know of, that's all 😀
>>>> >> >
>>>> >> > Thanks
>>>> >> >
>>>> >> > Ravi Kiran Bhaskar
>>>> >> >
>>>> >> > On Friday, September 25, 2015, Walter Underwood <
>>>> wun...@wunderwood.org
>>>> >> <javascript:;>>
>>>> >> > wrote:
>>>> >> >
>>>> >> >> Sorry, I did not mean to be rude. The original question did not
>>>> say that
>>>> >> >> you don’t have the docs outside of Solr. Some people jump to the
>>>> >> advanced
>>>> >> >> features and miss the simple ones.
>>>> >> >>
>>>> >> >> It might be faster to fetch all the docs from Solr and save them in
>>>> >> files.
>>>> >> >> Then modify them. Then reload all of them. No guarantee, but it is
>>>> >> worth a
>>>> >> >> try.
>>>> >> >>
>>>> >> >> Good luck.
>>>> >> >>
>>>> >> >> wunder
>>>> >> >> Walter Underwood
>>>> >> >> wun...@wunderwood.org <javascript:;> <javascript:;>
>>>> >> >> http://observer.wunderwood.org/  (my blog)
>>>> >> >>
>>>> >> >>
>>>> >> >> > On Sep 25, 2015, at 2:59 PM, Ravi Solr <ravis...@gmail.com
>>>> >> <javascript:;>
>>>> >> >> <javascript:;>> wrote:
>>>> >> >> >
>>>> >> >> > Walter, Not in a mood for banter right now.... Its 6:00pm on a
>>>> friday
>>>> >> and
>>>> >> >> > Iam stuck here trying to figure reindexing issues :-)
>>>> >> >> > I dont have source of docs so I have to query the SOLR, modify
>>>> and
>>>> >> put it
>>>> >> >> > back and that is seeming to be quite a task in 5.3.0, I did
>>>> reindex
>>>> >> >> several
>>>> >> >> > times with 4.7.2 in a master slave env without any issue. Since
>>>> then
>>>> >> we
>>>> >> >> > have moved to cloud and it has been a pain all day.
>>>> >> >> >
>>>> >> >> > Thanks
>>>> >> >> >
>>>> >> >> > Ravi Kiran Bhaskar
>>>> >> >> >
>>>> >> >> > On Fri, Sep 25, 2015 at 5:25 PM, Walter Underwood <
>>>> >> wun...@wunderwood.org <javascript:;>
>>>> >> >> <javascript:;>>
>>>> >> >> > wrote:
>>>> >> >> >
>>>> >> >> >> Sure.
>>>> >> >> >>
>>>> >> >> >> 1. Delete all the docs (no commit).
>>>> >> >> >> 2. Add all the docs (no commit).
>>>> >> >> >> 3. Commit.
>>>> >> >> >>
>>>> >> >> >> wunder
>>>> >> >> >> Walter Underwood
>>>> >> >> >> wun...@wunderwood.org <javascript:;> <javascript:;>
>>>> >> >> >> http://observer.wunderwood.org/  (my blog)
>>>> >> >> >>
>>>> >> >> >>
>>>> >> >> >>> On Sep 25, 2015, at 2:17 PM, Ravi Solr <ravis...@gmail.com
>>>> >> <javascript:;>
>>>> >> >> <javascript:;>> wrote:
>>>> >> >> >>>
>>>> >> >> >>> I have been trying to re-index the docs (about 1.5 million) as
>>>> one
>>>> >> of
>>>> >> >> the
>>>> >> >> >>> field needed part of string value removed (accidentally
>>>> >> introduced). I
>>>> >> >> >> was
>>>> >> >> >>> issuing a query for 100 docs getting 4 fields and updating the
>>>> doc
>>>> >> >> >> (atomic
>>>> >> >> >>> update with "set") via the CloudSolrClient in batches, However
>>>> from
>>>> >> >> time
>>>> >> >> >> to
>>>> >> >> >>> time the query returns 0 results, which exits the re-indexing
>>>> >> program.
>>>> >> >> >>>
>>>> >> >> >>> I cant understand as to why the cloud returns 0 results when
>>>> there
>>>> >> are
>>>> >> >> >> 1.4x
>>>> >> >> >>> million docs which have the "accidental" string in them.
>>>> >> >> >>>
>>>> >> >> >>> Is there another way to do bulk massive updates ?
>>>> >> >> >>>
>>>> >> >> >>> Thanks
>>>> >> >> >>>
>>>> >> >> >>> Ravi Kiran Bhaskar
>>>> >> >> >>
>>>> >> >> >>
>>>> >> >>
>>>> >> >>
>>>> >>
>>>>
>>>
>>>
>>

Re: bulk reindexing 5.3.0 issue

Reply via email to