Re: bulk reindexing 5.3.0 issue

Ravi Solr Fri, 25 Sep 2015 21:11:45 -0700

thank you for taking time to help me out. Yes I was not using cursorMark, I
will try that next. This is what I was doing, its a bit shabby coding but
what can I say my brain was fried :-) FYI this is a side process just to
correct a messed up string. The actual indexing process was working all the
time as our business owners are a bit petulant about stopping indexing. My
autocommit conf and code is given below, as you can see autocommit should
fire every 100 docs anyway


    <autoCommit>
       <maxDocs>100</maxDocs>
       <maxTime>120000</maxTime>
    </autoCommit>

    <autoSoftCommit>
        <maxTime>30000</maxTime>
    </autoSoftCommit>
  </updateHandler>

    private static void processDocs() {

        try {
            CloudSolrClient client = new
CloudSolrClient("zk1:xxxx,zk2:xxxx,zk3.com:xxxx");
            client.setDefaultCollection("collection1");

            //First initialize docs
            SolrDocumentList docList = getDocs(client, 100);
            Long count = 0L;

            while (docList != null && docList.size() > 0) {

                List<SolrInputDocument> inList = new
ArrayList<SolrInputDocument>();
                for(SolrDocument doc : docList) {

                    SolrInputDocument iDoc =
ClientUtils.toSolrInputDocument(doc);

                    //This is my SOLR's Unique id
                    String uniqueId = (String)
iDoc.getFieldValue("uniqueId");

                    /*
                     * This is another system's id which is what I want to
correct. Was messed
                     * because of script transformer in DIH import via
SolrEntityProcessor
                     * ex-
sun.org.mozilla.javascript.internal.NativeString:9cdef726-05dd-40b7-b1b2-c9bbce96741f
                     */
                    String uuid = (String) iDoc.getFieldValue("uuid");
                    String sanitizedUUID =
uuid.replace("sun.org.mozilla.javascript.internal.NativeString:", "");
                    Map<String,String> fieldModifier = new
HashMap<String,String>(1);
                    fieldModifier.put("set",sanitizedUUID);
                    iDoc.setField("uuid", fieldModifier);

                    inList.add(iDoc);
                    log.info("added " + systemid);
                }

                client.add(inList);

                count = count + docList.size();
                log.info("Indexed " + count + "/" + docList.getNumFound());

                Thread.sleep(5000);

                docList = getDocs(client, docList.size());
                log.info("Got Docs- " + docList.getNumFound());
            }

        } catch (Exception e) {
            log.error("Error indexing ", e);
        }
    }

    private static SolrDocumentList getDocs(CloudSolrClient client, Integer
rows) {


        SolrQuery q = new SolrQuery("*:*");
        q.setSort("publishtime", ORDER.desc);
        q.setStart(0);
        q.setRows(rows);
        q.addFilterQuery(new String[] {"uuid:[* TO *]",
"uuid:sun.org.mozilla*"});
        q.setFields(new String[]{"uniqueId","uuid"});
        SolrDocumentList docList = null;
        QueryResponse resp;
        try {
            resp = client.query(q);
            docList = resp.getResults();
        } catch (Exception e) {
            log.error("Error querying " + q.toString(), e);
        }
        return docList;
    }


Thanks

Ravi Kiran Bhaskar

On Fri, Sep 25, 2015 at 10:58 PM, Erick Erickson <[email protected]>
wrote:

> Wait, query again how? You've got to have something that keeps you
> from getting the same 100 docs back so you have to be sorting somehow.
> Or you have a high water mark. Or something. Waiting 5 seconds for any
> commit also doesn't really make sense to me. I mean how do you know
>
> 1> that you're going to get a commit (did you explicitly send one from
> the client?).
> 2> all autowarming will be complete by the time the next query hits?
>
> Let's see the query you fire. There has to be some kind of marker that
> you're using to know when you've gotten through the entire set.
>
> And I would use much larger batches, I usually update in batches of
> 1,000 (excepting if these are very large docs of course). I suspect
> you're spending a lot more time sleeping than you need to. I wouldn't
> sleep at all in fact. This is one (rare) case I might consider
> committing from the client. If you specify the wait for searcher param
> (server.commit(true, true), then it doesn't return until a new
> searcher is completely opened so your previous updates will be
> reflected in your next search.
>
> Actually, what I'd really do is
> 1> turn off all auto commits
> 2> go ahead and query/change/update. But the query bits would be using
> the cursormark.
> 3> do NOT commit
> 4> issue a commit when you were all done.
>
> I bet you'd get through your update a lot faster that way.
>
> Best,
> Erick
>
> On Fri, Sep 25, 2015 at 5:07 PM, Ravi Solr <[email protected]> wrote:
> > Thanks for responding Erick. I set the "start" to zero and "rows" always
> to
> > 100. I create CloudSolrClient instance and use it to both query as well
> as
> > index. But I do sleep for 5 secs just to allow for any auto commits.
> >
> > So query --> client.add(100 docs) --> wait --> query again
> >
> > But the weird thing I noticed was that after 8 or 9 batches I.e 800/900
> > docs the "query again" returns zero docs causing my while loop to
> > exist...so was trying to see if I was doing the right thing or if there
> is
> > an alternate way to do heavy indexing.
> >
> > Thanks
> >
> > Ravi Kiran Bhaskar
> >
> >
> >
> > On Friday, September 25, 2015, Erick Erickson <[email protected]>
> > wrote:
> >
> >> How are you querying Solr? You say you query for 100 docs,
> >> update then get the next set. What are you using for a marker?
> >> If you're using the start parameter, and somehow a commit is
> >> creeping in things might be weird, especially if you're using any
> >> of the internal Lucene doc IDs. If you're absolutely sure no commits
> >> are taking place even that should be OK.
> >>
> >> The "deep paging" stuff could be helpful here, see:
> >>
> >>
> https://lucidworks.com/blog/coming-soon-to-solr-efficient-cursor-based-iteration-of-large-result-sets/
> >>
> >> Best,
> >> Erick
> >>
> >> On Fri, Sep 25, 2015 at 3:13 PM, Ravi Solr <[email protected]
> >> <javascript:;>> wrote:
> >> > No problem Walter, it's all fun. Was just wondering if there was some
> >> other
> >> > good way that I did not know of, that's all 😀
> >> >
> >> > Thanks
> >> >
> >> > Ravi Kiran Bhaskar
> >> >
> >> > On Friday, September 25, 2015, Walter Underwood <
> [email protected]
> >> <javascript:;>>
> >> > wrote:
> >> >
> >> >> Sorry, I did not mean to be rude. The original question did not say
> that
> >> >> you don’t have the docs outside of Solr. Some people jump to the
> >> advanced
> >> >> features and miss the simple ones.
> >> >>
> >> >> It might be faster to fetch all the docs from Solr and save them in
> >> files.
> >> >> Then modify them. Then reload all of them. No guarantee, but it is
> >> worth a
> >> >> try.
> >> >>
> >> >> Good luck.
> >> >>
> >> >> wunder
> >> >> Walter Underwood
> >> >> [email protected] <javascript:;> <javascript:;>
> >> >> http://observer.wunderwood.org/  (my blog)
> >> >>
> >> >>
> >> >> > On Sep 25, 2015, at 2:59 PM, Ravi Solr <[email protected]
> >> <javascript:;>
> >> >> <javascript:;>> wrote:
> >> >> >
> >> >> > Walter, Not in a mood for banter right now.... Its 6:00pm on a
> friday
> >> and
> >> >> > Iam stuck here trying to figure reindexing issues :-)
> >> >> > I dont have source of docs so I have to query the SOLR, modify and
> >> put it
> >> >> > back and that is seeming to be quite a task in 5.3.0, I did reindex
> >> >> several
> >> >> > times with 4.7.2 in a master slave env without any issue. Since
> then
> >> we
> >> >> > have moved to cloud and it has been a pain all day.
> >> >> >
> >> >> > Thanks
> >> >> >
> >> >> > Ravi Kiran Bhaskar
> >> >> >
> >> >> > On Fri, Sep 25, 2015 at 5:25 PM, Walter Underwood <
> >> [email protected] <javascript:;>
> >> >> <javascript:;>>
> >> >> > wrote:
> >> >> >
> >> >> >> Sure.
> >> >> >>
> >> >> >> 1. Delete all the docs (no commit).
> >> >> >> 2. Add all the docs (no commit).
> >> >> >> 3. Commit.
> >> >> >>
> >> >> >> wunder
> >> >> >> Walter Underwood
> >> >> >> [email protected] <javascript:;> <javascript:;>
> >> >> >> http://observer.wunderwood.org/  (my blog)
> >> >> >>
> >> >> >>
> >> >> >>> On Sep 25, 2015, at 2:17 PM, Ravi Solr <[email protected]
> >> <javascript:;>
> >> >> <javascript:;>> wrote:
> >> >> >>>
> >> >> >>> I have been trying to re-index the docs (about 1.5 million) as
> one
> >> of
> >> >> the
> >> >> >>> field needed part of string value removed (accidentally
> >> introduced). I
> >> >> >> was
> >> >> >>> issuing a query for 100 docs getting 4 fields and updating the
> doc
> >> >> >> (atomic
> >> >> >>> update with "set") via the CloudSolrClient in batches, However
> from
> >> >> time
> >> >> >> to
> >> >> >>> time the query returns 0 results, which exits the re-indexing
> >> program.
> >> >> >>>
> >> >> >>> I cant understand as to why the cloud returns 0 results when
> there
> >> are
> >> >> >> 1.4x
> >> >> >>> million docs which have the "accidental" string in them.
> >> >> >>>
> >> >> >>> Is there another way to do bulk massive updates ?
> >> >> >>>
> >> >> >>> Thanks
> >> >> >>>
> >> >> >>> Ravi Kiran Bhaskar
> >> >> >>
> >> >> >>
> >> >>
> >> >>
> >>
>

Re: bulk reindexing 5.3.0 issue

Reply via email to