Re: how to update billions of docs

Mohsin Beg Beg Thu, 24 Mar 2016 18:00:28 -0700

An update on how I ended up implementing the requirement in case it helps 
others. There are lots of other code I did not include but the general logic is 
below.


While performance is still not great, it is 10x faster than atomic updates ( 
because RealTimeGetComponent.getInputDocument() is not needed )


1. Wrote an update handler
   /myupdater?q=*:* & sort=fieldx desc & fl=fieldx, fieldy & 
stream.file=exampledocs/oldvalueToNewValue.properties & update.chain=myprocessor


2. In the handler read the map from content stream and invoke the export 
handler for the query params
   SolrRequestHandler handler = core.getRequestHandler("/export");
   core.execute(handler, req, rsp);
   numFound = (Integer) req.getContext().get("totalHits");


3. Iterate using /export handler response, similar to 
SortingResponseWrite.write() method
 
   List<LeafReaderContext> leaves = 
req.getSearcher().getTopReaderContext().leaves();
   for(int i=0; i<leaves.size(); i++) {
     DocIdSetIterator it = new BitSetIterator(sets[i], 0);
     while((docId = it.nextDoc()) != DocIdSetIterator.NO_MORE_DOCS) {
        // get lucene doc
        Document luceneDoc = leaves.get(i).reader.document(docId);

        // update lucene doc with new values
        updateDoc(luceneDoc, oldValueToNewValuesMap)

        // post lucene doc to a linked blocking queue
        queue.add(luceneDoc);
     }
  }


4. have N threads waiting on queue for docs and invokes UpdateRequestProcessor 
chain using the update.chain param
   AddUpdateCommand cmd = new AddUpdateCommand(request);
   IndexSchema schema = req.getLatestSchema();
   while (true) {
      Document luceneDoc = queue.take();
      SolrDocument doc = toSolrDocument(luceneDoc, schema);

      cmd.doc = doc;

      // set these fields as needed
      cmd.overwrite = false;
      cmd.setVersion(0);
      doc.removeField("_version"_);

      // post doc
      updateProcessor.processAdd(cmd);
   }


-Mohsin


----- Original Message -----
From: jack.krupan...@gmail.com
To: solr-user@lucene.apache.org
Sent: Friday, March 18, 2016 6:55:17 AM GMT -08:00 US/Canada Pacific
Subject: Re: how to update billions of docs

That's another great example of a mode that Bulk Field Update (my mythical
feature) needs - switch a list of fields from stored to docvalues.

And maybe even the opposite since there are scenarios in which docValues is
worse than stored and you would only find that out after indexing...
billions of documents.

Being able to switch indexed mode of a field (or list of fields) is also a
mode needed for bulk update (reindex).


-- Jack Krupansky

On Fri, Mar 18, 2016 at 4:12 AM, Ishan Chattopadhyaya <
ichattopadhy...@gmail.com> wrote:

> Hi Mohsin,
> There's some work in progress for in-place updates to docValued fields,
> https://issues.apache.org/jira/browse/SOLR-5944. Can you try the latest
> patch there (or ping me if you need a git branch)?
> It would be nice to know how fast the updates go for your usecase with that
> patch. Please note that for that patch, both the version field and the
> updated field needs to have stored=false, indexed=false, docValues=true.
> Regards,
> Ishan
>
>
> On Thu, Mar 17, 2016 at 10:55 PM, Jack Krupansky <jack.krupan...@gmail.com
> >
> wrote:
>
> > It would be nice to have a wiki/doc for "Bulk Field Update" that listed
> all
> > of these techniques and tricks.
> >
> > And, of course, it would be so much better to have an explicit Lucene
> > feature for this. It could work in the background like merge and process
> > one segment at a time as efficiently as possible.
> >
> > Have several modes:
> >
> > 1. Set a field of all documents to explicit value.
> > 2. Set a field of query documents to an explicit value.
> > 3. Increment by n.
> > 4. Add new field to all document, or maybe by query.
> > 5. Delete existing field for all documents.
> > 6. Delete field value for all documents or a specified query.
> >
> >
> > -- Jack Krupansky
> >
> > On Thu, Mar 17, 2016 at 12:31 PM, Ken Krugler <
> kkrugler_li...@transpac.com
> > >
> > wrote:
> >
> > > As others noted, currently updating a field means deleting and
> inserting
> > > the entire document.
> > >
> > > Depending on how you use the field, you might be able to create another
> > > core/container with that one field (plus the key field), and use join
> > > support.
> > >
> > > Note that https://issues.apache.org/jira/browse/LUCENE-6352 is an
> > > improvement, which looks like it's in the 5.x code line, though I don't
> > see
> > > a fix version.
> > >
> > > -- Ken
> > >
> > > > From: Mohsin Beg Beg
> > > > Sent: March 16, 2016 3:52:47pm PDT
> > > > To: solr-user@lucene.apache.org
> > > > Subject: how to update billions of docs
> > > >
> > > > Hi,
> > > >
> > > > I have a requirement to replace a value of a field in 100B's of docs
> in
> > > 100's of cores.
> > > > The field is multiValued=false docValues=true type=StrField
> stored=true
> > > indexed=true.
> > > >
> > > > Atomic Updates performance is on the order of 5K docs per sec per
> core
> > > in solr 5.3 (other fields are quite big).
> > > >
> > > > Any suggestions ?
> > > >
> > > > -Mohsin
> > >
> > >
> > > --------------------------
> > > Ken Krugler
> > > +1 530-210-6378
> > > http://www.scaleunlimited.com
> > > custom big data solutions & training
> > > Hadoop, Cascading, Cassandra & Solr
> > >
> > >
> > >
> > >
> > >
> > >
> >
>

Re: how to update billions of docs

Reply via email to