RE: Bulk indexing data into solr

Zhang, Lisheng Thu, 26 Jul 2012 13:49:08 -0700

Hi,

I really appreciate your quick helps!

1) I want to let solr not cache any IndexerReader (hopefully it is possible),
because our app is made of many lucene folders and each of them not very
large, from my previous test it seems that performance is fine if each time
we just create IndexerReader. Hopefully doing this way we have no sync issue?

2) Our data is mainly in RDB (currently in mySQL and will move to Cassendra
later). My main concern is that by using Solr we need to pass rather large 
amount of data through network layer via HTTP, which could be a problem?

Best regards, Lisheng

-----Original Message-----
From: Mikhail Khludnev [mailto:mkhlud...@griddynamics.com]
Sent: Thursday, July 26, 2012 12:46 PM
To: solr-user@lucene.apache.org
Subject: Re: Bulk indexing data into solr

IIRC about a two month ago problem with such scheme discussed here, but I
can remember exact details.
Scheme is generally correct. But you didn't tell how do you let solr know
that it need to reread new index generation, after indexer fsync segments
get.

btw, it might be a possible issue:
https://lucene.apache.org/core/old_versioned_docs//versions/3_0_1/api/all/org/apache/lucene/index/IndexWriter.html#commit()
 Note that this operation calls Directory.sync on the index files. That
call should not return until the file contents & metadata are on stable
storage. For FSDirectory, this calls the OS's fsync. But, beware: some
hardware devices may in fact cache writes even during fsync, and return
before the bits are actually on stable storage, to give the appearance of
faster performance.

you should ensure that after segments.get is fsync'ed, all other index
files are fsynced for other processes too.

Could you tell more about your data:
what's the format?
whether they are located relatively to indexer?
And why you can't use remote streaming by Solr's upd handler or indexer
client app with StreamingUpdateServer ?

On Thu, Jul 26, 2012 at 10:47 PM, Zhang, Lisheng <
lisheng.zh...@broadvision.com> wrote:

> Hi,
>
> I think at least before lucene 4.0 we can only allow one process/thread to
> write on
> a lucene folder. Based on this fact my initial plan is:
>
> 1) There is one set of lucene index folders.
> 2) Solr server only perform queries in those servers
> 3) Having a separate process (multi-threads) to index those lucene folders
> (each
>    folder is a separate app). Only one thread will index one given lucene
> folder.
>
> Thanks very much for helps, Lisheng
>
>
> -----Original Message-----
> From: Mikhail Khludnev [mailto:mkhlud...@griddynamics.com]
> Sent: Thursday, July 26, 2012 10:15 AM
> To: solr-user@lucene.apache.org
> Subject: Re: Bulk indexing data into solr
>
>
> Coming back to your original question. I'm puzzled a little.
> It's not clear where you wanna call Lucene API directly from.
> if you mean that you has standalone indexer, which write index files. Then
> it stops and these files become available for Solr Process it will work.
> Sharing index between processes, or using EmbeddedServer is looking for
> problem (despite Lucene has Locks mechanism, which I'm not completely aware
> of).
> I can conclude that your data for indexing is collocate with the solr
> server. In this case consider
> http://wiki.apache.org/solr/ContentStream#RemoteStreaming
>
> Please give more details about your design.
>
> On Thu, Jul 26, 2012 at 1:22 PM, Zhang, Lisheng <
> lisheng.zh...@broadvision.com> wrote:
>
> >
> > Hi,
> >
> > I am starting to use solr, now I need to index a rather large amount of
> > data, it seems
> > that calling solr to pass data through HTTP is rather inefficient, I am
> > think still call
> > lucene API directly for bulk index but to use solr for search, is this
> > design OK?
> >
> > Thanks very much for helps, Lisheng
> >
> >
>
>
> --
> Sincerely yours
> Mikhail Khludnev
> Tech Lead
> Grid Dynamics
>
> <http://www.griddynamics.com>
>  <mkhlud...@griddynamics.com>
>

-- 
Sincerely yours
Mikhail Khludnev
Tech Lead
Grid Dynamics

<http://www.griddynamics.com>
 <mkhlud...@griddynamics.com>

RE: Bulk indexing data into solr

Reply via email to