Big per-file overhead on writing suggests it'd be beneficial to set
useCompoundFile to true (the default is false).

I think unlocking more write performance requires some sort of write level
cache to enable segment merges to use local segment files if they have been
written recently.  It could be layered as a Directory wrapper (i.e. extends
FilterDirectory).  I've done some thinking about this lately.

The read side demands a cache for reasonable read performance.  Solr's HDFS
module includes not just the underlying HdfsDirectory but also
BlockDirectory -- a read cache.  I like that it has no entanglements with
HDFS, thus it could be used creatively with, say, NIOFSDirectory with cloud
storage NIO impls Joel mentions.

It's not clear to me if the HDFS API is better suited than NIO.  There are
certainly a ton of dependencies to deal with for the Hadoop
ecosystem, which is a negative.

~ David Smiley
Apache Lucene/Solr Search Developer
http://www.linkedin.com/in/davidwsmiley


On Mon, Apr 24, 2023 at 5:23 PM Kevin Risden <kris...@apache.org> wrote:

> Solr already supports today reading and indexing on cloud storage - ABFS,
> GCS, and S3 - using the Hadoop HDFS module. I assume the same works with
> HDFS backup/restore as well. I haven't checked if all the supporting
> libraries are included in the shipped Solr distribution, but the HDFS
> filesystem support includes cloud storage. I can't attest to the
> performance, but last I heard it works.
>
>
> https://hadoop.apache.org/docs/stable/hadoop-aws/tools/hadoop-aws/index.html
> https://hadoop.apache.org/docs/stable/hadoop-azure/index.html
> https://github.com/GoogleCloudDataproc/hadoop-connectors
>
>
> Kevin Risden
>
>
> On Mon, Apr 24, 2023 at 5:15 PM Joel Bernstein <joels...@gmail.com> wrote:
>
> > As far as a Lucene/Solr directory on cloud storage. Performance on the
> > write has a lot of overhead per file, hundreds of millis. The read
> overhead
> > is about half as much. I believe the write is so expensive due to the
> > strong consistency of both gcs and s3. So I think the main bottleneck
> would
> > be indexing and merging lots of small segments etc ...
> >
> >
> > Joel Bernstein
> > http://joelsolr.blogspot.com/
> >
> >
> > On Fri, Apr 21, 2023 at 3:27 AM Ishan Chattopadhyaya <
> > ichattopadhy...@gmail.com> wrote:
> >
> > > My colleague at SearchScale has tried S3FS, and running Solr indexes
> off
> > > S3. We can chat about it, if you're interested.
> > >
> > > On Fri, 21 Apr, 2023, 10:38 am David Smiley, <dsmi...@apache.org>
> wrote:
> > >
> > > > Cool!
> > > > I wonder if anyone has tried such things for a Lucene/Solr
> "Directory"
> > as
> > > > well?
> > > >
> > > > ~ David Smiley
> > > > Apache Lucene/Solr Search Developer
> > > > http://www.linkedin.com/in/davidwsmiley
> > > >
> > > >
> > > > On Mon, Apr 17, 2023 at 1:14 PM Joel Bernstein <joels...@gmail.com>
> > > wrote:
> > > >
> > > > > I've been testing Java NIO providers for cloud storage. These two
> in
> > > > > particular worked for our use cases:
> > > > >
> > > > > https://github.com/googleapis/java-storage-nio
> > > > > https://github.com/carlspring/s3fs-nio
> > > > >
> > > > > I believe an Azure provider is available.
> > > > >
> > > > > We've been working on sponsoring getting the s3 provider into a
> > public
> > > > > maven repo and I can update this thread when that's done.
> > > > >
> > > > >
> > > > >
> > > > > Joel Bernstein
> > > > > http://joelsolr.blogspot.com/
> > > > >
> > > > >
> > > > > On Mon, Apr 10, 2023 at 6:51 PM Ishan Chattopadhyaya <
> > > > > ichattopadhy...@gmail.com> wrote:
> > > > >
> > > > > > Oh thanks, Jan. I had missed it. It is a shame because it looks
> > like
> > > a
> > > > > very
> > > > > > neat project.
> > > > > >
> > > > > > On Mon, 10 Apr, 2023, 23:53 Jan Høydahl, <jan....@cominvent.com>
> > > > wrote:
> > > > > >
> > > > > > > Looks like a nice project. With the promise of low-hanging
> > support
> > > > for
> > > > > > > more providers than those three for free.
> > > > > > >
> > > > > > > However,
> > > > > > https://lists.apache.org/thread/w61gzk2ohjtshbwcb5gy6wb2htv7fo0x
> > > > > > > does not look promising - they plan to move the project to the
> > > Attic,
> > > > > and
> > > > > > > no new releases has happened during the 6 months since the
> > > > proposal...
> > > > > > >
> > > > > > > Jan
> > > > > > >
> > > > > > > > 10. apr. 2023 kl. 19:08 skrev Ishan Chattopadhyaya <
> > > > > > > ichattopadhy...@gmail.com>:
> > > > > > > >
> > > > > > > > I think we should deprecate both the modules for S3 and GCS,
> > and
> > > > > > > > adopt Apache JCloud project that supports all three.
> > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>

Reply via email to