Re: using S3 as the Directory for Solr

Kevin Risden Sat, 25 Apr 2020 11:50:39 -0700

Solr's use of the HdfsDirectory may work over S3 directly if you use the
Hadoop AWS binding - s3a [1]. The idea is to replace hdfs:// with
s3a://bucket/. Since S3 is eventually consistent, the Hadoop AWS s3a
project has s3guard to help with consistent listing. If you are only doing
queries (no indexing) with Solr you may not need to worry about the
eventual consistency.


There was some previous exploration in this area with Solr 6.x/7.x, but it
should be much better with Solr 8.x due to the upgraded Hadoop 3.x
dependency. I haven't done any stress testing of this, but I made sure it
at least in theory would connect. I could index and query some small
datasets stored via s3a.

Using the HdfsDirectory with s3a will most likely be slower as already
pointed out. You might get reasonable performance depending on the nodes
used and tuning the HdfsDirectory block cache.

[1]
https://hadoop.apache.org/docs/current/hadoop-aws/tools/hadoop-aws/index.html

Kevin Risden


On Fri, Apr 24, 2020 at 1:19 PM dhurandar S <dhurandarg...@gmail.com> wrote:

> Its 10 PB of source data, But we do have indexes on most of the attributes.
> 80% or so
> We have a need to support such large data and we have use cases of finding
> a needle in the haystack kinda scenario.
> Most of our users are used to Search query language or Solr in addition to
> SQL. So we would have both the interfaces.
>
> We store the actual data in S3 in Parquet and have Presto query it using
> SQL (Presto is similar to Hive but much much faster).
>
> We also now want to store the indexes in S3 we have leeway in query
> interactivity  performance, the key thing here is support finding the
> needle in the haystack pattern and supporting really long-range data in a
> cheaper fashion
>
> regards,
> Rahul
>
>
> On Thu, Apr 23, 2020 at 7:41 PM Walter Underwood <wun...@wunderwood.org>
> wrote:
>
> > It will be a lot more than 2X or 3X slower. Years ago, I accidentally put
> > Solr indexes on an NFS mounted filesystem and it was 100X slower. S3
> would
> > be a lot slower than that.
> >
> > Are you doing relevance-ranked searches on all that data? That is the
> only
> > reason to use Solr instead of some other solution.
> >
> > I’d use Apache Hive, or whatever has replaced it. That is what Facebook
> > wrote to do searches on their multi-petabyte logs.
> >
> > https://hive.apache.org
> >
> > More options.
> >
> > https://jethro.io/hadoop-hive
> > https://mapr.com/why-hadoop/sql-hadoop/sql-hadoop-details/
> >
> > wunder
> > Walter Underwood
> > wun...@wunderwood.org
> > http://observer.wunderwood.org/  (my blog)
> >
> > > On Apr 23, 2020, at 7:29 PM, Christopher Schultz <
> > ch...@christopherschultz.net> wrote:
> > >
> > > -----BEGIN PGP SIGNED MESSAGE-----
> > > Hash: SHA256
> > >
> > > Rahul,
> > >
> > > On 4/23/20 21:49, dhurandar S wrote:
> > >> Thank you for your reply. The reason we are looking for S3 is since
> > >> the volume is close to 10 Petabytes. We are okay to have higher
> > >> latency of say twice or thrice that of placing data on the local
> > >> disk. But we have a requirement to have long-range data and
> > >> providing Seach capability on that.  Every other storage apart from
> > >> S3 turned out to be very expensive at that scale.
> > >>
> > >> Basically I want to replace
> > >>
> > >> -Dsolr.directoryFactory=HdfsDirectoryFactory \
> > >>
> > >> with S3 based implementation.
> > >
> > > Can you clarify whether you have 10 PiB of /source data/ or 10 PiB of
> > > /index data/?
> > >
> > > You can theoretically store your source data anywhere, of course. 10
> > > PiB sounds like a truly enormous index.
> > >
> > > - -chris
> > >
> > >> On Thu, Apr 23, 2020 at 3:12 AM Jan Høydahl <jan....@cominvent.com>
> > >> wrote:
> > >>
> > >>> Hi,
> > >>>
> > >>> Is your data so partitioned that it makes sense to consider
> > >>> splitting up in multiple collections and make some arrangement
> > >>> that will keep only a few collections live at a time, loading
> > >>> index files from S3 on demand?
> > >>>
> > >>> I cannot see how an S3 directory would be able to effectively
> > >>> cache files in S3 and what units the index files would be stored
> > >>> as?
> > >>>
> > >>> Have you investigated EFS as an alternative? That would look like
> > >>> a normal filesystem to Solr but might be cheaper storage wise,
> > >>> but much slower.
> > >>>
> > >>> Jan
> > >>>
> > >>>> 23. apr. 2020 kl. 06:57 skrev dhurandar S
> > >>>> <dhurandarg...@gmail.com>:
> > >>>>
> > >>>> Hi,
> > >>>>
> > >>>> I am looking to use S3 as the place to store indexes. Just how
> > >>>> Solr uses HdfsDirectory to store the index and all the other
> > >>>> documents.
> > >>>>
> > >>>> We want to provide a search capability that is okay to be a
> > >>>> little slow
> > >>> but
> > >>>> cheaper in terms of the cost. We have close to 2 petabytes of
> > >>>> data on
> > >>> which
> > >>>> we want to provide the Search using Solr.
> > >>>>
> > >>>> Are there any open-source implementations around using S3 as
> > >>>> the
> > >>> Directory
> > >>>> for Solr ??
> > >>>>
> > >>>> Any recommendations on this approach?
> > >>>>
> > >>>> regards, Rahul
> > >>>
> > >>>
> > >>
> > > -----BEGIN PGP SIGNATURE-----
> > > Comment: Using GnuPG with Thunderbird - https://www.enigmail.net/
> > >
> > > iQIyBAEBCAAdFiEEMmKgYcQvxMe7tcJcHPApP6U8pFgFAl6iTwUACgkQHPApP6U8
> > > pFjRaw/4sGbH286gZJe+wfKsLc4JPvyJZjjwVDCdpiR2SHt50IA23wYSK97R6xRj
> > > dbWWReA7C3JNWp6x21i8Bb6sIeLDnotbc7IOSmOMuNep1BtVaYBMJ8wyW6uUtXf6
> > > hQbY0Ew93ZhDlS9CWMJqbQtWfrQEqH51Xbz+4uqqvJU8Bq9o9Vv0rnuVp/5f73lV
> > > ihek0sbA73oGle0gC5NFmrKItnn+14X8vIxUC8JRZlY4rDSiOdOcIil3DExxOQNQ
> > > UodIvwKKhzALFY77PeGSSjKiy0X3JJ1rKzLeIBrW0JCNMprYLzL2CQjZ5F09MraZ
> > > WxXdA64lEg2diEwHywNrsaaygbEZYTWd8gaeGA7kzCk78Y2KuhWuEQej6KmE3Iq2
> > > AW+K7JgFakUpzB5oorCtKNLQOqFHX85ne57gCYKr42S3Htfxmf98pBdudQy4RvuT
> > > +tJvGYx8NLqgeOoZN4u+G/8WunlzUC+u2vUxVcIoK3Ozz0usMioFDqn69vmOxxoH
> > > cN2Y4T1ZZZGtndiAGZww1JXKAbVN0U41isXg2F8tHQV9dxaeoYDQ/xYbAoWEhhlM
> > > SVtEdr76eMJ08T6h5711gtrhSK+RQFPD2Jbr8B/Xl063xPfN2TpqmcJCKXkucvpc
> > > CEDLFqeKX6qIRZDgMf8EICmbFl6aF5knbDP0MkyYk4urB+uFaw==
> > > =Y/6Y
> > > -----END PGP SIGNATURE-----
> >
> >
>

Re: using S3 as the Directory for Solr

Reply via email to