Its 10 PB of source data, But we do have indexes on most of the attributes.
80% or so
We have a need to support such large data and we have use cases of finding
a needle in the haystack kinda scenario.
Most of our users are used to Search query language or Solr in addition to
SQL. So we would have both the interfaces.

We store the actual data in S3 in Parquet and have Presto query it using
SQL (Presto is similar to Hive but much much faster).

We also now want to store the indexes in S3 we have leeway in query
interactivity  performance, the key thing here is support finding the
needle in the haystack pattern and supporting really long-range data in a
cheaper fashion

regards,
Rahul


On Thu, Apr 23, 2020 at 7:41 PM Walter Underwood <wun...@wunderwood.org>
wrote:

> It will be a lot more than 2X or 3X slower. Years ago, I accidentally put
> Solr indexes on an NFS mounted filesystem and it was 100X slower. S3 would
> be a lot slower than that.
>
> Are you doing relevance-ranked searches on all that data? That is the only
> reason to use Solr instead of some other solution.
>
> I’d use Apache Hive, or whatever has replaced it. That is what Facebook
> wrote to do searches on their multi-petabyte logs.
>
> https://hive.apache.org
>
> More options.
>
> https://jethro.io/hadoop-hive
> https://mapr.com/why-hadoop/sql-hadoop/sql-hadoop-details/
>
> wunder
> Walter Underwood
> wun...@wunderwood.org
> http://observer.wunderwood.org/  (my blog)
>
> > On Apr 23, 2020, at 7:29 PM, Christopher Schultz <
> ch...@christopherschultz.net> wrote:
> >
> > -----BEGIN PGP SIGNED MESSAGE-----
> > Hash: SHA256
> >
> > Rahul,
> >
> > On 4/23/20 21:49, dhurandar S wrote:
> >> Thank you for your reply. The reason we are looking for S3 is since
> >> the volume is close to 10 Petabytes. We are okay to have higher
> >> latency of say twice or thrice that of placing data on the local
> >> disk. But we have a requirement to have long-range data and
> >> providing Seach capability on that.  Every other storage apart from
> >> S3 turned out to be very expensive at that scale.
> >>
> >> Basically I want to replace
> >>
> >> -Dsolr.directoryFactory=HdfsDirectoryFactory \
> >>
> >> with S3 based implementation.
> >
> > Can you clarify whether you have 10 PiB of /source data/ or 10 PiB of
> > /index data/?
> >
> > You can theoretically store your source data anywhere, of course. 10
> > PiB sounds like a truly enormous index.
> >
> > - -chris
> >
> >> On Thu, Apr 23, 2020 at 3:12 AM Jan Høydahl <jan....@cominvent.com>
> >> wrote:
> >>
> >>> Hi,
> >>>
> >>> Is your data so partitioned that it makes sense to consider
> >>> splitting up in multiple collections and make some arrangement
> >>> that will keep only a few collections live at a time, loading
> >>> index files from S3 on demand?
> >>>
> >>> I cannot see how an S3 directory would be able to effectively
> >>> cache files in S3 and what units the index files would be stored
> >>> as?
> >>>
> >>> Have you investigated EFS as an alternative? That would look like
> >>> a normal filesystem to Solr but might be cheaper storage wise,
> >>> but much slower.
> >>>
> >>> Jan
> >>>
> >>>> 23. apr. 2020 kl. 06:57 skrev dhurandar S
> >>>> <dhurandarg...@gmail.com>:
> >>>>
> >>>> Hi,
> >>>>
> >>>> I am looking to use S3 as the place to store indexes. Just how
> >>>> Solr uses HdfsDirectory to store the index and all the other
> >>>> documents.
> >>>>
> >>>> We want to provide a search capability that is okay to be a
> >>>> little slow
> >>> but
> >>>> cheaper in terms of the cost. We have close to 2 petabytes of
> >>>> data on
> >>> which
> >>>> we want to provide the Search using Solr.
> >>>>
> >>>> Are there any open-source implementations around using S3 as
> >>>> the
> >>> Directory
> >>>> for Solr ??
> >>>>
> >>>> Any recommendations on this approach?
> >>>>
> >>>> regards, Rahul
> >>>
> >>>
> >>
> > -----BEGIN PGP SIGNATURE-----
> > Comment: Using GnuPG with Thunderbird - https://www.enigmail.net/
> >
> > iQIyBAEBCAAdFiEEMmKgYcQvxMe7tcJcHPApP6U8pFgFAl6iTwUACgkQHPApP6U8
> > pFjRaw/4sGbH286gZJe+wfKsLc4JPvyJZjjwVDCdpiR2SHt50IA23wYSK97R6xRj
> > dbWWReA7C3JNWp6x21i8Bb6sIeLDnotbc7IOSmOMuNep1BtVaYBMJ8wyW6uUtXf6
> > hQbY0Ew93ZhDlS9CWMJqbQtWfrQEqH51Xbz+4uqqvJU8Bq9o9Vv0rnuVp/5f73lV
> > ihek0sbA73oGle0gC5NFmrKItnn+14X8vIxUC8JRZlY4rDSiOdOcIil3DExxOQNQ
> > UodIvwKKhzALFY77PeGSSjKiy0X3JJ1rKzLeIBrW0JCNMprYLzL2CQjZ5F09MraZ
> > WxXdA64lEg2diEwHywNrsaaygbEZYTWd8gaeGA7kzCk78Y2KuhWuEQej6KmE3Iq2
> > AW+K7JgFakUpzB5oorCtKNLQOqFHX85ne57gCYKr42S3Htfxmf98pBdudQy4RvuT
> > +tJvGYx8NLqgeOoZN4u+G/8WunlzUC+u2vUxVcIoK3Ozz0usMioFDqn69vmOxxoH
> > cN2Y4T1ZZZGtndiAGZww1JXKAbVN0U41isXg2F8tHQV9dxaeoYDQ/xYbAoWEhhlM
> > SVtEdr76eMJ08T6h5711gtrhSK+RQFPD2Jbr8B/Xl063xPfN2TpqmcJCKXkucvpc
> > CEDLFqeKX6qIRZDgMf8EICmbFl6aF5knbDP0MkyYk4urB+uFaw==
> > =Y/6Y
> > -----END PGP SIGNATURE-----
>
>

Reply via email to