Re: Solr 7.7 - Few Questions

Rahul Goswami Tue, 06 Oct 2020 20:56:25 -0700

1. What tool they use to run Solr as a service on windows.
>> Look into procrun. Afterall. Solr runs inside Jetty. So you should have
a way to invoke Jetty’s Main class with required parameters and bundle that
as a procrun service


2. How to set up the disaster recovery?
>> You can back up your indexes at regular periods. This can be done by
taking snapshots and backing them up...and then using the appropriate
snapshot names to restore a certain commit point. For more details please
refer to this link:
https://lucene.apache.org/solr/guide/7_7/making-and-restoring-backups.html

3. How to scale up the servers for the better performance?
>> This is too open ended a question and depends on a lot of factors
specific to your environment and use-case :)

- Rahul


On Tue, Oct 6, 2020 at 4:26 PM Manisha Rahatadkar <
manisha.rahatad...@anjusoftware.com> wrote:

> Hi All
>
> First of all thanks to Shawn, Rahul and Charlie for taking time to reply
> my questions and valuable information.
>
> I was very concerned about the size of the each document and on several
> follow ups got more information that the documents which have 0.5GB size
> are mp4 documents and these are not synced to Solr.
>
> @Shawn Heisey recommended NOT to use Windows because of windows license
> cost and service installer testing is done on Linux.
> I agree with him. We are using NSSM tool to run solr as a service.
>
> Are there any members here using Solr on Windows? I look forward to hear
> from them on:
>
> 1. What tool they use to run Solr as a service on windows.
> 2. How to set up the disaster recovery?
> 3. How to scale up the servers for the better performance?
>
> Thanks in advance and looking forward to hear back your experiences on
> Solr Scale up.
>
> Regards,
> Manisha Rahatadkar
>
> -----Original Message-----
> From: Rahul Goswami <rahul196...@gmail.com>
> Sent: Sunday, October 4, 2020 11:49 PM
> To: ch...@opensourceconnections.com; solr-user@lucene.apache.org
> Subject: Re: Solr 7.7 - Few Questions
>
> Charlie,
> Thanks for providing an alternate approach to doing this. It would be
> interesting to know how one  could go about organizing the docs in this
> case? (Nested documents?) How would join queries perform on a large
> index(200 million+ docs)?
>
> Thanks,
> Rahul
>
>
>
> On Fri, Oct 2, 2020 at 5:55 AM Charlie Hull <char...@flax.co.uk> wrote:
>
> > Hi Rahul,
> >
> >
> >
> > In addition to the wise advice below: remember in Solr, a 'document'
> > is
> >
> > just the name for the thing that would appear as one of the results
> > when
> >
> > you search (analagous to a database record). It's not the same
> >
> > conceptually as a 'Word document' or a 'PDF document'. If your source
> >
> > documents are so big, consider how they might be broken into parts, or
> >
> > whether you really need to index all of them for retrieval purposes,
> > or
> >
> > what parts of them need to be extracted as text. Thus, the Solr
> >
> > documents don't necessarily need to be as large as your source documents.
> >
> >
> >
> > Consider an email size 20kb with ten PDF attachments, each 20MB. You
> >
> > probably shouldn't push all this data into a single Solr document, but
> >
> > you *could* index them as 11 separate Solr documents, but with
> > metadata
> >
> > to indicate that one is an email and ten are PDFs, and a shared ID of
> >
> > some kind to indicate they're related. Then at query time there are
> >
> > various ways for you to group these together, so for example if the
> >
> > query hit one of the PDFs you could show the user the original email,
> >
> > plus the 9 other attachments, using the shared ID as a key.
> >
> >
> >
> > HTH,
> >
> >
> >
> > Charlie
> >
> >
> >
> > On 02/10/2020 01:53, Rahul Goswami wrote:
> >
> > > Manisha,
> >
> > > In addition to what Shawn has mentioned above, I would also like you
> > > to
> >
> > > reevaluate your use case. Do you *need to* index the whole document ?
> eg:
> >
> > > If it's an email, the body of the email *might* be more important
> > > than
> > any
> >
> > > attachments, in which case you could choose to only index the email
> > > body
> >
> > > and ignore (or only partially index) the text from attachments. If
> > > you
> >
> > > could afford to index the documents partially, you could consider
> > > Solr's
> >
> > > "Limit token count filter": See the link below.
> >
> > >
> >
> > >
> > https://lucene.apache.org/solr/guide/7_7/filter-descriptions.html#limi
> > t-token-count-filter
> >
> > >
> >
> > > You'll need to configure it in the schema for the "index" analyzer
> > > for
> > the
> >
> > > data type of the field with large text.
> >
> > > Indexing documents of the order of half a GB will definitely come to
> > > hurt
> >
> > > your operations, if not now, later (think OOM, extremely slow atomic
> >
> > > updates, long running merges etc.).
> >
> > >
> >
> > > - Rahul
> >
> > >
> >
> > >
> >
> > >
> >
> > > On Thu, Oct 1, 2020 at 7:06 PM Shawn Heisey <apa...@elyograg.org>
> wrote:
> >
> > >
> >
> > >> On 10/1/2020 6:57 AM, Manisha Rahatadkar wrote:
> >
> > >>> We are using Apache Solr 7.7 on Windows platform. The data is
> > >>> synced to
> >
> > >> Solr using Solr.Net commit. The data is being synced to SOLR in
> batches.
> >
> > >> The document size is very huge (~0.5GB average) and solr indexing
> > >> is
> > taking
> >
> > >> long time. Total document size is ~200GB. As the solr commit is
> > >> done as
> > a
> >
> > >> part of API, the API calls are failing as document indexing is not
> >
> > >> completed.
> >
> > >>
> >
> > >> A single document is five hundred megabytes?  What kind of
> > >> documents do
> >
> > >> you have?  You can't even index something that big without tweaking
> >
> > >> configuration parameters that most people don't even know about.
> >
> > >> Assuming you can even get it working, there's no way that indexing
> > >> a
> >
> > >> document like that is going to be fast.
> >
> > >>
> >
> > >>>     1.  What is your advise on syncing such a large volume of data
> > >>> to
> >
> > >> Solr KB.
> >
> > >>
> >
> > >> What is "KB"?  I have never heard of this in relation to Solr.
> >
> > >>
> >
> > >>>     2.  Because of the search requirements, almost 8 fields are
> > >>> defined
> >
> > >> as Text fields.
> >
> > >>
> >
> > >> I can't figure out what you are trying to say with this statement.
> >
> > >>
> >
> > >>>     3.  Currently Solr_JAVA_MEM is set to 2gb. Is that enough for
> > >>> such
> > a
> >
> > >> large volume of data?
> >
> > >>
> >
> > >> If just one of the documents you're sending to Solr really is five
> >
> > >> hundred megabytes, then 2 gigabytes would probably be just barely
> > >> enough
> >
> > >> to index one document into an empty index ... and it would probably
> > >> be
> >
> > >> doing garbage collection so frequently that it would make things
> > >> REALLY
> >
> > >> slow.  I have no way to predict how much heap you will need.  That
> > >> will
> >
> > >> require experimentation.  I can tell you that 2GB is definitely not
> > enough.
> >
> > >>
> >
> > >>>     4.  How to set up Solr in production on Windows? Currently
> > >>> it's set
> >
> > >> up as a standalone engine and client is requested to take the
> > >> backup of
> > the
> >
> > >> drive. Is there any other better way to do? How to set up for the
> > disaster
> >
> > >> recovery?
> >
> > >>
> >
> > >> I would suggest NOT doing it on Windows.  My reasons for that come
> > >> down
> >
> > >> to costs -- a Windows Server license isn't cheap.
> >
> > >>
> >
> > >> That said, there's nothing wrong with running on Windows, but
> > >> you're on
> >
> > >> your own as far as running it as a service.  We only have a service
> >
> > >> installer for UNIX-type systems.  Most of the testing for that is
> > >> done
> >
> > >> on Linux.
> >
> > >>
> >
> > >>>     5.  How to benchmark the system requirements for such a huge
> > >>> data
> >
> > >> I do not know what all your needs are, so I have no way to answer
> this.
> >
> > >> You're going to know a lot more about it that any of us are.
> >
> > >>
> >
> > >> Thanks,
> >
> > >> Shawn
> >
> > >>
> >
> >
> >
> > --
> >
> > Charlie Hull
> >
> > OpenSource Connections, previously Flax
> >
> >
> >
> > tel/fax: +44 (0)8700 118334
> >
> > mobile:  +44 (0)7767 825828
> >
> > web: www.o19s.com
> >
> >
> >
> >
> Confidentiality Notice
> ====================
> This email message, including any attachments, is for the sole use of the
> intended recipient and may contain confidential and privileged information.
> Any unauthorized view, use, disclosure or distribution is prohibited. If
> you are not the intended recipient, please contact the sender by reply
> email and destroy all copies of the original message. Anju Software, Inc.
> 4500 S. Lakeshore Drive, Suite 620, Tempe, AZ USA 85282.
>

Re: Solr 7.7 - Few Questions

Reply via email to