1. What tool they use to run Solr as a service on windows. >> Look into procrun. Afterall. Solr runs inside Jetty. So you should have a way to invoke Jetty’s Main class with required parameters and bundle that as a procrun service
2. How to set up the disaster recovery? >> You can back up your indexes at regular periods. This can be done by taking snapshots and backing them up...and then using the appropriate snapshot names to restore a certain commit point. For more details please refer to this link: https://lucene.apache.org/solr/guide/7_7/making-and-restoring-backups.html 3. How to scale up the servers for the better performance? >> This is too open ended a question and depends on a lot of factors specific to your environment and use-case :) - Rahul On Tue, Oct 6, 2020 at 4:26 PM Manisha Rahatadkar < manisha.rahatad...@anjusoftware.com> wrote: > Hi All > > First of all thanks to Shawn, Rahul and Charlie for taking time to reply > my questions and valuable information. > > I was very concerned about the size of the each document and on several > follow ups got more information that the documents which have 0.5GB size > are mp4 documents and these are not synced to Solr. > > @Shawn Heisey recommended NOT to use Windows because of windows license > cost and service installer testing is done on Linux. > I agree with him. We are using NSSM tool to run solr as a service. > > Are there any members here using Solr on Windows? I look forward to hear > from them on: > > 1. What tool they use to run Solr as a service on windows. > 2. How to set up the disaster recovery? > 3. How to scale up the servers for the better performance? > > Thanks in advance and looking forward to hear back your experiences on > Solr Scale up. > > Regards, > Manisha Rahatadkar > > -----Original Message----- > From: Rahul Goswami <rahul196...@gmail.com> > Sent: Sunday, October 4, 2020 11:49 PM > To: ch...@opensourceconnections.com; solr-user@lucene.apache.org > Subject: Re: Solr 7.7 - Few Questions > > Charlie, > Thanks for providing an alternate approach to doing this. It would be > interesting to know how one could go about organizing the docs in this > case? (Nested documents?) How would join queries perform on a large > index(200 million+ docs)? > > Thanks, > Rahul > > > > On Fri, Oct 2, 2020 at 5:55 AM Charlie Hull <char...@flax.co.uk> wrote: > > > Hi Rahul, > > > > > > > > In addition to the wise advice below: remember in Solr, a 'document' > > is > > > > just the name for the thing that would appear as one of the results > > when > > > > you search (analagous to a database record). It's not the same > > > > conceptually as a 'Word document' or a 'PDF document'. If your source > > > > documents are so big, consider how they might be broken into parts, or > > > > whether you really need to index all of them for retrieval purposes, > > or > > > > what parts of them need to be extracted as text. Thus, the Solr > > > > documents don't necessarily need to be as large as your source documents. > > > > > > > > Consider an email size 20kb with ten PDF attachments, each 20MB. You > > > > probably shouldn't push all this data into a single Solr document, but > > > > you *could* index them as 11 separate Solr documents, but with > > metadata > > > > to indicate that one is an email and ten are PDFs, and a shared ID of > > > > some kind to indicate they're related. Then at query time there are > > > > various ways for you to group these together, so for example if the > > > > query hit one of the PDFs you could show the user the original email, > > > > plus the 9 other attachments, using the shared ID as a key. > > > > > > > > HTH, > > > > > > > > Charlie > > > > > > > > On 02/10/2020 01:53, Rahul Goswami wrote: > > > > > Manisha, > > > > > In addition to what Shawn has mentioned above, I would also like you > > > to > > > > > reevaluate your use case. Do you *need to* index the whole document ? > eg: > > > > > If it's an email, the body of the email *might* be more important > > > than > > any > > > > > attachments, in which case you could choose to only index the email > > > body > > > > > and ignore (or only partially index) the text from attachments. If > > > you > > > > > could afford to index the documents partially, you could consider > > > Solr's > > > > > "Limit token count filter": See the link below. > > > > > > > > > > > > https://lucene.apache.org/solr/guide/7_7/filter-descriptions.html#limi > > t-token-count-filter > > > > > > > > > > You'll need to configure it in the schema for the "index" analyzer > > > for > > the > > > > > data type of the field with large text. > > > > > Indexing documents of the order of half a GB will definitely come to > > > hurt > > > > > your operations, if not now, later (think OOM, extremely slow atomic > > > > > updates, long running merges etc.). > > > > > > > > > > - Rahul > > > > > > > > > > > > > > > > > > > > On Thu, Oct 1, 2020 at 7:06 PM Shawn Heisey <apa...@elyograg.org> > wrote: > > > > > > > > > >> On 10/1/2020 6:57 AM, Manisha Rahatadkar wrote: > > > > >>> We are using Apache Solr 7.7 on Windows platform. The data is > > >>> synced to > > > > >> Solr using Solr.Net commit. The data is being synced to SOLR in > batches. > > > > >> The document size is very huge (~0.5GB average) and solr indexing > > >> is > > taking > > > > >> long time. Total document size is ~200GB. As the solr commit is > > >> done as > > a > > > > >> part of API, the API calls are failing as document indexing is not > > > > >> completed. > > > > >> > > > > >> A single document is five hundred megabytes? What kind of > > >> documents do > > > > >> you have? You can't even index something that big without tweaking > > > > >> configuration parameters that most people don't even know about. > > > > >> Assuming you can even get it working, there's no way that indexing > > >> a > > > > >> document like that is going to be fast. > > > > >> > > > > >>> 1. What is your advise on syncing such a large volume of data > > >>> to > > > > >> Solr KB. > > > > >> > > > > >> What is "KB"? I have never heard of this in relation to Solr. > > > > >> > > > > >>> 2. Because of the search requirements, almost 8 fields are > > >>> defined > > > > >> as Text fields. > > > > >> > > > > >> I can't figure out what you are trying to say with this statement. > > > > >> > > > > >>> 3. Currently Solr_JAVA_MEM is set to 2gb. Is that enough for > > >>> such > > a > > > > >> large volume of data? > > > > >> > > > > >> If just one of the documents you're sending to Solr really is five > > > > >> hundred megabytes, then 2 gigabytes would probably be just barely > > >> enough > > > > >> to index one document into an empty index ... and it would probably > > >> be > > > > >> doing garbage collection so frequently that it would make things > > >> REALLY > > > > >> slow. I have no way to predict how much heap you will need. That > > >> will > > > > >> require experimentation. I can tell you that 2GB is definitely not > > enough. > > > > >> > > > > >>> 4. How to set up Solr in production on Windows? Currently > > >>> it's set > > > > >> up as a standalone engine and client is requested to take the > > >> backup of > > the > > > > >> drive. Is there any other better way to do? How to set up for the > > disaster > > > > >> recovery? > > > > >> > > > > >> I would suggest NOT doing it on Windows. My reasons for that come > > >> down > > > > >> to costs -- a Windows Server license isn't cheap. > > > > >> > > > > >> That said, there's nothing wrong with running on Windows, but > > >> you're on > > > > >> your own as far as running it as a service. We only have a service > > > > >> installer for UNIX-type systems. Most of the testing for that is > > >> done > > > > >> on Linux. > > > > >> > > > > >>> 5. How to benchmark the system requirements for such a huge > > >>> data > > > > >> I do not know what all your needs are, so I have no way to answer > this. > > > > >> You're going to know a lot more about it that any of us are. > > > > >> > > > > >> Thanks, > > > > >> Shawn > > > > >> > > > > > > > > -- > > > > Charlie Hull > > > > OpenSource Connections, previously Flax > > > > > > > > tel/fax: +44 (0)8700 118334 > > > > mobile: +44 (0)7767 825828 > > > > web: www.o19s.com > > > > > > > > > Confidentiality Notice > ==================== > This email message, including any attachments, is for the sole use of the > intended recipient and may contain confidential and privileged information. > Any unauthorized view, use, disclosure or distribution is prohibited. If > you are not the intended recipient, please contact the sender by reply > email and destroy all copies of the original message. Anju Software, Inc. > 4500 S. Lakeshore Drive, Suite 620, Tempe, AZ USA 85282. >