RE: Sql entity processor sortedmapbackedcache out of memory issue

2020-10-02 Thread Srinivas Kashyap
Hi Shawn,

Continuing with the older thread, I have implemented WHERE clause on the inner 
child entity. When the import is run, whether it brings only the records 
matched with WHERE condition to JVM memory or will it bring entire SQL with 
joined tables on to JVM and does the WHERE filter in memory?

Also, I have written custom java code, 'onImportEnd' event listener. Can I call 
destroy() method of SortedMapBackedCache class to remove the cached entities in 
this event listener. This is required since for every import, there would be 
some entities which would be new and wouldn't be present in previous run of dih 
cache. My assumption is, when I call destroy method it would free up the JVM 
memory and wouldn't cause OOM.


Also Is there a way I can specify Garbage collection to run on DIHCache every 
time when an import is finished on a core.

P.S: Ours is a standalone Solr server with 18 cores in it. Each core is in sync 
by running full-import on SortedMapBackedCache entities with WHERE clause based 
on timestamp(last index time) on child entities.

-Original Message-
From: Shawn Heisey 
Sent: 09 April 2019 13:27
To: solr-user@lucene.apache.org
Subject: Re: Sql entity processor sortedmapbackedcache out of memory issue

On 4/8/2019 11:47 PM, Srinivas Kashyap wrote:
> I'm using DIH to index the data and the structure of the DIH is like below 
> for solr core:
>
> 
> 16 child entities
> 
>
> During indexing, since the number of requests being made to database was 
> high(to process one document 17 queries) and was utilizing most of 
> connections of database thereby blocking our web application.

If you have 17 entities, then one document will indeed take 17 queries.
That's the nature of multiple DIH entities.

> To tackle it, we implemented SORTEDMAPBACKEDCACHE with cacheImpl parameter to 
> reduce the number of requests to database.

When you use SortedMapBackedCache on an entity, you are asking Solr to store 
the results of the entire query in memory, even if you don't need all of the 
results.  If the database has a lot of rows, that's going to take a lot of 
memory.

In your excerpt from the config, your inner entity doesn't have a WHERE clause. 
 Which means that it's going to retrieve all of the rows of the ABC table for 
*EVERY* single entry in the DEF table.  That's going to be exceptionally slow.  
Normally the SQL query on inner entities will have some kind of WHERE clause 
that limits the results to rows that match the entry from the outer entity.

You may need to write a custom indexing program that runs separately from Solr, 
possibly on an entirely different server.  That might be a lot more efficient 
than DIH.

Thanks,
Shawn

DISCLAIMER:
E-mails and attachments from Bamboo Rose, LLC are confidential.
If you are not the intended recipient, please notify the sender immediately by 
replying to the e-mail, and then delete it without making copies or using it in 
any way.
No representation is made that this email or any attachments are free of 
viruses. Virus scanning is recommended and is the responsibility of the 
recipient.

Disclaimer

The information contained in this communication from the sender is 
confidential. It is intended solely for use by the recipient and others 
authorized to receive it. If you are not the recipient, you are hereby notified 
that any disclosure, copying, distribution or taking action in relation of the 
contents of this information is strictly prohibited and may be unlawful.

This email has been scanned for viruses and malware, and may have been 
automatically archived by Mimecast Ltd, an innovator in Software as a Service 
(SaaS) for business. Providing a safer and more useful place for your human 
generated data. Specializing in; Security, archiving and compliance. To find 
out more visit the Mimecast website.


Re: Solr 7.7 - Few Questions

2020-10-02 Thread Charlie Hull

Hi Rahul,

In addition to the wise advice below: remember in Solr, a 'document' is 
just the name for the thing that would appear as one of the results when 
you search (analagous to a database record). It's not the same 
conceptually as a 'Word document' or a 'PDF document'. If your source 
documents are so big, consider how they might be broken into parts, or 
whether you really need to index all of them for retrieval purposes, or 
what parts of them need to be extracted as text. Thus, the Solr 
documents don't necessarily need to be as large as your source documents.


Consider an email size 20kb with ten PDF attachments, each 20MB. You 
probably shouldn't push all this data into a single Solr document, but 
you *could* index them as 11 separate Solr documents, but with metadata 
to indicate that one is an email and ten are PDFs, and a shared ID of 
some kind to indicate they're related. Then at query time there are 
various ways for you to group these together, so for example if the 
query hit one of the PDFs you could show the user the original email, 
plus the 9 other attachments, using the shared ID as a key.


HTH,

Charlie

On 02/10/2020 01:53, Rahul Goswami wrote:

Manisha,
In addition to what Shawn has mentioned above, I would also like you to
reevaluate your use case. Do you *need to* index the whole document ? eg:
If it's an email, the body of the email *might* be more important than any
attachments, in which case you could choose to only index the email body
and ignore (or only partially index) the text from attachments. If you
could afford to index the documents partially, you could consider Solr's
"Limit token count filter": See the link below.

https://lucene.apache.org/solr/guide/7_7/filter-descriptions.html#limit-token-count-filter

You'll need to configure it in the schema for the "index" analyzer for the
data type of the field with large text.
Indexing documents of the order of half a GB will definitely come to hurt
your operations, if not now, later (think OOM, extremely slow atomic
updates, long running merges etc.).

- Rahul



On Thu, Oct 1, 2020 at 7:06 PM Shawn Heisey  wrote:


On 10/1/2020 6:57 AM, Manisha Rahatadkar wrote:

We are using Apache Solr 7.7 on Windows platform. The data is synced to

Solr using Solr.Net commit. The data is being synced to SOLR in batches.
The document size is very huge (~0.5GB average) and solr indexing is taking
long time. Total document size is ~200GB. As the solr commit is done as a
part of API, the API calls are failing as document indexing is not
completed.

A single document is five hundred megabytes?  What kind of documents do
you have?  You can't even index something that big without tweaking
configuration parameters that most people don't even know about.
Assuming you can even get it working, there's no way that indexing a
document like that is going to be fast.


1.  What is your advise on syncing such a large volume of data to

Solr KB.

What is "KB"?  I have never heard of this in relation to Solr.


2.  Because of the search requirements, almost 8 fields are defined

as Text fields.

I can't figure out what you are trying to say with this statement.


3.  Currently Solr_JAVA_MEM is set to 2gb. Is that enough for such a

large volume of data?

If just one of the documents you're sending to Solr really is five
hundred megabytes, then 2 gigabytes would probably be just barely enough
to index one document into an empty index ... and it would probably be
doing garbage collection so frequently that it would make things REALLY
slow.  I have no way to predict how much heap you will need.  That will
require experimentation.  I can tell you that 2GB is definitely not enough.


4.  How to set up Solr in production on Windows? Currently it's set

up as a standalone engine and client is requested to take the backup of the
drive. Is there any other better way to do? How to set up for the disaster
recovery?

I would suggest NOT doing it on Windows.  My reasons for that come down
to costs -- a Windows Server license isn't cheap.

That said, there's nothing wrong with running on Windows, but you're on
your own as far as running it as a service.  We only have a service
installer for UNIX-type systems.  Most of the testing for that is done
on Linux.


5.  How to benchmark the system requirements for such a huge data

I do not know what all your needs are, so I have no way to answer this.
You're going to know a lot more about it that any of us are.

Thanks,
Shawn



--
Charlie Hull
OpenSource Connections, previously Flax

tel/fax: +44 (0)8700 118334
mobile:  +44 (0)7767 825828
web: www.o19s.com