Hi Rahul,
In addition to the wise advice below: remember in Solr, a 'document' is
just the name for the thing that would appear as one of the results when
you search (analagous to a database record). It's not the same
conceptually as a 'Word document' or a 'PDF document'. If your source
documents are so big, consider how they might be broken into parts, or
whether you really need to index all of them for retrieval purposes, or
what parts of them need to be extracted as text. Thus, the Solr
documents don't necessarily need to be as large as your source documents.
Consider an email size 20kb with ten PDF attachments, each 20MB. You
probably shouldn't push all this data into a single Solr document, but
you *could* index them as 11 separate Solr documents, but with metadata
to indicate that one is an email and ten are PDFs, and a shared ID of
some kind to indicate they're related. Then at query time there are
various ways for you to group these together, so for example if the
query hit one of the PDFs you could show the user the original email,
plus the 9 other attachments, using the shared ID as a key.
HTH,
Charlie
On 02/10/2020 01:53, Rahul Goswami wrote:
Manisha,
In addition to what Shawn has mentioned above, I would also like you to
reevaluate your use case. Do you *need to* index the whole document ? eg:
If it's an email, the body of the email *might* be more important than
any
attachments, in which case you could choose to only index the email body
and ignore (or only partially index) the text from attachments. If you
could afford to index the documents partially, you could consider Solr's
"Limit token count filter": See the link below.
https://lucene.apache.org/solr/guide/7_7/filter-descriptions.html#limit-token-count-filter
You'll need to configure it in the schema for the "index" analyzer for
the
data type of the field with large text.
Indexing documents of the order of half a GB will definitely come to hurt
your operations, if not now, later (think OOM, extremely slow atomic
updates, long running merges etc.).
- Rahul
On Thu, Oct 1, 2020 at 7:06 PM Shawn Heisey <apa...@elyograg.org> wrote:
On 10/1/2020 6:57 AM, Manisha Rahatadkar wrote:
We are using Apache Solr 7.7 on Windows platform. The data is synced to
Solr using Solr.Net commit. The data is being synced to SOLR in batches.
The document size is very huge (~0.5GB average) and solr indexing is
taking
long time. Total document size is ~200GB. As the solr commit is done as
a
part of API, the API calls are failing as document indexing is not
completed.
A single document is five hundred megabytes? What kind of documents do
you have? You can't even index something that big without tweaking
configuration parameters that most people don't even know about.
Assuming you can even get it working, there's no way that indexing a
document like that is going to be fast.
1. What is your advise on syncing such a large volume of data to
Solr KB.
What is "KB"? I have never heard of this in relation to Solr.
2. Because of the search requirements, almost 8 fields are defined
as Text fields.
I can't figure out what you are trying to say with this statement.
3. Currently Solr_JAVA_MEM is set to 2gb. Is that enough for such
a
large volume of data?
If just one of the documents you're sending to Solr really is five
hundred megabytes, then 2 gigabytes would probably be just barely enough
to index one document into an empty index ... and it would probably be
doing garbage collection so frequently that it would make things REALLY
slow. I have no way to predict how much heap you will need. That will
require experimentation. I can tell you that 2GB is definitely not
enough.
4. How to set up Solr in production on Windows? Currently it's set
up as a standalone engine and client is requested to take the backup of
the
drive. Is there any other better way to do? How to set up for the
disaster
recovery?
I would suggest NOT doing it on Windows. My reasons for that come down
to costs -- a Windows Server license isn't cheap.
That said, there's nothing wrong with running on Windows, but you're on
your own as far as running it as a service. We only have a service
installer for UNIX-type systems. Most of the testing for that is done
on Linux.
5. How to benchmark the system requirements for such a huge data
I do not know what all your needs are, so I have no way to answer this.
You're going to know a lot more about it that any of us are.
Thanks,
Shawn
--
Charlie Hull
OpenSource Connections, previously Flax
tel/fax: +44 (0)8700 118334
mobile: +44 (0)7767 825828
web: www.o19s.com