Nested docs would be one approach, result grouping might be another. Regarding JOINs, the only way you're going to know is by some representative testing.

Charlie

On 05/10/2020 05:49, Rahul Goswami wrote:
Charlie,
Thanks for providing an alternate approach to doing this. It would be
interesting to know how one  could go about organizing the docs in this
case? (Nested documents?) How would join queries perform on a large
index(200 million+ docs)?

Thanks,
Rahul



On Fri, Oct 2, 2020 at 5:55 AM Charlie Hull <char...@flax.co.uk> wrote:

Hi Rahul,



In addition to the wise advice below: remember in Solr, a 'document' is

just the name for the thing that would appear as one of the results when

you search (analagous to a database record). It's not the same

conceptually as a 'Word document' or a 'PDF document'. If your source

documents are so big, consider how they might be broken into parts, or

whether you really need to index all of them for retrieval purposes, or

what parts of them need to be extracted as text. Thus, the Solr

documents don't necessarily need to be as large as your source documents.



Consider an email size 20kb with ten PDF attachments, each 20MB. You

probably shouldn't push all this data into a single Solr document, but

you *could* index them as 11 separate Solr documents, but with metadata

to indicate that one is an email and ten are PDFs, and a shared ID of

some kind to indicate they're related. Then at query time there are

various ways for you to group these together, so for example if the

query hit one of the PDFs you could show the user the original email,

plus the 9 other attachments, using the shared ID as a key.



HTH,



Charlie



On 02/10/2020 01:53, Rahul Goswami wrote:

Manisha,
In addition to what Shawn has mentioned above, I would also like you to
reevaluate your use case. Do you *need to* index the whole document ? eg:
If it's an email, the body of the email *might* be more important than
any

attachments, in which case you could choose to only index the email body
and ignore (or only partially index) the text from attachments. If you
could afford to index the documents partially, you could consider Solr's
"Limit token count filter": See the link below.
https://lucene.apache.org/solr/guide/7_7/filter-descriptions.html#limit-token-count-filter

You'll need to configure it in the schema for the "index" analyzer for
the

data type of the field with large text.
Indexing documents of the order of half a GB will definitely come to hurt
your operations, if not now, later (think OOM, extremely slow atomic
updates, long running merges etc.).
- Rahul
On Thu, Oct 1, 2020 at 7:06 PM Shawn Heisey <apa...@elyograg.org> wrote:
On 10/1/2020 6:57 AM, Manisha Rahatadkar wrote:
We are using Apache Solr 7.7 on Windows platform. The data is synced to
Solr using Solr.Net commit. The data is being synced to SOLR in batches.
The document size is very huge (~0.5GB average) and solr indexing is
taking

long time. Total document size is ~200GB. As the solr commit is done as
a

part of API, the API calls are failing as document indexing is not
completed.
A single document is five hundred megabytes?  What kind of documents do
you have?  You can't even index something that big without tweaking
configuration parameters that most people don't even know about.
Assuming you can even get it working, there's no way that indexing a
document like that is going to be fast.
     1.  What is your advise on syncing such a large volume of data to
Solr KB.
What is "KB"?  I have never heard of this in relation to Solr.
     2.  Because of the search requirements, almost 8 fields are defined
as Text fields.
I can't figure out what you are trying to say with this statement.
     3.  Currently Solr_JAVA_MEM is set to 2gb. Is that enough for such
a

large volume of data?
If just one of the documents you're sending to Solr really is five
hundred megabytes, then 2 gigabytes would probably be just barely enough
to index one document into an empty index ... and it would probably be
doing garbage collection so frequently that it would make things REALLY
slow.  I have no way to predict how much heap you will need.  That will
require experimentation.  I can tell you that 2GB is definitely not
enough.

     4.  How to set up Solr in production on Windows? Currently it's set
up as a standalone engine and client is requested to take the backup of
the

drive. Is there any other better way to do? How to set up for the
disaster

recovery?
I would suggest NOT doing it on Windows.  My reasons for that come down
to costs -- a Windows Server license isn't cheap.
That said, there's nothing wrong with running on Windows, but you're on
your own as far as running it as a service.  We only have a service
installer for UNIX-type systems.  Most of the testing for that is done
on Linux.
     5.  How to benchmark the system requirements for such a huge data
I do not know what all your needs are, so I have no way to answer this.
You're going to know a lot more about it that any of us are.
Thanks,
Shawn


--

Charlie Hull

OpenSource Connections, previously Flax



tel/fax: +44 (0)8700 118334

mobile:  +44 (0)7767 825828

web: www.o19s.com





--
Charlie Hull
OpenSource Connections, previously Flax

tel/fax: +44 (0)8700 118334
mobile:  +44 (0)7767 825828
web: www.o19s.com

Reply via email to