Yeah, I'm indexing some PDF documents.. I've extracted the text through
tika (pre-indexing).. and the largest field in my DB is 20MB. That's quite
extensive ;) My Solution for the moment is to cut this text to the first
500KB, that should be enough for a decent index and search capabilities..
Should I increase the buffer size for these sizes as well or will 32MB
suffice?

FYI, output of ulimit -a is
core file size          (blocks, -c) 0
data seg size           (kbytes, -d) unlimited
scheduling priority             (-e) 20
*file size               (blocks, -f) unlimited*
pending signals                 (-i) 16382
max locked memory       (kbytes, -l) 64
max memory size         (kbytes, -m) unlimited
open files                      (-n) 1024
pipe size            (512 bytes, -p) 8
POSIX message queues     (bytes, -q) 819200
real-time priority              (-r) 0
stack size              (kbytes, -s) 8192
cpu time               (seconds, -t) unlimited
max user processes              (-u) unlimited
virtual memory          (kbytes, -v) unlimited
file locks                      (-x) unlimited


Kind regards!
Bram

On Fri, Apr 20, 2012 at 12:15 PM, Lance Norskog <goks...@gmail.com> wrote:

> Good point! Do you store the large file in your documents, or just index
> them?
>
> Do you have a "largest file" limit in your environment? Try this:
> ulimit -a
>
> What is the "file size"?
>
> On Thu, Apr 19, 2012 at 8:04 AM, Shawn Heisey <s...@elyograg.org> wrote:
> > On 4/19/2012 7:49 AM, Bram Rongen wrote:
> >>
> >> Yesterday I've started indexing again but this time on Solr 3.6.. Again
> >> Solr is failing around the same time, but not exactly (now the largest
> fdt
> >> file is 4.8G).. It's right after the moment I receive memory-errors at
> the
> >> Drupal side which make me suspicious that it maybe has something to do
> >> with
> >> a huge document.. Is that possible? I was indexing 1500 documents at
> once
> >> every minute. Drupal builds them all up in memory before submitting them
> >> to
> >> Solr. At some point it runs out of memory and I have to switch to 10/20
> >> documents per minute for a while.. then I can switch back to 1000
> >> documents
> >> per minute.
> >>
> >> The disk is a software RAID1 over 2 disks. But I've also run into the
> same
> >> problem at another server.. This was a VM-server with only 1GB ram and
> >> 40GB
> >> of disk. With this server the merge-repeat happened at an earlier stage.
> >>
> >> I've also let Solr continue with merging for about two days before  (in
> an
> >> earlier attempt), without submitting new documents. The merging kept
> >> repeating.
> >>
> >> Somebody suggested it could be because I'm using Jetty, could that be
> >> right?
> >
> >
> > I am using Jetty for my Solr installation and it handles very large
> indexes
> > without a problem.  I have created a single index with all my data
> (nearly
> > 70 million documents, total index size over 100GB).  Aside from how long
> it
> > takes to build and the fact that I don't have enough RAM to cache it for
> > good performance, Solr handled it just fine.  For production I use a
> > distributed index on multiple servers.
> >
> > I don't know why you are seeing a merge that continually restarts, that's
> > truly odd.  I've never used drupal, don't know a lot about it.  From my
> > small amount of research just now, I assume that it uses Tika, also
> another
> > tool that I have no experience with.  I am guessing that you store the
> > entire text of your documents into solr, and that they are indexed up to
> a
> > maximum of 10000 tokens (the default value of maxFieldLength in
> > solrconfig.xml), based purely on speculation about the "body" field in
> your
> > schema.
> >
> > A document that's 100MB in size, if the whole thing gets stored, will
> > completely overwhelm a 32MB buffer, and might even be enough to
> overwhelm a
> > 256MB buffer as well, because it will basically have to build the entire
> > index segment in RAM, with term vectors, indexed data, and stored data
> for
> > all fields.
> >
> > With such large documents, you may have to increase the maxFieldLength,
> or
> > you won't be able to search on the entire document text.  Depending on
> the
> > content of those documents, it may or may not be a problem that only the
> > first 10,000 tokens will get indexed.  Large documents tend to be
> repetitive
> > and there might not be any search value after the introduction and
> initial
> > words.  Your documents may be different, so you'll have to make that
> > decision.
> >
> > To test whether my current thoughts are right, I recommend that you try
> with
> > the following settings during the initial full import:  ramBufferSizeMB:
> > 1024 (or maybe higher), autoCommit maxTime: 0, autoCommit maxDocs: 0.
>  This
> > will mean that unless the indexing process issues manual commits (either
> in
> > the middle of indexing or at the end), you will have to do a manual one.
> >  Once you have the initial index built and it is only doing updates, you
> > will probably be able to go back to using autoCommit.
> >
> > It's possible that I have no understanding of the real problem here, and
> my
> > recommendation above may result in no improvement.  General
> recommendations,
> > no matter what the current problem might be:
> >
> > 1) Get a lot more RAM.  Ideally you want to have enough free memory to
> cache
> > your entire index.  That may not be possible, but you want to get as
> close
> > to that goal as you can.
> > 2) If you can, see what you can do to increase your IOPS.  Using mirrored
> > high RPM SAS is an easy solution, and might be slightly cheaper than SATA
> > RAID10, which is my solution.  SSD is easy and very fast, but expensive
> and
> > not redundant -- I am currently not aware of any SSD RAID solutions that
> > have OS TRIM support.  RAID10 with high RPM SAS would be best, but very
> > expensive.  On the extreme high end, you could go with a high performance
> > SAN.
> >
> > Thanks,
> > Shawn
> >
>
>
>
> --
> Lance Norskog
> goks...@gmail.com
>

Reply via email to