Hmm, reading your reply again I see that Solr only uses the first 10k tokens from each field so field length should not be a problem per se.. It could be my document contain very large tokens and unorganized tokens, could this startle Solr?
On Fri, Apr 20, 2012 at 2:03 PM, Bram Rongen <m...@bramrongen.nl> wrote: > Yeah, I'm indexing some PDF documents.. I've extracted the text through > tika (pre-indexing).. and the largest field in my DB is 20MB. That's quite > extensive ;) My Solution for the moment is to cut this text to the first > 500KB, that should be enough for a decent index and search capabilities.. > Should I increase the buffer size for these sizes as well or will 32MB > suffice? > > FYI, output of ulimit -a is > core file size (blocks, -c) 0 > data seg size (kbytes, -d) unlimited > scheduling priority (-e) 20 > *file size (blocks, -f) unlimited* > pending signals (-i) 16382 > max locked memory (kbytes, -l) 64 > max memory size (kbytes, -m) unlimited > open files (-n) 1024 > pipe size (512 bytes, -p) 8 > POSIX message queues (bytes, -q) 819200 > real-time priority (-r) 0 > stack size (kbytes, -s) 8192 > cpu time (seconds, -t) unlimited > max user processes (-u) unlimited > virtual memory (kbytes, -v) unlimited > file locks (-x) unlimited > > > Kind regards! > Bram > > On Fri, Apr 20, 2012 at 12:15 PM, Lance Norskog <goks...@gmail.com> wrote: > >> Good point! Do you store the large file in your documents, or just index >> them? >> >> Do you have a "largest file" limit in your environment? Try this: >> ulimit -a >> >> What is the "file size"? >> >> On Thu, Apr 19, 2012 at 8:04 AM, Shawn Heisey <s...@elyograg.org> wrote: >> > On 4/19/2012 7:49 AM, Bram Rongen wrote: >> >> >> >> Yesterday I've started indexing again but this time on Solr 3.6.. Again >> >> Solr is failing around the same time, but not exactly (now the largest >> fdt >> >> file is 4.8G).. It's right after the moment I receive memory-errors at >> the >> >> Drupal side which make me suspicious that it maybe has something to do >> >> with >> >> a huge document.. Is that possible? I was indexing 1500 documents at >> once >> >> every minute. Drupal builds them all up in memory before submitting >> them >> >> to >> >> Solr. At some point it runs out of memory and I have to switch to 10/20 >> >> documents per minute for a while.. then I can switch back to 1000 >> >> documents >> >> per minute. >> >> >> >> The disk is a software RAID1 over 2 disks. But I've also run into the >> same >> >> problem at another server.. This was a VM-server with only 1GB ram and >> >> 40GB >> >> of disk. With this server the merge-repeat happened at an earlier >> stage. >> >> >> >> I've also let Solr continue with merging for about two days before >> (in an >> >> earlier attempt), without submitting new documents. The merging kept >> >> repeating. >> >> >> >> Somebody suggested it could be because I'm using Jetty, could that be >> >> right? >> > >> > >> > I am using Jetty for my Solr installation and it handles very large >> indexes >> > without a problem. I have created a single index with all my data >> (nearly >> > 70 million documents, total index size over 100GB). Aside from how >> long it >> > takes to build and the fact that I don't have enough RAM to cache it for >> > good performance, Solr handled it just fine. For production I use a >> > distributed index on multiple servers. >> > >> > I don't know why you are seeing a merge that continually restarts, >> that's >> > truly odd. I've never used drupal, don't know a lot about it. From my >> > small amount of research just now, I assume that it uses Tika, also >> another >> > tool that I have no experience with. I am guessing that you store the >> > entire text of your documents into solr, and that they are indexed up >> to a >> > maximum of 10000 tokens (the default value of maxFieldLength in >> > solrconfig.xml), based purely on speculation about the "body" field in >> your >> > schema. >> > >> > A document that's 100MB in size, if the whole thing gets stored, will >> > completely overwhelm a 32MB buffer, and might even be enough to >> overwhelm a >> > 256MB buffer as well, because it will basically have to build the entire >> > index segment in RAM, with term vectors, indexed data, and stored data >> for >> > all fields. >> > >> > With such large documents, you may have to increase the maxFieldLength, >> or >> > you won't be able to search on the entire document text. Depending on >> the >> > content of those documents, it may or may not be a problem that only the >> > first 10,000 tokens will get indexed. Large documents tend to be >> repetitive >> > and there might not be any search value after the introduction and >> initial >> > words. Your documents may be different, so you'll have to make that >> > decision. >> > >> > To test whether my current thoughts are right, I recommend that you try >> with >> > the following settings during the initial full import: ramBufferSizeMB: >> > 1024 (or maybe higher), autoCommit maxTime: 0, autoCommit maxDocs: 0. >> This >> > will mean that unless the indexing process issues manual commits >> (either in >> > the middle of indexing or at the end), you will have to do a manual one. >> > Once you have the initial index built and it is only doing updates, you >> > will probably be able to go back to using autoCommit. >> > >> > It's possible that I have no understanding of the real problem here, >> and my >> > recommendation above may result in no improvement. General >> recommendations, >> > no matter what the current problem might be: >> > >> > 1) Get a lot more RAM. Ideally you want to have enough free memory to >> cache >> > your entire index. That may not be possible, but you want to get as >> close >> > to that goal as you can. >> > 2) If you can, see what you can do to increase your IOPS. Using >> mirrored >> > high RPM SAS is an easy solution, and might be slightly cheaper than >> SATA >> > RAID10, which is my solution. SSD is easy and very fast, but expensive >> and >> > not redundant -- I am currently not aware of any SSD RAID solutions that >> > have OS TRIM support. RAID10 with high RPM SAS would be best, but very >> > expensive. On the extreme high end, you could go with a high >> performance >> > SAN. >> > >> > Thanks, >> > Shawn >> > >> >> >> >> -- >> Lance Norskog >> goks...@gmail.com >> > >