Hello, I know this has been discussed extensively in past posts. I have tried a bunch of suggestions and I still have a few questions.
I am using solr4.4 from tomcat 7. I am using openjdk1.7 and I am using 1 solr core I am trying to index a bunch of csv files (total size 13GB). Each csv file contains a long list of tuples - ( word1 word2, frequency) as shown below. (bigram frequencies) E.g: blue sky, 2500 green grass, 300 My schema.xml is as simple as can be: I am trying to index these two fields of type string and long and do not use any tokenizer or analyzer factories as shown below. <fields> <field name="_version_" type="long" indexed="true" stored="true" multiValued="false" omitNorms="true" /> <field name="word" type="string" indexed="true" stored="true" multiValued="false" omitNorms="true" /> <field name="frequency" type="long" indexed="true" stored="true" multiValued="false" omitNorms="true" /> </fields> In my solrconfig.xml: My rambuffer size is 100MB, merge factor is 10, maxIndexingThreads is 8. I am using solrj and concurrentupdatesolrserver (CUSS) to index. I have set the queue size to 10000 and number of threads to 10 and javabin format. I run my solrj instance by providing the path to the directory where the csv files are stored. I start one instance of CUSS and have multiple threads reading from the various files simultaneously and writing into the CUSS threads simutaneously. I do a commit only after all the records have been indexed. Also my autocommit values for number of documents and commit time are set to very large numbers. I have tried indexing a test set of csv files which contains 1.44M records (total size 21MB). All my tests have been on different types of Amazon ec2 instances - e.g. m1.xlarge (4vCPU, 15GB RAM) and m3.2xlarge(8vCPU, 30GB RAM). I have set my jvm heap size large enough and tuned gc parameters as seen on various forums. Observations: 1. My indexing speed for 1.44M records (or row in CSV file) is 240s on the m1.xlarge instance and 160s on the m3.2xlarge instance. 2. The indexing speed is independent of whether I have one large file with 1.44M rows or 2 files with 720K rows each. 3. My indexing speed is independent of the number of threads and queue size I specify for CUSS. I have kept set these parameters as low as 1 for both queue size and number of threads with no difference.. 4. My indexing speed is independent of merge factor, rambuffer and number of indexing threads. I've tried various settings. 5. It appears that I am not really indexing my files in parallel if I use a single solr core. Is this not possible? What exactly does maxindexthreads in solrconfig control? 6. My concern is that my indexing speed is way slower than what I've seen claimed on various forums (e.g., 29GB wikipedia in 13 minutes, 50GB in 39 minutes etc.) even with a single solr core. What am I doing wrong? How do I speed up my indexing? Any suggestions will be appreciated. Thanks, Vikram