Hi Fergie,

Haven't forgotten about you, but I've been traveling and then into some US Holidays here.

To confirm I am understanding, you are seeing a slowdown between 1.3- dev from April and one from September, right?

Can you produce an MD5 hash of the WAR file or something, such that I can know I have the exact bits. Better yet, perhaps you can put those files up somewhere where they can be downloaded.

Thanks,
Grant

On Nov 26, 2008, at 10:54 AM, Fergus McMenemie wrote:

Hello Grant,

Not much good with Java profilers (yet!) so I thought I
would send a script!

Details... details! Having decided to produce a script to
replicate the 1.2 vis 1.3 speed problem. The required rigor
revealed a lot more.

1) The faster version I have previously referred to as 1.2,
  was actually a "1.3-dev" I had downloaded as part of the
  solr bootcamp class at ApacheCon Europe 2008. The ID
  string in the CHANGES.txt document is:-
  $Id: CHANGES.txt 643465 2008-04-01 16:10:19Z gsingers $

2) I did actually download and speed test a version of 1.2
  from the internet. It's CHANGES.txt id is:-
  $Id: CHANGES.txt 543263 2007-05-31 21:19:02Z yonik $
  Speed wise it was about the same as 1.3 at 64min. It also
  had lots of char set issues and is ignored from now on.

3) The version I was planning to use, till I found this,
  speed issue was the "latest" official version:-
  $Id: CHANGES.txt 694377 2008-09-11 17:40:11Z klaas $
  I also verified the behavior with a nightly build.
  $Id: CHANGES.txt 712457 2008-11-09 01:24:11Z koji $

Anyway, The following script indexes the content in 22min
for the 1.3-dev version and takes 68min for the newer releases
of 1.3. I took the conf directory from the 1.3dev (bootcamp)
release and used it replace the conf directory from the
official 1.3 release. The 3x slow down was still there; it is
not a configuration issue!
=================================






#! /bin/bash

# This script assumes a /usr/local/tomcat link to whatever version
# of tomcat you have installed. I have "apache-tomcat-5.5.20" Also
# /usr/local/tomcat/conf/Catalina/localhost contains no solr.xml.
# All the following was done as root.


# I have a directory /usr/local/ts which contains four versions of solr. The # "official" 1.2 along with two 1.3 releases and a version of 1.2 or a 1.3beata # I got while attending a solr bootcamp. I indexed the same content using the
# different versions of solr as follows:
cd /usr/local/ts
if [ "" ]
then
  echo "Starting from a-fresh"
  sleep 5 # allow time for me to interrupt!
  cp -Rp apache-solr-bc/example/solr      ./solrbc  #bc = bootcamp
  cp -Rp apache-solr-nightly/example/solr ./solrnightly
  cp -Rp apache-solr-1.3.0/example/solr   ./solr13

# the gaz is regularly updated and its name keeps changing :-) The page # http://earth-info.nga.mil/gns/html/namefiles.htm has a link to the latest
  # version.
curl "http://earth-info.nga.mil/gns/html/geonames_dd_dms_date_20081118.zip " > geonames.zip
  unzip -q geonames.zip
  # delete corrupt blips!
  perl -i -n -e 'print unless
      ($. > 2128495 and $. < 2128505) or
      ($. > 5944254 and $. < 5944260)
      ;' geonames_dd_dms_date_20081118.txt
  #following was used to detect bad short records
#perl -a -F\\t -n -e ' print "line $. is bad with ",scalar(@F)," args\n" if (@F != 26);' geonames_dd_dms_date_20081118.txt

  # my set of fields and copyfields for the schema.xml
  fields='
  <fields>
<field name="UNI" type="string" indexed="true" stored="true" required="true" /> <field name="CCODE" type="string" indexed="true" stored="true"/> <field name="DSG" type="string" indexed="true" stored="true"/> <field name="CC1" type="string" indexed="true" stored="true"/> <field name="LAT" type="sfloat" indexed="true" stored="true"/> <field name="LONG" type="sfloat" indexed="true" stored="true"/> <field name="MGRS" type="string" indexed="false" stored="true"/> <field name="JOG" type="string" indexed="false" stored="true"/> <field name="FULL_NAME" type="string" indexed="true" stored="true"/> <field name="FULL_NAME_ND" type="string" indexed="true" stored="true"/> <!--field name="text" type="text" indexed="true" stored="false" multiValued="true"/ --> <!--field name="timestamp" type="date" indexed="true" stored="true" default="NOW" multiValued="false"/-->
  '
  copyfields='
     </fields>
     <copyField source="FULL_NAME" dest="text"/>
     <copyField source="FULL_NAME_ND" dest="text"/>
  '

  # add in my fields and copyfields
perl -i -p -e "print qq($fields) if s/<fields>//;" solr*/ conf/schema.xml perl -i -p -e "print qq($copyfields) if s[</fields>][];" solr*/ conf/schema.xml
  # change the unique key and mark the "id" field as not required
perl -i -p -e "s/<uniqueKey>id/<uniqueKey>UNI/i;" solr*/ conf/schema.xml perl -i -p -e 's/required="true"//i if m/<field name="id"/;' solr*/ conf/schema.xml
  # enable remote streaming in solrconfig file
perl -i -p -e 's/enableRemoteStreaming="false"/ enableRemoteStreaming="true"/;' solr*/conf/solrconfig.xml
  fi

# some constants to keep the curl command shorter
skip = "MODIFY_DATE ,RC ,UFI ,DMS_LAT ,DMS_LONG ,FC,PC,ADM1,ADM2,POP,ELEV,CC2,NT,LC,SHORT_FORM,GENERIC,SORT_NAME"
file=`pwd`"/geonames.txt"

export JAVA_OPTS=" -Xmx512M -Xms512M -Dsolr.home=`pwd`/solr - Dsolr.solr.home=`pwd`/solr"

echo 'Getting ready to index the data set using solrbc (bc = bootcamp)'
/usr/local/tomcat/bin/shutdown.sh
sleep 15
if [ -n "`ps awxww | grep tomcat | grep -v grep`" ]
  then
  echo "Tomcat would not shutdown"
  exit
  fi
rm -r /usr/local/tomcat/webapps/solr*
rm -r /usr/local/tomcat/logs/*.out
rm -r /usr/local/tomcat/work/Catalina/localhost/solr
cp apache-solr-bc/example/webapps/solr.war /usr/local/tomcat/webapps
rm solr # rm the symbolic link
ln -s solrbc solr
rm -r solr/data
/usr/local/tomcat/bin/startup.sh
sleep 10 # give solr time to launch and setup
echo "Starting indexing at " `date` " with solrbc (bc = bootcamp)"
time curl "http://localhost:8080/solr/update/csv?commit=true&stream.file=$file&escape=%00&separator=%09&skip=$skip "

echo "Getting ready to index the data set using solrnightly"
/usr/local/tomcat/bin/shutdown.sh
sleep 15
if [ -n "`ps awxww | grep tomcat | grep -v grep`" ]
  then
  echo "Tomcat would not shutdown"
  exit
  fi
rm -r /usr/local/tomcat/webapps/solr*
rm -r /usr/local/tomcat/logs/*.out
rm -r /usr/local/tomcat/work/Catalina/localhost/solr
cp apache-solr-nightly/example/webapps/solr.war /usr/local/tomcat/ webapps
rm solr # rm the symbolic link
ln -s solrnightly solr
rm -r solr/data
/usr/local/tomcat/bin/startup.sh
sleep 10 # give solr time to launch and setup
echo "Starting indexing at " `date` " with solrnightly"
time curl "http://localhost:8080/solr/update/csv?commit=true&stream.file=$file&escape=%00&separator=%09&skip=$skip "




On Nov 20, 2008, at 9:18 AM, Fergus McMenemie wrote:

Hello Grant,

Were you overwriting the existing index or did you also clean out the
Solr data directory, too?  In other words, was it a fresh index, or
an
existing one?  And was that also the case for the 22 minute time?

No in each case it was a new index. I store the indexes (the "data"
dir)
outside the solr home directory. For the moment I, rm -rf the index
dir
after each edit to the solrconfig.sml or schema.xml file and reindex
from scratch. The relaunch of tomcat recreates the index dir.

Would it be possible to profile the two instance and see if you
notice
anything different?
I dont understand this. Do mean run a profiler against the tomcat
image as indexing takes place, or somehow compare the indexes?

Something like JProfiler or any other Java profiler.



I was think of making a short script that replicates the results,
and posting it here, would that help?


Very much so.





Thanks,
Grant

On Nov 19, 2008, at 8:25 AM, Fergus McMenemie wrote:

Hello,

I have a CSV file with 6M records which took 22min to index with
solr 1.2. I then stopped tomcat replaced the solr stuff inside
webapps with version 1.3, wiped my index and restarted tomcat.

Indexing the exact same content now takes 69min. My machine has
2GB of RAM and tomcat is running with $JAVA_OPTS -Xmx512M - Xms512M.

Are there any tweaks I can use to get the original index time
back. I read through the release notes and was expecting a
speed up. I saw the bit about increasing ramBufferSizeMB and set
it to 64MB; it had no effect.
--

--

===============================================================
Fergus McMenemie               Email:[EMAIL PROTECTED]
Techmore Ltd                   Phone:(UK) 07721 376021

Unix/Mac/Intranets             Analyst Programmer
===============================================================

--------------------------
Grant Ingersoll

Lucene Helpful Hints:
http://wiki.apache.org/lucene-java/BasicsOfPerformance
http://wiki.apache.org/lucene-java/LuceneFAQ










Reply via email to