Yea I was going to reply to that thread but then it just slipped out of my mind. :)
Actually we have two indexes. One that is used for searching and other for highlighting. Their structure is different too like the 1st one has all the metadata + document contents indexed (just for searching). This has around 13 million rows. In 2nd one we have mainly the document PAGE contents indexed/stored with Terms Vectors. This has around 130 million rows (since each row is a page). What we do is search on the 1st index (around 150GB) and get document ID's based on the page size (20/50/100) and then just search on these document ID's on 2nd index (but on pages - as we need to show results based on page no's) with text for highlighting as well. The 2nd index is around 700GB (which has that 450GB TVF file I was talking about) but since its only referred for small no. of documents mostly that is not an issue (in some queries that's slow too but its size is the main issue). On average more than 90% of the query time is taken by 1st index file in searching (and total count as well). The confusion that I had was on the 1st index file which didn't have Term Vectors in any of the fields in SOLR schema file but still had a TVF file. The reason in the end turned out to be Lucene indexing. Some of the initial documents were indexed through Lucene and there one of the field did had Term Vectors! Sorry for that... *Keeping in mind the above description any other ideas you would like to suggest? Thanks!!* On Sat, Feb 5, 2011 at 7:40 AM, Otis Gospodnetic <otis_gospodne...@yahoo.com > wrote: > Hi Salman, > > Ah, so in the end you *did* have TV enabled on one of your fields! :) (I > think > this was a problem we were trying to solve a few weeks ago here) > > How many docs you have in the index doesn't matter here - only N > docs/fields > that you need to display on a page with N results need to be reanalyzed for > highlighting purposes, so follow Grant's advice, make a small index without > TV, > and compare highlighting speed with and without TV. > > Otis > ---- > Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch > Lucene ecosystem search :: http://search-lucene.com/ > > > > ----- Original Message ---- > > From: Salman Akram <salman.ak...@northbaysolutions.net> > > To: solr-user@lucene.apache.org > > Sent: Fri, February 4, 2011 8:03:06 AM > > Subject: Re: Highlighting with/without Term Vectors > > > > Basically Term Vectors are only on one main field i.e. Contents. Average > > size of each document would be few KB's but there are around 130 million > > documents so what do you suggest now? > > > > On Fri, Feb 4, 2011 at 5:24 PM, Otis Gospodnetic < > otis_gospodne...@yahoo.com > > > wrote: > > > > > Salman, > > > > > > It also depends on the size of your documents. Re-analyzing 20 fields > of > > > 500 > > > bytes each will be a lot faster than re-analyzing 20 fields with 50 KB > > > each. > > > > > > Otis > > > ---- > > > Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch > > > Lucene ecosystem search :: http://search-lucene.com/ > > > > > > > > > > > > ----- Original Message ---- > > > > From: Grant Ingersoll <gsing...@apache.org> > > > > To: solr-user@lucene.apache.org > > > > Sent: Wed, January 26, 2011 10:44:09 AM > > > > Subject: Re: Highlighting with/without Term Vectors > > > > > > > > > > > > On Jan 24, 2011, at 2:42 PM, Salman Akram wrote: > > > > > > > > > Hi, > > > > > > > > > > Does anyone have any benchmarks how much highlighting speeds up > with > > > Term > > > > > Vectors (compared to without it)? e.g. if highlighting on 20 > documents > > > take > > > > > 1 sec with Term Vectors any idea how long it will take without > them? > > > > > > > > > > I need to know since the index used for highlighting has a TVF > file of > > > > > around 450GB (approx 65% of total index size) so I am trying to > see > > > whether > > > > > the decreasing the index size by dropping TVF would be more > helpful > > > for > > > > > performance (less RAM, should be good for I/O too I guess) or > keeping > > > it is > > > > > still better? > > > > > > > > > > I know the best way is try it out but indexing takes a very long > time > > > so > > > > > trying to see whether its even worthy or not. > > > > > > > > > > > > Try testing on a smaller set. In general, you are saving the > process of > > > >re-analyzing the content, so, to some extent it is going to be > dependent > > > on how > > > >fast your analyzer chain is. At the size you are at, I don't know > if > > > storing > > > >TVs is worth it. > > > > > > > > > > > -- > > Regards, > > > > Salman Akram > > > -- Regards, Salman Akram