Yea I was going to reply to that thread but then it just slipped out of my
mind. :)

Actually we have two indexes. One that is used for searching and other for
highlighting. Their structure is different too like the 1st one has all the
metadata + document contents indexed (just for searching). This has around
13 million rows. In 2nd one we have mainly the document PAGE contents
indexed/stored with Terms Vectors. This has around 130 million rows (since
each row is a page).

What we do is search on the 1st index (around 150GB) and get document ID's
based on the page size (20/50/100) and then just search on these document
ID's on 2nd index (but on pages - as we need to show results based on page
no's) with text for highlighting as well.

The 2nd index is around 700GB (which has that 450GB TVF file I was talking
about) but since its only referred for small no. of documents mostly that is
not an issue (in some queries that's slow too but its size is the main
issue).

On average more than 90% of the query time is taken by 1st index file in
searching (and total count as well).

The confusion that I had was on the 1st index file which didn't have Term
Vectors in any of the fields in SOLR schema file but still had a TVF file.
The reason in the end turned out to be Lucene indexing. Some of the initial
documents were indexed through Lucene and there one of the field did had
Term Vectors! Sorry for that...

*Keeping in mind the above description any other ideas you would like to
suggest? Thanks!!*

On Sat, Feb 5, 2011 at 7:40 AM, Otis Gospodnetic <otis_gospodne...@yahoo.com
> wrote:

> Hi Salman,
>
> Ah, so in the end you *did* have TV enabled on one of your fields! :) (I
> think
> this was a problem we were trying to solve a few weeks ago here)
>
> How many docs you have in the index doesn't matter here - only N
> docs/fields
> that you need to display on a page with N results need to be reanalyzed for
> highlighting purposes, so follow Grant's advice, make a small index without
> TV,
> and compare highlighting speed with and without TV.
>
> Otis
> ----
> Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
> Lucene ecosystem search :: http://search-lucene.com/
>
>
>
> ----- Original Message ----
> > From: Salman Akram <salman.ak...@northbaysolutions.net>
> > To: solr-user@lucene.apache.org
> > Sent: Fri, February 4, 2011 8:03:06 AM
> > Subject: Re: Highlighting with/without Term Vectors
> >
> > Basically Term Vectors are only on one main field i.e. Contents. Average
> > size  of each document would be few KB's but there are around 130 million
> > documents  so what do you suggest now?
> >
> > On Fri, Feb 4, 2011 at 5:24 PM, Otis  Gospodnetic <
> otis_gospodne...@yahoo.com
> > >  wrote:
> >
> > > Salman,
> > >
> > > It also depends on the size of your  documents.  Re-analyzing 20 fields
> of
> > > 500
> > > bytes each will  be a lot faster than re-analyzing 20 fields with 50 KB
> > >  each.
> > >
> > > Otis
> > > ----
> > > Sematext :: http://sematext.com/ :: Solr -  Lucene - Nutch
> > > Lucene ecosystem search :: http://search-lucene.com/
> > >
> > >
> > >
> > > ----- Original  Message ----
> > > > From: Grant Ingersoll <gsing...@apache.org>
> > > > To: solr-user@lucene.apache.org
> > >  > Sent: Wed, January 26, 2011 10:44:09 AM
> > > > Subject: Re:  Highlighting with/without Term Vectors
> > > >
> > > >
> > > > On  Jan 24, 2011, at 2:42 PM, Salman Akram wrote:
> > > >
> > > > >  Hi,
> > > > >
> > > > > Does anyone have any benchmarks how much  highlighting speeds up
> with
> > >  Term
> > > > > Vectors  (compared to without it)? e.g. if highlighting on 20
>  documents
> > >  take
> > > > > 1 sec with Term Vectors any idea how long it will  take  without
> them?
> > > > >
> > > > > I need to know  since the index used for  highlighting has a TVF
> file of
> > > > >  around 450GB (approx 65% of total index  size) so I am trying to
>  see
> > > whether
> > > > > the decreasing the index size by   dropping TVF would be more
> helpful
> > > for
> > > > > performance  (less RAM, should be  good for I/O too I guess) or
> keeping
> > > it  is
> > > > > still better?
> > > > >
> > > > > I know  the best way is try it out but indexing takes a very long
> time
> > >   so
> > > > > trying to see whether its even worthy or not.
> > >  >
> > > >
> > > > Try testing  on a smaller set.  In  general, you are saving the
> process of
> > > >re-analyzing  the  content, so, to some extent it is going to be
> dependent
> > > on how
> > >  >fast your  analyzer chain is.  At the size you are at, I don't  know
> if
> > > storing
> > > >TVs is  worth  it.
> > >
> >
> >
> >
> > --
> > Regards,
> >
> > Salman Akram
> >
>



-- 
Regards,

Salman Akram

Reply via email to