date:20110828

Re: Shingle and Query Performance

2011-08-28 Thread Lord Khan Han

Another insteresting thing is : all one word or more word queries including
phrase queries such as "barack obama"  slower in shingle configuration. What
i am doing wrong ? without shingle "barack obama" Querytime 300ms  with
shingle  780 ms..


On Sat, Aug 27, 2011 at 7:58 PM, Lord Khan Han wrote:

> Hi,
>
> What is the difference between solr 3.3  and the trunk ?
> I will try 3.3  and let you know the results.
>
>
> Here the search handler:
>
> 
>  
>explicit
>10
>
>  mrank:[0 TO 100]
>explicit
>10
>  edismax
>
> title^1.05 url^1.2 content^1.7 m_title^10.0
>  
>  content^18.0 m_title^5.0
>  1
>  0
>  2<-25%
>  true
>  
> 5
>  subobjective
> false
>   
> 
>  true
>  
>
>
>
>
> On Sat, Aug 27, 2011 at 5:31 PM, Erik Hatcher wrote:
>
>> I'm not sure what the issue could be at this point.   I see you've got
>> qt=search - what's the definition of that request handler?
>>
>> What is the parsed query (from the debugQuery response)?
>>
>> Have you tried this with Solr 3.3 to see if there's any appreciable
>> difference?
>>
>>Erik
>>
>> On Aug 27, 2011, at 09:34 , Lord Khan Han wrote:
>>
>> > When grouping off the query time ie 3567 ms  to 1912 ms . Grouping
>> > increasing the query time and make useless to cache. But same config
>> faster
>> > without shingle still.
>> >
>> > We have and head to head test this wednesday tihs commercial search
>> engine.
>> > So I am looking for all suggestions.
>> >
>> >
>> >
>> > On Sat, Aug 27, 2011 at 3:37 PM, Erik Hatcher > >wrote:
>> >
>> >> Please confirm is this is caused by grouping.  Turn grouping off,
>> what's
>> >> query time like?
>> >>
>> >>
>> >> On Aug 27, 2011, at 07:27 , Lord Khan Han wrote:
>> >>
>> >>> On the other hand We couldnt use the cache for below types queries. I
>> >> think
>> >>> its caused from grouping. Anyway we need to be sub second without
>> cache.
>> >>>
>> >>>
>> >>>
>> >>> On Sat, Aug 27, 2011 at 2:18 PM, Lord Khan Han <
>> khanuniver...@gmail.com
>> >>> wrote:
>> >>>
>>  Hi,
>> 
>>  Thanks for the reply.
>> 
>>  Here the solr log capture.:
>> 
>>  **
>> 
>> 
>> >>
>> hl.fragsize=100&spellcheck=true&spellcheck.q=X&group.limit=5&hl.simple.pre=&hl.fl=content&spellcheck.collate=true&wt=javabin&hl=true&rows=20&version=2&fl=score,approved,domain,host,id,lang,mimetype,title,tstamp,url,category&hl.snippets=3&start=0&q=%2B+-"X"+-"X"+-"XX"+-"XX"+-"XX"+-+-"XX"+-XXX+-"X"+-+-+-"X"+-"X"+-"X"+-+-""+-"X"+-"XX"+-"X"+-"XX"+-"XX"+-+-"X"+-"XX"+-+-"X"+-"X"+-X+-"X"+-"X"+-"X"+-"X"+-X+-"XX"+-"XX"+-XX+-X+-"X"+"X"+"X"+"XX"++&group.field=host&hl.simple.post=&group=true&qt=search&fq=mrank:[0+TO+100]&fq=word_count:[70+TO+*]
>>  **
>> 
>>   is the words. All phrases "x"  has two words inside.
>> 
>>  The timing from the DebugQuery:
>> 
>>  
>>  8654.0
>>  
>>  16.0
>>  
>>  16.0
>>  
>>  
>>  0.0
>>  
>>  
>>  0.0
>>  
>>  
>>  0.0
>>  
>>  
>>  0.0
>>  
>>  
>>  0.0
>>  
>>  
>>  0.0
>>  
>>  
>>  
>>  8638.0
>>  
>>  4473.0
>>  
>>  
>>  0.0
>>  
>>  
>>  0.0
>>  
>>  
>>  42.0
>>  
>>  
>>  0.0
>>  
>>  
>>  1.0
>>  
>>  
>>  4122.0
>>  
>> 
>> 
>>  The funny thing is if I removed the ShingleFilter from the below
>> >> "sh_text"
>>  field and index normally  the query time is half of the current
>> shingle
>> >> one
>>  !. Shouldn't  be shingled index better for such heavy 2 word phrases
>> >> search
>>  ? I am confused.
>> 
>>  On the other hand One of the on the shelf big FAT companies search
>> >> engine
>>  doing the same query same machine 0.7 / 0.8 secs without cache . I am
>>  confident we can do better in solr but couldnt find the way at the
>> >> moment.
>> 
>>  thanks for helping..
>> 
>> 
>> 
>> 
>>  On Sat, Aug 27, 2011 at 2:46 AM, Erik Hatcher <
>> erik.hatc...@gmail.com
>> >>> wrote:
>> 
>> >
>> > On Aug 26, 2011, at 17:49 , Lord Khan Han wrote:
>> >> We are indexing news  document from the various sites. Currently we
>> >> have
>> >> 200K docs indexed. Total index size is 36 gig.  There is also
>> > attachement to
>> >> the news (pdf -docs etc) So document size could be high (ie 10mb).
>> >>
>> >> We are using some complex queries which includes around 30 - 40
>> terms
>> > per
>> >> query. %70 of this terms is two word phrases. We are using
>> >> with conjunction +  and -  to pinpoint exact result.
>> >> There is also grouping, dismax and boosting , Termvector HL  .
>> >
>> > You're using a lot of componentry there, and hav

Post Processing Solr Results

2011-08-28 Thread Jamie Johnson

I have a need to post process Solr results based on some access
controls which are setup outside of Solr, currently we've written
something that extends SearchComponent and in the prepare method I'm
doing something like this

QueryWrapperFilter qwf = new
QueryWrapperFilter(rb.getQuery());
Filter filter = new CustomFilter(qwf);
FilteredQuery fq = new FilteredQuery(rb.getQuery(), filter);
rb.setQuery(fq);

Inside my CustomFilter I have a FilteredDocIdSet which checks if the
document should be returned.  This works as I expect but for some
reason is very very slow.  Even if I take out any of the machinery
which does any logic with the document and only return true in the
FilteredDocIdSets match method the query still takes an inordinate
amount of time as compared to not including this custom filter.  So my
question, is this the most appropriate way of handling this?  What
should the performance out of such a setup be expected to be?  Any
information/pointers would be greatly appreciated.

Re: commas in synonyms.txt are not escaping

2011-08-28 Thread Yonik Seeley

Turns out this isn't a bug - I was just tripped up by the analysis
changes to the example server.

Gary, you are probably just hitting the same thing.
The "text" fieldType is no longer used by any fields by default - for
example the "text" field uses the "text_general" fieldType.
This fieldType uses the standard tokenizer, which discards stuff like
commas (hence the synonym will never match).

-Yonik
http://www.lucidimagination.com

Re: Error while decoding %DC (Ü) from URL - results in ?

2011-08-28 Thread Merlin Morgenstern

I double checked all code on that page and it looks like everything is in
utf-8 and works just perfect. The problematic URLs are called always by bots
like google bot. Looks like they are operating with a different encoding.
The page itself has an utf-8 meta tag.

So it looks like I have to find a way that checks for the encoding and
encodes apropriatly. this should be a common solr problem if all search
engines treat utf-8 that way, right?

Any ideas how to fix that? Is there maybe a special solr functionality for
this?

2011/8/27 François Schiettecatte 

> Merlin
>
> Ü encodes to two characters in utf-8 (C39C), and one in iso-8859-1 (%DC) so
> it looks like there is a charset mismatch somewhere.
>
>
> Cheers
>
> François
>
>
>
> On Aug 27, 2011, at 6:34 AM, Merlin Morgenstern wrote:
>
> > Hello,
> >
> > I am having problems with searches that are issued from spiders that
> contain
> > the ASCII encoded character "ü"
> >
> > For example in : "Übersetzung"
> >
> > The solr log shows following query request: /suche/%DCbersetzung
> > which has been translated into solr query: q=?ersetzung
> >
> > If you enter the search term directly as a user into the search box it
> will
> > result into:
> > /suche/Übersetzung which returns perfect results.
> >
> > I am decoding the URL within PHP: $term = trim(urldecode($q));
> >
> > Somehow urldecode() translates the Character Ü (%DC) into a ? which is a
> > illigeal first character in Solr.
> >
> > I tried it without urldecode(), with rawurldecode() and with
> utf8_decode()
> > but all of those did not help.
> >
> > Thank you for any help or hint on how to solve that problem.
> >
> > Regards, Merlin
>
>

schema design question

2011-08-28 Thread Adeel Qureshi

Hi there

I have a question regarding how to setup schema for some data. This data is
basically parent-child data for different types of records .. so

a bunch of records representing projects and subprojects where each
subproject has a parent project .. and a project has many child sub projects
another bunch of records reprensenting data for projects and linked projects
.. same parent child relationship here
another bunch representing project and linked people ..

so there are two ways I was thinking this kind of data can be indexed

1. create a single store called lets say CollectionData. use dynamic fields
to post all this different data but use a type field to identify the type of
records . e.g. to post two docs one representing project->linkedproject and
another project->linkedpeople info


123
LinkedProjects
child project name
child project status
...
parent info
...



123
LinkedPeople
child person name
...
parent info
...


now from the same store I can run queries to get the different data while
restricting the resultset on one type of records using the fq param ..

2. approach would be to create multiple stores for each different type of
records .. with pretty much the same schema but now we dont need the type
field because linkedProjects are in a linkedProjects store and linkedPeople
are in linkedPeople store .. only drawback i guess is that you could have a
few stores

my question to you guys is which approach makes more sense. I would
appreciate any comments.

Thanks
Adeel

multithreading, rebuilds, and updating.

2011-08-28 Thread Shawn Heisey

I am planning to make my build system run with a couple of threads 
whenever there is a need for a full index rebuild.  One thread will 
handle the normal update process - indexing new content, reinserting 
changed documents, and deletes.  The other thread will handle rebuilds 
at the same time.


I'm working on thread locks so the processes won't stomp on each other, 
but I've come across a question:


The rebuild process will take a few hours (running on all the build 
cores), but the update process will happen at least once every two 
minutes (running on the live cores).  If a commit is underway on the 
live core, what will happen if another process asks Solr to swap the 
live core and the build core?  Will it blow up in some way?  Should my 
synchronization prevent these two processes from happening at the same time?


Because I will have just swapped the core out of the way of active 
queries, I don't actually care about the commit, but I do care about 
whether an exception will be thrown and that the index on the core 
(which is now the build core) is intact and won't cause a future rebuild 
to fail.


Thanks,
Shawn

Re: Post Processing Solr Results

2011-08-28 Thread Jamie Johnson

Just a bit more information.  Inside my class which extends
FilteredDocIdSet all of the time seems to be getting spent in
retrieving the document from the readerCtx, doing this

Document doc = readerCtx.reader.document(docid);

If I comment out this and just return true things fly along as I
expect.  My query is returning a total of 2 million documents also.

On Sun, Aug 28, 2011 at 11:39 AM, Jamie Johnson  wrote:
> I have a need to post process Solr results based on some access
> controls which are setup outside of Solr, currently we've written
> something that extends SearchComponent and in the prepare method I'm
> doing something like this
>
>                    QueryWrapperFilter qwf = new
> QueryWrapperFilter(rb.getQuery());
>                    Filter filter = new CustomFilter(qwf);
>                    FilteredQuery fq = new FilteredQuery(rb.getQuery(), 
> filter);
>                    rb.setQuery(fq);
>
> Inside my CustomFilter I have a FilteredDocIdSet which checks if the
> document should be returned.  This works as I expect but for some
> reason is very very slow.  Even if I take out any of the machinery
> which does any logic with the document and only return true in the
> FilteredDocIdSets match method the query still takes an inordinate
> amount of time as compared to not including this custom filter.  So my
> question, is this the most appropriate way of handling this?  What
> should the performance out of such a setup be expected to be?  Any
> information/pointers would be greatly appreciated.
>

how to update solr cache when i delete records from remote database?

2011-08-28 Thread vighnesh

hi all

how can i update my solr cache when i am deleting the records from database
those records are again i can get when i m searching field so what is the
procedure to clear the solr cache when i update remote databse?


thanks in advance .



regards,
vighnesh

--
View this message in context: 
http://lucene.472066.n3.nabble.com/how-to-update-solr-cache-when-i-delete-records-from-remote-database-tp3291879p3291879.html
Sent from the Solr - User mailing list archive at Nabble.com.

Does Solr flush to disk even before ramBufferSizeMB is hit?

2011-08-28 Thread roz dev

Hi All,
I am trying to tune ramBufferSizeMB and merge factor for my setup.

So, i enabled Lucene Index Writer's log info stream and started monitoring
Data folder where index files are created.
I started my test with following

Heap: 3GB
Solr 1.4.1,
Index Size = 20 GB,
ramBufferSizeMB=856
Merge Factor=25


I ran my testing with 30 concurrent threads writing to Solr.
My jobs delete 6 (approx) records by issuing a deleteByQuery command and
then proceed to write data.

Commit is done at the end of writing process.

Results are bit surprising for me and I need some help understanding them.

I notice that even though InfoStream does not mention that data is being
flushed to disk, new segment files were created on the server.
Size of these files kept growing even though there was enough Heap available
and 856MB Ram was not even used.

Is it the case that Lucene is flushing to disk even if ramBufferSizeMB is
being hit. If that is the case then why is it that
InfoStream is not logging this info.

As per Infostream, it is flushing at the end but files are created much
before that.

Here is what InfoStream is saying: - Please note that is indicating that a
new segment is being flushed at 12:58 AM but files were created at 12:53 am
itself and they kept growing.

Aug 29, 2011 12:46:00 AM IW 0 [main]: setInfoStream:
dir=org.apache.lucene.store.NIOFSDirectory@/opt/gid/solr/ecom/data/index
autoCommit=false
mergePolicy=org.apache.lucene.index.LogByteSizeMergePolicy@4552a64dmergeScheduler=org.apache.lucene.index.ConcurrentMergeScheduler@35242cc9ramBufferSizeMB=856.0
maxBufferedDocs=-1 maxBuffereDeleteTerms=-1
maxFieldLength=1 index=_3l:C2151995

Aug 29, 2011 12:57:35 AM IW 0 [web-1]: now flush at close
Aug 29, 2011 12:57:35 AM IW 0 [web-1]: flush: now pause all indexing threads
Aug 29, 2011 12:57:35 AM IW 0 [web-1]:   flush: segment=_3m
docStoreSegment=_3m docStoreOffset=0 flushDocs=true flushDeletes=true
flushDocStores=true numDocs=60788 numBufDelTerms=60788
Aug 29, 2011 12:57:35 AM IW 0 [web-1]:   index before flush _3l:C2151995
Aug 29, 2011 12:57:35 AM IW 0 [web-1]: DW: flush postings as segment _3m
numDocs=60788
Aug 29, 2011 12:57:35 AM IW 0 [web-1]: DW: closeDocStore: 2 files to flush
to segment _3m numDocs=60788
Aug 29, 2011 12:57:40 AM IW 0 [web-1]: DW: DW.recycleIntBlocks count=9 total
now 9
Aug 29, 2011 12:57:40 AM IW 0 [web-1]: DW: DW.recycleByteBlocks
blockSize=32768 count=182 total now 182
Aug 29, 2011 12:57:40 AM IW 0 [web-1]: DW: DW.recycleCharBlocks count=49
total now 49
Aug 29, 2011 12:57:40 AM IW 0 [web-1]: DW: DW.recycleIntBlocks count=7 total
now 16
Aug 29, 2011 12:57:40 AM IW 0 [web-1]: DW: DW.recycleByteBlocks
blockSize=32768 count=145 total now 327
Aug 29, 2011 12:57:40 AM IW 0 [web-1]: DW: DW.recycleCharBlocks count=37
total now 86
Aug 29, 2011 12:57:40 AM IW 0 [web-1]: DW: DW.recycleIntBlocks count=9 total
now 25
Aug 29, 2011 12:57:40 AM IW 0 [web-1]: DW: DW.recycleByteBlocks
blockSize=32768 count=208 total now 535
Aug 29, 2011 12:57:40 AM IW 0 [web-1]: DW: DW.recycleCharBlocks count=52
total now 138
Aug 29, 2011 12:57:40 AM IW 0 [web-1]: DW: DW.recycleIntBlocks count=7 total
now 32
Aug 29, 2011 12:57:40 AM IW 0 [web-1]: DW: DW.recycleByteBlocks
blockSize=32768 count=136 total now 671
Aug 29, 2011 12:57:40 AM IW 0 [web-1]: DW: DW.recycleCharBlocks count=39
total now 177
Aug 29, 2011 12:57:40 AM IW 0 [web-1]: DW: DW.recycleIntBlocks count=3 total
now 35
Aug 29, 2011 12:57:40 AM IW 0 [web-1]: DW: DW.recycleByteBlocks
blockSize=32768 count=58 total now 729
Aug 29, 2011 12:57:40 AM IW 0 [web-1]: DW: DW.recycleCharBlocks count=16
total now 193
Aug 29, 2011 12:57:41 AM IW 0 [web-1]: DW:   oldRAMSize=50469888
newFlushedSize=161169038 docs/MB=395.491 new/old=319.337%
Aug 29, 2011 12:57:41 AM IFD [web-1]: now checkpoint "segments_1x" [2
segments ; isCommit = false]
Aug 29, 2011 12:57:41 AM IW 0 [web-1]: DW: apply 60788 buffered deleted
terms and 0 deleted docIDs and 1 deleted queries on 2 segments.
Aug 29, 2011 12:57:42 AM IFD [web-1]: now checkpoint "segments_1x" [2
segments ; isCommit = false]
Aug 29, 2011 12:57:42 AM IFD [web-1]: now checkpoint "segments_1x" [2
segments ; isCommit = false]
Aug 29, 2011 12:57:42 AM IW 0 [web-1]: LMP: findMerges: 2 segments
Aug 29, 2011 12:57:42 AM IW 0 [web-1]: LMP:   level 6.6799455 to 7.4299455:
1 segments
Aug 29, 2011 12:57:42 AM IW 0 [web-1]: LMP:   level 5.1209826 to 5.8709826:
1 segments
Aug 29, 2011 12:57:42 AM IW 0 [web-1]: CMS: now merge
Aug 29, 2011 12:57:42 AM IW 0 [web-1]: CMS:   index: _3l:C2151995 _3m:C60788
Aug 29, 2011 12:57:42 AM IW 0 [web-1]: CMS:   no more merges pending; now
return
Aug 29, 2011 12:57:42 AM IW 0 [web-1]: CMS: now merge
Aug 29, 2011 12:57:42 AM IW 0 [web-1]: CMS:   index: _3l:C2151995 _3m:C60788
Aug 29, 2011 12:57:42 AM IW 0 [web-1]: CMS:   no more merges pending; now
return
Aug 29, 2011 12:57:42 AM IW 0 [web-1]: now call final commit()
Aug 29, 2011 12:57:42 AM IW 0 [web-1]: startCommit(): start sizeInBytes=0
Aug 29, 2011 12:57:42

Re: Shingle and Query Performance

Post Processing Solr Results

Re: commas in synonyms.txt are not escaping

Re: Error while decoding %DC (Ü) from URL - results in ?

schema design question

multithreading, rebuilds, and updating.

Re: Post Processing Solr Results

how to update solr cache when i delete records from remote database?

Does Solr flush to disk even before ramBufferSizeMB is hit?

9 matches

Site Navigation

Mail list logo

Footer information