Re: Using IDF to find Collactions and SIPs . . ?

2010-01-03 Thread Siddhartha Pahade
pl unsubscribe me

On 12/28/09, Subscriptions  wrote:
>
> I am trying to write a query analyzer to pull:
>
>
>
> 1.  Common phrases (also known as Collocations) with in a query
>
>
>
> 2.  Highly unusual phrases (also known as Statistically Improbable
> Phrases or SIPs) with in a query
>
>
>
> The Collocations would be similar to facets except I am also trying to get
> multi word phrases as well as single terms. So suppose I could write
> something that does a chained query off the facet query looking for words
> in
> proximity. Conceptually (as I understand it) this should just be a question
> of using the IDF (inverse document frequency i.e. the measure of how often
> the term appears across the index).
>
>
>
> * Has anyone tried to write an analyzer that looks for the words
> that typically occur within a given proximity of another word?
>
>
>
> The highly unusual phrases on the other hand requires getting a handle on
> the IDF which at present only appears to be available via the explain
> function of debugging.
>
>
>
> * Has anyone written something to go directly after the IDF score
> only?
>
>
>
> * If I do have to go down the path of writing this from scratch is
> the org.apache.lucene.search.Similarity class the one to leverage?
>
>
>
> Most grateful for any feedback or insights,
>
>
>
> Christopher
>
>


Re: solrJ and spell check queries

2010-01-03 Thread Sascha Szott

Hi,

Jay Fisher wrote:

I'm trying to find a way to formulate the following query in solrJ. This is
the only way I can get the desired result but I can't figure out how to get
solrJ to generate the same query string. It always generates a url that
starts with select and I need it to start with spell. If there is an
alternative url string that will work please let me know.

http://solr-server/spell/?indent=on&q=shert&wt=json&spellcheck=true&spellcheck.collate=true

In case you hook SpellCheckComponent directly into the standard request 
handler, i.e., /select,


http://solr-server/select?indent=on&q=shert&wt=json&spellcheck=true&spellcheck.collate=true

should work.

-Sascha




Re: solrJ and spell check queries

2010-01-03 Thread Jay Fisher
Thank you. That did it.

~ Jay

On Sun, Jan 3, 2010 at 7:21 AM, Sascha Szott  wrote:

> Hi,
>
>
> Jay Fisher wrote:
>
>> I'm trying to find a way to formulate the following query in solrJ. This
>> is
>> the only way I can get the desired result but I can't figure out how to
>> get
>> solrJ to generate the same query string. It always generates a url that
>> starts with select and I need it to start with spell. If there is an
>> alternative url string that will work please let me know.
>>
>>
>> http://solr-server/spell/?indent=on&q=shert&wt=json&spellcheck=true&spellcheck.collate=true
>>
>>  In case you hook SpellCheckComponent directly into the standard request
> handler, i.e., /select,
>
>
> http://solr-server/select?indent=on&q=shert&wt=json&spellcheck=true&spellcheck.collate=true
>
> should work.
>
> -Sascha
>
>
>


Re: SOLR: Replication

2010-01-03 Thread Yonik Seeley
On Sat, Jan 2, 2010 at 11:35 PM, Fuad Efendi  wrote:
> I tried... I set APR to improve performance... server is slow while replica;
> but "top" shows only 1% of I/O wait... it is probably environment specific;

So you're saying that stock tomcat (non-native APR) was also 10 times slower?

> but the same happened in my home-based network, rsync was 10 times faster...
> I don't know details of HTTP-replica, it could be base64 or something like
> that; RAM-buffer, flush to disk, etc.

The HTTP replication is using binary.
If you look here, it was benchmarked to be nearly as fast as rsync:
http://wiki.apache.org/solr/SolrReplication

It does do a fsync to make sure that the files are on disk after
downloading, but that shouldn't make too much difference.

-Yonik
http://www.lucidimagination.com


Tokenizing problem with numbers in query

2010-01-03 Thread Bernd Brod
Hello,

when searching for a string: "asdf5qwerty" solr will tokenize it to:
"asdf", "5", "qwerty" and display documents matching either string.

How can i stop this behaviour and make it just search for plain
"asdf5qwerty"?

thanks in advance.
Bernd


RE: SOLR: Replication

2010-01-03 Thread Fuad Efendi
Thank you Yonik, excellent WIKI! I'll try without APR, I believe it's
environmental issue; 100Mbps switched should do 10 times faster (current
replica speed is 1Mbytes/sec)


> -Original Message-
> From: ysee...@gmail.com [mailto:ysee...@gmail.com] On Behalf Of Yonik
> Seeley
> Sent: January-03-10 10:03 AM
> To: solr-user@lucene.apache.org
> Subject: Re: SOLR: Replication
> 
> On Sat, Jan 2, 2010 at 11:35 PM, Fuad Efendi  wrote:
> > I tried... I set APR to improve performance... server is slow while
> replica;
> > but "top" shows only 1% of I/O wait... it is probably environment
> specific;
> 
> So you're saying that stock tomcat (non-native APR) was also 10 times
> slower?
> 
> > but the same happened in my home-based network, rsync was 10 times
> faster...
> > I don't know details of HTTP-replica, it could be base64 or something
> like
> > that; RAM-buffer, flush to disk, etc.
> 
> The HTTP replication is using binary.
> If you look here, it was benchmarked to be nearly as fast as rsync:
> http://wiki.apache.org/solr/SolrReplication
> 
> It does do a fsync to make sure that the files are on disk after
> downloading, but that shouldn't make too much difference.
> 
> -Yonik
> http://www.lucidimagination.com




Re: Tokenizing problem with numbers in query

2010-01-03 Thread Ahmet Arslan

> when searching for a string: "asdf5qwerty" solr will
> tokenize it to:
> "asdf", "5", "qwerty" and display documents matching either
> string.
> 
> How can i stop this behaviour and make it just search for
> plain
> "asdf5qwerty"?

What is the type of your field? If you have solr.WordDelimiterFilterFactory in 
your analysis chain, remove it. In admin/analysis.jsp you can see which 
tokenizer/tokenfilter is breaking "asdf5qwerty" into "asdf", "5", "qwerty".


  


Re: Tokenizing problem with numbers in query

2010-01-03 Thread Erick Erickson
This is an *extremely* useful page for figuring out what various
tokenizers/filters are doing. The javadocs for the classes
referenced can also provide some additional details

http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters

Erick

On Sun, Jan 3, 2010 at 11:26 AM, Bernd Brod  wrote:

> Hello,
>
> when searching for a string: "asdf5qwerty" solr will tokenize it to:
> "asdf", "5", "qwerty" and display documents matching either string.
>
> How can i stop this behaviour and make it just search for plain
> "asdf5qwerty"?
>
> thanks in advance.
> Bernd
>


Re: SOLR: Replication

2010-01-03 Thread Peter Wolanin
Related to the difference between rsync and native Solr replication -
we are seeing issues with Solr 1.4 where search queries that come in
during a replication request hang for excessive amount of time (up to
100's of seconds for a result normally that takes ~50 ms).

We are replicating pretty often (every 90 sec for multiple cores to
one slave server), but still did not think that replication would make
the master server unable to handle search requests.  Is there some
configuration option we are missing which would handle this situation
better?

Thanks,

Peter

On Sun, Jan 3, 2010 at 11:27 AM, Fuad Efendi  wrote:
> Thank you Yonik, excellent WIKI! I'll try without APR, I believe it's
> environmental issue; 100Mbps switched should do 10 times faster (current
> replica speed is 1Mbytes/sec)
>
>
>> -Original Message-
>> From: ysee...@gmail.com [mailto:ysee...@gmail.com] On Behalf Of Yonik
>> Seeley
>> Sent: January-03-10 10:03 AM
>> To: solr-user@lucene.apache.org
>> Subject: Re: SOLR: Replication
>>
>> On Sat, Jan 2, 2010 at 11:35 PM, Fuad Efendi  wrote:
>> > I tried... I set APR to improve performance... server is slow while
>> replica;
>> > but "top" shows only 1% of I/O wait... it is probably environment
>> specific;
>>
>> So you're saying that stock tomcat (non-native APR) was also 10 times
>> slower?
>>
>> > but the same happened in my home-based network, rsync was 10 times
>> faster...
>> > I don't know details of HTTP-replica, it could be base64 or something
>> like
>> > that; RAM-buffer, flush to disk, etc.
>>
>> The HTTP replication is using binary.
>> If you look here, it was benchmarked to be nearly as fast as rsync:
>> http://wiki.apache.org/solr/SolrReplication
>>
>> It does do a fsync to make sure that the files are on disk after
>> downloading, but that shouldn't make too much difference.
>>
>> -Yonik
>> http://www.lucidimagination.com
>
>
>



-- 
Peter M. Wolanin, Ph.D.
Momentum Specialist,  Acquia. Inc.
peter.wola...@acquia.com


Re: SOLR Performance Tuning: Pagination

2010-01-03 Thread Peter Wolanin
At the NOVA Apache Lucene/Solr Meetup last May, one of the speakers
from Near Infinity (Aaron McCurry I think) mentioned that he had a
patch for lucene that enabled unlimited depth memory-efficient paging.
 Is anyone in contact with him?

-Peter

On Thu, Dec 24, 2009 at 11:27 AM, Grant Ingersoll  wrote:
>
> On Dec 24, 2009, at 11:09 AM, Fuad Efendi wrote:
>
>> I used pagination for a while till found this...
>>
>>
>> I have filtered query ID:[* TO *] returning 20 millions results (no
>> faceting), and pagination always seemed to be fast. However, fast only with
>> low values for start=12345. Queries like start=28838540 take 40-60 seconds,
>> and even cause OutOfMemoryException.
>
> Yeah, deep pagination in Lucene/Solr can be problematic due to the Priority 
> Queue management.  See http://issues.apache.org/jira/browse/LUCENE-2127 and 
> the linked discussion on java-dev.
>
>>
>> I use highlight, faceting on nontokenized "Country" field, standard handler.
>>
>>
>> It even seems to be a bug...
>>
>>
>> Fuad Efendi
>> +1 416-993-2060
>> http://www.linkedin.com/in/liferay
>>
>> Tokenizer Inc.
>> http://www.tokenizer.ca/
>> Data Mining, Vertical Search
>>
>>
>>
>>
>
> --
> Grant Ingersoll
> http://www.lucidimagination.com/
>
> Search the Lucene ecosystem using Solr/Lucene: 
> http://www.lucidimagination.com/search
>
>



-- 
Peter M. Wolanin, Ph.D.
Momentum Specialist,  Acquia. Inc.
peter.wola...@acquia.com


Re: Remove the deleted docs from the Solr Index

2010-01-03 Thread Ravi Gidwani
Lance:
  At times we dont have the freedom make these Database changes.
Currently I am in this situation. Hence the requirement on the DIH.

~Ravi.


On Sat, Jan 2, 2010 at 3:44 PM, Lance Norskog  wrote:

> The other option is to have a 'deleted' column in your table, and have
> the application 'delete' operation set that field. In the DIH you
> query this column with 'deletedPkQuery'.
>
> Or, you can use triggers to maintain a new table with the IDs of
> deleted rows. This will allow you to have a batch job that deletes all
> IDs from this list.
>
> On Tue, Dec 29, 2009 at 10:40 AM, Mohamed Parvez  wrote:
> > Ditto. There should have been an DIH command to re-sync the Index with
> the
> > DB.
> > Right now it looks like one way street form DB to Index.
> >
> >
> > On Tue, Dec 29, 2009 at 3:07 AM, Ravi Gidwani  >wrote:
> >
> >> Hi Shalin:
> >>
> >> >   I get your point about not knowing what has been deleted
> from
> >> the database. So this is what even I am looking for:
> >> >
> >> > 0) A document (id=100) is currently part of solr index.(
> >> > 1) Lets say the application deleted a record with id=100 from
> database.
> >> >
> >> > 2) Now I need to execute some DIH command to say remove document where
> >> id=100. I dont expect the DIH to automatically detect what has been
> deleted,
> >> > but I am looking for a DIH command/special-command to request deletion
> >> from index.
> >> >
> >> > Is that possible ? also as an alternate solution, is it possible to
> build
> >> index using DIH, and use the solr.XmlUpdateRequestHandler request
> handler to
> >> delete/update these one off documents ?
> >> > Is this something you will recommend ?
> >> >
> >> > Thanks,
> >> > ~Ravi Gidwani.
> >> >
> >> > On Tue, Dec 29, 2009 at 3:03 AM, Mohamed Parvez 
> >> wrote:
> >> >
> >> > > I have looked in the that thread earlier. But there is no option
> there
> >> for
> >> >
> >> > > a
> >> > > solution from Solr side.
> >> > >
> >> > > I mean the two more options there are
> >> > > 1] Use database triggers instead of DIH to manage updating the index
> :-
> >> > > This out of question as we cant run 1000 odd triggers every hour to
> >> delete.
> >> >
> >> > >
> >> > > 2] Some sort of ORM use its interception:-
> >> > > This is also out of question as the deletes happens form external
> >> system or
> >> > > directly on the database, not through our application.
> >> > >
> >> > >
> >> >
> >> > > To Say in Short, Solr Should have something thing to keep the index
> >> synced
> >> > > with the database. As of now its one way street, updates rows, on DB
> >> will
> >> > > go
> >> > > to the index. Deleted rows in the DB, will not be deleted from the
> >> Index
> >> >
> >> > >
> >> > >
> >> > How can Solr figure out what has been deleted? Should it go through
> each
> >> row
> >> > and comparing against each doc? Even then some things are not possible
> >> > (think indexed fields). It would be far efficient to just do a
> >> full-import
> >> >
> >> > each time instead.
> >> >
> >> > --
> >> > Regards,
> >> > Shalin Shekhar Mangar.
> >> >
> >> >
> >>
> >
>
>
>
> --
> Lance Norskog
> goks...@gmail.com
>


Any way to modify result ranking using an integer field?

2010-01-03 Thread Andy
Is there any way to modify result ranking using an integer field?

I have documents that have an integer field "popularity".

I want to rank results by a combination of normal fulltext search 
 relevance and popularity. It's kinda like search in digg - result 
 ranking is based on the search relevance as well as how many digs a 
 posting has. 
 I don't have any specific ranking algorithm in mind. But is this 
 something that can be done with solr? 





  

Re: Any way to modify result ranking using an integer field?

2010-01-03 Thread Ahmet Arslan

> Is there any way to modify result
> ranking using an integer field?
> 
> I have documents that have an integer field "popularity".
> 
> I want to rank results by a combination of normal fulltext
> search 
>  relevance and popularity. It's kinda like search in digg -
> result 
>  ranking is based on the search relevance as well as how
> many digs a 
>  posting has. 
>  I don't have any specific ranking algorithm in mind. But
> is this 
>  something that can be done with solr? 

Yes. 
http://lucene.apache.org/solr/api/org/apache/solr/search/BoostQParserPlugin.html





Indexing the latests MS Office documents

2010-01-03 Thread Roland Villemoes
Hi All,

Anyone who knows how to index the latest MS office documents like .docx and 
.xlsx  ?

>From searching it seems like Tika only supports the earlier formats .doc and 
>.xls



med venlig hilsen/best regards

Roland Villemoes
Tel: (+45) 22 69 59 62
E-Mail: mailto:r...@alpha-solutions.dk



Re: SOLR: Replication

2010-01-03 Thread Yonik Seeley
On Sun, Jan 3, 2010 at 2:55 PM, Peter Wolanin  wrote:
> Related to the difference between rsync and native Solr replication -
> we are seeing issues with Solr 1.4 where search queries that come in
> during a replication request hang for excessive amount of time (up to
> 100's of seconds for a result normally that takes ~50 ms).
>
> We are replicating pretty often (every 90 sec for multiple cores to
> one slave server), but still did not think that replication would make
> the master server unable to handle search requests.  Is there some
> configuration option we are missing which would handle this situation
> better?

Hmmm, any other clues about what's happening during this time?
If it's not a bug, it could simply be that reading a large index to
serve it to a slave could throw out the important parts of the OS
cache that caused searches to be faster.

If it is a bug, well then we certainly want to get to the bottom of it!

-Yonik
http://www.lucidimagination.com


Re: Indexing the latests MS Office documents

2010-01-03 Thread Mattmann, Chris A (388J)
Hi Roland,

You probably want to send your email to tika-u...@lucene.apache.org.

Best of luck!

Cheers,
Chris



On 1/3/10 4:00 PM, "Roland Villemoes"  wrote:

> Hi All,
> 
> Anyone who knows how to index the latest MS office documents like .docx and
> .xlsx  ?
> 
> From searching it seems like Tika only supports the earlier formats .doc and
> .xls
> 
> 
> 
> med venlig hilsen/best regards
> 
> Roland Villemoes
> Tel: (+45) 22 69 59 62
> E-Mail: mailto:r...@alpha-solutions.dk
> 
> 


++
Chris Mattmann, Ph.D.
Senior Computer Scientist
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 171-266B, Mailstop: 171-246
Email: chris.mattm...@jpl.nasa.gov
WWW:   http://sunset.usc.edu/~mattmann/
++
Adjunct Assistant Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++




Rules engine and Solr

2010-01-03 Thread Avlesh Singh
I have a Solr (version 1.3) powered search server running in production.
Search is keyword driven is supported using custom fields and tokenizers.

I am planning to build a rules engine on top search. The rules are database
driven and can't be stored inside solr indexes. These rules would ultimately
two do things -

   1. Change the order of Lucene hits.
   2. Add/remove some results to/from the Lucene hits.

What should be my starting point? Custom search handler?

Cheers
Avlesh


Re: performance question

2010-01-03 Thread A. Steven Anderson
> Sorting and index norms have space penalties.
> Sorting on a field creates an array of Java ints, one for every
> document in the index. Index norms (used for boosting documents and
> other things) create an array of bytes in the Lucene index files, one
> for every document in the index.
> If you sort on many of your dynamic fields your memory use will
> explode, and the same with index norms and disk space.


Thanks for the info.  In general, I knew sorting was expensive, but I didn't
realize that dynamic fields made it worse.

-- 
A. Steven Anderson
Independent Consultant
st...@asanderson.com


Re: performance question

2010-01-03 Thread Chris Hostetter

: > If you sort on many of your dynamic fields your memory use will
: > explode, and the same with index norms and disk space.

: Thanks for the info.  In general, I knew sorting was expensive, but I didn't
: realize that dynamic fields made it worse.

dynamic fields don't make it worse ... the number of actaul field names 
you sort on makes it worse.  

If you sort on 100 fields, the cost is the same regardless of wether all 
100 of those fields exist because of a single  declaration, 
or 100 distinct  declarations.


-Hoss



Re: performance question

2010-01-03 Thread A. Steven Anderson
>
> dynamic fields don't make it worse ... the number of actaul field names
> you sort on makes it worse.
>
> If you sort on 100 fields, the cost is the same regardless of wether all
> 100 of those fields exist because of a single  declaration,
> or 100 distinct  declarations.
>

Ahh...thanks for the clarification.

So, in general, there is no *significant* performance difference with using
dynamic fields. Correct?


-- 
A. Steven Anderson
Independent Consultant
st...@asanderson.com


Re: Any way to modify result ranking using an integer field?

2010-01-03 Thread Andy
Thanks Ahmet.

Do I need to do anything to enable BoostQParserPlugin in Solr, or is it already 
enabled?

--- On Sun, 1/3/10, Ahmet Arslan  wrote:

From: Ahmet Arslan 
Subject: Re: Any way to modify result ranking using an integer field?
To: solr-user@lucene.apache.org
Date: Sunday, January 3, 2010, 5:45 PM


> Is there any way to modify result
> ranking using an integer field?
> 
> I have documents that have an integer field "popularity".
> 
> I want to rank results by a combination of normal fulltext
> search 
>  relevance and popularity. It's kinda like search in digg -
> result 
>  ranking is based on the search relevance as well as how
> many digs a 
>  posting has. 
>  I don't have any specific ranking algorithm in mind. But
> is this 
>  something that can be done with solr? 

Yes. 
http://lucene.apache.org/solr/api/org/apache/solr/search/BoostQParserPlugin.html






  

Search algorithm used in Solr

2010-01-03 Thread abhishes
Hello everyone,

Is there an article which explains (on a high level) the algorithm of search in 
Solr?

How does Solr search approach compare to the "inverted index" technique?

Regards,
Abhishek
--Original Message--
From: Mattmann, Chris A (388J)
To: solr-user@lucene.apache.org
ReplyTo: solr-user@lucene.apache.org
Subject: Re: Indexing the latests MS Office documents
Sent: Jan 4, 2010 06:49

Hi Roland,

You probably want to send your email to tika-u...@lucene.apache.org.

Best of luck!

Cheers,
Chris



On 1/3/10 4:00 PM, "Roland Villemoes"  wrote:

> Hi All,
> 
> Anyone who knows how to index the latest MS office documents like .docx and
> .xlsx  ?
> 
> From searching it seems like Tika only supports the earlier formats .doc and
> .xls
> 
> 
> 
> med venlig hilsen/best regards
> 
> Roland Villemoes
> Tel: (+45) 22 69 59 62
> E-Mail: mailto:r...@alpha-solutions.dk
> 
> 


++
Chris Mattmann, Ph.D.
Senior Computer Scientist
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 171-266B, Mailstop: 171-246
Email: chris.mattm...@jpl.nasa.gov
WWW:   http://sunset.usc.edu/~mattmann/
++
Adjunct Assistant Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++




Sent from BlackBerry® on Airtel

RE: Reverse sort facet query [SOLR-1672]

2010-01-03 Thread Chris Hostetter

: Yes, I thought about adding some 'new syntax', but I opted for a separate 
'facet.sortorder' parameter,
: 
: mainly because I'm not familiar enough with the codebase to know what effect 
this might have on
: 
: backward compatibility. It would be easy enough to modify the patch I created 
to do it this way.

it shouldn't really affect anything -- it wouldn't really be new syntax, 
just extending hte existing "sort" param syntax to apply to the 
"facet.sort" param.  The only back compat concern is making sure we 
continue to support true/false as aliases, and having the default order 
match the current bahvior if asc/desc aren't specified.


-Hoss



Re: Any way to modify result ranking using an integer field?

2010-01-03 Thread Andy
What I meant was that is there any way to make  {!boost b=log(popularity)} the 
default query type so that every query will be using it. 
From: Andy 
Subject: Re: Any way to modify result ranking using an integer field?
To: solr-user@lucene.apache.org
Date: Monday, January 4, 2010, 1:08 AM

Thanks Ahmet.

Do I need to do anything to enable BoostQParserPlugin in Solr, or is it already 
enabled?