Re: Using IDF to find Collactions and SIPs . . ?
pl unsubscribe me On 12/28/09, Subscriptions wrote: > > I am trying to write a query analyzer to pull: > > > > 1. Common phrases (also known as Collocations) with in a query > > > > 2. Highly unusual phrases (also known as Statistically Improbable > Phrases or SIPs) with in a query > > > > The Collocations would be similar to facets except I am also trying to get > multi word phrases as well as single terms. So suppose I could write > something that does a chained query off the facet query looking for words > in > proximity. Conceptually (as I understand it) this should just be a question > of using the IDF (inverse document frequency i.e. the measure of how often > the term appears across the index). > > > > * Has anyone tried to write an analyzer that looks for the words > that typically occur within a given proximity of another word? > > > > The highly unusual phrases on the other hand requires getting a handle on > the IDF which at present only appears to be available via the explain > function of debugging. > > > > * Has anyone written something to go directly after the IDF score > only? > > > > * If I do have to go down the path of writing this from scratch is > the org.apache.lucene.search.Similarity class the one to leverage? > > > > Most grateful for any feedback or insights, > > > > Christopher > >
Re: solrJ and spell check queries
Hi, Jay Fisher wrote: I'm trying to find a way to formulate the following query in solrJ. This is the only way I can get the desired result but I can't figure out how to get solrJ to generate the same query string. It always generates a url that starts with select and I need it to start with spell. If there is an alternative url string that will work please let me know. http://solr-server/spell/?indent=on&q=shert&wt=json&spellcheck=true&spellcheck.collate=true In case you hook SpellCheckComponent directly into the standard request handler, i.e., /select, http://solr-server/select?indent=on&q=shert&wt=json&spellcheck=true&spellcheck.collate=true should work. -Sascha
Re: solrJ and spell check queries
Thank you. That did it. ~ Jay On Sun, Jan 3, 2010 at 7:21 AM, Sascha Szott wrote: > Hi, > > > Jay Fisher wrote: > >> I'm trying to find a way to formulate the following query in solrJ. This >> is >> the only way I can get the desired result but I can't figure out how to >> get >> solrJ to generate the same query string. It always generates a url that >> starts with select and I need it to start with spell. If there is an >> alternative url string that will work please let me know. >> >> >> http://solr-server/spell/?indent=on&q=shert&wt=json&spellcheck=true&spellcheck.collate=true >> >> In case you hook SpellCheckComponent directly into the standard request > handler, i.e., /select, > > > http://solr-server/select?indent=on&q=shert&wt=json&spellcheck=true&spellcheck.collate=true > > should work. > > -Sascha > > >
Re: SOLR: Replication
On Sat, Jan 2, 2010 at 11:35 PM, Fuad Efendi wrote: > I tried... I set APR to improve performance... server is slow while replica; > but "top" shows only 1% of I/O wait... it is probably environment specific; So you're saying that stock tomcat (non-native APR) was also 10 times slower? > but the same happened in my home-based network, rsync was 10 times faster... > I don't know details of HTTP-replica, it could be base64 or something like > that; RAM-buffer, flush to disk, etc. The HTTP replication is using binary. If you look here, it was benchmarked to be nearly as fast as rsync: http://wiki.apache.org/solr/SolrReplication It does do a fsync to make sure that the files are on disk after downloading, but that shouldn't make too much difference. -Yonik http://www.lucidimagination.com
Tokenizing problem with numbers in query
Hello, when searching for a string: "asdf5qwerty" solr will tokenize it to: "asdf", "5", "qwerty" and display documents matching either string. How can i stop this behaviour and make it just search for plain "asdf5qwerty"? thanks in advance. Bernd
RE: SOLR: Replication
Thank you Yonik, excellent WIKI! I'll try without APR, I believe it's environmental issue; 100Mbps switched should do 10 times faster (current replica speed is 1Mbytes/sec) > -Original Message- > From: ysee...@gmail.com [mailto:ysee...@gmail.com] On Behalf Of Yonik > Seeley > Sent: January-03-10 10:03 AM > To: solr-user@lucene.apache.org > Subject: Re: SOLR: Replication > > On Sat, Jan 2, 2010 at 11:35 PM, Fuad Efendi wrote: > > I tried... I set APR to improve performance... server is slow while > replica; > > but "top" shows only 1% of I/O wait... it is probably environment > specific; > > So you're saying that stock tomcat (non-native APR) was also 10 times > slower? > > > but the same happened in my home-based network, rsync was 10 times > faster... > > I don't know details of HTTP-replica, it could be base64 or something > like > > that; RAM-buffer, flush to disk, etc. > > The HTTP replication is using binary. > If you look here, it was benchmarked to be nearly as fast as rsync: > http://wiki.apache.org/solr/SolrReplication > > It does do a fsync to make sure that the files are on disk after > downloading, but that shouldn't make too much difference. > > -Yonik > http://www.lucidimagination.com
Re: Tokenizing problem with numbers in query
> when searching for a string: "asdf5qwerty" solr will > tokenize it to: > "asdf", "5", "qwerty" and display documents matching either > string. > > How can i stop this behaviour and make it just search for > plain > "asdf5qwerty"? What is the type of your field? If you have solr.WordDelimiterFilterFactory in your analysis chain, remove it. In admin/analysis.jsp you can see which tokenizer/tokenfilter is breaking "asdf5qwerty" into "asdf", "5", "qwerty".
Re: Tokenizing problem with numbers in query
This is an *extremely* useful page for figuring out what various tokenizers/filters are doing. The javadocs for the classes referenced can also provide some additional details http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters Erick On Sun, Jan 3, 2010 at 11:26 AM, Bernd Brod wrote: > Hello, > > when searching for a string: "asdf5qwerty" solr will tokenize it to: > "asdf", "5", "qwerty" and display documents matching either string. > > How can i stop this behaviour and make it just search for plain > "asdf5qwerty"? > > thanks in advance. > Bernd >
Re: SOLR: Replication
Related to the difference between rsync and native Solr replication - we are seeing issues with Solr 1.4 where search queries that come in during a replication request hang for excessive amount of time (up to 100's of seconds for a result normally that takes ~50 ms). We are replicating pretty often (every 90 sec for multiple cores to one slave server), but still did not think that replication would make the master server unable to handle search requests. Is there some configuration option we are missing which would handle this situation better? Thanks, Peter On Sun, Jan 3, 2010 at 11:27 AM, Fuad Efendi wrote: > Thank you Yonik, excellent WIKI! I'll try without APR, I believe it's > environmental issue; 100Mbps switched should do 10 times faster (current > replica speed is 1Mbytes/sec) > > >> -Original Message- >> From: ysee...@gmail.com [mailto:ysee...@gmail.com] On Behalf Of Yonik >> Seeley >> Sent: January-03-10 10:03 AM >> To: solr-user@lucene.apache.org >> Subject: Re: SOLR: Replication >> >> On Sat, Jan 2, 2010 at 11:35 PM, Fuad Efendi wrote: >> > I tried... I set APR to improve performance... server is slow while >> replica; >> > but "top" shows only 1% of I/O wait... it is probably environment >> specific; >> >> So you're saying that stock tomcat (non-native APR) was also 10 times >> slower? >> >> > but the same happened in my home-based network, rsync was 10 times >> faster... >> > I don't know details of HTTP-replica, it could be base64 or something >> like >> > that; RAM-buffer, flush to disk, etc. >> >> The HTTP replication is using binary. >> If you look here, it was benchmarked to be nearly as fast as rsync: >> http://wiki.apache.org/solr/SolrReplication >> >> It does do a fsync to make sure that the files are on disk after >> downloading, but that shouldn't make too much difference. >> >> -Yonik >> http://www.lucidimagination.com > > > -- Peter M. Wolanin, Ph.D. Momentum Specialist, Acquia. Inc. peter.wola...@acquia.com
Re: SOLR Performance Tuning: Pagination
At the NOVA Apache Lucene/Solr Meetup last May, one of the speakers from Near Infinity (Aaron McCurry I think) mentioned that he had a patch for lucene that enabled unlimited depth memory-efficient paging. Is anyone in contact with him? -Peter On Thu, Dec 24, 2009 at 11:27 AM, Grant Ingersoll wrote: > > On Dec 24, 2009, at 11:09 AM, Fuad Efendi wrote: > >> I used pagination for a while till found this... >> >> >> I have filtered query ID:[* TO *] returning 20 millions results (no >> faceting), and pagination always seemed to be fast. However, fast only with >> low values for start=12345. Queries like start=28838540 take 40-60 seconds, >> and even cause OutOfMemoryException. > > Yeah, deep pagination in Lucene/Solr can be problematic due to the Priority > Queue management. See http://issues.apache.org/jira/browse/LUCENE-2127 and > the linked discussion on java-dev. > >> >> I use highlight, faceting on nontokenized "Country" field, standard handler. >> >> >> It even seems to be a bug... >> >> >> Fuad Efendi >> +1 416-993-2060 >> http://www.linkedin.com/in/liferay >> >> Tokenizer Inc. >> http://www.tokenizer.ca/ >> Data Mining, Vertical Search >> >> >> >> > > -- > Grant Ingersoll > http://www.lucidimagination.com/ > > Search the Lucene ecosystem using Solr/Lucene: > http://www.lucidimagination.com/search > > -- Peter M. Wolanin, Ph.D. Momentum Specialist, Acquia. Inc. peter.wola...@acquia.com
Re: Remove the deleted docs from the Solr Index
Lance: At times we dont have the freedom make these Database changes. Currently I am in this situation. Hence the requirement on the DIH. ~Ravi. On Sat, Jan 2, 2010 at 3:44 PM, Lance Norskog wrote: > The other option is to have a 'deleted' column in your table, and have > the application 'delete' operation set that field. In the DIH you > query this column with 'deletedPkQuery'. > > Or, you can use triggers to maintain a new table with the IDs of > deleted rows. This will allow you to have a batch job that deletes all > IDs from this list. > > On Tue, Dec 29, 2009 at 10:40 AM, Mohamed Parvez wrote: > > Ditto. There should have been an DIH command to re-sync the Index with > the > > DB. > > Right now it looks like one way street form DB to Index. > > > > > > On Tue, Dec 29, 2009 at 3:07 AM, Ravi Gidwani >wrote: > > > >> Hi Shalin: > >> > >> > I get your point about not knowing what has been deleted > from > >> the database. So this is what even I am looking for: > >> > > >> > 0) A document (id=100) is currently part of solr index.( > >> > 1) Lets say the application deleted a record with id=100 from > database. > >> > > >> > 2) Now I need to execute some DIH command to say remove document where > >> id=100. I dont expect the DIH to automatically detect what has been > deleted, > >> > but I am looking for a DIH command/special-command to request deletion > >> from index. > >> > > >> > Is that possible ? also as an alternate solution, is it possible to > build > >> index using DIH, and use the solr.XmlUpdateRequestHandler request > handler to > >> delete/update these one off documents ? > >> > Is this something you will recommend ? > >> > > >> > Thanks, > >> > ~Ravi Gidwani. > >> > > >> > On Tue, Dec 29, 2009 at 3:03 AM, Mohamed Parvez > >> wrote: > >> > > >> > > I have looked in the that thread earlier. But there is no option > there > >> for > >> > > >> > > a > >> > > solution from Solr side. > >> > > > >> > > I mean the two more options there are > >> > > 1] Use database triggers instead of DIH to manage updating the index > :- > >> > > This out of question as we cant run 1000 odd triggers every hour to > >> delete. > >> > > >> > > > >> > > 2] Some sort of ORM use its interception:- > >> > > This is also out of question as the deletes happens form external > >> system or > >> > > directly on the database, not through our application. > >> > > > >> > > > >> > > >> > > To Say in Short, Solr Should have something thing to keep the index > >> synced > >> > > with the database. As of now its one way street, updates rows, on DB > >> will > >> > > go > >> > > to the index. Deleted rows in the DB, will not be deleted from the > >> Index > >> > > >> > > > >> > > > >> > How can Solr figure out what has been deleted? Should it go through > each > >> row > >> > and comparing against each doc? Even then some things are not possible > >> > (think indexed fields). It would be far efficient to just do a > >> full-import > >> > > >> > each time instead. > >> > > >> > -- > >> > Regards, > >> > Shalin Shekhar Mangar. > >> > > >> > > >> > > > > > > -- > Lance Norskog > goks...@gmail.com >
Any way to modify result ranking using an integer field?
Is there any way to modify result ranking using an integer field? I have documents that have an integer field "popularity". I want to rank results by a combination of normal fulltext search relevance and popularity. It's kinda like search in digg - result ranking is based on the search relevance as well as how many digs a posting has. I don't have any specific ranking algorithm in mind. But is this something that can be done with solr?
Re: Any way to modify result ranking using an integer field?
> Is there any way to modify result > ranking using an integer field? > > I have documents that have an integer field "popularity". > > I want to rank results by a combination of normal fulltext > search > relevance and popularity. It's kinda like search in digg - > result > ranking is based on the search relevance as well as how > many digs a > posting has. > I don't have any specific ranking algorithm in mind. But > is this > something that can be done with solr? Yes. http://lucene.apache.org/solr/api/org/apache/solr/search/BoostQParserPlugin.html
Indexing the latests MS Office documents
Hi All, Anyone who knows how to index the latest MS office documents like .docx and .xlsx ? >From searching it seems like Tika only supports the earlier formats .doc and >.xls med venlig hilsen/best regards Roland Villemoes Tel: (+45) 22 69 59 62 E-Mail: mailto:r...@alpha-solutions.dk
Re: SOLR: Replication
On Sun, Jan 3, 2010 at 2:55 PM, Peter Wolanin wrote: > Related to the difference between rsync and native Solr replication - > we are seeing issues with Solr 1.4 where search queries that come in > during a replication request hang for excessive amount of time (up to > 100's of seconds for a result normally that takes ~50 ms). > > We are replicating pretty often (every 90 sec for multiple cores to > one slave server), but still did not think that replication would make > the master server unable to handle search requests. Is there some > configuration option we are missing which would handle this situation > better? Hmmm, any other clues about what's happening during this time? If it's not a bug, it could simply be that reading a large index to serve it to a slave could throw out the important parts of the OS cache that caused searches to be faster. If it is a bug, well then we certainly want to get to the bottom of it! -Yonik http://www.lucidimagination.com
Re: Indexing the latests MS Office documents
Hi Roland, You probably want to send your email to tika-u...@lucene.apache.org. Best of luck! Cheers, Chris On 1/3/10 4:00 PM, "Roland Villemoes" wrote: > Hi All, > > Anyone who knows how to index the latest MS office documents like .docx and > .xlsx ? > > From searching it seems like Tika only supports the earlier formats .doc and > .xls > > > > med venlig hilsen/best regards > > Roland Villemoes > Tel: (+45) 22 69 59 62 > E-Mail: mailto:r...@alpha-solutions.dk > > ++ Chris Mattmann, Ph.D. Senior Computer Scientist NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA Office: 171-266B, Mailstop: 171-246 Email: chris.mattm...@jpl.nasa.gov WWW: http://sunset.usc.edu/~mattmann/ ++ Adjunct Assistant Professor, Computer Science Department University of Southern California, Los Angeles, CA 90089 USA ++
Rules engine and Solr
I have a Solr (version 1.3) powered search server running in production. Search is keyword driven is supported using custom fields and tokenizers. I am planning to build a rules engine on top search. The rules are database driven and can't be stored inside solr indexes. These rules would ultimately two do things - 1. Change the order of Lucene hits. 2. Add/remove some results to/from the Lucene hits. What should be my starting point? Custom search handler? Cheers Avlesh
Re: performance question
> Sorting and index norms have space penalties. > Sorting on a field creates an array of Java ints, one for every > document in the index. Index norms (used for boosting documents and > other things) create an array of bytes in the Lucene index files, one > for every document in the index. > If you sort on many of your dynamic fields your memory use will > explode, and the same with index norms and disk space. Thanks for the info. In general, I knew sorting was expensive, but I didn't realize that dynamic fields made it worse. -- A. Steven Anderson Independent Consultant st...@asanderson.com
Re: performance question
: > If you sort on many of your dynamic fields your memory use will : > explode, and the same with index norms and disk space. : Thanks for the info. In general, I knew sorting was expensive, but I didn't : realize that dynamic fields made it worse. dynamic fields don't make it worse ... the number of actaul field names you sort on makes it worse. If you sort on 100 fields, the cost is the same regardless of wether all 100 of those fields exist because of a single declaration, or 100 distinct declarations. -Hoss
Re: performance question
> > dynamic fields don't make it worse ... the number of actaul field names > you sort on makes it worse. > > If you sort on 100 fields, the cost is the same regardless of wether all > 100 of those fields exist because of a single declaration, > or 100 distinct declarations. > Ahh...thanks for the clarification. So, in general, there is no *significant* performance difference with using dynamic fields. Correct? -- A. Steven Anderson Independent Consultant st...@asanderson.com
Re: Any way to modify result ranking using an integer field?
Thanks Ahmet. Do I need to do anything to enable BoostQParserPlugin in Solr, or is it already enabled? --- On Sun, 1/3/10, Ahmet Arslan wrote: From: Ahmet Arslan Subject: Re: Any way to modify result ranking using an integer field? To: solr-user@lucene.apache.org Date: Sunday, January 3, 2010, 5:45 PM > Is there any way to modify result > ranking using an integer field? > > I have documents that have an integer field "popularity". > > I want to rank results by a combination of normal fulltext > search > relevance and popularity. It's kinda like search in digg - > result > ranking is based on the search relevance as well as how > many digs a > posting has. > I don't have any specific ranking algorithm in mind. But > is this > something that can be done with solr? Yes. http://lucene.apache.org/solr/api/org/apache/solr/search/BoostQParserPlugin.html
Search algorithm used in Solr
Hello everyone, Is there an article which explains (on a high level) the algorithm of search in Solr? How does Solr search approach compare to the "inverted index" technique? Regards, Abhishek --Original Message-- From: Mattmann, Chris A (388J) To: solr-user@lucene.apache.org ReplyTo: solr-user@lucene.apache.org Subject: Re: Indexing the latests MS Office documents Sent: Jan 4, 2010 06:49 Hi Roland, You probably want to send your email to tika-u...@lucene.apache.org. Best of luck! Cheers, Chris On 1/3/10 4:00 PM, "Roland Villemoes" wrote: > Hi All, > > Anyone who knows how to index the latest MS office documents like .docx and > .xlsx ? > > From searching it seems like Tika only supports the earlier formats .doc and > .xls > > > > med venlig hilsen/best regards > > Roland Villemoes > Tel: (+45) 22 69 59 62 > E-Mail: mailto:r...@alpha-solutions.dk > > ++ Chris Mattmann, Ph.D. Senior Computer Scientist NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA Office: 171-266B, Mailstop: 171-246 Email: chris.mattm...@jpl.nasa.gov WWW: http://sunset.usc.edu/~mattmann/ ++ Adjunct Assistant Professor, Computer Science Department University of Southern California, Los Angeles, CA 90089 USA ++ Sent from BlackBerry® on Airtel
RE: Reverse sort facet query [SOLR-1672]
: Yes, I thought about adding some 'new syntax', but I opted for a separate 'facet.sortorder' parameter, : : mainly because I'm not familiar enough with the codebase to know what effect this might have on : : backward compatibility. It would be easy enough to modify the patch I created to do it this way. it shouldn't really affect anything -- it wouldn't really be new syntax, just extending hte existing "sort" param syntax to apply to the "facet.sort" param. The only back compat concern is making sure we continue to support true/false as aliases, and having the default order match the current bahvior if asc/desc aren't specified. -Hoss
Re: Any way to modify result ranking using an integer field?
What I meant was that is there any way to make {!boost b=log(popularity)} the default query type so that every query will be using it. From: Andy Subject: Re: Any way to modify result ranking using an integer field? To: solr-user@lucene.apache.org Date: Monday, January 4, 2010, 1:08 AM Thanks Ahmet. Do I need to do anything to enable BoostQParserPlugin in Solr, or is it already enabled?