RE: solr highlighting
Thanks, Mike. Sorry I was busy with something else. What does it mean "field F must have an analyzer defined"? My F defined as: text is defined as: Do you see anything wrong there? Thanks, - Kevin -Original Message- From: Mike Klaas [mailto:[EMAIL PROTECTED] Sent: Wednesday, May 14, 2008 12:03 PM To: solr-user@lucene.apache.org Subject: Re: solr highlighting The minimum "stuff" needed to highlight term X in field F is: field F must be 'stored' field F must have an analyzer defined a query with term X is sent (e.g., q=X) with parameters hl=true (or 'on'), hl.fl=F Try it on the example: 1. get the example running 2. cd example/exampledocs 3. ./post.sh *.xml 4. execute a query: http://localhost:8983/solr/select?indent=on&version=2.2&q=solr&start=0&rows=10&fl=*%2Cscore&qt=standard&wt=standard&explainOther=&hl=on&hl.fl=features -Mike On 14-May-08, at 9:39 AM, Kevin Xiao wrote: > Thanks Christian. I did try many options indicated in wiki, didn't > work. So I want to see if the basics work, i.e. only define hl=true > and a field for hl.fl. Do I need to include something global to make > hl settings work? > > Thanks, > - Kevin > > -Original Message- > From: Christian Vogler [mailto:[EMAIL PROTECTED] > Sent: Wednesday, May 14, 2008 5:55 AM > To: solr-user@lucene.apache.org > Subject: Re: solr highlighting > > On Wednesday 14 May 2008 09:21:36 Kevin Xiao wrote: >> Hi there, >> >> I am new to solr. I want search term to be highlighted on the >> results. I >> thought it is pretty simple, but could not make it work. I read a >> lot of >> solr documents and mail archives (I wish there is a search function >> for >> this, we are talking about solr, aren’t we? ☺). > > Take a look at hl.fragsize, hl.snippets, and hl.mergeContiguous, as > per > http://wiki.apache.org/solr/HighlightingParameters. > > In particular, setting hl.fragsize to 0 might be what you want if I > understand > your question correctly. > > Best regards > - Christian > -- > Christian Vogler, Ph.D. > Institute for Language and Speech Processing, Athens, Greece > http://gri.gallaudet.edu/~cvogler/ > [EMAIL PROTECTED]
Indexing HTML Content
Hello, In my application I wish to index articles which are stored in HTML format. Upon indexing these the html gets stored along with the content of the article, which is undesirable. Do you know of any common way of parsing the text content from HTML before adding to SOLR? I understand SOLR 1.3 has an HTML analyser, but I am using SOLR 1.2 and won't use 1.3 until it's stable, so looking for a solution to work on a batch of files before being added to SOLR. Thanks, John
Re: Indexing HTML Content
Hi, Maybe this one? http://htmlparser.sourceforge.net/ /Jimi Quoting "McBride, John" <[EMAIL PROTECTED]>: Hello, In my application I wish to index articles which are stored in HTML format. Upon indexing these the html gets stored along with the content of the article, which is undesirable. Do you know of any common way of parsing the text content from HTML before adding to SOLR? I understand SOLR 1.3 has an HTML analyser, but I am using SOLR 1.2 and won't use 1.3 until it's stable, so looking for a solution to work on a batch of files before being added to SOLR. Thanks, John
Re: Indexing HTML Content
Actually, it's very easy: http://us2.php.net/strip_tags I also store the data in a separate field with the html intact for display. In that case, I use urlencode on the string. David McBride, John wrote: Hello, In my application I wish to index articles which are stored in HTML format. Upon indexing these the html gets stored along with the content of the article, which is undesirable. Do you know of any common way of parsing the text content from HTML before adding to SOLR? I understand SOLR 1.3 has an HTML analyser, but I am using SOLR 1.2 and won't use 1.3 until it's stable, so looking for a solution to work on a batch of files before being added to SOLR. Thanks, John -- They must find it difficult, those who have taken authority as truth, rather than truth as authority. - Gerald Massey
RE: SOLR OOM (out of memory) problem
Hi Rong, My cache hit ratio are: filtercache: 0.96 documentcache:0.51 queryresultcache:0.58 Thanx Pravesh Yongjun Rong-2 wrote: > > I had the same problem some weeks before. You can try these: > 1. Check the hit ratio for the cache via the solr/admin/stats.jsp. If > the hit ratio is very low. Just disable those cache. It will save you > some memory. > 2. set -Xms and -Xmx to the same size will help improve GC performance. > 3. Check what's GC do you use? Default will be parallel. You can try use > concurrent GC which will help a lot. > 4. This is my sun hotspot jvm startup options: -XX:+UseConcMarkSweepGC > -XX:CMSInitiatingOccupancyFraction=50 -XX:-UseGCOverheadLimit > The above cannot solve the OOM forever. But they help a lot. > Wish this can help. > > -Original Message- > From: Mike Klaas [mailto:[EMAIL PROTECTED] > Sent: Wednesday, May 21, 2008 2:23 PM > To: solr-user@lucene.apache.org > Subject: Re: SOLR OOM (out of memory) problem > > > On 21-May-08, at 4:46 AM, gurudev wrote: > >> >> Just to add more: >> >> The JVM heap allocated is 6GB with initial heap size as 2GB. We use >> quadro(which is 8 cpus) on linux servers for SOLR slaves. >> We use facet searches, sorting. >> document cache is set to 7 million (which is total documents in index) > >> filtercache 1 > > You definitely don't have enough memory to keep 7 million document, > fully realized in java-object form, in memory. > > Nor would you want to. The document cache should aim to keep the most > frequently-occuring documents in memory (in the thousands, perhaps 10's > of thousands). By devoting more memory to the OS disk cache, more of > the 12GB index can be cached by the OS and thus speed up all document > retreival. > > -Mike > > -- View this message in context: http://www.nabble.com/SOLR-OOM-%28out-of-memory%29-problem-tp17364146p17402234.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: How to limit number of pages per domain
Sorry, but I can't really understand the difference with facets. On Thu, May 22, 2008 at 2:09 AM, Otis Gospodnetic < [EMAIL PROTECTED]> wrote: > Actually, the best documentation are really the comments in the JIRA issue > itself. > Is there anyone actually using Solr with this patch? > > > Otis > -- > Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch > > > - Original Message > > From: Koji Sekiguchi <[EMAIL PROTECTED]> > > To: solr-user@lucene.apache.org > > Sent: Wednesday, May 21, 2008 6:26:48 PM > > Subject: Re: How to limit number of pages per domain > > > > There is a documentation: > > > > http://wiki.apache.org/solr/FieldCollapsing > > > > Koji > > > > Jonathan Ariel wrote: > > > Sorry. But how field collapsing works? Is there documentation about > this > > > anywhere? Thanks! > > > > >
Re: [poll] Change logging to SLF4J?
On May 6, 2008, at 10:40 AM, Ryan McKinley wrote: [ ] Keep solr logging as it is. (JDK Logging) [X ] Use SLF4J. But you already knew that...
Re: [poll] Change logging to SLF4J?
Ryan McKinley wrote: > >> [ ] Keep solr logging as it is. (JDK Logging) >> [X ] Use SLF4J. > Can't "keep as is" since this strictly precludes configuring logging in a container agnostic way. -- View this message in context: http://www.nabble.com/-poll--Change-logging-to-SLF4J--tp17084684p17405410.html Sent from the Solr - User mailing list archive at Nabble.com.
RE: SOLR OOM (out of memory) problem
-Original Message- From: gurudev [mailto:[EMAIL PROTECTED] Sent: Thursday, May 22, 2008 7:28 AM To: solr-user@lucene.apache.org Subject: RE: SOLR OOM (out of memory) problem Hi Rong, My cache hit ratio are: filtercache: 0.96 documentcache:0.51 queryresultcache:0.58 Thanx Pravesh Yongjun Rong-2 wrote: > > I had the same problem some weeks before. You can try these: > 1. Check the hit ratio for the cache via the solr/admin/stats.jsp. If > the hit ratio is very low. Just disable those cache. It will save you > some memory. > 2. set -Xms and -Xmx to the same size will help improve GC performance. > 3. Check what's GC do you use? Default will be parallel. You can try > use concurrent GC which will help a lot. > 4. This is my sun hotspot jvm startup options: -XX:+UseConcMarkSweepGC > -XX:CMSInitiatingOccupancyFraction=50 -XX:-UseGCOverheadLimit The > above cannot solve the OOM forever. But they help a lot. > Wish this can help. > > -Original Message- > From: Mike Klaas [mailto:[EMAIL PROTECTED] > Sent: Wednesday, May 21, 2008 2:23 PM > To: solr-user@lucene.apache.org > Subject: Re: SOLR OOM (out of memory) problem > > > On 21-May-08, at 4:46 AM, gurudev wrote: > >> >> Just to add more: >> >> The JVM heap allocated is 6GB with initial heap size as 2GB. We use >> quadro(which is 8 cpus) on linux servers for SOLR slaves. >> We use facet searches, sorting. >> document cache is set to 7 million (which is total documents in >> index) > >> filtercache 1 > > You definitely don't have enough memory to keep 7 million document, > fully realized in java-object form, in memory. > > Nor would you want to. The document cache should aim to keep the most > frequently-occuring documents in memory (in the thousands, perhaps > 10's of thousands). By devoting more memory to the OS disk cache, > more of the 12GB index can be cached by the OS and thus speed up all > document retreival. > > -Mike > > -- View this message in context: http://www.nabble.com/SOLR-OOM-%28out-of-memory%29-problem-tp17364146p17 402234.html Sent from the Solr - User mailing list archive at Nabble.com.
RE: SOLR OOM (out of memory) problem
That looks good to use those cache. Keep those cache will help improve your search performance. Try the concurrent GC and see if you get better result. Please let me know the results. Best, Yongjun Rong -Original Message- From: gurudev [mailto:[EMAIL PROTECTED] Sent: Thursday, May 22, 2008 7:28 AM To: solr-user@lucene.apache.org Subject: RE: SOLR OOM (out of memory) problem Hi Rong, My cache hit ratio are: filtercache: 0.96 documentcache:0.51 queryresultcache:0.58 Thanx Pravesh Yongjun Rong-2 wrote: > > I had the same problem some weeks before. You can try these: > 1. Check the hit ratio for the cache via the solr/admin/stats.jsp. If > the hit ratio is very low. Just disable those cache. It will save you > some memory. > 2. set -Xms and -Xmx to the same size will help improve GC performance. > 3. Check what's GC do you use? Default will be parallel. You can try > use concurrent GC which will help a lot. > 4. This is my sun hotspot jvm startup options: -XX:+UseConcMarkSweepGC > -XX:CMSInitiatingOccupancyFraction=50 -XX:-UseGCOverheadLimit The > above cannot solve the OOM forever. But they help a lot. > Wish this can help. > > -Original Message- > From: Mike Klaas [mailto:[EMAIL PROTECTED] > Sent: Wednesday, May 21, 2008 2:23 PM > To: solr-user@lucene.apache.org > Subject: Re: SOLR OOM (out of memory) problem > > > On 21-May-08, at 4:46 AM, gurudev wrote: > >> >> Just to add more: >> >> The JVM heap allocated is 6GB with initial heap size as 2GB. We use >> quadro(which is 8 cpus) on linux servers for SOLR slaves. >> We use facet searches, sorting. >> document cache is set to 7 million (which is total documents in >> index) > >> filtercache 1 > > You definitely don't have enough memory to keep 7 million document, > fully realized in java-object form, in memory. > > Nor would you want to. The document cache should aim to keep the most > frequently-occuring documents in memory (in the thousands, perhaps > 10's of thousands). By devoting more memory to the OS disk cache, > more of the 12GB index can be cached by the OS and thus speed up all > document retreival. > > -Mike > > -- View this message in context: http://www.nabble.com/SOLR-OOM-%28out-of-memory%29-problem-tp17364146p17 402234.html Sent from the Solr - User mailing list archive at Nabble.com.
Is example-solr-home.jar synchronized with DataImportHandler documentation?
I downloaded example-solr-home.jar and was experimenting with '${dataimporter.functions.escapeSql(item.ID)}' It didn't work so I looked in dataimporter.jar and noticed that it didn't include classes for EvaluatorBag etal I'm assuming example-solr-home.jar on http://wiki.apache.org/solr/DataImportHandler is out of sync with the documentation on that page. -- View this message in context: http://www.nabble.com/Is-example-solr-home.jar-synchronized-with-DataImportHandler-documentation--tp17407305p17407305.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Is example-solr-home.jar synchronized with DataImportHandler documentation?
Hi Alan, Yes, it is a bit out of date. Please try using the SOLR-469.patch directly from the jira issue On Thu, May 22, 2008 at 9:23 PM, alan. <[EMAIL PROTECTED]> wrote: > > I downloaded example-solr-home.jar and was experimenting with > '${dataimporter.functions.escapeSql(item.ID)}' > > It didn't work so I looked in dataimporter.jar and noticed that it didn't > include classes for EvaluatorBag etal > > I'm assuming example-solr-home.jar on > http://wiki.apache.org/solr/DataImportHandler is out > of sync with the documentation on that page. > > -- > View this message in context: > http://www.nabble.com/Is-example-solr-home.jar-synchronized-with-DataImportHandler-documentation--tp17407305p17407305.html > Sent from the Solr - User mailing list archive at Nabble.com. > > -- Regards, Shalin Shekhar Mangar.
Re: How to limit number of pages per domain
I think I'll give it a try. I haven't done this before. Are there any instructions regarding how to apply the patch? I see 9 files, some displayed in gray links, some in blue links; some named as .diff, some .patch; one has 1.3 in file name, one has 1.3, I suppose the other files are for both versions. Should I apply all of them? https://issues.apache.org/jira/browse/SOLR-236 > Actually, the best documentation are really the comments in the JIRA issue > itself. > Is there anyone actually using Solr with this patch? > > > Otis
RE: Is example-solr-home.jar synchronized with DataImportHandler documentation?
Question on the status of the DataImportHandler. For the time being we are applying the recent patch. What are the plans for incorporating it as part of the nightly build or at least part of the subversion tree? I just want to make sure that I get updates/fixes/enhancements to this module when they occur as I update my tree. Thanks for all your work. ** julio -Original Message- From: Shalin Shekhar Mangar [mailto:[EMAIL PROTECTED] Sent: Thursday, May 22, 2008 8:56 AM To: solr-user@lucene.apache.org Subject: Re: Is example-solr-home.jar synchronized with DataImportHandler documentation? Hi Alan, Yes, it is a bit out of date. Please try using the SOLR-469.patch directly from the jira issue On Thu, May 22, 2008 at 9:23 PM, alan. <[EMAIL PROTECTED]> wrote: > > I downloaded example-solr-home.jar and was experimenting with > '${dataimporter.functions.escapeSql(item.ID)}' > > It didn't work so I looked in dataimporter.jar and noticed that it > didn't include classes for EvaluatorBag etal > > I'm assuming example-solr-home.jar on > http://wiki.apache.org/solr/DataImportHandler is out of sync with the > documentation on that page. > > -- > View this message in context: > http://www.nabble.com/Is-example-solr-home.jar-synchronized-with-DataI > mportHandler-documentation--tp17407305p17407305.html > Sent from the Solr - User mailing list archive at Nabble.com. > > -- Regards, Shalin Shekhar Mangar.
Re: Is example-solr-home.jar synchronized with DataImportHandler documentation?
It is scheduled to be released with the next release of Solr. Shouldn't be too long before it becomes part of the trunk/nightly code. If you find it useful, please do tell us here or vote/comment in the Jira issue. Bug reports are welcome too :) You can also add yourself as a watcher to the SOLR-469 issue. That way, you'll be notified by email for all changes. You'd need to be registered on Jira before you can become a watcher. On Thu, May 22, 2008 at 10:10 PM, Julio Castillo <[EMAIL PROTECTED]> wrote: > Question on the status of the DataImportHandler. > For the time being we are applying the recent patch. > > What are the plans for incorporating it as part of the nightly build or at > least part of the subversion tree? > > I just want to make sure that I get updates/fixes/enhancements to this > module when they occur as I update my tree. > > Thanks for all your work. > > ** julio > > -Original Message- > From: Shalin Shekhar Mangar [mailto:[EMAIL PROTECTED] > Sent: Thursday, May 22, 2008 8:56 AM > To: solr-user@lucene.apache.org > Subject: Re: Is example-solr-home.jar synchronized with DataImportHandler > documentation? > > Hi Alan, > > Yes, it is a bit out of date. Please try using the SOLR-469.patch directly > from the jira issue > > On Thu, May 22, 2008 at 9:23 PM, alan. <[EMAIL PROTECTED]> wrote: >> >> I downloaded example-solr-home.jar and was experimenting with >> '${dataimporter.functions.escapeSql(item.ID)}' >> >> It didn't work so I looked in dataimporter.jar and noticed that it >> didn't include classes for EvaluatorBag etal >> >> I'm assuming example-solr-home.jar on >> http://wiki.apache.org/solr/DataImportHandler is out of sync with the >> documentation on that page. >> >> -- >> View this message in context: >> http://www.nabble.com/Is-example-solr-home.jar-synchronized-with-DataI >> mportHandler-documentation--tp17407305p17407305.html >> Sent from the Solr - User mailing list archive at Nabble.com. >> >> > > > > -- > Regards, > Shalin Shekhar Mangar. > > -- Regards, Shalin Shekhar Mangar.
Re: Is example-solr-home.jar synchronized with DataImportHandler documentation?
I've updated the example-solr-home.jar on the DataImportHandler wiki page with the latest code. Please let us know if you find any issues. On Thu, May 22, 2008 at 10:19 PM, Shalin Shekhar Mangar <[EMAIL PROTECTED]> wrote: > It is scheduled to be released with the next release of Solr. > Shouldn't be too long before it becomes part of the trunk/nightly > code. If you find it useful, please do tell us here or vote/comment in > the Jira issue. Bug reports are welcome too :) > > You can also add yourself as a watcher to the SOLR-469 issue. That > way, you'll be notified by email for all changes. You'd need to be > registered on Jira before you can become a watcher. > > On Thu, May 22, 2008 at 10:10 PM, Julio Castillo > <[EMAIL PROTECTED]> wrote: >> Question on the status of the DataImportHandler. >> For the time being we are applying the recent patch. >> >> What are the plans for incorporating it as part of the nightly build or at >> least part of the subversion tree? >> >> I just want to make sure that I get updates/fixes/enhancements to this >> module when they occur as I update my tree. >> >> Thanks for all your work. >> >> ** julio >> >> -Original Message- >> From: Shalin Shekhar Mangar [mailto:[EMAIL PROTECTED] >> Sent: Thursday, May 22, 2008 8:56 AM >> To: solr-user@lucene.apache.org >> Subject: Re: Is example-solr-home.jar synchronized with DataImportHandler >> documentation? >> >> Hi Alan, >> >> Yes, it is a bit out of date. Please try using the SOLR-469.patch directly >> from the jira issue >> >> On Thu, May 22, 2008 at 9:23 PM, alan. <[EMAIL PROTECTED]> wrote: >>> >>> I downloaded example-solr-home.jar and was experimenting with >>> '${dataimporter.functions.escapeSql(item.ID)}' >>> >>> It didn't work so I looked in dataimporter.jar and noticed that it >>> didn't include classes for EvaluatorBag etal >>> >>> I'm assuming example-solr-home.jar on >>> http://wiki.apache.org/solr/DataImportHandler is out of sync with the >>> documentation on that page. >>> >>> -- >>> View this message in context: >>> http://www.nabble.com/Is-example-solr-home.jar-synchronized-with-DataI >>> mportHandler-documentation--tp17407305p17407305.html >>> Sent from the Solr - User mailing list archive at Nabble.com. >>> >>> >> >> >> >> -- >> Regards, >> Shalin Shekhar Mangar. >> >> > > > > -- > Regards, > Shalin Shekhar Mangar. > -- Regards, Shalin Shekhar Mangar.
Re: How to limit number of pages per domain
I don't know yet, so I asked directly in that JIRA issue :) Applying patches is done something like this: Ah, just added it to the Solr FAQ on the Wiki for everyone: http://wiki.apache.org/solr/FAQ#head-bd01dc2c65240a36e7c0ee78eaef88912a0e4030 Can you provide feedback about this particular patch once you try it? I'd like to get it on Solr 1.3, actually, so any feedback would help. Thanks, Otis -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch - Original Message > From: Jack <[EMAIL PROTECTED]> > To: solr-user@lucene.apache.org > Sent: Thursday, May 22, 2008 12:35:28 PM > Subject: Re: How to limit number of pages per domain > > I think I'll give it a try. I haven't done this before. Are there any > instructions regarding how to apply the patch? I see 9 files, some > displayed in gray links, some in blue links; some named as .diff, some > .patch; one has 1.3 in file name, one has 1.3, I suppose the other > files are for both versions. Should I apply all of them? > https://issues.apache.org/jira/browse/SOLR-236 > > > Actually, the best documentation are really the comments in the JIRA issue > itself. > > Is there anyone actually using Solr with this patch? > > > > > > Otis
RE: Is example-solr-home.jar synchronized with DataImportHandler documentation?
Thanks Shalin, I will add my vote to it that it becomes an integral part of the Solr ASAP. I don't know why indexing dB content is not the highest on the project's list. I've seen other messages regarding the status of SOLR 1.3, so I'm not holding my breath there. My request is that it makes it to the nightly builds. Thanks again ** julio -Original Message- From: Shalin Shekhar Mangar [mailto:[EMAIL PROTECTED] Sent: Thursday, May 22, 2008 9:50 AM To: solr-user@lucene.apache.org Subject: Re: Is example-solr-home.jar synchronized with DataImportHandler documentation? It is scheduled to be released with the next release of Solr. Shouldn't be too long before it becomes part of the trunk/nightly code. If you find it useful, please do tell us here or vote/comment in the Jira issue. Bug reports are welcome too :) You can also add yourself as a watcher to the SOLR-469 issue. That way, you'll be notified by email for all changes. You'd need to be registered on Jira before you can become a watcher. On Thu, May 22, 2008 at 10:10 PM, Julio Castillo <[EMAIL PROTECTED]> wrote: > Question on the status of the DataImportHandler. > For the time being we are applying the recent patch. > > What are the plans for incorporating it as part of the nightly build > or at least part of the subversion tree? > > I just want to make sure that I get updates/fixes/enhancements to this > module when they occur as I update my tree. > > Thanks for all your work. > > ** julio > > -Original Message- > From: Shalin Shekhar Mangar [mailto:[EMAIL PROTECTED] > Sent: Thursday, May 22, 2008 8:56 AM > To: solr-user@lucene.apache.org > Subject: Re: Is example-solr-home.jar synchronized with > DataImportHandler documentation? > > Hi Alan, > > Yes, it is a bit out of date. Please try using the SOLR-469.patch > directly from the jira issue > > On Thu, May 22, 2008 at 9:23 PM, alan. <[EMAIL PROTECTED]> wrote: >> >> I downloaded example-solr-home.jar and was experimenting with >> '${dataimporter.functions.escapeSql(item.ID)}' >> >> It didn't work so I looked in dataimporter.jar and noticed that it >> didn't include classes for EvaluatorBag etal >> >> I'm assuming example-solr-home.jar on >> http://wiki.apache.org/solr/DataImportHandler is out of sync with the >> documentation on that page. >> >> -- >> View this message in context: >> http://www.nabble.com/Is-example-solr-home.jar-synchronized-with-Data >> I mportHandler-documentation--tp17407305p17407305.html >> Sent from the Solr - User mailing list archive at Nabble.com. >> >> > > > > -- > Regards, > Shalin Shekhar Mangar. > > -- Regards, Shalin Shekhar Mangar.
Re: How to limit number of pages per domain
You mean you don't understand the difference. Here is an example of each: 1) field collapsing: http://www.google.com/search?q=lucene+in+action Note how Google figures out that the first 2 hits are from the same site (manning.com) and after showing those 2 hits offer "More results from www.manning.com »" That's field collapsing in action. If it didn't collapse hits, it might have to show many more hits from manning.com in a row on that results page, and that would translate to a bad user experience (users want diversity, too, not just pure relevance) 2) facets: http://www.amazon.com/s/ref=nb_ss_gw?url=search-alias%3Daps&field-keywords=morricone Note how the results are broken down by category on the left side of the page. Besides names of all categories that the results appear in, facets also show the number of items in each category. This helps with browsing/navigation. Otis -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch - Original Message > From: Jonathan Ariel <[EMAIL PROTECTED]> > To: solr-user@lucene.apache.org > Sent: Thursday, May 22, 2008 7:53:51 AM > Subject: Re: How to limit number of pages per domain > > Sorry, but I can't really understand the difference with facets. > > On Thu, May 22, 2008 at 2:09 AM, Otis Gospodnetic < > [EMAIL PROTECTED]> wrote: > > > Actually, the best documentation are really the comments in the JIRA issue > > itself. > > Is there anyone actually using Solr with this patch? > > > > > > Otis > > -- > > Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch > > > > > > - Original Message > > > From: Koji Sekiguchi > > > To: solr-user@lucene.apache.org > > > Sent: Wednesday, May 21, 2008 6:26:48 PM > > > Subject: Re: How to limit number of pages per domain > > > > > > There is a documentation: > > > > > > http://wiki.apache.org/solr/FieldCollapsing > > > > > > Koji > > > > > > Jonathan Ariel wrote: > > > > Sorry. But how field collapsing works? Is there documentation about > > this > > > > anywhere? Thanks! > > > > > > > >
Re: SOLR OOM (out of memory) problem
Hi, Seriously, try making that monster document cache smaller. Sure, there will be more evictions and more cache misses, but at least you will be less likely to get OOMs :). Oits -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch - Original Message > From: gurudev <[EMAIL PROTECTED]> > To: solr-user@lucene.apache.org > Sent: Thursday, May 22, 2008 7:27:44 AM > Subject: RE: SOLR OOM (out of memory) problem > > > Hi Rong, > > My cache hit ratio are: > > filtercache: 0.96 > documentcache:0.51 > queryresultcache:0.58 > > Thanx > Pravesh > > > Yongjun Rong-2 wrote: > > > > I had the same problem some weeks before. You can try these: > > 1. Check the hit ratio for the cache via the solr/admin/stats.jsp. If > > the hit ratio is very low. Just disable those cache. It will save you > > some memory. > > 2. set -Xms and -Xmx to the same size will help improve GC performance. > > 3. Check what's GC do you use? Default will be parallel. You can try use > > concurrent GC which will help a lot. > > 4. This is my sun hotspot jvm startup options: -XX:+UseConcMarkSweepGC > > -XX:CMSInitiatingOccupancyFraction=50 -XX:-UseGCOverheadLimit > > The above cannot solve the OOM forever. But they help a lot. > > Wish this can help. > > > > -Original Message- > > From: Mike Klaas [mailto:[EMAIL PROTECTED] > > Sent: Wednesday, May 21, 2008 2:23 PM > > To: solr-user@lucene.apache.org > > Subject: Re: SOLR OOM (out of memory) problem > > > > > > On 21-May-08, at 4:46 AM, gurudev wrote: > > > >> > >> Just to add more: > >> > >> The JVM heap allocated is 6GB with initial heap size as 2GB. We use > >> quadro(which is 8 cpus) on linux servers for SOLR slaves. > >> We use facet searches, sorting. > >> document cache is set to 7 million (which is total documents in index) > > > >> filtercache 1 > > > > You definitely don't have enough memory to keep 7 million document, > > fully realized in java-object form, in memory. > > > > Nor would you want to. The document cache should aim to keep the most > > frequently-occuring documents in memory (in the thousands, perhaps 10's > > of thousands). By devoting more memory to the OS disk cache, more of > > the 12GB index can be cached by the OS and thus speed up all document > > retreival. > > > > -Mike > > > > > > -- > View this message in context: > http://www.nabble.com/SOLR-OOM-%28out-of-memory%29-problem-tp17364146p17402234.html > Sent from the Solr - User mailing list archive at Nabble.com.
Re: Indexing HTML Content
John, Solr already has some of this stuff: $ ff \*HTML\*java ./src/test/org/apache/solr/analysis/HTMLStripReaderTest.java ./src/java/org/apache/solr/analysis/HTMLStripStandardTokenizerFactory.java ./src/java/org/apache/solr/analysis/HTMLStripReader.java ./src/java/org/apache/solr/analysis/HTMLStripWhitespaceTokenizerFactory.java Otis -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch - Original Message > From: "McBride, John" <[EMAIL PROTECTED]> > To: solr-user@lucene.apache.org > Sent: Thursday, May 22, 2008 4:44:23 AM > Subject: Indexing HTML Content > > Hello, > > In my application I wish to index articles which are stored in HTML > format. > > Upon indexing these the html gets stored along with the content of the > article, which is undesirable. > > Do you know of any common way of parsing the text content from HTML > before adding to SOLR? I understand SOLR 1.3 has an HTML analyser, but > I am using SOLR 1.2 and won't use 1.3 until it's stable, so looking for > a solution to work on a batch of files before being added to SOLR. > > Thanks, > John
Re: Is example-solr-home.jar synchronized with DataImportHandler documentation?
Julio, no worries, I'm 99% sure DIH is going to be in 1.3 and be in a nightly in a week or two. Otis -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch - Original Message > From: Julio Castillo <[EMAIL PROTECTED]> > To: solr-user@lucene.apache.org > Sent: Thursday, May 22, 2008 1:04:54 PM > Subject: RE: Is example-solr-home.jar synchronized with DataImportHandler > documentation? > > Thanks Shalin, > I will add my vote to it that it becomes an integral part of the Solr ASAP. > I don't know why indexing dB content is not the highest on the project's > list. > > I've seen other messages regarding the status of SOLR 1.3, so I'm not > holding my breath there. My request is that it makes it to the nightly > builds. > > Thanks again > > ** julio > > -Original Message- > From: Shalin Shekhar Mangar [mailto:[EMAIL PROTECTED] > Sent: Thursday, May 22, 2008 9:50 AM > To: solr-user@lucene.apache.org > Subject: Re: Is example-solr-home.jar synchronized with DataImportHandler > documentation? > > It is scheduled to be released with the next release of Solr. > Shouldn't be too long before it becomes part of the trunk/nightly code. If > you find it useful, please do tell us here or vote/comment in the Jira > issue. Bug reports are welcome too :) > > You can also add yourself as a watcher to the SOLR-469 issue. That way, > you'll be notified by email for all changes. You'd need to be registered on > Jira before you can become a watcher. > > On Thu, May 22, 2008 at 10:10 PM, Julio Castillo > wrote: > > Question on the status of the DataImportHandler. > > For the time being we are applying the recent patch. > > > > What are the plans for incorporating it as part of the nightly build > > or at least part of the subversion tree? > > > > I just want to make sure that I get updates/fixes/enhancements to this > > module when they occur as I update my tree. > > > > Thanks for all your work. > > > > ** julio > > > > -Original Message- > > From: Shalin Shekhar Mangar [mailto:[EMAIL PROTECTED] > > Sent: Thursday, May 22, 2008 8:56 AM > > To: solr-user@lucene.apache.org > > Subject: Re: Is example-solr-home.jar synchronized with > > DataImportHandler documentation? > > > > Hi Alan, > > > > Yes, it is a bit out of date. Please try using the SOLR-469.patch > > directly from the jira issue > > > > On Thu, May 22, 2008 at 9:23 PM, alan. wrote: > >> > >> I downloaded example-solr-home.jar and was experimenting with > >> '${dataimporter.functions.escapeSql(item.ID)}' > >> > >> It didn't work so I looked in dataimporter.jar and noticed that it > >> didn't include classes for EvaluatorBag etal > >> > >> I'm assuming example-solr-home.jar on > >> http://wiki.apache.org/solr/DataImportHandler is out of sync with the > >> documentation on that page. > >> > >> -- > >> View this message in context: > >> http://www.nabble.com/Is-example-solr-home.jar-synchronized-with-Data > >> I mportHandler-documentation--tp17407305p17407305.html > >> Sent from the Solr - User mailing list archive at Nabble.com. > >> > >> > > > > > > > > -- > > Regards, > > Shalin Shekhar Mangar. > > > > > > > > -- > Regards, > Shalin Shekhar Mangar.
Re: SOLR OOM (out of memory) problem
On 22-May-08, at 4:27 AM, gurudev wrote: Hi Rong, My cache hit ratio are: filtercache: 0.96 documentcache:0.51 queryresultcache:0.58 Note that you may be able to reduce the _size_ of the document cache without materially affecting the hit rate, since typically some documents are much more frequently accessed than others. I'd suggest starting with 700k, which I would still consider a large cache. -Mike
RE: Indexing HTML Content
The HTMLStripReader tool worked very well for us. It handles garbled HTML well. The only hole we found was that it does not find alt-text attributes for images. Also, note that this code is written as a Java Reader class rather than a Solr class. This makes it useful for other projects. Given the amount of string processing it does, the fact that it is a Reader probably does not affect its performance. Cheers, Lance -Original Message- From: Otis Gospodnetic [mailto:[EMAIL PROTECTED] Sent: Thursday, May 22, 2008 10:14 AM To: solr-user@lucene.apache.org Subject: Re: Indexing HTML Content John, Solr already has some of this stuff: $ ff \*HTML\*java ./src/test/org/apache/solr/analysis/HTMLStripReaderTest.java ./src/java/org/apache/solr/analysis/HTMLStripStandardTokenizerFactory.java ./src/java/org/apache/solr/analysis/HTMLStripReader.java ./src/java/org/apache/solr/analysis/HTMLStripWhitespaceTokenizerFactory.java Otis -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch - Original Message > From: "McBride, John" <[EMAIL PROTECTED]> > To: solr-user@lucene.apache.org > Sent: Thursday, May 22, 2008 4:44:23 AM > Subject: Indexing HTML Content > > Hello, > > In my application I wish to index articles which are stored in HTML > format. > > Upon indexing these the html gets stored along with the content of the > article, which is undesirable. > > Do you know of any common way of parsing the text content from HTML > before adding to SOLR? I understand SOLR 1.3 has an HTML analyser, > but I am using SOLR 1.2 and won't use 1.3 until it's stable, so > looking for a solution to work on a batch of files before being added to SOLR. > > Thanks, > John
Re: DocSet to BitSet
: One of the primary reasons that I was doing it this way is because I am : sending several filters, one is a big docset and others are BooleanQuery : objects (products in stock, etc.). : Since, the interface for SolrIndexSearcher.getDocListAndSet supports : only (Query, DocSet,...) or (Query, List,...), I was going to Just use SolrIndexSearch.getDocSet(List) to compute a DocSet for your query "filters" and then intersect that with your existing DocSet. : give it a list of filters. I haven't investigated further to see if : patching the Solr code to allow both methods (Query, List, : DocSet) would cause any problems. My guess is that it was done this way : for a reason. The code is a bit hairy, but deep down in a private getDocListC method there is a note about how the method can only be used with either a "DocSet filter" or a "List filterList" but not both .. i don't remember why. : Barring that solution, I will probably use the Query, DocSet method. I : have my DocSet for my bit-based filters in a single DocSet. And then I : can take my previous list of filter queries and add them onto the main : Query object that was created by the front-end. I'm not sure what this assuming those other quires are fairly orthoginal, generating a seperate DocSet for them (or one DocSetfor each of them) will probably give you better cache hit ratios. -Hoss
Re: DocSet to BitSet
That is more or less what I did. Once I found that function, it just took a small patch to expose that functionality, and then the problem was solved. - Original Message From: Chris Hostetter <[EMAIL PROTECTED]> To: solr-user@lucene.apache.org Sent: Thursday, May 22, 2008 12:32:56 PM Subject: Re: DocSet to BitSet : One of the primary reasons that I was doing it this way is because I am : sending several filters, one is a big docset and others are BooleanQuery : objects (products in stock, etc.). : Since, the interface for SolrIndexSearcher.getDocListAndSet supports : only (Query, DocSet,...) or (Query, List,...), I was going to Just use SolrIndexSearch.getDocSet(List) to compute a DocSet for your query "filters" and then intersect that with your existing DocSet. : give it a list of filters. I haven't investigated further to see if : patching the Solr code to allow both methods (Query, List, : DocSet) would cause any problems. My guess is that it was done this way : for a reason. The code is a bit hairy, but deep down in a private getDocListC method there is a note about how the method can only be used with either a "DocSet filter" or a "List filterList" but not both .. i don't remember why. : Barring that solution, I will probably use the Query, DocSet method. I : have my DocSet for my bit-based filters in a single DocSet. And then I : can take my previous list of filter queries and add them onto the main : Query object that was created by the front-end. I'm not sure what this assuming those other quires are fairly orthoginal, generating a seperate DocSet for them (or one DocSetfor each of them) will probably give you better cache hit ratios. -Hoss
Re: DocSet to BitSet
: That is more or less what I did. Once I found that function, it just : took a small patch to expose that functionality, and then the problem : was solved. I'm not sure why you needed a patch at all ... SolrIndexSearch.getDocSet(List) and getDocSet(Query) are both public methods. as is DocSet.intersection(DocSet) -Hoss
Re: DocSet to BitSet
In v1.3, it is public. In v1.2, it is still protected. - Original Message From: Chris Hostetter <[EMAIL PROTECTED]> To: solr-user@lucene.apache.org Sent: Thursday, May 22, 2008 1:50:22 PM Subject: Re: DocSet to BitSet : That is more or less what I did. Once I found that function, it just : took a small patch to expose that functionality, and then the problem : was solved. I'm not sure why you needed a patch at all ... SolrIndexSearch.getDocSet(List) and getDocSet(Query) are both public methods. as is DocSet.intersection(DocSet) -Hoss
Re[2]: the time factor
: I'm not quite understanding how boost query works though. How does it : "influence" the score exactly? Does it just simply append to the "q" : param? From the wiki: Esentially yes, but documents must match the at least one clause of the "q", matching the "bq" is optional (and when it happens, will result in a score increase accordingly) : If this is how it works, it sounds like the bq will be used first : to get a result set, then the result set will be sorted by q : (relevance)? no. bq doesn't influence what matches -- that's q -- bq only influence the scores of existing matches if they also match the bq. -Hoss
Re: Is example-solr-home.jar synchronized with DataImportHandler documentation?
Thanks for the new jar. I ended up building solr+dataimporthandler but an updated jar is a blessing for folks trying DataImportHandler. I've updated the example-solr-home.jar on the DataImportHandler wiki page with the latest code. Please let us know if you find any issues. -- View this message in context: http://www.nabble.com/Is-example-solr-home.jar-synchronized-with-DataImportHandler-documentation--tp17407305p17416063.html Sent from the Solr - User mailing list archive at Nabble.com.
Re[3]: the time factor
Hello Chris, > : If this is how it works, it sounds like the bq will be used first > : to get a result set, then the result set will be sorted by q > : (relevance)? > no. bq doesn't influence what matches -- that's q -- bq only influence > the scores of existing matches if they also match the bq. Hmm. Then it really works in a way that's similar to the recip function. I wonder why the bq works a lot better. One reason could be that with the bq query, the boost is directly linked to dates, instead of letting the recip function to figure out, which requires fine tuning. Thanks, Jack
solr sorting problem
I have problem sorting solr results. Here is my solr config search query select/?&rows=100&start=0&q=artistId:100346%20AND%20type:track&sort=alphaTrackSort%20desc&fl=track does not sort track. Don't understand what is missing from config -- View this message in context: http://www.nabble.com/solr-sorting-problem-tp17417394p17417394.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: solr sorting problem
I forgot to mention that I made changes to schema after indexing. pmg wrote: > > I have problem sorting solr results. Here is my solr config > > > > stored="true"/> > > > > > > > > search query > > select/?&rows=100&start=0&q=artistId:100346%20AND%20type:track&sort=alphaTrackSort%20desc&fl=track > > does not sort track. > > Don't understand what is missing from config > -- View this message in context: http://www.nabble.com/solr-sorting-problem-tp17417394p17417408.html Sent from the Solr - User mailing list archive at Nabble.com.
RE: solr highlighting
Just in case anyone wants to know: I figured out that you have to set uniqueKey stored="true" for highlighting to work. Thanks for everyone's help. Thanks, - Kevin -Original Message- From: Kevin Xiao [mailto:[EMAIL PROTECTED] Sent: Tuesday, May 13, 2008 11:22 PM To: solr-user@lucene.apache.org Subject: solr highlighting Hi there, I am new to solr. I want search term to be highlighted on the results. I thought it is pretty simple, but could not make it work. I read a lot of solr documents and mail archives (I wish there is a search function for this, we are talking about solr, aren’t we? ☺). Solrconfig.xml explicit 100 PMID AUTH ARTICLE_prefix_token ABST ARTICLE 1 recip(rord(DATE_1),1,1000,1000) true ABST We use a Java client, which is nothing but CommonsHttpSolrServer(“http://server:port/solr”), and populates the result to some data structure. Without the two lines for highlighter, I have results coming back, but after I add the two lines, the result is empty. I also used the admin utility, but I saw the ABST values are unchanged, but at the end of document added: − − − -gestational anemia as a consequence of a reduction in the number of primitive erythroid cells. GATA-1 mRNA is … The content of ABST of highlighting is much smaller than that of the original. I am guessing it tries to find the highlighted term’s position. So what should I do to get the highlighted ABST? Thanks, - Kevin