RE: solr highlighting

2008-05-22 Thread Kevin Xiao
Thanks, Mike. Sorry I was busy with something else. What does it mean "field F 
must have an analyzer defined"?

My F defined as:

text is defined as:

  





  
  





  


Do you see anything wrong there?

Thanks,
- Kevin

-Original Message-
From: Mike Klaas [mailto:[EMAIL PROTECTED]
Sent: Wednesday, May 14, 2008 12:03 PM
To: solr-user@lucene.apache.org
Subject: Re: solr highlighting

The minimum "stuff" needed to highlight term X in field F is:

field F must be 'stored'
field F must have an analyzer defined
a query with term X is sent (e.g., q=X)
with parameters hl=true (or 'on'), hl.fl=F

Try it on the example:
1. get the example running
2. cd example/exampledocs
3. ./post.sh *.xml
4. execute a query:

http://localhost:8983/solr/select?indent=on&version=2.2&q=solr&start=0&rows=10&fl=*%2Cscore&qt=standard&wt=standard&explainOther=&hl=on&hl.fl=features

-Mike

On 14-May-08, at 9:39 AM, Kevin Xiao wrote:

> Thanks Christian. I did try many options indicated in wiki, didn't
> work. So I want to see if the basics work, i.e. only define hl=true
> and a field for hl.fl. Do I need to include something global to make
> hl settings work?
>
> Thanks,
> - Kevin
>
> -Original Message-
> From: Christian Vogler [mailto:[EMAIL PROTECTED]
> Sent: Wednesday, May 14, 2008 5:55 AM
> To: solr-user@lucene.apache.org
> Subject: Re: solr highlighting
>
> On Wednesday 14 May 2008 09:21:36 Kevin Xiao wrote:
>> Hi there,
>>
>> I am new to solr. I want search term to be highlighted on the
>> results. I
>> thought it is pretty simple, but could not make it work. I read a
>> lot of
>> solr documents and mail archives (I wish there is a search function
>> for
>> this, we are talking about solr, aren’t we? ☺).
>
> Take a look at hl.fragsize, hl.snippets, and hl.mergeContiguous, as
> per
> http://wiki.apache.org/solr/HighlightingParameters.
>
> In particular, setting hl.fragsize to 0 might be what you want if I
> understand
> your question correctly.
>
> Best regards
> - Christian
> --
> Christian Vogler, Ph.D.
> Institute for Language and Speech Processing, Athens, Greece
> http://gri.gallaudet.edu/~cvogler/
> [EMAIL PROTECTED]



Indexing HTML Content

2008-05-22 Thread McBride, John
Hello,

In my application I wish to index articles which are stored in HTML
format.

Upon indexing these the html gets stored along with the content of the
article, which is undesirable.

Do you know of any common way of parsing the text content from HTML
before adding to SOLR?  I understand SOLR 1.3 has an HTML analyser, but
I am using SOLR 1.2 and won't use 1.3 until it's stable, so looking for
a solution to work on a batch of files before being added to SOLR.

Thanks,
John


Re: Indexing HTML Content

2008-05-22 Thread solr

Hi,

Maybe this one?

http://htmlparser.sourceforge.net/

/Jimi

Quoting "McBride, John" <[EMAIL PROTECTED]>:


Hello,

In my application I wish to index articles which are stored in HTML
format.

Upon indexing these the html gets stored along with the content of the
article, which is undesirable.

Do you know of any common way of parsing the text content from HTML
before adding to SOLR?  I understand SOLR 1.3 has an HTML analyser, but
I am using SOLR 1.2 and won't use 1.3 until it's stable, so looking for
a solution to work on a batch of files before being added to SOLR.

Thanks,
John






Re: Indexing HTML Content

2008-05-22 Thread David Arpad Geller

Actually, it's very easy: http://us2.php.net/strip_tags

I also store the data in a separate field with the html intact for 
display.  In that case, I use urlencode on the string.


David

McBride, John wrote:

Hello,

In my application I wish to index articles which are stored in HTML
format.

Upon indexing these the html gets stored along with the content of the
article, which is undesirable.

Do you know of any common way of parsing the text content from HTML
before adding to SOLR?  I understand SOLR 1.3 has an HTML analyser, but
I am using SOLR 1.2 and won't use 1.3 until it's stable, so looking for
a solution to work on a batch of files before being added to SOLR.

Thanks,
John
  


--
They must find it difficult, those who have taken authority as truth, rather 
than truth as authority. - Gerald Massey



RE: SOLR OOM (out of memory) problem

2008-05-22 Thread gurudev

Hi Rong,

My cache hit ratio are:

filtercache: 0.96
documentcache:0.51
queryresultcache:0.58

Thanx
Pravesh


Yongjun Rong-2 wrote:
> 
> I had the same problem some weeks before. You can try these:
> 1. Check the hit ratio for the cache via the solr/admin/stats.jsp. If
> the hit ratio is very low. Just disable those cache. It will save you
> some memory.
> 2. set -Xms and -Xmx to the same size will help improve GC performance. 
> 3. Check what's GC do you use? Default will be parallel. You can try use
> concurrent GC which will help a lot.
> 4. This is my sun hotspot jvm startup options: -XX:+UseConcMarkSweepGC
> -XX:CMSInitiatingOccupancyFraction=50 -XX:-UseGCOverheadLimit
> The above cannot solve the OOM forever. But they help a lot.
> Wish this can help.
> 
> -Original Message-
> From: Mike Klaas [mailto:[EMAIL PROTECTED] 
> Sent: Wednesday, May 21, 2008 2:23 PM
> To: solr-user@lucene.apache.org
> Subject: Re: SOLR OOM (out of memory) problem
> 
> 
> On 21-May-08, at 4:46 AM, gurudev wrote:
> 
>>
>> Just to add more:
>>
>> The JVM heap allocated is 6GB with initial heap size as 2GB. We use 
>> quadro(which is 8 cpus) on linux servers for SOLR slaves.
>> We use facet searches, sorting.
>> document cache is set to 7 million (which is total documents in index)
> 
>> filtercache 1
> 
> You definitely don't have enough memory to keep 7 million document,
> fully realized in java-object form, in memory.
> 
> Nor would you want to.  The document cache should aim to keep the most
> frequently-occuring documents in memory (in the thousands, perhaps 10's
> of thousands).  By devoting more memory to the OS disk cache, more of
> the 12GB index can be cached by the OS and thus speed up all document
> retreival.
> 
> -Mike
> 
> 

-- 
View this message in context: 
http://www.nabble.com/SOLR-OOM-%28out-of-memory%29-problem-tp17364146p17402234.html
Sent from the Solr - User mailing list archive at Nabble.com.



Re: How to limit number of pages per domain

2008-05-22 Thread Jonathan Ariel
Sorry, but I can't really understand the difference with facets.

On Thu, May 22, 2008 at 2:09 AM, Otis Gospodnetic <
[EMAIL PROTECTED]> wrote:

> Actually, the best documentation are really the comments in the JIRA issue
> itself.
> Is there anyone actually using Solr with this patch?
>
>
> Otis
> --
> Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
>
>
> - Original Message 
> > From: Koji Sekiguchi <[EMAIL PROTECTED]>
> > To: solr-user@lucene.apache.org
> > Sent: Wednesday, May 21, 2008 6:26:48 PM
> > Subject: Re: How to limit number of pages per domain
> >
> > There is a documentation:
> >
> > http://wiki.apache.org/solr/FieldCollapsing
> >
> > Koji
> >
> > Jonathan Ariel wrote:
> > > Sorry. But how field collapsing works? Is there documentation about
> this
> > > anywhere? Thanks!
> > >
>
>


Re: [poll] Change logging to SLF4J?

2008-05-22 Thread Grant Ingersoll


On May 6, 2008, at 10:40 AM, Ryan McKinley wrote:



[  ] Keep solr logging as it is.  (JDK Logging)
[X  ] Use SLF4J.


But you already knew that...



Re: [poll] Change logging to SLF4J?

2008-05-22 Thread Henrib


Ryan McKinley wrote:
> 
>> [  ] Keep solr logging as it is.  (JDK Logging)
>> [X  ] Use SLF4J.
> 
Can't "keep as is" since this strictly precludes configuring logging in a
container agnostic way.
-- 
View this message in context: 
http://www.nabble.com/-poll--Change-logging-to-SLF4J--tp17084684p17405410.html
Sent from the Solr - User mailing list archive at Nabble.com.



RE: SOLR OOM (out of memory) problem

2008-05-22 Thread Yongjun Rong
 

-Original Message-
From: gurudev [mailto:[EMAIL PROTECTED] 
Sent: Thursday, May 22, 2008 7:28 AM
To: solr-user@lucene.apache.org
Subject: RE: SOLR OOM (out of memory) problem


Hi Rong,

My cache hit ratio are:

filtercache: 0.96
documentcache:0.51
queryresultcache:0.58

Thanx
Pravesh


Yongjun Rong-2 wrote:
> 
> I had the same problem some weeks before. You can try these:
> 1. Check the hit ratio for the cache via the solr/admin/stats.jsp. If 
> the hit ratio is very low. Just disable those cache. It will save you 
> some memory.
> 2. set -Xms and -Xmx to the same size will help improve GC
performance. 
> 3. Check what's GC do you use? Default will be parallel. You can try 
> use concurrent GC which will help a lot.
> 4. This is my sun hotspot jvm startup options: -XX:+UseConcMarkSweepGC

> -XX:CMSInitiatingOccupancyFraction=50 -XX:-UseGCOverheadLimit The 
> above cannot solve the OOM forever. But they help a lot.
> Wish this can help.
> 
> -Original Message-
> From: Mike Klaas [mailto:[EMAIL PROTECTED]
> Sent: Wednesday, May 21, 2008 2:23 PM
> To: solr-user@lucene.apache.org
> Subject: Re: SOLR OOM (out of memory) problem
> 
> 
> On 21-May-08, at 4:46 AM, gurudev wrote:
> 
>>
>> Just to add more:
>>
>> The JVM heap allocated is 6GB with initial heap size as 2GB. We use 
>> quadro(which is 8 cpus) on linux servers for SOLR slaves.
>> We use facet searches, sorting.
>> document cache is set to 7 million (which is total documents in 
>> index)
> 
>> filtercache 1
> 
> You definitely don't have enough memory to keep 7 million document, 
> fully realized in java-object form, in memory.
> 
> Nor would you want to.  The document cache should aim to keep the most

> frequently-occuring documents in memory (in the thousands, perhaps 
> 10's of thousands).  By devoting more memory to the OS disk cache, 
> more of the 12GB index can be cached by the OS and thus speed up all 
> document retreival.
> 
> -Mike
> 
> 

--
View this message in context:
http://www.nabble.com/SOLR-OOM-%28out-of-memory%29-problem-tp17364146p17
402234.html
Sent from the Solr - User mailing list archive at Nabble.com.



RE: SOLR OOM (out of memory) problem

2008-05-22 Thread Yongjun Rong
 That looks good to use those cache. Keep those cache will help improve
your search performance. Try the concurrent GC and see if you get better
result. Please let me know the results.
  Best,
  Yongjun Rong

-Original Message-
From: gurudev [mailto:[EMAIL PROTECTED] 
Sent: Thursday, May 22, 2008 7:28 AM
To: solr-user@lucene.apache.org
Subject: RE: SOLR OOM (out of memory) problem


Hi Rong,

My cache hit ratio are:

filtercache: 0.96
documentcache:0.51
queryresultcache:0.58

Thanx
Pravesh


Yongjun Rong-2 wrote:
> 
> I had the same problem some weeks before. You can try these:
> 1. Check the hit ratio for the cache via the solr/admin/stats.jsp. If 
> the hit ratio is very low. Just disable those cache. It will save you 
> some memory.
> 2. set -Xms and -Xmx to the same size will help improve GC
performance. 
> 3. Check what's GC do you use? Default will be parallel. You can try 
> use concurrent GC which will help a lot.
> 4. This is my sun hotspot jvm startup options: -XX:+UseConcMarkSweepGC

> -XX:CMSInitiatingOccupancyFraction=50 -XX:-UseGCOverheadLimit The 
> above cannot solve the OOM forever. But they help a lot.
> Wish this can help.
> 
> -Original Message-
> From: Mike Klaas [mailto:[EMAIL PROTECTED]
> Sent: Wednesday, May 21, 2008 2:23 PM
> To: solr-user@lucene.apache.org
> Subject: Re: SOLR OOM (out of memory) problem
> 
> 
> On 21-May-08, at 4:46 AM, gurudev wrote:
> 
>>
>> Just to add more:
>>
>> The JVM heap allocated is 6GB with initial heap size as 2GB. We use 
>> quadro(which is 8 cpus) on linux servers for SOLR slaves.
>> We use facet searches, sorting.
>> document cache is set to 7 million (which is total documents in 
>> index)
> 
>> filtercache 1
> 
> You definitely don't have enough memory to keep 7 million document, 
> fully realized in java-object form, in memory.
> 
> Nor would you want to.  The document cache should aim to keep the most

> frequently-occuring documents in memory (in the thousands, perhaps 
> 10's of thousands).  By devoting more memory to the OS disk cache, 
> more of the 12GB index can be cached by the OS and thus speed up all 
> document retreival.
> 
> -Mike
> 
> 

--
View this message in context:
http://www.nabble.com/SOLR-OOM-%28out-of-memory%29-problem-tp17364146p17
402234.html
Sent from the Solr - User mailing list archive at Nabble.com.



Is example-solr-home.jar synchronized with DataImportHandler documentation?

2008-05-22 Thread alan.

I downloaded example-solr-home.jar and was experimenting with
'${dataimporter.functions.escapeSql(item.ID)}'

It didn't work so I looked in dataimporter.jar and noticed that it didn't
include classes for EvaluatorBag etal

I'm assuming example-solr-home.jar on
http://wiki.apache.org/solr/DataImportHandler is out
of sync with the documentation on that page.

-- 
View this message in context: 
http://www.nabble.com/Is-example-solr-home.jar-synchronized-with-DataImportHandler-documentation--tp17407305p17407305.html
Sent from the Solr - User mailing list archive at Nabble.com.



Re: Is example-solr-home.jar synchronized with DataImportHandler documentation?

2008-05-22 Thread Shalin Shekhar Mangar
Hi Alan,

Yes, it is a bit out of date. Please try using the SOLR-469.patch
directly from the jira issue

On Thu, May 22, 2008 at 9:23 PM, alan. <[EMAIL PROTECTED]> wrote:
>
> I downloaded example-solr-home.jar and was experimenting with
> '${dataimporter.functions.escapeSql(item.ID)}'
>
> It didn't work so I looked in dataimporter.jar and noticed that it didn't
> include classes for EvaluatorBag etal
>
> I'm assuming example-solr-home.jar on
> http://wiki.apache.org/solr/DataImportHandler is out
> of sync with the documentation on that page.
>
> --
> View this message in context: 
> http://www.nabble.com/Is-example-solr-home.jar-synchronized-with-DataImportHandler-documentation--tp17407305p17407305.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>
>



-- 
Regards,
Shalin Shekhar Mangar.


Re: How to limit number of pages per domain

2008-05-22 Thread Jack
I think I'll give it a try. I haven't done this before. Are there any
instructions regarding how to apply the patch? I see 9 files, some
displayed in gray links, some in blue links; some named as .diff, some
.patch; one has 1.3 in file name, one has 1.3, I suppose the other
files are for both versions. Should I apply all of them?
https://issues.apache.org/jira/browse/SOLR-236

> Actually, the best documentation are really the comments in the JIRA issue 
> itself.
> Is there anyone actually using Solr with this patch?
>
>
> Otis


RE: Is example-solr-home.jar synchronized with DataImportHandler documentation?

2008-05-22 Thread Julio Castillo
Question on the status of the DataImportHandler.
For the time being we are applying the recent patch.

What are the plans for incorporating it as part of the nightly build or at
least part of the subversion tree?

I just want to make sure that I get updates/fixes/enhancements to this
module when they occur as I update my tree.

Thanks for all your work.

** julio

-Original Message-
From: Shalin Shekhar Mangar [mailto:[EMAIL PROTECTED] 
Sent: Thursday, May 22, 2008 8:56 AM
To: solr-user@lucene.apache.org
Subject: Re: Is example-solr-home.jar synchronized with DataImportHandler
documentation?

Hi Alan,

Yes, it is a bit out of date. Please try using the SOLR-469.patch directly
from the jira issue

On Thu, May 22, 2008 at 9:23 PM, alan. <[EMAIL PROTECTED]> wrote:
>
> I downloaded example-solr-home.jar and was experimenting with 
> '${dataimporter.functions.escapeSql(item.ID)}'
>
> It didn't work so I looked in dataimporter.jar and noticed that it 
> didn't include classes for EvaluatorBag etal
>
> I'm assuming example-solr-home.jar on
> http://wiki.apache.org/solr/DataImportHandler is out of sync with the 
> documentation on that page.
>
> --
> View this message in context: 
> http://www.nabble.com/Is-example-solr-home.jar-synchronized-with-DataI
> mportHandler-documentation--tp17407305p17407305.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>
>



--
Regards,
Shalin Shekhar Mangar.



Re: Is example-solr-home.jar synchronized with DataImportHandler documentation?

2008-05-22 Thread Shalin Shekhar Mangar
It is scheduled to be released with the next release of Solr.
Shouldn't be too long before it becomes part of the trunk/nightly
code. If you find it useful, please do tell us here or vote/comment in
the Jira issue. Bug reports are welcome too :)

You can also add yourself as a watcher to the SOLR-469 issue. That
way, you'll be notified by email for all changes. You'd need to be
registered on Jira before you can become a watcher.

On Thu, May 22, 2008 at 10:10 PM, Julio Castillo
<[EMAIL PROTECTED]> wrote:
> Question on the status of the DataImportHandler.
> For the time being we are applying the recent patch.
>
> What are the plans for incorporating it as part of the nightly build or at
> least part of the subversion tree?
>
> I just want to make sure that I get updates/fixes/enhancements to this
> module when they occur as I update my tree.
>
> Thanks for all your work.
>
> ** julio
>
> -Original Message-
> From: Shalin Shekhar Mangar [mailto:[EMAIL PROTECTED]
> Sent: Thursday, May 22, 2008 8:56 AM
> To: solr-user@lucene.apache.org
> Subject: Re: Is example-solr-home.jar synchronized with DataImportHandler
> documentation?
>
> Hi Alan,
>
> Yes, it is a bit out of date. Please try using the SOLR-469.patch directly
> from the jira issue
>
> On Thu, May 22, 2008 at 9:23 PM, alan. <[EMAIL PROTECTED]> wrote:
>>
>> I downloaded example-solr-home.jar and was experimenting with
>> '${dataimporter.functions.escapeSql(item.ID)}'
>>
>> It didn't work so I looked in dataimporter.jar and noticed that it
>> didn't include classes for EvaluatorBag etal
>>
>> I'm assuming example-solr-home.jar on
>> http://wiki.apache.org/solr/DataImportHandler is out of sync with the
>> documentation on that page.
>>
>> --
>> View this message in context:
>> http://www.nabble.com/Is-example-solr-home.jar-synchronized-with-DataI
>> mportHandler-documentation--tp17407305p17407305.html
>> Sent from the Solr - User mailing list archive at Nabble.com.
>>
>>
>
>
>
> --
> Regards,
> Shalin Shekhar Mangar.
>
>



-- 
Regards,
Shalin Shekhar Mangar.


Re: Is example-solr-home.jar synchronized with DataImportHandler documentation?

2008-05-22 Thread Shalin Shekhar Mangar
I've updated the example-solr-home.jar on the DataImportHandler wiki
page with the latest code. Please let us know if you find any issues.

On Thu, May 22, 2008 at 10:19 PM, Shalin Shekhar Mangar
<[EMAIL PROTECTED]> wrote:
> It is scheduled to be released with the next release of Solr.
> Shouldn't be too long before it becomes part of the trunk/nightly
> code. If you find it useful, please do tell us here or vote/comment in
> the Jira issue. Bug reports are welcome too :)
>
> You can also add yourself as a watcher to the SOLR-469 issue. That
> way, you'll be notified by email for all changes. You'd need to be
> registered on Jira before you can become a watcher.
>
> On Thu, May 22, 2008 at 10:10 PM, Julio Castillo
> <[EMAIL PROTECTED]> wrote:
>> Question on the status of the DataImportHandler.
>> For the time being we are applying the recent patch.
>>
>> What are the plans for incorporating it as part of the nightly build or at
>> least part of the subversion tree?
>>
>> I just want to make sure that I get updates/fixes/enhancements to this
>> module when they occur as I update my tree.
>>
>> Thanks for all your work.
>>
>> ** julio
>>
>> -Original Message-
>> From: Shalin Shekhar Mangar [mailto:[EMAIL PROTECTED]
>> Sent: Thursday, May 22, 2008 8:56 AM
>> To: solr-user@lucene.apache.org
>> Subject: Re: Is example-solr-home.jar synchronized with DataImportHandler
>> documentation?
>>
>> Hi Alan,
>>
>> Yes, it is a bit out of date. Please try using the SOLR-469.patch directly
>> from the jira issue
>>
>> On Thu, May 22, 2008 at 9:23 PM, alan. <[EMAIL PROTECTED]> wrote:
>>>
>>> I downloaded example-solr-home.jar and was experimenting with
>>> '${dataimporter.functions.escapeSql(item.ID)}'
>>>
>>> It didn't work so I looked in dataimporter.jar and noticed that it
>>> didn't include classes for EvaluatorBag etal
>>>
>>> I'm assuming example-solr-home.jar on
>>> http://wiki.apache.org/solr/DataImportHandler is out of sync with the
>>> documentation on that page.
>>>
>>> --
>>> View this message in context:
>>> http://www.nabble.com/Is-example-solr-home.jar-synchronized-with-DataI
>>> mportHandler-documentation--tp17407305p17407305.html
>>> Sent from the Solr - User mailing list archive at Nabble.com.
>>>
>>>
>>
>>
>>
>> --
>> Regards,
>> Shalin Shekhar Mangar.
>>
>>
>
>
>
> --
> Regards,
> Shalin Shekhar Mangar.
>



-- 
Regards,
Shalin Shekhar Mangar.


Re: How to limit number of pages per domain

2008-05-22 Thread Otis Gospodnetic
I don't know yet, so I asked directly in that JIRA issue :)

Applying patches is done something like this:

Ah, just added it to the Solr FAQ on the Wiki for everyone:

http://wiki.apache.org/solr/FAQ#head-bd01dc2c65240a36e7c0ee78eaef88912a0e4030

Can you provide feedback about this particular patch once you try it?  I'd like 
to get it on Solr 1.3, actually, so any feedback would help.

Thanks,
Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch


- Original Message 
> From: Jack <[EMAIL PROTECTED]>
> To: solr-user@lucene.apache.org
> Sent: Thursday, May 22, 2008 12:35:28 PM
> Subject: Re: How to limit number of pages per domain
> 
> I think I'll give it a try. I haven't done this before. Are there any
> instructions regarding how to apply the patch? I see 9 files, some
> displayed in gray links, some in blue links; some named as .diff, some
> .patch; one has 1.3 in file name, one has 1.3, I suppose the other
> files are for both versions. Should I apply all of them?
> https://issues.apache.org/jira/browse/SOLR-236
> 
> > Actually, the best documentation are really the comments in the JIRA issue 
> itself.
> > Is there anyone actually using Solr with this patch?
> >
> >
> > Otis



RE: Is example-solr-home.jar synchronized with DataImportHandler documentation?

2008-05-22 Thread Julio Castillo
Thanks Shalin,
I will add my vote to it that it becomes an integral part of the Solr ASAP.
I don't know why indexing dB content is not the highest on the project's
list.

I've seen other messages regarding the status of SOLR 1.3, so I'm not
holding my breath there. My request is that it makes it to the nightly
builds.

Thanks again

** julio

-Original Message-
From: Shalin Shekhar Mangar [mailto:[EMAIL PROTECTED] 
Sent: Thursday, May 22, 2008 9:50 AM
To: solr-user@lucene.apache.org
Subject: Re: Is example-solr-home.jar synchronized with DataImportHandler
documentation?

It is scheduled to be released with the next release of Solr.
Shouldn't be too long before it becomes part of the trunk/nightly code. If
you find it useful, please do tell us here or vote/comment in the Jira
issue. Bug reports are welcome too :)

You can also add yourself as a watcher to the SOLR-469 issue. That way,
you'll be notified by email for all changes. You'd need to be registered on
Jira before you can become a watcher.

On Thu, May 22, 2008 at 10:10 PM, Julio Castillo <[EMAIL PROTECTED]>
wrote:
> Question on the status of the DataImportHandler.
> For the time being we are applying the recent patch.
>
> What are the plans for incorporating it as part of the nightly build 
> or at least part of the subversion tree?
>
> I just want to make sure that I get updates/fixes/enhancements to this 
> module when they occur as I update my tree.
>
> Thanks for all your work.
>
> ** julio
>
> -Original Message-
> From: Shalin Shekhar Mangar [mailto:[EMAIL PROTECTED]
> Sent: Thursday, May 22, 2008 8:56 AM
> To: solr-user@lucene.apache.org
> Subject: Re: Is example-solr-home.jar synchronized with 
> DataImportHandler documentation?
>
> Hi Alan,
>
> Yes, it is a bit out of date. Please try using the SOLR-469.patch 
> directly from the jira issue
>
> On Thu, May 22, 2008 at 9:23 PM, alan. <[EMAIL PROTECTED]> wrote:
>>
>> I downloaded example-solr-home.jar and was experimenting with 
>> '${dataimporter.functions.escapeSql(item.ID)}'
>>
>> It didn't work so I looked in dataimporter.jar and noticed that it 
>> didn't include classes for EvaluatorBag etal
>>
>> I'm assuming example-solr-home.jar on 
>> http://wiki.apache.org/solr/DataImportHandler is out of sync with the 
>> documentation on that page.
>>
>> --
>> View this message in context:
>> http://www.nabble.com/Is-example-solr-home.jar-synchronized-with-Data
>> I mportHandler-documentation--tp17407305p17407305.html
>> Sent from the Solr - User mailing list archive at Nabble.com.
>>
>>
>
>
>
> --
> Regards,
> Shalin Shekhar Mangar.
>
>



--
Regards,
Shalin Shekhar Mangar.



Re: How to limit number of pages per domain

2008-05-22 Thread Otis Gospodnetic
You mean you don't understand the difference.  Here is an example of each:

1) field collapsing: http://www.google.com/search?q=lucene+in+action

Note how Google figures out that the first 2 hits are from the same site 
(manning.com) and after showing those 2 hits offer "More results from 
www.manning.com »"  That's field collapsing in action.  If it didn't collapse 
hits, it might have to show many more hits from manning.com in a row on that 
results page, and that would translate to a bad user experience (users want 
diversity, too, not just pure relevance)

2) facets: 
http://www.amazon.com/s/ref=nb_ss_gw?url=search-alias%3Daps&field-keywords=morricone

Note how the results are broken down by category on the left side of the page.  
Besides names of all categories that the results appear in, facets also show 
the number of items in each category.  This helps with browsing/navigation.

Otis
 --
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch


- Original Message 
> From: Jonathan Ariel <[EMAIL PROTECTED]>
> To: solr-user@lucene.apache.org
> Sent: Thursday, May 22, 2008 7:53:51 AM
> Subject: Re: How to limit number of pages per domain
> 
> Sorry, but I can't really understand the difference with facets.
> 
> On Thu, May 22, 2008 at 2:09 AM, Otis Gospodnetic <
> [EMAIL PROTECTED]> wrote:
> 
> > Actually, the best documentation are really the comments in the JIRA issue
> > itself.
> > Is there anyone actually using Solr with this patch?
> >
> >
> > Otis
> > --
> > Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
> >
> >
> > - Original Message 
> > > From: Koji Sekiguchi 
> > > To: solr-user@lucene.apache.org
> > > Sent: Wednesday, May 21, 2008 6:26:48 PM
> > > Subject: Re: How to limit number of pages per domain
> > >
> > > There is a documentation:
> > >
> > > http://wiki.apache.org/solr/FieldCollapsing
> > >
> > > Koji
> > >
> > > Jonathan Ariel wrote:
> > > > Sorry. But how field collapsing works? Is there documentation about
> > this
> > > > anywhere? Thanks!
> > > >
> >
> >



Re: SOLR OOM (out of memory) problem

2008-05-22 Thread Otis Gospodnetic
Hi,

Seriously, try making that monster document cache smaller.  Sure, there will be 
more evictions and more cache misses, but at least you will be less likely to 
get OOMs :).


Oits
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch


- Original Message 
> From: gurudev <[EMAIL PROTECTED]>
> To: solr-user@lucene.apache.org
> Sent: Thursday, May 22, 2008 7:27:44 AM
> Subject: RE: SOLR OOM (out of memory) problem
> 
> 
> Hi Rong,
> 
> My cache hit ratio are:
> 
> filtercache: 0.96
> documentcache:0.51
> queryresultcache:0.58
> 
> Thanx
> Pravesh
> 
> 
> Yongjun Rong-2 wrote:
> > 
> > I had the same problem some weeks before. You can try these:
> > 1. Check the hit ratio for the cache via the solr/admin/stats.jsp. If
> > the hit ratio is very low. Just disable those cache. It will save you
> > some memory.
> > 2. set -Xms and -Xmx to the same size will help improve GC performance. 
> > 3. Check what's GC do you use? Default will be parallel. You can try use
> > concurrent GC which will help a lot.
> > 4. This is my sun hotspot jvm startup options: -XX:+UseConcMarkSweepGC
> > -XX:CMSInitiatingOccupancyFraction=50 -XX:-UseGCOverheadLimit
> > The above cannot solve the OOM forever. But they help a lot.
> > Wish this can help.
> > 
> > -Original Message-
> > From: Mike Klaas [mailto:[EMAIL PROTECTED] 
> > Sent: Wednesday, May 21, 2008 2:23 PM
> > To: solr-user@lucene.apache.org
> > Subject: Re: SOLR OOM (out of memory) problem
> > 
> > 
> > On 21-May-08, at 4:46 AM, gurudev wrote:
> > 
> >>
> >> Just to add more:
> >>
> >> The JVM heap allocated is 6GB with initial heap size as 2GB. We use 
> >> quadro(which is 8 cpus) on linux servers for SOLR slaves.
> >> We use facet searches, sorting.
> >> document cache is set to 7 million (which is total documents in index)
> > 
> >> filtercache 1
> > 
> > You definitely don't have enough memory to keep 7 million document,
> > fully realized in java-object form, in memory.
> > 
> > Nor would you want to.  The document cache should aim to keep the most
> > frequently-occuring documents in memory (in the thousands, perhaps 10's
> > of thousands).  By devoting more memory to the OS disk cache, more of
> > the 12GB index can be cached by the OS and thus speed up all document
> > retreival.
> > 
> > -Mike
> > 
> > 
> 
> -- 
> View this message in context: 
> http://www.nabble.com/SOLR-OOM-%28out-of-memory%29-problem-tp17364146p17402234.html
> Sent from the Solr - User mailing list archive at Nabble.com.



Re: Indexing HTML Content

2008-05-22 Thread Otis Gospodnetic
John,

Solr already has some of this stuff:

$ ff \*HTML\*java
./src/test/org/apache/solr/analysis/HTMLStripReaderTest.java
./src/java/org/apache/solr/analysis/HTMLStripStandardTokenizerFactory.java
./src/java/org/apache/solr/analysis/HTMLStripReader.java
./src/java/org/apache/solr/analysis/HTMLStripWhitespaceTokenizerFactory.java


Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch


- Original Message 
> From: "McBride, John" <[EMAIL PROTECTED]>
> To: solr-user@lucene.apache.org
> Sent: Thursday, May 22, 2008 4:44:23 AM
> Subject: Indexing HTML Content
> 
> Hello,
> 
> In my application I wish to index articles which are stored in HTML
> format.
> 
> Upon indexing these the html gets stored along with the content of the
> article, which is undesirable.
> 
> Do you know of any common way of parsing the text content from HTML
> before adding to SOLR?  I understand SOLR 1.3 has an HTML analyser, but
> I am using SOLR 1.2 and won't use 1.3 until it's stable, so looking for
> a solution to work on a batch of files before being added to SOLR.
> 
> Thanks,
> John



Re: Is example-solr-home.jar synchronized with DataImportHandler documentation?

2008-05-22 Thread Otis Gospodnetic
Julio, no worries, I'm 99% sure DIH is going to be in 1.3 and be in a nightly 
in a week or two.

 
Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch


- Original Message 
> From: Julio Castillo <[EMAIL PROTECTED]>
> To: solr-user@lucene.apache.org
> Sent: Thursday, May 22, 2008 1:04:54 PM
> Subject: RE: Is example-solr-home.jar synchronized with DataImportHandler 
> documentation?
> 
> Thanks Shalin,
> I will add my vote to it that it becomes an integral part of the Solr ASAP.
> I don't know why indexing dB content is not the highest on the project's
> list.
> 
> I've seen other messages regarding the status of SOLR 1.3, so I'm not
> holding my breath there. My request is that it makes it to the nightly
> builds.
> 
> Thanks again
> 
> ** julio
> 
> -Original Message-
> From: Shalin Shekhar Mangar [mailto:[EMAIL PROTECTED] 
> Sent: Thursday, May 22, 2008 9:50 AM
> To: solr-user@lucene.apache.org
> Subject: Re: Is example-solr-home.jar synchronized with DataImportHandler
> documentation?
> 
> It is scheduled to be released with the next release of Solr.
> Shouldn't be too long before it becomes part of the trunk/nightly code. If
> you find it useful, please do tell us here or vote/comment in the Jira
> issue. Bug reports are welcome too :)
> 
> You can also add yourself as a watcher to the SOLR-469 issue. That way,
> you'll be notified by email for all changes. You'd need to be registered on
> Jira before you can become a watcher.
> 
> On Thu, May 22, 2008 at 10:10 PM, Julio Castillo 
> wrote:
> > Question on the status of the DataImportHandler.
> > For the time being we are applying the recent patch.
> >
> > What are the plans for incorporating it as part of the nightly build 
> > or at least part of the subversion tree?
> >
> > I just want to make sure that I get updates/fixes/enhancements to this 
> > module when they occur as I update my tree.
> >
> > Thanks for all your work.
> >
> > ** julio
> >
> > -Original Message-
> > From: Shalin Shekhar Mangar [mailto:[EMAIL PROTECTED]
> > Sent: Thursday, May 22, 2008 8:56 AM
> > To: solr-user@lucene.apache.org
> > Subject: Re: Is example-solr-home.jar synchronized with 
> > DataImportHandler documentation?
> >
> > Hi Alan,
> >
> > Yes, it is a bit out of date. Please try using the SOLR-469.patch 
> > directly from the jira issue
> >
> > On Thu, May 22, 2008 at 9:23 PM, alan. wrote:
> >>
> >> I downloaded example-solr-home.jar and was experimenting with 
> >> '${dataimporter.functions.escapeSql(item.ID)}'
> >>
> >> It didn't work so I looked in dataimporter.jar and noticed that it 
> >> didn't include classes for EvaluatorBag etal
> >>
> >> I'm assuming example-solr-home.jar on 
> >> http://wiki.apache.org/solr/DataImportHandler is out of sync with the 
> >> documentation on that page.
> >>
> >> --
> >> View this message in context:
> >> http://www.nabble.com/Is-example-solr-home.jar-synchronized-with-Data
> >> I mportHandler-documentation--tp17407305p17407305.html
> >> Sent from the Solr - User mailing list archive at Nabble.com.
> >>
> >>
> >
> >
> >
> > --
> > Regards,
> > Shalin Shekhar Mangar.
> >
> >
> 
> 
> 
> --
> Regards,
> Shalin Shekhar Mangar.



Re: SOLR OOM (out of memory) problem

2008-05-22 Thread Mike Klaas


On 22-May-08, at 4:27 AM, gurudev wrote:



Hi Rong,

My cache hit ratio are:

filtercache: 0.96
documentcache:0.51
queryresultcache:0.58


Note that you may be able to reduce the _size_ of the document cache  
without materially affecting the hit rate, since typically some  
documents are much more frequently accessed than others.


I'd suggest starting with 700k, which I would still consider a large  
cache.


-Mike



RE: Indexing HTML Content

2008-05-22 Thread Lance Norskog
The HTMLStripReader tool worked very well for us. It handles garbled HTML
well. The only hole we found was that it does not find alt-text attributes
for images.

Also, note that this code is written as a Java Reader class rather than a
Solr class. This makes it useful for other projects. Given the amount of
string processing it does, the fact that it is a Reader probably does not
affect its performance.

Cheers,

Lance

-Original Message-
From: Otis Gospodnetic [mailto:[EMAIL PROTECTED] 
Sent: Thursday, May 22, 2008 10:14 AM
To: solr-user@lucene.apache.org
Subject: Re: Indexing HTML Content

John,

Solr already has some of this stuff:

$ ff \*HTML\*java
./src/test/org/apache/solr/analysis/HTMLStripReaderTest.java
./src/java/org/apache/solr/analysis/HTMLStripStandardTokenizerFactory.java
./src/java/org/apache/solr/analysis/HTMLStripReader.java
./src/java/org/apache/solr/analysis/HTMLStripWhitespaceTokenizerFactory.java


Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch


- Original Message 
> From: "McBride, John" <[EMAIL PROTECTED]>
> To: solr-user@lucene.apache.org
> Sent: Thursday, May 22, 2008 4:44:23 AM
> Subject: Indexing HTML Content
> 
> Hello,
> 
> In my application I wish to index articles which are stored in HTML 
> format.
> 
> Upon indexing these the html gets stored along with the content of the 
> article, which is undesirable.
> 
> Do you know of any common way of parsing the text content from HTML 
> before adding to SOLR?  I understand SOLR 1.3 has an HTML analyser, 
> but I am using SOLR 1.2 and won't use 1.3 until it's stable, so 
> looking for a solution to work on a batch of files before being added to
SOLR.
> 
> Thanks,
> John




Re: DocSet to BitSet

2008-05-22 Thread Chris Hostetter

: One of the primary reasons that I was doing it this way is because I am 
: sending several filters, one is a big docset and others are BooleanQuery 
: objects (products in stock, etc.).

: Since, the interface for SolrIndexSearcher.getDocListAndSet supports 
: only (Query, DocSet,...) or (Query, List,...), I was going to 

Just use SolrIndexSearch.getDocSet(List) to compute a DocSet for 
your query "filters" and then intersect that with your existing DocSet.

: give it a list of filters. I haven't investigated further to see if 
: patching the Solr code to allow both methods (Query, List, 
: DocSet) would cause any problems. My guess is that it was done this way 
: for a reason.

The code is a bit hairy, but deep down in a private getDocListC method 
there is a note about how the method can only be used with either a 
"DocSet filter" or a "List filterList" but not both .. i don't 
remember why.

: Barring that solution, I will probably use the Query, DocSet method. I 
: have my DocSet for my bit-based filters in a single DocSet. And then I 
: can take my previous list of filter queries and add them onto the main 
: Query object that was created by the front-end. I'm not sure what this 

assuming those other quires are fairly orthoginal, generating a seperate 
DocSet for them (or one DocSetfor each of them) will probably give you 
better cache hit ratios.



-Hoss



Re: DocSet to BitSet

2008-05-22 Thread Kevin Osborn
That is more or less what I did. Once I found that function, it just took a 
small patch to expose that functionality, and then the problem was solved.


- Original Message 
From: Chris Hostetter <[EMAIL PROTECTED]>
To: solr-user@lucene.apache.org
Sent: Thursday, May 22, 2008 12:32:56 PM
Subject: Re: DocSet to BitSet


: One of the primary reasons that I was doing it this way is because I am 
: sending several filters, one is a big docset and others are BooleanQuery 
: objects (products in stock, etc.).

: Since, the interface for SolrIndexSearcher.getDocListAndSet supports 
: only (Query, DocSet,...) or (Query, List,...), I was going to 

Just use SolrIndexSearch.getDocSet(List) to compute a DocSet for 
your query "filters" and then intersect that with your existing DocSet.

: give it a list of filters. I haven't investigated further to see if 
: patching the Solr code to allow both methods (Query, List, 
: DocSet) would cause any problems. My guess is that it was done this way 
: for a reason.

The code is a bit hairy, but deep down in a private getDocListC method 
there is a note about how the method can only be used with either a 
"DocSet filter" or a "List filterList" but not both .. i don't 
remember why.

: Barring that solution, I will probably use the Query, DocSet method. I 
: have my DocSet for my bit-based filters in a single DocSet. And then I 
: can take my previous list of filter queries and add them onto the main 
: Query object that was created by the front-end. I'm not sure what this 

assuming those other quires are fairly orthoginal, generating a seperate 
DocSet for them (or one DocSetfor each of them) will probably give you 
better cache hit ratios.



-Hoss

Re: DocSet to BitSet

2008-05-22 Thread Chris Hostetter

: That is more or less what I did. Once I found that function, it just 
: took a small patch to expose that functionality, and then the problem 
: was solved.

I'm not sure why you needed a patch at all ... 
SolrIndexSearch.getDocSet(List) and getDocSet(Query) are both 
public methods.  as is DocSet.intersection(DocSet)


-Hoss



Re: DocSet to BitSet

2008-05-22 Thread Kevin Osborn
In v1.3, it is public. In v1.2, it is still protected.


- Original Message 
From: Chris Hostetter <[EMAIL PROTECTED]>
To: solr-user@lucene.apache.org
Sent: Thursday, May 22, 2008 1:50:22 PM
Subject: Re: DocSet to BitSet


: That is more or less what I did. Once I found that function, it just 
: took a small patch to expose that functionality, and then the problem 
: was solved.

I'm not sure why you needed a patch at all ... 
SolrIndexSearch.getDocSet(List) and getDocSet(Query) are both 
public methods.  as is DocSet.intersection(DocSet)


-Hoss

Re[2]: the time factor

2008-05-22 Thread Chris Hostetter

: I'm not quite understanding how boost query works though. How does it
: "influence" the score exactly? Does it just simply append to the "q"
: param? From the wiki:

Esentially yes, but documents must match the at least one clause of 
the "q", matching the "bq" is optional (and when it happens, will result 
in a score increase accordingly)

: If this is how it works, it sounds like the bq will be used first
: to get a result set, then the result set will be sorted by q
: (relevance)?

no.  bq doesn't influence what matches -- that's q -- bq only influence 
the scores of existing matches if they also match the bq.



-Hoss



Re: Is example-solr-home.jar synchronized with DataImportHandler documentation?

2008-05-22 Thread alan.

Thanks for the new jar.  I ended up building solr+dataimporthandler but
an updated jar is a blessing for folks trying DataImportHandler.


I've updated the example-solr-home.jar on the DataImportHandler wiki
page with the latest code. Please let us know if you find any issues.

-- 
View this message in context: 
http://www.nabble.com/Is-example-solr-home.jar-synchronized-with-DataImportHandler-documentation--tp17407305p17416063.html
Sent from the Solr - User mailing list archive at Nabble.com.



Re[3]: the time factor

2008-05-22 Thread JLIST
Hello Chris,

> : If this is how it works, it sounds like the bq will be used first
> : to get a result set, then the result set will be sorted by q
> : (relevance)?

> no.  bq doesn't influence what matches -- that's q -- bq only influence
> the scores of existing matches if they also match the bq.

Hmm. Then it really works in a way that's similar to the recip
function. I wonder why the bq works a lot better. One reason
could be that with the bq query, the boost is directly linked
to dates, instead of letting the recip function to figure out,
which requires fine tuning.

Thanks,
Jack



solr sorting problem

2008-05-22 Thread pmg

I have problem sorting solr results. Here is my solr config 

   
   
   


   
   
  


search query 

select/?&rows=100&start=0&q=artistId:100346%20AND%20type:track&sort=alphaTrackSort%20desc&fl=track

does not sort track.

Don't understand what is missing from config
-- 
View this message in context: 
http://www.nabble.com/solr-sorting-problem-tp17417394p17417394.html
Sent from the Solr - User mailing list archive at Nabble.com.



Re: solr sorting problem

2008-05-22 Thread pmg

I forgot to mention that I made changes to schema after indexing. 


pmg wrote:
> 
> I have problem sorting solr results. Here is my solr config 
> 
>
>
> stored="true"/>
> 
> 
>
>
>   
> 
> 
> search query 
> 
> select/?&rows=100&start=0&q=artistId:100346%20AND%20type:track&sort=alphaTrackSort%20desc&fl=track
> 
> does not sort track.
> 
> Don't understand what is missing from config
> 

-- 
View this message in context: 
http://www.nabble.com/solr-sorting-problem-tp17417394p17417408.html
Sent from the Solr - User mailing list archive at Nabble.com.



RE: solr highlighting

2008-05-22 Thread Kevin Xiao
Just in case anyone wants to know: I figured out that you have to set uniqueKey 
stored="true" for highlighting to work. Thanks for everyone's help.

Thanks,
- Kevin

-Original Message-
From: Kevin Xiao [mailto:[EMAIL PROTECTED]
Sent: Tuesday, May 13, 2008 11:22 PM
To: solr-user@lucene.apache.org
Subject: solr highlighting

Hi there,

I am new to solr. I want search term to be highlighted on the results. I 
thought it is pretty simple, but could not make it work. I read a lot of solr 
documents and mail archives (I wish there is a search function for this, we are 
talking about solr, aren’t we? ☺).

Solrconfig.xml
  

 
   explicit
   100
   
PMID AUTH ARTICLE_prefix_token ABST
   
   
ARTICLE
   
   1
   
recip(rord(DATE_1),1,1000,1000)
   

  
 true
 ABST

 
  

We use a Java client, which is nothing but 
CommonsHttpSolrServer(“http://server:port/solr”), and populates the result to 
some data structure. Without the two lines for highlighter, I have results 
coming back, but after I add the two lines, the result is empty.

I also used the admin utility, but I saw the ABST values are unchanged, but at 
the end of document added:

−
−
−
-gestational anemia as a consequence of a reduction in the number of primitive 
erythroid cells. GATA-1 mRNA is



…

The content of ABST of highlighting is much smaller than that of the original. 
I am guessing it tries to find the highlighted term’s position. So what should 
I do to get the highlighted ABST?

Thanks,
- Kevin