Does Solr Have?

2007-10-04 Thread Robert Young
Hi,

We're just about to start work on a project in Solr and there are a
couple of points which I haven't been able to find out from the wiki
which I'm interested in.

1. Is there a REST interface for getting index stats? I would
particularly like access to terms and their document frequencies,
prefereably filtered by a query.

2. Is it possible to use different synonym sets for different queries
OR is it possible to search across multiple indexes with a single
query?

3. Is it possible to change stopword and synonym sets at runtime?

I'm sure I'll have lots more questions as time goes by and, hopefully,
I'll be able to answer others' questions in the future.

Thanks
Rob


Re: Searching combined English-Japanese index

2007-10-04 Thread Maximilian Hütter
You were right, the indexing is already wrong. I debugged Solr and saw
that the indexwriter gets the wrong values. That was because of the
missing Content-Type in the update-requests. It was just text/xml
without the charset=utf-8 . So it was interpreted as ISO-8859-1 Ithink.
Changing the charset to utf-8 fixed the index. The xml had the encoding
set but Solr seems to ignore that.

Limo really seems to converted back correctly by chance.

Thanks for the help! Now I just have to figure out how to correctly
encode the query string...

Best regards,

Max

-- 
Maximilian Hütter
blue elephant systems GmbH
Wollgrasweg 49
D-70599 Stuttgart

Tel:  (+49) 0711 - 45 10 17 578
Fax:  (+49) 0711 - 45 10 17 573
e-mail :  [EMAIL PROTECTED]
Sitz   :  Stuttgart, Amtsgericht Stuttgart, HRB 24106
Geschäftsführer:  Joachim Hörnle, Thomas Gentsch, Holger Dietrich


how to make sure a particular query is ALWAYS cached

2007-10-04 Thread Britske

I want a couple of costly queries to be cached at all times in the
queryResultCache. (unless I have a new searcher of course) 

As for as I know the only parameters to be supplied to the
LRU-implementation of the queryResultCache are size-related, which doens't
give me this guarentee. 

what would be my best bet to implement this functionality with the least
impact?
1. use User/Generic-cache. This would result in seperate coding-path in
application which I would like to avoid. 
2. exend LRU-cache, and extend request-handler so that a query can be
extended with a parameter indicating that it should be cached at all times.
However, this seems like a lot of cluttering-up these interfaces, for a
relatively small change. 
3. another option..

best regards,
Geert-Jan
-- 
View this message in context: 
http://www.nabble.com/how-to-make-sure-a-particular-query-is-ALWAYS-cached-tf4566711.html#a13035381
Sent from the Solr - User mailing list archive at Nabble.com.



Re: Does Solr Have?

2007-10-04 Thread Erik Hatcher


On Oct 4, 2007, at 4:38 AM, Robert Young wrote:

1. Is there a REST interface for getting index stats? I would
particularly like access to terms and their document frequencies,
prefereably filtered by a query.


Yes, the Luke request handler provides deeper views into the index  
information:


  


2. Is it possible to use different synonym sets for different queries


Not exactly, but it possible to have different synonym configurations  
for different field types configured in the schema.



OR is it possible to search across multiple indexes with a single
query?


Not with the current version of Solr, but there is a  
"federated" ("distributed" is a better term, I think) Solr patch in  
JIRA.



3. Is it possible to change stopword and synonym sets at runtime?


Only if the underlying text file is changed.


I'm sure I'll have lots more questions as time goes by and, hopefully,
I'll be able to answer others' questions in the future.


Welcome!

Erik



Re: Does Solr Have?

2007-10-04 Thread Robert Young
Brilliant, thank you, that LukeRequestHandler looks very useful.

On 10/4/07, Erik Hatcher <[EMAIL PROTECTED]> wrote:
> > 3. Is it possible to change stopword and synonym sets at runtime?
>
> Only if the underlying text file is changed.

Will Solr automatically reload the file if it changes or does it have
to be informed of the change? Is changing the underlying file while
Solr is running dangerous?

Cheers
Rob


Re: how to make sure a particular query stays cached (and is not overwritten)

2007-10-04 Thread Britske

the title of my original post was misguided. 

// Geert-Jan


Britske wrote:
> 
> I want a couple of costly queries to be cached at all times in the
> queryResultCache. (unless I have a new searcher of course) 
> 
> As for as I know the only parameters to be supplied to the
> LRU-implementation of the queryResultCache are size-related, which doens't
> give me this guarentee. 
> 
> what would be my best bet to implement this functionality with the least
> impact?
> 1. use User/Generic-cache. This would result in seperate coding-path in
> application which I would like to avoid. 
> 2. exend LRU-cache, and extend request-handler so that a query can be
> extended with a parameter indicating that it should be cached at all
> times. However, this seems like a lot of cluttering-up these interfaces,
> for a relatively small change. 
> 3. another option..
> 
> best regards,
> Geert-Jan
> 

-- 
View this message in context: 
http://www.nabble.com/how-to-make-sure-a-particular-query-is-ALWAYS-cached-tf4566711.html#a13037820
Sent from the Solr - User mailing list archive at Nabble.com.



Re: Does Solr Have?

2007-10-04 Thread Erik Hatcher


On Oct 4, 2007, at 6:10 AM, Robert Young wrote:


Brilliant, thank you, that LukeRequestHandler looks very useful.

On 10/4/07, Erik Hatcher <[EMAIL PROTECTED]> wrote:

3. Is it possible to change stopword and synonym sets at runtime?


Only if the underlying text file is changed.


Will Solr automatically reload the file if it changes or does it have
to be informed of the change?


I'll expose my confusion here and say that I don't know for sure, but  
I'm pretty sure that once it's been loaded it won't get reloaded  
without bouncing Solr altogether.  solr-dev's, please correct me with  
some code pointers if I'm wrong.  In my brief IntelliJ surfing this  
morning, it seems that once a SolrCore has been instantiated, that  
instance sticks around forever, and it holds the IndexSchema instance.


I'm guessing that the new multi-core stuff makes this a bit more  
dynamically controllable.



Is changing the underlying file while
Solr is running dangerous?


No - it only reads it once as far as I can tell.

Caveat to all technical details above: I'm still learning the ropes  
with the core of Solr so if I misspoke on any of this lifecycle  
stuff, let me know.  By trying to answer questions, though, I learn  
quicker :)


Erik





Replication

2007-10-04 Thread Eric Treece
Hello All,

I am interested in some of the joys, tribulations and processes of running a 
replicated Solr environment. Can anyone point to any particular links, 
documents and/or personal experiences.

Thanks,
Eric Treece
[EMAIL PROTECTED]



Re: Does Solr Have?

2007-10-04 Thread Ryan McKinley

Robert Young wrote:

Hi,

We're just about to start work on a project in Solr and there are a
couple of points which I haven't been able to find out from the wiki
which I'm interested in.

1. Is there a REST interface for getting index stats? I would
particularly like access to terms and their document frequencies,
prefereably filtered by a query.



have you checked the luke request handler?
http://wiki.apache.org/solr/LukeRequestHandler



2. Is it possible to use different synonym sets for different queries


the synonym's are linked to each field -- if differnt queires access 
different fields, it will use differnt synonyms.


To automatically index things with different field types, check the 
 stuff.



OR is it possible to search across multiple indexes with a single
query?



not currently.



3. Is it possible to change stopword and synonym sets at runtime?



By default no - but it is not hard to write a custom field type that 
would.  (I have one that loads synonyms from a SQL table -- it can be 
updated dynamically at runtime)




I'm sure I'll have lots more questions as time goes by and, hopefully,
I'll be able to answer others' questions in the future.



great!

ryan



Re: Replication

2007-10-04 Thread Erik Hatcher
Eric - there is tons here 


Start there and hit us up here if anything is amiss.

Erik


On Oct 4, 2007, at 9:05 AM, Eric Treece wrote:


Hello All,

I am interested in some of the joys, tribulations and processes of  
running a replicated Solr environment. Can anyone point to any  
particular links, documents and/or personal experiences.


Thanks,
Eric Treece
[EMAIL PROTECTED]





Re: Does Solr Have?

2007-10-04 Thread Ryan McKinley

dooh, should check all my email first!



Will Solr automatically reload the file if it changes or does it have
to be informed of the change?


I'll expose my confusion here and say that I don't know for sure, but 
I'm pretty sure that once it's been loaded it won't get reloaded without 
bouncing Solr altogether.  


Correct.  The StopFilterFactory is initialized at startup, any changes 
to the file won't take effect 'till solr restarts.


but you can write a custom FilterFactory based on StopFilterFactory that 
lets you change it dynamically.  Most likely this would also require 
writing a custom RequestHandler to manipulate it.


note - changing the stop words at runtime will only effect queries, the 
index will keep whatever was there at index time.


ryan


Re: Solr live at Netflix

2007-10-04 Thread Otis Gospodnetic
I'm curious about this one.  I'm assuming Porter stemmer would stem Gamers and 
Gamera to the same stem (Game?).  If the stems are different, which stemmer are 
you using?  A smarter custom morphological stemmer?

Thanks,
Otis

- Original Message 
From: Tom Hill <[EMAIL PROTECTED]>
To: solr-user@lucene.apache.org
Sent: Tuesday, October 2, 2007 8:16:18 PM
Subject: Re: Solr live at Netflix

Nice!

And there seem to be some improvements. For example, "Gamers" and "Gamera"
no longer stem to the same word :-)

Tom

On 10/2/07, Walter Underwood <[EMAIL PROTECTED]> wrote:
>
> Here at Netflix, we switched over our site search to Solr two weeks ago.
> We've seen zero problems with the server. We average 1.2 million
> queries/day on a 250K item index. We're running four Solr servers
> with simple round-robin HTTP load-sharing.
>
> This is all on 1.1. I've been too busy tuning to upgrade.
>
> Thanks everyone, this is a great piece of software.
>
> wunder
> --
> Walter Underwood
> Search Guy, Netflix
>
>





RE: Solr live at Netflix

2007-10-04 Thread Wagner,Harry
Otis,
Take a look at KStem:
http://ciir.cs.umass.edu/cgi-bin/downloads/downloads.cgi  It's less
aggressive than Porter.  I modified the Lucene version to work with
Solr, but don't know if it was adopted into the Solr source.  Let me
know if you are interested and I'll send you a jar file.

Cheers!
harry

-Original Message-
From: Otis Gospodnetic [mailto:[EMAIL PROTECTED] 
Sent: Thursday, October 04, 2007 10:36 AM
To: solr-user@lucene.apache.org
Subject: Re: Solr live at Netflix

I'm curious about this one.  I'm assuming Porter stemmer would stem
Gamers and Gamera to the same stem (Game?).  If the stems are different,
which stemmer are you using?  A smarter custom morphological stemmer?

Thanks,
Otis

- Original Message 
From: Tom Hill <[EMAIL PROTECTED]>
To: solr-user@lucene.apache.org
Sent: Tuesday, October 2, 2007 8:16:18 PM
Subject: Re: Solr live at Netflix

Nice!

And there seem to be some improvements. For example, "Gamers" and
"Gamera"
no longer stem to the same word :-)

Tom

On 10/2/07, Walter Underwood <[EMAIL PROTECTED]> wrote:
>
> Here at Netflix, we switched over our site search to Solr two weeks
ago.
> We've seen zero problems with the server. We average 1.2 million
> queries/day on a 250K item index. We're running four Solr servers
> with simple round-robin HTTP load-sharing.
>
> This is all on 1.1. I've been too busy tuning to upgrade.
>
> Thanks everyone, this is a great piece of software.
>
> wunder
> --
> Walter Underwood
> Search Guy, Netflix
>
>





Re: Solr live at Netflix

2007-10-04 Thread Walter Underwood
Gamera and Gamers do not stem to the same word, but the old Netflix
engine did conflate those two words. The Metaphones for those are
KMR and KMRS, respectively, and the old engine did fuzzy matching
on Metaphones, something I don't recommend. It also matched "skiing"
to "sings".

wunder

On 10/4/07 7:35 AM, "Otis Gospodnetic" <[EMAIL PROTECTED]> wrote:

> I'm curious about this one.  I'm assuming Porter stemmer would stem Gamers and
> Gamera to the same stem (Game?).  If the stems are different, which stemmer
> are you using?  A smarter custom morphological stemmer?
> 
> Thanks,
> Otis
> 
> - Original Message 
> From: Tom Hill <[EMAIL PROTECTED]>
> To: solr-user@lucene.apache.org
> Sent: Tuesday, October 2, 2007 8:16:18 PM
> Subject: Re: Solr live at Netflix
> 
> Nice!
> 
> And there seem to be some improvements. For example, "Gamers" and "Gamera"
> no longer stem to the same word :-)
> 
> Tom
> 
> On 10/2/07, Walter Underwood <[EMAIL PROTECTED]> wrote:
>> 
>> Here at Netflix, we switched over our site search to Solr two weeks ago.
>> We've seen zero problems with the server. We average 1.2 million
>> queries/day on a 250K item index. We're running four Solr servers
>> with simple round-robin HTTP load-sharing.
>> 
>> This is all on 1.1. I've been too busy tuning to upgrade.
>> 
>> Thanks everyone, this is a great piece of software.
>> 
>> wunder
>> --
>> Walter Underwood
>> Search Guy, Netflix




Re: Does Solr Have?

2007-10-04 Thread Robert Young
Is there, or are there plans to start, a plugin and extension repository?

Cheers
Rob

On 10/4/07, Ryan McKinley <[EMAIL PROTECTED]> wrote:
> dooh, should check all my email first!
>
> >>
> >> Will Solr automatically reload the file if it changes or does it have
> >> to be informed of the change?
> >
> > I'll expose my confusion here and say that I don't know for sure, but
> > I'm pretty sure that once it's been loaded it won't get reloaded without
> > bouncing Solr altogether.
>
> Correct.  The StopFilterFactory is initialized at startup, any changes
> to the file won't take effect 'till solr restarts.
>
> but you can write a custom FilterFactory based on StopFilterFactory that
> lets you change it dynamically.  Most likely this would also require
> writing a custom RequestHandler to manipulate it.
>
> note - changing the stop words at runtime will only effect queries, the
> index will keep whatever was there at index time.
>
> ryan
>


Re: how to make sure a particular query is ALWAYS cached

2007-10-04 Thread Chris Hostetter

: I want a couple of costly queries to be cached at all times in the
: queryResultCache. (unless I have a new searcher of course) 

first off: you can ensure that certain queries are in the cache, even if 
there is a newSearcher, just configure a newSearcher Event Listener that 
forcibly warms the queries you care about.

(this is particularly handy to ensure FieldCache gets populated before any 
user queries are processed)

Second: if i understand correctly, you want a way to put an object in the 
cache, and garuntee that it's always in the cache, even if other objects 
are more frequetnly used or more recently used?

that's kind of a weird use case ... can you elaborate a little more on 
what exactly your end goal is?


the most straightforward approach i can think of would be a new cache 
implementation that "permenantly" stores the first N items you put in it.  
that in combination with the newSearcher warming i described above should 
work.

: 1. use User/Generic-cache. This would result in seperate coding-path in
: application which I would like to avoid. 
: 2. exend LRU-cache, and extend request-handler so that a query can be
: extended with a parameter indicating that it should be cached at all times.
: However, this seems like a lot of cluttering-up these interfaces, for a
: relatively small change. 

#1 wouldn't really accomplish what you want without #2 as well.




-Hoss



Re: Re: Replication

2007-10-04 Thread ycrux
Hi Eric !

I can help on that if you know, even If I'm new to Solr.
What are you planning to do ?

cheers
Younès

Message d'origine
>De: Erik Hatcher 
>Sujet: Re: Replication
>Date: Thu, 4 Oct 2007 09:09:03 -0400
>A: solr-user@lucene.apache.org
>   RCVD_IN_SORBS_DUL autolearn=no version=3.1.1
>
>Eric - there is tons here CollectionDistribution>
>
>Start there and hit us up here if anything is amiss.
>
>   Erik
>
>
>On Oct 4, 2007, at 9:05 AM, Eric Treece wrote:
>
>> Hello All,
>>
>> I am interested in some of the joys, tribulations and processes of  
>> running a replicated Solr environment. Can anyone point to any  
>> particular links, documents and/or personal experiences.
>>
>> Thanks,
>> Eric Treece
>> [EMAIL PROTECTED]
>>
>
>



Real-time replication

2007-10-04 Thread John Reuning
Apologies if this has been covered.  I searched the archives and didn't 
see a thread on this topic.


Has anyone experimented with a near real-time replication scheme similar 
to RDBMS replication?  There's large efficiency in using rsync to copy 
the lucene index files to slaves, but what if you want index changes to 
propagate in a few seconds instead of a few minutes?


Is it feasible to make a solr manager take update requests and send them 
to slaves as it receives them?  (I guess maybe they're not really slaves 
in this case.)  The manager might issue commits every 10-30 seconds to 
reduce the write load.  Write overhead still exists on all read servers, 
but at least the read requests are spread across the pool.


Thanks,

-John R.


Re: Does Solr Have?

2007-10-04 Thread Matthew Runo
How does one set up the LukeRequestHandler? I didn't see a document  
in the wiki about how to add new handlers, and my install (a 1.1  
install upgraded to 1.2) does not have this handler available.


I'd like to see what we're talking about, it sounds very  
interesting.. but I can't find how to turn on this request handler.


++
 | Matthew Runo
 | Zappos Development
 | [EMAIL PROTECTED]
 | 702-943-7833
++


On Oct 4, 2007, at 6:11 AM, Ryan McKinley wrote:


Robert Young wrote:

Hi,
We're just about to start work on a project in Solr and there are a
couple of points which I haven't been able to find out from the wiki
which I'm interested in.
1. Is there a REST interface for getting index stats? I would
particularly like access to terms and their document frequencies,
prefereably filtered by a query.


have you checked the luke request handler?
http://wiki.apache.org/solr/LukeRequestHandler



2. Is it possible to use different synonym sets for different queries


the synonym's are linked to each field -- if differnt queires  
access different fields, it will use differnt synonyms.


To automatically index things with different field types, check the  
 stuff.



OR is it possible to search across multiple indexes with a single
query?


not currently.



3. Is it possible to change stopword and synonym sets at runtime?


By default no - but it is not hard to write a custom field type  
that would.  (I have one that loads synonyms from a SQL table -- it  
can be updated dynamically at runtime)



I'm sure I'll have lots more questions as time goes by and,  
hopefully,

I'll be able to answer others' questions in the future.


great!

ryan





Re: Real-time replication

2007-10-04 Thread Matthew Runo
The only problem that I see possibly happening is that you may end up  
committing more often than SOLR can open/prewarm new searchers. This  
happens in the peak of the day on our servers - leaving us with 5-10  
searchers just hanging out waiting for prewarm to be up - only be  
closed as soon as they're registered because there's already another  
searcher waiting behind it.


That said, I need to tune my cache. A lot.

++
 | Matthew Runo
 | Zappos Development
 | [EMAIL PROTECTED]
 | 702-943-7833
++


On Oct 4, 2007, at 9:07 AM, John Reuning wrote:

Apologies if this has been covered.  I searched the archives and  
didn't see a thread on this topic.


Has anyone experimented with a near real-time replication scheme  
similar to RDBMS replication?  There's large efficiency in using  
rsync to copy the lucene index files to slaves, but what if you  
want index changes to propagate in a few seconds instead of a few  
minutes?


Is it feasible to make a solr manager take update requests and send  
them to slaves as it receives them?  (I guess maybe they're not  
really slaves in this case.)  The manager might issue commits every  
10-30 seconds to reduce the write load.  Write overhead still  
exists on all read servers, but at least the read requests are  
spread across the pool.


Thanks,

-John R.





Re: Real-time replication

2007-10-04 Thread Walter Underwood
We don't use Solr replication. Each server is independent and
does its own indexing. This has several advantages:

* all installations are identical
* no single point of failure
* no inter-server version or config dependencies
* we can run a different version or config on one server for testing

The drawbacks are:

* 4X the DB accesses to get content
* each server has a CPU spike during indexing (we stagger that)

When we finally move to Solr 1.2 (or 1.3 if we wait long enough),
we can install it on one server and watch the performance. No need
to worry about different versions of Lucene.

Matthew: If your Searchers are only open for a short while, don't pre-warm.
Pre-warming is an optimization, not a necessity.

wunder


On 10/4/07 9:32 AM, "Matthew Runo" <[EMAIL PROTECTED]> wrote:

> The only problem that I see possibly happening is that you may end up
> committing more often than SOLR can open/prewarm new searchers. This
> happens in the peak of the day on our servers - leaving us with 5-10
> searchers just hanging out waiting for prewarm to be up - only be
> closed as soon as they're registered because there's already another
> searcher waiting behind it.
> 
> That said, I need to tune my cache. A lot.
> 
> ++
>   | Matthew Runo
>   | Zappos Development
>   | [EMAIL PROTECTED]
>   | 702-943-7833
> ++
> 
> 
> On Oct 4, 2007, at 9:07 AM, John Reuning wrote:
> 
>> Apologies if this has been covered.  I searched the archives and
>> didn't see a thread on this topic.
>> 
>> Has anyone experimented with a near real-time replication scheme
>> similar to RDBMS replication?  There's large efficiency in using
>> rsync to copy the lucene index files to slaves, but what if you
>> want index changes to propagate in a few seconds instead of a few
>> minutes?
>> 
>> Is it feasible to make a solr manager take update requests and send
>> them to slaves as it receives them?  (I guess maybe they're not
>> really slaves in this case.)  The manager might issue commits every
>> 10-30 seconds to reduce the write load.  Write overhead still
>> exists on all read servers, but at least the read requests are
>> spread across the pool.
>> 
>> Thanks,
>> 
>> -John R.
>> 
> 



Re: Does Solr Have?

2007-10-04 Thread Ryan McKinley

add:
  class="org.apache.solr.handler.admin.LukeRequestHandler" />


to your solrconfig.xml

It is in the example solrconfig.xml that comes with 1.2


Matthew Runo wrote:
How does one set up the LukeRequestHandler? I didn't see a document in 
the wiki about how to add new handlers, and my install (a 1.1 install 
upgraded to 1.2) does not have this handler available.


I'd like to see what we're talking about, it sounds very interesting.. 
but I can't find how to turn on this request handler.


++
 | Matthew Runo
 | Zappos Development
 | [EMAIL PROTECTED]
 | 702-943-7833
++


On Oct 4, 2007, at 6:11 AM, Ryan McKinley wrote:


Robert Young wrote:

Hi,
We're just about to start work on a project in Solr and there are a
couple of points which I haven't been able to find out from the wiki
which I'm interested in.
1. Is there a REST interface for getting index stats? I would
particularly like access to terms and their document frequencies,
prefereably filtered by a query.


have you checked the luke request handler?
http://wiki.apache.org/solr/LukeRequestHandler



2. Is it possible to use different synonym sets for different queries


the synonym's are linked to each field -- if differnt queires access 
different fields, it will use differnt synonyms.


To automatically index things with different field types, check the 
 stuff.



OR is it possible to search across multiple indexes with a single
query?


not currently.



3. Is it possible to change stopword and synonym sets at runtime?


By default no - but it is not hard to write a custom field type that 
would.  (I have one that loads synonyms from a SQL table -- it can be 
updated dynamically at runtime)




I'm sure I'll have lots more questions as time goes by and, hopefully,
I'll be able to answer others' questions in the future.


great!

ryan








Re: Does Solr Have?

2007-10-04 Thread Matthew Runo
Boo, thank you for the reply. That's what I get for customizing it  
and taking out all the other code I guess. Sorry about that.


++
 | Matthew Runo
 | Zappos Development
 | [EMAIL PROTECTED]
 | 702-943-7833
++


On Oct 4, 2007, at 9:47 AM, Ryan McKinley wrote:


add:
  class="org.apache.solr.handler.admin.LukeRequestHandler" />


to your solrconfig.xml

It is in the example solrconfig.xml that comes with 1.2


Matthew Runo wrote:
How does one set up the LukeRequestHandler? I didn't see a  
document in the wiki about how to add new handlers, and my install  
(a 1.1 install upgraded to 1.2) does not have this handler available.
I'd like to see what we're talking about, it sounds very  
interesting.. but I can't find how to turn on this request handler.

++
 | Matthew Runo
 | Zappos Development
 | [EMAIL PROTECTED]
 | 702-943-7833
++
On Oct 4, 2007, at 6:11 AM, Ryan McKinley wrote:

Robert Young wrote:

Hi,
We're just about to start work on a project in Solr and there are a
couple of points which I haven't been able to find out from the  
wiki

which I'm interested in.
1. Is there a REST interface for getting index stats? I would
particularly like access to terms and their document frequencies,
prefereably filtered by a query.


have you checked the luke request handler?
http://wiki.apache.org/solr/LukeRequestHandler


2. Is it possible to use different synonym sets for different  
queries


the synonym's are linked to each field -- if differnt queires  
access different fields, it will use differnt synonyms.


To automatically index things with different field types, check  
the  stuff.



OR is it possible to search across multiple indexes with a single
query?


not currently.



3. Is it possible to change stopword and synonym sets at runtime?


By default no - but it is not hard to write a custom field type  
that would.  (I have one that loads synonyms from a SQL table --  
it can be updated dynamically at runtime)



I'm sure I'll have lots more questions as time goes by and,  
hopefully,

I'll be able to answer others' questions in the future.


great!

ryan







facet and field collapse

2007-10-04 Thread Xuesong Luo
Hi, there,

Our index stores employee working history information. For each
employee, there could be multiple index records. The requirement is:

1.  The search result should be sorted on score.
2.  Each employee should only appear once regardless how many match
are found.
3.  The result should support pagination.

 

For example, if the original search result is:

 

doc=1, id=A, score=100
doc=2, id=A, score=90
doc=3, id=B, score=80
doc=4, id=A, score=70
doc=5, id=B, score=69
doc=6, id=B, score=68
doc=7, id=C, score=60
doc=8, id=C, score=59
doc=9, id=D, score=59
doc=10, id=E, score=58
...
doc=206, id=B, score=40
.
 
We want the final result to be:
 
Doc=1, id=A
Doc=3, id=B
Doc=7, id=C
Doc=9, id=D
Doc=10, id=E
 
If the user wants to see record 2-4, he/she should get
 
Doc=3, id=B
Doc=7, id=C
Doc=9, id=D
 
 
I tried both facet and field collapse, it seems neither satisfies our
requirement.
The problem of facet is it can only sort either on the number of counts
or the alphabetical order. If sort on counts, B will be returned before
A. 
The problem of using Field collapse is its pagination is based on doc,
not on group. In my pagination example, Doc 2-4 will be returned instead
of 3, 7, 9. 
 
Does anyone have similar experience before? Any suggestions? 
 
Thanks
Xuesong
 

 



Solr - Lucene Query

2007-10-04 Thread Jae Joo




In the schema.xml, this fiend is defined by 



Is there any way to find the document by querying - The Appraisal Station?


Thanks,
Jae


RE: question about bi-gram analysis on query

2007-10-04 Thread Teruhiko Kurosaka
Hello David,
> And if I do a search in Luke and the solr analysis page 
> for美聯, I get a hit.  But on the actual search, I don't.

I think you need to tell us what you mean by "actual search"
and your code that interfaces with Solr.

-kuro


Re: Indexing HTML

2007-10-04 Thread Mike Klaas

On 3-Oct-07, at 3:26 AM, Ravish Bhagdev wrote:



Because of this I cannot present the resulting html in a webpage.  Is
it possible to strip out all HTML tags completely in result set?
Would you recommend sending stripped out text to solr instead?  But
doesn't Solr use HTML features while searching (anchors/titles etc).

Why is there no documentation about indexing HTML specifically using
solr.  How does nutch do it?  does it strip out html in the snippets
it returns?


Solr isn't a web search engine, and doesn't do any special processing  
of html (although you can ask it to strip html if you want).


I recommend stripping the html yourself, and putting titles, anchors,  
etc in separate fields.


I believe that it would be possible to write this as a Solr update- 
handler plugin, if you wanted it to all run in one place.


-Mike


Re: unable to figure out nutch type highlighting in solr....

2007-10-04 Thread Mike Klaas


On 2-Oct-07, at 12:52 AM, Ravish Bhagdev wrote:





I see that you're using the HTML analyzer.  Unfortunately that does  
not play very well with highlighting at the moment. You may get  
garbled output.


-Mike


Re: Solr - Lucene Query

2007-10-04 Thread Mike Klaas

On 4-Oct-07, at 11:07 AM, Jae Joo wrote:






In the schema.xml, this fiend is defined by type="text"

indexed="true"  />



Is there any way to find the document by querying - The Appraisal  
Station?


sure: if you query trade1:(the appraisal station), that document  
hsould be found.  If you want _only_ that document to match, you  
should try something like a phrase query with a bit fo slop:


trade1:"the appraisal station"~10

-Mike


Re: how to make sure a particular query is ALWAYS cached

2007-10-04 Thread Britske


hossman wrote:
> 
> 
> : I want a couple of costly queries to be cached at all times in the
> : queryResultCache. (unless I have a new searcher of course) 
> 
> first off: you can ensure that certain queries are in the cache, even if 
> there is a newSearcher, just configure a newSearcher Event Listener that 
> forcibly warms the queries you care about.
> 
> (this is particularly handy to ensure FieldCache gets populated before any 
> user queries are processed)
> 
> Second: if i understand correctly, you want a way to put an object in the 
> cache, and garuntee that it's always in the cache, even if other objects 
> are more frequetnly used or more recently used?
> 
> that's kind of a weird use case ... can you elaborate a little more on 
> what exactly your end goal is?
> 
> 

Sure.  Actually i got the idea of another thread posted by Thomas to which
you gave a reply a few days ago: 
http://www.nabble.com/Result-grouping-options-tf4522284.html#a12900630. 
I quote the relevant bits below, although I think you remember: 


hossman wrote:
> 
> : Is it possible to use faceting to not only get the facet count but also
> the
> : top-n documents for every facet
> : directly? If not, how hard would it be to implement this as an
> extension?
> 
> not hard ... a custom request handler could subclass
> StandardRequestHandler, call super.handleRequest, and then pull the field
> faceting info out of the response object, and fetch a DocList for each of
> the top N field constraints.
> 

I have about a dozen queries that I want to have permanently cached, each
corresponding to a particular navigation page. Each of these pages has up to
about 10 top-N lists which are populated as discussed above. These lists are
pretty static (updated once a day, together with the index). 

The above would enable me to populate all the lists on a single page in 1
pass. Correct? 
Although I haven't tried yet, I can't imagine that this request returns in
sub-zero seconds, which is what I want (having a index of about 1M docs with
6000 fields/ doc and about 10 complex facetqueries / request). 

The navigation-pages are pretty important for, eh well navigation ;-) and
although I can rely on frequent access of these pages most of the time, it
is not guarenteed (so neither is the caching)


hossman wrote:
> 
> the most straightforward approach i can think of would be a new cache 
> implementation that "permenantly" stores the first N items you put in it.  
> that in combination with the newSearcher warming i described above should 
> work.
> 
> : 1. use User/Generic-cache. This would result in seperate coding-path in
> : application which I would like to avoid. 
> : 2. exend LRU-cache, and extend request-handler so that a query can be
> : extended with a parameter indicating that it should be cached at all
> times.
> : However, this seems like a lot of cluttering-up these interfaces, for a
> : relatively small change. 
> 
> #1 wouldn't really accomplish what you want without #2 as well.
> 
> 
> 
> -Hoss
> 
> 
> 

regarding #1. 
Wouldn't making a user-cache for the sole-purpose of storing these queries
be enough? I could then reference this user-cache by name, and extract the
correct queryresult. (at least that's how I read the documentation, I have
no previous experience with the user-cache mechanism).  In that case I don't
need #2 right? Or is this for another reason not a good way to handle
things? 

//Geert-Jan

-- 
View this message in context: 
http://www.nabble.com/how-to-make-sure-a-particular-query-is-ALWAYS-cached-tf4566711.html#a13048285
Sent from the Solr - User mailing list archive at Nabble.com.



Re: unable to figure out nutch type highlighting in solr....

2007-10-04 Thread Mike Klaas

In 2-Oct-07, at 12:52 AM, Ravish Bhagdev wrote:


I have tried very hard to follow documentation and forums that try to
answer questions about how to return snippets with highlights for
relevant searched term using Solr (as nutch does with such ease).

I will be really grateful if someone can guide me with basics, i have
made sure that the field to be highlighted is "stored" in index etc.
Still I can't figure out why it doesn't return the snippet and instead
returns the whole document.

I have tried all different highlight parameters with variations, but
no idea what's happening.  Can I test highlighting using given
application using "full search interface" option?  How, it just
returns xml with full document between field tag at the moment.


Note that the highlighting data is _not_ returned in the   
section of the response.  Getting the whole document back is probably  
due to asking for all fields (coupled with having stored the main  
text field).


You can play with the highlighting in the admin ui.  Besides having a  
few parameters directly present, the others can be added directly to  
the url for testing.


The minimum required for highlighting is:
 1. hl=true
 2. hl.fl=myfield

_If_ that field matches one of the query terms, you should see  
snippets in the generated response.  EVen if not, you should see a  
 section of the response (it will be empty).


regards,
-Mike


RE: question about bi-gram analysis on query

2007-10-04 Thread Keene, David
Hi,

Thanks for responding.  I should have been clearer..

By "actual search" I meant hitting the search demo page on the solr admin page. 
 So I get no results on this query:

/solr/select/?q=%E7%BE%8E%E8%81%AF&version=2.2&start=0&rows=10&indent=on

But the same query (with the data in my index) on the analysis page shows me a 
hit (and the same search in Luke gets me a hit too).

I've tried this on 1.1, 1.2 and nightly as of yesterday. I assume that I am 
missing something really obvious..

-Dave


-Original Message-
From: Teruhiko Kurosaka [mailto:[EMAIL PROTECTED] 
Sent: Thursday, October 04, 2007 12:44 PM
To: Keene, David
Cc: solr-user@lucene.apache.org
Subject: RE: question about bi-gram analysis on query

Hello David,
> And if I do a search in Luke and the solr analysis page 
> for美聯, I get a hit.  But on the actual search, I don't.

I think you need to tell us what you mean by "actual search"
and your code that interfaces with Solr.

-kuro


RE: how to make sure a particular query is ALWAYS cached

2007-10-04 Thread Lance Norskog
You could make these filter queries. Filters are a separate cache and as
long as you have more cache than queries they will remain pinned in RAM.
Your code has to remember these special queries in special-case code, and
create dummy query strings to fetch the filter query.  "field:[* TO *]" will
do nicely.

Cheers,

Lance Norskog 

-Original Message-
From: Britske [mailto:[EMAIL PROTECTED] 
Sent: Thursday, October 04, 2007 1:38 PM
To: solr-user@lucene.apache.org
Subject: Re: how to make sure a particular query is ALWAYS cached



hossman wrote:
> 
> 
> : I want a couple of costly queries to be cached at all times in the
> : queryResultCache. (unless I have a new searcher of course)
> 
> first off: you can ensure that certain queries are in the cache, even 
> if there is a newSearcher, just configure a newSearcher Event Listener 
> that forcibly warms the queries you care about.
> 
> (this is particularly handy to ensure FieldCache gets populated before 
> any user queries are processed)
> 
> Second: if i understand correctly, you want a way to put an object in 
> the cache, and garuntee that it's always in the cache, even if other 
> objects are more frequetnly used or more recently used?
> 
> that's kind of a weird use case ... can you elaborate a little more on 
> what exactly your end goal is?
> 
> 

Sure.  Actually i got the idea of another thread posted by Thomas to which
you gave a reply a few days ago: 
http://www.nabble.com/Result-grouping-options-tf4522284.html#a12900630. 
I quote the relevant bits below, although I think you remember: 


hossman wrote:
> 
> : Is it possible to use faceting to not only get the facet count but 
> also the
> : top-n documents for every facet
> : directly? If not, how hard would it be to implement this as an 
> extension?
> 
> not hard ... a custom request handler could subclass 
> StandardRequestHandler, call super.handleRequest, and then pull the 
> field faceting info out of the response object, and fetch a DocList 
> for each of the top N field constraints.
> 

I have about a dozen queries that I want to have permanently cached, each
corresponding to a particular navigation page. Each of these pages has up to
about 10 top-N lists which are populated as discussed above. These lists are
pretty static (updated once a day, together with the index). 

The above would enable me to populate all the lists on a single page in 1
pass. Correct? 
Although I haven't tried yet, I can't imagine that this request returns in
sub-zero seconds, which is what I want (having a index of about 1M docs with
6000 fields/ doc and about 10 complex facetqueries / request). 

The navigation-pages are pretty important for, eh well navigation ;-) and
although I can rely on frequent access of these pages most of the time, it
is not guarenteed (so neither is the caching)


hossman wrote:
> 
> the most straightforward approach i can think of would be a new cache 
> implementation that "permenantly" stores the first N items you put in it.
> that in combination with the newSearcher warming i described above 
> should work.
> 
> : 1. use User/Generic-cache. This would result in seperate coding-path 
> in
> : application which I would like to avoid. 
> : 2. exend LRU-cache, and extend request-handler so that a query can 
> be
> : extended with a parameter indicating that it should be cached at all 
> times.
> : However, this seems like a lot of cluttering-up these interfaces, 
> for a
> : relatively small change. 
> 
> #1 wouldn't really accomplish what you want without #2 as well.
> 
> 
> 
> -Hoss
> 
> 
> 

regarding #1. 
Wouldn't making a user-cache for the sole-purpose of storing these queries
be enough? I could then reference this user-cache by name, and extract the
correct queryresult. (at least that's how I read the documentation, I have
no previous experience with the user-cache mechanism).  In that case I don't
need #2 right? Or is this for another reason not a good way to handle
things? 

//Geert-Jan

--
View this message in context:
http://www.nabble.com/how-to-make-sure-a-particular-query-is-ALWAYS-cached-t
f4566711.html#a13048285
Sent from the Solr - User mailing list archive at Nabble.com.



Re: unable to figure out nutch type highlighting in solr....

2007-10-04 Thread Adrian Sutton
I see that you're using the HTML analyzer.  Unfortunately that does  
not play very well with highlighting at the moment. You may get  
garbled output.


Is it the HTML analyzer or the fact that it's HTML content? If it's  
just the analyzer you could always just copy the HTML content to  
another field with a different analyzer and use that for highlighting  
(but search on the original field). Would this work, and if so, which  
analyzer would be suitable for the second field?


Adrian Sutton
http://www.symphonious.net


Handling empty query

2007-10-04 Thread Guangwei Yuan
Hi,

Does Solr support empty queries? It'll be nice if Solr can return all
results if q is null. Otherwise, I guess I'll have to write a customized
request handler. Any thoughts?

Thanks in advance.

- Guangwei


RE: Handling empty query

2007-10-04 Thread Lance Norskog
If a field is required, and always has data, this query will enumerate all
documents:

field:[* TO *] 

-Original Message-
From: Guangwei Yuan [mailto:[EMAIL PROTECTED] 
Sent: Thursday, October 04, 2007 3:26 PM
To: solr-user@lucene.apache.org
Subject: Handling empty query

Hi,

Does Solr support empty queries? It'll be nice if Solr can return all
results if q is null. Otherwise, I guess I'll have to write a customized
request handler. Any thoughts?

Thanks in advance.

- Guangwei



Re: unable to figure out nutch type highlighting in solr....

2007-10-04 Thread Mike Klaas

On 4-Oct-07, at 3:19 PM, Adrian Sutton wrote:

I see that you're using the HTML analyzer.  Unfortunately that  
does not play very well with highlighting at the moment. You may  
get garbled output.


Is it the HTML analyzer or the fact that it's HTML content? If it's  
just the analyzer you could always just copy the HTML content to  
another field with a different analyzer and use that for  
highlighting (but search on the original field). Would this work,  
and if so, which analyzer would be suitable for the second field?


the HTML analyzer strips html but doesn't update the offsets nicely  
(the highlighter uses these to determine where to insert the  tags).


If you use a "normal" analyzer (like WordDelimiterFilter) directly on  
the HTML, the offsets will be correct but you will get HTML tags  
returned in your output, which you will have to be careful to strip.
(which means you couldn't use the default '' as highlighting  
markers).


In general, I don't recommend indexing HTML content straight to  
Solr.  None of the Solr contributors do this so the use case hasn't  
received a lot of love.


I'm actually somewhat surprised that several people are interested in  
this but none have have been sufficiently interested to implement a  
solution to contribute:


http://issues.apache.org/jira/browse/SOLR-42

-Mike


RE: how to make sure a particular query is ALWAYS cached

2007-10-04 Thread Britske

I need the documents in order, so FilterCache is no use. Moreover, I already
use lots of the filtercache for other fq-queries. About 99% of the 6000
fields I mentioned have there values seperately  in the filtercache. There
must be room for optimization there, but that's a different story ;-)

//Geert-Jan


Lance Norskog wrote:
> 
> You could make these filter queries. Filters are a separate cache and as
> long as you have more cache than queries they will remain pinned in RAM.
> Your code has to remember these special queries in special-case code, and
> create dummy query strings to fetch the filter query.  "field:[* TO *]"
> will
> do nicely.
> 
> Cheers,
> 
> Lance Norskog 
> 
> -Original Message-
> From: Britske [mailto:[EMAIL PROTECTED] 
> Sent: Thursday, October 04, 2007 1:38 PM
> To: solr-user@lucene.apache.org
> Subject: Re: how to make sure a particular query is ALWAYS cached
> 
> 
> 
> hossman wrote:
>> 
>> 
>> : I want a couple of costly queries to be cached at all times in the
>> : queryResultCache. (unless I have a new searcher of course)
>> 
>> first off: you can ensure that certain queries are in the cache, even 
>> if there is a newSearcher, just configure a newSearcher Event Listener 
>> that forcibly warms the queries you care about.
>> 
>> (this is particularly handy to ensure FieldCache gets populated before 
>> any user queries are processed)
>> 
>> Second: if i understand correctly, you want a way to put an object in 
>> the cache, and garuntee that it's always in the cache, even if other 
>> objects are more frequetnly used or more recently used?
>> 
>> that's kind of a weird use case ... can you elaborate a little more on 
>> what exactly your end goal is?
>> 
>> 
> 
> Sure.  Actually i got the idea of another thread posted by Thomas to which
> you gave a reply a few days ago: 
> http://www.nabble.com/Result-grouping-options-tf4522284.html#a12900630. 
> I quote the relevant bits below, although I think you remember: 
> 
> 
> hossman wrote:
>> 
>> : Is it possible to use faceting to not only get the facet count but 
>> also the
>> : top-n documents for every facet
>> : directly? If not, how hard would it be to implement this as an 
>> extension?
>> 
>> not hard ... a custom request handler could subclass 
>> StandardRequestHandler, call super.handleRequest, and then pull the 
>> field faceting info out of the response object, and fetch a DocList 
>> for each of the top N field constraints.
>> 
> 
> I have about a dozen queries that I want to have permanently cached, each
> corresponding to a particular navigation page. Each of these pages has up
> to
> about 10 top-N lists which are populated as discussed above. These lists
> are
> pretty static (updated once a day, together with the index). 
> 
> The above would enable me to populate all the lists on a single page in 1
> pass. Correct? 
> Although I haven't tried yet, I can't imagine that this request returns in
> sub-zero seconds, which is what I want (having a index of about 1M docs
> with
> 6000 fields/ doc and about 10 complex facetqueries / request). 
> 
> The navigation-pages are pretty important for, eh well navigation ;-) and
> although I can rely on frequent access of these pages most of the time, it
> is not guarenteed (so neither is the caching)
> 
> 
> hossman wrote:
>> 
>> the most straightforward approach i can think of would be a new cache 
>> implementation that "permenantly" stores the first N items you put in it.
>> that in combination with the newSearcher warming i described above 
>> should work.
>> 
>> : 1. use User/Generic-cache. This would result in seperate coding-path 
>> in
>> : application which I would like to avoid. 
>> : 2. exend LRU-cache, and extend request-handler so that a query can 
>> be
>> : extended with a parameter indicating that it should be cached at all 
>> times.
>> : However, this seems like a lot of cluttering-up these interfaces, 
>> for a
>> : relatively small change. 
>> 
>> #1 wouldn't really accomplish what you want without #2 as well.
>> 
>> 
>> 
>> -Hoss
>> 
>> 
>> 
> 
> regarding #1. 
> Wouldn't making a user-cache for the sole-purpose of storing these queries
> be enough? I could then reference this user-cache by name, and extract the
> correct queryresult. (at least that's how I read the documentation, I have
> no previous experience with the user-cache mechanism).  In that case I
> don't
> need #2 right? Or is this for another reason not a good way to handle
> things? 
> 
> //Geert-Jan
> 
> --
> View this message in context:
> http://www.nabble.com/how-to-make-sure-a-particular-query-is-ALWAYS-cached-t
> f4566711.html#a13048285
> Sent from the Solr - User mailing list archive at Nabble.com.
> 
> 
> 

-- 
View this message in context: 
http://www.nabble.com/how-to-make-sure-a-particular-query-is-ALWAYS-cached-tf4566711.html#a13050087
Sent from the Solr - User mailing list archive at Nabble.com.



Re: Handling empty query

2007-10-04 Thread Mike Klaas

On 4-Oct-07, at 3:25 PM, Guangwei Yuan wrote:


Does Solr support empty queries? It'll be nice if Solr can return all
results if q is null. Otherwise, I guess I'll have to write a  
customized

request handler. Any thoughts?


The dismax handler has an "alt.q" parameter which is used as the  
query if the queyr string is emtpy.


To return all documents, set "alt.q=*:*"

-Mike



Re: Handling empty query

2007-10-04 Thread Christopher Triggs
Have a look at  
http://www.mail-archive.com/solr-user@lucene.apache.org/msg03394.html


The thread goes on to describe that just using q=*:* is efficient and  
is very usefull for getting facets for browsing / navigation.


Regards,
Triggsie


On 4-Oct-07, at 3:25 PM, Guangwei Yuan wrote:


Does Solr support empty queries? It'll be nice if Solr can return all
results if q is null. Otherwise, I guess I'll have to write a customized
request handler. Any thoughts?


The dismax handler has an "alt.q" parameter which is used as the query
if the queyr string is emtpy.

To return all documents, set "alt.q=*:*"

-Mike


--
This message has been scanned for viruses and
dangerous content by MailScanner, and is
believed to be clean.





This message was sent using IMP, the Internet Messaging Program.


--
This message has been scanned for viruses and
dangerous content by MailScanner, and is
believed to be clean.



Re: unable to figure out nutch type highlighting in solr....

2007-10-04 Thread Adrian Sutton

On 05/10/2007, at 8:45 AM, Mike Klaas wrote:
In general, I don't recommend indexing HTML content straight to  
Solr.  None of the Solr contributors do this so the use case hasn't  
received a lot of love.


We're indexing XHTML straight to Solr and it's working great so far.

I'm actually somewhat surprised that several people are interested  
in this but none have have been sufficiently interested to  
implement a solution to contribute:


http://issues.apache.org/jira/browse/SOLR-42


Didn't know there was a problem to solve. We're a fair way off  
actually playing with highlighting but I'll keep an eye on this for  
when we get to it.



-Mike


Thanks,

Adrian Sutton
http://www.symphonious.net



Re: unable to figure out nutch type highlighting in solr....

2007-10-04 Thread Chris Hostetter

: In general, I don't recommend indexing HTML content straight to Solr.  None of
: the Solr contributors do this so the use case hasn't received a lot of love.

I second that comment ... the HTML Striping code was never intended to be 
an "HTML Parser" it was designed to be a workarround for dealing with 
"dirty data" where people had unwanted HTML tags in what should be plain 
text.  indexing as is with some analyzers would result in words like 
"script", "strong", and "class" matching lots of docs where the words 
never relaly appear in the text.

if you have wellformed HTML documents, use an HTML parser to extract the 
real content.



-Hoss



Re: Does Solr Have?

2007-10-04 Thread Chris Hostetter

: Is there, or are there plans to start, a plugin and extension repository?

Strictly speaking, there is no reason why Solr Plugins would have to live 
in an Apache repository.  if people write plugins that would be generally 
usefull to several people, and they wish to contribute them to Apache, 
they can be commited into the Solr repository -- but people could also 
host/distribute a plugin on some other site (sourceforge, etc...) however 
they see fit (using pretty much any license they see fit, i believe).  If 
people start doing that, then we should definitely have anice wiki page 
for people to list those things.

Note: there was another thread recently on solr-dev about adding a 
"contrib" section to the solr repository and bundling them in individual 
jars as an alternative to including them in the core solr.war...

http://www.nabble.com/Solr-contrib-tf4517606.html#a12886773



-Hoss



Re: unable to figure out nutch type highlighting in solr....

2007-10-04 Thread Walter Underwood
Wow, well-formed HTML. That's a rare beast. --wunder

On 10/4/07 7:08 PM, "Chris Hostetter" <[EMAIL PROTECTED]> wrote:

> if you have wellformed HTML documents, use an HTML parser to extract the
> real content.



Re: unable to figure out nutch type highlighting in solr....

2007-10-04 Thread J.J. Larrea
At 3:45 PM -0700 10/4/07, Mike Klaas wrote:
>I'm actually somewhat surprised that several people are interested in this but 
>none have have been sufficiently interested to implement a solution to 
>contribute:
>
>http://issues.apache.org/jira/browse/SOLR-42

I just devised a workaround earlier in the week and was planning on posting it; 
thanks to your nudge I just did (to SOLR-42).  Hopefully it may be of use to 
someone else.

It uses a PatternTokenizerFactory with a RegEx that swallows runs of HTML- or 
XML-like tags:

  (?:\s*\s]+))?)\s*|\s*)/?>\s*)|\s

and it will treat runs of "things that look like HTML/XML open or close tags 
with optional attributes, optionally preceded or followed by spaces" 
identically to "runs of one or more spaces" as token delimiters, and swallow 
them up, so the previous and following tokens have the correct offsets.

Of course this is just a hack: It doesn't have any real understanding of HTML 
or XML syntax, so something invalid like  will get matched. 
On the other hand, < and > in text will be left alone.

Also note it doesn't decode XML or HTML numeric or symbolic entity references, 
as HTMLStripReader does (my indexer is pre-decoding the entity references 
before sending the text to Solr for indexing).

So fixing HTMLStripReader and its dependent HTMLStripXXXTokenizers to do the 
right thing with offsets would still be a worthy task.  I wonder whether 
recasting HTMLStripReader using the 
org.apache.lucene.analysis.standard.CharStream interface would make sense for 
this?

(I just added the above to the Jira comment, please pardon the redundancy)

- J.J.


Re: unable to figure out nutch type highlighting in solr....

2007-10-04 Thread Ravish Bhagdev
Thanks all for help.

Just to make sure I understand correctly, am I right in summarizing
this way than?:

No significance of using HTML: Unlike nutch Solr doesn't parse HTML,
so it ignores the anchors, titles etc and is not good for page rank
-esq indexing.

HTMLAnalyser (by with you probably mean HTMLStripWhitespaceTokenizer?)
: Main purpose is to allow users to index html code, it will strip the
html tags and index the contents, but if used for getting snippets in
results the  tags may be in wrong locations

To avoid using HTMLAnalyser, strip out the tags yourself and only send
text to Solr for indexing using one of the "normal" analysers.
Highlighting should be accurate in this case.

(Query esp. Adrian):

If you are indexing XHTML, do you replace tags with entities before
giving it to solr, if so, when you get back snippets do you get tags
or entities or do you convert again to tags for presentation?  What's
the best way out?  It would help me a lot if you briefly explain your
configuration.

Do let me know if my assumptions are wrong!

Cheers,
Ravish

On 10/5/07, Chris Hostetter <[EMAIL PROTECTED]> wrote:
>
> : In general, I don't recommend indexing HTML content straight to Solr.  None 
> of
> : the Solr contributors do this so the use case hasn't received a lot of love.
>
> I second that comment ... the HTML Striping code was never intended to be
> an "HTML Parser" it was designed to be a workarround for dealing with
> "dirty data" where people had unwanted HTML tags in what should be plain
> text.  indexing as is with some analyzers would result in words like
> "script", "strong", and "class" matching lots of docs where the words
> never relaly appear in the text.
>
> if you have wellformed HTML documents, use an HTML parser to extract the
> real content.
>
>
>
> -Hoss
>
>