Indexing XML

2007-10-05 Thread PAUWELS Benoit
Hi,

 

I wish to index well formed xml documents as they are.

I have a database filled with MARCXML records. An example of these looks like 
this:

 

http://www.loc.gov/MARC21/slim 
http://www.loc.gov/standards/marcxml/schema/MARC21slim.xsd";

xmlns="http://www.loc.gov/MARC21/slim"; 
xmlns:ns0="http://www.w3.org/2001/XMLSchema-instance";>

0nam  22  a 4500

00050

20050826220257.0

000710s1998xx  r 000 0 dut 
d



Univ





van Wetten, J. W.





De positie van vrouwen in de asielprocedure 
/

J.W. van Wetten, N. Dijkhof, F. 
Heide.





 

The idea is to create Lucene indexes on specific MARC fields and store the 
complete MARC record in Lucene 'as is'. In the presentation layer of my 
application I would then have this complete MARC record at hand, and as such 
have full flexibility on which MARC fields to display. So I want to create the 
following record through XSLT and feed this to SOLR. 

 



De positie van vrouwen in de asielprocedure

van Wetten, J. W.

...



  http://www.loc.gov/MARC21/slim 
http://www.loc.gov/standards/marcxml/schema/MARC21slim.xsd";

xmlns="http://www.loc.gov/MARC21/slim"; 
xmlns:ns0="http://www.w3.org/2001/XMLSchema-instance";>

0nam  22  a 4500

00050

20050826220257.0

000710s1998xx  r 000 0 dut 
d



UGent





van Wetten, J. W.





De positie van vrouwen in de asielprocedure 
/

J.W. van Wetten, N. Dijkhof, F. 
Heide.









 

I have the following in my schema.xml:

 







 

 

SOLR has of course a problem with the XML in the 'originalRecord' field. 

Is there a solution to this? Has anyone done this before? 

 

Thanks a lot.

Benoit.

 

 

=

PAUWELS Benoit

Université Libre de Bruxelles - Libraries

Head of Automation

Av. F.D. Roosevelt 50, CP 180

1050 BRUSSELS

Belgium

Tel: + 32 2 650 23 91

Fax: + 32 2 650 23 91

=

 

 



Re: Indexing XML

2007-10-05 Thread Pieter Berkel
> SOLR has of course a problem with the XML in the 'originalRecord' field.
> Is there a solution to this? Has anyone done this before?


I would suggest changing the field type of "originalRecord" to "string"
rather than "text", and if you're still having trouble with the XML data
simply encapsulated the data with a CDATA:



cheers,
Piete


Re: unable to figure out nutch type highlighting in solr....

2007-10-05 Thread Adrian Sutton

One last one, when you send HTML to solr, do you too replace special
chars and tags with named entities?  I did this and HTMLStripper
doesn't seem to recognise them the tags :-S  While if I try and input
HTML as is indexer throws exceptions (as having tags within XML tags
is obviously not valid.  How to do this part?


We didn't do anything at all to the HTML, the editor returns valid  
XHTML (using numeric entities, never named entities which aren't  
valid in XML and don't tend to work in XHTML) and we do string  
concatenation to build up the /update request body like:


requestBody += "" + xhtmlContent + "";

Solr seems to handle it. From what people are suggesting though you'd  
be better off converting to plain text before indexing it with Solr.  
Something like JTidy (http://jtidy.sf.net) can parse most HTML that's  
around and you can iterate over the DOM to extract the text from there.


Regards,

Adrian Sutton
http://www.symphonious.net


Re: unable to figure out nutch type highlighting in solr....

2007-10-05 Thread Ravish Bhagdev
Thanks Adrian,  I'm very new to Solr myself so struggling a bit in
initial stages...

One last one, when you send HTML to solr, do you too replace special
chars and tags with named entities?  I did this and HTMLStripper
doesn't seem to recognise them the tags :-S  While if I try and input
HTML as is indexer throws exceptions (as having tags within XML tags
is obviously not valid.  How to do this part?

Ravish

On 10/5/07, Adrian Sutton <[EMAIL PROTECTED]> wrote:
> On 05/10/2007, at 4:07 PM, Ravish Bhagdev wrote:
> > (Query esp. Adrian):
> >
> > If you are indexing XHTML, do you replace tags with entities before
> > giving it to solr, if so, when you get back snippets do you get tags
> > or entities or do you convert again to tags for presentation?  What's
> > the best way out?  It would help me a lot if you briefly explain your
> > configuration.
>
> We happen to develop a HTML editor so we know 100% for certain that
> the XHTML is valid XML. Given that we just throw the raw XHTML at
> Solr which uses the HTMLStripWhitespaceTokenizer. However, at this
> stage we haven't configured highlighting at all, so our index is used
> for search and retrieving a document ID. At some point I'd like to
> add highlighting and it sounds like the best way to do so would be to
> index the document text instead of the HTML.
>
> Beyond that, we also use Solr as an optimization for extracting
> information such as what content was most recently changed, which
> pages link to others etc. On the page linking, we actually identify
> what pages are linked to prior to indexing and store them as a
> separate field - Solr itself has no understanding of the linking.
>
> Oh and I should note, I'm very new to Solr so I'm probably not doing
> things the best way, but I'm getting great results anyway.
>
> Regards,
>
> Adrian Sutton
> http://www.symphonious.net
>
>


Re: unable to figure out nutch type highlighting in solr....

2007-10-05 Thread Adrian Sutton

On 05/10/2007, at 4:07 PM, Ravish Bhagdev wrote:

(Query esp. Adrian):

If you are indexing XHTML, do you replace tags with entities before
giving it to solr, if so, when you get back snippets do you get tags
or entities or do you convert again to tags for presentation?  What's
the best way out?  It would help me a lot if you briefly explain your
configuration.


We happen to develop a HTML editor so we know 100% for certain that  
the XHTML is valid XML. Given that we just throw the raw XHTML at  
Solr which uses the HTMLStripWhitespaceTokenizer. However, at this  
stage we haven't configured highlighting at all, so our index is used  
for search and retrieving a document ID. At some point I'd like to  
add highlighting and it sounds like the best way to do so would be to  
index the document text instead of the HTML.


Beyond that, we also use Solr as an optimization for extracting  
information such as what content was most recently changed, which  
pages link to others etc. On the page linking, we actually identify  
what pages are linked to prior to indexing and store them as a  
separate field - Solr itself has no understanding of the linking.


Oh and I should note, I'm very new to Solr so I'm probably not doing  
things the best way, but I'm getting great results anyway.


Regards,

Adrian Sutton
http://www.symphonious.net



Re: Indexing XML

2007-10-05 Thread Alan Rykhus
Hello Benoit,

An additonal thing to check out is the work being done on fac-back-opac.
They have a parser that will parse native MARC records. 

I would assume that if you can extract your records in MARC XML you can
extract them in native MARC.

I've used the parser and it works well.

al

On Fri, 2007-10-05 at 02:44 -0500, PAUWELS Benoit wrote:
> Hi,
> 
> 
> 
> I wish to index well formed xml documents as they are.
> 
> I have a database filled with MARCXML records. An example of these looks like 
> this:
> 
> 
> 
>  
> ns0:schemaLocation="http://www.loc.gov/MARC21/slim 
> http://www.loc.gov/standards/marcxml/schema/MARC21slim.xsd";
> 
> xmlns="http://www.loc.gov/MARC21/slim"; 
> xmlns:ns0="http://www.w3.org/2001/XMLSchema-instance";>
> 
> 0nam  22  a 4500
> 
> 00050
> 
> 20050826220257.0
> 
> 000710s1998xx  r 000 0 dut 
> d
> 
> 
> 
> Univ
> 
> 
> 
> 
> 
> van Wetten, J. W.
> 
> 
> 
> 
> 
> De positie van vrouwen in de 
> asielprocedure /
> 
> J.W. van Wetten, N. Dijkhof, F. 
> Heide.
> 
> 
> 
> 
> 
> 
> 
> The idea is to create Lucene indexes on specific MARC fields and store the 
> complete MARC record in Lucene 'as is'. In the presentation layer of my 
> application I would then have this complete MARC record at hand, and as such 
> have full flexibility on which MARC fields to display. So I want to create 
> the following record through XSLT and feed this to SOLR.
> 
> 
> 
> 
> 
> De positie van vrouwen in de asielprocedure
> 
> van Wetten, J. W.
> 
> ...
> 
> 
> 
>
> ns0:schemaLocation="http://www.loc.gov/MARC21/slim 
> http://www.loc.gov/standards/marcxml/schema/MARC21slim.xsd";
> 
> xmlns="http://www.loc.gov/MARC21/slim"; 
> xmlns:ns0="http://www.w3.org/2001/XMLSchema-instance";>
> 
> 0nam  22  a 4500
> 
> 00050
> 
> 20050826220257.0
> 
> 000710s1998xx  r 000 0 dut 
> d
> 
> 
> 
> UGent
> 
> 
> 
> 
> 
> van Wetten, J. W.
> 
> 
> 
> 
> 
> De positie van vrouwen in de 
> asielprocedure /
> 
> J.W. van Wetten, N. Dijkhof, F. 
> Heide.
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> I have the following in my schema.xml:
> 
> 
> 
>  termVectors="true"/>
> 
>  termVectors="true"/>
> 
> 
> 
> 
> 
> 
> 
> SOLR has of course a problem with the XML in the 'originalRecord' field.
> 
> Is there a solution to this? Has anyone done this before?
> 
> 
> 
> Thanks a lot.
> 
> Benoit.
> 
> 
> 
> 
> 
> =
> 
> PAUWELS Benoit
> 
> Université Libre de Bruxelles - Libraries
> 
> Head of Automation
> 
> Av. F.D. Roosevelt 50, CP 180
> 
> 1050 BRUSSELS
> 
> Belgium
> 
> Tel: + 32 2 650 23 91
> 
> Fax: + 32 2 650 23 91
> 
> =
> 
> 
> 
> 
> 
-- 
Alan Rykhus
PALS, A Program of the Minnesota State Colleges and Universities 
(507)389-1975
[EMAIL PROTECTED]

---

"You and I as individuals can, by borrowing, live beyond our means, but
only for a limited period of time. Why should we think that
collectively, as a nation, we are not bound by that same limitation?"
-- Ronald Reagan



Re: Indexing XML

2007-10-05 Thread Wayne Graham
Benoit,

Are you familiar with the Vufind project (http://www.vufind.org)? If you
look at the PHP code in the import folder to see how the indexing is
working (there's an XSL transformation that then updates the index).
I've also written some initial code to use embedded Solr to do this
indexing directly from marc format files, including holding the entire
marcxml format record in the index.

You can contact me off-list if you have questions...

Wayne

Walter Underwood wrote:
> Solr is not an XML engine (or a MARC engine). It uses XML as an input format
> for fielded data. It does not index or search arbitrary XML. You need to
> convert your XML into Solr's format.
> 
> I would recommend expressing MARC in a Solr schema, then working on the
> input XML. The input XML depends on the schema.
> 
> If you need an XML engine, I'd recommend MarkLogic (commercial), a very
> good product.
> 
> wunder
> 
> On 10/5/07 12:44 AM, "PAUWELS  Benoit" <[EMAIL PROTECTED]> wrote:
> 
>> Hi,
>>
>> I wish to index well formed xml documents as they are.
>>
>> I have a database filled with MARCXML records. An example of these looks like
>> this:
>>
>>  
>>
>> >
>> ns0:schemaLocation="http://www.loc.gov/MARC21/slim
>> http://www.loc.gov/standards/marcxml/schema/MARC21slim.xsd";
>>
>> xmlns="http://www.loc.gov/MARC21/slim";
>> xmlns:ns0="http://www.w3.org/2001/XMLSchema-instance";>
>>
>> 0nam  22  a 4500
>>
>> 00050
>>
>> 20050826220257.0
>>
>> 000710s1998xx  r 000 0 dut
>> d
>>
>> 
>>
>> Univ
>>
>> 
>>
>> 
>>
>> van Wetten, J. W.
>>
>> 
>>
>> 
>>
>> De positie van vrouwen in de 
>> asielprocedure
>> /
>>
>> J.W. van Wetten, N. Dijkhof, F.
>> Heide.
>>
>> 
>>
>> 
>>
>>  
>>
>> The idea is to create Lucene indexes on specific MARC fields and store the
>> complete MARC record in Lucene 'as is'. In the presentation layer of my
>> application I would then have this complete MARC record at hand, and as such
>> have full flexibility on which MARC fields to display. So I want to create 
>> the
>> following record through XSLT and feed this to SOLR.
>>
>>  
>>
>> 
>>
>> De positie van vrouwen in de asielprocedure
>>
>> van Wetten, J. W.
>>
>> ...
>>
>> 
>>
>>   >
>> ns0:schemaLocation="http://www.loc.gov/MARC21/slim
>> http://www.loc.gov/standards/marcxml/schema/MARC21slim.xsd";
>>
>> xmlns="http://www.loc.gov/MARC21/slim";
>> xmlns:ns0="http://www.w3.org/2001/XMLSchema-instance";>
>>
>> 0nam  22  a 4500
>>
>> 00050
>>
>> 20050826220257.0
>>
>> 000710s1998xx  r 000 0 dut
>> d
>>
>> 
>>
>> UGent
>>
>> 
>>
>> 
>>
>> van Wetten, J. W.
>>
>> 
>>
>> 
>>
>> De positie van vrouwen in de 
>> asielprocedure
>> /
>>
>> J.W. van Wetten, N. Dijkhof, F.
>> Heide.
>>
>> 
>>
>> 
>>
>> 
>>
>> 
>>
>>  
>>
>> I have the following in my schema.xml:
>>
>>  
>>
>> > termVectors="true"/>
>>
>> > termVectors="true"/>
>>
>> 
>>
>>  
>>
>>  
>>
>> SOLR has of course a problem with the XML in the 'originalRecord' field.
>>
>> Is there a solution to this? Has anyone done this before?
>>
>>  
>>
>> Thanks a lot.
>>
>> Benoit.
>>
>>  
>>
>>  
>>
>> =
>>
>> PAUWELS Benoit
>>
>> Université Libre de Bruxelles - Libraries
>>
>> Head of Automation
>>
>> Av. F.D. Roosevelt 50, CP 180
>>
>> 1050 BRUSSELS
>>
>> Belgium
>>
>> Tel: + 32 2 650 23 91
>>
>> Fax: + 32 2 650 23 91
>>
>> =
>>
>>  
>>
>>  
>>
> 


-- 
/**
 * Wayne Graham
 * Earl Gregg Swem Library
 * PO Box 8794
 * Williamsburg, VA 23188
 * 757.221.3112
 * http://swem.wm.edu/blogs/waynegraham/
 */



Re: Indexing XML

2007-10-05 Thread Walter Underwood
Solr is not an XML engine (or a MARC engine). It uses XML as an input format
for fielded data. It does not index or search arbitrary XML. You need to
convert your XML into Solr's format.

I would recommend expressing MARC in a Solr schema, then working on the
input XML. The input XML depends on the schema.

If you need an XML engine, I'd recommend MarkLogic (commercial), a very
good product.

wunder

On 10/5/07 12:44 AM, "PAUWELS  Benoit" <[EMAIL PROTECTED]> wrote:

> Hi,
> 
> I wish to index well formed xml documents as they are.
> 
> I have a database filled with MARCXML records. An example of these looks like
> this:
> 
>  
> 
>  
> ns0:schemaLocation="http://www.loc.gov/MARC21/slim
> http://www.loc.gov/standards/marcxml/schema/MARC21slim.xsd";
> 
> xmlns="http://www.loc.gov/MARC21/slim";
> xmlns:ns0="http://www.w3.org/2001/XMLSchema-instance";>
> 
> 0nam  22  a 4500
> 
> 00050
> 
> 20050826220257.0
> 
> 000710s1998xx  r 000 0 dut
> d
> 
> 
> 
> Univ
> 
> 
> 
> 
> 
> van Wetten, J. W.
> 
> 
> 
> 
> 
> De positie van vrouwen in de asielprocedure
> /
> 
> J.W. van Wetten, N. Dijkhof, F.
> Heide.
> 
> 
> 
> 
> 
>  
> 
> The idea is to create Lucene indexes on specific MARC fields and store the
> complete MARC record in Lucene 'as is'. In the presentation layer of my
> application I would then have this complete MARC record at hand, and as such
> have full flexibility on which MARC fields to display. So I want to create the
> following record through XSLT and feed this to SOLR.
> 
>  
> 
> 
> 
> De positie van vrouwen in de asielprocedure
> 
> van Wetten, J. W.
> 
> ...
> 
> 
> 
>
> ns0:schemaLocation="http://www.loc.gov/MARC21/slim
> http://www.loc.gov/standards/marcxml/schema/MARC21slim.xsd";
> 
> xmlns="http://www.loc.gov/MARC21/slim";
> xmlns:ns0="http://www.w3.org/2001/XMLSchema-instance";>
> 
> 0nam  22  a 4500
> 
> 00050
> 
> 20050826220257.0
> 
> 000710s1998xx  r 000 0 dut
> d
> 
> 
> 
> UGent
> 
> 
> 
> 
> 
> van Wetten, J. W.
> 
> 
> 
> 
> 
> De positie van vrouwen in de asielprocedure
> /
> 
> J.W. van Wetten, N. Dijkhof, F.
> Heide.
> 
> 
> 
> 
> 
> 
> 
> 
> 
>  
> 
> I have the following in my schema.xml:
> 
>  
> 
>  termVectors="true"/>
> 
>  termVectors="true"/>
> 
> 
> 
>  
> 
>  
> 
> SOLR has of course a problem with the XML in the 'originalRecord' field.
> 
> Is there a solution to this? Has anyone done this before?
> 
>  
> 
> Thanks a lot.
> 
> Benoit.
> 
>  
> 
>  
> 
> =
> 
> PAUWELS Benoit
> 
> Université Libre de Bruxelles - Libraries
> 
> Head of Automation
> 
> Av. F.D. Roosevelt 50, CP 180
> 
> 1050 BRUSSELS
> 
> Belgium
> 
> Tel: + 32 2 650 23 91
> 
> Fax: + 32 2 650 23 91
> 
> =
> 
>  
> 
>  
> 



Re: unable to figure out nutch type highlighting in solr....

2007-10-05 Thread Walter Underwood
That is one seriously manly regex, but I'd recommend using the Tag Soup
parser instead:

  http://ccil.org/~cowan/XML/tagsoup/

wunder

On 10/4/07 10:11 PM, "J.J. Larrea" <[EMAIL PROTECTED]> wrote:

> It uses a PatternTokenizerFactory with a RegEx that swallows runs of HTML- or
> XML-like tags:
> 
>   (?:\s*\s]+))?)\s*|\s*)/?>\s*)|\s



Re: unable to figure out nutch type highlighting in solr....

2007-10-05 Thread Steven Rowe
Adrian Sutton wrote:
> We didn't do anything at all to the HTML, the editor returns valid XHTML
> (using numeric entities, never named entities which aren't valid in XML
> and don't tend to work in XHTML) [...]

Named entity references are valid in XML.  They just need to be declared
before they are used[1], unless they are one of the builtin named
entities < > ' " or & -- these are always valid when
parsing with an XML parser.

XHTML is XML, so if parsed by an XML parser, XML's builtin named
entities are available, and if the parser doesn't ignore external
entities, then the same set of (roughly) 250 named entities defined in
HTML are available as well[2].

Steve

[1] XML well-formedness constraint - entities must be declared:


[2] Named entities defined in XHTML 1.0



Re: unable to figure out nutch type highlighting in solr....

2007-10-05 Thread J.J. Larrea
At 9:32 PM +1000 10/5/07, Adrian Sutton wrote:
>From what people are suggesting though you'd be better off converting to plain 
>text before indexing it with Solr. Something like JTidy (http://jtidy.sf.net) 
>can parse most HTML that's around and you can iterate over the DOM to extract 
>the text from there.

It depends entirely on the use-case.  You can fire HTML or XML at a Solr field 
(possibly wrapping it in a CDATA block as just suggested by Pieter Berkel) and 
have it stored verbatim, then what happens at index-time is entirely dependent 
on the Analyzer chain: Treat tags and attributes as if they were text, remove 
them entirely, etc.  You can strip the markup before sending the data and so 
store and/or index just the text content.  You can use XSLT or other means to 
extract data to be indexed in specific fields.  And, as Benoit Pauwels just 
wrote, a combination of these techniques might be the most appropriate for a 
particular application, e.g. field-specific search yielding marked-up documents.

The HTMLStripXXX tokenizers appear to do a fine job of entity conversion and 
tag stripping, and so if highlighting is not a consideration then it makes the 
markup stripping very convenient, allowing storage of the document with markup 
and indexing of just the text content.

The primary issue with HTMLStripXXX is for the use-case when one wants to 
return the stored HTML/XML content with highlighting markup inserted around the 
text content, but preserving the original markup.  For example, have
Paris
highlighted as
Paris

For that the original marked-up version (rather than stripped) must be stored, 
a markup-stripped version should probably (but not necessarily) be indexed, and 
the offsets of the indexed tokens must properly point to the locations of those 
tokens in the stored version.  The HTMLStripXXX tokenizers ignore the offset of 
the stripped content (both tags and attributes, but also when entities are 
converted to characters) and so the token /paris/ in the example above is given 
the offset of the opening <, and the highlighting falls within (and thus 
destroys) the  tag.  The PatternTokenizer workaround posted to SOLR-42 
will fulfill this use-case.

But a different use-case might be for the highlighting to encompass the markup 
rather than just the text, e.g.
Paris
which would have to be accomplished some other way.

- J.J.


Merging Fields

2007-10-05 Thread Jae Joo
Is there any way to merge fields  during indexing time.

I have field1 and field2 and would like to combine these fields and make
field3.
In the document, there are field1 and field2, and I may build field3 using
CopyField.

Thanks,

Jae


RE: Merging Fields

2007-10-05 Thread Keene, David
Jae,

The easiest way to do this is with CopyField.  

These entries in your schema will accomplish that:







Field 3 will have the tokens from both field 1 and 2 in it.

If you want to merge those 2 fields for display, I would just concat
them at display time.

Dave



-Original Message-
From: Jae Joo [mailto:[EMAIL PROTECTED] 
Sent: Friday, October 05, 2007 9:22 AM
To: solr-user
Subject: Merging Fields

Is there any way to merge fields  during indexing time.

I have field1 and field2 and would like to combine these fields and make
field3.
In the document, there are field1 and field2, and I may build field3
using
CopyField.

Thanks,

Jae


Re: how to make sure a particular query is ALWAYS cached

2007-10-05 Thread Chris Hostetter

: Although I haven't tried yet, I can't imagine that this request returns in
: sub-zero seconds, which is what I want (having a index of about 1M docs with
: 6000 fields/ doc and about 10 complex facetqueries / request). 

i wouldn't neccessarily assume that :)  

If you have a request handler which does a query with a facet.field, and 
then does a followup query for the top N constraings in that facet.field, 
the time needed to execute that handler on a cold index should primarily 
depend on the faceting aspect and how many unique terms there are in that 
field.  try it and see.

: The navigation-pages are pretty important for, eh well navigation ;-) and
: although I can rely on frequent access of these pages most of the time, it
: is not guarenteed (so neither is the caching)

if i were in your shoes: i wouldn't worry about it.  i would setup 
"cold cache warming" of the important queries using a firstSearcher event 
listener, i would setup autowarming on the caches, i would setup explicit 
warming of queries using sort fields i care about in a newSearcher event 
listener, andi would make sure to tune my caches so that they were big 
enough to contain a much larger number of entries then are used by my 
custom request handler for the queris i care about (especially if my index 
only changed a few times a day, the caches become a huge win in that case, 
so throw everything you've got at them)

and for the record: i've been in your shoes.

>From a purely theoretical standpoint: if enough other requests are coming 
in fast enough to expunge the objects used by your "important" navigation 
pages from the caches ... then those pages aren't that important (at least 
not to your end users as an aggregate)

on the other hand: if you've got discreet pools of users (like say: 
customers who do searches, vs your boss who thiks navigation pages are 
really important) then another appraoch is to have to ports searching 
queries -- one that you send your navigation type queries to (with the 
caches tuned appropriately) and one that you send other traffic to (with 
caches tuned appropriately) ... i do that for one major index, it makes a 
lot of sense when you have very distinct usage profiles and you want to 
get the most bang for your buck cache wise.


: > #1 wouldn't really accomplish what you want without #2 as well.

: regarding #1. 
: Wouldn't making a user-cache for the sole-purpose of storing these queries
: be enough? I could then reference this user-cache by name, and extract the

only if you also write a custom request handler ... that was my point 
before it was clear that you were already doing that no matter what (you 
had custom request handler listed in #2)

you could definitely make sure to explicitly put all of your DocLists in 
your own usercache, that will certainly work.  but frankly, based on 
what you've described about your use case, and how often your data 
cahnges, it would probably be easier to set up a layer of caching in front 
of Solr (since you are concerned with ensuring *all* of the date 
for these important pages gets cached) ... something like an HTTP reverse 
proxy cache (aka: acelerator proxy) would help you ensure that thes whole 
pages were getting cached.

i've never tried it, but in theory: you could even setup a newSearcher 
event listener to trigger a little script to ping your proxy with a 
request thatforced it to revalidate the query when your index changes.



-Hoss



strange sorting problem

2007-10-05 Thread Kevin Lewandowski
I'm having a problem with sorting on a certain field. In my schema.xml
it's defined as a string (not analyzed, indexed/stored verbatim). But
when I look at my results (sorted on that field ascending) I get
things like the following:

Yr City's A Sucker
Movement b/w Yr City's A Sucker
X, Y & Sometimes Z
Move On Up - (Lisa Marie Experience / Footlclub / Z Factor Mixes)
Zakamucyo

Does anyone know what could be the problem?

thanks,
Kevin


Re: strange sorting problem

2007-10-05 Thread Chris Hostetter

can you post...

 * the fieldtype declaration from your schema.xml
 * the field declaration from your schema
 * the full URL that generated that ordering
 * the full XML output from that URL

(you can set the "fl" param to just be the field you are sorting on and 
score if the XML response is really big)


: Date: Fri, 5 Oct 2007 11:21:48 -0700
: From: Kevin Lewandowski <[EMAIL PROTECTED]>
: Reply-To: solr-user@lucene.apache.org
: To: solr-user@lucene.apache.org
: Subject: strange sorting problem
: 
: I'm having a problem with sorting on a certain field. In my schema.xml
: it's defined as a string (not analyzed, indexed/stored verbatim). But
: when I look at my results (sorted on that field ascending) I get
: things like the following:
: 
: Yr City's A Sucker
: Movement b/w Yr City's A Sucker
: X, Y & Sometimes Z
: Move On Up - (Lisa Marie Experience / Footlclub / Z Factor Mixes)
: Zakamucyo
: 
: Does anyone know what could be the problem?
: 
: thanks,
: Kevin
: 



-Hoss



Re: unable to figure out nutch type highlighting in solr....

2007-10-05 Thread Ravish Bhagdev
Thanks all for very valuable contributions, I understand these aspects
of Solr much better now

but...

>But a different use-case might be for the highlighting to encompass
the markup rather than >just the text, e.g.
>   Paris
>which would have to be accomplished some other way.

Yes, exactly.  And I think nutch handles this somehow as I remember
using it for indexing HTML and then returning snippets with accurate
highlighting placed within html snippets.

Is there a potential for code reuse from nutch?  Maybe this is topic
for solr developer list?  Or has it been already considered?

Bests,
Ravish


query syntax for complement set

2007-10-05 Thread Doug Daniels

Hi,

I'm trying to find a way to express a certain query and wondering if 
anyone could help.


The query is against a schema that stores the user_ids who have worked 
on each document in a multi-value integer field called 'user_ids'.  I'd 
like to query solr for all documents that anyone other than a few users 
have worked on.


For instance, say the user group I'm working with is user_ids 1, 3, and 
6.  I'd like to get back the documents that any other users have worked 
on--the complement set of users.  This would be too many users to list 
out individually, I imagine.


This would be easier if I were trying to simply exclude documents that 
users 1,3, and 6 had worked on, but I'm really looking to "include" 
documents that this complement set of users worked on.


Wondering if there's any way to write this query without listing out 
each of the IDs in the complement set individually or (slightly better) 
creating range queries to express the complement set.


Thanks,
Doug


Re: unable to figure out nutch type highlighting in solr....

2007-10-05 Thread Mike Klaas

On 5-Oct-07, at 11:59 AM, Ravish Bhagdev wrote:


But a different use-case might be for the highlighting to encompass

the markup rather than >just the text, e.g.
  Parisspan>

which would have to be accomplished some other way.


Yes, exactly.  And I think nutch handles this somehow as I remember
using it for indexing HTML and then returning snippets with accurate
highlighting placed within html snippets.

Is there a potential for code reuse from nutch?  Maybe this is topic
for solr developer list?  Or has it been already considered?


Last time I looked at the nutch highlighter I don't remember seeing  
anything about handling this correctly (which would involved a  
considerable amount of html finangling to get perfect).


Also, I don't see the use case for web docs: you absolutely never  
want to serve up the raw html form an unknown page.


I'm not against improving Solr's handling of HTML data, but it is the  
type of thing that is unlikely to happen unless someone who cares  
about it steps up.


Patches welcome :)

-Mike


RE: Merging Fields

2007-10-05 Thread Lance Norskog
A gotcha here is that  creates multiple values. Each field copied
in becomes a separate field. If you wanted a single-valued field this will
not work.

Lance Norskog 

-Original Message-
From: Keene, David [mailto:[EMAIL PROTECTED] 
Sent: Friday, October 05, 2007 10:50 AM
To: solr-user@lucene.apache.org
Subject: RE: Merging Fields

Jae,

The easiest way to do this is with CopyField.  

These entries in your schema will accomplish that:



 


Field 3 will have the tokens from both field 1 and 2 in it.

If you want to merge those 2 fields for display, I would just concat them at
display time.

Dave



-Original Message-
From: Jae Joo [mailto:[EMAIL PROTECTED]
Sent: Friday, October 05, 2007 9:22 AM
To: solr-user
Subject: Merging Fields

Is there any way to merge fields  during indexing time.

I have field1 and field2 and would like to combine these fields and make
field3.
In the document, there are field1 and field2, and I may build field3 using
CopyField.

Thanks,

Jae



Re: strange sorting problem

2007-10-05 Thread Kevin Lewandowski
Sorry, user error. In the example I posted the field type was actually
not string. But I was getting confused on another field because I
didn't realize that string was case sensitive. Too many fields to
think about! :)

On 10/5/07, Chris Hostetter <[EMAIL PROTECTED]> wrote:
>
> can you post...
>
>  * the fieldtype declaration from your schema.xml
>  * the field declaration from your schema
>  * the full URL that generated that ordering
>  * the full XML output from that URL
>
> (you can set the "fl" param to just be the field you are sorting on and
> score if the XML response is really big)
>
>
> : Date: Fri, 5 Oct 2007 11:21:48 -0700
> : From: Kevin Lewandowski <[EMAIL PROTECTED]>
> : Reply-To: solr-user@lucene.apache.org
> : To: solr-user@lucene.apache.org
> : Subject: strange sorting problem
> :
> : I'm having a problem with sorting on a certain field. In my schema.xml
> : it's defined as a string (not analyzed, indexed/stored verbatim). But
> : when I look at my results (sorted on that field ascending) I get
> : things like the following:
> :
> : Yr City's A Sucker
> : Movement b/w Yr City's A Sucker
> : X, Y & Sometimes Z
> : Move On Up - (Lisa Marie Experience / Footlclub / Z Factor Mixes)
> : Zakamucyo
> :
> : Does anyone know what could be the problem?
> :
> : thanks,
> : Kevin
> :
>
>
>
> -Hoss
>
>


Best way to change weighting based on the presence of a field

2007-10-05 Thread Kyle Banerjee
Howdy all,

We are attempting to provide access to about 8 million records of
highly variable quality and length. In a nutshell, we are trying to
find a way to deprioritize "suspect" records without discriminating
against useful records that happen to be short. We do not wish to
eliminate suspect records from the results -- just deprioritize them a
bit.

We have been indexing a field that marks a record as likely to be good
or bad, and I'm trying to figure out the most efficient way to use it
(should I be trying this at all?). As a newbie, my first inclination
was to OR the search terms with the same terms combined with a "good
record marker" with a modest boost.

However, this method seems really clunky, and I'm wondering if there's
a better way to accomplish what we're trying to do. Thanks,

kyle


Re: Best way to change weighting based on the presence of a field

2007-10-05 Thread Mike Klaas

On 5-Oct-07, at 2:06 PM, Kyle Banerjee wrote:


Howdy all,

We are attempting to provide access to about 8 million records of
highly variable quality and length. In a nutshell, we are trying to
find a way to deprioritize "suspect" records without discriminating
against useful records that happen to be short. We do not wish to
eliminate suspect records from the results -- just deprioritize them a
bit.

We have been indexing a field that marks a record as likely to be good
or bad, and I'm trying to figure out the most efficient way to use it
(should I be trying this at all?). As a newbie, my first inclination
was to OR the search terms with the same terms combined with a "good
record marker" with a modest boost.

However, this method seems really clunky, and I'm wondering if there's
a better way to accomplish what we're trying to do. Thanks,


If you know at index time that the document is shady, the easiest way  
to de-emphasize it globally is to set the document boost to some  
value other than one.


...

cheers,
-Mike


Re: Best way to change weighting based on the presence of a field

2007-10-05 Thread Kyle Banerjee
> If you know at index time that the document is shady, the easiest way
> to de-emphasize it globally is to set the document boost to some
> value other than one.
>
> ...

I considered that, but assumed we'd get the values wrong at first and
have to do a lot of tinkering before we got it right. Is there a good
way to do this at query time, or do you really need to do this when
loading? It would be feasible to boost at load time, but recovery
times from bad decisions are longer than I was hoping for.

kyle


Re: Best way to change weighting based on the presence of a field

2007-10-05 Thread Mike Klaas

On 5-Oct-07, at 3:01 PM, Kyle Banerjee wrote:


If you know at index time that the document is shady, the easiest way
to de-emphasize it globally is to set the document boost to some
value other than one.

...


I considered that, but assumed we'd get the values wrong at first and
have to do a lot of tinkering before we got it right. Is there a good
way to do this at query time, or do you really need to do this when
loading? It would be feasible to boost at load time, but recovery
times from bad decisions are longer than I was hoping for.


The other option is to use a function query on the value stored in a  
field (which could represent a range of 'badness').  This can be used  
directly in the dismax handler using the bf (boost function) query  
parameter.


-Mike


Re: Best way to change weighting based on the presence of a field

2007-10-05 Thread Yonik Seeley
On 10/5/07, Mike Klaas <[EMAIL PROTECTED]> wrote:
> The other option is to use a function query on the value stored in a
> field (which could represent a range of 'badness').  This can be used
> directly in the dismax handler using the bf (boost function) query
> parameter.

In the near future, you can do a real query-time boost (score multiplication)
by another field or function
https://issues.apache.org/jira/browse/SOLR-334

And even quickly update all the values of the field being used as the boost:
https://issues.apache.org/jira/browse/SOLR-351

-Yonik


Re: unable to figure out nutch type highlighting in solr....

2007-10-05 Thread Adrian Sutton
Named entity references are valid in XML.  They just need to be  
declared

before they are used[1], unless they are one of the builtin named
entities < > ' " or & -- these are always valid  
when

parsing with an XML parser.


Correct, it was an offhand comment and I skipped over all the  
details. In general named entities other than the built-ins aren't  
declared at the top of the file and many parsers don't bother to read  
in external DTDs so any entities declared there aren't read and are  
therefore considered invalid.



XHTML is XML, so if parsed by an XML parser, XML's builtin named
entities are available, and if the parser doesn't ignore external
entities, then the same set of (roughly) 250 named entities defined in
HTML are available as well[2].


Except that no browser that I know of actually reads in the XHTML DTD  
when in standards compliant mode, so none of those entities are  
actually viable to be used unless you include the declarations for  
them at the top of every XHTML document (which is ludicrous).


The bottom line is that it's far, far better to use numeric entities  
in XML and simply ignore all but the built-in named entities if you  
want to have any confidence that the document will be parsed  
correctly - hence my offhand comment.


Regards,

Adrian Sutton
http://www.symphonious.net


Re: question about bi-gram analysis on query

2007-10-05 Thread Otis Gospodnetic
Dave,

Have you tried using &debugQuery=true ? :)

Otis
 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Simpy -- http://www.simpy.com/  -  Tag  -  Search  -  Share

- Original Message 
From: "Keene, David" <[EMAIL PROTECTED]>
To: Teruhiko Kurosaka <[EMAIL PROTECTED]>
Cc: solr-user@lucene.apache.org
Sent: Thursday, October 4, 2007 4:44:59 PM
Subject: RE: question about bi-gram analysis on query

Hi,

Thanks for responding.  I should have been clearer..

By "actual search" I meant hitting the search demo page on the solr admin page. 
 So I get no results on this query:

/solr/select/?q=%E7%BE%8E%E8%81%AF&version=2.2&start=0&rows=10&indent=on

But the same query (with the data in my index) on the analysis page shows me a 
hit (and the same search in Luke gets me a hit too).

I've tried this on 1.1, 1.2 and nightly as of yesterday. I assume that I am 
missing something really obvious..

-Dave


-Original Message-
From: Teruhiko Kurosaka [mailto:[EMAIL PROTECTED] 
Sent: Thursday, October 04, 2007 12:44 PM
To: Keene, David
Cc: solr-user@lucene.apache.org
Subject: RE: question about bi-gram analysis on query

Hello David,
> And if I do a search in Luke and the solr analysis page 
> for美聯, I get a hit.  But on the actual search, I don't.

I think you need to tell us what you mean by "actual search"
and your code that interfaces with Solr.

-kuro