Re: Field Types Question

2008-08-12 Thread Erik Hatcher


On Aug 11, 2008, at 9:28 PM, Jake Conk wrote:

I was wondering what are the differences in certain field types? For
instance what's the difference between the following?

integer / sint
float / sfloat


The difference is the internal representation of the String value of  
the term representing the numbers.  The "s" prefix means the terms are  
sortable, in that they are in ascending _numerical_ (not textual)  
order within the index.



text / textzh


Looks like maybe you're picking up an acts_as_solr schema (which uses  
text_zh, though).  "zh" is the country code for China... and that  
field type is likely configured to use a Chinese-savvy analyzer.



Also, if I have two dynamic fields for instance *_facet and *_facet_mv
which both have the type set to string does it really matter which one
I use?


If the field types are identical, then no it won't matter which you  
use - the same thing will happen internally.


Erik



Re: Best strategy for dates in solr-ruby

2008-08-12 Thread Erik Hatcher


On Aug 11, 2008, at 3:03 PM, Ian Connor wrote:

I originally used a Ruby Date class for my dates, but found when I set
the type to solr.DateField in the solrconfig.xml, it returned a parse
error. After that, I switched to Time and it worked fine.

However, I now have some dates that are out of the Time range (e.g.
1865) so Date would work better here than time.

What is the best strategy here:
1. Use Dates and treat it as a solr.String;
2. Customize the Date class to output a valid solr.DateField string;  
or

3. Treat it as a string in ruby and handle to/from Date in my model?


I've thought about this some myself and I the proposal I have is to  
enhance Solr's Ruby output to provide real DateTime objects rather  
than strings.   I've added this as a TODO here: , but haven't made time to do anything with it myself yet.


As far as a specific recommendation for your use here - if you aren't  
leveraging any date sorting or (range) querying capabilities then it  
would be fine to treat it as a string within Solr.  How you configure  
Solr's types depend entirely on what you want to do with those fields  
from a query and storage perspective.


Erik



query parsing

2008-08-12 Thread Stefan Oestreicher
Hi,

I need to modify the query to search through all fields if no explicit field
has been specified. I know there's the dismax handler but I'd like to use
the standard query syntax.
I implemented that with my own QParserPlugin and QParser and for simple term
queries it works great. I'm using the SolrQueryParser which I get from the
schema to parse the query with an impossible field name as the default field
and then I rewrite the query accordingly.
Unfortunately this doesn't work with phrase queries, the SolrQueryParser
always returns a TermQuery instead of a phrase query.

What am I missing? Is this even a viable approach?

This is a code snippet from a test case (extending AbstractSolrTestCase)
which I used to verify that it's not returning a PhraseQuery:

-8<-
SolrQueryParser parser = h.getCore().getSchema().getSolrQueryParser(null);
Query q = parser.parse("baz \"foo bar\"");
assertTrue( q instanceof BooleanQuery );
BooleanQuery bq = (BooleanQuery)q;
BooleanClause[] cl = bq.getClauses();
assertEquals(2, cl.length);
//this assertion fails
assertTrue(cl[1].getQuery() instanceof PhraseQuery);
-8<-

I'm using solr 1.3, r685085.

TIA,
 
Stefan Oestreicher



Re: Lower Case Filter Factory

2008-08-12 Thread Erik Hatcher


On Aug 11, 2008, at 1:51 PM, swarag wrote:

When I query:
http://localhost:8983/solr/select?q=p*
I get results back, but when I query as
http://localhost:8983/solr/select?q=P*

I get no results. Is there anything wrong im doing?


Wildcard expressions are _not_ analyzed when parsed.  Their case is  
left as-is too.  There is a switch to Lucene's query parser that  
allows it to lowercase wildcard and fuzzy terms  
(setLowercaseExpandedTerms), but unfortunately that switch is not  
currently controllable from Solr.


One option is to have your querying client do the lowercasing for now.

Erik



Re: query parsing

2008-08-12 Thread Erik Hatcher
Solr/Lucene QueryParser returns a TermQuery for "phrases" that end up  
only as a single term.  This could happen, for example, if it was  
using Solr's "string" field type (which has effectively no analyzer).


I'd guess that you'd want to re-analyze TermQuery's?  (though that  
sound problematic for many cases)  Or possibly use your own  
SolrQueryParser subclass and override #getFieldQuery.


Erik

On Aug 12, 2008, at 5:26 AM, Stefan Oestreicher wrote:


Hi,

I need to modify the query to search through all fields if no  
explicit field
has been specified. I know there's the dismax handler but I'd like  
to use

the standard query syntax.
I implemented that with my own QParserPlugin and QParser and for  
simple term
queries it works great. I'm using the SolrQueryParser which I get  
from the
schema to parse the query with an impossible field name as the  
default field

and then I rewrite the query accordingly.
Unfortunately this doesn't work with phrase queries, the  
SolrQueryParser

always returns a TermQuery instead of a phrase query.

What am I missing? Is this even a viable approach?

This is a code snippet from a test case (extending  
AbstractSolrTestCase)

which I used to verify that it's not returning a PhraseQuery:

-8<-
SolrQueryParser parser =  
h.getCore().getSchema().getSolrQueryParser(null);

Query q = parser.parse("baz \"foo bar\"");
assertTrue( q instanceof BooleanQuery );
BooleanQuery bq = (BooleanQuery)q;
BooleanClause[] cl = bq.getClauses();
assertEquals(2, cl.length);
//this assertion fails
assertTrue(cl[1].getQuery() instanceof PhraseQuery);
-8<-

I'm using solr 1.3, r685085.

TIA,

Stefan Oestreicher




Re: number of matching documents incorrect during postOptimize

2008-08-12 Thread Shalin Shekhar Mangar
Hi Tom,

There will be no clean way of doing this with scripts. A way would be to
write a newSearcher event listener in Java and use one of the
SolrIndexSearcher#getDocList methods to do the query. If successful, execute
your scripts for snapshoot.

However, I think it should be fine if you want to execute snapshoot on the
postOptimize hook. DataImportHandler will not call commit/optimize unless
the import is successful. If you ever find a commit happening even when the
import failed, then it must be a bug.

On Tue, Aug 12, 2008 at 6:31 AM, Tom Morton <[EMAIL PROTECTED]> wrote:

> Hi all,
>   I'm trying to check that an import using the dataImportHandler was clean
> before I take a snapshot of the index to be pulled via snappuller to query
> nodes.  One of the checks I do is verify that a certain minimum number of
> documents are returned for a query.  I do this in a script that I'm calling
> via the postOptimize hook.  However, after a full import the numFound
> results from the query are not accurate until after the postOptimize code
> completes and so my checks are failing.
>
> Glancing at the code this looks non-trivial to "fix" as the hook call is
> pretty deep in the call stack.
> org.apache.solr.handler.dataimport.DataImporter.doFullImport execute
> eventually calls
> org.apache.solr.update.UpdateHandler.callPostOptimizeCallbacks
>
> One option would be to spawn and background a new job to check the status
> with an initial sleep to wait for the postOptimize that spawned it to
> finish.  This is pretty ugly and could lead to some race conditions but
> will
> probably work.
>
> Any better recommendations on how to acheive this functionality?
>
> Thanks...Tom
>



-- 
Regards,
Shalin Shekhar Mangar.


RE: query parsing

2008-08-12 Thread Stefan Oestreicher
Ah, yes, the FieldType I used was not the one I needed. I completely missed
that. Thank you very much, it's working perfectly now.

thanks,

Stefan Oestreicher

> -Original Message-
> From: Erik Hatcher [mailto:[EMAIL PROTECTED] 
> Sent: Tuesday, August 12, 2008 11:46 AM
> To: solr-user@lucene.apache.org
> Subject: Re: query parsing
> 
> Solr/Lucene QueryParser returns a TermQuery for "phrases" 
> that end up only as a single term.  This could happen, for 
> example, if it was using Solr's "string" field type (which 
> has effectively no analyzer).
> 
> I'd guess that you'd want to re-analyze TermQuery's?  (though 
> that sound problematic for many cases)  Or possibly use your 
> own SolrQueryParser subclass and override #getFieldQuery.
> 
>   Erik
> 
> On Aug 12, 2008, at 5:26 AM, Stefan Oestreicher wrote:
> 
> > Hi,
> >
> > I need to modify the query to search through all fields if 
> no explicit 
> > field has been specified. I know there's the dismax handler but I'd 
> > like to use the standard query syntax.
> > I implemented that with my own QParserPlugin and QParser and for 
> > simple term queries it works great. I'm using the SolrQueryParser 
> > which I get from the schema to parse the query with an impossible 
> > field name as the default field and then I rewrite the query 
> > accordingly.
> > Unfortunately this doesn't work with phrase queries, the 
> > SolrQueryParser always returns a TermQuery instead of a 
> phrase query.
> >
> > What am I missing? Is this even a viable approach?
> >
> > This is a code snippet from a test case (extending
> > AbstractSolrTestCase)
> > which I used to verify that it's not returning a PhraseQuery:
> >
> > -8<-
> > SolrQueryParser parser =
> > h.getCore().getSchema().getSolrQueryParser(null);
> > Query q = parser.parse("baz \"foo bar\""); assertTrue( q instanceof 
> > BooleanQuery ); BooleanQuery bq = (BooleanQuery)q; 
> BooleanClause[] cl 
> > = bq.getClauses(); assertEquals(2, cl.length); //this 
> assertion fails
> > assertTrue(cl[1].getQuery() instanceof PhraseQuery);
> > -8<-
> >
> > I'm using solr 1.3, r685085.
> >
> > TIA,
> >
> > Stefan Oestreicher
> 
> 



Re: Best strategy for dates in solr-ruby

2008-08-12 Thread Ian Connor
I like your suggestion. To keep the index working the same, I have
just switched to dates and added the string "T23:59:59Z" and it seems
to make the solr.DateField happy when this string is passed.

It will mean I will have to mess with it on the way back in - but it
should be okay. To have solr-ruby do this to dates for me would be
ideal.

if field.class == Date
 field = field.to_s + "T23:59:59Z"
end

On Tue, Aug 12, 2008 at 5:11 AM, Erik Hatcher
<[EMAIL PROTECTED]> wrote:
>
> On Aug 11, 2008, at 3:03 PM, Ian Connor wrote:
>>
>> I originally used a Ruby Date class for my dates, but found when I set
>> the type to solr.DateField in the solrconfig.xml, it returned a parse
>> error. After that, I switched to Time and it worked fine.
>>
>> However, I now have some dates that are out of the Time range (e.g.
>> 1865) so Date would work better here than time.
>>
>> What is the best strategy here:
>> 1. Use Dates and treat it as a solr.String;
>> 2. Customize the Date class to output a valid solr.DateField string; or
>> 3. Treat it as a string in ruby and handle to/from Date in my model?
>
> I've thought about this some myself and I the proposal I have is to enhance
> Solr's Ruby output to provide real DateTime objects rather than strings.
> I've added this as a TODO here:
> , but haven't made time to
> do anything with it myself yet.
>
> As far as a specific recommendation for your use here - if you aren't
> leveraging any date sorting or (range) querying capabilities then it would
> be fine to treat it as a string within Solr.  How you configure Solr's types
> depend entirely on what you want to do with those fields from a query and
> storage perspective.
>
>Erik
>
>



-- 
Regards,

Ian Connor


Problems using saxon for XSLT transforms

2008-08-12 Thread Norberto Meijome
hi :)
I'm trying to use SAXON instead of the default XSLT parser. I was pretty sure i
had it running fine on 1.2, but when I repeated the same steps (as per the
wiki) on latest nightly build, i cannot see any sign of it being loaded or use,
although the classpath seems to be pointing to them (see below)

In my logs,i see :
INFO: created xslt: org.apache.solr.request.XSLTResponseWriter
Aug 12, 2008 11:20:07 PM org.apache.solr.request.XSLTResponseWriter init
INFO: xsltCacheLifetimeSeconds=5

which is the RH itself, then, on a hit that triggers the transform : 
Aug 12, 2008 11:21:25 PM org.apache.solr.util.xslt.TransformerProvider 
WARNING: The TransformerProvider's simplistic XSLT caching mechanism is not
appropriate for high load scenarios, unless a single XSLT transform is used and
xsltCacheLifetimeSeconds is set to a sufficiently high value.

This is where I would expect to see saxon...right?

I'm running SOLR 1.3, nightly from 2008-08-11, under FreeBSD 7 (stable), JDK
1.6.. I have 4 cores defined in this test environment. 

I start my service  with :

java -Xms64m -Xmx1024m -server
-Djavax.xml.transform.TransformerFactory=net.sf.saxon.TransformerFactoryImpl
-jar start.jar


the  /admin/get-properties.jsp shows

[]

javax.xml.transform.TransformerFactory = net.sf.saxon.TransformerFactoryImpl
java.specification.version = 1.6
[...]
java.class.path
= 
/solrhome:/solrhome/lib/saxon9-s9api.jar:/solrhome/lib/jetty-6.1.11.jar:/solrhome/lib/saxon9-jdom.jar:/solrhome/lib/saxon9-sql.jar:/solrhome/lib/servlet-api-2.5-6.1.11.jar:/solrhome/lib/saxon9-xqj.jar:/solrhome/lib/saxon9.jar:/solrhome/lib/jetty-util-6.1.11.jar:/solrhome/lib/saxon9-xom.jar:/solrhome/lib/saxon9-dom4j.jar:/solrhome/lib/saxon9-xpath.jar:/solrhome/lib/saxon9-dom.jar:/solrhome/lib/jsp-2.1/core-3.1.1.jar:/solrhome/lib/jsp-2.1/ant-1.6.5.jar:/solrhome/lib/jsp-2.1/jsp-2.1.jar:/solrhome/lib/jsp-2.1/jsp-api-2.1.jar:/solrhome/lib/management/jetty-management-6.1.11.jar:/solrhome/lib/naming/jetty-naming-6.1.11.jar:/solrhome/lib/naming/activation-1.1.jar:/solrhome/lib/naming/mail-1.4.jar:/solrhome/lib/plus/jetty-plus-6.1.11.jar:/solrhome/lib/xbean/jetty-xbean-6.1.11.jar:/solrhome/lib/annotations/geronimo-annotation_1.0_spec-1.0.jar:/solrhome/lib/annotations/jetty-annotations-6.1.11.jar:/solrhome/lib/ext/jetty-java5-threadpool-6.1.11.jar:/solrhome/lib/ext/jetty-sslengine-6
 
.1.11.jar:/solrhome/lib/ext/jetty-servlet-tester-6.1.11.jar:/solrhome/lib/ext/jetty-ajp-6.1.11.jar:/solrhome/lib/ext/jetty-setuid-6.1.11.jar:/solrhome/lib/ext/jetty-client-6.1.11.jar:/solrhome/lib/ext/jetty-html-6.1.11.jar

[...]

Any pointers to where I should check to confirm saxon is being used, or
to address the problem will be greatly appreciated.

TIA,
B
_
{Beto|Norberto|Numard} Meijome

"Nature doesn't care how smart you are. You can still be wrong."
  Richard Feynman

I speak for myself, not my employer. Contents may be hot. Slippery when wet.
Reading disclaimers makes you go blind. Writing them is worse. You have been
Warned.


Re: Highlighting Output

2008-08-12 Thread Martin Owens
I tried to post it myself, got the address wrong, thanks for re-posting.

the problem we have with highlighting outside of the indexer is that the
systems we use that store co-ords are... based on term string (in one
case) and the specific term offset in another. Both of which break
horribly when trying to do interesting things with solr/lucene.

The only real way to do it is to store that term based data with the
index. Otherwise you'll have to use the lucene query parser to reparse
the search string and write our own searcher to search our custom xml
co-ord files. Most unsatisfactory.

P.S. I noticed that my original email had way too many spelling
mistakes, sorry about that.

Best Regards, Martin Owens

On Mon, 2008-08-11 at 17:43 -0600, Tricia Williams wrote:
> Martin,
> 
> I've been over some of the same thoughts you present here in the last 
> few years.  The path of least resistance ended up being to deal with the 
> highlighting portion of OCRed images outside of Solr.  That's not to say 
> it couldn't or shouldn't be done differently.  I briefly even pursued a 
> similar course of action evident in 
> https://issues.apache.org/jira/browse/SOLR-386.  This would make it 
> easier if you wanted to write your own highlighter.
> 
> I'm interested to see what others think of your suggestions.  I've 
> forwarded this to the solr-user list.
> 
> Tricia



Bug in admin center JSP?

2008-08-12 Thread Matthew Runo

Hello!

I've noticed that the admin center of SVN head seems to report two  
open searches recently, though they appear to be the same searcher..


Example:

name:[EMAIL PROTECTED] main
class:  org.apache.solr.search.SolrIndexSearcher
version:1.0
description:index searcher
stats:  searcherName : [EMAIL PROTECTED] main
caching : true
numDocs : 157474
maxDoc : 467325
readerImpl : MultiSegmentReader
readerDir : org.apache.lucene.store.FSDirectory@/opt/solr/data/index
indexVersion : 1205944089163
openedAt : Tue Aug 12 06:48:41 PDT 2008
registeredAt : Tue Aug 12 06:48:42 PDT 2008
warmupTime : 1190

name:   searcher
class:  org.apache.solr.search.SolrIndexSearcher
version:1.0
description:index searcher
stats:  searcherName : [EMAIL PROTECTED] main
caching : true
numDocs : 157474
maxDoc : 467325
readerImpl : MultiSegmentReader
readerDir : org.apache.lucene.store.FSDirectory@/opt/solr/data/index
indexVersion : 1205944089163
openedAt : Tue Aug 12 06:48:41 PDT 2008
registeredAt : Tue Aug 12 06:48:42 PDT 2008
warmupTime : 1190


Note that the "stats: 	searcherName : [EMAIL PROTECTED] main" line is  
the same for both - leading me to think that this is just a display  
issue. Is anyone else seeing this?


--Matthew


Re: Bug in admin center JSP?

2008-08-12 Thread Yonik Seeley
I believe this is intentional.
searchers are now also registered under their unique names also so one
can tell of there is a searcher "leak".  "searcher" is also still used
for the main registered searcher so thinks like JMX can look it up.

-Yonik

On Tue, Aug 12, 2008 at 10:58 AM, Matthew Runo <[EMAIL PROTECTED]> wrote:
> Hello!
>
> I've noticed that the admin center of SVN head seems to report two open
> searches recently, though they appear to be the same searcher..
>
> Example:
>
> name:[EMAIL PROTECTED] main
> class:  org.apache.solr.search.SolrIndexSearcher
> version:1.0
> description:index searcher
> stats:  searcherName : [EMAIL PROTECTED] main
> caching : true
> numDocs : 157474
> maxDoc : 467325
> readerImpl : MultiSegmentReader
> readerDir : org.apache.lucene.store.FSDirectory@/opt/solr/data/index
> indexVersion : 1205944089163
> openedAt : Tue Aug 12 06:48:41 PDT 2008
> registeredAt : Tue Aug 12 06:48:42 PDT 2008
> warmupTime : 1190
>
> name:   searcher
> class:  org.apache.solr.search.SolrIndexSearcher
> version:1.0
> description:index searcher
> stats:  searcherName : [EMAIL PROTECTED] main
> caching : true
> numDocs : 157474
> maxDoc : 467325
> readerImpl : MultiSegmentReader
> readerDir : org.apache.lucene.store.FSDirectory@/opt/solr/data/index
> indexVersion : 1205944089163
> openedAt : Tue Aug 12 06:48:41 PDT 2008
> registeredAt : Tue Aug 12 06:48:42 PDT 2008
> warmupTime : 1190
>
>
> Note that the "stats:   searcherName : [EMAIL PROTECTED] main" line is the
> same for both - leading me to think that this is just a display issue. Is
> anyone else seeing this?
>
> --Matthew
>


Re: Bug in admin center JSP?

2008-08-12 Thread Shalin Shekhar Mangar
They are both the same searcher. The reason for displaying them twice is to
see the current searcher separately (named "searcher") and any other
searchers that are still open due to any reasons. The name attribute was
specially added so that one can verify that both are indeed the same.

On Tue, Aug 12, 2008 at 8:28 PM, Matthew Runo <[EMAIL PROTECTED]> wrote:

> Hello!
>
> I've noticed that the admin center of SVN head seems to report two open
> searches recently, though they appear to be the same searcher..
>
> Example:
>
> name:[EMAIL PROTECTED] main
> class:  org.apache.solr.search.SolrIndexSearcher
> version:1.0
> description:index searcher
> stats:  searcherName : [EMAIL PROTECTED] main
> caching : true
> numDocs : 157474
> maxDoc : 467325
> readerImpl : MultiSegmentReader
> readerDir : org.apache.lucene.store.FSDirectory@/opt/solr/data/index
> indexVersion : 1205944089163
> openedAt : Tue Aug 12 06:48:41 PDT 2008
> registeredAt : Tue Aug 12 06:48:42 PDT 2008
> warmupTime : 1190
>
> name:   searcher
> class:  org.apache.solr.search.SolrIndexSearcher
> version:1.0
> description:index searcher
> stats:  searcherName : [EMAIL PROTECTED] main
> caching : true
> numDocs : 157474
> maxDoc : 467325
> readerImpl : MultiSegmentReader
> readerDir : org.apache.lucene.store.FSDirectory@/opt/solr/data/index
> indexVersion : 1205944089163
> openedAt : Tue Aug 12 06:48:41 PDT 2008
> registeredAt : Tue Aug 12 06:48:42 PDT 2008
> warmupTime : 1190
>
>
> Note that the "stats:   searcherName : [EMAIL PROTECTED] main" line is the
> same for both - leading me to think that this is just a display issue. Is
> anyone else seeing this?
>
> --Matthew
>



-- 
Regards,
Shalin Shekhar Mangar.


Re: concurrent optimize and update

2008-08-12 Thread Jason Rennie
On Mon, Aug 11, 2008 at 6:41 PM, Yonik Seeley <[EMAIL PROTECTED]> wrote:

> It's safe... the adds will block until the commit or optimize has finished.
>

By block, do you mean that the update connection(s) will be held open?  Our
optimizes take many minutes to complete.  I'm thinking that this could cause
a large pile of threads to accumulate if we're not careful...

Jason


Re: Bug in admin center JSP?

2008-08-12 Thread Matthew Runo
Ah, that makes sense. I just wanted to point it out in case it wasn't  
intentional since it wasn't apparent from the front end as to why they  
were listed twice.


Thanks for taking a moment to reply =)

Matthew Runo
Software Engineer, Zappos.com
[EMAIL PROTECTED] - 702-943-7833

On Aug 12, 2008, at 8:06 AM, Shalin Shekhar Mangar wrote:

They are both the same searcher. The reason for displaying them  
twice is to

see the current searcher separately (named "searcher") and any other
searchers that are still open due to any reasons. The name attribute  
was

specially added so that one can verify that both are indeed the same.

On Tue, Aug 12, 2008 at 8:28 PM, Matthew Runo <[EMAIL PROTECTED]>  
wrote:



Hello!

I've noticed that the admin center of SVN head seems to report two  
open

searches recently, though they appear to be the same searcher..

Example:

name:[EMAIL PROTECTED] main
class:  org.apache.solr.search.SolrIndexSearcher
version:1.0
description:index searcher
stats:  searcherName : [EMAIL PROTECTED] main
caching : true
numDocs : 157474
maxDoc : 467325
readerImpl : MultiSegmentReader
readerDir : org.apache.lucene.store.FSDirectory@/opt/solr/data/index
indexVersion : 1205944089163
openedAt : Tue Aug 12 06:48:41 PDT 2008
registeredAt : Tue Aug 12 06:48:42 PDT 2008
warmupTime : 1190

name:   searcher
class:  org.apache.solr.search.SolrIndexSearcher
version:1.0
description:index searcher
stats:  searcherName : [EMAIL PROTECTED] main
caching : true
numDocs : 157474
maxDoc : 467325
readerImpl : MultiSegmentReader
readerDir : org.apache.lucene.store.FSDirectory@/opt/solr/data/index
indexVersion : 1205944089163
openedAt : Tue Aug 12 06:48:41 PDT 2008
registeredAt : Tue Aug 12 06:48:42 PDT 2008
warmupTime : 1190


Note that the "stats:   searcherName : [EMAIL PROTECTED] main" line  
is the
same for both - leading me to think that this is just a display  
issue. Is

anyone else seeing this?

--Matthew





--
Regards,
Shalin Shekhar Mangar.




RE: Best way to index without diacritics

2008-08-12 Thread Steven A Rowe
Hi Alejandro,

Solr is Unicode aware.  The ISOLatin1AccentFilterFactory handles diacritics for 
the ISO Latin-1 section of the Unicode character set.  UTF (do you mean UTF-8?) 
is a (set of) Unicode serialization(s), and once Solr has deserialized it, it 
is just Unicode characters (Java's in-memory UTF-16 representation).

So as long as you're only concerned about removing diacritics from the set of 
Unicode characters that overlaps ISO Latin-1, and not about other Unicode 
characters, then ISOLatin1AccentFilterFactory should work for you.

Steve

On 08/11/2008 at 7:22 PM, Alejandro Garza Gonzalez wrote:
> I have utf-8 content that I wat to index, however I want searches
> without diacritics to return results.
> 
> For example, a document with the words "nino en mexico" should return
> results like a document with the phrase "Niño en México".
> 
> Ideally, exact diacritic matches should score higher (searching for
> "niño" exactly should make a document with "niño" score higher than a
> document with "nino")
> 
> Any pointers on how to do this? I found about the
> /solr/.ISOLatin1AccentFilterFactory but it seems to only strip
> diacritics from iso-latin characters. How about UTF diacritics? --
> _ ___ _ _ _ _ _ _ _ *Ing. Alejandro Garza González*
> Director, Tecnología e Innovación, Biblioteca Tecnológico de Monterrey,
> Campus Monterrey
> 
> Tel.: 52(81) 8358-1400 ext. 4037 Fax: 52(81) 8328-4067
> Enlace Intercampus: 80 689 4037
> http://biblioteca.mty.itesm.mx
> 
> El contenido de este mensaje de datos no se considera oferta, propuesta
> o acuerdo, sino hasta que sea confirmado en documento por escrito que
> contenga la firma autógrafa del apoderado legal del ITESM. El contenido
> de este mensaje de datos es confidencial y se entiende dirigido y para
> uso exclusivo del destinatario, por lo que no podrá distribuirse y/o
> difundirse por ningún medio sin la previa autorización del emisor
> original. Si usted no es el destinatario, se le prohíbe su utilización
> total o parcial para cualquier fin.
> 
> The content of this data transmission must not be considered
> an offer,
> proposal, understanding or agreement unless it is confirmed in a
> document signed by a legal representative of ITESM. The
> content of this
> data transmission is confidential and is intended to be
> delivered only
> to the addressees. Therefore, it shall not be distributed and/or
> disclosed through any means without the authorization of the original
> sender. If you are not the addressee, you are forbidden from
> using it,
> either totally or partially, for any purpose.
> 
>

 



Re: concurrent optimize and update

2008-08-12 Thread Yonik Seeley
On Tue, Aug 12, 2008 at 11:19 AM, Jason Rennie <[EMAIL PROTECTED]> wrote:
> On Mon, Aug 11, 2008 at 6:41 PM, Yonik Seeley <[EMAIL PROTECTED]> wrote:
>
>> It's safe... the adds will block until the commit or optimize has finished.
>>
>
> By block, do you mean that the update connection(s) will be held open?

HTTP calls are synchronous, so yes it will hold a connection open
(unless the container is configured to time out responses after a
while).

> Our
> optimizes take many minutes to complete.  I'm thinking that this could cause
> a large pile of threads to accumulate if we're not careful...

Many HTTP clients block at a certain number of open connections to the
same server, acting as a natural  throttling mechanism.

-Yonik


Re: Highlighting Output

2008-08-12 Thread Tricia Williams

Martin,

You may want to follow Mark Miller's effort 
https://issues.apache.org/jira/browse/LUCENE-1286 as it develops -- 
perhaps even help with it.  He's developing a Lucene highlighter which 
would "run through query terms by using their offsets" making 
highlighting large documents much more time efficient.  I would be 
interested to see something like this end up as a Solr highlighting option.


Revisiting some of your original thoughts:

What I see though is that the highlighting functionality is heavily tied
to the fragment (highlight context) functionality. This actually makes
it interesting to write a plane highlight method that just returns meta
data (so some other process can do the actual highlighting in some
custom fashion).

So is it worth while to make sure that solr is able to do multiple
different kinds of highlighting, even if it means passing meta data back
in the request? Should we have standard ways to index and read back
payload information if we're dealing with pages, books, co-ordinates
(for highlighting images) and other meta data which is used for
highlights (chat offset, term offset eccettera). I also noticed much of
the highlighting code to do with fragments being duplicated in custom
code.
My idea for highlighting based on 
https://issues.apache.org/jira/browse/SOLR-380 was to include the 
coordinates for highlighting images as just another attribute in the 
input xml.  Then the PayloadComponent will give the coordinates 
associated with a given query as part of the xpath.  I have written some 
code beyond what is posted there that takes some extra parameters and 
reconstructs the xpath into useful results based on the granularity of 
the information that is requested (roughly based on xquery).  Is that a 
"standard" enough way or is there something else you're thinking about?


If you find anything thing I've contributed useful feel free to improve 
it for the benefit of those that use Solr and Lucene.


Tricia


NOTICE: multicore.xml changed to solr.xml, format changes as well

2008-08-12 Thread Chris Hostetter


If you've bene using the trunk (and/or nightly builds) and you take 
advantage of the MultiCore features in Solr pelase be aware...


As of r685244 (committed a few moments ago) Solr no longer looks for a 
"multicore.xml" file.  It instead looks for a "solr.xml" file.


solr.xml supports all of the options that multicore.xml supported, however 
they have been "tweaked" slightly (in some cases renamed, in other 
attributes have been moved from one XML tag to another).


A detailed example can be seen in example/multicore/solr.xml...

http://svn.apache.org/viewvc/lucene/solr/trunk/example/multicore/solr.xml?view=markup

For more information, please see SOLR-689...
https://issues.apache.org/jira/browse/SOLR-689

Volunteers to help update the wiki documentation would be appreciated.


-Hoss



Re: Count of facet count

2008-08-12 Thread Chris Hostetter

: > : how I can get count of distinct facet_fields ?
: > : 
: > : like numFacetFound in this example:
: > 
: > There's currently no way to do that.

: I need to do the same thing. Any pointers on how one would go about
: implementing that? (in Java) Thanks.

The change would be in the SimpleFacets class, and there are a couple of 
differnet code paths to worry about (because two different hueristics are 
used depending on the field type) but the first step would be to define 
what the count represents: is it just the number of terms being returned? 
the number of terms that have a non zero count? or all of the terms in the 
field?

the first and the last are pretty trivial, the middle one requires 
maintaining a new count as the terms are scanned (and if i'm not mistaken, 
there's an optimization in there to stop once we know we won't find any 
terms better then theones we already have, and in order to return that 
count you'd need to prevent that optimization)


-Hoss



Re: adds / delete within same 'transaction'..

2008-08-12 Thread Mike Klaas

On 11-Aug-08, at 10:48 PM, Norberto Meijome wrote:


Hello :)

I *think* i know the answer, but i'd like to confirm :

Say I have
1old

already indexed and commited (ie, 'live' )

What happens if I issue:

1
1new


will delete happen first, and then the add, or could it be that the  
add happens before delete, in which case i end up with no more doc  
id=1 ?


As long as you are sending these requests on the same thread, they  
will occur in order.


-Mike


2nd Hadoop Get Together Berlin

2008-08-12 Thread idrost
Hello,

I would like to announce that on Monday September 8, 2008  at 5:00pm in the 
newthinking store (Tucholskystr. 48, Berlin) the second Hadoop Get Together 
in Berlin is going to take place. Just like last time there will be slots of 
20min each for talks on your Hadoop topic. After each talk there will be a 
lot time to discuss.

You can order drinks directly at the bar in the newthinking store. If you 
like, you can order pizza. There are quite a few good restaurants nearby, so 
we can go there after the official part.

Talks scheduled so far:

Marc Hofer will talk about his experiences bringing UIMA to Hadoop.

Rasmus Hahn is going to share his experiences with Hadoop from the perspective 
of his projects at neofonie (http://www.neofonie.de).

We would like to invite you, the visitor to also tell your Hadoop story, if 
you like, you can bring slides - there will be a beamer.

A big Thanks goes to the newthinking store for providing a room in the center 
of Berlin for us.

Further details on talks scheduled, people attending and instructions on how 
to find the location can be found on upcoming:

 http://upcoming.yahoo.com/event/1005510/?ps=5


Hope to see you there,
Isabel

-- 
You will attract cultured and artistic people to your home.
  |\  _,,,---,,_   Web:   
  /,`.-'`'-.  ;-;;,_
 |,4-  ) )-,_..;\ (  `'-'
'---''(_/--'  `-'\_) (fL)  IM:  


signature.asc
Description: This is a digitally signed message part.


Re: Field Types Question

2008-08-12 Thread Jake Conk
Thanks Erik!

On Tue, Aug 12, 2008 at 1:58 AM, Erik Hatcher
<[EMAIL PROTECTED]> wrote:
>
> On Aug 11, 2008, at 9:28 PM, Jake Conk wrote:
>>
>> I was wondering what are the differences in certain field types? For
>> instance what's the difference between the following?
>>
>> integer / sint
>> float / sfloat
>
> The difference is the internal representation of the String value of the
> term representing the numbers.  The "s" prefix means the terms are sortable,
> in that they are in ascending _numerical_ (not textual) order within the
> index.
>
>> text / textzh
>
> Looks like maybe you're picking up an acts_as_solr schema (which uses
> text_zh, though).  "zh" is the country code for China... and that field type
> is likely configured to use a Chinese-savvy analyzer.
>
>> Also, if I have two dynamic fields for instance *_facet and *_facet_mv
>> which both have the type set to string does it really matter which one
>> I use?
>
> If the field types are identical, then no it won't matter which you use -
> the same thing will happen internally.
>
>Erik
>
>


Searching Questions

2008-08-12 Thread Jake Conk
1) I want to search only within a specific field, for instance
`category`. Is there a way to do this?

2) When searching for multiple results are the following identical
since "*_facet" and "*_facet_mv" have their type's both set to string?

/select?q=tag_facet:%22John+McCain%22+OR+tag_facet:%22Barack+Obama%22
/select?q=tag_facet_mv:%22John+McCain%22+OR+tag_facet_mv:%22Barack+Obama%22

3) If I'm searching for something that is in a text field but I
specify it as a facet string rather than a text type would it still
search within text fields or would it just limit the search to string
fields?

4) Is there a page that will show me different querying combinations
or can someone post some more examples?

5) Anyone else notice returning back the data in php (&wt=phps)
doesn't unserialize? I am using PHP 5.3 w/ a nightly copy of Solr from
last week.

Thanks,
- Jake


Solr1.3 Freeze

2008-08-12 Thread Andrew Nagy
I read on the Solr 1.3 wiki page that there is a code freeze as of today, is 
this still accurate?  Moreover - does this mean that Solr1.3 will most likely 
ship with Lucene 2.4-dev or is there any plan to wait for lucene 2.4 to be 
released?

I know scheduling questions are annoying, but I am curious as to how to better 
manage a project that uses solr and how releases should be scheduled around 
that.

Thanks!
Andrew


Re: [jira] Commented: (SOLR-693) IntFieldSource incompatible with sint field type

2008-08-12 Thread Yonik Seeley
Switching to solr-user.
Jerry, what type of function are you trying to do that Solr won't do
for you out of the box?

-Yonik


On Tue, Aug 12, 2008 at 5:13 PM, Jerry Quinn (JIRA) <[EMAIL PROTECTED]> wrote:
>
>[ 
> https://issues.apache.org/jira/browse/SOLR-693?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12621980#action_12621980
>  ]
>
> Jerry Quinn commented on SOLR-693:
> --
>
> I found SortableIntFieldSource but it's not publicly accessible.  It's also 
> in org.apache.solr.schema instead of org.apache.solr.search.function like all 
> the other ValueSource objects.
>
>> IntFieldSource incompatible with sint field type
>> 
>>
>> Key: SOLR-693
>> URL: https://issues.apache.org/jira/browse/SOLR-693
>> Project: Solr
>>  Issue Type: Bug
>>  Components: search
>>Affects Versions: 1.3
>> Environment: RHEL 5, java6, builtin jetty container
>>Reporter: Jerry Quinn
>>
>> I'm trying to create a custom scoring query in Solr to implement a date 
>> bias.  I have a custom query parser that I'm using, that does nothing but 
>> wrap a BoostedQuery around the original query, which works in general.
>> I'm indexing and storing the day number in an sint field.  To implement my 
>> query, I extract the contents using 
>> org.apache.solr.search.function.IntFieldSource.  Unfortunately, this throws 
>> an exception when it executes:
>> HTTP ERROR: 500
>> For input string: "€?"
>> java.lang.NumberFormatException: For input string: "€?"
>>   at 
>> java.lang.NumberFormatException.forInputString(NumberFormatException.java:61)
>>   at java.lang.Integer.parseInt(Integer.java:460)
>>   at java.lang.Integer.parseInt(Integer.java:510)
>>   at 
>> org.apache.lucene.search.FieldCacheImpl$3.parseInt(FieldCacheImpl.java:148)
>>   at 
>> org.apache.lucene.search.FieldCacheImpl$7.createValue(FieldCacheImpl.java:262)
>>   at 
>> org.apache.lucene.search.FieldCacheImpl$Cache.get(FieldCacheImpl.java:72)
>>   at 
>> org.apache.lucene.search.FieldCacheImpl.getInts(FieldCacheImpl.java:245)
>>   at 
>> org.apache.lucene.search.FieldCacheImpl.getInts(FieldCacheImpl.java:239)
>>   at 
>> org.apache.solr.search.function.IntFieldSource.getValues(IntFieldSource.java:50)
>>   at 
>> org.apache.solr.search.function.FunctionQuery$AllScorer.(FunctionQuery.java:103)
>>   at 
>> org.apache.solr.search.function.FunctionQuery$FunctionWeight.scorer(FunctionQuery.java:81)
>>   at 
>> org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:132)
>>   at org.apache.lucene.search.Searcher.search(Searcher.java:126)
>>   at org.apache.lucene.search.Searcher.search(Searcher.java:105)
>>   at 
>> org.apache.solr.search.SolrIndexSearcher.getDocListNC(SolrIndexSearcher.java:966)
>>   at 
>> org.apache.solr.search.SolrIndexSearcher.getDocListC(SolrIndexSearcher.java:838)
>>   at 
>> org.apache.solr.search.SolrIndexSearcher.search(SolrIndexSearcher.java:269)
>>   at 
>> org.apache.solr.handler.component.QueryComponent.process(QueryComponent.java:160)
>>   at 
>> org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHandler.java:175)
>>   at 
>> org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:135)
>>   at org.apache.solr.core.SolrCore.execute(SolrCore.java:1151)
>> I ran into exactly the same problem when I tried to use the CustomScoreQuery 
>> and IntFieldSource classes from Lucene.
>> I've tracked the problem down to the fact that IntFieldSource expects the 
>> contents of the field to actually be an integer as returned by 
>> FieldCache.getInts().  However, Solr converts a sortable int using 
>> NumberUtils.int2sortablestr().
>> If I change my code to create a custom FieldCache.IntParser that applies 
>> NumberUtils.SortableStr2int before returning the value, my query works as 
>> expected.  For example:
>> class MyIntParser implements FieldCache.IntParser {
>>   public int parseInt(String val) { return NumberUtils.SortedStr2int(val, 0, 
>> val.length()); }
>> }
>> Query q = new BoostedQuery(qry, new IntFieldSource("myfield", new 
>> MyIntParser()));
>
> --
> This message is automatically generated by JIRA.
> -
> You can reply to this email to add a comment to the issue online.
>
>


RE: NOTICE: multicore.xml changed to solr.xml, format changes as well

2008-08-12 Thread Andrew Nagy
Chris - thanks for the alert.  Can you please clarify the usage of the default 
attribute that is documented to be used in the "core" node.  Solr-545 has a 
note about this being removed and it is not shown in the new example solr.xml 
file.

Thanks
Andrew

> -Original Message-
> From: Chris Hostetter [mailto:[EMAIL PROTECTED]
> Sent: Tuesday, August 12, 2008 2:01 PM
> To: solr-user@lucene.apache.org
> Subject: NOTICE: multicore.xml changed to solr.xml, format changes as
> well
>
>
> If you've bene using the trunk (and/or nightly builds) and you take
> advantage of the MultiCore features in Solr pelase be aware...
>
> As of r685244 (committed a few moments ago) Solr no longer looks for a
> "multicore.xml" file.  It instead looks for a "solr.xml" file.
>
> solr.xml supports all of the options that multicore.xml supported,
> however
> they have been "tweaked" slightly (in some cases renamed, in other
> attributes have been moved from one XML tag to another).
>
> A detailed example can be seen in example/multicore/solr.xml...
>
> http://svn.apache.org/viewvc/lucene/solr/trunk/example/multicore/solr.x
> ml?view=markup
>
> For more information, please see SOLR-689...
> https://issues.apache.org/jira/browse/SOLR-689
>
> Volunteers to help update the wiki documentation would be appreciated.
>
>
> -Hoss



Re: adds / delete within same 'transaction'..

2008-08-12 Thread Norberto Meijome
On Tue, 12 Aug 2008 11:21:50 -0700
Mike Klaas <[EMAIL PROTECTED]> wrote:

> > will delete happen first, and then the add, or could it be that the  
> > add happens before delete, in which case i end up with no more doc  
> > id=1 ?  
> 
> As long as you are sending these requests on the same thread, they  
> will occur in order.
> 
> -Mike

right, that is GREAT to know then :)

cheers,
b

_
{Beto|Norberto|Numard} Meijome

Life is not measured by the number of breaths we take, but by the moments that 
take our breath away.

I speak for myself, not my employer. Contents may be hot. Slippery when wet. 
Reading disclaimers makes you go blind. Writing them is worse. You have been 
Warned.


Administrative questions

2008-08-12 Thread Jon Drukman
1. How do people deal with having solr start when system reboots, manage 
the log output, etc.  Right now I run it manually under a unix 'screen' 
command with a wrapper script that takes care of restarts when it 
crashes.  That means that only my user can connect to it, and it can't 
happen when the system starts up... But I don't see any other way to 
control the process easily.


2. Is there any way to modify a schema without stopping the process, 
destroying the existing index, then restarting and reloading all the 
data?  It doesn't take that long and we're not in production yet, but 
once we're live I can't see that being feasible.


-jsd-



Re: adds / delete within same 'transaction'..

2008-08-12 Thread Yonik Seeley
On Tue, Aug 12, 2008 at 1:48 AM, Norberto Meijome <[EMAIL PROTECTED]> wrote:
> What happens if I issue:
>
> 1
> 1new
> 
>
> will delete happen first, and then the add, or could it be that the add 
> happens before delete

Doesn't matter... it's an implementation detail.  Solr used to buffer
deletes, and if it crashed at the right time one could get duplicates.
 Now, Lucene does the buffering of deletes (internally lucene does the
adds first and buffers the deletes until a segment flush) and it
should be impossible to see more than one "1" or no "1" at all.

-Yonik


Static Fields vs Dynamic Fields

2008-08-12 Thread Jake Conk
Is there a performance difference when using fields that are defined
in my schema vs dynamic fields?


Re: Solr1.3 Freeze

2008-08-12 Thread Chris Hostetter

: I read on the Solr 1.3 wiki page that there is a code freeze as of 
: today, is this still accurate?  Moreover - does this mean that Solr1.3 
: will most likely ship with Lucene 2.4-dev or is there any plan to wait 
: for lucene 2.4 to be released?

People who are interested in following/discussing the release process 
should keep tabs on solr-dev ... Grant volunteered to act as the Release 
Manager for 1.3, and (to paraphrase his comments from a few hours ago) he 
does not feel we are quite ready for a feature freeze.


-Hoss



Re: adds / delete within same 'transaction'..

2008-08-12 Thread Norberto Meijome
On Tue, 12 Aug 2008 20:53:12 -0400
"Yonik Seeley" <[EMAIL PROTECTED]> wrote:

> On Tue, Aug 12, 2008 at 1:48 AM, Norberto Meijome <[EMAIL PROTECTED]> wrote:
> > What happens if I issue:
> >  
> > 1  
> > 1new  
> >   
> >
> > will delete happen first, and then the add, or could it be that the add 
> > happens before delete  
> 
> Doesn't matter... it's an implementation detail.  Solr used to buffer
> deletes, and if it crashed at the right time one could get duplicates.
>  Now, Lucene does the buffering of deletes (internally lucene does the
> adds first and buffers the deletes until a segment flush) and it
> should be impossible to see more than one "1" or no "1" at all.

Thanks Yonik. I wasn't asking about the specific details, but of the 
consequence. I seem to remember (incorrectly , or v1.2 only maybe) , that if 
one wanted assurances that the case above happened in the right order, one had 
to commit after the deletes, and once more after the adds. 

This not being the case, I am happy :) 

Thanks again,
B
_
{Beto|Norberto|Numard} Meijome

"He has Van Gogh's ear for music."
  Billy Wilder

I speak for myself, not my employer. Contents may be hot. Slippery when wet. 
Reading disclaimers makes you go blind. Writing them is worse. You have been 
Warned.


Re: Best way to index without diacritics

2008-08-12 Thread Norberto Meijome
On Tue, 12 Aug 2008 11:44:42 -0400
"Steven A Rowe" <[EMAIL PROTECTED]> wrote:

> Solr is Unicode aware.  The ISOLatin1AccentFilterFactory handles diacritics 
> for the ISO Latin-1 section of the Unicode character set.  UTF (do you mean 
> UTF-8?) is a (set of) Unicode serialization(s), and once Solr has 
> deserialized it, it is just Unicode characters (Java's in-memory UTF-16 
> representation).
> 
> So as long as you're only concerned about removing diacritics from the set of 
> Unicode characters that overlaps ISO Latin-1, and not about other Unicode 
> characters, then ISOLatin1AccentFilterFactory should work for you.

hi,
do you know if anyone has implemented a similar filter using icu and mapping (a 
lot more of) UTF-8 to ascii ? 

B

_
{Beto|Norberto|Numard} Meijome

"He has the attention span of a lightning bolt."
  Robert Redford

I speak for myself, not my employer. Contents may be hot. Slippery when wet. 
Reading disclaimers makes you go blind. Writing them is worse. You have been 
Warned.


RE: NOTICE: multicore.xml changed to solr.xml, format changes as well

2008-08-12 Thread Chris Hostetter

: Chris - thanks for the alert.  Can you please clarify the usage of the 
: default attribute that is documented to be used in the "core" node.  
: Solr-545 has a note about this being removed and it is not shown in the 
: new example solr.xml file.

Any attribute that was in the old example multicore.xml has a 
corrisponding attribute in the example solr.xml ... 
  
https://svn.apache.org/viewvc/lucene/solr/trunk/example/multicore/solr.xml?r1=650331&r2=685244
 

...no functionality was changed at all in this commit, it was just 
renamed.

i don't know anything about a "default" attribute, other then the fact 
that the previous commit to that file (r650331) had the message "default 
is no longer a multicore concept"

https://svn.apache.org/viewvc/lucene/solr/trunk/example/multicore/solr.xml

-Hoss