Problem instantiating CommonsHttpSolrServer using solrj

2010-08-11 Thread bijeet singh
Hi all,

I'm trying to use solrj for indexing in solr, but when I try to instantiate
the server, using :

SolrServer server = new CommonsHttpSolrServer("http://localhost:8080/solr";);

 I get the following runtime error:

Exception in thread "main" java.lang.NoClassDefFoundError:
org/apache/solr/client/solrj/SolrServerException
Caused by: java.lang.ClassNotFoundException:
org.apache.solr.client.solrj.SolrServerException
at java.net.URLClassLoader$1.run(URLClassLoader.java:200)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:188)
at java.lang.ClassLoader.loadClass(ClassLoader.java:307)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:301)
at java.lang.ClassLoader.loadClass(ClassLoader.java:252)
at java.lang.ClassLoader.loadClassInternal(ClassLoader.java:320)


I am following the this link :  http://wiki.apache.org/solr/Solrj   ,and
have included all the jar files specified there, in the classpath.


Do help me out with this, thanks in advance

Bijeet


Delta-import with solrj client

2010-08-11 Thread Hando420

Greetings. I have a solrj client for fetching data from database. I am using
delta-import for fetching data. If a column is changed in database using
timestamp with delta-import i get the latest column indexed but there are
duplicate values in the index similar to the column but the data is older.
This works with cleaning the index but i want to update the index without
cleaning it. Is there a way to just update the index with the updated column
without having duplicate values. Appreciate for any feedback.

Hando
-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/Delta-import-with-solrj-client-tp1085763p1085763.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: solr query result not read the latest xml file

2010-08-11 Thread Jan Høydahl / Cominvent
Hi,

Yes, this is normal behavior. This is because Solr is *document* based, it does 
not know about *files*.
What happens here is that your source database (or whatever) has had deletinons 
within this category in addition to updates, and you need to relay those to 
Solr.

The best way to integrate with your source system is through some connector 
which picks up deletes as well as adds (updates is just a special case of add). 
If your source data is in a database, have a look at DataImportHandler which 
can be setup to do things like this.

If your source data is on files on a file system only, you'll have to write 
some scripts which takes care of all of this, e.g. by first issuing the delete 
and then the add (tip: try -Dcommit=no on the delete request and -Dcommit=yes 
on the following add to avoid temporary loss of data).

You need to think about what happens if a whole category is deleted. How would 
you know by simply looking at the file system?

--
Jan Høydahl, search solution architect
Cominvent AS - www.cominvent.com
Training in Europe - www.solrtraining.com

On 11. aug. 2010, at 04.10, e8en wrote:

> 
> thanks for you response Jan,
> I just knew that the post.jar only an example tool
> so what should I use if not post.jar for production?
> 
> btw, I already tried using this command:
> java -Durl=http://localhost:8983/search/update -jar post.jar cat_817.xml
> 
> and IT WORKS !!
> the cat_817.xml reflected directly in the solr query after I commit the
> cat_817.xml, this is the url:
> http://localhost:8983/search/select/?q=ITEM_CAT:817&version=2.2&start=0&rows=10&indent=on
> 
> the problem is it works if the old xml contain less doc than the new xml,
> for example if the old cat_817.xml contain 2 doc and the new cat_817.xml
> contain 10 doc then I just have to re-index (java
> -Durl=http://localhost:8983/search/update -jar post.jar cat_817.xml) and it
> the query result will have correct result (10 doc), but it doesn't work vice
> versa.
> If the old cat_817.xml contain 10 doc and the new cat_817.xml contain 2 doc,
> then I have to delete the index first (java -Ddata=args -Dcommit=yes -jar
> post.jar "ITEM_CAT:817") and re-index it
> (java -Durl=http://localhost:8983/search/update -jar post.jar cat_817.xml)
> to make the query result updated (2 doc).
> 
> is it a normal process or something wrong with my solr?
> 
> once again thanks again Jan, your help really make my day brighter :)
> and I believe your answer will help many solr newbie especially me
> -- 
> View this message in context: 
> http://lucene.472066.n3.nabble.com/solr-query-result-not-read-the-latest-xml-file-tp1066785p1081802.html
> Sent from the Solr - User mailing list archive at Nabble.com.



timestamp field

2010-08-11 Thread Frederico Azeiteiro
Hi,

 

I have on my schema

 

 

This field is returned as 

2010-08-11T10:11:03.354Z

 

For an article added at 2010-08-11T11:11:03.354Z!

 

And the server has the time of 2010-08-11T11:11:03.354Z...

 

This is a w2003 server using solr 1.4. 

 

Any guess of what could be wrong here?

 

Thanks,

Frederico

 

 



Re: timestamp field

2010-08-11 Thread Jan Høydahl / Cominvent
Hi,

Which time zone are you located in? Do you have DST?

Solr uses UTC internally for dates, which means that "NOW" will be the time in 
London right now :) Does that appear to be right 4 u?
Also see this thread: http://search-lucene.com/m/hqBed2jhu2e2/

--
Jan Høydahl, search solution architect
Cominvent AS - www.cominvent.com
Training in Europe - www.solrtraining.com

On 11. aug. 2010, at 13.02, Frederico Azeiteiro wrote:

> Hi,
> 
> 
> 
> I have on my schema
> 
>  default="NOW" /> 
> 
> 
> 
> This field is returned as 
> 
> 2010-08-11T10:11:03.354Z
> 
> 
> 
> For an article added at 2010-08-11T11:11:03.354Z!
> 
> 
> 
> And the server has the time of 2010-08-11T11:11:03.354Z...
> 
> 
> 
> This is a w2003 server using solr 1.4. 
> 
> 
> 
> Any guess of what could be wrong here?
> 
> 
> 
> Thanks,
> 
> Frederico
> 
> 
> 
> 
> 



RE: timestamp field

2010-08-11 Thread Frederico Azeiteiro
Hi Jan,

Dah, I didn't know that :(

I always thought it used the servertime. 

Anyway,just out of curiosity, the hour is UTC but NOT the time in London right 
now.

London is UTC+1 (same as here in Portugal) :).

So, London solr users should have the same "problem".
Well, I must be careful when using this field.

Thanks for your answer,
Frederico

-Original Message-
From: Jan Høydahl / Cominvent [mailto:jan@cominvent.com] 
Sent: quarta-feira, 11 de Agosto de 2010 12:17
To: solr-user@lucene.apache.org
Subject: Re: timestamp field

Hi,

Which time zone are you located in? Do you have DST?

Solr uses UTC internally for dates, which means that "NOW" will be the time in 
London right now :) Does that appear to be right 4 u?
Also see this thread: http://search-lucene.com/m/hqBed2jhu2e2/

--
Jan Høydahl, search solution architect
Cominvent AS - www.cominvent.com
Training in Europe - www.solrtraining.com

On 11. aug. 2010, at 13.02, Frederico Azeiteiro wrote:

> Hi,
> 
> 
> 
> I have on my schema
> 
>  default="NOW" /> 
> 
> 
> 
> This field is returned as 
> 
> 2010-08-11T10:11:03.354Z
> 
> 
> 
> For an article added at 2010-08-11T11:11:03.354Z!
> 
> 
> 
> And the server has the time of 2010-08-11T11:11:03.354Z...
> 
> 
> 
> This is a w2003 server using solr 1.4. 
> 
> 
> 
> Any guess of what could be wrong here?
> 
> 
> 
> Thanks,
> 
> Frederico
> 
> 
> 
> 
> 



Re: Delta-import with solrj client

2010-08-11 Thread kenf_nc

Short answer is no, there isn't a way. Solr doesn't have the concept of
'Update' to an indexed document. You need to add the full document (all
'columns') each time any one field changes. If doing that in your
DataImportHandler logic is difficult you may need to write a separate Update
Service that does:

1) Read UniqueID, UpdatedColumn(s)  from database
2) Using UniqueID Retrieve document from Solr
3) Add/Update field(s) with updated column(s)
4) Add document back to Solr

Although, if you use DIH to do a full import, using the same query in your
Delta-Import to get the whole document shouldn't be that difficult.
-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/Delta-import-with-solrj-client-tp1085763p1086173.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: timestamp field

2010-08-11 Thread Mark Allan
For what it's worth, London and the rest of the UK is currently  
observing British Summer Time (called Daylight Savings Time in other  
parts of the world) which is why we appear to be UTC+1 between the  
last Sunday in March and the last Sunday in October.


Mark

On 11 Aug 2010, at 12:36 pm, Frederico Azeiteiro wrote:


Hi Jan,

Dah, I didn't know that :(

I always thought it used the servertime.

Anyway,just out of curiosity, the hour is UTC but NOT the time in  
London right now.


London is UTC+1 (same as here in Portugal) :).

So, London solr users should have the same "problem".
Well, I must be careful when using this field.

Thanks for your answer,
Frederico

-Original Message-
From: Jan Høydahl / Cominvent [mailto:jan@cominvent.com]
Sent: quarta-feira, 11 de Agosto de 2010 12:17
To: solr-user@lucene.apache.org
Subject: Re: timestamp field

Hi,

Which time zone are you located in? Do you have DST?

Solr uses UTC internally for dates, which means that "NOW" will be  
the time in London right now :) Does that appear to be right 4 u?

Also see this thread: http://search-lucene.com/m/hqBed2jhu2e2/

--
Jan Høydahl, search solution architect
Cominvent AS - www.cominvent.com
Training in Europe - www.solrtraining.com

On 11. aug. 2010, at 13.02, Frederico Azeiteiro wrote:


Hi,



I have on my schema





This field is returned as

2010-08-11T10:11:03.354Z



For an article added at 2010-08-11T11:11:03.354Z!



And the server has the time of 2010-08-11T11:11:03.354Z...



This is a w2003 server using solr 1.4.



Any guess of what could be wrong here?



Thanks,

Frederico











--
The University of Edinburgh is a charitable body, registered in
Scotland, with registration number SC005336.



Re: Delta-import with solrj client

2010-08-11 Thread Jan Høydahl / Cominvent
Hi,

Make sure you use a proper "ID" field, which does *not* change even if the 
content in the database changes. In this way, when your delta-import fetches 
changed rows to index, they will update the existing rows in your index.

--
Jan Høydahl, search solution architect
Cominvent AS - www.cominvent.com
Training in Europe - www.solrtraining.com

On 11. aug. 2010, at 12.49, Hando420 wrote:

> 
> Greetings. I have a solrj client for fetching data from database. I am using
> delta-import for fetching data. If a column is changed in database using
> timestamp with delta-import i get the latest column indexed but there are
> duplicate values in the index similar to the column but the data is older.
> This works with cleaning the index but i want to update the index without
> cleaning it. Is there a way to just update the index with the updated column
> without having duplicate values. Appreciate for any feedback.
> 
> Hando
> -- 
> View this message in context: 
> http://lucene.472066.n3.nabble.com/Delta-import-with-solrj-client-tp1085763p1085763.html
> Sent from the Solr - User mailing list archive at Nabble.com.



Re: Solr 1.4 - stats page slow

2010-08-11 Thread Yonik Seeley
FYI, I opened https://issues.apache.org/jira/browse/SOLR-2036
for this.

-Yonik
http://www.lucidimagination.com

On Tue, Aug 10, 2010 at 8:35 PM, entdeveloper
 wrote:
>
> Apologies if this was resolved, but we just deployed Solr 1.4.1 and the stats
> page takes over a minute to load for us as well and began causing
> OutOfMemory errors so we've had to refrain from hitting the page. From what
> I gather, it is the fieldCache part that's causing it.
>
> Was there ever an official fix or recommendation on how to disable the stats
> page from calculating the fieldCache entries? If we could just ignore it, I
> think we'd be good to go since I find this page very useful otherwise.


DataImportHandler in Solr 1.4.1: exception handling in FileListEntityProcessor

2010-08-11 Thread Sascha Szott

Hi folks,

why does FileListEntityProcessor ignores onError="continue" and abort 
indexing if a directory or a file does not exist?


I'm using both XPathEntityProcessor and FileListEntityProcessor with 
onError set to continue. In case a directory or file is not present an 
Exception is thrown and indexing is stopped immediately.


Below you can find a stack trace that is generated in case the directory 
/home/doe/foo does not exist:


SEVERE: Full Import failed
org.apache.solr.handler.dataimport.DataImportHandlerException: 'baseDir' 
value: /home/doe/foo/bar.xml is not a directory Processing Document # 3
at 
org.apache.solr.handler.dataimport.FileListEntityProcessor.init(FileListEntityProcessor.java:122)
at 
org.apache.solr.handler.dataimport.EntityProcessorWrapper.init(EntityProcessorWrapper.java:71)
at 
org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:319)
at 
org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:383)
at 
org.apache.solr.handler.dataimport.DocBuilder.doFullDump(DocBuilder.java:242)
at 
org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java:180)
at 
org.apache.solr.handler.dataimport.DataImporter.doFullImport(DataImporter.java:331)
at 
org.apache.solr.handler.dataimport.DataImporter.runCmd(DataImporter.java:389)
at 
org.apache.solr.handler.dataimport.DataImporter$1.run(DataImporter.java:370)


How should I configure both processors so that missing directories and 
files are ignored and the indexing process does not stop immediately?


Best,
Sascha


Re: DataImportHandler in Solr 1.4.1: exception handling in FileListEntityProcessor

2010-08-11 Thread Sascha Szott

Sorry, there was a mistake in the stack trace. The correct one is:

SEVERE: Full Import failed
org.apache.solr.handler.dataimport.DataImportHandlerException: 'baseDir' 
value: /home/doe/foo is not a directory Processing Document # 3
at 
org.apache.solr.handler.dataimport.FileListEntityProcessor.init(FileListEntityProcessor.java:122)
at 
org.apache.solr.handler.dataimport.EntityProcessorWrapper.init(EntityProcessorWrapper.java:71)
at 
org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:319)
at 
org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:383)
at 
org.apache.solr.handler.dataimport.DocBuilder.doFullDump(DocBuilder.java:242)
at 
org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java:180)
at 
org.apache.solr.handler.dataimport.DataImporter.doFullImport(DataImporter.java:331)
at 
org.apache.solr.handler.dataimport.DataImporter.runCmd(DataImporter.java:389)
at 
org.apache.solr.handler.dataimport.DataImporter$1.run(DataImporter.java:370) 



-Sascha

On 11.08.2010 15:18, Sascha Szott wrote:

Hi folks,

why does FileListEntityProcessor ignores onError="continue" and abort
indexing if a directory or a file does not exist?

I'm using both XPathEntityProcessor and FileListEntityProcessor with
onError set to continue. In case a directory or file is not present an
Exception is thrown and indexing is stopped immediately.

Below you can find a stack trace that is generated in case the directory
/home/doe/foo does not exist:

SEVERE: Full Import failed
org.apache.solr.handler.dataimport.DataImportHandlerException: 'baseDir'
value: /home/doe/foo/bar.xml is not a directory Processing Document # 3
at
org.apache.solr.handler.dataimport.FileListEntityProcessor.init(FileListEntityProcessor.java:122)

at
org.apache.solr.handler.dataimport.EntityProcessorWrapper.init(EntityProcessorWrapper.java:71)

at
org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:319)

at
org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:383)

at
org.apache.solr.handler.dataimport.DocBuilder.doFullDump(DocBuilder.java:242)

at
org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java:180)
at
org.apache.solr.handler.dataimport.DataImporter.doFullImport(DataImporter.java:331)

at
org.apache.solr.handler.dataimport.DataImporter.runCmd(DataImporter.java:389)

at
org.apache.solr.handler.dataimport.DataImporter$1.run(DataImporter.java:370)


How should I configure both processors so that missing directories and
files are ignored and the indexing process does not stop immediately?

Best,
Sascha


Re: Solr Doc Lucene Doc !?

2010-08-11 Thread stockii

i have a question about the solr index mechanism with DIH ...

i try to understand how solr index an doc, and on wich code-elements, solr
using lucene.

thats my stand:
DIH is using the SolrWriter to add an doc.
To create an SolrInoputDocument SolrWriter uses the addUpdateCommand, 
This Command and Doc is put in the UpdateRequestProcessorChain. In this
Chain solr creates an LuceneDoc with DocumentBuilder and put it back into
the chain !?!? is this right ? 

Then the UpdateHandler getting the UpdateChain and managed the index changes
!?

So. i dont understand, how works the updatehandler. can anyone give me some
tipps ? 

SolrIndexWriter is using from UpdateHandler and SolrindexWriter use
IndexWriter from Lucene ? 

thx for your help =)=)
-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/Solr-Doc-Lucene-Doc-tp995922p1088334.html
Sent from the Solr - User mailing list archive at Nabble.com.


RE: PDF file

2010-08-11 Thread Ma, Xiaohui (NIH/NLM/LHC) [C]
Thanks so much for your help! I got "Remote Streaming is disabled" error. Would 
you please tell me if I miss something?

Thanks, 

-Original Message-
From: Jayendra Patil [mailto:jayendra.patil@gmail.com] 
Sent: Tuesday, August 10, 2010 8:51 PM
To: solr-user@lucene.apache.org
Subject: Re: PDF file

Try ...

curl "
http://lhcinternal.nlm.nih.gov:8989/solr/lhc/update/extract?stream.file=
/pub2009001.pdf&literal.id=777045&commit=true"

stream.file - specify full path
literal. - specify any extra params if needed

Regards,
Jayendra

On Tue, Aug 10, 2010 at 4:49 PM, Ma, Xiaohui (NIH/NLM/LHC) [C] <
xiao...@mail.nlm.nih.gov> wrote:

> Thanks so much for your help! I tried to index a pdf file and got the
> following. The command I used is
>
> curl '
> http://lhcinternal.nlm.nih.gov:8989/solr/lhc/update/extract?map.content=text&map.stream_name=id&commit=true'
> -F "fi...@pub2009001.pdf"
>
> Did I do something wrong? Do I need modify anything in schema.xml or other
> configuration file?
>
> 
> [xiao...@lhcinternal lhc]$ curl '
> http://lhcinternal.nlm.nih.gov:8989/solr/lhc/update/extract?map.content=text&map.stream_name=id&commit=true'
> -F "fi...@pub2009001.pdf"
> 
> 
> 
> Error 404 
> 
> HTTP ERROR: 404NOT_FOUND
> RequestURI=/solr/lhc/update/extracthttp://jetty.mortbay.org/";>Powered by Jetty://
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
>
> 
> 
> ***
>
> -Original Message-
> From: Sharp, Jonathan [mailto:jsh...@coh.org]
> Sent: Tuesday, August 10, 2010 4:37 PM
> To: solr-user@lucene.apache.org
> Subject: RE: PDF file
>
> Xiaohui,
>
> You need to add the following jars to the lib subdirectory of the solr
> config directory on your server.
>
> (path inside the solr 1.4.1 download)
>
> /dist/apache-solr-cell-1.4.1.jar
> plus all the jars in
> /contrib/extraction/lib
>
> HTH
>
> -Jon
> 
> From: Ma, Xiaohui (NIH/NLM/LHC) [C] [xiao...@mail.nlm.nih.gov]
> Sent: Tuesday, August 10, 2010 11:57 AM
> To: 'solr-user@lucene.apache.org'
> Subject: RE: PDF file
>
> Does anyone have any experience with PDF file? I really appreciate your
> help!
> Thanks so much in advance.
>
> -Original Message-
> From: Ma, Xiaohui (NIH/NLM/LHC) [C]
> Sent: Tuesday, August 10, 2010 10:37 AM
> To: 'solr-user@lucene.apache.org'
> Subject: PDF file
>
> I have a lot of pdf files. I am trying to import pdf files to solr and
> index them. I added ExtractingRequestHandler to solrconfig.xml.
>
> Please tell me if I need download some jar files.
>
> In the Solr1.4 Enterprise Search Server book, use following command to
> import a mccm.pdf.
>
> curl '
> http://localhost:8983/solr/solr-home/update/extract?map.content=text&map.stream_name=id&commit=true'
> -F "fi...@mccm.pdf"
>
> Please tell me if there is a way to import pdf files from a directory.
>
> Thanks so much for your help!
>
>
>
> -
> SECURITY/CONFIDENTIALITY WARNING:
> This message and any attachments are intended solely for the individual or
> entity to which they are addressed. This communication may contain
> information that is privileged, confidential, or exempt from disclosure
> under applicable law (e.g., personal health information, research data,
> financial information). Because this e-mail has been sent without
> encryption, individuals other than the intended recipient may be able to
> view the information, forward it to others or tamper with the information
> without the knowledge or consent of the sender. If you are not the intended
> recipient, or the employee or person responsible for delivering the message
> to the intended recipient, any dissemination, distribution or copying of the
> communication is strictly prohibited. If you received the communication in
> error, please notify the sender immediately by replying to this message and
> deleting the message and any accompanying files from your system. If, due to
> the security risks, you do not wish to receive further communications via
> e-mail, please reply to this message and inform the sender that you do not
> wish to receive further e-mail from the sender.
>
> -
>
>


Re: Solr Doc Lucene Doc !?

2010-08-11 Thread stockii

oh, i see that i mixed DIH classes with other Solr classes ^^
-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/Solr-Doc-Lucene-Doc-tp995922p1088738.html
Sent from the Solr - User mailing list archive at Nabble.com.


RE: PDF file

2010-08-11 Thread Ma, Xiaohui (NIH/NLM/LHC) [C]
Thanks, I knew how to enable Streaming. But I got another error, ERROR:unknown 
field 'metadata_trapped'. 

Does anyone know how to match up with SolrCell metadata? I found the following 
in schema.xml. I don't know how to make changes for PDF.



I really appreciate your help!
Thanks,

-Original Message-
From: Ma, Xiaohui (NIH/NLM/LHC) [C] 
Sent: Wednesday, August 11, 2010 10:36 AM
To: solr-user@lucene.apache.org
Cc: 'jayendra.patil@gmail.com'
Subject: RE: PDF file

Thanks so much for your help! I got "Remote Streaming is disabled" error. Would 
you please tell me if I miss something?

Thanks, 

-Original Message-
From: Jayendra Patil [mailto:jayendra.patil@gmail.com] 
Sent: Tuesday, August 10, 2010 8:51 PM
To: solr-user@lucene.apache.org
Subject: Re: PDF file

Try ...

curl "
http://lhcinternal.nlm.nih.gov:8989/solr/lhc/update/extract?stream.file=
/pub2009001.pdf&literal.id=777045&commit=true"

stream.file - specify full path
literal. - specify any extra params if needed

Regards,
Jayendra

On Tue, Aug 10, 2010 at 4:49 PM, Ma, Xiaohui (NIH/NLM/LHC) [C] <
xiao...@mail.nlm.nih.gov> wrote:

> Thanks so much for your help! I tried to index a pdf file and got the
> following. The command I used is
>
> curl '
> http://lhcinternal.nlm.nih.gov:8989/solr/lhc/update/extract?map.content=text&map.stream_name=id&commit=true'
> -F "fi...@pub2009001.pdf"
>
> Did I do something wrong? Do I need modify anything in schema.xml or other
> configuration file?
>
> 
> [xiao...@lhcinternal lhc]$ curl '
> http://lhcinternal.nlm.nih.gov:8989/solr/lhc/update/extract?map.content=text&map.stream_name=id&commit=true'
> -F "fi...@pub2009001.pdf"
> 
> 
> 
> Error 404 
> 
> HTTP ERROR: 404NOT_FOUND
> RequestURI=/solr/lhc/update/extracthttp://jetty.mortbay.org/";>Powered by Jetty://
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
>
> 
> 
> ***
>
> -Original Message-
> From: Sharp, Jonathan [mailto:jsh...@coh.org]
> Sent: Tuesday, August 10, 2010 4:37 PM
> To: solr-user@lucene.apache.org
> Subject: RE: PDF file
>
> Xiaohui,
>
> You need to add the following jars to the lib subdirectory of the solr
> config directory on your server.
>
> (path inside the solr 1.4.1 download)
>
> /dist/apache-solr-cell-1.4.1.jar
> plus all the jars in
> /contrib/extraction/lib
>
> HTH
>
> -Jon
> 
> From: Ma, Xiaohui (NIH/NLM/LHC) [C] [xiao...@mail.nlm.nih.gov]
> Sent: Tuesday, August 10, 2010 11:57 AM
> To: 'solr-user@lucene.apache.org'
> Subject: RE: PDF file
>
> Does anyone have any experience with PDF file? I really appreciate your
> help!
> Thanks so much in advance.
>
> -Original Message-
> From: Ma, Xiaohui (NIH/NLM/LHC) [C]
> Sent: Tuesday, August 10, 2010 10:37 AM
> To: 'solr-user@lucene.apache.org'
> Subject: PDF file
>
> I have a lot of pdf files. I am trying to import pdf files to solr and
> index them. I added ExtractingRequestHandler to solrconfig.xml.
>
> Please tell me if I need download some jar files.
>
> In the Solr1.4 Enterprise Search Server book, use following command to
> import a mccm.pdf.
>
> curl '
> http://localhost:8983/solr/solr-home/update/extract?map.content=text&map.stream_name=id&commit=true'
> -F "fi...@mccm.pdf"
>
> Please tell me if there is a way to import pdf files from a directory.
>
> Thanks so much for your help!
>
>
>
> -
> SECURITY/CONFIDENTIALITY WARNING:
> This message and any attachments are intended solely for the individual or
> entity to which they are addressed. This communication may contain
> information that is privileged, confidential, or exempt from disclosure
> under applicable law (e.g., personal health information, research data,
> financial information). Because this e-mail has been sent without
> encryption, individuals other than the intended recipient may be able to
> view the information, forward it to others or tamper with the information
> without the knowledge or consent of the sender. If you are not the intended
> recipient, or the employee or person responsible for delivering the message
> to the intended recipient, any dissemination, distribution or copying of the
> communication is strictly prohibited. If you received the communication in
> error, please notify the sender immediately by replying to this message and
> deleting the message and any accompanying files from your system. If, due to
> the security risks, you do not wish to receive further communications via
> e-mail, please reply to this message and inform the sender that you do not
> wish to receive further e-mail from the sender.
>
> -
>
>


SolrException log

2010-08-11 Thread Bastian Spitzer
Hi,

we are using solr 1.4.1 in a master-slave setup with replication,
requests are loadbalanced to both instances. this is just working fine,
but the slave
behaves strange sometimes with a "SolrException log" (trace below). We
are using 1.4.1 for weeks now, and this has happened only a few times
so far, and it only occured on the Slave. The Problem seemed to be gone
when we added a cron-job to send a periodic  (once a day)
to the master, but today it did happen again. The Index contains 55
files right now, after optimize there are only 10. So it seems its a
problem when
the index is spread among a lot files. The Slave wont ever recover once
this Exception shows up, the only thing that helps is a restart. 

Is this a known issue? Only workaround would be to track the
commit-counts and send additional  requests after a certain
amount of
commits, but id prefer solving this problem rather than building a
workaround..

Any hints/thoughts on this issue are verry much appreciated, thanks in
advance for your help.

cheers Bastian.

Aug 11, 2010 4:51:58 PM org.apache.solr.core.SolrCore execute
INFO: [] webapp=/solr path=/select
params={fl=media_id,keyword_1004&sort=priority_1000+desc,+score+desc&ind
ent=off&start=0&q=mandant_id:1000+AND+partner_id:1000+AND+active_1000:tr
ue+AND+cat_id_path_1000:7231/7258*+AND+language_id:1004&rows=24&version=
2.2} status=500 QTime=2
Aug 11, 2010 4:51:58 PM org.apache.solr.common.SolrException log
SEVERE: java.io.IOException: read past EOF
at
org.apache.lucene.store.BufferedIndexInput.refill(BufferedIndexInput.jav
a:151)
at
org.apache.lucene.store.BufferedIndexInput.readByte(BufferedIndexInput.j
ava:38)
at
org.apache.lucene.store.IndexInput.readVInt(IndexInput.java:78)
at
org.apache.lucene.index.SegmentTermDocs.next(SegmentTermDocs.java:112)
at
org.apache.lucene.search.FieldCacheImpl$IntCache.createValue(FieldCacheI
mpl.java:461)
at
org.apache.lucene.search.FieldCacheImpl$Cache.get(FieldCacheImpl.java:22
4)
at
org.apache.lucene.search.FieldCacheImpl.getInts(FieldCacheImpl.java:430)
at
org.apache.lucene.search.FieldCacheImpl$IntCache.createValue(FieldCacheI
mpl.java:445)
at
org.apache.lucene.search.FieldCacheImpl$Cache.get(FieldCacheImpl.java:22
4)
at
org.apache.lucene.search.FieldCacheImpl.getInts(FieldCacheImpl.java:430)
at
org.apache.lucene.search.FieldComparator$IntComparator.setNextReader(Fie
ldComparator.java:332)
at
org.apache.lucene.search.TopFieldCollector$MultiComparatorNonScoringColl
ector.setNextReader(TopFieldCollector.java:435)
at
org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:249)
at org.apache.lucene.search.Searcher.search(Searcher.java:171)
at
org.apache.solr.search.SolrIndexSearcher.getDocListNC(SolrIndexSearcher.
java:988)
at
org.apache.solr.search.SolrIndexSearcher.getDocListC(SolrIndexSearcher.j
ava:884)
at
org.apache.solr.search.SolrIndexSearcher.search(SolrIndexSearcher.java:3
41)
at
org.apache.solr.handler.component.QueryComponent.process(QueryComponent.
java:182)
at
org.apache.solr.handler.component.SearchHandler.handleRequestBody(Search
Handler.java:195)
at
org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerB
ase.java:131)
at org.apache.solr.core.SolrCore.execute(SolrCore.java:1316)
at
org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.ja
va:338)
at
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.j
ava:241)
at
org.mortbay.jetty.servlet.WebApplicationHandler$CachedChain.doFilter(Web
ApplicationHandler.java:821)
at
org.mortbay.jetty.servlet.WebApplicationHandler.dispatch(WebApplicationH
andler.java:471)
at
org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:568)
at org.mortbay.http.HttpContext.handle(HttpContext.java:1530)
at
org.mortbay.jetty.servlet.WebApplicationContext.handle(WebApplicationCon
text.java:633)
at org.mortbay.http.HttpContext.handle(HttpContext.java:1482)
at org.mortbay.http.HttpServer.service(HttpServer.java:909)
at
org.mortbay.http.HttpConnection.service(HttpConnection.java:820)
at
org.mortbay.http.ajp.AJP13Connection.handleNext(AJP13Connection.java:295
)
at
org.mortbay.http.HttpConnection.handle(HttpConnection.java:837)
at
org.mortbay.http.ajp.AJP13Listener.handleConnection(AJP13Listener.java:2
12)
at
org.mortbay.util.ThreadedServer.handle(ThreadedServer.java:357)
at
org.mortbay.util.ThreadPool$PoolThread.run(ThreadPool.java:534)


RE: Improve Query Time For Large Index

2010-08-11 Thread Burton-West, Tom
Hi Peter,

Can you give a few more examples of slow queries?  
Are they phrase queries? Boolean queries? prefix or wildcard queries?
If one word queries are your slow queries, than CommonGrams won't help.  
CommonGrams will only help with phrase queries.

How are you using termvectors?  That may be slowing things down.  I don't have 
experience with termvectors, so someone else on the list might speak to that.

When you say the query time for common terms stays slow, do you mean if you 
re-issue the exact query, the second query is not faster?  That seems very 
strange.  You might restart Solr, and send a first query (the first query 
always takes a relatively long time.)  Then pick one of your slow queries and 
send it 2 times.  The second time you send the query it should be much faster 
due to the Solr caches and you should be able to see the cache hit in the Solr 
admin panel.  If you send the exact query a second time (without enough 
intervening queries to evict data from the cache, ) the Solr queryResultCache 
should get hit and you should see a response time in the .01-5 millisecond 
range.

What settings are you using for your Solr caches?

How much memory is on the machine?  If your bottleneck is disk i/o for frequent 
terms, then you want to make sure you have enough memory for the OS disk cache. 
 

I assume that http is not in your stopwords.  CommonGrams will only help with 
phrase queries
CommonGrams was committed and is in Solr 1.4.  If you decide to use CommonGrams 
you definitely need to re-index and you also need to use both the index time 
filter and the query time filter.  Your index will be larger.













Tom
-Original Message-
From: Peter Karich [mailto:peat...@yahoo.de] 
Sent: Tuesday, August 10, 2010 3:32 PM
To: solr-user@lucene.apache.org
Subject: Re: Improve Query Time For Large Index

Hi Tom,

my index is around 3GB large and I am using 2GB RAM for the JVM although
a some more is available.
If I am looking into the RAM usage while a slow query runs (via
jvisualvm) I see that only 750MB of the JVM RAM is used.

> Can you give us some examples of the slow queries?

for example the empty query solr/select?q=
takes very long or solr/select?q=http
where 'http' is the most common term

> Are you using stop words?  

yes, a lot. I stored them into stopwords.txt

> http://www.hathitrust.org/blogs/large-scale-search/slow-queries-and-common-words-part-2

this looks interesting. I read through
https://issues.apache.org/jira/browse/SOLR-908 and it seems to be in 1.4.
I only need to enable it via:



right? Do I need to reindex?

Regards,
Peter.

> Hi Peter,
>
> A few more details about your setup would help list members to answer your 
> questions.
> How large is your index?  
> How much memory is on the machine and how much is allocated to the JVM?
> Besides the Solr caches, Solr and Lucene depend on the operating system's 
> disk caching for caching of postings lists.  So you need to leave some memory 
> for the OS.  On the other hand if you are optimizing and refreshing every 
> 10-15 minutes, that will invalidate all the caches, since an optimized index 
> is essentially a set of new files.
>
> Can you give us some examples of the slow queries?  Are you using stop words? 
>  
>
> If your slow queries are phrase queries, then you might try either adding the 
> most frequent terms in your index to the stopwords list  or try CommonGrams 
> and add them to the common words list.  (Details on CommonGrams here: 
> http://www.hathitrust.org/blogs/large-scale-search/slow-queries-and-common-words-part-2)
>
> Tom Burton-West
>
> -Original Message-
> From: Peter Karich [mailto:peat...@yahoo.de] 
> Sent: Tuesday, August 10, 2010 9:54 AM
> To: solr-user@lucene.apache.org
> Subject: Improve Query Time For Large Index
>
> Hi,
>
> I have 5 Million small documents/tweets (=> ~3GB) and the slave index
> replicates itself from master every 10-15 minutes, so the index is
> optimized before querying. We are using solr 1.4.1 (patched with
> SOLR-1624) via SolrJ.
>
> Now the search speed is slow >2s for common terms which hits more than 2
> mio docs and acceptable for others: <0.5s. For those numbers I don't use
> highlighting or facets. I am using the following schema [1] and from
> luke handler I know that numTerms =~20 mio. The query for common terms
> stays slow if I retry again and again (no cache improvements).
>
> How can I improve the query time for the common terms without using
> Distributed Search [2] ?
>
> Regards,
> Peter.
>
>
> [1]
>  required="true" />
> 
> 
> termVectors="true" termPositions="true" termOffsets="true"/>
>
> [2]
> http://wiki.apache.org/solr/DistributedSearch
>
>
>   


-- 
http://karussell.wordpress.com/



Re: how to support "implicit trailing wildcards"

2010-08-11 Thread yandong yao
Hi Jan,

Seems q=mount OR mount* have different sorting order with q=mount for those
documents including mount.
Change to  q=mount^100 OR (mount?* -mount)^1.0, and test well.

Thanks very much!

2010/8/10 Jan Høydahl / Cominvent 

> Hi,
>
> You don't need to duplicate the content into two fields to achieve this.
> Try this:
>
> q=mount OR mount*
>
> The exact match will always get higher score than the wildcard match
> because wildcard matches uses "constant score".
>
> Making this work for multi term queries is a bit trickier, but something
> along these lines:
>
> q=(mount OR mount*) AND (everest OR everest*)
>
> --
> Jan Høydahl, search solution architect
> Cominvent AS - www.cominvent.com
> Training in Europe - www.solrtraining.com
>
> On 10. aug. 2010, at 09.38, Geert-Jan Brits wrote:
>
> > you could satisfy this by making 2 fields:
> > 1. exactmatch
> > 2. wildcardmatch
> >
> > use copyfield in your schema to copy 1 --> 2 .
> >
> > q=exactmatch:mount+wildcardmatch:mount*&q.op=OR
> > this would score exact matches above (solely) wildcard matches
> >
> > Geert-Jan
> >
> > 2010/8/10 yandong yao 
> >
> >> Hi Bastian,
> >>
> >> Sorry for not make it clear, I also want exact match have higher score
> than
> >> wildcard match, that is means: if searching 'mount', documents with
> 'mount'
> >> will have higher score than documents with 'mountain', while 'mount*'
> seems
> >> treat 'mount' and 'mountain' as same.
> >>
> >> besides, also want the query to be processed with analyzer, while from
> >>
> >>
> http://wiki.apache.org/lucene-java/LuceneFAQ#Are_Wildcard.2C_Prefix.2C_and_Fuzzy_queries_case_sensitive.3F
> >> ,
> >> Wildcard, Prefix, and Fuzzy queries are not passed through the Analyzer.
> >> The
> >> rationale is that if search 'mounted', I also want documents with
> 'mount'
> >> match.
> >>
> >> So seems built-in wildcard search could not satisfy my requirements if i
> >> understand correctly.
> >>
> >> Thanks very much!
> >>
> >>
> >> 2010/8/9 Bastian Spitzer 
> >>
> >>> Wildcard-Search is already built in, just use:
> >>>
> >>> ?q=umoun*
> >>> ?q=mounta*
> >>>
> >>> -Ursprüngliche Nachricht-
> >>> Von: yandong yao [mailto:yydz...@gmail.com]
> >>> Gesendet: Montag, 9. August 2010 15:57
> >>> An: solr-user@lucene.apache.org
> >>> Betreff: how to support "implicit trailing wildcards"
> >>>
> >>> Hi everyone,
> >>>
> >>>
> >>> How to support 'implicit trailing wildcard *' using Solr, eg: using
> >> Google
> >>> to search 'umoun', 'umount' will be matched , search 'mounta',
> 'mountain'
> >>> will be matched.
> >>>
> >>> From my point of view, there are several ways, both with disadvantages:
> >>>
> >>> 1) Using EdgeNGramFilterFactory, thus 'umount' will be indexed with
> 'u',
> >>> 'um', 'umo', 'umou', 'umoun', 'umount'. The disadvantages are: a) the
> >> index
> >>> size increases dramatically, b) will matches even has no relationship,
> >> such
> >>> as such 'mount' will match 'mountain' also.
> >>>
> >>> 2) Using two pass searching: first pass searches term dictionary
> through
> >>> TermsComponent using given keyword, then using the first matched term
> >> from
> >>> term dictionary to search again. eg: when user enter 'umoun',
> >> TermsComponent
> >>> will match 'umount', then use 'umount' to search. The disadvantage are:
> >> a)
> >>> need to parse query string so that could recognize meta keywords such
> as
> >>> 'AND', 'OR', '+', '-', '"' (this makes more complex as I am using PHP
> >>> client), b) The returned hit counts is not for original search string,
> >> thus
> >>> will influence other components such as auto-suggest component based on
> >> user
> >>> search history and hit counts.
> >>>
> >>> 3) Write custom SearchComponent, while have no idea where/how to start
> >>> with.
> >>>
> >>> Is there any other way in Solr to do this, any feedback/suggestion are
> >>> welcome!
> >>>
> >>> Thanks very much in advance!
> >>>
> >>
>
>


Re: Improve Query Time For Large Index

2010-08-11 Thread Robert Muir
On Wed, Aug 11, 2010 at 11:47 AM, Burton-West, Tom wrote:

> Hi Peter,
>
> Can you give a few more examples of slow queries?
> Are they phrase queries? Boolean queries? prefix or wildcard queries?
> If one word queries are your slow queries, than CommonGrams won't help.
>  CommonGrams will only help with phrase queries.
>

 Since the example given was "http" being slow, its worth mentioning that if
queries are "one word" urls [for example http://lucene.apache.org] these
will actually form slow phrase queries by default.

Because your content is very tiny documents, its probably good to disable
this since the phrases won't likely help the results at all, but make things
unbearably slow. in solr 3_x and trunk, you can disable these automatic
phrase queries in schema.xml with autoGeneratePhraseQueries="false":



then the system won't form phrase queries unless the user explicitly puts
double quotes around it.

-- 
Robert Muir
rcm...@gmail.com


Re: Need help with facets

2010-08-11 Thread Moazzam Khan
That's awesome.

Thanks Ahmet!

On Wed, Aug 11, 2010 at 1:50 AM, Ahmet Arslan  wrote:
>
>
> --- On Wed, 8/11/10, Moazzam Khan  wrote:
>
>> From: Moazzam Khan 
>> Subject: Re: Need help with facets
>> To: solr-user@lucene.apache.org
>> Date: Wednesday, August 11, 2010, 1:32 AM
>> Thanks Ahmet that worked!
>>
>> Here's another issues I have :
>>
>> Like I said before, I have these fields in Solr documents
>>
>> FirstName
>> LastName
>> RecruitedDate
>> VolumeDate (just added this in this email)
>> VolumeDone (just added this in this email)
>>
>>
>> Now I have to get sum of all VolumeDone (integer field) for
>> this month
>> by everyone, then take 25% of that number and get all
>> people whose
>> volume was more than that. Is there a way to do this? :D
>
> You need to execute two queries for that. Stats Component can give you sum.  
> q=VolumeDate:[NOW-1MONTH TO NOW]&stats=true&stats.field=VolumeDone
>
> http://wiki.apache.org/solr/StatsComponent
>
> Then second query
> q=VolumeDate:[NOW-1MONTH TO NOW]&fq=VolumeDone:[sumComesAbove TO *]
>
> But you need to use tint type instead of int for VolumeDone, to range queries 
> work correctly.
>
>
>
>


Analysing SOLR logfiles

2010-08-11 Thread Jay Flattery
Hi there,


Just wondering what tools people use to analyse SOLR log files.

We're looking to do things like extracting common queries, calculating 
averaging 

Qtime and hits, returning particularly slow/expensive queries, etc.

Would prefer not to code something (completely) from scratch.

Thanks!


  



Filter Performance in Solr 1.3

2010-08-11 Thread Bargar, Matthew B
Hi there, I have a question about filter (fq) performance in Solr 1.3.
After doing some testing it seems as though adding a filter increases
search time. From what I've read here
http://www.derivante.com/2009/06/23/solr-filtering-performance-increase/

and here
http://www.lucidimagination.com/blog/2009/05/27/filtered-query-performan
ce-increases-for-solr-14/ 

it seems as though upgrading to 1.4 would solve this problem. My
question is whether there is anything that can be done in 1.3 to help
alleviate the problem, before upgrading to 1.4? It becomes an issue
because the majority of searches that are done on our site need some
content type excluded or filtered for. Does it make sense to use the fq
parameter in this way, or is there some better approach since filters
are almost always used?

Thank you!


Re: Filter Performance in Solr 1.3

2010-08-11 Thread Geert-Jan Brits
fq's are the preferred way to use for filtering when the same filter is
often  used. (since the filter-set can be cached seperately) .

as to your direct question:
> My question is whether there is anything that can be done in 1.3 to
help alleviate the problem, before upgrading to 1.4?

I don't think so (perhaps some patches that I'm not aware of) .

When are you seeing increased search time?

is it the first time the filter is used? If that's the case: that's logical
since the filter needs to be build.
(fq)-filters only show their strength (as said above)  when you use them
repeatedly.

If on the other hand you're seeing slower repsonse times with a fq-filter
applied all the time, then the same queries without the fq-filter, there
must be something strange going on since this really shouldn't happen in
normal situations.

Geert-Jan





2010/8/11 Bargar, Matthew B 

> Hi there, I have a question about filter (fq) performance in Solr 1.3.
> After doing some testing it seems as though adding a filter increases
> search time. From what I've read here
> http://www.derivante.com/2009/06/23/solr-filtering-performance-increase/
>
> and here
> http://www.lucidimagination.com/blog/2009/05/27/filtered-query-performan
> ce-increases-for-solr-14/
>
> it seems as though upgrading to 1.4 would solve this problem. My
> question is whether there is anything that can be done in 1.3 to help
> alleviate the problem, before upgrading to 1.4? It becomes an issue
> because the majority of searches that are done on our site need some
> content type excluded or filtered for. Does it make sense to use the fq
> parameter in this way, or is there some better approach since filters
> are almost always used?
>
> Thank you!
>


Data Import Handler Query

2010-08-11 Thread Manali Joshi
Hi,



I have installed solr 1.4 and am trying to use the Data Import Handler to
import data from a database. I have 2 tables which share a 1 to many
relation (1 Story to Many Images).



I want my index to contain attributes regarding “Story” and also all
“Images” that it has. Based on the DIH documentation, I have setup the
data-config.xml as follows:





  '${dataimporter.last_index_time}'"*

deltaQuery=*"select story_id from story where time >
'${dataimporter.last_index_time}'"*>





   

   



  





However, when I query the index, I find that it imports only the first
record from images that it finds for a story. Eg. If I have a story with 3
images, the index only has information about the first one. Is it possible
to get the data for all images for a story in the same index. If so, what am
I missing in the data config ?



Thanks.


RE: Filter Performance in Solr 1.3

2010-08-11 Thread Bargar, Matthew B
The search with the filter takes longer than a search for the same term
but no filter after repeated searches, after the cache should have come
into play. To be more specific, this happens on filters that exclude
very few results from the overall set. 

For instance, type:video returns few results and as one would expect,
returns much quicker than a search without that filter. 

-type:video, on the other hand returns a lot of results and excludes
very few, and actually takes longer than a search without any filter at
all.

Is this what one might expect when using a filter that excludes few
results, or does it still seem like something strange might be
happening?

Thanks,
Matt 

-Original Message-
From: Geert-Jan Brits [mailto:gbr...@gmail.com] 
Sent: Wednesday, August 11, 2010 2:55 PM
To: solr-user@lucene.apache.org
Subject: Re: Filter Performance in Solr 1.3

fq's are the preferred way to use for filtering when the same filter is
often  used. (since the filter-set can be cached seperately) .

as to your direct question:
> My question is whether there is anything that can be done in 1.3 to
help alleviate the problem, before upgrading to 1.4?

I don't think so (perhaps some patches that I'm not aware of) .

When are you seeing increased search time?

is it the first time the filter is used? If that's the case: that's
logical since the filter needs to be build.
(fq)-filters only show their strength (as said above)  when you use them
repeatedly.

If on the other hand you're seeing slower repsonse times with a
fq-filter applied all the time, then the same queries without the
fq-filter, there must be something strange going on since this really
shouldn't happen in normal situations.

Geert-Jan





2010/8/11 Bargar, Matthew B 

> Hi there, I have a question about filter (fq) performance in Solr 1.3.
> After doing some testing it seems as though adding a filter increases 
> search time. From what I've read here 
> http://www.derivante.com/2009/06/23/solr-filtering-performance-increas
> e/
>
> and here
> http://www.lucidimagination.com/blog/2009/05/27/filtered-query-perform
> an
> ce-increases-for-solr-14/
>
> it seems as though upgrading to 1.4 would solve this problem. My 
> question is whether there is anything that can be done in 1.3 to help 
> alleviate the problem, before upgrading to 1.4? It becomes an issue 
> because the majority of searches that are done on our site need some 
> content type excluded or filtered for. Does it make sense to use the 
> fq parameter in this way, or is there some better approach since 
> filters are almost always used?
>
> Thank you!
>


Re: how to support "implicit trailing wildcards"

2010-08-11 Thread Jan Høydahl / Cominvent
I guess q=mount OR (mount*)^0.01 would work equally as well, i.e. diminishing 
the effect of wildcard matches.
--
Jan Høydahl, search solution architect
Cominvent AS - www.cominvent.com
Training in Europe - www.solrtraining.com

On 11. aug. 2010, at 17.53, yandong yao wrote:

> Hi Jan,
> 
> Seems q=mount OR mount* have different sorting order with q=mount for those
> documents including mount.
> Change to  q=mount^100 OR (mount?* -mount)^1.0, and test well.
> 
> Thanks very much!
> 
> 2010/8/10 Jan Høydahl / Cominvent 
> 
>> Hi,
>> 
>> You don't need to duplicate the content into two fields to achieve this.
>> Try this:
>> 
>> q=mount OR mount*
>> 
>> The exact match will always get higher score than the wildcard match
>> because wildcard matches uses "constant score".
>> 
>> Making this work for multi term queries is a bit trickier, but something
>> along these lines:
>> 
>> q=(mount OR mount*) AND (everest OR everest*)
>> 
>> --
>> Jan Høydahl, search solution architect
>> Cominvent AS - www.cominvent.com
>> Training in Europe - www.solrtraining.com
>> 
>> On 10. aug. 2010, at 09.38, Geert-Jan Brits wrote:
>> 
>>> you could satisfy this by making 2 fields:
>>> 1. exactmatch
>>> 2. wildcardmatch
>>> 
>>> use copyfield in your schema to copy 1 --> 2 .
>>> 
>>> q=exactmatch:mount+wildcardmatch:mount*&q.op=OR
>>> this would score exact matches above (solely) wildcard matches
>>> 
>>> Geert-Jan
>>> 
>>> 2010/8/10 yandong yao 
>>> 
 Hi Bastian,
 
 Sorry for not make it clear, I also want exact match have higher score
>> than
 wildcard match, that is means: if searching 'mount', documents with
>> 'mount'
 will have higher score than documents with 'mountain', while 'mount*'
>> seems
 treat 'mount' and 'mountain' as same.
 
 besides, also want the query to be processed with analyzer, while from
 
 
>> http://wiki.apache.org/lucene-java/LuceneFAQ#Are_Wildcard.2C_Prefix.2C_and_Fuzzy_queries_case_sensitive.3F
 ,
 Wildcard, Prefix, and Fuzzy queries are not passed through the Analyzer.
 The
 rationale is that if search 'mounted', I also want documents with
>> 'mount'
 match.
 
 So seems built-in wildcard search could not satisfy my requirements if i
 understand correctly.
 
 Thanks very much!
 
 
 2010/8/9 Bastian Spitzer 
 
> Wildcard-Search is already built in, just use:
> 
> ?q=umoun*
> ?q=mounta*
> 
> -Ursprüngliche Nachricht-
> Von: yandong yao [mailto:yydz...@gmail.com]
> Gesendet: Montag, 9. August 2010 15:57
> An: solr-user@lucene.apache.org
> Betreff: how to support "implicit trailing wildcards"
> 
> Hi everyone,
> 
> 
> How to support 'implicit trailing wildcard *' using Solr, eg: using
 Google
> to search 'umoun', 'umount' will be matched , search 'mounta',
>> 'mountain'
> will be matched.
> 
> From my point of view, there are several ways, both with disadvantages:
> 
> 1) Using EdgeNGramFilterFactory, thus 'umount' will be indexed with
>> 'u',
> 'um', 'umo', 'umou', 'umoun', 'umount'. The disadvantages are: a) the
 index
> size increases dramatically, b) will matches even has no relationship,
 such
> as such 'mount' will match 'mountain' also.
> 
> 2) Using two pass searching: first pass searches term dictionary
>> through
> TermsComponent using given keyword, then using the first matched term
 from
> term dictionary to search again. eg: when user enter 'umoun',
 TermsComponent
> will match 'umount', then use 'umount' to search. The disadvantage are:
 a)
> need to parse query string so that could recognize meta keywords such
>> as
> 'AND', 'OR', '+', '-', '"' (this makes more complex as I am using PHP
> client), b) The returned hit counts is not for original search string,
 thus
> will influence other components such as auto-suggest component based on
 user
> search history and hit counts.
> 
> 3) Write custom SearchComponent, while have no idea where/how to start
> with.
> 
> Is there any other way in Solr to do this, any feedback/suggestion are
> welcome!
> 
> Thanks very much in advance!
> 
 
>> 
>> 



Re: Data Import Handler Query

2010-08-11 Thread kenf_nc

It may not be the data config. Do you have the fields in the schema.xml that
the image data is going to set to be multiValued="true"?

Although, I would think the last image would be stored, not the first, but
haven't really tested this.
-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/Data-Import-Handler-Query-tp1092010p1092917.html
Sent from the Solr - User mailing list archive at Nabble.com.


bug or feature???

2010-08-11 Thread Jean-Sebastien Vachon
Hi,

Can someone tell me why the two following queries do not return the same 
results?
Is that a bug or a feature?

http://localhost:8983/jobs/select?fq=title:(NOT janitor)&fq=description:(NOT 
janitor)&q=*:*

http://localhost:8983/jobs/select?q=title:(NOT janitor) AND description:(NOT 
janitor)


The second query returns no result while the first one returns 6097276 documents

Thanks


General questions about distributed solr shards

2010-08-11 Thread JohnRodey

1) Is there any information on preferred maximum sizes for a single solr
index.  I've read some people say 10 million, some say 80 million, etc... 
Is there any official recommendation or has anyone experimented with large
datasets into the tens of billions?

2) Is there any down side to running multiple solr shard instances on a
single machine rather than one shard instance with a larger index per
machine?  I would think that having 5 instances with 1/5 the index would
return results approx 5 times faster.

3) Say you have a solr configuration with multiple shards.  If you attempt
to query while one of the shards is down you will receive a HTTP 500 on the
client due to a connection refused on the server.  Is there a way to tell
the server to ignore this and return as many results as possible?  In other
words if you have 100 shards, it is possible that occasionally a process may
die, but I would still like to return results from the active shards.

Thanks
-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/General-questions-about-distributed-solr-shards-tp1095117p1095117.html
Sent from the Solr - User mailing list archive at Nabble.com.


Indexing and ExtractingRequestHandler

2010-08-11 Thread Harry Hochheiser
I'm trying to use Solr to index the contents of an Excel file, using
the ExtractingRequestHandler (CSV handler won't work for me - I need
to consider the whole spreadsheet as one document), and I'm running
into some trouble.

Is there any way to see what's going on during the indexing process?
I'm concerned that I may be losing some terms, and I'd like to see if
i can snoop on the terms that are added to the index as they go along.
How might I do this?

Barring that, how can I inspect the index post-fact?  I have tried to
use luke to see what's in the index, but I get an error: "Unknown
format version -10". Is it possible to get luke to work?

My solr build is straight out of SVN.

thanks,

harry


Re: Analysing SOLR logfiles

2010-08-11 Thread Jan Høydahl / Cominvent
Have a look at www.splunk.com

--
Jan Høydahl, search solution architect
Cominvent AS - www.cominvent.com
Training in Europe - www.solrtraining.com

On 11. aug. 2010, at 19.34, Jay Flattery wrote:

> Hi there,
> 
> 
> Just wondering what tools people use to analyse SOLR log files.
> 
> We're looking to do things like extracting common queries, calculating 
> averaging 
> 
> Qtime and hits, returning particularly slow/expensive queries, etc.
> 
> Would prefer not to code something (completely) from scratch.
> 
> Thanks!
> 
> 
> 
> 



Re: bug or feature???

2010-08-11 Thread Jan Høydahl / Cominvent
Your syntax looks a bit funny.

Which version of Solr are you using? Pure negative queries are not supported, 
try q=(*:* -title:janitor) instead.

Also, for debugging what's going on, please add &debugQuery=true and share the 
parsed query for both cases with us.

--
Jan Høydahl, search solution architect
Cominvent AS - www.cominvent.com
Training in Europe - www.solrtraining.com

On 11. aug. 2010, at 22.28, Jean-Sebastien Vachon wrote:

> Hi,
> 
> Can someone tell me why the two following queries do not return the same 
> results?
> Is that a bug or a feature?
> 
> http://localhost:8983/jobs/select?fq=title:(NOT janitor)&fq=description:(NOT 
> janitor)&q=*:*
> 
> http://localhost:8983/jobs/select?q=title:(NOT janitor) AND description:(NOT 
> janitor)
> 
> 
> The second query returns no result while the first one returns 6097276 
> documents
> 
> Thanks



Re: Data Import Handler Query

2010-08-11 Thread Manali Joshi
I tried making the schema fields that get the image data to
multiValued="true". But it still gets only the first image data. It doesn't
have information about all the images.




On Wed, Aug 11, 2010 at 1:15 PM, kenf_nc  wrote:

>
> It may not be the data config. Do you have the fields in the schema.xml
> that
> the image data is going to set to be multiValued="true"?
>
> Although, I would think the last image would be stored, not the first, but
> haven't really tested this.
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/Data-Import-Handler-Query-tp1092010p1092917.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>


Re: Indexing and ExtractingRequestHandler

2010-08-11 Thread Jan Høydahl / Cominvent
Hi,

You can try Tika command line to parse your Excel file, then you will se the 
exact textual output from it, which will be indexed into Solr, and thus inspect 
whether something is missing.

Are you sure you use a version of Luke which supports your version of Lucene?

--
Jan Høydahl, search solution architect
Cominvent AS - www.cominvent.com
Training in Europe - www.solrtraining.com

On 11. aug. 2010, at 23.33, Harry Hochheiser wrote:

> I'm trying to use Solr to index the contents of an Excel file, using
> the ExtractingRequestHandler (CSV handler won't work for me - I need
> to consider the whole spreadsheet as one document), and I'm running
> into some trouble.
> 
> Is there any way to see what's going on during the indexing process?
> I'm concerned that I may be losing some terms, and I'd like to see if
> i can snoop on the terms that are added to the index as they go along.
> How might I do this?
> 
> Barring that, how can I inspect the index post-fact?  I have tried to
> use luke to see what's in the index, but I get an error: "Unknown
> format version -10". Is it possible to get luke to work?
> 
> My solr build is straight out of SVN.
> 
> thanks,
> 
> harry



Re: DIH transformer script size limitations with Jetty?

2010-08-11 Thread harrysmith

To follow up on my own question, it appears this is only an issue when using
the DataImport console debugging tools. It looks like when submitting the
debugging request, the data-config.xml is sent via a GET request, which
would fail.  However, using the exact same data-config.xml via a full-import
operation (ie not a dry run debug), it looks like the request is sent POST
and the import works fine.
-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/DIH-transformer-script-size-limitations-with-Jetty-tp1091246p1100285.html
Sent from the Solr - User mailing list archive at Nabble.com.


DIH - Insert another record After first load

2010-08-11 Thread Girish
Hi,

I did load of the data with DIH and now once the data is loaded. I want to
load the records dynamically as an when I received.

Use cases:

   1. I did load of 7MM records and now everything is working fine.
   2. A new record is received, now I want to add this new record into the
indexed data. Here is difference in the processing and the logic:
  * Initial data Load is done from a Oracle Materialized view
  * The new record is added into the tables from where view is
created and not available in the view now
  * now I want to add this new record into the index. I have a Java
bean loaded with the data including the index column.
  * I looked at the indexed file and it is all encoded.
   3. How do I load above loaded Java bean to the index?

An example would really help.

Thanks
Girish


How to "OR" facet queries

2010-08-11 Thread Frank A
Hi,  I have 3 facet fields (A,B,C) the values of each facet field will
be shown as check boxes to users:

Field A
[x]  Val1a
[x]  Val2a
[]  Val3a

Field B
[x] Val1b
[] Val2b
[] Val3b

Within a field if the user selects two items I want the queries to be
an "OR" query.  Currently I'm generating something like:

&fq=FieldA%3AVal1a&fq=FieldA%3AVal2a&fq=FieldB%3AVal1b

This is not working as the first two filter queries are 'and'ing.
What is the proper syntax to accomplish what I'm trying to do?

Thanks.


Re: How to "OR" facet queries

2010-08-11 Thread Geek Gamer
On Thu, Aug 12, 2010 at 7:12 AM, Frank A  wrote:

> Hi,  I have 3 facet fields (A,B,C) the values of each facet field will
> be shown as check boxes to users:
>
> Field A
> [x]  Val1a
> [x]  Val2a
> []  Val3a
>
> Field B
> [x] Val1b
> [] Val2b
> [] Val3b
>
> Within a field if the user selects two items I want the queries to be
> an "OR" query.  Currently I'm generating something like:
>
> &fq=FieldA%3AVal1a&fq=FieldA%3AVal2a&fq=FieldB%3AVal1b
>
&fq=FieldA%3AVal1a%20OR%20FieldA%3AVal2a&fq=FieldB%3AVal1b

>
> This is not working as the first two filter queries are 'and'ing.
> What is the proper syntax to accomplish what I'm trying to do?
>
> Thanks.
>


Re: DIH transformer script size limitations with Jetty?

2010-08-11 Thread Girish Pandit

Have you tried changing the -Xmx value to bump to -Xmx1300m?

I had some problem with DIH loading the data and when I bumped the 
memory everything worked fine!


harrysmith wrote:

To follow up on my own question, it appears this is only an issue when using
the DataImport console debugging tools. It looks like when submitting the
debugging request, the data-config.xml is sent via a GET request, which
would fail.  However, using the exact same data-config.xml via a full-import
operation (ie not a dry run debug), it looks like the request is sent POST
and the import works fine.
  




Re: Indexing and ExtractingRequestHandler

2010-08-11 Thread Harry Hochheiser
Thanks.

I've done Tika command line to parse the Excel file, and I see
contents in it that don't appear to be indexed. I've tried the path of
using Tika to parse the Excel and then using extracting request
handler to index the resulting text, and that doesn't work either.

As far as Luke goes, I've built it from scratch. Still bombs. Is it
possible that it's not compatible with lucene  builds based on trunk?

thanks,


-harry

On Wed, Aug 11, 2010 at 6:48 PM, Jan Høydahl / Cominvent
 wrote:
> Hi,
>
> You can try Tika command line to parse your Excel file, then you will se the 
> exact textual output from it, which will be indexed into Solr, and thus 
> inspect whether something is missing.
>
> Are you sure you use a version of Luke which supports your version of Lucene?
>
> --
> Jan Høydahl, search solution architect
> Cominvent AS - www.cominvent.com
> Training in Europe - www.solrtraining.com
>
> On 11. aug. 2010, at 23.33, Harry Hochheiser wrote:
>
>> I'm trying to use Solr to index the contents of an Excel file, using
>> the ExtractingRequestHandler (CSV handler won't work for me - I need
>> to consider the whole spreadsheet as one document), and I'm running
>> into some trouble.
>>
>> Is there any way to see what's going on during the indexing process?
>> I'm concerned that I may be losing some terms, and I'd like to see if
>> i can snoop on the terms that are added to the index as they go along.
>> How might I do this?
>>
>> Barring that, how can I inspect the index post-fact?  I have tried to
>> use luke to see what's in the index, but I get an error: "Unknown
>> format version -10". Is it possible to get luke to work?
>>
>> My solr build is straight out of SVN.
>>
>> thanks,
>>
>> harry
>
>


Re: Schema Definition Question

2010-08-11 Thread Lance Norskog
Can do you a DB join on OurID? That makes the association in the
database, before it gets to the DataImportHandler.

On Sun, Aug 8, 2010 at 6:17 PM, Frank A  wrote:
> Hi,
>
> I have a db handler with the following definition:
>
>                            query="select OurID,Name,City,State,lat,lng,cost from place"
>                deltaQuery="select OurID from destinations where
> OurID= '${dataimporter.request.did}'"
>                deltaImportQuery="select
> OurID,Name,City,State,lat,lng,cost from place where
> OurID='${dataimporter.delta.id}'"
>                >
>
>                                query="select label,f.FeatureID from features f,
> featureplace fp where fp.PlaceID='${place.OurID}' and
> fp.FeatureID=f.FeatureID">
>                
>                
>            
>
> In my schema I have:
>
>   stored="true" multiValued="true"/>
>   multiValued="true"/>
>
> This yields results that have a list of feature labels and a separate
> list of FeatureIDs with now real connection between the two.  Is there
> a better way to represent this?
>
> Thanks.
>



-- 
Lance Norskog
goks...@gmail.com


In multicore env, can I make it access core0 by default

2010-08-11 Thread Chengyang
Thus when I access http://localhost/solr/select?q=*:* equals 
http://localhost/solr/core0/select?q=*:*.




Re: Schema Definition Question

2010-08-11 Thread harrysmith

I think I know where you're headed, I was struggling with the same issue. In
my case, using results from Solr I link to a detailed profile using an ID,
but I am displaying the String value. I was looking for something like:



12345

   Feature 1 label
   1 


   Feature 2 label
   2 




...or something similar, some way of linking child items together.
Unfortunately, this isn't how Solr works.

This issue is addressed in the Solr 1.4 book by Smiley and Pugh. This
related snippet is from Chapter 2, page 36, dealing with an example
application with a Music artist's name, and a related id.

"...If we only record the name, then it is problematic to do things like
have links in the UI from a band member to that member's detail page... This
means that we'll need to have an additional multi-valued field for the
member's ID. Multi-valued fields maintain ordering so that the two fields
would have corresponding values at a given index. Beware, there can be a
tricky case when one of the values can be blank, and you need to come up
with a placeholder. The client code would have to know about this
placeholder."

So it seems that we will be assured that the multivalued fields will be in
the same order, so we can use the same index number. This seems clunky to
me, but I have not come across any other solutions.
-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/Schema-Definition-Question-tp1049966p1105593.html
Sent from the Solr - User mailing list archive at Nabble.com.