Distributed search component.

2011-04-04 Thread Rok Rejc
Hi all,

I am trying to create a distributed search component in solr which is quite
difficult (at least for me, because I am new in solr and java). Anyway I
have looked into solr source (FacetComponent, TermsComponent...) and created
my own search component (it extends SearchComponent) but I still have two
questions (for now):

1.) In the prepare method I have the following code:

String shards = params.get(ShardParams.SHARDS);
if (shards != null) {
List lst = StrUtils.splitSmart(shards, ",", true);
rb.shards = lst.toArray(new String[lst.size()]);
rb.isDistrib = true;
}

If I remove "rb.isDistrib = true;" line the distributed methods are not
called. But to set the isDistrib my code must be in the
"org.apache.solr.handler.component" package (because it is not visible from
the outside). Is this  correct procedure/behaviour/design?

2.) Functions (process, distributedProcess, handleResponses...) are all
called properly. I can read partial responses in the handleResponses but I
don't know how to build "final" response. I see that for example
TermsComponent has a helper in the ResponseBuilder which collects all the
terms. Is this the only way (to edit the ResponseBuilder source), or can I
achive that without editing the solr's source?

Many thanks,

Rok


Re: Faceting on multivalued field

2011-04-04 Thread Kaushik Chakraborty
Are you implying to change the DB query of the nested entity which fetches
the comments (query is in my post) or something can be done during the index
like using Transformers etc. ?

Thanks,
Kaushik


On Mon, Apr 4, 2011 at 8:07 AM, Erick Erickson wrote:

> Why not count them on the way in and just store that number along
> with the original e-mail?
>
> Best
> Erick
>
> On Sun, Apr 3, 2011 at 10:10 PM, Kaushik Chakraborty  >wrote:
>
> > Ok. My expectation was since "comment_post_id" is a MultiValued field
> hence
> > it would appear multiple times (i.e. for each comment). And hence when I
> > would facet with that field it would also give me the count of those many
> > documents where comment_post_id appears.
> >
> > My requirement is getting total for every document i.e. finding number of
> > comments per post in the whole corpus. To explain it more clearly, I'm
> > getting a result xml something like this
> >
> > 46
> > Hello World
> > 20
> > 
> >9
> >10
> > 
> > 
> >   19
> >   2
> > 
> > 
> >  46
> >  46
> > 
> > 
> >   Hello - from World
> >   Hi
> > 
> >
> > 
> >  
> > *1*
> >
> > I need the count to be 2 as the post 46 has 2 comments.
> >
> >  What other way can I approach?
> >
> > Thanks,
> > Kaushik
> >
> >
> > On Mon, Apr 4, 2011 at 4:29 AM, Erick Erickson  > >wrote:
> >
> > > Hmmm, I think you're misunderstanding faceting. It's counting the
> > > number of documents that have a particular value. So if you're
> > > faceting on "comment_post_id", there is one and only one document
> > > with that value (assuming that the comment_post_ids are unique).
> > > Which is what's being reported This will be quite expensive on a
> > > large corpus, BTW.
> > >
> > > Is your task to show the totals for *every* document in your corpus or
> > > just the ones in a display page? Because if the latter, your app could
> > > just count up the number of elements in the XML returned for the
> > > multiValued comments field.
> > >
> > > If that's not relevant, could you explain a bit more why you need this
> > > count?
> > >
> > > Best
> > > Erick
> > >
> > > On Sun, Apr 3, 2011 at 2:31 PM, Kaushik Chakraborty <
> kaych...@gmail.com
> > > >wrote:
> > >
> > > > Hi,
> > > >
> > > > My index contains a root entity "Post" and a child entity "Comments".
> > > Each
> > > > post can have multiple comments. data-config.xml:
> > > >
> > > > 
> > > > > > > dataSource="jdbc" query="">
> > > >
> > > >
> > > >
> > > >
> > > > query="select
> > *
> > > > from comments where post_id = ${posts.post_id}" >
> > > >
> > > >
> > > >
> > > >
> > > >   
> > > >
> > > > 
> > > >
> > > > The schema has all columns of "comment" entity as "MultiValued"
> fields
> > > and
> > > > all fields are indexed & stored. My requirement is to count the
> number
> > of
> > > > comments for each post. Approach I'm taking is to query on "*:*" and
> > > > faceting the result on "comment_post_id" so that it gives the count
> of
> > > > comment occurred for that post.
> > > >
> > > > But I'm getting incorrect result e.g. if a post has 2 comments, the
> > > > multivalued fields are populated alright but the facet count is
> coming
> > as
> > > 1
> > > > (for that post_id). What else do I need to do?
> > > >
> > > >
> > > > Thanks,
> > > > Kaushik
> > > >
> > >
> >
>


Using MLT feature

2011-04-04 Thread Frederico Azeiteiro
Hi,

 

I would like to hear your opinion about the MLT feature and if it's a
good solution to what I need to implement.

 

My index has fields like: headline, body and medianame.

What I need to do is, before adding a new doc, verify if a similar doc
exists for this media.

 

My idea is to use the MorelikeThisHandler
(http://wiki.apache.org/solr/MoreLikeThisHandler) in the following way:

 

For each new doc, perform a MLT search with q= medianame and
stream.body=headline+bodytext.

If no similar docs are found than I can safely add the doc.

 

Is this feasible using the MLT handler? Is it a good approach? Are there
a better way to perform this comparison?

 

Thank you for your help.

 

Best regards,



Frederico Azeiteiro

 



Re: Using MLT feature

2011-04-04 Thread Chris Fauerbach
Do you want to not index if something similar? Or don't index if exact.   I 
would look into a hash code of the document if you don't want to index exact.   
 Similar though, I think has to be based off a document in the index.   

On Apr 4, 2011, at 5:16, Frederico Azeiteiro  
wrote:

> Hi,
> 
> 
> 
> I would like to hear your opinion about the MLT feature and if it's a
> good solution to what I need to implement.
> 
> 
> 
> My index has fields like: headline, body and medianame.
> 
> What I need to do is, before adding a new doc, verify if a similar doc
> exists for this media.
> 
> 
> 
> My idea is to use the MorelikeThisHandler
> (http://wiki.apache.org/solr/MoreLikeThisHandler) in the following way:
> 
> 
> 
> For each new doc, perform a MLT search with q= medianame and
> stream.body=headline+bodytext.
> 
> If no similar docs are found than I can safely add the doc.
> 
> 
> 
> Is this feasible using the MLT handler? Is it a good approach? Are there
> a better way to perform this comparison?
> 
> 
> 
> Thank you for your help.
> 
> 
> 
> Best regards,
> 
> 
> 
> Frederico Azeiteiro
> 
> 
> 


Mongo REST interface and full data import

2011-04-04 Thread andrew_s
Hi everyone,

I'm trying to make a simple data import from MongoDB into Solr using REST
interface.

As an test example I've created schecma.xml like:


  
   
  

 
  
  
  
  
 

 
 isbn

 
 title

 
 



and data-import.xml as:












Unfortunately it's not working and I'm stuck  on this place.

Could you please advise how correctly parser JSON format data?


Data format looks like:
{
  "offset" : 0,
  "rows": [
{ "_id" : { "$oid" : "4d9829412c8bd1064400" }, "isbn" : "716739356",
"title" : "Proteins", "description" : "" } ,
{ "_id" : { "$oid" : "4d9829412c8bd1064401" }, "isbn" :
"144433056X", "title" : "How to Assess Doctors and Health Professionals",
"description" : "" } ,
{ "_id" : { "$oid" : "4d9829412c8bd1064402" }, "isbn" :
"1406208159", "title" : "Freestyle: Time Travel Guides: Pack B",
"description" : "Takes you on a trip through history to visit the great
ancient civilisations." } ,
  "total_rows" : 3 ,
  "query" : {} ,
  "millis" : 0
}


Thank you.

--
View this message in context: 
http://lucene.472066.n3.nabble.com/Mongo-REST-interface-and-full-data-import-tp2774479p2774479.html
Sent from the Solr - User mailing list archive at Nabble.com.


RE: Using MLT feature

2011-04-04 Thread Frederico Azeiteiro
Hi,

The ideia is don't index if something similar (headline+bodytext) for
the same exact medianame.

Do you mean I would need to index the doc first (maybe in a temp index)
and then use the MLT feature to find similar docs before adding to final
index?

Thanks,
Frederico


-Original Message-
From: Chris Fauerbach [mailto:chris.fauerb...@gmail.com] 
Sent: segunda-feira, 4 de Abril de 2011 10:22
To: solr-user@lucene.apache.org
Subject: Re: Using MLT feature

Do you want to not index if something similar? Or don't index if exact.
I would look into a hash code of the document if you don't want to index
exact.Similar though, I think has to be based off a document in the
index.   

On Apr 4, 2011, at 5:16, Frederico Azeiteiro
 wrote:

> Hi,
> 
> 
> 
> I would like to hear your opinion about the MLT feature and if it's a
> good solution to what I need to implement.
> 
> 
> 
> My index has fields like: headline, body and medianame.
> 
> What I need to do is, before adding a new doc, verify if a similar doc
> exists for this media.
> 
> 
> 
> My idea is to use the MorelikeThisHandler
> (http://wiki.apache.org/solr/MoreLikeThisHandler) in the following
way:
> 
> 
> 
> For each new doc, perform a MLT search with q= medianame and
> stream.body=headline+bodytext.
> 
> If no similar docs are found than I can safely add the doc.
> 
> 
> 
> Is this feasible using the MLT handler? Is it a good approach? Are
there
> a better way to perform this comparison?
> 
> 
> 
> Thank you for your help.
> 
> 
> 
> Best regards,
> 
> 
> 
> Frederico Azeiteiro
> 
> 
> 


Re: Spellchecking Escaped Queries

2011-04-04 Thread Colin Vipurs
Thanks Chris, 

The field used for indexing and spellcheck is the same and is configured
like this:..



   
  
 
 
  
 
 
   



I use the pattern replace filter to swap all instances of "!" within a
word to "i".  I know this part is working correctly as performing a
search works correctly.

The spellcheck is initialized like this:



   title
   
  default
  searchfield
  ./spellchecker
  false
   


And is attached to as a component to my search handler.

Thanks,

Colin


> : I'm having an issue performing a spellcheck on some information and
> : search of the archive isn't helping.
> 
> For this type of quesiton, there's not much feedback anyone can offer w/o 
> knowing exactly what analyzers you have configured for hte various 
> fieldtypes (both the field you index/search and the fieldtype used for 
> spellchecking)
> 
> it's also fairly critical to know how you have the spellcheck component 
> configured.
> 
> off the cuff: i'd guess that maybe WordDelimiterFilter is being used in a 
> wonky way given your usecase -- but like i said: would need to see the 
> configs to make a guess.
> 
> 
> -Hoss
> 
> __
> This email has been scanned by the MessageLabs Email Security System.
> For more information please visit http://www.messagelabs.com/email 
> __


-- 


Colin Vipurs
Server Team Lead

Shazam Entertainment Ltd   
26-28 Hammersmith Grove, London W6 7HA
m:   +44 (0)  000 000   t: +44 (0) 20 8742 6820
w:www.shazam.com

Please consider the environment before printing this document

This e-mail and its contents are strictly private and confidential. It
must not be disclosed, distributed or copied without our prior consent.
If you have received this transmission in error, please notify Shazam
Entertainment immediately on: +44 (0) 020 8742 6820 and then delete it
from your system. Please note that the information contained herein
shall additionally constitute Confidential Information for the purposes
of any NDA between the recipient/s and Shazam Entertainment. Shazam
Entertainment Limited is incorporated in England and Wales under company
number 3998831 and its registered office is at 26-28 Hammersmith Grove,
London W6 7HA. 




__
This email has been scanned by the MessageLabs Email Security System.
For more information please visit http://www.messagelabs.com/email 
__

Re: Spellchecking Escaped Queries

2011-04-04 Thread Colin Vipurs
Thanks Chris, 

The field used for indexing and spellcheck is the same and is configured
like this:..



   
  
 
 
  
 
 
   



I use the pattern replace filter to swap all instances of "!" within a
word to "i".  I know this part is working correctly as performing a
search works correctly.

The spellcheck is initialized like this:



   title
   
  default
  searchfield
  ./spellchecker
  false
   



This is attached as a component to my search handler and spellchecking
is done inline with the queries.

Thanks,

Colin



> : I'm having an issue performing a spellcheck on some information and
> : search of the archive isn't helping.
> 
> For this type of quesiton, there's not much feedback anyone can offer w/o 
> knowing exactly what analyzers you have configured for hte various 
> fieldtypes (both the field you index/search and the fieldtype used for 
> spellchecking)
> 
> it's also fairly critical to know how you have the spellcheck component 
> configured.
> 
> off the cuff: i'd guess that maybe WordDelimiterFilter is being used in a 
> wonky way given your usecase -- but like i said: would need to see the 
> configs to make a guess.
> 
> 
> -Hoss
> 
> __
> This email has been scanned by the MessageLabs Email Security System.
> For more information please visit http://www.messagelabs.com/email 
> __


-- 


Colin Vipurs
Server Team Lead

Shazam Entertainment Ltd   
26-28 Hammersmith Grove, London W6 7HA
m:   +44 (0)  000 000   t: +44 (0) 20 8742 6820
w:www.shazam.com

Please consider the environment before printing this document

This e-mail and its contents are strictly private and confidential. It
must not be disclosed, distributed or copied without our prior consent.
If you have received this transmission in error, please notify Shazam
Entertainment immediately on: +44 (0) 020 8742 6820 and then delete it
from your system. Please note that the information contained herein
shall additionally constitute Confidential Information for the purposes
of any NDA between the recipient/s and Shazam Entertainment. Shazam
Entertainment Limited is incorporated in England and Wales under company
number 3998831 and its registered office is at 26-28 Hammersmith Grove,
London W6 7HA. 






__
This email has been scanned by the MessageLabs Email Security System.
For more information please visit http://www.messagelabs.com/email 
__

Re: Spellchecking Escaped Queries

2011-04-04 Thread Colin Vipurs
Apologies for the duplicate post.  I'm having Evolution problems


> Thanks Chris, 
> 
> The field used for indexing and spellcheck is the same and is
> configured like this:..
> 
> 
>  class="solr.TextField" >
>
>   
>   ignoreCase="true" expand="true"/>
>  
> pattern="^([^!]+)\!([^!]+)$"
>   replacement="$1i$2"
>   replace="all"/> 
>   generateWordParts="1" generateNumberParts="1" catenateWords="1" 
> catenateNumbers="0" catenateAll="1" splitOnCaseChange="1" 
> preserveOriginal="1"/>
>  
>
> 
> 
> 
> I use the pattern replace filter to swap all instances of "!" within a
> word to "i".  I know this part is working correctly as performing a
> search works correctly.
> 
> The spellcheck is initialized like this:
> 
> 
> 
>title
>
>   default
>   searchfield
>   ./spellchecker
>   false
>
> 
> 
> And is attached to as a component to my search handler.
> 
> Thanks,
> 
> Colin
> 
> 
> > : I'm having an issue performing a spellcheck on some information and
> > : search of the archive isn't helping.
> > 
> > For this type of quesiton, there's not much feedback anyone can offer w/o 
> > knowing exactly what analyzers you have configured for hte various 
> > fieldtypes (both the field you index/search and the fieldtype used for 
> > spellchecking)
> > 
> > it's also fairly critical to know how you have the spellcheck component 
> > configured.
> > 
> > off the cuff: i'd guess that maybe WordDelimiterFilter is being used in a 
> > wonky way given your usecase -- but like i said: would need to see the 
> > configs to make a guess.
> > 
> > 
> > -Hoss
> > 
> > __
> > This email has been scanned by the MessageLabs Email Security System.
> > For more information please visit http://www.messagelabs.com/email 
> > __
> 
> 
> -- 
> 
> 
> Colin Vipurs
> Server Team Lead
> 
> Shazam Entertainment Ltd   
> 26-28 Hammersmith Grove, London W6 7HA
> m:   +44 (0)  000 000   t: +44 (0) 20 8742 6820
> w:www.shazam.com
> 
> Please consider the environment before printing this document
> 
> This e-mail and its contents are strictly private and confidential. It
> must not be disclosed, distributed or copied without our prior
> consent. If you have received this transmission in error, please
> notify Shazam Entertainment immediately on: +44 (0) 020 8742 6820 and
> then delete it from your system. Please note that the information
> contained herein shall additionally constitute Confidential
> Information for the purposes of any NDA between the recipient/s and
> Shazam Entertainment. Shazam Entertainment Limited is incorporated in
> England and Wales under company number 3998831 and its registered
> office is at 26-28 Hammersmith Grove, London W6 7HA. 
> 
> 
> 
> 
> __
> This email has been scanned by the MessageLabs Email Security System.
> For more information please visit http://www.messagelabs.com/email 
> __
> 
> __
> This email has been scanned by the MessageLabs Email Security System.
> For more information please visit http://www.messagelabs.com/email 
> __


-- 


Colin Vipurs
Server Team Lead

Shazam Entertainment Ltd   
26-28 Hammersmith Grove, London W6 7HA
m:   +44 (0)  000 000   t: +44 (0) 20 8742 6820
w:www.shazam.com

Please consider the environment before printing this document

This e-mail and its contents are strictly private and confidential. It
must not be disclosed, distributed or copied without our prior consent.
If you have received this transmission in error, please notify Shazam
Entertainment immediately on: +44 (0) 020 8742 6820 and then delete it
from your system. Please note that the information contained herein
shall additionally constitute Confidential Information for the purposes
of any NDA between the recipient/s and Shazam Entertainment. Shazam
Entertainment Limited is incorporated in England and Wales under company
number 3998831 and its registered office is at 26-28 Hammersmith Grove,
London W6 7HA. 




__
This email has been scanned by the MessageLabs Email Security System.
For more information please visit http://www.messagelabs.com/email 
__

Re: Using MLT feature

2011-04-04 Thread Markus Jelsma
http://wiki.apache.org/solr/Deduplication

On Monday 04 April 2011 11:34:52 Frederico Azeiteiro wrote:
> Hi,
> 
> The ideia is don't index if something similar (headline+bodytext) for
> the same exact medianame.
> 
> Do you mean I would need to index the doc first (maybe in a temp index)
> and then use the MLT feature to find similar docs before adding to final
> index?
> 
> Thanks,
> Frederico
> 
> 
> -Original Message-
> From: Chris Fauerbach [mailto:chris.fauerb...@gmail.com]
> Sent: segunda-feira, 4 de Abril de 2011 10:22
> To: solr-user@lucene.apache.org
> Subject: Re: Using MLT feature
> 
> Do you want to not index if something similar? Or don't index if exact.
> I would look into a hash code of the document if you don't want to index
> exact.Similar though, I think has to be based off a document in the
> index.
> 
> On Apr 4, 2011, at 5:16, Frederico Azeiteiro
> 
>  wrote:
> > Hi,
> > 
> > 
> > 
> > I would like to hear your opinion about the MLT feature and if it's a
> > good solution to what I need to implement.
> > 
> > 
> > 
> > My index has fields like: headline, body and medianame.
> > 
> > What I need to do is, before adding a new doc, verify if a similar doc
> > exists for this media.
> > 
> > 
> > 
> > My idea is to use the MorelikeThisHandler
> > (http://wiki.apache.org/solr/MoreLikeThisHandler) in the following
> 
> way:
> > For each new doc, perform a MLT search with q= medianame and
> > stream.body=headline+bodytext.
> > 
> > If no similar docs are found than I can safely add the doc.
> > 
> > 
> > 
> > Is this feasible using the MLT handler? Is it a good approach? Are
> 
> there
> 
> > a better way to perform this comparison?
> > 
> > 
> > 
> > Thank you for your help.
> > 
> > 
> > 
> > Best regards,
> > 
> > 
> > 
> > Frederico Azeiteiro

-- 
Markus Jelsma - CTO - Openindex
http://www.linkedin.com/in/markus17
050-8536620 / 06-50258350


help with Jetty log message

2011-04-04 Thread Matthieu Huin

Greetings all,

I am currently using solr as the backend behind a log aggregation and 
search system my team is developing. All was well and good until I 
noticed a test server crashing quite unexpectedly. We'd like to dig more 
into the incident but none of us has much experience with Jetty crash 
logs - not to mention that our Java is very rusty.


The crash log is joined as an attachment.

Could anyone help us with understanding what went wrong there ?

Also, would it be possible and/or wise to automatically restart the 
server in case of such a crash ?



Thanks for your help. If you need any extra info about that case, do not 
hesitate to ask !



Matthieu Huin


#
# A fatal error has been detected by the Java Runtime Environment:
#
#  SIGSEGV (0xb) at pc=0x7f051a618105, pid=5033, tid=1092958544
#
# JRE version: 6.0_18-b18
# Java VM: OpenJDK 64-Bit Server VM (16.0-b13 mixed mode linux-amd64 )
# Derivative: IcedTea6 1.8.3
# Distribution: Debian GNU/Linux 5.0.8 (lenny), package 6b18-1.8.3-2~lenny1
# Problematic frame:
# V  [libjvm.so+0x5dc105]
#
# If you would like to submit a bug report, please include
# instructions how to reproduce the bug and visit:
#   http://icedtea.classpath.org/bugzilla
#

---  T H R E A D  ---

Current thread (0x0207d800):  GCTaskThread [stack: 0x41153000,0x41254000] [id=5036]

siginfo:si_signo=SIGSEGV: si_errno=0, si_code=128 (), si_addr=0x

Registers:
RAX=0x, RBX=0x7f04acba89a8, RCX=0x020d85d8, RDX=0x0030002e00300031
RSP=0x41252eb0, RBP=0x41252f20, RSI=0x, RDI=0x0030002e00300041
R8 =0x04a3523e2a33, R9 =0x7f051aae7188, R10=0x0001, R11=0x41252da0
R12=0x7f04f15b4368, R13=0x0035003000360034, R14=0x41252f50, R15=0x020d8070
RIP=0x7f051a618105, EFL=0x00010246, CSGSFS=0x0033, ERR=0x
  TRAPNO=0x000d

Top of Stack: (sp=0x41252eb0)
0x41252eb0:   04a3523e2a01 7f051aae7188
0x41252ec0:   04a00c960001e082 0004
0x41252ed0:   04a3523e2a33 0400
0x41252ee0:   04a3523e2a32 
0x41252ef0:   4097fb58 7f04acba89a8
0x41252f00:   020d8020 
0x41252f10:   41252f50 41252f5c
0x41252f20:   41252f90 7f051a61cb78
0x41252f30:   02196810 020d8070
0x41252f40:   0207d800 7f051a5a6f3b
0x41252f50:   7f04acba89a8 7b6e9b2f0207cf00
0x41252f60:   41252f90 02196810
0x41252f70:   0207d800 7f051a75254f
0x41252f80:    0207da90
0x41252f90:   41253070 7f051a3b4a10
0x41252fa0:   0207d800 41252fd0
0x41252fb0:   41253030 0207dac0
0x41252fc0:   0207dad0 0207dea8
0x41252fd0:   0207d800 0207deb0
0x41252fe0:   0207dee0 0207def0
0x41252ff0:   0207e2c8 41253000
0x41253000:   0207d800 0207deb0
0x41253010:   0207dee0 0207def0
0x41253020:   0207e2c8 0207e2d0
0x41253030:    
0x41253040:   0207ec30 
0x41253050:   0207ec30 0207eb50
0x41253060:   0207d800 1000
0x41253070:   41253140 7f051a5ce090
0x41253080:    
0x41253090:    
0x412530a0:     

Instructions: (pc=0x7f051a618105)
0x7f051a6180f5:   f6 0f 85 d4 00 00 00 49 8b 54 24 08 48 8d 7a 10
0x7f051a618105:   8b 4f 08 83 f9 00 0f 8e e4 00 00 00 89 c8 c1 f8 

Stack: [0x41153000,0x41254000],  sp=0x41252eb0,  free space=3ff0018k
Native frames: (J=compiled Java code, j=interpreted, Vv=VM code, C=native code)
V  [libjvm.so+0x5dc105]
V  [libjvm.so+0x5e0b78]
V  [libjvm.so+0x378a10]
V  [libjvm.so+0x592090]


---  P R O C E S S  ---

Java Threads: ( => current thread )
  0x0540f000 JavaThread "btpool0-12" [_thread_blocked, id=6839, stack(0x42623000,0x42724000)]
  0x0234a800 JavaThread "btpool0-11" [_thread_blocked, id=6796, stack(0x42522000,0x42623000)]
  0x02754000 JavaThread "btpool0-10" [_thread_blocked, id=6761, stack(0x42421000,0x42522000)]
  0x0246e800 JavaThread "TimeLimitedCollector timer thread" daemon [_thread_blocked, id=5307, stack(0x4232,0x42421000)]
  0x02317800 JavaThread "MultiThreadedHttpConnectionManager cleanup" daemon [_thread_blocked, id=5306, stack(0x40261000,

Re: help with Jetty log message

2011-04-04 Thread Upayavira
This is not Solr crashing, per se, it is your JVM. I personally haven't
generally had much success debugging these kinds of failure - see
whether it happens again, and if it does, try updating your
JVM/switching to another/etc.

Anyone have better advice?

Upayavira

On Mon, 04 Apr 2011 11:59 +0200, "Matthieu Huin"
 wrote:
> Greetings all,
> 
> I am currently using solr as the backend behind a log aggregation and 
> search system my team is developing. All was well and good until I 
> noticed a test server crashing quite unexpectedly. We'd like to dig more 
> into the incident but none of us has much experience with Jetty crash 
> logs - not to mention that our Java is very rusty.
> 
> The crash log is joined as an attachment.
> 
> Could anyone help us with understanding what went wrong there ?
> 
> Also, would it be possible and/or wise to automatically restart the 
> server in case of such a crash ?
> 
> 
> Thanks for your help. If you need any extra info about that case, do not 
> hesitate to ask !
> 
> 
> Matthieu Huin
> 
> 
> 
> Email had 1 attachment:
> + hs_err_pid5033.log
>   26k (text/x-log)
--- 
Enterprise Search Consultant at Sourcesense UK, 
Making Sense of Open Source



RE: Using MLT feature

2011-04-04 Thread Frederico Azeiteiro
Thank you Markus it looks great.

But the wiki is not very detailed on this. 
Do you mean if I:

1. Create:


  true
  false
  signature
  headline,body,medianame
  org.apache.solr.update.processor.Lookup3Signature



  

2. Add the request as the default update request 
3. Add a "signature" indexed field to my schema.

Then,
When adding a new doc to my index, it is only added of not considered a 
duplicate using a Lookup3Signature on the field defined?
All duplicates are ignored and not added to my index? 
Is it so simple as that?

Does it works even if the medianame should be an exact match (not similar match 
as the headline and bodytext are)?

Thank you for your help,


Frederico Azeiteiro
Developer
 


-Original Message-
From: Markus Jelsma [mailto:markus.jel...@openindex.io] 
Sent: segunda-feira, 4 de Abril de 2011 10:48
To: solr-user@lucene.apache.org
Subject: Re: Using MLT feature

http://wiki.apache.org/solr/Deduplication

On Monday 04 April 2011 11:34:52 Frederico Azeiteiro wrote:
> Hi,
> 
> The ideia is don't index if something similar (headline+bodytext) for
> the same exact medianame.
> 
> Do you mean I would need to index the doc first (maybe in a temp index)
> and then use the MLT feature to find similar docs before adding to final
> index?
> 
> Thanks,
> Frederico
> 
> 
> -Original Message-
> From: Chris Fauerbach [mailto:chris.fauerb...@gmail.com]
> Sent: segunda-feira, 4 de Abril de 2011 10:22
> To: solr-user@lucene.apache.org
> Subject: Re: Using MLT feature
> 
> Do you want to not index if something similar? Or don't index if exact.
> I would look into a hash code of the document if you don't want to index
> exact.Similar though, I think has to be based off a document in the
> index.
> 
> On Apr 4, 2011, at 5:16, Frederico Azeiteiro
> 
>  wrote:
> > Hi,
> > 
> > 
> > 
> > I would like to hear your opinion about the MLT feature and if it's a
> > good solution to what I need to implement.
> > 
> > 
> > 
> > My index has fields like: headline, body and medianame.
> > 
> > What I need to do is, before adding a new doc, verify if a similar doc
> > exists for this media.
> > 
> > 
> > 
> > My idea is to use the MorelikeThisHandler
> > (http://wiki.apache.org/solr/MoreLikeThisHandler) in the following
> 
> way:
> > For each new doc, perform a MLT search with q= medianame and
> > stream.body=headline+bodytext.
> > 
> > If no similar docs are found than I can safely add the doc.
> > 
> > 
> > 
> > Is this feasible using the MLT handler? Is it a good approach? Are
> 
> there
> 
> > a better way to perform this comparison?
> > 
> > 
> > 
> > Thank you for your help.
> > 
> > 
> > 
> > Best regards,
> > 
> > 
> > 
> > Frederico Azeiteiro

-- 
Markus Jelsma - CTO - Openindex
http://www.linkedin.com/in/markus17
050-8536620 / 06-50258350


Re: Faceting on multivalued field

2011-04-04 Thread Erick Erickson
I hadn't thought that far. But if you can change your query to
sum the fields that'd be easiest.

Mostly, I was thinking that since that information is known up
front, storing it with the document makes sense and would
avoid costly Solr work.

I don't know of any transformers that would do this for you,
it's almost an introspection transformation you'd want to do.

You could also consider using SolrJ/jdbc to query your database,
but I'd try for the SQL query first.

Best
Erick

On Mon, Apr 4, 2011 at 4:18 AM, Kaushik Chakraborty wrote:

> Are you implying to change the DB query of the nested entity which fetches
> the comments (query is in my post) or something can be done during the
> index
> like using Transformers etc. ?
>
> Thanks,
> Kaushik
>
>
> On Mon, Apr 4, 2011 at 8:07 AM, Erick Erickson  >wrote:
>
> > Why not count them on the way in and just store that number along
> > with the original e-mail?
> >
> > Best
> > Erick
> >
> > On Sun, Apr 3, 2011 at 10:10 PM, Kaushik Chakraborty  > >wrote:
> >
> > > Ok. My expectation was since "comment_post_id" is a MultiValued field
> > hence
> > > it would appear multiple times (i.e. for each comment). And hence when
> I
> > > would facet with that field it would also give me the count of those
> many
> > > documents where comment_post_id appears.
> > >
> > > My requirement is getting total for every document i.e. finding number
> of
> > > comments per post in the whole corpus. To explain it more clearly, I'm
> > > getting a result xml something like this
> > >
> > > 46
> > > Hello World
> > > 20
> > > 
> > >9
> > >10
> > > 
> > > 
> > >   19
> > >   2
> > > 
> > > 
> > >  46
> > >  46
> > > 
> > > 
> > >   Hello - from World
> > >   Hi
> > > 
> > >
> > > 
> > >  
> > > *1*
> > >
> > > I need the count to be 2 as the post 46 has 2 comments.
> > >
> > >  What other way can I approach?
> > >
> > > Thanks,
> > > Kaushik
> > >
> > >
> > > On Mon, Apr 4, 2011 at 4:29 AM, Erick Erickson <
> erickerick...@gmail.com
> > > >wrote:
> > >
> > > > Hmmm, I think you're misunderstanding faceting. It's counting the
> > > > number of documents that have a particular value. So if you're
> > > > faceting on "comment_post_id", there is one and only one document
> > > > with that value (assuming that the comment_post_ids are unique).
> > > > Which is what's being reported This will be quite expensive on a
> > > > large corpus, BTW.
> > > >
> > > > Is your task to show the totals for *every* document in your corpus
> or
> > > > just the ones in a display page? Because if the latter, your app
> could
> > > > just count up the number of elements in the XML returned for the
> > > > multiValued comments field.
> > > >
> > > > If that's not relevant, could you explain a bit more why you need
> this
> > > > count?
> > > >
> > > > Best
> > > > Erick
> > > >
> > > > On Sun, Apr 3, 2011 at 2:31 PM, Kaushik Chakraborty <
> > kaych...@gmail.com
> > > > >wrote:
> > > >
> > > > > Hi,
> > > > >
> > > > > My index contains a root entity "Post" and a child entity
> "Comments".
> > > > Each
> > > > > post can have multiple comments. data-config.xml:
> > > > >
> > > > > 
> > > > > > > > > dataSource="jdbc" query="">
> > > > >
> > > > >
> > > > >
> > > > >
> > > > > > query="select
> > > *
> > > > > from comments where post_id = ${posts.post_id}" >
> > > > >
> > > > >
> > > > >
> > > > >
> > > > >   
> > > > >
> > > > > 
> > > > >
> > > > > The schema has all columns of "comment" entity as "MultiValued"
> > fields
> > > > and
> > > > > all fields are indexed & stored. My requirement is to count the
> > number
> > > of
> > > > > comments for each post. Approach I'm taking is to query on "*:*"
> and
> > > > > faceting the result on "comment_post_id" so that it gives the count
> > of
> > > > > comment occurred for that post.
> > > > >
> > > > > But I'm getting incorrect result e.g. if a post has 2 comments, the
> > > > > multivalued fields are populated alright but the facet count is
> > coming
> > > as
> > > > 1
> > > > > (for that post_id). What else do I need to do?
> > > > >
> > > > >
> > > > > Thanks,
> > > > > Kaushik
> > > > >
> > > >
> > >
> >
>


Re: Mongo REST interface and full data import

2011-04-04 Thread Erick Erickson
I'm having trouble seeing your schema files, etc. I don't
know if gmail is stripping this on my end or whether
your e-mail is stripping it on upload, anyone else seeing this?

But to your question, what version are you using? From
Solr3.1  is the first version with JSON
support for updates.

See: http://wiki.apache.org/solr/UpdateJSON

Best
Erick

On Mon, Apr 4, 2011 at 5:31 AM, andrew_s  wrote:

> Hi everyone,
>
> I'm trying to make a simple data import from MongoDB into Solr using REST
> interface.
>
> As an test example I've created schecma.xml like:
> 
>
>
>
>
>
>
>
>
>
>
>
>
>
>  isbn
>
>
>  title
>
>
>
>
>
>
> and data-import.xml as:
>
>
>
>
>
>
>
>
>
>
>
>
> Unfortunately it's not working and I'm stuck  on this place.
>
> Could you please advise how correctly parser JSON format data?
>
>
> Data format looks like:
> {
>  "offset" : 0,
>  "rows": [
>{ "_id" : { "$oid" : "4d9829412c8bd1064400" }, "isbn" : "716739356",
> "title" : "Proteins", "description" : "" } ,
>{ "_id" : { "$oid" : "4d9829412c8bd1064401" }, "isbn" :
> "144433056X", "title" : "How to Assess Doctors and Health Professionals",
> "description" : "" } ,
>{ "_id" : { "$oid" : "4d9829412c8bd1064402" }, "isbn" :
> "1406208159", "title" : "Freestyle: Time Travel Guides: Pack B",
> "description" : "Takes you on a trip through history to visit the great
> ancient civilisations." } ,
>  "total_rows" : 3 ,
>  "query" : {} ,
>  "millis" : 0
> }
>
>
> Thank you.
>
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/Mongo-REST-interface-and-full-data-import-tp2774479p2774479.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>


RE: Faceting on multivalued field

2011-04-04 Thread Jonathan Rochkind
Is there a kind of function query that can count number of values in a 
multi-valued field on a given document?  I do not know. 

From: Erick Erickson [erickerick...@gmail.com]
Sent: Sunday, April 03, 2011 10:37 PM
To: solr-user@lucene.apache.org
Subject: Re: Faceting on multivalued field

Why not count them on the way in and just store that number along
with the original e-mail?

Best
Erick

On Sun, Apr 3, 2011 at 10:10 PM, Kaushik Chakraborty wrote:

> Ok. My expectation was since "comment_post_id" is a MultiValued field hence
> it would appear multiple times (i.e. for each comment). And hence when I
> would facet with that field it would also give me the count of those many
> documents where comment_post_id appears.
>
> My requirement is getting total for every document i.e. finding number of
> comments per post in the whole corpus. To explain it more clearly, I'm
> getting a result xml something like this
>
> 46
> Hello World
> 20
> 
>9
>10
> 
> 
>   19
>   2
> 
> 
>  46
>  46
> 
> 
>   Hello - from World
>   Hi
> 
>
> 
>  
> *1*
>
> I need the count to be 2 as the post 46 has 2 comments.
>
>  What other way can I approach?
>
> Thanks,
> Kaushik
>
>
> On Mon, Apr 4, 2011 at 4:29 AM, Erick Erickson  >wrote:
>
> > Hmmm, I think you're misunderstanding faceting. It's counting the
> > number of documents that have a particular value. So if you're
> > faceting on "comment_post_id", there is one and only one document
> > with that value (assuming that the comment_post_ids are unique).
> > Which is what's being reported This will be quite expensive on a
> > large corpus, BTW.
> >
> > Is your task to show the totals for *every* document in your corpus or
> > just the ones in a display page? Because if the latter, your app could
> > just count up the number of elements in the XML returned for the
> > multiValued comments field.
> >
> > If that's not relevant, could you explain a bit more why you need this
> > count?
> >
> > Best
> > Erick
> >
> > On Sun, Apr 3, 2011 at 2:31 PM, Kaushik Chakraborty  > >wrote:
> >
> > > Hi,
> > >
> > > My index contains a root entity "Post" and a child entity "Comments".
> > Each
> > > post can have multiple comments. data-config.xml:
> > >
> > > 
> > > > > dataSource="jdbc" query="">
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >   
> > >
> > > 
> > >
> > > The schema has all columns of "comment" entity as "MultiValued" fields
> > and
> > > all fields are indexed & stored. My requirement is to count the number
> of
> > > comments for each post. Approach I'm taking is to query on "*:*" and
> > > faceting the result on "comment_post_id" so that it gives the count of
> > > comment occurred for that post.
> > >
> > > But I'm getting incorrect result e.g. if a post has 2 comments, the
> > > multivalued fields are populated alright but the facet count is coming
> as
> > 1
> > > (for that post_id). What else do I need to do?
> > >
> > >
> > > Thanks,
> > > Kaushik
> > >
> >
>


Re: Solrj performance bottleneck

2011-04-04 Thread rahul
Hi All,

I just to want to share some findings which clearly identified the reason
for our performance bottleneck. we had looked into several areas for
optimization mostly directed at Solr configurations, stored fields,
highlighting, JVM, OS cache etc. But it turned out that the "main" culprit
was elsewhere. We were using the terms component for auto suggestion and
while examining the firebug outputs for time taken during the searches, we
detected that multiple requests were being spawned for autosuggestion as we
typed in the keyword to search (1 request per each character typed) and this
in turn cost us great delay in getting the search results. Once we turned
auto suggestion off, the performance was remarkably better and came down to
a second or so (compared to 8-10 seconds registered earlier).

if anybody has some suggestions/experience on how to leverage autosuggestion
without affecting search performance much, please do share them.

Once again, thanks for your inputs in analyzing our issues.

Thanks,

--
View this message in context: 
http://lucene.472066.n3.nabble.com/Solrj-performance-bottleneck-tp2682797p2775245.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Solrj performance bottleneck

2011-04-04 Thread Stefan Matheis
rahul,

On Mon, Apr 4, 2011 at 4:18 PM, rahul  wrote:
> if anybody has some suggestions/experience on how to leverage autosuggestion
> without affecting search performance much, please do share them.

we use javascript intervals for autosuggestion. regularly check the
value of the monitored input field and if changed, trigger a new
request. this will cover both cases, slow-typing users and also
ten-finger-guys (which will type much faster). a new request for every
added character is indeed too much, even if your backend is responding
within a few ms.

Regards
Stefan


RE: Using MLT feature

2011-04-04 Thread Frederico Azeiteiro
Hi again,
I guess I was wrong on my early post... There's no automated way to avoid the 
indexation of the duplicate doc.

I guess I have 2 options: 

1. Create a temp index with signatures and then have an app that for each new 
doc verifies if sig exists on my primary index. 
If not, add the article.

2. Before adding the doc, create a signature (using the same algorithm that 
SOLR uses) on my indexing app and then verify if signature exists before adding.

I'm way thinking the right way here? :)

Thank you,
Frederico 
 


-Original Message-
From: Frederico Azeiteiro [mailto:frederico.azeite...@cision.com] 
Sent: segunda-feira, 4 de Abril de 2011 11:59
To: solr-user@lucene.apache.org
Subject: RE: Using MLT feature

Thank you Markus it looks great.

But the wiki is not very detailed on this. 
Do you mean if I:

1. Create:


  true
  false
  signature
  headline,body,medianame
  org.apache.solr.update.processor.Lookup3Signature



  

2. Add the request as the default update request 
3. Add a "signature" indexed field to my schema.

Then,
When adding a new doc to my index, it is only added of not considered a 
duplicate using a Lookup3Signature on the field defined?
All duplicates are ignored and not added to my index? 
Is it so simple as that?

Does it works even if the medianame should be an exact match (not similar match 
as the headline and bodytext are)?

Thank you for your help,


Frederico Azeiteiro
Developer
 


-Original Message-
From: Markus Jelsma [mailto:markus.jel...@openindex.io] 
Sent: segunda-feira, 4 de Abril de 2011 10:48
To: solr-user@lucene.apache.org
Subject: Re: Using MLT feature

http://wiki.apache.org/solr/Deduplication

On Monday 04 April 2011 11:34:52 Frederico Azeiteiro wrote:
> Hi,
> 
> The ideia is don't index if something similar (headline+bodytext) for
> the same exact medianame.
> 
> Do you mean I would need to index the doc first (maybe in a temp index)
> and then use the MLT feature to find similar docs before adding to final
> index?
> 
> Thanks,
> Frederico
> 
> 
> -Original Message-
> From: Chris Fauerbach [mailto:chris.fauerb...@gmail.com]
> Sent: segunda-feira, 4 de Abril de 2011 10:22
> To: solr-user@lucene.apache.org
> Subject: Re: Using MLT feature
> 
> Do you want to not index if something similar? Or don't index if exact.
> I would look into a hash code of the document if you don't want to index
> exact.Similar though, I think has to be based off a document in the
> index.
> 
> On Apr 4, 2011, at 5:16, Frederico Azeiteiro
> 
>  wrote:
> > Hi,
> > 
> > 
> > 
> > I would like to hear your opinion about the MLT feature and if it's a
> > good solution to what I need to implement.
> > 
> > 
> > 
> > My index has fields like: headline, body and medianame.
> > 
> > What I need to do is, before adding a new doc, verify if a similar doc
> > exists for this media.
> > 
> > 
> > 
> > My idea is to use the MorelikeThisHandler
> > (http://wiki.apache.org/solr/MoreLikeThisHandler) in the following
> 
> way:
> > For each new doc, perform a MLT search with q= medianame and
> > stream.body=headline+bodytext.
> > 
> > If no similar docs are found than I can safely add the doc.
> > 
> > 
> > 
> > Is this feasible using the MLT handler? Is it a good approach? Are
> 
> there
> 
> > a better way to perform this comparison?
> > 
> > 
> > 
> > Thank you for your help.
> > 
> > 
> > 
> > Best regards,
> > 
> > 
> > 
> > Frederico Azeiteiro

-- 
Markus Jelsma - CTO - Openindex
http://www.linkedin.com/in/markus17
050-8536620 / 06-50258350


Re: Using MLT feature

2011-04-04 Thread Markus Jelsma

> Hi again,
> I guess I was wrong on my early post... There's no automated way to avoid
> the indexation of the duplicate doc.

Yes there is, try set overwriteDupes to true and documents yielding the same 
signature will be overwritten. If you have need both fuzzy and exact matching 
then add a second update processor inside the chain and create another 
signature field.

> 
> I guess I have 2 options:
> 
> 1. Create a temp index with signatures and then have an app that for each
> new doc verifies if sig exists on my primary index. If not, add the
> article.
> 
> 2. Before adding the doc, create a signature (using the same algorithm that
> SOLR uses) on my indexing app and then verify if signature exists before
> adding.
> 
> I'm way thinking the right way here? :)
> 
> Thank you,
> Frederico
>  
> 
> 
> -Original Message-
> From: Frederico Azeiteiro [mailto:frederico.azeite...@cision.com]
> Sent: segunda-feira, 4 de Abril de 2011 11:59
> To: solr-user@lucene.apache.org
> Subject: RE: Using MLT feature
> 
> Thank you Markus it looks great.
> 
> But the wiki is not very detailed on this.
> Do you mean if I:
> 
> 1. Create:
> 
>  class="org.apache.solr.update.processor.SignatureUpdateProcessorFactory">
> true
>   false
>   signature
>   headline,body,medianame
>name="signatureClass">org.apache.solr.update.processor.Lookup3Signature tr> 
> 
> 
>   
> 
> 2. Add the request as the default update request
> 3. Add a "signature" indexed field to my schema.
> 
> Then,
> When adding a new doc to my index, it is only added of not considered a
> duplicate using a Lookup3Signature on the field defined? All duplicates
> are ignored and not added to my index?
> Is it so simple as that?
> 
> Does it works even if the medianame should be an exact match (not similar
> match as the headline and bodytext are)?
> 
> Thank you for your help,
> 
> 
> Frederico Azeiteiro
> Developer
>  
> 
> 
> -Original Message-
> From: Markus Jelsma [mailto:markus.jel...@openindex.io]
> Sent: segunda-feira, 4 de Abril de 2011 10:48
> To: solr-user@lucene.apache.org
> Subject: Re: Using MLT feature
> 
> http://wiki.apache.org/solr/Deduplication
> 
> On Monday 04 April 2011 11:34:52 Frederico Azeiteiro wrote:
> > Hi,
> > 
> > The ideia is don't index if something similar (headline+bodytext) for
> > the same exact medianame.
> > 
> > Do you mean I would need to index the doc first (maybe in a temp index)
> > and then use the MLT feature to find similar docs before adding to final
> > index?
> > 
> > Thanks,
> > Frederico
> > 
> > 
> > -Original Message-
> > From: Chris Fauerbach [mailto:chris.fauerb...@gmail.com]
> > Sent: segunda-feira, 4 de Abril de 2011 10:22
> > To: solr-user@lucene.apache.org
> > Subject: Re: Using MLT feature
> > 
> > Do you want to not index if something similar? Or don't index if exact.
> > I would look into a hash code of the document if you don't want to index
> > exact.Similar though, I think has to be based off a document in the
> > index.
> > 
> > On Apr 4, 2011, at 5:16, Frederico Azeiteiro
> > 
> >  wrote:
> > > Hi,
> > > 
> > > 
> > > 
> > > I would like to hear your opinion about the MLT feature and if it's a
> > > good solution to what I need to implement.
> > > 
> > > 
> > > 
> > > My index has fields like: headline, body and medianame.
> > > 
> > > What I need to do is, before adding a new doc, verify if a similar doc
> > > exists for this media.
> > > 
> > > 
> > > 
> > > My idea is to use the MorelikeThisHandler
> > > (http://wiki.apache.org/solr/MoreLikeThisHandler) in the following
> > 
> > way:
> > > For each new doc, perform a MLT search with q= medianame and
> > > stream.body=headline+bodytext.
> > > 
> > > If no similar docs are found than I can safely add the doc.
> > > 
> > > 
> > > 
> > > Is this feasible using the MLT handler? Is it a good approach? Are
> > 
> > there
> > 
> > > a better way to perform this comparison?
> > > 
> > > 
> > > 
> > > Thank you for your help.
> > > 
> > > 
> > > 
> > > Best regards,
> > > 
> > > 
> > > 
> > > Frederico Azeiteiro


Re: Solrj performance bottleneck

2011-04-04 Thread openvictor Open
Dear Rahul,

Stefan has the right solution. the autosuggest must be checked both from
Javascript and your backend. For javascript there are some really nice tools
to do that such as Jquery which implements a auto-suggest with a tunable
delay. It has also highlighting, you can add additional information etc...
It is actually quite impressive. Here is the address :
http://jqueryui.com/demos/autocomplete/#remote-jsonp. It's open source so
you can just copy what they have done or see the method they used.
For backend limit the number of request / second per ip or session and / or
cache result. As for cache normally solr caches the common request but I
don't know for term components.

Hope this helps you !

Victor

2011/4/4 Stefan Matheis 

> rahul,
>
> On Mon, Apr 4, 2011 at 4:18 PM, rahul  wrote:
> > if anybody has some suggestions/experience on how to leverage
> autosuggestion
> > without affecting search performance much, please do share them.
>
> we use javascript intervals for autosuggestion. regularly check the
> value of the monitored input field and if changed, trigger a new
> request. this will cover both cases, slow-typing users and also
> ten-finger-guys (which will type much faster). a new request for every
> added character is indeed too much, even if your backend is responding
> within a few ms.
>
> Regards
> Stefan
>


dismax "boost query" not useful?

2011-04-04 Thread Smiley, David W.
As I was reviewing the boosting capabilities of the dismax & edismax query 
parsers, it's not clear to me that the "boost query" has much use.  The value 
of boost functions, particularly with a multiplied boost that edismax supports, 
is very clear -- there are a variety of uses.  But I can't think of a useful 
case when I want to both *add* a component to the ultimate score, and for that 
component to be a non-function query (i.e. use the lucene query parser).

Also, you can basically get the same affect as a boost query via boost 
functions: bf=query(mybq)&mybq=...  and note you will probably multiply 
this via product(10,query(mybq)) to boost it to an appropriate number.

~ David Smiley

Problems indexing very large set of documents

2011-04-04 Thread Brandon Waterloo
 Hey everybody,

I've been running into some issues indexing a very large set of documents.  
There's about 4000 PDF files, ranging in size from 160MB to 10KB.  Obviously 
this is a big task for Solr.  I have a PHP script that iterates over the 
directory and uses PHP cURL to query Solr to index the files.  For now, commit 
is set to false to speed up the indexing, and I'm assuming that Solr should be 
auto-committing as necessary.  I'm using the default solrconfig.xml file 
included in apache-solr-1.4.1\example\solr\conf.  Once all the documents have 
been finished the PHP script queries Solr to commit.

The main problem is that after a few thousand documents (around 2000 last time 
I tried), nearly every document begins causing Java exceptions in Solr:

Apr 4, 2011 1:18:01 PM org.apache.solr.common.SolrException log
SEVERE: org.apache.solr.common.SolrException: 
org.apache.tika.exception.TikaException: TIKA-198: Illegal IOException from 
org.apache.tika.parser.pdf.PDFParser@11d329d
at 
org.apache.solr.handler.extraction.ExtractingDocumentLoader.load(ExtractingDocumentLoader.java:211)
at 
org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:54)
at 
org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:131)
at 
org.apache.solr.core.RequestHandlers$LazyRequestHandlerWrapper.handleRequest(RequestHandlers.java:233)
at org.apache.solr.core.SolrCore.execute(SolrCore.java:1316)
at 
org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:338)
at 
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:241)
at 
org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1089)
at 
org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:365)
at 
org.mortbay.jetty.security.SecurityHandler.handle(SecurityHandler.java:216)
at 
org.mortbay.jetty.servlet.SessionHandler.handle(SessionHandler.java:181)
at 
org.mortbay.jetty.handler.ContextHandler.handle(ContextHandler.java:712)
at org.mortbay.jetty.webapp.WebAppContext.handle(WebAppContext.java:405)
at 
org.mortbay.jetty.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:211)
at 
org.mortbay.jetty.handler.HandlerCollection.handle(HandlerCollection.java:114)
at 
org.mortbay.jetty.handler.HandlerWrapper.handle(HandlerWrapper.java:139)
at org.mortbay.jetty.Server.handle(Server.java:285)
at 
org.mortbay.jetty.HttpConnection.handleRequest(HttpConnection.java:502)
at 
org.mortbay.jetty.HttpConnection$RequestHandler.content(HttpConnection.java:835)
at org.mortbay.jetty.HttpParser.parseNext(HttpParser.java:641)
at org.mortbay.jetty.HttpParser.parseAvailable(HttpParser.java:202)
at org.mortbay.jetty.HttpConnection.handle(HttpConnection.java:378)
at 
org.mortbay.jetty.bio.SocketConnector$Connection.run(SocketConnector.java:226)
at 
org.mortbay.thread.BoundedThreadPool$PoolThread.run(BoundedThreadPool.java:442)
Caused by: org.apache.tika.exception.TikaException: TIKA-198: Illegal 
IOException from org.apache.tika.parser.pdf.PDFParser@11d329d
at 
org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:125)
at 
org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:105)
at 
org.apache.solr.handler.extraction.ExtractingDocumentLoader.load(ExtractingDocumentLoader.java:190)
... 23 more
Caused by: java.io.IOException: expected='endobj' firstReadAttempt='' 
secondReadAttempt='' org.pdfbox.io.PushBackInputStream@b19bfc
at org.pdfbox.pdfparser.PDFParser.parseObject(PDFParser.java:502)
at org.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:176)
at org.pdfbox.pdmodel.PDDocument.load(PDDocument.java:707)
at org.pdfbox.pdmodel.PDDocument.load(PDDocument.java:691)
at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:40)
at 
org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:119)
... 25 more

As far as I know there's nothing special about these documents so I'm wondering 
if it's not properly autocommitting.  What would be appropriate settings in 
solrconfig.xml for this particular application?  I'd like it to autocommit as 
soon as it needs to but no more often than that for the sake of efficiency.  
Obviously it takes long enough to index 4000 documents and there's no reason to 
make it take longer.  Thanks for your help!

~Brandon Waterloo


Re: Problems indexing very large set of documents

2011-04-04 Thread Anuj Kumar
This is related to Apache TIKA. Which version are you using?
Please see this thread for more details-
http://lucene.472066.n3.nabble.com/PDF-parser-exception-td644885.html

Hope
it helps.

Regards,
Anuj

On Mon, Apr 4, 2011 at 11:30 PM, Brandon Waterloo <
brandon.water...@matrix.msu.edu> wrote:

>  Hey everybody,
>
> I've been running into some issues indexing a very large set of documents.
>  There's about 4000 PDF files, ranging in size from 160MB to 10KB.
>  Obviously this is a big task for Solr.  I have a PHP script that iterates
> over the directory and uses PHP cURL to query Solr to index the files.  For
> now, commit is set to false to speed up the indexing, and I'm assuming that
> Solr should be auto-committing as necessary.  I'm using the default
> solrconfig.xml file included in apache-solr-1.4.1\example\solr\conf.  Once
> all the documents have been finished the PHP script queries Solr to commit.
>
> The main problem is that after a few thousand documents (around 2000 last
> time I tried), nearly every document begins causing Java exceptions in Solr:
>
> Apr 4, 2011 1:18:01 PM org.apache.solr.common.SolrException log
> SEVERE: org.apache.solr.common.SolrException:
> org.apache.tika.exception.TikaException: TIKA-198: Illegal IOException from
> org.apache.tika.parser.pdf.PDFParser@11d329d
>at
> org.apache.solr.handler.extraction.ExtractingDocumentLoader.load(ExtractingDocumentLoader.java:211)
>at
> org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:54)
>at
> org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:131)
>at
> org.apache.solr.core.RequestHandlers$LazyRequestHandlerWrapper.handleRequest(RequestHandlers.java:233)
>at org.apache.solr.core.SolrCore.execute(SolrCore.java:1316)
>at
> org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:338)
>at
> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:241)
>at
> org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1089)
>at
> org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:365)
>at
> org.mortbay.jetty.security.SecurityHandler.handle(SecurityHandler.java:216)
>at
> org.mortbay.jetty.servlet.SessionHandler.handle(SessionHandler.java:181)
>at
> org.mortbay.jetty.handler.ContextHandler.handle(ContextHandler.java:712)
>at
> org.mortbay.jetty.webapp.WebAppContext.handle(WebAppContext.java:405)
>at
> org.mortbay.jetty.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:211)
>at
> org.mortbay.jetty.handler.HandlerCollection.handle(HandlerCollection.java:114)
>at
> org.mortbay.jetty.handler.HandlerWrapper.handle(HandlerWrapper.java:139)
>at org.mortbay.jetty.Server.handle(Server.java:285)
>at
> org.mortbay.jetty.HttpConnection.handleRequest(HttpConnection.java:502)
>at
> org.mortbay.jetty.HttpConnection$RequestHandler.content(HttpConnection.java:835)
>at org.mortbay.jetty.HttpParser.parseNext(HttpParser.java:641)
>at org.mortbay.jetty.HttpParser.parseAvailable(HttpParser.java:202)
>at org.mortbay.jetty.HttpConnection.handle(HttpConnection.java:378)
>at
> org.mortbay.jetty.bio.SocketConnector$Connection.run(SocketConnector.java:226)
>at
> org.mortbay.thread.BoundedThreadPool$PoolThread.run(BoundedThreadPool.java:442)
> Caused by: org.apache.tika.exception.TikaException: TIKA-198: Illegal
> IOException from org.apache.tika.parser.pdf.PDFParser@11d329d
>at
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:125)
>at
> org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:105)
>at
> org.apache.solr.handler.extraction.ExtractingDocumentLoader.load(ExtractingDocumentLoader.java:190)
>... 23 more
> Caused by: java.io.IOException: expected='endobj' firstReadAttempt=''
> secondReadAttempt='' org.pdfbox.io.PushBackInputStream@b19bfc
>at org.pdfbox.pdfparser.PDFParser.parseObject(PDFParser.java:502)
>at org.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:176)
>at org.pdfbox.pdmodel.PDDocument.load(PDDocument.java:707)
>at org.pdfbox.pdmodel.PDDocument.load(PDDocument.java:691)
>at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:40)
>at
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:119)
>... 25 more
>
> As far as I know there's nothing special about these documents so I'm
> wondering if it's not properly autocommitting.  What would be appropriate
> settings in solrconfig.xml for this particular application?  I'd like it to
> autocommit as soon as it needs to but no more often than that for the sake
> of efficiency.  Obviously it takes long enough to index 4000 documents and
> th

RE: Problems indexing very large set of documents

2011-04-04 Thread Brandon Waterloo
Looks like I'm using Tika 0.4:
apache-solr-1.4.1/contrib/extraction/lib/tika-core-0.4.jar
.../tika-parsers-0.4.jar

~Brandon Waterloo


From: Anuj Kumar [anujs...@gmail.com]
Sent: Monday, April 04, 2011 2:12 PM
To: solr-user@lucene.apache.org
Cc: Brandon Waterloo
Subject: Re: Problems indexing very large set of documents

This is related to Apache TIKA. Which version are you using?
Please see this thread for more details-
http://lucene.472066.n3.nabble.com/PDF-parser-exception-td644885.html

Hope
it helps.

Regards,
Anuj

On Mon, Apr 4, 2011 at 11:30 PM, Brandon Waterloo <
brandon.water...@matrix.msu.edu> wrote:

>  Hey everybody,
>
> I've been running into some issues indexing a very large set of documents.
>  There's about 4000 PDF files, ranging in size from 160MB to 10KB.
>  Obviously this is a big task for Solr.  I have a PHP script that iterates
> over the directory and uses PHP cURL to query Solr to index the files.  For
> now, commit is set to false to speed up the indexing, and I'm assuming that
> Solr should be auto-committing as necessary.  I'm using the default
> solrconfig.xml file included in apache-solr-1.4.1\example\solr\conf.  Once
> all the documents have been finished the PHP script queries Solr to commit.
>
> The main problem is that after a few thousand documents (around 2000 last
> time I tried), nearly every document begins causing Java exceptions in Solr:
>
> Apr 4, 2011 1:18:01 PM org.apache.solr.common.SolrException log
> SEVERE: org.apache.solr.common.SolrException:
> org.apache.tika.exception.TikaException: TIKA-198: Illegal IOException from
> org.apache.tika.parser.pdf.PDFParser@11d329d
>at
> org.apache.solr.handler.extraction.ExtractingDocumentLoader.load(ExtractingDocumentLoader.java:211)
>at
> org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:54)
>at
> org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:131)
>at
> org.apache.solr.core.RequestHandlers$LazyRequestHandlerWrapper.handleRequest(RequestHandlers.java:233)
>at org.apache.solr.core.SolrCore.execute(SolrCore.java:1316)
>at
> org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:338)
>at
> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:241)
>at
> org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1089)
>at
> org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:365)
>at
> org.mortbay.jetty.security.SecurityHandler.handle(SecurityHandler.java:216)
>at
> org.mortbay.jetty.servlet.SessionHandler.handle(SessionHandler.java:181)
>at
> org.mortbay.jetty.handler.ContextHandler.handle(ContextHandler.java:712)
>at
> org.mortbay.jetty.webapp.WebAppContext.handle(WebAppContext.java:405)
>at
> org.mortbay.jetty.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:211)
>at
> org.mortbay.jetty.handler.HandlerCollection.handle(HandlerCollection.java:114)
>at
> org.mortbay.jetty.handler.HandlerWrapper.handle(HandlerWrapper.java:139)
>at org.mortbay.jetty.Server.handle(Server.java:285)
>at
> org.mortbay.jetty.HttpConnection.handleRequest(HttpConnection.java:502)
>at
> org.mortbay.jetty.HttpConnection$RequestHandler.content(HttpConnection.java:835)
>at org.mortbay.jetty.HttpParser.parseNext(HttpParser.java:641)
>at org.mortbay.jetty.HttpParser.parseAvailable(HttpParser.java:202)
>at org.mortbay.jetty.HttpConnection.handle(HttpConnection.java:378)
>at
> org.mortbay.jetty.bio.SocketConnector$Connection.run(SocketConnector.java:226)
>at
> org.mortbay.thread.BoundedThreadPool$PoolThread.run(BoundedThreadPool.java:442)
> Caused by: org.apache.tika.exception.TikaException: TIKA-198: Illegal
> IOException from org.apache.tika.parser.pdf.PDFParser@11d329d
>at
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:125)
>at
> org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:105)
>at
> org.apache.solr.handler.extraction.ExtractingDocumentLoader.load(ExtractingDocumentLoader.java:190)
>... 23 more
> Caused by: java.io.IOException: expected='endobj' firstReadAttempt=''
> secondReadAttempt='' org.pdfbox.io.PushBackInputStream@b19bfc
>at org.pdfbox.pdfparser.PDFParser.parseObject(PDFParser.java:502)
>at org.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:176)
>at org.pdfbox.pdmodel.PDDocument.load(PDDocument.java:707)
>at org.pdfbox.pdmodel.PDDocument.load(PDDocument.java:691)
>at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:40)
>at
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:119)
>... 25 more
>
> As far as I know t

Re: Problems indexing very large set of documents

2011-04-04 Thread Anuj Kumar
In the log messages are you able to locate the file at which it fails? Looks
like TIKA is unable to parse one of your PDF files for the details. We need
to hunt that one out.

Regards,
Anuj

On Mon, Apr 4, 2011 at 11:57 PM, Brandon Waterloo <
brandon.water...@matrix.msu.edu> wrote:

> Looks like I'm using Tika 0.4:
> apache-solr-1.4.1/contrib/extraction/lib/tika-core-0.4.jar
> .../tika-parsers-0.4.jar
>
> ~Brandon Waterloo
>
> 
> From: Anuj Kumar [anujs...@gmail.com]
> Sent: Monday, April 04, 2011 2:12 PM
> To: solr-user@lucene.apache.org
> Cc: Brandon Waterloo
> Subject: Re: Problems indexing very large set of documents
>
> This is related to Apache TIKA. Which version are you using?
> Please see this thread for more details-
> http://lucene.472066.n3.nabble.com/PDF-parser-exception-td644885.html
>
>  >Hope
> it helps.
>
> Regards,
> Anuj
>
> On Mon, Apr 4, 2011 at 11:30 PM, Brandon Waterloo <
> brandon.water...@matrix.msu.edu> wrote:
>
> >  Hey everybody,
> >
> > I've been running into some issues indexing a very large set of
> documents.
> >  There's about 4000 PDF files, ranging in size from 160MB to 10KB.
> >  Obviously this is a big task for Solr.  I have a PHP script that
> iterates
> > over the directory and uses PHP cURL to query Solr to index the files.
>  For
> > now, commit is set to false to speed up the indexing, and I'm assuming
> that
> > Solr should be auto-committing as necessary.  I'm using the default
> > solrconfig.xml file included in apache-solr-1.4.1\example\solr\conf.
>  Once
> > all the documents have been finished the PHP script queries Solr to
> commit.
> >
> > The main problem is that after a few thousand documents (around 2000 last
> > time I tried), nearly every document begins causing Java exceptions in
> Solr:
> >
> > Apr 4, 2011 1:18:01 PM org.apache.solr.common.SolrException log
> > SEVERE: org.apache.solr.common.SolrException:
> > org.apache.tika.exception.TikaException: TIKA-198: Illegal IOException
> from
> > org.apache.tika.parser.pdf.PDFParser@11d329d
> >at
> >
> org.apache.solr.handler.extraction.ExtractingDocumentLoader.load(ExtractingDocumentLoader.java:211)
> >at
> >
> org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:54)
> >at
> >
> org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:131)
> >at
> >
> org.apache.solr.core.RequestHandlers$LazyRequestHandlerWrapper.handleRequest(RequestHandlers.java:233)
> >at org.apache.solr.core.SolrCore.execute(SolrCore.java:1316)
> >at
> >
> org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:338)
> >at
> >
> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:241)
> >at
> >
> org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1089)
> >at
> > org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:365)
> >at
> >
> org.mortbay.jetty.security.SecurityHandler.handle(SecurityHandler.java:216)
> >at
> > org.mortbay.jetty.servlet.SessionHandler.handle(SessionHandler.java:181)
> >at
> > org.mortbay.jetty.handler.ContextHandler.handle(ContextHandler.java:712)
> >at
> > org.mortbay.jetty.webapp.WebAppContext.handle(WebAppContext.java:405)
> >at
> >
> org.mortbay.jetty.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:211)
> >at
> >
> org.mortbay.jetty.handler.HandlerCollection.handle(HandlerCollection.java:114)
> >at
> > org.mortbay.jetty.handler.HandlerWrapper.handle(HandlerWrapper.java:139)
> >at org.mortbay.jetty.Server.handle(Server.java:285)
> >at
> > org.mortbay.jetty.HttpConnection.handleRequest(HttpConnection.java:502)
> >at
> >
> org.mortbay.jetty.HttpConnection$RequestHandler.content(HttpConnection.java:835)
> >at org.mortbay.jetty.HttpParser.parseNext(HttpParser.java:641)
> >at
> org.mortbay.jetty.HttpParser.parseAvailable(HttpParser.java:202)
> >at
> org.mortbay.jetty.HttpConnection.handle(HttpConnection.java:378)
> >at
> >
> org.mortbay.jetty.bio.SocketConnector$Connection.run(SocketConnector.java:226)
> >at
> >
> org.mortbay.thread.BoundedThreadPool$PoolThread.run(BoundedThreadPool.java:442)
> > Caused by: org.apache.tika.exception.TikaException: TIKA-198: Illegal
> > IOException from org.apache.tika.parser.pdf.PDFParser@11d329d
> >at
> > org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:125)
> >at
> > org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:105)
> >at
> >
> org.apache.solr.handler.extraction.ExtractingDocumentLoader.load(ExtractingDocumentLoader.java:190)
> >... 23 more
> > Caused by: java.io.IOException: expected='endobj' firstReadAttempt=''
> > secondReadAttempt=''

Re: Matching the beginning of a word within a term

2011-04-04 Thread Brian Lamb
Thank you both for your replies. It looks like EdgeNGramFilter will do the
job nicely. Time to reindex...again.

On Fri, Apr 1, 2011 at 8:31 AM, Jan Høydahl  wrote:

> Check out
> http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.EdgeNGramFilterFactory
> Don't know if it works with phrases though
>
> --
> Jan Høydahl, search solution architect
> Cominvent AS - www.cominvent.com
>
> On 31. mars 2011, at 16.49, Brian Lamb wrote:
>
> > No, I don't really want to break down the words into subwords. In the
> > example I provided, I would not want "kind" to match either record
> because
> > it is not at the beginning of the word even though "kind" appears in both
> > records as part of a word.
> >
> > On Wed, Mar 30, 2011 at 4:42 PM, lboutros  wrote:
> >
> >> Do you want to tokenize subwords based on dictionaries ? A bit like
> >> disagglutination of german words ?
> >>
> >> If so, something like this could help :
> DictionaryCompoundWordTokenFilter
> >>
> >> http://search.lucidimagination.com/search/document/CDRG_ch05_5.8.8
> >>
> >> Ludovic
> >>
> >>
> >>
> http://lucene.apache.org/java/2_4_0/api/org/apache/lucene/analysis/compound/DictionaryCompoundWordTokenFilter.html
> >>
> >> 2011/3/30 Brian Lamb [via Lucene] <
> >> ml-node+2754668-300063934-383...@n3.nabble.com>
> >>
> >>> Hi all,
> >>>
> >>> I have a field set up like this:
> >>>
> >>>  indexed="true"
> >>> stored="true" required="false" />
> >>>
> >>> And I have some records:
> >>>
> >>> RECORD1
> >>> 
> >>> companion to mankind
> >>> pooch
> >>> 
> >>>
> >>> RECORD2
> >>> 
> >>> companion to womankind
> >>> man's worst enemy
> >>> 
> >>>
> >>> I would like to write a query that will match the beginning of a word
> >>> within
> >>> the term. Here is the query I would use as it exists now:
> >>>
> >>>
> >>
> http://localhost:8983/solr/search/?q=*:*&fq={!q.op=AND%20df=common_names}
> >> "companion
> >>>
> >>> man"~10
> >>>
> >>> In the above example. I would want to return only RECORD1.
> >>>
> >>> The query as it exists right now is designed to only match records
> where
> >>> both words are present in the same term. So if I changed man to mankind
> >> in
> >>> the query, RECORD1 will be returned.
> >>>
> >>> Even though the phrases companion and man exist in the same term in
> >>> RECORD2,
> >>> I do not want RECORD2 to be returned because 'man' is not at the
> >> beginning
> >>> of the word.
> >>>
> >>> How can I achieve this?
> >>>
> >>> Thanks,
> >>>
> >>> Brian Lamb
> >>>
> >>>
> >>> --
> >>> If you reply to this email, your message will be added to the
> discussion
> >>> below:
> >>>
> >>>
> >>
> http://lucene.472066.n3.nabble.com/Matching-the-beginning-of-a-word-within-a-term-tp2754668p2754668.html
> >>> To start a new topic under Solr - User, email
> >>> ml-node+472068-1765922688-383...@n3.nabble.com
> >>> To unsubscribe from Solr - User, click here<
> >>
> http://lucene.472066.n3.nabble.com/template/NamlServlet.jtp?macro=unsubscribe_by_code&node=472068&code=Ym91dHJvc2xAZ21haWwuY29tfDQ3MjA2OHw0Mzk2MDUxNjE=
> >>> .
> >>>
> >>>
> >>
> >>
> >> -
> >> Jouve
> >> France.
> >> --
> >> View this message in context:
> >>
> http://lucene.472066.n3.nabble.com/Matching-the-beginning-of-a-word-within-a-term-tp2754668p2755561.html
> >> Sent from the Solr - User mailing list archive at Nabble.com.
>
>


Re: Matching on a multi valued field

2011-04-04 Thread Brian Lamb
I just noticed Juan's response and I find that I am encountering that very
issue in a few cases. Boosting is a good way to put the more relevant
results to the top but it is possible to only have the correct results
returned?

On Wed, Mar 30, 2011 at 11:51 AM, Brian Lamb
wrote:

> Thank you all for your responses. The field had already been set up with
> positionIncrementGap=100 so I just needed to add in the slop.
>
>
> On Tue, Mar 29, 2011 at 6:32 PM, Juan Pablo Mora wrote:
>
>> >> A multiValued field
>> >> is actually a single field with all data separated with
>> positionIncrement.
>> >> Try setting that value high enough and use a PhraseQuery.
>>
>>
>> That is true but you cannot do things like:
>>
>> q="bar* foo*"~10 with default query search.
>>
>> and if you use dismax you will have the same problems with multivalued
>> fields. Imagine the situation:
>>
>> Doc1:
>>field A: ["foo bar","dooh"] 2 values
>>
>> Doc2:
>>field A: ["bar dooh", "whatever"] Another 2 values
>>
>> the query:
>>qt=dismax & qf= fieldA & q = ( bar dooh )
>>
>> will return both Doc1 and Doc2. The only thing you can do in this
>> situation is boost phrase query in Doc2 with parameter pf in order to get
>> Doc2 in the first position of the results:
>>
>> pf = fieldA^1
>>
>>
>> Thanks,
>> JP.
>>
>>
>> El 29/03/2011, a las 23:14, Markus Jelsma escribió:
>>
>> > orly, all replies came in while sending =)
>> >
>> >> Hi,
>> >>
>> >> Your filter query is looking for a match of "man's friend" in a single
>> >> field. Regardless of analysis of the common_names field, all terms are
>> >> present in the common_names field of both documents. A multiValued
>> field
>> >> is actually a single field with all data separated with
>> positionIncrement.
>> >> Try setting that value high enough and use a PhraseQuery.
>> >>
>> >> That should work
>> >>
>> >> Cheers,
>> >>
>> >>> Hi all,
>> >>>
>> >>> I have a field set up like this:
>> >>>
>> >>> > indexed="true"
>> >>> stored="true" required="false" />
>> >>>
>> >>> And I have some records:
>> >>>
>> >>> RECORD1
>> >>> 
>> >>>
>> >>>  man's best friend
>> >>>  pooch
>> >>>
>> >>> 
>> >>>
>> >>> RECORD2
>> >>> 
>> >>>
>> >>>  man's worst enemy
>> >>>  friend to no one
>> >>>
>> >>> 
>> >>>
>> >>> Now if I do a search such as:
>> >>> http://localhost:8983/solr/search/?q=*:*&fq={!q.op=AND
>> >>> df=common_names}man's friend
>> >>>
>> >>> Both records are returned. However, I only want RECORD1 returned. I
>> >>> understand why RECORD2 is returned but how can I structure my query so
>> >>> that only RECORD1 is returned?
>> >>>
>> >>> Thanks,
>> >>>
>> >>> Brian Lamb
>>
>>
>


Re: Matching on a multi valued field

2011-04-04 Thread Juan Pablo Mora
I have not find any solution to this. The only thing is to denormalize your 
multivalue field into several docs with a single value field.

Try ComplexPhraseQueryParser (https://issues.apache.org/jira/browse/SOLR-1604) 
if you are using solr 1.4 version.


El 04/04/2011, a las 21:21, Brian Lamb escribió:

I just noticed Juan's response and I find that I am encountering that very 
issue in a few cases. Boosting is a good way to put the more relevant results 
to the top but it is possible to only have the correct results returned?

On Wed, Mar 30, 2011 at 11:51 AM, Brian Lamb 
mailto:brian.l...@journalexperts.com>> wrote:
Thank you all for your responses. The field had already been set up with 
positionIncrementGap=100 so I just needed to add in the slop.


On Tue, Mar 29, 2011 at 6:32 PM, Juan Pablo Mora 
mailto:jua...@informa.es>> wrote:
>> A multiValued field
>> is actually a single field with all data separated with positionIncrement.
>> Try setting that value high enough and use a PhraseQuery.


That is true but you cannot do things like:

q="bar* foo*"~10 with default query search.

and if you use dismax you will have the same problems with multivalued fields. 
Imagine the situation:

Doc1:
   field A: ["foo bar","dooh"] 2 values

Doc2:
   field A: ["bar dooh", "whatever"] Another 2 values

the query:
   qt=dismax & qf= fieldA & q = ( bar dooh )

will return both Doc1 and Doc2. The only thing you can do in this situation is 
boost phrase query in Doc2 with parameter pf in order to get Doc2 in the first 
position of the results:

pf = fieldA^1


Thanks,
JP.


El 29/03/2011, a las 23:14, Markus Jelsma escribió:

> orly, all replies came in while sending =)
>
>> Hi,
>>
>> Your filter query is looking for a match of "man's friend" in a single
>> field. Regardless of analysis of the common_names field, all terms are
>> present in the common_names field of both documents. A multiValued field
>> is actually a single field with all data separated with positionIncrement.
>> Try setting that value high enough and use a PhraseQuery.
>>
>> That should work
>>
>> Cheers,
>>
>>> Hi all,
>>>
>>> I have a field set up like this:
>>>
>>> >> stored="true" required="false" />
>>>
>>> And I have some records:
>>>
>>> RECORD1
>>> 
>>>
>>>  man's best friend
>>>  pooch
>>>
>>> 
>>>
>>> RECORD2
>>> 
>>>
>>>  man's worst enemy
>>>  friend to no one
>>>
>>> 
>>>
>>> Now if I do a search such as:
>>> http://localhost:8983/solr/search/?q=*:*&fq={!q.op=AND
>>> df=common_names}man's friend
>>>
>>> Both records are returned. However, I only want RECORD1 returned. I
>>> understand why RECORD2 is returned but how can I structure my query so
>>> that only RECORD1 is returned?
>>>
>>> Thanks,
>>>
>>> Brian Lamb






RE: Using the Data Import Handler with SQLite

2011-04-04 Thread Zac Smith
I was able to resolve this issue by using a different jdbc driver: 
http://www.xerial.org/trac/Xerial/wiki/SQLiteJDBC


-Original Message-
From: Zac Smith [mailto:z...@trinkit.com] 
Sent: Friday, April 01, 2011 5:56 PM
To: solr-user@lucene.apache.org
Subject: Using the Data Import Handler with SQLite

I hope this question is being directed to the right place ...

I am trying to use SQLite (v3) as a source for the Data Import Handler. I am 
using a sqllite jdbc driver (link below) and this works when using with only 
one entity. As soon as I add a sub-entity it falls over with a locked DB error: 
"java.sql.SQLException: database is locked".
Now I realize that you can only have one connection open to SQLite at a time. 
So I assume that the first query is leaving a connection open before it moves 
onto the sub-query. I am not sure if the issue would be in the jdbc driver or 
the DIH. It works fine with SQL Server.

Is this a bug? Or something that just isn't possible with SQLite?

Here is a sample of my data config file:

  
  


 




  


sqllite jdbc driver : http://www.zentus.com/sqlitejdbc/


Re: Mongo REST interface and full data import

2011-04-04 Thread andrew_s
Hi Erick,

Thanks for your reply.

I'm using latest stable version (1.4). 
About using JSON updates ... I have some experience how to setting up
data-import and delta-import from DB (it's pretty straight forward). 
Not sure how it will work with updates from JSON.  
Should I specify data-config.xml? 
How delta import can be used for it?

BTW ... I've asked the same question on stackoverflow
http://stackoverflow.com/questions/5536770/mongo-rest-interface-and-full-data-import,
but no luck there. Anyway, it's possible to see data-config.xml and
schema.xml there.

Thanks for your help.

--
View this message in context: 
http://lucene.472066.n3.nabble.com/Mongo-REST-interface-and-full-data-import-tp2774479p2776870.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: does overwrite=false work with json

2011-04-04 Thread David Murphy
I tried it with the example json documents, and even if I add overwrite=false 
to the URL, it still overwrites.

Do this twice:
curl 'http://localhost:8983/solr/update/json?commit=true&overwrite=false' 
--data-binary @books.json -H 'Content-type:application/json'

Then do this query:
curl 'http://localhost:8983/solr/select?q=title:monsters&wt=json&indent=true'

--Dave


Re: Question about http://wiki.apache.org/solr/Deduplication

2011-04-04 Thread eks dev
Thanks Hoss,

Externanlizing this part is exactly the path we are exploring now, not
only for this reason.

We already started testing Hadoop SequenceFile for write ahead log for
updates/deletes.
SequenceFile supports append now (simply great!). It was a a pain to
have to add hadoop into mix  for "mortal" collection
sizes 200 Mio, but on the other side, having hadoop around  offers
huge flexibility.
Write ahead log catches update commands (all solr slaves, fronting
clients accept updates but only to forward them to WAL). Solr master
is trying to catch up with update stream indexing in async fashion,
and finally solr slaves are chasing master index with standard solr
replication.
Overnight we run simple map reduce jobs to consolidate, normalize and
sort update stream and reindex at the end.
Deduplication and collection sorting is for us only an optimization,
if done reasonably offten, like  once per day/week, but if we do not
do it, it doubles HW resorces.

Imo, native WAL support on solr would be definitly one nice "nice to
have" (for HA, update scalability...). Charming with WAL  is that
updates never wait/disapear, if too much traffic, we only have
slightly higher update latency, but updates get definitley processed.
Some basic primitives on WAL (consolidation, replaying update stream
on solr etc...)  should be supported in this case, sort of "smallish
hadoop features subset for solr clusters", but nothing oversized.

Cheers,
eks









On Sun, Apr 3, 2011 at 1:05 AM, Chris Hostetter
 wrote:
>
> : Is it possible in solr to have multivalued "id"? Or I need to make my
> : own "mv_ID" for this? Any ideas how to achieve this efficiently?
>
> This isn't something the SignatureUpdateProcessor is going to be able to
> hel pyou with -- it does the deduplication be changing hte low level
> "update" (implemented as a delete then add) so that the key used to delete
> the older documents is based on the signature field instead of the id
> field.
>
> in order to do what you are describing, you would need to query the index
> for matching signatures, then add the resulting ids to your document
> before doing that "update"
>
> You could posibly do this in a custom UpdateProcessor, but you'd have to
> do something tricky to ensure you didn't overlook docs that had been addd
> but not yet committed when checking for dups.
>
> I don't have a good suggestion for how to do this internally in Slr -- it
> seems like the type of bulk processing logic that would be better suited
> for an external process before you ever start indexing (much like link
> analysis for back refrences)
>
> -Hoss
>


Re: Mongo REST interface and full data import

2011-04-04 Thread andrew_s
Sorry for mistake with Solr version ... I'm using Solr 3.1

--
View this message in context: 
http://lucene.472066.n3.nabble.com/Mongo-REST-interface-and-full-data-import-tp2774479p2777319.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Matching on a multi valued field

2011-04-04 Thread Jonathan Rochkind

On 4/4/2011 3:21 PM, Brian Lamb wrote:

I just noticed Juan's response and I find that I am encountering that very
issue in a few cases. Boosting is a good way to put the more relevant
results to the top but it is possible to only have the correct results
returned?


Only what's already been said in the thread.  You can simulate a 
non-phrase non-wildcard search, forced to match all within the same 
value of a multi-valued, by using phrase queries with slop.  And it will 
only return hits that have all terms within the same value -- it's not a 
boosting solution.


But if you need wildcards, or you need to find an actual phrase in the 
same value as additional term(s) or phrase(s), no, you are out of luck 
in Solr.


That is, exactly what Juan said, he already said exactly this.

If someone can think of a clever way to write some Java to do this in a 
new query component, that would be useful.  I am not entirely sure how 
possible that is.  I guess you'd have to make sure that ALL matching 
tokens or phrases are within the positionIncrementGap of each other, not 
sure how feasible that is, I'm not too familiar with Solr/Lucene 
source.   But at any rate, there's no way to do it out of the box with 
Solr, no.




Very very large scale Solr Deployment = how to do (Expert Question)?

2011-04-04 Thread Jens Mueller
Hello Experts,



I am a Solr newbie but read quite a lot of docs. I still do not understand
what would be the best way to setup very large scale deployments:



Goal (threoretical):

 A.) Index-Size: 1 Petabyte (1 Document is about 5 KB in Size)

 B) Queries: 10 Queries/ per Second

 C) Updates: 10 Updates / per Second




Solr offers:

1.)Replication => Scales Well for B)  BUT  A) and C) are not satisfied


2.)Sharding => Scales well for A) BUT B) and C) are not satisfied (=> As
I understand the Sharding approach all goes through a central server, that
dispatches the updates and assembles the quries retrieved from the different
shards. But this central server has also some capacity limits...)




What is the right approach to handle such large deployments? I would be
thankfull for just a rough sketch of the concepts so I can experiment/search
further…


Maybe I am missing something very trivial as I think some of the “Solr
Users/Use Cases” on the homepage are that kind of large deployments. How are
they implemented?



Thanky very much!!!

Jens


Re: Using EmbeddedSolrServer with static documents

2011-04-04 Thread vinodreddyr17
You can unmarshall the xml docs using jaxb and use the pojo adding
capabilities of solr to index the doc. You may need to create the classes
from the schema using xjc tool.

--
View this message in context: 
http://lucene.472066.n3.nabble.com/Using-EmbeddedSolrServer-with-static-documents-tp2767614p2778823.html
Sent from the Solr - User mailing list archive at Nabble.com.