Re: Solr Cloud Bulk Indexing Questions

2014-01-22 Thread Andre Bois-Crettez

1 node having more load should be the leader (because of the extra work
of receiving and distributing updates, but my experiences show only a
bit more CPU usage, and no difference in disk IO).

A suggestion would be to hard commit much less often, ie every 10
minutes, and see if there is a change.
How much system RAM ? JVM Heap ? Enough space in RAM for system disk cache ?
What is the size of your documents ? A few KB, MB, ... ?
Ah, and what about network IO ? Could that be a limiting factor ?


André

On 2014-01-21 23:40, Software Dev wrote:

Any other suggestions?


On Mon, Jan 20, 2014 at 2:49 PM, Software Dev wrote:


4.6.0


On Mon, Jan 20, 2014 at 2:47 PM, Mark Miller wrote:


What version are you running?

- Mark

On Jan 20, 2014, at 5:43 PM, Software Dev 
wrote:


We also noticed that disk IO shoots up to 100% on 1 of the nodes. Do all
updates get sent to one machine or something?


On Mon, Jan 20, 2014 at 2:42 PM, Software Dev <

static.void@gmail.com>wrote:

We commit have a soft commit every 5 seconds and hard commit every 30.

As

far as docs/second it would guess around 200/sec which doesn't seem

that

high.


On Mon, Jan 20, 2014 at 2:26 PM, Erick Erickson <

erickerick...@gmail.com>wrote:

Questions: How often do you commit your updates? What is your
indexing rate in docs/second?

In a SolrCloud setup, you should be using a CloudSolrServer. If the
server is having trouble keeping up with updates, switching to CUSS
probably wouldn't help.

So I suspect there's something not optimal about your setup that's
the culprit.

Best,
Erick

On Mon, Jan 20, 2014 at 4:00 PM, Software Dev <

static.void@gmail.com>

wrote:

We are testing our shiny new Solr Cloud architecture but we are
experiencing some issues when doing bulk indexing.

We have 5 solr cloud machines running and 3 indexing machines

(separate

from the cloud servers). The indexing machines pull off ids from a

queue

then they index and ship over a document via a CloudSolrServer. It

appears

that the indexers are too fast because the load (particularly disk

io)

on

the solr cloud machines spikes through the roof making the entire

cluster

unusable. It's kind of odd because the total index size is not even
large..ie, < 10GB. Are there any optimization/enhancements I could

try

to

help alleviate these problems?

I should note that for the above collection we have only have 1 shard

thats

replicated across all machines so all machines have the full index.

Would we benefit from switching to a ConcurrentUpdateSolrServer where

all

updates get sent to 1 machine and 1 machine only? We could then

remove

this

machine from our cluster than that handles user requests.

Thanks for any input.






--
André Bois-Crettez

Software Architect
Search Developer
http://www.kelkoo.com/


Kelkoo SAS
Société par Actions Simplifiée
Au capital de € 4.168.964,30
Siège social : 8, rue du Sentier 75002 Paris
425 093 069 RCS Paris

Ce message et les pièces jointes sont confidentiels et établis à l'attention 
exclusive de leurs destinataires. Si vous n'êtes pas le destinataire de ce 
message, merci de le détruire et d'en avertir l'expéditeur.


Highlighting not working

2014-01-22 Thread Fatima Issawi
Hello,

I'm trying to highlight content that is returned from a Solr query, but I can't 
seem to get it working.

I would like to highlight the "documentname" and the "pagetext" or "content" 
results, but when I run the search I don't get anything returned. I thought 
that the "content" field is supposed to be used for hightlighting? And that 
[termVectors="true" termPositions="true" termOffsets="true"] needs to be added 
to the fields that need to be highlighted? Is there something else I'm missing?


Here is my schema:

   
   
   
   
  
   
   
   />
   

   

   

   
   
   
   
   
   
   


Thanks,
Fatima


RE: Highlighting not working

2014-01-22 Thread Fatima Issawi
Also my highlighting defaults...

  
 

   
   on
   content documentname
   html
   
   
   0
   documentname
   3
   200
   content
   750

> -Original Message-
> From: Fatima Issawi [mailto:issa...@qu.edu.qa]
> Sent: Wednesday, January 22, 2014 11:34 AM
> To: solr-user@lucene.apache.org
> Subject: Highlighting not working
> 
> Hello,
> 
> I'm trying to highlight content that is returned from a Solr query, but I 
> can't
> seem to get it working.
> 
> I would like to highlight the "documentname" and the "pagetext" or
> "content" results, but when I run the search I don't get anything returned. I
> thought that the "content" field is supposed to be used for hightlighting?
> And that [termVectors="true" termPositions="true" termOffsets="true"]
> needs to be added to the fields that need to be highlighted? Is there
> something else I'm missing?
> 
> 
> Here is my schema:
> 
> required="true" multiValued="false" />
> omitNorms="true"/>
> stored="true" termVectors="true"  termPositions="true"
> termOffsets="true"/>
>
>   
>
>
>/>
> termVectors="true" termPositions="true" termOffsets="true"/>
> 
> multiValued="true" termVectors="true" termPositions="true"
> termOffsets="true"/>
> 
> multiValued="true"/>
> 
>
>
>
>
>
>
>
> 
> 
> Thanks,
> Fatima


Re: Solr middle-ware?

2014-01-22 Thread Lajos

I always go for SolrJ as the intermediate layer, usually in a Spring app.

I have sometimes proxied directly to Solr itself, but since we use a lot 
of Ajax, I'm not comfortable with exposing the Solr URIs directly, even 
if controlled via a proxy.


Having it go through a webapp gives me a layer I can use to validate 
input; if ever the situation warranted, I could use a filter to check 
for anything malicious. I can also layer security on top as well.


Cheers,

Lajos


On 22/01/2014 06:45, Alexandre Rafalovitch wrote:

So, everybody so far is exposing Solr directly to the web, but with
proxy/rewriting. Which means the html/JS libraries are Solr
query-format aware as well?

Is anybody using Solr clients (SolrNet, SolrJ) as a base?

Regards,
Alex.
Personal website: http://www.outerthoughts.com/
LinkedIn: http://www.linkedin.com/in/alexandrerafalovitch
- Time is the quality of nature that keeps events from happening all
at once. Lately, it doesn't seem to be working.  (Anonymous  - via GTD
book)


On Tue, Jan 21, 2014 at 9:05 PM, Artem Karpenko  wrote:

Hello. Not really middle-ware but might be of interest concerning possible
ways implementing security.

We use custom built Solr with web.xml including Spring Security filter and
appropriate infrastructure classes for authentication added as a dependency
into project. We pass token from frontend in each request. If it's accepted
in security filter then later user role (identified from token) is used in
custom request handler that modifies query according to role permissions.

Regards,
Artem.

21.01.2014 15:08, Markus Jelsma пишет:


Hi - We use Nginx to expose the index to the internet. It comes down to
putting some limitations on input parameters and on-the-fly rewrite of
queries using embedded Perl scripting. Limitations and rewrites are usually
just a bunch of regular expressions, so it is not that hard.

Cheers
Markus
 -Original message-


From:Alexandre Rafalovitch 
Sent: Tuesday 21st January 2014 14:01
To: solr-user@lucene.apache.org
Subject: Solr middle-ware?

Hello,

All the Solr documents talk about not running Solr directly to the
cloud. But I see people keep asking for a thin secure layer in front
of Solr they can talk from JavaScript to, perhaps with some basic
extension options.

Has anybody actually written one? Open source or in a community part
of larger project? I would love to be able to point people at
something.

Is there something particularly difficult about writing one? Does
anybody has a story of aborted attempt or mid-point reversal? I would
like to know.

Regards,
 Alex.
P.s. Personal context: I am thinking of doing a series of lightweight
examples of how to use Solr. Like I did for a book, but with a bit
more depth and something that can actually be exposed to the live web
with live data. I don't want to reinvent the wheel of the thin Solr
middleware.
P.p.s. Though I keep thinking that Dart could make an interesting
option for the middleware as it could have the same codebase on the
server and in the client. Like NodeJS, but with saner syntax.

Personal website: http://www.outerthoughts.com/
LinkedIn: http://www.linkedin.com/in/alexandrerafalovitch
- Time is the quality of nature that keeps events from happening all
at once. Lately, it doesn't seem to be working.  (Anonymous  - via GTD
book)





Re: Solr reload trigger when a configuration file is changed

2014-01-22 Thread Mohit Jain
Thanks Shawn. I appreciate you sharing the philosophy behind Solr's
implementation. I absolutely agree with the design principle and the fact
that it helps to debug unknown issues. Moreover it definitely gives more
control over the software.

However there are *small* number of applications that might get benefitted
from *optional* feature with additional risk of auto-reloads issues. One
can think of a scenario where a job generates universal stopwords/porn
words at regular intervals, but do not have any idea about the Solr
host/collections. Instead of creating one more component to coordinate with
generator and reload all the affecting collections, it might be a good idea
to let collections take care of it.

Thanks
Mohit


On Fri, Jan 17, 2014 at 10:29 PM, Shawn Heisey  wrote:

> On 1/17/2014 7:25 AM, Mohit Jain wrote:
>
>> Bingo !! Tomcat was the one which was keeping track of changes in his own
>> config/bin dirs. Once the timestamp of those dirs are changed it issued
>> reload on all wars, resulting reload of solr cores.
>>
>> By the way it will be good to have a similar configurable feature in Solr.
>>
>
> When not running in SolrCloud mode, Solr currently doesn't do anything
> unless a very definite action triggers it.  Typically, this means the Solr
> admin, a user, or an application must send a request to Solr.  Debugging
> problems is easier when you know for sure that the software cannot decide
> to do something on its own. This design principle is also part of why Solr
> doesn't have a scheduler built into the dataimport handler.
>
> When running in SolrCloud mode, Solr will react to random events like
> another Solr server going down, certain changes in zookeeper, or a problem
> with zookeeper, but it will not initiate any action on its own.  This is a
> requirement for a robust cluster.The principle is the same, but there are
> additional request sources.
>
> The current behavior with config files (whether on-disk or in zookeeper)
> allows you to update the config on a running Solr server and delay the
> activation of that config until later, at your own discretion.
>
> I think it's a very bad idea to change this behavior so that it
> automatically reloads when a config file is updated.  A less evil idea
> would be to make auto-reloads *optional*, as long as that feature is not
> turned on by default and is not turned on in the example config.  If such a
> feature is created, the solr log needs to clearly state at startup that
> it's enabled (possibly at WARN level), and each time a core is
> auto-reloaded due to a config change.
>
> Thanks,
> Shawn
>
>


Re: Optimizing index on Slave

2014-01-22 Thread Salman Akram
We do. We have a lot of updates/deletes every day and a weekly optimization
definitely gives a considerable improvement so don't see a downside to it
except the complete replication part which is not an issue on local
network.


Re: SOLR 4 - Query Issue in Common Grams with Surround Query Parser

2014-01-22 Thread Salman Akram
Apologies for the late response as this mail was lost somewhere in filters.

Issue was that CommonGramsQueryFilterFactory should be used for searching
and CommonGramsFilterFactory for indexing. We were using
CommonGramsFilterFactory for both due to which it was not dropping single
tokens for common grams in a phrase query.

I will go through the link you sent and see if it needs any explanation.
Thanks!


Solr Cloud on HDFS

2014-01-22 Thread Lajos

Hi all,

I've been running Solr on HDFS, and that's fine.

But I have a Cloud installation I thought I'd try on HDFS. I uploaded 
the configs for the core that runs in standalone mode already on HDFS 
(on another cluster). I specify the HdfsDirectoryFactory, HDFS data dir, 
solr.hdfs.home, and HDFS update log path:


  hdfs://master:9000/solr/test/data

  
 hdfs://master:9000/solr
  

  

  hdfs://master:9000/solr/test/ulog

  

Question is: should I create my collection differently than I would a 
normal collection?


If I just try that, Solr will initialise the directory in HDFS as if it 
were a single core. It will create shard directories on my nodes, but 
not actually put anything in there. And then it will complain mightily 
about not being able to forward updates to other nodes. (This same 
cluster hosts regular collections, and everything is working fine).


Am I missing a step? Do I have to manually create HDFS directories for 
each replica?


Thanks,

L


Re: Solr Cloud on HDFS

2014-01-22 Thread Lajos
Uugh. I just realised I should have take out the data dir and update log 
definitions! Now it works fine.


Cheers,

L


On 22/01/2014 11:47, Lajos wrote:

Hi all,

I've been running Solr on HDFS, and that's fine.

But I have a Cloud installation I thought I'd try on HDFS. I uploaded
the configs for the core that runs in standalone mode already on HDFS
(on another cluster). I specify the HdfsDirectoryFactory, HDFS data dir,
solr.hdfs.home, and HDFS update log path:

   hdfs://master:9000/solr/test/data

   
  hdfs://master:9000/solr
   

   
 
   hdfs://master:9000/solr/test/ulog
 
   

Question is: should I create my collection differently than I would a
normal collection?

If I just try that, Solr will initialise the directory in HDFS as if it
were a single core. It will create shard directories on my nodes, but
not actually put anything in there. And then it will complain mightily
about not being able to forward updates to other nodes. (This same
cluster hosts regular collections, and everything is working fine).

Am I missing a step? Do I have to manually create HDFS directories for
each replica?

Thanks,

L


dismax request handler will give wrong result in solr 4.3

2014-01-22 Thread Viresh Modi
When i use dismax query type handler in *SOLR 1.4 *and then same for *SOLR
4.3 *then both give different numFound record both have same index profile
as well.
means Solr 1.4 gives 9 records
and Solr 4.3 gives  99 records.

*My Query is:*

start=0&rows=10&hl=true&hl.fl=content&qt=dismax
&q=system admin
&fl=id,application,timestamp,name,score,metaData,metaDataDate
&fq=application:PunitR3_8
&fq= NOT ( id:OnlineR3_8_page_410_0 OR id:OnlineR3_8_page_411_0 OR
id:OnlineR3_8_page_628_0 )
&fq=(metaData:channelId/100 OR metaData:channelId/10 OR
metaData:channelId/160 OR  metaData:channelId/12)&sort=score desc

*and In Solr Config.xml has handler::*



 dismax
 explicit
 
 content name
 
 1000
 
 true
 content
 
 150
 
 name
 regex 

  


-- 

Regards,
Viresh Modi


Replication and conf files

2014-01-22 Thread Andrea Gazzarini
Hi all,
Reading here

http://wiki.apache.org/solr/SolrReplication#How_are_configuration_files_replicated.3F

I don't understand what is the observed behaviour in case

- confFiles contains schema.xml
- schema doesn't change between replication cycles

I mean, I read that the file is physically replaced on slaves only if the
checksum doesn't match but what about the COMMIT or RELOAD command? I would
expect a RELOAD on slaves only if schema is in confFiles AND it effectively
changes; instead It seems a RELOAD is always issued regardless changes...is
that the expected behaviour?

"If a replication involved downloading of at least one conf file a core
reload is issued instead of a 'commit' command."

That would mean that just declaring the schema.xml in confFiles (i.e.
downloading the schema from master to slaves) will trigger a RELOAD. Are
things working in this way? And if so, what is the reason behind that
behaviour?

P.S: I'm using SOLR 3.6.1, with 1 master and two slaves

Best,
Andrea


How to run a subsequent update query to documents indexed from a dataimport query

2014-01-22 Thread Dileepa Jayakody
Hi All,

I have a Solr requirement to send all the documents imported from a
/dataimport query to go through another update chain as a separate
background process.

Currently I have configured my custom update chain in the /dataimport
handler itself. But since my custom update process need to connect to an
external enhancement engine (Apache Stanbol) to enhance the documents with
some NLP fields, it has a negative impact on /dataimport process.
The solution will be to have a separate update process running to enhance
the content of the documents imported from /dataimport.

Currently I have configured my custom Stanbol Processor as below in my
/dataimport handler.



data-config.xml
stanbolInterceptor

   




  


What I need now is to separate the 2 processes of dataimport and
stanbol-enhancement.
So this is like runing a separate re-indexing process periodically over the
documents imported from /dataimport for Stanbol fields.

The question is how to trigger my Stanbol update process to the documents
imported from /dataimport?
In Solr to trigger /update query we need to know the id and the fields of
the document to be updated. In my case I need to run all the documents
imported from the previous /dataimport process through a stanbol
update.chain.

Is there a way to keep track of the documents ids imported from
/dataimport?
Any advice or pointers will be really helpful.

Thanks,
Dileepa


Re: dismax request handler will give wrong result in solr 4.3

2014-01-22 Thread Ahmet Arslan
Hi Viresh,

A couple of things:

1) / character is a special query parser character now. It wasn't before. It is 
used for regular expression searches.

http://lucene.apache.org/core/4_6_0/queryparser/org/apache/lucene/queryparser/classic/package-summary.html#Regexp_Searches


What happens when you use fq=metaData:("channelId/100" OR "channelId/10" OR 
"channelId/160" OR "channelId/12")

OR

fq={!lucene df=metaData q.op=OR}"channelId/100" "channelId/10" "channelId/160" 
"channelId/12"







On Wednesday, January 22, 2014 2:30 PM, Viresh Modi  
wrote:
When i use dismax query type handler in *SOLR 1.4 *and then same for *SOLR
4.3 *then both give different numFound record both have same index profile
as well.
means Solr 1.4 gives 9 records
    and Solr 4.3 gives  99 records.

*My Query is:*

start=0&rows=10&hl=true&hl.fl=content&qt=dismax
&q=system admin
&fl=id,application,timestamp,name,score,metaData,metaDataDate
&fq=application:PunitR3_8
&fq= NOT ( id:OnlineR3_8_page_410_0 OR id:OnlineR3_8_page_411_0 OR
id:OnlineR3_8_page_628_0 )
&fq=(metaData:channelId/100 OR metaData:channelId/10 OR
metaData:channelId/160 OR  metaData:channelId/12)&sort=score desc

*and In Solr Config.xml has handler::*


    
     dismax
     explicit
     
         content name
     
     1000
     
     true
     content
     
     150
     
     name
     regex 
    
  


-- 

Regards,
Viresh Modi



Re: Highlighting not working

2014-01-22 Thread Ahmet Arslan
Hi Fatima,

To enable higlighting (both standard and fastvector) you need to make 
stored="true". 

Term vectors may speed up standard highlighter. Plus they are mandatory for 
FastVectorHighligher.

https://cwiki.apache.org/confluence/display/solr/Field+Properties+by+Use+Case

Ahmet





On Wednesday, January 22, 2014 10:44 AM, Fatima Issawi  
wrote:
Also my highlighting defaults...

  
     

       
       on
       content documentname
       html
       
       
       0
       documentname
       3
       200
       content
       750


> -Original Message-
> From: Fatima Issawi [mailto:issa...@qu.edu.qa]
> Sent: Wednesday, January 22, 2014 11:34 AM
> To: solr-user@lucene.apache.org
> Subject: Highlighting not working
> 
> Hello,
> 
> I'm trying to highlight content that is returned from a Solr query, but I 
> can't
> seem to get it working.
> 
> I would like to highlight the "documentname" and the "pagetext" or
> "content" results, but when I run the search I don't get anything returned. I
> thought that the "content" field is supposed to be used for hightlighting?
> And that [termVectors="true" termPositions="true" termOffsets="true"]
> needs to be added to the fields that need to be highlighted? Is there
> something else I'm missing?
> 
> 
> Here is my schema:
> 
>     required="true" multiValued="false" />
>     omitNorms="true"/>
>     stored="true" termVectors="true"  termPositions="true"
> termOffsets="true"/>
>    
>   
>    
>    
>    />
>     termVectors="true" termPositions="true" termOffsets="true"/>
> 
>     multiValued="true" termVectors="true" termPositions="true"
> termOffsets="true"/>
> 
>     multiValued="true"/>
> 
>    
>    
>    
>    
>    
>    
>    
> 
> 
> Thanks,
> Fatima


Re: Solr middle-ware?

2014-01-22 Thread Shawn Heisey
On 1/22/2014 12:25 AM, Raymond Wiker wrote:
> Speaking for myself, I avoid using "client apis" like SolrNet, SolrJ and
> FAST DSAPI for the simple reason that I feel that the abstractions they
> offer are so thin that I may just as well talk directly to the HTTP
> interface. Doing that also lets me build web applications that maintain
> their own state, which makes for more responsive and more robust
> applications (although I'm sure there will be differing opinions on this).

If you have the programming skill, this is absolutely a great way to go.
 It does require a lot of knowledge and expertise, though.

If you want to hammer out a quick program and be reasonably sure it's
right, a client API handles a lot of the hard stuff for you.  When
something changes in a new version of Solr that breaks a client API,
just upgrading the client API is often enough to make the same code work
again.

I love SolrJ.  It's part of Solr itself, used internally for SolrCloud,
and probably replication too.  It's thoroughly tested with the Solr test
suite, and if used correctly, it's pretty much guaranteed to be
compatible with the same version of Solr.  In most cases, it will work
with other versions too.

Thanks,
Shawn



Re: Solr Cloud on HDFS

2014-01-22 Thread Mark Miller
Right - solr.hdfs.home is the only setting you should use with SolrCloud.  

The documentation should probably be improved.  

If you set the data dir or ulog location in solrconfig.xml explicitly, it will 
be the same for every collection. SolrCloud shares the solrconfig.xml across 
SolrCore’s, and this will not work out.  

By setting solr.hdfs.home and leaving the relative defaults, all of the 
locations are correctly set for each different collection under solr.hdfs.home 
without any effort on your part.

- Mark  



On Jan 22, 2014, 7:22:22 AM, Lajos  wrote: Uugh. I just 
realised I should have take out the data dir and update log
definitions! Now it works fine.

Cheers,

L


On 22/01/2014 11:47, Lajos wrote:
> Hi all,
>
> I've been running Solr on HDFS, and that's fine.
>
> But I have a Cloud installation I thought I'd try on HDFS. I uploaded
> the configs for the core that runs in standalone mode already on HDFS
> (on another cluster). I specify the HdfsDirectoryFactory, HDFS data dir,
> solr.hdfs.home, and HDFS update log path:
>
> hdfs://master:9000/solr/test/data
>
>  class="solr.HdfsDirectoryFactory">
> hdfs://master:9000/solr
> 
>
> 
> 
> hdfs://master:9000/solr/test/ulog
> 
> 
>
> Question is: should I create my collection differently than I would a
> normal collection?
>
> If I just try that, Solr will initialise the directory in HDFS as if it
> were a single core. It will create shard directories on my nodes, but
> not actually put anything in there. And then it will complain mightily
> about not being able to forward updates to other nodes. (This same
> cluster hosts regular collections, and everything is working fine).
>
> Am I missing a step? Do I have to manually create HDFS directories for
> each replica?
>
> Thanks,
>
> L


Re: Solr reload trigger when a configuration file is changed

2014-01-22 Thread Mark Miller
Yonik has brought up this feature a few times as well. I’ve always felt about 
the same as Shawn. I’m fine with it being optional, default to off. A cluster 
reload can be a fairly heavy operation.

- Mark  



On Jan 22, 2014, 4:36:19 AM, Mohit Jain  wrote: Thanks 
Shawn. I appreciate you sharing the philosophy behind Solr's
implementation. I absolutely agree with the design principle and the fact
that it helps to debug unknown issues. Moreover it definitely gives more
control over the software.

However there are *small* number of applications that might get benefitted
from *optional* feature with additional risk of auto-reloads issues. One
can think of a scenario where a job generates universal stopwords/porn
words at regular intervals, but do not have any idea about the Solr
host/collections. Instead of creating one more component to coordinate with
generator and reload all the affecting collections, it might be a good idea
to let collections take care of it.

Thanks
Mohit


On Fri, Jan 17, 2014 at 10:29 PM, Shawn Heisey  wrote:

> On 1/17/2014 7:25 AM, Mohit Jain wrote:
>
>> Bingo !! Tomcat was the one which was keeping track of changes in his own
>> config/bin dirs. Once the timestamp of those dirs are changed it issued
>> reload on all wars, resulting reload of solr cores.
>>
>> By the way it will be good to have a similar configurable feature in Solr.
>>
>
> When not running in SolrCloud mode, Solr currently doesn't do anything
> unless a very definite action triggers it. Typically, this means the Solr
> admin, a user, or an application must send a request to Solr. Debugging
> problems is easier when you know for sure that the software cannot decide
> to do something on its own. This design principle is also part of why Solr
> doesn't have a scheduler built into the dataimport handler.
>
> When running in SolrCloud mode, Solr will react to random events like
> another Solr server going down, certain changes in zookeeper, or a problem
> with zookeeper, but it will not initiate any action on its own. This is a
> requirement for a robust cluster.The principle is the same, but there are
> additional request sources.
>
> The current behavior with config files (whether on-disk or in zookeeper)
> allows you to update the config on a running Solr server and delay the
> activation of that config until later, at your own discretion.
>
> I think it's a very bad idea to change this behavior so that it
> automatically reloads when a config file is updated. A less evil idea
> would be to make auto-reloads *optional*, as long as that feature is not
> turned on by default and is not turned on in the example config. If such a
> feature is created, the solr log needs to clearly state at startup that
> it's enabled (possibly at WARN level), and each time a core is
> auto-reloaded due to a config change.
>
> Thanks,
> Shawn
>
>


AIOOBException on trunk since 21st or 22nd build

2014-01-22 Thread Markus Jelsma
Hi - this likely belongs to an existing open issue. We're seeing the stuff 
below on a build of the 22nd. Until just now we used builds of the 20th and 
didn't have the issue. This is either a bug or did some data format in 
Zookeeper change? Until now only two cores of the same shard through the error, 
all other nodes in the cluster are clean.

2014-01-22 15:32:48,826 ERROR [solr.core.SolrCore] - [http-8080-exec-5] - : 
java.lang.ArrayIndexOutOfBoundsException: 1
at 
org.apache.solr.common.cloud.CompositeIdRouter$KeyParser.getHash(CompositeIdRouter.java:291)
at 
org.apache.solr.common.cloud.CompositeIdRouter.sliceHash(CompositeIdRouter.java:58)
at 
org.apache.solr.common.cloud.HashBasedRouter.getTargetSlice(HashBasedRouter.java:33)
at 
org.apache.solr.update.processor.DistributedUpdateProcessor.setupRequest(DistributedUpdateProcessor.java:218)
at 
org.apache.solr.update.processor.DistributedUpdateProcessor.processDelete(DistributedUpdateProcessor.java:961)
at 
org.apache.solr.update.processor.UpdateRequestProcessor.processDelete(UpdateRequestProcessor.java:55)
at 
org.apache.solr.handler.loader.XMLLoader.processDelete(XMLLoader.java:347)
at 
org.apache.solr.handler.loader.XMLLoader.processUpdate(XMLLoader.java:278)
at org.apache.solr.handler.loader.XMLLoader.load(XMLLoader.java:174)
at 
org.apache.solr.handler.UpdateRequestHandler$1.load(UpdateRequestHandler.java:92)
at 
org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:74)
at 
org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:135)
at org.apache.solr.core.SolrCore.execute(SolrCore.java:1915)
at 
org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:785)
at 
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:418)
at 
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:203)
at 
org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:235)
at 
org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:206)
at 
org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:233)
at 
org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:191)
at 
org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:127)
at 
org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:102)
at 
org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:109)
at 
org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:293)
at 
org.apache.coyote.http11.Http11NioProcessor.process(Http11NioProcessor.java:889)
at 
org.apache.coyote.http11.Http11NioProtocol$Http11ConnectionHandler.process(Http11NioProtocol.java:744)
at 
org.apache.tomcat.util.net.NioEndpoint$SocketProcessor.run(NioEndpoint.java:2282)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:724)


Upgrading from SolrCloud 4.x to 4.y - as if you had used 4.y all along

2014-01-22 Thread Per Steffensen
If you are upgrading from SolrCloud 4.x to a later version 4.y, and 
basically want your end-system to seem as if it had been running 4.y (no 
legacy mode or anything) all along, you might find some inspiration here


http://solrlucene.blogspot.dk/2014/01/upgrading-from-solrcloud-4x-to-4y-as-if.html 



RE: Indexing URLs from websites

2014-01-22 Thread Teague James
Markus,

With some help from another user on the Nutch list I did a dump and found that 
the URLs I am trying to capture are in Nutch. However, when I index them with 
Solr I am not getting them. What I get in the dump is this:

http://www.example.com/pdfs/article1.pdf
Status: 2 (db_fetched)
Fetch time: [date/time stamp]
Modified time: [date/time stamp]
Retries since fetch: 0
Retry interval: 2592000 seconds (30 days)
Score: 0.0010525313
Signature: null
Metadata: Content-Type: application/pdf_pst_: success(1), lastModified=0

-Original Message-
From: Markus Jelsma [mailto:markus.jel...@openindex.io] 
Sent: Tuesday, January 21, 2014 3:09 PM
To: solr-user@lucene.apache.org
Subject: RE: Indexing URLs from websites

Hi, are you getting pdfs at all? Sounds like a problem with url filters, those 
also work on the linkdb. You should also try dumping the linkdb and inspect it 
for urls.

Btw, i noticed this is om the solr list, its best to open a new discussion on 
the nutch user mailing list.

CheersTeague James  schreef:What I'm getting is just 
the anchor text. In cases where there are multiple anchors I am getting a comma 
separated list of anchor text - which is fine. However, I am not getting all of 
the anchors that are on the page, nor am I getting any of the URLs. The anchors 
I am getting back never include anchors that lead to documents - which is the 
primary objective. So on a page that looks something like:

Article 1 text blah blah blah [Read more] Article 2 test blah blah blah [Read 
more] Download a the [PDF]

Where each [Read more] links to a page where the rest of the article is stored 
and [PDF] links to a PDF document (these are relative links). That I get back 
in the anchor field is "[Read more]","[Read more]"

I am not getting the "[PDF]" anchor and I am not getting any of the URLs that 
those anchors point to - like "/Artilce 1", "/Article 2", and  
"/documents/Article 1.pdf"

How can I get these URLs?

-Original Message-
From: Markus Jelsma [mailto:markus.jel...@openindex.io]
Sent: Monday, January 20, 2014 9:08 AM
To: solr-user@lucene.apache.org
Subject: RE: Indexing URLs from websites

Well it is hard to get a specific anchor because there is usually more than 
one. The content of the anchors field should be correct. What would you expect 
if there are multiple anchors? 

-Original message-
> From:Teague James 
> Sent: Friday 17th January 2014 18:13
> To: solr-user@lucene.apache.org
> Subject: RE: Indexing URLs from websites
> 
> Progress!
> 
> I changed the value of that property in nutch-default.xml and I am getting 
> the anchor field now. However, the stuff going in there is a bit random and 
> doesn't seem to correlate to the pages I'm crawling. The primary objective is 
> that when there is something on the page that is a link to a file 
> ...href="/blah/somefile.pdf">Get the PDF!<... (using ... to prevent actual 
> code in the email) I want to capture that URL and the anchor text "Get the 
> PDF!" into field(s).
> 
> Am I going in the right direction on this?
> 
> Thank you so much for sticking with me on this - I really appreciate your 
> help!
> 
> -Original Message-
> From: Markus Jelsma [mailto:markus.jel...@openindex.io]
> Sent: Friday, January 17, 2014 6:46 AM
> To: solr-user@lucene.apache.org
> Subject: RE: Indexing URLs from websites
> 
> 
> 
>  
>  
> -Original message-
> > From:Teague James 
> > Sent: Thursday 16th January 2014 20:23
> > To: solr-user@lucene.apache.org
> > Subject: RE: Indexing URLs from websites
> > 
> > Okay. I had used that previously and I just tried it again. The following 
> > generated no errors:
> > 
> > bin/nutch solrindex http://localhost/solr/ crawl/crawldb -linkdb 
> > crawl/linkdb -dir crawl/segments/
> > 
> > Solr is still not getting an anchor field and the outlinks are not 
> > appearing in the index anywhere else.
> > 
> > To be sure I deleted the crawl directory and did a fresh crawl using:
> > 
> > bin/nutch crawl urls -dir crawl -depth 3 -topN 50
> > 
> > Then
> > 
> > bin/nutch solrindex http://localhost/solr/ crawl/crawldb -linkdb 
> > crawl/linkdb -dir crawl/segments/
> > 
> > No errors, but no anchor fields or outlinks. One thing in the response from 
> > the crawl that I found interesting was a line that said:
> > 
> > LinkDb: internal links will be ignored.
> 
> Good catch! That is likely the problem. 
> 
> > 
> > What does that mean?
> 
> 
>   db.ignore.internal.links
>   true
>   If true, when adding new links to a page, links from
>   the same host are ignored.  This is an effective way to limit the
>   size of the link database, keeping only the highest quality
>   links.
>   
> 
> 
> So change the property, rebuild the linkdb and try reindexing once 
> again :)
> 
> > 
> > -Original Message-
> > From: Markus Jelsma [mailto:markus.jel...@openindex.io]
> > Sent: Thursday, January 16, 2014 11:08 AM
> > To: solr-user@lucene.apache.org
> > Subject: RE: Indexing URLs from websites
> 

Re: AIOOBException on trunk since 21st or 22nd build

2014-01-22 Thread Mark Miller
Looking at the list of changes on the 21st and 22nd, I don’t see a smoking gun.

- Mark  



On Jan 22, 2014, 11:13:26 AM, Markus Jelsma  wrote: 
Hi - this likely belongs to an existing open issue. We're seeing the stuff 
below on a build of the 22nd. Until just now we used builds of the 20th and 
didn't have the issue. This is either a bug or did some data format in 
Zookeeper change? Until now only two cores of the same shard through the error, 
all other nodes in the cluster are clean.

2014-01-22 15:32:48,826 ERROR [solr.core.SolrCore] - [http-8080-exec-5] - : 
java.lang.ArrayIndexOutOfBoundsException: 1
at 
org.apache.solr.common.cloud.CompositeIdRouter$KeyParser.getHash(CompositeIdRouter.java:291)
at 
org.apache.solr.common.cloud.CompositeIdRouter.sliceHash(CompositeIdRouter.java:58)
at 
org.apache.solr.common.cloud.HashBasedRouter.getTargetSlice(HashBasedRouter.java:33)
at 
org.apache.solr.update.processor.DistributedUpdateProcessor.setupRequest(DistributedUpdateProcessor.java:218)
at 
org.apache.solr.update.processor.DistributedUpdateProcessor.processDelete(DistributedUpdateProcessor.java:961)
at 
org.apache.solr.update.processor.UpdateRequestProcessor.processDelete(UpdateRequestProcessor.java:55)
at org.apache.solr.handler.loader.XMLLoader.processDelete(XMLLoader.java:347)
at org.apache.solr.handler.loader.XMLLoader.processUpdate(XMLLoader.java:278)
at org.apache.solr.handler.loader.XMLLoader.load(XMLLoader.java:174)
at 
org.apache.solr.handler.UpdateRequestHandler$1.load(UpdateRequestHandler.java:92)
at 
org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:74)
at 
org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:135)
at org.apache.solr.core.SolrCore.execute(SolrCore.java:1915)
at 
org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:785)
at 
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:418)
at 
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:203)
at 
org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:235)
at 
org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:206)
at 
org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:233)
at 
org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:191)
at org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:127)
at org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:102)
at 
org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:109)
at org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:293)
at 
org.apache.coyote.http11.Http11NioProcessor.process(Http11NioProcessor.java:889)
at 
org.apache.coyote.http11.Http11NioProtocol$Http11ConnectionHandler.process(Http11NioProtocol.java:744)
at 
org.apache.tomcat.util.net.NioEndpoint$SocketProcessor.run(NioEndpoint.java:2282)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:724)


Re: Solr Cloud on HDFS

2014-01-22 Thread Lajos

Thanks Mark ... indeed, some doc updates would help.

Regarding what seems to be a popular question on sharding. It seems that 
it would be a Good Thing that the shards for a collection running HDFS 
essentially be pointers to the HDFS-replicated index. Is that what your 
thinking is?


I've been following your work recently, would be interested in helping 
out on this if there's the chance.


Is there a JIRA yet on this issue?

Thanks,

lajos


On 22/01/2014 16:57, Mark Miller wrote:

Right - solr.hdfs.home is the only setting you should use with SolrCloud.

The documentation should probably be improved.

If you set the data dir or ulog location in solrconfig.xml explicitly, it will 
be the same for every collection. SolrCloud shares the solrconfig.xml across 
SolrCore’s, and this will not work out.

By setting solr.hdfs.home and leaving the relative defaults, all of the 
locations are correctly set for each different collection under solr.hdfs.home 
without any effort on your part.

- Mark



On Jan 22, 2014, 7:22:22 AM, Lajos  wrote: Uugh. I just 
realised I should have take out the data dir and update log
definitions! Now it works fine.

Cheers,

L


On 22/01/2014 11:47, Lajos wrote:

Hi all,

I've been running Solr on HDFS, and that's fine.

But I have a Cloud installation I thought I'd try on HDFS. I uploaded
the configs for the core that runs in standalone mode already on HDFS
(on another cluster). I specify the HdfsDirectoryFactory, HDFS data dir,
solr.hdfs.home, and HDFS update log path:

hdfs://master:9000/solr/test/data


hdfs://master:9000/solr




hdfs://master:9000/solr/test/ulog



Question is: should I create my collection differently than I would a
normal collection?

If I just try that, Solr will initialise the directory in HDFS as if it
were a single core. It will create shard directories on my nodes, but
not actually put anything in there. And then it will complain mightily
about not being able to forward updates to other nodes. (This same
cluster hosts regular collections, and everything is working fine).

Am I missing a step? Do I have to manually create HDFS directories for
each replica?

Thanks,

L




Searching and scoring with block join

2014-01-22 Thread dev

Hello again,

I'm using the solr block-join feature to index a journal and all of  
it's articles.

Here a short example:



527fcbf8-c140-4ae6-8f51-68cd2efc1343
Sozialmagazin
8
2008
0340-8469
...
juventa
...
true

527fcb34-4570-4a86-b9e7-68cd2efc1343
A World out of Balance
62
Amthor
...
...


527fcbf8-84ec-424f-9d58-68cd2efc1343
Die Philosophie des 
Helfens
50
Keck
...
...




I read about the search syntax in this article:  
http://blog.griddynamics.com/2013/09/solr-block-join-support.html
Yet I'm wondering, how to use it properly. If I want to make a  
"fulltext" search over all journals and their articles and getting the  
journals with the highest score as result, how should my query look  
like?
I know that I can't just make a query like this: {!parent  
which=is_parent:true}+Term, most likely I'll get this error: child  
query must only match non-parent docs, but parent docID= matched  
childScorer=class org.apache.lucene.search.TermScorer


So, how do I make a query that is searching in both, journals and  
articles, giving me the journals ordered by their score? How do I get  
the score of the child documents to be added to the score of the  
parent document?


Thank you for your help.

- Gesh




RE: Interesting search question! How to match documents based on the least number of fields that match all query terms?

2014-01-22 Thread Petersen, Robert
Hi Daniel,

How about trying something like this (you'll have to play with the boosts to 
tune this), search all the fields with all the terms using edismax and use the 
minimum should match parameter, but require all terms to match in the 
allMetadata field.
https://wiki.apache.org/solr/ExtendedDisMax#mm_.28Minimum_.27Should.27_Match.29

Lucene query syntax below to give you the general idea, but this query would 
require all terms to be in one of the metadata fields to get the boost.

metadata1:(term1 AND ... AND termN)^2
metadata2:(term1 AND ... AND termN)^2
.
metadataN:(term1 AND ... AND termN)^2
allMetadatas :(term1 AND ... AND termN)^0.5

That should do approximately what you want,
Robi

-Original Message-
From: Daniel Shane [mailto:sha...@lexum.com] 
Sent: Tuesday, January 21, 2014 8:42 AM
To: solr-user@lucene.apache.org
Subject: Interesting search question! How to match documents based on the least 
number of fields that match all query terms?

I have an interesting solr/lucene question and its quite possible that some new 
features in solr might make this much easier that what I am about to try. If 
anyone has a clever idea on how to do this search, please let me know!

Basically, lets state that I have an index in which each documents has a 
content and several metadata fields.

Document Fields:

content
metadata1
metadata2
.
metadataN
allMetadatas (all the terms indexed in metadata1...N are concatenated in this 
field) 

Assuming that I am searching for documents that contains a certain number of 
terms (term1 to termN) in their metadata fields, I would like to build a search 
query that will return document that satisfy these requirement:

a) All search terms must be present in a metadata field. This is quite easy, we 
can simply search in the field allMetadatas and that will work fine.

b) Now for the hard part, we prefer document in which we found the metadatas in 
the *least number of different fields*. So if one document contains all the 
search terms in 10 different fields, but another document contains all search 
terms but in only 8 fields, we would like those to sort first. 

My first idea was to index terms in the allMetadatas using payloads. Each 
indexed term would also have the specific metadataN field from which they 
originate. Then I can write a scorer to score based on these payloads. 

However, if there is a way to do this without payloads I'm all ears!

-- 
Daniel Shane
Lexum (www.lexum.com)
sha...@lexum.com



Re: Solr Cloud Bulk Indexing Questions

2014-01-22 Thread Software Dev
A suggestion would be to hard commit much less often, ie every 10
minutes, and see if there is a change.

- Will try this

How much system RAM ? JVM Heap ? Enough space in RAM for system disk cache ?

- We have 18G of ram 12 dedicated to Solr but as of right now the total
index size is only 5GB

Ah, and what about network IO ? Could that be a limiting factor ?

- What is the size of your documents ? A few KB, MB, ... ?

Under 1MB

- Again, total index size is only 5GB so I dont know if this would be a
problem






On Wed, Jan 22, 2014 at 12:26 AM, Andre Bois-Crettez
wrote:

> 1 node having more load should be the leader (because of the extra work
> of receiving and distributing updates, but my experiences show only a
> bit more CPU usage, and no difference in disk IO).
>
> A suggestion would be to hard commit much less often, ie every 10
> minutes, and see if there is a change.
> How much system RAM ? JVM Heap ? Enough space in RAM for system disk cache
> ?
> What is the size of your documents ? A few KB, MB, ... ?
> Ah, and what about network IO ? Could that be a limiting factor ?
>
>
> André
>
>
> On 2014-01-21 23:40, Software Dev wrote:
>
>> Any other suggestions?
>>
>>
>> On Mon, Jan 20, 2014 at 2:49 PM, Software Dev 
>> wrote:
>>
>>  4.6.0
>>>
>>>
>>> On Mon, Jan 20, 2014 at 2:47 PM, Mark Miller >> >wrote:
>>>
>>>  What version are you running?

 - Mark

 On Jan 20, 2014, at 5:43 PM, Software Dev 
 wrote:

  We also noticed that disk IO shoots up to 100% on 1 of the nodes. Do
> all
> updates get sent to one machine or something?
>
>
> On Mon, Jan 20, 2014 at 2:42 PM, Software Dev <
>
 static.void@gmail.com>wrote:

> We commit have a soft commit every 5 seconds and hard commit every 30.
>>
> As

> far as docs/second it would guess around 200/sec which doesn't seem
>>
> that

> high.
>>
>>
>> On Mon, Jan 20, 2014 at 2:26 PM, Erick Erickson <
>>
> erickerick...@gmail.com>wrote:

> Questions: How often do you commit your updates? What is your
>>> indexing rate in docs/second?
>>>
>>> In a SolrCloud setup, you should be using a CloudSolrServer. If the
>>> server is having trouble keeping up with updates, switching to CUSS
>>> probably wouldn't help.
>>>
>>> So I suspect there's something not optimal about your setup that's
>>> the culprit.
>>>
>>> Best,
>>> Erick
>>>
>>> On Mon, Jan 20, 2014 at 4:00 PM, Software Dev <
>>>
>> static.void@gmail.com>

> wrote:
>>>
 We are testing our shiny new Solr Cloud architecture but we are
 experiencing some issues when doing bulk indexing.

 We have 5 solr cloud machines running and 3 indexing machines

>>> (separate

> from the cloud servers). The indexing machines pull off ids from a

>>> queue

> then they index and ship over a document via a CloudSolrServer. It

>>> appears
>>>
 that the indexers are too fast because the load (particularly disk

>>> io)

> on
>>>
 the solr cloud machines spikes through the roof making the entire

>>> cluster
>>>
 unusable. It's kind of odd because the total index size is not even
 large..ie, < 10GB. Are there any optimization/enhancements I could

>>> try

> to
>>>
 help alleviate these problems?

 I should note that for the above collection we have only have 1
 shard

>>> thats
>>>
 replicated across all machines so all machines have the full index.

 Would we benefit from switching to a ConcurrentUpdateSolrServer
 where

>>> all
>>>
 updates get sent to 1 machine and 1 machine only? We could then

>>> remove

> this
>>>
 machine from our cluster than that handles user requests.

 Thanks for any input.

>>>
>>

>> --
>> André Bois-Crettez
>>
>> Software Architect
>> Search Developer
>> http://www.kelkoo.com/
>>
>
> Kelkoo SAS
> Société par Actions Simplifiée
> Au capital de € 4.168.964,30
> Siège social : 8, rue du Sentier 75002 Paris
> 425 093 069 RCS Paris
>
> Ce message et les pièces jointes sont confidentiels et établis à
> l'attention exclusive de leurs destinataires. Si vous n'êtes pas le
> destinataire de ce message, merci de le détruire et d'en avertir
> l'expéditeur.
>


Re: Trying to config solr cloud

2014-01-22 Thread svante karlsson
Thank you very much!!

Just to recap.

My solrconfig.xml had the tvComponent and when I removed that it works as
expected although not as fast as I had hoped. I'll do some more reading on
best practices and probably ask a new question later...



 
  tvComponent



- Svante







2014/1/22 Mark Miller 

> If that is the case, we could probably use a JIRA issue Svante. The
> component should really give a nice user error in this scenerio.
>
> - Mark
>
>
>
> On Jan 21, 2014, 8:00:55 PM, Tim Potter 
> wrote: Hi Svante,
>
> It seems like the TermVectorComponent is in the search component chain of
> your /select search handler but you haven't indexed docs with term vectors
> enabled (at least from what's in the schema you provided). Admittedly, the
> NamedList code could be a little more paranoid but I think the key is to
> check the component chain of your /select handler to make sure tvComponent
> isn't included (or re-index with term vectors enabled).
>
> Cheers,
>
> Timothy Potter
> Sr. Software Engineer, LucidWorks
> www.lucidworks.com
>
> 
> From: saka.csi...@gmail.com  on behalf of svante
> karlsson 
> Sent: Tuesday, January 21, 2014 4:20 PM
> To: solr-user@lucene.apache.org
> Subject: Trying to config solr cloud
>
> I've been playing around with solr 4.6.0 for some weeks and I'm trying to
> get a solrcloud configuration running.
>
> I've installed two physical machines and I'm trying to set up 4 shards on
> each.
>
> I installled a zookeeper on each host as well
>
> I uploaded a config to zookeeper with
> /opt/solr-4.6.0/example/cloud-scripts/zkcli.sh -cmd upconfig -zkhost
> 192.168.0.93:2181 -confdir /opt/solr/om5/conf/ -confname om5
>
> The /opt/solr/om5 was where I kept my normal solr and I'm trying to reuse
> that config.
>
>
> now I start two hosts (one on each server)
> java -DzkHost=192.168.0.93:2181,192.168.0.94:2181 -Dhost=192.168.0.93 -jar
> start.jar
> java -DzkHost=192.168.0.93:2181,192.168.0.94:2181 -Dhost=192.168.0.94 -jar
> start.jar
>
> and finally I'll run
> curl '
>
> http://192.168.0.93:8983/solr/admin/collections?action=CREATE&name=om5&numShards=8&replicationFactor=1&maxShardsPerNode=4
> '
>
> This gets me 8 shard in the web gui
> http://192.168.0.94:8983/solr/#/~cloud
>
> Now I add documents to this and that seems to work. I pushed 97 million
> docs during the night. ( each shard reports a 8th of the documents )
>
> But all questions returns http 500 in variants of the below result. I get
> correct data in the body but always an error trace after that...
>
> http://192.168.0.93:8983/solr/om5/select?q=*:*&rows=1&fl=id
>
> returns
>
> 
> 
> 500
> 32
> 
> 
> 
> b1e5865c-3b01---0471b12d16ac
> 
> 
> 
> 
> java.lang.NullPointerException at
>
> org.apache.solr.common.util.NamedList.nameValueMapToList(NamedList.java:114)
> at org.apache.solr.common.util.NamedList.(NamedList.java:80) at
>
> org.apache.solr.handler.component.TermVectorComponent.finishStage(TermVectorComponent.java:453)
> at
>
> org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHandler.java:317)
> at
>
> org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:135)
> at org.apache.solr.core.SolrCore.execute(SolrCore.java:1859) at
>
> org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:710)
> at
>
> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:413)
> at
>
> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:197)
> at
>
> org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1419)
> at
> org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:455)
> at
>
> org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:137)
> at
> org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:557)
> at
>
> org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:231)
> at
>
> org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1075)
> at
> org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:384)
> at
>
> org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:193)
> at
>
> org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1009)
> at
>
> org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:135)
> at
>
> org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:255)
> at
>
> org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:154)
> at
>
> org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:116)
> at org.eclipse.jetty.server.Server.handle(Server.java:368) at
>
> org.eclipse.jetty.server.AbstractHttpConnection.handleRequest(AbstractHttpConnection.java:489)
> at
>
> org.eclipse.jetty.server.BlockingHttpConnection.handleRequest(BlockingHttpConnection.java:53)
> at
>
> org.eclipse.

Fuzzy 2 search results wrong

2014-01-22 Thread Lou Foster
I am using the fuzzy search functionality with solr 4.1 and am having
problems with the fuzzy search results when fuzzy level 2 is used.

Here is a description of the issue;

I have an index that consists of one main core that is generated by merging
many other cores together.

If I fuzzy search within the cores prior to merging, the results are as
expected. Exact match yields a number of hits, fuzzy 1 yields more and
fuzzy 2 even more. At each search I have verified that the words are being
matched using the correct edit distances.

The problem occurs after merging, and not with all cores. Sometimes the
results get capped out at the number of results returned by the fuzzy 1
search.

For example, exact returns 100 hits, and fuzzy 1 and fuzzy 2 both return
1200. I can see that the words matched are still the correct edit distance,
so I would expect the fuzzy 2 to have many more hits.

Why is this happening, and what can I do to troubleshoot and/or solve this
problem? It almost feels like a bug in solr.

Thanks!


Re: Interesting search question! How to match documents based on the least number of fields that match all query terms?

2014-01-22 Thread Mikhail Khludnev
Hello Daniel,

I have an idea to try to use coord() here. Check
http://lucene.apache.org/core/4_0_0/core/org/apache/lucene/search/similarities/TFIDFSimilarity.htmland
http://lucene.apache.org/core/4_0_0/core/org/apache/lucene/search/similarities/package-summary.html

So, if you can override similarity to ignore all scoring factors, leaving
coord() only meaningful, and form query like
metadata1:(a b c...) metadata2:(a b c...) metadata3:(a b c...)...
you can check number of hits across metadata# fields. Mind that you might
need to disable coord by new BooleanQuery(true) for nested disjunctions (a
b c...)

Related question was discussed
http://www.youtube.com/watch?v=1ZRmqtPoAj4but it mostly covers norms
(which are sort of primitive form of payloads)



On Tue, Jan 21, 2014 at 8:41 PM, Daniel Shane  wrote:

> I have an interesting solr/lucene question and its quite possible that
> some new features in solr might make this much easier that what I am about
> to try. If anyone has a clever idea on how to do this search, please let me
> know!
>
> Basically, lets state that I have an index in which each documents has a
> content and several metadata fields.
>
> Document Fields:
>
> content
> metadata1
> metadata2
> .
> metadataN
> allMetadatas (all the terms indexed in metadata1...N are concatenated in
> this field)
>
> Assuming that I am searching for documents that contains a certain number
> of terms (term1 to termN) in their metadata fields, I would like to build a
> search query that will return document that satisfy these requirement:
>
> a) All search terms must be present in a metadata field. This is quite
> easy, we can simply search in the field allMetadatas and that will work
> fine.
>
> b) Now for the hard part, we prefer document in which we found the
> metadatas in the *least number of different fields*. So if one document
> contains all the search terms in 10 different fields, but another document
> contains all search terms but in only 8 fields, we would like those to sort
> first.
>
> My first idea was to index terms in the allMetadatas using payloads. Each
> indexed term would also have the specific metadataN field from which they
> originate. Then I can write a scorer to score based on these payloads.
>
> However, if there is a way to do this without payloads I'm all ears!
>
> --
> Daniel Shane
> Lexum (www.lexum.com)
> sha...@lexum.com
>



-- 
Sincerely yours
Mikhail Khludnev
Principal Engineer,
Grid Dynamics


 


Re: Solr middle-ware?

2014-01-22 Thread Jorge Luis Betancourt González
I would love to see some proxy-like application implemented in go (partly for 
my desire of having time to check out go).

- Original Message -
From: "Shawn Heisey" 
To: solr-user@lucene.apache.org
Sent: Wednesday, January 22, 2014 10:38:34 AM
Subject: Re: Solr middle-ware?

On 1/22/2014 12:25 AM, Raymond Wiker wrote:
> Speaking for myself, I avoid using "client apis" like SolrNet, SolrJ and
> FAST DSAPI for the simple reason that I feel that the abstractions they
> offer are so thin that I may just as well talk directly to the HTTP
> interface. Doing that also lets me build web applications that maintain
> their own state, which makes for more responsive and more robust
> applications (although I'm sure there will be differing opinions on this).

If you have the programming skill, this is absolutely a great way to go.
 It does require a lot of knowledge and expertise, though.

If you want to hammer out a quick program and be reasonably sure it's
right, a client API handles a lot of the hard stuff for you.  When
something changes in a new version of Solr that breaks a client API,
just upgrading the client API is often enough to make the same code work
again.

I love SolrJ.  It's part of Solr itself, used internally for SolrCloud,
and probably replication too.  It's thoroughly tested with the Solr test
suite, and if used correctly, it's pretty much guaranteed to be
compatible with the same version of Solr.  In most cases, it will work
with other versions too.

Thanks,
Shawn


III Escuela Internacional de Invierno en la UCI del 17 al 28 de febrero del 
2014. Ver www.uci.cu


Re: Solr Cloud on HDFS

2014-01-22 Thread Mark Miller
I just created a JIRA issue for the first bit I’ll be working on:  



- Mark  



On Jan 22, 2014, 12:57:46 PM, Lajos  wrote: Thanks Mark ... 
indeed, some doc updates would help.

Regarding what seems to be a popular question on sharding. It seems that
it would be a Good Thing that the shards for a collection running HDFS
essentially be pointers to the HDFS-replicated index. Is that what your
thinking is?

I've been following your work recently, would be interested in helping
out on this if there's the chance.

Is there a JIRA yet on this issue?

Thanks,

lajos


On 22/01/2014 16:57, Mark Miller wrote:
> Right - solr.hdfs.home is the only setting you should use with SolrCloud.
>
> The documentation should probably be improved.
>
> If you set the data dir or ulog location in solrconfig.xml explicitly, it 
> will be the same for every collection. SolrCloud shares the solrconfig.xml 
> across SolrCore’s, and this will not work out.
>
> By setting solr.hdfs.home and leaving the relative defaults, all of the 
> locations are correctly set for each different collection under 
> solr.hdfs.home without any effort on your part.
>
> - Mark
>
>
>
> On Jan 22, 2014, 7:22:22 AM, Lajos  wrote: Uugh. I just 
> realised I should have take out the data dir and update log
> definitions! Now it works fine.
>
> Cheers,
>
> L
>
>
> On 22/01/2014 11:47, Lajos wrote:
>> Hi all,
>>
>> I've been running Solr on HDFS, and that's fine.
>>
>> But I have a Cloud installation I thought I'd try on HDFS. I uploaded
>> the configs for the core that runs in standalone mode already on HDFS
>> (on another cluster). I specify the HdfsDirectoryFactory, HDFS data dir,
>> solr.hdfs.home, and HDFS update log path:
>>
>> hdfs://master:9000/solr/test/data
>>
>> > class="solr.HdfsDirectoryFactory">
>> hdfs://master:9000/solr
>> 
>>
>> 
>> 
>> hdfs://master:9000/solr/test/ulog
>> 
>> 
>>
>> Question is: should I create my collection differently than I would a
>> normal collection?
>>
>> If I just try that, Solr will initialise the directory in HDFS as if it
>> were a single core. It will create shard directories on my nodes, but
>> not actually put anything in there. And then it will complain mightily
>> about not being able to forward updates to other nodes. (This same
>> cluster hosts regular collections, and everything is working fine).
>>
>> Am I missing a step? Do I have to manually create HDFS directories for
>> each replica?
>>
>> Thanks,
>>
>> L
>


Re: Solr Cloud on HDFS

2014-01-22 Thread Mark Miller
Whoops, hit the send keyboard shortcut.  

I just created a JIRA issue for the first bit I’ll be working on:

SOLR-5656: When using HDFS, the Overseer should have the ability to reassign 
the cores from failed nodes to running nodes.  

- Mark  



On Jan 22, 2014, 12:57:46 PM, Lajos  wrote: Thanks Mark ... 
indeed, some doc updates would help.

Regarding what seems to be a popular question on sharding. It seems that
it would be a Good Thing that the shards for a collection running HDFS
essentially be pointers to the HDFS-replicated index. Is that what your
thinking is?

I've been following your work recently, would be interested in helping
out on this if there's the chance.

Is there a JIRA yet on this issue?

Thanks,

lajos


On 22/01/2014 16:57, Mark Miller wrote:
> Right - solr.hdfs.home is the only setting you should use with SolrCloud.
>
> The documentation should probably be improved.
>
> If you set the data dir or ulog location in solrconfig.xml explicitly, it 
> will be the same for every collection. SolrCloud shares the solrconfig.xml 
> across SolrCore’s, and this will not work out.
>
> By setting solr.hdfs.home and leaving the relative defaults, all of the 
> locations are correctly set for each different collection under 
> solr.hdfs.home without any effort on your part.
>
> - Mark
>
>
>
> On Jan 22, 2014, 7:22:22 AM, Lajos  wrote: Uugh. I just 
> realised I should have take out the data dir and update log
> definitions! Now it works fine.
>
> Cheers,
>
> L
>
>
> On 22/01/2014 11:47, Lajos wrote:
>> Hi all,
>>
>> I've been running Solr on HDFS, and that's fine.
>>
>> But I have a Cloud installation I thought I'd try on HDFS. I uploaded
>> the configs for the core that runs in standalone mode already on HDFS
>> (on another cluster). I specify the HdfsDirectoryFactory, HDFS data dir,
>> solr.hdfs.home, and HDFS update log path:
>>
>> hdfs://master:9000/solr/test/data
>>
>> > class="solr.HdfsDirectoryFactory">
>> hdfs://master:9000/solr
>> 
>>
>> 
>> 
>> hdfs://master:9000/solr/test/ulog
>> 
>> 
>>
>> Question is: should I create my collection differently than I would a
>> normal collection?
>>
>> If I just try that, Solr will initialise the directory in HDFS as if it
>> were a single core. It will create shard directories on my nodes, but
>> not actually put anything in there. And then it will complain mightily
>> about not being able to forward updates to other nodes. (This same
>> cluster hosts regular collections, and everything is working fine).
>>
>> Am I missing a step? Do I have to manually create HDFS directories for
>> each replica?
>>
>> Thanks,
>>
>> L
>


Re: Searching and scoring with block join

2014-01-22 Thread Mikhail Khludnev
On Wed, Jan 22, 2014 at 10:17 PM,  wrote:

> I know that I can't just make a query like this: {!parent
> which=is_parent:true}+Term, most likely I'll get this error: child query
> must only match non-parent docs, but parent docID= matched
> childScorer=class org.apache.lucene.search.TermScorer
>

Hello Gesh,

As it's state there child clause should not match any parent docs, but the
query +Term matches them because it applies some default field which, I
believe belongs to parent docs.

That blog has an example of searching across both 'scopes'
q=+BRAND_s:Nike +_query_:"{!parent which=type_s:parent}+COLOR_s:Red
+SIZE_s:XL"
mind exact fields specified for both scopes. In your case you need to
switch from conjunction '+' to disjunction.

-- 
Sincerely yours
Mikhail Khludnev
Principal Engineer,
Grid Dynamics





Re: Optimizing index on Slave

2014-01-22 Thread Michael Della Bitta
Salman,

To my knowledge, there's not a great way of doing this.

Perhaps if your dataset were based on a time series, you could shard by
date, and then only a smaller segment of your data would be updated and
therefore need to be sent each week?


Michael Della Bitta

Applications Developer

o: +1 646 532 3062

appinions inc.

“The Science of Influence Marketing”

18 East 41st Street

New York, NY 10017

t: @appinions  | g+:
plus.google.com/appinions
w: appinions.com 


On Wed, Jan 22, 2014 at 4:48 AM, Salman Akram <
salman.ak...@northbaysolutions.net> wrote:

> We do. We have a lot of updates/deletes every day and a weekly optimization
> definitely gives a considerable improvement so don't see a downside to it
> except the complete replication part which is not an issue on local
> network.
>


dataimport handler

2014-01-22 Thread tom
Hi,
I am trying to use dataimporthandler(Solr 4.6) from oracle database, but I
have some issues in mapping the data.
I have 3 columns in the test_table,
 column1,
 column2,
 id

dataconfig.xml

  


   


Issue is,
- if I remove the id column from the table, index fails, solr is looking for
id column even though it is not mapped in dataconfig.xml.
- if I add, it directly maps the id column form the db to solr id, it
ignores the column1, even though it is mapped.

my problem is I don't have ID in every table, I should be able to map the
column I choose from the table to solr Id,  any solution will be greatly
appreciated.

`Tom




--
View this message in context: 
http://lucene.472066.n3.nabble.com/dataimport-handler-tp4112830.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Solr Cloud on HDFS

2014-01-22 Thread Lajos

Cool Mark, I'll keep an eye on this one.

L


On 22/01/2014 22:36, Mark Miller wrote:

Whoops, hit the send keyboard shortcut.

I just created a JIRA issue for the first bit I’ll be working on:

SOLR-5656: When using HDFS, the Overseer should have the ability to reassign 
the cores from failed nodes to running nodes.

- Mark



On Jan 22, 2014, 12:57:46 PM, Lajos  wrote: Thanks Mark ... 
indeed, some doc updates would help.

Regarding what seems to be a popular question on sharding. It seems that
it would be a Good Thing that the shards for a collection running HDFS
essentially be pointers to the HDFS-replicated index. Is that what your
thinking is?

I've been following your work recently, would be interested in helping
out on this if there's the chance.

Is there a JIRA yet on this issue?

Thanks,

lajos


On 22/01/2014 16:57, Mark Miller wrote:

Right - solr.hdfs.home is the only setting you should use with SolrCloud.

The documentation should probably be improved.

If you set the data dir or ulog location in solrconfig.xml explicitly, it will 
be the same for every collection. SolrCloud shares the solrconfig.xml across 
SolrCore’s, and this will not work out.

By setting solr.hdfs.home and leaving the relative defaults, all of the 
locations are correctly set for each different collection under solr.hdfs.home 
without any effort on your part.

- Mark



On Jan 22, 2014, 7:22:22 AM, Lajos  wrote: Uugh. I just 
realised I should have take out the data dir and update log
definitions! Now it works fine.

Cheers,

L


On 22/01/2014 11:47, Lajos wrote:

Hi all,

I've been running Solr on HDFS, and that's fine.

But I have a Cloud installation I thought I'd try on HDFS. I uploaded
the configs for the core that runs in standalone mode already on HDFS
(on another cluster). I specify the HdfsDirectoryFactory, HDFS data dir,
solr.hdfs.home, and HDFS update log path:

hdfs://master:9000/solr/test/data


hdfs://master:9000/solr




hdfs://master:9000/solr/test/ulog



Question is: should I create my collection differently than I would a
normal collection?

If I just try that, Solr will initialise the directory in HDFS as if it
were a single core. It will create shard directories on my nodes, but
not actually put anything in there. And then it will complain mightily
about not being able to forward updates to other nodes. (This same
cluster hosts regular collections, and everything is working fine).

Am I missing a step? Do I have to manually create HDFS directories for
each replica?

Thanks,

L






shard merged into a another shard as replica

2014-01-22 Thread Utkarsh Sengar
I am not sure what happened, I updated merchant collection and then
restarted all the solr machines.

This is what I see right now: http://i.imgur.com/4bYuhaq.png

merchant collection looks fine. But deals and prodinfo collections should
have a total of 3 shards. But someone shard1 has converted to replica of
shard2.

This is running in production, so how can I fix it without dumping the
whole zk data?

-- 
Thanks,
-Utkarsh


Solr/Lucene Faceted Search Too Many Unique Values?

2014-01-22 Thread Bing Hua
Hi,

I am going to evaluate some Lucene/Solr capabilities on handling faceted
queries, in particular, with a single facet field that contains large number
(say up to 1 million) of distinct values. Does anyone have some experience
on how lucene performs in this scenario?

e.g. 
Doc1 has tags A B C D 
Doc2 has tags B C D E 
etc etc millions of docs and there can be millions of distinct tag values.

Thanks



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Solr-Lucene-Faceted-Search-Too-Many-Unique-Values-tp4112860.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Solr/Lucene Faceted Search Too Many Unique Values?

2014-01-22 Thread Yago Riveiro
You will need to use DocValues if you want to use facets with this amount of 
terms and not blow the heap.

I have facets with ~39M of unique terms, the response time is about 10 ~ 40 
seconds, in my case is not a problem.  

-- 
Yago Riveiro
Sent with Sparrow (http://www.sparrowmailapp.com/?sig)


On Wednesday, January 22, 2014 at 10:59 PM, Bing Hua wrote:

> Hi,
> 
> I am going to evaluate some Lucene/Solr capabilities on handling faceted
> queries, in particular, with a single facet field that contains large number
> (say up to 1 million) of distinct values. Does anyone have some experience
> on how lucene performs in this scenario?
> 
> e.g. 
> Doc1 has tags A B C D 
> Doc2 has tags B C D E 
> etc etc millions of docs and there can be millions of distinct tag values.
> 
> Thanks
> 
> 
> 
> --
> View this message in context: 
> http://lucene.472066.n3.nabble.com/Solr-Lucene-Faceted-Search-Too-Many-Unique-Values-tp4112860.html
> Sent from the Solr - User mailing list archive at Nabble.com 
> (http://Nabble.com).
> 
> 




Re: shard merged into a another shard as replica

2014-01-22 Thread Mark Miller
What version of Solr are you running?

- Mark 



On Jan 22, 2014, 5:42:30 PM, Utkarsh Sengar  wrote: I am 
not sure what happened, I updated merchant collection and then
restarted all the solr machines.

This is what I see right now: http://i.imgur.com/4bYuhaq.png

merchant collection looks fine. But deals and prodinfo collections should
have a total of 3 shards. But someone shard1 has converted to replica of
shard2.

This is running in production, so how can I fix it without dumping the
whole zk data?

--
Thanks,
-Utkarsh


Re: shard merged into a another shard as replica

2014-01-22 Thread Utkarsh Sengar
solr 4.4.0


On Wed, Jan 22, 2014 at 3:12 PM, Mark Miller  wrote:

> What version of Solr are you running?
>
> - Mark
>
>
>
> On Jan 22, 2014, 5:42:30 PM, Utkarsh Sengar 
> wrote: I am not sure what happened, I updated merchant collection and then
> restarted all the solr machines.
>
> This is what I see right now: http://i.imgur.com/4bYuhaq.png
>
> merchant collection looks fine. But deals and prodinfo collections should
> have a total of 3 shards. But someone shard1 has converted to replica of
> shard2.
>
> This is running in production, so how can I fix it without dumping the
> whole zk data?
>
> --
> Thanks,
> -Utkarsh
>



-- 
Thanks,
-Utkarsh


Re: shard merged into a another shard as replica

2014-01-22 Thread Mark Miller
Hopefully an issue that has been fixed then. We should look into that. 

You should be able to fix it by directly modifying the clusterstate.json in 
ZooKeeper. Remember to back it up first! 

There are a variety of tools you can use to work with ZooKeeper - I like the 
eclipse plug-in that you can google for. 

Many, many SolrCloud bug fixes (we are about to release 4.6.1) since 4.4, so 
you might consider an upgrade if possible at some point soon.

- Mark 



On Jan 22, 2014, 6:14:10 PM, Utkarsh Sengar  wrote: solr 
4.4.0


On Wed, Jan 22, 2014 at 3:12 PM, Mark Miller  wrote:

> What version of Solr are you running?
>
> - Mark
>
>
>
> On Jan 22, 2014, 5:42:30 PM, Utkarsh Sengar 
> wrote: I am not sure what happened, I updated merchant collection and then
> restarted all the solr machines.
>
> This is what I see right now: http://i.imgur.com/4bYuhaq.png
>
> merchant collection looks fine. But deals and prodinfo collections should
> have a total of 3 shards. But someone shard1 has converted to replica of
> shard2.
>
> This is running in production, so how can I fix it without dumping the
> whole zk data?
>
> --
> Thanks,
> -Utkarsh
>



--
Thanks,
-Utkarsh


Re: Searching and scoring with block join

2014-01-22 Thread dev


Zitat von Mikhail Khludnev :


On Wed, Jan 22, 2014 at 10:17 PM,  wrote:


I know that I can't just make a query like this: {!parent
which=is_parent:true}+Term, most likely I'll get this error: child query
must only match non-parent docs, but parent docID= matched
childScorer=class org.apache.lucene.search.TermScorer



Hello Gesh,

As it's state there child clause should not match any parent docs, but the
query +Term matches them because it applies some default field which, I
believe belongs to parent docs.

That blog has an example of searching across both 'scopes'
q=+BRAND_s:Nike +_query_:"{!parent which=type_s:parent}+COLOR_s:Red
+SIZE_s:XL"
mind exact fields specified for both scopes. In your case you need to
switch from conjunction '+' to disjunction.



Hello Mikhail,

Yes, that's correct.

I also already tried the query you brought as example, but I have  
problems with the scoring.
I'm using edismax as defType, but I'm not quite sure how to use it  
with a {!parent } query.


For example, if I do this query, the score is always 0
{!parent which=is_parent:true}+content_de:Test

The blog says: ToParentBlockJoinQuery supports a few modes of score  
calculations. {!parent} parser has None mode hardcoded.
So, can I change the hardcoded mode somehow? I didn't find any further  
documentation about the parameters of {!parent}.


If I'm doing this request, the score seems only be calculated by the  
results found in "title".

title:Test _query_:"{!parent which=is_parent:true}+content_de:Test"

Sorry if I ask stupid questions but I just have started to work with  
solr and some techniques are not very familiar.


Thanks
-Gesh



Re: shard merged into a another shard as replica

2014-01-22 Thread Utkarsh Sengar
Thanks Mark. I tried updating clusterstate manually, things went haywire J.
So to fix it, had to take 30secs-1min downtime where I stopped solr and zk,
deleted "/zookeeper_data/version-2" directory and restarted everything
again.

I have auotmated these commands via fabric, so was easily able to recover
from the downtime.

Thanks,
-Utkarsh


On Wed, Jan 22, 2014 at 3:18 PM, Mark Miller  wrote:

> Hopefully an issue that has been fixed then. We should look into that.
>
> You should be able to fix it by directly modifying the clusterstate.json
> in ZooKeeper. Remember to back it up first!
>
> There are a variety of tools you can use to work with ZooKeeper - I like
> the eclipse plug-in that you can google for.
>
> Many, many SolrCloud bug fixes (we are about to release 4.6.1) since 4.4,
> so you might consider an upgrade if possible at some point soon.
>
> - Mark
>
>
>
> On Jan 22, 2014, 6:14:10 PM, Utkarsh Sengar 
> wrote: solr 4.4.0
>
>
> On Wed, Jan 22, 2014 at 3:12 PM, Mark Miller 
> wrote:
>
> > What version of Solr are you running?
> >
> > - Mark
> >
> >
> >
> > On Jan 22, 2014, 5:42:30 PM, Utkarsh Sengar 
> > wrote: I am not sure what happened, I updated merchant collection and
> then
> > restarted all the solr machines.
> >
> > This is what I see right now: http://i.imgur.com/4bYuhaq.png
> >
> > merchant collection looks fine. But deals and prodinfo collections should
> > have a total of 3 shards. But someone shard1 has converted to replica of
> > shard2.
> >
> > This is running in production, so how can I fix it without dumping the
> > whole zk data?
> >
> > --
> > Thanks,
> > -Utkarsh
> >
>
>
>
> --
> Thanks,
> -Utkarsh
>



-- 
Thanks,
-Utkarsh


Re: Solr middle-ware?

2014-01-22 Thread Alexandre Rafalovitch
I thought about Go, but that does not give the advantages of spanning
client and server like Dart and Node/Javascript. Which is why Dart
felt a bit more interesting, especially with tree-shaking of unused
code.

But then, neither language has enough adoption to be an answer to my
original question right now (existing middleware for new people to
pick). So, that's a more theoretical part of the discussion.

Regards,
   Alex.
Personal website: http://www.outerthoughts.com/
LinkedIn: http://www.linkedin.com/in/alexandrerafalovitch
- Time is the quality of nature that keeps events from happening all
at once. Lately, it doesn't seem to be working.  (Anonymous  - via GTD
book)


On Thu, Jan 23, 2014 at 4:29 AM, Jorge Luis Betancourt González
 wrote:
> I would love to see some proxy-like application implemented in go (partly for 
> my desire of having time to check out go).
>
> - Original Message -
> From: "Shawn Heisey" 
> To: solr-user@lucene.apache.org
> Sent: Wednesday, January 22, 2014 10:38:34 AM
> Subject: Re: Solr middle-ware?
>
> On 1/22/2014 12:25 AM, Raymond Wiker wrote:
>> Speaking for myself, I avoid using "client apis" like SolrNet, SolrJ and
>> FAST DSAPI for the simple reason that I feel that the abstractions they
>> offer are so thin that I may just as well talk directly to the HTTP
>> interface. Doing that also lets me build web applications that maintain
>> their own state, which makes for more responsive and more robust
>> applications (although I'm sure there will be differing opinions on this).
>
> If you have the programming skill, this is absolutely a great way to go.
>  It does require a lot of knowledge and expertise, though.
>
> If you want to hammer out a quick program and be reasonably sure it's
> right, a client API handles a lot of the hard stuff for you.  When
> something changes in a new version of Solr that breaks a client API,
> just upgrading the client API is often enough to make the same code work
> again.
>
> I love SolrJ.  It's part of Solr itself, used internally for SolrCloud,
> and probably replication too.  It's thoroughly tested with the Solr test
> suite, and if used correctly, it's pretty much guaranteed to be
> compatible with the same version of Solr.  In most cases, it will work
> with other versions too.
>
> Thanks,
> Shawn
>
> 
> III Escuela Internacional de Invierno en la UCI del 17 al 28 de febrero del 
> 2014. Ver www.uci.cu


Re: Solr middle-ware?

2014-01-22 Thread lianyi
I've been thinking of using nodejs as a thin layer between the client and solr 
servers.  it seems pretty handy for adding features like throttling, load 
balancing and basic authentications. -lianyi

On Wed, Jan 22, 2014 at 7:36 PM, Alexandre Rafalovitch 
wrote:

> I thought about Go, but that does not give the advantages of spanning
> client and server like Dart and Node/Javascript. Which is why Dart
> felt a bit more interesting, especially with tree-shaking of unused
> code.
> But then, neither language has enough adoption to be an answer to my
> original question right now (existing middleware for new people to
> pick). So, that's a more theoretical part of the discussion.
> Regards,
>Alex.
> Personal website: http://www.outerthoughts.com/
> LinkedIn: http://www.linkedin.com/in/alexandrerafalovitch
> - Time is the quality of nature that keeps events from happening all
> at once. Lately, it doesn't seem to be working.  (Anonymous  - via GTD
> book)
> On Thu, Jan 23, 2014 at 4:29 AM, Jorge Luis Betancourt González
>  wrote:
>> I would love to see some proxy-like application implemented in go (partly 
>> for my desire of having time to check out go).
>>
>> - Original Message -
>> From: "Shawn Heisey" 
>> To: solr-user@lucene.apache.org
>> Sent: Wednesday, January 22, 2014 10:38:34 AM
>> Subject: Re: Solr middle-ware?
>>
>> On 1/22/2014 12:25 AM, Raymond Wiker wrote:
>>> Speaking for myself, I avoid using "client apis" like SolrNet, SolrJ and
>>> FAST DSAPI for the simple reason that I feel that the abstractions they
>>> offer are so thin that I may just as well talk directly to the HTTP
>>> interface. Doing that also lets me build web applications that maintain
>>> their own state, which makes for more responsive and more robust
>>> applications (although I'm sure there will be differing opinions on this).
>>
>> If you have the programming skill, this is absolutely a great way to go.
>>  It does require a lot of knowledge and expertise, though.
>>
>> If you want to hammer out a quick program and be reasonably sure it's
>> right, a client API handles a lot of the hard stuff for you.  When
>> something changes in a new version of Solr that breaks a client API,
>> just upgrading the client API is often enough to make the same code work
>> again.
>>
>> I love SolrJ.  It's part of Solr itself, used internally for SolrCloud,
>> and probably replication too.  It's thoroughly tested with the Solr test
>> suite, and if used correctly, it's pretty much guaranteed to be
>> compatible with the same version of Solr.  In most cases, it will work
>> with other versions too.
>>
>> Thanks,
>> Shawn
>>
>> 
>> III Escuela Internacional de Invierno en la UCI del 17 al 28 de febrero del 
>> 2014. Ver www.uci.cu

Re: Solr Cloud Bulk Indexing Questions

2014-01-22 Thread Erick Erickson
When you're doing hard commits, is it with openSeacher = true or
false? It should probably be false...

Here's a rundown of the soft/hard commit consequences:

http://searchhub.org/2013/08/23/understanding-transaction-logs-softcommit-and-commit-in-sorlcloud/

I suspect (but, of course, can't prove) that you're over-committing
and hitting segment
merges without meaning to...

FWIW,
Erick

On Wed, Jan 22, 2014 at 1:46 PM, Software Dev  wrote:
> A suggestion would be to hard commit much less often, ie every 10
> minutes, and see if there is a change.
>
> - Will try this
>
> How much system RAM ? JVM Heap ? Enough space in RAM for system disk cache ?
>
> - We have 18G of ram 12 dedicated to Solr but as of right now the total
> index size is only 5GB
>
> Ah, and what about network IO ? Could that be a limiting factor ?
>
> - What is the size of your documents ? A few KB, MB, ... ?
>
> Under 1MB
>
> - Again, total index size is only 5GB so I dont know if this would be a
> problem
>
>
>
>
>
>
> On Wed, Jan 22, 2014 at 12:26 AM, Andre Bois-Crettez
> wrote:
>
>> 1 node having more load should be the leader (because of the extra work
>> of receiving and distributing updates, but my experiences show only a
>> bit more CPU usage, and no difference in disk IO).
>>
>> A suggestion would be to hard commit much less often, ie every 10
>> minutes, and see if there is a change.
>> How much system RAM ? JVM Heap ? Enough space in RAM for system disk cache
>> ?
>> What is the size of your documents ? A few KB, MB, ... ?
>> Ah, and what about network IO ? Could that be a limiting factor ?
>>
>>
>> André
>>
>>
>> On 2014-01-21 23:40, Software Dev wrote:
>>
>>> Any other suggestions?
>>>
>>>
>>> On Mon, Jan 20, 2014 at 2:49 PM, Software Dev 
>>> wrote:
>>>
>>>  4.6.0


 On Mon, Jan 20, 2014 at 2:47 PM, Mark Miller >>> >wrote:

  What version are you running?
>
> - Mark
>
> On Jan 20, 2014, at 5:43 PM, Software Dev 
> wrote:
>
>  We also noticed that disk IO shoots up to 100% on 1 of the nodes. Do
>> all
>> updates get sent to one machine or something?
>>
>>
>> On Mon, Jan 20, 2014 at 2:42 PM, Software Dev <
>>
> static.void@gmail.com>wrote:
>
>> We commit have a soft commit every 5 seconds and hard commit every 30.
>>>
>> As
>
>> far as docs/second it would guess around 200/sec which doesn't seem
>>>
>> that
>
>> high.
>>>
>>>
>>> On Mon, Jan 20, 2014 at 2:26 PM, Erick Erickson <
>>>
>> erickerick...@gmail.com>wrote:
>
>> Questions: How often do you commit your updates? What is your
 indexing rate in docs/second?

 In a SolrCloud setup, you should be using a CloudSolrServer. If the
 server is having trouble keeping up with updates, switching to CUSS
 probably wouldn't help.

 So I suspect there's something not optimal about your setup that's
 the culprit.

 Best,
 Erick

 On Mon, Jan 20, 2014 at 4:00 PM, Software Dev <

>>> static.void@gmail.com>
>
>> wrote:

> We are testing our shiny new Solr Cloud architecture but we are
> experiencing some issues when doing bulk indexing.
>
> We have 5 solr cloud machines running and 3 indexing machines
>
 (separate
>
>> from the cloud servers). The indexing machines pull off ids from a
>
 queue
>
>> then they index and ship over a document via a CloudSolrServer. It
>
 appears

> that the indexers are too fast because the load (particularly disk
>
 io)
>
>> on

> the solr cloud machines spikes through the roof making the entire
>
 cluster

> unusable. It's kind of odd because the total index size is not even
> large..ie, < 10GB. Are there any optimization/enhancements I could
>
 try
>
>> to

> help alleviate these problems?
>
> I should note that for the above collection we have only have 1
> shard
>
 thats

> replicated across all machines so all machines have the full index.
>
> Would we benefit from switching to a ConcurrentUpdateSolrServer
> where
>
 all

> updates get sent to 1 machine and 1 machine only? We could then
>
 remove
>
>> this

> machine from our cluster than that handles user requests.
>
> Thanks for any input.
>

>>>
>
>>> --
>>> André Bois-Crettez
>>>
>>> Software Architect
>>> Search Developer
>>> http://www.kelkoo.com/
>>>
>>
>> Kelkoo SAS
>> Société par Actions Simplifiée
>> Au capital de € 4.168.964,30
>> Siège social : 8, rue du Sentier 75002 Paris
>> 425 093 069 RCS Paris

Re: Solr/Lucene Faceted Search Too Many Unique Values?

2014-01-22 Thread Erick Erickson
A legitimate question that only you can answer is
"what's the value of faceting on fields with so many unique values?"

Consider the ridiculous case of faceting on . There's
almost exactly zero value in faceting on it, since all counts will be 1.

By analogy, with millions of tag values, will there ever be more than a very
small count of for any facet? And will showing those be useful to the
user?

They may be, and Yago has a use-case where the answer is "yes". Before
trying to make Solr perform in this insance, though, I'd review the use-case
to see if it makes sense

Erick

On Wed, Jan 22, 2014 at 5:09 PM, Yago Riveiro  wrote:
> You will need to use DocValues if you want to use facets with this amount of 
> terms and not blow the heap.
>
> I have facets with ~39M of unique terms, the response time is about 10 ~ 40 
> seconds, in my case is not a problem.
>
> --
> Yago Riveiro
> Sent with Sparrow (http://www.sparrowmailapp.com/?sig)
>
>
> On Wednesday, January 22, 2014 at 10:59 PM, Bing Hua wrote:
>
>> Hi,
>>
>> I am going to evaluate some Lucene/Solr capabilities on handling faceted
>> queries, in particular, with a single facet field that contains large number
>> (say up to 1 million) of distinct values. Does anyone have some experience
>> on how lucene performs in this scenario?
>>
>> e.g.
>> Doc1 has tags A B C D 
>> Doc2 has tags B C D E 
>> etc etc millions of docs and there can be millions of distinct tag values.
>>
>> Thanks
>>
>>
>>
>> --
>> View this message in context: 
>> http://lucene.472066.n3.nabble.com/Solr-Lucene-Faceted-Search-Too-Many-Unique-Values-tp4112860.html
>> Sent from the Solr - User mailing list archive at Nabble.com 
>> (http://Nabble.com).
>>
>>
>
>


RE: Highlighting not working

2014-01-22 Thread Fatima Issawi
Hi,

I have stored=true for my "content" field, but I get an error saying there is a 
mismatch of settings on that field (I think) because of the "term*=true"  
settings.

Thanks again,
Fatima



> -Original Message-
> From: Ahmet Arslan [mailto:iori...@yahoo.com]
> Sent: Wednesday, January 22, 2014 5:02 PM
> To: solr-user@lucene.apache.org
> Subject: Re: Highlighting not working
> 
> Hi Fatima,
> 
> To enable higlighting (both standard and fastvector) you need to make
> stored="true".
> 
> Term vectors may speed up standard highlighter. Plus they are mandatory
> for FastVectorHighligher.
> 
> https://cwiki.apache.org/confluence/display/solr/Field+Properties+by+Use+
> Case
> 
> Ahmet
> 
> 
> 
> 
> 
> On Wednesday, January 22, 2014 10:44 AM, Fatima Issawi
>  wrote:
> Also my highlighting defaults...
> 
>   
>      
> 
>        
>        on
>        content documentname
>        html
>        
>        
>        0
>        documentname
>        3
>        200
>        content
>        750
> 
> 
> > -Original Message-
> > From: Fatima Issawi [mailto:issa...@qu.edu.qa]
> > Sent: Wednesday, January 22, 2014 11:34 AM
> > To: solr-user@lucene.apache.org
> > Subject: Highlighting not working
> >
> > Hello,
> >
> > I'm trying to highlight content that is returned from a Solr query,
> > but I can't seem to get it working.
> >
> > I would like to highlight the "documentname" and the "pagetext" or
> > "content" results, but when I run the search I don't get anything
> > returned. I thought that the "content" field is supposed to be used for
> hightlighting?
> > And that [termVectors="true" termPositions="true" termOffsets="true"]
> > needs to be added to the fields that need to be highlighted? Is there
> > something else I'm missing?
> >
> >
> > Here is my schema:
> >
> >     > required="true" multiValued="false" />
> >     > omitNorms="true"/>
> >     > stored="true" termVectors="true"  termPositions="true"
> > termOffsets="true"/>
> >    
> >   
> >     >stored="true"/>
> >    
> >     >stored="true"/>/>
> >     > termVectors="true" termPositions="true" termOffsets="true"/>
> >
> >     > multiValued="true" termVectors="true" termPositions="true"
> > termOffsets="true"/>
> >
> >     > multiValued="true"/>
> >
> >    
> >    
> >    
> >    
> >    
> >    
> >    
> >
> >
> > Thanks,
> > Fatima


Re: Optimizing index on Slave

2014-01-22 Thread Salman Akram
Unfortunately we can't do sharding right now.

If we optimize on master and slave separately the file names and sizes are
same. I think it's just the version no that is different. Maybe if there
was a to copy master version to slave that would resolve this issue?


Re: dataimport handler

2014-01-22 Thread Shalin Shekhar Mangar
I'm guessing that "id" in your schema.xml is also a unique key field.
If so, each document must have an id field or Solr will refuse to
index them.

DataImportHandler will map the id field in your table to Solr schema's
id field only if you have not specified a mapping.

On Thu, Jan 23, 2014 at 3:01 AM, tom  wrote:
> Hi,
> I am trying to use dataimporthandler(Solr 4.6) from oracle database, but I
> have some issues in mapping the data.
> I have 3 columns in the test_table,
>  column1,
>  column2,
>  id
>
> dataconfig.xml
>
>query="select * from test_table" >
> 
> 
> 
>
> Issue is,
> - if I remove the id column from the table, index fails, solr is looking for
> id column even though it is not mapped in dataconfig.xml.
> - if I add, it directly maps the id column form the db to solr id, it
> ignores the column1, even though it is mapped.
>
> my problem is I don't have ID in every table, I should be able to map the
> column I choose from the table to solr Id,  any solution will be greatly
> appreciated.
>
> `Tom
>
>
>
>
> --
> View this message in context: 
> http://lucene.472066.n3.nabble.com/dataimport-handler-tp4112830.html
> Sent from the Solr - User mailing list archive at Nabble.com.



-- 
Regards,
Shalin Shekhar Mangar.


Re: How to run a subsequent update query to documents indexed from a dataimport query

2014-01-22 Thread Dileepa Jayakody
Hi All,

I did some research on this and found some alternatives useful to my
usecase. Please give your ideas.

Can I update all documents indexed after a /dataimport query using the
last_indexed_time in dataimport.properties?
If so can anyone please give me some pointers?
What I currently have in mind is something like below;

1. Store the indexing timestamp of the document as a field
eg: 

2. Read the last_index_time from the dataimport.properties

3. Query all document id's indexed after the last_index_time and send them
through the Stanbol update processor.

But I have a question here;
Does the last_index_time refer to when the dataimport is
started(onImportStart) or when the dataimport is finished (onImportEnd)?
If it's onImportEnd timestamp, them this solution won't work because the
timestamp indexed in the document field will be : onImportStart<
doc-index-timestamp < onImportEnd.


Another alternative I can think of is trigger an update chain via a
EventListener configured to run after a dataimport is processed
(onImportEnd).
In this case can the context in DIH give the list of document ids processed
in the /dataimport request? If so I can send those doc ids with an /update
query to run the Stanbol update process.

Please give me your ideas and suggestions.

Thanks,
Dileepa




On Wed, Jan 22, 2014 at 6:14 PM, Dileepa Jayakody  wrote:

> Hi All,
>
> I have a Solr requirement to send all the documents imported from a
> /dataimport query to go through another update chain as a separate
> background process.
>
> Currently I have configured my custom update chain in the /dataimport
> handler itself. But since my custom update process need to connect to an
> external enhancement engine (Apache Stanbol) to enhance the documents with
> some NLP fields, it has a negative impact on /dataimport process.
> The solution will be to have a separate update process running to enhance
> the content of the documents imported from /dataimport.
>
> Currently I have configured my custom Stanbol Processor as below in my
> /dataimport handler.
>
> 
> 
>  data-config.xml
> stanbolInterceptor
>  
>
>
> 
>   class="com.solr.stanbol.processor.StanbolContentProcessorFactory"/>
> 
>   
>
>
> What I need now is to separate the 2 processes of dataimport and
> stanbol-enhancement.
> So this is like runing a separate re-indexing process periodically over
> the documents imported from /dataimport for Stanbol fields.
>
> The question is how to trigger my Stanbol update process to the documents
> imported from /dataimport?
> In Solr to trigger /update query we need to know the id and the fields of
> the document to be updated. In my case I need to run all the documents
> imported from the previous /dataimport process through a stanbol
> update.chain.
>
> Is there a way to keep track of the documents ids imported from
> /dataimport?
> Any advice or pointers will be really helpful.
>
> Thanks,
> Dileepa
>


Solr Filter Query is not working

2014-01-22 Thread kumar
Hi,


I have some product details when i am looking for the different products at
a time it is not working.

I am using edismax. Configured filter query in the following way.

{!edismax v=$c}
_query_:"{!field f=product v=$p}"


For Example i using the following query for filtering results.

http://localhost:8080/solr/CoreName/select?c=abc&p=(book or pen) .. Here
"abc" available in both products

It has to show the results from "book product " as well as "pen product"
also. but not showing.

For below individual queries it showing exact results

http://localhost:8080/solr/CoreName/select?c=abc&p=book 

http://localhost:8080/solr/CoreName/select?c=abc&p=pen 


But with or condition it is not workingAny body help me







--
View this message in context: 
http://lucene.472066.n3.nabble.com/Solr-Filter-Query-is-not-working-tp4112915.html
Sent from the Solr - User mailing list archive at Nabble.com.