date:20150831

Hi,

I've tried to split my collection from 1 shard to 2 shards using the
command:
http://localhost:8983/solr/admin/collections?action=SPLITSHARD&collection=collection1&shard=shard1

The shard was split successfully with all the index intact. The search and
highlight gives the same results before and after the split.

However, when I tried to query the clustering, I found that the speed to
generate the cluster label become much slower after the split (used to be
average about 2 seconds, but now it takes about 5 seconds).
Also, the clustering labels produced before and after the split are
different. What could be the reason?

Below is my clustering handler for reference. I'm using Solr 5.2.1.

  


   explicit
   
  200
  edismax
   json
   true
  text
  null

  true
  true
 default

 id
  
  
 title
  
  url
  
  
 content
  
  true

 100
 2

  
  20
  
  false
 30

  


  clustering

  


Regards,
Edwin

Re: Get distinct results in Solr

2015-08-31 Thread Jan Høydahl

Hi

Check out the CollapsingQParser 
(https://cwiki.apache.org/confluence/display/solr/Collapse+and+Expand+Results). 
As long as you have a field that will be the same for all duplicates, you can 
“collapse” on that field. If you not have a “group id”, you can create one 
using e.g. an MD5 signature of the identical body text 
(https://cwiki.apache.org/confluence/display/solr/De-Duplication).

--
Jan Høydahl, search solution architect
Cominvent AS - www.cominvent.com

> 31. aug. 2015 kl. 12.03 skrev Zheng Lin Edwin Yeo :
> 
> Hi,
> 
> I'm using Solr 5.2.1, and I would like to find out, what is the best way to
> get Solr to return only distinct results?
> 
> Currently, I've indexed several exact similar documents into Solr, with
> just different id and title, but the content is exactly the same. When I do
> a search, Solr will return all these documents several time in the list.
> 
> What is the most suitable way to get Solr to return only one of the
> document during the search?
> I understand that there is result grouping and faceting, but I'm not sure
> if that is the best way.
> 
> Regards,
> Edwin

Re: .nabble.com is indexing each post, is it possible to delete my post or hide email id

Apache only removes or modifies posts when personal information is
revealed, such as social security numbers. Email addresses and phone
numbers are not considered such. Apache has no control over Nabble and
such third party services.

I would suggest you resubscribe with a different email address that is
of less value to you (e.g. gmail or yahoo). Then continue to
participate, and any older posts will be so low down in search results
as to not matter.

Upayavira

On Mon, Aug 31, 2015, at 09:57 AM, Roshan Agarwal wrote:
> .nabble.com is indexing each post, is it possible to delete my post or
> hide
> email id
> 
> On Mon, Aug 10, 2015 at 11:24 AM, Roshan Agarwal 
> wrote:
> 
> > Dear All,
> >
> > Can any one let us know how to implement plagiarism Checker with solr,
> > how to index content with shingles and what to send in queries
> >
> > Roshan
> >
> > --
> >
> > Siddhast Ip innovation (P) ltd
> > 907 chandra vihar colony
> > Jhansi-284002
> > M:+919871549769
> > M:+917376314900
> >
> 
> 
> 
> -- 
> 
> Roshan Agarwal
> Director sales
> Siddhast Ip innovation (P) ltd
> 907 chandra vihar colony
> Jhansi-284002
> M:+919871549769
> M:+917376314900

Sorting parent documents based on a field from children

2015-08-31 Thread Florin Mandoc


Hi,

I am trying to model am index from a relational database and i have 3 
main entity types: products, buyers and sellers.
I am using nested documents for sellers and buyers, as i have many 
sellers and many buyers for one product:


{ "Active" : "true",
  "CategoryID" : 59,
  "CategoryName" : "Produce",
  "Id" : "227686",
  "ManufacturerID" : 322,
  "ManufacturerName" : "---",
  "Name" : "product name",
  "ProductID" : "227686",
  "SKU" : "DAFA2A1F047E438B8462667F987D80A5",
  "ShortDescription" : "s description",
  "type" : "product",
  "UOM" : "Unit",
  "UomSize" : "48",
  "_childDocuments_" : [ { "BuyerID" : 83,
"DisplayOrder" : 0,
"ProductID" : "227686",
"id" : "227686_83",
"type" : "buyer"
  },
  { "BuyerID" : 86,
"DisplayOrder" : 10,
"ProductID" : "227686",
"id" : "227686_86",
"type" : "buyer"
  },
  { "BuyerID" : 83,
"ProductID" : "227686",
"SellerID" : 84,
"SellerName" : "Seller 84",
"id" : "227686_83_84",
"type" : "seller"
  },
  { "BuyerID" : 83,
"ProductID" : "227686",
"SellerID" : 89,
"SellerName" : "Seller 89",
"id" : "227686_83_89",
"type" : "seller"
  }
],
  "_version_" : 1509403723402575872
}

To query i use:
http://localhost:8983/solr/dine/select?q=Name:"product 
name"&fq={!parent%20which=type:product v="type:buyer AND 
BuyerID=83"}&wt=json&indent=true&fl=*,[child%20parentFilter=type:product%20childFilter=%22((type:buyer%20AND%20BuyerSiteID:83)%20OR%20(type:seller%20AND%20BuyerSiteID:83))]&rows=1000


and i get the product, buyer and sellers details, but i want to have the 
products of BuyerID:83 sorted by DisplayOrder field.


Is this possible to achieve this, and how?

Thank you

Re: Sorting parent documents based on a field from children

2015-08-31 Thread Mikhail Khludnev

Florin,

I disclosure some details in the recent post
http://blog.griddynamics.com/2015/08/scoring-join-party-in-solr-53.html.
Let me know if you have further questions afterwards.
I also notice that you use "obvious" syntax: BuyerID=83 but it's hardly
ever possible. There is a good habit of debugQuery=true, which allows to
reconcile query interpretation.

On Mon, Aug 31, 2015 at 2:40 PM, Florin Mandoc  wrote:

> Hi,
>
> I am trying to model am index from a relational database and i have 3 main
> entity types: products, buyers and sellers.
> I am using nested documents for sellers and buyers, as i have many sellers
> and many buyers for one product:
>
> { "Active" : "true",
>   "CategoryID" : 59,
>   "CategoryName" : "Produce",
>   "Id" : "227686",
>   "ManufacturerID" : 322,
>   "ManufacturerName" : "---",
>   "Name" : "product name",
>   "ProductID" : "227686",
>   "SKU" : "DAFA2A1F047E438B8462667F987D80A5",
>   "ShortDescription" : "s description",
>   "type" : "product",
>   "UOM" : "Unit",
>   "UomSize" : "48",
>   "_childDocuments_" : [ { "BuyerID" : 83,
> "DisplayOrder" : 0,
> "ProductID" : "227686",
> "id" : "227686_83",
> "type" : "buyer"
>   },
>   { "BuyerID" : 86,
> "DisplayOrder" : 10,
> "ProductID" : "227686",
> "id" : "227686_86",
> "type" : "buyer"
>   },
>   { "BuyerID" : 83,
> "ProductID" : "227686",
> "SellerID" : 84,
> "SellerName" : "Seller 84",
> "id" : "227686_83_84",
> "type" : "seller"
>   },
>   { "BuyerID" : 83,
> "ProductID" : "227686",
> "SellerID" : 89,
> "SellerName" : "Seller 89",
> "id" : "227686_83_89",
> "type" : "seller"
>   }
> ],
>   "_version_" : 1509403723402575872
> }
>
> To query i use:
> http://localhost:8983/solr/dine/select?q=Name:"product
> name"&fq={!parent%20which=type:product v="type:buyer AND
> BuyerID=83"}&wt=json&indent=true&fl=*,[child%20parentFilter=type:product%20childFilter=%22((type:buyer%20AND%20BuyerSiteID:83)%20OR%20(type:seller%20AND%20BuyerSiteID:83))]&rows=1000
>
> and i get the product, buyer and sellers details, but i want to have the
> products of BuyerID:83 sorted by DisplayOrder field.
>
> Is this possible to achieve this, and how?
>
> Thank you
>



-- 
Sincerely yours
Mikhail Khludnev
Principal Engineer,
Grid Dynamics

Re: testing with EmbeddedSolrServer

2015-08-31 Thread Mikhail Khludnev

Endre,

As I suggested before, consider to avoid test framework, just put all code
interacting with EmbeddedSolrServer into main() method.

On Mon, Aug 31, 2015 at 12:15 PM, Moen Endre  wrote:

> Hi Mikhail,
>
> Im trying to read 7-8 xml files of data that contain realistic data from
> our production server. Then I would like to read this data into
> EmbeddedSolrServer to test for edge cases for our custom date search. The
> use of EmbeddedSolrServer is purely to separate the data testing from any
> environment that might change over time.
>
> I would also like to avoid writing plumbing-code to import each field from
> the xml since I already have a working DIH.
>
> I tried adding synchronous=true but it doesn’t look like it makes solr
> complete the import before doing a search.
>
> Looking at the log it doesn’t seem process the import request:
> [searcherExecutor-6-thread-1-processing-{core=nmdc}] DEBUG
> o.apache.solr.core.SolrCore.Request - [nmdc] webapp=null path=null
> params={q=static+firstSearcher+warming+in+solrconfig.xml&distrib=false&event=firstSearcher}
> ...
> [TEST-TestSolrEmbeddedServer.testNodeConfigConstructor-seed#[41C3C11DE20DD5CE]]
> INFO  org.apache.solr.core.CoreContainer - registering core: nmdc
> 10:48:31.613
> [TEST-TestSolrEmbeddedServer.testNodeConfigConstructor-seed#[41C3C11DE20DD5CE]]
> INFO  o.apache.solr.core.SolrCore.Request - [nmdc] webapp=null
> path=/dataimport2
> params={qt=%2Fdataimport2&command=full-import%26clean%3Dtrue%26synchronous%3Dtrue}
> status=0 QTime=1
>
> {responseHeader={status=0,QTime=1},initArgs={defaults={config=dih-config.xml}},command=full-import&clean=true&synchronous=true,status=idle,importResponse=,statusMessages={}}
> [TEST-TestSolrEmbeddedServer.testNodeConfigConstructor-seed#[41C3C11DE20DD5CE]]
> DEBUG o.apache.solr.core.SolrCore.Request - [nmdc] webapp=null path=/select
> params={q=*%3A*}
> [TEST-TestSolrEmbeddedServer.testNodeConfigConstructor-seed#[41C3C11DE20DD5CE]]
> DEBUG o.a.s.h.component.QueryComponent - process:
> q=*:*&df=text&rows=10&echoParams=explicit
> [searcherExecutor-6-thread-1-processing-{core=nmdc}] DEBUG
> o.a.s.h.component.QueryComponent - process:
> q=static+firstSearcher+warming+in+solrconfig.xml&distrib=false&df=text&event=firstSearcher&rows=10&echoParams=explicit
> [searcherExecutor-6-thread-1-processing-{core=nmdc}] DEBUG
> o.a.s.search.stats.LocalStatsCache - ## GET
> {q=static+firstSearcher+warming+in+solrconfig.xml&distrib=false&df=text&event=firstSearcher&rows=10&echoParams=explicit}
> [searcherExecutor-6-thread-1-processing-{core=nmdc}] INFO
> o.apache.solr.core.SolrCore.Request - [nmdc] webapp=null path=null
> params={q=static+firstSearcher+warming+in+solrconfig.xml&distrib=false&event=firstSearcher}
> hits=0 status=0 QTime=36
> [searcherExecutor-6-thread-1-processing-{core=nmdc}] INFO
> org.apache.solr.core.SolrCore - QuerySenderListener done.
> [searcherExecutor-6-thread-1-processing-{core=nmdc}] INFO
> org.apache.solr.core.SolrCore - [nmdc] Registered new searcher
> Searcher@28be2785[nmdc]
> main{ExitableDirectoryReader(UninvertingDirectoryReader())}
> ...
> [TEST-TestSolrEmbeddedServer.testNodeConfigConstructor-seed#[41C3C11DE20DD5CE]]
> INFO  org.apache.solr.update.SolrCoreState - Closing SolrCoreState
> [TEST-TestSolrEmbeddedServer.testNodeConfigConstructor-seed#[41C3C11DE20DD5CE]]
> INFO  o.a.solr.update.DefaultSolrCoreState - SolrCoreState ref count has
> reached 0 - closing IndexWriter
> [TEST-TestSolrEmbeddedServer.testNodeConfigConstructor-seed#[41C3C11DE20DD5CE]]
> INFO  o.a.solr.update.DefaultSolrCoreState - closing IndexWriter with
> IndexWriterCloser
> [TEST-TestSolrEmbeddedServer.testNodeConfigConstructor-seed#[41C3C11DE20DD5CE]]
> DEBUG o.apache.solr.update.SolrIndexWriter - Closing Writer
> DirectUpdateHandler2
>
> Cheers
> Endre
>
> -Original Message-
> From: Mikhail Khludnev [mailto:mkhlud...@griddynamics.com]
> Sent: 25. august 2015 19:43
> To: solr-user
> Subject: Re: testing with EmbeddedSolrServer
>
> Hello,
>
> I'm trying to guess what are you doing. It's not clear so far.
> I found http://stackoverflow.com/questions/11951695/embedded-solr-dih
> My conclusion, if you play with DIH and EmbeddedSolrServer you'd better to
> avoid the third beast, you don't need to bother with tests.
> I guess that main() is over while DIH runs in background thread. You need
> to loop status command until import is over. or add synchronous=true
> parameter to full-import command it should switch to synchronous mode:
>
> https://github.com/apache/lucene-solr/blob/trunk/solr/contrib/dataimporthandler/src/java/org/apache/solr/handler/dataimport/DataImportHandler.java#L199
>
> Take care
>
>
> On Tue, Aug 25, 2015 at 4:41 PM, Moen Endre  wrote:
>
> > Is there an example of integration-testing with EmbeddedSolrServer
> > that loads data from a data importhandler - then queries the data? Ive
> > tried doing this based on
> > org.apache.solr.client.solrj.embedded.TestEmbeddedSolrServerConstructors.
> >
> > But n

Re: solrcloud and core swapping

It doesn't matter which node you do it on. And, you can replace an
existing alias by just creating another one with the same name.

Upayavira

On Mon, Aug 31, 2015, at 02:04 PM, Bill Au wrote:
> Thank, Shawn.  So I only need to issue the command to update the alias on
> one of the node in the SolrCloud cluster, right?  Should I do it on the
> leader or it doesn't matter which node I do it one?
> 
> Bill
> 
> On Fri, Aug 28, 2015 at 10:35 AM, Shawn Heisey 
> wrote:
> 
> > On 8/28/2015 8:25 AM, Shawn Heisey wrote:
> > > Instead, use collection aliasing. Create collections named something
> > > like foo_0 and foo_1, and update the alias "foo" to point to whichever
> > > of them is currently live. Your queries and update requests will never
> > > need to know about foo_0 and foo_1 ... only the coordinating part of
> > > your system, where you would normally do your core swapping, needs to
> > > know about those.
> >
> > You might also want to have a foo_build alias pointing to the *other*
> > collection for any "full rebuild" functionality, so it can also use a
> > static collection name.
> >
> > Thanks,
> > Shawn
> >
> >

Slow Replication between Shard & Replica

2015-08-31 Thread Maulin Rathod

We are using solrcloud 5.2 with 1 shard (in UK Data Center) and 1 replica
(in Australia Data Center). We observed that data inserted/updated in shard
(UK Data center) is replicated very slowly to Replica in AUSTRALIA Data
Center (Due to high latency between UK and AUSTRALIA). We are looking to
improve the speed of data replication from shard to replica. Can we use
some sort of compression before sending data to replica? Please let me know
if any other alternative is available to improve data replication speed
from shard to replica?

Re: Using join vs flattening structure

For 1-3, test and see. The problem I often see is that it is _assumed_ that
flattening the data will cost a lot in terms of index size and maintenance.
Test that assumption before going down the relational road.

You haven't talked about how many documents you have, how much data
would have to be replicated in each if you denormalized etc., so there's
not much guidance we can give.

I'll skip 4

5 probably another month or two in Solr 5.4

Best,
Erick

On Sun, Aug 30, 2015 at 6:59 PM, Brian Narsi  wrote:
> I have read a lot about using flattened structures in solr (instead of
> relational). Looks like it is preferable to use flattened structure. But in
> our case we have to consider  using (sort of) relational structure to keep
> index maintenance cost low.
>
> Does anyone have deeper insight into this?
>
> 1) When should we definitely use relational type of structure and use join?
> (instead of flattened structure)
>
> 2) When should we definitely use flattened structure (instead of
> relational)?
>
> 3) What are the signs that one has made a wrong choice of flattened vs
> relational?
>
> 4) Any best practices when relational structure and join is used?
>
> 5) I understand that parallel sql (in solr) will have more relational
> functionality support? Any ETA on when the parallel sql will support joins?
>
> Thanks for your help!

Re: Slow Replication between Shard & Replica

On Mon, Aug 31, 2015, at 02:23 PM, Maulin Rathod wrote:
> We are using solrcloud 5.2 with 1 shard (in UK Data Center) and 1 replica
> (in Australia Data Center). We observed that data inserted/updated in
> shard
> (UK Data center) is replicated very slowly to Replica in AUSTRALIA Data
> Center (Due to high latency between UK and AUSTRALIA). We are looking to
> improve the speed of data replication from shard to replica. Can we use
> some sort of compression before sending data to replica? Please let me
> know
> if any other alternative is available to improve data replication speed
> from shard to replica?

What sort of replication are you using? SolrCloud? I believe that
Zookeeper doesn't like high-latency setups, thus SolrCloud isn't, of
itself, a good way to replicate across datacentres. 

Traditional master/slave should be fine in that setup, or the
cross-datacentre replication (CDCR) work that has been done recently
might help you. Not so sure what state that is in - I'm sure Erick can
say more!

Upayavira

Re: 'missing content stream' issuing expungeDeletes=true

If you really must expunge deletes, use optimize. That will merge all
index segments into one, and in the process will remove any deleted
documents.

Why do you need to expunge deleted documents anyway? It is generally
done in the background for you, so you shouldn't need to worry about it.

Upayavira

On Mon, Aug 31, 2015, at 06:46 AM, davidphilip cherian wrote:
> Hi,
> 
> The below curl command worked without error, you can try.
> 
> curl http://localhost:8983/solr/techproducts/update?commit=true -H
> "Content-Type: text/xml" --data-binary ' expungeDeletes="true"/>'
> 
> However, after executing this, I could still see same deleted counts on
> dashboard.  Deleted Docs:6
> I am not sure whether that means,  the command did not take effect or it
> took effect but did not reflect on dashboard view.
> 
> 
> 
> 
> 
> On Mon, Aug 31, 2015 at 8:51 AM, Derek Poh 
> wrote:
> 
> > Hi
> >
> > I tried doing a expungeDeletes=true with the following but get the message
> > 'missing content stream'. What am I missing? I need to provide additional
> > parameters?
> >
> > curl 'http://127.0.0.1:8983/solr/supplier/update/json?expungeDeletes=true
> > ';
> >
> > Thanks,
> > Derek
> >
> > --
> > CONFIDENTIALITY NOTICE
> > This e-mail (including any attachments) may contain confidential and/or
> > privileged information. If you are not the intended recipient or have
> > received this e-mail in error, please inform the sender immediately and
> > delete this e-mail (including any attachments) from your computer, and you
> > must not use, disclose to anyone else or copy this e-mail (including any
> > attachments), whether in whole or in part.
> > This e-mail and any reply to it may be monitored for security, legal,
> > regulatory compliance and/or other appropriate reasons.
> >
> >

Re: Solr 4.6.1 Cloud Stops Replication

2015-08-31 Thread Rallavagu

Erick,

Apologies for missing out on status on indexing (replication) issues as
I have originally started this thread. After implementing
CloudSolrServer instead of CouncurrentUpdateSolrServer things were much
better. I simply wanted to follow up on understanding the memory
behavior better though we have tuned both heap and physical memory a
while ago.

Thanks

On 8/24/15 9:09 AM, Erick Erickson wrote:

bq: As a follow up, the default is set to "NRTCachingDirectoryFactory"
for DirectoryFactory but not MMapDirectory. It is mentioned that
NRTCachingDirectoryFactory "caches small files in memory for better
NRT performance".

NRTCachingDirectoryFactory also uses MMapDirectory under the covers as
well as "caches small files in memory"
so you really can't separate out the two.

I didn't mention this explicitly, but your original problem should
_not_ be happening in a well-tuned
system. Why your nodes go into a down state needs to be understood.
The connection timeout is
the only clue so far, and the usual reason here is that very long GC
pauses are happening. If this
continually happens, you might try turning on GC reporting options.

Best,
Erick

On Mon, Aug 24, 2015 at 2:47 AM, Rallavagu wrote:

As a follow up, the default is set to "NRTCachingDirectoryFactory" for
DirectoryFactory but not MMapDirectory. It is mentioned that
NRTCachingDirectoryFactory "caches small files in memory for better NRT
performance".

Wondering if the this would also consume physical memory to the amount of
MMap directory. Thoughts?

On 8/18/15 9:29 AM, Erick Erickson wrote:

Couple of things:

1> Here's an excellent backgrounder for MMapDirectory, which is
what makes it appear that Solr is consuming all the physical memory

http://blog.thetaphi.de/2012/07/use-lucenes-mmapdirectory-on-64bit.html

2> It's possible that your transaction log was huge. Perhaps not likely,
but possible. If Solr abnormally terminates (kill -9 is a prime way to do
this),
then upon restart the transaction log is replayed. This log is rolled over
upon
every hard commit (openSearcher true or false doesn't matter). So, in the
scenario where you are indexing a whole lot of stuff without committing,
then
it can take a very long time to replay the log. Not only that, but as you
do
replay the log, any incoming updates are written to the end of the tlog..
That
said, nothing in your e-mails indicates this could be a problem and it's
frankly not consistent with the errors you _do_ report but I thought
I'd mention it.
See:
https://lucidworks.com/blog/understanding-transaction-logs-softcommit-and-commit-in-sorlcloud/
You can avoid the possibility of this by configuring your autoCommit
interval
to be relatively short (say 60 seconds) with openSearcher=false

3> ConcurrentUpdateSolrServer isn't the best thing for bulk loading
SolrCloud,
CloudSolrServer (renamed CloudSolrClient in 5.x) is better. CUSS sends all
the docs to some node, and from there that node figures out which
shard each doc belongs on and forwards the doc (actually in batches) to
the
appropriate leader. So doing what you're doing creates a lot of cross
chatter
amongst nodes. CloudSolrServer/Client figures that out on the client side
and
only sends packets to each leader that consist of only the docs the belong
on
that shard. You can getnearly linear throughput with increasing numbers of
shards this way.

Best,
Erick

On Tue, Aug 18, 2015 at 9:03 AM, Rallavagu wrote:

Thanks Shawn.

All participating cloud nodes are running Tomcat and as you suggested
will
review the number of threads and increase them as needed.

Essentially, what I have noticed was that two of four nodes caught up
with
"bulk" updates instantly while other two nodes took almost 3 hours to
completely in sync with "leader". I have "tickled" other nodes by sending
an
update thinking that it would initiate the replication but not sure if
that
caused other two nodes to eventually catch up.

On similar note, I was using "CouncurrentUpdateSolrServer" directly
pointing
to leader to bulk load Solr cloud. I have configured the chunk size and
thread count for the same. Is this the right practice to bulk load
SolrCloud?

Also, the maximum number of connections per host parameter for
"HttpShardHandler" is in solrconfig.xml I suppose?

Thanks

On 8/18/15 8:28 AM, Shawn Heisey wrote:

On 8/18/2015 8:18 AM, Rallavagu wrote:

Thanks for the response. Does this cache behavior influence the delay
in catching up with cloud? How can we explain solr cloud replication
and what are the option to monitor and take proactive action (such as
initializing, pausing etc) if needed?

I don't know enough about your setup to speculate.

I did notice this exception in a previous reply:

org.apache.http.conn.ConnectionPoolTimeoutException: Timeout waiting for
connection from pool

I can think of two things that would cause this.

One cause is that your servlet container is limiting the number of
available threads. A typical jetty or tomcat de

DataImportHandler scheduling

2015-08-31 Thread Troy Edwards

I am having a hard time finding documentation on DataImportHandler
scheduling in SolrCloud. Can someone please post a link to that? I have a
requirement that the DIH should be initiated at a specific time Monday
through Friday.

Thanks!

Re: DataImportHandler scheduling

2015-08-31 Thread Ahmet Arslan

Hi Troy,

I think folks use corncobs (with curl utility) provided by the Operating System.

Ahmet



On Monday, August 31, 2015 8:26 PM, Troy Edwards  
wrote:
I am having a hard time finding documentation on DataImportHandler
scheduling in SolrCloud. Can someone please post a link to that? I have a
requirement that the DIH should be initiated at a specific time Monday
through Friday.

Thanks!

RE: DataImportHandler scheduling

2015-08-31 Thread Davis, Daniel (NIH/NLM) [C]

So, I think corncobs is not a utility, but a pattern - you have cron run curl 
to invoke something on your web application on the localhost (and elsewhere), 
and it runs the job if the job needs running, thus the webapp keeps the state.

There's a utility cronlock (https://github.com/kvz/cronlock) that runs on top 
of Redis.   I was thinking that a common pattern would be something similar 
written in python using the kazoo module to dialog with zookeeper.   No point 
writing much Java for a cronjob, but python should be OK.   What I don't like 
about cronlock is that it isn't "run once", but instead avoids overlap, so 
there's good reason to write something specific to that case.

-Original Message-
From: Ahmet Arslan [mailto:iori...@yahoo.com.INVALID] 
Sent: Monday, August 31, 2015 1:35 PM
To: solr-user@lucene.apache.org
Subject: Re: DataImportHandler scheduling

Hi Troy,

I think folks use corncobs (with curl utility) provided by the Operating System.

Ahmet



On Monday, August 31, 2015 8:26 PM, Troy Edwards  
wrote:
I am having a hard time finding documentation on DataImportHandler scheduling 
in SolrCloud. Can someone please post a link to that? I have a requirement that 
the DIH should be initiated at a specific time Monday through Friday.

Thanks!

Re: replication and HDFS

2015-08-31 Thread Joseph Obernberger

Thank you Erick.  What about cache size?  If we add replicas to our 
cluster and each replica has nGBytes of RAM allocated for HDFS caching, 
would that help performance?  Specifically the performance we want to 
increase is time to facet data, time to cluster data and search time.  
While we index a lot of data (~4 million docs per day), we do not 
perform that many searches of the data (~250 searches per day).


-Joe

On 8/20/2015 4:21 PM, Erick Erickson wrote:

Yes. Maybe. It Depends (tm).

Details matter (tm).

If you're firing just a few QPS at the system, then improved
throughput by adding replicas is unlikely. OTOH, if you're firing lots
of simultaneous queries at Solr and are pegging the processors, then
adding replication will increase aggregate QPS.

If your soft commit interval is very short and you're not doing proper
warming, it won't help at all in all probability.

Replication in Solr is about increasing the number of instances
available to serve queries. The two types of replication (HDFS or
Solr) are really orthogonal, the first is about data integrity and the
second is about increasing the number of Solr nodes available to
service queries.

Best,
Erick

On Thu, Aug 20, 2015 at 9:23 AM, Joseph Obernberger
 wrote:

Hi - we currently have a multi-shard setup running solr cloud without
replication running on top of HDFS.  Does it make sense to use replication
when using HDFS?  Will we expect to see a performance increase in searches?
Thank you!

-Joe

Custom merge logic in SolrCloud.

2015-08-31 Thread Mohan gupta

Hi Folks,

I need to merge docs received from multiple shards via a custom logic, a
straightforward score based priority queue doesn't work for my scenario (I
need to maintain a blend/distribution of docs).

How can I plugin my custom merge logic? One way might be to fully implement
the QueryComponent but that seems like a lot of work, is there a simpler
way?

I need my custom logic to kick-in in very specific cases and most of the
cases can still use default QueryComponent, was there a reason to make
merge functionality private (non-overridable) in the  QueryComponent class?

-- 
Regards ,
Mohan Gupta

Re: Using join vs flattening structure

2015-08-31 Thread Brian Narsi

We have about 15 million items. Each item has 10 attributes that we are
indexing at this time. We are planning on adding 15 more attributes in
future.

We have about 1 customers. Each of the items mentioned above can have
special pricing, etc for each of the customers. There are 6 attributes of
item that are different for each customer.

Erick - you have mentioned testing. What would be a good test scenario to
determine using flattened structure or relational?

Best,

On Mon, Aug 31, 2015 at 9:50 AM, Erick Erickson 
wrote:

> For 1-3, test and see. The problem I often see is that it is _assumed_ that
> flattening the data will cost a lot in terms of index size and maintenance.
> Test that assumption before going down the relational road.
>
> You haven't talked about how many documents you have, how much data
> would have to be replicated in each if you denormalized etc., so there's
> not much guidance we can give.
>
> I'll skip 4
>
> 5 probably another month or two in Solr 5.4
>
> Best,
> Erick
>
> On Sun, Aug 30, 2015 at 6:59 PM, Brian Narsi  wrote:
> > I have read a lot about using flattened structures in solr (instead of
> > relational). Looks like it is preferable to use flattened structure. But
> in
> > our case we have to consider  using (sort of) relational structure to
> keep
> > index maintenance cost low.
> >
> > Does anyone have deeper insight into this?
> >
> > 1) When should we definitely use relational type of structure and use
> join?
> > (instead of flattened structure)
> >
> > 2) When should we definitely use flattened structure (instead of
> > relational)?
> >
> > 3) What are the signs that one has made a wrong choice of flattened vs
> > relational?
> >
> > 4) Any best practices when relational structure and join is used?
> >
> > 5) I understand that parallel sql (in solr) will have more relational
> > functionality support? Any ETA on when the parallel sql will support
> joins?
> >
> > Thanks for your help!
>

Solr 5.3 Faceting on Children with Block Join Parser

2015-08-31 Thread Tom Devel

Apologies for cross posting a question from SO here.

I am very interested in the new faceting on child documents feature of Solr
5.3 and would like to know if somebody has figured out how to do it as
asked in the question on
http://stackoverflow.com/questions/32212949/solr-5-3-faceting-on-children-with-block-join-parser

Thanks for any hints,
Tom

The question is:

Solr 5.3 supports faceting on nested documents [1], with a great tutorial
from Yonik [2].

In the tutorial example, the query to get the documents for faceting is
directly performed on the child documents:

$ curl http://localhost:8983/solr/demo/query -d '
q=author_s:yonik&fl=id,comment_t&
json.facet={
genres : {
type: terms,
field: cat_s,
domain: { blockParent : "type_s:book" }
}
}'

What I do not know is how to facet on child documents returned from a Block
Join Parent Query Parser [3] and provided through ExpandComponent [4].

What I have working so far is the same as in the example from the
ExpandComponent [4]: Query the child fields to return the parent documents
(see 1.), then expand the result to get the relevant child documents (see
2.)


1. q={!parent which="type_s:parent" v='text_t:solr'}

2. &expand=true&expand.field=ISBN_s&expand.q=*:*

What I need:

Having steps 1.) and 2.) already working, how can we facet on some field
(does not matter which) of the returned child documents from (2.) ?

  [1]: http://yonik.com/solr-5-3/
  [2]: http://yonik.com/solr-nested-objects/
  [3]: https://cwiki.apache.org/confluence/display/solr/Other+Parsers
  [4]: http://heliosearch.org/expand-block-join/

Re: Using join vs flattening structure

Mostly just do the most naive data-flattening you can and see
how big the index is. You really have to generate the index then
run representative queries at it.

But naively flattening the data in this case approaches
15B documents, which is a problem, you're sharding over quite a
few shards etc.

Before even going there though, you need to pin your data model
down, the whole question about "what to flatten" is premature IMO.

For instance, how are you going to search this data? Do you
require searches like
"show me all the red things from customer X sort by price"?

Really, start from the _requirements_ and create your data model from
there _then_ start worrying about what needs to happen to make it fit
rather than worry about the index size/structure first.

Best,
Erick

On Mon, Aug 31, 2015 at 1:02 PM, Brian Narsi  wrote:
> We have about 15 million items. Each item has 10 attributes that we are
> indexing at this time. We are planning on adding 15 more attributes in
> future.
>
> We have about 1 customers. Each of the items mentioned above can have
> special pricing, etc for each of the customers. There are 6 attributes of
> item that are different for each customer.
>
> Erick - you have mentioned testing. What would be a good test scenario to
> determine using flattened structure or relational?
>
> Best,
>
> On Mon, Aug 31, 2015 at 9:50 AM, Erick Erickson 
> wrote:
>
>> For 1-3, test and see. The problem I often see is that it is _assumed_ that
>> flattening the data will cost a lot in terms of index size and maintenance.
>> Test that assumption before going down the relational road.
>>
>> You haven't talked about how many documents you have, how much data
>> would have to be replicated in each if you denormalized etc., so there's
>> not much guidance we can give.
>>
>> I'll skip 4
>>
>> 5 probably another month or two in Solr 5.4
>>
>> Best,
>> Erick
>>
>> On Sun, Aug 30, 2015 at 6:59 PM, Brian Narsi  wrote:
>> > I have read a lot about using flattened structures in solr (instead of
>> > relational). Looks like it is preferable to use flattened structure. But
>> in
>> > our case we have to consider  using (sort of) relational structure to
>> keep
>> > index maintenance cost low.
>> >
>> > Does anyone have deeper insight into this?
>> >
>> > 1) When should we definitely use relational type of structure and use
>> join?
>> > (instead of flattened structure)
>> >
>> > 2) When should we definitely use flattened structure (instead of
>> > relational)?
>> >
>> > 3) What are the signs that one has made a wrong choice of flattened vs
>> > relational?
>> >
>> > 4) Any best practices when relational structure and join is used?
>> >
>> > 5) I understand that parallel sql (in solr) will have more relational
>> > functionality support? Any ETA on when the parallel sql will support
>> joins?
>> >
>> > Thanks for your help!
>>

Re: replication and HDFS

Yes, No, Maybe.

bq; Specifically the performance we want to increase is time to facet
data, time to cluster data and search time

Well, that about covers everything ;)

You cannot talk about this without also taking about cache warming. Given your
setup, I'm guessing you have very few searches on the same Solr
searcher. Every time
you commit (hard with openSearcher=true or soft), you get a new searcher and
your top-level caches are  thrown away. The next request in will not
have any benefit
from the caches unless you've also done autowarming, look at the
counts for filterCache,
queryResultsCache and the newSearch and firstSearcher events.

So talking about significantly increasing cache size is premature until you know
you _use_ the caches.

And don't go wild with the autowarm counts for your caches, start
quite low in the
20-30 range IMO.

You'll particularly want to make newSearcher searches that exercise
your faceting and
reference all the fields you care about at least once.

Best,
Erick

On Mon, Aug 31, 2015 at 12:41 PM, Joseph Obernberger
 wrote:
> Thank you Erick.  What about cache size?  If we add replicas to our cluster
> and each replica has nGBytes of RAM allocated for HDFS caching, would that
> help performance?  Specifically the performance we want to increase is time
> to facet data, time to cluster data and search time.  While we index a lot
> of data (~4 million docs per day), we do not perform that many searches of
> the data (~250 searches per day).
>
> -Joe
>
> On 8/20/2015 4:21 PM, Erick Erickson wrote:
>>
>> Yes. Maybe. It Depends (tm).
>>
>> Details matter (tm).
>>
>> If you're firing just a few QPS at the system, then improved
>> throughput by adding replicas is unlikely. OTOH, if you're firing lots
>> of simultaneous queries at Solr and are pegging the processors, then
>> adding replication will increase aggregate QPS.
>>
>> If your soft commit interval is very short and you're not doing proper
>> warming, it won't help at all in all probability.
>>
>> Replication in Solr is about increasing the number of instances
>> available to serve queries. The two types of replication (HDFS or
>> Solr) are really orthogonal, the first is about data integrity and the
>> second is about increasing the number of Solr nodes available to
>> service queries.
>>
>> Best,
>> Erick
>>
>> On Thu, Aug 20, 2015 at 9:23 AM, Joseph Obernberger
>>  wrote:
>>>
>>> Hi - we currently have a multi-shard setup running solr cloud without
>>> replication running on top of HDFS.  Does it make sense to use
>>> replication
>>> when using HDFS?  Will we expect to see a performance increase in
>>> searches?
>>> Thank you!
>>>
>>> -Joe
>
>

Re: Solr 4.6.1 Cloud Stops Replication

OK, thanks for wrapping this up!

On Mon, Aug 31, 2015 at 10:08 AM, Rallavagu  wrote:
> Erick,
>
> Apologies for missing out on status on indexing (replication) issues as I
> have originally started this thread. After implementing CloudSolrServer
> instead of CouncurrentUpdateSolrServer things were much better. I simply
> wanted to follow up on understanding the memory behavior better though we
> have tuned both heap and physical memory a while ago.
>
> Thanks
>
> On 8/24/15 9:09 AM, Erick Erickson wrote:
>>
>> bq: As a follow up, the default is set to "NRTCachingDirectoryFactory"
>> for DirectoryFactory but not MMapDirectory. It is mentioned that
>> NRTCachingDirectoryFactory "caches small files in memory for better
>> NRT performance".
>>
>> NRTCachingDirectoryFactory also uses MMapDirectory under the covers as
>> well as "caches small files in memory"
>> so you really can't separate out the two.
>>
>> I didn't mention this explicitly, but your original problem should
>> _not_ be happening in a well-tuned
>> system. Why your nodes go into a down state needs to be understood.
>> The connection timeout is
>> the only clue so far, and the usual reason here is that very long GC
>> pauses are happening. If this
>> continually happens, you might try turning on GC reporting options.
>>
>> Best,
>> Erick
>>
>>
>> On Mon, Aug 24, 2015 at 2:47 AM, Rallavagu  wrote:
>>>
>>> As a follow up, the default is set to "NRTCachingDirectoryFactory" for
>>> DirectoryFactory but not MMapDirectory. It is mentioned that
>>> NRTCachingDirectoryFactory "caches small files in memory for better NRT
>>> performance".
>>>
>>> Wondering if the this would also consume physical memory to the amount of
>>> MMap directory. Thoughts?
>>>
>>> On 8/18/15 9:29 AM, Erick Erickson wrote:

 Couple of things:

 1> Here's an excellent backgrounder for MMapDirectory, which is
 what makes it appear that Solr is consuming all the physical memory

 http://blog.thetaphi.de/2012/07/use-lucenes-mmapdirectory-on-64bit.html

 2> It's possible that your transaction log was huge. Perhaps not likely,
 but possible. If Solr abnormally terminates (kill -9 is a prime way to
 do
 this),
 then upon restart the transaction log is replayed. This log is rolled
 over
 upon
 every hard commit (openSearcher true or false doesn't matter). So, in
 the
 scenario where you are indexing a whole lot of stuff without committing,
 then
 it can take a very long time to replay the log. Not only that, but as
 you
 do
 replay the log, any incoming updates are written to the end of the
 tlog..
 That
 said, nothing in your e-mails indicates this could be a problem and it's
 frankly not consistent with the errors you _do_ report but I thought
 I'd mention it.
 See:

 https://lucidworks.com/blog/understanding-transaction-logs-softcommit-and-commit-in-sorlcloud/
 You can avoid the possibility of this by configuring your autoCommit
 interval
 to be relatively short (say 60 seconds) with openSearcher=false

 3> ConcurrentUpdateSolrServer isn't the best thing for bulk loading
 SolrCloud,
 CloudSolrServer (renamed CloudSolrClient in 5.x) is better. CUSS sends
 all
 the docs to some node, and from there that node figures out which
 shard each doc belongs on and forwards the doc (actually in batches) to
 the
 appropriate leader. So doing what you're doing creates a lot of cross
 chatter
 amongst nodes. CloudSolrServer/Client figures that out on the client
 side
 and
 only sends packets to each leader that consist of only the docs the
 belong
 on
 that shard. You can getnearly linear throughput with increasing numbers
 of
 shards this way.

 Best,
 Erick

 On Tue, Aug 18, 2015 at 9:03 AM, Rallavagu  wrote:
>
>
> Thanks Shawn.
>
> All participating cloud nodes are running Tomcat and as you suggested
> will
> review the number of threads and increase them as needed.
>
> Essentially, what I have noticed was that two of four nodes caught up
> with
> "bulk" updates instantly while other two nodes took almost 3 hours to
> completely in sync with "leader". I have "tickled" other nodes by
> sending
> an
> update thinking that it would initiate the replication but not sure if
> that
> caused other two nodes to eventually catch up.
>
> On similar note, I was using "CouncurrentUpdateSolrServer" directly
> pointing
> to leader to bulk load Solr cloud. I have configured the chunk size and
> thread count for the same. Is this the right practice to bulk load
> SolrCloud?
>
> Also, the maximum number of connections per host parameter for
> "HttpShardHandler" is in solrconfig.xml I suppose?
>
> Thanks
>
>
>
> On 8/18/15 8:28 AM, Shawn Heise

Re: Lucene/Solr 5.0 and custom FieldCahe implementation

2015-08-31 Thread Tomás Fernández Löbbe

Sorry Jamie, I totally missed this email. There was no Jira that I could
find. I created SOLR-7996

On Sat, Aug 29, 2015 at 5:26 AM, Jamie Johnson  wrote:

> This sounds like a good idea, I'm assuming I'd need to make my own
> UnInvertingReader (or subclass) to do this right?  Is there a way to do
> this on the 5.x codebase or would I still need the solrindexer factory work
> that Tomás mentioned previously?
>
> Tomás, is there a ticket for the SolrIndexer factory?  I'd like to follow
> it's work to know what version of 5.x (or later) I should be looking for
> this in.
>
> On Thu, Aug 27, 2015 at 1:06 PM, Yonik Seeley  wrote:
>
> > UnInvertingReader makes indexed fields look like docvalues fields.
> > The caching itself is still done in FieldCache/FieldCacheImpl
> > but you could perhaps wrap what is cached there to either screen out
> > stuff or construct a new entry based on the user.
> >
> > -Yonik
> >
> >
> > On Thu, Aug 27, 2015 at 12:55 PM, Jamie Johnson 
> wrote:
> > > I think a custom UnInvertingReader would work as I could skip the
> process
> > > of putting things in the cache.  Right now in Solr 4.x though I am
> > caching
> > > based but including the users authorities in the key of the cache so
> > we're
> > > not rebuilding the UnivertedField on every request.  Where in 5.x is
> the
> > > object actually cached?  Will this be possible in 5.x?
> > >
> > > On Thu, Aug 27, 2015 at 12:32 PM, Yonik Seeley 
> > wrote:
> > >
> > >> The FieldCache has become implementation rather than interface, so I
> > >> don't think you're going to see plugins at that level (it's all
> > >> package protected now).
> > >>
> > >> One could either subclass or re-implement UnInvertingReader though.
> > >>
> > >> -Yonik
> > >>
> > >>
> > >> On Thu, Aug 27, 2015 at 12:09 PM, Jamie Johnson 
> > wrote:
> > >> > Also in this vein I think that Lucene should support factories for
> the
> > >> > cache creation as described @
> > >> > https://issues.apache.org/jira/browse/LUCENE-2394.  I'm not
> endorsing
> > >> the
> > >> > patch that is provided (I haven't even looked at it) just the
> concept
> > in
> > >> > general.
> > >> >
> > >> > On Thu, Aug 27, 2015 at 12:01 PM, Jamie Johnson 
> > >> wrote:
> > >> >
> > >> >> That makes sense, then I could extend the SolrIndexSearcher by
> > creating
> > >> a
> > >> >> different factory class that did whatever magic I needed.  If you
> > >> create a
> > >> >> Jira ticket for this please link it here so I can track it!  Again
> > >> thanks
> > >> >>
> > >> >> On Thu, Aug 27, 2015 at 11:59 AM, Tomás Fernández Löbbe <
> > >> >> tomasflo...@gmail.com> wrote:
> > >> >>
> > >> >>> I don't think there is a way to do this now. Maybe we should
> > separate
> > >> the
> > >> >>> logic of creating the SolrIndexSearcher to a factory. Moving this
> > logic
> > >> >>> away from SolrCore is already a win, plus it will make it easier
> to
> > >> unit
> > >> >>> test and extend for advanced use cases.
> > >> >>>
> > >> >>> Tomás
> > >> >>>
> > >> >>> On Wed, Aug 26, 2015 at 8:10 PM, Jamie Johnson  >
> > >> wrote:
> > >> >>>
> > >> >>> > Sorry to poke this again but I'm not following the last comment
> of
> > >> how I
> > >> >>> > could go about extending the solr index searcher and have the
> > >> extension
> > >> >>> > used.  Is there an example of this?  Again thanks
> > >> >>> >
> > >> >>> > Jamie
> > >> >>> > On Aug 25, 2015 7:18 AM, "Jamie Johnson" 
> > wrote:
> > >> >>> >
> > >> >>> > > I had seen this as well, if I over wrote this by extending
> > >> >>> > > SolrIndexSearcher how do I have my extension used?  I didn't
> > see a
> > >> way
> > >> >>> > that
> > >> >>> > > could be plugged in.
> > >> >>> > > On Aug 25, 2015 7:15 AM, "Mikhail Khludnev" <
> > >> >>> mkhlud...@griddynamics.com>
> > >> >>> > > wrote:
> > >> >>> > >
> > >> >>> > >> On Tue, Aug 25, 2015 at 2:03 PM, Jamie Johnson <
> > jej2...@gmail.com
> > >> >
> > >> >>> > wrote:
> > >> >>> > >>
> > >> >>> > >> > Thanks Mikhail.  If I'm reading the SimpleFacets class
> > >> correctly,
> > >> >>> out
> > >> >>> > >> > delegates to DocValuesFacets when facet method is FC, what
> > used
> > >> to
> > >> >>> be
> > >> >>> > >> > FieldCache I believe.  DocValuesFacets either uses
> DocValues
> > or
> > >> >>> builds
> > >> >>> > >> then
> > >> >>> > >> > using the UninvertingReader.
> > >> >>> > >> >
> > >> >>> > >>
> > >> >>> > >> Ah.. got it. Thanks for reminding this details.It seems like
> > even
> > >> >>> > >> docValues=true doesn't help with your custom implementation.
> > >> >>> > >>
> > >> >>> > >>
> > >> >>> > >> >
> > >> >>> > >> > I am not seeing a clean extension point to add a custom
> > >> >>> > >> UninvertingReader
> > >> >>> > >> > to Solr, would the only way be to copy the FacetComponent
> and
> > >> >>> > >> SimpleFacets
> > >> >>> > >> > and modify as needed?
> > >> >>> > >> >
> > >> >>> > >> Sadly, yes. There is no proper extension point. Also,
> consider
> > >> >>> > overriding
> > >> >>> > >> SolrIndexSearcher.wrapRe

Re: Issue Using Solr 5.3 Authentication and Authorization Plugins

2015-08-31 Thread Kevin Lee

Anyone else running into any issues trying to get the authentication and 
authorization plugins in 5.3 working?

> On Aug 29, 2015, at 2:30 AM, Kevin Lee  wrote:
> 
> Hi,
> 
> I’m trying to use the new basic auth plugin for Solr 5.3 and it doesn’t seem 
> to be working quite right.  Not sure if I’m missing steps or there is a bug.  
> I am able to get it to protect access to a URL under a collection, but am 
> unable to get it to secure access to the Admin UI.  In addition, after 
> stopping the Solr and Zookeeper instances, the security.json is still in 
> Zookeeper, however Solr is allowing access to everything again like the 
> security configuration isn’t in place.
> 
> Contents of security.json taken from wiki page, but edited to produce valid 
> JSON.  Had to move comma after 3rd from last “}” up to just after the last 
> “]”.
> 
> {
> "authentication":{
>   "class":"solr.BasicAuthPlugin",
>   "credentials":{"solr":"IV0EHq1OnNrj6gvRCwvFwTrZ1+z1oBbnQdiVC3otuq0= 
> Ndd7LKvVBAaZIF0QAVi1ekCfAJXr1GGfLtRUXhgrF8c="}
> },
> "authorization":{
>   "class":"solr.RuleBasedAuthorizationPlugin",
>   "permissions":[{"name":"security-edit",
>  "role":"admin"}],
>   "user-role":{"solr":"admin"}
> }}
> 
> Here are the steps I followed:
> 
> Upload security.json to zookeeper
> ./zkcli.sh -z localhost:2181,localhost:2182,localhost:2183 -cmd putfile 
> /security.json ~/solr/security.json
> 
> Use zkCli.sh from Zookeeper to ensure the security.json is in Zookeeper at 
> /security.json.  It is there and looks like what was originally uploaded.
> 
> Start Solr Instances
> 
> Attempt to create a permission, however get the following error:
> {
>  "responseHeader":{
>"status":400,
>"QTime":0},
>  "error":{
>"msg":"No authorization plugin configured",
>"code":400}}
> 
> Upload security.json again.
> ./zkcli.sh -z localhost:2181,localhost:2182,localhost:2183 -cmd putfile 
> /security.json ~/solr/security.json
> 
> Issue the following to try to create the permission again and this time it’s 
> successful.
> // Create a permission for mysearch endpoint
>curl --user solr:SolrRocks -H 'Content-type:application/json' -d 
> '{"set-permission": {"name":"mycollection-search","collection": 
> “mycollection","path":”/mysearch","role": "search-user"}}' 
> http://localhost:8983/solr/admin/authorization
>
>{
>  "responseHeader":{
>"status":0,
>"QTime":7}}
>
> Issue the following commands to add users
> curl --user solr:SolrRocks http://localhost:8983/solr/admin/authentication -H 
> 'Content-type:application/json' -d '{"set-user": {"admin" : “password" }}’
> curl --user solr:SolrRocks http://localhost:8983/solr/admin/authentication -H 
> 'Content-type:application/json' -d '{"set-user": {"user" : “password" }}'
> 
> Issue the following command to add permission to users
> curl -u solr:SolrRocks -H 'Content-type:application/json' -d '{ 
> "set-user-role" : {"admin": ["search-user", "admin"]}}' 
> http://localhost:8983/solr/admin/authorization
> curl -u solr:SolrRocks -H 'Content-type:application/json' -d '{ 
> "set-user-role" : {"user": ["search-user"]}}' 
> http://localhost:8983/solr/admin/authorization
> 
> After executing the above, access to /mysearch is protected until I restart 
> the Solr and Zookeeper instances.  However, the admin UI is never protected 
> like the Wiki page says it should be once activated.
> 
> https://cwiki.apache.org/confluence/display/solr/Rule-Based+Authorization+Plugin
>  
> 
> 
> Why does the authentication and authorization plugin not stay activated after 
> restart and why is the Admin UI never protected?  Am I missing any steps?
> 
> Thanks,
> Kevin

Re: Slow Replication between Shard & Replica

2015-08-31 Thread Shawn Heisey

On 8/31/2015 7:23 AM, Maulin Rathod wrote:
> We are using solrcloud 5.2 with 1 shard (in UK Data Center) and 1 replica
> (in Australia Data Center). We observed that data inserted/updated in shard
> (UK Data center) is replicated very slowly to Replica in AUSTRALIA Data
> Center (Due to high latency between UK and AUSTRALIA). We are looking to
> improve the speed of data replication from shard to replica. Can we use
> some sort of compression before sending data to replica? Please let me know
> if any other alternative is available to improve data replication speed
> from shard to replica?

SolrCloud replicates data differently than many people expect,
especially if they are familiar with how replication worked prior to
SolrCloud's introduction in Solr 4.0.  The original document is sent to
all replicas and each one indexes it independently.  This is HTTP
traffic, containing the document data after the initial update
processors are finished with it.

TCP connections across international lines, and oceans in particular,
are slow, because of the high latency.  The physical distance covered by
the speed of light is one problem, but international links usually
involve a number of additional routers, which also slows it down.  My
employer has been dealing with this problem for years when copying files
from one location to another.  One of the things available to help with
this problem is modern TCP stacks that scale the TCP window effectively,
so fewer acknowledgements are required.

If you are running Solr on Linux machines that are running any recent
kernel version (2.6 definitely qualifies, but I think 2.4 does as well),
and you haven't turned on SYN cookies or explicitly disabled the
scaling, you should be automatically scaling your TCP window.  If you
are on Windows Server 2008 or Windows 7 (or versions later than these)
and haven't poked around in the TCP tuning options, then you would also
be OK.  If either end of the communication is Windows XP, Server 2003,
or an older version of Windows, you're out of luck and will need to
upgrade the operating system.

The requests involved in SolrCloud indexing may be too short-lived to
benefit much from scaling, though.  Window scaling typically only helps
when the TCP connection lives for more than a few seconds, like an FTP
data transfer.  Each individual inter-server indexing request is likely
only transmitting 10 documents.

Even when TCP window scaling is present, if there is *ANY* packet loss
anywhere in a high-latency path, transfer speed will drop dramatically. 
In the lab, I built a simulated setup emulating our connection to our UK
office.  Even with 130 milliseconds of round-trip latency added by the
Linux router impersonating the Internet, transfer speeds of photo-sized
files on a modern TCP stack were good ... until I also introduced packet
loss.  Transfer speeds were BADLY affected by even one tenth of one
percent packet loss, which is the lowest amount I tested.

SolrCloud is highly optimized for the way it is usually installed -- on
multiple machines connected together with one or more LAN switches. 
This is why it uses lots of little connections.  The new cross-data
center replication (CDCR) feature is an attempt to better utilize
high-latency WAN links.

In Solr 5.x, the web server is more firmly under the control of the Solr
development team, so compression and other improvements may be possible,
but latency is the major problem here, not a lack of features.  I'm not
sure whether the number of documents per update (currently defaulting to
10) is configurable, but with a modern TCP stack, increasing that number
could make the transfer more efficient, assuming the communication link
is clean.

Thanks,
Shawn

Re: DataImportHandler scheduling

2015-08-31 Thread Shawn Heisey

On 8/31/2015 11:26 AM, Troy Edwards wrote:
> I am having a hard time finding documentation on DataImportHandler
> scheduling in SolrCloud. Can someone please post a link to that? I have a
> requirement that the DIH should be initiated at a specific time Monday
> through Friday.

Every modern operating system (and most of the previous versions of
every modern OS) has a built-in task scheduling system.  For Windows,
it's literally called Task Scheduler.  For most other operating systems,
it's called cron.

Including dataimport scheduling capability in Solr has been discussed,
and I think someone even wrote a working version ... but since every OS
already has scheduling capability that has had years of time to mature,
why should Solr reinvent the wheel and take the risk that the
implementation will have bugs?

Currently virtually all updates to Solr's index must be initiated
outside of Solr, and there is good reason to make sure that Solr doesn't
ever modify the index without outside input.  The only thing I know of
right now that can update the index automatically is Document
Expiration, but the expiration time is decided when the document is
indexed, and the original indexing action is external to Solr.

https://lucidworks.com/blog/document-expiration/

Thanks,
Shawn

Overseer Leader gone

2015-08-31 Thread Rishi Easwaran

Hi All,

I have a cluster that has the overseer leader gone. This is on Solr 4.10.3 
version.
Its completely gone from zookeeper and bouncing any instance does not start a 
new election process.
Anyone experience this issue before and any ideas to fix this.

Thanks,
Rishi.

Re: 'missing content stream' issuing expungeDeletes=true

2015-08-31 Thread Derek Poh

Hi Upayavira

In fact we are using optimize currently but was advised to use expunge
deletes as it is less resource intensive.
So expunge deletes will only remove deleted documents, it will not merge
all index segments into one?

If we don't use optimize, the deleted documents in the index will affect
the scores (with docFreq=2) of the matched documents which will affect
the relevancy of the search result.

Derek

On 9/1/2015 12:05 AM, Upayavira wrote:

If you really must expunge deletes, use optimize. That will merge all
index segments into one, and in the process will remove any deleted
documents.

Why do you need to expunge deleted documents anyway? It is generally
done in the background for you, so you shouldn't need to worry about it.

Upayavira

On Mon, Aug 31, 2015, at 06:46 AM, davidphilip cherian wrote:

Hi,

The below curl command worked without error, you can try.

curl http://localhost:8983/solr/techproducts/update?commit=true -H
"Content-Type: text/xml" --data-binary ''

However, after executing this, I could still see same deleted counts on
dashboard. Deleted Docs:6
I am not sure whether that means, the command did not take effect or it
took effect but did not reflect on dashboard view.

On Mon, Aug 31, 2015 at 8:51 AM, Derek Poh
wrote:

I tried doing a expungeDeletes=true with the following but get the message
'missing content stream'. What am I missing? I need to provide additional
parameters?

curl 'http://127.0.0.1:8983/solr/supplier/update/json?expungeDeletes=true
';

Thanks,
Derek

--
CONFIDENTIALITY NOTICE
This e-mail (including any attachments) may contain confidential and/or
privileged information. If you are not the intended recipient or have
received this e-mail in error, please inform the sender immediately and
delete this e-mail (including any attachments) from your computer, and you
must not use, disclose to anyone else or copy this e-mail (including any
attachments), whether in whole or in part.
This e-mail and any reply to it may be monitored for security, legal,
regulatory compliance and/or other appropriate reasons.

--
CONFIDENTIALITY NOTICE

This e-mail (including any attachments) may contain confidential and/or privileged information. If you are not the intended recipient or have received this e-mail in error, please inform the sender immediately and delete this e-mail (including any attachments) from your computer, and you must not use, disclose to anyone else or copy this e-mail (including any attachments), whether in whole or in part.

This e-mail and any reply to it may be monitored for security, legal,
regulatory compliance and/or other appropriate reasons.

Re: Get distinct results in Solr

Thanks Jan.

But I read that the field that is being collapsed on must be a single
valued String, Int or Float. As I'm required to get the distinct results
from "content" field that was indexed from a rich text document, I got the
following error:

  "error":{
"msg":"java.io.IOException: 64 bit numeric collapse fields are not
supported",
"trace":"java.lang.RuntimeException: java.io.IOException: 64 bit
numeric collapse fields are not supported\r\n\tat


Is it possible to collapsed on fields which has a long integer of data,
like content from a rich text document?

Regards,
Edwin


On 31 August 2015 at 18:59, Jan Høydahl  wrote:

> Hi
>
> Check out the CollapsingQParser (
> https://cwiki.apache.org/confluence/display/solr/Collapse+and+Expand+Results).
> As long as you have a field that will be the same for all duplicates, you
> can “collapse” on that field. If you not have a “group id”, you can create
> one using e.g. an MD5 signature of the identical body text (
> https://cwiki.apache.org/confluence/display/solr/De-Duplication).
>
> --
> Jan Høydahl, search solution architect
> Cominvent AS - www.cominvent.com
>
> > 31. aug. 2015 kl. 12.03 skrev Zheng Lin Edwin Yeo  >:
> >
> > Hi,
> >
> > I'm using Solr 5.2.1, and I would like to find out, what is the best way
> to
> > get Solr to return only distinct results?
> >
> > Currently, I've indexed several exact similar documents into Solr, with
> > just different id and title, but the content is exactly the same. When I
> do
> > a search, Solr will return all these documents several time in the list.
> >
> > What is the most suitable way to get Solr to return only one of the
> > document during the search?
> > I understand that there is result grouping and faceting, but I'm not sure
> > if that is the best way.
> >
> > Regards,
> > Edwin
>
>

Re: Get distinct results in Solr

2015-08-31 Thread Alexandre Rafalovitch

Can't you just treat it as String?

Also, do you actually want those documents in your index in the first
place? If not, have you looked at De-duplication:
https://cwiki.apache.org/confluence/display/solr/De-Duplication

Regards,
   Alex.

Solr Analyzers, Tokenizers, Filters, URPs and even a newsletter:
http://www.solr-start.com/


On 31 August 2015 at 22:00, Zheng Lin Edwin Yeo  wrote:
> Thanks Jan.
>
> But I read that the field that is being collapsed on must be a single
> valued String, Int or Float. As I'm required to get the distinct results
> from "content" field that was indexed from a rich text document, I got the
> following error:
>
>   "error":{
> "msg":"java.io.IOException: 64 bit numeric collapse fields are not
> supported",
> "trace":"java.lang.RuntimeException: java.io.IOException: 64 bit
> numeric collapse fields are not supported\r\n\tat
>
>
> Is it possible to collapsed on fields which has a long integer of data,
> like content from a rich text document?
>
> Regards,
> Edwin
>
>
> On 31 August 2015 at 18:59, Jan Høydahl  wrote:
>
>> Hi
>>
>> Check out the CollapsingQParser (
>> https://cwiki.apache.org/confluence/display/solr/Collapse+and+Expand+Results).
>> As long as you have a field that will be the same for all duplicates, you
>> can “collapse” on that field. If you not have a “group id”, you can create
>> one using e.g. an MD5 signature of the identical body text (
>> https://cwiki.apache.org/confluence/display/solr/De-Duplication).
>>
>> --
>> Jan Høydahl, search solution architect
>> Cominvent AS - www.cominvent.com
>>
>> > 31. aug. 2015 kl. 12.03 skrev Zheng Lin Edwin Yeo > >:
>> >
>> > Hi,
>> >
>> > I'm using Solr 5.2.1, and I would like to find out, what is the best way
>> to
>> > get Solr to return only distinct results?
>> >
>> > Currently, I've indexed several exact similar documents into Solr, with
>> > just different id and title, but the content is exactly the same. When I
>> do
>> > a search, Solr will return all these documents several time in the list.
>> >
>> > What is the most suitable way to get Solr to return only one of the
>> > document during the search?
>> > I understand that there is result grouping and faceting, but I'm not sure
>> > if that is the best way.
>> >
>> > Regards,
>> > Edwin
>>
>>

Re: Get distinct results in Solr

Hi Alexandre,

Will treating it as String affect the search or other functions like
highlighting?

Yes, the content must be in my index, unless I do a copyField to do
de-duplication on that field.. Will that help?

Regards,
Edwin


On 1 September 2015 at 10:04, Alexandre Rafalovitch 
wrote:

> Can't you just treat it as String?
>
> Also, do you actually want those documents in your index in the first
> place? If not, have you looked at De-duplication:
> https://cwiki.apache.org/confluence/display/solr/De-Duplication
>
> Regards,
>Alex.
> 
> Solr Analyzers, Tokenizers, Filters, URPs and even a newsletter:
> http://www.solr-start.com/
>
>
> On 31 August 2015 at 22:00, Zheng Lin Edwin Yeo 
> wrote:
> > Thanks Jan.
> >
> > But I read that the field that is being collapsed on must be a single
> > valued String, Int or Float. As I'm required to get the distinct results
> > from "content" field that was indexed from a rich text document, I got
> the
> > following error:
> >
> >   "error":{
> > "msg":"java.io.IOException: 64 bit numeric collapse fields are not
> > supported",
> > "trace":"java.lang.RuntimeException: java.io.IOException: 64 bit
> > numeric collapse fields are not supported\r\n\tat
> >
> >
> > Is it possible to collapsed on fields which has a long integer of data,
> > like content from a rich text document?
> >
> > Regards,
> > Edwin
> >
> >
> > On 31 August 2015 at 18:59, Jan Høydahl  wrote:
> >
> >> Hi
> >>
> >> Check out the CollapsingQParser (
> >>
> https://cwiki.apache.org/confluence/display/solr/Collapse+and+Expand+Results
> ).
> >> As long as you have a field that will be the same for all duplicates,
> you
> >> can “collapse” on that field. If you not have a “group id”, you can
> create
> >> one using e.g. an MD5 signature of the identical body text (
> >> https://cwiki.apache.org/confluence/display/solr/De-Duplication).
> >>
> >> --
> >> Jan Høydahl, search solution architect
> >> Cominvent AS - www.cominvent.com
> >>
> >> > 31. aug. 2015 kl. 12.03 skrev Zheng Lin Edwin Yeo <
> edwinye...@gmail.com
> >> >:
> >> >
> >> > Hi,
> >> >
> >> > I'm using Solr 5.2.1, and I would like to find out, what is the best
> way
> >> to
> >> > get Solr to return only distinct results?
> >> >
> >> > Currently, I've indexed several exact similar documents into Solr,
> with
> >> > just different id and title, but the content is exactly the same.
> When I
> >> do
> >> > a search, Solr will return all these documents several time in the
> list.
> >> >
> >> > What is the most suitable way to get Solr to return only one of the
> >> > document during the search?
> >> > I understand that there is result grouping and faceting, but I'm not
> sure
> >> > if that is the best way.
> >> >
> >> > Regards,
> >> > Edwin
> >>
> >>
>

Re: Get distinct results in Solr

2015-08-31 Thread Alexandre Rafalovitch

Re-read the question. You want to de-dupe on the full text-content.

I would actually try to use the dedupe chain as per the link I gave
but put results into a separate string field. Then, you group on that
field. You cannot actually group on the long text field, that would
kill any performance. So a signature is your proxy.

Regards,
   Alex

Solr Analyzers, Tokenizers, Filters, URPs and even a newsletter:
http://www.solr-start.com/


On 31 August 2015 at 22:26, Zheng Lin Edwin Yeo  wrote:
> Hi Alexandre,
>
> Will treating it as String affect the search or other functions like
> highlighting?
>
> Yes, the content must be in my index, unless I do a copyField to do
> de-duplication on that field.. Will that help?
>
> Regards,
> Edwin
>
>
> On 1 September 2015 at 10:04, Alexandre Rafalovitch 
> wrote:
>
>> Can't you just treat it as String?
>>
>> Also, do you actually want those documents in your index in the first
>> place? If not, have you looked at De-duplication:
>> https://cwiki.apache.org/confluence/display/solr/De-Duplication
>>
>> Regards,
>>Alex.
>> 
>> Solr Analyzers, Tokenizers, Filters, URPs and even a newsletter:
>> http://www.solr-start.com/
>>
>>
>> On 31 August 2015 at 22:00, Zheng Lin Edwin Yeo 
>> wrote:
>> > Thanks Jan.
>> >
>> > But I read that the field that is being collapsed on must be a single
>> > valued String, Int or Float. As I'm required to get the distinct results
>> > from "content" field that was indexed from a rich text document, I got
>> the
>> > following error:
>> >
>> >   "error":{
>> > "msg":"java.io.IOException: 64 bit numeric collapse fields are not
>> > supported",
>> > "trace":"java.lang.RuntimeException: java.io.IOException: 64 bit
>> > numeric collapse fields are not supported\r\n\tat
>> >
>> >
>> > Is it possible to collapsed on fields which has a long integer of data,
>> > like content from a rich text document?
>> >
>> > Regards,
>> > Edwin
>> >
>> >
>> > On 31 August 2015 at 18:59, Jan Høydahl  wrote:
>> >
>> >> Hi
>> >>
>> >> Check out the CollapsingQParser (
>> >>
>> https://cwiki.apache.org/confluence/display/solr/Collapse+and+Expand+Results
>> ).
>> >> As long as you have a field that will be the same for all duplicates,
>> you
>> >> can “collapse” on that field. If you not have a “group id”, you can
>> create
>> >> one using e.g. an MD5 signature of the identical body text (
>> >> https://cwiki.apache.org/confluence/display/solr/De-Duplication).
>> >>
>> >> --
>> >> Jan Høydahl, search solution architect
>> >> Cominvent AS - www.cominvent.com
>> >>
>> >> > 31. aug. 2015 kl. 12.03 skrev Zheng Lin Edwin Yeo <
>> edwinye...@gmail.com
>> >> >:
>> >> >
>> >> > Hi,
>> >> >
>> >> > I'm using Solr 5.2.1, and I would like to find out, what is the best
>> way
>> >> to
>> >> > get Solr to return only distinct results?
>> >> >
>> >> > Currently, I've indexed several exact similar documents into Solr,
>> with
>> >> > just different id and title, but the content is exactly the same.
>> When I
>> >> do
>> >> > a search, Solr will return all these documents several time in the
>> list.
>> >> >
>> >> > What is the most suitable way to get Solr to return only one of the
>> >> > document during the search?
>> >> > I understand that there is result grouping and faceting, but I'm not
>> sure
>> >> > if that is the best way.
>> >> >
>> >> > Regards,
>> >> > Edwin
>> >>
>> >>
>>

Re: Get distinct results in Solr

Thank you for your advice Alexandre.

Will try out the de-duplication from the link you gave.

Regards,
Edwin


On 1 September 2015 at 10:34, Alexandre Rafalovitch 
wrote:

> Re-read the question. You want to de-dupe on the full text-content.
>
> I would actually try to use the dedupe chain as per the link I gave
> but put results into a separate string field. Then, you group on that
> field. You cannot actually group on the long text field, that would
> kill any performance. So a signature is your proxy.
>
> Regards,
>Alex
> 
> Solr Analyzers, Tokenizers, Filters, URPs and even a newsletter:
> http://www.solr-start.com/
>
>
> On 31 August 2015 at 22:26, Zheng Lin Edwin Yeo 
> wrote:
> > Hi Alexandre,
> >
> > Will treating it as String affect the search or other functions like
> > highlighting?
> >
> > Yes, the content must be in my index, unless I do a copyField to do
> > de-duplication on that field.. Will that help?
> >
> > Regards,
> > Edwin
> >
> >
> > On 1 September 2015 at 10:04, Alexandre Rafalovitch 
> > wrote:
> >
> >> Can't you just treat it as String?
> >>
> >> Also, do you actually want those documents in your index in the first
> >> place? If not, have you looked at De-duplication:
> >> https://cwiki.apache.org/confluence/display/solr/De-Duplication
> >>
> >> Regards,
> >>Alex.
> >> 
> >> Solr Analyzers, Tokenizers, Filters, URPs and even a newsletter:
> >> http://www.solr-start.com/
> >>
> >>
> >> On 31 August 2015 at 22:00, Zheng Lin Edwin Yeo 
> >> wrote:
> >> > Thanks Jan.
> >> >
> >> > But I read that the field that is being collapsed on must be a single
> >> > valued String, Int or Float. As I'm required to get the distinct
> results
> >> > from "content" field that was indexed from a rich text document, I got
> >> the
> >> > following error:
> >> >
> >> >   "error":{
> >> > "msg":"java.io.IOException: 64 bit numeric collapse fields are not
> >> > supported",
> >> > "trace":"java.lang.RuntimeException: java.io.IOException: 64 bit
> >> > numeric collapse fields are not supported\r\n\tat
> >> >
> >> >
> >> > Is it possible to collapsed on fields which has a long integer of
> data,
> >> > like content from a rich text document?
> >> >
> >> > Regards,
> >> > Edwin
> >> >
> >> >
> >> > On 31 August 2015 at 18:59, Jan Høydahl 
> wrote:
> >> >
> >> >> Hi
> >> >>
> >> >> Check out the CollapsingQParser (
> >> >>
> >>
> https://cwiki.apache.org/confluence/display/solr/Collapse+and+Expand+Results
> >> ).
> >> >> As long as you have a field that will be the same for all duplicates,
> >> you
> >> >> can “collapse” on that field. If you not have a “group id”, you can
> >> create
> >> >> one using e.g. an MD5 signature of the identical body text (
> >> >> https://cwiki.apache.org/confluence/display/solr/De-Duplication).
> >> >>
> >> >> --
> >> >> Jan Høydahl, search solution architect
> >> >> Cominvent AS - www.cominvent.com
> >> >>
> >> >> > 31. aug. 2015 kl. 12.03 skrev Zheng Lin Edwin Yeo <
> >> edwinye...@gmail.com
> >> >> >:
> >> >> >
> >> >> > Hi,
> >> >> >
> >> >> > I'm using Solr 5.2.1, and I would like to find out, what is the
> best
> >> way
> >> >> to
> >> >> > get Solr to return only distinct results?
> >> >> >
> >> >> > Currently, I've indexed several exact similar documents into Solr,
> >> with
> >> >> > just different id and title, but the content is exactly the same.
> >> When I
> >> >> do
> >> >> > a search, Solr will return all these documents several time in the
> >> list.
> >> >> >
> >> >> > What is the most suitable way to get Solr to return only one of the
> >> >> > document during the search?
> >> >> > I understand that there is result grouping and faceting, but I'm
> not
> >> sure
> >> >> > if that is the best way.
> >> >> >
> >> >> > Regards,
> >> >> > Edwin
> >> >>
> >> >>
> >>
>

Re: Get distinct results in Solr