Re: Solr on HDFS

2019-08-02 Thread Kevin Risden
> > If you think about it, having a shard with 3 replicas on top of a file system that does 3x replication seems a little excessive! https://issues.apache.org/jira/browse/SOLR-6305 should help here. I can take a look at merging the patch since looks like it has been helpful to others. Kevin Ri

Re: Solr on HDFS

2019-08-02 Thread Joe Obernberger
Hi Kyle - Thank you. Our current index is split across 3 solr collections; our largest collection is 26.8TBytes (80.5TBytes when 3x replicated in HDFS) across 100 shards.  There are 40 machines hosting this cluster. We've found that when dealing with large collections having no replicas (but l

Re: Solr on HDFS

2019-08-02 Thread lstusr 5u93n4
Hi Joe, We fought with Solr on HDFS for quite some time, and faced similar issues as you're seeing. (See this thread, for example:" http://mail-archives.apache.org/mod_mbox/lucene-solr-user/201812.mbox/%3cCABd9LjTeacXpy3FFjFBkzMq6vhgu7Ptyh96+w-KC2p=-rqk...@mail.gmail.com%3e ) The Solr lock files

Re: Solr on HDFS

2019-08-02 Thread Joe Obernberger
Thank you.  No, while the cluster is using Cloudera for HDFS, we do not use Cloudera to manager the solr cluster.  If it is a configuration/architecture issue, what can I do to fix it?  I'd like a system where servers can come and go, but the indexes stay available and recover automatically.  I

Re: Solr on HDFS

2019-08-01 Thread Angie Rabelero
I don’t think you’re using claudera or ambari, but ambari has an option to delete the locks. This seems more a configuration/architecture isssue than a realibility issue. You may want to spin up an alias while you bring down, clear locks and directories, recreate and index the affected collectio

Re: Solr on HDFS vs local storage - Benchmarking

2017-11-22 Thread Erick Erickson
bq: We also had an HDFS setup already so it looked like a good option to not loos data. Earlier we had a few cases where we lost the machines so HDFS looked safer for that. right, that's one of the places where using HDFS to back Solr makes a lot of sense. The other approach is to just have replic

Re: Solr on HDFS vs local storage - Benchmarking

2017-11-22 Thread Hendrik Haddorp
We actually use no auto warming. Our collections are pretty small and the query performance is not really a problem so far. We are using lots of collections and most Solr caches seem to be per core and not global so we also have a problem with caching. I have to test the HDFS cache some more as

Re: Solr on HDFS vs local storage - Benchmarking

2017-11-22 Thread Erick Erickson
In my experience, for relatively static indexes the performance is roughly similar. Once the data is read from whatever data source it's in memory, where the data came from is (largely) secondary in importance. In cases where there's a lot of I/O I expect HDFS to be slower, this fits Hendrik's obs

Re: Solr on HDFS vs local storage - Benchmarking

2017-11-22 Thread Greenhorn Techie
Hendrik, Thanks for your response. Regarding "But this seems to greatly depend on how your setup looks like and what actions you perform." May I know what are the factors influence and what considerations are to be taken in relation to this? Thanks On Wed, 22 Nov 2017 at 14:16 Hendrik Haddorp

Re: Solr on HDFS vs local storage - Benchmarking

2017-11-22 Thread Hendrik Haddorp
We did some testing and the performance was strangely even better with HDFS then the with the local file system. But this seems to greatly depend on how your setup looks like and what actions you perform. We now had a patter with lots of small updates and commits and that seems to be quite a bi

Re: Solr on HDFS: AutoAddReplica does not add a replica

2017-02-22 Thread Hendrik Haddorp
I'm also not really an HDFS expert but I believe it is slightly different: The HDFS data is replicated, lets say 3 times, between the HDFS data nodes but for an HDFS client it looks like one directory and it is hidden that the data is replicated. Every client should see the same data. Just lik

Re: Solr on HDFS: AutoAddReplica does not add a replica

2017-02-22 Thread Erick Erickson
bq: in the none HDFS case that sounds logical but in the HDFS case all the index data is in the shared HDFS file system That's not really the point, and it's not quite true. The Solr index unique _per replica_. So replica1 points to an HDFS directory (that's triply replicated to be sure). replica2

Re: Solr on HDFS: AutoAddReplica does not add a replica

2017-02-21 Thread Hendrik Haddorp
Hi Erick, in the none HDFS case that sounds logical but in the HDFS case all the index data is in the shared HDFS file system. Even the transaction logs should be in there. So the node that once had the replica should not really have more information then any other node, especially if legacyC

Re: Solr on HDFS: AutoAddReplica does not add a replica

2017-02-21 Thread Erick Erickson
Hendrik: bq: Not really sure why one replica needs to be up though. I didn't write the code so I'm guessing a bit, but consider the situation where you have no replicas for a shard up and add a new one. Eventually it could become the leader but there would have been no chance for it to check if i

Re: Solr on HDFS: AutoAddReplica does not add a replica

2017-02-21 Thread Hendrik Haddorp
Hi, I had opened SOLR-10092 (https://issues.apache.org/jira/browse/SOLR-10092) for this a while ago. I was now able to gt this feature working with a very small code change. After a few seconds Solr reassigns the replica to a different Solr instance as long as one replica is still up. Not rea

Re: Solr on HDFS: AutoAddReplica does not add a replica

2017-01-19 Thread Hendrik Haddorp
HDFS is like a shared filesystem so every Solr Cloud instance can access the data using the same path or URL. The clusterstate.json looks like this: "shards":{"shard1":{ "range":"8000-7fff", "state":"active", "replicas":{ "core_node1":{ "core

Re: Solr on HDFS: AutoAddReplica does not add a replica

2017-01-19 Thread Shawn Heisey
On 1/19/2017 4:09 AM, Hendrik Haddorp wrote: > Given that the data is on HDFS it shouldn't matter if any active > replica is left as the data does not need to get transferred from > another instance but the new core will just take over the existing > data. Thus a replication factor of 1 should also

Re: Solr on HDFS: AutoAddReplica does not add a replica

2017-01-19 Thread Hendrik Haddorp
Hi, I'm seeing the same issue on Solr 6.3 using HDFS and a replication factor of 3, even though I believe a replication factor of 1 should work the same. When I stop a Solr instance this is detected and Solr actually wants to create a replica on a different instance. The command for that does

Re: Solr on HDFS: AutoAddReplica does not add a replica

2017-01-13 Thread Shawn Heisey
On 1/13/2017 5:46 PM, Chetas Joshi wrote: > One of the things I have observed is: if I use the collection API to > create a replica for that shard, it does not complain about the config > which has been set to ReplicationFactor=1. If replication factor was > the issue as suggested by Shawn, shouldn

Re: Solr on HDFS: AutoAddReplica does not add a replica

2017-01-13 Thread Chetas Joshi
Erick, I have not changed any config. I have autoaddReplica = true for individual collection config as well as the overall cluster config. Still, it does not add a replica when I decommission a node. Adding a replica is overseer's job. I looked at the logs of the overseer of the solrCloud but coul

Re: Solr on HDFS: AutoAddReplica does not add a replica

2017-01-12 Thread Erick Erickson
Hmmm, have you changed any of the settings for autoAddReplcia? There are several parameters that govern how long before a replica would be added. But I suggest you use the Cloudera resources for this question, not only did they write this functionality, but Cloudera support is deeply embedded in H

Re: Solr on HDFS: AutoAddReplica does not add a replica

2017-01-12 Thread Shawn Heisey
On 1/11/2017 7:14 PM, Chetas Joshi wrote: > This is what I understand about how Solr works on HDFS. Please correct me > if I am wrong. > > Although solr shard replication Factor = 1, HDFS default replication = 3. > When the node goes down, the solr server running on that node goes down and > hence

Re: Solr on HDFS: AutoAddReplica does not add a replica

2017-01-11 Thread Chetas Joshi
Hi Shawn, This is what I understand about how Solr works on HDFS. Please correct me if I am wrong. Although solr shard replication Factor = 1, HDFS default replication = 3. When the node goes down, the solr server running on that node goes down and hence the instance (core) representing the repli

Re: Solr on HDFS: AutoAddReplica does not add a replica

2017-01-11 Thread Shawn Heisey
On 1/11/2017 1:47 PM, Chetas Joshi wrote: > I have deployed a SolrCloud (solr 5.5.0) on hdfs using cloudera 5.4.7. The > cloud has 86 nodes. > > This is my config for the collection > > numShards=80 > ReplicationFactor=1 > maxShardsPerNode=1 > autoAddReplica=true > > I recently decommissioned a nod

Re: Solr on HDFS: Streaming API performance tuning

2016-12-19 Thread Joel Bernstein
I took another look at the stack trace and I'm pretty sure the issue is with NULL values in one of the sort fields. The null pointer is occurring during the comparison of sort values. See line 85 of: https://github.com/apache/lucene-solr/blob/branch_5_5/solr/solrj/src/java/org/apache/solr/client/so

Re: Solr on HDFS: Streaming API performance tuning

2016-12-19 Thread Chetas Joshi
Hi Joel, I don't have any solr documents that have NULL values for the sort fields I use in my queries. Thanks! On Sun, Dec 18, 2016 at 12:56 PM, Joel Bernstein wrote: > Ok, based on the stack trace I suspect one of your sort fields has NULL > values, which in the 5x branch could produce null

Re: Solr on HDFS: Streaming API performance tuning

2016-12-18 Thread Joel Bernstein
Ok, based on the stack trace I suspect one of your sort fields has NULL values, which in the 5x branch could produce null pointers if a segment had no values for a sort field. This is also fixed in the Solr 6x branch. Joel Bernstein http://joelsolr.blogspot.com/ On Sat, Dec 17, 2016 at 2:44 PM, C

Re: Solr on HDFS: Streaming API performance tuning

2016-12-17 Thread Chetas Joshi
Here is the stack trace. java.lang.NullPointerException at org.apache.solr.client.solrj.io.comp.FieldComparator$2.compare(FieldComparator.java:85) at org.apache.solr.client.solrj.io.comp.FieldComparator.compare(FieldComparator.java:92) at org.apache.solr.client.solrj.io.

Re: Solr on HDFS: Streaming API performance tuning

2016-12-16 Thread Reth RM
If you could provide the json parse exception stack trace, it might help to predict issue there. On Fri, Dec 16, 2016 at 5:52 PM, Chetas Joshi wrote: > Hi Joel, > > The only NON alpha-numeric characters I have in my data are '+' and '/'. I > don't have any backslashes. > > If the special charac

Re: Solr on HDFS: Streaming API performance tuning

2016-12-16 Thread Chetas Joshi
Hi Joel, The only NON alpha-numeric characters I have in my data are '+' and '/'. I don't have any backslashes. If the special characters was the issue, I should get the JSON parsing exceptions every time irrespective of the index size and irrespective of the available memory on the machine. That

Re: Solr on HDFS: Streaming API performance tuning

2016-12-16 Thread Joel Bernstein
The Streaming API may have been throwing exceptions because the JSON special characters were not escaped. This was fixed in Solr 6.0. Joel Bernstein http://joelsolr.blogspot.com/ On Fri, Dec 16, 2016 at 4:34 PM, Chetas Joshi wrote: > Hello, > > I am running Solr 5.5.0. > It is a solrCloud

Re: Solr on HDFS: increase in query time with increase in data

2016-12-16 Thread Shawn Heisey
On 12/16/2016 11:58 AM, Chetas Joshi wrote: > How different the index data caching mechanism is for the Streaming > API from the cursor approach? Solr and Lucene do not handle that caching. Systems external to Solr (like the OS, or HDFS) handle the caching. The cache effectiveness will be a comb

Re: Solr on HDFS: increase in query time with increase in data

2016-12-16 Thread Chetas Joshi
Thank you everyone. I would add nodes to the SolrCloud and split the shards. Shawn, Thank you for explaining why putting index data on local file system could be a better idea than using HDFS. I need to find out how HDFS caches the index files in a resource constrained environment. I would also

Re: Solr on HDFS: increase in query time with increase in data

2016-12-16 Thread Shawn Heisey
On 12/14/2016 11:58 AM, Chetas Joshi wrote: > I am running Solr 5.5.0 on HDFS. It is a solrCloud of 50 nodes and I have > the following config. > maxShardsperNode: 1 > replicationFactor: 1 > > I have been ingesting data into Solr for the last 3 months. With increase > in data, I am observing increa

Re: Solr on HDFS: increase in query time with increase in data

2016-12-16 Thread Piyush Kunal
I think 70GB is too huge for a shard. How much memory does the system is having? Incase solr does not have sufficient memory to load the indexes, it will use only the amount of memory defined in your Solr Caches. Although you are on HFDS, solr performances will be really bad if it has do disk IO a

Re: Solr on HDFS: increase in query time with increase in data

2016-12-15 Thread Reth RM
I think the shard index size is huge and should be split. On Wed, Dec 14, 2016 at 10:58 AM, Chetas Joshi wrote: > Hi everyone, > > I am running Solr 5.5.0 on HDFS. It is a solrCloud of 50 nodes and I have > the following config. > maxShardsperNode: 1 > replicationFactor: 1 > > I have been ingest

Re: Solr on HDFS: adding a shard replica

2016-09-14 Thread Erick Erickson
The core_node name is largely irrelevant, you should have names more descriptive in the state.json file like collection1_shard1_replica1. You happen to see 19 because you have only one replica per shard, Exactly how are you creating the replica? What version of Solr? If you're using the "core admi

Re: Solr on HDFS: adding a shard replica

2016-09-13 Thread Chetas Joshi
Is this happening because I have set replicationFactor=1? So even if I manually add replica for the shard that's down, it will just create a dataDir but would not copy any of the data into the dataDir? On Tue, Sep 13, 2016 at 6:07 PM, Chetas Joshi wrote: > Hi, > > I just started experimenting wi

Re: Solr on HDFS in a Hadoop cluster

2015-01-08 Thread Charles VALLEE
tique (EEI) 32 avenue Pablo Picasso 92000 Nanterre charles.val...@edf.fr Tél. : + (0) 1 78 66 69 81 Un geste simple pour l'environnement, n'imprimez ce message que si vous en avez l'utilité. De :otis.gospodne...@gmail.com A : solr-user@lucene.apache.org Date : 06/

Re: Solr on HDFS in a Hadoop cluster

2015-01-06 Thread Otis Gospodnetic
Oh, and https://issues.apache.org/jira/browse/SOLR-6743 Otis -- Monitoring * Alerting * Anomaly Detection * Centralized Log Management Solr & Elasticsearch Support * http://sematext.com/ On Tue, Jan 6, 2015 at 12:52 PM, Otis Gospodnetic < otis.gospodne...@gmail.com> wrote: > Hi Charles, > > See

Re: Solr on HDFS in a Hadoop cluster

2015-01-06 Thread Otis Gospodnetic
Hi Charles, See http://search-lucene.com/?q=solr+hdfs and https://cwiki.apache.org/confluence/display/solr/Running+Solr+on+HDFS Otis -- Monitoring * Alerting * Anomaly Detection * Centralized Log Management Solr & Elasticsearch Support * http://sematext.com/ On Tue, Jan 6, 2015 at 11:02 AM, Cha

Re: SOLR on hdfs

2014-07-08 Thread shlash
Hi all, I am new to Solr and hdfs, actually, I am trying to index text content extracted from binary files like PDF, MS Office...etc which are stored on hdfs (single node), till now I've running Solr on HDFS, and create the core but I couldn't send the files to solr for indexing. Can someone please

Re: SOLR on hdfs

2013-03-07 Thread Otis Gospodnetic
Hi Joseph, I believe Nutch can index into Solr/SolrCloud just fine. Sounds like that is the approach you should take. Otis -- Solr & ElasticSearch Support http://sematext.com/ On Thu, Mar 7, 2013 at 12:10 AM, Joseph Lim wrote: > Hi Amit, > > Currently I am designing a Learning Management

Re: SOLR on hdfs

2013-03-06 Thread Joseph Lim
Hi Amit, Currently I am designing a Learning Management System where it is based on Hadoop and hbase . Right now I want to integrate nutch with solr in it as part of crawler module, so that users will only be able to search relevant documents from specific source. And since crawling and indexing t

Re: SOLR on hdfs

2013-03-06 Thread Amit Nithian
Joseph, Doing what Otis said will do literally what you want which is copying the index to HDFS. It's no different than copying it to a different machine which btw is what Solr's master/slave replication scheme does. Alternatively, I think people are starting to setup new Solr instances with SolrC

Re: SOLR on hdfs

2013-03-06 Thread Joseph Lim
Hi Amit, so you mean that if I just want to get redundancy for solr in hdfs, the only best way to do it is to as per what Otis suggested using the following command hadoop fs -copyFromLocal URI Ok let me try out solrcloud as I will need to make sure it works well with nutch too.. Thanks for th

Re: SOLR on hdfs

2013-03-06 Thread Amit Nithian
Why wouldn't SolrCloud help you here? You can setup shards and replicas etc to have redundancy b/c HDFS isn't designed to serve real time queries as far as I understand. If you are using HDFS as a backup mechanism to me you'd be better served having multiple slaves tethered to a master (in a non-cl

Re: SOLR on hdfs

2013-03-06 Thread Joseph Lim
Hi Upayavira, sure, let me explain. I am setting up Nutch and SOLR in hadoop environment. Since I am using hdfs, in the event if there is any crashes to the localhost(running solr), i will still have the shards of data being stored in hdfs. Thanks you so much =) On Thu, Mar 7, 2013 at 1:19 AM, U

Re: SOLR on hdfs

2013-03-06 Thread Upayavira
What are you actually trying to achieve? If you can share what you are trying to achieve maybe folks can help you find the right way to do it. Upayavira On Wed, Mar 6, 2013, at 02:54 PM, Joseph Lim wrote: > Hello Otis , > > Is there any configuration where it will index into hdfs instead? > > I

Re: SOLR on hdfs

2013-03-06 Thread Joseph Lim
Hello Otis , Is there any configuration where it will index into hdfs instead? I tried crawlzilla and lily but I hope to update specific package such as Hadoop only or nutch only when there are updates. That's y would prefer to install separately . Thanks so much. Looking forward for your repl

Re: SOLR on hdfs

2013-03-06 Thread Otis Gospodnetic
Hello Joseph, You can certainly put them there, as in: hadoop fs -copyFromLocal URI But searching such an index will be slow. See also: http://katta.sourceforge.net/ Otis -- Solr & ElasticSearch Support http://sematext.com/ On Wed, Mar 6, 2013 at 7:50 AM, Joseph Lim wrote: > Hi, > Woul

Re: SOLR on hdfs

2013-03-06 Thread Joseph Lim
Hi, Would like to know how can i put the indexed solr shards into hdfs? Thanks.. Joseph On Mar 6, 2013 7:28 PM, "Otis Gospodnetic" wrote: > Hi Joseph, > > What exactly are you looking to to? > See http://incubator.apache.org/blur/ > > Otis > -- > Solr & ElasticSearch Support > http://sematext.c

Re: SOLR on hdfs

2013-03-06 Thread Otis Gospodnetic
Hi Joseph, What exactly are you looking to to? See http://incubator.apache.org/blur/ Otis -- Solr & ElasticSearch Support http://sematext.com/ On Wed, Mar 6, 2013 at 2:39 AM, Joseph Lim wrote: > Hi I am running hadoop distributed file system, how do I put my output of > the solr dir into h