>
> If you think about it, having a shard with 3 replicas on top of a file
system that does 3x replication seems a little excessive!
https://issues.apache.org/jira/browse/SOLR-6305 should help here. I can
take a look at merging the patch since looks like it has been helpful to
others.
Kevin Ri
Hi Kyle - Thank you.
Our current index is split across 3 solr collections; our largest
collection is 26.8TBytes (80.5TBytes when 3x replicated in HDFS) across
100 shards. There are 40 machines hosting this cluster. We've found
that when dealing with large collections having no replicas (but l
Hi Joe,
We fought with Solr on HDFS for quite some time, and faced similar issues
as you're seeing. (See this thread, for example:"
http://mail-archives.apache.org/mod_mbox/lucene-solr-user/201812.mbox/%3cCABd9LjTeacXpy3FFjFBkzMq6vhgu7Ptyh96+w-KC2p=-rqk...@mail.gmail.com%3e
)
The Solr lock files
Thank you. No, while the cluster is using Cloudera for HDFS, we do not
use Cloudera to manager the solr cluster. If it is a
configuration/architecture issue, what can I do to fix it? I'd like a
system where servers can come and go, but the indexes stay available and
recover automatically. I
I don’t think you’re using claudera or ambari, but ambari has an option to
delete the locks. This seems more a configuration/architecture isssue than a
realibility issue. You may want to spin up an alias while you bring down, clear
locks and directories, recreate and index the affected collectio
bq: We also had an HDFS setup already so it looked like a good option
to not loos data. Earlier we had a few cases where we lost the
machines so HDFS looked safer for that.
right, that's one of the places where using HDFS to back Solr makes a
lot of sense. The other approach is to just have replic
We actually use no auto warming. Our collections are pretty small and
the query performance is not really a problem so far. We are using lots
of collections and most Solr caches seem to be per core and not global
so we also have a problem with caching. I have to test the HDFS cache
some more as
In my experience, for relatively static indexes the performance is
roughly similar. Once the data is read from whatever data source it's
in memory, where the data came from is (largely) secondary in
importance.
In cases where there's a lot of I/O I expect HDFS to be slower, this
fits Hendrik's obs
Hendrik,
Thanks for your response.
Regarding "But this seems to greatly depend on how your setup looks like
and what actions you perform." May I know what are the factors influence
and what considerations are to be taken in relation to this?
Thanks
On Wed, 22 Nov 2017 at 14:16 Hendrik Haddorp
We did some testing and the performance was strangely even better with
HDFS then the with the local file system. But this seems to greatly
depend on how your setup looks like and what actions you perform. We now
had a patter with lots of small updates and commits and that seems to be
quite a bi
I'm also not really an HDFS expert but I believe it is slightly different:
The HDFS data is replicated, lets say 3 times, between the HDFS data
nodes but for an HDFS client it looks like one directory and it is
hidden that the data is replicated. Every client should see the same
data. Just lik
bq: in the none HDFS case that sounds logical but in the HDFS case all
the index data is in the shared HDFS file system
That's not really the point, and it's not quite true. The Solr index
unique _per replica_. So replica1 points to an HDFS directory (that's
triply replicated to be sure). replica2
Hi Erick,
in the none HDFS case that sounds logical but in the HDFS case all the
index data is in the shared HDFS file system. Even the transaction logs
should be in there. So the node that once had the replica should not
really have more information then any other node, especially if
legacyC
Hendrik:
bq: Not really sure why one replica needs to be up though.
I didn't write the code so I'm guessing a bit, but consider the
situation where you have no replicas for a shard up and add a new one.
Eventually it could become the leader but there would have been no
chance for it to check if i
Hi,
I had opened SOLR-10092
(https://issues.apache.org/jira/browse/SOLR-10092) for this a while ago.
I was now able to gt this feature working with a very small code change.
After a few seconds Solr reassigns the replica to a different Solr
instance as long as one replica is still up. Not rea
HDFS is like a shared filesystem so every Solr Cloud instance can access
the data using the same path or URL. The clusterstate.json looks like this:
"shards":{"shard1":{
"range":"8000-7fff",
"state":"active",
"replicas":{
"core_node1":{
"core
On 1/19/2017 4:09 AM, Hendrik Haddorp wrote:
> Given that the data is on HDFS it shouldn't matter if any active
> replica is left as the data does not need to get transferred from
> another instance but the new core will just take over the existing
> data. Thus a replication factor of 1 should also
Hi,
I'm seeing the same issue on Solr 6.3 using HDFS and a replication
factor of 3, even though I believe a replication factor of 1 should work
the same. When I stop a Solr instance this is detected and Solr actually
wants to create a replica on a different instance. The command for that
does
On 1/13/2017 5:46 PM, Chetas Joshi wrote:
> One of the things I have observed is: if I use the collection API to
> create a replica for that shard, it does not complain about the config
> which has been set to ReplicationFactor=1. If replication factor was
> the issue as suggested by Shawn, shouldn
Erick, I have not changed any config. I have autoaddReplica = true for
individual collection config as well as the overall cluster config. Still,
it does not add a replica when I decommission a node.
Adding a replica is overseer's job. I looked at the logs of the overseer of
the solrCloud but coul
Hmmm, have you changed any of the settings for autoAddReplcia? There
are several parameters that govern how long before a replica would be
added.
But I suggest you use the Cloudera resources for this question, not
only did they write this functionality, but Cloudera support is deeply
embedded in H
On 1/11/2017 7:14 PM, Chetas Joshi wrote:
> This is what I understand about how Solr works on HDFS. Please correct me
> if I am wrong.
>
> Although solr shard replication Factor = 1, HDFS default replication = 3.
> When the node goes down, the solr server running on that node goes down and
> hence
Hi Shawn,
This is what I understand about how Solr works on HDFS. Please correct me
if I am wrong.
Although solr shard replication Factor = 1, HDFS default replication = 3.
When the node goes down, the solr server running on that node goes down and
hence the instance (core) representing the repli
On 1/11/2017 1:47 PM, Chetas Joshi wrote:
> I have deployed a SolrCloud (solr 5.5.0) on hdfs using cloudera 5.4.7. The
> cloud has 86 nodes.
>
> This is my config for the collection
>
> numShards=80
> ReplicationFactor=1
> maxShardsPerNode=1
> autoAddReplica=true
>
> I recently decommissioned a nod
I took another look at the stack trace and I'm pretty sure the issue is
with NULL values in one of the sort fields. The null pointer is occurring
during the comparison of sort values. See line 85 of:
https://github.com/apache/lucene-solr/blob/branch_5_5/solr/solrj/src/java/org/apache/solr/client/so
Hi Joel,
I don't have any solr documents that have NULL values for the sort fields I
use in my queries.
Thanks!
On Sun, Dec 18, 2016 at 12:56 PM, Joel Bernstein wrote:
> Ok, based on the stack trace I suspect one of your sort fields has NULL
> values, which in the 5x branch could produce null
Ok, based on the stack trace I suspect one of your sort fields has NULL
values, which in the 5x branch could produce null pointers if a segment had
no values for a sort field. This is also fixed in the Solr 6x branch.
Joel Bernstein
http://joelsolr.blogspot.com/
On Sat, Dec 17, 2016 at 2:44 PM, C
Here is the stack trace.
java.lang.NullPointerException
at
org.apache.solr.client.solrj.io.comp.FieldComparator$2.compare(FieldComparator.java:85)
at
org.apache.solr.client.solrj.io.comp.FieldComparator.compare(FieldComparator.java:92)
at
org.apache.solr.client.solrj.io.
If you could provide the json parse exception stack trace, it might help to
predict issue there.
On Fri, Dec 16, 2016 at 5:52 PM, Chetas Joshi
wrote:
> Hi Joel,
>
> The only NON alpha-numeric characters I have in my data are '+' and '/'. I
> don't have any backslashes.
>
> If the special charac
Hi Joel,
The only NON alpha-numeric characters I have in my data are '+' and '/'. I
don't have any backslashes.
If the special characters was the issue, I should get the JSON parsing
exceptions every time irrespective of the index size and irrespective of
the available memory on the machine. That
The Streaming API may have been throwing exceptions because the JSON
special characters were not escaped. This was fixed in Solr 6.0.
Joel Bernstein
http://joelsolr.blogspot.com/
On Fri, Dec 16, 2016 at 4:34 PM, Chetas Joshi
wrote:
> Hello,
>
> I am running Solr 5.5.0.
> It is a solrCloud
On 12/16/2016 11:58 AM, Chetas Joshi wrote:
> How different the index data caching mechanism is for the Streaming
> API from the cursor approach?
Solr and Lucene do not handle that caching. Systems external to Solr
(like the OS, or HDFS) handle the caching. The cache effectiveness will
be a comb
Thank you everyone. I would add nodes to the SolrCloud and split the shards.
Shawn,
Thank you for explaining why putting index data on local file system could
be a better idea than using HDFS. I need to find out how HDFS caches the
index files in a resource constrained environment.
I would also
On 12/14/2016 11:58 AM, Chetas Joshi wrote:
> I am running Solr 5.5.0 on HDFS. It is a solrCloud of 50 nodes and I have
> the following config.
> maxShardsperNode: 1
> replicationFactor: 1
>
> I have been ingesting data into Solr for the last 3 months. With increase
> in data, I am observing increa
I think 70GB is too huge for a shard.
How much memory does the system is having?
Incase solr does not have sufficient memory to load the indexes, it will
use only the amount of memory defined in your Solr Caches.
Although you are on HFDS, solr performances will be really bad if it has do
disk IO a
I think the shard index size is huge and should be split.
On Wed, Dec 14, 2016 at 10:58 AM, Chetas Joshi
wrote:
> Hi everyone,
>
> I am running Solr 5.5.0 on HDFS. It is a solrCloud of 50 nodes and I have
> the following config.
> maxShardsperNode: 1
> replicationFactor: 1
>
> I have been ingest
The core_node name is largely irrelevant, you should have names more
descriptive in the state.json file like collection1_shard1_replica1.
You happen to see 19 because you have only one replica per shard,
Exactly how are you creating the replica? What version of Solr? If
you're using the "core admi
Is this happening because I have set replicationFactor=1?
So even if I manually add replica for the shard that's down, it will just
create a dataDir but would not copy any of the data into the dataDir?
On Tue, Sep 13, 2016 at 6:07 PM, Chetas Joshi
wrote:
> Hi,
>
> I just started experimenting wi
tique (EEI)
32 avenue Pablo Picasso
92000 Nanterre
charles.val...@edf.fr
Tél. : + (0) 1 78 66 69 81
Un geste simple pour l'environnement, n'imprimez ce message que si vous en
avez l'utilité.
De :otis.gospodne...@gmail.com
A : solr-user@lucene.apache.org
Date : 06/
Oh, and https://issues.apache.org/jira/browse/SOLR-6743
Otis
--
Monitoring * Alerting * Anomaly Detection * Centralized Log Management
Solr & Elasticsearch Support * http://sematext.com/
On Tue, Jan 6, 2015 at 12:52 PM, Otis Gospodnetic <
otis.gospodne...@gmail.com> wrote:
> Hi Charles,
>
> See
Hi Charles,
See http://search-lucene.com/?q=solr+hdfs and
https://cwiki.apache.org/confluence/display/solr/Running+Solr+on+HDFS
Otis
--
Monitoring * Alerting * Anomaly Detection * Centralized Log Management
Solr & Elasticsearch Support * http://sematext.com/
On Tue, Jan 6, 2015 at 11:02 AM, Cha
Hi all,
I am new to Solr and hdfs, actually, I am trying to index text content
extracted from binary files like PDF, MS Office...etc which are stored on
hdfs (single node), till now I've running Solr on HDFS, and create the core
but I couldn't send the files to solr for indexing.
Can someone please
Hi Joseph,
I believe Nutch can index into Solr/SolrCloud just fine. Sounds like that
is the approach you should take.
Otis
--
Solr & ElasticSearch Support
http://sematext.com/
On Thu, Mar 7, 2013 at 12:10 AM, Joseph Lim wrote:
> Hi Amit,
>
> Currently I am designing a Learning Management
Hi Amit,
Currently I am designing a Learning Management System where it is based on
Hadoop and hbase . Right now I want to integrate nutch with solr in it as
part of crawler module, so that users will only be able to search relevant
documents from specific source. And since crawling and indexing t
Joseph,
Doing what Otis said will do literally what you want which is copying the
index to HDFS. It's no different than copying it to a different machine
which btw is what Solr's master/slave replication scheme does.
Alternatively, I think people are starting to setup new Solr instances with
SolrC
Hi Amit,
so you mean that if I just want to get redundancy for solr in hdfs, the
only best way to do it is to as per what Otis suggested using the following
command
hadoop fs -copyFromLocal URI
Ok let me try out solrcloud as I will need to make sure it works well with
nutch too..
Thanks for th
Why wouldn't SolrCloud help you here? You can setup shards and replicas etc
to have redundancy b/c HDFS isn't designed to serve real time queries as
far as I understand. If you are using HDFS as a backup mechanism to me
you'd be better served having multiple slaves tethered to a master (in a
non-cl
Hi Upayavira,
sure, let me explain. I am setting up Nutch and SOLR in hadoop environment.
Since I am using hdfs, in the event if there is any crashes to the
localhost(running solr), i will still have the shards of data being stored
in hdfs.
Thanks you so much =)
On Thu, Mar 7, 2013 at 1:19 AM, U
What are you actually trying to achieve? If you can share what you are
trying to achieve maybe folks can help you find the right way to do it.
Upayavira
On Wed, Mar 6, 2013, at 02:54 PM, Joseph Lim wrote:
> Hello Otis ,
>
> Is there any configuration where it will index into hdfs instead?
>
> I
Hello Otis ,
Is there any configuration where it will index into hdfs instead?
I tried crawlzilla and lily but I hope to update specific package such as
Hadoop only or nutch only when there are updates.
That's y would prefer to install separately .
Thanks so much. Looking forward for your repl
Hello Joseph,
You can certainly put them there, as in:
hadoop fs -copyFromLocal URI
But searching such an index will be slow.
See also: http://katta.sourceforge.net/
Otis
--
Solr & ElasticSearch Support
http://sematext.com/
On Wed, Mar 6, 2013 at 7:50 AM, Joseph Lim wrote:
> Hi,
> Woul
Hi,
Would like to know how can i put the indexed solr shards into hdfs?
Thanks..
Joseph
On Mar 6, 2013 7:28 PM, "Otis Gospodnetic"
wrote:
> Hi Joseph,
>
> What exactly are you looking to to?
> See http://incubator.apache.org/blur/
>
> Otis
> --
> Solr & ElasticSearch Support
> http://sematext.c
Hi Joseph,
What exactly are you looking to to?
See http://incubator.apache.org/blur/
Otis
--
Solr & ElasticSearch Support
http://sematext.com/
On Wed, Mar 6, 2013 at 2:39 AM, Joseph Lim wrote:
> Hi I am running hadoop distributed file system, how do I put my output of
> the solr dir into h
53 matches
Mail list logo