Hi Tom It might be worth restarting the DataNode process? I didn’t think you could disable the DataNode Web UI as such, but I could be wrong on this point. Out of interest, what does hdfs-site.xml say with regards to dfs.datanode.http.address/dfs.datanode.https.address?
Regarding the logs, a quick look on GitHub suggests there may be a couple of useful log messages: https://github.com/apache/hadoop/blob/88a9f42f320e7c16cf0b0b424283f8e4486ef286/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/datanode/BlockScanner.java For example, LOG.warn(“Periodic block scanner is not running”) or LOG.info(“Initialized block scanner with targetBytesPerSec {}”). Of course, you’d need make sure those LOG statements are present in the Hadoop version included with CDH 6.3. Git “blame” suggests the LOG statements were added 6 years, so chance are you have them... Thanks Austin > On 22 Oct 2020, at 14:44, TomK <[email protected]> wrote: > > Thanks Austin. However none of these are open on a standard Cloudera 6.3 > build. > > # netstat -pnltu|grep -Ei "9866|1004|9864|9865|1006|9867" > # > > Would there be anything in the logs to indicate whether or not the block / > volume scanner is running? > > Thx, > TK > > > On 10/22/2020 3:09 AM, Austin Hackett wrote: >> Hi Tom >> >> I not too familiar with the CDH distribution, but this page has the default >> ports used by DataNode: >> >> https://docs.cloudera.com/documentation/enterprise/latest/topics/cdh_ports.html >> >> <https://docs.cloudera.com/documentation/enterprise/latest/topics/cdh_ports.html> >> >> I believe it’s the settings for dfs.datanode.http.address/ >> <>dfs.datanode.https.address that you’re interested in (9864/9865) <> >> >> <> >> Since the data block scanner related config parameters are not set, the >> defaults of 3 weeks and 1MB should be applied. <> >> >> <> >> Thanks <> >> >> <> >> Austin >> >>> On 22 Oct 2020, at 06:35, TomK <[email protected]> >>> <mailto:[email protected]> wrote: >>> >>> >>> Hey Austin, Sanjeev, >>> >>> Thanks once more! Took some time to review the pages. That was certainly >>> very helpful. Appreciated! >>> >>> However, I tried to access https://dn01/blockScannerReport >>> <https://dn01/blockScannerReport> on a test Cloudera 6.3 cluster. Didn't >>> work Tried the following as well: >>> >>> http://dn01:50075/blockscannerreport?listblocks >>> <http://dn01:50075/blockscannerreport?listblocks> >>> >>> https://dn01:50075/blockscannerreport >>> <https://dn01:50075/blockscannerreport> >>> >>> >>> https://dn01:10006/blockscannerreport >>> <https://dn01:10006/blockscannerreport> >>> >>> Checked that port 50075 is up ( netstat -pnltu ). There's no service on >>> that port on the workers. Checked the pages: >>> >>> https://docs.cloudera.com/documentation/enterprise/5-14-x/topics/cdh_ig_ports_cdh5.html >>> >>> <https://docs.cloudera.com/documentation/enterprise/5-14-x/topics/cdh_ig_ports_cdh5.html> >>> >>> It is defined on the pages. Checked if the following is set: >>> >>> The following 2 configurations in hdfs-site.xml are the most used for block >>> scanners. >>> >>> >>> dfs.block.scanner.volume.bytes.per.second to throttle the scan bandwidth >>> to configurable bytes per second. Default value is 1M. Setting this to 0 >>> will disable the block scanner. >>> dfs.datanode.scan.period.hours to configure the scan period, which defines >>> how often a whole scan is performed. This should be set to a long enough >>> interval to really take effect, for the reasons explained above. Default >>> value is 3 weeks (504 hours). Setting this to 0 will use the default value. >>> Setting this to a negative value will disable the block scanner. >>> These are NOT explicitly set. Checked hdfs-site.xml. Nothing defined >>> there. Checked the Configuration tab in the cluster. It's not defined >>> either. >>> >>> Does this mean that the defaults are applied OR does it mean that the block >>> / volume scanner is disabled? I see the pages detail what values for these >>> settings mean but I didn't see any notes pertaining to the situation where >>> both values are not explicitly set. >>> >>> Thx, >>> TK >>> >>> >>> On 10/21/2020 1:34 PM, संजीव (Sanjeev Tripurari) wrote: >>>> Yes Austin, >>>> >>>> you are right every datanode will do its block verification, which is send >>>> as health check report to the namenode >>>> >>>> Regards >>>> -Sanjeev >>>> >>>> >>>> On Wed, 21 Oct 2020 at 21:53, Austin Hackett <[email protected] >>>> <mailto:[email protected]>> wrote: >>>> Hi Tom >>>> >>>> It is my understanding that in addition to block verification on client >>>> reads, each data node runs a DataBlockScanner in a background thread that >>>> periodically verifies all the blocks stored on the data node. The >>>> dfs.datanode.scan.period.hours property controls how often this >>>> verification occurs. >>>> >>>> I think the reports are available via the data node /blockScannerReport >>>> HTTP endpoint, although I’m not sure I ever actually looked at one. (add >>>> ?listblocks to get the verification status of each block). >>>> >>>> More info here: >>>> https://blog.cloudera.com/hdfs-datanode-scanners-and-disk-checker-explained/ >>>> >>>> <https://blog.cloudera.com/hdfs-datanode-scanners-and-disk-checker-explained/> >>>> >>>> Thanks >>>> >>>> Austin >>>> >>>>> On 21 Oct 2020, at 16:47, TomK <[email protected] >>>>> <mailto:[email protected]>> wrote: >>>>> >>>>> Hey Sanjeev, >>>>> >>>>> Allright. Thank you once more. This is clear. >>>>> >>>>> However, this poses an issue then. If during the two years, disk drives >>>>> develop bad blocks but do not necessarily fail to the point that they >>>>> cannot be mounted, that checksum would have changed since those >>>>> filesystem blocks can no longer be read. However, from an HDFS >>>>> perspective, since no checks are done regularly, that is not known. So >>>>> HDFS still reports that the file is fine, in other words, no missing >>>>> blocks. For example, if a disk is going bad, but those files are not >>>>> read for two years, the system won't know that there is a problem. Even >>>>> when removing a data node temporarily and re-adding the datanode, HDFS >>>>> isn't checking because that HDFS file isn't read. >>>>> >>>>> So let's assume this scenario. Data nodes dn01 to dn10 exist. Each data >>>>> node has 10 x 10TB drives. >>>>> And let's assume that there is one large file on those drives and it's >>>>> replicated to factor of X3. >>>>> >>>>> If during the two years the file isn't read, and 10 of those drives >>>>> develop bad blocks or other underlying hardware issues, then it is >>>>> possible that HDFS will still report everything fine, even with a >>>>> replication factor of 3. Because with 10 disks failing, it's possible a >>>>> block or sector has failed under each of the 3 copies of the data. But >>>>> HDFS would NOT know since nothing triggered a read of that HDFS file. >>>>> Based on everything below, then corruption is very much possible even >>>>> with a replication of factor X3. A this point the file is unreadable but >>>>> HDFS still reports no missing blocks. >>>>> >>>>> Similarly, if once I take a data node out, I adjust one of the files on >>>>> the data disks, HDFS will not know and still report everything fine. >>>>> That is until someone read's the file. >>>>> >>>>> Sounds like this is a very real possibility. >>>>> >>>>> Thx, >>>>> TK >>>>> >>>>> >>>>> On 10/21/2020 10:26 AM, संजीव (Sanjeev Tripurari) wrote: >>>>>> Hi Tom >>>>>> >>>>>> Therefore, if I write a file to HDFS but access it two years later, then >>>>>> the checksum will be computed only twice, at the beginning of the two >>>>>> years and again at the end when a client connects? Correct? As long as >>>>>> no process ever accesses the file between now and two years from now, >>>>>> the checksum is never redone and compared to the two year old checksum >>>>>> in the fsimage? >>>>>> >>>>>> yes, Exactly unless data is read checksum is not verified. (when data is >>>>>> written and when the data is read), >>>>>> if checksum is mismatched, there is no way to correct it, you will have >>>>>> to re-write that file. >>>>>> >>>>>> When datanode is added back in, there is no real read operation on the >>>>>> files themselves. The datanode just reports the blocks but doesn't >>>>>> really read the blocks that are there to re-verify the files and ensure >>>>>> consistency? >>>>>> >>>>>> yes, Exactly, datanode maintains list of files and their blocks, which >>>>>> it reports, along with total disk size and used size. >>>>>> Namenode only has list of blocks, unless datanodes is connected it wont >>>>>> know where the blocks are stored. >>>>>> >>>>>> Regards >>>>>> -Sanjeev >>>>>> >>>>>> >>>>>> On Wed, 21 Oct 2020 at 18:31, TomK <[email protected] >>>>>> <mailto:[email protected]>> wrote: >>>>>> Hey Sanjeev, >>>>>> >>>>>> Thank you very much again. This confirms my suspision. >>>>>> >>>>>> Therefore, if I write a file to HDFS but access it two years later, then >>>>>> the checksum will be computed only twice, at the beginning of the two >>>>>> years and again at the end when a client connects? Correct? As long as >>>>>> no process ever accesses the file >>>>>> between now and two years from now, the checksum is never redone and >>>>>> compared to the two year old checksum in the fsimage? >>>>>> >>>>>> When datanode is added back in, there is no real read operation on the >>>>>> files themselves. The datanode just reports the blocks but doesn't >>>>>> really read the blocks that are there to re-verify the files and ensure >>>>>> consistency? >>>>>> >>>>>> Thx, >>>>>> TK >>>>>> >>>>>> >>>>>> >>>>>> On 10/21/2020 12:38 AM, संजीव (Sanjeev Tripurari) wrote: >>>>>>> Hi Tom, >>>>>>> >>>>>>> Every datanode sends heartbeat to namenode, on its list of blocks it >>>>>>> has. >>>>>>> >>>>>>> When a datanode which is disconnected for a while, after connecting >>>>>>> will send heartbeat to namenode, with list of blocks it has (till then >>>>>>> namenode will have under-replicated blocks). >>>>>>> As soon as the datanode is connected to namenode, it will clear >>>>>>> under-replicatred blocks. >>>>>>> >>>>>>> When a client connects to read or write a file, it will run checksum to >>>>>>> validate the file. >>>>>>> >>>>>>> There is no independent process running to do checksum, as it will be >>>>>>> heavy process on each node. >>>>>>> >>>>>>> Regards >>>>>>> -Sanjeev >>>>>>> >>>>>>> On Wed, 21 Oct 2020 at 00:18, Tom <[email protected] >>>>>>> <mailto:[email protected]>> wrote: >>>>>>> Thank you. That part I understand and am Ok with it. >>>>>>> >>>>>>> What I would like to know next is when again the CRC32C checksum is ran >>>>>>> and checked against the fsimage that the block file has not changed or >>>>>>> become corrupted? >>>>>>> >>>>>>> For example, if I take a datanode out, and within 15 minutes, plug it >>>>>>> back in, does HDF rerun the CRC 32C on all data disks on that node to >>>>>>> make sure blocks are ok? >>>>>>> >>>>>>> Cheers, >>>>>>> TK >>>>>>> >>>>>>> Sent from my iPhone >>>>>>> >>>>>>>> On Oct 20, 2020, at 1:39 PM, संजीव (Sanjeev Tripurari) >>>>>>>> <[email protected] <mailto:[email protected]>> wrote: >>>>>>>> >>>>>>>> its done as sson as a file is stored on disk.. >>>>>>>> >>>>>>>> Sanjeev >>>>>>>> >>>>>>>> On Tuesday, 20 October 2020, TomK <[email protected] >>>>>>>> <mailto:[email protected]>> wrote: >>>>>>>> Thanks again. >>>>>>>> >>>>>>>> At what points is the checksum validated (checked) after that? For >>>>>>>> example, is it done on a daily basis or is it done only when the file >>>>>>>> is accessed? >>>>>>>> >>>>>>>> Thx, >>>>>>>> TK >>>>>>>> >>>>>>>> On 10/20/2020 10:18 AM, संजीव (Sanjeev Tripurari) wrote: >>>>>>>>> As soon as the file is written first time checksum is calculated and >>>>>>>>> updated in fsimage (first in edit logs), and same is replicated other >>>>>>>>> replicas. >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> On Tue, 20 Oct 2020 at 19:15, TomK <[email protected] >>>>>>>>> <mailto:[email protected]>> wrote: >>>>>>>>> Hi Sanjeev, >>>>>>>>> >>>>>>>>> Thank you. It does help. >>>>>>>>> >>>>>>>>> At what points is the checksum calculated? >>>>>>>>> >>>>>>>>> Thx, >>>>>>>>> TK >>>>>>>>> >>>>>>>>> On 10/20/2020 3:03 AM, संजीव (Sanjeev Tripurari) wrote: >>>>>>>>>> For Missing blocks and corrupted blocks, do check if all the >>>>>>>>>> datanode services are up, non of the disks where hdfs data is stored >>>>>>>>>> is accessible and have no issues, hosts are reachable from namenode, >>>>>>>>>> >>>>>>>>>> If you are able to re-generate the data and write its great, >>>>>>>>>> otherwise hadoop cannot correct itself. >>>>>>>>> Could you please elaborate on this? Does it mean I have to >>>>>>>>> continuously access a file for HDFS to be able to detect corrupt >>>>>>>>> blocks and correct itself? >>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> "Does HDFS check that the data node is up, data disk is mounted, >>>>>>>>>> path to >>>>>>>>>> the file exists and file can be read?" >>>>>>>>>> -- yes, only after it fails it will say missing blocks. >>>>>>>>>> >>>>>>>>>> Or does it also do a filesystem check on that data disk as well as >>>>>>>>>> perhaps a checksum to ensure block integrity? >>>>>>>>>> -- yes, every file cheksum is maintained and cross checked, if it >>>>>>>>>> fails it will say corrupted blocks. >>>>>>>>>> >>>>>>>>>> hope this helps. >>>>>>>>>> >>>>>>>>>> -Sanjeev >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> On Tue, 20 Oct 2020 at 09:52, TomK <[email protected] >>>>>>>>>> <mailto:[email protected]>> wrote: >>>>>>>>>> Hello, >>>>>>>>>> >>>>>>>>>> HDFS Missing Blocks / Corrupt Blocks Logic: What are the specific >>>>>>>>>> checks done to determine a block is bad and needs to be replicated? >>>>>>>>>> >>>>>>>>>> Does HDFS check that the data node is up, data disk is mounted, path >>>>>>>>>> to >>>>>>>>>> the file exists and file can be read? >>>>>>>>>> >>>>>>>>>> Or does it also do a filesystem check on that data disk as well as >>>>>>>>>> perhaps a checksum to ensure block integrity? >>>>>>>>>> >>>>>>>>>> I've googled on this quite a bit. I don't see the exact answer I'm >>>>>>>>>> looking for. I would like to know exactly what happens during file >>>>>>>>>> integrity verification that then constitutes missing blocks or >>>>>>>>>> corrupt >>>>>>>>>> blocks in the reports. >>>>>>>>>> >>>>>>>>>> -- >>>>>>>>>> Thank You, >>>>>>>>>> TK. >>>>>>>>>> >>>>>>>>>> --------------------------------------------------------------------- >>>>>>>>>> To unsubscribe, e-mail: [email protected] >>>>>>>>>> <mailto:[email protected]> >>>>>>>>>> For additional commands, e-mail: [email protected] >>>>>>>>>> <mailto:[email protected]> >>>>>>>>>> >>>>>>>>> >>>>>>>> >>>>>>>> -- >>>>>>>> Thx, >>>>>>>> TK. >>>>>> >>>>>> -- >>>>>> Thx, >>>>>> TK. >>>>> >>>>> -- >>>>> Thx, >>>>> TK. >>>> >>> >>> -- >>> Thx, >>> TK. > > -- > Thx, > TK.
