As the number of total files on our server was exploding (~2.5 million / 1 Terabyte) I wrote a simple shell script that used find to tell me which users have how many. So far so good.
But I want to drill down more: *Are there lots of duplicate files? I suspect so. Stuff like job submission scripts which users copy rather than link etc. (fdupes seems puny for a job of this scale) *What is the most common file (or filename) *A distribution of filetypes (executibles; netcdf; movies; text) and prevalence. *A distribution of file age and prevelance (to know how much of this material is archivable). Same for frequency of access; i.e. maybe the last access stamp. * A file size versus number plot. i.e. Is 20% of space occupied by 80% of files? etc. I've used cushion plots in the past (sequiaview; pydirstat) but those seem more desktop oriented than suitable for a job like this. Essentially I want to data mine my file usage to strategize. Are there any tools for this? Writing a new find each time seems laborious. I suspect forensics might also help identify anomalies in usage across users which might be indicative of other maladies. e.g. a user who had a runaway job write a 500GB file etc. Essentially are there any "filesystem metadata mining tools"? -- Rahul _______________________________________________ Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf