Apologies, I left out 3 critical lines of code after the randomized
sample dataframe is created:
group_a <- d[ which(d$label =='A'), ]
group_b <- d[ which(d$label =='B'), ]
group_c <- d[ which(d$label =='C'), ]
On 2021-08-03 18:56, Tom Woolman wrote:
# Resending this message since the original email was held in queue by
the listserv software because of a "suspicious" subject line, and/or
because of attached .png histogram chart attachments. I'm guessing
that the listserv software doesn't like multiple image file
attachments.
Hi everyone. I'm working on a research model now that is calculating
anomaly scores (RMSE values) for three distinct groups within a large
dataset. The anomaly scores are a continuous data type and are quite
small, ranging from approximately 1e-04 to 1-e07 across a population
of approximately 1 million observations.
I have all of the summary and descriptive statistics for each of the
anomaly score distributions across each group label in the dataset,
and I am able to create some useful histograms showing how each of the
three groups is uniquely distributed across the range of scores.
However, because of the large variance within the frequency of score
values and the high density peaks within much of the anomaly scores, I
need to use a log transformation within the histogram to show both the
log frequency count of each binned observation range (y-axis) and a
log transformation of the binned score values (x-axis) to be able to
appropriately illustrate the distributions within the data and make it
more readily understandable.
Fortunately, ggplot2 is really useful for creating some really
attractive dual-axis log transformed histograms.
However, I cannot figure out a way to create the log transformed
histograms to show each of my three groups by color within the same
histogram. I would want it to look like this, BUT use a log
transformation for each axis. This plot below shows the 3 groups in
one histogram but uses the default normal values.
For log transformed axis values, the best I can do so far is produce
three separate histograms, one for each group.
Below is sample R code to illustrate my problem with a
randomly-generated example dataset and the ggplot2 approaches that I
have taken so far:
# Sample R code below:
library(ggplot2)
library(dplyr)
library(hrbrthemes)
# I created some simple random sample data to produce an example
dataset.
# This produces an example dataframe called d, which contains a class
label IV of either A, B or C for each observation. The target variable
is the anomaly_score continuous value for each observation.
# There are 300 rows of dummy data in this dataframe.
DV_score_generator = round(runif(300,0.001,0.999), 3)
d <- data.frame( label = sample( LETTERS[1:3], 300, replace=TRUE,
prob=c(0.65, 0.30, 0.05) ), anomaly_score = DV_score_generator)
# First, I use ggplot to create the normal distribution histogram that
shows all 3 groups on the same plot, by color.
# Please note that with this small set of randomized sample data it
doesn't appear to be necessary to use an x and y-axis log
transformation to show the distribution patterns, but it does becomes
an issue with my vastly larger and more complex score values in the DV
of the actual data.
p <- d %>%
ggplot( aes(x=anomaly_score, fill=label)) +
geom_histogram( color="#e9ecef", alpha=0.6, position = 'identity') +
scale_fill_manual(values=c("#69b3a2", "blue", "#404080")) +
theme_ipsum() +
labs(fill="")
p
# Produces a normal multiclass histogram.
# Now produce a series of x and y-axis log-transformed histograms,
producing one histogram for each distinct label class in the dataset:
# Group A, log transformed
ggplot(group_a, aes(x = anomaly_score)) +
geom_histogram(aes(y = ..count..), binwidth = 0.05,
colour = "darkgoldenrod1", fill = "darkgoldenrod2") +
scale_x_continuous(name = "Log-scale Anomaly Score", trans="log2")
+
scale_y_continuous(trans="log2", name="Log-transformed Frequency
Counts") +
ggtitle("Transformed Anomaly Scores - Group A Only")
# Group A transformed histogram is produced here.
# Group B, log transformed
ggplot(group_b, aes(x = anomaly_score)) +
geom_histogram(aes(y = ..count..), binwidth = 0.05,
colour = "green", fill = "darkgreen") +
scale_x_continuous(name = "Log-scale Anomaly Score", trans="log2")
+
scale_y_continuous(trans="log2", name="Log-transformed Frequency
Counts") +
ggtitle("Transformed Anomaly Scores - Group B Only")
# Group B transformed histogram is produced here.
# Group C, log transformed
ggplot(group_c, aes(x = anomaly_score)) +
geom_histogram(aes(y = ..count..), binwidth = 0.05,
colour = "red", fill = "darkred") +
scale_x_continuous(name = "Log-scale Anomaly Score", trans="log2")
+
scale_y_continuous(trans="log2", name="Log-transformed Frequency
Counts") +
ggtitle("Transformed Anomaly Scores - Group C Only")
# Group C transformed histogram is produced here.
# End.
Thanks in advance, everyone!
- Tom
Thomas A. Woolman, PhD Candidate (Indiana State University), MBA, MS,
MS
On Target Technologies, Inc.
Virginia, USA
______________________________________________
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
______________________________________________
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.