> On Jan 4, 2017, at 7:40 PM, David Wolfskill <r...@catwhisker.org> wrote: > > I have (time series) data extracted from a repository that is stored > such that each record is for an hour, but each record contains an > ordered set of values throughout the hour. In the following exmaple, > I'll show sets of 4, depicting 0, 15, 30, and 45 minutes after each > "start" point (respectively): > > test_data <- structure(list(start = c(1482793200L, 1482793200L, 1482793200L, > 1482793200L, 1482793200L, 1482793200L, 1482793200L, 1482793200L, > 1482793200L, 1482793200L, 1482793200L, 1482793200L, 1482793200L, > 1482793200L, 1482793200L), hostname = c("c001.example.net", > "c001.example.net", > "c001.example.net", "c001.example.net", "c001.example.net", > "c001.example.net", > "c001.example.net", "c001.example.net", "c001.example.net", > "c001.example.net", > "c001.example.net", "c161.example.net", "c161.example.net", > "c161.example.net", > "c161.example.net"), mtype = c("health", "health", "net", "net", > "net", "net", "net", "net", "net", "sys", "sys", "net", "sys", > "sys", "sys"), limit_type = c("fill", "serve", "", "", "", "", > "", "", "", "", "", "", "", "", ""), hw = c(1.16, 1.16, 1.16, > 1.16, 1.16, 1.16, 1.16, 1.16, 1.16, 1.16, 1.16, 1.21, 1.21, 1.21, > 1.21), fw = c("2017Q1.1.1", "2017Q1.1.1", "2017Q1.1.1", "2017Q1.1.1", > "2017Q1.1.1", "2017Q1.1.1", "2017Q1.1.1", "2017Q1.1.1", "2017Q1.1.1", > "2017Q1.1.1", "2017Q1.1.1", "2016Q4.2.13", "2016Q4.2.13", "2016Q4.2.13", > "2016Q4.2.13"), tcp_state = c("", "", "", "", "closed", "closing", > "fin_wait_2", "last_ack", "syn_rcvd", "", "", "", "", "", ""), > value_type = c("limit", "limit", "", "", "", "", "", "", > "", "", "", "", "", "", ""), nic = c("all", "all", "", "", > "", "", "", "", "", "", "", "mce0", "", "", ""), name = > c("in_download_window", > "in_download_window", "tcpOutSegs", "tcpRetransSegs", > "tcp_connection_count", > "tcp_connection_count", "tcp_connection_count", "tcp_connection_count", > "tcp_connection_count", "CpuSystem", "CpuUser", "HCOutOctets", > "CpuIdle", "CpuSystem", "CpuUser"), values = c("[0.0, 0.0, 0.0, 0.0]", > "[0.0, 0.0, 0.0, 0.0]", "[260410.94547698632, 258469.54433635762, > 260579.2186617577, 258763.2815145043]", > "[18436.311524062934, 18248.952271420356, 18201.62259198662, > 17818.39529178736]", > "[5.0, 3.0, 3.0, 3.0]", "[3.0, 3.0, 2.0, 2.0]", "[670.0, 677.0, 685.0, > 729.0]", > "[1162.0, 1192.0, 1148.0, 1110.0]", "[25.0, 60.0, 71.0, 33.0]", > "[11.0, 10.0, 11.0, 10.0]", "[2.0, 2.0, 2.0, 2.0]", "[7.873191635959294E9, > 7.7377184658927E9, 7.876630519328283E9, 7.714521544912713E9]", > "[70.0, 70.0, 70.0, 70.0]", "[27.0, 26.0, 27.0, 26.0]", "[4.0, 4.0, 4.0, > 4.0]" > )), .Names = c("start", "hostname", "mtype", "limit_type", > "hw", "fw", "tcp_state", "value_type", "nic", "name", "values" > ), class = "data.frame", row.names = c(NA, -15L)) > > > So of the 15 rows in the above example, row 8 (which depicts the TCP > connection counts in the "last ACK" state) has the values: > > * 1162.0 > * 1192.0 > * 1148.0 > * 1110.0 > > It seems to me that what is wanted is for each of the existing rows to be > replaced by a set of 4 (in this case) rows, where the other columns are > the same (save for "start", which is a timestamp, and should be adjusted > for the respective times).
Perhaps something like this: # function to read the values in 'values': parse_values <- function(x) {scan(text= gsub( "\\[|\\]","",x), sep=",") } # the apply function reads line-by-line new_dat <- apply(test_data, 1, function(d) data.frame( as.list(d[!names(d) %in% "values"]), nvals <- parse_values(d['values']) ) ) Read 4 items Read 4 items Read 4 items Read 4 items Read 4 items Read 4 items Read 4 items Read 4 items Read 4 items Read 4 items Read 4 items Read 4 items Read 4 items Read 4 items Read 4 items # Could suppress the report from scan by adding quiet = TRUE # now take this list of 4 line data.frames and "rbind" them # If you wanted these to remain character you would use stringsAsFactors=FALSE in the data.frame call > new_df <- do.call("rbind", new_dat) > head(new_df) start hostname mtype limit_type hw fw tcp_state value_type nic 1 1482793200 c001.example.net health fill 1.16 2017Q1.1.1 limit all 2 1482793200 c001.example.net health fill 1.16 2017Q1.1.1 limit all 3 1482793200 c001.example.net health fill 1.16 2017Q1.1.1 limit all 4 1482793200 c001.example.net health fill 1.16 2017Q1.1.1 limit all 5 1482793200 c001.example.net health serve 1.16 2017Q1.1.1 limit all 6 1482793200 c001.example.net health serve 1.16 2017Q1.1.1 limit all name nvals....parse_values.d..values... 1 in_download_window 0 2 in_download_window 0 3 in_download_window 0 4 in_download_window 0 5 in_download_window 0 6 in_download_window 0 str(new_df) 'data.frame': 60 obs. of 11 variables: $ start : Factor w/ 1 level "1482793200": 1 1 1 1 1 1 1 1 1 1 ... $ hostname : Factor w/ 2 levels "c001.example.net",..: 1 1 1 1 1 1 1 1 1 1 ... $ mtype : Factor w/ 3 levels "health","net",..: 1 1 1 1 1 1 1 1 2 2 ... $ limit_type : Factor w/ 3 levels "fill","serve",..: 1 1 1 1 2 2 2 2 3 3 ... $ hw : Factor w/ 2 levels "1.16","1.21": 1 1 1 1 1 1 1 1 1 1 ... $ fw : Factor w/ 2 levels "2017Q1.1.1","2016Q4.2.13": 1 1 1 1 1 1 1 1 1 1 ... $ tcp_state : Factor w/ 6 levels "","closed","closing",..: 1 1 1 1 1 1 1 1 1 1 ... $ value_type : Factor w/ 2 levels "limit","": 1 1 1 1 1 1 1 1 2 2 ... $ nic : Factor w/ 3 levels "all","","mce0": 1 1 1 1 1 1 1 1 2 2 ... $ name : Factor w/ 8 levels "in_download_window",..: 1 1 1 1 1 1 1 1 2 2 ... $ nvals....parse_values.d..values...: num 0 0 0 0 0 ... > > I'm fairly sure I can write code to do that, but it would end up being > something like Perl implemented in R, which seems fairly grotesque: I > can't help but think that there ought to be a ... more elegant approach > in R (which is why I am sking for help). > > (I will also end up collecting all of the records for a given timestamp > and hostname, and creating one very wide record with all of the data > from the set of records thus found. I already have (yes, Perl) code to > do this -- though if there's a reasonable way to avoid that, I'm > interested.) I thought you wanted the data in long form. > > Once that's all done, I'll be examining various columns, subsetting > by attributes of the systems being compared -- but I already have > code to do that (that makes use of a different -- and rather more > fragile -- approach for extracting the data from its repository). > > Thanks! > > Peace, > david > -- > David H. Wolfskill r...@catwhisker.org > Epistemology for post-truthers: How do we select parts of reality to ignore? > > See http://www.catwhisker.org/~david/publickey.gpg for my public key. > ______________________________________________ > R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. David Winsemius Alameda, CA, USA ______________________________________________ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.