Thomas, I'm curious - what OS are you running this on, and how much memory does the computer have?
Let me know if that code worked out as I hoped. regards, gregg On Wednesday, December 11th, 2024 at 6:51 AM, Deramus, Thomas Patrick <tdera...@mgb.org> wrote: > About to try this implementation. > > As a follow-up, this is the exact error: > > Lost warning messages > Error: no more error handlers available (recursive errors?); invoking 'abort' > restart > Execution halted > Error: cons memory exhausted (limit reached?) > Error: cons memory exhausted (limit reached?) > Error: cons memory exhausted (limit reached?) > Error: cons memory exhausted (limit reached?) > > > > From: Gregg Powell <g.a.pow...@protonmail.com> > Sent: Tuesday, December 10, 2024 7:52 PM > To: Deramus, Thomas Patrick <tdera...@mgb.org> > Cc: r-help@r-project.org <r-help@r-project.org> > Subject: Re: [R] Cores hang when calling mcapply > > Hello Thomas, > > Consider that the primary bottleneck may be tied to memory usage and the > complexity of pivoting extremely large datasets into wide formats with tens > of thousands of unique values per column. Extremely large expansions of > columns inherently stress both memory and CPU, and splitting into 110k > separate data frames before pivoting and combining them again is likely > causing resource overhead and system instability. > > Perhaps, evaluate if the presence/absence transformation can be done in a > more memory-efficient manner without pivoting all at once. Since you are > dealing with extremely large data, a more incremental or streaming approach > may be necessary. Instead of splitting into thousands of individual data > frames and trying to pivot each in parallel, consider instead a method that > processes segments of data to incrementally build a large sparse matrix or a > compressed representation, then combine results at the end. > > It's probbaly better to move away from `pivot_wider()` on a massive scale and > attempt a data.table-based approach, which is often more memory-efficient and > faster for large-scale operations in R. > > > An alternate way would be data.table’s `dcast()` can handle large data more > efficiently, and data.table’s in-memory operations often reduce overhead > compared to tidyverse pivoting functions. > > Also - consider using data.table’s `fread()` or `arrow::open_dataset()` > directly with `as.data.table()` to keep everything in a data.table format. > For example, you can do a large `dcast()` operation to create > presence/absence columns by group. If your categories are extremely large, > consider an approach that processes categories in segments as I mentioned > earlier - and writes intermediate results to disk, then > combines/mergesresults at the end. > > Limit parallelization when dealing with massive reshapes. Instead of trying > to parallelize the entire pivot across thousands of subsets, run a single > parallelized chunking approach that processes manageable subsets and writes > out intermediate results (for example... using `fwrite()` for each subset). > After processing, load and combine these intermediate results. This manual > segmenting approach can circumvent the "zombie" processes you mentioned - > that I think arise from overly complex parallel nesting and excessivememory > utilization. > > If the presence/absence indicators are ultimately sparse (many zeros and few > ones), consider storing the result in a sparse matrix format (for exapmple- > `Matrix` package in R). Instead of creating thousands of columns as dense > integers, using a sparse matrix representation should dramatically reduce > memory. After processing the data into a sparse format, you can then save it > in a suitable file format and only convert to a dense format if absolutely > necessary. > > Below is a reworked code segment using data.table for a more scalable > approach. Note that this is a conceptual template. In practice, adapt the > chunk sizes and filtering operations to your workflow. The idea is to avoid > creating 110k separate data frames and to handle the pivot in a data.table > manner that’s more robust and less memory intensve. Here, presence/absence > encoding is done by grouping and casting directly rather than repeatedly > splitting and row-binding. > > > library(data.table) > > library(arrow) > > > > # Step A: Load data efficiently as data.table > > dt <- as.data.table( > > open_dataset( > > sources = input_files, > > format = 'csv', > > unify_schema = TRUE, > > col_types = schema( > > "ID_Key" = string(), > > "column1" = string(), > > "column2" = string() > > ) > > ) |> > > > collect() > > ) > > > > # Step B: Clean names once > > # Assume `crewjanitormakeclean` essentially standardizes column names > > dt[, column1 := janitor::make_clean_names(column1, allow_dupes = > > > TRUE)] > > dt[, column2 := janitor::make_clean_names(column2, allow_dupes = > > > TRUE)] > > > > # Step C: Create presence/absence indicators using data.table > > # Use dcast to pivot wide. Set n=1 for presence, 0 for absence. > > # For large unique values, consider chunking if needed. > > out1 <- dcast(dt[!is.na(column1)], ID_Key ~ column1, fun.aggregate = > > > length, value.var = "column1") > > out2 <- dcast(dt[!is.na(column2)], ID_Key ~ column2, fun.aggregate = > > > length, value.var = "column2") > > > > # Step D: Merge the two wide tables by ID_Key > > # Fill missing columns with 0 using data.table on-the-fly operations > > all_cols <- unique(c(names(out1), names(out2))) > > out1_missing <- setdiff(all_cols, names(out1)) > > out2_missing <- setdiff(all_cols, names(out2)) > > > > # Add missing columns with 0 > > for (col in out1_missing) out1[, (col) := 0] > > for (col in out2_missing) out2[, (col) := 0] > > > > # Ensure column order alignment if needed > > setcolorder(out1, all_cols) > > setcolorder(out2, all_cols) > > > > # Combine by ID_Key (since they share same columns now) > > final_dt <- rbindlist(list(out1, out2), use.names = TRUE, fill = TRUE) > > > > # Step E: If needed, summarize across ID_Key to sum presence > > > indicators > > final_result <- final_dt[, lapply(.SD, sum, na.rm = TRUE), by = > > > ID_Key, .SDcols = setdiff(names(final_dt), "ID_Key")] > > > > # note that final_result should now contain summed presence/absence > > > (0/1) indicators. > > > > > Hope this helps! > gregg > somewhereinArizona > > The information in this e-mail is intended only for the person to whom it is > addressed. If you believe this e-mail was sent to you in error and the > e-mail contains patient information, please contact the Mass General Brigham > Compliance HelpLine at https://www.massgeneralbrigham.org/complianceline . > > > > Please note that this e-mail is not secure (encrypted). If you do not wish > to continue communication over unencrypted e-mail, please notify the sender > of this message immediately. Continuing to send or respond to e-mail after > receiving this message means you understand and accept this risk and wish to > continue to communicate over unencrypted e-mail.
signature.asc
Description: OpenPGP digital signature
______________________________________________ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide https://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.