Thomas,
I'm curious - what OS are you running this on, and how much memory does the 
computer have? 

Let me know if that code worked out as I hoped.

regards,
gregg


On Wednesday, December 11th, 2024 at 6:51 AM, Deramus, Thomas Patrick 
<tdera...@mgb.org> wrote:

> About to try this implementation.
> 

> As a follow-up, this is the exact error:
> 

> Lost warning messages
> Error: no more error handlers available (recursive errors?); invoking 'abort' 
> restart
> Execution halted
> Error: cons memory exhausted (limit reached?)
> Error: cons memory exhausted (limit reached?)
> Error: cons memory exhausted (limit reached?)
> Error: cons memory exhausted (limit reached?)
> 

> 

> 

> From: Gregg Powell <g.a.pow...@protonmail.com>
> Sent: Tuesday, December 10, 2024 7:52 PM
> To: Deramus, Thomas Patrick <tdera...@mgb.org>
> Cc: r-help@r-project.org <r-help@r-project.org>
> Subject: Re: [R] Cores hang when calling mcapply
> 

> Hello Thomas,
> 

> Consider that the primary bottleneck may be tied to memory usage and the 
> complexity of pivoting extremely large datasets into wide formats with tens 
> of thousands of unique values per column. Extremely large expansions of 
> columns inherently stress both memory and CPU, and splitting into 110k 
> separate data frames before pivoting and combining them again is likely 
> causing resource overhead and system instability.
> 

> Perhaps, evaluate if the presence/absence transformation can be done in a 
> more memory-efficient manner without pivoting all at once. Since you are 
> dealing with extremely large data, a more incremental or streaming approach 
> may be necessary. Instead of splitting into thousands of individual data 
> frames and trying to pivot each in parallel, consider instead  a method that 
> processes segments of data to incrementally build a large sparse matrix or a 
> compressed representation, then combine results at the end.
> 

> It's probbaly better to move away from `pivot_wider()` on a massive scale and 
> attempt a data.table-based approach, which is often more memory-efficient and 
> faster for large-scale operations in R.
> 

> 

> An alternate way would be data.table’s `dcast()` can handle large data more 
> efficiently, and data.table’s in-memory operations often reduce overhead 
> compared to tidyverse pivoting functions.
> 

> Also - consider using data.table’s `fread()` or `arrow::open_dataset()` 
> directly with `as.data.table()` to keep everything in a data.table format. 
> For example, you can do a large `dcast()` operation to create 
> presence/absence columns by group. If your categories are extremely large, 
> consider an approach that processes categories in segments as I mentioned 
> earlier -  and writes intermediate results to disk, then 
> combines/mergesresults at the end.
> 

> Limit parallelization when dealing with massive reshapes. Instead of trying 
> to parallelize the entire pivot across thousands of subsets, run a single 
> parallelized chunking approach that processes manageable subsets and writes 
> out intermediate results (for example... using `fwrite()` for each subset). 
> After processing, load and combine these intermediate results. This manual 
> segmenting approach can circumvent the "zombie" processes you mentioned - 
> that I think arise from overly complex parallel nesting and excessivememory 
> utilization.
> 

> If the presence/absence indicators are ultimately sparse (many zeros and few 
> ones), consider storing the result in a sparse matrix format (for exapmple- 
> `Matrix` package in R). Instead of creating thousands of columns as dense 
> integers, using a sparse matrix representation should dramatically reduce 
> memory. After processing the data into a sparse format, you can then save it 
> in a suitable file format and only convert to a dense format if absolutely 
> necessary.
> 

> Below is a reworked code segment using data.table for a more scalable 
> approach. Note that this is a conceptual template. In practice, adapt the 
> chunk sizes and filtering operations to your workflow. The idea is to avoid 
> creating 110k separate data frames and to handle the pivot in a data.table 
> manner that’s more robust and less memory intensve. Here, presence/absence 
> encoding is done by grouping and casting directly rather than repeatedly 
> splitting and row-binding.
> 

> > library(data.table)
> > library(arrow)
> >
> > # Step A: Load data efficiently as data.table
> > dt <- as.data.table(
> >   open_dataset(
> >    sources = input_files,
> >    format = 'csv',
> >    unify_schema = TRUE,
> >    col_types = schema(
> >      "ID_Key" = string(),
> >      "column1" = string(),
> >      "column2" = string()
> >    )
> >  ) |>
> 

> >    collect()
> > )
> >
> > # Step B: Clean names once
> > # Assume `crewjanitormakeclean` essentially standardizes column names
> > dt[, column1 := janitor::make_clean_names(column1, allow_dupes = 
> 

> > TRUE)]
> > dt[, column2 := janitor::make_clean_names(column2, allow_dupes =
> 

> >  TRUE)]
> >
> > # Step C: Create presence/absence indicators using data.table
> > # Use dcast to pivot wide. Set n=1 for presence, 0 for absence.
> > # For large unique values, consider chunking if needed.
> > out1 <- dcast(dt[!is.na(column1)], ID_Key ~ column1, fun.aggregate =
> 

> > length, value.var = "column1")
> > out2 <- dcast(dt[!is.na(column2)], ID_Key ~ column2, fun.aggregate =
> 

> > length, value.var = "column2")
> >
> > # Step D: Merge the two wide tables by ID_Key
> > # Fill missing columns with 0 using data.table on-the-fly operations
> > all_cols <- unique(c(names(out1), names(out2)))
> > out1_missing <- setdiff(all_cols, names(out1))
> > out2_missing <- setdiff(all_cols, names(out2))
> >
> > # Add missing columns with 0
> > for (col in out1_missing) out1[, (col) := 0]
> > for (col in out2_missing) out2[, (col) := 0]
> >
> > # Ensure column order alignment if needed
> > setcolorder(out1, all_cols)
> > setcolorder(out2, all_cols)
> >
> > # Combine by ID_Key (since they share same columns now)
> > final_dt <- rbindlist(list(out1, out2), use.names = TRUE, fill = TRUE)
> >
> > # Step E: If needed, summarize across ID_Key to sum presence
> 

> > indicators
> > final_result <- final_dt[, lapply(.SD, sum, na.rm = TRUE), by =
> 

> > ID_Key, .SDcols = setdiff(names(final_dt), "ID_Key")]
> >
> > # note that final_result should now contain summed presence/absence
> 

> > (0/1) indicators.
> 

> 

> 

> 

> Hope this helps!
> gregg
> somewhereinArizona
> 

> The information in this e-mail is intended only for the person to whom it is 
> addressed.  If you believe this e-mail was sent to you in error and the 
> e-mail contains patient information, please contact the Mass General Brigham 
> Compliance HelpLine at https://www.massgeneralbrigham.org/complianceline .
> 

> 

> 

> Please note that this e-mail is not secure (encrypted).  If you do not wish 
> to continue communication over unencrypted e-mail, please notify the sender 
> of this message immediately.  Continuing to send or respond to e-mail after 
> receiving this message means you understand and accept this risk and wish to 
> continue to communicate over unencrypted e-mail.

Attachment: signature.asc
Description: OpenPGP digital signature

______________________________________________
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide https://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Reply via email to