[R] Help merging large datasets in R

Deelen, M. (Mirjam) via R-help Wed, 07 May 2025 13:24:52 -0700

Hi guys,
For my MSc. thesis i am using R studio. The goal is for me to merge a couple 
(6) of relatively large datasets (min of 200.000 and max of 2mil rows). I have 
now been able to do so, however I think something might be going wrong in my 
codes.
For reference, i have a dataset 1 (200.000), dataset 2 (600.000), dataset 3 
(2mil) and dataset 4 (2mil) merged into one dataset of 4mil, and dataset 5 
(4mil) and dataset 6 (4mil) merged into one dataset of 8mil.
What i have done so far is the following:


  *   Merged dataset 1 and dataset 2 using the following code = merged 1 <- 
dataset 2[dataset 1, nomatch = NA]. This results in a dataset of 600.000 (looks 
to be alright).
  *   Merged the dataset merged 1 and datasets 3/4 using the following code = 
merged 2 <- dataset 3/4[merged 1, nomatch = NA, allow.cartesian = TRUE]. This 
results in a dataset of 21mil (as expected). To this i have applied an 
additional criteria (dates in dataset 3/4 should be within 365 days of the 
dates in merged 1), which reduces merged 2 to around 170.000.
  *   Merged the dataset merged 2 and datasets 5/6 using the following code = 
merged 3 <- dataset 5/6[merged 2, nomatch = NA, allow.cartesian = TRUE]. Again, 
this results in a dataset of 8mil (as expected). And again, to this i have 
applied an additional criteria (dates in dataset 5/6 should be within 365 days 
of the dates in merged 2), which reduces merged 3 to around 50.000.

What I'm now thinking, is how can the merging + additional criteria lead to 
such a loss of cases ?? The first merge, of dataset 1 and dataset 2, results in 
an amount that I think should be the final amount of cases. I understand that 
by adding an additional criteria the number of possible matches when merging 
datasets 3/4 and 5/6 is reduced, but I'm not sure this should lead to SUCH a 
loss. Besides this, the additional criteria was added to reduce the duplication 
of information that is now happening when merging datasets 3/4 and 5/6.
All cases appear once in dataset 1, but could appear a couple more times in the 
following datasets (say twice in dataset 2, four times in datasets 3/4 and 8 
times in datasets 5/6). Which results in a 1 x 2 x 4 x 8 duplication of 
information when merging the datasets without additional criteria.
So sum this up, my questions are=

  *   Are there any tips as to not have this duplication ? (so I can drop the 
additonal criteria and the final amount of cases, probably, increases).
  *   Or are there any tips as to figure out where in these steps cases are 
lost ?

Thanks!
Mirjam


        [[alternative HTML version deleted]]

______________________________________________
[email protected] mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide https://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

[R] Help merging large datasets in R

Reply via email to