The Wikipedia entry for the "subset sum problem," which I noted in my prior comment in this thread, describes several (exponential time, of course) algorithms for solving it. I think these should be useful to those pursuing code for the problem discussed here.
-- Bert "An educated person is one who can entertain new ideas, entertain others, and entertain herself." On Tue, Apr 15, 2025 at 2:09 PM Ebert,Timothy Aaron <teb...@ufl.edu> wrote: > It is not clear if the problem is a generic example or if the requester is > only interested in the case where sample size is ten and the sum of the > sample is 20. If the true population size is 100 (as in the example) then > problem is trivial. However, there was described a "denied population, > where population members are marked by non-negative integers" that is > missing from population of 100. My assumption is that this is a dummy > example. The inefficiency in looking at all possible combinations to only > find a few successes is spending hours programming to save yourself a few > milliseconds in one run of the program. Take the short simple (but > inefficient) path as in the first program. It will work. If there is more > data, or if the example was not exactly like the real problem, then the > last program may help. > > 1) Poster stated that he wants to look at the "denied" subpopulation > within his data. Start by extracting these as the rest of the data are not > relevant. > 2) Poster stated that the denied population consisted of non-negative > integers (values >= 0). The distribution of these integers was not > specified. > 3) The smaller the problem the more likely it can be solved. Since the > sample size seems fixed, and the target sum is probably fixed the only > other approach is to reduce the quantity of data to process. The smaller > the set of data, the fewer resources it will take to get an answer. > A) Unless it is important to know the ratio of accepted to > rejected samples, discard all values that cannot produce a valid result. If > the target sum is 20, reject all values in the data that are greater than > this. > B) Look for other opportunities to reject data. The case in 3A is > just an obvious example. > C) How large is your population of values that have the potential > to result in a valid set? > 4) Are there other options? We do not know how this will be used? > A) can you use a subsample of all possible results? > > set.seed(2257) > values <- rnorm(10000, 0, 1000) # My population > values <- values[values >= 0] # Only non-negative values > values <- values[values < 16] # Only values that might possibly work > values <- floor(values) # Convert to integers > k <- 5 # Number of values to sum > target <- 15 # Desired total > > combos <- combn(values, k) > valid_combos <- combos[, colSums(combos) == target] > print(valid_combos) > > # From 100,000 values one run had 66 cases that could satisfy conditions. > # These 61 candidates generated a matrix using 257.5 MB. > # Of these there were 36985 valid combinations. > # This is without replacement. > > In contrast this version took 4 minutes to run, and a few tries at > increasing k resulted in exceeding memory. > set.seed(2257) > k <- 10 # Number of values to sum > target <- 8 # Desired total > values <- rnorm(10000, 0, 1000) # My population > values <- values[values >= 0] # Only non-negative values > values <- values[values < target+1] # Only values that might possibly > work > values <- floor(values) # Convert to integers > > combos <- combn(values, k) > valid_combos <- combos[, colSums(combos) == target] > print(valid_combos) > > > Here is a recursive approach that includes testing for cases where the > combination cannot possibly result in a valid answer. It is much faster. > However, increasing the size of values much beyond this level results in a > program that took over 2 hours without finishing. Clear your environment > between runs. This version returns 16,445 valid combinations. > > set.seed(2257) > k <- 10 # Number of values to sum > target <- 20 # Desired total > values <- floor(rnorm(50000, mean = 0, sd = 1000)) > values <- values[!is.na(values) & values >= 0 & values <= target] > #Data from this source cannot have NA, but the real data might > > find_combos <- function(values, k, target, start = 1, current = > integer(0), current_sum = 0, results = list()) { > if (length(values) == 0 || any(is.na(values))) return(results) > > # Base case > if (length(current) == k) { > if (current_sum == target) { > results[[length(results) + 1]] <- current > } > return(results) > } > > n <- length(values) > for (i in start:n) { > v <- values[i] > new_sum <- current_sum + v > remaining_slots <- k - length(current) - 1 > > if (is.na(new_sum)) next # Skip if v or current_sum is NA > > if (new_sum > target) break > if (remaining_slots > 0) { > if (new_sum + remaining_slots * min(values) > target) next > if (new_sum + remaining_slots * max(values) < target) next > } > > results <- find_combos(values, k, target, i + 1, c(current, v), > new_sum, results) > } > > return(results) > } > > valid_combos <- find_combos(values, k, target) > print(valid_combos) > > This could be refined further, but I have no idea if this is even close to > what was requested. I also have no idea if the values I have chosen are > realistic. Likely the variance I used in rnorm() is entirely wrong. > However, this should give a better idea of the sorts of things to consider > when attempting what at first looks like an insurmountable problem. You can > try jumping over the mountain but also consider reducing the size of the > mountain to make the jump easier. Also consider redefining the mountain. Do > not jump the entire mountain if you only need a small part. You could add > another test such that the program stops once it has found "this many" > valid solutions. > > I am terrible at recursive programming. I understand how it works, but my > brain does not work that way natively. The code is from ChatGPT, but a few > simple cases were checked using the first code. ChatGPT made a few errors > in initial tries. There could be more constraints that were not considered. > I only worked with the vector. If the vector is part of a data frame where > I need to know that "this zero" came from Bob, and the other zero came from > Samantha, then the code will not work as written. My solution would be to > ask ChatGPT more questions and keep checking to see that each improvement > gave a correct answer. ChatGPT is more likely to give junk answers as > program complexity increases, or maybe I have more difficulty in asking the > right questions as complexity increases. > > Tim > > > -----Original Message----- > From: Jeff Newmiller <jdnew...@dcn.davis.ca.us> > Sent: Tuesday, April 15, 2025 3:16 AM > To: r-help@r-project.org; Ebert,Timothy Aaron <teb...@ufl.edu>; Bert > Gunter <bgunter.4...@gmail.com>; Duncan Murdoch <murdoch.dun...@gmail.com> > Cc: r-help@r-project.org > Subject: Re: [R] Drawing a sample based on certain condition > > [External Email] > > Tim, do you know what you are talking about? If so, please do share sample > code so we can follow along. If you pick nine uniformly distributed > variables (the tenth is constrained) and reject sets that do not add to the > desired sum then only (1/10)^9 or so of the draws will not be rejected. > This is a vast amount of computation for a tiny fraction of valid draws. If > you progressively pick from yet-unallocated values that don't exceed the > total then the distributions of the variables will be very small for later > variables considered (unequal distributions). There may indeed be slick > algorithms for surmounting these problems but I cannot understand if you > are outlining such a solution from your description. > > On April 14, 2025 2:49:57 PM PDT, "Ebert,Timothy Aaron" <teb...@ufl.edu> > wrote: > > > >1) select only denied individuals from the original data. This is S. > >2) There is a fixed sample size of exactly t > >3) There is a fixed target sum T such that sum(t values from S) = T > > > >You can reduce the problem. All large values where the max(S) + (t-1 > smallest values) > T can be eliminated from S. Likewise if (t-1 max S) + > min(S) < T then the smallest values can be eliminated. > > > >You can estimate the number of options by asking how many ways can I draw > t objects from the size of S with replacement. This is the size of S raised > to the t power. If Joe has a score of 6 and Kim has a score of 6 these are > different outcomes. If not, then S can be reduced to the number of unique > values. > > > >Could you randomly sample S, filter for the sum=T and then use the random > sample? Do you need every case when there are millions of cases (or more)? > > > >Is a long execution time important, or what do you consider a long > execution time? Are you planning on looking at all t and T within a range? > > > >Tim > > > >-----Original Message----- > >From: R-help <r-help-boun...@r-project.org> On Behalf Of Bert Gunter > >Sent: Monday, April 14, 2025 4:16 PM > >To: Duncan Murdoch <murdoch.dun...@gmail.com> > >Cc: r-help@r-project.org > >Subject: Re: [R] Drawing a sample based on certain condition > > > >[External Email] > > > >Just a comment: > > > >You wish to draw subsets of size n with or without replacement -- and I > suspect without replacement is simpler than with -- from a set of positive > integers that sum to a fixed value T. This sounds related to the so-called > subset sum problem in computational complexity theory: Given a set S of > positive integers and positive total, T, is there *any* subset of S (i.e. a > sample without replacement) whose sum is T? This is known to be > NP-complete. So this means that listing all such subsets must be > NP-complete. I don't know whether specifying that, as you have, the size of > the subsets must be a fixed number (obviously?) simplifies the problem > sufficiently to make it computationally tractable; computational wizards or > suitable web searches might tell you this. But for these reasons, your > query about how to sample the set of all such subsets -- both those drawn > with or without replacement -- seems computationally difficult to me. > > > > > >Cheers, > >Bert > > > > > > > >"An educated person is one who can entertain new ideas, entertain others, > and entertain herself." > > > > > > > >On Mon, Apr 14, 2025 at 8:37 AM Duncan Murdoch > ><murdoch.dun...@gmail.com> > >wrote: > > > >> On 2025-04-14 7:26 a.m., Brian Smith wrote: > >> > Hi, > >> > > >> > For my analytical work, I need to draw a sample of certain sample > >> > size from a denied population, where population members are marked > >> > by non-negative integers, such that sum of sample members if fixed. > >> > For example, > >> > > >> > Population = 0:100 > >> > Sample_size = 10 > >> > Sample_Sum = 20 > >> > > >> > Under this setup if my sample members are X1, X2, ..., X10 then I > >> > should have X1+X2+...+X10 = 20 > >> > > >> > Sample drawing scheme may be with/without replacement > >> > > >> > Is there any R function to achieve this? One possibility is to > >> > employ naive trial-error approach, but this doesnt seem to be > >> > practical as it would take long time to get the final sample with > desired properties. > >> > > >> > Any pointer would be greatly appreciated. > >> > >> One general way to think of this problem is that you are defining a > >> distribution on the space of all possible samples of size 10, such > >> that the probability of a sample is X if the sum is 20, and zero > >> otherwise, and you want to sample from this distribution. > >> > >> There's probably a slick method to do that for your example, but if > >> you've got a general population instead of that special one, I doubt it. > >> > >> What I would do is the following: > >> > >> Define another distribution on samples that has probabilities that > >> depend on the sum of the sample, with the highest probabilities > >> attached to ones with the correct sum, and probabilities for other > >> sums declining with distance from the sum. For example, maybe > >> > >> P(sum) = Y/(1 + abs(sum - 20)) > >> > >> for some constant Y. > >> > >> You can use MCMC to sample from that distribution and then only keep > >> the samples where the sum is exactly equal to the target sum. If you > >> do that, you don't need to care about the value of Y. but you do need > >> to think about how proposed moves are made, and you probably need to > >> use a different function than the example above for acceptable > efficiency. > >> > >> Duncan Murdoch > >> > >> ______________________________________________ > >> R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see > >> https://stat/ > >> .ethz.ch%2Fmailman%2Flistinfo%2Fr-help&data=05%7C02%7Ctebert%40ufl.ed > >> u > >> %7C722a00e77d1343db507d08dd7b914147%7C0d4da0f84a314d76ace60a62331e1b8 > >> 4 > >> %7C0%7C0%7C638802585985433701%7CUnknown%7CTWFpbGZsb3d8eyJFbXB0eU1hcGk > >> i > >> OnRydWUsIlYiOiIwLjAuMDAwMCIsIlAiOiJXaW4zMiIsIkFOIjoiTWFpbCIsIldUIjoyf > >> Q > >> %3D%3D%7C0%7C%7C%7C&sdata=KNgaZ7e92Zfs51%2F%2B%2B9%2BQkGI5hiSj9OzhDtF > >> z > >> iWXx2D0%3D&reserved=0 > >> PLEASE do read the posting guide > >> https://www/. > >> r-project.org%2Fposting-guide.html&data=05%7C02%7Ctebert%40ufl.edu%7C > >> 7 > >> 22a00e77d1343db507d08dd7b914147%7C0d4da0f84a314d76ace60a62331e1b84%7C > >> 0 > >> %7C0%7C638802585985456014%7CUnknown%7CTWFpbGZsb3d8eyJFbXB0eU1hcGkiOnR > >> y > >> dWUsIlYiOiIwLjAuMDAwMCIsIlAiOiJXaW4zMiIsIkFOIjoiTWFpbCIsIldUIjoyfQ%3D > >> % > >> 3D%7C0%7C%7C%7C&sdata=Vhraq88ySxlq4IUEqvoeRvNdhjvliYdz7VDDvmQzmkE%3D& > >> r > >> eserved=0 and provide commented, minimal, self-contained, > >> reproducible code. > >> > > > > [[alternative HTML version deleted]] > > > >______________________________________________ > >R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see > >https://stat/. > >ethz.ch%2Fmailman%2Flistinfo%2Fr-help&data=05%7C02%7Ctebert%40ufl.edu%7 > >Cc623a7b217674da8b1ee08dd7bed5bfe%7C0d4da0f84a314d76ace60a62331e1b84%7C > >0%7C0%7C638802981570144984%7CUnknown%7CTWFpbGZsb3d8eyJFbXB0eU1hcGkiOnRy > >dWUsIlYiOiIwLjAuMDAwMCIsIlAiOiJXaW4zMiIsIkFOIjoiTWFpbCIsIldUIjoyfQ%3D%3 > >D%7C0%7C%7C%7C&sdata=FESBnk%2Bs0euSW8a9WXBB4pW7W3ChgGGzujmZoD23zDE%3D&r > >eserved=0 PLEASE do read the posting guide > >https://www.r/ > >-project.org%2Fposting-guide.html&data=05%7C02%7Ctebert%40ufl.edu%7Cc62 > >3a7b217674da8b1ee08dd7bed5bfe%7C0d4da0f84a314d76ace60a62331e1b84%7C0%7C > >0%7C638802981570159758%7CUnknown%7CTWFpbGZsb3d8eyJFbXB0eU1hcGkiOnRydWUs > >IlYiOiIwLjAuMDAwMCIsIlAiOiJXaW4zMiIsIkFOIjoiTWFpbCIsIldUIjoyfQ%3D%3D%7C > >0%7C%7C%7C&sdata=u32xfqbftfIm%2FzZxlH9NvD1SXw6lRZulE0FoPBUJOvA%3D&reser > >ved=0 and provide commented, minimal, self-contained, reproducible > >code. > >______________________________________________ > >R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see > >https://stat/. > >ethz.ch%2Fmailman%2Flistinfo%2Fr-help&data=05%7C02%7Ctebert%40ufl.edu%7 > >Cc623a7b217674da8b1ee08dd7bed5bfe%7C0d4da0f84a314d76ace60a62331e1b84%7C > >0%7C0%7C638802981570168291%7CUnknown%7CTWFpbGZsb3d8eyJFbXB0eU1hcGkiOnRy > >dWUsIlYiOiIwLjAuMDAwMCIsIlAiOiJXaW4zMiIsIkFOIjoiTWFpbCIsIldUIjoyfQ%3D%3 > >D%7C0%7C%7C%7C&sdata=TOtcwGxtU%2ByIYZ1GEjneNEAUVqudwX9%2BA6FU3AkYPh4%3D > >&reserved=0 PLEASE do read the posting guide > >https://www.r/ > >-project.org%2Fposting-guide.html&data=05%7C02%7Ctebert%40ufl.edu%7Cc62 > >3a7b217674da8b1ee08dd7bed5bfe%7C0d4da0f84a314d76ace60a62331e1b84%7C0%7C > >0%7C638802981570176634%7CUnknown%7CTWFpbGZsb3d8eyJFbXB0eU1hcGkiOnRydWUs > >IlYiOiIwLjAuMDAwMCIsIlAiOiJXaW4zMiIsIkFOIjoiTWFpbCIsIldUIjoyfQ%3D%3D%7C > >0%7C%7C%7C&sdata=gPSpdirwgJOb0AoECt3Vd0DGdPLCq9dZcXoc8lYbxB4%3D&reserve > >d=0 and provide commented, minimal, self-contained, reproducible code. > > -- > Sent from my phone. Please excuse my brevity. > [[alternative HTML version deleted]] ______________________________________________ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide https://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.