Re: [R] Drawing a sample based on certain condition

Bert Gunter Tue, 15 Apr 2025 14:30:03 -0700

The Wikipedia entry for the "subset sum problem," which I noted in my prior
comment in this thread, describes several (exponential time, of course)
algorithms for solving it. I think these should be useful to those pursuing
code for the problem discussed here.


-- Bert

"An educated person is one who can entertain new ideas, entertain others,
and entertain herself."



On Tue, Apr 15, 2025 at 2:09 PM Ebert,Timothy Aaron <teb...@ufl.edu> wrote:

> It is not clear if the problem is a generic example or if the requester is
> only interested in the case where sample size is ten and the sum of the
> sample is 20. If the true population size is 100 (as in the example) then
> problem is trivial. However, there was described a "denied population,
> where population members are marked by non-negative integers" that is
> missing from population of 100. My assumption is that this is a dummy
> example. The inefficiency in looking at all possible combinations to only
> find a few successes is spending hours programming to save yourself a few
> milliseconds in one run of the program. Take the short simple (but
> inefficient) path as in the first program. It will work. If there is more
> data, or if the example was not exactly like the real problem, then the
> last program may help.
>
> 1) Poster stated that he wants to look at the "denied" subpopulation
> within his data. Start by extracting these as the rest of the data are not
> relevant.
> 2) Poster stated that the denied population consisted of non-negative
> integers (values >= 0). The distribution of these integers was not
> specified.
> 3) The smaller the problem the more likely it can be solved. Since the
> sample size seems fixed, and the target sum is probably fixed the only
> other approach is to reduce the quantity of data to process. The smaller
> the set of data, the fewer resources it will take to get an answer.
>         A) Unless it is important to know the ratio of accepted to
> rejected samples, discard all values that cannot produce a valid result. If
> the target sum is 20, reject all values in the data that are greater than
> this.
>         B) Look for other opportunities to reject data. The case in 3A is
> just an obvious example.
>         C) How large is your population of values that have the potential
> to result in a valid set?
> 4) Are there other options? We do not know how this will be used?
>         A) can you use a subsample of all possible results?
>
> set.seed(2257)
> values <- rnorm(10000, 0, 1000)  # My population
> values <- values[values >= 0]    # Only non-negative values
> values <- values[values < 16]    # Only values that might possibly work
> values <- floor(values)          # Convert to integers
> k <- 5            # Number of values to sum
> target <- 15      # Desired total
>
> combos <- combn(values, k)
> valid_combos <- combos[, colSums(combos) == target]
> print(valid_combos)
>
> # From 100,000 values one run had 66 cases that could satisfy conditions.
> # These 61 candidates generated a matrix using 257.5 MB.
> # Of these there were 36985 valid combinations.
> # This is without replacement.
>
> In contrast this version took 4 minutes to run, and a few tries at
> increasing k resulted in exceeding memory.
> set.seed(2257)
> k <- 10            # Number of values to sum
> target <- 8      # Desired total
> values <- rnorm(10000, 0, 1000)  # My population
> values <- values[values >= 0]    # Only non-negative values
> values <- values[values < target+1]    # Only values that might possibly
> work
> values <- floor(values)          # Convert to integers
>
> combos <- combn(values, k)
> valid_combos <- combos[, colSums(combos) == target]
> print(valid_combos)
>
>
> Here is a recursive approach that includes testing for cases where the
> combination cannot possibly result in a valid answer. It is much faster.
> However, increasing the size of values much beyond this level results in a
> program that took over 2 hours without finishing. Clear your environment
> between runs. This version returns 16,445 valid combinations.
>
> set.seed(2257)
> k <- 10            # Number of values to sum
> target <- 20      # Desired total
> values <- floor(rnorm(50000, mean = 0, sd = 1000))
> values <- values[!is.na(values) & values >= 0 & values <= target]
> #Data from this source cannot have NA, but the real data might
>
> find_combos <- function(values, k, target, start = 1, current =
> integer(0), current_sum = 0, results = list()) {
>   if (length(values) == 0 || any(is.na(values))) return(results)
>
>   # Base case
>   if (length(current) == k) {
>     if (current_sum == target) {
>       results[[length(results) + 1]] <- current
>     }
>     return(results)
>   }
>
>   n <- length(values)
>   for (i in start:n) {
>     v <- values[i]
>     new_sum <- current_sum + v
>     remaining_slots <- k - length(current) - 1
>
>     if (is.na(new_sum)) next  # Skip if v or current_sum is NA
>
>     if (new_sum > target) break
>     if (remaining_slots > 0) {
>       if (new_sum + remaining_slots * min(values) > target) next
>       if (new_sum + remaining_slots * max(values) < target) next
>     }
>
>     results <- find_combos(values, k, target, i + 1, c(current, v),
> new_sum, results)
>   }
>
>   return(results)
> }
>
> valid_combos <- find_combos(values, k, target)
> print(valid_combos)
>
> This could be refined further, but I have no idea if this is even close to
> what was requested. I also have no idea if the values I have chosen are
> realistic. Likely the variance I used in rnorm() is entirely wrong.
> However, this should give a better idea of the sorts of things to consider
> when attempting what at first looks like an insurmountable problem. You can
> try jumping over the mountain but also consider reducing the size of the
> mountain to make the jump easier. Also consider redefining the mountain. Do
> not jump the entire mountain if you only need a small part. You could add
> another test such that the program stops once it has found "this many"
> valid solutions.
>
> I am terrible at recursive programming. I understand how it works, but my
> brain does not work that way natively. The code is from ChatGPT, but a few
> simple cases were checked using the first code. ChatGPT made a few errors
> in initial tries. There could be more constraints that were not considered.
> I only worked with the vector. If the vector is part of a data frame where
> I need to know that "this zero" came from Bob, and the other zero came from
> Samantha, then the code will not work as written. My solution would be to
> ask ChatGPT more questions and keep checking to see that each improvement
> gave a correct answer. ChatGPT is more likely to give junk answers as
> program complexity increases, or maybe I have more difficulty in asking the
> right questions as complexity increases.
>
> Tim
>
>
> -----Original Message-----
> From: Jeff Newmiller <jdnew...@dcn.davis.ca.us>
> Sent: Tuesday, April 15, 2025 3:16 AM
> To: r-help@r-project.org; Ebert,Timothy Aaron <teb...@ufl.edu>; Bert
> Gunter <bgunter.4...@gmail.com>; Duncan Murdoch <murdoch.dun...@gmail.com>
> Cc: r-help@r-project.org
> Subject: Re: [R] Drawing a sample based on certain condition
>
> [External Email]
>
> Tim, do you know what you are talking about? If so, please do share sample
> code so we can follow along. If you pick nine uniformly distributed
> variables (the tenth is constrained) and reject sets that do not add to the
> desired sum then only (1/10)^9 or so of the draws will not be rejected.
> This is a vast amount of computation for a tiny fraction of valid draws. If
> you progressively pick from yet-unallocated values that don't exceed the
> total then the distributions of the variables will be very small for later
> variables considered (unequal distributions). There may indeed be slick
> algorithms for surmounting these problems but I cannot understand if you
> are outlining such a solution from your description.
>
> On April 14, 2025 2:49:57 PM PDT, "Ebert,Timothy Aaron" <teb...@ufl.edu>
> wrote:
> >
> >1) select only denied individuals from the original data. This is S.
> >2) There is a fixed sample size of exactly t
> >3) There is a fixed target sum T such that sum(t values from S) = T
> >
> >You can reduce the problem. All large values where the max(S) + (t-1
> smallest values) > T can be eliminated from S. Likewise if (t-1 max S) +
> min(S) < T then the smallest values can be eliminated.
> >
> >You can estimate the number of options by asking how many ways can I draw
> t objects from the size of S with replacement. This is the size of S raised
> to the t power. If Joe has a score of 6 and Kim has a score of 6 these are
> different outcomes. If not, then S can be reduced to the number of unique
> values.
> >
> >Could you randomly sample S, filter for the sum=T and then use the random
> sample? Do you need every case when there are millions of cases (or more)?
> >
> >Is a long execution time important, or what do you consider a long
> execution time? Are you planning on looking at all t and T within a range?
> >
> >Tim
> >
> >-----Original Message-----
> >From: R-help <r-help-boun...@r-project.org> On Behalf Of Bert Gunter
> >Sent: Monday, April 14, 2025 4:16 PM
> >To: Duncan Murdoch <murdoch.dun...@gmail.com>
> >Cc: r-help@r-project.org
> >Subject: Re: [R] Drawing a sample based on certain condition
> >
> >[External Email]
> >
> >Just a comment:
> >
> >You wish to draw subsets of size n with or without replacement -- and I
> suspect without replacement is simpler than with -- from a set of positive
> integers that sum to a fixed value T. This sounds related to the so-called
> subset sum problem in computational complexity theory: Given a set S of
> positive integers and positive total, T, is there *any* subset of S (i.e. a
> sample without replacement) whose sum is T? This is known to be
> NP-complete. So this means that listing all such subsets must be
> NP-complete. I don't know whether specifying that, as you have, the size of
> the subsets must be a fixed number (obviously?) simplifies the problem
> sufficiently to make it computationally tractable; computational wizards or
> suitable web searches might tell you this. But for these reasons,  your
> query about how to sample the set of all such subsets -- both those drawn
> with or without replacement -- seems computationally difficult to me.
> >
> >
> >Cheers,
> >Bert
> >
> >
> >
> >"An educated person is one who can entertain new ideas, entertain others,
> and entertain herself."
> >
> >
> >
> >On Mon, Apr 14, 2025 at 8:37 AM Duncan Murdoch
> ><murdoch.dun...@gmail.com>
> >wrote:
> >
> >> On 2025-04-14 7:26 a.m., Brian Smith wrote:
> >> > Hi,
> >> >
> >> > For my analytical work, I need to draw a sample of certain sample
> >> > size from a denied population, where population members are marked
> >> > by non-negative integers, such that sum of sample members if fixed.
> >> > For example,
> >> >
> >> > Population = 0:100
> >> > Sample_size = 10
> >> > Sample_Sum = 20
> >> >
> >> > Under this setup if my sample members are X1, X2, ..., X10 then I
> >> > should have X1+X2+...+X10 = 20
> >> >
> >> > Sample drawing scheme may be with/without replacement
> >> >
> >> > Is there any R function to achieve this? One possibility is to
> >> > employ naive trial-error approach, but this doesnt seem to be
> >> > practical as it would take long time to get the final sample with
> desired properties.
> >> >
> >> > Any pointer would be greatly appreciated.
> >>
> >> One general way to think of this problem is that you are defining a
> >> distribution on the space of all possible samples of size 10, such
> >> that the probability of a sample is X if the sum is 20, and zero
> >> otherwise, and you want to sample from this distribution.
> >>
> >> There's probably a slick method to do that for your example, but if
> >> you've got a general population instead of that special one, I doubt it.
> >>
> >> What I would do is the following:
> >>
> >> Define another distribution on samples that has probabilities that
> >> depend on the sum of the sample, with the highest probabilities
> >> attached to ones with the correct sum, and probabilities for other
> >> sums declining with distance from the sum.  For example, maybe
> >>
> >>   P(sum) = Y/(1 + abs(sum - 20))
> >>
> >> for some constant Y.
> >>
> >> You can use MCMC to sample from that distribution and then only keep
> >> the samples where the sum is exactly equal to the target sum.  If you
> >> do that, you don't need to care about the value of Y. but you do need
> >> to think about how proposed moves are made, and you probably need to
> >> use a different function than the example above for acceptable
> efficiency.
> >>
> >> Duncan Murdoch
> >>
> >> ______________________________________________
> >> R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
> >> https://stat/
> >> .ethz.ch%2Fmailman%2Flistinfo%2Fr-help&data=05%7C02%7Ctebert%40ufl.ed
> >> u
> >> %7C722a00e77d1343db507d08dd7b914147%7C0d4da0f84a314d76ace60a62331e1b8
> >> 4
> >> %7C0%7C0%7C638802585985433701%7CUnknown%7CTWFpbGZsb3d8eyJFbXB0eU1hcGk
> >> i
> >> OnRydWUsIlYiOiIwLjAuMDAwMCIsIlAiOiJXaW4zMiIsIkFOIjoiTWFpbCIsIldUIjoyf
> >> Q
> >> %3D%3D%7C0%7C%7C%7C&sdata=KNgaZ7e92Zfs51%2F%2B%2B9%2BQkGI5hiSj9OzhDtF
> >> z
> >> iWXx2D0%3D&reserved=0
> >> PLEASE do read the posting guide
> >> https://www/.
> >> r-project.org%2Fposting-guide.html&data=05%7C02%7Ctebert%40ufl.edu%7C
> >> 7
> >> 22a00e77d1343db507d08dd7b914147%7C0d4da0f84a314d76ace60a62331e1b84%7C
> >> 0
> >> %7C0%7C638802585985456014%7CUnknown%7CTWFpbGZsb3d8eyJFbXB0eU1hcGkiOnR
> >> y
> >> dWUsIlYiOiIwLjAuMDAwMCIsIlAiOiJXaW4zMiIsIkFOIjoiTWFpbCIsIldUIjoyfQ%3D
> >> %
> >> 3D%7C0%7C%7C%7C&sdata=Vhraq88ySxlq4IUEqvoeRvNdhjvliYdz7VDDvmQzmkE%3D&
> >> r
> >> eserved=0 and provide commented, minimal, self-contained,
> >> reproducible code.
> >>
> >
> >        [[alternative HTML version deleted]]
> >
> >______________________________________________
> >R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
> >https://stat/.
> >ethz.ch%2Fmailman%2Flistinfo%2Fr-help&data=05%7C02%7Ctebert%40ufl.edu%7
> >Cc623a7b217674da8b1ee08dd7bed5bfe%7C0d4da0f84a314d76ace60a62331e1b84%7C
> >0%7C0%7C638802981570144984%7CUnknown%7CTWFpbGZsb3d8eyJFbXB0eU1hcGkiOnRy
> >dWUsIlYiOiIwLjAuMDAwMCIsIlAiOiJXaW4zMiIsIkFOIjoiTWFpbCIsIldUIjoyfQ%3D%3
> >D%7C0%7C%7C%7C&sdata=FESBnk%2Bs0euSW8a9WXBB4pW7W3ChgGGzujmZoD23zDE%3D&r
> >eserved=0 PLEASE do read the posting guide
> >https://www.r/
> >-project.org%2Fposting-guide.html&data=05%7C02%7Ctebert%40ufl.edu%7Cc62
> >3a7b217674da8b1ee08dd7bed5bfe%7C0d4da0f84a314d76ace60a62331e1b84%7C0%7C
> >0%7C638802981570159758%7CUnknown%7CTWFpbGZsb3d8eyJFbXB0eU1hcGkiOnRydWUs
> >IlYiOiIwLjAuMDAwMCIsIlAiOiJXaW4zMiIsIkFOIjoiTWFpbCIsIldUIjoyfQ%3D%3D%7C
> >0%7C%7C%7C&sdata=u32xfqbftfIm%2FzZxlH9NvD1SXw6lRZulE0FoPBUJOvA%3D&reser
> >ved=0 and provide commented, minimal, self-contained, reproducible
> >code.
> >______________________________________________
> >R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
> >https://stat/.
> >ethz.ch%2Fmailman%2Flistinfo%2Fr-help&data=05%7C02%7Ctebert%40ufl.edu%7
> >Cc623a7b217674da8b1ee08dd7bed5bfe%7C0d4da0f84a314d76ace60a62331e1b84%7C
> >0%7C0%7C638802981570168291%7CUnknown%7CTWFpbGZsb3d8eyJFbXB0eU1hcGkiOnRy
> >dWUsIlYiOiIwLjAuMDAwMCIsIlAiOiJXaW4zMiIsIkFOIjoiTWFpbCIsIldUIjoyfQ%3D%3
> >D%7C0%7C%7C%7C&sdata=TOtcwGxtU%2ByIYZ1GEjneNEAUVqudwX9%2BA6FU3AkYPh4%3D
> >&reserved=0 PLEASE do read the posting guide
> >https://www.r/
> >-project.org%2Fposting-guide.html&data=05%7C02%7Ctebert%40ufl.edu%7Cc62
> >3a7b217674da8b1ee08dd7bed5bfe%7C0d4da0f84a314d76ace60a62331e1b84%7C0%7C
> >0%7C638802981570176634%7CUnknown%7CTWFpbGZsb3d8eyJFbXB0eU1hcGkiOnRydWUs
> >IlYiOiIwLjAuMDAwMCIsIlAiOiJXaW4zMiIsIkFOIjoiTWFpbCIsIldUIjoyfQ%3D%3D%7C
> >0%7C%7C%7C&sdata=gPSpdirwgJOb0AoECt3Vd0DGdPLCq9dZcXoc8lYbxB4%3D&reserve
> >d=0 and provide commented, minimal, self-contained, reproducible code.
>
> --
> Sent from my phone. Please excuse my brevity.
>

        [[alternative HTML version deleted]]

______________________________________________
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide https://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] Drawing a sample based on certain condition

Reply via email to