Ah, I see - PCSA can actually keep track of unique IP's without actually revealing them. Your first link cleared it up a lot for me. PCSA is a really cool technique!
I'd love to work on this as a GSoC project. I'll write up a proposal and send it out soon. On Tue, Mar 28, 2017 at 2:55 PM, Jaskaran Singh <jvsg1...@gmail.com> wrote: > Hi Samir, > > Brute force does affect Bloom filter/hashed-values as you rightly > mentioned, but not Probabilistic Counting by Stochastic Averaging (PCSA). > > PCSA works on the principle that in an input the probability of n > consecutive bits having value '0' from the left side(could be right as > well, but for now assume it left) is 2^(-(n+1)). Bit 'i' of the > Bitmap(which is our main data structure) is set if a the number of > consecutive zeros (from left) is 'i'. > > We keep repeating it for every input(IP address). We then end up with a > Bitmap whose most significant '1' can be computed to give us an > approximate number of inputs that must have been gone into the algorithm. > > In simple words, if I tell you that I have seen the value 1010000 out of > a total of 'x' values I examined. You could guess that I had examined a > total of 2^5 values before I saw that particular value. > > We would tweak the algorithm to store only the significant most '1' in > bitmap instead of storing '1' at every iteration. This would mean that > all that adversary could get hold of is a bitmap whose just one of the > bit is '1'. > > Example, the adversary might get a data structure that looks like: > 000001000000 > and would have no way tell what IP addresses were used as an input. > > This was just the basic idea behind PCSA. The actual PCSA makes use of > complicated looking formula to get the approximate number of unique IP > addresses in order to keep error rate low. > > I hope this makes sense. > > For some more information and simulation, please check > > [0] https://research.neustar.biz/2013/04/02/sketch-of-the- > day-probabilistic-counting-with-stochastic-averaging-pcsa/ > [1] http://content.research.neustar.biz/blog/runs.html > > Regards, > Jaskaran > > On Wednesday 29 March 2017 01:54 AM, samir menon wrote: > > This ticket [1] was suggested as a GSoC project, but I think there might > > be an issue with the security model/perceived threat. > > > > To summarize the ticket and its child [1], basically, we currently store > > all the IP's seen by a node so that we can count unique IP's. The idea > > is that this is dangerous; if a node is compromised, then all of those > > IP addresses can be retrieved from memory. Therefore, a variety of > > mitigation methods have been proposed (most prominently, the > > 'Probabilistic Counting Algorithm' from [2]) > > > > Here's my issue: what about brute force? > > > > No matter what method we use, we will arrive at a data structure that > > should be able to, given an IP address, tell us whether it is new (and > > we should increment the unique counter) or old (and we should leave the > > unique counter the same), with some reasonably small false positive > > rate. Basically, we're supposed to use some kind of Bloom filter like > > structure. > > > > Then can't that structure then be brute-forced, offline, by an attacker? > > IPv4 addresses are 32-bits (~4.3 billion of them), so an attacker could > > just run whatever method we use to check membership over and over, and > > then recover the set of IP's. The same happens if we hash the IP's > > beforehand. > > > > So, is this attack acceptable? The only mitigation I've seen is the one > > referenced by 'Aaron' in the ticket, which is the system that git uses, > > cryptolog; there, they have a random salt that changes daily. Then, an > > attacker can only learn the IP's for one day. This sounds like a > > reasonable compromise to me, but then the implementation becomes rather > > simple; just hash the IP's with a random salt that changes daily before > > putting them in the set. > > > > IPv6 also solves this (128 bits), but there again, the solution is just > > to hash the IP's before storing them - the Bloom filter/'Probabilistic > > Counting Algorithm' is unnecessary. > > > > I think I must be missing something about how the 'Probabilistic > > Counting Algorithm' works - somehow, it needs to keep track of the # of > > unique IP's without knowing (with a high probability) whether any 1 > > individual IP has been seen. > > > > Any help/pointing out of errors in my reasoning would be useful. > > > > Thanks, > > Samir Menon > > menon.sa...@gmail.com <mailto:menon.sa...@gmail.com> > > sam...@stanford.edu <mailto:sam...@stanford.edu> > > > > [1] https://trac.torproject.org/projects/tor/ticket/7532 > > [2] http://www.mathcs.emory.edu/~cheung/papers/StreamDB/Probab/ > 1985-Flajolet-Probabilistic-counting.pdf > > > > > > _______________________________________________ > > tor-dev mailing list > > tor-dev@lists.torproject.org > > https://lists.torproject.org/cgi-bin/mailman/listinfo/tor-dev > > > > -- > Jaskaran Veer Singh (jvsg) > jvsg1303 at gmail dot com > PGP 2814 3FB7 A32D 429B 092E 27F0 8AA3 C532 9E1A 6AD8 > > _______________________________________________ > tor-dev mailing list > tor-dev@lists.torproject.org > https://lists.torproject.org/cgi-bin/mailman/listinfo/tor-dev >
_______________________________________________ tor-dev mailing list tor-dev@lists.torproject.org https://lists.torproject.org/cgi-bin/mailman/listinfo/tor-dev