On Tue, 05 Aug 2014 18:00:32 +0200 Karsten Loesing <kars...@torproject.org> wrote:
> On 05/08/14 17:24, Philipp Winter wrote: > > On Tue, Aug 05, 2014 at 11:37:45AM +0200, Karsten Loesing wrote: > >> Started looking into better algorithms to detect Sybil attacks on the > >> Tor network. Current thinking is that we should define relay similarity > >> metrics like common IP address prefix length or time between first seen > >> in a consensus, go throw the consensus archives, and see which relays > >> look similar but are not defined to be in the same family. > > > > Do you already have some code or more details on that? > > Details, yes, see below. Code, not really, but give me a few hours > tomorrow to clean it up and put it online. > > > I'm quite > > interested in this topic and I'm wondering if it wouldn't be best to > > start with something simple like cosine similarity [0]. We would have > > to transform a relay descriptor into a numerical vector and then > > calculate the cosine similarity to other relay vectors. However, this > > might scale poorly as we would need (n^2)/2 comparisons. It can be O(n^2 log n). If you concatenate the many vectors into a single one, the many cosine dot products are a single convolution operation, which is O(n log n) with the FFT. Correlate the big vector (haystack) with the search one (needle) -- indices of peaks give matches. This is cross-correlation. (As Andrea mentioned, there are conditions. I think they're met.) So if you want to find matches for some vector X = x1, x2, ... in the set of vectors A, B, ... A = a1, a2, a3, ... B = b1, b2, ... etc. then let Z = a1, a2, a3, b1, b2, ... and compute xcorr(Z, X). Hopefully the result is almost entirely noise, with a few peaks. If there is a positive peak at some index i, then X matches the vector corresponding to the one with index i in Z. The most useful vectors for correlation are ones which assign zero for "no measurement" (like inactivity) and which have a large variance, because that will bring down false positives significantly. Bandwidth per hour/day, for example, might be useful. This is basically a kind of passive traffic confirmation attack. It would be funny to use it to detect traffic confirmation attacks. > Sounds great. I'm good at transforming relay data into numerical > vectors (well, .csv files), but I have little to no experience with > the subsequent analysis. Help much appreciated! I could try what I described above, with the data "bandwidth over time" for some small set of relays, two of which were known to be colluding. (I apologize if that's already available, it's been a while since I used the Tor metrics.) Mansour _______________________________________________ tor-dev mailing list tor-dev@lists.torproject.org https://lists.torproject.org/cgi-bin/mailman/listinfo/tor-dev