slides
Transcription
slides
Detection and analysis of paedophile activity in P2P networks Raphaël Fournier-S’niehotta Réseaux et individus, informatique et sciences sociales, LIAFA December, 3rd 2012 Introduction Queries Users Dynamicity Conclusion 2 / 21 Outline 1 Paedophile queries 2 Paedophile users 3 Dynamicity 4 Conclusion Raphaël Fournier-S’niehotta Paedophile activity in P2P networks Introduction Queries Users Dynamicity Conclusion 3 / 21 Large sets of queries Interaction between users and search engines Applications traditional (system improvements) original (Google Flu) set of queries : qi = (t, u, k1 , k2 , . . . , kn ) t timestamp u user information (IP address, connection port) (k1 , k2 , . . . , kn ) sequence of keywords Raphaël Fournier-S’niehotta Paedophile activity in P2P networks Introduction Queries Users Dynamicity Conclusion 4 / 21 Rationale Children victimization Danger for innocent users Societal problem Very little is known Raphaël Fournier-S’niehotta Paedophile activity in P2P networks Introduction Queries Users Dynamicity Conclusion 5 / 21 Goals Increase knowledge of paedophile activity in P2P systems Detection Create an automatic tagging tool Elaborate a generic methodology Analysis Rigorous quantification of queries Study users Raphaël Fournier-S’niehotta Paedophile activity in P2P networks Introduction Queries Users Dynamicity Conclusion 6 / 21 Challenges Appropriate data collection size, dynamicity, poorly documented protocols Automatic detection tool hidden activity, several languages Rigorous statistical inference low amount of paedophile queries Raphaël Fournier-S’niehotta Paedophile activity in P2P networks Introduction Queries Users Dynamicity Conclusion 7 / 21 Datasets eDonkey (eMule, MLDonkey, Shareaza) 2007 09-12 2009 Duration 10 weeks 147 weeks 28 weeks Nb Queries 107 226 021 1 290 377 956 205 228 820 Nb IP 23 892 531 82 264 897 24 413 195 Duly anonymised Raphaël Fournier-S’niehotta Paedophile activity in P2P networks Introduction Queries Users Dynamicity Conclusion Tool design Tool assessment Estimations Outline 1 Paedophile queries Tool design Tool assessment Fraction of paedophile queries Raphaël Fournier-S’niehotta Paedophile activity in P2P networks 7 / 21 Introduction Queries Users Dynamicity Conclusion Tool design Tool assessment Estimations Tool design 4 categories of paedophile queries query matches explicit ? matches child and sex ? matches familyparents and familychild and sex ? matches agesuffix with age<17 and ( sex or child )? tag as paedophile raygold little girl porno infantil incest mom son video 12yo fuck video Raphaël Fournier-S’niehotta Paedophile activity in P2P networks 8 / 21 Introduction Queries Users Dynamicity Conclusion Tool design Tool assessment Estimations Quality False positive “sexy daddy destinys child” contains “sexy”, “daddy” and “child” but most likely a music-related query False negative “pjk 12yo” contains paedophile keywords that we don’t search for How to estimate false positive and false negative rates? Raphaël Fournier-S’niehotta Paedophile activity in P2P networks 9 / 21 Introduction Queries Users Dynamicity Conclusion Tool design Tool assessment Estimations Tool assessment – Survey set of 21 volunteering experts (Europol, national authorities, NGOs) set of 3,000 randomly selected queries: paedophile not paedophile neighbours (submitted within the 2 previous or next hours of a paedophile query by the same user) tag queries as paedophile, probably paedophile, probably not paedophile, not paedophile or I don’t know pédo ... 1174 ... prob. pédo ... 111 ... je ne sais pas ... 20 ... prob. pas ... 64 ... Raphaël Fournier-S’niehotta pas pédo ... 789 ... total ... 2158 ... pertinence ... 99.1 ... Paedophile activity in P2P networks 10 / 21 Introduction Queries Users Dynamicity Conclusion Tool design Tool assessment Estimations Assessment results Limited filter precision False negatives False positives correct: 75.5% paedophile queries our tool wrong: 24.5% correct: 98.61% all queries our tool paedophile wrong: 1.39% Raphaël Fournier-S’niehotta Paedophile activity in P2P networks 11 / 21 Introduction Queries Users Dynamicity Conclusion Tool design Tool assessment Estimations Fraction of paedophile queries fraction of paedophile queries 0.0025 0.002 0.0015 0.001 0.0005 2007 2009 0 0 5 10 15 20 measurement duration (weeks) 25 30 Result detected queries: slighlty above 0.19% for both datasets after correction: 2,5 queries out of 1,000 are paedophiles 1 paedophile query every 33 seconds Raphaël Fournier-S’niehotta Paedophile activity in P2P networks 12 / 21 Introduction Queries Users Dynamicity Conclusion Distinguishing users Counting Outline 2 Paedophile users Distinguishing different users Fraction of paedophile users Raphaël Fournier-S’niehotta Paedophile activity in P2P networks 12 / 21 Introduction Queries Users Dynamicity Conclusion Distinguishing users Counting Distinguishing users Classical hypothesis: user ∼ IP address Problems gateway/firewall (NAT) IP addresses dynamic addresses allocation several users per computer several computers per user Improvements user ∼ IP adress + connection port measurement duration sessions Raphaël Fournier-S’niehotta Paedophile activity in P2P networks 13 / 21 Introduction Queries Users Dynamicity Conclusion Distinguishing users Counting User: IP vs (IP,port) fraction of paedophile users 0.0045 0.004 0.0035 0.003 0.0025 0.002 0.0015 0.001 2007, (IP,port) 2007, IP 2009, IP 0.0005 0 0 2 4 6 8 10 time (weeks) hypothesis: user tagged as paedophile after one such query pollution: all dynamic/public IP addresses may be considered as paedophile after some time convergence when considering (IP, port) Raphaël Fournier-S’niehotta Paedophile activity in P2P networks 14 / 21 Introduction Queries Users Dynamicity Conclusion Distinguishing users Counting 15 / 21 User: sessions t1 t2 t3 t5 t6 t4 t fraction of paedophile sessions session session 2007, (IP,port) 2007, IP 2009, IP 0.0035 0.003 0.0025 0.002 0.0015 0.0024 0.001 0.002 0.0005 0 0.25 0.5 0 0 2 4 Raphaël Fournier-S’niehotta 6 δ (hours) 8 10 12 Paedophile activity in P2P networks Introduction Queries Users Dynamicity Conclusion Distinguishing users Counting Fraction of paedophile users False positive/negative rate on users p(u ∈ U + | u ∈ V (n, 0)) = 1 − (1 − f 0− )n p(u ∈ U − | u ∈ V (n, k )) = (f 0+ )k (1 − f 0− )n−k Result Fraction of paedophile users close to 0,22% for both datasets Raphaël Fournier-S’niehotta Paedophile activity in P2P networks 16 / 21 Introduction Queries Users Dynamicity Conclusion Long-term evolution Daily evolution Outline 3 Dynamicity Long-term evolution of paedophile activity Daily evolution of paedophile activity Raphaël Fournier-S’niehotta Paedophile activity in P2P networks 16 / 21 Introduction Queries Users Dynamicity Conclusion Long-term evolution Daily evolution Long-term evolution toutes req. nombre de requêtes (millions) 12 10 8 6 4 2 12 20 11 20 11 20 10 20 10 20 09 20 1 −0 7 −0 1 −0 7 −0 1 −0 7 −0 temps (semaine) stability of global traffic over 3 years fraction of paedophile queries strongly increasing fraction of paedophile users also increasing Raphaël Fournier-S’niehotta Paedophile activity in P2P networks 17 / 21 Introduction Queries Users Dynamicity Conclusion Long-term evolution Daily evolution Long-term evolution fraction de requêtes (en %) 0.6 requêtes pédo. 0.5 0.4 0.3 0.2 0.1 12 20 12 20 11 20 11 20 10 20 10 20 09 20 7 −0 1 −0 7 −0 1 −0 7 −0 1 −0 7 −0 temps (semaine) stability of global traffic over 3 years fraction of paedophile queries strongly increasing fraction of paedophile users also increasing Raphaël Fournier-S’niehotta Paedophile activity in P2P networks 17 / 21 Introduction Queries Users Dynamicity Conclusion Long-term evolution Daily evolution Long-term evolution adresses IP pédo. fraction d’adresses IP (en %) 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 12 20 12 20 11 20 11 20 10 20 10 20 09 20 7 −0 1 −0 7 −0 1 −0 7 −0 1 −0 7 −0 temps (semaine) stability of global traffic over 3 years fraction of paedophile queries strongly increasing fraction of paedophile users also increasing Raphaël Fournier-S’niehotta Paedophile activity in P2P networks 17 / 21 Introduction Queries Users Dynamicity Conclusion Long-term evolution Daily evolution Daily evolution nombre de requêtes moyen 90000 toutes requêtes 80000 70000 60000 50000 40000 30000 20000 10000 0 0 2 4 6 8 10 12 14 16 18 20 22 heure de la journée circadian cycle (day/night effect) fraction of paedophile queries peaks at 6 AM paedopornagraphy and traditional pornography differ Raphaël Fournier-S’niehotta Paedophile activity in P2P networks 18 / 21 Introduction Queries Users Dynamicity Conclusion Long-term evolution Daily evolution Daily evolution fraction moyenne des requêtes (en %) 2,5 BR+AR FR 2 1,5 1 0,5 0 0 2 4 6 8 10 12 14 16 18 20 22 heure de la journée circadian cycle (day/night effect) fraction of paedophile queries peaks at 6 AM paedopornagraphy and traditional pornography differ Raphaël Fournier-S’niehotta Paedophile activity in P2P networks 18 / 21 Introduction Queries Users Dynamicity Conclusion Long-term evolution Daily evolution Daily evolution fraction moyenne des requêtes (en %) 0,9 requêtes pédo. requêtes porn. 0,8 0,7 0,6 0,5 0,4 0,3 0 2 4 6 8 10 12 14 16 18 20 22 heure de la journée circadian cycle (day/night effect) fraction of paedophile queries peaks at 6 AM paedopornagraphy and traditional pornography differ Raphaël Fournier-S’niehotta Paedophile activity in P2P networks 18 / 21 Introduction Queries Users Dynamicity Conclusion 18 / 21 Outline 4 Conclusion Raphaël Fournier-S’niehotta Paedophile activity in P2P networks Introduction Queries Users Dynamicity Conclusion 19 / 21 Conclusion General approach for detecting rare contents Contributions Automatic detection tool Large set of paedophile queries Rigorous quantification 2.5 queries out of 1,000 are paedophile User identification Quantification of paedophile users Other contributions Comparing eDonkey and KAD Raphaël Fournier-S’niehotta Paedophile activity in P2P networks Introduction Queries Users Dynamicity Conclusion 20 / 21 Perspectives Tool improvement previous/next queries languages, word order, categories machine learning Analysis different threshold for being considered paedophile geolocation community detection (graph topology) detailed study of sequences of queries Raphaël Fournier-S’niehotta Paedophile activity in P2P networks Introduction Queries Users Dynamicity Conclusion 21 / 21 Contact Thank you for your attention. [email protected] Raphaël Fournier-S’niehotta Paedophile activity in P2P networks Introduction Queries Users Dynamicity Conclusion 23 / 21 KAD network Completely distributed protocol of clients No server for file indexing Some peers are in charge of some files and keywords Principle: Precise and targeted injection of peers into the network to control files or keywords Peers catch queries and control replies Applications: Which files are published for a given keyword? Which peers share them ? Eclipse : prevent peers from accessing content Raphaël Fournier-S’niehotta Paedophile activity in P2P networks Introduction Queries Users Dynamicity Conclusion 24 / 21 Geo-location: statistics country IT ES FR BR IL DE KR US PL AR CN PT IE TW BE CH GB NL CA SI MX RU AT # queries 19569361 8881405 7583815 2795090 2139697 2093106 1386799 1053183 975170 810466 635392 513327 511185 417893 402565 320054 319386 243646 241460 239572 210504 200958 184248 # paedo 15426 5177 8059 4849 2618 11238 336 6184 1178 1465 337 434 54 138 646 1710 1698 1131 1233 167 1098 2712 977 ratio 0.08 % 0.06 % 0.11 % 0.17 % 0.12 % 0.54 % 0.02 % 0.59 % 0.12 % 0.18 % 0.05 % 0.08 % 0.01 % 0.03 % 0.16 % 0.53 % 0.53 % 0.46 % 0.51 % 0.07 % 0.52 % 1.35 % 0.53 % Raphaël Fournier-S’niehotta Biased by: language knowledge decoding problems Paedophile activity in P2P networks Introduction Queries Users Dynamicity Conclusion 25 / 21 Geo-location: maps # queries Raphaël Fournier-S’niehotta Paedophile activity in P2P networks Introduction Queries Users Dynamicity Conclusion 25 / 21 Geo-location: maps ratio # paedophile queries / # queries Raphaël Fournier-S’niehotta Paedophile activity in P2P networks
Documents pareils
Detection and analysis of paedophile activity in P2P networks -
Completely distributed protocol of clients
No server for file indexing
Some peers are in charge of some files and keywords
Principle:
Precise and targeted injection of peers into the network to
con...