Detection and analysis of paedophile activity in P2P networks -
Transcription
Detection and analysis of paedophile activity in P2P networks -
Detection and analysis of paedophile activity in P2P networks Raphaël Fournier-S’niehotta JINO, LIFO January, 18th 2013 Introduction Queries Users Dynamics Conclusion 2 / 31 Context Large sets of queries Interaction between users and search engines Applications Traditional (system improvements) Original (Google Flu) Paedophile activity in P2P systems Children victimization Danger for innocent users Societal problem Very little is known Raphaël Fournier-S’niehotta Paedophile activity in P2P networks Introduction Queries Users Dynamics Conclusion 3 / 31 Goals Increase knowledge of paedophile activity in P2P systems Detection Create an automatic tagging tool Elaborate a generic methodology Analysis Rigorous quantification of queries Study users General methodology Rare topic Raphaël Fournier-S’niehotta Paedophile activity in P2P networks Introduction Queries Users Dynamics Conclusion Challenges Appropriate data collection size, dynamics, poorly documented protocols Automatic detection tool hidden activity, several languages Rigorous statistical inference low amount of paedophile queries User identification partial information, unreliable Raphaël Fournier-S’niehotta Paedophile activity in P2P networks 4 / 31 Introduction Queries Users Dynamics Conclusion 5 / 31 Datasets Queries submitted to eDonkey search engine 2007 10 weeks, 100 millions queries, 24 million IP addresses 2009 147 weeks, 1,3 billion queries, 82 million IP addresses Set of queries : qi = (t, u, k1 , k2 , . . . , kn ) t timestamp u user information (IP address, connection port) (k1 , k2 , . . . , kn ) sequence of keywords Duly anonymised Raphaël Fournier-S’niehotta Paedophile activity in P2P networks Introduction Queries Users Dynamics Conclusion Tool design Tool assessment Estimations Outline 1 Paedophile queries Tool design Tool assessment Fraction of paedophile queries 2 Paedophile users 3 Temporal dynamics 4 Conclusion Raphaël Fournier-S’niehotta Paedophile activity in P2P networks 5 / 31 Introduction Queries Users Dynamics Conclusion Tool design Tool assessment Estimations Tool design Set of rules based on law-enforcement knowledge Manual inspection of our datasets Improve until negligible changes 4 categories of paedophile queries query matches explicit ? matches child and sex ? matches familyparents and familychild and sex ? matches agesuffix with age<17 and ( sex or child )? tag as paedophile raygold little girl porno infantil incest mom son video Raphaël Fournier-S’niehotta 12yo fuck video Paedophile activity in P2P networks 6 / 31 Introduction Queries Users Dynamics Conclusion Tool design Tool assessment Estimations Quality False positive “sexy daddy destinys child” contains “sexy”, “daddy” and “child” but most likely a music-related query False negative “pjk 12yo” contains paedophile keywords that we don’t search for How to estimate false positive and false negative rates? Raphaël Fournier-S’niehotta Paedophile activity in P2P networks 7 / 31 Introduction Queries Users Dynamics Conclusion Tool design Tool assessment Estimations Tool assessment – Survey Set of 21 volunteering experts (Europol, national authorities, NGOs) Set of 3,000 randomly selected queries: Paedophile Not paedophile Neighbours (submitted within the 2 previous or next hours of a paedophile query by the same user) Tag queries as paedophile, probably paedophile, probably not paedophile, not paedophile or I don’t know Raphaël Fournier-S’niehotta Paedophile activity in P2P networks 8 / 31 Introduction Queries Users Dynamics Conclusion Tool design Tool assessment Estimations Tool assessment – Survey results paedo 1530 1381 1679 1603 1598 128 216 1624 351 647 1174 335 641 1071 1554 1506 305 371 976 344 845 prob. paedo 149 247 89 201 5 81 154 126 16 119 111 17 383 546 197 120 270 1017 936 12 139 don’t know 25 125 2 99 15 1 0 16 2 71 20 1 4 2 28 6 24 496 405 10 323 prob. not 66 580 113 174 1 26 142 165 16 40 64 70 112 453 327 25 89 570 594 70 175 not paedo 1230 667 1117 923 1381 124 132 581 27 439 789 166 753 928 894 393 181 546 89 156 182 total 3000 3000 3000 3000 3000 360 644 2512 412 1316 2158 589 1893 3000 3000 2050 869 3000 3000 592 1664 relevance 99.5 98.5 99.1 99.0 98.8 100.0 98.4 99.8 100.0 98.4 99.1 97.5 97.8 88.4 97.6 98.3 99.0 95.7 96.6 98.3 97.9 Relevance rate: adequate knowledge of specific context Raphaël Fournier-S’niehotta Paedophile activity in P2P networks 9 / 31 Introduction Queries Users Dynamics Conclusion Tool design Tool assessment Estimations Assessment results correct: 75.5% paedophile queries our tool wrong: 24.5% correct: 98.61% all queries our tool paedophile wrong: 1.39% Raphaël Fournier-S’niehotta Paedophile activity in P2P networks 10 / 31 Introduction Queries Users Dynamics Conclusion Tool design Tool assessment Estimations Assessment results |P + | (1 − f 0+ ) |T + | = |D| 1 − f − |D| P+ : paedophile queries T+ : tagged paedophile queries f0+ : false positive rate f− : false negative rate Raphaël Fournier-S’niehotta Paedophile activity in P2P networks 11 / 31 Introduction Queries Users Dynamics Conclusion Tool design Tool assessment Estimations Fraction of detected paedophile queries fraction of paedophile queries 0.0025 0.002 0.0015 0.001 0.0005 2007 2009 0 0 5 10 15 20 measurement duration (weeks) Raphaël Fournier-S’niehotta 25 Paedophile activity in P2P networks 30 12 / 31 Introduction Queries Users Dynamics Conclusion Tool design Tool assessment Estimations Fraction of paedophile queries Result Detected queries: slighlty above 0.19% for both datasets After correction: 2,5 queries out of 1,000 are paedophiles 1 paedophile query every 33 seconds M ATTHIEU L ATAPY, C LÉMENCE M AGNIEN , AND R APHAËL F OURNIER . Quantifying paedophile queries in a large P2P system. In IEEE International Conference on Computer Communications (INFOCOM) Mini-Conference, 2011. M ATTHIEU L ATAPY, C LÉMENCE M AGNIEN , AND R APHAËL F OURNIER . Quantifying paedophile activity in a large P2P system. Information Processing and Management, In press, 2012. Raphaël Fournier-S’niehotta Paedophile activity in P2P networks 13 / 31 Introduction Queries Users Dynamics Conclusion Distinguishing users Quantifying Outline 1 Paedophile queries 2 Paedophile users Distinguishing different users Fraction of paedophile users 3 Temporal dynamics 4 Conclusion Raphaël Fournier-S’niehotta Paedophile activity in P2P networks 13 / 31 Introduction Queries Users Dynamics Conclusion Distinguishing users Quantifying Distinguishing users Possible approximation: user ∼ IP address Problems Gateway/firewall (NAT) IP addresses Dynamic addresses allocation Several users per computer Several computers per user Raphaël Fournier-S’niehotta Paedophile activity in P2P networks 14 / 31 Introduction Queries Users Dynamics Conclusion Distinguishing users Quantifying Distinguishing users Paedophile user User paedophile after one paedophile query All dynamic/public IP addresses may be considered as paedophile after some time 3 approches : User ∼ IP adress + connection port Measurement duration Sessions Raphaël Fournier-S’niehotta Paedophile activity in P2P networks 15 / 31 Introduction Queries Users Dynamics Conclusion Distinguishing users Quantifying fraction d’utilisateurs détéctés comme pédophiles (en %) User: IP vs (IP,port) 0,45 0,4 0,35 0,3 0,25 0,2 0,15 0,1 2007, IP 2009, IP 0,05 0 0 2 4 6 temps (semaines) 8 10 (IP, port) reduces pollution (bias) Raphaël Fournier-S’niehotta Paedophile activity in P2P networks 16 / 31 Introduction Queries Users Dynamics Conclusion Distinguishing users Quantifying fraction d’utilisateurs détéctés comme pédophiles (en %) User: IP vs (IP,port) 0,45 0,4 0,35 0,3 0,25 0,2 0,15 0,1 2007, (IP, port) 2007, IP 2009, IP 0,05 0 0 2 4 6 temps (semaines) 8 10 (IP, port) reduces pollution (bias) Raphaël Fournier-S’niehotta Paedophile activity in P2P networks 16 / 31 Introduction Queries Users Dynamics Conclusion Distinguishing users Quantifying User: sessions t1 t2 t3 t4 t5 t6 t session Raphaël Fournier-S’niehotta session Paedophile activity in P2P networks 17 / 31 Introduction Queries Users Dynamics Conclusion Distinguishing users Quantifying User: sessions fraction de sessions détéctées comme pédophiles 0,3 0,25 0,2 0,15 0,1 2007, (IP,port) 2007, IP 2009, IP 0,05 0 0 2 4 Raphaël Fournier-S’niehotta 6 8 δ (heures) 10 12 Paedophile activity in P2P networks 18 / 31 Introduction Queries Users Dynamics Conclusion Distinguishing users Quantifying Fraction of paedophile users False positive/negative rate on users p(u ∈ U + | u ∈ V (n, 0)) = 1 − (1 − f 0− )n p(u ∈ U − | u ∈ V (n, k )) = (f 0+ )k (1 − f 0− )n−k U+ , U− : set of paedophile/not paedophile users V+ , V− : set of users detected as paedophile/not paedophile n : number of queries of a user k : number of queries detected as paedophile for a user |U + ∩V + | |D| = PN n=1 Pn k =1 (1 )| − (f 0+ )k (1 − f 0− )n−k ) |V (n,k |D| Raphaël Fournier-S’niehotta Paedophile activity in P2P networks 19 / 31 Introduction Queries Users Dynamics Conclusion Distinguishing users Quantifying Fraction of paedophile users Result Fraction of paedophile users close to 0,22% for both datasets 1 paedophile user out of 450 M ATTHIEU L ATAPY, C LÉMENCE M AGNIEN , AND R APHAËL F OURNIER . Quantifying paedophile queries in a large P2P system. In IEEE International Conference on Computer Communications (INFOCOM) Mini-Conference, 2011. M ATTHIEU L ATAPY, C LÉMENCE M AGNIEN , AND R APHAËL F OURNIER . Quantifying paedophile activity in a large P2P system. Information Processing and Management, In press, 2012. Raphaël Fournier-S’niehotta Paedophile activity in P2P networks 20 / 31 Introduction Queries Users Dynamics Conclusion Long-term evolution Daily evolution Outline 1 Paedophile queries 2 Paedophile users 3 Temporal dynamics Long-term evolution of paedophile activity Daily evolution of paedophile activity 4 Conclusion Raphaël Fournier-S’niehotta Paedophile activity in P2P networks 20 / 31 Introduction Queries Users Dynamics Conclusion Long-term evolution Daily evolution Global traffic on server toutes req. nombre de requêtes (millions) 12 10 8 6 4 2 01 07 2− 1 20 01 1− 1 20 07 1− 1 20 01 0− 1 20 07 0− 1 20 9− 0 20 temps (semaine) Stability of global traffic over 3 years Raphaël Fournier-S’niehotta Paedophile activity in P2P networks 21 / 31 Introduction Queries Users Dynamics Conclusion Long-term evolution Daily evolution Fraction of paedophile queries fraction de requêtes (en %) 0,6 requêtes pédo. 0,5 0,4 0,3 0,2 0,1 07 01 2− 1 20 07 2− 1 20 1− 01 07 1− 1 20 1 20 0− 01 07 0− 1 20 1 20 9− 0 20 temps (semaine) Fraction of paedophile queries strongly increasing Raphaël Fournier-S’niehotta Paedophile activity in P2P networks 22 / 31 Introduction Queries Users Dynamics Conclusion Long-term evolution Daily evolution Fraction of paedophile users adresses IP pédo. fraction d’adresses IP (en %) 0,9 0,8 0,7 0,6 0,5 0,4 0,3 0,2 Fraction of paedophile users also increasing Raphaël Fournier-S’niehotta Paedophile activity in P2P networks 07 01 2− 1 20 07 2− 1 20 1− 01 07 1− 1 20 1 20 0− 01 07 0− 1 20 1 20 9− 0 20 temps (semaine) 23 / 31 Introduction Queries Users Dynamics Conclusion Long-term evolution Daily evolution Daily traffic nombre de requêtes moyen 90000 toutes requêtes 80000 70000 60000 50000 40000 30000 20000 10000 0 0 2 4 6 8 10 12 14 16 18 20 22 heure de la journée Circadian cycle (day/night effect) Raphaël Fournier-S’niehotta Paedophile activity in P2P networks 24 / 31 Introduction Queries Users Dynamics Conclusion Long-term evolution Daily evolution fraction moyenne des requêtes (en %) Fraction of paedophile activity 0,9 requêtes pédo. 0,8 0,7 0,6 0,5 0,4 0,3 0 2 4 6 8 10 12 14 16 18 20 22 heure de la journée Fraction of paedophile queries peaks at 6 AM Raphaël Fournier-S’niehotta Paedophile activity in P2P networks 25 / 31 Introduction Queries Users Dynamics Conclusion Long-term evolution Daily evolution fraction moyenne des requêtes (en %) Pornography vs paedophile activity 0,9 requêtes pédo. requêtes porn. 0,8 0,7 0,6 0,5 0,4 0,3 0 2 4 6 8 10 12 14 16 18 20 22 heure de la journée Paedopornagraphy and traditional pornography differ Raphaël Fournier-S’niehotta Paedophile activity in P2P networks 26 / 31 Introduction Queries Users Dynamics Conclusion Long-term evolution Daily evolution Évolution de l’activité Résultat Important growth of paedophile activity between 2009 and 2012 Fraction of paedophile queries peaks at 6 AM Qualitative contribution with quantitative approach Raphaël Fournier-S’niehotta Paedophile activity in P2P networks 27 / 31 Introduction Queries Users Dynamics Conclusion Outline 1 Paedophile queries 2 Paedophile users 3 Temporal dynamics 4 Conclusion Raphaël Fournier-S’niehotta Paedophile activity in P2P networks 27 / 31 Introduction Queries Users Dynamics Conclusion Conclusion (1/2) 1 Paedophile queries automatic detection tool set of paedophile queries estimated fraction of paedophile queries 2 Paedophile users general study of user identification quantification of paedophile users Raphaël Fournier-S’niehotta Paedophile activity in P2P networks 28 / 31 Introduction Queries Users Dynamics Conclusion 29 / 31 Conclusion (2/2) 3 Temporal dynamics three-year study user social integration 4 Comparing KAD and eDonkey adequate methodology analysis with partial information R. F OURNIER , T. C HOLEZ , M. L ATAPY, C. M AGNIEN , I. C HRISMENT, I. DANILOFF AND O. F ESTOR . Comparing paedophile activity in different P2P systems. Submitted. Raphaël Fournier-S’niehotta Paedophile activity in P2P networks Introduction Queries Users Dynamics Conclusion Perspectives Tool improvement previous/next queries languages, word order, categories machine learning Analysis different thresholds for paedophile users community detection (graph topology) detailed study of sequences of queries file exchanges (supply) Apply methodology to other contexts Raphaël Fournier-S’niehotta Paedophile activity in P2P networks 30 / 31 Introduction Queries Users Dynamics Conclusion Contact Thank you for your attention. [email protected] Raphaël Fournier-S’niehotta Paedophile activity in P2P networks 31 / 31 34 / 31 KAD network Completely distributed protocol of clients No server for file indexing Some peers are in charge of some files and keywords Principle: Precise and targeted injection of peers into the network to control files or keywords Peers catch queries and control replies Applications: Which files are published for a given keyword? Which peers share them ? Eclipse : prevent peers from accessing content Raphaël Fournier-S’niehotta Paedophile activity in P2P networks 35 / 31 Geo-location: statistics country IT ES FR BR IL DE KR US PL AR CN PT IE TW BE CH GB NL CA SI MX RU AT # queries 19569361 8881405 7583815 2795090 2139697 2093106 1386799 1053183 975170 810466 635392 513327 511185 417893 402565 320054 319386 243646 241460 239572 210504 200958 184248 # paedo 15426 5177 8059 4849 2618 11238 336 6184 1178 1465 337 434 54 138 646 1710 1698 1131 1233 167 1098 2712 977 ratio 0.08 % 0.06 % 0.11 % 0.17 % 0.12 % 0.54 % 0.02 % 0.59 % 0.12 % 0.18 % 0.05 % 0.08 % 0.01 % 0.03 % 0.16 % 0.53 % 0.53 % 0.46 % 0.51 % 0.07 % 0.52 % 1.35 % 0.53 % Raphaël Fournier-S’niehotta Biased by: language knowledge decoding problems Paedophile activity in P2P networks 36 / 31 Geo-location: maps # queries Raphaël Fournier-S’niehotta Paedophile activity in P2P networks 36 / 31 Geo-location: maps ratio # paedophile queries / # queries Raphaël Fournier-S’niehotta Paedophile activity in P2P networks
Documents pareils
slides
Completely distributed protocol of clients
No server for file indexing
Some peers are in charge of some files and keywords
Principle:
Precise and targeted injection of peers into the network to
con...