Ike Antkare - Com`Eau Labo
Transcription
Ike Antkare - Com`Eau Labo
Ike Antkare : Génèse et échos Cyril Labbé Université Joseph Fourier - LIG Octobre 2014 C.Labbé (UJF-LIG) Ike Antkare & Co Octobre 2014 1 / 24 1 Preliminaries Scientometrics SCIgen a Probabilistic Context Free Grammar 2 Ike Antkare, one of the great starts in the scientific firmament 3 Detection of SCIgen papers : June 2012 Google Search Automatic classification C.Labbé (UJF-LIG) Ike Antkare & Co Octobre 2014 2 / 24 Preliminaries Table of Contents 1 Preliminaries Scientometrics SCIgen a Probabilistic Context Free Grammar 2 Ike Antkare, one of the great starts in the scientific firmament 3 Detection of SCIgen papers : June 2012 Google Search Automatic classification C.Labbé (UJF-LIG) Ike Antkare & Co Octobre 2014 3 / 24 Preliminaries Scientometrics Ranking scientists and journals Number of citations Definition (h-index [Hirsch, 2005]) A scientist has index h if h of his or her Np papers have at least h citations each and the other (Np h) papers have h citations each. h Papers 0 Np Citations Definition (Impact Factor) Average number of citations to papers published by the journal over the last two years. Computed since 1975. C.Labbé (UJF-LIG) h 2 years Ike Antkare & Co Time after publication Octobre 2014 4 / 24 Preliminaries Scientometrics Tools that count citations. Toll based tools. Provided by publisher (Elsevier, Thomson reuters); Based on publishers catalogs (ACM, IEEE, Springer, Elsevier,...); Selected venues only ( all peer reviewed). Free tools: Google Scholar, CiteSeerX,... Crawling the web and/or selected catalogs and/or added by users; Social media (Google+, Scholarometer, Microsoft Academics...). Free tools that computes indicators Publish or Perish; Scholarometer; Microsoft Academics; Google+; and many more... C.Labbé (UJF-LIG) Ike Antkare & Co Octobre 2014 5 / 24 Preliminaries Scientometrics Chronos 2015 Web of Science(Thomson Reuter) Scopus (Elsevier) Google Scholar 2004 2006 h-index C.Labbé (UJF-LIG) 2008 2010 PoP V1.0 2012 2014 Abiteboul par l’administrateur du Collège de France Ike Antkare & Co Octobre 2014 6 / 24 Preliminaries Scientometrics Chronos 2015 Web of Science(Thomson Reuter) Scopus (Elsevier) Google Scholar 2004 2006 h-index 2008 2010 PoP V1.0 2012 2014 Abiteboul par l’administrateur du Collège de France Tools to generate publications. C.Labbé (UJF-LIG) Ike Antkare & Co Octobre 2014 6 / 24 Preliminaries SCIgen a Probabilistic Context Free Grammar PCFG: Probabilistic Context Free Grammar Sets of symbols Set of non terminal symbols N = {SP, S, V, P}, Set of terminal symbols ⌃ = {”.”, sing , dance, flight, seas, oceans, air , streets, hills, fields}. Set of rules Ri R1 : R2 : R3 : R4 : R5..7 : R8..13 : SP S S S V P ! ! ! ! ! ! S. We shall V in the P We shall V in the P, S We shall V in the P and in the P, S sing |dance|flight seas|oceans|air |streets|hills|fields p(R1 )=1 p(R2 )=1/4 p(R3 )=1/2 p(R4 )=1/4 p(Ri )=1/3 i=5..7 p(Ri )=1/6 i=8..13 Terminal string example: s : We Q shall sing in the air and in the hills, We shall dance in the fields. p(s) = j p(Rj ) C.Labbé (UJF-LIG) Ike Antkare & Co Octobre 2014 7 / 24 Preliminaries SCIgen a Probabilistic Context Free Grammar PCFG: Probabilistic Context Free Grammar Sets of symbols Set of non terminal symbols N = {SP, S, V, P}, Set of terminal symbols ⌃ = {”.”, sing , dance, flight, seas, oceans, air , streets, hills, fields}. Set of rules Ri R1 : R2 : R3 : R4 : R5..7 : R8..13 : SP S S S V P ! ! ! ! ! ! S. We shall V in the P We shall V in the P, S We shall V in the P and in the P, S sing |dance|flight seas|oceans|air |streets|hills|fields p(R1 )=1 p(R2 )=1/4 Non zero p(R3 )=1/2 probability p(R4 )=1/4 to 1 p(Ri )=1/3 i=5..7 p(Ri )=1/6 i=8..13 Terminal string example: s : We Q shall sing in the air and in the hills, We shall dance in the fields. p(s) = j p(Rj ) C.Labbé (UJF-LIG) Ike Antkare & Co Octobre 2014 7 / 24 Preliminaries SCIgen: SCIgen a Probabilistic Context Free Grammar 2005 by J. Stribling, M. Krohn & D. Aguayo ... maximize amusement, rather than coherence ... Titre Abstract Intro_A Intro_A2 Introduction Intro_A3 Model Impl Eval RelatedWork Concl References Intro_closing Intro_A ! Many SCI_PEOPLE would agree that, had it not been for SCI_GENERIC_NOUN, ... Intro_A ! SCI_THING_MOD and SCI_THING_MOD, while SCI_ADJ in theory, have not until... ! The SCI_ACT has SCI_VERBEDSCI_THING_MOD, and current trends... ! ... Intro_A Intro_A Intro_A Intro_A ... ! In recent years, much research has been devoted to the SCI_ACT; , ... ! The SCI_ACT is a SCI_ADJSCI_PROBLEM. ! The implications of SCI_BUZZWORD_ADJ SCI_BUZZWORD_NOUN have... SCI_PEOPLE SCI_BUZZWORD_ADJ C.Labbé (UJF-LIG) ! ! steganographers, cyberinformaticians, futurists, cyberneticists, ... omniscient, introspective, peer Ike Antkare & Co to peer, ambimorphic, ... Octobre 2014 8 / 24 Preliminaries SCIgen a Probabilistic Context Free Grammar SCIGen example C.Labbé (UJF-LIG) Ike Antkare & Co Octobre 2014 9 / 24 Preliminaries SCIgen a Probabilistic Context Free Grammar Chronos 2015 Web of Science(Thomson Reuter) Scopus (Elsevier) Google Scholar 2004 2006 h-index 2008 2010 PoP V1.0 2014 Abiteboul par l’administrateur du Collège de France SCIgen C.Labbé (UJF-LIG) 2012 Ike Antkare & Co Octobre 2014 10 / 24 Ike Antkare, one of the great starts in the scientific firmament Table of Contents 1 Preliminaries Scientometrics SCIgen a Probabilistic Context Free Grammar 2 Ike Antkare, one of the great starts in the scientific firmament 3 Detection of SCIgen papers : June 2012 Google Search Automatic classification C.Labbé (UJF-LIG) Ike Antkare & Co Octobre 2014 11 / 24 Ike Antkare, one of the great starts in the scientific firmament SCIgen texts citing SCIgen texts [Labbé, 2010] Modified SCIgen ... ... C.Labbé (UJF-LIG) 100 ... ... 0 Real Documents 1 ... Ike Antkare’s 101 Documents Ike Antkare & Co Octobre 2014 12 / 24 Ike Antkare, one of the great starts in the scientific firmament Ike Antkare h-index according GS (2010) C.Labbé (UJF-LIG) Ike Antkare & Co Octobre 2014 13 / 24 Ike Antkare, one of the great starts in the scientific firmament Chronos 2015 Web of Science(Thomson Reuter) Scopus (Elsevier) Google Scholar 2004 2006 h-index PoP V1.0 2008 2010 Ike Antkare SCIgen C.Labbé (UJF-LIG) Ike Antkare & Co 2012 2014 Abiteboul par l’administrateur du Collège de France Octobre 2014 14 / 24 Ike Antkare, one of the great starts in the scientific firmament Get cited or Perish Conclusion Completeness Accuracy Robustness Google Scholar (free) Good Good enough Spamable WoK / Scopus (fee-based) incomplete Error Free Excellent A scholar/scientific would never fraud like that... C.Labbé (UJF-LIG) Ike Antkare & Co Octobre 2014 15 / 24 Ike Antkare, one of the great starts in the scientific firmament Get cited or Perish Conclusion Completeness Accuracy Robustness Google Scholar (free) Good Good enough Spamable WoK / Scopus (fee-based) incomplete Error Free Excellent A scholar/scientific would never fraud like that... C.Labbé (UJF-LIG) Ike Antkare & Co Octobre 2014 15 / 24 Detection of SCIgen papers : June 2012 Table of Contents 1 Preliminaries Scientometrics SCIgen a Probabilistic Context Free Grammar 2 Ike Antkare, one of the great starts in the scientific firmament 3 Detection of SCIgen papers : June 2012 Google Search Automatic classification C.Labbé (UJF-LIG) Ike Antkare & Co Octobre 2014 16 / 24 Detection of SCIgen papers : June 2012 Google Search Phrase search and More Like This IEEE http://www.computer.org Many SCI_PEOPLE would agree that, had it not been for SCI_GENERIC_NOUN, ... In recent years, much research has been devoted to the SCI_ACT; ... SCI_THING_MOD and SCI_THING_MOD, while SCI_ADJ in theory, have not until ... The SCI_ACT has SCI_VERBEDSCI_THING_MOD, and current trends ... The implications of SCI_BUZZWORD_ADJ SCI_BUZZWORD_NOUN have ... C.Labbé (UJF-LIG) Ike Antkare & Co Octobre 2014 17 / 24 Detection of SCIgen papers : June 2012 Google Search Phrase search and More Like This IEEE http://www.computer.org Many SCI_PEOPLE would agree that, had it not been for SCI_GENERIC_NOUN, ... In recent years, much research has been devoted to the SCI_ACT; ... SCI_THING_MOD and SCI_THING_MOD, while SCI_ADJ in theory, have not until ... The SCI_ACT has SCI_VERBEDSCI_THING_MOD, and current trends ... The implications of SCI_BUZZWORD_ADJ SCI_BUZZWORD_NOUN have ... Corpus name Downloaded from Years Type of papers Number of papers Acceptance rate Corpus size MLT IEEE ieee.org 2008 2010 various 122 NA 122 C.Labbé (UJF-LIG) Ike Antkare & Co Octobre 2014 17 / 24 Detection of SCIgen papers : June 2012 Intertextual Distance: Automatic classification [Labbé and Labbé, 2006] A: {le le chat} ( 13 , 23 , 03 ) B: {un chat chat } ( 23 , 03 , 13 ) un un un un chat chat 1/3 un chat chat un chat chat 1/3 1/3 chat chat 2/3 2/3 2/3 2/3 2/3 2/3 le le chat le chat le le chat le Intertextual Distance: D(A,B) = le le chat le 1 2 P i2(A[B) |fi,A fi,B | = 2 3 Interpretation: D(A,B) = the proportion of word tokens that are different in the two texts. C.Labbé (UJF-LIG) Ike Antkare & Co Octobre 2014 18 / 24 Detection of SCIgen papers : June 2012 Automatic classification SCIgen Detection: proposed method http://scigendetection.imag.fr Corpus Downloaded Years Field Corpus size arXiv1 arxiv.org 08–10 Computer Science 15338 MLT ieee.org 08–10 Computer Science 122 SCIgen-Origin Original SCIgen – Computer Science 236 SCIgen-Physics Modified SCIgen – Physics 414 Let t be a text under test. Fake t If Fake t be the distance between t and the nearest fake < 0.55 Then SCIgen origin must be seriously considered (misclass. risk < 10 Else ( Fake t 5 ). > 0.55) non-SCIgen origin must be seriously considered. 1 open repository for scholarly papers C.Labbé (UJF-LIG) Ike Antkare & Co Octobre 2014 19 / 24 (Z, MTL, SCIgen) Dendrogram Distance 0.7 l 0.6 l l l l l l I l II I I I 0.3 I l I l l ll l l ll l l l ll l ll l ll ! ! ! !! ! l !! ! !! !!! ! !! !!! ! !! ! !! !! ! ! !! !! !!! ! ! l ll l! !l l l l l!l l !! ! l !l! !! !!! !! l l l ll ll lll ! !ll l l !l l l! !l l! ! ! l ! I I I l l l l l I I II I II II I I II III I I I I I II I I I I I II I II I I III I I II I I I I I I I II I I I III I I I l 0.4 I II I I I I II I l II I I I I I I I I II I I I I l l I I l lI I l l l 0.5 l l l 0.2 l l l 0.1 l l l 0.0 20 / 24 Octobre 2014 Ike Antkare & Co C.Labbé (UJF-LIG) MLT Corpus Z SCIGen Automatic classification Detection of SCIgen papers : June 2012 Detection of SCIgen papers : June 2012 Automatic classification Scopus, Wok,... 2015 Web of Science(Thomson Reuter) Nature Scopus (Elsevier) Google Scholar 2004 2006 h-index PoP V1.0 2008 2010 Ike Antkare SCIgen C.Labbé (UJF-LIG) Ike Antkare & Co 2012 2014 Abiteboul par l’administrateur du Collège de France Octobre 2014 21 / 24 Detection of SCIgen papers : June 2012 Automatic classification Related/Ongoing Work Spoofing [Beel and Gipp, 2010, Lopez-Cozar et al., 2012] , Academic optim. ; [Beel et al., 2010] Detecting methods: Bib. based [Xiong and Huang, 2009], Compression ad-hoc dist. [Lavoie and Krishnamoorthy, 2010], Phrase search [Springer, 2014]. , [Dalkilic et al., 2006] No SCIgen paper in arXiv (Computer Science) Image borrowed from [Ginsparg, 2014]; PCA, only stop-words. Supposed non Zipfian. C.Labbé (UJF-LIG) Ike Antkare & Co Octobre 2014 22 / 24 Detection of SCIgen papers : June 2012 Automatic classification Conclusion and Future/Ongoing works Publication procedures, models and habits Why fake papers were accepted, published and ... sold. Traditional publisher vs open access. Knowledge diffusion: better and less... or as much as possible. Automatic knowledge extraction/detection/generation. Blind management rules... ... are an incitation to malpractices: slicing, plagiarism, faked data, ... Automatic detection of new generators Hand written PCFG : find dense cluster inside a population. Study other kind of generator (language model). In the web today How to separate the wheat from the chaff... and scale up ! C.Labbé (UJF-LIG) Ike Antkare & Co Octobre 2014 23 / 24 Detection of SCIgen papers : June 2012 Automatic classification Thanks Beel, J. and Gipp, B. (2010). Academic search engine spam and google scholar’s resilience against it. Journal of Electronic Publishing, 13(3). Beel, J., Gipp, B., and Wilde, E. (2010). Academic search engine optimization (aseo). Journal of scholarly publishing, 41(2):176–190. Labbé, C. and Labbé, D. (2006). A tool for literary studies. intertextual distance and tree classification. Literary and Linguistic Computing, 21(3):311–326. Labbé, C. and Labbé, D. (2013). Dalkilic, M. M., Clark, W. T., Costello, J. C., and Radivojac, P. (2006). Using compression to identify classes of inauthentic texts. In Proceedings of the 2006 SIAM Conference on Data Mining. Ginsparg, P. (2014). Automated screening: Arxiv screens spot fake papers. Nature, 508(7494):44–44. Hirsch, J. E. (2005). An index to quantify an individual’s scientific research output. Proceedings of the National Academy of Science, 102:16569–16572. Labbé, C. (2010). Ike antkare, one of the great stars in the scientific firmament. C.Labbé (UJF-LIG) International Society for Scientometrics and Informetrics Newsletter, 6(2):48–52. Ike Antkare & Co Duplicate and fake publications in the scientific literature: how many scigen papers in computer science? Scientometrics, 94(1):379–396. Lavoie, A. and Krishnamoorthy, M. (2010). Algorithmic Detection of Computer Generated Text. ArXiv e-prints. Lopez-Cozar, E. D., Robinson-García, N., and Torres-Salinas, D. (2012). Manipulating google scholar citations and google scholar metrics: Simple, easy and tempting. arXiv preprint arXiv:1212.0638. Xiong, J. and Huang, T. (2009). An effective method to identify machine automatically generated paper. In Knowledge Engineering and Software Engineering, 2009. KESE ’09. Pacific-Asia Conference on, pages 101–102. Octobre 2014 24 / 24
Documents pareils
Publication List
Gravity, Gauge Theory and Strings. Springer-Verlag, Berlin; EDP Sciences, Les Ulis,
2003, (2003). Les Houches - Ecole d’Eté de Physique Théorique, Session LXXVI, Les
Houches, France, 2001-07-30/2...