screen - Christophe Lalanne
Transcription
screen - Christophe Lalanne
Découverte du logiciel Stata Mesures et tests d’association Christophe Lalanne www.aliquote.org Synopsis Tests de comparaison de deux moyennes Tests de comparaison de k moyennes Tests de comparaison de deux proportions Analyse d’un tableau de contingence Mesures d’association en épidémiologie d2e5ca9 2 / 47 Données d’illustration Enquête socio-économique allemande réalisée en 2009 : « GSOEP » (3) . ybirth hhnr2009 sex mar edu yedu voc emp egp income hhinc size hhsize d2e5ca9 Données socio-démographiques année de naissance foyer résidentiel sexe statut marital niveau d’éducation nombre d’années de formation niveau secondaire ou université Emploi et revenu type d’emploi catégorie socio professionnelle revenus (€) revenus du foyer (€) Logement taille du logement nombre de personnes dans habitation 3 / 47 Fichier de données : gsoep09.dta . use data/gsoep09 (SOEP 2009 (Kohler/Kreuter)) Pré-traitements : . gen age = 2009 - ybirth . mvdecode income, mv(0=.c) income: 1369 missing values generated . gen lincome = log(income) (2001 missing values generated) d2e5ca9 4 / 47 Tests de comparaison de deux moyennes d2e5ca9 5 / 47 Comparaison de deux moyennes Le test de Student, via la commande ttest, s’utilise dans le cas des comparaisons de moyennes pour un échantillon (H0 : µ = 0) ou deux échantillons (indépendants ou non). Illustration : le revenu moyen diffère-t-il selon le sexe ? . bysort sex: summarize lincome -------------------------------------------------------------------------------> sex = Male Variable | Obs Mean Std. Dev. Min Max -------------+-------------------------------------------------------lincome | 1746 10.08129 1.083648 3.828641 13.70765 -------------------------------------------------------------------------------> sex = Female Variable | Obs Mean Std. Dev. Min Max -------------+-------------------------------------------------------lincome | 1664 9.443893 1.073004 5.09375 13.32572 d2e5ca9 6 / 47 . graph box lincome, over(sex) ytitle("Income (log(2)") 14 Income (log(2) 12 10 8 6 4 Male d2e5ca9 Female 7 / 47 Test de Student Statistics . Summaries, tables, and tests . Classical tests of hypotheses . t test . ttest lincome, by(sex) Two-sample t test with equal variances -----------------------------------------------------------------------------Group | Obs Mean Std. Err. Std. Dev. [95% Conf. Interval] ---------+-------------------------------------------------------------------Male | 1746 10.08129 .0259338 1.083648 10.03043 10.13216 Female | 1664 9.443893 .0263042 1.073004 9.3923 9.495486 ---------+-------------------------------------------------------------------combined | 3410 9.770257 .0192551 1.124407 9.732504 9.808009 ---------+-------------------------------------------------------------------diff | .6374003 .0369475 .5649587 .7098419 -----------------------------------------------------------------------------diff = mean(Male) - mean(Female) t = 17.2515 Ho: diff = 0 degrees of freedom = 3408 Ha: diff < 0 Pr(T < t) = 1.0000 d2e5ca9 Ha: diff != 0 Pr(|T| > |t|) = 0.0000 Ha: diff > 0 Pr(T > t) = 0.0000 8 / 47 Test de Student (bis) Sans supposer l’égalité des variances parentes (correction de Satterthwaite, option unequal) (5) : . ttest lincome, by(sex) welch Two-sample t test with unequal variances -----------------------------------------------------------------------------Group | Obs Mean Std. Err. Std. Dev. [95% Conf. Interval] ---------+-------------------------------------------------------------------Male | 1746 10.08129 .0259338 1.083648 10.03043 10.13216 Female | 1664 9.443893 .0263042 1.073004 9.3923 9.495486 ---------+-------------------------------------------------------------------combined | 3410 9.770257 .0192551 1.124407 9.732504 9.808009 ---------+-------------------------------------------------------------------diff | .6374003 .0369388 .5649759 .7098247 -----------------------------------------------------------------------------diff = mean(Male) - mean(Female) t = 17.2556 Ho: diff = 0 Welch's degrees of freedom = 3405.02 Ha: diff < 0 Pr(T < t) = 1.0000 Ha: diff != 0 Pr(|T| > |t|) = 0.0000 Ha: diff > 0 Pr(T > t) = 0.0000 Si l’on souhaite vraiment comparer deux variances, la commande sdtest offre la même syntaxe que ttest. d2e5ca9 9 / 47 Intervalles de confiance La commande ci permet de construire des intervalles de fluctuation pour un certain niveau de confiance (level()) : . bysort sex: ci lincome -------------------------------------------------------------------------------> sex = Male Variable | Obs Mean Std. Err. [95% Conf. Interval] -------------+--------------------------------------------------------------lincome | 1746 10.08129 .0259338 10.03043 10.13216 -------------------------------------------------------------------------------> sex = Female Variable | Obs Mean Std. Err. [95% Conf. Interval] -------------+--------------------------------------------------------------lincome | 1664 9.443893 .0263042 9.3923 9.495486 Commande additionnelle : mean (idem, utilisation de la loi normale pour les IC à 95 %). d2e5ca9 10 / 47 . mean lincome if sex == 1 Mean estimation Number of obs = 1746 -------------------------------------------------------------| Mean Std. Err. [95% Conf. Interval] -------------+-----------------------------------------------lincome | 10.08129 .0259338 10.03043 10.13216 -------------------------------------------------------------- Manuellement : . local zc = 1-invnormal(0.95) . display 10.08129 - `zc'/2 * .0259338 10.089652 Si l’on souhaite construire des intervalles de confiance basés sur une distribution de Student, on utilisera plutôt invt (tprob fournit les valeurs de probabilités au lieu des fractiles) : . display 10.08129 - invt(1745, 0.975) * .0259338 10.030425 d2e5ca9 11 / 47 Alternative non-paramétrique Le test de Wilcoxon (différent de median) constitue une alternative non-paramétrique au test de Student. . ranksum lincome, by(sex) Two-sample Wilcoxon rank-sum (Mann-Whitney) test sex | obs rank sum expected -------------+--------------------------------Male | 1746 3551869.5 2977803 Female | 1664 2263885.5 2837952 -------------+--------------------------------combined | 3410 5815755 5815755 unadjusted variance adjustment for ties adjusted variance 8.258e+08 -16.745225 ---------8.258e+08 Ho: lincome(sex==Male) = lincome(sex==Female) z = 19.976 Prob > |z| = 0.0000 d2e5ca9 12 / 47 Tests de comparaison de k moyennes d2e5ca9 13 / 47 Analyse de variance à un facteur L’analyse de variance (ANOVA) est utilisée pour comparer plus de 2 moyennes (H0 : µ1 = µ2 = · · · = µk ). Stata offre deux commandes (sans passer par le modèle linéaire) : oneway et anova. Illustration : le revenu moyen diffère-t-il selon le type d’emploi ? . recode egp (1/2=1) (3/5=2) (8/9=3) (15/18=.) , /// gen ( egp4 ) . label define egp4 1 " Service class 1/2" /// 2 " Non - manuals & self - employed " 3 " Manuals " . label values egp4 egp4 (4435 differences between egp and egp4) d2e5ca9 14 / 47 Distributions par groupe . histogram lincome, by(egp4, col(3)) freq Service class 1/2 Non−manuals & self−employed Manuals 200 Frequency 150 100 50 0 5 10 15 5 10 15 5 10 15 lincome Graphs by RECODE of egp (Social Class (EGP)) d2e5ca9 15 / 47 . twoway (kdensity lincome), by(egp4) Service class 1/2 Non−manuals & self−employed .8 .6 kdensity lincome .4 .2 0 5 10 15 Manuals .8 .6 .4 .2 0 5 10 15 x Graphs by RECODE of egp (Social Class (EGP)) d2e5ca9 16 / 47 . graph box lincome, over(egp4) ytitle("Income (log(2)") 14 Income (log(2) 12 10 8 6 4 Service class 1/2 d2e5ca9 Non−manuals & self−employed Manuals 17 / 47 Moyennes conditionnelles . tabstat lincome, by(egp4) stats(mean sd count) Summary for variables: lincome by categories of: egp4 (RECODE of egp (Social Class (EGP))) egp4 | mean sd N -----------------+-----------------------------Service class 1/ | 10.29525 .9454878 1085 Non-manuals & se | 9.776857 .9735212 868 Manuals | 9.615197 1.002863 1102 -----------------+-----------------------------Total | 9.902652 1.018826 3055 ------------------------------------------------ d2e5ca9 18 / 47 Tableau d’ANOVA Statistics . Linear models and related . ANOVA/MANOVA . Oneway ANOVA . oneway lincome egp4 Analysis of Variance Source SS df MS F Prob > F -----------------------------------------------------------------------Between groups 272.026782 2 136.013391 143.24 0.0000 Within groups 2898.0461 3052 .949556388 -----------------------------------------------------------------------Total 3170.07288 3054 1.03800684 Bartlett's test for equal variances: chi2(2) = 3.7888 Prob>chi2 = 0.150 oneway [response_var] [factor_var] [if] [in] [ , options] • tabulate : affichage des moyennes, écarts-type et effectifs • bonferroni : comparaison des paires de moyennes avec correction de Bonferroni d2e5ca9 19 / 47 Vérification des conditions d’application • indépendance des observations • normalité des résidus • égalité des variances (parentes) d2e5ca9 20 / 47 Normalité des résidus La commande swilk fournit le test de Shapiro-Wilks. Mais en règle générale, les méthodes graphiques sont préférables : . quietly: anova lincome egp4 . predict r, resid . qnorm r (2356 missing values generated) 4 Residuals 2 0 −2 −4 −6 −4 −2 0 2 4 Inverse Normal d2e5ca9 21 / 47 Égalité des variances Stata fournit le résultat du test de Bartlett pour l’égalité des variances avec la commande oneway. Le test de Levenne s’obtient avec la commande robvar (W0) : . robvar lincome, by(egp4) RECODE of | egp (Social | Class | Summary of lincome (EGP)) | Mean Std. Dev. Freq. ------------+-----------------------------------Service c | 10.295247 .94548776 1085 Non-manua | 9.7768571 .97352115 868 Manuals | 9.6151967 1.0028632 1102 ------------+-----------------------------------Total | 9.9026521 1.0188262 3055 W0 = 12.5051486 df(2, 3052) Pr > F = 0.00000390 7.9388574 df(2, 3052) Pr > F = 0.00036403 W10 = 10.6968625 df(2, 3052) Pr > F = 0.00002348 W50 = d2e5ca9 22 / 47 Comparaison de paires de moyennes Option de correction pour les tests post-hoc : bonferroni, scheffe ou sidak. . oneway lincome egp4, bonferroni noanova Comparison of lincome by RECODE of egp (Social Class (EGP)) (Bonferroni) Row Mean-| Col Mean | Service Non-manu ---------+---------------------Non-manu | -.51839 | 0.000 | Manuals | -.680051 -.16166 | 0.000 0.001 On arrive à des conclusions similaires en appliquant la correction de Bonferroni sur les résultats de simples tests de Student. . quietly: ttest lincome if egp4 != 1, by(egp4) . display r(p)*3 .0009856 d2e5ca9 23 / 47 Alternative à oneway La commande oneway est limité au cas à un facteur explicatif. La commande anova est plus générale et couvre : les plans factoriels et emboîtés, les plans équilibrés ou non (cf. calcul des sommes de carrés), les mesures répétées, l’analyse de covariance. . anova lincome egp4 Number of obs = 3055 Root MSE = .974452 R-squared = Adj R-squared = 0.0858 0.0852 Source | Partial SS df MS F Prob > F -----------+---------------------------------------------------Model | 272.026782 2 136.013391 143.24 0.0000 | egp4 | 272.026782 2 136.013391 143.24 0.0000 | Residual | 2898.0461 3052 .949556388 -----------+---------------------------------------------------Total | 3170.07288 3054 1.03800684 d2e5ca9 24 / 47 Comparaisons multiples En utilisant anova, les comparaisons par paires de moyennes s’obtiennent à l’aide de pwcompare, commande plus générale que pwmean. Les options de correction (mcompare()) incluent en plus : tukey, snk, duncan et dunnett. . pwcompare egp4, cformat(%3.2f) Pairwise comparisons of marginal linear predictions Margins : asbalanced -----------------------------------------------------------------------------| Unadjusted | Contrast Std. Err. [95% Conf. Interval] -----------------------------+-----------------------------------------------egp4 | Non-manuals & self-employed | vs | Service class 1/2 | -0.52 0.04 -0.61 -0.43 Manuals | vs | Service class 1/2 | -0.68 0.04 -0.76 -0.60 Manuals | d2e5ca9 25 / 47 vs | Tests de comparaison de deux proportions d2e5ca9 26 / 47 Tests de proportion exact et approché Outre le test du χ2 de Pearson dans le cas du croisement de deux variables binaires, Stata dispose des commandes bitest (test binomial) et prtest (test reposant sur l’approximation normale). Dans le cas univarié, la variable binaire doit être codée en 0/1. Plusieurs types d’intervalles de confiance sont disponibles (4) . Illustration : distribution équilibrée des deux sexes dans l’échantillon. . generate sexb = sex - 1 . tabulate sexb sexb | Freq. Percent Cum. ------------+----------------------------------0 | 2,585 47.77 47.77 1 | 2,826 52.23 100.00 ------------+----------------------------------Total | 5,411 100.00 d2e5ca9 27 / 47 Test binomial Statistics . Summaries, tables, and tests . Classical tests of hypotheses . Proportion test . bitest sexb == 0.5 Variable | N Observed k Expected k Assumed p Observed p -------------+-----------------------------------------------------------sexb | 5411 2826 2705.5 0.50000 0.52227 Pr(k >= 2826) = 0.000551 Pr(k <= 2826) = 0.999500 Pr(k <= 2585 or k >= 2826) = 0.001102 (one-sided test) (one-sided test) (two-sided test) . ci sexb, binomial -- Binomial Exact -Variable | Obs Mean Std. Err. [95% Conf. Interval] -------------+--------------------------------------------------------------sexb | 5411 .5222695 .0067905 .508859 .5356559 d2e5ca9 28 / 47 Test de proportion pour un échantillon Statistics . Summaries, tables, and tests . Classical tests of hypotheses . Binomial probability test . prtest sexb == 0.5 One-sample test of proportion sexb: Number of obs = 5411 -----------------------------------------------------------------------------Variable | Mean Std. Err. [95% Conf. Interval] -------------+---------------------------------------------------------------sexb | .5222695 .0067905 .5089604 .5355785 -----------------------------------------------------------------------------p = proportion(sexb) z = 3.2763 Ho: p = 0.5 Ha: p < 0.5 Pr(Z < z) = 0.9995 d2e5ca9 Ha: p != 0.5 Pr(|Z| > |z|) = 0.0011 Ha: p > 0.5 Pr(Z > z) = 0.0005 29 / 47 Test de proportion pour deux échantillons . generate egpb = egp4 == 1 . prtest egpb, by(sexb) Two-sample test of proportions 0: Number of obs = 2585 1: Number of obs = 2826 -----------------------------------------------------------------------------Variable | Mean Std. Err. z P>|z| [95% Conf. Interval] -------------+---------------------------------------------------------------0 | .2201161 .0081491 .2041441 .236088 1 | .1854211 .0073107 .1710923 .1997498 -------------+---------------------------------------------------------------diff | .034695 .0109478 .0132376 .0561523 | under Ho: .0109269 3.18 0.001 -----------------------------------------------------------------------------diff = prop(0) - prop(1) z = 3.1752 Ho: diff = 0 Ha: diff < 0 Pr(Z < z) = 0.9993 d2e5ca9 Ha: diff != 0 Pr(|Z| > |z|) = 0.0015 Ha: diff > 0 Pr(Z > z) = 0.0007 30 / 47 Commandes immédiates Plusieurs commandes Stata acceptent des formes « immédiates ». prtesti # obs1 # p1 # obs2 # p2 [ , levels (#) count ] Statistics . Summaries, tables, and tests . Classical tests of hypotheses . Proportion test calculator . prtesti 2585 0.2201 2826 0.1854 L’option count permet de travailler avec les effectifs observés plutôt que des fréquences relatives. d2e5ca9 31 / 47 Analyse d’un tableau de contingence d2e5ca9 32 / 47 Construction d’un tableau 2x2 Statistics . Summaries, tables, and tests . Frequency tables . Two-way table with measures of association La commande tabulate (twoway) permet de construire un tableau d’effectifs ou de fréquences relatives et dispose d’options pour les statistiques de Pearson et de Fisher (1) . . tabulate sex egp4 | RECODE of egp (Social Class | (EGP)) Gender | Service c Non-manua Manuals | Total ---------------------+---------------------------------+---------Male | 569 290 717 | 1,576 Female | 524 592 396 | 1,512 ---------------------+---------------------------------+---------Total | 1,093 882 1,113 | 3,088 d2e5ca9 33 / 47 Profils ligne et colonne . tabulate sex egp4, row +----------------+ | Key | |----------------| | frequency | | row percentage | +----------------+ | RECODE of egp (Social Class | (EGP)) Gender | Service c Non-manua Manuals | Total ---------------------+---------------------------------+---------Male | 569 290 717 | 1,576 | 36.10 18.40 45.49 | 100.00 ---------------------+---------------------------------+---------Female | 524 592 396 | 1,512 | 34.66 39.15 26.19 | 100.00 ---------------------+---------------------------------+---------Total | 1,093 882 1,113 | 3,088 | 35.40 28.56 36.04 | 100.00 d2e5ca9 34 / 47 Test d’association du χ2 . tabulate sex egp4, chi | RECODE of egp (Social Class | (EGP)) Gender | Service c Non-manua Manuals | Total ---------------------+---------------------------------+---------Male | 569 290 717 | 1,576 Female | 524 592 396 | 1,512 ---------------------+---------------------------------+---------Total | 1,093 882 1,113 | 3,088 Pearson chi2(2) = 196.5961 d2e5ca9 Pr = 0.000 35 / 47 Effectifs théoriques L’option expected fournit les effectifs théoriques. . tabulate sex egp4, expected +--------------------+ | Key | |--------------------| | frequency | | expected frequency | +--------------------+ | RECODE of egp (Social Class | (EGP)) Gender | Service c Non-manua Manuals | Total ---------------------+---------------------------------+---------Male | 569 290 717 | 1,576 | 557.8 450.1 568.0 | 1,576.0 ---------------------+---------------------------------+---------Female | 524 592 396 | 1,512 | 535.2 431.9 545.0 | 1,512.0 ---------------------+---------------------------------+---------Total | 1,093 882 1,113 | 3,088 | 1,093.0 882.0 1,113.0 | 3,088.0 d2e5ca9 36 / 47 Test exact de Fisher . tabulate sex egp4, exact Enumerating sample-space stage 3: enumerations = stage 2: enumerations = stage 1: enumerations = combinations: 1 351 0 | RECODE of egp (Social Class | (EGP)) Gender | Service c Non-manua Manuals | Total ---------------------+---------------------------------+---------Male | 569 290 717 | 1,576 Female | 524 592 396 | 1,512 ---------------------+---------------------------------+---------Total | 1,093 882 1,113 | 3,088 Fisher's exact = d2e5ca9 0.000 37 / 47 Mesures d’association en épidémiologie d2e5ca9 38 / 47 Mesures de risque Statistics . Epidemiology and related . Tables for epidemiologists Stata offre une grande variété de tests d’association et de mesures de risque classiquement utilisées en épidémiologie. d2e5ca9 39 / 47 Odds-ratio La commande tabodds s’utilise dans le cas des études castémoins ou des études transversales. Elle permet de calculer l’odds-ratio et son intervalle de confiance asymptotique (autre option : cornfield ou woolf), ainsi que tester l’homogénéité des OR entre strates (test de Mantel-Haenszel). Autres commandes disponibles : cc et mcc (étude cas-témoins), ir (étude de cohorte). Toutes ces commandes disposent d’une forme « immédiate » alternative. Manuel : [ST] epitab d2e5ca9 40 / 47 Données d’illustration Étude sur les poids de naisssance (2) . low age lwt race smoke ht ui ftv ptl bwt d2e5ca9 poids de naissance < 2,5 kg âge de la mère poids de la mère (livres) aux dernières règles ethnicité de la mère (« w », « b », « o ») statut fumeur de la mère pendant la grossesse antécédent d’hypertension présence d’irritabilité utérine nb de visites chez le gynécologue 1er trimestre nb d’accouchements pré terme antérieurs poids du bébé (grammes) 41 / 47 . clear all . webuse lbw (Hosmer & Lemeshow data) . list in 1/5 1. 2. 3. 4. 5. d2e5ca9 +-----------------------------------------------------------------------+ | id low age lwt race smoke ptl ht ui ftv bwt | |-----------------------------------------------------------------------| | 85 0 19 182 black nonsmoker 0 0 1 0 2523 | | 86 0 33 155 other nonsmoker 0 0 0 3 2551 | | 87 0 20 105 white smoker 0 0 0 1 2557 | | 88 0 21 108 white smoker 0 0 1 2 2594 | | 89 0 18 107 white smoker 0 0 1 0 2600 | +-----------------------------------------------------------------------+ 42 / 47 Calcul de l’odds-ratio . tabodds low smoke, or --------------------------------------------------------------------------smoke | Odds Ratio chi2 P>chi2 [95% Conf. Interval] -------------+------------------------------------------------------------nonsmoker | 1.000000 . . . . smoker | 2.021944 4.90 0.0269 1.069897 3.821169 --------------------------------------------------------------------------Test of homogeneity (equal odds): chi2(1) = 4.90 Pr>chi2 = 0.0269 Score test for trend of odds: d2e5ca9 chi2(1) = Pr>chi2 = 4.90 0.0269 43 / 47 . cc low smoke, woolf | smoked during pregnancy| Proportion | Exposed Unexposed | Total Exposed -----------------+------------------------+-----------------------Cases | 30 29 | 59 0.5085 Controls | 44 86 | 130 0.3385 -----------------+------------------------+-----------------------Total | 74 115 | 189 0.3915 | | | Point estimate | [95% Conf. Interval] |------------------------+-----------------------Odds ratio | 2.021944 | 1.08066 3.783112 (Woolf) Attr. frac. ex. | .5054264 | .0746392 .7356673 (Woolf) Attr. frac. pop | .2569965 | +------------------------------------------------chi2(1) = 4.92 Pr>chi2 = 0.0265 d2e5ca9 44 / 47 Calcul du risque relatif . cs low smoke | smoked during pregnancy| | Exposed Unexposed | Total -----------------+------------------------+-----------Cases | 30 29 | 59 Noncases | 44 86 | 130 -----------------+------------------------+-----------Total | 74 115 | 189 | | Risk | .4054054 .2521739 | .3121693 | | | Point estimate | [95% Conf. Interval] |------------------------+-----------------------Risk difference | .1532315 | .0160718 .2903912 Risk ratio | 1.607642 | 1.057812 2.443262 Attr. frac. ex. | .377971 | .0546528 .5907112 Attr. frac. pop | .1921887 | +------------------------------------------------chi2(1) = 4.92 Pr>chi2 = 0.0265 d2e5ca9 45 / 47 Références I 1. I Campbell. Chi-squared and Fisher-Irwin tests of two-by-two tables with small sample recommendations. Statistics in Medicine, 26(19) :3661–3675, 2007. 2. D Hosmer and S Lemeshow. Applied Logistic Regression. New York : Wiley, 1989. 3. U Kohler and F Kreuter. Data Analysis Using Stata. College Station : Stata Press, 2012. 4. RG Newcombe. Two-sided confidence intervals for the single proportion : comparison of seven methods. Statistics in Medicine, 17(8) :857–872, 1998. 5. BL Welch. On the comparison of several mean values : An alternative approach. Biometrika, 38 :330–336, 1951. d2e5ca9 46 / 47 Index des commandes anova, 24 bitest, 28 bysort, 6, 10 cc, 44 ci, 10, 28 clear, 42 cs, 45 display, 11, 23 generate, 4, 27, 30 graph box, 7, 17 histogram, 15 invt, 11 kdensity, 16 label define, 14 d2e5ca9 label values, 14 list, 42 local, 11 log, 4 mean, 11 mvdecode, 4 normal, 11 oneway, 19, 23 predict, 21 prtest, 29, 30 prtesti, 31 pwcompare, 25 pwmean, 25 qnorm, 21 quietly, 23 ranksum, 12 recode, 14 robvar, 22 sqrt, 11 summarize, 6 tabodds, 43 tabstat, 18 tabulate, 27, 33–37 ttest, 8, 9, 23 twoway, 16 use, 4 webuse, 42 47 / 47