postanowiono zbadać co wpływa na przyznanie wyróżnienia. Hipotezy testujemy na poziomie istotności 0,05. Zadanie 1 Na podstawie danych

Zadanie 1 Na podstawie danych Contains data obs: 200 vars: 7 10 Feb 2001 16:27 size: 6,400 (99.8% of memory free) - 1. id float %9.0g 2. female float %9.0g fl sex: 0 male; 1 - female 3. ses float %9.0g sl socio-economic status SES 4. lang float %9.0g language test score 5. math float %9.0g math score 6. science float %9.0g science score 7. honors float %9.0g - socio-economic status SES - status ekonomiczno-społeczny (dalej SES) mierzony wykształceniem rodziców wyznacza karierę edukacyjną dziecka jest powszechnie znany i dowiedziony przez liczne badania. summarize Variable Obs Mean Std. Dev. Min Max ---------+----------------------------------------------------- id 200 100.5 57.87918 1 200 female 200.545.4992205 0 1 ses 200 2.055.7242914 1 3 lang 200 52.23 10.25294 28 76 math 200 52.645 9.368448 33 75 science 200 51.85 9.900891 26 74 honors 200.265.4424407 0 1 -> tabulation of honors honors Freq. Percent Cum. 0 147 73.50 73.50 1 53 26.50 100.00 Total 200 100.00 tabulation of female female Freq. Percent Cum. ---- 0 - male 91 45.50 45.50 1 - female 109 54.50 100.00 ---- Total 200 100.00 tabulate ses ses Freq. Percent Cum. low 47 23.50 23.50 middle 95 47.50 71.00 high 58 29.00 100.00 Total 200 100.00 postanowiono zbadać co wpływa na przyznanie wyróżnienia. Hipotezy testujemy na poziomie istotności 0,05.

1. Oszacowano liniowy model prawdopodobienstwa: regress honors lang female Source SS df MS Number of obs = 200 -------------+------------------------------ F( 2, 197) = 35.85 Model 10.3957196 2 5.19785982 Prob > F = 0.0000 Residual 28.5592804 197.144970966 R-squared = 0.2669 -------------+------------------------------ Adj R-squared = 0.2594 Total 38.955 199.195753769 Root MSE =.38075 honors Coef. Std. Err. t P> t [95% Conf. Interval] ------------------------------ lang.0214989.0026362 8.16 0.000.0163001.0266977 female.1467375.054142 2.71 0.007.0399652.2535098 _cons -.9378584.1448623-6.47 0.000-1.223538 -.6521786 a) Jakie problemy wystąpią w powyższym modelu? b) Które założenie KMRL nie będzie spełnione i jak ten problem mona rozwiązać? Czy mona na podstawie powyższych wyników testować, które zmienne są istotne? c) Dokonaj interpretacji oszacowań parametrów. 2. Oszacowano model logitowy: xi: logit honors lang math female i.ses, nolog i.ses _Ises_1-3 (naturally coded; _Ises_1 omitted) Logit estimates Number of obs = 200 LR chi2(5) = Prob > chi2 = 0.0000 Log likelihood = -71.994756 Pseudo R2 = 0.3774 honors Coef. Std. Err. z P> z [95% Conf. Interval] ------------------------------ lang.0687277.0287044 2.39 0.017.0124681.1249873 math.1358904.0336874 4.03 0.000.0698642.2019166 female 1.145726.4513589 2.54 0.011.2610792 2.030374 _Ises_2-1.040402.5791511-1.80 0.072-2.175517.094713 _Ises_3.0541296.5945439 0.09 0.927-1.111155 1.219414 _cons -12.55332 1.838493-6.83 0.000-16.1567-8.949939 a) Wypisać założenia modelu logitowego. b) Przetestować hipotezę o łącznej istotności zmiennych. c) Które ze zmiennych są istotne na poziomie istotności 0,05? d) Przeprowadzić test na łączną istotność zmiennych objaśniających. Prószę przeprowadzić ten test i zinterpretować jego wynik mając podaną tabele częstości dla zmiennej zależnej: tabulate honors -> tabulation of honors honors Freq. Percent Cum. 0 147 73.50 73.50 1 53 26.50 100.00 Total 200 100.00 oraz kwantyle rozkładu chi-kwadrat:

Wskazówka: wyznaczyć estymator największej wiarogodności dla modelu logitowego, w którym występuje sama stała, a następnie obliczyć wartość logarytmu funkcji wiarogodności dla znalezionego estymatora. logit honors Logit estimates Number of obs = 200 LR chi2(0) = -0.00 Prob > chi2 =. Log likelihood = -115.64441 Pseudo R2 = -0.0000 honors Coef. Std. Err. z P> z [95% Conf. Interval] ------------------------------ _cons -1.020141.1602206-6.37 0.000-1.334167 -.706114 Note that the log likelihood for iteration 0 is LL 0, i.e. it is the log likelihood when there are no explanatory variables in the model - only the constant term is included. The last log likelihood reported is LL M. a 1 =ln(53/147)= -1.020141 f) Na podstawie oszacowań parametrów ocen jaki wpływ (ujemny, dodatni) maja zmienne objaśniające na prawdopodobieństwo przyznania wyróżnienia. 3.Wyznaczono tabele klasyfikacji dla prawdopodobieństwa ucięcia wynoszącego 0,5. Prószę wyznaczyć wrażliwość, specyficzność, R 2 liczebnościowe oraz skorygowane R 2 liczebnościowe. lstat Logistic model for honors -------- True -------- Classified D ~D Total -----------+--------------------------+----------- + 31 10 41-22 137 159 -----------+--------------------------+----------- Total 53 147 200 Odp. Sensativity = 31/53 =.58490566. Specificity= 135/147 =.93197279. Count R 2 =(31+137)/200 =.84, adjusted count R 2 = ((31+137) - 147)/(200-147). 4. Zinterpretuj wyniki testu typu związku (linktest). Co za jego pomocą testujemy? linktest, nolog /* general test of model specification */ Logit estimates Number of obs = 200 LR chi2(2) = 87.39 Prob > chi2 = 0.0000 Log likelihood = -71.950384 Pseudo R2 = 0.3778 honors Coef. Std. Err. z P> z [95% Conf. Interval] ------------------------------ _hat.9703938.172421 5.63 0.000.6324549 1.308333 _hatsq -.0236258.0801491-0.29 0.768 -.1807151.1334635 _cons.041176.2706831 0.15 0.879 -.4893531.5717051

5. Postanowiono przeprowadzić test jakości dopasowania. Jaka jest hipoteza zerowa w tym teście? Użyto dwóch wersji testu: - podróbki zdefiniowane są jako wszystkie możliwe kombinacje zmiennych niezależnych: lfit Logistic model for honors, goodness-of-fit test number of covariate patterns = 199 Pearson chi2(194) = 164.86 Prob > chi2 = 0.9365 Hosmer and Lemeshow suggest that when the number of covariate patterns is large relative to the number of observations that their index of fit is more appropriate. - wersja zaproponowana przez Hosmera Lemeshow a: lfit, group(10) Logistic model for honors, goodness-of-fit test (Table collapsed on quantiles of estimated probabilities) number of observations = 200 number of groups = 10 Hosmer-Lemeshow chi2(8) = 8.25 Prob > chi2 = 0.4095 Z której wersji testu powinniśmy skorzystać i dlaczego? Jaki jest wynik testu? 6. Dokonaj interpretacji ilorazów szans. logit, or Logit estimates Number of obs = 200 LR chi2(5) = 87.30 Prob > chi2 = 0.0000 Log likelihood = -71.994756 Pseudo R2 = 0.3774 honors Odds Ratio Std. Err. z P> z [95% Conf. Interval] ---------+-------------------------------------------------------------------- lang 1.071145.0307466 2.394 0.017 1.012546 1.133134 math 1.145556.0385909 4.034 0.000 1.072363 1.223746 female 3.144725 1.4194 2.538 0.011 1.29833 7.616932 ses1.9473093.563217-0.091 0.927.2954031 3.037865 ses2.3346963.1617908-2.264 0.024.1297728.8632135 7. Podać interpretację R2. fitstat, saving(mod1) Measures of Fit for logit of honors Log-Lik Intercept Only: -115.644 Log-Lik Full Model: -73.643 D(195): 147.286 LR(4): 84.003 Prob > LR: 0.000 McFadden's R2: 0.363 McFadden's Adj R2: 0.320 Maximum Likelihood R2: 0.343 Cragg & Uhler's R2: 0.500 McKelvey and Zavoina's R2: 0.560 Efron's R2: 0.388 Variance of y*: 7.485 Variance of error: 3.290 Count R2: 0.840 Adj Count R2: 0.396 AIC: 0.786 AIC*n: 157.286 BIC: -885.886 BIC': -62.810 (Indices saved in matrix fs_mod1)

Deviance Deviance compares a given model to a fully saturated one. Deviance reflects error associated with the model even after the predictors are included in the model. It thus has to do with the significance of the unexplained variance in the response variable. One wants deviance to be not significant. That is, the significance should be worse than (greater than).05. In many respects deviance in categorical models functions the way SSresid functions in OLS regression, that is, the smaller the deviance the better the model fits the data. Pseudo R 2 As discussed in an earlier unit the R 2 in OLS regression can take on several different meanings, proportion of variance accounted for, squared correlation between fitted and predicted, and a transformation of the F-statistic. In categorical models there is no single index that fills all of these roles, instead there are a number of pseudo-r 2 that have been developed to help in assessing fit. McFadden's R 2 This is also known as the likelihood-ratio index. It compares the likelihood for the intercept only model to the likelihood for the model with the predictors. McFadden's R 2 can be as low as zero but can never equal one. Adjusted McFadden's R 2 The adjusted version of McFadden's R 2 subtracts K, the number of parameters in the model. Thus, the Adjusted McFadden's R 2 is to McFadden's R 2 as the adjusted R 2 is to R 2 in OLS regression. Maximum Likelihood R 2 The maximum likelihood R 2 expresses the model fit as a transformation of likelihood ratio chi-square in an analgous way to that of R 2 in OLS regression which can be though of as a transformation of the F-statistic. The maximum likelihood R 2 can reach a maximum of 1 - L(M int ) 2/N.

Craig & Uhler's R 2 Because of the limitation on the maximum value for the maximum likelihood R 2 Craig and Uhler proposed a relative index that can reach one. McKelvey and Zavoina's R 2 The McKelvey and Zavoina R 2 is an attempt to measure model fit as the proportion of variance accounted for. In this case, we are attempting to explain the variance of the latent variable. The variance of the latent variable can be computed by y* = β'var(x)β. Efron's R 2 Efron's R 2 is another model fit index based on proportion of variance accountef for. Count R 2 The count R 2, as discussed above, is the proportion of correctly classified observations. Adjusted Count R 2 The count R 2 can be misleading values under certain circumstances. In a binary model it is possible to correctly categorize at least 50% of the cases, without using information from the predictors, by choosing the outcome with the largest percentage. The count R 2 needs to be adjusted by the largest row marginal total. In our example, the adjusted count R 2 = ((31+137) - 147)/(200-147). Thus, the adjusted count R 2 is the proportion of correct guesses beyond that by guessing the largest marginal.

Information Indices The pseudo-r 2 s are limited in that they can only be used to compare nested models. Model fit can also be based on measures of information. Akaike's information criterion (AIC) and the Bayesian information criterion (BIC) are two commonly used measures. One advantage to using information criterion measures is that they can be used to compare non-nested models. For these information measures smaller is better. AIC & AIC*n Where L(M k ) is the likelihood of the model and P is the number of parameters (K+1). Some researchers use AIC multiplied by N which fitstat calls AIC*n. Regardless, smaller is better. BIC & BIC' The BIC is based upon the deviance while the BIC' uses the likelihood ratio chi-square. For BIC the term df k is the degrees of freedom for the deviance and in the BIC' equation df' k is the number of predictors in the model. In comparing two models the difference in the BICs is the same as the difference in the BIC's. The table below can assist in interpreting the difference in two models. As above the smaller BIC or BIC' is better. Interpreting BIC and BIC' Absolute Difference Evidence 0-2 Weak 2-6 Positive 7-10 Strong >10 Very Strong

Zadanie 2 Następnie oszacowano model probitowy. probit honors lang math female ses1 ses2 Probit estimates Number of obs = 200 LR chi2(5) = 88.28 Prob > chi2 = 0.0000 Log likelihood = -71.503442 Pseudo R2 = 0.3817 honors Coef. Std. Err. z P> z [95% Conf. Interval] ---------+-------------------------------------------------------------------- lang.0439894.0162434 2.708 0.007.0121528.0758259 math.0760789.018053 4.214 0.000.0406958.111462 female.6752606.2523046 2.676 0.007.1807526 1.169769 ses1 -.0275906.3397904-0.081 0.935 -.6935676.6383864 ses2 -.6179796.2723557-2.269 0.023-1.151787 -.0841724 _cons -7.334563 1.056422-6.943 0.000-9.405111-5.264015 a) Wypisać założenia modelu logitowego. b) Przetestować hipotezę o łącznej istotności zmiennych. c) Które ze zmiennych są istotne na poziomie istotności 0,05? d) Na podstawie oszacowań parametrów ocen jaki wpływ (ujemny, dodatni) maja zmienne objaśniające na prawdopodobieństwo przyznania wyróżnienia. e) Wyjaśnić dlaczego w modelu została umieszczona interakcja miedzy zmiennymi lang i female oraz math i female. (for var lang math: generate fxx = female*x) probit honors lang math female ses1 ses2 fxlang fxmath Probit estimates Number of obs = 200 LR chi2(7) = 89.08 Prob > chi2 = 0.0000 Log likelihood = -71.104283 Pseudo R2 = 0.3851 honors Coef. Std. Err. z P> z [95% Conf. Interval] ---------+-------------------------------------------------------------------- lang.0325027.0233381 1.393 0.164 -.0132391.0782445 math.0717692.0254528 2.820 0.005.0218825.1216559 female -.9346668 1.92794-0.485 0.628-4.713361 2.844027 ses1 -.003803.3424154-0.011 0.991 -.6749249.667319 ses2 -.5965207.2774592-2.150 0.032-1.140331 -.0527107 fxlang.0203053.0323945 0.627 0.531 -.0431868.0837974 fxmath.0081221.0363954 0.223 0.823 -.0632115.0794558 _cons -6.427969 1.443015-4.455 0.000-9.256227-3.599711 f) Na podstawie oszacowań parametrów ocen jaki wpływ (ujemny, dodatni) mają zmienne objaśniające na prawdopodobieństwo przyznania wyróżnienia. Należy pamiętać, że w modelu została umieszczona interakcja. e) Dokonaj interpretacji efektów cząstkowych zamieszczonych w poniższej tabeli.

Zadanie 3 describe Contains data from bbdm13.dta obs: 200 vars: 15 5 Nov 2001 19:37 size: 10,400 (96.1% of memory free) - storage display value variable name type format label variable label - str float %9.0g stratum obs float %9.0g agmt float %9.0g age at interview fndx float %9.0g final diagnosis chk float %9.0g regular check-ups agp1 float %9.0g age at 1st preg agmn float %9.0g age at menarche nlv float %9.0g number stillbirths liv float %9.0g number live births wt float %9.0g wt at interview mst float %9.0g marital status mar byte %8.0g married mod byte %8.0g div or sep wid byte %8.0g widowed nvmr byte %8.0g never married - summarize Variable Obs Mean Std. Dev. Min Max ------------------- str 200 25.5 14.46708 1 50 obs 200 2.5 1.12084 1 4 agmt 200 46.185 10.29323 27 68 fndx 200.25.4340993 0 1 chk 200 1.405.4921239 1 2 agp1 178 23.57865 4.05847 14 40 agmn 200 12.95 1.744338 8 17 nlv 178.5168539.9638946 0 7 liv 178 2.853933 1.544449 0 11 wt 200 143.715 31.92994 80 280 mst 200 1.655 1.234339 1 5 mar 200.725.4476348 0 1 mod 200.13.3371474 0 1 wid 200.085.2795815 0 1 nvmr 200.06.2380828 0 1 clogit fndx chk agmn wt mod wid nvmr, group(str) nolog Conditional (fixed-effects) logistic regression Number of obs = 200 LR chi2(6) = 48.20 Prob > chi2 = 0.0000 Log likelihood = -45.214824 Pseudo R2 = 0.3477 fndx Coef. Std. Err. z P> z [95% Conf. Interval] ------------------------------ chk -1.121849.4474471-2.51 0.012-1.998829 -.2448688 agmn.3561333.1291722 2.76 0.006.1029605.6093061 wt -.0283565.0099776-2.84 0.004 -.0479122 -.0088009 mod -.2030472.6472909-0.31 0.754-1.471714 1.06562 wid -.4915826.8173094-0.60 0.548-2.09348 1.110314 nvmr 1.472195.7582064 1.94 0.052 -.0138621 2.958252

Zadanie 4 We will illustrate poisson regression using the lahigh data set. In particular, we would like to know whether there is a gender difference in days absent and the relation between language NCE test scores and days absent. Note that for gender, 0 is female and 1 is male. Here is a histogram of days absent. summarize gender langnce daysabs Variable Obs Mean Std. Dev. Min Max ------------------- gender 316.4873418.5006325 0 1 langnce 316 50.06379 17.93921 1.007114 98.99289 daysabs 316 5.810127 7.449003 0 45 summarize daysabs, detail days absent ------------------------------------------------------------- Percentiles Smallest 1% 0 0 5% 0 0 10% 0 0 Obs 316 25% 1 0 Sum of Wgt. 316 50% 3 75% 8 90% 14 95% 23 99% 35 Mean 5.810127 Largest Std. Dev. 7.449003 35 35 Variance 55.48764 41 Skewness 2.250587 45 Kurtosis 8.949302 poisson daysabs gender langnce Poisson regression Number of obs = 316 LR chi2(2) = 171.50 Prob > chi2 = 0..0000 Log likelihood = -1549.8567 Pseudo R2 = 0..0524 -------------------------------------------------------------------------- ----- daysabs Coef. Std. Err. z P> z [95% Conf. Interval] -------------------------- ----- gender -.4093528.0482192-8.49 0.000 -.5038606 -.3148449 langnce -.01467.0012934-11.34 0.000 -.0172051 -.0121349 _cons 2.646977.0697764 37.94 0.000 2.510217 2.783736 -------------------------------------------------------------------------- -----

1. Wypisać założenie modelu Poissona. 2. Wyjaśnić, dlaczego w tym przypadku nie powinno używać się modelu dla zmiennych uporządkowanych. 3. Przetestuj hipotezę o łącznej istotności zmiennych objaśniających. 4. Które ze zmiennych są istotne na poziomie istotności 0,05? 5. Dokonaj interpretacji parametrów.

Zadanie 5 Consider the situation in which we have a measure of academic aptitude (scaled 200-800) which we want to model using reading and math test scores and whether the student is enrolled in a public or private school. The problem here is that students who answer all questions on the academic aptitude test correctly receive a score of 800, even though it is likely that these students are not "truly" equal in aptitude. Description of the Data The academic aptitude variable is apt, the reading and writing test scores are read and write respectively. The variable public is a zero-one variable with the ones indicating a public school student. summarize Variable Obs Mean Std. Dev. Min Max ---------------------- id 200 100.5 57.87918 1 200 apt 200 651.06 101.4404 420 800 read 200 52.23 10.25294 28 76 math 200 52.645 9.368448 33 75 public 200.545.4992205 0 1 histogram apt, normal bin(10) xline(800) tabulate public public Freq. Percent Cum. 0 91 45.50 45.50 1 109 54.50 100.00 Total 200 100.00 correlate read math public apt (obs=200) read math public apt -- read 1.0000

math 0.6623 1.0000 public -0.0531-0.0293 1.0000 apt 0.5971 0.6171 0.2567 1.0000 graph matrix read math apt, half jitter(2) tobit apt read math public, ul(800) Tobit regression Number of obs = 200 LR chi2(3) = 149.03 Prob > chi2 = 0.0000 Log likelihood = -1072.2469 Pseudo R2 = 0.0650 apt Coef. Std. Err. t P> t [95% Conf. Interval] ------------------------------ read 3.681712.6873387 5.36 0.000 2.326225 5.037198 math 4.557839.7538493 6.05 0.000 3.071189 6.044489 public 62.1633 10.57346 5.88 0.000 41.31159 83.01501 _cons 188.3943 32.74961 5.75 0.000 123.8095 252.9791 ------------------------------ /sigma 73.63244 3.873908 65.99279 81.27209 Obs. summary: 0 left-censored observations 185 uncensored observations 15 right-censored observations at apt>=800 1.Wypisać założenia modelu tobitowego. 2. Jaką nietypową cechą będzie się najprawdopodobniej charakteryzować rozkład zmiennej zależnej? 3. Dlaczego zarówno policzenie regresji liniowej dla tej całej obserwowanej próby da najprawdopodobniej wartości dopasowane, które dla części obserwacji będą nieinterpretowalne?

4.Podać interpretację miar dopasowania fitstat Measures of Fit for tobit of apt Log-Lik Intercept Only: -1146.761 Log-Lik Full Model: -1072.247 D(195): 2144.494 LR(3): 149.028 Prob > LR: 0.000 McFadden's R2: 0.065 McFadden's Adj R2: 0.061 ML (Cox-Snell) R2: 0.525 Cragg-Uhler(Nagelkerke) R2: 0.525 McKelvey & Zavoina's R2: 0.531 Variance of y*: 11565.884 Variance of error: 5421.737 AIC: 10.772 AIC*n: 2154.494 BIC: 1111.322 BIC': -133.133 BIC used by Stata: 2170.985 AIC used by Stata: 2154.494