OpenPsych forums

Full Version: [ODP] An update on the narrowing of the black-white gap in the Wordsum
You're currently viewing a stripped down version of our content. View the full version with proper formatting.
Pages: 1 2 3 4 5 6 7
(2014-Nov-28, 00:30:33)Meng Hu Wrote: [ -> ]I would like to encourage some experts to review it, but given my extremely low response rate to my very modest, short questions (e.g., what's the consequence of ignoring slope-intercept covariance, can we use an effect size measure for random component in multilevel regression), I don't think they will spend more time in reviewing a 21-page article. If someone accepts reviewing it, I will keep you informed.

The analysis has grown beyond my technical competences, too. Perhaps you should consider submitting it to another journal, one with reviewers who are more familial with the specific methods. Regardless, I will be happy to concept check your method when you are done. Just make a list of your procedures and their justifications with citations added.
I'm very late. Two reasons. The big one : I had to read tons of books (4-5) and articles (20-30? don't remember) on multilevel regressions, and on the road, I may have lost my motivation, so I was spacing out all the time. The small one : I dislike R, especially when I'm forced to use it.

I have re-examined all numbers in my spreadsheet and my Stata syntax, and updated them, due to new analyses. But I will publish them when I'm certain that my submitted version is going to be the last one. Unless you really want to see them now.

Anyway, here's the 3rd version.
https://osf.io/9tgmi/
I don't know what happens with OSF; I don't see the overview of the first page, but it should be possible to download it.

In short, I added multilevel analyses, and extended my tobit regressions by using continuous variables of cohort/year instead of dummies, and modified the variables family16 and reg16 that Emil found bothersome due to their original coding. I have also added an appendix for R syntax, so that everyone can reproduce what I did with a free (but so damn ugly) software.

As always, to make things easier for you, in yellow are the changes I've made (from 2nd to 3rd version). Thus, it goes without saying that if you haven't looked at the changes made from version 1 to version 2, you won't see them in yellow in this 3rd version.

I'm really exhausted, so in this post I will not write all the important paragraphs from the books I have read. I have already noted all the pages that were important in my article. Even if you don't have these books materially, you already know where you can get them.
Meng Hu,

You did a lot of work on this, I can see.

Again, you should not put the R code in the PDF. Copying from the PDF is bad (screws up formatting) and it inflates the page count since almost all readers do not care about the syntax. Please put the source code in the OSF repository. The same applies to footnote 1. Put all the data, syntax, files etc. in the OSF repository, so that it is not necessarily to contact authors. You know how slow they are, if they even reply.

The figures are still not placed in the text. This means that the reader has to jump back and forth between the end of the paper and the text. You should move them up so they are near the text wherein they are discussed.

Quote:The variable "reg16" has been dichotomized as 0=0, 1/4=1, 5/7=0, 8/9=1, so that the values of the new variable is divided into South(=0)/North(=1) or, equivalently, into Wordsum score below 6 (value=0) or above 6 (value=1).

This doesn't sound right. Those living in the south all received scores below <6 and everybody in the north >6?

Bold text means there is something that needs attention. Either grammar or wrong word or something.

Quote:However, I will focuse on age and

Quote:I decide to model cohort as random effects

Quote:The respective numbers for Blacks were 58.9, 50.5, 41.5, 36.6, 31.3, 24.8, respectively.

Quote:These numbers are also suggestive that the wordsum narrowing has ceased somewhat.

Quote:Footnote 14: The Shapiro-Wilk W value should be close to 0.99 to indicate normality.

You mean, should be above .99 to indicate normality.

Is that following my test results? http://emilkirkegaard.dk/en/?p=4452 If so, cite it. If you got it from some other reference, cite that.

Quote:homoscedasticity in the underlying latent dependentvariable

Quote:The Cronbach's alpha analysis of the ten items of the Wordsum (worda to wordj) yielded an alpha of 0.6365 for Blacks (N=3,550) and 0.7015 for Whites (N=18,606). The alpha for the entire sample is 0.7137 (N=23,817). This is quite comparable to the early estimates.

Odd that merging two subgroups with α's of .70 and .64 increases the merged group α to .71? Increased variation increases α?

Quote:Hopefully, Yang (2006) addressed this question

In the source code

Quote:m<- vglm(wordsum ~ bw1 + agec + agec2 + agec3 + bw1agec + bw1agec2 + bw1agec3, tobit(Upper=10),
data=d, weights=d$weight)
summary(m)
d$wordsumpredictedage<-
5.210807+1.355356*d$bw1+-.0051079*d$agec+-.0016632*d$agec2+.000017*d$agec3+.0169752*d$bw1a
gec+.0002868*d$bw1agec2+.0000104*d$bw1agec3

You can extract the values from the model object so you don't need to type them manually. This is better as it reduces data errors and increases precision. You can find the attribute you need by using attributes(m).
I didn't know the syntax wouldn't work on pdf. I always copy pasted from my document file.

EMil Wrote:This doesn't sound right. Those living in the south all received scores below <6 and everybody in the north >6?

That's the truth. When I do a cross tabulation of wordsum per region, I get wordsum higher than 6 for regions in the North, and less than 6 for regions in the South.

The pictures are not embedded into the text this time. Because they are too many. Especially, the figure 7 is too big. If I reduce it, that will be worthless because the nine graphs of figure 7 are already small. In that case, I should delete it. If you want me to put that with the text, you get a big empty space in several pages, and i don't like this. Also, I have 4 graphs for tobit regressions, and the "result" section for tobit has only 2 pages. In other words, I have to insert the third graph in section 2 and the fourth graph in section 3.2, i.e., multilevel regression. That doesn't sound right to me.

I attached the file as document. If you want, you can try to rearrange the graphs as you like. But when i tried, I was not satisfied with this, especially with figure 7.

I see no problem in "However, I will focuse on age and". Concerning "I decide to model cohort as random effects" I can tell most people write random effects with an "s" and rarely without the "s" because random effects are the parameter values for each of the values of your level-2 variable. I have 21 categories in my cohort variables, so 21 random effects, i.e., variations around the average slope estimated in the fixed component of the multilevel regression.

I remove "respectively" and wrote "indicate" instead of "are also suggestive" and I will think about "hopefully" later, although I don't understand the problem. Concerning "homoscedasticity in the underlying latent dependent variable" I don't see the problem.

Quote:Odd that merging two subgroups with α's of .70 and .64 increases the merged group α to .71? Increased variation increases α?

I don't know why. Cronbach's is supposed to be affected by two factors : 1) the strength of the correlation and 2) the number of variables. But 2) does not move, so it must be 1).

Quote:You mean, should be above .99 to indicate normality.

Is that following my test results? http://emilkirkegaard.dk/en/?p=4452 If so, cite it. If you got it from some other reference, cite that.

I don't think it's above .99. In several instances, some authors even said that >.95 is sufficient for indicating normality. But I didn't cite these references (found in google books). I want a good reference, and I coudn't find one. Even the original paper of Shapiro and Wilk (1965) didn't not mention which value you should be seeking. If you find a good reference, i will add it of course.

By the way, regarding this comment

Quote:You can extract the values from the model object so you don't need to type them manually. This is better as it reduces data errors and increases precision. You can find the attribute you need by using attributes(m).

You have to tell me what is the syntax you have in mind. Is this the one you think ?

yhat<-fitted(m)

Because if this is the case, I think it's wrong. If you don't believe me, just try to make a scatterplot of the object yhat with, say, cohort, age, or year, or any other variable. And see the result. This object "yhat" is not an actual variable, so you have to create one by yourself. That's why R is pretty stupid. When you save your predicted Y (Y-hat) in SPSS or Stata, you get a new variable, and you can correlate it. But not in R. Or at least, not with fitted() function. Concerning possible mistakes in the numbers I have typed, I do not type them myself. I copy pasted them from Stata (or R). However, I read the lines carefully (twice) so as to make sure there is no mistake.

Also, because the object yhat is not an actual variable, I dont know how to remove the undesirable values from it. Because in all software I have tried (stata, spss, R), the predicted Y and/or residuals are calculated even for the observations that are missing in your regression model. If you correlate this yhat, that is a problem.
Quote: I didn't know the syntax wouldn't work on pdf. I always copy pasted from my document file.

Copying from PDF causes problems with copying whitespaces including linebreaks. There is no good reason to put code in PDF unless it is very short. Source code should be in supplementary material on OSF.

Quote: That's the truth. When I do a cross tabulation of wordsum per region, I get wordsum higher than 6 for regions in the North, and less than 6 for regions in the South.

EVERYBODY in the south received scores below 6, and EVERYBODY in the north above? That is clearly not true. You must mean something else.

Quote: The pictures are not embedded into the text this time. Because they are too many. Especially, the figure 7 is too big. If I reduce it, that will be worthless because the nine graphs of figure 7 are already small. In that case, I should delete it. If you want me to put that with the text, you get a big empty space in several pages, and i don't like this. Also, I have 4 graphs for tobit regressions, and the "result" section for tobit has only 2 pages. In other words, I have to insert the third graph in section 2 and the fourth graph in section 3.2, i.e., multilevel regression. That doesn't sound right to me.

I attached the file as document. If you want, you can try to rearrange the graphs as you like. But when i tried, I was not satisfied with this, especially with figure 7.

I don't understand what the problem is, aside from the hassle of writing in Word Processors like this. I moved the Figures to the nearest text location. Figure 4 wasn't mentioned in the text at all, so I left it in the end.

Why can't you do it something like what I did?

Quote:I see no problem in "However, I will focuse on age and".

"focuse" is a spelling error of "focus".

Quote:"I decide to model cohort as random effects"

Congruence error (singular and plural).

Quote: I remove "respectively" and wrote "indicate" instead of "are also suggestive" and I will think about "hopefully" later, although I don't understand the problem. Concerning "homoscedasticity in the underlying latent dependent variable" I don't see the problem.

In your (private) email to me just now, you used "hopefully" wrong again. You mean "luckily". Luckily = It was good luck that it happened. Hopefully = hope it will happen in the future.

Quote: I don't think it's above .99. In several instances, some authors even said that >.95 is sufficient for indicating normality. But I didn't cite these references (found in google books). I want a good reference, and I coudn't find one. Even the original paper of Shapiro and Wilk (1965) didn't not mention which value you should be seeking. If you find a good reference, i will add it of course.

They are clearly wrong. Look at my simulations. Anything below .99 is hopelessly unnormal.

Quote: You have to tell me what is the syntax you have in mind. Is this the one you think ?

yhat<-fitted(m)

No. I simply told you to extract the values from the object instead of typing them off in the code. You can use attributes(object) to see which attributes an object in R has.

For instance, if one uses factor analysis with the fa() function (from psych package), one can extract the scores and loadings from the factor analysis object. E.g.

Code:
object.fa = fa(object) #factor analysis on [i]object [/i]
object.fa$loadings #factor loadings vector
object.fa$scores #factor scores vector

Quote: Because if this is the case, I think it's wrong. If you don't believe me, just try to make a scatterplot of the object yhat with, say, cohort, age, or year, or any other variable. And see the result. This object "yhat" is not an actual variable, so you have to create one by yourself. That's why R is pretty stupid. When you save your predicted Y (Y-hat) in SPSS or Stata, you get a new variable, and you can correlate it. But not in R. Or at least, not with fitted() function. Concerning possible mistakes in the numbers I have typed, I do not type them myself. I copy pasted them from Stata (or R). However, I read the lines carefully (twice) so as to make sure there is no mistake.

Also, because the object yhat is not an actual variable, I dont know how to remove the undesirable values from it. Because in all software I have tried (stata, spss, R), the predicted Y and/or residuals are calculated even for the observations that are missing in your regression model. If you correlate this yhat, that is a problem.

This is not a flaw in R. It is because you are not using it correctly...

In the case of extracting model parameters:

Code:
>object.lm = lm(variable1 ~ variable2 , data.frame) #fit the model
>attributes(lm.euro.IQ.wt)
$names
[1] "coefficients"  "residuals"     "fitted.values" "effects"       "weights"       "rank"        
[7] "assign"        "qr"            "df.residual"   "na.action"     "xlevels"       "call"        
[13] "terms"         "model"        

$class
[1] "lm"

You can extract various values from it. I don't know which ones you want. From your syntax it looks like you want coefficients? Or fitted values? Coefficients are in

Code:
object.lm$coefficients #unstd. coefficients
predict(object.lm) #fitted values
fitted(object.lm) #same
object.lm$fitted.values #same

This also appears to be what your syntax does, but in a silly way.
(2014-Dec-13, 03:58:29)Emil Wrote: [ -> ]EVERYBODY in the south received scores below 6, and EVERYBODY in the north above? That is clearly not true. You must mean something else.

Where did I say that ? I said the mean of wordsum is <6 in the south and >6 in the north.

Quote:I don't understand what the problem is, aside from the hassle of writing in Word Processors like this. I moved the Figures to the nearest text location. Figure 4 wasn't mentioned in the text at all, so I left it in the end.

Figure 4 was mentioned : "Figures 3 and 4 display the White and Black trends in Wordsum score derived from model 2 (i.e., with age and sex partialled out) for birth cohort and survey year".

I sent you the file as document for a reason. If you can rearrange the pictures and avoid the big empty space caused by figure 7, i will take it. But you have to make sure that the figures 3-6 are inserted only in the section belonging to tobit analysis (i.e., section 3.1, which has only 2 pages). I repeat, I tried it before, but that big empty space right in the middle of the page is annoying. Not even mentioning the graphs 5-6. Where should I put them ? If you want me to put the graphs where you really want, I should remove figures 5, 6 and 7. Because they are the source of the trouble.

I will correct the errors in grammar later.

Quote:They are clearly wrong. Look at my simulations. Anything below .99 is hopelessly unnormal.

I disagree again. See here. Anyway I would be surprised if they just come up with this cutoff of >0.95 without knowing what they are talking about. I think they have some references, and that's why they decided on this cutoff. But where are these references mentioning we could use that cutoff value ? I can't find them.

Quote:
Code:
object.lm$coefficients #unstd. coefficients
predict(object.lm) #fitted values
fitted(object.lm) #same
object.lm$fitted.values #same

This also appears to be what your syntax does, but in a silly way.

About how to get fitted values, let me ask you a question. Did you try the syntax I've suggested ? The answer is probably no. Because if you do, you will see that it won't work. Like I said, SPSS and Stata also had the same flaws; residuals and/or predicted Y values are computed even for the observations not included in your model. But in these softwares you can remove the values from the observations that are absent in your regression model. In stata, you just do :

Code:
predict predicted_values if e(sample), xb

instead of just :

Code:
predict predicted_values, xb

But R is not Stata. I know of no other way to remove these undesirable values other than this :

Code:
d$wordsumpredictedage<- 5.210807+1.355356*d$bw1+-.0051079*d$agec+-.0016632*d$agec2+.000017*d$agec3+.0169752*d$bw1agec+.0002868*d$bw1agec2+.0000104*d$bw1agec3

d$wordsumpredictedage<-d$wordsumpredictedage+d$wordsum+d$bw1+d$age
d$wordsumpredictedage<-d$wordsumpredictedage-d$wordsum-d$bw1-d$age

Since age and bw1 have much larger N (i.e., values other than NAs) than wordsum and wordsumpredictedage, when you do these maths, the NAs in bw1 and age will become NAs in the variable wordsumpredictedage. It's a weird way of doing it, but I don't care. It works perfectly.
Mh,

Quote: Where did I say that ? I said the mean of wordsum is <6 in the south and >6 in the north.

In the paper:

Draft #3 Wrote:The variable "reg16" has been dichotomized as 0=0, 1/4=1, 5/7=0, 8/9=1, so that the values of the new variable is divided into South(=0)/North(=1) or, equivalently, into Wordsum score below 6 (value=0) or above 6 (value=1).

And just before:

Emil, before Wrote:This doesn't sound right. Those living in the south all received scores below <6 and everybody in the north >6?

MH, in reply Wrote:That's the truth. When I do a cross tabulation of wordsum per region, I get wordsum higher than 6 for regions in the North, and less than 6 for regions in the South.

Notice, the "all" in my quote above. You seems to have missed this. That's why I made it extra clear with "EVERYBODY".

What I think you meant:

The mean score of the southern regions is below 6. The mean score of the northern regions is above 6.

What you actually stated:

Everybody in the southern regions scored below 6. Everybody in the northern regions scored above 6.

Quote:Figure 4 was mentioned : "Figures 3 and 4 display the White and Black trends in Wordsum score derived from model 2 (i.e., with age and sex partialled out) for birth cohort and survey year".

My mistake. I searched for it using "Figure 4", but since you wrote "Figures 3 and 4", it did not turn up in my search.

Quote:I sent you the file as document for a reason. If you can rearrange the pictures and avoid the big empty space caused by figure 7, i will take it. But you have to make sure that the figures 3-6 are inserted only in the section belonging to tobit analysis (i.e., section 3.1, which has only 2 pages). I repeat, I tried it before, but that big empty space right in the middle of the page is annoying. Not even mentioning the graphs 5-6. Where should I put them ? If you want me to put the graphs where you really want, I should remove figures 5, 6 and 7. Because they are the source of the trouble.

I made a version of your paper where I moved them around. I forgot to attach it! It is attached now.

Quote:I disagree again. See here. Anyway I would be surprised if they just come up with this cutoff of >0.95 without knowing what they are talking about. I think they have some references, and that's why they decided on this cutoff. But where are these references mentioning we could use that cutoff value ? I can't find them.

Look at what they actually talk about. P values, not W value. Their W value for the non-normal distribution is .98. Above the .95 value you propose. For it to be normal, it should be above .99 as in my tests. Do some more tests in R if you want to see.

Quote:About how to get fitted values, let me ask you a question. Did you try the syntax I've suggested ? The answer is probably no. Because if you do, you will see that it won't work. Like I said, SPSS and Stata also had the same flaws; residuals and/or predicted Y values are computed even for the observations not included in your model. But in these softwares you can remove the values from the observations that are absent in your regression model. In stata, you just do :

So, you just need to remove them. Which observations are not included in your model?

Quote:Since age and bw1 have much larger N (i.e., values other than NAs) than wordsum and wordsumpredictedage, when you do these maths, the NAs in bw1 and age will become NAs in the variable wordsumpredictedage. It's a weird way of doing it, but I don't care. It works perfectly.

In R, whenever you add or multiple variables together and one of them is NA, you will get a NA result.

As far as I can understand without running the code myself, the trouble is that when you use fitted/predict etc., it will calculate predictions for the observations that have missing values using the rest of the weights. Since you don't want this, you have resorted to calculating them manually is a very bad way. If you wanted to calculate them manually, you should still extract the coefficients from the model object.

In any case, what you want to do is just remove the cases with missing values before fitting the model to them. Then you can use the predict() function. Right?

To remove cases with missing values, you can use the complete.cases() function. I don't know which variables you want to include exactly, so I cannot provide the code for you.
Emil Wrote:Look at what they actually talk about. P values, not W value. Their W value for the non-normal distribution is .98. Above the .95 value you propose. For it to be normal, it should be above .99 as in my tests. Do some more tests in R if you want to see.

You missed the QQ plot. Perfectly normal. But W is below 0.99. Also, you don't answer my objection. See here for still another example. It's not true that below 0.99, the distribution is necessarily not normal. There is no proof of what you say.

Emil Wrote:So, you just need to remove them. Which observations are not included in your model?

I said it before. But here's an illustration :

Code:
y     x1     x2     x3

1     .     .     .
2     .     11     88
3     7     28     77
4     .     22     .
5     4     12     93
6     2     17     90
7     3     .      81
8     5     19     74
9     5     19     93
10    5     15     76

A listwise analysis would involve here 6 people, out of 10. But when you compute Y_hat, probably all statistical software become silly and compute Y_hat for all the 10 observations (because one variable here has 10 observations). Anyway, I said already that I found the solution. It's just that it is not elegant because I couldn't find a better way.

Emil Wrote:In R, whenever you add or multiple variables together and one of them is NA, you will get a NA result.

I know. But it's precisely why I did it that way. Concerning complete.cases() function. I'm aware of it. But how can you use it when you do this :

yhat<- 5.21 + 1.35*x1 + -.0051*x2 + -.00166*x3 + .000017*x4 + .01697*x5 + .0002868*x6 + .0000104​*x7

Again, I don't know.

Emil Wrote:Since you don't want this, you have resorted to calculating them manually is a very bad way. If you wanted to calculate them manually, you should still extract the coefficients from the model object.

In any case, what you want to do is just remove the cases with missing values before fitting the model to them. Then you can use the predict() function. Right?

No. I repeat again you don't understand. And you probably never tried it before. I said that predict() or fitted() will never give you an actual variable. Just do this.

Use this :

http://openpsych.net/datasets/GSSsubset.7z

And type this :

Code:
setwd("c:\\Users\\ ............. \\)
entiredata<-read.csv("GSSsubset.csv")
d<- subset(entiredata, age<70)

d$weight<- d$wtssall*d$oversamp # interaction variable

d$bw<- rep(NA) # generate variable with only missing values
d$bw[d$race==1] <- 1 # assign a value of "1" under the condition specified in bracket
d$bw[d$race==2] <- 0 # assign a value of "0" under the condition specified in bracket
d$bw0<- rep(NA)
d$bw0[d$year>=2000 & d$race==1 & d$hispanic==1] <- 1
d$bw0[d$year>=2000 & d$race==2 & d$hispanic==1] <- 0
d$bw00<- rep(NA)
d$bw00[d$year<2000 & d$race==1] <- 1
d$bw00[d$year<2000 & d$race==2] <- 0
d$bw1<-d$bw0 # combine the two vars, by incorporating the first var
d$bw1[is.na(d$bw0)]<-d$bw00[is.na(d$bw0)] # then the second, without Nas

d$agec<- d$age-40.62 # mean-centering of age
d$agec2<- d$agec^2 # squared term of age
d$agec3<- d$agec^3 # cubic term of age
d$bw1agec = d$bw1*d$agec
d$bw1agec2 = d$bw1*d$agec2
d$bw1agec3 = d$bw1*d$agec3

d$wordsum[d$wordsum==-1] <- NA
d$wordsum[d$wordsum==99] <- NA

require(VGAM)

I_hate_you_R<- vglm(wordsum ~ bw1 + agec + agec2 + agec3 + bw1agec + bw1agec2 + bw1agec3, tobit(Upper=10), data=d, weights=d$weight)
summary(I_hate_you_R)
yhat<-fitted(I_hate_you_R)

Now, you have your yhat. If you still don't believe me, then use this :

Code:
View(d)

And see... where is your yhat object ? There is no column with yhat object. As I said, it's not an actual variable, and correlating it with an actual variable listed in your window data causes either error message or produces weird graphs.

Just compare :

Code:
d$wordsumpredictedage<- 5.210807+1.355356*d$bw1+-.0051079*d$agec+-.0016632*d$agec2+.000017*d$agec3+.0169752*d$bw1agec+.0002868*d$bw1agec2+.0000104*d$bw1agec3

require(lattice)

xyplot(yhat~ d$age, data=d, groups=bw1, pch=19, type=c("p"), col = c('red', 'blue'), grid=TRUE, ylab="Wordsum predicted by tobit", xlab="age", key=list(text=list(c("Black", "White")), points=list(pch=c(19,19), col=c("red", "blue")), columns=2))

xyplot(d$wordsumpredictedage ~ d$age, data=d, groups=bw1, pch=19, type=c("p"), col = c('red', 'blue'), grid=TRUE, ylab="Wordsum predicted by tobit", xlab="age", key=list(text=list(c("Black", "White")), points=list(pch=c(19,19), col=c("red", "blue")), columns=2))

Nonetheless, the yhat object has not been corrected by removing the missing data. But my variable d$wordsumpredictedage also has this problem. I didn't remove the NAs. Now let's do this :

Code:
d$wordsumpredictedage1<-d$wordsumpredictedage+d$wordsum+d$bw1+d$age
d$wordsumpredictedage1<-d$wordsumpredictedage1-d$wordsum-d$bw1-d$age

Let's redraw the graph :

Code:
xyplot(d$wordsumpredictedage1 ~ d$age, data=d, groups=bw1, pch=19, type=c("p"), col = c('red', 'blue'), grid=TRUE, ylab="Wordsum predicted by tobit", xlab="age", key=list(text=list(c("Black", "White")), points=list(pch=c(19,19), col=c("red", "blue")), columns=2))

You see that the graph is pretty much the same. The only difference is that it must have less data points (because lot of observations are dropped). I'm not sure about its effect on the graph. I had the impression on the other day that the Y values were smaller (or bigger ?) but that the pattern of the curves are the same. But still, you get a graph that is readable, with a nice regression line. You can't do it with the yhat object. The scatter you obtain from the yhat object is similar to what you get if you plot wordsum against age : a cloud of points.

Imagine you do :

wordsum=age+race

You plot Y_hat by computing it the usual way :

new_variable = intercept + beta*age + beta*race

What you will have is a scatter of wordsum versus age that is identical for blacks and whites. But why so ? That is because you did not include age*race interaction. That is, age effect on wordsum was not allowed to vary across races.

What I mean is that you must have a neat graph which has a "regression line" rather than a "cloud" of points scattered everywhere. This is a proof that fitted() or predict() can't be used to obtain a true Y_hat variable. At least, not in the usual way. In most software, you surely can save the predicted Y variable, and correlate it (but after removing the undesirable missing data), but not with R.

But even with Stata, SPSS, or even SAS (which I haven't tried, but maybe I will, because it's free now), most of the time, you will have to compute Y_hat by hand. The reason is that the Y_hat you request from your software will compute the predicted Y for all variables in your model. Imagine I have added income, educ, degree, sibs, region, family, etc. The requested Y_hat will have data points locating each person for each value of sibs, each value of income, each value of educ, etc., around each values of your Y variable. But if I want only to plot the predicted Y based on age and race while controlling for SES/background variables, I don't need to add the coefficients of these variables in my Y_hat variable because they will add some "variation" around the plotted line. This graph will become impossible to read because the spread of the data points do not have the form of a single regression line. If my explanation is too abstract, I have planned (one month ago) that when my article is published, I will publish 3 articles in a row : 1) how to compute Y_hat and make "predicted plots" and how to avoid the pitfalls in doing it, 2) tobit and truncated regression, 3) multilevel regression. But if you want now, I have attached a graph, based on the same regression above, but added all SES/background variables. I have requested the Y_hat directly from Stata and plotted it against age. The syntax was :

Code:
tobit wordsum bw1 agec agec2 agec3 bw1agec bw1agec2 bw1agec3 logincomec educc degreec gender sibs residence reg family [pweight = weight], ll ul
predict wordsumpredictedage if e(sample), xb
label variable wordsumpredictedage "Predicted Wordsum scores"
graph twoway (scatter wordsumpredictedage age if bw1==0, msymbol(Oh)) (scatter wordsumpredictedage age if bw1==1, msymbol(O)), legend(label(1 black) label(2 white))

You see the difficulty with that graph. Anyway, this discussion is pointless. If anyone here has suspicions and believes I mistyped some of my numbers, then you just run the analysis in R. I checked the numbers at least 3 times, so that the probability of misreporting (even a little one) is close to zero. Furthermore, saying that I calculate it in a bad way, what does that mean ? That it is not how Y_hat should be computed ? In that case, I disagree, obviously. Because it's exactly what all textbooks say to do. The formula is Y_hat=intercept+beta*(var1)+beta*(var2) ... etc.

Finally, the document you sent has a problem. You have reduced Figure 7. I don't want that. Otherwise I prefer to delete it, because there are 9 graphs, all were already small. By reducing figure 7, all the 9 graphs are now tiny. I will try to rearrange them as you want even if the inclusion of Figure 4 will be tedious, but I don't think it's reasonable to reduce the size of Figure 7. Because a graph is useless if you can't see with much precision what's happening here. So they must be reasonably big.
MH,

Quote:You missed the QQ plot. Perfectly normal. But W is below 0.99. Also, you don't answer my objection. See here for still another example. It's not true that below 0.99, the distribution is necessarily not normal. There is no proof of what you say.

The reason W is below .99 is this case is a small sample size. In my tests N=5k in all cases. In the above n=20. If you use rnorm with a larger sample, the W metric functions as in my simulations. With small N's, one can of course get strange results.

In your case, N is very large, so these small-scale simulations are not relevant.

Quote:A listwise analysis would involve here 6 people, out of 10. But when you compute Y_hat, probably all statistical software become silly and compute Y_hat for all the 10 observations (because one variable here has 10 observations). Anyway, I said already that I found the solution. It's just that it is not elegant because I couldn't find a better way.

Just remove the observations with missing values before fitting, then you won't have this problem.

Quote:I know. But it's precisely why I did it that way. Concerning complete.cases() function. I'm aware of it. But how can you use it when you do this :
yhat<- 5.21 + 1.35*x1 + -.0051*x2 + -.00166*x3 + .000017*x4 + .01697*x5 + .0002868*x6 + .0000104​*x7
Again, I don't know.

You don't use it when one does stuff like that, because the above is a very poor way of doing things. As I said, if you wanted to calculate the predicted values manually per above, then you should use the $coefficients attribute.

In any case,

There are many of ways to handle missing values. One simple way is just to make a new data.frame (df) and only add the variables you want.

E.g.
Code:
#this makes a new data.frame with the 5 variables of interest
df.complete.data = cbind(myVariable1,myVariable2,myVariable3,myVariable4,myVariable5)
#retain cases with full data only (should use imputation)
df.complete.data = df.complete.data[complete.cases(df.complete.data),]
#now runs stuff on it

Perhaps the easiest way is to use the subset function (see below).

Quote:No. I repeat again you don't understand. And you probably never tried it before. I said that predict() or fitted() will never give you an actual variable. Just do this.

Of course I tried it. I tested it prior to telling you. :)

Here's proof that it does return values as I say:

[Image: fitted.png]

One can plot them vs. the predicted data too if one makes sure there are no missing values:

Code:
#predict IQs using all ancestry
#make subset of relevant vars
test.df.for.mh = subset(DF.supermega, select=c(LV2012estimatedIQ,AmergenomeFuerst2014,AfrgenomeFuerst2014))
test.df.for.mh = test.df.for.mh[complete.cases(test.df.for.mh),] #keep only complete cases
#fit model
lm.IQ12.ancestry.wt = lm(LV2012estimatedIQ ~ AmergenomeFuerst2014+AfrgenomeFuerst2014, DF.supermega, weights=sqrt(Population))
lm.IQ12.ancestry.wt.summary = summary(lm.euro.IQ12.wt) #summary
lm.beta(lm.euro.IQ12.wt) #std betas, = -.54 Amer and -1.23 Afro
test.df.for.mh["fitted"] = predict(lm.euro.IQ12.wt) #fitted values
main.title = paste0("Predicted national IQ based on genomic ancestry for the Americas\nAdj.r is ",round(sqrt(lm.IQ12.ancestry.wt.summary$adj.r.squared),2),", n=",nrow(test.df.for.mh),", weighted by log(pop)")
scatterplot(LV2012estimatedIQ ~ fitted, test.df.for.mh,
            smoother=FALSE,
            labels=rownames(test.df.for.mh),id.n=nrow(test.df.for.mh),
            main = main.title, xlab ="Predicted IQ",ylab="National IQ")

[Image: plot_for_mh.png]
(correct, it is actually sqrt(pop))
(2014-Dec-15, 05:17:03)Emil Wrote: [ -> ]The reason W is below .99 is this case is a small sample size. In my tests N=5k in all cases. In the above n=20. If you use rnorm with a larger sample, the W metric functions as in my simulations. With small N's, one can of course get strange results.

In your case, N is very large, so these small-scale simulations are not relevant.

Your main claim was that anything below .99 is not normal. You did not specify any condition. Now you specify a condition : that N should not be small. Either way, this means your first statement was plain wrong.

Anyway, I tried this in the GSS data :

Code:
keep if born==2
keep if age<70
histogram age
swilk age

histogram cohort
swilk cohort

The sample sizes are large. Histogram shows normal distribution, but W values are all below 0.99; age has 0.970 and cohort has 0.984.

I could have also said there is another thing which is not right with your statement of W>0.99. The reason is that in my multilevel analysis, the W values for level-2 residuals were around 0.94-0.95 but histogram shows nearly normal distribution. Same thing with Q-Q plots. There was just a small-to-modest deviation from normality.

Quote:Just remove the observations with missing values before fitting, then you won't have this problem.

I will try that.

But first... Why are you so nitpicking... ? I said already my own method actually works. If you don't like it, then it's just like saying "Meng, you use Times New Roman for your style but I preferred Arial, so unless you switch back to Arial style, I must disapprove your paper". If that was merely an advice, I will thank you for this, because you are a fast learner, unlike me, and I may need help with R (sometimes). However, I hope it's not a requirement for publication. Because all of your objections (instead the "swilk" thing) is about "change the placement of the graphs" "make the syntax looking more stylish" "don't type the numbers of your parameters" etc. All of them are unreasonable requirements, in my opinion. It's, after all, just a question of taste. Nothing to do with the content of the article (method, theory, etc.).

Quote:There are many of ways to handle missing values. One simple way is just to make a new data.frame (df) and only add the variables you want.

I know that. But i didn't want to use it first. I wanted to "keep" only cases with age<70. But not to remove cases with missing values on wordsum. It may contain some useful information (e.g., people having wordsum may have different values on a certain variables than people who don't have wordsum). I prefer to use a function like "do if". In SAS, Stata and SPSS you can easily do it.

Quote:Here's proof that it does return values as I say:

It works for you because you remove all the missing data before. I prefer to run the regression using a statement like "do if not missing x1 x2 x3 etc".

However, you use your own data. Not mine.

Let's do it.

Code:
d1= subset(d, select=c(wordsum, bw1, age, agec, agec2, agec3, bw1agec, bw1agec2, bw1agec3, weight))
d1 = d1[complete.cases(d1),] #keep only complete cases

Now, look at your data window. How many rows ?

Code:
View(d1)

Now, we do this :

Code:
library(VGAM) # vglm() function
library(lattice) # xyplot() function

still_hate_R<- vglm(wordsum ~ bw1 + agec + agec2 + agec3 + bw1agec + bw1agec2 + bw1agec3, tobit(Upper=10), data=d1, weights=d1$weight)
summary(still_hate_R)

d1$wordsumpredictedage<- 5.210807+1.355356*d1$bw1+-.0051079*d1$agec+-.0016632*d1$agec2+.000017*d1$agec3+.0169752*d1$bw1agec+.0002868*d1$bw1agec2+.0000104*d1$bw1agec3

d1$fitted = predict(still_hate_R) #fitted values

xyplot(d1$fitted ~ d1$age, data=d1, groups=bw1, pch=19, type=c("p"), col = c('red', 'blue'), grid=TRUE, ylab="Wordsum predicted by tobit", xlab="age", key=list(text=list(c("Black", "White")), points=list(pch=c(19,19), col=c("red", "blue")), columns=2))

The plot is nicer than before. But why all these values at zero in Y axis along all the values of X ? If you want the answer, you do this :

Code:
View(d1)

And compare the columns fitted and wordsumpredictedage. Then, scroll down to the end. How many rows do you have ? Anyway, even if you can fix that, it's not the problem here. Because you have ignored my most important comment. I will not repeat it again. I don't have time for this. I will just ask you to do this :

Code:
d$educ[d$educ>20] <- NA

d11 = subset(d, select=c(wordsum, bw1, age, agec, agec2, agec3, bw1agec, bw1agec2, bw1agec3, educ, weight))
d11 = d11[complete.cases(d11),] #keep only complete cases

R_go_to_hell<- vglm(wordsum ~ bw1 + agec + agec2 + agec3 + bw1agec + bw1agec2 + bw1agec3 + educ, tobit(Upper=10), data=d11, weights=d11$weight)
summary(R_go_to_hell)

d11$fitted = predict(R_go_to_hell) #fitted values

xyplot(d11$fitted ~ d11$age, data=d11, groups=bw1, pch=19, type=c("p"), col = c('red', 'blue'), grid=TRUE, ylab="Wordsum predicted by tobit", xlab="age", key=list(text=list(c("Black", "White")), points=list(pch=c(19,19), col=c("red", "blue")), columns=2))

Like I said. A cloud of points. And again, with data points at Y=0 along the X axis ? So, you use again View(d11) and look at the values of the predicted Y. But whatever. The scatter plot illustrates what I'm saying. My method is not wrong, you just don't understand how Y_hat is computed when requested directly from the software. It takes all variables, but I have already said they should not be taken into account in computing Y_hat.
Pages: 1 2 3 4 5 6 7