2016-Aug-06, 10:26:14

Approved

2016-Aug-06, 10:26:14

Approved

Perhaps I missed something but this paper seems to me to be a continuity of your actual research, i.e., correlations between g and social variables. Thus I'm somewhat disconcerted about the title of the paper and the section 1. Introduction, which is only about data sharing. Why nothing about your previous research on g and social variables and your S-factor ? And even in the discussion section, still no word about your previous research. So, what is the purpose of the paper ? To illustrate the importance of data sharing ? Or that this data has some use for psychological research (which is debatable considering its limitations) ?

I'm skeptical. I wouldn't qualify as a test, one test composed of only 2-4 items. I think you should write about it in the limitation section.

Stability means that the correlation between a variable measured at time1 and the same variable measured at time2 is high.

That would be much clearer if you write "between the most and least religious groups".

Even if the graph (and some of the following ones) suggests this conclusion, I won't use such wording "linear relationship" when the variable is not nominal. If you have a 3-category variable, 1 "no", 2 "neither", 3 "yes", a line that looks linear shouldn't be qualified as a linear relationship in my opinion.

What do you mean ?

"time of birth in the year" ?

I will leave another comment later, I think, because I don't understand something about section 5.3. Which is, the use of p-values...

By the way, can you explain what the null hypothesis is about ? NH of what ? Specify it in your paper, also. It helps to clarify things.

Quote:To examine the effect of using only a smaller number of items to increase the sample with complete data, we also created tests with 2-13 items.

I'm skeptical. I wouldn't qualify as a test, one test composed of only 2-4 items. I think you should write about it in the limitation section.

Quote:It can be seen that there is strong stability of estimates across different test compositions

Stability means that the correlation between a variable measured at time1 and the same variable measured at time2 is high.

Quote:The difference between the most and least religious is -.67 d

That would be much clearer if you write "between the most and least religious groups".

Quote:We see a linear negative relationship between the rated importance of religion/God in life and cognitive ability.

Even if the graph (and some of the following ones) suggests this conclusion, I won't use such wording "linear relationship" when the variable is not nominal. If you have a 3-category variable, 1 "no", 2 "neither", 3 "yes", a line that looks linear shouldn't be qualified as a linear relationship in my opinion.

Quote:As expected, we see that people willing to help out more had higher cognitive ability. It's not possible to calculate the latent correlation because the order of the options is not clear: is time or money the greater sacrifice?

What do you mean ?

Quote:it is possible that there are effects of time of birth in the year

"time of birth in the year" ?

I will leave another comment later, I think, because I don't understand something about section 5.3. Which is, the use of p-values...

By the way, can you explain what the null hypothesis is about ? NH of what ? Specify it in your paper, also. It helps to clarify things.

2016-Sep-04, 04:21:40

Hi Meng,

Thanks for reviewing. The quotes below are from you unless otherwise specified.

The paper is about presenting a new dataset. This is why the introduction mentions this topic, the title is about this and we don't cite any of the research related to S factor, no S factor analysis was carried out, nor were any of the typical socioeconomic data analyzed (such as education or income or criminality).

The analyses presented in the paper are only presented to showcase what kind of analyses one can do with the dataset and show that one finds known results when doing so (successful calibration).

I don't understand how this was not clear to you. Let me know if you have any suggestions for how to make this more clear if you think it should be.

There are 14 useable questions that can be used as items in a test. The matrix shows the intercorrelations between using tests with different numbers of these items. The trade-off is that using more of the questions results in more missing data but also more precise measurement. IRT is able to estimate scores for persons with missing data, so the trade-off is less grave than it would have been if one had used a method that required full data (such as ordinary factor analysis). If you are interested in the items, you can find them in the supplementary materials (data/test_items.csv).

I added a paragraph in the limitations section about the items:

The cognitive ability data is limited to about 14 items with sufficient amount of data. This necessarily limits the reliability of the measurement. Furthermore, as far as we know, these items have not been validated against known test batteries or used in any other studies.

Let me know if this is satisfactory to you.

I have added groups.

I think you meant to say continuous. I think it's alright to say it's linear if the scale is a Likert or similar which is plausibly interpreted as being close to interval. I think this is the case for the analyses we present. For instance, I think the 4 point scale in Figure 6 is pretty plausibly interpreted as being interval scale or close to:

Note that a violation of interval scale would be unlikely to result in a linear relationship as seen. It's easier to make a relationship non-linear than linear.

Furthermore, note that the analysis in Figure 8 does not display a linear relationship despite using the same answer options. Thus, it's possible to get both linear and non-linear looking results with these answer options.

To calculate a correlation, one must be able to rank the possible values. However, how should one rank the answers "I would donate time" and "I would donate money"? It's not clear which one is the greatest sacrifice.

I note that one should probably reorder the groups on the plot so that the None-answer is on the left. This was already done for the plots found in the supplementary materials, but the figure in the paper was not updated. This has been done now.

Effects of when time of birth falls within a year, e.g. January vs. February. The last clause is necessary because otherwise one might think it includes the difference between being born in 1962 vs. 1970, a cohort or age effect.

What do you not understand about it?

The null hypothesis for a chi square test is always that the samples come from populations with the same mean, so it seems redundant to specify it explicitly. However, because you requested it, I have done it. The text new reads:

It is possible to do a large-scale test of astrology using the OKCupid dataset by examining whether Zodiac sign is related to every question in the dataset. Zodiac sign is arguably a nominal variable and the questions are either ordinal (possibly interval-like) or nominal. Thus, to use all the questions, a test that can handle nominal x nominal variables was needed. We settled on using the standard chi square test because the goal was to look for any signal at all, not estimate effect sizes. This is a strong test because it is possible that there are effects of time of birth within a given year which are unrelated to Zodiac sign. For instance, being born in summer may be related to which kind of activities one takes part in at age 3 due to limitations of the weather, and the experiences from these activities may have a causal impact on one’s later personality.

To clarify, the null hypothesis tested by the chi square test here is that the answers have the same frequency for all the 12 Zodiac populations. Figure 11 shows a density-histogram of the p-values.

Let me know if this is satisfactory.

---

I noted that there was some odd whitespace on page 9. I have fixed this.

I have added page numbers.

--

A new version will be uploaded shortly.

Thanks for reviewing. The quotes below are from you unless otherwise specified.

Quote:Perhaps I missed something but this paper seems to me to be a continuity of your actual research, i.e., correlations between g and social variables. Thus I'm somewhat disconcerted about the title of the paper and the section 1. Introduction, which is only about data sharing. Why nothing about your previous research on g and social variables and your S-factor ? And even in the discussion section, still no word about your previous research. So, what is the purpose of the paper ? To illustrate the importance of data sharing ? Or that this data has some use for psychological research (which is debatable considering its limitations) ?

The paper is about presenting a new dataset. This is why the introduction mentions this topic, the title is about this and we don't cite any of the research related to S factor, no S factor analysis was carried out, nor were any of the typical socioeconomic data analyzed (such as education or income or criminality).

The analyses presented in the paper are only presented to showcase what kind of analyses one can do with the dataset and show that one finds known results when doing so (successful calibration).

I don't understand how this was not clear to you. Let me know if you have any suggestions for how to make this more clear if you think it should be.

Quote:I'm skeptical. I wouldn't qualify as a test, one test composed of only 2-4 items. I think you should write about it in the limitation section.

There are 14 useable questions that can be used as items in a test. The matrix shows the intercorrelations between using tests with different numbers of these items. The trade-off is that using more of the questions results in more missing data but also more precise measurement. IRT is able to estimate scores for persons with missing data, so the trade-off is less grave than it would have been if one had used a method that required full data (such as ordinary factor analysis). If you are interested in the items, you can find them in the supplementary materials (data/test_items.csv).

I added a paragraph in the limitations section about the items:

The cognitive ability data is limited to about 14 items with sufficient amount of data. This necessarily limits the reliability of the measurement. Furthermore, as far as we know, these items have not been validated against known test batteries or used in any other studies.

Let me know if this is satisfactory to you.

Quote:That would be much clearer if you write "between the most and least religious groups".

I have added groups.

Quote:Even if the graph (and some of the following ones) suggests this conclusion, I won't use such wording "linear relationship" when the variable is not nominal. If you have a 3-category variable, 1 "no", 2 "neither", 3 "yes", a line that looks linear shouldn't be qualified as a linear relationship in my opinion.

I think you meant to say continuous. I think it's alright to say it's linear if the scale is a Likert or similar which is plausibly interpreted as being close to interval. I think this is the case for the analyses we present. For instance, I think the 4 point scale in Figure 6 is pretty plausibly interpreted as being interval scale or close to:

- Extremely important

- Somewhat important

- Not very important

- Not at all important

Note that a violation of interval scale would be unlikely to result in a linear relationship as seen. It's easier to make a relationship non-linear than linear.

Furthermore, note that the analysis in Figure 8 does not display a linear relationship despite using the same answer options. Thus, it's possible to get both linear and non-linear looking results with these answer options.

Quote:What do you mean ?

To calculate a correlation, one must be able to rank the possible values. However, how should one rank the answers "I would donate time" and "I would donate money"? It's not clear which one is the greatest sacrifice.

I note that one should probably reorder the groups on the plot so that the None-answer is on the left. This was already done for the plots found in the supplementary materials, but the figure in the paper was not updated. This has been done now.

Quote:"time of birth in the year" ?

Effects of when time of birth falls within a year, e.g. January vs. February. The last clause is necessary because otherwise one might think it includes the difference between being born in 1962 vs. 1970, a cohort or age effect.

Quote:I will leave another comment later, I think, because I don't understand something about section 5.3. Which is, the use of p-values...

By the way, can you explain what the null hypothesis is about ? NH of what ? Specify it in your paper, also. It helps to clarify things.

What do you not understand about it?

The null hypothesis for a chi square test is always that the samples come from populations with the same mean, so it seems redundant to specify it explicitly. However, because you requested it, I have done it. The text new reads:

It is possible to do a large-scale test of astrology using the OKCupid dataset by examining whether Zodiac sign is related to every question in the dataset. Zodiac sign is arguably a nominal variable and the questions are either ordinal (possibly interval-like) or nominal. Thus, to use all the questions, a test that can handle nominal x nominal variables was needed. We settled on using the standard chi square test because the goal was to look for any signal at all, not estimate effect sizes. This is a strong test because it is possible that there are effects of time of birth within a given year which are unrelated to Zodiac sign. For instance, being born in summer may be related to which kind of activities one takes part in at age 3 due to limitations of the weather, and the experiences from these activities may have a causal impact on one’s later personality.

To clarify, the null hypothesis tested by the chi square test here is that the answers have the same frequency for all the 12 Zodiac populations. Figure 11 shows a density-histogram of the p-values.

Let me know if this is satisfactory.

---

I noted that there was some odd whitespace on page 9. I have fixed this.

I have added page numbers.

--

A new version will be uploaded shortly.

2016-Sep-06, 23:36:07

I also added the names of the reviewers (Piffer, Hu).

The files have been updated.

The files have been updated.

2016-Sep-09, 04:40:25

(2016-Sep-04, 04:21:40)Emil Wrote: To calculate a correlation, one must be able to rank the possible values. However, how should one rank the answers "I would donate time" and "I would donate money"? It's not clear which one is the greatest sacrifice.

Ok, I understand a little bit more now. Maybe try to describe the problem another way : "order of the options" wasn't clear enough. But something like, say, "rank ordering of the answers" is better. I think the problem with this variable is that it is just not a linear/continuous one.

Quote:Effects of when time of birth falls within a year, e.g. January vs. February. The last clause is necessary because otherwise one might think it includes the difference between being born in 1962 vs. 1970, a cohort or age effect.

Understood. What about "time of birth within a year" ?

Quote:What do you not understand about it?

I thought you would know pretty well my opinion on this, given how many times I said it in the past. P-value is a mixture of sample size and effect size, thus it adds nothing at all above what information is provided by both sample size and effect size. If your research is about "examining whether Zodiac sign is related to every question in the dataset", i.e., "yes" or "no" there is a relationship, then p-value is no more informative than an effect size. And the effect size doesn't have the problem of the p-value, which depends on the sample size. Effect size and p-values can provide different answers sometimes. But I think you should already know that.

I don't see why people continue to rely on the p-values (whatever the research and studied questions are). That's totally useless. And of course, I strongly disagree with your following statement : "This (i.e., the significance test) is a stronger test because it is possible that there are effects of time of birth in the year which would be unrelated to Zodiac sign".

There are some comments I didn't answer, but that's because I don't have much to say (e.g., no objection or equivocal).

Quote:The correlation matrix can found in the supplementary materials

can be found

2016-Sep-13, 19:44:13

Hi MH,

I have changed it to:

It's not possible to calculate the latent correlation because the rank ordering of the answers is not clear: is time or money the greater sacrifice?

I have changed it to:

This is a strong test because it is possible that there are effects of time of birth within a given year (e.g. spring vs. summer) which are unrelated to Zodiac sign. For instance, being born in summer may be related to which kind of activities one takes part in at age 3 due to limitations of the weather, and the experiences from these activities may have a causal impact on one’s later personality (for a possible example of something of this sort, see Gobet and Chassy (2008)).

While in general I dislike the use of NHST, I think this is a case where they are used well. We are not trying to measure the effect size of Zodiac sign, we are trying to test the null hypothesis that there are no effects at all. Such a test when carried out on many variables leads to a very clear prediction about what the distribution of p values should look like, i.e. uniform. This is also the observed distribution to a close approximation.

As mentioned in the text, since we are dealing with nominal x nominal variables, it is not easy to calculate an effect size. Most effect sizes require that one can rank order the options, which one by definition cannot do with nominal data.

How would you test the null hypothesis here? One complicated idea is to find some kind of effect size that works, calculate it for all the questions and note some summary statistics about the distribution of effects. Then simulate null hypothesis data many times with the same sample sizes and calculate summary statistics of these distributions. Then finally, compare the summary statistics of the real data with those from simulated null data. This would provide just about the same evidence as the current method used I think.

Fixed.

---

The files were updated.

(2016-Sep-09, 04:40:25)Meng Hu Wrote: Ok, I understand a little bit more now. Maybe try to describe the problem another way : "order of the options" wasn't clear enough. But something like, say, "rank ordering of the answers" is better. I think the problem with this variable is that it is just not a linear/continuous one.

I have changed it to:

It's not possible to calculate the latent correlation because the rank ordering of the answers is not clear: is time or money the greater sacrifice?

(2016-Sep-09, 04:40:25)Meng Hu Wrote: Understood. What about "time of birth within a year" ?

I have changed it to:

This is a strong test because it is possible that there are effects of time of birth within a given year (e.g. spring vs. summer) which are unrelated to Zodiac sign. For instance, being born in summer may be related to which kind of activities one takes part in at age 3 due to limitations of the weather, and the experiences from these activities may have a causal impact on one’s later personality (for a possible example of something of this sort, see Gobet and Chassy (2008)).

(2016-Sep-09, 04:40:25)Meng Hu Wrote: I thought you would know pretty well my opinion on this, given how many times I said it in the past. P-value is a mixture of sample size and effect size, thus it adds nothing at all above what information is provided by both sample size and effect size. If your research is about "examining whether Zodiac sign is related to every question in the dataset", i.e., "yes" or "no" there is a relationship, then p-value is no more informative than an effect size. And the effect size doesn't have the problem of the p-value, which depends on the sample size. Effect size and p-values can provide different answers sometimes. But I think you should already know that.

I don't see why people continue to rely on the p-values (whatever the research and studied questions are). That's totally useless. And of course, I strongly disagree with your following statement : "This (i.e., the significance test) is a stronger test because it is possible that there are effects of time of birth in the year which would be unrelated to Zodiac sign".

There are some comments I didn't answer, but that's because I don't have much to say (e.g., no objection or equivocal).

While in general I dislike the use of NHST, I think this is a case where they are used well. We are not trying to measure the effect size of Zodiac sign, we are trying to test the null hypothesis that there are no effects at all. Such a test when carried out on many variables leads to a very clear prediction about what the distribution of p values should look like, i.e. uniform. This is also the observed distribution to a close approximation.

As mentioned in the text, since we are dealing with nominal x nominal variables, it is not easy to calculate an effect size. Most effect sizes require that one can rank order the options, which one by definition cannot do with nominal data.

How would you test the null hypothesis here? One complicated idea is to find some kind of effect size that works, calculate it for all the questions and note some summary statistics about the distribution of effects. Then simulate null hypothesis data many times with the same sample sizes and calculate summary statistics of these distributions. Then finally, compare the summary statistics of the real data with those from simulated null data. This would provide just about the same evidence as the current method used I think.

(2016-Sep-09, 04:40:25)Meng Hu Wrote: can be found

Fixed.

---

The files were updated.

2016-Sep-14, 20:20:49

Ok with your changes.

Concerning the above, I said that I know you don't care about effect size, but remember : p-value is a mixture of sample size and effect size. Also, like I've said, p-values can lead to different conclusions than those produced by effect size. For instance, for my most recent research, on MGCFA testing of Spearman's Hypothesis and internal bias, p-values show always significant changes, while indices such as RMSEA, Mc, CFI, don't. It's not possible that p-values can be reliable.

How would I test the null hypothesis ? It depends on which principle it relies upon. If it requires the use of p-value, and its corresponding "higher than 0.05 being not significant", NH is not even worth testing. p-value has never been reliable. And if you're interested in the distribution of p-values, why again it would be more useful than calculating the distribution of effect size ? After all, effect size is a component of p-value.

Also, if you think getting effect sizes such as correlation or d for categorical data is a little bit problematic, you should know there are other types of effect sizes. Such as odd ratio which is appropriate for categorical data. Instead of measuring the strength of relationship with correlation ®, you measure the probability of answering #2 as opposed to #1. In any case, p-value is not a better alternative to effect size such as r or d. If r and d are both inappropriate for categorical data, the resulting p-value from r and d estimates should be always wrong as well.

(2016-Sep-13, 19:44:13)Emil Wrote: While in general I dislike the use of NHST, I think this is a case where they are used well. We are not trying to measure the effect size of Zodiac sign, we are trying to test the null hypothesis that there are no effects at all.

As mentioned in the text, since we are dealing with nominal x nominal variables, it is not easy to calculate an effect size. Most effect sizes require that one can rank order the options, which one by definition cannot do with nominal data.

How would you test the null hypothesis here? One complicated idea is to find some kind of effect size that works, calculate it for all the questions and note some summary statistics about the distribution of effects. Then simulate null hypothesis data many times with the same sample sizes and calculate summary statistics of these distributions. Then finally, compare the summary statistics of the real data with those from simulated null data. This would provide just about the same evidence as the current method used I think.

Concerning the above, I said that I know you don't care about effect size, but remember : p-value is a mixture of sample size and effect size. Also, like I've said, p-values can lead to different conclusions than those produced by effect size. For instance, for my most recent research, on MGCFA testing of Spearman's Hypothesis and internal bias, p-values show always significant changes, while indices such as RMSEA, Mc, CFI, don't. It's not possible that p-values can be reliable.

How would I test the null hypothesis ? It depends on which principle it relies upon. If it requires the use of p-value, and its corresponding "higher than 0.05 being not significant", NH is not even worth testing. p-value has never been reliable. And if you're interested in the distribution of p-values, why again it would be more useful than calculating the distribution of effect size ? After all, effect size is a component of p-value.

Also, if you think getting effect sizes such as correlation or d for categorical data is a little bit problematic, you should know there are other types of effect sizes. Such as odd ratio which is appropriate for categorical data. Instead of measuring the strength of relationship with correlation ®, you measure the probability of answering #2 as opposed to #1. In any case, p-value is not a better alternative to effect size such as r or d. If r and d are both inappropriate for categorical data, the resulting p-value from r and d estimates should be always wrong as well.

I can't think of a better way to test whether Zodiac sign has any predictive validity for this dataset across all the questions, than to look at the p-curve. If you can think of one, then please try yours yourself and report back what you find. The data are public. I don't think there is anything statistically wrong with the present analysis.

ORs ratios do not work well for nom. x nom. data with >2 levels for both variables. For instance, for the questions with 4 answer options, this results in 4 x 12 probabilities being calculated. The data are also problematically hierarchical when analyzed this way because the questions have varying numbers of answers (2-4), which results in different numbers of probabilities: 2 x 12, 3 x 12, 4 x 12. One will have to aggregate within question before aggregating across questions. More complications for little gain...

Could you suggest changes regarding this section that you think are mandatory before you would approve the paper?

ORs ratios do not work well for nom. x nom. data with >2 levels for both variables. For instance, for the questions with 4 answer options, this results in 4 x 12 probabilities being calculated. The data are also problematically hierarchical when analyzed this way because the questions have varying numbers of answers (2-4), which results in different numbers of probabilities: 2 x 12, 3 x 12, 4 x 12. One will have to aggregate within question before aggregating across questions. More complications for little gain...

Could you suggest changes regarding this section that you think are mandatory before you would approve the paper?

2016-Sep-18, 12:52:12

You do not answer my comment here. I said that I don't see why p-value is better than effect size, in answering the question that whether there is an effect or not, since effect size is a component of p-value, and that p-value is biased by sample sizes.

Concerning ORs, I don't know what you're talking about by x12 probabilities. What I was thinking is that by looking at your article, given that the IQ variable is a continuous one, you can use OLS, with IQ as dependent var and the various independent categorical vars as dummy vars. If a continuous var has 4 categories, you end up with 3 independant vars to enter in your regression equation (if answer #1 is the reference category, you'll get dummy vars #1vs#2, #1vs#3, #1vs#4).

There is nothing mandatory here. One or two years before, I think I will make answering (and dealing with !) my question of p-value vs effect size mandatory before I give approval. So at this step I would have disapproved if the author doesn't answer my questions. But today, I have time constraints (having many other things to do) and I don't want to make a fuss anymore for something that many other reviewers won't really care about (i.e., p-value and effect size). Furthermore, although I don't agree with you on the two issues mentioned in this post, I'm thinking there is no big, fatal flaw in the paper (even if there are obviously some errors and ways to improve the paper).

So, if you think you don't want to modify anything and think it's OK, then I give my approval.

Concerning ORs, I don't know what you're talking about by x12 probabilities. What I was thinking is that by looking at your article, given that the IQ variable is a continuous one, you can use OLS, with IQ as dependent var and the various independent categorical vars as dummy vars. If a continuous var has 4 categories, you end up with 3 independant vars to enter in your regression equation (if answer #1 is the reference category, you'll get dummy vars #1vs#2, #1vs#3, #1vs#4).

There is nothing mandatory here. One or two years before, I think I will make answering (and dealing with !) my question of p-value vs effect size mandatory before I give approval. So at this step I would have disapproved if the author doesn't answer my questions. But today, I have time constraints (having many other things to do) and I don't want to make a fuss anymore for something that many other reviewers won't really care about (i.e., p-value and effect size). Furthermore, although I don't agree with you on the two issues mentioned in this post, I'm thinking there is no big, fatal flaw in the paper (even if there are obviously some errors and ways to improve the paper).

So, if you think you don't want to modify anything and think it's OK, then I give my approval.

2016-Sep-18, 15:40:40

I have answered it twice. However, let me try again. The reason to use p-values over effect sizes is that using p-values allows for a direct test of the global null hypothesis. Using effect sizes does not allow for a direct test.

Please state if you can think of any other issues, aside from the p-value one.

Please state if you can think of any other issues, aside from the p-value one.

Users browsing this thread: 1 Guest(s)

Powered By MyBB, © 2002-2019 MyBB Group.

Theme © Opel Owners Forum.

**Current time:** 2019-Jul-17, 06:47:59