Hello There, Guest!  

[OQSPS] Name features and social status

#7
Thanks to reviewers for the detailed and constructive criticism. Many changes were made. Specific replies are given below. Files on OSF and Rpubs are updated.

Dr. g,


Quote:I wish there were more links to the literature on naming practices. There is a great deal of literature about naming practices and the information that names communicate (e.g., Abel & Kruger, 2007; Edwards & Caballero, 2008; Fryer & Levitt, 2004; Lieberson & Mikelson, 1995; Varnum & Kitayama, 2011)


Both the introduction and discussion cites a number of other studies. I generally prefer to avoid long literature review sections in papers, the reader can consult the literature for themselves as necessary. Nevertheless, I looked at the ones you suggested:

Abel, E. L., & Kruger, M. L. (2007). Symbolic significance of initials on longevity. Perceptual and motor skills, 104(1), 179-182.

This was a false positive, see reply:

http://journals.sagepub.com/doi/10.2466/....1.211-216

Edwards, R., & Caballero, C. (2008). What's in a name? An exploration of the significance of personal naming of ‘mixed’children for parents from different racial, ethnic and faith backgrounds. The Sociological Review, 56(1), 39-60.

This was not a quantitative study, just some interviews.

Fryer Jr, R. G., & Levitt, S. D. (2004). The causes and consequences of distinctively black names. The Quarterly Journal of Economics, 119(3), 767-805.

Was already cited in the discussion.

Lieberson, S., & Mikelson, K. S. (1995). Distinctive African American names: An experimental, historical, and linguistic analysis of innovation. American Sociological Review, 928-946.

Is somewhat similar, but was only related to inferring the sex of offspring with new/rare names among African Americans.

Varnum, M. E., & Kitayama, S. (2011). What’s in a name? Popular names are less common on frontiers. Psychological science, 22(2), 176-183.

Is interesting, but does not involve research on social status, and is based on regional data.

Quote:Page 1: “The results showed strong evidence of validity.” This sentence is vague. Do you mean the results showed strong evidence that linguistic characteristics of names correlated with socioeconomic indicators?

I meant to refer to the findings from the t-tests result, which was an initial test for signal in the data. I have amended the sentence to “The results showed strong evidence of signal in the data.”

Quote:I’m confused by the claim that one can infer approximate ancestry from first names (page 2). Please provide some detail (e.g., percentage of non-immigrants who have a name matching their ethnicity) and a citation. (Perhaps I’m skeptical because none of my children have names matching my or my spouse’s predominant ethnicities.)

Non-immigrants only have one ethnicity (Danish), so one cannot supply such data. The point is that one can look up a given name in a database of first names and see where it comes from. We used behindthename.com. For instance, if we look up my name (https://www.behindthename.com/name/emil), we can see it is tagged as “Swedish, Norwegian, Danish, German, Romanian, Bulgarian, Czech, Slovak, Polish, Russian, Slovene, Serbian, Croatian, Macedonian, Hungarian, Icelandic, English”. I guess they are in order of strength of association and we can see that all 3 Scandinavian countries are the first on the list. I’d say they are also right about the primary origin is likely to be Sweden based on my personal impressions. If you check your name (https://www.behindthename.com/name/russell), it is just given as “English”, which is not so informative considering the various countries that trace their population to English settles (including USA of course).

The results in Figure 5 suggest that this estimation method is highly accurate because the estimates of social status are highly congruent with the known ones from official data (r = .72).

Quote:Page 2: The author states that income, criminal convictions, house ownership, and unemployment are all positively correlated. Shouldn’t criminal convictions and unemployment be negatively correlated with the other two variables?

Good catch. I added the clause “when negative outcomes were reversed”.

Quote:The biggest problem with the manuscript is that it does not make it clear to a non-expert how the name scoring procedure resulted in a variable value for a name. For example, was a point value for assigned to each pattern and then these were summed for each name? Did the process create a unique score for each name? Please add a few sentences to clarify this point.

You are right an example should have been given. I have added the following example:

“For instance, the name Peter would be scored as having the following n-grams: p, e, t, r, pe, et, te, er, pet, ete, ter, as well as their initial and ending variants. It would furthermore have a vowel fraction of 2/5, stop sound fraction of 2/5, nasal sound fraction of 0, and be negative for presence of a dash. All the other features would be negative. Thus, each name has 1,099 features associated with it, of which 1,995 are binary, and 4 are numeric.”

Quote:Don’t say that a p-value distribution is uniform “by chance” (p. 4). This is vague. Say that this is the expected distribution of p-values if the null hypothesis were perfectly true. I suggest making a similar change on p. 13.

I have added “(i.e. if the null hypothesis of no signal is true)” to each.

Quote:It’s not clear that, “This can be inferred because only rare names would tend to produce very large effect sizes” (p. 4). What is your logic here? Please explain.

The rare would names have rare linguistic patterns because of they belong to small immigrant populations. Generally, large effect sizes would tend to involve rare patterns since it is difficult to get a large effect size if the number of persons with such a name in the population is large. The exact reasoning here is somewhat too long for me to elaborate on in the section, as it is a side remark. If desired, I could produce a long footnote with it.

Quote:For Table 1, add two columns that would let your reader know the percentage of adults in Denmark with each of these first names. That would help provide some context. (Ignore this suggestion if this information is not available.)

Data for just adults is not available as far as I know, but I have added the number of persons with each name in 2012. The Danish population was about 5.6 million at that time, so the number of males was about 2.75 million. So every name will be only a tiny fraction of the total, the most popular one (7279) was about 0.27%.

Quote:Adding a mean, SD, and median S factor score for Danish and non-Danish names (perhaps in the caption of Figure 3) would be helpful.

Added “The mean/sd for Danish names was 0.37/0.72, and for non-Danish -0.81/1.04.” to the caption of Figure 4.

Quote:Please provide evidence or a citation that high status secular Turks would give their children Turkish names instead of Muslim names. (Sounds plausible, but some supporting evidence is needed.)

This was meant as a hypothesis, not established fact. I searched a bit, and one can find literature in this direction, though not entirely satisfactory: https://www.tandfonline.com/doi/full/10....mobileUi=0 https://www.cairn-int.info/resume.php?ID...C_602_0018

I have amended the sentence to "e.g. high status, secular Turks might give their children Turkish names while most others give their children Muslim names".

Quote:Yes, the paragraph on pp. 11-12 stating that the skew in Figure 3 is caused by the inclusion of non-Danish names is almost certainly true. Figure 4 indicates that three foreign groups of names in particular (Arabic, Polish, and Turkish) scored MUCH lower on the S factor—on average—than the overall mean. This might be worth mentioning.

Not sure exactly what you are suggesting I add. Figure 5 shows the average status of the origin groups whether identified by name or by official data.

Quote:Correct a few instances of awkward writing, vague language, or grammatical errors:
Eliminate the use of “we” in a general sense (p. 2).

I replaced the plural first person with singular first person. The paper originally had multiple authors and the text reflected that fact. It now has the somewhat unusual singular seen in economics papers.

Quote:The second-to-last sentence on the first paragraph on p. 2 is very hard to understand.

I have split up the long sentence on page 2.

Quote:The brackets at the end of p. 2 are confusing. At first, I thought that “aeiouæøå” was an example of a “fraction vowel.”

I have added separating commas to the brackets. The bracket notation is standard in linguistics, but I see why it can be confusing to readers without a background in that field (as I have).

Quote:Whose expectations are you referring to on p. 7?

Expectations readers would have based on reading other material about immigrant groups in Denmark, origin countries well being, and the world at large. Generally speaking, Northwest Europeans do best, then other Europeans, then various non-Western groups except for a few Asian countries. I have amended the sentence to “The relative ranking of the top 10 origin groups corresponds fairly well to expectations based on the origin countries’ well-being (Kirkegaard, 2014).”.

Quote:Page 8: You use the term “significant.” Please specify whether this is statistical, practical, or clinical significance. (Never use “significant” alone.)

Good catch on the word significant. I generally avoid using it altogether because of these confusions. I have changed it to “substantial”.

Quote:The phrase “we subset” is awkward because (1) there is only 1 author, and (2) I’m not sure “subset” is a verb.

Subset is a verb too. https://en.wiktionary.org/wiki/subset#Verb This usage is somewhat unsual, but not at all without precedent, e.g.: http://www2.sas.com/proceedings/sugi22/C...APER79.PDF https://www.r-bloggers.com/5-ways-to-sub...rame-in-r/ http://adv-r.had.co.nz/Subsetting.html

Jose,


Quote:Given that the feature are 3-grams, it would be interesting to see the coefficients on each, to try to pick out the ones most (independently) associated with S. I guess you will probably see "Abd". This would be helpful insofar as there are language-typical letter triplets, which may help pinpoint language-level associations, in addition to the manual testing done with the behindthename data.

The features are also 1- and 2-grams. There are too many of them to list them in the paper, but one can of course inspect the strongest features from the t-tests. As you guessed, the strongest feature, or rather set of features, are 3 equivalent ones that tag the same set of 7 names (bd, abd, _abd) with a d value of -2.44 (p = 2.35e-5). The most positive feature is _lau (d = 1.70, p = 9.17e-5) as this tags 7 high ranking names, including #1 and #6 seen in Table 1. I did save the results to files, but only in RDS format. I have added XLSX versions as well for the non-R readers.

Quote:“p 8. Clarify what is meant by performance in that context”

Amended to “Because the Muslim countries generally perform poorly (correlation between Muslim% in origin country and general social status = -.63 (Kirkegaard & Fuerst, 2014)),”.

Quote:It would be interesting to show an R2 curve showing the changes in R2 with number of predictors, showing what % of features can be retained with minimal loss in accuracy. This could be done by selecting the best predictor triplet from a bivariate regression, then running N-1 regressions and picking the next one, and so on.

One could also just increase the penalty in the lasso. However, this was not an objective of the current study and would take a substantial amount of new coding to include.

Quote:There is a slight mismatch between the alphas that are used and the ones that the paper mentions. The ones used are 1,0.325, 0.55, and 0.775. This does not affect the conclusions.

Good catch. Yes, I recoded these at some point after writing the text.

Quote:Given that you are willing to run 5k CV iterations, why not do a straight LOOCV? This would get results as good or better as 5K CVs with fewer model runs. Also, switching to K folds (With K higher than usual, say 100) would also enable faster iteration, possible allowing for an optimisation of alpha in addition to lambda. Ideally one would use bayesian optimization here, but this feels excessive. Overall, I don't expect the results to change substantially were these methods tried.

One could probably improve upon the specific scheme used in this paper in a number of ways. As you note, however, this would be very unlikely to affect the conclusions. I don’t recall exactly why CV was chosen over LOOCV, aside from the usual reasons (high variance in LOOCV).



Quote:Typos:
  • "The list predictors" -> predictors
  • "The featured" -> features

Fixed.
 Reply
 
Messages In This Thread
[OQSPS] Name features and social status - by Emil - 2016-Nov-24, 05:07:03
RE: [OQSPS] Name features and social status - by hvc - 2017-Jan-03, 14:24:57
RE: [OQSPS] Name features and social status - by Emil - 2017-Jan-09, 01:41:00
RE: [OQSPS] Name features and social status - by Emil - 2018-Sep-03, 20:50:33
RE: [OQSPS] Name features and social status - by Dr. g - 2018-Nov-01, 06:07:47
RE: [OQSPS] Name features and social status - by JoseL - 2018-Nov-03, 22:48:31
RE: [OQSPS] Name features and social status - by Emil - 2018-Nov-04, 04:47:08
RE: [OQSPS] Name features and social status - by Dr. g - 2018-Nov-08, 17:52:35
RE: [OQSPS] Name features and social status - by JoseL - 2018-Nov-10, 23:12:34
RE: [OQSPS] Name features and social status - by Emil - 2018-Nov-11, 00:10:27
RE: [OQSPS] Name features and social status - by Dr. g - 2018-Nov-18, 04:24:02
 
Forum Jump:

Users browsing this thread: 1 Guest(s)