Data Files for "Caught in the crossfire: Fears of Chinese–American scientists" in PNAS 2023

This dataset encompasses two distinct sets of data analyzed in the study, namely Asian American Scholar Forum survey data and Microsoft Academic Graph bibleometrics data:

Yu Xie, Xihong Lin, Ju Li, Qian He, Junming Huang, Caught in the crossfire: Fears of Chinese-American scientists, Proceedings of the National Academy of Sciences, 120(27) (2023). DOI: 10.1073/pnas.2216248120

This data is available at yuxie.com and Princeton DataSpace.

Survey data:

The first part of the dataset comprises survey data collected from the Asian American Scholar Forum survey. With respect to privacy concerns of the survey respondents, the raw survey data have been designated as confidential and are deemed inappropriate for public disclosure. Researchers interested in obtaining access to the data are encouraged to directly contact the authors for an authorized copy. Nonetheless, the summarized statistics derived from the survey data can be found in the Supplementary Materials, sufficing the replication of the results presented in this paper.

Bibleometrics data:

The second part of the dataset involves bibliometric data obtained from the Microsoft Academic Graph, which indexed 208,440,142 scientists from 27,077 institutions authoring 205,203,354 scientific publications dated until December 2021. The database was sourced from the publicly available snapshot retrieved from https://openalex.org/ in early 2022, after Microsoft Academic Graph announced retirement in Dec 2021.

We identified Chinese-descent scientists by their surnames. We first collected 832 common Chinese surnames from Wikipedia (https://en.wikipedia.org/wiki/List_of_common_Chinese_surnames), including those in Chinese characters and romanized names, in Hanyu Pinyin (the system of Chinese romanization mostly used by mainland Chinese scientists) and Wade-Giles (the system mostly used by Cantonese-speaking and Taiwanese scientists). This methodology results in the non-counting of Chinese-descent scientists who have changed their surnames (usually females after marriage), leading to an undercount. We searched for those surnames in the authors’ full names recorded in Microsoft Academic Graph to identify Chinese-descent scientists. To retain a high degree of reliability in individual identification, we removed scientists with a gap of more than 5 years between consecutive publications, which we believed were false results in which Microsoft Academic Graph’s name disambiguation algorithm incorrectly merged multiple individuals. We ended up with 25,202 Chinese-descent scientists who had their first publications in US affiliations and dropped their US affiliations and subsequently published at least one paper affiliated with China.

We leveraged Google Maps API to parse all 27,077 institution names in Microsoft Academic Graph, and retrieved their country labels. Therefore, we could label every Chinese-descent scientist’s working country in any publishing year. Specifically, we focused on Chinese-descent scientists leaving the US, i.e., those who were trained in the US (first paper affiliated in the US) and who subsequently moved from the US to China (i.e., stopped using US affiliations and started to use Chinese affiliations). For each such scientist, we counted the year range of all his/her papers affiliated in the US and affiliated in China, and annotated his/her leaving year as the year of his/her first subsequent paper after his/her most recent usage of a US affiliation. This was more accurate than simply using his/her last year with a US affiliation, which might produce false positives that counted current US-based Chinese-descent scientists. 

We further identified two groups of interest among US-based Chinese-descent scientists: “junior” scientists—those who had published their first papers in the US, started publishing with Chinese affiliations within 5 years thereafter, and finally left the US within 7 years thereafter; and “experienced” scientists—those who had published over 25 papers in their whole career and outperformed 97% of scientists. 

[Chinese-descent-scientists-destination.csv] provides the destination country or region for each of the 25,202 Chinese-descent scientists, along with their respective discipline labels. Scientists migrating to China mainland, Hong Kong and Taiwan are recorded separately.

[Chinese-descent-scientists-destination-count.csv] reports the number of Chinese-descent scientists who migrated to China, categorized by year, discipline, and stage (junior/experienced). Due to the small sample size, scientists labeled in the "Statistics" discipline were excluded from the count.

For additional information on the processing of the survey data and bibliometric data, please refer to the Supplementary Materials accompanying this


Data Publisher

The survey data is administered by the Asian American Scholar Forum. The bibleometrics data is published by Microsoft under Open Data Commons Attribution License (ODC-By).


Citation

Please cite this paper if you use this dataset for research purpose.

Yu Xie, Xihong Lin, Ju Li, Qian He, Junming Huang, Caught in the Crossfire: Fears of Chinese-American Scientists, Proceedings of the National Academy of Sciences, 120 (27) e2216248120 (2023).

Please cite Microsoft Academic Graph if you use their data.

Arnab Sinha et al., An Overview of Microsoft Academic Service (MAS) and Applications, in Proceedings of the 24th International Conference on World Wide Web (WWW ’15 Companion), ACM, New York, NY, 243-246 (2015). DOI: 10.1145/2740908.2742839