Unlocking the Cultural Digital Footprint from Natural Language at Scale

The growth of social media yields an unprecedented ability for a populous to passively report cultural data

By: USGIF | January 25, 2019

Unlocking-the-Cultural-Digital-Footprint-from-Natural-Language-at-Scale

By H. Andrew Schwartz, Ph.D., Stony Brook University; Brenda Curtis, Ph.D., University of Pennsylvania; Christine DeLorenzo, Ph.D., Stony Brook University; Salvatore Giorgi, M.S., University of Pennsylvania; and Peter Small, M.D., Rockefeller Foundation Fellow

The growth of social media yields an unprecedented ability for a populous to passively report cultural data such as:

Behavior (e.g., exercise, smoking, drinking, and food consumption).
Psychological characteristics (e.g., mental health, sense of community, beliefs, and engagement in life).
Socioeconomic markers (e.g., education, commerce, real estate, and work).

Historically, creating an understanding of the “cultural digital footprint” from limited datasets has been conducted using qualitative research techniques such as manually reading and summarizing cultural data. However, new data science techniques emerging from the intersection of natural language processing, machine learning, and computational social science allow the conversion of unstructured information from social media into quantitative spatial and temporal data, automating the understanding of the cultural footprint of communities.

Figure 1. Social media and web content are mapped to U.S. counties whereby patterns of language can be encoded as a representation of each geographic area. Colors indicate greater (red) or less (blue) frequency of mention.

Recently, we have been exploring the advantage of the cultural digital footprint by evaluating its predictive analytic power as compared to typical structured geospatial data, and what we are finding is quite striking. Cultural characteristics derived from Tweets on Twitter often provide more predictive power for rates of health, psychological, and economic outcomes as compared to standard socioeconomic and demographic variables.

The general idea, as depicted in Figure 1, is that unstructured language data is mapped to its geographic origin and then natural language processing routines can be run to turn the unstructured text into a structured, quantitative representation of the geographic area. For example, the structured representation could contain the frequency with which particular words are mentioned. Because words are our primary form of communication, often the contents of these representations are also interpretable. For example, Figure 1 depicts a representation of a county in New Jersey by the frequency with which specific topics are mentioned. Talking about sleep may be frequent, while talking about the training class at the gym is less so. In this way, digital footprints in language can unlock geographically structured insights into cultural and psychological factors that were previously accessible only through expensive surveying techniques. Early work with such data looked directly at linguistic differences by region.^[1] In this article, we discuss more recent work that takes the next step of relating geographic difference to health, psychological, and economic outcomes.

This article is part of USGIF’s 2019 State & Future of GEOINT Report. Download the PDF to view the report in its entirety.

Current Research on Geographic Language

Prior to the growth of web and social media data, relating health outcomes across a large number of communities to cultural factors typically relied on expensive and limited surveys (a notable exception being crude behavioral proxies such as number of fast food restaurants, bike trails, etc.). The digital footprint of culture offers a novel and potentially more powerful perspective. Using geographic language representations along with machine learning techniques, one can often predict county mortality rates in the U.S.

Figure 2. Prediction of U.S. county cancer mortality in 2013-2015 using digital footprints derived from 30 billion Tweets. Green indicates Twitter performance above and beyond standard geographic predictors.

For example, when predicting atherosclerotic heart disease mortality, geographic language from Twitter contained more predictive power by itself than 10 standard variables including demographics, socioeconomics, and standard risk factors such as smoking, diabetes, and obesity.^[2] More recently, we found encodings from Twitter show predictive power beyond 15 standard structural variables (covering demographic, socioeconomic, geographic, and surveyed behavioral and psychological information) for seven out of America’s top 10 causes of death. Figure 2 shows prediction results for cancer mortality rates.

One might find these results particularly striking when considering, for the most part, the people Tweeting are not those dying of cancer. Rather, the Tweeters are more like canaries that together provide a powerful characterization of a community. In fact, the users on Twitter are not even perfectly representative of their community, specifically they skew young among a number of other minor deviations.^[3]

Still, the outcomes evaluated against are in fact representative, demonstrating that a biased sample of community language can be mapped to unbiased outcomes.

To better understand how geographic language can represent a community, researchers have also considered psychological outcomes and economic metrics. Using the same data as the mortality study (using counties covering more than 90 percent of the U.S. population), we attempted to predict the life satisfaction scores of those communities derived from surveys.^[4] Compared to standard demographic and socioeconomic data previously available, current methods (involving techniques for integrating heterogenous variable type: language and census demographics) are able to increase the variance explained by 22 percent in predicting surveyed life satisfaction.^[5]

Social media can also provide a window into economic outcomes. When looking at change in the median sale price of homes across a community, geographic language once again provided a significant improvement over demographic, social, and economic variables.^[6] Together, this suggests the breadth of social media to represent a community spans information about health, psychological, and even economic factors.

Prediction often isn’t the end game when it comes to geographic language. Instead, researchers often seek insights—potentially novel relationships between community attributes and outcomes. This is often done by looking at the language patterns that are most correlated with outcomes. For example, tied in with the psychological literature, words relating to outdoor activities, spiritual meaning, exercise, and good jobs correlate with increased life satisfaction, while words signifying disengagement such as “bored” and “tired” show a negative association. We looked at community alcohol consumption data along with language patterns in Twitter as compared to demographic and socioeconomic information. Through open-vocabulary analyses—those unrestricted to specific linguistic categories—nearly unbounded numbers of predictive patterns emerge.

As another example, Figure 3 shows the topics (clusters of semantically related words) that are most predictive of geographic areas with high (top) and low (bottom) rates of excessive drinking.^[7] Mediation analysis resulted in topics that explained much of the relationship associated with socioeconomics and excessive alcohol consumption. Social media language contains key pieces of information public health officials can use to monitor behavior and identify people and communities most in need of intervention.

For all the contrasting we have done between the value of surveys and geographic language, it is worth noting much of the approach to geographic language is inspired by survey techniques. Delving deeper into methods and data, we found it is helpful to statistically model each community as a collection of people whose digital footprint can be measured over multiple Tweets rather than simply counting the words within a community. This mirrors aggregating individual survey takers into a community. Words are aggregated from Tweets to users and then from users to U.S. counties, giving accurate measures of the people in the community as opposed to the Tweets in the community. This method has been shown to achieve state-of-the-art prediction accuracies on four different U.S. county-level tasks spanning demographic, health, and psychological outcomes.

Researchers, including some of the authors of this article, have recently been working to make aggregate geographic language data more accessible. While the social media data typically used for geographic studies is technically publicly accessible, it is often impractical or violates terms of service to share the raw data separately. However, aggregate community data is much smaller in size than raw text, and it is typically individually anonymized. To that end, we recently released on GitHub[8] a large community level aggregate dataset, the County Tweet Lexical Bank, derived from 37 billion Tweets—more than one billion of which were mapped to 2,041 U.S. counties.^[9] This dataset spans 2009-2015 and includes multiple language features aggregated over various time spans.

The Future of Geographic Language

“A therapist, the joke goes, knows in great detail how a patient is doing every Thursday at 3 o’clock.” – David Dobbs, The Atlantic

The power in geographic language patterns lies in their ability to capture everyday concerns. They are not a one-time snapshot of a community, but rather an ongoing (perhaps biased) window into culture. Measures obtained in snapshots and set intervals can introduce many biases, such as recall bias (e.g., bias in recalling a recent state due to current subjective feeling) or variability in mood across hours/days. These measures often require assessment outside of naturalistic circumstances that can introduce biases.^[10]

Much of the work thus far with such data has largely neglected the temporal dimension (and for good reason—simply establishing a connection between the data and real-world outcomes was needed and the time dimension can overcomplicate analyses), but we believe the future of such data and its grandest utility involve utilization of space and time.

One promising avenue for bringing in the time dimension to language-based geographic studies is the application to mental health epidemics. Dr. Thomas Insel, former director of the National Institute of Mental Health, described digital behavior measures as providing “a more objective, textured picture of people’s lives.”^[11] Daily behaviors assessed through technology such as social media offer unique insight into mental health status. Developing new platforms to understand mental health is critical because the traditional U.S. mental health care infrastructure is drastically overburdened, leaving many without care.

Approximately one-third of Americans with serious mental illness receive no treatment, and those that are treated often receive inadequate care, with increasing gaps in service.^[12] This unmet need is greatest in traditionally underserved groups, including those with limited incomes, without insurance, and living in rural areas.[13] Even with economic setbacks, such folks and their communities are often well represented online.^[14] Numerous studies have now shown self-reported conditions related to mental health, including depression, anxiety, PTSD, and suicidality are strong evidence for the use of social media for psychiatric assessment.^[15]^,[16] Practical utilization is still on the horizon with prediction of mental illness rates by community being an obvious potential application.

Let’s consider one of the current mental health epidemics, drug overdose deaths, which are now the leading cause of injury related death in the U.S. In 2016, drug overdoses accounted for more than 63,000 deaths annually with nearly two-thirds of these deaths involving a prescription or illicit opioid.^[17] Geographic language can capture and quantify the types of dialogue on social media associated with time of drug use relapse, opioid overdoses, and addiction treatment dropout. In addition, one can examine the amount and patterns of dialogue on social media with respect to opioid addiction treatment need, emerging synthetic opioids, and risk and protective factors for drug use. These results would demonstrate the robustness of social media language analysis and enable public health practitioners to craft adaptive algorithms to the characteristics of each population.

The future of geographic language also appears propitious for socioeconomic applications. Social media has a long history of use for tracking opinions and sentiment.^[18] Applications for tracking sentiment often relate to products reviews, ^[19]^,[20] and political concerns such as links between sentiment and public opinion polls.^[21]^,[22] However, unlike the previous applications of social media that neglected time, these applications have mostly neglected geography. Researchers are beginning to use these same methods to track beliefs in climate change and other environmental issues,^[23]^,[24] but the integration of geography is largely unexplored. One can easily imagine these beliefs being tracked at the community level in the same way various corporate and government agencies use standard polling and surveys, but with more frequent updates at a fraction of the cost.

At first, it may seem like using social media data for geospatial intelligence is jumping on the Twitter bandwagon. However, it’s hard to imagine a resource that can capture such a large variety of cultural phenomena—public digital footprints from millions of individuals across thousands of communities. Of course, unlocking the information can be non-trivial. Like many forms of data science, studying geographic language is often a multidisciplinary endeavor involving trial and error.

Experts such as computer scientists are needed to design and implement the data analyses while social scientists or domain experts help drive the beneficial questions and interpret the results. Still, more and more experts are beginning to leverage such data across a variety of fields. As a result, more tools are becoming available along with aggregate processed datasets, such as our County Tweet Lexical Bank, reducing the barrier to entry and enabling new applications. We have seen predictive power and insights emerge from health to psychological and economic outcomes. However, one insight that alludes us is just how geographic intelligence from social media will be used next.

Jacob Eisenstein, Brendan O’Connor, Noah A. Smith, and Eric P. Xing. “A Latent Variable Model for Geographic Lexical Variation.” In Proceedings of the 2010 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, 2010. p 1277-1287.
Johannes C. Eichstaedt, H. Andrew Schwartz, Margaret L. Kern, Gregory Park, Darwin R. Labarthe, Raina M. Merchant, Sneha Jha, et al. “Psychological Language on Twitter Predicts County-Level Heart Disease Mortality.” Psychological Science, 2015:26(2):159-169.
Andrew Perrin. “Social Media Usage in 2018.” Pew Research Center. 2018.
H. Andrew Schwartz, Johannes C. Eichstaedt, Margaret L. Kern, Lukasz Dziurzynski, Richard E. Lucas, Megha Agrawal, Gregory J. Park, et al. “Characterizing Geographic Variation in Well-Being Using Tweets.” In ICWSM. 2013. p 583-591.
Mohammadzaman Zamani, H. Andrew Schwartz, Veronica E. Lynn, Salvatore Giorgi, and Niranjan Balasubramanian. “Residualized Factor Adaptation for Community Social Media Prediction Tasks.” In EMNLP-2018. 2018.
Mohammadzaman Zamani and H. Andrew Schwartz. “Using Twitter Language to Predict the Real Estate Market.” In Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 2, Short Papers, vol. 2. 2017. p 28-33.
Brenda Curtis, Salvatore Giorgi, Anneke EK Buffone, Lyle H. Ungar, Robert D. Ashford, Jessie Hemmons, Dan Summers, Casey Hamilton, and H. Andrew Schwartz. “Can Twitter Be Used to Predict County Excessive Alcohol Consumption Rates?” PloS One, 2018:13(4): e0194290.
github.com/wwbp/county_tweet_lexical_bank
Salvatore Giorgi, Daniel Preotiuc-Pietro, Anneke Buffone, Daniel Rieman, Lyle H. Ungar, and H. Andrew Schwartz. “The Remarkable Benefit of User-Level Aggregation for Lexical-based Population-Level Predictions.” In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing. 2018.
David A. Axelson, Michele A. Bertocci, Daniel S. Lewin, Laura S. Trubnick, Boris Birmaher, Douglas E. Williamson, Neal D. Ryan, and Ronald E. Dahl. “Measuring Mood and Complex Behavior in Natural Environments: Use of Ecological Momentary Assessment in Pediatric Affective Disorders.” Journal of Child and Adolescent Psychopharmacology, 2003:13(3):253-266.
David Dobbs. “The Smartphone Psychiatrist.” The Atlantic, 2017:320:78-86.
Mark Olfson, Carlos Blanco, and Steven C. Marcus. “Treatment of Adult Depression in the United States.” JAMA Internal Medicine, 2016:176(10):1482-1491.
P.S. Wang, M. Lane, M. Olfson, H.A. Pincus, K.B. Wells, and R.C. Kessler. “Twelve-Month Use of Mental Health Services in the United States: Results from the National Comorbidity Survey Replication.” Arch Gen Psychiatry, 2005:62(6):629-40.
Andrew Perrin. “Social Media Usage in 2018.” Pew Research Center. 2018.
Munmun De Choudhury, Michael Gamon, Scott Counts, and Eric Horvitz. “Predicting Depression Via Social Media.” ICWSM13, 2013:1-10.
Glen Coppersmith, Mark Dredze, and Craig Harman. “Quantifying Mental Health Signals in Twitter.” In Proceedings of the Workshop on Computational Linguistics and Clinical Psychology: From Linguistic Signal to Clinical Reality. 2014. p 51-60.
Rebecca Ahrnsbrak, J. Bose, S.L. Hedden, R.N. Lipari, and E. Park-Lee. “Key Substance Use and Mental Health Indicators in the United States: Results from the 2016 National Survey on Drug Use and Health.” Center for Behavioral Health Statistics and Quality, Substance Abuse and Mental Health Services Administration: Rockville, MD, USA. 2017.
Bo Pang, Lillian Lee, and Shivakumar Vaithyanathan. “Thumbs Up? Sentiment Classification Using Machine Learning Techniques.” In Proceedings of the ACL-02 Conference on Empirical Methods in Natural Language Processing. Volume 10. Association for Computational Linguistics. 2002. p 79-86.
Bing Liu. “Sentiment Analysis and Opinion Mining.” Synthesis Lectures on Human Language Technologies, 2012:5(1):1-167.
Dave Kushal, Steve Lawrence, and David M. Pennock. “Mining the Peanut Gallery: Opinion Extraction and Semantic Classification of Product Reviews.” In Proceedings of the 12th International Conference on World Wide Web. ACM. 2003. p 519-528.
Minqing Hu and Bing Liu. “Mining and Summarizing Customer Reviews.” In Proceedings of the 10th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, 2004. p 168-177.
Bi Chen, Leilei Zhu, Daniel Kifer, and Dongwon Lee. “What Is an Opinion About? Exploring Political Standpoints Using Opinion Scoring Model.” In AAAI. 2010.
Brendan O’Connor, Ramnath Balasubramanyan, Bryan R. Routledge, and Noah A. Smith. “From Tweets to Polls: Linking Text Sentiment to Public Opinion Time Series.” Icwsm, 2010:11(122-129): 1-2.
Maurice Lineman, Yuno Do, Ji Yoon Kim, and Gea-Jae Joo. “Talking About Climate Change and Global Warming.” PloS One, 2015:10(9):e0138996.
Ji Yoon Kim, Yuno Do, Ran-Young Im, Gu-Yeon Kim, and Gea-Jae Joo. “Use of Large Web-Based Data to Identify Public Interest and Trends Related to Endangered Species.” Biodiversity and Conservation, 2014:23(12):2961-2984.