Data-Driven Insights into Personality, Health, and Well-Being through Language Analysis With more than a quarter of the world’s population communicating actively on social media (e.g. Facebook, Instagram, Twitter), researchers are presented with an unprecedented resource into who we are -- our health and well-being. Utilizing natural language processing and machine learning techniques we seek to discover new behavioral and psychological factors of health and well-being as manifest through language in social media. However, with few exceptions, existing works have mostly applied methods that were originally developed for modeling documents or words rather than characteristics of people. We are undertaking a range of novel as well as tried-and-true language analysis techniques to answer big questions about human health, personality, and well-being.
Developing Methods of Human-Centered Language Processing While most tasks in Natural Language Processing are concerned with characterizing or transforming elements of documents (e.g. parts-of-speech tagging, syntactic parsing, machine translation, and sentiment analysis), the goal of human centered analysis is to measure or understand the users or communities generating those documents. This changes the way we can analyze language: (1) considering additional variables representing the human and/or social context, (2) it suggests consideration for change in language over time, and (3) motivates the need for understanding language beyond word-context as modern transformers do.
Time-series and Longitudinal Language Analysis Tackling change in language opens the door to impactful applications such as forecasting psychological states or health events (e.g. suicide attempt, panic attack, heart disease onset). Longitudinal studies and time series analyses have a long history which has yielded a vast repertoire of possible methods to build off (Zeger and Liang, 1986; Frees, 2004). On the other hand, NLP has a long history of tackling sequences of characters, words, or phrases. However, the mix of the two, tracking sequences in language across time, brings new key challenges. The goal now is often not predicting language itself but a latent mental state of an individual for which posts on social media serve as a source of noisy evidence. Further, some people post hourly while others post only weekly or monthly, and even within individuals post vary from dozens of times within a week to none at all.
Effective Integration of Extra-Linguistic Controls in Human-Centered Language Processing Many problems in NLP rely on making predictions simply from textual data itself, but other information about people is often available and useful. In fact, sometimes such information is necessary. For example, early NLP work in mental health classification failed to consider the age and gender of users being classified but it turned out that a substantial portion of early system prediction accuracy was due simply to distinguishing language by age (Preotiuc-Pietro et al., 2015). Other fields studying humans have adopted a standard expectation that a study will control for basic human traits – such as age and gender as well as theoretically related variables (Gazzaniga and Heatherton, 2015) – such socioeconomic variables for health outcomes or, to take an NLP example, political ideology in the case of stance detection. A key challenge in integrating these extra-linguistic variables, referred to as controls, is in handling their different characteristics as compared to linguistic features – controls tend to be much less sparse and less noisy than signal from linguistic features.
Multi-level Modeling Human Communication Analyzing language about people introduces inherent multi-level structure of data: communities contain multiple people who each generate multiple documents of personal discourse. A key challenge is that linguistic variables tend to be distributed very differently at each level of analysis (Almodaresi et al., 2017) (e.g. “love” is sparse at document-level, more log-normal at the user-level, and more normal at the community-level), and aggregation by location can introduce systematic biases (e.g. “love” may be mentioned multiple orders of magnitude more often in Love, Texas). We seek to account for these differences and enable models to be applied at multiple levels of analysis.