hlab.cs.stonybrook.edu

Data-Driven Insights into Personality, Health, and Well-Being through Language Analysis	With more than a quarter of the world’s population communicating actively on social media (e.g. Facebook, Instagram, Twitter), researchers are presented with an unprecedented resource into who we are -- our health and well-being. Utilizing natural language processing and machine learning techniques we seek to discover new behavioral and psychological factors of health and well-being as manifest through language in social media. However, with few exceptions, existing works have mostly applied methods that were originally developed for modeling documents or words rather than characteristics of people. We are undertaking a range of novel as well as tried-and-true language analysis techniques to answer big questions about human health, personality, and well-being.	Data-Driven Content Analysis of Social Media: A Systematic Overview of Automated Methods Adaptive Language-based Mental Health Assessment with Item-Response Theory Natural language analysis and the psychology of verbal behavior: The past, present, and future states of… Robust language-based mental health assessments in time and space through social media Incremental Validity of Language-Based Assessments of Personality in World Trade Center Responders Archetypes and Entropy: Theory-Driven Extraction of Evidence for Suicide Risk Beyond rating scales: With targeted evaluation, language models are poised for psychological assessment Comparing Human-Centered Language Modeling: Is it Better to Model Groups, Individual Traits, or Both? World Trade Center responders in their own words: Predicting PTSD symptom trajectories with AI-based lan… Evaluating Contextual Embeddings and their Extraction Layers for Depression Assessment Characterizing empathy and compassion using computational linguistic analysis AI-based analysis of social media language predicts addiction treatment dropout at 90 days Using Facebook language to predict and describe excessive alcohol use Linguistic predictors from Facebook postings of substance use disorder treatment retention versus discon… Depression and anxiety on Twitter during the COVID-19 stay-at-home period in 7 major US cities Detecting dissonant stance in social media: The role of topic exposure Estimating geographic subjective well-being from Twitter: A comparison of dictionary and data-driven lan… Discourse-level representations can improve prediction of degree of anxiety The language of well-being: Tracking fluctuations in emotion experience through everyday speech. Quantifying Community Characteristics of Maternal Mortality Using Social Media Hierarchical models of Twitter language suggest that richer neighbors make you less happy Using social media to track geographic variability in language about diabetes: Infodemiology analysis The self-congruity effect of music
Developing Methods of Human-Centered Language Processing	While most tasks in Natural Language Processing are concerned with characterizing or transforming elements of documents (e.g. parts-of-speech tagging, syntactic parsing, machine translation, and sentiment analysis), the goal of human centered analysis is to measure or understand the users or communities generating those documents. This changes the way we can analyze language: (1) considering additional variables representing the human and/or social context, (2) it suggests consideration for change in language over time, and (3) motivates the need for understanding language beyond word-context as modern transformers do.	Human Centered NLP with User-Factor Adaptation Predictive Biases in Natural Language Processing Models: A Conceptual Framework and Overview Transfer and Active Learning for Dissonance Detection: Addressing the Rare-Class Challenge Predictive Biases in Natural Language Processing Models: A Conceptual Framework and Overview Comparing Human-Centered Language Modeling: Is it Better to Model Groups, Individual Traits, or Both? Human-centered metrics for dialog system evaluation Systematic Evaluation of GPT-3 for Zero-Shot Personality Estimation Large Human Language Models: A Need and the Challenges Human Language Modeling A Human-Centered Hierarchical Framework for Dialogue System Construction and Evaluation Closed-and open-vocabulary approaches to text analysis: A review, quantitative comparison, and recommend… Empirical evaluation of pre-trained transformers for human-level NLP: The role of sample size and dimens… Discourse relation embeddings: Representing the relations between discourse segments in social media Modeling latent dimensions of human beliefs MeLT: Message-level transformer with masked document representations as pre-training for stance detection The Text-package: An R-package for Analyzing and Visualizing Human Language Using Natural Language Proce… Contrastive Lexical Diffusion Coefficient: Quantifying the Stickiness of the Ordinary Closed and Open Vocabulary Approaches to Text Analysis: A Review, Quantitative Comparison, and Recommend… Autoregressive Affective Language Forecasting: A Self-Supervised Task Characterizing social spambots by their human traits
Time-series and Longitudinal Language Analysis	Tackling change in language opens the door to impactful applications such as forecasting psychological states or health events (e.g. suicide attempt, panic attack, heart disease onset). Longitudinal studies and time series analyses have a long history which has yielded a vast repertoire of possible methods to build off (Zeger and Liang, 1986; Frees, 2004). On the other hand, NLP has a long history of tackling sequences of characters, words, or phrases. However, the mix of the two, tracking sequences in language across time, brings new key challenges. The goal now is often not predicting language itself but a latent mental state of an individual for which posts on social media serve as a source of noisy evidence. Further, some people post hourly while others post only weekly or monthly, and even within individuals post vary from dozens of times within a week to none at all.	Opioid death projections with AI-based forecasts using social media language Using Daily Language to Understand Drinking: Multi-Level Longitudinal Differential Language Analysis Predicting adolescent depression and anxiety from multi-wave longitudinal data using machine learning The where and when of COVID-19: Using ecological and Twitter-based assessments to examine impacts in a t… Using Ecological and Twitter-Based Assessments to Examine Impacts in Temporal and Community Context Artificial intelligence language predictors of two-year trauma-related outcomes	County Covid19 Tracker
Effective Integration of Extra-Linguistic Controls in Human-Centered Language Processing	Many problems in NLP rely on making predictions simply from textual data itself, but other information about people is often available and useful. In fact, sometimes such information is necessary. For example, early NLP work in mental health classification failed to consider the age and gender of users being classified but it turned out that a substantial portion of early system prediction accuracy was due simply to distinguishing language by age (Preotiuc-Pietro et al., 2015). Other fields studying humans have adopted a standard expectation that a study will control for basic human traits – such as age and gender as well as theoretically related variables (Gazzaniga and Heatherton, 2015) – such socioeconomic variables for health outcomes or, to take an NLP example, political ideology in the case of stance detection. A key challenge in integrating these extra-linguistic variables, referred to as controls, is in handling their different characteristics as compared to linguistic features – controls tend to be much less sparse and less noisy than signal from linguistic features.	Residualized Factor Adaptation for Community Social Media Prediction Tasks Human Centered NLP with User-Factor Adaptation I slept like a baby: using human traits to characterize deceptive ChatGPT and human text The Role of Negative Affect in Shaping Populist Support: Converging Evidence from the Field Using Twitter to Predict the Real Estate Market
Multi-level Modeling Human Communication	Analyzing language about people introduces inherent multi-level structure of data: communities contain multiple people who each generate multiple documents of personal discourse. A key challenge is that linguistic variables tend to be distributed very differently at each level of analysis (Almodaresi et al., 2017) (e.g. “love” is sparse at document-level, more log-normal at the user-level, and more normal at the community-level), and aggregation by location can introduce systematic biases (e.g. “love” may be mentioned multiple orders of magnitude more often in Love, Texas). We seek to account for these differences and enable models to be applied at multiple levels of analysis.	Using Daily Language to Understand Drinking: Multi-Level Longitudinal Differential Language Analysis WWBP-SQT-lite: Difference Embeddings and Multi-level Models for Moments of Change Identification in Ment… Regional personality assessment through social media language Understanding weekly COVID-19 concerns through dynamic content-specific LDA topic modeling Cultural differences in Tweeting about drinking across the US Autoregressive affective language forecasting: A self-supervised task Suicide Risk Assessment with Multi-level Dual-Context Language and BERT