Adaptive Information Extraction from Social Media for Actionable Inferences in Public Health

Computer Science Department
Columbia University

Project Summary

Social media is a major source for non-curated, user-generated feedback on virtually all products and services. Users increasingly rely on social media to disclose serious real-life incidents, such as a food poisoning incident at a restaurant, rather than reporting to official government channels. This valuable user-generated information, if identified reliably, may have a dramatic positive impact on critical applications related to public health—the family of applications of interest in this project—and beyond. For example, a local health department might launch an investigation of a potential foodborne illness outbreak at a restaurant if compelling evidence supporting the investigation can be inferred from social media. This project addresses fundamental research challenges associated with processing social media data to produce actionable inferences, where the output of the process leads to concrete actions in the real world. In addition to producing broadly applicable research results, the project has as its centerpiece a critical public health application, namely, detecting and acting on foodborne illness outbreaks in restaurants.

Overall, this project develops (1) strategies for entity-centric modeling and selection of social media, to cover the vast volumes of user-produced content across sources; (2) non-traditional information extraction strategies over informal, noisy, and ungrammatical text, as well as learning-based approaches to produce actionable, entity-centric inferences for public health applications; and (3) methods for general online active learning and search that are tuned for detecting the rare and infrequent occurrences required for actionable inferences. Furthermore, this project centers around (4) an application, detecting and acting on foodborne illness outbreaks, in a joint collaboration between Columbia University and the New York City Department of Health and Mental Hygiene (DOHMH). This collaboration provides a robust, real-world platform for a continuous, end-to-end evaluation of the novel research results as applied to a large-scale data science problem, a rare opportunity in the evaluation of Computer Science research. This collaboration includes the development and deployment of a system with a direct impact on public health and society. A proof-of-concept prototype is already in use at DOHMH and has helped identify and act on several previously unknown outbreaks. The public health findings from the project are shared across governmental agencies, following DOHMH's best practices. Developed code and annotated datasets will be shared with other researchers and agencies.

Acknowledgments: This research is supported by the National Science Foundation under Grant No. IIS-15-63785. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect the views of the National Science Foundation.

Our colleagues at the New York City DOHMH are supported by the Alfred P. Sloan Foundation under Grant No. G‐2015‐14017 managed by the Fund for Public Health in New York, Inc.

We thank Yelp for generously providing us with access to its raw feed of business reviews for New York City and Los Angeles County.

People at Columbia

Luis Gravano (PI)
Daniel Hsu (Co-PI)

Ivy Cao (undergraduate student, graduated)
Alan Chung (high-school student, graduated)
Dean Deng (undergraduate student, graduated)
Sam Deng (PhD student)
Tom Effland (PhD student, graduated)
Jacob Fisher (high-school student, graduated)
Lampros Flokas (PhD student, graduated)
Yogesh Garg (MS student, graduated)
Haolin Guo (undergraduate student)
Howard Hong (MS student, graduated)
Mohip Joarder (MS student, graduated)
Max Kaliner (undergraduate student, graduated)
Giannis Karamanolakis (PhD student, graduated)
Anna Lawson (undergraduate student, graduated)
Ziyi Liu (undergraduate student, graduated)
Zizhou Liu (undergraduate and MS student, graduated)
Divyang Mittal (MS student)
Ken Miura (undergraduate student, graduated; MS student)
Fotis Psallidas (PhD student, graduated)
Alden Quimby (undergraduate student, graduated)
Samuel Raab (undergraduate student, graduated)
Vipul Raheja (MS student, graduated)
Eden Shaveet (graduate student, graduated)
Crystal Su (undergraduate student)
Henri Stern (undergraduate student, graduated)
Keyang Xu (PhD student)

Collaborators at the New York City Department of Health and Mental Hygiene

Sharon Balter (through 2017; now at Los Angeles County Department of Public Health)
Katelynn Devinney
Lenka Malec
Vasudha Reddy
Haena Waechter
and many others

Collaborators at the Los Angeles County Department of Public Health

Sharon Balter
Melody Brown
Rebecca Fisher
and many others

Publications

Interactive Machine Teaching by Labeling Rules and Instances, G. Karamanolakis, D. Hsu, and L. Gravano, in Transactions of the Association for Computational Linguistics, 12:1441–1459, 2024
Geospatial and Geosocial Dimensions of Foodborne Illness as Reflected in Yelp Restaurant Reviews, E. Shaveet, S. Chowdhury, D. Hsu, and L. Gravano, International Conference on Social Media & Society, 2024
Representational Strengths and Limitations of Transformers, C. Sanford, D. Hsu, and M. Telgarsky, in Advances in Neural Information Processing Systems 36, 2023
Intrinsic Dimensionality and Generalization Properties of the R-norm Inductive Bias, N. Ardeshir, D. Hsu, and C. Sanford, in Thirty-Sixth Annual Conference on Learning Theory, 2023
Efficient Machine Teaching Frameworks for Natural Language Processing, G. Karamanolakis, PhD Dissertation, Columbia University, 2022
Masked Prediction: A Parameter Identifiability View, B. Liu, D. Hsu, P. Ravikumar, and A. Risteski, in Advances in Neural Information Processing Systems 35, 2022
Simple and Near-Optimal Algorithms for Hidden Stratification and Multi-Group Learning, C. Tosh and D. Hsu, in Thirty-Ninth International Conference on Machine Learning, 2022
Near-Optimal Statistical Query Lower Bounds for Agnostically Learning Intersections of Halfspaces with Gaussian Marginals, D. Hsu, C. Sanford, R. Servedio, and E.-V. Vlatakis-Gkaragkounis, in Thirty-Fifth Annual Conference on Learning Theory, 2022
Learning Tensor Representations for Meta-Learning, S. Deng, Y. Guo, D. Hsu, and D. Mandal, in Twenty-Fifth International Conference on Artificial Intelligence and Statistics, 2022
Quantifying the Effects of COVID-19 on Restaurant Reviews, I. Cao, Z. Liu, G. Karamanolakis, D. Hsu, and L. Gravano, in Proc. of the 9th International Workshop on Natural Language Processing for Social Media (SocialNLP@NAACL 2021), 2021
Cross-Lingual Text Classification with Minimal Resources by Transferring a Sparse Teacher, G. Karamanolakis, D. Hsu, and L. Gravano, in Proc. of Findings of the 2020 Conference on Empirical Methods in Natural Language Processing (Findings of EMNLP 2020), 2020
Detecting Foodborne Illness Complaints in Multiple Languages Using English Annotations Only, Z. Liu, G. Karamanolakis, D. Hsu, and L. Gravano, in Proc. of the 11th International Workshop on Health Text Mining and Information Analysis (LOUHI@EMNLP 2020), 2020
Leveraging Just a Few Keywords for Fine-Grained Aspect Detection Through Weakly Supervised Co-Training, G. Karamanolakis, D. Hsu, and L. Gravano, in Proc. of the 2019 Conference on Empirical Methods in Natural Language Processing and 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP 2019), 2019
Weakly Supervised Attention Networks for Fine-Grained Opinion Mining and Public Health, G. Karamanolakis, D. Hsu, and L. Gravano, in the 5th Workshop on Noisy User-Generated Text (W-NUT 2019), 2019
Training Neural Networks for Aspect Extraction Using Descriptive Keywords Only, G. Karamanolakis, D. Hsu, and L. Gravano, in 2nd Learning from Limited Labeled Data Workshop (LLD 2019), 2019
Learning Single-Index Models in Gaussian Space, R. Dudeja and D. Hsu, in 31st Annual Conference on Learning Theory (COLT 2018), 2018
Discovering Foodborne Illness in Online Restaurant Reviews, T. Effland, A. Lawson, S. Balter, K. Devinney, V. Reddy, H. Waechter, L. Gravano, and D. Hsu, in Journal of the American Medical Informatics Association, vol. 25, no. 12, pages 1586–1592, Dec. 2018
Linear Regression Without Correspondence, D. Hsu, K. Shi, and X. Sun, in Advances in Neural Information Processing Systems 30, 2017
Kernel Ridge vs. Principal Component Regression: Minimax Bounds and Adaptability of Regularization Operators, L. Dicker, D. Foster, and D. Hsu, in Electronic Journal of Statistics, vol. 11, no. 1, pages 1022-1047, 2017
Correspondence Retrieval, A. Andoni, D. Hsu, K. Shi, and X. Sun, in Thirtieth Annual Conference on Learning Theory, 2017
Using Online Reviews by Restaurant Patrons to Identify Unreported Cases of Foodborne Illness — New York City, 2012–2013, C. Harrison, M. Jorder, H. Stern, F. Stavinsky, V. Reddy, H. Hanson, H. Waechter, L. Lowe, L. Gravano, and S. Balter, in Centers for Disease Control and Prevention Morbidity and Mortality Weekly Report (CDC MMWR), vol. 63, no. 20, pages 441-445, May 2014
Detecting Foodborne Disease Outbreaks Using Social Media (demonstration), F. Psallidas, L. Gravano, and many others, in NYC Media Lab's Annual Summit, 2014
Information Extraction from Social Media for Public Health, N. Elhadad, L. Gravano, D. Hsu, S. Balter, V. Reddy, and H. Waechter, in KDD at Bloomberg Workshop, Data Frameworks Track (KDD 2014), 2014

Datasets

Dataset for SocialNLP@NAACL 2021 paper

Presentations

Devinney K., Bekbay A., Effland T., Gravano L., Howell D., Hsu D., O'Halloran D., Padhy C., Reddy V., Stavinsky F., Waechter H., Balter S. Evaluating Twitter as a Data Source for Foodborne Outbreak Detection in New York City. International Society of Disease Surveillance 2018 Annual Conference; January 31-February 2, 2018; Orlando, Florida
Devinney K., Bekbay A., Effland T., Gravano L., Howell D., Hsu D., O'Halloran D., Padhy C., Reddy V., Stavinsky F., Waechter H., Balter S. Evaluating Twitter as a Data Source for Foodborne Outbreak Detection in New York City. Integrated Foodborne Outbreak Response and Management Conference; November 6-9, 2017; Garden Grove, California
Devinney K., Bekbay A., Effland T., Gravano L., Howell D., Hsu D., O'Halloran D., Padhy C., Reddy V., Stavinsky F., Waechter H., Balter S. Evaluating Twitter as a Data Source for Foodborne Outbreak Detection in New York City. Northeast Epidemiology Conference; October 18-20, 2017; Northampton, Massachusetts
Devinney K. Improving Foodborne Complaint and Outbreak Detection Using Social Media, New York City. Arizona Infectious Disease Training; July 18-20, 2017; Phoenix, Arizona
Devinney K., Bekbay A., Effland T., Gravano L., Howell D., Hsu D., O'Halloran D., Padhy C., Reddy V., Stavinsky F., Waechter H., Balter S. Evaluating Twitter as a Data Source for Foodborne Outbreak Detection in New York City. Council of State and Territorial Epidemiologists Conference; June 4-8, 2017. Boise, Idaho
Devinney K., Bekbay A., Howell D., O'Halloran D., Padhy C., Reddy V., Stavinsky F., Waechter H., Balter S. An Estimation of Restaurant-Associated Foodborne Illness Incidents in New York City, 2014-2016. Council of State and Territorial Epidemiologists Conference; June 4-8, 2017. Boise, Idaho
Devinney K. Improving Foodborne Complaint and Outbreak Detection Using Social Media, New York City. New York Conference-Quarterly Training, Central Atlantic States Association of Food Safety; March 21, 2017; Jamaica, New York
Devinney K., Bekbay A., Howell D., O'Halloran D., Padhy C., Reddy V., Stavinsky F., Waechter H., Balter S. An Estimation of Restaurant-Associated Foodborne Illness Incidents in New York City, 2014-2016. Northeast Epidemiology Conference; October 20-21, 2016. Saratoga Springs, New York
Reddy V. Online Chatter for Foodborne Outbreak Detection: Using Social Media Data to Identify Unreported Complaints and Outbreaks. Bureau of Communicable Disease Cross-cutting Data Analysis Meeting; December 2, 2015; New York, New York
Mansour R., Harris J., Reddy V., Gravano L., Elhadad N., Hawkins J., Southern K., Stevens J. Identifying Unreported Foodborne Disease Using Social Media Data. American Public Health Association Annual Conference; October 31-November 4, 2015; Chicago, Illinois
Reddy V. Using Yelp Reviews to Identify Unreported Cases of Foodborne Illness in New York City. International Association for Food Protection Annual Meeting; July 25-28, 2015; Portland, Oregon

Press, Press Releases, Tweets, etc.

Code

X (formerly Twitter) Account, Etc.

@NYCFoodborne
Devinney K. #foodpoisoning: Using Social Media to Detect Outbreaks. CSTE Features; February 7, 2017

Last updated: February 2, 2024