Adaptive Information Extraction from Social Media for
Actionable Inferences in Public Health
Project Summary
Social media is a major source for non-curated, user-generated
feedback on virtually all products and services. Users increasingly
rely on social media to disclose serious real-life incidents, such as
a food poisoning incident at a restaurant, rather than reporting to
official government channels. This valuable user-generated
information, if identified reliably, may have a dramatic positive
impact on critical applications related to public health—the
family of applications of interest in this project—and
beyond. For example, a local health department might launch an
investigation of a potential foodborne illness outbreak at a
restaurant if compelling evidence supporting the investigation can be
inferred from social media. This project addresses fundamental
research challenges associated with processing social media data to
produce actionable inferences, where the output of the process leads
to concrete actions in the real world. In addition to producing
broadly applicable research results, the project has as its
centerpiece a critical public health application,
namely, detecting and acting on foodborne illness outbreaks in
restaurants.
Overall, this project develops (1) strategies for entity-centric
modeling and selection of social media, to cover the vast volumes of
user-produced content across sources; (2) non-traditional information
extraction strategies over informal, noisy, and ungrammatical text, as
well as learning-based approaches to produce actionable,
entity-centric inferences for public health applications; and (3)
methods for general online active learning and search that are tuned
for detecting the rare and infrequent occurrences required for
actionable inferences. Furthermore, this project centers around (4) an
application, detecting and acting on foodborne illness outbreaks, in a
joint collaboration between Columbia University and the New
York City Department of Health and Mental Hygiene
(DOHMH). This collaboration provides a robust, real-world platform for
a continuous, end-to-end evaluation of the novel research results as
applied to a large-scale data science problem, a rare opportunity in
the evaluation of Computer Science research. This collaboration
includes the development and deployment of a system with a direct
impact on public health and society. A proof-of-concept prototype is
already in use at DOHMH and has helped identify and act on several
previously unknown outbreaks. The public health findings from the
project are shared across governmental agencies, following DOHMH's
best practices. Developed code and annotated datasets will be shared
with other researchers and agencies.
Acknowledgments: This research is supported by
the National
Science Foundation under Grant No. IIS-15-63785. Any opinions,
findings, and conclusions or recommendations expressed in this
material are those of the authors and do not necessarily reflect the
views of the National Science Foundation.
Our colleagues at the New York City DOHMH are supported by the
Alfred P. Sloan Foundation under Grant No. Gā2015ā14017 managed by the
Fund for Public Health in New York, Inc.
We thank Yelp for generously providing us with
access to its raw feed of business reviews for New York City and Los
Angeles County.
People at Columbia
- Ivy Cao (undergraduate student, graduated)
- Alan Chung (high-school student, graduated)
- Dean Deng (undergraduate student, graduated)
- Sam Deng (PhD student)
- Tom Effland (PhD student, graduated)
- Jacob Fisher (high-school student, graduated)
- Lampros Flokas (PhD student, graduated)
- Yogesh Garg (MS student, graduated)
- Haolin Guo (undergraduate student)
- Howard Hong (MS student, graduated)
- Mohip Joarder (MS student, graduated)
- Max Kaliner (undergraduate student, graduated)
- Giannis Karamanolakis (PhD student, graduated)
- Anna Lawson (undergraduate student, graduated)
- Ziyi Liu (undergraduate student, graduated)
- Zizhou Liu (undergraduate and MS student, graduated)
- Divyang Mittal (MS student)
- Ken Miura (undergraduate student, graduated; MS student)
- Fotis Psallidas (PhD student, graduated)
- Alden Quimby (undergraduate student, graduated)
- Samuel Raab (undergraduate student, graduated)
- Vipul Raheja (MS student, graduated)
- Eden Shaveet (graduate student)
- Henri Stern (undergraduate student, graduated)
- Keyang Xu (PhD student)
Collaborators at the New York City Department of Health and Mental
Hygiene
- Sharon Balter (through 2017; now at Los Angeles County
Department of Public Health)
- Katelynn Devinney
- Lenka Malec
- Vasudha Reddy
- Haena Waechter
- and many others
Collaborators at the Los Angeles County Department of Public Health
- Sharon Balter
- Melody Brown
- Rebecca Fisher
- and many others
Publications
- Representational
Strengths and Limitations of Transformers, C. Sanford, D.
Hsu, and M. Telgarsky, in Advances in Neural Information
Processing Systems 36, 2023
- Intrinsic
Dimensionality and Generalization Properties of the R-norm
Inductive Bias, N. Ardeshir, D. Hsu, and C. Sanford, in
Thirty-Sixth Annual Conference on Learning Theory, 2023
- Efficient
Machine Teaching Frameworks for Natural Language Processing,
G. Karamanolakis, PhD Dissertation, Columbia University, 2022
- Masked Prediction: A
Parameter Identifiability View, B. Liu, D. Hsu, P. Ravikumar,
and A. Risteski, in Advances in Neural Information Processing
Systems 35, 2022
- Simple
and Near-Optimal Algorithms for Hidden Stratification and
Multi-Group Learning, C. Tosh and D. Hsu, in Thirty-Ninth
International Conference on Machine Learning, 2022
- Near-Optimal
Statistical Query Lower Bounds for Agnostically Learning
Intersections of Halfspaces with Gaussian Marginals, D. Hsu,
C. Sanford, R. Servedio, and E.-V. Vlatakis-Gkaragkounis, in
Thirty-Fifth Annual Conference on Learning Theory, 2022
- Learning Tensor
Representations for Meta-Learning, S. Deng, Y. Guo, D. Hsu, and
D. Mandal, in Twenty-Fifth International Conference on Artificial
Intelligence and Statistics, 2022
- Quantifying
the Effects of COVID-19 on Restaurant Reviews, I. Cao, Z. Liu,
G. Karamanolakis,
D. Hsu, and L. Gravano, in Proc. of the 9th International Workshop
on Natural Language Processing for Social Media (SocialNLP@NAACL
2021), 2021
- Cross-Lingual
Text Classification with Minimal Resources by Transferring a Sparse
Teacher, G. Karamanolakis, D. Hsu, and L. Gravano, in Proc. of
Findings of the 2020 Conference on Empirical Methods in Natural
Language Processing (Findings of EMNLP 2020), 2020
- Detecting
Foodborne Illness Complaints in Multiple Languages Using English
Annotations Only, Z. Liu, G. Karamanolakis, D. Hsu, and
L. Gravano, in Proc. of the 11th International Workshop on
Health Text Mining and Information Analysis (LOUHI@EMNLP 2020),
2020
- Leveraging
Just a Few Keywords for Fine-Grained Aspect Detection Through Weakly
Supervised Co-Training, G. Karamanolakis, D. Hsu, and
L. Gravano, in Proc. of the 2019 Conference on Empirical Methods in
Natural Language Processing and 9th International Joint Conference
on Natural Language Processing (EMNLP-IJCNLP 2019), 2019
- Weakly
Supervised Attention Networks for Fine-Grained Opinion Mining and
Public Health, G. Karamanolakis, D. Hsu, and L. Gravano, in the 5th
Workshop on Noisy User-Generated Text (W-NUT 2019), 2019
-
Training Neural Networks for Aspect Extraction Using Descriptive Keywords Only,
G. Karamanolakis, D. Hsu, and L. Gravano, in 2nd Learning from Limited Labeled Data
Workshop (LLD 2019), 2019
-
Learning Single-Index Models in Gaussian Space,
R. Dudeja and D. Hsu, in 31st Annual Conference on Learning Theory (COLT 2018),
2018
-
Discovering Foodborne Illness in Online Restaurant Reviews,
T. Effland, A. Lawson, S. Balter, K. Devinney, V. Reddy,
H. Waechter, L. Gravano, and D. Hsu, in Journal of the American
Medical Informatics Association, vol. 25, no. 12, pages 1586ā1592,
Dec. 2018
- Linear Regression
Without Correspondence, D. Hsu, K. Shi, and X. Sun, in Advances
in Neural Information Processing Systems 30, 2017
- Kernel
Ridge vs. Principal Component Regression: Minimax Bounds and
Adaptability of Regularization Operators, L. Dicker, D. Foster,
and D. Hsu, in Electronic Journal of Statistics, vol. 11, no. 1,
pages 1022-1047, 2017
- Correspondence
Retrieval, A. Andoni, D. Hsu, K. Shi, and X. Sun, in Thirtieth
Annual Conference on Learning Theory, 2017
- Using
Online Reviews by Restaurant Patrons to Identify Unreported Cases of
Foodborne Illness — New York City, 2012–2013,
C. Harrison, M. Jorder, H. Stern, F. Stavinsky, V. Reddy, H. Hanson,
H. Waechter, L. Lowe, L. Gravano, and S. Balter, in Centers for
Disease Control and Prevention Morbidity and Mortality Weekly Report
(CDC MMWR), vol. 63, no. 20, pages 441-445, May 2014
- Detecting Foodborne Disease Outbreaks Using Social Media
(demonstration), F. Psallidas, L. Gravano, and many others, in NYC
Media Lab's Annual Summit, 2014
- Information
Extraction from Social Media for Public Health, N. Elhadad,
L. Gravano, D. Hsu, S. Balter, V. Reddy, and H. Waechter, in KDD at
Bloomberg Workshop, Data Frameworks Track (KDD 2014), 2014
Datasets
Presentations
- Devinney K., Bekbay A., Effland T., Gravano L., Howell D., Hsu
D., O'Halloran D., Padhy C., Reddy V., Stavinsky F., Waechter H.,
Balter S. Evaluating Twitter as a Data Source for Foodborne Outbreak
Detection in New York City. International Society of Disease
Surveillance 2018 Annual Conference; January 31-February 2, 2018;
Orlando, Florida
- Devinney K., Bekbay A., Effland T., Gravano L., Howell D., Hsu
D., O'Halloran D., Padhy C., Reddy V., Stavinsky F., Waechter H.,
Balter S. Evaluating Twitter as a Data Source for Foodborne Outbreak
Detection in New York City. Integrated Foodborne Outbreak Response
and Management Conference; November 6-9, 2017; Garden Grove,
California
- Devinney K., Bekbay A., Effland T., Gravano L., Howell D., Hsu
D., O'Halloran D., Padhy C., Reddy V., Stavinsky F., Waechter H.,
Balter S. Evaluating Twitter as a Data Source for Foodborne Outbreak
Detection in New York City. Northeast Epidemiology Conference;
October 18-20, 2017; Northampton, Massachusetts
- Devinney K. Improving Foodborne Complaint and Outbreak Detection
Using Social Media, New York City. Arizona Infectious Disease
Training; July 18-20, 2017; Phoenix, Arizona
- Devinney K., Bekbay A., Effland T., Gravano L., Howell D., Hsu
D., O'Halloran D., Padhy C., Reddy V., Stavinsky F., Waechter H.,
Balter S. Evaluating Twitter as a Data Source for Foodborne Outbreak
Detection in New York City. Council of State and Territorial
Epidemiologists Conference; June 4-8, 2017. Boise, Idaho
- Devinney K., Bekbay A., Howell D., O'Halloran D., Padhy C.,
Reddy V., Stavinsky F., Waechter H., Balter S. An Estimation of
Restaurant-Associated Foodborne Illness Incidents in New York City,
2014-2016. Council of State and Territorial Epidemiologists
Conference; June 4-8, 2017. Boise, Idaho
- Devinney K. Improving Foodborne Complaint and Outbreak Detection
Using Social Media, New York City. New York Conference-Quarterly
Training, Central Atlantic States Association of Food Safety; March
21, 2017; Jamaica, New York
- Devinney K., Bekbay A., Howell D., O'Halloran D., Padhy C.,
Reddy V., Stavinsky F., Waechter H., Balter S. An Estimation of
Restaurant-Associated Foodborne Illness Incidents in New York City,
2014-2016. Northeast Epidemiology Conference; October 20-21,
2016. Saratoga Springs, New York
- Reddy V. Online Chatter for Foodborne Outbreak Detection: Using
Social Media Data to Identify Unreported Complaints and
Outbreaks. Bureau of Communicable Disease Cross-cutting Data
Analysis Meeting; December 2, 2015; New York, New York
- Mansour R., Harris J., Reddy V., Gravano L., Elhadad N., Hawkins
J., Southern K., Stevens
J.
Identifying Unreported Foodborne Disease Using Social Media
Data. American Public Health Association Annual Conference;
October 31-November 4, 2015; Chicago, Illinois
- Reddy V. Using Yelp Reviews to Identify Unreported Cases of
Foodborne Illness in New York City. International Association for
Food Protection Annual Meeting; July 25-28, 2015; Portland,
Oregon
Press, Press Releases, Tweets, etc.
Code
X (formerly Twitter) Account, Etc.
Last updated: February 2, 2024