Teaching
Data governance / Responsible data management - Spring 2024 [DRAFT - SUBJECT TO CHANGE]
Short Intro
This course will explore common data collection, management, and sharing practices in information technology and emerging technologies. Students will examine the human, social, and ethical impact of these practices and work on group projects to design data systems that are centered around broader impact and social responsibilities.
Long Intro
This course will explore common data collection, management, and sharing practices in information technology and emerging technologies, such as search engines and AI systems. Students will read papers and engage in discussions about the pros and cons of established data practices and learn about the three main components of responsible data management: 1) consent and ownership, 2) privacy and anonymity, and 3) broader impact.
Students will also practice how to design ethical data-driven products through group projects as UX designers, researchers, and data scientists.
The course will bring in interdisciplinary perspectives with guest speakers from archive science, engineering, and responsible AI, to provide a holistic view of broader data ecosystems and infrastructures.
Objective
Students will learn the pros and cons of different data collection, management, and sharing practices.
Students will gain hands-on experience with designing data-driven products or systems as UX designers, researchers, and data scientists.
Students will also be exposed to interdisciplinary research on important ethical considerations about data, e.g. privacy and consent.
Format
Student-led discussion
Reading list [PRELIMINARY DRAFT]
WK1: Data and crowdwork
Mark Díaz, Ian Kivlichan, Rachel Rosen, Dylan Baker, Razvan Amironesei, Vinodkumar Prabhakaran, and Emily Denton. 2022. CrowdWorkSheets: Accounting for Individual and Collective Identities Underlying Crowdsourced Dataset Annotation. In Proceedings of the 2022 ACM Conference on Fairness, Accountability, and Transparency (FAccT '22). Association for Computing Machinery, New York, NY, USA, 2342–2351. https://doi-org.ezproxy.lib.utexas.edu/10.1145/3531146.3534647
Milagros Miceli and Julian Posada. 2022. The Data-Production Dispositif. Proc. ACM Hum.-Comput. Interact. 6, CSCW2, Article 460 (November 2022), 37 pages. https://doi-org.ezproxy.lib.utexas.edu/10.1145/3555561
WK2: Labor
Yuling Sun, Xiaojuan Ma, Silvia Lindtner, and Liang He. 2023. Data Work of Frontline Care Workers: Practices, Problems, and Opportunities in the Context of Data-Driven Long-Term Care. Proc. ACM Hum.-Comput. Interact. 7, CSCW1, Article 42 (April 2023), 28 pages. https://doi-org.ezproxy.lib.utexas.edu/10.1145/35
Hanlin Li, Nicholas Vincent, Stevie Chancellor, and Brent Hecht. 2023. The Dimensions of Data Labor: A Road Map for Researchers, Activists, and Policymakers to Empower Data Producers. In Proceedings of the 2023 ACM Conference on Fairness, Accountability, and Transparency (FAccT '23). Association for Computing Machinery, New York, NY, USA, 1151–1161. https://doi-org.ezproxy.lib.utexas.edu/10.1145/3593013.3594070
Naja Holten Møller, Claus Bossen, Kathleen H. Pine, Trine Rask Nielsen, and Gina Neff. 2020. Who does the work of data? interactions 27, 3 (May - June 2020), 52–55. https://doi-org.ezproxy.lib.utexas.edu/10.1145/3386389
WK3: Collection
Freelon, D. (2018). Computational research in the post-API age. Political Communication, 35(4), 665-668.
Zimmer, M. (2010). “But the data is already public”: on the ethics of research in Facebook. Ethics and information technology, 12(4), 313-325.
Data and its (dis)contents: A survey of dataset development and use in machine learning research
WK4: Subjectivity and biases
Teanna Barrett, Quanze Chen, and Amy Zhang. 2023. Skin Deep: Investigating Subjectivity in Skin Tone Annotations for Computer Vision Benchmark Datasets. In Proceedings of the 2023 ACM Conference on Fairness, Accountability, and Transparency (FAccT '23). Association for Computing Machinery, New York, NY, USA, 1757–1771. https://doi-org.ezproxy.lib.utexas.edu/10.1145/3593013.3594114
Isaac L. Johnson, Yilun Lin, Toby Jia-Jun Li, Andrew Hall, Aaron Halfaker, Johannes Schöning, and Brent Hecht. 2016. Not at Home on the Range: Peer Production and the Urban/Rural Divide. In Proceedings of the 2016 CHI Conference on Human Factors in Computing Systems (CHI '16). Association for Computing Machinery, New York, NY, USA, 13–25. https://doi-org.ezproxy.lib.utexas.edu/10.1145/2858036.2858123
WK5: impact
Nithya Sambasivan, Shivani Kapania, Hannah Highfill, Diana Akrong, Praveen Paritosh, and Lora M Aroyo. 2021. “Everyone wants to do the model work, not the data work”: Data Cascades in High-Stakes AI. In Proceedings of the 2021 CHI Conference on Human Factors in Computing Systems (CHI '21). Association for Computing Machinery, New York, NY, USA, Article 39, 1–15. https://doi-org.ezproxy.lib.utexas.edu/10.1145/3411764.3445518
Mark Diaz, Isaac Johnson, Amanda Lazar, Anne Marie Piper, and Darren Gergle. 2018. Addressing Age-Related Bias in Sentiment Analysis. In Proceedings of the 2018 CHI Conference on Human Factors in Computing Systems (CHI '18). Association for Computing Machinery, New York, NY, USA, Paper 412, 1–14. https://doi-org.ezproxy.lib.utexas.edu/10.1145/3173574.3173986
Wk6: values
Data Feminism - the power chapter, By Catherine D'Ignazio and Lauren F. Klein
https://data-feminism.mitpress.mit.edu/pub/vi8obxh7/release/4
Wk7: documentation
Gebru, T., Morgenstern, J., Vecchione, B., Vaughan, J. W., Wallach, H., Iii, H. D., & Crawford, K. (2021). Datasheets for datasets. Communications of the ACM, 64(12), 86-92.
Bandy, J., & Vincent, N. (2021, June). Addressing" documentation debt" in machine learning: A retrospective datasheet for bookcorpus. In Thirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 1).
ArtSheets for Art datasets: https://openreview.net/pdf?id=K7ke_GZ_6N
Wk8: midterm presentations
Wk9: Sharing and deprecation
Peng, K., Mathur, A., & Narayanan, A. (2021). Mitigating dataset harms requires stewardship: Lessons from 1000 papers. ArXiv, abs/2108.02922.
Alexandra Sasha Luccioni, Frances Corry, Hamsini Sridharan, Mike Ananny, Jason Schultz, and Kate Crawford. 2022. A Framework for Deprecating Datasets: Standardizing Documentation, Identification, and Communication. In Proceedings of the 2022 ACM Conference on Fairness, Accountability, and Transparency (FAccT '22). Association for Computing Machinery, New York, NY, USA, 199–212. https://doi-org.ezproxy.lib.utexas.edu/10.1145/3531146.3533086
Wk10: Governance
Carroll, S. R., Garba, I., Figueroa-Rodríguez, O. L., Holbrook, J., Lovett, R., Materechera, S., ... & Hudson, M. (2020). The CARE principles for indigenous data governance. Data Science Journal, 19, 43-43.
Yacine Jernite, Huu Nguyen, Stella Biderman, Anna Rogers, Maraim Masoud, Valentin Danchev, Samson Tan, Alexandra Sasha Luccioni, Nishant Subramani, Isaac Johnson, Gerard Dupont, Jesse Dodge, Kyle Lo, Zeerak Talat, Dragomir Radev, Aaron Gokaslan, Somaieh Nikpoor, Peter Henderson, Rishi Bommasani, and Margaret Mitchell. 2022. Data Governance in the Age of Large-Scale Data-Driven Language Technology. In Proceedings of the 2022 ACM Conference on Fairness, Accountability, and Transparency (FAccT '22). Association for Computing Machinery, New York, NY, USA, 2206–2222. https://doi-org.ezproxy.lib.utexas.edu/10.1145/3531146.3534637
Wk11: AI, LLMs, Computer vision
Liang, W., Tadesse, G.A., Ho, D. et al. Advances, challenges and opportunities in creating data for trustworthy AI. Nat Mach Intell 4, 669–677 (2022). https://doi.org/10.1038/s42256-022-00516-1
Morgan Klaus Scheuerman, Katy Weathington, Tarun Mugunthan, Emily Denton, and Casey Fiesler. 2023. From Human to Data to Dataset: Mapping the Traceability of Human Subjects in Computer Vision Datasets. Proc. ACM Hum.-Comput. Interact. 7, CSCW1, Article 55 (April 2023), 33 pages. https://doi-org.ezproxy.lib.utexas.edu/10.1145/3579488
Wk12: Pricing
J. Pei, "A Survey on Data Pricing: From Economics to Data Science," in IEEE Transactions on Knowledge and Data Engineering, vol. 34, no. 10, pp. 4586-4608, 1 Oct. 2022, doi: 10.1109/TKDE.2020.3045927.
Tiziano Piccardi, Miriam Redi, Giovanni Colavizza, and Robert West. 2021. On the Value of Wikipedia as a Gateway to the Web. In Proceedings of the Web Conference 2021 (WWW '21). Association for Computing Machinery, New York, NY, USA, 249–260. https://doi-org.ezproxy.lib.utexas.edu/10.1145/3442381.3450136
Wk13: Protests and legal issues
Nicholas Vincent, Hanlin Li, Nicole Tilly, Stevie Chancellor, and Brent Hecht. 2021. Data Leverage: A Framework for Empowering the Public in its Relationship with Technology Companies. In Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency (FAccT '21). Association for Computing Machinery, New York, NY, USA, 215–227. https://doi-org.ezproxy.lib.utexas.edu/10.1145/3442188.3445885
Min, S., Gururangan, S., Wallace, E., Hajishirzi, H., Smith, N. A., & Zettlemoyer, L. (2023). SILO Language Models: Isolating Legal Risk In a Nonparametric Datastore. arXiv preprint arXiv:2308.04430.
Wk14: final presentation