Contact Us                 Archivaria

Members                  Volunteer

Analyzing Artificial Intelligence Methods in Digital Preservation Workflow

16 May 2024 6:09 PM | Anonymous member (Administrator)

by Jill K. Sadler

Digital technology has greatly changed preservation management in the archival field, presenting both opportunities and challenges. There has been a significant amount of research in the last five years investigating how artificial intelligence (AI) and its sub-fields, machine learning (ML) and natural language processing (NLP), can assist archivists in their workflow by addressing privacy concerns and improving access. This article is not about ChatGPT—that’s another conversation for another (very imminent) time—but it is about examining how archivists can ethically steward records using emerging technologies. Ultimately, AI methods are helpful, but archivists are necessary to help provide context and supervision.  

Discussions of efficient and meaningful archival processing have been ongoing for decades. The discussions of archival backlog continue, especially as technology progresses and digitized and born-digital records are amassed, surpassing what human archivists are capable of addressing. Digital records don’t need a warehouse for storage. Millions of email records can be stored on a USB flash drive or on cloud servers distal from a physical archival institution. The vulnerability of this unprocessed data is compounded by technology obsolescence. Without ethical stewardship of these records, there are privacy and access concerns and there is substantial risk to records’ creators, records’ subjects, and institutional trust. 

AI methods such as machine learning and natural language processing can be used to process records and text at a scale that surpasses human capability and can potentially address the archival backlog. A machine learning algorithm builds a model based on training data, not based on direct programming by a human. Natural language processing describes a computer’s ability to process text and speech similar to the way humans can. Both of these methods, in combination with ever-increasing computing power, can help archivists analyze and describe records; however, there is still the matter of a computer’s ability to infer context. When considering AI solutions, one should continually ask the question, Can a computer understand contextual nuance and value as well as a human can? It is important to critically assess how AI tools work and the opportunities and challenges these new tools present to the archival field. One way to do this is to examine AI tools against the archival framework of radical empathy. Michelle Caswell and Marika Cifor’s “Human Rights to Feminist Empathy: Radical Empathy in the Archives” examines archivists’ responsibilities in their relationships with records’ creators, records’ subjects, records’ users, and larger communities. A radical empathy approach considers positions of power and, more specifically, notes how the archival tradition of preserving records for legal reasons ignores oppression and is not a radical empathy approach to recordkeeping.  

In “Balancing Care and Authenticity in Digital Collections: A Radical Empathy Approach to Working with Disk Images,” Monique Lassere and Jess M. Whyte provide a balanced perspective on archival processing with AI tools. They specifically place disk image preservation issues within a radical empathy framework, outlining how preserving disk images may harm records’ subjects, donors, and other stakeholders due to privacy concerns. Preserving disk images provides an authentic record with clear provenance and authority; however, this approach considers the disk image from a sociolegal viewpoint. There is also institutional risk associated with preserving a disk image containing sensitive data that may or may not be processed in a timely fashion and with not treating the data with care. The question is one of balancing privacy with access and assessing the risk of jeopardizing one or both. AI tools can help redact sensitive data on disk images, thus protecting the records’ subjects and lowering institutional risk. Once sensitive data has been redacted, the records can potentially be made available to the public per the archival institution’s mission. 

Lassere and Whyte make a number of recommendations for archivists who want to use AI tools in their workflow. From a technical standpoint, these tools need to showcase transparency in how they work: there needs to be clear documentation and reporting of the tools’ functionality and usage, clear evidence of content deletion, and ease of adoption and use. However, the authors clearly underscore that radical empathy doesn’t mesh well with automation. Tools can help archivists, but archivists still need training and time to perform acceptable acquisition and oversee the use of these tools. In other words, in a radical empathy framework, archival processes need humans to work alongside technology to ensure decisions can be made slowly given their subjective, contextual, and shifting nature. Most importantly, archivists need to collaborate with IT departments, the institution’s administration, records’ subjects, donors, and researchers with the primary goal of mitigating risk for all parties. Perhaps archivists need to acknowledge their acceptable risk and collect less: if you can’t ethically steward without risk, don’t acquire it. 

In “After the Digital Revolution: Working With Emails and Born-digital Records in Literary and Publishers’ Archives,” Lise Jaillant draws similar conclusions when considering dark digital archives that are intended to be public but are locked away due to their sensitive nature. Regarding access, Jaillant recommends collaborating with donors, advocating for open data and improved text mining tools, and improving graduate student training in digital curation and AI. Engaging donors fits into a radical empathy approach: not only is this collaborative and builds relationships with archivists and records creators/subjects, but it also gives the donor agency in determining their own balance between privacy and risk. Jaillant also advocates for involving researchers to work with archivists instead of waiting for archivists to process records and make them available. This requires a code of ethics on the part of the researcher and requires that the researcher understands the research context. 

In “Unlocking Digital Archives: Cross‐disciplinary Perspectives on AI and Born‐digital Data,” Lise Jaillant and Annalina Caputo continue to explore how archival institutions prioritize closed archives over access because of a risk-averse perspective and wanting to avoid legal challenges. This approach of limiting access to archives, especially using a rights-based argument, doesn’t easily align with a radical empathy framework as it potentially obscures the oppressed and marginalized voices that may be present in the archives. It potentially limits the power of voices in the archives, limits the experiences of the archival user, and may negatively impact the larger community. The authors advocate for working alongside AI tools as “human scrutiny is not replaced by their algorithmic counterpart, but boosted… [there is] value of digital assisted sensitivity review on both speed and quantity” (2022). Jaillant and Caputo underscore that “archivists do not have to be proficient in technical aspects of AI… but they need to actively participate in this process of ‘assisted review’ of archival documents” (2022). They call for close examination of AI and machine learning tools because the exact ways in which these tools and their respective algorithms are created and function are typically obscured.  

Jaillant and Caputo’s analysis segues nicely into an analysis made by Stephanie Decker et al. in “Finding Light in Dark Archives: Using AI to Connect Context and Content in Email.” Having humans work alongside machines to help provide context is key, especially with complex email archives, which seem to be particularly troublesome due to their inherently risky privacy issues, networked nature, the ease at which information can be decontextualized, and the volume of data.  

Decker et al. provide a comprehensive, technical description of how AI methods can be applied to email archives to aid in searching, contextualization, and, ultimately, access but note that their success depends on how the archives have been rendered so that they are machine-readable. Lassere and Whyte also emphasize how important it is to use knowledge domain-specific AI tools, especially ones designed for archives, as the tool may be less effective if it cannot properly contextualize or assist with the data it is meant to be analyzing. Decker et al. emphasize the need for researchers to help provide access to records, and for cross-disciplinary collaboration to improve or provide meaningful access. This is yet another reason why interdisciplinary collaboration between archivists and AI scientists is so crucial. Already, just in this article, one can see that scholars who are interested in this intersection of AI and archives come from varied backgrounds: Caputo is a computer scientist, Decker is a business scholar, Jaillant is a digital humanist, Lassere is a digital archivist, and Whyte is a digital assets librarian. Given the multidisciplinary nature of the AI field, one can expect to see more collaborations from experts in the fields of history, anthropology, linguistics, business, computer science, archives, and more. 

Recent scholarship shows how artificial intelligence methods can be applied in digital preservation workflow and provides a perspective on ethical stewardship of records while using emerging technologies. Jaillant and Caputo expertly summarize what a radical empathy approach to using AI tools in archives looks like: “a framework of AI governance informed by well-developed language and procedures of consent, power, inclusivity, transparency… cross-disciplinary collaborations, [and] close attention to ethical principles” (2022). With intentional and patient management and supervision, implementing AI methods in digital preservation workflows can help connect archivists with records’ creators, subjects, and users, as well as larger communities, in a responsible manner. 

References

Caswell, M. & Cifor, M. (2016) From Human Rights to Feminist Ethics: Radical Empathy in the Archives. Archivaria 81, 23-43. https://www.muse.jhu.edu/article/687705 

Decker, S., Kirsch, D.A., Kuppili Venkata, S. et al(2022)Finding Light inDark Archives: Using AI to Connect Context and Content in Email. AI & Soc 37, 859–872. https://doi.org/10.1007/s00146-021-01369-9  

Jaillant L. (2019). After the Digital Revolution: Working with Emails and Born-digital Records in Literary and Publishers’ Archives. Archives & Manuscripts, 47(3), 285–304. https://doi.org/10.1080/01576895.2019.1640555 

Jaillant, L. Caputo, A. (2022). Unlocking Digital Archives: Cross-disciplinary Perspectives on AI and Born-digital Data. AI & Soc 37, 823–835. https://doi.org/10.1007/s00146-021-01367-x 

Lassere, M. & Whyte, J. M. (2021). Balancing Care and Authenticity in Digital Collections: A Radical Empathy Approach to Working with Disk Images. Journal of Critical Library and Information Studies 3, 1-25.  


Contact Us

Suite 1912-130 Albert Street  

Ottawa, Ontario K1P 5G4

Tel:  613-383-2009

Email: aca@archivists.ca

The ACA office is located on the unceded, unsurrendered Territory of the Anishinaabe Algonquin Nation whose presence here reaches back to time immemorial.



Privacy & Confidentiality  -  Code of Ethics & Professional Conduct

Copyright © 2022 - The Association of Canadian Archivists

Powered by Wild Apricot Membership Software