A journey from the land of 10,000 lakes

Lessons learned and strategies for data lakes at the minnesota department of health

The Minnesota Department of Health began building data lakes in 2020 for different streams of data, such as COVID-19 test and case data, immunization, syndromic surveillance, and claims data, to give public health professionals better access to the many sources of data they need.

promising practices

  • Develop governance strategies, such as defining areas of ownership, before implementing a data lake.
  • Invest in tools to create accessible user views for a data lake. This will lessen the need for formal training and help people learn to engage with a data lake quickly.
  • Start with a solid use case or pilot for a data lake to illustrate the value and culture of sharing data in a data lake.
  • When starting out, think big and build small. Envision what data can do, then try to build something small.

getting started

For years, sharing diverse kinds of public health data across the Minnesota Department of Health (MDH) involved time-consuming, manual processes. Access to data from across the health departments’ many programs was limited, and combining data was cumbersome. The agency knew there was a better way to share and use data, and in 2018 launched the Office of Data Strategy and Interoperability to advance enterprise data strategies and data exchange across programs of the health department and with external partners.

The data strategy team began considering the use of data lakes—centralized systems which can store and process diverse kinds of structured and unstructured data. In early 2020, the agency sent a technical team to Seattle to work with Amazon Web Services on creating a data lake for combining trauma data with death records. This pilot proved helpful when the team had to quickly pivot, building a new lake that dealt with the volume of COVID data coming in. The data lakes allowed MDH to quickly build connections between immunization and COVID case data, to find pockets of the population where cases were high and immunization coverage was low.

Aasa Dahlberg Schmit, who served as the director of the Office of Data Strategy and Interoperability at MDH from 2019 to March 2023, says, “I don’t even think we could have done what we did for COVID without the data lake. I can’t even wrap my head around how we would’ve managed the pandemic if we hadn’t had a data lake, just with all the volume of data, how we needed to combine the data, and all the people that needed access to it.”

“I don’t even think we could have done what we did for COVID without the data lake. I can’t even wrap my head around how we would’ve managed the pandemic if we hadn’t had a data lake, just with all the volume of data, how we needed to combine the data, and all the people that needed access to it.”

Data lakes are designed to allow connections between them, also known in the MDH as “boundary waters.” In Minnesota, these separate streams of data are connected with the exception of one.

Governing data lakes

When MDH began implementing the data lakes, there was no time to put formal governance structures in place to set policies for who could access what data. Schmit became the owner of the data lakes, receiving requests from staff and confirming on an ad hoc basis who had access to retrieve and combine the data. She says it was not perfect, but simply worked because the technology needed an overarching owner. “The data lakes grew faster than there was time to do governance,” she says.

In 2022, the team developed governance questions for the data lakes around ownership, access, content, infrastructure and support. They have begun working on answers to these questions, along with a roadmap and implementation plan. Schmit advises public health agencies to have some governance structures in place before creating data lakes—including determining who decides access to the lake and who determines what goes in the data lake. She advises thinking through questions like “How do you want to do the ownership? How do you decide if you’re going to change the technology in the data lakes? What are the rules if you start combining data?”

community impact

Schmit says it was helpful that she was located in the executive office, and had access to top leadership who supported the data lakes. She hopes that there will be sustainable funding for data modernization, especially for workforce staff. Without knowing whether there’s long-term funding, it has been challenging to hire permanent positions.

Despite these concerns, she says the impact of the data lakes is big. Minnesota now can overlay different data with datasets that give them new insights. They were able to respond rapidly to COVID and other outbreaks.

She envisions a time when public health agencies can expand data lakes with newer tools, such as AI tools to find new information that will affect communities. “Utilizing technology to do more of the work will free up our epidemiologists to learn from the data, and take action based on their findings…if done right, data modernization will free up people to learn and use data in a different way, to draw conclusions, to tell stories with data, instead of just massaging, manipulating, fixing and entering data.”

Key Takeaways

  • Data lakes are central repositories designed to store, process and secure data in different formats. They can make data easier to find, access and combine.
  • Governance structures ideally need to be in place before implementing a data lake. In time-constrained situations, at a minimum, determine who will act as the owner of the data lake.
  • Having buy-in from leadership and starting with a solid use case helps staff embrace a new culture of sharing data in a data lake.
“…if done right, data modernization will free up people to learn and use data in a different way, to draw conclusions, to tell stories with data, instead of just massaging, manipulating, fixing and entering data.”
The Public Health Informatics Institute logo