How three journalists got answers when there were no public datasets to be had

Published on
December 3, 2020

Easily searchable, online databases don’t always exist, particularly when it comes to government data. So, many news outlets nowadays are creating their own.

Three veteran data journalists shared how they do that this week during at the online 2020 Data Fellowship, expressing the importance of transparency, flexibility and collaboration.

Investigative data reporter Jill Castellano of inewsource, a news nonprofit in San Diego, said she set out to build her database after learning that the federal government’s independent watchdog, the Office of Special Counsel, had ruled that the Department of Veterans Affairs had insufficiently investigated an allegation that a VA researcher was collecting liver samples from sick veterans without their consent.

She wondered how common these inadequate investigations were, but there was no database of the Office of Special Counsel’s findings. So she made her own, by going through hundreds of documents dating back years on the agency’s website and categorizing the results in a spreadsheet.

Castellano had to decide how to label the agency’s findings when they weren’t always black and white. She also had to change some of her data categories partway through the project.

“I knew I was making judgment calls,” she told this year’s data fellows via Zoom on Tuesday.

But she emphasized transparency with her data methodology, a critical step for this type of journalism. She was clear on why exactly inewsource listed investigations as reasonable vs. unreasonable. She used that openness to refute criticism about her reporting from the VA. Her work eventually prompted Congress to investigate the VA for shoddy internal investigations.

She said that even though her data work took time, she didn’t worry about getting scooped.

“My reporting is so specialized, and I know that I've carved out a niche,” she said. “Especially if I'm building my own data set. Once I get far enough along that process, I know no one can catch up to me. ... So that's what I love about working with data. It's way less easy to come in and swoop in and steal a story.”

Nicole Hayden, health reporter for The Desert Sun in Palm Springs, wanted to track the efficacy of California’s Project Roomkey program, which has sought to get homeless people into hotels and motels during the pandemic and eventually into permanent housing.

She learned the state wasn’t tracking how many of these individuals are connected with permanent housing, so she decided to survey counties herself and create her own data set.

“We accounted for 6,000 more people who received Roomkey placements that the state didn't have record of,” she told the fellows Tuesday. “So that meant 6,000 additional people that could end up back on the street right around Christmastime.” (The program is winding down.)

Hayden was open with her readers about the fact that only about three-fourths of counties in California responded to her survey request. “You want to be very clear about how you compiled (the data),” she said.

Hayden has also used data from surveys to report on the problem of sexual harassment at the Coachella and Stagecoach music festivals and the health needs of homeless people, consulting with experts to draft her survey questions.

While she said she typically aims for a large enough sample size to get a margin of error less than 5%, she noted that journalists can be more “squishy” than academic researchers as long as they fully disclose their methodology.

Visual journalist and developer Katie Park of The Marshall Project, a nonprofit newsroom that covers criminal justice, explained how she helped construct an interactive database of the number of COVID-19 cases and deaths at prisons across the country.

“When you look at the conditions that allow for heightened disease transmission, (prisoners) have less access to hygiene, denser populations, less access to medical care,” she said. “So we knew that this was going to be a problem for prisons.”

Her news organization partnered with The Associated Press to assign a reporter to each state department of corrections as well as federal prisons to get weekly COVID updates.

“With covering the criminal justice system, there aren't a lot of nationwide, comprehensive data sources out there,” she said. “We talk about the U.S. criminal justice system. But what we're really talking about is 50 different states with completely different systems and the Federal Bureau of Prisons.”

So the data had to be translated so it could be comparable. Some states were more forthcoming with the information than others, and her newsroom even caught a flaw in one state’s data.

The project is ongoing. The Marshall Project uses its own open-source reporting tool, Klaxon, to track updates in each of the states, and programming scripts to continuously process and verify the data. The website always states there has been “at least” a given number of coronavirus cases and deaths to acknowledge the data could be incomplete.

“We, like many news organizations, have really had to learn as we go along,” Park said. “And I am sure that everyone is aware of the kind of data issues that we see with COVID testing in general. So … I'm trying to find meaningful metrics to help us tell a story has been a learning process. We've had to kind of think about new ways of categorizing data as time goes on.”

Moderator MaryJo Webster, data editor for the Star Tribune in Minneapolis, noted that journalists like Park, who has a background in programming and interactive graphics, have a skill set that’s in high demand. “Being able to build data visualizations, particularly ones that are interactive or animated, it requires a special set of coding skills that not enough people have in journalism, and every news organization wants them very badly right now,” Webster said.

Webster asked the three data journalists what they would change if they go back in time and start their projects over.

“I would say factor in more time than you think you need,” Castellano said. “Make sure you also factor in time for going back and starting over once you realize that some of what you did doesn't quite fit and you know that you've got to change your definitions a little bit. ... And factor in however much time you're going to need for fact checking. … We had someone else basically repeat everything that I had done.

“Don't get yourself in a position where you're rushing at the end and you're making last-minute decisions that could put your project at risk.”

Hayden agreed on the notion of allotting extra time for your project. She also recommended partnering with other reporters in the newsroom on tasks like fact-checking.

Said Park: “I would tell myself to prepare for this project to be going on much longer than what I initially thought. I know that might sound specific to collecting data about an ongoing pandemic, but I think it speaks to the sort of flexibility that you need when you're thinking about starting out with a data project.”