From Data --> Knowledge

Now where did I put that R script?

Feb 10, 2025

Words from Stephen Lindemann, PhD.

Image credit: https://disney.fandom.com/wiki/Bill_(Schoolhouse_Rock!)

“Whether they should let me be a law.
How I hope and pray that they will,
But today I am still just a bill.”

- “I’m Just a Bill,” Schoolhouse Rock

As every child of the 80s (and much of the 70s and 90s) learned while watching Saturday morning cartoons, the process by which a bill becomes a law is a multi-step process fraught with peril. This process involves a series of steps, at any of which the bill might face sudden death or long languishing in purgatory. A similar minefield must be traversed by data on its path to becoming knowledge.

My last entry focused on how data is born, which we saw is a messy process that depends greatly on its midwife (that is, the researcher collecting it). However, data do not just “grow up” into new knowledge. Data are typically just a bunch of numbers on their own, with the occasional image thrown in (which, in their digital form, is also just a matrix of numbers). Knowledge does not arise from them on its own, no matter how many data we collect – there they silently sit and wait, in notebooks, spreadsheets, or wherever else they were recorded. Turning data into knowledge requires analysis, and that analysis process takes a multitude of forms depending upon the data type and question. Analogously to a bill becoming a law, data must be carefully shepherded through this process of becoming interpretable to humans as knowledge; it is at this point that we learn something new about the world.

For most, the idea that analyzed data lead to insights elicits is natural. We live in a world awash with data and commonplace analysis of that data connects us with friends and sells us things. However, although the process seems intuitive (What have you clicked on? You get more of that.) fewer of us could say anything about how those data link together to drive “the algorithm” to make the next suggestion. And, frankly, perhaps even fewer of us care. With respect to scientific data, however, careful attention to this process is vital to our confidence in the knowledge that we glean. We expect that this knowledge arises according to standard practices, well-regulated by the scientific community through peer review (more on that in future entries).

“There are three kinds of lies: lies, damn lies, and statistics.”

- Mark Twain, probably

In reality, though statistics has a dense network of formal rules and procedures for how to determine whether an experiment yields a significant result, the proper application of the rules sometimes requires judgment calls. Much of the time, proper application is clear – a Student’s t test here, a Tukey multiple comparison or a Bonferroni correction there. In these cases, misapplication of statistics is very frequently caught by reviewers and corrected. Though Twain may have famously hated statistics, they routinely deliver very trustworthy results; his point, however, is that they can also be intentionally or unintentionally misleading, especially to those who do not understand them well. The field of statistics itself has had to evolve to account for the dimensionality of ‘omics data (like genomics, transcriptomics, proteomics, metabolomics) which measure thousands of parameters (or more) simultaneously – a problem that co-evolves with technical innovation in sequencers and mass spectrometers. Rather than the single result given by, say, a spectrophotometer, these instruments can give trillions of data points per run. That’s not something you can just run through an automatic analysis tool in Excel.

“Sometimes these scripts are provided alongside in peer review or in published papers. Sometimes not.”

Instead, files this large commonly require automatic processing to feed into analyses, for pre-processing, subsampling, or formatting, and then in to specialized statistical tools for analysis. The size of these files makes it impossible to do otherwise than process their millions of lines algorithmically – no human can double check each line. Some of that processing is semi-standardized through bioinformatic tools shared across the community, with some variance across laboratories. Typically, multiple competing tools exist in circulation that may be slightly different in theory or application and yield slightly different outcomes, many are typically regarded as giving valid results by the community, though scientists have their favorites. Some of it is written and used by individual laboratories or small networks thereof. Some of it is written by individuals who can code, for themselves. Sometimes these scripts are provided alongside in peer review or in published papers. Sometimes not. In either case, they may be referenced as a “custom script” in a manuscript, or not (I’ve written this very thing in previous manuscripts). Sometimes scripts may be shared across laboratories or some networks thereof. Along the way, the code can be modified. Those who modify it may provide comments in the code that explain what parts of the code do or what modifications have been made over time. Previous versions of scripts may or may not be maintained, either publicly or privately.

None of this is inherently bad or wrong. The people who wrote the scripts and executed them may have done everything perfectly. The people with whom they are shared may understand exactly how they work and modify them in helpful ways, providing good comments so that the next generation (or their future selves) can easily understand what they do. Labs may maintain scripts in public repositories like GitHub, complete with tight version control and documentation. The real question is whether the current system is reproducible. That is, does the identical dataset analyzed by different scientists in the same or in different ways yield the same insights, the same knowledge? In one sense, this can be a question of focus – that is, scientists report on the aspects of their datasets they personally find interesting. In another sense, it is a question of numbers and tests – different statistical approaches may provide different confidence levels that two values are, indeed, different from one another. In both regards, the outputs from high-dimensionality biological datasets may be very different across different individual scientists, labs, and analysis regimes.

“large amounts of published bioinformatics algorithms become… impossible to use after publication”

We submit that the path forward is that the raw data – alongside the computation used to process and analyze it – should always be recorded and available to the scientific community as these data are publicly shared. The goal of this is that another scientist should be able to analyze the same dataset in the same way as one did and arrive at the same statistically significant conclusions. All of the scripts and computational tools should be accessible. Further, another scientist should have the metadata required to be able to analyze the dataset in a different way, for a different purpose – potentially leading to novel insights and new knowledge. To get there, we submit we will need substantially improved documentation of the analysis pipelines that are being used. We will tight and useful metadata standards. We will need large repositories of code and versioning to know which version was run for each analysis. Much of this already exists and is maintained by incredibly conscientious scientists who are harsher critics of themselves than anyone else is of them. But they are to a small or large degree disconnected from one another as laboratories, and connecting across labs to align these things remains challenging. Simply maintaining the code in useful form as time marches on is difficult enough – large amounts of published bioinformatic algorithms become unfindable or difficult or impossible to use after publication.

We submit that what is needed for the future of ‘omics is a new data ecosystem that facilitates the sharing of data, metadata, and associated code used in analyses. Thus, the first step is making it easier to perform reproducible – and reusable – ‘omics analyses. At Liminal, our goal is to make this process more straightforward and require less human time – so, hopefully, we as a community will do a better job of it.

Liminal’s Substack

Discussion about this post