Skip to Main Content

Evaluating information & data: Assess original dataset

Assess the original dataset

When finding quantitative information like a number, a table, or a whole dataset we want be able to judge its quality to help us select the right information.

Quality needs to be defined more precisely. There are several aspects:

  • Creation - If data is created by a measuring device, the quality of the data depends upon the accuracy of the measuring device. If data is produced by measuring human behaviour, the quality depends upon the way the experiment is conducted (methodological issues).
  • Research data management - Research data management is about metadata, file formats and access.
    • Metadata: data describing the data. Examples are a log, descriptions of variables, questionnaires, readme files about folders structures etc.
    • File formats: choosing a file format impacts the way you can use your data. It can determine which software you need to use to open your files. It also influences ease of use in about, let’s say, 10 years. A data repository like DANS has preferred file formats. Choosing one of their preferred file formats guarantees support to convert files in the future if necessary.
    • Access: ownership and privacy can be issues.
  • Usefulness - Assessing the quality of data in terms of its usefulness for research purposes (scholarly merit) is difficult since it basically depends on your research questions. Peer reviewing datasets might be an indication of academic status. A very practical indicator of the popularity of datasets are download statistics or data citation.

The problem with “original data”

One of the biggest hurdles you will encounter when trying to assess the original data is actually getting access to the initial raw data. The facts and figures which appear in peer reviewed journal articles often do not contain the raw data which hold information like how the data was entered into the system, what was and was not included or excluded, if the data had been manipulated in anyway prior to being plotted or graphed, or even if some of the data had been faked. The lack of access to the raw data is one of the leading issues in science, as Professor Tsuyoshi Miyakawa of Fujita Health University and Editor-in-Chief of the journal Molecular Brain discussed in a 2020 article, saying that:


I have handled 180 manuscripts since early 2017 and have made 41 editorial decisions categorized as “Revise before review,” requesting that the authors provide raw data. Surprisingly, among those 41 manuscripts, 21 were withdrawn without providing raw data, indicating that requiring raw data drove away more than half of the manuscripts. I rejected 19 out of the remaining 20 manuscripts because of insufficient raw data. Thus, more than 97% of the 41 manuscripts did not present the raw data supporting their results when requested by an editor, suggesting a possibility that the raw data did not exist from the beginning, at least in some portions of these cases (Miyakawa 2020: 1).


Data can be manipulated in several ways to make it appear significant or it can even be faked, but this manipulation and fakery is only visible if one has access to the raw data. What this means for you is that when you are assessing the data from an article, just know that even if you follow the steps above, assessing the quality of the data is still no easy task. What you are looking at is unlikely to be the raw data, and measuring the trustworthiness of that data is a problem which the scientific community as a whole currently faces.


Source: Miyakawa, T. (2020). No raw data, no science: another possible source of the reproducibility crisis. Molecular brain, 13(1), 1-6.

Exercise Assess the original data set

In this exercise you assess the original dataset.