Data quality in research: Gold for one, poison for another

  • Gold vs. Poison

If anyone knows what quality data means, that is Prof. Dr. Carsten Oliver Schmidt. As the director of "Quality in health research" in the "Study of Health in Pomerania – clinical-epidemiological research" (SHIP-KEF) department at the Greifswald University medical faculty, he has to reply to the question of data quality for different scientific fields again and again. In this interview with VIEW he shares his knowledge and experience and explains why the perfect data set is a myth.

VIEW: Prof. Schmidt, what makes good data?

Carsten Oliver Schmidt: Look at the ISO 8000 standard first of all. It gives a general definition of the quality of data as follows: a set of inherent features of data that meets the requirements. This definition encapsulates one of the most important points of data quality: whether I can use the data for my purpose. And, depending on the requirement, the features that define high data quality may be very different. For example, are the data to be used to make a diagnostic or therapeutic decision? Or are the data to be used for billing for services or to answer a scientific problem? A data set may be excellent for accounting purposes but completely useless otherwise.

VIEW: But are there not features that every data set should fulfil regardless of the requirements or questions?

Carsten Oliver Schmidt: I can describe how we view the quality of data. We basically consider three levels. First, the integrity, i.e. the question whether data are even available in a structure with which we can work. The syntactic accuracy of the data also falls into this category. If it is not accurate, I can proceed no further. The second level is the completeness. There is still in some cases the difference between completeness and availability. So the question of whether data are complete for an observational unit and whether the observational units themselves are available. At the third level we look at whether the data are accurate. We also classify data by consistency, whether the question of whether formal requirements are met, such as no violations of value ranges, and by accuracy, which primarily refers to metrological accuracy. 

Most quality models are based on this general classification into three categories. However, the underlying levels are conceptually very different. We have developed a concept tree for with different levels for observational studies, which consider useful. However, we make no claim for its universal application.

VIEW: What do you base your concept tree on?

Carsten Oliver Schmidt: The concept tree was developed based on a number of principles. That includes a search of the literature for concepts of data quality in our field. One important point of reference was also the guideline for data quality in TMF medical research. This is primarily relevant for registers. 
We evaluated this guideline with representatives of various epidemiological cohort studies and determined what could be improved. Another point: our section is one of the very few that is implementing an automatic quality report reimbursement cycle. This is a software-based approach that enables us to assess the quality very thoroughly and easily. And finally, we have taken advantage of our experience to decide which workflows must be retained. This is why, when analyzing data quality, it is very important to pay attention to a specific sequence of the quality aspects under review. This sequence is mapped in the concept tree. 

VIEW: Is it possible to derive "good" data from all data?

Carsten Oliver Schmidt: The quality of the data that can be derived subsequently depends on where which errors occur. Do they occur during production of the data, such as when measuring blood pressure? Or do they come from the data processing, such as during documentation, archiving and utilization? If there are errors during production of the data; the measured values will be incorrect, and there is little to be done. Particularly with serious systemic errors no algorithm and no big data will help. This means: garbage in, garbage out. However, if there are deficiencies in the data processing, they can often be corrected. One example is where histological results from pathology consist primarily of free text, which is very useful for doctors who base their treatment on the results. However, in science free text is simply useless. With thousand of results it could easily take months to develop comparable classifications from free text. However, it is possible in principle to improve some specific aspects of data quality with suitable preparation. 

VIEW: In that case you must be pleased about the advances in structured diagnostics...

Carsten Oliver Schmidt: Yes, certainly where structured diagnostics is properly implemented it is an advantage. However, the question of whether the classifications and measurements are correct still remains open. And this is where we come back to the question of the requirements. In the hospital the requirements for a measurement are different from those in research. We often have limited time for measurements in the hospital. Items such as blood pressure must be recorded quickly. In contrast, we have much more time and resources in research for obtaining reproducible values. For example, we can measure blood pressure after a defined rest period and take at least two, preferably three readings one after the other. That is not possible in the hospital routine.

VIEW: What would you prefer from the clinical data?

Carsten Oliver Schmidt: As a scientist I would most like to have clinical data that is immediately re-usable. However, data in the hospital are recorded under different requirements from those of science. That fact that they are not immediately useful for research is initially not a problem for hospitals. A very positive development is that items such as the Medical Informatics Initiative or the NFDI4Health are currently concerned with precisely the topics of easier usability of patient and research data and they are developing innovative solutions in this field. If I would wish for more than this, it would be that data protection could be implemented in a more practical fashion and greater weight placed on the potential benefits for society by the scientific use of data. 

VIEW: Thank you for the interview!