Quality control in the evaluation reports

Please enable JavaScript and reload the page.

Quality control in the evaluation reports

The usefulness of an evaluation depends on credibility which relies on transparency of the process and the quality of the evaluation.

The quality of the evaluation has to be checked at four levels: the terms of reference, the evaluation process, the evaluation report and the dissemination and feedback of the evaluation.

The present paper is only dealing with the quality of the evaluation report.

Origin of the quality grid

In 1999, the Directorate General Regio of the European Commission proposed in their guidelines on evaluation (MEANS collection) a quality grid for assessing the quality of the evaluation reports. In 1999 also, the Directorate General Agriculture of the EC decided to use systematically the quality grid for assessing the quality of the evaluation reports. In 2001, it has then been decided that the use of the grid would be mandatory for all the European Commission services, as one of the standards to be applied by the evaluation function in each of the Directorates General. The External Relations Directorates have applied, to all the evaluations launched from 2002 onwards, the quality grid as described below.

Content of the quality grid

The quality grid is filled twice. The first time for assessing the draft final report and a second time for assessing the final report in its publication form. The quality grid is published on the web site beside the evaluation report.

The quality assessment ensures that the evaluation report complies with professional standards and meets the information needs of the intended users. The quality grid is presented in the terms of reference. The external evaluators know from the start on which basis they will be assessed.

The quality assessment of the final report has 3 main aims:

Ensuring that the external evaluation team has fulfilled the commissioning services requirements.
Distinguishing the conclusions that are valid and well grounded from those that are less so and must therefore be used with caution.
Ensuring that the evaluation withstands the criticism inevitably generated by value judgements on successes and failures.

Nine quality criteria

The quality grid is based on nine criteria which are independent of each other and are examined for their own sake:

1. Meeting needs:

does the report adequately address the information needs of the commissioning body and fit the terms of reference?

2. Relevant scope:

is the rationale of the intervention and its set of outputs, results and impacts examined fully, including both intended and unexpected policy interactions and consequences?

3. Defensible design:

is the evaluation design appropriate and adequate to ensure that the full set of findings, along with methodological limitations, is made accessible for answering the main evaluation questions?

4. Reliable data:

are the primary and secondary data selected adequate? Are they sufficiently reliable for their intended use? When statistical data are missing, are qualitative data meaningful? When only weak data are available, the evaluators have explained their weaknesses and the limit of use.

5. Sound data analysis:

is the analysis of quantitative and qualitative data appropriately and systematically done so that evaluation questions and judgement criteria are informed in a valid way? Are cause and effect links between the intervention and its results explained? Are external factors correctly taken into consideration? Are comparisons made explicit?

6. Credible findings:

do findings follow logically from and are justified by the data analysis? Are findings based on careful described assumptions and rationale?

7. Valid conclusions:

are conclusions linked to the findings? Are conclusions based on the judgement criteria? Are conclusions clear, clustered and prioritised?

8. Useful recommendations:

are the recommendations linked to the conclusions? Are they fair, unbiased by personal or stakeholders' views, and sufficiently detailed to be operationally applicable? Are they clustered and prioritised?

9. Clear report:

does the report describe the intervention being evaluated, including its context and purpose together with the process and findings of the evaluation? Is the report easy to read and has a short but comprehensive summary? Does the report contain graphs and tables?

How to fill out the quality assessment grid?

Each of the nine criteria of the quality grid is rated by five levels: excellent, very good, good, poor and unacceptable.

The quality grid for each evaluation report is filled by two persons who discuss the nine criteria as to produce one grid sent to the external evaluators.

The quality grids of the various evaluation reports are filled and rated by various persons within the evaluation team. It is important that, dependent of the persons and on the time, there is a consistency in the quality assessment of the report. To ensure that the quality grids of the reports are comparable, it has been necessary to elaborate a referential for the evaluation managers when assessing the quality of the evaluation reports. Moreover, the head of the evaluation unit checks each quality assessment to ensure a better consistency.

The evaluation unit of External Relations within the European Commission has set up a comprehensive set of referential for rating the nine criteria.

Criterion 1: Meeting needs

Good: The report deals with the whole intervention in its temporal, geographic and regulatory dimensions. The main intended and unintended effects have been identified.

Very good: In addition to the previous point, the evaluation took an interest in interferences with other EC policies, other donors' interventions and the partner government(s)' policies. Unintended effects have been addressed.

Poor: One of the three dimensions of the intervention and/or one major effect has been inadequately or insufficiently addressed.

Unacceptable: Several dimensions of the intervention and/or several major effects have been inadequately or insufficiently addressed.

Excellent: In addition to the remarks for a "very good" rating, the report has systematically examined the unintended effects in detail.

Criterion 2: Relevant scope

Good: The report deals with the whole intervention in its temporal, geographic and regulatory dimensions. The main intended and unintended effects have been identified.

Poor: One of the three dimensions of the intervention and/or one major effect has been inadequately or insufficiently addressed.

Unacceptable: Several dimensions of the intervention and/or several major effects have been inadequately or insufficiently addressed.

Excellent: In addition to the remarks for a "very good" rating, the report has systematically examined the unintended effects in detail.

Criterion 3: Defensible design

Good: The evaluation method is clearly explained and has been actually applied throughout the process. The methodological choices have been appropriate enough so as to meet the requirements of the terms of reference.

Very good: The limitations inherent in the evaluation method have been clearly specified and the choices have been discussed while standing up against other options.

Poor: Upon reading the evaluation report, the methodological choices seem to have been made without being either explained or defended.

Unacceptable: Either there is no evaluation method, or the methodological choices are not in line with the results sought after.

Excellent: In addition to meeting the expectations for a "very good" rating, the evaluation team submits a critique of the method and methodological choices. The report points out to the risks that might have been incurred if other methodological options had been adopted.

Criterion 4: Reliable data

This criterion does not apply to the intrinsic validity of existing data but on the way in which the evaluation team has collected and used the data.

Good: Both quantitative and qualitative data are identified through explicit sources. The evaluation team has tested and discussed the reliability of data. The data collection tools have been clearly explained and adjusted to the data sought after.

Very good: Data have been systematically cross-checked by relying upon sources or data collection tools that are independent of one another. Limitations pertaining to the reliability of data or to data collection tools are made explicit.

Poor: Both quantitative and qualitative data provided are not very reliable regarding the question asked. The data collection tools are questionable (for instance, insufficient samples or off-the-target case studies).

Unacceptable: Certain data are manifestly distorted. The data collection tools have not been applied correctly or else they provide biased or useless information.

Excellent: All biases deriving from the information provided are analysed and rectified by means of recognised techniques.

Criterion 5: Sound data analysis

Good: The quantitative and/or qualitative data analysis is done rigorously, following the recognised and relevant steps depending on the type of analysed data. Cause-and-effect links between the intervention and its consequences are explained. Comparisons (for example: before / after, beneficiaries / non beneficiaries, with / without) are made explicit as well. Data on external factors are analysed.

Very good: The analysis approaches are made explicit and their validity limitations are specified. Underlying cause-and-effect assumptions are explained. Validity limitations of comparisons made are pointed out.

Poor: Either one out of three elements (analysis approach, cause-and-effect relations, and comparisons) is not well addressed or two out of such elements are dealt with inadequately.

Unacceptable: 2 out of 3 elements are addressed inadequately.

Excellent: Every analysis bias (across the 3 elements: analysis approach, cause-and-effect relations, and comparisons) has been systematically reviewed and presented, including its consequence in terms of limiting the analysis validity. Influence of external data on the analysis of relevant data is explicited.

Criterion 6: Credible findings

Good: The findings derived from the analysis seem both reliable and balanced, especially in view of the context in which the intervention is being assessed. Interpretations and extrapolations made are acceptable.
The findings acceptably reflect the reality described by the data and evidence recorded on the one hand, and the reality of the intervention as perceived by the actors and the beneficiaries on the other hand.

Very good: The limitations applying to interpretations and extrapolations are explained and discussed.
The effects of the intervention under evaluation are isolated from the external factors and contextual constraints.
Both internal validity (absence of analysis bias) and external validity (generalisability of findings) are satisfactory.

Poor: Findings seem imbalanced. The context is not made explicit. Neither extrapolations made nor generalisations of analysis are relevant.

Unacceptable: Credibility of findings seems very poor. Some assertions in the text cannot be sustained. Neither extrapolations made nor generalisations of analysis are relevant.

Excellent: Imbalances between the internal and external validity of findings are systematically analysed and the consequences this has on the evaluation is made explicit.
Contextual factors have been identified and their influence has been demonstrated. The biases involved with the choice of interpretative assumptions and in the extrapolations are analysed and their consequences are made explicit.

Criterion 7: Valid conclusions

This criterion does not assess the conclusion's intrinsic substance but the way in which the conclusions have been reached.

Good: Conclusions derive from findings. Conclusions are grounded on both facts and analysis that are easily identifiable throughout the report. They are linked to explicit judgement criteria. The limitations to conclusions' validity are pointed out as well as the context in which the analysis was done.

Very good: Conclusions are organised along clusters and hierarchical side. Conclusions are debated upon in connection with the context in which the analysis was done. The limitations to conclusions' validity are made explicit and well grounded.

Poor: Conclusions stem from a hasty generalisation of some of the findings. The limitations to conclusions' validity are not pointed out.

Unacceptable: Conclusions are not backed up by relevant and thorough findings. Conclusions are partial because they reflect the evaluator's preconceived ideas rather than the analysis of the facts.

Excellent: Conclusions are reached in relation with the global nature of the intervention under evaluation. They take into account the intervention's connection with the context in which it takes place, considering other programmes or connected public policies in particular.

Criterion 8: Useful recommendations

This criterion does not judge the recommendations' intrinsic substance but the way in which they are articulated and whether they really derive from the conclusions.

Good: The recommendations follow logically from the conclusions. They are impartial.

Very good: In addition to the previous points, the recommendations are prioritised and clustered. They are presented in the form of options for possible actions.

Poor: The recommendations are not very clear or they are mere evidence without any added value, their operability is arguable. The connection with the conclusions is not clear.

Unacceptable: The recommendations are disconnected from the conclusions. They are biased because they mostly reflect certain players' or beneficiaries' viewpoints or the evaluation team's preconceived ideas.

Excellent: In addition to meeting the requirements for a "very good" rating, the recommendations are tested and the validity limitations are pointed out.

Criterion 9: Clear report

Good: The report is easy to read and its structure is logical.
The summary is brief and reflects the report.
Specific concepts and technical explanations are presented in an annex with clear references throughout the body of the text.

Very good: The body of the report is short, concise and easy to read.
Its structure is easy to memorise.
The summary is clear and presents the main conclusions and recommendations in a balanced and unbiased way.

Poor: The report is hard to read and/or its structure is complex.
Crossed references are hard to understand or make reading difficult.
The summary is too long or does not reflect the body of the report.

Unacceptable: Absence of summary.
Illegible report and/or disorganised structure.
Lack of conclusion (and recommendations) chapter.

Excellent: The report can be read "like a novel" and its structure has an unquestionable logic.
The summary is operational in itself.

Criterion 10: Overall assessment

The general quality of the report results from the ratings assigned to each one of the 9 criteria.
If there are at least 3 "unacceptable" ratings, the report must be considered as unacceptable altogether.
For the rating "very good" and "poor", at least 2 examples of good or bad practices have to be highlighted.
For the rating "excellent" and "unacceptable", 3 examples are necessary.

Recommendations

Consider the context when using the assessment criteria, rather than applying them in absolute terms. In a given situation, it is possible and useful to specify the quality criteria in order to take particular demands and/or constraints into account.
Write a qualitative synthesis of the nine criteria as a way to assess the overall quality of the report. Another less appropriate option is to give each criterion a score and a weight, and to compute an average weighted score.
Do not wait until the draft final report for quality assurance. The quality assurance process should start from the outset. In particular, it is worth doing quality assessments at two important stages: inception report and first phase report (desk).
Attach the quality criteria to the terms of reference.
Have the quality assessment made by the evaluation manager and double-checked by a second person.

Use of the term criterion: A warning!

The term "criterion" is used here in the sense of a quality criterion and should not be mistaken for evaluation criteria (effectiveness, efficiency, etc.), or with judgement criteria (also called "reasoned assessment criteria")

Evaluation methodological approach

Table of contents

Quality control in the evaluation reports