Related Projects

Full title

Empirical research on spontaneous diversity in programs


Diversity, programming language, dependability, fault tolerance, run-time checks, software, programming contest


The University of Valladolid has a website with more than 1,500 specifications for programs. Everyone can write programs to these specifications and send them to Valladolid over the internet. The website automatically assesses every submission and gives feedback to the author whether it is right or wrong. In total, over fifty thousand authors have submitted more than three million programs!

With this large resource of diverse programs, it is possible to perform empirical studies of the impact of a range of diversity techniques on program reliability [1, 2, 3, 4]. Some of the results of these empirical studies are shown below.

What does fault tolerance in 1-out-of-2 systems give us?

1-out-of-2 systems have been presented as a means to significantly improve dependability. It is however hard to determine how much improvement is likely because it is in general difficult to get a large number programs written to the same specification. However, given the many thousands of programs for each on-line contest specification, we can easily measure the probability of undetected failure on demand averaged over all possible pairs in the program population.

An example of our measurements is the following:

The horizontal axis gives the average probability of failure on demand of a single program in the population. The average pfd is reduced by removing the most unreliable programs from the population. The vertical axis gives the reliability improvement a 1-out-of-2 pair gives over a single program drawn from that population. We can observe that for unreliable programs, the probability of failure of the pair is approximately the product of the probabilities of the individual programs (close to the independence assumption). We also observe that the reliability improvement reaches a “plateau” at about a factor of a hundred. This plateau has often been predicted, but has never been observed before!

Diverse programming languages

Forced diversity has been advocated as a means of increasing program diversity (different development methods, languages, etc). In this study we examined the impact of diverse languages (C, C++ and Pascal) on the reliability improvement of a 1-out-of-2 pair by choosing different languages for the two programs in the pairs.
An example of these measurements is shown below.

On the horizontal axis, we again have the average probability of failure on demand of the program population, on the vertical axis the reliability improvement of the pair of program over a single version. It can be seen that the reliability improvement is always better if the second program of the pair uses a different language from the first (which is written in C). The three curves correspond to a C/C, a C/C++ and a C/Pascal pair. We can see that the C/Pascal pair is significantly better. Experiments on other contest problems confirm this, though the improvement is often less than the example shown above.

Diverse run-time checks

A last example of our experiments concerns run-time checks. Run-time checks are a diverse means of checking the output of a program to detect program failure. Run-time checks evaluate one or more properties of the program output and flag the output as incorrect if it fails the check.

We have been able to measure the effectiveness of run-time checks for various specifications. An example is:

The horizontal axis gives the average probability of failure on demand of the program population, and the vertical axis gives the improvement in undetected failures using various run-time checks. The box in the graphs describes the various run-time checks in our analysis; we will not go into detail about that here. We can observe a maximum improvement of about a factor of three, but only in a limited range. The average improvement is much lower than that. This indicates that run-time checks can give some improvement, but the reliability improvement does not appear to be large for the most reliable programs in the population.




Bentley, J.G.W., Bishop, P.G., Meulen, M.J.P. van der, An Empirical Exploration of the Difficulty Function, Safecomp, Potsdam, 21-24 September 2004, Potsdam, Germany, pp. 60-71, Springer Verlag, 2004.

Meulen, M.J.P. van der, Bishop, P.G., Revilla, M., An Exploration of Software Faults and Failure Behaviour in a Large Population of Programs, ISSRE 2004, Rennes, France, pp. 101-12, 2004.

Meulen, M.J.P. van der, Revilla, M., The effectiveness of choice of programming language as a diversity seeking decision, EDCC 2005, Budapest, pp.199-209.

Meulen, M.J.P. van der, Strigini, L., Revilla, M., On the Effectiveness of Run-Time Checks, Safecomp 2005, Halden, Norway, to be published.


Meine van der Meulen and Peter Bishop (City)


Page Maintainer: Credits      Project Members only Last Modified: 18 August, 2005