Orientation Observation In-depth interviews Document analysis and semiology Conversation and discourse analysis Secondary Data Surveys Experiments Ethics Research outcomes



Social Research Glossary

About Researching the Real World



© Lee Harvey 2012–2020

Page updated 29 April, 2020

Citation reference: Harvey, L., 2012–2020, Researching the Real World, available at
All rights belong to author.


A Guide to Methodology

1. Basics

1.9 Reliability and accuracy
1.9.1 Introduction
1.9.2 Accuracy


1.9.1 Introduction
Reliability is about undertaking research in a consistent and replicable manner.

Reliability is primarily about whether data is being collected in the same way each time. It asks, for example, whether interviewers are asking different respondents the same set of questions? If a test is applied (such as an IQ test, an aptitude test or an attitude survey) are the conditions under which it is applied the same (for example, the time to complete the test) and does the test yield consistent results? Does a statistical measure of the same phenomenon in different settings result in the same scores?

Thus, a measure (a test or a survey, for example) is considered reliable if it gives the same result repeatedly (provided that what is being measured is not changing).

As with validity, the concept is primarily a positivistic notion. It assumes minimum impact of the researcher or the data collection process on the research subjects and again assumes an objective world (see part 1.7) that can be addressed neutrally.

Activity 1.9.1
Working in pairs, undertake a short piece of participant observation (similar to Student Activities 6.3 or 6.8) with both of you participating or observing the same group or organisation. Compare your observations and account for any differences in conclusions that you come to. Do the results of this activity lead you to regard participant observation as unreliable? If so, in what ways? What do you feel is the consequence or importance of any unreliability you might have noted?

This activity involves recording observations.
This activity needs to be done out of class in pairs and would probably take about an hour (excluding any travel).

Top Means of assessing reliability Inter-observer reliability Test-retest Split-half reliability Parallel-forms reliability Overview of positivist approaches to reliability Phenomenological approaches to reliability Observation and reliability Reliability and validity

Top Means of assessing reliability
There are four key means by which researchers (mainly those using quantitative methods) can explore the reliability of a research study.

Top Inter-observer reliability
If, for example, the research involves people making judgements about observable events or actions it is possible to assess whether the research method is reliable by comparing the results when different people assess the same data. This only works if the data is measurable in some quantitative way through some form of rating; even if it is just a simple dichotomy, such as: X is/isn't an instance of Y.

The inter-observer reliability coefficient is a simple way to assess whether there are consistent results among testers or coders who are rating the same information.

The inter-observer reliability coefficient is: Total Number of Agreements divided by Total Number of Observations.

A good rule of thumb is that if (Total agreements) / (Total observations) > .80, the data are said to have inter-observer reliability.

For more examples see Trochim (2006)

Top Test-retest
Reliability can be established by using the same instrument to measure something at two different times (provided no treatment has been applied or there have been no other significant change in circumstances in between). A reliable measurement instrument should yield the same results on both occasions.

To assess test-retest reliability we correlate the first set of results against the second. A high correlation suggests high reliability. Various published questionnaires report good test-retest results, such as the Mathematics Anxiety Rating Scale reported to have a test-retest reliability of 0.97 (Richardson and Suinn, 1972).

However,in practice, it is very difficult to say that circumstances, which would impact on the two sets of results, haven't changed. The longer the time gap between the two tests the lower the correlation is likely to be. Thus, to assess the reliability of a test instrument, it is important to use the same subjects and to repeat the test very soon after the initial application.

Even so, simply by administering the first test, the results of the second test can be affected. The first application is likely to be novel, the second application may mean subjects are familiar with the test which may lead them to see it in a different light, maybe making it easier or they may become more cynical about the repetition, or not take the second test as seriously.

Peters, Greenbaum, Steinberg, Carter, Ortiz, Fry and Valle (2000) examined the effectiveness of several screening instruments detecting substance use disorders among prisoners. For example on reapplication, the TCU Drug Screen fared extremely well on this measure, obtaining a test-retest reliability of +0.95 (using the Pearson product-moment correlation coefficient, which has a range of 1 to +1.)

See CASE STUDY Test-retest reliability for further examples of the use of this process.

Top Split-half reliability
One way of assessing reliability of something like an attitude survey is to divide the items in half and see whether the two halves generate the same results. There are various ways of dividing the items. They could be divided in a systematic way, such as odd numbered items versus even numbered items. They could be divided at random into two groups. In both cases this would only work if the questions are all instances of the same phenomenon, which is not so likely in most questionnaires. In most attitude surveys, for example, a phenomenon about which opinions are being sought will have several dimensions and questions will relate to different aspects of the phenomenon.

For example, a survey of attitudes about the UK Royal Family might explore people's views about whether there should be a monarchy, who should pay for it, and whether the monarchy has done a good job? There may be a range of attitude questions addressing each of these dimensions. To compare items randomly across dimensions would not give any indication of reliability. One would need to split items within each dimension.

The most effective approach to split half would be to ensure the operationalisation of a specific attitude resulted in two related items. These would be matching but not identical pairs that would usually be randomly distributed through the questionnaire. The reliability would be assessed by comparing the results on the matched sets of items.

For more examples see Trochim (2006)

Top Parallel-forms reliability
The parallel forms approach to testing reliability essentially uses two different versions of the same test instrument, both applied to the same group of people (randomised so that half answer the first instrument first) and then compare results. A high correlation indicates high reliability.

The problem with this, apart from having to construct two instruments with different items, is that you need to be certain that they are, indeed parallel. This is a judgement call and no amount of randomisation will really determine that you are checking like with like.

Parallel forms are similar to split half, although the latter is the division of a single instrument while parallel forms requires two independent instruments, supposedly measuring the same thing.

Top Overview of positivist approaches to reliability
The positivist approach to reliability assumes that there is something existing that can be measured. A reliable measure is one that measures that something on a consistent basis, with little or no variation.

A ruler is a consistent measure of someone's height, provided it is used with care, and produces the same result every time (unless the person has grown!). Other measures are less easy to devise and there may be more error in the application. In effect, this means that most sociological measurement tools have a degree of error built in and the aim of a reliable measure is to get as close to the 'real' or 'true' measure as possible on a consistent basis.

Of the approaches mentioned above, inter-observer reliability is useful when observing instances of a certain phenomenon or when using a team of data-collecting researchers. The test-retest approach potentially assesses reliability but only after the event (unless you are doing it as part of the pilot). If the correlation between the test and retest is low then the research is questionable. This approach works best when you have an experimental design with a control group who are not exposed to any experimental stimulus. The parallel forms approach uses two forms as alternate measures of the same thing and is applied in limited circumstances. Split half is relatively straightforward but it does require that you can split the questionnaire items in a meaningful way, dependent on those in each half measuring the same concept or dimension.

The problem with the positivist approach to reliability is the assumption that there is a 'true' measure and that it somehow remains constant (at least over a specified period) rather than continuously changing: in which case it would be impossible to assess the reliability of a measurement instrument using any of the devices outlined above.

Top Phenomenological approaches to reliability
Although phenomenological approaches do not suggest that reliability is statistically measurable, there are those who argue that methods used in phenomenological research must be demonstrably reproducible and consistent (Hancock, 2002). This is done by:

  1. describing the approach to, and procedures for, data analysis;
  2. justifying why this approach is appropriate in the context of the study;
  3. clearly documenting the process of generating the themes, concepts, categories (of concepts) and theories emerging from the data audit trail;
  4. referring to external evidence, including other qualitative and quantitative work to test conclusions from the analysis as being appropriate;
  5. adopting data triangulation (see section 1.15), which is gathering and analysing data from more than one source. Evidence that the researcher has used triangulation in this way and has effectively drawn the analysis of different forms of data together demonstrates rigour, rather than simply the use of different sources. However, this approach owes much to a positivist perspective, implying that the data triangulation process makes the results robust enough to generalise (see section 1.10.1).

Lincoln & Guba (1985, p. 300) prefer the use of 'dependability' rather than reliability in non-quantitative research. Once again, they align this with the notion of trustworthiness

An alternative view, found in critical approaches as well as phenomenological ones is that reliability is unimportant, and somewhat of a diversion from the real issue, which is validity. It might be that the positivist's research tool consistently and reliability measures the variables but if what is being measured is invalid then whole process is rather pointless.

For example, intelligence tests may be reliable measures of intelligence quotients but whether this has anything to do with intelligence is debatable. It has been argued that IQ is an invalid measure of intelligence because it is biased towards middle-class, white educational skills.

Top Observation and reliability
Observation studies are, in particular, regarded as unreliable because there is no consistent way of measuring the data. For example, what is observed depends on the degree and type of involvement of the researcher. The role adopted by the observer exacerbates the situation. Participant observers, for example, are rarely able to make their procedures explicit and there is no way of replicating a study to check the reliability of its findings.

The apparent unreliability of observation studies, particularly participant observation, is, however, seen as a 'red herring' by non-positivists. The positivistic view of reliability is one-sided. It relates to the ability of the data collection tool to measure predefined social phenomena. The point of naturalistic observation research, for example, is that it is not interested in making measures of 'variables' but is concerned with grasping social processes or cultural meanings (see Section 3: Observation).

Furthermore, social surveyors are no more 'reliable' in what they do than observers. Just because they provide a questionnaire so that the study can be replicated does not mean that the method is reliable. It is difficult, if not impossible, to check the reliability of a questionnaire (see Section 8: Surveys). Furthermore, providing a schedule or questionnaire does not mean that the researcher has made explicit the procedures used in selecting the questions. Surveyors rarely, if ever, provide a rationale as to why they asked the particular questions. Nor do they provide the criteria by which the particular questions on the questionnaire were selected in preference to other feasible alternative questions. In addition a questionnaire or interview schedule has to be completed and while it is often taken for granted that the interviewer is a neutral agent, this is not the case.

Taking up the issue of relevance of reliability, it is argued by people such as Westwood (1984) that the data from participant observation, is more valid than other forms of data collection whether or not it is more (or less) reliable.

Top Reliability and validity
Validity is about the researcher constructing conceptualisations and measures of the theoretical idea that they intended to research (see section 1.8). Reliability is about consistency and replicability of data collection.

It is argued, by positivists, that a valid measure of a concept has to be reliable but that reliability of measurement does not ensure validity. A measure could measure something consistently but may be measuring the wrong thing.

As noted above, non-positivists do not subscribe to the view that reliability is a necessary condition of validity. For them, reliability of measurement is independent of the validity of conceptualisation and interpretation or understanding of the phenomenon.


Next 1.9.2 Accuracy