Orientation Observation In-depth interviews Document analysis and semiology Conversation and discourse analysis Secondary Data Surveys Experiments Ethics Research outcomes



Social Research Glossary

About Researching the Real World



© Lee Harvey 2012–2020

Page updated 29 April, 2020

Citation reference: Harvey, L., 2012–2020, Researching the Real World, available at
All rights belong to author.


A Guide to Methodology

8. Surveys

8.1 Introduction to surveys
8.2 Methodological approaches
8.3 Doing survey research

8.3.1 Aims and purpose
8.3.2 Background to the research
8.3.3 Feasibility
8.3.4 Hypotheses
8.3.5 Operationalisation Preliminary enquiry Operationalisation and validity Scaling Thurstone scaling The method of equal appearing intervals The method of successive intervals and the method of paired comparisons Likert or Summative Scales Introduction to Likert scaling Constructing a Likert scale Issues in developing a Likert scale Possible distortion from the use of Likert scales Guttman or Cumulative Scales Interchangeability of indicators

8.3.6 How will data be collected and what are the key relationships
8.3.7 Designing the research instrument
8.3.8 Pilot survey
8.3.9 Sampling
8.3.10 Questionnaire distribution and interviewing
8.3.11 Coding data
8.3.12.Response rate

8.4 Statistical analysis
8.5 Summary and conclusion Scaling
There are often a range of indicators for a variable and the researcher is faced with four options.

First, treat the variable as unidimensional and select a single item to represent the concept. For example, an individual's social class might be operationalised by selecting the social class category of their occupation.

Second, treat the variable as multi-dimensional and select a single indicator for each dimension. Then combine these into an index. Such a combination may, for example, involve simply adding scores together for each dimension, or it may involve weighting the different dimensions to accord their importance in the overall concept.

For example, attempts to provide a ranking of universities use several dimensions and identify a single variable to represent each: ranging from research outputs, through grants obtained, to Nobel Prize winners, expert opinion and student satisfaction ratings. Different ranking systems use different combinations of these and other dimensions and give different weights to each component based on how important they think they are in providing an overall assessment of universities. The final rank is the (weighted) average or total of the indicator for each dimension.

Third, treat the variable as unidimensional and select several indicators to represent the concept and from that compile a single index. The construction of an index (combining different indicators into a single quantitative measure) is known as scaling.

Scaling evolved from various attempt to measure complex concepts such as authoritarianism or self-esteem. Thus, scaling in social science attempts to measure abstract concepts.

There are various ways to construct a scale. The three most used unidimensional approaches (named after their creators) are:

  • Thurstone or equal-appearing interval scaling
  • Likert or summative scaling
  • Guttman or cumulative scaling

These are discussed below.

Fourth, treat the variable as multi-dimensional and select several indicators to represent each dimension of the concept and from that compile a single index. This process reduces a complex multidimensional concept into a single item represented by the index. This would involve applying a unidimensional approach (such as Guttman, Likert or Thurstone scaling) to each dimension and then combining each sub-index together to form an overall index.

Sometimes this reduction to a single index can be very unsatisfactory as the combination process looses the integrity of the concept and in such circumstances multi-dimensional scaling needs to be undertaken. This is rather more complex and is well explained at: (accessed 29 April 2020)

Top Thurstone scaling
Louis Thurstone's scale was the first formal attitude measuring technique and was developed in 1928 to measure attitudes towards religion. Thurstone developed three different versions of his scaling method: equal-appearing intervals; successive intervals; and paired comparisons.

Top The method of equal-appearing intervals
Having identified the aims, purpose, feasibility, explored the background, drafted some preliminary hypotheses, and identified the key concepts, it is time to systematically address these key concepts.

First, define them as specifically as possible. The Thurstone method only applies if the concept is unidimensional.

Second, generate a selection of statements that are indicative of the concept (the indicators). The statements need to be formulated in a similar manner, such as statements that one could agree or disagree with (rather than, for example, yes/no questions mixed in with agree/disagree questions).

Third, decide which items work best to illustrate aspects of the concept being operationalised. How do you do this?

One way to select a final set of statements is to use a group of respondents to rate each of the initial items on a scale of 'most favourable' to 'least favourable' indicator of the concept. Then compute the median and interquartile range for each item and rank the items in order. Then take a selection of items at regular intervals of the median using the items with the smallest variability (smallest interquartile range).

Note that the judges are rating items as least favourable through to most favourable indicator of the concept, not rating each item on a personal basis as though they were answering the questionnaire.

William Trochim (2006a) provided an example of a scale to measure attitudes that people might have towards persons with AIDS. A long list of possible statements was rated on a scale of 1 least favourable attitude towards people with AIDs to 11 most favouable. The long list was reduced to the following list (slightly adpated in this example); the values in parentheses are their scale point and all values had a very small interquartule range. On the questionnaire these items would be listed in random order.

  • People with AIDS deserve what they got (1).
  • AIDS is good because it helps control the population (2).
  • AIDS will never happen to me (3).
  • I can't get AIDS if I'm in a monogamous relationship (4).
  • Because AIDS is preventable, we should focus our resources on prevention instead of curing (5).
  • People with AIDS are like my parents (6).
  • If you have AIDS, you can still lead a normal life (8).
  • AIDS doesn't have a preference, anyone can get it (9).
  • Aids affects us all (10).
  • People with AIDS should be treated just like everybody else (11).

This is the scale; note there is no point in this case that came out with a medium of 7.

The scale is included in a questionnaire and administered to the sample and respondents indicate whether they agree or disagree with each item. If, for example, a respondent agreed with three items that had a scale score of 1, 3, 5, the respondent's score would be 3. If another respondent agreed with 6 items that had scale scores of 4, 5, 6, 8, 9, 10 then the respondent's score would be 7.

Top The method of successive intervals and the method of paired comparisons
The method of successive intervals and the method of paired comparisons devised by Thurstone are no different in the final product presented to the sample: it is still agree/disagree with a set of statements that each have a scale rating. They differ in the way that the scale is constructed.

The method of successive intervals is very much like the method of equal-appearing intervals but does not assume that rating categories or intervals are of equal width.

The method of paired comparisons requires each judge to determine which of a pair of statements is most favourable. This is done for all possible pairs and so is only feasible when the list of potential indicators is quite small. For example, a list of 20 statements would require 190 comparisons to be made, which would be very time consuming and tedious.

Top Likert or Summative Scales Introduction to Likert scaling
Likert Scaling is a unidimensional scaling method developed by Rensis Likert, a psychologist, in 1932. A Likert scale is commonly used to measure attitudes, values, perceptions, knowledge and behavioral changes.

A Likert-type scale assumes that the strength or intensity of experience is on linear continuum from, for example, strongly agree to strongly disagree. Respondents may be offered a choice of five, seven or even nine pre-coded responses with the neutral point being neither agree nor disagree. In most cases, Likert scales use five points to allow the individual to express how much they agree or disagree with a particular statement (for example, strongly disagree, disagree, neither agree nor disagree, agree, strongly agree).

Likert-like scales have also been used to ask respondents to indicate the importance, frequency, quality, likelihood and relevance as well as agreement.

An example of a Likert item might be:

Ecological concerns are the most important issues facing humanity.
(1) strongly disagree (2) disagree (3) undecided (4) agree (5) strongly agree (0) don't know/can't answer/missing.

A respondent's score for the whole Likert-like scale would usually be the sum total of the scores for each item in the scale (or average score for the items answered). So in the example above if there were 20 such items in the scale, a respondent who provided 20 strongly agree scores would have a total of 100. At the other extreme the lowest total score would be 20. If the situation arises where there are missing values (0) then it is probably best to take the mean score of the items answered.

In this example, the assumption is that each point on the scale is equal distance apart, so the difference between strongly agree and agree is the same as between agree and undecided, and so on. It also assumes that all 20 items have the same weight because the scores on each item have just been added together to make the index score.

It is possible to weight the individual items when constructing the index and one could give the responses within each item scores that do not reflect equal intervals, such as: (1) strongly disagree (3) disagree (4) undecided (5) agree (7) strongly agree. However this would not be a Likert scale, which presupposes equal intervals.

Other examples of Likert scales are:

1. Definitely not, 2. Undecided, 3. Definitely will

1. Not aware, 2. Somewhat aware, 3. Usually aware, 4. Very much aware

1. Hardly ever, 2. Occasionally, 3. Sometimes, 4. Frequently, 5. Almost always

1. Very slow, 2. Slow, 3. Average, 4. Fast, 5. Very fast

1. Exceptionally unfavorable, 2. Unfavorable, 3. Somewhat unfavorable, 4. Somewhat favorable, 5. Favorable, 6. Exceptionally favorable

1. Excellent, 2. Very good, 3. Good, 4. Satisfactory, 5. Poor. 6. Very poor, 7. Unacceptable

Top Constructing a Likert scale
The items for a Likert scale are derived in much the same way as in Thurstone scaling (Section but, as shown above, the scale allows grading of the answers and not just agree/disagree. The process works as follows.

First, define what it is that is being measured.

Second, create the set of potential scale items either by using your own knowledge or by engaging other (experts or people familiar with the concept) to help. This may be by brainstroming (as a group or through virtual means). A large group of items should be generated at this stage.

Third, the large number of items would be rated by judges on a 1-to-5 rating scale; ranging from (1) strongly unfavorable to the concept through (2) unfavorable to the concept to (3) unsure to (4) favorable to the concept (5) strongly favorable to the concept. Then interrcorrelate all pairs of items, based on the ratings of the judges. Discard items that have a low correlation with the total (summed) score across all items (Item-Total correlation). There is no fixed discard rule but a score less than 0.6 would normally be a good starting point. Most statistics packages can easily compute Item-Total correlation. First, create a new variable which is the sum of all of the individual items for each respondent. Add this variable into the correlation matrix computation

Fourth, identify which of the remaining items best discriminate between high and low scores (by judges) of the item. The aim is to have items that correlate highly with overall average ratings but also have high discrimination. For each item, compute the average rating for the top quarter of judges and the bottom quarter. Then, do a t-test [a test of significance] of the differences between the mean value for the item for the top and bottom quarter judges. The higher the t value the bigger the difference and the better the item is at discriminating, so use these items. Your judgement will be needed as to the best items to retain. Keep between 10 and 20 items, preferably all with high Item-Total correlations and high discrimination (high t-values).

Top Issue in developing a Likert scale
Likert scaling is a bipolar scaling method, measuring either positive or negative response to a statement. Sometimes an even-point scale is used, where the middle option of 'Neither agree nor disagree' is not available. This is sometimes called a 'forced choice' method, since the neutral option is removed (see Allen & Seaman (2007)).

The neutral option can be seen as an easy option to take when a respondent is unsure, and so whether it is a true neutral option is questionable. Robert Armstrong (1987) found negligible differences between the use of 'undecided' and 'neutral' as the middle option in a five-point Likert scale.

There is disagreement as to whether individual Likert items can be considered as interval-level data or as ordered-categorical data, see for example Susan Jamieson (2004) and Geoff Norman (2010).

There are two primary considerations in this disagreement. First, Likert scales are arbitrary. The value assigned to a Likert item is simply determined by the researcher designing the survey, who makes the decision based on a desired level of detail. However, by convention Likert items tend to be assigned progressive positive integer values. Likert scales are typically 5 or 7-point scales and the implication is that a higher response category indicates a 'better' response than the preceding value (or 'worse' if the scale is reverse constructed from better to worse).

The second, more important issue, is whether the difference between each successive point on the item is equivalent. For example, is the difference between 'stongly agree' and 'agree' the same as between 'agree' and 'neutral'. An equidistant item response is important otherwise the analysis may be biased. For example, a four-point Likert item with categories 'poor', 'average', 'good', and 'very good' is unlikely to have all equidistant categories since there is only one category that can receive a below average rating. This would arguably bias any result in favour of a positive outcome. However, even if researchers present what they consider to be equidistant categories, it may not be interpreted as such by the respondent.

Top Possible distortion from the use of Likert scales
Likert scales may be subject to distortion from several causes. Respondents may:

  • avoid using extreme response categories (central tendency bias), especially out of a desire to avoid being perceived as having extremist views; or may be restrained early on in the questionnaire but become more extreme later;
  • agree with statements as presented (acquiescence bias), with this effect especially strong among persons, such as children, developmentally disabled persons, and the elderly or infirm, who are subjected to a culture of institutionalisation that encourages compliance;
  • respond in a (neutral) way that would avoid perceived negative consequences should their answers be used against them;
  • provide answers that they believe will be evaluated as indicating strength or lack of dysfunction;
  • try to portray themselves or their organisation in a light that they consider might be taken more favorably than their true beliefs (social desirability bias).

Top Guttman or Cumulative Scales
Guttman scaling, names after Louis Guttman, is also sometimes known as cumulative scaling or scalogram analysis. The purpose of Guttman scaling is to establish a one-dimensional continuum to measure a concept. This means a set of items or statements that are ordered so that so that a respondent who agrees with any specific question in the list will also agree with all previous questions.

For example, imagine a ten-item cumulative scale. If the respondent scores a four, that should mean that the respondent agreed with the first four statements. Similar a score of 8 should mean the respondent agreed with the first eight items. The object is to find a set of items that perfectly matches this pattern. In practice, this is rarely possible.

Constructing a Guttman scale works as follows. First, define the concept being investigated, for example, attitudes to immigration. Be clear in the definition about what kinds of immigration are being investigated (legal, illegal, refugee, economic, and so on).

Second, develop a large set of items that reflect the concept (with the help of others if required).

Third, get a group of judges to rate the items, for example, as favourable or not favourable towards immigation. (The judges are not being asked for their views on immigartion, just whether the item is favourable or not favourable to immigration).

Fourth, construct a matrix or table that shows the responses of all the judges on all of the items. The rows of the table would have the judges who identify more favourable items at the top and those with fewer favourable items at the bottom. The columns would be the items with more favourable responses to the left and least favourable to the right.

The matrix will show how cumulative the scale is. A perfect scale would look something like this (where + is judged a favourable item and is judged an unfavoutable item):

Judge Item 5 Item 1 Item 4 Item 3 item 7 item 6 item 2
1 + + + + + + +
4 + + + + + + -
3 + + + + + - -
6 + + + - - - -
2 + + - - - - -
7 + - - - - - -
5 - - - - - - -

So, in the example, Judge 4 regards Item 6 as favourable and also all items to the left of it (i.e. Items 5, 1, 4, 3, 7). Judge 2 regards Item 1 as favourable all all items to the left (i.e. Item 5)

In practice, things will not work out perfectly and judgement is needed to identify the best cumulative scale. This may require the aid of statistical techniques, in particular scalogram analysis.

When administering the scale, each item has a scale score depending on how favourable it has judged to be. A respondent's score would be the sum of the scale values of every item they agree with.

Next Interchangeability of indicators