Orientation Observation In-depth interviews Document analysis and semiology Conversation and discourse analysis Secondary Data Surveys Experiments Ethics Research outcomes



Social Research Glossary

About Researching the Real World



© Lee Harvey 2012–2020

Page updated 29 April, 2020

Citation reference: Harvey, L., 2012–2020, Researching the Real World, available at
All rights belong to author.


A Guide to Methodology

7. Secondary data

7.1 Introduction to secondary analysis
7.2 Extent of re-analysis of secondary data

7.3 Nature of the data

7.4 Data sources

7.4.1 Statistical sources
7.4.2 Data and historical archives
7.4.3 Big data How is 'big data' being used for social research? Examples of social research using 'big data': two studies of Twitter use

7.5 Examining data sources
7.6 Methodological approaches

7.7 Summary and conclusion

7.4.3 Big data
Big data is data collected as the result of the increasing digitisation of society. Every on-line transaction, every swipe of a store loyalty card, every utterance on Twitter, search on Google, comment on Facebook and potentially (for government security services at least) every email sent, constitute the basis of swathes of big data. Some of this is collected and analysed and used automatically, when for example you access a page on a website, the advertisement seems to reflect something you recently looked up or enquired about online.

This data is both structured and unstructured and much of it is collected by large corporations to better inform business decisions. Some big data is the result of continuous data streaming such as information on financial transactions or on-line browsing of sites. Social media is another, usually unstructured, form of data, which poses analysis challenges. There are also huge, publicly available data sets including open data sources such as the US government's, the CIA World Factbook or the European Union Open Data Portal.

What makes big data different apart from the sheer quantity, is the speed at which it is generated and collected. This is has knock-on effects for processing speed. Big data also takes a wide variety of formats including text documents, email, video, audio, numeric data in traditional databases and financial transactions. Unlike for example the computation of the retail proce index, which happens at regular intervals using similar amounts of data collected in a consistent manner, big data flows are inconsistent with peaks and troughs, depending on what, for example, is capturing the interest of those using social media or the rise and fall of retail sales throughout the year. Furthermore, as big data exists, irrespective of whether anyone uses it for research purposes, the post hoc exploration of data from a plethora of sources makes it difficult to connect and correlate because of the different formats and the necessity to establish the veracity of the material (see Section 7.5).

An example, from business, of the use of big data is the way UPS, a United States delivery service, reduced mileage of its delivery drivers. As a company with many pieces and parts constantly in motion, UPS receives a huge amount of data, primarily from sensors in its vehicles. It has two key uses. First, it monitors monitors daily performance of delivery drivers. Second, it provides the data for a very large operations research project (called On-Road Integration Optimization and Navigation (ORION)). It used online map data to rearrange a driver's routes in real time. The project has claimed to have saved about eigt and a half million gallons of fuel by cutting 85 million miles off of daily routes. The company estimates that saving only one daily mile per driver saves the company US$50 million (UPS, 2017).

Today, ORION can solve an individual route in seconds and is constantly running in the background evaluating routes before drivers even leave the facility. This level of route evaluation conducted through the ORION program requires extensive hardware and architectural provisions. Running on a bank of servers in Mahwah, NJ, ORION is constantly evaluating the best way for a route to run based on real-time information. While most of America is sleeping, ORION is solving tens of thousands route optimizations per minute.

Little of this big data is currently (as of 2017) analysed by social researchers, which is not too surprising given the technology required to process it and that it is data that is not specifically determined by a research objective.

Instead, faceless technocrats employed by huge corporations mine the data for, on the one hand, innocuous marketing information and, on the other, for more intrusive or social engineering purposes. For example, Facebook, in 2014,undertook an analysis of their users responses to altered newsfeeds. A study supposedly to assess 'emotional contagion' explored how the altered content was reproduced and shared by Facebook users. The resultant outcry about Facebook experimenting on the uses without their consent has simply meant that such activities are unlikely to be publsihed in the future, rather than that such activities will cease to occur.

This is indicative of a considerable access problem. Apart from a lack of incentive, private corporations are often reluctant to share data for legal, reputational or competitiveness reasons.

Engaging with appropriate partners in the public and private sectors to access non-public data entails putting in place non-trivial legal arrangements in order to secure (1) reliable access to data streams and (2) get access to back up data for retrospective analysis and data training purposes. There are other technical challenges of inter-comparability of data and inter-operability of systems, but these might be relatively less problematic to deal with than getting formal access or agreement on licensing issues around data.

A culture of secrecy pervades the large global corporations. Ironically, large corporations want their own privacy but are content to infringe the privacy of individuals.

Privacy is a sensitive issue when it comes to using or collecting big data. Privacy is a basic freedom and fundamental human right. It is defined by the International Telecommunications Union as the right of individuals to control or influence what information related to them may be disclosed. It is likely that, in many cases, individuals routinely consent to the collection and use of web-generated data by simply ticking a box without fully realising how their data might be used or misused. It is also unclear whether bloggers and Twitter users, for instance, actually consent to their data being analysed. In addition, research has shown that it is 'possible to 'de-anonymise' previously anonymised datasets' (UN, 2012, p. 24).  

The wealth of individual-level information that Google, Facebook, and a few mobile phone and credit card companies would jointly hold if they ever were to pool their information is in itself concerning. Because privacy is a pillar of democracy, we must remain alert to the possibility that it might be compromised by the rise of new technologies, and put in place all necessary safeguards.... Any initiative in the field ought to fully recognise the salience of the privacy issues and the importance of handling data in ways that ensure that privacy is not compromised. These concerns must nurture and shape on-going debates around data privacy in the digital age in a constructive manner in order to devise strong principles and strict rules—backed by adequate tools and systems—to ensure “privacy-preserving analysis.” (UN, 2012, pp. 24–25) 

That worrisome use of big data by non-social researchers apart, there are studies that make use of 'big data'. The main characteristic of such things as Twitter and Facebook posts is that they are in effect large repositories of extant qualitative data that can be used as qualitative data or categorised and analysed quantitatively.

The issue for most researchers is how to access such data and, once accessed, how to mine it in a way that provides a basis for research and analysis. How, for example, to construct a meaningful sample rather than just a convenience one.

In addition, there is the problem of analysis of such huge datasets that can be counted in terabytes rather than gigabytes. Large corporations have pioneered merthods of analysis, such as MapReduce and Hadoop. MapReduce works by defining a way of sorting data (mapping it) and then doing something to the mapped groups, such as avergaing the data.

Top How is 'big data' being used for social research?
It is far from clear, as of 2017, how social, business studies and health research uses big data. There is increasing use of big data but how and where it is being used and how it is being accessed and analysed is the subject of enquiry in its own right (Oxford Internet Institute, 2016; ESRC, 2017).

In principle, big data provides an opportunity for social research to augment targeted sample-based enquiry with real-time transactional data derived from whole populations, which increases the potential for detailed analysis of social issues.

In the United States of America, President Obama's Administration announced the well-funded 'Big Data Research and Development Initiative' in March 2012. It's intention was to improve the ability of researchers to extract knowledge and insights from large and complex collections of digital data.

In May 2012, The United Nations published the White Paper 'Big Data for Development: Opportunities and Challenges', which highlighted the opportunities and challenges of using big data in the field of international development.

Turning Big Data—call logs, mobile-banking transactions, online user-generated content such as blog posts and Tweets, online searches, satellite images, etc.—into actionable information requires using computational techniques to unveil trends and patterns within and between these extremely large socioeconomic datasets. New insights gleaned from such data mining should complement official statistics, survey data, and information generated by Early Warning Systems, adding depth and nuances on human behaviours and experiences—and doing so in real time, thereby narrowing both information and time gaps.

However, as the report states:

It is important to recognise that Big Data and real-time analytics are no modern panacea for age-old development challenges. That said, the diffusion of data science to the realm of international development nevertheless constitutes a genuine opportunity to bring powerful new tools to the fight against poverty, hunger and disease. (UN, 2012, p. 4)

There are, though, the report suggest questions about the analytical value and policy relevance of big data, 'including concerns over the relevance of the data in developing country contexts, its representativeness, its reliability' (UN, 2012, p. 4), as well as the issue of privacy when utilising personal data.

Exploring how big data is being used is the purpose of the 'Accessing and Using Big Data to Advance Social Science Knowledge' project of the Oxford Internet Institute (2016), which aims to:

arrive at robust insights with practical implications about how big data about people and their social interactions is accessed, and how big data enables the discovery of new knowledge about society and behaviour: in short, what are the social and scientific implications of large-scale 'big data' as it becomes more widely available to social scientists in academia, public institutions, and the private sector? The project will rely on in-depth studies of exemplar cases to understand how social scientists in academia, industry, and government are accessing and using big data to answer old questions at larger scales as well as asking and answering new questions about society and human behaviour. The main objectives of the project are to:

  • Undertake case studies of social science uses of big data with a focus on means and modes of access.
  • Support the development and documentation of new methodologies for working with big social science data, such as access, data management, analysis, and visualization techniques.
  • Facilitate engagement with social scientists working with big data through workshops and other events.
  • Organize a conference on big data in the social sciences.
  • Produce findings that report the project's evidence and make policy recommendations.

Meanwhile, ESRC (2017) has established and is funding (until December 2019) a 'Big Data Network', which it hopes will 'shape our knowledge of society and help us prepare and evaluate better government policies in the future'. The three-phase network, exploring the 'enormous volume and complexity of data that is now being collected by government departments, has, as phase one the development of the Administrative Data Research Network (ADRN) that 'will provide access to de-identified administrative data collected by government departments for research use, businesses and other organisations'. ESRC consider that such data, duly anonymised, 'will provide a robust evidence-base to inform research, and policy development, implementation and evaluation'. The idea of the second phase is to support the establishment of centres that will make data, routinely collected by business and local government organisations, accessible for academics. In so doing it is hoped that the ensuing research will shape public policies and make business, voluntary bodies and other organisations more effective, as well as shaping wider society. The third phase 'will focus primarily on third sector data and social media data'.

Top Examples of social research using 'big data': two studies of Twitter use

Ramine Tinati et al. (2014, p. 664) noted the appeal of big data as a source for aocial research. However they were aware of accessibility issues, 'for reasons of privacy and/or commercial sensitivity, many of these datasets remain in the hands of governments and private corporations'. However, access to Twitter posts has little by way of restriction and they used these in their analysis of the protest against the imposition of student fees in the United Kingdom.

Twitter, content, they explain, is:

visible to anyone who chooses to search and follow users, and available via Twitter's own Application Programming Interface (API), which—depending on the methods used—allows access to (1) a small selection of the tweets via the search or streaming service, (2) the 'garden-hose', a 10 per cent random sample, or (3) the 'firehose' of all tweets made. Not surprisingly, Twitter has generated a considerable amount of interest amongst social scientists: since its launch in 2008, there have been over 110 scholarly publications about Twitter [International Bibliography of Social Sciences (IBSS), accessed 8 October 2012]. Whilst little of this has been published in mainstream sociology (Murthy, 2012), there is much herT to interest sociologists, for instance in attention to practices of impression management, micro-celebrity and personal branding (Hargittai and Litt, 2011; Jackson and Lilleker, 2011; Marwick and boyd, 2011); and to questions of participatory democracy and political mobilisation (Grant et al., 2010; Larsson and Moe, 2011; Segerberg and Bennett, 2011; Tufekci and Wilson, 2012). (Tinati et al., 2014, p. 664)

Tinati et al. (2014) suggest that big data tends to be approached by social researchers in the same way as standard collected data. To make it manageable, small samples are taken from big data, thus obviating the point of big data. For example, Waters and Williams (2011), analysed 30 tweets from each of 60 Twitter accounts and Jackson and Lilleker (2011) undertook in-depth analysis of the Twitter stream of 51 MPs.

Tinati et al. point out that much is lost in doing this; the latter analysis, for example, permitted no possibility of understanding where and how this content or these users are positioned within the broader Twitter stream. Furthermore, data is taken as a snap shot, which also bypasses the dynamic emerging nature of the data and the network connections. 'Whilst the key characteristics of Big Data are its scale, proportionality, dynamism and relationality, the methods used in social science have fallen short of enabling us to explore this' (Tinati et al., 2014, p. 667).

Tinati et al. note that the nature of big data is leading to considerable work among computer scientists to create ways of dealing and analysing it. While much of this is about the technical process, computer scientists are also asking social questions:

Indeed, there is a stream of such research on Twitter from computer science exploring, for instance, friendship networks (Macskassy and Michelson, 2011), political orientations (Conover et al., 2011) and the diffusion of information (Bakshy et al., 2012). At first sight, this might seem to support Savage and Burrows' (2007) claim that the availability of new forms of data is moving the centre of gravity for social research away from sociology, although it is important to note that attention is more often to observing patterns and network structures per se rather than exploring meaning or explanation. Where claims to social knowledge are made these take the form of 'big' claims about the patterns in Twitter, for example using Natural Language Programming and sentiment analysis to search for key words to determine the 'happiness' of a tweet (Dodds et al., 2011), or an individual's political affiliations (Rao et al., 2010). Notably, these approaches favour computational techniques over theoretically informed or conceptually nuanced sociological analysis, let alone fine-grained qualitative analysis, and tend to treat the data as 'naturally occurring' rather than paying any attention to their social and technical constitution. (Tinati et al., 2014, p. 667)

Tinati et al. suggest a new approach to analysing Twitter data. It combines quantitative and qualitative analysis within a broader methodological approach that draws on what they call 'wide data'. They refer to John Scott (2008) who argued that the power of social network analysis 'would be improved if it were to move beyond static metrics and statistical measures of network structures and connectivity, to expose the temporal nature of the data' (Tinati et al. 2014, p. 668). They present a new software tool developed to meet these challenges, which is outlined in CASE STUDY Twitter analysis.

An alternative approach to analysis of Twitter data can be found in Hilde Stephansen and Nick Couldry's (2014) case study of how teachers and students at an English sixth-form college used Twitter to help construct a community of practice.

They noted that Twitter was intended a tool for information dissemination not for community building, although research has demonstrated the latter. However, unlike research that examines how communities form on Twitter, Stephansen and Couldry investigated how an already existing community of teachers and students use Twitter to reinforce existing 'offline' social bonds. For them 'community of practice' is a joint enterprise that is achieved through sustained interaction and shared practice and goals.

The study used a mixed methodology of interviews and detailed analysis of a departmental Twitter account. This they called 'small data' in contradistinction to big data approaches that use standardised metrics approaches to analyzing Twitter.

Although undoubtedly useful, Big Data approaches are also problematic. boyd and Crawford (2012, p. 670) outline a set of 'critical questions' for Big Data, two of which are particularly relevant here. First, Big Data changes the definition of knowledge. By privileging large-scale quantitative approaches, it sidelines other forms of analysis and limits the kinds of questions that can be asked: this has important normative and political consequences. Second, 'Big Data loses its meaning when taken out of context'. Although network analysis can reveal connections and patterns, it has little to say about their meaning and context; nor are such networks necessarily equivalent to personal and social networks...a limitation that also applies to studies that define 'communities' on Twitter in terms of their morphology. (Stephansen and Couldry, 2014, p. 1215)

The analysis of the community of practice was part of a wider project in which:

we conducted around 70 hours of participant observation alongside a total of 22 staff interviews, four student focus groups and 18 individual student interviews. Of these, one student focus group (with two boys and two girls aged 17–18) and two staff interviews (with a teacher and the head of department) focused specifically on the departmental Twitter account.
We analysed a corpus of 4546 tweets, captured from the home page of the departmental Twitter account ('CollegeDept' hereafter) using the NCapture add-on for NVivo, containing tweets and retweets made by CollegeDept between 10 March 2012 and 10 March 2013. This was supplemented by a second dataset captured using the search string 'From:CollegeDept OR @CollegeDept', containing a total of 1753 tweets sent from and to this account between 24 October 2012 and 10 March 2013. The shorter time span covered by the second data set is due partly to limitations of the Twitter API (this only returns tweets sent in the last seven days in response to search queries, whereas up to the last 3200 tweets can be captured from the home page of any given user). An additional obstacle was a technical problem with the NCapture application, which prior to 24 October 2012 prevented the capture of search results. These obstacles are illustrative of the many constraints faced by researchers seeking to capture publicly available Twitter data (boyd & Crawford, 2012). Our Twitter data set is therefore partial, as we only have complete access to tweets sent by students to CollegeDept from 24 October 2012. However, contextual data indicates that teachers carried out a stated intention (interview with teacher) to retweet tweets received from students; therefore we can plausibly claim that our larger data set provides a reasonably comprehensive view of staff-student interactions....
Adopting a purposive, theoretically informed sampling approach, we first chose to focus on particular weeks that could be seen as either typical (weeks with an average number of tweets) or exceptional (weeks with a higher than average number of tweets), our assumption being that this would provide insights into both routine and unusual communication patterns....Clues (Alasuutari, 1995) arising from this initial analysis were used to search out other, related interactions. Additionally, we identified particular Twitter exchanges on the basis of information provided in interviews, and were thus able to triangulate the two types of data.
This particular approach to sampling and analysis was chosen over more conventional methods of qualitative Twitter research such as extracting a random sample of individual tweets and coding such tweets to establish broader categories. Our interest in understanding context, narrative exchange and process precluded such an approach. We attempt instead a 'thick description' (Geertz, 1973) to provide nuance and richness not available from a Big Data perspective. (Stephansen and Couldry, 2014, p. 1215–6)

Their analysis shows teachers and students used Twitter to create a shared space for dialogue beyond curriculum and classroom, despite anxieties about professionalism and privacy. Shared meanings, values and identities that facilitated community building were developed through sustained Twitter interactions. 'Perhaps the most significant aspect of the CollegeDept Twitter account is how it was used by teachers to acknowledge students as knowledge sources and contributors to debates, particularly through their practice of retweeting.' (Stephansen and Couldry, 2014, p. 1224). They conclude:

While quantitative metrics can provide important insights into the form that online communities might take and the extent of their interactions, an ethnographic and hermeneutic approach is needed to understand how Twitter and other digital platforms become embedded within particular contexts and used by social agents for their own purposes. (Stephansen and Couldry, 2014, p. 1224)


Next 7.5 Examining data sources