Home Print this page Email this page
Users Online: 318
Home About us Editorial board Search Ahead of print Current issue Archives Submit article Instructions Subscribe Contacts Login 

 Table of Contents  
Year : 2014  |  Volume : 2  |  Issue : 3  |  Page : 100-104

Understanding the process of statistical methods for effective data analysis

Department of Medical Education, College of Medicine, King Saud bin Abdulaziz University for Health Sciences, Riyadh, Saudi Arabia

Date of Web Publication31-Jul-2014

Correspondence Address:
Aamir Omair
Department of Medical Education, College of Medicine, King Saud bin Abdulaziz University for Health Sciences, Riyadh
Saudi Arabia
Login to access the Email id

DOI: 10.4103/1658-600X.137882

Rights and Permissions

The purpose of this article is to provide a basic understanding of the statistical methods for conducting effective data analysis. Quantitative research involves the collection and analysis of different types of variable in the form of raw data, which needs to be cleaned before starting the data analysis. A biostatistician must be involved from the planning stages of the research process to ensure the validity of the sampling process and the collected data.
The statistical analysis includes descriptive analysis for summarizing the data and inferential statistics for comparing between the subgroups to determine a statistically significant association. The relevant statistical tests must be applied and the results appropriately reported using P-values and confidence intervals. The possibility of type I and type II errors should be considered during the final interpretation of the results as well as the clinical significance of the results even if the P-values are found to be statistically significant.

Keywords: Statistical analysis, data cleaning, tests of significance, types of errors, clinical siginificance

How to cite this article:
Omair A. Understanding the process of statistical methods for effective data analysis. J Health Spec 2014;2:100-4

How to cite this URL:
Omair A. Understanding the process of statistical methods for effective data analysis. J Health Spec [serial online] 2014 [cited 2020 Dec 2];2:100-4. Available from: https://www.thejhs.org/text.asp?2014/2/3/100/137882

  Introduction Top

Quantitative research in health sciences involves the collection of data from different sources. [1] This data includes demographic variables (e.g. age, gender), biological variables (e.g. weight, blood pressure), risk factors for disease (e.g. smoking status, obesity), outcome variables (e.g. survival data, length of hospital stay), etc. The purpose of statistical analysis is to process this 'raw' data into an organized form so that it provides the required information in summary (descriptive statistics). [2] The inferential statistical analysis makes generalizations about the population based on a 'representative sample' taken from the study population; it also compares the results between different subgroups of the data sampled to determine any difference or association between the predictor (independent) and outcome (dependent) variables based on the objectives/hypotheses of the study. The purpose of inferential statistics is to determine with a degree of confidence whether the observed differences are statistically significant or may be due to chance alone using the P-value/confidence interval. [3]

  Data Management Top

It is a common mistake to first collect the data and then take it to the (bio)statistician for analysis. It is strongly recommended that the biostatistician must be involved from the initial stage of planning and development of the research study. [4] This will ensure that the data collected can be processed more efficiently and produce results in a more comprehensive manner without losing any data related to important variables. The data collected undergoes a series of editing, entry, cleaning and management steps (recoding and transformation) before they are suitable for analysis. [5] Attempting to do the analysis directly before ensuring that the data is clean, leads to loss of time, wastage of data, inappropriate output and lots of frustration. A common source of error is an error in data collection or data entry which leads to 'outlier' values in the numerical values or other values out of range of the coded values for categorical variables. [6] For example, a height of 173 cm may be entered inappropriately as 73 cm or 730 cm which would lead to distortion of the mean value from the actual value especially if the sample size is relatively small. Also categorical variables may be wrongly entered e.g. entering a value of 3 or 11 for gender where the coding is '1' for male and '2' for female. Another common mistake is to enter a '2' for a question coded as '1' for Yes and '0' for No. These errors need to be identified and corrected (where possible) before doing the analysis otherwise they will lead to incorrect results. [7]

Once the data has been appropriately cleaned and recoded, it is saved as a separate file for analysis. It is a good idea to keep the initial data file as a backup in case the original data is required at a future stage. [8] If the data is in the required format then the basic statistical analysis process can be generally completed in less than 4 hours. But in most cases (where the data has not been appropriately managed) it requires at least another 6 - 8 hours for cleaning the data and transforming it into a suitable file for data analysis. In certain cases the data cleaning and transformation has taken up to 16 - 20 hours. [9]

  Statistical Analysis Top

The initial step in statistical analysis is to do the descriptive statistics to summarize and present the data in a form of numerical summaries and graphs. [10] Categorical data (e.g. gender, education level, disease status) are presented as frequencies and percentages; numerical variables (e.g. age, BMI, lab results, hospital stay) are presented as mean ± standard deviation (SD) if normally distributed or as median and interquartile ranges for skewed data. Some examples of how the data are normally presented are given in [Table 1]. The descriptive analysis is most appropriately presented in a table form as the first table in the manuscript. If required, some descriptive statistics may be presented as graphs for oral presentations and posters (but generally not in manuscripts for publication). It is recommended to use bar charts for categorical variables and box plots for numerical data [Figure 1] and [Figure 2]. [11] Pie charts and histograms are generally not recommended for use in scientific presentations. [12],[13]

The inferential statistical analysis should be performed first for the stated aim and objectives of the study. This process is related to testing the hypothesis and reaching a conclusion for the predetermined study objectives. A hypothesis is a statement regarding the research question to be studied [14] and is stated as the Null (Ho) and Alternate (Ha) hypotheses. The null hypothesis is generally related to there being no difference (opposite to the proposed research objective) while the alternate hypothesis states the research question to be studied in a statement form. [15] For example if the research question is to determine if doctors are more likely to develop heart disease as compared to teachers then the null and alternate hypotheses would be stated as:
Figure 1: Presenting distribution of categorical data using bar chart

Click here to view
Figure 2: Presenting distribution of numerical data using box plots

Click here to view
Table 1: Presenting descriptive statistics of categorical and numerical data

Click here to view

Ho: There is no difference in the occurrence of heart disease between doctors and teachers. Ha: Occurrence of heart disease is more in doctors as compared to teachers.

Inferential statistics involves making an inference for the general population based on the results from a sample of the study population. [16] It is important that the required sample size is determined before starting the study in order to ensure that the study has the required power to identify a difference, if present, or to determine the expected outcome variable within a required margin of accuracy. [17] The requirements for determining the required sample size will be discussed in detail in a following article in this series on 'sample size and sampling technique'. It is essential that the relevant statistical tests be applied in order to ensure the validity of the results. A summary of the commonly used tests for statistical analysis is presented in [Table 2]. [18] Further statistical analysis may need to be performed such as multivariable analysis or logistic regression in order to remove the effects of confounding variables and to establish the true effect of the predictor variables on the outcome of the study. [19],[20]
Table 2: Choosing the appropriate commonly used statistical tests according to type of grouping and outcome variables

Click here to view

  Statistical Significance Top

It is important to understand the concepts of P-value and confidence intervals for reporting the results of inferential statistics. A P-value shows the probability that the difference or association being shown might have occurred by chance alone. [3] As shown in [Table 3] the P-values of 0.004 and 0.02 (marked with *) indicate that the probability of obtaining such a difference by chance alone is 0.4% and 2% respectively, which is considered as being statistically significant. But the P-values of 0.07 and 0.35 in the table indicate that there is a 7% and 35% probability respectively that this difference may be due to chance alone and so should not be reported as being significantly different between the two groups. [21]

It is also important to consider that there is a potential for making an error based on the results of any study. The types of errors are classified as Type-I (α error) or Type-II (β error) and are based on the conclusion of the statistical tests. [22] The Type-I error is the possibility of incorrectly stating that there is a difference when there is actually no difference. It can be identified by the P-value which gives the probability of making such an error. A P-value of <0.05 is generally considered as being statistically significant. This means that if it is concluded that there is a difference between the groups (or between the sample and a standard value), then the probability of making a Type-I error is less than 5%. On the other hand a P-value of ≥0.05 should not be concluded as meaning that 'there is no difference between the groups'. [23] This is what generally constitutes a Type-II error and is generally not reported in any of the statistical results. So whenever the P-value is ≥0.05 it is more appropriate to state that 'the results of this study do not show a significant difference' as opposed to stating that the groups are similar or not different. The P-values in [Table 3] indicate the chance of making a Type-I error, if it is concluded that 'there is a difference in the attitudes regarding research between the pre-clinical and the clinical students'. This would be relevant for the two P-values which are less than 0.05 but it would be inappropriate to state that 'the attitudes regarding research are not different (or the same)'for the two statements that have a P-value of more than 0.05. While it may appear to be reasonable for the statement about 'mandatory research time in the curriculum' which has a P-value of 0.35, it would be wrong to assert that there is no difference between the two groups with regards to their interest in research (P = 0.07). This is because in both cases the difference may exist but this study did not have the required sample size (that is power) to identify the difference. [24]
Table 3: Interpreting the significance of reported P-values

Click here to view

It is also important to remember that no matter what the magnitude of difference is present between two or more groups it cannot be considered as being significant if the P-value is not less than 0.05 since this may be due to chance alone. On the other hand a statistical significant difference does not mean that the difference is 'clinically' (or practically) significant. [25] For example if a clinical trial on a group of 200 hypertensive patients shows that the mean systolic blood pressure is reduced from 154 ± 18 mmHg to 149 ± 16 mmHg with a P-value of 0.04. This difference may be 'statistically significant' but it needs to be considered whether it is 'clinically significant' or not? A reduction of 5 mmHg in this case is not sufficient to consider the intervention as being clinically effective since the mean blood pressure is still not within the desired levels for control of hypertension. [23]

  Summary Top

In order to ensure the validity of the data collected and efficiency of the data analysis process it is important to consult a biostatistician from the initial stages of the research process. The data once collected must be edited and cleaned before the data analysis is started in order to avoid reporting spurious results. Data analysis involves descriptive statistics for summarizing the general distribution of the data as well as inferential statistics for making inferences about the population and comparing the result between different subgroups of the study population. The important considerations for statistical analysis include having the required sample size and a representative sample, as well as selecting the appropriate statistical test for the different types of variables. The significance of the statistical results is based on the P-values and confidence intervals. The importance of considering the type I and type II errors as well as the results being clinically significant or not should be carefully considered in the final interpretation of the results.

  References Top

1.Fraser health. Introduction to statistics and quantitative research methods. Available from: http://research.fraserhealth.ca/media/Introduction-to-Statistics-and-Quantitative-Research-Methods.pdf. [3 rd Jul, 2014].  Back to cited text no. 1
2.American Health Management Information Management Association. Health data analysis toolkit. 2011. Available from: http://library.ahima.org/xpedio/groups/public/documents/ahima/bok1_048618.pdf. [3 rd Jul, 2014].  Back to cited text no. 2
3.Daview HT, Cromble IK. What are confidence intervals and p-values? Hayward Medical Communications 2009. Available from: http://www.medicine.ox.ac.uk/bandolier/painres/download/whatis/what_are_conf_inter.pdf. [3 rd Jul, 2014].  Back to cited text no. 3
4.American Statistical Association. When you consult a statistician. What to expect. Section on Statistical Consulting 2003. Available from: http://www.amstat.org/sections/cnsl/brochures/SCSbrochure.pdf. [3 rd Jul, 2014].  Back to cited text no. 4
5.Mitchel JT, Kim YJ, Choi J, Park G, Cappi S, Horn D, et al. Evaluation of data entry errors and data changes to an electronic data capture clinical trial database. Drug Inf J 2011;45:421-30.  Back to cited text no. 5
6.Schoenbach VJ. Data analysis and interpretation. Available from: http://www.epidemiolog.net/evolving/DataAnalysis-and-interpretation.pdf. [3 rd Jul, 2014].  Back to cited text no. 6
7.Hellerstein JM. Quantitative data cleaning for large databases. 2008. Available from: http://db.cs.berkeley.edu/jmh/papers/cleaning-unece.pdf. [3 rd Jul, 2014].  Back to cited text no. 7
8.San Jose State University. Data management. Available from: http://www.sjsu.edu/faculty/gerstman/StatPrimer/dataentry.PDF. [3 rd Jul, 2014].  Back to cited text no. 8
9.Wickham H. Tidy data. J Stat Software. Available from: http://vita.had.co.nz/papers/tidy-data.pdf. [3 rd Jul, 2014].  Back to cited text no. 9
10.Pearson Higher Education. Descriptive statistics. Available from: http://www.pearsonhighered.com/sullivan/sul_alg_trig/Ch2_org_summ_data_.pdf [4 rd Jul, 2014].  Back to cited text no. 10
11.Streit M, Gehlenborg N. Bar charts and box plots. Nat Methods 2014;11:117.  Back to cited text no. 11
12.Wisconsin Hospital Association Quality Center. Using graphs to display data. Available from: http://www.whaqualitycenter.org/Portals/0/Tools%20to%20Use/Making%20Sense%20of%20Data/Using%20Graphs%20to%20Display%20Data%20R%202-12.pdf [4 rd Jul, 2014].  Back to cited text no. 12
13.Kelly D, Jasperse J, Westbrooke I. Designing science graphs for data analysis and presentation. The bad, the good and the better. Department of Conservation Technical Series 32. Science & Technical Publishing. Wellington: New Zealand. 2005. Available from: http://www.doc.govt.nz/Documents/science-and-technical/docts32entire.pdf [4 rd Jul, 2014].  Back to cited text no. 13
14.Cherry K. What is a hypothesis. About.com Psychology. 2014. Available from: http://psychology.about.com/od/hindex/g/hypothesis.htm [4 rd Jul, 2014].  Back to cited text no. 14
15.Davis RB, Mukamal KJ. Hypothesis testing: Means. Circulation 2006;114:1078-82.  Back to cited text no. 15
16.Gabrenya WK. Inferential statistics: Basic concepts. Available from: http://my.fit.edu/~gabrenya/IntroMethods/eBook/inferentials.pdf [4 rd Jul, 2014].  Back to cited text no. 16
17.Suresh KP, Chandrashekara S. Sample size estimation and power analysis for clinical research studies. J Hum Reprod Sci 2012;5:7-13.   Back to cited text no. 17
[PUBMED]  Medknow Journal  
18.Omair A. Presenting your results-II: Inferential statistics. J Pak Med Assoc 2012;62:1254-7.  Back to cited text no. 18
19.BMJ. Study design and choosing a statistical test. Available from: http://www.bmj.com/about-bmj/resources-readers/publications/statistics-square-one/13-study-design-and-choosing-statisti [4 rd Jul, 2014].  Back to cited text no. 19
20.Gunawardana N. Choosing the correct statistical test made easy. Available from: http://www.med.cmb.ac.lk/SMJ/VOLUME%203%20DOWNLOADS/Page%2033-37%20-%20Choosing%20the%20correct%20statistical%20test%20made%20easy.pdf [4 rd Jul, 2014].  Back to cited text no. 20
21.Biau DJ, Jolles BM, Porcher R. P value and the theory of hypothesis testing: An explanation for new researchers. Clin Orthop Relat Res 2010;468:885-92.   Back to cited text no. 21
22.University of California, Berkeley. Multiple hypothesis testing and false discovery rate, Available from: http://www.stat.berkeley.edu/users/hhuang/STAT141/Lecture-FDR.pdf.[4 rd Jul, 2014].  Back to cited text no. 22
23.McCluskey A, Laikhen AG. Statistics IV. Interpreting the results of statistical tests. Contin Educ Anaesth Crit Care Pain 2007;7:208-12.   Back to cited text no. 23
24.Johnson DH. The insignificance of statistical significance testing. University of Nebraska - Lincoln 1999. Available from: http://digitalcommons.unl.edu/cgi/viewcontent.cgi?article=1225&context=usgsnpwrc [4 rd Jul, 2014].  Back to cited text no. 24
25.Sedgwick P. Clinical significance versus statistical significance. BMJ 2014;348:g2130.  Back to cited text no. 25


  [Figure 1], [Figure 2]

  [Table 1], [Table 2], [Table 3]

This article has been cited by
1 Hydrogeological characteristics of crystalline rock aquifers: implication on sustainable water supply in the basement complex terrain of southwestern Nigeria
Akinola Shola Akinwumiju
Sustainable Water Resources Management. 2020; 6(2)
[Pubmed] | [DOI]


Similar in PUBMED
   Search Pubmed for
   Search in Google Scholar for
 Related articles
Access Statistics
Email Alert *
Add to My List *
* Registration required (free)

  In this article
Data Management
Statistical Analysis
Statistical Sign...
Article Figures
Article Tables

 Article Access Statistics
    PDF Downloaded709    
    Comments [Add]    
    Cited by others 1    

Recommend this journal