

ORIGINAL ARTICLE 

Year : 2014  Volume
: 2
 Issue : 3  Page : 100104 

Understanding the process of statistical methods for effective data analysis
Aamir Omair
Department of Medical Education, College of Medicine, King Saud bin Abdulaziz University for Health Sciences, Riyadh, Saudi Arabia
Date of Web Publication  31Jul2014 
Correspondence Address: Aamir Omair Department of Medical Education, College of Medicine, King Saud bin Abdulaziz University for Health Sciences, Riyadh Saudi Arabia
 Check 
DOI: 10.4103/1658600X.137882
The purpose of this article is to provide a basic understanding of the statistical methods for conducting effective data analysis. Quantitative research involves the collection and analysis of different types of variable in the form of raw data, which needs to be cleaned before starting the data analysis. A biostatistician must be involved from the planning stages of the research process to ensure the validity of the sampling process and the collected data. The statistical analysis includes descriptive analysis for summarizing the data and inferential statistics for comparing between the subgroups to determine a statistically significant association. The relevant statistical tests must be applied and the results appropriately reported using Pvalues and confidence intervals. The possibility of type I and type II errors should be considered during the final interpretation of the results as well as the clinical significance of the results even if the Pvalues are found to be statistically significant. Keywords: Statistical analysis, data cleaning, tests of significance, types of errors, clinical siginificance
How to cite this article: Omair A. Understanding the process of statistical methods for effective data analysis. J Health Spec 2014;2:1004 
Introduction   
Quantitative research in health sciences involves the collection of data from different sources. ^{[1]} This data includes demographic variables (e.g. age, gender), biological variables (e.g. weight, blood pressure), risk factors for disease (e.g. smoking status, obesity), outcome variables (e.g. survival data, length of hospital stay), etc. The purpose of statistical analysis is to process this 'raw' data into an organized form so that it provides the required information in summary (descriptive statistics). ^{[2]} The inferential statistical analysis makes generalizations about the population based on a 'representative sample' taken from the study population; it also compares the results between different subgroups of the data sampled to determine any difference or association between the predictor (independent) and outcome (dependent) variables based on the objectives/hypotheses of the study. The purpose of inferential statistics is to determine with a degree of confidence whether the observed differences are statistically significant or may be due to chance alone using the Pvalue/confidence interval. ^{[3]}
Data Management   
It is a common mistake to first collect the data and then take it to the (bio)statistician for analysis. It is strongly recommended that the biostatistician must be involved from the initial stage of planning and development of the research study. ^{[4]} This will ensure that the data collected can be processed more efficiently and produce results in a more comprehensive manner without losing any data related to important variables. The data collected undergoes a series of editing, entry, cleaning and management steps (recoding and transformation) before they are suitable for analysis. ^{[5]} Attempting to do the analysis directly before ensuring that the data is clean, leads to loss of time, wastage of data, inappropriate output and lots of frustration. A common source of error is an error in data collection or data entry which leads to 'outlier' values in the numerical values or other values out of range of the coded values for categorical variables. ^{[6]} For example, a height of 173 cm may be entered inappropriately as 73 cm or 730 cm which would lead to distortion of the mean value from the actual value especially if the sample size is relatively small. Also categorical variables may be wrongly entered e.g. entering a value of 3 or 11 for gender where the coding is '1' for male and '2' for female. Another common mistake is to enter a '2' for a question coded as '1' for Yes and '0' for No. These errors need to be identified and corrected (where possible) before doing the analysis otherwise they will lead to incorrect results. ^{[7]}
Once the data has been appropriately cleaned and recoded, it is saved as a separate file for analysis. It is a good idea to keep the initial data file as a backup in case the original data is required at a future stage. ^{[8]} If the data is in the required format then the basic statistical analysis process can be generally completed in less than 4 hours. But in most cases (where the data has not been appropriately managed) it requires at least another 6  8 hours for cleaning the data and transforming it into a suitable file for data analysis. In certain cases the data cleaning and transformation has taken up to 16  20 hours. ^{[9]}
Statistical Analysis   
The initial step in statistical analysis is to do the descriptive statistics to summarize and present the data in a form of numerical summaries and graphs. ^{[10]} Categorical data (e.g. gender, education level, disease status) are presented as frequencies and percentages; numerical variables (e.g. age, BMI, lab results, hospital stay) are presented as mean ± standard deviation (SD) if normally distributed or as median and interquartile ranges for skewed data. Some examples of how the data are normally presented are given in [Table 1]. The descriptive analysis is most appropriately presented in a table form as the first table in the manuscript. If required, some descriptive statistics may be presented as graphs for oral presentations and posters (but generally not in manuscripts for publication). It is recommended to use bar charts for categorical variables and box plots for numerical data [Figure 1] and [Figure 2]. ^{[11]} Pie charts and histograms are generally not recommended for use in scientific presentations. ^{[12],[13]}
The inferential statistical analysis should be performed first for the stated aim and objectives of the study. This process is related to testing the hypothesis and reaching a conclusion for the predetermined study objectives. A hypothesis is a statement regarding the research question to be studied ^{[14]} and is stated as the Null (Ho) and Alternate (Ha) hypotheses. The null hypothesis is generally related to there being no difference (opposite to the proposed research objective) while the alternate hypothesis states the research question to be studied in a statement form. ^{[15]} For example if the research question is to determine if doctors are more likely to develop heart disease as compared to teachers then the null and alternate hypotheses would be stated as:  Table 1: Presenting descriptive statistics of categorical and numerical data
Click here to view 
Ho: There is no difference in the occurrence of heart disease between doctors and teachers. Ha: Occurrence of heart disease is more in doctors as compared to teachers.
Inferential statistics involves making an inference for the general population based on the results from a sample of the study population. ^{[16]} It is important that the required sample size is determined before starting the study in order to ensure that the study has the required power to identify a difference, if present, or to determine the expected outcome variable within a required margin of accuracy. ^{[17]} The requirements for determining the required sample size will be discussed in detail in a following article in this series on 'sample size and sampling technique'. It is essential that the relevant statistical tests be applied in order to ensure the validity of the results. A summary of the commonly used tests for statistical analysis is presented in [Table 2]. ^{[18]} Further statistical analysis may need to be performed such as multivariable analysis or logistic regression in order to remove the effects of confounding variables and to establish the true effect of the predictor variables on the outcome of the study. ^{[19],[20]}  Table 2: Choosing the appropriate commonly used statistical tests according to type of grouping and outcome variables
Click here to view 
Statistical Significance   
It is important to understand the concepts of Pvalue and confidence intervals for reporting the results of inferential statistics. A Pvalue shows the probability that the difference or association being shown might have occurred by chance alone. ^{[3]} As shown in [Table 3] the Pvalues of 0.004 and 0.02 (marked with *) indicate that the probability of obtaining such a difference by chance alone is 0.4% and 2% respectively, which is considered as being statistically significant. But the Pvalues of 0.07 and 0.35 in the table indicate that there is a 7% and 35% probability respectively that this difference may be due to chance alone and so should not be reported as being significantly different between the two groups. ^{[21]}
It is also important to consider that there is a potential for making an error based on the results of any study. The types of errors are classified as TypeI (α error) or TypeII (β error) and are based on the conclusion of the statistical tests. ^{[22]} The TypeI error is the possibility of incorrectly stating that there is a difference when there is actually no difference. It can be identified by the Pvalue which gives the probability of making such an error. A Pvalue of <0.05 is generally considered as being statistically significant. This means that if it is concluded that there is a difference between the groups (or between the sample and a standard value), then the probability of making a TypeI error is less than 5%. On the other hand a Pvalue of ≥0.05 should not be concluded as meaning that 'there is no difference between the groups'. ^{[23]} This is what generally constitutes a TypeII error and is generally not reported in any of the statistical results. So whenever the Pvalue is ≥0.05 it is more appropriate to state that 'the results of this study do not show a significant difference' as opposed to stating that the groups are similar or not different. The Pvalues in [Table 3] indicate the chance of making a TypeI error, if it is concluded that 'there is a difference in the attitudes regarding research between the preclinical and the clinical students'. This would be relevant for the two Pvalues which are less than 0.05 but it would be inappropriate to state that 'the attitudes regarding research are not different (or the same)'for the two statements that have a Pvalue of more than 0.05. While it may appear to be reasonable for the statement about 'mandatory research time in the curriculum' which has a Pvalue of 0.35, it would be wrong to assert that there is no difference between the two groups with regards to their interest in research (P = 0.07). This is because in both cases the difference may exist but this study did not have the required sample size (that is power) to identify the difference. ^{[24]}
It is also important to remember that no matter what the magnitude of difference is present between two or more groups it cannot be considered as being significant if the Pvalue is not less than 0.05 since this may be due to chance alone. On the other hand a statistical significant difference does not mean that the difference is 'clinically' (or practically) significant. ^{[25]} For example if a clinical trial on a group of 200 hypertensive patients shows that the mean systolic blood pressure is reduced from 154 ± 18 mmHg to 149 ± 16 mmHg with a Pvalue of 0.04. This difference may be 'statistically significant' but it needs to be considered whether it is 'clinically significant' or not? A reduction of 5 mmHg in this case is not sufficient to consider the intervention as being clinically effective since the mean blood pressure is still not within the desired levels for control of hypertension. ^{[23]}
Summary   
In order to ensure the validity of the data collected and efficiency of the data analysis process it is important to consult a biostatistician from the initial stages of the research process. The data once collected must be edited and cleaned before the data analysis is started in order to avoid reporting spurious results. Data analysis involves descriptive statistics for summarizing the general distribution of the data as well as inferential statistics for making inferences about the population and comparing the result between different subgroups of the study population. The important considerations for statistical analysis include having the required sample size and a representative sample, as well as selecting the appropriate statistical test for the different types of variables. The significance of the statistical results is based on the Pvalues and confidence intervals. The importance of considering the type I and type II errors as well as the results being clinically significant or not should be carefully considered in the final interpretation of the results.
References   
1.  Fraser health. Introduction to statistics and quantitative research methods. Available from: http://research.fraserhealth.ca/media/IntroductiontoStatisticsandQuantitativeResearchMethods.pdf. [3 ^{rd} Jul, 2014]. 
2.  American Health Management Information Management Association. Health data analysis toolkit. 2011. Available from: http://library.ahima.org/xpedio/groups/public/documents/ahima/bok1_048618.pdf. [3 ^{rd} Jul, 2014]. 
3.  Daview HT, Cromble IK. What are confidence intervals and pvalues? Hayward Medical Communications 2009. Available from: http://www.medicine.ox.ac.uk/bandolier/painres/download/whatis/what_are_conf_inter.pdf. [3 ^{rd} Jul, 2014]. 
4.  American Statistical Association. When you consult a statistician. What to expect. Section on Statistical Consulting 2003. Available from: http://www.amstat.org/sections/cnsl/brochures/SCSbrochure.pdf. [3 ^{rd} Jul, 2014]. 
5.  Mitchel JT, Kim YJ, Choi J, Park G, Cappi S, Horn D, et al. Evaluation of data entry errors and data changes to an electronic data capture clinical trial database. Drug Inf J 2011;45:42130. 
6.  Schoenbach VJ. Data analysis and interpretation. Available from: http://www.epidemiolog.net/evolving/DataAnalysisandinterpretation.pdf. [3 ^{rd} Jul, 2014]. 
7.  Hellerstein JM. Quantitative data cleaning for large databases. 2008. Available from: http://db.cs.berkeley.edu/jmh/papers/cleaningunece.pdf. [3 ^{rd} Jul, 2014]. 
8.  San Jose State University. Data management. Available from: http://www.sjsu.edu/faculty/gerstman/StatPrimer/dataentry.PDF. [3 ^{rd} Jul, 2014]. 
9.  Wickham H. Tidy data. J Stat Software. Available from: http://vita.had.co.nz/papers/tidydata.pdf. [3 ^{rd} Jul, 2014]. 
10.  Pearson Higher Education. Descriptive statistics. Available from: http://www.pearsonhighered.com/sullivan/sul_alg_trig/Ch2_org_summ_data_.pdf [4 ^{rd} Jul, 2014]. 
11.  Streit M, Gehlenborg N. Bar charts and box plots. Nat Methods 2014;11:117. [PUBMED] 
12.  Wisconsin Hospital Association Quality Center. Using graphs to display data. Available from: http://www.whaqualitycenter.org/Portals/0/Tools%20to%20Use/Making%20Sense%20of%20Data/Using%20Graphs%20to%20Display%20Data%20R%20212.pdf [4 ^{rd} Jul, 2014]. 
13.  Kelly D, Jasperse J, Westbrooke I. Designing science graphs for data analysis and presentation. The bad, the good and the better. Department of Conservation Technical Series 32. Science & Technical Publishing. Wellington: New Zealand. 2005. Available from: http://www.doc.govt.nz/Documents/scienceandtechnical/docts32entire.pdf [4 ^{rd} Jul, 2014]. 
14.  Cherry K. What is a hypothesis. About.com Psychology. 2014. Available from: http://psychology.about.com/od/hindex/g/hypothesis.htm [4 ^{rd} Jul, 2014]. 
15.  Davis RB, Mukamal KJ. Hypothesis testing: Means. Circulation 2006;114:107882. 
16.  Gabrenya WK. Inferential statistics: Basic concepts. Available from: http://my.fit.edu/~gabrenya/IntroMethods/eBook/inferentials.pdf [4 ^{rd} Jul, 2014]. 
17.  Suresh KP, Chandrashekara S. Sample size estimation and power analysis for clinical research studies. J Hum Reprod Sci 2012;5:713. [PUBMED] 
18.  Omair A. Presenting your resultsII: Inferential statistics. J Pak Med Assoc 2012;62:12547. [PUBMED] 
19.  BMJ. Study design and choosing a statistical test. Available from: http://www.bmj.com/aboutbmj/resourcesreaders/publications/statisticssquareone/13studydesignandchoosingstatisti [4 ^{rd} Jul, 2014]. 
20.  Gunawardana N. Choosing the correct statistical test made easy. Available from: http://www.med.cmb.ac.lk/SMJ/VOLUME%203%20DOWNLOADS/Page%203337%20%20Choosing%20the%20correct%20statistical%20test%20made%20easy.pdf [4 ^{rd} Jul, 2014]. 
21.  Biau DJ, Jolles BM, Porcher R. P value and the theory of hypothesis testing: An explanation for new researchers. Clin Orthop Relat Res 2010;468:88592. 
22.  University of California, Berkeley. Multiple hypothesis testing and false discovery rate, Available from: http://www.stat.berkeley.edu/users/hhuang/STAT141/LectureFDR.pdf.[4 ^{rd} Jul, 2014]. 
23.  McCluskey A, Laikhen AG. Statistics IV. Interpreting the results of statistical tests. Contin Educ Anaesth Crit Care Pain 2007;7:20812. 
24.  Johnson DH. The insignificance of statistical significance testing. University of Nebraska  Lincoln 1999. Available from: http://digitalcommons.unl.edu/cgi/viewcontent.cgi?article=1225&context=usgsnpwrc [4 ^{rd} Jul, 2014]. 
25.  Sedgwick P. Clinical significance versus statistical significance. BMJ 2014;348:g2130. 
[Figure 1], [Figure 2]
[Table 1], [Table 2], [Table 3]
