An advantage to pairwise deletion over listwise is that it can help increase statistical power

The treatment phase of identified erroneous data involves correcting, deleting or leaving the error unchanged . For purposes of this study, if impossible or missing values are observed, they will have to be deleted, as there would be no way of correcting that value related to the retrospective and secondary nature of the data. For data points that are true extremes, further examination on the influence of these data points, individually and collectively, on analysis will be made prior to determining whether or not that data point will be deleted or left unchanged . It is important to deal with missing data because missing data can create bias. First, an exploratory analysis will be performed to look at frequencies or percentages of missing data, and to help identify how much data is missing. Next, an analysis of the mechanisms, or types, of missingness will be performed to identify whether the missing data is missing completely at random , missing at random , or not missing at random using statistical tests, such as Little’s test for MCAR. Following this, an analysis for patterns of missingness will be performed using a missing pattern value chart. There are two patterns that may be potentially observed: 1) a monotone pattern where data is missing systematically, or 2) an arbitrary pattern where data are missing at random . While the analyses are not definitive, they can bring attention to blatant anomalies in the missingness of data and help to make decisions on the missing data handling procedures.There are a variety of methods that can be utilized to deal with missing data. The type of method utilized will depend on the percentage of missing data present and cannot be specified beforehand. Simple methods, such as list wise or pairwise deletion are helpful when the percentage of missing data is less than 5%. Listwise deletion, also known as complete-case analysis,vertical growing systems removes all data for a case with one or more missing values. In other words, that case is omitted completely.

A disadvantage when using listwise deletion is that it can reduce the sample size. On the other hand, pairwise deletion, also known as available-case analysis, aims at minimizing the loss of other potential data incurred with listwise deletion. Pairwise deletion still uses that case when analyzing other variables with non-missing values; it just excludes that one value with a missing data. However, pairwise deletion does have its disadvantages in that most software packages use the average sample size across analyses which can create over or underestimation. If the percentage of missing data is greater than 5%, then more advanced methods of dealing with missing data can be utilized, such as imputation. Imputation methods will depend on the pattern of missingness identified and the type of variable requiring imputation . In patterns where missing data is systematic or monotone, methods such as regression, predicted mean matching or propensity scoring are helpful. In patterns where missing data is arbitrary or at random, methods such as multiple imputation using maximum likelihood regression methods to predict missing values based on observed values and sensitivity analyses that simulate the results based on a range of plausible values can be used. Aim 1. For Aim 1, the objective is to determine the prevalence of marijuana exposure in patients with moderate or severe TBI. Analyses will be conducted using the Statistical Package for the Social Sciences software. The proportion of TBI patients who have marijuana present on admission will be reported. Unadjusted prevalence will be determined through a 2×2 table. Prevalence rates will be calculated for total number of TBIs. Aim 2. For Aim 2, the objective is to determine the correlates associated with the presence of marijuana exposure at the time of injury. The correlates included in Aim 2 will be also collected for the sample of participants without marijuana exposure at time of injury. Measures of central tendency, including range, means, proportions and standard deviations will be calculated. These basic summary statistics will be calculated for continuous variables and binary categorical variables . Continuous variables will be plotted to assess for normality; tests to assess for normality will include kurtosis and skewness.

If data is normally distributed, then parametric statistics will be utilized. If data is not normally distributed, then non-parametric statistics will be utilized. Frequency distributions, including numbers and percentages, will be generated for each of the categorical variables/correlates; scatterplots will be created so that outliers can be identified. All correlate variables presented in table 6 will be examined; all the variables but one are categorical variables. Categorical variables will be mapped against presence of marijuana exposure and TBI severity to determine if significant differences are present across each of the categories. Tests to determine significant differences across categories include chi-square test or Fisher’s exact test based on the data. Variables that are identified as significant will be used as covariates in the adjusted prevalence rates. The variable of age is a continuous variable. The literature suggests that the relationship between age and drug exposure is not linear so we will test this relationship in this study. For this study a bar plot graph plotting age against marijuana exposure will be used to determine if a linear relationship exists. If there is not a linear relationship, the variable will be categorized. Correlates that are identified as significant will become covariates in the adjusted prevalence analysis. Prior to the adjusted prevalence analysis, these covariates will be examined for multi-collinearity. Aim 3. For Aim 3, the objective is to determine the relationship between marijuana exposure at the time of injury, the mechanism of injury, and TBI severity. The null hypothesis is that a relationship between marijuana at the time of injury, the mechanism of injury, and severity of TBI does not exist. As illustrated in the conceptual framework , mechanism of injury is considered a mediating variable; it potentially mediates the relationship between marijuana exposure at time of injury and TBI severity . First an estimate of the effect between marijuana exposure and TBI severity will be obtained without the mediator variable of mechanism of injury. To test for mediation, several regression analyses will be conducted that include the mediator variable and significance of the coefficients will be examined in each step to assess for direct and indirect effects. First, I will test for a direct relationship between marijuana exposure and TBI severity. Assuming there is a significant relationship between the two variables, I will then conduct an analysis to determine if marijuana exposure affects mechanism of injury. Assuming there is a significant effect, I will then conduct an analysis to determine if mechanism of injury affects TBI severity, and whether the mediation effect is complete or partial . To determine if the mediation effect is statistically significant I will use either the Sobel test or bootstrapping methods .

All analyses will be conducted unadjusted and then adjusted for covariates and confounders identified a priori and via aim 2 . The analyses will use logistic regression modeling because the dependent variable, TBI severity, is a dichotomous variable with only two choices, moderate or severe TBI. While TBI severity can be considered a continuous variable if using the number scoring of the GCS scale, a binary variable will be used as it is easier to interpret for clinicians using a numerical score: clinicians treat not on subtle degrees of TBI severity,pruning cannabis but whether it is a moderate or severe one based on GCS threshold cut-offs. Dummy variables will be used to input non-binary categorical variables into the analysis. However, with the predicted large sample size, and understanding the potentially significant confounding effects of certain variables such as other drugs, I hope to create binary variables for each drug listed in the NTDB database . But if this is unable to be done another approach would be to code all drug use into 3 categories: a value of 0 assigned for ‘no drug use’, a value of 1 for ‘stimulants’ only . Observational studies offer valuable methods for studying various problems within healthcare where other study design methods, such as randomized controlled designs , may not be feasible or even unethical. High quality observational studies can render invaluable and credible results that positively impact healthcare when studying clinically relevant topics in patient populations of interest to practicing clinicians. Despite this, observational studies can be subject to a few potential problems within the design and analytical phases rendering results highly compromised. Potential problems that will be encountered in this study design are selection bias, information bias and confounding. Possible countermeasures to address these problems will be discussed in this section. A potential problem regarding selection bias is present in the current study. The target study population is comprised of a purposive sample of patients registered in the NTDB. The NTDB is a centralized national trauma registry developed by the American College of Surgeons with the largest repository of trauma related data and metrics reported by 65% of trauma centers across the U.S. and Canada. The main advantage to utilizing such a registry for this study is that it constitutes the largest trauma database in the U.S. Furthermore, the NTDB allows for risk-adjusted analyses which can be important when evaluating outcomes in trauma . Despite its incredible potential in informing trauma related research, the selection of participants from the NTDB is not without its own biases. The reporting of data into the NTDB is done on a voluntary basis by participating trauma centers, rendering a convenience sample that may not be representative of all trauma patients, and may also not be representative of all trauma centers across the U.S. . This creates the problem of selection bias. Furthermore, the NTDB is subject to the limitations of selection bias is that it includes a larger number of trauma centers with typically more severely injured patients potentially under representing patients with milder traumatic injuries and injury scores . Additionally, patients who may be traumatically injured and who are not admitted to a participating trauma center will not be included in the NTDB, nor will trauma patients who died on scene before being transported. Another consideration to note is that participating hospitals may differ in their criteria of which patients to include in the database, specifically patients who are dead on arrival or those who die in the Emergency Department . This discrepancy in inclusion and exclusion criteria between hospitals regarding specific injuries makes representative comparisons potentially difficult.

Lastly, it is important to mention that large databases such as the NTDB are subject to missing data or disparate data. This is often due a multitude of factors, a few of which various demographic data points, test results and other key information, such as procedures, that may not be documented in the health record and therefore omitted in the database . Missing data often contributes to information bias; however, it can also contribute to selection bias because one of the methods in dealing with missing data is excluding participants for which data is missing thereby creating potential selection bias. Missing data may undermine the ability to make valid inferences, therefore, steps will be taken throughout the design and operational stages and methods within this study to avoid or minimize missing data. Methods to reduce information bias that can lead to selection bias will be discussed in the analysis section of this paper. Due to the methods by which data are collected and inputted into the NTDB, potential problems are encountered in terms of data accuracy. Under reporting of variables obtained from the NTDB has often been noted as a problem due to the reliability of data extraction by participating hospitals . The data is self-reported and often inputted by staff dedicated to data collection. A major variance between participating hospitals is that hospitals with more resources are more likely to have dedicated staff to data collection. This can lead to informational bias in those hospitals that are more compliant in reporting data metrics when compared to others that are not. For example, hospital data registries that have incomplete data on complications may appear to deliver better care than hospitals that consistently record all complications.