The tuning parameters involved in these models were selected using leave-one-out-cross-validation

We chose these techniques because they work even when the sample size is small relative to the number of predictors, as is the case here.Moreover, in keeping with the common practice, the performance of these models was evaluated by examining their prediction accuracy as measured using overall accuracy , sensitivity , and specificity . Further, due to the lack of independent test data, the performance measures were computed using LOOCV. By protecting against overfitting, the LOOCV-based measures provide a more accurate assessment of model performance on future unseen data than those computed directly from the training data. By default, the models use 0.5 as the cutoff for probability, that is, a study subject is classified as having CUD if their probability of CUD exceeds 0.5. If the cutoff is increased, the sensitivity will decrease and specificity will increase. To evaluate the overall model performance, we used the receiver operating characteristic  curve, a plot of sensitivity against 1-specificity  obtained by varying the cutoffs, and computed the corresponding area under the curve  . The models were fit using the statistical software system R  with the following specific packages: glmnet  for LASSO logistic regression, knn , e1071  for SVM, randomForest , gbm  for gradient boosting, caret  for LOOCV, and pROC  for AUC. Fig. 1 presents the variables with non-zero regression coefficients from LASSO logistic regression model and the top seven variables based on variable importance measures for the other models. The seven variables selected by LASSO, namely, age, level of enjoyment from initial smoking, ImpSS-T, BIS-I, NEO-N, NEO-O, and NEO-C, were also found to be important by the other models. In particular, except ImpSS-T and NEO-O, the remaining five were selected as important by all other models.

Moreover, ImpSS-T was chosen as an important predictor by KNN, random forest, and SVM while NEO-O was indicated to be important by random forest and gradient boosting. Table 2 presents the accuracy, sensitivity, and specificity of the models based on 0.5 cutoff as computed using LOOCV as well as the AUC of the models. The associated ROC curves and the plots of accuracy versus cutoff are provided in Supplementary Materials. Although the various models performed similarly, which is reassuring, grow cannabis overall we may conclude that LASSO and gradient boosting outperformed the others. For example, the two are tied for the highest AUC. Nevertheless, an advantage of LASSO is that it provides estimates of regression coefficients and hence odds ratios. This allows easy interpretation of the effects of the risk factors. This important and desirable feature is not available in other models. Therefore, we choose the LASSO logistic regression model as our final model. It may be of interest to quantify the advantage of this model over a random guess classifier that predicts CUD with probability 0.617, the proportion of CUD cases in the data. The accuracy, sensitivity, and specificity of this classifier can be calculated to be 0.527, 0.617, and 0.383, respectively. These are much lower than the corresponding values reported in Table 2 for the LASSO model. The final LASSO model predicted the CUD status with 66% accuracy. Its sensitivity and specificity were 0.81 and 0.42, respectively. Thus, it does a much better job of correctly identifying the CUD cases than the non-CUD controls at the probability cutoff of 0.5. This cutoff may not be appropriate in all clinical settings. The appropriate cutoff can be chosen by examining its ROC curve, presented in Supplementary Materials, for the trade off between sensitivity and specificity. Its AUC is 0.65. The seven variables selected by this model together with their estimated coefficients and the associated odds ratios  are shown in Table 3. The higher probability of CUD was associated with younger age , lower level of enjoyment from initial smoking , higher score on impulsivity , greater cognitive instability , higher neuroticism, i.e., more prone to experience negative feelings , greater openness to new experiences , and lower conscientiousness . To illustrate the model, we considered two subjects from the data who had the largest and the smallest predicted probability of CUD. Their true status is CUD and non-CUD, respectively. The first subject was young ; received little enjoyment from initial smoking ; had high scores on impulsivity , cognitive instability , and neuroticism ; was quite open to new experiences ; and had low conscientiousness .

The predicted probability of CUD for this subject was 0.93. The second subject was 49 years old; received much enjoyment from initial smoking ; had low scores on impulsivity , cognitive instability , and neuroticism ; was also quite open to new experiences ; and had high conscientiousness . The predicted probability of CUD for this subject was 0.15. Substance use disorders are a growing public health problem and cannabis is the most commonly used illicit substance in the world . The legalization of cannabis for medical and recreational purposes worldwide has increased cannabis use and CUD. Therefore, there is a growing need for a CUD risk prediction tool. In this study, we built a preliminary model by identifying risk factors with the help of several statistical and machine learning algorithms. We eventually chose the LASSO logistic regression model as the final model for two reasons. First, there was no major difference among the top performing models. Second, LASSO allows the ability to interpret the effects of risk factors quantitatively, a feature unavailable in the other methods. The LASSO model gave seven risk factors with non-zero  coefficients. We had also explored the possibility of adding interaction terms to this model but did not eventually add any because the model with interactions had lower predictive accuracy than this model. The risk factors identified by our model are consistent with the literature . In particular, previous findings indicate that younger people are more likely to develop CUD . Using ImpSS and BIS scales, numerous studies have shown that high impulsivity is prevalent among users of nicotine , cocaine , and alcohol . We also found that higher ImpSS-T increases the likelihood of dependence on cannabis. The positive association between cognitive instability and CUD status that we found is also known . Similarly, the relationship of CUD with personality trait risk factors based on NEO is consistent with the previous findings . For example, cannabis users have higher openness and lower conscientiousness compared to nonusers . Generally, high neuroticism is reported in nicotine-only users  and average neuroticism is reported in cannabis only users . We found that higher neuroticism is associated with higher likelihood of CUD, which is not surprising because our sample consists of co-morbid marijuana and nicotine users.

We also found that less enjoyment from initial smoking is associated with increased likelihood of becoming cannabis dependent. This is in line with the findings from a nationally representative longitudinal study, which was conducted to identify the risk factors associated with different stages of cannabis use . This study found that greater quantity of cigarette use decreased the likelihood of reinitiation of cannabis use among participants who were cannabis users prior to reaching adolescence . Even though our overall findings are consistent with the literature, we did not find several risk factors for CUD that have been previously reported in the literature. Some of the risk factors such as childhood depression and conduct disorder symptoms were not available in these data. While some other factors such as early exposure to traumatic events had substantial missing data because of which they were excluded. Yet others may not have been identified due to limitations of the study as described in the following. Our study’s first limitation is the cross-sectional and observational nature of the study because of which it is difficult to establish a causal relationship between a risk factor and CUD, especially for the factors that can vary over time. To mitigate the latter issue, we only used risk factors that remain relatively stable over time. However, even then we need to be cautious about drawing any conclusion about causation as this is an observational study. The second is that there are not a large number of subjects and the participating subjects came from a specific metro area in the US, which may not be representative of the entire population of all cannabis users. The third is due to missing values on the variables. When the risk factors are jointly analyzed in a multivariate model, this leads to a loss of some subjects as those with missing values in any of the multiple variables are discarded. We tried to balance the loss of sample size with the inclusion of risk factors. Moreover, to mitigate the issue of small sample size, we chose the statistical and machine learning methods that work even when the sample size is small relative to the number of predictors. Nonetheless, availability of complete data on more subjects would have provided higher power for identifying association. We also acknowledge that the data used for this study were acquired in 2007–2010 and may be limited in its generalizability to current cannabis use impacts. Nonetheless, New Mexico’s cannabis policies may be more historically representative of current national policies  given that medically-indicated cannabis was legalized in New Mexico in 2007 coinciding with the study’s data collection. Thus, our findings may provide insights into future trends related to continued changes in cannabis legislation in the US. Also importantly, there has been no change in rate of current marijuana use in New Mexico in recent years, although the rate has remained significantly higher than the US rate . Thus, indoor cannabis grow system use in New Mexico has been stable and should not limit the impact of the current findings. Lastly, the mechanisms that underlie the risk for CUD likely remained relatively unchanged in the last 10 years. Despite its limitations, this study represents a novel attempt to build a CUD risk prediction tool.

To address the limitations, we are working towards building a risk prediction model using longitudinal data from a large number of subjects spread throughout the US. In addition, some people may be dependent on more than one substance  and in fact, there may be common risk factors for several substance disorders . Therefore, it would be of interest to model jointly the relationship between multiple substance disorders and potential risk factors. Finally, inclusion of genetic and/or imaging factors can also provide a more personalized model. Cannabis use is common, but most users do not progress to cannabis use disorders. About 50–70% of liability to cannabis use disorders is due to genetic factors.1 Three genome-wide association studies  of cannabis use disorders2–4 have identified variants reaching genome-wide significance, but inadequate sample sizes  and heterogeneity among samples have contributed to a paucity of replicable findings: only one locus, tagged by a cis-eQTL for CHRNA2 , has been robustly identified.A GWAS of lifetime cannabis use  identified eight genome-wide significant loci and 35 significant genes.Twin studies suggest high genetic correlations between early stages of cannabis experimentation and later cannabis use disorder.6 However, casual cannabis use is affected by a variety of socio-environmental influences and ageperiod-cohort effects, whereas progression to cannabis use disorder is related to other psychopathologies. Findings have suggested partially distinct genetic causes underlying alcohol consumption and alcohol use disorder, including different genetic associations with other psychiatric disorders and traits.7,8 Thus, in addition to examining the genomic liability for cannabis use disorder, we tested whether the genetic influences underlying cannabis use and cannabis use disorder diverge with respect to behavioural and brain measures.regulations. Investigators for each contributing study obtained informed consent from their participants and received ethics approvals from their respective review boards in accordance with applicable regulations. Personal identifiers associated with phenotypic information and samples from deCODE were encrypted using a third-party encryption system.The iPSYCH group used pseudonymised unique identifications.Psychiatric Genomics Consortium cases met criteria for a lifetime diagnosis of DSM-IV cannabis abuse or dependence11 derived from clinician ratings or semi-structured interviews.Cases from the iPSYCH sample had ICD-10 codes of F12.1 or F12.2 , or both in the Danish Psychiatric Central Research Register; the remaining individuals in the sample were used as controls.