Discover millions of ebooks, audiobooks, and so much more with a free trial

Only $11.99/month after trial. Cancel anytime.

Big Data in Psychiatry and Neurology
Big Data in Psychiatry and Neurology
Big Data in Psychiatry and Neurology
Ebook738 pages16 hours

Big Data in Psychiatry and Neurology

Rating: 0 out of 5 stars

()

Read preview

About this ebook

Big Data in Psychiatry and Neurology provides an up-to-date overview of achievements in the field of big data in Psychiatry and Medicine, including applications of big data methods to aging disorders (e.g., Alzheimer’s disease and Parkinson’s disease), mood disorders (e.g., major depressive disorder), and drug addiction. This book will help researchers, students and clinicians implement new methods for collecting big datasets from various patient populations. Further, it will demonstrate how to use several algorithms and machine learning methods to analyze big datasets, thus providing individualized treatment for psychiatric and neurological patients.

As big data analytics is gaining traction in psychiatric research, it is an essential component in providing predictive models for both clinical practice and public health systems. As compared with traditional statistical methods that provide primarily average group-level results, big data analytics allows predictions and stratification of clinical outcomes at an individual subject level.

  • Discusses longitudinal big data and risk factors surrounding the development of psychiatric disorders
  • Analyzes methods in using big data to treat psychiatric and neurological disorders
  • Describes the role machine learning can play in the analysis of big data
  • Demonstrates the various methods of gathering big data in medicine
  • Reviews how to apply big data to genetics
LanguageEnglish
Release dateJun 11, 2021
ISBN9780128230022
Big Data in Psychiatry and Neurology

Read more from Ahmed Moustafa

Related to Big Data in Psychiatry and Neurology

Related ebooks

Medical For You

View More

Related articles

Related categories

Reviews for Big Data in Psychiatry and Neurology

Rating: 0 out of 5 stars
0 ratings

0 ratings0 reviews

What did you think?

Tap to rate

Review must be at least 10 words

    Book preview

    Big Data in Psychiatry and Neurology - Ahmed Moustafa

    Chapter 1: Best practices for supervised machine learning when examining biomarkers in clinical populations

    Benjamin G. Schultza; Zaher Joukhadarb; Usha Nattalab; Maria del Mar Quirogab; Francesca Bolkc; Adam P. Vogela,d    a Centre for Neuroscience of Speech, The University of Melbourne, Melbourne, VIC, Australia

    b Melbourne Data Analytics Platform, The University of Melbourne, Melbourne, VIC, Australia

    c Murdoch Children’s Research Institute, The University of Melbourne, Melbourne, VIC, Australia

    d Redenlab, Melbourne, VIC, Australia

    Abstract

    Machine learning approaches are increasingly used in health research. Applications range from the identification of disease onset, classification of disease severity, to predicting epileptic seizures. Although machine learning can be a powerful tool, there is potential for misuse; model performance can be inflated through overfitting and, consequently, will not generalize to the greater population. The risk of misuse increases when the number of variables extracted from continuous data is almost unlimited, as is the case for neural, movement, and acoustic (e.g., speech and music) data. Given that health research may contain small sample sizes, and outcome variables can be noisier for clinical populations, there are important points that should be considered before using machine learning. We suggest best practices in machine learning including data formatting, reducing data dimensionality, model selection and evaluation, and other steps within the machine learning process. We further discuss some common pitfalls in applying machine learning to small sample sizes and high-dimensional data (e.g., speech biomarkers, neural and imaging data). We advocate for parsimonious approaches that include selecting the simplest machine learning method that best describes the data, preventing redundancy and overfitting through variable elimination, and ensuring that certain variables or approaches do not inflate machine learning outcomes. We further consider approaches that can identify the best predictors (or combinations thereof), as well as black box machine learning methods (e.g., deep learning). Finally, we discuss the limitations of current machine learning methods and pose future directions to broaden the applicability of machine learning tools and ensure the outcomes are robust against random factors.

    Keywords

    Machine learning; Best practices; Big data; Health; Artificial intelligence

    1: Introduction

    Machine learning is a powerful tool for predicting outcomes as it can simultaneously consider multiple features to identify and delineate classes (e.g., healthy or unhealthy). This approach has several advantages over traditional univariate statistical approaches (see Bzdok, Altman, & Krzywinski, 2018; Bzdok & Meyer-Lindenberg, 2018), with broad scope for machine learning in health and medicine. Machine learning has been used to identify and distinguish medical conditions using genetic (cf. Libbrecht & Noble, 2015; Pattichis & Schizas, 1996), speech (cf. Hegde, Shetty, Rai, & Dodderi, 2019), neural (Craik, He, & Contreras-Vidal, 2019; Kassraian-Fard, Matthis, Balsters, Maathuis, & Wenderoth, 2016), imaging (Thrall et al., 2018), and movement (cf. Figueiredo, Santos, & Moreno, 2018; Kubota, Chen, & Little, 2016) data. There are several decisions made by researchers undertaking machine learning that are seldom explicitly reported (or known) including sample size estimation, variable screening and data reduction, selection of machine learning algorithm(s), selection of training and test datasets, and parameter adjustment. This chapter discusses current best practice for supervised machine learning applications in health and medicine, and describes some of the common pitfalls that researchers may encounter.

    2: Data formatting

    One of the most important aspects of machine learning, especially across multiple experiments, is data formatting. There are very few guidelines for how data should be formatted in this context and data sharing more generally. While data formatting may seem trivial to some, incorrect data formats can provide misleading results. Following standardized formatting conventions facilitates open data sharing and streamlines collaboration (Ellis & Leek, 2018). Correctly formatting data is important for machine learning in medicine where datasets typically consist of data collected from different sites and studies. Here we describe the best practices for data formatting that work across a range of different software with a focus on tidy data formats and the use of headers (or embedded metadata) for case-wise information (Chen, 2017; Ellis & Leek, 2018; Wickham, 2014; Wickham & Grolemund, 2016). Although there are other formatting styles, here we only discuss formats that are readable/importable for most statistical software that can perform machine learning (e.g., Python, R, SPSS, SAS, STATA).

    Best practice 1:

    Use tidy data formats for machine learning and data sharing

    Tidy datasets have a specific structure where each column contains one variable, rows contain one observation, and tables contain one observational unit (e.g., sample, group, or experiment) (Wickham, 2014). There are several advantages of tidy datasets relating to standardized practices in data visualization, exploration, screening, transformations, and analysis. For example, Tables 1–3 show three ways to arrange the same data where Tables 1 and 2 are considered messy data and Table 3 is considered Tidy data (i.e., long format). To perform most machine learning applications for data in Tables 1 and 2, information would need to be combined into a format that can be read and interpreted by statistical software. In Table 1, data for each participant is in a different subtable. In order to perform any visualization or analysis, these tables would need to be combined using additional steps, such as melting and casting. Furthermore, note that each subtable contains header information that is not, strictly speaking, inside the table itself (i.e., ID and Group) and would require some scripting to extract automatically. Header information can be included in a multitude of different ways. Embedded metadata, often found within the file properties, can provide information that may require specialized scripts or functions to extract. Information may also be contained in the filename itself (e.g., Part01_control, Part02_control, Part03_disease, Part04_disease) and may require string-splitting to extract this information. In more complicated cases, this header information may be missing or formatted in different ways (e.g., control_Part01, Part02_control, Part03_take2_disease, disease_Part04), requiring manual recoding and/or idiosyncratic code to standardize the data (e.g., sequences of if this then that conditional statements to fix deviant file naming conventions).

    Table 1

    Table 2

    Table 3

    Table 2 is also considered messy data, which may be surprising to SPSS (or similar software) users where within-subject designs typically require this format (i.e., wide format). This can be considered messy because the same variable is scattered across different columns representing different conditions. This makes it difficult to perform certain analyses (e.g., checking for a normal distribution) across all conditions without first transforming the data into long format; the format does not imply that the four columns (TaskA_Time1, TaskB_Time1, TaskA_Time2, TaskB_Time2) contain the same variable under different conditions (Tasks A and B, Times 1 and 2). To perform machine learning, these data need to be explicitly defined as belonging to these conditions. Tidy Data is a standardized data structure that maps the meaning of a dataset to its structure (Wickham, 2014, p. 4) where the data structure informs the data interpreter of related categorical and continuous variables across conditions. Data for machine learning diverge from one rule of Tidy Data (Wickham, 2014); relational data on a single subject or participant that is used as a feature or the target category (or categories) being classified must be repeated instead of being placed in a separate table (see Table 4).

    Table 4

    3: Statistical assumptions

    Although machine learning is qualitatively different from other parametric tests and linear statistical methods, many approaches have similar statistical assumptions. Only when these assumptions are upheld do machine learning approaches provide reliable outcomes. For example, many data reduction techniques (e.g., principal components analysis) assume linear relationships between variables and deviations from linearity must be normally distributed, as per parametric assumptions. Some machine learning methods can be robust to mild violations of their statistical assumptions. Regardless of the machine learning method used, researchers must assess their model for gross violations of statistical assumptions as these violations can lead to biased, inaccurate, and unreliable predictions. Model assessments provide statistical support for the reliability of results and the appropriateness of the chosen machine learning approach. These assessments may also reveal latent unmodeled information from the data that could be included in the machine learning model to improve the results. Linear regression is a basic machine learning method and is used to model linear relationships between a dependent variable (output, e.g., class or group) and multiple predictor variables (input features). Here we discuss statistical assumptions in the context of general linear regression models that also apply to other linear machine learning approaches.

    The assumption of linearity requires that the relationship between the dependent variable and the fitted values of features is linear when the other variables are held constant (Ernst & Albers, 2017). The fitted values, also called the predicted values, are the outcomes that are generated when fitting the linear model. Deviations from linearity can undermine the model and render it unsuitable for the data. This can happen when the data contain hidden nonlinear relationships that are unmodeled. The assumption of linearity can be assessed by viewing plots of the observed by predicted values or the residuals by the predicted values; there should be a linear trend between the observed and expected probabilities. Violations of the assumption of linearity can sometimes be mitigated through transformations or normalization (e.g., logarithmic or exponential transforms) of the original input features but may require the use of nonlinear models.

    The assumption of homoscedasticity (meaning same variance) requires that the error (i.e., noise) of the relationship between the dependent variable and features is similar across all features. When this assumption is not upheld, the standard errors in the output are not reliable (Yang, Tu, & Chen, 2019). Homoscedasticity can be assessed by examining relationships between observed residuals and predicted values. If the plot shows a relationship or linear trend between residuals and features, then the linear model is not appropriate. Fig. 1 shows four examples of residual plots where Fig. 1A fulfills the assumption of linearity because residuals are randomly dispersed. Nonrandom residual dispersions, such as linear trends (Fig. 1B), U-shaped (Fig. 1C), or inverted U-shaped (Fig. 1D) patterns suggest nonlinear models are more appropriate for the data (e.g., nonlinear regression, see Bates & Watts, 1988). If the assumption of homoscedasticity is violated, heteroscedasticity-corrected errors can be calculated that account for the heteroscedasticity present in the model (see Hoechle, 2007).

    Fig. 1

    Fig. 1 Examples of residual dispersions that (A) fulfill the assumption of homoscedasticity or (B) violate this assumption through heteroscedasticity, or (C) nonlinear u-shaped or (D) inverted u-shaped distributions.

    The assumption of independence requires that values within features are not correlated or derived from the same source. For example, data collected from the same participant over time or in different tasks are connected. If these values are serially correlated, the estimates of their variances will not be reliable. These effects can be mitigated by specifying known dependencies in the model and accounting for these random effects using linear mixed-effects models, or similar (Bates & DebRoy, 2004; Bates & Watts, 1988).

    The assumption of normality requires that errors are normally distributed. This is sometimes referred to as a weak assumption as, if it is violated, unreliable results will only be obtained if the dataset is small. This assumption does not need to be upheld when dealing with large datasets (Schmidt & Finan, 2018). Normality can be assessed through visual inspection of normal probability or normal quantile plots to ensure the distribution of errors has a linear relationship with the theoretical standardized residual. Alternatively, normality can be tested using the Anderson-Darling test, the Jarque-Bera test, the Kolmogorov-Smirnov test, or the Shapiro-Wilk test. To mitigate violations in normality, the sample size should be increased to more than 10 observations per variable (Schmidt & Finan, 2018).

    The assumption of linearly unrelated features (i.e., absence of multicollinearity) assumes features are not highly correlated with one another. Multicollinearity prevents a model from accurately associating variance in the dependent variable with the most informative predictor variable(s) and may lead to incorrect interpretations of the model. One can detect multicollinearity by looking at correlations between the input variables and collinearity statistics (e.g., tolerance and variable inflation factor). Multicollinearity can be mitigated through the data reduction techniques discussed later in the chapter (see Section 6).

    Best practice 2:

    Check statistical assumptions for machine learning processes and hidden steps

    Some machine learning approaches do not necessarily need to meet parametric assumptions (e.g., decision trees and neural networks; Breiman, 2001b; Brownlee, 2016). However, some software contain hidden data reduction steps prior to machine learning that still require parametric assumptions to be met (e.g., Gao et al., 2018). We recommend that researchers specify the assumptions required for their chosen approach. If the assumptions are unknown due to proprietary algorithms, then a conservative approach should meet parametric assumptions even if this is not strictly necessary. We further suggest that any procedures used to manipulate the data are reported.

    4: Sample size estimation

    Building supervised machine learning models that are both accurate and generalizable requires large datasets. Health data often contain a large number of variables (i.e., high-dimensionality) from a small number of participants (e.g., fewer than 40). It is challenging to collect high-quality (i.e., low-noise, artifact-free) data from large clinical cohorts; patient populations tend to experience fatigue more rapidly leading to poorer performance over time within testing sessions or missing data due to premature session termination. Most clinical studies are acquired from a single site (e.g., clinic or hospital), or the population is rare. There are several initiatives around the world aiming to link health data from multiple locations and experiments to create large corpora of data (big data). These initiatives assist in building more reliable machine learning models for health applications (Farinelli, Barcellos De Almeida, & Linhares De Souza, 2015). However, many health studies that use machine learning are still conducted using very small numbers of participants, some less than 10 (Halilaj et al., 2018).

    A common question when performing machine learning is what is an appropriate sample size? Unfortunately, there is no single answer to this question as there are several factors to consider, including the number of features/variables relative to the number of cases. From a theoretical perspective, training a model with a larger feature set relative to sample size will lead to a more idiosyncratic model that may not accurately classify cases when applied to new data. This will not always eventuate in practice; some variables can hinder classification, some are redundant in the presence of other variables, and others may have a negligible effect on classification accuracy. As discussed later, it is important to assess the usefulness of variables within any model (machine learning or otherwise) and variables that harm the model (or do not help the model) should be discarded.

    Sample size calculators have been developed for some machine learning approaches, such as predictive logistic regression (Hsieh, Bloch, & Larsen, 1998; Palazón-Bru, Folgado-De La Rosa, Cortés-Castell, López-Cascales, & Gil-Guillén, 2017). Some have suggested other rules like 10–20 samples per feature (Concato, Peduzzi, Holford, & Feinstein, 1995; Peduzzi, Concato, Feinstein, & Holford, 1995; Peduzzi, Concato, Kemper, Holford, & Feinstein, 1996), or a minimum sample size of between 100 (Palazón-Bru et al., 2017) and 500 (Bujang, Sa’at, & Bakar, 2018) regardless of the number of features. Although models with small-to-moderate sample sizes typically overestimate relationships between variables using categorical classification (i.e., the finite-sample bias; King & Zeng, 2001), even large samples of 150 cases per feature may not completely attenuate this bias (van Smeden et al., 2016). There are certain machine learning approaches that are better suited to small sample sizes (Sharma & Paliwal, 2015) and/or large feature sets that reduce data dimensionality (Jollife & Cadima, 2016) as discussed in the following sections. There are also dynamic resampling techniques that can estimate appropriate sample sizes for training sets and measure classification performance with different sample sizes that are not discussed here (see Byrd, Chin, Nocedal, & Wu, 2012; Figueroa, Zeng-Treitler, Kandula, & Ngo, 2012). However, there are no universally accepted practices for a priori determination of sample sizes with the exception that larger sample sizes provide less biased estimates (suggested N = 75–100, Beleites, Neugebauer, Bocklitz, Krafft, & Popp, 2013). Instead, we discuss the best practices to estimate error variance and model fit, and advocate for data sharing and transparency when reporting results and methods to increase replicability and meta analyses.

    Best practice 3:

    Be cautious when interpreting results from small samples and use appropriate machine learning approaches

    5: Choosing parsimonious models

    It can be difficult to choose a machine learning approach, especially when starting out in the field. Some researchers may choose an approach based on common practice in their field or by what is within their capabilities. This decision-making strategy is understandable given the range of options for performing machine learning. There is no catch-all solution in machine learning and every approach has advantages and disadvantages (i.e., the no free lunch theorem; Wolpert, 1996). However, given that machine learning provides a situation where a near-infinite number of variables can be used to discriminate between categories using various approaches, it is prudent to discuss the concept of parsimony in model selection (Vandekerckhove, Matzke, & Wagenmakers, 2015). Parsimony follows the principle of Ockham’s razor Entities are not to be multiplied beyond necessity (William of Ockham, c. 1287–1347).

    When choosing the parameters and features that are necessary in machine learning, it is considered best practice to select the simplest model that best explains the data (Zellner, Keuzenkamp, & McAleer, 2001). There are some metrics that can inform how well a model explains the data relative to the model’s complexity including Akaike’s information criterion (AIC) and Bayesian information criterion (BIC). These metrics penalize more complex models (i.e., models with more variables and interactions) relative to the goodness of fit, to achieve a tradeoff between simplicity and information loss (Burnham & Anderson, 2004). Lower AIC or BIC values indicate a more parsimonious model relative to other models with the same (or similar) feature sets; BIC penalizes complexity more than AIC and a correction can be applied to the AIC for small sample sizes (AICc). The choice of whether to use AIC or BIC depends on whether the researcher favors information retention or simplicity, respectively. However, AIC and BIC are rarely used when selecting variables and model parameters for machine learning (but see Demyanov, Bailey, Ramamohanarao, & Leckie, 2012).

    The absence of parsimony metrics means that researchers must pay attention when selecting/removing variables and adjusting model parameters to ensure that the model is generalizable (i.e., does not overfit the data) and not needlessly complex. We advocate that metrics of parsimony be considered when selecting variables and model parameters, and that these values are reported for baseline and saturated models, in addition to the final model. With parsimony in mind, there are several aspects of machine learning that impact model complexity and generalizability.

    The flexibility of the learning algorithm is an important aspect of machine learning. In some instances, a machine learning approach might overfit the sample data and not generalize to other datasets or the population from which the sample was drawn. Most supervised machine learning approaches can be adjusted (automatically or through set parameters) to provide a tradeoff between bias and variance (e.g., GridSearch or Genetic Programming; Nagarajah & Poravi, 2019). This ensures that learning algorithms are not so flexible (i.e., low bias) that they produce a different fit for each training set (i.e., high variance). We recommend that, where applicable, these parameters are reported to improve replicability and so the reader can assess the complexity and level of bias of a model.

    Best practice 4:

    Report adjustments to model parameters and any algorithms used to attain these values

    Another aspect to consider is the function complexity and whether there are complex interactions between the input variables. Complex models require more data for the training set and a learning algorithm with low bias. Bayesian variable elimination methods are sensitive to complex interactions. They are useful in identifying whether simple or complex functions exist within a dataset and whether they explain enough variance to be considered parsimonious (Zhang, 2016). It is difficult to predict the function complexity prior to performing machine learning in the absence of preestablished theoretical or data-driven models that describe relationships between features (e.g., path models with mediator variables). While this problem cannot be completely avoided, we advocate that variables be sufficiently described for machine learning applications and that their inclusion is justified even for exploratory analyses.

    An additional issue when identifying the most parsimonious model is the number of variables or features. High-dimensional data does not necessarily guarantee a better outcome when using machine learning; some variables may confuse algorithms and decrease classification accuracy. Here we discuss several methods for feature selection/elimination that can identify the relevant and irrelevant features for classifying a target.

    6: Reduction of data dimensionality

    Applying machine learning methods to datasets with small sample sizes and high dimensionality without careful consideration can lead to overfitting. Overfitting occurs when models perform well when predicting the training samples but fail when introduced to new data (e.g., test data or new samples). When building supervised machine learning models, feature engineering and dimension reduction are crucial steps that mitigate the risk of overfitting while maintaining high levels of accuracy. Before applying machine learning methods, it is critical researchers use their domain knowledge and construct feature sets with a justified number of features to reduce the dimensions of data while retaining sufficient relevant information.

    Speech data, for example, undergo signal processing to arrive at a set of acoustic features that, theoretically, represent the quality of speech and potential deficits in speech articulators (Noffs et al., 2020; Vogel et al., 2017; Vogel et al., 2020; Vogel, Fletcher, & Maruff, 2010). There are multiple parameters and algorithms that can be used to measure voice attributes, and a plethora of acoustic features that show potential for classifying diseases that affect speech (cf. Hegde et al., 2019). This creates the possibility of high-dimensional data that may contain redundant variables or overfit the data when using machine learning approaches. Similar instances of high dimensionality also occur for other types of data including fMRI (cf. Kassraian-Fard et al., 2016), EEG (cf. Craik et al., 2019; Sun & Zhou, 2014), movement and motion capture (cf. Figueiredo et al., 2018; Kubota et al., 2016), and genetics (cf. Libbrecht & Noble, 2015). To remedy this, researchers can reduce data dimensions through variable selection and/or automated dimensionality reduction algorithms (Mares, Wang, & Guo, 2016).

    6.1: Scaling

    Before applying data reduction methods and (most) machine learning methods, variables should be scaled so all features have comparable ranges and distributions (Zheng & Casari, 2018). Scaling is particularly important for methods that use distance measures (e.g., Euclidean distance), such as linear regression, k-means, and k-nearest neighbors. In the absence of scaling, results may be unreliable. The most common scaling methods are z-scores (i.e., standard normalization) and range normalization so the values fall into a range between 0 and 1.

    6.2: Variable selection

    One way to reduce the number of variables in high-dimensional data is through variable selection and/or elimination. These methods examine covariation between variables to identify potential redundancies and discard variables that explain the least variance when distinguishing between target categories. For example, there are stepwise variable selection procedures based on parsimony metrics (AIC, BIC) that determine which variables account for the most variance and retain them and/or which variables account for the least variance and eliminate them (see Zhang, 2016). These procedures can also be performed with interactions between variables to determine complex relationships between features that may explain additional variance. However, increasing model parsimony may decrease classification accuracy due to the removal of variables that explain little variance relative to added model

    Enjoying the preview?
    Page 1 of 1