Discover millions of ebooks, audiobooks, and so much more with a free trial

Only $11.99/month after trial. Cancel anytime.

Data Science Applied to Sustainability Analysis
Data Science Applied to Sustainability Analysis
Data Science Applied to Sustainability Analysis
Ebook634 pages14 hours

Data Science Applied to Sustainability Analysis

Rating: 0 out of 5 stars

()

Read preview

About this ebook

Data Science Applied to Sustainability Analysis focuses on the methodological considerations associated with applying this tool in analysis techniques such as lifecycle assessment and materials flow analysis. As sustainability analysts need examples of applications of big data techniques that are defensible and practical in sustainability analyses and that yield actionable results that can inform policy development, corporate supply chain management strategy, or non-governmental organization positions, this book helps answer underlying questions. In addition, it addresses the need of data science experts looking for routes to apply their skills and knowledge to domain areas.

  • Presents data sources that are available for application in sustainability analyses, such as market information, environmental monitoring data, social media data and satellite imagery
  • Includes considerations sustainability analysts must evaluate when applying big data
  • Features case studies illustrating the application of data science in sustainability analyses
LanguageEnglish
Release dateMay 11, 2021
ISBN9780128179772
Data Science Applied to Sustainability Analysis

Related to Data Science Applied to Sustainability Analysis

Related ebooks

Environmental Science For You

View More

Related articles

Related categories

Reviews for Data Science Applied to Sustainability Analysis

Rating: 0 out of 5 stars
0 ratings

0 ratings0 reviews

What did you think?

Tap to rate

Review must be at least 10 words

    Book preview

    Data Science Applied to Sustainability Analysis - Jennifer Dunn

    Chapter 1

    Overview of data science and sustainability analysis

    Prasanna Balaprakasha,b, Jennifer B. Dunna,cd

    aNorthwestern-Argonne Institute of Science and Engineering, Evanston, IL

    bMath and Computer Science Division, Argonne National Laboratory, Lemont, IL

    cCenter for Engineering Sustainability and Resilience, Northwestern University, Evanston, IL

    dChemical and Biological Engineering, Northwestern University, Evanston, IL

    Chapter Outlines

    Data science is central to advances in sustainability 1

    Types of sustainability analyses 6

    Data science tools 7

    Supervised learning 8

    Unsupervised learning 9

    Reinforcement learning 10

    Tools 10

    Overview of case studies in data science in sustainability 10

    Abstract

    Globally, challenges related to sustainability abound, including improving air and water quality, reducing food and water consumption, decreasing waste, enhancing energy efficiency and the share of renewable energy, and conserving ecologically valuable lands. One of the most pressing sustainability-related challenges is reducing greenhouse gas emissions that contribute to climate change while developing environmentally-sound adaptation strategies. Simultaneously, advancing the societal aspect of sustainability is critical, but challenging as large portions of the world's population live below the International Poverty Line. Data science, including different statistical machine learning techniques, is a tool that will see increasing use in efforts to tackle sustainability challenges. Leveraging the growing volumes of data such as satellite imagery, continuous sensor data from industrial processes, social media data, and data from environmental sensors, requires such techniques. This book provides case studies and examples at the intersection of data science and sustainability in the areas of environmental quality and sustainability, energy and water, sustainable systems analysis, and society and policy.

    Key Words

    Data science; Sustainability; Energy and water; Systems analysis; Policy

    Data science is central to advances in sustainability

    The frequently-used term sustainability is often defined per the Brundtland Report's definition of sustainable development:

    Sustainable development is development that meets the needs of the present without compromising the ability of future generations to meet their own needs. (World Commission on Environment and Our Common, 1987)

    For the purposes of this book, we intend the term sustainability to mean the potential for achieving a high quality of life in human, social, environmental, and economic systems, both today and in the future.

    For this potential to be realized, society must reach a point where air quality, water quality, and soil health are robust and do not pose a threat to ecosystem or human health. Air quality remains unsatisfactory globally, and is the fifth risk factor for global mortality in 2017, associated with 4.9 deaths (Fig. 1.1) and 147 million healthy life years lost. (Health Effects Institute 2019) In addition, water quality globally is a challenge, with only half of water bodies exhibiting good quality per United Nations Sustainability Development Goal monitoring initiatives (Fig. 1.2). (UN Environment, 2018) In addition to air and water quality, soil quality has significant implications for human health yet can suffer from pollution from industry, mining, or waste disposal. In Europe, sites with likely soil contamination number 340,000 with only one-third of these undergoing detailed study and only 15 percent of those remediated. (Food and Agriculture Organization of the United Nations 2015) In the US, the Environmental Protection Agency has remediated 9.3 million ha of contaminated land with 160 contaminated sites on the Superfund National Priorities List remaining to be evaluated and 49 new sites proposed to be added to the list. In addition to contamination, other challenges to soil sustainability include erosion and loss of organic carbon. (Food and Agriculture Organization of the United Nations 2015)

    Fig. 1.1 Number of deaths attributable to air pollution in 2017. Data source: Global Burden of Disease Study 2017. IHME, 2018. ( Health Effects Institute 2019)

    Fig. 1.2 Proportion of water bodies with good ambient water quality (percent) in 2017 ( UN Environment 2018).

    Similarly, the abundance of energy, preferably produced from minimally polluting and renewable resources, and clean water is essential for society's survival. Like the quality of air, water, and soil, energy and water use are closely tied to human activity. The world has nearly tripled its energy consumption since 1971 (Fig. 1.3). (International Energy Agency. IEA, 2020) Fossil fuels (coal, oil, and natural gas) continue to dominate the production of energy with the attendant impacts from mining and extraction and subsequent combustion of these sources which releases greenhouse gas emissions into the air which contain carbon that had been long-sequestered in the earth. Combustion of fossil fuels diminishes air quality. In addition to considering total energy production, it is important to evaluate energy efficiency, which has only improved by 1.2 percent from 2017 to 2018. Improving energy efficiency is one of the best strategies available to cutting energy consumption and associated pollution. (International Energy Agency 2019) The production of energy, along with many other activities, including agriculture consumes water. Correspondingly, as the population has increased and clean water supplies have diminished, water scarcity is a reality for approximately one-half of the global population. (Boretti and Rosa, 2019)

    Fig. 1.3 World total energy supply by source (million tons of oil equivalent). ( International Energy Agency. IEA, 2020)

    Challenges faced in achieving sustainability include improving food and water security, maintaining biodiversity, reducing air and water pollution, reducing greenhouse gas emissions, increasing reuse and recycling, and increasing system-level efficiencies in energy, urban, agricultural and industrial systems.

    Furthermore, extracting and disposing of the materials we need to make the equipment, devices, and food we need to run our society can be unsustainable, generating pollution and operating without concern for the long-term availability of critical materials. In fact, the very technologies society is relying on to address climate change, including wind turbines, solar panels, and lithium-ion batteries, rely on metals (cobalt, nickel, copper, rare earths) that are mined, often in developing countries where environmental regulations are often insufficient to protect populations from exposure to pollution in the air, water, and soil. (Sovacool et al., 2020)

    Conserving natural lands is an important part of ensuring a healthy and productive future for human society. Natural lands such as grasslands, wetlands, and forests provide innumerable ecosystems services such as mitigating floods, sequestering carbon, and enhancing biodiversity. Targeted conservation initiatives are required to slow the pace and extent of extinction, improves environmental quality, and retain the inspirational value of nature. (Balvanera, 2019)

    Economic and consumer preference drivers often can favor technology and societal developments that advance towards sustainability, but law and policy are important drivers as well. (Ashford and Hall, 2011) For example, one reason energy efficiency gains have faltered (Fig. 1.4) is a lack of clear policy to advance energy efficiency. (International Energy Agency 2019)

    Fig. 1.4 Global Improvements in Primary Energy Intensity 2000–2018 ⁶ .

    Finally, social well-being, in part as indicated by the portion of the world's population that has can viably provide food and other basic needs for themselves and their families, is an important element of sustainability. Global levels of poverty between 2013 and 2015 declined through all regions of World Bank analysis yet the percent of people living at the International Poverty Line of $1.90 per day stayed relatively constant in many of these regions, showing a decrease most notably in South Asia (Fig. 1.5). The over 700 million people globally living below this poverty line in 2015 is an unignorable indication that sustainability has not yet been attained.

    Fig. 1.5 Number of people and percent of population at the International Poverty Line of $1.90/day (2011 PPP). ( World Bank 2018)

    Undoubtedly, the breadth of earth, industrial, and societal systems that contribute to sustainability is immense. Developing technology, societal, and policy approaches to address each facet of sustainability can be guided by analyses that point the way, for example, towards pollution or water scarcity hotspots, the most impactful energy efficiency technologies, or regionally-specific conservation strategies. These analyses can make use of ever-growing volumes of data including satellite imagery, continuous sensor data from industrial processes, social media data, and environmental sensors, to name only a few. As a result, data science techniques have become central to addressing sustainability challenges and this role will only expand in the future.

    Accordingly, we have assembled this book with the contribution of co-authors who are addressing sustainability challenges in the spheres of environmental health, energy and water, sustainable industrial systems, and society and policy. Our intention is to provide a well-rounded set of case studies addressing different challenges using varying types of sustainability analysis and data to serve as a reference for analysts who seek to employ data science in their work and for data scientists looking to apply their skills to sustainability challenges. Another audience for this book will be policy makers who rely on sustainability analyses as a decision making tool to evaluate how governments could collect data that would support these efforts and use the results of these analyses in policymaking. Additionally, this book could be used in data science and systems analysis classrooms to provide case study examples, especially at the graduate student level.

    In the remainder of this introductory chapter, we review different types of analyses that guide our understanding of and action towards increasing sustainability. We provide an overview of data science tools that can be used in sustainability analyses. Finally, we introduce the different case studies readers will encounter in the remaining chapters. We note that the concluding chapter of this book summarizes data gaps and research needs for the further building of data science applications in sustainability analysis.

    Types of sustainability analyses

    The term sustainability analysis is meant to be broad for this book to capture a wide range of analyses that can address or evaluate society's advancement towards sustainable resource management and wellbeing. We summarize some examples of analysis types that fall under the term sustainability analysis in Fig. 1.6. As one example, natural systems modeling improves our understanding of the geoscience, bioscience, and social science that underpins natural systems relies on data analysis and modeling, with an effort to move towards prediction. Hydrologic modeling, for example, based on land characteristics and precipitation data, can help predict the location and effects of flooding from major precipitation events. Soil carbon modeling that explores the influence of agricultural management practices on levels of carbon storage in soils is another example of analysis that could be grouped under sustainability analysis. Furthermore, modeling of air pollution dispersion could also be categorized under this umbrella. All of these natural systems models contain parameters that must be estimated based on evaluation of data sets.

    Fig. 1.6 Examples of sustainability analysis types that are increasingly using data science techniques.

    Furthermore, many types of analyses can enhance the design of industrial, energy, and water systems that offer sustainability improvements over the status quo. As one primary example, machine learning can be used to speed up the design of new materials that can be used in any number of important sustainability applications from designing membranes that exhibit less fouling in water treatment applications thereby reducing energy and chemicals used in wastewater treatment to exploring next generation lithium-ion battery chemistries. Additionally, as the Industrial Internet of Things continues to expand, analysts will apply data science techniques to identify opportunities to improve the energy, water, and material efficiency of industrial processes. Finally, evaluating the progress of consumers' adoption of technology that will be more energy or water efficient, for example, is another important type of sustainability analysis. This type of analysis could be based on earth observation data in the case of adoption of large infrastructure or based on social media posts that indicate shifts in technology use in the home, on the road, or in the workplace.

    Two mainstays of sustainable systems analysis are life cycle assessment (LCA) and materials flow analysis (MFA). Whereas LCA evaluates the environmental effects of a product or process – from fuels to electronics to foods – MFA tracks the flows of commodities within a system boundary, which could be a city, a region, or a nation. LCA and MFA are at the very beginning of applying data science techniques, in general because datasets are often insufficiently large to allow data science approaches to offer value. As the data revolution continues, these two analysis types have many opportunities to leverage data science techniques.

    Finally, evaluations of social well-being are another important pillar of sustainability analyses because sustainability is often described as having three pillars – economic, social, and environmental. One expanding enabler of using data science approaches in social well-being evaluations is satellite imagery, which provides a bird's eye view of living conditions for Earth's inhabitants. While these data can show us these conditions, they cannot identify what has caused them. This second and critical step will require the linkage of image interpretation and causal analysis.

    Regardless of analysis type, data availability is a cornerstone of all of these analyses. In some instances that remain data sparse, the use of data science techniques in these areas is anticipatory rather than widespread. Furthermore, the examples we provide here are not all-encompassing and the list of types of sustainability analyses that benefit from data science approaches today and into the future will evolve and grow.

    Data science tools

    Broadly speaking, data science is an inter-disciplinary field that adopts data collection, pre-processing, meaning/useful feature extraction methods, data exploration methods, and predictive models to extract knowledge from a wide range of structured and unstructured data. Given the structure, size, heterogeneity, and complexity of the data sets, a wide range of data science tools and techniques have been developed. Among them, statistical machine learning is a prominent class of methods that are used and adopted for many data science task. Next, we will review three widely used subclass of ML methods.

    Supervised learning

    It is used to model the functional relationship between the output variables and one or more independent input variables. Typically, the original function relationship is unknown and/or hard to derive in an analytical form. The approach starts with a set of training data given as a large set of input-output pair. The goal is to find a surrogate function for original function relationship such that the difference between prediction from the surrogate function and the observed value is minimal for all input-output pair in the training data and the unseen testing data. Several supervised learning algorithms exist in the ML literature. Based on the functionality, one can group them as follows: regularization, instance-based methods, recursive partitioning, kernel-based methods, artificial neural networks, bagging, and boosting methods. Often, the best method depends on the data and type of the modeling task, such as volume of data, variety of data, and speed required for training and inference. Here, we cover several widely adopted algorithms to cover different groups. We will review them from regression perspective (predicting a scalar value). Without loss of generality, most of these methods also handle classification (predicting a class).

    Multivariate linear regression (Bishop, 2006) is one of the most simple methods for modeling the functional relationship between inputs and output. It models the functional relationship using a linear equation. This is given by the sum of product of each input with a scaling factor. A bias factor is also added to the equation. The multivariate linear regression involves finding the scaling factors and the bias. It is one of the well understood method and often preferred for interpretability and simplicity. It is important that data science practitioners try and adopt this method as a baseline and comparison to other methods.

    Ridge regression (Hoerl and Kennard, 1970) is a regularization algorithm that is designed to reduce the model complexity so that the model does not overfit the training data. This overfitting occurs in supervised learning when the model learns small variations and/or noise in the training set and consequently loses prediction accuracy on the testing data. To do so, in addition to minimizing the error between predicted and actual observations, the method penalizes the training objective with respect to input coefficients and achieves tradeoff between minimizing the error and minimizing the sum of the square of the coefficients.

    k-nearest-neighbor regression (Bishop, 2006) belongs to the class of instance-based methods, where the training data is stored in memory and the model is built only during testing. Given a testing point, the method first finds k nearest input points in the training data and returns the prediction as the average of k outputs. Typically, k and the nearest distance metric are user defined hyperparameters.

    Support vector machine (Drucker et al., 1996) is a widely-used kernel-based method. It uses a kernel function to project the input space onto a higher-dimensional feature space; a linear regression is performed in the transformed space. The training is formulated as a convex quadratic optimization problem, for which efficient optimization algorithms are utilized. The effectiveness of this method depends on a good choice of kernel type and their hyperparameters.

    Decision tree regression (Breiman, 1984) belongs to the class of recursive partitioning methods. It recursively splits the multidimensional input space of training points into regions such that inputs with similar outputs fall within the same region. The splits give rise to a set of if-else rules. For each region, an average over the output values is computed and stored at the end of each rule. Given a new testing point, the decision tree employs the if-else rule to return the stored value as the predicted value.

    Random forest (Breiman, 2001) is a bagging approach that considers random subsamples of the training dataset and builds a decision tree on each subsample. Given a new test data point, the prediction from each tree is averaged to obtain the predicted value.

    Gradient boosting regression (Friedman, 2002) is similar to random forest but the trees are constructed sequentially on each random subsample. The key idea is to build each tree to minimize the error of the previous tree.

    Deep neural networks (Goodfellow et al., 2016) belong to the class of artificial neural networks. They are characterized by stacked layers, where each layer is composed of a number of units. Each unit receives inputs from units from previous layers, which are combined in a weighted linear fashion and passed through a nonlinear function. The first layer receives the training points and the predictions are obtained from the last layer of the stack. The training phase consists of modifying the weights of the stacked layers to minimize the prediction error on the training data set. This is typically done by stochastic gradient descent optimization method that computes the gradients of the objective function with respect to all the weights in the network and uses them to update the weights.

    Unsupervised learning

    Traditionally, unsupervised learning methods were used for exploratory analysis. (Bishop, 2006) Notably, clustering and dimension reduction methods were adopted for a wide range of data science tasks. The former computes the distances between the points in the given data using a distance metric, which is then used to group similar points. The latter is often employed to project the high dimensional data into low dimensional embedding space for visualization. In recent years, auto encoders, a class of deep neural networks, have received significant attention for dimension reduction method due to their ability to perform effective nonlinear dimension reduction and handle large amount of data. Another key advancement in the area of unsupervised learning is generative modeling, which has potential to understand and explain the underlying structure of the input data when there are few–or even no–labels. A promising generative modeling approach that has received much recent attention is generative adversarial networks (GAN). (Goodfellow et al., 2014) The basic idea in GAN is to train two deep neural networks simultaneously and capture the domain-specific features and representations from the unlabeled data and deploy them as labeled data becomes available. For example, GANs can produce high quality synthetic images of real-world objects without having any explicit labels of what those objects are. By automatically extracting the underlying structure of the inputs without labels, GANs can empower supervised learning methods to understand the context of the domain in which they operate.

    Reinforcement learning

    It is an approach that is concerned with is concerned with training agents for autonomous design and control. (Sutton and Barto, 2018) The agents interact within an environment, receive rewards, and use them to improve the actions iteratively using training settings. The agents once trained can be deployed for control in test settings.

    Tools

    The data science software ecosystem is quite vibrant has a wide range of software tools and many of them are open-source. Scikit-learn (Pedregosa et al., 2018) is one of the widely used package for numerous data science tasks. It has implementation of preprocessing, unsupervised, and supervised learning methods that are integral part of many data science pipelines. Similarly, R project for statistical computing (R Core Team, 2021) provides a number of libraries to build data science pipelines with minimal effort. Jupyter notebook (Kluyver et al., 2016) and R studio (Allaire, 2012) are productivity centric integrated development editors for interactive data science code development. Tensorflow (Abadi et al., 2016) and Pytorch (Paszke et al., 2019) are packages for differentiable computing and are widely used for the design and development of deep neural network models. Python and R ecosystem provides a number of libraries for data visualization (for example, Matplotlib (Hunter, 2007) and ggplot2 (Wickham, 2011)). RapidMiner, (Mierswa and Klinkenberg, 2018) Weka (Hall et al., 2009), and KNIME (Berthold et al., 2009) software tools designed for users with minimal programming experience. They provide easy to use interfaces to build data science pipelines but do not provide flexibility and configurability as programming-intensive software stack.

    Overview of case studies in data science in sustainability

    Data science techniques have been applied to numerous domains within the sustainability field. For example, social media data have been analyzed with data science techniques to inform an understanding of urban sustainability including aspects like mobility and economic development (Ilieva and McPhearson, 2018) and even waste minimization in beef supply chain. (Mishra and Singh, 2018) In general, the agricultural sector holds much promise for applications of data science to improve farming sustainability such as reducing use of fertilizer and irrigation. (Kamilaris et al., 2017) Considering the social side of sustainability, predictive analytics and data visualization have been used to study and improve the humanitarian supply chain. (Gupta et al., 2019)

    With such an expansive space of intersection between data science and sustainability, this book covers only a subset of an ever-growing field. We have focused on the broad topics of environmental quality and sustainability, energy and water, sustainable systems analysis, and society and policy. Fig. 1.7 places each chapter in this book in one of these topics.

    Fig. 1.7 Organization of case studies in this book.

    Environmental Quality and Sustainability focuses on how we can better understand natural ecosystems and design strategies to protect them and improve air, water, and soil quality. Swami (Chapter 2) examines the many ways artificial intelligence can contribute to conservation efforts. Bui (Chapter 3) describe the application of machine learning techniques including supervised pattern recognition, random forests, support vector machines, and deep learning to investigate spatial patterns such as species distribution, streamflow, and land use within Australian Critical Zones. For this application, machine learning techniques have proven helpful to predict spatial patterns, identify regions vulnerable to factors such as erosion or soil organic carbon loss, and to find the drivers of spatial patterns. Pauwels et al. (Chapter 4) describe several methods that have been used to improve hydrologic modeling to inform water resources for Australia. The methods described include Bayesian techniques and Monte Carlo methods that demonstrate improvements in parameter estimation over other methods and can better predict flooding.

    As described in Section 1.1, energy use and water consumption continue to rise. Kapousouz et al. (Chapter 5) use clustering to explore spatial and temporal use patterns in energy and water consumption at the state level. Identifying patterns can lead to technology and policy development to reduce resource consumption. Developing and deploying technology to harness renewable energy is one important approach to minimizing the influence of energy production. Devereux and Cole (Chapter 7) explore avenues for using machine learning to develop solar photovoltaic cells that move this technology to optimal performance. One example is to use machine learning to predict property based on structure. Another relevant application of machine learning is to carry out high-throughput computational screening. A final option is to automatically generate property databases. One helpful tool in this case is using natural language processing to mine the technical literature for property information that can be include in such a database. As new technology develops, it is helpful to evaluate how it is being deployed and used in the real world. One reality of solar photovoltaic deployment is that it tends to be limited in areas of lower socio-economic status. Castellanos et al. (Chapter 6) integrated Google Project Sunroof and United States Census data and carried out regression analysis and bootstrapping to explore racial and ethnic disparities of rooftop solar voltaic technology. With this information in hand, interventions can be better designed to increase solar PV

    Enjoying the preview?
    Page 1 of 1