Discover millions of ebooks, audiobooks, and so much more with a free trial

Only $11.99/month after trial. Cancel anytime.

Big Data Analytics for Cyber-Physical Systems: Machine Learning for the Internet of Things
Big Data Analytics for Cyber-Physical Systems: Machine Learning for the Internet of Things
Big Data Analytics for Cyber-Physical Systems: Machine Learning for the Internet of Things
Ebook727 pages6 hours

Big Data Analytics for Cyber-Physical Systems: Machine Learning for the Internet of Things

Rating: 0 out of 5 stars

()

Read preview

About this ebook

Big Data Analytics in Cyber-Physical Systems: Machine Learning for the Internet of Things examines sensor signal processing, IoT gateways, optimization and decision-making, intelligent mobility, and implementation of machine learning algorithms in embedded systems. This book focuses on the interaction between IoT technology and the mathematical tools used to evaluate the extracted data of those systems. Each chapter provides the reader with a broad list of data analytics and machine learning methods for multiple IoT applications. Additionally, this volume addresses the educational transfer needed to incorporate these technologies into our society by examining new platforms for IoT in schools, new courses and concepts for universities and adult education on IoT and data science.
  • Bridges the gap between IoT, CPS, and mathematical modelling
  • Features numerous use cases that discuss how concepts are applied in different domains and applications
  • Provides "best practices", "winning stories" and "real-world examples" to complement innovation
  • Includes highlights of mathematical foundations of signal processing and machine learning in CPS and IoT
LanguageEnglish
Release dateJul 15, 2019
ISBN9780128166468
Big Data Analytics for Cyber-Physical Systems: Machine Learning for the Internet of Things

Related to Big Data Analytics for Cyber-Physical Systems

Related ebooks

Law For You

View More

Related articles

Reviews for Big Data Analytics for Cyber-Physical Systems

Rating: 0 out of 5 stars
0 ratings

0 ratings0 reviews

What did you think?

Tap to rate

Review must be at least 10 words

    Book preview

    Big Data Analytics for Cyber-Physical Systems - Guido Dartmann

    University)

    Introduction

    Cyber-physical systems (CPS) and the Internet of things (IoT) are developing rapidly and this technology is now transforming our economy and society. The key features of this disruptive technological revolution are smart algorithms based on data science. In the last decade, the progress was mainly driven by network concepts, embedded systems, and cloud technology. Now, we are facing a new area utilizing the availability of artificial intelligence (AI) as new key technology for CPS. The AI enables systems to make decisions based on measured data and to transform data into new business ideas. A key to success in the design of new ideas based on AI is the interplay of novel applications and mathematical methods. The book addresses technological advances in machine learning, data science, and optimization in combination with applications in IoT and CPS, for example, mobility, industry, environmental systems, and medicine. This includes fundamentals of (sensor) signal processing together with data analytics and machine learning (e.g., smart sensors and IoT gateways), optimization and decision-making in smart systems (e.g., intelligent mobility), and the implementation of new machine learning algorithms in embedded systems.

    These skills will become a central tool for the qualification of future engineers. In this book, we first introduce the fundamentals of data analytics and machine learning. Then, we present hardware platform aspects and applications in IoT, and finally, we discuss future demands in education for big data analytics in CPS.

    To introduce basic concepts of data science, Chapters 1 and 2 of the book present fundamentals of data analytics, statistics, and processing platforms in CPS. Chapter 3 investigates an application where clustering techniques are used for object detection in smart cities.

    To integrate smart IoT in our industry, knowledge of machine learning and data science needs to be combined with expertise in networks and embedded systems. Especially, IoT requires secure regional network platforms to provide a variety of IoT services, which are presented in Chapter 4.

    For the pervasive establishment of smart CPS in our economy, new algorithms in combination with hardware components have to be developed. Chapter 5 presents inference techniques for IoT in the application of a complete IoT software and hardware framework embedded in a smart city infrastructure. Furthermore, efficient and real-time capable hardware is essential for the application of new machine learning techniques in autonomous driving cars. Therefore, Chapter 6 presents new aspects on the design of heterogeneous hardware platforms for autonomous driving. Finally, Chapter 7 presents an overview of AI-based sensor platforms for smart cities and gives a broad view on how the different aspects (sensors, gateways, cloud, communication standards, actuators, and algorithms) work together to establish a smart IoT system. It further gives an overview of different IEEE standards and the needs for future standardization for IoT.

    The next part of the book shows how data analytics and machine learning can solve different challenges such as energy saving, autonomous driving, air quality, and public health. In Chapter 8, machine learning tools are used to predict the energy consumption in buildings.

    Chapter 9 presents concepts of reinforcement learning for autonomous driving and gives an overview of a possible simulation framework where AI algorithms for autonomous driving can be evaluated. In IoT, the localization of sensors and agents is an important aspect. Chapter 10 presents an evolutionary algorithm for the localization of sensory agents for infrastructure monitoring. The AI can also be used to warn people in smart cities against dangerous gases or can monitor the air quality. The progress in the design of gas sensors allow a low-cost design of an artificial nose which is presented in Chapter 11 in combination with machine learning techniques to classify different gases.

    Besides the environment, the traffic, or the autonomous driving cars, machine learning can revolutionize the future health system. In this book, Chapter 12 presents how basic algorithms based on continuous-time Markov chains can be used to classify different types of patients or diseases.

    The AI and machine learning can also offer new risks: If these algorithms become more and more powerful, these techniques may easily find pattern in user data. This offers a large risk regarding the privacy of the people, therefore, Chapter 13 presents the societal aspects of citizens in future urban environments and Chapter 14 presents the theoretical foundations of metrics to quantify the privacy in communication systems.

    Despite the technological advances, new concepts of education especially in the field of machine learning and data science are required as well. Additionally, this book addresses new concepts of education to transfer the described technology to our society. Especially, small and midsize companies need qualified employees to create new business models with IoT applications. People with knowledge of data analytics and machine learning together with practical experience in IoT and CPS are rare. Those having this knowledge might prefer to apply at big players and not consider classical companies in mechanical engineering or other domains with interfaces to novel IoT technology. Therefore, Chapters 15 and 16 present concepts and blueprints as to how this technology can be successfully integrated into our education systems.

    Chapter 1

    Data analytics and processing platforms in CPS

    Claudia Chitu⁎; Houbing Song†    ⁎ Faculty of Automatic Control and Computer Science, University Politehnica of Bucharest, Bucharest, Romania

    † Department of Electrical, Computer, Software and Systems Engineering, Embry-Riddle Aeronautical University, Daytona Beach, FL, United States

    Abstract

    The speed of new developments in the IoT and CPS poses new challenges for data scientists and business owners to leverage smarter insights, demanding a real-time dashboard of information extracted from data in movement. Business is developed on top of Big Data Analytics and revenue is returned with high percentages from events prediction. Since many business areas can benefit from it, Big Data Analytics as a research topic faces multiple challenges. These ones are: fundamental understanding of models, architectures, security, privacy, but also data science and mentality to accommodate big data-driven decisions. This is argued by a series of facts such as: inference from multiple data sources, observation measuring, missing events and surrogate variables, incomplete information. Thus, this chapter aims to present a broad overview of the most common methods and techniques, including processing data platforms, used for analytics applied to large volume of unstructured, semistructured, and structured data coming with high velocity. Moreover, the tutorial character of this material helps to develop some capabilities in analyzing small data, to be followed up with massive amounts of data.

    Keywords

    Analytics; Dashboard; Machine Learning; Processing

    1 Open source versus proprietary software

    Cyber physical systems (CPSs) are taking advantage of and are growing with improvements in smart manufacturing industries and intelligent services, in which a key role is played by Big Data evolution. This evolution brings challenges and new trends in analyzing data (Yin and Kaynak, 2015). Historical data is examined with analytics tools and modeled for prediction while actionable intelligence is extracted from information systems. Due to its high importance in business, there are many players on the market, which deliver automatically data collection, cleansing and analysis in near real time, and even predictions. Big Data ingested from thousands of robots, machines, customers, and combined information systems is turned into rewarding outcome with predictive analytics. In CPS, analytics is a fundamental component, being the core wheel for the big system, with high weight in decision and control process: from level of prediction to dynamics incorporation, analytics is defined as a framework for capabilities gaps identification and roadmap for opportunities to improve quality. Popular techniques such as SVM (support vector machine) are now rewritten to take advantage of parallel computation and server farms with organic growing perspective. In this way, larger volumes of data coming from a very dynamic system get support from the infrastructure. Proper data analytics based on near real-time big data streams make possible a digital twin to optimize the working conditions, way of operation, consumption and maintenance of physical systems, and manufacturing. As is stated in Khaitan and McCalley (2015), CPSs have a broad application in many domains: vehicular systems and transportation, medical and health-care systems, smart homes and buildings (Shih et al., 2016), scheduling, thermal management, power grid systems, industrial process control, aerospace and air traffic management, etc. Thus, the broad area of CPS’s applications implies an extensive statistics and data analytics expertise. This chapter shows a comprehensive list of different mostly used data analytics tools with some hands-on examples. The knowledge transmitted within this chapter is designed to create a good image of the fundamentals in statistics for developing capabilities of understanding software’s results and implement algorithms to be followed up with larger and more complex datasets. One of the most challenging situations in CPS is the data reliability aspect; so, it is crucial to understand the correctness of collected observations as well as data validity. For instance, in Liu et al. (2017), based on big data analytics performed for spatial distribution characteristics of location data loss events, the authors proposed novel data-driven methodologies to increase data validity in transportation systems for smart cities, and there are more similar applications for smart and interconnected communities (Kambatla et al., 2014; Rathore et al., 2016; Sun et al., 2016). As data complexity is growing and the techniques and software available have to process more and produce more insights and generate decisions, we present in the first table open source and freeware software, versus proprietary tools used to analyze and predict data.

    In Table 1, only a few of the tools used in the industry and academia to analyze and predict outcome from billions of intelligent devices are presented. When applied to near real-time industrial internet data streams, big data analytics eases critical failures detection, challenging anomalies exploration, and provides signal predictive maintenance alerts, etc. More details about a part of the previously mentioned tools could be found in Research (2018). Besides the aspect that large data sets demand special techniques of approach to process, they also need a special infrastructure, usually on cloud storage and resources being used. In Kiran et al. (2015), is presented a general architecture for online analysis applied to big data sets for unlocking behaviors previously unknown, using a data-handling back-end on Amazon EC2. For sure, Big Data is an immense source of knowledge and information about systems, situations, and opportunities. Moreover, there is a complex computing stack architecture to enable such size of data processing (Fox et al., 2015). Smart cities can benefit from real-time data collection, data processing, and visualization on cloud-based data analysis service for information intelligence and to support decision making (Khan et al., 2013).

    Table 1

    2 Data types

    A first step to start the journey of understanding data is to distinguish the data types. This is important when implementing the machine learning (ML) algorithms in order to use the implementation properly. Variables could be of categorical and numerical types. Numerical data types are measurements or counts and could be of two other types: continuous (any value, e.g., temperature, height, etc.) and discrete (integer value: number of car accidents in a city in 1 year, number of failures of air conditioning equipment, etc.). The complementary data type is the categorical one, which is divided in two other subcategories: ordinal (represents an obvious order: A, B, C, etc.) and nominal (no meaningful order, e.g., gender, color, etc.).

    Considering the variables data types, there are few categories of data analysis:

    1.Qualitative analysis (examination of nonquantifiable data, deeply used in environmental chemistry, for instance, or oil industry)

    2.Quantitative analysis (statistical, mathematical, or numerical analysis applied on objective measurements)

    3.Spatial analysis (the analysis of the location of objects or phenomena being observed, e.g., analyzing data on a map)

    4.Hierarchical analysis (data with parent-child relationship)

    5.Graph analysis (analysis of relation between objects)

    6.Textual data analysis (try to find patterns and use of words in documents and text-based data sources)

    3 Easy data visualization using code

    As an introduction to data analytics, an image of how data looks like could be very helpful, especially if the volume of data is high. For example, using an open data set from (ITU, 2015) about the Percentage of individuals using internet, and selecting only the values collected for 2016 and adding the country code and coordinates of countries from (Tamosauskas, 2018), we created a map with bubble plots to see how data is distributed across the world, and also where the values are higher, using different gradients of blue (different gradients of gray in print versions; Fig. 1).

    Fig. 1 Distribution of data for internet usage percentage around the globe in 2016.

    The R code used to create the previous figure is shown in the following code snippet:

    library(ggplot2) library(dplyr) dataset <-read.csv('C:/Users/Claudia/Desktop/individuals_using_internet.csv', header = TRUE,sep =',') ggplot(data = dataset)+ borders(database ='world',colour ='grey60',fill ='grey90')+ geom_point(aes(y = lat_avg,x = long_Avg,size = percent,color = percent))+ scale_size_area(max_size = 1)+ ggtitle('Percentage of individuals using internet in 2016')+xlab('')+ylab('')+ labs('percent of individuals')+ theme(panel.background = element_blank(), axis.title.x = element_blank(), axis.text.x = element_blank(), axis.ticks.x = element_blank(), axis.title.y = element_blank(), axis.text.y = element_blank(), axis.ticks.y = element_blank())

    Similar graphics could be created with Python using a dedicated package called geopandas and plotly for creating choropleth maps (Plotly, 2018). For plotting geographical map, R has implemented dplyr and ggplot2 packages with dedicated functions for geographical representation. We refer to R and Python programming languages as they are very popular in the developers and data scientists’ world and several exploration and data visualization tools such as VisIt, CDAT, and VisTrails have been built using Python (Anderson et al., 2010). The importance of the Python programming language in solving problems that involves large data sets in different formats and computational systems has been highlighted since 2010 (Perez et al., 2011). From the same data visualization techniques category is the choropleth map, a thematic map, in which areas are colored according to the density of population, or measurement of the statistical variable shown on the map. Although this type of map has some limitations such as difficulty in making comparisons or ranking countries only looking at the map, it is very popular and is used because it is easily understandable by the audience. They are indicated to be used when the data set has some continuous statistical surface (measure the phenomena, collect data anywhere on the map, even for an entire country) and data is standardized to show percentages, ratios, and illustrate relative differences. This type of data representation in data analysis is becoming more and more frequent in discussions at a global data scale as society is transformed by interconnected technologies, devices, and machines.

    4 Statistical measurements in CPS data

    Analyzing your data starts with a profile, brief summarizing even charting the data to better see the patterns and outliers. Means and standard deviation fit good for continuously normally distributed data and median (middle value) while interquartile ranges (IQRs) are more suitable for skewed data sets. A big difference between the mean and the median value is an indicator of skewed data. Using R studio, these statistics could be seen in few ways: an ampler image of these ones is to use the summary command and another method is to use the functions mean, median, and standard deviation (sd). We explore them in the following exercise. First, load a data set to apply on these functions and see the results. Download a free data set on your desktop from (Arel-Bundock, 2018). For this example, the amis data set is used. This item is a dataset of 8437 rows and 4 columns called Car Speeding and Warning Signs. The data frame contains data from a study conducted by the Cambridgeshire County Council about locations to account for factors such as traffic volume and type of road. The effect warning signs have on speeding patterns is observed with this study. These speed measurements were taken before the erection of a sign, shortly after sign placement, and third time after the sign had been in place for a while. These measurements correspond to the column called period having three values:1, 2, and 3, respectively, for each of the speed measurements. The other columns are: speed of cars (miles/h), warning (1 or 2 if the sign was present or, respectively, erected), and pair (from 1 to 14 corresponding to the 14 locations). To load the data, follow the next step; then use the commands to find the summary shown in Fig. 2.

    Fig. 2 Summary results in R studio for car speeding and warning signs data set.

    The median is commonly used to measure the properties of a data set and is more advantageous in describing data compared with the mean (average), giving a better idea of a typical value. The mean could be skewed by a small number or extremely high value, but the median better suggests what a typical value is. It is better to use the median as a measure of central tendency since it is not much affected by extreme values. The mean value is expressed as

       (1)

    Standard deviation is defined as a measure to quantify the amount of variation in the data set. A low value of it indicates the data points tend to be close to the mean value and it is used to measure confidence in statistical conclusions.

       (2)

    with N being the mean value of the observations. This is called the sample standard deviation and it is useful to show how the obtained results in a study could be generalized, in contrast with the population standard deviation applied usually for a population, as its name suggests.

    5 Statistical methods, models, and techniques: Brief introduction

    Regardless where the border between statistics and ML is, very powerful methods in the field are: linear regression for prediction of a target variable, classification techniques for assigning categories to collections of data, subset selection for identifying a certain number of features to find the target response, and dimension reduction for those data sets with a number of dimensions larger than 10 usually used a priori to apply the KNN algorithm. A classical technique for dimension reduction is PCA (principal component analysis) applied often for sparse data (Franke et al., 2016).

    Regression is defined as a statistical method that describes a statistical relationship modeled by a mathematical function between the predictor variable and the response variable. Statistical modeling is presented as the second step in a data analysis cycle, the first one being the exploratory analysis. Models used for this stage in the analysis life cycle are built with supervised and unsupervised techniques depending on the situation. The output of this stage is reporting and visualization; so, the goal is to transfer information to decision makers. A second option from this phase could be also a backward path in order to identify new data to be fed into the system for complementing the existing data. A more complex presentation of learning algorithms and stochastic techniques in terms of accuracy, speed of learning, and risks of over-fitting, is given in Singh et al. (2016). Methods could be understood as systems of concepts and procedures defining an ensemble to realize certain insights and techniques are the practical approaches to implement these methods.

    Methods, techniques, and models are used to develop and implement solutions in a large spectrum of domains involving statistical knowledge combined with learning algorithms. Even more, the most complicated situations are challenging the experts to apply this set of tools in order to solve critical situations in real-time systems, CPS, and the derived systems from these ones as illustrated in Wang et al. (2017) and Yavanoglu and Aydos (2017).

    6 Analytics and statistics versus ML techniques

    Deciding appropriate statistical methods for research implies defining firstly the types of measurements (variables) and then, the relationship between them (dependent vs independent variables). Analysis of scale or binary independent variables could be done with regressions. Before we go into more detail with regressions, we present in Table 2, a list of statistical methods, models, and techniques and a list of ML techniques suitable to be exploited in Big Data analysis from CPS.

    Table 2

    ML focuses intensively on prediction, learning (supervised and unsupervised), and computational methodology while statistical analysis explores the design, sampling, estimation, regression, and classification more than ML. The process of using many modeling methods from statistics and ML, to best predict the probability of an outcome (failure of a robot, maintenance of a train/machine/car, etc.) with a phase of implementing, training, testing, and validating a model, is called a predictive modeling process.

    Good practices of statistical modeling and thinking are the fundamentals of Big Data projects; high-quality data, correct relationship modeling, and right algorithms as well as strategies are the keys for the success in the petabyte age. Many relationships in the world are or tend to be linear; so, linear regression is a very powerful tool to build exploratory models and predict relationship between models. A proper example is presented in Flynn et al. (2009), where authors use regression as a tool to perform a comparative analysis for performance capability exploration of measurement systems. The participants of a linear regression are called the predictor variable and the response variable, x and y, where x is an independent variable and y is a dependent one. The independent variable is the one manipulated during the experiment in order to observe the behavior of the outcome, y. They are called also exploratory variable, independent variable, regressor/risk factor, feature/attribute, respectively, dependent variable or regressand. In computer science, x is very often referred as feature or

    Enjoying the preview?
    Page 1 of 1