Discover millions of ebooks, audiobooks, and so much more with a free trial

Only $11.99/month after trial. Cancel anytime.

Hands-on Data Analysis and Visualization with Pandas: Engineer, Analyse and Visualize Data, Using Powerful Python Libraries
Hands-on Data Analysis and Visualization with Pandas: Engineer, Analyse and Visualize Data, Using Powerful Python Libraries
Hands-on Data Analysis and Visualization with Pandas: Engineer, Analyse and Visualize Data, Using Powerful Python Libraries
Ebook542 pages3 hours

Hands-on Data Analysis and Visualization with Pandas: Engineer, Analyse and Visualize Data, Using Powerful Python Libraries

Rating: 5 out of 5 stars

5/5

()

Read preview

About this ebook

The book will start with quick introductions to Python and its ecosystem libraries for data science such as JupyterLab, Numpy, Pandas, SciPy, Matplotlib, and Seaborn.

This book will help in learning python data structures and essential concepts such as Functions, Lambdas, List comprehensions, Datetime objects, etc. required for data engineering. It also covers an in-depth understanding of Python data science packages where JupyterLab used as an IDE for writing, documenting, and executing the python code, Numpy used for computation of numerical operations, Pandas for cleaning and reorganizing the data, handling large datasets and merging the dataframes to get meaningful insights. You will go through the statistics to understand the relation between the variables using SciPy and building visualization charts using Matplotllib and Seaborn libraries.
LanguageEnglish
Release dateAug 13, 2020
ISBN9789389845655
Hands-on Data Analysis and Visualization with Pandas: Engineer, Analyse and Visualize Data, Using Powerful Python Libraries

Related to Hands-on Data Analysis and Visualization with Pandas

Related ebooks

Computers For You

View More

Related articles

Reviews for Hands-on Data Analysis and Visualization with Pandas

Rating: 5 out of 5 stars
5/5

1 rating0 reviews

What did you think?

Tap to rate

Review must be at least 10 words

    Book preview

    Hands-on Data Analysis and Visualization with Pandas - Purna Chander Rao. Kathula

    CHAPTER 1

    Introduction to Data Analysis

    Data analysis is an art. It is a science of extracting insights from the silos of data. This chapter introduces you to the data and its ecosystem components, along with the different stages of the data analysis process, how Python is useful for data analysis and different data science libraries/modules, and their installation process.

    Structure

    Inspiration for data analysis

    What is data science?

    Domain expertise

    Maths and statistics

    Artificial intelligence

    Machine learning

    Data infrastructure

    Data analysis process

    Business requirements

    Data collection

    Data cleansing

    Data exploration and visualization

    Data modeling

    Model validation and testing

    Deployment

    Why Python for data analysis?

    Python libraries for data analysis

    Objective

    This chapter will guide you through the different processes of data analysis, various concepts such as maths, statistics, and processes that make up this discipline. The concepts covered here will be a heads up for the coming chapters where these concepts and procedures will be applied in the form of Python code with different data related libraries.

    Inspiration for data analysis

    In this chapter, we will be covering various factors and trends that influence data analysis. In the current world of digitalization, a huge amount of data is produced by IoT devices like sensors, diagnosis reports from healthcare or wellness industry, social network portals such as Facebook, YouTube, LinkedIn, Instagram, and e-commerce sites like Alibaba, Amazon, or Flipkart, where you add an audio, video, comment, add a like, emoji, or you make bank transactions online or use an ATM kiosk to withdraw the money, buy something on e-commerce sites and much more.

    This data is not exactly useful information. It is the result of processing, which takes into account a certain set of data that extracts some set of conclusions that can be used in different ways. This process of extracting information from the raw data is data analysis. This analysis of the data becomes the foundation for building predictive models or drawing data visualization charts around the data.

    Without Big data and analytics, companies are blind and deaf, wandering on to the web like deer on a freeway.

    -Geoffrey Moore, author, and consultant.

    What is data science?

    Data science is a study of data. It is multidisciplinary that involves maths, statistics, algorithms, domain expertise, processes, and systems to extract insights from data. This data might be structured, semi-structured, and unstructured. The following Figure 1.1 display different structures of data:

    Figure 1.1

    Structured data

    Tabular rows and columns (Databases)

    DWH (Tera data systems) and BI Systems

    Text files such as comma-separated (.csv), tab-separated (.tsv).

    Semi-structured data

    Excel, XML, JSON, Logs.

    Unstructured data

    Audio, Video, Images.

    Domain expertise

    Domain expertise or domain knowledge is about expertise in a particular field like Healthcare, Insurance, Banking, and so on. A domain expert may or may not relate to technology but has in-depth knowledge of a particular industry, its trends, and practices that impact the industry. The process of data analysis not only requires having good expertise in tools and computational techniques but also needs to have a good understanding of the data. In short, the data analyst must be able to know how to search not only for data but also for information and how to treat that information to get valid insights from it.

    For example, you are asked to build an application for e-commerce, banking, or insurance domain. The application has to be that it complements the industry and various dimensions of it. The technical team wouldn‘t know the industry norms or the application features; here is where domain expert and domain knowledge comes into the picture.

    Maths and statistics

    It is a study of statistics from a mathematical point of view. Data analysis requires a good amount of math. Good knowledge of statistics is also required because the statistical methods are applied to the analysis and interpretation of the data. Python provides a good amount of libraries to solve these mathematical and statistical problems, but one should have a good idea about how the libraries work.

    Artificial intelligence

    Artificial intelligence is the intelligence demonstrated by machines, in contrast to the natural intelligence displayed by humans. Artificial intelligence is the superset of data science, which is one of the advanced concepts in data analysis. It is the study of training computers for jobs which are done by humans. The term Artificial intelligence is two different words: Artificial means something which is not natural or human-made, and Intelligence means the ability to think or understand.

    AI Market is already widespread, and you interact with it on a daily basis. Here are a few examples of Artificial intelligence:

    Search engines like Google internally use gigantic algorithms to perform a better search.

    Self-driving cars where the vehicles can completely navigate their way from one point to another.

    Chatbots help as online messengers to assist customers immediately and effectively.

    Voice searches on smartphones use AI to determine the best result for those long-tail keywords and conversational queries.

    Online Ads use AI to target specific customers based on past behavior, interest, and search queries.

    Machine learning

    It is an Algorithmic driven study which makes computers capable of learning based on their own previous experience and improve the performance of the task. Machine learning is the subset of Artificial intelligence, and it is a study of machines where machines learn by themselves without being explicitly trained. Assuming you are asked to write a program for a speech recognition software converting speech to text, based on accent, grammar, pronunciation, vocabulary. It would be a gigantic task that can be easily understood by machine learning.

    Technically machine learning is divided into three parts, explained as follows:

    Supervised learning

    In this learning, we ask machine questions and compare answers with the actual answers and instruct the machines to minimize the errors. Supervised machine learning can do things as follows:

    Weather forecasting.

    Detecting online frauds.

    Market forecasting.

    Image classification.

    Unsupervised learning

    In this learning, you give the machine huge chunks of data and instruct it to find some sort of patterns, and based on these patterns, your machine accomplishes certain tasks. Unsupervised machine learning can do things as follows:

    Build recommendation engines

    Targeted marketing

    Customer segmentation

    Reinforcement learning

    In this learning, the machine is left in an environment where something is happening, and there is a reward if the machine does what we want, and there is a penalty if it performs incorrectly and based on it we instruct the machine to maximize the reward, and eventually, the machine learns the things which we want it to do. Reinforcement learning works on:

    Games

    Bidding and advertising

    Training self-driven cars

    Data infrastructure

    Generally, people tend to refer to infrastructure as those things that support what they are doing at work. For example, the roads used for transportation, sewage system, and bridges, all these are considered as infrastructure. The role of data infrastructure is to protect, preserve, process, move, secure, and serve data as well as their applications for information service delivery. Data infrastructure includes software, hardware, and cloud or managed services, servers, storage, and so on.

    Thanks to the Big data world, it generates a humongous amount of information that needs to be processed. Sometimes normal desktop systems or servers doesn‘t have enough computation power to read, process, or analyze them. We need systems with a high configuration of RAM or a good amount of disk space to save the data. The cloud-based Amazon (AWS)/GCP/Azure help us meet the challenges through resource allocation and virtualization.

    Data analysis process

    Data analysis is a series of steps in which the raw data is transformed and processed in order to produce insights about the data and to make predictions. The processing includes mathematical and statistical approaches and charts or graphs for data visualizations. So data analysis is schematized as a process chain consisting of the following sequence of stages, as shown in Figure 1.2:

    Figure 1.2

    Let‘s discuss these processes in detail.

    Business requirements

    Data Analysis starts with a problem to be solved, which needs to be defined, like predicting the stock price of a company or identifying credit card fraudulent transactions or detecting tumors based on health data and so on.

    Data collection

    The data must be chosen with the basic purpose of building a predictive model. This is the most tedious task to analyze anything we need to have data. Mostly data will be shared by the clients in the form of comma-separated, tab-delimited, pipe delimited files. Not all data is available in files or databases; it can be as HTML pages; this process of collecting the data is called Web Scraping. Python libraries such as scrapy, beautiful soup, and requests help in scraping the data from web pages.

    Data cleansing

    This stage seems to be less problematic but requires more resources and time to complete. The data collected may be from different sources such as excel, CSV, Json, parquet or a scraped data from a web page each of which will have different representation of data like date field might be a string or an integer might be read as float, so all these data needs to be cleaned for data analysis. Cleansing includes invalid data, ambiguous or missing values or outliers in the data.

    Data exploring and visualization

    Exploration is the process of graphical and statistical representation to find patterns, connections, and relations between variables in the data. Python libraries such as matplotlib and seaborn help us to visualize the data. Different statistical formats like heatmaps, boxplot, violin plot, scatter plots help us to understand the patterns, outliers, and relationships better. Exploration also includes one or more of the following activities:

    Grouping the data

    Summarizing the data

    Construction of regression models to find the deviation of data

    Data modeling

    It is the process of choosing a suitable statistical model to predict the result. After data exploration, we need to develop a mathematical model that encodes the relationship between data. These models are divided according to the result they produce:

    Classification: If the result obtained by the model is categorical.

    Regression: If the result obtained by the model is numerical.

    Clustering: It involves grouping of the data points to gain valuable insights.

    Python’s Scikit Learn library provides methods such as linear regression, logistic regression, classification trees, SVM, Adaboost, and K-nearest neighbor to generate these models.

    Model validation and testing

    Validation of the model is divided into train and test phase. The data is randomly divided to 70 percent for training, 30 percent for testing. The model gets trained by the 70 percent data, which in turn compares with the remaining 30 percent test data. There are several techniques to validate the effectiveness of the model; the most popular is k-Fold

    Enjoying the preview?
    Page 1 of 1