Discover millions of ebooks, audiobooks, and so much more with a free trial

Only $11.99/month after trial. Cancel anytime.

Data Science Fundamentals and Practical Approaches: Understand Why Data Science Is the Next
Data Science Fundamentals and Practical Approaches: Understand Why Data Science Is the Next
Data Science Fundamentals and Practical Approaches: Understand Why Data Science Is the Next
Ebook1,073 pages9 hours

Data Science Fundamentals and Practical Approaches: Understand Why Data Science Is the Next

Rating: 0 out of 5 stars

()

Read preview

About this ebook

This book introduces the fundamental concepts of Data Science, which has proved to be a major game-changer in business solving problems.
Topics covered in the book include fundamentals of Data Science, data preprocessing, data plotting and visualization, statistical data analysis, machine learning for data analysis, time-series analysis, deep learning for Data Science, social media analytics, business analytics, and Big Data analytics. The content of the book describes the fundamentals of each of the Data Science related topics together with illustrative examples as to how various data analysis techniques can be implemented using different tools and libraries of Python programming language.
Each chapter contains numerous examples and illustrative output to explain the important basic concepts. An appropriate number of questions is presented at the end of each chapter for self-assessing the conceptual understanding. The references presented at the end of every chapter will help the readers to explore more on a given topic.
LanguageEnglish
Release dateJun 2, 2020
ISBN9789389845679
Data Science Fundamentals and Practical Approaches: Understand Why Data Science Is the Next

Related to Data Science Fundamentals and Practical Approaches

Related ebooks

Computers For You

View More

Related articles

Reviews for Data Science Fundamentals and Practical Approaches

Rating: 0 out of 5 stars
0 ratings

0 ratings0 reviews

What did you think?

Tap to rate

Review must be at least 10 words

    Book preview

    Data Science Fundamentals and Practical Approaches - Rupam Kumar Sharma

    CHAPTER 1

    Fundamentals of Data Science

    The goal is to turn data into information, and information into insight

    — Carly Fiorina

    Data, in today’s technology-driven world, is vital in decision making. The rate at which data is being generated per day is tremendous. Every company is using data to comprehend their customers better. Data science and data analytics can gain meaningful insights that help companies in identifying possible areas of growth, streamlining of costs, better product opportunities, and effective company decisions. Data analysis can bring an impact in every sector, be it healthcare, medicine, stock market, academic institutes, and so on. Undoubtedly, data will keep growing in momentum for the next few decades and for this, IT jobs are monotonically expanding to deal with the bulk amount of Big Data that has been realized as the need of the hour in data analysis.

    This chapter elaborately discusses data science which is one of the most demanding careers in the 21st century. The world of data science may comprise of simple tasks such as estimating the sales of products in the coming year and viewing the trend of products in the market,or many complex tasks such as prediction of disease based on complex neural network model and classifying and recommending products based on fuzzy logic theory. John Wills, the Director of Data Engineering at Slack, has defined a data scientist as a Person who is better at statistics than any software engineer and better at software engineering than any statistician. Thus, data scientist plays a pivotal role in data analysis which is currently a very demanding area of study that is being explored at an exponential growth to gain hidden insights for better decision making.

    Structure

    The next few sections in this chapter will discuss the following topics:

    Introduction to data science

    Why learn data science?

    Data analytics lifecycle

    Types of data analysis

    Types of jobs in business analytics

    Data science tools

    Fundamental areas of study in data science

    Role of SQL in data science

    Pros and cons of data science

    Conclusion

    References

    Points to remember

    Exercises

    Objectives

    After studying this chapter, you should be able to:

    Understand the concept and need for data science.

    Discuss the various phases in the data analytics lifecycle.

    Learn the various types of data analytics and the important tools applied in data science.

    Analyze the fundamental areas of study in data science

    1.1. Introduction to data science

    Data science is the task of scrutinizing and processing raw data to reach a meaningful conclusion. Data is mined and classified to detect and study behavioral data and patterns, and the techniques used for this may vary according to the requirements. All data that is available for analysis can be classified into four types. They are nominal data, ordinal data, interval data, and ratio data. A common useful acronym used for these four types of data is NOIR (Nominal Ordinal Interval Ratio), which means black in French. A detailed description of each of these types of data is provided in Chapter 2: Data Preprocessing.

    For data collection, there are two major sources of data – primary and secondary. Primary data is data that is never collected before and can be gathered in a variety of ways such as, participatory or non-participatory observation, conducting interviews, collecting data through questionnaires or schedules, and so on. Secondary data, on the other hand, is data that is already gathered and can be accessed and used by other users easily. Secondary data can be from existing case studies, government reports, newspapers, journals, books and also from many popular dedicated websites that provide several datasets. Few standard popular websites for downloading datasets include the UCI Machine Learning Repository, the Kaggle datasets, IMDB datasets, and Stanford Large Network Dataset Collection. Though there are clear benefits of using readily available secondary data, it must be however verified as to how authenticated and valid such data is.

    It is said that we all are data analysts in varying degrees of our everyday lives. We analyze the need and working principle of an electronic gadget before purchasing it, or we predict the demand of a particular course for the next few years in terms of job prospects before enrolling our children in that particular course. We do not need to be an exceptionally good expert in analytics to do analysis. The need for complex data analysis has been immensely felt over these years in main business sectors and companies to discover historical patterns for improving the performance of the business in the future.

    1.2. Why learn data science?

    There has been a revolutionary change in the behavioral pattern of customers in case of online purchases, stock market investment, advertising products to other customers, and so on. Each of these activities requires an in-depth analysis of existing relevant data which makes data science a promising field of study in today’s fast-growing data-driven world.

    Few of the industry verticals where data science has found its prominence and is used for operational and strategic decision making are discussed below:

    Ecommerce: Ecommerce sites hugely involve data science for maximizing revenue and profitability. These sites analyze the shopping and purchasing behavior of customers and accordingly recommend products to customers for more purchases online.

    Finance: The finance market is an emerging field in the data industry. The financial analytics market takes care of risk analysis, fraud detection, shareholders’upcoming share status, working capital management, and so on.

    Retail: Retail industries take care of a 360-degree view and feedback reviews of customers. The retail analytics market analyzes customers’ purchasing trends and demands in order to get products based on customers’ liking. Retail industries involve data science for optimal pricing, personalized offers, better marketing strategies, market basket analysis, stock management, and so on.

    Healthcare: The healthcare sector also nowadays heavily relies on analytics of patient data to predict diseases and health issues. Healthcare industries make an analysis of data-driven patient quality care, improved patient care, classification of the type of symptoms of patients and predicted health deficiencies, and so on.

    Education: The sources of data in education is vast, starting from student-centric data, enrollment in various courses, scholarship and fee details, examination results, and so on. Education analytics play a major role in academic institutions for better admission scenario, empowerment of students for successful examination results, and all-round student performance.

    Human Resource (HR): HR analytics involves HR-related data that can be used for building strong leadership, employee acquisition, employee retention, workforce optimization, and performance management.

    Sports: Nowadays, sports analytics is often used in international tournaments to analyze the performance of players, the predicted scores, prevention of injuries, and the possibility of winning or losing a match by a particular team.

    The use of data science is nowadays found in every prominent domain, few of which have been addressed above. The few other sectors that need a mention are telecom industries, sales, supply chain management, risk monitoring, manufacturing industries, and IT companies. The recent competitions in businesses and companies consider data science no longer as an optional requirement but rather hire data analysts and data scientists for the same to deal with hidden massive data to provide meaningful results and generate reports to arrive at profit-making decisions. Also, the recent trends in the job market show that data analysts, data scientists, and data engineers have a huge demand in the IT companies and this demand will continue for the next decade. Hence, making data analyst, data scientist, or data engineer as a career can uplift your job profile and the demand will be witnessed in many companies in the years to come.

    1.3. Data analytics lifecycle

    While the terms data science and data analytics are often used interchangeably, the two terms are quite different based on the difference in the scope of their performances. Data science is an umbrella term that comprises a large variety of fields compared to data analytics which is more focused and can be considered to be a subset of data science. Hence to understand data science thoroughly, let us first try to understand the various phases in the data analytics lifecycle.

    Data analytics involves mainly six important phases that are carried out in a cycle - data discovery, data preparation, planning of data models, the building of data models, communication of results, and operationalization. Figure 1.1 illustrates the six phases of the data analytics lifecycle that is followed one phase after another to complete one cycle. It is interesting to note that these six phases of data analytics can follow both forward and backward movement between each phase and are iterative. The lifecycle of the data analytics provides a framework for the best performances of each phase from the creation of the project until its completion. This framework was built by a large team of data scientists with much care and experiments. The key stakeholders in data science projects are business analysts, data engineers, database administrators, project managers, executive project sponsors, and data scientists.

    Figure 1.1: The Data Analytics Life Cycle

    Let us now briefly discuss all the six phases of the data analytics lifecycle followed in any data science projects:

    1.3.1. Data discovery

    In this first phase of data analytics, the stakeholders regularly perform the following tasks - examine the business trends, make case studies of similar data analytics, and study the domain of the business industry. The entire team makes an assessment of the in-house resources, the in-house infrastructure, total time involved, and technology requirements. Once all these assessments and evaluations are completed, the stakeholders start formulating the initial hypothesis for resolving all business challenges in terms of the current market scenario.

    1.3.2. Data preparation

    In the second phase after the data discovery phase, data is prepared by transforming it from a legacy system into a data analytics form by using the sandbox platform. A sandbox is a scalable platform commonly used by the data scientists for data preprocessing. It includes huge CPUs, high capacity storage and high I/O capacity. The IBM Netezza 1000 is one such data sandbox platform used by the IBM Company for handling data marts. The stakeholders involved during this phase are mostly involved in the preprocessing of data for preliminary results by using a standard sandbox platform.

    1.3.3. Model planning

    The third phase of the lifecycle is model planning, where the data analytics team makes proper planning of the methods to be adapted and the various workflow to be followed during the next phase of model building. At this stage, the various division of work among the team is decided to clearly define the workload among the team members. The data prepared in the previous phase is further explored to understand the various features and their relationships and also perform feature selection for applying it to the model.

    1.3.4. Model building

    The next phase of the lifecycle is model building in which the team works on developing datasets for training and testing as well as for production purposes. Also, the execution of the model, based on the planning made in the previous phase, is carried out. The kind of environment needed for execution of the model is decided and prepared so that if a more robust environment is required, it is accordingly applied.

    1.3.5. Communicate results

    Phase five of the life cycle checks the results of the project to find whether it is a success or failure. The result is scrutinized by the entire team along with its stakeholders to draw inferences on the key findings and summarize the entire work done. Also, the business values are quantified and an elaborate narrative on the key findings is prepared that is discussed among the various stakeholders.

    1.3.6. Operationalization

    In phase six, a final report is prepared by the team along with the briefings, source codes, and related documents. The last phase also involves running the pilot project to implement the model and test it in a real-time environment. As data analytics help build models that lead to better decision making, it, in turn, adds values to individuals, customers, business sectors and other organizations. While proceeding through theses six phases, the various stakeholders that can be involved in the planning, implementation, and decision-making are data analysts, business intelligence analysts, database administrators, data engineers, executive project sponsors, project managers, and data scientists. All these stakeholders are rigorously involved in the proper planning and completion of the project, keeping in note the various crucial factors to be considered for the success of the project.

    1.4. Types of data analysis

    There are many different ways to analyze data. Some forms are more complex than others based on which data analysis has been broadly divided into four types, namely descriptive analysis, diagnostic analysis, predictive analysis, and prescriptive analysis. Figure 1.2 demonstrates the level of complexity of each of these four types of data analysis.

    Figure 1.2: Four types of data analysis based on the level of complexity

    Let us briefly discuss each of the four types of data analysis and find how each of these types differs from one another:

    1.4.1. Descriptive analysis

    Descriptive analysis is the simplest and the most common type of data analysis used by companies and other sectors. This type of data analysis is mostly used in businesses to generate monthly revenue reports, sales leads, and key performance indicators (KPI) dashboards. It describes the main aspects of the data being analyzed. The data dealt with are large in volume and often include the entire population. The results or reports generated are based on data that are already available.

    The main emphasis in the descriptive analysis is given on ‘what has happened?’ by analyzing valuable information found from the available past data. For example, with descriptive analysis, a data analyst will be able to generate the statistical results of the performance of the cricket players of team India. For generating such results, the data may need to be integrated from multiple data sources to gain meaningful insights through statistical analysis.

    1.4.2. Diagnostic analysis

    The diagnostic analysis differs from the descriptive analysis by simply not emphasizing only ‘what has happened?’ but also on ‘why it happened?’ This type of data analysis tries to gain a deeper understanding of the reasons behind the pattern of data found in the past. Here, business intelligence comes into play by digging down to find the root cause of the pattern or nature of data obtained. For example, with diagnostic analysis, a data analyst will be able to find why the performance of each player of the cricket team of India has risen (or degraded) in the recent past six months.

    The diagnostic analysis deals with the critical aspect of finding the reason behind a particular change or cause in a phenomenon. This is undoubtedly a major task in the field of data analysis as an analyst has to be critical and correct enough to find the reason behind a particular cause of occurrence to make a gain or profit in various fields. For this purpose, an analyst often uses machine learning techniques to use business intelligence for a deeper understanding of a given problem.

    1.4.3. Predictive analysis

    Predictive analysis, as the name suggests, deals with prediction of future based on the available current and past data. The main emphasis in the descriptive analysis is given on ‘what is likely to happen?’ by utilizing previous data to find the future outcome. For example, with predictive analysis, a data analyst will be able to predict the performance of each player of the Indian cricket team for the upcoming international cricket world cup. Such prediction can help the Board of Cricket Council of India (BCCI) to decide on the players’ selection for the upcoming international cricket tournament.

    Predictive analysis is applied in many domains such as risk management, sales forecasting, weather forecasting, and prediction of the performance of each team. Though descriptive and diagnostics analyses are more common in nature, data analysts are also largely hired in companies to predict future trends in businesses and other marketing sectors. In most cases prediction is made by dividing the available dataset into the training set and testing set and the machine learning algorithm is applied to check the accuracy level of prediction. If the accuracy of prediction is found to be at a satisfactory level, the algorithm is then used to predict future data. However, it is important to remember that the predicted solution provides an approximate forecasted result and may vary from the actual result, as accuracy is not guaranteed to a hundred percent.

    1.4.4. Prescriptive analysis

    The final type of data analysis which is the highest in terms of complexity is called predictive analysis. In this type of data analysis, the insights gained from all the other three types of data analyzes are combined to determine the kind of action to be taken to solve a certain situation. Predictive analysis prescribes what steps are needed to be taken to avoid a future problem. It involves a high degree of responsibility, time, and complicacy to reach to informed decision-making. Thus, the predictive analysis makes recommendations based on the forecasting done in predictive analysis.

    To summarize the four main types of data analytics, it should be remembered that descriptive analysis is mainly involved in explaining what has happened till date, diagnostic analysis emphasizes more on finding why it has happened in a particular way, predictive analysis makes a forecast on what might happen in the near future, while prescriptive analysis emphasizes on recommending actions based on the forecast. All these types of analyses are usually carried out by a data analyst or data scientist to deal with the given data and produce a meaningful outcome based on the type of analysis required to be made.

    1.5. Types of jobs in data analytics

    The various key stakeholders in any data analysis project include the data analyst, the data scientist, the data engineer, the database administrator, and the analytics manager. Each stakeholder has a clear role to play for a business problem right from understanding the essentials of the problem, proper planning, implementation of the project, analyzing the various outcomes of the project, solving the bottlenecks visible in the outcomes, and generating reports by drawing inferences about the success of the project. Figure 1.3 shows some of the key stakeholders involved in any data analytics-based project.

    Figure 1.3: Some of thekey stakeholders in the Data Analytics projects

    Though it is a big team that may involve many other stakeholders such as analytics specialist, business intelligence consultant, chief creative officer, ETL Developers, project sponsor, and many more, the main prominent workers involved in the project are few and play a pivotal role in bringing success to a project. The leader of any business project clearly defines the role of the stakeholders and the estimated timeline of each of the work assigned. Let us briefly discuss six such main stakeholders involved in business analytics, namely the data analyst, the data scientist, the data engineer, the database administrator, the data architect, and the analytics manager.

    1.5.1. Data analyst

    The main role of a data analyst is to extract data and interpret the information attained from the data for analyzing the outcome of a given problem in business. In this process, the analyst also discovers the various bottlenecks that are found in the results and provides possible solutions for the same. Extraction of information from given existing data is done using one or more standard methodologies such as data cleaning, data transformation, data visualization, and data modeling. Using these methodologies, a data analyst is able to make careful data-driven decisions.

    The major skills required to be a data analyst are Python and/or R programming skills, Structured Query Language (SQL), Statistical Analysis Software (SAS), SAS Miner, Microsoft Excel and/or Tableau. The key areas and techniques which a data analyst should be well-versed with include the following:

    Data preprocessing, which is an important step in data analysis, involves data cleaning, data integration, data transformation, and data reduction. The task of data preprocessing is discussed elaborately in Chapter 2: Data Preprocessing.

    Data visualization, which is the graphical representation of data that can make information easy to analyze and understand. The task of data visualization is discussed elaborately in Chapter 3: Data Plotting and Visualization.

    Statistical modeling, which mainly involves two important kinds, descriptive or summary statistics and inferential statistics. The task of statistical data analysis is discussed elaborately in Chapter 4: Statistical Data Analysis.

    Programming skills, for which a data analyst may thoroughly practice and learn R and/or Python programming, that is mainly used in data analysis.

    Communication and presentation skills, which is required for communicating with the team regarding the various reports and outcome of results obtained after proper data analysis.

    To summarize, a few of the major tasks that a data analyst is involved in are data acquisition, data management, data cleaning, and data filtering, data interpretation using statistical analysis, improving data quality and statistical efficiency, data visualization, and analytics reporting.

    1.5.2. Data scientist

    A data scientist incurs all the skills of a data analyst with the additional skills of data wrangling, complex machine learning, Big Data tools, and software engineering. It is observed that both data analysts and data scientists use the same tools and practices. However, the scope and nature of the problem addressed by a data scientist differ from a data analyst. Data scientists mainly deal with large and complex data that can be of high dimension, and carry out appropriate machine learning and visualization tools to convert the complex data into easily interpretable meaningful information.

    Some of the fundamental prerequisites that a data scientist should be thorough with are as follows:

    Statistics: Statistics is the most prerequisite field in the area of data science. Data science is mostly about statistics and to master in data science, good knowledge in statistics is mandatory. The two kinds of statistics mostly used in data science are – Descriptive statistics and Inferential statistics.

    Mathematics: To enhance one’s skills in machine learning, a data scientist should have a profound knowledge of Mathematics. The two most important topics in Mathematics in terms of use in data science are linear algebra and Calculus. While linear algebra is all about the study of vectors and linear functions, Calculus is the mathematical study of the continuous change. Many of the concepts of Linear algebra such as tensors and vectors are used in many areas of machine learning. Similarly, calculus is also required in various areas of machine learning such as optimization techniques.

    Computer programming: A data scientist should be a programming lover. Other than the basic computer application skills such as mastering in Microsoft Excel, a data scientist should have programming skills to be able to easily write code in Python or R for any given data science project. MS Excel can be used as a basic tool for a beginner in the field of data science as it can easily handle complex numerical calculations as well as allow plotting of data visualization graphs. Both Python and R are considered as excellent programming tools for handling statistical analysis and machine learning skills.

    Database handling: A data scientist also often has to deal with data that are stored in databases. In the case of Relational Database Management Systems (RDBMS), a data scientist should have the prerequisites of handling database queries using SQL commands. As data extraction is a primary task in data science, SQL is an important tool for accessing and manipulating data that is maintained in databases.

    Data scientists can be engineers but are usually not involved in maintaining data architecture. The primary task of a data scientist is to use machine learning and deep learning-based techniques to make an in-depth analysis of input data. This is where a data analyst lacks his/her skills as an analyst may not possess much machine learning or deep learning-based skills.

    1.5.3. Data engineer

    The job of a data engineer comes first, and then the data is handed over to a data analyst or data scientist for analysis. Thus, the role of a data engineer is not to analyze data but rather to prepare, manage and convert data into a form that can be readily used by a data analyst or data scientist. Also, the advanced skills required by a data engineer are far different from the other two.

    With special training, a data engineer can design, build, integrate, and maintain data from multiple (homogenous or heterogeneous) sources. Few of the prominent work a data engineer is involved in include the following:

    Developing and maintaining data architectures.

    Aligning data architectures with the business or project requirements.

    Improving data quality and raising data efficiency.

    Performing predictive and prescriptive modeling for given input data.

    Determining activities that can be automated.

    Engaging oneself with the other stakeholders to explain the details of the converted data so that it can be used by the data analyst or data scientist for further analysis.

    The major skills required to be a data analyst are Ruby, Java, C++, Python and/or R programming skills, Hive, NoSQL, MapReduce technologies, and MATLAB. Good knowledge of ETL tools and some popular APIs will be an added benefit to his/her profile. Data engineers have a demanding role in data analytics as they help in assuring that data are made available in a form that can be easily used for analysis and interpretation. If the raw data is not initially handled by a data engineer, no machine learning or deep learning model would be able to handle such complex raw bulky data that is initially received by the team for business analysis.

    1.5.4. Database administrator

    The Database Administrator (DBA), as the name suggests, operates and administers the database. The technical skills required by a DBA are SQL, scripting, database performance tuning, and system and network design. The backup and recovery of databases are also handled by a DBA. This job is critical as a business functions properly only when the database is stored and managed well. Few of the prominent work a database administrator is involved in include the following:

    Database designing as per end-user requirements.

    Providing (or revoking) rights to (or from) database end-users.

    Enabling efficient data backup and data recovery mechanisms.

    Database related training to end-users.

    Ensuring data privacy and security.

    Managing data integrity for end-users.

    Monitoring the performances of the database.

    The proper functioning of databases is solely the responsibility of a DBA. If at any point in time the database functioning fails, the DBA should be able to quickly and efficiently manage data recovery mechanisms to recover the functioning of data. Thorough knowledge of SQL and related scripting languages makes a DBA well-tuned to manage any database queries that need to be handled by various end-users of a database.

    1.5.5. Data architect

    The data architect provides the support of various tools and platforms that are required by data engineers to carry out various tests with precision. Data architects should be well equipped with knowledge of data modeling and data warehousing. The other additional skills required by a data architect are Extraction, Transformation, and Load (ETL), and knowledge of Hive, Pig, and Spark. Few of the prominent work a database administrator is involved in include the following:

    Designing data models.

    Developing database solutions.

    Providing structural requirements for new software applications.

    Managing data migration and optimization of database systems.

    Providing Management Information System (MIS) support.

    Administering system performance by troubleshooting, testing, and assimilating new elements.

    The main task of data architects is to design and implement database systems, data models, and components of data architecture. Data architects also have a wide variety of knowledge on various kinds of data available, both offline and in the cloud environment. Also, he/she possesses the capability of managing data warehouses and ETL operations.

    1.5.6. Analytics manager

    The analytics manager is involved in the overall management of the various data analytics operations as discussed in this section. For each of the stakeholders of data analytics that have been mentioned in this section, the analytics manager deals with the team leader of each group and monitors and manages the work of each team. The major skills required to be an analytics manager are Python and/or R programming skills, Structured Query Language (SQL), and Statistical Analysis Software (SAS). Also, an analytics manager should have good leadership and social skills. Few of the prominent work an analytics manager is involved in include the following:

    Leading the data analysts’ team.

    Having a thorough understanding of the business requirements and objectives.

    Configuring and implementing data analytics solutions.

    Ensuring the quality results of the reports developed by every team.

    Keeping an update on recent industry and business trends.

    The analytics manager should have the out-of-box thinking skills to lead and direct every team towards effective result generation. With leadership skills, the analytics manager skillfully controls the team and thoroughly studies the needs of a project to develop the best solutions for the same.

    1.6. Data science tools

    There are many popular tools and techniques used by data scientists and data analysts. One best thing about these tools is that most of these tools are popular, user-friendly and open-source and provide good performance in the field of data science. Let us discuss six such open-source tools that can be learned and adapted by any beginner or researcher who wants to explore in the field of data science.

    1.6.1. Python programming

    Python is an open-source tool and falls under object-oriented scripting language. It was found in the 1980s by Guido van Rossum and is famous for the implementation of data preprocessing, statistical analysis, machine learning and deep learning, which are the core tasks in any data science project. Python is versatile and can run on any platform such as UNIX, Windows, and Mac operating systems. It can also be assembled on any database platform like the SQL server or a MongoDB database.This book contains illustrative Python codes throughout the rest of the chapters for carrying out data analysis for various purposes. The rich set of libraries (more than 200,000) available in Python makes a Python programmer’s life easy and interesting and this is one of the core strengths of this language. Many types of visualization graphs can also be plotted using Python codes (explained in Chapter 3: Data Plotting and Visualization) that makes data interpretation easy for a data analyst.

    1.6.2. R programming

    R is also an open-source tool that is often used for data science. It was developed by Ross Ihaka and Robert Gentleman, both of whose first names start with the letter R and hence the name ‘R’ has been given for this language. Data handling and manipulation are easily done using R programming. It is versatile and can run on any platform such as UNIX, Windows, and Mac operating systems. R also has a rich collection of libraries (more than 11,556) that can be easily installed as per requirements. This makes the R language popular and is widely used for data analytics for handling major tasks such as classical statistical tests, time-series forecasting, and machine learning such as classification and regression, and many more. The basic visualization graphs can also be effortlessly plotted through R codes that make data interpretation easy using this language. While often the comparison is made between the two programming languages mostly used in data science – R and Python, data scientists often come to the conclusion that there is no clear cut advantage of one from the other. Rather, a good grasp of both the languages can make a data scientist switch between either of the language based on the need and requirements of the project.

    1.6.3. SAS

    SAS (Statistical Analysis System) is a programming environment and language used for advanced data handling such as criminal investigation, business intelligence, and predictive analysis. It was initially released in 1976 and has been written in C language. It is supported in various operating systems such as Windows, Unix/Linux, and IBM mainframes. It is mainly used for integrating data from multiple sources and generating statistical results based on the input data fed into the environment. SAS data can be generated in a wide variety of formats such as PDF, HTML, Excel, and many more. This software has more than 200 components, each of which is dedicated to handling a specific task. For instance, the SAS/STAT component mainly handles statistical analysis, the SAS/QC component deals with quality control, and the SAS/INSIGHT component manages data handling.

    1.6.4. Tableau Public

    Tableau is data visualization software which has its free version named as Tableau Public. It was developed in 2003 by four founders from the United States. It has an interesting interface that allows connectivity to both local and cloud-based data sources. The preparation, analysis, and presentation of input data can be all done in Tableau with various drag and drop features and easy available menus. Tableau is well-suited for big-data analytics and generates powerful data visualization graphs that make it very popular in the data analytics market. A very interesting functionality of Tableau is its ability to plot latitude and longitude coordinates for geospatial data and generate graphical maps based on these coordinate values.

    1.6.5. Microsoft Excel

    Microsoft Excel is a data analytics tool widely used due to its simplicity and easy interpretation of complex data analytical tasks. It was released in the year 1987 by the Microsoft Company to handle numerical calculations efficiently. It is of type spreadsheet and can handle complex numerical calculations, generate pivot tables, and display graphics. An analyst may use R, Python, SAS or Tableau and will also still use MS Excel for its simplicity and efficient data modeling capabilities. However, it is not an open-source application and can be used if one has Windows, macOS, or Android operating system installed in one’s machine.

    1.6.6. RapidMiner

    RapidMiner is a data science software platform developed by the RapidMiner Company in the year 2006. It is written in the Java language and has a GUI that is used for designing and executing workflows related to data analytics. It also has template-based frameworks that can handle several data analysis tasks such as data preprocessing, data mining, machine learning, ETL handling, and data visualization. The RapidMiner Studio Free Edition has one logical processor and can be used by a beginner who wants to master the software for data analysis.

    1.6.7. Knime

    Knime (Konstanz Information Miner) Analytics platform is an open-source data analytics and reporting platform. Knime was developed in 2004 by a team of software engineers from Germany. It is mainly used for applying statistical analysis, data mining, ETL handling, and machine learning. The Knime workbench has several components such as Knime Explorer, Workflow editor, Workflow Coach, Node Repository, Description, Outline, Knime Hub Search, and Console. Here, the individual tasks are represented as nodes which are displayed as colored boxes and also have input ports, output ports, and status. The interconnected nodes form a workflow that can be used in a data analytics project for performing various tasks such as reading a file, data transformations, and creating visualizations. The core architecture of Knime is designed in such a way that it practically has almost no limitations on the input data fed into the system. This is a big advantage of using Knime as a data science tool as large volumes of data are needed to be dealt with for analysis in data science.

    1.6.8. Apache Spark

    Apache Spark is open-source software developed in 2014 by the Apache Spark developers. It is versatile and can run on any platform such as UNIX, Windows, and Mac operating systems. Spark has a remarkable advantage of having high speed when dealing with large datasets and is found to be more efficient than the MapReduce technique used in a Hadoop framework. Apache Spark mainly consists of the Spark Core which is a distributed execution engine. Many libraries are built on top of the Spark Core that helps in enabling many data analysis tasks such as handling SQL queries, drawing visualization graphs, and machine learning. Other than the Spark Core, the other components available in Apache Spark are Spark SQL, Spark Streaming, MLlib (Machine Learning Library), and GraphX.

    There are several other data science tools that are also used by data scientists. The eight tools listed above are very popular as these tools are freely downloadable and can be explored to learn the utilities of each. In reality, a data scientist will simply not work on only one tool but will use a combination of the analytics tool based on the efficiency and requirements of the project.

    1.7. Fundamental areas of study in data science

    Data science is a broad term that encompasses multiple disciplines. It is a rapidly growing field of study that uses scientific methods to extract meaningful insights from given input data. The rapid growth in the field of data science has opened the eyes of researchers interested in this field to explore more into the multiple disciplines that encompass data science. Let us discuss a few of these broad areas that are fundamental aspects to be covered for mastering in data science.

    1.7.1. Machine learning

    Both machine learning and data science are like buzzwords in today’s technical world. Though data science includes machine learning as one of its fundamental areas of study, machine learning in itself is a vast research area of study that requires good skills and experience to expertise. The basic idea of machine learning is to allow machines (computers) to independently learn from the wealth of data that is fed as input into the machine. To master in machine learning, a learner needs to have an in-depth knowledge of computer fundamentals, programming skills, data modeling, and evaluation skills, probability, and statistics.

    With the advancement of new technology, machines are being trained to behave like a human in decision-making capability. In doing so, it is necessary to automate decisions that can be inferred by the machines with the interaction with the environment and understanding from past knowledge. The field of machine learning deals with all those algorithms that help machines to get self-trained in this process. Machine learning techniques are broadly categorized into three types - supervised machine learning, unsupervised machine learning, and reinforcement learning. To master data science, it is good to be thorough with all the types of machine learning that are immensely used by a data scientist for the extraction of meaningful output from the data provided as input.

    1.7.2. Deep learning

    Deep learning is often used in data science as it is computationally very competent compared to traditional machine learning methods, which require human intervention before being machine trained. The big players in the market such as Google, Microsoft, and Amazon need to deal with large volumes of data on a daily basis for business analysis and effective decision-making. Deep learning helps in analyzing a bulk amount of data through a hierarchical learning process. The amount of data generated in these companies is massive, raw and unstructured for which deep learning approaches are used to generate meaningful results.

    Deep learning approaches have proven to outperform all other machine learning techniques especially in the field of image and speech recognition systems. A deep learning network obliges representation learning incorporating multiple levels of representation. In a simple sense, it could be understood as such that the higher levels of the network amplify input aspects that are relevant to classification ignoring the irrelevant features that are not significant to the classification process. The interesting fact to note is that these layers of features in the deep network are not designed by human engineers but are learned from data using general-purpose learning procedures.

    1.7.3. Natural Language Processing (NLP)

    Natural Language Processing (NLP) will ever remain a standard requirement in the field of data science. NLP is a branch of artificial intelligence, just like machine learning. NLP focuses on bridging the gap between human communication and computer understanding. Nowadays, thanks to NLP that made it possible to analyze language-based data equally as humans, such as reading text, understanding speech, measuring sentiments from the text, and extracting valuable text from a bulk amount of available text. The field of NLP is found to be highly beneficial for resolving ambiguity in the various languages spoken worldwide and is a key area of study for text analytics as well as speech recognition.

    NLP, as an important branch of data science, plays a vital role in extracting insights from the input text. Industry experts have predicted that the demand for NLP in data science will grow immensely in the years to come. One of the key areas where NLP is playing a pivotal role in data science is while dealing with multi-channel data like mobile data or any social media data. Through the use of NLP, these multi-channel data are being assessed and evaluated to understand customer sentiments, moods, and priorities. NLP has already emerged as a game-changer in the field of data science and business analytics.

    1.7.4. Statistical data analysis

    Statistics is a branch of mathematics that includes the collection, analysis, interpretation, and validation of stored data. Statistical data analysis allows the execution of statistical operations using quantitative approaches. Few such important concepts in statistical data analysis include descriptive statistics, data distributions, conditional probability, hypothesis-testing, and regression. Statistical analysis is an essential area of study in data analytics as it provides tools and techniques to analyze and draw inferences from the provided data. It is an excellent discipline for handling data that needs to be analyzed or to deal with uncertainty by quantifying some results.

    There are two main kinds of statistics – descriptive statistics and inferential statistics. While descriptive statistics are mainly used for presenting, organizing and summarizing data of a given dataset, inferential statistics are used to draw conclusions about a population based on data observed in a sample. Also, statistical data analysis deals with data that is essential of two types, namely, continuous data and discrete data. The fundamental difference between continuous data and discrete data is that continuous data do not have separate distinct values and cannot be counted, whereas discrete data are distinct and can be counted.

    1.7.5. Knowledge discovery and data mining

    Data mining, a major step in Knowledge Discovery from Data (KDD), has evolved as a prominent field in all these years as the demand for discovering meaningful patterns from the data has given rise to meaningful output for data analysis.We are living in a data age where infinite volumes of data are being generated every second. However, we may be data-rich but may become information poor if these data are not rightly utilized. Data alone makes no sense in the analysis world until this data is converted and interpreted to some meaningful form and this is done through the process of data mining in KDD.

    The few prominent applications of data mining include target marketing, customer relationship management, loan approval decision-making in the banking, identifying customer behavior in retail industries, and fraud detection in financial and other sectors. KDD includes a series of clearly defined steps – data selection, data cleaning, data integration, data transformation, data mining, and pattern evaluation. Data mining tasks are either descriptive or predictive. Description based data mining tasks help find human-interpretable patterns that describe the data. Few examples of description based data mining tasks include sequential pattern discovery, clustering, and association rule mining. Prediction-based data mining tasks, on the other hand, use some variables to predict unknown or future values of other variables. Few examples of prediction-based data mining tasks include classification, regression, and deviation detection.

    1.7.6. Text mining

    Text mining is similar to text analytics and includes the method of deriving high-quality information from text. It is a variation of data mining that derives high-quality information by formulating patterns and trends using various methods such asstatistical pattern learning. Some of the prominent text mining tasks include text clustering, document summarization, sentiment analysis through text, text categorization, and concept extraction. In data science, text mining broadly involves considering the text as input data and then applying various text mining analysis such as lexical analysis or pattern recognition, to interpret the gathered information from the given text.

    Text analytics may involve statistical and machine learning techniques for carrying out the mining of textual sources of data. Text analytics is extensively used for research in data science, business intelligence, or exploratory data analysis. The seven main steps involved in text analytics include language identification, tokenization, sentence breaking, speech tagging, chunking, syntax parsing, and sentence chaining. While the term text mining was widely used initially in the context of data mining, the term text analytics is more often used nowadays as a buzzword being a promising area in the field of data science.

    1.7.7. Recommender systems

    The various web services such as Amazon, YouTube, and Netflix, and various e-commerce sites such as Flipkart and Snapdeal use recommender systems to provide suggestions to online users about new and relevant items. The items (such as videos, music, appliances, or books) suggested are based on the types of items being accessed by the useron a particular website. This indirectly helps inproviding a pleasant user experience as well as the revenue generation of these businesses increases drastically. In a typical recommender system, the dataset containing customer and product information is fed as input to a filtering technique. There are many standard filtering techniques applied to recommender systems. Four such filtering techniques that are widely used are collaborative filtering, content-based filtering, demographic filtering, and hybrid filtering. The type of filtering techniques to be used largely depends on the type of data that the recommender system will be processing and the type of recommendations it needs to generate. After the process of filtering is over, any memory-based or model-based method of recommendations is applied to make predictions of items for a list of users and finally, top-N recommendations are given as output for each user.

    Nowadays, building efficient recommender systems are a part and parcel of every online business as they indirectly help in generating a huge amount of revenue and make the business flourish well when compared to other competitors. There are, however, several noteworthy challenges met by all recommender systems techniques for generation of top recommendations for a customer of a site which can be summarized as follows:

    Are commender system has thousands, lakhs or even millions of distinct products as well as visited customers for an e-commerce site, all of which have to be considered for providing recommendations.

    The cold-start problem in the recommender system arises for first-time customers who have not visited the e-commerce site before and hence no information can be fetched based on his/her previous activities to provide some recommendations.

    The older customers, on the other hand, may have an abundant amount of information stored based on the number of purchases and ratings made by the frequently visited customers.

    The most challenging task is to generate recommendations in a real-time setup which demands that the recommender system technique should provide quick results in not more than half a second by also considering optimum accuracy of recommendations.

    Recommender systems have become essential in every industry, business and service sectors, and, hence, have received much attention in recent years. The three main phases involved

    Enjoying the preview?
    Page 1 of 1