Discover millions of ebooks, audiobooks, and so much more with a free trial

Only $11.99/month after trial. Cancel anytime.

Data Democracy: At the Nexus of Artificial Intelligence, Software Development, and Knowledge Engineering
Data Democracy: At the Nexus of Artificial Intelligence, Software Development, and Knowledge Engineering
Data Democracy: At the Nexus of Artificial Intelligence, Software Development, and Knowledge Engineering
Ebook476 pages4 hours

Data Democracy: At the Nexus of Artificial Intelligence, Software Development, and Knowledge Engineering

Rating: 0 out of 5 stars

()

Read preview

About this ebook

Data Democracy: At the Nexus of Artificial Intelligence, Software Development, and Knowledge Engineering provides a manifesto to data democracy. After reading the chapters of this book, you are informed and suitably warned! You are already part of the data republic, and you (and all of us) need to ensure that our data fall in the right hands. Everything you click, buy, swipe, try, sell, drive, or fly is a data point. But who owns the data? At this point, not you! You do not even have access to most of it. The next best empire of our planet is one who owns and controls the world’s best dataset. If you consume or create data, if you are a citizen of the data republic (willingly or grudgingly), and if you are interested in making a decision or finding the truth through data-driven analysis, this book is for you. A group of experts, academics, data science researchers, and industry practitioners gathered to write this manifesto about data democracy.

  • The future of the data republic, life within a data democracy, and our digital freedoms
  • An in-depth analysis of open science, open data, open source software, and their future challenges
  • A comprehensive review of data democracy's implications within domains such as: healthcare, space exploration, earth sciences, business, and psychology
  • The democratization of Artificial Intelligence (AI), and data issues such as: Bias, imbalance, context, and knowledge extraction
  • A systematic review of AI methods applied to software engineering problems
LanguageEnglish
Release dateJan 21, 2020
ISBN9780128189399
Data Democracy: At the Nexus of Artificial Intelligence, Software Development, and Knowledge Engineering
Author

Feras A. Batarseh

Feras A. Batarseh is an Associate Professor with the Department of Biological Systems Engineering at Virginia Tech (VT) and the Director of A3 (AI Assurance and Applications) Lab. His research spans the areas of AI Assurance, Cyberbiosecurity, AI for Agriculture and Water, and Data-Driven Public Policy. His work has been published at various prestigious journals and international conferences. Additionally, Dr. Batarseh published multiple chapters and books, his two recent books are: "Federal Data Science", and "Data Democracy", both by Elsevier’s Academic Press. Dr. Batarseh is a senior member of the Institute of Electrical and Electronics Engineers (IEEE), the Agricultural and Applied Economical Association (AAEA), and the Association for the Advancement of Artificial Intelligence (AAAI). He has taught AI and Data Science courses at multiple universities including George Mason University (GMU), University of Maryland - Baltimore County (UMBC), Georgetown University, and George Washington University (GWU). Dr. Batarseh obtained his Ph.D. and M.Sc. in Computer Engineering from the University of Central Florida (UCF) (2007, 2011), a Juris Masters of Law from GMU (2022), and a Graduate Certificate in Project Leadership from Cornell University (2016). He currently holds courtesy appointments with the Center for Advanced Innovation in Agriculture (CAIA), National Security Institute (NSI), and the Department of Electrical and Computer Engineering at VT.

Related to Data Democracy

Related ebooks

Science & Mathematics For You

View More

Related articles

Related categories

Reviews for Data Democracy

Rating: 0 out of 5 stars
0 ratings

0 ratings0 reviews

What did you think?

Tap to rate

Review must be at least 10 words

    Book preview

    Data Democracy - Feras A. Batarseh

    Data Democracy

    At the Nexus of Artificial Intelligence, Software Development, and Knowledge Engineering

    Editors

    Feras A. Batarseh

    Ruixin Yang

    Table of Contents

    Cover image

    Title page

    Copyright

    Dedication

    To: Aaron Swartz—the creator of the Open Access Manifesto.

    Contributors

    A note from the editors

    Foreword

    Preface

    Section I. The data republic

    1. Data democracy for you and me (bias, truth, and context)

    1. What is data democracy?

    2. Incompleteness and winning an election

    3. The story and the alternative story

    4. Nothing else matters

    2. Data citizens: rights and responsibilities in a data republic

    1. Introduction

    2. A paradigm for discussing the cyclical nature of data–technology evolution

    3. Use cases explaining the black–red–white paradigm of data–technology evolution

    4. Preparing for a future data democratization

    5. Practical actions toward good data citizenry

    6. Conclusion

    3. The history and future prospects of open data and open source software

    1. Introduction to the history of open source

    2. Open source software's relationship to corporations

    3. Open source data science tools

    4. Open source and AI

    5. Revolutionizing business: avoiding data silos through open data

    6. Future prospects of open data and open source in the United States

    4. Mind mapping in artificial intelligence for data democracy

    1. Information overload

    2. Mind mapping and other types of visualization

    3. Conclusions

    5. Foundations of data imbalance and solutions for a data democracy

    1. Motivation and introduction

    2. Imbalanced data basics

    3. Statistical assessment metrics

    4. How to deal with imbalanced data

    5. Other methods

    6. Conclusion

    Section II. Implications of a data democracy

    6. Data openness and democratization in healthcare: an evaluation of hospital ranking methods

    1. Introduction

    2. Healthcare within a data democracy—thesis

    3. Motivation

    4. Related works

    5. Hospitals' quality of service through open data

    6. Hospital ranking—existing systems

    7. Top ranked hospitals

    8. Proposed hospital ranking: experiment and results

    9. Conclusions and future work

    7. Knowledge formulation in the health domain: a semiotics-powered approach to data analytics and democratization

    1. Introduction

    2. Conceptual foundations

    3. A semiotics-centered conceptual framework for data democratization

    4. Conclusion

    8. Landsat's past paves the way for data democratization in earth science

    1. Introduction

    2. Landsat overview

    3. Machine learning for satellite data

    4. Satellite images on the cloud

    5. Landsat data policy

    6. Conclusion

    9. Data democracy for psychology: how do people use contextual data to solve problems and why is that important for AI systems?

    1. Introduction and motivation

    2. Understanding context

    3. Cognitive psychology and context

    4. The importance of understanding linguistic acquisitions in intelligence

    5. Context and data, how important?

    6. Neuroscience and contextual understanding

    7. Context and artificial intelligence

    8. Conclusion

    10. The application of artificial intelligence in software engineering: a review challenging conventional wisdom

    1. Introduction and motivation

    2. Applying AI to SE lifecycle phases

    3. Summary of the review

    4. Insights, dilemmas, and the path forward

    Index

    Copyright

    Academic Press is an imprint of Elsevier

    125 London Wall, London EC2Y 5AS, United Kingdom

    525 B Street, Suite 1650, San Diego, CA 92101, United States

    50 Hampshire Street, 5th Floor, Cambridge, MA 02139, United States

    The Boulevard, Langford Lane, Kidlington, Oxford OX5 1GB, United Kingdom

    Copyright © 2020 Elsevier Inc. All rights reserved.

    No part of this publication may be reproduced or transmitted in any form or by any means, electronic or mechanical, including photocopying, recording, or any information storage and retrieval system, without permission in writing from the publisher. Details on how to seek permission, further information about the Publisher’s permissions policies and our arrangements with organizations such as the Copyright Clearance Center and the Copyright Licensing Agency, can be found at our website: www.elsevier.com/permissions.

    This book and the individual contributions contained in it are protected under copyright by the Publisher (other than as may be noted herein).

    Notices

    Knowledge and best practice in this field are constantly changing. As new research and experience broaden our understanding, changes in research methods, professional practices, or medical treatment may become necessary.

    Practitioners and researchers must always rely on their own experience and knowledge in evaluating and using any information, methods, compounds, or experiments described herein. In using such information or methods they should be mindful of their own safety and the safety of others, including parties for whom they have a professional responsibility.

    To the fullest extent of the law, neither the Publisher nor the authors, contributors, or editors, assume any liability for any injury and/or damage to persons or property as a matter of products liability, negligence or otherwise, or from any use or operation of any methods, products, instructions, or ideas contained in the material herein.

    Library of Congress Cataloging-in-Publication Data

    A catalog record for this book is available from the Library of Congress

    British Library Cataloguing-in-Publication Data

    A catalogue record for this book is available from the British Library

    ISBN: 978-0-12-818366-3

    For information on all Academic Press publications visit our website at https://www.elsevier.com/books-and-journals

    Publisher: Mara Conner

    Acquisition Editor: Chris Katsaropoulos

    Editorial Project Manager: Gabriela Capille

    Production Project Manager: Punithavathy Govindaradjane

    Cover Designer: Matthew Limbert

    Typeset by TNQ Technologies

    Dedication

    To the sons and daughters of the digital prison; as we may give you freedom through a data democracy, you must not inherit our thoughts or our ways.

    To: Aaron Swartz—the creator of the Open Access Manifesto.

    I remember one day visiting the John Crerar Library in Chicago. My father (Aaron's grandfather) had spoken to me about it many times, about how he had done research there when he was a young man. The library is now on the campus of the University of Chicago. It is an interesting place, as it is a science library whose mission is that it be open to the public. Although the University of Chicago does not encourage public use any more, if you are assertive, they will let you in.

    Aaron's grandfather taught me how to do library research when I was very young. We had many reference books at home and used our local public library often. To him, the ability to do research was a fundamental skill to be passed on from father to son.

    So I took Aaron to the Crerar Library and showed him around and showed him the stacks and all the books that were there. I remember clearly taking a random book off the shelf and discovering that it was from the 19th century and explaining to him how important it was having access to the world's knowledge. Aaron understood the importance of written knowledge and, just as Crerar wanted his library open to the public, how it was vital that everyone should be able to easily access the world's research and knowledge. As Wikipedia points out: Because the library was incorporated under the 1891 special law, court approval was required for the merger, a condition of the merger was that the combined library would also remain free to the public. We forget that in the last century, all the world's knowledge was available in the libraries—there books and journals were accessible and open to everyone. Aaron fought so that in this world of bits and bytes we could once again return to a place where everyone could have access to the world's knowledge and research.

    Robert Swartz (Aaron's father)

    2019

    Contributors

    Feras A. Batarseh,     Graduate School of Arts & Sciences, Data Analytics Program, Georgetown University, Washington, D.C., United States

    Justin Bui,     Department of Electrical Engineering and Computer Sciences, University of California, Berkeley, CA, United States

    Deri Chong,     Volgenau School of Engineering, George Maon University, Fairfax, VA, United States

    Sam Eisenberg,     Department of Mathematics, University of Virginia, Charlottesville, VA, United States

    Jay Gendron,     United Services Automobile Association (USAA), Chesapeake, VA, United States

    José M. Guerrero,     Infoseg, Barcelona, Spain

    Debra Hollister,     Valencia College – Lake Nona Campus, Orlando, FL, United States

    Dan Killian,     Massachusetts Institute of Technology Operations Research Center, Cambridge, MA, United States

    Erik W. Kuiler,     George Mason University, Arlington, VA, United States

    Ajay Kulkarni,     The Department of Computational and Data Sciences, College of Science, George Mason University, Fairfax, VA, United States

    Abhinav Kumar,     Volgenau School of Engineering, George Mason University, Fairfax, VA, United States

    Kelly Lewis,     College of Science, George Mason University, Fairfax, VA, United States

    Connie L. McNeely,     George Mason University, Arlington, VA, United States

    Rasika Mohod,     Volgenau School of Engineering, George Mason University, Fairfax, VA, United States

    Patrick O'Neil

    College of Science, George Mason University, Fairfax, VA, United States

    BlackSky Inc., Herndon, VA, United States

    Chau Pham,     College of Science, George Mason University, Fairfax, VA, United States

    Diego Torrejon

    College of Science, George Mason University, Fairfax, VA, United States

    BlackSky Inc., Herndon, VA, United States

    Ruixin Yang,     Geography and GeoInformation Science, College of Science, George Mason University, Fairfax, VA, United States

    Karen Yuan,     College of Science, George Mason University, Fairfax, VA, United States

    A note from the editors

    If you consume or create data, if you are a citizen of the data republic (willingly or grudgingly), and if you are interested in making a decision or finding the truth through data-driven analysis, this book is for you. A group of experts, academics, data science researchers, and industry practitioners gathered to write this book about data democracy.

    Multiple books have been published in the areas of data science, open data, artificial intelligence, machine learning, and knowledge engineering. This book, however, is at the nexus of these topics. We invite you to explore it and join us in our efforts to advance a major cause that we ought to debate.

    The chapters of this book provide a manifesto to data democracy. After reading this book, you are informed and suitably warned! You are already part of the data republic, and you (and all of us) need to ensure that our data fall in the right hands. Everything you click, buy, swipe, try, sell, drive, or fly is a data point. But who owns/should own that data? At this point, not you! You do not even have access to most of it. The next best empire of our planet is one that owns and controls the world's best dataset.

    This book presents the data republic (in Section 1), introduces methods to democratizing data (in Section 2), provides examples on the benefits of open data (for healthcare, earth science, and psychology), and describes the path forward. Data democracy is an inevitable pursuit, let us begin now.

    Feras A. Batarseh,     Assistant Teaching Professor, Graduate School of Arts & Sciences, Data Analytics Program, Georgetown University, Washington, D.C., United States,     Research Assistant Professor, College of Science George Mason University, Fairfax, VA, United States

    Ruixin Yang,     Geography and GeoInformation Science, College of Science, George Mason University, Fairfax, VA, United States

    2019

    Foreword

    The data crisis is an operational and ethical litmus test which the monolithic technology giants have badly failed. The corporations whose profit models depend on data—Facebook, Google, Amazon, and others—have proven inept at safeguarding consumers' personal data and have so outraged the public by sharing and selling personal information that politicians can credibly advocate that they should be torn apart like Ma Bell, a tech monopoly from an earlier era. At the same time, most of the artificial intelligence (AI)–centered digital corporations that dominate the tech industry regard the data they collect as protected intellectual property. They go to great pains to collect the data and consider their aggregation a more than fair trade for free services such as Internet search, social networking, and shopping. They will neither acknowledge that consumers have a legitimate claim to their own data nor give up their proprietary stake and make the data open and available to everyone; but they have proven again and again that they are not able, or fit, to manage it.

    The short-term penalty for their data mishandling is lost customers. Generation Z-ers, the first truly native digital citizens, are abandoning social media; 34% say they will leave it entirely, while 64% say they are taking a break. Privacy concerns are high on their list of reasons why [1]. The longer-term penalties are much worse and threaten us all; if the tech giants cannot ethically manage or control services using machine learning, how will they safeguard the world's most sensitive technology as machine intelligence ineluctably grows? And how will they do so when we share the planet with computers that can outsmart us all?

    How did we get to this dangerous state of affairs? To understand, we have to go back a decade, when big data was all the rage. Then, companies with a lot of transactional data could analyze it using data-mining tools and extract useful information like inefficiencies and fraud. Wall Street used big data algorithms to seek out investment opportunities and make trading decisions. A big data technique called affinity analysis let companies discover relationships among consumers and products and offer suggestions for movies, shoes, and other goods. Big data still serves up these enterprise-friendly insights. But around 2009, something really big happened to big data.

    Three scientists—Hinton, LeCun, and Bengio, all of whom would later win the prestigious A. M. Turing Prize—revealed that training learning algorithms on big data yields predictive abilities that exceed hand-coded programs [2]. Soon this technique, called deep learning, fueled amazing breakthroughs in speech recognition, computer vision, and self-driving cars. Corporations everywhere caught on. Since 2009, thanks in large part to deep learning, investment in AI has doubled each year, and now stands at about $30 billion. AI implementation in enterprise grew 270% over the last four years, mostly, again, thanks to deep learning applications. By 2030, AI will add an estimated $15 trillion to the Global GDP [2].

    Just the way that electricity powered the 20th century, this century's economic opportunities are driven by AI.

    To get the latest AI applications to work, high quality datasets are mandatory. While hackneyed, the aphorism Data is the new oil gets truer all the time. As data gain value, their acquisition, ownership, and use grow more controversial. To understand why, we must consider what data are, and where data come from.

    Data are discrete pieces of information, such as numbers, words, photographs, measurements, and descriptions. Big data refers to a collection of data so large that it cannot be stored or processed with traditional database or software techniques. For example, Snapchat users share 527,760 photos every minute. Also every minute, 456,000 tweets are sent on Twitter. All these data require hundreds of thousands of terabytes of storage (1 terabyte equals 1024 gigabytes; 1 gigabyte equals 1024 megabytes; 1 megabyte equals 1024 bytes). It's estimated that Google, Amazon, Microsoft, and Facebook together store 1.2 million terabytes among them [3].

    Who produces all these data? You! Or rather, your use of the Internet, social media, digital photos, communications like phone calls and texts, and the IoT, or Internet of Things. Your digital activity generates mountains of data, more than half a gigabyte per day for an average user [4]. AI's recent boom can be partly explained by the fact that for the first time enough large-scale data are available for high functioning machine learning systems (the other two drivers of the AI revolution are GPU and AI-specific processor chips, and key insights, i.e., deep learning).

    How do the tech giants profit from data? In two ways. First, companies including Facebook, Amazon, and Google make money by offering their clients curated ad positioning. Based on your digital profile, they target you for their clients' ads. Second, whenever you buy their product or use their service, the tech giants gather data about you and your web activity. These data feed the development of profitable, data-hungry applications and products. Whenever you comment on an Amazon product, tweet on Twitter, or like a notice on your tennis league's Facebook page, you are helping mint cash for the world's richest corporations.

    Because we, the users, generate the data that are the lifeblood of these companies, we should be paid for its use, right? Don't you own your data?

    On the Internet, you do not own your data in a traditional sense, the way, say, a photographer owns her photograph. If you publish a photograph on Facebook, it's still yours for personal use, of course. But by electrically signing Facebook's Terms of Use, you give FB permission to use your photograph as they see fit and to share your photograph with their business partners and other entities. And there is a lot to share. For each user, on average, Facebook has as much as 400,000 MS Word documents worth of data. Google has much more, about 3 million MS Word docs per user [5].

    Google makes the case that they divorce your identity from your data, and so your privacy is safe with them. Their advertising clients target your digital identity with ads, without ever knowing your name or other personal information. In essence, who you are does not matter to Google. What matters are your photos, texts, and browsing history, and they highly prize their access to it. They do not want you to have data rights that will restrict their unimpeded use, and they certainly do not want to make your data open and free to anyone.

    And despite the tech corporations' promises, they do shamefully little to secure your data or to honor their own Terms of Use. The greatest example so far of how low companies can stoop is a tale of big data, foreign intrigue, and the most important election in a decade: Facebook's Cambridge Analytica scandal.

    Briefly, in 2011, due to past failures in keeping user data private, Facebook made an agreement with the Federal Trade Commission (FTC). It required that, among other things, Facebook receive prior affirmative consent from users before it shared their data with third parties. This consent decree was to last 20 years. Three years later, in 2014, Facebook allowed an app developer to access the personal data of some 87 million users and their Facebook friends. The developer worked closely with the election consulting firm Cambridge Analytica. It acquired these data and then created an algorithm that could determine personality traits connected to voting behavior for the affected users. In conjunction with a Russian firm, Cambridge Analytica targeted users with ad and news campaigns meant to impact their vote in the United States' 2016 Presidential Election.

    For breaches of its 2011 agreement with the FTC, Facebook may be fined up to $5 billion [6]. The previous record for fines related to privacy violations belongs to Google; in 2012, it paid $22.5 million to settle FTC charges that it misrepresented privacy assurances to users of Apple's Safari Internet browser. More recently, in the early 2019, France fined Google $57 million for failing to tell users how their data were being collected and failing to get users' consent to target them with personalized ads.

    For Facebook and Google, which in 2018 earned $55 billion and $136 billion, respectively, these fines are little more than slaps on the wrist [7]. The tech giants' own history tells us that modifying their behavior is difficult indeed. Google, now Alphabet, it seems, would prefer to be sued than to change their business practices or protect user privacy. Alphabet employs some 400 lawyers because, among other things, it has been sued in 20 countries for everything from privacy and copyright violations to predatory business practices. In the United States, 38 states sued then-Google when it was discovered that the cars working in its Street View mapping project did more than take pictures. Without permission they hoovered up emails, passwords, and other personal information from computers in houses they passed [8].

    Facebook is of course no better. Just weeks after April 2018, when founder Mark Zuckerberg answered questions on Capitol Hill about the Cambridge Analytica scandal and promised to impose harsh new restrictions on third-party use of user data, Facebook shared more user data with at least 50 device manufacturers, including four Chinese companies. The manufacturers were able to access personal data even if the Facebook user denied permission to share their data with third parties [9].

    No business entities in world history have possessed wealth compared to that of the tech giants. The profits they earn put them in unique category of human enterprise somewhere between corporations and nations. If they were nations, Alphabet and Facebook would rank in the top richest 30% and 41%, respectively. Consequently, they behave with nation-like arrogance, flying above normal corporate constraints of ethics and law, and paying taxes in low cost tax havens instead of where they make their wealth. As sole providers of their respective services, they act monopolistically. In fact, they are de facto utilities and should be subject to stringent regulations aimed at preserving competition and innovation, or broken up, as the Bell System of telephone companies was in 1983.

    The tech giants profit from personal data and bulldoze the competition, but their greatest transgression still lies ahead. They are setting up the human race for an AI disaster many have seen coming for years: the intelligence explosion.

    The formula for the intelligence explosion was laid out in 1963 by English statistician I.J. Good. He wrote

    Let an ultraintelligent machine be defined as a machine that can far surpass all the intellectual activities of any man however clever. Since the design of machines is one of these intellectual activities, an ultraintelligent machine could design even better machines; there would then unquestionably be an ‘intelligence explosion,' and the intelligence of man would be left far behind. Thus the first ultraintelligent machine is the last invention that man need ever make, provided that the machine is docile enough to tell us how to keep it under control [10].

    I like to put Good's theorem in a contemporary context. We have already created machines that are better than humans at chess, go, Jeopardy!, and many tasks such as navigation, search, theorem proving, and more. Scientists are rapidly developing resources that fuel AI design, including AI-specific processors, large datasets, and key insights such as, but not limited to, deep learning and evolutionary algorithms. Eventually, scientists will create machines that are better at AI research and development than humans are. At that point, they will be able to improve their own capabilities very quickly. These machines will match human level intelligence, then become superintelligent—smarter in a rational, mathematical sense than any human—in a matter of days or weeks, in a recursive loop of self improvement [11].

    Many experts who consider the future of AI, including myself, have argued that the intelligence explosion is not merely possible, but probable, and will occur in this century [12]. A great number of factors have gone into that conclusion, including the durability of Moore's Law, potential defeaters of AI development, the limitations of existing AI techniques, and much more. It seems inescapable that barring a cataclysmic disaster or war, scientists will create the basic ingredients of the intelligence explosion—a smarter-than-human machine—in the normal course of developing AI. Its cost will limit the competitors for this dangerous distinction. Open AI, a nonprofit founded to create beneficial general intelligence free of market pressures, recently revealed their best estimate of the price of this endeavor, and how long it will take: at least $2 billion, and more than 10 years [13].

    The intelligence explosion will be the most sensitive event in human history for the simple reason that we have no experience with machines that can outwit us; we cannot be sure their development would not be disastrous. Computer scientists and philosophers refer to this with the masterfully understated term the control problem.

    I think there is ample evidence to conclude the tech giants are not fit to guide the development of superintelligence to a safe

    Enjoying the preview?
    Page 1 of 1