Entity Information Life Cycle for Big Data: Master Data Management and Information Integration

Ebook503 pages4 hours

Entity Information Life Cycle for Big Data: Master Data Management and Information Integration

Name: Entity Information Life Cycle for Big Data: Master Data Management and Information Integration
Author: John R. Talburt
ISBN: 9780128006658

By John R. Talburt and Yinle Zhou

Rating: 0 out of 5 stars

()

Read preview

About this ebook

Entity Information Life Cycle for Big Data walks you through the ins and outs of managing entity information so you can successfully achieve master data management (MDM) in the era of big data. This book explains big data’s impact on MDM and the critical role of entity information management system (EIMS) in successful MDM. Expert authors Dr. John R. Talburt and Dr. Yinle Zhou provide a thorough background in the principles of managing the entity information life cycle and provide practical tips and techniques for implementing an EIMS, strategies for exploiting distributed processing to handle big data for EIMS, and examples from real applications. Additional material on the theory of EIIM and methods for assessing and evaluating EIMS performance also make this book appropriate for use as a textbook in courses on entity and identity management, data management, customer relationship management (CRM), and related topics.

Explains the business value and impact of entity information management system (EIMS) and directly addresses the problem of EIMS design and operation, a critical issue organizations face when implementing MDM systems
Offers practical guidance to help you design and build an EIM system that will successfully handle big data
Details how to measure and evaluate entity integrity in MDM systems and explains the principles and processes that comprise EIM
Provides an understanding of features and functions an EIM system should have that will assist in evaluating commercial EIM systems
Includes chapter review questions, exercises, tips, and free downloads of demonstrations that use the OYSTER open source EIM system
Executable code (Java .jar files), control scripts, and synthetic input data illustrate various aspects of CSRUD life cycle such as identity capture, identity update, and assertions

Skip carousel

LanguageEnglish

PublisherElsevier Science

Release dateApr 20, 2015

ISBN9780128006658

Author

John R. Talburt

Dr. John R. Talburt is Professor of Information Science at the University of Arkansas at Little Rock (UALR) where he is the Coordinator for the Information Quality Graduate Program and the Executive Director of the UALR Center for Advanced Research in Entity Resolution and Information Quality (ERIQ). He is also the Chief Scientist for Black Oak Partners, LLC, an information quality solutions company. Prior to his appointment at UALR he was the leader for research and development and product innovation at Acxiom Corporation, a global leader in information management and customer data integration. Professor Talburt holds several patents related to customer data integration and the author of numerous articles on information quality and entity resolution, and is the author of Entity Resolution and Information Quality (Morgan Kaufmann, 2011). He also holds the IAIDQ Information Quality Certified Professional (IQCP) credential.

Related authors

Skip carousel

Related to Entity Information Life Cycle for Big Data

Related ebooks

Skip carousel

Introduction to Information Quality
Ebook
Introduction to Information Quality
byCraig Fisher
Rating: 0 out of 5 stars
0 ratings
Smarter Data Science: Succeeding with Enterprise-Grade Data and AI Projects
Ebook
Smarter Data Science: Succeeding with Enterprise-Grade Data and AI Projects
byNeal Fishman
Rating: 0 out of 5 stars
0 ratings
Building Big Data Applications
Ebook
Building Big Data Applications
byKrish Krishnan
Rating: 0 out of 5 stars
0 ratings
Managing Data in Motion: Data Integration Best Practice Techniques and Technologies
Ebook
Managing Data in Motion: Data Integration Best Practice Techniques and Technologies
byApril Reeve
Rating: 0 out of 5 stars
0 ratings
Big Data Analytics: From Strategic Planning to Enterprise Integration with Tools, Techniques, NoSQL, and Graph
Ebook
Big Data Analytics: From Strategic Planning to Enterprise Integration with Tools, Techniques, NoSQL, and Graph
byDavid Loshin
Rating: 5 out of 5 stars
5/5
Meeting the Challenges of Data Quality Management
Ebook
Meeting the Challenges of Data Quality Management
byLaura Sebastian-Coleman
Rating: 0 out of 5 stars
0 ratings
Managing Data Quality: A practical guide
Ebook
Managing Data Quality: A practical guide
byTim King
Rating: 0 out of 5 stars
0 ratings
The Data and Analytics Playbook: Proven Methods for Governed Data and Analytic Quality
Ebook
The Data and Analytics Playbook: Proven Methods for Governed Data and Analytic Quality
byLowell Fryman
Rating: 5 out of 5 stars
5/5
Data Risk Management
Ebook
Data Risk Management
byTejasvi Addagada
Rating: 0 out of 5 stars
0 ratings
A Measurement Framework for Software Projects: A Generic and Practical Goal-Question-Metric(Gqm) Based Approach.
Ebook
A Measurement Framework for Software Projects: A Generic and Practical Goal-Question-Metric(Gqm) Based Approach.
byPrashanth Harish Southekal
Rating: 0 out of 5 stars
0 ratings
Governance and Metadata Management Complete Self-Assessment Guide
Ebook
Governance and Metadata Management Complete Self-Assessment Guide
byGerardus Blokdyk
Rating: 0 out of 5 stars
0 ratings
Digital Master: Debunk the Myths of Enterprise Digital Maturity
Ebook
Digital Master: Debunk the Myths of Enterprise Digital Maturity
byPearl Zhu
Rating: 0 out of 5 stars
0 ratings
Master Data Governance A Complete Guide - 2020 Edition
Ebook
Master Data Governance A Complete Guide - 2020 Edition
byGerardus Blokdyk
Rating: 0 out of 5 stars
0 ratings
Agile Architecture A Complete Guide - 2020 Edition
Ebook
Agile Architecture A Complete Guide - 2020 Edition
byGerardus Blokdyk
Rating: 0 out of 5 stars
0 ratings
Edge Data Fabric Third Edition
Ebook
Edge Data Fabric Third Edition
byGerardus Blokdyk
Rating: 0 out of 5 stars
0 ratings
Business Process Governance Complete Self-Assessment Guide
Ebook
Business Process Governance Complete Self-Assessment Guide
byGerardus Blokdyk
Rating: 0 out of 5 stars
0 ratings
Technology Roadmap A Complete Guide - 2021 Edition
Ebook
Technology Roadmap A Complete Guide - 2021 Edition
byGerardus Blokdyk
Rating: 0 out of 5 stars
0 ratings
How to Alleviate Digital Transformation Debt: post-COVID-19
Ebook
How to Alleviate Digital Transformation Debt: post-COVID-19
byDr. Setrag Khoshafian
Rating: 0 out of 5 stars
0 ratings
Digital Transformation In Healthcare A Complete Guide - 2020 Edition
Ebook
Digital Transformation In Healthcare A Complete Guide - 2020 Edition
byGerardus Blokdyk
Rating: 0 out of 5 stars
0 ratings
Big Data Analytics for Cyber-Physical Systems: Machine Learning for the Internet of Things
Ebook
Big Data Analytics for Cyber-Physical Systems: Machine Learning for the Internet of Things
byGuido Dartmann
Rating: 0 out of 5 stars
0 ratings
Digital Transformation at Scale: Why the Strategy Is Delivery
Ebook
Digital Transformation at Scale: Why the Strategy Is Delivery
byAndrew Greenway
Rating: 0 out of 5 stars
0 ratings
Transforming Project Management: An Essential Paradigm for Turning Your Strategic Planning into Action
Ebook
Transforming Project Management: An Essential Paradigm for Turning Your Strategic Planning into Action
byDuane Petersen
Rating: 4 out of 5 stars
4/5
A Practical Guide to Analytics for Governments: Using Big Data for Good
Ebook
A Practical Guide to Analytics for Governments: Using Big Data for Good
byMarie Lowman
Rating: 0 out of 5 stars
0 ratings
Handling digital Value Streams: A Guideline for drafting, optimizing and digitizing Service Value Streams
Ebook
Handling digital Value Streams: A Guideline for drafting, optimizing and digitizing Service Value Streams
byHelmut Steigele
Rating: 0 out of 5 stars
0 ratings
Agile Bank Management A Complete Guide - 2021 Edition
Ebook
Agile Bank Management A Complete Guide - 2021 Edition
byGerardus Blokdyk
Rating: 0 out of 5 stars
0 ratings
Data Literacy: How to Make Your Experiments Robust and Reproducible
Ebook
Data Literacy: How to Make Your Experiments Robust and Reproducible
byNeil Smalheiser
Rating: 0 out of 5 stars
0 ratings
Enterprise Architect A Complete Guide - 2019 Edition
Ebook
Enterprise Architect A Complete Guide - 2019 Edition
byGerardus Blokdyk
Rating: 0 out of 5 stars
0 ratings
Mind+Machine: A Decision Model for Optimizing and Implementing Analytics
Ebook
Mind+Machine: A Decision Model for Optimizing and Implementing Analytics
byMarc Vollenweider
Rating: 0 out of 5 stars
0 ratings
Managing Blind: A Data Quality and Data Governance Vade Mecum
Ebook
Managing Blind: A Data Quality and Data Governance Vade Mecum
byPeter Benson
Rating: 0 out of 5 stars
0 ratings
Enterprise Architecture Governance Standard Requirements
Ebook
Enterprise Architecture Governance Standard Requirements
byGerardus Blokdyk
Rating: 0 out of 5 stars
0 ratings

Business For You

Skip carousel

Becoming Bulletproof: Protect Yourself, Read People, Influence Situations, and Live Fearlessly
Ebook
Becoming Bulletproof: Protect Yourself, Read People, Influence Situations, and Live Fearlessly
byEvy Poumpouras
Rating: 4 out of 5 stars
4/5
Crucial Conversations Tools for Talking When Stakes Are High, Second Edition
Ebook
Crucial Conversations Tools for Talking When Stakes Are High, Second Edition
byKerry Patterson
Rating: 4 out of 5 stars
4/5
Robert's Rules Of Order
Ebook
Robert's Rules Of Order
byBarCharts, Inc.
Rating: 5 out of 5 stars
5/5
Crucial Conversations: Tools for Talking When Stakes are High, Third Edition
Ebook
Crucial Conversations: Tools for Talking When Stakes are High, Third Edition
byJoseph Grenny
Rating: 4 out of 5 stars
4/5
Powerful Phrases for Dealing with Difficult People: Over 325 Ready-to-Use Words and Phrases for Working with Challenging Personalities
Ebook
Powerful Phrases for Dealing with Difficult People: Over 325 Ready-to-Use Words and Phrases for Working with Challenging Personalities
byRenee Evenson
Rating: 3 out of 5 stars
3/5
Who Moved My Cheese: An A-Mazing Way to Deal with Change in Your Work and in Your Life by Spencer Johnson | Key Takeaways, Analysis & Review
Ebook
Who Moved My Cheese: An A-Mazing Way to Deal with Change in Your Work and in Your Life by Spencer Johnson | Key Takeaways, Analysis & Review
by. IRB Media
Rating: 5 out of 5 stars
5/5
Collaborating with the Enemy: How to Work with People You Don’t Agree with or Like or Trust
Ebook
Collaborating with the Enemy: How to Work with People You Don’t Agree with or Like or Trust
byAdam Kahane
Rating: 4 out of 5 stars
4/5
Set for Life: An All-Out Approach to Early Financial Freedom
Ebook
Set for Life: An All-Out Approach to Early Financial Freedom
byScott Trench
Rating: 4 out of 5 stars
4/5
Capitalism and Freedom
Ebook
Capitalism and Freedom
byMilton Friedman
Rating: 4 out of 5 stars
4/5
Leadership and Self-Deception: Getting out of the Box
Ebook
Leadership and Self-Deception: Getting out of the Box
byThe Arbinger Institute
Rating: 4 out of 5 stars
4/5
The Five Dysfunctions of a Team: A Leadership Fable, 20th Anniversary Edition
Ebook
The Five Dysfunctions of a Team: A Leadership Fable, 20th Anniversary Edition
byPatrick M. Lencioni
Rating: 4 out of 5 stars
4/5
The Catalyst: How to Change Anyone's Mind
Ebook
The Catalyst: How to Change Anyone's Mind
byJonah Berger
Rating: 4 out of 5 stars
4/5
Law of Connection: Lesson 10 from The 21 Irrefutable Laws of Leadership
Ebook
Law of Connection: Lesson 10 from The 21 Irrefutable Laws of Leadership
byJohn C. Maxwell
Rating: 4 out of 5 stars
4/5
Nickel and Dimed: On (Not) Getting By in America
Ebook
Nickel and Dimed: On (Not) Getting By in America
byBarbara Ehrenreich
Rating: 4 out of 5 stars
4/5
Just Listen: Discover the Secret to Getting Through to Absolutely Anyone
Ebook
Just Listen: Discover the Secret to Getting Through to Absolutely Anyone
byMark Goulston
Rating: 4 out of 5 stars
4/5
The Richest Man in Babylon: The most inspiring book on wealth ever written
Ebook
The Richest Man in Babylon: The most inspiring book on wealth ever written
byGeorge S. Clason
Rating: 5 out of 5 stars
5/5
Summary of J.L. Collins's The Simple Path to Wealth
Ebook
Summary of J.L. Collins's The Simple Path to Wealth
byIRB Media
Rating: 5 out of 5 stars
5/5
The Intelligent Investor, Rev. Ed: The Definitive Book on Value Investing
Ebook
The Intelligent Investor, Rev. Ed: The Definitive Book on Value Investing
byBenjamin Graham
Rating: 4 out of 5 stars
4/5
Red Notice: A True Story of High Finance, Murder, and One Man's Fight for Justice
Ebook
Red Notice: A True Story of High Finance, Murder, and One Man's Fight for Justice
byBill Browder
Rating: 4 out of 5 stars
4/5
Lying
Ebook
Lying
bySam Harris
Rating: 4 out of 5 stars
4/5
High Conflict: Why We Get Trapped and How We Get Out
Ebook
High Conflict: Why We Get Trapped and How We Get Out
byAmanda Ripley
Rating: 4 out of 5 stars
4/5
Tools Of Titans: The Tactics, Routines, and Habits of Billionaires, Icons, and World-Class Performers
Ebook
Tools Of Titans: The Tactics, Routines, and Habits of Billionaires, Icons, and World-Class Performers
byTimothy Ferriss
Rating: 4 out of 5 stars
4/5
Wealth without Cash: Supercharge Your Real Estate Investing with Subject-to, Seller Financing, and Other Creative Deals
Ebook
Wealth without Cash: Supercharge Your Real Estate Investing with Subject-to, Seller Financing, and Other Creative Deals
byPace Morby
Rating: 5 out of 5 stars
5/5
Summary of Limitless: by Jim Kwik - Upgrade Your Brain, Learn Anything Faster, and Unlock Your Exceptional Life - A Comprehensive Summary
Ebook
Summary of Limitless: by Jim Kwik - Upgrade Your Brain, Learn Anything Faster, and Unlock Your Exceptional Life - A Comprehensive Summary
byAlexander Cooper
Rating: 4 out of 5 stars
4/5
Emotional Intelligence: Exploring the Most Powerful Intelligence Ever Discovered
Ebook
Emotional Intelligence: Exploring the Most Powerful Intelligence Ever Discovered
byBenjamin Smith
Rating: 5 out of 5 stars
5/5
Your Next Five Moves: Master the Art of Business Strategy
Ebook
Your Next Five Moves: Master the Art of Business Strategy
byPatrick Bet-David
Rating: 5 out of 5 stars
5/5
Crucial Accountability: Tools for Resolving Violated Expectations, Broken Commitments, and Bad Behavior, Second Edition
Ebook
Crucial Accountability: Tools for Resolving Violated Expectations, Broken Commitments, and Bad Behavior, Second Edition
byKerry Patterson
Rating: 4 out of 5 stars
4/5
Buy, Rehab, Rent, Refinance, Repeat: The BRRRR Rental Property Investment Strategy Made Simple
Ebook
Buy, Rehab, Rent, Refinance, Repeat: The BRRRR Rental Property Investment Strategy Made Simple
byDavid M Greene
Rating: 5 out of 5 stars
5/5
Carol Dweck's Mindset The New Psychology of Success: Summary and Analysis
Ebook
Carol Dweck's Mindset The New Psychology of Success: Summary and Analysis
bySpeedReader Summaries
Rating: 4 out of 5 stars
4/5
How to Get Ideas
Ebook
How to Get Ideas
byJack Foster
Rating: 5 out of 5 stars
5/5

Related podcast episodes

Skip carousel

Data Operations vs. Data Analytics: Are we doing data and analytics correctly? Self service, centralization vs decentralization, analytics vs operations… so many aspects that data teams need to consider. Join this week’s episode of Catalog & Cocktails with hos...
Podcast episode
Data Operations vs. Data Analytics: Are we doing data and analytics correctly? Self service, centralization vs decentralization, analytics vs operations… so many aspects that data teams need to consider. Join this week’s episode of Catalog & Cocktails with hos...
byCatalog & Cocktails: The Honest, No-BS Data Podcast
0 ratings
0% found this document useful
The Future of Data Catalogs w/ Ole Olesen-Bagneux: Data Catalogs are at the center of every enterprise’s data strategy. It’s important to explore the current state and how data catalogs are evolving. Who better to talk about Data Catalog than the author of the upcoming O’Reilly book “The Enterprise ...
Podcast episode
The Future of Data Catalogs w/ Ole Olesen-Bagneux: Data Catalogs are at the center of every enterprise’s data strategy. It’s important to explore the current state and how data catalogs are evolving. Who better to talk about Data Catalog than the author of the upcoming O’Reilly book “The Enterprise ...
byCatalog & Cocktails: The Honest, No-BS Data Podcast
0 ratings
0% found this document useful
Reframing Data Strategy Alignment: Reframing Data Strategy Alignment
Podcast episode
Reframing Data Strategy Alignment: Reframing Data Strategy Alignment
byInsights Tomorrow
0 ratings
0% found this document useful
Unlocking The Power of Data Lineage In Your Platform with OpenLineage: An interview with Julien Le Dem about the OpenLineage specification and the opportunity that it offers for simplifying the tracking and analysis of data lineage across your data platform.
Podcast episode
Unlocking The Power of Data Lineage In Your Platform with OpenLineage: An interview with Julien Le Dem about the OpenLineage specification and the opportunity that it offers for simplifying the tracking and analysis of data lineage across your data platform.
byData Engineering Podcast
0 ratings
0% found this document useful
Multi-cloud: Exploring the challenges and opportunities: When cloud first hit the mainstream more than a decade ago, its attraction was rooted, in part, in its apparent elegance and simplicity. As it has become an established norm in the industry, such simplicity has given way to more fragmentation and...
Podcast episode
Multi-cloud: Exploring the challenges and opportunities: When cloud first hit the mainstream more than a decade ago, its attraction was rooted, in part, in its apparent elegance and simplicity. As it has become an established norm in the industry, such simplicity has given way to more fragmentation and...
byThoughtworks Technology Podcast
0 ratings
0% found this document useful
Exploring The Design And Benefits Of The Modern Data Stack: A conversation about the design and motivation of the "modern data stack" and how it can simplify the work of building a self-service data platform that enables everyone in the business to ask and answer questions with data.
Podcast episode
Exploring The Design And Benefits Of The Modern Data Stack: A conversation about the design and motivation of the "modern data stack" and how it can simplify the work of building a self-service data platform that enables everyone in the business to ask and answer questions with data.
byData Engineering Podcast
0 ratings
0% found this document useful
Dapr Distributed Application Runtime with Azure CTO Mark Russinovich: Dapr is a an event-driven, portable runtime for building microservices on cloud and edge. In this episode Scott talks to Azure CTO Mark Russinovich about what this means and why you should care? What are the responsibilities of a microservice, and what should YOU worry about and what a responsibilities better delegated to an open source project like Dapr?
Podcast episode
Dapr Distributed Application Runtime with Azure CTO Mark Russinovich: Dapr is a an event-driven, portable runtime for building microservices on cloud and edge. In this episode Scott talks to Azure CTO Mark Russinovich about what this means and why you should care? What are the responsibilities of a microservice, and what should YOU worry about and what a responsibilities better delegated to an open source project like Dapr?
byHanselminutes with Scott Hanselman
0 ratings
0% found this document useful
An Agile Approach To Master Data Management with Mark Marinelli - Episode 46: Building A Master Data Catalog Using Machine Learning (Interview)
Podcast episode
An Agile Approach To Master Data Management with Mark Marinelli - Episode 46: Building A Master Data Catalog Using Machine Learning (Interview)
byData Engineering Podcast
100%
100% found this document useful
Ali Ghodsi – The Past, Present, and Future of Big Data – [Founder’s Field Guide, EP.18]: My Guest today is Ali Ghodsi, founder and CEO of Databricks, a data analytics platform for data scientists and developers. He's also the founder of Apache Spark, the open-source project that Databricks is built on, and is an accomplished researcher at...
Podcast episode
Ali Ghodsi – The Past, Present, and Future of Big Data – [Founder’s Field Guide, EP.18]: My Guest today is Ali Ghodsi, founder and CEO of Databricks, a data analytics platform for data scientists and developers. He's also the founder of Apache Spark, the open-source project that Databricks is built on, and is an accomplished researcher at...
byInvest Like the Best with Patrick O'Shaughnessy
0 ratings
0% found this document useful
2023 Look Ahead to FinOps
Podcast episode
2023 Look Ahead to FinOps
byThe Cloudcast
0 ratings
0% found this document useful
Reflections On Designing A Data Platform From Scratch: A monologue by Tobias Macey, the host of the show, about the design considerations involved in building a data platform and how the lessons learned from running the Data Engineering Podcast are influencing the choices made.
Podcast episode
Reflections On Designing A Data Platform From Scratch: A monologue by Tobias Macey, the host of the show, about the design considerations involved in building a data platform and how the lessons learned from running the Data Engineering Podcast are influencing the choices made.
byData Engineering Podcast
100%
100% found this document useful
62: Cracking the Data Code w/ Mike Bugembe: Mike Bugembe is a speaker, consultant, and Amazon best selling author of the book Cracking the Data Code. He joins today’s podcast to talk about the things that you can do that will help create successful analytics projects. After being the...
Podcast episode
62: Cracking the Data Code w/ Mike Bugembe: Mike Bugembe is a speaker, consultant, and Amazon best selling author of the book Cracking the Data Code. He joins today’s podcast to talk about the things that you can do that will help create successful analytics projects. After being the...
byAnalytics on Fire
0 ratings
0% found this document useful
Building An Enterprise Data Fabric At CluedIn - Episode 74: An interview about building an enterprise data fabric at scale to ease enterprise data integration
Podcast episode
Building An Enterprise Data Fabric At CluedIn - Episode 74: An interview about building an enterprise data fabric at scale to ease enterprise data integration
byData Engineering Podcast
0 ratings
0% found this document useful
Exploring Product Management in Nonprofits with Steve MacLaughlin: Steve MacLaughlin of Blackbaud shares his insights on what good product management looks like in nonprofit organizations, product managers as decision makers, the importance of benchmarking, and what it means to operate as a data-driven nonprofit.
Podcast episode
Exploring Product Management in Nonprofits with Steve MacLaughlin: Steve MacLaughlin of Blackbaud shares his insights on what good product management looks like in nonprofit organizations, product managers as decision makers, the importance of benchmarking, and what it means to operate as a data-driven nonprofit.
byProduct Thinking
0 ratings
0% found this document useful
406: Why you should join a professional organization as a product manager – with Susan Penta: What product managers need to know about the Product Development and Management Association The Product Development and Management Association (PDMA) has been curating the body of knowledge for product managers, leaders,
Podcast episode
406: Why you should join a professional organization as a product manager – with Susan Penta: What product managers need to know about the Product Development and Management Association The Product Development and Management Association (PDMA) has been curating the body of knowledge for product managers, leaders,
byProduct Mastery Now for Product Managers, Leaders, and Innovators
0 ratings
0% found this document useful
Using Intention-Driven Communication: There’s a layer to communication that is most important for understanding: intention. It’s the most often overlooked aspect because it’s hard to decipher if people are saying what they mean or if you actually understand what they’re saying. By huntin...
Podcast episode
Using Intention-Driven Communication: There’s a layer to communication that is most important for understanding: intention. It’s the most often overlooked aspect because it’s hard to decipher if people are saying what they mean or if you actually understand what they’re saying. By huntin...
byLeadingAgile SoundNotes: an Agile Podcast
0 ratings
0% found this document useful
245: Angela Harris - Building Relationships with Developers & Builders Helped Grow her Interior Design Firm: Welcome to today's show and happy birthday to LuAnn! Today's guest, Angela Harris, is the principal of Trio Environments and she is a real dynamo! Trio Environments is one of the fastest growing Interior Design and Visual Merchandising firms in the...
Podcast episode
245: Angela Harris - Building Relationships with Developers & Builders Helped Grow her Interior Design Firm: Welcome to today's show and happy birthday to LuAnn! Today's guest, Angela Harris, is the principal of Trio Environments and she is a real dynamo! Trio Environments is one of the fastest growing Interior Design and Visual Merchandising firms in the...
byA Well-Designed Business® | Interior Design Business Podcast
0 ratings
0% found this document useful
AI Today Podcast: AI Glossary Series – Model Validation, Validation Data, Test Data, and Cross-Validation: In this episode of the AI Today podcast hosts Kathleen Walch and Ron Schmelzer define the terms Model Validation, Validation Data, Test Data, Cross-Validation, explain how these terms relate to AI and why it's important to know about them.
Podcast episode
AI Today Podcast: AI Glossary Series – Model Validation, Validation Data, Test Data, and Cross-Validation: In this episode of the AI Today podcast hosts Kathleen Walch and Ron Schmelzer define the terms Model Validation, Validation Data, Test Data, Cross-Validation, explain how these terms relate to AI and why it's important to know about them.
byAI Today Podcast: Artificial Intelligence Insights, Experts, and Opinion
0 ratings
0% found this document useful
Using Product Driven Development To Improve The Productivity And Effectiveness Of Your Data Teams: With all of the messaging about treating data as a product it is becoming difficult to know what that even means. Vishal Singh is the head of products at Starburst which means that he has to spend all of his time thinking and talking about the details of product thinking and its application to data. In this episode he shares his thoughts on the strategic and tactical elements of moving your work as a data professional from being task-oriented to being product-oriented and the long term improvements in your productivity that it provides.
Podcast episode
Using Product Driven Development To Improve The Productivity And Effectiveness Of Your Data Teams: With all of the messaging about treating data as a product it is becoming difficult to know what that even means. Vishal Singh is the head of products at Starburst which means that he has to spend all of his time thinking and talking about the details of product thinking and its application to data. In this episode he shares his thoughts on the strategic and tactical elements of moving your work as a data professional from being task-oriented to being product-oriented and the long term improvements in your productivity that it provides.
byData Engineering Podcast
0 ratings
0% found this document useful
Requirements Engineering and Business Analysis
Podcast episode
Requirements Engineering and Business Analysis
byBusiness Analysis Live!
0 ratings
0% found this document useful
Fast.ai, AutoML, and Software Engineering for ML: Jeremy Howard // Coffee Session #47
Podcast episode
Fast.ai, AutoML, and Software Engineering for ML: Jeremy Howard // Coffee Session #47
byMLOps.community
0 ratings
0% found this document useful
The Three Roles of the Chief Data Officer: ADP’s Jack Berkowitz
Podcast episode
The Three Roles of the Chief Data Officer: ADP’s Jack Berkowitz
byMe, Myself, and AI
0 ratings
0% found this document useful
EP 195 - Affordably Manage the Deluge of Unstructured Data: In this week’s episode, we have , the Chief Development Officer at . Quantum helps organizations in harnessing the potential of their expanding unstructured data, offering an affordable solution for storing data for decades to come. During our...
Podcast episode
EP 195 - Affordably Manage the Deluge of Unstructured Data: In this week’s episode, we have , the Chief Development Officer at . Quantum helps organizations in harnessing the potential of their expanding unstructured data, offering an affordable solution for storing data for decades to come. During our...
byIndustrial IoT Spotlight
0 ratings
0% found this document useful
87: Michael Katz: The Evolution of packaged CDPs, democratizing ML and the myths of composable and zero data copy
Podcast episode
87: Michael Katz: The Evolution of packaged CDPs, democratizing ML and the myths of composable and zero data copy
byHumans of Martech
0 ratings
0% found this document useful
The Topography of Problems, and the Importance of Distributed Problem Solving with Dr. Steve Spear: In this bonus follow-up interview, Gene Kim and Dr. Steve Spear dig into what makes for great leadership today, including the importance of distributed decision-making and problem-solving. They showcase the real advantages of allowing more decisions ...
Podcast episode
The Topography of Problems, and the Importance of Distributed Problem Solving with Dr. Steve Spear: In this bonus follow-up interview, Gene Kim and Dr. Steve Spear dig into what makes for great leadership today, including the importance of distributed decision-making and problem-solving. They showcase the real advantages of allowing more decisions ...
byThe Idealcast with Gene Kim by IT Revolution
0 ratings
0% found this document useful
Security and Privacy in the Enterprise with Skyflow’s Sam Sternberg: Sam Sternberg, Customer Programs Lead at Skyflow, joins the show to discuss the world of privacy and security at scale within large enterprises. We explore the complex infrastructure, regulatory challenges, and evolving technologies that these giants...
Podcast episode
Security and Privacy in the Enterprise with Skyflow’s Sam Sternberg: Sam Sternberg, Customer Programs Lead at Skyflow, joins the show to discuss the world of privacy and security at scale within large enterprises. We explore the complex infrastructure, regulatory challenges, and evolving technologies that these giants...
byPartially Redacted: Data Privacy, Security & Compliance
0 ratings
0% found this document useful
Open Source Software as a Triumph of Information Hiding, Modularity, and Creating Optionality with Dr. Gail Murphy: In this newest episode of The Idealcast, Gene Kim speaks with Dr. Gail Murphy, Professor of Computer Science and Vice President of Research and Innovation at the University of British Columbia. She is also the co-founder, board member, and former Chi...
Podcast episode
Open Source Software as a Triumph of Information Hiding, Modularity, and Creating Optionality with Dr. Gail Murphy: In this newest episode of The Idealcast, Gene Kim speaks with Dr. Gail Murphy, Professor of Computer Science and Vice President of Research and Innovation at the University of British Columbia. She is also the co-founder, board member, and former Chi...
byThe Idealcast with Gene Kim by IT Revolution
0 ratings
0% found this document useful
KeyBank CIO: How to Create a Data Strategy?: Digital transformation relies on data from across the organization, ranging from operations to customer service, financials, and employee well-being. In this conversation, we learn how KeyBank's Chief Information Officer, Amy G. Brady, uses data in...
Podcast episode
KeyBank CIO: How to Create a Data Strategy?: Digital transformation relies on data from across the organization, ranging from operations to customer service, financials, and employee well-being. In this conversation, we learn how KeyBank's Chief Information Officer, Amy G. Brady, uses data in...
byCXOTalk
0 ratings
0% found this document useful
Privacy Engineering at CMU and Privacy Decision Making with Dr. Lorrie Cranor: Dr. Lorrie Cranor began her career in privacy 25 years ago and has been a professor at Carnegie Mellon University in the School of Computer Science for 19 years. Today, she serves as director and professor for the CMU privacy engineering program.In this ...
Podcast episode
Privacy Engineering at CMU and Privacy Decision Making with Dr. Lorrie Cranor: Dr. Lorrie Cranor began her career in privacy 25 years ago and has been a professor at Carnegie Mellon University in the School of Computer Science for 19 years. Today, she serves as director and professor for the CMU privacy engineering program.In this ...
byPartially Redacted: Data Privacy, Security & Compliance
0 ratings
0% found this document useful
Data Governance with Jessi Ashdown and Uri Gilad: Hosts Stephanie Wong and Priyanka Vergadia learn about data governance this week in an interesting chat with Jessi Ashdown and Uri Gilad.
Podcast episode
Data Governance with Jessi Ashdown and Uri Gilad: Hosts Stephanie Wong and Priyanka Vergadia learn about data governance this week in an interesting chat with Jessi Ashdown and Uri Gilad.
byGoogle Cloud Platform Podcast
0 ratings
0% found this document useful

Skip carousel

Enter the Industry 4.0 Era Today by Using “Dark Data” You Already Have
The European Business Review
Article
Enter the Industry 4.0 Era Today by Using “Dark Data” You Already Have
Aug 2, 2019
7 min read
Agile Insights In An Uncertain World
NZ Marketing
Article
Agile Insights In An Uncertain World
Nov 15, 2020
2 min read
AI – Turn Buzz Into Biz
Facility Management
Article
AI – Turn Buzz Into Biz
Dec 23, 2018
4 min read
About the Authors
The European Business Review
Article
About the Authors
Apr 3, 2019
Peter Weill, PhD, is an MIT senior research scientist and chair of the Center for Information Systems Research (CISR) at the MIT Sloan School of Management, which studies and works with companies on how to transform for success in the digital era. MI
1 min read
Leadership Forum: Making Digital Transformation A Reality
Rotman Management
Article
Leadership Forum: Making Digital Transformation A Reality
Jan 1, 2018
Glenda Crisp Senior Vice President and Chief Data Officer, TD Bank Group + Connie Bonello Associate Partner, Financial Services, IBM Canada IN MOST OF TODAY’S ORGANIZATIONS, data underpins every transaction, operation and interaction. And yet, the ab
8 min read
Signals Of Change: how To Evolve For The New Global Reality
Rotman Management
Article
Signals Of Change: how To Evolve For The New Global Reality
May 1, 2022
11 min read
Why We Need To Fear The Risk Of AI Model Collapse
Evening Standard
Article
Why We Need To Fear The Risk Of AI Model Collapse
Dec 17, 2023
4 min read
The Million Dollar Question
The European Business Review
Article
The Million Dollar Question
May 22, 2018
More than a decade ago, the metaphor “data is the new oil” shook the world and left organisations scrambling for ways on how they could translate it into tangible value for their business. While others already reaped and are reaping the benefits emer
2 min read
Businesses Were Not Prepared For The Pandemic, Warns IBM Chief Executive
Evening Standard
Article
Businesses Were Not Prepared For The Pandemic, Warns IBM Chief Executive
Sep 7, 2020
4 min read
IBM Boss: Big Companies Were Not Prepared For The Pandemic
Evening Standard
Article
IBM Boss: Big Companies Were Not Prepared For The Pandemic
Sep 4, 2020
Sreeram Visvanathan is the new chief executive of IBM UK and Ireland. The 53 year-old is from Bangalore in India and previously led IBM’s global Public Sector team. In the middle of London Tech Week, he talked to the Evening Standard about the future
4 min read
Why the Fourth Industrial Revolution Requires More Supply Chain CEOs
The European Business Review
Article
Why the Fourth Industrial Revolution Requires More Supply Chain CEOs
Nov 22, 2018
8 min read
Privacy: The Price Of Trust
Inc.
Article
Privacy: The Price Of Trust
Aug 17, 2021
How will cybersecurity and data privacy evolve over the next decade? We’re hearing from our customers that cybersecurity and data privacy are increasingly becoming board-level conversations. Companies will start to think about privacy, security, vend
1 min read
About the Authors
The European Business Review
Article
About the Authors
Mar 31, 2020
Ina M. Sebastian, PhD, is a research scientist at MIT CISR. Ina studies how large enterprises transform for success in the digital economy. Her current research areas are digital partnering, and value creation and value capture in digital models. In
1 min read
Cybersecurity Courses Ramp Up Amid Shortage Of Professionals
TechLife News
Article
Cybersecurity Courses Ramp Up Amid Shortage Of Professionals
Jun 18, 2022
7 min read
Cybersecurity Courses Ramp Up Amid Shortage Of Professionals
AppleMagazine
Article
Cybersecurity Courses Ramp Up Amid Shortage Of Professionals
Jun 17, 2022
7 min read
5 KEYS TO RESOLVING Cross-Functional Rivalries IN YOUR DIGITAL TRANSFORMATION
The European Business Review
Article
5 KEYS TO RESOLVING Cross-Functional Rivalries IN YOUR DIGITAL TRANSFORMATION
Sep 30, 2020
The challenge is as old as business itself: How do you get people in different parts of an organization to work together to solve common problems? Yes, COVID-19 has hastened a great coming together – across functions, disciplines, industries, geograp
4 min read
The Weakest Link
Facility Management
Article
The Weakest Link
Jun 24, 2018
7 min read
Creating And Leading A Future Ready Enterprise
The European Business Review
Article
Creating And Leading A Future Ready Enterprise
Apr 3, 2019
As we head deeper into the year ahead, we strive to highlight the essential strategies and latest research studies that will enlighten and equip companies and their leaders to keep abreast with the advancements and stay ahead of the curve amongst dis
2 min read
Will Generative AI Disrupt Your Company And Your need For Workers?
The European Business Review
Article
Will Generative AI Disrupt Your Company And Your need For Workers?
Jul 31, 2023
5 min read
Leadership Forum: Investing in Disruption
Rotman Management
Article
Leadership Forum: Investing in Disruption
Jan 1, 2019
10 min read
01 Ready Or Not, AI Is Here To Assist You
HWM Singapore
Article
01 Ready Or Not, AI Is Here To Assist You
Jul 11, 2023
4 min read
Arnab PANDEY
Techfastly
Article
Arnab PANDEY
Apr 1, 2021
11 min read
Local Duo Revolutionises First Responder Systems
Finweek - English
Article
Local Duo Revolutionises First Responder Systems
Jul 24, 2020
after a ten-year stint as chief technology officer of a gaming business in London, Brett Meyerowitz made his way back to South Africa and started working for a banking company. With more time on his hands, he decided to volunteer as an ambulance assi
3 min read
The Future Of The Data Economy
The European Business Review
Article
The Future Of The Data Economy
Jun 1, 2022
6 min read
Playing With Numbers
India Today
Article
Playing With Numbers
Jul 18, 2019
In the last few years, we have probably created more data digitally than in the rest of human history. Think about the millions of Internet searches and social media posts that are made every minute, and the resultant data that corporations and gover
3 min read
Responsible Tech Should Not Wait For Regulation To Mandate It
Business Today
Article
Responsible Tech Should Not Wait For Regulation To Mandate It
Jun 9, 2023
2 min read
DIGITAL CROWN JEWELS: How to Protect Your Data Assets
Rotman Management
Article
DIGITAL CROWN JEWELS: How to Protect Your Data Assets
Sep 1, 2023
10 min read
Building Trends, Building Momentum
Facility Management
Article
Building Trends, Building Momentum
Oct 14, 2019
3 min read
The Democratization of Judgment
Rotman Management
Article
The Democratization of Judgment
Jan 1, 2018
8 min read
Why Your Organisation Needs To Lift Its Data Game
NZBusiness and Management
Article
Why Your Organisation Needs To Lift Its Data Game
Oct 22, 2019
From problems stemming from the recent New Zealand census to data collected by Facebook, data has been in the news a lot lately. It may seem obvious that large organisations such as Statistics New Zealand and Facebook need to continually improve thei
3 min read

Related categories

Skip carousel

Reviews for Entity Information Life Cycle for Big Data

Rating: 0 out of 5 stars

0 ratings

0 ratings0 reviews

Book preview

Entity Information Life Cycle for Big Data - John R. Talburt

Entity Information Life Cycle for Big Data

Master Data Management and Information Integration

John R. Talburt

Yinle Zhou

Cover image

Title page

Copyright

Foreword

Preface

Acknowledgements

Chapter 1. The Value Proposition for MDM and Big Data

Definition and Components of MDM

The Business Case for MDM

Dimensions of MDM

The Challenge of Big Data

MDM and Big Data – The N-Squared Problem

Concluding Remarks

Chapter 2. Entity Identity Information and the CSRUD Life Cycle Model

Entities and Entity References

Managing Entity Identity Information

Entity Identity Information Life Cycle Management Models

Concluding Remarks

Chapter 3. A Deep Dive into the Capture Phase

An Overview of the Capture Phase

Building the Foundation

Understanding the Data

Data Preparation

Selecting Identity Attributes

Assessing ER Results

Data Matching Strategies

Concluding Remarks

Chapter 4. Store and Share – Entity Identity Structures

Entity Identity Information Management Strategies

Dedicated MDM Systems

The Identity Knowledge Base

MDM Architectures

Concluding Remarks

Chapter 5. Update and Dispose Phases – Ongoing Data Stewardship

Data Stewardship

The Automated Update Process

The Manual Update Process

Asserted Resolution

EIS Visualization Tools

Managing Entity Identifiers

Concluding Remarks

Chapter 6. Resolve and Retrieve Phase – Identity Resolution

Identity Resolution

Identity Resolution Access Modes

Confidence Scores

Concluding Remarks

Chapter 7. Theoretical Foundations

The Fellegi-Sunter Theory of Record Linkage

The Stanford Entity Resolution Framework

Entity Identity Information Management

Concluding Remarks

Chapter 8. The Nuts and Bolts of Entity Resolution

The ER Checklist

Cluster-to-Cluster Classification

Selecting an Appropriate Algorithm

Concluding Remarks

Chapter 9. Blocking

Blocking

Blocking by Match Key

Dynamic Blocking versus Preresolution Blocking

Blocking Precision and Recall

Match Key Blocking for Boolean Rules

Match Key Blocking for Scoring Rules

Concluding Remarks

Chapter 10. CSRUD for Big Data

Large-Scale ER for MDM

The Transitive Closure Problem

Distributed, Multiple-Index, Record-Based Resolution

An Iterative, Nonrecursive Algorithm for Transitive Closure

Iteration Phase: Successive Closure by Reference Identifier

Deduplication Phase: Final Output of Components

ER Using the Null Rule

The Capture Phase and IKB

The Identity Update Problem

Persistent Entity Identifiers

The Large Component and Big Entity Problems

Identity Capture and Update for Attribute-Based Resolution

Concluding Remarks

Chapter 11. ISO Data Quality Standards for Master Data

Background

Goals and Scope of the ISO 8000-110 Standard

Four Major Components of the ISO 8000-110 Standard

Simple and Strong Compliance with ISO 8000-110

ISO 22745 Industrial Systems and Integration

Beyond ISO 8000-110

Concluding Remarks

Appendix A. Some Commonly Used ER Comparators

References

Index

Copyright

Acquiring Editor: Steve Elliot

Editorial Project Manager: Amy Invernizzi

Project Manager: Priya Kumaraguruparan

Cover Designer: Matthew Limbert

Morgan Kaufmann is an imprint of Elsevier

225 Wyman Street, Waltham, MA 02451, USA

No part of this publication may be reproduced or transmitted in any form or by any means, electronic or mechanical, including photocopying, recording, or any information storage and retrieval system, without permission in writing from the publisher. Details on how to seek permission, further information about the Publisher’s permissions policies and our arrangements with organizations such as the Copyright Clearance Center and the Copyright Licensing Agency, can be found at our website: www.elsevier.com/permissions.

This book and the individual contributions contained in it are protected under copyright by the Publisher (other than as may be noted herein).

Notices

Knowledge and best practice in this field are constantly changing. As new research and experience broaden our understanding, changes in research methods, professional practices, or medical treatment may become necessary.

Practitioners and researchers must always rely on their own experience and knowledge in evaluating and using any information, methods, compounds, or experiments described herein. In using such information or methods they should be mindful of their own safety and the safety of others, including parties for whom they have a professional responsibility.

To the fullest extent of the law, neither the Publisher nor the authors, contributors, or editors, assume any injury and/or damage to persons or property as a matter of products liability, negligence or otherwise, or from any use or operation of any methods, products, instructions, or ideas contained in the material herein.

ISBN: 978-0-12-800537-8

British Library Cataloguing in Publication Data

A catalogue record for this book is available from the British Library

Library of Congress Cataloging-in-Publication Data

A catalog record for this book is available from the Library of Congress

For information on all MK publications visit our website at www.mkp.com

Foreword

In July of 2015 the Massachusetts Institute of Technology (MIT) will celebrate the 20th anniversary of the International Conference on Information Quality. My journey to information and data quality has had many twists and turns, but I have always found it interesting and rewarding. For me the most rewarding part of the journey has been the chance to meet and work with others who share my passion for this topic. I first met John Talburt in 2002 when he was working in the Data Products Division of Acxiom Corporation, a data management company with global operations. John had been tasked by leadership to answer the question, What is our data quality? Looking for help on the Internet he found the MIT Information Quality Program and contacted me. My book Quality Information and Knowledge (Huang, Lee, & Wang, 1999) had recently been published. John invited me to Acxiom headquarters, at that time in Conway, Arkansas, to give a one-day workshop on information quality to the Acxiom Leadership team.

This was the beginning of John’s journey to data quality, and we have been traveling together on that journey ever since. After I helped him lead Acxiom’s effort to implement a Total Data Quality Management program, he in turn helped me to realize one of my long-time goals of seeing a U.S. university start a degree program in information quality. Through the largess of Acxiom Corporation, led at that time by Charles Morgan and the academic entrepreneurship of Dr. Mary Good, Founding Dean of the Engineering and Information Technology College at the University of Arkansas at Little Rock, the world’s first graduate degree program in information quality was established in 2006. John has been leading this program at UALR ever since. Initially created around a Master of Science in Information Quality (MSIQ) degree (Lee et al., 2007), it has since expanded to include a Graduate Certificate in IQ and an IQ PhD degree. As of this writing the program has graduated more than 100 students.

The second part of this story began in 2008. In that year, Yinle Zhou, an e-commerce graduate from Nanjing University in China, came to the U.S. and was admitted to the UALR MSIQ program. After finishing her MS degree, she entered the IQ PhD program with John as her research advisor. Together they developed a model for entity identity information management (EIIM) that extends entity resolution in support of master data management (MDM), the primary focus of this book. Dr. Zhou is now a Software Engineer and Data Scientist for IBM InfoSphere MDM Development in Austin, Texas, and an Adjunct Assistant Professor of Electrical and Computer Engineering at the University of Texas at Austin. And so the torch was passed and another journey began.

I have also been fascinated to see how the landscape of information technology has changed over the past 20 years. During that time IT has experienced a dramatic shift in focus. Inexpensive, large-scale storage and processors have changed the face of IT. Organizations are exploiting cloud computing, software-as-a-service, and open source software, as alternatives to building and maintaining their own data centers and developing custom solutions. All of these trends are contributing to the commoditization of technology. They are forcing companies to compete with better data instead of better technology. At the same time, more and more data are being produced and retained, from structured operational data to unstructured, user-generated data from social media. Together these factors are producing many new challenges for data management, and especially for master data management.

The complexity of the new data-driven environment can be overwhelming. How to deal with data governance and policy, data privacy and security, data quality, MDM, RDM, information risk management, regulatory compliance, and the list goes on. Just as John and Yinle started their journeys as individuals, now we see that entire organizations are embarking on journeys to data and information quality. The difference is that an organization needs a leader to set the course, and I strongly believe this leader should be the Chief Data Officer (CDO).

The CDO is a growing role in modern organizations to lead their company’s journey to strategically use data for regulatory compliance, performance optimization, and competitive advantage. The MIT CDO Forum recognizes the emerging criticality of the CDO’s role and has developed a series of events where leaders come for bidirectional sharing and collaboration to accelerate identification and establishment of best practices in strategic data management.

I and others have been conducting the MIT Longitudinal Study on the Chief Data Officer and hosting events for senior executives to advance CDO research and practice. We have published research results in leading academic journals, as well as the proceedings of the MIT CDO Forum, MIT CDOIQ Symposium, and the International Conference on Information Quality (ICIQ). For example, we have developed a three-dimensional cubic framework to describe the emerging role of the Chief Data Officer in the context of Big Data (Lee et al., 2014).

I believe that CDOs, MDM architects and administrators, and anyone involved with data governance and information quality will find this book useful. MDM is now considered an integral component of a data governance program. The material presented here clearly lays out the business case for MDM and a plan to improve the quality and performance of MDM systems through effective entity information life cycle management. It not only explains the technical aspects of the life cycle, it also provides guidance on the often overlooked tasks of MDM quality metrics and analytics and MDM stewardship.

Richard Wang, MIT Chief Data Officer and Information Quality Program

Preface

The Changing Landscape of Information Quality

Since the publication of Entity Resolution and Information Quality (Morgan Kaufmann, 2011), a lot has been happening in the field of information and data quality. One of the most important developments is how organizations are beginning to understand that the data they hold are among their most important assets and should be managed accordingly. As many of us know, this is by no means a new message, only that it is just now being heeded. Leading experts in information and data quality such as Rich Wang, Yang Lee, Tom Redman, Larry English, Danette McGilvray, David Loshin, Laura Sebastian-Coleman, Rajesh Jugulum, Sunil Soares, Arkady Maydanchik, and many others have been advocating this principle for many years.

Evidence of this new understanding can be found in the dramatic surge of the adoption of data governance (DG) programs by organizations of all types and sizes. Conferences, workshops, and webinars on this topic are overflowing with attendees. The primary reason is that DG provides organizations with an answer to the question, If information is really an important organizational asset, then how can it be managed at the enterprise level? One of the primary benefits of a DG program is that it provides a framework for implementing a central point of communication and control over all of an organization’s data and information.

As DG has grown and matured, its essential components become more clearly defined. These components generally include central repositories for data definitions, business rules, metadata, data-related issue tracking, regulations and compliance, and data quality rules. Two other key components of DG are master data management (MDM) and reference data management (RDM). Consequently, the increasing adoption of DG programs has brought a commensurate increase in focus on the importance of MDM.

Certainly this is not the first book on MDM. Several excellent books include Master Data Management and Data Governance by Alex Berson and Larry Dubov (2011), Master Data Management in Practice by Dalton Cervo and Mark Allen (2011), Master Data Management by David Loshin (2009), Enterprise Master Data Management by Allen Dreibelbis, Eberhard Hechler, Ivan Milman, Martin Oberhofer, Paul van Run, and Dan Wolfson (2008), and Customer Data Integration by Jill Dyché and Evan Levy (2006). However, MDM is an extensive and evolving topic. No single book can explore every aspect of MDM at every level.

Motivation for This Book

Numerous things have motivated us to contribute yet another book. However, the primary reason is this. Based on our experience in both academia and industry, we believe that many of the problems that organizations experience with MDM implementation and operation are rooted in the failure to understand and address certain critical aspects of entity identity information management (EIIM). EIIM is an extension of entity resolution (ER) with the goal of achieving and maintaining the highest level of accuracy in the MDM system. Two key terms are achieving and maintaining.

Having a goal and defined requirements is the starting point for every information and data quality methodology from the MIT TDQM (Total Data Quality Management) to the Six-Sigma DMAIC (Define, Measure, Analyze, Improve, and Control). Unfortunately, when it comes to MDM, many organizations have not defined any goals. Consequently these organizations don’t have a way to know if they have achieved their goal. They leave many questions unanswered. What is our accuracy? Now that a proposed programming or procedure has been implemented, is the system performing better or worse than before? Few MDM administrators can provide accurate estimates of even the most basic metrics such as false positive and false negative rates or the overall accuracy of their system. In this book we have emphasized the importance of objective and systematic measurement and provided practical guidance on how these measurements can be made.

To help organizations better address the maintaining of high levels of accuracy through EIIM, the majority of the material in the book is devoted to explaining the CSRUD five-phase entity information life cycle model. CSRUD is an acronym for capture, store and share, resolve and retrieve, update, and dispose. We believe that following this model can help any organization improve MDM accuracy and performance.

Finally, no modern day IT book can be complete without talking about Big Data. Seemingly rising up overnight, Big Data has captured everyone’s attention, not just in IT, but even the man on the street. Just as DG seems to be getting up a good head of steam, it now has to deal with the Big Data phenomenon. The immediate question is whether Big Data simply fits right into the current DG model, or whether the DG model needs to be revised to account for Big Data.

Regardless of one’s opinion on this topic, one thing is clear – Big Data is bad news for MDM. The reason is a simple mathematical fact: MDM relies on entity resolution, and entity resolution primarily relies on pair-wise record matching, and the number of pairs of records to match increases as the square of the number of records. For this reason, ordinary data (millions of records) is already a challenge for MDM, so Big Data (billions of records) seems almost insurmountable. Fortunately, Big Data is not just matter of more data; it is also ushering in a new paradigm for managing and processing large amounts of data. Big Data is bringing with it new tools and techniques. Perhaps the most important technique is how to exploit distributed processing. However, it is easier to talk about Big Data than to do something about it. We wanted to avoid that and include in our book some practical strategies and designs for using distributed processing to solve some of these problems.

Audience

It is our hope that both IT professionals and business professionals interested in MDM and Big Data issues will find this book helpful. Most of the material focuses on issues of design and architecture, making it a resource for anyone evaluating an installed system, comparing proposed third-party systems, or for an organization contemplating building its own system. We also believe that it is written at a level appropriate for a university textbook.

Organization of the Material

Chapters 1 and 2 provide the background and context of the book. Chapter 1 provides a definition and overview of MDM. It includes the business case, dimensions, and challenges facing MDM and also starts the discussion of Big Data and its impact on MDM. Chapter 2 defines and explains the two primary technologies that support MDM – ER and EIIM. In addition, Chapter 2 introduces the CSRUD Life Cycle for entity identity information. This sets the stage for the next four chapters.

Chapters 3, 4, 5, and 6 are devoted to an in-depth discussion of the CSRUD life cycle model. Chapter 3 is an in-depth look at the Capture Phase of CSRUD. As part of the discussion, it also covers the techniques of truth set building, benchmarking, and problem sets as tools for assessing entity resolution and MDM outcomes. In addition, it discusses some of the pros and cons of the two most commonly used data matching techniques – deterministic matching and probabilistic matching.

Chapter 4 explains the Store and Share Phase of CSRUD. This chapter introduces the concept of an entity identity structure (EIS) that forms the building blocks of the identity knowledge base (IKB). In addition to discussing different styles of EIS designs, it also includes a discussion of the different types of MDM architectures.

Chapter 5 covers two closely related CSRUD phases, the Update Phase and the Dispose Phase. The Update Phase discussion covers both automated and manual update processes and the critical roles played by clerical review indicators, correction assertions, and confirmation assertions. Chapter 5 also presents an example of an identity visualization system that assists MDM data stewards with the review and assertion process.

Chapter 6 covers the Resolve and Retrieve Phase of CSRUD. It also discusses some design considerations for accessing identity information, and a simple model for a retrieved identifier confidence score.

Chapter 7 introduces two of the most important theoretical models for ER, the Fellegi-Sunter Theory of Record Linkage and the Stanford Entity Resolution Framework or SERF Model. Chapter 7 is inserted here because some of the concepts introduced in the SERF Model are used in Chapter 8, The Nuts and Bolts of ER. The chapter concludes with a discussion of how EIIM relates to each of these models.

Chapter 8 describes a deeper level of design considerations for ER and EIIM systems. It discusses in detail the three levels of matching in an EIIM system: attribute-level, reference-level, and cluster-level matching.

Chapter 9 covers the technique of blocking as a way to increase the performance of ER and MDM systems. It focuses on match key blocking, the definition of match-key-to-match-rule alignment, and the precision and recall of match keys. Preresolution blocking and transitive closure of match keys are discussed as a prelude to Chapter 10.

Chapter 10 discusses the problems in implementing the CSRUD Life Cycle for Big Data. It gives examples of how the Hadoop Map/Reduce framework can be used to address many of these problems using a distributed computing environment.

Chapter 11 covers the new ISO 8000-110 data quality standard for master data. This standard is not well understood outside of a few industry verticals, but it has potential implications for all industries. This chapter covers the basic requirements of the standard and how organizations can become ISO 8000 compliant, and perhaps more importantly, why organizations would want to be compliant.

Finally, to reduce ER discussions in Chapters 3 and 8, Appendix A goes into more detail on some of the more common data comparison algorithms.

This book also includes a website with exercises, tips and free downloads of demonstrations that use a trial version of the HiPER EIM system for hands-on learning. The website includes control scripts and synthetic input data to illustrate how the system handles various aspects of the CSRUD life cycle such as identity capture, identity update, and assertions. You can access the website here: http://www.BlackOakAnalytics.com/develop/HiPER/trial.

Acknowledgements

This book would not have been possible without the help of many people and organizations. First of all, Yinle and I would like to thank Dr. Rich Wang, Director of the MIT Information Quality Program, for starting us on our journey to data quality and for writing the foreword for our book, and Dr. Scott Schumacher, Distinguished Engineer at IBM, for his support of our research and collaboration. We would also like to thank our employers, IBM Corporation, University of Arkansas at Little Rock, and Black Oak Analytics, Inc., for their support and encouragement during its writing.

It has been a privilege to be a part of the UALR Information Quality Program and to work with so many talented students and gifted faculty members. I would especially like to acknowledge several of my current students for their contributions to this work. These include Fumiko Kobayashi, identity resolution models and confidence scores in Chapter 6; Cheng Chen, EIS visualization tools and confirmation assertions in Chapter 5 and Hadoop map/reduce in Chapter 10; Daniel Pullen, clerical review indicators in Chapter 5 and Hadoop map/reduce in Chapter 10; Pei Wang, blocking for scoring rules in Chapter 9, Hadoop map/reduce in Chapter 10, and the demonstration data, scripts, and exercises on the book’s website; Debanjan Mahata, EIIM for unstructured data in Chapter 1; Melody Penning, entity-based data integration in Chapter 1; and Reed Petty, IKB structure for HDFS in Chapter 10. In addition I would like to thank my former student Dr. Eric Nelson for introducing the null rule concept and for sharing his expertise in Hadoop map/reduce in Chapter 10. Special thanks go to Dr. Laura Sebastian-Coleman, Data Quality Leader at Cigna, and Joshua Johnson, UALR Technical Writing Program, for their help in editing and proofreading. Finally I want to thank my teaching assistants, Fumiko Kobayashi, Khizer Syed, Michael Greer, Pei Wang, and Daniel Pullen, and my administrative assistant, Nihal Erian, for giving me the extra time I needed to complete this work.

I would also like to take this opportunity to acknowledge several organizations that have supported my work for many years. Acxiom Corporation under Charles Morgan was one of the founders of the UALR IQ program and continues to support the program under Scott Howe, the current CEO, and Allison Nicholas, Director of College Recruiting and University Relations. I am grateful for my experience at Acxiom and the opportunity to learn about Big Data entity resolution in a distributed computing environment from Dr. Terry Talley and the many other world-class data experts who work there.

The Arkansas Research Center under the direction of Dr. Neal Gibson and Dr. Greg Holland were the first to support my work on the OYSTER open source entity resolution system. The Arkansas Department of Education – in particular former Assistant Commissioner Jim Boardman and his successor, Dr. Cody Decker, along with Arijit Sarkar in the IT Services Division – gave me the opportunity to build a student MDM system that implements the full CSRUD life cycle as described in this book.

The Translational Research Institute (TRI) at the University of Arkansas for Medical Sciences has given me and several of my students the opportunity for hands-on experience with MDM systems in the healthcare environment. I would like to thank Dr. William Hogan, the former Director of TRI for teaching me about referent tracking, and also Dr. Umit Topaloglu the current Director of Informatics at TRI who along with Dr. Mathias Brochhausen continues this collaboration.

Last but not least are my business partners at Black Oak Analytics. Our CEO, Rick McGraw, has been a trusted friend and business advisor for many years. Because of Rick and our COO, Jonathan Askins, what was only a vision has become a reality.

John R. Talburt, and Yinle Zhou

Chapter 1 The Value Proposition for MDM and Big Data

Abstract

This chapter gives a definition of master data management (MDM) and describes how it generates value for organizations. It also provides an overview of Big Data and the challenges it brings to MDM.

Keywords

Master data; master data management; MDM; Big Data; reference data management; RDM

Definition and Components of MDM

Master Data as a Category of Data

Modern information systems use four broad categories of data including master data, transaction data, metadata, and reference data. Master data are data held by an organization that describe the entities both independent and fundamental to the organization’s operations. In some sense, master data are the nouns in the grammar of data and information. They describe the persons, places, and things that are critical to the operation of an organization, such as its customers, products, employees, materials, suppliers, services, shareholders, facilities, equipment, and rules and regulations. The determination of exactly what is considered master data depends on the viewpoint of the organization.

If master data are the nouns of data and information, then transaction data can be thought of as the verbs. They describe the actions that take place in the day-to-day operation of the organization, such as the sale of a product in a business or the admission of a patient to a hospital. Transactions relate master data in a meaningful way. For example, a credit card transaction relates two entities that are represented by master data. The first is the issuing bank’s credit card account that is identified by the credit card number, where the master data contains information required by the issuing bank about that specific account. The second is the accepting bank’s merchant account that is identified by the merchant number, where the master data contains information required by the accepting bank about that specific merchant.

Master data management (MDM) and reference data management (RDM) systems are both systems of record (SOR). A SOR is a system that is charged with keeping the most complete or trustworthy representation of a set of entities (Sebastian-Coleman, 2013). The records in an SOR are sometimes called golden records or certified records because they provide a single point of reference for a particular type of information. In the context of MDM, the objective is to provide a single point of reference for each entity under management. In the case of master data, the intent is to have only one information structure and identifier for each entity under management. In this example, each entity would be a credit card account.

Metadata are simply data about data. Metadata are critical to understanding the meaning of both master and transactional data. They provide the definitions, specifications, and other descriptive information about the operational data. Data standards, data definitions, data requirements, data quality information, data provenance, and business rules are all forms of metadata.

Reference data share characteristics with both master data and metadata. Reference data are standard, agreed-upon codes that help

Enjoying the preview?

Page 1 of 1

Entity Information Life Cycle for Big Data: Master Data Management and Information Integration

About this ebook

John R. Talburt

Related authors

Related to Entity Information Life Cycle for Big Data

Related ebooks

Business For You

Related podcast episodes

Related articles

Related categories

Reviews for Entity Information Life Cycle for Big Data

What did you think?

Book preview

Entity Information Life Cycle for Big Data - John R. Talburt

Table of Contents

Copyright

Foreword

Preface

The Changing Landscape of Information Quality

Motivation for This Book

Audience

Organization of the Material

Acknowledgements

Chapter 1

The Value Proposition for MDM and Big Data

Definition and Components of MDM

Master Data as a Category of Data