Ebook902 pages11 hours

Applied Logistic Regression

Name: Applied Logistic Regression
Brand: Wiley
Rating: 4.5 (2 reviews)

By David W. Hosmer, Jr., Stanley Lemeshow and Rodney X. Sturdivant

Rating: 4.5 out of 5 stars

4.5/5

()

Read preview

About this ebook

A new edition of the definitive guide to logistic regression modeling for health science and other applications

This thoroughly expanded Third Edition provides an easily accessible introduction to the logistic regression (LR) model and highlights the power of this model by examining the relationship between a dichotomous outcome and a set of covariables.

Applied Logistic Regression, Third Edition emphasizes applications in the health sciences and handpicks topics that best suit the use of modern statistical software. The book provides readers with state-of-the-art techniques for building, interpreting, and assessing the performance of LR models. New and updated features include:

A chapter on the analysis of correlated outcome data
A wealth of additional material for topics ranging from Bayesian methods to assessing model fit
Rich data sets from real-world studies that demonstrate each method under discussion
Detailed examples and interpretation of the presented results as well as exercises throughout

Applied Logistic Regression, Third Edition is a must-have guide for professionals and researchers who need to model nominal or ordinal scaled outcome variables in public health, medicine, and the social sciences as well as a wide range of other fields and disciplines.

Skip carousel

LanguageEnglish

PublisherWiley

Release dateFeb 26, 2013

ISBN9781118548356

Author

David W. Hosmer, Jr.

Related authors

Skip carousel

Related to Applied Logistic Regression

Titles in the series (100)

Skip carousel

Statistics and Causality: Methods for Applied Empirical Research
Ebook
Statistics and Causality: Methods for Applied Empirical Research
byWolfgang Wiedermann
Rating: 0 out of 5 stars
0 ratings
Nonparametric Finance
Ebook
Nonparametric Finance
byJussi Klemelä
Rating: 0 out of 5 stars
0 ratings
Nonlinear Statistical Models
Ebook
Nonlinear Statistical Models
byA. Ronald Gallant
Rating: 0 out of 5 stars
0 ratings
Time Series Analysis: Nonstationary and Noninvertible Distribution Theory
Ebook
Time Series Analysis: Nonstationary and Noninvertible Distribution Theory
byKatsuto Tanaka
Rating: 0 out of 5 stars
0 ratings
Linear Statistical Inference and its Applications
Ebook
Linear Statistical Inference and its Applications
byC. Radhakrishna Rao
Rating: 0 out of 5 stars
0 ratings
Applications of Statistics to Industrial Experimentation
Ebook
Applications of Statistics to Industrial Experimentation
byCuthbert Daniel
Rating: 3 out of 5 stars
3/5
Time Series Analysis with Long Memory in View
Ebook
Time Series Analysis with Long Memory in View
byUwe Hassler
Rating: 0 out of 5 stars
0 ratings
Linear Regression Analysis
Ebook
Linear Regression Analysis
byGeorge A. F. Seber
Rating: 3 out of 5 stars
3/5
Methods for Statistical Data Analysis of Multivariate Observations
Ebook
Methods for Statistical Data Analysis of Multivariate Observations
byR. Gnanadesikan
Rating: 0 out of 5 stars
0 ratings
Statistical Models and Methods for Lifetime Data
Ebook
Statistical Models and Methods for Lifetime Data
byJerald F. Lawless
Rating: 0 out of 5 stars
0 ratings
Sensitivity Analysis in Linear Regression
Ebook
Sensitivity Analysis in Linear Regression
bySamprit Chatterjee
Rating: 0 out of 5 stars
0 ratings
Measuring Agreement: Models, Methods, and Applications
Ebook
Measuring Agreement: Models, Methods, and Applications
byPankaj K. Choudhary
Rating: 0 out of 5 stars
0 ratings
Measurement Errors in Surveys
Ebook
Measurement Errors in Surveys
byPaul P. Biemer
Rating: 0 out of 5 stars
0 ratings
Fundamental Statistical Inference: A Computational Approach
Ebook
Fundamental Statistical Inference: A Computational Approach
byMarc S. Paolella
Rating: 0 out of 5 stars
0 ratings
Theory of Probability: A critical introductory treatment
Ebook
Theory of Probability: A critical introductory treatment
byBruno de Finetti
Rating: 0 out of 5 stars
0 ratings
Aspects of Multivariate Statistical Theory
Ebook
Aspects of Multivariate Statistical Theory
byRobb J. Muirhead
Rating: 0 out of 5 stars
0 ratings
Theory of Ridge Regression Estimation with Applications
Ebook
Theory of Ridge Regression Estimation with Applications
byA. K. Md. Ehsanes Saleh
Rating: 0 out of 5 stars
0 ratings
Robust Correlation: Theory and Applications
Ebook
Robust Correlation: Theory and Applications
byGeorgy L. Shevlyakov
Rating: 0 out of 5 stars
0 ratings
Probability and Conditional Expectation: Fundamentals for the Empirical Sciences
Ebook
Probability and Conditional Expectation: Fundamentals for the Empirical Sciences
byRolf Steyer
Rating: 0 out of 5 stars
0 ratings
Forecasting with Univariate Box - Jenkins Models: Concepts and Cases
Ebook
Forecasting with Univariate Box - Jenkins Models: Concepts and Cases
byAlan Pankratz
Rating: 0 out of 5 stars
0 ratings
Multiple Imputation for Nonresponse in Surveys
Ebook
Multiple Imputation for Nonresponse in Surveys
byDonald B. Rubin
Rating: 2 out of 5 stars
2/5
The Statistical Analysis of Failure Time Data
Ebook
The Statistical Analysis of Failure Time Data
byJohn D. Kalbfleisch
Rating: 0 out of 5 stars
0 ratings
Applied Spatial Statistics for Public Health Data
Ebook
Applied Spatial Statistics for Public Health Data
byLance A. Waller
Rating: 0 out of 5 stars
0 ratings
Fractal-Based Point Processes
Ebook
Fractal-Based Point Processes
bySteven Bradley Lowen
Rating: 4 out of 5 stars
4/5
A Course in Time Series Analysis
Ebook
A Course in Time Series Analysis
byDaniel Peña
Rating: 3 out of 5 stars
3/5
Business Survey Methods
Ebook
Business Survey Methods
byBrenda G. Cox
Rating: 0 out of 5 stars
0 ratings
Regression With Social Data: Modeling Continuous and Limited Response Variables
Ebook
Regression With Social Data: Modeling Continuous and Limited Response Variables
byAlfred DeMaris
Rating: 0 out of 5 stars
0 ratings
Periodically Correlated Random Sequences: Spectral Theory and Practice
Ebook
Periodically Correlated Random Sequences: Spectral Theory and Practice
byHarry L. Hurd
Rating: 0 out of 5 stars
0 ratings
Computation for the Analysis of Designed Experiments
Ebook
Computation for the Analysis of Designed Experiments
byRichard Heiberger
Rating: 0 out of 5 stars
0 ratings
Sequential Stochastic Optimization
Ebook
Sequential Stochastic Optimization
byR. Cairoli
Rating: 0 out of 5 stars
0 ratings

Related ebooks

Skip carousel

Bayesian Biostatistics
Ebook
Bayesian Biostatistics
byEmmanuel Lesaffre
Rating: 0 out of 5 stars
0 ratings
Bayesian Non- and Semi-parametric Methods and Applications
Ebook
Bayesian Non- and Semi-parametric Methods and Applications
byPeter Rossi
Rating: 3 out of 5 stars
3/5
Learning Probabilistic Graphical Models in R
Ebook
Learning Probabilistic Graphical Models in R
byDavid Bellot
Rating: 0 out of 5 stars
0 ratings
Biostatistics and Computer-based Analysis of Health Data using Stata
Ebook
Biostatistics and Computer-based Analysis of Health Data using Stata
byChristophe Lalanne
Rating: 0 out of 5 stars
0 ratings
Applied Survival Analysis: Regression Modeling of Time-to-Event Data
Ebook
Applied Survival Analysis: Regression Modeling of Time-to-Event Data
byDavid W. Hosmer, Jr.
Rating: 4 out of 5 stars
4/5
Introduction to Quantitative Data Analysis in the Behavioral and Social Sciences
Ebook
Introduction to Quantitative Data Analysis in the Behavioral and Social Sciences
byMichael J. Albers
Rating: 0 out of 5 stars
0 ratings
Statistical Design and Analysis of Experiments: With Applications to Engineering and Science
Ebook
Statistical Design and Analysis of Experiments: With Applications to Engineering and Science
byRobert L. Mason
Rating: 0 out of 5 stars
0 ratings
SPSS Data Analysis for Univariate, Bivariate, and Multivariate Statistics
Ebook
SPSS Data Analysis for Univariate, Bivariate, and Multivariate Statistics
byDaniel J. Denis
Rating: 0 out of 5 stars
0 ratings
JMP for Basic Univariate and Multivariate Statistics: Methods for Researchers and Social Scientists, Second Edition
Ebook
JMP for Basic Univariate and Multivariate Statistics: Methods for Researchers and Social Scientists, Second Edition
byAnn Lehman, PhD
Rating: 0 out of 5 stars
0 ratings
Biostatistics and Computer-based Analysis of Health Data Using SAS
Ebook
Biostatistics and Computer-based Analysis of Health Data Using SAS
byChristophe Lalanne
Rating: 0 out of 5 stars
0 ratings
Biostatistics Decoded
Ebook
Biostatistics Decoded
byA. Gouveia Oliveira
Rating: 0 out of 5 stars
0 ratings
Panel Data Econometrics with R
Ebook
Panel Data Econometrics with R
byYves Croissant
Rating: 0 out of 5 stars
0 ratings
Statistical Methods for Overdispersed Count Data
Ebook
Statistical Methods for Overdispersed Count Data
byJean-Francois Dupuy
Rating: 0 out of 5 stars
0 ratings
Preparing Data for Analysis with JMP
Ebook
Preparing Data for Analysis with JMP
byRobert Carver
Rating: 0 out of 5 stars
0 ratings
Introduction to Probability Models
Ebook
Introduction to Probability Models
bySheldon M. Ross
Rating: 0 out of 5 stars
0 ratings
Statistical Methods in Longitudinal Research: Time Series and Categorical Longitudinal Data
Ebook
Statistical Methods in Longitudinal Research: Time Series and Categorical Longitudinal Data
byAlexander von Eye
Rating: 0 out of 5 stars
0 ratings
RStudio for R Statistical Computing Cookbook
Ebook
RStudio for R Statistical Computing Cookbook
byAndrea Cirillo
Rating: 0 out of 5 stars
0 ratings
Methods of Multivariate Analysis
Ebook
Methods of Multivariate Analysis
byAlvin C. Rencher
Rating: 0 out of 5 stars
0 ratings
Correlation Is Not Causation: Bite-Size Stats, #3
Ebook
Correlation Is Not Causation: Bite-Size Stats, #3
byLee Baker
Rating: 5 out of 5 stars
5/5
The Total Survey Error Approach: A Guide to the New Science of Survey Research
Ebook
The Total Survey Error Approach: A Guide to the New Science of Survey Research
byHerbert F. Weisberg
Rating: 0 out of 5 stars
0 ratings
R Programming - a Comprehensive Guide: Software
Ebook
R Programming - a Comprehensive Guide: Software
byEditor IJSMI
Rating: 0 out of 5 stars
0 ratings
Regression Models for Time Series Analysis
Ebook
Regression Models for Time Series Analysis
byBenjamin Kedem
Rating: 2 out of 5 stars
2/5
ANOVA and ANCOVA: A GLM Approach
Ebook
ANOVA and ANCOVA: A GLM Approach
byAndrew Rutherford
Rating: 0 out of 5 stars
0 ratings
Bayesian Methodology: an Overview With The Help Of R Software
Ebook
Bayesian Methodology: an Overview With The Help Of R Software
byEditor IJSMI
Rating: 0 out of 5 stars
0 ratings
Applied Regression Including Computing and Graphics
Ebook
Applied Regression Including Computing and Graphics
byR. Dennis Cook
Rating: 5 out of 5 stars
5/5
Introduction to Bayesian Statistics
Ebook
Introduction to Bayesian Statistics
byWilliam M. Bolstad
Rating: 0 out of 5 stars
0 ratings
The Econometric Analysis of Network Data
Ebook
The Econometric Analysis of Network Data
byBryan Graham
Rating: 0 out of 5 stars
0 ratings
Wise Use of Null Hypothesis Tests: A Practitioner's Handbook
Ebook
Wise Use of Null Hypothesis Tests: A Practitioner's Handbook
byFrank S Corotto
Rating: 0 out of 5 stars
0 ratings
The EM Algorithm and Extensions
Ebook
The EM Algorithm and Extensions
byGeoffrey McLachlan
Rating: 0 out of 5 stars
0 ratings
Stochastic Differential Equations: An Introduction with Applications in Population Dynamics Modeling
Ebook
Stochastic Differential Equations: An Introduction with Applications in Population Dynamics Modeling
byMichael J. Panik
Rating: 0 out of 5 stars
0 ratings

Mathematics For You

Skip carousel

My Best Mathematical and Logic Puzzles
Ebook
My Best Mathematical and Logic Puzzles
byMartin Gardner
Rating: 5 out of 5 stars
5/5
Quantum Physics for Beginners
Ebook
Quantum Physics for Beginners
byMax Thomson
Rating: 4 out of 5 stars
4/5
Calculus Made Easy
Ebook
Calculus Made Easy
bySilvanus P. Thompson
Rating: 4 out of 5 stars
4/5
Algebra - The Very Basics
Ebook
Algebra - The Very Basics
byMetin Bektas
Rating: 5 out of 5 stars
5/5
Standard Deviations: Flawed Assumptions, Tortured Data, and Other Ways to Lie with Statistics
Ebook
Standard Deviations: Flawed Assumptions, Tortured Data, and Other Ways to Lie with Statistics
byGary Smith
Rating: 4 out of 5 stars
4/5
The Thirteen Books of the Elements, Vol. 1
Ebook
The Thirteen Books of the Elements, Vol. 1
byEuclid
Rating: 0 out of 5 stars
0 ratings
Real Estate by the Numbers: A Complete Reference Guide to Deal Analysis
Ebook
Real Estate by the Numbers: A Complete Reference Guide to Deal Analysis
byJ Scott
Rating: 0 out of 5 stars
0 ratings
The Everything Guide to Algebra: A Step-by-Step Guide to the Basics of Algebra - in Plain English!
Ebook
The Everything Guide to Algebra: A Step-by-Step Guide to the Basics of Algebra - in Plain English!
byChristopher Monahan
Rating: 4 out of 5 stars
4/5
The Everything Guide to Pre-Algebra: A Helpful Practice Guide Through the Pre-Algebra Basics - in Plain English!
Ebook
The Everything Guide to Pre-Algebra: A Helpful Practice Guide Through the Pre-Algebra Basics - in Plain English!
byJane Cassie
Rating: 5 out of 5 stars
5/5
Game Theory: A Simple Introduction
Ebook
Game Theory: A Simple Introduction
byK.H. Erickson
Rating: 4 out of 5 stars
4/5
Alan Turing: The Enigma: The Book That Inspired the Film The Imitation Game - Updated Edition
Ebook
Alan Turing: The Enigma: The Book That Inspired the Film The Imitation Game - Updated Edition
byAndrew Hodges
Rating: 4 out of 5 stars
4/5
Mental Math Secrets - How To Be a Human Calculator
Ebook
Mental Math Secrets - How To Be a Human Calculator
byRandy Silverman
Rating: 5 out of 5 stars
5/5
Basic Math & Pre-Algebra For Dummies
Ebook
Basic Math & Pre-Algebra For Dummies
byMark Zegarelli
Rating: 4 out of 5 stars
4/5
This is The Statistics Handbook your Professor Doesn't Want you to See. So Easy, it's Practically Cheating...
Ebook
This is The Statistics Handbook your Professor Doesn't Want you to See. So Easy, it's Practically Cheating...
byS. Deviant
Rating: 4 out of 5 stars
4/5
Mathematical Thinking - For People Who Hate Math: Level Up Your Analytical and Creative Thinking Skills. Excel at Problem-Solving and Decision-Making.
Ebook
Mathematical Thinking - For People Who Hate Math: Level Up Your Analytical and Creative Thinking Skills. Excel at Problem-Solving and Decision-Making.
byAlbert Rutherford
Rating: 3 out of 5 stars
3/5
The Little Book of Mathematical Principles, Theories & Things
Ebook
The Little Book of Mathematical Principles, Theories & Things
byRobert Solomon
Rating: 3 out of 5 stars
3/5
Statistics 101: From Data Analysis and Predictive Modeling to Measuring Distribution and Determining Probability, Your Essential Guide to Statistics
Ebook
Statistics 101: From Data Analysis and Predictive Modeling to Measuring Distribution and Determining Probability, Your Essential Guide to Statistics
byDavid Borman
Rating: 4 out of 5 stars
4/5
Flatland
Ebook
Flatland
byEdwin A. Abbott
Rating: 4 out of 5 stars
4/5
Algebra I For Dummies
Ebook
Algebra I For Dummies
byMary Jane Sterling
Rating: 4 out of 5 stars
4/5
The Everything Everyday Math Book: From Tipping to Taxes, All the Real-World, Everyday Math Skills You Need
Ebook
The Everything Everyday Math Book: From Tipping to Taxes, All the Real-World, Everyday Math Skills You Need
byChristopher Monahan
Rating: 5 out of 5 stars
5/5
Logicomix: An epic search for truth
Ebook
Logicomix: An epic search for truth
byApostolos Doxiadis
Rating: 4 out of 5 stars
4/5
The Math of Life and Death: 7 Mathematical Principles That Shape Our Lives
Ebook
The Math of Life and Death: 7 Mathematical Principles That Shape Our Lives
byKit Yates
Rating: 4 out of 5 stars
4/5
Is God a Mathematician?
Ebook
Is God a Mathematician?
byMario Livio
Rating: 4 out of 5 stars
4/5
Basic Math Notes
Ebook
Basic Math Notes
byErnest Bywater
Rating: 5 out of 5 stars
5/5
Algebra I Workbook For Dummies
Ebook
Algebra I Workbook For Dummies
byMary Jane Sterling
Rating: 3 out of 5 stars
3/5
The Golden Ratio: The Divine Beauty of Mathematics
Ebook
The Golden Ratio: The Divine Beauty of Mathematics
byGary B. Meisner
Rating: 5 out of 5 stars
5/5
Relativity: The special and the general theory
Ebook
Relativity: The special and the general theory
byAlbert Einstein
Rating: 5 out of 5 stars
5/5
See Ya Later Calculator: Simple Math Tricks You Can Do in Your Head
Ebook
See Ya Later Calculator: Simple Math Tricks You Can Do in Your Head
byEditors of Portable Press
Rating: 4 out of 5 stars
4/5
A Mind for Numbers | Summary
Ebook
A Mind for Numbers | Summary
bySummary Station
Rating: 4 out of 5 stars
4/5
ACT Math & Science Prep: Includes 500+ Practice Questions
Ebook
ACT Math & Science Prep: Includes 500+ Practice Questions
byKaplan Test Prep
Rating: 3 out of 5 stars
3/5

Related podcast episodes

Skip carousel

Episode 246: Create a Culture of Safety: Fair and Just Culture: “I love the motto, ‘If you see a problem, you can solve a problem.’ So, no matter what level you fall on on the clinical ladder or within your administration, you always have the opportunity to promote and create positive change and do that with...
Podcast episode
Episode 246: Create a Culture of Safety: Fair and Just Culture: “I love the motto, ‘If you see a problem, you can solve a problem.’ So, no matter what level you fall on on the clinical ladder or within your administration, you always have the opportunity to promote and create positive change and do that with...
byThe Oncology Nursing Podcast
0 ratings
0% found this document useful
MLOps Coffee Sessions #11: Analyzing “Continuous Delivery and Automation Pipelines in ML" // Part 3
Podcast episode
MLOps Coffee Sessions #11: Analyzing “Continuous Delivery and Automation Pipelines in ML" // Part 3
byMLOps.community
0 ratings
0% found this document useful
MLOps Coffee Sessions #10 Analyzing the Article “Continuous Delivery and Automation Pipelines in Machine Learning" // Part 2
Podcast episode
MLOps Coffee Sessions #10 Analyzing the Article “Continuous Delivery and Automation Pipelines in Machine Learning" // Part 2
byMLOps.community
0 ratings
0% found this document useful
Retrieval-Augmented Generation for Large Language Models: A Survey: Large language models (LLMs) demonstrate powerful capabilities, but they still face challenges in practical applications, such as hallucinations, slow knowledge updates, and lack of transparency in answers. Retrieval-Augmented Generation (RAG) refers...
Podcast episode
Retrieval-Augmented Generation for Large Language Models: A Survey: Large language models (LLMs) demonstrate powerful capabilities, but they still face challenges in practical applications, such as hallucinations, slow knowledge updates, and lack of transparency in answers. Retrieval-Augmented Generation (RAG) refers...
byPapers Read on AI
0 ratings
0% found this document useful
Alignment Newsletter #173: Recent language model results from DeepMind: Recent language model results from DeepMind
Podcast episode
Alignment Newsletter #173: Recent language model results from DeepMind: Recent language model results from DeepMind
byAlignment Newsletter Podcast
0 ratings
0% found this document useful
BONUS: Unleashing Agile Experimentation, Accelerating Learning Cycles With 24h Experiments | Vasco Duarte: BONUS: Unleashing Agile Experimentation, Accelerating Learning Cycles With 24h Experiments, With Vasco Duarte Read the and search through the world’s largest audio library on Scrum directly on the . Merry Christmas, everyone! In today's...
Podcast episode
BONUS: Unleashing Agile Experimentation, Accelerating Learning Cycles With 24h Experiments | Vasco Duarte: BONUS: Unleashing Agile Experimentation, Accelerating Learning Cycles With 24h Experiments, With Vasco Duarte Read the and search through the world’s largest audio library on Scrum directly on the . Merry Christmas, everyone! In today's...
byScrum Master Toolbox Podcast: Agile storytelling from the trenches
0 ratings
0% found this document useful
Editing Large Language Models: Problems, Methods, and Opportunities: Recent advancements in deep learning have precipitated the emergence of large language models (LLMs) which exhibit an impressive aptitude for understanding and producing text akin to human language. Despite the ability to train highly capable LLMs, t...
Podcast episode
Editing Large Language Models: Problems, Methods, and Opportunities: Recent advancements in deep learning have precipitated the emergence of large language models (LLMs) which exhibit an impressive aptitude for understanding and producing text akin to human language. Despite the ability to train highly capable LLMs, t...
byPapers Read on AI
0 ratings
0% found this document useful
QDD Redux Ep. 2: How to Handle Competing Failure Modes
Podcast episode
QDD Redux Ep. 2: How to Handle Competing Failure Modes
byQuality during Design
0 ratings
0% found this document useful
Diffusion Model-Based Image Editing: A Survey: Denoising diffusion models have emerged as a powerful tool for various image generation and editing tasks, facilitating the synthesis of visual content in an unconditional or input-conditional manner. The core idea behind them is learning to reverse ...
Podcast episode
Diffusion Model-Based Image Editing: A Survey: Denoising diffusion models have emerged as a powerful tool for various image generation and editing tasks, facilitating the synthesis of visual content in an unconditional or input-conditional manner. The core idea behind them is learning to reverse ...
byPapers Read on AI
0 ratings
0% found this document useful
22. Luke Marsden - Data Science Infrastructure and MLOps
Podcast episode
22. Luke Marsden - Data Science Infrastructure and MLOps
byTowards Data Science
0 ratings
0% found this document useful
083R_Operationalising a concept: The systematic review of composite indicator building for measuring community disaster resilience (research summary)
Podcast episode
083R_Operationalising a concept: The systematic review of composite indicator building for measuring community disaster resilience (research summary)
byWhat is The Future for Cities?
0 ratings
0% found this document useful
Varsity A/B Testing: When you want to understand if doing something ca…
Podcast episode
Varsity A/B Testing: When you want to understand if doing something ca…
byLinear Digressions
0 ratings
0% found this document useful
4 + 1 Model of Data Science: Before diving into the complex world of data science it seemed to wise to establish a shared definition of the field. Here at the UVA School of Data Science, we have defined data science with the 4 + 1 Model. This model serves an outline for the first series of UVA Data Points. It also serves as a guiding definition within the School of Data Science, touching everything from research to course planning. In this introduction trailer, host Monica Manney discusses the history, development, and function of the 4 + 1 Model of Data Science with its main author, Raf Alvarado. Below is a brief expect from An Outline of the 4 + 1 Model of Data Science by Raf Alvarado: “The point of the 4 + 1 model, abstract as it is, is to provide a practical template for strategically planning the various elements of a school of data science. To serve as an effective template, a model must be general. But generality if often purchased at the cost of intuitive understanding. The fol
Podcast episode
4 + 1 Model of Data Science: Before diving into the complex world of data science it seemed to wise to establish a shared definition of the field. Here at the UVA School of Data Science, we have defined data science with the 4 + 1 Model. This model serves an outline for the first series of UVA Data Points. It also serves as a guiding definition within the School of Data Science, touching everything from research to course planning. In this introduction trailer, host Monica Manney discusses the history, development, and function of the 4 + 1 Model of Data Science with its main author, Raf Alvarado. Below is a brief expect from An Outline of the 4 + 1 Model of Data Science by Raf Alvarado: “The point of the 4 + 1 model, abstract as it is, is to provide a practical template for strategically planning the various elements of a school of data science. To serve as an effective template, a model must be general. But generality if often purchased at the cost of intuitive understanding. The fol
byUVA Data Points
0 ratings
0% found this document useful
058R_An adaptive learning process for developing and applying sustainability indicators with local communities (research summary)
Podcast episode
058R_An adaptive learning process for developing and applying sustainability indicators with local communities (research summary)
byWhat is The Future for Cities?
0 ratings
0% found this document useful
BONUS: The Future Of Agility, Experiment Driven Development | Vasco Duarte: BONUS: The Future Of Agility, Experiment Driven Development, With Vasco Duarte Read the and search through the world’s largest audio library on Scrum directly on the . Merry Christmas, everyone! As we bid farewell to 2023, we present the...
Podcast episode
BONUS: The Future Of Agility, Experiment Driven Development | Vasco Duarte: BONUS: The Future Of Agility, Experiment Driven Development, With Vasco Duarte Read the and search through the world’s largest audio library on Scrum directly on the . Merry Christmas, everyone! As we bid farewell to 2023, we present the...
byScrum Master Toolbox Podcast: Agile storytelling from the trenches
0 ratings
0% found this document useful
Improving Software Engineering in Biostatistics with Daniel Sabanés Bové
Podcast episode
Improving Software Engineering in Biostatistics with Daniel Sabanés Bové
byAxial Podcast
0 ratings
0% found this document useful
Enterprise Solution Delivery . Why Enterprise Solution Delivery?
Podcast episode
Enterprise Solution Delivery . Why Enterprise Solution Delivery?
byALEPH - GLOBAL SCRUM TEAM - Agile Coaching. Agile Training and Digital Marketing Certifications
0 ratings
0% found this document useful
The Art & Science of Finding You Top Performers: The Art & Science of Finding You Top Performers Advanced Insights into Data Analysis and Optimization with Dr. Ellis Welcome to this episode of Seller Sessions, where we dive deep into the nuanced world of data analysis and optimisation with the...
Podcast episode
The Art & Science of Finding You Top Performers: The Art & Science of Finding You Top Performers Advanced Insights into Data Analysis and Optimization with Dr. Ellis Welcome to this episode of Seller Sessions, where we dive deep into the nuanced world of data analysis and optimisation with the...
bySeller Sessions Amazon FBA and Private Label
0 ratings
0% found this document useful
AAC Implementation Plans
Podcast episode
AAC Implementation Plans
bySLP Nerdcast
0 ratings
0% found this document useful
Putting machine learning into a database: Most data scientists bounce back and forth regula…
Podcast episode
Putting machine learning into a database: Most data scientists bounce back and forth regula…
byLinear Digressions
0 ratings
0% found this document useful
APMP 6th Edition Project Context and Environmental Factors: Project Environment and Context
Podcast episode
APMP 6th Edition Project Context and Environmental Factors: Project Environment and Context
byAPM Project Management Training
0 ratings
0% found this document useful
Expert Strategies Unveiled for ERP Success with Stephanie Forbes: Part 1 - Contract Considerations
Podcast episode
Expert Strategies Unveiled for ERP Success with Stephanie Forbes: Part 1 - Contract Considerations
byArt of Consulting Podcast
0 ratings
0% found this document useful
Instruction Tuning for Large Language Models: A Survey: This paper surveys research works in the quickly advancing field of instruction tuning (IT), a crucial technique to enhance the capabilities and controllability of large language models (LLMs). Instruction tuning refers to the process of further trai...
Podcast episode
Instruction Tuning for Large Language Models: A Survey: This paper surveys research works in the quickly advancing field of instruction tuning (IT), a crucial technique to enhance the capabilities and controllability of large language models (LLMs). Instruction tuning refers to the process of further trai...
byPapers Read on AI
0 ratings
0% found this document useful
Integrating AAC into Behavioral Programming
Podcast episode
Integrating AAC into Behavioral Programming
bySLP Nerdcast
0 ratings
0% found this document useful
Build Better Tests For Your dbt Projects With Datafold And data-diff: Data engineering is all about building workflows, pipelines, systems, and interfaces to provide stable and reliable data. Your data can be stable and wrong, but then it isn't reliable. Confidence in your data is achieved through constant validation and testing. Datafold has invested a lot of time into integrating with the workflow of dbt projects to add early verification that the changes you are making are correct. In this episode Gleb Mezhanskiy shares some valuable advice and insights into how you can build reliable and well-tested data assets with dbt and data-diff.
Podcast episode
Build Better Tests For Your dbt Projects With Datafold And data-diff: Data engineering is all about building workflows, pipelines, systems, and interfaces to provide stable and reliable data. Your data can be stable and wrong, but then it isn't reliable. Confidence in your data is achieved through constant validation and testing. Datafold has invested a lot of time into integrating with the workflow of dbt projects to add early verification that the changes you are making are correct. In this episode Gleb Mezhanskiy shares some valuable advice and insights into how you can build reliable and well-tested data assets with dbt and data-diff.
byData Engineering Podcast
0 ratings
0% found this document useful
How Can Recommender Systems Benefit from Large Language Models: A Survey: Recommender systems (RS) play important roles to match users' information needs for Internet applications. In natural language processing (NLP) domains, large language model (LLM) has shown astonishing emergent abilities (e.g., instruction following,...
Podcast episode
How Can Recommender Systems Benefit from Large Language Models: A Survey: Recommender systems (RS) play important roles to match users' information needs for Internet applications. In natural language processing (NLP) domains, large language model (LLM) has shown astonishing emergent abilities (e.g., instruction following,...
byPapers Read on AI
0 ratings
0% found this document useful
RLHF 201 - with Nathan Lambert of AI2 and Interconnects
Podcast episode
RLHF 201 - with Nathan Lambert of AI2 and Interconnects
byLatent Space: The AI Engineer Podcast — Practitioners talking LLMs, CodeGen, Agents, Multimodality, AI UX, GPU Infra and all things Software 3.0
0 ratings
0% found this document useful
Teaching Math Is Hard. Here's Why... [Part 1]
Podcast episode
Teaching Math Is Hard. Here's Why... [Part 1]
byMaking Math Moments That Matter
0 ratings
0% found this document useful
?ThursdAI - LAION down, OpenChat beats GPT3.5, Apple is showing where it's going, Midjourney v6 is here & Suno can make music!
Podcast episode
?ThursdAI - LAION down, OpenChat beats GPT3.5, Apple is showing where it's going, Midjourney v6 is here & Suno can make music!
byThursdAI - The top AI news from the past week
0 ratings
0% found this document useful
Episode 440: RR 432: Stop Testing, Start Storytelling with Mike Schutte
Podcast episode
Episode 440: RR 432: Stop Testing, Start Storytelling with Mike Schutte
byRuby Rogues
0 ratings
0% found this document useful

Skip carousel

Let Your Supply Chains Be Sustainable Too: SOUND ADVICE From LUXEMBOURG
The European Business Review
Article
Let Your Supply Chains Be Sustainable Too: SOUND ADVICE From LUXEMBOURG
Mar 31, 2020
4 min read
Machine Learning And Investing: The Cautious Seldom Err Or Write Great Poetry
Finweek - English
Article
Machine Learning And Investing: The Cautious Seldom Err Or Write Great Poetry
Oct 18, 2019
5 min read
Better Together: Behavioural Science + Data Science
Rotman Management
Article
Better Together: Behavioural Science + Data Science
May 1, 2020
IMAGINE THIS SCENARIO: You are designing a new customer experience to drive a shift in customer behaviour. You have reviewed the reports and dashboards describing current behaviour. You have asked customers how they felt and incorporated their feedba
5 min read
Generative AI: What Leaders Need To Know
Rotman Management
Article
Generative AI: What Leaders Need To Know
Jan 1, 2024
12 min read
What European Banks Need to Know about Competing with Ecosystems
The European Business Review
Article
What European Banks Need to Know about Competing with Ecosystems
Dec 3, 2019
6 min read
Code A Cataloguing Application In Python
Linux Format
Article
Code A Cataloguing Application In Python
Nov 15, 2022
Credit: www.djangoproject.com Matt Holder has been a fan of the open source methodology for over two decades and uses Linux and other tools where possible. More featurepacked source code for this project can be downloaded from https://github.com/mat
8 min read
Soulver 3: Mac App Simplifies Readable Calculations And Conversions
MacWorld
Article
Soulver 3: Mac App Simplifies Readable Calculations And Conversions
Nov 19, 2019
3 min read
Deconstructing Management Analytics
Rotman Management
Article
Deconstructing Management Analytics
Sep 1, 2022
7 min read
2024: What Is The Near Future Of Generative AI?
The European Business Review
Article
2024: What Is The Near Future Of Generative AI?
Jan 26, 2024
8 min read
Change Sustainability Your ROI Health Check
Facility Management
Article
Change Sustainability Your ROI Health Check
Mar 28, 2019
Change sustainability programs are as unique as each of the workplace projects themselves. They can be developed either to dovetail into the end of a change program where businessas-usual kicks in, or they can be integrated into a prototype rotation
5 min read
Measuring Attribution: What’s Working?
NZ Marketing
Article
Measuring Attribution: What’s Working?
Sep 21, 2022
9 min read
Measuring Performance For Nature Recovery
Landscape Architecture Australia
Article
Measuring Performance For Nature Recovery
Jan 29, 2024
5 min read
How Mature Is Your Organisation With Regards To Digital And Web Analytics?
NZ Marketing
Article
How Mature Is Your Organisation With Regards To Digital And Web Analytics?
Jun 9, 2021
1 min read
Strategy + Design Thinking = Stakeholder-Centric Design
Rotman Management
Article
Strategy + Design Thinking = Stakeholder-Centric Design
Sep 1, 2018
OVER THE PAST 15 YEARS, design thinking has had an explosive impact on innovation and commercialization, especially within established firms. The Rotman School’s former Dean, Roger Martin, has contributed mightily to these advances, notably through h
6 min read
Strategic Drivers FOR THE POST-PANDEMIC ERA
The European Business Review
Article
Strategic Drivers FOR THE POST-PANDEMIC ERA
Feb 25, 2021
10 min read
How Spooky Science Helps Us Peer Inside The Planets
All About Space
Article
How Spooky Science Helps Us Peer Inside The Planets
Dec 3, 2020
An assistant professor of computational science at the EPFL research centre in Lausanne, Switzerland, involved in the current research on metallic hydrogen. Could you explain how the machine-learning techniques used in your research work? Why were th
1 min read
A Continuously Improving Workplace
Artichoke
Article
A Continuously Improving Workplace
Aug 27, 2017
3 min read
Web App Security
Linux Format
Article
Web App Security
Jun 29, 2021
8 min read
Top 10 Excel Functions That Everyone Should Know
Techfastly
Article
Top 10 Excel Functions That Everyone Should Know
Feb 4, 2021
5 min read
Social Media In B2b Supply Chain Management
The European Business Review
Article
Social Media In B2b Supply Chain Management
Aug 1, 2022
8 min read
DESIGN THINKING: Eight Mistakes to Avoid
The European Business Review
Article
DESIGN THINKING: Eight Mistakes to Avoid
Feb 4, 2019
3 min read
Why Fixing The Planet Is Also About Seizing Business Opportunities
The European Business Review
Article
Why Fixing The Planet Is Also About Seizing Business Opportunities
Feb 25, 2021
6 min read
Sustainability Tools: The Regenerative Compass
Rotman Management
Article
Sustainability Tools: The Regenerative Compass
Jan 1, 2024
We are well into what climate experts are calling ‘the decisive decade’ for sustainability and Net Zero commitments. And yet, significant action and momentum are missing in most organizations. Even in companies that have made bold commitments for 203
4 min read
Triple A.i. Supply Chains
The European Business Review
Article
Triple A.i. Supply Chains
Jun 1, 2022
15 min read
Scrum Project Management: The Ideal Agile Practice
Techfastly
Article
Scrum Project Management: The Ideal Agile Practice
May 3, 2021
7 min read
The Tech Trends Every Leader Needs to Understand
Rotman Management
Article
The Tech Trends Every Leader Needs to Understand
Sep 1, 2023
11 min read
Quantum Leap
Marketing
Article
Quantum Leap
Jul 11, 2019
6 min read
5 Tools To Help Your Remote-work Business Click
TechLife News
Article
5 Tools To Help Your Remote-work Business Click
Aug 14, 2021
3 min read
Corporate Foresight In An Ever-turbulent Era
The European Business Review
Article
Corporate Foresight In An Ever-turbulent Era
Sep 30, 2020
13 min read
Mathematics Packages
Linux Format
Article
Mathematics Packages
Sep 22, 2020
1 min read

Related categories

Skip carousel

Reviews for Applied Logistic Regression

Rating: 4.5 out of 5 stars

4.5/5

2 ratings1 review

Rating: 4 out of 5 stars
4/5
A good book that certainly has practical application.It details the rise in use of this particular technique, and where it is applicable.Also details multiple varieties including multinomial and others.This is definitely a mathematics text that is worth the time to take a look at.

Book preview

Applied Logistic Regression - David W. Hosmer, Jr.

Preface to the Third Edition

This third edition of Applied Logistic Regression comes 12 years after the 2000 publication of the second edition. During this interval there has been considerable effort researching statistical aspects of the logistic regression model—particularly when the outcomes are correlated. At the same time, capabilities of computer software packages to fit models grew impressively to the point where they now provide access to nearly every aspect of model development a researcher might need. As is well-recognized in the statistical community, the inherent danger of this easy-to-use software is that investigators have at their disposal powerful computational tools, about which they may have only limited understanding. It is our hope that this third edition will help bridge the gap between the outstanding theoretical developments and the need to apply these methods to diverse fields of inquiry.

As was the case in the first two editions, the primary objective of the third edition is to provide an introduction to the underlying theory of the logistic regression model, with a major focus on the application, using real data sets, of the available methods to explore the relationship between a categorical outcome variable and a set of covariates. The materials in this book have evolved over the past 12 years as a result of our teaching and consulting experiences. We have used this book to teach parts of graduate level survey courses, quarter- or semester-long courses, as well as focused short courses to working professionals. We assume that students have a solid foundation in linear regression methodology and contingency table analysis. The positive feedback we have received from students or professionals taking courses using this book or using it for self-learning or reference, provides us with some assurance that the approach we used in the first two editions worked reasonably well; therefore, we have followed that approach in this new edition.

The approach we take is to develop the logistic regression model from a regression analysis point of view. This is accomplished by approaching logistic regression in a manner analogous to what would be considered good statistical practice for linear regression. This differs from the approach used by other authors who have begun their discussion from a contingency table point of view. While the contingency table approach may facilitate the interpretation of the results, we believe that it obscures the regression aspects of the analysis. Thus, discussion of the interpretation of the model is deferred until the regression approach to the analysis is firmly established.

To a large extent, there are no major differences between the many software packages that include logistic regression modeling. When a particular approach is available in a limited number of packages, it will be noted in this text. In general, analyses in this book have been performed using STATA [Stata Corp. (2011)]. This easy-to-use package combines excellent graphics and analysis routines; is fast; is compatible across Macintosh, Windows and UNIX platforms; and interacts well with Microsoft Word. Other major statistical packages employed at various points during the preparation of this text include SAS [SAS Institute Inc. (2009)], OpenBUGS [Lunn et al. (2009)] and R [R Development Core Team (2010)]. For all intents and purposes the results produced were the same regardless of which package we used. Reported numeric results have been rounded from figures obtained from computer output and thus may differ slightly from those that would be obtained in a replication of our analyses or from calculations based on the reported results. When features or capabilities of the programs differed in an important way, we noted them by the names given rather than by their bibliographic citation.

We feel that this new edition benefits greatly from the addition of a number of key topics. These include the following:

1. An expanded presentation of numerous new techniques for model-building, including methods for determining the scale of continuous covariates and assessing model performance.

2. An expanded presentation of regression modeling of complex sample survey data.

3. An expanded development of the use of logistic regression modeling in matched studies, as well as with multinomial and ordinal scaled responses.

4. A new chapter dealing with models and methods for correlated categorical response data.

5. A new chapter developing a number of important applications either missing or expanded from the previous editions. These include propensity score methods, exact methods for logistic regression, sample size issues, Bayesian logistic regression, and other link functions for binary outcome regression models. This chapter concludes with sections dealing with the epidemiologic concepts of mediation and additive interaction.

As was the case for the second edition, all of the data sets used in the text are available at a web site at John Wiley & Sons, Inc. http://wiley.mpstechnologies.com/wiley/BOBContent/searchLPBobContent.do

In addition, the data may also be found, by permission of John Wiley & Sons Inc., in the archive of statistical data sets maintained at the University of Massachusetts at http://www.umass.edu/statdata/statdata in the logistic regression section.

We would like to express our sincere thanks and appreciation to our colleagues, students, and staff at all of the institutions we have been fortunate to have been affiliated with since the first edition was conceived more than 25 years ago. This includes not only our primary university affiliations but also the locations where we spent extended sabbatical leaves and special research assignments. For this edition we would like to offer special thanks to Sharon Schwartz and Melanie Wall from Columbia University who took the lead in writing the two final sections of the book dealing with mediation and additive interaction. We benefited greatly from their expertise in applying these methods in epidemiologic settings. We greatly appreciate the efforts of Danielle Sullivan, a PhD candidate in biostatistics at Ohio State, for assisting in the preparation of the index for this book. Colleagues in the Division of Biostatistics and the Division of Epidemiology at Ohio State were helpful in their review of selected sections of the book. These include Bo Lu for his insights on propensity score methods and David Murray, Sigrún Alba Jóhannesdóttir, and Morten Schmidt for their thoughts concerning the sections on mediation analysis and additive interaction. Data sets form the basis for the way we present our materials and these are often hard to come by. We are very grateful to Karla Zadnik, Donald O. Mutti, Loraine T. Sinnott, and Lisa A. Jones-Jordan from The Ohio State University College of Optometry as well as to the Collaborative Longitudinal Evaluation of Ethnicity and Refractive Error (CLEERE) Study Group for making the myopia data available to us. We would also like to acknowledge Cynthia A. Fontanella from the College of Social Work at Ohio State for making both the Adolescent Placement and the Polypharmacy data sets available to us. A special thank you to Gary Phillips from the Center for Biostatistics at OSU for helping us identify these valuable data sets (that he was the first one to analyze) as well as for his assistance with some programming issues with Stata. We thank Gordon Fitzgerald of the Center for Outcomes Research (COR) at the University of Massachusetts / Worcester for his help in obtaining the small subset of data used in this text from the Global Longitudinal Study of Osteoporosis in Women (GLOW) Study's main data set. In addition, we thank him for his many helpful comments on the use of propensity scores in logistic regression modeling. We thank Turner Osler for providing us with the small subset of data obtained from a large data set he abstracted from the National Burn Repository 2007 Report, that we used for the burn injury analyses. In many instances the data sets we used were modified from the original data sets in ways to allow us to illustrate important modeling techniques. As such, we issue a general disclaimer here, and do so again throughout the text, that results presented in this text do not apply to the original data.

Before we began this revision, numerous individuals reviewed our proposal anonymously and made many helpful suggestions. They confirmed that what we planned to include in this book would be of use to them in their research and teaching. We thank these individuals and, for the most part, addressed their comments. Many of these reviewers suggested that we include computer code to run logistic regression in a variety of packages, especially R. We decided not to do this for two reasons: we are not statistical computing specialists and did not want to have to spend time responding to email queries on our code. Also, capabilities of computer packages change rapidly and we realized that whatever we decided to include here would likely be out of date before the book was even published. We refer readers interested in code specific to various packages to a web site maintained by Academic Technology Services (ATS) at UCLA where they use a variety of statistical packages to replicate the analyses for the examples in the second edition of this text as well as numerous other statistical texts. The link to this web site is http://www.ats.ucla.edu/stat/.

Finally, we would like to thank Steve Quigley, Susanne Steitz-Filler, Sari Friedman and the production staff at John Wiley & Sons Inc. for their help in bringing this project to completion.

David W. Hosmer, Jr.

Stanley Lemeshow

Rodney X. Sturdivant¹

Stowe, Vermont

Columbus, Ohio

West Point, New York

January 2013

¹ * The views expressed in this book are those of the author and do not reflect the official policy or position of the Department of the Army, Department of Defense, or the U.S. Government.

Chapter 1: Introduction to the Logistic Regression Model

1.1 Introduction

Regression methods have become an integral component of any data analysis concerned with describing the relationship between a response variable and one or more explanatory variables. Quite often the outcome variable is discrete, taking on two or more possible values. The logistic regression model is the most frequently used regression model for the analysis of these data.

Before beginning a thorough study of the logistic regression model it is important to understand that the goal of an analysis using this model is the same as that of any other regression model used in statistics, that is, to find the best fitting and most parsimonious, clinically interpretable model to describe the relationship between an outcome (dependent or response) variable and a set of independent (predictor or explanatory) variables. The independent variables are often called covariates. The most common example of modeling, and one assumed to be familiar to the readers of this text, is the usual linear regression model where the outcome variable is assumed to be continuous.

What distinguishes a logistic regression model from the linear regression model is that the outcome variable in logistic regression is binary or dichotomous. This difference between logistic and linear regression is reflected both in the form of the model and its assumptions. Once this difference is accounted for, the methods employed in an analysis using logistic regression follow, more or less, the same general principles used in linear regression. Thus, the techniques used in linear regression analysis motivate our approach to logistic regression. We illustrate both the similarities and differences between logistic regression and linear regression with an example.

Example 1: Table 1.1 lists the age in years (AGE), and presence or absence of evidence of significant coronary heart disease (CHD) for 100 subjects in a hypothetical study of risk factors for heart disease. The table also contains an identifier variable (ID) and an age group variable (AGEGRP). The outcome variable is CHD, which is coded with a value of 0 to indicate that CHD is absent, or 1 to indicate that it is present in the individual. In general, any two values could be used, but we have found it most convenient to use zero and one. We refer to this data set as the CHDAGE data.

Table 1.1 Age, Age Group, and Coronary Heart Disease (CHD) Status of 100 Subjects

c01-tab-0001c01-tab-0001c01-tab-0001

It is of interest to explore the relationship between AGE and the presence or absence of CHD in this group. Had our outcome variable been continuous rather than binary, we probably would begin by forming a scatterplot of the outcome versus the independent variable. We would use this scatterplot to provide an impression of the nature and strength of any relationship between the outcome and the independent variable. A scatterplot of the data in Table 1.1 is given in Figure 1.1.

Figure 1.1 Scatterplot of presence or absence of coronary heart disease (CHD) by AGE for 100 subjects.

c01f001

In this scatterplot, all points fall on one of two parallel lines representing the absence of CHD ( ) or the presence of CHD ( ). There is some tendency for the individuals with no evidence of CHD to be younger than those with evidence of CHD. While this plot does depict the dichotomous nature of the outcome variable quite clearly, it does not provide a clear picture of the nature of the relationship between CHD and AGE.

The main problem with Figure 1.1 is that the variability in CHD at all ages is large. This makes it difficult to see any functional relationship between AGE and CHD. One common method of removing some variation, while still maintaining the structure of the relationship between the outcome and the independent variable, is to create intervals for the independent variable and compute the mean of the outcome variable within each group. We use this strategy by grouping age into the categories (AGEGRP) defined in Table 1.1. Table 1.2 contains, for each age group, the frequency of occurrence of each outcome, as well as the percent with CHD present.

Table 1.2 Frequency Table of Age Group by CHD

c01-tab-0004

By examining this table, a clearer picture of the relationship begins to emerge. It shows that as age increases, the proportion (mean) of individuals with evidence of CHD increases. Figure 1.2 presents a plot of the percent of individuals with CHD versus the midpoint of each age interval. This plot provides considerable insight into the relationship between CHD and AGE in this study, but the functional form for this relationship needs to be described. The plot in this figure is similar to what one might obtain if this same process of grouping and averaging were performed in a linear regression. We note two important differences.

Figure 1.2 Plot of the percentage of subjects with CHD in each AGE group.

c01f002

The first difference concerns the nature of the relationship between the outcome and independent variables. In any regression problem the key quantity is the mean value of the outcome variable, given the value of the independent variable. This quantity is called the conditional mean and is expressed as where denotes the outcome variable and denotes a specific value of the independent variable. The quantity is read the expected value of , given the value . In linear regression we assume that this mean may be expressed as an equation linear in (or some transformation of or ), such as

equation

This expression implies that it is possible for to take on any value as ranges between and .

The column labeled Mean in Table 1.2 provides an estimate of . We assume, for purposes of exposition, that the estimated values plotted in Figure 1.2 are close enough to the true values of to provide a reasonable assessment of the functional relationship between CHD and AGE. With a dichotomous outcome variable, the conditional mean must be greater than or equal to zero and less than or equal to one (i.e., ). This can be seen in Figure 1.2. In addition, the plot shows that this mean approaches zero and one gradually. The change in the per unit change in becomes progressively smaller as the conditional mean gets closer to zero or one. The curve is said to be S-shaped and resembles a plot of the cumulative distribution of a continuous random variable. Thus, it should not seem surprising that some well-known cumulative distributions have been used to provide a model for in the case when is dichotomous. The model we use is based on the logistic distribution.

Many distribution functions have been proposed for use in the analysis of a dichotomous outcome variable. Cox and Snell (1989) discuss some of these. There are two primary reasons for choosing the logistic distribution. First, from a mathematical point of view, it is an extremely flexible and easily used function. Second, its model parameters provide the basis for clinically meaningful estimates of effect. A detailed discussion of the interpretation of the model parameters is given in Chapter 3.

In order to simplify notation, we use the quantity to represent the conditional mean of given when the logistic distribution is used. The specific form of the logistic regression model we use is:

1.1

A transformation of that is central to our study of logistic regression is the logit transformation. This transformation is defined, in terms of , as:

equation

The importance of this transformation is that has many of the desirable properties of a linear regression model. The logit, , is linear in its parameters, may be continuous, and may range from to , depending on the range of .

The second important difference between the linear and logistic regression models concerns the conditional distribution of the outcome variable. In the linear regression model we assume that an observation of the outcome variable may be expressed as . The quantity is called the error and expresses an observation's deviation from the conditional mean. The most common assumption is that follows a normal distribution with mean zero and some variance that is constant across levels of the independent variable. It follows that the conditional distribution of the outcome variable given is normal with mean , and a variance that is constant. This is not the case with a dichotomous outcome variable. In this situation, we may express the value of the outcome variable given as . Here the quantity may assume one of two possible values. If then with probability , and if then with probability . Thus, has a distribution with mean zero and variance equal to . That is, the conditional distribution of the outcome variable follows a binomial distribution with probability given by the conditional mean, .

In summary, we have shown that in a regression analysis when the outcome variable is dichotomous:

1. The model for the conditional mean of the regression equation must be bounded between zero and one. The logistic regression model, , given in equation 1.1, satisfies this constraint.

2. The binomial, not the normal, distribution describes the distribution of the errors and is the statistical distribution on which the analysis is based.

3. The principles that guide an analysis using linear regression also guide us in logistic regression.

1.2 Fitting the Logistic Regression Model

Suppose we have a sample of n independent observations of the pair where denotes the value of a dichotomous outcome variable and is the value of the independent variable for the subject. Furthermore, assume that the outcome variable has been coded as 0 or 1, representing the absence or the presence of the characteristic, respectively. This coding for a dichotomous outcome is used throughout the text. Fitting the logistic regression model in equation 1.1 to a set of data requires that we estimate the values of and , the unknown parameters.

In linear regression, the method used most often for estimating unknown parameters is least squares. In that method we choose those values of and that minimize the sum-of-squared deviations of the observed values of from the predicted values based on the model. Under the usual assumptions for linear regression the method of least squares yields estimators with a number of desirable statistical properties. Unfortunately, when the method of least squares is applied to a model with a dichotomous outcome, the estimators no longer have these same properties.

The general method of estimation that leads to the least squares function under the linear regression model (when the error terms are normally distributed) is called maximum likelihood. This method provides the foundation for our approach to estimation with the logistic regression model throughout this text. In a general sense, the method of maximum likelihood yields values for the unknown parameters that maximize the probability of obtaining the observed set of data. In order to apply this method we must first construct a function, called the likelihood function. This function expresses the probability of the observed data as a function of the unknown parameters. The maximum likelihood estimators of the parameters are the values that maximize this function. Thus, the resulting estimators are those that agree most closely with the observed data. We now describe how to find these values for the logistic regression model.

If is coded as 0 or 1 then the expression for given in equation 1.1 provides (for an arbitrary value of , the vector of parameters) the conditional probability that is equal to 1 given . This is denoted as . It follows that the quantity gives the conditional probability that is equal to zero given , . Thus, for those pairs , where , the contribution to the likelihood function is , and for those pairs where , the contribution to the likelihood function is , where the quantity denotes the value of computed at . A convenient way to express the contribution to the likelihood function for the pair is through the expression

1.2

As the observations are assumed to be independent, the likelihood function is obtained as the product of the terms given in equation 1.2 as follows:

1.3

The principle of maximum likelihood states that we use as our estimate of the value that maximizes the expression in equation 1.3. However, it is easier mathematically to work with the log of equation 1.3. This expression, the log-likelihood, is defined as

1.4

To find the value of that maximizes we differentiate with respect to and and set the resulting expressions equal to zero. These equations, known as the likelihood equations, are

1.5

and

1.6

In equations 1.5 and 1.6 it is understood that the summation is over varying from 1 to . (The practice of suppressing the index and range of summation, when these are clear, is followed throughout this text.)

In linear regression, the likelihood equations, obtained by differentiating the sum-of-squared deviations function with respect to are linear in the unknown parameters and thus are easily solved. For logistic regression the expressions in equations 1.5 and 1.6 are nonlinear in and , and thus require special methods for their solution. These methods are iterative in nature and have been programmed into logistic regression software. For the moment, we need not be concerned about these iterative methods and view them as a computational detail that is taken care of for us. The interested reader may consult the text by McCullagh and Nelder (1989) for a general discussion of the methods used by most programs. In particular, they show that the solution to equations 1.5 and 1.6 may be obtained using an iterative weighted least squares procedure.

The value of given by the solution to equations 1.5 and 1.6 is called the maximum likelihood estimate and is denoted as . In general, the use of the symbol denotes the maximum likelihood estimate of the respective quantity. For example, is the maximum likelihood estimate of . This quantity provides an estimate of the conditional probability that is equal to 1, given that is equal to . As such, it represents the fitted or predicted value for the logistic regression model. An interesting consequence of equation 1.5 is that

equation

That is, the sum of the observed values of is equal to the sum of the predicted (expected) values. We use this property in later chapters when we discuss assessing the fit of the model.

As an example, consider the data given in Table 1.1. Use of a logistic regression software package, with continuous variable AGE as the independent variable, produces the output in Table 1.3.

Table 1.3 Results of Fitting the Logistic Regression Model to the CHDAGE Data, n = 100

c01-tab-0005

The maximum likelihood estimates of and are and . The fitted values are given by the equation

1.7

and the estimated logit, , is given by the equation

1.8

The log-likelihood given in Table 1.3 is the value of equation 1.4 computed using and .

Three additional columns are present in Table 1.3. One contains estimates of the standard errors of the estimated coefficients, the next column displays the ratios of the estimated coefficients to their estimated standard errors, and the last column displays a -value. These quantities are discussed in the next section.

Following the fitting of the model we begin to evaluate its adequacy.

1.3 Testing for the Significance of the Coefficients

In practice, the modeling of a set of data, as we show in Chapters 4,7, and 8, is a much more complex process than one of simply fitting and testing. The methods we present in this section, while simplistic, do provide essential building blocks for the more complex process.

After estimating the coefficients, our first look at the fitted model commonly concerns an assessment of the significance of the variables in the model. This usually involves formulation and testing of a statistical hypothesis to determine whether the independent variables in the model are significantly related to the outcome variable. The method for performing this test is quite general, and differs from one type of model to the next only in the specific details. We begin by discussing the general approach for a single independent variable. The multivariable case is considered in Chapter 2.

One approach to testing for the significance of the coefficient of a variable in any model relates to the following question. Does the model that includes the variable in question tell us more about the outcome (or response) variable than a model that does not include that variable? This question is answered by comparing the observed values of the response variable to those predicted by each of two models; the first with, and the second without, the variable in question. The mathematical function used to compare the observed and predicted values depends on the particular problem. If the predicted values with the variable in the model are better, or more accurate in some sense, than when the variable is not in the model, then we feel that the variable in question is significant. It is important to note that we are not considering the question of whether the predicted values are an accurate representation of the observed values in an absolute sense (this is called goodness of fit). Instead, our question is posed in a relative sense. The assessment of goodness of fit is a more complex question that is discussed in detail in Chapter 5.

The general method for assessing significance of variables is easily illustrated in the linear regression model, and its use there motivates the approach used for logistic regression. A comparison of the two approaches highlights the differences between modeling continuous and dichotomous response variables.

In linear regression, one assesses the significance of the slope coefficient by forming what is referred to as an analysis of variance table. This table partitions the total sum-of-squared deviations of observations about their mean into two parts: (1) the sum-of-squared deviations of observations about the regression line SSE (or residual sum-of-squares) and (2) the sum-of-squares of predicted values, based on the regression model, about the mean of the dependent variable SSR (or due regression sum-of-squares). This is just a convenient way of displaying the comparison of observed to predicted values under two models. In linear regression, the comparison of observed and predicted values is based on the square of the distance between the two. If denotes the observed value and denotes the predicted value for the ith individual under the model, then the statistic used to evaluate this comparison is

equation

Under the model not containing the independent variable in question the only parameter is , and , the mean of the response variable. In this case, and SSE is equal to the total sum-of-squares. When we include the independent variable in the model, any decrease in SSE is due to the fact that the slope coefficient for the independent variable is not zero. The change in the value of SSE is due to the regression source of variability, denoted SSR. That is,

equation

In linear regression, interest focuses on the size of SSR. A large value suggests that the independent variable is important, whereas a small value suggests that the independent variable is not helpful in predicting the response.

The guiding principle with logistic regression is the same: compare observed values of the response variable to predicted values obtained from models, with and without the variable in question. In logistic regression, comparison of observed to predicted values is based on the log-likelihood function defined in equation 1.4. To better understand this comparison, it is helpful conceptually to think of an observed value of the response variable as also being a predicted value resulting from a saturated model. A saturated model is one that contains as many parameters as there are data points. (A simple example of a saturated model is fitting a linear regression model when there are only two data points, .)

The comparison of observed to predicted values using the likelihood function is based on the following expression:

1.9

The quantity inside the large brackets in the expression above is called the likelihood ratio. Using minus twice its log is necessary to obtain a quantity whose distribution is known and can therefore be used for hypothesis testing purposes. Such a test is called the likelihood ratio test. Using equation 1.4, equation 1.9 becomes

1.10

where .

The statistic, , in equation 1.10 is called the deviance, and for logistic regression, it plays the same role that the residual sum-of-squares plays in linear regression. In fact, the deviance as shown in equation 1.10, when computed for linear regression, is identically equal to the SSE.

Furthermore, in a setting as shown in Table 1.1, where the values of the outcome variable are either 0 or 1, the likelihood of the saturated model is identically equal to 1.0. Specifically, it follows from the definition of a saturated model that and the likelihood is

equation

Thus it follows from equation 1.9 that the deviance is

1.11

Some software packages report the value of the deviance in equation 1.11 rather than the log-likelihood for the fitted model. In the context of testing for the significance of a fitted model, we want to emphasize that we think of the deviance in the same way that we think of the residual sum-of-squares in linear regression.

In particular, to assess the significance of an independent variable we compare the value of with and without the independent variable in the equation. The change in due to the inclusion of the independent variable in the model is:

equation

This statistic, , plays the same role in logistic regression that the numerator of the partial -test does in linear regression. Because the likelihood of the saturated model is always common to both values of being differenced, can be expressed as

1.12

For the specific case of a single independent variable, it is easy to show that when the variable is not in the model, the maximum likelihood estimate of is where and and the predicted probability for all subjects is constant, and equal to . In this setting, the value of is:

1.13

1.14

Under the hypothesis that is equal to zero, the statistic follows a chi-square distribution with 1 degree of freedom. Additional mathematical assumptions are needed; however, for the above case they are rather nonrestrictive, and involve having a sufficiently large sample size, , and enough subjects with both and . We discuss in later chapters that, as far as sample size is concerned, the key determinant is .

As an example, we consider the model fit to the data in Table 1.1, whose estimated coefficients and log-likelihood are given in Table 1.3. For these data the sample size is sufficiently large as and . Evaluating as shown in equation 1.14 yields

equation

The first term in this expression is the log-likelihood from the model containing age (see Table 1.3), and the remainder of the expression simply substitutes and into the second part of equation 1.14. We use the symbol to denote a chi-square random variable with degrees of freedom. Using this notation, the -value associated with this test is ; thus, we have convincing evidence that AGE is a significant variable in predicting CHD. This is merely a statement of the statistical evidence for this variable. Other important factors to consider before concluding that the variable is clinically important would include the appropriateness of the fitted model, as well as inclusion of other potentially important variables.

As all logistic regression software report either the value of the log-likelihood or the value of , it is easy to check for the significance of the addition of new terms to the model or to verify a reported value of . In the simple case of a single independent variable, we first fit a model containing only the constant term. Next, we fit a model containing the independent variable along with the constant. This gives rise to another log-likelihood. The likelihood ratio test is obtained by multiplying the difference between these two values by .

In the current example, the log-likelihood for the model containing only a constant term is . Fitting a model containing the independent variable (AGE) along with the constant term results in the log-likelihood shown in Table 1.3 of . Multiplying the difference in these log-likelihoods by gives

equation

This result, along with the associated -value for the chi-square distribution, is commonly reported in logistic regression software packages.

There are two other statistically equivalent tests: the Wald test and the Score test. The assumptions needed for each of these is the same as those of the likelihood ratio test in equation 1.14. A more complete discussion of these three tests and their assumptions may be found in Rao (1973).

The Wald test is equal to the ratio of the maximum likelihood estimate of the slope parameter, , to an estimate of its standard error. Under the null hypothesis and the sample size assumptions, this ratio follows a standard normal distribution. While we have not yet formally discussed how the estimates of the standard errors of the estimated parameters are obtained, they are routinely printed out by computer software. For example, the Wald test for the coefficient for AGE in Table 1.3 is provided in the column headed and is

equation

The two-tailed -value, provided in the last column of Table 1.3, is , where denotes a random variable following the standard normal distribution. Some software packages display the statistic , which is distributed as chi-square with 1 degree of freedom. Hauck and Donner (1977) examined the performance of the Wald test and found that it behaved in an aberrant manner, often failing to reject the null hypothesis when the coefficient was significant using the likelihood ratio test. Thus, they recommended (and we agree) that the likelihood ratio test is preferred. We note that while the assertions of Hauk and Donner are true, we have never seen huge differences in the values of and . In practice, the more troubling situation is when the values are close, and one test has and the other has When this occurs, we use the -value from the likelihood ratio test.

A test for the significance of a variable that does not require computing the estimate of the coefficient is the score test. Proponents of the score test cite this reduced computational effort as its major advantage. Use of the test is limited by the fact that it is not available in many software packages. The score test is based on the distribution theory of the derivatives of the log-likelihood. In general, this is a multivariate test requiring matrix calculations that are discussed in Chapter 2.

In the univariate case, this test is based on the conditional distribution of the derivative in equation 1.6, given the derivative in equation 1.5. In this case, we can write down an expression for the Score test. The test uses the value of equation 1.6 computed using and . As noted earlier, under these parameter values, and the left-hand side of equation 1.6 becomes . It may be shown that the estimated variance is . The test statistic for the score test (ST) is

equation

As an example of the score test, consider the model fit to the data in Table 1.1. The value of the test statistic for this example is

equation

and the two tailed -value is . We note that, for this example, the values of the three test statistics are nearly the same (note: ).

In summary, the method for testing the significance of the coefficient of a variable in logistic regression is similar to the approach used in linear regression; however, it is based on the likelihood function for a dichotomous outcome variable under the logistic regression model.

1.4 Confidence Interval Estimation

An important adjunct to testing for significance of the model, discussed in Section 1.3, is calculation and interpretation of confidence intervals for parameters of interest. As is the case in linear regression we can obtain these for the slope, intercept and the line (i.e., the logit). In some settings it may be of interest to provide interval estimates for the fitted values (i.e., the predicted probabilities).

The basis for construction of the interval estimators is the same statistical theory we used to formulate the tests for significance of the model. In particular, the confidence interval estimators for the slope and intercept are, most often, based on their respective Wald tests and are sometimes referred to as Wald-based confidence intervals. The endpoints of a confidence interval for the slope coefficient are

1.15

and for the intercept they are

1.16

where is the upper point from the standard normal distribution and denotes a model-based estimator of the standard error of the respective parameter estimator. We defer discussion of the actual formula used for calculating the estimators of the standard errors to Chapter 2. For the moment, we use the fact that estimated values are provided in the output following the fit of a model and, in addition, many packages also provide the endpoints of the interval estimates.

As an example, consider the model fit to the data in Table 1.1 regressing AGE on the presence or absence of CHD. The results are presented in Table 1.3. The endpoints of a 95 percent confidence interval for the slope coefficient from equation 1.15 are , yielding the interval . We defer a detailed discussion of the interpretation of these results to Chapter 3. Briefly, the results suggest that the change in the log-odds of CHD per one year increase in age is 0.111 and the change could be as little as 0.064 or as much as 0.158 with 95 percent confidence.

As is the case with any regression model, the constant term provides an estimate of the response at unless the independent variable has been centered at some clinically meaningful value. In our example, the constant provides an estimate of the log-odds ratio of CHD at zero years of age. As a result, the constant term, by itself, has no useful clinical interpretation. In any event, from equation 1.16, the endpoints of a 95 percent confidence interval for the constant are , yielding the interval .

The logit is the linear part of the logistic regression model and, as such, is most similar to the fitted line in a linear regression model. The estimator of the logit is

1.17

The estimator of the variance of the estimator of the logit requires obtaining the variance of a sum. In this case it is

1.18

In general, the variance of a sum is equal to the sum of the variance of each term and twice the covariance of each possible pair of terms formed from the components of the sum. The endpoints of a Wald-based confidence interval for the logit are

1.19

where is the positive square root of the variance estimator in equation 1.18.

The estimated logit for the fitted model in Table 1.3 is shown in equation 1.8. In order to evaluate equation 1.18 for a specific age we need the estimated covariance matrix. This matrix can be obtained from the output from all logistic regression software packages. How it is displayed varies from package to package, but the triangular form shown in Table 1.4 is a common one.

Table 1.4 Estimated Covariance Matrix of the Estimated Coefficients in Table 1.3

The estimated logit from equation 1.8 for a subject of age 50 is

equation

the estimated variance, using equation 1.18 and the results in Table 1.4, is

equation

and the estimated standard error is . Thus the end points of a 95 percent confidence interval for the logit at age 50 are

equation

We discuss the interpretation and use of the estimated logit in providing estimates of odds ratios in Chapter 3.

The estimator of the logit and its confidence interval provide the basis for the estimator of the fitted value, in this case the logistic probability, and its associated confidence interval. In particular, using equation 1.7 at age 50 the estimated logistic probability is

1.20

and the endpoints of a 95 percent confidence interval are obtained from the respective endpoints of the confidence interval for the logit. The endpoints of the Wald-based confidence interval for the fitted value are

1.21

Using the example at age 50 to demonstrate the calculations, the lower limit is

equation

and the upper limit is

equation

We have found that a major mistake often made by data analysts new to logistic regression modeling is to try and apply estimates on the probability scale to individual subjects. The fitted value computed in equation 1.20 is analogous to a particular point on the line obtained from a linear regression. In linear regression each point on the fitted line provides an estimate of the mean of the dependent variable in a population of subjects with covariate value . Thus the value of 0.56 in equation 1.20 is an estimate of the mean (i.e., proportion) of 50-year-old subjects in the population sampled that have evidence of CHD. An individual 50-year-old subject either does or does not have evidence of CHD. The confidence interval suggests that this mean could be between 0.435 and 0.677 with 95 percent confidence. We discuss the use and interpretation of fitted values in greater detail in Chapter 3.

One application of fitted logistic regression models that has received a lot of attention in the subject matter literature is using model-based fitted values similar to the one in equation 1.20 to predict the value of a binary dependent value in individual subjects. This process is called classification and has a long history in statistics where it is referred to as discriminant analysis. We discuss the classification problem in detail in Chapter 4. We also discuss discriminant analysis within the context of a method for obtaining estimators of the coefficients in the next section.

The coverage¹² of the Wald-based confidence interval estimators in equations 1.15 and 1.16 depends on the assumption that the distribution of the maximum likelihood estimators is normal. Potential sensitivity to this assumption is the main reason that the likelihood ratio test is recommended over the Wald test for assessing the significance of individual coefficients, as well as for the overall model. In settings where the number of events and/or the sample size is small the normality assumption is suspect and a log-likelihood function-based confidence interval can have better coverage. Until recently routines to compute these intervals were not available in most software packages. Cox and Snell (1989, p. 179–183) discuss the theory behind likelihood intervals, and Venzon and Moolgavkar (1988) describe an efficient way to calculate the end points. Royston (2007) describes a STATA [StataCorp (2011)] routine that implements the Venzon and Moolgavkar method that we use for the examples in this text. The SAS package's logistic regression procedure [SAS Institute Inc. (2009)] has the option to obtain likelihood confidence intervals.

The likelihood-based confidence interval estimator for a coefficient can be concisely described as the interval of values, , for which the likelihood ratio test would fail to reject the hypothesis, , at the stated percent significance level. The two end points, and , of this interval for a coefficient are defined as follows:

1.22

where is the value of the log-likelihood of the fitted model and is the value of the profile log-likelihood. A value of the profile log-likelihood is computed by first specifying/fixing a value for the coefficient of interest, for example the slope coefficient for age, and then finding the value of the intercept coefficient, using the Venzon and Moolgavkar method, that maximizes the log-likelihood. This process is repeated over a grid of values of the specified coefficient, for example, values of , until the solutions to equation 1.22 are found. The results can be presented graphically or in standard interval form. We illustrate both in the example below.

As an example, we show in Figure 1.3 a plot of the profile log-likelihood for the coefficient for AGE using the CHDAGE data in Table 1.1. The end points of the 95 percent likelihood interval are and and are shown in the figure where the two vertical lines intersect the axis. The horizontal line in the figure is drawn at the value

equation

where is the value of the log-likelihood of the fitted model from Table 1.3 and 3.8416 is the 95th percentile of the chi-square distribution with 1 degree of freedom.

Figure 1.3 Plot of the profile log-likelihood for the coefficient for AGE in the CHDAGE data.

c01f003

The quantity Asymmetry in Figure 1.3 is a measure of asymmetry of the profile log-likelihood that is the difference between the lengths of the upper part of the interval, , to the lower part, , as a percent of the total length, . In the example the value is

equation

As the upper and lower endpoints of the Wald-based confidence interval in equation 1.15 are equidistant from the maximum likelihood estimator, it has asymmetry .

In this example, the Wald-based confidence interval for the coefficient for age is . The likelihood interval is , which is only 1.1% wider than the Wald-based interval. So there is not a great deal of pure numeric difference in the two intervals and the asymmetry is small. In settings where there is greater asymmetry in the likelihood-based interval there can be more substantial differences between the two intervals. We return to this point in Chapter 3 where we discuss the interpretation of estimated coefficients. In addition, we include an exercise at the end of this chapter where there is a pronounced difference between the Wald and likelihood confidence interval estimators.

Methods to extend the likelihood intervals to functions of more than one coefficient such as the estimated logit function and probability are not available in current software packages.

1.5 Other Estimation Methods

The method of maximum likelihood described in Section 1.2 is the estimation method used in the logistic regression routines of the major software packages. However, two other methods have been and may still be used for estimating the coefficients. These methods are: (1) noniterative weighted least squares, and (2) discriminant function analysis.

A linear models approach to the analysis of categorical data proposed by Grizzle et al. (1969) [Grizzle, Starmer, and Koch (GSK) method] uses estimators based on noniterative weighted least squares. They demonstrate that the logistic regression model is an example of a general class of models that can be handled by their methods. We should add that the maximum likelihood estimators are usually calculated using an iterative reweighted least squares algorithm, and are also technically least squares estimators. The GSK method requires one iteration and is used in SAS's GENMOD procedure to fit a logistic regression model containing only categorical covariates.

A major limitation of the GSK method is that we must have an estimate of that is not zero or 1 for most values of . An example where we could use both maximum likelihood and GSK's noniterative weighted least squares is the data in Table 1.2. In cases such as this, the two methods are asymptotically equivalent, meaning that as gets large, the distributional properties of the two estimators become identical. The GSK method could not be used with the data in Table 1.1.

The discriminant function approach to estimation of the coefficients is of historical importance as it was popularized by Cornfield (1962) in some of the earliest work on logistic regression. These estimators take their name from the fact that the posterior probability in the usual discriminant function model is the logistic regression function given in equation 1.1. More precisely, if the independent variable, , follows a normal distribution within each of two groups (subpopulations) defined by the two values of and has different means and the same variance, then the conditional distribution of given is the logistic regression model. That is, if

equation

then . The symbol is read is distributed and the denotes the normal distribution with mean equal to and variance equal to . Under these assumptions it is easy to show [Lachenbruch (1975)] that the logistic coefficients are

1.23

and

1.24

where The discriminant function estimators of and are obtained by substituting estimators for and into the above equations. The estimators usually used are , the mean of in the subgroup defined by the mean of with and

equation

where is the unbiased estimator of computed within the subgroup of the data defined by . The above expressions are for a single variable and multivariable expressions are presented in Chapter 2.

It is natural to ask why, if the discriminant function estimators are so easy to compute, they are not used in place of the maximum likelihood estimators? Halpern et al. (1971) and Hosmer et al. (1983) compared the two methods when the model contains a mixture of continuous and discrete variables, with the general conclusion that the discriminant function estimators are sensitive to the assumption of normality. In particular, the estimators of the coefficients for non-normally distributed variables are biased away from zero when the coefficient is, in fact, different from zero. The practical implication of this is that for dichotomous independent variables (that occur in many situations), the discriminant function estimators overestimate the magnitude of the coefficient. Lyles et al. (2009) describe a clever linear regression-based approach to compute the discriminant function estimator of the coefficient for a single continuous variable that, when their assumptions of normality hold, has better statistical properties than the maximum likelihood estimator. We discuss their multivariable extension and some of its practical limitations in Chapter 2.

At this point it may be helpful to delineate more carefully the various uses of the term maximum likelihood, as it applies to the estimation of the logistic regression coefficients. Under the assumptions of the discriminant function model stated above, the estimators obtained from equations 1.23 and 1.24 are maximum likelihood estimators. The estimators obtained from equations 1.5 and 1.6 are based on the conditional distribution of given and, as such, are technically conditional maximum likelihood estimators. It is common practice to drop the word conditional when describing the estimators given in equations 1.5 and 1.6. In this text, we use the word conditional to describe estimators in logistic regression with matched data as discussed in Chapter 7.

In summary there are alternative methods of estimation for some data configurations that are computationally quicker; however, we use the maximum likelihood method described in Section 1.2 throughout the rest of this text.

1.6 Data Sets Used in Examples and Exercises

A number of different data sets are used in the examples as well as the exercises for the purpose of demonstrating various aspects of logistic regression modeling. Six of the data sets used throughout the text are described below. Other data sets are introduced as needed in later chapters. Some of the data sets were used in the previous editions of this text, for example the ICU and Low Birth Weight data, while others are new to this edition. All data sets used in this text may be obtained from links to web sites at John Wiley & Sons Inc. and the University of Massachusetts given in the Preface.

1.6.1 The ICU Study

The ICU study data set consists of a sample of 200 subjects who were part of a much larger study on survival of patients following admission to an adult intensive care unit (ICU). The major goal of this study was to develop a logistic regression model to predict the probability of survival to hospital discharge of these patients. A number of publications have appeared that have focused on various facets of this problem. The reader wishing to learn more about the clinical aspects of this study should start with Lemeshow et al. (1988). For a more up-to-date discussion of modeling the outcome of ICU patients the reader is referred to Lemeshow and Le Gall (1994) and to Lemeshow et al. (1993). The actual observed variable values have been modified to protect subject confidentiality. A code sheet for the variables to be considered in this text is given in Table 1.5. We refer to this data set as the ICU data.

Table 1.5 Code Sheet for the Variables in the ICU Data

c01-tab-0007

1.6.2 The Low Birth Weight Study

Low birth weight, defined as birth weight less than 2500 grams, is an outcome that has been of concern to physicians for years.

Enjoying the preview?

Page 1 of 1

Applied Logistic Regression

About this ebook

David W. Hosmer, Jr.

Related authors

Related to Applied Logistic Regression

Titles in the series (100)

Related ebooks

Mathematics For You

Related podcast episodes

Related articles

Related categories

Reviews for Applied Logistic Regression

What did you think?

Book preview

Applied Logistic Regression - David W. Hosmer, Jr.

Preface to the Third Edition

1.1 Introduction

1.2 Fitting the Logistic Regression Model

1.3 Testing for the Significance of the Coefficients

1.4 Confidence Interval Estimation

1.5 Other Estimation Methods

1.6 Data Sets Used in Examples and Exercises

1.6.1 The ICU Study

1.6.2 The Low Birth Weight Study