Wednesday 15 November 2017

Missing data imputering binära alternativ


Tillämpningsstrategier för saknade binära resultat i kluster randomiserade försök Bakgrundsledning, som leder till saknade data, är ett vanligt problem i kluster randomiserade försök (CRT), där patienter i stället för individer är randomiserade. Standardmultiputering (MI) - strategier kan inte vara lämpliga för att ålägga saknade data från CRTs eftersom de antar oberoende data. I det här dokumentet, under antagandet om att missa helt slumpmässigt och kovariatberoende saknas, jämförde vi sex MI-strategier som svarar för interklusterkorrelationen för saknade binära resultat i CRT med standardansättningsstrategierna och fullständig fallanalysmetod med hjälp av en simuleringsstudie . Vi ansåg tre inomkluster och tre överkluster-MI-strategier för att sakna binära resultat i CRT. De tre inom-kluster-MI-strategierna är logistisk regressionsmetod, prognosticeringsmetod och Markov-kedjan Monte Carlo (MCMC) - metoden, som tillämpar standard-MI-strategier inom varje kluster. De tre överkluster-MI-strategierna är prognosticeringsmetod, slumpmässig (RE) logistisk regressionsmetod och logistisk regression med kluster som en fast effekt. Baserat på communityhypertension assessment trial (CHAT) som har fullständiga data, utformade vi en simuleringsstudie för att undersöka prestanda av ovanstående MI-strategier. Den beräknade behandlingseffekten och dess 95 konfidensintervall (CI) från generaliserad estimeringsekvation (GEE) - modell baserad på den fullständiga datasatsen CHAT är 1,14 (0,76 1,70). När 30 av binärutfallet saknas fullständigt slumpmässigt visar en simuleringsstudie att de uppskattade behandlingseffekterna och motsvarande 95 CI från GEE-modellen är 1,15 (0,76 1,75) om fullständig fallanalys används, 1,12 (0,72 1,73) om inkluster MCMC-metoden används, 1,21 (0,80 1,81) om över-kluster RE logistisk regression används och 1,16 (0,82 1,64) om standard logistisk regression som inte tar hänsyn till kluster används. Slutsats När procentdelen av saknade data är låg eller intra-kluster korrelationskoefficienten är liten, ger olika sätt att hantera saknade binära utfallsdata ganska lika resultat. När procentdelen av saknade data är stor, underskattar standard MI-strategier, som inte tar hänsyn till interklusterkorrelationen, variationen i behandlingseffekten. Inom kluster och överkluster-MI-strategier (med undantag för slumpmässig logistisk regression MI-strategi), som tar hänsyn till interklusterkorrelationen, verkar det vara lämpligare att hantera det saknade resultatet från CRT. Under samma imputationsstrategi och procentuell misslyckadhet är uppskattningarna av behandlingseffekten från GEE - och RE-logistikregressionsmodellerna likartade. 1. Inledning Cluster randomized trials (CRTs), där grupper av deltagare snarare än individer är randomiserade, används alltmer i hälsofrämjande och hälsovårdsforskning 1. När deltagare måste hanteras inom samma inställning, till exempel sjukhus, samhälle eller familjläkare, används denna randomiseringsstrategi vanligtvis för att minimera den potentiella behandlingsföroreningen mellan deltagare och kontrolldeltagare. Det används också när enskild nivå randomisering kan vara olämplig, oetisk eller otänkbar 2. Den huvudsakliga konsekvensen av den kluster-randomiserade konstruktionen är att deltagarna inte kan antas vara oberoende på grund av likheten hos deltagare från samma kluster. Denna likhet kvantifieras av interkloppskorrelationskoefficienten ICC. Med tanke på de två delarna av variationen i utfallet, mellankluster och interklustervariationer kan tolkas som andelen av den totala variationen i resultatet som kan förklaras av mellanklustervariationen 3. Det kan också tolkas som korrelationen mellan resultaten för två deltagare i samma kluster. Det har varit väletablerat att misslyckande med att redovisa interklusterkorrelationen i analysen kan öka risken för att få statistiskt signifikanta men falska fynd 4. Risken för avgång kan vara mycket hög i vissa CRTs på grund av bristen på direktkontakt med enskilda deltagare och långvarig uppföljning 5. Förutom saknade personer kan hela klyftorna saknas, vilket ytterligare komplicerar hanteringen av saknade data i CRT. Effekten av saknade data på resultaten av statistisk analys beror på vilken mekanism som orsakade att data saknas och hur det hanteras. Standardinställningen för att hantera detta problem är att använda fullständig fallanalys (även kallad listvis borttagning), dvs utesluta deltagarna med saknade data från analysen. Även om detta tillvägagångssätt är lätt att använda och är standardalternativet i de flesta statistiska paket, kan det väsentligt försämra den statistiska effekten av försöket och kan också leda till fördjupade resultat beroende på mekanismen för den saknade data. Generellt kan naturen eller typen av misslyckande passa in i fyra kategorier: saknas helt slumpmässigt (MCAR), saknas slumpmässigt (MAR), kovariatberoende (CD) saknas och saknas inte slumpmässigt (MNAR) 6. Att förstå dessa kategorier är viktigt eftersom lösningarna kan variera beroende på vilken typ av misslyckande. MCAR betyder att den saknade datamekanismen, det vill säga sannolikheten att sakna, beror inte på observerade eller observerade data. Både MAR och CD-mekanismer indikerar att orsakerna till saknade data inte är relaterade till de saknade värdena, men kan relateras till de observerade värdena. I samband med longitudinella data när seriella mätningar tas för varje individ innebär MAR att sannolikheten för ett saknat svar vid ett visst besök är relaterat till antingen observerade svar vid tidigare besök eller kovariater, medan CD saknas - ett särskilt fall av MAR - innebär att sannolikheten för ett saknas svar endast är beroende av kovariater. MNAR betyder att sannolikheten för att data saknas beror på de observerade data. Det förekommer vanligen när människor släpper ut ur studien på grund av dåliga eller bra hälsoeffekter. En viktig skillnad mellan dessa kategorier är att MNAR är otänkbar medan de andra tre kategorierna (dvs MCAR, CD eller MAR) är negativa. Under omständigheter av otänkbar misslyckadhet kan imputationsstrategier som medelåtskrivning, hetdäck, lastobservation framåt eller multipel imputation (MI) - som ersätter varje saknas värde till ett eller flera plausibla värden - producera en komplett dataset som inte är negativt förspänd 8. 9. Icke-ignorerande saknade data är mer utmanande och kräver ett annat tillvägagångssätt 10. Två huvudmetoder för att hantera saknade resultat är sannolikhetsbaserade analyser och imputation 10. I detta dokument fokuserar vi på MI-strategier, som tar hänsyn till variabiliteten eller osäkerheten för de saknade data, för att ålägga det saknade binära resultatet i CRT. Under antagandet av MAR ersätter MI-strategierna varje saknat värde med en uppsättning plausibla värden för att skapa multipla beräknade dataset - vanligtvis varierande i antal från 3 till 10 11. Dessa multipla beräknade dataset analyseras med hjälp av standardprocedurer för fullständiga data. Resultat från de beräknade dataset kombineras sedan för inferens för att generera slutresultatet. Standard MI-procedurer finns i många standardiserade statistiska programvarupaket som SAS (Cary, NC), SPSS (Chicago IL) och STATA (College Station, TX). Emellertid antar dessa förfaranden att observationer är oberoende och kan inte vara lämpliga för CRT eftersom de inte tar hänsyn till interklusterkorrelationen. Så vitt vi vet har begränsad utredning gjorts på imputationsstrategierna för saknade binära resultat eller kategoriska resultat i CRT. Yi och Cook rapporterade marginella metoder för att sakna longitudinella data från gruppformad design 12. Hunsberger et al. 13 beskrivna tre strategier för kontinuerlig saknad data i CRT: er: 1) Multipla imputationsprocedurer, där de saknade värdena ersätts med samplingsvärden från de observerade data 2) ett medianförfarande baserat på Wilcoxon rank sum-testet som tilldelar de saknade data i interventionsgrupp med de värsta rangerna 3) Multipla imputationsprocedurer där de saknade värdena ersätts av de förutspådda värdena från en regressionsekvation. Nixon et al. 14 presenterade strategier för att imputera saknade slutpunkter från en surrogat. I analysen av ett kontinuerligt resultat från gemenskapsinterventionstestet för rökavbrott (COMMIT) lagrade Green et al enskilda deltagare i grupper som var mer homogena med avseende på det förutsagda resultatet. Inom varje lager satsade de på det missade resultatet med hjälp av observerade data 15. 16. Taljaard et al 17 jämförde flera olika imputationsstrategier för missande kontinuerliga resultat i CRTs under antagandet om att missa helt slumpmässigt. Dessa strategier innefattar klustermedelimulering, inom-kluster MI med användning av Approximate Bayesian Bootstrap (ABB) - metoden, poolad MI med ABB-metod, standardregression MI och mixed-effect regression MI. Som påpekat av Kenward et al att om en substantivmodell, såsom en generaliserad linjär blandad modell, ska användas som reflekterar datastrukturen är det viktigt att imputationsmodellen också reflekterar denna struktur 18. Målen i denna uppsats är att: i) undersöka prestanda för olika imputationsstrategier för saknade binära resultat i CRT under olika procentuella missförhållanden, förutsatt att en misslyckad mekanism saknas helt slumpmässigt eller kovariatberoende saknas, ii) jämföra avtalet mellan det fullständiga datasetet och de beräknade dataset som erhållits från olika imputationsstrategier iii) jämföra resultatets robusthet med två vanliga statistiska analysmetoder: de generaliserade uppskattningsekvationerna (GEE) och random-effects (RE) logistisk regression under olika imputationsstrategier. 2. Metoder I det här dokumentet betraktar vi tre inomkluster och tre överkluster-MI-strategier för att sakna binära resultat i CRT. De tre inom-kluster-MI-strategierna är logistisk regressionsmetod, prognosticeringsmetod och MCMC-metod, som är standard-MI-strategier som genomförs inom varje kluster. De tre överkluster-MI-strategierna är prognospoäng, slumpmässig logistisk regressionsmetod och logistisk regression med kluster som en fast effekt. Baserat på det fullständiga datasetet från community hypertension assessment trial (CHAT), genomförde vi en simuleringsstudie för att undersöka prestanda av ovanstående MI-strategier. Vi använde Kappa-statistik för att jämföra avtalet mellan de beräknade datauppsättningarna och hela databasen. Vi använde också de beräknade behandlingseffekterna som erhållits från GEE och RE logistisk regressionsmodell 19 för att bedöma resultatets robusthet under olika procenttal av saknad binärt resultat under antagandet att MCAR och CD saknas. 2,1. Komplett fallanalys Med hjälp av detta tillvägagångssätt ingår endast patienter med slutförda data för analys, medan patienter med saknade data är uteslutna. När data är MCAR är det fullständiga fallanalysmetoden, med användning av antingen sannolikhetsbaserad analys som RE logistisk regression eller marginalmodellen som GEE-tillvägagångssätt, giltigt för analys av binärt resultat från CRT, eftersom den saknade datamekanismen är oberoende av resultat. När data saknas är både RE logistisk regression och GEE-tillvägagångssätt giltig om de kända kovariaten som är associerade med den saknade datamekanismen justeras för. Det kan implementeras med hjälp av GENMOD och NLMIXED-proceduren i SAS. 2,2. Standard multipel imputering Antag att observationerna är oberoende, vi kan tillämpa standard MI-procedurer som tillhandahålls av någon vanlig statistisk programvara, såsom SAS. Tre mycket använda MI-metoder är prediktiv modellmetod (logistisk regressionsmetod för binär data), prognosticeringsmetod och MCMC-metod 20. I allmänhet rekommenderas både prognosticeringsmetod och MCMC-metod för imputering av kontinuerlig variabel 21. En dataset sägs ha ett monotiskt saknat mönster när en mätning Yj saknas för en individ innebär att alla efterföljande mätningar Y k. k gt j. är alla saknade för individen. När data saknas i det monotona saknade mönstret är någon av parametriska prediktiva modellerna och den icke-parametriska metoden som använder benägenhetsresultat eller MCMC-metoden lämplig 21. För ett godtyckligt saknat datamönster kan en MCMC-metod som antar multivariate normalitet användas 10. Dessa MI-strategier implementeras med hjälp av MI, MIANALYZE, GENMOD och NLMIXED-förfarandena i SAS separat för varje interventionsgrupp. 2.2.1. Logistisk regressionsmetod I denna metod anpassas en logistisk regressionsmodell med det observerade resultatet och kovariaten 21. Baserat på parametrisuppskattningarna och den tillhörande kovariansmatrisen kan den bakre prediktiva fördelningen av parametrarna konstrueras. En ny logistisk regressionsmodell simuleras sedan från den bakre prediktiva fördelningen av parametrarna och används för att anslå de saknade värdena. 2.2.2. Prognicitetspoängmetod Prognospoängen är den förutsebara sannolikheten att missa givet de observerade data. Det kan beräknas med hjälp av logistisk regressionsmodell med ett binärt resultat som indikerar om data saknas eller inte. Observationerna stratifieras sedan till ett antal lager baserat på dessa tendenser. ABB-proceduren 22 appliceras sedan på varje stratum. ABB-imputationen drar först med ersättning från observerade data för att skapa en ny dataset, vilken är en icke-parametrisk analog av ritningsparametrar från den bakre prediktiva fördelningen av parametrarna och sedan slumpmässigt teckna tillförda värden med ersättning från det nya datasetet. 2.2.3. Markov-kedjan Monte Carlo-metoden Användning av MCMC-metoden pseudo slumpmässiga prover ritas från en målsannolikhetsfördelning 21. Målfördelningen är den gemensamma villkorliga fördelningen av Y-mis och givet Y obs när saknade data har ett icke-monotiskt mönster, där Y mis och Y obs representerar den saknade data och observerade data respektive representerar de okända parametrarna. MCMC-metoden utförs enligt följande: ersätt Y-fel med vissa antagna värden, simulera sedan från den resulterande fullständiga data-posteriorfördelningen P (Y obs, Y mis). Låt (t) vara det aktuella simulerade värdet av. då kan Y mis (t 1) dras från den villkorliga prediktiva fördelningen Y m i s (t 1) P (Y m i s Y o b s. (t)). Konditionering på Y mis (t 1). nästa simulerade värde kan dras från dess fullständiga data bakre fördelning (t 1) P (Y o b s. Y m i s (t 1)). Genom att upprepa ovanstående procedur kan vi generera en Markov-kedja som konvergerar i distributionen till P (Y-mis, Y obs). Denna metod är attraktiv eftersom den undviker komplicerad analytisk beräkning av den bakre fördelningen av och Y mis. Distributionskonvergensen är emellertid en fråga som forskare behöver möta. Dessutom är denna metod baserad på antagandet om multivariat normalitet. När den används för att förordna binära variabler, kan de tillförda värdena vara några reella värden. De flesta av de tillförda värdena är mellan 0 och 1, vissa är utom detta intervall. Vi runda de imputerade värdena till 0 om det är mindre än 0,5 och till 1 annars. Denna multipla imputationsmetod implementeras med användning av MI-procedur i SAS. Vi använder en enda kedja och icke-informativ före alla imputationer och förväntningsmaksimering (EM) - algoritmen för att hitta största sannolikhetsuppskattningar i parametriska modeller för ofullständig data och härleda parametrisuppskattningar från ett bakre läge. Iterationerna anses ha konvergerat när förändringen i parameteruppskattningarna mellan iterationsteg är mindre än 0,0001 för varje parameter. 2,3. Inomkluster-multipel imputering Standard MI-strategier är olämpliga för hantering av saknade data från CRTs på grund av antagandet av oberoende observationer. För inklusteranslutningen utför vi standard MI beskrivet ovan med hjälp av logistisk regressionsmetod, prognosticeringsmetod och MCMC-metod separat för varje kluster. Således beräknas de saknade värdena baserat på de observerade data inom samma kluster som de saknade värdena. Med tanke på att ämnen inom samma kluster är mer benägna att likna varandra än de från olika kluster, kan inklusteranslutning ses som en strategi för att påverka de saknade värdena för att ta hänsyn till interklusterkorrelationen. Dessa MI-strategier implementeras med hjälp av MI, MIANALYZE, GENMOD och NLMIXED-förfarandena i SAS. 2,4. Överkropps multipel imputation 2.4.1. Prognosticeringsmetod Jämfört med standard multipel imputering med hjälp av prognostisitetsmetod tillsatte vi kluster som en av kovariaten för att erhålla prognospoäng för varje observation. Följaktligen är patienter i samma kluster mer benägna att kategoriseras i samma benägenhetsklass. Därför beaktas intraklypskorrelationen när ABB-proceduren appliceras inom varje stratum för att generera de tillförda värdena för de saknade data. Denna multipla imputationsstrategi implementeras med hjälp av MI, MIANALYZE, GENMOD och NLMIXED-förfarandena i SAS. 2.4.2. Random-effects logistic regression Jämfört med den prediktiva modellen med standard logistisk regressionsmetod antar vi att det binära resultatet modelleras av den slumpmässiga logistikmodellen: logga det (Pr (Y ijl 1)) X ijl U ij där Y ijl är binärt utfall av patient l i grupp j i interventionsgruppen i X ijl är matrisen av helt observerade individuella nivåer eller klyvningsnivåkovariater, Ujj N (0. B2) representerar den slumpartiga slumpmässiga effekten och B2 representerar mellan-klustervariansen. B 2 kan uppskattas vid anpassning av den slumpmässiga logistikregressionsmodellen med hjälp av det observerade resultatet och kovariaterna. MI-strategin genom att använda slumpmässig logistisk regressionsmetod erhåller de tillförda värdena i tre steg: (1) Anpassa en slumpmässig logistisk regressionsmodell som beskrivits ovan med hjälp av det observerade resultatet och kovariaten. Baserat på uppskattningarna för och B som erhållits från steg (1) och den tillhörande kovariansmatrisen, konstruerar den bakre prediktiva fördelningen av dessa parametrar. Anpassa en ny slumpmässig logistisk regression med hjälp av de simulerade parametrarna från den bakre prediktiva fördelningen och de observerade kovariaten för att erhålla det uppskattade missade resultatet. MI-strategin genom att använda slumpmässig logistisk regression tar hänsyn till den mellan klustervariansen, som ignoreras i MI-strategin med standardlogistikregression, och kan därför vara giltig för att påföra missande binär data i CRT. Vi tillhandahåller SAS-koden för denna metod i bilaga A. 2.4.3. Logistisk regression med kluster som en fast effekt Jämfört med den prediktiva modellen med standard logistisk regressionsmetod, lägger vi till kluster som en fast effekt för att ta hänsyn till klustringseffekten. Denna multipla imputationsstrategi implementeras med hjälp av MI, MIANALYZE, GENMOD och NLMIXED-förfarandena i SAS. 3. Simuleringsstudie 3.1. Test av gemenskapshypertensionstestning CHAT-studien rapporterades i detalj på annat håll 23. I korthet var det en kluster randomiserad kontrollerad studie som syftade till att utvärdera effektiviteten av apoteksbaserade blodtryckskliniker (PK) som leds av vårdpedagoger, med återkoppling till familjemedicinska läkare (FP) om hantering och övervakning av BP bland patienterna 65 år eller äldre. FP var enheten för randomisering. Patienter från samma FP fick samma ingrepp. Totalt deltog 28 FP i studien. Fjorton fördelades slumpmässigt till interventionen (apotekets BP-kliniker) och 14 till kontrollgruppen (inga BP-kliniker erbjöds). Femtiofem patienter valdes slumpmässigt från varje FP-roster. Därför deltog 1540 patienter i studien. Alla berättigade patienter i både interventions - och kontrollgruppen fick vanligt hälsovård på deras kontor för kontorsarbete. Patienter i praktiken som tilldelades interventionsgruppen uppmanades att besöka samhällets BP-kliniker. Peer hälsopersonal assisterade patienter att mäta sitt BP och granska deras kardiovaskulära riskfaktorer. Forskningsjuksköterskor genomförde baslinjen och slutet av försöket (12 månader efter randomiseringen) granskningar av hälsoregisterna för de 1540 patienter som deltog i studien. Det primära resultatet av CHAT-studien var ett binärt resultat som indikerar huruvida patienterna BP kontrollerades eller inte i slutet av försöket. Patienter BP kontrollerades om i slutet av försöket systolisk BP 140 mmHg och diastolisk BP 90 mmHg för patient utan diabetes eller målorganskada eller systolisk BP 130 mmHg och diastolisk BP 80 mmHg för patient med diabetes eller målorganskada . Förutom interventionsgruppen inkluderade andra prediktorer i detta dokument bland annat ålder (kontinuerlig variabel), kön (binär variabel), diabetes vid baslinjen (binär variabel), hjärtsjukdom vid baslinjen (binär variabel) och huruvida patienter BP kontrollerades vid baslinjen ( binär variabel). Vid slutet av försöket kontrollerades 55 patienter BP. Utan att inkludera några andra prediktorer i modellen var behandlingseffekterna och deras 95 konfidensintervaller (CI) uppskattade från GEE och RE-modellen 1,14 (0,72, 1,80) respektive 1,10 (0,65, 1,86). Den uppskattade ICC var 0,077. Efter justering för ovan nämnda variabler var behandlingseffekterna och deras CIs uppskattade från GEE och RE-modellen var 1,14 (0,76, 1,70) respektive 1,12 (0,72, 1,76). Den uppskattade ICC var 0,055. Eftersom det inte finns några saknade data i CHAT-datasetet, ger den oss en lämplig plattform för att utforma en simuleringsstudie för att jämföra de tillförda och de observerade värdena och ytterligare undersöka prestanda för de olika multipla imputationsstrategierna under olika saknade datamekanismer och procentuella missförhållanden . 3,2. Generera dataset med saknad binärt resultat Med användning av CHAT-studiedatasetet undersökte vi prestanda för olika MI-strategier för missande binärt resultat baserat på MCAR - och CD-mekanismer. Under antagandet av MCAR genererade vi dataset med viss procent av det saknade binära resultatet, vilket indikerar om BP var kontrollerat eller inte i slutet av försöket för varje patient. Sannolikheten att sakna för varje patient var helt slumpmässigt, dvs sannolikheten att sakna berode inte på någon observerad eller observerad CHAT-data. Under antagandet om att CD saknades ansåg vi kön, behandlingsgrupp, huruvida patienter som kontrollerats BP eller inte vid baslinjen, som vanligen associerades med utfall i kliniska prövningar och observationsstudier 24 26, var associerade med sannolikheten att saknas. Vi antog vidare att manliga patienter var 1,2 gånger mer sannolikt att de saknade utfallspatienter som tilldelades kontrollgruppen var 1,3 gånger mer sannolikt att de saknade utfallspatienter, vars BP inte kontrollerades vid baslinjen, var 1,4 gånger mer sannolika att ha saknat resultat än patienter vars BP kontrollerades vid baslinjen. 3,3. Utformning av simuleringsstudie Först jämförde vi avtalet mellan värdena för den beräknade utfallsvariabeln och de verkliga värdena för utfallsvariabeln med hjälp av Kappa-statistiken. Kappa statistik är den vanligaste statistiken för att bedöma avtalet mellan två observatörer eller metoder som tar hänsyn till det faktum att de ibland kommer att enas om eller inte ensamma av en slump 27. Den beräknas utifrån skillnaden mellan hur mycket avtalet faktiskt är närvarande jämfört med hur mycket avtal som förväntas vara enbart tillfälle. En Kappa av 1 indikerar det perfekta avtalet, och 0 indikerar avtal som motsvarar chansen. Kappa-statistiken har använts i stor utsträckning av forskare för att utvärdera prestanda för olika imputationstekniker för att påföra saknade kategoriska data 28. 29. För det andra, jämfört med MCAR och CD, jämförde vi effektbedömningarna från RE och GEE-metoderna i följande scenarier: 1) utesluta de saknade värdena från analysen, dvs fullständig fallanalys 2) tillämpa standardmultiputeringsstrategier som inte tar hänsyntagen till intraklusterkorrelationen 3) tillämpa strategierna för inkluster av impulser och 4) tillämpa strategierna för överkluster. Vi utformade simuleringsstudien enligt följande steg. 1) Genererade 5, 10, 15, 20, 30 och 50 saknade resultat under både MCAR och CD saknas antagande. Dessa mängder av missnöje valdes för att täcka området för eventuell misslyckande i praktiken 30. Tillämpade ovanstående multipla imputationsstrategier för att generera m 5 dataset. Enligt Rubin ökar inte den relativa effektiviteten hos MI ökad mycket när man genererar mer än 5 tillförda dataset 11. Beräknad Kappa-statistik för att bedöma avtalet mellan värdena för den beräknade utfallsvariabeln och de verkliga värdena för utfallsvariabeln. Erhållen estimatet för enbehandlingseffekt genom att kombinera effektberäkningarna från de 5 impulserade dataseten med användning av GEE och RE-modellen. Upprepade fyra steg i 1000 gånger, dvs ta 1000 simuleringslöpningar. Beräkna den övergripande Kappa-statistiken genom att beräkna Kappa-statistiken från 1000 simuleringslöpningar. Beräkna den övergripande behandlingseffekten och dess standardfel genom att beräkna behandlingseffekterna och deras standardfel från 1000-simuleringskörningarna. 4. Resultat 4.1. Resultat när data saknas helt slumpmässigt Med 5, 10, 15, 20, 30 eller 50 procent av misslyckanden under MCAR-antagandet är den uppskattade Kappa för alla olika imputationsstrategier något över 0,95, 0,90, 0,85, 0,80, 0,70 och 0,50 respektive. Den uppskattade Kappa för olika imputationsstrategier vid olika procentuella missförhållanden under antagandet av MCAR presenteras i detalj i Tabell 1. Kappa-statistik för olika imputationsstrategier när misslyckande är helt slumpmässig Behandlingseffekt uppskattad av slumpmässiga effekter logistisk regression när 30 data är kovariatberoende saknad. 5. Diskussion I detta dokument, under antagandet om MCAR och CD saknas jämförde vi sex MI-strategier som svarar för interklusterkorrelationen för saknade binära resultat i CRTs med standardanslagningsstrategierna och fullständig fallanalysmetod med hjälp av en simuleringsstudie. Våra resultat visar att för det första då andelen saknade data är låg eller intra-cluster korrelationskoefficienten är liten, ger olika imputationsstrategier eller fullständig fallanalys tillvägagångssätt ganska lika resultat. För det andra underskattar standard-MI-strategier, som inte tar hänsyn till korrelationen inom klyftan, variationen i behandlingseffekterna. Därför kan de leda till statistiskt signifikanta men falska slutsatser när de används för att hantera de saknade data från CRT. För det tredje, under antagandet att MCAR och CD saknas, är punktuppskattningarna (OR) ganska likartade på olika sätt för att hantera de saknade data förutom slumpmässiga effekter logistikregression MI-strategi. Fjärde, både inom - kluster och tvärkluster-MI-strategier tar hänsyn till intra-klusterkorrelationen och ger mycket konservativa behandlingseffektuppskattningar jämfört med MI-strategier som ignorerar klustringseffekten. Femte strategier inom impulseringsstrategi leder till bredare CI än strategier för ömsesidig imputation, speciellt när procentuella missförhållanden är höga. Detta kan bero på att inklusterstrategier endast använder en bråkdel av data, vilket leder till mycket variation av den beräknade behandlingseffekten. Den sjätte större uppskattade kappa, som indikerar högre överensstämmelse mellan de tillförda värdena och de observerade värdena, är associerad med bättre prestanda av MI-strategier när det gäller att generera uppskattad behandlingseffekt och 95 CI närmare de som erhållits från den fullständiga CHAT-datasatsen. Sjunde, under samma imputationsstrategi och procentuell missbrott, är uppskattningarna av behandlingseffekten från GEE och RE logistiska regressionsmodeller liknande. Så vitt vi vet har det gjorts begränsat arbete vid jämförelse av olika multipla imputationsstrategier för saknade binära resultat i CRT. Taljaard et al 17 jämförde fyra MI-strategier (sammanslagna ABB, inkluster ABB, standardregression, blandingseffekteradregression) för att sakna kontinuerligt resultat i CRT när det saknas är helt slumpmässigt. Deras resultat liknar vårt. Det bör noteras att inom-kluster MI-strategier endast kan tillämpas när klusterstorleken är tillräckligt stor och andelen misslyckande är relativt liten. I CHAT-studien fanns det 55 patienter i varje kluster som gav tillräcklig data för att genomföra inkluster-imputationsstrategierna med hjälp av prognospoäng och MCMC-metod. Emellertid misslyckades den logistiska regressionsmetoden när andelen misslyckande var hög. Detta berodde på att alla patienter med binärt utfall av 0 simulerades som missade för vissa kluster när de genererade stor procentandel (20) av saknade resultat. Därför misslyckades den logistiska regressionsmodellen för dessa specifika kluster. Dessutom visar våra resultat att den fullständiga fallanalysmetoden fungerar relativt bra även med 50 saknade. Vi tror att på grund av interklusterkorrelationen kan man inte förvänta sig att de saknade värdena har stor inverkan om en stor del av ett kluster fortfarande finns. Ytterligare undersökning om denna fråga med hjälp av en simuleringsstudie kommer dock att vara till hjälp för att svara på denna fråga. Våra resultat visar att den logistiska regressionsstrategin för slumpmässig slumpmässig effekt leder till en potentiellt förutspattad uppskattning, särskilt när procentdelen av bristen är hög. Som vi beskrivit i avsnitt 2.4.2 antar vi att de slumpmässiga slumpmässiga effekterna följer normal distribution, dvs Ujj N (0. B2). Forskare har visat att misspecifikation av fördelningsformen har liten inverkan på inferensen om de fasta effekterna 31. Om felaktigt antas att fördelningen av slumpmässiga effekter är oberoende av klusterstorleken kan det påverka inferenser om avlyssningen, men påverkar inte på allvar inferenser om regressionsparametrarna. Om felaktigt antas att fördelningen av slumpmässiga effekter är oberoende av kovariater kan det emellertid allvarligt påverka inferenser om regressionsparametrarna 32. 33. Medelvärdet av fördelning av slumpmässig effekt kan associeras med en kovariat, eller variansen av slumpmässig effektfördelning skulle kunna associeras med en kovariat för vårt dataset, vilket kan förklara den potentiella bias från den logistiska regressionsstrategin för överklyvning av slumpmässiga effekter. I motsats härtill har imputationsstrategin för logistisk regression med kluster som en fast effekt bättre prestanda. Det kan dock bara tillämpas när klusterstorleken är stor nog för att ge en stabil uppskattning för klustereffekten. For multiple imputation, the overall variance of the estimated treatment effect consists of two parts: within imputation variance U . and between imputation variance B . The total variance T is calculated as T U (1 1 m ) B . where m is the number of imputed datasets 10 . Since standard MI strategies ignore the between cluster variance and fail to account for the intra-cluster correlation, the within imputation variance may be underestimated, which could lead to underestimation of the total variance and consequently the narrower confidence interval. In addition, the adequacy of standard MI strategies depends on the ICC. In our study, the ICC of the CHAT dataset is 0.055 and the cluster effect in the random-effects model is statistically significant. Among the three imputation methods: predictive model (logistic regression method), propensity score method, and MCMC method, the latter is most popular method for multiple imputation of missing data and is the default method implemented in SAS. Although this method is widely used to impute binary and polytomous data, there are concerns about the consequences of violating the normality assumption. Experience has repeatedly shown that multiple imputation using MCMC method tends to be quite robust even when the real data depart from the multivariate normal distribution 20 . Therefore, when handling the missing binary or ordered categorical variables, it is acceptable to impute under a normality assumption and then round off the continuous imputed values to the nearest category. For example, the imputed values for the missing binary variable can be any real value rather than being restricted to 0 and 1. We rounded the imputed values so that values greater than or equal to 0.5 were set to 1, and values less than 0.5 were set to 0 34 . Horton et al 35 showed that such rounding may produce biased estimates of proportions when the true proportion is near 0 or 1, but does well under most other conditions. The propensity score method is originally designed to impute the missing values on the response variables from the randomized experiment with repeated measures 21 . Since it uses only the covariate information associated with the missingness but ignores the correlation among variables, it may produce badly biased estimates of regression coefficients when data on predictor variables are missing. In addition, with small sample sizes and a relatively large number of propensity score groups, application of the ABB method is problematic, especially for binary variables. In this case, a modified version of ABB should be conducted 36 . There are some limitations that need to be acknowledged and addressed regarding the present study. First, the simulation study is based on a real dataset, which has a relatively large cluster size and small ICC. Further research should investigate the performance of different imputation strategies at different design settings. Second, the scenario of missing an entire cluster is not investigated in this paper. The proposed within-cluster and across-cluster MI strategies may not apply to this scenario. Third, we investigate the performance of different MI strategies assuming missing data mechanism of MCAR and CD missing. Therefore, results cannot be generalized to MAR or MNAR scenarios. Fourth, since the estimated treatment effects are similar under different imputation strategies, we only presented the OR and 95 CI for each simulation scenario. However, estimates of standardized bias and coverage would be more informative and would also provide a quantitative guideline to assess the adequacy of imputes 37 . 6. Conclusions When the percentage of missing data is low or intra-cluster correlation coefficient is small, different imputation strategies or complete case analysis approach generate quite similar results. When the percentage of missing data is high, standard MI strategies, which do not take into account the intra-cluster correlation, underestimate the variance of the treatment effect. Within-cluster and across-cluster MI strategies (except for the random-effects logistic regression MI strategy), which take the intra-cluster correlation into account, seem to be more appropriate to handle the missing outcome from CRTs. Under the same imputation strategy and percentage of missingness, the estimates of the treatment effect from GEE and RE logistic regression models are similar. Appendix A: SAS code for across-cluster random-effects logistic regression method let maximum 1000 ods listing close proc nlmixed data mcaramppercentampindex cov parms b0 -0.0645 bgroup -0.1433 bdiabbase -0.04 bhdbase 0.1224 bage -0.0066 bbasebpcontrolled 1.1487 bsex 0.0873 s2u 0.5 Population Health Research Institute, Hamilton Health Sciences References Campbell MK, Grimshaw JM: Cluster randomised trials: time for improvement. The implications of adopting a cluster design are still largely being ignored. BMJ. 1998, 317 (7167): 1171-1172. View Article PubMed PubMed Central Google Scholar COMMIT Research Group: Community Intervention trial for Smoking Cessation (COMMIT): 1. Cohort results from a four-year community intervention. Am J Public Health. 1995, 85: 183-192. 10.2105AJPH.85.2.183. View Article Google Scholar Donner A, Klar N: Design and Analysis of Cluster Randomisation Trials in Health Research. 2000, London: Arnold Google Scholar Cornfield J: Randomization by group: a formal analysis. Am J Epidemiol. 1978, 108 (2): 100-102. PubMed Google Scholar Donner A, Brown KS, Brasher P: A methodological review of non-therapeutic intervention trials employing cluster randomization, 1979-1989. Int J Epidemiol. 1990, 19 (4): 795-800. 10.1093ije19.4.795. View Article PubMed Google Scholar Rubin DB: Inference and missing data. Biometrika. 1976, 63: 581-592. 10.1093biomet63.3.581. View Article Google Scholar Allison PD: Missing Data. 2001, SAGE Publications Inc Google Scholar Schafer JL, Olsen MK: Multiple imputation for multivariate missing-data problems: a data analysts perspective. Multivariate Behavioral Research. 1998, 33: 545-571. 10.1207s15327906mbr33045. View Article PubMed Google Scholar McArdle JJ: Structural factor analysis experiments with incomplete data. Multivariate Behavioral Research. 1994, 29: 409-454. 10.1207s15327906mbr29045. View Article PubMed Google Scholar Little RJA, Rubin DB: Statistical Analysis with missing data. 2002, New York: John Wiley, Second Google Scholar Rubin DB: Multiple Imputation for Nonresponse in Surveys. 1987, New York, NY. John Wiley amp Sons, Inc View Article Google Scholar Yi GYY, Cook RJ: Marginal Methods for Incomplete Longitudinal Data Arising in Clusters. Journal of the American Statistical Association. 2002, 97 (460): 1071-1080. 10.1198016214502388618889. View Article Google Scholar Hunsberger S, Murray D, Davis CE, Fabsitz RR: Imputation strategies for missing data in a school-based multi-centre study: the Pathways study. Stat Med. 2001, 20 (2): 305-316. 10.10021097-0258(20010130)20:2lt305::AID-SIM645gt3.0.CO2-M. View Article PubMed Google Scholar Nixon RM, Duffy SW, Fender GR: Imputation of a true endpoint from a surrogate: application to a cluster randomized controlled trial with partial information on the true endpoint. BMC Med Res Methodol. 2003, 3: 17-10.11861471-2288-3-17. View Article PubMed PubMed Central Google Scholar Green SB, Corle DK, Gail MH, Mark SD, Pee D, Freedman LS, Graubard BI, Lynn WR: Interplay between design and analysis for behavioral intervention trials with community as the unit of randomization. Am J Epidemiol. 1995, 142 (6): 587-593. PubMed Google Scholar Green SB: The advantages of community-randomized trials for evaluating lifestyle modification. Control Clin Trials. 1997, 18 (6): 506-13. 10.1016S0197-2456(97)00013-5. discussion 514-6 View Article PubMed Google Scholar Taljaard M, Donner A, Klar N: Imputation strategies for missing continuous outcomes in cluster randomized trials. Biom J. 2008, 50 (3): 329-345. 10.1002bimj.200710423. View Article PubMed Google Scholar Kenward MG, Carpenter J: Multiple imputation: current perspectives. Stat Methods Med Res. 2007, 16 (3): 199-218. 10.11770962280206075304. View Article PubMed Google Scholar Dobson AJ: An introduction to generalized linear models. 2002, Boca Raton: Chapman amp HallCRC, 2 Google Scholar Schafer JL: Analysis of Incomplete Multivariate Data. 1997, London: Chapman and Hall View Article Google Scholar SAS Publishing: SASSTAT 9.1 Users Guide: support. sasdocumentationonlinedoc91pdfsasdoc91statug7313.pdf Rubin DB, Schenker N: Multiple imputation for interval estimation from simple random samples with ignorable nonresponse. Journal of the American Statistical Association. 1986, 81 (394): 366-374. 10.23072289225. View Article Google Scholar Ma J, Thabane L, Kaczorowski J, Chambers L, Dolovich L, Karwalajtys T, Levitt C: Comparison of Bayesian and classical methods in the analysis of cluster randomized controlled trials with a binary outcome: the Community Hypertension Assessment Trial (CHAT). BMC Med Res Methodol. 2009, 9: 37-10.11861471-2288-9-37. View Article PubMed PubMed Central Google Scholar Levin KA: Study design VII. Randomised controlled trials. Evid Based Dent. 2007, 8 (1): 22-23. 10.1038sj. ebd.6400473. View Article PubMed Google Scholar Matthews FE, Chatfield M, Freeman C, McCracken C, Brayne C, MRC CFAS: Attrition and bias in the MRC cognitive function and ageing study: an epidemiological investigation. BMC Public Health. 2004, 4: 12-10.11861471-2458-4-12. View Article PubMed PubMed Central Google Scholar Ostbye T, Steenhuis R, Wolfson C, Walton R, Hill G: Predictors of five-year mortality in older Canadians: the Canadian Study of Health and Aging. J Am Geriatr Soc. 1999, 47 (10): 1249-1254. View Article PubMed Google Scholar Viera AJ, Garrett JM: Understanding interobserver agreement: the kappa statistic. Fam Med. 2005, 37 (5): 360-363. PubMed Google Scholar Laurenceau JP, Stanley SM, Olmos-Gallo A, Baucom B, Markman HJ: Community-based prevention of marital dysfunction: multilevel modeling of a randomized effectiveness study. J Consult Clin Psychol. 2004, 72 (6): 933-943. 10.10370022-006X.72.6.933. View Article PubMed Google Scholar Shrive FM, Stuart H, Quan H, Ghali WA: Dealing with missing data in a multi-question depression scale: a comparison of imputation methods. BMC Med Res Methodol. 2006, 6: 57-10.11861471-2288-6-57. View Article PubMed PubMed Central Google Scholar Elobeid MA, Padilla MA, McVie T, Thomas O, Brock DW, Musser B, Lu K, Coffey CS, Desmond RA, St-Onge MP, Gadde KM, Heymsfield SB, Allison DB: Missing data in randomized clinical trials for weight loss: scope of the problem, state of the field, and performance of statistical methods. PLoS One. 2009, 4 (8): e6624-10.1371journal. pone.0006624. View Article PubMed PubMed Central Google Scholar McCulloch CE, Neuhaus JM: Prediction of Random Effects in Linear and Generalized Linear Models under Model Misspecification. Biometrics. Neuhaus JM, McCulloch CE: Separating between - and within-cluster covariate effects using conditional and partitioning methods. Journal of the Royal Statistical Society. 2006, 859-872. Series B, 68 Heagerty PJ, Kurland BF: Misspecified maximum likelihood estimates and generalised linear mixed models. Biometrika. 2001, 88 (4): 973-985. 10.1093biomet88.4.973. View Article Google Scholar Christopher FA: Rounding after multiple imputation with Non-binary categorical covariates. SAS Focus Session SUGI. 2004, 30: Google Scholar Horton NJ, Lipsitz SR, Parzen M: A potential for bias when rounding in multiple imputation. American Statistician. 2003, 229-232. 10.11980003130032314. 57 Li X, Mehrotra DV, Barnard J: Analysis of incomplete longitudinal binary data using multiple imputation. Stat Med. 2006, 25 (12): 2107-2124. 10.1002sim.2343. View Article PubMed Google Scholar Collins LM, Schafer JL, Kam CM: A comparison of inclusive and restrictive strategies in modern missing data procedures. Psychol Methods. 2001, 6 (4): 330-351. 10.10371082-989X.6.4.330. View Article PubMed Google Scholar Pre-publication history Ma et al licensee BioMed Central Ltd. 2011 This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License ( creativecommons. orglicensesby2.0 ), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. Multiple Imputation in Stata: Imputing This is part four of the Multiple Imputation in Stata series. For a list of topics covered by this series, see the Introduction . This section will talk you through the details of the imputation process. Be sure youve read at least the previous section, Creating Imputation Models. so you have a sense of what issues can affect the validity of your results. Example Data To illustrate the process, well use a fabricated data set. Unlike those in the examples section, this data set is designed to have some resemblance to real world data. female (binary) race (categorical, three values) urban (binary) edu (ordered categorical, four values) exp (continuous) wage (continuous) Missingness . Each value of all the variables except female has a 10 chance of being missing completely at random, but of course in the real world we wont know that it is MCAR ahead of time. Thus we will check whether it is MCAR or MAR (MNAR cannot be checked by looking at the observed data) using the procedure outlined in Deciding to Impute : unab numvars: unab missvars: urban-wage misstable sum, gen(miss) foreach var of local missvars local covars: list numvars - var display newline(3) quotlogit missingness of var on covarsquot logit missvar covars foreach nvar of local covars display newline(3) quotttest of nvar by missingness of varquot ttest nvar, by(missvar) See the log file for results. Our goal is to regress wages on sex, race, education level, and experience. To see the quotrightquot answers, open the do file that creates the data set and examine the gen command that defines wage. Complete code for the imputation process can be found in the following do file: The imputation process creates a lot of output. Well put highlights in this page, however, a complete log file including the associated graphs can be found here: Each section of this article will have links to the relevant section of the log. Click quotbackquot in your browser to return to this page. Setting up The first step in using mi commands is to mi set your data. This is somewhat similar to svyset. tsset. or xtset. The mi set command tells Stata how it should store the additional imputations youll create. We suggest using the wide format, as it is slightly faster. On the other hand, mlong uses slightly less memory. To have Stata use the wide data structure, type: To have Stata use the mlong (marginal long) data structure, type: The wide vs. long terminology is borrowed from reshape and the structures are similar. However, they are not equivalent and you would never use reshape to change the data structure used by mi. Instead, type mi convert wide or mi convert mlong (add, clear if the data have not been saved since the last change). Most of the time you dont need to worry about how the imputations are stored: the mi commands figure out automatically how to apply whatever you do to each imputation. But if you need to manipulate the data in a way mi cant do for you, then youll need to learn about the details of the structure youre using. Youll also need to be very, very careful. If youre interested in such things (including the rarely used flong and flongsep formats) run this do file and read the comments it contains while examining the data browser to see what the data look like in each form. Registering Variables The mi commands recognize three kinds of variables: Imputed variables are variables that mi is to impute or has imputed. Regular variables are variables that mi is not to impute, either by choice or because they are not missing any values. Passive variables are variables that are completely determined by other variables. For example, log wage is determined by wage, or an indicator for obesity might be determined by a function of weight and height. Interaction terms are also passive variables, though if you use Statas interaction syntax you wont have to declare them as such. Passive variables are often problematic8212the examples on transformations. non-linearity. and interactions show how using them inappropriately can lead to biased estimates. If a passive variable is determined by regular variables, then it can be treated as a regular variable since no imputation is needed. Passive variables only have to be treated as such if they depend on imputed variables. Registering a variable tells Stata what kind of variable it is. Imputed variables must always be registered: mi register imputed varlist where varlist should be replaced by the actual list of variables to be imputed. Regular variables often dont have to be registered, but its a good idea: mi register regular varlist Passive variables must be registered: mi register passive varlist However, passive variables are more often created after imputing. Do so with mi passive and theyll be registered as passive automatically. In our example data, all the variables except female need to be imputed. The appropriate mi register command is: mi register imputed race-wage (Note that you cannot use as your varlist even if you have to impute all your variables, because that would include the system variables added by mi set to keep track of the imputation structure.) Registering female as regular is optional, but a good idea: mi register regular female Checking the Imputation Model Based on the types of the variables, the obvious imputation methods are: race (categorical, three values): mlogit urban (binary): logit edu (ordered categorical, four values): ologit exp (continuous): regress wage (continuous): regress female does not need to be imputed, but should be included in the imputation models both because it is in the analysis model and because its likely to be relevant. Before proceeding to impute we will check each of the imputation models. Always run each of your imputation models individually, outside the mi impute chained context, to see if they converge and (insofar as it is possible) verify that they are specified correctly. Code to run each of these models is: mlogit race i. urban exp wage i. edu i. female logit urban i. race exp wage i. edu i. female ologit edu i. urban i. race exp wage i. female regress exp i. urban i. race wage i. edu i. female regress wage i. urban i. race exp i. edu i. female Note that when categorical variables (ordered or not) appear as covariates i. expands them into sets of indicator variables. As well see later, the output of the mi impute chained command includes the commands for the individual models it runs. Thus a useful shortcut, especially if you have a lot of variables to impute, is to set up your mi impute chained command with the dryrun option to prevent it from doing any actual imputing, run it, and then copy the commands from the output into your do file for testing. Convergence Problems The first thing to note is that all of these models run successfully. Complex models like mlogit may fail to converge if you have large numbers of categorical variables, because that often leads to small cell sizes. To pin down the cause of the problem, remove most of the variables, make sure the model works with whats left, and then add variables back one at a time or in small groups until it stops working. With some experimentation you should be able to identify the problem variable or combination of variables. At that point youll have to decide if you can combine categories or drop variables or make other changes in order to create a workable model. Prefect Prediction Perfect prediction is another problem to note. The imputation process cannot simply drop the perfectly predicted observations the way logit can. You could drop them before imputing, but that seems to defeat the purpose of multiple imputation. The alternative is to add the augment (or just aug ) option to the affected methods. This tells mi impute chained to use the quotaugmented regressionquot approach, which adds fake observations with very low weights in such a way that they have a negligible effect on the results but prevent perfect prediction. For details see the section quotThe issue of perfect prediction during imputation of categorical dataquot in the Stata MI documentation. Checking for Misspecification You should also try to evaluate whether the models are specified correctly. A full discussion of how to determine whether a regression model is specified correctly or not is well beyond the scope of this article, but use whatever tools you find appropriate. Here are some examples: Residual vs. Fitted Value Plots For continuous variables, residual vs. fitted value plots (easily done with rvfplot ) can be useful8212several of the examples use them to detect problems. Consider the plot for experience: regress exp i. urban i. race wage i. edu i. female rvfplot Note how a number of points are clustered along a line in the lower left, and no points are below it: This reflects the constraint that experience cannot be less than zero, which means that the fitted values must always be greater than or equal to the residuals, or alternatively that the residuals must be greater than or equal to the negative of the fitted values. (If the graph had the same scale on both axes, the constraint line would be a 45 degree line.) If all the points were below a similar line rather than above it, this would tell you that there was an upper bound on the variable rather than a lower bound. The y-intercept of the constraint line tells you the limit in either case. You can also have both a lower bound and an upper bound, putting all the points in a band between them. The quotobviousquot model, regress. is inappropriate for experience because it wont apply this constraint. Its also inappropriate for wages for the same reason. Alternatives include truncreg, ll(0) and pmm (well use pmm ). Adding Interactions In this example, it seems plausible that the relationships between variables may vary between race, gender, and urbanrural groups. Thus one way to check for misspecification is to add interaction terms to the models and see whether they turn out to be important. For example, well compare the obvious model: regress exp i. race wage i. edu i. urban i. female with one that includes interactions: regress exp (i. race i. urban i. female)(c. wage i. edu) Well run similar comparisons for the models of the other variables. This creates a great deal of output, so see the log file for results. Interactions between female and other variables are significant in the models for exp. wage. edu. and urban. There are a few significant interactions between race or urban and other variables, but not nearly as many (and keep in mind that with this many coefficients wed expect some false positives using a significance level of .05). Well thus impute the men and women separately. This is an especially good option for this data set because female is never missing. If it were, wed have to drop those observations which are missing female because they could not be placed in one group or the other. In the imputation command this means adding the by(female) option. When testing models, it means starting the commands with the by female: prefix (and removing female from the lists of covariates). The improved imputation models are thus: bysort female: reg exp i. urban i. race wage i. edu by female: logit urban exp i. race wage i. edu by female: mlogit race exp i. urban wage i. edu by female: reg wage exp i. urban i. race i. edu by female: ologit edu exp i. urban i. race wage pmm itself cannot be run outside the imputation context, but since its based on regression you can use regular regression to test it. These models should be tested again, but well omit that process. The basic syntax for mi impute chained is: mi impute chained ( method1 ) varlist1 ( method2 ) varlist2. regvars Each method specifies the method to be used for imputing the following varlist The possibilities for method are regress. pmm. truncreg. intreg. logit. ologit. mlogit. poisson. and nbreg. regvars is a list of regular variables to be used as covariates in the imputation models but not imputed (there may not be any). The basic options are: add( N ) rseed( R ) savetrace( tracefile. replace) N is the number of imputations to be added to the data set. R is the seed to be used for the random number generator8212if you do not set this youll get slightly different imputations each time the command is run. The tracefile is a dataset in which mi impute chained will store information about the imputation process. Well use this dataset to check for convergence. Options that are relevant to a particular method go with the method, inside the parentheses but following a comma (e. g. (mlogit, aug) ). Options that are relevant to the imputation process as a whole (like by(female) ) go at the end, after the comma. For our example, the command would be: mi impute chained (logit) urban (mlogit) race (ologit) edu (pmm) exp wage, add(5) rseed(4409) by(female) Note that this does not include a savetrace() option. As of this writing, by() and savetrace() cannot be used at the same time, presumably because it would require one trace file for each by group. Stata is aware of this problem and we hope this will be changed soon. For purposes of this article, well remove the by() option when it comes time to illustrate use of the trace file. If this problem comes up in your research, talk to us about work-arounds. Choosing the Number of Imputations There is some disagreement among authorities about how many imputations are sufficient. Some say 3-10 in almost all circumstances, the Stata documentation suggests at least 20, while White, Royston, and Wood argue that the number of imputations should be roughly equal to the percentage of cases with missing values. However, we are not aware of any argument that increasing the number of imputations ever causes problems (just that the marginal benefit of another imputation asymptotically approaches zero). Increasing the number of imputations in your analysis takes essentially no work on your part. Just change the number in the add() option to something bigger. On the other hand, it can be a lot of work for the computer8212multiple imputation has introduced many researchers into the world of jobs that take hours or days to run. You can generally assume that the amount of time required will be proportional to the number of imputations used (e. g. if a do file takes two hours to run with five imputations, it will probably take about four hours to run with ten imputations). So heres our suggestion: Start with five imputations (the low end of whats broadly considered legitimate). Work on your research project until youre reasonably confident you have the analysis in its final form. Be sure to do everything with do files so you can run it again at will. Note how long the process takes, from imputation to final analysis. Consider how much time you have available and decide how many imputations you can afford to run, using the rule of thumb that time required is proportional to the number of imputations. If possible, make the number of imputations roughly equal to the percentage of cases with missing data (a high end estimate of whats required). Allow time to recover if things to go wrong, as they generally do. Increase the number of imputations in your do file and start it. Do something else while the do file runs, like write your paper. Adding imputations shouldnt change your results significantly8212and in the unlikely event that they do, consider yourself lucky to have found that out before publishing. Speeding up the Imputation Process Multiple imputation has introduced many researchers into the world of jobs that take hours, days, or even weeks to run. Usually its not worth spending your time to make Stata code run faster, but multiple imputation can be an exception. Use the fastest computer available to you. For SSCC members that means learning to run jobs on Linstat, the SSCCs Linux computing cluster. Linux is not as difficult as you may think8212Using Linstat has instructions. Multiple imputation involves more reading and writing to disk than most Stata commands. Sometimes this includes writing temporary files in the current working directory. Use the fastest disk space available to you, both for your data set and for the working directory. In general local disk space will be faster than network disk space, and on Linstat ramdisk (a quotdirectoryquot that is actually stored in RAM) will be faster than local disk space. On the other hand, you would not want to permanently store data sets anywhere but network disk space. So consider having your do file do something like the following: Windows (Winstat or your own PC) This applies when youre using imputed data as well. If your data set is large enough that working with it after imputation is slow, the above procedure may help. Checking for Convergence MICE is an iterative process. In each iteration, mi impute chained first estimates the imputation model, using both the observed data and the imputed data from the previous iteration. It then draws new imputed values from the resulting distributions. Note that as a result, each iteration has some autocorrelation with the previous imputation. The first iteration must be a special case: in it, mi impute chained first estimates the imputation model for the variable with the fewest missing values based only on the observed data and draws imputed values for that variable. It then estimates the model for the variable with the next fewest missing values, using both the observed values and the imputed values of the first variable, and proceeds similarly for the rest of the variables. Thus the first iteration is often atypical, and because iterations are correlated it can make subsequent iterations atypical as well. To avoid this, mi impute chained by default goes through ten iterations for each imputed data set you request, saving only the results of the tenth iteration. The first nine iterations are called the burn-in period. Normally this is plenty of time for the effects of the first iteration to become insignificant and for the process to converge to a stationary state. However, you should check for convergence and increase the number of iterations if necessary to ensure it using the burnin() option. To do so, examine the trace file saved by mi impute chained. It contains the mean and standard deviation of each imputed variable in each iteration. These will vary randomly, but they should not show any trend. An easy way to check is with tsline. but it requires reshaping the data first. Our preferred imputation model uses by(). so it cannot save a trace file. Thus well remove by() for the moment. Well also increase the burnin() option to 100 so its easier to see what a stable trace looks like. Well then use reshape and tsline to check for convergence: preserve mi impute chained (logit) urban (mlogit) race (ologit) edu (pmm) exp wage female, add(5) rseed(88) savetrace(extrace, replace) burnin(100) use extrace, replace reshape wide mean sd, i(iter) j(m) tsset iter tsline expmean, title(quotMean of Imputed Values of Experiencequot) note(quotEach line is for one imputationquot) legend(off) graph export conv1.png, replace tsline expsd, title(quotStandard Deviation of Imputed Values of Experiencequot) note(quotEach line is for one imputationquot) legend(off) graph export conv2.png, replace restore The resulting graphs do not show any obvious problems: If you do see signs that the process may not have converged after the default ten iterations, increase the number of iterations performed before saving imputed values with the burnin() option. If convergence is never achieved this indicates a problem with the imputation model. Checking the Imputed Values After imputing, you should check to see if the imputed data resemble the observed data. Unfortunately theres no formal test to determine whats quotclose enough. quot Of course if the data are MAR but not MCAR, the imputed data should be systematically different from the observed data. Ironically, the fewer missing values you have to impute, the more variation youll see between the imputed data and the observed data (and between imputations). For binary and categorical variables, compare frequency tables. For continuous variables, comparing means and standard deviations is a good starting point, but you should look at the overall shape of the distribution as well. For that we suggest kernel density graphs or perhaps histograms. Look at each imputation separately rather than pooling all the imputed values so you can see if any one of them went wrong. The mi xeq: prefix tell Stata to apply the subsequent command to each imputation individually. It also applies to the original data, the quotzeroth imputation. quot Thus: mi xeq: tab race will give you six frequency tables: one for the original data, and one for each of the five imputations. However, we want to compare the observed data to just the imputed data, not the entire data set. This requires adding an if condition to the tab commands for the imputations, but not the observed data. Add a number or numlist to have mi xeq act on particular imputations: mi xeq 0: tab race mi xeq 15: tab race if missrace This creates frequency tables for the observed values of race and then the imputed values in all five imputations. If you have a significant number of variables to examine you can easily loop over them: foreach var of varlist urban race edu mi xeq 0: tab var mi xeq 15: tab var if missvar For results see the log file . Running summary statistics on continuous variables follows the same process, but creating kernel density graphs adds a complication: you need to either save the graphs or give yourself a chance to look at them. mi xeq: can carry out multiple commands for each imputation: just place them all in one line with a semicolon ( ) at the end of each. (This will not work if youve changed the general end-of-command delimiter to a semicolon.) The sleep command tells Stata to pause for a specified period, measured in milliseconds. mi xeq 0: kdensity wage sleep 1000 mi xeq 15: kdensity wage if missvar sleep 1000 Again, this can all be automated: foreach var of varlist wage exp mi xeq 0: sum var mi xeq 15: sum var if missvar mi xeq 0: kdensity var sleep 1000 mi xeq 15: kdensity var if missvar sleep 1000 Saving the graphs turns out to be a bit trickier, because you need to give the graph from each imputation a different file name. Unfortunately you cannot access the imputation number within mi xeq. However, you can do a forvalues loop over imputation numbers, then have mi xeq act on each of them: forval i15 mi xeq i: kdensity exp if missexp graph export expi. png, replace Integrating this with the previous version gives: foreach var of varlist wage exp mi xeq 0: sum var mi xeq 15: sum var if missvar mi xeq 0: kdensity var graph export chkvar0.png, replace forval i15 mi xeq i: kdensity var if missvar graph export chkvari. png, replace For results, see the log file . Its troublesome that in all imputations the mean of the imputed values of wage is higher than the mean of the observed values of wage. and the mean of the imputed values of exp is lower than the mean of the observed values of exp. We did not find evidence that the data is MAR but not MCAR, so wed expect the means of the imputed data to be clustered around the means of the observed data. There is no formal test to tell us definitively whether this is a problem or not. However, it should raise suspicions, and if the final results with these imputed data are different from the results of complete cases analysis, it raises the question of whether the difference is due to problems with the imputation model. Last Revised: 8232012NOTICE: The IDRE Statistical consulting group will be migrating the website to the WordPress CMS in February to facilitate maintenance and creation of new content. Some of our older pages will be removed or archived such that they will no longer be maintained. We will try to maintain redirects so that the old URLs will continue to work as best we can. Welcome to the Institute for Digital Research and Education Help the Stat Consulting Group by giving a gift Statistical Computing Seminars Missing Data in SAS Part 1 Note: A PowerPoint presentation of this webpage can be downloaded here . Introduction Missing data is a common issue, and more often than not, we deal with the matter of missing data in an ad hoc fashion. The purpose of this seminar is to discuss commonly used techniques for handling missing data and common issues that could arise when these techniques are used. In particular, we will focus on the one of the most popular methods, multiple imputation. We are not advocating in favor of any one technique to handle missing data and depending on the type of data and model you will be using, other techniques such as direct maximum likelihood may better serve your needs. We have chosen to explore multiple imputation through an examination of the data, a careful consideration of the assumptions needed to implement this method and a clear understanding of the analytic model to be estimated. We hope this seminar will help you to better understand the scope of the issues you might face when dealing with missing data using this method. The data set hsbmar. sas7bdat which is based on hsb2.sas7bdat used for this seminar can be downloaded in following the link. The SAS code for this seminar is developed u sing SAS 9.4 and SASSTAT 13.1. So me of the variables have value labels (formats) associated with them. Here is the setup for reading the value labels correctly. Goals of statistical analysis with missing data: Minimize bias Maximize use of available information Obtain appropriate estimates of uncertainty Exploring missing data mechanisms The missing data mechanism describes the process that is believed to have generated the missing values. Missing data mechanisms generally fall into one of three main categories. There are precise technical definitions for these terms in the literature the following explanation necessarily contains simplifications. Missing completely at random (MCAR) A variable is missing completely at random, if neither the variables in the dataset nor the unobserved value of the variable itself predict whether a value will be missing. Missing completely at random is a fairly strong assumption and may be relatively rare. One relatively common situation in which data are missing completely at random occurs when a subset of cases is randomly selected to undergo additional measurement, this is sometimes referred to as quotplanned missing. quot For example, in some health surveys, some subjects are randomly selected to undergo more extensive physical examination therefore only a subset of participants will have complete information for these variables. Missing completely at random also allow for missing on one variable to be related to missing on another, e. g. var1 is missing whenever var2 is missing. For example, a husband and wife are both missing information on height. A variable is said to be missing at random if other variables (but not the variable itself) in the dataset can be used to predict missingness on a given variable. For example, in surveys, men may be more likely to decline to answer some questions than women (i. e. gender predicts missingness on another variable). MAR is a less restrictive assumption than MCAR. Under this assumption the probability of missingness does not depend on the true values after controlling for the observed variables. MAR is also related to ignorability. The missing data mechanism is said be ignorable if it is missing at random and the probability of a missingness does not depend on the missing information itself. The assum ption of ignorability is needed for optimal estimation of missing information and is a required assumption for both of the missing data techniques we will discuss. Missing not at random (MNAR) Finally, data are said to be missing not at random if the value of the unobserved variable itself predicts missingness. A classic example of this is income. Individuals with very high incomes are more likely to decline to answer questions about their income than individuals with more moderate incomes. An understanding of the missing data mechanism(s) present in your data is important because different types of missing data require different treatments. When data are missing completely at random, analyzing only the complete cases will not result in biased parameter estimates (e. g. regression coefficients). However, the sample size for an analysis can be substantially reduced, leading to larger standard errors. In contrast, analyzing only complete cases for data that are either missing at random, or missing not at random can lead to biased parameter estimates. Multiple imputation and other modern methods such as direct maximum likelihood generally assumes that the data are at least MAR, meaning that this procedure can also be used on data that are missing completely at random. Statistical models have also been developed for modeling the MNAR processes however, these model are beyond the scope of this seminar. For more information on missing data mechanisms please see: Allison, 2002 Enders, 2010 Little amp Rubin, 2002 Rubin, 1976 Schafer amp Graham, 2002 Full data: Below is a regression model predicting read using the complete data set ( hsb2 ) used to create hsbmar . We will use these results for comparison. Common techniques for dealing with missing data In this section, we are going to discuss some common techniques for dealing with missing data and briefly discuss their limitations. Complete case analysis (listwise deletion) Available case analysis (pairwise deletion) Mean Imputation Single Imputation Stochastic Imputation 1. Complete Case Analysis: This methods involves deleting cases in a particular dataset that are missing data on any variable of interest. It is a common technique because it is easy to implement and works with any type of analysis. Below we look at some of the descriptive statistics of the data set hsbmar . which contains test scores, as well as demographic and school information for 200 high school students. Note that although the dataset contains 200 cases, six of the variables have fewer than 200 observatio ns. The missing information varies between 4.5 ( read ) and 9 ( female and prog ) of cases depending on the variable. This doe snt seem like a lot of missing data, so we might be inclined to try to analyze the observed data as they are, a strategy sometimes referred to as complete case analysis. Below is a regression model where the dependent variable read is regressed on write . math, female and prog . Notice that the default behavior of proc glm is complete case analysis (also referred to as listwise deletion). Looking at the output, we see that only 130 cases were used in the analysis in other words, more than one third of the cases in our dataset (70200) were excluded from the analysis because of missing data. The reduction in sample size (and statistical power) alone might be considered a problem, but complete case analysis can also lead to biased estimates. Specifically you will see below that the estimates for the intercept, write, math and prog are different from the regression model on the complete data. Also, the standard errors are all larger due to the smaller sample size, resulting in the parameter estimate for female almost becoming non-significant. Unfortunately, unless the mechanism of missing data is MCAR, this method will introduce bias into the parameter estimates. 2. Available Case Analysis: This method involves estimating means, variances and covariances based on all available non-missing cases. Meaning that a covariance (or correlation) matrix is computed where each element is based on the full set of cases with non-missing values for each pair of variables. This method became popular because the loss of power due to missing information is not as substantial as with complete case analysis. Below we look at the pairwise correlations between the outcome read and each of the predictors, write, prog, female, and math. Depending on the pairwise comparison examined, the sample size will change based on the amount of missing present in one or both variables. Because proc glm does not accept covariance matrices as data input, the following example will be done with proc reg . This will require us to create dummy variables for our categorical predictor prog since there is no class statement in proc reg . By default proc corr uses pairwise deletion to estimate the correlation table. The options on the proc corr statement, cov and outp . will output a variancecovariance matrix based on pairwise deletion that will be used in the subsequent regression model The first thing you should see is the note that SAS prints to your log file stating quotN not equal across variables in data set. This may not be appropriate. The smallest value will be used. quot. One of the main drawbacks of this method is no consistent sample size. You will also notice that the parameter estimates presented here are different than the estimates obtained from analysis on the full data and the listwise deletion approach. For instance, the variable female had an estimated effect of -2.7 with the full data but was attenuated to -1.85 for the available case analysis. Unless the mechanism of missing data is MCAR, this method will introduce bias into the parameter estimates. Therefore, this method is not recommended. 3. Unconditional Mean Imputation: This methods involves replacing the missing values for an individual variable with it39s overall estimated mean from the available cases. While this is a simple and easily implemented method for dealing with missing values it has some unfortunate consequences. The most important problem with mean imputation, also called mean substitution, is that it will result in an artificial reduction in variability due to the fact you are imputing values at the center of the variable39s distribution. This also has the unintended consequence of changing the magnitude of correlations between the imputed variable and other variables. We can demonstrate this phenomenon in our data. Below are tables of the means and standard deviations of the four variables in our regression model BEFORE and AFTER a mean imputation as well as their corresponding correlation matrices. We will again utilize the prog dummy variables we created previously. You will notice that there is very little change in the mean (as you would expect) however, the standard deviation is noticeably lower after substituting in mean values for the observations with missing information. This is because you reduce the variability in your variables when you impute everyone at the mean. Moreover, you can see the table of quotPearson Correlation Coefficientsquot that the correlation between each of our predictors of interest ( write . math . female . and prog ) as well as between predictors and the outcome read have now be attenuated. Therefore, regression models that seek to estimate the associations between these variables will also see their effects weakened. 4. Single or Deterministic Imputation : A slightly more sophisticated type of imputation is a regressionconditional mean imputation, which r eplaces missing values with predicted scores from a regression equation. The strength of this approach is that it uses complete information to impute values. The drawback here is that all your predicted values will fall directly on the regression line once again decreasing variability, just not as much as with unconditional mean imputation. Moreover, statistical models cannot distinguish between observed and imputed values and therefore do not incorporate into the model the error or uncertainly associated with that imputedva lue. Additionally, you will see that this method will also inflate the associations between variables because it imputes values that are perfectly correlated with one another. Unfortunately, even under the assumption of MCAR, regression imputation will upwardly bias correlations and R-squared statistics. Further discussion and an example of this can be found in Craig Enders book quotApplied Missing Data Analysisquot (2010). 5. Stochastic Imputation : In recognition of the problems with regression imputation and the reduced variability associated with this approach, researchers developed a technique to incorporate or quotadd backquot lost variability. A residual term, that is randomly drawn from a normal distribution with mean zero and variance equal to the residual variance from the regression model, is added to the predicted scores from the regression imputation thus restoring some of the lost variability. This method is superior to the previous methods as it will produce unbiased coefficient estimates under MAR. However, the standard errors produced during regression estimation while less biased then the single imputation approach, will still be attenuated. While you might be inclined to use one of these more traditional methods, consider this statement: quotMissing data analyses are difficult because there is no inherently correct methodological procedure. In many (if not most) situations, blindly applying maximum likelihood estimation or multiple imputation will likely lead to a more accurate set of estimates than using one of the previously mentioned missing data handling techniquesquot (p.344, Applied Missing Data Analysis, 2010). Multiple Imputation Multiple imputation is essentially an iterative form of stochastic imputation. However, instead of filling in a single value, the distribution of the observed data is used to estimate multiple values that reflect the uncertainty around the true value. These values are then used in the analysis of interest, such as in a OLS model, and the results combined. Each imputed value includes a random component whose magnitude reflects the extent to which other variables in the imputation model cannot predict it39s true values (Johnson and Young, 2011 White et al, 2010). Thus, building into the imputed values a level of uncertainty around the quottruthfulnessquot of the imputed values. A common misconception of missing data methods is the assumption that imputed values should represent quotrealquot values. The purpose when addressing missing data is to correctly reproduce the variancecovariance matrix we would have observed had our data not had any missing information. MI has three basic phases: 1. Imputation or Fill-in Phase: The missing data are filled in with estimated values and a complete data set is created. This process of fill-in is repeated m times. 2. Analysis Phase: Each of the m complete data sets is then analyzed using a statistical method of interest (e. g. linear regression). 3. Pooling Phase: The parameter estimates (e. g. coefficients and standard errors) obtained from each analyzed data set are then combined for inference. The imputation method you choose depends on the pattern of missing information as well as the type of variable(s) with missing information. Imputation Model, Analytic Model and Compatibility : When developing your imputation model, it is important to assess if your imputation model is quotcongenialquot or consistent with your analytic model. Consistency means that your imputation model includes (at the very least) the same variables that are in your analytic or estimation model. This includes any tr ansformations to variables that will be needed to assess your hypothesis of interest. This can include log transformations, interaction terms, or recodes of a continuous variable into a categorical form, if that is how it will be used in later analysis. The reason for this relates back to the earlier comments about the purpose of multiple imputation. Since we are trying to reproduce the proper variancecovariance matrix for estimation, all relationships between our analytic variables should be represented and estimated simultaneously. Otherwise, you are imputing values assuming they have a correlation of zero with the variables you did not include in your imputation model. This would result in underestimating the association between parameters of interest in your analysis and a loss of power to detect properties of your data that may be of interest such as non-linearities and statistical interactions. For additional reading on this particular topic see: 1. von Hippel, 2009 2. von Hippel, 2013 3. White et al. 2010 Preparing to conduct MI: First step: Examine the number and proportion of missing values among your variables of interest. The proc means procedure in SAS has an option called nmiss that will count the number of missing values for the variables specified. You can also create missing data flags or indicator variables for the missing information to assess the proportion of missingness. This quotMissing Data Patternsquot table can be requested without actually performing a full imputation by specifying the option nimpute0 (specifying zero imputed datasets to be created) on the proc mi statement line. Each quotgroupquot represents a set of observations in the data set that share the same pattern of missing information. For example, group 1 represents the 130 observations in the data that have complete information on all 5 variables of interest. This procedure also provides means for each variable for this group. You can see that there are a total of 12 patterns for the specified variables. The estimated means associated with each missing data pattern can also give you an indication of whether the assumption MCAR or MAR is appropriate. If you begin to observe that those with certain missing data patterns appear to have a very different distribution of values, this is an indication that you data may not be MCAR. Moreover, depending on the nature of the data, you may recognize patterns such as monotone missing which can be observed in longitudinal data when an individual drops out at a particular time point and therefore all data after that is subsequently missing. Additionally, you may identify skip patterns that were missed in your original review of the data that should then be dealt with before moving forward with the multiple imputation. Third Step: If necessary, identify potential auxiliary variables Auxiliary variables are variables in your data set that are either correlated with a missing variable(s) (the recommendation is r gt 0.4) or are believed to be associated with missingness. These are factors that are not of particular interest in your analytic model. but they are added to the imputation model to increase power andor to help make the assumpti on of MAR more plausible. These variables have been found to improve the quality of imputed values generate from multiple imputation. Moreover, research has demonstrated their particular importance when imputing a dependent variable andor when you have variables with a high proportion of missing information (Johnson and Young, 2011 Young and Johnson, 2010 Enders. 2010). You may a priori know of several variables you believe would make good auxiliary variables based on your knowledge of the data and subject matter. Additionally, a good review of the literature can often help identify them as well. However, if your not sure what variables in the data would be potential candidates (this is often the case when conducting analysis secondary data analysis), you can uses some simple methods to help identify potential candidates. One way to identify these variables is by examining associations between write, read, female, and math with other variables in the dataset. For example, let39s take a look at the correlation matrix between our 4 variables of interest and two other test score variables science and socst . Science and socst both appear to be a good auxiliary because they are well correlated (r gt0.4) with all the other test score variables of interest. You will also notice that they are not well correlated with female . A good auxiliary does not have to be correlated with every variable to be used. You will also notice that science also has missing information of it39s own. Additionally, a good auxiliary is not required to have complete information to be valuable. They can have missing and still be effective in reducing bias (Enders, 2010). One area, this is still under active research, is whether it is beneficial to include a variable as an auxiliary if it does not pass the 0.4 correlation threshold with any of the variables to be imputed. Some researchers believe that including these types o f items introduces unnecessary error into the imputation model (Allison, 2012), while others do not believe that there is any harm in this practice (Ender, 2010). Thus. we leave it up to you as the researcher to use your best judgment. Good auxiliary variables can also be correlates or predictors of missingness. Let39s use the missing data flags we made earlier to help us identify some variables that may be good correlates. We examine if our potential auxiliary variable socst also appears to predict missingness. Below are a set of t-tests to test if the mean socst or science scores differ significantly between those with missing information and those without. The only significant difference was found when examining missingness on math with socst. Above you can see that the mean socst score is significantly lower among the respondents who are missing on math. This suggests that socst is a potential correlate of missingness (Enders, 2010) and may help us satisfy the MAR assumption for multiple imputation by including it in our imputation model. Example 1: MI using multivariate normal distribution (MVN): When choosing to impute one or many variables, one of the first decisions you will make is the type of distribution under which you wa nt to impute your variable(s). One method available in SAS uses Markov Chain Monte Carlo (MCMC) which assumes that all the variables in the imputation model have a joint multivariate normal distribution. This is probably the most common parametric approach for multiple imputation. The specific algorithm used is called the data augmentation (DA) algorithm, which belongs to the family of MCMC procedures. The algorithm fills in missing data by drawing from a conditional distribution, in this case a multivariate normal, of the missing data given the observed data. In most cases, simul ation studies have shown that assuming a MVN distribution leads to reliable estimates even when the normality assumption is violated given a sufficient sample size (Demirtas et al. 2008 KJ Lee, 2010). Ho wever, biased estimates have been observed when the sample size is relatively small and the fraction of missing information is high. Note: Since we are using a multivariate normal distribution for imputation, decimal and negative values are possible. These values are not a problem for estimation however, we will need to create dummy variables for the nominal categorical variables so the parameter estiamtes for each level can be interpreted. Imputation in SAS requires 3 procedures. The first is proc mi where the user specifies the imputation model to be used and the number of imputed datasets to be created. The second procedure runs the analytic model of interest (here it is a linear regression using proc glm ) within each of the imputed datasets. The third step runs a procedure call proc mianalyze which combines all the estimates (coefficients and standard errors) across all the imputed datasets and outputs one set of parameter estimates for the model of interest. On the proc mi procedure line we can use the nimpute option to specify the number of imputations to be performed. The imputed datasets will be outputted using the out option, and stored appended or quotstackedquot together in a dataset called quotmimvnquot. An indicator variables called imputation is automatically created by the procedure to number each new imputed dataset. After the var statement, all the variables for the imputation model are specified including all the variables in the analytic model as well as any auxiliary variables. The option seed is not required, but since MI is designed to be a random process, setting a seed will allow you to obtain the same imputed dataset each time. This estimates the linear regression model for each imputed dataset individually using the by statement and the indicator variable created previously. You will observe in the Results Viewer, that SAS outputs the parameter estimates for each of the 10 imputations. The output statement stores the parameter estimates from the regression model in the dataset named quotamvn. quot This dataset will be used in the next step of the process, the pooling phase. Proc mianalyze uses the dataset quotamvnquot that contains the parameter estimates and associated covariance matrices for each imputation. The variancecovariance matrix is needed to estimate the standard errors. This step combines the parameter estimates into a single set of statistics that appropriately reflect the uncertainty associated with the imputed values. The coefficients are simply just an arithmetic mean of the individual coefficients estimated for each of the 10 regression models. Averaging the parameter estimates dampens the variation thus increasing efficiency and decreasing sampling variation. Estimation of the standard error for each variable is little more complicated and will be discussed in the next section. If you compare these estimates to those from the complete data you will observe that they are, in general, quite comparable. The variables write female and math . are significant in both sets of data. You will also observe a small inflation in the standard errors, which is to be expected since the multiple imputation process is designed to build additional uncertainty into our estimates. 2. Imputation Diagnostics: Above the quotParameter Estimatesquot table in the SAS output above you will see a table called quotVariance Informationquot. It is important to examine the output from proc mianalyze, as several pieces of the information can be used to assess how well the imputation performed. Below we discuss each piece: Variance Between (V B ): This is a measure of the variability in the parameter estimates (coefficients) obtained from the 10 imputed datasets For example, if you took all 10 of the parameter estimates for write and calculated the variance this would equal V B 0.000262. This variability estimates the additional variation (uncertainty) that results from missing data. Variance Within (V W ): This is simply the arithmetic mean of the sampling variances (SE) from each of the 10 imputed datasets. For example, if you squared the standard errors for write for all 10 imputations and then divided by 10, this would equal, this would equal V w 0.006014. This estimates the sampling variability that we would have expected had there been no missing data. Variance Total (V T ): The primary usefulness of MI comes from how the total variance is estimated. T he total variance is sum of multiple sources of variance. While regression coefficients are just averaged across imputations, Rubin39s formula (Rubin, 1 987) p artitions variance into quotwithin imputationquot capturing the expected uncertainty and quotbetween imputationquot capturing the estimation variability due to missing information (Graham, 2007 White et al. 2010). The total variance is the sum of 3 sources of variance. The within, the between and an additional source of sampling variance. For example, the total variance for the variable write would be calcualted like this: V B V w V B m 0.000262 0.006014 0.00026210 0.006302 The additional sampling variance is literally the variance between divided by m . This value represents the sampling error associated with the overall or average coefficient estimates. It is used as a correction factor for using a specific number of imputations. This value becomes small er, the more imputations are conducted. The idea being that the larger the number of imputations, the more precise the parameter estimates will be. Bottom line: The main difference between multiple imputation and other single imputation methods, is in the estimation of the variances. The SE39s for each parameter estimate are the square root of it39s V T . Degrees of Freedom (DF): Unlike analysis with non-imputed data, sample size does not directly influence the estimate of DF. DF actually continues to increase as the number of imputations increase. The standard formula used to calculate DF can result in fractional estimates as well as estimates that far exceed the DF that would had resulted had the data been complete. By default the DF infinity. Note: Starting is SAS v.8, a formula to adjust for the problem of inflated DF has been implemented (Barnard and Rubin, 1999). Use the EDF option on the proc mianalyze line to indicate to SAS what the proper adjusted DF. Bottom line: The standard formula assumes that the estimator has a normal distribution, i. e. a t-distribution with infinite degrees of freedom. In large samples this is not usually an issue but can be with smaller sample sizes. In that case, the corrected formula should be used (Lipsitz et al. 2002). Relative Increases in Variance (RIVRVI): Proportional increase in total sampling variance that is due to missing information (V B V B m V W ). For example, the RVI for write is 0.048, this means that the estimated sampling variance for write is 4.8 larger than its sampling variance would have been had the data on write been complete. Bottom line: Variables with large amounts of missing andor that are weakly correlated with other variables in the imputation model will tend to have high RVI39s. Fraction of Missing Information (FMI): Is directly related to RVI. Proportion of the total sampling variance that is due to missing data (V B V B m V T ) . It39s estimated based on the percentage missing for a particular variable and how correlated this variable is with other variables in the imputation model. The interpretation is similar to an R-squared. So an FMI of 0.046 for write means that 4.6 of the total sampling variance is attributable to missing data. The accuracy of the estimate of FMI increases as the number imputation increases because varaince estimates become more stable. This especially important in the presence of a variable(s) with a high proportion of missing information. If convergence of your imputation model is slow, examine the FMI estimates for each variables in your imputation model. A high FMI can indicate a problematic variable. Bottom line: If FMI is high for any particular variable(s) then consider increasing the number of imputations. A good rule of thumb is to have the number imputations (at least) equal the highest FMI percentage. Relative Efficiency: The relative efficiency (RE) of an imputation (how well the true population parameters are estimated) is related to both the amount of missing information as well as the number ( m) of imputations performed. When the amount of missing information is very low then efficiency may be achieved by only performing a few imputations (the minimum number given in most of the literature is 5). However when there is high amount of missing information, more imputations are typically necessary to achieve adequate efficiency for parameter estimates. You can obtain relatively good efficiency even with a small number of m. However, this does not mean that the standard errors will be well estimated well. More imputations are often necessary for proper standard erro r estimation as the variability between imputed datasets incorporate the necessary amount of uncertainty around the imputed values. The direct relationship between RE, m and the FMI is: 1(1FMI m ) . This formula represent the RE of using m imputation versus the infinte number of imputations. To get an idea of what this looks like practically, take a look at the figure below from the SAS documentation where m is the number of imputations and lambda is the FMI. Bottom line: It may appear that you can get good RE with a few imputations however, it often takes more imputations to get good estimates of the variances than good estimates of parameters like means or regression coefficients. After performing an imputation it is also useful to look at means, frequencies and box plots comparing observed and imputed values to assess if the range appears reasonable. You may also want to examine plots of residuals and outliers for each imputed dataset individually. If anomalies are evident in only a small number of imputations then this indicates a problem with the imputation model (White et al, 2010). You should also assess convergence of your imputation model. This should be done for different imputed variables, but specifically for those variables with a high proportion of missing (e. g. high FMI). Convergence of the proc mi procedure means that DA algorithm has reached an appropriate stationary posterior distribution. Convergence for each imputed variable can be assessed using trace plots. These plots can be requested on the mcmc statement line in the proc mi procedure. Long-term trends in trace plots and high serial dependence are indicative of a slow convergence to stationarity. A stationary process has a mean and variance that do not change over time. By default SAS will provide a trace plots of estimates for the means for each variable but you can also ask for these for the standard deviation as well. You can take a look at examples of good and bad trace plots in the SAS users guide section on quotAssessing Markov Chain Convergence quot. Above is an example of a trace plot for mea n social studies score. There are two main things you want to note in a trace plot. First, assess whether the algorithm appeared to reach a stable posterior distribution by examining the plot to see if the mean remains relatively constant and that there appears to be an absence of any sort of trend (indicating a sufficient amount of randomness in the means between iterations). In our case, this looks to be true. Second, you want to examine the plot to see how long it takes to reach this stationary phase. In the above example it looks to happen almost immediately, indicating good convergence. The dotted lines represent at what iteration and imputed dataset is drawn. By default the burn-in period (number of iterations before the first set of imputed values is drawn) is 200. This can be increased if it appears that proper convergence is not achieved using the nbiter option on the mcmc statement. Another plot that is very useful for assessing convergence is the auto correlation plot also specified on the mcmc statement using plotsacf. This helps us to assess possible auto correlation of parameter values between iterations. Let39s say you noticed a trend in the mean social studies scores in the previous trace plot. You may want to assess the magnitude of the observed dependency of scores across iterations. The auto correlation plot will show you that. In the plot below, you will see that the correlation is perfect when the mcmc algorithm starts but quickly goes to near zero after a few iterations indicating almost no correlation between iterations and therefore no correlation between values in adjacent imputed datasets. By default SAS, draws an imputed dataset every 100 iterations, if correlation appears high for more than that, you will need to increase the number of iterations between imputed datasets using the niter option. Take a look at the SAS 9.4 proc mi documentation for more information about this and other options. Note: The amount of time it takes to get to zero (or near zero) correlation is an indication of convergence time (Enders, 2010). For more information on these and other diagnostic tools, please se e Ender, 2010 and Rubin, 1987. Example 2: MI using fully conditional specification (also known as imputation by chained equationsICE or sequential generalized regression ) A second method available in SAS imputes missing variables using the fully conditional method (FCS) which does not assume a joint distribution but instead uses a separate conditio nal distribution for each imputed variable. This specification may be necessary if your are imputing a variable that must only take on specific values such as a binary outcome for a logistic model or count variable for a poisson model. In simulation studies (Lee amp Carlin, 2010 Van Buuren, 2007), the FCS has been show to produce estimates that are comparable to MVN method. Later we will discuss some diagnostic tools that can be used to assess if convergence was reached when using FCS. The FCS methods available is SAS are discriminant function and logistic regression for binarycategorical variables and linear regression and predictive mean matching for continuous variables. If you do not specify a method, by default the discriminant function and regression are used. Some interesting properties of each of these options are: 1. The discriminant function method allows for the user to specify prior probabilities of group membership. In discriminant function only continuous variables can be covariates by default. To change this default use the classeffects option. 2. The logistic regression method assumes ordering of class variables if more then two levels. 3. The default imputation method for continuous variables is regression. The regression method allows for the use of ranges and rounding for imputed values. These options are prob lematic and typically introduce bias (Horton et al. 2003 Allison, 2005). Take a look at the quotOther Issuesquot section below, for further discussion on this topic. 4. The predictive mean matching method will provide imputed values that are consistent with observed values. If plausible values are necessary, this is a better choice then using bounds or rounding values produced from regression. For more information on these methods and the options associated with them, see SAS Help and Documentation on the FCS Statement . The basic set-up for conducting an imputation is shown below. The var statement includes all the variables that will be used in the imputation model. If you want to impute these variables using method different then the default you can specify which variable(s) is to be imputed and by what method on the FCS statement. In this example we are imputing the binary variable female and the categorical variable prog using the discriminant function method. Since they are both categorical, we also list female and prog on the class statement. Note: Because we are using the discriminant function method to impute prog we no longer need to create dummy variables. Additionally, we use the classeffectsinclude option so all continuous and categorical variables will be used as predictors when imputing female and prog . All the other variables on var statement will be imputed using regression since a different distribution was not specified. The ordering of variables on the var statement controls in which order variables will be imputed. With multiple imputation using FCS, a single imputation is conducted during an initial fill-in stage. After the initial stage, the variables with missing values are imputed in the order specified on the var statement. With subsequent variable being imputed using observed and imputed values from the variables that proceeded them. For more information on this see White et al. 2010. Also as in the previous proc mi example using MVN, we can also specify the number of burn-in interations using the option nbiter . The FCS statement also allows users to specify which variable you want to use as predictors, if no covariates are given from the imputed variable then SAS assumes that all the variables on the var statement are to be used to predict all other variables. Multiple conditional distributions can be specified in the same FCS statement. Take a look at the examples below. This specification, imputes female and prog under a generalized logit distribution that is appropriate for non-ordered categorical variables instead of the default cumulative logit that is appropriate for ordered variables. This second specification, imputes female and prog under a generalized logit distribution and uses predictive mean matching to impute math, read and write instead of the default regression method. This third specification, indicates that prog and female should be imputed using a different sets of predictors. 2. Analysis and Pooling Phase Once the 20 multiply imputed datasets have been created, we can run our linear regression using proc genmod . Since we imputed female and prog under a distribution appropriate for categorical outcomes, the imputed values will now be true integer values. Take a look at the results of proc freq for female and prog in the second imputed dataset as compared to original data with missing values. As you can see, the FCS method has imputed quotrealquot values for our categorical variables. Prog and female can now be used in the class statement below and we no longer need to create dummy variables for prog . As with the previous example using MVN, we will run our model on each imputed dataset stored in mifcs . We will also use an ODS Output statement to save the parameter estimates from our 20 regressions. Below is a proc print of what the parameter estimates in gmfcs look like for the first two imputed datasets. quot Imputation quot indicates which imputed dataset each set of parameters estimates belong to. quotLevel1quot indicates the levels or categories for our class variables. The mianalyze procedure will now require some additional specification in order to properly combine the parameter estimates. You can see above that the parameter estimates for variables used in our model39s class statement have estimates with 1 row for each level. Additionally, a column called quotLevel1quot specifies the name or label associated with each category. In order from mianalyze to estimate the combined estimates appropriately for the class variables we need to add some options to the proc mianalyze line. As before the parms refers to input SAS data set that contains parameter estimates computed from each imputed data set. However, we also need the option classvar added. This option is only appropriate when the model effects contain classification variables. Since proc genmod names the column indicator for classification quotLevel1quot we will need to specify classvarlevel . Note: Different procedures in SAS require different classvar options. If you compare these estimates to those from the full data (below) you will see that the magnitude of the write . female . and math parameter estimates using the FCS data are very similar to the results from the full data. Additionally, the overall significance or non-significance of specific variables remains unchanged. As with the MVN model, the SE are larger due to the incorporation of uncertainty around the parameter estimates, but these SE are still smaller then we observed in the complete cases analysis. 4. Imputation Diagnostics: Like the previous imputation method with MVN . the FCS statement will output trace plots. These can be examined for the mean and standard deviation of each continuous variable in the imputation model. As before, the dashed vertical line indicates the final iteration where the imputation occurred. Each line represents a different imputation. So all 20 imputation chains are overlayed on top of one another. Autocorrelation plots are only available with the mcmc statement when assuming a joint multivariate normal distribution. This plot is not available when using the FCS statement. 1. Why Auxiliary variables So one question you may be asking yourself, is why are auxiliary variables necessary or even important. First, they can help improve the likelihood of meeting the MAR assum ption (White et al, 2011 Johnson and Young, 2011 Allison, 2012). Remember, a variable is said to be missing at random if other variables in the dataset can be used to predict missingness on a given variable. So you want your imputation model to include all the variables you think are associated with or p redict missingness in your variable in order to fulfill the assumption of MAR. Second, including auxiliaries has been shown to help yield more accurate and stable estimates and thus reduce the estimated standard errors in analytic models (Enders, 2010 Allison, 2012 von Hippel and Lynch, 2013). This is especially true in the case of missing outcome variables. Third. including these variable can also help to increase po wer (Reis and Judd, 2000 Enders, 2010). In general, there is almost always a benefit to adopting a more quotinclusive analysis str ategyquot (Enders, 2010 Allison, 2012). 2. Selecting the number of imputations ( m ) Historically, the recommendation was for three to five MI datasets. Relatively low values of m may still be appropriate when the fraction of missing information is low and the analysis techniques are relatively simple. Recently, however, larger values of m are often being recommended. To some extent, this change in the recommended number of imputations is based on the radical increase in the computing power available to the typical researcher, making it more practical to run, create and analyze multiply imputed datasets with a larger number of imputations. Recommendations for the number of m vary. For example, five to 20 imputations for low fractions of missing information, and as many as 50 (or more) imputations when the proportion of missing data is relatively high. Remember that estimates of coefficients stabilize at much lower values of m than estimates of variances and covariances of error terms (i. e. standard errors). Thus, in order to get appropriate estimates of these parameters, you may need to increase the m. A larger number of imputations may also allow hypothesis tests with less restrictive assumptions (i. e. that do not assume equal fractions of missing information for all coefficients). Multiple runs of m imputations are recommended to assess the stability of the parameter estimates. Graham et al. 2007 conducted a simulation demonstrating the affect on power, efficiency and parameter estimates across different fractions of missing information as you decrease m. The authors found that: 1. Mean square error and standard error increased. 2. Power was reduced, especially when FMI is greater than 50 and the effect size is small, even for a large number of m (20 or more). 3. Variability of the estimate of FMI increased substantially. I allmänhet. the estimation of FMI improves with an increased m . Another factor to consider is the importance of reproducibility between analyses using the same data. White et al. (2010), ass uming the true FMI for any variable would be less than or equal to the percentage of cases that are incomplete, uses the rule m should equal the percentage of incomplete cases. Thus if the FMI for a variable is 20 then you need 20 imputed datasets. A similar analysis by Bodner, 2008 makes a similar recommendation. White et al. 2010 also found when making this assumption, the error associated with estimating the regression coefficients, standard errors and the resulting p-values was considerably reduced and resulted in an adequate level of reproducibility. 3. Maximum, Minimum and Round This issue often comes up in the context of using MVN to impute variables that normally have integer values or bounds. Intuitively speaking, it makes sense to round values or incorporate bounds to give quotplausiblequot values. However, these methods has been shown to decrease efficiency and increase bias by altering the correlation or covariances between variables estimated during the imputation process. Additionally, these changes will often result in an underestimation of the uncertainly around imputed values. Remember imputed values are NOT equivalent to observed values and serve only to help estimate the covariances between variables needed for inference (Johnson and Young 2011). Leaving the imputed values as is in the imputation model is perfectly fine for your analytic models. If plausible values are needed to perform a specific type of analysis, than you may want to use a different imputation algorithm such as FCS . Isn39t multiple imputation just making up data No. This is argument can be made of the missing data methods that use a single imputed value because this value will be treated like observed data, but this is not true of multiple imputation. Unlike single imputation, multiple imputation builds into the model the uncertaintyerror associated with the missing data. Therefore the process and subsequent estimation never depends on a single value. Additionally, another method for dealing the missing data, maximum likelihood produces almost identical results to multiple imputation and it does not require the missing information to be filled-in. What is Passive imputation Passive variables are functions of imputed variables. For example, let39s say we have a variable X with missing information but in my analytic model we will need to use X 2. In passive imputation we would impute X and then use those imputed values to create a quadratic term. This method is called quotimpute then transformquot (von Hippel, 2009). While th is appears to make sense, additional research (Seaman et al. 2012 Bartlett et al. 2014) has s hown that using this method is actually a misspecification of your imputation model and will lead to biased parameter estimates in your analytic model. There are better ways of dealing with transformations. How do I treat variable transformations such as logs, quadratics and interactions Most of the current literature on multiple imputation supports the method of treating variable transformations as quotjust another variablequot. For example, if you know that in your subsequent analytic model you are interesting in looking at the modifying effect of Z on the association between X and Y (i. e. an interaction between X and Z). This is a property of your data that you want to be maintained in the imputation. Using something like passive imputation, where the interaction is created after you impute X andor Z means that the filled-in values are imputed under a model assuming that Z is not a moderator of the association between X an Y. Thus, your imputation model is now misspecified. Should I include my dependent variable (DV) in my imputation model Yes An emphatic YES unless you would like to impute independent variables (IVs) assuming they are uncorrelated with your DV (Enders, 2010). Thus, causing the estimated association between you DV and IV39s to be biased toward the null (i. e. underestimated). Additionally, using imputed values of your DV is considered perfectly acceptable when you have good auxiliary variables in your imputation model (Enders, 2010 Johnson and Young, 2011 White et al. 2010). However, if good auxiliary variables are not available then you still INCLUDE your DV in the imputation model and then later restrict your analysis to only those observations with an observed DV value. Research has shown that imputing DV39s when auxiliary variables are not present can add unnecessary random variation into your imputed valu es (Allison, 2012). How much missing can I have and still get good estimates using MI Simulations have indicated that MI can perform well, under certain circumstances, even up to 50 missing observations (Allison, 2002). However, the larger the amount of missing information the higher the chance you will run into estimation problems during the imputation process and the lower the chance of meeting the MAR assumption unless it was planned missing (Johnson and Young, 2011). Additionally, as discussed further, the higher the FMI the more imputations are needed to reach good relative efficiency for effect estimates, especially standard errors. What should I report in my methods abut my imput ation Most papers mention if they performed multiple imputation but give very few if any details of how they implemented the method. In general, a basic description should include: Which statistical program was used to conduct the imputation. The type of imputation algorithm used (i. e. MVN or FCS). Some justification for choosing a particular imputation method. The number of imputed datasets ( m) created. The proportion of missing observations for each imputed variable. The variables used in the imputation model and why so your audience will know if you used a more inclusive strategy. This is particularly important when using auxiliary variables. This may seem like a lot, but probably would not require more than 4-5 sentences. Enders (2010) provides some examples of write-ups for particular scenarios. Additionally, MacKinnon (2010) discusses the reporting of MI procedures in medical journals. Main Take Always from this seminar: Multiple Imputation is always superior to any of the single imputation methods because: A single imputed value is never used The variance estimates reflect the appropriate amount of uncertainty surrounding parameter estimates There are several decisions to be made before performing a multiple imputation including distribution, auxiliary variables and number of imputations that can affect the quality of the imputation. Remember that multiple imputation is not magic, and while it can help increase power it should not be expected to provide quotsignificantquot effects when other techniques like listwise deletion fail to find significant associations. Multiple Imputation is one tool for researchers to address the very common problem of missing data. Allison (2002). Missing Data. Sage Publications. Allison (2012). Handling Missing Data by Maximum Likelihood. SAS Global Forum: Statistics and Data Analysis. Allison (2005). Imputation of Categorical Variables with PROC MI. SUGI 30 Proceedings - Philadelphia, Pennsylvania April 10-13, 2005. Barnard and Rubin (1999). Small-sample degrees of freedom with multiple imputation. Biometrika . 86(4), 948-955. Bartlett et al. (2014). Multiple imputation of covariates by fully conditional specific ation: Accommodating the substantive model. Stat Methods Med Res . Todd E. Bodner (2008).quotWhat Improves with Increased Missing Data Imputationsquot. Structural Equation Modeling: A Multidisciplinary Journal . 15:4, 651-675. Demirtas et al.(2008). Plausibility of multivariate normality assumption when multiply imputing non-gaussian continuous outcomes: a simulation assessment. Jour of Stat Computation amp Simulation . 78(1). Enders (2010). Applied Missing Data Analysis. The Guilford Press. Graham et al. (2007). How Many Imputations are Really Needed Some Practical Clarifications of Multiple Imputation Theory. Prev Sci, 8: 206-213. Horton et al. (2003) A potential for bias when rounding in multiple imputation. American Statistician. 57: 229-232. Lee and Carlin (2010). Multiple Imputation for missing data: Fully Conditional Specification versus Multivariate Normal Imputation. Am J Epidemiol . 171(5): 624-32. Lipsitz et al. (2002). A Degrees-of-Freedom Approximation in Multiple Imputation. J Statist Comput Simul, 72(4): 309-318. Little, and Rubin, D. B. (2002). Statistical Analysis with Missing Data . 2 nd edition, New York. John Wiley. Johnson and Young (2011). Towards Best Practices in analyszing Datasets with Missing Data: Comparisons and Recomendations. Journal of Marriage and Family, 73(5): 926-45. Mackinnon (2010). The use and reporting of multiple imputation in medical research a review. J Intern Med, 268: 586593. Editors: Harry T. Reis, Charles M. Judd (2000). Handbook of Research Methods in Social and Personality Psychology. Rubin (1976). Inference and Missing Data. Biometrika 63 (3), 581-592. Rubin (1987). Multiple Imputation for Nonresponse in Surveys. J. Wiley amp Sons, New York. Seaman et al. (2012). Multiple Imputation of missing covariates with non-linear effects: an evaluation of statistical methods. B MC Medical Research Methodology . 12(46). Schafer and Graham (2002) Missing data: our view of the state of the art. Psychol Methods, 7(2):147-77 van Buuren (2007). Multiple imputation of discrete and continuous data by fully conditional specification. Statistical Methods in Medical Research . 16: 219242 . von Hippel (2009). How to impute interactions, squares and other transformed variables. Sociol Methodol . 39:265-291. von Hippel and Lynch (2013). Efficiency Gains from Using Auxiliary Variables in Imputation. Cornell University Library . von Hippel (2013). Should a Normal Imputation Model be modified to Impute Skewed Variables . Sociological Methods amp Research, 42(1):105-138. White et al. (2011). Multiple imputation using chained equations: Issues and guidance for practice. Statistics in Medicine . 30(4): 377-399. Young and Johnson (2011). Imputing the Missing Y39s: Implications for Survey Producers and Survey Users. Proceedings of the AAPOR Conference Abstracts . pp. 62426248. The content of this web site should not be construed as an endorsement of any particular web site, book, or software product by the University of California. Imputation of categorical and continuous data - multivariate normal vs chained equations Question: Generally speaking, would you say that standard methods of multiple imputation (e. g. those available in PROC MI) have difficulty handling models with mixed (continuous and categorical) data Or would you think (generally) that the multivariate normality assumption is robust in the context of MI for handling continuous and categorical missing data. Answer: Opinion on this is somewhat mixed. A fair bit of work has been done on how to impute categorical data using the MVN model, and some papers have shown you can do quite well, provided you use so called adaptive rounding methods for rounding the continuous imputed data. For more on this, see: CA Bernaards, TR Belin, JL Schafer. Robustness of a multivariate normal approximation for imputation of incomplete binary data. Statistics in Medicine 200726:1368-1382. Lee and Carlin found that both chained equations and imputation via a MVN model worked well, even with some binary and ordinal variables: KJ Lee and JB Carlin. Multiple Imputation for Missing Data: Fully Conditional Specification Versus Multivariate Normal Imputation. American Journal of Epidemiology 2010171:624-632 In contrast, a paper by van Buuren concluded that the chained equation (also known as fully conditional specification (FCS)) approach is preferable in situations with a mixture of continuous and categorical data: S van Buuren. Multiple imputation of discrete and continuous data by fully conditional specification. Statistical Methods in Medical Research 200716:219-242 My personal opinion is that the chained equations approach is preferable with a mixture of continuous and categorical data.

No comments:

Post a Comment