Nov 2020 Dataset Updates - Kidney Health Education and Research Group

Nov 2020 Dataset Updates

Hi all,
Hope everyone is making the most of the remote work and preparing for the colder temperatures ahead.
With alot of analyses & papers ongoing, we’ve been working closely with the dataset to identify and resolve issues. Several enhancements have been implemented into the November 2020 MergedSet.dta:

  1. Cat domain counter
    We’ve noticed that the *_cnt variables were quite buggy and, in some instances, showing that patients answered more than 12 questions in the cat domain.
  • The prior method worked as follows:
    Let B = number of non-missing questions in the template format dep__6w
    Let A = number of non-missing questions in the naming format dep

    Number of questions answered by each patient on the depression questionnaire = B – A

This would in theory work, but as we found, there were other variables (outside of the expected depression questionnaire) with similar variable names, which were being included in the counter.

Why was this method employed?
There are 700+ variables across the domains. In theory this method would have allowed us to get around specifying all variables within each domain grouping to be counted.

  • Updated method:
    We are now able to pull detailed PROMIS cat extraction data from dados for all patients. This includes running standard errors and running thetas. This data is in long format, which makes it a better choice for designing an algorithm to programmatically count the exact number of questions within each domain, without needing to specify all questions.

The method has been deployed & tested. We confirm that the maximum count is now 12.

  • Patients with only 1 question answered will be flagged and passed to dados to inspect, since it should not have been possible for only 1 question to be administered.
  1. Integrated patients’ time to completion for individual cat domains
  • This was previously completed, but we identified a bug in the data injection process.
    The bug has been resolved. Variables can be found in the naming convention ttc_* representing the time in minutes that they took to complete the individual domain
  • Related to this variable, we also injected a new variable response_time_to_comp
    This measures the overall time that the patient took to complete the entire study in minutes.
  1. Patients missing cat scores
  • There are 2 data end points of extraction from dados
  • Subject data extraction
  • This was the original data mechanism we’ve been working with for the past 5 years
  • Promis specific data extraction
  • This is a new end point where we can now source domain level data regarding our patients across all studies
    e.g. the exact order of questions answered; running standard error & running theta.
  • We identified gaps where patients’ thetas were provided in method (ii) but not method (i). To ensure all data is made available to researchers, we designed an algorithm which combines data from these sources to fill in any missing final theta values.
  • Important note: No existing data was changed. This is an addition of data.
  • 12 patients’ anxiety & 9 patients’ depression theta values were recovered & made available with this fix. Similar metrics would be reported on other domains.
  1. CAT response pattern detection
  • We identified a group of patients who floored the anxiety & depression cat domains, as well as a few others.
    From closer inspection of the data, it was difficult to classify which patients were truly anxious/depressed from those who simply indicated “never” to all questions within the domain, potentially as a result of patient burden.
  • We implemented an algorithm which inspects the patients’ overall domain response patterns and identifies highest frequency response trends
    e.g.
  • A tag is defined as the top 5 most common patterns within a domain
  • If a patient (floors the depression and anxiety domains) and they are identified as having taken either (less than 15 minutes to complete the overall study or more than 80 minutes) or their response pattern across at least 5 domains marked as tagged, then the patients’ status is marked as “Enrolled but cat tagged”.
  • Exactly 23 patients across all studies were identified under this algorithm.
    We will be looking into a future enhancement to integrate the time taken on the individual domain into the algorithm.
  1. Age & gender missing for patients
  • We’ve implemented an algorithm to detect patients who recruiters incorrectly marked as Enrolled/Completed the questionnaire, but whom answered less than 5% of the overall questions administered to them. These patients were easiest to find as those showing up in the dataset with missing ages and genders.
  • Once identified by the algorithm, it toggles their enrollment statuses to ‘Declined’ as their incomplete data should not be used for research..
  1. OMI enhancement
  • In our last update on the Ontario Marginalized Index, we provided factor and quintile scores for patients who we were able to source their 6-digit postal code from OTTR, alongside the original variables which were based on the 3-digit postal code which the patient outlined in the questionnaire. We further provided scores at the Ontario, GTA and Toronto levels. However, the 6-digit postal code variables would have only accounted for the UHN patients, and generated a variable with a large amount of missingness.

To have one variable available for researchers’ analyses, we combined the 3-digit variables within the geographic levels
i.e. If the 6-digit postal code for the patient was sourced and was valid, this was used for the OMI GTA ethnicity concentration. Else, the 3-digit postal code mapping was used.
This generates one variable for the GTA ethnic conc. which represents an amalgamation of the most accurate data available

  • This generates 3 groups of variables having OMI data successfully mapped for our enrolled patients:
  • Within Ontario – 74%
  • Within GTA – 66%
  • Within Toronto – 39%
  1. CAT reliability testing
  • We’ve made tscores and simulated reliability scores available for all domains if patients were to have only been administered 4,6 or 8 questions at baseline
    e.g.
    anx_tscore4 rel_anxiety_sim4;
    anx_tscore6 rel_anxiety_sim6;
    anx_tscore8 rel_anxiety_sim8 etc.

Latest dataset available at U:\data\dump\2020-11

Please feel free to reach out if anything is unclear.

Thanks,
Nathaniel Edwards

Leave a reply

Your email address will not be published.