CSH Data Integration Tool
Partner(s): Corporation for Supportive Housing (CSH), Boone County (Missouri), Clark County (Nevada), McLean County (Illinois), Salt Lake County (Utah)
Status: Transitioning data integration tool to County partners
Github Repo: https://github.com/dssg/matching-tool
Documentation Website: https://dssg.github.io/matching-tool/
Team: Erika Salomon, Tristan Crockett, Eddie Lin, Adolfo De Unanue, Christina Sung
DSaPP has partnered with the Corporation for Supportive Housing (CSH) and four communities to build a web-based data-integration tool that will integrate homeless and criminal justice data. The tool will match common clients between the two systems allowing for communities to identify frequent users shared by both systems.
Over the past few years, a new effort has focused on scaling innovative, data-driven criminal justice reform practices pioneered at the local level to help meet the needs of a key population: People who repeatedly cycle through multiple systems, including jails, hospital emergency rooms, shelters, and other services. Sometimes called “super-utilizers,” they are often chronically homeless individuals, with mental illness, substance abuse, and health problems. Communities often lack the resources to combine information across multiple data systems, which is an essential step in identifying the super-utilizers and offering support to them. With funding from the Laura and John Arnold Foundation and the Corporation for National and Community Service, DSaPP has been able to collaborate with CSH to develop a matching tool to integrate data from the Criminal Justice system and the Homeless Management Information System (HMIS).
The goal of this project was to develop a matching tool to integrate data from the Criminal Justice system and the Homeless Management Information System (HMIS) to help communities identify high utilizers of both systems. By using the integrated data, communities will be able to identify those who are most in need of an intervention and allocate resources more effectively.
After appropriate data use agreements were executed, DSaPP requested that each county send a data dump of at least one HMIS and one jail data set to create a standardized data specification schema. Since the ETL process is performed by each county, the schema is used to ensure that the data being uploaded to the tool would have specific requirements such as which fields needed to be populated and what values were valid for each field. The validated data was then used to create a matching algorithm to identify unique individuals within one data source and across both data sources. After the data goes through the matching process, the tool then displays results from the matched data to find overlaps between the populations and frequent utilizers of both systems.
This project used data from the criminal justice system and the Homeless Management Information System (HMIS). At minimum, the tool expects the user to upload data from the jail bookings and homeless services stays. The tool is also able to upload three other criminal justice data schemas and two other homeless service data schemas.
When a user successfully uploads a new data file to the tool, a new matching job begins. The first step that the matcher takes is it loads and combines all of the data files passed by the webtool into a single data frame, retaining only the information that identifies the person (i.e., forgetting information about the criminal justice or homeless services interaction). Before beginning the algorithmic matching process, all rows that are exact duplicates of another row (e.g., match exactly on first name, last name, date of birth, social security number, gender, and race) are dropped. The deduplicated data are then reformatted for matching. From here, the matching algorithm breaks the data into smaller subsets based on groups likely to contain matches (blocking), comparing records in each of these subsets, and then clustering records into matched groups. Once the record linkage step is done, the matcher has a dataframe of deduplicated and matched records. To complete the process, the matcher re-loads the original source data and joins the matched IDs to the original records.
Future Plans and Areas for Improvement
DSaPP originally hosted the data integration tool and is currently in the process of assisting the counties to self-host the tool themselves. Currently, two out of the four counties are self-hosting the tool and using the integrated data to construct cost-benefit analyses around investments in supportive housing from local and federal partners modeled on CSH’s FUSE framework. Other uses of the tool include using the matched utilization data to support decision-making in local homeless coordinated entry system processes, along with improvements in systems coordination at local jails.