Combining datasets and performing large aggregate analyses are a powerful new way to improve service across large populations. Critically important in this task is the deduplication of identities across multiple data sets that were rarely designed to work together. Inconsistent data entry, typographical errors, and real world identity changes pose significant challenges to this process. To help, we have built a tool called pgdedupe.
This page provides links to several resources to help you understand and use pgdedupe:
- Download a copy of our white paper.
- Download the sample data used in the white paper.
- Generate your own sample data of arbitrary size using the testing scripts in the repository.
- Initialize a database using the initialize_db.py script
- Create a YAML file with your PostgreSQL database credentials
- Pass it along with the CSV to the initialize script: python initialize_db.py –db database.yaml –csv records.csv
- Install pgdedupe with pip install pgdedupe
- Run pgdedupe with the database credential file and a configuration file as pgdedupe –db database.yaml –config config.yaml
- Download the sample labels referenced in the configuration file (“tests/dedup_postgres_training.json“).
- We posted the code to Github. You can download, use, and contribute to the code, and submit issues.
- pgdedupe uses dedupe, the code written by our friends at DataMade. We extend special thanks to Forrest Gregg for his help with this project.