Database Dumps

Partners who choose to give us copies of their data typically dump their entire databases and give it to us. We can deal with dumps from almost any database. Here we provide directions on how to dump databases in some of the most common systems.

PostgreSQL

There are at least two ways to dump a Postgres database:

  1. Linux Command Line
    Replace [brackets] with the appropriate values. -Fc uses Postgres’ compressed format:
    pg_dump -h [server] -U [user] -d [database] -Fc 1> [database dump filename] 2> [error log filename]
  2. pgAdmin
    Right click on your database and choose the “backup” option:
    pgAdmin database dump

SQL Server

  • In SQL Server Management Studio, right click on the database you’d like to dump
  • Choose tasks -> backup

Oracle

  1. Extract the number of rows and columns in your database so we can make sure all the data loads correctly on our side. We provided a script here.
  2. The database administrator logs into the database using the interactive command line mode.
  3. The DBA creates a directory object:
    CREATE DIRECTORY dpump_dir1 AS ‘[directory on disk]’;
    An example of [directory on disk] is /mnt/data/. Oracle needs read and write permissions on that directory.
  4. Grant the database user read and write access to the directory created in step 2:
    GRANT READ,WRITE ON DIRECTORY dpump_dir1 TO [database user];
  5. Run the following command, where [database user] is the same as in step 3:
    expdp [database user]/[password] full=Y DIRECTORY=dpump_dir1 dumpfile=[dump filename] logfile=[log filename]

Direct Access to Your System

Some partners prefer we do our work on their systems. From the partner’s perspective, this may have significant benefits: The partner retains control of the data, and it is easier to deploy our work at the end of the project.

Partners who choose this approach need to provide us with the computational resources necessary to handle our machine-learning pipeline. While your project’s needs may vary, we often use 128 cores, 1+TB of RAM, and a TB of disk space for short-run projects (e.g. three-month projects at DSSG). But the more computational resources you give access to, the faster we can build good models. We can do reasonably well with fewer cores and less RAM (e.g. 32/256), but it will generally take longer, which may affect our timeline.

We use all free software, including the following:

  • Linux command-line tools (e.g. bash)
  • Python (numpy, pandas, scipy, scikit-learn at a minimum)
  • Postgres. We can use other database systems, but it will slow the work because we won’t be able to re-use our standard code base (e.g. triage).