Partners who choose to give us copies of their data typically dump their entire databases and give it to us. We can deal with dumps from almost any database. Here we provide directions on how to dump databases in some of the most common systems.
There are at least two ways to dump a Postgres database:
- Linux Command Line
Replace [brackets] with the appropriate values. -Fc uses Postgres’ compressed format:
pg_dump -h [server] -U [user] -d [database] -Fc 1> [database dump filename] 2> [error log filename]
Right click on your database and choose the “backup” option:
- In SQL Server Management Studio, right click on the database you’d like to dump
- Choose tasks -> backup
- Extract the number of rows and columns in your database so we can make sure all the data loads correctly on our side. We provided a script here.
- The database administrator logs into the database using the interactive command line mode.
- The DBA creates a directory object:
CREATE DIRECTORY dpump_dir1 AS ‘[directory on disk]’;
An example of [directory on disk] is /mnt/data/. Oracle needs read and write permissions on that directory.
- Grant the database user read and write access to the directory created in step 2:
GRANT READ,WRITE ON DIRECTORY dpump_dir1 TO [database user];
- Run the following command, where [database user] is the same as in step 3:
expdp [database user]/[password] full=Y DIRECTORY=dpump_dir1 dumpfile=[dump filename] logfile=[log filename]
Some partners prefer we do our work on their systems. From the partner’s perspective, this may have significant benefits: The partner retains control of the data, and it is easier to deploy our work at the end of the project.
Partners who choose this approach need to provide us with the computational resources necessary to handle our machine-learning pipeline. While your project’s needs may vary, we often use 128 cores, 1+TB of RAM, and a TB of disk space for short-run projects (e.g. three-month projects at DSSG). But the more computational resources you give access to, the faster we can build good models. We can do reasonably well with fewer cores and less RAM (e.g. 32/256), but it will generally take longer, which may affect our timeline.
We use all free software, including the following: