Partners who choose to give us copies of their data typically dump their entire databases and give it to us. We can deal with dumps from almost any database. Here we provide directions on how to dump databases in some of the most common systems.
There are at least two ways to dump a Postgres database:
- Linux Command Line
Replace[brackets] with the appropriate values. -Fc uses Postgres’ compressed format:
pg_dump -h [server] -U [user] -d [database] -Fc > [database dump filename] 2> [error log filename]
Right click on your database and choose the “backup” option:
- In SQL Server Management Studio, right click on the database you’d like to dump
- Choose tasks -> backup
- Extract the number of rows and columns in your database so we can make sure all the data loads correctly on our side. We provided a script here.
- The database administrator logs into the database using the interactive command line mode.
- The DBA creates a directory object:
CREATE DIRECTORY dpump_dir1 AS ‘[directory on disk]’;
An example of [directory on disk] is /mnt/data/. Oracle needs read and write permissions on that directory.
- Grant the database user read and write access to the directory created in step 2:
GRANT READ,WRITE ON DIRECTORY dpump_dir1 TO [database user];
- Run the following command, where [database user] is the same as in step 3:
expdp [database user]/[password] full=Y DIRECTORY=dpump_dir1 dumpfile=[dump filename] logfile=[log filename]
- On the linux command line (Replace [brackets] with the appropriate values):
mysqldump -u [user] -p [password] [database name] > [dump filename] 2> [error log filename]
Some partners prefer we do our work on their systems. From the partner’s perspective, this may have significant benefits: The partner retains control of the data, and it is easier to deploy our work at the end of the project.
Partners who choose this approach need to provide us with the computational resources necessary to handle our machine-learning pipeline. For most projects, we can do well with 16 cores, 128 GB of RAM, and at least 1 TB of disk space. The more computational resources we get, the faster we can build good models.
We use all free software, including the following:
- We use linux command-line tools (e.g. drake)
- Python (numpy, pandas, scipy, scikit-learn at a minimum)
- Postgres. We can use other database systems, but it will slow the work.
Your project’s specific needs may vary.