Recently I’ve been working on a mechanism to mirror a dataset from a local filesystem to Google Nearline cloud storage in an encrypted format. The costs are really the compelling factor. While I could buy some hard drives and put them at another location (called the colo-buddy system!) it just made sense to not have to deal with the logistics, maintenance, power, network, etc…

Of course being me, it’s a PHP cli script, because well – me. If you’re looking for something like this feel free to use and modify to your purposes. This one is built for Linux but it wouldn’t be terribly hard to port it to Windows.

Note that the first step will be to install gsutil which is provided by Google. They have all the documentation on that front. Once it’s working you will want to create your nearline bucket for storage. Make sure you specify the storage type of NL (nearline) and the data center where you want to host it. For me, something like:

gsutil mb -C NL -p backup-storage-3144254 -l US-CENTRAL1 gs://offline-storage

Of course you will have to change offline-storage to the name you want your bucket to have (and note these names are unique across all of google’s customers so you have to come up with something no one else is using.) The -p parameter is your project and you can get that from your cloud account portal.

Next is the script. The basic syntax is simply:

gbackup -d /usr/local/files -u gs://offline-storage

The script will first download all the files at the URL specified and directory as labelled (so a gsutil ls -l gs://offline-storage/usr/local/files/**) and then compare that against all the files in that directory to see if the local file is newer or if the file doesn’t exist. Then the copying begins.

In the script below you will want to set variables for:

  • $key – the location of your aescrypt cipher
  • $aescrypt – the location of your aescrypt executable
  • $gsutil – the location of gsutil
  • $workingdir – the directory to use for temp files created from encrypted copies (aescrypt output)
#!/usr/bin/php
<?php

// Todo:
//   Validate deleted files and mark reference file to follow up with cloud deletion
//   Build exclude file for directories and files to ignore
//   Find work around for files with [
//   Enable phtreads support in transfer()


$path = "";
$gsurl = "";
$key = "/home/user/crypt-cipher";
$aescrypt = "/usr/bin/aescrypt";
$gsutil = "/usr/bin/gsutil";
$workingdir = "/mnt/storage/";
$tempfile = sprintf("temp.%d.aes", getmypid());
$bucketfiles = array();
$localfiles = array();

#
# Setup options
#

date_default_timezone_set("GMT");
$options = getopt("d:u:v");
foreach($options as $option => $value)
{
  if($option == "v") { echo "gbackup v0.1\n"; exit; }
  if($option == "d") $path = realpath($value);
  if($option == "u") $gsurl = rtrim($value, "/");
}
if(!strlen($path)) { echo "No path given. Exiting.\n", exit; }
if(!strlen($gsurl)) { echo "No url given. Exiting.\n", exit; }
$workingdir = rtrim($workingdir, "/");


#
# Process files in directory
#

$objects = new RecursiveIteratorIterator(new RecursiveDirectoryIterator($path));
foreach($objects as $name => $object){
    $mtime = filemtime($name);
    $localfiles[$name] = array("size" => filesize($name), "date" => date("c", $mtime), "timestamp" => $mtime);
}

#
# Process files in bucket
#

echo(sprintf("%s ls -l %s%s/**", $gsutil, $gsurl, $path));
exec(sprintf("%s ls -l %s%s/**", $gsutil, $gsurl, $path), $bucketlist);
foreach($bucketlist as $fileurl)
{
  if(strstr($fileurl, "TOTAL: ")) continue;
  preg_match("/\s+(\d*)\s+(\S+)\s+gs:\/\/\S+?\/(.*)/", $fileurl, $data);
  $bucketfiles[$data[3]] = array("size" => $data[1], "date" => $data[2], "timestamp" => strtotime($data[2]));
}

#
# Encrypt and upload files to bucket
#

function transfer($file)
{
  global $aescrypt, $gsutil, $tempfile, $workingdir, $gsurl, $key;
  $efile = str_replace("'", "\'", $file);
  $cmd = sprintf("%s -e -k %s -o %s/%s \"%s\" 2>&1", $aescrypt, $key, $workingdir, $tempfile, addcslashes($file, "$"));
  exec($cmd, $cap, $ret);
  //printf("crypt: %s - %s (%s)\n", $cmd, implode("-", $cap), $ret);
  if($ret)
  {
    printf("FAILURE: encrypt %s: %s\n", $file, implode($cap));
  } else {
    $cmd = sprintf("%s cp %s/%s \"%s%s.aes\" 2>&1", $gsutil, $workingdir, $tempfile, $gsurl, addcslashes($file, "$"));
    exec($cmd, $cap, $ret);
    if($ret) printf("FAILURE: transfer %s: %s\n", $file, implode($cap));
    //printf("gsutil: %s - %s (%s)\n", $cmd, implode("-", $cap), $ret);
    unlink(sprintf("%s/%s", $workingdir, $tempfile));
  }
}

foreach($localfiles as $file => $values)
{
  if(basename($file) == "." || basename($file) == "..") continue;
  printf("Processing %s\n", $file);
  if(array_key_exists(ltrim($file, "/") . ".aes", $bucketfiles))
  {
    // Compare before Encrypt and copy
    if($values["timestamp"] > $bucketfiles[ltrim($file,"/").".aes"]["timestamp"])
    {
      // File is newer
      transfer($file);
    }
    // else skip file

  } else {
    transfer($file);
  }
}
?>

I’m still copying about 1.5TB of data. (Need to add pthread support because this is well… not fast.) However after some testing I’ll report back on the final costs. (Note, google charges pennies for just about everything you do. Class A requests, storage, early deletion… etc etc etc.)

In otherwords YMMV. Best of luck!