Bulk import of CSV data into HANA using Node.js and xargs

Nov 25, 2021·
Dirk Kemper
Dirk Kemper
· 4 min read

Recently I dusted off some prior work on analyzing NYC yellow cab rides which I described in the following blog post: Timelapse data exploration of NYC Taxi rides

A large chunk of time in setting up the solution from that post sits in loading the data into HANA using the Eclipse “Data from local file” import screen. Although it works fine from a functional perspective the performance lags behind quite a bit, I even remember it being available in the very first versions of HANA Studio 10 years ago. I am not sure if this has received any attention since then.

In this post I will present a quicker solution for loading the CSV data into HANA by using Node.js. In the last year I have been working with Node.js as a generic command line programming tool I have noticed that as it has been optimized for fast I/O many problems are really faster when converted into JavaScript and run in Node.js.

Although the scenario will be tailored towards the NYC Taxi trips CSVs it should be straightforward to convert it to your own scenario.

The example has been tested on Node.js v16. Make sure to install the following packages using npm:

  • fast-csv
  • @sap/hana-client

The program will look as below. Save it in a file called ‘load-trips.js’.

const fs = require('fs');
const fastcsv = require("fast-csv");
var hana = require('@sap/hana-client');
var hanaStream = require('@sap/hana-client/extension/Stream');

const BATCH_SIZE = 100000;

let inputFile = '';
if(process.argv[2] == undefined) {
    console.log("Usage: load-trips <filename>");
    process.exit(1);
} else {
    inputFile = process.argv[2];
}

var hanaConn = hana.createConnection();

var connOptions = {
    serverNode: 'hxehost:39015',
    uid: 'ML_USER',
    password: 'Password1',
    sslValidateCertificate: 'false'
};

function processCsv() {
    console.time("processCsv");
    console.log("opening CSV: " + inputFile)
    let stream = fs.createReadStream(inputFile);
    let csvData = [];
    let firstLineProcessed = false;
    let numLines = 0;
    let csvStream = fastcsv
      .parse()
      .on("data", function(data) {
        if(firstLineProcessed) {
            if(data[0] != undefined)    // do not add empty fields
                csvData.push(data);
            numLines = numLines + 1;
            // If buffer is full insert it into HANA
            if(numLines == BATCH_SIZE) {
                //console.log(csvData);
                stmt = hanaConn.prepare("INSERT INTO YELLOW_TRIPDATA_STAGING VALUES(?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?)");
                result = stmt.execBatch(csvData);

                // start over with new batch
                csvData = [];
                numLines = 0;
            }
        } else {
            // just skip first line
            firstLineProcessed = true;
        }
      })
      .on("end", function() {
        console.timeEnd("processCsv");
      });
    
    stream.pipe(csvStream);
};


hanaConn.connect(connOptions, function(err) {
    if (err) {
        console.error(err);
        process.exit(1);
    }
    
    processCsv();
});

Please have a look at the database connection parameters and update as required.

The program will read in a CSV file supplied on the command line and buffer these in memory into batches of 100,000 and commit these into the HANA database using the execBatch() function. This brought down the import time for a single CSV down from 1.5 hours to 14 minutes on my dated hardware.

As I am using Windows I did a quick comparison between the native Node.js Windows binary and a Node.js instance running inside a Ubuntu WSL-2 (Windows Subsystem for Linux) host. It appeared that the Ubuntu host only took 30 seconds longer, which is a huge improvement over former experiences I had with WSL-1.

Running in parallel

I now wanted to make use of the 4 cores in my machine to run the import process in parallel. As the Node.js script is meant to take a CSV as input we are able to perform a very simple but effective parallelization from the command line using xargs. This approach is also suitable for many other problems or programs.

Lets assume the CSV files you want to import are all stored in folder /mnt/c/nyc-csv, then just use the following command to start 4 instances:

ls /mnt/c/nyc-csv/*.csv | xargs --max-procs=4 -L1 node load-trips

This command can be split into the following functionality:

  • The part before the pipe (|) symbol lists the CSVs from the chosen folder and pipes them into xargs
  • xargs just appends the filename to the end of the command, so a single command will look like ‘node load-trips yellow_tripdata_2014-01.csv’
  • The L1 parameter is telling xargs that each individual argument (from the ls command) is on a single line
  • Now for the best feature: the –max-procs flag tells xargs to spawn 4 processes and keep as much of these alive as possible. If one of the processes finishes, it immediately spawns the next one until all CSVs have been imported. Obviously this should be tweaked for the number of cores your CPU has available

This means that a 12-core (or more) machine can run the import in minutes. Now that’s a great improvement if you have set off to import 10+ years of these files!