TL; DR:
If you pipe your node.js streams, make sure you pause the last one in the chain.
const stream1 = s3.getObject( params ).createReadStream();
const stream2 = fasctCsv.fromStream( stream1 ); // This makes piping behind the scenes.
// If you want to pause the streams, pause the last in the chain
stream2.pause();
The longer story:
We’re building a node.js application that ingests data from multiple data sources for a client of ours. Since they are quite big in size and in user base (we’re going to process data for ~50 M users from tens of systems), the ingested files (CSV) are also relatively big in size (several GB)
We’re using AWS S3 as the glue for the data – the systems are uploading their data there and we’re monitoring for new data to ingest. We’re using the aws-sdk node package to read them as streams, parse them using fastCsv and create audit log and snapshots for each of the user in a PG database.
We are batching the inserts and are pausing the data stream for each of the batches, so we don’t end up with a back-pressure problem.
While testing the ingestion of the big files we noticed something peculiar. We thought we’ve paused the stream but it continued to push data like the .pause() was not invoked.
The mistake we made turned out to be quite common if you work with streams – we called the .pause() method of the s3stream, which we have pipe()-d to another stream – the fastCsv one. In this scenario when the fastCsv stream drained, it called the resume method of the s3stream.
In order to pause the streams, one must pause() the last piped one (in our case the fastCsv one)
More on back-pressure: Together with the research about our issue I found a very extensive article about back-pressure in node.js