AWS Closes S3 Read Stream Unexpectedly

I’m continuing with my notes on transferring big files from and to AWS S3 with node.js

If you are reading a file from a S3 bucket using a stream that you occasionally pause, mind that the read stream will be closed in 60 minutes.

If you cannot handle the file in that period of time, you’d receive a ‘data’ and an ‘end’ event, even though you didn’t finish processing the file.

One possible solution here is to download the file before starting the import, process it and delete it once we don’t need it any more.

//So instead of:
const s3Stream = s3.getObject( params ).createReadStream();
const csvStream = fastCsv.fromStream( s3Stream, csvParams );
/* Do your processing of the csvStream */


// Store your file to the file system
const s3Stream = s3.getObject( params ).createReadStream();
const localFileWriteStream = fs.createWriteStream( path.resolve( 'tmp' , 'big.csv' ) );
s3Stream.pipe( localFileWriteStream );

localFileWriteStream .on( 'close', () => {
    const localReadStream = fs.createReadStream( path.resolve( 'tmp', 'big.csv' ) );

    const csvStream = fastCsv.fromStream( localReadStream , csvParams );

    csvStream.on( 'data', ( data ) => {
        /* Do your processing of the csvStream */
    });

    csvStream.on( 'end', () => {
        // Delete the tmp file
        fs.unlink( path.resolve( 'tmp', 'big.csv' ) );
    });
);

Node.js Streams and why sometimes they don’t pause()

TL; DR: 

If you pipe your node.js streams, make sure you pause the last one in the chain.

const stream1 = s3.getObject( params ).createReadStream();
const stream2 = fasctCsv.fromStream( stream1 ); // This makes piping behind the scenes.
// If you want to pause the streams, pause the last in the chain
stream2.pause();

The longer story:

We’re building a node.js application that ingests data from multiple data sources for a client of ours. Since they are quite big in size and in user base (we’re going to process data for ~50 M users from tens of systems), the ingested files (CSV) are also relatively big in size (several GB)


We’re using AWS S3 as the glue for the data – the systems are uploading their data there and we’re monitoring for new data to ingest. We’re using the aws-sdk node package to read them as streams, parse them using fastCsv and create audit log and snapshots for each of the user in a PG database.

We are batching the inserts and are pausing the data stream for each of the batches, so we don’t end up with a back-pressure problem.

While testing the ingestion of the big files we noticed something peculiar. We thought we’ve paused the stream but it continued to push data like the .pause() was not invoked.

The mistake we made turned out to be quite common if you work with streams – we called the .pause() method of the s3stream, which we have pipe()-d to another stream – the fastCsv one. In this scenario when the fastCsv stream drained, it called the resume method of the s3stream.

In order to pause the streams, one must pause() the last piped one (in our case the fastCsv one)

More on back-pressure: Together with the research about our issue I found a very extensive article about back-pressure in node.js

Testing with Jest in a node and ReactJS monorepo (and getting rid of environment.teardown error)

Big number of the applications we develop have at least one ReactJS UI, that is held in one repo and an API, held in another. If we need to reuse some part of the code, we do so by moving it to another repository and adding it as a git submodule

For our latest project we decided to give the monorepo approach a try (which we didn’t come to a conclusion yet if it better fits our needs). The project is a node.js API with a ReactJS app, that is based on create-react-app

This first issue we faced with it was with testing the node app – tests ran just fine in the react application (/app/) but if you tried to run it for the server, you’d get the following error:

● Test suite failed to run
TypeError: environment.teardown is not a function

  at ../node_modules/jest-runner/build/run_test.js:230:25

In our package.json we had the trivial test definition – just running jest:

"scripts": {
    ...
    "test": "jest"
    ...
}

We didn’t had issue with this approach in a node API with no CRA app in it, so as it turned out to be the case, we had to indicate that the environment is node.

To do so we added a testconfig.json and added it to the script in the package.json

testconfig.json

{
	"testEnvironment": "node"
}

package.json

{
	"scripts": {
		"test": "jest --config=testconfig.json"
	}
}

If you want jest to monitor your files, change the “test’ script to “jest –watchAll –config=testconfig.json”