Creating an API endpoint that streams data from the DB
I am working with a microservice and one of the most significant points of contention I’ve run into is CSV Download for a user with over 1 million rows. It should be easy to make CSV Downloads available, right?
Existing Architecture
Here is the existing architecture. Before you ask, yes it’s a micro-service, though the micro part is becoming debatable. This service “aggregates” the data and gives users the option to export their activity as a CSV.

On top of that, the “CSV” (note the quote) is returned as JSON of all things. Why? Probably because GraphQL doesn’t really support cursor-based responses. Streaming is a relatively new feature that I have not had a chance to really look into.
After spending an extensive amount of time optimizing the database, it seemed like nothing I did would work as the user with 700k+ rows would take 30+ seconds to get the data on PostgreSQL. On top of it all, all the data is in JSONB. To make matters worse, we sorted the data by the JSONB fields.
Sort Speed
For example:
CREATE INDEX table_updatedAt ON table (data_to_timestamp(data->>'updated'::text) DESC)
Then the subsequent query would have to be like this:
SELECT * FROM table ORDER BY data_to_timestamp(data->>'updated'::text) DESC
Async Flow
Due to the size of the data, the proposed solution was something called the “async flow” where the user requests to download a file. They get added to some kind of queue and magically a link to a file will be emailed to them whenever possible.
So if we were to draw this out it would be something like this, approximately:

The user makes a request to get their CSV and then has to wait an undermined amount of time to get the file, or link to it, in an email.
In summary:
- The user makes a request for the file
- They get added to some sort of queue
- We show them their position in the queue so they know they’re not forgotten (or that the export is not broken)
- The file then gets uploaded to s3 when it is the user’s turn to have their request processed
- When complete, we will notify the user by sending an email
Streaming
The “Async” flow doesn’t sit well for me. It feels over-complicated and prone to maintenance and debugging nightmares. After spending some time reading up on how to deal with data-heavy applications, streaming came up as a very common theme. The architecture would look like this, which I am sure it will look familiar:

There is one problem. We’re getting data as CSV and not JSON. The Graphql server is not going to like it.
After spending some time playing around with it, I’ve discovered that we can create new endpoints/routes on the server. In essence, skip GraphQL altogether. We don’t want to get rid of the server entirely because it has some important logic that the endpoint should use. We want the GraphQL server to act as an API gateway.
So how do we go about doing this? Well, I got lucky on that part.
Our underlying ORM is massive.js and it supports “streaming”. The ORM is basically a wrapper over node-pg-cursor.
Seeing as massive.js took care of most of the “plumbing” so to speak, it made sense for me to stick with this.
CSV Server
stream = await db.query(`SELECT * FROM table WHERE ...`, { stream: true })
stream.on("data", (row) => {
// res is the koa response object each handler receives
// If the connection is broken/closed we basically close the db connection as well
if (!res.finished) {
res.write(toCSV(row))
}
}
// Some cleanup will be needed, possible but it's not important here
stream.on("end", cleanup)
For Koa, we’d need to create a Piped Stream object and just pass it to ctx.body and koa will take care of the rest.
await verifyAuth(request)
...
response = await axios.get("/data/csv_path")
response.data.on("data", (data) => response.write(data))
...
Basically, we’re creating a “proxy” endpoint that can deal with gateway-related issues like rate limits, and whatnot.
Pros and Cons
Streaming
| Pros | Cons |
|---|---|
|
|
Async
| Pros | Cons |
|---|---|
|
|