Scaling async workloads: Building a high-performance architecture with BullMQ and AWS
Managing high-traffic portals that require users to upload massive files—some exceeding 500MB—demands a robust architectural approach to ensure both responsiveness and reliability.
The requirement is clear. The UI had to stay responsive, the uploads couldn't block the server, and the processing needed to happen asynchronously across several third-party services.
When you're dealing with this level of scale, you can't just POST a file to an Express endpoint and hope for the best. You'll hit memory limits, timeout issues, and eventually, your server will just fall over.
Here is how I architected the solution using BullMQ and AWS to ensure 99.9% reliability and a seamless user experience.
The architecture overview
The core philosophy here is "Offload everything." We want the web server to do as little as possible.
Front-end and authentication
We use a React-based front-end. For the login, I prefer using a managed service like AWS Cognito or a simple JWT-based flow. The key here is that the front-end never sends the file bytes to our API. Instead, it asks the API for permission to talk to the storage layer directly.
The direct-to-S3 upload (The secret sauce)
Instead of streaming bytes through your Node.js process, use S3 Presigned URLs.
The Flow: The user fills the form. The client calls a small /get-upload-url endpoint. The server generates a temporary, secure URL for an S3 bucket and sends it back.
The Benefit: The client uploads the large file directly to S3. Your server's CPU and RAM stay at 0% usage during the actual upload. For files over 100MB, I recommend using S3 Multipart Uploads to allow for retries of failed chunks.
The job orchestrator: BullMQ + ElastiCache
Once the upload finishes, the client notifies the API. This is where BullMQ comes in. We don't process the file yet. We just "queue" it.
I choose AWS ElastiCache (Redis OSS) as the backend for BullMQ. It is a managed service that handles the scaling and persistence of our task list. BullMQ uses Lua scripts to ensure that job transitions are atomic. This is why we need a robust Redis provider like ElastiCache rather than a basic in-memory store.
The worker tier
We run a separate cluster of Node.js workers. These workers "subscribe" to the BullMQ queue.
Isolation: If a processing task crashes, it doesn't take down the website.
Scalability: If the queue grows, we can spin up more workers (Auto Scaling Groups) without touching the web tier.
Deep dive: Concurrency and sandboxed processors
When handling heavy file processing, you need to manage how many jobs a single worker handles. Node.js is single-threaded, so a CPU-intensive task (like image resizing or PDF parsing) can block the event loop.
I solve this using Sandboxed Processors. BullMQ allows you to run your processing code in a separate child process or worker thread. This ensures that even if a job gets stuck in an infinite loop, your main worker remains responsive to heartbeats and new jobs.
// worker.ts
import { Worker } from 'bullmq';
import path from 'path';
const worker = new Worker('file-processing', path.join(__dirname, 'processor.js'), {
connection: { host: 'your-elasticache-endpoint', port: 6379 },
concurrency: 5, // Process 5 files in parallel per container
useWorkerThreads: true
});Why this specific AWS stack?
Amazon S3: It is the industry standard for "infinitely" scalable storage. We use S3 Lifecycle policies to move old processed files to Glacier to save costs.
Amazon ElastiCache (Redis): BullMQ relies on Redis features for atomicity. ElastiCache is the most stable way to run this in AWS. I recommend disabling the noeviction policy to ensure jobs aren't lost if memory fills up.
AWS ECS Fargate: For the workers, I prefer ECS Fargate. It allows for long-running processes (necessary for big file processing) without managing servers.
Implementation: The producer pattern
In my experience, the most common mistake is passing the whole file object in the queue. Don't do that. Only pass the S3 Key.
import { Queue } from 'bullmq';
import IORedis from 'ioredis';
const connection = new IORedis(process.env.REDIS_URL);
const processingQueue = new Queue('file-processing', { connection });
async function queueProcessingTask(userId: string, s3Key: string) {
await processingQueue.add('process-large-file', {
userId,
s3Key,
timestamp: Date.now(),
}, {
attempts: 5,
backoff: {
type: 'exponential',
delay: 2000,
},
});
}The heuristic: When to use BullMQ vs SQS?
I often get asked why I don't just use AWS SQS. It is a great service, but BullMQ has a specific advantage for complex apps.
The Rule: Use SQS if you have a simple "fire and forget" message. Use BullMQ if you need complex features like job priorities, parent-child dependencies (Flows), or a real-time dashboard to see exactly what is stuck.
Final thoughts on the dashboard
To show the user a dashboard of their "pending" tasks, we don't query BullMQ directly every time. When a worker finishes, it updates a status in our main database (PostgreSQL/RDS). The front-end then polls that record or receives a WebSocket update.
This separation of concerns ensures that even if we are processing 10,000 files a minute, the user's dashboard remains lightning fast.
