Recently I was working on a task in which I faced a challenge regarding processing of millions of records. I had module whose work is to periodically read users info from database and then push that to queue system.

It was a Spring Boot project backed with MongoDB. RabbitMQ is used as queue system

Basic Approach

I didn’t thought about big picture and just implement a fast and simple approach to this. I made a scheduler which will periodically read info from db and push to queue.

.....

@scheduled(cron="cron expression")
public void doWork(){
    // fetching and publishing to queue
}

Problem with in this approach

If there is 1 million users and assume it will push 10 user’s info per second to push to queue. So roughly it will take around 27 hours to push all user info.

Enhanced Approach

Usage of multi threading can enhance throughput to solution. I used executor service to get thread pool. I bench marked and found appropriate thread count.

Bench-marking

RAM : 8 GB

No of records : 50,000

Threads	       Run 1	             Run 2	           Run 3
10	       00:00:20.90	     00:00:20.51	   00:00:19.67
20	       00:00:18.10	     00:00:14.79	   00:00:17.93
30	       00:00:18.94	     00:00:17.23	   00:00:18.92
40	       00:00:19.61	     00:00:17.16	   00:00:16.40
50	       00:00:18.91	     00:00:17.99	   00:00:16.13
70	       00:00:20.68	     00:00:16.49	   00:00:17.25
100	       00:00:18.59	     00:00:18.66	   00:00:14.74
150	       00:00:14.32	     00:00:18.12	   00:00:15.28
200	       00:00:13.91	     00:00:13.32	   00:00:12.80
250	       00:00:14.47	     00:00:13.86	   00:00:16.62
300	       00:00:13.40	     00:00:11.56	   00:00:11.99
350	       00:00:13.31	     00:00:10.19	   00:00:12.67
400	       00:00:13.83	     00:00:21.25	   00:00:13.50
450	       00:00:10.84	     00:00:10.65	   00:00:09.05
500	       00:00:15.29	     00:00:10.94	   00:00:13.43

Data in seconds

After this I choose to have 450 threads as it is more promising then others. I was able to do previous task around 3 minutes.

Problem with this approach

As if user grow exponentially then it will not be able to perform effectively. And If I reduce the periodic cycle then there might be collapse of data.

End Approach

So I decided to have multiple nodes of this task. Each node will process a chunk of information as fast as possible. If user base grows more node will be used to have same throughput.

For example:

If one node take ~ 10 seconds to process 50000 user info,
then no of nodes will be with same throughput as below

No of node required = (No of records)/(Records per node)

In this case it is ~20 nodes which will process 1 million data into 10 seconds. If you want to reduce nodes ~10 nodes around 20 seconds and ~5 nodes around 40-50 seconds.

Note : Every node will read a chunk of data, which should not duplicated. For example if one node is processing records from 0 to 50_000, then there should not be another node who is processing same, it can be processing from 50_001 to 1_00_000.
How you to partition data among nodes.  I will be sharing soon.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

w

Connecting to %s