I built an ETL pipeline to process terabytes of data. To achieve that goal, I setup a Spark Cluster (Scala) and MinIO server for object data storage.
I can process and save 200 gigabytes in roughly 30 minutes using 10 virtual machines, for Spark Processing.
The issue I have is that I am not able to scale that Processing. Meaning if I double the number of spark virtual machines, this does not affect processing time.
I need a Data Architect who has enough expertise to help me identify the bottleneck and fix the issue.
ARCHITECTURE SUMMARY.
• I use virtual machines set up on-premises using VMWare ESXi 6
• Physical machines (which host VMs) are on a 1 GB network.
• There is no over commitment for vCPU nor RAM
• Spark VMs. 16VCPU, 64 GB RAM
• MinIO (Storage). 16vCPU, 64GB RAM, Configured using RAID0
SOME DETAILS ABOUT DATA PROCESSING
The process is straight.
• Read data from 2 sources on MinIO,
• Make a Union of data of two sources,
• Filter out empty values on a column from resulting dataset,
• Apply 2 groupby on that column (We save intermediate values after the first groupby)
• Union the dataset obtained after the groupby operation with the empty columns values
• Save the whole again on MinIO
Hi there,I am excited to share my expertise and skills in data engineering and Big data, which I have acquired over the past 3 years. I am confident that I can meet your requirements. I would be delighted to work with you and I look forward to hearing more about the project if you are interested.
Ps: my services are satisfaction guaranteed
Ps2 : je peux communiquer avec vous en français
Hi there,
How are you? I have gone through your project details.
I would like to tell you that l have a great bunch of experience in VMware, Spark, Data Engineer, Big Data and Amazon S3.
For that I would require from your end to start a chat with me to discuss about CANNOT SCALE BIG DATA PROCESSING.
You can check my profile that I have 100% completion rate on my projects, so it
would be my pleasure to build long term relationship with you.
All my skills are related to this particular project.
Hoping to hear from you soon. Cheers.
Rashid Amjad.
Hi Saint Denis,
I am a Data Engineer with 7+year of experience. I would like to offer you help to fix this issue.
Please let me know if we can connect .
Hi,
I hv ,,10 years of exp in this.
I would like to work for you. As i have already did the similar task and supported many projects/person in the same way etc. I would like to hear from your side.
Thank you for
Hi,
I am a data engineer of 5 years experience. I have designed and built large scale spark pipelines for use cases similar to yours. Unfortunately as you might be aware there are no straight forward answer to your problem. The bottleneck could be anywhere.
It will require understanding the existing system and all its parts, some experimentation and then we can come up with a strategy to solve the issue.