Spark QA Engineer

DataPelago

DataPelago

Quality Assurance
Hyderabad, Telangana, India
Posted on Jan 29, 2025

We DataPelago hiring for many roles! Our Universal Data Processing Engine enables businesses to fully realize the promise of analytics and GenAI while leveraging any computing engine, including open source, on any hardware and with any type of data. If you have cut your teeth in this domain, join us. Be part of the world class team to set a new standard for data processing in the era of accelerated computing.

Job Title: Spark QA Engineer

As a Spark QA Engineer, you will play a key role in ensuring the accuracy, performance, and reliability of data-driven applications and large-scale data processing systems. This role demands an understanding of distributed query engines like Spark, Trino and Presto.

Key Responsibilities:

Testing Distributed Query Engines: Design and implement test cases for Spark, Trino, and Presto engines, focusing on their unique execution engines, query planners, and optimizers.

Query Profiling and Performance Tuning: Profile and analyze query execution plans to identify bottlenecks and optimize Spark, Trino, and Presto jobs for efficiency and scalability.

Performance Benchmarking: Conduct benchmarking on distributed query engines (e.g., TPC-DS, TPC-H) to evaluate system performance under different loads and optimize resource allocation.

Data Generation for Testing: Create and maintain synthetic data generation strategies for various test scenarios, including edge cases, large-scale datasets, and TPC-DS or TPC-H benchmark data.

Automation Development: Develop and maintain automated test frameworks using PySpark, Scala, or SQL for continuous testing.

Defect Management and Troubleshooting: Identify, document, and prioritize defects, collaborate with developers to diagnose and resolve issues in Spark, Trino, and Presto environments.

Requirements:

Experience: 7+ years in Quality Assurance with a focus on Big Data, distributed query engines, and large-scale data processing.

Technical Proficiency:

Proficient in Apache Spark and experience with other query engines like Trino and Presto.

Understanding of Spark, Trino, and Presto internals, including query planning and optimization techniques.

Experience with data generation techniques, particularly for large-scale synthetic test data and TPC benchmark data.

Familiarity with query profiling tools and methods for analyzing execution plans in Spark, Trino, and Presto.

Experience with cloud-based Spark implementations (e.g., AWS EMR, GCP Dataproc) and Infrastructure as Code (IaC) tools such as Terraform or Ansible.

Familiarity with open-source data formats and table formats like Apache Parquet, Avro and Hudi, Delta Lake, Iceberg.

Programming and Scripting: Proficiency in PySpark, Scala, SQL, or Java.

Performance Testing: Background in benchmarking methodologies and tools, such as TPC-DS or TPC-H, to assess and enhance distributed query performance.

Please connect with me or share your resume on rashmi.joshi@datapelago.com