- Managed 200+ Nodes CDH 5.3.8 cluster with 14 petabytes of data using CM 5.5.1 and Linux Cent OS 6.5.
- Installed and configured Cloudera Manager for easy management of existing Hadoop cluster.
- Deployed cluster in AWS.
- Enhanced and optimized product Spark code to aggregate, group and run data mining tasks using the Spark framework.
- Performed integration testing of Hadoop packages for ingestion, transformation, and loading of massive structured and unstructured data into benchmark cube.
- Hand in developing and testing solutions to analyze large data sets efficiently.
- Designed several test scenarios and test cases.
- Worked on Quality Center, SQL Server, J2EE.
- Involved in creating Hive tables, loading and analyzing data using Hive Queries.
- Managed the quality risks and several testing issues.
- Load and transform large sets of structured, semi-structured and unstructured data.
- Worked on testing and increasing the performance for Hive and Pig queries.
- Implemented a proof of concept using Kafka, Storm. Work on spark SQL and spark streaming.
- Wrote scripts in Redshift and modified scope to be in Redshift due to relational data.
- Worked on EMR to analyze data in S3 bucket.
Environment: HDFS, Hive, MapReduce, Pig, Sqoop, AWS, EMR, RedShift, Kafka, Storm, Spark, Scala, SQL.