Big Data Engineer03/2014 to Current Redbox LLC Apache Storm/Apache Pig – Chicago, IL
Python Keywords: loops, list, dictionary, tuple, list comprehension, exception handling, module, method chaining, data persistence, serialization, deserialization, REST API Services.
Personal Projects/Articles: The following articles available via my LinkedIn demonstrate my coding ability on top of different Apache APIs: Title: 'Total Pig Latin' Key contents: JAVA UDF, Python processing logic, external twitter library for JsonLoader, HBaseStorage, HCatalog, Algebraic/Accumulator Interfaces.
Title: 'Hortonworks_Tech_Exercise_Solution' This article consists of Pig Latin scripts, JAVA UDF, Apache Storm Topology code, Hive code and python data validation logic that I wrote while working with Hortonworks team on a small short-term venture.
Title: 'How To Beat The Tricks During Oracle Certified Associate Java SE8 Programmer I: Exam 1Z0:808' This article contains codes that trick test takers during Oracle Certification exam.
The project consisted of two types of processing - initial phase of big data batch processing via Pig Latin and later - a Real-time streaming processing framework via Apache Storm.
Business process was movie-streaming via RedboxInstant.
Initial phase of the project was to create a Hadoop-based batch processing framework via Pig Latin involving HDFS, RDBMS.
Source file was batched via Cisco-based accesspoint devices from US and Canada bases.
Key metrics populated in the HDFS were 'engagement duration' for streaming movies, performance, monitoring, popularity and more.
Hadoop cluster size ranged from 12 to 18 nodes eventually with version 1.X to 2.X.
The platform included Hortonworks HDP and Apache Oozie was the job scheduler.
Project eventually scoped in a real-time streaming framework with Apache Storm (v0.80 to v0.10).
With expansion of business, data centers were located across Europe (UK, Sweden), Canada, US and South America.
Avro logs were messaged in small data centers' Kafka clusters (6 nodes) which eventually were transferred to big Kafka cluster (with 20 nodes).
Key use cases and campaigns included 4 Topologies for Trial customer monitoring/locating, trending/recommending movies, performance metrics.
A 22 nodes based on San Jose, CA was set up with around 10000 - 13000 tuples processed per second for entire topologies.
Tasks Accomplished: Written Pig Latin scripts to process batch data.
Written algebraic interface within a JAVA UDF function.
Used EXPLAIN operator to diagnose MapReduce plan comprehensively.
Assessed MapReduce parallelism, partitioner functions, script optimization, and performance across all environments.
Used HCatalog to interact with existing Apache Hive MetaStore via Pig Latin.
Used twitter-based Elephant bird library to load data in nested JSON format using JsonLoader.
Used HBaseStorage function to interact with Apache HBase data using Pig Latin.
Assessed Apache Oozie jobs for action nodes, decision nodes and the DAG acyclic graphs.
Designed Workflow.xml and Coordinator.xml as an Apache Oozie application with MapReduce jobs and Pig jobs to run on Oozie server.
Assessed existing Kafka topics resulting via Parsers of Avro logs using Python and JAVA.
Studied Apache Zookeeper co-ordination between Storm Supervisors and Storm Nimbus in terms of heartbeat transfer performance.
Designed KafkaSpout constructor with appropriate broker hosts objects, Kafka configuration elements, partition/offsets information, topic-level configuration.
Written PlaybackInstance Filter Bolt to omit out invalid heartbeat records.
Designed Metadata Adder Bolt with GET request to combine incoming tuple with metadata information.
Implemented Async & Batch RPC (Remote Procedure Call) to create separate bolt thread and network thread so that network thread takes care of scheduling GET request for metadata and does callback to finally put the new combination record back to queue.
Assessed the feasibility of Redis and Memcached for In-Memory processing.
Wrote Rule Formulation Bolt to segregate tuples based on rules and direct to different tasks of the subscriber bolt via Field-Shuffling.
Used LRU Cache to manage garbage collection by JVM.
Wrote Privacy Bolt to carry out GET request of user-preference related data, modeled async RPC (Remote Procedure Call) similar to Metadata Adder Bolt, and heap-cached selective tuples.
Implemented HDFS bolt to transfer messages to HDFS.
Used Apache Maven as well as Classpath import methodology to build Storm project along with using external dependencies like maven-shade-plugin.
Analytics Engineer01/2012 to 02/2014 Ernst & Young Hadoop/Tibco Spotfire – Alpharetta, GA
The analytics project involved financial data relevant to auditing and accounting.
The back-end framework involved Hortonworks based platform comprising of Hadoop batch processing of Big Data.
The major languages involved Pig Latin and Apache Hive.
Analytics platform involved Tibco Spotfire.
The key metrics involved financial data entities.
HCatalog was used to communicate between Hive and Pig.
Other sources included Sql Server relational data plus various file formats incoming from external clients.
Talend tool was used as data integration methodology.
Agile methodology was the backbone for SDLC.
Tibco Spotfire connected to a single source - Apache Hive Metastore.
Hive-QL based Information Links were the primary data loading application.
Spotfire in memory technology was heavily utilized with both manual and automatic caching and loading.
Development and Testing Hadoop cluster size was of 10 nodes - cluster.
The environment was also cluster operated servers.
Tibco spotfire servers were supported via load balancers.
Hadoop cluster was administered by Hortonworks while Tibco Spotfire platform was set up by Tibco and Infomatix vendors.
Tasks Accomplished: Used Pig Latin to write MapReduce scripts sourcing data via JsonStorage (using Elephant-bird packages), PigStorage, HCatLoader and Sequencefiles.
Sourced Hive MetaStore data via Pig Latin using HCatalog to sink data to be consumed by Tibco Spotfire analytics platform.
Used HIVE JAVA UDF's to convert data values between integers, strings, and other primitive types.
Wrote Hive-Ql statements to load data in Spotfire memory engine and designed On-demand data handling patterns.
Used Pig Latin diagnosis tools to evaluate logical, physical and MapReduce plans.
Optimized Pig Latin scripts to produce performance-oriented scripts.
Analyzed log files to diagnose errors involving UDF loading, data validation and accuracy.
Used Pig Latin operators, Load/Store functions and Built-in functions to process data for aggregation purposes.
Supported software testers in testing batch-processing codes using single-node Hortonworks cluster.
Analyzed the efficiency of various file formats - JSON, Text, Sequence Files, ORC, and Avro.
Used UNIX commands to operate on file and derive statistics.
Analytics Engineer07/2010 to 12/2011 L'Oreal Corporate Python/JAVA/Hadoop – Berkeley Heights, NJ
The inventory management project focused on creating a well-organized data storage system.
The initial plan comprised of cloud storage via AWS.
NoSql storage via HBase was assessed for transactional data record keeping.
Pig Latin was the key MapReduce language processing files with formats JSON and ORC for better performance.
RESTful Services architecture was under implementation for web requests sent for JSON and XML objects.
The key metrics analyzed included production related metrics, inventory management related measures.
Dimensional attributes mainly comprised of location and time based.
Incoming files from clients included HTTP and FTP protocol in various formats.
Pig Latin was the centralized ETL pipeline while HBase and AWS were the pre-planned storage systems.
Tasks Accomplished: Assessed Amazon Web Services (AWS) S3 object storage system in regards to various storage classes including S3 standard, S3 -IA standard, Amazon Glacier and wrote specification for automatic migration of files across these standards.
Assessed file security policies within AWS S3 system that covered Access Controlled List (ACL), bucket policy, query string authentication.
Remediated the Tibco Spotfire visualization platform by automating the visuals via extensive Python scripting over Spotfire API based on .NET framework.
Wrote JAVA UDF for Pig Latin to convert unstructured string literals appearing in the file into structured string types.
Accessed data in Hive Metastore via HCatalog and HBase via HBaseStorage function using Pig Latin.
Assessed the NoSql data model design with HBase.
Used Apache Maven to package Hadoop jars ensuring all dependencies and plugin availability.
Used python script to send Web request to pull JSON object and XML object of few key metrics via RESTful Services.
Assessed the feasibility of using columnar file methodology - ORC in addition to JSON oriented AVRO design for better performance.
Used Elephant bird library via Git source to load JSON data using twitter owned JsonLoader function.
Data Analytics Engineer02/2009 to 06/2010 T.Rowe Price – Owings Mills, MD
Project Description: The Asset Mangement Project comprised of a centralized data warehouse based on Oracle and DB2.
Oracle was the final datawarehouse with star-schema while the initial staging and ODS layers were with DB2.
The financial data were processing at transactional level with derivation of key metrics like monthly position balance.
Business Objects was the key reporting platform with Web Intelligence and Desktop Intelligence in practice.
Tasks Accomplished: Monitored weekly AutoSys Jobs in Dev, Qual and PROD regions.
Created Sprint Logs for Agile mode of work.
Created Job file (in XML module) to automate the reporting tasks.
Created Technical Specifications for modification of Informatica mappings.
Created Technical specification to list data profile including constraints, indexes, views, sequences, synonyms, cardinality/optionality, measures (metrics), dimensions, referential integrity, aggregations, calculations, transformations, statistical modeling in order to pre-design data load from source to target.
Monitored Informatica session logs in order to assess and rectify data cycle errors to address data quality issues.
Assessed existing data marts and designed bridge tables to connect different data tables.
Modified Source Qualifier queries in Informatica mappings.
Used Analytic Functions, Statistical Functions via SQL to analyze and process the data at aggregate level including roll-ups, cubes, grouping sets.
Used DML and DDL statements via SQL to manipulate and define data sets.
Used joins, subqueries and set operators via SQL to cross- analyze data across multiple tables.
Used Substitute Functions, Character Functions, Numeric Functions, Conversion Functions, and Date-Time Functions, Pivots, Hierarchies, Aggregation via SQL to modify data column values.
Used Data Security languages and Data Transaction Control languages via SQL.
Written test strategies, test cases to asses Informatica code changes including initial code debug.
Programming For Everybody Specialization - University of Michigan (Link) Course I: Getting Started With Python (Link) Course II: Python Data Structures (Link) Course III: Using Python To Access Web Data (Link) Course IV: Using Databases With Python (Link) Course V: Capstone: Retrieving, Processing and Visualizing Data with Python (Link) IBM Explorer Badges (Big Data University) Big Data Spark Foundations(Link) Big Data Programming(Link) Introduction to Solr (Certificate available in hard copy format.) Big Data Administration(Link) :
MBA: Information Technology2015Goldey Beacom College-
Programming For Everybody Specialization - University of Michigan (Link) Course I: Getting Started With Python (Link) Course II: Python Data Structures (Link) Course III: Using Python To Access Web Data (Link) Course IV: Using Databases With Python (Link) Course V: Capstone: Retrieving, Processing and Visualizing Data with Python (Link) IBM Explorer Badges (Big Data University) Big Data Spark Foundations(Link) Big Data Programming(Link) Introduction to Solr (Certificate available in hard copy format.) Big Data Administration(Link) : MBA : Information Technology 2015 Masters of Science : 2010
Create a job alert for [job role title] at [location].