Large Scale Data Analytics Driven Systems Design

Every senior software engineer, module lead, team lead and project manager must have an in-depth conceptual knowledge and skillset, to handle the future applications that will generate very very large volumes of data. Knowledge about data alone is not sufficient; one must know the end to end hardware and software stack on which the data resides and moves.

The mantra is – Scale, Speed & Security.

This course will enable them to be fully independent in conceptualizing, designing and implementing internet scale applications of the future.

Course Director

Prof Kamakoti Veezhinathan,
Course Coordinator, IITM Pravartak Technologies Foundation,
Director, IIT Madras.

Course Brief

2 Semester Course

300 hours of sessions

5 hours training on training days

Sat and Sun, alternate weekends

Online class (Post Covid19 pandemic, in-person classes at IITM Premises, for those who willing to attend in Chennai)

80% attendance must

Basic level understanding of 1 programming language + 1 database required

2 ungraded assignments, 2 graded assignments and 2 exams

Fee ₹ 1.5 Lacs + GST

1. Introduction to Data Science

Concepts of business intelligence and business analytics
Simple data retrieval vs data processing
Role of data scientist vs data analysts
Hidden facts in data, unearthing the facts
Evolution of data science, popular techniques & algorithms
Concepts of predictive and prescriptive analytics
Representing the data in a common model
Concepts of data clusters, data distribution, time series data
Text processing - textalizer, sentiment analyzer, tokenizer

2. Introduction to Data Management

Identifying the OLTP data to be used for analytics (Core banking, ATMs, Net banking, loans, cards etc.)
Collection of TV viewership data, collection of router data, Call Detail Records
News feeds, sports data in real time etc
Sizing the transactions in terms of number of records and size
Ingestion of data to data lake (structured and unstructured data from subsystems)
ETL process, initial data load and incremental data load
Abstraction of data, use of meta data
Typical architecture of a BA solution for enterprise
Hardware and software required for managing large data
De normalization, regular integrity checks, moving Big Data and NoSQL data bases
Scaling out thru hardware/clusters
Introduction to postgres database – create database, tables, indexes (to ensure non-Development people are brought to level playing field)
Postgres – select, insert, update, delete statements, conditions, groupings, sorting
Postgres – joins, aggregate functions, load from file, dump to file
Basic usage of new gen bigdata ClickHouse, a superfast new age database

3. Tools and Techniques for Data Visualization and Communication

Getting data to data mart for faster visuals (connect to metabase and sync)
Plotting data based on classifications/ buckets, time, trending
Drill downs to provide more details
Tableau – basic visuals, filters, groups, Aggregates
Tableau – drill downs, dashboards
Tableau – maps, sheetsTableau – security controls

4. Introduction to Business Metrics

Need for business metrics (for different roles of people in enterprise)
Real time metrics, near real time metrics, offline metrics – enabling spot decisions and approvals
Decision enabling and decision making based on data
Analysis of past data of account holders, pattern of the transactions, volume of transactions, linking account holder-relatives-business
Samples of 100+ business metrics for any specific domain
How to arrive at business metrics from raw master and transaction data, exploratory data analysis
Aggregate metrics - count of transactions, sum of transactions, distribution of transactions based on types
How to arrive at a set of business metrics for a specific domain, design the metrics, implement the metrics, getting feedback
Security data analytics, Security event logs and critical data

5. Introduction to Applied Business Statistics

Customer-based metrics, employee-based metrics, brand/service-based metrics, geography-based metrics
Patterns of new account openings, account closures, payments, defaults, churn
Customer profiling - age buckets, profession buckets etc., gradation of customers using metrics
Campaign based metrics on products, measuring campaign success, measuring new sell vs up sell success
Employee training metrics, customer education metrics
Metrics on technology improvement success
Anonymize data by 2 level translations and encryption.
Customer feedback analyticsTrend analysis - transactions, customers, products etc.
Internet of Things (IoT)/Sensors metrics – Pollution data, plant sensor data
Edge processing metrics
Call Detail Records in telecom industry

6. Python Programming

Comments, variables Operations – numeric, string, logical
Conditions, loops, Lists, Tuples, dictionaries, sets
File handling – read, write
Date operations, Exceptions
Database operations – fetch data for analytics only and not update/delete/insert
Functions, parameters
Basic numpy operations – like avg, mean median mode, std dev, plot, random
Using other 3rd party libraries e.g. Numpy, matplotlib
Linear algebraic functions. Matrices, Determinants, Inverses
Logic building tipsPlotting – pie, bar, scatter, line, multi charts

7. Statistics Using Python

Preparing and loading data
Identifying the influencing data and resultant data
Identifying specific packages to run on core bank transaction data, CDR data, payment data etc.
Examples on Exploratory analysis - numerical summary, rule, plots, multi variate
Examples on standard mathematical analysis - mean/median/percentile/variance, distribution based on frequency/poisson/sampling
Examples on statistics analysis – descriptive, inferential, correlation, error types
Running regular statistics packages for average, 98th percentile, etc. on core transactions data
Running linear regression on ATM, TV viewership data
Population parameter estimation (concept and code)
Confidence interval estimation
Hypothesis testing
Ttest and ztest
Mathematics concepts behind analytics/ML – Linear algebra, matrices, determinants, probability
Maths of linear regression (line of best fit)
Data security aspects

8. Systems Design and Architecture

Evolution of system architecture over the last 3 decades
Distributed system components, layers, n-tier architecture
Micro-servicesDatabases, distributed databases, federated databases
System interfaces
Challenges in large scale systems
Derive design from architecture
Boundaries, Abstraction
Dependencies, Reusability
Hardware components
Cloud based architecture
Standard architectures for consumer based mobile apps, enterprise apps, SME apps, social apps, SaaS apps
Flexibility and extensibility

9. Performance Monitoring and Management

Load, stress, volume, capacity, reliability testing aspects
Load generation – 1000s of concurrent users – Apache Jmeter open source
Performance monitoring using APM – Appedo open source APM
System resources – cpu. Memory, disk i/o, network i/o
App server performance counters
Database performance counters
Commonly seen bottlenecks
Availability monitoring

1. Regression and Classification for Business Applications

Introduction to CART
Random Forest
Bagging and boosting
Decision tree classification
Bayesian classification
Logistic regression
Linear regression
Regularization - bias, variances, lasso, ridge and elastic net

2. Machine Learning

What cannot be done by manual analysis?
Supervised learning
Training data, test data - banking customer footprints, ATM and net access data
Applying Clustering, representation learning
Unsupervised learning
Introduction to artificial neural networks
Natural Language Processing (NLP) - text processing basics
Algorithms behind recognition of sentences, how to get semantic stuff, text mining algorithms
Image processing – auto cropping/orientation, identification, tuning

3. Time Series Modeling

Concepts and examples of Curve fitting, segmentation, classification
Preparing raw data with right unit of time
Models - auto regressive, moving average, integrated
Apply time series modeling on ATM data, viewership data, call data
Prediction and forecasting


Setting up Tensorflow virtual environment for python
Creating Neural Network with Keras
Constructing Models in Keras
Employing Layers in Keras Models
Building Convolutional NN with Keras
Introduction to deep learning

5. System and Data Security

Basic security principles – authentication, authorization
Encryption – data in transit, data at rest
OWASP top 10 security aspects
Vulnerability test concepts, essential techniques
Penetration test concepts, essential techniques
Mobile app security concepts, essential techniques
Basics of Firewalls
Cloud security
Personal data protection – need to know, right to disclose
Enterprise security policies

6. Project Management

Checklist for Analytics project
Gaining Subject Matter Expertise
OKR methodology
Information gathering Techniques
Key Process Indicators for different industries
Typical task list for an analytics project
Entrepreneurship – Art of start, Market analysis, CrunchBase/tracxn/gartner, storytelling.
Entrepreneurship - Pitch deck, investor updates, market size, product feature analysis, assumptions, funding stories.
Entrepreneurship - Sales importance, negative bottom lines of start-ups, branding before meeting VCs, repeat sale strategy.
Latest trends in analytics - Correlational Analytics, DevSecOps analytics, Automate the ML process itself

7. Project Work - 40 hours - Outside Classhours

Ingest large sets of data in postgres or ClickHouse
Carry out exploratory analytics
Build aggregates and data marts
Build visuals and dashboards using tableau
Present the business metrics
Create prediction model

Sample Certificate