What i learnt as a big4 ai ml engineer

What I learnt working as a ML engineer at Big4

In this post I will share my experience when it comes to learning ML/DL vs working as a Machine learning engineer at a BIG4 since the last 2 years.

When one starts learning a new technology, generally its through some free resources in order to explore, tinker around and find whether its actually their calling or not. This brings me to my first point,

Don’t fall for “XYZ full course in 10 Hrs”

If you are a beginner , and are just getting your hands dirty in Machine Learning for the first time, don’t fall prey to clickbait titles on youtube that claim to make you expert within a week.

The first mistake during my college years was to assume that I have a good amount of knowledge and that I am ready to apply for internships after watching few 4-5 hr long videos.

The purpose of these videos are nothing more than letting you know the “syllabus” rather than content .

You can find ideal resources to study the theory aspect from stanford/MIT open lectures. You-tubers like Krish Naik, Stat-quest were a go to resource for me.

This brings me to my second point.

Find a balance between Theory and Practical Projects

When one is in getting into a field like Machine learning you will find that people from backgrounds like physics/statistics/economics are also making a mark, and its not restricted to the folks with CSE majors.

The reason being the breadth as well as depth of topics that one can find under the wide umbrella of AI.

Talking of job interviews, its important to remember that your resume projects are valued only when you either pass prior rounds or/and are able to tackle questions about why and how your project showcases your skill.

Lets jump straight to the third point.

Machine Learning/Data Science is NOT always a Jupyter Notebook

In course of learning through courses like Udemy/Coursera/You-tube, one might be accustomed of working on Jupyter Notebooks/labs. In production this is sheldom the case. One should not restricts one’s knowledge to just loading a dataset, performing some feature engineering and training a model.

What I have learnt during this tenure of working in the industry is that understanding of the underlying infrastructure, security protocols, costs incurred, latency and storage issues is equally important.

You won’t always get a “classic csv” served to you

Quite a few times, it might happen that the problem you have to solve is tremendously open ended, leave alone not having the data, it might be something as vague as ” how can i use AI to reduce my costs?” and you have to think of asking good questions to yourself .

Its no longer a game of finding answers, life just played a UNO reverse.

Cloud is your friend

“Upskilling is a must”. When you choose to be a part of growing technologies like ML/Blockchain and more, you must realize that there are certain organizations in the world that are dedicated to pure Innovation and research (unlike the ones that are building applied solutions for customers).

Hence every now and then there are updates, new tech stacks and libraries released. Personally I feel that adding the knowledge of cloud services like AWS,GCP ,AZURE helps one a lot. But there is one major flaw in the way most you tube tutorials are designed , lets say you look up “EC2 for beginners ” on you tube ; what I have observed is that after explaining what an EC2 is, they keep security group, vpc related settings and almost all other aspects as “default” , as if they don’t matter, but over my last year deploying applications on cloud these things have confused me the most. In production things like access restrictions, access ports n stuff matter a lot. DON’T ignore these aspects while learning about a service.

Writing clean code and understanding cost of ML based solutions

If you look at the cost of deploying ML solutions in production, you will realize that the actual game starts when it comes to the monitoring of your production pipeline.

Interview questions like these also test this ability of yours,”

  • How often do i need to retrain models?
  • How do i decide a threshold for my use case?
  • If model performance drops down, what is to blame?
  • Since real life scenarios won’t come with “ground truth labels” how do i even keep track of performance ?

You only realize the the importance of writing clean and readable code when you have to read someone else’s code.

Read the Docs

Whatever algorithm you are using/learning . It is really important to go through the documentation and understand each and every hyperparameter that can be used. I have seen people (including myself) randomly tweaking hyperparameters in order to increase model performance.

Remember that in production this might be an expensive operation to do.

So whenever you get time, try to have a detailed understanding of the parameters.

Don’t ignore DSA

Although the classic ” I haven’t inverted a binary tree ever at my job” exists, its not ok to keep oneself unaware of something that is so omnipresent when it comes to interviews, code optimization and developing software solutions.

This was something that I neglected during the early phase of my college years.

Even if you don’t consider yourself a DSA wizard, its ok to at-least start with topics like basic data structures ,traversal algorithms , most popular coding interview questions and dynamic programming,( it never hurts to solve 2 questions a day right?) .

See where you find yourself after 5-6 months

Learn when to not use AI

Using deep learning to add 2 numbers is not an act of wisdom. When problems can be solved without the use of “AI”, its better to do so.

Remember that AI is here to reduce our problems, even if it means taking itself out of the picture.

For interviews focus on the Basics

I have seen fresh undergrads like myself focusing too much on the latest Deep Learning algorithms, and ignoring basic statistics just to impress the interviewer by using “advanced” terminologies, all that just to fall short when asked the most basic questions. Now I have been on the other side of the equation where i first hand witnessed how profiles are shortlisted and interviews set, remember as a fresher they only want to check whether you have your basics clear, and that you are a quick learner.

The Generative AI trap

I have been working closely with GenAI technologies over the last 6-8 months. The gap between understandable and “black box” AI models is widening day by day. Just compare the mathematics foundation required by linear regression and stable diffusion generative ai model. This is causing a sense of FOMO as well as confusing the folks who are trying to enter this field. Remember that probably you wont be creating your own chatgpt as the cost of these training huge models have given monopolies to big sharks who can invest millions and afford to host them for inferencing 24/7 at a large scale. What you as a student can learn is the basic idea/math behind the algos and how to integrate cloud based APIs into the existing solution.

Adding Edge AI to your arsenal

A simple yet effective way to to enhance your skills in the Domain of AI is to learn how edge devices like Nvidia Jetson work, You can learn to solve a lot of use cases using applications like Deepstream . Also there are tons of benefits from a production point of view when it comes to business cases requiring extremely low latency and where hardware requirements are at a minimum.

Learn to communicate

Its important that you are able to demystify AI to the people who are stakeholders but not academic personnels. You can’t talk math in a sales pitch.

Summary

Its all about learning with a balance of theory and practical applications. And things take time ,one cannot become an expert overnight.

I listed a few things I learnt from my experience. So keep your learning rate high and learn from my errors! –
If you want to connect with me on linkedin:

https://www.linkedin.com/in/siddhartha-samant-6a52111a7/

AI for schools

How One should approach AI in schools

Recently CBSE introduced Artificial Intelligence as a part of curriculum and this has left administration, students and “Industry experts” divided in terms of opinion. What was once something reserved for undergrads to explore is now being pushed as a subject for high school and pre-high school students. This article tries to cover how much compression of syllabus is required and what should be the definition of an “AI course” at schools, the desired outcomes, and what is the student achieving by indulging in such a curriculum. In reference to the same I want to make 7 major points:

Don’t make it seem Mysterious

If one is making a sci-fi movie or a catchy you-tube thumbnail it seems fair to create an “aura” of unimaginable mystery and talking humanoids; but from an academic perspective its important to make the students realize that its just some math and it has assumptions and flaws of its own. Any fantastic tech that you see can be boiled down to some intelligently crafted algorithm.

An Analogy with Physics

I will draw an analogy with something that people, (students and teachers) are already familiar with- the way we teach the development of atomic theory in high schools. We don’t expect the students to come up with theories of their own , neither we provide mathematical proofs of every statement we make. The obvious reason for doing the same is a lack of mathematical skills developed at the corresponding age. Same goes for AI ; the first couple of years should not focus on “numerics”, but rather on creating an understanding of what AI tends to achieve rather than “how” it does it.

Its not Just about running a program

One of the biggest problems I observe when students participate during a Robotics, AI or any tech related workshop is that the quality of the workshop is judged by the fact that whether at the end of the day, a student has been able to build a running prototype or not, even if it means rushing the concepts. Don’t rush these things , it takes 2 minutes to download some random code and run it on your system, but it might take two days to understand how the same code functions.

AI Business cases/use cases

Another important aspect that students should be taught is that there are different expectations from an academic vs an engineering venture. AI as an engineering marvel is easier to witness for young students in their daily lives, i.e. the websites one visits, the online store one buys stuff from and the streaming services they use, analyzing what problems these platforms solve and why “AI” is something that has lately become indisposable for them is one the best ways students can develop their understanding of AI.

This method of learning will also increase awareness about the suite of online services that are omnipresent and also create data literacy- something that is extremely important in a digital world where privacy is almost a myth.

Make students aware of scaling issues

Its a brilliant opportunity for students to understand the problem of “scaling up” any software/business and how resource optimization is a skill of its own. This will help them understand AI as a product and how data can be used to draw beautiful insights.

I hope that students make the most of this amazing opportunity where for the first time in a while they are getting a chance to deal with a subject that belongs to these past few decades and not something that is centuries old.

Where not to use AI

AI shouldn’t be forced everywhere just because it sounds cool. It has its own costs and downsides. And no, by downsides or dangers I don’t mean invasion over humanity, I mean AI can be stupid and non-reliable at times, hence it is of utmost importance to understand what is at stake whenever we are asking a machine to take decisions on our behalf.

You don’t need AI to make a software that adds two numbers. When you use Amazon for shopping or Google for getting search results, as a consumer you don’t care about the database they use or what cloud services are being utilized , you want your parcel on time that’s it , isn’t it. AI should add as a brilliant weapon in your arsenal of problem solving skills, but its not wise to cut apples using a sword.

A sad reality

While some of today’s kids can boast of having a customized Assembled pc , we still have instances/places where even basic computer literacy hasn’t reached fully yet. I hope Inclusion of skill based subjects will narrow down these differences in the coming years.

College students/interested school students with good mathematical background can check out this 3 video long syllabus for AI beginners.

SQL for beginners

SQL For Analytics

About the author

Aparna Mishra

Data science Analyst

Linkedin: https://www.linkedin.com/in/aparna-mishra-a934b6212

SQL stands for Structured Query language , also pronounced as ‘Sequel’. Itis the language for communication between user and database. The various SQL databases are Oracle, MS SQL server, MySQL etc.

Sublanguages in SQL  :

  • Data Definition Language (DDL)
  • Data Manipulation language (DML)
  • Data Query Language (DQL)
  • Data Control Language (DCL)
  • Transaction Control language (TCL)

Importing Data from File:

Syntax :

COPY {table_name} (col1, col2, col3 , ……colN)

FROM “ path / location of file “ DELIMITER ‘ , ‘ CSV HEADER ;

OR

BULK INSERT Orders

FROM  ‘ C:\Users\aparna\Downloads\orders.csv ‘   —dummy file path / location

WITH

(

    FIRSTROW = 2,                         —  as the 1st row is header

    FIELDTERMINATOR = ‘ , ‘ ,         — delimiter

    ROWTERMINATOR = ‘ \n ‘ ,            –Used to shift the control to next row

    TABLOCK

)

OR

COPY {table_name} from “ location of file / path “ delimiter ‘ , ‘ ;

DDL Commands and Uses :

  • CREATE : It is used to create a table and define the columns and their data types.

CREATE TABLE { TABLE NAME }

( column1 data type,

column2 data type ,

column3 data type ,

…… columnN data type) ;

Example :

  •   ALTER : It is used to modify the database table.
  1. Adding new column – ALTER TABLE {table_name} ADD { column_name}     VARCHAR(20);

2. Dropping the existing column –  ALTER TABLE {table_name} DROP COLUMN {column_name} ;

3. Changing of data type – ALTER TABLE {table_name} ALTER COLUMN {column_name}  char(20) ;

  • DROP : Drops the table permanently from the database.

DROP TABLE {table_name};

  • TRUNCATE : Removes the data and not the structure of the table.

TRUNCATE TABLE {table_name} ;

DML Commands and Uses :

  • INSERT : INSERT INTO statement is often used to add new records into the existing table.

INSERT INTO { TABLE NAME }

( column1 , column2, ……) VALUES ( value1, value2,……) ;

Example :

INSERT INTO student ( column1, column2 ) VALUES ( value1, value2 );

  • UPDATE :  Used to modify the existing records in a table.

UPDATE { TABLE NAME }

SET column1 = ‘value1’,

column2 =  ‘value2’,

……..

WHERE { condition } ;

Example :

  • DELETE :  Used to delete the existing records from a table.

DELETE FROM { TABLE NAME } WHERE { condition } ;

    DELETE FROM student WHERE student_id = 6;

DQL Commands and Uses :

  • SELECT : Used to fetch data from a database table.

    SELECT * from { TABLE NAME } WHERE { condition };

  1. SELECT student_id ,first_name, std from student ;

2. SELECT student_id from student WHERE Class = ‘VII’ ;

3. SELECT cust_id, name , address from customer where age > 40 ;

  • SELECT DISTINCT : The DISTINCT keyword is used with SELECT to eliminate all the duplicate records and fetch only the unique ones.

     SELECT DISTINCT { column name } FROM {TABLE NAME };

  1. SELECT DISTINCT address  FROM customer ;
  1. SELECT DISTINCT first_name FROM student ;

DCL Commands and Uses :

  • GRANT : Used to provide any user access or privileges for the database.

GRANT CREATE ANY TABLE TO Username ;

GRANT DROP ANY TABLE TO Username ;

  • REVOKE :  Used to take back privileges / access for the database.

REVOKE CREATE ANY TABLE FROM Username ;

REVOKE DROP ANY TABLE FROM Username;

TCL Commands and Uses :

  • COMMIT – It is used to make the transaction permanent.

UPDATE { TABLE NAME } SET column1 = ‘value1’ WHERE { condition };

COMMIT ;

  1. UPDATE student SET Name = ‘XYZ’ WHERE Name = ‘ABC’ ;

COMMIT;

2. UPDATE orders SET order_date  = ‘2020-09-18’’

WHERE order_id = 15 ;

COMMIT;

  • ROLLBACK – It is used so that the database can be restored to the state when it was last committed or to the state when last changes were made to the database.

ROLLBACK TO savept_name;

  • SAVEPOINT – This command is used to save a transaction temporarily so that we can rollback to that point whenever we need to.

     SAVEPOINT savepoint_name ;

HOW TO USE COMMIT,  ROLLBACK AND SAVEPOINT ?

  1. Create a table and insert records into it.
  1. To use the TCL commands in SQL, we need to first initiate a transaction by using the BEGIN / START TRANSACTION command.

BEGIN TRANSACTION ;

  1. Updating the table and using COMMIT –

UPDATE student SET age = 14 WHERE student_id = 4 ;

  COMMIT;

Using COMMIT makes sure that the command will be saved successfully.

The output after COMMIT will give us a table with saved changes.

  1. Using Rollback – ROLLBACK;

DELETE FROM ORDER WHERE order_id = 17;

ROLLBACK;

The above command ensures that the changes made by DELETE command are reverted back successfully.

5. Using SAVEPOINT – We can create as many SAVEPOINTs as we want after doing some changes to the table.

    SAVEPOINT {savepoint_name};

UPDATE orders SET amount = 12900 WHERE order_id = 15;

SAVEPOINT upd;

DELETE FROM orders WHERE order_id = 12;

SAVEPOINT del;

6. Now if we want to go back / revert back to the table before DELETE OR UPDATE COMMAND we can ROLLBACK TO upd / del, and if we want to roll back to the table when records were just inserted then we can ROLLBACK TO A.

ROLLBACK TO upd;

ROLLBACK TO save_upd;

select * from orders;

When we run these commands we get the records before any of the changes were caused.

And these commands will give back the table as they were before any changes

Aggregation functions in SQL –

Aggregate functions are called only in the SELECT / HAVING clause.

  • AVG( ) – Returns floating point value.

SELECT AVG (column_name) FROM TABLE;

  • COUNT( ) – Returns the number of rows which is an integer value.
  1. SELECT COUNT(*) FROM TABLE;

This query counts and returns the number of records in the table

2. SELECT COUNT( * ) FROM TABLE WHERE { condition };

  • MAX( ) – Returns the maximum value.

SELECT MAX (column_name) FROM TABLE;

  • MIN( ) – Returns the minimum value.

     SELECT MIN (column_name) FROM TABLE;

  • SUM( ) – Returns the total sum.

SELECT SUM (column_name) FROM TABLE;

  • ROUND( ) – It can be used to specify precision after decimal.

SELECT ROUND(AVG ( column_name ),3 ) FROM TABLE;

Use of ORDER BY :

It is used to sort rows based on a specific column either in ascending / descending order.

Syntax :

  1. SELECT Column1, Column2 FROM TABLE ORDER BY column1 ASC / DESC ;

2. SELECT column1 , column2 WHERE { condition } ORDER BY;

Use of LIMIT :

It allows us to limit the number of rows returned for a query.

Syntax :

SELECT * FROM TABLE

ORDER BY column_name DESC/ ASC

LIMIT 10 ;

Use of BETWEEN  Operator:

It can be used to get a range of values.

Syntax :

  1. Value BETWEEN LOW and  HIGH
  2. Value NOT BETWEEN LOW and HIGH.
  3. Value BETWEEN ‘YYYY-MM-DD’ AND ‘YYYY-MM-DD’

SELECT COUNT(*) FROM orders WHERE amount BETWEEN 800 AND 2000;

This statement will return the number of orders between 800 and 2000.

So we need to provide a range for BETWEEN clauses.

Use of IN Operator :

It checks if the value is included in the list of options provided.

Syntax :

  1. SELECT * FROM TABLE WHERE column_name IN (option1, option2, option3,…..) ;
  1. SELECT * FROM orders WHERE amount IN (200, 800, 9000) ;
  1. SELECT * FROM orders WHERE order_id NOT IN (12,13,14) ;

This way the IN / NOT IN Operator can be used to fetch the required records from the table, it acts as a filter to get only those records which you entered into the list of options.

Use of LIKE Operator :

It is helpful in pattern matching against string data with the use of wildcard characters.

LIKE is case – sensitive.

  1. I want to get those data from the table where all names begin with ‘D’.

SELECT * FROM student WHERE first_name LIKE ‘D%’  ;

  1. All names that end with ‘a’.

SELECT * FROM orders WHERE first_name LIKE “%a” ;

3. All names with ‘on’  in it.

SELECT * FROM student WHERE first_name LIKE “_on%” ;

4. All email ids with ”ar” in the middle.

SELECT * FROM Customer WHERE first_name LIKE ‘%ar%’ ;

This way we can use LIKE to find / match patterns.

Use of Group By Clause:

Group By allows us to aggregate the columns according to some category. The GROUP BY

Clause must appear right after a FROM or WHERE.

Syntax :

SELECT column1, column2, SUM( column3)

FROM column4

WHERE column2 IN (option1 , option2)

GROUP BY column1, column2 ;

SELECT COUNT(cust_id) , address

FROM customer

GROUP BY address;

Use of HAVING clause :

It allows us to use the aggregate result as a filter along the GROUP BY.

Syntax :

SELECT COUNT(cust_id), address

FROM customer

GROUP BY address

HAVING COUNT(cust_id) > 4;

Use of AS :

‘AS’ keyword is used to assign alias names to a column or a table.

Syntax :

SELECT { column_name } AS { column _alias } FROM { table_name };

USE OF VIEW :

A view is a virtual table , it is not a physical table. It is created to see a set of results , because it provides an overview of the result-set.

Syntax :

CREATE { OR  REPLACE } VIEW { view_name } AS

SELECT { column1 , column2 , column3 ,….. columnN }

FROM { table_name }

WHERE { conditions } ;

Joins :

Types of Joins :

  • Inner Join
  • Left Join
  • Right Join
  • Full Join

INNER JOIN : It compares each row of table ” A “ with each row of table “ B “ to find all of the rows which satisfy the join predicate and returns back the result – record.

Syntax:

SELECT columns

FROM tableA

INNER JOIN tableB

ON tableA.required_column = tableBrequired_.column ;

LEFT JOIN : Left join returns all the records from the left table , even if there are no matched in the right table , it shows null where the records are not matched

Syntax :

SELECT tableA.column1 , tableB.column2

FROM tableA

LEFT JOIN tableB ON tableA.common_field = tableB.common_field

ORDER BY tableA.required_column , tableA.required_column  ;

RIGHT JOIN :  Returns all the records from the right table , even if there are no matches in the left table , it shows null where the records are not matched.

Syntax:

SELECT tableA.col1 , tableB.col2

FROM tableA

RIGHT JOIN tableB

ON tableA.required_column = tableB.required_column ;

FULL JOIN : Full Join combines the results of both left and right tables and returns the records.

Syntax :

SELECT tableA.col1 , tableB.col2

FROM tableA

FULL JOIN tableB

ON tableA.required_column = tableB.required_column ;

tyrshdtfcg

Hypothesis testing and p values

About the author

Aparna Mishra

Linkedin: https://www.linkedin.com/in/aparna-mishra-a934b6212

Hypothesis testing ,p-values and the statistics involved is one of the most commonly asked topics in interviews , hence its always better to be prepared with the minute mathematical details as well as real life examples/ business scenarios to be used as examples to express your point better . This post aims to help you do the same .Lets begin!

What is a Hypothesis?

A Hypothesis is a statement that can be tested either by experiment or observation, provided we have past data. For eg – We can make a statement like “ Cristiano Ronaldo is the best footballer “ , and we can test the statement based on all the data of the past football matches.

Steps involved in Hypothesis testing :

  • Formulating a Hypothesis.
  • Finding the right test for Hypothesis.( Outcomes of tests refer to the population parameter rather than sample statistics ).
  • Executing the test.
  • Making a decision on the basis of the test.

What cannot be a Hypothesis?

Not a Hypothesis if the statement cannot be tested and we have no data regarding it.

There are two Hypothesis:

  • Null Hypothesis – denoted by- H0
  • Alternate Hypothesis – denoted by- H1

The Null Hypothesis is the statement we are trying to reject. Therefore, the Null Hypothesis is the present state of affairs while the Alternate Hypothesis is our personal opinion.

Null Hypothesis

A Null Hypothesis is the hypothesis that is to be tested for rejection after assuming it to be true. The concept of Null Hypothesis is similar to “Innocent until proven guilty”. So, is considered True until it is rejected.

Alternate Hypothesis :

Alternate Hypothesis is the opposite of Null Hypothesis. Whatever we assume our Null Hypothesis to be , the Alternate Hypothesis is the complement of that assumption.

Simple and Composite Hypothesis :

Simple Hypothesis is when the Hypothesis statement has an exact value of the parameter.

Example – A textile company claiming that it exports its products and makes $ 10,000 per month.

Composite Hypothesis is when we have a range of values in the Hypothesis statement.

For example – the average height of girls in the class is greater than 5 feet.

Two Tailed Test :

One Tailed Test :

If the Alternate Hypothesis gives the alternate in only one direction for the parameter specified in the Null Hypothesis, it is called a One-tailed test.

Critical region :

Also called the Rejection Region. It is the set of values for the test statistic for which the Null Hypothesis is rejected which means if the observed test statistic is in the critical region then we reject the Null Hypothesis and accept the Alternative Hypothesis.

Confidence Interval :

Type I and Type II error :

Type I error :

Type II error :

P- value :

The p-value is the smallest level of significance at which a null hypothesis can be rejected and this is the reason why many tests give p-value and is more preferred since it gives us more information than the critical value.

The smaller the p-value, the stronger the evidence that we should reject the null hypothesis.

  • If p > .10 → “not significant”
  • If p ≤ .10 → “marginally significant”
  • If p ≤ .05 → “significant”
  • If p ≤ .01 → “highly significant.”