AI on cloud, AI on AWS

AI: Leveraging AWS and GCP AI Services

Introduction

Artificial Intelligence (AI) has emerged as a transformative force, revolutionizing industries and shaping the way we approach problem-solving and innovation. Cloud service providers like Amazon Web Services (AWS) and Google Cloud Platform (GCP) have made AI accessible to businesses of all sizes, empowering them with advanced AI technologies to drive growth and efficiency. In this comprehensive guide, we will explore the diverse AI services offered by both AWS and GCP, discuss their potential applications, and understand how businesses can seamlessly integrate these powerful tools with their backend systems and APIs.

AI Services Offered by AWS

  1. Amazon Rekognition: Advanced Image and Video Analysis

Amazon Rekognition, a flagship AI service from AWS, enables businesses to analyze images and videos using cutting-edge computer vision algorithms. With features like facial recognition, object detection, and scene understanding, Rekognition finds applications in various industries. E-commerce companies can utilize it for product tagging and personalized recommendations, while security systems can enhance surveillance with real-time face recognition.

  1. Amazon Polly: Natural Text-to-Speech Conversion

Amazon Polly empowers businesses with the capability to convert written text into lifelike speech in multiple languages. Audiobook creation, language translation, and voice-enabled applications are just a few examples of Polly’s versatile applications, enhancing accessibility and user engagement.

  1. Amazon Lex: Building Intelligent Conversational Interfaces

Amazon Lex simplifies the development of interactive chatbots and conversational interfaces that offer personalized customer experiences. Leveraging natural language understanding (NLU) capabilities, Lex engages users with human-like interactions, enhancing customer support and satisfaction.

  1. Amazon Comprehend: Extracting Insights from Unstructured Text

Amazon Comprehend harnesses the power of Natural Language Processing (NLP) to extract valuable insights from vast amounts of unstructured text. Businesses can perform sentiment analysis, entity recognition, language detection, and topic modeling, thereby understanding customer feedback, market trends, and brand sentiment.

  1. Amazon SageMaker: Simplifying Machine Learning Workflows

Amazon SageMaker offers a fully-managed environment for building, training, and deploying machine learning models. This service streamlines the process of developing intelligent applications, allowing data scientists and developers to efficiently collaborate.

  1. Amazon Forecast: Predicting Future Trends with Machine Learning

Amazon Forecast enables businesses to make data-driven decisions by predicting future trends and demands. Leveraging AI and machine learning, this service optimizes inventory management, sales forecasting, and resource allocation.

  1. Amazon Personalize: Delivering Tailored Recommendations

Amazon Personalize enables businesses to offer personalized product and content recommendations based on user behavior and historical data. This personalization enhances customer engagement, satisfaction, and conversion rates.

AI Services Offered by GCP

  1. Cloud Vision API: Powerful Image Analysis

Google Cloud’s Vision API allows businesses to integrate AI-powered image analysis capabilities into their applications. With features like label detection, face detection, and OCR (optical character recognition), Vision API finds use in e-commerce, healthcare, and content management industries.

  1. Text-to-Speech API: Natural Language Processing

The Text-to-Speech API from GCP converts written text into natural-sounding speech, offering multilingual support and a range of voice options. Applications include voice-enabled applications, language translation, and accessibility features for visually impaired users.

  1. Dialogflow: Building Conversational Interfaces

Dialogflow enables businesses to create chatbots and conversational interfaces with NLU capabilities. The platform supports multiple messaging platforms, making it suitable for various use cases, including customer support and virtual assistants.

  1. Natural Language API: Understanding Textual Data

Google Cloud’s Natural Language API provides sentiment analysis, entity recognition, and syntax analysis to gain insights from unstructured text. Businesses can use it for customer feedback analysis, content classification, and trend monitoring.

  1. AI Platform: Scalable Machine Learning

GCP’s AI Platform offers a scalable infrastructure for machine learning model development and deployment. Data scientists and developers can collaborate effectively to build intelligent applications.

  1. Cloud AI Platform Notebooks: Collaborative Data Science

The AI Platform Notebooks facilitate collaborative data science, providing a managed Jupyter Notebook environment for data exploration, model development, and visualization.

Conclusion

AWS and GCP offer a comprehensive suite of AI services that enable businesses to harness the power of AI for enhanced growth and efficiency. By leveraging services like Rekognition, Polly, Lex, Comprehend, SageMaker, Forecast, Personalize from AWS, or alternatives like Vision API, Text-to-Speech API, Dialogflow, Natural Language API, AI Platform, Cloud AI Platform Notebooks from GCP, organizations can gain valuable insights, optimize operations, and deliver personalized experiences to their customers.

Integrating these AI services with backend systems and APIs is achievable through the use of respective SDKs and APIs, making it easier for businesses to embrace AI and unlock its full potential.

In the upcoming blogs we will discuss about Azure and each service in detail.

SQL for beginners

SQL For Analytics

About the author

Aparna Mishra

Data science Analyst

Linkedin: https://www.linkedin.com/in/aparna-mishra-a934b6212

SQL stands for Structured Query language , also pronounced as ‘Sequel’. Itis the language for communication between user and database. The various SQL databases are Oracle, MS SQL server, MySQL etc.

Sublanguages in SQL  :

  • Data Definition Language (DDL)
  • Data Manipulation language (DML)
  • Data Query Language (DQL)
  • Data Control Language (DCL)
  • Transaction Control language (TCL)

Importing Data from File:

Syntax :

COPY {table_name} (col1, col2, col3 , ……colN)

FROM “ path / location of file “ DELIMITER ‘ , ‘ CSV HEADER ;

OR

BULK INSERT Orders

FROM  ‘ C:\Users\aparna\Downloads\orders.csv ‘   —dummy file path / location

WITH

(

    FIRSTROW = 2,                         —  as the 1st row is header

    FIELDTERMINATOR = ‘ , ‘ ,         — delimiter

    ROWTERMINATOR = ‘ \n ‘ ,            –Used to shift the control to next row

    TABLOCK

)

OR

COPY {table_name} from “ location of file / path “ delimiter ‘ , ‘ ;

DDL Commands and Uses :

  • CREATE : It is used to create a table and define the columns and their data types.

CREATE TABLE { TABLE NAME }

( column1 data type,

column2 data type ,

column3 data type ,

…… columnN data type) ;

Example :

  •   ALTER : It is used to modify the database table.
  1. Adding new column – ALTER TABLE {table_name} ADD { column_name}     VARCHAR(20);

2. Dropping the existing column –  ALTER TABLE {table_name} DROP COLUMN {column_name} ;

3. Changing of data type – ALTER TABLE {table_name} ALTER COLUMN {column_name}  char(20) ;

  • DROP : Drops the table permanently from the database.

DROP TABLE {table_name};

  • TRUNCATE : Removes the data and not the structure of the table.

TRUNCATE TABLE {table_name} ;

DML Commands and Uses :

  • INSERT : INSERT INTO statement is often used to add new records into the existing table.

INSERT INTO { TABLE NAME }

( column1 , column2, ……) VALUES ( value1, value2,……) ;

Example :

INSERT INTO student ( column1, column2 ) VALUES ( value1, value2 );

  • UPDATE :  Used to modify the existing records in a table.

UPDATE { TABLE NAME }

SET column1 = ‘value1’,

column2 =  ‘value2’,

……..

WHERE { condition } ;

Example :

  • DELETE :  Used to delete the existing records from a table.

DELETE FROM { TABLE NAME } WHERE { condition } ;

    DELETE FROM student WHERE student_id = 6;

DQL Commands and Uses :

  • SELECT : Used to fetch data from a database table.

    SELECT * from { TABLE NAME } WHERE { condition };

  1. SELECT student_id ,first_name, std from student ;

2. SELECT student_id from student WHERE Class = ‘VII’ ;

3. SELECT cust_id, name , address from customer where age > 40 ;

  • SELECT DISTINCT : The DISTINCT keyword is used with SELECT to eliminate all the duplicate records and fetch only the unique ones.

     SELECT DISTINCT { column name } FROM {TABLE NAME };

  1. SELECT DISTINCT address  FROM customer ;
  1. SELECT DISTINCT first_name FROM student ;

DCL Commands and Uses :

  • GRANT : Used to provide any user access or privileges for the database.

GRANT CREATE ANY TABLE TO Username ;

GRANT DROP ANY TABLE TO Username ;

  • REVOKE :  Used to take back privileges / access for the database.

REVOKE CREATE ANY TABLE FROM Username ;

REVOKE DROP ANY TABLE FROM Username;

TCL Commands and Uses :

  • COMMIT – It is used to make the transaction permanent.

UPDATE { TABLE NAME } SET column1 = ‘value1’ WHERE { condition };

COMMIT ;

  1. UPDATE student SET Name = ‘XYZ’ WHERE Name = ‘ABC’ ;

COMMIT;

2. UPDATE orders SET order_date  = ‘2020-09-18’’

WHERE order_id = 15 ;

COMMIT;

  • ROLLBACK – It is used so that the database can be restored to the state when it was last committed or to the state when last changes were made to the database.

ROLLBACK TO savept_name;

  • SAVEPOINT – This command is used to save a transaction temporarily so that we can rollback to that point whenever we need to.

     SAVEPOINT savepoint_name ;

HOW TO USE COMMIT,  ROLLBACK AND SAVEPOINT ?

  1. Create a table and insert records into it.
  1. To use the TCL commands in SQL, we need to first initiate a transaction by using the BEGIN / START TRANSACTION command.

BEGIN TRANSACTION ;

  1. Updating the table and using COMMIT –

UPDATE student SET age = 14 WHERE student_id = 4 ;

  COMMIT;

Using COMMIT makes sure that the command will be saved successfully.

The output after COMMIT will give us a table with saved changes.

  1. Using Rollback – ROLLBACK;

DELETE FROM ORDER WHERE order_id = 17;

ROLLBACK;

The above command ensures that the changes made by DELETE command are reverted back successfully.

5. Using SAVEPOINT – We can create as many SAVEPOINTs as we want after doing some changes to the table.

    SAVEPOINT {savepoint_name};

UPDATE orders SET amount = 12900 WHERE order_id = 15;

SAVEPOINT upd;

DELETE FROM orders WHERE order_id = 12;

SAVEPOINT del;

6. Now if we want to go back / revert back to the table before DELETE OR UPDATE COMMAND we can ROLLBACK TO upd / del, and if we want to roll back to the table when records were just inserted then we can ROLLBACK TO A.

ROLLBACK TO upd;

ROLLBACK TO save_upd;

select * from orders;

When we run these commands we get the records before any of the changes were caused.

And these commands will give back the table as they were before any changes

Aggregation functions in SQL –

Aggregate functions are called only in the SELECT / HAVING clause.

  • AVG( ) – Returns floating point value.

SELECT AVG (column_name) FROM TABLE;

  • COUNT( ) – Returns the number of rows which is an integer value.
  1. SELECT COUNT(*) FROM TABLE;

This query counts and returns the number of records in the table

2. SELECT COUNT( * ) FROM TABLE WHERE { condition };

  • MAX( ) – Returns the maximum value.

SELECT MAX (column_name) FROM TABLE;

  • MIN( ) – Returns the minimum value.

     SELECT MIN (column_name) FROM TABLE;

  • SUM( ) – Returns the total sum.

SELECT SUM (column_name) FROM TABLE;

  • ROUND( ) – It can be used to specify precision after decimal.

SELECT ROUND(AVG ( column_name ),3 ) FROM TABLE;

Use of ORDER BY :

It is used to sort rows based on a specific column either in ascending / descending order.

Syntax :

  1. SELECT Column1, Column2 FROM TABLE ORDER BY column1 ASC / DESC ;

2. SELECT column1 , column2 WHERE { condition } ORDER BY;

Use of LIMIT :

It allows us to limit the number of rows returned for a query.

Syntax :

SELECT * FROM TABLE

ORDER BY column_name DESC/ ASC

LIMIT 10 ;

Use of BETWEEN  Operator:

It can be used to get a range of values.

Syntax :

  1. Value BETWEEN LOW and  HIGH
  2. Value NOT BETWEEN LOW and HIGH.
  3. Value BETWEEN ‘YYYY-MM-DD’ AND ‘YYYY-MM-DD’

SELECT COUNT(*) FROM orders WHERE amount BETWEEN 800 AND 2000;

This statement will return the number of orders between 800 and 2000.

So we need to provide a range for BETWEEN clauses.

Use of IN Operator :

It checks if the value is included in the list of options provided.

Syntax :

  1. SELECT * FROM TABLE WHERE column_name IN (option1, option2, option3,…..) ;
  1. SELECT * FROM orders WHERE amount IN (200, 800, 9000) ;
  1. SELECT * FROM orders WHERE order_id NOT IN (12,13,14) ;

This way the IN / NOT IN Operator can be used to fetch the required records from the table, it acts as a filter to get only those records which you entered into the list of options.

Use of LIKE Operator :

It is helpful in pattern matching against string data with the use of wildcard characters.

LIKE is case – sensitive.

  1. I want to get those data from the table where all names begin with ‘D’.

SELECT * FROM student WHERE first_name LIKE ‘D%’  ;

  1. All names that end with ‘a’.

SELECT * FROM orders WHERE first_name LIKE “%a” ;

3. All names with ‘on’  in it.

SELECT * FROM student WHERE first_name LIKE “_on%” ;

4. All email ids with ”ar” in the middle.

SELECT * FROM Customer WHERE first_name LIKE ‘%ar%’ ;

This way we can use LIKE to find / match patterns.

Use of Group By Clause:

Group By allows us to aggregate the columns according to some category. The GROUP BY

Clause must appear right after a FROM or WHERE.

Syntax :

SELECT column1, column2, SUM( column3)

FROM column4

WHERE column2 IN (option1 , option2)

GROUP BY column1, column2 ;

SELECT COUNT(cust_id) , address

FROM customer

GROUP BY address;

Use of HAVING clause :

It allows us to use the aggregate result as a filter along the GROUP BY.

Syntax :

SELECT COUNT(cust_id), address

FROM customer

GROUP BY address

HAVING COUNT(cust_id) > 4;

Use of AS :

‘AS’ keyword is used to assign alias names to a column or a table.

Syntax :

SELECT { column_name } AS { column _alias } FROM { table_name };

USE OF VIEW :

A view is a virtual table , it is not a physical table. It is created to see a set of results , because it provides an overview of the result-set.

Syntax :

CREATE { OR  REPLACE } VIEW { view_name } AS

SELECT { column1 , column2 , column3 ,….. columnN }

FROM { table_name }

WHERE { conditions } ;

Joins :

Types of Joins :

  • Inner Join
  • Left Join
  • Right Join
  • Full Join

INNER JOIN : It compares each row of table ” A “ with each row of table “ B “ to find all of the rows which satisfy the join predicate and returns back the result – record.

Syntax:

SELECT columns

FROM tableA

INNER JOIN tableB

ON tableA.required_column = tableBrequired_.column ;

LEFT JOIN : Left join returns all the records from the left table , even if there are no matched in the right table , it shows null where the records are not matched

Syntax :

SELECT tableA.column1 , tableB.column2

FROM tableA

LEFT JOIN tableB ON tableA.common_field = tableB.common_field

ORDER BY tableA.required_column , tableA.required_column  ;

RIGHT JOIN :  Returns all the records from the right table , even if there are no matches in the left table , it shows null where the records are not matched.

Syntax:

SELECT tableA.col1 , tableB.col2

FROM tableA

RIGHT JOIN tableB

ON tableA.required_column = tableB.required_column ;

FULL JOIN : Full Join combines the results of both left and right tables and returns the records.

Syntax :

SELECT tableA.col1 , tableB.col2

FROM tableA

FULL JOIN tableB

ON tableA.required_column = tableB.required_column ;

tyrshdtfcg

Hypothesis testing and p values

About the author

Aparna Mishra

Linkedin: https://www.linkedin.com/in/aparna-mishra-a934b6212

Hypothesis testing ,p-values and the statistics involved is one of the most commonly asked topics in interviews , hence its always better to be prepared with the minute mathematical details as well as real life examples/ business scenarios to be used as examples to express your point better . This post aims to help you do the same .Lets begin!

What is a Hypothesis?

A Hypothesis is a statement that can be tested either by experiment or observation, provided we have past data. For eg – We can make a statement like “ Cristiano Ronaldo is the best footballer “ , and we can test the statement based on all the data of the past football matches.

Steps involved in Hypothesis testing :

  • Formulating a Hypothesis.
  • Finding the right test for Hypothesis.( Outcomes of tests refer to the population parameter rather than sample statistics ).
  • Executing the test.
  • Making a decision on the basis of the test.

What cannot be a Hypothesis?

Not a Hypothesis if the statement cannot be tested and we have no data regarding it.

There are two Hypothesis:

  • Null Hypothesis – denoted by- H0
  • Alternate Hypothesis – denoted by- H1

The Null Hypothesis is the statement we are trying to reject. Therefore, the Null Hypothesis is the present state of affairs while the Alternate Hypothesis is our personal opinion.

Null Hypothesis

A Null Hypothesis is the hypothesis that is to be tested for rejection after assuming it to be true. The concept of Null Hypothesis is similar to “Innocent until proven guilty”. So, is considered True until it is rejected.

Alternate Hypothesis :

Alternate Hypothesis is the opposite of Null Hypothesis. Whatever we assume our Null Hypothesis to be , the Alternate Hypothesis is the complement of that assumption.

Simple and Composite Hypothesis :

Simple Hypothesis is when the Hypothesis statement has an exact value of the parameter.

Example – A textile company claiming that it exports its products and makes $ 10,000 per month.

Composite Hypothesis is when we have a range of values in the Hypothesis statement.

For example – the average height of girls in the class is greater than 5 feet.

Two Tailed Test :

One Tailed Test :

If the Alternate Hypothesis gives the alternate in only one direction for the parameter specified in the Null Hypothesis, it is called a One-tailed test.

Critical region :

Also called the Rejection Region. It is the set of values for the test statistic for which the Null Hypothesis is rejected which means if the observed test statistic is in the critical region then we reject the Null Hypothesis and accept the Alternative Hypothesis.

Confidence Interval :

Type I and Type II error :

Type I error :

Type II error :

P- value :

The p-value is the smallest level of significance at which a null hypothesis can be rejected and this is the reason why many tests give p-value and is more preferred since it gives us more information than the critical value.

The smaller the p-value, the stronger the evidence that we should reject the null hypothesis.

  • If p > .10 → “not significant”
  • If p ≤ .10 → “marginally significant”
  • If p ≤ .05 → “significant”
  • If p ≤ .01 → “highly significant.”
1c1cec8f6953f5900017c1a72921f192

Gaussian NAIVE BAYES, continuous features

There are different variants of Naive bayes , bernoulli, and Gaussian. This article assumes you are familiar with the the basic idea behind Naive bayes and also how it works on categorical data .

Here we discuss one of the approaches used for handling continuous variables when it comes to naive bayes.

Suppose we have the following dataset , where the target variable is whether a movie will be hit or not and the feature variables are the action rating and story rating (a whole numbers between 1 to 10)

ACTION RATING (AR)STORY RATING (SR)HIT/FLOP
7.25.8HIT
3.46.3FLOP
3.57.3FLOP
8.58.0HIT
6.92.8FLOP
7.05.3HIT
9.03.8HIT

NOW LETS SUPPOSE WE HAVE A TEST POINT : ACTION RATING=7.6 , STORY RATING= 5.7 . So these are what we need to predict:

P(HIT| AR= 7.6, SR=5.7) and P(FLOP| AR=7.6, SR=5.7)

LETS START BY CONSIDERING THE FIRST PROBABILITY EXPRESSION

BUT NO SUCH POINT IS PRESENT IN THE DATA SET , SO SHOULD WE SET THIS PROBABILITY TO ZERO? AND SIMILAR WITH THE SECOND EXPRESSION? THIS WOULD MEAN THAT ANY UNSEEN POINT WOULD ALWAYS LEAD TO BOTH PROBABILITIES TURNING TO ZERO. SO HOW DO WE RESOLVE THIS ISSUE ? LETS GET THERE.

GAUSSIAN DISTRIBUTION

There are 3 expressions that are needed to be evaluated in the below expression

P(HIT| AR= 7.6, SR=5.7) = P( AR= 7.6|HIT) * P( SR= 5.7|HIT) * P(HIT)

the P(HIT) calculation is straightforward and is equal to {total number of hits/Total number of hits and flops}.

For calculating the 2 left conditional probabilities we assume that the values in the data set are sampled from a gaussian distribution with mean and variance calculated from the sample points . To recall , this is what a Gaussian distribution looks like:

Now once we have the gaussian distribution for our column feature , we can get the pdf value for any point , whether it is present in our data set or not .

IMPORTANT POINTS TO BE NOTED:

  1. While calculating P(HIT| AR= 7.6, SR=5.7), the gaussian distribution will be made only using data points where output =HIT
  2. different distributions are calculated/obtained for every column and target variable , so here there will be 4 distributions used ; whose data points are from AR FOR HIT, AR FOR FLOP , SR FOR HIT ,SR FOR FLOP

WHAT TO CHECK ? BOX COX TRANSFORM- AN IMPORTANT TOOL

For applying Naive bayes we assumed that in any feature, points will come from a GAUSSIAN DISTRIBUTION . But what if it is not the case . Following are a few explanations and points that you need to follow :

  1. You can always plot the dist-plot and see whether the distribution is gaussian or not .
  2. Before applying Gaussian Naive Bayes you can use Box-Cox transform to make the distribution normal.
  3. If you see that columns are varying hugely from gaussian distribution you an use different distributions , other distributions are log-normal( also box-cox with gamma=0 gives log-normal distribution) , power law etc. Below you see the general expression used in box cox distribution , you can see how gamma=0 turns it into a log distribution.

With the above points in mind you are ready to use Gaussian Naive Bayes!! You can read more about Box cox transform here :

More to come!

09tmag-brad-slide-ZPNA-superJumbo

Splitting nodes in DT for continuous features(classification)

Splitting nodes in decision trees (DT) when a given feature is categorical is done using concepts like entropy, Information Gain and Gini impurity.

But when the features are continuous , how does one split the nodes of the decision tree? I assume you are familiar with the concept of entropy .

Suppose that we have a training data set of n sample points . let us consider one particular feature f1 which is continuous in nature .

Approach for splitting nodes

  1. We need to perform splitting of nodes for all sample points .
  2. we sort the f1 column in ascending order .
  3. then taking every value in f1 as a threshold, calculate the entropy and then an Information Gain.
  4. we select the threshold with the most information gain and make a split.
  5. we then continue to do the same for leaf nodes until either max_depth is reached or min_samples required to reach is more than sample points .

Lets try to understand the above by one example :

let the following be the f1 feature column and let say its a two class classification problem:,

F1(NUMERICAL FEATURE)TARGET VARIABLE/LABEL
5.4YES
2.8NO
3.9NO
8.5YES
7.6YES
5.9YES
6.8NO

WE START BY SORTING THE FEATURE VALUES IN INCREASING ORDER:

SORTED F1TARGET VARIABLE/LABEL
2.8NO
3.9NO
5.4YES
5.9YES
6.8NO
7.6YES
8.5YES
THE SORTED FEATURE COLUMN

NOW WE WILL CHOOSE EACH POINT AS THRESHOLD ONE BY ONE , 2.8 , 3.9 and so on . Below we display the splitting for one point , let say 5.4.

we perform similar splittings for all the data points , and whichever gives us the max IG is our first splitting point. If you cannot recall what IG is , this image might help:

ref: Quora

Now , for further splits , similar approach is repeated on leaf nodes .

DISADVANTAGE

There is one disadvantage of using the above stated process. The fact that if the data set you have is large , the computation requirements increases significantly. Imagine performing the above operation on millions of records and max_depth =10

Although we could handle the problem by feature binning and converting the numerical features into categorical .

More to come!