mathematics, formula, physics

K MEANS CLUSTERING IN MACHINE LEARNING

THE K MEANS CLUSTERING CLASSIFICATION ALGORITHM USED IN MACHINE LEARNING

WE HAVE SEEN HOW CLASSIFICATION PROBLEMS ARE TACKLED USING LOGISTIC REGRESSIONS . HERE WE DISCUSS AN ALGORITHM THAT HELPS US TO CLASSIFY THINGS INTO MULTI -CLASSES . THE INTERESTING PART IS THAT THERE ARE NO LABELLED TAGS ASSOCIATED WITH THE DATA POINTS TO TELL US TO WHICH CLASS A CERTAIN DATA INSTANCE BELONGS(K MEANS CLUSTERING IN MACHINE LEARNING NOT TO BE CONFUSED WITH K NEAREST NEIGHBOURS WHERE WE NEED LABELED DATA) MAKING IT AN UNSUPERVISED MACHINE LEARNING PROBLEM . LETS MAKE THIS POINT CLEAR BY CONSIDERING A REAL LIFE EXAMPLE WHERE WE AS HUMANS HAVE CLASSIFIED NUMEROUS UNLABELLED DATA. ANY LIVING CREATURE IS CLASSIFIED AS AN ANIMAL OR A PLANT .FURTHER WE ASSOCIATE THOUSANDS OF FEATURES TO MAKE CLASSIFICATIONS LIKE THE KINGDOM , CLASS , ORDER AND FAMILY . BUT NOTICE HOW NO ANIMAL HAS A TAG ON IT SAYING I BELONG TO SO AND SO CATEGORY . SO WHEN WE ENCOUNTER A NEW SPECIES HOW DO WE DECIDE AS TO WHICH CLASS THEY BELONG TO .

MOREOVER WHAT IS THE LEVEL OF CLASSIFICATION REQUIRED DEPENDS ON THE PROBLEM STATEMENT . SOMEONE MIGHT BE INTERESTED IN FULL ROOT LEVELS OF CLASSIFICATION , LIKE A RESEARCHER , WHILE FOR SOME THE DIFFERENCE BETWEEN BEING A REPTILE OR A BIRD IS ENOUGH . THIS LEADS TO A MAJOR CONCLUSION . DEPENDING ON HOW COMPLEX OUR CLASSES ARE WE CAN HAVE POINTS WHICH FALL TOGETHER IN A CERTAIN CLASS FOR ONE LEVEL OF CLASSIFICATION WHILE THEY MAY CHANGE CLASSES IF COMPLEXITY INCREASES .

EXAMPLE A PIGEON AND A RABBIT FALL UNDER THE SAME CLASS IF THE DIVISION IS JUST BASED ON WHETHER A CERTAIN ANIMAL LIVES IN WATER OR NOT . BUT THEY FALL IN DIFFERENT CLASSES IF FURTHER DETAILS ARE CONSIDERED .

WHAT DOES ” K ” SIGNIFY

THE DIFFICULTY /COMPLEXITY OF A PROBLEM LIES IN THE FACT THAT INTO HOW MANY CLASSES ONE HAS TO DISTRIBUTE THE DATA INSTANCES .

IN MACHINE LEARNING THIS IS THE BASIC IDEA BEHIND K MEANS CLUSTERING . THE VALUE OF K SHOWS HOW MANY “CLASSES ” WE ARE CONSIDERING . IN OTHER WORDS THE NUMBER OF CENTROIDS OUR ALGORITHM WILL USE . HENCE A LARGER K IMPLIES MAKING THE CLASSIFICATION MORE STRICTER . THEORETICALLY ONE CAN HAVE AS MANY CLASSES AS THERE ARE DATA POINTS AVAILABLE IN THE DATA SET . THAT WOULD BE BEING SO STRICT THAT EVERY OBJECT BECOMES A CLASS AS WELL AS THE ONLY MEMBER OF THE CLASS!!!

HOW TO MEASURE “CLOSENESS” : DISTANCE AND ITS TYPES

OBVIOUSLY THINGS THAT ARE SIMILAR OR A RELATED “CLOSELY” TEND TO FALL WITHIN SAME CLASSES .MATHEMATICALLY CLOSENESS REFERS TO THE THE DISTANCE BETWEEN TWO POINTS : DISTANCES ARE OF THE FOLLOWING TYPES :

  1. EUCLIDEAN
  2. MANHATTAN
  3. MINKOWSKI
  4. HAMMIMG
PLOT OF K MEANS CLUSTERING

THE BLACK POINTS ARE THE CENTROID POINTS , 3 CENTROIDS RESULT IN CLASSIFICATION INTO 3 GROUPS

IN K MEANS CLUSTERING WE USE THE WELL KNOWN EUCLIDEAN DISTANCE METRIC . LETS SEE THE ALGORITHM:

  1. YOU HAVE THE DATA SET (UNLABELED ) PLOTTED .
  2. CHOOSE THE VALUE OF K – THE NUMBER OF CLASSES YOU WANT
  3. RANDOMLY DRAW K POINTS ON THE PLOT (THESE ARE THE K CENTROIDS ) .
  4. FOR EVERY POINT CALCULATE THE K DISTANCES (DISTANCE FROM EACH CENTROID ).
  5. ASSOCIATE THE POINT WITH THE CENTROID WITH WHICH IT HAS THE MINIMUM DISTANCE .
  6. NOW YOU HAVE DIVIDED THE DATA SET POINTS INTO K SETS , EACH SET HAS POINTS THAT ARE NEAREST TO A PARTICULAR CENTROID .
  7. NOW SUPPOSE IN A PARTICULAR SET S ,THERE ARE M POINTS , CALCULATE THE MEAN COORDINATE OF THESE M POINTS .
  8. THIS MEAN COORDINATE IS THE NEW CENTROID . DO THIS FOR ALL K SETS . WE GET K UPDATED CENTROID POINTS
  9. REPEAT FROM STEP 4 UNTIL IN ANY ITERATION NONE OF THE POINTS CHANGE THEIR SET .
K MEANS ALGORITHM MACHINE LEARNING

AN ALGORITHMIC DEFINITION OF THE K MEANS APPROACH

FOLLOWING IS THE MORE MATHEMATICAL DEFINITION FOR PEOPLE WHO WANT A DEEPER UNDERSTANDING :

K MEANS OBJECTIVE FUNCTION


HOW DO WE DECIDE THE BEST K VALUES FOR OUR DATA SET ?

NOT ALL DATA SETS ARE THE SAME , SOME COULD BE EASILY LINEARLY SEPARABLE , HENCE K=2 WOULD BE ENOUGH . BUT IN MANY CASES THIS IS NOT POSSIBLE . IT ALSO VARIES ACCORDING TO THE COMPLEXITY OF THE PROBLEM . WE USE THE ELBOW METHOD TO DECIDE THE IDEAL VALUE OF K FOR A PARTICULAR DATA SET :

THIS IS TO ENSURE THAT MODEL DOESN’T GET OVER FIT . SURELY ADDING MORE CLASSES WILL MAKE THE MODEL BETTER . BUT IF WE KEEP ON ADDING CLASSES SOON WE WILL BE OVER FITTING AND EVENTUALLY EACH OBJECT IN THE DATA SET WOULD BE A CLASS OF ITS OWN!!

IN THE ELBOW METHOD WE PLOT THE VARIANCE VS THE NUMBER OF CLASSES . THE GRAPH TURNS OUT TO LOOK LIKE “AN ELBOW” . DECREASING SHARPLY INITIALLY AND THEN FORMING AN “L ” SHAPE CURVE .THE SHARPNESS OF THE BEND DEPENDS ON THE PARTICULAR DATASET. AND THIS SHARP “BEND” POINT CORRESPONDS TO THE IDEAL K . WHY ? BECAUSE NOW FOR EVERY FURTHER INTRODUCTION OF NEW CLASS THE CHANGES IN THE CLASSIFICATIONS ARE MINIMAL . TAKE A LOOK AT THE GRAPH BELOW , THINGS WILL GET CLEAR WITH THIS :

K MEANS ELBOW METHOD MACHINE LEARNING

A PLOT OF AVERAGE DISPERSION(VARIANCE ) VS NUMBER OF CLASSES(K)

HAPPY CLASSIFYING!!!

Add a Comment

You must be logged in to post a comment