THE K MEANS CLUSTERING CLASSIFICATION ALGORITHM USED IN MACHINE LEARNING
WE HAVE SEEN HOW CLASSIFICATION PROBLEMS ARE TACKLED USING LOGISTIC REGRESSIONS . HERE WE DISCUSS AN ALGORITHM THAT HELPS US TO CLASSIFY THINGS INTO MULTI -CLASSES . THE INTERESTING PART IS THAT THERE ARE NO LABELLED TAGS ASSOCIATED WITH THE DATA POINTS TO TELL US TO WHICH CLASS A CERTAIN DATA INSTANCE BELONGS(K MEANS CLUSTERING IN MACHINE LEARNING NOT TO BE CONFUSED WITH K NEAREST NEIGHBOURS WHERE WE NEED LABELED DATA) MAKING IT AN UNSUPERVISED MACHINE LEARNING PROBLEM . LETS MAKE THIS POINT CLEAR BY CONSIDERING A REAL LIFE EXAMPLE WHERE WE AS HUMANS HAVE CLASSIFIED NUMEROUS UNLABELLED DATA. ANY LIVING CREATURE IS CLASSIFIED AS AN ANIMAL OR A PLANT .FURTHER WE ASSOCIATE THOUSANDS OF FEATURES TO MAKE CLASSIFICATIONS LIKE THE KINGDOM , CLASS , ORDER AND FAMILY . BUT NOTICE HOW NO ANIMAL HAS A TAG ON IT SAYING I BELONG TO SO AND SO CATEGORY . SO WHEN WE ENCOUNTER A NEW SPECIES HOW DO WE DECIDE AS TO WHICH CLASS THEY BELONG TO .
MOREOVER WHAT IS THE LEVEL OF CLASSIFICATION REQUIRED DEPENDS ON THE PROBLEM STATEMENT . SOMEONE MIGHT BE INTERESTED IN FULL ROOT LEVELS OF CLASSIFICATION , LIKE A RESEARCHER , WHILE FOR SOME THE DIFFERENCE BETWEEN BEING A REPTILE OR A BIRD IS ENOUGH . THIS LEADS TO A MAJOR CONCLUSION . DEPENDING ON HOW COMPLEX OUR CLASSES ARE WE CAN HAVE POINTS WHICH FALL TOGETHER IN A CERTAIN CLASS FOR ONE LEVEL OF CLASSIFICATION WHILE THEY MAY CHANGE CLASSES IF COMPLEXITY INCREASES .
EXAMPLE A PIGEON AND A RABBIT FALL UNDER THE SAME CLASS IF THE DIVISION IS JUST BASED ON WHETHER A CERTAIN ANIMAL LIVES IN WATER OR NOT . BUT THEY FALL IN DIFFERENT CLASSES IF FURTHER DETAILS ARE CONSIDERED .
WHAT DOES ” K ” SIGNIFY
THE DIFFICULTY /COMPLEXITY OF A PROBLEM LIES IN THE FACT THAT INTO HOW MANY CLASSES ONE HAS TO DISTRIBUTE THE DATA INSTANCES .
IN MACHINE LEARNING THIS IS THE BASIC IDEA BEHIND K MEANS CLUSTERING . THE VALUE OF K SHOWS HOW MANY “CLASSES ” WE ARE CONSIDERING . IN OTHER WORDS THE NUMBER OF CENTROIDS OUR ALGORITHM WILL USE . HENCE A LARGER K IMPLIES MAKING THE CLASSIFICATION MORE STRICTER . THEORETICALLY ONE CAN HAVE AS MANY CLASSES AS THERE ARE DATA POINTS AVAILABLE IN THE DATA SET . THAT WOULD BE BEING SO STRICT THAT EVERY OBJECT BECOMES A CLASS AS WELL AS THE ONLY MEMBER OF THE CLASS!!!
HOW TO MEASURE “CLOSENESS” : DISTANCE AND ITS TYPES
OBVIOUSLY THINGS THAT ARE SIMILAR OR A RELATED “CLOSELY” TEND TO FALL WITHIN SAME CLASSES .MATHEMATICALLY CLOSENESS REFERS TO THE THE DISTANCE BETWEEN TWO POINTS : DISTANCES ARE OF THE FOLLOWING TYPES :
- EUCLIDEAN
- MANHATTAN
- MINKOWSKI
- HAMMIMG

THE BLACK POINTS ARE THE CENTROID POINTS , 3 CENTROIDS RESULT IN CLASSIFICATION INTO 3 GROUPS
IN K MEANS CLUSTERING WE USE THE WELL KNOWN EUCLIDEAN DISTANCE METRIC . LETS SEE THE ALGORITHM:
- YOU HAVE THE DATA SET (UNLABELED ) PLOTTED .
- CHOOSE THE VALUE OF K – THE NUMBER OF CLASSES YOU WANT
- RANDOMLY DRAW K POINTS ON THE PLOT (THESE ARE THE K CENTROIDS ) .
- FOR EVERY POINT CALCULATE THE K DISTANCES (DISTANCE FROM EACH CENTROID ).
- ASSOCIATE THE POINT WITH THE CENTROID WITH WHICH IT HAS THE MINIMUM DISTANCE .
- NOW YOU HAVE DIVIDED THE DATA SET POINTS INTO K SETS , EACH SET HAS POINTS THAT ARE NEAREST TO A PARTICULAR CENTROID .
- NOW SUPPOSE IN A PARTICULAR SET S ,THERE ARE M POINTS , CALCULATE THE MEAN COORDINATE OF THESE M POINTS .
- THIS MEAN COORDINATE IS THE NEW CENTROID . DO THIS FOR ALL K SETS . WE GET K UPDATED CENTROID POINTS
- REPEAT FROM STEP 4 UNTIL IN ANY ITERATION NONE OF THE POINTS CHANGE THEIR SET .

AN ALGORITHMIC DEFINITION OF THE K MEANS APPROACH
FOLLOWING IS THE MORE MATHEMATICAL DEFINITION FOR PEOPLE WHO WANT A DEEPER UNDERSTANDING :

HOW DO WE DECIDE THE BEST K VALUES FOR OUR DATA SET ?
NOT ALL DATA SETS ARE THE SAME , SOME COULD BE EASILY LINEARLY SEPARABLE , HENCE K=2 WOULD BE ENOUGH . BUT IN MANY CASES THIS IS NOT POSSIBLE . IT ALSO VARIES ACCORDING TO THE COMPLEXITY OF THE PROBLEM . WE USE THE ELBOW METHOD TO DECIDE THE IDEAL VALUE OF K FOR A PARTICULAR DATA SET :
THIS IS TO ENSURE THAT MODEL DOESN’T GET OVER FIT . SURELY ADDING MORE CLASSES WILL MAKE THE MODEL BETTER . BUT IF WE KEEP ON ADDING CLASSES SOON WE WILL BE OVER FITTING AND EVENTUALLY EACH OBJECT IN THE DATA SET WOULD BE A CLASS OF ITS OWN!!
IN THE ELBOW METHOD WE PLOT THE VARIANCE VS THE NUMBER OF CLASSES . THE GRAPH TURNS OUT TO LOOK LIKE “AN ELBOW” . DECREASING SHARPLY INITIALLY AND THEN FORMING AN “L ” SHAPE CURVE .THE SHARPNESS OF THE BEND DEPENDS ON THE PARTICULAR DATASET. AND THIS SHARP “BEND” POINT CORRESPONDS TO THE IDEAL K . WHY ? BECAUSE NOW FOR EVERY FURTHER INTRODUCTION OF NEW CLASS THE CHANGES IN THE CLASSIFICATIONS ARE MINIMAL . TAKE A LOOK AT THE GRAPH BELOW , THINGS WILL GET CLEAR WITH THIS :

A PLOT OF AVERAGE DISPERSION(VARIANCE ) VS NUMBER OF CLASSES(K)
HAPPY CLASSIFYING!!!
Add a Comment
You must be logged in to post a comment