Oct 5 - Oct 17 Journal

Clustering:
These two weeks, I researched the k-means clustering algorithm after the group split amongst ourselves different clustering algorithms. I already knew the basics of k-means, that data points would be clustered depending on the centroids that would be shifted until all the data stayed in the same clusters. One of the key issues with the k-means algorithm, however, is finding exactly how many clusters the algorithm should create, because the user must input the "k" value, the number of cluster. I found that the "Elbow Method" (pictured below) was the most common way to determine the "k" value. The Elbow Method uses the Least Squares Method to calculate how many clusters minimizes the distance between centroids and the cluster data points. I also learned of how the k-means algorithm can be applied. This clustering can be used to cluster customer purchases, personality test respondents, or even typical Youtube recommendations.


Image result for k-means clustering gif

During this time, I also learned about DBSCAN clustering from Will and hierarchical clustering from Connie. Edmond also touched a bit on Gaussian clustering but his algorithm is quite complicated. The group and I discussed when using each others' algorithms was most appropriate and the limitations of our respective algorithms. An advantage of DBSCAN clustering, for example, is that it can form clusters that are irregularly shaped. In Will's example drawn out on the board, he showed how DBSCAN clustering can make "Cheeto" clusters. Connie's hierarchical clustering worked by making clusters out of clusters, basically grouping similar clusters and creating a family-tree like graph to depict data points. Puja also talked about the Gini Index while we were still talking about Decision Trees which determined threshold values.


DBSCAN "Cheeto" clusters

Research Project:
We started trying out a possible research project (either the main or a side one) with collecting student responses on a Google Form we created. We brainstormed together over 20 possible questions but narrowed it down to a select set. While not knowing exactly what to expect, our goal in this side(?) project is to collect the data and see if there are any correlation between say playing an instrument and iPhone vs Android preference. We hope to have our teachers as well get students to take our survey. One task that I'm not exactly sure how we are going to deal with is how to transform our survey answers into a data set that can be clustered and/or analyzed. Our goal is to predict how a certain person would answer one of the questions given that we know the other information in the survey (possibly through Decision Trees?).


Caltech:
Dr. Hassibi was out of town on the Thursday we went to Caltech so one of his grad students came in as a substitute (he was 15 minutes late though). While we discussed what we were doing in our concepts maps to each other in the meantime, we would soon talk more about graph clustering and how the data points created nodes and edges. The grad student then explored directed versus undirected edges between nodes (the points). An example of directed edges would be someone following a Twitter account because you can follow someone but they won't follow you back. An example of undirected edges is Facebook where by being friends, both accounts share information with one another. The concept of centrality was brought up, but because of the limited time, we only got the equations for degree, closeness, and between-ness centrality. The group and I plan on making centrality the topic of our next concept map though so check out the next entry for more information on that!




Comments

Popular posts from this blog

April 10 Technical Journal

March 13 Technical Journal

Start of School - Sep 12 Journal