April 10 Technical Journal
Code:
Working right now as sort of a middle man between the Q learning and code sides, I was helping out a bit with the code as well since the last technical journal on March 13. While I am still mainly focused on the Q learning side, I worked with Edmond on how to improve user interaction with our game. This would come in the form of changing how the lines are inputted. Our current game requires us to input coordinate points to create lines and our new idea is to designate certain coordinates' pairings as a number (depicted below). For example, the line connecting (0,0) and (1,0) is 1. The pattern would be like a zig zag.
Q Learning Algorithm:
Continuing on the reinforcement learning look into the Q learning algorithm, Connie and I were working with code we got off of Github which stored Q values in a Q table file. Running it over the period, a statement of "Kolo #,#" was printed. The number started off very unbalanced with the first at 0 and the second at 99. Over time, the first number would increase to around 20 and the second would decrease but then the numbers would just fluctuate around this level. Upon talking with Dr. Hassibi about the time needed and him saying that all cases of dots and boxes would need several days, we understood why our algorithm was not improving as quickly after reaching the current point. We had tried letting the algorithm run over the weekend but a planned power outage on Sunday messed up our plan.
Since the last technical journal, I also learned more about the Q table and how that influences the algorithm. Before, I was more familiar with the reward and feedback structure but not so much the exact numbers in the Q table. I learned that the Q table started with all values at 0 and the algorithm picking a random action to start the learning process. Whether the computer wins or loses the game would adjust its Q value on that specific move and positive Q values would favor that move (depicted below). I learned how the algorithm needs to go through all the different game states for each different move though, explaining the slow learning rate. We also talked about the presentation and how it would be set up. We also collected our ideas to condense our year long learning into our presentation.
Hassibi In-Class Visits:
Dr. Hassibi visited class on Monday of this week and today. On Monday, many of us did not know he would be coming so our class time was more just catch up and explaining what we had been doing. Then, on Wednesday, we explored more of our questions. To be more specific it was how the reward function for the Q learning algorithm worked. We also calculated the number of game states for our game versus tic tac toe and analyzed how the scale will impact the Q table and the Q learning rate. Some time was also spent looking at the Monte Carlo search tree that Will is working on. Talking with Dr. Hassibi got us thinking about how the Q table changed with new inputs as it learns and how the future game states are not just held by Qmax.
Working right now as sort of a middle man between the Q learning and code sides, I was helping out a bit with the code as well since the last technical journal on March 13. While I am still mainly focused on the Q learning side, I worked with Edmond on how to improve user interaction with our game. This would come in the form of changing how the lines are inputted. Our current game requires us to input coordinate points to create lines and our new idea is to designate certain coordinates' pairings as a number (depicted below). For example, the line connecting (0,0) and (1,0) is 1. The pattern would be like a zig zag.
Q Learning Algorithm:
Continuing on the reinforcement learning look into the Q learning algorithm, Connie and I were working with code we got off of Github which stored Q values in a Q table file. Running it over the period, a statement of "Kolo #,#" was printed. The number started off very unbalanced with the first at 0 and the second at 99. Over time, the first number would increase to around 20 and the second would decrease but then the numbers would just fluctuate around this level. Upon talking with Dr. Hassibi about the time needed and him saying that all cases of dots and boxes would need several days, we understood why our algorithm was not improving as quickly after reaching the current point. We had tried letting the algorithm run over the weekend but a planned power outage on Sunday messed up our plan.
Since the last technical journal, I also learned more about the Q table and how that influences the algorithm. Before, I was more familiar with the reward and feedback structure but not so much the exact numbers in the Q table. I learned that the Q table started with all values at 0 and the algorithm picking a random action to start the learning process. Whether the computer wins or loses the game would adjust its Q value on that specific move and positive Q values would favor that move (depicted below). I learned how the algorithm needs to go through all the different game states for each different move though, explaining the slow learning rate. We also talked about the presentation and how it would be set up. We also collected our ideas to condense our year long learning into our presentation.
Hassibi In-Class Visits:
Dr. Hassibi visited class on Monday of this week and today. On Monday, many of us did not know he would be coming so our class time was more just catch up and explaining what we had been doing. Then, on Wednesday, we explored more of our questions. To be more specific it was how the reward function for the Q learning algorithm worked. We also calculated the number of game states for our game versus tic tac toe and analyzed how the scale will impact the Q table and the Q learning rate. Some time was also spent looking at the Monte Carlo search tree that Will is working on. Talking with Dr. Hassibi got us thinking about how the Q table changed with new inputs as it learns and how the future game states are not just held by Qmax.



Comments
Post a Comment