June 10, 2019

I am making some progress on the second lab for 6.009. I have realized that my life will be MUCH MUCH easier if I can implement some helper functions to help with my get_actors_with_bacon_number function. Namely, these would be the functions which generate a dictionary mapping each actor to the actors they have acted with. The second helper function accepts as inputs the data set generated by the first helper function mapping each actor to the other actors they have acted with , the current set of actors, and the set of actors with bacon number i-1. Using these parameters, I devised an algorithm to generate the set of actors with bacon number i.

Utilizing both of these functions, I believe that we can find the actors with bacon number n by looping through the range(n) and at each iteration calling the second helper function, which returns the next set of actors with bacon number. Parameters here are acted_with generated by the first helper function, the current set of actors (Kevin Bacon), and the root from which all actors with the next bacon number are generated: a dictionary mapping 4724 (Kevin Bacon’s bacon number) to None – {4724:None}. I am still getting some testing errors when I run the test.py file in the lab, but I believe I have made somewhat of a breakthrough here. All that’s left for me is to fix the logical bugs in my code.

I spent the better part of 3 hours today on leetcode solving more problems. Namely, I finally did the integer to roman numeral converter, which was not that much harder than the reverse converter. The secret here was nesting a while loop inside of a for loop, which didn’t occur to me while I had tried this problem earlier. I did two mock interviews as well, the first one was actually incredibly easy for me and I did it in 15 minutes despite the allotted time of 90 minutes. The second one I guess I ‘failed’, as I was only able to solve one out of the two questions. I got hung up on the following question: Given n pairs of parentheses, write a function to generate all combinations of well-formed parentheses.

For example, given n = 3, a solution set is:

[
  "((()))",
  "(()())",
  "(())()",
  "()(())",
  "()()()"
]

I’ll try to solve this tomorrow, I ran out of time in the mock interview (the first question took me ~30 minutes, and I only had ~30 minutes for this one which was sadly not enough time). Hopefully I can find a creative way to solve this tomorrow!

June 8th, 2019 & June 9th 2019

Well, I finished the first lab of 6.009 and have uploaded it to my GitHub page. It passes all the test cases and seems to work as described in the assignment. Lab 2 is MUCH harder so far, I’ve been grappling with implementing the “Get_bacon_number” function. Essentially, we are provided with a large database of actors and movies ; the task is to find subsets of ‘bacon number n.’ Bacon number n is defined as follows: Every actor who has acted with Kevin Bacon in a movie is assigned a “Bacon number” of 1, every actor who acted with someone who acted with Kevin Bacon is given a “Bacon number” of 2, and so on. Kevin Bacon has Bacon number 0. Honestly, this is a lot harder than any of the other projects I’ve done so far and so I will probably spend the rest of the week completing it. Syntactically, there is nothing new or challenging here as the focus is on sets & dictionaries which I am comfortable with. Conceptually, it is more difficult than other projects I’ve done. I’m a little stuck on this project at the moment, so I’ve decided to pivot away from it and learn other topics in CS.

I watched the 3rd and 4th lectures for 6.006, and have done the first two parts of homework one. These are the easiest sections of the homework; all I had to do was rank in hierarchical order the time complexities of given )(n) expressions, and evaluate the O(n) time complexity of given recurrence relations. Nothing too bad, really.

I’m pausing on 6.005 for the moment, as the only reason I took that class up was because I couldn’t find 6.009 on OCW. Taking 6.006 and 6.009 concurrently will be rigorous enough and I hope to have completed both of these courses in the next 6-8 weeks (I’m being generous with my timeline here).

Spent some more time on leetcode solving mostly Binary Tree traversal problems. I feel it has given me a more solid understanding of recursion in general. All of the problems solved so far have been ‘easy’ but still challenging.

June 6, 2019 & June 7,2019

I spent the bulk of my time yesterday solving problems on SQLzoo.net and leetcode.com. I built a program that converts Roman Numerals to integers & also one that converts integers to Roman Numerals. I watched the third lecture of 6.006 (Algorithms and Data Structures), where we talked about Heaps, Max & Min Heaps, Max-Heapify and HeapSort. I think the professor is trying to build up to introducing algorithms that can sort certain data structures in time faster than O(nlogn). Heapsort doesn’t quite accomplish this, however it is getting us closer to that goal. I spent some more time trying to do lab1 for 6.009, but came across a logical bug in my code that was stumping me so decided to walk away from the problem for a bit. Fortunately, that decision turned out to be prudent as today I solved it!

As the picture above demonstrates, I was unable to pass the test_blur test cases, and the reason why is because I had improperly implemented the correlate method.

Here is the properly implemented correlate method for image processing

Here is the properly implemented algorithm. My logical error lied in the first line of the inner most nested for loop (this line ‘for kerny in range(len(kernel[kernx])’). I had originally been trying to retrieve the pixel (x-1+kerny, y-1+kernx), but after scrutinizing this code for the better part of 2 hours today, realized that this case would only work on 3×3 kernel. To generalize, I had to retrieve the pixels for x and y that were in the range (x-len(kernel)//2 +kerny, y-len(kernel)//2+kernx), as this number is the distance from the center to the edge of the kernels. It turned out that for 3×3 kernels this number was always 1. I conceptualized this method on 3×3 kernels, but didn’t realize that this number needed to be generalized for higher dimension kernels as well. This took a good deal of reasoning and time for me to solve; I was almost convinced at one point that it was the test cases that were broken, and not my code. You can probably imagine my feelings of vindication when I ran this test to see the following message:

Success! Once I had the correlate method working properly, the rest was pretty straightforward. All that’s left for me to implement is the test_edges method, which I would have done today had I not run out of time (it is a Friday night after all, and I am allowed to have some fun, right?!)

My copy of The Self Taught Programmer” also came in the mail today and I began reading it. Skipped through the first 15 chapters or so since it’s just basic Python syntax rules. The sections on reading and writing to files are nicely prepared though, with a thorough introduction to the with function used to write or read files, and also a discussion on how to handle CSV files. I began reading up on regular expressions and printed ‘The Zen of Python’ poem, a pretty logical starting point for regular expressions. Still trying to wrap my head around this a bit more, so I decided to not spend much more time delving deeper into regular expressions.

Side note: in this book, they talk about the FizzBuzz program and how its a solid interview question to filter out candidates. Umm, is this real? I made this program on my third day of learning python; it seemed trivially easy. This makes me either doubt the industry or the author of this book; and I don’t doubt the industry.

That’s all for today!

June 5th, 2019

Welp, I didn’t have time to update the blog yesterday so I’m doing it now. I went to NYC yesterday, planned on writing my blog post when I came home but I came home wayyyy later than anticipated and just passed out.

Didn’t have the most productive day yesterday since I left for NYC at 12, only had time to do a few hours of coding. I found a repository for 6.009 readings and projects, so I’ve started PSET 0 for 6.009 and completed some readings for the course. I implemented the invert method which reflects pixels about a middle gray value (0 becomes 255, 255 becomes 0).

That’s basically it for yesterday. I spent the bulk of my time yesterday at a networking event for WalMart e-Commerce, then went out to a few bars with my friends in NYC. Overall, a very fun day, but not the most productive with respect to coding.

June 4, 2019

I’ve watched the second lecture of 6.006. This lecture discussed different sorting algorithms and the advantages of each; the quickest is MergeSort, which can sort a list in nlogn time. This lecture was mainly a review of sorting as we learned it in 6.0001, the main difference being the analysis done on each algorithm and the conversations that followed on complexity and recurrence trees. I need to review some of the math behind solving recurrence relations, but other than that, this lecture was pretty comfortable and interesting.

It appears that there are no lecture recording for 6.005, so I have to make do with the readings posted on OCW. I’m hoping that will be enough for me to complete the assignments. The second reading focused on “Basic Java.” This reading continued to highlight some of the main differences between Python and Java and discussed the following methods for lists: .size(),.add(), .isEmpty() (Python analogs are len, append, and if not). A short discussion on mutable and immutable types followed with the relevant snapshot diagrams. This was still mainly review from 6.0001, so nothing too bad here.

Following these concepts was an introduction to more Java data structures.A Map is to Java as a dictionary is to Python. We use the following methods on maps:

map.put(key,val) <—> (map[key]=val in Python)

map.get(key) <—> (map[key] in Python)

map.containsKey(key) (<—-> key in map in Python)

map.remove(key) (<—-> del(map[key]) in Python)

How to declare generic types: List<String> cities;

Set<Integer> numbers

Map<String,Integer> hash

Notice how the types of each basic data structure are capitalized and how we must declare which variable types are contained in the generic type. This is very different than in Python.

This concludes the 2nd reading for 6.005. I’m going to move onto learning some basic SQL syntax next to complement my Python and aid my ambitions to build and manipulate databases.

I have started using sqlzoo.net to learn some of the basic syntax rules of SQL. I’ve been introduced to the following functions: FROM, SELECT, WHERE, SUM and COUNT. This website has some pretty basic problems you can solve to get you coding in SQL right away. Here is a basic query I wrote to select the countries with the largest area in each continent. The table here is called ‘world.’

SELECT continent, name, area FROM world x
WHERE area >= ALL
(SELECT area FROM world y
WHERE y.continent=x.continent
AND population>0)

I’ve spent the last hour or so on leetcode solving problems. I’ve solved two leetcode questions today; not amazing but not bad either. I am getting better at these, but a lot of the easy questions still require me to think pretty hard about them. It’s good practice though.

Here is the code for one of the problems I solved today, which implements a dynamic programming solution using the bottoms up approach.

The following parts of this blog post are only tangentially related to programming. I decided to update my resume to include the MIT OCW classes I’ve completed as well as some of the more substantial projects I’ve done. I am also beginning to reach out to alumni in my network to pick their brains and ask for guidance in this journey. That’s all for today!

June 3rd, 2019

Today is June 3rd, 2019 and I am updating my blog for the first time. So far I have watched 2 6.0002 lectures on MIT OCW, lectures 13 and 14.

Lecture 13 focuses on Machine Learning and label classification, or predicting a label associated with a feature vector. The first method we looked at for classifying data was the nearest neighbor method, which assigns to a data point the label that is equal to its nearest neighbor’s label. From there we transitioned naturally to the method of K nearest neighbors, which is a bit more complex.

K nearest neighbors takes a parameter K, and for each data point X it looks at the K nearest neighbors to assign X it’s label. For example if K=3 and we are trying to label whether X is either Republican or Democrat, if X’s three nearest neighbors are R, R, D, K nearest neighbors assign to X the label R. This has the advantage of working quickly and not requiring training data, however if we have noisy data it can give the wrong answer. The lecture then talked about how accuracy isn’t a reliable way of determining the strength of a model. For example, if .1% of a population has a disease, we can just make a model that assigns “No disease” to every patient and it would be accurate 99.9% of the time, however this is insignificant. This led to a conversation about two different testing methodologies: ‘leave one out’ and ‘repeated random sampling’.

In leave one out, starting with n data points, we train on n-1 data points and test on the excluded data point. This process is repeated n times so that we have tested on each data point individually, using the other n-1 data points as training intervals. We then average the error generated from each iteration and receive the mean error.

In repeated random sampling, we shuffle the data randomly and partition the data into subsets. If we have n subsets, we train on n-1 subsets and test on the excluded subset, taking the error for each iteration and at the end averaging n errors.

Lastly the lecture looked at logistic regression, which finds a relationship between features and the probability of an outcome.

Lecture 14 talked about the limitations of logistic regression and how to measure its performance. Logistic regression is chiefly limited by the fact that features are often correlated and therefore we can’t read too much into the weights they yield. Since values are not independent, the question becomes “is label 1 good, or is it simply that labels 2 and 3 are bad?” Log regression is measured by its AUROC, the area under receiving operating characteristic. Higher AUROC means a more accurate model. This wrapped up the discussion of machine learning for 6.0002.

Statistical sins were covered in the remaining half of lecture 14, starting with the introduction to Anscombe’s Quartet, which has 4 columns of x & y values. Each x column and each y column has the same mean and variance along with the same linear regression: y=.5x+3. The data is not the same though, and the conclusion drawn from Anscombe’s Quarter is that statistics != data: use visualization tools instead to look at data. Errors in data can also lead to dangerous conclusions if these errors are drawn systemically as opposed to randomly. Data scientists must be understand how the data they are looking at was collected and whether the assumptions in the analysis are satisfied. Failure to meet either condition is grounds for wariness.

I have begun reviewing the modules BeautifulSoup and Requests and learned some basic HTML/CSS syntax such as tags, classes, and style selectors. I made a simple program using the requests module to download the picture of Guido Van Rossum from wikipedia.

Code for a simple program that downloads Guido Van Rossum’s picture from wikipedia

I’ve also solved the maxSubArray problem after grappling with it for a few days. I made use of the function lru_cache to give an O(n) complexity and am working on a memoized solution going forward. Here is the code for that project:

maxSubArray implemented in O(n) time.

I’d like to start a web scraping project either today or tomorrow, but before I do so I want to learn about SQLite and MySQL databases. Also, it has occurred to me that 6.009 isn’t available on MIT OCW which is a bit of a problem, since I was planning on completing that course next. Later tonight I will watch the first lecture of 6.006. Tomorrow, instead of 6.009, I will start 6.005 -Software Construction. I am excited for both courses.

I have just read the first assigned reading corresponding to Lecture 1 for 6.005. The reading highlighted some of the main differences between Java and Python – Java requires a ; at the end of each statement, if & while clauses need parentheses and Java is statically typed, meaning the types of variables are known before runtime (Python is dynamically typed). Errors in Java are caught via static checking. TIn Java, the number of elements that can fit in an array must be declared and are immutable. If unsure about how many elements to store, we use a list. When declaring a variable in Java, we must specify its type at the time of declaring (ex: int a = 5 assigns a value of 5 to the integer variable a; in Python, we can simply write a=5). Vairables can be made immutable by use of the keyword ‘final.’

It’s time to now watch lecture 1 of 6.006. This post will be updated afterwards and that will conclude the day for me.

Course starts with some basic introductions of concepts that will be discussed later. We then look at different ways to define ‘distance’ between documents. Different algorithms for computing document distance are provided and evaluated based on their efficiency and time complexity. The algorithm starts by using a list to count word frequencies, and eventually, after introducing four variations of the same algorithm, uses a dictionary to count these occurrences. Since dictionaries achieve constant time lookup, the time complexity is simplified to O(N).

We then looked at peak finding algorithms in both 1 dimensional and 2 dimensional arrays, combining our algorithm with binary search to speed up the runtime and improve its complexity. The reading then talks about how to use a for loop in Java, which is a bit different to Python.

Welcome! Who I am & the purpose of this blog

My name is Ethan King and I graduated from McGill University on May 31, 2019 with a degree in Mathematics. Beginning in the middle of April, I became fascinated by the idea of a career in computer science and started to teach myself how to code in Python.

Thecodingjunkie.com serves to chronicle my progress, achievements, frustrations, obstacles and goals. I will post here what I’ve learned, programs I have made, problems I’ve encountered and small victories or accomplishments that I feel are worth sharing.

To give some context as to where I am in my programming journey right now: It is June 2nd, 2019 and I am auditing MIT classes through OpenCourseWare. I have just finished the last problem set, PSet 5, for 6.0002, an Introduction to Computational Thinking and Data Science. I have watched the lectures and completed psets for 6.0001 and 6.189. My plan going forward is to finish the last 3 lectures of 6.0002 and then take 6.009 (Fundamentals of Programming) and 6.004 (Computational Structures) concurrently.

Problem set 5 asked us to model data on the climate and evaluate which models best explained the data to make predictions about future temperature. Models were compared and evaluated by their R^2 value and generated from training intervals. I then implemented these models on testing intervals to see which model could best predict the behavior of new data. As I expected, the model with degree 20 had a great R^2 of .997 but badly overfit the data, leading to a high RMSE of 1.4. In other words – while the model with degree 20 did a fantastic job of explaining the data it was generated on, it did a poor job of explaining and making predictions for data it had not yet encountered. The best model, as determined both by how well it could explain the testing data (smallest RMSE) and its R^2 value on the training interval, was the linear model generated by looking at the five year moving average of yearly temperatures.