June 3rd, 2019

Today is June 3rd, 2019 and I am updating my blog for the first time. So far I have watched 2 6.0002 lectures on MIT OCW, lectures 13 and 14.

Lecture 13 focuses on Machine Learning and label classification, or predicting a label associated with a feature vector. The first method we looked at for classifying data was the nearest neighbor method, which assigns to a data point the label that is equal to its nearest neighbor’s label. From there we transitioned naturally to the method of K nearest neighbors, which is a bit more complex.

K nearest neighbors takes a parameter K, and for each data point X it looks at the K nearest neighbors to assign X it’s label. For example if K=3 and we are trying to label whether X is either Republican or Democrat, if X’s three nearest neighbors are R, R, D, K nearest neighbors assign to X the label R. This has the advantage of working quickly and not requiring training data, however if we have noisy data it can give the wrong answer. The lecture then talked about how accuracy isn’t a reliable way of determining the strength of a model. For example, if .1% of a population has a disease, we can just make a model that assigns “No disease” to every patient and it would be accurate 99.9% of the time, however this is insignificant. This led to a conversation about two different testing methodologies: ‘leave one out’ and ‘repeated random sampling’.

In leave one out, starting with n data points, we train on n-1 data points and test on the excluded data point. This process is repeated n times so that we have tested on each data point individually, using the other n-1 data points as training intervals. We then average the error generated from each iteration and receive the mean error.

In repeated random sampling, we shuffle the data randomly and partition the data into subsets. If we have n subsets, we train on n-1 subsets and test on the excluded subset, taking the error for each iteration and at the end averaging n errors.

Lastly the lecture looked at logistic regression, which finds a relationship between features and the probability of an outcome.

Lecture 14 talked about the limitations of logistic regression and how to measure its performance. Logistic regression is chiefly limited by the fact that features are often correlated and therefore we can’t read too much into the weights they yield. Since values are not independent, the question becomes “is label 1 good, or is it simply that labels 2 and 3 are bad?” Log regression is measured by its AUROC, the area under receiving operating characteristic. Higher AUROC means a more accurate model. This wrapped up the discussion of machine learning for 6.0002.

Statistical sins were covered in the remaining half of lecture 14, starting with the introduction to Anscombe’s Quartet, which has 4 columns of x & y values. Each x column and each y column has the same mean and variance along with the same linear regression: y=.5x+3. The data is not the same though, and the conclusion drawn from Anscombe’s Quarter is that statistics != data: use visualization tools instead to look at data. Errors in data can also lead to dangerous conclusions if these errors are drawn systemically as opposed to randomly. Data scientists must be understand how the data they are looking at was collected and whether the assumptions in the analysis are satisfied. Failure to meet either condition is grounds for wariness.

I have begun reviewing the modules BeautifulSoup and Requests and learned some basic HTML/CSS syntax such as tags, classes, and style selectors. I made a simple program using the requests module to download the picture of Guido Van Rossum from wikipedia.

Code for a simple program that downloads Guido Van Rossum’s picture from wikipedia

I’ve also solved the maxSubArray problem after grappling with it for a few days. I made use of the function lru_cache to give an O(n) complexity and am working on a memoized solution going forward. Here is the code for that project:

maxSubArray implemented in O(n) time.

I’d like to start a web scraping project either today or tomorrow, but before I do so I want to learn about SQLite and MySQL databases. Also, it has occurred to me that 6.009 isn’t available on MIT OCW which is a bit of a problem, since I was planning on completing that course next. Later tonight I will watch the first lecture of 6.006. Tomorrow, instead of 6.009, I will start 6.005 -Software Construction. I am excited for both courses.

I have just read the first assigned reading corresponding to Lecture 1 for 6.005. The reading highlighted some of the main differences between Java and Python – Java requires a ; at the end of each statement, if & while clauses need parentheses and Java is statically typed, meaning the types of variables are known before runtime (Python is dynamically typed). Errors in Java are caught via static checking. TIn Java, the number of elements that can fit in an array must be declared and are immutable. If unsure about how many elements to store, we use a list. When declaring a variable in Java, we must specify its type at the time of declaring (ex: int a = 5 assigns a value of 5 to the integer variable a; in Python, we can simply write a=5). Vairables can be made immutable by use of the keyword ‘final.’

It’s time to now watch lecture 1 of 6.006. This post will be updated afterwards and that will conclude the day for me.

Course starts with some basic introductions of concepts that will be discussed later. We then look at different ways to define ‘distance’ between documents. Different algorithms for computing document distance are provided and evaluated based on their efficiency and time complexity. The algorithm starts by using a list to count word frequencies, and eventually, after introducing four variations of the same algorithm, uses a dictionary to count these occurrences. Since dictionaries achieve constant time lookup, the time complexity is simplified to O(N).

We then looked at peak finding algorithms in both 1 dimensional and 2 dimensional arrays, combining our algorithm with binary search to speed up the runtime and improve its complexity. The reading then talks about how to use a for loop in Java, which is a bit different to Python.