Study Thoughts

Published:

This post is to share some of my thoughts of systematically studying statistics, causal inference, sql, machine learning, data structure and algorithms. So far, I have learned most of the fundamental stuff. The next thing is to enhance my understanding and practice with real-world applications.

I have benefited a lot from my applied econ background. Though, I recognize that there are significant differences between economics, statistics, and data science. For economics, the priority is to examine whether some change in x will cause a proportionate change in y, holding everything else unchanged. In other words, our goal is to exploit the causal relationship between some x and y. If our model is valid, we can make explanation/inference from the estimated model coefficients - or economic behavioral parameters. Normally, we assume that the model is based on theories that combined with math, statistics and logics. We then use sampled data to test our hypothesis of a given economic issue.

For any applied fields, data is vitally important. In economics, data not only reflect the consequence of economic behavior, but also the incentive behind the behavior. We can’t simply assume the sample data is representative of the population of all kinds. We need to think deep about the underlying incentives so that we can design the market or policy regime for the targeted population, or for different subpopulations. We also need to consider whether sampled data uncover some sort of selection bias that could distort the market or policy implication - whether or not being selected into a certain subpopulation has been intervened by the design itself. Based on the economic theories, we make reasonable assumptions of model and data - how to design model, how to choose variables, how to deal with missing data, aggregated/disaggregated data, or proxy of data, etc.

Instead, data science/machine learning (DSML) emphasizes more on forecasting. It cares less about each parameter, but rather the final projection of future. Understanding how the different features contribute to the final prediction is good, but with the more complicated model algorithms, it’s harder for features to traverse the journey linearly - (generalized) linear relationship is always easier to understand. Nevertheless, selecting an appropriate set of features has been proven to improve the model performance effectively. While choosing features, we should keep in mind both the statistical and practical matters.

For some data science projects, the goal is to examine a targeted feature impact by conducting experimental design and A/B testing. For example, whether the size of action button would affect users’ intention to click? Whether the price range would affect users’ interest to continue browsing? etc. In this case, the hypothesis testing and statistical inference play a bigger role. Given the specific problem at hand, we can decide how to set up experiments, how to sample data (randomized controls vs network effects), how much data it needs, how long time it takes, etc.

Another thing that DSML highlights more is the generalization error. A good model algorithm should generalize well on new data. It would be useless if it performs perfectly on training data, but poorly on test data. It is often referred as the bias-variance tradeoff. In comparison, applied economics are more prone to overfitting, since the objective is to hopefully estimate the treatment effect of a policy instrument by controlling most explanatory/confounding variables. Thus, it is harder to generalize the same set of data and model structure for new policy evaluations.

Last fall, I had the chance to join the Insight Data Science Fellowships Program to experience the real life as a data scientist, though, for a limited amount of time. During the program, I consulted a startup company to solve their data/business problem and provided suggestions to improve their business growth. I visited many companies and talked to many talented colleagues. I enjoy the working style in tech (and fintech) companies - creative, collaborative, and self-fulfilling. I enjoy the excitement of tackling new challenges and learning new things. More importantly, I find myself well suited for the data science work - turning data into actionable insights. However, I quickly realized that I still have a lot to learn along the journey, especially for computer science-based content, such as programming, algorithm, big data infrastructure, and other relevant topics. I’m still on my way, sooner or later I will succeed.

Leave a comment