Data Scientist’s Guide (regularly updated!)
This guide was written in preparation for a data science interview. Assuming our data is already loaded into our script/notebook we will move onto the next steps in the data science process. This guide goes over key concepts we should know. We don’t go over each concept in full detail but we make sure you understand it, at least in layman's terms.
First, look at what data and their formats look like. We can look at descriptive statistics to get an idea about our data.
EDA Phase:
EDA is a good starting method for learning about a dataset. Using an exploratory method we can try to find insight that we may not have grasped by doing statistical tests. Some examples include looking at distributions visually, making box plots, scatterplots. Looking at relationships between variables.
Missing values & NaN’s:
If there are Missing values different ways we can approach it would be to 1. Drop the missing values, 2. Impute them with some number (via mean or estimation), 3. For each column with missing entries in the original dataset, we add a new column that shows if the column was imputed.
If we have categorical data we either use a 1. Label encoding method (works well with ordinal data), 2. One-hot encoding(works well with nominal variables)