Having spent close to 4 years in data-centric roles in many industries, the differences in approaches at each of these companies has helped me build a good framework to conduct data analyses. I will attempt to explain this framework in this post.
First things first….what is an exploratory data analysis?
- Exploratory Data Analyses (EDA for short) refer to the preliminary steps taken while solving a data problem. In practice, this usually involves creating visualizations or identifying patterns. By patterns, I refer to things such as outliers, missing data and correlated features etc.
- The reason as to why EDAs are important is because they help one come up with potential hypotheses for why the aforementioned patterns occur in a given dataset.
- However, it is important to note that EDAs are not finished products: you won’t see them in any dashboard, report or data product. EDAs help inform these finished products.
Enough chit-chat, can I dive into my dataset already?
Unfortunately, no. To extract the most valuable patterns out of a given dataset, one needs to spend some time understanding the available data at a high level prior to exploring it.
From a business standpoint, this is a great step in talking to one’s stakeholders and gain as much background as possible. This is crucial because it helps one understand where they should focus. Questions to ask here include:
- Is this analysis being performed just to explore what we can potentially do?
- Are we trying to infer or predict some kind of trend out of this dataset?
The answers to these questions could point you in starkly contrasting directions. Once you have the answers to your questions, you are ready to start working with your dataset.
Diving into the Data
This is the most important, and most interesting step. Depending on the nature of the available dataset, this could be easy or difficult. Actions taken during this step include:
- Deducing the size of the available dataset: It is helpful to know how many samples and features are contained within the dataset. While this might just seem trivial (data_frame_name.shape in pandas for example), this is helpful in a couple ways:
- Knowing the size of the dataset enables you to know what kind of problems you could potentially run into, from a computing resources perspective.
- You get to know if you’re ever going to have to sub-sample your data (applicable if the datasets are just way too big)