THE CODE USED IN THIS PROJECT IS AVAILABLE ON GITHUB HERE.
This data analysis explains the steps I took to solve a take-home assessment I was offered as part of the interview pipeline for a Data Consultant gig at YipitData, a company that analyzes data to provide accurate, granular insights to over 480 investment funds and innovative companies, thereby assisting them in making better business decisions.
Note: I did not use any libraries other than Pandas to perform this task since that would render this exercise pointless.
Prompt:
At YipitData, we collect and analyze unique data sets that allow us to develop detailed insights about many companies. Through these insights we provide competitive intelligence and market research that allows our clients, to better understand the markets in which they are competing, and their own performance therein.
For the purpose of this exercise, consider that one of the companies we cover is Groupon. For Groupon, the main metric we track is called gross billings. Every quarter, Groupon reports gross billings in their financial statements. We use our proprietary data to estimate gross billings across different segments, in order for Groupon’s competitors to benchmark their own performance as well as for strategic insights.
Pretending it is January 2014, before Groupon reports 4Q13 earnings, your goal is to use the attached data file to estimate Groupon’s 4Q13 North America gross billings by segment (Local, Travel, and Goods). Additionally, please describe the performance of each segment. The data is based on real data we collected for Groupon in 2013, and you will need to overcome real challenges we faced back then in order to arrive at your estimate. Remember, we are pretending it is January 2014, so you can’t use any information that is after January 2014.
Background Information:
- This dataset is an estimate of gross billings and units sold for each deal that was active in Groupon's North America segment in Q4 2013.
- Each row represents a Groupon deal that was active for some or all of Q4 2013. For each row we've provided data on units sold during Q4 2013, gross billings during Q4 2013, the date that the deal started, the URL of the deal page, the product segment of the deal (Local, Travel, or Goods), and the inventory type of the good (first-party means Groupon owns the inventory).
- Groupon offers thousands of deals on its platform, with new ones starting each day. Some deals stay active for weeks, months, or even years, while other deals are only active for a day or a few days. In the data file, we include the start date of each deal that was active in Q4 2013. Remember, these dates are not necessarily the date the billings occurred, but the date that the deal launched. Since deals can be active for many days, or even weeks or months, these deals can have billings on many days, not just the date that the deal started.
- This data is collected by finding all deals on Groupon's platform, and then tracking the quantity sold and price information that is available on each deal webpage. Since we have the price and quantity for each deal, we can generate companywide estimates of gross billings.
- However, our system that finds the deals on the Groupon North America website and adds them to our database broke from October 20 to October 30 2013 (inclusive) for the Local segment, so we did not add any Local segment deals that started between those dates to our database (you'll notice this in the dataset). This means we missed all billings from deals that started during this period, even if they remained active after the period. In other words, our dataset includes zero Local deals that started from October 20 to October 30, 2013 inclusive, and 100% of all other Local deals that were active in Q4 2013. You'll need to adjust the data to deal with this outage as you make an estimate for North America Q4 2013 billings bysegment.
- Note that purchasers of Groupon deals can also return deals if they decide they don’t want to use them. The data we present in the spreadsheet is net of returns. For example, let’s say a row of data has $100 in billings. This means there were $100 gross billings for this deal in Q4 2013, which could mean there was $100 of sales and $0 of returns, it could mean $110 of sales and $10 of returns, it could mean $500 of sales and $400 of returns. We don’t breakout the data by purchases and returns, we just provide the net number, which is called “gross billings”. This is in line with how Groupon reports Gross billings in their financial statements. Note that some rows in our data have negative gross billings. This means that these deals had more returns than purchases in Q4 2013.
- You'll notice that units sold are often in decimals. This is because we are employing estimation techniques behind the scenes - you can ignore the methodology behind these estimations and just take the data as given.
- There is a tab in the Q4 2013 North America Groupon Data XLSX file called "Historical Data," which shows you YipitData estimates from previous quarters. This tab may be helpful in understanding to what extent YipitData estimates can accurately estimate Groupon reported gross billings, which is a number that investors care about. Although we broke out the historical figures on a monthly basis, the raw data we provided for Q4 2013 does not break gross billings out by month. Additionally, we do not provide the raw data that was used to generate the historical estimates, only the Q4 2013 raw data.
- Data is not always as clean as you would hope. As such, there may be inaccuracies in the data in addition to the ones described above. If you find any, you should determine their impact and adjust accordingly if you think an adjustment is warranted. As a reminder, please describe any adjustments you made and the rationale for those adjustments.
My approach to solving this problem