Читать книгу Machine Learning Approach for Cloud Data Analytics in IoT - Группа авторов - Страница 90
3.4.1 Case Study
ОглавлениеFor the sake of illustration of implementation of AI in retail industry, authors in the chapter consider a case study. Similarly, authors have taken a dataset pertaining to a retail store. This dataset comprises of observation for duration of 4 years from 2011 to 2015. This dataset has been taken from kaggle (https://www.kaggle.com/jr2ngb/superstore-datausername:jr2ngb). The considered dataset has 16 variables. Out of these 16 features, 10 are categorical features, 5 are numerical features, and 1 is date feature as follows.
# | Feature Name | Non-Null | Dtype |
---|---|---|---|
--- | --------------- | ----------- | ------- |
0 | Order Date | 51290 | datetime64[ns] |
1 | Customer_Name | 51290 | object |
2 | Segment | 51290 | object |
3 | City | 51290 | object |
4 | State | 51290 | object |
5 | Country | 51290 | object |
6 | Category | 51290 | object |
7 | Sub-Category | 51290 | object |
8 | Product Name | 51290 | object |
9 | Sales | 51290 | float64 |
10 | Quantity | 51290 | int64 |
11 | Discount | 51290 | float64 |
12 | Profit | 51290 | float64 |
13 | year | 51290 | int64 |
14 | month | 51290 | int64 |
15 | Day | 51290 | object |
The number of observations in the considered dataset is 51,290. The considered retail store broadly deals in three types of products, viz., office supplies, technology, and furniture.
First of all, authors attempt to understand the correlation among various features of the dataset. Similarly, authors employ Pearson’s correlation that signifies the measure of correlation between two variables. The value lies between −1 and +1. Here, negative value indicates negative linear correlation; 0 signifies no correlation and +1 indicates the positive linear correlation. The Pearson’s correlation among various attributes of the dataset is shown in Figure 3.4.
Further, authors would like to demonstrate how this dataset can be used to understand its chunk of customers across the country. This helps retailer to understand that its largest market share lies in the country and thus enables it to focus in the weaker market section. It can be performed by region-wise analysis as shown in Figure 3.5. The figure shows the histogram plot for frequency of customers across various states in India. From Figure 3.5, it is evident that Maharashtra has the highest number of customers in the country followed by the Uttar Pradesh. On the contrary, places like Manipur, Tripura, Chandigarh, and Pondicherry have the lowest number of customers.
The analysis can further be drilled down to find best and worst performing city in a state so as to exactly identify the specific region or branch. Such drilled down histogram is shown in Figure 3.6. For Maharashtra, it shows that the top performing cities in the state are Mumbai, Pune, Thane, and Nagpur.
Figure 3.4 Pearson’s correlation among various attributes of dataset.
Figure 3.5 Histogram plot for the frequency of customers in country level (India).
Further, it is evident from above two graphs that Mumbai has the highest number of customers. Hence, further the retailer is interested to find which the best performing product in the city is. Therefore, retailer is interested to find the histogram along the product dimension. Similarly, it is evident that office supply category is the most in the city as shown in Figure 3.7. Further, within the office category, the sub-category which has highest demand is storage supplies and labels supplies followed by the art supplies and other stationary products such as envelopes, binders, and papers. This histogram plot along the product dimension is shown in Figure 3.7.
Figure 3.6 Histogram plot for the customers’ frequency at city level in Maharashtra.
Additionally, box plot represents minimum, maximum, and median of sales in each category of every segment. The highest median for technology category is from consumer segment as shown in Figure 3.8. Similarly, the highest median for furniture category is from the corporate segment. Home office segment has the maximum sales in office supplies, and the highest median for the office supplies is from consumer segment.
Figure 3.7 Histogram plot for Mumbai along the product dimension.
Figure 3.8 Box plot for products across consumer segment.
In order to analyze the day that observes highest and minimum sale, authors suggest usage of pivot table as shown in Figure 3.9. From Figure 3.9, it is evident that every Saturday of August from 2011 to 2015 experiences maximum sale. However, the minimum sale is recorded on every Monday of November from 2011 to 2015. This gives an idea to retail to have an idea of its sales forecast.
Finally, the heatmap in Figure 3.10 shows the sales of various countries across the globe. From Figure 3.10, it is clear that United States records maximum sale in comparison to any other country. It is followed by sales of France and Australia. This analysis helps the retail industry to understand that there is a huge potential for increasing in sales in Southeast Asian Region and also in Oceania.
Figure 3.9 Pivot table.
Figure 3.10 Heatmap of the world.
Thus, from the above case study, it is clear that data analytics can be quite helpful for a retail industry, and thus, it has a huge potential in retail apart from various promising fields.