SQL Projects for Data Analysis — Sales Analysis Superstore 2023

Ridwan Puadi
12 min readOct 7, 2023

--

Introduction

SQL (Structured Query Language) is a powerful data analysis and manipulation tool, playing a crucial role in drawing valuable insights from large datasets in data science

Objective

The main goal of this data analysis project is to conduct an in-depth analysis of sales data to gain valuable insights into sales performance, customer segmentation, hot product types, and identify emerging trends, then develop data-driven business strategies to improve decision-making.

Dataset Overview and Data Processing

Superstore data sets include transactional information, customer demographics, product details, and sales and profit information that is important for sales analysis.

SQL Queries for Analysis

These queries involve aggregating sales data, calculating key performance metrics such as revenue, profit, and sales growth, and grouping data based on dimensions like time, region, or product category.

Key Insights and Findings

The sales analysis yields valuable and actionable insights for decision-making. It uncovers sales performance trends over time, pinpoints best-selling products or categories, and highlights underperforming regions. Analyzing customer demographics aids in identifying target segments for personalized marketing strategies. Additionally, the analysis may reveal seasonality effects, correlations between sales and external factors, and opportunities for cross-selling and upselling. With these insights, businesses can make informed decisions, optimize operations, and drive growth and success.

Analytics Workflow

This analysis will be carried out with SQL, Python Pandas, and Tableau. SQL is used as the overall analysis and this time I used Big Query, with a little help from Python Pandas to check that the data was appropriate and cleaned, while Tableau will provide a clear display of the findings. The following superstore sales analysis will follow 6 Data Analysis frameworks, namely:

Documentation from Revou

1. Business Understanding:

Determine business objectives and data analysis to produce project plans as needed.

In this step, I will define a business problem, a business question I got from the Kaggle site. I then subtracted, added, and modified questions to target customer segments, regions, and product categories that the Superstore avoided to increase profitability.

Question from Kaggle site from superstore dataset

a. Business Objectives:

  • Are existing customer segments and product categories optimizing profits?
  • What trends can we get from sales identification?
  • How can we leverage these insights to make recommendations?

b. Deliverables:

  • A clear summary of the business objectives.
  • Complete documentation of all data cleaning, manipulation, and analysis.
  • Dashboards with visualizations and key insights.
  • Business recommendations based on insights and analysis results.

2. Data Exploration:

Describing and verifying data quality to ensure it contains the data we need.

In this step, I will identify the superstore dataset

  • The data is publicly available through data. world under https://data.world/ehughes/superstore-sales-2023
  • It comes with 9994 rows. data recorded between January 3, 2020 (first order date) to December 30, 2023 (last order date), so we will use the order date range to represent our 4 years of business)
  • The data contains 21 columns namely; Row ID, Order ID, Order Date, Ship Date, Ship Mode, Customer ID, Customer Name, Segment, Country, City, State, Postal Code, Region, Product ID, Category, Sub-Category, Product Name, Sales, Quantity, Discounts and Profits
  • Moving on to the data processing, we will use Python Pandas for cleaning

3. Data Preparation:

Clean and organize data to get useful information, then integrate the data.

I will process and clean the superstore data with the help of Python Pandas. After the data is cleaned and saved in CSV file format, I will upload the file into Big Query for the data manipulation and analysis process by answering the questions that were asked at the beginning.

Data Cleaning Step by Step:

  • Fill/erase missing value
  • Inconsistent data standardization
  • Remove/set outliers
  • Remove duplicate data
Python Pandas help analyze results

While exploring the dataset, we may do and notice the following;

  • Our data looks correct and consistent. Everything looks well structured.
  • With the ‘remove duplicates’ command, there is no duplicate data with all the correct parameters for the customer in all columns. So all rows have some variation so as not to produce duplicate data.
  • We ensure the Order Date and Ship Date are formatted well by formatting them to (existing) dates i.e. ‘YY-MM-DD’ and formatting the Sales and Profit columns from numbers to currency since we are talking about finances. and the discount column will be formatted from number to percentage.

Now the superstore dataset is ideal for us to analyze to find relationships, trends, and patterns that will give us a competitive advantage and fully solve business goals.

4. Exploratory Analysis / Modeling:

An insight/analysis, choosing a modeling technique, building a model, and assessing the model built.

In the analysis step, to answer business goals I will organize the questions into three parts, namely: Customer analysis, demography analysis, and product analysis.

a. Customers Analysis

Profile customers based on their purchase frequency — calculate the purchase frequency of top customers?

SELECT Customer_ID, Customer_Name, Segment, SUM (Quantity) purchase_frequency
FROM `ridwanpuadi.project.superstore23`
GROUP BY Customer_ID, Customer_Name, Segment
ORDER BY 4 DESC
LIMIT 10;

The code above produces the following:

Total top purchase frequency of each customer

Following are customer profiles based on their purchasing frequency. It is known that there are only 8 customers who have made repeat purchases more than 120 times

Do high-frequency customers contribute more revenue?

SELECT Customer_ID, Customer_Name, SUM (Quantity) purchase_frequency, SUM (Profit) total_revenue
FROM `ridwanpuadi.project.superstore23`
GROUP BY Customer_ID, Customer_Name
ORDER BY 3 DESC
LIMIT 10;

This produces the following results:

Results of customer purchase frequency with total revenue

From the results of this query, we can see that customers who have repeat purchases do not necessarily contribute the most revenue to the company.
I can see more clearly which customers provide the highest revenue margin by answering the next question

Are they also profitable — what is the profit margin across the group?

This is solved with the code below:

SELECT Customer_ID, Customer_Name, SUM (Quantity) purchase_frequency, ROUND (SUM(Profit)/SUM (Sales)*100, 2) profit_margin
FROM `ridwanpuadi.project.superstore23`
GROUP BY Customer_ID, Customer_Name
ORDER BY 4 DESC
LIMIT 10;
Top customers who provide the greatest margin based on purchase frequency

It turns out that customers who provide the largest revenue margins are those who purchase less than 50 times.
At this point, an important insight is that the total purchasing frequency of top customers does not necessarily provide a large profit margin.

Which customer segments maximize profits and sales?

Let’s run the query

SELECT Segment,  SUM (Sales) total_sales, SUM (Profit) total_profit 
FROM `ridwanpuadi.project.superstore23`
GROUP BY Segment
ORDER BY total_profit DESC;

The customer segment that provides profits and sales comes from consumers, followed by corporate and home officers.

The following is the total number of customers it has

SELECT COUNT(DISTINCT customer_id) total_customers
FROM `ridwanpuadi.project.superstore23`
Total number of customers

The company had 793 customers between 2020 and 2023.

b. Demography Analysis

What are the total sales and profits each year?

These years are grouped by order date for four consecutive years, so we can observe data from 2020 to 2023.

SELECT DATE_TRUNC(DATE(DATE(Order_Date)), YEAR) year,
SUM(Sales) AS total_sales,
SUM(profit) AS total_profit
FROM `ridwanpuadi.project.superstore23`
GROUP BY year
ORDER BY year ASC;

This query produced the following result:

Total sales and Total profits for each year

The data above shows how profits over the years have continued to increase with each year being more profitable than the next despite a decline in sales in 2021.

What is the highest total profit and total sales for the region?

SELECT Region, SUM (Sales) total_sales, SUM (Profit) total_profit
FROM `ridwanpuadi.project.superstore23`
GROUP BY 1
ORDER BY 2 DESC;
Total Profits and Sales by Region

We can observe above that the West region is the one with the most sales and brings us the highest profits. The East region is pretty good looking for our company too. Those 2 regions are areas of interest if we want to maximize our profits and expand our business. Concerning the South region, we do not gain a lot of revenue but still, the profits are there. It is the Central region that is quite alarming as we generate way more revenue than the South region but do not make at least the same profits over there. The Central region should be on our watchlist as we could start to think about how we could maybe put our resources in the other regions instead. Let’s observe each region’s profit margins for further analysis with the following code:

SELECT Region, ROUND (SUM(Profit)/SUM (Sales)*100, 2) profit_margin
FROM `ridwanpuadi.project.superstore23`
GROUP BY 1
ORDER BY 2 DESC;
Profit margins by region

Profit margin is a measure of a company’s profitability and is expressed as the percentage of revenue that a company keeps as profit. So we see that the West and East are very good. The Southern region despite selling less than half of the revenue of the Western region has a good profit margin of 11.93% which is very good. However, the Central region is still not convincing.

What is the highest total profit and total sales for the state?

Can be obtained by query;

SELECT State, SUM (Sales) total_sales, SUM (Profit) total_profit
FROM `ridwanpuadi.project.superstore23`
GROUP BY State
ORDER BY 3 DESC
LIMIT 10

Here we find out which state is the highest and lowest, we take the 10 highest states first, and the results are as follows;

Top 10 State’s total sales and profits with their profit margins

The decision was to include profit margins to see this through a different lens. The data shows the top 10 most profitable states. Besides we can see the total sales and profit margins. Profit margins are important and it allows us to think long-term as an investor to see potential big markets. In terms of profits, California, New York, and Washington are our most profitable markets and most present ones, especially in terms of sales. Which, are so high that it would take so much for the profit margins to be higher. However, the profits are great and the total sales show that we have the best part of our business share at those points so we need to boost our resources and customer service in those top states.

let’s run a query on the 10 states that have the lowest sales and profits

SELECT State, SUM (Sales) total_sales, SUM (Profit) total_profit
FROM `ridwanpuadi.project.superstore23`
GROUP BY State
ORDER BY 3 ASC
LIMIT 10
Bottom 10 State’s total sales and profits

Our least profitable markets are listed above. The top 3 are Texas, Ohio, and Pennsylvania. Texas and Pennsylvania are especially alarming as they have more than 100,000 in sales with Texas having more sales than Washington (which made $33402.70 in profits) but made a loss of $25729.29.

What is the highest total profit and total sales for the city?

SELECT City, SUM (Sales) total_sales, SUM (Profit) total_profit
FROM `ridwanpuadi.project.superstore23`
GROUP BY City
ORDER BY 3 DESC
LIMIT 10
Top 10 Cities’ total sales and profits with their profit margins

The top 5 cities that we should focus on are New York City, Los Angeles, Seattle, San Francisco, and Detroit.

The bottom 10 cities are:


SELECT City, SUM (Sales) total_sales, SUM (Profit) total_profit
FROM `ridwanpuadi.project.superstore23`
GROUP BY City
ORDER BY 3 ASC
LIMIT 10
Bottom 10 Cities’ total sales and profits with their profit margins

The bottom 5 are Philadelphia, Houston, San Antonio, Lancaster, and Chicago. We have 2 cities from Texas in our top 5so it is clear that we have started redesigning some strategies and how we operate in those cities.

c. Product Analysis

5 types of subcategories that generate the highest sales and profits?

SELECT Sub_Category, SUM (Sales) total_sales, SUM (Profit) total_profit, 
ROUND (SUM(Profit)/SUM (Sales)*100, 2) profit_margin
FROM `ridwanpuadi.project.superstore23`
GROUP BY Sub_Category
ORDER BY total_profit DESC
LIMIT 5;
Subcategories with their total sales, total profits, and profit margins

Top 5 subcategories, our biggest profits come from Copiers, Telephones, Accessories, Paper and blinders. The profits and profit margins on Copiers and Paper are especially attractive in the long term.

Now we check the bottom 5 subcategories;

SELECT Sub_Category, SUM (Sales) total_sales, SUM (Profit) total_profit, 
ROUND (SUM(Profit)/SUM (Sales)*100, 2) profit_margin
FROM `ridwanpuadi.project.superstore23`
GROUP BY Sub_Category
ORDER BY total_profit ASC
LIMIT 5;
Subcategories with their total sales, total profits, and profit margins

The company’s losses came from tables, bookshelves, and equipment which the company was unable to break even. All three should be reviewed further as the sales are there, (except Inventory) but we cannot make a profit from them.

Categories that generate the highest sales and profits?

SELECT Category, SUM (Sales) total_sales, SUM (Profit) total_profit, 
ROUND (SUM(Profit)/SUM (Sales)*100, 2) profit_margin
FROM `ridwanpuadi.project.superstore23`
GROUP BY Category
ORDER BY total_profit
Categories with their total sales, total profits, and profit margins

Out of the 3, it is clear that Technology and Office Supplies are the best in terms of profits. Plus they seem like a good investment because of their profit margins. Furniture is still making profits but does not convert well overall.

How is the price for the product category based on average sales — Is there an increase in profit margin with price reductions/discounts?

First, let’s observe the correlation between discounts and average sales to understand how impactful one is to the other.

SELECT Discount, AVG(Sales) avg_penjualan
FROM `ridwanpuadi.project.superstore23`
GROUP BY 1
ORDER BY 1
Discount vs Avg Sales

They almost have no linear relationship. This is noted by the correlation coefficient of -0.3 and the shape of the graph. However, we can at least observe that at a 50% discount, (0.5 * 100 to convert it to percentage) our average sales are the highest it can be. Maybe it is a psychology technique or it’s just the right product category that is discounted.

Which product categories and sub-categories get the most discounts?

SELECT Category, SUM (Discount) total_discount
FROM `ridwanpuadi.project.superstore23`
GROUP BY 1
ORDER BY 2 DESC;
Most discounted Categories

Office equipment is the most discounted, followed by furniture and then technology
Now look at the top subcategories that get the most discounts.

SELECT Sub_Category, SUM (Discount) total_discount
FROM `ridwanpuadi.project.superstore23`
GROUP BY 1
ORDER BY 2 DESC;
LIMIT 10
Most discounted subcategories (product type)

Binders, Phones, and Furnishings are the most discounted items. But the gap between binders and the others is drastic. We should check the sales and profits for the binders and other items on the list. But first, let’s move on to the categories per state.

Look for the 5 most profitable and least profitable product names for the company?

The most profitable product names;

SELECT Product_Name, SUM (profit) total_profit
From `ridwanpuadi.project.superstore23`
GROUP BY 1
ORDER BY 2 DESC
LIMIT 5;
Top 5 most profitable products

These Copiers, Machines, and Printers are the main foundation of the company’s profits. The Canon imageClass 2200 Advanced Copier, the Fellowes PB500 Electric Blow Plastic Comb Binding Machine with Manual Binding, and the Hewlett Packard LaserJet 3310 Copier are the company’s top 3. Companies have to maintain stock with this.

Least profitable product name;

SELECT Product_Name, SUM (profit) total_profit
From `ridwanpuadi.project.superstore23`
GROUP BY 1
ORDER BY 2 ASC
LIMIT 5;
Top 5 less profitable products

The Cubify CubeX 3D Printer Double Head Print, Lexmark MX611dhe Monochrome Laser Printer, and the Cubify CubeX 3D Printer Triple Head Print are the products that operate the most at a loss. We should take this into account if we are thinking about modifying our stock

5. Data evaluation

evaluates test case results and reviews error coverage in this phase.

In this analysis, I did not validate the data. let’s go to the next step.

6. Visualization

Create interactive charts, diagrams, or dashboards. communicate your findings.

To share the results of my analysis, below will be an interactive dashboard image representing key KPIs and information about the Superstore data collected. I demonstrated it with the help of Tableau.

a. Dashboard

b. Insight and Business Recommendation

--

--

Ridwan Puadi
Ridwan Puadi

Written by Ridwan Puadi

0 Followers

nobody, who dreams of becoming an independent data scientist

No responses yet