Capstone Projects in SQL

As part of my professional development and data analysis interests, I have completed various lessons and projects in SQL. A number of them are highlighted below, but for a more complete collection of SQL skills I have developed, please click here.

Social Media Analytics

-Window Functions and Subqueries-

Introduction

A digital marketing agent is looking to study the ad campaigns used by some of the most-followed Instagram accounts in the world. To achieve this task, assistance is required to analyse monthly account data including the change in the number of followers as well as the amount of content posted by each account.

The database consists of one table:

  • social_media

The database schema is shown below:

table: social_media

Column Description
username Username of Instagram account
month Month number for which data is recorded
posts Number of posts by each account in a given month
followers_change Change in account followers for a given month

Queries

SQL queries are used to answer the following questions:

1. Explore the dataset

SELECT *
FROM social_media;

Output:

username month posts followers_change
ladygaga 1 32 6.27
ladygaga 2 39 4.71
ladygaga 3 39 5.73
ladygaga 4 41 5.43
ladygaga 5 45 4.99
ladygaga 6 5 2.52
ladygaga 7 41 4.61
ladygaga 8 44 3.91
cristiano 1 21 5.33
cristiano 2 19 4.93
cristiano 3 7 4.3
cristiano 4 17 5.74
cristiano 5 20 5.81
cristiano 6 19 5.74
cristiano 7 15 5.99
cristiano 8 21 4.6
arianagrande 1 12 3.69
arianagrande 2 22 4.4
arianagrande 3 14 4.93
arianagrande 4 22 4.12
arianagrande 5 48 5.47
arianagrande 6 14 4.01
arianagrande 7 23 3.71
arianagrande 8 24 4.02
therock 1 44 3.75
therock 2 44 3.15
therock 3 36 3.5
therock 4 56 4.4
therock 5 70 3.77
therock 6 19 3.25
therock 7 62 3.54
therock 8 56 4.32
aliaabhatt 1 9 1.15
aliaabhatt 2 14 1.16
aliaabhatt 3 25 1.52
aliaabhatt 4 9 1.57
aliaabhatt 5 9 1.47
aliaabhatt 6 7 -0.76
aliaabhatt 7 5 1.09
aliaabhatt 8 13 1
theellenshow 1 129 1.76
theellenshow 2 123 1.43
theellenshow 3 126 1.88
theellenshow 4 104 1.41
theellenshow 5 121 1.73
theellenshow 6 72 1.26
theellenshow 7 30 1.16
theellenshow 8 0 0.79
leomessi 1 2 0.03
leomessi 2 4 0.01
leomessi 3 1 0.02
leomessi 4 3 0.01
leomessi 5 2 0.01
leomessi 6 1 0.01
leomessi 7 0 0.03
leomessi 8 2 0.05

2. Select all the monthly data stored in the dataset, but only for accounts with a total number of posts greater than 300 over the 8 months. Add a column to the table showing the running total of followers after each month

WITH filtered_table AS(
  SELECT *
  FROM social_media
  WHERE username IN(
    SELECT username
    FROM social_media
    GROUP BY username
    HAVING SUM(posts) > 300
    )
)
SELECT username, month, posts, followers_change,
  ROUND(SUM(followers_change) OVER (
    PARTITION BY username 
    ORDER BY month
  ), 2) AS 'followers_running_tot'
FROM filtered_table;

Output:

username month posts followers_change followers_running_tot
theellenshow 1 129 1.76 1.76
theellenshow 2 123 1.43 3.19
theellenshow 3 126 1.88 5.07
theellenshow 4 104 1.41 6.48
theellenshow 5 121 1.73 8.21
theellenshow 6 72 1.26 9.47
theellenshow 7 30 1.16 10.63
theellenshow 8 0 0.79 11.42
therock 1 44 3.75 3.75
therock 2 44 3.15 6.9
therock 3 36 3.5 10.4
therock 4 56 4.4 14.8
therock 5 70 3.77 18.57
therock 6 19 3.25 21.82
therock 7 62 3.54 25.36
therock 8 56 4.32 29.68

Visualisation of Results

The results were imported into RStudio (data analysis in R programming language) to visualise the results using the ggplot2 library. There was particular interest in how the total number of followers for each account changed throughout the months analysed:

Funnel for Mattresses and More

-Joins and Aggregates for Business Analytics-

Introduction

Mattresses and More is a bedding essentials e-commerce store. Their purchase funnel is:

  1. User browses products and adds them to shopping cart

  2. User views shopping cart and proceeds to checkout page

  3. Use enters payment details and purchases the products

The Management Team is requesting to see the conversion of users from the browse to the checkout stage, and from the checkout to the purchase stage. As a sales analyst, I am tasked to provide this information, as well as to examine the trend in the data on the approach to Christmas Day.

The database consists of three tables:

  • browse

  • checkout

  • purchase

The database schema is shown below:

table: browse

Column Description
user_id A unique identifier for each user
browse_date The date when the user browsed the product
item_id A unique identifier for each product
 

table: checkout

Column Description
user_id A unique identifier for each user
checkout_date The date when the user entered the checkout page
item_id A unique identifier for each product
 

table: purchase

Column Description
user_id A unique identifier for each user
purchase_date The date when the user purchased the product
item_id A unique identifier for each product

Queries

SQL queries are used to answer the following question:

1. Create a funnel for Mattresses and More which displays the number of users at each stage (browse, checkout and purchase) for each date available

WITH funnels AS (
  SELECT DISTINCT b.browse_date,
     b.user_id,
     c.user_id IS NOT NULL AS 'is_checkout',
     p.user_id IS NOT NULL AS 'is_purchase'
  FROM browse AS 'b'
  LEFT JOIN checkout AS 'c'
    ON c.user_id = b.user_id
  LEFT JOIN purchase AS 'p'
    ON p.user_id = c.user_id)
SELECT browse_date, COUNT(*) AS 'num_browse',
   SUM(is_checkout) AS 'num_checkout',
   SUM(is_purchase) AS 'num_purchase',
   ROUND(1.0 * SUM(is_checkout) / COUNT(user_id), 3) AS 'browse_to_checkout',
   ROUND(1.0 * SUM(is_purchase) / SUM(is_checkout), 3) AS 'checkout_to_purchase'
FROM funnels
GROUP BY browse_date
ORDER BY browse_date;

Output:

browse_date num_browse num_checkout num_purchase browse_to_checkout
checkout_to_purchase
2017-12-20
100
20 16 0.2 0.8
2017-12-21 150 33 28 0.22 0.848
2017-12-22 250 62 55 0.248 0.887
2017-12-23 275 68 64 0.247 0.941

Analysis of Results

As suspected, there was an increase in sales on the approach to Christmas Day (look at the increase in checkout_to_purchase rate). This analysis also shows that once a user has entered the checkout page, they are more likely to complete the purchase compared to the rate of users who enter the checkout page after only browsing (for example 0.941 vs 0.247 on 2017-12-23).

Songify Music Streaming

-Basic and Advanced Queries, Joins and Date Functions-

Introduction

A music streaming company offers both a free product and a premium paid product. Assistance is needed to analyse their customer database which consists of six tables:

  • plans

  • users

  • premium_users

  • songs

  • months

  • plays

The database schema is shown below:

table: plans

Column Description
id A unique identifier for each plan
price The monthly cost of the plan
description A description of the plan
 

table: users

Column Description
id A unique identifier for each user
first_name The first name of the user
last_name The last name of the user
age The age of the user
gender The gender of the user
 

table: premium_users

Column Description
user_id A unique identifier for each user. Matches users.id
membership_plan_id An ID for which payment plan that user is on. Matches plans.id
purchase_date The date when the user purchased their premium plan
cancel_date The date when the user canceled their plan (which can be `NULL` if they haven't canceled yet)
 

table: songs

Column Description
id A unique identifier for each song
title The title of the song
artist The artist who recorded the song
year The year that the song was released
 

table: plays

Column Description
user_id A unique identifier for each user. Matches users.id
song_id An ID for which payment plan that user is on. Matches songs.id
play_date The date when the user played this song
play_hour The hour of day (0 - 23) when the user played this song
 

table: months

Column Description
months The first date of a month

Queries

SQL queries are used to answer the following questions:

1. Which plans are used by which premium members?

SELECT premium_users.user_id, plans.description
FROM premium_users
JOIN plans
  ON premium_users.membership_plan_id = plans.id;

Output:

user_id description
e9005106-b5d3-49d8-bb80-c6dac246140b
unlimited songs
a4c787c5-fa70-42a9-a139-b288a6d6c8ec limited songs per month
42038fb3-b538-4c25-9c76-cabf57c29e16 unlimited songs
c89db18c-94c2-45e3-8cb2-a5c5df2d75f7 unlimited songs
... ...

2. Which titles of songs are played by each user?

SELECT plays.user_id, plays.play_date, songs.title
FROM plays
JOIN songs
  ON plays.song_id = songs.id;

Output:

user_id play_date title
220201d9-21ad-4380-94f2-148ce930be06
2017-01-03
What About Love?
0fd581d4-e543-4651-8aa0-0a8e84d18405 2017-01-03
Illegal Smile
220201d9-21ad-4380-94f2-148ce930be06 2017-01-03
What's Your Name
42038fb3-b538-4c25-9c76-cabf57c29e16 2017-01-04
Small Town
... ... ...

3. Which users aren’t premium users?

SELECT users.id
FROM users
LEFT JOIN premium_users
  ON users.id = premium_users.user_id
WHERE premium_users.user_id IS NULL;

Output:

id
90038787-ceed-41db-9bb6-4a3d1d644bbe
9244d64e-8dbb-4382-a4a5-5318661da692
aac29bd6-61be-4cbd-8891-f05e703dff01
8a08151d-ff10-47a0-be3f-2164459981fa
...

4. Find which users played songs in January, but not February

WITH january AS (
  SELECT *
  FROM plays
  WHERE strftime("%m", play_date) = '01'
),
february AS (
  SELECT *
  FROM plays
  WHERE strftime("%m", play_date) = '02'

)
SELECT january.user_id
FROM january
LEFT JOIN february
  ON january.user_id = february.user_id
WHERE february.user_id IS NULL;

Output:

user_id
220201d9-21ad-4380-94f2-148ce930be06
0fd581d4-e543-4651-8aa0-0a8e84d18405
220201d9-21ad-4380-94f2-148ce930be06
42038fb3-b538-4c25-9c76-cabf57c29e16
...

5. For each month, we want to know if each user in 'premium_users' was active or canceled.

SELECT premium_users.user_id,
  months.months,
  CASE
    WHEN (
      premium_users.purchase_date <= months.months
      )
      AND
      (
        premium_users.cancel_date >= months.months
        OR
        premium_users.cancel_date IS NULL
      )
    THEN 'active'
    ELSE 'not_active'
  END AS 'status'
FROM premium_users
CROSS JOIN months;

Output:

user_id months status
e9005106-b5d3-49d8-bb80-c6dac246140b
2017-01-01
not_active
e9005106-b5d3-49d8-bb80-c6dac246140b 2017-02-01
active
e9005106-b5d3-49d8-bb80-c6dac246140b 2017-03-01
not_active
a4c787c5-fa70-42a9-a139-b288a6d6c8ec 2017-01-01
not_active
... ... ...

6. Songify has added some new songs to their catalog. Combine the existing songs with the new bonus songs.

SELECT *
FROM songs
UNION 
SELECT *
FROM bonus_songs;

Output:

index id title artist year
0
0
Caught Up in You
.38 Special  1982
0 2229
Can't Stop the Feeling
Justin Timberlake  2016
1 1
Fantasy Girl
.38 Special  NULL
1 2230
Send My Love (To Your New Lover)
Adele 2016
... ... ... ... ...

7. Find how many times each song was played, displaying the title, artist and play count.

WITH play_count AS(
  SELECT song_id,
  COUNT(*) AS 'times_played'
  FROM plays
  GROUP BY song_id
)
SELECT songs.title, songs.artist, play_count.times_played
FROM songs
JOIN play_count
ON songs.id = play_count.song_id;

Output:

title artist times_played
Ballroom Blitz
Sweet
7
Karn Evil 9 Emerson, Lake & Palmer
10
Hello It's Me Jefferson Starship
8
Jane Slaughter
9
... ... ...
Next
Next

Multimatic Projects