Data Analysis in SQL

Below are examples of various SQL analyses I have carried out, organised by the different functions and concepts I implemented.

Window Functions

Historical U.S. Climate Change Investigation

Climate data is available for each state in the United States dating back to 1895. Window functions will be used in this investigation to draw insights about the differences in temperature by region and by year:

--Dataset: Historical U.S. Climate Change Investigation table 'state_climate'
--Source: Analyse Data with SQL Skill Path Course, Codecademy

--Explore the dataset
SELECT *
FROM state_climate;

--Calculate the running average temperature over time in each state
SELECT state, year, tempf, tempc, AVG(tempc) OVER (
  PARTITION BY state
  ORDER BY year
) AS 'running_avg_temp'
FROM state_climate;

--Display the lowest temperature for each state alongside the other columns
SELECT state, year, tempf, tempc, FIRST_VALUE(tempc) OVER (
  PARTITION BY state
  ORDER BY tempc
) AS 'lowest_temp'
FROM state_climate;

--Display the highest temperature for each state alongside the other columns
SELECT state, year, tempf, tempc, LAST_VALUE(tempc) OVER (
  PARTITION BY state
  ORDER BY tempc
  RANGE BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING
) AS 'highest_temp'
FROM state_climate;

--Look at how the temperature changed year-on-year in each state
SELECT state, year, tempf, tempc, tempc - LAG(tempc, 1, tempc) OVER (
  PARTITION BY state
  ORDER BY year
)AS 'change_in_tempc'
FROM state_climate
ORDER BY ABS(change_in_tempc) DESC;

--Display a rank of the coldest temperatures on record
SELECT year, state, tempf, tempc, RANK() OVER (
  ORDER BY tempc
) AS 'rank'
FROM state_climate;

--Display a rank of the warmest temperatures for each state on record
SELECT year, state, tempf, tempc, RANK() OVER (
  PARTITION BY state
  ORDER BY tempc DESC
) AS 'rank'
FROM state_climate;

--Display the average yearly temperature partitioned by quartiles for each state
SELECT year, state, tempf, tempc, NTILE(4) OVER (
  PARTITION BY state
  ORDER BY tempc
) AS 'quartile'
FROM state_climate;

--Display the average yearly temperature partitioned by quintiles for the whole country
SELECT year, state, tempf, tempc, NTILE(5) OVER (
  ORDER BY tempc
) AS 'quintile'
FROM state_climate;

Social Media Analytics

Aggregate functions provide useful metrics by collapsing several rows of a table into a single value(e.g. average or sum of a column). However, data scientists often require a similar calculation to be performed while still displaying the original rows of data alongside the new calculated value. Window functions are used in this project to perform more complex data analysis on various social media platforms:

--Dataset: Social Media Analytics table 'social_media', 'streams'
--Source: Analyse Data with SQL Skill Path Course, Codecademy

--Display the running total and the running average of followers for the 'instagram' user
SELECT 
   month,
   change_in_followers,
   ROUND(SUM(change_in_followers) OVER (
      ORDER BY month
   ), 2) AS 'running_total',
   ROUND(AVG(change_in_followers) OVER (
      ORDER BY month
   ), 2) AS 'running_avg'
FROM
   social_media
WHERE
   username = 'instagram';

--Display the running total and the running average of followers for all recorded accounts
SELECT 
    username,
    month,
    change_in_followers,
    ROUND(SUM(change_in_followers) OVER (
      PARTITION BY username 
      ORDER BY month
    ), 2) AS 'running_total_followers_change',
    ROUND(AVG(change_in_followers) OVER (
      PARTITION BY username 
      ORDER BY month
    ), 2) AS 'running_average_followers_change'
FROM
    social_media;

--Display the least number of posts recorded for a given month, per user, alongside their monthly number of posts
SELECT
   username,
   posts,
   FIRST_VALUE (posts) OVER (
      PARTITION BY username 
      ORDER BY posts
   ) fewest_posts
FROM
   social_media;

--Display the most number of posts recorded for a given month, per user, alongside their monthly number of posts
SELECT
   username,
   posts,
   LAST_VALUE (posts) OVER (
      PARTITION BY username 
      ORDER BY posts
      RANGE BETWEEN UNBOUNDED PRECEDING AND 
      UNBOUNDED FOLLOWING
    ) most_posts
FROM social_media;

--Display Lady Gaga's monthly change in number of streams and monthly change in chart position. Use LAG() functions
SELECT
   artist,
   week,
   streams_millions,
   ROUND(streams_millions - LAG(streams_millions, 1, streams_millions) OVER (
      PARTITION BY artist
      ORDER BY week
   ), 2) AS 'streams_millions_change',
   chart_position,
   LAG(chart_position, 1, chart_position) OVER ( 
    PARTITION BY artist
    ORDER BY week 
) - chart_position AS chart_position_change
FROM
   streams 
WHERE
   artist = 'Lady Gaga';

--Alter the previous query to calculate the same new columns but now for all artists and using LEAD() functions
SELECT
   artist,
   week,
   streams_millions,
  ROUND((LEAD(streams_millions, 1) OVER (
      PARTITION BY artist
      ORDER BY week
   ) - streams_millions), 2) AS 'streams_millions_change',
   chart_position,
   chart_position - LEAD(chart_position, 1) OVER (
    PARTITION BY artist
    ORDER BY week
   ) AS 'chart_position_change'
FROM
   streams;

--Use window functions to add row numbers to an ordered list of weekly streams per artist. Number 1 in the list should be the artist (and week) with the least streams
SELECT 
   ROW_NUMBER() OVER (
      ORDER BY streams_millions
   ) AS 'row_num', 
   artist, 
   week,
   streams_millions
FROM
   streams;

--Alter the previous query to display rank instead of row number. Tied numbers of streams should display the same rank
SELECT 
   RANK() OVER (
      ORDER BY streams_millions
   ) AS 'rank', 
   artist, 
   week,
   streams_millions
FROM
   streams;

--Alter the previous query to display the ranked list of artists partitioned by week so we can see the most streamed artist of each week
SELECT 
   RANK() OVER (
      PARTITION BY week
      ORDER BY streams_millions DESC
   ) AS 'rank', 
   artist, 
   week,
   streams_millions
FROM
   streams;

--Break the data into quintiles according to their number of streams per week
SELECT 
   NTILE(5) OVER (
      ORDER BY streams_millions DESC
   ) AS 'quintile', 
   artist, 
   week,
   streams_millions
FROM
   streams;

--Break the data into quartiles such that we now have quartiles for each week according to the number of streams per week i.e. each quartile contains two artists because we have eight artists total for each week in the dataset
SELECT 
   NTILE(4) OVER (
      PARTITION BY week
      ORDER BY streams_millions DESC
   ) AS 'quartile', 
   artist, 
   week,
   streams_millions
FROM
   streams;

Marketing Attribution

CoolTShirts Marketing Campaign

A t-shirts company recently started various marketing campaigns to increase website traffic and revenue. They would like assistance in mapping their customer’s journey through their website by using touch attribution:

--Dataset: CoolTShirts Marketing Campaign table 'page_visits'
--Source: Analyse Data with SQL Skill Path Course, Codecademy

--Become familiar with the data. Find the number of distinct campaigns
SELECT COUNT(DISTINCT utm_campaign)
FROM page_visits;

--Find the number of distinct UTM sources
SELECT COUNT(DISTINCT utm_source)
FROM page_visits;

--Find how many campaigns there are per source
SELECT utm_source, COUNT(DISTINCT utm_campaign) AS 'campaigns_per_source'
FROM page_visits
GROUP BY utm_source;

--Identify the pages that are on the CoolTShirts website
SELECT DISTINCT page_name
FROM page_visits;

--How many first touches is each campaign responsible for?
WITH first_touch AS (
    SELECT user_id,
        MIN(timestamp) as first_touch_at
    FROM page_visits
    GROUP BY user_id)
SELECT pv.utm_campaign, COUNT(*)
FROM first_touch ft
JOIN page_visits pv
    ON ft.user_id = pv.user_id
    AND ft.first_touch_at = pv.timestamp
GROUP BY pv.utm_campaign;


--How many last touches is each campaign responsible for?
WITH last_touch AS (
    SELECT user_id,
        MAX(timestamp) as last_touch_at
    FROM page_visits
    GROUP BY user_id)
SELECT pv.utm_campaign, COUNT(*)
FROM last_touch lt
JOIN page_visits pv
    ON lt.user_id = pv.user_id
    AND lt.last_touch_at = pv.timestamp
GROUP BY pv.utm_campaign;

--How many visitors make a purchase? (the purchase page is labeled '4 - purchase')
SELECT COUNT(DISTINCT user_id)
FROM page_visits
WHERE page_name LIKE '4%';

--How many last touches on the purchase page is each campaign responsible for?
WITH last_touch AS (
    SELECT user_id,
        MAX(timestamp) as last_touch_at
    FROM page_visits
    WHERE page_name LIKE '4%'
    GROUP BY user_id)
SELECT pv.utm_campaign, COUNT(*)
FROM last_touch lt
JOIN page_visits pv
    ON lt.user_id = pv.user_id
    AND lt.last_touch_at = pv.timestamp
GROUP BY pv.utm_campaign;

First and Last Touch Attribution with UTM Parameters

Web developers use UTM parameters in the ads they display in other websites to determine the source of the traffic to their own website. For example, for an e-commerce website displaying ads on social media, it could be helpful to know the where a user was surfing when they first clicked on their ad (‘first-touch attribution’). However, that user may have closed the online store until a new ad came up which they clicked on and ultimately finished their purchase. This last visit and where the traffic originated from is also recorded (‘last-touch attribution’). Analyse these results to improve the company’s marketing and online presence:

--Dataset: First and Last Touch Attribution with UTM Parameters table 'page_visits'
--Source: Analyse Data with SQL Skill Path Course, Codecademy

--Write a first-touch attribution query for the users stored in the database. Display the associated UTM source
WITH first_touch AS (
    SELECT user_id,
       MIN(timestamp) AS 'first_touch_at'
    FROM page_visits
    GROUP BY user_id)
SELECT ft.user_id,
   ft.first_touch_at,
   pv.utm_source
FROM first_touch AS 'ft'
JOIN page_visits AS 'pv'
   ON ft.user_id = pv.user_id
   AND ft.first_touch_at = pv.timestamp;

----Write a last-touch attribution query for the users stored in the database. Display the associated UTM source
WITH last_touch AS (
    SELECT user_id,
       MAX(timestamp) AS 'last_touch_at'
    FROM page_visits
    GROUP BY user_id)
SELECT lt.user_id,
   lt.last_touch_at,
   pv.utm_source
FROM last_touch AS 'lt'
JOIN page_visits AS 'pv'
   ON lt.user_id = pv.user_id
   AND lt.last_touch_at = pv.timestamp;

Churn Rate

Codeflix Video Streaming SaaS Churn Rate

A common revenue model for SaaS (software as a service) companies is to charge a monthly subscription fee for access to their product. Churn rate is the ratio of subscribers that have canceled within a certain period, usually a month, compared to the total number of subscribers for that period. Calculate this value for the required months for the video streaming company Codeflix:

--Dataset: Codeflix Video Streaming SaaS Churn Rate table 'subscriptions'
--Source: Analyse Data with SQL Skill Path Course, Codecademy

--Calculate the churn for the month of December 2016
SELECT ROUND(1.0 * 
(
  SELECT COUNT(*)
  FROM subscriptions
  WHERE subscription_start < '2017-01-01'
  AND (
    subscription_end
    BETWEEN '2017-01-01'
    AND '2017-01-31'
  )
) / (
  SELECT COUNT(*) 
  FROM subscriptions 
  WHERE subscription_start < '2017-01-01'
  AND (
    (subscription_end >= '2017-01-01')
    OR (subscription_end IS NULL)
  )
), 3)
AS result;

--Calculate the churn for the month of January 2017, now using a system that can then be scaled to multiple months
WITH enrollments AS
(SELECT *
FROM subscriptions
WHERE subscription_start < '2017-01-01'
AND (
  (subscription_end >= '2017-01-01')
  OR (subscription_end IS NULL)
)),
status AS 
(SELECT
CASE
  WHEN (subscription_end > '2017-01-31')
    OR (subscription_end IS NULL) THEN 0
    ELSE 1
  END as is_canceled,
CASE
    WHEN subscription_start < '2017-01-01'
      AND (
        (subscription_end >= '2017-01-01')
        OR (subscription_end IS NULL)
      ) THEN 1
    ELSE 0
  END as is_active
FROM enrollments
)
SELECT ROUND((1.0 * SUM(is_canceled) / SUM(is_active)), 3) AS 'churn_january_2017'
FROM status;

--Calculate the churn rate for multiple months (January 2017, February 2017 and March 2017) by adapting the previous query
WITH months AS
(SELECT
  '2017-01-01' as first_day,
  '2017-01-31' as last_day
UNION
SELECT
  '2017-02-01' as first_day,
  '2017-02-28' as last_day
UNION
SELECT
  '2017-03-01' as first_day,
  '2017-03-31' as last_day
),
cross_join AS
(SELECT *
FROM subscriptions
CROSS JOIN months),
status AS
(SELECT id, first_day as month,
CASE
  WHEN (subscription_start < first_day)
    AND (
      subscription_end > first_day
      OR subscription_end IS NULL
    ) THEN 1
  ELSE 0
END as is_active,
CASE 
  WHEN subscription_end BETWEEN first_day AND last_day THEN 1
  ELSE 0
END as is_canceled
FROM cross_join),
status_aggregate AS
(SELECT
  month,
  SUM(is_active) as active,
  SUM(is_canceled) as canceled
FROM status
GROUP BY month)
SELECT month, ROUND((1.0 * canceled/active), 3) AS churn_rate
FROM status_aggregate;

A/B Testing User Segments

A company provides a subscription based service to a range of customers. Customers were acquired through two distinct subscription channels (labeled ‘87’ and ‘30’). The marketing department for the company would like to compare the two user segments by analysing their churn rates for the months provided in the dataset:

--Dataset: A/B Testing User Segments table 'subscriptions'
--Source: Analyse Data with SQL Skill Path Course, Codecademy

--Explore the dataset
SELECT *
FROM subscriptions
LIMIT 20;

--Determine the range of months of data provided
SELECT MIN(subscription_start), MAX(subscription_start), MIN(subscription_end), MAX(subscription_end)
FROM subscriptions;

--Calculate the churn rate for each user segment and for multiple months (January 2017, February 2017 and March 2017)
WITH 
 months AS (
  SELECT
  '2017-01-01' as first_day,
  '2017-01-31' as last_day
  UNION
  SELECT
  '2017-02-01' as first_day,
  '2017-02-28' as last_day
  UNION
  SELECT
  '2017-03-01' as first_day,
  '2017-03-31' as last_day
 ),
cross_join AS(
  SELECT *
  FROM subscriptions
  CROSS JOIN months
),
status AS (
  SELECT id, first_day AS month, 
  CASE
    WHEN segment == '87' AND subscription_start < first_day AND (subscription_end > last_day OR subscription_end IS NULL) THEN 1
    ELSE 0
  END AS 'is_active_87',
  CASE
    WHEN segment == '87' AND subscription_end BETWEEN first_day AND last_day THEN 1
    ELSE 0
  END AS 'is_canceled_87',
  CASE
    WHEN segment == '30' AND subscription_start < first_day AND (subscription_end > last_day OR subscription_end IS NULL) THEN 1
    ELSE 0
  END AS 'is_active_30',
  CASE
    WHEN segment == '30' AND subscription_end BETWEEN first_day AND last_day THEN 1
      ELSE 0
    END AS 'is_canceled_30'
  FROM cross_join
),
status_aggregate AS (
  SELECT 
    month,
    SUM(is_active_87) AS 'sum_active_87',
    SUM(is_active_30) AS 'sum_active_30',
    SUM(is_canceled_87) AS 'sum_canceled_87',
    SUM(is_canceled_30) AS 'sum_canceled_30'
  FROM status
  GROUP BY month
)
SELECT month, ROUND((1.0* sum_canceled_30/sum_active_30), 3) AS 'churn_30', ROUND((1.0*sum_canceled_87/sum_active_87), 3) AS 'churn_87'
FROM status_aggregate;

Funnels

Usage Funnels with Warby Parker

Warby Parker is a transformative lifestyle brand in the in the eyeglasses and sunglasses industry. Analyse different Warby Parker marketing funnels in order to provide them with actionable insights:

--Dataset: Usage Funnels with Warby Parker tables 'survey', 'quiz', 'home_try_on', 'purchase'
--Source: Analyse Data with SQL Skill Path Course, Codecademy

--Join the three tables to determine which users move on from the quiz stage to the try-on stage, and then to the purchase stage. Use 'IS NOT NULL' as a 'marker' for which customers remain at each stage
SELECT q.user_id, h.user_id IS NOT NULL AS 'is_home_try_on', h.number_of_pairs, p.user_id IS NOT NULL AS 'is_purchase'
 FROM quiz q
 LEFT JOIN home_try_on h
  ON h.user_id = q.user_id
LEFT JOIN purchase p
  ON p.user_id = h.user_id;

--Calculate the conversion of customers from one stage to the next. Write this as a standalone query
WITH funnel_calcs_table AS (
  SELECT q.user_id, h.user_id IS NOT NULL AS 'is_home_try_on', h.number_of_pairs, p.user_id IS NOT NULL AS 'is_purchase'
 FROM quiz q
 LEFT JOIN home_try_on h
  ON h.user_id = q.user_id
 LEFT JOIN purchase p
  ON p.user_id = h.user_id
)
SELECT COUNT(user_id) AS 'users_quizzed' , SUM(is_home_try_on) AS 'users_tried_on', SUM(is_purchase) AS 'users_purchased'
FROM funnel_calcs_table;

--CUSTOMER SURVEY FUNNEL
--Review table
SELECT *
FROM survey
LIMIT 10;

--Create a funnel for Warby Parker which displays the number of users that answer each question of a survey
SELECT question, COUNT(DISTINCT user_id)
FROM survey
GROUP BY question;

Funnel for Mattresses and More

The Management Team is requesting to see the conversion of users from the browse to the checkout stage, and from the checkout to the purchase stage. Use SQL queries to help analyse the data and provide the required insight:

--Dataset: Mattresses and More tables 'browse', 'checkout', 'purchase'
--Source: Analyse Data with SQL Skill Path Course, Codecademy

--Create a funnel for Mattresses and More which displays the number of users at each stage (browse, checkout and purchase) for each date available
WITH funnels AS (
  SELECT DISTINCT b.browse_date,
     b.user_id,
     c.user_id IS NOT NULL AS 'is_checkout',
     p.user_id IS NOT NULL AS 'is_purchase'
  FROM browse AS 'b'
  LEFT JOIN checkout AS 'c'
    ON c.user_id = b.user_id
  LEFT JOIN purchase AS 'p'
    ON p.user_id = c.user_id)
SELECT browse_date, COUNT(*) AS 'num_browse',
   SUM(is_checkout) AS 'num_checkout',
   SUM(is_purchase) AS 'num_purchase',
   ROUND(1.0 * SUM(is_checkout) / COUNT(user_id), 3) AS 'browse_to_checkout',
   ROUND(1.0 * SUM(is_purchase) / SUM(is_checkout), 3) AS 'checkout_to_purchase'
FROM funnels
GROUP BY browse_date
ORDER BY browse_date;

A/B Testing Modal Popups

A company uses modal popups that show up for new users to their website. The product team has created a new popup design and they would like to test how this new variant compares to the control version of the popup in terms of its impact on the funnel of users going through to the last page of the website (the checkout stage):

--Dataset: A/B Testing Modal Popups table 'onboarding_modal'
--Source: Analyse Data with SQL Skill Path Course, Codecademy

--Familiarise yourself with the table
SELECT *
FROM onboarding_modals
LIMIT 10;

--Calculate the number of distinct user ids for each modal text displayed. This calculates the number of users at each step of the funnel
SELECT modal_text, COUNT(DISTINCT user_id)
FROM onboarding_modals
GROUP BY modal_text
ORDER BY modal_text;

--Calculate the number of users at each stwep of the funnel but this time separate them into the 'control' and 'variant' classes
SELECT modal_text,
  COUNT(DISTINCT CASE
    WHEN ab_group = 'control' THEN user_id
    END) AS 'control_clicks',
  COUNT(DISTINCT CASE
    WHEN ab_group = 'variant' THEN user_id
    END) AS 'variant_clicks'
FROM onboarding_modals
GROUP BY 1
ORDER BY 1;

Subqueries

VR Startup Talent Pool

As a member of the Project Completion team, my responsibility is to analyse our current talent pool to ensure each project can be completed effectively:

--Dataset: VR Startup Talent Pool tables 'employees', 'projects'
--Source: Analyse Data with SQL Skill Path Course, Codecademy

--Display the names of projects that were not chosen by any employees
SELECT *
FROM projects
WHERE project_id NOT IN (
  SELECT current_project
  FROM employees
  WHERE current_project IS NOT NULL
);

--What is the name of the project chosen by the most employees?
SELECT projects.project_name, COUNT(*) AS 'No. of employees who chose project'
FROM employees
JOIN projects
ON employees.current_project = projects.project_id
WHERE employees.current_project IS NOT NULL
GROUP BY projects.project_name
ORDER BY 2 DESC
LIMIT 1;

--Which projects were chosen by multiple employees?
SELECT projects.project_name, COUNT(*) AS 'No. of employees who chose project'
FROM employees
JOIN projects
ON employees.current_project = projects.project_id
WHERE employees.current_project IS NOT NULL
GROUP BY projects.project_name
HAVING COUNT(*) > 1
ORDER BY 2 DESC;

--If each project needs at least 2 developers, how many available project positions are there for developers?
SELECT (COUNT(*) * 2) - (
  SELECT COUNT(*)
  FROM employees
  WHERE current_project IS NOT NULL
    AND position = 'Developer') AS 'Count'
FROM projects;

--When employees are hired, they are given the Myers-Briggs personality test. Which personality is the most common among the employees?
SELECT personality, COUNT(*)
FROM employees
GROUP BY personality
ORDER BY 2 DESC;

--What are the names of projects chosen by employees with the most common personality type?
SELECT project_name
FROM projects
WHERE project_id IN (
  SELECT current_project
  FROM employees
  WHERE current_project IS NOT NULL AND personality IN(
    SELECT personality
    FROM employees
    GROUP BY personality
    ORDER BY COUNT(*) DESC
    LIMIT 1
  )
);

--Find the personality type most represented by employees with a selected project
SELECT first_name, last_name, personality, projects.project_name
FROM employees
JOIN projects
  ON employees.current_project = projects.project_id
WHERE personality = (
   SELECT personality 
   FROM employees
   WHERE current_project IS NOT NULL
   GROUP BY personality
   ORDER BY COUNT(personality) DESC
   LIMIT 1);

--For each employee, provide their name, personality, the names of any projects they've chosen, and the number of incompatible co-workers (according to Myers-Briggs, some personality types are incompatible with others)
SELECT last_name, first_name, personality, project_name,
CASE 
   WHEN personality = 'INFP' OR personality = 'ENFP' OR personality = 'INFJ' OR personality = 'ENFJ'
   THEN (SELECT COUNT(*)
      FROM employees 
      WHERE personality IN ('ISFP', 'ESFP', 'ISTP', 'ESTP', 'ISFJ', 'ESFJ', 'ISTJ', 'ESTJ'))
   WHEN personality = 'ISFP' 
   THEN (SELECT COUNT(*)
      FROM employees 
      WHERE personality IN ('INFP', 'ENFP', 'INFJ'))
   WHEN personality = 'ESFP' OR personality = 'ISTP' OR personality = 'ESTP' OR personality = 'ISFJ' OR personality = 'ESFJ' OR personality = 'ISTJ' OR personality = 'ESTJ'
   THEN (SELECT COUNT(*)
      FROM employees 
      WHERE personality IN ('INFP', 'ENFP', 'INFJ', 'ENFJ'))
   ELSE 0
END AS 'No. of Incompatible Colleagues'
FROM employees
LEFT JOIN projects on employees.current_project = projects.project_id;

Welp Reviews

There’s an exciting new app called Welp which people are using to leave reviews of the businesses they have visited. Analysis is required to draw insights from these reviews:

--Dataset: Welp Reviews tables 'places', 'reviews'
--Source: Analyse Data with SQL Skill Path Course, Codecademy

--Familiarise yourself with the tables
SELECT *
FROM places;

SELECT *
FROM reviews;

--Find all places that cost $20 or less (each '$' sign represents $10)
SELECT *
FROM places
WHERE price_point = '$$' OR price_point = '$';

--Show all reviews for restaurants that have at least one review
SELECT * 
FROM places 
INNER JOIN reviews 
   ON places.id = reviews.place_id;

--Find the id and name of all the places without reviews in the database
SELECT id, name
FROM places
WHERE id NOT IN(
  SELECT place_id
  FROM reviews
);

--Select all the reviews that happened in 2020
WITH temporary_table AS(
  SELECT *
  FROM reviews
  WHERE strftime('%Y', review_date) = '2020'
)
SELECT *
FROM places
JOIN temporary_table
  ON places.id = temporary_table.place_id;

--Find the reviewers with the most reviews that are rated below the average rating for that place
WITH full_temp_table AS(
  SELECT *
  FROM places
  JOIN reviews
    ON places.id = reviews.place_id
)
SELECT username, COUNT(*)
FROM full_temp_table
WHERE rating <(
  SELECT average_rating
  FROM full_temp_table
)
GROUP BY username
ORDER BY 2 DESC;

Multiple Tables

Songify Music Streaming

A music streaming company offers both a free product and a premium paid product. Assistance is needed to analyse their customer database which consists of six tables.

--Dataset: Songify Music Streaming tables 'plans', 'users', 'premium_users', 'songs', 'months', 'plays'
--Source: Analyse Data with SQL Skill Path Course, Codecademy

--Let's see which plans are used by which premium members!
SELECT premium_users.user_id, plans.description
FROM premium_users
JOIN plans
  ON premium_users.membership_plan_id = plans.id;

--Let's see the titles of songs that were played by each user!
SELECT plays.user_id, plays.play_date, songs.title
FROM plays
JOIN songs
  ON plays.song_id = songs.id;

--Which users aren't premium users?
SELECT users.id
FROM users
LEFT JOIN premium_users
  ON users.id = premium_users.user_id
WHERE premium_users.user_id IS NULL;

--Find which users played songs in January, but not February
WITH january AS (
  SELECT *
  FROM plays
  WHERE strftime("%m", play_date) = '01'
),
february AS (
  SELECT *
  FROM plays
  WHERE strftime("%m", play_date) = '02'

)
SELECT january.user_id
FROM january
LEFT JOIN february
  ON january.user_id = february.user_id
WHERE february.user_id IS NULL;

--For each month, we want to know if each user in 'premium_users' was active or canceled.
SELECT premium_users.user_id,
  months.months,
  CASE
    WHEN (
      premium_users.purchase_date <= months.months
      )
      AND
      (
        premium_users.cancel_date >= months.months
        OR
        premium_users.cancel_date IS NULL
      )
    THEN 'active'
    ELSE 'not_active'
  END AS 'status'
FROM premium_users
CROSS JOIN months;

--Songify has added some new songs to their catalog. Combine the existing songs with the new bonus songs
SELECT *
FROM songs
UNION 
SELECT *
FROM bonus_songs
LIMIT 10;

--Find how many times each song was played, displaying the title, artist and play count
WITH play_count AS(
  SELECT song_id,
   COUNT(*) AS 'times_played'
FROM plays
GROUP BY song_id
)
SELECT songs.title, songs.artist, play_count.times_played
FROM songs
JOIN play_count
ON songs.id = play_count.song_id;

Fictional Reddit Analytics

Reddit is organised into subreddits or topics where users can contribute through posts. A database of fictional data exists with multiple tables that needs to be analysed through the use of queries, subqueries and joins:

--Dataset: Fictional Reddit Analytics tables 'users', 'posts', 'subreddits'
--Source: Analyse Data with SQL Skill Path Course, Codecademy

--Join the data from the different tables to find how many posts each user has made
SELECT username, COUNT(*)
FROM users
LEFT JOIN posts
  ON users.id = posts.user_id
GROUP BY username
ORDER BY 2 DESC;

--Join the data to show only existing posts where users are still active
SELECT *
FROM posts
INNER JOIN users
  ON posts.user_id = users.id;

--Some new posts have been added to Reddit. Concatenate them on a single table with the old posts
SELECT *
FROM posts
UNION
SELECT *
FROM posts2;

--Find which subreddits have the most popular posts (i.e. posts with a score of at least 5000)
WITH popular_posts AS(
  SELECT *
  FROM posts
  WHERE score >= 5000
)
SELECT *
FROM subreddits
INNER JOIN popular_posts
  ON subreddits.id = popular_posts.subreddit_id
ORDER BY popular_posts.score DESC;

--Now find the most popular post for each subreddit
SELECT subreddits.name AS 'subreddit_name', title AS 'post_highest_score', MAX(score) AS 'score'
FROM posts
INNER JOIN subreddits
  ON posts.subreddit_id = subreddits.id
GROUP BY subreddits.name;

--Calculate the average score of all the posts for each subreddit
SELECT subreddits.name AS 'subreddit_name', ROUND(AVG(score),0) AS 'avg. score'
FROM posts
INNER JOIN subreddits
  ON posts.subreddit_id = subreddits.id
GROUP BY subreddits.name;

Aggregate Functions

Codeflix Video Streaming

A new streaming video company needs assistance analysing their user database:

--Dataset: Codeflix Video Streaming tables 'users', 'payments', 'watch_history'
--Source: Analyse Data with SQL Skill Path Course, Codecademy

--Determine the number of users that have an email ending in '.com'
SELECT COUNT(*)
FROM users
WHERE email LIKE '%.com';

--What are the most popular first names on Codeflix?
ELECT first_name, COUNT(*) AS 'Count'
FROM users
GROUP BY first_name
ORDER BY 2 DESC;

--The UX Research team wants to see a distribution of watch durations
SELECT ROUND(watch_duration_in_minutes) AS 'duration', COUNT(*) AS 'count'
FROM watch_history
GROUP BY 1
ORDER BY 1;

--Find all the users that have successfully made a payment to Codeflix and find their total amount paid
SELECT user_id, SUM(amount) AS 'total per user'
FROM payments
WHERE status='paid'
GROUP BY user_id
ORDER BY 2 DESC;

--Generate a table of user ids and total watch duration for users who watched more than 400 minutes of content
SELECT user_id, SUM(watch_duration_in_minutes) AS 'total watch time'
FROM watch_history
GROUP BY user_id
HAVING SUM(watch_duration_in_minutes) > 400;

--To the nearest minute, how many minutes of content were streamed on Codeflix?
SELECT ROUND(SUM(watch_duration_in_minutes), 0)
FROM watch_history;

--Which days in this period did Codeflix collect the most money?
SELECT pay_date, SUM(amount) AS 'total per day'
FROM payments
WHERE status = 'paid'
GROUP BY pay_date
ORDER BY 2 DESC;

--When users successfully pay Codeflix, what is the average payment amount?
SELECT AVG(amount)
FROM payments
WHERE status='paid';

--Of all the events in the 'watch_history' table, what is the duration of the longest individual watch event? What is the duration of the shortest?
SELECT MAX(watch_duration_in_minutes) AS 'max', MIN(watch_duration_in_minutes) AS 'min'
FROM watch_history;

Hacker News Trends

Analyse trends in the news stories published since 2007 to Hacker News, a tech news website run by Y Combinator:

--Dataset: Hacker News table 'hacker_news'
--Source: Analyse Data with SQL Skill Path Course, Codecademy

--Find which sites feed Hacker News the most between GitHub, Medium and The New York Times.
SELECT CASE
    WHEN url LIKE '%github%' THEN 'GitHub'
    WHEN url LIKE '%medium%' THEN 'Medium'
    WHEN url LIKE '%nytimes%' THEN 'New York Times'
    ELSE 'Other'
  END AS 'Source',
  COUNT(*)
FROM hacker_news
GROUP BY 1;

--What's the best time to post a story?
SELECT strftime('%H', timestamp) AS 'Hours', ROUND(AVG(score),1) AS 'AVG Score', COUNT(*) AS 'No. of Stories'
FROM hacker_news
WHERE timestamp IS NOT NULL
GROUP BY 1
ORDER BY 2 DESC;

The Metropolitan Museum of Art

As one of the world’s largest art museums, assistance is needed to analyse the museum’s collection database of American Decorative Arts:

--Dataset: The Metropolitan Museum of Art table 'met'
--Source: Analyse Data with SQL Skill Path Course, Codecademy

--Familiarise yourself with the table
SELECT *
FROM met
LIMIT 10;

--How many pieces are in the American Decorative Art collection?
SELECT COUNT(*)
FROM met
WHERE department = 'American Decorative Arts';

--Count the number of pieces where the category includes 'celery'
SELECT COUNT(*)
FROM met
WHERE category LIKE '%celery%';

--Find the title and medium of the oldest piece(s) in the collection
SELECT title, medium, date
FROM met
ORDER BY date
LIMIT 5;

--Results show oldest pieces are from the 1600s. Display all from this era
SELECT title, medium, date
FROM met
WHERE date LIKE '%1600%';

--Find the top 10 countries with the most pieces in the collection
SELECT country, COUNT(*)
FROM met
WHERE country IS NOT NULL
GROUP BY country
ORDER BY 2 DESC
LIMIT 10;

--Find the categories HAVING more than 100 pieces
SELECT category, COUNT(*)
FROM met
WHERE category IS NOT NULL
GROUP BY category
HAVING COUNT(*) > 100
ORDER BY 2 DESC;

--Count the number of pieces where the medium contains 'gold' or 'silver'
SELECT CASE
   WHEN medium LIKE '%gold%'   THEN 'Gold'
   WHEN medium LIKE '%silver%' THEN 'Silver'
   ELSE NULL
  END AS 'Bling',
  COUNT(*)
FROM met
WHERE Bling IS NOT NULL
GROUP BY 1
ORDER BY 2 DESC;

Basic Queries

New York Restaurants

Use SQL queries to find the best dinner places in your area:

--Dataset: New York Restaurants table 'nomnom'
--Source: Analyse Data with SQL Skill Path Course, Codecademy

--Review nomnom table
SELECT *
FROM nomnom;

--Find distinct neighbourhoods
SELECT DISTINCT neighborhood
FROM nomnom;

--Find distinct cuisine types
SELECT DISTINCT cuisine
FROM nomnom;

--Find a list of Chinese restaurants in the dataset
SELECT *
FROM nomnom
WHERE cuisine = 'Chinese';

--Find all restaurants rated 4-stars and above
SELECT *
FROM nomnom
WHERE review>=4;

--Return a list of all fancy ('$$$') Italian restaurants
SELECT *
FROM nomnom
WHERE cuisine = 'Italian' AND price = '$$$';

--Find the restaurant which you remember had the word 'meatball' in its name
SELECT *
FROM nomnom
WHERE name LIKE '%meatball%';

--Find all your nearby restaurants by location
SELECT *
FROM nomnom
WHERE neighborhood = 'Midtown' OR neighborhood = 'Downtown' OR neighborhood = 'Chinatown';

--Only display restaurants which are pending a health rating
SELECT *
FROM nomnom
WHERE  health IS NULL;

--Create a table with the Top 10 restaurants based on review score
SELECT *
FROM nomnom
ORDER BY review DESC
LIMIT 10;

--Create a table where the rating system is change from score numbers to words
SELECT name,
 CASE
  WHEN review > 4.5 THEN 'Extraordinary'
  WHEN review > 4 THEN 'Excellent'
  WHEN review > 3 THEN 'Good'
  WHEN review > 2 THEN 'Fair'
  ELSE 'Poor'
 END AS 'Review'
FROM nomnom;

Reputable Product Agency - Fraud Detection and Customer Segmentation

Help the finance department of the RPA to identify potentially fraudulent transactions. Then assist the marketing department in segmenting the company users according to different criteria:

--Dataset: Reputable Fraud Detection (RPA) tables 'transction_data' and 'users'
--Source: Analyse Data with SQL Skill Path Course, Codecademy

-- FRAUD DETECTION
-- What are the column names?
SELECT *
FROM transaction_data
LIMIT 10;

-- Find the full_names and emails
-- of the transactions listing 20252 as the zip code.
SELECT full_name, email
FROM transaction_data
WHERE zip=20252;

-- Use a query to find the names 
-- and emails associated with these transactions.
SELECT full_name, email
FROM transaction_data
WHERE full_name = 'Art Vandelay' OR full_name LIKE '% der %';

-- Find the ip_addresses and emails listed with these transactions.
SELECT ip_address, email
FROM transaction_data
WHERE ip_address LIKE '10.%';

-- Find the emails in transaction_data with
-- 'temp_email.com' as a domain.
SELECT email
FROM transaction_data
WHERE email LIKE '%temp_email.com';

-- The finance department is looking for a specific transaction. 
-- They know that the transaction occurred from an ip address starting 
-- with '120.' and their full name starts with 'John'.

-- Can you find the transaction?
SELECT *
FROM transaction_data
WHERE ip_address LIKE '120%' AND full_name LIKE 'John%';

-- ///////////////////////////////////////////////////////////////////

-- CUSTOMER SEGMENTATION
-- What are the column names?
SELECT *
FROM users
LIMIT 20;
 
-- Find the email addresses and birthdays of users whose 
-- birthday is between 1980-01-01 and 1989-12-31.
SELECT email, birthday
FROM users
WHERE birthday BETWEEN '1980-01-01' AND '1989-12-31';
   
-- Find the emails and creation date of users 
-- whose created_at date matches this condition.
SELECT email, created_at
FROM users
WHERE created_at < '2017-05-01';

-- Find the emails of the users who received the 'bears' test.
SELECT email
FROM users
WHERE test = 'bears';

-- Find all the emails of all users who 
-- received a campaign on website BBB.
SELECT email
FROM users
WHERE campaign LIKE 'BBB%';

-- Find all the emails of all users who received ad copy 2 in 
-- their campaign.
SELECT email
FROM users
WHERE campaign LIKE '%-2';

-- Find the emails for all users who received both a campaign and a test. 
-- These users will have non-empty entries in the 
-- campaign and test columns.
SELECT email
FROM users
WHERE campaign IS NOT NULL AND test IS NOT NULL;

-- One of the members of the marketing team had an idea of calculating
-- how old users were when they signed up.
SELECT email, (created_at - birthday) AS age_at_creation
FROM users
LIMIT 20;

Davie’s Burgers Subway Ad

Help Davie’s Burgers to search through their past orders to find interesting material that can be used for a marketing campaign:

--Dataset: Davie's Burgers Restaurant table 'orders'
--Source: Analyse Data with SQL Skill Path Course, Codecademy

-- What are the column names?
SELECT *
FROM orders
LIMIT 20;

-- How recent is this data?
SELECT DISTINCT(order_date)
FROM orders
ORDER by order_date DESC;

-- Instead of selecting all the columns using *, 
-- write a query that selects only the special_instructions column.
SELECT special_instructions
FROM orders

-- Limit the result to 20 rows.
LIMIT 20;

-- Can you edit the query so that we are only 
-- returning the special instructions that are not empty?
SELECT special_instructions
FROM orders
WHERE special_instructions IS NOT NULL;

-- Let's go even further and sort the instructions 
-- in alphabetical order (A-Z).
SELECT special_instructions
FROM orders
WHERE special_instructions IS NOT NULL
ORDER BY special_instructions;


-- Let's search for special instructions that have the word 'sauce'.
SELECT special_instructions
FROM orders
WHERE special_instructions LIKE '%sauce%';

-- Let's search for special instructions that have the word 'door'.
-- Any funny or interesting ones?
SELECT special_instructions
FROM orders
WHERE special_instructions LIKE '%door%';

-- Let's search for special instructions that have the word 'box'.
-- Any funny or interesting ones?
SELECT special_instructions
FROM orders
WHERE special_instructions LIKE '%box%';

-- Instead of just returning the special instructions, also return their order ids.
-- For more readability:
-- Rename id as '#'
-- Rename special_instructions as 'Notes'
SELECT id AS '#', special_instructions AS 'Notes'
FROM orders
WHERE special_instructions LIKE '%box%';
Previous
Previous

Master's Thesis

Next
Next

Shell Eco-marathon