Data Analysis in SQL
Below are examples of various SQL analyses I have carried out, organised by the different functions and concepts I implemented.
Use the links below to jump to your desired section:
Window Functions | Marketing Attribution | Churn Rate | Funnels | Subqueries | Multiple Tables | Aggregate Functions | Basic Queries
Window Functions
Historical U.S. Climate Change Investigation
Climate data is available for each state in the United States dating back to 1895. Window functions will be used in this investigation to draw insights about the differences in temperature by region and by year:
--Dataset: Historical U.S. Climate Change Investigation table 'state_climate' --Source: Analyse Data with SQL Skill Path Course, Codecademy --Explore the dataset SELECT * FROM state_climate; --Calculate the running average temperature over time in each state SELECT state, year, tempf, tempc, AVG(tempc) OVER ( PARTITION BY state ORDER BY year ) AS 'running_avg_temp' FROM state_climate; --Display the lowest temperature for each state alongside the other columns SELECT state, year, tempf, tempc, FIRST_VALUE(tempc) OVER ( PARTITION BY state ORDER BY tempc ) AS 'lowest_temp' FROM state_climate; --Display the highest temperature for each state alongside the other columns SELECT state, year, tempf, tempc, LAST_VALUE(tempc) OVER ( PARTITION BY state ORDER BY tempc RANGE BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING ) AS 'highest_temp' FROM state_climate; --Look at how the temperature changed year-on-year in each state SELECT state, year, tempf, tempc, tempc - LAG(tempc, 1, tempc) OVER ( PARTITION BY state ORDER BY year )AS 'change_in_tempc' FROM state_climate ORDER BY ABS(change_in_tempc) DESC; --Display a rank of the coldest temperatures on record SELECT year, state, tempf, tempc, RANK() OVER ( ORDER BY tempc ) AS 'rank' FROM state_climate; --Display a rank of the warmest temperatures for each state on record SELECT year, state, tempf, tempc, RANK() OVER ( PARTITION BY state ORDER BY tempc DESC ) AS 'rank' FROM state_climate; --Display the average yearly temperature partitioned by quartiles for each state SELECT year, state, tempf, tempc, NTILE(4) OVER ( PARTITION BY state ORDER BY tempc ) AS 'quartile' FROM state_climate; --Display the average yearly temperature partitioned by quintiles for the whole country SELECT year, state, tempf, tempc, NTILE(5) OVER ( ORDER BY tempc ) AS 'quintile' FROM state_climate;
Social Media Analytics
Aggregate functions provide useful metrics by collapsing several rows of a table into a single value(e.g. average or sum of a column). However, data scientists often require a similar calculation to be performed while still displaying the original rows of data alongside the new calculated value. Window functions are used in this project to perform more complex data analysis on various social media platforms:
--Dataset: Social Media Analytics table 'social_media', 'streams' --Source: Analyse Data with SQL Skill Path Course, Codecademy --Display the running total and the running average of followers for the 'instagram' user SELECT month, change_in_followers, ROUND(SUM(change_in_followers) OVER ( ORDER BY month ), 2) AS 'running_total', ROUND(AVG(change_in_followers) OVER ( ORDER BY month ), 2) AS 'running_avg' FROM social_media WHERE username = 'instagram'; --Display the running total and the running average of followers for all recorded accounts SELECT username, month, change_in_followers, ROUND(SUM(change_in_followers) OVER ( PARTITION BY username ORDER BY month ), 2) AS 'running_total_followers_change', ROUND(AVG(change_in_followers) OVER ( PARTITION BY username ORDER BY month ), 2) AS 'running_average_followers_change' FROM social_media; --Display the least number of posts recorded for a given month, per user, alongside their monthly number of posts SELECT username, posts, FIRST_VALUE (posts) OVER ( PARTITION BY username ORDER BY posts ) fewest_posts FROM social_media; --Display the most number of posts recorded for a given month, per user, alongside their monthly number of posts SELECT username, posts, LAST_VALUE (posts) OVER ( PARTITION BY username ORDER BY posts RANGE BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING ) most_posts FROM social_media; --Display Lady Gaga's monthly change in number of streams and monthly change in chart position. Use LAG() functions SELECT artist, week, streams_millions, ROUND(streams_millions - LAG(streams_millions, 1, streams_millions) OVER ( PARTITION BY artist ORDER BY week ), 2) AS 'streams_millions_change', chart_position, LAG(chart_position, 1, chart_position) OVER ( PARTITION BY artist ORDER BY week ) - chart_position AS chart_position_change FROM streams WHERE artist = 'Lady Gaga'; --Alter the previous query to calculate the same new columns but now for all artists and using LEAD() functions SELECT artist, week, streams_millions, ROUND((LEAD(streams_millions, 1) OVER ( PARTITION BY artist ORDER BY week ) - streams_millions), 2) AS 'streams_millions_change', chart_position, chart_position - LEAD(chart_position, 1) OVER ( PARTITION BY artist ORDER BY week ) AS 'chart_position_change' FROM streams; --Use window functions to add row numbers to an ordered list of weekly streams per artist. Number 1 in the list should be the artist (and week) with the least streams SELECT ROW_NUMBER() OVER ( ORDER BY streams_millions ) AS 'row_num', artist, week, streams_millions FROM streams; --Alter the previous query to display rank instead of row number. Tied numbers of streams should display the same rank SELECT RANK() OVER ( ORDER BY streams_millions ) AS 'rank', artist, week, streams_millions FROM streams; --Alter the previous query to display the ranked list of artists partitioned by week so we can see the most streamed artist of each week SELECT RANK() OVER ( PARTITION BY week ORDER BY streams_millions DESC ) AS 'rank', artist, week, streams_millions FROM streams; --Break the data into quintiles according to their number of streams per week SELECT NTILE(5) OVER ( ORDER BY streams_millions DESC ) AS 'quintile', artist, week, streams_millions FROM streams; --Break the data into quartiles such that we now have quartiles for each week according to the number of streams per week i.e. each quartile contains two artists because we have eight artists total for each week in the dataset SELECT NTILE(4) OVER ( PARTITION BY week ORDER BY streams_millions DESC ) AS 'quartile', artist, week, streams_millions FROM streams;
Marketing Attribution
CoolTShirts Marketing Campaign
A t-shirts company recently started various marketing campaigns to increase website traffic and revenue. They would like assistance in mapping their customer’s journey through their website by using touch attribution:
--Dataset: CoolTShirts Marketing Campaign table 'page_visits' --Source: Analyse Data with SQL Skill Path Course, Codecademy --Become familiar with the data. Find the number of distinct campaigns SELECT COUNT(DISTINCT utm_campaign) FROM page_visits; --Find the number of distinct UTM sources SELECT COUNT(DISTINCT utm_source) FROM page_visits; --Find how many campaigns there are per source SELECT utm_source, COUNT(DISTINCT utm_campaign) AS 'campaigns_per_source' FROM page_visits GROUP BY utm_source; --Identify the pages that are on the CoolTShirts website SELECT DISTINCT page_name FROM page_visits; --How many first touches is each campaign responsible for? WITH first_touch AS ( SELECT user_id, MIN(timestamp) as first_touch_at FROM page_visits GROUP BY user_id) SELECT pv.utm_campaign, COUNT(*) FROM first_touch ft JOIN page_visits pv ON ft.user_id = pv.user_id AND ft.first_touch_at = pv.timestamp GROUP BY pv.utm_campaign; --How many last touches is each campaign responsible for? WITH last_touch AS ( SELECT user_id, MAX(timestamp) as last_touch_at FROM page_visits GROUP BY user_id) SELECT pv.utm_campaign, COUNT(*) FROM last_touch lt JOIN page_visits pv ON lt.user_id = pv.user_id AND lt.last_touch_at = pv.timestamp GROUP BY pv.utm_campaign; --How many visitors make a purchase? (the purchase page is labeled '4 - purchase') SELECT COUNT(DISTINCT user_id) FROM page_visits WHERE page_name LIKE '4%'; --How many last touches on the purchase page is each campaign responsible for? WITH last_touch AS ( SELECT user_id, MAX(timestamp) as last_touch_at FROM page_visits WHERE page_name LIKE '4%' GROUP BY user_id) SELECT pv.utm_campaign, COUNT(*) FROM last_touch lt JOIN page_visits pv ON lt.user_id = pv.user_id AND lt.last_touch_at = pv.timestamp GROUP BY pv.utm_campaign;
First and Last Touch Attribution with UTM Parameters
Web developers use UTM parameters in the ads they display in other websites to determine the source of the traffic to their own website. For example, for an e-commerce website displaying ads on social media, it could be helpful to know the where a user was surfing when they first clicked on their ad (‘first-touch attribution’). However, that user may have closed the online store until a new ad came up which they clicked on and ultimately finished their purchase. This last visit and where the traffic originated from is also recorded (‘last-touch attribution’). Analyse these results to improve the company’s marketing and online presence:
--Dataset: First and Last Touch Attribution with UTM Parameters table 'page_visits' --Source: Analyse Data with SQL Skill Path Course, Codecademy --Write a first-touch attribution query for the users stored in the database. Display the associated UTM source WITH first_touch AS ( SELECT user_id, MIN(timestamp) AS 'first_touch_at' FROM page_visits GROUP BY user_id) SELECT ft.user_id, ft.first_touch_at, pv.utm_source FROM first_touch AS 'ft' JOIN page_visits AS 'pv' ON ft.user_id = pv.user_id AND ft.first_touch_at = pv.timestamp; ----Write a last-touch attribution query for the users stored in the database. Display the associated UTM source WITH last_touch AS ( SELECT user_id, MAX(timestamp) AS 'last_touch_at' FROM page_visits GROUP BY user_id) SELECT lt.user_id, lt.last_touch_at, pv.utm_source FROM last_touch AS 'lt' JOIN page_visits AS 'pv' ON lt.user_id = pv.user_id AND lt.last_touch_at = pv.timestamp;
Churn Rate
Codeflix Video Streaming SaaS Churn Rate
A common revenue model for SaaS (software as a service) companies is to charge a monthly subscription fee for access to their product. Churn rate is the ratio of subscribers that have canceled within a certain period, usually a month, compared to the total number of subscribers for that period. Calculate this value for the required months for the video streaming company Codeflix:
--Dataset: Codeflix Video Streaming SaaS Churn Rate table 'subscriptions' --Source: Analyse Data with SQL Skill Path Course, Codecademy --Calculate the churn for the month of December 2016 SELECT ROUND(1.0 * ( SELECT COUNT(*) FROM subscriptions WHERE subscription_start < '2017-01-01' AND ( subscription_end BETWEEN '2017-01-01' AND '2017-01-31' ) ) / ( SELECT COUNT(*) FROM subscriptions WHERE subscription_start < '2017-01-01' AND ( (subscription_end >= '2017-01-01') OR (subscription_end IS NULL) ) ), 3) AS result; --Calculate the churn for the month of January 2017, now using a system that can then be scaled to multiple months WITH enrollments AS (SELECT * FROM subscriptions WHERE subscription_start < '2017-01-01' AND ( (subscription_end >= '2017-01-01') OR (subscription_end IS NULL) )), status AS (SELECT CASE WHEN (subscription_end > '2017-01-31') OR (subscription_end IS NULL) THEN 0 ELSE 1 END as is_canceled, CASE WHEN subscription_start < '2017-01-01' AND ( (subscription_end >= '2017-01-01') OR (subscription_end IS NULL) ) THEN 1 ELSE 0 END as is_active FROM enrollments ) SELECT ROUND((1.0 * SUM(is_canceled) / SUM(is_active)), 3) AS 'churn_january_2017' FROM status; --Calculate the churn rate for multiple months (January 2017, February 2017 and March 2017) by adapting the previous query WITH months AS (SELECT '2017-01-01' as first_day, '2017-01-31' as last_day UNION SELECT '2017-02-01' as first_day, '2017-02-28' as last_day UNION SELECT '2017-03-01' as first_day, '2017-03-31' as last_day ), cross_join AS (SELECT * FROM subscriptions CROSS JOIN months), status AS (SELECT id, first_day as month, CASE WHEN (subscription_start < first_day) AND ( subscription_end > first_day OR subscription_end IS NULL ) THEN 1 ELSE 0 END as is_active, CASE WHEN subscription_end BETWEEN first_day AND last_day THEN 1 ELSE 0 END as is_canceled FROM cross_join), status_aggregate AS (SELECT month, SUM(is_active) as active, SUM(is_canceled) as canceled FROM status GROUP BY month) SELECT month, ROUND((1.0 * canceled/active), 3) AS churn_rate FROM status_aggregate;
A/B Testing User Segments
A company provides a subscription based service to a range of customers. Customers were acquired through two distinct subscription channels (labeled ‘87’ and ‘30’). The marketing department for the company would like to compare the two user segments by analysing their churn rates for the months provided in the dataset:
--Dataset: A/B Testing User Segments table 'subscriptions' --Source: Analyse Data with SQL Skill Path Course, Codecademy --Explore the dataset SELECT * FROM subscriptions LIMIT 20; --Determine the range of months of data provided SELECT MIN(subscription_start), MAX(subscription_start), MIN(subscription_end), MAX(subscription_end) FROM subscriptions; --Calculate the churn rate for each user segment and for multiple months (January 2017, February 2017 and March 2017) WITH months AS ( SELECT '2017-01-01' as first_day, '2017-01-31' as last_day UNION SELECT '2017-02-01' as first_day, '2017-02-28' as last_day UNION SELECT '2017-03-01' as first_day, '2017-03-31' as last_day ), cross_join AS( SELECT * FROM subscriptions CROSS JOIN months ), status AS ( SELECT id, first_day AS month, CASE WHEN segment == '87' AND subscription_start < first_day AND (subscription_end > last_day OR subscription_end IS NULL) THEN 1 ELSE 0 END AS 'is_active_87', CASE WHEN segment == '87' AND subscription_end BETWEEN first_day AND last_day THEN 1 ELSE 0 END AS 'is_canceled_87', CASE WHEN segment == '30' AND subscription_start < first_day AND (subscription_end > last_day OR subscription_end IS NULL) THEN 1 ELSE 0 END AS 'is_active_30', CASE WHEN segment == '30' AND subscription_end BETWEEN first_day AND last_day THEN 1 ELSE 0 END AS 'is_canceled_30' FROM cross_join ), status_aggregate AS ( SELECT month, SUM(is_active_87) AS 'sum_active_87', SUM(is_active_30) AS 'sum_active_30', SUM(is_canceled_87) AS 'sum_canceled_87', SUM(is_canceled_30) AS 'sum_canceled_30' FROM status GROUP BY month ) SELECT month, ROUND((1.0* sum_canceled_30/sum_active_30), 3) AS 'churn_30', ROUND((1.0*sum_canceled_87/sum_active_87), 3) AS 'churn_87' FROM status_aggregate;
Funnels
Usage Funnels with Warby Parker
Warby Parker is a transformative lifestyle brand in the in the eyeglasses and sunglasses industry. Analyse different Warby Parker marketing funnels in order to provide them with actionable insights:
--Dataset: Usage Funnels with Warby Parker tables 'survey', 'quiz', 'home_try_on', 'purchase' --Source: Analyse Data with SQL Skill Path Course, Codecademy --Join the three tables to determine which users move on from the quiz stage to the try-on stage, and then to the purchase stage. Use 'IS NOT NULL' as a 'marker' for which customers remain at each stage SELECT q.user_id, h.user_id IS NOT NULL AS 'is_home_try_on', h.number_of_pairs, p.user_id IS NOT NULL AS 'is_purchase' FROM quiz q LEFT JOIN home_try_on h ON h.user_id = q.user_id LEFT JOIN purchase p ON p.user_id = h.user_id; --Calculate the conversion of customers from one stage to the next. Write this as a standalone query WITH funnel_calcs_table AS ( SELECT q.user_id, h.user_id IS NOT NULL AS 'is_home_try_on', h.number_of_pairs, p.user_id IS NOT NULL AS 'is_purchase' FROM quiz q LEFT JOIN home_try_on h ON h.user_id = q.user_id LEFT JOIN purchase p ON p.user_id = h.user_id ) SELECT COUNT(user_id) AS 'users_quizzed' , SUM(is_home_try_on) AS 'users_tried_on', SUM(is_purchase) AS 'users_purchased' FROM funnel_calcs_table; --CUSTOMER SURVEY FUNNEL --Review table SELECT * FROM survey LIMIT 10; --Create a funnel for Warby Parker which displays the number of users that answer each question of a survey SELECT question, COUNT(DISTINCT user_id) FROM survey GROUP BY question;
Funnel for Mattresses and More
The Management Team is requesting to see the conversion of users from the browse to the checkout stage, and from the checkout to the purchase stage. Use SQL queries to help analyse the data and provide the required insight:
--Dataset: Mattresses and More tables 'browse', 'checkout', 'purchase' --Source: Analyse Data with SQL Skill Path Course, Codecademy --Create a funnel for Mattresses and More which displays the number of users at each stage (browse, checkout and purchase) for each date available WITH funnels AS ( SELECT DISTINCT b.browse_date, b.user_id, c.user_id IS NOT NULL AS 'is_checkout', p.user_id IS NOT NULL AS 'is_purchase' FROM browse AS 'b' LEFT JOIN checkout AS 'c' ON c.user_id = b.user_id LEFT JOIN purchase AS 'p' ON p.user_id = c.user_id) SELECT browse_date, COUNT(*) AS 'num_browse', SUM(is_checkout) AS 'num_checkout', SUM(is_purchase) AS 'num_purchase', ROUND(1.0 * SUM(is_checkout) / COUNT(user_id), 3) AS 'browse_to_checkout', ROUND(1.0 * SUM(is_purchase) / SUM(is_checkout), 3) AS 'checkout_to_purchase' FROM funnels GROUP BY browse_date ORDER BY browse_date;
A/B Testing Modal Popups
A company uses modal popups that show up for new users to their website. The product team has created a new popup design and they would like to test how this new variant compares to the control version of the popup in terms of its impact on the funnel of users going through to the last page of the website (the checkout stage):
--Dataset: A/B Testing Modal Popups table 'onboarding_modal' --Source: Analyse Data with SQL Skill Path Course, Codecademy --Familiarise yourself with the table SELECT * FROM onboarding_modals LIMIT 10; --Calculate the number of distinct user ids for each modal text displayed. This calculates the number of users at each step of the funnel SELECT modal_text, COUNT(DISTINCT user_id) FROM onboarding_modals GROUP BY modal_text ORDER BY modal_text; --Calculate the number of users at each stwep of the funnel but this time separate them into the 'control' and 'variant' classes SELECT modal_text, COUNT(DISTINCT CASE WHEN ab_group = 'control' THEN user_id END) AS 'control_clicks', COUNT(DISTINCT CASE WHEN ab_group = 'variant' THEN user_id END) AS 'variant_clicks' FROM onboarding_modals GROUP BY 1 ORDER BY 1;
Subqueries
VR Startup Talent Pool
As a member of the Project Completion team, my responsibility is to analyse our current talent pool to ensure each project can be completed effectively:
--Dataset: VR Startup Talent Pool tables 'employees', 'projects' --Source: Analyse Data with SQL Skill Path Course, Codecademy --Display the names of projects that were not chosen by any employees SELECT * FROM projects WHERE project_id NOT IN ( SELECT current_project FROM employees WHERE current_project IS NOT NULL ); --What is the name of the project chosen by the most employees? SELECT projects.project_name, COUNT(*) AS 'No. of employees who chose project' FROM employees JOIN projects ON employees.current_project = projects.project_id WHERE employees.current_project IS NOT NULL GROUP BY projects.project_name ORDER BY 2 DESC LIMIT 1; --Which projects were chosen by multiple employees? SELECT projects.project_name, COUNT(*) AS 'No. of employees who chose project' FROM employees JOIN projects ON employees.current_project = projects.project_id WHERE employees.current_project IS NOT NULL GROUP BY projects.project_name HAVING COUNT(*) > 1 ORDER BY 2 DESC; --If each project needs at least 2 developers, how many available project positions are there for developers? SELECT (COUNT(*) * 2) - ( SELECT COUNT(*) FROM employees WHERE current_project IS NOT NULL AND position = 'Developer') AS 'Count' FROM projects; --When employees are hired, they are given the Myers-Briggs personality test. Which personality is the most common among the employees? SELECT personality, COUNT(*) FROM employees GROUP BY personality ORDER BY 2 DESC; --What are the names of projects chosen by employees with the most common personality type? SELECT project_name FROM projects WHERE project_id IN ( SELECT current_project FROM employees WHERE current_project IS NOT NULL AND personality IN( SELECT personality FROM employees GROUP BY personality ORDER BY COUNT(*) DESC LIMIT 1 ) ); --Find the personality type most represented by employees with a selected project SELECT first_name, last_name, personality, projects.project_name FROM employees JOIN projects ON employees.current_project = projects.project_id WHERE personality = ( SELECT personality FROM employees WHERE current_project IS NOT NULL GROUP BY personality ORDER BY COUNT(personality) DESC LIMIT 1); --For each employee, provide their name, personality, the names of any projects they've chosen, and the number of incompatible co-workers (according to Myers-Briggs, some personality types are incompatible with others) SELECT last_name, first_name, personality, project_name, CASE WHEN personality = 'INFP' OR personality = 'ENFP' OR personality = 'INFJ' OR personality = 'ENFJ' THEN (SELECT COUNT(*) FROM employees WHERE personality IN ('ISFP', 'ESFP', 'ISTP', 'ESTP', 'ISFJ', 'ESFJ', 'ISTJ', 'ESTJ')) WHEN personality = 'ISFP' THEN (SELECT COUNT(*) FROM employees WHERE personality IN ('INFP', 'ENFP', 'INFJ')) WHEN personality = 'ESFP' OR personality = 'ISTP' OR personality = 'ESTP' OR personality = 'ISFJ' OR personality = 'ESFJ' OR personality = 'ISTJ' OR personality = 'ESTJ' THEN (SELECT COUNT(*) FROM employees WHERE personality IN ('INFP', 'ENFP', 'INFJ', 'ENFJ')) ELSE 0 END AS 'No. of Incompatible Colleagues' FROM employees LEFT JOIN projects on employees.current_project = projects.project_id;
Welp Reviews
There’s an exciting new app called Welp which people are using to leave reviews of the businesses they have visited. Analysis is required to draw insights from these reviews:
--Dataset: Welp Reviews tables 'places', 'reviews' --Source: Analyse Data with SQL Skill Path Course, Codecademy --Familiarise yourself with the tables SELECT * FROM places; SELECT * FROM reviews; --Find all places that cost $20 or less (each '$' sign represents $10) SELECT * FROM places WHERE price_point = '$$' OR price_point = '$'; --Show all reviews for restaurants that have at least one review SELECT * FROM places INNER JOIN reviews ON places.id = reviews.place_id; --Find the id and name of all the places without reviews in the database SELECT id, name FROM places WHERE id NOT IN( SELECT place_id FROM reviews ); --Select all the reviews that happened in 2020 WITH temporary_table AS( SELECT * FROM reviews WHERE strftime('%Y', review_date) = '2020' ) SELECT * FROM places JOIN temporary_table ON places.id = temporary_table.place_id; --Find the reviewers with the most reviews that are rated below the average rating for that place WITH full_temp_table AS( SELECT * FROM places JOIN reviews ON places.id = reviews.place_id ) SELECT username, COUNT(*) FROM full_temp_table WHERE rating <( SELECT average_rating FROM full_temp_table ) GROUP BY username ORDER BY 2 DESC;
Multiple Tables
Songify Music Streaming
A music streaming company offers both a free product and a premium paid product. Assistance is needed to analyse their customer database which consists of six tables.
--Dataset: Songify Music Streaming tables 'plans', 'users', 'premium_users', 'songs', 'months', 'plays' --Source: Analyse Data with SQL Skill Path Course, Codecademy --Let's see which plans are used by which premium members! SELECT premium_users.user_id, plans.description FROM premium_users JOIN plans ON premium_users.membership_plan_id = plans.id; --Let's see the titles of songs that were played by each user! SELECT plays.user_id, plays.play_date, songs.title FROM plays JOIN songs ON plays.song_id = songs.id; --Which users aren't premium users? SELECT users.id FROM users LEFT JOIN premium_users ON users.id = premium_users.user_id WHERE premium_users.user_id IS NULL; --Find which users played songs in January, but not February WITH january AS ( SELECT * FROM plays WHERE strftime("%m", play_date) = '01' ), february AS ( SELECT * FROM plays WHERE strftime("%m", play_date) = '02' ) SELECT january.user_id FROM january LEFT JOIN february ON january.user_id = february.user_id WHERE february.user_id IS NULL; --For each month, we want to know if each user in 'premium_users' was active or canceled. SELECT premium_users.user_id, months.months, CASE WHEN ( premium_users.purchase_date <= months.months ) AND ( premium_users.cancel_date >= months.months OR premium_users.cancel_date IS NULL ) THEN 'active' ELSE 'not_active' END AS 'status' FROM premium_users CROSS JOIN months; --Songify has added some new songs to their catalog. Combine the existing songs with the new bonus songs SELECT * FROM songs UNION SELECT * FROM bonus_songs LIMIT 10; --Find how many times each song was played, displaying the title, artist and play count WITH play_count AS( SELECT song_id, COUNT(*) AS 'times_played' FROM plays GROUP BY song_id ) SELECT songs.title, songs.artist, play_count.times_played FROM songs JOIN play_count ON songs.id = play_count.song_id;
Fictional Reddit Analytics
Reddit is organised into subreddits or topics where users can contribute through posts. A database of fictional data exists with multiple tables that needs to be analysed through the use of queries, subqueries and joins:
--Dataset: Fictional Reddit Analytics tables 'users', 'posts', 'subreddits' --Source: Analyse Data with SQL Skill Path Course, Codecademy --Join the data from the different tables to find how many posts each user has made SELECT username, COUNT(*) FROM users LEFT JOIN posts ON users.id = posts.user_id GROUP BY username ORDER BY 2 DESC; --Join the data to show only existing posts where users are still active SELECT * FROM posts INNER JOIN users ON posts.user_id = users.id; --Some new posts have been added to Reddit. Concatenate them on a single table with the old posts SELECT * FROM posts UNION SELECT * FROM posts2; --Find which subreddits have the most popular posts (i.e. posts with a score of at least 5000) WITH popular_posts AS( SELECT * FROM posts WHERE score >= 5000 ) SELECT * FROM subreddits INNER JOIN popular_posts ON subreddits.id = popular_posts.subreddit_id ORDER BY popular_posts.score DESC; --Now find the most popular post for each subreddit SELECT subreddits.name AS 'subreddit_name', title AS 'post_highest_score', MAX(score) AS 'score' FROM posts INNER JOIN subreddits ON posts.subreddit_id = subreddits.id GROUP BY subreddits.name; --Calculate the average score of all the posts for each subreddit SELECT subreddits.name AS 'subreddit_name', ROUND(AVG(score),0) AS 'avg. score' FROM posts INNER JOIN subreddits ON posts.subreddit_id = subreddits.id GROUP BY subreddits.name;
Aggregate Functions
Codeflix Video Streaming
A new streaming video company needs assistance analysing their user database:
--Dataset: Codeflix Video Streaming tables 'users', 'payments', 'watch_history' --Source: Analyse Data with SQL Skill Path Course, Codecademy --Determine the number of users that have an email ending in '.com' SELECT COUNT(*) FROM users WHERE email LIKE '%.com'; --What are the most popular first names on Codeflix? ELECT first_name, COUNT(*) AS 'Count' FROM users GROUP BY first_name ORDER BY 2 DESC; --The UX Research team wants to see a distribution of watch durations SELECT ROUND(watch_duration_in_minutes) AS 'duration', COUNT(*) AS 'count' FROM watch_history GROUP BY 1 ORDER BY 1; --Find all the users that have successfully made a payment to Codeflix and find their total amount paid SELECT user_id, SUM(amount) AS 'total per user' FROM payments WHERE status='paid' GROUP BY user_id ORDER BY 2 DESC; --Generate a table of user ids and total watch duration for users who watched more than 400 minutes of content SELECT user_id, SUM(watch_duration_in_minutes) AS 'total watch time' FROM watch_history GROUP BY user_id HAVING SUM(watch_duration_in_minutes) > 400; --To the nearest minute, how many minutes of content were streamed on Codeflix? SELECT ROUND(SUM(watch_duration_in_minutes), 0) FROM watch_history; --Which days in this period did Codeflix collect the most money? SELECT pay_date, SUM(amount) AS 'total per day' FROM payments WHERE status = 'paid' GROUP BY pay_date ORDER BY 2 DESC; --When users successfully pay Codeflix, what is the average payment amount? SELECT AVG(amount) FROM payments WHERE status='paid'; --Of all the events in the 'watch_history' table, what is the duration of the longest individual watch event? What is the duration of the shortest? SELECT MAX(watch_duration_in_minutes) AS 'max', MIN(watch_duration_in_minutes) AS 'min' FROM watch_history;
Hacker News Trends
Analyse trends in the news stories published since 2007 to Hacker News, a tech news website run by Y Combinator:
--Dataset: Hacker News table 'hacker_news' --Source: Analyse Data with SQL Skill Path Course, Codecademy --Find which sites feed Hacker News the most between GitHub, Medium and The New York Times. SELECT CASE WHEN url LIKE '%github%' THEN 'GitHub' WHEN url LIKE '%medium%' THEN 'Medium' WHEN url LIKE '%nytimes%' THEN 'New York Times' ELSE 'Other' END AS 'Source', COUNT(*) FROM hacker_news GROUP BY 1; --What's the best time to post a story? SELECT strftime('%H', timestamp) AS 'Hours', ROUND(AVG(score),1) AS 'AVG Score', COUNT(*) AS 'No. of Stories' FROM hacker_news WHERE timestamp IS NOT NULL GROUP BY 1 ORDER BY 2 DESC;
The Metropolitan Museum of Art
As one of the world’s largest art museums, assistance is needed to analyse the museum’s collection database of American Decorative Arts:
--Dataset: The Metropolitan Museum of Art table 'met' --Source: Analyse Data with SQL Skill Path Course, Codecademy --Familiarise yourself with the table SELECT * FROM met LIMIT 10; --How many pieces are in the American Decorative Art collection? SELECT COUNT(*) FROM met WHERE department = 'American Decorative Arts'; --Count the number of pieces where the category includes 'celery' SELECT COUNT(*) FROM met WHERE category LIKE '%celery%'; --Find the title and medium of the oldest piece(s) in the collection SELECT title, medium, date FROM met ORDER BY date LIMIT 5; --Results show oldest pieces are from the 1600s. Display all from this era SELECT title, medium, date FROM met WHERE date LIKE '%1600%'; --Find the top 10 countries with the most pieces in the collection SELECT country, COUNT(*) FROM met WHERE country IS NOT NULL GROUP BY country ORDER BY 2 DESC LIMIT 10; --Find the categories HAVING more than 100 pieces SELECT category, COUNT(*) FROM met WHERE category IS NOT NULL GROUP BY category HAVING COUNT(*) > 100 ORDER BY 2 DESC; --Count the number of pieces where the medium contains 'gold' or 'silver' SELECT CASE WHEN medium LIKE '%gold%' THEN 'Gold' WHEN medium LIKE '%silver%' THEN 'Silver' ELSE NULL END AS 'Bling', COUNT(*) FROM met WHERE Bling IS NOT NULL GROUP BY 1 ORDER BY 2 DESC;
Basic Queries
New York Restaurants
Use SQL queries to find the best dinner places in your area:
--Dataset: New York Restaurants table 'nomnom' --Source: Analyse Data with SQL Skill Path Course, Codecademy --Review nomnom table SELECT * FROM nomnom; --Find distinct neighbourhoods SELECT DISTINCT neighborhood FROM nomnom; --Find distinct cuisine types SELECT DISTINCT cuisine FROM nomnom; --Find a list of Chinese restaurants in the dataset SELECT * FROM nomnom WHERE cuisine = 'Chinese'; --Find all restaurants rated 4-stars and above SELECT * FROM nomnom WHERE review>=4; --Return a list of all fancy ('$$$') Italian restaurants SELECT * FROM nomnom WHERE cuisine = 'Italian' AND price = '$$$'; --Find the restaurant which you remember had the word 'meatball' in its name SELECT * FROM nomnom WHERE name LIKE '%meatball%'; --Find all your nearby restaurants by location SELECT * FROM nomnom WHERE neighborhood = 'Midtown' OR neighborhood = 'Downtown' OR neighborhood = 'Chinatown'; --Only display restaurants which are pending a health rating SELECT * FROM nomnom WHERE health IS NULL; --Create a table with the Top 10 restaurants based on review score SELECT * FROM nomnom ORDER BY review DESC LIMIT 10; --Create a table where the rating system is change from score numbers to words SELECT name, CASE WHEN review > 4.5 THEN 'Extraordinary' WHEN review > 4 THEN 'Excellent' WHEN review > 3 THEN 'Good' WHEN review > 2 THEN 'Fair' ELSE 'Poor' END AS 'Review' FROM nomnom;
Reputable Product Agency - Fraud Detection and Customer Segmentation
Help the finance department of the RPA to identify potentially fraudulent transactions. Then assist the marketing department in segmenting the company users according to different criteria:
--Dataset: Reputable Fraud Detection (RPA) tables 'transction_data' and 'users' --Source: Analyse Data with SQL Skill Path Course, Codecademy -- FRAUD DETECTION -- What are the column names? SELECT * FROM transaction_data LIMIT 10; -- Find the full_names and emails -- of the transactions listing 20252 as the zip code. SELECT full_name, email FROM transaction_data WHERE zip=20252; -- Use a query to find the names -- and emails associated with these transactions. SELECT full_name, email FROM transaction_data WHERE full_name = 'Art Vandelay' OR full_name LIKE '% der %'; -- Find the ip_addresses and emails listed with these transactions. SELECT ip_address, email FROM transaction_data WHERE ip_address LIKE '10.%'; -- Find the emails in transaction_data with -- 'temp_email.com' as a domain. SELECT email FROM transaction_data WHERE email LIKE '%temp_email.com'; -- The finance department is looking for a specific transaction. -- They know that the transaction occurred from an ip address starting -- with '120.' and their full name starts with 'John'. -- Can you find the transaction? SELECT * FROM transaction_data WHERE ip_address LIKE '120%' AND full_name LIKE 'John%'; -- /////////////////////////////////////////////////////////////////// -- CUSTOMER SEGMENTATION -- What are the column names? SELECT * FROM users LIMIT 20; -- Find the email addresses and birthdays of users whose -- birthday is between 1980-01-01 and 1989-12-31. SELECT email, birthday FROM users WHERE birthday BETWEEN '1980-01-01' AND '1989-12-31'; -- Find the emails and creation date of users -- whose created_at date matches this condition. SELECT email, created_at FROM users WHERE created_at < '2017-05-01'; -- Find the emails of the users who received the 'bears' test. SELECT email FROM users WHERE test = 'bears'; -- Find all the emails of all users who -- received a campaign on website BBB. SELECT email FROM users WHERE campaign LIKE 'BBB%'; -- Find all the emails of all users who received ad copy 2 in -- their campaign. SELECT email FROM users WHERE campaign LIKE '%-2'; -- Find the emails for all users who received both a campaign and a test. -- These users will have non-empty entries in the -- campaign and test columns. SELECT email FROM users WHERE campaign IS NOT NULL AND test IS NOT NULL; -- One of the members of the marketing team had an idea of calculating -- how old users were when they signed up. SELECT email, (created_at - birthday) AS age_at_creation FROM users LIMIT 20;
Davie’s Burgers Subway Ad
Help Davie’s Burgers to search through their past orders to find interesting material that can be used for a marketing campaign:
--Dataset: Davie's Burgers Restaurant table 'orders' --Source: Analyse Data with SQL Skill Path Course, Codecademy -- What are the column names? SELECT * FROM orders LIMIT 20; -- How recent is this data? SELECT DISTINCT(order_date) FROM orders ORDER by order_date DESC; -- Instead of selecting all the columns using *, -- write a query that selects only the special_instructions column. SELECT special_instructions FROM orders -- Limit the result to 20 rows. LIMIT 20; -- Can you edit the query so that we are only -- returning the special instructions that are not empty? SELECT special_instructions FROM orders WHERE special_instructions IS NOT NULL; -- Let's go even further and sort the instructions -- in alphabetical order (A-Z). SELECT special_instructions FROM orders WHERE special_instructions IS NOT NULL ORDER BY special_instructions; -- Let's search for special instructions that have the word 'sauce'. SELECT special_instructions FROM orders WHERE special_instructions LIKE '%sauce%'; -- Let's search for special instructions that have the word 'door'. -- Any funny or interesting ones? SELECT special_instructions FROM orders WHERE special_instructions LIKE '%door%'; -- Let's search for special instructions that have the word 'box'. -- Any funny or interesting ones? SELECT special_instructions FROM orders WHERE special_instructions LIKE '%box%'; -- Instead of just returning the special instructions, also return their order ids. -- For more readability: -- Rename id as '#' -- Rename special_instructions as 'Notes' SELECT id AS '#', special_instructions AS 'Notes' FROM orders WHERE special_instructions LIKE '%box%';