Step by Step

  1. Navigate to the Amazon S3 service and look at the *-datalakebucket-* bucket subfolder processed/instacart.

    Notice that the data is in files, rather than in a database running on a server. It’s going to be fairly difficult to analyse these files without knowing their structure. We can solve this problem by using AWS Glue!

    explore-data

  2. Now we need to use the AWS Glue service to extract data and tables from these files:

    • Navigate to the AWS Glue service
    • Select Crawlers
    • Select DatalakeCrawler

    Notice the S3 locations it crawls through. The crawler looks through data files to create a database

    explore-data

  3. Lets look at the database created by the AWS Glue crawler:

    • Select Databases
    • Select the datalake database
    • Select Tables in datalake

    The crawler went over our data files in our S3 Bucket. A database with tables were created from our data files. How cool is that!

    explore-data explore-data

  4. We now need to determine the top 8 products ordered by our customers across all departments. These are the products we would like to provide to our most loyal customers (generous, I know!). To do this:

    • Navigate to the AWS Athena service
    • Select the Data Source as AwsDataCatalog
    • Select the Database as datalake
    • Copy the SQL Query text below
    • Paste the text into the New query 1 window
    • Press Ctrl + Enter to run the query

    SELECT
      "product_id"
    , "product_name"
    , "department"
    , "orders"
    , "ranking"
    FROM
    (
    SELECT
        "count"(1) "orders"
    , "op"."product_id"
    , "p"."product_name"
    , "d"."department"
    , "row_number"() OVER (PARTITION BY "d"."department" ORDER BY "count"(1) DESC) "ranking"
    FROM
        ((((orders o
    INNER JOIN users u ON ("u"."user_id" = "o"."user_id"))
    INNER JOIN order_products op ON ("op"."order_id" = "o"."order_id"))
    INNER JOIN products p ON ("p"."product_id" = "op"."product_id"))
    INNER JOIN departments d ON ("d"."department_id" = "p"."department_id"))
    GROUP BY "op"."product_id", "p"."product_name", "d"."department"
    ) 
    WHERE ("ranking" = 1)
    ORDER BY "orders" DESC

explore-data

  1. Great, so we have the top items across each department!

    Looking at the results, there’s probably a few departments that we want to remove. Baby food isn’t exactly the ideal surprise gift for a loyal customer!

    Let’s modify the SQL statement to remove the results for unwanted departments. We will also limit the results to the top 8 products.

    • Copy the SQL Query text below
    • Select “+” to create a New query window
    • Paste the SQL text in the New query 2 window
    • Press Ctrl + Enter to run the query

SELECT
"product_id"
, "product_name"
, "department"
, "orders"
, "ranking"
FROM
(
SELECT
"count"(1) "orders"
, "op"."product_id"
, "p"."product_name"
, "d"."department"
, "row_number"() OVER (PARTITION BY "d"."department" ORDER BY "count"(1) DESC) "ranking"
FROM
((((orders o
INNER JOIN users u ON ("u"."user_id" = "o"."user_id"))
INNER JOIN order_products op ON ("op"."order_id" = "o"."order_id"))
INNER JOIN products p ON ("p"."product_id" = "op"."product_id"))
INNER JOIN departments d ON ("d"."department_id" = "p"."department_id"))
GROUP BY "op"."product_id", "p"."product_name", "d"."department"
) 
WHERE ("ranking" = 1) and department not in ('bulk', 'deli', 'babies', 'dairy eggs', 'dry goods pasta')
ORDER BY "orders" DESC
LIMIT 8

explore-data

  1. Excellent! Recycled Paper Towels, Now that’s a loyalty gift!

    Rather than writing down the top 8 products, let’s save the search as a Saved query so we can run it on demand later:

    • Click on Save as
    • Name the statement most_ordered_products-team#Click to copy

explore-data

Replace team# with your unique team number

Bonus Challenge!

Now that you know how to find the most ordered products, can you write an SQL statement to find the least popular products?

If you work it out, be sure to save the search like you did in step 6 to show participants after.

Bonus answer hidden here