Navigate to the Amazon S3 service and look at the *-datalakebucket-* bucket subfolder processed/instacart.
Notice that the data is in files, rather than in a database running on a server. It’s going to be fairly difficult to analyse these files without knowing their structure. We can solve this problem by using AWS Glue!
Now we need to use the AWS Glue service to extract data and tables from these files:
Notice the S3 locations it crawls through. The crawler looks through data files to create a database
Lets look at the database created by the AWS Glue crawler:
The crawler went over our data files in our S3 Bucket. A database with tables were created from our data files. How cool is that!
We now need to determine the top 8 products ordered by our customers across all departments. These are the products we would like to provide to our most loyal customers (generous, I know!). To do this:
SELECT
"product_id"
, "product_name"
, "department"
, "orders"
, "ranking"
FROM
(
SELECT
"count"(1) "orders"
, "op"."product_id"
, "p"."product_name"
, "d"."department"
, "row_number"() OVER (PARTITION BY "d"."department" ORDER BY "count"(1) DESC) "ranking"
FROM
((((orders o
INNER JOIN users u ON ("u"."user_id" = "o"."user_id"))
INNER JOIN order_products op ON ("op"."order_id" = "o"."order_id"))
INNER JOIN products p ON ("p"."product_id" = "op"."product_id"))
INNER JOIN departments d ON ("d"."department_id" = "p"."department_id"))
GROUP BY "op"."product_id", "p"."product_name", "d"."department"
)
WHERE ("ranking" = 1)
ORDER BY "orders" DESC
Great, so we have the top items across each department!
Looking at the results, there’s probably a few departments that we want to remove. Baby food isn’t exactly the ideal surprise gift for a loyal customer!
Let’s modify the SQL statement to remove the results for unwanted departments. We will also limit the results to the top 8 products.
SELECT
"product_id"
, "product_name"
, "department"
, "orders"
, "ranking"
FROM
(
SELECT
"count"(1) "orders"
, "op"."product_id"
, "p"."product_name"
, "d"."department"
, "row_number"() OVER (PARTITION BY "d"."department" ORDER BY "count"(1) DESC) "ranking"
FROM
((((orders o
INNER JOIN users u ON ("u"."user_id" = "o"."user_id"))
INNER JOIN order_products op ON ("op"."order_id" = "o"."order_id"))
INNER JOIN products p ON ("p"."product_id" = "op"."product_id"))
INNER JOIN departments d ON ("d"."department_id" = "p"."department_id"))
GROUP BY "op"."product_id", "p"."product_name", "d"."department"
)
WHERE ("ranking" = 1) and department not in ('bulk', 'deli', 'babies', 'dairy eggs', 'dry goods pasta')
ORDER BY "orders" DESC
LIMIT 8
Excellent! Recycled Paper Towels, Now that’s a loyalty gift!
Rather than writing down the top 8 products, let’s save the search as a Saved query so we can run it on demand later:
Replace team#
with your unique team number
Now that you know how to find the most ordered products, can you write an SQL statement to find the least popular products?
If you work it out, be sure to save the search like you did in step 6 to show participants after.