Sampling With Replacement In Presto
While analyzing an experiment data, I encountered an interesting brain teaser. I wanted to use the bootstrap method and for that needed to sample my data with replacement. After some iterations, I got the perfect way to do it. The trick is using UNNEST with sequence operation to duplicate the data. Below is an example.
In the below example, I assume we have two different treatment groups (A and B), and we need 3 samples of data. In each sample, we need 40% of users i.e. (2 of 5 users from each treatment group).
The most important step is UNNEST("sequence"(0, 3, 1)) T1(x)
which duplicates each user 3 times. Then we use NTILE operator to split the data into 5 groups and thereafter select two buckets for each group. Since there are only 5 users per treatment group, each tile will only have one user. If you have more than 100 users, then set the number of tiles to 100 and replace tile < 3 with tile < 40 to select 40% of users. You can read more about how to use NTILE operator for sampling in my earlier blog over here
WITH dataset AS (
SELECT *
FROM (
VALUES
(1, 'A'),
(2, 'A'),
(3, 'A'),
(4, 'A'),
(5, 'A'),
(6, 'B'),
(7, 'B'),
(8, 'B'),
(9, 'B'),
(10, 'B')
) AS t(user_uuid, treatment_group)
)
SELECT
user_uuid
, treatment_group
, iterNo
FROM
(
SELECT
user_uuid
, treatment_group
, iterNo
, NTILE(5) OVER (PARTITION BY treatment_group, iterNo ORDER BY rnd) as tile
FROM
(
SELECT
user_uuid
, treatment_group
, x as iterNo
, RAND() as rnd
FROM dataset
CROSS JOIN UNNEST("sequence"(0, 3, 1)) T1(x)
) A
) B
WHERE tile < 3
Originally published at http://ragrawal.wordpress.com on June 12, 2019.