177 points by thunderbong 3 days ago | 43 comments
tmoertel 3 days ago
indoordin0saur 2 days ago
tmoertel 2 days ago
Based on what I glean about your scenario, I suspect you’d be better served by some kind of exponentially decaying weight based on age (i.e., age = current_timestamp ‒ created_at). For example, if you wanted to make your rows half as likely to be selected every time they grew another hour older, you could use `POW(2.0, -age / 3600)` as your weight, where age is given in seconds.
duckdb> WITH Ages AS (
SELECT age FROM UNNEST([0, 3600, 7200]) AS t(age)
)
SELECT age, POW(2.0, -age / 3600) AS w FROM Ages;
┌──────┬──────┐
│ age ┆ w │
╞══════╪══════╡
│ 0 ┆ 1.0 │
│ 3600 ┆ 0.5 │
│ 7200 ┆ 0.25 │
└──────┴──────┘
swasheck 2 days ago
swasheck 2 days ago
i do have a question of clarification. this is built on weighted sampling with weights in the dataset. does this indicate some sort of preprocessing to arrive at the weights?
tmoertel 2 days ago
For example, say you run a website like Wikipedia and want to estimate the percentage of page views that go to pages that are dangerously misleading. You have a dataset that contains all of your site’s pages, and you have logs that indicate which pages users have viewed. What you don’t have is a flag for each page that indicates whether it’s “dangerously misleading”; you’ll have to create it. And that’s likely to require a panel of expert human reviewers. Since it’s not practical to have your experts review every single page, you’ll want to review only a sample of pages. And since each page spreads misleading information only to the degree it’s viewed by users, you’ll want to weight each page by its view count. To get those counts, you’ll have to process the site logs, and count how many times each page was viewed. Then you can take a sample of pages weighted by those counts.
That’s a pretty typical scenario. You think about the question you’re trying to answer. You think about the data you have and the data you’ll need. Then you figure out how to get the data you need, hopefully without too much trouble.
swasheck 2 days ago
thank you!
rhymer 2 days ago
See "Remarks on some misconceptions about unequal probability sampling without replacement" by Yves Tillé.
Quoted from Tillé's conclusion: "There is a multitude of correct and fast method of sampling... there is no reason to use an incorrect method like weighted random sampling where we do not control the inclusion probabilities"
It's not clear to me how easy it is to implement the "multitude of corect and fast methods" in SQL, though. Would love to see some reference implementation.
apwheele 2 days ago
tmoertel 2 days ago
In practice, however, when your samples are small, your populations are large, and your populations' weights are not concentrated in a small number of members, you can use Algorithm A and just pretend that you have a sample of i.i.d. draws. In these circumstances, the potential for bias is generally not worth worrying about.
But when you cannot play so fast and loose, you can produce unbiased estimates from an Algorithm A sample by using an ordered estimator, such as Des Raj's ordered estimator [1].
You could alternatively use a different sampling method, one that does produce inclusion probabilities. But these tend to be implemented only in more sophisticated statistical systems (e.g., R's "survey" package [2]) and thus not useful unless you can fit your dataset into those systems.
For very large datasets, then, I end up using Algorithm A to take samples and (when it matters) Des Raj's ordered estimator to make estimates.
(I was planning a follow up post to my blog on this subject.)
[1] The original papers are hard to come by online, so I suggest starting with an introduction like https://home.iitk.ac.in/~shalab/sampling/chapter7-sampling-v..., starting at page 11.
[2] https://cran.r-project.org/web/packages/survey/survey.pdf
Horffupolde 2 days ago
tmoertel 2 days ago
Only Part 1 requires running Algorithm A and its ORDER/LIMIT logic. And, when you run that logic, you only need to read two tiny columns from the table you want to sample: the weight and the primary key (any candidate key will work). So it looks like this:
SELECT pk
FROM Population
WHERE weight > 0
ORDER BY -LN(1.0 - RANDOM()) / weight
LIMIT 1000 -- Sample size.
This operation will be fast whenever your data is stored in a column-oriented format or when you have indexed the weight.I trust that you will accept as self evident that column stores are very fast when reading two small columns since they don't make you pay to read the columns you don't need.
So I'll focus on the row-store case. If you have indexed your weight, you basically have a tiny table of (weight, primary key) pairs sorted by weight. If you then run a query that needs to read only the weight and primary key, the query planner can fulfill your query by scanning the index, not the full population table. (The fact that the index is sorted doesn't matter. The query planner isn't taking advantage of the index ordering, just the fact that the index has everything it needs and is way smaller than the full population table. Most modern row-oriented database stores support index-only scans, including PostgreSQL [2] and SQLite [3].)
[1] https://blog.moertel.com/posts/2024-08-23-sampling-with-sql....
[2] https://www.postgresql.org/docs/current/indexes-index-only-s...
[3] https://www.sqlite.org/optoverview.html#covering_indexes
tmoertel 2 days ago
crazygringo 2 days ago
If you pull the highest 5 product ID's sorting by product ID, you're right it only reads those 5 entries from disk.
But when you ORDER BY RANDOM(), it's forced to do a full-table scan, because RANDOM() can't be indexed.
tmoertel 2 days ago
No, even without indexes, you only need O(N) memory to find the top N records under any ordering, including on random sort keys. You do have to examine every row, but if you are using a column store or have indexed the weight column, the scan will require only a small fraction of the work that a read-everything table scan would require.
crazygringo 2 days ago
But that's because memory usage isn't the relevant bottleneck here -- it's disk IO. It's reading the entire table from disk, or only the entire column(s) if applicable.
That's not something you ever want to do in production. Not even if it's only a single column. That will destroy performance on any reasonably large table.
If a full-table scan of all columuns takes 180 seconds, then a "small fraction" to scan a single column might take 10 seconds, but queries running on a production database need to be measured in small numbers of milliseconds.
swasheck 2 days ago
tmoertel 2 days ago
To determine which rows are in the sample, you need to consider only two columns: the weight and (a) primary key.
If your population lives in a traditional row-oriented store, then you're going to have to read every row. But if you index your weights, you only need to scan the index, not the underlying population table to identify the sample.
If your population lives in a column-oriented store, identifying the sample is fast, again because you only need to read two small columns to do it. Column stores are optimized for this case.
mbb70 2 days ago
Unless you know the id range and select N ids ahead of time, then use an index to pull those. But I think the assumption of the article is you can't do that.
tmoertel 2 days ago
crazygringo 2 days ago
And indeed, if your data doesn't naturally have contiguous ID's, you might create an AUTOINCREMENT primary key precisely to create those contiguous ID's. And then of course if you have a reasonable but not overwhelming number of deletions, you can get away with retrying a new random number every time you don't get a hit, up to a max number of retries where you just return an error and ask the user to attempt the operation again.
crazygringo 2 days ago
You just need to make sure you're running these queries against a database used solely for occasional analytics, where it's no problem to be saturating the disk for two minutes or whatever, because it won't bother anybody else.
natmaka 3 days ago