My two last posts have looked at how to infer a user’s interests based on their browsing history.
This post looks at how to translate that into personalised results. At Metro, we’d like to do this for our news feed results (generated by a top secret algorithm).
The most accurate results could be achieved by passing weights for all the category information available. (The weights could represent the probability of a particular category being read). For example:
(Internally we could join these onto our existing data).
This doesn’t play very nicely with CDN caching though, and our top secret algorithm is fairly intensive.
Alternative: top-level categories only
One option is to only pass top-level categories. (Metro has only a handful of top-level categories). We can also reduce the resolution of the weights (e.g. 0-19, 20-39, 40plus)…
This gives us (3^5) = 243 different URLs. Caching for 5 minutes means (243 / 5) = 49 different URLs per minute. (In practice the number is greater due to requests from different edges).
A 400 error could be returned if the parameters are passed in the wrong order.
Alternative: hot categories only
It’s a shame not to pass sub-categories.
We could pass only the user’s hottest 3 (sub)categories. Metro has around 20 categories. That gives (20^3) = 8000 possible URLs. Caching for 5 minutes means (8000 / 5) = 1600 URLs per minute.
Hottest 2 categories reduces this to 80 URLs per minute.
Alternative: cache in-application
At Metro, we generally only consider the last few days worth of articles. This could all be cached in-application (ElastiCache?) rather than querying the database.
Performance bottlenecks are typically IO-bound: in-app caching would minimise network IO to the database.
Applications are easier to scale out than databases.