83 points by smitty1e 8 months ago | 20 comments
grogers 8 months ago
It will save you a lot of headaches to make the hash key the actual userid (e.g. 'user!abcdef123456'). This will make it more expensive if you do need to occasionally scan all users, but it's not drastically so. You can either do scan and ignore stuff you don't care about, or maintain an index that just contains the userids (in a similar hash/range key as the article) and then do point gets for each userid for the actual data. This will spread the load of these scans out better, because the range scan contains little data compared to if all user data is stored in the range key.
karmaniverous 8 months ago
wmfiv 8 months ago
Because you've introduced static hash keys ("user", "email", etc) you've had to manually partition which DDB should do for you automatically. And while you covered the partition size limit you're also likely to have write performance issues because you're not distributing writes to the "user" and "email" hash keys.
Single-table design should distribute writes and minimize roundtrips to the database. user#12345 as a hash key and range keys of 'User', 'Email#jo@email.com', 'Email#joe@email.com', etc achieve those goals. If you need to query and/or sort on a large number of attributes it's going to be easier, faster, and probably cheaper to stream data into Elasticsearch or similar to support those queries.
kellengreen 8 months ago
librasteve 8 months ago
karmaniverous 8 months ago
Entity Manager is a framework for defining, managing, and most importantly QUERYING an entity model with DynamoDB. It's actually platform-generic, so the DynamoDB-specific machinery is implemented at https://github.com/karmaniverous/entity-client-dynamodb
Entity Manager's most important feature is that it permits a simple, scheduled partition sharding configuration and then transparent, multi-index querying of data across shards with a very compact, fluent query API.
This resolves the biggest challenge of using DynamoDB at scale, which is that very large data sets MUST be sharded, and a given query can ONLY operate against a single shard. If you're querying on the basis of a related record, you won't know which shard your results will be on so you must query ALL shards.
Entity Manager reduces this to an effortless operation: once you've defined your sharding strategy for a given entity, you can forget sharding is even a thing.
For some more color on Entity Manager within the context of SQL vs NoSQL databases, please review this (much shorter!) article: https://karmanivero.us/projects/entity-manager/sql-vs-nosql
vdvsvwvwvwvwv 8 months ago
Like most been burned and frustrated by this sort of sharded DB and unperformant queries if you dont or cant use the partition key. Usually you need a second table amirite :) And seen others burned.
Especially when your intro is a Jira ticket learning on the job!
karmaniverous 8 months ago
I've actually been using the JS version of EM in production for over a year. It's been working flawlessly.
The TS version is a complete rewrite that factors in a BUNCH of lessons learned and is completely--maybe obsessively lol--type-safe. The query builder got a LOT of attention, and the fluent API reduces even complex queries to a super-compact, declarative coding experience.
I'm pushing a big update tonight and will then resume my focus on the demo & docs, basically the companion stuff to this one. Should be ready for use in a couple of weeks.
Thanks for the interest, it really means a lot to me!
librasteve 8 months ago
mustime 8 months ago
qaq 8 months ago
karmaniverous 8 months ago
qaq 8 months ago
Onavo 8 months ago
qaq 8 months ago
exabrial 8 months ago
karmaniverous 8 months ago
By "regular ole" I presume you mean some flavor of RDBMS. Those have significant issues at scale that the newfangled platforms don't... but as you see there's a price to be paid at design time.
If I do my job right, you get to have your cake & eat it too!
usagisushi 8 months ago
Noob question: For DynamoDB, I think internal partitions are automatically split based on usage. Is it still worthwhile to apply double-sharding at the application layer as well? Is cross-partition querying a key factor here?
karmaniverous 8 months ago
Recall that, in DDB, your index has two parts: hash & range key. If you want to have many entities in the same table, then you need a way of distinguishing between different entities, and a way of locating an individual record. In your primary index, those account for your hash and range keys, respectively: the hash key is your entity differentiator, and the range key is your entity id (which may come from a different record property from one entity to another). If you follow the development of the article, you’ll see how this plays out with variously constructed keys across different indexes.
Now, forget EM sharding for a minute and let DDB manage your sharding. Say you launch your application with little data and a single shard. Over time your data scales & spills over onto additional shards. When you perform a search, DDB has no way of knowing which shards are relevant so it has to search ALL of them.
But from the application side, your data scaled over TIME. Therefore, if you know which shards were created when, you could limit a time-based search only to the shards that are relevant to the search parameters. And a LOT of searches involve a time window.
Within the context of EM, when I say a “shard”, I am talking about a unique hash key value like `user!1F`, where `user` is the entity type and `1F` is the shard key. These may or may not map to physical DDB shards, and the good news is that you don’t NEED to care… DDB will flex if you don’t.
EM has a lot of features that greatly streamline the dev experience when operating against a DDB table with a multi-entity data model. You don’t HAVE to use the sharding feature… it’s literally just a config item, everything else happens behind the scenes. But when you DO use it, EM splits a search across sharded data into MANY parallel searches, one per shard, then assembles the returns into a coherent result with a “page key” that is actually a compressed map of ALL the underlying page keys. You don’t have to care about THAT, either… just pass the compressed string back to EM and it will rehydrate the page keys & perform the next set of searches.
So you get to choose your own adventure… you can run every entity on a single “shard” or run in parallel. I’d just keep an eye out for any drop in performance at scale and add a shard bump when I see it.
Also worth noting: EM is actually platform-agnostic. There is a companion repo that contains the DDB-specific client. This is still a bit in flux btw so be kind lol. Anyway the point is that other platforms that don’t have AWS’ resource footprint may not handle sharding as well, and EM will be able to render effectively the same result.
Hope that answers your question!
P.S. Worth noting: in addition to searching across multiple SHARDS, an EM query can also search across multiple INDEXES. Say you want to query on “name” and you want to query both your firstName and lastName indexes with the same “name” value. With EM, this is a SINGLE query that returns a combined, paged, deduped, sorted result set. Handy.
preaching5271 8 months ago
jeroen79 8 months ago