Hacker Remix

Evolving a NoSQL Database Schema

83 points by smitty1e 8 months ago | 20 comments

grogers 8 months ago

I would strongly recommend against using a fixed key 'user' for the hash key on dynamodb, with the range key being used to select the actual record. DDB does not handle splitting by range key very well, so you will run into load balance and throttling issues even with the sharding scheme (i.e. 'user!2') mentioned later.

It will save you a lot of headaches to make the hash key the actual userid (e.g. 'user!abcdef123456'). This will make it more expensive if you do need to occasionally scan all users, but it's not drastically so. You can either do scan and ignore stuff you don't care about, or maintain an index that just contains the userids (in a similar hash/range key as the article) and then do point gets for each userid for the actual data. This will spread the load of these scans out better, because the range scan contains little data compared to if all user data is stored in the range key.

karmaniverous 8 months ago

This approach is not really compatible with the single-table design pattern, which has some significant advantages. The point where performance degrades due to the issues you mentioned would be a good point to start applying sharding.

wmfiv 8 months ago

With due respect, I think you've misunderstood the single-table design pattern.

Because you've introduced static hash keys ("user", "email", etc) you've had to manually partition which DDB should do for you automatically. And while you covered the partition size limit you're also likely to have write performance issues because you're not distributing writes to the "user" and "email" hash keys.

Single-table design should distribute writes and minimize roundtrips to the database. user#12345 as a hash key and range keys of 'User', 'Email#jo@email.com', 'Email#joe@email.com', etc achieve those goals. If you need to query and/or sort on a large number of attributes it's going to be easier, faster, and probably cheaper to stream data into Elasticsearch or similar to support those queries.

kellengreen 8 months ago

Agreed, this approach will not scale very well in DynamoDB.

librasteve 8 months ago

Very informative read … however, I am left not really understanding what the Entity Manager is … perhaps a migration tool to move data from NoSQL to RDBMS, perhaps a front end helper service? Is it FOSS, a PAYG service? How can I deploy it? Perhaps this article long enough already and is intended to whet our appetite.

karmaniverous 8 months ago

I'm the author, and thanks for asking! The article is really just background, submitted by a friend. Still building the demo & documentation, so my apologies for the confusion.

Entity Manager is a framework for defining, managing, and most importantly QUERYING an entity model with DynamoDB. It's actually platform-generic, so the DynamoDB-specific machinery is implemented at https://github.com/karmaniverous/entity-client-dynamodb

Entity Manager's most important feature is that it permits a simple, scheduled partition sharding configuration and then transparent, multi-index querying of data across shards with a very compact, fluent query API.

This resolves the biggest challenge of using DynamoDB at scale, which is that very large data sets MUST be sharded, and a given query can ONLY operate against a single shard. If you're querying on the basis of a related record, you won't know which shard your results will be on so you must query ALL shards.

Entity Manager reduces this to an effortless operation: once you've defined your sharding strategy for a given entity, you can forget sharding is even a thing.

For some more color on Entity Manager within the context of SQL vs NoSQL databases, please review this (much shorter!) article: https://karmanivero.us/projects/entity-manager/sql-vs-nosql

vdvsvwvwvwvwv 8 months ago

That makes me want to learn more. I will read this.

Like most been burned and frustrated by this sort of sharded DB and unperformant queries if you dont or cant use the partition key. Usually you need a second table amirite :) And seen others burned.

Especially when your intro is a Jira ticket learning on the job!

karmaniverous 8 months ago

Very cool of you to say so!

I've actually been using the JS version of EM in production for over a year. It's been working flawlessly.

The TS version is a complete rewrite that factors in a BUNCH of lessons learned and is completely--maybe obsessively lol--type-safe. The query builder got a LOT of attention, and the fluent API reduces even complex queries to a super-compact, declarative coding experience.

I'm pushing a big update tonight and will then resume my focus on the demo & docs, basically the companion stuff to this one. Should be ready for use in a couple of weeks.

Thanks for the interest, it really means a lot to me!

librasteve 8 months ago

thanks … that’s really helpful

mustime 8 months ago

[dead]

qaq 8 months ago

Save yourself a ton of grief and don't use things like dynamodb unless you really have to.

karmaniverous 8 months ago

I mean srsly why use computers at all right.

qaq 8 months ago

DynamoDB is a very crippled product what does it have to do with using computers.

Onavo 8 months ago

What about ScyllaDB

qaq 8 months ago

Can't really comment have not ever used it. In general I'd err on the side of using ACID SQL DB over NoSQL unless you have some very specific use case that will highly benefit from NoSQL.

exabrial 8 months ago

A lot of effort to not use a regular ole database as a database... when you know, you could just use a database.

karmaniverous 8 months ago

Well, that's the nice thing about a library... when it works, all that effort happens under the hood. :)

By "regular ole" I presume you mean some flavor of RDBMS. Those have significant issues at scale that the newfangled platforms don't... but as you see there's a price to be paid at design time.

If I do my job right, you get to have your cake & eat it too!

usagisushi 8 months ago

Great write-up! Recently, I struggled with DynamoDB and eventually opted for a library called ElectroDB, which supports a type-safe way to model entities in Single Table Design.

Noob question: For DynamoDB, I think internal partitions are automatically split based on usage. Is it still worthwhile to apply double-sharding at the application layer as well? Is cross-partition querying a key factor here?

karmaniverous 8 months ago

The key focus of EM is the implementation of a multi-entity data model along the lines of the Single Table Design Pattern.

Recall that, in DDB, your index has two parts: hash & range key. If you want to have many entities in the same table, then you need a way of distinguishing between different entities, and a way of locating an individual record. In your primary index, those account for your hash and range keys, respectively: the hash key is your entity differentiator, and the range key is your entity id (which may come from a different record property from one entity to another). If you follow the development of the article, you’ll see how this plays out with variously constructed keys across different indexes.

Now, forget EM sharding for a minute and let DDB manage your sharding. Say you launch your application with little data and a single shard. Over time your data scales & spills over onto additional shards. When you perform a search, DDB has no way of knowing which shards are relevant so it has to search ALL of them.

But from the application side, your data scaled over TIME. Therefore, if you know which shards were created when, you could limit a time-based search only to the shards that are relevant to the search parameters. And a LOT of searches involve a time window.

Within the context of EM, when I say a “shard”, I am talking about a unique hash key value like `user!1F`, where `user` is the entity type and `1F` is the shard key. These may or may not map to physical DDB shards, and the good news is that you don’t NEED to care… DDB will flex if you don’t.

EM has a lot of features that greatly streamline the dev experience when operating against a DDB table with a multi-entity data model. You don’t HAVE to use the sharding feature… it’s literally just a config item, everything else happens behind the scenes. But when you DO use it, EM splits a search across sharded data into MANY parallel searches, one per shard, then assembles the returns into a coherent result with a “page key” that is actually a compressed map of ALL the underlying page keys. You don’t have to care about THAT, either… just pass the compressed string back to EM and it will rehydrate the page keys & perform the next set of searches.

So you get to choose your own adventure… you can run every entity on a single “shard” or run in parallel. I’d just keep an eye out for any drop in performance at scale and add a shard bump when I see it.

Also worth noting: EM is actually platform-agnostic. There is a companion repo that contains the DDB-specific client. This is still a bit in flux btw so be kind lol. Anyway the point is that other platforms that don’t have AWS’ resource footprint may not handle sharding as well, and EM will be able to render effectively the same result.

Hope that answers your question!

P.S. Worth noting: in addition to searching across multiple SHARDS, an EM query can also search across multiple INDEXES. Say you want to query on “name” and you want to query both your firstName and lastName indexes with the same “name” value. With EM, this is a SINGLE query that returns a combined, paged, deduped, sorted result set. Handy.

preaching5271 8 months ago

Enjoyed it

jeroen79 8 months ago

I would evolve it into an SQL DB :-p