82 points by nbsande 7 months ago | 46 comments
While this was initially built with enhancing CPU utilization for FastAPI servers in mind, the approach can be used with more general async programs too.
If you’re interested in diving deeper into the details, I’ve written a blog post about it here: https://www.neilbotelho.com/blog/multithreaded-async.html
noident 7 months ago
There is built-in support for this. Take a look at loop.run_in_executor. You can await something scheduled in a separate Thread/ProcessPoolExecutor.
Granted, this is different than making the async library end-to-end multi-threaded as you seem to be trying to do, but it does seem worth mentioning in this context. You _can_ have async and multiple threads at the same time!
nbsande 7 months ago
zo1 7 months ago
sandeep1998 7 months ago
zo1 7 months ago
https://www.tornadoweb.org/en/stable/concurrent.html#tornado...
noident 7 months ago
But run_in_executor achieves that as well! If you use no-GIL Python and a thread pool (or GIL Python with a Process pool), you will utilize more CPU cores.
d0mine 7 months ago
noident 7 months ago
That isn't true. If you use a ProcessPoolExecutor as the target instead of the default executor, you will use multiple processes in pure Python code.
d0mine 7 months ago
https://docs.python.org/3/library/asyncio-eventloop.html#asy...
game_the0ry 7 months ago
When I am trying to solve a technical problem, the problem is going to dictate my choice of tooling.
If I am doing some fast scripting or I need to write some glue code, python is my go-to. But if I have a need for resource efficiency, multi threading, non-blocking async i/o, and/or hi performance, I would not consider python - I would probably use JVM over the best python option.
Don't get me wrong, I think its a worthwhile effort to explore this effort, and I certainly do not think its a wasted effort (quite the opposite, this gets my up vote) I just don't think I would ever use it if I had use case for perf and resource efficiency.
Myrmornis 7 months ago
mywittyname 7 months ago
I've been using ThreadPoolExecutors in Python for a while now. They seem to work pretty well for my use cases. Granted, my use cases don't require things like shared memory segments; I use as_* functions under concurrent.futures to recombine the data as needed. Honestly, I prefer the futures functions as I don't need to think about deadlocks.
game_the0ry 7 months ago
I agree with this, this is a fair trade off, but not the direction I would go as a matter of preference.
maest 7 months ago
1. Rewrite the whole thing 2. Carve out the high perf component into a separate system and also deal with the overhead of marshalling data between two different systems?
trashtester 7 months ago
And in many teams, just having to worry about python makes it easier to keep team members productive if they're not expected to handle several different languages productively.
game_the0ry 7 months ago
> And in many teams, just having to worry about python makes it easier to keep team members productive if they're not expected to handle several different languages productively.
I think this makes sense for individuals and teams, but for an org or company I think having specialist teams makes sense where teams that require perf use JVM and teams that make business-ware or devops or something not perf would use python.
trashtester 7 months ago
But plenty of tasks can be done beautifully in Python. That's especially true in a data processing or ML setting where most of the heavy lifting is done in libraries such as numpy, spark, pytorch etc. (Also Python is the industry standard for such teams).
Still, even for such teams there are times where you want to do SOME heavier compute tasks within the core language, and offloading this to some other dev team simply doesn't scale.
The solution is to use multi processing instead of multi threading. But this workaround is quite inflexible.
Some dev teams may have developers that can deliver this in scala (especially if spark is involved). Other may have the ability to build C++ (or CUDA) libraries to add to the python environment.
But the ability to run somewhat heavier processing than what can be achieved by a single process is often much better.
Cost wise it also makes a lot of sense. Such steps may often find themselves on some large compute cluster where you have tens or hundreds of processors (or more) available. If a single step in a processing pipeline on such a cluster can be cut from 2 hours to 1 minute, it can be a large saving. Taking it from 1 minute to 10 seconds means a lot less.
Btw, and with all due respect, the part about teams that require perf using JVM doesn't really match my experience. Where I come from, the Java devs tend to produce the slowest code of all, mostly because every step of the processing is serialized/deserialized as microservices talk to each other for each data element.
Even python based code is often faster (sometimes by order of magnitudes). Partly because of cultural differences between teams (the python code, even when exposed as microservices, tend to work with larger blocks of code. And partly because the really is processed in C++ based libraries within python, that still have a significant edge on JVM based code.
Don't get me wrong: Java has a lot of advantages for many types of business applications, where the business logic complexity can be abstracted in well organized and standardized ways. But it's not typically the go-to language when seeking maximum performance in heavy compute or massive data volume scenarios.
adamc 7 months ago
gloryjulio 7 months ago
An example is facebook's php to hack compiler
nbsande 7 months ago
I have an example of a very basic ASGI server that does just that towards the end of the blog
game_the0ry 7 months ago
> They want some kind of performance uplift without rewriting the whole python code base.
In order take advantage of mutli-threading and/or async i/o, you would need re-write your code anyway, right? And at the point, wouldn't re-writing in different language be an option?
biorach 7 months ago
Heavily restructure, sure. Rewrite? Probably not.
gloryjulio 7 months ago
Upgrading the language however it's way easier, and you usually have the official upgrade guide about what to do. It's also much safer, easier to deploy and test with.
Once we have the sane multi threading path in python, there would be even less incentive to rewrite the code
trashtester 7 months ago
Not to mention that python is actually a good language choice for many types of environments, and basically the industry standard for fields like ML/AI and supporting data pipelines.
Wherever python is used for heavy duty number crunching or large data volumes, most processing is handled by libraries written in other languages, while python is handling program flow and some small parts that need custom code. The large part can currently be quite expensive.
Migrating the whole codebase to another language for such setups would simply be absurd.
Still, for the small percentage of such codebases that DOES do semi-heavy data crunching, real multithreading would be nice so one can avoid resorting to multi-processing or implementing these parts as custom C++ libraries, or similar.
m11a 7 months ago
My understanding was that tasks in an event loop should yield after they dispatch IO tasks, which means the event loop should be CPU-bound right? If so, multithreading should not help much in theory?
spiffytech 7 months ago
I've seen code that spends disproportionate CPU time spent on e.g., JSON (de)serializing large objects, or converting Postgres result sets into native data structures, but sometimes it's just plain ol' business logic. And with enough traffic, any app gets too busy for one core.
Single-threaded langs get around this by deploying multiple copies of the app on each server to use up the cores. But that's less efficient than a single, parallel runtime, and eliminates some architectural options.
limaoscarjuliet 7 months ago
bastawhiz 7 months ago
Which is to say, why even bother with async if you want your code to be fully threaded? Async is an abstraction designed specifically to address the case where you're dealing with blocking IO on a single thread. If you're fully threaded, the problems async addresses don't exist anymore. So why bother?
svieira 7 months ago
bastawhiz 7 months ago
svieira 7 months ago
m11a 7 months ago
I suppose it’d only really be useful if you have more tasks than you can have OS threads (due to the memory overhead of an OS thread), then maybe 10,000 tasks can run in 16 OS threads.
If that’s the case, then is this useful in any application other than when you have way too many threads to feasibly make each task an OS thread?
nbsande 7 months ago
nbsande 7 months ago
quotemstr 7 months ago
btown 7 months ago
My startup has been using it in production for years. It excels at I/O bound workflows where you have highly concurrent real-time usage of slow/unpredictable partner APIs. You just write normal (non-async) Python code and the patched system internals create yields to the event loop whenever you’d be waiting for I/O, giving you essentially unlimited concurrency (as long as all pending requests and their context fit in RAM).
https://github.com/gfmio/asyncio-gevent does exist to let you use asyncio code in a gevent context, but it’s far less battle-tested and we’ve avoided using it so far.
sidmitra 7 months ago
Does your app have a lot of dependencies that do background threads? Like Launchdarkly(feature flags), redis, spyne(rpc) and on and on.
whalesalad 7 months ago
drowsspa 7 months ago
quotemstr 7 months ago
svieira 7 months ago
drowsspa 7 months ago
drowsspa 7 months ago
fnord123 7 months ago
nbsande 7 months ago