Search this blog

25 February, 2009

Game multithreading laws

Personal note: My eeepc still has no keyboard (now it's totally dead), so I'm writing from my girlfriend's laptop... I should learn not to use netbooks in the tub.
I'll cut it short this time. Explicit multithreading is too hard. Actually I think it's the hardest thing in computer science.
Parallel programming articles are always a fascinating read (I strongly advise, strongly, to read Joe Duffy's blog, and of course Herb Sutter's one), but the truth is, when it comes to real work, you want to minimize the exposure you have to it.
And yet, doing games, you want to make your CPU go over its peak gflops rating. How to? Those are my personal laws (not that there's anything revolutionary really, so don't be surprised if you're already following them):
  • Data parallellism is your only God. It will feed your long, starving pipelines, hide your latencies and vectorize your computation.
  • Embrace Stream computing, love Map/Reduce, study ParallelFX, OpenMp and CUDA and finally implement and use a ParallelFor primitive with a Thread Pool.
  • That shall be the only primitive you routinely use for multithreading.
  • Avoid explicit threads.
  • Avoid explicit locks/syncronization primitives.
  • Avoid all forms of data sharing.
  • Enforce the non sharing rule. Use smart pointers and reference counting base classes. Assert that shared smart pointers are acquired/released always on the same thread.
  • You don't need locks in your libraries, because you don't want to share. Re-entrancy is the key, if you can't achieve that, just say it, don't lock. Locks do NOT compose.
  • You don't need exotic parallel data structures or syncronization primitives, because you're not sharing.
  • The only time you have to think about syncronization shall be when communicating with the GPU.
  • You might not have more than explicit threads than the fingers of one hand.
  • Any explicit thread will only depend on one other (i.e. ai->game->rendering).
  • In runtime there will be only one syncronization point, when communication happens by passing an updated buffer of data to the thread that it needs it. That should be done with a queque. Locks contention is practically absent.
  • Communication is always mono-directional, one thread always writes stuff for the dependant thread that always reads it (or, one thread will own only one side of the communication queque).

Follow those rules, and you'll be happy looking at the guy who has implemented that super cool lockless actor based future system working all the weekends...

I will be writing more on this, focusing on how to make the render engine parallell, but it will take a while because my plan is to describe and then publish the code of an old (but not too old) engine I wrote for a series of articles that had to appear on a magazine, but never did...


Hugo Gomes said...

Or you can use Haskell...

DEADC0DE said...

I think that if and when people will really use immutable data structures to solve their multithreading problems, they'll notice that it doesn't really work :)

If I have to look at something more than parallelfor and stream computing, I'll place my bets on actors and futures, is nice

stevec said...

"You don't need thread safe libraries, because you're not sharing."

This doesn't make sense to me. A non-thread-safe library may be non-thread-safe precisely because the library itself uses static data.

Example: strchr() from libc.

Unless by "you don't need thread-safe libraries" you do not mean "you can get away with non-thread-safe libraries" but instead, "you don't need ANY libraries," which would be just weird.

DEADC0DE said...

stevec: you're right and I wrote it wrong (modified now). What I intented to say is that you don't want to use locks in your libraries, as syncronization primitives do not compose and give you a false sense of safety. Reentrancy on the other hand, is totally needed.

Nick said...

Like where you're going with this.
Could you expand on what you mean by sharing and not sharing?

Pretty sure you mean don't share between explicit threads, but your rules simply say "don't share".

I'm also wondering if "don't synchronize except with the GPU" is overstating? Or are you suggesting that if there is dependent data that might be computed in parallel, such as collision results, the consumer of the collision result should be on the same explicit thread?

deadc0de said...

Nick: I didn't write "don't syncronize except with the GPU" but "the only time you need to think about...".

Obviously you'll need to have sync points, either implicitly by dependencies between parallel computations or explicitly, usually waiting for a parallel computation to end.

But those are easy, with the GPU you have to double buffer, fence, sometimes in a direction, sometimes in the other. That requires thinking, if you find yourself thinking about syncronization mostly or only when dealing with the GPU, then your architecture is fine.

About sharing, that's intended as shared state. That should be avoided as a general design principle (i.e. singletons, statics and everything anyone can access, that have global visibility) and in particular when dealing with multithreading. It makes your code both complex and slow (as it creates contention).

If you go towards data driven parallelism and stream computing, all the other "laws" are implicit anyways... Stream computing is practically the only way to have the kind of data that long vector pipelines are good at digesting, so in a way or another, is what your system should look like :)

Nick said...

Ah, I see what you mean. Thanks for the explanation!

Daniele said...

"I think that if and when people will really use immutable data structures to solve their multithreading problems, they'll notice that it doesn't really work."

Erlang uses only immutable data structures, and it is used in production for big concurrent systems (e.g: mochiads, facebook chat, Ericsson telecom infrastructure.)
At least for some scenarios, we can say that immutable data structures actually work.


DEADC0DE said...

Daniele: yep, but Erlang does not rely on immutable structures to achieve its parallelism, or at least, not only.

Erlang uses the actor model to achieve its parallelism, and that does not require functional purity, I guess that with a lot of hacks and ugly code you could do the same in C++ and many other languages support actors without being pure (again, see AliceML, I've posted the link in the first comment).

Immutability does not help in any case, because either your threads do not share data, and in that case you can do whatever you want with your private memory, or they do need to share data, and in that case the fact that they work on immutable structures does not alleviate from the needs of synchronization, in fact, it makes it much harder...