When you have sorted out the concurrency problems for CPU threads, and established a way to safely generate data in CPU for the GPU, set up your object pipelines, you will end up hitting the problem of multithreaded draw calls.
At the moment, such a thing is not possible in any mainstream platform, you have to issue all the draw calls from a single thread that owns the rendering device. The usual solution to this problem is to bake command buffers (display lists in openGl terminology) on the non-rendering threads and then pass them to the rendering one that draws them.
The problem with this approach is that you can't sort/optimize/wathever primitives between threads, all the draw calls are baked and just copied in the main ring buffer. Of course you can always organize your rendering objects in lists that are bound to a given renderbuffer/pass and process those lists in parallel, doing so you're pretty sure that you can't do more optimization than the ones you can perform on a single list. A problem arises when you have a few big lists and all the other ones are much smaller, in that case, processing one list per thread does not give you an optimal load balancing. So other solutions could be more employed, depending on the context.
A very common one is to have some higher level rendering data, usually meshes with materials and all the context needed to do draw that data. Those primitives/contextes are added by the various threads in a render queue that is then sorted (by rendertarget, passes, materials etc...) and generates state changes and draw calls. The state API is hidden and used only by the rendering thread. Using a lockfree stack helps.
Another interesting solution is the one employed by Capcom's MT engine (Lost Planet). It's like a cross-platform command buffer API, where the commands have hints on their ordering (rendertarget, pass, etc) and are issued in parallel by multiple threads, then sorted in each thread, gathered and merge-sorted togheter in the rendering thread and then converted in actual draw calls. This is somewhat an hybrid approach between an high level submission API and a native commandbuffer API, when you can still do every kind of inter-thread optimization, but in a very fast way, without hiding the state API and doing only a simple translation of the commands to the native API ones in the main thread.