The type of CPU is known simply from looking at the motherboard. Very often this is a "off-the-shelf" part from a well-known manufacturer, and not something developed exclusively for a particular console (there are exceptions though). These manufacturers publish data sheets, manuals etc. because they want system builders and programmers be able to use the CPU.
From that you can build a CPU emulator, and guess at the RAM map (i.e. which parts of the system are connected to which CPU addresses). For example the 6502 has a "zero page" in the first 256 bytes which are fast to access, a stack in the next 256 bytes, and some fixed pointers at the end of the address space. From that you can assume that there's RAM mapped at the beginning and ROM at the end of the address space.
When you manage to get the emulated CPU to run the first few bytes, you can follow the binary game code and see what it's doing. By writing test ROMs and connecting them with the real hardware you can explore the machine's behavior.
Shaders.
What difficulties?
When this question comes up, people are essentially saying that parts of the emulated system should be separated and distributed to several CPU cores. However, a CPU core works best when it can run separately from others - because it doesn't have to wait for data to arrive / wait for synchronization (the other core must finish its current task, check for waiting queries, and send its "ready!" signal). As you can see from these numbers, synchronization (mutexes) are relatively slow. If I can fit most of the system state into the L1 cache and much of the data/instruction stream into the L2 cache, I'm going to be more than 4 times faster working with L2 cache data than working with another core.
GPUs go the "separate the cores" way even further: these SMs / CUs are slower than a CPU and have less cache associated with them. This doesn't help for a console like the SNES where every part of the system can influence another part of the system at a rate of ~21MHz (every ~46.56ns).