Muddled the code too much, and only gave a marginal performance improvement in the grand scheme of things. Other ways to parallelize MCTS will be nicer to implement and could yield better results.
Changes sqrt -> rsqrt and div->mul. Reduces latency by 23 -> 8 cycles.