Previously there was one memory arena for all threads, making it
the bottleneck for multi-threaded performance. As the number of
threads increased, the contention for the lock on the arena would
grow, causing other threads to wait to acquire it.
This commit makes it use 8 memory arenas, and round-robbins how
they are assigned to threads. Threads keep track of the index that
they should use into the arena array, assigned the first time the
arena is accessed on a given thread.
When an image is first created, it is allocated from an arena.
When the logic to have multiple arenas is enabled, it then keeps
track of the index on the image, so that when deleted it can be
returned to the correct arena.
Effectively this means that in single-threaded programs, this
should not really have an effect. We also do not do this logic if
the GIL is enabled, as it effectively acts as the lock on the
default arena for us.
As expected, this approach has no real noticable effect on regular
CPython. On free-threaded CPython, however, there is a massive
difference (measuring up to about 70%).