A goal of the MTA is that porting codes from other machines is straightforward, but gives good performance.
A parallelizing FORTRAN compiler can produce high performance for some codes with little manual intervention.
Where manual porting is required, the simple and fine-grained synchronization model often allows programmers to write code the "obvious" way yet achieve good performance.
The implementations are: Only a few systems have been deployed, and only MTA-2 benchmarks have been reported widely, making performance comparisons difficult.
Second, memory references take 150-170 cycles,[4][5] a much higher latency than even a slow cache, thus increasing the number of runable threads required to keep the ALU busy.
The MTA-4 will have a non-coherent cache, which can be used for read-only and unshared data (such as non-shared stack frames), but which requires software coherency e.g., if a thread is migrated between CPUs.
Data cache competition is often a performance bottleneck for highly-concurrent processors, and sometimes even for 2-core systems; however, by using the cache for data that is either highly shared or has very high locality (stack frames), competition between threads can be kept low.
The latency-tolerance hardware may be wasted on regular calculations, including those with latency that is high but which can be scheduled easily.