[1] Another definition of granularity takes into account the communication overhead between multiple processors or processing elements.
In fine-grained parallelism, a program is broken down to a large number of small tasks.
Shared memory architecture which has a low communication overhead is most suitable for fine-grained parallelism.
[4] Connection Machine (CM-2) and J-Machine are examples of fine-grain parallel computers that have grain size in the range of 4-5 μs.
This might result in load imbalance, wherein certain tasks process the bulk of the data while others might be idle.
Message-passing architecture takes a long time to communicate data among processes which makes it suitable for coarse-grained parallelism.
[1] Cray Y-MP is an example of coarse-grained parallel computer which has a grain size of about 20s.
[4] Intel iPSC is an example of medium-grained parallel computer which has a grain size of about 10ms.
Each quarter will be processed individually by one processor at a time taking 25 clock cycles (for 5x5 pixels).
Coarse-grained parallelism: A full image is processed by a single processor taking 100 clock cycles.
The goal should be to maximize parallelization (split work into enough units to evenly distribute it across most available processors) while minimizing communication overhead (ratio of time spend on communication vs time spend on computation).
[1] At the sub-routine (or procedure) level the grain size is typically a few thousand instructions.
Finding the best grain size, depends on a number of factors and varies greatly from problem-to-problem.