Level 1 - Block Tiling (Data Distribution):
Can fit in L1 and L2 memory tiles. Also reduces bandwidth needed to transfer.
Creates even distribution of data across cores to enable parallel processing
Level 2 - Vector Tiling:
It matches the AIE vector register sizes, enabling 512 int8, or 64 int16 multiply-accumulate operations in one cycle.
So data movement from L2 → L1 is optimized through chunking into fixed blocks, providing high level distribution of workloads across available cores with consistent bandwidth. Within each core, we can perfectly divide these blocks into vector units that can be computed in VLIW processors. There is both system and hardware-level parallelism.