In contrast, the VLIW method depends on the programs providing all the decisions regarding which instructions to execute simultaneously and how to resolve conflicts.
Before VLIW, the notion of prescheduling execution units and instruction-level parallelism in software was well established in the practice of developing horizontal microcode.
His ideas were later developed and published as a part of a textbook[4] two years before Fisher's seminal paper, but because of the Iron Curtain and because Kartsev's work was mostly military-related it remained largely unknown in the West.
He realized that to get good performance and target a wide-issue machine, it would be necessary to find parallelism beyond that generally within a basic block.
Fisher developed a set of principles characterizing a proper VLIW design, such as self-draining pipelines, wide multi-port register files, and memory architectures.
[clarification needed] The major semiconductor companies recognized the value of Multiflow technology in this context, so the compiler and architecture were subsequently licensed to most of these firms.
In contrast, VLIW executes operations in parallel, based on a fixed schedule, determined when programs are compiled.
All of this fits in one 48-bit instruction: f12 = f0 * f4, f8 = f8 + f12, f0 = dm(i0, m3), f4 = pm(i8, m9); Since the earliest days of computer architecture,[6] some CPUs have added several arithmetic logic units (ALUs) to run in parallel.
If the CPU guesses wrong, all of these instructions and their context need to be flushed and the correct ones loaded, which takes time.
This has led to increasingly complex instruction-dispatch logic that attempts to guess correctly, and the simplicity of the original reduced instruction set computing (RISC) designs has been eroded.
If the branch takes an unexpected way, the compiler has already generated compensating code to discard speculative results to preserve program semantics.
[citation needed] Cydrome was a company producing VLIW numeric processors using emitter-coupled logic (ECL) integrated circuits in the same timeframe (late 1980s).
In VLIW mode, the i860 could maintain floating-point performance in the range of 20-40 double-precision MFLOPS; a very high value for its time and for a processor running at 25-50Mhz.
Examples of contemporary VLIW CPUs include the TriMedia media processors by NXP (formerly Philips Semiconductors), the Super Harvard Architecture Single-Chip Computer (SHARC) DSP by Analog Devices, the ST200 family by STMicroelectronics based on the Lx architecture (designed in Josh Fisher's HP lab by Paolo Faraboschi), the FR-V from Fujitsu, the BSP15/16[8] from Pixelworks, the CEVA-X DSP from CEVA, the Jazz DSP from Improv Systems, the HiveFlex[9] series from Silicon Hive, and the MPPA Manycore family by Kalray.
It uses a similar code density improvement method called configurable long instruction word (CLIW).
[10] Outside embedded processing markets, Intel's Itanium IA-64 explicitly parallel instruction computing (EPIC) and Elbrus 2000 appear as the only examples of a widely used VLIW CPU architectures.
VLIWs also gained significant consumer penetration in the graphics processing unit (GPU) market, though both Nvidia and AMD have since moved to RISC architectures to improve performance on non-graphics workloads.
Transmeta addressed this issue by including a binary-to-binary software compiler layer (termed code morphing) in their Crusoe implementation of the x86 architecture.
This mechanism was advertised to basically recompile, optimize, and translate x86 opcodes at runtime into the CPU's internal machine code.
Thus, the Transmeta chip is internally a VLIW processor, effectively decoupled from the x86 CISC instruction set that it executes.
VLIW architectures are growing in popularity, especially in the embedded system market, where it is possible to customize a processor for an application in a system-on-a-chip.