A DMA engine usually receives "instructions" with 4 fields :
- source address
- destination address
- size/length
- options/details/opcode
With a 32-bits processor, each field is 32-bits wide and the overall instruction length is 128 bits. Most of the design decisions occur at first on the "opcode" field.
The destination address field holds 32 bits which can be a main memory address or a I/O port address.
The source address field can be both, or an immediate field as well, for the "memset" operations.
I/O ports can map to special CPU/configuration registers, I/O coprocessors (Serial, SPI or Ethernet for example), computational processors (block/stream compression, crypto, DSP, DCT...)
Instructions can be loaded by a CPU or eventually indirectly through a "link" address (a 5th field to perform linked lists ?)
Early termination is an optional feature (for example detect that a number of "0" bytes are read)
Completion of each instruction could raise an IRQ signal, the state of all channels must be easily probed/read.
Priorities and "ready flags" are used to interlace accesses when for example slow I/O provides intermittent data that is mixed with large block transfers.
Byte and half-word packing would also be handled (ideally) for unaligned blocks, or peripherals that send sub-words (serial or SPI for example).