IBM e-business Cryptographic Accelerator, AKA Leedslite

Leedslite DD for Linux kernel README
Copyright (c) Internation Business Machines Corp., 2001
Author: Jon Grimm


IBM e-business Cryptographic Accelerator (ICA)

	Leedslite was the internal name for the ICA Adapter.  
	ICA is a PCI based cryptographic accelerator.   The adapter 
provides the following functionality:
	RSA - CRT and non-CRT, 256-2048 bit key lengths
	DES - CBC & ECB Modes
	TDES (EDE) - CBC & ECB Modes
	SHA1 - secure hash
	Random Number Generation


	The main controllers are on the ICA adapter are:
	Merlin - crypto controller
	Piuma - PCI interface
	Five UltraCyphers - crypto processors


	To program the Leedslite DD the following resources were consulted:
	Leedslite Adapter Functional Specification
	IBM UltraCypher Cryptographic Engine Specification  
	Applied Cryptography, Second Edition, by Bruce Schneier

	The ICA adapter is currently marketed as an Option for the IBM
	RS/6000 (TM).  

Driver Design

	The Leedslite driver is really divided into two components, devica and 
leedslite.   Not suprisingly, their respective source code lies in 'devica.c'
and 'leedslite.c'.  The drivers are compilable statically into the kernel
or as dynamically loadable kernel modules.

	The leedslite module is the driver for the "physical adapter".   The 
devica module provides the function of a "virtual adapter" where multiple 
leedslite physical adapters can hide be hidden.   Work can be distributed
across the registered leedslite instances by the devica component.    

	NOTE:  Here are some up-front warnings:   This driver has only been
tested on i386 (32-bit) uni platforms with no DEVFS support.  There has 
been some limited testing on an 2-way smp box with no apparent I/O problems.  
Porting to other platforms is greatly desired, but is certainly limited in
that this is mostly an OEM-able adapter from IBM.   With sufficient interest
this situation could change. (05/27/2003) -- This has now been tested on
ppc64 SMP. - Kent Yoder <yoder1@us.ibm.com>


Devica Module

	Devica can provide multiple "virtual adapters" though the default 
configuration is that of a single instance which would deliver work to
all registered instances of leedslites.

	The registration functions are:
		ica_register_worker( )
		ica_unregister_worker( )

	Linux linked lists are used pervasively througout both drivers.   Each
"virtual adapter" has its own list of workers to distribute work to.  The
registration functions are used by the leedslite device driver to register
instances of "physical adapters" for work.

	The work distribution is currently round robin.  A more complex 
scheduling algorithm (possibly based on run-time statics or other) could 
be inserted here.

	An interesting feature of devica is that of auto-loading of 
worker modules.   Upon init, the devica driver will attempt using kmod
to load worker modules such as leedslite.  'maxmodules' controls the 
number of aliases that will attempt to load via kmod.  The aliases to be
loaded are prefixed with 'ica-slot-%d' where '%d' is replaced by the integer 
0 through maxmodules.   For example, the default 'maxdevices' value is 1; 
devica will try to load 'ica-slot-0'.   If modules.conf has an entry of:

	'alias ica-slot-0 leedslite'

the leedslite module will get autoloaded.   


Devica Module Parms

	maxmodules (default 1): number of worker modules to attempt autoload

	maxdevices (default 1): number of "virtual adapters" to support

	driver_major (default 0): device major number to be used


Leedslite Module

	This driver provides the interface to the ICA Adapter 
functionality.    The driver can be accessed both from an application or
indirectly via the devica work distributor.   

Initialization

	The module can support up to 'maxdevices' number of adapters.   
Originally, a driver supporting the PCI bus would need to walk the PCI 
configuration space itself to find its adapter.   The Linux Kernel now 
supports services to register callbacks for PCI device discovery.   This 
is also the interface used by PCI Hotplug.   As the PCI service discover
PCI functions it will callback to the register driver.  The leedslite
adapter uses these newer (and preferred services).   In this fashion, the
driver should be PCI hot-plug compatible (though this has not been tested
since I do not have a box supporting this).

	Once discovered the driver will allocate memory for data structures, 
DMA buffers, reset the adapter and initialize the card per the hardware 
specifications.


Binding
	
	By default, the driver auto-registers with devica on the first 
virtual adapter.   The 'devica' parameter can be used to disable the 
auto-registration.   Once the leedslite driver is loaded there are 'bind'
and 'unbind' IOCtls which can manipulate this relationship, specifying the 
"virtual adapter" to bind/unbind with.   It is not required that the 
leedslite adapter with devica, though the auto-registration is the 
expected normal configuration.

	This is a good place to discuss MODULE use count.  Typically, one will
See MOD_INC_USE_COUNT in the 'open' processing and MOD_DEC_USE_COUNT in
the close processing.   These two macros are used to track the number of
users of a module to allow its safe removal from the system.   As some
applications may be accessing the leedslite module via devica, the use count
macros are used at the beginning and endings of the functional entrypoints,
as opposed to the open/close processing.  So when a RNG read entrypoint is 
called, the use count is incremented, upon exit the use count will be 
decremented.   

RNG Operations

	The random number generation function of the adapter can be accessed
via the 'read' system call.  

	The card itself generates 8 random bytes per RNG interrupt.   If 
there are multiple consumers for the RNG data, the leedslite driver will
service the consumers in a round-robin fashion for each 8 bytes of new RNG
data.

	The card is programmed in a continuously running mode.  As soon as
all 8 bytes are read from the card arms itself to generate the next RNG
interrupt.   Care is taken in the driver to serialize/share data between
the interrupt handler and the consumer threads through the use of Linux 
'wait_queues' and atomic operations this can be achieved.  

	Two important variables are:

	rng_current_wait:   This is where tasks wait for rng data to be 
		available.  The task at the front of this list will 
		receive the next (up to) 8 bytes of RNG data.

	entropy_available:  This is communicates the presence of RNG data.
		This is set by the ISR and cleared by the task reading the
		RNG.   This prevents an accidentally awakened task from
		reading possibly stale or incorrect RNG data from the card.

	Note: there is currently no 'poll' support, though this should be 
relatively easy to add.  



Non-RSA Operations

	The leedslite adapter can perform only 1 non-RSA cryptographic 
operation at a time.   This includes DES, TDES, SHA, and DESMAC operations.
    
	As such the driver needs to serialize and control access to the
leedslite hardware.    There is a two tier control mechanism based on two 
variables, des_wait and des_current_wait.

	des_wait:  This is a mutex controlling access to the second tier.  
		Once past this point, this task has access to adapter.

	des_current_wait:  This is the where the task which currently owns 
		the access to the adapter waits until the programmed 
		operation completes and is woken up by the ISR.


	The driver currently does double-buffering.   Data is copied to and
from the 'desbuffer'.   The adapter is programmed with the bus address for
this buffer.  The size of 'desbuffer' can be tuned via module parameters.

	Each non-RSA operation (SHA, DES, TDES, DESMAC) sets up the the
leedslite adapter a little bit differently.   


RSA Operations

	The ICA Adapter has 5 UltraCypher cryptographic processors.  
To minimize software overhead, the Leedslite has function to post up to
64 operations to a ICA Adapter.   The control functions on the card
manage the busmastering of the requests/data to and from the card.

	Three data structures are used to control this mechanism: the RIP, ROP,
and VFIFO.   The RIP, or RSA Input Buffer, is used to define the operation 
to be performed, its data, and its operation parameters.  The ROP, or RSA
Output Buffer, is used as the destination for the operations output.  The 
RSA_VF, or RSA Virtual FIFO, structure is used to relay the status for that 
operation.  The bus addresses for these structures must be programmed into the 
leedslite adapter upon initialization. 

	The RIP, ROP, and RSA_VF are each 64 entry structures (actually this 
can be reduced via the 'rsabufs' module parameter if needed).   The index 
into the RIP and ROP are referred to as the RSAopID.  The RSA_VF is a circular
buffer.  

	Operationally, to submit a request to the adapter one must:

	1) Locate an unused entry (or wait for one)
	2) Fill out RIP entry with operation data and parameters
	3) Submit request to adapter by programming the RSA Command Register
	4) Wait for completion.  
	5) If successful return results to requester from the ROP

	The adapter notifies the device driver of an RSA completion through an 
interrupt that n RSA operations are complete.   The interrupt handler reads
n entries of the RSA_VF fifo.  Each entry returns the RSAopID of the 
completed operation, as well as, that specific operations status.

	For a more detailed description of programming the ICA Adapter 
for RSA operations see the functional specification or refer to the source 
code.   

	RIP and ROP usage is managed through a list of "free" entries.  This 
list is labelled 'rsa_freelist_head' and contains entries of 'rsa_free_t' 
strucutres. One 'rsa_free_t' is allocated for this list for each entry in the 
RIP/ROP/RSA_VF (typically 64).  Entries are removed from this list for use and
returned upon completion.  Within the device driver, an 'rsa_free_t' is
the structure used to track an individual operation, including its status, 
its index (or RSAopID), and a wait queue to sleep/woken on.

	To locate and unused RIP/ROP entry, one would look to the
free list.   Additionally, and for simplicity, a semaphore is used to 
control access to this list.  The semaphore will allow an 'rsabufs' number of
threads entry to allocate a free entry.  There should not be a case where the
free list is empty, as the semaphore should control the number of used entries.
A thread will wait at the semaphore until an entry has been put back on the
free list. 

     Here are some interesting RSA related fields:

     rsa_wait:  semaphore used for controlling access to free list
     rsa_freelist: saved allocation handle to memory used for entries
     rsa_freelist_head: the actual list head for entries
     rip:   RIP buffer
     rop:   ROP buffer
     vfifo: RSA_VF buffer
     lwp:   last word pointer; last entry used in vfifo
     

Interrupt Handling

	The leedslite adapter uses an interrupt to notify the host of a variety
of events, including completion and errors.

	A given interrupt actually returns a number of events.  The interrupt
handler reads the Piuma Interrupt Register, or PIR, to determine what events
have occurred on will farm the processing out to approriate functions.   See
the 'leedslite_interrupt' function for the initial interrupt processing. 

Error Recovery

	Different error recovery mechanisms are needed for a variety of
situations.  In general, error recovery will occur if possible, however, the
operation will fail.   The KERN_ERR level klog message will log the error.

	PCI Abort Occurrence, or PAO:  Currently, the driver just clears this
	condition.
	
	Merlin Interrupt, or MI:  This can happen for a variety of reasons.  
	One of the simplest examples, is that if passing illegal RSA 
	operands (e.g. RSA modulus length > 8).   The error recovery for
	this case is quite painful and requires a reset of the card.    
	Additionally, the driver fails all outstanding requests. 

	Busmaster Error Occurrence:  This should not happen, there is a
	defined recovery mechanism (though also painful).  The procedure
	requires a resetting of internal Piuma FIFOs and resetting the 
	adapter.    There were early adapters which also generated BEO
	errors, which were not correctable via this mechanism.  This should
	not be the case for the production hardware.
	
	

DMA Timeout

	When working with device drivers, expect hardware failure.   DMA 
operations are used to move data independently from the host CPU.  Typically, 
this is to/from memory and peripheral.   Once a DMA operation is set in 
motion, the memory is off-limits until the operation is complete.  With the 
Leedslite, the adapter should generate an interrupt to indicates such a 
completion.  As the driver has no way to interrupt the current operation, 
the current task wait UNINTERRUPTIBLE.

	However, if the hardware malfunctions, the interrupt may not get
generated.  The 'error_timeout' specifies how long we are willing to 
wait before the driver considers the hardware to have malfunctioned and
assumes the the desbuffer is now safe for reuse.   Additionally, the
problem will be logged and the current task will return with an error.   



Teardown

	Teardown very simply reverses the resource allocations used during
the initialization and operation of the Leedslite device driver.  No 
attempt is made to put the adapter in any known state, instead this function
is assumed to occur upon device driver initialization.



Leedslite Module Parms

	Note:  All values specified as positive integers

	maxdevices (default 12):  maximum number of adapters to support

	devica (default 1): should the driver auto-register with devica

	desbuffersize (default 8192): number of bytes to allocate for DES DMA buffer

	rsabufs (default 64): number of entries allocated for RSA operations up to a maximum of 64

	sam (default 0): Speed Adjustment Mechanism, 0x00-0x3f,  lower number the higher power consumption and performance

	pmwi (default 1):  Enable pseudo-MWI.   Configurable just because it is.

	error_timeout (default 60):  Minutes before a DMA operation is considered completely errant.  This should _not_ happen.   



Programming with the Leedslite Driver

	I prefer to have device drivers with the simple interfaces, 
especially in absense of requirements otherwise.  As there were no predefined
interfaces for the crytographic accelerator, the driver's software interfaces
often directly expose the Leedslite adatper internals.   Typically an 
application would not directly use these interfaces, but instead be accessed
through higher level libraries, such as OpenSSL or PKCS11 implementations.  
These higher level libraries, would use the low-level software interfaces to
build build their needed function.  


open, close, read
	Typical filesytem calls are used to access the driver via open/close
to get access to the device.   The 'read' call will access the RNG read
function of the device driver.   For example, simple utilities such as 'cat'
can be used to access the RNG function of the adapter.   

IOCtls (See <linux/icaioctl.h> )

	If enough interest is generated from this driver, detailed 
programming interfaces may be published.   There are testcases which 
demonstrate these functions and can be made available upon request.   
Otherwise, programming an application for this device driver will require 
access to the Leedslite Functional Specification.  If one actually has a 
Leedslite adapter, one likely has access to this specification.   Other 
examples will be viewable from the implementation of the libica library.   

	For RSA operations, all operands must be big-endian and pre-padded to
'inputdatalength' size.   There are additionally, allowable minimum/maximum
allowable size.   The limitations basically correspond to those required by
the adapter itself.   

	For DES operations, 


ICASETBIND
	Bind a physical adapter to a virtual devica adapter.
ICAGETBIND
	Determine the virtual adapter a physical adapter is bound to.
ICARSAMODEXPO
	Perform RSA (modular exponentiation) operation.  
ICARSACRT
	Perform RSA with CRT (Chinese Remainder Theorem) operation.
ICARSAMODMULT
	Perform modular multiplication operation
ICADES
	Perform DES operation
ICATDES
	Perform Triple-DES operation
ICADESMAC
	Perform DESMAC (SHA1 digest of DES result) operation.
ICATDESSHA
	Perform DES and SHA in parallel operations.  Unsupported.
ICATDESMAC
	Perform Triple-DES followed by SHA1 digest operation.
ICASHA1
	Perform SHA-1 message digest operation.   
ICARNG
	Read bytes from random number generator.


Miscellaneous Comments:
	The current interfaces are 1) synchronous 2) built with applications 
in mind.   It would be a very interesting task to develop interfaces to
enhance the driver to support asynchrouns interfaces.   Additionally, there
are instances, such as IPSEC, that would need much more kernel friendly 
interfaces.

	The /dev/random could be enhanced with an interface to allow input by
a truley random number generator such as leedslite, or possibly be enhanced to
farm out RNG request to sources.   I've toyed with Jeff Garzik's 
intel-rng-tools, which has a daemon to periodically read RNG data from a 
device and then send this down to /dev/random as another RNG data source and 
this seems to work fine for the Leedslite device driver.   

	I've used the devfs interfaces for registering the driver with the
kernel.   HOWEVER, I've never actually tested this for it to work.   This
would be fantastic thing to actually try out.   I'm _not_ in favor of 
writing code that I cannot test.   The rest of the driver has this discipline.
There will be at least one bug here, but should not be more than one or two 
days effort assuming one knows how to debug kernel code and can configure
devfs. 


Building the Device Driver

	The directions in this section will highly depend on someone fluent
both in building kernels and device drivers.   

	The driver's compilation can be enabled through a config option of
'CONFIG_ICA_LEEDSLITE'.   You can access this through the xconfig menu's of:

	Character devices->IBM Leedslite Crypto Accel (EXPERIMENTAL)
	
however, this option depends on CONFIG_EXPERIMENTAL also being enabled.  This
driver can be compiled either statically in the kernel or as a module.

	Note:  No major number has been secured for the driver.  One can 
either be configured as a module_parm, or this information can be gleaned
from /dev/devices.    Device nodes will need to be created for the driver's
major/minor numbers.   These words should make sense for those interested
in this driver.
 
	Again if this adapter becomes widespread, more detailed information
and/or facilities may be suplied.
